CN113826382B

CN113826382B - Adaptive bit depth conversion in video coding

Info

Publication number: CN113826382B
Application number: CN202080036229.4A
Authority: CN
Inventors: 张凯; 张莉; 刘鸿彬; 王悦
Original assignee: Beijing ByteDance Network Technology Co Ltd; ByteDance Inc
Current assignee: Beijing ByteDance Network Technology Co Ltd; ByteDance Inc
Priority date: 2019-05-16
Filing date: 2020-05-18
Publication date: 2023-06-20
Anticipated expiration: 2040-05-18
Also published as: CN113841395A; CN113841395B; WO2020228835A1; CN113826382A; WO2020228834A1; CN113875232A; WO2020228833A1

Abstract

Apparatus, systems, and methods for digital video coding are described, including bit depth and color format conversion for video coding. An exemplary method for video processing includes a method for video processing, comprising: applying an adaptive bit depth conversion (ABC) procedure to a video region within the current picture, wherein in the adaptive bit depth conversion (ABC) procedure, one or more sets of sample bit depths are applicable to one or more color components of the video region and the one or more sets of sample bit depths are signaled in a particular video unit; and performing a transition between the bitstream representations of the video region and the current video region based on an adaptive bit depth conversion (ABC) process.

Description

Adaptive bit depth conversion in video coding

Cross Reference to Related Applications

The present application is intended to claim in time the priority and benefit of international patent application PCT/CN2019/087209 filed on 5 months 16 days 2019, in accordance with applicable patent laws and/or in accordance with rules of the paris convention. The entire disclosure of which is incorporated by reference as part of the disclosure of the present application.

Technical Field

This patent document relates to video codec technology, apparatus, and systems.

Background

Despite advances in video compression technology, digital video still occupies the largest bandwidth usage on the internet and other digital communication networks. As the number of networked user devices capable of receiving and displaying video increases, the bandwidth requirements for digital video usage are expected to continue to increase.

Disclosure of Invention

Devices, systems, and methods related to digital video codecs, and in particular, to bit depth and color format conversion for video codecs, are described. The described methods may be applied to existing video codec standards (e.g., high efficiency video codec (High Efficiency Video Coding, HEVC)) and future video codec standards or video codecs.

In one representative aspect, a method for video processing is disclosed, comprising: applying an adaptive bit depth conversion (ABC) procedure to a video region within the current picture, wherein in the adaptive bit depth conversion (ABC) procedure, one or more sets of sample bit depths are applicable to one or more color components of the video region and the one or more sets of sample bit depths are signaled in a particular video unit; and performing a transition between the bitstream representations of the video region and the current video region based on an adaptive bit depth conversion (ABC) process.

In another representative aspect, a method for video processing is disclosed that includes: for a video region within a current picture of video encoded and decoded using an adaptive bit depth conversion (ABC) process, determining a relationship between a sampling bit depth of at least one color component of the current picture and a sampling bit depth of at least one color component of a reference picture to which the video region refers; in response to the relationship, performing a particular operation on a sample point within the reference picture or a predicted region of the video region; and performing conversion between the bitstream representation of the video region and the video region based on the specific operation.

In yet another representative aspect, a method for video processing is disclosed that includes: applying a first adaptive conversion process to a video region within a current picture; applying a second adaptive conversion process to the video region; and performing a conversion between the bitstream representation of the video region and the video region based on the first adaptive conversion process and the second adaptive conversion process; wherein the first adaptive conversion process is one of an adaptive bit depth conversion (ABC) process and an Adaptive Resolution Change (ARC) process, and the second adaptive conversion process is the other of the adaptive bit depth conversion (ABC) process and the Adaptive Resolution Change (ARC) process.

In yet another representative aspect, a method for video processing is disclosed that includes: determining at least one Merge candidate for a video region within a current picture, wherein the at least one Merge candidate is assigned a priority based on a sampling bit depth of at least one color component of a reference picture to which the at least one Merge candidate refers; determining whether and/or how to insert the Merge candidate into a Merge candidate list of the video region based on the priority; and performing a transition between the bitstream representation of the video region and the video region based on the Merge candidate list.

In yet another representative aspect, a method for video processing is disclosed that includes: an Adaptive Loop Filtering (ALF) process is applied to samples within the current picture, wherein a plurality of ALF parameters are applicable to the samples based on corresponding sampling bit depths associated with the plurality of ALF parameters.

In yet another representative aspect, a method for video processing is disclosed that includes: a Luma Mapping (LMCS) process utilizing chroma scaling is applied to samples within the current picture, wherein a plurality of LMCS parameters are applicable to the samples based on corresponding sample bit depths associated with the plurality of LMCS parameters.

In yet another representative aspect, a method for video processing is disclosed that includes: determining whether and/or how to perform a particular coding scheme on a video block within a current picture based on a relationship between a sampling bit depth of at least one of a plurality of reference pictures to which the video block refers and a sampling bit depth of the current picture; and based on the determination, performing a conversion between the bitstream representation of the video block and the video block.

In yet another representative aspect, a method for video processing is disclosed that includes: a transition between a bitstream representation of a current video block and the current video block is performed, wherein bi-prediction from two reference pictures is disabled for the video block during the transition if the two reference pictures referenced by the current video block have different sampling bit depths.

In yet another representative aspect, the above-described methods are embodied in the form of processor-executable code and stored in a computer-readable program medium.

In yet another representative aspect, an apparatus configured or operable to perform the above-described method is disclosed. The apparatus may include a processor programmed to implement the method.

In yet another representative aspect, a video decoder device may implement the methods described herein.

The above and other aspects and features of the disclosed technology are described in more detail in the accompanying drawings, description and claims.

Drawings

Fig. 1 shows an example of an adaptive stream of two representations of the same content encoded at different resolutions.

Fig. 2 shows another example of an adaptive stream of two representations of the same content encoded at different resolutions, where the slices use either a closed group of pictures (GOP) or an open GOP prediction structure.

Fig. 3 shows an example of an open GOP prediction structure for two representations.

Fig. 4 shows an example of a representation of switching at an open GOP location.

Fig. 5 shows an example of decoding a random access skip guidance (RASL) picture using a resampled reference picture from another bitstream as a reference.

Fig. 6A-6C illustrate examples of region-based mixed resolution (RWMR) viewport-dependent 360-streaming based on a motion-constrained tile set (MCTS).

Fig. 7 shows examples of different Intra Random Access Point (IRAP) intervals and different sizes of collocated (conflated) sub-picture representations.

Fig. 8 shows an example of a segment received when a change in viewing orientation results in a change in resolution at the beginning of the segment.

Fig. 9 shows an example of a viewing orientation change.

Fig. 10 shows an example of a sub-picture representation for two sub-picture positions.

Fig. 11 shows an example of encoder modification for Adaptive Resolution Conversion (ARC).

Fig. 12 shows an example of decoder modification for ARC.

Fig. 13 shows an example of slice group based resampling for ARC.

Fig. 14 shows an example of an ARC process.

Fig. 15 shows an example of an Alternative Temporal Motion Vector Prediction (ATMVP) for a codec unit.

Fig. 16A to 16B show examples of simplified affine motion models.

Fig. 17 shows an example of affine Motion Vector Field (MVF) of each sub-block.

Fig. 18A and 18B show examples of a 4-parameter affine model and a 6-parameter affine model, respectively.

Fig. 19 shows an example of Motion Vector Prediction (MVP) for af_inter of inherited affine candidates.

Fig. 20 shows an example of MVP for af_inter of the constructed affine candidate.

Fig. 21A and 21B show examples of candidates of af_merge.

Fig. 22 shows an example of candidate positions of the affine merge mode.

Fig. 23 is a block diagram of an example of a hardware platform for implementing the visual media decoding or visual media encoding techniques described in this document.

Fig. 24 shows a flowchart of an example method for video processing.

Fig. 25 shows a flow chart of another example method for video processing.

Fig. 26 shows a flowchart of an example method for video processing.

Fig. 27 shows a flow chart of another example method for video processing.

Fig. 28 shows a flow chart of an example method for video processing.

Fig. 29 shows a flow chart of another example method for video processing.

Fig. 30 shows a flowchart of an example method for video processing.

Fig. 31 shows a flow chart of another example method for video processing.

Detailed Description

Embodiments of the disclosed technology may be applied to existing video codec standards (e.g., HEVC, h.265) and future standards to improve compression performance. Chapter headings are used in this document to enhance the readability of the description and in no way limit the discussion or embodiments (and/or implementations) to the corresponding sections.

1. Video codec introduction

Video codec methods and techniques are ubiquitous in modern technology due to the increasing demand for high resolution video. Video codecs typically include electronic circuitry or software that compresses or decompresses digital video, and are continually improving to provide higher codec efficiency. The video codec converts uncompressed video into a compressed format and vice versa. There are complex relationships between video quality, the amount of data used to represent the video (determined by the bit rate), the complexity of the encoding and decoding algorithms, the sensitivity to data loss and errors, ease of editing, random access, and end-to-end delay (latency). The compression format typically conforms to standard video compression specifications, such as the High Efficiency Video Codec (HEVC) standard (also known as h.265 or MPEG-H part 2), the multi-function video codec (Versatile Video Coding, VVC) standard to be finalized, or other current and/or future video codec standards.

Video codec standards have evolved primarily through the development of the well-known ITU-T and ISO/IEC standards. ITU-T specifies h.261 and h.263, ISO/IEC specifies MPEG-1 and MPEG-4 visualizations, and these two organizations jointly specify h.262/MPEG-2 video and h.264/MPEG-4 advanced video codec (Advanced Video Coding, AVC) and h.265/HEVC standards. Since h.262, video codec standards have been based on hybrid video codec structures, where time domain prediction plus transform coding is used. To explore future video codec technologies beyond HEVC, VCEG and MPEG have combined in 2015 to form a joint video exploration group (Joint Video Exploration Team, jfet). Thereafter, jfet takes many new approaches and places it in reference software called joint exploration model (Joint Exploration Model, JEM). In month 4 2018, the joint video expert group (jfet) between VCEG (Q6/16) and ISO/IEC JTC1 SC29/WG11 (MPEG) holds true to address the Versatile Video Codec (VVC) standard with the goal of reducing the bit rate by 50% compared to HEVC.

AVC and HEVC do not have the ability to change resolution without introducing IDR or Intra Random Access Point (IRAP) pictures; this capability may be referred to as Adaptive Resolution Change (ARC). There are some use cases or application scenarios that may benefit from the ARC feature, including the following:

Rate adaptation in video telephony and conferencing: in order to adapt the encoded and decoded video to changing network conditions, the encoder can adapt to the smaller resolution picture by encoding it when the network conditions worsen to make the available bandwidth smaller. Currently, the picture resolution can only be changed after IRAP pictures. There are several problems with this. IRAP pictures of reasonable quality will be much larger than inter-frame codec pictures and will be correspondingly more complex to decode as well). This wastes time and resources. A problem arises if the decoder requests a change in resolution for loading reasons. It may also disrupt the low delay buffer condition forcing the audio to resynchronize and the end-to-end delay of the stream will increase, at least temporarily. This can lead to a poor experience for the user.

Active speaker change in multiparty video conference: for multiparty video conferences, active speakers are typically displayed at a video size that is larger than the video of the other conference participants. When the active speaker changes, it may also be necessary to adjust the picture resolution of each participant. The need to have ARC functionality becomes particularly important when such changes in active speakers occur frequently.

Fast start-up in streaming: for streaming applications, the application typically buffers a length of decoded pictures before starting to display. Starting the bitstream at a smaller resolution will allow the application to have enough pictures in the buffer to start a faster display.

Adaptive stream switching in streaming: the dynamic adaptive streaming over HTTP (DASH) specification includes a feature named @ mediaStreamStructureId. This enables switching between different representations at an open GOP random access point with non-decodable leading pictures (e.g., CRA pictures in HEVC with associated RASL pictures). When two different representations of the same video have different bit rates but the same spatial resolution, while they have the same @ mediaStreamStructureId value, switching can be made between the two representations of CRA pictures with associated RASL pictures, and RASL pictures associated with switching to CRA pictures can be decoded with acceptable quality, enabling seamless switching. The @ mediaStreamStructureId feature can also be used to switch between DASH representations with different spatial resolutions using ARC.

ARCs are also known as dynamic resolution transforms.

ARC can also be seen as a special case of Reference Picture Resampling (RPR), such as h.263 annex P.

Reference picture resampling in annex P1.1.h.263

This mode describes an algorithm that distorts the reference picture before it is used for prediction. This may be useful for resampling reference pictures having different source formats than the predicted picture. It can also be used for global motion estimation or rotational motion estimation by warping the shape, size and position of the reference picture. The syntax includes warp parameters to be used and a resampling algorithm. The simplest operation level of the reference picture resampling mode is an implicit factor of 4 resampling, since only FIR filters need to be applied to the upsampling and downsampling processes. In this case, no additional signaling overhead is required, as its use can be appreciated when the size of the new picture (indicated in the picture header) is different from the size of the previous picture.

Contribution of ARC to VVC

1.2.1.JVET-M0135

The preliminary design of ARC described below suggests acting as a placeholder to trigger the discussion, some of which are taken from JCTCC-F158.

1.2.1.1 basic tool description

The basic tool constraints to support ARC are as follows:

The spatial resolution may differ from the nominal resolution for two dimensions by a factor of 0.5. Spatial resolution may increase or decrease resulting in scaling of 0.5 and 2.0.

The aspect ratio and chroma format of the video format are unchanged.

The planting (cropping) area scales with spatial resolution.

Only the reference picture needs to be rescaled as needed and inter prediction applied as usual.

1.2.1.2 zoom operations

It is proposed to use a simple zero-phase separable downscaling and upscaling filter. Note that these filters are only used for prediction; the decoder may use more complex scaling for output purposes. The following 1 was used: 2 a downscaling filter having zero phase and 5 taps: (-1,9,16,16,9, -1)/32

The downsampling points are located at uniform sampling locations and are co-located. The same filter is used for luminance and chrominance.

For 2:1 up-sampling, additional samples are generated at odd grid positions using half-pixel motion compensated interpolation filter coefficients in the latest VVC WD.

The combined up-sampling and down-sampling will not change the position of the phase or chroma sampling points.

1.2.1.3 resolution description in parameter set

The signaling changes in picture resolution in SPS are shown below, with the following description and the rest of this document both deleted with double-bracket marks (e.g., [ [ a ] ] indicates deletion of the character "a").

Sequence parameter set RBSP grammar and semantics

[ [ pic_width_in_luma_samples ] specify the width of each decoded picture in units of luma samples. pic_width_in_luma_samples should not be equal to 0 and should be an integer multiple of MinCbSizeY.

pic_height_in_luma_samples specify the height of each decoded picture in units of luma samples. pic_height_in_luma_samples should not be equal to 0 and should be an integer multiple of MinCbSizeY. ]]

num_pic_size_in_luma_samples_minus1 plus 1 specifies the number of picture sizes (width and height) in units of luma samples that may occur in the codec video sequence.

pic_width_in_luma_samples [ i ] specify the i-th width of the decoded picture in units of luma samples that may occur in the encoded video sequence. pic_width_in_luma_samples [ i ] should not be equal to 0 and should be an integer multiple of MinCbSizeY.

pic_height_in_luma_samples [ i ] specify the i-th height of the decoded picture in units of luma samples that may occur in the encoded and decoded video sequence. pic_height_in_luma_samples [ i ] should not be equal to 0 and should be an integer multiple of MinCbSizeY.

Picture parameter set RBSP grammar and semantics

pic_parameter_set_rbsp(){	Descriptor for a computer
		...
pic_size_idx	ue(v)
		...
}

pic_size_idx specifies the index of the i-th picture size in the sequence parameter set. The width of a picture of the reference picture parameter set is pic_width_in_luma_samples [ pic_size_idx ] in the luma samples. Also, the picture height of the reference picture parameter set is pic_height_in_luma_samples [ pic_size_idx ] in the luma sample.

1.2.2.JVET-M0259

1.2.2.1. Background: sub-picture

The term "sub-picture track" is defined in the omni-directional media format (OMAF) as follows: the tracks, which have spatial relationships with other tracks and represent spatial subsets representing the original video content, have been split into spatial subsets prior to video encoding on the content production side. The sub-picture track for HEVC may be constructed by overwriting the parameter sets and slice segment headers of the motion-constrained slice sets to make it an independent HEVC bitstream. The sub-picture representation may be defined as a DASH representation carrying sub-picture tracks.

jfet-M0261 uses the term "sub-picture" as the spatial segmentation unit for VVC, summarized as follows:

1. the picture is divided into sub-pictures, slice groups, and slices.

2. A sub-picture is a set of rectangular tiles starting with a tile, with tile_group_address equal to 0.

3. Each sub-picture may refer to its own PPS and thus may have its own slice segmentation.

4. The sub-picture is considered a picture in the decoding process.

5. The reference picture for decoding the current sub-picture is generated by extracting a region juxtaposed with the sub-picture from the reference picture in the decoded picture buffer. The extracted region will be the decoded sub-picture, i.e. inter prediction occurs between sub-pictures of the same size and same position in the picture.

6. A slice group is a sequence of slices in a slice raster scan of a sub-picture.

In this context we refer to the term "sub-picture" as defined in jfet-M0261. However, the tracks of the encapsulated sub-picture sequence as defined in jfet-M0261 have very similar properties to the sub-picture tracks defined in OMAF, the examples given below apply in both cases.

1.2.2.2. Use case

1.2.2.2.1. Adaptive resolution change in streams

Requirements to support adaptive streaming

Section 5.13 of MPEG N17074 ("supporting adaptive streaming") includes the following requirements for VVC:

where the adaptive streaming service provides multiple representations of the same content, each representation having different properties (e.g., spatial resolution or sampling bit depth), the criteria should support fast representation switching. The standard should be able to use an efficient prediction structure (e.g. a so-called open group of pictures) without affecting the fast seamless switching representation capability between representations of different properties (e.g. different spatial resolutions).

Example with open GOP prediction Structure representing switching

Content generation for adaptive bit rate streaming includes the generation of different representations, which may have different spatial resolutions. The client requests segments from the representation and can therefore decide at which resolution and bit rate to receive the content. At the client, the segments of the different representations are concatenated, decoded and played. The client should be able to use one decoder instance to achieve seamless playback. As shown in fig. 1, a closed GOP structure (starting from IDR pictures) is typically used. The open GOP prediction structure (starting from CRA pictures) may achieve better compression performance than the corresponding closed GOP prediction structure. For example, an average bit rate reduction of 5.6% in terms of brightness Bjontegaard delta bit rate is achieved with an IRAP picture interval of 24 pictures.

The open GOP prediction structure also reduces the subjectively visible poor quality (pumping).

One challenge with using an open GOP in streaming is that the RASL pictures cannot be decoded using the correct reference pictures after switching the representations. We will describe this challenge below with respect to the representation in fig. 2.

The segments starting with CRA pictures include RASL pictures, with at least one reference picture in their previous segments. This is shown in fig. 3, where picture 0 in both bitstreams resides in the previous segment and is used as a reference for predicting RASL pictures.

The representation switching marked with a dashed rectangle in fig. 2 is shown in fig. 4. It can be seen that the reference picture for RASL pictures ("picture 0") is not decoded. Thus RASL pictures are not decodable and there is a gap in the playout of video.

However, as described by embodiments of the present invention, it has been found that decoding RASL pictures with resampled reference pictures is subjectively acceptable. Resampling of "picture 0" is shown in fig. 5 and is used as a reference picture for decoding RASL pictures.

1.2.2.2.2. View port variation in region-based mixed resolution (RWMR) 360 ° video streaming

Background: HEVC-based RWMR streaming

RWMR 360 streaming provides higher effective spatial resolution over the viewport. Slices covering the viewport originate from 6K (6144 x 3072) ERP pictures with "4K" decoding capability (HEVC level 5.1) or an equivalent CMP resolution scheme (as shown in fig. 6) are included in clauses d.6.3 and d.6.4 of OMAF and are also employed in VR industry forum guidelines. Such resolution is claimed to be suitable for head mounted displays using four times HD (2560 x 1440) display panels.

Encoding: the content is encoded at two spatial resolutions, having cube face sizes of 1536×1536 and 768×768, respectively. In both bitstreams, a 6 x 4 slice grid is used and a motion-constrained slice set (MCTS) is encoded for each slice position.

Packaging: each MCTS sequence is encapsulated as a sub-picture track and can be used as a sub-picture representation in DASH.

Selection of streaming MCTS: 12 MCTS are selected from the high resolution bitstream and complementary 12 MCTS are extracted from the low resolution bitstream. Thus, the hemispheres (180 ° ×180 °) of streaming content originate from a high resolution bitstream.

Merging MCTS into a bitstream to be decoded: the received MCTS for a single time instance is combined into a 1920 x 4608 codec picture that conforms to HEVC level 5.1. Another option for a merged picture is to have 4 columns of 768, two columns of 384 and three rows of 768 luminance samples, thus generating a picture of 3840 x 2304 luminance samples.

Background: several representations of different IRAP intervals for viewport-dependent 360 ° streaming

When the viewing direction in the HEVC-based viewport-dependent 360 ° streaming changes, a new selection of sub-picture representations may take effect at the next IRAP-aligned segment boundary. The sub-picture representations are combined into a codec picture for decoding, and therefore, the VCL NAL unit types are aligned in all selected sub-picture representations.

In order to make a trade-off between response time to changes in viewing direction and rate-distortion performance when the viewing direction is stable, multiple versions of content may be encoded and decoded at different IRAP intervals. This is illustrated in fig. 7 for one set of juxtaposed sub-picture representations presented in fig. 6 for encoding.

Fig. 8 shows an example in which the sub-picture position is first selected for reception at a lower resolution (384×384). A change of viewing direction will result in a new choice of sub-picture positions being received with a higher resolution (768 x 768). In this example, the viewing direction change occurs such that segment 4 is received from the short IRAP interval sub-picture representation. Thereafter the viewing direction is stable and therefore a long IRAP interval version can be used starting from segment 5 onwards.

Statement of problem

Since the viewing direction is gradually shifted in a typical viewing situation, in RWMR viewport-related streaming the resolution changes only in a subset of sub-picture positions. Fig. 9 shows the cube face with the change in viewing direction slightly upward and to the right from fig. 6. The cube face partition having a different resolution than before is denoted by "C". A change in the resolution of 6 out of 24 cube face partitions can be observed. However, as described above, in response to the viewing direction change, it is necessary to receive clips starting with IRAP pictures for all 24 cube face partitions. Updating all sub-picture positions with fragments starting with IRAP pictures is inefficient in terms of streaming rate-distortion performance.

In addition, the ability to use an open GOP prediction structure with RWMR 360 ° streamed sub-picture representations is desirable to improve rate-distortion performance and avoid poor visual picture quality caused by a closed GOP prediction structure.

Proposed design goals

The following design goals are presented:

the VVC design should allow merging a sub-picture originating from a random access picture and another sub-picture originating from a non-random access picture into the same codec picture conforming to the VVC.

The VVC design should enable the use of an open GOP prediction structure in the sub-picture representations without affecting the fast seamless representation switching capability between sub-picture representations of different properties, such as different spatial resolutions, while being able to merge the sub-picture representations into a single VVC bitstream.

The design goal may be illustrated with fig. 10, where a sub-picture representation of two sub-picture positions is presented. For two sub-picture positions, separate versions of the content are encoded for each combination between two resolutions and two random access intervals. Some slices start with an open GOP prediction structure. The change of viewing direction results in the resolution of sub-picture position 1 switching at the beginning of segment 4. Since slice 4 starts with the CRA picture associated with the RASL picture, those reference pictures of the RASL picture that are in slice 3 need to be resampled. It should be noted that this resampling applies to sub-picture position 1, while the decoded sub-pictures of some other sub-picture positions are not resampled. In this example, the viewing direction change does not cause a change in resolution of sub-picture position 2, so the decoded sub-picture of sub-picture position 2 is not resampled. In the first picture of slice 4, the slice for sub-picture position 1 contains sub-pictures originating from CRA pictures, while the slice for sub-picture position 2 contains sub-pictures originating from non-random access pictures. It is suggested that these sub-pictures are allowed to be combined into a codec picture in VVC.

1.2.2.2.3. Adaptive resolution change in video conferencing

JCTVC-F158 proposes that adaptive resolution change is mainly used for video conferencing. The following subsections are replicated from JCTVC-F158 and introduce examples of what adaptive resolution changes are considered useful.

Seamless network adaptation and error recovery

Applications such as video conferencing and streaming over packet networks often require the encoded streams to adapt to changing network conditions, especially in cases where the bit rate is too high and the data is lost. Such applications typically have a return channel that allows the encoder to detect errors and perform adjustments. The encoder has two main tools available: reducing the bit rate and changing the temporal or spatial resolution. By using a hierarchical prediction structure for encoding and decoding, a temporal resolution change can be effectively achieved. However, to obtain the best quality, it is necessary to change the spatial resolution and part of the encoder for which the video communication is well designed.

Changing spatial resolution in AVC requires sending IDR frames and resetting the stream. This causes serious problems. IDR frames of reasonable quality will be much larger than inter pictures and will be correspondingly more complex to decode: this wastes time and resources. A problem arises if the decoder requests a change in resolution for loading reasons. It may also violate the low-latency buffer condition, forcing the audio to resynchronize, and the end-to-end delay of the stream will increase, at least temporarily. This gives the user a bad experience.

To minimize these problems, IDRs are typically transmitted with low quality using a similar number of bits as P frames, and take a significant amount of time to recover to full quality for a given resolution. To obtain a sufficiently low delay, the quality may indeed be very low and a visible blurring typically occurs before the image is "refocused". In practice, intra frames are rarely used in compression: it is just one way to restart the stream.

Thus, there is a need for methods in HEVC that allow for changing resolution with minimal impact on the subjective experience, especially under challenging network conditions.

Quick start

It would be useful to have a "fast start" mode in which the first frame is sent at reduced resolution and the resolution is increased in the next few frames to reduce delay and restore normal quality faster without unacceptable image blurring at the beginning.

Conference "write"

Video conferences also typically have a function of displaying the speaker in a full screen manner and the other participants in a smaller resolution window. To support this function effectively, smaller pictures are typically sent at lower resolutions. This resolution will then increase when the participant becomes the speaker and is displayed full screen. Sending intra frames at this time may cause unpleasant pauses (hiccup) in the video stream. This effect may be very noticeable and unpleasant if the speaker is rapidly alternating.

1.2.2.3. Proposed design goals

The following is a high-level design choice proposed for VVC version 1:

1. it is proposed to include a reference picture resampling process in VVC version 1 for the following use cases:

use of efficient prediction structures (e.g. so-called open picture groups) in adaptive streaming without affecting the fast and seamless representation switching capability between representations of different properties, such as different spatial resolutions.

Adapting low-latency conference video content to network conditions and application-induced resolution changes without significant latency or latency variations.

2. A VVC design is proposed to allow merging a sub-picture originating from a random access picture and another sub-picture originating from a non-random access picture into the same encoded picture conforming to the VVC. It is claimed that the viewing direction change in the 360 deg. streaming of the mixing quality and mixing resolution viewport can be handled effectively.

3. It is proposed to include sub-picture based resampling procedures in VVC version 1. It is claimed that an efficient prediction structure can be achieved in order to handle view direction changes more efficiently in mixed resolution viewport adaptive 360 deg. streaming.

1.2.3.JVET-N0048

The use case and design goals of Adaptive Resolution Change (ARC) are discussed in detail in JHET-M0259. The following is a brief summary:

1. Real-time communication

JCTVC-F158 initially includes the following use cases for adaptive resolution change:

a. seamless network adaptation and error recovery (by dynamically adapting resolution changes)

b. Quick start (resolution gradually increased upon session start or reset)

c. Conference "write" (speaker gets higher resolution)

2. Adaptive streaming

where the adaptive streaming service provides multiple representations of the same content, the standard should support fast representation switching, each representation having different properties (e.g., spatial resolution or sampling bit depth). The standard should allow the use of efficient prediction structures (e.g. so-called open picture groups) without affecting the fast and seamless switching representation capabilities between representations of different properties, such as different spatial resolutions.

jfet-M0259 discusses how this requirement can be met by resampling the reference picture of the leading picture.

3.360 degree viewport related streaming

jfet-M0259 discusses how this use case can be addressed by resampling picture areas of some independent codecs of the reference pictures of the leading picture.

This document proposes an adaptive resolution codec method that claims to meet all of the above use cases and design goals. This proposal handles the stream initiation and conference "write" use cases associated with a 360 degree viewport along with jfet-N0045 (which proposes a separate sub-picture layer).

Proposed canonical text

Signaling of

sps_max_rpr

seq_parameter_set_rbsp(){	Descriptor for a computer
		…
sps_max_rpr	ue(v)
		…

The sps_max_ rpr specifies the maximum number of active reference pictures in

reference picture list

0 or 1 for any slice group in the CVS of pic_width_in_luma_samples and pic_height_in_luma_samples of the current picture that is not equal to pic_width_in_luma_samples and pic_height_in_luma_samples, respectively.

Picture width and height

/>

pic_parameter_set_rbsp(){	Descriptor for a computer
		pps_pic_parameter_set_id	ue(v)
pps_seq_parameter_set_id	ue(v)
		pic_width_in_luma_samples	ue(v)
pic_height_in_luma_samples	ue(v)
		…

max_width_in_luma_samples specify that for any picture of the CVS where the SPS is active, pic_width_in_luma_samples in any active PPS are less than or equal to max_width_in_luma_samples, which is a requirement for bitstream consistency.

max_height_in_luma_samples specify any picture of the CVS that is active for this SPS, pic_height_in_luma_samples in any active PPS are less than or equal to max_height_in_luma_samples, which is a requirement for bitstream consistency.

Advanced decoding process

The decoding process of the current picture CurrPic is as follows:

1. Clause 8.2 specifies decoding of NAL units.

2. The process in clause 8.3 specifies the following decoding process using syntax elements in the slice group header layer and above:

-deriving variables and functions related to image order counts as specified in clause 8.3.1. This requires only a first slice group call to the picture.

-invoking the decoding process of the reference picture list construction specified in clause 8.3.2 at the beginning of the decoding process of each slice group of the non-IDR picture to derive reference picture list 0 (RefPicList [0 ]) and reference picture list 1 (RefPicList [1 ]).

Invoking the decoding procedure for reference picture marking in clause 8.3.3, wherein the reference picture may be marked as "unused for reference" or "used for long-term reference". This requires only a first slice group call to the picture.

-for each active reference picture in RefPicList [0] and RefPicList [1], its pic_width_in_luma_samples or pic_height_in_luma_samples, respectively, are not equal to pic_width_in_luma_samples or pic_height_in_luma_samples of CurrPic, respectively, the following applies:

-invoke resampling procedure in clause x.y.z [ Ed. (MH): detailed information of call parameters to be added ], whose output has the same reference picture flag and picture order count as the input.

The reference picture used as input to the resampling process is marked as "unused for reference".

[ Ed. (YK): calls for decoding procedure, scaling, transformation, loop filtering, etc. of the coding tree unit are added thereto

4. After decoding all slice groups of the current picture, marking the current decoded picture as

"for short term reference".

Resampling process

The SHVC resampling process (HEVC clause h.8.1.4.2) is proposed and the following is added: …

If sps_ref_wraparound_enabled_flag is equal to 0, then the sample value temp_reporting [ n ] of n=0.7 is derived as follows:

tempArray[n]＝(f _L [xPhase,0]*rlPicSample _L [Clip3(0,refW-1,xRef-3),yPosRL]+

f _L [xPhase,1]*rlPicSample _L [Clip3(0,refW-1,xRef-2),yPosRL]+

f _L [xPhase,2]*rlPicSample _L [Clip3(0,refW-1,xRef-1),yPosRL]+

f _L [xPhase,3]*rlPicSample _L [Clip3(0,refW-1,xRef),yPosRL]+

f _L [xPhase,4]*rlPicSample _L [Clip3(0,refW-1,xRef+1),yPosRL]+(H-38)

f _L [xPhase,5]*rlPicSample _L [Clip3(0,refW-1,xRef+2),yPosRL]+

f _L [xPhase,6]*rlPicSample _L [Clip3(0,refW-1,xRef+3),yPosRL]+

f _L [xPhase,7]*rlPicSample _L [Clip3(0,refW-1,xRef+4),yPosRL])>>shift1

otherwise, the sample value tempArray [ n ] of n=0.7 is derived as follows:

refOffset＝(sps_ref_wraparound_offset_minus1+1)*MinCbSizeY

tempArray[n]＝

(f _L [xPhase,0]*rlPicSample _L [ClipH(refOffset,refW,xRef-3),yPosRL]+

f _L [xPhase,1]*rlPicSample _L [ClipH(refOffset,refW,xRef-2),yPosRL]+

f _L [xPhase,2]*rlPicSample _L [ClipH(refOffset,refW,xRef-1),yPosRL]+

f _L [xPhase,3]*rlPicSample _L [ClipH(refOffset,refW,xRef),yPosRL]+

f _L [xPhase,4]*rlPicSample _L [ClipH(refOffset,refW,xRef+1),yPosRL]+

f _L [xPhase,5]*rlPicSample _L [ClipH(refOffset,refW,xRef+2),yPosRL]+

f _L [xPhase,6]*rlPicSample _L [ClipH(refOffset,refW,xRef+3),yPosRL]+

f _L [xPhase,7]*rlPicSample _L [ClipH(refOffset,refW,xRef+4),yPosRL])

>>shift1

…

if sps_ref_wraparound_enabled_flag is equal to 0, then the sample value temp_reporting [ n ] of n=0.3 is derived as follows:

tempArray[n]＝(f _C [xPhase,0]*rlPicSample _C [Clip3(0,refWC-1,xRef-1),yPosRL]+

f _C [xPhase,1]*rlPicSample _C [Clip3(0,refWC-1,xRef),yPosRL]+

f _C [xPhase,2]*rlPicSample _C [Clip3(0,refWC-1,xRef+1),yPosRL]+(H-50)

f _C [xPhase,3]*rlPicSample _C [Clip3(0,refWC-1,xRef+2),yPosRL])>>shift1

otherwise, the sample value tempArray [ n ] of n=0.3 is derived as follows:

refOffset＝(sps_ref_wraparound_offset_minus1+1)*MinCbSizeY)/SubWidthC

tempArray[n]＝

(f _C [xPhase,0]*rlPicSample _C [ClipH(refOffset,refWC,xRef-1),yPosRL]+

f _C [xPhase,1]*rlPicSample _C [ClipH(refOffset,refWC,xRef),yPosRL]+

f _C [xPhase,2]*rlPicSample _C [ClipH(refOffset,refWC,xRef+1),yPosRL]+

f _C [xPhase,3]*rlPicSample _C [ClipH(refOffset,refWC,xRef+2),yPosRL])

>>shift1

1.2.4.JVET-N0052

adaptive resolution change has emerged as a concept in video compression standards, at least since 1996; in particular h.263+ related proposals for reference picture resampling (RPR, annex P) and reduced resolution update (annex Q). It has recently gained some attention, firstly the proposal made by cisco during JCT-VC, then in the context of VP9 (which has been moderately widely deployed today), and more recently in the context of VVC. ARC may reduce the number of samples needed for a given picture codec and upsample the resulting reference picture to a higher resolution when needed.

We consider ARC of particular interest in two scenarios:

1) Intra-coded pictures (such as IDR pictures) are typically much larger than inter-pictures. Downsampled pictures intended for intra-coding may provide better input for future predictions for whatever reason. It is obviously also advantageous from a rate control point of view, at least in low delay applications.

2) When operating a codec near a breakpoint, at least some cable and satellite operators will typically do so, even for pictures that are not intra-coded, ARC may be facilitated, such as in scene transitions without hard transition points.

3) Perhaps somewhat too far forward: is the concept of fixed resolution reasonable? With the advent of CRTs and the popularity of scaling engines in rendering devices, hard binding between rendering and codec resolution has become past. Also, we note that there are some studies that show that when much activity occurs in a video sequence, most people cannot concentrate on fine details (possibly associated with high resolution) even if the activity is spatially elsewhere. If this is correct and generally accepted, fine granularity resolution changes may be a better rate control mechanism than adaptive QP. We now propose this discussion because we lack data-feedback from thank you. Of course, the concept of discarding fixed resolution bitstreams has innumerable system layers and implementation implications, to which we are aware (at least at the level at which they exist, if not in their detailed nature).

Technically, ARC may be implemented as reference picture resampling. There are two main aspects to implementing reference picture resampling: resampling filters and signaling of resampled information in a bitstream. This document focuses on the latter, which is only reached insofar as we have experience in practice. Encouraging more research into suitable filter designs.

Summary of existing ARC embodiments

Fig. 11 and 12 show embodiments of a conventional ARC encoder/decoder, respectively. In our implementation, it is possible to vary the width and height of a picture according to each picture granularity, regardless of the picture type. At the encoder, the input image data is downsampled to the selected picture size of the current picture encoding. After encoding the first input picture as an intra picture, the decoded picture is stored in a Decoded Picture Buffer (DPB). When subsequent pictures are downsampled at different sampling rates and encoded as inter pictures, the reference picture in the DPB will be scaled up/down according to the spatial ratio between the reference picture size and the current picture size. At the decoder, the decoded pictures are stored in the DPB without resampling. However, when used for motion compensation, the reference picture in the DPB is scaled up/down relative to the spatial ratio between the current decoded picture and the reference. When the decoded picture is highlighted for display, it will be upsampled to the original picture size or the desired output picture size. In the motion estimation/compensation process, the motion vector is scaled with respect to the picture size ratio and picture order count difference.

Signaling of ARC parameters

The term ARC parameter is used herein as a combination of any parameters required to operate an ARC. In the simplest case, it may be a scaling factor, or an index to a table with defined scaling factors. It may be the target resolution (e.g., in samples or maximum CU size granularity) or an index to a table that provides the target resolution, as set forth in jfet-M0135. Also included are filter selectors of the up/down sampling filters used, even filter parameters (up to the filter coefficients).

From the beginning we propose here to at least conceptually allow different parts of a picture to use different ARC parameters. We propose that the appropriate syntax structure according to the current VVC draft should be a rectangular slice group (TG). Those using scan order TG will be limited to using ARC only for the full graph, or include scan order TG in rectangular TG (we will not recall that TG nesting has been discussed so far, perhaps as a bad idea). Which can be easily specified by the bitstream constraints.

Since different TGs may have different ARC parameters, the appropriate location of the ARC parameters may be in the TG header or in the parameter set with TG range and referenced by the TG header—the adaptive parameter set in the current VVC draft, or a more detailed reference (index) in a table in the higher parameter set. Of these three options, we propose that at this point the TG header is used to encode a reference to a table entry comprising ARC parameters, and that this table is located in the SPS, the maximum table value being encoded in the (upcoming) DPS. We can encode the scaling factor directly into the TG header without using any parameter set values. As proposed in jfet-M0135, the use of PPS as a reference is indicated inversely if, as we do, each slice group signaling of ARC parameters is a design criterion.

For the table entry itself, we can see a number of options:

codec downsampling factors, one for two dimensions and one independently for X and Y dimensions? This is mainly a (hardware) implemented discussion, some people may prefer the result that the scaling factor in the X-dimension is rather flexible, but the scaling factor in the Y-dimension is fixed to 1, or seldom chosen. We consider the grammar to be the place of error in expressing such constraints, and we prefer to express constraints as a consistency requirement if required. In other words, the flexibility of the syntax is maintained.

Codec target resolution. This is our following proposal. These resolutions may have more or less complex constraints than the current resolution, possibly expressed in terms of bitstream conformance requirements.

Downsampling is preferably performed on each slice group to allow picture synthesis/extraction. However, this is not important from a signaling point of view. If the group makes an unworkable decision, only ARC at picture granularity is allowed, we can always include the requirement of bitstream uniformity for all TGs using the same ARC parameters.

ARC related control information. In our following design, this includes the reference picture size.

Is the flexibility required for filter design? Is there a number of code points larger than a few? If so, put those into APS? (without please do not discuss APS updates anymore.) if the downsampling filter changes, but the ALF remains unchanged, we propose that the bitstream must eat overhead).

Now, in order to make the proposed technique consistent and simple (as much as possible), we propose

Fixed filter design

Target resolution in table in SPS, with bitstream constraint TBD.

Minimum/maximum target resolution in DPS to facilitate upper bound (cap) exchange/negotiation.

The resulting grammar may be as follows:

decoder parameter set RBSP syntax

max_pic_width_in_luma_samples specifies the maximum width of the decoded picture in units of luma samples in the bitstream. max_pic_width_in_luma_samples should not be equal to 0 and should be an integer multiple of MinCbSizeY. The value of dec_pic_width_in_luma_samples [ i ] cannot be greater than the value of max_pic_width_in_luma_samples.

max_pic_height_in_luma_samples specifies the maximum height of the decoded picture in units of luma samples. max_pic_height_in_luma_samples must not be equal to 0 and should be an integer multiple of MinCbSizeY. The value of dec_pic_height_in_luma_samples [ i ] cannot be greater than the value of max_pic_height_in_luma_samples.

Sequence parameter set RBSP syntax

an indication of the number of decoded picture sizes (num_dec_pic_size_in_luma_samples_minus1) and at least one decoded picture size (dec_pic_width_in_luma_samples [ i ], dec_pic_height_in_luma_samples [ i ]) are present in the SPS. The reference picture size (reference_pic_width_in_luma_samples) exists with a value condition for reference_pic_size_present_flag.

output_pic_width_in_luma_samples specifies the width of the output picture in units of luminance samples. output_pic_width_in_luma_samples should not be equal to 0.

output_pic_height_in_luma_samples specify the height of the output picture in units of luma samples. output_pic_height_in_luma_samples should not be equal to 0.

reference_pic_size_present_flag equal to 1 indicates that reference_pic_width_in_luma_samples and reference_pic_height_in_luma_samples are present.

reference_pic_width_in_luma_samples specifies the width of the reference picture in units of luminance samples. output_pic_width_in_luma_samples should not be equal to 0. If not, it is inferred that the value of reference_pic_with_in_luma_samples is equal to dec_pic_with_in_luma_samples [ i ].

reference_pic_height_in_luma_samples specify the height of the reference picture in units of luma samples. output_pic_height_in_luma_samples should not be equal to 0. If not, it is inferred that the value of reference_pic_height_in_luma_samples is equal to dec_pic_height_in_luma_samples [ i ].

The size of the 1-shot output picture should be equal to the values of output_pic_width_in_luma_samples and output_pic_height_in_luma_samples. When the reference picture is used for motion compensation, the size of the reference picture should be equal to the values of reference_pic_width_in_luma_samples and_pic_height_in_luma_samples.

num_dec_pic_size_in_luma_samples_minus1 plus 1 specifies the number of decoded picture sizes in units of luma samples in the codec video sequence (dec_pic_width_in_luma_samples [ i ], dec_pic_height_in_luma_samples [ i ]).

dec_pic_width_in_luma_samples [ i ] specify the i-th width of the decoded picture size in units of luma samples in the encoded video sequence. dec_pic_width_in_luma_samples [ i ] should not be equal to 0 and should be an integer multiple of MinCbSizeY.

dec_pic_height_in_luma_samples [ i ] specify the i-th height of the decoded picture size in units of luma samples in the encoded video sequence. dec_pic_height_in_luma_samples [ i ] should not be equal to 0 and should be an integer multiple of MinCbSizeY.

Note that the 2-i-th decoded picture size (dec_pic_width_in_luma_samples [ i ]) may be equal to the decoded picture size of the decoded pictures in the encoded video sequence.

Slice group header syntax

The dec_pic_size_idx specifies that the width of the decoded picture should be equal to pic_width_in_luma_samples [ dec_pic_size_idx ], and the height of the decoded picture should be equal to pic_height_in_luma_samples [ dec_pic_size_idx ].

Filter

The proposed design conceptually comprises four different filter sets: a downsampling filter from an original picture to an input picture, an up/downsampling filter to rescale a reference picture for motion estimation/compensation, and an upsampling filter from a decoded picture to an output picture. The first and last may remain as non-normative matters. Within the specification, the up/down sampling filters need to be explicitly signaled or predefined in the appropriate parameter set.

Our implementation downsamples using a downsampling filter of SHVC (SHM version 12.4), which is a 12 tap and 2D separable filter, to adjust the size of the reference picture for motion compensation. In the current embodiment, only binary sampling is supported. Thus, the phase of the downsampling filter is set to zero by default. For up-sampling, an 8-tap interpolation filter with 16 phases is used to shift the phases and align the luminance and chrominance pixel positions to the original positions.

Tables 1 and 2 provide 8-tap filter coefficients fL [ p, x ], where p=0..15 and x=0..7, for the luminance upsampling process, and 4-tap filter coefficients fC [ p, x ], where p=0..15 and x=0..3, for the chrominance upsampling process.

Table 3 provides 12 tap filter coefficients for the downsampling process. Both luminance and chrominance are downsampled using the same filter coefficients.

TABLE 1 luminance upsampling filter with 16 phases

TABLE 2 chroma upsampling filter with 16 phases

TABLE 3 downsampling Filter coefficients for luminance and chrominance

We have not tried other filter designs. We expect (perhaps obviously) subjective and objective gains to be expected when filters adapted to the content and/or scaling factors are used.

Slice group boundary discussion

Since many slice group related jobs may be correct, our implementation has not been completed for slice group (TG) based ARCs. Our preference is to re-discuss this embodiment once the discussion of spatial synthesis and extraction of multiple sub-pictures as synthesized pictures in the compressed domain yields at least a working draft. However, this does not prevent us from somehow deducing the result and adapting the signal design accordingly.

For the reasons mentioned above, we consider the slice group header to be the correct position as described above, e.g. dec_pic_size_idx. We use the single ue (v) code point dec_pic_size_idx conditionally present in the slice group header to indicate the ARC parameter employed. To match the implementation of each picture ARC only, one thing we need to do in canonical space is to either only codec a single slice group or make it a condition for bitstream consistency, i.e. all TG heads of a given codec picture have the same dec_pic_size_idx value (if any).

The parameter dec_pic_size_idx may be moved into any header of the starting sub-picture. Our current sense is that it is likely that the slice group header will continue.

In addition to these syntactic considerations, some additional effort is required to enable slice-group-based or sub-picture-based ARCs. Perhaps the most difficult part is how to solve the problem of unwanted samples in a picture, where sub-pictures have been resampled to smaller sizes.

Consider the right part of fig. 13, which is made up of four sub-pictures (possibly represented as four rectangular slices in the bitstream syntax). On the left, the lower right TG is sampled to half size. How do we handle "half" labeled spots outside the relevant region?

Common to some existing video codec standards is that spatial extraction of portions of a picture in the compressed domain is not supported. This means that each sample of a picture is represented by one or more syntax elements, and each syntax element affects at least one sample. If we want to hold, we need to fill the area around the sample marked "half" by the down-sampled TG. H.263+ accessory P solves this problem by filling; in practice, the sample values of the padded samples may be signaled in the bit stream (within certain strict limits).

One might constitute a significantly different alternative to the previous assumption, but if we want to support sub-bitstream extraction (and synthesis) based on rectangular parts of the picture, this alternative would relax the current understanding that the samples of the reconstructed picture must be represented by some content in the encoded picture (even if that content is just skipped blocks).

Implementation considerations, system implications and Profile/level

We propose to include the basic ARC in the "benchmark/primary" profile. If some application scenarios are not needed, it can be removed using the sub-profile. Some limitations are acceptable. In this regard, we note that some h.263+ profiles and "recommended modes" (which are earlier than profiles) include a limitation on using only attachment P as an "implicit factor of 4", i.e., employing binary downsampling in both dimensions. This is sufficient to support fast start-up (fast acquisition of I-frames) in video conferencing.

This design allows us to believe that all filtering can be done in "real time" with no or only a slight increase in memory bandwidth. In this regard, we consider that it is not necessary to put the ARC into a foreign profile.

We do not consider that complex tables or the like can be used meaningfully in capability exchange, as discussed in mala-karsh in connection with jfet-M0135. Assuming that answers and similar depth-limiting handshakes are provided, the number of choices is large to allow meaningful inter-vendor interoperability. In this regard, in reality, to support ARC in a meaningful way in a capability exchange scenario, we have come back to the next to be at most an interoperability point. For example: ARC-free, ARC with implicit factor 4, complete ARC. Alternatively, we can specify the necessary support for all ARCs and leave the limitation of bit stream complexity to higher level SDOs. In any event, this is a strategic discussion that we should make (except for the discussions already existing in the sub-profile and logo context).

As for the level: we consider that the basic design principle must be that the sample count of an upsampled picture must fit into the bitstream level (as a condition for bitstream conformance) and that all samples must fit into the upsampled codec picture, no matter how many samples are signaled in the bitstream. We note that this is not the case in h263+; there may be no spots.

1.2.5.JVET-N0118

The following aspects are presented:

1) The picture resolution list is signaled in the SPS and the list index is signaled in the PPS to specify the size of the individual picture.

2) For any picture to be output, the decoded picture before resampling is cropped (as needed) and output, i.e., the resampled picture is not used for output, only for inter prediction reference.

3) 1.5 times and 2 times resampling ratios are supported. Any resampling ratio is not supported. Further studies were made as to whether one or more other resampling ratios were required.

4) Between picture-level resampling and block-level resampling, the supporters prefer block-level resampling.

a. However, if picture level resampling is selected, the following is proposed:

i. when the reference picture is resampled, the resampled version and the original version of the reference picture are stored in the DPB, so both affect the DPB richness.

When a corresponding non-resampled reference picture is marked as "unused for reference", the resampled reference picture is marked as "unused for reference".

The RPL signaling syntax remains unchanged and the RPL construction process is modified as follows: when a reference picture needs to be included in the RPL entry and a version of the reference picture having the same resolution as the current picture is not in the DPB, a picture resampling process is invoked and a resampled version of the reference picture is included in the RPL entry.

The number of resampled reference pictures that may be present in the dpb should be limited, e.g. to less than or equal to 2.

b. Otherwise (block level resampling is selected), the following is suggested:

i. to limit the complexity of the worst-case decoder, it is proposed that bi-prediction is not allowed for blocks from reference pictures having different resolutions than the current picture.

Another option is to combine the two filters and apply the operation immediately when resampling and quarter-pixel interpolation are required.

5) Whether a picture-based resampling method or a block-based resampling method is selected, it is proposed to apply temporal motion vector scaling as required.

1.2.5.1. Description of the embodiments

ARC software was implemented on the basis of VTM 4.0.1 with the following modifications:

signaling a supported solution list in SPS.

-spatial resolution signaling moves from SPS to PPS.

-implementing a picture-based resampling scheme for resampling reference pictures.

After the picture is decoded, the reconstructed picture may be resampled to a different spatial resolution. Both the original reconstructed picture and the resampled reconstructed picture are stored in the DPB and are available for reference by future pictures in decoding order.

The resampling filter implemented is based on the filter tested in JCTVC-H0234, as follows:

o up-sampling filter: 4 taps +/-quarter phase DCTIF with taps (-4,54,16, -2)/64

o downsampling filter: h11 filter with taps (1, 0, -3,0,10,16,10,0, -3,0, 1)/32

When constructing the reference picture list (i.e. L0 and L1) of the current picture, only reference pictures with the same resolution as the current picture are used. Note that the reference picture may be the original size or the resampled size.

-TMVP and ATVMP may be enabled; however, when the original encoding resolutions of the current picture and the reference picture are different, TMVP and ATMVP are disabled for the reference picture.

For convenience and simplicity of the starting software implementation, the decoder outputs the highest available resolution when outputting the picture.

1.2.5.2. Signaling and picture output regarding picture size

1. On a spatial resolution list of encoded and decoded pictures in a bitstream

Currently, all coded pictures in a CVS have the same resolution. Thus, signaling only one resolution (i.e., width and height of the picture) is straightforward in SPS. When ARC support is used, instead of using one resolution, a picture resolution list needs to be signaled, and we propose to signal this list in SPS and the index of this list in PPS to specify the size of the individual pictures.

2. With respect to picture output

We propose to clip the decoded picture before resampling and output (as needed) for any picture to be output, i.e. the resampled picture is not used for output, only for inter prediction reference. The ARC resampling filter should be designed to optimize the use of resampled pictures for inter prediction and such a filter may not be optimal for picture output/display purposes, whereas video terminal devices typically have an optimized output scaling (panning)/scaling function already implemented.

1.2.5.3. With respect to resampling

The resampling of the decoded picture may be picture-based or block-based. For the final ARC design in VVC, we prefer block-based resampling rather than picture-based resampling. We recommend discussing both methods and jfet decides which of the two methods should be specified for ARC support in VVC.

Picture-based resampling

In picture-based ARC resampling, a picture is resampled only once for a particular resolution and then stored in the DPB, while the non-resampled version of the same picture is also retained in the DPB.

There are two problems with ARC using picture-based resampling: 1) An additional DPB buffer is required to store the resampled reference picture, and 2) additional memory bandwidth is required due to the increased operations of reading reference picture data from and writing reference picture data to the DPB.

Leaving only one version of the reference picture in the DPB is not a good idea for picture-based resampling. If we only keep the non-resampled version, the reference picture may need to be resampled multiple times because multiple pictures may refer to the same reference picture. On the other hand, if the reference picture is resampled and we only keep the resampled version, we need to apply the inverse resampling when the reference picture needs to be output, since as described above, it is preferable to output a picture that is not resampled. This is a problem because the resampling process is not a lossless operation. Taking a picture A, downsampling the picture A, and upsampling the picture A to obtain a picture A ' with the same resolution as the picture A, wherein A ' and A ' are not the same; a' contains less information than a because some high frequency information is lost during the downsampling and upsampling processes.

To address the problem of additional DPB buffers and memory bandwidth, we propose that if the ARC design in VVC uses picture-based resampling, the following conditions apply:

1. when resampling a reference picture, both the resampled version of the reference picture and the original, resampled version are stored in the DPB, and thus both affect the fullness of the DPB.

2. The resampled reference picture is marked as "unused for reference" when the corresponding non-resampled reference picture is marked as "unused for reference".

3. The Reference Picture List (RPL) of each slice group contains reference pictures having the same resolution as the current picture. Although the RPL signaling syntax need not be changed, the RPL construction process may be modified to ensure that the contents of the previous sentence are as follows: when a reference picture needs to be included in the RPL entry and a reference picture having the same resolution as the current picture is not yet available, a picture resampling process is invoked and a resampled version of the reference picture is included.

4. The number of resampled reference pictures that may be present in the DPB should be limited, e.g. less than or equal to 2.

Furthermore, to enable temporal MV usage (e.g., merge mode and ATMVP) in case the temporal MV comes from a reference frame of a different resolution than the current resolution, we propose to scale the temporal MV to the current resolution as needed.

Block-based ARC resampling

In block-based resampling for ARC, the reference block is resampled as long as needed, and no resampled pictures are stored in the DPB.

The main problem here is the additional decoder complexity. This is because a block in a reference picture can be referenced multiple times by multiple blocks in another picture and by blocks in multiple pictures.

When a block in the reference picture is referenced by a block in the current picture and the resolutions of the reference picture and the current picture are different, the reference block is resampled by invoking an interpolation filter so that the reference block has integer pixel resolution. When the motion vector is in a quarter pixel, the interpolation process is again invoked to obtain a resampled reference block in quarter pixel resolution. Thus, for each motion compensation operation for a current block according to reference blocks involving different resolutions, up to two interpolation filtering operations instead of one are required. If there is no ARC support, at most only one interpolation filter operation is required (i.e., the reference block is generated at quarter-pixel resolution).

To limit the worst case complexity, we propose that if the ARC design of VVC uses block-based resampling, the following applies:

bi-prediction of the block in the reference picture using a different resolution than the current picture is not allowed.

More precisely, the constraints are as follows: for a current block blkA in the current picture picA to refer to a reference block blkB in the reference picture picB, the block blkA should be a single prediction block when picA and picB have different resolutions.

Under this constraint, the worst interpolation operand required to decode a block is limited to two. If a block references a block from a picture of a different resolution, the required interpolation operand is 2, as described above. This is the same as when a block references a reference block from the same resolution picture and is encoded as a bi-predicted block, because the interpolation operand is also two (i.e., one for obtaining a quarter-pixel resolution for each reference block).

To simplify the implementation, we propose another variation, if the ARC design of VVC uses block-based resampling, the following applies:

if the reference frame and the current frame have different resolutions, the corresponding position of each pixel of the predicted value is first calculated, and then interpolation is applied only once. That is, two interpolation operations (i.e., one for resampling and one for quarter-pixel interpolation) are combined into only one interpolation operation. The sub-pixel interpolation filter in the current VVC may be reused, but in this case the granularity of interpolation should be enlarged, but the number of interpolation operations is reduced from two to one.

To enable temporal MV use (e.g., merge mode and ATMVP) in case the temporal MV comes from a reference frame of a different resolution than the current resolution, we propose to scale the temporal MV to the current resolution as required.

Resampling ratio

In JHET-M0135, to begin the discussion regarding ARC, it is proposed to consider only a 2X resampling ratio for the start of ARC (meaning 2X 2 is used for upsampling and 1/2X 1/2 is used for downsampling). From further discussion of this topic after the malachite conference, we know that supporting only 2 x resampling ratios is very limited, as in some cases a smaller difference between resampling and non-resampling resolution may be more beneficial.

While it may be desirable to support any resampling ratio, it appears to be difficult to support. This is because the number of resampling filters that must be defined and implemented in order to support an arbitrary resampling ratio seems too high and places a great burden on the decoder implementation.

We propose that more than one but a small number of resampling ratios should be supported, but at least 1.5 x and 2 x resampling ratios are supported, and that no arbitrary resampling ratios are supported.

1.2.5.4. Maximum DPB buffer size and buffer fullness

Using ARC, the DPB may contain decoded pictures of different spatial resolutions within the same CVS. For DPB management and related aspects, calculating DPB size and fullness in units of decoded pictures is no longer efficient.

If ARC is supported, the following is a discussion of some specific aspects of the final VVC specification that need to be addressed and possible solutions (we do not propose to employ the possible solutions at this meeting):

1. instead of deriving MaxDpbSize (i.e., the maximum number of reference pictures that may be present in the DPB) using the value of picsizeinsmplesy (i.e., picsizeinsmplesy=pic_width_in_luma_samples), maxDpbSize is derived based on the value of minpicsizeinsmplesy. MinPicSizeInSampley is defined as follows:

minpicsizeinsmpley= (width of minimum picture resolution in bitstream) × (height of minimum resolution in bitstream)

The derivation of MaxDpbSize is modified as follows (based on HEVC equation):

if (MinPicSizeInSamplesY < = (MaxLumaps > > 2))

MaxDpbSize＝Min(4*maxDpbPicBuf,16)

Otherwise, if (MinPicSizeInSamplesY < = (MaxLumaps > > 1))

MaxDpbSize＝Min(2*maxDpbPicBuf,16)

Otherwise, if (MinPicSizeInSamplesY < = ((3. Times. MaxLumaps) > > 2))

MaxDpbSize＝Min((4*maxDpbPicBuf)/3,16)

Otherwise

MaxDpbSize＝maxDpbPicBuf

2. Each decoded picture is associated with a value named picturesizet. picturesizeUnit is an integer value that specifies how large the decoded picture size is relative to MinPicSizeInSampley. The definition of picturesizeUnit depends on the resampling ratio supported by ARC in VVC.

For example, if the ARC only supports a resampling ratio of 2, then the definition of picturesizet is as follows:

the decoded picture of the smallest resolution in the bitstream is associated with a picturesizet of 1.

The resolution of the decoded picture is 2 by 2 of the minimum resolution in the bitstream, associated with a picturesizunit (i.e. 1*4) of 4.

For another example, if the ARC supports both 1.5 and 2 resampling ratios, then the PictureUnit definition is as follows:

the smallest resolution decoded picture in the bitstream is associated with a picturesizet of 4.

The resolution of the decoded picture is 1.5 times 1.5 of the minimum resolution in the bitstream, associated with a picturesizet of 9 (i.e. 2.25 x 4).

The resolution of the decoded picture is 2 by 2 of the minimum resolution in the bitstream, associated with a picturesizetet of 16 (i.e. 4*4).

For other resampling ratios supported by ARC, the picturesizet value for each picture size should be determined using the same principles given by the above example.

3. Let the variable MinPictures sizeUnit be the smallest possible value for Pictures sizeUnit. That is, if the ARC only supports a resampling ratio of 2, the minipicturesizeteunit is 1; if ARC supports a resampling ratio of 1.5 and 2, minPicturesIzeUnit is 4; the same principle is also used to determine the value of the minipicturesizet.

The value range of sps_max_dec_pic_buffering_mins1[ i ] is specified as 0 to (minpicturesizeUnit x (MaxPbsize-1)). The variable MinPictures sizeUnit is the smallest possible value for Pictures sizeUnit.

Dpb fullness operations are specified based on picturesizunit as follows:

HRD is initialized at decoding unit 0, with both CPB and DPB set to null (DPB fullness set to 0).

When flushing (flush) the DPB (i.e. removing all pictures from the DPB), the DPB fullness is set to 0.

When a picture is removed from the DPB, the DPB fullness may decrease the value of the picturesizunit associated with the removed picture.

When a picture is inserted into the DPB, the DPB fullness will increase the value of the picturesizeUnit associated with the inserted picture.

1.2.5.5. Resampling filter

In a software implementation, the resampling filter implemented is simply taken from the previously available filters described in JCTVC-H0234. If other resampling filters provide better performance and/or lower complexity, then testing and use should be performed. We propose to test various resampling filters to trade-off between complexity and performance. Such testing may be done in the CE.

1.2.5.6. Other necessary modifications to existing tools

Some modifications and/or additional operations may be required to some existing codec tools in order to support ARC. For example, in ARC software implemented picture-based resampling, we disable TMVP and ATMVP when the original codec resolutions of the current picture and the reference picture are different for simplicity.

1.2.6.JVET-N0279

According to the "requirements for future video codec standards", the standard should support fast representation switching in the case of an adaptive streaming service providing multiple representations of the same content, each representation having different properties (e.g. spatial resolution or sampling bit depth) ". In real-time video communication, the resolution in the encoded and decoded video sequence is allowed to be changed without inserting I pictures, so that video data can be seamlessly adapted to dynamic channel conditions or user preferences, and jitter effects caused by the I pictures can be eliminated. An example of a hypothetical adaptive resolution change is shown in fig. 14, where the current picture is predicted from a reference picture of a different size.

The document proposes a high level syntax to signal adaptive resolution changes and modifications to the current motion compensated prediction process in the VTM. These modifications are limited to motion vector scaling and sub-pixel position derivation without changing the existing motion compensated interpolators. This would allow existing motion compensated interpolators to be reused and no new processing blocks are needed to support the adaptive resolution changes, which would introduce additional costs.

1.2.6.1. Adaptive resolution change signaling

1.2.6.1.1.SPS

[ [ pic_width_in_luma_samples ] specifies the width of each decoded picture in units of luma samples. pic_width_in_luma_samples should not be equal to 0 and should be an integer multiple of MinCbSizeY. ]]

[ [ pic_height_in_luma_samples ] specifies the height of each decoded picture in units of luma samples. pic_height_in_luma_samples should not be equal to 0 and should be an integer multiple of MinCbSizeY. ]]

max_pic_width_in_luma_samples specifies the maximum width of the decoded picture of the reference SPS in units of luma samples. max_pic_width_in_luma_samples should not be equal to 0 and should be an integer multiple of MinCbSizeY.

max_pic_height_in_luma_samples specifies the maximum height of the decoded picture of the reference SPS in units of luma samples. max_pic_height_in_luma_samples should not be equal to 0 and should be an integer multiple of MinCbSizeY.

1.2.6.1.2.PPS

pic_size_difference_from_max_flag equal to 1 specifies that PPS signaling is different picture width or picture height than max_pic_width_in_luma_samples and max_pic_height_in_luma_samples in the referenced SPS. pic_size_difference_from_max_flag equal to 0 specifies pic_width_in_luma_samples and pic_height_in_luma_samples that are the same as max_pic_width_in_luma_samples and max_pic_height_in_luma_samples in the referenced SPS.

pic_width_in_luma_samples specifies the width of each decoded picture in units of luma samples. pic_width_in_luma_samples should not be equal to 0 and should be an integer multiple of MinCbSizeY. When pic_width_in_luma_samples are not present, it is inferred that it is equal to max_pic_width_in_luma_samples.

pic_height_in_luma_samples specify the height of each decoded picture in units of luma samples. pic_height_in_luma_samples should not be equal to 0 and should be an integer multiple of MinCbSizeY. When pic_height_in_luma_samples are not present, it is inferred that it is equal to max_pic_height_in_luma_samples.

The requirement for bitstream consistency is that the horizontal and vertical scaling should be in the range of 1/8 to 2, including each moving reference picture. The scaling is defined as follows:

–horizontal_scaling_ratio＝((reference_pic_width_in_luma_samples<<14)+(pic_width_in_luma_samples/2))/pic_width_in_luma_samples–vertical_scaling_ratio＝((reference_pic_height_in_luma_samples<<14)+(pic_height_in_luma_samples/2))/pic_height_in_luma_samples

reference picture scaling process

When the resolution in the CVS changes, the picture may be different in size from its one or more reference pictures. This proposal normalizes all motion vectors to the current picture grid, rather than their corresponding reference picture grids. This is said to be advantageous in maintaining consistency of the design and making the resolution change transparent to the motion vector prediction process. Otherwise, due to different scaling, neighboring motion vectors pointing to reference pictures of different sizes cannot be used directly for spatial motion vector prediction.

When the resolution changes, the motion vectors and the reference block must be scaled when motion compensation prediction is performed. The zoom range is limited to [1/8,2], i.e., the up-scaling is limited to 1:8 and the down-scaling is limited to 2:1. Note that up-scaling refers to the case where the reference picture is smaller than the current picture, and down-scaling refers to the case where the reference picture is larger than the current picture. In the following sections, the scaling process will be described in more detail.

Luminance block

The scaling factor and its fixed point representation are defined as:

the scaling process includes two parts:

1. mapping an upper left corner pixel of the current block to a reference picture;

2. the horizontal and vertical steps are used to address the reference positions of other pixels of the current block.

If the coordinates of the upper left corner pixel of the current block are (x, y), the sub-pixel position (x ', y') in the reference picture to which the motion vector (mvX, mvY) in 1/16 pixel units points is specified as follows:

the horizontal position in the reference picture is

x′＝((x＜＜4)+mvX)·hori_scale_fp, (3)

And x' is scaled down further to preserve only 10 decimal places

x′＝Sign(x′)·((Abs(x′)+(1＜＜7))＞＞8)。 (4)

Similarly, the vertical position in the reference picture is

y′＝((y＜＜4)+mvY)·vert_scale_fp (5)

And y' is scaled further down to

y′＝Sign(y′)·((Abs(y′)+(1＜＜7))＞＞8)。 (6)

At this time, the reference position of the upper left corner pixel of the current block is at (x ', y'). Other reference sub-pixel/pixel positions are calculated in horizontal and vertical steps relative to (x ', y'). These steps are derived from the horizontal and vertical scaling factors described above with an accuracy of 1/1024 pixels, as follows:

x_step＝(hori_scale_fp+8)>>4， (7)

y_step＝(vert_scale_fp+8)>>4。 (8)

For example, if a pixel in the current block is separated from the upper left pixel by i columns and j rows, the horizontal and vertical coordinates of its corresponding reference pixel are derived from the following equation

x′ _i ＝x′+i*x_step, (9)

y′ _j ＝y′+j*y_step。 (10)

In sub-pixel interpolation, x is necessary _i ' and y _j ' split into full and fractional pixel portions:

full pixel portion of the addressed reference block is equal to

(x′ _i +32)＞＞10, (11)

(y′ _j +32)＞＞10。 (12)

Fractional part for selecting interpolation filter equal to

Δx＝((x′ _i +32)＞＞6)&15, (13)

Δy＝((y′ _j +32)＞＞6)&15。 (14)

Once the full and fractional pixel positions in the reference picture are determined, an existing motion compensated interpolator can be used without any additional changes. Full pixel locations will be used to obtain a reference block patch (patch) from the reference picture, while fractional pixel locations will be used to select the appropriate interpolation filter.

Chroma block

When the chroma format is 4:2:0, the chroma motion vector is 1/32 pixel in precision. The scaling process of the chroma motion vector and the chroma reference block is almost the same as the luma block except for the chroma format-related adjustments.

When the coordinates of the upper left corner pixel of the current chroma block are (x _c ，y _c ) When the initial horizontal and vertical positions in the reference chroma picture are

x _c ′＝((x _C ＜＜5)+mvX)·hori_scale_fp, (1)

y _c ′＝((y _c ＜＜5)+mvY)·vert_scale_fp, (2)

Where mvX and mvY are original luminance motion vectors but should now be checked with an accuracy of 1/32 pixel.

And x is _c ' and y _c ' scaled further down to maintain 1/1024 pixel accuracy

x _c ′＝Sign(x _c ′)·((Abs(x _c ′)+(1＜＜8))＞＞9), (3)

y _c ′＝Sign(y _c ′)·((Abs(y _c ′)+(1＜＜8))＞＞9)。 (4)

The right shift adds an extra bit compared to the associated luminance formula.

The step size used is the same as the luminance. For the chroma pixel located at (i, j) relative to the upper left corner pixel, the horizontal and vertical coordinates of its reference pixel are derived from the following equation

x _c ′ _i ＝x _c ′+i*x_step, (5)

y _c ′ _j ＝y _c ′+j*y_step。 (6)

In sub-pixel interpolation, x _ci ' and y _cj ' also divided into a full pixel portion and a fractional pixel portion:

full pixel portion of the addressed reference block is equal to

(x _c ′ _i +16)＞＞10, (7)

(y _c ′ _j +16)＞＞10。 (8)

Fractional part for selecting interpolation filter equal to

Δx＝((x _c ′ _i +16)＞＞5)&31, (9)

Δy＝((y _c ′ _j +16)＞＞5)&31。 (10)

Interaction with other codec tools

Since some codec tools interact with reference picture scaling with additional complexity and memory bandwidth, it is recommended to add the following limitations in the VVC specification:

when tile_group_temporal_mvp enabled_flag is equal to 1, the current picture and its collocated picture should have the same size.

Decoder motion vector refinement should be turned off when resolution changes are allowed in the sequence.

When the resolution is allowed to change in the sequence, sps_bdofenabled_flag should be equal to 0.

Adaptive loop filter based on Coding Tree Block (CTB) in JVET-N0415

Band-level time domain filter

An Adaptive Parameter Set (APS) is employed in VTM 4. Each APS contains a set of signaled ALF filters, supporting a maximum of 32 APS. In this scheme, a band-level time-domain filter is tested. The slice group may reuse ALF information from the APS to reduce overhead. APS is updated as a first-in-first-out (FIFO) buffer.

ALF based on CTB

For the luminance component, when ALF is applied to the luminance CTB, it will be indicated to select among 16 fixed, 5 time-domain or 1 signaled filter banks. Only the filter bank index is signaled. For one stripe, a new set of 25 filters can only be signaled. If a new group is signaled for a stripe, then all luma CTBs in the system stripe share the group. The fixed filter bank may be used to predict a new band-level filter bank, and may also be used as a candidate filter bank for luminance CTBs. The total number of filters is 64.

For the chrominance components, when ALF is applied to the chrominance CTB, if a new filter is signaled for the band, the CTB uses the new filter, otherwise, the latest temporal chrominance filter satisfying the temporal scalability constraint is applied.

As a band-level time domain filter, APS is updated as a first-in-first-out (FIFO) buffer.

1.4. Optional temporal motion vector prediction (also known as subblock-based temporal Merging candidates in VVC)

In an Alternative Temporal Motion Vector Prediction (ATMVP) method, motion vector temporal motion vector prediction (Temporal Motion Vector Prediction, TMVP) is modified by extracting multiple sets of motion information (including motion vectors and reference indices) from blocks smaller than the current CU. As shown in fig. 15, the sub CU is a square nxn block (N is set to 8 by default).

ATMVP predicts the motion vectors of sub-CUs within a CU in two steps. The first step is to identify the corresponding block in the reference picture with a so-called temporal vector. The reference picture is also called a motion source picture. The second step is to divide the current CU into sub-CUs and obtain a motion vector and a reference index of each sub-CU from a block corresponding to each sub-CU, as shown in fig. 15.

In a first step, the reference picture and the corresponding block are determined from motion information of spatial neighboring blocks of the current CU. To avoid the iterative scanning process of neighboring blocks, the Merge candidate from block A0 (left block) in the Merge candidate list of the current CU is used. The first available motion vector from block A0 that references the collocated reference picture is set to the temporal vector. In this way, in an ATMVP, the corresponding block (sometimes referred to as a collocated block) may be more accurately identified than a TMVP, where the corresponding block is always in a lower right or center position relative to the current CU.

In a second step, by adding a temporal vector to the coordinates of the current CU, the corresponding block of the sub-CU is identified by the temporal vector in the motion source picture. For each sub-CU, the motion information of its corresponding block (the smallest motion grid covering the center sample) is used to derive the motion information of the sub-CU. After identifying the motion information of the corresponding nxn block, it is converted into a motion vector and a reference index of the current sub-CU in the same manner as TMVP of HEVC, where motion scaling and other processes are applicable.

1.5. Affine motion prediction

In HEVC, only translational motion models are applied to motion compensated prediction (Motion Compensation Prediction, MCP). While in the real world there are many kinds of movements, e.g. zoom in/out, rotation, perspective movements and other irregular movements. In VVC, reduced affine transformation motion compensated prediction of a 4-parameter affine model and a 6-parameter affine model is applied. As shown in fig. 16A-16B, the affine motion field of a block is described by two Control Point Motion Vectors (CPMV) for a 4-parameter affine model, and by three Control Point Motion Vectors (CPMV) for a 6-parameter affine model.

The Motion Vector Field (MVF) of a block is described by the following equations, respectively a 4-parameter affine model in equation (1), where 4 parameters are defined as variables a, b, e and f, and a 6-parameter affine model in equation (2), where 4 parameters are defined as variables a, b, c, d, e and f:

wherein, (mv) ^h ₀ ,mv ^h ₀ ) Is the motion vector of the upper left corner control point, and (mv ^h ₁ ,mv ^h ₁ ) Is the motion vector of the upper right corner control point, and (mv ^h ₂ ,mv ^h ₂ ) Is the motion vector of the control point in the lower left corner, all three motion vectors are called Control Point Motion Vector (CPMV), (x, y) representing the coordinates of the representative point (representative point) in the current block relative to the sample point in the upper left corner, and (mv) ^h (x,y),mv ^v (x, y)) is a motion vector derived for a sample located at (x, y). The CP motion vector may be signaled (as in affine AMVP mode) or dynamically (on-the-fly) derived (as in affine mere mode). w and h are the width and height of the current block. In practice, division is performed by right shifting and rounding operations. In VTM, a representative point is defined as the center position of a sub-block, for example, when the coordinates of the upper left corner of the sub-block with respect to the upper left sample point within the current block are (xs, ys), the coordinates of the representative point are defined as (xs+2, ys+2). For each sub-block (e.g., 4 x 4 in the VTM), a representative point is utilized to derive the motion vector for the entire sub-block.

To further simplify motion compensated prediction, sub-block based affine transformation prediction is applied. In order to derive the motion vector of each m×n (M and N are both set to 4 in the current VVC), the motion vector of the center sample of each sub-block as shown in fig. 17 may be calculated according to equations (1) and (2) and rounded to 1/16 of the fractional precision. Then, a 1/16-pixel motion compensated interpolation filter is applied to generate a prediction for each sub-block with a derived motion vector. Affine mode introduces a 1/16 pixel interpolation filter.

After MCP, the high precision motion vector for each sub-block is rounded and saved to the same precision as the normal motion vector.

1.5.1. Signaling of affine predictions

Similar to the translational motion model, there are also two modes for signaling side information due to affine prediction. They are affine_inter and affine_merge modes.

Af_inter mode of 1.5.2

For CUs with width and height both greater than 8, the af_inter mode may be applied. Affine flags at the CU level are signaled in the bitstream to indicate whether af_inter mode is used.

In this mode, for each reference picture list (list 0 or list 1), the affine AMVP candidate list is constructed with three types of affine motion prediction values in the following order, with each candidate including the estimated CPMV of the current block. Signalling is sent on the encoder side (such as mv in fig. 18 ₀ mv ₁ mv ₂ ) The difference between the found optimal CPMV and the estimated CPMV. In addition, the index of affine AMVP candidates from which the estimated CPMV is derived is further signaled.

1) Inherited affine motion prediction values

The order of checking is similar to spatial MVP in HEVC AMVP list construction. First, an affine motion prediction value inherited to the left is derived from a first block in { A1, A0} which is affine-coded and has the same reference picture as in the current block. Second, the affine motion prediction value inherited above is derived from the first block in { B1, B0, B2} which is affine-coded and has the same reference picture as in the current block. Five blocks A1, A0, B1, B0, B2 are depicted in fig. 19.

Once the neighboring block is found to be encoded with affine patterns, the CPMV of the codec unit covering the neighboring block is used to derive a prediction value of the CPMV of the current block. For example, if A1 is encoded with a non-affine pattern and A0 is encoded with a 4-parameter affine pattern, then the left inherited affine MThe V prediction value will be derived from A0. In this case, CPMU covering CU of A0 (CPMU for upper left is shown as in FIG. 21B

CPMV for upper right is denoted +.>

) CPMV (upper left position (coordinates (x 0, y 0)), upper right position (coordinates (x 1, y 1)) and lower right position (coordinates (x 2, y 2)) used to derive an estimate of a current block are denoted as +.>

)。

2) Constructed affine motion prediction values

As shown in fig. 20, the constructed affine motion prediction value includes a Control Point Motion Vector (CPMV) derived from neighboring inter-frame codec blocks having the same reference picture. If the current affine motion model is a 4-parameter affine, the number of CPMV is 2, otherwise if the current affine motion model is a 6-parameter affine, the number of CPMV is 3. CPMV at upper left

Is derived from MVs in the set { a, B, C } at the first block, which is inter-coded and has the same reference picture as in the current block. CPMV at upper right >

Is derived from MVs in the set D, E at the first block, which is inter-coded and has the same reference picture as in the current block. CPMV at lower left->

Is derived from MVs at the first block in the set F, G, which are inter-coded and have the same reference picture as in the current block.

-if the current affine motion model is a 4-parameter affine, thenOnly when

And->

When both are established, the constructed affine motion prediction value is inserted into the candidate list, that is,/->

And->

Is used as an estimated CPMV of the upper left (coordinates (x 0, y 0)) and upper right (coordinates (x 1, y 1)) position of the current block.

-if the current affine motion model is a 6-parameter affine, then only if

And->

When all are established, the constructed affine motion prediction value is inserted into the candidate list, that is, <' > x->

And->

Is used as an estimated CPMV for the upper left (coordinate (x 0, y 0)), upper right (coordinate (x 1, y 1)) and lower right (coordinate (x 2, y 2)) positions of the current block.

When inserting the constructed affine motion prediction value into the candidate list, the pruning process is not applied.

3) Normal AMVP motion prediction

The following condition applies until the number of affine motion prediction values reaches the maximum value.

1) By setting all CPMVs equal to

If available, to derive affine motion prediction values.

2) By setting all CPMVs equal to

If available, to derive affine motion prediction values.

3) By setting all CPMVs equal to

If available, to derive affine motion prediction values.

4) Affine motion prediction values are derived by setting all CPMV equal to HEVC TMVP (if available).

5) Affine motion prediction values are derived by setting all CPMV to zero MV.

Note that the number of the components to be processed,

has been derived from the constructed affine motion prediction values.

In the AF INTER mode, when a 4/6 parameter affine mode is used, 2/3 control points are required, and thus 2/3 MVDs need to be encoded for these control points, as shown in fig. 18A and 18B. In JVET-K0337, it is proposed to derive MV in such a way that it is derived from mvd ₀ Medium predictive mvd ₁ And mvd ₂ 。

Wherein, the liquid crystal display device comprises a liquid crystal display device,

mvd _i and mv ₁ The predicted motion vector, the motion vector difference, and the motion vector of the upper left pixel (i=0), the upper right pixel (i=1), or the lower left pixel (i=2), respectively, as shown in fig. 18B. Note that the addition of two motion vectors (e.g., mvA (xA, yA) and mvB (xB, yB)) equals the separate summation of the two components. I.e. newmv=mva+mvb, and the two components of newMV are set to (xa+xb) and (ya+yb), respectively.

Af_merge mode

When a CU is applied in af_merge mode, it obtains the first block encoded with affine mode from the valid neighboring reconstructed blocks. And the selection order of the candidate blocks is from left, upper right, lower left to upper left as shown in fig. 21A (indicated by A, B, C, D, E in turn). For example, if the neighboring lower left block is encoded and decoded in affine mode, as represented by A0 in FIG. 21B, the element B, the Control Point (CP) motion vector mv including the upper left, upper right and lower left corners of the neighboring CU/PU of block A are extracted ₀ ^N 、mv ₁ ^N And mv ₂ ^N . And is based on mv ₀ ^N 、mv ₁ ^N And mv ₂ ^N To calculate the motion vector mv in the upper left/upper right/lower left on the current CU/PU ₀ ^C 、mv ₁ ^C And mv ₂ ^C (which is only for the 6 parameter affine model). It should be noted that in VTM-2.0, if the current block is affine coded, the sub-block located in the upper left corner (e.g., 4 x 4 block of VTM) stores mv0 and the sub-block located in the upper right corner stores mv1. If the current block is encoded and decoded by using a 6-parameter affine model, storing mv2 in a sub-block positioned at the lower left corner; otherwise (in the case of a 4-parameter affine model), LB stores mv2'. The other sub-blocks store MVs for the MC.

Deriving CPMV of the current CU from the reduced affine motion model in equations (1) and (2) ₀ ^C And mv ₁ ^C And mv ₂ ^C Thereafter, the MVF of the current CU is generated. In order to identify whether the current CU is encoded in the af_merge mode, an affine flag is signaled in the bitstream when there is at least one neighboring block encoded in the affine mode.

In JVET-L0142 and JVET-L0632, an affine merge candidate list is constructed by the steps of:

1) Inserting inherited affine candidates

Inherited affine candidates refer to candidates that are derived from affine motion models of their valid neighboring affine codec blocks. Up to two inherited affine candidates are derived from affine motion models of neighboring blocks and inserted into the candidate list. For the predicted value on the left, the scan order is { A0, A1}; for the upper predictor, the scan order is { B0, B1, B2}.

2) Inserting constructed affine candidates

If the number of candidates in the affine candidate list is less than maxnumaffineca (e.g., 5), the constructed affine candidate is inserted into the candidate list. The constructed affine candidates refer to constructing candidates by combining neighboring motion information of each control point.

a) The motion information for the control points is first derived from the specified spatial and temporal neighbors shown in fig. 22. CPk (k=1, 2,3, 4) represents the kth control point. A0, A1, A2, B0, B1, B2, and B3 are spatial locations for predicting CPk (k=1, 2, 3); t is the time domain position used to predict CP 4.

The coordinates of CP1, CP2, CP3, and CP4 are (0, 0), (W, 0), (H, 0), and (W, H), respectively, where W and H are the width and height of the current block.

Motion information for each control point is obtained according to the following priority order:

for CP1, the check priority is B2- > B3- > A2. If B2 is available, then B2 is used. Otherwise, if B2 is available, B3 is used. If neither B2 nor B3 is available, A2 is used.

If none of the three candidates is available, the motion information of CP1 cannot be obtained.

For CP2, the check priority is B1- > B0.

For CP3, the check priority is A1- > A0.

For CP4, T is used.

b) Second, affine Merge candidates are constructed using combinations of control points.

I. Constructing 6-parameter affine candidates requires motion information for three control points. The three control points may select one from the following four combinations: { CP1, CP2, CP4}, { CP1, CP2, CP3}, { CP2, CP3, CP4}, { CP1, CP3, CP4}. The combination CP1, CP2, CP3, { CP2, CP3, CP4}, { CP1, CP3, CP4} will be converted into a 6-parameter motion model represented by the upper left, upper right and lower left control points.

II, constructing 4-parameter affine candidates requires motion information of two control points. The two control points may select one from the following two combinations: { CP1, CP2}, { CP1, CP3}. These two combinations will be converted into a 4-parameter motion model represented by the upper left and upper right control points.

Inserting the constructed combination of affine candidates into a candidate list in the following order:

{CP1,CP2,CP3}、{CP1,CP2,CP4}、{CP1,CP3,CP4}、{CP2,CP3,CP4}、{CP1,CP2}、{CP1,CP3}

i. for each combination, the reference index of each CP in list X is checked, and if they are all the same, the combination has a valid CPMV for list X. If the combination has no valid CPMVs for both list 0 and list 1, the combination is marked invalid. Otherwise, it is valid and the CPMV is put into the sub-block merge list.

3) Filling with zero motion vectors

If the number of candidates in the affine candidate list is less than 5, a zero motion vector with a zero reference index is inserted into the candidate list until the list is full.

More specifically, for the sub-block Merge candidate list, a 4-parameter Merge candidate, where MV is set to (0, 0) and the prediction direction is set to unidirectional prediction (for P slices) and bi-prediction (for B slices) from list 0.

2. Disadvantages of the prior embodiments

When applied in VVC, the ARC may suffer from the following problems:

1. in addition to resolution, other basic parameters such as bit depth and color format (such as 4:0:0, 4:2:0, or 4:4:4) may also be changed from one picture to another picture in a sequence.

3. Example methods for bit depth and color format conversion

The following detailed invention should be considered as examples to explain the general concepts. These inventions should not be interpreted in a narrow sense. Furthermore, these inventions may be combined in any manner.

In the following discussion, satShift (x, n) is defined as

Shift (x, n) is defined as Shift (x, n) = (x+offset 0) > > n.

In one example, offset0 and/or offset1 is set to (1 < < n) > >1 or (1 < < (n-1)).

In another example, offset0 and/or offset1 is set to 0.

In another example, offset0 = offset1 = ((1 < < n) > > 1) -1 or ((1 < < (n-1))) -1.

Clip3 (min, max, x) is defined as

Floor (x) is defined as the largest integer less than or equal to x.

Ceil (x) is greater than or equal to the smallest integer of x.

Log2 (x) is defined as the base 2 logarithm of x.

Adaptive Resolution Change (ARC)

1. It is proposed to prohibit bi-prediction from two reference pictures with different resolutions.

2. Whether DMVR/BIO or other kinds of motion derivation/refinement are enabled at the decoder side may depend on whether the two reference pictures have the same resolution.

a. If the two reference pictures have different resolutions, motion derivation/refinement, such as DMVR/BIO, is disabled.

b. Alternatively, how motion derivation/refinement is applied at the decoder side may depend on whether the two reference pictures have the same resolution.

i. In one example, the allowable MVDs associated with each reference picture may be scaled, for example, according to resolution.

3. It is proposed how the reference sample point can be manipulated depending on the relation between the resolution of the reference picture and the resolution of the current picture.

a. In one example, all reference pictures stored in the decoded picture buffer have the same resolution (denoted as first resolution), such as maximum/minimum allowed resolution.

i. Alternatively, furthermore, samples in the decoded picture may be modified first (e.g., by upsampling or downsampling) before being stored in the decoded picture buffer.

1) The modification may be in accordance with the first resolution.

2) The modification may be based on the resolution of the reference picture and the resolution of the current picture.

b. In one illustrative example, all reference pictures stored in the decoded picture buffer may be the resolution at which the picture is encoded.

i. In one example, if the motion vector of a block points to a reference picture of a different resolution than the current picture, conversion of the reference samples may be applied first (e.g., by upsampling or downsampling) before invoking the motion compensation process.

Alternatively, reference points in the reference picture may be used directly for motion compensation (MV). Thereafter, the prediction block generated from the MC process may be further modified (e.g., by upsampling or downsampling), and the final prediction block of the current block may depend on the modified prediction block.

Adaptive bit depth conversion (ABC)

The maximum allowable bit depth of the i-th color component is represented by ABCMaxBD [ i ], and the minimum allowable bit depth of the color component is represented by ABCMinBD [ i ] (e.g., i is 0..2).

4. It is proposed that one or more sets of sample bit depths of one or more components, such as DPS, VPS, SPS, PPS, APS, picture header, slice header, can be signaled in a video unit.

a. Alternatively, one or more sets of sample bit depths of one or more components may be signaled in a Supplemental Enhancement Information (SEI) message.

b. Alternatively, one or more sets of sample bit depths of one or more components may be signaled in a single video unit for adaptive bit depth conversion.

c. In one example, a set of sample bit depths of one or more components may be coupled with a dimension of a picture.

i. In one example, one or more combinations of sample bit depth of one or more components and corresponding dimension/downsampling/upsampling rate of the picture may be signaled in the same video unit.

d. In one example, an indication of ABCMaxBD and/or ABCMinBD may be signaled. Alternatively, differences in other bit depth values compared to ABCMaxBD and/or ABCMinBD may be signaled.

5. It is proposed that the first combination is not allowed to be identical to the second combination when more than one combination of the sample bit depth of one or more components and the corresponding dimension/downsampling ratio/upsampling ratio of pictures in a single video unit (such as DPS, VPS, SPS, PPS, APS, picture header, slice group header, etc.) or in a single video unit of ARC/ABC to be named.

6. It is proposed how the reference samples may be manipulated depending on the relation between the sample bit depth of the component in the reference picture and the sample bit depth of the current picture.

a. In one example, all reference pictures stored in the decoded picture buffer are at the same bit depth (denoted as first bit depth), such as ABCMaxBD [ i ] or ABCmin BD [ i ] (i is 0..2 denotes a color component index).

i. Alternatively, furthermore, samples in the decoded picture may be modified via a left or right shift first before being stored in the decoded picture buffer.

1) The modification may be based on the first bit depth.

2) The modification may be made according to a bit depth defined for the reference picture and for the current picture.

b. In one example, all reference pictures stored in the decoded picture buffer are at a bit depth that has been used for encoding.

i. In one example, if a reference sample in a reference picture has a different bit depth than the current picture, the conversion of the reference sample may be applied first before invoking the motion compensation process.

1) Alternatively, the reference samples are used directly for motion compensation. Thereafter, the prediction block generated from the MC process may be further modified (e.g., by shifting), and the final prediction block of the current block may depend on the modified prediction block.

c. It is proposed that if the sample bit depth of the component in the reference picture (denoted BD 1) is lower than the sample bit depth of the component in the current picture (denoted BD 0), the reference samples can be converted accordingly.

i. In one example, the reference sample S may be converted to S ', S' =s < < (BD 0-BD 1).

Alternatively, S' =s < (BD 0-BD 1) + (1 < (BD 0-BD 1-1)).

d. It is proposed that if the sample bit depth of the component in the reference picture (denoted BD 1) is larger than the sample bit depth of the component in the current picture (denoted BD 0), the reference samples can be converted accordingly.

i. In one example, the reference sample S may be converted to S ', S' =shift (S, BD1-BD 0).

e. In one example, a reference picture with samples that are not converted may be removed after a reference picture is converted according thereto.

i. Alternatively, reference pictures with samples that are not converted may be retained, but marked as unavailable after a reference picture is converted according thereto.

f. In one example, a reference picture with samples that are not converted may be placed into a reference picture list. When reference samples are used for inter prediction, they are transformed.

7. It is proposed that both ARC and ABC can be performed.

a. In one example, ARC is performed first, followed by ABC.

i. For example, in arc+abc conversion, samples are first up-sampled/down-sampled according to different picture dimensions, and then left-shifted/right-shifted according to different bit depths.

b. In one example, ABC is performed first, followed by ARC.

i. For example, in the abc+arc conversion, samples are first left/right shifted according to different bit depths and then upsampled/downsampled according to different picture dimensions.

8. It is proposed that the Merge candidate referring to the reference picture with higher sample bit depth has a higher priority than the Merge candidate referring to the reference picture with lower bit depth.

a. For example, in the Merge candidate list, a Merge candidate referencing a reference picture having a higher sample bit depth may precede a Merge candidate referencing a reference picture having a lower sample bit depth.

b. For example, a motion vector of a reference picture having a reference sample bit depth lower than the sample bit depth of the current picture cannot be in the Merge candidate list.

9. It is proposed to filter pictures with ALF parameters associated with the corresponding sample bit depth.

a. In one example, ALF parameters signaled in a video unit (such as APS) may be associated with one or more sample bit depths.

b. In one example, a video unit (such as an APS) signaling ALF parameters may be associated with one or more sample bit depths.

c. For example, a picture may only apply ALF parameters signaled in a video unit (such as APS) associated with the same sample bit depth.

d. Alternatively, the picture may use ALF parameters associated with different same-point bit depths.

10. It is proposed that ALF parameters associated with a first corresponding sample bit depth may inherit or predict from ALF parameters associated with a second corresponding sample bit depth.

a. In one example, the first corresponding sample bit depth must be the same as the second corresponding sample bit depth.

b. In one example, the first corresponding sample bit depth may be different from the second corresponding sample bit depth.

11. Different default ALF parameters are proposed for different sample bit depths.

12. It is proposed to shape samples in a picture with LMCS parameters associated with the corresponding sample bit depth.

a. In one example, LMCS parameters signaled in a video unit (such as APS) may be associated with one or more sample bit depths.

b. In one example, a video unit (such as an APS) signaling LMCS parameters may be associated with one or more sample bit depths.

c. For example, a picture may only apply LMCS parameters signaled in a video unit (such as APS) associated with the same sample bit depth.

13. It is proposed that LMCS parameters associated with a first corresponding sample bit depth may inherit or predict from LMCS parameters associated with a second corresponding sample bit depth.

14. It is proposed that the codec tool X may be disabled for a block if the block references at least one reference picture having a different sample bit depth than the current picture.

a. In one example, information related to codec tool X may not be signaled.

b. Alternatively, if the codec tool X is applied in a block, the block cannot refer to a reference picture having a different sample bit depth from the current picture.

i. In one example, the Merge candidate referencing a reference picture having a different sample bit depth than the current picture may be skipped or not put into the Merge candidate list.

in one example, signaling that the reference index corresponds to a reference picture having a different same point bit depth than the current picture may be skipped or not allowed.

c. The codec tool X may be any one of the following.

i.ALF

ii.LMCS

iii.DMVR

iv.BDOF

Affine prediction

vi.TPM

vii.SMVD

viii.MMVD

inter prediction in vvc

x.LIC

xi.HMVP

Multiple Transformation Set (MTS)

Sub-block transform (SBT)

15. It is proposed that bi-prediction from two reference pictures with different bit depths may not be allowed.

Adaptive color Format conversion (ACC)

16. It is proposed that one or more color formats (in one example, a color format may refer to 4:4:4, 4:2:2, 4:2:0, or 4:0:0, in another example, a color format may refer to YCbCr or RGB) may be signaled in a video unit (such as DPS, VPS, SPS, PPS, APS, picture header, slice group header).

a. Alternatively, one or more color formats may be signaled in a Supplemental Enhancement Information (SEI) message.

b. Alternatively, one or more color formats may be signaled in a single video unit for the ACC.

c. In one example, the color format may be coupled with the dimensions and/or sample bit depth of the picture. i. In one example, one or more combinations of color formats, and/or sample bit depths of one or more components, and/or corresponding dimensions of a picture may be signaled in the same video unit.

17. It is proposed that when more than one combination of color formats, and/or sample bit depths of one or more components, and/or corresponding dimensions of a picture are combined in a single video unit (such as DPS, VPS,

SPS, PPS, APS, picture header, slice group header) or in a single video unit of ARC/ABC/ACC to be named, the first combination is not allowed to be the same as the second combination.

18. It is proposed to disable prediction of a reference picture according to a different color format than the current picture.

a. Alternatively, in addition, pictures with different color formats are not allowed to be put into the reference picture list for the block in the current picture.

19. It is proposed how to operate the reference samples may depend on the color formats of the reference picture and the current picture.

a. It is proposed that if the first color format of the reference picture is not identical to the second color format of the current picture, the reference samples may be converted accordingly.

i. In one example, samples of chroma components in a reference picture may be vertically up-sampled at a ratio of 1:2 if the first format is 4:2:0 and the second format is 4:2:2.

in one example, samples of chroma components in a reference picture may be horizontally up-sampled at a ratio of 1:2 if the first format is 4:2:2 and the second format is 4:4:4.

in one example, if the first format is 4:2:0 and the second format is 4:4:4, samples of chroma components in the reference picture may be vertically up-sampled at a ratio of 1:2 and horizontally up-sampled at a ratio of 1:2.

in one example, samples of chroma components in a reference picture may be vertically downsampled at a ratio of 1:2 if the first format is 4:2:2 and the second format is 4:2:0.

In one example, samples of the chroma components in the reference picture may be downsampled horizontally at a ratio of 1:2 if the first format is 4:4:4 and the second format is 4:2:2.

In one example, if the first format is 4:4:4 and the second format is 4:2:0, samples of the chroma components in the reference picture may be vertically downsampled at a ratio of 1:2 and horizontally downsampled at a ratio of 1:2.

In one example, samples of a luma component in a reference picture may be used to perform inter prediction on a current picture if the first format is not 4:0:0 and the second format is 4:0:0.

In one example, samples of a luma component in a reference picture may be used to perform inter prediction on a current picture if the first format is 4:0:0 and the second format is not 4:0:0.

1) Alternatively, if the first format is 4:0:0 and the second format is not 4:0:0, then the samples in the reference picture cannot be used to perform inter prediction on the current picture.

b. In one example, a reference picture with samples that are not converted may be removed after a reference picture is converted according to it.

c. In one example, a reference picture with samples that are not converted may be placed in a reference picture list. When reference samples are used for inter prediction, they are transformed.

20. It is proposed that both ARC and ACC can be performed.

a. In one example, ARC is performed first, followed by ACC.

i. In one example, in arc+acc conversion, samples are first downsampled/upsampled according to different picture dimensions and then downsampled/upsampled according to different color formats.

b. In one example, ACC is performed first, followed by ARC.

i. In one example, in arc+acc conversion, samples are first downsampled/upsampled according to different color formats, and then downsampled/upsampled according to different picture dimensions.

c. In one example, ACC and ARC may be performed together.

i. For example, in arc+acc or acc+arc conversion, the samples are downsampled/upsampled according to scaling derived from different color formats and different picture dimensions.

21. It is proposed that both ACC and ABC can be performed.

a. In one example, ACC is performed first, followed by ABC.

i. For example, in acc+abc conversion, samples are first up-sampled/down-sampled according to different color formats, and then left-shifted/right-shifted according to different bit depths.

b. In one example, ABC is performed first, followed by ACC.

i. For example, in the abc+acc conversion, samples are first left/right shifted according to different bit depths and then up/down sampled according to different color formats.

22. It is proposed to filter the picture with ALF parameters associated with the corresponding color format.

a. In one example, ALF parameters signaled in a video unit (such as APS) may be associated with one or more color formats.

b. In one example, a video unit (such as an APS) that signals ALF parameters may be associated with one or more color formats.

c. For example, a picture may only apply ALF parameters signaled in a video unit (such as APS) associated with the same color format.

23. It is proposed that ALF parameters associated with a first corresponding color format may inherit or predict from ALF parameters associated with a second corresponding color format.

a. In one example, the first corresponding color format must be the same as the second corresponding color format.

b. In one example, the first corresponding color format may be different from the second corresponding color format.

24. Different default ALF parameters are proposed for different color formats.

a. In one example, different default ALF parameters may be designed for YCbCr and RGB formats.

b. In one example, different default ALF parameters may be designed for 4:4:4, 4:2:2, 4:2:0, 4:0:0 formats.

25. It is proposed to shape samples in a picture with LMCS parameters associated with the corresponding color format.

a. In one example, LMCS parameters signaled in a video unit (such as APS) may be associated with one or more color formats.

b. In one example, a video unit (such as an APS) that signals LMCS parameters may be associated with one or more color formats.

c. For example, a picture may only apply LMCS parameters signaled in a video unit (such as APS) associated with the same color format.

26. It is proposed that LMCS parameters associated with a first corresponding sample color format may inherit or predict from LMCS parameters associated with a second corresponding color format.

27. It is proposed that the LMCS chroma residual scaling procedure is not applied to pictures with color formats of 4:0:0.

a. In one example, if the color format is 4:0:0, an indication of whether chroma residual scaling is applied may not be signaled and inferred as "unused".

b. In one example, if the color format is 4:0:0, then an indication of whether chroma residual scaling is applied must be "unused" in the consistent bitstream.

c. In one example, if the color format is 4:0:0, the signal indicating whether chroma residual scaling is applied is ignored and set to "unused" by the decoder.

28. It is proposed that the codec tool X may be disabled for a block if the block references at least one reference picture having a different color format than the current picture.

a. In one example, information related to codec tool X may not be signaled.

b. Alternatively, if the codec tool X is applied in a block, the block cannot refer to a reference picture having a different color format from the current picture.

i. In one example, the Merge candidate referencing a reference picture having a different color format than the current picture may be skipped or not put into the Merge candidate list.

in one example, the reference index corresponds to a reference picture having a different color format than the current picture may be skipped or not allowed to be signaled.

c. The codec tool X may be any one of the following.

i.ALF

ii.LMCS

iii.DMVR

iv.BDOF

Affine prediction

vi.TPM

vii.SMVD

viii.MMVD

inter prediction in vvc

x.LIC

xi.HMVP

Multiple Transformation Set (MTS)

Sub-block transform (SBT)

29. It is proposed that bi-prediction from two reference pictures with different color formats is not allowed.

The above examples may be incorporated in the context of the methods described below (e.g., methods 2400-3100), which may be implemented at a video decoder or video encoder.

Fig. 24 shows a flow chart of an exemplary method for video processing. Method 2400 includes, at step 2402, applying an adaptive bit depth conversion (ABC) process to a video region within a current picture, wherein, during the adaptive bit depth conversion (ABC) process, one or more sets of sample bit depths are applicable to one or more color components of the video region and the one or more sets of sample bit depths are signaled in a particular video unit; and at step 2404, performing a transition between the bitstream representations of the video region and the current video region based on an adaptive bit depth conversion (ABC) process.

Fig. 25 shows a flow chart of an exemplary method for video processing. Method 2500 includes, for a video region within a current picture of video encoded using an adaptive bit depth conversion (ABC) process, determining a relationship between a sample bit depth of at least one color component of the current picture and a sample bit depth of at least one color component of a reference picture referenced by the video region; in response to the relationship, performing a particular operation on a prediction region of a sample or video region within the reference picture at step 2504; and at step 2506, performing a transition between the bitstream representation of the video region and the video region based on the particular operation.

Fig. 26 shows a flow chart of an exemplary method for video processing. The method 2600 includes applying a first adaptive conversion process to a video region within a current picture; at step 2604, applying a second adaptive conversion process to the video region; and in step 2606, performing a conversion between the bitstream representation of the video region and the video region based on the first adaptive conversion process and the second adaptive conversion process; wherein the first adaptive conversion process is one of an adaptive bit depth conversion (ABC) process and an Adaptive Resolution Change (ARC) process, and the second adaptive conversion process is the other of the adaptive bit depth conversion (ABC) process and the Adaptive Resolution Change (ARC) process.

Fig. 27 shows a flow chart of an exemplary method for video processing. The method 2700 includes, for a video region within a current picture, determining at least one Merge candidate, wherein the at least one Merge candidate is assigned a priority based on a sampling bit depth of at least one color component of a reference picture to which the at least one Merge candidate refers; at step 2704, it is determined, based on the priority, whether and/or how to insert the Merge candidate into the Merge candidate list of the video region; and at step 2706, performing a transition between the bitstream representation of the video region and the video region based on the Merge candidate list.

Fig. 28 shows a flow chart of an exemplary method for video processing. Method 2800 includes, at step 2802, applying an Adaptive Loop Filtering (ALF) process to samples within the current picture, wherein a plurality of ALF parameters are applicable to the samples based on corresponding sample bit depths associated with the plurality of ALF parameters.

Fig. 29 shows a flow chart of an exemplary method for video processing. Method 2900 includes, at step 2902, applying a Luma Mapping (LMCS) process utilizing chroma scaling to a sample within a current picture, wherein a plurality of LMCS parameters are applicable to the sample based on corresponding sample bit depths associated with the plurality of LMCS parameters.

Fig. 30 shows a flow chart of an exemplary method for video processing. The method 3000 includes, at step 3002, determining whether and/or how to perform a particular coding scheme on a video block within a current picture based on a relationship between a sampling bit depth of at least one of a plurality of reference pictures referenced by the video block and a sampling bit depth of the current picture; and at step 3004, performing a conversion between the bitstream representation of the video block and the video block based on the determination.

Fig. 31 shows a flow chart of an exemplary method for video processing. Method 3100 includes, at step 3102, performing a transition between a bitstream representation of a current video block and the current video block, wherein bi-prediction from two reference pictures is disabled for the video block during the transition if the two reference pictures referenced by the current video block have different sampling bit depths.

In one aspect, a method for video processing is disclosed, comprising:

applying an adaptive bit depth conversion (ABC) procedure to a video region within the current picture, wherein in the adaptive bit depth conversion (ABC) procedure, one or more sets of sample bit depths are applicable to one or more color components of the video region and the one or more sets of sample bit depths are signaled in a particular video unit; and

Conversion between the bitstream representations of the video region and the current video region is performed based on an adaptive bit depth conversion (ABC) process.

In one example, the particular video unit includes at least one of a Decoder Parameter Set (DPS), a Video Parameter Set (VPS), a Sequence Parameter Set (SPS), a Picture Parameter Set (PPS), an Adaptive Parameter Set (APS), a picture header, a slice header, and a slice group header.

In one example, a particular video unit includes a Supplemental Enhancement Information (SEI) message.

In one example, a particular video unit includes a separate video unit for adaptive bit depth conversion.

In one example, one or more sets of sample bit depths are concatenated to corresponding dimension information of the current picture.

In one example, the dimension information of the current picture includes an upsampling rate and a downsampling rate of the current picture.

In one example, the corresponding dimension information for the current picture is signaled in the same particular video unit as the one or more sets of sampling bit depths.

In one example, one set of sampling bit depths includes at least one of a maximum sampling bit depth and a minimum sampling bit depth allowed.

In one example, one set of sampling bit depths includes a bit depth threshold and a difference between the allowed maximum sampling bit depth and at least one of the minimum sampling bit depth.

In one example, more than one combination is configured according to one or more sets of sample bit depths and corresponding dimension information, and any two of the more than one combinations are different.

In another aspect, a method for video processing is disclosed, comprising:

for a video region within a current picture of video encoded and decoded using an adaptive bit depth conversion (ABC) process, determining a relationship between a sampling bit depth of at least one color component of the current picture and a sampling bit depth of at least one color component of a reference picture to which the video region refers;

in response to the relationship, performing a particular operation on a sample point within the reference picture or a predicted region of the video region; and

based on the specific operation, a transition between the bitstream representation of the video region and the video region is performed.

In one example, if it is determined that the sampling bit depth of the at least one color component of the current picture is different from the sampling bit depth of the at least one color component of the reference picture, the specific operation includes:

Before invoking the motion compensation process for the video region, modifications are performed on samples within the reference picture.

performing motion compensation on a video region by using samples within a reference picture to generate a prediction region of the video region, and

the modification is performed on samples within the prediction area.

In one example, if the sampling bit depth of at least one color component of the reference picture is not equal to the sampling bit depth of at least one color component of the current picture, the specific operations include:

modifying at least one sample within the reference picture prior to use in inter prediction of the video region;

and placing the modified reference picture in a reference picture list.

In one example, the reference picture is stored in a decoded picture buffer at a sampling bit depth used to encode and decode the reference picture.

In one example, the reference picture is stored in the decoded picture buffer at a maximum allowable sample bit depth, a minimum allowable sample bit depth, or a sample bit depth of a predetermined sample bit depth.

In one example, the reference picture is stored in the decoded picture buffer at a sampling bit depth that is based on the sampling bit depth of the current picture and the sampling bit depth of the reference picture.

In one example, the method further comprises:

the decoded picture including the video region is stored in a decoded picture buffer with sampling bit depths used to encode and decode the decoded picture.

In one example, the method further comprises:

the decoded picture including the video region is stored in a decoded picture buffer at a maximum allowable sample bit depth, a minimum allowable sample bit depth, or a sample bit depth of a predetermined sample bit depth.

In one example, pictures in the decoded picture buffer have the same sampling bit depth.

In one example, the modification includes at least one of a left shift or a right shift.

In one example, if it is determined that the sampling bit depth of the at least one color component of the reference picture is lower than the sampling bit depth of the at least one color component of the current picture, at least one sample within the reference picture is modified as follows to generate a modified reference picture:

s' =s < < (BD 0-BD 1); or alternatively

S’＝S<<(BD0-BD1)+(1<<(BD0-BD1-1))，

Where S denotes at least one sample within the reference picture, BD0 denotes the sampling bit depth of at least one color component of the current picture, and BD1 denotes the sampling bit depth of the corresponding color component of the reference picture, and S' denotes the modified sample.

In one example, if it is determined that the sampling bit depth of the at least one color component of the reference picture is greater than the sampling bit depth of the at least one color component of the current picture, at least one sample within the reference picture is modified as follows to generate a modified reference picture:

S’＝Shift(S，BD1-BD0)，

where Shift (x, n) is defined as Shift (x, n) = (x+offset 0) > > n, S represents at least one sample within the reference picture, BD0 represents the sampling bit depth of at least one color component of the current picture, and BD1 represents the sampling bit depth of the corresponding color component of the reference picture, and S' represents the modified sample.

In one example, the reference picture is removed after the modified reference picture is generated.

In one example, after the modified reference picture is generated, the reference picture is reserved and marked as unavailable.

In yet another aspect, a method for video processing is disclosed, comprising:

Applying a first adaptive conversion process to a video region within a current picture;

applying a second adaptive conversion process to the video region; and

performing a conversion between the bitstream representation of the video region and the video region based on the first adaptive conversion process and the second adaptive conversion process;

wherein the first adaptive conversion process is one of an adaptive bit depth conversion (ABC) process and an Adaptive Resolution Change (ARC) process, and the second adaptive conversion process is the other of the adaptive bit depth conversion (ABC) process and the Adaptive Resolution Change (ARC) process.

In one example, an Adaptive Resolution Change (ARC) process is performed before or after an adaptive bit depth conversion (ABC) process.

In one example, the Adaptive Resolution Change (ARC) process includes:

for a video region, determining a relationship between a resolution of a current picture and a resolution of a reference picture referenced by the video region; and

in response to the relationship, a modification is performed on the samples within the reference picture.

In one example, the modification includes at least one of upsampling or downsampling.

In one example, the adaptive bit depth conversion (ABC) process includes:

For a video region, determining a relationship between a sampling bit depth of at least one color component of a current picture and a sampling bit depth of at least one color component of a reference picture referenced by the video region; and

In yet another aspect, a method for video processing is disclosed, comprising:

determining at least one Merge candidate for a video region within the current picture, wherein the at least one Merge candidate is assigned a priority based on a sampling bit depth of at least one color component of a reference picture to which the at least one Merge candidate refers;

determining whether and/or how to insert the Merge candidate into a Merge candidate list of the video region based on the priority; and

based on the Merge candidate list, a transition between the bitstream representation of the video region and the video region is performed.

In one example, at least one Merge candidate is inserted into the Merge candidate list before another Merge candidate that references a reference picture having a lower sampling bit depth than the reference picture referenced by the at least one Merge candidate.

In one example, if the sampling bit depth of the reference picture to which the at least one Merge candidate refers is lower than the sampling bit depth of the current picture, the at least one Merge candidate is not allowed to be inserted into the Merge candidate list.

In yet another aspect, a method for video processing is disclosed, comprising:

an Adaptive Loop Filtering (ALF) process is applied to samples within the current picture, wherein a plurality of ALF parameters are applicable to the samples based on corresponding sampling bit depths associated with the plurality of ALF parameters.

In one example, multiple ALF parameters are signaled in a particular video unit.

In one example, a particular video unit includes an Adaptive Parameter Set (APS).

In one example, a particular video unit is associated with one or more sampling bit depths.

In one example, a plurality of ALF parameters are associated with one or more sampling bit depths.

In one example, only ALF parameters of the plurality of ALF parameters that are associated with the same sampling bit depth may be applied to the samples.

In one example, ALF parameters associated with different sampling bit depths among a plurality of ALF parameters may be applied to a sample point.

In one example, the plurality of ALF parameters are divided into at least a first subset and a second subset associated with the first sampling bit depth and the second sampling bit depth, respectively, and the first subset of ALF parameters inherits or predicts from the second subset of ALF parameters.

In one example, the first sampling bit depth is equal to the second sampling bit depth.

In one example, the first sampling bit depth is different from the second sampling bit depth.

In one example, the plurality of ALF parameters includes a default ALF parameter, wherein different default ALF parameters are designed for different sampling bit depths.

In yet another aspect, a method for video processing is disclosed, comprising:

a Luma Mapping (LMCS) process utilizing chroma scaling is applied to samples within the current picture, wherein a plurality of LMCS parameters are applicable to the samples based on corresponding sample bit depths associated with the plurality of LMCS parameters.

In one example, a plurality of LMCS parameters are signaled in a particular video unit.

In one example, a plurality of LMCS parameters are associated with one or more sampling bit depths.

In one example, only LMCS parameters of the plurality of LMCS parameters that are associated with the same sampling bit depth may be applied to the samples.

In one example, the plurality of LMCS parameters are partitioned into at least a first subset and a second subset associated with the first sampling bit depth and the second sampling bit depth, respectively, and the first subset of LMCS parameters inherits or predicts from the second subset of LMCS parameters.

In yet another aspect, a method for video processing is disclosed, comprising:

determining whether and/or how to perform a particular coding scheme on a video block within a current picture based on a relationship between a sampling bit depth of at least one of a plurality of reference pictures to which the video block refers and a sampling bit depth of the current picture; and

based on the determination, a conversion between the bitstream representation of the video block and the video block is performed.

In one example, if it is determined that the sampling bit depth of at least one of the plurality of reference pictures to which the video block refers is different from the sampling bit depth of the current picture, the particular codec scheme is disabled for the video block.

In one example, information about a particular coding scheme is not signaled.

In one example, if the plurality of reference pictures to which the video block refers do not include reference pictures having sampling bit depths that are different from the sampling bit depths of the current picture, a particular codec scheme is enabled for the video block.

In one example, a reference index indicating a reference picture having a sampling bit depth different from the sampling bit depth of the current picture is not signaled or skipped for the video block during the conversion.

In one example, the conversion between the bitstream representation of the video block and the video block is based on at least one of a plurality of Merge candidates arranged in a Merge candidate list, and the Merge candidate list excludes Merge candidates referencing reference pictures having a sampling bit depth different from a sampling bit depth of the current picture.

In one example, the conversion between the bitstream representation of the video block and the video block is based on at least one of a plurality of Merge candidates arranged in a Merge candidate list, and any Merge candidate in the Merge candidate list that references a reference picture having a sample bit depth different from the sample bit depth of the current picture is skipped for the video block during the conversion.

In one example, the particular codec scheme is selected from the group consisting of at least one of: an Adaptive Loop Filtering (ALF) scheme, a luma mapping with chroma scaling (LMCS) scheme, a decoder-side motion vector refinement (DMVR) scheme, a bi-directional optical flow (BDOF) scheme, an affine prediction scheme, a Triangular Partition Mode (TPM) scheme, a Symmetric Motion Vector Difference (SMVD) scheme, a Merge (MMVD) scheme with motion vector difference, an inter prediction scheme in general visual coding (VVC), a Local Illumination Compensation (LIC) scheme, a history-based motion vector prediction (HMVP) scheme, a multi-transform set (MTS) scheme, and a sub-block transform (SBT) scheme.

In yet another aspect, a method for video processing is disclosed, comprising:

performs a transition between the bitstream representation of the current video block and the current video block,

wherein bi-prediction from two reference pictures is disabled for a video block during a transition if the two reference pictures referenced by the current video block have different sampling bit depths.

In one example, the video region is a video block.

In one example, the converting includes encoding the video region into a bitstream representation of the video region, and decoding the video region from the bitstream representation of the video region.

In yet another aspect, an apparatus in a video system is disclosed that includes a processor and a non-transitory memory having instructions thereon, wherein the instructions, when executed by the processor, cause the processor to implement the above-described method.

In yet another aspect, a non-transitory computer readable medium is disclosed having stored thereon program code which, when executed, causes a processor to implement a method as described above.

4. Example embodiments of the disclosed technology

Fig. 23 is a block diagram of a video processing apparatus 2300. The apparatus 2300 may be used to implement one or more methods described herein. The apparatus 2300 may be embodied in a smart phone, tablet, computer, internet of things (IoT) receiver, or the like. The apparatus 2300 may include one or more processors 2302, one or more memories 2304, and video processing hardware 2306. The processor(s) 2302 may be configured to implement one or more methods described in this document, including but not limited to method 2300. The memory(s) 2304 may be used to store data and code for implementing the methods and techniques described herein. Video processing hardware 2306 may be used to implement some of the techniques described in this document in hardware circuitry.

In some embodiments, the video codec method may be implemented using an apparatus implemented on a hardware platform as described with reference to fig. 23.

From the foregoing it will be appreciated that specific embodiments of the presently disclosed technology have been described herein for purposes of illustration, but that various modifications may be made without deviating from the scope of the invention. Accordingly, the presently disclosed technology is not limited except as by the appended claims.

Embodiments of the subject matter and the functional operations described in this patent document can be implemented in various systems, digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer program products, i.e., one or more modules of computer program instructions encoded on a tangible and non-transitory computer-readable medium for execution by, or to control the operation of, data processing apparatus. The computer readable medium can be a machine-readable storage device, a machine-readable storage substrate, a memory device, a combination of materials embodying a machine-readable propagated signal, or a combination of one or more of them. The term "data processing unit" or "data processing apparatus" encompasses all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. In addition to hardware, the apparatus may include code that creates a runtime environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

A computer program (also known as a program, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. The computer program does not necessarily correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network.

The processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer does not require such a device. Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

The specification, together with the drawings, are to be considered exemplary only, with the examples being illustrative. As used herein, the use of "or" is intended to include "and/or" unless the context clearly indicates otherwise.

While this patent document contains many specifics, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this patent document in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, although operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Furthermore, the separation of various system components in the embodiments described in this patent document should not be understood as requiring such separation in all embodiments.

Only a few embodiments and examples are described, and other implementations, enhancements, and variations may be made based on what is described and illustrated in this patent document.

Claims

1. A method for video processing, comprising:

for a first video region within a current picture of video encoded using an adaptive bit depth conversion, ABC, process, determining a relationship between a sample bit depth of at least one color component of the current picture and a sample bit depth of at least one color component of a reference picture to which the first video region is referenced;

performing a particular operation on a sample within the reference picture or a predicted region of the first video region in response to the relationship; and

performing a conversion between the bitstream of the first video region and the first video region based on the particular operation, wherein if it is determined that the sampling bit depth of the at least one color component of the current picture is different from the sampling bit depth of the at least one color component of the reference picture, the particular operation comprises:

performing a modification to a sample within the reference picture prior to invoking a motion compensation process for the first video region; or alternatively

Performing motion compensation on the first video region by using samples within the reference picture to generate a prediction region of the first video region, and performing modification on samples within the prediction region; or alternatively

At least one sample within the reference picture is modified prior to inter-prediction for the first video region, and the modified reference picture is placed in a reference picture list.

2. The method of claim 1, wherein the reference picture is stored in a decoded picture buffer at a sampling bit depth used to encode and decode the reference picture.

3. The method of claim 1, wherein the reference picture is stored in a decoded picture buffer at a maximum allowable sample bit depth, a minimum allowable sample bit depth, or a sample bit depth of a predetermined sample bit depth.

4. The method of claim 1, wherein the reference picture is stored in a decoded picture buffer at a sampling bit depth that is based on a sampling bit depth of the current picture and a sampling bit depth of the reference picture.

5. The method of claim 1, wherein the method further comprises:

A decoded picture including the first video region is stored in a decoded picture buffer with sampling bit depths used to encode and decode the decoded picture.

6. The method of claim 1, wherein the method further comprises:

the decoded picture comprising the first video region is stored in a decoded picture buffer at a maximum allowable sample bit depth, a minimum allowable sample bit depth, or a sample bit depth of a predetermined sample bit depth.

7. The method of claim 3 or 6, wherein pictures in the decoded picture buffer have the same sampling bit depth.

8. The method of claim 1, wherein the modification comprises at least one of a left shift or a right shift.

9. The method of claim 8, wherein if it is determined that the sampling bit depth of the at least one color component of the reference picture is lower than the sampling bit depth of the at least one color component of the current picture, at least one sample within the reference picture is modified to generate a modified reference picture as follows:

s' =s < < (BD 0-BD 1); or alternatively

S’＝S<<(BD0-BD1)+(1<<(BD0-BD1-1))，

Where S represents at least one sample within the reference picture, BD0 represents the sampling bit depth of at least one color component of the current picture, and BD1 represents the sampling bit depth of the corresponding color component of the reference picture, and S' represents the modified sample.

10. The method of claim 9, wherein if it is determined that the sampling bit depth of the at least one color component of the reference picture is greater than the sampling bit depth of the at least one color component of the current picture, at least one sample within the reference picture is modified to generate a modified reference picture as follows:

S’＝Shift(S，BD1-BD0)，

where Shift (x, n) is defined as Shift (x, n) = (x+offset 0) > > n, S represents at least one sample within the reference picture, BD0 represents a sampling bit depth of at least one color component of the current picture, and BD1 represents a sampling bit depth of a corresponding color component of the reference picture, and S' represents a modified sample.

11. The method of claim 1, 9 or 10, wherein the reference picture is removed after generating the modified reference picture.

12. The method of any of claims 1, 9 or 10, wherein after generating the modified reference picture, the reference picture is preserved and marked as unavailable.

13. The method of claim 1, further comprising:

applying a first adaptive conversion process to a second video region within the current picture;

Applying a second adaptive conversion process to the second video region; and

performing a conversion between the bitstream of the second video region and the second video region based on the first adaptive conversion process and the second adaptive conversion process;

wherein the first adaptive conversion process is one of the adaptive bit depth conversion ABC process and the adaptive resolution change ARC process, and the second adaptive conversion process is the other of the adaptive bit depth conversion ABC process and the adaptive resolution change ARC process.

14. The method of claim 13, wherein the adaptive resolution change ARC process is performed before or after the adaptive bit depth conversion ABC process.

15. The method of claim 14, wherein the adaptive resolution change ARC process comprises:

for the second video region, determining a relationship between a resolution of the current picture and a resolution of a reference picture referenced by the second video region; and

in response to the relationship, a modification is performed on samples within the reference picture.

16. The method of claim 15, wherein the modifying comprises at least one of upsampling or downsampling.

17. The method of claim 1, further comprising:

determining at least one Merge candidate for a third video region within the current picture, wherein the at least one Merge candidate is assigned a priority based on a sampling bit depth of at least one color component of a reference picture to which the at least one Merge candidate refers;

determining, based on the priority, whether and/or how to insert the Merge candidate into a Merge candidate list of the third video region; and

and performing conversion between the bit stream of the third video area and the third video area based on the Merge candidate list.

18. The method of claim 17, wherein the at least one Merge candidate is inserted into the Merge candidate list before another Merge candidate that references a reference picture having a lower sampling bit depth than a reference picture referenced by the at least one Merge candidate.

19. The method of claim 17, wherein if a sampling bit depth of a reference picture referenced by at least one Merge candidate is lower than a sampling bit depth of the current picture, the at least one Merge candidate is not allowed to be inserted into the Merge candidate list.

20. The method of claim 1, further comprising:

determining a plurality of Adaptive Loop Filter (ALF) parameters;

applying an adaptive loop filtering, ALF, process to samples within the current picture based on the plurality of ALF parameters,

wherein the plurality of ALF parameters are applicable to the samples based on corresponding sampling bit depths associated with the plurality of ALF parameters.

21. The method of claim 20, wherein the plurality of ALF parameters are signaled in a particular video unit.

22. The method of claim 21, wherein the particular video unit comprises an adaptive parameter set APS.

23. The method of claim 21 or 22, wherein the particular video unit is associated with one or more sampling bit depths.

24. The method of any of claims 20-22, wherein the plurality of ALF parameters are associated with one or more sampling bit depths.

25. The method of claim 23, wherein only ALF parameters of the plurality of ALF parameters associated with a same sampling bit depth are applicable to the samples.

26. The method of claim 23, wherein ALF parameters of the plurality of ALF parameters associated with different sampling bit depths are applicable to the samples.

27. The method of claim 23, wherein the plurality of ALF parameters are partitioned into at least first and second subsets associated with first and second sample bit depths, respectively, and the first subset of the ALF parameters inherits or predicts from the second subset of the ALF parameters.

28. The method of claim 27, wherein the first sampling bit depth is equal to the second sampling bit depth.

29. The method of claim 27, wherein the first sampling bit depth is different from the second sampling bit depth.

30. The method of claim 20, wherein the plurality of ALF parameters comprises a default ALF parameter, wherein different default ALF parameters are designed for different sampling bit depths.

31. The method of claim 1, further comprising:

determining a plurality of luma map LMCS parameters utilizing chroma scaling; and

an LMCS process is applied to a sample within the current picture, wherein the plurality of LMCS parameters are applicable to the sample based on corresponding sampling bit depths associated with the plurality of LMCS parameters.

32. A method as defined in claim 31, wherein the plurality of LMCS parameters are signaled in a particular video unit.

33. The method of claim 32, wherein the particular video unit comprises an adaptive parameter set APS.

34. The method of claim 32 or 33, wherein the particular video unit is associated with one or more sampling bit depths.

35. A method according to claim 31 or 32, wherein the plurality of LMCS parameters are associated with one or more sampling bit depths.

36. A method as defined in claim 34, wherein only LMCS parameters of the plurality of LMCS parameters associated with a same sampling bit depth are applicable to the samples.

37. A method as defined in claim 34, wherein the plurality of LMCS parameters are divided into at least first and second subsets associated with first and second sample bit depths, respectively, and the first subset of LMCS parameters inherits or predicts from the second subset of LMCS parameters.

38. The method of claim 37, wherein the first sampling bit depth is equal to the second sampling bit depth.

39. The method of claim 37, wherein the first sampling bit depth is different from the second sampling bit depth.

40. The method of claim 1, further comprising:

determining whether and/or how to perform a particular coding scheme on a first video block within the current picture based on a relationship between a sampling bit depth of at least one of a plurality of reference pictures referenced by the first video block within the current picture and a sampling bit depth of the current picture; and

based on the determination, a transition between the bitstream of the first video block and the first video block is performed.

41. The method of claim 40, wherein the particular codec is disabled for the first video block if it is determined that a sampling bit depth of at least one of a plurality of reference pictures referenced by the first video block is different than a sampling bit depth of the current picture.

42. The method of claim 41, wherein the information regarding the particular codec scheme is not signaled.

43. The method of claim 40, wherein the particular codec scheme is enabled for the first video block if the plurality of reference pictures referenced by the first video block do not include reference pictures having sampling bit depths that are different from sampling bit depths of the current picture.

44. The method of claim 43, wherein, during a transition between the bitstream of the first video block and the first video block, a reference index is not signaled or skipped for the first video block, the reference index indicating a reference picture having a sampling bit depth different from a sampling bit depth of the current picture.

45. The method of claim 43, wherein the conversion between the bitstream of the first video block and the first video block is based on at least one of a plurality of Merge candidates arranged in a Merge candidate list, and the Merge candidate list excludes Merge candidates referencing reference pictures having a sampling bit depth different from a sampling bit depth of the current picture.

46. The method of claim 43, wherein the transition between the bitstream of the first video block and the first video block is based on at least one of a plurality of Merge candidates arranged in a Merge candidate list, and any Merge candidate in the Merge candidate list is skipped for the first video block during the transition between the bitstream of the first video block and the first video block, the Merge candidate referencing a reference picture having a sample bit depth different from a sample bit depth of the current picture.

47. The method of any of claims 40-46, wherein the particular codec scheme is selected from a group comprising at least one of: an adaptive loop filtering ALF scheme, a luma map with chroma scaling LMCS scheme, a decoder side motion vector refinement DMVR scheme, a bi-directional optical flow BDOF scheme, an affine prediction scheme, a triangle partition mode TPM scheme, a symmetric motion vector difference SMVD scheme, a Merge (MMVD) scheme with motion vector difference, an inter prediction scheme in a general visual coding VVC, a local illumination compensation LIC scheme, a history-based motion vector prediction HMVP scheme, a multi-transform set MTS scheme, and a sub-block transform SBT scheme.

48. The method of claim 1, further comprising:

performing a conversion between a bitstream of a second video block of the video and the second video block,

wherein bi-prediction from the two reference pictures is disabled for the second video block during a transition between the bitstream of the second video block and the second video block if the two reference pictures referenced by the second video block have different sampling bit depths.

49. The method of claim 1, wherein the converting comprises: encoding the first video region into a bitstream of the first video region, and decoding the first video region from the bitstream of the first video region.

50. The method of claim 1, wherein the first video region is a video block.

51. The method of claim 1, wherein one or more sets of sample bit depths are applicable to one or more color components of the first video region during the adaptive bit depth conversion ABC, and the one or more sets of sample bit depths are signaled in a particular video unit.

52. The method of claim 51, wherein the particular video unit comprises at least one of a decoder parameter set DPS, a video parameter set VPS, a sequence parameter set SPS, a picture parameter set PPS, an adaptive parameter set APS, a picture header, a slice header, and a slice group header.

53. The method of claim 51, wherein the particular video unit comprises a supplemental enhancement information SEI message.

54. The method of claim 51, wherein the particular video unit comprises a separate video unit for the adaptive bit depth conversion.

55. The method of any of claims 51, wherein the one or more sets of sampling bit depths are coupled to corresponding dimension information of the current picture.

56. The method of claim 55, wherein the dimension information of the current picture includes an upsampling rate and a downsampling rate of the current picture.

57. The method of claim 55 or 56, wherein the dimension information of the current picture is signaled in the same particular video unit as the one or more sets of sampling bit depths.

58. The method of any of claims 51-56, wherein the set of sampling bit depths includes at least one of a maximum sampling bit depth and a minimum sampling bit depth allowed.

59. The method of any of claims 51-56, wherein the set of sampling bit depths comprises a difference between a bit depth threshold and at least one of an allowable maximum sampling bit depth and a minimum sampling bit depth.

60. The method of claim 57, wherein more than one combination is configured according to the one or more sets of sample bit depths and corresponding dimension information, and any two of the more than one combinations are different.

61. An apparatus in a video system comprising a processor and a non-transitory memory having instructions thereon, wherein the instructions, when executed by the processor, cause the processor to implement the method of any of claims 1-60.

62. A non-transitory computer readable medium having stored thereon program code which, when executed, causes a processor to implement the method of any one of claims 1 to 60.