CN117651132A

CN117651132A - Method and apparatus for signaling post-loop filter information for neural networks

Info

Publication number: CN117651132A
Application number: CN202310098962.4A
Authority: CN
Inventors: 萨钦·G·德施潘德; 艾哈迈德·谢赫·西迪亚
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2022-09-02
Filing date: 2023-02-07
Publication date: 2024-03-05

Abstract

The present invention discloses a device that can be configured to signal neural network post-filter parameter information. In one example, a device transmits a neural network post-filter characteristic message that signals a syntax element that specifies a number of interpolated pictures generated by a post-processing filter between successive pictures that serve as inputs to the post-processing filter. In one example, the neural network post-filter characteristic message includes a syntax element specifying a number of decoded output pictures that are used as inputs to the post-processing filter.

Description

Method and apparatus for signaling post-loop filter information for neural networks

RELATED APPLICATIONS

The present application claims the benefit of U.S. provisional patent application No.63/403,598 filed on month 2 of 2022, which application is incorporated herein by reference in its entirety.

Technical Field

The present disclosure relates to video coding, and more particularly to a system and method for signaling neural network post-filter parameter information of coded video.

Background

Digital video functionality may be incorporated into a variety of devices including digital televisions, notebook or desktop computers, tablet computers, digital recording devices, digital media players, video gaming devices, cellular telephones (including so-called smartphones), medical imaging devices, and the like. The digital video may be encoded according to a video encoding standard. The video coding standard defines the format of a compatible bitstream encapsulating the coded video data. A compliant bitstream is a data structure that may be received and decoded by a video decoding device to generate reconstructed video data. The video coding standard may incorporate video compression techniques. Examples of video coding standards include ISO/IEC MPEG-4Visual and ITU-T H.264 (also known as ISO/IEC MPEG-4 AVC) and High Efficiency Video Coding (HEVC). HEVC is described in High Efficiency Video Coding (HEVC) of the ITU-T h.265 recommendation of 12 in 2016, which is incorporated herein by reference and is referred to herein as ITU-T h.265. Extensions and improvements to ITU-T h.265 are being considered to develop next generation video coding standards. For example, ITU-T Video Coding Experts Group (VCEG) and ISO/IEC Moving Picture Experts Group (MPEG), collectively referred to as joint video research group (jfet), have video coding techniques that standardize compression capabilities significantly beyond current HEVC standards. Joint exploration model 7 (JEM 7), algorithmic description of joint exploration test model 7 (JEM 7), ISO/IEC JTC1/SC29/WG11 documents: jfet-G1001 (2017, 7, italy, duling) describes the coding features under joint test model study by jfet, a potentially enhanced video coding technique beyond ITU-T h.265 functionality. It should be noted that the coding feature of JEM 7 is implemented in JEM reference software. As used herein, the term JEM may collectively refer to the algorithm in JEM 7 and the specific implementation of JEM reference software. Furthermore, in response to "Joint Call for Proposals on Video Compression with Capabilities beyond HEVC" issued by the VCEG and MPEG in conjunction, various groups set forth various descriptions of video coding tools at the 10 th conference of ISO/IEC JTC1/SC29/WG11 held in San Diego, CA, from 16 days to 20 days at 4 months of 2018. According to various descriptions of video coding tools, the final initial Draft text of the video coding specification is described in ISO/IEC JTC1/SC29/WG11, 10 th meeting, "Versatile Video Coding (Draft 1)" at 4 months 16 to 20 days 2018, in san Diego, calif., document JHET-J1001-v 2, which is incorporated herein by reference and referred to as JHET-J1001. The development of video coding standards for VCEG and MPEG is known as the Versatile Video Coding (VVC) project. "Versatile Video Coding (Draft 10)" (teleconference, document jfet-T2001-v 2, which is incorporated herein by reference, and referred to as jfet-T2001) in the 20 th conference of ISO/IEC JTC1/SC29/WG11 held from 7 th to 16 th month 10 2020 represents the current iteration of the Draft text corresponding to the video coding specification of the VVC project.

Video compression techniques can reduce the data requirements for storing and transmitting video data. Video compression techniques may reduce data requirements by exploiting redundancy inherent in video sequences. Video compression techniques may subdivide a video sequence into successively smaller portions (i.e., a group of pictures within the video sequence, a picture within a group of pictures, a region within a picture, a sub-region within a region, etc.). The difference between the unit video data to be encoded and the reference unit of video data may be generated using intra prediction encoding techniques (e.g., intra-picture spatial prediction techniques) and inter prediction techniques (i.e., inter-picture techniques (times)). This difference may be referred to as residual data. The residual data may be encoded as quantized transform coefficients. Syntax elements may relate to residual data and reference coding units (e.g., intra prediction mode index and motion information). The residual data and syntax elements may be entropy encoded. The entropy encoded residual data and syntax elements may be included in a data structure forming a compatible bitstream.

Disclosure of Invention

In general, this disclosure describes various techniques for encoding video data. In particular, the present disclosure describes techniques for signaling neural network post-filter parameter information for encoded video data. It should be noted that although the techniques of this disclosure are described with respect to ITU-T H.264, ITU-T H.265, JEM and JHET-T2001, the techniques of this disclosure are generally applicable to video coding. For example, in addition to those techniques included in ITU-T H.265, JEM, and JHET-T2001, the encoding techniques described herein may be incorporated into video encoding systems (including video encoding systems based on future video encoding standards), including video block structures, intra-prediction techniques, inter-prediction techniques, transformation techniques, filtering techniques, and/or other entropy encoding techniques. Thus, references to ITU-T H.264, ITU-T H.265, JEM and/or JHET-T2001 are for descriptive purposes and should not be construed as limiting the scope of the techniques described herein. Furthermore, it should be noted that the incorporation of documents by reference herein is for descriptive purposes and should not be construed as limiting or creating ambiguity with respect to the terms used herein. For example, where a definition of a term provided in a particular incorporated reference differs from another incorporated reference and/or the term as used herein, that term should be interpreted as broadly as includes each and every corresponding definition and/or includes every specific definition in the alternative.

In one example, a method of encoding video data includes transmitting a signaling neural network post-filter characteristic message and transmitting a signaling syntax element that specifies a number of interpolated pictures generated by a post-processing filter between successive pictures used as an input to the post-processing filter.

In one example, an apparatus includes one or more processors configured to send a signaling neural network post-filter characteristic message and send a signaling syntax element specifying a number of interpolated pictures generated by a post-processing filter between successive pictures used as an input to the post-processing filter.

In one example, a non-transitory computer-readable storage medium includes instructions stored thereon that, when executed, cause one or more processors of a device to transmit a signaling neural network post-filter characteristic message and transmit a signaling syntax element that specifies a number of interpolated pictures generated by a post-processing filter between successive pictures used as input to the post-processing filter.

In one example, an apparatus includes: means for sending a message signaling the neural network post-filter characteristics; and means for signaling a syntax element specifying a number of interpolated pictures generated by the post-processing filter between successive pictures used as inputs to the post-processing filter.

In one example, a method of decoding video data includes: receiving a neural network post-filter characteristic message; a syntax element is parsed that specifies a number of interpolated pictures generated by a post-processing filter between successive pictures that serve as inputs to the post-processing filter.

In one example, an apparatus includes one or more processors configured to receive a neural network post-filter characteristic message, parse a syntax element specifying a number of interpolated pictures generated by a post-processing filter between successive pictures used as an input to the post-processing filter.

In one example, a non-transitory computer-readable storage medium includes instructions stored thereon that, when executed, cause one or more processors of a device to: receiving a neural network post-filter characteristic message; a syntax element is parsed that specifies a number of interpolated pictures generated by a post-processing filter between successive pictures that serve as inputs to the post-processing filter.

In one example, an apparatus includes: means for receiving a neural network post-filter characteristic message; and means for parsing a syntax element that specifies a number of interpolated pictures generated by a post-processing filter between successive pictures that are used as inputs to the post-processing filter.

The details of one or more examples are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description and drawings, and from the claims.

Drawings

Fig. 1 is a block diagram illustrating an example of a system that may be configured to encode and decode video data in accordance with one or more techniques of the present disclosure.

Fig. 2 is a conceptual diagram illustrating encoded video data and corresponding data structures according to one or more techniques of the present disclosure.

Fig. 3 is a conceptual diagram illustrating a data structure encapsulating encoded video data and corresponding metadata according to one or more techniques of the present disclosure.

Fig. 4 is a conceptual diagram illustrating an example of components that may be included in a particular implementation of a system that may be configured to encode and decode video data in accordance with one or more techniques of the present disclosure.

Fig. 5 is a block diagram illustrating an example of a video encoder that may be configured to encode video data in accordance with one or more techniques of this disclosure.

Fig. 6 is a block diagram illustrating an example of a video decoder that may be configured to decode video data in accordance with one or more techniques of this disclosure.

Fig. 7 is a conceptual diagram illustrating an example of a packed data channel of a luminance component according to one or more techniques of this disclosure.

Detailed Description

Video content comprises a video sequence consisting of a series of frames (or pictures). A series of frames may also be referred to as a group of pictures (GOP). Each video frame or picture may be divided into one or more regions. The region may be defined according to a base unit (e.g., video block) and a set of rules defining the region. For example, a rule defining a region may be that the region must be an integer number of video blocks arranged in a rectangle. Further, the video blocks in the region may be ordered according to a scan pattern (e.g., raster scan). As used herein, the term "video block" may generally refer to a region of a picture, or may more particularly refer to a largest array of sample values, sub-partitions thereof, and/or corresponding structures that may be predictively encoded. Furthermore, the term "current video block" may refer to an area of a picture being encoded or decoded. A video block may be defined as an array of sample values. It should be noted that in some cases, pixel values may be described as sample values that include corresponding components of video data, which may also be referred to as color components (e.g., luminance (Y) and chrominance (Cb and Cr) components or red, green, and blue components). It should be noted that in some cases, the terms "pixel value" and "sample value" may be used interchangeably. Further, in some cases, a pixel or sample may be referred to as pel. The video sampling format (which may also be referred to as chroma format) may define the number of chroma samples included in a video block relative to the number of luma samples included in the video block. For example, for a 4:2:0 sampling format, the sampling rate of the luma component is twice the sampling rate of the chroma components in both the horizontal and vertical directions.

The video encoder may perform predictive encoding on the video block and its sub-partitions. The video block and its sub-partitions may be referred to as nodes. ITU-T h.264 specifies the macroblock comprising 16 x 16 luma samples. That is, in ITU-T H.264, pictures are segmented into macroblocks. ITU-T h.265 specifies a similar Coding Tree Unit (CTU) structure, which may be referred to as a Largest Coding Unit (LCU). In ITU-T H.265, pictures are segmented into CTUs. In ITU-T h.265, CTU sizes may be set to include 16 x 16, 32 x 32, or 64 x 64 luma samples for pictures. In ITU-T h.265, a CTU is composed of a respective Coding Tree Block (CTB) for each component of video data, e.g., luminance (Y) and chrominance (Cb and Cr). It should be noted that a video with one luminance component and two corresponding chrominance components may be described as having two channels, namely a luminance channel and a chrominance channel. Furthermore, in ITU-T h.265, CTUs may be divided according to a Quadtree (QT) division structure, which causes CTBs of the CTUs to be divided into Coded Blocks (CBs). That is, in ITU-T H.265, the CTU may be divided into quadtree leaf nodes. According to ITU-T h.265, one luma CB together with two corresponding chroma CBs and associated syntax elements are referred to as a Coding Unit (CU). In ITU-T h.265, the minimum allowed size of the CB may be signaled. In ITU-T H.265, the minimum allowable minimum size for luminance CB is 8 x 8 luminance samples. In ITU-T h.265, the decision to encode a picture region using intra-prediction or inter-prediction is made at the CU level.

In ITU-T H.265, a CU is associated with a prediction unit structure that has its root at the CU. In ITU-T H.265, the prediction unit structure allows partitioning of luma CB and chroma CB to generate corresponding reference samples. That is, in ITU-T h.265, luminance CB and chrominance CB may be partitioned into respective luminance prediction blocks and chrominance Prediction Blocks (PB), where PB comprises blocks of sample values to which the same prediction is applied. In ITU-T H.265, CBs can be divided into 1, 2 or 4 PBs. ITU-T h.265 supports PB sizes from 64 x 64 samples down to 4 x 4 samples. In ITU-T h.265, square PB is supported for intra prediction, where CB may form PB or CB may be partitioned into four square PB. In ITU-T h.265, rectangular PB is supported for inter prediction in addition to square PB, where CB may be halved vertically or horizontally to form PB. Furthermore, it should be noted that in ITU-T h.265, for inter prediction, four asymmetric PB partitioning is supported, where CB is partitioned into two PB at one quarter of the height (top or bottom) or width (left or right) of CB. Intra-prediction data (e.g., intra-prediction mode syntax elements) or inter-prediction data (e.g., motion data syntax elements) corresponding to the PB are used to generate reference and/or prediction sample values for the PB.

JEM specifies CTUs with 256 x 256 luma samples of maximum size. JEM specifies a quadtree plus binary tree (QTBT) block structure. In JEM, the QTBT structure allows the quadtree (BT) nodes to be further partitioned by a Binary Tree (BT) structure. That is, in JEM, the binary tree structure allows for recursive partitioning of the quadtree nodes vertically or horizontally. In JVET-T2001, CTUs are partitioned according to a quadtree plus multi-type tree (QTMT or QT+MTT) structure. QTMT in jfet-T2001 is similar to QTBT in JEM. However, in jfet-T2001, the multi-type tree may indicate so-called ternary (or Trigeminal Tree (TT)) splitting in addition to binary splitting. Ternary partitioning divides one block vertically or horizontally into three blocks. In the case of vertical TT splitting, the block is split from the left edge at one quarter of its width and from the right edge at one quarter of its width, and in the case of horizontal TT splitting, the block is split from the top edge at one quarter of its height and from the bottom edge at one quarter of its height.

As described above, each video frame or picture may be divided into one or more regions. For example, according to ITU-T h.265, each video frame or picture may be divided to include one or more slices, and further divided to include one or more tiles, wherein each slice includes a sequence of CTUs (e.g., arranged in raster scan order), and wherein a tile is a sequence of CTUs corresponding to a rectangular region of the picture. It should be noted that in ITU-T h.265, a slice is a sequence of one or more slice segments starting from an independent slice segment and containing all subsequent dependent slice segments (if any) before the next independent slice segment (if any). A slice segment (e.g., slice) is a CTU sequence. Thus, in some cases, the terms "slice" and "slice segment" are used interchangeably to indicate a sequence of CTUs arranged in a raster scan order arrangement. Further, it should be noted that in ITU-T H.265, a tile may be composed of CTUs contained in more than one slice, and a slice may be composed of CTUs contained in more than one tile. However, ITU-T H.265 specifies that one or both of the following conditions should be met: (1) all CTUs in a slice belong to the same tile; and (2) all CTUs in a tile belong to the same slice.

With respect to jfet-T2001, a slice needs to be made up of an integer number of complete tiles or an integer number of consecutive complete CTU rows within a tile, rather than just an integer number of CTUs. It should be noted that in jfet-T2001, the slice design does not include slice segments (i.e., there are no independent/dependent slice segments). Thus, in jfet-T2001, a picture may comprise a single tile, where the single tile is contained within a single slice, or a picture may comprise multiple tiles, where the multiple tiles (or rows of CTUs thereof) may be contained within one or more slices. In jfet-T2001, a picture is specified to divide the picture into tiles by specifying a respective height of a tile row and a respective width of a tile column. Thus, in JHET-T2001, a tile is a rectangular CTU region within a particular tile row and a particular tile column position. Furthermore, it should be noted that jfet-T2001 specifies that the picture may be divided into sub-pictures, where the sub-pictures are rectangular CTU regions within the picture. The upper left CTU of a sub-picture may be located at any CTU position within the picture, where the sub-picture is constrained to include one or more slices. Thus, unlike tiles, sub-pictures need not be limited to specific row and column positions. It should be noted that the sub-pictures may be used to encapsulate regions of interest within the picture, and that the sub-bitstream extraction process may be used to decode and display only specific regions of interest. That is, as described in further detail below, the bitstream of encoded video data includes a Network Abstraction Layer (NAL) unit sequence, where NAL units encapsulate encoded video data (i.e., video data corresponding to a picture slice), or NAL units encapsulate metadata (e.g., parameter sets) for decoding video data, and the sub-bitstream extraction process forms a new bitstream by removing one or more NAL units from the bitstream.

Fig. 2 is a conceptual diagram illustrating an example of pictures within a group of pictures divided according to tiles, slices, and sub-pictures. It should be noted that the techniques described herein may be applied to tiles, slices, sub-pictures, sub-partitions thereof, and/or equivalent structures thereof. That is, the techniques described herein may be universally applicable regardless of how a picture is divided into regions. For example, in some cases, the techniques described herein may be applicable where tiles may be divided into so-called bricks, where a brick is a rectangular CTU row region within a particular tile. Further, for example, in some cases, the techniques described herein may be applicable where one or more tiles may be included in a so-called tile group, where the tile group includes an integer number of adjacent tiles. In the example shown in FIG. 2, pic ₃ Is shown as including 16 tiles (i.e., tiles ₀ To the block ₁₅ ) And three slices (i.e., slices ₀ To slice into ₂ ). In the example shown in FIG. 2, a Slice ₀ Comprising four tiles (i.e. Tile ₀ To Tile ₃ )，Slice ₁ Comprising eight tiles (i.e. Tile ₄ To Tile ₁₁ ) And Slice ₂ Comprising four tiles (i.e. Tile ₁₂ To Tile ₁₅ ). Further, pic, as shown in the example of fig. 2 ₃ Is shown to include two sub-pictures (i.e., sub-pictures ₀ And sub-picture ₁ ) Wherein the sub-picture ₀ Comprising slicing ₀ And slicing ₁ And wherein the sub-picture ₁ Including slice 2. As described above, the sub-picture may be used to encapsulate a region of interest within the picture, and a sub-bitstream extraction process may be used to selectively decode (and display) the region of interest. For example, referring to FIG. 2, a sub-picture ₀ May correspond to an action portion (e.g., view of a field) of a sporting event presentation, and a sub-picture ₁ May correspond to a scroll banner displayed during presentation of the sporting event. By organizing the pictures into sub-pictures in this way, the viewer may be able to disable the display of the scroll banner. That is, through the sub-bitstream extraction process, a slice is cut ₂ NAL units may be removed from the bitstream (and thus not decoded and/or displayed), and sliced ₀ NAL units and slice 1NAL units may be decoded and displayed. How slices of a picture are encapsulated into corresponding NAL unit data structures and sub-bitstream extraction is described in further detail below.

For intra prediction encoding, an intra prediction mode may specify the location of a reference sample within a picture. In ITU-T H.265, the possible intra prediction modes that have been defined include a planar (i.e., surface-fitting) prediction mode, a DC (i.e., flat ensemble average) prediction mode, and 33 angular prediction modes (predMode: 2-34). In JEM, the possible intra prediction modes that have been defined include a planar prediction mode, a DC prediction mode, and 65 angular prediction modes. It should be noted that the plane prediction mode and the DC prediction mode may be referred to as a non-directional prediction mode, and the angle prediction mode may be referred to as a directional prediction mode. It should be noted that the techniques described herein may be universally applicable regardless of the number of possible prediction modes that have been defined.

For inter prediction encoding, a reference picture is determined, and a Motion Vector (MV) identifies samples in the reference picture that are used to generate a prediction for a current video block. For example, a single-bit can be usedThe current video block is predicted from reference sample values in one or more previously encoded pictures, and a motion vector is used to indicate the position of the reference block relative to the current video block. The motion vector may describe, for example, a horizontal displacement component (i.e., MV _x ) Vertical displacement component of motion vector (i.e. MV _y ) And resolution of motion vectors (e.g., quarter-pixel precision, half-pixel precision, one-pixel precision, two-pixel precision, four-pixel precision). Previously decoded pictures (which may include pictures output before or after the current picture) may be organized into one or more reference picture lists and identified using reference picture index values. Furthermore, in inter prediction coding, single prediction refers to generating a prediction using sample values from a single reference picture, and double prediction refers to generating a prediction using corresponding sample values from two reference pictures. That is, in single prediction, a single reference picture and a corresponding motion vector are used to generate a prediction for a current video block, while in bi-prediction, a first reference picture and a corresponding first motion vector and a second reference picture and a corresponding second motion vector are used to generate a prediction for the current video block. In bi-prediction, the corresponding sample values are combined (e.g., added, rounded and clipped, or averaged according to weights) to generate a prediction. Pictures and their regions may be classified based on which types of prediction modes are available to encode their video blocks. That is, for a region having a B type (e.g., a B slice), bi-prediction, uni-prediction, and intra-prediction modes may be utilized, for a region having a P type (e.g., a P slice), uni-prediction and intra-prediction modes may be utilized, and for a region having an I type (e.g., an I slice), only intra-prediction modes may be utilized. As described above, the reference picture is identified by the reference index. For example, for P slices, there may be a single reference picture list RefPicList0, and for B slices, there may be a second independent reference picture list RefPicList1 in addition to RefPicList 0. It should be noted that for single prediction in a B slice, one of RefPicList0 or RefPicList1 may be used to generate the prediction. Furthermore, it should be noted that during the decoding process, at the beginning of decoding a picture, the following applies Previously decoded pictures stored in a Decoded Picture Buffer (DPB) generate a reference picture list.

Furthermore, the coding standard may support various motion vector prediction modes. Motion vector prediction enables the value of a motion vector for a current video block to be derived based on another motion vector. For example, a set of candidate blocks with associated motion information may be derived from the spatial neighboring blocks and the temporal neighboring blocks of the current video block. In addition, the generated (or default) motion information may be used for motion vector prediction. Examples of motion vector prediction include Advanced Motion Vector Prediction (AMVP), temporal Motion Vector Prediction (TMVP), so-called "merge" mode, and "skip" and "direct" motion reasoning. Further, other examples of motion vector prediction include Advanced Temporal Motion Vector Prediction (ATMVP) and spatio-temporal motion vector prediction (STMVP). For motion vector prediction, both the video encoder and the video decoder perform the same process to derive a set of candidates. Thus, for the current video block, the same set of candidates is generated during encoding and decoding.

As described above, for inter prediction coding, reference samples in a previously coded picture are used to code a video block in the current picture. Previously encoded pictures that may be used as references when encoding a current picture are referred to as reference pictures. It should be noted that the decoding order does not necessarily correspond to the picture output order, i.e. the temporal order of the pictures in the video sequence. In ITU-T h.265, when a picture is decoded, it is stored to a Decoded Picture Buffer (DPB) (which may be referred to as a frame buffer, a reference picture buffer, etc.). In ITU-T h.265, pictures stored to a DPB are removed from the DPB when output and are no longer needed for encoding subsequent pictures. In ITU-T h.265, after decoding the slice header, i.e. at the start of decoding a picture, each picture invokes a determination of whether a picture should be removed from the DPB or not. For example, referring to FIG. 2, pic ₂ Is shown as reference Pic ₁ . Similarly Pic ₃ Is shown as reference Pic ₀ . With respect to fig. 2, assuming that the number of pictures corresponds to the decoding order, the DPB will fill as follows: at decoding Pic ₀ Thereafter, the DPB will include { Pic ] ₀ -a }; in decoding Pic ₁ Initially, the DPB will include { Pic ] ₀ -a }; in decoding Pic ₁ Thereafter, the DPB will include { Pic ] ₀ ,Pic ₁ -a }; in decoding Pic ₂ Initially, the DPB will include { Pic ] ₀ ,Pic ₁ }. Then, reference Pic ₁ Decoding Pic ₂ And decodes Pic ₂ Thereafter, the DPB will include { Pic ] ₀ ,Pic ₁ ,Pic ₂ }. In decoding Pic ₃ Initially, picture Pic ₀ And Pic (Pic) ₁ Will be marked for removal from the DPB because they are not decoding Pic ₃ (or any subsequent pictures, not shown) and assuming Pic ₁ And Pic (Pic) ₂ Has been output, the DPB will be updated to include { Pic ] ₀ }. Reference will then be made to Pic ₀ For Pic ₃ Decoding is performed. The process of marking pictures to remove them from the DPB may be referred to as Reference Picture Set (RPS) management.

As described above, the intra prediction data or inter prediction data is used to generate reference sample values for a block of sample values. The difference between the sample values included in the current PB or another type of picture region structure and the associated reference samples (e.g., those generated using prediction) may be referred to as residual data. The residual data may include a respective difference array corresponding to each component of the video data. The residual data may be in the pixel domain. A transform such as a Discrete Cosine Transform (DCT), a Discrete Sine Transform (DST), an integer transform, a wavelet transform, or a conceptually similar transform may be applied to the array of differences to generate transform coefficients. It should be noted that in ITU-T H.265 and JHET-T2001, a CU is associated with a transform tree structure that has its root at the CU level. The transform tree is divided into one or more Transform Units (TUs). That is, to generate transform coefficients, an array of differences may be partitioned (e.g., four 8 x 8 transforms may be applied to a 16 x 16 array of residual values). For each component of video data, such subdivision of the difference value may be referred to as a Transform Block (TB). It should be noted that in some cases, a core transform and a subsequent secondary transform may be applied (in a video encoder) to generate transform coefficients. For video decoders, the order of the transforms is reversed.

The quantization process may be performed directly on the transform coefficients or residual sample values (e.g., in terms of palette coded quantization). Quantization approximates transform coefficients by limiting the amplitude to a set of specified values. Quantization essentially scales the transform coefficients to change the amount of data needed to represent a set of transform coefficients. Quantization may include dividing the transform coefficients (or values resulting from adding an offset value to the transform coefficients) by a quantization scaling factor and any associated rounding function (e.g., rounding to the nearest integer). The quantized transform coefficients may be referred to as coefficient level values. Inverse quantization (or "dequantization") may include multiplying coefficient level values by quantization scaling factors, and any reciprocal rounding or offset addition operations. It should be noted that, as used herein, the term quantization process may refer in some cases to division by a scaling factor to generate level values, and may refer in some cases to multiplication by a scaling factor to recover transform coefficients. That is, the quantization process may refer to quantization in some cases, and may refer to inverse quantization in some cases. Further, it should be noted that while the quantization process is described in some of the examples below with respect to arithmetic operations associated with decimal notation, such description is for illustrative purposes and should not be construed as limiting. For example, the techniques described herein may be implemented in a device using binary operations or the like. For example, the multiply and divide operations described herein may be implemented using shift operations or the like.

The quantized transform coefficients and syntax elements (e.g., syntax elements indicating the coding structure of the video block) may be entropy encoded according to an entropy encoding technique. The entropy encoding process includes encoding the syntax element values using a lossless data compression algorithm. Examples of entropy coding techniques include Content Adaptive Variable Length Coding (CAVLC), context Adaptive Binary Arithmetic Coding (CABAC), probability interval partitioning entropy coding (PIPE), and the like. The entropy encoded quantized transform coefficients and corresponding entropy encoded syntax elements may form a compliant bitstream that may be used to render video data at a video decoder. An entropy encoding process, such as CABAC, may include binarizing the syntax elements. Binarization refers to the process of converting the value of a syntax element into a sequence of one or more bits. These bits may be referred to as "bins". Binarization may include one or a combination of the following encoding techniques: fixed length coding, unary coding, truncated Rice coding, golomb coding, k-order exponential Golomb coding, and Golomb-Rice coding. For example, binarization may include representing the integer value 5 of the syntax element as 00000101 using an 8-bit fixed length binarization technique, or representing the integer value 5 as 11110 using a unary coding binarization technique. As used herein, the terms fixed length coding, unary coding, truncated Rice coding, golomb coding, k-th order exponential Golomb coding, and Golomb-Rice coding may each refer to a general implementation of these techniques and/or a more specific implementation of these coding techniques. For example, golomb-Rice coding implementations may be specifically defined in accordance with video coding standards. In the example of CABAC, for a particular bin, the context provides a Maximum Probability State (MPS) value for the bin (i.e., the MPS of the bin is one of 0 or 1), and a probability value for the bin being the MPS or the minimum probability state (LPS). For example, the context may indicate that the MPS of bin is 0 and the probability of bin being 1 is 0.3. It should be noted that the context may be determined based on the value of the previously encoded bin including the current syntax element and the bin in the previously encoded syntax element. For example, the value of the syntax element associated with the neighboring video block may be used to determine the context of the current bin.

As described above, the sample values of the reconstructed block may be different from the sample values of the current video block being encoded. Furthermore, it should be noted that in some cases encoding video data block by block may lead to artifacts (e.g., so-called block artifacts, banding artifacts, etc.). For example, block artifacts may result in encoded block boundaries of reconstructed video data being visually perceived by a user. In this way, the reconstructed sample values may be modified to minimize differences between the encoded current video block and the reconstructed block sample values and/or to minimize artifacts introduced by the video encoding process. Such modifications may be generally referred to as filtering. It should be noted that the filtering may occur as part of an in-loop filtering process or a post-loop (or post-filtering) filtering process. For the in-loop filtering process, the resulting sample values of the filtering process may be used to predict the video block (e.g., stored to a reference frame buffer for subsequent encoding at the video encoder and subsequent decoding at the video decoder). For the post-loop filtering process, the resulting sample values of the filtering process are only output as part of the decoding process (e.g., not used for subsequent encoding). For example, with respect to a video decoder, for an in-loop filtering process, the sample values produced by the filtering reconstruction block will be used for subsequent decoding (e.g., stored to a reference buffer) and will be output (e.g., to a display). For the post-loop filtering process, the reconstruction block will be used for subsequent decoding, and the sample values generated by filtering the reconstruction block will be output and will be used for subsequent decoding.

Deblocking (or deblocking), deblocking filtering, or applying a deblocking filter refers to the process of smoothing the boundary of neighboring reconstructed video blocks (i.e., making the boundary imperceptible to an observer). Smoothing the boundaries of neighboring reconstructed video blocks may include modifying sample values included in rows or columns adjacent to the boundary. jfet-T2001 provides a scenario in which a deblocking filter is applied to reconstructed sample values as part of the in-loop filtering process. In addition to applying deblocking filters as part of the in-loop filtering process, jfet-T2001 also provides a scenario in which Sample Adaptive Offset (SAO) filtering may be applied during in-loop filtering. Generally, SAO is a process of modifying deblocking sample values in a region by conditionally adding an offset value. Another type of filtering process includes so-called Adaptive Loop Filters (ALF). ALF using block-based adaptation is specified in JEM. In JEM, ALF is applied after SAO filters. It should be noted that ALF may be applied to the reconstructed samples independently of other filtering techniques. The process of applying the ALF specified in JEM at the video encoder can be summarized as follows: (1) Each 2 x 2 block for reconstructing the luminance component of the image is classified according to a classification index; (2) deriving a set of filter coefficients for each class index; (3) determining a filtering decision for the luminance component; (4) determining a filtering decision for the chrominance component; and (5) signaling filter parameters (e.g., coefficients and decisions). jfet-T2001 specifies deblocking filters, SAO filters, and ALF filters, which may be described as being generally based on deblocking filters, SAO filters, and ALF filters provided in ITU-T h.265 and JEM.

It should be noted that jfet-T2001 is referred to as a pre-release version of ITU-T h.266, and thus is an almost final determined draft of the video coding standard produced by the VVC project, and thus may be referred to as a first version of the VVC standard (or VVC version 1 or ITU-h.266). It should be noted that during the VVC project, convolutional Neural Network (CNN) based techniques have been studied that show the potential to remove artifacts and improve objective quality, but it is decided not to include such techniques in the VVC standard. However, CNN-based techniques are currently being considered for expansion and/or improvement of VVCs. Some CNN-based techniques involve post-filtering. For example, "AHG11: content-adaptive neural network post-filter" (conference call, document JVET-Z0082-v2, referred to herein as JVET-Z0082) in the ISO/IEC JTC1/SC29/WG11 26 th conference held at 20-29, 4, 2022 describes a Content-adaptive neural network-based post-filter. It should be noted that in jfet-Z0082, content adaptation is achieved by overfitting NN post-filters to the test video. Furthermore, it should be noted that the result of the over-fitting process in JVET-Z0082 is a weight update. JVET-Z0082 describes the location where weight updates are encoded with ISO/IEC FDIS 15938-17. Information technology-multimedia content description interface-part 17: neural network compression for multimedia content description and analysis and neural network delta compression test model (incatm), N0179, month 2022, which may be collectively referred to as the MPEG NNR (neural network representation) or Neural Network Coding (NNC) standards, for multimedia content description and analysis. jfet-Z0082 also describes the location within the video bitstream where the coding weight update is signaled as an NNR post-filter SEI message. The ISO/IEC JTC1/SC29/WG11 26 th meeting, "AHG9:NNR post-filter SEI message" (conference call, document JVET-Z0052-v1, referred to herein as JVET-Z0052), held 20 to 29 days 2022 describes NNR post-filter SEI messages utilized by JVET-Z0082. Elements of the NN post-filter described in jfet-Z0082 and NNR post-filter SEI messages described in jfet-Z0052 are employed in ISO/IEC JTC1/SC29/WG11, "Additional SEI messages for VSEI (Draft 2)" of the 27 th conference of 2022, 7 th month 13 th to 22 th days (conference call, document jfet-AA 2006-v2, referred to herein as jfet-AA 2006). The jfet-AA 2006 provides a generic supplemental enhancement information message for the coded video bitstream (VSEI). jfet-AA 2006 specifies syntax and semantics for a neural network post-filter characteristics SEI message and for a neural network post-filter activation SEI message. The neural network post-filter characteristics SEI message specifies the neural network that can be used as a post-processing filter. The use of a specified post-processing filter for a particular picture is indicated by a neural network post-filter activation SEI message. JHET-AA 2006 is described in more detail below. The techniques described herein provide techniques for signaling a neural network postfilter message.

Regarding the formulas used herein, the following arithmetic operators may be used:

+addition method

-subtraction

* Multiplication, including matrix multiplication

x ^y Exponentiation. X is specified as a power of y. In other contexts, such symbols are used for superscripts and are not intended to be interpreted as exponentiations.

Integer division of the result towards zero. For example, 7/4 and-7/-4 are truncated to 1, and-7/4 and 7/-4 are truncated to-1.

Used to represent division in mathematical formulas without truncation or rounding.

Used to represent division in a mathematical formula without truncation or rounding.

Furthermore, the following mathematical functions may be used:

log2 (x) x is a base 2 logarithm;

ceil (x) is greater than or equal to the smallest integer of x.

Regarding the exemplary syntax used herein, the following definition of logical operators may be applied:

boolean logical AND of x & & y x and y "

Boolean logical OR of x y x and y "

The following is carried out Boolean logic "NO"

xy z if x is TRUE or not equal to 0, then evaluate to y; otherwise, z is evaluated.

Furthermore, the following relational operators may be applied:

greater than

More than or equal to

< less than

Is less than or equal to

= equal to

The following is carried out =not equal to

Further, it should be noted that in the syntax descriptor used herein, the following descriptor may be applied:

-b (8): bytes (8 bits) with any bit string pattern. The parsing process of the descriptor is specified by the return value of the function read_bit (8).

-f (n): a fixed pattern bit string written using n bits (left to right) from the leftmost bit. The parsing process of the descriptor is specified by the return value of the function read_bit (n).

-se (v): signed integer 0 th order Exp-Golomb encoded syntax element, starting from the leftmost bit.

-tb (v): truncated binary codes are used up to maxVal bits, where maxVal is defined in the semantics of the syntax element.

Tu (v): truncated unary codes are used up to maxVal bits, where maxVal is defined in the semantics of the syntax element.

-u (n): an unsigned integer of n bits is used. When n is "v" in the syntax table, the number of bits varies in a manner depending on the values of other syntax elements. The parsing process of the descriptor is specified by the return value of the function read_bits (n), which is interpreted as a binary representation of the unsigned integer written first to the most significant bit.

-ue (v): unsigned integer 0 th order Exp-Golomb encoded syntax element, starting from the leftmost bit.

As described above, video content includes a video sequence composed of a series of pictures, and each picture may be divided into one or more regions. In jfet-T2001, the encoding of a picture represents VCL NAL units that include a particular layer within an AU and contains all CTUs of the picture. For example, referring again to FIG. 2, pic ₃ Is encapsulated in three coded Slice NAL units (i.e., slices ₀ NAL unit, slice ₁ NAL unit and Slice ₂ NAL unit). It should be noted that the term Video Coding Layer (VCL) NAL unit is used as a generic term for coded slice NAL units, i.e. VCL NAL is a generic term comprising all types of slice NAL units. As described above, and described in further detail below, NAL units may encapsulate metadata for decoding video data. NAL units that encapsulate metadata for decoding a video sequence are commonly referred to as non-VCL NAL units. Thus, in JVET-T2001, the NAL units may be VCL NAL units or non-VCL NAL units. It should be noted that the VCL NAL unit includes slice header data that provides information for decoding a particular slice. Thus, in jfet-T2001, the information used to decode the video data (which may be referred to as metadata in some cases) is not limited to being included in non-VCL NAL units. jfet-T2001 specifies that Picture Units (PUs) are a set of NAL units that are associated with each other according to a specified classification rule, are consecutive in decoding order, and contain exactly one coded picture, and that Access Units (AUs) are a set of PUs that belong to different layers and contain coded pictures associated with the same time output from a DPB. jfet-T2001 further specifies that a layer is a set of VCL NAL units and their associated non-VCL NAL units that all have a particular value of a layer identifier. Furthermore, in jfet-T2001, a PU consists of zero or one PH NAL units, one coded picture (which consists of one or more VCL NAL units), and zero or more other non-VCL NAL units. Further, in jfet-T2001, the Coded Video Sequence (CVS) is an AU sequence consisting of CVSs AUs and subsequent AUs of zero or more non-CVSs AUs arranged in decoding order, including all subsequent AUs before any subsequent AUs that are CVSs AUs next (not included), where the coded video sequence start (CVSs) AUs is an AU where each layer in the CVS has a PU, and the coded picture in each existing picture unit is a Coded Layer Video Sequence Start (CLVSS) picture. In jfet-T2001, the Coded Layer Video Sequence (CLVS) is a PU sequence within the same layer that consists of zero or more PUs (including all subsequent PUs until any subsequent PU whose next (not included) is a CLVSs PU) in decoding order of CLVSs PUs and subsequent non-CLVSs PUs. That is, in JHET-T2001, the bitstream may be described as comprising an AU sequence that forms one or more CVSs.

Multi-layer video encoding enables a video presentation to be decoded/displayed as a presentation corresponding to a base layer of video data and decoded/displayed as one or more additional presentations corresponding to an enhancement layer of video data. For example, the base layer may enable presentation of video presentations having a base quality level (e.g., high definition presentation and/or 30Hz frame rate), and the enhancement layer may enable presentation of video presentations having an enhanced quality level (e.g., ultra-high definition rendering and/or 60Hz frame rate). The enhancement layer may be encoded by referring to the base layer. That is, a picture in the enhancement layer may be encoded (e.g., using inter-layer prediction techniques), for example, by referencing one or more pictures in the base layer (including scaled versions thereof). It should be noted that the layers may also be encoded independently of each other. In this case, there may be no inter-layer prediction between the two layers. Each NAL unit may include an identifier indicating the video data layer with which the NAL unit is associated. As described above, the sub-bitstream extraction process may be used to decode and display only a specific region of interest of a picture. In addition, the sub-bitstream extraction process may be used to decode and display only a specific video layer. Sub-bitstream extraction may refer to the process by which a device receiving a compatible or compliant bitstream forms a new compatible or compliant bitstream by discarding and/or modifying data in the received bitstream. For example, sub-bitstream extraction may be used to form a new compatible or compliant bitstream corresponding to a particular video representation (e.g., a high quality representation).

In jfet-T2001, each of the video sequence, GOP, picture, slice, and CTU may be associated with metadata describing video coding properties, and some types of metadata are encapsulated in non-VCL NAL units. jfet-T2001 defines a set of parameters that may be used to describe video data and/or video coding properties. Specifically, JHET-T2001 includes the following four parameter sets: video Parameter Sets (VPSs), sequence Parameter Sets (SPS), picture Parameter Sets (PPS), and Adaptive Parameter Sets (APS), wherein SPS is applied to zero or more integer number of CVSs, PPS is applied to zero or more integer number of encoded pictures, APS is applied to zero or more slices, and VPSs may optionally be referenced by SPS. PPS applies to the single coded picture that references it. In jfet-T2001, the parameter set may be encapsulated as a non-VCL NAL unit and/or may be signaled as a message. jfet-T2001 also includes a Picture Header (PH) that is encapsulated as a non-VCL NAL unit. In jfet-T2001, the picture header applies to all slices of the encoded picture. jfet-T2001 further enables signaling Decoding Capability Information (DCI) and Supplemental Enhancement Information (SEI) messages. In jfet-T2001, DCI and SEI messages assist in the process related to decoding, display, or other purposes, however, DCI and SEI messages may not be needed to construct luma or chroma samples from the decoding process. In jfet-T2001, DCI and SEI messages may be signaled in a bitstream using non-VCL NAL units. Further, the DCI and SEI message may be transmitted by some mechanism, not by being present in the bitstream (i.e., signaling out-of-band).

Fig. 3 shows an example of a bitstream comprising a plurality of CVSs, wherein the CVSs comprise AUs and the AUs comprise picture units. The example shown in fig. 3 corresponds to an example of encapsulating the slice NAL units shown in the example of fig. 2 in a bitstream. In the example shown in FIG. 3, pic ₃ Comprises three VCL NAL coded Slice NAL units, i.e. Slice ₀ NAL unit, slice ₁ NAL unit and Slice ₂ NAL units, and two non-VCL NAL units, namely a PPS NAL unit and a PH NAL unit. It should be noted that in fig. 3, the header is a NAL unit header (i.e., not confused with a slice header). Further, it should be noted that in fig. 3, other non-VCL NAL units, not shown, may be included in the CVS, such as SPS NAL units, VPS NAL units, SEI message NAL units, etc. Furthermore, it should be noted that in other examples, the decoding of Pic ₃ May be included elsewhere in the bitstream, e.g., in correspondence to Pic ₀ Or may be provided by an external mechanism. As described in further detail below, in jfet-T2001, the PH syntax structure may be present in the slice header of the VCL NAL unit or in the PH NAL unit of the current PU.

jfet-T2001 defines NAL unit header semantics that specify the type of original byte sequence payload (RBSP) data structure included in the NAL unit. Table 1 shows the syntax of the NAL unit header provided in jfet-T2001.

TABLE 1

jfet-T2001 provides the following definition for the corresponding syntax elements shown in table 1.

The forbidden _ zero _ bit should be equal to 0.

nuh_reserved_zero_bit should be equal to 0. The value 1 of nuh_reserved_zero_bit may be specified in the future by ITU-T|ISO/IEC. Although in this version of the specification the value of nuh_reserved_zero_bit is required to be equal to 0, a decoder conforming to this version of the specification should allow the value of nuh_reserved_zero_bit to be equal to 1 to appear in the syntax and should ignore (i.e., delete and discard from the bitstream) NAL units for which nuh_reserved_zero_bit is equal to 1.

nuh_layer_id specifies an identifier of a layer to which a VCL NAL unit belongs or an identifier of a layer to which a non-VCL NAL unit applies. The value of nuh_layer_id should be in the range of 0 to 55 (inclusive). Other values of nuh_layer_id are reserved for future use by ITU-t|iso/IEC. Although in this version of the specification the value of nuh layer id is required to be in the range of 0 to 55 (inclusive), a decoder conforming to this version of the specification should allow values of nuh layer id greater than 55 to appear in the syntax and should ignore (i.e., delete and discard from the bitstream) NAL units with nuh layer id greater than 55.

The values of nuh layer id of all VCL NAL units of a coded picture should be the same. The value of nuh_layer_id of a coded picture or PU is the value of nuh_layer_id of the VCL NAL unit of the coded picture or PU.

When nal_unit_type is equal to ph_nut or fd_nut, nuh_layer_id should be equal to the nuh_layer_id of the associated VCL NAL unit.

When nal_unit_type is equal to eos_nut, nuh_layer_id should be equal to one of the nuh_layer_id values of the layers present in the CVS.

Note that the nuh layer id values of-DCI, OPI, VPS, AUD and EOB NAL units are not constrained. nuh temporal id plus1 minus 1 specifies the temporal identifier of the NAL unit.

The value of nuh_temporal_id_plus1 should not be equal to 0.

The variable TemporalId is derived as follows:

TemporalId＝nuh_temporal_id_plus1-1

when nal_unit_type is in the range of idr_w_radl to rsv_irap_11 (inclusive), the temporalld should be equal to 0. When nal_unit_type is equal to STSA_NUT and vps_independent_layer_flag [ general LayerIdx [ nuh_layer_id ] ] is equal to 1, the TemporalId should be greater than 0.

The value of the temporalld should be the same for all VCL NAL units of the AU. The value of the temporalld of the encoded picture, PU or AU is the value of the temporalld of the VCL NAL unit of the encoded picture, PU or AU. The value of the temporalld for the sub-layer representation is the maximum value of the temporalld for all VCL NAL units in the sub-layer representation.

The value of the temporalld of a non-VCL NAL unit is constrained as follows:

if nal_unit_type is equal to dci_nut, opi_nut, vps_nut or sps_nut, then temporalld shall be equal to 0 and the temporalld of the AU containing the NAL unit shall be equal to 0.

Otherwise, if nal_unit_type is equal to ph_nut, then the temporalld should be equal to the temporalld of the PU containing the NAL unit.

Otherwise, if nal_unit_type is equal to eos_nut or eob_nut, then the temporalld should be equal to 0.

Otherwise, if nal_unit_type is equal to aud_nut, fd_nut, prepix_sei_nut, or suffix_sei_nut, then the TemporalId should be equal to the TemporalId of the AU containing the NAL unit.

Otherwise, when nal_unit_type is equal to pps_nut, prepix_aps_nut, or suffix_aps_nut, the temporald should be greater than or equal to the temporald of the PU containing the NAL unit.

Note that when the NAL unit is a non-VCL NAL unit, the value of the temporalld is equal to the minimum of the temporalld values of all AUs to which the non-VCL NAL unit applies. When nal_unit_type is equal to pps_nut, prefix_aps_nut, or suffix_aps_nut, the temporalld may be greater than or equal to the temporalld containing AU, because all PPS and APS may be included in the beginning of the bitstream (e.g., when they are delivered out-of-band and the receiver places them at the beginning of the bitstream), with the first encoded picture having a temporalld equal to 0.

nal_unit_type specifies the NAL unit type, i.e., the type of RBSP data structure contained in the NAL unit as specified in table 2.

NAL units (whose semantics are not specified) with nal_unit_type within the range of UNSPEC28 … UNSPEC31 (including end values) should not affect the decoding process specified in this specification.

Note that NAL unit types within the scope of unspec_28..unspec_31 may be used as determined by the application. The decoding process for these values of nal_unit_type is not specified in this specification. Since different applications may use these NAL unit types for different purposes, special attention is expected when designing encoders that generate NAL units with these nal_unit_type values and when designing decoders that interpret the content of NAL units with these nal_unit_type values. The present specification does not define any management of these values. These nal_unit_type values may only be applicable in contexts where the use of "collision" (i.e. meaning of NAL unit content of the same nal_unit_type value has different definitions) is not important, or is not possible or is managed, e.g. defined or managed in a control application or transport specification, or managed by controlling the environment in which the bitstream is distributed.

For the purpose of determining the amount of data out of DUs of the bitstream, the decoder should ignore (remove and discard from the bitstream) the contents of all NAL units that use the reserved value of nal_unit_type. Note that this requirement allows for future definition of compatible extensions of the specification.

/>

TABLE 2

Note that a Clean Random Access (CRA) picture may have an associated RASL or RADL picture present in the bitstream. Note that an Instantaneous Decoding Refresh (IDR) picture with nal_unit_type equal to idr_n_lp does not have an associated leading picture present in the bitstream. An IDR picture with nal_unit_type equal to idr_w_radl does not have an associated RASL picture present in the bitstream, but may have an associated RADL picture in the bitstream.

The value of nal_unit_type should be the same for all VCL NAL units of a sub-picture. A sub-picture is referred to as having the same NAL unit type as the VCL NAL unit of the sub-picture.

For VCL NAL units of any particular picture, the following applies:

if pps_mixed_nalu_types_in_pic_flag is equal to 0, then the value of nal_unit_type should be the same for all VCL NAL units of the picture, and the picture or PU is said to have the same NAL unit type as the coded slice NAL unit of the picture or PU.

Otherwise (pps_mixed_nalu_types_in_pic_flag equal to 1), all the following constraints apply:

the picture should have at least two sub-pictures.

The VCL NAL units of a picture should have two or more different nal_unit_type values.

VCL NAL units for pictures where nal_unit_type equals gdr_nut will not exist.

When nal_unit_type of a VCL NAL unit of a picture is equal to nalUnitTypeA with the value idr_w_radl, idr_n_lp or cra_nut, the nal_unit_type of the other VCL NAL units of the picture should be equal to nalUnitTypeA or trail_nut.

The value of nal_unit_type should be the same for all pictures of IRAP or GDR AU.

When the sps_video_parameter_set_id is greater than 0, vps_max_tid_ref_pics_plus 1[ i ] [ j ] is equal to 0 (for any value of i in the range of j+1 to vps_max_layers_minus1 (inclusive) and pps_mixed_nalu_type in_pic_flag is equal to 1), the value of nal_unit_type will not be equal to IDR_W_RADL, IDR_N_LP, or CRA_NUT.

The following constraints apply for the bitstream compliance requirements:

when the picture is the leading picture of the IRAP picture, the picture should be a RADL or RASL picture.

When the sub-picture is the leading sub-picture of the IRAP sub-picture, the sub-picture should be a RADL or RASL sub-picture.

When the picture is not the leading picture of the IRAP picture, the picture should not be a RADL or RASL picture.

When the sub-picture is not the leading sub-picture of the IRAP sub-picture, the sub-picture should not be an RADL or RASL sub-picture.

RASL pictures should not be present in the bitstream, which RASL pictures are associated with IDR pictures.

RASL sub-pictures should not be present in the bitstream, which RASL sub-pictures are associated with IDR sub-pictures.

RADL pictures should not be present in the bitstream, which RADL pictures are associated with IDR pictures with nal_unit_type equal to idr_n_lp.

Note that random access at the location of the IRAP PU can be performed by discarding all PUs preceding the IRAP AU (and correctly decoding the non-RASL pictures in the IRAP AU and all subsequent AUs in decoding order), provided that each parameter set (in the bitstream or by an external means not specified in this specification) is available when referenced.

RADL sub-pictures should not be present in the bitstream, which RADL sub-pictures are associated with IDR sub-pictures with nal_unit_type equal to idr_n_lp.

Any picture in decoding order with nuh layer id equal to the particular value layerId that precedes the IRAP picture with nuh layer id equal to layerId should precede the IRAP picture in output order and should precede any RADL picture associated with the IRAP picture in output order.

Any sub-picture in decoding order that is located in front of an IRAP sub-picture with nuh layer id equal to layerId and sub-picture index equal to subsumed, should be located in output order in front of the IRAP sub-picture and all its associated RADL sub-pictures.

Any picture in decoding order with nuh layer id equal to a particular value layerId that precedes a recovery point picture with nuh layer id equal to layerId in output order should precede the recovery point picture.

Any sub-picture in decoding order that is located in the recovery point picture in nuh layer id equal to layerId and sub-picture index equal to subspecix in the order of output should be located before the sub-picture in the recovery point picture.

Any RASL picture associated with a CRA picture should precede any RADL picture associated with a CRA picture in output order.

Any RASL sub-picture associated with a CRA sub-picture should precede any RADL sub-picture associated with a CRA sub-picture in output order.

Any RASL picture associated with a CRA picture for which nuh layer id equals a particular layerId should be located in output order after any IRAP or GDR picture in decoding order located before the CRA picture for which nuh layer id equals layerId.

Any RASL sub-picture associated with a CRA sub-picture for which nuh layer id is equal to a particular value layerId and the sub-picture index is equal to a particular value subtyidx should be located in output order after any IRAP or GDR sub-picture in decoding order located before the CRA sub-picture for which nuh layer id is equal to layerId and the sub-picture index is equal to subtyidx.

-if sps_field_seq_flag is equal to 0, the following applies: when the current picture with nuh layer id equal to the particular layerId is the leading picture associated with the IRAP picture, then the current picture should precede all non-leading pictures associated with the same IRAP picture in decoding order. Otherwise (sps field seq flag equal to 1), let picA and picB be the first and last leading pictures associated with the IRAP picture, respectively, in decoding order, there should be at most one non-leading picture with nuh layer id equal to layerId before picA in decoding order, and there should not be a non-leading picture with nuh layer id equal to layerId between picA and picB in decoding order.

-if sps_field_seq_flag is equal to 0, the following applies: when nuh layer id is equal to a specific value layerId and the current sub-picture with a sub-picture index equal to a specific value subtyidx is the leading sub-picture associated with the IRAP sub-picture, then the current sub-picture should precede all non-leading sub-pictures associated with the same IRAP sub-picture in decoding order. Otherwise (sps_field_seq_flag equal to 1), let sub-bpica and sub-bpicb be the first leading sub-picture and the last leading sub-picture associated with IRAP sub-pictures, respectively, in decoding order, nuh_layer_id equal to layerId and sub-picture index equal to at most one non-leading sub-picture of sub-bpicidx should be present before sub-bpica in decoding order, and no non-leading picture with nuh_layer_id equal to layerId and sub-picture index equal to sub-bpicidx should be present between picA and picB in decoding order.

As provided in table 2, the NAL unit may include a Supplemental Enhancement Information (SEI) syntax structure. Tables 3 and 4 show the Supplemental Enhancement Information (SEI) syntax structure provided in jfet-T2001.

TABLE 3 Table 3

TABLE 4 Table 4

For tables 3 and 4, JHET-T2001 provides the following semantics:

each SEI message consists of variables specifying the type payloadType and size payloadSize of the SEI message payload. The SEI message payload is specified. The derived SEI message payload size payloadSize is specified in bytes and should be equal to the number of RBSP bytes in the SEI message payload.

Note that-the NAL unit byte sequence containing the SEI message may include one or more emulation prevention bytes (represented by the encryption_prediction_thread_byte syntax element). Since the payload size of the SEI message is specified in RBSP bytes, the number of emulation prevention bytes is not included in the size payloadSize of the SEI payload.

payload_type_byte is a byte of the payload type of the SEI message.

payload_size_byte is a byte of the payload size of the SEI message.

It should be noted that JHET-T2001 defines the payload type, and ISO/IEC JTC1/SC29/WG11 held at day 25 of ISO/IEC JTC1/SC29/WG11, "Additional SEI messages for VSEI (Draft 6)" (conference call, document JHET-Y2006-v 1, which is incorporated herein by reference and referred to as JHET-Y2006) defines the additional payload type. Table 5 shows in general the sei_payload () syntax structure. That is, table 5 shows the sei_payload () syntax structure, but for the sake of brevity, not all possible types of payloads are included in table 5.

TABLE 5

For Table 5, JHET-T2001 provides the following semantics:

the sei_reserved_payload_extension_data should not be present in the bitstream conforming to this version of the present specification. However, a decoder conforming to this version of the present specification should ignore the presence and value of sei_reserved_payload_extension_data. When present, the bit-wise length of the sei_reserved_payload_extension_data is equal to 8 x payload size-nearlybits-nPaylozerobits-1, where nearlybits are the number of bits in the sei_payload () syntax structure preceding the sei_reserved_payload_extension_data syntax element, and nPaylozerobits are the number of sei_payload_bit_equivalent_to_zero syntax elements at the end of the sei_payload () syntax structure.

If more_data_in_payload () is TRUE and nproadzerobits is not equal to 7 after parsing the SEI message syntax structure (e.g., buffering_period () syntax structure), then PayloadBits are set equal to 8 x payloadsize-nproadzerobits-1; otherwise, payloadBits are set equal to 8 payloadsize.

The payload_bit_equal_to_one should be equal to 1.

The payload_bit_equivalent_to_zero should be equal to 0.

Note that-SEI messages with the same payadtype value are conceptually identical SEI messages, regardless of whether they are contained in prefix or suffix SEI NAL units.

Note that for the SEI messages specified in this specification and VSEI specification (ITU-T h.274i ISO/IEC 23002-7), the payloadType value is aligned with similar SEI messages specified in AVC (rec.itu-T h.264|iso/IEC 14496-10) and HEVC (rec.itu-T h.265|iso/IEC 23008-2).

The semantics and persistence scope of each SEI message is specified in the semantic specification of each particular SEI message.

Note that the persistence information for the SEI message is summarized in table 142.

jfet-T2001 also provides the following:

SEI messages with the syntax structure identified in table 5 (specified in rec.itu-T h.274|iso/IEC 23002-7) can be used with the bitstreams specified in this specification.

When any particular Rec.ITU-T H.274|ISO/IEC23002-7 SEI message is included in the bit stream specified in this specification, the SEI payload syntax should be included in the sei_payload () syntax structure specified in Table 5, the payloadType value specified in Table 5 should be used, and furthermore, any SEI message specific constraints specified for the particular SEI message in this attachment should apply.

As described above, the PayloadBits value is passed to the parser of the SEI message syntax structure specified in Rec. ITU-T H.274|ISO/IEC 23002-7.

As described above, jfet-AA 2006 provides NN post-filter supplemental enhancement information messages. Specifically, jfet-AA 2006 provides a neural network postfilter characteristics SEI message (payadtype= 210) and a neural network postfilter activation SEI message (payadtype= 211). Tables 6 and 7 show the syntax of the neural network post-filter characteristics SEI message provided in jfet-AA 2006.

/>

TABLE 6

TABLE 7

For tables 6 and 7, JHET-AA 2006 provides the following semantics:

this SEI message specifies the neural network that can be used as a post-processing filter. The use of a specified post-processing filter for a particular picture is indicated by a neural network post-filter activation SEI message.

The use of this SEI message requires the definition of the following variables:

The cropped decoded output picture width and height in luminance samples are denoted herein by crappedwith and crappedwight, respectively.

-a luma sample array CroppedYPic and a chroma sample array CroppedCbPic and CroppedCrPic (when present) of the cropped decoded output picture for vertical and horizontal coordinates y and x, wherein the upper left corner of the sample array has coordinates y equal to 0 and coordinates x equal to 0.

Bit depth BitDepth of an array of luma samples of a cropped decoded output picture _Y 。

Bit depth BitDepth of an array of chroma samples (if any) of a cropped decoded output picture _C 。

-a chroma format indicator, denoted herein by chromaformattdc.

-the quantized intensity value strengthcontrol val when nnpfc_auxliary_inp_idc is equal to 1.

When this SEI message specifies a neural network that can be used as a post-processing filter, the semantics specify the derivation of the luma sample array, filterpypic [ x ] [ y ], the chroma sample array, filterpedCbPic [ x ] [ y ], and FilterpedCrPic [ x ] [ y ], as indicated by the value of nnpfc_out_order_idc, which contains the output of the post-processing filter.

The variables SubWidthc and SubHeight C were derived from ChromaFormatIdc specified in Table 8.

sps_chroma_format_idc	Chroma format	SubWidthC	SubHeightC
				0	Monochromatic color	1	1
1	4:2:0	2	2
				2	4:2:2	2	1
3	4:4:4	1	1

TABLE 8

The nnpfc_id contains an identification number that can be used to identify the post-processing filter. The value of nnpfc_id should be 0 to 2 ³² -2 (inclusive).

Reserve from 256 to 511 (inclusive) and from 2 ³¹ To 2 ³² -2 (inclusive) for future use by ITU-t|iso/IEC. Encountering an nnpfc_id value of 256 to 511 (inclusive) or 2 ³¹ To 2 ³² A decoder within the range-2 (inclusive) will ignore it.

The value of nnpfc_mode_idc equal to 0 specifies that the post-processing filter associated with the value of nnpfc_id is determined by an external means not specified in this specification.

An nnpfc_mode_idc equal to 1 specifies that the post-processing filter associated with the nnpfc_id value is a neural network represented by the ISO/IEC 15938-17 bitstream contained in this SEI message.

The nnpfc_mode_idc equal to 2 specifies that the post-processing filter associated with the nnpfc_id value is a neural network identified by a specified tag uniform resource locator (URI) (nnpfc_uri_tag [ i ]) and a neural network information URI (nnpfc_uri [ i ]).

The value of nnpfc_mode_idc should be in the range of 0 to 255 (inclusive). Values of nnpfc_mode_idc greater than 2 are reserved for future use by ITU-t|iso/IEC specifications and should not be present in bitstreams conforming to this version of the specification. A decoder conforming to this version of the specification should ignore the SEI message that contains the reserved value of nnpfc_mode_idc.

An nnpfc_purpose_and_shaping_flag equal to 0 specifies that no syntax elements related to filter purpose, input format, output format and complexity exist. The nnpfc_purpose_and_formatting_flag equal to 1 specifies that there are syntax elements related to filter purpose, input format, output format and complexity.

When nnpfc_mode_idc is equal to 1 and the current CLVS does not contain a previous neural network post-filter characteristic SEI message with the value of nnpfc_id equal to the value of nnpfc_id in this SEI message in decoding order, nnpfc_purpose_and_shaping_flag should be equal to 1.

When the current CLVS includes, in decoding order, a previous neural network post-filter characteristics SEI message with an nnpfc_id value equal to the same value of nnpfc_id in this SEI message, at least one of the following conditions should apply:

the value of nnpfc_mode_idc of this SEI message is equal to 1 and nnpfc_purpose_and_shaping_flag is equal to 0, in order to provide a neural network update.

This SEI message has the same content as the previous neural network post-filter characteristics SEI message.

When this SEI message is the first neural network post-filter characteristic SEI message with a specific value of nnpfc_id in the current CLVS in decoding order, it specifies the basic post-processing filter

And (3) regarding the current decoded picture of the current layer and all subsequent decoded pictures in output order until the current CLVS is finished. When this SEI message is not the first neural network post-filter characteristic SEI message with a specific value of nnpfc_id in the current CLVS in decoding order, this SEI message is related to the current decoded picture and all subsequent decoded pictures of the current layer in output order until the end of the current CLVS or the next neural network post-filter characteristic SEI message with a specific value of nnpfc_id in output order in the current CLVS.

nnpfc_purose indicates the purpose of the post-processing filter as specified in table 9. The value of nnpfc_purose should be from 0 to 2 ³² -2 (inclusive). The value of nnpfc_purose not present in Table 9 is reserved for ITU-T|ISO/IEC toIs used in the specification and should not be present in a bitstream conforming to this version of the specification. A decoder conforming to this version of the present description should ignore the SEI message that contains the reserved value of nnpfc_purose.

Value of	Interpretation of the drawings
		0	Unknown or unspecified
1	Visual quality improvement
		2	Chroma upsampling from a 4:2:0 chroma format to a 4:2:2 or 4:4:4 chroma format, or from a 4:2:2 chroma format to a 4:4:4 chroma format
3	Increasing width or height of cropped decoded output picture without changing chroma format
		4	Increasing width or height of cropped decoded output picture and upsampling chroma format

TABLE 9

Note that when ITU-T I ISO/IEC uses a reserved value of nnpfc_purpose in the future, the syntax of this SEI message can be extended by syntax elements, the presence of which is conditioned on nnpfc_purpose being equal to this value.

When SubWidthC is equal to 1 and subheight c is equal to 1, nnpfc_purcose should not be equal to 2 or 4.

The nnpfc_out_sub_c_flag being equal to 1 specifies that the outSubWidthC is equal to 1 and the outsubheight c is equal to 1. The nnpfc_out_sub_c_flag being equal to 0 specifies that the outSubWidthC is equal to 2 and the outsubheight c is equal to 1. When nnpfc_out_sub_c_flag does not exist, it is inferred that outSubWidthC is equal to SubWidthC and that outsubheight c is equal to subheight c. If SubWidthC is equal to 2 and subheight c is equal to 1, then nnpfc_out_sub_c_flag should not be equal to 0.

The nnpfc_pic_width_in_luma_samples and nnpfc_pic_height_in_luma_samples specify the width and height, respectively, of the luma sample array of the picture that is generated by applying a post-processing filter identified by nnpfc_id to the cropped decoded output picture. When nnpfc_pic_width_in_luma_samples and nnpfc_pic_height_in_luma_samples are not present, it is inferred that they are equal to cropepdwwidth and cropepedheight, respectively.

The nnpfc_component_last_flag being equal to 0 specifies that the second dimension in the input tensor inputTensor is input to the post-processing filter and the output tensor output from the post-processing filter is used for the channel. The nnpfc_component_last_flag equal to 1 specifies that the last dimension in the input tensor inputTensor is input to the post-processing filter and the output tensor output from the post-processing filter is used for the channel.

Note that-the first dimension in the input tensor and the output tensor is used for batch indexing, which is a practice in some neural network frameworks. Although the semantics of this SEI message use a batch size equal to 1, the batch size used as input for neural network inference is determined by post-processing implementations.

Note that-color components are examples of channels.

The nnpfc_inp_format_flag indicates a method of converting sample values of a clip-decoded output picture into input values input to a post-processing filter. When nnpfc_inp_format_flag is equal to 0, the input value to the post-processing filter is a real number, and functions InpY () and InpC () are specified as follows:

InpY(x)＝x÷((1<<BitDepth _Y )-1)

InpC(x)＝x÷((1<<BitDepth _C )-1)

when nnpfc_inp_format_flag is equal to 1, the input value to the post-processing filter is an unsigned integer, and functions InpY () and InpC () are specified as follows:

The variable inptensporitdepth is derived from the syntax element nnpfc_inp_tensor_bitdepth_minus8, as specified below.

The nnpfc_inp_tensor_bitdepth_minus8 plus 8 specifies the bit depth of the luminance sample value in the input integer tensor. The value of inptensporitdepth is derived as follows:

inpTensorBitDepth＝nnpfc_inp_tensor_bitdepth_minus8+8

the value of the bitstream conformance requirement nnpfc_inp_tensor_bitdepth_minus8 should be in the range of 0 to 24 (inclusive).

The nnpfc_auxliary_inp_idc not equal to 0 specifies that auxiliary input data is present in the input tensor of the neural network post-filter. An nnpfc_auxliary_inp_idc equal to 0 indicates that auxiliary input data is not present in the input tensor. nnpfc_auxliary_inp_idc equal to 1 specifies that auxiliary input data is derived as specified in table 12. The value of nnpfc_auxliary_inp_idc should be in the range of 0 to 255 (inclusive). Values of nnpfc_auxliary_inp_idc greater than 1 are reserved for future specifications of ITU-t|iso/IEC and should not be present in a bitstream conforming to this version of the specification. A decoder consistent with this version of this specification should ignore SEI messages that include reserved values of nnpfc_auxliary_inp_idc.

The nnpfc_separation_color_description_presentation_flag being equal to 1 indicates that different combinations of color primaries, transmission characteristics, and matrix coefficients of pictures generated by the post-processing filter are specified in the SEI message syntax structure. An nnfpc_separation_color_description_present_flag equal to 0 indicates that the combination of color primaries, transmission characteristics, and matrix coefficients of the picture generated by the post-processing filter is the same as that indicated in the VUI parameters of CLVS.

The nnpfc_color_primaries have the same semantics as specified for the vui_color_primaries syntax element as follows: vui_color_primary indicates chromaticity coordinates of the source color primary. The semantics of which are specified for the ColourPrimaries parameter in Rec. ITU-T H.273|ISO/IEC 23091-2. When the vui_color_primaries syntax element is not present, the value of vui_color_primaries is inferred to be equal to 2 (chroma is unknown or unspecified or determined by other means not specified in this specification). The value of vui_color_primaries is identified as reserved for future use by rec.itu-T h.273|iso/IEC 23091-2 and should not be present in a bitstream conforming to this version of the specification. The decoder should interpret the reserved value of vui_color_primary as equal to the value 2.

Except for the following cases:

nnpfc_color_primaries specify the color primaries of the picture produced by applying the neural network post-filter specified in the SEI message, instead of the color primaries for CLVS.

-when nnpfc_colour_primaries are not present in the neural network post-filter characteristics SEI message, deducing that the value of nnpfc_colour_primaries is equal to vui_colour_primaries.

The nnpfc_transfer_characteristics has the same semantics as specified for the vui_transfer_characteristics syntax element as follows: vui_transfer_characteristics indicate a transfer characteristic function of the color representation. Its semantics are as specified for the TransferCharacteristics parameter in Rec. ITU-T H.273|ISO/IEC 23091-2. When the vui_transfer_characteristics syntax element is not present, the value of vui_transfer_characteristics is inferred to be equal to 2 (the transmission characteristics are unknown or unspecified or determined by other means not specified in the present specification). The value of vui_transfer_characteristics is identified as reserved for future use by Rec.ITU-T H.273|ISO/IEC 23091-2 and should not be present in a bitstream conforming to this version of the specification. The decoder should interpret the reserved value of vui_transfer_characteristics as equal to the value 2.

Except for the following cases:

the nnpfc_transfer_characteristics specifies the transmission characteristics of the pictures produced by applying the neural network post-filter specified in the SEI message, instead of the transmission characteristics for CLVS.

-when the nnpfc_transfer_characteristics is not present in the neural network post-filter characteristic SEI message, deducing that the value of nnpfc_transfer_characteristics is equal to vui_transfer_characteristics.

The nnpfc_matrix_coeffs has the same semantics as specified for the vui_matrix_coeffs syntax element as follows: vui_matrix_coeffs describes equations for deriving luminance and chrominance signals from green, blue and red or Y, Z and the X primary colors. The semantics of which are as specified for the matrixcoeffients parameter in rec.itu-T h.273|iso/IEC 23091-2.

vui_matrix_coeffs should not be equal to 0 unless both of the following conditions are true:

BitDepthC is equal to BitDepthY.

Chromaformattdc is equal to 3 (4:4:4 chroma format).

The specification of using vui_matrix_coeffs equal to 0 under all other conditions is reserved for future use by ITU-t|iso/IEC.

vui_matrix_coeffs should not be equal to 8 unless one of the following conditions is true:

BitDepthC is equal to BitDepthY,

bitdepthhc equals bitdepthy+1 and chromaformat equals 3 (4:4:4 chroma format).

The specification of using vui_matrix_coeffs equal to 8 under all other conditions is reserved for future use by ITU-t|iso/IEC.

When the vui_matrix_coeffs syntax element is not present, the value of vui_matrix_coeffs is inferred to be equal to 2 (unknown or unspecified or determined by other means not specified in this specification).

Except for the following cases:

the nnpfc matrix coeffs specifies the matrix coefficients of the picture produced by applying the neural network post-filter specified in the SEI message, instead of the matrix coefficients for CLVS.

-when nnpfc_matrix_coeffs is not present in the neural network post-filter characteristic SEI message, deducing that the value of nnpfc_matrix_coeffs is equal to vui_matrix_coeffs.

The value allowed by nnpfc matrix coeffs is not constrained by the chroma format of the decoded video picture, which is indicated by the ChromaFormatIdc value for VUI parameter semantics.

When nnpfc_matrix_coeffs is equal to 0, nnpfc_out_order_idc should not be equal to 1 or 3.

nnpfc_inp_order_idc indicates a method of ordering a sample array of the clip-decoded output picture as an input of the post-processing filter. Table 10 contains an informative description of the values of nnpfc_inp_order_idc. The semantics of the nnpfc_inp_order_idc in the range of 0 to 3 (including the end values) are specified in table 12, which specifies the procedure for deriving the input tensor for different values of the nnpfc_inp_order_idc, and the given vertical sample coordinates cTop and horizontal sample coordinates cLeft of the upper left sample position of the sample block included in the input tensor. When the chroma format of the cropped decoded output picture is not 4:2:0, nnpfc_inp_order_idc should not be equal to 3. The value of nnpfc_inp_order_idc should be in the range of 0 to 255 (inclusive). Values of nnpfc_inp_order_idc greater than 3 are reserved for future use by ITU-t|iso/IEC specifications and should not be present in bitstreams conforming to this version of the specification. A decoder conforming to this version of the present description should ignore the SEI message that contains the reserved value of nnpfc_inp_order_idc.

Table 10

A block is a rectangular array of samples from a component of a picture (e.g., a luma or chroma component).

The value of nnpfc_constant_patch_size_flag equal to 0 specifies that the post-processing filter accepts as input any block size that is a positive integer multiple of the block sizes indicated by nnpfc_patch_width_minus1 and nnpfc_patch_height_minus1. When nnpfc_constant_patch_size_flag is equal to 0, the block size width should be less than or equal to CroppedWidth. When nnpfc_constant_patch_size_flag is equal to 0, the block size height should be less than or equal to the cyclopedheight. The nnpfc_constant_patch_size_flag being equal to 1 specifies that the post-processing filter fully accepts as input the block sizes indicated by nnpfc_patch_width_minus1 and nnpfc_patch_height_minus1.

The nnpfc_patch_width_minus1+1 specifies a horizontal sample count of a block size required to be input to the post-processing filter when nnpfc_constant_patch_size_flag is equal to 1. When the nnpfc_constant_patch_size_flag is equal to 0, any positive integer multiple of (nnpfc_patch_width_minus1+1) can be used as a horizontal sample count for the block size of the input to the post-processing filter. The value of nnpfc_latch_width_minus1 should be in the range of 0 to Min (32766, cyclopedwidth-1) (inclusive).

The vertical sample count of the block size required for input to the post-processing filter is specified when the nnpfc_constant_latch_size_flag is equal to 1. When the nnpfc_constant_patch_size_flag is equal to 0, any positive integer multiple of (nnpfc_patch_height_minus1+1) can be used as a vertical sample count for the block size of the input to the post-processing filter. The value of nnpfc_patch_height_minus1 should be in the range of 0 to Min (32766, cyclopedHeight-1), inclusive.

nnpfc_overlap specifies the overlapping horizontal and vertical sample counts of adjacent input tensors of the post-processing filter. The value of nnpfc_overlap should be in the range of 0 to 16383 (inclusive).

Variables inpPatchWidth, inpPatchHeight, outPatchWidth, outPatchHeight, horCScaling, verCScaling, outPatchCWidth, outPatchCHeight and overlapping were derived as follows:

inpPatchWidth＝nnpfc_patch_width_minus1+1

inpPatchHeight＝nnpfc_patch_height_minus1+1

outPatchWidth＝(nnpfc_pic_width_in_luma_samples*inpPatchWidth)/CroppedWidth

outPatchHeight＝(nnpfc_pic_height_in_luma_samples*inpPatchHeight)/CroppedHeight

horCScaling＝SubWidthC/outSubWidthC

verCScaling＝SubHeightC/outSubHeightC

outPatchCWidth＝outPatchWidth*horCScaling

outPatchCHeight＝outPatchHeight*verCScaling

overlapSize＝nnpfc_overlap

the bitstream conformance requirement outpatch width should be equal to nnpfc_pic_width_in_luma_samples, inpatch width, and outpatch height should be equal to nnpfc_pic_height_in_luma_samples.

The nnpfc_padding_type specifies a padding process when referring to sample positions outside the boundary of the clip decoded output picture as described in table 11. The value of nnpfc_padding_type should be in the range of 0 to 15 (inclusive).

nnpfc_padding_type	Description of the invention
		0	Zero padding
1	Replication filling
		2	Reflective filling
3	Surrounding filling
		4	Fixed filling
5..15	Reservation of

TABLE 11

The nnpfc_luma_padding_val specifies the luminance value for padding when nnpfc_padding_type is equal to 4.

The nnpfc_cb_padding_val specifies the Cb value for padding when nnpfc_padding_type equals 4.

nnpfc_cr_padding_val specifies the Cr value for padding when nnpfc_padding_type equals 4.

The function inpsfamplyval (y, x, picHeight, picWidth, croppedPic) input as vertical sample position y, horizontal sample position x, picture height picHeight, picture width picWidth, and sample array croppedPic returns the value of sampleVal derived as follows:

/>

table 12

An nnpfc_complexity_idc greater than 0 specifies that there may be one or more syntax elements indicating the complexity of the post-processing filter associated with the nnpfc_id. An nnpfc_complexity_idc equal to 0 specifies that there is no syntax element indicating the complexity of the post-processing filter associated with nnpfc_id. The value nnpfc_complex_idc should be in the range of 0 to 255 (inclusive). Values of nnpfc_complex_idc greater than 1 are reserved for future use by ITU-t|iso/IEC specifications and should not be present in bitstreams conforming to this version of the specification. A decoder conforming to this version of the specification should ignore the SEI message that contains the reserved value of nnpfc_complex_idc.

The nnpfc_out_format_flag being equal to 0 indicates that the sample value output by the post-processing filter is a real number, and functions OutY () and OutC () for converting the luminance sample value and the chrominance sample value output by the post-processing into integer values at bit depths BitDepthY and BitDepthC, respectively, are specified as follows:

OutY(x)＝Clip3(0，(1＜＜BitDepthY)-1，Round(x*((1＜＜BitDepthY)-1)))

OutC(x)＝Clip3(0，(1＜＜BitDepthC)-1，Round(x*((1＜＜BitDepthC)-1)))

the nnpfc_out_format_flag being equal to 1 indicates that the sample value output by the post-processing filter is an unsigned integer and the functions OutY () and OutC () are specified as follows:

the variable outTensorBitDepth is derived from the syntax element nnpfc_out_tensor_bitdepth_minus8 as described below.

The nnpfc_out_tensor_bitdepth_minus8 plus 8 specifies the bit depth of the sample values in the output integer tensor. The value of the outTensorBitDepth is derived as follows:

outTensorBitDepth＝nnpfc_out_tensor_bitdepth_minus8+8

the value of the bitstream conformance requirement nnpfc_out_tensioner_bitdepth_minus8 should be in the range of 0 to 24 (inclusive).

nnpfc_out_order_idc indicates the output order of samples generated from the post-processing filter. Table 13 contains an informative description of the value of nnpfc out order idc. The semantics of the nnpfc out order idc in the range of 0 to 3 (including the end values) are specified in table 14, which specifies the procedure for deriving the sample values in the filtered output sample array FilteredYPic, filteredCbPic and FilteredCrPic from the output tensor for the different values of nnpfc out order idc, and the given vertical sample coordinates cTop and horizontal sample coordinates cLeft for the upper left sample position of the sample block included in the input tensor. When nnpfc_purose is equal to 2 or 4, nnpfc_out_order_idc should not be equal to 3. The value of nnpfc_out_order_idc should be in the range of 0 to 255 (inclusive). Values of nnpfc out order idc greater than 3 are reserved for future use by ITU-t|iso/IEC specifications and should not be present in bitstreams conforming to this version of the specification. A decoder conforming to this version of the specification should ignore the SEI message that contains the reserved value of nnpfc out order idc.

TABLE 13

TABLE 14

The basic post-processing filter for the clipped decoded output picture picA is the filter identified in decoding order by the first neural network post-filter characteristic SEI message with a specific nnpfc_id value within CLVS.

If there is another neural network post-filter characteristic SEI message having the same value of nnpfc_id, having nnpfc_mode_idc equal to 1, having different content from the neural network post-filter characteristic SEI message defining the basic post-processing filter and related to the picture picA, the basic post-processing filter is updated by decoding the ISO/IEC 15938-17 bitstream in the neural network post-filter characteristic SEI message to obtain post-processing filter (). Otherwise, post-processing filter () is designated as the same as the basic post-processing filter.

The following process is used to filter the cropped decoded output picture with post-processing filter () to produce a filtered picture, which includes an array of Y, cb, cr samples FilteredYPic, filteredCbPic, filteredCrPic, respectively, as indicated by nnpfc out order idc.

The nnpfc_reserved_zero_bit should be equal to 0.

The nnpfc_uri_tag [ i ] contains a UTF-8 string ending with NULL that specifies the tag URI. The UTF-8 string should contain a URI having syntax and semantics as specified in IETF RFC 4151, uniquely identifying the format and associated information of the neural network specified by the nnrpf_uri [ i ] value to be used as a post-processing filter.

Note that the nnrpf_uri_tag [ i ] element represents a "tag" URI that enables the format of the neural network data specified by the nnrpf_uri [ i ] value to be uniquely identified without requiring a central registration authority.

The nnpfc_uri [ i ] should contain a UTF-8 string ending with NULL, as specified in ISO/IEC 10646. The UTF-8 string should contain a URI having syntax and semantics as specified in IETF internet standard 66 to identify the neural network information (e.g., data representation) that is used as a post-processing filter.

The nnpfc_payload_byte [ i ] contains the ith byte of the ISO/IEC 15938-17 compliant bitstream. The byte sequence nnpfc_payload_byte i for all current values of i should be a complete bitstream compliant with ISO/IEC 15938-17.

An nnpfc_parameter_type_idc equal to 0 indicates that the neural network uses only integer parameters. An nnpfc_parameter_type_flag equal to 1 indicates that the neural network may use floating point or integer parameters. An nnpfc_parameter_type_idc equal to 2 indicates that the neural network uses only binary parameters. The nnpfc_parameter_type_idc equal to 3 is reserved for future use by ITU-t|iso/IEC specifications and should not be present in bitstreams conforming to this version of the specification. A decoder conforming to this version of the present description should ignore the SEI message that contains the reserved value of nnpfc_parameter_type_idc.

The nnpfc_log2_parameter_bit_length_minus3 being equal to 0, 1, 2, and 3 indicates that the neural network does not use parameters with bit lengths greater than 8, 16, 32, and 64, respectively. When nnpfc_parameter_type_idc exists and nnpfc_log2_parameter_bit_length_minus3 does not exist, the neural network does not use a parameter with a bit length greater than 1.

nnpfc_num_parameters_idc indicates the maximum number of neural network parameters for the post-processing filter in units of 2048 powers. An nnpfc_num_parameters_idc equal to 0 indicates the maximum number of unspecified neural network parameters. The value nnpfc_num_parameters_idc should be in the range of 0 to 52 (inclusive). Values of nnpfc_num_parameters_idc greater than 52 are reserved for future use by ITU-t|iso/IEC specifications and should not be present in bitstreams conforming to this version of the specification. A decoder conforming to this version of the present specification should ignore the SEI message that contains the reserved value of nnpfc_num_parameters_idc.

If the value of nnpfc_num_parameters_idc is greater than 0, the variable maxnubamaranters is derived as follows:

maxNumParameters＝(2048<<nnpfc_num_parameters_idc)-1

the number of neural network parameters of the post-processing filter should be less than or equal to maxnudamarameters, which is required for bitstream compliance.

An nnpfc_num_kmac_operations_idc greater than 0 specifies that the maximum number of multiply-accumulate operations per sample of the post-processing filter is less than or equal to nnpfc_num_kmac_operations_idc. nnpfc_num_kmac_operations_idc equal to 0 specifies the maximum number of unspecified multiply-accumulate operations. The value of nnpfc_num_kmac_operations_idc should be 0 to 2 ³² Within the range of-1 (inclusive).

Table 15 shows the syntax of the neural network postfilter activation SEI message provided in jfet-AA 2006.

nn_post_filter_activation(payloadSize){	Descriptor for a computer
		nnpfa_id	ue(v)
}

TABLE 15

For Table 15, JHET-AA 2006 provides the following semantics:

this SEI message specifies the neural network post-processing filters that are available for the post-processing filters of the current picture.

The neural network post-processing filter activation SEI message is valid only for the current picture.

Note that there may be several neural network post-processing filter activation SEI messages for the same picture, for example, when the post-processing filters are intended for different purposes or to filter different color components.

The nnpfa_id specifies that a neural network post-processing filter specified by one or more neural network post-processing filter characteristics SEI messages related to the current picture and having an nnpfc_id equal to nnfpa_id may be used to post-process filter the current picture.

Furthermore, it should be noted that in some cases, for the use of NN post-filter characteristics SEI messages:

for the purpose of explaining the neural network post-filter characteristics SEI message, the following variables are specified:

-InpPicWidthInLumaSamples are set equal to pps_pic_width_in_luma_samples-submadthc

(pps_conf_win_left_offset+pps_conf_win_right_offset)。

-inppicHeight inlumasamples are set equal to pps_pic_height_in_luma_samples-subHeight c

(pps_conf_win_top_offset+pps_conf_win_bottom_offset)。

When present, the variables cropepedypic y x, chroma sample array cropepdcbpic y x and cropepdcrpic y x are set to 2-dimensional arrays of decoded sample values of the 0 th, 1 st and 2 nd components, respectively, of the clipped decoded output picture applied by the neural network post-filter characteristic SEI message.

Both BitDepthY and BitDepthC are set equal to BitDepthC.

InpSubWidthC is set equal to SubWidthC.

-inpsubheight c is set equal to subheight c.

The SliceQPY is set equal to SliceQPY.

When the neural network post-filter characteristic SEI messages with the same nnpfc_id and different contents are present in the same picture unit, the two neural network post-filter characteristic SEI messages should be present in the same SEI NAL unit.

The neural network post-filter characteristics SEI message provided in jfet-AA 2006 may be less than ideal. In particular, the signaling in the jfet-AA 2006 may be insufficient to indicate various neural network post-filter parameters. In accordance with the techniques described herein, additional syntax and semantics are provided for indicating various neural network post-filter parameters.

Fig. 1 is a block diagram illustrating an example of a system that may be configured to encode (e.g., encode and/or decode) video data in accordance with one or more techniques of the present disclosure. System 100 represents an example of a video data system that may be packaged in accordance with one or more techniques of the present disclosure. As shown in fig. 1, system 100 includes a source device 102, a communication medium 110, and a target device 120. In the example shown in fig. 1, source device 102 may include any device configured to encode video data and transmit the encoded video data to communication medium 110. Target device 120 may include any device configured to receive encoded video data and decode the encoded video data via communication medium 110. The source device 102 and/or the target device 120 may include computing devices equipped for wired and/or wireless communication, and may include, for example, set-top boxes, digital video recorders, televisions, desktop, laptop or tablet computers, gaming consoles, medical imaging devices, and mobile devices (including, for example, smart phones, cellular phones, personal gaming devices).

Communication medium 110 may include any combination of wireless and wired communication media and/or storage devices. Communication medium 110 may include coaxial cable, fiber optic cable, twisted pair cable, wireless transmitters and receivers, routers, switches, repeaters, base stations, or any other device that may be used to facilitate communications between various devices and sites. Communication medium 110 may include one or more networks. For example, the communication medium 110 may include a network configured to allow access to the world wide web, such as the Internet. The network may operate in accordance with a combination of one or more telecommunications protocols. The telecommunications protocol may include proprietary aspects and/or may include standardized telecommunications protocols. Examples of standardized telecommunications protocols include the Digital Video Broadcasting (DVB) standard, the Advanced Television Systems Committee (ATSC) standard, the Integrated Services Digital Broadcasting (ISDB) standard, the Data Over Cable Service Interface Specification (DOCSIS) standard, the global system for mobile communications (GSM) standard, the Code Division Multiple Access (CDMA) standard, the 3 rd generation partnership project (3 GPP) standard, the European Telecommunications Standards Institute (ETSI) standard, the Internet Protocol (IP) standard, the Wireless Application Protocol (WAP) standard, and the Institute of Electrical and Electronics Engineers (IEEE) standard.

The storage device may include any type of device or storage medium capable of storing data. The storage medium may include a tangible or non-transitory computer readable medium. The computer readable medium may include an optical disk, flash memory, magnetic memory, or any other suitable digital storage medium. In some examples, the memory device or portions thereof may be described as non-volatile memory, and in other examples, portions of the memory device may be described as volatile memory. Examples of volatile memory can include Random Access Memory (RAM), dynamic Random Access Memory (DRAM), and Static Random Access Memory (SRAM). Examples of non-volatile memory may include magnetic hard disk, optical disk, floppy disk, flash memory, or forms of electrically programmable memory (EPROM) or Electrically Erasable and Programmable (EEPROM) memory. The storage devices may include memory cards (e.g., secure Digital (SD) memory cards), internal/external hard disk drives, and/or internal/external solid state drives. The data may be stored on the storage device according to a defined file format.

Fig. 4 is a conceptual diagram illustrating an example of components that may be included in a particular implementation of system 100. In the exemplary implementation shown in fig. 4, system 100 includes one or more computing devices 402A-402N, a television service network 404, a television service provider site 406, a wide area network 408, a local area network 410, and one or more content provider sites 412A-412N. The implementations shown in fig. 4 represent examples of systems that may be configured to allow digital media content (such as movies, live sporting events, etc.) and data and applications associated therewith, as well as media presentations, to be distributed to and accessed by multiple computing devices (such as computing devices 402A-402N). In the example shown in fig. 4, computing devices 402A-402N may include any device configured to receive data from one or more of television services network 404, wide area network 408, and/or local area network 410. For example, computing devices 402A-402N may be equipped for wired and/or wireless communication and may be configured to receive services over one or more data channels and may include televisions, including so-called smart televisions, set-top boxes, and digital video recorders. Further, computing devices 402A-402N may include desktop, laptop or tablet computers, gaming consoles, mobile devices (including, for example, "smart" phones, cellular phones, and personal gaming devices).

Television services network 404 is an example of a network configured to enable distribution of digital media content that may include television services. For example, television service network 404 may include a public over-the-air television network, a public or subscription-based satellite television service provider network, and a public or subscription-based cable television provider network and/or a cloud or internet service provider. It should be noted that although in some examples, television services network 404 may be primarily used to allow provision of television services, television services network 404 may also allow provision of other types of data and services according to any combination of the telecommunication protocols described herein. Further, it should be noted that in some examples, television service network 404 may allow bi-directional communication between television service provider site 406 and one or more of computing devices 402A-402N. Television services network 404 may include any combination of wireless and/or wired communication media. Television services network 404 may include coaxial cables, fiber optic cables, twisted pair cables, wireless transmitters and receivers, routers, switches, repeaters, base stations, or any other device that may be used to facilitate communications between various devices and sites. The television services network 404 may operate according to a combination of one or more telecommunications protocols. The telecommunications protocol may include proprietary aspects and/or may include standardized telecommunications protocols. Examples of standardized telecommunication protocols include DVB standards, ATSC standards, ISDB standards, DTMB standards, DMB standards, data Over Cable Service Interface Specification (DOCSIS) standards, hbbTV standards, W3C standards, and UPnP standards.

Referring again to fig. 4, television service provider site 406 may be configured to distribute television services via television service network 404. For example, television service provider site 406 may include one or more broadcast stations, cable television providers, or satellite television providers, or internet-based television providers. For example, television service provider site 406 may be configured to receive transmissions (including television programs) via satellite uplink/downlink. Further, as shown in fig. 4, television service provider site 406 may be in communication with wide area network 408 and may be configured to receive data from content provider sites 412A through 412N. It should be noted that in some examples, television service provider site 406 may include a television studio and content may originate from the television studio.

Wide area network 408 may comprise a packet-based network and operate in accordance with a combination of one or more telecommunications protocols. The telecommunications protocol may include proprietary aspects and/or may include standardized telecommunications protocols. Examples of standardized telecommunications protocols include the global system for mobile communications (GSM) standard, the Code Division Multiple Access (CDMA) standard, the third generation partnership project (3 GPP) standard, the European Telecommunications Standards Institute (ETSI) standard, the european standard (EN), the IP standard, the Wireless Application Protocol (WAP) standard, and the Institute of Electrical and Electronics Engineers (IEEE) standard, such as one or more IEEE 802 standards (e.g., wi-Fi). Wide area network 408 may include any combination of wireless and/or wired communication media. Wide area network 408 may include coaxial cables, fiber optic cables, twisted pair cables, ethernet cables, wireless transmitters and receivers, routers, switches, repeaters, base stations, or any other device that may be used to facilitate communications between various devices and sites. In one example, wide area network 408 may include the internet. The local area network 410 may comprise a packet-based network and operate in accordance with a combination of one or more telecommunications protocols. Local area network 410 may be distinguished from wide area network 408 based on access level and/or physical infrastructure. For example, local area network 410 may include a secure home network.

Referring again to fig. 4, content provider sites 412A-412N represent examples of sites that may provide multimedia content to television service provider site 406 and/or computing devices 402A-402N. For example, the content provider site may include a studio having one or more studio content servers configured to provide multimedia files and/or streams to the television service provider site 406. In one example, content provider sites 412A-412N may be configured to provide multimedia content using an IP suite. For example, the content provider site may be configured to provide multimedia content to the receiver device according to Real Time Streaming Protocol (RTSP), HTTP, or the like. Further, the content provider sites 412A-412N may be configured to provide data including hypertext-based content and the like to one or more of the receiver devices 402A-402N and/or the television service provider sites 406 via the wide area network 408. Content provider sites 412A-412N may include one or more web servers. The data provided by the content provider sites 412A through 412N may be defined according to a data format.

Referring again to fig. 1, source device 102 includes a video source 104, a video encoder 106, a data packager 107, and an interface 108. Video source 104 may include any device configured to capture and/or store video data. For example, video source 104 may include a camera and a storage device operatively coupled thereto. Video encoder 106 may include any device configured to receive video data and generate a compatible bitstream representing the video data. A compatible bitstream may refer to a bitstream from which a video decoder may receive and reproduce video data. Aspects of a compatible bitstream may be defined in accordance with a video coding standard. When generating a compatible bitstream, the video encoder 106 may compress the video data. Compression may be lossy (perceptible or imperceptible to an observer) or lossless. Fig. 5 is a block diagram illustrating an example of a video encoder 500 that may implement the techniques for encoding video data described herein. It should be noted that although the exemplary video encoder 500 is shown with different functional blocks, such illustration is intended for descriptive purposes and not to limit the video encoder 500 and/or its subcomponents to a particular hardware or software architecture. The functions of video encoder 500 may be implemented using any combination of hardware, firmware, and/or software implementations.

The video encoder 500 may perform intra prediction encoding and inter prediction encoding of picture regions, and thus may be referred to as a hybrid video encoder. In the example shown in fig. 5, a video encoder 500 receives a source video block. In some examples, a source video block may include picture regions that have been partitioned according to an encoding structure. For example, the source video data may include macroblocks, CTUs, CBs, sub-partitions thereof, and/or additional equivalent coding units. In some examples, video encoder 500 may be configured to perform additional subdivision of the source video block. It should be noted that the techniques described herein are generally applicable to video encoding, regardless of how the source video data is partitioned prior to and/or during encoding. In the example shown in fig. 5, the video encoder 500 includes an adder 502, a transform coefficient generator 504, a coefficient quantization unit 506, an inverse quantization and transform coefficient processing unit 508, an adder 510, an intra prediction processing unit 512, an inter prediction processing unit 514, a filter unit 516, and an entropy encoding unit 518. As shown in fig. 5, a video encoder 500 receives source video blocks and outputs a bitstream.

In the example shown in fig. 5, video encoder 500 may generate residual data by subtracting a predicted video block from a source video block. The selection of the predicted video block is described in detail below. Summer 502 represents a component configured to perform the subtraction operation. In one example, the subtraction of video blocks occurs in the pixel domain. The transform coefficient generator 504 applies a transform such as a Discrete Cosine Transform (DCT), a Discrete Sine Transform (DST), or a conceptually similar transform (e.g., four 8 x 8 transforms may be applied to a 16 x 16 array of residual values) to the residual block or sub-partitions thereof to produce a set of residual transform coefficients. The transform coefficient generator 504 may be configured to perform any and all combinations of the transforms included in the series of discrete trigonometric transforms, including approximations thereof. The transform coefficient generator 504 may output the transform coefficients to the coefficient quantization unit 506. The coefficient quantization unit 506 may be configured to perform quantization of the transform coefficients. The quantization process may reduce the bit depth associated with some or all of the coefficients. The degree of quantization may change the rate distortion (i.e., the relationship of bit rate to video quality) of the encoded video data. The degree of quantization may be modified by adjusting a Quantization Parameter (QP). The quantization parameter may be determined based on a slice level value and/or a CU level value (e.g., a CU delta QP value). QP data may include any data used to determine a QP for quantizing a particular set of transform coefficients. As shown in fig. 5, the quantized transform coefficients (which may be referred to as level values) are output to an inverse quantization and transform coefficient processing unit 508. The inverse quantization and transform coefficient processing unit 508 may be configured to apply inverse quantization and inverse transform to generate reconstructed residual data. As shown in fig. 5, reconstructed residual data may be added to the predicted video block at summer 510. In this way, the encoded video block may be reconstructed and the resulting reconstructed video block may be used to evaluate the coding quality of a given prediction, transform and/or quantization. The video encoder 500 may be configured to perform multiple encoding passes (e.g., perform encoding while changing one or more of the prediction, transform parameters, and quantization parameters). The rate-distortion or other system parameters of the bitstream may be optimized based on the evaluation of the reconstructed video block. Furthermore, the reconstructed video block may be stored and used as a reference for predicting a subsequent block.

Referring again to fig. 5, the intra-prediction processing unit 512 may be configured to select an intra-prediction mode for the video block to be encoded. The intra prediction processing unit 512 may be configured to evaluate the frame and determine an intra prediction mode used to encode the current block. As described above, possible intra prediction modes may include a planar prediction mode, a DC prediction mode, and an angular prediction mode. Further, it should be noted that in some examples, the prediction mode of the chroma component may be inferred from the prediction mode of the luma prediction mode. The intra-prediction processing unit 512 may select the intra-prediction mode after performing one or more encoding passes. Further, in one example, intra-prediction processing unit 512 may select a prediction mode based on rate-distortion analysis. As shown in fig. 5, the intra-prediction processing unit 512 outputs intra-prediction data (e.g., syntax elements) to the entropy encoding unit 518 and the transform coefficient generator 504. As described above, the transforms performed on the residual data may be mode dependent (e.g., a quadratic transform matrix may be determined based on the prediction mode).

Referring again to fig. 5, the inter prediction processing unit 514 may be configured to perform inter prediction encoding for the current video block. The inter prediction processing unit 514 may be configured to receive the source video block and calculate a motion vector for a PU of the video block. The motion vector may indicate a displacement of a prediction unit of a video block within the current video frame relative to a prediction block within the reference frame. Inter prediction coding may use one or more reference pictures. Further, the motion prediction may be unidirectional prediction (using one motion vector) or bidirectional prediction (using two motion vectors). The inter prediction processing unit 514 may be configured to select a prediction block by calculating pixel differences determined by, for example, sum of Absolute Differences (SAD), sum of Squared Differences (SSD), or other difference metrics. As described above, a motion vector can be determined and specified from motion vector prediction. As described above, the inter prediction processing unit 514 may be configured to perform motion vector prediction. The inter prediction processing unit 514 may be configured to generate a prediction block using the motion prediction data. For example, the inter-prediction processing unit 514 may locate a predicted video block (not shown in fig. 5) within a frame buffer. It should be noted that the inter prediction processing unit 514 may be further configured to apply one or more interpolation filters to the reconstructed residual block to calculate sub-integer pixel values for motion estimation. The inter prediction processing unit 514 may output motion prediction data of the calculated motion vector to the entropy encoding unit 518.

As shown in fig. 5, the filter unit 516 receives the reconstructed video block and the encoding parameters and outputs modified reconstructed video data. The filter unit 516 may be configured to perform deblocking and/or Sample Adaptive Offset (SAO) filtering. SAO filtering is a type of nonlinear amplitude mapping that can be used to improve reconstruction by adding an offset to the reconstructed video data. It should be noted that as shown in fig. 5, the intra prediction processing unit 512 and the inter prediction processing unit 514 may receive the modified reconstructed video block via the filter unit 216. The entropy encoding unit 518 receives quantized transform coefficients and prediction syntax data (i.e., intra prediction data and motion prediction data). It should be noted that in some examples, coefficient quantization unit 506 may perform a scan of a matrix comprising quantized transform coefficients before outputting the coefficients to entropy encoding unit 518. In other examples, entropy encoding unit 518 may perform the scanning. The entropy encoding unit 518 may be configured to perform entropy encoding according to one or more of the techniques described herein. As such, video encoder 500 represents an example of a device configured to generate encoded video data in accordance with one or more techniques of this disclosure.

Referring again to fig. 1, data encapsulator 107 can receive encoded video data and generate a compatible bitstream, e.g., a sequence of NAL units, according to a defined data structure. A device receiving a compatible bitstream may reproduce video data therefrom. Further, as described above, sub-bitstream extraction may refer to the process by which a device receiving a compatible bitstream forms a new compatible bitstream by discarding and/or modifying data in the received bitstream. It should be noted that the term compliant bitstream may be used instead of the term compatible bitstream. In one example, the data packager 107 may be configured to generate a grammar in accordance with one or more techniques described herein. It should be noted that the data packager 107 need not necessarily be located in the same physical device as the video encoder 106. For example, the functions described as being performed by the video encoder 106 and the data encapsulator 107 may be distributed among the devices shown in fig. 4.

As described above, the signaling provided in the jfet-AA 2006 may be inadequate. In particular, a neural network post-processing filter may be used to upsample the frame rate of the decoded video. The jfet-AA 2006 does not provide signaling for frame rate upsampling. In accordance with the techniques herein, additional purposes of signaling for frame rate up-sampling of the post-processing filter are provided. In one example, a neural network post-processing filter for frame rate up-sampling may take as input two or more decoded output pictures and generate one or more interpolated pictures between those pictures. Table 16 shows the syntax of an exemplary neural network post-filter characteristics SEI message in accordance with the techniques herein.

/>

Table 16

For table 16, semantics may be based on the semantics provided above and based on the following:

nnpfc_purose indicates the purpose of the post-processing filter as specified in table 17. The value of nnpfc_purose should be from 0 to 2 ³² -2 (inclusive). The value of nnpfc_purose not present in table 17 is reserved for future use by ITU-t|iso/IEC specifications and should not be present in a bitstream conforming to this version of the specification. A decoder conforming to this version of the present description should ignore the SEI message that contains the reserved value of nnpfc_purose.

Value of	Interpretation of the drawings
		0	Unknown or unspecified
1	Visual quality improvement
		2	From 4:2:0 colorChroma format to 4:2:2 or 4:4:4 chroma format, or chroma upsampling from 4:2:2 chroma format to 4:4:4 chroma format)
3	Increasing width or height of cropped decoded output picture without changing chroma format
		4	Increasing width or height of cropped decoded output picture and upsampling chroma format
5	Frame rate upsampling

TABLE 17

In other examples, another term (e.g., frame rate interpolation or frame rate increase) may be used instead of the term frame rate upsampling.

The nnpfc_number_of_input_pictures_minus2 plus 2 specifies the number of decoded output pictures that are used as input to the post-processing filter.

The nnpfc_interpolated_pictures [ i ] specifies the number of interpolated pictures generated by the post-processing filter between the i-th and (i+1) -th pictures used as inputs to the post-processing filter.

nnpfc_inp_order_idc indicates a method of ordering a sample array of the clip-decoded output picture as an input of the post-processing filter. Table 18 contains an informative description of the values of nnpfc_inp_order_idc. The semantics of the nnpfc_inp_order_idc in the range of 0 to 3 (including the end values) are specified in table 19, which specifies the procedure for deriving the input tensor for different values of the nnpfc_inp_order_idc, and the given vertical sample coordinates cTop and horizontal sample coordinates cLeft of the upper left sample position of the sample block included in the input tensor. When the chroma format of the cropped decoded output picture is not 4:2:0, nnpfc_inp_order_idc should not be equal to 3. The value of nnpfc_inp_order_idc should be in the range of 0 to 255 (inclusive). Values of nnpfc_inp_order_idc greater than 3 are reserved for future use by ITU-t|iso/IEC specifications and should not be present in bitstreams conforming to this version of the specification. A decoder conforming to this version of the present description should ignore the SEI message that contains the reserved value of nnpfc_inp_order_idc.

TABLE 18

/>

TABLE 19

nnpfc_out_order_idc indicates the output order of samples generated from the post-processing filter. Table 13 contains an informative description of the value of nnpfc out order idc. The semantics of the nnpfc out order idc in the range of 0 to 3 (including the end values) are specified in table 20, which specifies the procedure for deriving the sample values in the filtered output sample array FilteredYPic, filteredCbPic and FilteredCrPic from the output tensor for the different values of nnpfc out order idc, and the given vertical sample coordinates cTop and horizontal sample coordinates cLeft for the upper left sample position of the sample block included in the input tensor. When nnpfc_purose is equal to 2 or 4, nnpfc_out_order_idc should not be equal to 3. The value of nnpfc_out_order_idc should be in the range of 0 to 255 (inclusive). Values of nnpfc out order idc greater than 3 are reserved for future use by ITU-t|iso/IEC specifications and should not be present in bitstreams conforming to this version of the specification. A decoder conforming to this version of the specification should ignore the SEI message that contains the reserved value of nnpfc out order idc.

/>

Table 20

For the semantics provided above for table 16, in a variant, the syntax element nnpfc_number_of_input_pictures_minus2 may instead be signaled as nnpfc_number_of_input_pictures_minus1, where the semantics are as follows: the nnpfc_number_of_input_pictures_minus1 plus 1 specifies the number of decoded output pictures that are used as input to the post-processing filter. In one example, the value of nnpfc_number_of_input_pictures_minus1 should not be equal to 0.

It should be noted that signaling the syntax element as nnpfc_interleaved_pictures [ i ] allows specifying (by setting the value of nnpfc_interleaved_pictures [ i ] equal to 0) that no interpolated picture is generated between the particular i-th picture and the (i+1) -th picture that are used as inputs to the post-processing filter. In a variant, the syntax element nnpfc_interleaved_pictures [ i ] may instead be signaled as nnpfc_interleaved_pictures_minus1 [ i ]. In this example, the semantics may be as follows:

the nnpfc_interleaved_pictures_minus1 [ i ] plus 1 specifies the number of interpolated pictures generated by the post-processing filter between the i-th and (i+1) -th pictures used as inputs to the post-processing filter.

In one example, according to the techniques herein, a separate syntax element nnpfc_number_of_total_interleaved_pictures_minus1 may be signaled and for looping signaling one less syntax element such that nnpfc_inter_pictures [ nnpfc_number_of_input_pictures_minus2] is not signaled but inferred. Table 21 shows the syntax of an exemplary neural network post-filter characteristics SEI message in accordance with the techniques herein.

/>

Table 21

For table 21, the semantics may be based on the semantics provided above, including the semantics provided for the syntax elements nnpfc_number_of_input_pictures_minus2, nnpfc_ purpose, nnpfc _inp_order_idc, nnpfc_overlap, and nnpfc_out_order_idc of table 16, and based on the following:

the nnpfc_number_of_total_interpolated_pictures_minus1 plus 1 specifies the total number of interpolated pictures shown by the post-processing filter.

The nnpfc_inter_pictures [ i ] specifies the number of interpolated pictures generated by the post-processing filter between the i-th and (i+1) -th pictures used as inputs to the post-processing filter.

The nnpfc_interface_pictures [ nnpfc_number_of_input_pictures_minus2] is inferred as follows:

for(i＝0,nI＝0；i<nnpfc_number_of_input_pictures_minus2；i++)

nI+＝nnpfc_interpol_pictures[i]

nnpfc_interpol_pictures[nnpfc_number_of_input_pictures_minus2]＝(nnpfc_number_of_total_interpolated_pictures_minus+1)-nI

when nnpfc_number_of_input_pictures_minus2 is equal to 0, it is inferred that nnpfc_inter_picture_minus1 [0] is equal to nnpfc_number_of_total_inter-poled_pictures_minus1.

For table 21, the derivation of numoutputimages can be as follows:

numInputImages＝(nnpfc_purpose＝＝5)？(nnpfc_number_of_input_pictures_minus2+2):1

numOutputImages＝(nnpfc_purpose＝＝5)？(nnpfc_number_of_total_interpolated_pictures_minus1+1):1

in one example, according to the techniques herein, it may be assumed that frame rate upsampling utilizes two input pictures, so that only one syntax element nnpfc_number_of_total_interleaved_pictures_minus1 may be signaled. Table 22 shows the syntax of an exemplary neural network post-filter characteristics SEI message in accordance with the techniques herein.

/>

Table 22

For table 22, semantics may be based on the semantics provided above, including those provided for the syntax elements nnpfc_ purpose, nnpfc _inp_order_idc, nnpfc_overlap, and nnpfc_out_order_idc of table 16, and based on the following:

If nnpfc_purose is equal to 5, it is inferred that nnpfc_number_of_input_pictures_minus2 is equal to 0.

For table 22, the derivation of numinputimages and numOutputImages can be as follows:

numInputImages＝(nnpfc_purpose＝＝5)？2:1

it should be noted that although the above syntax tables use ue (v) coding for nnpfc_number_of_input_pictures_minus1, nnpfc_interleaved_pictures [ i ], nnpfc_number_of_total_interleaved_pictures_minus1, nnpfc_interleaved_pictures [ i ]. In another example, however, u (v) or fixed u (8) or u (12) or u (16) encoding may be used for one or more of these syntax elements.

In this way, video encoder 500 represents an example of a device configured to send a signaling neural network post-filter characteristic message and send a signaling syntax element that specifies a number of interpolated pictures generated by a post-processing filter between consecutive pictures that serve as inputs to the post-processing filter.

Referring again to fig. 1, interface 108 may comprise any device configured to receive data generated by data packager 107 and to transmit and/or store the data to a communication medium. Interface 108 may include a network interface card such as an ethernet card, and may include an optical transceiver, a radio frequency transceiver, or any other type of device that may transmit and/or receive information. Further, interface 108 may include a computer system interface that may enable files to be stored on a storage device. For example, interface 108 may include a support for Peripheral Component Interconnect (PCI) and peripheral component interconnect express (PCIe) bus protocols, a proprietary bus protocol, a Universal Serial Bus (USB) protocol, I ² C, or any other logical and physical structure that may be used to interconnect peer devices.

Referring again to fig. 1, target device 120 includes an interface 122, a data decapsulator 123, a video decoder 124, and a display 126. Splicing jointPort 122 may comprise any device configured to receive data from a communication medium. Interface 122 may include a network interface card such as an ethernet card, and may include an optical transceiver, a radio frequency transceiver, or any other type of device that may receive and/or transmit information. In addition, interface 122 may include a computer system interface that allows for retrieving compatible video bitstreams from a storage device. For example, interface 122 may include support for PCI and PCIe bus protocols, proprietary bus protocols, USB protocols, I ² C, or any other logical and physical structure that may be used to interconnect peer devices. The data decapsulator 123 may be configured to receive and parse any of the example syntax structures described herein.

Video decoder 124 may include any device configured to receive a bitstream (e.g., sub-bitstream extraction) and/or acceptable variations thereof and to reproduce video data therefrom. Display 126 may include any device configured to display video data. The display 126 may include one of a variety of display devices such as a Liquid Crystal Display (LCD), a plasma display, an Organic Light Emitting Diode (OLED) display, or another type of display. The display 126 may include a high definition display or an ultra high definition display. It should be noted that although in the example shown in fig. 1, video decoder 124 is described as outputting data to display 126, video decoder 124 may be configured to output video data to various types of devices and/or subcomponents thereof. For example, video decoder 124 may be configured to output video data to any communication medium, as described herein.

Fig. 6 is a block diagram illustrating an example of a video decoder (e.g., a decoding process for reference picture list construction described above) that may be configured to decode video data in accordance with one or more techniques of the present disclosure. In one example, the video decoder 600 may be configured to decode the transform data and reconstruct residual data from the transform coefficients based on the decoded transform data. The video decoder 600 may be configured to perform intra prediction decoding and inter prediction decoding, and thus may be referred to as a hybrid decoder. The video decoder 600 may be configured to parse any combination of the syntax elements described above in tables 1-22. The video decoder 600 may decode video based on or in accordance with the above-described procedure and also based on the parsed values in tables 1 to 22.

In the example shown in fig. 6, the video decoder 600 includes an entropy decoding unit 602, an inverse quantization unit 604, an inverse transform coefficient processing unit 606, an intra prediction processing unit 608, an inter prediction processing unit 610, a summer 612, a post-filter unit 614, and a reference buffer 616. The video decoder 600 may be configured to decode video data in a manner consistent with a video encoding system. It should be noted that although the exemplary video decoder 600 is shown with different functional blocks, such illustration is intended for descriptive purposes and not to limit the video decoder 600 and/or its subcomponents to a particular hardware or software architecture. The functions of video decoder 600 may be implemented using any combination of hardware, firmware, and/or software implementations.

As shown in fig. 6, the entropy decoding unit 602 receives an entropy-encoded bitstream. The entropy decoding unit 602 may be configured to decode syntax elements and quantized coefficients from the bitstream according to a process that is reciprocal to the entropy encoding process. The entropy decoding unit 602 may be configured to perform entropy decoding according to any of the entropy encoding techniques described above. The entropy decoding unit 602 may determine values of syntax elements in the encoded bitstream in a manner consistent with the video encoding standard. As shown in fig. 6, the entropy decoding unit 602 may determine quantization parameters, quantization coefficient values, transform data, and prediction data from a bitstream. In the example shown in fig. 6, the inverse quantization unit 604 and the inverse transform coefficient processing unit 606 receive quantized coefficient values from the entropy decoding unit 602, and output reconstructed residual data.

Referring again to fig. 6, the reconstructed residual data may be provided to a summer 612. Summer 612 may add the reconstructed residual data to the prediction video block and generate reconstructed video data. The prediction video block may be determined according to a prediction video technique (i.e., intra-prediction and inter-prediction). The intra prediction processing unit 608 may be configured to receive the intra prediction syntax element and retrieve the predicted video block from the reference buffer 616. The reference buffer 616 may include a memory device configured to store one or more frames of video data. The intra prediction syntax element may identify an intra prediction mode, such as the intra prediction mode described above. The inter prediction processing unit 610 may receive the inter prediction syntax element and generate a motion vector to identify a prediction block in one or more reference frames stored in the reference buffer 616. The inter prediction processing unit 610 may generate a motion compensation block, possibly performing interpolation based on an interpolation filter. An identifier of an interpolation filter for motion estimation with sub-pixel precision may be included in the syntax element. The inter prediction processing unit 610 may calculate interpolation values of sub-integer pixels of the reference block using interpolation filters. Post-filter unit 614 may be configured to perform filtering on the reconstructed video data. For example, post-filter unit 614 may be configured to perform deblocking and/or Sample Adaptive Offset (SAO) filtering, e.g., based on parameters specified in the bitstream. Further, it should be noted that in some examples, post-filter unit 614 may be configured to perform dedicated arbitrary filtering (e.g., visual enhancement such as mosquito noise cancellation). As shown in fig. 6, the video decoder 600 may output the reconstructed video block. In this way, video decoder 600 represents an example of a device configured to receive a neural network post-filter characteristic message and parse a syntax element that specifies a number of interpolated pictures generated by a post-processing filter between consecutive pictures that are used as inputs to the post-processing filter.

In one or more examples, the functions described may be implemented by hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium, and executed by a hardware-based processing unit. The computer-readable medium may comprise a computer-readable storage medium corresponding to a tangible medium, such as a data storage medium, or a propagation medium comprising any medium that facilitates the transfer of a computer program from one place to another, for example, according to a communication protocol. As such, the computer-readable medium may generally correspond to: (1) A non-transitory tangible computer readable storage medium, or (2) a communication medium such as a signal or carrier wave. Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementing the techniques described in this disclosure. The computer program product may include a computer-readable medium.

By way of example, and not limitation, such computer-readable storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, if the instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. However, it should be understood that computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transitory media, but are instead directed to non-transitory tangible storage media. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk and Blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The instructions may be executed by one or more processors, such as one or more Digital Signal Processors (DSPs), general purpose microprocessors, application Specific Integrated Circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Thus, the term "processor" as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein. Furthermore, in some aspects, the functionality described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or incorporated into a combined codec. Moreover, these techniques may be implemented entirely in one or more circuits or logic elements.

The techniques of this disclosure may be implemented in various devices or apparatuses, including a wireless handset, an Integrated Circuit (IC), or a set of ICs (e.g., a chipset). Various components, modules, or units are described in this disclosure to emphasize functional aspects of devices configured to perform the disclosed techniques but do not necessarily require realization by different hardware units. Rather, as described above, the various units may be combined in a codec hardware unit or provided by an interoperating hardware unit comprising a set of one or more processors as described above, in combination with suitable software and/or firmware.

Further, each functional block or various features of the base station apparatus and the terminal apparatus used in each of the above-described embodiments may be realized or executed by a circuit (typically, an integrated circuit or a plurality of integrated circuits). Circuits designed to perform the functions described in this specification may include general purpose processors, digital Signal Processors (DSPs), application specific or general purpose integrated circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic, or discrete hardware components, or combinations thereof. A general purpose processor may be a microprocessor, or in the alternative, the processor may be a conventional processor, controller, microcontroller, or state machine. The general purpose processor or each of the above circuits may be configured by digital circuitry or may be configured by analog circuitry. In addition, when a technology of manufacturing an integrated circuit that replaces the current integrated circuit occurs due to progress in semiconductor technology, the integrated circuit produced by the technology can also be used.

Various examples have been described. These and other examples are within the scope of the following claims.

Claims

1. A method of transmitting post-loop filter information signaling encoded data for a neural network, the method comprising:

Transmitting a post-filter characteristic message signaling the neural network; and

a signaling syntax element is sent that specifies a number of interpolated pictures generated by a post-processing filter between successive pictures that serve as inputs to the post-processing filter.

2. An apparatus comprising one or more processors configured to:

3. The apparatus of claim 2, wherein the one or more processors are further configured to signal a syntax element specifying a number of decoded output pictures to be used as input to the post-processing filter.

4. The apparatus of claim 3, wherein the one or more processors are further configured to signal a syntax element indicating a purpose of the post-processing filter.

5. The apparatus of claim 4, wherein a value of 5 for the syntax element indicating a purpose of the post-processing filter indicates frame rate upsampling.

6. The apparatus of claim 2, wherein the apparatus comprises a video encoder.

7. An apparatus comprising one or more processors configured to:

receiving a neural network post-filter characteristic message; and

parsing a syntax element that specifies a number of interpolated pictures generated by a post-processing filter between successive pictures that are used as inputs to the post-processing filter.

8. The apparatus of claim 7, wherein the one or more processors are further configured to parse a syntax element specifying a number of decoded output pictures to be used as input to the post-processing filter.

9. The apparatus of claim 8, wherein the one or more processors are further configured to parse a syntax element indicating a purpose of the post-processing filter.

10. The apparatus of claim 9, wherein a value of 5 for the syntax element indicating a purpose of the post-processing filter indicates frame rate upsampling.

11. The apparatus of claim 7, wherein the apparatus comprises a video decoder.