CN117917080A - System and method for interpolating reconstructed feature data in encoding of multi-dimensional data - Google Patents

System and method for interpolating reconstructed feature data in encoding of multi-dimensional data Download PDF

Info

Publication number
CN117917080A
CN117917080A CN202280060456.XA CN202280060456A CN117917080A CN 117917080 A CN117917080 A CN 117917080A CN 202280060456 A CN202280060456 A CN 202280060456A CN 117917080 A CN117917080 A CN 117917080A
Authority
CN
China
Prior art keywords
video
data
picture
vps
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202280060456.XA
Other languages
Chinese (zh)
Inventor
基兰·穆克什·米斯拉
计天颖
克里斯托弗·安德鲁·塞格尔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sharp Corp
Original Assignee
Sharp Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sharp Corp filed Critical Sharp Corp
Priority claimed from PCT/JP2022/033489 external-priority patent/WO2023038038A1/en
Publication of CN117917080A publication Critical patent/CN117917080A/en
Pending legal-status Critical Current

Links

Landscapes

  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

A method of interpolating inferred data corresponding to reconstructed feature data is disclosed. The method comprises the following steps: receiving reconstructed feature data, wherein the reconstructed feature data corresponds to video data that has been time downsampled; generating a bounding box for the reconstructed feature data; and a partially interpolated bounding box for temporal downsampling of the video.

Description

System and method for interpolating reconstructed feature data in encoding of multi-dimensional data
Technical Field
The present disclosure relates to encoding multi-dimensional data, and more particularly to techniques for interpolating reconstructed feature data.
Background
The digital video and audio functions may be incorporated into a variety of devices including digital televisions, computers, digital recording devices, digital media players, video gaming devices, smart phones, medical imaging devices, surveillance systems, tracking and monitoring systems, and the like. Digital video and audio may be represented as a collection of arrays. Data represented as a set of arrays may be referred to as multi-dimensional data. For example, a picture in a digital video may be represented as a collection of two-dimensional arrays of sample values. That is, for example, video resolution provides the width and height dimensions of the sample value array, and each component of the color space provides the number of two-dimensional arrays in the set. Furthermore, the number of pictures in a digital video sequence provides another data dimension. For example, a 60Hz video one second with 1080p resolution of three color components may correspond to four dimensions of the data values, i.e. the number of samples may be expressed as follows: 1920×1080×3×60. Thus, digital video and images are examples of multi-dimensional data. It should be noted that additional and/or alternative dimensions (e.g., number of layers, number of views/channels, etc.) may be used to represent the digital video.
The digital video may be encoded according to a video encoding standard. The video coding standard defines the format of a compatible bitstream encapsulating the coded video data. A compatible bitstream is a data structure that may be received and decoded by a video decoding device to generate reconstructed video data. Typically, the reconstructed video data is intended for human consumption (i.e., viewing on a display). Examples of video coding standards include ISO/IEC MPEG-4Visual and ITU-T H.264 (also known as ISO/IEC MPEG-4 AVC) and High Efficiency Video Coding (HEVC). HEVC is described in High Efficiency Video Coding (HEVC) of the ITU-T h.265 recommendation of 12 in 2016, which is incorporated herein by reference and is referred to herein as ITU-T h.265. The ITU-T Video Coding Experts Group (VCEG) and ISO/IEC Moving Picture Experts Group (MPEG), collectively referred to as the joint video research group (JVET), have been working on video coding techniques that standardize compression capabilities beyond HEVC. This standardization effort is known as the Versatile Video Coding (VVC) project. "VERSATILE VIDEO CODING (Draft 10)" (document JVET-T2001-v2, which is incorporated herein by reference, and which is referred to as VVC) in the 20 th meeting of ISO/IEC JTC1/SC29/WG11, held from 7 to 16 of 10 months 2020, represents the current iteration of Draft text corresponding to the video coding specification of the VVC project.
Video coding standards may utilize video compression techniques. Video compression techniques reduce the data requirements for storing and/or transmitting video data by exploiting redundancy inherent in video sequences. Video compression techniques typically subdivide a video sequence into smaller contiguous portions (i.e., groups of pictures within the video sequence, pictures within groups of pictures, regions within regions, etc.) and utilize intra-prediction encoding techniques (e.g., spatial prediction techniques within pictures) and inter-prediction techniques (i.e., inter-picture techniques (time)) to generate differences between units of video data to be encoded and reference units of video data. This difference may be referred to as residual data. Syntax elements may relate to residual data and reference coding units (e.g., intra prediction mode index and motion information). The residual data and syntax elements may be entropy encoded. The entropy encoded residual data and syntax elements may be included in a data structure forming a compatible bitstream.
Disclosure of Invention
In one example, a method of interpolating inferred data corresponding to reconstructed feature data includes: receiving reconstructed feature data, wherein the reconstructed feature data corresponds to video data that has been time downsampled; generating a bounding box for the reconstructed feature data; and a partially interpolated bounding box for temporal downsampling of the video.
In one example, an apparatus includes one or more processors configured to: receiving reconstructed feature data, wherein the reconstructed feature data corresponds to video data that has been time downsampled; generating a bounding box for the reconstructed feature data; and a partially interpolated bounding box for temporal downsampling of the video.
The details of one or more examples are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description and drawings, and from the claims.
Drawings
Fig. 1 is a conceptual diagram illustrating video data as a multi-dimensional dataset (MDDS) according to one or more techniques of the present disclosure.
Fig. 2A is a conceptual diagram illustrating an example of encoding a block of video data using an available typical video encoding technique according to one or more techniques of this disclosure.
Fig. 2B is a conceptual diagram illustrating an example of encoding a block of video data using an available typical video encoding technique according to one or more techniques of this disclosure.
Fig. 3 is a conceptual diagram illustrating encoded video data and corresponding data structures associated with an available typical video encoding technique according to one or more techniques of the present disclosure.
Fig. 4 is a block diagram illustrating an example of a system that may be configured to encode and decode multidimensional data in accordance with one or more techniques of the present disclosure.
Fig. 5 is a block diagram illustrating an example of a video encoder that may be configured to encode video data in accordance with one or more techniques of the present disclosure, which may be utilized with one or more techniques of the present disclosure.
Fig. 6 is a block diagram illustrating an example of a video decoder that may be configured to decode video data according to one or more techniques of the present disclosure, which may be utilized with one or more techniques of the present disclosure.
Fig. 7A is a conceptual diagram illustrating an example of encoding a block of video data according to an automatic encoding technique that may be utilized with one or more techniques of the present disclosure.
Fig. 7B is a conceptual diagram illustrating an example of encoding a block of video data according to an automatic encoding technique that may be utilized with one or more techniques of the present disclosure.
Fig. 8 is an example of an encoding system that may encode a multi-dimensional dataset according to one or more techniques of the present disclosure.
Fig. 9 is an example of an encoding system that may encode a multi-dimensional dataset according to one or more techniques of the present disclosure.
Fig. 10 is a block diagram illustrating an example of a video encoder that may be configured to encode video data in accordance with one or more techniques of the present disclosure.
Fig. 11 is a block diagram illustrating an example of a video decoder that may be configured to decode video data in accordance with one or more techniques of this disclosure.
FIG. 12 is a block diagram illustrating an example of a compression engine that may be configured to encode a multi-dimensional dataset according to one or more techniques of the present disclosure.
Fig. 13 is a block diagram illustrating an example of a decompression engine that may be configured to decode a multi-dimensional dataset according to one or more techniques of the present disclosure.
Fig. 14 is an example of an encoding system that may encode a multi-dimensional dataset according to one or more techniques of the present disclosure.
Fig. 15 is an example of an encoding system that may encode a multi-dimensional dataset according to one or more techniques of the present disclosure.
Fig. 16 is an example of an encoding system that may encode a multi-dimensional dataset according to one or more techniques of the present disclosure.
Fig. 17 is a conceptual diagram illustrating an example of generating feature data according to techniques that may be utilized with one or more techniques of this disclosure.
Fig. 18 is an example of an encoding system that may encode a multi-dimensional dataset according to one or more techniques of the present disclosure.
Fig. 19 is an example of a regional proposal network in accordance with one or more techniques of the present disclosure.
Fig. 20 is an example of a box header in accordance with one or more techniques of this disclosure.
Fig. 21 is an example of a box header in accordance with one or more techniques of the present disclosure.
Fig. 22 is an example of a mask header in accordance with one or more techniques of the present disclosure.
FIG. 23 is an example of an encoding system that may encode a multi-dimensional dataset according to one or more techniques of the present disclosure.
FIG. 24 is an example of an encoding system that may encode a multi-dimensional dataset according to one or more techniques of the present disclosure.
Fig. 25 is an example of an encoding system that can encode a multi-dimensional dataset according to one or more techniques of the present disclosure.
Fig. 26 is an example of an encoding system that can encode a multi-dimensional dataset according to one or more techniques of the present disclosure.
Fig. 27 is a conceptual diagram illustrating a data structure of encapsulating encoded video data and corresponding metadata according to one or more techniques of the present disclosure.
Detailed Description
In general, the present disclosure describes various techniques for encoding multi-dimensional data, which may be referred to as multi-dimensional data sets (MDDS) and may include, for example, video data, audio data, and the like. It should be noted that the techniques described herein for encoding multi-dimensional data may be used for other applications in addition to reducing the data requirements for providing multi-dimensional data for human consumption. For example, the techniques described herein may be used for so-called machine consumption. That is, for example, in the case of surveillance, it may be useful for a surveillance application running on a central server to be able to quickly identify and track objects from any one of a plurality of video feeds. In this case, the encoded video data need not necessarily be capable of being reconstructed into human-consumable form, but need only be capable of enabling the object to be identified. As described in further detail below, object detection, segmentation, and/or tracking (i.e., object recognition tasks) generally involve receiving an image (e.g., an image included in a single image or video sequence), generating feature data corresponding to the image, analyzing the feature data, and generating inference data, wherein the inference data may be indicative of a type of object and a spatial location of the object within the image. The spatial position of the object within the image may be specified by a bounding box having spatial coordinates (e.g., x, y) and dimensions (e.g., height and width). The present disclosure describes techniques for compressing and reconstructing feature data. In particular, this disclosure describes techniques for interpolating reconstructed feature data. The techniques described in this disclosure may be particularly useful for allowing object recognition tasks to be distributed across a communication network and optimizing video encoding. For example, in some applications, the acquisition device (e.g., camera and accompanying hardware) may have power and/or computational constraints. In this case, the generation of the feature data may be optimized for the capabilities at the acquisition device, but the analysis and inference may be more suitable for execution at one or more devices with additional capabilities distributed across the network. In such a case, compression of the feature set may facilitate efficient distribution of the object recognition task (e.g., reduced bandwidth and/or latency). It should be noted that, as described in further detail below, inferred data (e.g., spatial position of objects within an image) may be used to optimize encoding of video data (e.g., adjust encoding parameters to improve relative image quality in areas where objects of interest are present, etc.). Furthermore, the video encoding device utilizing inferred data may be located at a different location than the acquisition device. For example, the distribution network may include a plurality of distribution servers (at various physical locations) that perform compression and distribution of acquired video.
It should be noted that as used herein, the term "typical video coding standard" or "typical video coding" may refer to a video coding standard that utilizes one or more of the following video compression techniques: video partitioning techniques, intra-prediction techniques, inter-prediction techniques, residual transformation techniques, reconstructed video filtering techniques, and/or entropy encoding techniques for residual data and syntax elements. For example, the term "typical video coding standard" may refer to any of ITU-T h.264, ITU-T h.265, VVC, etc., alone or together. Furthermore, it should be noted that the incorporation of documents by reference herein is for descriptive purposes and should not be construed as limiting or creating ambiguity with respect to the terms used herein. For example, where a definition of a term provided in a particular incorporated reference differs from another incorporated reference and/or the term as used herein, that term should be interpreted as broadly as includes each and every corresponding definition and/or includes every specific definition in the alternative.
Video content comprises a video sequence consisting of a series of frames (or pictures). A series of frames may also be referred to as a group of pictures (GOP). For encoding purposes, each video frame or picture may be divided into one or more regions, which may be referred to as video blocks. As used herein, the term "video block" may generally refer to a region of a picture, a sub-partition thereof, and/or a corresponding structure that may be encoded (e.g., according to a prediction technique). Furthermore, the term "current video block" may refer to a region of a picture that is currently being encoded or decoded. A video block may be defined as an array of sample values. It should be noted that in some cases, pixel values may be described as sample values that include corresponding components of video data, which may also be referred to as color components (e.g., luminance (Y) and chrominance (Cb and Cr) components or red, green, and blue components (RGB)). It should be noted that in some cases, the terms "pixel value" and "sample value" may be used interchangeably. Further, in some cases, a pixel or sample may be referred to as pel. The video sampling format (which may also be referred to as chroma format) may define the number of chroma samples included in a video block relative to the number of luma samples included in the video block. For example, for a 4:2:0 sampling format, the sampling rate of the luma component is twice the sampling rate of the chroma components in both the horizontal and vertical directions.
Digital video data comprising one or more video sequences is an example of multi-dimensional data. Fig. 1 is a conceptual diagram illustrating video data represented as multi-dimensional data. Referring to fig. 1, video data includes respective groups of pictures for two layers. For example, each layer may be a view (e.g., left and right) or temporal layer of video. As shown in fig. 1, each layer includes three components (e.g., RGB, BGR, YCbCr, etc.) of video data and each component includes four pictures having width (W) ×height (H) sample values (e.g., 1920×1080, 1280×720, etc.). Thus, in the example shown in fig. 1, there are 24 w×h sample value arrays, and each sample value array may be described as two-dimensional data. Further, the arrays may be grouped into groups according to one or more other dimensions (e.g., time series of channels, components, and/or frames). For example, component 1 of a GOP of layer 1 may be described as a three-dimensional dataset (i.e., w×h×picture number), all components of a GOP of layer 1 may be described as a four-dimensional dataset (i.e., w×h×picture number×component number), and all components of a GOP of layer 1 and a GOP of layer 2 may be described as a five-dimensional dataset (i.e., w×h×picture number×component number×layer number).
The multi-layer video encoding enables the video presentation to be decoded/displayed as a presentation corresponding to the base layer of video data and decoded/displayed as one or more additional presentations corresponding to the enhancement layer of video data. For example, the base layer may enable presentation of video presentations having a base quality level (e.g., high definition presentation and/or 30Hz frame rate), and the enhancement layer may enable presentation of video presentations having an enhanced quality level (e.g., ultra-high definition rendering and/or 60Hz frame rate). The enhancement layer may be encoded by referring to the base layer. That is, a picture in the enhancement layer may be encoded (e.g., using inter-layer prediction techniques), for example, by referencing one or more pictures in the base layer (including scaled versions thereof). It should be noted that the layers may also be encoded independently of each other. In this case, there may be no inter-layer prediction between the two layers. The sub-bitstream extraction process may be used to decode and display only a specific video layer. Sub-bitstream extraction may refer to the process by which a device receiving a compatible or compliant bitstream forms a new compatible or compliant bitstream by discarding and/or modifying data in the received bitstream.
A video encoder operating in accordance with a typical video coding standard may perform predictive coding on video blocks and sub-partitions thereof. For example, a picture may be partitioned into video blocks, which are the largest arrays of video data that may be predictively encoded, and the largest arrays of video data may be further partitioned into nodes. For example, in ITU-T H.265, coding Tree Units (CTUs) are partitioned into Coding Units (CUs) according to a Quadtree (QT) partitioning structure. A node may be associated with a prediction unit data structure and a residual unit data structure having its root at the node. The prediction unit data structure may include intra-prediction data (e.g., intra-prediction mode syntax elements) or inter-prediction data (e.g., motion data syntax elements) that may be used to generate reference and/or prediction sample values for the node. For intra prediction encoding, the defined intra prediction mode may specify the location of a reference sample within a picture. For inter prediction encoding, a reference picture may be determined, and a Motion Vector (MV) may identify samples in the reference picture that are used to generate a prediction for the current video block. For example, a current video block may be predicted using reference sample values located in one or more previously encoded pictures, and a motion vector may be used to indicate the position of the reference block relative to the current video block. The motion vector may describe, for example, a horizontal displacement component of the motion vector (i.e., MV x), a vertical displacement component of the motion vector (i.e., MV y), and a resolution of the motion vector (i.e., pixel precision). Previously decoded pictures may be organized into one or more reference picture lists and identified using reference picture index values. Further, in inter prediction coding, single prediction refers to generating a prediction using sample values from a single reference picture, and double prediction refers to generating a prediction using corresponding sample values from two reference pictures. That is, in single prediction, a single reference picture is used to generate a prediction for a current video block, while in bi-prediction, a first reference picture and a second reference picture may be used to generate a prediction for the current video block. In bi-prediction, the corresponding sample values may be combined (e.g., added, rounded and clipped, or averaged according to weights) to generate a prediction. Furthermore, typical video coding standards may support various motion vector prediction modes. Motion vector prediction enables the value of a motion vector for a current video block to be derived based on another motion vector. For example, a set of candidate blocks with associated motion information may be derived from spatially neighboring blocks of the current video block, and a motion vector for the current video block may be derived from a motion vector associated with one of the candidate blocks.
As described above, the intra prediction data or the inter prediction data may be used to generate reference sample values of the current block of sample values. The difference between the sample values included in the current block and the associated reference samples may be referred to as residual data. The residual data may include a respective difference array corresponding to each component of the video data. The residual data may be initially calculated in the pixel domain. I.e. the sample amplitude value from subtracting the component of the video data. A transform such as a Discrete Cosine Transform (DCT), a Discrete Sine Transform (DST), an integer transform, a wavelet transform, or a conceptually similar transform may be applied to the sample difference array to generate transform coefficients. It should be noted that in some cases, a core transform and a subsequent secondary transform may be applied to generate transform coefficients. The quantization process may be performed directly on the transform coefficients or residual sample values (e.g., in terms of palette coded quantization). Quantization approximates transform coefficients (or residual sample values) by limiting the amplitude to a set of specified values. Quantization essentially scales the transform coefficients to change the amount of data needed to represent a set of transform coefficients. Quantization may include dividing the transform coefficients (or values resulting from adding an offset value to the transform coefficients) by a quantization scaling factor and any associated rounding function (e.g., rounding to the nearest integer). The quantized transform coefficients may be referred to as coefficient level values. Inverse quantization (or "dequantization") may include multiplying coefficient level values by quantization scaling factors, as well as any reciprocal rounding and/or offset addition operations. It should be noted that, as used herein, the term "quantization process" may refer in some examples to generating a level value (or similar value) and in some examples recovering a transform coefficient (or similar value). That is, the quantization process may refer to quantization in some cases, and may refer to inverse quantization (also referred to as dequantization) in some cases. Further, it should be noted that while in some of the examples the quantization process is described with respect to arithmetic operations related to decimal notation, such description is for illustrative purposes and should not be construed as limiting. For example, the techniques described herein may be implemented in a device using binary operations or the like. For example, the multiply and divide operations described herein may be implemented using shift operations or the like.
The quantized transform coefficients and syntax elements (e.g., syntax elements indicating predictions of video blocks) may be entropy encoded according to an entropy encoding technique. The entropy encoding process includes encoding the syntax element values using a lossless data compression algorithm. Examples of entropy coding techniques include Content Adaptive Variable Length Coding (CAVLC), context Adaptive Binary Arithmetic Coding (CABAC), probability interval partitioning entropy coding (PIPE), and the like. The entropy encoded quantized transform coefficients and corresponding entropy encoded syntax elements may form a compatible bitstream that may be used to render video data at a video decoder. An entropy encoding process, such as CABAC, as implemented in ITU-T h.265, may include performing binarization on syntax elements. Binarization refers to the process of converting the value of a syntax element into a sequence of one or more bits. These bits may be referred to as "bins". Binarization may include one or a combination of the following encoding techniques: fixed length coding, unary coding, truncated Rice coding, golomb coding, k-order exponential Golomb coding, and Golomb-Rice coding. For example, binarization may include representing the integer value 5 of the syntax element as 00000101 using an 8-bit fixed length binarization technique, or representing the integer value 5 as 11110 using a unary coding binarization technique. As used herein, the terms fixed length coding, unary coding, truncated Rice coding, golomb coding, k-th order exponential Golomb coding, and Golomb-Rice coding may each refer to a general implementation of these techniques and/or a more specific implementation of these coding techniques. For example, golomb-Rice coding implementations may be specifically defined in accordance with video coding standards. In the example of CABAC, for a particular bin, the context may provide a Maximum Probability State (MPS) value for the bin (i.e., the MPS of the bin is one of 0 or 1), and a probability value for the bin being the MPS or the minimum probability state (LPS). For example, the context may indicate that the MPS of bin is 0 and the probability of bin being 1 is 0.3. It should be noted that the context may be determined based on the value of the previously encoded bin including the current syntax element and the bin in the previously encoded syntax element.
Fig. 2A to 2B are conceptual diagrams illustrating an example of encoding a block of video data. As shown in fig. 2A, a current block of video data (e.g., a region of a picture corresponding to a video component) is encoded by subtracting a set of prediction values from the current block of video data to generate a residual, performing a transform on the residual, and quantizing the transform coefficients to generate level values. As shown in fig. 2B, a block of current video data is decoded by performing inverse quantization on the level values, performing an inverse transform, and adding a set of predictors to the resulting residual. It should be noted that in the examples of fig. 2A-2B, the sample values of the reconstructed block are different from the sample values of the current video block being encoded. Specifically, fig. 2B shows a reconstruction error, which is the difference between the current block and the reconstructed block. In this way, the encoding may be considered lossy. However, for an observer reconstructing the video, the difference in sample values may be considered to be hardly noticeable. That is, it can be said that the reconstructed video is suitable for human consumption. However, it should be noted that in some cases, encoding video data block by block may lead to artifacts (e.g., so-called block artifacts, banding artifacts, etc.). For example, block artifacts may result in encoded block boundaries of reconstructed video data being visually perceived by a user. In this way, reconstructed sample values may be modified to minimize reconstruction errors and/or to minimize perceptible artifacts introduced by the video encoding process. Such modifications may be generally referred to as filtering. It should be noted that the filtering may occur as part of a filtering process in a loop or a filtering process after a loop. For the in-loop filtering process, the resulting sample values of the filtering process may be used for further reference, and for the post-loop filtering process, the resulting sample values of the filtering process are output only as part of the decoding process (e.g., not used for subsequent encoding).
Typical video coding standards may utilize so-called deblocking (or deblocking), which refers to a process of smoothing the boundaries of neighboring reconstructed video blocks (i.e., making the boundaries less noticeable to a viewer), which is part of an in-loop filtering process. In addition to applying deblocking filters as part of the in-loop filtering process, typical video coding standards may also utilize Sample Adaptive Offset (SAO), a process that modifies deblocking sample values in regions by conditionally adding offset values. Furthermore, typical video coding standards may utilize one or more additional filtering techniques. For example, in VVC, a so-called Adaptive Loop Filter (ALF) may be applied.
As described above, for encoding purposes, each video frame or picture may be divided into one or more regions, which may be referred to as video blocks. It should be noted that in some cases, other overlapping and/or independent regions may be defined. For example, according to a typical video coding standard, each video picture may be divided to include one or more slices, and further divided to include one or more tiles. With respect to VVC, a slice needs to be made up of an integer number of complete tiles or an integer number of consecutive complete CTU rows within a tile, rather than just an integer number of CTUs. Thus, in a VVC, a picture may include a single tile, where the single tile is contained within a single slice, or a picture may include multiple tiles, where the multiple tiles (or rows of CTUs thereof) may be contained within one or more slices. Furthermore, it should be noted that VVC specifies that a picture may be divided into sub-pictures, wherein a sub-picture is a rectangular CTU region within the picture. The upper left CTU of a sub-picture may be located at any CTU position within the picture, where the sub-picture is constrained to include one or more slices. Thus, unlike tiles, sub-pictures need not be limited to specific row and column positions. It should be noted that the sub-pictures may be used to encapsulate regions of interest within the picture, and that the sub-bitstream extraction process may be used to decode and display only specific regions of interest. That is, the bitstream of encoded video data may include a Network Abstraction Layer (NAL) unit sequence, where NAL units encapsulate the encoded video data (i.e., video data corresponding to a picture slice), or NAL units encapsulate metadata (e.g., parameter sets) for decoding the video data, and the sub-bitstream extraction process forms a new bitstream by removing one or more NAL units from the bitstream.
Fig. 3 is a conceptual diagram illustrating an example of pictures within a group of pictures divided according to tiles, slices, and sub-pictures, and corresponding encoded video data is encapsulated in NAL units. It should be noted that the techniques described herein may be applied to tiles, slices, sub-pictures, sub-partitions thereof, and/or equivalent structures thereof. That is, the techniques described herein are generally applicable regardless of how a picture is divided into regions. In the example shown in fig. 3, pic 3 is shown as including 16 tiles (i.e., tile 0 through tile 15) and three slices (i.e., slices 0 through 2). In the example shown in fig. 3, slice 0 includes four tiles (i.e., tile 0 to tile 3), slice 1 includes eight tiles (i.e., tile 4 to tile 11), and slice 2 includes four tiles (i.e., tile 12 to tile 15). Further, as shown in the example of fig. 3, pic 3 includes two sub-pictures (i.e., sub-picture 0 and sub-picture 1), where sub-picture 0 includes slice 0 and slice 1 and where sub-picture 1 includes slice 2. As described above, the sub-picture may be used to encapsulate a region of interest within the picture, and a sub-bitstream extraction process may be used to selectively decode (and display) the region of interest. For example, referring to fig. 2, sub-picture 0 may correspond to an action portion of a sporting event presentation (e.g., a view of a venue), and sub-picture 1 may correspond to a scroll banner displayed during the sporting event presentation. By organizing the pictures into sub-pictures in this way, the viewer may be able to disable the display of the scroll banner. That is, through the sub-bitstream extraction process, slice 2 NAL units may be removed from the bitstream (and thus not decoded and/or displayed), and slice 0 NAL units and slice 1 NAL units may be decoded and displayed.
As described above, for inter prediction coding, reference samples in a previously coded picture are used to code a video block in the current picture. Previously encoded pictures that may be used as references when encoding a current picture are referred to as reference pictures. It should be noted that the decoding order does not necessarily correspond to the picture output order, i.e. the temporal order of the pictures in the video sequence. According to a typical video coding standard, when a picture is decoded, it may be stored to a Decoded Picture Buffer (DPB) (which may be referred to as a frame buffer, a reference picture buffer, etc.). For example, referring to fig. 3, pic 2 is shown as reference Pic 1. Similarly Pic 3 is shown as reference Pic 0. With respect to fig. 3, assuming that the number of pictures corresponds to the decoding order, the DPB will fill as follows: after decoding Pic 0, the DPB would include { Pic 0 }; at the beginning of decoding Pic 1, the DPB will include { Pic 0 }; after decoding Pic 1, the DPB would include { Pic 0,Pic1 }; at the beginning of decoding Pic 2, the DPB will include { Pic 0,Pic1 }. Pic 2 will then be decoded with reference to Pic 1, and after Pic 2 is decoded, the DPB will include { Pic 0,Pic1,Pic2 }. At the beginning of decoding Pic 3, pictures Pic 0 and Pic 1 will be marked for removal from the DPB because they are not needed to decode Pic 3 (or any subsequent pictures, not shown) and assuming Pic 1 and Pic 2 have been output, the DPB will be updated to include { Pic 0 }. Pic 3 will then be decoded with reference to Pic 0. The process of marking pictures to remove them from the DPB may be referred to as Reference Picture Set (RPS) management.
Fig. 4 is a block diagram illustrating an example of a system that may be configured to encode (i.e., encode and/or decode) a multi-dimensional data set (MDDS) in accordance with one or more techniques of the present disclosure. It should be noted that in some cases, the MDDS may be referred to as a tensor. System 100 represents an example of a system that may encapsulate encoded data in accordance with one or more techniques of the present disclosure. As shown in fig. 4, system 100 includes a source device 102, a communication medium 110, and a target device 120. In the example shown in fig. 4, source device 102 may include any device configured to encode multi-dimensional data and transmit the encoded data to communication medium 110. Target device 120 may include any device configured to receive encoded data via communication medium 110 and decode the encoded data. Source device 102 and/or target device 120 may include computing devices equipped for wired and/or wireless communication, and may include, for example, set-top boxes, digital video recorders, televisions, computers, gaming consoles, medical imaging devices, and mobile devices (including, for example, smartphones).
Communication medium 110 may include any combination of wireless and wired communication media and/or storage devices. Communication medium 110 may include coaxial cable, fiber optic cable, twisted pair cable, wireless transmitters and receivers, routers, switches, repeaters, base stations, or any other device that may be used to facilitate communications between various devices and sites. Communication medium 110 may include one or more networks. For example, the communication medium 110 may include a network configured to allow access to the world wide web, such as the Internet. The network may operate in accordance with a combination of one or more telecommunications protocols. The telecommunications protocol may include proprietary aspects and/or may include standardized telecommunications protocols. Examples of standardized telecommunications protocols include the Digital Video Broadcasting (DVB) standard, the Advanced Television Systems Committee (ATSC) standard, the Integrated Services Digital Broadcasting (ISDB) standard, the Data Over Cable Service Interface Specification (DOCSIS) standard, the global system for mobile communications (GSM) standard, the Code Division Multiple Access (CDMA) standard, the 3 rd generation partnership project (3 GPP) standard, the European Telecommunications Standards Institute (ETSI) standard, the Internet Protocol (IP) standard, the Wireless Application Protocol (WAP) standard, and the Institute of Electrical and Electronics Engineers (IEEE) standard.
The storage device may include any type of device or storage medium capable of storing data. The storage medium may include a tangible or non-transitory computer readable medium. The computer readable medium may include an optical disk, flash memory, magnetic memory, or any other suitable digital storage medium. In some examples, the memory device or portions thereof may be described as non-volatile memory, and in other examples, portions of the memory device may be described as volatile memory. Examples of volatile memory can include Random Access Memory (RAM), dynamic Random Access Memory (DRAM), and Static Random Access Memory (SRAM). Examples of non-volatile memory may include magnetic hard disk, optical disk, floppy disk, flash memory, or forms of electrically programmable memory (EPROM) or Electrically Erasable and Programmable (EEPROM) memory. The storage devices may include memory cards (e.g., secure Digital (SD) memory cards), internal/external hard disk drives, and/or internal/external solid state drives. The data may be stored on the storage device according to a defined file format.
Referring again to fig. 4, source device 102 includes a data source 104, a data encoder 106, an encoded data encapsulator 107, and an interface 108. The data source 104 may include any device configured to capture and/or store multidimensional data. For example, the data source 104 may include a camera and a storage device operatively coupled thereto. The data encoder 106 may include any device configured to receive multidimensional data and generate a bitstream representing the data. A bitstream may refer to a general bitstream (i.e., binary values representing encoded data) or a compatible bitstream, wherein aspects of the compatible bitstream may be defined in accordance with a standard (e.g., a video coding standard). The encoded data encapsulator 107 can receive a bitstream and encapsulate the bitstream for storage and/or transmission purposes. For example, the encoded data encapsulator 107 can encapsulate the bitstream according to a file format. It should be noted that the encoded data encapsulator 107 need not necessarily be located in the same physical device as the data encoder 106. For example, the functions described as being performed by the data source 104, the data encoder 106, and/or the encoded data packager 107 may be distributed among various devices in the computing system (e.g., at different server locations, etc.). Interface 108 may include any device configured to receive data generated by encoded data packager 107 and to transmit and/or store the data to a communication medium. Interface 108 may include a network interface card such as an ethernet card, and may include an optical transceiver, a radio frequency transceiver, or any other type of device that may transmit and/or receive information. Further, interface 108 may include a computer system interface that may enable files to be stored on a storage device. For example, interface 108 may include a chipset supporting Peripheral Component Interconnect (PCI) and peripheral component interconnect express (PCIe) bus protocols, a proprietary bus protocol, a Universal Serial Bus (USB) protocol, I 2 C, or any other logical and physical structure that may be used to interconnect peer devices.
Referring again to fig. 4, target device 120 includes an interface 122, an encoded data decapsulator 123, a data decoder 124, and an output 126. Interface 122 may include any device configured to receive data from a communication medium. Interface 122 may include a network interface card such as an ethernet card, and may include an optical transceiver, a radio frequency transceiver, or any other type of device that may receive and/or transmit information. In addition, interface 122 may include a computer system interface that allows for retrieving compatible video bitstreams from a storage device. For example, interface 122 may include a chipset supporting PCI and PCIe bus protocols, a proprietary bus protocol, a USB protocol, I 2 C, or any other logical and physical structure that may be used to interconnect peer devices. The encoded data decapsulator 123 may be configured to receive the encapsulation format and extract the bitstream from the encapsulation format. For example, in the case of video encoded according to a typical video encoding standard stored on a physical medium according to a defined file format, the encoded data decapsulator 123 may be configured to extract a compatible bitstream from the file. The data decoder 124 may include any device configured to receive a bitstream and/or acceptable variations thereof and render multi-dimensional data therefrom. The rendered multi-dimensional data may then be received by output 126. For example, in the case of video, output 126 may include a display device configured to display video data. Further, it should be noted that the data decoder 124 may be configured to output multi-dimensional data to various types of devices and/or subcomponents thereof. For example, the data decoder 124 may be configured to output data to any communication medium. Furthermore, as described above, the techniques described in this disclosure may be particularly useful for allowing object recognition tasks to be distributed across a communication network. Thus, in some examples, source device 102 may represent an acquisition device where data source 104 acquires video data and generates corresponding feature data, data encoder 106 compresses the feature data (e.g., according to one or more techniques described herein), and target device 120 is a device that performs analysis and inference on the reconstructed feature data. It should be noted that, for example, with respect to the examples described above, the data encoder 106 and the data decoder 124 may be configured to encode multiple types of data. For example, in the case of video data, the data encoder 106 may receive source video and corresponding feature data and generate a compatible bitstream according to a video encoding standard, and generate a bitstream comprising compressed feature data (e.g., according to the techniques described herein). In this case, in one example, the target device 120 may be a headend-type device that reconstructs video (e.g., a high quality representation) and feature data from the received bitstream and encodes the reconstructed video (e.g., at output 126) for further distribution (e.g., to nodes in a media distribution system) based on the feature data.
As described above, the data encoder 106 may include any device configured to receive multi-dimensional data, and examples of multi-dimensional data include video data that may be encoded according to a typical video encoding standard. As described in further detail below, in some examples, the techniques described herein for encoding multidimensional data may be utilized in connection with techniques utilized in a typical video standard. Fig. 5 is a block diagram illustrating an example of a video encoder that may be configured to encode video data according to typical video encoding techniques. It should be noted that although the exemplary video encoder 200 is shown with different functional blocks, such illustration is intended for descriptive purposes and not to limit the video encoder 200 and/or its subcomponents to a particular hardware or software architecture. The functions of video encoder 200 may be implemented using any combination of hardware, firmware, and/or software implementations. The video encoder 200 may perform intra prediction encoding and inter prediction encoding of picture regions, and thus may be referred to as a hybrid video encoder. In the example shown in fig. 5, video encoder 200 receives a source video block. In some examples, a source video block may include picture regions that have been partitioned according to an encoding structure. For example, the source video data may include CTUs, sub-partitions thereof, and/or additional equivalent coding units. In some examples, video encoder 200 may be configured to perform additional subdivision of the source video block. It should be noted that the techniques described herein are generally applicable to video encoding, regardless of how the source video data is partitioned prior to and/or during encoding. In the example shown in fig. 5, the video encoder 200 includes a summer 202, a transform coefficient generator 204, a coefficient quantization unit 206, an inverse quantization and transform coefficient processing unit 208, a summer 210, an intra prediction processing unit 212, an inter prediction processing unit 214, a reference block buffer 216, a filter unit 218, a reference picture buffer 220, and an entropy encoding unit 222. As shown in fig. 5, the video encoder 200 receives source video blocks and outputs a bitstream.
In the example shown in fig. 5, video encoder 200 may generate residual data by subtracting a predicted video block from a source video block. Summer 202 represents a component configured to perform the subtraction operation. In one example, the subtraction of video blocks occurs in the pixel domain. The transform coefficient generator 204 applies a transform, such as a DCT or a conceptually similar transform, to the residual block or sub-partition thereof (e.g., four 8 x 8 transforms may be applied to the 16 x 16 array of residual values) to produce a set of transform coefficients. The transform coefficient generator 204 may be configured to perform any and all combinations of the transforms included in the series of discrete trigonometric transforms, including approximations thereof. The transform coefficient generator 204 may output the transform coefficients to the coefficient quantization unit 206. The coefficient quantization unit 206 may be configured to perform quantization on the transform coefficients. The quantization process may reduce the bit depth associated with some or all of the coefficients. The degree of quantization may change the rate distortion (i.e., the relationship of bit rate to video quality) of the encoded video data. In a typical video coding standard, the degree of quantization may be modified by adjusting a Quantization Parameter (QP), and the quantization parameter may be determined based on signaled values and/or predicted values. The quantization data may include any data used to determine a QP for quantizing a particular set of transform coefficients. As shown in fig. 5, the quantized transform coefficients (which may be referred to as level values) are output to an inverse quantization and transform coefficient processing unit 208. The inverse quantization and transform coefficient processing unit 208 may be configured to apply inverse quantization and inverse transform to generate reconstructed residual data. As shown in fig. 5, at summer 210, reconstructed residual data may be added to the predicted video block. The reconstructed video block may be stored to a reference block buffer 216 and used as a reference for predicting a subsequent block (e.g., using intra prediction).
Referring again to fig. 5, the intra-prediction processing unit 212 may be configured to select an intra-prediction mode for the video block to be encoded. The intra prediction processing unit 212 may be configured to evaluate the reconstructed block stored to the reference block buffer 216 and determine an intra prediction mode for encoding the current block. In a typical video coding standard, possible intra prediction modes may include a planar prediction mode, a DC prediction mode, and an angular prediction mode. As shown in fig. 5, the intra-prediction processing unit 212 outputs intra-prediction data (e.g., syntax elements) to the entropy encoding unit 222.
Referring again to fig. 5, the inter prediction processing unit 214 may be configured to perform inter prediction encoding for the current video block. The inter prediction processing unit 214 may be configured to receive a source video block, select a reference picture from among pictures stored to the reference buffer 220, and calculate a motion vector of the video block. The motion vector may indicate a displacement of a prediction unit of a video block within the current video picture relative to a prediction block within the reference picture. Inter prediction coding may use one or more reference pictures. The inter prediction processing unit 214 may be configured to select a prediction block by calculating pixel differences determined by, for example, sum of Absolute Differences (SAD), sum of Squared Differences (SSD), or other difference metrics. As described above, a motion vector can be determined and specified from motion vector prediction. As described above, the inter prediction processing unit 214 may be configured to perform motion vector prediction. The inter prediction processing unit 214 may be configured to generate a prediction block using the motion prediction data. For example, the inter-prediction processing unit 214 may locate a predicted video block within the reference picture buffer 220. It should be noted that the inter prediction processing unit 214 may be further configured to apply one or more interpolation filters to the reconstructed residual block to calculate sub-integer pixel values for motion estimation. The inter prediction processing unit 214 may output motion prediction data of the calculated motion vector to the entropy encoding unit 222.
Referring again to fig. 5, filter unit 218 receives the reconstructed video block from reference block buffer 216 and outputs the filtered picture to reference picture buffer 220. That is, in the example of fig. 5, the filter unit 218 is part of an in-loop filtering process. The filter unit 218 may be configured to perform one or more of deblocking, SAO filtering, and/or ALF filtering, e.g., according to typical video coding standards. The entropy encoding unit 222 receives data representing level values (i.e., quantized transform coefficients) and prediction syntax data (i.e., intra-frame prediction data and motion prediction data). It should be noted that the data representing the level values may include, for example, marks, absolute values, sign values, increment values, etc. Such as significant coefficient flags provided in typical video coding standards, etc. Entropy encoding unit 518 may be configured to perform entropy encoding according to one or more of the techniques described herein and output a bitstream (e.g., a compatible bitstream) according to a typical video encoding standard.
Referring again to fig. 4, as described above, the data decoder 124 may comprise any device configured to receive encoded multidimensional data, and examples of encoded multidimensional data include video data that may be encoded according to a typical video encoding standard. Fig. 6 is a block diagram illustrating an example of a video decoder that may be configured to decode video data according to one or more techniques of the present disclosure, which may be utilized with one or more techniques of the present disclosure. In the example shown in fig. 6, the video decoder 300 includes an entropy decoding unit 302, an inverse quantization unit 304, an inverse transform coefficient processing unit 306, an intra prediction processing unit 308, an inter prediction processing unit 310, a summer 312, a post-filter unit 314, and a reference buffer 316. It should be noted that although the exemplary video decoder 300 is shown with different functional blocks, such illustration is intended for descriptive purposes and not to limit the video decoder 300 and/or its subcomponents to a particular hardware or software architecture. The functions of video decoder 300 may be implemented using any combination of hardware, firmware, and/or software implementations.
As shown in fig. 6, the entropy decoding unit 302 receives an entropy-encoded bitstream. The entropy decoding unit 302 may be configured to decode syntax elements and level values from the bitstream according to a process that is reciprocal to the entropy encoding process. Entropy decoding unit 302 may be configured to perform entropy decoding according to any of the entropy encoding techniques described above and/or determine values of syntax elements in the encoded bitstream in a manner consistent with the video encoding standard. As shown in fig. 6, the entropy decoding unit 302 may determine a level value, quantized data, and prediction data from a bitstream. In the example shown in fig. 6, the inverse quantization unit 304 receives quantized data and level values and outputs transform coefficients to the inverse transform coefficient processing unit 306. The inverse transform coefficient processing unit 306 outputs reconstructed residual data. Thus, the inverse quantization unit 304 and the inverse transform coefficient processing unit 306 operate in a similar manner to the inverse quantization and transform coefficient processing unit 208 described above.
Referring again to fig. 6, the reconstructed residual data is provided to summer 312. Summer 312 may add the reconstructed residual data to the prediction video block and generate reconstructed video data. The prediction video block may be determined according to a prediction video technique (i.e., intra-prediction and inter-prediction). The intra prediction processing unit 308 may be configured to receive the intra prediction syntax element and retrieve the predicted video block from the reference buffer 316. The reference buffer 316 may include a memory device configured to store one or more pictures (and corresponding regions) of video data. The intra prediction syntax element may identify an intra prediction mode, such as the intra prediction mode described above. The inter-prediction processing unit 310 may receive the inter-prediction syntax elements and generate motion vectors to identify prediction blocks in one or more reference frames stored in the reference buffer 316. The inter prediction processing unit 310 may generate a motion compensation block, possibly performing interpolation based on interpolation filters. An identifier of an interpolation filter for motion estimation with sub-pixel precision may be included in the syntax element. The inter prediction processing unit 310 may calculate interpolation values for sub-integer pixels of the reference block using interpolation filters. Post-filter unit 314 may be configured to perform filtering on reconstructed video data. For example, the post-filter unit 314 may be configured to perform deblocking based on parameters specified in the bitstream. Further, it should be noted that in some examples, post-filter unit 314 may be configured to perform dedicated arbitrary filtering (e.g., visual enhancement such as mosquito noise cancellation). As shown in fig. 6, the video decoder 300 may output the reconstructed video, for example, to a display.
As described above with respect to fig. 2A through 2B, a block of video data (i.e., a data array included within an MDDS) may be encoded by generating a residual, performing a transform on the residual, and quantizing the transform coefficients to generate level values, and decoded by performing inverse quantization on the level values, performing an inverse transform, and adding the resulting residual to a prediction. The data arrays included within the MDDS may also be encoded using so-called auto-encoding techniques. In general, automatic encoding may refer to learning techniques that impose bottlenecks into a network to force the generation of compressed representations of inputs. That is, an automatic encoder may be referred to as a nonlinear Principal Component Analysis (PCA), which attempts to represent input data in a lower dimensional space. Examples of automatic encoders include convolutional automatic encoders that use a single convolution operation to compress an input. Convolutional automatic encoders can be used in so-called deep Convolutional Neural Networks (CNNs).
Fig. 7A shows an example of automatic encoding using two-dimensional discrete convolution. In the example shown in fig. 7A, a discrete convolution is performed on a current block of video data (i.e., the block of video data shown in fig. 2A) to generate an output signature (OFM), where the discrete convolution is defined in terms of a padding operation, a kernel, and a stride function. It should be noted that while fig. 7A illustrates discrete convolution of a two-dimensional input using a two-dimensional kernel, discrete convolution may be performed on a higher-dimensional data set. For example, a three-dimensional kernel (e.g., a cubic kernel) may be used to perform discrete convolution of the three-dimensional input. In the case of video data, such convolutions may downsample the video in both the spatial and temporal dimensions. Further, it should be noted that while the example shown in FIG. 7A illustrates a square kernel convolving over a square input, in other examples the kernel and/or input may be a non-square rectangle. In the example shown in fig. 7A, the 4 x 4 array of video data is enlarged to a 6 x 6 array by copying the nearest value at the boundary. This is an example of a fill operation. Generally, padding operations increase the size of an input data set by inserting values. Typically, zeros may be inserted into the array to achieve a particular size of array prior to convolution. It should be noted that the padding functionality may include one or more of inserting zeros (or another default value) at particular locations, symmetric expansion at various locations of the dataset, copy expansion, circular expansion. For example, for symmetric expansion, input array values outside the array boundary may be calculated by specularly reflecting the array across the array boundary along the filled dimension. For replication expansion, it may be assumed that the input array value outside the array boundary is equal to the nearest array boundary value along the filled dimension. For cyclic expansion, input array values outside the array boundaries may be calculated by implicitly assuming that the input array is periodic along the filled dimension.
Referring again to fig. 7A, an output signature is generated by convolving the 3 x 3 kernels on a 6 x 6 array according to a stride function. That is, the stride shown in FIG. 7A shows the upper left position of the kernel at the corresponding position in the 6X 6 array. That is, for example, at stride position 1, the upper left of the kernel is aligned with the upper left of the 6 x 6 array. At each discrete stride position, the kernel is used to generate a weighted sum. The generated weighted sum values are then used to populate corresponding positions in the output feature map. For example, at position 1 of the stride function, the output of 107 (107=1/16×107+1/8×107+1/16×103+1/8×107+1/4×107+1/8×103+1/16×111+1/8×111+1/16×108) corresponds to the upper left position of the output feature map. It should be noted that in the example shown in fig. 7A, the stride function corresponds to a so-called unit stride, i.e., the kernel slides across each position of the input. In other examples, non-unity or arbitrary stride may be used. For example, the stride function may include only positions 1, 4, 13, and 16 in the stride shown in FIG. 7A to generate a2×2 output profile. In this way, in the case of two-dimensional discrete convolution, for input data having a width w i and a height h i, an arbitrary fill function, an arbitrary stride function, and a kernel having a width w k and a height h k may be used to create an output profile having a desired width w o and height h o. It should be noted that, similar to the kernel, a stride function may be defined for multiple dimensions (e.g., a three-dimensional stride function may be defined). It should be noted that in some cases, for a particular kernel size and stride function, the kernel may be located outside the support area. In some cases, the output at such locations is invalid. In some cases, the corresponding value is derived for the outlier support location, e.g., according to a fill operation.
It should be noted that in the example shown in fig. 7A, the 4 x 4 array of video data is shown as being downsampled to the 2 x 2 output profile by selecting the underlined values of the 4 x 4 output profile. A 4 x 4 output profile is shown for illustration purposes. I.e. to show a typical unit stride function. Typically, no calculation will be made for the discard value. Typically, as described above, a 2 x 2 output profile may/will be derived by performing weighted sum operations with kernels at positions 1,4, 13 and 16. However, it should be noted that in other examples, so-called pooling operations (such as finding the maximum pooling) may be performed on the input (before performing the convolution) or output feature maps to downsample the data set. For example, in the example shown in fig. 7A, a 2×2 output feature map may be generated by taking the local maximum (i.e., 108, 104, 117, and 108) of each 2×2 region in the 4×4 output feature map. That is, there are many ways to perform automatic encoding, including performing convolution on input data to represent the data as a downsampled output profile.
Finally, as indicated in fig. 7A, the output signature may be quantized in a manner similar to that described above with respect to the transform coefficients (e.g., limited to the amplitudes of a set of specified values). In the example shown in fig. 7A, the amplitude of the 2×2 output feature map is quantized by dividing by 2. In this case, quantization can be described as uniform quantization defined by the following equation:
QOFM(x,y)=round(OFM(x,y)/Stepsize)
wherein,
QOFM (x, y) is the quantized value corresponding position (x, y);
OFM (x, y) is a value corresponding to the position (x, y);
Stepsize is a scalar; and
Round (x) rounds x to the nearest integer.
Thus, for the example shown in fig. 7A, stepsize =2 and x=0 … 1, y=0 … 1. In this example, at the auto decoder, the inverse quantization used to derive the restored output feature map ROFM (x, y) may be defined as follows:
ROFM(x,y)=QOFM(x,y)*Stepsize
It should be noted that in one example, a respective Stepsize, i.e., stepsize (x,y), may be provided for each location. It should be noted that this may be referred to as uniform quantization, since quantization (i.e., scaling) is the same across the possible amplitude ranges at locations in the OFM (x, y).
In one example, the quantization may be non-uniform. That is, the quantization may be different across the range of possible amplitudes. For example, the respective Stepsize may vary across a range of values. That is, for example, in one example, the non-uniform quantization function may be defined as follows:
QOFM(x,y)=round(OFM(x,y)/Stepsizei)
Wherein the method comprises the steps of
Stepsizei-scalar0:if OFM(x,y)<value0
scalar1:if value0≤OFM(x,y)≤value1
scalarN-1:if valueN-2≤OFM(x,y)≤valueN-1
scalarN:if OFM(x,y)>valueN-1
Further, it should be noted that, as described above, quantization may include mapping the amplitudes in the range to specific values. That is, for example, in one example, the non-uniform quantization function may be defined as:
Wherein value i+1>valuei, and for i+.j, value i+1-valuei is not necessarily equal to value j+1-valuej
The inverse of the non-uniform quantization process may be defined as:
the inverse process corresponds to a look-up table and may signal in a bit stream.
Finally, it should be noted that a combination of the above quantization techniques may be utilized, and in some cases, a particular quantization function may be specified and signaled. For example, the signaling quantization table may be transmitted in a similar manner as the quantization table in the signaling VVC.
Referring again to fig. 7A, although not shown, entropy encoding may be performed on the quantized output feature map data as described in further detail below. Thus, as shown in fig. 7A, the quantized output feature map is a compressed representation of the current video block.
As shown in fig. 7B, the current block of video data is decoded by performing inverse quantization on the quantized output feature map, performing a padding operation on the restored output feature map, and convolving the padded output feature map with a kernel. Similar to fig. 2B, fig. 7B shows a reconstruction error, which is the difference between the current block and the restored block. It should be noted that the padding operation performed in fig. 7B is different from the padding operation performed in fig. 7A, and the kernel utilized in fig. 7B is different from the kernel utilized in fig. 7A. That is, in the example shown in fig. 7B, zero values are interleaved with the restored output signature and the 3 x 3 kernel is convolved on a 6 x 6 input using unit steps to produce a restored MDDS block. It should be noted that such convolution operations performed during auto-decoding may be referred to as convolution transpose (convT). It should be noted that in some cases, the convolution transpose may define a particular relationship between kernels at each of the auto-encoder and auto-decoder, and in other cases, the term "convolution transpose" may be more general. It should be noted that there may be several ways in which automatic decoding may be implemented. That is, FIG. 7B provides an illustrative case of convolution transpose, and there are a number of ways in which convolution transpose (and auto-decoding) may be performed and/or implemented. The techniques described herein are generally applicable to automatic decoding. For example, with respect to the example shown in fig. 7B, in a simple case, each of the four values shown in the restoration output feature map may be replicated to create a4 x 4 array (i.e., an array whose top left four values are 108, whose top right four values are 102, whose bottom left four values are 116, and whose bottom right four values are 108). In addition, other padding operations, kernels, and/or stride functions may be utilized. Essentially, at an auto-decoder, the auto-decoding process can be selected in a manner that achieves the desired objective (e.g., reduces reconstruction errors). It should be noted that other desired goals may include reducing visual artifacts, increasing the probability of detecting an object, and so forth.
As described above, the techniques for encoding multi-dimensional data described herein may be utilized in connection with techniques utilized in typical video standards. As described above with respect to fig. 5, the degree of quantization applied during video encoding may alter the rate distortion of the encoded video data. In addition, typical video encoders select an intra prediction mode for intra prediction and reference frames and motion information for inter prediction. These choices also alter the rate distortion. That is, in general, video encoding includes selecting video encoding parameters in a manner that optimizes and/or provides desired rate distortion. In accordance with the techniques herein, in one example, automatic encoding may be used during video encoding in order to select video encoding parameters to achieve desired rate distortion. That is, for example, as described above, inferred data derived from the feature data (e.g., where the object is located within the image) may be used to optimize the encoding of the video data (e.g., adjust encoding parameters to improve relative image quality in areas where the object of interest is present).
Fig. 8 is an example of an encoding system that may encode a multi-dimensional dataset according to one or more techniques of the present disclosure. In the example shown in fig. 8, an automatic encoder unit 402 receives a multi-dimensional dataset, i.e., video data, and generates one or more output feature maps corresponding to the video data. That is, for example, an auto-encoder may perform a two-dimensional discrete convolution on a region within a video sequence as described above. It should be noted that in fig. 8, the encoding parameters shown as received by the automatic encoder unit 402 correspond to the selection of parameters for performing automatic encoding. That is, for example, in the case of two-dimensional discrete convolution, the selection of w i and h i, the selection of the fill function, the selection of the stride function, and the selection of the kernel. As shown in fig. 8, the encoder control unit 404 receives the output feature map and provides encoding parameters (e.g., QP, intra prediction mode, motion information, etc.) to the video encoder 200. The video encoder 200 receives video data and provides a bitstream based on encoding parameters according to a typical video encoding standard as described above. The video decoder 300 receives the bitstream and reconstructs the video data according to the typical video coding standard as described above. As shown in fig. 8, summer 406 subtracts the reconstructed video data from the source video data and generates a reconstruction error (i.e., in a manner similar to that described above with respect to fig. 2B, for example). As shown in fig. 8, the encoder control unit 404 receives the reconstruction error. It should be noted that although not explicitly shown in fig. 8, the encoder control unit 404 may also determine a bit rate corresponding to the bit stream. Accordingly, the encoder control unit 404 may correlate the output profile (i.e., statistics thereof, for example) corresponding to the video data, the encoding parameters used to encode the video, the reconstruction error, and the bit rate. That is, the encoder control unit 404 may determine the rate distortion for video data encoded using a particular set of encoding parameters and having a particular OFM. In this way, by iterating multiple times encoding the same video data (or training set of video data) with different encoding parameters, the encoder control unit 404 may be considered to be able to learn (or train) which encoding parameters optimize the rate distortion of various types of video data. That is, for example, an output feature map having a relatively low variance may be associated with an image having a large low texture region, and may be relatively insensitive to changes in quantization level. That is, in this case, for this type of image, rate distortion can be optimized by increasing quantization.
As described above with respect to fig. 7A-7B, automatic encoding may be performed on video data to generate quantized output feature map data. The quantized output feature map is a compressed representation of the current video block. In some cases, i.e., based on how the auto-coding is performed, the output profile may effectively be a downsampled version of the video data. For example, referring to fig. 7A, a4 x 4 array of video data may be compressed into a 2x 2 array (either before or after quantization). In the case where the 4×4 video data array is one of several 4×4 video data arrays included in 1920×1080 resolution pictures, automatically encoding each 4×4 array may effectively downsample 1920×1080 resolution pictures to 960×540 resolution pictures, as shown in fig. 7A. It should be noted that in some cases, quantization may include adjusting the number of bits used to represent the sample value. That is, for example, a 10-bit value is mapped to an 8-bit value. In this case, the quantized value may have the same amplitude range as the non-quantized value, but fidelity of the amplitude data is reduced. In one example, such a downsampled video data representation may be encoded according to typical video encoding standards in accordance with the techniques herein. Furthermore, according to the techniques herein, automatic encoding may be used during video encoding in order to select video encoding parameters to achieve desired rate distortion, e.g., as described above with respect to fig. 8.
Fig. 9 is an example of an encoding system that may encode a multi-dimensional dataset according to one or more techniques of the present disclosure. The system in fig. 9 is similar to the system shown in fig. 8 and further comprises a quantizer unit 408, an inverse quantizer unit 410, and an auto decoder unit 412. As shown in fig. 9, the quantizer unit 408 receives the one or more output feature maps corresponding to the video data and quantizes the output feature maps. As described above, quantization may include reducing the bit depth such that the amplitude range of the quantized OFM values is the same as the input video data. As shown in fig. 9, the video encoder 200 receives the quantized output feature map, encodes the quantized output feature map based on encoding parameters according to a typical video encoding standard as described above, and outputs a bitstream. The video decoder 300 receives the bitstream and reconstructs the quantized output feature map according to the typical video coding standard as described above. It should be noted that although not shown in fig. 9, in some examples, additional processing may be performed on the quantized OFM for the purpose of encoding data according to a video encoding standard. That is, in some examples, the data may be rearranged, scaled, etc. Furthermore, a reciprocal process may be performed on the reconstructed quantized OFM. The inverse quantizer unit 410 receives the restored quantized output feature map and performs inverse quantization, and the automatic decoder unit 412 performs automatic decoding. That is, the inverse quantizer unit 410 and the auto decoder unit 412 may operate in a manner similar to that described above with respect to fig. 7B. In this way, in the system shown in fig. 9, the bitstream output of the video encoder 200 is an encoded downsampled input video data representation, and the video decoder, inverse quantizer unit 410, and auto decoder unit 412 reconstruct the input video data from the bitstream. Further, as shown in fig. 9, in a manner similar to that described above with respect to fig. 8, the encoder control unit 404 may determine rate distortion for the quantized output feature map encoded using a particular set of encoding parameters and video data having a particular OFM. That is, the encoder control unit 404 may optimize the encoding of the downsampled video data representation. In addition, the encoder control unit 404 may optimize downsampling of the input video data. That is, for example, in accordance with the techniques herein, the encoder control unit 404 may determine which types of video data (e.g., high detail image vs. low detail image (or region thereof)) are more or less sensitive to reconstruction errors as a result of downsampling.
As described above with respect to fig. 5, with a typical video encoder, residual data is encoded in the bitstream as level values. It should be noted that, similar to the input video data, the residual data is an example of a multi-dimensional dataset. Thus, in one example, residual data (e.g., pixel domain residual data) may be encoded using an automatic encoding technique in accordance with the techniques herein. Fig. 10 is a block diagram illustrating an example of a video encoder that may be configured to encode video data in accordance with the techniques described herein. It should be noted that although the exemplary video encoder 500 is shown with different functional blocks, such illustration is intended for descriptive purposes and not to limit the video encoder 500 and/or its subcomponents to a particular hardware or software architecture. The functions of video encoder 500 may be implemented using any combination of hardware, firmware, and/or software implementations. As shown in fig. 10, the video encoder 500 receives a source video block and outputs a bitstream, and includes a summer 202, a summer 210, an intra prediction processing unit 212, an inter prediction processing unit 214, a reference block buffer 216, a filter unit 218, a reference picture buffer 220, and an entropy encoding unit 222, similar to the video encoder 200. Accordingly, video encoder 500 may perform intra-prediction encoding and inter-prediction encoding of picture regions and receive source video blocks in a manner similar to that described above with respect to video encoder 200.
As shown in fig. 10, the video encoder 500 includes an auto encoder/quantizer unit 502, an inverse quantizer and auto decoder unit 504, and an entropy encoding unit 506. As shown in fig. 10, the auto encoder/quantizer unit 502 receives residual data and outputs a quantized Residual Output Feature Map (ROFM). That is, the auto-encoder/quantizer unit 502 may perform auto-encoding in accordance with the techniques described herein. For example, in a manner similar to that described above with respect to fig. 7A. As shown in fig. 10, the inverse quantizer and auto decoder unit 504 receives the quantized Residual Output Feature Map (ROFM) and outputs reconstructed residual data. That is, the auto inverse quantizer and auto decoder unit 504 may perform auto decoding according to the techniques described herein. For example, in a manner similar to that described above with respect to fig. 7B. In this way, the video encoder 200 shown in fig. 5 and the video encoder 500 shown in fig. 10 have an encoding/decoding loop for reconstructing residual data, which is then added to the predicted video block for subsequent encoding. As shown in fig. 10, the entropy encoding unit 506 receives the quantized residual output feature map and outputs a bit sequence. That is, entropy encoding unit 506 may perform entropy encoding according to the entropy encoding techniques described herein. As further shown in fig. 10, the encoding parameter entropy encoding unit 222 receives a null order value. That is, because the video encoder 500 outputs encoded residual data as a bit sequence, and a video decoder (e.g., the video decoder 500 shown in fig. 11) may derive residual data from the bit sequence, in some cases, the residual data may not be derived from a bitstream conforming to a typical video encoding standard. For example, a bitstream generated from video encoder 500 may set encoded block flags (e.g., cbf_luma, cbf_cb, and cbf_cr in ITU-T h.265) to zero to indicate that there are no transform coefficient level values that are not equal to 0. It should be noted that although in the example shown in fig. 10, the transform coefficient generator 204, the coefficient quantization unit 206, the inverse quantization and transform coefficient processing unit 208 are not included, in some examples, the video encoder 500 may be configured to additionally/alternatively encode residual data using one or more of the techniques described above. That is, the type of encoding used to encode the residual data may be selectively applied, for example, on a sequence-by-sequence, picture-by-picture, slice-by-slice level, and/or component-by-component basis. Further, as shown in fig. 10, the automatic encoder/quantizer unit 502 and the entropy encoding unit 506 are controlled by encoding parameters. That is, an encoder control unit (encoder control unit 404 described in fig. 8 and 9) may be used in conjunction with video encoder 500. That is, video encoder 500 may be used in a system that optimizes rate distortion based on the techniques described herein.
Fig. 11 is a block diagram illustrating an example of a video decoder that may be configured to decode video data in accordance with the techniques described herein. As shown in fig. 11, the video decoder 600 receives an entropy-encoded bitstream and a bit sequence, and outputs a reconstructed video. Similar to the video decoder 300 shown in fig. 6, the video decoder 600 includes an entropy decoding unit 302, an intra prediction processing unit 308, an inter prediction processing unit 310, a summer 312, a post-filter unit 314, and a reference buffer 316. Thus, the video decoder 600 may be configured to derive a predicted video block from the conforming bitstream and add the predicted video block to the reconstructed residual to generate the reconstructed video in a manner similar to that described above with respect to fig. 6. As further shown in the example shown in fig. 6, the video decoder 600 includes an entropy decoding unit 602. The entropy decoding unit 602 may be configured to decode the quantized residual output feature map from the bit sequence according to a process that is reciprocal to the entropy encoding process. That is, entropy decoding unit 302 may be configured to perform entropy decoding according to the entropy encoding technique performed by entropy encoding unit 506 described above. As shown in fig. 11, the inverse quantizer unit 604 receives the quantized residual output feature map and outputs the restored residual output feature map to the automatic decoder unit 606. The auto decoder unit 606 outputs reconstructed residual data. Thus, the inverse quantizer unit 604 and the auto decoder unit 606 operate in a manner similar to the inverse quantizer and auto decoder unit 504 described above. That is, the inverse quantizer unit 604 and the auto decoder unit 606 may perform auto decoding according to the techniques described herein. Thus, in the example described in fig. 11, video decoder 600 may be configured to decode video data in accordance with the techniques described herein. It should be noted that predictive coding may be used on data other than video data, as described in further detail below. Thus, in one example, the video decoder 600 may decode the non-video MDDS from the conforming bitstream. For example, video decoder 600 may decode data for consumption by a machine. Similarly, the video encoder 600 may decode a non-video MDDS having a compatible input structure format. That is, for example, the source video may undergo some preprocessing and be converted to a non-video MDDS. In summary, a typical video encoder and decoder may not know whether the data being encoded is actually video data (e.g., human-consumable video data).
As described above, predictive video coding techniques (i.e., intra-prediction and inter-prediction) generate a prediction of a current video block from stored reconstructed reference video data. As further described above, in one example, the downsampled video data representation (which is an output feature map) may be encoded according to predictive video encoding techniques in accordance with the techniques herein. Thus, predictive coding techniques for encoding video data are generally applicable to output feature maps. That is, in one example, an output feature map (e.g., an output feature map corresponding to video data) may be predictively encoded using predictive video encoding techniques in accordance with the techniques herein. Furthermore, in some examples, according to the techniques herein, the corresponding residual data (i.e., e.g., the difference of the current region of the OFM and the prediction) may be encoded using an automatic encoding technique. Thus, in one example, a multi-dimensional dataset may be automatically encoded, a resulting output feature map may be predictively encoded, and residual data corresponding to the output feature map may be automatically encoded, in accordance with the techniques herein.
FIG. 12 is a block diagram illustrating an example of a compression engine that may be configured to encode a multi-dimensional dataset according to one or more techniques of the present disclosure. It should be noted that while the exemplary compression engine 700 is shown as having different functional blocks, such illustration is intended for descriptive purposes and not to limit the compression engine 700 and/or its subcomponents to a particular hardware or software architecture. The functionality of compression engine 700 may be implemented using any combination of hardware, firmware, and/or software implementations. In the example shown in fig. 12, compression engine 700 includes automatic encoder units 402A and 402B, encoder control unit 404, summer 406, quantizer units 408A and 408B, inverse quantizer units 410A and 410B, automatic decoder units 412A and 412B, summer 414, and entropy encoding unit 506. As further shown in fig. 12, compression engine 700 includes a reference buffer 702, an OFM prediction unit 704, a prediction generation unit 706, and an entropy encoding unit 710. As shown in fig. 12, the compression engine 700 receives the MDDS and outputs a first bit sequence and a second bit sequence.
The auto encoder units 402A and 402B and the quantizer units 408A and 408B are configured to operate in a manner similar to the auto encoder unit 402 and the quantizer unit 408 described above with respect to fig. 9. That is, the auto encoder units 402A and 402B and the quantizer units 408A and 408B are configured to receive the MDDS and output quantized OFMs. Specifically, in the example shown in fig. 12, the automatic encoder unit 402A and the quantizer unit 408A receive the source MDDS and output the quantized OFM, and the automatic encoder unit 402B and the quantizer unit 408B receive the residual data that is the MDDS as described above and output the quantized OFM. Furthermore, the inverse quantizer units 410A and 410B and the auto decoder units 412A and 412B are configured to operate in a manner similar to the inverse quantizer units 410 and auto decoder units 412 described above with respect to fig. 9. That is, the inverse quantizer units 410A and 410B and the auto-decoder units 412A and 412B are configured to receive the quantized output feature maps, perform inverse quantization and auto-decode to generate a reconstructed data set. Specifically, in the example shown in fig. 12, the inverse quantizer unit 410B and the auto decoder unit 412B receive the quantized residual output feature map and output reconstructed residual data as part of the encoding/decoding cycle. As shown in fig. 12, at summer 426, the reconstructed residual data is added to the predicted video block for subsequent encoding. As described in further detail below, the predictions are generated by the prediction generation unit 706 and are quantized OFMs. As shown in fig. 12, the output of summer 426 is the reconstructed quantized OFM, and inverse quantizer units 410A and 410B receive the reconstructed quantized OFM and output the reconstructed MDDS as part of the encoding/decoding cycle. That is, as shown in fig. 12, the summer 406 provides a reconstruction error that can be evaluated by the encoder control unit 404 in a manner similar to that described above. Accordingly, compression engine 700 is similar to the encoders and systems described above in that rate distortion can be optimized based on reconstruction errors. As shown in fig. 12, the entropy encoding unit 506 receives the quantized residual output feature map and outputs a bit sequence. In this way, entropy encoding unit 506 operates in a manner similar to entropy encoding unit 506 described above with respect to fig. 10.
As described above, the output feature map may be predictively encoded. Referring again to fig. 12, reference buffer 702, OFM prediction unit 704, and prediction generation unit 706 represent components of compression engine 700 configured to predictively encode an output profile. That is, the output profile may be stored in the reference buffer 702. The OFM prediction unit 704 may be configured to analyze the current OFM and the OFM stored to the reference buffer 702 and generate prediction data. That is, for example, the OFM prediction unit 704 may process the OFM in a manner similar to processing a picture in typical video encoding, and select motion information of the reference OFM and the current OFM. In the example shown in fig. 12, the prediction generation unit 706 receives the prediction data and generates a prediction (e.g., retrieves an area of the OFM) from the OFM data stored to the reference buffer 702. It should be noted that in fig. 12, the OFM prediction unit 704 is shown as receiving the encoding parameters. In this case, the encoder control unit 404 may control how the prediction data is generated, for example, based on rate-distortion analysis. For example, OFM data may be particularly sensitive to various types of artifacts that are relatively small relative to video data, and thus may disable prediction modes associated with such artifacts. Finally, as shown in fig. 12, the entropy encoding unit 710 receives the encoding parameters and the prediction data, and outputs a bit sequence. That is, entropy encoding unit 710 may be configured to perform the entropy encoding techniques described herein. It should be noted that although not shown in fig. 12, the first bit sequence and the second bit sequence may be multiplexed (e.g., before or after entropy encoding) to form a single bit stream.
Fig. 13 is a block diagram illustrating an example of a decompression engine that may be configured to decode a multi-dimensional dataset according to one or more techniques of the present disclosure. As shown in fig. 13, the decompression engine 800 receives the entropy-encoded first bit sequence, the entropy-encoded second bit sequence, and the encoding parameters, and outputs a reconstructed MDDS. That is, decompression engine 800 may operate in a reciprocal manner to compression engine 700. As shown in fig. 13, decompression engine 800 includes inverse quantizer units 410A and 410B, auto-decoder units 412A and 412B, and summer 426, each of which may be configured to operate in a manner similar to like numbered components described above with respect to fig. 12. As further shown in fig. 13, the decompression engine 800 includes an entropy decoding unit 802, a prediction generation unit 804, a reference buffer 806, and an entropy decoding unit 808. As shown in fig. 13, the entropy decoding unit 802 and the entropy decoding unit 808 receive the corresponding bit sequences and output corresponding data. That is, the entropy decoding unit 802 and the entropy decoding unit 808 may operate in a reciprocal manner to the entropy encoding unit 710 and the entropy encoding unit 506 described above with respect to fig. 12. As shown in fig. 13, the reference buffer 806 stores the reconstructed quantized OFM, the prediction generation unit 804 receives the prediction data, and the encoding parameters generate the prediction. That is, the prediction generation unit 804 and the reference buffer 806 may operate in a manner similar to the prediction generation unit 706 and the reference buffer 702 described above with respect to fig. 12. Accordingly, decompression engine 800 may be configured to decode encoded MDDS data in accordance with the techniques described herein.
It should be noted that in the examples described above, in fig. 8, 9 and 12, each encoder control unit 404 is shown as receiving a reconstruction error. In some examples, the encoder control unit may not receive the reconstruction error. That is, in some examples, full decoding may not occur at the encoder. For example, referring to fig. 8, in one example, the video decoder 300 and summer 406 (i.e., decoding loop) and encoder control unit 404 may simply receive the OFM to determine the encoding parameters.
As described above, in addition to performing discrete convolution on a two-dimensional (2D) data set, convolution may also be performed on a one-dimensional data set (1D) or higher-dimensional data set (e.g., a 3D data set). There are several ways in which video data can be mapped to a multi-dimensional dataset. In general, video data may be described as having a plurality of spatial data input channels. That is, video data may be described as an N i x W x H dataset, where N i is the number of input channels, W is the spatial width and H is the spatial height. It should be noted that in some examples, N i may be a time dimension (e.g., number of pictures). For example, N i in N i ×w×h may indicate the number of 1920×1080 monochrome pictures. Further, in some examples, N i may be a component dimension (e.g., the number of color components). For example, N i x W x H may comprise a single 1024 x 742 image with RGB components, i.e. in this case N i is equal to 3. Further, it should be noted that in some cases, there may be N input channels for both multiple components (e.g., N Ci) and multiple pictures (e.g., N Pi). In this case, the video data may be designated as N Ci×NPi ×w×h, that is, as a four-dimensional dataset. According to the N Ci×NPi ×w×h format, an example of 60 1920×1080 monochrome pictures can be expressed as 1×60×1920×1080, and a single 1024×742RGB image can be expressed as 3×1×1024×742. It should be noted that in these cases, each of the four-dimensional data sets has a dimension of 1, and may be referred to as a three-dimensional data set, and is reduced to 60×1920×1080 and 3×1024×742, respectively. That is, 60 and 3 are both input channels in a three-dimensional dataset, but refer to different dimensions (i.e., time and components).
As described above, in some cases, the 2D OFM may correspond to a downsampled video component (e.g., luminance) in both the spatial dimension and the temporal dimension. Further, in some cases, the 2D OFM may correspond to downsampled video in both the spatial dimension and the component dimension. That is, for example, a single 1024×742RGB image (i.e., 3×1024×742) can be downsampled to 1×342×248OFM. I.e. downsampling by 3 in the spatial dimension and downsampling by 3 in the component dimension, respectively. It should be noted that in this case 1024 may fill 1 to 1025 and 743 may fill 2 to 744 such that each is a multiple of 3. Further, in one example, 60 1920×1080 monochrome pictures (i.e., 60×1920×1080) may be downsampled to 1×640×360OFM. I.e. downsampling by 3 in the spatial dimension and 60 in the temporal dimension, respectively.
It should be noted that in the above case, the downsampling may be achieved by having the N i x 3 kernels with a stride of 3 in the spatial dimension. I.e., for a3 x 1025 x 744 data set, the convolution generates a single value for each 3 x 3 data point, and for a 60 x 1920 x 1080 data set, the convolution generates a single value for each 60 x 3 data point. It should be noted that in some cases it may be useful to perform discrete convolutions on the data set multiple times (e.g., using multiple kernels and/or strides). That is, for example, with respect to the above example, multiple instances (e.g., each having a different value) of an N i x 3 kernel may be defined and used to generate a corresponding multiple OFM instances. In this case, the number of instances may be referred to as the number of output channels, i.e., N O. Thus, where the N i×Wi×Hi input dataset is downsampled according to the N O instances of the N i×Wk×Hk kernel, the resulting output data may be represented as N O×WO×HO. Where W O is a function of W i、Wk and the stride in the horizontal dimension, and H O is a function of H i、Hk and the stride in the vertical dimension. That is, each of W O and H O is determined from spatial downsampling. It should be noted that in some examples, the N O×WO×HO dataset may be used for object/feature detection in accordance with the techniques herein. That is, for example, each of the N O data sets may be compared to one another, and the relationship in the common region may be used to identify the presence of an object (or another feature) in the original N i×Wi×Hi input data set. For example, the comparison/tasks may be performed on multiple NN layers. Furthermore, algorithms such as non-maximum suppression for selecting among the available options may be used. In this way, as described above, the encoding parameters of a typical video encoder may be optimized based on the N O×WO×HO dataset, such as quantization that varies based on an indication of objects/features in the video. In this manner, in accordance with the techniques herein, the data encoder 106 represents an example of a device configured to: receiving a data set having a size specified by a number of channel dimensions, a height dimension, and a width dimension, generating an output data set corresponding to the input data by performing a discrete convolution on the input set, wherein performing the discrete convolution includes spatially downsampling the input data set according to a number of instances of the kernel, and encoding the received data set based on the generated output set. It should be noted that in theory, the stride may be less than one, and in this case, convolution may be used to upsample the data.
In one example, where multiple instances of K x K kernels each having a corresponding dimension equal to N i are used in the processing of the N i×Wi×Hi dataset, one of the convolutions or convolution transposes, kernel size for convolutions, stride and fill functions, and the number of output dimensions of the discrete convolutions may be indicated using the following symbols:
conv2d:2D convolution, conv2dT: the 2D convolution transpose is performed,
KK: a kernel of all dimensions K (e.g., K x K);
sS: all dimensions are steps of S (e.g., (S, S));
pP: padding of P on both sides of all dimensions with a value of 0 (e.g., (P, P) for 2D); and
And nN: number of channel outputs.
It should be noted that in the exemplary notation provided above, the operations are symmetrical, i.e., square. It should be noted that in some examples, for a generally rectangular case, the symbols may be as follows:
conv2d:2D convolution, conv2dT: the 2D convolution transpose is performed,
KK w Kh: a kernel having a width dimension of K w and a height dimension of K h (e.g., K w×Kh);
sS wSh: a step of width dimension S w and height dimension S h (e.g., S w×Sh);
pP wPh: p w on both sides of the width dimension and P h on both sides of the height dimension (e.g., P w×Ph); and
And nN: number of channel outputs.
It should be noted that in some examples, a combination of the above symbols may be used. For example, in some examples, K, S and P wPh symbols may be used. Further, it should be noted that in other examples, the fill may be asymmetric with respect to the spatial dimension (e.g., fill the upper 1 row, the lower 2 rows).
Further, as described above, convolution may be performed on a one-dimensional data set (1D) or a higher-dimensional data set (e.g., a 3D data set). It should be noted that in some cases, the above symbols may be generalized for multidimensional convolution as follows:
conv1d:1D convolution, conv2D:2D convolution, conv3D:3D convolution
Conv1dT:1D convolution transpose, conv2dT:2D convolution transpose, conv3dT:3D convolution transpose
KK: kernels of all dimensions size K (e.g., K for 1D, K x K for 2D, K x K for 3D)
SS: stride with all dimensions S (e.g., S for 1D, S for 2D, S for 3D)
PP: padding of P on both sides of all dimensions with a value of 0 (e.g., (P) for 1D, (P, P) for 2D, (P, P) for 3D)
Number of nN channel outputs
The symbols provided above may be used to efficiently signal auto-encoding and auto-decoding operations. For example, in the case of downsampling a single 1024×742RGB image to 342×248OFM, as described above, 256 instances according to the kernel can be described as follows:
input data: 3X 1024X 742
And (3) operation: conv2d, k3, s3, p1, n256
The obtained output data: 256×342×248
Similarly, in the case of downsampling 60 1920×1080 monochrome pictures to 640×360OFM, as described above, 32 examples according to the kernel can be described as follows:
Input data: 60×1920×1080
And (3) operation: conv2d, k3, s3, p0.2n32
The obtained output data: 32×640×360
It should be noted that there may be a variety of ways to perform convolution on input data to represent the data as an output signature (e.g., 1 st pad, 1 st convolution, 2 nd pad, 2 nd convolution, etc.). For example, the resulting dataset 256×342×248 may be further downsampled by 3 in the spatial dimension and 8 in the channel dimension and as follows:
Input data: 256×342×248
And (3) operation: conv2d, k3, s3, p0.1, n32
The obtained output data: 32×114×84
In one example, the operation of an auto-decoder may be well-defined and known to an auto-encoder in accordance with the techniques herein. That is, the auto encoder knows the size of the input (e.g., OFM) received at the decoder (e.g., 256×342×248, 32×640×360, or 32×114×84 in the above example). This information can be used with k and s in the known convolution/convolution transpose stage to determine the data set size at a particular location of the auto-decoder.
As described above, object recognition tasks generally involve receiving an image, generating feature data corresponding to the image, analyzing the feature data, and generating inference data. Examples of typical object detection systems include, for example, YOLO, retinaNet and versions of fast R-CNN. Detailed descriptions about the object detection system, performance assessment techniques, and performance comparisons are provided in various journals, etc. For example, document "Redmon et al, 'YOLOv3: AN INCREMENTAL Improvement', arXiv:1804.02767, 2018, 4/8" generally describes YOLOv3 and provides a comparison with other object detection systems. Literature "EVERINGHAM M, ESLAMI S M A, vanGool L et al, THE PASCAL Visual Object CLASSES CHALLENGE: A Retrospective [ J ], international journal of computer vision, 2015, volume 111 (stage 1): pages 98-136 "describe mAP (mean precision average) evaluation metrics for evaluating object detection and segmentation. Wu et al in "Detectron2" (github, facebookresearch, detectron2, 2019) provide libraries and associated documents for Detectron2, detectron2 being a Facebook Artificial Intelligence (AI) research platform for object detection, segmentation and other visual recognition tasks.
It should be noted that for purposes of explanation, in some cases, the techniques described herein are described with a particular exemplary object detection system (e.g., detectron 2). However, it should be noted that the techniques herein are generally applicable to any object detection system. Furthermore, the techniques described herein may be applicable to any system that generates feature tensors for an MDDS. For example, the techniques described herein may be generally applicable to other types of MDDS (e.g., multi-channel audio, omni-directional video, etc.). That is, regardless of what the input data represents, the feature tensor generated therefrom may be compressed according to the techniques described herein. Referring to fig. 14, in general, with respect to image data, an object detection system can be described as: image data (e.g., ,ResNet-101-C4、ResNet-101-FPN、Inception-ResNet-v2、Inception-ResNet-v2-TDM、DarkNet-19、ResNet-101-SSD、ResNet-101-DSSD、ResNet-101-FPN、ResNeXt-101-SSD、Darknet-53, etc.) is received at backbone network element 900 and feature data (also referred to as OFM, feature tensor, feature map, etc.) is generated, and feature data is received at inference network element 1000 and inference data is generated. It should be noted that there may be several methods (or algorithm strategies) for generating the inferred data at the inferred network element 1000, including, for example, so-called one-phase methods and two-phase methods. Regardless of how the inferred data is generated, the techniques described herein generally apply. As described above, the techniques described in this disclosure may be particularly useful for allowing object recognition tasks to be distributed across a communication network. That is, referring to fig. 15, in accordance with the techniques herein, each of backbone network element 900 and inference network element 1000 may be coupled to communication medium 110, and thus, in some examples, located at different physical locations.
Fig. 16 is an example of an encoding system that may encode a multi-dimensional dataset according to one or more techniques of the present disclosure. As shown in fig. 16, the system includes a backbone network unit 900, an inference network unit 1000, and a communication medium 110. In addition, as shown in fig. 16, the system includes a compression engine 1100 and a decompression engine 1200. Compression engine 1100 may be configured to compress the feature data according to one or more of the techniques described herein, and decompression engine 1200 may be configured to perform reciprocal operations to reconstruct the feature data. As described above, the feature data may be generated from a defined backbone network. Typically, the feature data may be a multi-scale feature map with different acceptance domains. The backbone network may be based on a backbone model (e.g., ,R-50、R101、X-101、ResNet-101-C4、ResNet-101-FPN、Inception-ResNet-v2、Inception-ResNet-v2-TDM、DarkNet-19、ResNet-101-SSD、ResNet-101-DSSD、ResNet-101-FPN、ResNeXt-101-SSD、Darknet-53、Base-RCNN-FPN, etc.). Typically, backbone networks include stages that include multiple bottlenecks. The stage may correspond to a zoom. For example, for a 2D image, a phase may correspond to 1/4 downsampling of data (e.g., 1920x1080 data values to 480x270 data values). The bottleneck may include a convolution layer. That is, the bottleneck may include performing multiple convolution operations with various kernel sizes and strides. Furthermore, it should be noted that the backbone network may further handle features from each stage. That is, features generated, for example, from bottlenecks may be provided as input for one or more additional processes. That is, the backbone network may comprise a so-called fully connected layer and/or an active layer. For example, base-RCNN-FPN includes a transversal and output convolution layer, an upsampler, and a final maximum pooling layer. Thus, there are a number of ways in which a backbone network can be implemented. The techniques described herein generally apply regardless of the backbone network used to generate the characterization data. However, it should be noted that in some cases it may be useful to use a common (e.g., standardized) backbone network for a particular task. That is, for some applications, advantages similar to those achieved by video coding standards with defined compliant bitstreams may be achieved by implementing a common/standardized backbone network. As described in detail below, the techniques described herein are particularly useful for public/standardized backbone networks because they allow compression of feature data without requiring modification of a particular backbone network. With respect to modifying the backbone network, it should be noted that developing a useful backbone model may require analysis of a large amount of training data and may therefore not be a simple process.
As noted above, for purposes of explanation, in some cases, the techniques described herein are described with a particular exemplary object detection system (such as Detectron a 2). Fig. 14 shows an example in which exemplary image data, feature data, and inferred data correspond to Detectron. That is, in Detectron, a Feature Pyramid Network (FPN), base-RCNN-FPN extracts feature maps from the BGR input image at different scales. It should be noted that a complete description of Detectron2 is not provided herein for the sake of brevity. However, the document "Medium, hiroto Honda, digging into Detectron-part 1 to part 5, month 1, day 5 to month 7, day 2020" provides an overview of Detectron 2. Detectron2 generates the feature map at 1/4 scale, 1/8 scale, 1/16 scale, 1/32 scale, and 1/64 scale, and outputs 256 channels at each scale. That is, as described above, data is generated for each of the 256 instances of the kernel at each scale. Specifically, in the example of Detectron2, at each scale, one or more convolution and operations are performed to generate feature data (e.g., 7 x 7 convolutions of stride=2 and maximum pooling of stride=2). Fig. 17 is a conceptual diagram showing a general example of generating feature data. As shown in fig. 17, for input data having a width W and a height H, at each scale (i.e., 1/2 scale, 1/4 scale, 1/8 scale, and 1/16 scale), there are a corresponding number N i of output channels of the feature data. Further, at each scale, feature data may be generated according to one or more automatic encoding techniques. Such as one or more of the automated encoding techniques described above. As described above, a particular automatic encoding technique may be specified according to the backbone model. Regardless of the number of scaling ratios and the number of output channels and/or techniques used to generate the feature data, the techniques described herein are generally applicable to compressing the feature data.
In some cases, the generated characteristic data may include redundant and/or data that does not significantly contribute to the output. That is, some feature data may not significantly contribute to the subsequent generation of inferred data. For example, referring to the example shown in FIG. 17, for some input data sets, numerous channels of 1/2 scaled feature data (and/or 1/4, 1/8, 1/16 scaled feature data) do not significantly contribute to the subsequent generation of inferred data. That is, in this case, feature data from other scales may provide a more significant contribution to inferred data generation. For example, when the inferred data comprises a bounding box, a particular inferred data generation method may only require a subset of the feature data to generate a particular bounding box. Thus, in these cases, the feature data may be compressed without degrading the overall performance of object detection for particular input data, in accordance with the techniques herein. As described in detail below, in one example, the channels of the feature data may be pruned in accordance with the techniques herein. It should be noted that while redundant and/or unimportant feature data may be removed in some cases by modifying the backbone network (e.g., by removing phases from the backbone network), such an approach may be less than ideal. That is, for example, as described above, a common/standardized backbone network may be implemented, and modifications to such a backbone network may not be possible and/or practical, depending on the particular application. That is, for example, modifying the backbone network may require extensive retraining and/or fine-tuning of the backbone network (and/or the inference network) to maintain overall performance. In other cases, modifying the backbone network may compromise future scalability (i.e., the ability of the same backbone output to be used for future tasks). Furthermore, it should be noted that the input data may vary significantly. For example, video clips depicting a particular scene may vary significantly (e.g., one large slowly moving object versus several small fast moving objects), and it may not be possible and/or practical to develop a backbone network that does not generate redundant feature data for at least some of the variations of the input data.
As described above, in one example, the channels of the feature data may be pruned in accordance with the techniques herein. Pruning redundant and/or unimportant feature data may be particularly useful for compressing feature data for distribution over a communication network. That is, for example, referring to fig. 16, in accordance with the techniques herein, compression engine 1100 may be configured to prune feature data (e.g., to form a bitstream) in accordance with one or more of the techniques described herein such that less data needs to be transmitted across the communication network. Decompression engine 1200 may be configured to perform operations that are reciprocal to the pruning operations to reconstruct the feature data for subsequent processing. As described above, some feature data may be redundant and/or contribute insignificant to the generation of output and thus may be pruned (and reconstructed) while degrading system performance (e.g., object detection performance) in a negligible manner.
In one example, in accordance with the techniques herein, compression engine 1100 may be configured to determine which channels (or scales) to trim in accordance with one or more of the algorithms described herein. Further, in one example, compression engine 1100 may be configured to signal which channels have been pruned. For example, with respect to the example of Detectron2, where the backbone network generates feature data including 256 channels at 1/4 scale, 1/8 scale, 1/16 scale, 1/32 scale, and 1/64 scale, the compression engine 1100 may be configured to signal 256 bits (i.e., 1280 bits (256 bits×5 scale)) for each scale, and a value corresponding to a channel (i.e., 1 or 0) may indicate whether the channel has been pruned, i.e., not included in the feature data. It should be noted that in some examples, signaling bits may be encoded to reduce the amount of signaling data. For example by using run length coding or the like. In one example, decompression engine 1200 may be configured to fill zeros into the pruned channels. In other examples, decompression engine 1200 may be configured to insert other values (e.g., median, average, calculated values for channels, etc.) into the pruned channels. Further, in one example, compression engine 1100 may be configured to signal a data value (or set of data values) to be inserted into the pruned channel. Further, in one example, each of compression engine 1100 and decompression engine 1200 may store a lookup table of data sets, and compression engine 1100 may signal an index to the lookup table. The decompression engine 1200 may determine a data set to be inserted into the pruned channel based on the stored lookup table and the received index.
As described above, compression engine 1100 may be configured to determine which channels to trim according to an algorithm. In one example, compression engine 1100 may be configured to prune a channel when all (or a large number of) tensor values in the channel are less than a threshold. For example, for feature data with tensors x [ C, H, W ] (e.g., feature data for scaling), where C is the number of channels, H is the height, and W is the width, an exemplary pruning algorithm may be the following for a threshold T:
It should be noted that the above algorithm provides a standard logical expression for pruning, and that there are many ways to implement such an algorithm to achieve computational efficiency. For example, the algorithm may be written at Pytorch as follows:
x_max=torch.max(x,dim=(1,2))
for c=1 to C do
prune channel c if x_max[c]<T
where x has the shape of [ C, H, W ] and x_max has the shape of [ C ].
It should be noted that PyTorch is an open source optimized tensor library for deep learning using a GPU and CPU. PyTorch are based on Torch libraries. A detailed description of the PyTorch function is provided in detail in the PyTorch document maintained by its developer Facebook AI research laboratory (FAIR). The current stable version of PyTorch was v1.9.0, released by 2021, 6, 15. For brevity, a detailed description of the PyTorch functions is not provided herein, however, reference is made to the PyTorch document.
In the above example, if a channel does not contain a tensor value greater than the threshold T, the channel is pruned. For example, according to the exemplary algorithm described above, the feature data for 256 channels is included, for example, at the exemplary scaling x [256, 20, 40], and for channels 1 through 256, the threshold t=5.0, and if all 800 (20 x 40=800) tensor values in a given channel are less than 5.0, then that channel is pruned. As described above, compression engine 1100 may be configured to prune a channel when all or a substantial number of the tensor values in the channel are less than a threshold. In the case where the channel is pruned, if the effective number M of tensor values in the channel is less than the threshold, then the following in the algorithm described above:
if count[c]==0
prune channel c
Can be modified as follows:
if count[c]<M
prunne channel c
In one example, the compression engine 1100 may be configured to prune a predetermined number of channels based on the ranking. For example, the compression engine 1100 may be configured to rank/order channels based on the number of tensor values in the channel that are greater than a threshold, and prune the plurality of channels that have the smallest number of tensor values that are greater than the threshold. For example, for feature data with tensors x [ C, H, W ], where C is the number of channels, H is the height, and W is the width, for a threshold T, an exemplary pruning algorithm may be as follows:
It should be noted that the above algorithm provides a standard logical expression for pruning, and that there are many ways to implement such an algorithm to achieve computational efficiency. For example, the algorithm may be written at Pytorch as follows:
x_threshold=(x>T).float()
x_count=torch.sum(x_threshold,dim=0)
sort and prune first N channels
where x has the shape of [ C, H, W ] and x_max has the shape of [ C ].
For example, according to the exemplary algorithm described above, the feature data for 256 channels is included, for example, at the exemplary scaling x [256, 20, 40], and for channels 1 through 256, the threshold t=5.0 and the number of channels to be trimmed n=3, all 800 (20x40=800) tensor values are compared to the threshold 5.0, and the number of tensor values greater than 5.0 is counted, the channels are sorted according to the count, and the bottom 3 channels with the least number of tensor values greater than the threshold are trimmed.
It should be noted that for a feature map tensor with C channels, if the target bit saving is m percent, the number of channels to be pruned is n=c×m/100. For example, for the feature map tensor x [256, 20, 40] and target bit savings of 5%, the number of channels to prune is n=256×5/100=13 (12.8, rounded up). It should be noted that there may be a tradeoff between bit savings and performance.
In one example, compression engine 1100 may be configured to rank/order channels based on statistics corresponding to tensor values in the channels. For example, the compression engine 1100 may be configured to determine the standard deviation of the tensor values in the channels and prune the channels with the smallest standard deviation. For example, for feature data with tensors x [ C, H, W ], where C is the number of channels, H is the height, and W is the width, for a threshold T, an exemplary pruning algorithm may be as follows:
where std (x) returns the standard deviation of the element with x.
For example, according to the exemplary algorithm described above, including, for example, feature data including 256 channels at the exemplary scaling x [256, 20, 40], and for channels 1 through 256, the number of channels to be trimmed, n=3, the standard deviation of the tensor values in the channels is calculated, the channels are ordered according to the calculated standard deviation, and the bottom 3 channels with the smallest standard deviation are trimmed. It should be noted that in one example, similar to the examples described above, the standard deviation of a channel may be compared to a threshold value, and if the standard deviation is not greater than the threshold value, the channel may be trimmed. In this way, one or more statistics of a channel may be compared to a corresponding threshold, and if one (or all, or a large number) of these statistics is not greater than the threshold, the channel is pruned.
As described above, the inference network (e.g., inference network element 1000) receives the feature data and generates inference data. With respect to Detectron2, and in general, in some examples, an inferred network may be described as a sub-class that includes a region proposal network and an ROI (region of interest) header, which may be generally referred to as a box header. Fig. 18 shows the coding system of fig. 16 with an inference network unit 1000 comprising a region proposal network unit 1020 and a box header unit 1050. The area proposal network unit 1020 may be configured to perform area proposal network functions including, for example, those described in Detectron 2. In Detectron2, the regional proposal network receives feature maps at 1/4 scale, 1/8 scale, 1/16 scale, 1/32 scale, and 1/64 scale, each feature map having 256 channels, as described above, and outputs 1000 box proposals with confidence scores (which are set as defaults). That is, each of the 1000 box proposals includes anchor coordinates, height, width, and score. Generally, the region proposal network in Detectron can be described as including an RPN header and an RPN output. Fig. 19 shows an example of a region proposal network 1020 including an RPN header 1022 and an RPN output 1024. In Detectron2, for each feature scale, the RPN header generates objectness logits and anchor delta. Objectness logits is a probability map of the existence of an object and the anchor delta is the relative box shape and position for the anchor. As shown in fig. 19, an initial conv2d k 3n 256 operation is performed on the feature map. To generate objectness logits, conv2d k 1n 3 is performed after the initial conv2d k n256 operation. To generate the anchor delta, conv2d k 1n 3x4 is performed on the data after the initial conv2d k n256 operation. As shown in fig. 19, RPN output 1024 receives objectness logits and defined parameters (including, for example, anchors and reference real photo frames) and generates frame suggestions. In Detectron2, the generation of the block proposal includes anchor generation, reference true phase preparation, loss calculation, and proposal selection. In essence, in Detectron, the output feature maps of objectness logits and anchor delta are associated with reference true photo frames to generate scored prediction frames, and a maximum of 1,000 scoring frames are selected as output.
As described above, the inference network may include a box header unit. In general, the box header in Detectron may be described as including an ROI pooler, a box header, and a box predictor. Fig. 20 shows an example of a box header unit 1050 that includes an ROI pooler 1052, a box header unit 1054, and a box predictor unit 1056. In Detectron2, the ROI pooler pools rectangular regions of the feature map specified by the box proposal. Essentially, ROI Chi Huaqi generates a tensor, which is a set of pruned instance features that include balanced foreground and background ROIs. In Detectron2, this tensor may have a size of [ Nxbatch size, 256,7,7], with an ROI size of 7×7. In Detectron2, the box header may be FastRCNNConvFCHead and the box predictor may be FastRCNNOutputLayers. It should be noted that the ROI may generate tensors of other sizes. It should be noted that, although not shown in fig. 20, the tensor generated from the ROI Chi Huaqi is flattened to 256×7×7=12,544 tensors before being input into the box header unit 1054.
As shown in fig. 20, the box header unit 1054 performs two Linear () operations. The Linear () operation is specified as follows:
Linear(in_features_count,out_features_count,bias)
Applying a linear transformation to the input data:
y=xAT+b
Parameters (parameters)
-In_features-size of each input sample
-Out_features-size of each output sample
Bias-if set to false, the layer will not learn the additive bias. Default: true shape
-Input: (N, H in), where x represents any number of additional dimensions, and hin=in_features
-Output of: (N, hout), wherein all dimensions except the last dimension are the same as the shape of the input and hout=out_features.
Variable(s)
-Leachable weights (out_features, in_features) of the linear. The value is initialized from U (-sqrt { k }, where k=1/in_features, and sqrt { ] is a square root operation
-Leachable biasing (out_features) of the linear. If bias is true, the value is initialized from U (-sqrt { k }, sqrt { k }), where k=1/in_features
The box header unit 1054 classifies objects within the ROI and fine-tunes the box position and shape. The box predictor unit 1056 generates a classification score and a bounding box predictor. The classification score and bounding box predictor may be used to output a bounding box. Typically, in Detectron, non-maximum suppression (NMS) is used to filter out up to 100 bounding boxes. It should be noted that the maximum number of bounding boxes is configurable and that it may be useful to vary the number depending on the particular application.
As described above, in Detectron, the inferred data includes a bounding box. In some applications, it may be useful to have so-called instance segmentation information, which may for example provide pixel-by-pixel classification of the bounding box. That is, the instance partition information may indicate whether pixels within the bounding box form part of the object. Further, example segmentation information may include, for example, a binary mask for the ROI. As described above, with respect to the example in fig. 20, the ROI pooler essentially generates tensors that are a set of pruned instance features, and may input these tensors to the FastRCNNConvFCHead box header. In other implementations where the generation of semantic segmentation information is useful, a so-called mask header, including, for example, maskR-CNN, may operate in parallel with the FastRCNNConvFCHead box header, or the like. Fig. 21 shows an example in which the frame head 1050 shown in fig. 20 additionally includes a mask head unit 1060. Mask head unit 1060 essentially receives the set of pruned instance features and generates a segmentation mask for the ROI. Mask header unit 1060 may be configured to generate a Mask from a Mask header (e.g., mask R-CNN). It should be noted that, as provided above, the box header 1050 is a generic term for a sub-class of ROI (region of interest) headers. Thus, the mask header may be considered a sub-class of the ROI header. Furthermore, as described above, the ROI pooler may generate tensors of other sizes besides [ Nxbatch sizes, 256,7,7], where the ROI size is 7 x 7. This tensor may have a size of [ Nxbatch sizes, 256,14,14] relative to the tensor input into the mask header unit 1060, where the ROI size is 14 x 14. Furthermore, it should be noted that the channel count C, the height H and the width W are all configurable parameters of the ROI pooler.
Fig. 22 shows an example of the mask head unit 1060. As shown in fig. 22, the mask header unit 1060 performs four consecutive conv2d k s3 s1 p1 n256 operations before performing the conv2dT k2 s2 p 0n 256 up-sampling operations. In addition, in the example shown in fig. 22, reLU refers to an operation in which ReLU (x) =max (0, x). That is, if the output is negative, it is set to 0. Finally, the conv2d k s1 p 0n 80 predictor operation generates a mask. Thus, as shown in fig. 22, the final 1×1 convolutional layer is utilized for blocking, where n80 specifies the number of categories.
As described above, inferred data (e.g., spatial location of objects within an image) may be used to optimize encoding of video data (e.g., adjust encoding parameters to improve relative image quality in areas where objects of interest are present, etc.). Fig. 23 shows an example in which output data output by the inference network unit 1000 is input into the encoder control unit 404 to generate encoding parameters of the video encoder 200. As further described above, there may be several ways to compress/decompress feature data for communication over a communication network, e.g., quantization, channel pruning, etc. In addition to directly compressing the characteristic data (e.g., via channel pruning), there may be several methods to compress the amount of data that needs to be transmitted across the communication network. In one example, in accordance with the techniques herein, the amount of characteristic data may be reduced by reducing the amount of input data processed by an automatic encoder (e.g., a backbone network). That is, referring to the example in fig. 16, the amount of characteristic data input into the compression engine 1100 may be reduced by reducing the amount of input data input into the backbone network unit 900 (resulting in a reduction of the generated bit stream). For example, in the case of video data, the video data may be time downsampled prior to processing by the backbone network in accordance with the techniques herein. For example, every X pictures (e.g., 10) may be input into the backbone network. For example, in the case of 60Hz video, instead of inputting 60 pictures per second, 6 pictures per second may be input. In another example, pictures in a sequence may be divided into groups and each group assigned a group ID. For each group, a different process may be used for temporal downsampling. For example, group 0 may contain pictures with indices 5*m and 5x (m+1) -1, where m = 0,1,2, … k, and the remaining pictures belong to group 1. In one example, only the pictures in a group or only a subset of the pictures within a group of pictures may be handled by the backbone. In one example, for the exemplary group assignment process described above, if, for example, the number of remaining pictures is below a particular threshold (e.g., 5), the group assignment for pictures having an index greater than 5x (k+1) -1 may not be regular. In another example, group 0 may contain pictures with index 4*m, where m=0, 1,2, … k, while the remaining pictures belong to group 1. If, for example, the number of remaining pictures is below a particular threshold (e.g., 4), the group assignment for pictures having an index greater than 4*k may not be regular.
Fig. 24 shows an example in which the encoding system shown in fig. 16 additionally includes a downsampling unit 1300 and an interpolation unit 1400. The downsampling unit 1300 is configured to downsample input data. For example, as described above, the downsampling unit may be configured to temporally downsample video data at fixed intervals. It should be noted that for some implementations, in addition to reducing the feature data, time downsampling the input data may reduce processing requirements and thus increase the throughput of the backbone network 900.
The interpolation unit 1400 is configured to interpolate inferred data corresponding to information removed due to downsampling information. For example, in the example of video data, where the feature data is generated such that the inferred data includes bounding boxes for each picture input into the backbone network, the interpolation unit 1400 may be configured to interpolate the bounding boxes of downsampled (i.e., intermediate) pictures. In one example, in accordance with the techniques herein, generating (i.e., predicting) an intermediate bounding box may be based on the following equation:
wherein,
i=0,1
(X 0,0,y0,0,x0,1,y0,1) is the bounding box of the object in picture 0, and (x 1,0,y1,0,x1,1,y1,1) is the corresponding bounding box of the object in picture 1. (x 0,0,y0,0,x0,1,y0,1) and (x 1,0,y1,0,x1,1,y1,1) may be referred to as reference bounding boxes.
T 0 and t 1 are time instances (e.g., picture counts in display order) corresponding to picture 0 and picture 1, respectively.
T intermediate is a time instance (e.g., picture count in display order) corresponding to an intermediate picture.
In one example, the correspondence may be established by: (1) Measuring the displacement between each pair of bounding boxes from picture 0 and picture 1, and pruning the pair list based on the threshold of displacement; and (2) identifying a closest bounding box for each bounding box in picture 0 and discarding remaining pairs corresponding to bounding boxes in picture 0, wherein the closest bounding box may be determined, for example, by spatial displacement and content contained within the bounding box (e.g., object type, SAD between samples of the bounding box). In one example, multiple bounding boxes may be selected from picture 1, e.g., a single representative bounding box in picture 1 may be acquired using n-nearest and average/median. It should be noted that interpolation can be extended to be based on M bounding boxes, where M is greater than two, more generally:
Wherein picture 0 is the earliest picture in all M. In some cases, there may be more than one reference bounding box in the picture.
Thus, for one or more downsampled pictures, an intermediate bounding box may be generated. For example, in an example in which 60Hz video is downsampled to 3Hz video, bounding boxes may be interpolated for the 15 th and 45 th pictures in the original sequence. Furthermore, in the above example where pictures are downsampled according to group assignments, interpolation may be adapted based on temporal picture distance. For example, interpolation rules may be specified for the temporal distance size. That is, for temporal picture distances, the number and space between bounding boxes to be interpolated for a picture may be defined. It should be noted that the rate of downsampling and interpolation may be determined based on the desired data compression and/or how the interpolation data is used to modify video encoding. Furthermore, the downsampling may be determined based on the expected throughput of a particular backbone network implementation. For example, in the case where interpolation data is used to ensure low level quantization and/or coarse filtering of the ROI is turned off (i.e., to ensure details are preserved), the downsampling rate may be relatively high and the rate at which interpolation occurs may be relatively low, i.e., as described above (60 Hz downsampled to 3Hz and 15 th and 45 th pictures interpolated), for example. In another example, where the interpolated data is used for motion prediction, the rate of downsampling may be relatively low and the rate at which interpolation occurs may be relatively high. It should be noted that the picture and the ROI therein may be used as a reference during bounding box interpolation and may be used as a reference during inter prediction. In one example, the frequency at which the picture/ROI is used for reference may be used to determine the quality at which the picture is encoded. It should be noted that the frequency may include an indirect reference, wherein the picture is used for bounding box interpolation and the interpolated bounding box is used for reference during inter prediction.
It should be noted that information about the movement of the bounding box may be used to assist the video encoder in selecting motion vectors for inter prediction. This may improve encoder performance. For example, as described above, the process of establishing correspondence between bounding boxes results in the generation of motion vectors between regions of corresponding pictures. In one example, these derived motion vectors may be anchored in a motion search space used in conventional video coding for regions in a picture that include a reference bounding box and an intermediate/interpolated bounding box, in accordance with the techniques herein. In one example, BDOF (i.e., bi-directional optical flow) and/or MVR (motion vector refinement) techniques may be used to search around corresponding motion vectors determined when establishing correspondence between bounding boxes. Further, in one example, a motion vector determined when a correspondence between bounding boxes is established may be added to a motion vector predictor list (e.g., a merge list). In a video encoder, motion estimation for regions within a picture may be performed within a reference picture determined by motion vectors derived when correspondence between bounding boxes is established.
As described above, discrete convolutions may be performed on video data, such convolutions may downsample the video in both the spatial and temporal dimensions. Such a process may also be used to reduce the feature data prior to entering the feature data into the compression engine. Furthermore, temporal downsampling may be achieved using pooling. It should be noted that the interpolation techniques described herein are generally applicable regardless of how temporal downsampling is implemented.
As shown in the example of fig. 24, parameters may be transferred to the downsampling unit 1300 and the interpolation unit 1400. Such parameters may include downsampling/interpolation rate. It should be noted that in some examples, a common set of parameters may be stored at each of the downsampling unit 1300 and the interpolation unit 1400. For example, in one example, the downsampling unit 1300 may operate according to a predefined downsampling process, and the interpolation unit 1400 may operate according to a corresponding predefined interpolation process. Further, in one example, parameters for downsampling/interpolation may be communicated out-of-band to each of the downsampling unit 1300 and interpolation unit 1400. For example, the downsampling rate may be determined based on the transmitted interpolation process and vice versa. Further, in one example, the parameters for downsampling/interpolation may be transmitted to each of the downsampling unit 1300 and interpolation unit 1400 using a bitstream (i.e., via multiplexing, for example).
It should be noted that in one example, reconstructed feature data may be interpolated in addition to and instead of performing interpolation on inferred data in accordance with the techniques herein. Fig. 25 shows an example in which the encoding system shown in fig. 16 additionally includes a downsampling unit 1300 and an interpolation unit 1500. Fig. 26 shows an example in which the encoding system shown in fig. 16 additionally includes a downsampling unit 1300, an interpolation unit 1400, and an interpolation unit 1500. The interpolation unit 1500 is configured to interpolate feature data corresponding to the information removed due to downsampling. As described above, the output feature map may be predictively encoded in a manner similar to that of video data (i.e., using typical video encoding techniques), for example, with respect to fig. 12. Similarly, typical interpolation techniques for video coding may be used to interpolate the output feature map, i.e., for example, frame Rate Up Conversion (FRUC) techniques. In addition, typical BDOF and MVR techniques may be utilized. Interpolation unit 1500 may be configured to interpolate the reconstructed feature data, for example, using techniques similar to typical video coding interpolation techniques.
As described above, video content includes a video sequence composed of a series of pictures, and each picture may be divided into one or more regions. In VVC, the encoding of a picture represents VCL NAL units that include a particular layer within an AU and contains all CTUs of the picture. For example, referring again to fig. 3, the encoded representation of pic 3 is encapsulated in three encoded slice NAL units (i.e., slice 0 NAL unit, slice 1 NAL unit, and slice 2 NAL unit). It should be noted that the term Video Coding Layer (VCL) NAL unit is used as a generic term for coded slice NAL units, i.e. VCL NAL is a generic term comprising all types of slice NAL units. As described above, and described in further detail below, NAL units may encapsulate metadata for decoding video data. NAL units that encapsulate metadata for decoding a video sequence are commonly referred to as non-VCL NAL units. Thus, in VVC, a NAL unit may be a VCL NAL unit or a non-VCL NAL unit. It should be noted that the VCL NAL unit includes slice header data that provides information for decoding a particular slice. Thus, in VVC, information for decoding video data (which may be referred to as metadata in some cases) is not limited to be included in non-VCL NAL units. VVC specifies that Picture Units (PUs) are a group of NAL units associated with each other according to a specified classification rule, consecutive in decoding order and containing exactly one coded picture, and an Access Unit (AU) is a group of PUs belonging to different layers and containing coded pictures associated with the same time output from a DPB. VVC further specifies that a layer is a set of VCL NAL units and associated non-VCL NAL units that all have a particular value of a layer identifier. Furthermore, in VVC, a PU is composed of zero or one PH NAL unit, one coded picture (which is composed of one or more VCL NAL units), and zero or more other non-VCL NAL units. Further, in VVC, the Coded Video Sequence (CVS) is an AU sequence consisting of CVSs AUs and subsequent AUs of zero or more non-CVSs AUs arranged in decoding order, including all subsequent AUs before any subsequent AUs that are CVSs AUs to the next (not included), where the coded video sequence start (CVSs) AUs is an AU where each layer in the CVS has a PU, and the coded picture in each existing picture unit is a Coded Layer Video Sequence Start (CLVSS) picture. In VVC, the Coding Layer Video Sequence (CLVS) is a PU sequence within the same layer, consisting of CLVSS PU and zero or more subsequent PUs that are not CLVSS PU, in decoding order, including all subsequent PUs until the next (not including) any subsequent PUs that are CLVSS PU. That is, in VVC, a bitstream may be described as including an AU sequence forming one or more CVSs.
Multi-layer video encoding enables a video presentation to be decoded/displayed as a presentation corresponding to a base layer of video data and decoded/displayed as one or more additional presentations corresponding to an enhancement layer of video data. For example, the base layer may enable presentation of video presentations having a base quality level (e.g., high definition presentation and/or 30Hz frame rate), and the enhancement layer may enable presentation of video presentations having an enhanced quality level (e.g., ultra-high definition rendering and/or 60Hz frame rate). The enhancement layer may be encoded by referring to the base layer. That is, a picture in the enhancement layer may be encoded (e.g., using inter-layer prediction techniques), for example, by referencing one or more pictures in the base layer (including scaled versions thereof). It should be noted that the layers may also be encoded independently of each other. In this case, there may be no inter-layer prediction between the two layers. Each NAL unit may include an identifier indicating the video data layer with which the NAL unit is associated. The sub-bitstream extraction process may be used to decode and display only a specific region of interest of a picture. In addition, the sub-bitstream extraction process may be used to decode and display only a specific video layer. Sub-bitstream extraction may refer to the process by which a device receiving a compatible or compliant bitstream forms a new compatible or compliant bitstream by discarding and/or modifying data in the received bitstream. For example, sub-bitstream extraction may be used to form a new compatible or compliant bitstream corresponding to a particular video representation (e.g., a high quality representation).
In VVC, each of video sequences, GOPs, pictures, slices, and CTUs may be associated with metadata describing video coding properties, and some types of metadata are encapsulated in non-VCL NAL units. VVC defines a set of parameters that may be used to describe video data and/or video coding properties. Specifically, VVC includes the following four parameter sets: video Parameter Sets (VPSs), sequence Parameter Sets (SPS), picture Parameter Sets (PPS), and Adaptive Parameter Sets (APS), wherein SPS is applied to zero or more integer number of CVSs, PPS is applied to zero or more integer number of encoded pictures, APS is applied to zero or more slices, and VPSs may optionally be referenced by SPS. PPS applies to the single coded picture that references it. In VVC, parameter sets may be encapsulated as non-VCL NAL units and/or may be signaled as messages. The VVC also includes a Picture Header (PH) that is encapsulated as a non-VCL NAL unit. In VVC, a picture header is applied to all slices of an encoded picture. VVC further enables signaling Decoding Capability Information (DCI) and Supplemental Enhancement Information (SEI) messages. In VVC, DCI and SEI messages assist in the process related to decoding, display or other purposes, however, DCI and SEI messages may not be required to construct luminance or chrominance samples from the decoding process. In VVC, DCI and SEI messages may be signaled in a bitstream using non-VCL NAL units. Further, the DCI and SEI message may be transmitted by some mechanism, not by being present in the bitstream (i.e., signaling out-of-band).
Fig. 27 shows an example of a bitstream including a plurality of CVSs, wherein the CVSs include AUs, and the AUs include picture units. The example shown in fig. 27 corresponds to an example of encapsulating the slice NAL unit shown in the example of fig. 2 in a bitstream. In the example shown in fig. 27, the corresponding picture units of Pic 3 include three VCL NAL coded slice NAL units, namely a slice 0 NAL unit, a slice 1 NAL unit, and a slice 2 NAL unit, and two non-VCL NAL units, namely a PPS NAL unit and a PH NAL unit. It should be noted that in fig. 27, the header is a NAL unit header (i.e., not confused with a slice header). Further, it should be noted that in fig. 27, other non-VCL NAL units, not shown, may be included in the CVS, such as SPS NAL units, VPS NAL units, SEI message NAL units, and the like. Further, it should be noted that in other examples, PPS NAL units used to decode Pic 3 may be included elsewhere in the bitstream, e.g., in picture units corresponding to Pic 0, or may be provided by an external entity. As described in further detail below, in VVC, the PH syntax structure may be present in a slice header of the VCL NAL unit or in the PH NAL unit of the current PU.
VVC defines NAL unit header semantics that specify the type of original byte sequence payload (RBSP) data structure included in the NAL unit. Table 1 shows the syntax of the NAL unit header provided in VVC.
It should be noted that in the syntax descriptor used herein, the following descriptor may be applied: -b (8): bytes (8 bits) with any bit string pattern. The parsing process of the descriptor is specified by the return value of the function read_bit (8).
-F (n): a fixed pattern bit string written using n bits (left to right) from the leftmost bit. The parsing process of the descriptor is carried out by the PA246067
The return value of the function read_bit (n) is specified.
-Se (v): signed integer 0 th order Exp-Golomb encoded syntax element, starting from the leftmost bit.
-Tb (v): truncated binaries of up to maxVal bits are used, with maxVal defined in the semantics of the syntax element.
Tu (v): truncated unary code of up to maxVal bits is used, where maxVal is defined in the semantics of the syntax element.
-U (n): an unsigned integer of n bits is used. When n is "v" in the syntax table, the number of bits varies in a manner depending on the values of other syntax elements. The parsing process of the descriptor is specified by the return value of the function read_bits (n), which is interpreted as a binary representation of the unsigned integer written first to the most significant bit.
-Ue (v): unsigned integer 0 th order Exp-Golomb encoded syntax element, starting from the leftmost bit.
TABLE 1
The VVC provides the following definition for the corresponding syntax element shown in table 1.
The forbidden _ zero _ bit should be equal to 0.
Nuh_reserved_zero_bit should be equal to 0. The value 1 of nuh_reserved_zero_bit may be specified in the future by ITU-T|ISO/IEC. Although in this version of the specification the value of nuh_reserved_zero_bit is required to be equal to 0, a decoder conforming to this version of the specification should allow the value of nuh_reserved_zero_bit to be equal to 1 to appear in the syntax and should ignore (i.e., delete and discard from the bitstream) NAL units for which nuh_reserved_zero_bit is equal to 1.
Nuh_layer_id specifies an identifier of a layer to which a VCL NAL unit belongs or an identifier of a layer to which a non-VCL NAL unit applies. The value of nuh_layer_id should be in the range of 0 to 55 (inclusive). Other values of nuh_layer_id are reserved for future use by ITU-t|iso/IEC. Although in this version of the specification the value of nuh layer id is required to be in the range of 0 to 55 (inclusive), a decoder conforming to this version of the specification should allow values of nuh layer id greater than 55 to appear in the syntax and should ignore (i.e., delete and discard from the bitstream) NAL units with nuh layer id greater than 55.
The values of nuh layer id of all VCL NAL units of a coded picture should be the same. The value of nuh_layer_id of a coded picture or PU is the value of nuh_layer_id of the VCL NAL unit of the coded picture or PU.
When nal_unit_type is equal to ph_nut or fd_nut, nuh_layer_id should be equal to the nuh_layer_id of the associated VCL NAL unit.
When nal_unit_type is equal to eos_nut, nuh_layer_id should be equal to one of the nuh_layer_id values of the layers present in the CVS.
Note that the values of nuh layer id for-DCI, OPI, VPS, AUD and EOB NAL units are not constrained.
Nuh temporal id plus1 minus 1 specifies the temporal identifier of the NAL unit.
The value of nuh_temporal_id_plus1 should not be equal to 0.
The variable TemporalId is derived as follows:
TemporalId=nuh_temporal_id_plus1-1
when nal_unit_type is in the range of idr_w_radl to rsv_irap_11 (inclusive), the temporalld should be equal to 0.
When nal_unit_type equals STSA_NUT and
When vps_independent_layer_flag [ GeneralLayerIdx [ nuh_layer_id ] ] is equal to 1, the TemporalId should be greater than 0.
The value of the temporalld should be the same for all VCL NAL units of the AU. The value of the temporalld of the encoded picture, PU or AU is the value of the temporalld of the VCL NAL unit of the encoded picture, PU or AU. The value of the temporalld for the sub-layer representation is the maximum value of the temporalld for all VCL NAL units in the sub-layer representation.
The value of the temporalld of a non-VCL NAL unit is constrained as follows:
If nal_unit_type is equal to dci_nut, opi_nut, vps_nut or sps_nut, then temporalld shall be equal to 0 and the temporalld of the AU containing the NAL unit shall be equal to 0.
Otherwise, if nal_unit_type is equal to ph_nut, then the temporalld should be equal to the temporalld of the PU containing the NAL unit.
Otherwise, if nal_unit_type is equal to eos_nut or eob_nut, then the temporalld should be equal to 0.
Otherwise, if nal_unit_type is equal to aud_nut, fd_nut, prepix_sei_nut, or SUFFIX _sei_nut, then the TemporalId should be equal to the TemporalId of the AU containing the NAL unit.
Otherwise, when nal_unit_type is equal to pps_nut, prepix_aps_nut or SUFFIX _aps_nut,
The temporalld should be greater than or equal to the temporalld of the PU containing the NAL unit.
Note that when the NAL unit is a non-VCL NAL unit, the value of the temporalld is equal to the minimum of the temporalld values of all AUs to which the non-VCL NAL unit applies. When nal_unit_type is equal to pps_nut, prefix_aps_nut, or SUFFIX _aps_nut, the temporalld may be greater than or equal to the temporalld containing AU, because all PPS and APS may be included in the beginning of the bitstream (e.g., when they are transported out-of-band and the receiver places them at the beginning of the bitstream), with the first encoded picture having a temporalld equal to 0.
Nal_unit_type specifies the NAL unit type, i.e., the type of RBSP data structure contained in the NAL unit as specified in table 2.
NAL units (whose semantics are not specified) with nal_unit_type within the range of unspec28.. UNSPEC31 (inclusive) should not affect the decoding process specified in this specification.
Note that NAL unit types within UNSPEC _28.. UNSPEC _31 may be used as determined by the application. The decoding process for these values of nal_unit_type is not specified in this specification. Since different applications may use these NAL unit types for different purposes, special attention is expected when designing encoders that generate NAL units with these nal_unit_type values and when designing decoders that interpret the content of NAL units with these nal_unit_type values. The present specification does not define any management of these values. These nal_unit_type values may only be applicable in contexts where the use of "collision" (i.e. meaning of NAL unit content of the same nal_unit_type value has different definitions) is not important, or is not possible or is managed, e.g. defined or managed in a control application or transport specification, or managed by controlling the environment in which the bitstream is distributed.
For purposes other than determining the amount of data in the DUs of the bitstream (as specified in annex C), the decoder will ignore (remove from the bitstream and discard) the contents of all NAL units that use the reserved value of nal_unit_type. Note-this requirement allows future definitions of compatible extensions to this specification.
TABLE 2
Note that a Clean Random Access (CRA) picture may have an associated RASL or RADL picture present in the bitstream.
Note that an Instantaneous Decoding Refresh (IDR) picture with nal_unit_type equal to idr_n_lp does not have an associated leading picture present in the bitstream. An IDR picture with nal_unit_type equal to idr_w_radl does not have an associated RASL picture present in the bitstream, but may have an associated RADL picture in the bitstream.
The value of nal_unit_type should be the same for all VCL NAL units of a sub-picture. A sub-picture is referred to as having the same NAL unit type as the VCL NAL unit of the sub-picture.
For VCL NAL units of any particular picture, the following applies:
If pps_mixed_ nalu _types_in_pic_flag is equal to 0, then the value of nal_unit_type should be the same for all VCL NAL units of the picture, and the picture or PU is said to have the same NAL unit type as the coded slice NAL unit of the picture or PU.
Otherwise (pps_mixed_ nalu _types_in_pic_flag equal to 1), all the following constraints apply:
The picture should have at least two sub-pictures.
The VCL NAL units of a picture should have two or more different nal_unit_type values.
VCL NAL units for pictures where nal_unit_type equals gdr_nut will not exist.
When nal_unit_type of a VCL NAL unit of a picture is equal to nalUnitTypeA of the value idr_w_radl, idr_n_lp or cra_nut, the nal_unit_type of the other VCL NAL units of the picture should be equal to nalUnitTypeA or trail_nut.
The value of nal_unit_type should be the same for all pictures of IRAP or GDR AU.
When the sps_video_parameter_set_id is greater than 0, vps_max_tid_ref_pics_plus 1[ i ] [ j ] is equal to 0 (for any value of i in the range of j equal to GeneralLayerIdx [ nuh_layer_id ] and j+1 to vps_max_layers_minus1 (including the end value)), and pps_mixed_ nalu _types_in_pic_flag is equal to 1, the value of nal_unit_type will not be equal to IDR_W_RADL, IDR_N_LP, or CRA_NUT.
The following constraints apply for the bitstream compliance requirements:
When the picture is the leading picture of the IRAP picture, the picture should be a RADL or RASL picture.
When the sub-picture is the leading sub-picture of the IRAP sub-picture, the sub-picture should be a RADL or RASL sub-picture.
When the picture is not the leading picture of the IRAP picture, the picture should not be a RADL or RASL picture.
When the sub-picture is not the leading sub-picture of the IRAP sub-picture, the sub-picture should not be an RADL or RASL sub-picture.
RASL pictures should not be present in the bitstream, which RASL pictures are associated with IDR pictures.
RASL sub-pictures should not be present in the bitstream, which RASL sub-pictures are associated with IDR sub-pictures.
RADL pictures should not be present in the bitstream, which RADL pictures are associated with IDR pictures with nal_unit_type equal to idr_n_lp.
Note that random access at the location of the IRAP PU can be performed by discarding all PUs preceding the IRAP AU (and correctly decoding the non-RASL pictures in the IRAP AU and all subsequent AUs in decoding order), provided that each parameter set (in the bitstream or by an external means not specified in this specification) is available when referenced.
RADL sub-pictures should not be present in the bitstream, which RADL sub-pictures are associated with IDR sub-pictures with nal_unit_type equal to idr_n_lp.
Any picture in decoding order with nuh_layer_id equal to a particular value layerId that precedes the IRAP picture with nuh_ llayer _id equal to layerId should precede the IRAP picture in output order and should precede any RADL picture associated with the IRAP picture in output order.
Any sub-picture in decoding order that is located before the IRAP sub-picture with nuh layer id equal to layerId and sub-picture index equal to subpicIdx, whose sub-picture index is equal to the specific value layerId and whose sub-picture index is equal to the specific value subpicIdx, should be located before the IRAP sub-picture and all its associated RADL sub-pictures in output order.
Any picture in decoding order with nuh layer id equal to the specific value layerId that precedes the recovery point picture with nuh layer id equal to layerId should precede the recovery point picture in output order.
Any sub-picture in decoding order that is located in the recovery point picture, nuh layer id equal to the specific value layerId and sub-picture index equal to the specific value subpicIdx, that is located before the sub-picture in the recovery point picture, nuh layer id equal to layerId and sub-picture index equal to subpicIdx, should be located before this sub-picture in output order.
Any RASL picture associated with a CRA picture should precede any RADL picture associated with a CRA picture in output order.
Any RASL sub-picture associated with a CRA sub-picture should precede any RADL sub-picture associated with a CRA sub-picture in output order.
Any RASL picture with nuh layer id equal to a particular value layerId associated with a CRA picture should be located in output order after any IRAP or GDR picture with nuh layer id equal to layerId, which is located in decoding order before the CRA picture.
Any RASL sub-picture associated with a CRA sub-picture for which nuh layer id is equal to a particular value layerId and the sub-picture index is equal to a particular value subpicIdx should be located in output order after any IRAP or GDR sub-picture located before the CRA sub-picture in decoding order for which nuh layer id is equal to layerId and the sub-picture index is equal to subpicIdx.
-If sps_field_seq_flag is equal to 0, the following applies: when the current picture with nuh layer id equal to the particular value layerId is a leading picture associated with an IRAP picture, then the current picture should precede all non-leading pictures associated with the same IRAP picture in decoding order. Otherwise (sps field seq flag equal to 1), let picA and picB be the first and last leading pictures associated with IRAP pictures, respectively, in decoding order, there should be at most one non-leading picture with nuh layer id equal to layerId before picA in decoding order, and there should not be a non-leading picture with nuh layer id equal to layerId between picA and picB in decoding order.
-If sps_field_seq_flag is equal to 0, the following applies: when nuh_layer_id is equal to a specific value layerId and the current sub-picture with a sub-picture index equal to a specific value subpicIdx is the leading sub-picture associated with the IRAP sub-picture, then the current sub-picture should precede all non-leading sub-pictures associated with the same IRAP sub-picture in decoding order. Otherwise (sps field_seq_flag equal to 1), let subpicA and subpicB be the first leading sub-picture and the last leading sub-picture associated with an IRAP sub-picture, respectively, in decoding order, there should be at most one non-leading sub-picture with nuh layer id equal to layerId and sub-picture index equal to subpicIdx before subpicA in decoding order, and there should not be a non-leading picture with nuh layer id equal to layerId and sub-picture index equal to subpicIdx between picA and picB in decoding order.
It should be noted that in general, an Intra Random Access Point (IRAP) picture is a picture that does not refer to any picture other than itself for prediction during its decoding process. In VVC, IRAP pictures may be Clean Random Access (CRA) pictures or Instantaneous Decoding Refresh (IDR) pictures. In VVC, the first picture in the bitstream, which is arranged in decoding order, must be an IRAP picture or a progressive decoding refresh (GDR) picture. VVC describes the concept of a leading picture, which is a picture preceding an associated IRAP picture in output order. VVC also describes the concept of a trailing picture, which is a non-IRAP picture following an associated IRAP picture in output order. The trailing picture associated with the IRAP picture is also after the IRAP picture in decoding order. For IDR pictures, there are no trailing pictures that need to reference pictures decoded before the IDR picture. VVC specifies that a CRA picture may have a leading picture following the CRA picture in decoding order and includes inter-picture prediction that references a picture decoded prior to the CRA picture. Thus, when CRA pictures are used as random access points, these leading pictures may not be decodable and are identified as Random Access Skip Leading (RASL) pictures. Another type of picture that may follow an IRAP picture in decoding order and precede the IRAP picture in output order is a Random Access Decodable Leading (RADL) picture that may not contain a reference to any picture that precedes the IRAP picture in decoding order. The GDR picture is a picture in which each VCL NAL unit has nal_unit_type equal to gdr_nut. If the current picture is a GDR picture associated with a picture header that sends a signaling syntax element, recovery_poc_cnt, and there is a picture picA in CLVS that follows the current GDR picture in decoding order and has a PicOrderCntVal equal to the PicORderCntVal of the current GDR picture plus the value of recovery_poc_cnt, then the picture picA is referred to as a recovery point picture.
As provided in table 2, the NAL unit may include a Video Parameter Set (VPS) syntax structure.
Table 3 shows the video parameter set syntax structure provided in JVET-T2001.
TABLE 3 Table 3
Regarding table 3, vvc provides the following semantics:
Before being referenced, the VPS RBSP shall be available for the decoding process, be included in at least one AU with a temporalld equal to 0 or be provided by external means.
All VPS NAL units having a specific value of vps_video_parameter_set_id in the CVS should have the same content.
Vps_video_parameter_set_id provides an identifier of the VPS for reference by other syntax elements. The value of vps_video_parameter_set_id should be greater than 0.
Vps_max_layers_minus1 plus 1 specifies the number of layers specified by the VPS, which is the maximum allowed number of layers in each CVS of the reference VPS.
Vps_max_ sublayers _minus1 plus 1 specifies the maximum number of temporal sub-layers that may exist in the layer specified by the VPS. The value of vps_max_ sublayers _minus1 should be in the range of 0 to 6 (inclusive).
Vps_default_ptl_dpb_hrd_max_tid_flag equal to 1 specifies syntax elements vps_ptl_max_tid [ i ], vps_dpb_max_tid [ i ], and vps_hrd_max_tid [ i ] are absent and inferred to be equal to a default value vps_max_ sublayers _minus1.vps_default_ptl_dpb_hrd_max_tid_flag equal to 0 specifies that syntax elements vps_ptl_max_tid [ i ], vps_dpb_max_tid [ i ], and vps_hrd_max_tid [ i ] exist. When not present, the value of vps_default_ptl_dpb_hrd_max_tid_flag is inferred to be equal to 1.
All layers specified by vps_all_independent_layers_flag equal to 1 specifying VPS are independently encoded without using inter-layer prediction, and one or more of the layers specified by vps_all_independent_layers_flag equal to 0 specifying VPS may use inter-layer prediction. When not present, it is inferred that the value of vps_all_independent_layers_flag is equal to 1.
Vps_layer_id [ i ] specifies the nuh_layer_id value of the i-th layer. For any two non-negative integer values of m and n, when m is less than n, the value of vps_layer_id [ m ] should be less than vps_layer_id [ n ].
Vps_independent_layer_flag [ i ] equal to 1 specifies that layers with index i do not use inter-layer prediction. A vps_independent_layer_flag [ i ] equal to 0 specifies that a layer with index i can use inter-layer prediction, and that there is a syntax element vps_direct_ref_layer_flag [ i ] [ j ] in the VPS, where j is in the range of 0 to i-1 (inclusive). When not present, the value of vps_independent_layer_flag [ i ] is inferred to be equal to 1.
Vps max TID REF PRESENT _flag [ i ] equal to 1 the specified syntax element vps_max_tid_il_ref_pics_plus1[ i ] [ j ] may be present. vps_max_tid_ref_present_flag [ i ] is equal to 0 the specified syntax element vps_max_tid_il_ref_pics_plus1[ i ] [ j ] is not present.
Vps_direct_ref_layer_flag [ i ] [ j ] equal to 0 specifies that the layer with index j is not a direct reference layer for the layer with index i. vps_direct_ref_layer_flag [ i ] [ j ] equals 1, designating the layer with index j as a direct reference layer for the layer with index i. When i and j are in the range of 0 to vps_max_layers_minus1 (inclusive), vps_direct_ref_layer_flag [ i ] [ j ] is not present, it is inferred to be equal to 0. When vps_independent_layer_flag [ i ] is equal to 0, at least one value of j should be in the range of 0 to i-1 (inclusive), such that the value of vps_direct_ref_layer_flag [ i ] [ j ] is equal to 1.
Variables NumDirectRefLayers [ i ], directRefLayerIdx [ i ] [ d ], numRefLayers [ i ], referenceLayerIdx [ i ] [ r ], and LayerUsedAsRefLayerFlag [ j ] are derived as follows:
The variable GeneralLayerIdx [ i ] specifying the layer index of the layer whose nuh_layer_id is equal to vps_layer_id [ i ] is derived as follows:
for(i=0;i<=vps_max_layers_minus1;i++)
GeneralLayerIdx[vps_layer_id[i]]=i
For any two different values of i and j, each in the range of 0 to vps_max_layers_minus1 (inclusive), the values of sps_chroma_format_idc and sps_ bitdepth _minus8 applied to the ith layer should be equal to the values of sps_chroma_format_idc and sps_ bitdepth _minus8 applied to the jth layer, respectively, when DEPENDENCYFLAG [ i ] [ j ] equals 1.
Vps_max_tid_il_ref_pics_plus1[ i ] [ j ] equals 0 specifies: a picture of the j-th layer that is neither an IRAP picture nor a GDR picture with ph_recovery_poc_cnt equal to 0 is not used as the ILRP for decoding the picture of the i-th layer. vps_max_tid_il_ref_pics_plus1[ i ] [ j ] is greater than 0 designation: for decoding pictures of the ith layer, pictures from the jth layer having a temporalld greater than vps_max_tid_ref_pics_plus 1[ i ] [ j ] -1 are not used as ILRP, and no APS with nuh_layer_id equal to vps_layer_id [ j ] and temporalld greater than vps_max_tid_il_ref_pics_plus1[ i ] [ j ] -1 are referenced. When not present, the value of vps_max_tid_ref_pics_plus 1[ i ] [ j ] is inferred to be equal to vps_max_ sublayers _minus1+1.
Vps_each_layer_is_an_ols_flag equal to 1 specifies that each OLS specified by the VPS contains only one layer and that each layer specified by the VPS is an OLS having a single including layer as a unique output layer. vps_each_layer_is_an_ols_flag equal to 0 specifies that at least one OLS specified by the VPS contains more than one layer. If vps_max_layers_minus1 is equal to 0, the value of vps_each_layer_is_an_ols_flag is inferred to be equal to 1. Otherwise, when vps_all_independent_layers_flag is equal to 0, the value of vps_each_layer_is_an_ols_flag is inferred to be equal to 0.
Vps_ols_mode_idc equal to 0 specifies: the total number of OLS specified by the VPS is equal to vps_max_layers_minus1+1, the ith OLS includes layers with layer indexes from 0 to i (inclusive), and for each OLS, only the highest layer in the OLS is the output layer.
Vps_ols_mode_idc equal to 1 specifies: the total number of OLS specified by the VPS is equal to vps_max_layers_minus1+1, the ith OLS includes layers with layer indexes from 0 to i (inclusive), and for each OLS, all layers in the OLS are output layers.
Vps_ols_mode_idc equal to 2 specifies: the total number of OLS specified by the VPS is signaled explicitly, and for each OLS, the output layer is signaled explicitly, and the other layers are layers that are direct or indirect reference layers to the output layer of the OLS.
The value of vps_ols_mode_idc should be in the range of 0 to 2 (inclusive). The value 3 of vps_ols_mode_idc is reserved for future use by ITU-t|iso/IEC. Decoders consistent with this version of the present specification should ignore OLS where vps_ols_mode_idc equals 3.
When vps_all_independent_layers_flag is equal to 1 and vps_each_layer_is_an_oles_flag is equal to 0, the value of vps_oles_mode_idc is inferred to be equal to 2.
Vps_num_output_layers_sets_minus2 plus 2 specifies the total number of OLS specified by the VPS when vps_ols_mode_ide equals 2.
Variables olsModeIdc are derived as follows:
The variable TotalNumOlss specifying the total number of OLS specified by the VPS is derived as follows:
the layer whose vps_ols_output_layer_flag [ i ] [ j ] is equal to 1 specifies that when vps_ols_mode_idc is equal to 2, the layer whose nuh_layer_id is equal to vps_layer_id [ j ] is the output layer of the ith OLS. vps_ols_output_layer_flag [ i ] [ j ] is equal to 0, and it is specified that when vps_ols_mode_idc is equal to 2, a layer whose nuh_layer_id is equal to vps_layer_id [ j ] is not an output layer of the ith OLS.
A variable NumOutputLayersInOls [ i ] specifying the number of output layers in the ith OLS, a variable NumSubLayersInLayerInOls [ i ] [ j ] specifying the number of sub-layers in the jth layer in the ith OLS, a variable OutputLayerIdInOls [ i ] [ j ] specifying the nuh_layer_id value of the jth output layer in the ith OLS, and a variable LayerUsedAsOutputLayerFlag [ k ] specifying whether the kth layer is used as an output layer in at least one of the OLS are derived as follows:
For each value of i in the range of 0 to vps_max_layers_minus1 (inclusive), the values LayerUsedAsRefLayerFlag [ i ] and LayerUsedAsOutputLayerFlag [ i ] should not both be equal to 0. In other words, there should be no layer that is not a direct reference layer for at least one of the OLS's output layers nor any other layer.
For each OLS, at least one layer should be the output layer. In other words, for any value of i in the range of 0 to TotalNumOlss-1 (inclusive), the value of NumOutputLayersInOls [ i ] should be greater than or equal to 1.
A variable NumLayersInOls [ i ] specifying the number of layers in the ith OLS, a variable LayerIdInOls [ i ] [ j ] specifying the nuh_layer_id value of the j-th layer in the ith OLS, a variable NumMultiLayerOlss specifying the number of multi-layer OLS (i.e., OLS containing more than one layer), and a variable MultiLayerOlsIdx [ i ] specifying the index to the list of multi-layer OLS of the ith OLS when NumLayersInOls [ i ] is greater than 0 are derived as follows:
Note that-the 0 th OLS contains only the lowest layer (i.e., the layer with nuh_layer_id equal to vps_layer_id [0 ]), and for the 0 th OLS, only the output layer is included.
The lowest layer in each OLS should be an independent layer. In other words, for each i, vps_independent_layer_flag [ GeneralLayerIdx [ LayerIdInOls [ i ] [0] ] ] in the range of 0 to TotalNumOlss-1 (inclusive) the value should be equal to 1.
Each layer should be included in at least one OLS specified by the VPS. In other words, for each layer of one of nuh_layer_ id nuhLayerId whose specific value is equal to vps_layer_id [ k ] (k is in the range of 0 to vps_max_layers_minus1 (inclusive)), there should be at least one pair of values of i and j, where i is in the range of 0 to TotalNumOlss-1 (inclusive) and j is in the range of NumLayersInOls [ i ] -1 (inclusive), such that the value of LayerIdInOls [ i ] [ j ] is equal to nuhLayerId.
Vps_num_ ptls _minus1 plus 1 specifies the number of profile_tier_level () syntax structures in the VPS. The value of vps_num_ ptls _minus1 should be less than TotalNumOlss. When not present, the value of vps_num_ ptls _minus1 is inferred to be equal to 0.
Vps_pt_present_flag [ i ] equals 1 the specified configuration file, layer and general constraint information are present in the i-th profile_tier_level () syntax structure in the VPS. vps_pt_present_flag [ i ] equals 0 specifying that the configuration file, layer, and general constraint information are not present in the i-th profile_tier_level () syntax structure in the VPS. The value of vps_pt_present_flag [0] is inferred to be equal to 1. When vps_pt_present_flag [ i ] is equal to 0, the configuration file, layer and general constraint information of the i-th profile_tier_level () syntax structure in the VPS are inferred to be the same as those of the (i-1) -th profile_tier_level () syntax structure in the VPS.
Vps_ptl_max_tid [ i ] specifies the temporalld of the highest sub-layer representation of the level information present in the ith profile_tier_level () syntax structure in the VPS and the temporalld of the highest sub-layer representation present in the OLS with the OLS index olsIdx such that vps_ols_ptl_idx [ olsIdx ] is equal to i. The value of vps_ptl_max_tid [ i ] should be in the range of 0 to vps_max_ sublayers _minus1 (inclusive). When vps_default_ptl_dpb_hrd_max_tid_flag is equal to 1, the value of vps_ptl_max_tid [ i ] is inferred to be equal to vps_max_ sublayers _minus1.
Vps_ptl_alignment_zero_bit should be equal to 0.
Vps_ols_ptl_idx [ i ] specifies the index to the list of profile_tier_level () syntax structures in the VPS of the profile_tier_level () syntax structure applied to the ith OLS. When present, the value of vps_ols_ptl_idx [ i ] should be in the range of 0 to vps_num_ ptls _minus1 (inclusive).
When not present, the value of vps_ols_ptl_idx [ i ] is inferred as follows:
-if vps_num_ ptls _minus1 is equal to 0, the value of vps_ols_ptl_idx [ i ] is inferred to be equal to 0.
Otherwise (vps_num_ ptls _minus1 is greater than 0 and vps_num_ ptls _minus1+1 is equal to TotalNumOlss), the value of vps_ols_ptl_idx [ i ] is inferred to be equal to i.
When NumLayersInOls [ i ] is equal to 1, the profile_tier_level () syntax structure applied to the i-th OLS also exists in the SPS referenced by the layers in the i-th OLS. The bitstream conformance requirement, when NumLayersInOls [ i ] is equal to 1, the profile_tier_level () syntax structure signaled in the VPS and in the SPS of the ith OLS should be the same.
Each profile_tier_level () syntax structure in the VPS should be referenced by at least one value of vps_ols_ptl_idx [ i ], where i is in the range of 0 to TotalNumOlss-1 (inclusive).
Vps_num_dpb_parameters_minus1 plus 1, when present, specify the number of dpb_parameters () syntax structures in the VPS. The value of vps_num_dpb_params_minus1 should be in the range of 0 to NumMultiLayerOlss-1 (inclusive).
A variable VpsNumDpbParams specifying the number of dpb_parameters () syntax structures in VPS is derived as follows:
When vps_dpb_max_tid [ i ] is greater than 0, vps_dpb_dpb_parameters_present_flag is used to control the presence of syntax elements dpb_max_dec_pic_buffering_minus1[ j ], dpb_max_num_reorder_pics [ j ] and dpb_max_latency_ increase _plus1[ j ] in the dpb_parameters () syntax structure in the VPS, where j is in the range of 0 to vps_dpb_max_tid [ i ] -1 (inclusive). When not present, the value of vps_sub_dpb_params_info_present_flag is inferred to be equal to 0.
Vps_dpb_max_tid [ i ] specifies the temporalld of the highest sub-layer representation in the idpb _parameters () syntax structure that the DPB parameter can exist in the VPS. The value of vps_dpb_max_tid [ i ] should be in the range of 0 to vps_max_ sublayers _minus1 (inclusive). When not present, the value of vps_dpb_max_tid [ i ] is inferred to be equal to vps_max_ sublayers _minus1.
For each mth multi-layer OLS, the value of vps_dpb_max_tid [ vps_ols_dpb_params_idx [ m ] ] should be greater than or equal to vps_ptl_max_tid [ vps_ols_ptl_idx [ n ] ], where m is 0 to NumMultiLayerOlss-1 (including all endpoints), and n is the OLS index of the mth multi-layer OLS among all OLSs.
Vps_ols_dpb_pic_width [ i ] specifies the width of each picture store buffer for the ith multi-layer OLS, which is in units of luma samples.
Vps_ols_dpb_pic_height [ i ] specifies the height of each picture store buffer for the ith multi-layer OLS, in units of luma samples.
Vps_ols_dpb_chroma_format [ i ] specifies the maximum allowed value of sps_chroma_format_idc for all SPS referenced by CLVS in the CVS for the ith multi-layer OLS.
Vps_ols_dpb_ bitdepth _minus8[ i ] specifies the maximum allowed value of sps_ bitdepth _minus8 for all SPS referenced by CLVS in the CVS for the ith multi-layer OLS. The value of vps_ols_dpb_ bitdepth _minus8[ i ] should be in the range of 0 to 2 (inclusive).
Note that-to decode the i-th multi-layer OLS, the decoder can safely allocate memory for the DPB according to the values of syntax elements vps_ols_dpb_pic_width [ i ], vps_ols_dpb_pic_height [ i ], vps_ols_dpb_chroma_format [ i ], and vps_ols_dpb_ bitdepth _minus8[ i ].
Vps_oles_dpb_parameters_idx [ i ] specifies the index to the dpb_parameters () syntax structure list in the VPS of the dpb_parameters () syntax structure applied to the i-th multi-layer OLS. When present, the value of vps_ols_dpb_params_idx [ i ] should be in the range of 0 to VpsNumDpbParams-1 (inclusive).
When vps_ols_dpb_params_idx [ i ] is absent, it is inferred as follows:
if VpsNumDpbParams is equal to 1, the value of vps_ols_dpb_params_idx [ i ] will be equal to 0.
-Otherwise (VpsNumDpbParams is greater than 1 and equal to NumMultiLayerOlss), the value of vps_ols_dpb_params_idx [ i ] is inferred to be equal to i.
For single layer OLS, the applicable dpb_parameters () syntax structure exists in the SPS referenced by the layers in the OLS.
Each dpb_parameters () syntax structure in VPS should be referenced by at least one value of vps_ols_dpb_parameters_idx [ i ], where i is in the range of 0 to NumMultiLayerOlss-1 (inclusive).
Vps_time_hrd_parameters_present_flag equal to 1 specifies that the VPS contains a general_time_hrd_parameters () syntax structure and other HRD parameters, and vps_time_hrd_parameters_present_flag equal to 0 specifies that the VPS does not contain a general_time_hrd_parameters () syntax structure or other HRD parameters.
When NumLayersInOls [ i ] is equal to 1, the general_time_hrd_parameters () syntax structure and the ols_time_hrd_parameters () syntax structure applied to the ith OLS exist in the SPS referenced by the layer in the ith OLS.
The iols _time_hrd_parameters () syntax structure in VPS with vps_pivotal_cpb_parameters_present_flag equal to 1 specifies VPS contains HRD parameters for sub-layer representation of the TemporalId in the range of 0 to vps_hrd_max_tid [ i ], and the vps_pivotal_cpb_parameters_present_flag equal to iols _time_hrd_parameters () syntax structure in VPS with TemporalId equal to vps_hrd_max_tid [ i ] only. When vps_max_ sublayers _minus1 is equal to 0, the value of vps_sublayer_cpb_parameters_present_flag is inferred to be equal to 0.
When vps_sublayer_cpb_parameters_present_flag is equal to 0, the HRD parameters for sub-layer representations of the TemporalId in the range of 0 to vps_hrd_max_tid [ i ] -1 (inclusive) are inferred to be the same as the HRD parameters for sub-layer representations of the TemporalId equal to vps_hrd_max_tid [ i ]. These parameters include HRD parameters starting from the fixed_pic_rate_general_flag [ i ] syntax element up to the syntax structure of the bushing_hrd_ PARAMETERS (I) immediately under the condition "if (general_ vcl _hrd_parameters_present_flag)" in the syntax structure of the ols_time_hrd_parameters.
Vps_num_ols_time_hrd_parameters_minus1 plus 1 specifies the number of ols_time_hrd_parameters () syntax structures present in the VPS when vps_time_hrd_parameters_present_flag equals 1. The value of vps_num_ols_time_hrd_params_minus1 should be in the range of 0 to NumMultiLayerOlss-1 (inclusive).
Vps_hrd_max_tid [ i ] specifies that the HRD parameter contains the temporalld represented by the highest sub-layer in the ith ols_time_hrd_parameters () syntax structure. The value of vps_hrd_max_tid [ i ] should be in the range of 0 to vps_max_ sublayers _minus1 (inclusive). When not present, the value of vps_hrd_max_tid [ i ] is inferred to be equal to vps_max_ sublayers _minus1.
For each mth multi-layer OLS, the value of vps_hrd_max_tid [ vps_ols_timing_hrd_idx [ m ] ] should be greater than or equal to vps_ptl_max_tid [ vps_ols_ptl_idx [ n ] ], where m is 0 to NumMultiLayerOlss-1 (including all endpoints), and n is the OLS index of the mth multi-layer OLS among all OLSs.
Vps_ols_time_hrd_idx [ i ] specifies the index to the list of ols_time_hrd_parameters () syntax structures in the VPS of the ols_time_hrd_parameters () syntax structures applied to the i-th multi-layer OLS. The value of vps_ols_time_hrd_idx [ i ] should be in the range of 0 to vps_num_ols_time_hrd_params_minus1 (inclusive).
When vps_ols_timing_hrd_idx [ i ] is not present, it is inferred as follows:
-if vps_num_ols_time_hrd_params_minus1 is equal to 0, then the value of vps_ols_time_hrd_idx [ i ] is inferred to be equal to 0.
Otherwise (vps_num_ols_time_hrd_params_minus1+1 is greater than 1 and equal to NumMultiLayerOlss), the value of vps_ols_time_hrd_idx [ i ] is inferred to be equal to i.
For single layer OLS, the applicable ols_time_hrd_parameters () syntax structure is present in the SPS referenced by the layers in the OLS.
Each ols_time_hrd_parameters () syntax structure in the VPS should be referenced by at least one value of vps_ols_time_hrd_idx [ i ], where i is in the range of 1 to NumMultiLayerOlss-1 (inclusive).
The vps_extension_flag equal to 0 specifies that the vps_extension_data_flag syntax element does not exist in the VPS RBSP syntax structure. A vps_extension_flag equal to 1 specifies that a vps_extension_data_flag syntax element may exist in the VPS RBSP syntax structure. In a bitstream conforming to this version of the specification, vps_extension_flag should be equal to 0. However, some uses where vps_extension_flag is equal to 1 may be specified in some future versions of the present specification, and a decoder conforming to this version of the present specification should allow a value of vps_extension_flag equal to 1 to appear in the syntax.
Vps_extension_data_flag may have any value. Its presence and value does not affect the decoding process specified in this version of the specification. A decoder conforming to this version of the present specification should ignore all vps extension data flag syntax elements.
As described above, output feature maps including those corresponding to temporally downsampled video may be predictively encoded in a similar manner to video data (i.e., using typical video encoding techniques). Thus, in one example, an output feature map corresponding to a temporally downsampled video may be encoded as a video layer, i.e., a feature layer, in accordance with the techniques herein. For example, the feature layer may be encapsulated as an ITU-T h.265 video layer or a VVC video layer, depending on the structure and syntax provided above. In one example, the machine task related syntax/structure may be encapsulated in NAL units of different types. In one example, when the syntax/structure used to signal the feature layer is the same as, for example, the ITU-t h.265 or VVC syntax, the packing/unpacking process may be defined to organize feature tensors into pictures in a corresponding chroma format. In one example, the syntax/structure may include additional syntax/structures of ITU-T h.265 or VVC syntax, e.g., as an extension of VVC.
In some cases, it may be useful to include a feature layer in the bitstream in cases where the layer corresponds to an encoded version of the video that has been temporally downsampled (i.e., encoded input video). For example, after the encoded input video has been reconstructed, the feature data may be used for one or more enhancements. For example, the bounding box may be overlaid on the video during presentation and/or objects in the image may be enhanced. In one example, the same bit stream may be used for both machine task execution and human consumption. In such an organization, layers for target consumption can be extracted more easily.
In one example, in accordance with the techniques herein, one or more of the syntax elements provided above for the VVC NAL unit header may have a value indicating that the layer includes an output feature map corresponding to the temporally downsampled video. For example, a nuh_layer_id value equal to a value greater than 55 and/or a nal_unit_type value equal to a current reserved value may be used to indicate that a layer includes a feature layer. In one example, a marker may be included in the VPS, indicating that the layer includes a feature layer, in accordance with the techniques herein. For example, the VPS extension data may include an indication that the layer includes a feature layer. In one example, it may be inferred that inter-layer prediction is disabled between the video layer and the feature layer, which may result in the corresponding (i.e., inter-layer prediction) syntax element not being included in the bitstream.
In one example, a maximum number of inferred predictions (e.g., object detection) that can be generated at a decoder may be specified in accordance with the techniques herein. In one example, the maximum number of inferred predictions may be specified from a profile corresponding to the feature layer. It should be noted that the maximum number of inference predictions helps to define the computational complexity of the inference engine. In one example, in accordance with the techniques herein, the number of inferred predictions (e.g., object detection) to be generated at a decoder may be specified. In one example, the number of inferred predictions may be signaled in a parameter set within the bitstream, e.g., as VPS extension data. In other examples, the number of inferred predictions may be signaled in the SPS and/or PPS. Alternatively, the number of inferred predictions may be signaled in a different parameter set, with lower parameters overriding higher parameter set values.
In one example, the layer corresponding to the classification (e.g., the linear layer used in the classification score generator in the ROI header of the faster RCNN network) may be retrained. Each retrain will produce a different set of parameter values for that layer. In one example, each set of parameter values for the layer may correspond to an operating point of a feature compression engine. In one example, the operating point may be a rate constraint used during training. The parameter value set for that layer may be signaled along with the compression characteristics. The transmit signaling may include a transmission index. That is, signaling the entire set of parameter values for that layer may be too expensive (in terms of bit cost). Thus, all (or a subset) of the parameter value sets for that layer for different typical operating points may be predetermined and associated with an index. In the case of a subset of the predetermined parameter set, a notification of the non-predetermined parameter may be signaled.
In one example, in accordance with the techniques herein, a size to be used by an inferred decoder (e.g., a size used in spatial scaling prediction) may be signaled. That is, the neural network may produce predictions at a resolution different from the resolution of the picture used as input. For example, the predicted bounding box (x Left side _ Top part ,y Left side _ Top part ,x Right side _ Bottom part ,y Right side _ Bottom part ) may be output in a coordinate space, whereAnd 0 represents the corresponding left or top edge of the picture; and width pred and height pred represent the corresponding right and bottom edges of the picture, respectively; the width and height values, negative or greater than width pred and height pred, respectively, are located outside the picture. In this case, the scaled prediction must be scaled back to the original picture resolution to identify the corresponding spatial location (e.g., the spatial location of the bounding box or segmentation mask). To facilitate scaling, the original resolution of the input picture may be included in the bitstream. For example, in one example, the scaling prediction may be specified as follows:
where i represents the vertex coordinates of the bounding box/segmentation mask Is a reference to (a). Here, the predicted coordinates are absolute positions. Generally,Is the absolute position.
In another example, the scaling prediction may be specified as follows:
where i represents the vertex coordinates of the bounding box/segmentation mask Is a reference to (a). Here, the predicted coordinates are relative positions between 0 and 1 (inclusive). Generally,Is the absolute position.
For each of these cases, it may be necessary to signal the linear scaling to be used. One way to achieve this is to signal the original resolution, as the decoder is usually aware of the predicted resolution. Using the original resolution and the predicted resolution, the decoder may derive corresponding linear scaling values. Furthermore, if the transform to be performed at the decoder is a transform other than linear scaling (e.g., an affine transform), then parameters that signal the transform may be required (and may be) sent.
In this way, the encoding system described herein represents an example of a device configured to: receiving reconstructed feature data, wherein the reconstructed feature data corresponds to video data that has been time downsampled; generating a bounding box for the reconstructed feature data; and interpolates the bounding box for the temporally downsampled portion of the video.
In one or more examples, the functions described may be implemented by hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium, and executed by a hardware-based processing unit. The computer-readable medium may comprise a computer-readable storage medium corresponding to a tangible medium, such as a data storage medium, or a propagation medium comprising any medium that facilitates the transfer of a computer program from one place to another, for example, according to a communication protocol. As such, the computer-readable medium may generally correspond to: (1) A non-transitory tangible computer readable storage medium, or (2) a communication medium such as a signal or carrier wave. Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementing the techniques described in this disclosure. The computer program product may include a computer-readable medium.
By way of example, and not limitation, such computer-readable storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, if the instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. However, it should be understood that computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transitory media, but are instead directed to non-transitory tangible storage media. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk and Blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
The instructions may be executed by one or more processors, such as one or more Digital Signal Processors (DSPs), general purpose microprocessors, application Specific Integrated Circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Thus, the term "processor" as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein. Furthermore, in some aspects, the functionality described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or incorporated into a combined codec. Moreover, these techniques may be implemented entirely in one or more circuits or logic elements.
The techniques of this disclosure may be implemented in various devices or apparatuses, including a wireless handset, an Integrated Circuit (IC), or a set of ICs (e.g., a chipset). Various components, modules, or units are described in this disclosure to emphasize functional aspects of devices configured to perform the disclosed techniques but do not necessarily require realization by different hardware units. Rather, as described above, the various units may be combined in a codec hardware unit or provided by an interoperating hardware unit comprising a set of one or more processors as described above, in combination with suitable software and/or firmware.
Further, each functional block or various features of the base station apparatus and the terminal apparatus used in each of the above-described embodiments may be realized or executed by a circuit (typically, an integrated circuit or a plurality of integrated circuits). Circuits designed to perform the functions described in this specification may include general purpose processors, digital Signal Processors (DSPs), application specific or general purpose integrated circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic, or discrete hardware components, or combinations thereof. A general purpose processor may be a microprocessor, or in the alternative, the processor may be a conventional processor, controller, microcontroller, or state machine. The general purpose processor or each of the above circuits may be configured by digital circuitry or may be configured by analog circuitry. In addition, when a technology of manufacturing an integrated circuit that replaces the current integrated circuit occurs due to progress in semiconductor technology, the integrated circuit produced by the technology can also be used.
Various examples have been described. These and other examples are within the scope of the following claims.
< Cross-reference >
This non-provisional application claims priority from provisional application number 63/243,065 filed on day 10 of 9 and 2021 and provisional application number 63/243,554 filed on day 13 of 9 in the united states code, volume 35, 119, the entire contents of which are hereby incorporated by reference.

Claims (12)

1. A method of interpolating inferred data corresponding to reconstructed feature data, the method comprising:
receiving reconstructed feature data, wherein the reconstructed feature data corresponds to video data that has been temporally downsampled;
Generating a bounding box for the reconstructed feature data; and
A bounding box is interpolated for a temporally downsampled portion of the video.
2. The method of claim 1, wherein the temporally downsampled video corresponds to video downsampled by sampling pictures from a source video at fixed intervals.
3. The method of claim 1, wherein generating a bounding box for the reconstructed feature data comprises generating a bounding box from a defined region proposal network.
4. A method according to claim 3, wherein the defined area proposal network is defined according to a Detectron-based object detection system.
5. The method of claim 1, wherein the temporally downsampled partial interpolation bounding box for the video comprises: the spatial coordinates of the bounding box of the intermediate picture are calculated based on the spatial coordinates of the generated bounding box corresponding to the picture having a temporal relationship with the intermediate picture.
6. The method of claim 5, further performing motion prediction using the interpolated bounding box.
7. An apparatus comprising one or more processors configured to:
receiving reconstructed feature data, wherein the reconstructed feature data corresponds to video data that has been temporally downsampled;
Generating a bounding box for the reconstructed feature data; and
A bounding box is interpolated for a temporally downsampled portion of the video.
8. The apparatus of claim 7, wherein the temporally downsampled video corresponds to video downsampled by sampling pictures from a source video at fixed intervals.
9. The apparatus of claim 7, wherein generating a bounding box for the reconstructed feature data comprises generating a bounding box according to a defined region proposal network.
10. The apparatus of claim 9, wherein the defined area proposal network is defined in accordance with Detectron-based object detection system.
11. The apparatus of claim 9, wherein a partially interpolated bounding box for temporal downsampling of the video comprises: the spatial coordinates of the bounding box of the intermediate picture are calculated based on the spatial coordinates of the generated bounding box corresponding to the picture having a temporal relationship with the intermediate picture.
12. The apparatus of claim 11, wherein the one or more processors are further configured to perform motion prediction using the interpolated bounding box.
CN202280060456.XA 2021-09-10 2022-09-07 System and method for interpolating reconstructed feature data in encoding of multi-dimensional data Pending CN117917080A (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US63/243,065 2021-09-10
US202163243554P 2021-09-13 2021-09-13
US63/243,554 2021-09-13
PCT/JP2022/033489 WO2023038038A1 (en) 2021-09-10 2022-09-07 Systems and methods for interpolation of reconstructed feature data in coding of multi-dimensional data

Publications (1)

Publication Number Publication Date
CN117917080A true CN117917080A (en) 2024-04-19

Family

ID=90691110

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202280060456.XA Pending CN117917080A (en) 2021-09-10 2022-09-07 System and method for interpolating reconstructed feature data in encoding of multi-dimensional data

Country Status (1)

Country Link
CN (1) CN117917080A (en)

Similar Documents

Publication Publication Date Title
CN107690808B (en) Derivation of gamut scalability parameters and tables in scalable video coding
TW201817236A (en) Linear model chroma intra prediction for video coding
US20240297984A1 (en) Method and apparatus for coding transform coefficient in video/image coding system
US11223832B2 (en) Methods and apparatus for encoding video data using block palettes and sub-block and pixel scanning orders
CN114009049A (en) Context modeling for low frequency non-separable transform signaling for video coding
US20240333944A1 (en) Method and apparatus for deriving rice parameter in video/image coding system
CN116508321A (en) Joint component neural network-based filtering during video coding
US20240267557A1 (en) Systems and methods for performing padding in coding of a multi-dimensional data set
CN114830669A (en) Method and apparatus for partitioning blocks at image boundaries
CN114830667A (en) Reference picture scaling for reference picture resampling in video coding and decoding
CN114902670A (en) Method and apparatus for signaling sub-picture division information
CN113906756A (en) Spatial scalability support in video encoding and decoding
WO2023048070A1 (en) Systems and methods for compression of feature data using joint coding in coding of multi-dimensional data
US12015779B2 (en) Method and apparatus for deriving rice parameter in video/image coding system
CN114762339B (en) Image or video coding based on transform skip and palette coding related high level syntax elements
CN117917080A (en) System and method for interpolating reconstructed feature data in encoding of multi-dimensional data
WO2023038038A1 (en) Systems and methods for interpolation of reconstructed feature data in coding of multi-dimensional data
CN114846789A (en) Decoder for indicating image segmentation information of a slice and corresponding method
WO2023037977A1 (en) Systems and methods for reducing noise in reconstructed feature data in coding of multi-dimensional data
WO2023276809A1 (en) Systems and methods for compressing feature data in coding of multi-dimensional data
WO2022209828A1 (en) Systems and methods for autoencoding residual data in coding of a multi-dimensional data
CN117981317A (en) System and method for compressing feature data using joint coding in the coding of multi-dimensional data
WO2023149367A1 (en) Systems and methods for improving object detection in compressed feature data in coding of multi-dimensional data
WO2023032879A1 (en) Systems and methods for entropy coding a multi-dimensional data set
US20240357116A1 (en) Systems and methods for entropy coding a multi-dimensional data set

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination