WO2023151903A1 - A method, an apparatus and a computer program product for video coding - Google Patents

A method, an apparatus and a computer program product for video coding Download PDF

Info

Publication number
WO2023151903A1
WO2023151903A1 PCT/EP2023/050956 EP2023050956W WO2023151903A1 WO 2023151903 A1 WO2023151903 A1 WO 2023151903A1 EP 2023050956 W EP2023050956 W EP 2023050956W WO 2023151903 A1 WO2023151903 A1 WO 2023151903A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
probability
finetuning
elements
representation
Prior art date
Application number
PCT/EP2023/050956
Other languages
French (fr)
Inventor
Honglei Zhang
Francesco Cricrì
Nannan ZOU
Hamed REZAZADEGAN TAVAKOLI
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Publication of WO2023151903A1 publication Critical patent/WO2023151903A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/13Adaptive entropy coding, e.g. adaptive variable length coding [AVLC] or context adaptive binary arithmetic coding [CABAC]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/146Data rate or code amount at the encoder output
    • H04N19/147Data rate or code amount at the encoder output according to rate distortion criteria
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/90Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
    • H04N19/91Entropy coding, e.g. variable length coding [VLC] or arithmetic coding

Definitions

  • the present solution generally relates to video encoding and coding.
  • Video Coding for Machines VCM
  • an apparatus for encoding comprising means for receiving a representation of input media; means for performing a compression of the representation by means of at least a probability model and arithmetic codec to generate bitstreams to be delivered to a decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation to be encoded, and wherein the arithmetic encoder is configured to encode elements to bitstream using at least the estimated probability distributions; means to adapt the prediction model according to a set of finetuning parameters by using elements that have already been processed as true values, wherein finetuning comprises minimizing a rate loss calculated by the probability distribution estimation for a resolution level; wherein the apparatus further comprises means for selecting a prediction model with optimal rate loss as a best model; and means for using the best model as the probability model.
  • an apparatus for decoding comprising means for receiving an encoded bitstream; means for obtaining finetuning parameters; means for decompressing the bitstream to generate a representation of an output media by means of a probability model and arithmetic decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation, and wherein the arithmetic decoder is configured to decode elements from the bitstream using at least the estimated probability distributions; means for adapting the prediction model according to at least a set of finetuning parameters by using elements that have already been processed as true values, wherein finetuning comprises minimizing a rate loss calculated by the probability distribution estimation for a resolution level; wherein the apparatus further comprises means for selecting a prediction model as a best model according to the set of finetuning parameters; and means for using the best model as the probability model.
  • a method for encoding comprising: receiving a representation of input media; performing a compression of the representation by means of at least a probability model and arithmetic encoder to generate bitstreams to be delivered to a decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation to be encoded, and wherein the arithmetic encoder is configured to encode elements to bitstream using at least the estimated probability distributions; adapting the prediction model according to a set of finetuning parameters by using elements that have already been processed as true values, wherein finetuning comprises minimizing a rate loss calculated by the probability distribution estimation for a resolution level; wherein the method further comprises selecting a prediction model with optimal rate loss as a best model and using the best model as the probability model.
  • a method for decoding comprising: receiving an encoded bitstream; obtaining finetuning parameters; decompressing the bitstream to generate a representation of an output media by means of a probability model and arithmetic decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation, and wherein the arithmetic decoder is configured to decode elements from the bitstream using at least the estimated probability distributions; adapting the prediction model according to at least a set of finetuning parameters by using elements that have already been processed as true values, wherein finetuning comprises minimizing a rate loss calculated by the probability distribution estimation for a resolution level; wherein the method further comprises selecting a prediction model as a best model according to the set of finetuning parameters; using the best model as the probability model.
  • an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: receive a representation of input media; perform a compression of the representation by means of at least a probability model and arithmetic encoder to generate bitstreams to be delivered to a decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation to be encoded, and wherein the arithmetic encoder is configured to encode elements to bitstream using at least the estimated probability distributions; adapt the prediction model according to a set of finetuning parameters by using elements that have already been processed as true values, wherein finetuning comprises minimizing a rate loss calculated by the probability distribution estimation for a resolution level; and select a prediction model with optimal rate loss as a best model and use the best model as the probability model.
  • an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: receive an encoded bitstream; obtain finetuning parameters from; decompress the bitstream to generate a representation of an output media by means of a probability model and arithmetic decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation, and wherein the arithmetic decoder is configured to decode elements from the bitstream using the estimated probability distributions; adapt the prediction model according to a set of finetuning parameters by using elements that have already been processed as true values, wherein finetuning comprises minimizing rate loss calculated by the probability distribution estimation for a resolution level; wherein the apparatus further comprises means for selecting a prediction model as a best model according to the set of finetuning parameters ; and use the best model as the probability model.
  • a seventh aspect there is provided computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to: receive a representation of input media; perform a compression of the representation by means of at least a probability model and arithmetic encoder to generate bitstreams to be delivered to a decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation to be encoded, and wherein the arithmetic encoder is configured to encode elements to bitstream using at least the estimated probability distributions; adapt the prediction model according to a set of finetuning parameters by using elements that have already been processed as true values, wherein finetuning comprises minimizing a rate loss calculated by the probability distribution estimation for a resolution level; and select a prediction model with optimal rate loss as a best model and use the best model as the probability model.
  • computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to: receive an encoded bitstream; obtain finetuning parameters from; decompress the bitstream to generate a representation of an output media by means of a probability model and arithmetic decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation, and wherein the arithmetic decoder is configured to decode elements from the bitstream using at least the estimated probability distributions; adapt the prediction model according to the set of finetuning parameters by using elements that have already been processed as true values, wherein finetuning comprises minimizing a rate loss calculated by the probability distribution estimation for a resolution level; wherein the apparatus further comprises means for selecting a prediction model as a best model according to the set of finetuning parameters ; and use the best model as the probability model.
  • the probability model is configured to partition the elements into a plurality of groups, and estimate the probability distribution of elements in said groups and adapt the prediction model by using groups that have already been processed as true values.
  • the partition is based on one of the following: spatial dimensions, temporal dimensions, channel dimensions.
  • the representation is a latent representation of an image, or a frame of a video sequence, or a block of an image, or a slice of an image, where the latent representation may be obtained by one or more processing steps such as by applying a neural network on the input media; or the representation is an unprocessed or substantially unprocessed version of the input media or an uncompressed image or video.
  • parameters used in finetuning are delivered to a decoder.
  • the parameters further comprise a number of iterations needed for the finetuning.
  • the apparatus for encoding further comprises means for determining a representative data from the input dataset; and means for adapting the probability model to the representative data; and means for using the adapted probability model for other data in the input dataset.
  • the apparatus for encoding further comprises means for clustering the input data into a plurality of clusters; and means for adapting the probability model to the centroid data of each cluster; and means for assigning an identifier (ID) to the adapted probability model for a centroid data; and means for using the adapted probability model for the centroid data in a cluster to other data in the same cluster; and means for delivering the ID of the adapted probability model to the decoder.
  • ID identifier
  • the computer program product is embodied on a non-transitory computer readable medium.
  • Fig. 1 shows an example of a codec with neural network (NN) components
  • Fig. 2 shows another example of a video coding system with neural network components
  • Fig. 3 shows an example of a neural auto-encoder architecture
  • Fig. 4 shows an example of a neural network-based end-to-end learned video coding system
  • Fig. 5 shows an example of a video coding for machines
  • Fig. 6 shows an example of a pipeline for end-to-end learned system
  • Fig. 7 shows an example of training an end-to-end learned system
  • Fig. 8 shows an example of an end-to-end learned image or video codec
  • Fig. 9 shows an example of a multi-scale probability model
  • Fig. 10 shows an example of prediction model
  • Fig. 11 is a flowchart illustrating a method for encoding according to an embodiment
  • Fig. 12 is a flowchart illustrating a method for decoding according to an embodiment.
  • Fig. 13 shows an apparatus according to an embodiment. Description of Example Embodiments
  • the present embodiments relate to adaptive probability model for video coding.
  • a neural network is a computation graph consisting of several layers of computation. Each layer consists of one or more units, where each unit performs an elementary computation. A unit is connected to one or more other units, and the connection may have associated with a weight. The weight may be used for scaling the signal passing through the associated connection. Weights are learnable parameters, i.e., values which can be learned from training data. There may be other learnable parameters, such as those of batch-normalization layers.
  • Feedforward neural networks are such that there is no feedback loop: each layer takes input from one or more of the layers before and provides its output as the input for one or more of the subsequent layers. Also, units inside a certain layer take input from units in one or more of preceding layers and provide output to one or more of following layers.
  • Initial layers extract semantically low-level features such as edges and textures in images, and intermediate and final layers extract more high-level features.
  • semantically low-level features such as edges and textures in images
  • intermediate and final layers extract more high-level features.
  • After the feature extraction layers there may be one or more layers performing a certain task, such as classification, semantic segmentation, object detection, denoising, style transfer, super-resolution, etc.
  • a certain task such as classification, semantic segmentation, object detection, denoising, style transfer, super-resolution, etc.
  • recurrent neural nets there is a feedback loop, so that the network becomes stateful, i.e., it is able to memorize information or a state.
  • Neural networks are being utilized in an ever-increasing number of applications for many different types of devices, such as mobile phones. Examples include image and video analysis and processing, social media data analysis, device usage data analysis, etc.
  • neural networks are able to learn properties from input data, either in supervised way or in unsupervised way. Such learning is a result of a training algorithm, or of a meta-level neural network providing the training signal.
  • the training algorithm consists of changing some properties of the neural network so that its output is as close as possible to a desired output.
  • the output of the neural network can be used to derive a class or category index which indicates the class or category that the object in the input image belongs to.
  • Training usually happens by minimizing or decreasing the output’s error, also referred to as the loss. Examples of losses are mean squared error, cross-entropy, etc.
  • training is an iterative process, where at each iteration the algorithm modifies the weights of the neural net to make a gradual improvement of the network’s output, i.e., to gradually decrease the loss.
  • model and “neural network” are used interchangeably, and also the weights of neural networks are sometimes referred to as learnable parameters or simply as parameters.
  • Training a neural network is an optimization process.
  • the goal of the optimization or training process is to make the model learn the properties of the data distribution from a limited training dataset.
  • the goal is to learn to use a limited training dataset in order to learn to generalize to previously unseen data, i.e., data which was not used for training the model. This is usually referred to as generalization.
  • data may be split into at least two sets, the training set and the validation set.
  • the training set is used for training the network, i.e., to modify its learnable parameters in order to minimize the loss.
  • the validation set is used for checking the performance of the network on data, which was not used to minimize the loss, as an indication of the final performance of the model.
  • the errors on the training set and on the validation set are monitored during the training process to understand the following things:
  • the training set error should decrease, otherwise the model is in the regime of underfitting.
  • the validation set error needs to decrease and to be not too much higher than the training set error. If the training set error is low, but the validation set error is much higher than the training set error, or it does not decrease, or it even increases, the model is in the regime of overfitting. This means that the model has just memorized the training set’s properties and performs well only on that set but performs poorly on a set not used for tuning its parameters. Lately, neural networks have been used for compressing and de -compressing data such as images, i.e., in an image codec.
  • the most widely used architecture for realizing one component of an image codec is the auto-encoder, which is a neural network consisting of two parts: a neural encoder and a neural decoder.
  • the neural encoder takes as input an image and produces a code which requires less bits than the input image. This code may be obtained by applying a binarization or quantization process to the output of the encoder.
  • the neural decoder takes in this code and reconstructs the image which was input to the neural encoder.
  • Such neural encoder and neural decoder may be trained to minimize a combination of bitrate and distortion, where the distortion may be based on one or more of the following metrics: Mean Squared Error (MSE), Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), or similar.
  • MSE Mean Squared Error
  • PSNR Peak Signal-to-Noise Ratio
  • SSIM Structural Similarity Index Measure
  • Video codec comprises an encoder that transforms the input video into a compressed representation suited for storage/transmission and a decoder that can decompress the compressed video representation back into a viewable form.
  • An encoder may discard some information in the original video sequence in order to represent the video in a more compact form (that is, at lower bitrate).
  • the H.264/AVC standard was developed by the Joint Video Team (JVT) of the Video Coding Experts Group (VCEG) of the Telecommunications Standardization Sector of International Telecommunication Union (ITU-T) and the Moving Picture Experts Group (MPEG) of International Organisation for Standardization (ISO) / International Electrotechnical Commission (IEC).
  • JVT Joint Video Team
  • VCEG Video Coding Experts Group
  • MPEG Moving Picture Experts Group
  • ISO International Organization for Standardization
  • IEC International Electrotechnical Commission
  • the H.264/AVC standard is published by both parent standardization organizations, and it is referred to as ITU-T Recommendation H.264 and ISO/IEC International Standard 14496-10, also known as MPEG-4 Part 10 Advanced Video Coding (AVC).
  • Extensions of the H.264/AVC include Scalable Video Coding (SVC) and Multiview Video Coding (MVC).
  • H.265/HEVC a.k.a. HEVC High Efficiency Video Coding
  • JCT-VC Joint Collaborative Team - Video Coding
  • the standard was published by both parent standardization organizations, and it is referred to as ITU-T Recommendation H.265 and ISO/IEC International Standard 23008-2, also known as MPEG-H Part 2 High Efficiency Video Coding (HEVC).
  • HEVC MPEG-H Part 2 High Efficiency Video Coding
  • H.266 a.k.a. VVC Versatile Video Coding
  • ISO/IEC 23090-3 ISO/IEC 23090-3
  • a specification of the AV 1 bitstream format and decoding process were developed by the Alliance of Open Media (AOM).
  • AOM is reportedly working on the AV2 specification.
  • An elementary unit for the input to a video encoder and the output of a video decoder, respectively, in most cases is a picture.
  • a picture given as an input to an encoder may also be referred to as a source picture, and a picture decoded by a decoder may be referred to as a decoded picture or a reconstructed picture.
  • the source and decoded pictures are each comprises of one or more sample arrays, such as one of the following sets of sample arrays:
  • Luma and two chroma (Y CbCr or Y CgCo),
  • RGB Green, Blue and Red
  • Arrays representing other unspecified monochrome or tri-stimulus color samplings for example, YZX, also known as XYZ.
  • a component may be defined as an array or single sample from one of the three sample arrays (luma and two chroma) that compose a picture, or the array or a single sample of the array that compose a picture in monochrome format.
  • Hybrid video codecs may encode the video information in two phases. Firstly, pixel values in a certain picture area (or “block”) are predicted for example by motion compensation means (finding and indicating an area in one of the previously coded video frames that corresponds closely to the block being coded) or by spatial means (using the pixel values around the block to be coded in a specified manner). Secondly the prediction error, i.e., the difference between the predicted block of pixels and the original block of pixels, is coded.
  • motion compensation means finding and indicating an area in one of the previously coded video frames that corresponds closely to the block being coded
  • spatial means using the pixel values around the block to be coded in a specified manner.
  • This may be done by transforming the difference in pixel values using a specified transform (e.g., Discrete Cosine Transform (DCT) or a variant of it), quantizing the coefficients and entropy coding the quantized coefficients.
  • a specified transform e.g., Discrete Cosine Transform (DCT) or a variant of it
  • encoder can control the balance between the accuracy of the pixel representation (picture quality) and size of the resulting coded video representation (file size or transmission bitrate).
  • Inter prediction which may also be referred to as temporal prediction, motion compensation, or motion- compensated prediction, exploits temporal redundancy. In inter prediction the sources of prediction are previously decoded pictures.
  • Intra prediction utilizes the fact that adjacent pixels within the same picture are likely to be correlated. Intra prediction can be performed in spatial or transform domain, i.e., either sample values or transform coefficients can be predicted. Intra prediction is typically exploited in intra coding, where no inter prediction is applied.
  • One outcome of the coding procedure is a set of coding parameters, such as motion vectors and quantized transform coefficients.
  • Many parameters can be entropy-coded more efficiently if they are predicted first from spatially or temporally neighboring parameters.
  • a motion vector may be predicted from spatially adjacent motion vectors and only the difference relative to the motion vector predictor may be coded.
  • Prediction of coding parameters and intra prediction may be collectively referred to as in-picture prediction.
  • the decoder reconstructs the output video by applying prediction means similar to the encoder to form a predicted representation of the pixel blocks (using the motion or spatial information created by the encoder and stored in the compressed representation) and prediction error decoding (inverse operation of the prediction error coding recovering the quantized prediction error signal in spatial pixel domain). After applying prediction and prediction error decoding means, the decoder sums up the prediction and prediction error signals (pixel values) to form the output video frame.
  • the decoder (and encoder) can also apply additional filtering means to improve the quality of the output video before passing it for display and/or storing it as prediction reference for the forthcoming frames in the video sequence.
  • the motion information may be indicated with motion vectors associated with each motion compensated image block.
  • Each of these motion vectors represents the displacement of the image block in the picture to be coded (in the encoder side) or decoded (in the decoder side) and the prediction source block in one of the previously coded or decoded pictures.
  • those may be coded differentially with respect to block specific predicted motion vectors.
  • the predicted motion vectors may be created in a predefined way, for example calculating the median of the encoded or decoded motion vectors of the adjacent blocks.
  • Another way to create motion vector predictions is to generate a list of candidate predictions from adjacent blocks and/or co-located blocks in temporal reference pictures and signaling the chosen candidate as the motion vector predictor.
  • the reference index of previously coded/decoded picture can be predicted.
  • the reference index is typically predicted from adjacent blocks and/or or co-located blocks in temporal reference picture.
  • high efficiency video codecs can employ an additional motion information coding/decoding mechanism, often called merging/merge mode, where all the motion field information, which includes motion vector and corresponding reference picture index for each available reference picture list, is predicted and used without any modification/correction.
  • predicting the motion field information may be carried out using the motion field information of adjacent blocks and/or co-located blocks in temporal reference pictures and the used motion field information is signaled among a list of motion field candidate list filled with motion field information of available adjacent/co-located blocks.
  • the prediction residual after motion compensation may be first transformed with a transform kernel (like DCT) and then coded.
  • a transform kernel like DCT
  • Video encoders may utilize Lagrangian cost functions to find optimal coding modes, e.g., the desired coding mode for a block, block partitioning, and associated motion vectors.
  • This kind of cost function uses a weighting factor /. to tie together the (exact or estimated) image distortion due to lossy coding methods and the (exact or estimated) amount of information that is required to represent the pixel values in an image area:
  • the rate R may be the actual bitrate or bit count resulting from encoding. Alternatively, the rate R may be an estimated bitrate or bit count.
  • One possible way of the estimating the rate R is to omit the final entropy encoding step and use e.g., a simpler entropy encoding or an entropy encoder where some of the context states have not been updated according to previously encoding mode selections.
  • Conventionally used distortion metrics may comprise, but are not limited to, peak signal-to-noise ratio (PSNR), mean squared error (MSE), sum of absolute differences (SAD), sub of absolute transformed differences (SATD), and structural similarity (SSIM), typically measured between the reconstructed video/image signal (that is or would be identical to the decoded video/image signal) and the "original" video/image signal provided as input for encoding.
  • PSNR peak signal-to-noise ratio
  • MSE mean squared error
  • SAD sum of absolute differences
  • SATD sub of absolute transformed differences
  • SSIM structural similarity
  • a partitioning may be defined as a division of a set into subsets such that each element of the set is in exactly one of the subsets.
  • a bitstream may be defined as a sequence of bits, which may in some coding formats or standards be in the form of a network abstraction layer (NAL) unit stream or a byte stream, that forms the representation of coded pictures and associated data forming one or more coded video sequences.
  • NAL network abstraction layer
  • a bitstream format may comprise a sequence of syntax structures.
  • a syntax element may be defined as an element of data represented in the bitstream.
  • a syntax structure may be defined as zero or more syntax elements present together in the bitstream in a specified order.
  • a NAL unit may be defined as a syntax structure containing an indication of the type of data to follow and bytes containing that data in the form of an RBSP interspersed as necessary with start code emulation prevention bytes.
  • a raw byte sequence payload (RBSP) may be defined as a syntax structure containing an integer number of bytes that is encapsulated in a NAL unit.
  • An RBSP is either empty or has the form of a string of data bits containing syntax elements followed by an RBSP stop bit and followed by zero or more subsequent bits equal to 0.
  • a parameter may be defined as a syntax element of a parameter set.
  • a parameter set may be defined as a syntax structure that contains parameters and that can be referred to from or activated by another syntax structure for example using an identifier.
  • a coding standard or specification may specify several types of parameter sets. It needs to be understood that embodiments may be applied but are not limited to the described types of parameter sets and embodiments could likewise be applied to any parameter set type.
  • a parameter set may be activated when it is referenced e.g., through its identifier.
  • An adaptation parameter set (APS) may be defined as a syntax structure that applies to zero or more slices. There may be different types of adaptation parameter sets.
  • An adaptation parameter set may for example contain filtering parameters for a particular type of a filter.
  • three types of APSs are specified carrying parameters for one of: adaptive loop filter (ALF), luma mapping with chroma scaling (LMCS), and scaling lists.
  • a scaling list may be defined as a list that associates each frequency index with a scale factor for the scaling process, which multiplies transform coefficient levels by a scaling factor, resulting in transform coefficients.
  • an APS is referenced through its type (e.g., ALF, LMCS, or scaling list) and an identifier. In other words, different types of APSs have their own identifier value ranges.
  • An Adaptation Parameter Set may comprise parameters for decoding processes of different types, such as adaptive loop filtering or luma mapping with chroma scaling.
  • Video coding specifications may enable the use of supplemental enhancement information (SEI) messages or alike.
  • SEI supplemental enhancement information
  • Some video coding specifications include SEI network abstraction layer (NAL) units, and some video coding specifications contain both prefix SEI NAL units and suffix SEI NAL units, where the former type can start a picture unit or alike and the latter type can end a picture unit or alike.
  • An SEI NAL unit contains one or more SEI messages, which are not required for the decoding of output pictures but may assist in related processes, such as picture output timing, post-processing of decoded pictures, rendering, error detection, error concealment, and resource reservation.
  • SEI messages are specified in H.264/AVC, H.265/HEVC, H.266/WC, and H.274/VSEI standards, and the user data SEI messages enable organizations and companies to specify SEI messages for their own use.
  • the standards may contain the syntax and semantics for the specified SEI messages but a process for handling the messages in the recipient might not be defined. Consequently, encoders may be required to follow the standard specifying a SEI message when they create SEI message(s), and decoders might not be required to process SEI messages for output order conformance.
  • One of the reasons to include the syntax and semantics of SEI messages in standards is to allow different system specifications to interpret the supplemental information identically and hence interoperate. It is intended that system specifications can require the use of particular SEI messages both in the encoding end and in the decoding end, and additionally the process for handling particular SEI messages in the recipient can be specified.
  • SEI messages are generally not extended in future amendments or versions of the standard.
  • the phrase along the bitstream (e.g., indicating along the bitstream) or along a coded unit of a bitstream (e.g., indicating along a coded tile) may be used in claims and described embodiments to refer to transmission, signaling, or storage in a manner that the "out-of-band" data is associated with but not included within the bitstream or the coded unit, respectively.
  • the phrase decoding along the bitstream or along a coded unit of a bitstream or alike may refer to decoding the referred out-of-band data (which may be obtained from out-of-band transmission, signaling, or storage) that is associated with the bitstream or the coded unit, respectively.
  • the phrase along the bitstream may be used when the bitstream is contained in a container file, such as a file conforming to the ISO Base Media File Format, and certain file metadata is stored in the file in a manner that associates the metadata to the bitstream, such as boxes in the sample entry for a track containing the bitstream, a sample group for the track containing the bitstream, or a timed metadata track associated with the track containing the bitstream.
  • a container file such as a file conforming to the ISO Base Media File Format
  • certain file metadata is stored in the file in a manner that associates the metadata to the bitstream, such as boxes in the sample entry for a track containing the bitstream, a sample group for the track containing the bitstream, or a timed metadata track associated with the track containing the bitstream.
  • Image and video codecs may use a set of filters to enhance the visual quality of the predicted visual content and can be applied either in-loop or out-of-loop, or both.
  • in-loop filters the filter applied on one block in the currently-encoded frame will affect the encoding of another block in the same frame and/or in another frame which is predicted from the current frame.
  • An in-loop filter can affect the bitrate and/or the visual quality. In fact, an enhanced block will cause a smaller residual (difference between original block and predicted-and-filtered block), thus requiring less bits to be encoded.
  • An out-of-the loop filter will be applied on a frame after it has been reconstructed, the filtered visual content won't be as a source for prediction, and thus it may only impact the visual quality of the frames that are output by the decoder.
  • In-loop filters in a video/image encoder and decoder may comprise, but may not be limited to, one or more of the following: deblocking filter (DBF); sample adaptive offset (SAO) filter; adaptive loop filter (ALF) for luma and/or chroma components; cross-component adaptive loop filter (CC-ALF).
  • DPF deblocking filter
  • SAO sample adaptive offset
  • ALF adaptive loop filter
  • CC-ALF cross-component adaptive loop filter
  • a deblocking filter may be configured to reduce blocking artefacts due to block-based coding.
  • a deblocking filter may be applied (only) to samples located at prediction unit (PU) and/or transform unit (TU) boundaries, except at the picture boundaries or when disabled at slice and/or tiles boundaries.
  • Horizontal filtering may be applied (first) for vertical boundaries, and vertical filtering may be applied for horizontal boundaries.
  • a sample adaptive offset may be another in-loop filtering process that modifies decoded samples by conditionally adding an offset value to a sample (possibly to each sample), based on values in lookup tables transmitted by the encoder.
  • SAO may have one or more (e.g., two) operation modes; band offset and edge offset modes.
  • band offset mode an offset may be added to the sample value depending on the sample amplitude.
  • the full sample amplitude range may be divided into a number of bands (e.g., 32 bands), and sample values belonging to four of these bands may be modified by adding a positive or negative offset, which may be signalled for each coding tree unit (CTU).
  • CTU coding tree unit
  • the horizontal, vertical, and two diagonal gradients may be used for classification.
  • An Adaptive Loop Filter may apply block-based filter adaptation. For example, for the luma component, one among 25 filters may be selected for each 4x4 block, based on the direction and activity of local gradients, which are derived using the samples values of that 4x4 block.
  • the ALF classification may be performed on 2x2 block units, for instance. When all of the vertical, horizontal and diagonal gradients are below a first threshold value, the block may be classified as texture (not containing edges). Otherwise, the block may be classified to contain edges, a dominant edge direction may be derived from horizontal, vertical and diagonal gradients, and a strength of the edge (e.g., strong or weak) may be further derived from the gradient values.
  • the filtering may be performed by applying a 7x7 diamond filter, for example, to the luma component.
  • An ALF filter set may comprise one filter for each chroma component, and a 5 x5 diamond filter may be applied to the chroma components, for example.
  • the filter coefficients use point-symmetry relative to the center point.
  • An ALF design may comprise clipping the difference between the neighboring sample value and the current to-be-filtered sample is added, which provides adaptability related to both spatial relationship and value similarity between samples.
  • cross-component ALF uses luma sample values to refine each chroma component by applying an adaptive linear filter to the luma channel and then using the output of this filtering operation for chroma refinement.
  • Filtering in CC-ALF is accomplished by applying a linear, diamond shaped filter to the luma channel.
  • ALF filter parameters are signalled in Adaptation Parameter Set (APS). For example, in one APS, up to 25 sets of luma filter coefficients and clipping value indices, and up to eight sets of chroma filter coefficients and clipping value indices could be signalled. To reduce the overhead, filter coefficients of different classification for luma component can be merged. In slice header, the identifiers of the APSs used for the current slice are signaled.
  • APS Adaptation Parameter Set
  • ALF APS indices can be signaled to specify the luma filter sets that are used for the current slice.
  • the filtering process can be further controlled at coding tree block (CTB) level.
  • CTB coding tree block
  • a flag is signalled to indicate whether ALF is applied to a luma CTB.
  • a filter set among 16 fixed filter sets and the filter sets from APSs selected in the slice header may be selected per each luma CTB by the encoder and may be decoded per each luma CTB by the decoder.
  • a filter set index is signaled for a luma CTB to indicate which filter set is applied.
  • the 16 fixed filter sets are pre-defined in the VVC standard and hardcoded in both the encoder and the decoder.
  • the 16 fixed filter sets may be referred to as the pre-defined ALFs.
  • LMCS luma mapping with chroma scaling
  • the luma sample values of an input video signal to the encoder and output video signal from the decoder are represented in the original (unmapped) sample domain.
  • Forward luma mapping maps luma sample values from the original sample domain to the mapped sample domain.
  • Inverse luma mapping maps luma sample values from the mapped sample domain to the original sample domain.
  • the processes in the mapped sample domain include inverse quantization, inverse transform, luma intra prediction and summing the luma prediction with the luma residue values.
  • the processes in the original sample domain include in-loop filters (e.g., deblocking, SAO, ALF), inter prediction, and storage of pictures in the decoded picture buffer (DPB).
  • in-loop filters e.g., deblocking, SAO, ALF
  • inter prediction e.g., inter prediction, and storage of pictures in the decoded picture buffer (DPB).
  • DPB decoded picture buffer
  • Inverse quantization and inverse transform are applied to the decoded luma transform coefficients to produce the luma residues in the mapped sample domain, Y’r es ;
  • Reconstructed luma sample values in the mapped sample domain, Y’ r are obtained by summing Y’res with the corresponding predicted luma values in the mapped sample domain, Y’ pr ed.
  • Y’ pre d is directly obtained by performing intra prediction in mapped sample domain.
  • the predicted luma values in original sample domain, Y pre d are first obtained by motion compensation using reference pictures from the DPB, and then forward luma mapping is applied to produce the luma values in the mapped sample domain, Y’ pre d.
  • Inverse luma mapping is applied to reconstructed values Y’ r to produce reconstructed luma sample values in the original sample domain, which are processed by in-loop filters (deblocking, sample adaptive offset, and adaptive loop filter) before being stored in the DPB.
  • in-loop filters deblocking, sample adaptive offset, and adaptive loop filter
  • LMCS syntax elements are signalled in an adaptation parameter set (APS) with aps_params_type equal to 1 (LMCS APS).
  • aps_adaptation_parameter_set_id The value range for an adaptation parameter set identifier (aps_adaptation_parameter_set_id) is from 0 to 3, inclusive, for LMCS APSs.
  • the use of LMCS can be enabled or disabled in a picture header.
  • the LMCS APS identifier value used for the picture (ph lmcs aps id) is included in the picture header.
  • the same LMCS parameters are used for entire picture.
  • the chroma scaling part can be enabled or disabled in the picture header through ph chroma residual scale flag.
  • LMCS is further enabled or disabled in the slice header for each slice.
  • LMCS data within an LMCS APS comprises syntax related to a piecewise linear model of up to 16 pieces for luma mapping.
  • the luma sample value range of the piecewise linear forward mapping function is uniformly sampled into 16 pieces of same length OrgCW.
  • OrgCW 64 input codewords.
  • binCW[i] is determined at the encoding process.
  • the difference between binCW[i] and OrgCW is signalled in LMCS APS.
  • NNs neural networks
  • traditional codec such as VVC/H.266.
  • traditional refers to those codecs whose components and their parameters may not be learned from data. Examples of such components are:
  • Additional in-loop fdter for example by having the NN as an additional in-loop fdter with respect to the traditional loop fdters.
  • Figure 1 illustrates examples of functioning of NNs as components of a traditional codec's pipeline, in accordance with an embodiment.
  • Figure 1 illustrates an encoder, which also includes a decoding loop.
  • Figure 1 is shown to include components described below:
  • a luma intra pred block or circuit 101 This block or circuit performs intra prediction in the luma domain, for example, by using already reconstructed data from the same frame.
  • the operation of the luma intra pred block or circuit 101 may be performed by a deep neural network such as a convolutional auto-encoder.
  • a chroma intra pred block or circuit 102 This block or circuit performs intra prediction in the chroma domain, for example, by using already reconstructed data from the same frame.
  • the chroma intra pred block or circuit 102 may perform cross-component prediction, for example, predicting chroma from luma.
  • the operation of the chroma intra pred block or circuit 102 may be performed by a deep neural network such as a convolutional auto-encoder.
  • An intra pred block or circuit 103 and inter-pred block or circuit 104 These blocks or circuit perform intra prediction and inter-prediction, respectively.
  • the intra pred block or circuit 103 and the inter-pred block or circuit 104 may perform the prediction on all components, for example, luma and chroma.
  • the operations of the intra pred block or circuit 103 and inter-pred block or circuit 104 may be performed by two or more deep neural networks such as convolutional auto-encoders.
  • a probability estimation block or circuit 105 for entropy coding This block or circuit performs prediction of probability for the next symbol to encode or decode, which is then provided to the entropy coding module 112, such as the arithmetic coding module, to encode or decode the next symbol.
  • the operation of the probability estimation block or circuit 105 may be performed by a neural network.
  • - A transform and quantization (T/Q) block or circuit 106 These are actually two blocks or circuits.
  • the transform and quantization block or circuit 106 may perform a transform of input data to a different domain, for example, the FFT transform would transform the data to frequency domain.
  • the transform and quantization block or circuit 106 may quantize its input values to a smaller set of possible values.
  • inverse quantization block or circuit and inverse transform block or circuit 113 there may be inverse quantization block or circuit and inverse transform block or circuit 113.
  • One or both of the transform block or circuit and quantization block or circuit may be replaced by one or two or more neural networks.
  • One or both of the inverse transform block or circuit and inverse quantization block or circuit 113 may be replaced by one or two or more neural networks.
  • An in-loop fdter block or circuit 107 Operations of the in-loop filter block or circuit 107 is performed in the decoding loop, and it performs filtering on the output of the inverse transform block or circuit, or anyway on the reconstructed data, in order to enhance the reconstructed data with respect to one or more predetermined quality metrics. This filter may affect both the quality of the decoded data and the bitrate of the bitstream output by the encoder.
  • the operation of the in-loop filter block or circuit 107 may be performed by a neural network, such as a convolutional auto-encoder. In examples, the operation of the in-loop filter may be performed by multiple steps or filters, where the one or more steps may be performed by neural networks.
  • the postprocessing filter block or circuit 108 may be performed only at decoder side, as it may not affect the encoding process.
  • the postprocessing filter block or circuit 108 filters the reconstructed data output by the in-loop filter block or circuit 107, in order to enhance the reconstructed data.
  • the postprocessing filter block or circuit 108 may be replaced by a neural network, such as a convolutional auto-encoder.
  • a resolution adaptation block or circuit 109 this block or circuit may downsample the input video frames, prior to encoding. Then, in the decoding loop, the reconstructed data may be upsampled, by the upsampling block or circuit 110, to the original resolution.
  • the operation of the resolution adaptation block or circuit 109 block or circuit may be performed by a neural network such as a convolutional auto-encoder.
  • An encoder control block or circuit 111 This block or circuit performs optimization of encoder's parameters, such as what transform to use, what quantization parameters (QP) to use, what intraprediction mode (out of N intra-prediction modes) to use, and the like.
  • the operation of the encoder control block or circuit 111 may be performed by a neural network, such as a classifier convolutional network, or such as a regression convolutional network.
  • An ME/MC block or circuit 114 performs motion estimation and/or motion compensation, which are two key operations to be performed when performing inter-frame prediction.
  • ME/MC stands for motion estimation / motion compensation.
  • NNs are used as the main components of the image/video codecs.
  • end-to-end learned compression there are two main options:
  • Option 1 re-use the video coding pipeline but replace most or all the components with NNs.
  • FIG 2 it illustrates an example of modified video coding pipeline based on a neural network, in accordance with an embodiment.
  • An example of neural network may include, but is not limited to, a compressed representation of a neural network.
  • Figure 2 is shown to include following components:
  • a neural transform block or circuit 202 this block or circuit transforms the output of a summation/subtraction operation 203 to a new representation of that data, which may have lower entropy and thus be more compressible.
  • a quantization block or circuit 204 this block or circuit quantizes an input data 201 to a smaller set of possible values.
  • An inverse transform and inverse quantization blocks or circuits 206 perform the inverse or approximately inverse operation of the transform and the quantization, respectively.
  • An encoder parameter control block or circuit 208 This block or circuit may control and optimize some or all the parameters of the encoding process, such as parameters of one or more of the encoding blocks or circuits.
  • An entropy coding block or circuit 210 This block or circuit may perform lossless coding, for example based on entropy.
  • One popular entropy coding technique is arithmetic coding.
  • a neural intra-codec block or circuit 212 This block or circuit may be an image compression and decompression block or circuit, which may be used to encode and decode an intra frame.
  • An encoder 214 may be an encoder block or circuit, such as the neural encoder part of an autoencoder neural network.
  • a decoder 216 may be a decoder block or circuit, such as the neural decoder part of an auto-encoder neural network.
  • An intra-coding block or circuit 218 may be a block or circuit performing some intermediate steps between encoder and decoder, such as quantization, entropy encoding, entropy decoding, and/or inverse quantization.
  • a deep loop fdter block or circuit 220 This block or circuit performs filtering of reconstructed data, in order to enhance it.
  • a decode picture buffer block or circuit 222 is a memory buffer, keeping the decoded frame, for example, reconstructed frames 224 and enhanced reference frames 226 to be used for inter prediction.
  • An inter-prediction block or circuit 228 This block or circuit performs inter-frame prediction, for example, predicts from frames, for example, frames 232, which are temporally nearby.
  • An ME/MC 230 performs motion estimation and/or motion compensation, which are two key operations to be performed when performing inter-frame prediction.
  • ME/MC stands for motion estimation / motion compensation.
  • Option 2 re-design the whole pipeline, as follows.
  • - Encoder NN is configured to perform a non-linear transform
  • - Decoder NN is configured to perform a non-linear inverse transform.
  • FIG. 3 shows an encoder NN and a decoder NN being parts of a neural auto-encoder architecture, in accordance with an example.
  • the Analysis Network 301 is an Encoder NN
  • the Synthesis Network 302 is the Decoder NN, which may together be referred to as spatial correlation tools 303, or as neural auto-encoder.
  • the input data 304 is analyzed by the Encoder NN (Analysis Network 301), which outputs a new representation of that input data.
  • the new representation may be more compressible.
  • This new representation may then be quantized, by a quantizer 305, to a discrete number of values.
  • the quantized data is then lossless encoded, for example by an arithmetic encoder 306, thus obtaining a bitstream 307.
  • the example shown in Figure 3 includes an arithmetic decoder 308 and an arithmetic encoder 306.
  • the arithmetic encoder 306, or the arithmetic decoder 308, or the combination of the arithmetic encoder 306 and arithmetic decoder 308 may be referred to as arithmetic codec in some embodiments.
  • the bitstream is first lossless decoded, for example, by using the arithmetic codec decoder 308.
  • the lossless decoded data is dequantized and then input to the Decoder NN, Synthesis Network 302.
  • the output is the reconstructed or decoded data 309.
  • the lossy steps may comprise the Encoder NN and/or the quantization.
  • a training objective function (also called “training loss”) may be utilized, which may comprise one or more terms, or loss terms, or simply losses.
  • the training loss comprises a reconstruction loss term and a rate loss term.
  • the reconstruction loss encourages the system to decode data that is similar to the input data, according to some similarity metric. Examples of reconstruction losses are:
  • MS-SSIM Multi-scale structural similarity
  • Losses derived from the use of a pretrained neural network For example, error(fl, f2), where fl and f2 are the features extracted by a pretrained neural network for the input data and the decoded data, respectively, and error() is an error or distance function, such as LI norm or L2 norm;
  • Losses derived from the use of a neural network that is trained simultaneously with the end-to- end learned codec For example, adversarial loss can be used, which is the loss provided by a discriminator neural network that is trained adversarially with respect to the codec, following the settings proposed in the context of Generative Adversarial Networks (GANs) and their variants.
  • GANs Generative Adversarial Networks
  • the rate loss encourages the system to compress the output of the encoding stage, such as the output of the arithmetic encoder.
  • compressing we mean reducing the number of bits output by the encoding stage.
  • rate loss typically encourages the output of the Encoder NN to have low entropy.
  • rate losses are the following:
  • a sparsification loss i.e., a loss that encourages the output of the Encoder NN or the output of the quantization to have many zeros. Examples are L0 norm, LI norm, LI norm divided by L2 norm;
  • One or more of reconstruction losses may be used, and one or more of the rate losses may be used, as a weighted sum.
  • the different loss terms may be weighted using different weights, and these weights determine how the final system performs in terms of rate-distortion loss. For example, if more weight is given to the reconstruction losses with respect to the rate losses, the system may learn to compress less but to reconstruct with higher accuracy (as measured by a metric that correlates with the reconstruction losses).
  • These weights may be considered to be hyper-parameters of the training session and may be set manually by the person designing the training session, or automatically for example by grid search or by using additional neural networks.
  • a neural network-based end-to-end learned video coding system may contain an encoder 401, a quantizer 402, a probability model 403, an entropy codec 420 (for example arithmetic encoder 405 / arithmetic decoder 406), a dequantizer 407, and a decoder 408.
  • the encoder 401 and decoder 408 may be two neural networks, or mainly comprise neural network components.
  • the probability model 403 may also comprise mainly neural network components.
  • Quantizer 402, dequantizer 407 and entropy codec 420 may not be based on neural network components, but they may also comprise neural network components, potentially.
  • the encoder component 401 takes a video x 409 as input and converts the video from its original signal space into a latent representation that may comprise a more compressible representation of the input.
  • the latent representation may be a 3- dimensional tensor, where two dimensions represent the vertical and horizontal spatial dimensions, and the third dimension represent the “channels” which contain information at that specific location.
  • the latent representation is a tensor of dimensions (or “shape”) 64x64x32 (i.e., with horizontal size of 64 elements, vertical size of 64 elements, and 32 channels).
  • the channel dimension may be the first dimension, so for the above example, the shape of the input tensor may be represented as 3x128x128, instead of 128x128x3.
  • another dimension in the input tensor may be used to represent temporal information.
  • the quantizer component 402 quantizes the latent representation into discrete values given a predefined set of quantization levels.
  • Probability model 403 and arithmetic codec component 420 work together to perform lossless compression for the quantized latent representation and generate bitstreams to be sent to the decoder side.
  • the probability model 403 estimates the probability distribution of all possible values for that symbol based on a context that is constructed from available information at the current encoding/decoding state, such as the data that has already been encoded/decoded.
  • the arithmetic encoder 405 encodes the input symbols to bitstream using the estimated probability distributions.
  • the arithmetic decoder 406 and the probability model 403 first decode symbols from the bitstream to recover the quantized latent representation. Then the dequantizer 407 reconstructs the latent representation in continuous values and pass it to decoder 408 to recover the input video/image. Note that the probability model 403 in this system is shared between the encoding and decoding systems. In practice, this means that a copy of the probability model 403 is used at encoder side, and another exact copy is used at decoder side.
  • the encoder 401, probability model 403, and decoder 408 may be based on deep neural networks.
  • the system may be trained in an end-to-end manner by minimizing the following ratedistortion loss function:
  • the distortion loss term may be the mean square error (MSE), structure similarity (SSIM) or other metrics that evaluate the quality of the reconstructed video. Multiple distortion losses may be used and integrated into D, such as a weighted sum of MSE and SSIM.
  • the rate loss term is normally the estimated entropy of the quantized latent representation, which indicates the number of bits necessary to represent the encoded symbols, for example, bits-per-pixel (bpp).
  • the system may contain only the probability model 403 and arithmetic encoder/decoder 405, 406.
  • the system loss function contains only the rate loss, since the distortion loss is always zero (i.e., no loss of information).
  • Reducing the distortion in image and video compression is often intended to increase human perceptual quality, as humans are considered to be the end users, i.e., consuming/watching the decoded image.
  • machines i.e., autonomous agents
  • Examples of such analysis are object detection, scene classification, semantic segmentation, video event detection, anomaly detection, pedestrian tracking, etc.
  • Example use cases and applications are self-driving cars, video surveillance cameras and public safety, smart sensor networks, smart TV and smart advertisement, person reidentification, smart traffic monitoring, drones, etc.
  • VCM Video Coding for Machines
  • VCM concerns the encoding of video streams to allow consumption for machines.
  • Machine is referred to indicate any device except human.
  • Example of machine can be a mobile phone, an autonomous vehicle, a robot, and such intelligent devices which may have a degree of autonomy or run an intelligent algorithm to process the decoded stream beyond reconstructing the original input stream.
  • a machine may perform one or multiple tasks on the decoded stream. Examples of tasks can comprise the following:
  • Classification classify an image or video into one or more predefined categories.
  • the output of a classification task may be a set of detected categories, also known as classes or labels.
  • the output may also include the probability and confidence of each predefined category.
  • Object detection detect one or more objects in a given image or video.
  • the output of an object detection task may be the bounding boxes and the associated classes of the detected objects.
  • the output may also include the probability and confidence of each detected object.
  • Instance segmentation identify one or more objects in an image or video at the pixel level.
  • the output of an instance segmentation task may be binary mask images or other representations of the binary mask images, e.g., closed contours, of the detected objects.
  • the output may also include the probability and confidence of each object for each pixel.
  • Semantic segmentation assign the pixels in an image or video to one or more predefined semantic categories.
  • the output of a semantic segmentation task may be binary mask images or other representations of the binary mask images, e.g., closed contours, of the assigned categories.
  • the output may also include the probability and confidence of each semantic category for each pixel.
  • Object tracking track one or more objects in a video sequence.
  • the output of an object tracking task may include frame index, object ID, object bounding boxes, probability, and confidence for each tracked object.
  • Captioning generate one or more short text descriptions for an input image or video.
  • the output of the captioning task may be one or more short text sequences.
  • Human pose estimation estimate the position of the key points, e.g., wrist, elbows, knees, etc., from one or more human bodies in an image of the video.
  • the output of a human pose estimation includes sets of locations of each key point of a human body detected in the input image or video.
  • Human action recognition recognize the actions, e.g., walking, talking, shaking hands, of one or more people in an input image or video.
  • the output of the human action recognition may be a set of predefined actions, probability, and confidence of each identified action.
  • Anomaly detection detect abnormal object or event from an input image or video.
  • the output of an anomaly detection may include the locations of detected abnormal objects or segments of frames where abnormal events detected in the input video.
  • the receiver-side device has multiple “machines” or task neural networks (Task-NNs). These multiple machines may be used in a certain combination which is for example determined by an orchestrator sub-system. The multiple machines may be used for example in succession, based on the output of the previously used machine, and/or in parallel. For example, a video which was compressed and then decompressed may be analyzed by one machine (NN) for detecting pedestrians, by another machine (another NN) for detecting cars, and by another machine (another NN) for estimating the depth of all the pixels in the frames.
  • NN machine
  • another NN for detecting cars
  • another machine another NN
  • task machine and “machine” and “task neural network” are referred to interchangeably, and for such referral any process or algorithm (learned or not from data) which analyzes or processes data for a certain task is meant.
  • recipient-side or “decoder-side” are used to refer to the physical or abstract entity or device, which contains one or more machines, and runs these one or more machines on an encoded and eventually decoded video representation which is encoded by another physical or abstract entity or device, the “encoder-side device”.
  • the encoded video data may be stored into a memory device, for example as a file.
  • the stored file may later be provided to another device.
  • the encoded video data may be streamed from one device to another.
  • FIG. 5 is a general illustration of the pipeline of Video Coding for Machines.
  • a VCM encoder 502 encodes the input video into a bitstream 504.
  • a bitrate 506 may be computed 508 from the bitstream 504 in order to evaluate the size of the bitstream.
  • a VCM decoder 510 decodes the bitstream output by the VCM encoder 502.
  • the output of the VCM decoder 510 is referred to as “Decoded data for machines” 512. This data may be considered as the decoded or reconstructed video. However, in some implementations of this pipeline, this data may not have same or similar characteristics as the original video which was input to the VCM encoder 502.
  • this data may not be easily understandable by a human by simply rendering the data onto a screen.
  • the output of VCM decoder is then input to one or more task neural networks 514.
  • task-NNs 514 there are three example task-NNs, and a non-specified one (Task- NN X).
  • the goal of VCM is to obtain a low bitrate while guaranteeing that the task-NNs still perform well in terms of the evaluation metric 516 associated to each task.
  • FIG. 6 illustrates an example of a pipeline for the end-to-end learned approach.
  • the video is input to a neural network encoder 601.
  • the output of the neural network encoder 601 is input to a lossless encoder 602, such as an arithmetic encoder, which outputs a bitstream 604.
  • the lossless codec may be a probability model 603, both in the lossless encoder and in the lossless decoder, which predicts the probability of the next symbol to be encoded and decoded.
  • the probability model 603 may also be learned, for example it may be a neural network.
  • the bitstream 604 is input to a lossless decoder 605, such as an arithmetic decoder, whose output is input to a neural network decoder 606.
  • the output of the neural network decoder 606 is the decoded data for machines 607, that may be input to one or more task-NNs 608.
  • Figure 7 illustrates an example of how the end-to-end learned system may be trained. For the sake of simplicity, only one task-NN 707 is illustrated.
  • a rate loss 705 may be computed from the output of the probability model 703. The rate loss 705 provides an approximation of the bitrate required to encode the input video data.
  • a task loss 710 may be computed 709 from the output 708 of the task-NN 707.
  • the rate loss 705 and the task loss 710 may then be used to train 711 the neural networks used in the system, such as the neural network encoder 701, the probability model 703, the neural network decoder 706. Training may be performed by first computing gradients of each loss with respect to the neural networks that are contributing or affecting the computation of that loss. The gradients are then used by an optimization method, such as Adam, for updating the trainable parameters of the neural networks.
  • an optimization method such as Adam
  • the machine tasks may be performed at decoder side (instead of at encoder side) for multiple reasons, for example because the encoder-side device does not have the capabilities (computational, power, memory) for running the neural networks that perform these tasks, or because some aspects or the performance of the task neural networks may have changed or improved by the time that the decoderside device needs the tasks results (e.g., different or additional semantic classes, better neural network architecture). Also, there could be a customization need, where different clients would run different neural networks for performing these machine learning tasks.
  • a video codec for machines can be realized by using a traditional codec such as H.266/VVC.
  • another possible design may comprise using a traditional "base" codec, such as H.266/VVC. which additionally comprises one or more neural networks.
  • the one or more neural networks may replace or be an alternative of one of the components of the traditional codec, such as: one or more in-loop fdters; one or more intra-prediction modes; one or more inter-prediction modes; one or more transforms; one or more inverse transforms; one or more probability models, for lossless coding; one or more post-processing fdters.
  • the one or more neural networks may function as an additional component, such as: one or more additional in-loop fdters; one or more additional intra-prediction modes; one or more additional inter-prediction modes; one or more additional transforms; one or more additional inverse transforms; one or more additional probability models, for lossless coding; one or more additional post-processing fdters.
  • additional in-loop fdters such as: one or more additional intra-prediction modes; one or more additional inter-prediction modes; one or more additional transforms; one or more additional inverse transforms; one or more additional probability models, for lossless coding; one or more additional post-processing fdters.
  • another possible design may comprise using any codec architecture (such as a traditional codec, or a traditional codec which includes one or more neural networks, or an end-to-end learned codec), and having a post-processing neural networks which adapts the output of the decoder so that it can be analyzed more effectively by one or more machines or task neural networks.
  • the encoder and decoder may be conformant to the H.266/VVC standard, a post-processing neural network takes the output of the decoder, and the output of the post-processing neural network is then input to an object detection neural network.
  • the object detection neural network is the machine or task neural network.
  • Figure 8 illustrates an example including an encoder, a decoder, a post-processing filter, a set of task- NNs.
  • the encoder and decoder may represent a traditional image or video codec, such as a codec conformant with the VVC/H.266 standard, or may represent an end-to-end (E2E) learned image or video codec.
  • the post-processing filter may be a neural network-based filter.
  • the task-NNs may be neural networks that performs tasks such as object detection, object segmentation, object tracking, etc.
  • a probability mode may be used in an end-to-end learned codec to estimate the probability distribution of the elements in the latent tensor, which is the output of the encoder.
  • the estimated probability distribution may be used by an arithmetic encoder to encode the latent tensor into a bitstream at the encoding stage, or by an arithmetic decoder to decode the latent tensor from the bitstream at the decoding stage.
  • the probability model estimates the probability distribution of the elements in the input image or video for the arithmetic encoder and decoder to encode and decode the input image or video.
  • term “latent tensor” may also refer to the input image or video in a lossless image or video compression system.
  • a multi-scale progressive probability model may partition the elements in a latent tensor into multiple groups. The elements in one group may be processed in parallel and the groups may be processed sequentially.
  • Figure 9 shows the architecture of a multi-scale progressive probability model at the encoding stage.
  • Input latent tensor x'" 930 is first downsampled into a certain number of low-resolution representations, e.g., x (I) , x (2) 920, 930 respectively.
  • the downsampling operation may use the nearest neighborhood, bilinear, or bicubic algorithm. If the downsampling algorithm is not the nearest neighborhood method, extra information may be transferred from the encoder to the decoder to recover round-off error due to the downsampling operation.
  • the probability distribution of the elements in the representation at the lowest resolution may be modeled as identically and independently distributed with a Gaussian distribution model, a uniform distribution model, or a mixture of probability distribution models.
  • the probability of elements in the latent tensors in resolution levels other than the lowest one may be modeled by a conditional distribution model (whose parameters are estimated by a prediction model), where the conditioning information (also referred to as context) may comprise the representation at lower resolution level.
  • p(i) is the estimated parameters of the (conditional) probability distribution model for elements at resolution level i.
  • the prediction model may be implemented by a deep neural network with a plurality of parameters.
  • parameter and weight for a deep neural network are interchangeable.
  • the latent tensor at the lowest resolution i.e., X 2 ’ 930 in Figure 9 may be first decoded from the bitstream using the predefined probability distribution model.
  • a multi-scale progressive probability model may use the elements in the latent tensor at resolution level i, e.g., x (i) 920, as the context to estimate the parameters of the distribution model for the elements in the latent tensor at a higher resolution level, e.g., x (l ⁇ 1 ⁇ .
  • the estimated probability distribution may be used by the arithmetic decoder to decode the elements in the bitstream.
  • the procedure may repeat until all elements in the latent tensor at the highest resolution level, i.e., x (0) 910, is decoded.
  • the prediction models at different resolution levels may share the weights or a subset of the weights.
  • the prediction models at different resolutions are the same or substantially the same.
  • the elements in the latent tensor at a resolution level may be further partitioned into several groups.
  • the groups may be processed sequentially.
  • the elements in a group are modeled by independent conditional distribution models using the elements that have already been processed as the context. I.e., the elements of the latent tensor at a resolution level are processed in steps, where the elements in a group associated to a step are processed in parallel.
  • An example of the architecture of the prediction model is shown in Figure 10.
  • the distribution predictor 1010 predicts the parameters of the probability distribution for the elements in the latent representation at resolution level i and step j.
  • z" ⁇ ) is the auxiliary input for the distribution predictor 1010.
  • x (l ' 2> is derived by upsampling x ⁇ i+1 m ⁇ ’ 7 -* is a binary-valued mask tensor with the same shape of x ⁇ ’ 7 -* indicating the positions of the elements in x ⁇ ’ 7 -* that have the true values
  • p ⁇ ’ 7 -* is the estimated parameters of the probability distribution for the elements in group j at resolution level i.
  • the tensor updater component 1020 may update the elements in group j with the corresponding true values in x to generate % ⁇ w +1 ), an( j p qe maS k updater component 1030 may update the mask tensor m ⁇ ’ 7 -* accordingly to generate m ⁇ ,7+1
  • the calculated is used to decode the corresponding elements in the bitstream.
  • the corresponding values are updated in x ⁇ ’ 7 -* to generate %bj+i) anc
  • mas k tensor m ⁇ ’ 7 -* is updated accordingly to generate m ⁇ ,7+1
  • the prediction model repeats this operation in steps until all elements at the resolution level i are processed.
  • a probability model used in an end-to-end learned codec may be jointly trained together with the encoder and decoder using a large training dataset.
  • the training may be performed by optimizing a loss function that involves minimizing the divergence of the estimated probability distribution function and the true probability distribution function of the latent tensor.
  • input data may have different characteristics from the data in the training dataset.
  • the model learned from the training dataset may not be optimal when compressing the input data.
  • Inference time overfitting technique was developed to improve the performance of the codec.
  • One way to perform model overfitting at the inference stage is to finetune the parameters or a subset of the parameters of the components at the decoder side, including the probability model using the input data.
  • the update of the parameters that have been changed after the finetuning may be compressed and sent to the decoder.
  • the decoder may update the corresponding weights before decoding the bitstream.
  • the overhead caused by the weight update may significantly increase the size of the bitstream and diminish the gain from the decoder overfitting.
  • the “input image” may refer to an image, a frame from a video sequence, a block or slice from an image, or a frame in a video sequence.
  • the present embodiments provide adaptive methods to finetune a multi-scale progressive probability model at inference time.
  • the proposed embodiments may improve the compression performance significantly. Even though some additional information is encoded compared to the case where no overfitting is performed, such additional information may represent only a small fraction of the total bitstream, which is compensated for by the bitrate saving brought by the proposed overfitting technique .
  • This bitrate overhead is likely to be smaller than the overhead incurred when the neural network’s parameters are overfitted at the encoder side and the signal, derived from the overfitted parameter or from the difference between the overfitted parameters and the original parameters, is transmitted to the decoder in or along with the bitstream.
  • the probability model may be adapted to the input data in multiple steps. At each step, the probability model may be finetuned using the data that has already been decoded from the bitstream.
  • the encoder may signal one or more additional parameters to the decoder to perform the finetuning.
  • the present embodiments provide a solution to determine the data that may be used to finetune the probability model when a set of images, in particular, for a certain application, are encoded.
  • the input data of a probability model may be the latent tensor that is the output of the encoder in an end-to-end learned codec or an uncompressed image or video in an end-to-end learned lossless codec.
  • the probability model may process the input data in a multi-scale and progressive manner. I.e., the input data may be downsampled several times, generating representations at different resolution levels.
  • the elements in one resolution level may be partitioned into several groups. The partition may be done along spatial dimensions, temporal dimension, channel dimension, or any combination of these dimensions.
  • the probability model may process the elements in the input data sequentially by a predefined group and resolution order. All elements in one group may be processed in parallel as a batch.
  • the elements that have already been processed may be used by the probability model as the context information to determine the probability distribution of the elements that have not been processed.
  • the determined probability distributions may be used by an arithmetic encoder to encode the input data into a bitstream at the encoding stage, or by an arithmetic decoder to decode a bitstream into output at the decoding stage.
  • the input data of an arithmetic encoder and the output data of an arithmetic decoder are same or substantially the same.
  • the processing order of the groups and resolution levels may be predefined and hardcoded into the encoder and decoder.
  • the processing order of the groups and resolution levels may be determined at encoding time by the encoder and signaled to the decoder in or along the bitstream.
  • the system may complete all groups in one resolution level before processing the groups in another resolution level.
  • the system may process all the groups with the same channel IDs on all resolution levels before moving to groups with another channel ID.
  • the groups may be processed in an order defined by the encoder, and the processing order is signaled to the decoder along with the bitstream or via a supplementary message. Probability model overfitting at different resolution level
  • the prediction model component in a probability model estimates the probability distribution of the elements in the latent representation at a resolution level using the elements at a lower resolution as the context.
  • the estimated probability distribution may be used to calculate the rate loss for compressing the latent representation. For example, at resolution level i, the probability distribution of elements in the latent representation x (i) is estimated using latent representation x (,+I) as the context.
  • latent representation x (,+I) and x (i) may be used as the input data and the ground truth data, respectively, to finetune the prediction model at resolution i - 1.
  • the finetuning may be performed by minimizing the rate loss calculated from the probability distribution estimation p (i) for resolution level i.
  • the parameters of the prediction model may be updated using the gradient descent method, where the gradients are calculated by back-propagating the gradient of the rate loss to the parameters, i.e., by using the back-propagation algorithm.
  • any suitable method for obtaining gradients of a loss function with respect to the parameter of a neural network may be used.
  • the optimization may use a certain optimizer, for example, Adam, with a set of predefined optimization parameters including learning rate.
  • the finetuning may be performed in several iterations. At each iteration, the rate loss for latent representation x (l ⁇ 1 ⁇ is calculated using the estimated probability distribution p (l ⁇ 1 ⁇ by the finetuned model. The iteration may be terminated when the rate loss for latent representation x (l ⁇ 1 ⁇ stops decreasing for a certain number of iterations or a predefined max number of iterations is reached. After the finetuning, the number of iterations where the optimal rate loss for latent representation x (l ⁇ 1 ⁇ is reached and the finetuning parameters are recorded by the encoder.
  • a certain optimizer for example, Adam
  • the encoder may send the recorded finetuning parameters and the number of iterations to the decoder in or along the bitstream.
  • the encoder may compress the finetuning parameters and the number of iterations and send the compressed data to the decoder in or along the bitstream.
  • the rate loss calculation function R() used in the algorithm takes the parameters of the estimated distribution function and the ground truth values of the elements as the input and calculates cross entropy value as the rate loss.
  • the auxiliary information at resolution level i + 1 may be used to estimate the probability distributions for latent representation 0).
  • the best model of the prediction model may be used to estimate the probability distribution of the elements in the latent representation x" - ''' for the arithmetic encoder.
  • the finetuning parameters and the number of iterations are obtained. If the encoder has signaled this information to the decoder, the finetuning parameters and the number of iterations are decoded from the bitstream or a supplementary message carrying this information. Alternatively, the finetuning parameters may have been predefined in a decoder. Then, the decoder may perform the probability finetuning while decoding the elements from the bitstream. The following example algorithm describes the finetuning procedure at the decoder side.
  • pretrained prediction model P Given latent representation x ⁇ l+1 ⁇ x ⁇ at resolution level i+1, i respectively, optimizer with learning rate and other finetuning parameters, pretrained prediction model P to be used at resolution level i — 1 with learnable parameters 0, rate loss calculation function /?(•) ⁇ the number of iterations best iter
  • the elements in latent representation at a resolution level may be partitioned into several groups.
  • the elements in one group may be processed in parallel while different groups may be processed sequentially in steps. In each step, elements in one group are processed in parallel.
  • the distribution predictor 1010 in the prediction model component may be finetuned using the groups of elements that have already been processed.
  • the encoder may determine the parameters for the finetuning, including the number of iterations and learning rate.
  • the finetuning parameters may be signaled to the decoder along with the bitstream or via a supplementary message. According to another embodiment, the finetuning parameters may be compressed before being signaled to the decoder.
  • Let x ⁇ ’ 7 -* be the mixture latent representation at resolution level i and step j where the corresponding positions of the elements that have already been processed are set to the true values and the other positions are set to the estimated values.
  • the following example algorithm illustrates the finetuning procedure at the encoder side at resolution level i and step j.
  • the finetuned distribution predictor model may be used to estimate the probability distribution of the elements in step j + 1.
  • auxiliary information at step j may be used to estimate the probability distributions for step
  • the best model of the prediction model may be used to estimate the probability distribution of the elements for step j+1 and the estimated probability distribution may be used by the arithmetic encoder to encode the elements to be processed in step j+1 to a bitstream.
  • the finetuning parameters and the number of iterations are obtained. If the encoder has signaled this information to the decoder, the finetuning parameters and the number of iterations are decoded from the bitstream or a supplementary message. Alternatively, the finetuning parameters may have been predefined in the decoder Next, a finetuning procedure may be performed. An example of the finetuning procedure is given in the following:
  • the finetuned model may be used to estimate the probability distributions of the elements for step j+1, i.e., 0).
  • the auxiliary information at step j may be used to estimate the probability distributions for step j+1, i.e., 0).
  • the estimated probability distributions may be used by the arithmetic decoder to decode the elements for step j+1 from the bitstream. Efficient finetuning technique on a batch of images or video frames
  • the proposed finetuning techniques may increase the encoding and decoding complexity because of the iterative finetuning procedures.
  • the present embodiments also provide a solution to improve the encoding and decoding complexity when the input data is a set of images, blocks from an image or frame, slices from an image or frame, or frames from one or more video sequences, in particular, the input images, blocks, slices or frames share common characteristics, for example, a dataset containing images in a specific domain such as radiographic images for medical purpose.
  • image may be an image, a block from an image or frame, a slice from an image or frame, or a frame from a video sequence.
  • the term input dataset may be a collection of images, one or more blocks from an image or frame, one or more slices from an image or frame, or one or more frames from a video sequence.
  • the encoder may select a representative image from images in the input dataset.
  • the encoder may perform probability model finetuning at each resolution level and each step when encoding the representative image.
  • the representative image may be selected according to the similarity, for example, measured by MSE in the pixel domain, to other images in the input dataset.
  • the finetuned probability models at each resolution level and each step are stored by the decoder while decoding the representative image.
  • the encoder may determine whether the original pretrained model or the model finetuned on the representative image shall be used to decode an input image. The decision may be signaled to the decoder in or along the bitstream.
  • One example of this embodiment is to use the first image as the representative image. There may be more than one representative images, and more than one finetuned probability models may be obtained by using the one or more representative images.
  • the images in the input dataset may be clustered into a number of clusters, for example, using the k-means algorithm with the distance measured by MSE in the pixel domain or in a feature domain.
  • the centroid image of each cluster is used as the representative image for the images belonging to the same cluster.
  • the encoder may signal the indicator to the decoder and the decoder may store the finetuned probability models to be used for other images.
  • the encoder may signal the ID of the model finetuned on a centroid image to the decoder.
  • the encoder and decoder may perform the probability model finetuning using the model finetuned by a representative image as the base model and continue the finetuning from the base model.
  • the encoder may determine the base model for the probability model finetuning.
  • the identifier (ID) of the base model may be signaled to the decoder to perform the probability model finetuning.
  • the performance of the proposed method has been evaluated using a lossless image compression codec with a multi-scale progressive probability model.
  • the probability model is finetuned, the number of iterations is sent to the decoder along with the bitstream.
  • the probability model may be finetuned eight times. Here, one byte is used for the number of iterations. With this setup, the total overhead of the bitstream is eight bytes for each image.
  • the baseline is a multi-scale progressive probability model without adaptive technology.
  • the proposed method uses the same multi-scale progressive probability model with adaptive technology at every scale and every step.
  • the number in the table 1 are bit rates measured by bits-per-pixel (BPP).
  • BPP bits-per-pixel
  • the method generally comprises receiving 1110 a representation of input media; performing 1120 a compression of the representation by means of a probability model and arithmetic encoder to generate bitstreams to be delivered to a decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation to be encoded, and wherein the arithmetic encoder is configured to encode elements to bitstream using at least the estimated probability distributions; adapting 1130 the prediction model according to a set of finetuning parameters by using elements that have already been processed as true values, wherein finetuning comprises minimizing a rate loss calculated by the probability distribution estimation for a resolution level; wherein the method further comprises selecting 1140 a prediction model with optimal rate loss as a best model and using 1150 the best model as the probability model.
  • Each of the steps can be implemented by a respective module of a computer system.
  • An apparatus comprises means for receiving a representation of input media; means for performing a compression of the representation by means of a probability model and arithmetic encoder to generate bitstreams to be delivered to a decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation to be encoded, and wherein the arithmetic encoder is configured to encode elements to bitstream using at least the estimated probability distributions; means for adapting the prediction model according to a set of finetuning parameters by using elements that have already been processed as true values, wherein finetuning comprises minimizing a rate loss calculated by the probability distribution estimation for a resolution level; means for selecting a prediction model with optimal rate loss as a best model and means for using the best model as the probability model.
  • the means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry.
  • the memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 11 according to various embodiments.
  • the method for decoding generally comprises receiving an encoded bitstream 1200; obtaining 1210 a set of finetuning parameters; decompressing 1220 the bitstream to generate a representation of an output media by means of a probability model and arithmetic decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation, and wherein the arithmetic decoder is configured to decode elements from the bitstream using the estimated probability distributions; adapting 1230 the prediction model according to at least a set of finetuning parameters by using elements that have already been processed as true values, wherein finetuning comprises minimizing rate loss calculated by the probability distribution estimation for a resolution level; selecting 1240 a prediction model as the best model according to the set of finetuning parameters; and using 1250 the best model as the probability model.
  • Each of the steps can be implemented by a respective module of a computer system.
  • An apparatus comprises means for receiving an encoded bitstream; means for obtaining finetuning parameters; means for decompressing the bitstream to generate a representation of an output media by means of a probability model and arithmetic decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation, and wherein the arithmetic decoder is configured to decode elements from the bitstream using the estimated probability distributions; means for adapting the prediction model according to at least a set of finetuning parameters by using elements that have already been processed as true values, wherein finetuning comprises minimizing rate loss calculated by the probability distribution estimation for a resolution level; wherein the apparatus further comprises means for selecting a prediction model as a best model according to the set of finetuning parameters; and means for using the best model as the probability model.
  • the means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry.
  • the memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 12 according to various embodiments.
  • the apparatus is a user equipment for the purposes of the present embodiments.
  • the apparatus 90 comprises a main processing unit 91, a memory 92, a user interface 94, a communication interface 93.
  • the apparatus may also comprise a camera module 95.
  • the apparatus may be configured to receive image and/or video data from an external camera device over a communication network.
  • the memory 92 stores data including computer program code in the apparatus 90.
  • the computer program code is configured to implement the method according to various embodiments by means of various computer modules.
  • the camera module 95 or the communication interface 93 receives data, in the form of images or video stream, to be processed by the processor 91.
  • the communication interface 93 forwards processed data, i.e., the image file, for example to a display of another device, such a virtual reality headset.
  • processed data i.e., the image file
  • the apparatus 90 is a video source comprising the camera module 95
  • user inputs may be received from the user interface.
  • a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment.
  • a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of various embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The embodiments relate to a method for encoding, comprising receiving a representation of input media; performing a compression of the representation by means of a probability model and arithmetic encoder to generate bitstreams to be delivered to a decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation to be encoded, and wherein the arithmetic encoder is configured to encode elements to bitstream using the estimated probability distributions; adapting the prediction model by using elements that have already been processed as true values, wherein finetuning comprises minimizing rate loss calculated by the probability distribution estimation for a resolution level; wherein the method further comprises selecting a prediction model with optimal rate loss as a best model and using the best model as the probability model. In addition, the present embodiments relate to a method for decoding, and apparatus for the methods.

Description

A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR VIDEO
CODING
Technical Field
The present solution generally relates to video encoding and coding.
Background
One of the elements in image and video compression is to compress data while maintaining the quality to satisfy human perceptual ability. However, in recent development of machine learning, machines can replace humans when analyzing data for example in order to detect events and/or objects in video/image. Thus, when decoded image data is consumed by machines, the quality of the compression can be different from the human approved quality. Therefore, a concept Video Coding for Machines (VCM) has been provided.
Summary
The scope of protection sought for various embodiments of the invention is set out by the independent claims. The embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the invention.
Various aspects include a method, an apparatus and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments are disclosed in the dependent claims.
According to a first aspect, there is provided an apparatus for encoding comprising means for receiving a representation of input media; means for performing a compression of the representation by means of at least a probability model and arithmetic codec to generate bitstreams to be delivered to a decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation to be encoded, and wherein the arithmetic encoder is configured to encode elements to bitstream using at least the estimated probability distributions; means to adapt the prediction model according to a set of finetuning parameters by using elements that have already been processed as true values, wherein finetuning comprises minimizing a rate loss calculated by the probability distribution estimation for a resolution level; wherein the apparatus further comprises means for selecting a prediction model with optimal rate loss as a best model; and means for using the best model as the probability model. According to a second aspect, there is provided an apparatus for decoding, comprising means for receiving an encoded bitstream; means for obtaining finetuning parameters; means for decompressing the bitstream to generate a representation of an output media by means of a probability model and arithmetic decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation, and wherein the arithmetic decoder is configured to decode elements from the bitstream using at least the estimated probability distributions; means for adapting the prediction model according to at least a set of finetuning parameters by using elements that have already been processed as true values, wherein finetuning comprises minimizing a rate loss calculated by the probability distribution estimation for a resolution level; wherein the apparatus further comprises means for selecting a prediction model as a best model according to the set of finetuning parameters; and means for using the best model as the probability model.
According to a third aspect, there is provided a method for encoding, comprising: receiving a representation of input media; performing a compression of the representation by means of at least a probability model and arithmetic encoder to generate bitstreams to be delivered to a decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation to be encoded, and wherein the arithmetic encoder is configured to encode elements to bitstream using at least the estimated probability distributions; adapting the prediction model according to a set of finetuning parameters by using elements that have already been processed as true values, wherein finetuning comprises minimizing a rate loss calculated by the probability distribution estimation for a resolution level; wherein the method further comprises selecting a prediction model with optimal rate loss as a best model and using the best model as the probability model.
According to a fourth aspect, there is provided a method for decoding, comprising: receiving an encoded bitstream; obtaining finetuning parameters; decompressing the bitstream to generate a representation of an output media by means of a probability model and arithmetic decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation, and wherein the arithmetic decoder is configured to decode elements from the bitstream using at least the estimated probability distributions; adapting the prediction model according to at least a set of finetuning parameters by using elements that have already been processed as true values, wherein finetuning comprises minimizing a rate loss calculated by the probability distribution estimation for a resolution level; wherein the method further comprises selecting a prediction model as a best model according to the set of finetuning parameters; using the best model as the probability model.
According to a fifth aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: receive a representation of input media; perform a compression of the representation by means of at least a probability model and arithmetic encoder to generate bitstreams to be delivered to a decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation to be encoded, and wherein the arithmetic encoder is configured to encode elements to bitstream using at least the estimated probability distributions; adapt the prediction model according to a set of finetuning parameters by using elements that have already been processed as true values, wherein finetuning comprises minimizing a rate loss calculated by the probability distribution estimation for a resolution level; and select a prediction model with optimal rate loss as a best model and use the best model as the probability model.
According to a sixth aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: receive an encoded bitstream; obtain finetuning parameters from; decompress the bitstream to generate a representation of an output media by means of a probability model and arithmetic decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation, and wherein the arithmetic decoder is configured to decode elements from the bitstream using the estimated probability distributions; adapt the prediction model according to a set of finetuning parameters by using elements that have already been processed as true values, wherein finetuning comprises minimizing rate loss calculated by the probability distribution estimation for a resolution level; wherein the apparatus further comprises means for selecting a prediction model as a best model according to the set of finetuning parameters ; and use the best model as the probability model.
According to a seventh aspect, there is provided computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to: receive a representation of input media; perform a compression of the representation by means of at least a probability model and arithmetic encoder to generate bitstreams to be delivered to a decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation to be encoded, and wherein the arithmetic encoder is configured to encode elements to bitstream using at least the estimated probability distributions; adapt the prediction model according to a set of finetuning parameters by using elements that have already been processed as true values, wherein finetuning comprises minimizing a rate loss calculated by the probability distribution estimation for a resolution level; and select a prediction model with optimal rate loss as a best model and use the best model as the probability model.
According to an eighth aspect, there is provided computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to: receive an encoded bitstream; obtain finetuning parameters from; decompress the bitstream to generate a representation of an output media by means of a probability model and arithmetic decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation, and wherein the arithmetic decoder is configured to decode elements from the bitstream using at least the estimated probability distributions; adapt the prediction model according to the set of finetuning parameters by using elements that have already been processed as true values, wherein finetuning comprises minimizing a rate loss calculated by the probability distribution estimation for a resolution level; wherein the apparatus further comprises means for selecting a prediction model as a best model according to the set of finetuning parameters ; and use the best model as the probability model.
According to an embodiment, the probability model is configured to partition the elements into a plurality of groups, and estimate the probability distribution of elements in said groups and adapt the prediction model by using groups that have already been processed as true values.
According to an embodiment, wherein the partition is based on one of the following: spatial dimensions, temporal dimensions, channel dimensions.
According to an embodiment, the representation is a latent representation of an image, or a frame of a video sequence, or a block of an image, or a slice of an image, where the latent representation may be obtained by one or more processing steps such as by applying a neural network on the input media; or the representation is an unprocessed or substantially unprocessed version of the input media or an uncompressed image or video.
According to an embodiment, parameters used in finetuning are delivered to a decoder.
According to an embodiment, the parameters further comprise a number of iterations needed for the finetuning.
According to an embodiment, the apparatus for encoding further comprises means for determining a representative data from the input dataset; and means for adapting the probability model to the representative data; and means for using the adapted probability model for other data in the input dataset.
According to an embodiment, the apparatus for encoding further comprises means for clustering the input data into a plurality of clusters; and means for adapting the probability model to the centroid data of each cluster; and means for assigning an identifier (ID) to the adapted probability model for a centroid data; and means for using the adapted probability model for the centroid data in a cluster to other data in the same cluster; and means for delivering the ID of the adapted probability model to the decoder.
According to an embodiment, the computer program product is embodied on a non-transitory computer readable medium.
Description of the Drawings
In the following, various embodiments will be described in more detail with reference to the appended drawings, in which
Fig. 1 shows an example of a codec with neural network (NN) components;
Fig. 2 shows another example of a video coding system with neural network components;
Fig. 3 shows an example of a neural auto-encoder architecture;
Fig. 4 shows an example of a neural network-based end-to-end learned video coding system;
Fig. 5 shows an example of a video coding for machines;
Fig. 6 shows an example of a pipeline for end-to-end learned system;
Fig. 7 shows an example of training an end-to-end learned system;
Fig. 8 shows an example of an end-to-end learned image or video codec;
Fig. 9 shows an example of a multi-scale probability model;
Fig. 10 shows an example of prediction model;
Fig. 11 is a flowchart illustrating a method for encoding according to an embodiment;
Fig. 12 is a flowchart illustrating a method for decoding according to an embodiment; and
Fig. 13 shows an apparatus according to an embodiment. Description of Example Embodiments
The following description and drawings are illustrative and are not to be construed as unnecessarily limiting. The specific details are provided for a thorough understanding of the disclosure. However, in certain instances, well-known or conventional details are not described in order to avoid obscuring the description. References to one or an embodiment in the present disclosure can be, but not necessarily are, reference to the same embodiment and such references mean at least one of the embodiments.
Reference in this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment in included in at least one embodiment of the disclosure.
The present embodiments relate to adaptive probability model for video coding.
Before discussing the present embodiments in more detailed manner, a short reference to related technology is given.
A neural network (NN) is a computation graph consisting of several layers of computation. Each layer consists of one or more units, where each unit performs an elementary computation. A unit is connected to one or more other units, and the connection may have associated with a weight. The weight may be used for scaling the signal passing through the associated connection. Weights are learnable parameters, i.e., values which can be learned from training data. There may be other learnable parameters, such as those of batch-normalization layers.
Two widely used architectures for neural networks are feed-forward and recurrent architectures. Feedforward neural networks are such that there is no feedback loop: each layer takes input from one or more of the layers before and provides its output as the input for one or more of the subsequent layers. Also, units inside a certain layer take input from units in one or more of preceding layers and provide output to one or more of following layers.
Initial layers (those close to the input data) extract semantically low-level features such as edges and textures in images, and intermediate and final layers extract more high-level features. After the feature extraction layers there may be one or more layers performing a certain task, such as classification, semantic segmentation, object detection, denoising, style transfer, super-resolution, etc. In recurrent neural nets, there is a feedback loop, so that the network becomes stateful, i.e., it is able to memorize information or a state. Neural networks are being utilized in an ever-increasing number of applications for many different types of devices, such as mobile phones. Examples include image and video analysis and processing, social media data analysis, device usage data analysis, etc.
One of the important properties of neural networks (and other machine learning tools) is that they are able to learn properties from input data, either in supervised way or in unsupervised way. Such learning is a result of a training algorithm, or of a meta-level neural network providing the training signal.
In general, the training algorithm consists of changing some properties of the neural network so that its output is as close as possible to a desired output. For example, in the case of classification of objects in images, the output of the neural network can be used to derive a class or category index which indicates the class or category that the object in the input image belongs to. Training usually happens by minimizing or decreasing the output’s error, also referred to as the loss. Examples of losses are mean squared error, cross-entropy, etc. In recent deep learning techniques, training is an iterative process, where at each iteration the algorithm modifies the weights of the neural net to make a gradual improvement of the network’s output, i.e., to gradually decrease the loss.
In this description, terms “model” and “neural network” are used interchangeably, and also the weights of neural networks are sometimes referred to as learnable parameters or simply as parameters.
Training a neural network is an optimization process. The goal of the optimization or training process is to make the model learn the properties of the data distribution from a limited training dataset. In other words, the goal is to learn to use a limited training dataset in order to learn to generalize to previously unseen data, i.e., data which was not used for training the model. This is usually referred to as generalization. In practice, data may be split into at least two sets, the training set and the validation set. The training set is used for training the network, i.e., to modify its learnable parameters in order to minimize the loss. The validation set is used for checking the performance of the network on data, which was not used to minimize the loss, as an indication of the final performance of the model. In particular, the errors on the training set and on the validation set are monitored during the training process to understand the following things:
If the network is learning at all - in this case, the training set error should decrease, otherwise the model is in the regime of underfitting.
If the network is learning to generalize - in this case, also the validation set error needs to decrease and to be not too much higher than the training set error. If the training set error is low, but the validation set error is much higher than the training set error, or it does not decrease, or it even increases, the model is in the regime of overfitting. This means that the model has just memorized the training set’s properties and performs well only on that set but performs poorly on a set not used for tuning its parameters. Lately, neural networks have been used for compressing and de -compressing data such as images, i.e., in an image codec. The most widely used architecture for realizing one component of an image codec is the auto-encoder, which is a neural network consisting of two parts: a neural encoder and a neural decoder. The neural encoder takes as input an image and produces a code which requires less bits than the input image. This code may be obtained by applying a binarization or quantization process to the output of the encoder. The neural decoder takes in this code and reconstructs the image which was input to the neural encoder.
Such neural encoder and neural decoder may be trained to minimize a combination of bitrate and distortion, where the distortion may be based on one or more of the following metrics: Mean Squared Error (MSE), Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), or similar. These distortion metrics are meant to be correlated to the human visual perception quality, so that minimizing or maximizing one or more of these distortion metrics results into improving the visual quality of the decoded image as perceived by humans.
Video codec comprises an encoder that transforms the input video into a compressed representation suited for storage/transmission and a decoder that can decompress the compressed video representation back into a viewable form. An encoder may discard some information in the original video sequence in order to represent the video in a more compact form (that is, at lower bitrate).
The H.264/AVC standard was developed by the Joint Video Team (JVT) of the Video Coding Experts Group (VCEG) of the Telecommunications Standardization Sector of International Telecommunication Union (ITU-T) and the Moving Picture Experts Group (MPEG) of International Organisation for Standardization (ISO) / International Electrotechnical Commission (IEC). The H.264/AVC standard is published by both parent standardization organizations, and it is referred to as ITU-T Recommendation H.264 and ISO/IEC International Standard 14496-10, also known as MPEG-4 Part 10 Advanced Video Coding (AVC). Extensions of the H.264/AVC include Scalable Video Coding (SVC) and Multiview Video Coding (MVC).
The High Efficiency Video Coding (H.265/HEVC a.k.a. HEVC) standard was developed by the Joint Collaborative Team - Video Coding (JCT-VC) of VCEG and MPEG. The standard was published by both parent standardization organizations, and it is referred to as ITU-T Recommendation H.265 and ISO/IEC International Standard 23008-2, also known as MPEG-H Part 2 High Efficiency Video Coding (HEVC). Later versions of H.265/HEVC included scalable, multiview, fidelity range, three- dimensional, and screen content coding extensions which may be abbreviated SHVC, MV-HEVC, REXT, 3D-HEVC, and SCC, respectively. Versatile Video Coding (H.266 a.k.a. VVC), defined in ITU-T Recommendation H.266 and equivalently in ISO/IEC 23090-3, (also referred to as MPEG-I Part 3) is a video compression standard developed as the successor to HEVC.
A specification of the AV 1 bitstream format and decoding process were developed by the Alliance of Open Media (AOM). The AVI specification was published in 2018. AOM is reportedly working on the AV2 specification.
An elementary unit for the input to a video encoder and the output of a video decoder, respectively, in most cases is a picture. A picture given as an input to an encoder may also be referred to as a source picture, and a picture decoded by a decoder may be referred to as a decoded picture or a reconstructed picture.
The source and decoded pictures are each comprises of one or more sample arrays, such as one of the following sets of sample arrays:
Luma (Y) only (monochrome),
Luma and two chroma (Y CbCr or Y CgCo),
Green, Blue and Red (GBR, also known as RGB), Arrays representing other unspecified monochrome or tri-stimulus color samplings (for example, YZX, also known as XYZ).
A component may be defined as an array or single sample from one of the three sample arrays (luma and two chroma) that compose a picture, or the array or a single sample of the array that compose a picture in monochrome format.
Hybrid video codecs, for example ITU-T H.263 and H.264, may encode the video information in two phases. Firstly, pixel values in a certain picture area (or “block”) are predicted for example by motion compensation means (finding and indicating an area in one of the previously coded video frames that corresponds closely to the block being coded) or by spatial means (using the pixel values around the block to be coded in a specified manner). Secondly the prediction error, i.e., the difference between the predicted block of pixels and the original block of pixels, is coded. This may be done by transforming the difference in pixel values using a specified transform (e.g., Discrete Cosine Transform (DCT) or a variant of it), quantizing the coefficients and entropy coding the quantized coefficients. By varying the fidelity of the quantization process, encoder can control the balance between the accuracy of the pixel representation (picture quality) and size of the resulting coded video representation (file size or transmission bitrate). Inter prediction, which may also be referred to as temporal prediction, motion compensation, or motion- compensated prediction, exploits temporal redundancy. In inter prediction the sources of prediction are previously decoded pictures.
Intra prediction utilizes the fact that adjacent pixels within the same picture are likely to be correlated. Intra prediction can be performed in spatial or transform domain, i.e., either sample values or transform coefficients can be predicted. Intra prediction is typically exploited in intra coding, where no inter prediction is applied.
One outcome of the coding procedure is a set of coding parameters, such as motion vectors and quantized transform coefficients. Many parameters can be entropy-coded more efficiently if they are predicted first from spatially or temporally neighboring parameters. For example, a motion vector may be predicted from spatially adjacent motion vectors and only the difference relative to the motion vector predictor may be coded. Prediction of coding parameters and intra prediction may be collectively referred to as in-picture prediction.
The decoder reconstructs the output video by applying prediction means similar to the encoder to form a predicted representation of the pixel blocks (using the motion or spatial information created by the encoder and stored in the compressed representation) and prediction error decoding (inverse operation of the prediction error coding recovering the quantized prediction error signal in spatial pixel domain). After applying prediction and prediction error decoding means, the decoder sums up the prediction and prediction error signals (pixel values) to form the output video frame. The decoder (and encoder) can also apply additional filtering means to improve the quality of the output video before passing it for display and/or storing it as prediction reference for the forthcoming frames in the video sequence.
In video codecs, the motion information may be indicated with motion vectors associated with each motion compensated image block. Each of these motion vectors represents the displacement of the image block in the picture to be coded (in the encoder side) or decoded (in the decoder side) and the prediction source block in one of the previously coded or decoded pictures. In order to represent motion vectors efficiently, those may be coded differentially with respect to block specific predicted motion vectors. In video codecs, the predicted motion vectors may be created in a predefined way, for example calculating the median of the encoded or decoded motion vectors of the adjacent blocks. Another way to create motion vector predictions is to generate a list of candidate predictions from adjacent blocks and/or co-located blocks in temporal reference pictures and signaling the chosen candidate as the motion vector predictor. In addition to predicting the motion vector values, the reference index of previously coded/decoded picture can be predicted. The reference index is typically predicted from adjacent blocks and/or or co-located blocks in temporal reference picture. Moreover, high efficiency video codecs can employ an additional motion information coding/decoding mechanism, often called merging/merge mode, where all the motion field information, which includes motion vector and corresponding reference picture index for each available reference picture list, is predicted and used without any modification/correction. Similarly, predicting the motion field information may be carried out using the motion field information of adjacent blocks and/or co-located blocks in temporal reference pictures and the used motion field information is signaled among a list of motion field candidate list filled with motion field information of available adjacent/co-located blocks.
In video codecs the prediction residual after motion compensation may be first transformed with a transform kernel (like DCT) and then coded. The reason for this is that often there still exists some correlation among the residual and transform can in many cases help reduce this correlation and provide more efficient coding.
Video encoders may utilize Lagrangian cost functions to find optimal coding modes, e.g., the desired coding mode for a block, block partitioning, and associated motion vectors. This kind of cost function uses a weighting factor /. to tie together the (exact or estimated) image distortion due to lossy coding methods and the (exact or estimated) amount of information that is required to represent the pixel values in an image area:
C = D + R where C is the Lagrangian cost to be minimized, D is the image distortion (e.g., Mean Squared Error) with the mode and motion vectors considered, and R the number of bits needed to represent the required data to reconstruct the image block in the decoder (including the amount of data to represent the candidate motion vectors). The rate R may be the actual bitrate or bit count resulting from encoding. Alternatively, the rate R may be an estimated bitrate or bit count. One possible way of the estimating the rate R is to omit the final entropy encoding step and use e.g., a simpler entropy encoding or an entropy encoder where some of the context states have not been updated according to previously encoding mode selections.
Conventionally used distortion metrics may comprise, but are not limited to, peak signal-to-noise ratio (PSNR), mean squared error (MSE), sum of absolute differences (SAD), sub of absolute transformed differences (SATD), and structural similarity (SSIM), typically measured between the reconstructed video/image signal (that is or would be identical to the decoded video/image signal) and the "original" video/image signal provided as input for encoding.
A partitioning may be defined as a division of a set into subsets such that each element of the set is in exactly one of the subsets. A bitstream may be defined as a sequence of bits, which may in some coding formats or standards be in the form of a network abstraction layer (NAL) unit stream or a byte stream, that forms the representation of coded pictures and associated data forming one or more coded video sequences.
A bitstream format may comprise a sequence of syntax structures.
A syntax element may be defined as an element of data represented in the bitstream. A syntax structure may be defined as zero or more syntax elements present together in the bitstream in a specified order.
A NAL unit may be defined as a syntax structure containing an indication of the type of data to follow and bytes containing that data in the form of an RBSP interspersed as necessary with start code emulation prevention bytes. A raw byte sequence payload (RBSP) may be defined as a syntax structure containing an integer number of bytes that is encapsulated in a NAL unit. An RBSP is either empty or has the form of a string of data bits containing syntax elements followed by an RBSP stop bit and followed by zero or more subsequent bits equal to 0.
Some coding formats specify parameter sets that may carry parameter values needed for the decoding or reconstruction of decoded pictures. A parameter may be defined as a syntax element of a parameter set. A parameter set may be defined as a syntax structure that contains parameters and that can be referred to from or activated by another syntax structure for example using an identifier.
A coding standard or specification may specify several types of parameter sets. It needs to be understood that embodiments may be applied but are not limited to the described types of parameter sets and embodiments could likewise be applied to any parameter set type.
A parameter set may be activated when it is referenced e.g., through its identifier. An adaptation parameter set (APS) may be defined as a syntax structure that applies to zero or more slices. There may be different types of adaptation parameter sets. An adaptation parameter set may for example contain filtering parameters for a particular type of a filter. In VVC, three types of APSs are specified carrying parameters for one of: adaptive loop filter (ALF), luma mapping with chroma scaling (LMCS), and scaling lists. A scaling list may be defined as a list that associates each frequency index with a scale factor for the scaling process, which multiplies transform coefficient levels by a scaling factor, resulting in transform coefficients. In VVC, an APS is referenced through its type (e.g., ALF, LMCS, or scaling list) and an identifier. In other words, different types of APSs have their own identifier value ranges.
An Adaptation Parameter Set (APS) may comprise parameters for decoding processes of different types, such as adaptive loop filtering or luma mapping with chroma scaling. Video coding specifications may enable the use of supplemental enhancement information (SEI) messages or alike. Some video coding specifications include SEI network abstraction layer (NAL) units, and some video coding specifications contain both prefix SEI NAL units and suffix SEI NAL units, where the former type can start a picture unit or alike and the latter type can end a picture unit or alike. An SEI NAL unit contains one or more SEI messages, which are not required for the decoding of output pictures but may assist in related processes, such as picture output timing, post-processing of decoded pictures, rendering, error detection, error concealment, and resource reservation. Several SEI messages are specified in H.264/AVC, H.265/HEVC, H.266/WC, and H.274/VSEI standards, and the user data SEI messages enable organizations and companies to specify SEI messages for their own use. The standards may contain the syntax and semantics for the specified SEI messages but a process for handling the messages in the recipient might not be defined. Consequently, encoders may be required to follow the standard specifying a SEI message when they create SEI message(s), and decoders might not be required to process SEI messages for output order conformance. One of the reasons to include the syntax and semantics of SEI messages in standards is to allow different system specifications to interpret the supplemental information identically and hence interoperate. It is intended that system specifications can require the use of particular SEI messages both in the encoding end and in the decoding end, and additionally the process for handling particular SEI messages in the recipient can be specified. SEI messages are generally not extended in future amendments or versions of the standard.
The phrase along the bitstream (e.g., indicating along the bitstream) or along a coded unit of a bitstream (e.g., indicating along a coded tile) may be used in claims and described embodiments to refer to transmission, signaling, or storage in a manner that the "out-of-band" data is associated with but not included within the bitstream or the coded unit, respectively. The phrase decoding along the bitstream or along a coded unit of a bitstream or alike may refer to decoding the referred out-of-band data (which may be obtained from out-of-band transmission, signaling, or storage) that is associated with the bitstream or the coded unit, respectively. For example, the phrase along the bitstream may be used when the bitstream is contained in a container file, such as a file conforming to the ISO Base Media File Format, and certain file metadata is stored in the file in a manner that associates the metadata to the bitstream, such as boxes in the sample entry for a track containing the bitstream, a sample group for the track containing the bitstream, or a timed metadata track associated with the track containing the bitstream.
Image and video codecs may use a set of filters to enhance the visual quality of the predicted visual content and can be applied either in-loop or out-of-loop, or both. In the case of in-loop filters, the filter applied on one block in the currently-encoded frame will affect the encoding of another block in the same frame and/or in another frame which is predicted from the current frame. An in-loop filter can affect the bitrate and/or the visual quality. In fact, an enhanced block will cause a smaller residual (difference between original block and predicted-and-filtered block), thus requiring less bits to be encoded. An out-of-the loop filter will be applied on a frame after it has been reconstructed, the filtered visual content won't be as a source for prediction, and thus it may only impact the visual quality of the frames that are output by the decoder.
In-loop filters in a video/image encoder and decoder may comprise, but may not be limited to, one or more of the following: deblocking filter (DBF); sample adaptive offset (SAO) filter; adaptive loop filter (ALF) for luma and/or chroma components; cross-component adaptive loop filter (CC-ALF).
A deblocking filter may be configured to reduce blocking artefacts due to block-based coding. A deblocking filter may be applied (only) to samples located at prediction unit (PU) and/or transform unit (TU) boundaries, except at the picture boundaries or when disabled at slice and/or tiles boundaries. Horizontal filtering may be applied (first) for vertical boundaries, and vertical filtering may be applied for horizontal boundaries.
A sample adaptive offset (SAO) may be another in-loop filtering process that modifies decoded samples by conditionally adding an offset value to a sample (possibly to each sample), based on values in lookup tables transmitted by the encoder. SAO may have one or more (e.g., two) operation modes; band offset and edge offset modes. In the band offset mode, an offset may be added to the sample value depending on the sample amplitude. The full sample amplitude range may be divided into a number of bands (e.g., 32 bands), and sample values belonging to four of these bands may be modified by adding a positive or negative offset, which may be signalled for each coding tree unit (CTU). In the edge offset mode, the horizontal, vertical, and two diagonal gradients may be used for classification.
An Adaptive Loop Filter (ALF) may apply block-based filter adaptation. For example, for the luma component, one among 25 filters may be selected for each 4x4 block, based on the direction and activity of local gradients, which are derived using the samples values of that 4x4 block. The ALF classification may be performed on 2x2 block units, for instance. When all of the vertical, horizontal and diagonal gradients are below a first threshold value, the block may be classified as texture (not containing edges). Otherwise, the block may be classified to contain edges, a dominant edge direction may be derived from horizontal, vertical and diagonal gradients, and a strength of the edge (e.g., strong or weak) may be further derived from the gradient values. When a filter within a filter set has been selected based on the classification, the filtering may be performed by applying a 7x7 diamond filter, for example, to the luma component. An ALF filter set may comprise one filter for each chroma component, and a 5 x5 diamond filter may be applied to the chroma components, for example. In an example, the filter coefficients use point-symmetry relative to the center point. An ALF design may comprise clipping the difference between the neighboring sample value and the current to-be-filtered sample is added, which provides adaptability related to both spatial relationship and value similarity between samples.
In an example, cross-component ALF (CC-ALF) uses luma sample values to refine each chroma component by applying an adaptive linear filter to the luma channel and then using the output of this filtering operation for chroma refinement. Filtering in CC-ALF is accomplished by applying a linear, diamond shaped filter to the luma channel.
In an approach, ALF filter parameters are signalled in Adaptation Parameter Set (APS). For example, in one APS, up to 25 sets of luma filter coefficients and clipping value indices, and up to eight sets of chroma filter coefficients and clipping value indices could be signalled. To reduce the overhead, filter coefficients of different classification for luma component can be merged. In slice header, the identifiers of the APSs used for the current slice are signaled.
In VVC slice header, up to 7 ALF APS indices can be signaled to specify the luma filter sets that are used for the current slice. The filtering process can be further controlled at coding tree block (CTB) level. A flag is signalled to indicate whether ALF is applied to a luma CTB. A filter set among 16 fixed filter sets and the filter sets from APSs selected in the slice header may be selected per each luma CTB by the encoder and may be decoded per each luma CTB by the decoder. A filter set index is signaled for a luma CTB to indicate which filter set is applied. The 16 fixed filter sets are pre-defined in the VVC standard and hardcoded in both the encoder and the decoder. The 16 fixed filter sets may be referred to as the pre-defined ALFs.
A feature known as luma mapping with chroma scaling (LMCS) is included in H.266/ VVC. The luma mapping (LM) part remaps luma sample values. It may be used to use a full luma sample value range (e.g., 0 to 1023, inclusive in bit depth equal to 10 bits per sample) in content that would otherwise occupy only a subset of the range.
The luma sample values of an input video signal to the encoder and output video signal from the decoder are represented in the original (unmapped) sample domain. Forward luma mapping maps luma sample values from the original sample domain to the mapped sample domain. Inverse luma mapping maps luma sample values from the mapped sample domain to the original sample domain.
In an example codec architecture, the processes in the mapped sample domain include inverse quantization, inverse transform, luma intra prediction and summing the luma prediction with the luma residue values. The processes in the original sample domain include in-loop filters (e.g., deblocking, SAO, ALF), inter prediction, and storage of pictures in the decoded picture buffer (DPB). In an example decoder, one or more of the following steps may be performed:
Inverse quantization and inverse transform are applied to the decoded luma transform coefficients to produce the luma residues in the mapped sample domain, Y’res;
Reconstructed luma sample values in the mapped sample domain, Y’r, are obtained by summing Y’res with the corresponding predicted luma values in the mapped sample domain, Y’pred.
For intra prediction, Y’pred is directly obtained by performing intra prediction in mapped sample domain.
For inter prediction, the predicted luma values in original sample domain, Ypred, are first obtained by motion compensation using reference pictures from the DPB, and then forward luma mapping is applied to produce the luma values in the mapped sample domain, Y’pred.
Inverse luma mapping is applied to reconstructed values Y’r to produce reconstructed luma sample values in the original sample domain, which are processed by in-loop filters (deblocking, sample adaptive offset, and adaptive loop filter) before being stored in the DPB.
In VVC, LMCS syntax elements are signalled in an adaptation parameter set (APS) with aps_params_type equal to 1 (LMCS APS). The value range for an adaptation parameter set identifier (aps_adaptation_parameter_set_id) is from 0 to 3, inclusive, for LMCS APSs. The use of LMCS can be enabled or disabled in a picture header. When LMCS is enabled in a picture header, the LMCS APS identifier value used for the picture (ph lmcs aps id) is included in the picture header. Thus, the same LMCS parameters are used for entire picture. Note also that when LMCS is enabled in a picture header and a chroma format including the chroma components is in use, the chroma scaling part can be enabled or disabled in the picture header through ph chroma residual scale flag. When a picture has multiple slices, LMCS is further enabled or disabled in the slice header for each slice.
In VVC, LMCS data within an LMCS APS comprises syntax related to a piecewise linear model of up to 16 pieces for luma mapping. The luma sample value range of the piecewise linear forward mapping function is uniformly sampled into 16 pieces of same length OrgCW. For example, for a 10-bit input video, each of the 16 pieces contains OrgCW = 64 input codewords. For each piece of index i, the number of output (mapped) codewords is defined as binCW[i]. binCW[i] is determined at the encoding process. The difference between binCW[i] and OrgCW is signalled in LMCS APS. The slopes scaleY[i] and invScaleY[i] of the functions FwdMap and InvMap are respectively derived as: scaleY[i] = binCW[i] OrgCW invScaleY[i] = OrgCW binCW[i]
Recently, neural networks (NNs) have been used in the context of image and video compression, by following mainly two approaches. In one approach, NNs are used to replace one or more of the components of a traditional codec such as VVC/H.266. Here, term “traditional” refers to those codecs whose components and their parameters may not be learned from data. Examples of such components are:
Additional in-loop fdter, for example by having the NN as an additional in-loop fdter with respect to the traditional loop fdters.
Single in-loop fdter, for example by having the NN replacing all traditional in-loop fdters. Intra-frame prediction.
Inter-frame prediction.
Transform and/or inverse transform.
Probability model for the arithmetic codec.
Etc.
Figure 1 illustrates examples of functioning of NNs as components of a traditional codec's pipeline, in accordance with an embodiment. In particular, Figure 1 illustrates an encoder, which also includes a decoding loop. Figure 1 is shown to include components described below:
- A luma intra pred block or circuit 101. This block or circuit performs intra prediction in the luma domain, for example, by using already reconstructed data from the same frame. The operation of the luma intra pred block or circuit 101 may be performed by a deep neural network such as a convolutional auto-encoder.
- A chroma intra pred block or circuit 102. This block or circuit performs intra prediction in the chroma domain, for example, by using already reconstructed data from the same frame. The chroma intra pred block or circuit 102 may perform cross-component prediction, for example, predicting chroma from luma. The operation of the chroma intra pred block or circuit 102 may be performed by a deep neural network such as a convolutional auto-encoder.
- An intra pred block or circuit 103 and inter-pred block or circuit 104. These blocks or circuit perform intra prediction and inter-prediction, respectively. The intra pred block or circuit 103 and the inter-pred block or circuit 104 may perform the prediction on all components, for example, luma and chroma. The operations of the intra pred block or circuit 103 and inter-pred block or circuit 104 may be performed by two or more deep neural networks such as convolutional auto-encoders.
- A probability estimation block or circuit 105 for entropy coding. This block or circuit performs prediction of probability for the next symbol to encode or decode, which is then provided to the entropy coding module 112, such as the arithmetic coding module, to encode or decode the next symbol. The operation of the probability estimation block or circuit 105 may be performed by a neural network. - A transform and quantization (T/Q) block or circuit 106. These are actually two blocks or circuits. The transform and quantization block or circuit 106 may perform a transform of input data to a different domain, for example, the FFT transform would transform the data to frequency domain. The transform and quantization block or circuit 106 may quantize its input values to a smaller set of possible values. In the decoding loop, there may be inverse quantization block or circuit and inverse transform block or circuit 113. One or both of the transform block or circuit and quantization block or circuit may be replaced by one or two or more neural networks. One or both of the inverse transform block or circuit and inverse quantization block or circuit 113 may be replaced by one or two or more neural networks.
- An in-loop fdter block or circuit 107. Operations of the in-loop filter block or circuit 107 is performed in the decoding loop, and it performs filtering on the output of the inverse transform block or circuit, or anyway on the reconstructed data, in order to enhance the reconstructed data with respect to one or more predetermined quality metrics. This filter may affect both the quality of the decoded data and the bitrate of the bitstream output by the encoder. The operation of the in-loop filter block or circuit 107 may be performed by a neural network, such as a convolutional auto-encoder. In examples, the operation of the in-loop filter may be performed by multiple steps or filters, where the one or more steps may be performed by neural networks.
- A postprocessing filter block or circuit 108. The postprocessing filter block or circuit 108 may be performed only at decoder side, as it may not affect the encoding process. The postprocessing filter block or circuit 108 filters the reconstructed data output by the in-loop filter block or circuit 107, in order to enhance the reconstructed data. The postprocessing filter block or circuit 108 may be replaced by a neural network, such as a convolutional auto-encoder.
- A resolution adaptation block or circuit 109: this block or circuit may downsample the input video frames, prior to encoding. Then, in the decoding loop, the reconstructed data may be upsampled, by the upsampling block or circuit 110, to the original resolution. The operation of the resolution adaptation block or circuit 109 block or circuit may be performed by a neural network such as a convolutional auto-encoder.
- An encoder control block or circuit 111. This block or circuit performs optimization of encoder's parameters, such as what transform to use, what quantization parameters (QP) to use, what intraprediction mode (out of N intra-prediction modes) to use, and the like. The operation of the encoder control block or circuit 111 may be performed by a neural network, such as a classifier convolutional network, or such as a regression convolutional network. - An ME/MC block or circuit 114 performs motion estimation and/or motion compensation, which are two key operations to be performed when performing inter-frame prediction. ME/MC stands for motion estimation / motion compensation.
In another approach, commonly referred to as “end-to-end learned compression”, NNs are used as the main components of the image/video codecs. In this second approach, there are two main options:
Option 1: re-use the video coding pipeline but replace most or all the components with NNs. Referring to Figure 2, it illustrates an example of modified video coding pipeline based on a neural network, in accordance with an embodiment. An example of neural network may include, but is not limited to, a compressed representation of a neural network. Figure 2 is shown to include following components:
- A neural transform block or circuit 202: this block or circuit transforms the output of a summation/subtraction operation 203 to a new representation of that data, which may have lower entropy and thus be more compressible.
- A quantization block or circuit 204: this block or circuit quantizes an input data 201 to a smaller set of possible values.
- An inverse transform and inverse quantization blocks or circuits 206. These blocks or circuits perform the inverse or approximately inverse operation of the transform and the quantization, respectively.
- An encoder parameter control block or circuit 208. This block or circuit may control and optimize some or all the parameters of the encoding process, such as parameters of one or more of the encoding blocks or circuits.
- An entropy coding block or circuit 210. This block or circuit may perform lossless coding, for example based on entropy. One popular entropy coding technique is arithmetic coding.
- A neural intra-codec block or circuit 212. This block or circuit may be an image compression and decompression block or circuit, which may be used to encode and decode an intra frame. An encoder 214 may be an encoder block or circuit, such as the neural encoder part of an autoencoder neural network. A decoder 216 may be a decoder block or circuit, such as the neural decoder part of an auto-encoder neural network. An intra-coding block or circuit 218 may be a block or circuit performing some intermediate steps between encoder and decoder, such as quantization, entropy encoding, entropy decoding, and/or inverse quantization.
- A deep loop fdter block or circuit 220. This block or circuit performs filtering of reconstructed data, in order to enhance it.
- A decode picture buffer block or circuit 222. This block or circuit is a memory buffer, keeping the decoded frame, for example, reconstructed frames 224 and enhanced reference frames 226 to be used for inter prediction.
- An inter-prediction block or circuit 228. This block or circuit performs inter-frame prediction, for example, predicts from frames, for example, frames 232, which are temporally nearby. An ME/MC 230 performs motion estimation and/or motion compensation, which are two key operations to be performed when performing inter-frame prediction. ME/MC stands for motion estimation / motion compensation.
Option 2: re-design the whole pipeline, as follows.
- Encoder NN is configured to perform a non-linear transform;
- Quantization and lossless encoding of the encoder NN's output;
- Lossless decoding and dequantization;
- Decoder NN is configured to perform a non-linear inverse transform.
An example of option 2 is described in detail in Figure 3 which shows an encoder NN and a decoder NN being parts of a neural auto-encoder architecture, in accordance with an example. In Figure 3, the Analysis Network 301 is an Encoder NN, and the Synthesis Network 302 is the Decoder NN, which may together be referred to as spatial correlation tools 303, or as neural auto-encoder.
As shown in Figure 3, the input data 304 is analyzed by the Encoder NN (Analysis Network 301), which outputs a new representation of that input data. The new representation may be more compressible. This new representation may then be quantized, by a quantizer 305, to a discrete number of values. The quantized data is then lossless encoded, for example by an arithmetic encoder 306, thus obtaining a bitstream 307. The example shown in Figure 3 includes an arithmetic decoder 308 and an arithmetic encoder 306. The arithmetic encoder 306, or the arithmetic decoder 308, or the combination of the arithmetic encoder 306 and arithmetic decoder 308 may be referred to as arithmetic codec in some embodiments. On the decoding side, the bitstream is first lossless decoded, for example, by using the arithmetic codec decoder 308. The lossless decoded data is dequantized and then input to the Decoder NN, Synthesis Network 302. The output is the reconstructed or decoded data 309.
In case of lossy compression, the lossy steps may comprise the Encoder NN and/or the quantization.
In order to train this system, a training objective function (also called “training loss”) may be utilized, which may comprise one or more terms, or loss terms, or simply losses. In one example, the training loss comprises a reconstruction loss term and a rate loss term. The reconstruction loss encourages the system to decode data that is similar to the input data, according to some similarity metric. Examples of reconstruction losses are:
Mean squared error (MSE);
Multi-scale structural similarity (MS-SSIM);
Losses derived from the use of a pretrained neural network. For example, error(fl, f2), where fl and f2 are the features extracted by a pretrained neural network for the input data and the decoded data, respectively, and error() is an error or distance function, such as LI norm or L2 norm;
Losses derived from the use of a neural network that is trained simultaneously with the end-to- end learned codec. For example, adversarial loss can be used, which is the loss provided by a discriminator neural network that is trained adversarially with respect to the codec, following the settings proposed in the context of Generative Adversarial Networks (GANs) and their variants.
The rate loss encourages the system to compress the output of the encoding stage, such as the output of the arithmetic encoder. By “compressing”, we mean reducing the number of bits output by the encoding stage.
When an entropy-based lossless encoder is used, such as an arithmetic encoder, the rate loss typically encourages the output of the Encoder NN to have low entropy. Example of rate losses are the following:
A differentiable estimate of the entropy;
A sparsification loss, i.e., a loss that encourages the output of the Encoder NN or the output of the quantization to have many zeros. Examples are L0 norm, LI norm, LI norm divided by L2 norm;
A cross-entropy loss applied to the output of a probability model, where the probability model may be a NN used to estimate the probability of the next symbol to be encoded by an arithmetic encoder.
One or more of reconstruction losses may be used, and one or more of the rate losses may be used, as a weighted sum. The different loss terms may be weighted using different weights, and these weights determine how the final system performs in terms of rate-distortion loss. For example, if more weight is given to the reconstruction losses with respect to the rate losses, the system may learn to compress less but to reconstruct with higher accuracy (as measured by a metric that correlates with the reconstruction losses). These weights may be considered to be hyper-parameters of the training session and may be set manually by the person designing the training session, or automatically for example by grid search or by using additional neural networks.
As shown in Figure 4, a neural network-based end-to-end learned video coding system may contain an encoder 401, a quantizer 402, a probability model 403, an entropy codec 420 (for example arithmetic encoder 405 / arithmetic decoder 406), a dequantizer 407, and a decoder 408. The encoder 401 and decoder 408 may be two neural networks, or mainly comprise neural network components. The probability model 403 may also comprise mainly neural network components. Quantizer 402, dequantizer 407 and entropy codec 420 may not be based on neural network components, but they may also comprise neural network components, potentially. On the encoder side, the encoder component 401 takes a video x 409 as input and converts the video from its original signal space into a latent representation that may comprise a more compressible representation of the input. In the case of an input image, the latent representation may be a 3- dimensional tensor, where two dimensions represent the vertical and horizontal spatial dimensions, and the third dimension represent the “channels” which contain information at that specific location. If the input image is a 128x128x3 RGB image (with horizontal size of 128 pixels, vertical size of 128 pixels, and 3 channels for the Red, Green, Blue color components), and if the encoder downsamples the input tensor by 2 and expands the channel dimension to 32 channels, then the latent representation is a tensor of dimensions (or “shape”) 64x64x32 (i.e., with horizontal size of 64 elements, vertical size of 64 elements, and 32 channels). Please note that the order of the different dimensions may differ depending on the convention which is used; in some cases, for the input image, the channel dimension may be the first dimension, so for the above example, the shape of the input tensor may be represented as 3x128x128, instead of 128x128x3. In the case of an input video (instead ofjust an input image), another dimension in the input tensor may be used to represent temporal information.
The quantizer component 402 quantizes the latent representation into discrete values given a predefined set of quantization levels. Probability model 403 and arithmetic codec component 420 work together to perform lossless compression for the quantized latent representation and generate bitstreams to be sent to the decoder side. Given a symbol to be encoded into the bitstream, the probability model 403 estimates the probability distribution of all possible values for that symbol based on a context that is constructed from available information at the current encoding/decoding state, such as the data that has already been encoded/decoded. Then, the arithmetic encoder 405 encodes the input symbols to bitstream using the estimated probability distributions.
On the decoder side, opposite operations are performed. The arithmetic decoder 406 and the probability model 403 first decode symbols from the bitstream to recover the quantized latent representation. Then the dequantizer 407 reconstructs the latent representation in continuous values and pass it to decoder 408 to recover the input video/image. Note that the probability model 403 in this system is shared between the encoding and decoding systems. In practice, this means that a copy of the probability model 403 is used at encoder side, and another exact copy is used at decoder side.
In this system, the encoder 401, probability model 403, and decoder 408 may be based on deep neural networks. The system may be trained in an end-to-end manner by minimizing the following ratedistortion loss function:
L = D + AR, where D is the distortion loss term, R is the rate loss term, and A is the weight that controls the balance between the two losses. The distortion loss term may be the mean square error (MSE), structure similarity (SSIM) or other metrics that evaluate the quality of the reconstructed video. Multiple distortion losses may be used and integrated into D, such as a weighted sum of MSE and SSIM. The rate loss term is normally the estimated entropy of the quantized latent representation, which indicates the number of bits necessary to represent the encoded symbols, for example, bits-per-pixel (bpp).
For lossless video/image compression, the system may contain only the probability model 403 and arithmetic encoder/decoder 405, 406. The system loss function contains only the rate loss, since the distortion loss is always zero (i.e., no loss of information).
Reducing the distortion in image and video compression is often intended to increase human perceptual quality, as humans are considered to be the end users, i.e., consuming/watching the decoded image. Recently, with the advent of machine learning, especially deep learning, there is a rising number of machines (i.e., autonomous agents) that analyze data independently from humans and that may even take decisions based on the analysis results without human intervention. Examples of such analysis are object detection, scene classification, semantic segmentation, video event detection, anomaly detection, pedestrian tracking, etc. Example use cases and applications are self-driving cars, video surveillance cameras and public safety, smart sensor networks, smart TV and smart advertisement, person reidentification, smart traffic monitoring, drones, etc. When the decoded data is consumed by machines, a different quality metric shall be used instead of human perceptual quality. Also, dedicated algorithms for compressing and decompressing data for machine consumption are likely to be different than those for compressing and decompressing data for human consumption. The set of tools and concepts for compressing and decompressing data for machine consumption is referred to here as Video Coding for Machines (VCM).
VCM concerns the encoding of video streams to allow consumption for machines. Machine is referred to indicate any device except human. Example of machine can be a mobile phone, an autonomous vehicle, a robot, and such intelligent devices which may have a degree of autonomy or run an intelligent algorithm to process the decoded stream beyond reconstructing the original input stream.
A machine may perform one or multiple tasks on the decoded stream. Examples of tasks can comprise the following:
Classification: classify an image or video into one or more predefined categories. The output of a classification task may be a set of detected categories, also known as classes or labels. The output may also include the probability and confidence of each predefined category. Object detection: detect one or more objects in a given image or video. The output of an object detection task may be the bounding boxes and the associated classes of the detected objects. The output may also include the probability and confidence of each detected object.
Instance segmentation: identify one or more objects in an image or video at the pixel level. The output of an instance segmentation task may be binary mask images or other representations of the binary mask images, e.g., closed contours, of the detected objects. The output may also include the probability and confidence of each object for each pixel.
Semantic segmentation: assign the pixels in an image or video to one or more predefined semantic categories. The output of a semantic segmentation task may be binary mask images or other representations of the binary mask images, e.g., closed contours, of the assigned categories. The output may also include the probability and confidence of each semantic category for each pixel.
Object tracking: track one or more objects in a video sequence. The output of an object tracking task may include frame index, object ID, object bounding boxes, probability, and confidence for each tracked object.
Captioning: generate one or more short text descriptions for an input image or video. The output of the captioning task may be one or more short text sequences.
Human pose estimation: estimate the position of the key points, e.g., wrist, elbows, knees, etc., from one or more human bodies in an image of the video. The output of a human pose estimation includes sets of locations of each key point of a human body detected in the input image or video.
Human action recognition: recognize the actions, e.g., walking, talking, shaking hands, of one or more people in an input image or video. The output of the human action recognition may be a set of predefined actions, probability, and confidence of each identified action.
Anomaly detection: detect abnormal object or event from an input image or video. The output of an anomaly detection may include the locations of detected abnormal objects or segments of frames where abnormal events detected in the input video.
It is likely that the receiver-side device has multiple “machines” or task neural networks (Task-NNs). These multiple machines may be used in a certain combination which is for example determined by an orchestrator sub-system. The multiple machines may be used for example in succession, based on the output of the previously used machine, and/or in parallel. For example, a video which was compressed and then decompressed may be analyzed by one machine (NN) for detecting pedestrians, by another machine (another NN) for detecting cars, and by another machine (another NN) for estimating the depth of all the pixels in the frames.
In this description, “task machine” and “machine” and “task neural network” are referred to interchangeably, and for such referral any process or algorithm (learned or not from data) which analyzes or processes data for a certain task is meant. In the rest of the description, other assumptions made regarding the machines considered in this disclosure may be specified in further details. Also, term “receiver-side” or “decoder-side” are used to refer to the physical or abstract entity or device, which contains one or more machines, and runs these one or more machines on an encoded and eventually decoded video representation which is encoded by another physical or abstract entity or device, the “encoder-side device”.
The encoded video data may be stored into a memory device, for example as a file. The stored file may later be provided to another device. Alternatively, the encoded video data may be streamed from one device to another.
Figure 5 is a general illustration of the pipeline of Video Coding for Machines. A VCM encoder 502 encodes the input video into a bitstream 504. A bitrate 506 may be computed 508 from the bitstream 504 in order to evaluate the size of the bitstream. A VCM decoder 510 decodes the bitstream output by the VCM encoder 502. In Figure 5, the output of the VCM decoder 510 is referred to as “Decoded data for machines” 512. This data may be considered as the decoded or reconstructed video. However, in some implementations of this pipeline, this data may not have same or similar characteristics as the original video which was input to the VCM encoder 502. For example, this data may not be easily understandable by a human by simply rendering the data onto a screen. The output of VCM decoder is then input to one or more task neural networks 514. In the figure, for the sake of illustrating that there may be any number of task-NNs 514, there are three example task-NNs, and a non-specified one (Task- NN X). The goal of VCM is to obtain a low bitrate while guaranteeing that the task-NNs still perform well in terms of the evaluation metric 516 associated to each task.
One of the possible approaches to realize video coding formachines is an end-to-end learned approach. In this approach, the VCM encoder and VCM decoder mainly consist of neural networks. Figure 6 illustrates an example of a pipeline for the end-to-end learned approach. The video is input to a neural network encoder 601. The output of the neural network encoder 601 is input to a lossless encoder 602, such as an arithmetic encoder, which outputs a bitstream 604. The lossless codec may be a probability model 603, both in the lossless encoder and in the lossless decoder, which predicts the probability of the next symbol to be encoded and decoded. The probability model 603 may also be learned, for example it may be a neural network. At decoder-side, the bitstream 604 is input to a lossless decoder 605, such as an arithmetic decoder, whose output is input to a neural network decoder 606. The output of the neural network decoder 606 is the decoded data for machines 607, that may be input to one or more task-NNs 608.
Figure 7 illustrates an example of how the end-to-end learned system may be trained. For the sake of simplicity, only one task-NN 707 is illustrated. A rate loss 705 may be computed from the output of the probability model 703. The rate loss 705 provides an approximation of the bitrate required to encode the input video data. A task loss 710 may be computed 709 from the output 708 of the task-NN 707.
The rate loss 705 and the task loss 710 may then be used to train 711 the neural networks used in the system, such as the neural network encoder 701, the probability model 703, the neural network decoder 706. Training may be performed by first computing gradients of each loss with respect to the neural networks that are contributing or affecting the computation of that loss. The gradients are then used by an optimization method, such as Adam, for updating the trainable parameters of the neural networks.
The machine tasks may be performed at decoder side (instead of at encoder side) for multiple reasons, for example because the encoder-side device does not have the capabilities (computational, power, memory) for running the neural networks that perform these tasks, or because some aspects or the performance of the task neural networks may have changed or improved by the time that the decoderside device needs the tasks results (e.g., different or additional semantic classes, better neural network architecture). Also, there could be a customization need, where different clients would run different neural networks for performing these machine learning tasks.
Alternatively, to an end-to-end trained codec, a video codec for machines can be realized by using a traditional codec such as H.266/VVC.
Alternatively, as described already above for the case of video coding for humans, another possible design may comprise using a traditional "base" codec, such as H.266/VVC. which additionally comprises one or more neural networks. In one possible implementation, the one or more neural networks may replace or be an alternative of one of the components of the traditional codec, such as: one or more in-loop fdters; one or more intra-prediction modes; one or more inter-prediction modes; one or more transforms; one or more inverse transforms; one or more probability models, for lossless coding; one or more post-processing fdters.
In another possible implementation, the one or more neural networks may function as an additional component, such as: one or more additional in-loop fdters; one or more additional intra-prediction modes; one or more additional inter-prediction modes; one or more additional transforms; one or more additional inverse transforms; one or more additional probability models, for lossless coding; one or more additional post-processing fdters.
Alternatively, another possible design may comprise using any codec architecture (such as a traditional codec, or a traditional codec which includes one or more neural networks, or an end-to-end learned codec), and having a post-processing neural networks which adapts the output of the decoder so that it can be analyzed more effectively by one or more machines or task neural networks. For example, the encoder and decoder may be conformant to the H.266/VVC standard, a post-processing neural network takes the output of the decoder, and the output of the post-processing neural network is then input to an object detection neural network. In this example, the object detection neural network is the machine or task neural network.
Figure 8 illustrates an example including an encoder, a decoder, a post-processing filter, a set of task- NNs. The encoder and decoder may represent a traditional image or video codec, such as a codec conformant with the VVC/H.266 standard, or may represent an end-to-end (E2E) learned image or video codec. The post-processing filter may be a neural network-based filter. The task-NNs may be neural networks that performs tasks such as object detection, object segmentation, object tracking, etc.
A probability mode may be used in an end-to-end learned codec to estimate the probability distribution of the elements in the latent tensor, which is the output of the encoder. The estimated probability distribution may be used by an arithmetic encoder to encode the latent tensor into a bitstream at the encoding stage, or by an arithmetic decoder to decode the latent tensor from the bitstream at the decoding stage. For lossless image and video compression, the probability model estimates the probability distribution of the elements in the input image or video for the arithmetic encoder and decoder to encode and decode the input image or video. In this disclosure, term “latent tensor” may also refer to the input image or video in a lossless image or video compression system.
A multi-scale progressive probability model may partition the elements in a latent tensor into multiple groups. The elements in one group may be processed in parallel and the groups may be processed sequentially. Figure 9 shows the architecture of a multi-scale progressive probability model at the encoding stage. Input latent tensor x'" 930 is first downsampled into a certain number of low-resolution representations, e.g., x(I), x(2) 920, 930 respectively. The downsampling operation may use the nearest neighborhood, bilinear, or bicubic algorithm. If the downsampling algorithm is not the nearest neighborhood method, extra information may be transferred from the encoder to the decoder to recover round-off error due to the downsampling operation. The probability distribution of the elements in the representation at the lowest resolution, i.e., x(2) 930 in Figure 9, may be modeled as identically and independently distributed with a Gaussian distribution model, a uniform distribution model, or a mixture of probability distribution models. The probability of elements in the latent tensors in resolution levels other than the lowest one may be modeled by a conditional distribution model (whose parameters are estimated by a prediction model), where the conditioning information (also referred to as context) may comprise the representation at lower resolution level. In Figure 9, z(i) is an auxiliary output from the prediction model at resolution level i where i=l,2 that may be used by the prediction model at the next resolution level (i-1) as an extra input. p(i) is the estimated parameters of the (conditional) probability distribution model for elements at resolution level i. The prediction model may be implemented by a deep neural network with a plurality of parameters. In the disclosure, the terms parameter and weight for a deep neural network are interchangeable.
At the decoding stage, the latent tensor at the lowest resolution, i.e., X2’ 930 in Figure 9 may be first decoded from the bitstream using the predefined probability distribution model. A multi-scale progressive probability model may use the elements in the latent tensor at resolution level i, e.g., x(i) 920, as the context to estimate the parameters of the distribution model for the elements in the latent tensor at a higher resolution level, e.g., x(l ~1}. The estimated probability distribution may be used by the arithmetic decoder to decode the elements in the bitstream. The procedure may repeat until all elements in the latent tensor at the highest resolution level, i.e., x(0) 910, is decoded.
The prediction models at different resolution levels may share the weights or a subset of the weights. In an example, the prediction models at different resolutions are the same or substantially the same.
To further improve the accuracy of the probability distribution estimation, the elements in the latent tensor at a resolution level may be further partitioned into several groups. The groups may be processed sequentially. The elements in a group are modeled by independent conditional distribution models using the elements that have already been processed as the context. I.e., the elements of the latent tensor at a resolution level are processed in steps, where the elements in a group associated to a step are processed in parallel. An example of the architecture of the prediction model is shown in Figure 10.
Let be the number of groups into which the elements in the latent representation at resolution level i are partitioned. Figure 10 shows the prediction model at resolution level i and step j, where j = 1, ... , N . The distribution predictor 1010 predicts the parameters of the probability distribution for the elements in the latent representation at resolution level i and step j. z" }) is the auxiliary input for the distribution predictor 1010. For the first step, z" 1)=z(,+1) ,
Figure imgf000030_0001
is the auxiliary output from the resolution level z+7, i.e., z^+1^
Figure imgf000030_0002
is a tensor that contains true values for the elements that have already been processed and the predicted values of the elements that have not been processed at step j. x(l'2> is derived by upsampling x^i+1 m^’7-* is a binary-valued mask tensor with the same shape of x^’7-* indicating the positions of the elements in x^’7-* that have the true values, p^’7-* is the estimated parameters of the probability distribution for the elements in group j at resolution level i. At the encoding stage, after p^1^ is calculated, the tensor updater component 1020 may update the elements in group j with the corresponding true values in x to generate %<w+1), an(j pqe maSk updater component 1030 may update the mask tensor m^’7-* accordingly to generate m^,7+1 At the last step, i.e., j = N^l let
Figure imgf000031_0001
At the decoding stage, the calculated
Figure imgf000031_0002
is used to decode the corresponding elements in the bitstream. After a group of elements is decoded, the corresponding values are updated in x^’7-* to generate %bj+i) anc| mask tensor m^’7-* is updated accordingly to generate m^,7+1 The prediction model repeats this operation in
Figure imgf000031_0003
steps until all elements at the resolution level i are processed.
A probability model used in an end-to-end learned codec may be jointly trained together with the encoder and decoder using a large training dataset. The training may be performed by optimizing a loss function that involves minimizing the divergence of the estimated probability distribution function and the true probability distribution function of the latent tensor.
However, at the inference time, input data may have different characteristics from the data in the training dataset. Thus, the model learned from the training dataset may not be optimal when compressing the input data.
Inference time overfitting technique was developed to improve the performance of the codec. One way to perform model overfitting at the inference stage is to finetune the parameters or a subset of the parameters of the components at the decoder side, including the probability model using the input data. After the finetuning, the update of the parameters that have been changed after the finetuning, may be compressed and sent to the decoder. The decoder may update the corresponding weights before decoding the bitstream.
However, when the deep neural networks used in the probability model contain a large number of parameters, the overhead caused by the weight update may significantly increase the size of the bitstream and diminish the gain from the decoder overfitting.
In this disclosure terms “further training”, “finetuning”, “overfitting”, “adapting” are used interchangeably. Also, the “input image” may refer to an image, a frame from a video sequence, a block or slice from an image, or a frame in a video sequence.
The present embodiments provide adaptive methods to finetune a multi-scale progressive probability model at inference time. The proposed embodiments may improve the compression performance significantly. Even though some additional information is encoded compared to the case where no overfitting is performed, such additional information may represent only a small fraction of the total bitstream, which is compensated for by the bitrate saving brought by the proposed overfitting technique . This bitrate overhead is likely to be smaller than the overhead incurred when the neural network’s parameters are overfitted at the encoder side and the signal, derived from the overfitted parameter or from the difference between the overfitted parameters and the original parameters, is transmitted to the decoder in or along with the bitstream. With the present embodiments, the probability model may be adapted to the input data in multiple steps. At each step, the probability model may be finetuned using the data that has already been decoded from the bitstream. The encoder may signal one or more additional parameters to the decoder to perform the finetuning.
In addition, the present embodiments provide a solution to determine the data that may be used to finetune the probability model when a set of images, in particular, for a certain application, are encoded.
These embodiments are discussed in more detailed manner next.
The input data of a probability model may be the latent tensor that is the output of the encoder in an end-to-end learned codec or an uncompressed image or video in an end-to-end learned lossless codec. The probability model may process the input data in a multi-scale and progressive manner. I.e., the input data may be downsampled several times, generating representations at different resolution levels. The elements in one resolution level may be partitioned into several groups. The partition may be done along spatial dimensions, temporal dimension, channel dimension, or any combination of these dimensions. The probability model may process the elements in the input data sequentially by a predefined group and resolution order. All elements in one group may be processed in parallel as a batch. The elements that have already been processed may be used by the probability model as the context information to determine the probability distribution of the elements that have not been processed. The determined probability distributions may be used by an arithmetic encoder to encode the input data into a bitstream at the encoding stage, or by an arithmetic decoder to decode a bitstream into output at the decoding stage. The input data of an arithmetic encoder and the output data of an arithmetic decoder are same or substantially the same.
According to an embodiment, the processing order of the groups and resolution levels may be predefined and hardcoded into the encoder and decoder. According to another embodiment, the processing order of the groups and resolution levels may be determined at encoding time by the encoder and signaled to the decoder in or along the bitstream. According to an embodiment, the system may complete all groups in one resolution level before processing the groups in another resolution level. According to another embodiment, the system may process all the groups with the same channel IDs on all resolution levels before moving to groups with another channel ID. According to another embodiment, the groups may be processed in an order defined by the encoder, and the processing order is signaled to the decoder along with the bitstream or via a supplementary message. Probability model overfitting at different resolution level
As Figure 9 shows, the prediction model component in a probability model estimates the probability distribution of the elements in the latent representation at a resolution level using the elements at a lower resolution as the context. The estimated probability distribution may be used to calculate the rate loss for compressing the latent representation. For example, at resolution level i, the probability distribution of elements in the latent representation x(i) is estimated using latent representation x(,+I) as the context.
At the encoding stage, latent representation x(,+I) and x(i) may be used as the input data and the ground truth data, respectively, to finetune the prediction model at resolution i - 1. The finetuning may be performed by minimizing the rate loss calculated from the probability distribution estimation p(i) for resolution level i. The parameters of the prediction model may be updated using the gradient descent method, where the gradients are calculated by back-propagating the gradient of the rate loss to the parameters, i.e., by using the back-propagation algorithm. However, any suitable method for obtaining gradients of a loss function with respect to the parameter of a neural network may be used. The optimization may use a certain optimizer, for example, Adam, with a set of predefined optimization parameters including learning rate. The finetuning may be performed in several iterations. At each iteration, the rate loss for latent representation x(l ~1} is calculated using the estimated probability distribution p(l ~1} by the finetuned model. The iteration may be terminated when the rate loss for latent representation x(l ~1} stops decreasing for a certain number of iterations or a predefined max number of iterations is reached. After the finetuning, the number of iterations where the optimal rate loss for latent representation x(l ~1} is reached and the finetuning parameters are recorded by the encoder. The encoder may send the recorded finetuning parameters and the number of iterations to the decoder in or along the bitstream. The encoder may compress the finetuning parameters and the number of iterations and send the compressed data to the decoder in or along the bitstream.
The following example algorithm describes the finetuning procedure at resolution level i at the encoder side. The rate loss calculation function R() used in the algorithm takes the parameters of the estimated distribution function and the ground truth values of the elements as the input and calculates cross entropy value as the rate loss.
Given latent representation
Figure imgf000033_0001
at resolution level i + 1, i, i — 1 respectively, optimizer with a learning rate and other finetuning parameters, pretrained prediction model P to be used at resolution level i — 1 with learnable parameters 0, rate loss calculation function /?(•)■ max number of iterations Nmax, max number of iterations Mmax to stop the finetuning when the performance stops improving. Let iter=0
Let best rate = MAX FLOAT
Let best_iter = 0
Let best_model = 0
W
Figure imgf000034_0003
Let best_iter = iter
Let best_model = 0
If iter - best_iter > Mmax break
Calculate
Calculate
Figure imgf000034_0001
Calculate gradients dQ for parameters 0 using back-propagation method
Update parameters 0 using the optimizer and the calculated gradients dQ
Let iter = iter + 1
Return best_iter, best_model
In another embodiment, the auxiliary information at resolution level i + 1 may be used to estimate the probability distributions for latent representation
Figure imgf000034_0002
0).
After the finetuning, the best model of the prediction model may be used to estimate the probability distribution of the elements in the latent representation x" - ''' for the arithmetic encoder.
At the decoding stage, the finetuning parameters and the number of iterations are obtained. If the encoder has signaled this information to the decoder, the finetuning parameters and the number of iterations are decoded from the bitstream or a supplementary message carrying this information. Alternatively, the finetuning parameters may have been predefined in a decoder. Then, the decoder may perform the probability finetuning while decoding the elements from the bitstream. The following example algorithm describes the finetuning procedure at the decoder side. Given latent representation x^l+1\ x^ at resolution level i+1, i respectively, optimizer with learning rate and other finetuning parameters, pretrained prediction model P to be used at resolution level i — 1 with learnable parameters 0, rate loss calculation function /?(•)■ the number of iterations best iter
Let iter=0
While iter < best_iter
Calculate
Figure imgf000035_0001
0)
Calculate rate_loss = R(p^l\x^l+1^')
Calculate gradients dQ for parameters 0 using back-propagation method Update parameters 0 using the optimizer and the calculated gradients dQ
Let iter = iter + 1
Return 0
After the prediction model is finetuned, the finetuned model may be used to calculate the estimated probability distributions of the elements in the latent representation x^-1-* at resolution level i — 1, i.e., pd- = p(xd zdP 0) . In another embodiment, the auxiliary information at resolution level i + 1 may be used to estimate the probability distributions for latent representation x^i-1 i.e., pd-P> =
Figure imgf000035_0002
Probability model finetuning at a resolution level
As Figure 10 shows, the elements in latent representation at a resolution level may be partitioned into several groups. The elements in one group may be processed in parallel while different groups may be processed sequentially in steps. In each step, elements in one group are processed in parallel.
The distribution predictor 1010 in the prediction model component may be finetuned using the groups of elements that have already been processed. The encoder may determine the parameters for the finetuning, including the number of iterations and learning rate. The finetuning parameters may be signaled to the decoder along with the bitstream or via a supplementary message. According to another embodiment, the finetuning parameters may be compressed before being signaled to the decoder. Let x^’7-* be the mixture latent representation at resolution level i and step j where the corresponding positions of the elements that have already been processed are set to the true values and the other positions are set to the estimated values.
The following example algorithm illustrates the finetuning procedure at the encoder side at resolution level i and step j. The finetuned distribution predictor model may be used to estimate the probability distribution of the elements in step j + 1.
Given mixture latent representation
Figure imgf000036_0002
auxiliary information
Figure imgf000036_0001
anc| mask tensor m^’7 m^i7+1-) at resolution level i and step j,j + 1 respectively, optimizer with learning rate and other finetuning parameters, pretrained distribution predictor P with learnable parameters 0, rate loss calculation function R (•), max number of iterations Nmax, max number of iterations Mmax to stop the finetuning when the performance stops improving.
Let iter=0
Let best rate = MAX FLOAT
Let best_iter = 0
Let best_mode = 0
W
Figure imgf000036_0004
Let best_iter = iter
Let best model = 0
If iter - best_iter > Mmax break
Calculate
Calculate
Figure imgf000036_0003
Calculate gradients dQ for parameters 0 using back-propagation method Update parameters 0 using the optimizer and the calculated gradients dQ
Let iter = iter + 1
Return best_iter, best_model In another embodiment, the auxiliary information at step j may be used to estimate the probability distributions for step
Figure imgf000037_0001
After the finetuning, the best model of the prediction model may be used to estimate the probability distribution of the elements for step j+1 and the estimated probability distribution may be used by the arithmetic encoder to encode the elements to be processed in step j+1 to a bitstream.
At the decoding stage, the finetuning parameters and the number of iterations are obtained. If the encoder has signaled this information to the decoder, the finetuning parameters and the number of iterations are decoded from the bitstream or a supplementary message. Alternatively, the finetuning parameters may have been predefined in the decoder Next, a finetuning procedure may be performed. An example of the finetuning procedure is given in the following:
Given mixture latent representation x^,7 auxiliary information z^ and mask tensor m^’7-* at resolution level i and step j, optimizer with learning rate and other finetuning parameters, pretrained distribution predictor P with learnable parameters 0, rate loss calculation function /?(•), the number of iterations best iter
Let iter=0
While iter < best_iter
Calculate
Figure imgf000037_0002
0)
Calculate rate_loss =
Figure imgf000037_0003
Calculate gradients dQ for parameters 0 using back-propagation method Update parameters 0 using the optimizer and the calculated gradients dQ
Let iter = iter + 1
Return 0
After the distribution predictor has been finetuned at step j, the finetuned model may be used to estimate the probability distributions of the elements for step j+1, i.e.,
Figure imgf000037_0004
0).
In another embodiment, the auxiliary information at step j may be used to estimate the probability distributions for step j+1, i.e.,
Figure imgf000037_0005
0). The estimated probability distributions may be used by the arithmetic decoder to decode the elements for step j+1 from the bitstream. Efficient finetuning technique on a batch of images or video frames
The proposed finetuning techniques may increase the encoding and decoding complexity because of the iterative finetuning procedures. The present embodiments also provide a solution to improve the encoding and decoding complexity when the input data is a set of images, blocks from an image or frame, slices from an image or frame, or frames from one or more video sequences, in particular, the input images, blocks, slices or frames share common characteristics, for example, a dataset containing images in a specific domain such as radiographic images for medical purpose. The term image may be an image, a block from an image or frame, a slice from an image or frame, or a frame from a video sequence. The term input dataset may be a collection of images, one or more blocks from an image or frame, one or more slices from an image or frame, or one or more frames from a video sequence.
According to an embodiment, the encoder may select a representative image from images in the input dataset. The encoder may perform probability model finetuning at each resolution level and each step when encoding the representative image. The representative image may be selected according to the similarity, for example, measured by MSE in the pixel domain, to other images in the input dataset. The finetuned probability models at each resolution level and each step are stored by the decoder while decoding the representative image. For other images in the input dataset, the encoder may determine whether the original pretrained model or the model finetuned on the representative image shall be used to decode an input image. The decision may be signaled to the decoder in or along the bitstream. One example of this embodiment is to use the first image as the representative image. There may be more than one representative images, and more than one finetuned probability models may be obtained by using the one or more representative images.
According to another embodiment, the images in the input dataset may be clustered into a number of clusters, for example, using the k-means algorithm with the distance measured by MSE in the pixel domain or in a feature domain. The centroid image of each cluster is used as the representative image for the images belonging to the same cluster. For a centroid image, the encoder may signal the indicator to the decoder and the decoder may store the finetuned probability models to be used for other images. For a non-centroid image, the encoder may signal the ID of the model finetuned on a centroid image to the decoder.
According to another embodiment, the encoder and decoder may perform the probability model finetuning using the model finetuned by a representative image as the base model and continue the finetuning from the base model. According to another embodiment, the encoder may determine the base model for the probability model finetuning. The identifier (ID) of the base model may be signaled to the decoder to perform the probability model finetuning.
Experimental results
The performance of the proposed method has been evaluated using a lossless image compression codec with a multi-scale progressive probability model. After the probability model is finetuned, the number of iterations is sent to the decoder along with the bitstream. With three resolution levels and three steps at each resolution level, the probability model may be finetuned eight times. Here, one byte is used for the number of iterations. With this setup, the total overhead of the bitstream is eight bytes for each image.
24 images from the Kodak dataset and 500 images from the validation set of the Open Images dataset are used to evaluate the performance of the proposed technology. The results are shown in the table 1. The baseline is a multi-scale progressive probability model without adaptive technology. The proposed method uses the same multi-scale progressive probability model with adaptive technology at every scale and every step. The number in the table 1 are bit rates measured by bits-per-pixel (BPP). As shown in the Table 1, the proposed technique significantly improves the compression performance compared to the baseline method.
Figure imgf000039_0001
Table 1 : Results on the performance of the proposed technology
In the next experiment, shown in Table 2, the impact on the encoding and decoding complexity was studied using the propose representative image finetuning. The testing dataset contains 30 images that were taken by the Mars Curiosity Rover on Mars. Table 2 demonstrates that the proposed technique improves the compression rate significantly without severely increasing the encoding and decoding time.
Figure imgf000039_0002
Figure imgf000040_0001
Table 2: Impact of encoding and decoding complexity
The method according to an embodiment is shown in Figure 11. The method generally comprises receiving 1110 a representation of input media; performing 1120 a compression of the representation by means of a probability model and arithmetic encoder to generate bitstreams to be delivered to a decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation to be encoded, and wherein the arithmetic encoder is configured to encode elements to bitstream using at least the estimated probability distributions; adapting 1130 the prediction model according to a set of finetuning parameters by using elements that have already been processed as true values, wherein finetuning comprises minimizing a rate loss calculated by the probability distribution estimation for a resolution level; wherein the method further comprises selecting 1140 a prediction model with optimal rate loss as a best model and using 1150 the best model as the probability model. Each of the steps can be implemented by a respective module of a computer system.
An apparatus according to an embodiment comprises means for receiving a representation of input media; means for performing a compression of the representation by means of a probability model and arithmetic encoder to generate bitstreams to be delivered to a decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation to be encoded, and wherein the arithmetic encoder is configured to encode elements to bitstream using at least the estimated probability distributions; means for adapting the prediction model according to a set of finetuning parameters by using elements that have already been processed as true values, wherein finetuning comprises minimizing a rate loss calculated by the probability distribution estimation for a resolution level; means for selecting a prediction model with optimal rate loss as a best model and means for using the best model as the probability model. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 11 according to various embodiments.
The method for decoding according to an embodiment is shown in Figure 12. The method generally comprises receiving an encoded bitstream 1200; obtaining 1210 a set of finetuning parameters; decompressing 1220 the bitstream to generate a representation of an output media by means of a probability model and arithmetic decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation, and wherein the arithmetic decoder is configured to decode elements from the bitstream using the estimated probability distributions; adapting 1230 the prediction model according to at least a set of finetuning parameters by using elements that have already been processed as true values, wherein finetuning comprises minimizing rate loss calculated by the probability distribution estimation for a resolution level; selecting 1240 a prediction model as the best model according to the set of finetuning parameters; and using 1250 the best model as the probability model. Each of the steps can be implemented by a respective module of a computer system.
An apparatus according to an embodiment comprises means for receiving an encoded bitstream; means for obtaining finetuning parameters; means for decompressing the bitstream to generate a representation of an output media by means of a probability model and arithmetic decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation, and wherein the arithmetic decoder is configured to decode elements from the bitstream using the estimated probability distributions; means for adapting the prediction model according to at least a set of finetuning parameters by using elements that have already been processed as true values, wherein finetuning comprises minimizing rate loss calculated by the probability distribution estimation for a resolution level; wherein the apparatus further comprises means for selecting a prediction model as a best model according to the set of finetuning parameters; and means for using the best model as the probability model. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 12 according to various embodiments.
An example of an apparatus is shown in Figure 13. The apparatus is a user equipment for the purposes of the present embodiments. The apparatus 90 comprises a main processing unit 91, a memory 92, a user interface 94, a communication interface 93. The apparatus according to an embodiment, shown in Figure 13, may also comprise a camera module 95. Alternatively, the apparatus may be configured to receive image and/or video data from an external camera device over a communication network. The memory 92 stores data including computer program code in the apparatus 90. The computer program code is configured to implement the method according to various embodiments by means of various computer modules. The camera module 95 or the communication interface 93 receives data, in the form of images or video stream, to be processed by the processor 91. The communication interface 93 forwards processed data, i.e., the image file, for example to a display of another device, such a virtual reality headset. When the apparatus 90 is a video source comprising the camera module 95, user inputs may be received from the user interface.
The various embodiments can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the method. For example, a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of various embodiments.
If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above-described functions and embodiments may be optional or may be combined.
Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.
It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.

Claims

Claims:
1. An apparatus comprising: means for receiving a representation of input media; means for performing a compression of the representation by means of a probability model and arithmetic codec to generate bitstreams to be delivered to a decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation to be encoded, and wherein the arithmetic encoder is configured to encode elements to bitstream using at least the estimated probability distributions; means to adapt the prediction model according to a set of finetuning parameters by using elements that have already been processed as true values, wherein finetuning comprises minimizing a rate loss calculated by the probability distribution estimation for a resolution level; wherein the apparatus further comprises means for selecting a prediction model with optimal rate loss as a best model; and means for using the best model as the probability model.
2. The apparatus according to claim 1 , wherein the probability model is configured to partition the elements into a plurality of groups, and estimate the probability distribution of elements in said groups and adapt the prediction model by using groups that have already been processed as true values.
3. The apparatus according to claim 2, wherein the partition is based on one of the following: spatial dimensions, temporal dimensions, channel dimensions.
4. The apparatus according to any of the claims 1 to 3, wherein the representation is a latent representation of an image, or a frame of a video sequence, or a block of an image, or a slice of an image where the latent representation may be obtained by one or more processing steps such as by applying a neural network on the input media; or the representation is an unprocessed or substantially unprocessed version of the input media; or an uncompressed image or video.
5. The apparatus according to any of the claims 1 to 4, further comprising means for delivering parameters used in finetuning to a decoder.
6. The apparatus according to claim 5, wherein the parameters further comprise a number of iterations needed for the finetuning. The apparatus according to claims 1 to 6, further comprising: means for determining a representative data from the input dataset; and means for adapting the probability model to the representative data; and means for using the adapted probability model for other data in the input dataset. The apparatus according to claims 1 to 6, further comprising: means for clustering the input data into a plurality of clusters; and means for adapting the probability model to the centroid data of each cluster; and means for assigning an identifier (ID) to the adapted probability model for a centroid data; and means for using the adapted probability model for the centroid data in a cluster to other data in the same cluster; and means for delivering the ID of the adapted probability model to the decoder. An apparatus for decoding comprising means for receiving an encoded bitstream; means for obtaining a set of finetuning parameters; means for decompressing the bitstream to generate a representation of an output media by means of a probability model and arithmetic decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation, and wherein the arithmetic decoder is configured to decode elements from the bitstream using at least the estimated probability distributions; means for adapting the prediction model according to at least a set of finetuning parameters by using elements that have already been processed as true values, wherein finetuning comprises minimizing a rate loss calculated by the probability distribution estimation for a resolution level; wherein the apparatus further comprises means for selecting a prediction model as a best model according to the set of finetuning parameters; means for using the best model as the probability model. The apparatus according to claim 9, wherein the probability model is configured to partition the elements into a plurality of groups, and estimate the probability distribution of elements in said groups and adapt the prediction model by using groups that have already been processed as true values. The apparatus according to claim 9 to 10, further comprising means for receiving parameters used in finetuning from an encoder. The apparatus according to claim 11, wherein the parameters further comprise a number of iterations needed for the finetuning. The apparatus according to any of the claims 9 to 12, further comprising means for determining a representative data from the input dataset; means for adapting the probability model to the representative data; and means for using the adapted probability model for other data in the input dataset. The apparatus according to any of the claims 9 to 12, further comprising means for receiving an ID of the adapted probability model form an encoder; means for determining a probability mode matching with the received ID. The apparatus according to any of the claims 9 to 14, further comprising means for decoding the finetuning parameters from the bitstream. A method, comprising: receiving a representation of input media; performing a compression of the representation by means of a probability model and arithmetic encoder to generate bitstreams to be delivered to a decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation to be encoded, and wherein the arithmetic encoder is configured to encode elements to bitstream using the estimated probability distributions; adapting the prediction model according to a set of finetuning parameters by using elements that have already been processed as true values, wherein finetuning comprises minimizing rate loss calculated by the probability distribution estimation for a resolution level; wherein the method further comprises selecting a prediction model with optimal rate loss as a best model and using the best model as the probability model. A method for decoding comprising receiving an encoded bitstream; obtaining a set of finetuning parameters; decompressing the bitstream to generate a representation of an output media by means of a probability model and arithmetic decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation, and wherein the arithmetic decoder is configured to decode elements from the bitstream using at least the estimated probability distributions; adapting the prediction model according to at least a set of finetuning parameters by using elements that have already been processed as true values, wherein finetuning comprises minimizing a rate loss calculated by the probability distribution estimation for a resolution level; wherein the apparatus further comprises means for selecting a prediction model as a best model according to the set of finetuning parameters; and means for using the best model as the probability model. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: receive a representation of input media; perform a compression of the representation by means of a probability model and arithmetic codec to generate bitstreams to be delivered to a decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation to be encoded, and wherein the arithmetic encoder/decoder is configured to encode/decode elements to/from bitstream using the estimated probability distributions; adapt the prediction model according to a set of finetuning parameters by using elements that have already been processed as true values, wherein finetuning comprises minimizing rate loss calculated by the probability distribution estimation for a resolution level; wherein the apparatus is further caused to select a prediction model with optimal rate loss as a best model and use the best model as the probability model. An apparatus comprising atleastone processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: receive an encoded bitstream; obtain a set of finetuning parameters; decompress the bitstream to generate a representation of an output media by means of a probability model and arithmetic decoder, wherein the probability model comprises a prediction model to estimate probability distribution of the elements of the representation, and wherein the arithmetic decoder is configured to decode elements from the bitstream using at least the estimated probability distributions; adapt the prediction model according to at least a set of finetuning parameters by using elements that have already been processed as true values, wherein finetuning comprises minimizing a rate loss calculated by the probability distribution estimation for a resolution level; wherein the apparatus further comprises means for selecting a prediction model as a best model according to the set of finetuning parameters; use the best model as the probability model.
PCT/EP2023/050956 2022-02-14 2023-01-17 A method, an apparatus and a computer program product for video coding WO2023151903A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202263309737P 2022-02-14 2022-02-14
US63/309,737 2022-02-14

Publications (1)

Publication Number Publication Date
WO2023151903A1 true WO2023151903A1 (en) 2023-08-17

Family

ID=85036216

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2023/050956 WO2023151903A1 (en) 2022-02-14 2023-01-17 A method, an apparatus and a computer program product for video coding

Country Status (1)

Country Link
WO (1) WO2023151903A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021255567A1 (en) * 2020-06-16 2021-12-23 Nokia Technologies Oy Guided probability model for compressed representation of neural networks

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021255567A1 (en) * 2020-06-16 2021-12-23 Nokia Technologies Oy Guided probability model for compressed representation of neural networks

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
NANNAN ZOU ET AL: "Learning to Learn to Compress", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 1 May 2021 (2021-05-01), XP081950101 *

Similar Documents

Publication Publication Date Title
US11375204B2 (en) Feature-domain residual for video coding for machines
US20240314362A1 (en) Performance improvements of machine vision tasks via learned neural network based filter
EP4156691A2 (en) A method, an apparatus and a computer program product for video encoding and video decoding
EP4260242A1 (en) A caching and clearing mechanism for deep convolutional neural networks
US20240202507A1 (en) Method, apparatus and computer program product for providing finetuned neural network filter
US20240249514A1 (en) Method, apparatus and computer program product for providing finetuned neural network
WO2023135518A1 (en) High-level syntax of predictive residual encoding in neural network compression
US20230325639A1 (en) Apparatus and method for joint training of multiple neural networks
WO2023208638A1 (en) Post processing filters suitable for neural-network-based codecs
US20230269385A1 (en) Systems and methods for improving object tracking in compressed feature data in coding of multi-dimensional data
EP4142289A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
EP4181511A2 (en) Decoder-side fine-tuning of neural networks for video coding for machines
WO2023031503A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
US20240146938A1 (en) Method, apparatus and computer program product for end-to-end learned predictive coding of media frames
US20230186054A1 (en) Task-dependent selection of decoder-side neural network
EP4360000A1 (en) Method, apparatus and computer program product for defining importance mask and importance ordering list
WO2023151903A1 (en) A method, an apparatus and a computer program product for video coding
WO2023194650A1 (en) A method, an apparatus and a computer program product for video coding
WO2024068081A1 (en) A method, an apparatus and a computer program product for image and video processing
WO2024068190A1 (en) A method, an apparatus and a computer program product for image and video processing
WO2024002579A1 (en) A method, an apparatus and a computer program product for video coding
WO2024074231A1 (en) A method, an apparatus and a computer program product for image and video processing using neural network branches with different receptive fields
WO2024141694A1 (en) A method, an apparatus and a computer program product for image and video processing
US20240223762A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
WO2023111384A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23701264

Country of ref document: EP

Kind code of ref document: A1