WO2023172153A1

WO2023172153A1 - Method of video coding by multi-modal processing

Info

Publication number: WO2023172153A1
Application number: PCT/RU2022/000067
Authority: WO
Inventors: Alexey Aleksandrovich LETUNOVSKIY; Denis Vladimirovich Parkhomenko; Xiang Ma; Andrey Sergeevich SHUTKIN; Alexander Andreevich PLETNEV; Ivan Vladimirovich KIRILLOV; Ivan Iurevich ILYIN
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2022-03-09
Filing date: 2022-03-09
Publication date: 2023-09-14

Abstract

Methods and apparatuses are described for encoding and decoding of image data, which includes processing two or more representations of the image data by a multi-scale reconstruction network. Feature extraction networks generate latent representations for each of the representations. The latent representations are processed by a generation neural network to obtain reconstructed image data. The representations are encoded into a bitstream according to a set of split parameters indication target bitrates.

Description

Method of video coding by multi-modal processing

The present disclosure relates to encoding a signal into a bitstream and decoding a signal from a bitstream. In particular, the present disclosure relates to encoding and decoding of image data including processing by a neural network.

BACKGROUND

Hybrid image and video codecs have been used for decades to compress image and video data. In such codecs, signal is typically encoded block-wisely by predicting a block and by further coding only the difference between the original bock and its prediction. In particular, such coding may include transformation, quantization and generating the bitstream, usually including some entropy coding. Typically, the three components of hybrid coding methods - transformation, quantization, and entropy coding - are separately optimized. Modern video compression standards like High-Efficiency Video Coding (HEVC), Versatile Video Coding (WC) and Essential Video Coding (EVC) also use transformed representation to code residual signal after prediction.

Recently, neural network architectures have been applied to image and/or video coding. In general, these neural network (NN) based approaches can be applied in various different ways to the image and video coding. For example, some end-to-end optimized image or video coding frameworks have been discussed. Moreover, deep learning has been used to determine or optimize some parts of the end-to-end coding framework such as selection or compression of prediction parameters or the like. Besides, some neural network based approached have also been discussed for usage in hybrid image and video coding frameworks, e.g. for implementation as a trained deep learning model for intra or inter prediction in image or video coding.

The end-to-end optimized image or video coding applications discussed above have in common that they produce some feature map data, which is to be conveyed between encoder and decoder.

Neural networks are machine learning models that employ one or more layers of nonlinear units based on which they can predict an output for a received input. Some neural networks include one or more hidden layers in addition to an output layer. A corresponding feature map may be provided as an output of each hidden layer. Such corresponding feature map of each hidden layer may be used as an input to a subsequent layer in the network, i.e., a subsequent hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters. In a neural network that is split between devices, e.g. between encoder and decoder, a device and a cloud or between different devices, a feature map at the output of the place of splitting (e.g. a first device) is compressed and transmitted to the remaining layers of the neural network (e.g. to a second device).

Further improvement of encoding and decoding using trained network architectures may be desirable.

SUMMARY

The embodiments of the present disclosure provide apparatuses and methods for encoding and decoding of image data including processing by a neural network.

The embodiments of the invention are defined by the features of the independent claims, and further advantageous implementations of the embodiments by the features of the dependent claims.

According to an embodiment a method is provided for decoding image data from a bitstream, comprising: decoding a first representation of the image data from the bitstream; obtaining a second representation of the image data; processing the first representation by applying a first neural network, including generating a first latent representation and a second latent representation; processing the second representation by applying a second neural network, including generating a third latent representation and a fourth latent representation; combining processing including combining of the first latent representation and the third latent representation; obtaining reconstructed image data including applying a generation neural network to an output of the combining processing, the second latent representation and the fourth latent representation.

The processing of data decoded from a bitstream including a multi-scale enhancement (or reconstruction) neural network, which includes at least the first neural network, the second neural network and the generation neural network, improves the quality of the reconstructed image data.

In an exemplary implementation, the obtaining of the second representation comprises decoding the second representation from the bitstream, the second representation having a smaller resolution and/or a different bitrate in the bitstream than the first representation. A second representation having a different resolution and/or a different bitrate may provide additional (different) features to the generation neural network, when being processed by the second neural network.

For example, the decoding comprises: splitting the bitstream into at least two subbitstreams; decoding the first representation and the second representation from a respective subbitstream out of the at least two subbitstreams.

Such an independent decoding of two separate substreams provides a prerequisite for parallelization.

In an exemplary implementation, the decoding comprises: decoding the bitstream; splitting the decoded bitstream into at least the first representation and the second representation.

A joint bitstream including the first representation and the second representation enables an exploitation of redundant information in the encoding and thus may result in a lower bitrate of the bitstream.

For example, the decoding is performed including one of a decoder of an autoencoding neural network, or a hybrid block-based decoder.

The method is readily applied to both a decoder of an autoencoding neural network, ora hybrid block-based decoder.

In an exemplary implementation, the obtaining of a second representation comprises generating the second representation by downscaling of the first representation.

A second representation (i.e. an additional representation) obtained from the decoded first representation may enhance the reconstructed image data by providing additional hidden features through the processing by the corresponding neural network without increasing the bitrate of the bitstream or decreasing one or more bitrates of the representations encoded in the bitstream.

For example, the method is further comprising decoding one or more parameters from the bitstream, said one or more parameters indicating target bitrates of one or more respective representations encoded in the bitstream, wherein a configuration of at least one of the first neural network, the second neural network, and the generation neural network is selected based on the one or more parameters. Including the parameters in the bitstream facilitates a more efficient processing by selecting an appropriate configuration of the multi-scale reconstruction network including at least the first neural network, the second neural network and the generation neural network.

In an exemplary implementation, a set of weights within at least one of the first neural network, the second neural network and the generation neural network is selected based on the one or more parameters.

Selecting set of weights for one or more of the neural networks enables a more refined obtaining of the reconstructed image data by considering different configurations of input representations.

For example, the method is further comprising performing an interpolation on the second representation to obtain a resolution of the first representation, before processing the second representation by applying the second neural network.

A same input resolution enables a more efficient processing by the neural networks.

In an exemplary implementation, the combining processing includes at least one of concatenating the first latent representation and the third latent representation; processing by a fully-connected neural network.

A concatenation provides an efficient implementation of a combining and a processing by a fully connected neural network enables the extraction of hidden features for further processing, thus enabling an improved reconstruction of image data.

For example, the applying of the first neural network includes generating a first set of latent representations, said first set including the second latent representation; the applying of the second neural network includes generating a second set of latent representations, said second set including the fourth latent representation; and the generation neural network receives the first set and the second set as additional input.

Considering multiple latent representations being output by the first and second neural network provides additional hidden features for the processing by the generation neural network thus enabling an improved reconstruction of the image data.

In an exemplary implementation, the method is further comprising: obtaining N representations, N being an integer greater or equal to two, the N representations including the first representation and the second representation; for each i-th representation out of the N representations, i being one to N, processing the i-th representation by applying a i-th neural network, including generating a (2i-1)-th latent representation and a (2i)-th latent representation; combining processing including combining of each of the (2i-1)-th latent representations; obtaining reconstructed image data including applying a generation neural network to an output of the combining processing and each of the (2i)-th latent representations.

Extending the method to process a plurality of representations enables a more refined reconstruction of image data by providing additional input to the generation neural network.

For example, the obtaining of N representations further comprises; decoding a first representation out of the N representations from the bitstream; obtaining a j-th representation out of the N representations, j being two to N, by either decoding said j-th representation from the bitstream or by downscaling of the first representation.

This enables the obtaining of additional representations by decoding from the bitstream and/or downscaling. These additional representations provide further input to the generation neural network.

In an exemplary implementation, each of the i-th neural networks includes Ki downsampling layers, a bottleneck layer, and Ki upsampling layers, Ki being an integer greater than zero; wherein the bottleneck layer outputs the (2i- 1 )-th latent representation and a first layer out of the up-sampling layers following the bottleneck layer outputs the (2i)-th latent representation.

Such a configuration may provide an efficient implementation of a (feature extraction) neural network.

For example, a first block within the generation neural network receives as input the output of the combining processing and each of the (2i)-th latent representations; and a second block within the generation neural network receives as input one or more of an output of the first block; the output of the combining processing; and for each i-th neural network, an latent representation output of a second layer out of the K, up-sampling layers in the i-th neural network, the second layer following the first layer within the K up-sampling layers.

Such a configuration may provide an efficient implementation of a generation neural network.

In an exemplary implementation, the generation neural network includes M blocks, said M blocks including the first block and the second block, M being an integer greater than or equal to two; and each q-th block of the M blocks, q being two to M, receives as input one or more of an output of the (q-1 )-th block; the output of the combining processing; and for each i-th neural network, an latent representation output of a q-th layer out of the up-sampling layers in the i-th neural network. In such a configuration of the generation neural network, an increased number of blocks enables the processing of additional latent representation generated by the at least two feature extraction networks, thus enabling a more refined reconstruction of image data by providing additional input to the generation neural network.

For example, the method is further comprising obtaining a corresponding subframe in each of the N representations; wherein the processing the i-th representation comprises applying the i-th neural network to the corresponding subframe of the i-th representation.

Limiting the processing by the multi-scale reconstruction network including at least the first neural network, the second neural network and the generation neural network, to a subframe enables a focus on important parts of the image data and may enable a more specialized reconstruction of said subframes.

In an exemplary implementation, the corresponding subframe includes a face crop.

A restriction to subframes including face crops facilitates the applying of a specialized multiscale reconstruction network trained for facial reconstruction.

For example, the generation neural network is a general adversarial network (GAN).

A general adversarial network provides an efficient implementation for reconstruction of image data.

According to an embodiment, a method is provided for encoding image data into a bitstream, comprising: obtaining a set of parameters based on a cost function, said parameters indicating target bitrates; obtaining two or more representations of the image data based on the set of parameters; encoding the two or more representations into a bitstream according to the set of parameters; wherein the cost function is based on reconstructed image data obtained by processing two or more decoded representations of the same image data.

Such encoding generates a bitstream, which is optimized based on the cost function to provide an input to a decoder including a multi-scale reconstruction network.

In an exemplary implementation, the obtaining set of parameters based on a cost function comprises: for each current set of parameters out of a predefined plurality of sets of parameters: obtaining two or more representations of the image data based on the current set of parameters; encoding the two or more representations into a bitstream according to the current set of parameters; decoding the two or more representations from the bitstream; obtaining reconstructed image data based on the two or more decoded representations; determining quality of reconstructed image data; selecting a set of parameters out of the predefined plurality of sets based on the cost function including the determined quality.

Selecting a set of parameters out of a set of parameters based on the quality of the reconstructed image data enables the generation of a bitstream adapted to obtain reconstructed image data of said quality after decoding at a decoder.

For example, the obtaining set of parameters based on a cost function comprises: processing the image data by one or more layers of a neural network to output the set of parameters; wherein said neural network is trained by optimizing the cost function.

Applying a trained neural network provides an efficient implementation to obtain the set of parameters.

In an exemplary implementation, the neural network receives as input one or more of image data; a target bitrate of the bitstream; an indication of complexity of the image data; an indication of motion intensity of the image data.

Providing additional input to the neural network may improve the obtaining of the parameters.

For example, the encoding of the two or more representations into the bitstream comprises: for each representation out of the two or more representations, encoding a representation into a subbitstream; including each of the subbitstreams into the bitstream.

Such an independent encoding of two separate substreams provides a prerequisite for parallelization.

In an exemplary implementation, the encoding of the two or more representations into the bitstream comprises: including the set of parameters into the bitstream.

Including the parameters in the bitstream facilitates a more efficient processing in the decoding by selecting an appropriate configuration of the multi-scale reconstruction network.

For example, the two or more representations represent different resolutions of the image data.

Providing different resolutions may facilitate the extraction of additional features by the neural networks applied in the processing of the representations in the decoding.

In an exemplary implementation, a first representation out of the two or more representations represents an original resolution of the image data and a second representation out of the two or more representations represents a downscaled resolution of the image data. The method is readily applied to the original image data and a downscaled representation of the original image data.

For example, a first parameter out of the set of parameters indicates a target bitrate for the first representation representing an original resolution of the image data; a second parameter out of the set of parameters indicates a target bitrate for the second representation representing a downscaled resolution of the image data.

Adapting the target bitrates of the respective representations by the obtaining of the set of parameters may result in an improved reconstruction of the image data at the decoding side.

In an exemplary implementation, the parameters indicating target bitrates for respective representations within the set of parameters add up to a target bitrate of the bitstream.

A target bitrate of the bitstream provides a suitable constraint used in the obtaining of the set of parameters.

For example, an encoder used in the encoding selects a quantization parameter for the encoding of a representation to obtain a target bitrate indicated by the parameter within the set of parameters corresponding to said representation.

Selecting an appropriate quantization parameter provides an efficient implementation to obtain a target bitrate.

In an exemplary implementation, the encoding is performed including one of an encoder of an autoencoding neural network, or a hybrid block-based encoder.

The method is readily applied to both an encoder of an autoencoding neural network, and a hybrid block-based encoder.

For example, a computer program product is provided, comprising program code for performing the any of the methods described above when executed on a computer or a processor.

In an exemplary implementation, a non-transitory computer-readable medium is provided, carrying a program code which, when executed by a computer device, causes the computer device to perform any of the methods described above.

According to an embodiment, a decoder is provided, which comprises processing circuitry configured to carrying out any of the methods for decoding image data described above. According to an embodiment, an encoder is provided, which comprises processing circuitry configured to carrying out any of the methods for encoding image data described above.

According to an embodiment, a non-transitory storage medium is provided, comprising a bitstream encoded by any of the methods for encoding image data described above

The programs and apparatuses provide the advantages of the methods described above.

The invention can be implemented in hardware (HW) and/or software (SW) or in any combination thereof. Moreover, HW-based implementations may be combined with SW-based implementations.

Details of one or more embodiments are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description, drawings, and claims.

BRIEF DESCRIPTION OF THE DRAWINGS

In the following embodiments of the present disclosure are described in more detail with reference to the attached figures and drawings, in which

Fig. 1 is a schematic drawing illustrating channels processed by layers of a neural network;

Fig. 2 is a schematic drawing illustrating an autoencoder type of a neural network;

Fig. 3a is a schematic drawing illustrating an exemplary network architecture for encoder and decoder side including a hyperprior model;

Fig. 3b is a schematic drawing illustrating a general network architecture for encoder side including a hyperprior model;

Fig. 3c is a schematic drawing illustrating a general network architecture for decoder side including a hyperprior model;

Fig. 4 is a schematic drawing illustrating an exemplary network architecture for encoder and decoder side including a hyperprior model;

Fig. 5a is a block diagram illustrating end-to-end video compression framework based on a neural networks; Fig. 5b is a block diagram illustrating some exemplary details of application of a neural network for motion field compression;

Fig. 5c is a block diagram illustrating some exemplary details of application of a neural network for motion compensation;

Fig. 6 is a block diagram showing an example of a video hybrid encoder configured to implement embodiments of the present disclosure;

Fig. 7 is a block diagram showing an example of a video hybrid decoder configured to implement embodiments of the present disclosure;

Fig. 8 is a flow diagram illustrating an exemplary method for decoding image data;

Fig. 9 is a flow diagram illustrating an exemplary method for encoding image data;

Fig. 10 is a schematic drawing illustrating an exemplary network structure to obtain reconstructed image data;

Fig. 11 a illustrates an exemplary first implementation of an encoder and a decoder;

Fig. 11b illustrates an exemplary second implementation of an encoder and a decoder;

Fig. 11c illustrates an exemplary third implementation of a decoder;

Fig. 12a is a schematic illustration of an exemplary downsampling (compressing) layer;

Fig. 12b is a schematic illustration of an exemplary upsampling (decompressing) layer;

Fig. 13 illustrates an exemplary processing within a generation neural network;

Fig. 14 illustrates exemplarily the processing by a face detection network;

Fig. 15 is a schematic drawing illustrating an exemplary decoding of a bitstream;

Fig. 16 is a schematic drawing illustrating an exemplary encoding of a bitstream;

Fig. 17 illustrates exemplarily the selection of a set of parameters;

Fig. 18 is a block diagram showing an example of a video coding system configured to implement embodiments of the present disclosure; Fig. 19 is a block diagram showing another example of a video coding system configured to implement embodiments of the present disclosure;

Fig. 20 is a block diagram illustrating an example of an encoding apparatus or a decoding apparatus;

Fig. 21 is a block diagram illustrating another example of an encoding apparatus or a decoding apparatus.

Like reference numbers and designations in different drawings may indicate similar elements.

DETAILED DESCRIPTION OF THE EMBODIMENTS

In the following description, reference is made to the accompanying figures, which form part of the disclosure, and which show, by way of illustration, specific aspects of embodiments of the present disclosure or specific aspects in which embodiments of the present disclosure may be used. It is understood that embodiments of the present disclosure may be used in other aspects and comprise structural or logical changes not depicted in the figures. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present disclosure is defined by the appended claims.

For instance, it is understood that a disclosure in connection with a described method may also hold true for a corresponding device or system configured to perform the method and vice versa. For example, if one or a plurality of specific method steps are described, a corresponding device may include one or a plurality of units, e.g. functionaLunits, to perform the described one or plurality of method steps (e.g. one unit performing the one or plurality of steps, or a plurality of units each performing one or more of the plurality of steps), even if such one or more units are not explicitly described or illustrated in the figures. On the other hand, for example, if a specific apparatus is described based on one or a plurality of units, e.g. functional units, a corresponding method may include one step to perform the functionality of the one or plurality of units (e.g. one step performing the functionality of the one or plurality of units, or a plurality of steps each performing the functionality of one or more of the plurality of units), even if such one or plurality of steps are not explicitly described or illustrated in the figures. Further, it is understood that the features of the various exemplary embodiments and/or aspects described herein may be combined with each other, unless specifically noted otherwise. Video coding typically refers to processing of a sequence of pictures, where the sequence of pictures forms a video also referred to as video sequence. In the field of video coding, the terms "picture (picture)", "frame (frame)", and "image (image)" may be used as synonyms. Video coding (or coding in general) includes two parts video encoding and video decoding. Video encoding is performed at the source side, typically including processing (for example, by compression) the original video pictures to reduce the amount of data required for representing the video pictures (for more efficient storage and/or transmission). Video decoding is performed on a destination side, and typically includes inverse processing in comparison with processing of the encoder to reconstruct the video picture. Embodiments referring to "coding" of video pictures (or pictures in general) shall be understood to relate to "encoding" or "decoding" of video pictures or respective video sequences. A combination of an encoding part and a decoding part is also referred to as CODEC (encoding or decoding).

In a case of lossless video coding, an original video picture can be reconstructed. In other words, a reconstructed video picture has same quality as the original video picture (assuming that no transmission loss or other data loss occurs during storage or transmission). In a case of lossy video coding, further compression is performed through, for example, quantization, to reduce an amount of data required for representing a video picture, and the video picture cannot be completely reconstructed on a decoder side. In other words, quality of a reconstructed video picture is lower or poorer than that of the original video picture.

Several video coding standards are used for "lossy hybrid video coding" (that is, spatial and temporal prediction in a sample domain is combined with 2D. transform coding for applying quantization in a transform domain). Each picture of a video sequence is typically partitioned into a set of non-overlapping blocks, and coding is typically performed at a block level. To be specific, at an encoder side, a video is usually processed, that is, encoded, at a block (picture block) level. For example, a prediction block is generated through spatial (intra-picture) prediction ortemporal (inter-picture) prediction, the prediction block is subtracted from a current block (block being processed or to be processed) to obtain a residual block (prediction error block), and the residual block is transformed into the transform domain and quantized to reduce an amount of data that is to be transmitted (compressed). At a decoder side, an inverse processing part relative to the encoder is applied to an encoded block or a compressed block to reconstruct the current block for representation. Furthermore, the encoder duplicates the decoder processing loop such that both generate identical predictions (for example, intra- and inter predictions) and/or re-constructions for processing, that is, coding, the subsequent blocks.

In the following, an overview over some of the used technical terms and framework within which the embodiments of the present disclosure may be employed is provided. Artificial neural networks

Artificial neural networks (ANN) or connectionist systems are computing systems vaguely inspired by the biological neural networks that constitute animal brains. Such systems "learn" to perform tasks by considering examples, generally without being programmed with taskspecific rules. For example, in image recognition, they might learn to identify images that contain cats by analyzing example images that have been manually labeled as "cat" or "no cat" and using the results to identify cats in other images. They do this without any prior knowledge of cats, for example, that they have fur, tails, whiskers and cat-like faces. Instead, they automatically generate identifying characteristics from the examples that they process.

An ANN is based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal to other neurons. An artificial neuron that receives a signal then processes it and can signal neurons connected to it.

In ANN implementations, the "signal" at a connection is a real number, and the output of each neuron is computed by some non-linear function of the sum of its inputs. The connections are called edges. Neurons and edges typically have a weight that adjusts as learning proceeds. The weight increases or decreases the strength of the signal at a connection. Neurons may have a threshold such that a signal is sent only if the aggregate signal crosses that threshold. Typically, neurons are aggregated into layers. Different layers may perform different transformations on their inputs. Signals travel from the first layer (the input layer), to the last layer (the output layer), possibly after traversing the layers multiple -times.

The original goal of the ANN approach was to solve problems in the same way that a human brain would. Over time, attention moved to performing specific tasks, leading to deviations from biology. ANNs have been used on a variety of tasks, including computer vision, speech recognition, machine translation, social network filtering, playing board and video games, medical diagnosis, and even in activities that have traditionally been considered as reserved to humans, like painting.

The name “convolutional neural network” (CNN) indicates that the network employs a mathematical operation called convolution. Convolution is a specialized kind of linear operation. Convolutional networks are neural networks that use convolution in place of a general matrix multiplication in at least one of their layers. Fig. 1 schematically illustrates a general concept of processing by a neural network such as the CNN. A convolutional neural network consists of an input and an output layer, as well as multiple hidden layers. Input layer is the layer to which the input (such as a portion 11 of an input image as shown in Fig. 1) is provided for processing. The hidden layers of a CNN typically consist of a series of convolutional layers that convolve with a multiplication or other dot product. The result of a layer is one or more feature maps (illustrated by empty solid-line rectangles), sometimes also referred to as channels. There may be a resampling (such as subsampling) involved in some or all of the layers. As a consequence, the feature maps may become smaller, as illustrated in Fig. 1. It is noted that a convolution with a stride may also reduce the size (resample) an input feature map. The activation function in a CNN is usually a ReLU (Rectified Linear Unit) layer, and is subsequently followed by additional convolutions such as pooling layers, fully connected layers and normalization layers, referred to as hidden layers because their inputs and outputs are masked by the activation function and final convolution. Though the layers are colloquially referred to as convolutions, this is only by convention. Mathematically, it is technically a sliding dot product or cross-correlation. This has significance for the indices in the matrix, in that it affects how the weight is determined at a specific index point.

When programming a CNN for processing images, as shown in Fig. 1 , the input is a tensor with shape (number of images) x (image width) x (image height) x (image depth). It should be known that the image depth can be constituted by channels of an image. After passing through a convolutional layer, the image becomes abstracted to a feature map, with shape (number of images) x (feature map width) x (feature map height) x (feature map channels). A convolutional layer within a neural network should have the following attributes. Convolutional kernels defined by a width and height (hyper-parameters). The number of input channels and output channels (hyper-parameter). The depth of the convolution filter (the input channels) should be equal to the number channels (depth) of the input feature map.

In the past, traditional multilayer perceptron (MLP) models have been used for image recognition. However, due to the full connectivity between nodes, they suffered from high dimensionality, and did not scale well with higher resolution images. A 1000*1000-pixel image with RGB color channels has 3 million weights, which is too high to feasibly process efficiently at scale with full connectivity. Also, such network architecture does not take into account the spatial structure of data, treating input pixels which are far apart in the same way as pixels that are close together. This ignores locality of reference in image data, both computationally and semantically. Thus, full connectivity of neurons is wasteful for purposes such as image recognition that are dominated by spatially local input patterns. Convolutional neural networks are biologically inspired variants of multilayer perceptrons that are specifically designed to emulate the behavior of a visual cortex. These models mitigate the challenges posed by the MLP architecture by exploiting the strong spatially local correlation present in natural images. The convolutional layer is the core building block of a CNN. The layer's parameters consist of a set of learnable filters (the above-mentioned kernels), which have a small receptive field, but extend through the full depth of the input volume. During the forward pass, each filter is convolved across the width and height of the input volume, computing the dot product between the entries of the filter and the input and producing a 2- dimensional activation map of that filter. As a result, the network learns filters that activate when it detects some specific type of feature at some spatial position in the input.

Stacking the activation maps for all filters along the depth dimension forms the full output volume of the convolution layer. Every entry in the output volume can thus also be interpreted as an output of a neuron that looks at a small region in the input and shares parameters with neurons in the same activation map. A feature map, or activation map, is the output activations for a given filter. Feature map and activation has same meaning. In some papers it is called an activation map because it is a mapping that corresponds to the activation of different parts of the image, and also a feature map because it is also a mapping of where a certain kind of feature is found in the image. A high activation means that a certain feature was found.

Another important concept of CNNs is pooling, which is a form of non-linear down-sampling. There are several non-linear functions to implement pooling among which max pooling is the most common. It partitions the input image into a set of non-overlapping rectangles and, for each such sub-region, outputs the maximum.

Intuitively, the exact location of a feature is less important than its rough location relative to other features. This is the idea behind the use of pooling in convolutional neural networks. The pooling layer serves to progressively reduce the spatial size of the representation, to reduce the number of parameters, memory footprint and amount of computation in the network, and hence to also control overfitting. It is common to periodically insert a pooling layer between successive convolutional layers in a CNN architecture. The pooling operation provides another form of translation invariance.

The pooling layer operates independently on every depth slice of the input and resizes it spatially. The most common form is a pooling layer with filters of size 2x2 applied with a stride of 2 at every depth slice in the input by 2 along both width and height, discarding 75% of the activations. In this case, every max operation is over 4 numbers. The depth dimension remains unchanged. In addition to max pooling, pooling units can use other functions, such as average pooling or€2-norm pooling. Average pooling was often used historically but has recently fallen out of favour compared to max pooling, which often performs better in practice. Due to the aggressive reduction in the size of the representation, there is a recent trend towards using smaller filters or discarding pooling layers altogether. "Region of Interest" pooling (also known as ROI pooling) is a variant of max pooling, in which output size is fixed and input rectangle is a parameter. Pooling is an important component of convolutional neural networks for object detection based on Fast R-CNN architecture.

The above-mentioned ReLU is the abbreviation of rectified linear unit, which applies the nonsaturating activation function. It effectively removes negative values from an activation map by setting them to zero. It increases the nonlinear properties of the decision function and of the overall network without affecting the receptive fields of the convolution layer. Other functions are also used to increase nonlinearity, for example the saturating hyperbolic tangent and the sigmoid function. ReLU is often preferred to other functions because it trains the neural network several times faster without a significant penalty to generalization accuracy.

Leaky Rectified Linear Unit, also known as Leaky ReLU, may be considered as a ReLU-based type of activation function comprising a sloped relationship for negative values.

After several convolutional and max pooling layers, the high-level reasoning in the neural network is done via fully connected layers. Neurons in a fully connected layer have connections to all activations in the previous layer, as seen in regular (non-convolutional) artificial neural networks. Their activations can thus be computed as an affine transformation, with matrix multiplication followed by a bias offset (vector addition of a learned or fixed bias term).

The "loss layer" (including calculating of a loss function) specifies how training penalizes the deviation between the predicted (output) and true labels and is normally the final layer of a neural network. Various loss functions appropriate for different tasks may be used. Softmax loss is used for predicting a single class of C mutually exclusive classes. Sigmoid cross-entropy loss is used for predicting C independent probability values in [0, 1]. Euclidean loss is used for regressing to real-valued labels.

In summary, Fig. 1 shows the data flow in a typical convolutional neural network. First, the input image is passed through convolutional layers and becomes abstracted to a feature map comprising several channels, corresponding to a number of filters in a set of learnable filters of this layer. Then, the feature map is subsampled using e.g. a pooling layer, which reduces the dimension of each channel in the feature map. Next, the data comes to another convolutional layer, which may have different numbers of output channels. As was mentioned above, the number of input channels and output channels are hyper-parameters of the layer. To establish connectivity of the network, those parameters need to be synchronized between two connected layers, such that the number of input channels for the current layers should be equal to the number of output channels of the previous layer. For the first layer which processes input data, e.g. an image, the number of input channels is normally equal to the number of channels of data representation, for instance 3 channels for RGB or YUV representation of images or video, or 1 channel for grayscale image or video representation. The channels obtained by one or more convolutional layers (and possibly resampling layer(s)) may be passed to an output layer. Such output layer may be a convolutional or resampling in some implementations. In an exemplary and non-limiting implementation, the output layer is a fully connected layer.

A generative adversarial network (GAN) is a deep learning model. The model includes at least two modules: a generative model (generator) and a discriminative model (discriminator). The two modules learn from each other, to generate a better output. Both the generative model and the discriminative model may be neural networks, and may specifically be deep neural networks or convolutional neural networks. A basic principle of the GAN is as follows: Using a GAN for generating a picture as an example, it is assumed that there are two networks: G (Generator) and D (Discriminator). G is a network for generating a picture. G receives random noise z, and generates the picture by using the noise, where the picture is denoted as G(z). D is a discriminator network used to determine whether a picture is "real". An input parameter of D is x, x represents a picture, and an output D(x) represents a probability that x is a real picture. If a value of D(x) is 1 , it indicates that the picture is 100% real. If the value of D(x) is 0, it indicates that the picture cannot be real. In a process of training the generative adversarial network, an objective of the generative network G is to generate a picture that is as real as possible to deceive the discriminative network D, and an objective of the discriminative network D is to distinguish between the picture generated by G and a real picture as much as possible. In this way, a dynamic "gaming" process, to be specific, "adversary" in the "generative adversarial network", exists between G and D. A final gaming result is that in an ideal state, G may generate a picture G(z) that is to be difficultly distinguished from a real picture, and it is difficult for D to determine whether the picture generated by G is real, to be specific, D(G(z))=0.5. In this way, an excellent generative model G is obtained, and can be used to generate a picture.

A bottleneck within the context of a neural network architecture may be considered as a position in which a bottleneck neural network layer is arranged. A bottleneck neural network layer may be considered as a neural network layer having less neural network nodes than adjacent (in particular preceding and successive) neural network layers. According to some embodiment, the bottleneck can therefore be considered as neural network layer having minimum number of neural network nodes. It can therefore be used to obtain a representation of the input with reduced dimensionality.

Training is the adaptation of the network to better handle a task by considering sample observations. Training involves adjusting the weights and other parameters of the network to improve the accuracy of the result. This is done by minimizing the observed errors. After finish of the training neural network with adapted weights is called trained neural network.

Autoencoders and unsupervised learning

An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. A schematic drawing thereof is shown in Fig. 2. The autoencoder includes an encoder side 2010 with an input x inputted into an input layer of an encoder subnetwork 2020 and a decoder side 2050 with output x’ outputted from a decoder subnetwork 2060. The aim of an autoencoder is to learn a representation (encoding) 2030 for a set of data x, typically for dimensionality reduction, by training the network 2020, 2060 to ignore signal “noise”. Along with the reduction (encoder) side subnetwork 2020, a reconstructing (decoder) side subnetwork 2060 is learnt, where the autoencoder tries to generate from the reduced encoding 2030 a representation x’ as close as possible to its original input x, hence its name. In the simplest case, given one hidden layer, the encoder stage of an autoencoder takes the input x and maps it to h h = o(Wx + b).

This image h is usually referred to as code 2030, latent variables, or latent representation. Here, a is an element-wise activation function such as a sigmoid function or a rectified linear unit. W is a weight matrix b is a bias vector. Weights and biases are usually initialized randomly, and then updated iteratively during training through Backpropagation. After that, the decoder stage of the autoencoder maps h to the reconstruction x'of the same shape as x: x' = a'(W'h' + b'~) where a', W' and b' for the decoder may be unrelated to the corresponding a, W and b for the encoder.

Variational autoencoder models make strong assumptions concerning the distribution of latent variables. They use a variational approach for latent representation learning, which results in an additional loss component and a specific estimator for the training algorithm called the Stochastic Gradient Variational Bayes (SGVB) estimator. It assumes that the data is generated by a directed graphical model p_e(x|h) and that the encoder is learning an approximation q^ hlx) to the posterior distribution p_e(h|x) where 0 and 0 denote the parameters of the encoder (recognition model) and decoder (generative model) respectively. The probability distribution of the latent vector of a VAE typically matches that of the training data much closer than a standard autoencoder. The objective of VAE has the following form:

Here, D_KL stands for the Kullback-Leibler divergence. The prior over the latent variables is usually set to be the centered isotropic multivariate Gaussian p_e(/i) = 7£(0, /). Commonly, the shape of the variational and the likelihood distributions are chosen such that they are factorized Gaussians:

where p(x) and <o²(x) are the encoder output, while p(h) and a²(h) are the decoder outputs.

Recent progress in artificial neural networks area and especially in convolutional neural networks enables researchers’ interest of applying neural networks based technologies to the task of image and video compression. For example, End-to-end Optimized Image Compression has been proposed, which uses a network based on a variational autoencoder.

Accordingly, data compression is considered as a fundamental and well-studied problem in engineering, and is commonly formulated with the goal of designing codes for a given discrete data ensemble with minimal entropy. The solution relies heavily on knowledge of the probabilistic structure of the data, and thus the problem is closely related to probabilistic source modeling. However, since all practical codes must have finite entropy, continuous-valued data (such as vectors of image pixel intensities) must be quantized to a finite set of discrete values, which introduces an error.

In this context, known as the lossy compression problem, one must trade off two competing costs: the entropy of the discretized representation (rate) and the error arising from the quantization (distortion). Different compression applications, such as data storage or transmission over limited-capacity channels, demand different rate-distortion trade-offs. Joint optimization of rate and distortion is difficult. Without further constraints, the general problem of optimal quantization in high-dimensional spaces is intractable. For this reason, most existing image compression methods operate by linearly transforming the data vector into a suitable continuous-valued representation, quantizing its elements independently, and then encoding the resulting discrete representation using a lossless entropy code. This scheme is called transform coding due to the central role of the transformation.

For example, JPEG uses a discrete cosine transform on blocks of pixels, and JPEG 2000 uses a multi-scale orthogonal wavelet decomposition. Typically, the three components of transform coding methods - transform, quantizer, and entropy code - are separately optimized (often through manual parameter adjustment). Modern video compression standards like HEVC, WC and EVC also use transformed representation to code residual signal after prediction. The several transforms are used for that purpose such as discrete cosine and sine transforms (DCT, DST), as well as low frequency non-separable manually optimized transforms (LFNST).

Variational image compression

Variable Auto-Encoder (VAE) framework can be considered as a nonlinear transforming coding model. The transforming process can be mainly divided into four parts. This is exemplified in Fig. 3a showing a VAE framework.

The transforming process can be mainly divided into four parts: Fig. 3a exemplifies the VAE framework. In Fig. 3a, the encoder 101 maps an input image x into a latent representation (denoted by y) via the function y = f (x). This latent representation may also be referred to as a part of or a point within a “latent space” in the following. The function f() is a transformation function that converts the input signal x into a more compressible representation y. The quantizer 102 transforms the latent representation y into the quantized latent representation y with (discrete) values by y = Q(y), with Q representing the quantizer function. The entropy model, or the hyper encoder/decoder (also known as hyperprior) 103 estimates the distribution of the quantized latent representation y to get the minimum rate achievable with a lossless entropy source coding.

The latent space can be understood as a representation of compressed data in which similar data points are closer together in the latent space. Latent space is useful for learning data features and for finding simpler representations of data for analysis. The quantized latent representation T, y and the side information z of the hyperprior 3 are included into a bitstream 2 (are binarized) using arithmetic coding (AE). Furthermore, a decoder 104 is provided that transforms the quantized latent representation to the reconstructed image x, x - g(y . The signal x is the estimation of the input image x. It is desirable that x is as close to x as possible, in other words the reconstruction quality is as high as possible. However, the higher the similarity between x and x, the higher the amount of side information necessary to be transmitted. The side information includes bitstreaml and bitstream2 shown in Fig. 3a, which are generated by the encoder and transmitted to the decoder. Normally, the higher the amount of side information, the higher the reconstruction quality. However, a high amount of side information means that the compression ratio is low. Therefore, one purpose of the system described in Fig. 3a is to balance the reconstruction quality and the amount of side information conveyed in the bitstream.

In Fig. 3a the component AE 105 is the Arithmetic Encoding module, which converts samples of the quantized latent representation and the side information z into a binary representation bitstream 1. The samples of y and z might for example comprise integer or floating point numbers. One purpose of the arithmetic encoding module is to convert (via the process of binarization) the sample values into a string of binary digits (which is then included in the bitstream that may comprise further portions corresponding to the encoded image or further side information).

The arithmetic decoding (AD) 106 is the process of reverting the binarization process, where binary digits are converted back to sample values. The arithmetic decoding is provided by the arithmetic decoding module 106.

It is noted that the present disclosure is not limited to this particular framework. Moreover, the present disclosure is not restricted to image or video compression, and can be applied to object detection, image generation, and recognition systems as well.

In Fig. 3a there are two sub networks concatenated to each other. A subnetwork in this context is a logical division between the parts of the total network. For example, in Fig. 3a the modules 101 , 102, 104, 105 and 106 are called the “Encoder/Decoder” subnetwork. The “Encoder/Decoder” subnetwork is responsible for encoding (generating) and decoding (parsing) of the first bitstream “bitstreaml”. The second network in Fig. 3a comprises modules 103, 108, 109, 110 and 107 and is called “hyper encoder/decoder” subnetwork. The second subnetwork is responsible for generating the second bitstream “bitstream2”. The purposes of the two subnetworks are different. The first subnetwork is responsible for:

• the transformation 101 of the input image x into its latent representation y (which is easier to compress that x),

• quantizing 102 the latent representation y into a quantized latent representation y,

• compressing the quantized latent representation y using the AE by the arithmetic encoding module 105 to obtain bitstream “bitstream 1”,”.

• parsing the bitstream 1 via AD using the arithmetic decoding module 106, and

• reconstructing 104 the reconstructed image (x) using the parsed data.

The purpose of the second subnetwork is to obtain statistical properties (e.g. mean value, variance and correlations between samples of bitstream 1) of the samples of “bitstream 1”, such that the compressing of bitstream 1 by first subnetwork is more efficient. The second subnetwork generates a second bitstream “bitstream2", which comprises the said information (e.g. mean value, variance and correlations between samples of bitstreaml).

The second network includes an encoding part which comprises transforming 103 of the quantized latent representation y into side information z, quantizing the side information z into quantized side information z, and encoding (e.g. binarizing) 109 the quantized side information z into bitstream2. In this example, the binarization is performed by an arithmetic encoding (AE). A decoding part of the second network includes arithmetic decoding (AD) 110, which transforms the input bitstream2 into decoded quantized side information z'. The z' might be identical to z, since the arithmetic encoding end decoding operations are lossless compression methods. The decoded quantized side information z' is then transformed 107 into decoded side information y'. y' represents the statistical properties of (e.g. mean value of samples of y, or the variance of sample values or like). The decoded latent representation y' is then provided to the above-mentioned Arithmetic Encoder 105 and Arithmetic Decoder 106 to control the probability model of

The Fig. 3a describes an example of VAE (variational auto encoder), details of which might be different in different implementations. For example in a specific implementation additional components might be present to more efficiently obtain the statistical properties of the samples of bitstream 1. In one such implementation a context modeler might be present, which targets extracting cross-correlation information of the bitstream 1. The statistical information provided by the second subnetwork might be used by AE (arithmetic encoder) 105 and AD (arithmetic decoder) 106 components.

Fig. 3a depicts the encoder and decoder in a single figure. As is clear to those skilled in the art, the encoder and the decoder may be, and very often are, embedded in mutually different devices.

Fig. 3b depicts the encoder and Fig. 3c depicts the decoder components of the VAE framework in isolation. As input, the encoder receives, according to some embodiments, a picture. The input picture may include one or more channels, such as color channels or other kind of channels, e.g. depth channel or motion information channel, or the like. The output of the encoder (as shown in Fig. 3b) is a bitstreaml and a bitstream2. The bitstreaml is the output of the first sub-network of the encoder and the bitstream2 is the output of the second subnetwork of the encoder.

Similarly, in Fig. 3c, the two bitstreams, bitstreaml and bitstream2, are received as input and z, which is the reconstructed (decoded) image, is generated at the output. As indicated above, the VAE can be split into different logical units that perform different actions. This is exemplified in Figs. 3b and 3c so that Fig. 3b depicts components that participate in the encoding of a signal, like a video and provided encoded information. This encoded information is then received by the decoder components depicted in Fig. 3c for encoding, for example. It is noted that the components of the encoder and decoder denoted with numerals 12x and 14x may correspond in their function to the components referred to above in Fig. 3a and denoted with numerals 10x.

Specifically, as is seen in Fig. 3b, the encoder comprises the encoder 121 that transforms an input x into a signal y which is then provided to the quantizer 322. The quantizer 122 provides information to the arithmetic encoding module 125 and the hyper encoder 123. The hyper encoder 123 provides the bitstream2 already discussed above to the hyper decoder 147 that in turn provides the information to the arithmetic encoding module 105 (125).

The output of the arithmetic encoding module is the bitstreaml. The bitstreaml and bitstream2 are the output of the encoding of the signal, which are then provided (transmitted) to the decoding process. Although the unit 101 (121) is called “encoder”, it is also possible to call the complete subnetwork described in Fig. 3b as “encoder”. The process of encoding in general means the unit (module) that converts an input to an encoded (e.g. compressed) output. It can be seen from Fig. 3b, that the unit 121 can be actually considered as a core of the whole subnetwork, since it performs the conversion of the input x into y, which is the compressed version of the x. The compression in the encoder 121 may be achieved, e.g. by applying a neural network, or in general any processing network with one or more layers. In such network, the compression may be performed by cascaded processing including downsampling which reduces size and/or number of channels of the input. Thus, the encoder may be referred to, e.g. as a neural network (NN) based encoder, or the like.

The remaining parts in the figure (quantization unit, hyper encoder, hyper decoder, arithmetic encoder/decoder) are all parts that either improve the efficiency of the encoding process or are responsible for converting the compressed output y into a series of bits (bitstream). Quantization may be provided to further compress the output of the NN encoder 121 by a lossy compression. The AE 125 in combination with the hyper encoder 123 and hyper decoder 127 used to configure the AE 125 may perform the binarization which may further compress the quantized signal by a lossless compression. Therefore, it is also possible to call the whole subnetwork in Fig. 3b an “encoder”.

A majority of Deep Learning (DL) based image/video compression systems reduce dimensionality of the signal before converting the signal into binary digits (bits). In the VAE framework for example, the encoder, which is a non-linear transform, maps the input image x into y, where y has a smaller width and height than x. Since the y has a smaller width and height, hence a smaller size, the (size of the) dimension of the signal is reduced, and, hence, it is easier to compress the signal y. It is noted that in general, the encoder does not necessarily need to reduce the size in both (or in general all) dimensions. Rather, some exemplary implementations may provide an encoder which reduces size only in one (or in general a subset of) dimension.

In J. Balle, L. Valero Laparra, and E. P. Simoncelli (2015). “Density Modeling of Images Using a Generalized Normalization Transformation”, In: arXiv e-prints, Presented at the 4th Int. Conf, for Learning Representations, 2016 (referred to in the following as “Balle”) the authors proposed a framework for end-to-end optimization of an image compression model based on nonlinear transforms. The authors optimize for Mean Squared Error (MSE), but use a more flexible transforms built from cascades of linear convolutions and nonlinearities. Specifically, authors use a generalized divisive normalization (GDN) joint nonlinearity that is inspired by models of neurons in biological visual systems, and has proven effective in Gaussianizing image densities. This cascaded transformation is followed by uniform scalar quantization (i.e., each element is rounded to the nearest integer), which effectively implements a parametric form of vector quantization on the original image space. The compressed image is reconstructed from these quantized values using an approximate parametric nonlinear inverse transform. Such example of the VAE framework is shown in Fig. 4, and it utilizes 6 downsampling layers that are marked with 401 to 406. The network architecture includes a hyperprior model. The left side (g_a, g_s) shows an image autoencoder architecture, the right side (h_a, h_s) corresponds to the autoencoder implementing the hyperprior. The factorized-prior model uses the identical architecture for the analysis and synthesis transforms g_a and g_s. Q represents quantization, and AE, AD represent arithmetic encoder and arithmetic decoder, respectively. The encoder subjects the input image x to g_a, yielding the responses y (latent representation) with spatially varying standard deviations. The encoding g_a includes a plurality of convolution layers with subsampling and, as an activation function, generalized divisive normalization (GDN).

The responses are fed into h_a, summarizing the distribution of standard deviations in z. z is then quantized, compressed, and transmitted as side information. The encoder then uses the quantized vector z to estimate 6, the spatial distribution of standard deviations which is used for obtaining probability values (or frequency values) for arithmetic coding (AE), and uses it to compress and transmit the quantized image representation y (or latent representation). The decoder first recovers z from the compressed signal. It then uses h_s to obtain y, which provides it with the correct probability estimates to successfully recovery as well. It then feeds y into g_s to obtain the reconstructed image.

The layers that include downsampling is indicated with the downward arrow in the layer description. The layer description „Conv N,k1 ,2j“ means that the layer is a convolution layer, with N channels and the convolution kernel is k1xk1 in size. For example, k1 may be equal to 5 and k2 may be equal to 3. As stated, the 2|means that a downsampling with a factor of 2 is performed in this layer. Downsampling by a factor of 2 results in one of the dimensions of the input signal being reduced by half at the output. In Fig. 4, the 2 indicates that both width and height of the input image is reduced by a factor of 2. Since there are 6 downsampling layers, if the width and height of the input image 414 (also denoted with x) is given by w and h, the output signal z 413 is has width and height equal to w/64 and h/64 respectively. Modules denoted by AE and AD are arithmetic encoder and arithmetic decoder, which are explained with reference to Figs. 3a to 3c. The arithmetic encoder and decoder are specific implementations of entropy coding. AE and AD can be replaced by other means of entropy coding. In information theory, an entropy encoding is a lossless data compression scheme that is used to convert the values of a symbol into a binary representation which is a revertible process. Also, the “Q” in the figure corresponds to the quantization operation that was also referred to above in relation to Fig. 4. Also, the quantization operation and a corresponding quantization unit as part of the component 413 or 415 is not necessarily present and/or can be replaced with another unit. In Fig. 4, there is also shown the decoder comprising upsampling layers 407 to 412. A further layer 420 is provided between the upsampling layers 411 and 410 in the processing order of an input that is implemented as convolutional layer but does not provide an upsampling to the input received. A corresponding convolutional layer 430 is also shown for the decoder. Such layers can be provided in NNs for performing operations on the input that do not alter the size of the input but change specific characteristics. However, it is not necessary that such a layer is provided.

When seen in the processing order of bitstream2 through the decoder, the upsampling layers are run through in reverse order, i.e. from upsampling layer 412 to upsampling layer 407. Each upsampling layer is shown here to provide an upsampling with an upsampling ratio of 2, which is indicated by the f. It is, of course, not necessarily the case that all upsampling layers have the same upsampling ratio and also other upsampling ratios like 3, 4, 8 or the like may be used. The layers 407 to 412 are implemented as convolutional layers (conv). Specifically, as they may be intended to provide an operation on the input that is reverse to that of the encoder, the upsampling layers may apply a deconvolution operation to the input received so that its size is increased by a factor corresponding to the upsampling ratio. However, the present disclosure is not generally limited to deconvolution and the upsampling may be performed in any other manner such as by bilinear interpolation between two neighboring samples, or by nearest neighbor sample copying, or the like.

In the first subnetwork, some convolutional layers (401 to 403) are followed by generalized divisive normalization (GDN) at the encoder side and by the inverse GDN (IGDN) at the decoder side. In the second subnetwork, the activation function applied is ReLu. It is noted that the present disclosure is not limited to such implementation and in general, other activation functions may be used instead of GDN or ReLu.

End-to-end image or video compression

DNN based image compression methods can exploit large scale end-to-end training and highly non-linear transform, which are not used in the traditional approaches. However, it is non-trivial to directly apply these techniques to build an end-to-end learning system for video compression. First, it remains an open problem to learn how to generate and compress the motion information tailored for video compression. Video compression methods heavily rely on motion information to reduce temporal redundancy in video sequences.

A straightforward solution is to use the learning based optical flow to represent motion information. However, current learning based optical flow approaches aim at generating flow fields as accurate as possible. The precise optical flow is often not optimal for a particular video task. In addition, the data volume of optical flow increases significantly when compared with motion information in the traditional compression systems and directly applying the existing compression approaches to compress optical flow values will significantly increase the number of bits required for storing motion information. Second, it is unclear how to build a DNN based video compression system by minimizing the rate-distortion based objective for both residual and motion information. Rate-distortion optimization (RDO) aims at achieving higher quality of reconstructed frame (i.e., less distortion) when the number of bits (or bit rate) for compression is given. RDO is important for video compression performance. In order to exploit the power of end-to-end training for learning based compression system, the RDO strategy is required to optimize the whole system.

In Guo Lu, Wanli Ouyang, Dong Xu, Xiaoyun Zhang, Chunlei Cai, Zhiyong Gao; „DVC: An End-to-end Deep Video Compression Framework". Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 11006-11015, authors proposed the end-to-end deep video compression (DVC) model that jointly learns motion estimation, motion compression, and residual coding.

Such encoder is illustrated in Fig. 5a. In particular, Fig. 5a shows an overall structure of end- to-end trainable video compression framework. In order to compress motion information, a CNN was designated to transform the optical flow v_t to the corresponding representations m_t suitable for better compression. Specifically, an auto-encoder style network is used to compress the optical flow. The motion vectors (MV) compression network is shown in Fig. 5b. The network architecture is somewhat similar to the ga/gs of Fig. 4. In particular, the optical flow v_t is fed into a series of convolution operation and nonlinear transform including GDN and IGDN. The number of output channels c for convolution (deconvolution) is here exemplarily 128 except for the last deconvolution layer, which is equal to 2 in this example. The kernel size is k, e.g. k=3. Given optical flow, the MV encoder will generate the motion representation m_t. Then motion representation is quantized (Q), entropy coded and sent to bitstream as m_t. The MV decoder receives the quantized representation m_t and reconstruct motion information v_t using MV encoder. In general, the values for k and c may differ from the above mentioned examples as is known from the art.

Fig. 5c shows a structure of the motion compensation part. Here, using previous reconstructed frame xt-i and reconstructed motion information, the warping unit generates the warped frame (normally, with help of interpolation filter such as bi-linear interpolation filter). Then a separate CNN with three inputs generates the predicted picture. The architecture of the motion compensation CNN is also shown in Fig. 5c. The residual information between the original frame and the predicted frame is encoded by the residual encoder network. A highly non-linear neural network is used to transform the residuals to the corresponding latent representation. Compared with discrete cosine transform in the traditional video compression system, this approach can better exploit the power of non-linear transform and achieve higher compression efficiency.

From above overview it can be seen that CNN based architecture can be applied both for image and video compression, considering different parts of video framework including motion estimation, motion compensation and residual coding. Entropy coding is popular method used for data compression, which is widely adopted by the industry and is also applicable for feature map compression either for human perception or for computer vision tasks.

Hybrid block-based encoder and decoder

Fig. 6 shows a schematic block diagram of an example video encoder 20 that is configured to implement the techniques of the present disclosure. In the example of Fig. 6, the video encoder 20 includes an input 201 (or input interface 201), a residual calculation unit 204, a transform processing unit 206, a quantization unit 208, an inverse quantization unit 210, and inverse transform processing unit 212, a reconstruction unit 214, a loop filter unit 220, a decoded picture buffer (DPB) 230, a mode selection unit 260, an entropy encoding unit 270. and an output 272 (or output interface 272). The mode selection unit 260 may include an inter prediction unit 244, an intra prediction unit 254 and a partitioning unit 262. The inter prediction unit 244 may include a motion estimation unit and a motion compensation unit (not shown). The encoder 20 shown in Fig. 6 may also be referred to as a hybrid video encoder.

The residual calculation unit 204, the transform processing unit 206, the quantization unit 208, the mode selection unit 260 may be referred to as forming a forward signal path of the encoder 20, whereas the inverse quantization unit 210, the inverse transform processing unit 212, the reconstruction unit 214, the buffer 216, the loop filter 220, the decoded picture buffer (decoded picture buffer, DPB) 230, the inter prediction unit 244 and the intra-prediction unit 254 may be referred to as forming a backward signal path of the video encoder 20, wherein the backward signal path of the video encoder 20 corresponds to the signal path of the decoder (see the video decoder 30 in Fig. 7). The inverse quantization unit 210, the inverse transform processing unit 212, the reconstruction unit 214, the loop filter 220, the decoded picture buffer (decoded picture buffer, DPB) 230, the inter prediction unit 244 and the intra-prediction unit 254 are also referred to forming the "built-in decoder" of video encoder 20. The encoder 20 may be configured to receive, for example, via the input 201 , a picture 17 (or picture data 17), for example, a picture in a sequence of pictures forming a video or video sequence. The received picture or picture data may also be a pre-processed picture 19 (or pre-processed picture data 19). For ease of simplicity, the following description refers to the picture 17. The picture 17 may also be referred to as current picture or picture to be coded (in particular in video coding to distinguish the current picture from other pictures, for example, previously encoded and/or decoded pictures of the same video sequence, that is, the video sequence which also includes the current picture).

A (digital) picture is or can be considered as a two-dimensional array or matrix of samples with intensity values. A sample in the array may also be referred to as pixel or pel (short form of picture element) and vice versa. A quantity of samples in horizontal and vertical direction (or axis) of the array or picture define the size and/or resolution of the picture. For representation of color, three color components are usually employed, to be specific, the picture may be represented as or include three sample arrays. In RBG format or color space a picture includes a corresponding red, green and blue sample array. However, in video coding each pixel is typically represented in a luminance and chrominance format or color space, for example, YCbCr, which includes a luminance component indicated by Y (sometimes also L is used instead) and two chrominance components indicated by Cb and Cr. The luminance (or short luma) component Y represents the brightness or grey level intensity (e.g. like in a grey-scale picture), while the two chrominance (or short chroma) components Cb and Cr represent the chromaticity or color information components. Accordingly, a picture in YCbCr format includes a luminance sample array of luminance sample values (Y), and two chrominance sample arrays of chrominance values (Cb and Cr). Pictures in RGB format may be converted or transformed into YCbCr format and vice versa, the process is also known as color transformation or conversion. If a picture is monochrome, the picture may include only a luminance sample array. Accordingly, a picture may be, for example, an array of luma samples in monochrome format or an array of luma samples and two corresponding arrays of chroma samples in 4:2:0, 4:2:2, and 4:4:4 colour format.

In further embodiments, the video encoder may be configured to receive directly a block 203 of the picture 17, for example, one, several or all blocks forming the picture 17. The picture block 203 may also be referred to as current picture block or picture block to be coded.

Like the picture 17, the picture block 203 again is or can be considered as a two-dimensional array or matrix of samples with intensity values (sample values), although of smaller dimension than the picture 17. In other words, the block 203 may include, for example, one sample array (for example, a luma array in case of a monochrome picture 17, or a luma or chroma array in case of a color picture) or three sample arrays (for example, a luma and two chroma arrays in case of a color picture 17) or any other number and/or type of arrays depending on the color format applied. A quantity of samples in horizontal and vertical direction (or axis) of the block 203 define the size of block 203. Accordingly, a block may, for example, an A x B (A-column by B-row) array of samples, or an A x B array of transform coefficients.

The residual calculation unit 204 may be configured to calculate a residual block 205 (also referred to as residual 205) based on the picture block 203 and a prediction block 265 (further details about the prediction block 265 are provided later), for example, by subtracting sample values of the prediction block 265 from sample values of the picture block 203, sample by sample (pixel by pixel) to obtain the residual block 205 in the sample domain.

The transform processing unit 206 may be configured to apply a transform, for example, a discrete cosine transform (DCT) or discrete sine transform (DST), on the sample values of the residual block 205 to obtain transform coefficients 207 in a transform domain. The transform coefficients 207 may also be referred to as transform residual coefficients and represent the residual block 205 in the transform domain.

Embodiments of the video encoder 20 (respectively transform processing unit 206) may be configured to output transform parameters, for example, a type of transform or transforms, for example, directly or encoded or compressed via the entropy encoding unit 270, so that, for example, the video decoder 30 may receive and use the transform parameters for decoding.

The quantization unit 208 may be configured to quantize the transform coefficients 207 to obtain quantized coefficients 209, for example, by applying scalar quantization or vector quantization. The quantized transform coefficient 209 may also be referred to as a quantized residual coefficient 209.

Embodiments of the video encoder 20 (respectively quantization unit 208) may be configured to output quantization parameters (quantization parameter, QP), for example, directly or encoded via the entropy encoding unit 270, so that, for example, the video decoder 30 may receive and apply the quantization parameters for decoding.

The inverse quantization unit 210 is configured to apply the inverse quantization of the quantization unit 208 on the quantized coefficients to obtain dequantized coefficients 211, for example, by applying the inverse of the quantization scheme applied by the quantization unit 208 based on or using the same quantization step size as the quantization unit 208. The dequantized coefficients 211 may also be referred to as dequantized residual coefficients 211 and correspond - although typically not identical to the transform coefficients due to the loss by quantization - to the transform coefficients 207.

The inverse transform processing unit 212 is configured to apply the inverse transform of the transform applied by the transform processing unit 206, for example, an inverse discrete cosine transform (DCT) or inverse discrete sine transform (DST) or other inverse transforms, to obtain a reconstructed residual block 213 (or corresponding dequantized coefficients 213) in the sample domain. The reconstructed residual block 213 may also be referred to as transform block 213.

The reconstruction unit 214 (for example, adder or summer 214) is configured to add the transform block 213 (that is, reconstructed residual block 213) to the prediction block 265 to obtain a reconstructed block 215 in the sample domain, for example, by adding the sample values of the reconstructed residual block 213 and the sample values of the prediction block 265.

The loop filter unit 220 (or short "loop filter" 220), is configured to filter the reconstructed block 215 to obtain a filtered block 221 , or in general, to filter reconstructed samples to obtain filtered sample values. The loop filter unit is, for example, configured to smooth pixel transitions, or otherwise improve the video quality. The loop filter unit 220 may include one or more loop filters such as a de-blocking filter, a sample-adaptive offset (sample-adaptive offset, SAO) filter or one or more other filters, for example, an adaptive loop filter (adaptive loop filter, ALF), a noise suppression filter (noise suppression filter, NSF), or any combination thereof. In an example, the loop filter unit 220 may include a de-blocking filter, a SAO filter and an ALF filter. The order of the filtering process may be the deblocking filter, SAO and ALF. In another example, a process called the luma mapping with chroma scaling (luma mapping with chroma scaling, LMCS) (namely, the adaptive in-loop reshaper) is added. This process is performed before deblocking. In another example, the deblocking filter process may be also applied to internal sub-block edges, for example, affine sub-blocks edges, ATMVP sub-blocks edges, sub-block transform (sub-block transform, SBT) edges and intra sub-partition (intra subpartition, ISP) edges. Although the loop filter unit 220 is shown as an in-loop filter in Fig. 6, in another implementation, the loop filter unit 220 may be implemented as a post-loop filter. The filtered block 221 may also be referred to as filtered reconstructed block 221.

Embodiments of the video encoder 20 (respectively loop filter unit 220) may be configured to output loop filter parameters (such as SAO filter parameters or ALF filter parameters or LMCS parameters), for example, directly or encoded via the entropy encoding unit 270, so that, for example, a decoder 30 may receive and apply the same loop filter parameters or respective loop filters for decoding.

The decoded picture buffer (decoded picture buffer, DPB) 230 may be a memory that stores reference pictures, or in general reference picture data, for encoding video data by the video encoder 20.

The mode selection unit 260 may include an inter prediction unit 244, an intra prediction unit 254 and a partitioning unit 262. Inter prediction unit 244 may include a motion estimation unit and a motion compensation unit (not shown). A video encoder 20 as shown in Fig. 20 may also be referred to as hybrid video encoder or a video encoder according to a hybrid video codec.

The entropy encoding unit 270 is configured to apply, for example, an entropy encoding algorithm or scheme (for example, a variable length coding (VLC) scheme, an context adaptive VLC scheme (CAVLC), an arithmetic coding scheme, a binarization, a context adaptive binary arithmetic coding (CABAC), syntax-based context-adaptive binary arithmetic coding (SBAC), probability interval partitioning entropy (PIPE) coding or another entropy encoding methodology or technique) or bypass (no compression) on the quantized coefficients 209, inter prediction parameters, intra prediction parameters, loop filter parameters and/or other syntax elements to obtain encoded picture data 21 which can be output via the output 272, for example, in the form of an encoded bitstream 21 , so that, for example, the video decoder 30 may receive and use the parameters for decoding. The encoded bitstream 21 may be transmitted to video decoder 30, or stored in a memory for later transmission or retrieval by video decoder 30.

Other structural variations of the video encoder 20 can be used to encode the video stream. For example, a non-transform based encoder 20 may quantize a residual signal directly without the transform processing unit 206 for some blocks or frames. In another implementation, the encoder 20 may have the quantization unit 208 and the inverse quantization unit 210 that are combined into a single unit.

Fig. 7 shows an example of a video decoder 30 that is configured to implement the techniques of the present disclosure. The video decoder 30 is configured to receive encoded picture data 21 (for example, encoded bitstream 21), for example, encoded by encoder 20, to obtain a decoded picture 331. The encoded picture data or bitstream includes information for decoding the encoded picture data, for example, data that represents picture blocks of an encoded video slice (and/or tile groups or tiles) and associated syntax elements. In the example of Fig. 7, the decoder 30 includes an entropy decoding unit 304, an inverse quantization unit 310, an inverse transform processing unit 312, a reconstruction unit 314 (for example, a summer 314), a loop filter 320, a decoded picture buffer (DBP) 330, a mode application unit 360, an inter prediction unit 344 and an intra prediction unit 354. The inter prediction unit 344 may be or include a motion compensation unit. In some examples, the video decoder 30 may perform a decoding process that is roughly inverse to the encoding process described with respect to the video encoder 100 in Fig. 6.

As explained with regard to the encoder 20, the inverse quantization unit 210, the inverse transform processing unit 212, the reconstruction unit 214, the loop filter 220, the decoded picture buffer (DPB) 230, the inter prediction unit 344 and the intra prediction unit 354 are also referred to as forming the “built-in decoder” of video encoder 20. Accordingly, the inverse quantization unit 310 may be identical in function to the inverse quantization unit 110, the inverse transform processing unit 312 may be identical in function to the inverse transform processing unit 212, the reconstruction unit 314 may be identical in function to reconstruction unit 214, the loop filter 320 may be identical in function to the loop filter 220, and the decoded picture buffer 330 may be identical in function to the decoded picture buffer 230. Therefore, the explanations provided for the respective units and functions of the video 20 encoder apply correspondingly to the respective units and functions of the video decoder 30.

The entropy decoding unit 304 is configured to parse the bitstream 21 (or in general encoded picture data 21) and perform, for example, entropy decoding to the encoded picture data 21 to obtain, for example, quantized coefficients 309 and/or decoded coding parameters (not shown in Fig. 7), for example, any or all of inter prediction parameters (for example, reference picture index and motion vector), intra prediction parameter (for example, intra-prediction mode or index), transform parameters, quantization parameters, loop filter parameters, and/or other syntax elements. The entropy decoding unit 304 maybe configured to apply the decoding algorithms or schemes corresponding to the encoding schemes as described with regard to the entropy encoding unit 270 of the encoder 20. The entropy decoding unit 304 may be further configured to provide inter prediction parameters, intra prediction parameter and/or other syntax elements to the mode application unit 360 and other parameters to other units of the decoder 30. The video decoder 30 may receive the syntax elements at the video slice level and/or the video block level. In addition or as an alternative to slices and respective syntax elements, tile groups and/or tiles and respective syntax elements may be received and/or used.

The inverse quantization unit 310 may be configured to receive quantization parameters (QP) (or in general information related to the inverse quantization) and quantized coefficients from the encoded picture data 21 (for example, by parsing and/or decoding, for example, by entropy decoding unit 304) and to apply based on the quantization parameters an inverse quantization on the decoded quantized coefficients 309 to obtain dequantized coefficients 311 , which may also be referred to as transform coefficients 311. The inverse quantization process may include use of a quantization parameter determined by video encoder 20 for each video block in the video slice (or tile or tile group) to determine a degree of quantization and, likewise, a degree of inverse quantization that should be applied.

Inverse transform processing unit 312 may be configured to receive dequantized coefficients 311 , also referred to as transform coefficients 311 , and to apply a transform to the dequantized coefficients 311 in order to obtain reconstructed residual blocks 213 in the sample domain. The reconstructed residual blocks 213 may also be referred to as transform blocks 313. The transform may be an inverse transform, for example, an inverse DCT, an inverse DST, an inverse integer transform, or a conceptually similar inverse transform process. The inverse transform processing unit 312 may be further configured to receive transform parameters or corresponding information from the encoded picture data 21 (for example, by parsing and/or decoding, for example, by entropy decoding unit 304) to determine the transform to be applied to the dequantized coefficients 311.

The reconstruction unit 314 (for example, adder or summer 314) may be configured to add the reconstructed residual block 313, to the prediction block 365 to obtain a reconstructed block 315 in the sample domain, for example, by adding the sample values of the reconstructed residual block 313 and the sample values of the prediction block 365.

The loop filter unit 320 (either in the coding loop or after the coding loop) is configured to filter the reconstructed block 315 to obtain a filtered block 321 , for example, to smooth pixel transitions, or otherwise improve the video quality. The loop filter unit 320 may include one or more loop filters such as a de-blocking filter, a sample-adaptive offset (sample-adaptive offset, SAO) filter or one or more other filters, for example, an adaptive loop filter (adaptive loop filter, ALF), a noise suppression filter (noise suppression filter, NSF), or any combination thereof. In an example, the loop filter unit 220 may include a de-blocking filter, a SAO filter and an ALF filter. The order of the filtering process may be the deblocking filter, SAO and ALF. In another example, a process called the luma mapping with chroma scaling (luma mapping with chroma scaling, LMCS) (namely, the adaptive in-loop reshaper) is added. This process is performed before deblocking. In another example, the deblocking filter process may be also applied to internal sub-block edges, for example, affine sub-blocks edges, ATMVP sub-blocks edges, sub-block transform (sub-block transform, SBT) edges and intra sub-partition (intra sub- partition, ISP) edges. Although the loop filter unit 320 is shown as an in-loop filter in Fig. 7, in another implementation, the loop filter unit 320 may be implemented as a post-loop filter.

The decoded video blocks 321 of a picture are then stored in decoded picture buffer 330, which stores the decoded pictures 331 as reference pictures for subsequent motion compensation for other pictures and/or for output respectively display.

The decoder 30 is configured to output the decoded picture 311 , for example, via output 312, for presentation or viewing to a user.

The mode application unit 360 may include an inter prediction unit 344 and an intra prediction unit 354. Inter prediction unit 244 may include a motion estimation unit and a motion compensation unit (not shown).

Embodiments of the video decoder 30 as shown in Fig. 7 may be configured to partition and/or decode the picture by using slices (also referred to as video slices), wherein a picture may be partitioned into or decoded using one or more slices (typically non-overlapping). Each slice may include one or more blocks (for example, CTUs) or one or more groups of blocks (for example, tiles (H.265/HEVC and WC) or bricks (WC)).

Embodiments of the video decoder 30 as shown in Fig. 7 may be configured to partition and/or decode the picture by using slices/tile groups (also referred to as video tile groups) and/or tiles (also referred to as video tiles), where a picture may be partitioned into or decoded using one or more slices/tile groups (typically non-overlapping), and each slice/tile group may include, for example, one or more blocks (for example, CTUs) or one or more tiles, where each tile, for example, may be of rectangular shape and may include one or more blocks (for example, CTUs), for example, complete or fractional blocks.

Other variations of the video decoder 30 can be used to decode the encoded picture data 21. For example, the decoder 30 may generate an output video stream without processing by the loop filter unit 320. For example, a non-transform based encoder 30 may quantize a residual signal directly without the transform processing unit 312 for some blocks or frames. In another implementation, the video decoder 30 may have the inverse quantization unit 310 and the inverse transform processing unit 312 combined into a single unit.

Although embodiments of the present disclosure have been primarily described based on video coding, it should be noted that embodiments of the coding system 10, the encoder 20 and the decoder 30 and the other embodiments described herein may also be configured for still picture processing or coding, that is, the processing or coding of an individual picture independent of any preceding or consecutive picture as in video coding. In general only interprediction units 244 (encoder) and 344 (decoder) may not be available in case the picture processing coding is limited to a single picture 17. All other functionalities (also referred to as tools or technologies) of the video encoder 20 and the video decoder 30 may equally be used for still picture processing, for example, residual calculation 204/304, transform 206, quantization 208, inverse quantization 210/310, (inverse) transform 212/312, partitioning 262/362, intra-prediction 254/354, and/or loop filtering 220, 320, and entropy coding 270 and entropy decoding 304.

One or more blocks within the exemplary scheme of the hybrid encoder explained above with reference to Figs. 6 and 7 may be implemented by a processing by neural networks. For example, the inter prediction unit 244 and/or the intra prediction unit 254 may involve processing by neural networks.

Multi-scale reconstruction networks in video coding

Image data, which is decoded from a bitstream may be processed by a neural network to obtain enhanced decoded image data, having an improved quality compared to the image data without enhancement. Such processing is performed, for example, by a so-called multi-scale reconstruction network.

At a decoding side including such a multi-scale reconstruction network, at least two representations of the same image data are processed by a neural network to obtain reconstructed image data. Such processing is shown exemplarily in the flowchart of Fig. 8 and in the schematic drawing of Fig. 10.

A first representation 1010 is decoded S810 from the bitstream. A second representation 1011 of the image data is obtained S820. For example, the second representation 1011 may be decoded from the bitstream. A second representation 1011 being decoded from the bitstream may have a smaller resolution and/or a different bitrate than the first representation 1010. The bitrate of a representation encoded in the bitstream may be adjusted by the selection of quantization parameters in the encoding.

For example, the second representation is obtained by decoding said second representation from the bitstream. The second representation 1011 may have a smaller resolution and possibly a different bitrate in the bitstream than the first representation 1010. For example the second representation 1011 may have the same resolution and a different bitrate than the first representation 1010. For example, the second representation 1011 may have a different (i.e. smaller) resolution than the first representation 1010.

In an exemplary implementation, the first representation 1010 of the image data has a resolution of H x W_lt wherein H_r represents the height of the first representation 1010 and Wi represents the width of the first representation 1010. The resolution of the second representation 1011 is H₂ x W₂, wherein the height H₂ and the width W₂ of the second representation 1011 are smaller than the height and the width of the first representation 1010: H₂ and W₂ < W_r. For example the resolution of the second representation 1011 may be

The present invention is not limited to two representations of the same image data. Any plurality of representations of the same image data including at least the first and the second representation may be used. For example, a third representation included in the plurality of representations may be obtained similar to the second representation, i.e. by decoding from the bitstream, by downscaling of the first representation, or the like. The third representation has a resolution H₃ x V/₃ which may be smaller than the resolution of the first and the second representation. In the case when the third representation is decoded from the bitstream, the encoded third representation may have a bitrate different from a bitrate of the encoded first representation or from a bitrate of the encoded second representation.

In an exemplary implementation, the first representation 1010 may have a resolution of Hi x W_lt the second representation 1011 may have a resolution of y X y and the third representation may have a resolution of

The decoding of the first and the second representation from the bitstream may include a splitting of the bitstream into at least two subbitstreams. The first and the second representation may then be decoded from a respective subbitstream. The decoding of the respective subbitstreams may be performed sequentially or in parallel. Moreover, the decoding of the respective subbitstreams may be performed by a single decoder or by multiple decoders.

Such decoding is exemplarily shown in Figs. 11 a and 11b. In a first exemplary implementation as shown in Fig. 11a, a bitstream 1130 is decoded using multiple decoders 1141 , 1142 and 1143. Said multiple decoders may be multiple instances of a same decoder or may be different decoders. A second exemplary implementation shown in Fig. 11b uses a single decoder 1140 to decode representations from the bitstream 1131. Alternatively, the decoding may comprise decoding the bitstream and splitting the decoded bitstream in to at least the first and the second representation. The decoding of the bitstream may be, for instance, performed by the single decoder 1140 of the second exemplary implementation.

Any above-mentioned decoding may be performed by a decoder of an autoencoding neural network, which is explained in detail in section Autoencoders and unsupen/ised learning, or by a hybrid block-based decoder as explained in detail in section Hybrid block-based encoder and decoder.

For example, the obtaining of a second representation 1011 may include a downscaling of the first representation 1010. A second representation 1011 that is obtained by downscaling the first representation 1010 has a smaller resolution than the first representation 1010. A downscaling may be performed by downsampling (i.e. leaving out a plurality of pixels of the first representation), by applying a filtering, which results in a lower resolution, or the like. This is depicted in a third exemplary implementation in Fig. 11c. In the third exemplary implementation, an exemplary first representation 1150 is decoded from the bitstream 1132 by a decoder 1140. An exemplary second representation 1151 is generated from the exemplary first representation 1150, the exemplary second representation 1151 having a lower resolution than the exemplary first representation 1150. From the exemplary second representation 1151 an exemplary third representation 1152 is generated having a lower resolution than the exemplary second representation 1151. The generating includes a downscaling, e.g. by leaving out a plurality of pixels.

The first representation 1010 is processed S830 by applying a first neural network 1020 (first feature extraction network). The processing generates a first latent representation and a second latent representation. The first feature extraction network 1020 may include at least a plurality of downsampling (compressing) layers 1021 , a bottleneck layer 1022 and a plurality of upsampling (decompressing) layers 1023 to 1025. A downsampling layer out of the plurality of downsampling layers, which reduce spatial resolution and similarly increase channel resolution, may have a structure as shown for example in Fig. 12a. Such an exemplary downsampling layer receives an input of (height) x (width) x (number of channels') = H x W x C and generates an output of dimension H/2 x W /2 x 2C. The exemplary downsampling layer in Fig. 12a includes three convolutional layers, each of said convolutional layer is followed by a LeakyReLU.

An upsampling layer out of the plurality of upsampling layers, which increase spatial resolution and similarly decrease channel resolution, may have a structure as shown for example in Fig. 12b. Such an exemplary upsampling layer receives an input of height) x (width x (number of channels) = H x W x C and generates an output of dimension 2H x 2W x C/2. The exemplary upsampling layer in Fig. 12b includes three convolutional layers, each of said convolutional layer is followed by a LeakyReLU.

The output of such an upsampling or downsampling layer is a so-called latent representation.

In an exemplary implementation, the first latent representation is an output of the bottleneck layer 1022 of the first feature extraction network 1020. The second latent representation is from a layer of the first feature extraction network 1020 different from the bottleneck layer 1022. For example, the second latent representation is an output of the upsampling part of the first feature extraction network 1020. The second latent representation may be an output of a first upsampling layer 1023 in the first feature extraction network 1020 subsequently to the bottleneck layer 1022.

The second representation 1011 is processed S840 by applying a second neural network 1030 (second feature extraction network). The processing generates a third latent representation and a fourth latent representation. Similarly to the first feature extraction network, the second feature extraction network 1030 may include at least a plurality of downsampling layers 1031 , a bottleneck layer 1032 and a plurality of upsampling layers 1033 to 1035.

The second representation 1011 out of the two exemplary representations may be processed to obtain a same resolution as the first representation before applying the second neural network. Such processing may involve an interpolation, for example a bicubic upsampling, to obtain the same resolution as the first representation. The processed second representation 1012 is provided as input to a second exemplary feature extraction network 1030.

In an example, the third latent representation is an output of the bottleneck layer of the second feature extraction network. The fourth latent representation is from a layer of the second feature extraction network different from the bottleneck layer 1022. For example, the fourth latent representation is an output of the upsampling part of the second feature extraction network 1030. The fourth latent representation may be an output of a first upsampling layer 1033 in the second feature extraction network 1030 subsequently to the bottleneck layer 1032.

The first and the third latent representation are processed S850 by a combining processing, which combines said two representations. The combining processing may include concatenating 1040 of the first latent representation and the third latent representation. The combining processing may include a processing by a fully-connected neural network 1041. For example, such a fully-connected neural network 1041 may be a multilayer perceptron. The combining processing may include the concatenating 1040 and the processing by a fully- connected neural network 1041 .

An obtaining S860 of reconstructed image data 1060 includes the processing by a generation neural network 1050. The generation neural network 1050 is applied to an output of the combining processing, the second latent representation and the fourth latent representation. Such a generation neural network may be a general adversarial network (GAN), which is explained above in section Artificial neural networks.

The applying of the first feature extraction network may generate a first set of latent representations (first set of extracted features). Said first set includes the second latent representation. Similarly, the applying of the second feature extraction network may generate a second set of latent representation (second set of extracted features), which includes the fourth latent representation. Said first and second set may be provided as additional input to the generation neural network.

As already indicated above, there may be more than two representations of the same image data. In particular, N representations may be obtained, wherein N is an integer greater than two and the N representations include the first representation and the second representation.

As explained in detail above, the first representation is decoded from the bitstream. A j-th representation out of the N representations, j being two to N, may be obtained by either decoding said j-th representation from the bitstream or by downscaling of the first representation.

For instance, each of the N representations may be decoded from the bitstream. For, example, each j-th representation out of the N representations, j being two to N, may be obtained by a downscaling of the first representation. For example, a subset of representations including the first representation may be decoded from the bitstream and the remaining representations out of the N representations, which are not included in said subset may be obtained by downscaling.

Each i-th of the N representations, i being an integer in the range one to N, may be processed by applying an i-th neural network (i-th feature extraction network). Said applying of the i-th neural network includes generating a (2i-1)-th latent representation and a (2i)-th latent representation. In other words, there are N neural networks (feature extraction networks), including the first neural network and the second neural network. Each of said N networks is applied to one of the N representations, respectively. The applying of each of said N feature extraction networks includes the generating of at least two latent representations. One of the at least two latent representations is an output of the bottleneck layer of the respective feature extraction network.

Each of the exemplary feature extraction networks 1020 and 1030 in Fig. 10 includes a plurality of downsampling layers 1021 and 1031 , one bottleneck layer 1022 and 1032 and a plurality of upsampling layers 1023 to 1025 and 1033 to 1035.

For example, the i-th neural network includes Kj downsampling layers, a bottleneck layer, and Ki upsampling layers, Ki being an integer greater than zero. For example, each of the i-th neural networks may include a same number of downsampling/upsampling layers, i.e. Ki = K for all i from one to N.

The (2i- 1 )-th latent representation is an output of the bottleneck layer of the i-th neural network. For example, (2i-1)-th latent representation may be an output of the bottleneck layer 1022 or of the bottleneck layer 1032 for i=1 or 2, respectively. The (2i)-th latent representation is an output of a layer in the decompressing (upsampling) part of the i-th neural network, i.e. out of the Ki upsampling layer. For example, the (2i)-th latent representation may be an output of the first layer 1023 or 1033 of the upsampling part for i=1 or 2, respectively, subsequently to the bottleneck layer 1022 or 1032.

Analogously as explained above in detail for the first and the second feature extraction network, each i-th out of the N feature extraction networks may output an i-th set of latent representations, including the (2i)-th latent representations. Each of the latent representations is output of a layer within the upsampling part of the i-th neural network.

In an exemplary implementation, the generation neural network 1050 includes at least two blocks 1051 to 1053, wherein a block is a neural subnetwork including at least one layer. A first block 1051 of the generation neural network 1050 receives as input the output of the combining processing and each of the (2i)-th latent representations.

A second block 1052 receives as input one or more of (i) an output of the first block, (ii) the output of the combining processing, and (iii) for each i-th neural network, a latent representation output of a second layer out of the Kj up-sampling layers in the i-th neural network, the second layer following the first layer within the Kj up-sampling layers. Said first layer (e.g. the first upsampling layer 1023 in the first feature extraction network) outputs the (2i)-th latent representation, which is input to the first block 1051. In Fig. 10, examples for such second layers are the second layer 1024 in the first feature extraction network 1020 and the second layer 1034 in the second feature extraction network 1030. In an example shown in Fig. 13, the output 1310 of every generator block is split to three parts (P_o, P₁,P₂), for example in proportion (0.5, 0.25, 0.25). The respective outputs ofthe first feature extraction network 1311 and the second feature extraction network 1312 are processed by convolutional layers 1320 and 1321 , respectively, which generates style scale coefficients (scalel, scale2) and style shift coefficients (shiftl , shift2). P₁ is scaled with scale 1 1330 and shifted with shiftl 1340. P₂, is scaled with scale2 1331 and shifted with shift2 1341 :

= P * scalel + shiftl and P'₂ = P₂ * scale2 + shift2. The next generator block receives the output 1350 (P₀,P'₁₍ P'2) as an input.

The present invention is not limited to two blocks in the generation neural network 1050. The generation neural network 1050 may include M blocks, where M is an integer greater than or equal to two. Said M blocks include the first and the second block. Each q-th block of the M blocks, q being two to M, receives as input one or more of (i) an output of the (q-1)-th block, (ii) the output of the combining processing; and (iii) for each i-th neural network, an latent representation output (intermediate image features) of a q-th layer out of the Ki up-sampling layers in the i-th neural network. Said latent representation outputs of each of the q-th layer in the i-th neural network are included in the i-th set of latent representations.

In other words, the output 1350 of each block of the generation neural network may be an input to a directly following block. Moreover, each block may receive as input a latent representation of a corresponding layer within the upsampling part of each of the N feature extraction networks. For example, a third block 1053 receives as input the latent representation output of the third layer 1025 of the first neural network 1031 , the output of the third layer 1035 of the second neural network 1030 and possibly further outputs of third layers within the remaining neural networks out of the N neural networks (not shown in Fig. 10).

Before processing the N representations, a corresponding subframe in each of the N representations may be obtained. The processing of the i-th representation may comprise applying the i-th neural network to the corresponding subframe of the i-th representation. For example, the subframe may include a face crop 1440. As shown in Fig. 14, such a face crop 1440 may be obtained from a representation 1410 by applying a face detection network 1420, which identifies a face within the representation 1430. Such a face detection network is a neural network trained to recognize faces in image data.

A corresponding subframe in each of the N representations relates to a same part in the image data represented by each of the representations. For example, the corresponding sub frame includes the face crop of a face in each of the representations. In the case, when subframes including faces are processed, the generating neural network is trained on large dataset of faces and to generate high quality faces from input noise using input subframes having different resolutions.

The bitstream may include, in addition to one or more representations encoded in the bitstream, one or more parameters. Said one or more parameters indicate target bitrates of one or more respective representations encoded in the bitstream. This is shown exemplarily in Fig. 15, where the input bitstream includes three encoded representations R_{l t} R₂ and R₄, together with so-called split parameters B_lt B₂ and B₄ indicating said target bitrates. A configuration of at least one of the first neural network, the second neural network, and the generation neural network may be selected based on the one or more parameters.

For example, a set of weights within at least one of the first neural network, the second neural network and the generation neural network is selected based on the one or more parameters.

For example, each of the one or more parameters included in the bitstream indicates a target bitrate of a representation having a predefined resolution. A mapping of a resolution to a corresponding parameter may be defined by a standard or may be defined at the encoding side and being transmitted to the decoding side.

An exemplary implementation is shown in Fig. 15, where three representations R_lt R₂ and R₄

having resolutions of H x W,

x y and ~ x y, respectively, may be encoded into the bitstream. The split parameters B_t, B₂ and B₄ indicate the target bitrates of the encoded representations. The value of such a target bitrate may be zero, i.e. the representation corresponding to said target bitrate is not included in the bitstream, in the case, the target bitrate has a value greater than zero, the corresponding representation is decoded from the bitstream as explained above and all of the decoded representations are processed by the multi-scale reconstruction network including at least the first neural network, the second neural network and the generation network to obtain a reconstructed output video (reconstructed image data).

The split parameters may be used to select the configuration of the networks within the multiscale reconstruction network. For example, the number N of feature extraction networks depends on the number of split parameters, which are greater than zero. Thus, the configuration of the generation neural network may depend on the number N of feature extraction networks. For example, the structure and/or weights of the two or more feature extraction networks may be selected according to the resolutions and/or bitrates of the representations encoded in the bitstream.

Encoding of representations of image data based on split parameters

A multi-scale reconstruction network as described above processes two or more representations of the same image data. Said representations may be decoded from a bitstream.

The encoding of image data into a bitstream is exemplarily shown in the flowchart in Fig. 9 and in the schematic drawings in Figs. 11a and b. A set of parameters is obtained S910 based on a cost function. The parameters (split parameters) within the set of parameters indicate target bitrates. Based on said set of parameters two or more representations of the image data are obtained S920.

For example, the two or more representations may represent different resolutions of the image data. In an exemplary implementation, a first representation 1110 out of the two or more representations represents an original resolution of the image data and a second representation 1111 out of the two or more representations represents a downscaled resolution of the image data.

A downscaling may be performed by down sampling (i.e. leaving out a plurality of pixels of the first representation), by applying a filtering, which results in a lower resolution, or the like.

For example, the first representation 1110 has a resolution of H₁ x W_lt wherein H represents the height of the first representation of the image data and 14^ represents the width of the first representation of the image data. The resolution of the second representation 1111 is H₂ x W₂, wherein the height H₂ and the width IV₂ of the second representation 1111 may be smaller than the height and the width of the first representation 1110: H₂ < H₁ and W₂ < W . For example, the resolution of the second representation 1111 may be y x y.

The present invention is not limited to two representations of the same image data. Any plurality of representations of the same image data including at least the first and the second representation may be used. For example, a third representation 1112 included in the plurality of representations may be obtained according to the set of parameters. The third representation 1112 has a resolution H₃ x IV₃, which may be smaller than the resolution of the first and the second representation. In an exemplary implementation, the first representation 1110 may have a resolution of x VKj, the second representation 1111 may have a resolution of y X y and the third representation 1112 may have a resolution of

A first parameter out of the set of parameters may indicate a target bitrate for the first representation 1110 representing an original resolution of the image data. A second parameter out of the set of parameters may indicate a target bitrate for the second representation 1111 representing a downscaled resolution of the image data.

The parameters may indicate target bitrates for respective representations within the set of parameters may add up to a target bitrate of the bitstream. For example, the values of the parameters (B_{l t} B₂, B₃) may correspond to the target bitrates for a first, second and third representation, respectively. The target bitrate of the bitstream in this example is B = B + B₂ + B₃, where B_lt B₂,B₃ > 0. For example, the values of the parameters (B_lt B₂, B₃) may correspond to the fraction of a target bitrate for a representation compared to the target bitrate of the bitstream, i.e

+ B₂ + B₃ = 1, where B₁, B₂, B₃ > 0. For example, the values of the parameters may be represented by 32-bit floating numbers.

The two or more representations are encoded S930 according to the set of parameters. For example, an encoder used in the encoding may select a quantization parameter for the encoding of a representation to obtain a target bitrate corresponding to said representation. The quantization in encoding is explained in detail above for the hybrid block-based encoder and for the autoencoder. Said target bitrate is indicated by a parameter within the set of parameters.

The encoding of the respective representations may be performed sequentially or in parallel. For example, each representation may be encoded into a respective subbitstream, thereby generating two or more subbitstreams. Each of the subbitstreams may be included into a bitstream. Moreover, the encoding of the respective subbitstreams may be performed by a single encoder or by multiple encoders. Furthermore, the set of parameters may be included into the bitstream.

Such encoding is exemplarily shown in Figs. 11a and 11b. In a first exemplary implementation as shown in Fig. 11a, each representation 1110, 1111 and 1112 is encoded a respective encoders 1121, 1122 and 1123. Said respective encoders may be multiple instances of a same encoder or may be different encoders. These encoders may encode the two or more representations in parallel. A second exemplary implementation shown in Fig. 11b uses a single encoder 1120 to encode the two or more representations into the bitstream 1131. Any above-mentioned encoding may be performed by an encoder of an autoencoding neural network, which is explained in detail in section Autoencoders and unsupervised learning, or by a hybrid block-based encoder as explained in detail in section Hybrid block-based encoder and decoder.

An encoder used in the encoding may select a quantization parameter (QP) for the encoding of a representation to obtain a target bitrate indicated by the parameter within the set of parameters corresponding to said representation. QPs for downscaled representations and original representation are chosen such that the total bitrate of downscaled representations and original representation is equal to the target bitrate of the bitstream. For example, a QP is selected based on prior knowledge how QP affects bitrate.

Fig. 16 shows an exemplary implementation of encoding representations based on split parameters included in the set of parameters. The values of split parameters {B^ B₂, B₄) indicate the respective target bitrates of a first representation V with resolution H x W, a second representation V₂ having a resolution

x y and a third representation V₄ having resolution Said target bit rates add up to the target bitrate of the bitstream. In the example of Fig. 16, the first representation V is encoded with target bitrate B_t, if B_t is greater than zero. If B₂ is greater than zero, the second representation V₂ is obtained by downscaling of the first representation. The second representation V₂ encoded with target bitrate B₂. If B₄ is greater than zero, the third representation V₄ is obtained by downscaling of the first representation. The third representation V₄ encoded with target bitrate B₄. A bitstream is generated including the first encoded representation R_{l t} the second encoded representation R₂, the third encoded representation R₄ and the split parameters.

The above-mentioned cost function is based on reconstructed image data. Said reconstructed image data is obtained by processing two or more decoded representations of the same image data. For example, the set of parameters is chosen by optimizing (i.e. minimizing or maximizing) the cost function.

For example, there may be a predefined plurality of set of parameters. For each current set of parameters out of a predefined plurality of sets of parameters two or more representations of the image data are obtained based on the current set of parameters. Said two or more representations are encoded into a bitstream according to the current set of parameters. The two or more representations are then decoded from the bitstream and reconstructed image data based on the two or more decoded representations is obtained. The obtaining may include the processing by a multi-scale reconstruction network as described in detail above. For each current set of parameters, quality of the reconstructed image data is determined.

Such a determination may include applying a video metric to the reconstructed image data. Such a video metric may be an objective metric of reconstructed videos, for example, Peak signal-noise ratio (PSNR), Multi-scale structural similarity index (MSSIM), or the like.

In this exemplary implementation, a set of parameters is selected out of the predefined plurality of sets based on the cost function including the determined quality. For example, an optimizing of the cost function may select the set of parameters leading to the largest quality of reconstructed image data within the plurality of set of parameters.

For example, in the implementation of Fig. 16 with a bitrate B of the bitstream including two or more of the exemplary representations V, V₂ and V₄, the plurality of the sets of parameters may include one or more of the following options for the target bitrates of the respective representations:

- B/2 for representation V , B/2 for representation 7₂ ,

- B/2 for representation V₂ , B/2 for representation V ,

- B/2 for representation V, B/4 for representation V₂, B/4 for representation V₄,

- B/4 for representation V, B/2 for representation V₂, B/4 for representation V₄,

- B/4 for representation V, B/4 for representation V₂, B/2 for representation I/₄.

However, note that these options provide merely an example. The present invention is not limited to this example. The plurality of sets of parameters may include other options and/or additional resolutions, and/or the like.

Alternatively, a set of parameters may be obtained processing the image data by one or more layers of a neural network to output the set of parameters. Said neural network is trained by optimizing the cost function. For example, as shown schematically in Fig. 17, such a neural network 1740 may receive as input one or more of (i) the image data 1710, (ii) a target bitrate of the bitstream, (iii) an indication of complexity 1720 of the image data, and (iv) an indication of motion intensity 1730 of the image data. Based on one or more of these inputs, the neural network 1740 outputs the set of parameters 1750.

The two or more representations are encoded in to the bitstream based on the selected or obtained set of parameters. Some exemplary implementations in hardware and software

As mentioned above Figs. 11a-c provide exemplary implementations of the encoding and/or decoding as described above.

In the first exemplary implementation as shown in Fig. 11a, a first representation 1110 of the image data, a second (downscaled) representation 1111 of the same image data and a third (further downscaled) representation 1112 of the same image data are encoded by respective encoders 1121 , 1122 and 1123 into a bitstream 1130. The encoders may be multiple instances of a same encoder or may be different encoders. The encoders may perform the encoding in parallel or sequentially.

The bitstream 1130 is decoded using multiple decoders 1141 , 1142 and 1143. Said multiple decoders may be multiple instances of a same decoder or may be different decoders. In the first exemplary implementation, there are three representations having different resolution. Said representations are input to a multi-scale reconstruction neural network (MSNN) 1160, which includes at least the first feature extraction network, the second feature extraction network and the generation neural network as subnetworks. The MSNN 1160 outputs the reconstructed image data 1170.

In the second exemplary implementation as shown in Fig. 11b, a first representation 1110 of the image data, a second (downscaled) representation 1111 of the same image data and a third (further downscaled) representation 1112 of the same image data are encoded by an encoder 1120 into a bitstream 1130. Said bitstream 1130 is decoded by a decoder 1140. Said representations are input to a multi-scale reconstruction network (MSNN) 1160, which includes at least the first feature extraction network, the second feature extraction network and the generation neural network as subnetworks. The MSNN 1160 outputs the reconstructed image data 1170.

In the second exemplary implementation as shown in Fig. 11c, a first representation 1110 of the image data is encoded by an encoder 1120 into a bitstream 1130. Said bitstream 1130 is decoded by a decoder 1140 to obtain the first decoded representation 1150. The first decoded representation is downscaled to obtain a second representation 1151 and a third representation 1152. Said representations are input to a multi-scale reconstruction network (MSNN) 1160, which includes at least the first feature extraction network, the second feature extraction network and the generation neural network as subnetworks. The MSNN 1160 outputs the reconstructed image data 1170. The present invention is not limited to these exemplary implementations. The encoding and the decoding may be performed and combined as illustrated in Figs. 11 a-c. However, any of the above described encodings and any of the above described decodings may be combined with each other or may be combined differently.

Some further implementations in hardware and software are described in the following.

Any of the encoding devices described in the following with references to Figs. 18-21 may provide means in order to carry out the entropy encoding of image data into a bitstream. A processing circuitry within any of these exemplary devices is configured to obtain a set of parameters based on a cost function, said parameters indicating target bitrates, obtaining two or more representations of the image data based on the set of parameters, encoding the two or more representations into a bitstream according to the set of parameters, wherein the cost function is based on reconstructed image data obtained by processing two or more decoded representations of the same image data according to the method described above.

The decoding devices in any of Figs. 18-21 , may contain a processing circuitry, which is adapted to perform the decoding method. The processing circuitry within any of these exemplary devices is configured to decode a first representation of the image data from the bitstream, obtain a second representation of the image data, process the first representation by applying a first neural network, including generating a first latent representation and a second latent representation, process the second representation by applying a second neural network, including generating a third latent representation and a fourth latent representation, combine processing including combining of the first latent representation and the third latent representation, and obtain reconstructed image data including applying a generation neural network to an output of the combining processing, the second latent representation and the fourth latent representation according to the method described above.

The corresponding system which may deploy the above-mentioned encoder-decoder processing chain is illustrated in Fig. 18. Fig. 18 is a schematic block diagram illustrating an example coding system, e.g. a video, image, audio, and/or other coding system (or short coding system) that may utilize techniques of this present application. Video encoder 20 (or short encoder 20) and video decoder 30 (or short decoder 30) of video coding system 10 represent examples of devices that may be configured to perform techniques in accordance with various examples described in the present application. For example, the video coding and decoding may employ neural network such which may be distributed and which may apply the above-mentioned bitstream parsing and/or bitstream generation to convey feature maps between the distributed computation nodes (two or more). As shown in Fig. 18, the coding system 10 comprises a source device 12 configured to provide encoded picture data 21 e.g. to a destination device 14 for decoding the encoded picture data 13.

The source device 12 comprises an encoder 20, and may additionally, i.e. optionally, comprise a picture source 16, a pre-processor (or pre-processing unit) 18, e.g. a picture pre-processor 18, and a communication interface or communication unit 22.

The picture source 16 may comprise or be any kind of picture capturing device, for example a camera for capturing a real-world picture, and/or any kind of a picture generating device, for example a computer-graphics processor for generating a computer animated picture, or any kind of other device for obtaining and/or providing a real-world picture, a computer generated picture (e.g. a screen content, a virtual reality (VR) picture) and/or any combination thereof (e.g. an augmented reality (AR) picture). The picture source may be any kind of memory or storage storing any of the aforementioned pictures.

In distinction to the pre-processor 18 and the processing performed by the pre-processing unit 18, the picture or picture data 17 may also be referred to as raw picture or raw picture data 17.

Pre-processor 18 is configured to receive the (raw) picture data 17 and to perform preprocessing on the picture data 17 to obtain a pre-processed picture 19 or pre-processed picture data 19. Pre-processing performed by the pre-processor 18 may, e.g., comprise trimming, color format conversion (e.g. from RGB to YCbCr), color correction, or de-noising. It can be understood that the pre-processing unit 18 may be optional component. It is noted that the preprocessing may also employ a neural network (such as in any of Figs. 1 to 7) which uses the presence indicator signaling.

The video encoder 20 is configured to receive the pre-processed picture data 19 and provide encoded picture data 21.

Communication interface 22 of the source device 12 may be configured to receive the encoded picture data 21 and to transmit the encoded picture data 21 (or any further processed version thereof) over communication channel 13 to another device, e.g. the destination device 14 or any other device, for storage or direct reconstruction.

The destination device 14 comprises a decoder 30 (e.g. a video decoder 30), and may additionally, i.e. optionally, comprise a communication interface or communication unit 28, a post-processor 32 (or post-processing unit 32) and a display device 34. The communication interface 28 of the destination device 14 is configured receive the encoded picture data 21 (or any further processed version thereof), e.g. directly from the source device 12 or from any other source, e.g. a storage device, e.g. an encoded picture data storage device, and provide the encoded picture data 21 to the decoder 30.

The communication interface 22 and the communication interface 28 may be configured to transmit or receive the encoded picture data 21 or encoded data 13 via a direct communication link between the source device 12 and the destination device 14, e.g. a direct wired or wireless connection, or via any kind of network, e.g. a wired or wireless network or any combination thereof, or any kind of private and public network, or any kind of combination thereof.

The communication interface 22 may be, e.g., configured to package the encoded picture data 21 into an appropriate format, e.g. packets, and/or process the encoded picture data using any kind of transmission encoding or processing for transmission over a communication link or communication network.

The communication interface 28, forming the counterpart of the communication interface 22, may be, e.g., configured to receive the transmitted data and process the transmission data using any kind of corresponding transmission decoding or processing and/or de-packaging to obtain the encoded picture data 21.

Both, communication interface 22 and communication interface 28 may be configured as unidirectional communication interfaces as indicated by the arrow for the communication channel 13 in Fig. 18 pointing from the source device 12 to the destination device 14, or bidirectional communication interfaces, and may be configured, e.g. to send and receive messages, e.g. to set up a connection, to acknowledge and exchange any other information related to the communication link and/or data transmission, e.g. encoded picture data transmission. The decoder 30 is configured to receive the encoded picture data 21 and provide decoded picture data 31 or a decoded picture 31.

The post-processor 32 of destination device 14 is configured to post-process the decoded picture data 31 (also called reconstructed picture data), e.g. the decoded picture 31 , to obtain post-processed picture data 33, e.g. a post-processed picture 33. The post-processing performed by the post-processing unit 32 may comprise, e.g. color format conversion (e.g. from YCbCr to RGB), color correction, trimming, or re-sampling, or any other processing, e.g. for preparing the decoded picture data 31 for display, e.g. by display device 34.

The display device 34 of the destination device 14 is configured to receive the post-processed picture data 33 for displaying the picture, e.g. to a user or viewer. The display device 34 may be or comprise any kind of display for representing the reconstructed picture, e.g. an integrated or external display or monitor. The displays may, e.g. comprise liquid crystal displays (LCD), organic light emitting diodes (OLED) displays, plasma displays, projectors , micro LED displays, liquid crystal on silicon (LCoS), digital light processor (DLP) or any kind of other display.

Although Fig. 18 depicts the source device 12 and the destination device 14 as separate devices, embodiments of devices may also comprise both or both functionalities, the source device 12 or corresponding functionality and the destination device 14 or corresponding functionality. In such embodiments the source device 12 or corresponding functionality and the destination device 14 or corresponding functionality may be implemented using the same hardware and/or software or by separate hardware and/or software or any combination thereof.

As will be apparent for the skilled person based on the description, the existence and (exact) split of functionalities of the different units or functionalities within the source device 12 and/or destination device 14 as shown in Fig. 18 may vary depending on the actual device and application.

The encoder 20 (e.g. a video encoder 20) or the decoder 30 (e.g. a video decoder 30) or both encoder 20 and decoder 30 may be implemented via processing circuitry, such as one or more microprocessors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), discrete logic, hardware, video coding dedicated or any combinations thereof. The encoder 20 may be implemented via processing circuitry 46 to embody the various modules including the neural network or its parts. The decoder 30 may be implemented via processing circuitry 46 to embody any coding system or subsystem described herein. The processing circuitry may be configured to perform the various operations as discussed later. If the techniques are implemented partially in software, a device may store instructions for the software in a suitable, non-transitory computer-readable storage medium and may execute the instructions in hardware using one or more processors to perform the techniques of this disclosure. Either of video encoder 20 and video decoder 30 may be integrated as part of a combined encoder/decoder (CODEC) in a single device, for example, as shown in Fig. 19.

Source device 12 and destination device 14 may comprise any of a wide range of devices, including any kind of handheld or stationary devices, e.g. notebook or laptop computers, mobile phones, smart phones, tablets or tablet computers, cameras, desktop computers, set- top boxes, televisions, display devices, digital media players, video gaming consoles, video streaming devices(such as content services servers or content delivery servers), broadcast receiver device, broadcast transmitter device, or the like and may use no or any kind of operating system. In some cases, the source device 12 and the destination device 14 may be equipped for wireless communication. Thus, the source device 12 and the destination device 14 may be wireless communication devices.

In some cases, video coding system 10 illustrated in Fig. 18 is merely an example and the techniques of the present application may apply to video coding settings (e.g., video encoding or video decoding) that do not necessarily include any data communication between the encoding and decoding devices. In other examples, data is retrieved from a local memory, streamed over a network, or the like. A video encoding device may encode and store data to memory, and/or a video decoding device may retrieve and decode data from memory. In some examples, the encoding and decoding is performed by devices that do not communicate with one another, but simply encode data to memory and/or retrieve and decode data from memory.

Fig. 20 is a schematic diagram of a video coding device 8000 according to an embodiment of the disclosure. The video coding device 8000 is suitable for implementing the disclosed embodiments as described herein. In an embodiment, the video coding device 8000 may be a decoder such as video decoder 30 of Fig. 18 or an encoder such as video encoder 20 of Fig. 18.

The video coding device 8000 comprises ingress ports 8010 (or input ports 8010) and receiver units (Rx) 8020 for receiving data; a processor, logic unit, or central processing unit (CPU) 8030 to process the data; transmitter units (Tx) 8040 and egress ports 8050 (or output ports 8050) for transmitting the data; and a memory 8060 for storing the data. The video coding device 8000 may also comprise optical-to-electrical (OE) components and electrical-to-optical (EO) components coupled to the ingress ports 8010, the receiver units 8020, the transmitter units 8040, and the egress ports 8050 for egress or ingress of optical or electrical signals.

The processor 8030 is implemented by hardware and software. The processor 8030 may be implemented as one or more CPU chips, cores (e.g., as a multi-core processor), FPGAs, ASICs, and DSPs. The processor 8030 is in communication with the ingress ports 8010, receiver units 8020, transmitter units 8040, egress ports 8050, and memory 8060. The processor 8030 comprises a neural network based codec 8070. The neural network based codec 8070 implements the disclosed embodiments described above. For instance, the neural network based codec 8070 implements, processes, prepares, or provides the various coding operations. The inclusion of the neural network based codec 8070 therefore provides a substantial improvement to the functionality of the video coding device 8000 and effects a transformation of the video coding device 8000 to a different state. Alternatively, the neural network based codec 8070 is implemented as instructions stored in the memory 8060 and executed by the processor 8030.

The memory 8060 may comprise one or more disks, tape drives, and solid-state drives and may be used as an over-flow data storage device, to store programs when such programs are selected for execution, and to store instructions and data that are read during program execution. The memory 8060 may be, for example, volatile and/or non-volatile and may be a read-only memory (ROM), random access memory (RAM), ternary content-addressable memory (TCAM), and/or static random-access memory (SRAM).

Fig. 21 is a simplified block diagram of an apparatus that may be used as either or both of the source device 12 and the destination device 14 from Fig. 18 according to an exemplary embodiment.

A processor 9002 in the apparatus 9000 can be a central processing unit. Alternatively, the processor 9002 can be any other type of device, or multiple devices, capable of manipulating or processing information now-existing or hereafter developed. Although the disclosed implementations can be practiced with a single processor as shown, e.g., the processor 9002, advantages in speed and efficiency can be achieved using more than one processor.

A memory 9004 in the apparatus 9000 can be a read only memory (ROM) device or a random access memory (RAM) device in an implementation. Any other suitable type of storage device can be used as the memory 9004. The memory 9004 can include code and data 9006 that is accessed by the processor 9002 using a bus 9012. The memory 9004 can further include an operating system 9008 and application programs 9010, the application programs 9010 including at least one program that permits the processor 9002 to perform the methods described here. For example, the application programs 9010 can include applications 1 through N, which further include a video coding application that performs the methods described here.

The apparatus 9000 can also include one or more output devices, such as a display 9018. The display 9018 may be, in one example, a touch sensitive display that combines a display with a touch sensitive element that is operable to sense touch inputs. The display 9018 can be coupled to the processor 9002 via the bus 9012.

Although depicted here as a single bus, the bus 9012 of the apparatus 9000 can be composed of multiple buses. Further, a secondary storage can be directly coupled to the other components of the apparatus 9000 or can be accessed via a network and can comprise a single integrated unit such as a memory card or multiple units such as multiple memory cards. The apparatus 9000 can thus be implemented in a wide variety of configurations.

While operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In certain implementations, multitasking and parallel processing may be advantageous.

Summarizing, methods and apparatuses are described for encoding and decoding of image data, which includes processing two or more representations of the image data by a multiscale reconstruction network. Feature extraction networks generate latent representations for each of the representations. The latent representations are processed by a generation neural network to obtain reconstructed image data. The representations are encoded into a bitstream according to a set of split parameters indication target bitrates.

Claims

1 . A method for decoding image data from a bitstream (1130), comprising: decoding (S810) a first representation (1010) of the image data from the bitstream (1130); obtaining (S820) a second representation (1011) of the image data; processing (S830) the first representation (1010) by applying a first neural network (1020), including generating a first latent representation and a second latent representation; processing (S840) the second representation (1011) by applying a second neural network (1030), including generating a third latent representation and a fourth latent representation; combining processing (S850) including combining of the first latent representation and the third latent representation; obtaining (S860) reconstructed image data including applying a generation neural network (1050) to an output of the combining processing, the second latent representation and the fourth latent representation.

2. The method according to claim 1 , wherein the obtaining of the second representation (1011) comprises decoding the second representation (1011) from the bitstream (1130), the second representation (1011) having a smaller resolution and/or a different bitrate in the bitstream (1130) than the first representation (1010).

3. The method according to claim 2, wherein the decoding comprises: splitting the bitstream (1130) into at least two subbitstreams; decoding the first representation (1010) and the second representation (1011) from a respective subbitstream out of the at least two subbitstreams. The method according to claim 2, wherein the decoding comprises: decoding the bitstream (1131); splitting the decoded bitstream into at least the first representation (1010) and the second representation (1011). The method according to any of claims 3 and 4, wherein the decoding is performed including one of a decoder of an autoencoding neural network, or

- a hybrid block-based decoder. The method according to claim 1 , wherein the obtaining of a second representation (1011) comprises generating the second representation (1011) by downscaling of the first representation (1010). The method according to any of the claims 1 to 6 further comprising decoding one or more parameters from the bitstream (1130), said one or more parameters indicating target bitrates of one or more respective representations encoded in the bitstream (1130), wherein a configuration of at least one of

- the first neural network (1020),

- the second neural network (1030), and

- the generation neural network (1050) is selected based on the one or more parameters.

8. The method according to claim 7, wherein a set of weights within at least one of the first neural network (1020), the second neural network (1030) and the generation neural network (1050) is selected based on the one or more parameters.

9. The method according to any of the claims 1 to 8 further comprising performing an interpolation on the second representation (1011) to obtain a resolution of the first representation (1010), before processing the second representation by applying the second neural network (1030).

10. The method according to any of the claims 1 to 9, wherein the combining processing includes at least one of

- concatenating (1040) the first latent representation and the third latent representation;

- processing by a fully-connected neural network (1041).

11. The method according to any of the claims 1 to 10, wherein the applying of the first neural . network (1020) includes generating a first set of latent representations, said first set including the second latent representation; the applying of the second neural network (1030) includes generating a second set of latent representations, said second set including the fourth latent representation; and the generation neural network (1050) receives the first set and the second set as additional input.

12. The method according to any of the claims 1 to 11 , the method further comprising: obtaining N representations, N being an integer greater or equal to two, the N representations including the first representation (1010) and the second representation (1011); for each i-th representation out of the N representations, i being one to N, processing the i-th representation by applying a i-th neural network, including generating a (2i-1 )- th latent representation and a (2i)-th latent representation; combining processing including combining of each of the (2i-1)-th latent representations; obtaining reconstructed image data including applying a generation neural network (1050) to an output of the combining processing and each of the (2i)-th latent representations.

13. The method according to claim 12, wherein the obtaining of N representations further comprises; decoding (S810) a first representation (1010) out of the N representations from the bitstream (1130); obtaining (S820) a j-th representation out of the N representations, j being two to N, by either decoding said j-th representation from the bitstream (1130) or by downscaling of the first representation (1010).

14. The method according to any of claims 12 and 13, wherein each of the i-th neural networks includes Ki downsampling layers, a bottleneck layer (1022, 1032), and K upsampling layers, being an integer greater than zero; wherein the bottleneck layer (1022, 1032) outputs the (2i-1)-th latent representation and a first layer out of the Ki up-sampling layers (1023, 1033) following the bottleneck layer (1022, 1032) outputs the (2i)-th latent representation.

15. The method according to any of claims 12 to 14, wherein a first block (1051) within the generation neural network (1050) receives as input the output of the combining processing and each of the (2i)-th latent representations; and a second block (1052) within the generation neural network (1050) receives as input one or more of - an output of the first block (1051);

- the output of the combining processing; and

- for each i-th neural network, an latent representation output of a second layer (1024, 1034) out of the Ki up-sampling layers in the i-th neural network, the second layer (1024, 1034) following the first layer (1023, 1033) within the Ki up-sampling layers.

16. The method according to claim 15, wherein the generation neural network (1050) includes M blocks, said M blocks including the first block (1051) and the second block (1052), M being an integer greater than or equal to two; and each q-th block of the M blocks, q being two to M, receives as input one or more of

- an output of the (q-1 )-th block;

- the output of the combining processing; and

- for each i-th neural network, an latent representation output of a q-th layer out of the Ki up-sampling layers in the i-th neural network.

17. The method according to any of claims 12 to 16 further comprising obtaining a corresponding subframe in each of the N representations; wherein the processing the i-th representation comprises applying the i-th neural network to the corresponding subframe of the i-th representation.

18. The method according to claim 17, wherein the corresponding subframe includes a face crop (1440).

19. The method according to any of the claims 1 to 18, wherein the generation neural network (1050) is a general adversarial network (GAN). A method for encoding image data into a bitstream (1130), comprising: obtaining (S910) a set of parameters based on a cost function, said parameters indicating target bitrates; obtaining (S920) two or more representations (1110, 1111 , 1112) of the image data based on the set of parameters; encoding (S930) the two or more representations (1110, 1111 , 1112) into a bitstream (1130) according to the set of parameters; wherein the cost function is based on reconstructed image data obtained by processing two or more decoded representations of the same image data. The method according to claim 20, wherein the obtaining set of parameters based on a cost function comprises: for each current set of parameters out of a predefined plurality of sets of parameters: obtaining two or more representations (1110, 1111 , 1112) of the image data based on the current set of parameters; encoding the two or more representations (1110, 1111 , 1112) into a bitstream (1130) according to the current set of parameters; decoding the two or more representations (1110, 1111 , 1112) from the bitstream (1130); obtaining reconstructed image data based on the two or more decoded representations; determining quality of reconstructed image data; selecting a set of parameters out of the predefined plurality of sets based on the cost function including the determined quality.

22. The method according to claim 20, wherein the obtaining set of parameters based on a cost function comprises: processing the image data by one or more layers of a neural network (1740) to output the set of parameters (1750); wherein said neural network (1740) is trained by optimizing the cost function.

23. The method according to claim 22, wherein the neural network (1740) receives as input one or more of image data (1710),

- a target bitrate of the bitstream;

- an indication of complexity (1720) of the image data;

- an indication of motion intensity (1730) of the image data.

24. The method according to any of the claims 20 to 23, wherein the encoding of the two or more representations (1110, 1111 , 1112) into the bitstream (1130) comprises: for each representation out of the two or more representations (1110, 1111 , 1112), encoding a representation into a subbitstream; including each of the subbitstreams into the bitstream (1130).

25. The method according to any of the claims 20 to 24, wherein the encoding of the two or more representations (1110, 1111 , 1112) into the bitstream (1130) comprises: including the set of parameters into the bitstream.

26. The method according to any of claims 20 to 25, wherein the two or more representations (1110, 1111 , 1112) represent different resolutions of the image data. The method according to claim 26, wherein a first representation (1110) out of the two or more representations (1110, 1111 , 1112) represents an original resolution of the image data and a second representation (1111) out of the two or more representations (1110, 1111 , 1112) represents a downscaled resolution of the image data. The method according to claim 27, wherein a first parameter out of the set of parameters indicates a target bitrate for the first representation (1110) representing an original resolution of the image data; a second parameter out of the set of parameters indicates a target bitrate for the second representation (1111) representing a downscaled resolution of the image data. The method according to any of claims 20 to 28, wherein the parameters indicating target bitrates for respective representations within the set of parameters add up to a target bitrate of the bitstream (1130). The method according to any of claims 20 to 29, wherein an encoder used in the encoding selects a quantization parameter for the encoding of a representation to obtain a target bitrate indicated by the parameter within the set of parameters corresponding to said representation. The method according to claim 30, wherein the encoding is performed including one of an encoder of an autoencoding neural network, or

- a hybrid block-based encoder. A computer program product comprising program code for performing the method according to any one of the claims 1 to 31 when executed on a computer ora processor. A non-transitory computer-readable medium carrying a program code which, when executed by a computer device, causes the computer device to perform the method of any one of the claims 1 to 31. A decoder (30) comprising processing circuitry configured to carrying out the method according to any one of claims 1 to 19. An encoder (20) comprising processing circuitry configured to carrying out the method according to any one of claims 20 to 31. A non-transitory storage medium comprising a bitstream encoded by the method of any one of the claims 20 to 21.