WO2023066507A1

WO2023066507A1 - An autoencoder for data compression

Info

Publication number: WO2023066507A1
Application number: PCT/EP2021/083240
Authority: WO
Inventors: Yun Li
Original assignee: Telefonaktiebolaget Lm Ericsson (Publ)
Priority date: 2021-10-22
Filing date: 2021-11-26
Publication date: 2023-04-27

Abstract

A computer-implemented method of training an autoencoder for data compression is provided. The method comprises inputting data into a neural network to generate first latent variables and determining a distortion measure based on comparing the data to a reconstruction of the data. The reconstruction of the data is based on first quantized latent variables obtained by quantizing the first latent variables. The probability estimator is used to, for respective first latent variables, obtain one or more parameters of respective mixture models representing a probability distribution of the respective first latent variable. A first rate measure is determined based on the mixture models and approximately quantized latent variables, in which the approximately quantized latent variables are obtained by applying a differentiable function to the first latent variables to approximate quantization. One or more parameters of the autoencoder are updated based on the distortion measure and the first rate measure.

Description

AN AUTOENCODER FOR DATA COMPRESSION

Technical Field

Embodiments of the present disclosure relate to computer-implemented methods and apparatus relating to an autoencoder and, in particular, to computer-implemented methods and apparatus for training an autoencoder for data compression.

Background

Significant success has been achieved using autoencoders for data compression. With the advancement of neural network acceleration platforms, compression systems which use neural networks, such as autoencoders, can be run in devices with graphics processing units (GPUs), neural network accelerators, or dedicated application-specific integrated circuit (ASIC) devices. Whilst autoencoders have been successfully used to compress various types of data, particular success has been achieved when using autoencoders for image compression.

Autoencoders for image compression may comprise functions of input analysis, for determining a plurality of latent variables describing input image data; quantization for quantizing the plurality of latent variables (e.g., by rounding); and encoding for encoding the quantized latent variables (e.g., to compress the data).

Input analysis may be performed by a data-driven neural network, to downsample data for lossless encoding. An autoencoder can be trained to do this by encoding training data and then validating the encoding by attempting to regenerate the training data from the encoded data. In other words, the autoencoder may be trained by encoding an image in accordance with the steps outlined above (e.g., input analysis using a neural network, quantizing the latent variables that result from that analysis, and then encoding the quantized latent variables), attempting to reconstruct (e.g. decode) the encoded image, and updating weights of the neural network based on a comparison of the reconstructed image with the input image. These weights can be updated using gradient-based optimisation techniques (e.g. gradient-descent methods) to find the weights which minimise the information lost during encoding. A paper by Cheng et al (“Learned Image Compression with Discretized Gaussian Mixture Likelihoods and Attention Modules””, downloadable from https://arxiv.org/pdf/2001.01568.pdf at the date of filing) discloses a method of image compression that adopts this approach. Summary

One problem identified by the inventor is that gradient-based optimisation techniques may only be applied when the components of the autoencoder are differentiable. As quantization is an indifferentiable function, this poses a challenge for autoencoder training.

In one aspect, a computer-implemented method of training an autoencoder for data compression is provided. The autoencoder comprises a first neural network for processing and downsampling input training data to generate latent variables, a probability estimator for determining probability distributions of the latent variables, and an entropy encoder for compressing the latent variables based on the probability distributions. The method comprises inputting training data into the first neural network to generate a plurality of first latent variables and determining a distortion measure based on a comparison of the training data to a reconstruction of the training data. The reconstruction of the training data is based on a plurality of first quantized latent variables obtained by quantizing the plurality of first latent variables. The method further comprises using the probability estimator to, for respective first latent variables of the plurality of first latent variables, obtain one or more parameters of respective mixture models representing a probability distribution of the respective first latent variable. A first rate measure is determined based on the mixture models and a plurality of approximately quantized latent variables, in which the plurality of approximately quantized latent variables are obtained by applying a differentiable function to the plurality of first latent variables to approximate the quantization of the plurality of first latent variables. One or more parameters of the autoencoder are updated based on the distortion measure and the first rate measure.

In a further aspect, an apparatus configured to perform the aforementioned method is provided. The apparatus may comprise, for example, a computer. In another aspect, a computer program is provided. The computer program comprises instructions which, when executed on at least one processor of an apparatus, cause the apparatus to carry out the aforementioned method. In a further aspect, a carrier containing the computer program is provided, in which the carrier is one of an electronic signal, optical signal, radio signal, or non-transitory machine-readable storage medium.

A still further aspect of the present disclosure provides an apparatus for training an autoencoder for data compression, in which the autoencoder comprises a first neural network for processing and downsampling input data to generate latent variables, a probability estimator for determining probability distributions of the latent variables, and an entropy encoder for compressing the latent variables based on the probability distributions. The apparatus comprises a processor and a machine-readable medium, in which the machine-readable medium contains instructions executable by the processor such that the apparatus is operable to input training data into the first neural network to generate a plurality of first latent variables and determine a distortion measure based on a comparison of the training data to a reconstruction of the training data. The reconstruction of the training data is based on a plurality of first quantized latent variables obtained by quantizing the plurality of first latent variables. The apparatus is further operable to use the probability estimator to, for respective first latent variables of the plurality of first latent variables, obtain one or more parameters of respective mixture models representing a probability distribution of the respective first latent variable. The apparatus is further operable to determine a first rate measure based on the mixture models and a plurality of approximately quantized latent variables, in which the plurality of approximately quantized latent variables are obtained by applying a differentiable function to the plurality of first latent variables to approximate the quantization of the plurality of first latent variables. The apparatus is further operable to update one or more parameters of the autoencoder based on the distortion measure and the first rate measure.

Aspects of the disclosure thus provide an improved method of training an autoencoder for data compression which uses mixed quantisation and mixture models during training to provide a more effective autoencoder.

Detailed description of the drawings

Embodiments of the disclosure will now be described with reference to, by way of example only, the following drawings:

Figure 1 shows an example of an autoencoder for compressing data;

Figure 2 shows an example of an autoencoder according to embodiments of the disclosure;

Figure 3 shows a computer-implemented method of training an autoencoder according to embodiments of the disclosure;

Figure 4 shows rate-distortion curves for different data compression techniques;

Figures 5 and 6 shows structures of neural networks in an autoencoder according to embodiments of the disclosure; and Figure 7 shows an apparatus for training an autoencoder according to embodiments of the disclosure.

Detailed description

Autoencoders can be used to compress data, such as images. An example autoencoder may comprise a neural network which generates latent variables representing features of the data to be compressed, a quantization unit or quantizer which discretises the latent variables and an entropy encoder which further compresses the quantized latent variables. The entropy encoder encodes the quantized latent variables based on the probabilities of the quantized variables having a particular value or range of values. There are various ways in which these probabilities may be determined.

One way of modelling the probability distributions of the latent variables is by first generating hyperpriors which capture dependencies between the latent variables and then using the hyperpriors to determine parameters of the likelihoods of the latent variables. The dependencies between latent variables may be indicative of underlying properties of the data to be encoded. For example, neighbouring elements of latent variables generated based on image data may be correlated. This additional information may be reflected in the hyperprior.

The hyperpriors may be generated by analysing the latent variables with a second neural network. The hyperpriors may comprise further latent variables. The hyperpriors may be used to predict the parameters of entropy models to model the probability distributions of the latent variables. These probability distributions or likelihoods may then be used by an entropy encoder in the autoencoder to compress the quantized latent variables.

In addition, since the hyperpriors reflect properties of the latent variables, the hyperpriors may be provided with the encoded data to improve data reconstruction. In particular, the hyperpriors may be compressed (e.g. encoded using an entropy encoder) and provided with the encoded data such that both the encoded data and the compressed hyperprior may be used to reconstruct the input data.

Hyperpriors may thus be used to provide side information to accompany the encoded data and aide decoding, as well being used to more accurately determine the parameters of the models (e.g. entropy models) of the probability distributions of the latent variables. This can reduce the bit rate more effectively than using a fully factorized entropy model. However, the accuracy of the models used for the probability distributions of the latent variables is still limited by the choice of model. The probability distribution of a latent variable can, for example, be modelled using single univariant Gaussians parameterised by the scale or scale and mean. However, these models may not accurately represent the underlying distributions. The effectiveness of an autoencoder can partially be limited by how accurately the probability distributions of the latent variables are modelled.

The effectiveness of an autoencoder also depends on its training. In general, an autoencoder can be trained to compress data by compressing training data and then validating the compression by attempting to regenerate the training data from the compressed data.. The effectiveness of the autoencoder can be captured in a ratedistortion function, which compares the distortion caused by encoding the data to how effectively the data is compressed (e.g. how much smaller the encoded data is than the input data). Autoencoders can be trained by seeking to optimise the rate-distortion function and updating parameters of the autoencoder accordingly. There are many techniques which may be used to optimise (e.g. minimise) the rate-distortion function. Training methods which use gradients of the rate-distortion function to achieve quicker convergence may be particularly promising. However, these methods can only be used effectively when the components of the autoencoder are differentiable. As quantization is an indifferentiable function, this poses a challenge for autoencoder training.

An exemplary autoencoder 100 for image compression is shown in Figure 1. The autoencoder 100 comprises an input unit 102, an analysis network 104, a quantization unit 106 and an encoding unit 108. In operation, the input unit 102 of the autoencoder 100 obtains an image x.

The image x is input to the analysis network 104 which determines a plurality of latent variables y representing features of the image. In this context, a latent variable may be a variable which is inferred from the input data. The analysis network 104 comprises a neural network that has been trained to process an input image to extract features and generate corresponding latent variables. The analysis network 104 may thus be operable to map the image to a plurality of variables in a latent space. By representing image features using latent variables, the analysis network 104 can generate a compressed representation of the image. The analysis network 104 thus downsamples and processes the input image to obtain the latent variables. The latent variables y are output to the quantization unit 106, which quantizes the plurality of latent variables to obtain a plurality of quantized latent variables y. For example, the quantization unit 106 may round each value in the latent variables y to the closest integer in order to discretise the latent variables y received from the analysis network 104.

The encoding unit 108 encodes the quantized latent variable data to further compress the image data. The encoding unit 108 uses an entropy encoder for this purpose. The skilled person will be familiar with entropy encoders, so they are not discussed here in detail. Briefly, entropy encoding is a lossless encoding technique for encoding a plurality of discrete values. In entropy encoding, each discrete value is assigned an associated codeword. The length of a codeword for a particular value may be determined based on the probability of that value occurring in the data to be compressed. Thus, for example, shorter codewords may be assigned to values that occur (or are expected to occur) more frequently. By using shorter codewords for common values in a dataset, the dataset can be compressed without loss of information.

Based on information from the probability estimator 110, the encoding unit 108 determines, for each value in the quantized latent variable data, the likelihood of that value occurring. The encoding unit 108 uses this probability information to determine codeword lengths and thus to encode the quantized latent variables received from the quantization unit 106 using entropy encoding. The probability estimator 110 may, for example, output one or more parameters of probability mass functions or probability density functions for the latent variables and the encoding unit 108 may use these parameters to determine, for the quantized latent variables y , the probability of particular values or ranges of values occurring.

The encoding unit 108 thus losslessly compresses the quantized latent variables to obtain an encoded image. Due to the downsampling and processing performed by the analysis network 104 and the encoding performed by the encoding unit 108, the encoded image will be smaller in size (e.g. will take up less storage space) than the input image. The encoded image may be stored or transmitted. The encoded image may, for example, be decoded to reconstruct the input image.

Autoencoders can thus use a data-driven neural network, such as the analysis network 104, to process and downsample data for lossless encoding using an entropy encoder. An autoencoder can be trained to do this by compressing training data and then validating the compression process by attempting to regenerate the training data from the compressed data. Thus, the analysis network 104 of the autoencoder 100 may be trained by encoding an image in accordance with the steps outlined above, attempting to reconstruct (e.g. decode) the encoded image, and updating weights of the analysis network 104 based on a comparison of the reconstructed image with the input image. These weights can be updated during training using gradient-based optimisation techniques (e.g. gradient-descent methods) to find the weights which minimise the information lost during compression. However, these techniques may only be applied when the components of the autoencoder are differentiable. As quantization is an indifferentiable function, this poses a challenge for autoencoder training.

Aspects of the disclosure seek to address these and other problems by training an autoencoder using mixture models to model the probability density functions of latent variables and mixed quantisation, in which latent variables are quantized when reconstructing the input data to measure distortion, but a differentiable function is applied to the latent variables to approximate quantization when determining the rate measure.

In one aspect, a computer-implemented method of training an autoencoder for data compression is provided. The autoencoder comprises a first neural network for processing and downsampling input data to generate latent variables, a probability estimator for determining probability distributions of the latent variables, and an entropy encoder for compressing the latent variables based on the probability distributions.

The method comprises inputting training data into the first neural network to generate a plurality of first latent variables and determining a distortion measure based on a comparison of the training data to a reconstruction of the training data. The reconstruction of the training data is based on a plurality of first quantized latent variables obtained by quantizing the plurality of first latent variables. The method further comprises using the probability estimator to, for respective first latent variables of the plurality of first latent variables, obtain one or more parameters of respective mixture models representing a probability distribution of the respective first latent variable. A first rate measure is determined based on the mixture models and a plurality of approximately quantized latent variables, in which the plurality of approximately quantized latent variables are obtained by applying a differentiable function to the plurality of first latent variables to approximate the quantization of the plurality of first latent variables. One or more parameters of the autoencoder are updated based on the distortion measure and the first rate measure.

Using mixed quantisation during training of an autoencoder that uses mixture models results in an autoencoder which can compress data more effectively. In particular, by modelling the probability distributions of latent variables using mixture models, the autoencoder uses more accurate entropy modelling in which the probability distributions of the latent variables generated by the autoencoder are more accurately represented. Using mixed quantisation enables more effective training of the autoencoder.

A computer-implemented method of training of an autoencoder is described in respect of Figure 2, which shows an example of an autoencoder 200 for data compression. The method may be performed by the autoencoder 200 itself. Alternatively, another entity (e.g. another node or computer) may use the autoencoder to perform the following operations which are described as being performed by components of the autoencoder and the other entity may perform any other operations which are not described as being performed by the autoencoder.

The autoencoder 200 comprises an input unit 202, a first analysis network 204, first and second quantization units 206a, 206b (Q1 and Q2 in Figure 2), a probability estimator 210, a first entropy estimator 212, a first synthesis network 214 and a distortion calculator 216. Although not illustrated, the autoencoder 200 further comprises an entropy encoder.

The autoencoder 200 is trained by seeking to optimise (e.g. minimise) a rate-distortion function. The rate-distortion function may alternatively be referred to as a loss function. The rate-distortion function may correspond to L = AD + R + R₂ , in which A is a configurable parameter, D is a distortion, R is a first rate measure and R₂ is a second rate measure. The process for determining, using the autoencoder 200, the distortion, first rate measure and second rate measure based on training data is as follows.

The process begins with obtaining training data x at the input unit 202. The training data may comprise any data which the autoencoder 200 can compress. In particular examples, the training data may comprise image data. Thus, for example, the training data may comprise an array defining position information for the image (e.g. coordinates, such as pixel coordinates) and one or more other arrays defining properties of the image at those positions (colour, transparency and/or any other suitable properties). In more general examples, the training data may comprise any data having at least two dimensions. The training data may be provided in any suitable form such as, for example, an array.

The training data is input to the first analysis network 204. The first analysis network 204 comprises a neural network, such as a convolutional neural network. The first analysis network 204 may, for example, comprise a deep neural network having layers of neurons. Based on the training data, the first analysis network 204 generates a plurality of first latent variables y representing features in the training data. Thus for example, the first latent variables may represent features in an image. The first analysis network 204 downsamples and processes the training data to generate the first latent variables. The first analysis network 204 thus effectively compresses the training data by processing the training data to extract relevant (e.g. important or significant) features and representing those features using the latent variables. The first analysis network 204 may be considered to concentrate the training data (for example as opposed to merely reducing the dimensionality) because prominent features in the training data are not lost.

The first latent variables y are input to the first quantization unit 206a, the second quantization unit 206b and the probability estimator 210.

The first quantization unit 206a quantizes the plurality of first latent variables y and outputs the first quantized latent variables y to the first synthesis network 214. In this context, quantization (or discretisation) may refer to restricting the values of the first quantized latent variables to a prescribed set of values. The prescribed set of values may, for example, be provided as specific values (e.g. a table of predetermined values) or as a class of values (e.g. integers or values limited to a number of significant digits).

The skilled person will appreciate that there are various ways in which the first latent variables may be quantized. Quantization may be deterministic. Rounding is one example of deterministic quantization which can be used to simply and efficiently quantize the first latent variables. Thus, in one example, the first quantization unit 206a quantizes the first latent variables by applying a rounding function. The rounding function may round the first latent variables to the closest integer. More generally, the rounding function may round the first latent variables to a specified number of digits or significant digits. The first quantization unit 206a effectively discretises the first latent variables so that they would be suitable for encoding with the entropy encoder.

The first quantization unit 206a outputs the plurality of first quantized latent variables to the first synthesis network 214. The first synthesis network 214 attempts to reconstruct the training data x based on the plurality of first quantized latent variables y. The first synthesis network 214 comprises a neural network which, through training of the autoencoder 200, may be trained for this purpose. The neural network may be, for example a convolutional neural network.

The synthesis network 214 outputs the reconstructed training data x to a distortion calculator 216. The distortion calculator 216 compares the training data x to the reconstructed training data x to determine a distortion (or distortion measure), D, caused by the autoencoder 200. The distortion indicates the information lost or altered due to the compression and subsequent reconstruction of the training data. The distortion D may use any suitable measure to quantify difference between the training data and the reconstructed training data. In one example, the distortion is based on the mean- squared error of the training data and the reconstructed data. For example, the distortion may be calculated according to: n

^D = -Y Xi - Xi)², n Z—i i=l for training data x comprising n data points. In some embodiments, more complex metrics can be used to indicate the distortion. Other suitable distortion measures may include, for example, a Structure Similarity Index Measure (SSIM), and/or a neural network-based perceptual quality metric (see, for example, “Deep Perceptual Image Quality Assessment For Compression”, Mier et. al, arXiv:2103.01114). Deterministic quantization can be more advantageous when these more complex metrics are used since they may be more sensitive to high frequency noise which can result from using stochastic quantization.

Thus, the distortion D is calculated using a reconstruction of the training data determined using latent variables quantized by the first quantization unit 206a.

In contrast, the first rate measure R is determined using approximately quantized latent variables determined by the second quantization unit 206b. Rather than quantizing the latent variables, the second quantization unit 206b approximates the quantization performed by the first quantization unit 206a.

The skilled person will appreciate that there are different differentiable functions which may be suitable for approximating quantization. Moreover, the specific choice of differentiable function may depend on the quantization method used by the first quantization unit 206a.

In particular examples, a stochastic approach may be used to approximate quantization. In some examples, noise may be applied to the first latent variables to approximate quantization. For example, the second quantization unit 206b may apply noise sampled from a distribution spanning the range to the first latent variables. This may be

particularly effective when the first quantization unit 206a quantizes the latent variables by rounding to the nearest integer.

The noise may be sampled from any suitable distribution. In particular examples, noise may be sampled from a uniform distribution. For example, the approximately quantized first latent variables y may be determined according to

in which 'll ( function which samples from a uniform noise distribution with a

lower limit of - and an upper limit of |. In another example, the approximately quantized first latent variables y may be determined according to y = round(y — u) + u in which u is sampled from a uniform distribution and the function round( ) is

operable to round f to the nearest integer. This may be referred to as universal quantization.

In another example, a softmax function may be used to approximately quantize the first latent variables. For example, an approximately quantized latent variable y_t may be determined based on

in which q = — abs(yj — C)a for a quantized latent variable y_t. a is a parameter for controlling the hardness of the quantization. For example, a larger value of a can be used to generate more precise values for the intended quantization level, n is the number of quantization levels, C is the quantization levels and the function abs(.) returns the absolute value or modulus of the quantity to which it is applied. In one example, C corresponds to C = [— 128, 127, ... , 0, ... 126, 127],

Thus, the second quantization unit 206b uses a differentiable function to approximate quantization of the first latent variables (e.g. to approximate the quantization performed by the first quantization unit 206a). The approximately quantized first latent variables are output to the first entropy estimator 212 to use, with parameters of the probability distributions of the latent variables y obtained from the probability estimator 210, to calculate the first rate measure. The calculation of the first rate measure is discussed in more detail below.

The probability estimator 210 comprises a second analysis network 218, a second quantization unit 220, a second entropy estimator 222 and a second synthesis network 224.

The probability estimator 210 uses the second analysis network 218 to generate a plurality of second latent variables, z, based on the first latent variables y. The second latent variables may be referred to as a hyperprior or hyperpriors. When the autoencoder 200 is used to encode data after deployment (e.g. rather than during training), the probability estimator 210 may fulfil a second purpose of generating the second latent variables to accompany the encoded data (e.g. as side information to aide decoding). Thus, for example, the autoencoder 200 may additionally compress the second latent variables and provide the compressed latent variables with the encoded data.

Returning to Figure 2, the second analysis network 218 comprises a neural network which further downsamples the first latent variables to obtain the second latent variables z. The neural network may be, for example, a convolutional neural network. The second latent variables may comprise features extracted from the first latent variables. The second latent variables may reflect dependencies (e.g. spatial dependencies) amongst the first latent variables. The second analysis network 218 may be trained to generate the second latent variables as part of training the autoencoder 200. The second latent variables are input to a third quantization unit 220. The third quantization unit 220 quantizes or approximately quantizes the plurality of second latent variables to obtain a plurality of second quantized latent variables. In this context, quantization refers to restricting the values of the second latent variables to a prescribed set of values and approximating quantization refers to performing one or more operations on the second latent variables to simulate or imitate quantization. For example, the second latent variables may be quantized by rounding to the nearest integer. In another example, the second latent variables may be approximately quantized by applying noise sampled from a uniform distribution in the range -i fo

Adding noise in this manner effectively simulates the rounding process since rounding a number to the closest integer involves adding a value between in the range - - to - to the number.

The third quantization unit 220 may thus use any suitable quantization or approximate quantization process to obtain the second quantized latent variables. The third quantization unit 220 may use any of the approaches described above in respect of the first and second quantization units 206a, 206b, for example.

The second quantization unit 220 outputs the plurality of second quantized latent variables z to a second entropy estimator 222. The second entropy estimator 222 determines a second rate measure R₂ based on the second quantized latent variables. The second rate measure is indicative of the code length or bit rate of the second quantized latent variables. Since, in use (e.g. in the inference stage or after deployment), the autoencoder 200 is operable to provide the second quantized latent variables with the encoded data as side information, this second rate measure is indicative of the cost of providing this side information. The side information provides information regarding the parameters of the entropy model for the first quantized latent variables.

The second entropy estimator 222 may determine the second rate measure R₂ based on probability distributions of the plurality of second quantized latent variables. The second entropy estimator 222 may fit models to the probability density distributions or probability mass functions of the second quantized latent variables. For example, the second entropy estimator 222 may, for each of the second quantized latent variables, determine one or parameters of a model which represents its probability mass or density function. The models may be referred to as fully factorized entropy models, for example. The model captures the probability of a particular second quantized latent variable having a particular value or range of values.

The second entropy estimator 222 may assume any suitable model for fitting the second quantized latent variables. For example, the distributions of the second quantised latent variables may be modelled using a piece-wise function or one or more Laplace distributions. In particular examples, a further neural network may be used to determine the parameters of the models used for the probability distributions of the second quantized latent variables.

The skilled person will appreciate that there are various ways in which the second rate measure may be determined based on probability distributions of the plurality of second quantized latent variables. In one example, the second rate measure may be determined according to

in which is a probability of a second quantized latent variable in the plurality of second quantized latent variables z having a value

as determined using the models fitted by the second entropy estimator 222.

S is the size of the training data. For training data comprising an image, S may be the resolution of the image patch for example. The size may be quantified in any suitable way. For example, the size may be the number of data points in the training data. The size of training data comprising an image may be the number of pixels in the image, the number of pixels per unit of length (e.g. the number of pixels per inch, ppi) orthe number of pixels per unit of area. The factor of i may thus normalize the second rate measure. In other examples, other normalisation factors may be used. Alternatively, the normalisation factor may be omitted.

The second entropy estimator 222 also outputs the second quantized latent variables to the second synthesis network 224. Alternatively, the second synthesis network 224 may receive the second quantized latent variables from the second quantization unit 220. The second synthesis network 224 is operable to determine, based on the second quantized latent variables, parameters of models representing the probability distributions P(y) of the first latent variables. The probability distribution of a first latent variable indicates how likely the first latent variable is to have a particular value or range of values. Thus the probability distribution of a latent variable may comprise a probability density function or a probability mass function.

The second synthesis network 224 may be operable to, for each of the first latent variables, y, to determine one or more parameters of a model representing its probability distribution function. The probability distributions of the first latent variables are modelled using mixture models. Any suitable mixture models may be used such as, for example, Laplace mixture models or Logistic mixture models.

In one example, the second synthesis network 225 fits the probability distributions of the first latent variables with Gaussian mixture models. For example, the probability density function for a first latent variable, y_t, may be represented by

according to a Gaussian mixture model having N (e.g. N> ) components, in which

Tij is the mixture weight of component j , which indicates the prior probability of component j, and

= 1. The second synthesis network 224 may determine the parameters //₇, of for each of the components of the Gaussian mixture model for the respective first latent variable. The second synthesis network 224 may be configured with a predetermined number of components, N, to use. For example, the second synthesis network 224 may be configured to use Gaussian mixture models having two components. Alternatively, the number of components may also be determined by the second synthesis network 224 (e.g. as part of the model fitting process). In a yet further alternative, the number of components used in model fitting may be determined through training the autoencoder 200.

Thus the second synthesis network 224 determines one or more parameters of mixture models representing probability distributions of the first latent variable y based on the plurality of second quantized latent variables z. The second synthesis network 224 comprises a neural network (e.g. a convolutional neural network) which can be trained for this purpose. For example, one or more weights of the second synthesis network 224 may be updated as the autoencoder 200 is trained.

The second synthesis network 224 outputs the mixture model parameters to the first entropy estimator 212.

The entropy estimator 212 uses the approximately quantized first latent variables y and the mixture model parameters to determine the first rate measure R . The first rate measure is indicative of the code length or bit rate of the first quantized latent variables. The first rate measure may be determined according to

in which p . is the probability of an approximately quantized first latent variable y_t having its respective value. This probability is determined using the probability distributions of the latent variables y according to the parameters provided by the probability estimator 210. The probability of an approximately quantized first latent variable having a value y_t may be determined according to

in which (yj) is the probability density function of y_t based on the parameters provided by the second synthesis network 225. When Gaussian mixture models are used by the probability estimator 210, p . may be determined according to

in which

and

for a Gaussian mixture model having N components, means . and standard deviations o- (e.g. as determined by the probability estimator 210). Erf(.) is the Gaussian error function, n is the mixture weight of component j in a Gaussian mixture model which indicates the prior probability of component j, and = 1. S is the size of the training data as described above in respect of the second rate measure. The first rate measure may thus be normalised based on the size of the training data. In other examples, other normalisation factors may be used. Alternatively, the normalisation factor may be omitted.

Thus the first entropy estimator 212 determines the first rate measure based on the approximately quantized first latent variables obtained from the second quantization unit 206b and the mixture model parameters obtained by the synthesis network 224 in the probability estimator 210.

As discussed above, the parameters of the autoencoder 200 may be updated by seeking to optimise (e.g. minimise) the rate-distortion function, in which the rate-distortion function is based on the first rate measure, second rate measure and the distortion. This may be referred to as training the autoencoder 200. The skilled person will appreciate that there are various parameters of the autoencoder 200 which may be updated as part of this process. The autoencoder 200 comprises four neural networks 202, 214, 218, 224. One or more weights of at least one of these neural networks 202, 214, 218, 224 may be updated as part of the training process.

The calculation of the first rate measure, second rate measure and the distortion as described above may be referred to as a forward pass of the autoencoder 200. Thus, the calculation of the rate-distortion function may form part of the forward pass. One or more parameters of the autoencoder 200 may be updated by seeking to optimise the rate-distortion function on the forward pass. For example, forward passes of the autoencoder 200 may be performed for a plurality of training data to explore the parameter space defined by the one or more parameters of the autoencoder 200 and find a combination of parameters which optimise (e.g. minimise) the rate-distortion function.

In further examples, a backward pass of the autoencoder 200 may also be performed. As part of the backward pass, gradients of the rate-distortion function with respect to the one or more parameters of the autoencoder 200 are calculated. These gradients may indicate the sensitivity of the rate-distortion function to changes in the parameters of the autoencoder 200. As such, the gradients can be used to determine which parameters of the autoencoder 200 to change and/or by how much. The skilled person will appreciate that there are various ways in which the gradients may be used to update the one or more parameters of the autoencoder 200.

As an example, a parameter of the network (e.g. a weight of one of the neural networks in the autoencoder 200) w_t can be updated by computing the gradient of the ratedistortion function £' (wj)with respect to w_t and updating the parameter according to

Wi = w_t - kL' w in which k is the learning rate. The value of k can be used to control how much the parameters of the autoencoder 200 are changed with each iteration of the training process.

For the autoencoder 200 which uses a rate-distortion function L = D + R + R₂ , one or more weights of the autoencoder 200 may be updated based on the gradients of the distortion, the first rate measure and the second rate measure. As described above, the first rate measure is determined using the approximately quantized latent variables, which are approximately quantized using a differentiable function. This means that the first rate measure is differentiable and thus the gradient of the first rate measure can be directly calculated.

However, the distortion is based on the quantized first latent variables obtained by the first quantization unit. The quantization performed by the first quantization unit 206a may not be differentiable. Therefore, in order to calculate the gradient of the distortion on the backwards pass, it may be assumed that y = y. Thus, in effect, the quantization performed by the first quantization unit 206a may be replaced by an identity function on the backward pass. This may be used when universal quantization is used on the forward pass, for example. In general, any suitable differentiable approximation may be used to approximate the quantization performed by the first quantization unit 206a on the backward pass. Thus, for example, any of the approximations discussed in respect of the second quantization unit 206b may be used.

Similarly, the second rate measure can be determined based on the quantized second latent variables obtained by the third quantization unit 220. The quantization performed by the second quantization unit 220 may not be differentiable. Therefore, in order to calculate the gradient of the distortion on the backwards pass, it may be assumed that z = z. Thus, in effect, the quantization performed by the third quantization unit 220 may be replaced by an identity function on the backward pass. Alternatively, any suitable differentiable approximation may be used to approximate the quantization performed by the third quantization unit 220 on the backward pass. Thus, for example, any of the approximations discussed in respect of the second quantization unit 206b may be used.

In particular examples, a gradient descent process (or algorithm) may be used to update the one or more parameters of the autoencoder 200. Any suitable gradient descent process may be used such as, for example, the Adam process described in Adam: A Method for Stochastic Optimization, Diederik P. Kingma and Jimmy Ba, arXiv: 1412.6980, 2017; Stochastic Gradient Descent (SGD); Root Mean Squared Propagation (RMSprop), Adadelta or Adagrad. The Adam optimizer was demonstrated to be particularly efficient for solving deep learning problems.

In the aforementioned embodiment, the rate-distortion function used when updating the model takes the form AD + R + R₂. However, the skilled person will appreciate that, in general, one or more parameters of the autoencoder 200 may be updated based on the distortion and the first rate measure. Thus, for example, rather than seeking to optimise a rate-distortion function, the autoencoder may be trained by seeking to minimise the distortion D whilst meeting a constraint (e.g. a minimum or maximum value) on the first rate measure. In other examples, a simpler rate-distortion function may be used which omits the second rate measure. For example, the rate-distortion function may take the form D + R_±.

Aspects of the disclosure thus provide an improved method of training an autoencoder for data compression which uses mixed quantisation and mixture models during training to provide a more effective autoencoder. Autoencoders trained using the methods disclosed herein can compress data more effectively, reducing the resources needed to store and/or transmit the compressed data (e.g. storage space, network resources etc.). Moreover, these advantages can be achieved whilst minimising distortion and/or data loss which can occur during data compression.

Although the training method described in respect of the autoencoder 200 shown in Figure 2 comprises determining a second rate measure and updating the parameters of the autoencoder 200 based on the second rate measure, the skilled person will appreciate that the disclosure is not limited as such. The advantages of using mixed quantisation with mixture models may still be achieved even when a second rate measure is not calculated. Moreover, omitting the calculation of the second rate measure may reduce the complexity of the training process.

The present disclosure further provides a computer-implemented method of compressing data using the autoencoder 200. The method may, for example, be used to compress data when the autoencoder 200 is deployed (e.g. after training).

The skilled person will appreciate that one or components of the autoencoder 200 may be provided for the purpose of training and may thus be omitted from the autoencoder 200 once deployed. For example, one or more of the first entropy estimator 212, the second quantization unit 206b, the second entropy estimator 222 may be omitted from the deployed autoencoder since the primary purpose of calculating the first and second rate measure is training. One or more of the first quantization unit 206a, the first synthesis network 214 and the distortion calculator 216 may be omitted for similar reasons. Alternatively, the training of the autoencoder 200 may continue after deployment (e.g. the autoencoder 200 may be updated through use after training on an initial dataset). Thus, the autoencoder 200 may retain some or all the aforementioned components after initial training has been performed. For example, the first and second entropy estimators 212, 222 may be replaced with the entropy encoder during inference time (e.g. once deployed).

The method of compressing data using the autoencoder 200 begins with obtaining data to be compressed at the input unit 202. The first analysis network 204 generates a plurality of first latent variables based on the data. The first analysis network 204 may operate in substantially the same way during training and during deployment (e.g. subject to any changes to parameters, such as weights, of the first analysis network 204 during training). The first analysis network 204 outputs the plurality of first latent variables to the probability estimator 210, and a fourth quantization unit (not illustrated).

The probability estimator 210 uses the second analysis network 218 to generate a plurality of second latent variables, based on the first latent variables. The third quantization unit 220 quantizes the plurality of second latent variables to provide a plurality of second quantized latent variables. The third quantization unit 220 may perform quantization in the same manner as described above in respect of the training process, for example. The probability estimator 210 further comprises a second entropy encoder which encodes the plurality of second quantized latent variables to be provided with the encoded first quantized latent variables. Thus the second quantized latent variables may be provided as side information to aide in the decoding process. The second entropy encoder may assume any suitable model for the probability distributions of the second quantized latent variables for this encoding process. In some examples, the same models may be used by the second entropy estimator 222 during training and the second entropy encoder during deployment.

The quantized second latent variables are also provided to the second synthesis network 224 which determines, based on the second quantized latent variables, parameters of mixture models representing the probability distributions P(y) of the first latent variables. The second synthesis network 224 may operate in substantially the same way during training and deployment, for example (e.g. subject to any changes to parameters, such as weights, of the second synthesis network 224 during training). The second synthesis network 225 outputs the parameters of the mixture models to the first entropy encoder.

The fourth quantization unit quantizes the plurality of first latent variables to obtain a plurality of first quantized latent variables. The fourth quantization unit may quantize the first latent variables in a same or similar manner to the first quantization unit 206a described above in respect of the training process. Thus the fourth quantization unit may comprise the first quantization unit 206a.

However, in contrast to the first quantization unit 206a, the fourth quantization unit outputs the plurality of first quantized latent variables to the first entropy encoder (not illustrated in Figure 2).

The first entropy encoder encodes the plurality of first quantized latent variables based on mixture models representing a probability distribution of the first latent variables obtained by the probability estimator 210. The first entropy encoder may operate in a similar manner to the encoding unit 108 described above in respect of Figure 1 , except for that the mixture models are used for the probability distributions of the quantized latent variables. The first entropy encoder may use any suitable form of entropy encoding such as, for example, arithmetic encoding or range encoding. Thus, in deployment, the autoencoder 200 can compress data to generate encoded data comprising the encoded first quantized latent variables and the encoded second quantized latent variables. The encoded data may be stored, processed and/or transmitted in any suitable way.

Further, although the aforementioned method for compressing data using the autoencoder 200 comprises providing both the encoded first quantized latent variables and the encoded second quantized latent variables, the skilled person will appreciate that the disclosure is not limited as such. Although the encoded second quantized latent variables aid in decoding of the encoded first quantized latent variables, they may, in other examples, not be output by the autoencoder 200. Thus, in some examples, the compressed data may comprise only the encoded first quantized latent variables.

Although the aforementioned method for compressing data uses both a first entropy encoder and a second entropy encoder, the skilled person will appreciate that, in other embodiments, the autoencoder may comprise only a single entropy encoder which performs the operations of both the first entropy encoder and the second entropy encoder.

The present disclosure further provides a computer-implemented method for decompressing compressed data, in which the compressed data was compressed using the autoencoder 200 (e.g. using the aforementioned compression method). Decoding may be performed by the autoencoder 200 itself or a modified version thereof. For example, when deployed as a decoder (e.g. a standalone decoder), one or more components of the autoencoder 200 may be omitted. Thus, for example, one or more of the first entropy estimator 212, the second quantization unit 206b, the second entropy estimator 222 may be omitted from the deployed autoencoder since the primary purpose of calculating the first and second rate measure is training. One or more of the first quantization unit 206a and the distortion calculator 216 may be omitted for similar reasons. The first analysis network 204 and/or the second analysis network 218 may also be omitted, since these networks may be operable to compress data, rather than reconstruct compressed data.

The method begins with obtaining compressed data. The compressed data may comprise encoded first quantized latent variables and encoded second quantized latent variables. The encoded second quantized latent variables may be considered to be side information to assist with the decoding of encoded first quantized latent variables.

The autoencoder 200 may comprise a first entropy decoder and a second entropy decoder. However, the skilled person will appreciate that, in other embodiments, the autoencoder 200 may only comprise a single entropy decoder which is operable to perform the operations of both the first and second entropy decoders.

The encoded second quantized latent variables may be decoded first by inputting the encoded second quantized latent variables into the second entropy decoder. The skilled person will be familiar with entropy decoding, so the process will not be discussed in detail here. Briefly, an entropy decoder can recover information from a set of codewords or the partitioned intervals based on the probability distribution used during the encoding process.

The second entropy decoder decodes the encoded second quantized latent variables to obtain the second quantized latent variables. Since entropy encoding is, in general, lossless, the second quantized latent variables may be reconstructed with little or no data loss. The second entropy decoder is thus operable to undo or perform the inverse of the second entropy encoder described above in respect of the encoding process.

The autoencoder 200 uses the second synthesis network 224 to determine, based on the second quantized latent variables obtained from the second entropy decoder, one or more parameters of mixture models representing the probability distributions of the first quantized latent variables (e.g. the probability distributions of information encoded in the encoded first quantized latent variables). The second synthesis network 224 may make this determination in the same or similar manner as it determines the mixture model parameters for the first entropy estimator 212 during the training process, as described above.

The first entropy decoder uses the one or more parameters of the mixture models obtained by the second synthesis network 224 to decode the encoded first quantized latent variables and obtain the first quantized latent variables. The first entropy decoder is thus operable to undo or perform the inverse of the first entropy encoder described above in respect of the encoding process. The autoencoder 200 inputs the first quantized latent variables into the first synthesis network 214 to reconstruct the compressed data. In this respect, the first synthesis network 214 may operate in the same or substantially the same way as described above in respect of the training process. The first synthesis network 214 outputs the reconstructed data.

Thus the autoencoder 200 may be used to decompress or decode compressed data.

Although the encoding and decoding processes are described above separately, the skilled person will appreciate that, in general, a single autoencoder may be operable to compress and reconstruct data. Thus, a single autoencoder may trained according to the methods described above and be operable to both compress and reconstruct data. Moreover, rather than having separate first and second entropy encoders and first and second entropy decoders, in some embodiments the autoencoder may be provided with a single unit which is operable to perform both the entropy encoding and entropy decoding steps outlined above. This unit may be referred to as, for example, an entropy encoder-decoder or an entropy coding unit.

The skilled person will also appreciate that the autoencoder 200 described above is provided as an exemplary way of implementing the embodiments of the disclosure. The skilled person will be able to envisage modifications, addition and deletions thereto which still provide the same or similar functionality and thus still fall within the scope of the disclosure.

Figure 3 shows a computer-implemented method 300 of training an autoencoder for data compression according to embodiments of the disclosure. The method 300 is for training an autoencoder which comprises a first neural network, a probability estimator, and an entropy encoder. The method 300 may be used to train the autoencoder 200 described above in respect of Figure 2, for example.

The first neural network is for processing and downsampling input data to generate latent variables. The first neural network may be considered to concentrate the input data (for example as opposed to merely reducing the dimensionality) because essential or prominent features in the data are not lost. The first neural network may be any suitable type of neural network such as, for example, a convolutional neural network. The first neural network may comprise the first analytic network 204 described above in respect of Figure 2, for example.

The probability estimator is for determining probability distributions of the latent variables. In particular, the probability estimator is operable to model the probability distributions of the latent variables using mixture models, such as Gaussian mixture models. The probability estimator may comprise the probability estimator 210 described above in respect of Figure 2. However, the skilled person will appreciate that the present disclosure is not limited as such and, in general, any suitable probability estimator may be used.

The entropy encoder is for compressing the latent variables based on the probability distributions. The entropy encoder is operable to apply a lossless compression scheme to encode the latent variables. The entropy encoder may use any suitable entropy encoding technique for this purpose such as, for example, arithmetic coding. Although the entropy encoder may not be involved in training the autoencoder, it may be used during deployment of the autoencoder to encode the latent variables and thereby provide a further compression layer.

The method begins in step 302 in which training data is input to the first neural network to generate a plurality of first latent variables. This step may be performed in accordance with the operation of the input unit 202 and the first analysis network 204 described above in respect of Figure 2, for example. For example, the training data may comprise the training data described above in respect of Figure 2.

In step 304, a distortion measure is determined based a comparison of the training data to a reconstruction of the training data. The reconstruction of the training data is based on a plurality of first quantized latent variables obtained by quantizing the plurality of first latent variables. Step 304 may be performed in accordance with the operations of the first quantisation unit 206a, the first synthesis network 214 and the distortion calculator 216 described above in respect of Figure 2, for example. Thus, the first latent variables may be quantized in the manner described above in respect of the first quantization unit 206a to obtain the plurality of first quantized latent variables.

The plurality of first quantized latent variables may be input to a second neural network (e.g. the first synthesis network 214) to reconstruct the training data. The second neural network may comprise a convolutional neural network, for example. Alternatively, any other suitable method for reconstructing the training data based on the plurality of first quantized latent variables may be used. The first quantized latent variables are effectively decoded to attempt to obtain the original training data.

The reconstruction of the training data is compared to the training data that was input to the autoencoder in order to determine the distortion measure. For example, the distortion measure may be determined in accordance with the operations of the distortion calculator 216.

In step 306, the probability estimator is used to, for respective first latent variables of the plurality of first latent variables, obtain one or more parameters of respective mixture models representing a probability distribution of the respective first latent variable. For example, the probability estimator may be used to determine respective parameters of a mixture model for each of the plurality of first latent variables. The probability estimator may determine the one or more parameters by first generating a plurality of second latent parameters (e.g. a hyperprior) and then using the second latent parameters to determine the probability distribution (e.g. probability density function) of the first latent variables. The probability estimator may quantize the plurality of second latent variables and input the second quantized latent variables into a third neural network (e.g. the second synthesis network 224) to determine the one or more parameters.

Step 306 may be performed in accordance with the operation of the probability estimator 210, for example.

Alternatively, the probability estimator may use any other suitable method for obtaining the one or more parameters in step 306. For example, the probability estimator may assume a prior for the first latent variables and determine the one or more parameters of the mixture models based on the assumed prior.

In step 308, a first rate measure is determined based on the mixture models and a plurality of approximately quantized latent variables. The plurality of approximately quantized latent variables are obtained by applying a differentiable function to the plurality of first latent variables to approximate the quantization of the plurality of first latent variables. The plurality of first latent variables may be approximately quantized in any suitable manner. For example, the quantization of the plurality of first latent variables may be approximated in accordance with the operation of the second quantization unit 206b described in respect of Figure 2. Whilst the first rate measure is determined using the approximately quantized latent variables, rather than the first quantized latent variables specifically, it may still be indicative of a bit-rate or code length of the first quantized latent variables since the approximately quantized latent variables are effectively an approximation of the first quantized latent variables.

The first rate measure may be determined in accordance with the determination of a first rate measure R by the first entropy estimator 212 as described above in respect of Figure 2.

The method 300 may further comprise calculating a second rate measure. The second rate measure may be determined based on the plurality of second latent parameters generated by the probability estimator. This may be particularly beneficial when the autoencoder is operable to, when compressing data, provide side information based on the second latent variables to aide in decoding the compressed data. The second rate measure may be determined in the same or a similar manner to the second rate measure R₂ described above in respect of Figure 2, for example.

In step 310, one or more parameters of the autoencoder is updated based on the distortion measure and the first rate measure. For example, the one or more parameters may be updated based on a rate-distortion function or loss function determined based on the distortion measure and the first rate measure. In particular examples, the updating of the parameters of the autoencoder may also be based on the second rate measure. Thus, for example, the one or more parameters may be updated based on a loss function or rate-distortion function determined based on the distortion measure, the first rate measure and the second rate measure.

The parameters of the autoencoder may comprise one or more weights of any neural networks comprised in the autoencoder, such as the first neural network. In embodiments in which the autoencoder comprises further neural networks (e.g. the first and second synthesis networks and/or the second analysis network described above in respect of Figure 2), one or more weights of at least one of the further neural networks may be updated in addition to or instead of the weights of the first neural network. Step 310 may be performed in accordance with the updating of the autoencoder 200 described above in respect of Figure 2, for example. Although the training of the autoencoder 200 involved calculating a second rate measure, the skilled person will appreciate that the method described above in respect of Figure 2 may be modified to omit the determination and use of the second rate measure.

A method of compressing data using an autoencoder trained using the method 300 is also provided. In some embodiments, the following steps may form part of the method 300 itself.

The method of compressing data comprises inputting data to be compressed into the first neural network to generate a plurality of third latent variables. This step may be performed in accordance with step 302 described above, except in respect of the data to be compressed, rather than training data. The data to be compressed and the training data may be a same type of data. For example, the autoencoder may be trained using one or more first images and then used to compress one or more second images. In another example, the autoencoder may be trained using first video data and used to compress second video data.

The plurality of third latent variables are quantized to obtain a plurality of third quantized latent variables. The plurality of latent variables may be quantized in accordance with the operations of the first quantization unit 206a of the autoencoder 200 described above, for example.

The probability estimator of the autoencoder is used to, for respective third latent variables of the plurality of third latent variables, obtain one or more parameters of respective second mixture models representing a probability distribution of the respective third latent variable. The skilled person will appreciate that there are various ways in which the probability estimator may obtain the parameters of the mixture models. For example, the probability estimator may operate in the same or similar manner as the probability estimator 210 described above in respect of the encoding method using the autoencoder 200 of Figure 2.

Based on the second mixture models, the plurality of third quantized latent variables are encoded using the entropy encoder to obtain encoded data. This step may be performed in accordance with the operation of the first entropy encoder described above, for example.

The method of compressing data further comprises outputting the encoded data. The encoded data may be, for example, stored, processed and/or transmitted to another node or unit.

References herein to “image data” will be understood by those skilled in the art to include all information which contains images. Thus the term “image data” includes both still and moving images (e.g., video data).

Figure 4 shows the rate-distortion curve for data compressed by a first autoencoder trained according to the embodiments of the disclosure. This is shown by the upper dashed line with crosses as markers. For comparison, Figure 4 also includes ratedistortion curves for data compressed by a second autoencoder trained using mixed quantization (dotted line, point-like marker), data compressed using a mean scale hyperprior (MSH; dot-dashed line, square marker) and data compressed using Better Portable Graphics (BGP; lower dashed line, circular marker). MSH and BPG are representative methods of image compression using deep learning and traditional techniques respectively. All four lines were generated by compressing the Kodak standard image data set (http://rOk.us/qraphics/kodak/) with the respective compression processes. The Kodak standard image data set comprises 24 test images.

The first and second autoencoders were implemented for the mean-scale model in the CompressAI framework developed by InterDigital, which is available at https://github.com/lnterDigitallnc/CompressAI (Accessed: Sep. 20, 2021). Both the first and second autoencoders were trained using the ImageNet dataset (www. image- net. orq/). The second autoencoder was trained using mixed quantisation and a univariate Gaussian distribution.

The first autoencoder (trained according to the embodiments of the disclosure) was trained using mixed quantization and a Gaussian mixture model having two components (e.g. two mixtures). More specifically, the first autoencoderwas implemented and trained in accordance with the autoencoder 200 and training method described above in respect of Figure 2. Thus, the first autoencoder comprises a first analysis network 202, a first synthesis network 214, a second analysis network 218 and a second synthesis network 224, as described in respect of Figure 2. The architectures of these neural networks are shown in Figures 5 and 6.

As illustrated in Figure 5, the first analysis network 202 comprises four convolutional layers, with a generalized divisive normative transform (GDN) layer in between subsequent convolutional layers. The GDN layers may be nonlinear layers which are operable to Gaussianise inputs across channels. Each convolutional layer is operable to take inputs with a convolution of kernel size 5x5 in spatial dimension and with an output of C channels, denoted by 5x5xC, in which C is the number of channels to be produced after the convolutions. The convolutional layers are operable to downsample (d) input data by a factor of 2. However, the skilled person will appreciate that these convolutional layers may be adapted to downsample input data by any suitable factor.

The first synthesis network 214 has an analogous structure, comprising four convolutional layers operable to take data with a convolutional kernel 5x5xC. Subsequent convolutional layers are separated by respective IGDN layers, each of which is an approximate inverse of a GDN layer. An IGDN layer may be referred to as an Inversed Generalized Divisive Normalization layer. The IGDN layers may be nonlinear. The convolutional layers are operable to upsample (u) input data. The final convolutional layer outputs data having a channel dimension of 3, i.e., where there are three (R, G, B) channels in the image data.

As illustrated in Figure 6, the second analysis network 218 comprises three convolutional layers which are operable to downsample input data. Subsequent convolutional layers are separated by respective Rectified Linear Unit (ReLU) or rectified linear activation function layers. The ReLU layers can act as nonlinear activation function layers. The three convolutional layers, each separated by a respective ReLU layer, are operable to process the data with a convolution of 5x5xC. The first convolutional layer performs a convolution with a kernel. The first convolutional layer may thus perform data processing. The second and third convolutional layers are operable to process and downsample input data by a factor of 2, the skilled person will appreciate that these convolutional layers may be adapted to downsample input data by any suitable factor.

The second synthesis network 224 comprises six convolutional layers. The first three convolutional layers, each separated by a respective ReLU layer, are operable to upsample data with convolutions of 5x5xC, 5x5xCx(3/2) and 5x5xCx3 respectively. The ReLU layers may be nonlinear layers. The subsequent three convolutional layers are also separated by respective ReLU layers but are operable to process data with convolutions of 1x1xC, 1x1x(Cx3x(N/2)) and 1x1x(CxNx3) respectively. In this context, (1x1) indicates the kernel size, 3 is the number of parameters (mean, scale and mixing factor), N is the number of mixtures and C is the number of channels (e.g., 192). These convolutional layers produce parameters of the Gaussian mixture models (e.g. mean, scale and/or mixing factors) used to model the probability distributions of the first latent variables.

Returning to Figure 4, the rate-distortion curves show the peak signal-to-noise ratio (PSNR) for different bits per pixel (bpp). The rate-distortion curves reflect how effectively compression preserves image features for different bit-rates. As shown in Figure 4, the autoencoder trained according to embodiments of the disclosure (the first autoencoder) provides a higher PSNR at all bitrates than MSH and BPG encoding. Further, it outperforms the second autoencoder which uses mixed quantization with a univariate Gaussian, showing that the combination of mixture models with mixed quantization provides more effective image compression. This is further confirmed by the Bjontegaard Delta rate (BD-rate) difference for the first autoencoder compared to the second autoencoder, which is -1.611%. Whilst these tests were performed on image data, the methods disclosed herein are applicable for compression other types of data and thus the comparable performance improvements are expected when applied to compression of other types of data.

Embodiments of the disclosure thus provide an improved method for training an autoencoder for data compression. In addition, methods of compressing and reconstructing data using the trained autoencoder are provided. T raining an autoencoder using the methods disclosed herein can advantageously improve compression efficiency, which reduces the resources needed to store and/or transmit data compressed using the trained autoencoder. This can reduce demands on storage space as well as network resources (e.g. if the compressed data is transmitted over a network). Moreover, these advantages can be achieved whilst minimising any distortion or data loss which may otherwise occur during data compression.

Figure 7 is a schematic diagram of an apparatus 700 for training an autoencoder for data compression according to embodiments of the disclosure. The autoencoder comprises a first neural network for processing and downsampling input data to generate latent variables, a probability estimator for determining probability distributions of the latent variables, and an entropy encoder for compressing the latent variables based on the probability distributions. The autoencoder may comprise the autoencoder 200 described above in respect of Figure 2, for example.

The apparatus 700 may comprise any suitable apparatus such as, for example, a computer, an embedded system, an accelerator (e.g. a neural network accelerator), and/or a custom ASIC.

The apparatus 700 may be operable to carry out the example method 300 described with reference to Figure 3 and possibly any other processes or methods disclosed herein. It is also to be understood that the method 300 of Figure 3 may not necessarily be carried out solely by the apparatus 700. At least some operations of the method can be performed by one or more other entities.

The apparatus 700 comprises processing circuitry 702 (such as one or more processors, digital signal processors, general purpose processing units, etc), a machine-readable medium 704 (e.g., memory such as read-only memory (ROM), random-access memory, cache memory, flash memory devices, optical storage devices, etc) and one or more interfaces 706.

In one embodiment, the machine-readable medium 704 contains (e.g. stores) instructions which are executable by the processor such that the apparatus is operable to input training data into the first neural network to generate a plurality of first latent variables and determine a distortion measure based on a comparison of the training data to a reconstruction of the training data. The reconstruction of the training data is based on a plurality of first quantized latent variables obtained by quantizing the plurality of first latent variables. The apparatus 700 is further operable to use the probability estimator to, for respective first latent variables of the plurality of first latent variables, obtain one or more parameters of respective mixture models representing a probability distribution of the respective first latent variable. The apparatus 700 is further operable to determine a first rate measure based on the mixture models and a plurality of approximately quantized latent variables, in which the plurality of approximately quantized latent variables are obtained by applying a differentiable function to the plurality of first latent variables to approximate the quantization of the plurality of first latent variables. The apparatus 700 is further operable to update one or more parameters of the autoencoder based on the distortion measure and the first rate measure.

Thus, the machine-readable medium may store instructions which, when executed by the processing circuitry 702, cause the apparatus 700 to perform the steps described above.

In other embodiments, the processing circuitry 702 may be configured to directly perform the method, or to cause the apparatus 700 to perform the method, without executing instructions stored in the non-transitory machine-readable medium 704, e.g., through suitably configured dedicated circuitry.

The one or more interfaces 706 may comprise hardware and/or software suitable for communicating with other nodes of the communication network using any suitable communication medium. For example, the interfaces 706 may comprise one or more wired interfaces, using optical or electrical transmission media. Such interfaces may therefore utilize optical or electrical transmitters and receivers, as well as the necessary software to encode and decode signals transmitted via the interface. In a further example, the interfaces 706 may comprise one or more wireless interfaces. Such interfaces may therefore utilize one or more antennas, baseband circuitry, etc. The components are illustrated coupled together in series; however, those skilled in the art will appreciate that the components may be coupled together in any suitable manner (e.g., via a system bus or suchlike).

In further embodiments of the disclosure, the apparatus 700 may comprise power circuitry (not illustrated). The power circuitry may comprise, or be coupled to, power management circuitry and is configured to supply the components of apparatus 700 with power for performing the functionality described herein. Power circuitry may receive power from a power source. The power source and/or power circuitry may be configured to provide power to the various components of apparatus 700 in a form suitable for the respective components (e.g., at a voltage and current level needed for each respective component). The power source may either be included in, or external to, the power circuitry and/or the apparatus 700. For example, the apparatus 700 may be connectable to an external power source (e.g., an electricity outlet) via an input circuitry or interface such as an electrical cable, whereby the external power source supplies power to the power circuitry. As a further example, the power source may comprise a source of power in the form of a battery or battery pack which is connected to, or integrated in, the power circuitry. The battery may provide backup power should the external power source fail. Other types of power sources, such as photovoltaic devices, may also be used. It should be noted that the above-mentioned examples illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative examples without departing from the scope of the appended statements. The word “comprising” does not exclude the presence of elements or steps other than those listed in a claim, “a” or “an” does not exclude a plurality, and a single processor or other unit may fulfil the functions of several units recited in the statements below. Where the terms, “first”, “second” etc. are used they are to be understood merely as labels for the convenient identification of a particular feature. In particular, they are not to be interpreted as describing the first or the second feature of a plurality of such features (i.e. the first or second of such features to occur in time or space) unless explicitly stated otherwise. Steps in the methods disclosed herein may be carried out in any order unless expressly otherwise stated. Any reference signs in the statements shall not be construed so as to limit their scope.

Claims

35 Claims

1. A computer-implemented method (300) of training an autoencoder (200) for data compression, the autoencoder comprising a first neural network (204) for processing and downsampling input data to generate latent variables, a probability estimator (210) for determining probability distributions of the latent variables, and an entropy encoder for compressing the latent variables based on the probability distributions, the method comprising: inputting (302) training data into the first neural network to generate a plurality of first latent variables; determining (304) a distortion measure based on a comparison of the training data to a reconstruction of the training data, wherein the reconstruction of the training data is based on a plurality of first quantized latent variables obtained by quantizing the plurality of first latent variables; using (306) the probability estimator to, for respective first latent variables of the plurality of first latent variables, obtain one or more parameters of respective mixture models representing a probability distribution of the respective first latent variable; determining (308) a first rate measure based on the mixture models and a plurality of approximately quantized latent variables, wherein the plurality of approximately quantized latent variables are obtained by applying a differentiable function to the plurality of first latent variables to approximate the quantization of the plurality of first latent variables; and updating (310) one or more parameters of the autoencoder based on the distortion measure and the first rate measure.

2. The method of claim 1 , wherein determining a first rate measure Ri based on the mixture models and a plurality of approximately quantized latent variables comprises determining the first rate measure Ri according to

wherein p . is a probability of an approximately quantized latent variable in the plurality of approximately quantized latent variables y having a value y_t according to the mixture models, and S' is a size of the training data.

3. The method of any of the preceding claims, wherein the mixture models comprise Gaussian mixture models. 36

4. The method of any of the preceding claims, wherein the probability estimator comprises a second neural network (218) and a third neural network (224), and the method further comprises: inputting the plurality of latent variables into the second neural network to generate a plurality of second latent variables indicative of one or more dependencies between the plurality of latent variables; using the third neural network to determine the one or more parameters of the mixture models based on a plurality of second quantized latent variables obtained by quantizing or approximately quantizing the plurality of second latent variables.

5. The method of claim 4, wherein updating one or more parameters of the autoencoder comprises one or more of the following: updating one or more weights of the second neural network and updating one or more weights of the third neural network.

6. The method of any of claims 4-5, wherein the one or more parameters of the autoencoder are updated based on the distortion measure, the first rate measure and a second rate measure, and the method further comprises: determining the second rate measure based on the plurality of second quantized latent variables.

7. The method of claim 6, wherein the second rate measure, R₂, is determined based on the second plurality of quantized latent variables z according to

wherein is a probability of a second quantized latent variable in the plurality of second quantized latent variables having a value z_b and S' is a size of the training data

8. The method of any of claims 6-7, wherein the one or more parameters of the autoencoder are updated by optimising a function determined based on the distortion measure, the first rate measure and a second rate measure.

9. The method of claim 8, wherein optimising the function comprises minimising the function

wherein A is a parameter, D is the distortion, Ri is the first rate measure and R₂ is the second rate measure.

10. The method of any of the preceding claims, wherein the autoencoder further comprises a fourth neural network (214) and the method further comprises: generating the reconstruction of the training data by inputting the plurality of first quantized latent variables into the fourth neural network.

11. The method of claim 10, wherein updating one or more parameters of the autoencoder comprises updating one or more weights of the fourth neural network.

12. The method of any of the preceding claims, wherein the plurality of latent variables are quantized to obtain the quantized latent variables by applying a rounding function to the plurality of latent variables.

13. The method of any of the preceding claims, wherein applying a differentiable function to the plurality of first latent variables to approximate the quantization of the plurality of first latent variables comprises applying noise to the plurality of first latent variables.

14. The method of claim 13, wherein the noise is sampled from a uniform distribution 11.

15. The method of claim 14, wherein applying a differentiable function to the plurality of first latent variables y to approximate the quantization of the plurality of first latent variables comprises determining the approximately quantized latent variables y according to y = round(y — u) + u wherein u is sampled from a uniform distribution 'l ^nd the function round( ) is

operable to round f to the nearest integer.

16. The method of any of the preceding claims, wherein the autoencoder is for image compression and the training data comprises an image.

17. The method of any of the proceeding claims, wherein the mixture models are first mixture models and the method further comprises: inputting data to be compressed into the first neural network to generate a plurality of third latent variables; quantizing the plurality of third latent variables to obtain a plurality of third quantized latent variables; using the probability estimator to, for respective third latent variables of the plurality of third latent variables, obtain one or more parameters of respective second mixture models representing a probability distribution of the respective third latent variable based on the second mixture models, encoding the plurality of third quantized latent variables using the entropy encoder to obtain encoded data; and outputting the encoded data.

18. A computer program, comprising instructions which, when executed on at least one processor, cause the at least one processor to carry out the method according to any of the preceding claims.

19. A carrier containing a computer program according to claim 18, wherein the carrier comprises one of an electronic signal, optical signal, radio signal or computer readable storage medium.

20. A computer program product comprising non-transitory computer readable media having stored thereon a computer program according to claim 18.

21. An apparatus adapted to perform the method according to any of claims 1-17.

22. An apparatus (700) for training an autoencoder (200) for data compression, the autoencoder comprising a first neural network (204) for processing and downsampling input data to generate latent variables, a probability estimator (210) for determining probability distributions of the latent variables, and an entropy encoder for compressing the latent variables based on the probability distributions, the apparatus comprising a processor (702) and a machine-readable medium (704), wherein the machine-readable medium contains instructions executable by the processor such that the apparatus is operable to: input (302) training data into the first neural network to generate a plurality of first latent variables; determine (304) a distortion measure based on a comparison of the training data to a reconstruction of the training data, wherein the reconstruction of the training data is 39 based on a plurality of first quantized latent variables obtained by quantizing the plurality of first latent variables; use (306) the probability estimator to, for respective first latent variables of the plurality of first latent variables, obtain one or more parameters of respective mixture models representing a probability distribution of the respective first latent variable; determine (308) a first rate measure based on the mixture models and a plurality of approximately quantized latent variables, wherein the plurality of approximately quantized latent variables are obtained by applying a differentiable function to the plurality of first latent variables to approximate the quantization of the plurality of first latent variables; and update (310) one or more parameters of the autoencoder based on the distortion measure and the first rate measure.

23. The apparatus of claim 22, wherein determining a first rate measure Ri based on the mixture models and a plurality of approximately quantized latent variables comprises determining the first rate measure Ri according to

24. The apparatus of claim 22 or 23, wherein the mixture models comprise Gaussian mixture models.

25. The apparatus of any of claims 22-24, wherein the probability estimator comprises a second neural network (218) and a third neural network (224), and the apparatus is further operable to: input the plurality of latent variables into the second neural network to generate a plurality of second latent variables indicative of one or more dependencies between the plurality of latent variables; use the third neural network to determine the one or more parameters of the mixture models based on a plurality of second quantized latent variables obtained by quantizing or approximately quantizing the plurality of second latent variables. 40

26. The apparatus of claim 25, wherein updating one or more parameters of the autoencoder comprises one or more of the following: updating one or more weights of the second neural network and updating one or more weights of the third neural network.

27. The apparatus of any of claims 25-26, wherein the one or more parameters of the autoencoder are updated based on the distortion measure, the first rate measure and a second rate measure, and the apparatus is further operable to: determine the second rate measure based on the plurality of second quantized latent variables.

28. The apparatus of claim 27, wherein the second rate measure, R₂, is determined based on the second plurality of quantized latent variables z according to

wherein . is a probability of a second quantized latent variable in the plurality of second quantized latent variables having a value z_; and S' is a size of the training data.

29. The apparatus of any of claims 27-28, wherein the one or more parameters of the autoencoder are updated by optimising a function determined based on the distortion measure, the first rate measure and a second rate measure.

30. The apparatus of any of claims 22-29, wherein the autoencoder further comprises a fourth neural network (214) and the apparatus is further operable to: generate the reconstruction of the training data by inputting the plurality of first quantized latent variables into the fourth neural network.

31. The apparatus of claim 30, wherein updating one or more parameters of the autoencoder comprises updating one or more weights of the fourth neural network.

32. The apparatus of any of claims 22-31 , wherein the plurality of latent variables are quantized to obtain the quantized latent variables by applying a rounding function to the plurality of latent variables.

33. The apparatus of any of claims 22-32, wherein applying a differentiable function to the plurality of first latent variables to approximate the quantization of the plurality of first 41 latent variables comprises applying noise sampled from a uniform distribution 11 to the plurality of first latent variables.

34. The apparatus of claim 33, wherein applying a differentiable function to the plurality of first latent variables y to approximate the quantization of the plurality of first latent variables comprises determining the approximately quantized latent variables y according to y = round(y — u) + u wherein u is sampled from a uniform distribution 'l ^nd the function round( ) is

operable to round f to the nearest integer.

35. The apparatus of any of claims 22-34, wherein the mixture models are first mixture models and the apparatus is further operable to: input data to be compressed into the first neural network to generate a plurality of third latent variables; quantize the plurality of third latent variables to obtain a plurality of third quantized latent variables; use the probability estimator to, for respective third latent variables of the plurality of third latent variables, obtain one or more parameters of respective second mixture models representing a probability distribution of the respective third latent variable based on the second mixture models, encode the plurality of third quantized latent variables using the entropy encoder to obtain encoded data; and output the encoded data.