WO2021205066A1

WO2021205066A1 - Training a data coding system for use with machines

Info

Publication number: WO2021205066A1
Application number: PCT/FI2021/050236
Authority: WO
Inventors: Francesco Cricri; Miska Hannuksela; Emre Baris Aksu; Hamed REZAZADEGAN TAVAKOLI; Honglei Zhang; Nam Le
Original assignee: Nokia Technologies Oy
Priority date: 2020-04-09
Filing date: 2021-03-31
Publication date: 2021-10-14

Abstract

Example embodiments relate to training (1110) encoder and/or decoder neural networks (1102, 1104). An apparatus may obtain a primary model (1002) configured to perform a task. The apparatus may further obtain an auxiliary model (1004) and train (1008) the auxiliary model to perform the same task, for example based on a first loss function (1006) comprising an output of the primary model (1002) and an output of the auxiliary model (1004). The apparatus may further determine gradients of a second loss function (1108) with respect to input of the auxiliary model (1004). The second loss function may comprise the output of the auxiliary model. An effective depth of the auxiliary model may be lower than an effective depth of the primary model in order to obtain gradients with higher magnitude. The apparatus may then train (1110) the encoder and/or decoder neural networks (1102, 1104) based on the gradients of the second loss function. Apparatuses, methods, and computer programs are disclosed.

Description

TRAINING A DATA CODING SYSTEM FOR USE WITH MACHINES

TECHNICAL FIELD

[0001] The present application generally relates to encoding and decoding of data for different types of applications. In particular, some example embodiments of the present application relate to training of encoder and/or decoder neural networks for use with machine learning related applications or other non-human purposes.

BACKGROUND

[0002] Machine learning (ML) or other automated processes may be utilized for different applications in different types of devices, such as for example mobile phones. Example applications include compression and analysis of data, such as for example image data, video data, audio data, speech data, or text data. An encoder may be configured to transform input data into a compressed representation suitable for storage or transmission. A decoder may be configured to reconstruct the data based on the compressed representation. Subsequently a machine, such as for example a neural network (NN), may perform a task based on the reconstructed data.

SUMMARY

[0003] This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

[0004] Example embodiments improve training of encoder and/or decoder neural networks. This may be achieved by the features of the independent claims. Further implementation forms are provided in the dependent claims, the description, and the drawings.

[0005] According to an aspect, an apparatus comprises at least one processor; and at least one memory including computer program code; the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain a primary model configured to perform a task; obtain an auxiliary model, wherein an effective depth of the auxiliary model is lower than an effective depth of the primary model; train the auxiliary model for performing the task based on a first loss function comprising an output of the primary model and an output of the auxiliary model; determine a plurality of gradients of a second loss function with respect to an input of the auxiliary model, wherein the second loss function comprises the output of the auxiliary model; and train an encoder neural network and/or a decoder neural network based on the plurality of gradients.

[0006] According to an aspect, a method comprises obtaining a primary model configured to perform a task; obtaining an auxiliary model, wherein an effective depth of the auxiliary model is lower than an effective depth of the primary model; training the auxiliary model for performing the task based on a first loss function comprising an output of the primary model and an output of the auxiliary model; determining a plurality of gradients of a second loss function with respect to an input of the auxiliary model, wherein the second loss function comprises the output of the auxiliary model; and training an encoder neural network and/or a decoder neural network based on the plurality of gradients.

[0007] According to an aspect, a computer program comprises instructions for causing an apparatus to perform at least the following: obtaining a primary model configured to perform a task; obtaining an auxiliary model, wherein an effective depth of the auxiliary model is lower than an effective depth of the primary model; training the auxiliary model for performing the task based on a first loss function comprising an output of the primary model and an output of the auxiliary model; determining a plurality of gradients of a second loss function with respect to an input of the auxiliary model, wherein the second loss function comprises the output of the auxiliary model; and training an encoder neural network and/or a decoder neural network based on the plurality of gradients.

[0008] According to an aspect, an apparatus comprises means for obtaining a primary model configured to perform a task; obtaining an auxiliary model, wherein an effective depth of the auxiliary model is lower than an effective depth of the primary model; means for training the auxiliary model for performing the task based on a first loss function comprising an output of the primary model and an output of the auxiliary model; means for determining a plurality of gradients of a second loss function with respect to an input of the auxiliary model, wherein the second loss function comprises the output of the auxiliary model; and means for training an encoder neural network and/or a decoder neural network based on the plurality of gradients.

[0009] Many of the attendant features will be more readily appreciated as they become better understood by reference to the following detailed description considered in connection with the accompanying drawings.

DESCRIPTION OF THE DRAWINGS

[0010] The accompanying drawings, which are included to provide a further understanding of the example embodiments and constitute a part of this specification, illustrate example embodiments and together with the description help to explain the example embodiments. In the drawings:

[001 1] FIG. 1 illustrates an example of a data coding system comprising an encoder device, a decoder device, and a machine configured to perform a task, according to an example embodiment;

[0012] FIG. 2 illustrates an example of an apparatus configured to practice one or more example embodiments;

[0013] FIG. 3 illustrates an example of a neural network, according to an example embodiment;

[0014] FIG. 4 illustrates an example of an elementary computation unit, according to an example embodiment;

[001 5] FIG. 5 illustrates an example of a convolutional classification neural network, according to an example embodiment;

[0016] FIG. 6 illustrates an example of an auto-encoder comprising an encoder neural network and a decoder neural network, according to an example embodiment;

[001 7] FIG. 7 illustrates an example of a video codec targeted for both machine and human consumption of video, according to an example embodiment;

[0018] FIG. 8 illustrates another example of a video codec targeted for both machine and human consumption of video, according to an example embodiment;

[0019] FIG. 9 illustrates an example of a human-targeted encoder neural network, according to an example embodiment; [0020] FIG. 10 illustrates an example of a method for teaching a student neural network for performing a task, according to an example embodiment;

[0021] FIG. 11 illustrates an example of a method for training encoder and decoder neural networks based on a student neural network, according to an example embodiment; [0022] FIG. 12 illustrates an example of a method for training encoder and decoder neural networks based on a student neural network and a teacher neural network, according to an example embodiment; and

[0023] FIG. 13 illustrates an example of a method for training an encoder neural network and/or a decoder neural network, according to an example embodiment.

[0024] Like references are used to designate like parts in the accompanying drawings.

DETAILED DESCRIPTION

[0025] Reference will now be made in detail to example embodiments, examples of which are illustrated in the accompanying drawings. The detailed description provided below in connection with the appended drawings is intended as a description of the present examples and is not intended to represent the only forms in which the present examples may be constructed or utilized. The description sets forth the functions of the example and a possible sequence of steps for constructing and operating the example. However, the same or equivalent functions and sequences may be accomplished by different examples.

[0026] Reducing distortion in image and video compression may be intended for increasing human perceptual quality, because the human user may be considered as the consumer for the decompressed data. However, with the advent of machine learning, for example deep learning, machines operating as autonomous agents may be configured to analyze data and even make decisions without human intervention. Examples of such analysis tasks include object detection, scene classification, semantic segmentation, video event detection, anomaly detection, pedestrian tracking, etc. Example use cases and applications include self-driving cars, video surveillance cameras and public safety, smart sensor networks, smart TV and smart advertisement, person re-identification, smart traffic monitoring, drones, etc. Since decoded data is more likely to be consumed by machines, it may be desirable to apply other quality metrics in addition or alternative to human perceptual quality, when considering media compression for inter-machine communication. Also, dedicated algorithms for compressing and decompressing data for machine consumption may be different from algorithms for compressing and decompressing data for human consumption. The set of tools and concepts for compressing and decompressing data for machine consumption may be referred to as video coding for machines (VCM).

[0027] Furthermore, a decoder device may comprise or have access to multiple machines, for example machine learning (ML) functions such as neural networks (NN). ML functions may be used in a certain combination with or without other machines, such as for example non-ML functions including, but not limited to, user related functions. Execution of the functions may be controlled by an orchestrator sub-system, which may be for example configured to determine an order of execution among functions. Multiple machines may be used for example in succession, based on the output of the previously used machine, and/or in parallel. For example, a decompressed video may be analyzed by one machine (e.g. a first neural network) for detecting pedestrians, by another machine (e.g. a second neural network) for detecting cars, and by another machine (e.g. a third neural network) for estimating depth of pixels in the video frames.

[0028] A neural network is one type of a machine, but it is appreciated that any process or algorithm, either learned from data or not, which analyzes or processes data for a certain task may be considered as a machine. Furthermore, receiver or decoder side may refer to a physical or abstract entity or device, which may contain in addition to the decoder one or more machines, and which may be configured to run the machine (s) on a decoded video representation. The video may have been encoded by another physical or abstract entity or device, which may be referred to as transmitter or encoder side. The encoder and decoder may be also embodied in a single device.

[0029] According to an example embodiment, an apparatus may obtain a primary model configured to perform a task, for example a neural network configured to classify an image. The apparatus may further obtain an auxiliary model and train the auxiliary model to perform the same task, for example based on a first loss function comprising an output of the primary model and an output of the auxiliary model. The apparatus may further determine gradients of a second loss function with respect to input of the auxiliary model, which may comprise the output of the auxiliary model. An effective depth of the auxiliary model may be lower than an effective depth of the primary model in order to obtain gradients with higher magnitude. The apparatus may then train the encoder and/or decoder neural networks based on the gradients of the second loss function with respect to input of the auxiliary model. Training efficiency may be improved due to the higher magnitude of the gradients provided by the auxiliary model.

[0030] FIG. 1 illustrates an example of a data coding system 100 comprising an encoder device 110, a decoder device 120, and a machine 130, according to an example embodiment. Encoder device 110 may be configured to receive input data and produce encoded data, which may comprise an encoded representation of the input data. The encoded data may for example comprise a compressed version of the input data. [0031] The encoded data may be delivered to decoder device 120 by various means, for example over a communication network. Therefore, the encoder device 110 may comprise a transmitter. The decoder device 120 may comprise a receiver. Alternatively, encoded data may be stored on a storage medium such as for example a hard drive or an external memory and retrieved from the memory by decoder device 120. The decoder device 120 may be configured to reconstruct the input data based on the encoded data received from the encoder device 110, or otherwise accessed by decoder device 120. As a result, decoded data may be output by the decoder device 120.

[0032] According to an example embodiment, encoder device 110 and decoder device 120 may be embodied as separate devices. It is however possible that a single device comprises one or more encoders and one or more decoders, for example as dedicated software and/or hardware components. Encoder device 110 may comprise a video encoder or video compressor. Decoder device 120 may comprise a video decoder or video decompressor. As will be further described below, an encoder may be implemented as a neural encoder comprising an encoder neural network. The neural encoder may further comprise one or more additional functions, for example quantization and lossless encoder after the encoder neural network. A decoder may be implemented as a neural decoder comprising a decoder neural network. The neural decoder may further comprise one or more additional functions, for example dequantization and lossless decoder before the decoder neural network. Even though some example embodiments are directed to video encoders and video decoders, it is appreciated that example embodiments may be also applied to other type of data, a such as for example image data, audio data, speech data, or text data.

[0033] The decoded data provided by decoder device 120 may be processed by one or more machines 130. The one or more machines 130 may be located at decoder device 120. The one or more machines 130 may comprise one or more machine learning (ML) functions configured to perform one or more machine learning tasks. Examples of machine learning tasks include detecting an object of interest, classifying an object, recognizing identity of an object, or the like. The one or more machines 130 may comprise one or more neural networks. Alternatively, or additionally, the one or more machines 130 may comprise one or more non-ML functions such as for example algorithms or other non-learned functions. The non-ML functions 325 may for example include algorithms for performing similar tasks as the machine learning functions. Different machines 130 may be associated with different metrics for encoding and/or decoding quality and therefore example embodiments provide methods for efficiently training a neural encoder and/or a neural decoder for use with particular machine (s) 130.

[0034] The one or more machines 130 may comprise models such as for example neural networks (e.g. task-NNs), for which it is possible to compute gradients of their output with respect to their input. Gradients with respect to a variable X, for example an input of a model, may be determined based on a loss function. For example, gradients of the loss function may be computed with respect to variable X. If the machines 130 comprise parametric models (such as neural networks), the gradients may be obtained based on computing the gradients of their output first with respect to their internal parameters and then with respect to their input, for example by using the chain rule for differentiation in mathematics. In case of neural networks, backpropagation may be used to obtain the gradients of the output of a neural network with respect to its input. [0035] FIG. 2 illustrates an example of an apparatus configured to practice one or more example embodiments, according to an example embodiment. The apparatus 200 may for example comprise the encoder device 110 or the decoder device 120. Apparatus 200 may comprise at least one processor 202. The at least one processor may comprise, for example, one or more of various processing devices, such as for example a co processor, a microprocessor, a controller, a digital signal processor (DSP), a processing circuitry with or without an accompanying DSP, or various other processing devices including integrated circuits such as, for example, an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a microcontroller unit (MCU), a hardware accelerator, a special-purpose computer chip, or the like.

[0036] The apparatus may further comprise at least one memory 204. The memory may be configured to store, for example, computer program code or the like, for example operating system software and application software. The memory may comprise one or more volatile memory devices, one or more non-volatile memory devices, and/or a combination thereof. For example, the memory may be embodied as magnetic storage devices (such as hard disk drives, floppy disks, magnetic tapes, etc.), optical magnetic storage devices, or semiconductor memories (such as mask ROM, PROM (programmable ROM), EPROM (erasable PROM), flash ROM, RAM (random access memory), etc.).

[0037] Apparatus 200 may further comprise communication interface 208 configured to enable apparatus 200 to transmit and/or receive information, for example compressed video data to/from other devices. The communication interface may be configured to provide at least one wireless radio connection, such as for example a 3GPP mobile broadband connection (e.g. 3G, 4G, 5G); a wireless local area network (WLAN) connection such as for example standardized by IEEE 802.11 series or Wi Fi alliance; a short range wireless network connection such as for example a Bluetooth, NFC (near-field communication), or RFID connection; a local wired connection such as for example a local area network (LAN) connection or a universal serial bus (USB) connection, or the like; or a wired Internet connection.

[0038] Apparatus 200 may further comprise a user interface 210 comprising an input device and/or an output device. The input device may take various forms such a keyboard, a touch screen, or one or more embedded control buttons. The output device may for example comprise a display, a speaker, a vibration motor, or the like.

[0039] When the apparatus is configured to implement some functionality, some component and/or components of the apparatus 200, such as for example the at least one processor 202 and/or the memory 204, may be configured to implement this functionality. Furthermore, when the at least one processor 202 is configured to implement some functionality, this functionality may be implemented using program code 206 comprised, for example, in the memory 204.

[0040] The functionality described herein may be performed, at least in part, by one or more computer program product components such as software components. According to an example embodiment, the apparatus comprises a processor or processor circuitry, such as for example a microcontroller, configured by the program code when executed to execute the embodiments of the operations and functionality described. Alternatively, or in addition, the functionality described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-programmable Gate Arrays (FPGAs), application-specific Integrated Circuits (ASICs), application-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), Graphics Processing Units (GPUs).

[0041] The apparatus 200 may comprise means for performing at least one method described herein. In one example, the means comprises the at least one processor 202, the at least one memory 204 including program code 206, the at least one memory 204 and the program code 206 configured to, with the at least one processor 202, cause the apparatus 200 to perform the method (s).

[0042] Apparatus 200 may comprise a computing device such as for example mobile phone, a tablet computer, a laptop, an internet of things (IoT) device, or the like. Examples of IoT devices include, but are not limited to, consumer electronics, wearables, and smart home appliances. In one example, apparatus 200 may comprise a vehicle such as for example a car. Although apparatus 200 is illustrated as a single device it appreciated that, wherever applicable, functions of apparatus 200 may be distributed to a plurality of devices, for example to implement example embodiments as a cloud computing service.

[0043] FIG. 3 illustrates an example of a neural network, according to an example embodiment. A neural network may comprise a computation graph with several layers of computation. For example, neural network 300 may comprise an input layer, one or more hidden layers, and an output layer. Nodes of the input layer, i₁ to i_nr may be connected to one or more of the m nodes of the first hidden layer, rin to ni_m. Nodes of the first hidden layer may be connected to one or more of the k nodes of the second hidden layer, rgi to rg_k· It is appreciated that even though the example neural network of FIG. 3 illustrates two hidden layers, a neural network may apply any number and any type of hidden layers. Neural network 400 may further comprise an output layer. Nodes of the last hidden layer, in the example of FIG. 3 the nodes of the second hidden layer, pito r , may be connected to one or more nodes of the output layer, oi to O_j . It is noted that the number of nodes may be different for each layer of the network. A node may be also referred to as a neuron, a computation unit, or an elementary computation unit. Terms neural network, neural net, network, and model may be used interchangeably. A model may comprise a neural network, but a model may also refer to another learnable model. Weights of the neural network may be referred to as learnable parameters or simply as parameters. In the example of FIG. 3, one or more of the layers may be fully connected layers, for example layers where each node is connected to every node of a previous layer.

[0044] Two example architectures of neural networks include feed-forward and recurrent architectures. Feed-forward neural networks are such that there is no feedback loop. Each layer takes input from one or more previous layers and provides its output as the input for one or more of the subsequent layers. Also, units inside certain layers may take input from units in one or more of preceding layers and provide output to one or more of following layers.

[0045] Initial layers, for example layers close to the input data, may extract semantically low-level features. In an example image or video data, the low-level features may correspond to edges and textures in images or video frames. Intermediate and final layers may extract more high-level features. After the feature extraction layers there may be one or more layers performing a certain task, such as classification, semantic segmentation, object detection, denoising, style transfer, super-resolution, or the like. [0046] In recurrent neural networks there is a feedback loop from one or more nodes of one or more subsequent layers. This causes the network to become becomes stateful. For example, the network may be able to memorize information or a state.

[0047] FIG. 4 illustrates an example of an elementary computation unit, according to an example embodiment. The elementary computation unit may comprise a node 401, which may be configured to receive one or more inputs, aito a_n, from one or more nodes of one or more previous layers and compute an output based on the input values received. The node 401 may also receive feedback from one or more nodes of one or more subsequent layers. Inputs may be associated with parameters to adjust the influence of a particular input to the output. For example weights wito w_n associated with the inputs aito a_n may be used to multiply the input values aito a_n. The node 401 may be further configured combine the inputs to an output, or an activation. For example, the node 401 may be configured to sum the modified input values. A bias or offset b may be also applied to add a constant to the combination of modified inputs. Weights and biases may be learnable parameters. For example, when the neural network is trained for a particular task, the values of the weights and biases associated with different inputs and different nodes may be updated such that an error associated with performing the task is reduced to an acceptable level.

[0048] Furthermore, an activation function f() may be applied to control when and how the node 401 provides the output. Activation function may be for example a non-linear function that is substantially linear in the region of zero but limits the output of the node when the input increases or decreases. Examples of activation functions include, but are not limited to, a step function, a sigmoid function, a tanh function, a ReLu (rectified linear unit) function. The output may be provided to nodes of one or more following layers of the network, and/or to one or more nodes of one or more previous layers of the network.

[0049] A forward propagation or a forward pass may comprise feeding a set of input data through the layers of the neural network 400 and producing an output. During this process the weights and biases of the neural network 400 affect the activations of individual nodes and thereby the output provided by the output layer.

[0050] One property of neural networks and other machine learning tools is that they are able to learn properties from input data, for example in supervised way or in unsupervised way. Learning may be based on teaching the network by a training algorithm or based on a meta-level neural network providing a training signal.

[0051] In general, a training algorithm may include changing some properties of the neural network such that its output becomes as close as possible to a desired output. For example, in the case of classification of objects in images or video frames, the output of the neural network may be used to derive a class or category index, which indicates the class or category that the object in the input data belongs to. Training may happen by minimizing or decreasing the output's error, also referred to as the loss.

[0052] During training the generated or predicted output may be compared to a desired output, for example ground-truth data provided for training purposes, to compute an error value or a loss value. The error may be calculated based on a loss function. Updating the neural network may be then based on calculating a derivative with respect to learnable parameters of the network. This may be done for example using a backpropagation algorithm that determines gradients for each layer starting from the final layer of the network until gradients for the learnable parameters have been obtained. Parameters of each layer are updated accordingly such that the loss is iteratively decreased. Examples of losses include mean squared error, cross-entropy, or the like. In deep learning, training may comprise an iterative process, where at each iteration the algorithm modifies parameters of the neural network to make a gradual improvement of the network's output, that is, to gradually decrease the loss.

[0053] Deep neural networks may suffer from vanishing gradients, which may cause updates to the learnable parameters be so small that training the neural network becomes slow or stops completely. For example, each weight associated with the nodes of the layers of the neural network may receive an update that is proportional to a partial derivative of the loss function. If the number of layers is high, the update may not cause any significant change in the weights and thereby also the output of the neural network.

[0054] Training phase of the neural network may be ended after reaching an acceptable error level. In inference phase the trained neural network may be applied for a particular task, for example, to provide a classification of an unseen image to one of a plurality of classes based on content of an input image.

[0055] Training a neural network may be seen as an optimization process, but the final goal may be different from a typical goal of optimization. In optimization, the goal may be to minimize a functional. In machine learning, a goal of the optimization or training process is to make the model learn the properties of the data distribution from a limited training dataset. In other words, the goal is to learn to use a limited training dataset in order to learn to generalize to previously unseen data, that is, data which was not used for training the model. This may be referred to as generalization. [0056] In practice, data may be split into at least two sets, a training data set and a validation data set. The training data set may be used for training the network, for example to modify its learnable parameters in order to minimize the loss. The validation data set may be used for checking performance of the network on data which was not used to minimize the loss as an indication of the final performance of the model. In particular, the errors on the training set and on the validation data set may monitored during the training process to understand the following issues: 1) if the network is learning at all - in this case, the training data set error should decrease, otherwise the model is in the regime of underfitting; 2) if the network is learning to generalize - in this case, also the validation set error should decrease and not be much higher than the training data set error. If the training data set error is low, but the validation data set error is much higher than the training data set error, or it does not decrease, or it even increases, the model is in the regime of overfitting. This means that the model has just memorized properties of the training data set and performs well on that set, but performs poorly on a set not used for tuning its parameters.

[0057] FIG. 5 illustrates an example of a convolutional classification neural network 500. A convolutional neural network 500 comprises at least one convolutional layer. A convolutional layer performs convolutional operations to extract information from input data, for example image 502, to form a plurality of feature maps 506. A feature map may be generated by applying a filter or a kernel to a subset of input data, for example block 504 in image 502, and sliding the filter through the input data to obtain a value for each element of the feature map. The filter may comprise a matrix or a tensor, which may be for example multiplied with the input data to extract features corresponding to that filter. A plurality of feature maps may be generated based on applying a plurality of filters. A further convolutional layer may take as input the feature maps from a previous layer and apply the same filtering principle on the feature maps 506 to generate another set of feature maps 508. Weights of the filters may be learnable parameters and they may be updated during a training phase, similar to parameters of neural network 300. Similar to node 501, an activation function may be applied to the output of the filter(s). The convolutional neural network may further comprise one or more other type of layers such as for example fully connected layers 510 after and/or between the convolutional layers. An output may be provided by an output layer 512, which in the example of FIG. 5 comprises a classification layer. For a classification task the output may comprise an approximation of a probability distribution, for example an N-dimensional array 514, where N is the number of classes, and where the sum of the values in the array is 1. Each element of the array 514 may indicate a probability of the input image belonging to a particular class, such as for example a class of cats, dogs, horses, or the like. Elements of the array 514 may be called bins. The output may be represented either as one-hot representation where only one class bin is one and other class bins are zero, or as soft labels where the array comprises a probability distribution instead of a one-hot representation. In the latter case, all bins of the output array 514 may have a value different from zero, as illustrated in FIG. 5.

[0058] In addition to implementing the one or more machines 130, neural networks may be also applied at encoder device 110 or decoder device 120. Neural networks may be used either to perform the whole encoding or decoding process or to perform some steps of the encoding or decoding process. The former option may be referred to as end-to-end learned compression. Learned compression may be for example based on an auto-encoder structure that is trained to encode and decode video data. [0059] FIG. 6 illustrates an example of an auto-encoder comprising an encoder neural network and a decoder neural network, according to an example embodiment. An auto-encoder is a neural network comprising an encoder network 611, which may be configured to compress data or make the input data to be more compressible at its output, for example by having lower entropy. The auto-encoder may further comprise a decoder network 621, which may take the compressed data, for example the data output by the encoder network 611 or data output by a step performed after the encoder network 611, and output a reconstruction of the original data, possibly with some loss. It is noted that example embodiments may be applied to various other type of neural networks configured to be applied in encoding or decoding process and the auto-encoder 600 is provided only as an example.

[0060] The auto-encoder 600 may be trained based on a training dataset. For each training iteration, a subset of data may be sampled from the training dataset and input to the encoder network 611. The output of the encoder network 611 may be subject to further processing steps, such as for example binarization, quantization, and/or entropy coding. Finally, the output, which may be also referred to as a code, may be input to the decoder network 621 which may reconstruct the original data input to the encoder network 611. The reconstructed data may differ from the original input data. The difference between the input data and the reconstructed data may be referred to as the loss. However, the auto-encoder pipeline may be also designed in a way that there is no loss in reconstruction. A loss or error value may be computed by comparing the output of the decoder network 621 to the input of the encoder network 611. The loss value may be computed for example based on a mean-squared error (MSE), a pixel signal- to-noise ratio (PSNR), structural similarity (SSIM), or the like. Such distortion metrics may be inversely proportional to the human visual perception quality. Methods for training the encoder network 611 and/or the decoder network 621 for non human purposes are also disclosed herein.

[0061] Another loss function may be used for encouraging the output of the encoder to be more compressible, for example to have low entropy. This loss may be used in addition to the loss measuring the quality of data reconstruction. In general, a plurality of losses may be computed and then added together for example via a linear combination (weighted average) to obtain a combined loss. The combined loss value may be then differentiated with respect to the weights and/or other parameters of the encoder network 611 and decoder network 621. Differentiation may be done for example based on backpropagation, as described above. The obtained gradients may then be used to update or change the parameters (e.g. weights), for example based on a stochastic gradient descent algorithm or any other suitable algorithm. This process may be iterated until a stopping criterion is met. As a result, the neural auto-encoder is trained to compress the input data and to reconstruct original data from the compressed representation. According to an example embodiment, the encoder device 110 may comprise a neural encoder, for example the encoder network 611 of the auto-encoder 600. According to an example embodiment, the decoder device 120 may comprise a neural decoder, for example the decoder network 621 of the auto-encoder 600.

[0062] Video coding may be alternatively performed by algorithmic video codecs. Examples of algorithmic video codecs include hybrid video codecs, such as for example similar to ITU-T H.263, H.264, H.265, and H.266 standards. Hybrid video encoders may code video information in two phases. Firstly, pixel values in a certain picture area, for example a block, may be predicted for example by motion compensation means or spatial means. Motion compensation may comprise finding and indicating an area in one of the previously coded video frames that corresponds to the block being coded. Applying spatial means may comprise using pixel values around the block to be coded in a specified manner.

[0063] Secondly, the prediction error, for example the difference between the predicted block of pixels and the original block of pixels, may be coded. This may be done based on transforming the difference in pixel values using a transform, such as for example discrete cosine transform (DCT) or a variant of DCT, quantizing the coefficients, and entropy coding the quantized coefficients. By varying the fidelity of the quantization process, encoder can control the balance between the accuracy of the pixel representation (picture quality) and size of the resulting coded video representation (file size or transmission bitrate).

[0064] Inter prediction, which may also be referred to as temporal prediction, motion compensation, or motion- compensated prediction, exploits temporal redundancy. In inter prediction the sources of prediction are previously decoded pictures.

[0065] Intra prediction utilizes the fact that adjacent pixels within the same picture are likely to be correlated. Intra prediction can be performed in spatial or transform domain, which means that either sample values or transform coefficients can be predicted. Intra prediction may be exploited in intra coding, where no inter prediction is applied. [0066] One outcome of the coding procedure may comprise a set of coding parameters, such as motion vectors and quantized transform coefficients. Many parameters can be entropy-coded more efficiently if they are predicted first from spatially or temporally neighbouring parameters. For example, a motion vector may be predicted from spatially adjacent motion vectors and the difference relative to the motion vector predictor may be coded. Prediction of coding parameters and intra prediction may be collectively referred to as in-picture prediction. [0067] A video decoder may reconstruct the output video based on prediction means similar to the encoder to form a predicted representation of the pixel blocks. Reconstruction may be based on motion or spatial information created by the encoder and stored in the compressed representation and prediction error decoding, which may comprise an inverse operation of the prediction error coding to recover the quantized prediction error signal in spatial pixel domain. After applying prediction and prediction error decoding means the decoder may sum up the prediction and prediction error signals, for example pixel values, to form the output video frame. The decoder, and also the encoder, can also apply additional filtering means to improve the quality of the output video before passing it for display and/or storing it as prediction reference for the forthcoming frames in the video sequence.

[0068] In video codecs the motion information may be indicated with motion vectors associated with each motion compensated image block. Each of these motion vectors represents displacement of the image block in the picture to be coded or decoded and the prediction source block in one of the previously coded or decoded pictures. In order to represent motion vectors efficiently, the motion vectors may be coded differentially with respect to block specific predicted motion vectors. The predicted motion vectors may be created in a predefined way, for example based on calculating a median of encoded or decoded motion vectors of an adjacent blocks.

[0069] Another way to create motion vector predictions may comprise generating a list of candidate predictions from adjacent blocks and/or co-located blocks in temporal reference pictures and signalling the chosen candidate as the motion vector predictor. In addition to predicting the motion vector values, the reference index of previously coded or decoded picture can be predicted. The reference index may be predicted from adjacent blocks and/or or co-located blocks in temporal reference picture.

[0070] Moreover, high efficiency video codecs may employ an additional motion information coding or decoding mechanism, which may be called a merging/merge mode, where motion field information comprising a motion vector and corresponding reference picture index for each available reference picture list, is predicted and used without any modification/correction. Similarly, predicting the motion field information may be carried out based on using the motion field information of adjacent blocks and/or co-located blocks in temporal reference pictures and the used motion field information may be signalled among a list of motion field candidate list filled with motion field information of available adjacent/co-located blocks.

[0071] The prediction residual after motion compensation may be first transformed with a transform kernel, for example DCT, and then coded. One reason for this is that there still be some correlation among the residual and applying the transform may reduce this correlation and enable to provide more efficient coding. [0072] Video encoders may utilize Lagrangian cost functions to find optimal coding modes, for example the desired macroblock mode and associated motion vectors. This kind of cost function may use a weighting factor l to tie together the (exact or estimated) image distortion due to lossy coding methods and the (exact or estimated) amount of information that is required to represent the pixel values in an image area: C = D + XR, where C is the Lagrangian cost to be minimized, D is the image distortion, for example mean squared error (MSE) with the mode and motion vectors considered, and R is the number of bits needed to represent the required data to reconstruct the image block in the decoder, which may include the amount of data to represent the candidate motion vectors.

[0073] Neural networks may be used in image and video compression, either to perform the whole compression or decompression process or to perform some steps of the compression or decompression process. When the encoder neural network is used for some step(s), an encoder neural network may be for example configured to perform a step which takes as an input a decorrelated version of the input data, for example an output of a transform such as for example Fourier transform or discrete cosine transform (DCT). The encoder neural network may be for example configured to perform one of the last steps of the encoding process. The encoder neural network may be followed by one or more post-processing steps such as for example binarization, quantization, and/or an arithmetic encoding step, for example a lossless encoder such as an entropy encoder. The decoder neural network may be located at a corresponding position at the decoder and be configured to perform a corresponding inverse function.

[0074] Example embodiments provide an encoder and decoder which may be applied for compression and decompression of data to be consumed by machine (s). The decompressed data may also be consumed by humans, either at the same time or at different times with respect to consumption of the decompressed data at the machine (s). A codec may comprise multiple parts, where some parts may be used for compressing or decompressing data for machine consumption, and other parts may be used for compressing or decompressing data for human consumption.

[0075] FIG. 7 illustrates an example of a video codec targeted for both machine and human consumption of video, according to an example embodiment. Video codec 700 may comprise an encoder 712, for example an algorithmic encoder as illustrated in FIG. 7, such as for example according to or based on the H.266 standard. Alternatively, the encoder 712 may comprise a neural network. The encoder 712 may be configured to provide a first bitstream, which may be a human- targeted bitstream. The video codec 700 may further comprise a neural encoder 714, for example the encoder network 611 of the auto-encoder 600. The neural encoder 714 may be configured to provide a second bitstream, which may be a machine-targeted bitstream. The encoder 712 and neural encoder 714 may form or be comprised at an encoder part of the video codec 700. The encoder part may further comprise instances of the decoder 722 and/or neural decoder 724.

[0076] Video codec 700 may further comprise an decoder 722, for example an algorithmic decoder as illustrated in FIG. 7, such as for example according to or based on the H.266 standard. Alternatively, the decoder 722 may comprise a neural network. The algorithmic decoder 722 may be configured to decode the first bitstream, for example to output a decoded human-targeted video. The video codec 700 may further comprise a neural decoder 724, for example the decoder network 621 of the auto-encoder 600. The neural decoder 724 may be configured to decode the second bitstream, for example to output a decoded machine-targeted video to one or more machines 730. The algorithmic decoder 722 and neural decoder 724 may form or be comprised in a decoder part of the video codec 700. Alternatively, the encoder and decoder parts of the video codec 700 may be embodied separately, for example at different devices or as separate software and/or hardware components. For example, the encoder part may be comprised in the encoder device 110. The decoder part may be comprised in the decoder device 120.

[0077] At training phase, the encoder device 110, or other training device, may have access to the algorithmic decoder 722 and/or the neural decoder 724. For example, feedback 726 may be provided from the algorithmic decoder 722 to the neural encoder 714. Feedback 728 may be provided from the algorithmic decoder 722 to the neural decoder 724. Hence, the decoded first bitstream output by the algorithmic decoder 722 may be provided to the neural encoder 714 and/or to the neural decoder 724.

[0078] At inference phase, there may not be feedback from the decoder device 120 to the encoder device 110. However, the encoder device 110 may be embedded with an instance of the algorithmic decoder 722 and a feedback may be provided internally within the encoder device 110. The feedback 726 may be provided to inform the neural encoder 714 about what information has been already reconstructed at the decoder device 120 side by its algorithmic decoder 722, and thus to understand what additional information to encode for the machines 730.

[0079] At the inference phase, the feedback 728 may be provided within the decoder device 120. The feedback 728 may be applied to provide the neural decoder 724 of the decoder device 120 with information about the already decoded information from the algorithmic decoder 722 of the decoder device 120, and to combine it with the machine-targeted bitstream to decode the machine targeted video.

[0080] It is noted that instead of video data, the above approach may be also applied to featured extracted from video data. Furthermore, the human-targeted video and the machine- targeted video may be processed in either order. It is also possible to apply the algorithmic encoder 712 and the algorithmic decoder 722 for machine-targeted video. The neural encoder 714 and the neural decoder 724 may be alternatively used for human-targeted video.

[0081] FIG. 8 illustrates another example of a video codec targeted for both machine and human consumption of video, according to an example embodiment. Video codec 800 may comprise a first encoder neural network 811, which may comprise a machine-targeted encoder neural network. The first encoder neural network 811 may operate as a generic feature extractor and be configured to output spatio-temporal machine-targeted features (M-features) based on input video data. The extracted spatio-temporal features may be quantized at a first quantization function 812. The video codec 800 may further comprise a first entropy encoder 813 configured to entropy encode the quantized M-features provided by the first quantization function 812. The output of the first entropy coder 813 may comprise the quantized and entropy encoded spatio-temporal M-features (M-code). The video codec 800 may further comprise a second encoder neural network 814, which may comprise a human-targeted encoder neural network. The second encoder neural network 814 may be configured to output spatio-temporal human-targeted features (H-features) based on the input video data and optionally the quantized M-features provided by the first quantization function 812, as will be further discussed in relation to FIG. 9. If the quantized M- features are provided to the human-targeted encoder NN 814, an initial operation may comprise dequantizing the quantized M- features, applying a machine-targeted decoder NN (similar to 821) and then subtracting the dequantized M-features from the video data. Other suitable operations for reducing the amount of data by using the quantized M-features and the video data may be utilized. The extracted spatio-temporal features may be quantized at a second quantization function 815. The video codec 800 may further comprise a second entropy encoder 816 configured to entropy encode the quantized H-features provided by the second quantization function 815. The output of the second entropy encoder 816 may comprise the quantized and entropy encoded spatio-temporal H-features (H-code). In another embodiment, the quantized M-features and the quantized H-features may be encoded by the same entropy encoder. The first and second encoder neural networks 811, 814, the first and second quantization functions 812, 815, and/or the first and second entropy encoders 813, 816, may form or be comprised at an encoder part of the video codec 800. Operations of the first and second entropy encoders 813, 816 may be performed in a single entropy encoder. Operations of the first and second entropy decoders 823, 826 may be implemented in a single entropy decoder.

[0082] The video codec 800 may further comprise a first entropy decoder 823 configured to entropy decode an M-code received from the first entropy encoder 813, or another data source. The video codec 800 may further comprise a first inverse quantization function 822 configured to de-quantize the entropy decoded M-code. The video codec 800 may further comprise a first decoder neural network 821. The first decoder neural network 821 may comprise a machine-targeted decoder neural network, which may be configured to decode the entropy decoded and de-quantized M-code. However, in some cases the video codec 800 may not comprise the first decoder neural network 821. The video codec 800 may further comprise a second entropy decoder 826 configured to entropy decode an H-code received from the second entropy encoder 816, or another data source. In another embodiment, the M-code and the H-code are both decoded by the same entropy-decoder. The video codec 800 may further comprise a second inverse quantization function 825 configured to de-quantize the entropy decoded H-code. The video codec 800 may further comprise a second decoder neural network 824. The second decoder neural network 824 may comprise a human-targeted decoder neural network, which may be configured to decode the entropy decoded and de-quantized H- code. The second decoder neural network 824 may further receive as input the entropy decoded and de-quantized M-code from the first inverse quantization function 822.As discussed above, the decoder device 120 may already have some decoded information (e.g. the output of the inverse quantization 822), and therefore the data sent by the human-targeted encoder NN 814 of the encoder device 110 may be additional information. Thus, if the human-targeted encoder 814 has performed an initial step comprising a subtraction between the dequantized M-features and the original video, the human-targeted decoder NN 824 may first decode the dequantized H-features, and then add the decoded H-features to the dequantized M-features. In another embodiment, the second decoder neural network 824 may further receive as input the output of the machine-targeted decoder neural network 821. The first and second decoder neural networks 821, 824, the first and second inverse quantization functions 822, 825, and/or the first and second entropy decoders 823, 826, may form or be comprised at an decoder part of the video codec 800. The encoder and decoder parts of the video codec 800 may be embodied as a single codec, or separately for example at different devices or as separate software and/or hardware components. [0083] The video codec 800 may further comprise or be configured to be coupled to one or more machines, for example one or more task neural networks (task-NN) 830. During a development stage, for example when training the video codec 800, the task-NNs 830 may be representative of the task-NNs which will be used at an inference stage, for example when the video codec 800, or parts thereof, are deployed and used for compressing and/or decompressing data. The task-NNs 830 may be pre-trained before the development stage. Furthermore, training data in a domain suitable to be input to the task-NNs 830 may be obtained for the development stage. The training data may not be annotated, for example, the training data may not contain ground-truth labels.

[0084] FIG. 9 illustrates an example of a human-targeted encoder neural network, according to an example embodiment. As discussed above, the second encoder neural network 814 may comprise a human-targeted encoder neural network. Human- targeted encoder neural network 914 is provided as an example of the second encoder neural network 814. The human-targeted neural network 914 may process received video data with initial layers 915 of the human-targeted neural network 914. Furthermore, the human-targeted neural network 914 may comprise an inverse quantization function 916 configured to de-quantize quantized M-features provided for example by the first quantization function 812. The human-targeted neural network 914 may further comprise a combiner 917, which may combine the de-quantized M-features with features extracted by the initial layers 915. The combiner 917 may for example concatenate the de-quantized M-features with features extracted by the initial layers 915. Alternatively, the combiner 917 may first apply a machine-targeted decoder neural network similar to 821 on the dequantized M-features, and then concatenate the decoded M-features with features extracted by the initial layers 915. The output of the combiner 917 may be provided to the final layers 918 of the human-targeted neural network 914. The output of the human-targeted neural network 914 may comprise the spatio-temporal H-features output from the final layers 918.

[0085] FIG. 10 illustrates an example of a method for teaching a student neural network for performing a task, according to an example embodiment. At a development stage, new version (s) of the one or more machines 130, 730, for example the task-NNs 830, may be obtained. These auxiliary models may be trained to replicate behavior, for example the input-output mapping, of the machines 130, 730, which may be considered as primary models. A targeted difference between the auxiliary model(s) and the primary model is that the auxiliary model(s) may be able to provide gradients of its output with respect to its input, which are more suitable for training the neural networks of the encoder device 110 and/or the decoder device 120. For example, the gradients may have bigger magnitude, since the effective depth of the auxiliary network may be lower than the effective depth of the primary model.

[0086] According to an example embodiment, an apparatus may obtain a primary model configured to perform a task. The apparatus may further obtain an auxiliary model. An effective depth of the auxiliary model may be lower than an effective depth of the primary model. The effective depth of the primary model may comprise or be determined based on a number of layers of the primary model. The effective depth of the auxiliary model may comprise or be determined based on a number of layers of the auxiliary model. Alternatively, the effective depth of the primary model may be determined based on the number of layers of the primary model and a number of skip connections of the primary model. The effective depth of the auxiliary model may be determined based on the number of layers of the auxiliary model and a number of skip connections of the auxiliary model.

[0087] An effective depth of the primary or auxiliary model may relate to the ability of the model to provide sufficient gradients for training purposes. For example, in case of the fully-connected neural network 300, the effective depth may be equal to the number of layers. However, the neural network 300 could have skip connections that skip some of the layers of the neural network 300, thus allowing forward and backward signals to avoid passing through some of the layers. For example, neural network 300 could have one or more skip connections directly from the input nodes i_n to the second hidden layer nodes n_åk instead of the first hidden layer nodes n_lm. Therefore, the effective depth may be alternatively determined based on the number of layers and the number of skip connections. For example, the number of layers may be scaled with a ratio of the number of skip connections to the total number of connections. In general, the effective depth may reflect an amount of steps needed when backpropagating gradients through a model, such as for example neural network 300.

[0088] An example of the primary model is a task neural network (task-NN) configured to perform a task. The task-NN may be used as a teacher task-NN (TNN) 1002 to train the auxiliary model, an example of which is provided by the student task-NN (SNN) 1004. However, the primary and auxiliary models may comprise other learnable models as well. The TNN 1002 may be pretrained. The SNN 1004 may be pretrained or it may be initialized by other means, for example based on a random process such as sampling from a probability distribution. Since architectures of models may be heterogenous, the SNN 1004 may be engineered from scratch or be obtained using an architecture search given a specific criterion such as for example memory and/or power consumption. A goal of the teacher-student training is to transfer knowledge or information from the primary model to the auxiliary model.

[0089] According to an example embodiment, the apparatus may train the auxiliary model for performing the task based on a first loss function comprising an output of the primary model and an output of the auxiliary model. For example, input data such as for example video data may be provided to TNN 1002, which may provide a corresponding output. In case of a classifier model, the TNN 1002 may output a 1-dimensional array with N bins, where N is the number of considered classes. The array may represent soft-labels, for example an approximation of a probability distribution over the considered classes. Thus, values of the bins may be non-zero and the sum of all bin values may be equal to one. Alternatively, the array may be represented as a one-hot vector, where only one bin value may be equal to one and other bin values may be equal to zero. The one-hot representation may be also obtained from the soft- labels representation, for example based on applying a max operation on the soft-labels array. The same input data may be input to SNN 1004, which may provide a corresponding output. [0090] At 1006, the apparatus may compute a first loss. The first loss may be computed based on the first loss function comprising the output of the TNN 1002 and the output of the SNN 1004, or the output of the SNN 1004 and a post-processed version of the output of the TNN 1002, for example a one-hot version of the probability distribution output by the TNN 1002. [0091] At 1008, the apparatus may differentiate the first loss with respect to parameters, for example weights, of the SNN 1004. A parameter update may be computed based on the gradients obtained from the differentiation operation. Finally, the parameter update may be used for performing one training iteration on the SNN 1004.

[0092] Training the SNN 1004 may be performed on training data available during the development stage. The training data may comprise data that was used for training the TNN 1002, or different data. The training data may be split into two or more subsets, such as for example a training subset, which may be used to update the SNN 1004, and validation subset, which may be used to evaluate performance of the SNN 1004. Training the SNN 1004 may continue until detecting that at least one of the following conditions is satisfied: a predetermined number of training iterations is reached; a training loss does not decrease anymore, based on a predetermined value; a validation loss does not decrease anymore, based on a predetermined value; or other suitable stopping criteria.

[0093] The teacher-student training described above provides one possible technique for obtaining the auxiliary model, for example a version of the TNN 1002 which provides better gradients for training models of an encoder and/or a decoder. However, other techniques may be used as well for obtaining the auxiliary model.

[0094] FIG. 11 illustrates an example of a method for training encoder and decoder neural networks based on a student neural network, according to an example embodiment. Once the auxiliary model, for example SNN 1004, has been trained, the auxiliary model may be used for developing an encoder neural network 1102 and/or a decoder neural network 1104. To this end, steps in the signal-path between the encoder neural network 1102, the decoder neural network 1104, and the SNN 1004 may be differentiable.

[0095] The encoder neural network 1102 may receive input data, such as for example video data. The encoder neural network 1102 may provide encoded video data to a decoder neural network 1104, which may provide the decoded video data to the SNN 1004. The SNN 1004 may perform the task and provide an output, for example classification of an object appearing in the video data.

[0096] At 1108, the apparatus may compute a second loss. The second loss may be computed based on a second loss function comprising an output of the auxiliary model, for example the SNN 1004. For example, a value of the second loss function may be determined based on comparing the output of the SNN 1004 to ground-truth data, such as for example correct classification labels associated with the input video data. Alternatively, ground-truth labels may be obtained as the output of either TNN or SNN when the input is the original video data, i.e., the video data input to encoder NN 1102. The apparatus may determine a plurality of gradients of the second loss function with respect to input of the auxiliary model. Training the encoder neural network 1102 and/or the decoder neural network 1104 may be further based on additional loss(es), such as for example compression loss(es). For example, training may be based on a combination, for example a linear combination, of one or more compression losses and the loss(es) determined based on output of the SNN 1004 and/or the TNN 1002. A compression loss may for example act on the encoded and/or quantized and/or entropy coded features, for example the M- code and/or the H-code.

[0097] At 1110, the apparatus may train the encoder neural network 1102 and/or the decoder neural network 1104 based on the plurality of gradients of the second loss function with respect to input of the auxiliary model. Gradients may be for example backpropagated through each layer of the SNN 1004 and provided to the decoder neural network 1104, where the backpropagation may proceed layer wise towards the input of the decoder neural network 1104. Gradients output from the input layer of the decoder neural network may be further backpropagated at the encoder neural network 1102 until the first layer of the encoder neural network 1102. Since the SNN 1004 is able to provide higher-quality gradients than the TNN 1002 (i.e., gradients which are more suitable for training the encoder and decoder), the decoder neural network 1104 and the encoder neural network 1102 may be trained more effectively. [0098] FIG. 12 illustrates an example of a method for training encoder and decoder neural networks based on a student neural network and a teacher neural network, according to an example embodiment. According to an example embodiment, the apparatus may determine the plurality of gradients of the second loss function with respect to the input of the auxiliary model. The apparatus may further determine a second plurality of gradients of a third loss function with respect to the input of the primary model. The apparatus may train the encoder neural network 1102 and/or a decoder neural network 1104 based on the (first) plurality of gradients and the second plurality of gradients. The third loss function may comprise the output of the primary model, for example the output of the TNN 1002. A value of the third loss function may be determined based on comparing the output of the TNN 1002 to ground-truth data, such as for example the correct classification labels associated with the input video data. Alternatively, ground- truth labels may be obtained as the output of the TNN 1002 when the input is the original video data, i.e., the video data input to encoder NN 1102. As discussed above, the second loss may be obtained from the SNN 1004 and therefore when performing differentiation of the second loss with respect to encoder and decoder parameters, gradients of the second loss function with respect to the encoder and decoder parameters may be obtained through SNN 1004 (backward path). The third loss may be obtained from the TNN 1002 and therefore when performing differentiation of the third loss with respect to the encoder and decoder parameters, the gradients of the third loss function with respect to the encoder and decoder parameters may be obtained through TNN 1002 (backward path). [0099] According to an example embodiment, the apparatus may determine the plurality of gradients with respect to the encoder and decoder parameters based on a weighted sum of the second loss function and the third loss function. For example, with reference to FIG. 12 both the teacher task-NN (TNN) 1002 and the student task-NN (SNN) 1004 may be used for training the encoder neural network 1102 and/or the decoder neural network 1104. At 1208, the apparatus may compute a third loss based on the third loss function, which may comprise the output of the TNN 1002. At 1108, the apparatus may compute the second loss based on the second loss function, which may comprise the output of the SNN 1004. The second loss and the third loss obtained based on the outputs of the SNN 1004 and TNN 1002, respectively, may be weighted differently.

[00100] At 1212, the apparatus may determine weights for the two losses. The sum of weights may be equal to one. The weights may be predetermined and/or fixed during training of the encoder neural network 1102 and/or the decoder neural network 1104. Alternatively, the weights may be changed during the training. Training with variable weights may comprise starting with a high weight for the SNN loss (second loss) and a low weight for the TNN loss (third loss). The weight of the SNN loss may be gradually decreased and the weight of the TNN loss may be gradually increased. Therefore, the apparatus may iteratively increase the weight of the third loss function and iteratively decrease the weight of the second loss function to train the encoder neural network 1102 and/or the decoder neural network 1104. [00101] At 1214, the apparatus may compute a weighted sum of the second loss and the third loss based on the fixed or variable weights determined at 1212. Based on the weighted sum of the second and third losses, apparatus may determine a plurality of gradients of the second loss with respect to input of the SNN 1004 and a plurality of gradients of the third loss with respect to the TNN 1002.

[00102] At 1110, the apparatus may train the encoder neural network 1102 and/or the decoder neural network 1104 based on the plurality of gradients of the second loss of the auxiliary model with respect to the input of the auxiliary model, similar to FIG. 11. This enables to benefit from the stronger gradients provided by the SNN 1004 while exploiting also the better performing TNN 1002 to train the encoder neural network 1102 and/or the decoder neural network 1104.

[00103] According to an example embodiment, the apparatus may obtain a plurality of primary models. The plurality of primary models may be configured for same or different task(s) and they may comprise a plurality of task neural networks (task-NN). One or more of the task-NNs may be converted to models for which it may be possible to obtain better gradients for training the encoder neural network 1102 and/or the decoder neural network 1104.

[00104] For example, the apparatus may obtain a plurality of auxiliary models corresponding to a first subset of the plurality of primary models. The first subset of the plurality of primary models may be determined based on at least one of: a number of layers of at least one of the plurality of primary models; an effective depth of the at least one of the plurality of primary models; or a magnitude, for example an average magnitude, of gradients of an output of the at least one of the plurality of primary models with respect to an input of the at least one of the plurality of primary models. For example, primary model(s) having a high number of layers or high effective depth, for example exceeding predetermined threshold (s), may be selected in the first subset and thereby to be converted into auxiliary model(s). Alternatively, or additionally, the plurality of auxiliary models may be determined based on an average magnitude of the gradients of the output of the NN with respect to its input. For example, primary models that are determined not to provide sufficient gradients may be selected to be included in the first subset of primary models. The plurality of auxiliary models enables training of the encoder neural network 1102 and/or the decoder neural network 1104 by providing sufficient gradients. Selecting a subset of primary models for conversion to auxiliary model(s) enables to simplify the training process since all primary models may not need to be converted into auxiliary models having lower effective depth.

[00105] Furthermore, the apparatus may determine a third plurality of gradients with respect to inputs of the plurality of auxiliary models. The third plurality of gradients may comprise a plurality of gradients with respect to inputs of each of the plurality of auxiliary models. The apparatus may train the encoder neural network 1102 and/or the decoder neural network 1104 based on the third plurality of gradients with respect to inputs of the plurality of auxiliary models. For example, the encoder neural network 1102 and/or decoder neural network 1104 may be trained sequentially based on gradients with respect to input of one auxiliary network at a time. Alternatively, the gradients with respect to inputs of the auxiliary networks may be combined to train the encoder neural network 1102 and/or the decoder neural network 1104.

[00106] If there are multiple task-NNs to be considered during the development stage, one or more of these task-NNs may be used for training the encoder neural network 1102 and/or the decoder neural network 1104. Some of the models used during the training may be the primary task-NNs, whereas some of the models may be the converted (SNN or auxiliary) versions of the primary task-NNs, based on the selection criteria discussed above. Hence, in addition to the plurality of auxiliary models, the encoder neural network 1102 and/or the decoder neural network 1104 may be trained based on a second subset of the plurality of primary models. The second subset of the plurality of primary models may comprise primary models that are not converted to auxiliary models, e.g., primary models that are not selected to the first subset of primary models. The apparatus may train the encoder neural network 1102 and/or the decoder neural network 1104 based on the third plurality of gradients and a fourth plurality of gradients for the second subset of primary models. The fourth plurality of gradients may be determined based on a set of loss functions, each loss function comprising an output of one of the second plurality of primary models. The second subset of primary models may comprise a set of primary models not selected to the first subset of primary models, for example primary models for which auxiliary model(s) have not been obtained. This enables the encoder neural network 1102 and/or the decoder neural network 1104 to be trained with sufficient gradients and with a more diverse set of task-NNs.

[00107] It is also possible to train the encoder neural network 1102 and/or the decoder neural network 1104 based on both the plurality of primary models and the plurality of auxiliary models. Hence, for some primary models also their corresponding auxiliary models may be used during training. For example, the apparatus may train the encoder neural network 1102 and/or the decoder neural network 1104 based on the third plurality of gradients with respect to inputs of the plurality of auxiliary models and a fifth plurality of gradients with respect to inputs of the plurality of primary models. The fifth plurality of gradients may comprise a plurality of gradients with respect to inputs of each of the plurality of primary models. The fifth plurality of gradients may be determined based on a set of loss functions, each loss function comprising an output of one of the plurality of primary models. Hence, the fifth plurality of gradients may comprise the fourth plurality of gradients determined for the second subset of primary models. When training with multiple models, the training may be based on sequentially training the encoder neural network 1102 and/or the decoder neural network 1104 with each model or combining the gradients provided by each model to train the encoder neural network 1102 and/or the decoder neural network 1104.

[00108] According to an example embodiment, the apparatus may obtain a plurality of auxiliary models corresponding to a plurality of effective depths. A plurality of auxiliary models may be obtained for a single primary model or a plurality of auxiliary models may be obtained for each of the first subset of primary models. The apparatus may sequentially train the encoder neural network 1102 and/or the decoder neural network 1104 based on the plurality of auxiliary models with an increasing effective depth. For example, the apparatus may train multiple SNN versions, where each SNN may have a different number of layers. If a TNN has 50 layers, the apparatus may for example obtain five SNN versions with 5, 10, 20, 30, and 40 layers, respectively. During training of the encoder neural network 1102 and/or the decoder neural network 1104, the shallower version with 5 layers may be used initially, and, for example every M training iterations, or when the training loss achieves a predetermined value or flattens, a slightly deeper version of SNN is used. Hence, encoder neural network 1102 and/or the decoder neural network 1104 may be trained with an increasing effective depth, which enables faster training because the shallower versions of the SNN are able to provide stronger gradients and therefore the initial phase of the training is more effective.

[00109] The plurality of auxiliary models may be for example obtained based on inserting a varying number of identity layers between layers of an auxiliary model. For example, an auxiliary model may comprise a first SNN with 5 layers trained. Then, a second SNN with 10 layers may be formed by re-using the layers of the first SNN with already trained parameters and by inserting one or more identity layers between the pretrained layers. For example, one identity layer may be inserted between a first pretrained layer of the first SNN and second pretrained layer of the first SNN. An identity layer may comprise a layer which performs an identity mapping, for example, for which the output of the layer is the same as its input. For convolutional filters, this may be implemented for example based on setting kernel parameters of each filter to zero except for the mid location parameter, which may be set to one. Initially, the second SNN is configured to replicate the behavior of the first SNN. During first iterations of training of the second SNN, some of the weight-update values may be zeroed out in order to train the identity layers effectively. This may be beneficial because of the zeros initially included in the identity layers. The weight update values to be zeroed out may be determined based on a random process.

[00110] According to an example embodiment, after the SNN(s) has been used for one or more training iterations for training the encoder neural network and/or decoder neural network, the TNNs may be adapted to the updated encoder and/or decoder neural networks. Fine-tuning may comprise an iterative and collaborative scheme where a TNN trains an SNN, the SNN fine- tunes the encoder NN and the decoder NN, for example the encoder neural network 1102, and the TNN is updated. The SNN, or in general any of the plurality of auxiliary networks may be stored at the apparatus and retrieved upon a request for fine-tuning, for example fine-tuning of the encoder neural network 1102 and/or the decoder neural network 1104.

[001 1 1] Even though some example embodiments have been described using video data as an example of input data, it is appreciated that example embodiments may be applied to other types of data such as for example image data, audio data, text data, or features extracted by a preceding model, for example another neural network.

[001 12] Therefore, according to an example embodiment at least one of image data, video data, audio data, text data, or features extracted by a preceding model may be encoded with the encoder neural network, for example to compress the data. At least one of image data, video data, audio data, text data, or features extracted by a preceding model may be decoded with the decoder neural network, for example to decompress the data. Alternatively, or additionally, the encoder neural network 1102 and/or the decoder neural network 1104 may be stored or transmitted to another device for use in encoding and/or decoding of at least one of image data, video data, audio data, or text data. The features extracted by a preceding model may be extracted for example from image data, video data, audio data, or text data.

[001 13] FIG. 13 illustrates an example of a method 1300 for training an encoder neural network and/or a decoder neural network, according to an example embodiment.

[001 14] At 1301, the method may comprise obtaining a primary model configured to perform a task. [001 1 5] At 1302, the method may comprise obtaining an auxiliary model, wherein an effective depth of the auxiliary model is lower than an effective depth of the primary model. [001 16] At 1303, the method may comprise training the auxiliary model for performing the task based on a first loss function comprising an output of the primary model and an output of the auxiliary model.

[001 1 7] At 1304, the method may comprise determining a plurality of gradients of a second loss function with respect to an input of the auxiliary model, wherein the second loss function comprises the output of the auxiliary model [001 18] At 1305, the method may comprise training an encoder neural network and/or a decoder neural network based on the plurality of gradients.

[001 19] Further features of the method directly result from the functionalities and parameters of the encoder device 110 and/or the decoder device 120, as described in the appended claims and throughout the specification, and are therefore not repeated here. It is noted that one or more steps of the method may be performed in different order.

[00120] An apparatus, for example encoder device 110, decoder device 120, or a combination thereof, may be configured to perform or cause performance of any aspect of the method (s) described herein. Further, a computer program may comprise instructions for causing, when executed, an apparatus to perform any aspect of the method (s) described herein. Further, a computer program may be configured to, when executed, to cause an apparatus at least to perform any aspect of the method (s) described herein. Further, a computer program product or a computer readable medium may comprise program instructions for causing an apparatus to perform any aspect of the method (s) described herein. Further, an apparatus may comprise means for performing any aspect of the method (s) described herein. According to an example embodiment, the means comprises the at least one processor, and the at least one memory including program code, the at one memory and the program code configured to, with the at least one processor, cause the apparatus at least to perform any aspect of the method (s).

[00121] Any range or device value given herein may be extended or altered without losing the effect sought. Also, any embodiment may be combined with another embodiment unless explicitly disallowed.

[00122] Although the subject matter has been described in language specific to structural features and/or acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as examples of implementing the claims and other equivalent features and acts are intended to be within the scope of the claims.

[00123] It will be understood that the benefits and advantages described above may relate to one embodiment or may relate to several embodiments. The embodiments are not limited to those that solve any or all of the stated problems or those that have any or all of the stated benefits and advantages. It will further be understood that reference to 'an' item may refer to one or more of those items. Furthermore, references to 'at least one' item or 'one or more' items may refer to one or a plurality of those items.

[00124] The steps or operations of the methods described herein may be carried out in any suitable order, or simultaneously where appropriate. Additionally, individual blocks may be deleted from any of the methods without departing from the scope of the subject matter described herein. Aspects of any of the embodiments described above may be combined with aspects of any of the other embodiments described to form further embodiments without losing the effect sought.

[00125] The term 'comprising' is used herein to mean including the method, blocks, or elements identified, but that such blocks or elements do not comprise an exclusive list and a method or apparatus may contain additional blocks or elements.

[00126] As used in this application, the term 'circuitry' may refer to one or more or all of the following: (a) hardware- only circuit implementations (such as implementations in only analog and/or digital circuitry) and (b) combinations of hardware circuits and software, such as (as applicable):(i) a combination of analog and/or digital hardware circuit(s) with software/firmware and (ii) any portions of hardware processor (s) with software (including digital signal processor (s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions) and (c) hardware circuit(s) and or processor (s), such as a microprocessor(s) or a portion of a microprocessor (s), that requires software (e.g., firmware) for operation, but the software may not be present when it is not needed for operation. This definition of circuitry applies to all uses of this term in this application, including in any claims.

[00127] As a further example, as used in this application, the term circuitry also covers an implementation of merely a hardware circuit or processor (or multiple processors) or portion of a hardware circuit or processor and its (or their) accompanying software and/or firmware. The term circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit or processor integrated circuit for a mobile device or a similar integrated circuit in server, a cellular network device, or other computing or network device.

[00128] It will be understood that the above description is given by way of example only and that various modifications may be made by those skilled in the art. The above specification, examples and data provide a complete description of the structure and use of exemplary embodiments. Although various embodiments have been described above with a certain degree of particularity, or with reference to one or more individual embodiments, those skilled in the art could make numerous alterations to the disclosed embodiments without departing from scope of this specification.

Claims

1. An apparatus, comprising: at least one processor; and at least one memory including computer program code; the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain a primary model configured to perform a task; obtain an auxiliary model, wherein an effective depth of the auxiliary model is lower than an effective depth of the primary model; train the auxiliary model for performing the task based on a first loss function comprising an output of the primary model and an output of the auxiliary model; determine a plurality of gradients of a second loss function with respect to an input of the auxiliary model, wherein the second loss function comprises the output of the auxiliary model; and train an encoder neural network and/or a decoder neural network based on the plurality of gradients.

2. The apparatus according to claim 1, wherein the effective depth of the primary model comprises a number of layers of the primary model and wherein the effective depth of the auxiliary model comprises a number of layers of the auxiliary model, or, wherein the effective depth of the primary model is determined based on the number of layers of the primary model and a number of skip connections of the primary model and wherein the effective depth of the auxiliary model is determined based on the number of layers of the auxiliary model and a number of skip connections of the auxiliary model.

3. The apparatus according to claim 1 or claim 2, wherein the at least one memory and the computer program code are further configured to cause the apparatus to: determine a second plurality of gradients with respect to an input of the primary model based on a third loss function comprising the output of the primary model; and train the encoder neural network and/or a decoder neural network based on the plurality of gradients and the second plurality of gradients.

4. The apparatus according to claim 3, wherein the at least one memory and the computer program code are further configured to cause the apparatus to: determine the plurality of gradients with respect to the input of the auxiliary model based on a weighted sum of the second loss function and the third loss function.

5. The apparatus according to claim 4, wherein the at least one memory and the computer program code are further configured to cause the apparatus to: iteratively increase a weight of the third loss function and iteratively decrease a weight of the second loss function to train the encoder neural network and/or the decoder neural network.

6. The apparatus according to any preceding claim, wherein the at least one memory and the computer program code are further configured to cause the apparatus to: obtain a plurality of primary models; obtain a plurality of auxiliary models corresponding to a first subset of the plurality of primary models; determine a third plurality of gradients with respect to inputs of the plurality of auxiliary models; and train the encoder neural network and/or the decoder neural network based on the third plurality of gradients with respect to inputs of the plurality of auxiliary models.

7. The apparatus according to claim 6, wherein the at least one memory and the computer program code are further configured to cause the apparatus to: select the first subset of the plurality of primary models based on at least one of: a number of layers of at least one of the plurality of primary models, an effective depth of the at least one of the plurality of primary models, or a magnitude of gradients of an output of the at least one of the plurality of primary models with respect to an input of the at least one of the plurality of primary models.

8. The apparatus according to claim 6 or 7, wherein the at least one memory and the computer program code are further configured to cause the apparatus to: train the encoder neural network and/or the decoder neural network based on the third plurality of gradients and a fourth plurality of gradients with respect to inputs of a second subset of primary models, wherein the second subset of primary models comprises a set of primary models not selected to the first subset of primary models.

9. The apparatus according to claim 6 or 7, wherein the at least one memory and the computer program code are further configured to cause the apparatus to: train the encoder neural network and/or the decoder neural network based on the third plurality of gradients and a fifth plurality of gradients with respect to inputs of the plurality of primary models.

10. The apparatus according to any of claims 1 to 5, wherein the at least one memory and the computer program code are further configured to cause the apparatus to: obtain a plurality of auxiliary models corresponding to a plurality of effective depths, wherein the plurality of auxiliary models is obtained based on inserting a varying number of identity layers between layers of the auxiliary model; sequentially train the encoder neural network and/or the decoder neural network based on the plurality of auxiliary models with an increasing effective depth.

11. The apparatus according to any preceding claim, wherein an input to the encoder neural network comprises video data, image data, audio data, text data, or features extracted by a preceding model.

12. The apparatus according to any of claims 1 to 10, wherein the at least one memory and the computer program code are further configured to cause the apparatus to: encode and/or decode at least one of image data, video data, audio data, text data, or features extracted by a preceding model with the encoder neural network and/or the decoder neural network; or store or transmit the encoder neural network and/or the decoder neural network for use in encoding and/or decoding of at least one of the image data, the video data, the audio data, the text data, or the features extracted by the preceding model.

13. A method, comprising: obtaining a primary model configured to perform a task; obtaining an auxiliary model, wherein an effective depth of the auxiliary model is lower than an effective depth of the primary model; training the auxiliary model for performing the task based on a first loss function comprising an output of the primary model and an output of the auxiliary model; determining a plurality of gradients of a second loss function with respect to an input of the auxiliary model, wherein the second loss function comprises the output of the auxiliary model; and training an encoder neural network and/or a decoder neural network based on the plurality of gradients.

14. A computer program comprising instructions for causing an apparatus to perform at least the following: obtaining a primary model configured to perform a task; obtaining an auxiliary model, wherein an effective depth of the auxiliary model is lower than an effective depth of the primary model; training the auxiliary model for performing the task based on a first loss function comprising an output of the primary model and an output of the auxiliary model; determining a plurality of gradients of a second loss function with respect to an input of the auxiliary model, wherein the second loss function comprises the output of the auxiliary model; and training an encoder neural network and/or a decoder neural network based on the plurality of gradients.

15. An apparatus, comprising: means for obtaining a primary model configured to perform a task; means for obtaining an auxiliary model, wherein an effective depth of the auxiliary model is lower than an effective depth of the primary model; means for training the auxiliary model for performing the task based on a first loss function comprising an output of the primary model and an output of the auxiliary model; means for determining a plurality of gradients of a second loss function with respect to an input of the auxiliary model, wherein the second loss function comprises the output of the auxiliary model; and means for training an encoder neural network and/or a decoder neural network based on the plurality of gradients.