WO2023085968A1

WO2023085968A1 - Device and method for neural network pruning

Info

Publication number: WO2023085968A1
Application number: PCT/RU2021/000505
Authority: WO
Inventors: Kirill Igorevich SOLODSKIKH; Azim Edgarovich KURBANOV; Ruslan Daurenovich AYDARKHANOV; Dehua SONG; Alexander Nikolaevich Filippov; Youliang Yan
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2021-11-15
Filing date: 2021-11-15
Publication date: 2023-05-19

Abstract

The present disclosure relates to methods and devices for performing neural network pruning in the field of machine learning. In particular, each layer of a trained neural network that comprises a first set of discrete weights is represented by a continuous parameter representation. The continuous parameter representation is based on a linear combination of Riemann integrable functions. Then, said continuous parameter representation may be discretized to obtain a second set of discrete weights of a desired size for that layer. In this way, neural network pruning is performed and the size of the neural network may be changed at the inference phase. Moreover, there is no need to fine-tune the second set of discrete weights because of the continuous parameter representation.

Description

DEVICE AND METHOD FOR NEURAL NETWORK PRUNING

TECHNICAL FIELD

The present disclosure relates to devices and methods in the fields of computer science, in particular, artificial intelligence (Al). The disclosure relates especially to devices and methods for neural network pruning.

BACKGROUND

In the field of artificial intelligence, such as machine learning, pruning a neural network (or neural network pruning) refers to a process of deleting unnecessary or least important parameters, such as weights and neurons, from a trained neural network. Neural network pruning is widely used for neural network compression to achieve a lightweight trained neural network model of any desired size. The lightweight trained neural network may have the benefits of a reduced size and an accelerated execution speed during an inference phase. In this disclosure, neural network pruning may be simply referred to as pruning. A neural network may be simply referred to as a model.

SUMMARY

One issue with conventional neural network pruning is that retraining (or fine-tuning) is often required, in order to reduce accuracy drop. Sometimes, for ensuring a similar model performance as the original model, several iterations of pruning and retraining may be required.

A further issue with conventional neural network pruning is that changing the size of the neural network during the inference phase is not allowed after pruning is done.

In view of the above, an objective of this disclosure is to enable neural network pruning to an arbitrary size without retraining. Another objective is to allow the size of a neural network to change flexibly during the inference phase.

These and other objectives are achieved by the solutions of the present disclosure as described in the independent claims. Advantageous implementations are further defined in the dependent claims. An idea described in the present disclosure is to use a continuous representation, such as integral operations, to replace a discrete transformation denoted by weights of neural network layers. Then, this continuous representation of the neural network layer(s) may be discretized to an arbitrary size at any time, also during the inference phase.

A first aspect of the present disclosure provides a data processing apparatus configured to obtain a trained neural network comprising a plurality of neural network layers, in which each neural network layer comprises a first set of discrete weights. For each neural network layer, the data processing apparatus is configured to determine a continuous parameter representation for the first set of discrete weights based on a linear combination of Riemann integrable functions. Then, for each neural network layer, the data processing apparatus is configured to discretize the continuous parameter representation to obtain a second set of discrete weights, and generate a layer output by processing a layer input based on the second set of discrete weights.

As a result, each neural network layer of the trained neural network comprises the second set of discrete weights. For example, for each neural network layer, the size of the second set of discrete weights may be less than that of the first set of discrete weights. That is, the size of the neural network may be reduced and the trained neural network may be pruned.

Moreover, the first set of discrete weights may be presented by multidimensional continuous surfaces based on the linear combination of the Riemann integrable functions. This may allow generating a neural network of any desired size without fine-tuning or retraining. This is because trained parameters may have inertial effect to the neighborhood of each parameter to some extent. Unlike conventional neural network pruning, where information of the pruned (or deleted) elements is simply lost, a discretization based on the continuous parameter representation may include the inertial effect to the neighborhood for each weight.

A further advantage of the present disclosure is that instead of storing the first set of discrete weights for each neural network layer conventionally, for storing the trained neural network, the data processing apparatus may ultimately only need to store the continuous parameter representation for each neural network layer. In this way, the storage space of the data processing apparatus may be saved.

Optionally, values of the first set of discrete weights are in the range of [0, 1]. Optionally, a Riemann integrable function, in general, may refer to a function that its lower and upper integral are the same.

Optionally, the data processing apparatus may be configured to, only during the inference phase of the trained neural network, perform the discretization of the continuous parameter representation and the generation of the layer output. Alternatively, the data processing apparatus may be configured to perform the generation of the layer output only during the inference phase.

In a possible implementation form of the first aspect, for discretizing the continuous parameter representation, the data processing apparatus may be configured to:

- obtain a first discretization by applying a meshgrid operation to the continuous parameter representation within [0, l]ⁿ, wherein n denotes a dimension of the continuous parameter representation, and

- adjust the first discretization according to a numerical integration method to obtain the second set of discrete weights.

Optionally, applying the meshgrid operation may refer to using a meshgrid function to create a rectangular grid out of the continuous parameter representation. In the rectangular grid, the data processing apparatus may be configured to use the numerical integration method to compute a quadrature over a partition of rectangular gird and obtain the result of the numerical integration method as a discrete weight of the second set of discrete weights.

In a possible implementation form of the first aspect, for discretizing the continuous parameter representation, the data processing apparatus may be configured to apply uniform partitions in each dimension of the continuous parameter representation.

It is noted that any suitable kind of partition may be used. By using the uniform partitions, computational complexity may be further reduced.

In a possible implementation form of the first aspect, the data processing apparatus may be further configured to adapt a size of the second set of discrete weights based on computational complexity of an inference phase of the trained neural network. Optionally, the size of the second set of discrete weights may be understood as the size of each layer.

In a possible implementation form of the first aspect, the data processing apparatus may be further configured to:

- determine a further continuous parameter representation for a set of discrete inputs based on the linear combination of the Riemann integrable functions; and

- perform a numerical integration based on the continuous parameter representation and the further continuous parameter representation, to obtain, as a result, the layer output.

Optionally, the data processing apparatus may be configured to execute these two steps during the inference phase. In this way, a non-linear transformation performed by the neural network may be turned into a numerical integration. Thus, the size of the neural network and/or its computational complexity may be adapted on-demand during the inference phase.

In a possible implementation form of the first aspect, the Riemann integrable functions may be based on wavelet functions.

It is noted that the wavelet functions may also simply refer to wavelets in the field of mathematics. Examples of the wavelet functions include but are not limited to: a Gaussian function, Morlet wavelet, and Ricker wavelet. Preferably, the Gaussian function may be used.

In a possible implementation form of the first aspect, the continuous parameter representation may be as follows:

wherein F_w () denotes the continuous parameter representation, θ denotes a vector of parameters including μ_i, σ_i and λ_i , det() denotes a determinant operation, denotes a location parameter, σ_i; denotes diagonal positive definite matrix, λ_i denotes weights of linear combination, k denotes the number of dimensions, i denotes the number of Gaussian functions, and i,k are integers. In a possible implementation form of the first aspect, the plurality of neural network layers may comprise at least one 2-dimensional, 2D, convolutional layer, and for discretizing the continuous parameter representation to obtain a second set of discrete weights, the data processing apparatus may be configured to perform:

for each 2D convolution layer, wherein W denotes the second set of discrete weights, i,j, k, I denote four dimensions of the weights of the 2D convolutional layer, and h denotes a step along each dimension.

In a possible implementation form of the first aspect, the plurality of neural network layers may comprise at least one fully-connected layer, and for discretizing the continuous parameter representation to obtain a second set of discrete weights, the data processing apparatus may be configured to perform:

for each fully-connected layer, wherein W denotes the second set of discrete weights, i,j denote two dimensions of the weights of the fully-connected layer, and h denotes a step along each dimension.

A second aspect of the present disclosure provides a data processing method. The method comprises the following steps:

- obtaining a trained neural network comprising a plurality of neural network layers, wherein each neural network layer comprises a first set of discrete weights,

- for each neural network layer, determining a continuous parameter representation for the first set of discrete weights based on a linear combination of Riemann integrable functions;

- discretizing the continuous parameter representation to obtain a second set of discrete weights; and

- generating a layer output by processing a layer input based on the second set of discrete weights. Optionally, the method may be performed by a single apparatus. Alternatively, the steps of the method may be performed by a plurality of distributed apparatus. For example, the method may be performed by the apparatus of the first aspect.

In a possible implementation form of the second aspect, the step of discretizing the continuous parameter representation may comprise:

- obtaining a first discretization by applying a meshgrid operation to the continuous parameter representation within [0, l]ⁿ, wherein n denotes a dimension of the continuous parameter representation, and

- adjusting the first discretization according to a numerical integration method to obtain the second set of discrete weights.

In a possible implementation form of the second aspect, the step of discretizing the continuous parameter representation may comprise applying uniform partitions in each dimension of the continuous parameter representation.

In a possible implementation form of the second aspect, the method may further comprise adapting a size of the second set of discrete weights based on a computational complexity of an inference phase of the trained neural network.

In a possible implementation form of the second aspect, the method may further comprise:

- determining a further continuous parameter representation for a set of discrete inputs based on the linear combination of the Riemann integrable functions; and

- performing a numerical integration based on the continuous parameter representation and the further continuous parameter representation to obtain as a result the layer output.

In a possible implementation form of the second aspect, the Riemann integrable functions may be based on wavelet functions. In a possible implementation form of the second aspect, the continuous parameter representation may be:

wherein F_w () denotes the continuous parameter representation, θ denotes a vector of parameters including and λ_i, det() denotes a determinant operation, μ_i denotes a location parameter, σ_i denotes diagonal positive definite matrix, denotes weigths of linear combination, k denotes the number of dimensions, i denotes the number of Gaussian functions, and i,k are integers.

In a possible implementation form of the second aspect, the plurality of neural network layers may comprise at least one 2-dimensional, 2D, convolutional layer, and the discretizing the continuous parameter representation to obtain a second set of discrete weights may comprise performing:

In a possible implementation form of the second aspect, the plurality of neural network layers may comprise at least one fully-connected layer, and the step of discretizing the continuous parameter representation to obtain a second set of discrete weights may comprise performing:

for each fully-connected layer, wherein W denotes the second set of discrete weights, i,j denote two dimensions of the weights of the fully-connected layer, and h denotes a step along each dimension.. A third aspect of the present disclosure provides a computer program or program product comprising a program code for performing the method according to the second aspect or any implementation form thereof, when executed on a computer.

A fourth aspect of the present disclosure provides a computer-readable medium comprising instructions which, when executed by a computer, cause the computer to carry out the method according to any one of the second aspect or any implementation form thereof.

A fifth aspect of the present disclosure provides a chipset comprising instructions which, when executed by the chipset, cause the chipset to carry out the method according to any one of the fourth aspect or any implementation form thereof.

It has to be noted that all apparatus, devices, elements, units, and means described in the present application could be implemented in software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity, which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof.

BRIEF DESCRIPTION OF DRAWINGS

The above-described aspects and implementation forms will be explained in the following description of specific embodiments in relation to the enclosed drawings, in which:

FIG. 1 shows an example of neural network pruning according to the present disclosure;

FIG. 2 shows an illustration of neural network pruning according to the present disclosure;

FIG. 3 shows a flow-diagram of a method for neural network pruning according to the present disclosure;

FIG. 4 shows an illustrative result of neural network priming; FIG. 5 shows an application scenario of the present disclosure; and

FIG. 6 shows another application scenario of the present disclosure.

DETAILED DESCRIPTION OF EMBODIMENTS

In FIGs. 1-6, corresponding elements are labelled with the same reference signs, may share the same features and may function likewise. Moreover, it is noted that the number of elements, graphs of functions depicted and values depicted in FIGs. 1-6 are for illustration purposes only and shall not be interpreted as limitations to embodiments of the present disclosure.

FIG. 1 shows an example of neural network pruning according to the present disclosure.

For performing neural network pruning, a data processing apparatus 100 is firstly configured to obtain a trained neural network. The trained neural network comprises a plurality of neural network layers, which may comprise an input layer, at least one hidden layer, and an output layer. In the present disclosure, the neural network layer may be simply referred to as a layer. Neural network layers are subsequently connected by establishing connections between neurons in neighboring layers. The input layer may be configured to receive an input. Each of the output layer and the at least one hidden layer may be configured to receive a layer output from its previous layer and apply a non-linear transformation based on its weights to generate its layer output. The layer output of the output layer may refer to an output of the neural network.

Each layer may comprise weights and optional biases. The trained neural network may refer to a neural network that has been trained based on a training set for a particular purpose, such as image processing. Training a neural network may be referred to as a training phase, while applying the trained neural network may be referred to as an inference phase. The first set of discrete weights may refer to fine-timed parameters during the training phase.

The first set of discrete weights 111 shown in FIG. 1 is for illustration purposes, which may comprise a set of discrete values of weights associated with neurons of a layer. The trained neural network comprises multiple layers, wherein each layer may comprise a set of discrete weights similar to the first set of discrete weights 111 shown in FIG. 1. In the following, aspects referring to the first set of discrete weights 111 shall apply likewise to any other layers of the trained neural network. For each neural network layer, the data processing apparatus is configured to determine a continuous parameter representation 131 (or labeled as F_w) for the first set of discrete weights. The continuous parameter representation 131 is based on a plurality of linear combination of Riemann integrable functions 121, 122, 123. Optionally, the continuous parameter representation 131 may be based on any reasonable number of Riemann integrable functions. The data processing apparatus may be further configured to determine an upper limit of the number of Riemann integrable functions that can be used based on its computational capability.

Without prejudice to the commonly known meaning in the field of mathematics, a function is Riemann integrable under the following condition: let /(x/, .... x_n) be a multivariate function defined on n-dimensional unit cube Q=[0, 1 ]ⁿ, function ƒ() is Riemann integrable if the following limit exists:

where A, denotes volume of partition, A denotes maximum volume of partition

denotes points inside partition z. The same notations apply to the following where same symbols are used, unless otherwise specified.

If such limit exists, then it is called Riemann integral of function f(x₁ , ... , x_n ). This definition may be intuitively interpreted as the volume under the surface defined by the function f. A finer partition of the cube Ω with smaller A may ensure a more precise integral estimation.

Optionally, to calculate a numerical value of definite integral, different numerical quadratures may be used. Numerical integration methods can generally be described as combining evaluations of the integrand to get an approximation to the integral, which is shown as follows:

The integrand may be evaluated at a finite set of points called integration points and a weighted sum of these values is used to approximate the integral. The integration points and weights depend on the specific method used and the accuracy required from the approximation. The evaluation of multiple integrals can be reduced to iterated integral by iteratively applying such quadratures.

A wavelet function is Riemann integrable. Therefore, in some embodiments of the present disclosure, the Riemann integrable functions may be based on wavelet functions.

In a preferred embodiment, the continuous parameter representation may be based on a linear combination of Gaussian functions as follows:

( 1 ) wherein F_w () denotes the continuous parameter representation and may define parameter surface of weights, 9 denotes a vector of parameters including μ_i, σ_i and λ_i, det() denotes a determinant operation,μ_i denotes a location parameter, σ_i denotes diagonal positive definite matrix, λ_i denotes weights of linear combination, k denotes the number of dimensions, i denotes the number of Gaussian functions, and i, k are integers.

Optionallyμ σ_ii , and λ_i are trainable parameters and may be determined in a learnable way.

For training the trainable parameter, the trainable parameters may be randomly initialized. Alternatively, the initialization of the location parameters [μ may be based on a uniform grid. The number of functions along each axis is defined by shape discretization. The univariate Gaussian functions are mostly concentrated in segment[ μ-3σ, μ+ 3σ]3<r], so that neighboring Gaussian functions may have a high support at the intersections to ensure the necessary gradient behavior.

An advantage of using Gaussian functions is that Gaussian functions are mostly localized in a bounded area which makes it possible to concentrate to the training of trainable parameters within the cube Ω. Diagonal covariance matrices may lead to a fast evaluation while allowing to train complex surfaces with a small number of parameters.

When other Riemann integrable functions are used, F_W(θ, x) may not be limited to equation (3).

After the data processing apparatus 100 determines the continuous parameter representation 131, the data processing apparatus 100 may be configured to store the linear combination F_w () instead of storing the first set of discrete weights for each layer. During the inference phase, the data processing apparatus 100 is configured to discretize the continuous parameter representation to obtain a second set of discrete weights. The second set of discrete weights may be seen as part of a pruned neural network, which is lightweight and may help to reduce the computational complexity of the inference phase.

FIG. 1 further illustrates an example of a discretization of the continuous parameter representation to obtain one of the second set of discrete weights. In particular, during the inference phase, the data processing apparatus 100 may be configured to determine the size of the second set of discrete weights 141, which is fourteen as an example. Then, the data processing apparatus 100 may be configured to determine a corresponding number of partitions and perform integral on each partition to obtain, as a result, a discrete weight of the second set of discrete weights 141.

When a layer is a 2-dimensional (2D) convolutional (conv) layer, the second set of discrete weights may be obtained by:

(2),

where i,j, k, I denote four dimensions of the weights of the 2D convolutional layer, h denotes a step along each dimension where a meshgrid is defined. For each point of the meshgrid, function F_w is evaluated to obtain as a result a discrete weight of the second set of discrete weights.

When a layer is a fully connected (FC) layer, the second set of discrete weights may be obtained by:

(3).

FIG. 2 shows an illustration of neural network pruning according to the present disclosure.

In FIG. 2, neural network pruning on a three-dimensional cube is illustrated. Similar to FIG. 1, weights of higher dimensions (larger than two) of each layer may be presented by a linear combination of surfaces or wavelets on a high-dimensional unit cube, optionally via smooth integral kernel evaluation. The linear combination may be represented by function FwQ and a continuous surface may be formed. Then, the data processing apparatus 100 may be configured to applying a meshgrid operation to obtain a meshgrid on the continuous surface. Then, the function FwQ may be discretized by performing integral on partitions of the meshgrid according to a desired shape and quadrate (e.g., according to a desired neural network size).

FIG. 3 shows a flow-diagram of a method 300 according to the present disclosure. The method 300 may be performed by the apparatus 100.

The method 300 comprises the following steps:

- step 301: obtaining a trained neural network comprising a plurality of neural network layers, wherein each neural network layer comprises a first set of discrete weights; for each neural network layer:

- step 302: determining a continuous parameter representation for the first set of discrete weights based on a linear combination of Riemann integrable functions;

- step 303: discretizing the continuous parameter representation to obtain a second set of discrete weights; and during an inference phase of the trained neural network,

- step 304: discretizing the continuous parameter representation to obtain a second set of discrete weights. Optionally, a single apparatus may be configured to execute the method 300. Alternatively, multiple apparatus or components of a device may be configured to execute different steps of the method 300. It is noted although an apparatus 100 is used with respect to FIG. 1, it shall be understood that it does not exclude an embodiment where multiple apparatus may be configured to execute the steps mentioned in the method 300.

For example, a first apparatus may be configured to execute steps 301-302 once a trained neural network is obtained. Then, the first apparatus may be configured to provide the continuous parameter representation to a second apparatus. The second apparatus may be specifically configured to execute the trained neural network during the inference phase. During the inference phase, the second apparatus may be configured to execute steps 303 and 304. As an example, the first apparatus may be a server, while the second apparatus may be a terminal.

In another scenario, a first execution unit of a device may be configured to execute steps 301- 302, while a second execution unit of the device may be configured to execute steps 303-304. For example, the device may be a mobile device and may comprise multiple cores in its CPU. The multiple cores may comprise relatively battery-saving and slower processor cores (known as ‘little core’), and relatively more powerful and power-hungry processor cores (known as ‘big core’). The one or more big cores may be configured to execute steps 301-302; while the one or more little cores may be configured to execute steps 303-304. The device may further comprise an Al accelerator chip, such as a tensor core, neural processing unit, tensor processing unit, and graph processing unit. The Al accelerator chip may be used to assist Al-related computations in steps 301-304.

Moreover, the steps of the method 300 may share the same functions and details from the perspective of FIG. 1 and 2 described above. Therefore, the corresponding method implementations are not described again at this point.

FIG. 4 shows an illustrative result of neural network pruning.

In FIG. 4, an initially obtained neural network 410 is pruned into a lightweight neural network 420. The initially obtained neural network 410 comprises exemplarily three layers. Each layer comprises a first set of discrete weights 411, 412 and 413. After performing the neural network pruning, each layer of the light-weight neural network 420 comprises a second set of discrete weights 421, 422 and 423.

FIG. 5 shows an application scenario of the present disclosure.

In FIG. 5, the first two 2D convolutional layers and the last fully connected (FC) layer of a trained neural network are represented as continuous parameter representations (original NN). These continuous parameter representations can be adapted into pruned neural networks (NNs 1-3) of different sizes, which are illustrated exemplarily in FIG. 5.

FIG. 6 shows another application scenario of the present disclosure.

In FIG. 6, the present disclosure may be applied to smartphones or self-driving cars where AI- based applications are often used. It can be seen that the present disclosure may allow conducting flexible inference strategies for Al-based applications depending on battery power, CPU usage, memory usage, environmental conditions, etc.

The present disclosure is described mainly with reference to the weights comprised in each layer of the neural network. It is noted that each layer may optionally comprise a set of biases, and embodiments of the present disclosure may apply similarly to the set of biases.

It is noted that the apparatus 100 in the present disclosure may comprise processing circuitry configured to perform, conduct or initiate the various operations of the device described herein, respectively. The processing circuitry may comprise hardware and software. The hardware may comprise analog circuitry or digital circuitry, or both analog and digital circuitry. The digital circuitry may comprise components such as application-specific integrated circuits (ASICs), field-programmable arrays (FPGAs), digital signal processors (DSPs), or multi-purpose processors. In one embodiment, the processing circuitry comprises one or more processors and a non-transitory memory connected to the one or more processors. The non-transitory memory may carry executable program code which, when executed by the one or more processors, causes the device to perform, conduct or initiate the operations or methods described herein, respectively. It is further noted that the apparatus 100 in the present disclosure may be a single electronic device capable of computing, or may comprise a set of connected electronic devices or modules capable of computing with shared system memory. It is well-known in the art that such computing capabilities may be incorporated into many different devices, and therefore the term “device” may comprise a chip, chipset, computer (including in-vehicle computer), server, navigation equipment, radar microcontroller (MCU), advanced driver assistance system (ADAS), autonomous vehicle, drone, mobile terminal, tablet, wearable device, game console, graphic processing unit, graphic card, and the like.

The present disclosure has been described in conjunction with various embodiments as examples as well as implementations. However, other variations can be understood and effected by those persons skilled in the art and practicing the claimed subject matter, from the studies of the drawings, this disclosure and the independent claims. In the claims as well as in the description the word “comprising” does not exclude other elements or steps and the indefinite article “a” or “an” does not exclude a plurality. A single element or another unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.

Claims

1. A data processing apparatus (100) configured to: obtain a trained neural network comprising a plurality of neural network layers, wherein each neural network layer comprises a first set of discrete weights (111), for each neural network layer, determine a continuous parameter representation (131) for the first set of discrete weights based on a linear combination of Riemann integrable functions (121, 122, 123); discretize the continuous parameter representation (131) to obtain a second set of discrete weights (141); and generate a layer output by processing a layer input based on the second set of discrete weights (141).

2. The data processing apparatus (100) according to claim 1, wherein for discretizing the continuous parameter representation (131), the data processing apparatus (100) is configured to: obtain a first discretization by applying a meshgrid operation to the continuous parameter representation (131) within [0, 1 ]ⁿ, wherein n denotes a dimension of the continuous parameter representation (131), and adjust the first discretization according to a numerical integration method to obtain the second set of discrete weights (141).

3. The data processing apparatus (100) according to claim 1 or 2, wherein for discretizing the continuous parameter representation (131), the data processing apparatus (100) is configured to apply uniform partitions in each dimension of the continuous parameter representation (131).

4. The data processing apparatus (100) according to any one of claims 1 to 3, further configured to adapt a size of the second set of discrete weights (141) based on computational complexity of an inference phase of the trained neural network.

5. The data processing apparatus (100) according to any one of claims 1 to 4, further configured to: determine a further continuous parameter representation ( 131 ) for a set of discrete inputs based on the linear combination of the Riemann integrable functions (121, 122, 123); and perform a numerical integration based on the continuous parameter representation (131) and the further continuous parameter representation (131), to obtain as a result the layer output.

6. The data processing apparatus (100) according to any one of claims 1 to 5, wherein the Riemann integrable functions (121,122, 123) are based on wavelet functions.

7. The data processing apparatus (100) according to claim 6, wherein the continuous parameter representation (131) is:

wherein F_w () denotes the continuous parameter representation (131), 9 denotes a vector of parameters including /Z(, a, and det() denotes a determinant operation, /Z; denotes a location parameter, denotes diagonal positive definite matrix, denotes weights of the linear combination, k denotes the number of dimensions, z denotes the number of the Gaussian functions, and i,k are integers.

8. The data processing apparatus (100) according to claim 7, wherein the plurality of neural network layers comprises at least one 2-dimensional, 2D, convolutional layer, and for discretizing the continuous parameter representation (131) to obtain a second set of discrete weights (141), the data processing apparatus (100) is configured to perform:

for each 2D convolution layer, wherein W denotes the second set of discrete weights (141), i.j, k, I denote four dimensions of the weights of the 2D convolutional layer, and h denotes a step along each dimension.

9. The data processing apparatus (100) according to claim 7, wherein the plurality of neural network layers comprises at least one fully-connected layer, and for discretizing the continuous parameter representation (131) to obtain a second set of discrete weights (141), the data processing apparatus (100) is configured to perform:

for each fully-connected layer, wherein W denotes the second set of discrete weights (141), i,j denote two dimensions of the weights of the fully-connected layer, and h denotes a step along each dimension.

10. A data processing method (300) comprising: obtaining (301 ) a trained neural network comprising a plurality of neural network layers, wherein each neural network layer comprises a first set of discrete weights (111), for each neural network layer, determining (302) a continuous parameter representation (131) for the first set of discrete weights (111) based on a linear combination of Riemann integrable functions (121, 122, 123); discretizing (303) the continuous parameter representation (131) to obtain a second set of discrete weights (141); and generating (304) a layer output by processing a layer input based on the second set of discrete weights (141).

11. The data processing method according to claim 10, wherein the discretizing (303) the continuous parameter representation (131) comprises: obtaining a first discretization by applying a meshgrid operation to the continuous parameter representation (131) within [0, l]ⁿ, wherein n denotes a dimension of the continuous parameter representation (131), and adjusting the first discretization according to a numerical integration method to obtain the second set of discrete weights (141).

12. The data processing method according to claim 10 or 11, wherein the discretizing (303) the continuous parameter representation (131) comprises applying uniform partitions in each dimension of the continuous parameter representation (131).

13. The data processing method according to any one of claims 10 to 12, further comprising adapting a size of the second set of discrete weights (141) based on computational complexity of an inference phase of the trained neural network.

14. The data processing method according to any one of claims 10 to 13, further comprising: determining a further continuous parameter representation (131) for a set of discrete inputs based on the linear combination of the Riemann integrable functions (121, 122, 123); and performing a numerical integration based on the continuous parameter representation (131) and the further continuous parameter representation (131) to obtain as a result the layer output.

15. The data processing method according to any one of claims 10 or 14, wherein the Riemann integrable functions (121, 122, 123) are based on wavelet functions.

16. The data processing apparatus (100) according to claim 15, wherein the continuous parameter representation (131) is:

wherein F_w () denotes the continuous parameter representation (131), 6 denotes a vector of parameters including and A,_b detQ denotes a determinant operation, denotes a location parameter, (T_£ denotes diagonal positive definite matrix, λ_i denotes weights of the linear combination, k denotes the number of dimensions, z denotes the number of the Gaussian functions, and i,k are integers.

17. The data processing method according to claim 16, wherein the plurality of neural network layers comprises at least one 2-dimensional, 2D, convolutional layer, and the discretizing the continuous parameter representation (131) to obtain a second set of discrete weights (141) comprises performing

WViJ. k. Q = F_wQ0, ih_out,jh_in, kh_wl, kh_w2), for each 2D convolution layer, wherein W denotes the second set of discrete weights (141), i,j, k, I denote four dimensions of the weights of the 2D convolutional layer, and h denotes a step along each dimension.

18. The data processing method according to claim 16, wherein the plurality of neural network layers comprises at least one fully-connected layer, and the discretizing the continuous parameter representation (131) to obtain a second set of discrete weights (141) comprises performing

19. A computer program product comprising instructions which, when the program is executed by a computer, cause the computer to perform the method according to any one of claims 10 to 18.