WO2022037756A1

WO2022037756A1 - Data processing apparatus and method for operating multi-output neural networks

Info

Publication number: WO2022037756A1
Application number: PCT/EP2020/073001
Authority: WO
Inventors: Alexander GRIGORIEVSKIY; Were OYOMNO; Muhammad AMAD-UD-DIN; Mark VAN HEESWIJK; Jonathan Paul FERNANDEZ STRAHL; Adrian Flanagan; Kuan Eeik TAN
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2020-08-17
Filing date: 2020-08-17
Publication date: 2022-02-24
Also published as: WO2022037856A1

Abstract

The disclosure relates to a data processing apparatus (110). The apparatus comprises a processing circuitry (110a) configured to implement a neural network including a current processing layer. The input data of the current processing layer is an input data vector and the output data provided by the current processing layer is an output data vector depending on the input data vector. The current processing layer is configured to apply a first parameter tensor to the input data vector for obtaining a first factor of an approximation tensor and a second parameter tensor to the input data vector for obtaining a second factor of the approximation tensor. Moreover, the current processing layer is configured to generate the output data vector on the basis of the approximation tensor, wherein the approximation tensor is a tensor product of at least the first factor of the approximation tensor and the second factor of the approximation tensor.

Description

DATA PROCESSING APPARATUS AND METHOD FOR OPERATING MULTI-OUTPUT

NEURAL NETWORKS

TECHNICAL FIELD

The present disclosure relates to data processing. More specifically, the present disclosure relates to a data processing apparatus and method for operating a multi-output neural network.

BACKGROUND

Artificial Intelligence (Al), for instance, in the form of machine learning (ML), is being implemented in more and more electronic devices. ML, however, is usually demanding with respect to computational resources and consumes a significant amount of energy, for instance, due to frequent memory accesses. Therefore, it is a challenge to implement ML models on mobile electronic devices with reduced hardware capabilities in terms of processing power, memory and energy consumption, such as smartphones or other types of loT devices. Reducing the memory size required by ML models is one possible approach to address this challenge.

Federated learning (FL) is an approach for training ML models on a plurality of distributed mobile devices, such as smartphones. One of the core features of FL is that user data does not leave the respective mobile device. Thus, FL is an approach which allows private machine learning. To this end, however, FL requires inference and training capabilities on the distributed mobile devices as well as the capability of transferring a complete ML model from a respective mobile device to a FL server (often several times). Thus, it would be desirable to have small ML models, i.e. ML models that require a small amount of memory.

However, certain components of ML models usually cannot be reduced in size, such as the final layer of a Multi-output Neural Network (MONN). This is because MONNs, which are a preferred choice for several machine learning tasks, such as language modelling, recommendation systems (also referred to as recommender system), extreme classification and the like, have to keep track of every item/word. More specifically, for the final layer of a MONN the number of parameters is usually proportional to the number of items. Consequently, the number of parameters scales linearly with the number of items. For instance, in a MONN providing a recommender system for recommending 10 millions of videos the final MONN layer may comprise at least 10⁷ parameters, which occupy about 10⁷ ■ 16 bits or about 20 Mb of memory. A memory space of this size just for a FL recommender system is usually not available on a mobile device with limited hardware resources.

SUMMARY

It is an objective of the present disclosure to provide an improved data processing apparatus and method configured to operate a multi-output neural network (MONN) having a smaller size than conventional MONNs.

The foregoing and other objectives are achieved by the subject matter of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.

According to a first aspect a data processing apparatus is provided. The data processing apparatus comprises a processing circuitry configured to operate, i.e. implement a neural network, in particular a multi-output neural network (MONN), wherein the neural network comprises a plurality of processing layers for sequentially processing data, including a current processing layer. The input data of the current processing layer, i.e. the data provided by the preceding processing layer, is an input data vector and the output data provided by the current processing layer is an output data vector depending on the input data vector and a plurality of parameters of the current processing layer. In an embodiment, the plurality of parameters of the current processing layer comprise a plurality of weights of the current processing layer. In an embodiment, the plurality of parameters of the current processing layer comprise a plurality of biases of the current processing layer.

The current processing layer implemented by the processing circuitry is configured to apply a first parameter tensor to the input data vector for obtaining a first factor, i.e. component of an approximation tensor and a second parameter tensor to the input data vector for obtaining a second factor, i.e. component of the approximation tensor. In an embodiment, the current processing layer implemented by the processing circuitry is configured to apply one or more further parameter tensors to the input data vector for obtaining one or more further factors, i.e. components of the approximation tensor. Moreover, the current processing layer implemented by the processing circuitry is configured to generate the output data vector on the basis of the approximation tensor, wherein the approximation tensor is a tensor product of at least the first factor of the approximation tensor and the second factor of the approximation tensor. In an embodiment, the approximation tensor is the tensor product of the first factor of the approximation tensor, the second factor of the approximation tensor and one or more further factors of the approximation tensor. As will be appreciated, the first factor and the second factor of the approximation tensor may have the same number of dimensions as the approximation tensor.

Thus, advantageously, a data processing apparatus is provided allowing to implement a neural network with less memory requirements than conventional neural networks.

In a further possible implementation form of the first aspect, the current processing layer is a final processing layer of the plurality of processing layers and the output data vector is a prediction vector.

In a further possible implementation form of the first aspect, the approximation tensor is an approximation matrix having two dimensions, wherein the approximation matrix is a matrix product of the first factor of the approximation matrix and the second factor of the approximation matrix. As will be appreciated, the first factor and the second factor of the approximation matrix may be matrices having two dimensions as well.

In a further possible implementation form of the first aspect, the output data vector comprises N elements and the approximation matrix comprises M x M elements, wherein M denotes an integer equal to or larger than V/V.

In a further possible implementation form of the first aspect, the first factor of the approximation matrix comprises M x R elements and the second factor of the approximation matrix comprises R x M elements, wherein R denotes an integer approximation parameter.

In a further possible implementation form of the first aspect, the input data vector comprises D elements, wherein the first parameter tensor comprises M x R x D elements and the second parameter tensor comprises R x M x D elements. In a further possible implementation form of the first aspect, the processing circuitry of the data processing apparatus is configured to adjust the integer approximation parameter R.

In a further possible implementation form of the first aspect, the processing circuitry of the data processing apparatus is configured to generate the output data vector, in particular prediction vector by a row-wise or a column-wise reshaping of the approximation matrix, i.e. by concatenating one or more rows or one or more columns of the approximation matrix for generating the output data vector, in particular prediction vector.

In a further possible implementation form of the first aspect, the processing circuitry of the data processing apparatus is configured to prune the output data vector, in particular prediction vector (e.g. from either one of the two ends), in case M is larger than 7v.

In a further possible implementation form of the first aspect, the processing circuitry of the data processing apparatus is configured to generate the output data vector, in particular prediction vector on the basis of a sum of the approximation tensor, in particular approximation matrix and a bias tensor, in particular bias matrix, wherein the bias tensor is the tensor product of at least a first factor of the bias tensor and a second factor of the bias tensor, wherein the first factor of the bias tensor has the same size as the first factor of the approximation tensor and the second factor of the bias tensor has the same size as the second factor of the approximation tensor.

In a further possible implementation form of the first aspect, the processing circuitry of the data processing apparatus is further configured to adjust the order of the elements of the approximation tensor, in particular approximation matrix for obtaining a reordered approximation tensor, in particular approximation matrix and to generate the output data vector, in particular prediction vector on the basis of the reordered approximation tensor, in particular approximation matrix.

In a further possible implementation form of the first aspect, each element of the approximation tensor, in particular approximation matrix is associated with a respective item of a plurality of items, e.g. N items, wherein the processing circuitry of the data processing apparatus is further configured to adjust the order of the elements of the approximation tensor, in particular approximation matrix on the basis of information about a respective score of a respective item. In an embodiment, the respective score of a respective item may be a measure of a popularity or rating of an item.

In a further possible implementation form of the first aspect, each element of the approximation tensor, in particular approximation matrix is associated with a respective item of a plurality of items, wherein each item is defined by one or more item features and wherein the processing circuitry of the data processing apparatus is further configured to adjust the order of the elements of the approximation tensor, in particular approximation matrix on the basis of the one or more item features of each of the plurality of items.

In a further possible implementation form of the first aspect, the processing circuitry of the data processing apparatus is configured to operate the neural network in a training mode, wherein in the training mode the processing circuitry of the data processing apparatus is configured to train, i.e. determine the elements of the first parameter tensor and the elements of the second parameter tensor as well as corresponding biases.

According to a second aspect a data processing method is provided. The data processing method comprises the steps of: operating a neural network, in particular a MONN, wherein the neural network comprises a plurality of processing layers for sequentially processing data, including a current processing layer, wherein input data of the current processing layer is an input data vector and wherein output data provided by the current processing layer is an output data vector depending on the input data vector and a plurality of parameters of the current processing layer; applying a first parameter tensor to the input data vector for obtaining a first factor of an approximation tensor and a second parameter tensor to the input data vector for obtaining a second factor of the approximation tensor; and generating the output data vector on the basis of the approximation tensor, wherein the approximation tensor is a tensor product of at least the first factor of the approximation tensor and the second factor of the approximation tensor.

The data processing method according to the second aspect of the present disclosure can be performed by the data processing apparatus according to the first aspect of the present disclosure. Thus, further features of the data processing method according to the second aspect of the present disclosure result directly from the functionality of the data processing apparatus according to the first aspect of the present disclosure as well as its different implementation forms described above and below.

According to a third aspect a computer program product is provided comprising a non- transitory computer-readable storage medium carrying program code which causes a computer or a processor to perform the method according to the second aspect when the program code is executed by the computer or the processor.

Details of one or more embodiments are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description, drawings, and claims.

BRIEF DESCRIPTION OF THE DRAWINGS

In the following, embodiments of the present disclosure are described in more detail with reference to the attached figures and drawings, in which:

Fig. 1 is a schematic diagram illustrating a communication system including a data processing apparatus according to an embodiment in the form of a smartphone;

Fig. 2a is a schematic diagram illustrating different aspects of a processing layer of a conventional neural network;

Fig. 2b is a schematic diagram illustrating different aspects of a processing layer of a neural network operated by a data processing apparatus according to an embodiment;

Fig. 3 is a schematic diagram illustrating different aspects of a processing layer of a neural network operated by a data processing apparatus according to an embodiment;

Fig. 4 is a schematic diagram illustrating different aspects of a processing layer of a neural network operated by a data processing apparatus according to an embodiment; Fig. 5 is a diagram illustrating the performance of different neural networks operated by a data processing apparatus according to an embodiment in comparison with a conventional neural network; and

Fig. 6 is a flow diagram illustrating steps of a data processing method according to an embodiment.

In the following identical reference signs refer to identical or at least functionally equivalent features.

DETAILED DESCRIPTION OF THE EMBODIMENTS

In the following description, reference is made to the accompanying figures, which form part of the disclosure, and which show, by way of illustration, specific aspects of embodiments of the present disclosure or specific aspects in which embodiments of the present disclosure may be used. It is understood that embodiments of the present disclosure may be used in other aspects and comprise structural or logical changes not depicted in the figures. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present disclosure is defined by the appended claims.

For instance, it is to be understood that a disclosure in connection with a described method may also hold true for a corresponding device or system configured to perform the method and vice versa. For example, if one or a plurality of specific method steps are described, a corresponding device may include one or a plurality of units, e.g. functional units, to perform the described one or plurality of method steps (e.g. one unit performing the one or plurality of steps, or a plurality of units each performing one or more of the plurality of steps), even if such one or more units are not explicitly described or illustrated in the figures. On the other hand, for example, if a specific apparatus is described based on one or a plurality of units, e.g. functional units, a corresponding method may include one step to perform the functionality of the one or plurality of units (e.g. one step performing the functionality of the one or plurality of units, or a plurality of steps each performing the functionality of one or more of the plurality of units), even if such one or plurality of steps are not explicitly described or illustrated in the figures. Further, it is understood that the features of the various exemplary embodiments and/or aspects described herein may be combined with each other, unless specifically noted otherwise. Figure 1 is a schematic diagram of a communication system 100. In the embodiment shown in figure 1 , the communication system 100 includes a plurality of data processing apparatuses 110. In the embodiment shown in Figure 1 , one of the plurality of data processing apparatuses 110, by way of example, is a smartphone 110, i.e. an electronic user device 110 with reduced hardware capabilities with respect to computational power, memory storage and/or battery capacity. In an embodiment, the plurality of data processing apparatuses 110 may further comprise, for instance, smart watches, tablet computers, laptop computers, or other types of mobile user or loT devices. As illustrated in figure 1 , the plurality of data processing apparatuses 110 may be configured to communicate via a base station 120 of a wireless communication network with each other and/or one or more cloud servers 130.

As illustrated in figure 1 , the data processing apparatus 110 may comprise processing circuitry 111 , for instance, a processor 111 for processing and generating data, a communication interface 113, including, for instance, an antenna, for exchanging data with the other components of the communication system 100, and a non-transitory memory 115 for storing data. The processor 111 of the data processing apparatus 110 may be implemented in hardware and/or software. The hardware may comprise digital circuitry, or both analog and digital circuitry. Digital circuitry may comprise components such as application-specific integrated circuits (ASICs), field-programmable arrays (FPGAs), digital signal processors (DSPs), or general-purpose processors. The non- transitory memory 115 may store data, such as executable program code which, when executed by the processor 111 , causes the client device 110 to perform the functions, operations and methods described herein. The communication interface 113 may comprise a wired or wireless communication interface 113.

As will be described in more detail below under further reference to figures 2a, 2b and 3, the processing circuitry 111 of the data processing apparatus 110 is configured to operate, i.e. implement a neural network, in particular a multi-output neural network (MONN). The neural network comprises a plurality of processing layers for sequentially processing data, including a current processing layer. The input data of the current processing layer, i.e. the output data of the processing layer preceding the current processing layer is an input data vector 211 (as illustrated in figure 2b). The output data provided by the current processing layer is a prediction vector 219 (as illustrated in figure 2b) depending on the input data vector 211 and a plurality of parameters, e.g. weights and biases of the current processing layer. Moreover, in the following embodiments of the data processing apparatus 110 will be described, wherein the current processing layer configured to process the input data vector 211 into the output data vector 219 is a final processing layer of the plurality of processing layers, i.e. the last processing layer along the data processing stream. For the case of the current processing layer being the final processing layer of the plurality of processing layers the output data vector 219 is often referred to as prediction vector 219. In an embodiment, the elements of the prediction vectors may be a plurality of scores or likelihoods associated with a plurality of items, for instance, video files.

The generation of a prediction vector 209 having N elements by a final processing layer of a conventional neural network is illustrated in figure 2a. The N elements of the prediction vector 209 are the result of applying a matrix 203 based on the parameters of the final processing layer of the conventional neural network to an input data vector 201 having D elements.

Figure 2b illustrates the generation of the prediction vector 219 having N elements by the final processing layer of the neural network provided by the processing circuitry 111 of the data processing apparatus 110 according to an embodiment. As illustrated in figure 2b, the final processing layer of the neural network operated by the processing circuitry 111 of the data processing apparatus 110 is configured to apply a first parameter tensor 213a to the input data vector 211 for obtaining a first factor 215a, i.e. component of an approximation tensor 217. In the embodiment shown in figure 2b, the approximation tensor 217 is a matrix, i.e. having two dimensions. However, as will be described in more detail further below, in other embodiments, the approximation tensor 217 may have more than two dimensions, for instance, three or four dimensions.

Likewise, the final processing layer of the neural network operated by the processing circuitry 111 of the data processing apparatus 110 is configured to apply a second parameter tensor 213b to the input data vector 211 for obtaining a second factor, i.e. component of the approximation matrix 217. In embodiments with an approximation tensor 217 having more than two dimensions the processing circuitry 111 of the data processing apparatus 110 implementing the final processing layer of the neural network may be configured to apply one or more further parameter tensors to the input data vector 211 for obtaining one or more further factors, i.e. components of the approximation tensor 217. Moreover, the final processing layer of the neural network operated by the processing circuitry 111 of the data processing apparatus 110 is configured to generate the prediction vector 219 on the basis of the approximation matrix 217, wherein the approximation matrix 217 is a matrix product of the first factor 215a of the approximation matrix 217 and the second factor 215b of the approximation matrix 217. In embodiments with an approximation tensor 217 having more than two dimensions the processing circuitry 111 of the data processing apparatus 110 implementing the final processing layer of the neural network may be configured to generate the prediction vector 219 on the basis of the approximation tensor 217, wherein the approximation tensor 217 is the tensor product of the first factor 215a of the approximation tensor 217, the second factor 215b of the approximation tensor 217 and one or more further factors of the approximation tensor 217.

In the embodiment shown in figure 2b the first factor 215a of the approximation matrix 217 comprises V/V x R elements and the second factor 215b of the approximation matrix 217 comprises R x ^N elements, wherein R denotes an adjustable integer approximation parameter that allows managing the trade-off between amount of parameters and accuracy of the neural network model, as will be described in further detail in the context of figure 5 below. In this case the approximation matrix 217 being the matrix product of the first factor 215a and the second factor 215b comprises V7v x /V elements.

In further embodiments, the first factor 215a of the approximation matrix 217 may comprise M x R elements and the second factor 215b of the approximation matrix 217 may comprise R x M elements, wherein M denotes an integer larger than V/V (and smaller than /V). For such embodiments the approximation matrix 217 comprises M x M elements.

In an embodiment, the first parameter tensor 213a comprises M x R x D elements and the second parameter tensor 213b comprises R x M x D elements (with the case M = ^N illustrated in figure 2b).

In an embodiment, the processing circuitry 111 of the data processing apparatus 110 is configured to generate the prediction vector 219 on the basis of a sum of the approximation matrix 217 and a bias matrix (not shown in figure 2b), wherein the bias matrix is the matrix product, i.e. tensor product of a first factor of the bias matrix and a second factor of the bias matrix, wherein the first factor of the bias matrix has the same size as the first factor 215a of the approximation matrix 217 and the second factor of the bias matrix has the same size as the second factor 215b of the approximation matrix 217.

The processing steps performed by the final processing layer of the neural network provided by the processing circuitry 111 of the data processing apparatus 110, as illustrated in figure 2b, may be summarized as follows.

In a first processing stage, the activations from the previous processing layer of size D, i.e. the elements of the input data vector 211 are multiplied using matrix multiplication by the first parameter tensor 213a and the second parameter tensor 213b, respectively. As a result, the two matrices 215a, 215b, i.e. the first factor 215a and the second factor 215b of the approximation matrix 217 are obtained.

In a further processing stage, these two matrices 215a, 215b, i.e. the first factor 215a and the second factor 215b of the approximation matrix 217 are multiplied together along the dimension R. As a result, the approximation matrix 217 with dimensions V/V x V/V (or M x M) is obtained.

As already described above, in a further processing stage two bias matrices, i.e. the first factor of the bias matrix and the second factor of the bias matrix may also be multiplied along the second dimension which results in bias matrix having the same dimensions as the approximation matrix 217. The bias matrix may be added to the approximation matrix to obtain the biased approximation matrix 217.

In a further processing stage, the processing circuitry 111 of the data processing apparatus 110 may be configured to reshape the (biased) approximation matrix 217 rowwise or a column-wise into the prediction vector 219, i.e. by concatenating the rows or the columns of the approximation matrix 217 for generating the prediction vector 219.

In case the prediction vector 219 resulting from the previous processing stage has more than N elements, the processing circuitry 111 of the data processing apparatus 110 may be further configured to reduce the number of elements of the prediction vector 219 to N elements by pruning the prediction vector 219, i.e. by removing one or more elements of the prediction vector 219 starting from one of either ends thereof. In an embodiment, the processing circuitry 111 of the data processing apparatus 110 may output the final prediction vector 219 for providing predictions in an interference mode of the neural network operated by the data processing apparatus 110. In a further embodiment, the final prediction vector 219 may be used by the processing circuitry 111 operating the neural network in a training mode for a cost function for training the elements of the first parameter tensor 213a and the elements of the second parameter tensor 213b (as well as the corresponding bias tensors, if present). In an embodiment, the processing circuitry 111 of the data processing apparatus 110 may output the prediction vector 219 in the training mode into a Binary Cross-Entropy loss function which acts independently on each output dimension. Other loss functions may be used as well. Back- propagation may be used in the usual way. As already described above, in the inference mode or phase, the elements of the prediction vector 219 may be interpreted as probabilities (scores) indicating, for instance, how much a user likes an item.

As will be appreciated, contrary to the conventional processing scheme illustrated in figure 2a, where the number of parameters depend linearly on N, in the processing scheme illustrated in figure 2b the number of parameters depends linearly on V/V and, thus, increases at a slower rate for an increasing number of parameters N. Thus, the final processing layer of the neural network provided by the processing circuitry 111 of the data processing apparatus 110 may be considered to constitute a low rank activation approximation (LRAA) layer that solves the problem of linear size scalability of a conventional output embedding layer.

A further illustration of the general idea implemented by the final processing layer of the neural network provided by the processing circuitry 111 of the data processing apparatus 110 is provided in figure 3. As illustrated in figure 3, the prediction vector 219 is reshaped into the approximation matrix 217 of size V/V x /V (or M x M). The approximation matrix 217 is approximated by the low rank decomposition into a multiplication of the 2 lowdimensional factors 215a, 215b of hidden dimensionality R. For the special case of R=1 the low rank decomposition can also be viewed as an outer product of 2 vectors.

The low rank approximation illustrated in figures 2b and 3 and implemented by the processing circuitry 111 of the data processing apparatus 110 may have one or more of the following properties. The approximation is easily reversible. For instance, for two low rank matrices (which may be the outputs of a neural network), a vector may be obtained with scores of the original dimension by doing the steps illustrated in figures 2b and 3 in the inverse order. Moreover, as already described above, the dimensions of the approximation matrix 217 not necessarily have to be V/V x V/V, but may be M x M with M denoting an integer equal to or larger than V/V. In this case, when the approximation matrix 217 is reshaped into the prediction vector 219 these extra elements may be cropped, i.e. pruned so that the prediction vector 219 has exactly the size N. As will be appreciated, the reduction of the number of parameters provided by embodiments of the data processing apparatus 110 may be largest when the dimensions of the approximation matrix 217 are equal to or close to V/V x V/V.

In a further embodiment, the dimensions of the approximation matrix 217 may be

x M₂ and M₂ denoting integers equal to or larger than V/V and

■ M₂ > N. The processing circuitry 111 of the data processing apparatus 110 may be configured to determine

and M₂ by decomposing N into integer factors.

Because in the embodiments described above the approximation matrix 217 is estimated by the multiplication of two lower dimensional matrices, namely the first factor 215a and the second factor 215b, the elements of the approximation matrix 217 are linearly dependent, as illustrated in figure 4. For instance, the elements in the first row 411 of the approximation matrix 217 are linearly dependent on the elements of the first row 401 of the first factor 215a. Likewise, the elements in the second column 413 of the approximation matrix 217 are linearly dependent on the elements of the second column 403 of the second factor 215b. Further embodiments of the data processing apparatus 110, which will be described in the following, make advantageous use of this property of the elements of the approximation matrix 217 for allowing the processing circuitry 111 to determine the approximation matrix 217 more efficiently.

In an embodiment, the data processing apparatus 110 may be configured to use item features for rearranging the elements, i.e. items of the approximation matrix 217. In a first embodiment, the processing circuitry 111 of the data processing apparatus 110 may be configured to organize items in a matrix having the same dimensions as the approximation matrix 217 using items features. Having item features, items can be projected to a 2D space by any known algorithm. Thereafter, each item can be mapped to a certain position in the approximation matrix 217. The principle of mapping is that items which are close in the 2D space should be close in the approximation matrix 217. This determines the enumeration of items since all positions in the approximation matrix 217 can be enumerated by integers. In a second embodiment, the processing circuitry 111 of the data processing apparatus 110 may be configured to use the popularity of items in the training data to organize them in a matrix having the same dimensions as the approximation matrix 217. In other words, in the second embodiment, the items are enumerated based on their popularity or another type of score (for instance, the most popular item has number 1 , the second popular item has number 2 and so forth). This enumeration determines the position of an item in the approximation matrix 217. Each of these embodiments will be described in the following in more detail.

In the first embodiment, the processing circuitry 111 of the data processing apparatus 110 may be configured to divide items using item features into V/V groups, wherein each group comprises V/V items having the same or similar features. As will be appreciated, these numbers should be understood in an approximate sense since they depend on dimensions of the approximation matrix 217, which may be larger than V/V x /V (as described above). As items in the same group have the same or similar features, the processing circuitry 111 of the data processing apparatus 110 may arrange items in the same group in the same row (or column) of the approximation matrix 217. For determining the order of items within the same group the processing circuitry 111 of the data processing apparatus 110 may be configured to take at least a second possible grouping of the items into account (possibly using other features). Preferably, the second grouping of the items is in some sense orthogonal to the first grouping. This second grouping will define the order of items in each row of the approximation matrix 217 which is equivalent to defining which column of the approximation matrix 217 each item belongs to. In other words, the processing circuitry 111 of the data processing apparatus 110 is configured to sort the items of the approximation matrix 217 first with respect to the first grouping and then with respect to the second grouping, for instance, by means of a sorting algorithm. As a result, items of the application matrix 217 are sorted, i.e. rearranged with respect to the first grouping, and within each such group they are sorted, i.e. rearranged with respect to the second grouping.

Thus, according to the first embodiment, the position of an item in the approximation matrix may be determined by the processing circuitry 111 of the data processing apparatus 110 on the basis of at least two "orthogonal" groupings. In an embodiment, these groupings can be provided by the cloud server 130, which may be the central server of a Federated Learning system, while the data processing apparatuses 110 receive the groupings from the cloud server 130, i.e. the arrangement of the items in the approximation matrix 217. In an embodiment, these groupings can be done by projecting items to a 2D space using, for instance, one of the following methods: Principal Component Analysis (PCA), Independent Component Analysis (ICA), Uniform Manifold Approximation and Projection (UMAP), T-distributed Stochastic Neighborhood Embedding (t-SNE), Autoencoders.

In the second embodiment, the processing circuitry 111 of the data processing apparatus 110 may be configured to enumerate, i.e. order items on the basis of the popularity, i.e. score of the items in the training data so that the most popular item has number 1 , the second most popular item has number 2 and so forth. On the basis of this numbering the processing circuitry 111 can determine the approximation matrix 217 more efficiently, because in this case items in the first row (or column) of the application matrix 217 should have larger scores than the items in the second row and so forth.

Figure 5 is a diagram illustrating the performance of different neural networks operated by the data processing apparatus 110 according to an embodiment in comparison with a conventional neural network. As can be taken from figure 5 for different choices of the adjustable parameter R, (i.e. R=7, 100, 500 and 1000), the number of parameters and, thus, the size of the neural network scales less than linearly. The parameter R allows controlling the neural network model size.

Figure 6 is a flow diagram illustrating steps of a data processing method 600 according to an embodiment. The data processing method 600 illustrated in figure 6 can be performed by the data processing apparatus 110 described above. Thus, further features of the data processing method 600 result directly from the functionality of the data processing apparatus 110 described above.

In a first step 601 of the data processing method 600 a neural network, in particular a MONN, is operated, wherein the neural network comprises a plurality of processing layers for sequentially processing data, including a current processing layer, for instance, a final processing layer of the plurality of processing layers. The input data of the current processing layer is an input data vector 211 and the output data provided by the current processing layer is an output data vector 219 depending on the input data vector 211 and a plurality of parameters of the current processing layer.

A further step 603 of the data processing method 600 comprises applying a first parameter tensor 213a to the input data vector 211 for obtaining a first factor 215a of an approximation tensor 217, in particular approximation matrix 217 and a second parameter tensor 213b to the input data vector 211 for obtaining a second factor 215b of the approximation tensor 217, in particular approximation matrix 217.

In a further step 605 of the data processing method 600 the output data vector 219 is generated on the basis of the approximation tensor 217 (in particular approximation matrix 217), wherein the approximation tensor 217 (in particular approximation matrix 217) is a tensor product of at least the first factor 215a of the approximation tensor 217 (in particular approximation matrix 217) and the second factor 215b of the approximation tensor 217 (in particular approximation matrix 217).

As already described above, the current, in particular final processing layer described herein, may be generalized to higher dimensional tensors. In particular, the prediction vector 219 can be reshaped to a three-dimensional tensor (in contrast to a matrix which is two-dimensional tensor), four-dimensional tensor and so forth. Then, there may be several parameter tensors with different numbers of dimensions. The only requirement is that after parameter tensors are multiplied with activations from the previous layer and multiplied with each other the resulting tensor can be reshaped to a vector of size N or "slightly" larger. Multiplication here means multiplication of tensors. For example, the final processing layer may have 3 parameter tensors of the same shape, i.e. with the dimensions ( , /?, D). After multiplying each of them by the input data vector along the last dimension, 3 parts, i.e. factors of the approximation tensor 217 are obtained. Each part has the shape

Finally, the approximation tensor 217 may be obtained by multiplying all three factors along the last dimension. If these 3 tensors are denoted as A, B, C, then the tensor multiplication in the Einstein notation would be: AijB_k]C_mj. The result of the product is the approximation tensor 217 of the shape ( /V, , ). The approximation tensor 217 is the analog of the approximation matrix 217 described above and can be reshaped to the prediction vector 219. Note that items’ enumeration based on items features should project items features into 3D space and then the procedure is generalized straightforwardly.

The person skilled in the art will understand that the "blocks" ("units") of the various figures (method and apparatus) represent or describe functionalities of embodiments of the present disclosure (rather than necessarily individual "units" in hardware or software) and thus describe equally functions or features of apparatus embodiments as well as method embodiments (unit = step). In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the described embodiment of an apparatus is merely exemplary. For example, the unit division is merely logical function division and may be another division in an actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented by using some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, functional units in the embodiments of the invention may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.

Claims

1 . A data processing apparatus (110), comprising: processing circuitry (111) configured to operate a neural network, wherein the neural network comprises a plurality of processing layers for sequentially processing data, including a current processing layer, wherein input data of the current processing layer is an input data vector (211) and wherein output data provided by the current processing layer is an output data vector (219) depending on the input data vector (211) and a plurality of parameters of the current processing layer, wherein the current processing layer is configured to: apply a first parameter tensor (213a) to the input data vector (211) for obtaining a first factor (215a) of an approximation tensor (217) and a second parameter tensor (213b) to the input data vector (211) for obtaining a second factor (215b) of the approximation tensor (217); and generate the output data vector (219) on the basis of the approximation tensor (217), wherein the approximation tensor (217) is a tensor product of at least the first factor (215a) of the approximation tensor (217) and the second factor (215b) of the approximation tensor (217).

2. The data processing apparatus (110) of claim 1 , wherein the current processing layer is a final processing layer of the plurality of processing layers and wherein the output data vector (219) is a prediction vector (219).

3. The data processing apparatus (110) of claim 1 or 2, wherein the approximation tensor (217) is an approximation matrix (217) and wherein the approximation matrix (217) is a matrix product of the first factor (215a) of the approximation matrix (217) and the second factor (215b) of the approximation matrix (217).

4. The data processing apparatus (110) of claim 3, wherein the output data vector (219) comprises N elements and the approximation matrix (217) comprises M x M elements, wherein M denotes an integer equal to or larger than V/V.

5. The data processing apparatus (110) of claim 4, wherein the first factor (215a) of the approximation matrix (217) comprises M x R elements and the second factor (215b) of the approximation matrix (217) comprises R x M elements, wherein R denotes an integer approximation parameter.

6. The data processing apparatus (110) of claim 5, wherein the input data vector

(211) comprises D elements, wherein the first parameter tensor (213a) comprises M x R x D elements and the second parameter tensor (213b) comprises R x M x D elements.

7. The data processing apparatus (110) of claim 5 or 6, wherein the processing circuitry (111) of the data processing apparatus (110) is configured to adjust the integer approximation parameter /?.

8. The data processing apparatus (110) of any one of claims 3 to 7, wherein the processing circuitry (110a) of the data processing apparatus (110) is configured to generate the output data vector (219) by a row-wise or a column-wise reshaping of the approximation matrix (217).

9. The data processing apparatus (110) of claim 8, wherein the processing circuitry (111) of the data processing apparatus (110) is configured to prune the output data vector (219).

10. The data processing apparatus (110) of any one of the preceding claims, wherein the processing circuitry (111) of the data processing apparatus (110) is configured to generate the output data vector (219) on the basis of a sum of the approximation tensor (217) and a bias tensor, wherein the bias tensor is the tensor product of at least a first factor of the bias tensor and a second factor of the bias tensor, wherein the first factor of the bias tensor has the same size as the first factor (215a) of the approximation tensor (217) and the second factor of the bias tensor has the same size as the second factor (215b) of the approximation tensor (217).

11 . The data processing apparatus (110) of any one of the preceding claims, wherein the processing circuitry (111) of the data processing apparatus (110) is further configured to adjust the order of the elements of the approximation tensor (217) for obtaining a reordered approximation tensor and to generate the output data vector (219) on the basis of the reordered approximation matrix.

12. The data processing apparatus (110) of claim 11 , wherein each element of the approximation tensor (217) is associated with a respective item of a plurality of items and wherein the processing circuitry (111) of the data processing apparatus (110) is further configured to adjust the order of the elements of the approximation tensor (217) on the basis of information about a respective score of a respective item.

13. The data processing apparatus (110) of claim 11 , wherein each element of the approximation tensor (217) is associated with a respective item of a plurality of items, wherein each item is defined by one or more item features and wherein the processing circuitry (111) of the data processing apparatus (110) is further configured to adjust the order of the elements of the approximation tensor (217) on the basis of the one or more item features of each of the plurality of items.

14. The data processing apparatus (110) of any one of the preceding claims, wherein the processing circuitry (111) is configured to operate the neural network in a training mode, wherein in the training mode the processing circuitry (111) is configured to train the elements of the first parameter tensor (213a) and the elements of the second parameter tensor (213b).

15. A data processing method (600), comprising: operating (601) a neural network, wherein the neural network comprises a plurality of processing layers for sequentially processing data, including a current processing layer, wherein input data of the current processing layer is an input data vector (211) and wherein output data provided by the current processing layer is an output data vector (219) depending on the input data vector (211) and a plurality of parameters of the current processing layer; applying (603) a first parameter tensor (213a) to the input data vector (211) for obtaining a first factor (215a) of an approximation tensor (217) and a second parameter tensor (213b) to the input data vector (211) for obtaining a second factor (215b) of the approximation tensor (217); and generating (605) the output data vector (219) on the basis of the approximation tensor (217), wherein the approximation tensor (217) is a tensor product of at least the first factor (215a) of the approximation tensor (217) and the second factor (215b) of the approximation tensor (217).

16. A computer program product comprising a computer-readable storage medium for storing program code which causes a computer or a processor to perform the method

(600) of claim 15, when the program code is executed by the computer or the processor.

21