WO2019035770A1

WO2019035770A1 - Adaptive computer system, and methods and apparatus for training the adaptive computer system

Info

Publication number: WO2019035770A1
Application number: PCT/SG2018/050419
Authority: WO
Inventors: Aakash Shantaram PATIL; Arindam Basu
Original assignee: Nanyang Technological University
Priority date: 2017-08-18
Filing date: 2018-08-17
Publication date: 2019-02-21

Abstract

An adaptive computer system is proposed including: a multiplication and summation section, such as one implemented using a VLSI integrated circuit, which processes data input to the model by multiplying inputs by fixed, random weights and summing them; a combination unit which forms feature values by combining respective sub-sets of the outputs of the multiplication and summation section; and an adaptive output layer which combines the feature values using variable weight parameters. During the training, the variable weight parameters are trained, but the multiplication and summation section and the combination unit are not trained. The number of features values is greater than the number of outputs of the multiplication and summation section.

Description

Adaptive computer system, and methods and apparatus for training the adaptive computer system

Field of the invention

The present invention relates to an adaptive computer system incorporating a multiplicative stage, and an adaptive layer defined by variable parameters. The invention further relates to methods for training the computer system, and operating the trained system.

Background of the Invention

Previous literature [1] has proposed an adaptive computer system referred to as an Extreme Learning Machine (ELM) incorporating a VLSI random projection network to which an input vector is applied, and a trained output layer which receives the output of the VLSI random projection network. The adaptive computer system is thus a two layer neural network. The first layer (the VLSI random projection network) includes multiplicative units for receiving respective components of the input vector. Each multiplicative unit is fabricated using a CMOS (complementary metal-oxide- semiconductor) fabrication process, and has random fixed input weights due to inherent transistor random mismatch. Thus, the ELM provides massive parallelism and programmability, and is a very power efficient solution to perform multiplication-and- accumulation (MAC) operations. As depicted in Fig. 1 , the method performed by the adaptive computer system (also referred to as the ELM algorithm) is a two layer neural feed-forward network with L hidden neurons having an activation function g: R— R [1].

The network includes an input layer (which is not counted as one of the two layers) containing D input neurons with associated values x^ x₂, .... Xo, ..., XD which can also be denoted as a vector x with D components. Thus, D is the dimension of the input to the network.

The outputs of these D input neurons are input to a multiplicative section comprising hidden layer of L hidden neurons having an activation function g : R→ R. Without loss of generality, we consider a scalar output in this case. The output o of the network is given by: o =∑₌₁ Pjg(wj^T x + bj), wj, x e R^D, bj 6 R (1 )

The values { ?, .. fi_L } are variable parameters obtained by training. The value g(w x + bj) may be referred to as the activation h_j of the y^'-th neuron of the hidden layer. Note that in a variation of the embodiment, there are multiple outputs, each having an output which is a scalar product of {h_j} with a respective vector of L weights β^. In general, a sigmoidal form of g( ) is assumed though other functions have also been used.

Compared to a traditional back propagation learning rule that modifies all the weights, in ELM, W| and b, are set to random values and only the output weights need to be tuned based on the desired output of N items of training data T = [fi..,in, .. i_w], where n=1, ...N, and t_n is the desired output for n-th input vector xⁿ. Therefore, the hidden- layer output matrix h (that is, h is an NxL matrix, where the n-th column is {hj} for the n- th training example), is actually unchanged after initialization of the input weights, reducing the training of this single hidden layer feed-forward neural network into a linear optimization problem of finding a least-square solution of β for £ιβ=Τ, where β is output weights and T is the target of the training.

The desired output weights (variable parameters), β are then the solution of the following optimization problem: β

- Τ\\ (2) where β = [ β_{ .. βι ] and T = [ t_x .. t_N \. The solution β is given by β = Ιι Ί where A^† denotes the Moore Penrose generalized inverse of a matrix. Using this expression, a fast training speed can be obtained, resulting in good generalization.

The variable parameters can be implemented in digital circuits (e.g. of a conventional digital processor) that facilitate accurate tuning. The fixed random input weights of the hidden neurons, however, can be easily realized by a VLSI random projection network microchip comprising a plurality of multiplicative units. Each multiplicative unit is subject to random transistor mismatch, which commonly exists and becomes even profounder in the scaling modern deep sub-micrometer CMOS process. Thus, the VLSI random projection network microchip can co-operate with a conventional digital processor to form the adaptive computer system of Fig. 1 as a machine learning system using ELM.

The architecture of the input layer and the hidden layer for the proposed classifier that exploits the D x L random weights of the input layer is shown in Fig. 2. The random weights {b_j} arise from tolerances in the leak currents of the CCO neurons. A decoder 10 receives input data to the computer system, and separates it into D data signals. The VLSI random projection network consists of three parts— (a) input handling circuits (IHCs) to convert digital input to analog current, (b) a current mirror synapse array 1 1 of synapses (multiplication units) for multiplication of input current with a random weight, and (c) a current-controlled-oscillator (CCO) neuron based ADC 12 for summing up along columns. Thus, a single hidden-layer neuron comprises a column of analog circuits (each of which is labelled a synapse in Fig. 2, and acts as a multiplicative unit) and a sum unit (the COO and corresponding counter) to generate a sum value which is the activation. The hidden neuron also includes a portion of the functionality of a processing unit (e.g. a digital signal processor) to calculate the output of the hidden neuron from the activation value.

If n-bit digital data are used as the input of the IHCs, the IHCs directly convert it into the input current for the current mirror synapse array by n-bit DAC. Different preprocessing circuits can be implemented in the IHCs to extract features from various input signals.

In the implementation of the concept in [1 ], minimum sized transistors are employed to generate random input weights w_ih exploiting random transistor mismatch, leading to a

AVt

log-normal distribution of input weight, determined by: : w_l} = e ^ur , where U_T is thermal voltage and Δν\ is the mismatch of transistor threshold voltage and follows a zero- mean normal distribution in modern CMOS processes.

The CCO neurons which perform the ADC each consist of a neural CCO and a counter. They convert output current from each column of the current mirror synapse array into a digital number, corresponding to hidden layer output of the ELM. The hidden layer output is transmitted out of the microchip for further processing. The circuit diagram of CCO-neuron is presented in Fig. 2. The output of CCO-neuron is a pulse frequency modulated digital signal with frequency proportional to input current l_in. As noted above, a digital signal processor (DSP) is usually provided as an output layer of the ELM computer system. The DSP receives the sum values from the VLSI random projection networks, obtains the corresponding outputs of the hidden layer neurons, and generates final outputs by further processing the data, for instance, passing it through an output stage operation which comprises an adaptive neural network with one or more output neurons, associated with respective variable parameters. The (DSP) thus implements an adaptive network. The adaptive network is trained to perform a computational task. Normally, this is supervised learning in which sets of training input signals are presented to the decoder 10, and the adaptive network is trained to generate corresponding outputs. Once the training is over, the entire adaptive computer system (i.e. the portion shown in Fig. 2 plus the DSP) is used to perform useful computational tasks.

Note that the VLSI random projection network of [1] is not the only known implementation of an ELM. Another way of implementing an ELM is for the multiplicative section of the adaptive model (and indeed optionally the entire adaptive model) to be implemented in a digital system by a set of one or more digital processors. The fixed numerical parameters of the hidden neurons may be defined by respective numerical values stored in a memory of the digital system. The numerical values may be randomly-set, such as by a pseudo-random number generator algorithm.

For some applications of the ELM adaptive model described above, the dimension of the input data is quite large (more than a few thousand data values). For some other applications, the network requires a large number of hidden layer neurons (also more than a few thousands) to achieve the best performance. This poses a big challenge to the hardware implementation. This is true both in the case that the ELM is implemented using the tolerances of electrical components to implement the random numerical parameters of the hidden neurons, and in the case that the random numerical parameters are stored in the memory of a digital system. For example, if the required input dimension for a given application is D, and the adaptive model requires L hidden layer neurons, conventionally, at least D*L random projections are needed for classification. Each neuron requires D random weights. However, if the maximum input dimension for the hardware is only k ( k < D ) and the number of implemented hidden layer neurons is N ( N < L ), the hardware provides a k*N random projection matrix W,, (i = 1 , 2, ·^■·, k and j = 1 , 2, ·^■·, N), which is smaller than Dx|_. This physical limitation makes it hard to use a VLSI random projection network in applications which require a number of hidden layer neurons L which is greater than N.

To address this, citation [2] proposes that the input layer of the computer system provides a controllable mapping of the input data values to the hidden neuron inputs, and/or the output layer provides a controllable mapping of the hidden neuron outputs to the neurons of the output layer. This makes it possible to re-use the hidden neurons, so as to increase the effective input dimensionality of the computer system, and/or the effective number of neurons.

Summary of the invention In general terms, the present invention proposes increasing the number of random feature values available for use in the output layer by forming new feature values by arithmetic combinations of the outputs of the sum units.

Thus, in one form, the invention proposes an adaptive computer system including a multiplication and summation section, such as one implemented using a VLSI integrated circuit, which processes data input to the model by multiplying inputs by fixed, random weights and summing them; a combination unit which forms feature values by combining respective sub-sets of the outputs of the multiplication and summation section as a function of a linear arithmetic combination of the respective sub-set of the outputs of the multiplication and summation section; and an adaptive output layer which combines the feature values using variable parameters (weights). During the training, the variable parameters are trained, but the multiplication and summation section and the combination unit are not trained. The number of feature values is greater than the number of outputs of the multiplication and summation section. This may mean that the effective number of feature values which can be obtained from a multiplication and summation section implemented by hardware with a limited number of random weights, can be increased. From another point of view, it means that, for a given number of feature values, the memory requirements needed to implement the multiplication and summation section may be reduced. Furthermore, the number of arithmetic operations, and thus the energy consumption, needed to implement the multiplication and summation section may be reduced.

The multiplication and summation section may have the same structure as the one explained above, in which each multiplication operation is performed by a respective one of an array of multiplication units. The multiplication units may be nonprogrammable. That is, the corresponding multiplication operation is to multiply an input to the multiplication unit by a fixed multiplicative factor. For example, at least part of the multiplication and summation unit may be implemented as a VLSI integrated circuit, having a respective electronic circuit for each multiplication operation. The electronic circuit may be non-programmable, and the corresponding multiplicative factor may be random, according to tolerances in the fabrication of the electronic circuit. The combination unit may be thought of as a virtual hidden layer, having a number of outputs (the feature values) which is greater than the number of inputs (the sum value). The arithmetic operations correspond to respective neurons of the virtual hidden layer. Note however that, in contrast to a conventional multi-layer neural network in which each neuron of each layer performs a non-linear function of a summation value, the sum values output by the multiplication and summation section are combined linearly at the combination unit (e.g. as a weighted sum of the sum values, according to weights which may not all be not equal). The combination unit may be implemented as nonprogrammable electronic circuitry (analogue or digital), or using a digital signal processor which processes software to perform the function of the combination unit.

Suppose that the number of feature values (L) is greater than the number of outputs (N) of the multiplicative and summation section, by an increment factor E. This means that, for each of the outputs of the multiplication and summation section (which constitute respective random projections of the input vector), the combination unit needs to perform multiply-and-accumulate (MAC) operations r=ceil(E) times, where "ceil(E)" denotes the smallest integer which is not below E. This implies that the energy consumption per random projection is increased r=ceil(E) times

Using the arithmetic operations, the respective sub-set of the sum values are combined as linear combinations. Thus, the feature values input to the adaptive output layer (i.e. feature values output by the virtual hidden layer) are functions of linear combination of the sum values.

The concept of approximating a larger feature set using a linear combination of smaller basis sets of features is not entirely new (see [3], [4], [5] and [6]). Note that these works are not in the context of an ELM, but rather one in which the weights 1/Vof a first hidden layer of a neural network are obtained by training. They assume that the weight matrix of a hidden layer (which, for a number of inputs D and outputs L, can be denoted by the DxLmatrix W = ... , w_L ] where each vector w has D components), can be approximately decomposed as multiplication of a smaller basis feature set matrix B 6 R^DxN for N<L, and a matrix a e R^NxL to give W « W_approx = Ba. This makes it possible to approximate the computation heavy term w^x of eq(1 ) as below: w,^Tx =∑[ ? w xi *∑_k ^k i h_ka_jk for y=7,2, .../. (3) where h_k

b_{k i}Xi where {b_{k i}} are the components of B. (4)

The original computation of Eqn. (1} would have needed D*L MAC operations, while the approximate calculation requires D*N+N*L MAC operation which will result in reducing computation load by factor of (D*L)/(N*(D+L)). This implies the technique is most beneficial if N < (D*L)/(D+L) i.e. we can decompose original matrix W into very small basis feature sets. In general, reducing N will increase the error in approximating W by Wgpprox, resulting in loss of classification accuracy, but this increase may be acceptably low. Citation [3] reports a 2.5X speedup with no loss in accuracy, and a 4.5X speedup with only a drop of 1 % in classification accuracy. Citations [3]-[6] obtain W by training and then try to approximate it as Ba. By contrast, the present invention is directed to a random mapping where W can consist of any random numbers, and embodiments of the invention exploit this freedom to reduce energy. Depending on the extent to which this freedom is exploited, embodiments of the invention permit greater energy savings, and/or to obtain a given number of feature values for the output layer. Whereas in the citations [3]-[6] the matrix W was obtained by training, such that it is undesirable to approximate it, in embodiments of the present invention the weights of the multiplication and summation section are random anyway. Thus, even if it is approximated using some random basis feature set B, thereby obtaining a larger random feature set based on W_approx, there should be no loss of quality because W_approx is as good as W. We can thus select N«L to reduce computational load by factor of (D*L)/(D*N+N*L) »1. For any hardware which needs 'm' pJ (pico-Joules) per MAC, this is equivalent to energy savings by factor of (D*L)/(D*N +N*L) »1. Further, we can use a matrix a consisting of only 0 and powers of 2 (±2^b= ±1 , ±2, ±4 ...). That is, the non-zero weights may all be expressible as 2" for corresponding values of n. The value of n may be the same for all weights, or it may be higher for some weight(s) than for other(s). Multiplication by powers of 2 can be performed in most digital circuits by a bit-shift operation, so the computational cost, in a digital signal processor, of multiplying sum values by weights which are powers of 2 is negligible (indeed, in the special case in which the matrix a consists of only 0s and 1s (i.e. n=1), even shift operations are not required), so calculating the right side of Eqn. (3) amounts, at most, to performing N add operations for each j. This further reduces the total computational load to only D*N MAC operations + N*L add operations. This gives energy saving of (D*L)/(D*N+r*N*L) » (D*L)/(D*N+N*L) »1. Here, r=a/m«1 because energy requirement for an add operation ('a' pJ per add operation) is much less than that for a MAC operation.

The matrix a may be sparse, e.g. contain more (preferably, many more) elements which are zero than non-zero elements. As a special case we can restrict each column of a (that is, a_{ e R^N (i=1,2..L)) to contain Λ/-2 elements which are 'Ο', and two elements which are , or more generally '+Γ. This will further reduce the complexity to D*N MAC operations + L add operations. This gives an energy saving of (D*L)/(D*N+r*L) » (D*L)/(D*N+r*N*L) » (D*L)/(D*N+N*L) »1. Here, r=a/m«1 because the energy requirement for add operations is much less than for MAC operations.

In one application, the invention may used in a computer system of the type explained in the background section above, in which the multiplicative stage comprises a very- large-scale-integration (VLSI) integrated circuit including a plurality of multiplicative units which are analog circuits, each analog circuit performing multiplicative operations according to inherent tolerances of its components. Note that in the present case, however, there is no need for the values {bj}, so that random leak currents due to the CCO are not needed.

In other applications, the invention may be implemented as a dedicated digital multiplying circuit, or simulated using software by a digital processor, such as arithmetic logical unit of any central processing unit (CPU). The various aspects of the invention may be implemented within an ELM. However, as an alternative to an ELM, the present approach can be also used in other adaptive signal processing algorithms, such as liquid state machines (LSM) or echo state networks (ESN), since they too require random projections of the input. That is, in these networks too, a first layer of the adaptive model employs fixed randomly-set parameters to perform multiplicative operations of the input signals, and the results are summed.

Additionally, the techniques of the present invention may be freely combined with those of reference [2].

The term "adaptive model" is used in this document to mean a computer-implemented model defined by a plurality of numerical parameters, including at least some which can be modified. The modifiable parameters are set (usually, but not always, iteratively) using training data illustrative of a computational task the adaptive model is to perform.

The present invention may be expressed in terms of a computer system employing the novel architecture. In one possibility, the computer system includes at least one integrated circuit comprising electronic circuits (digital or analogue) having random tolerances used by the multiplication and summation section. Alternatively, the multiplication and summation section may be implemented by software, e.g. using random values in a data structure stored in a memory unit.

The invention may also be expressed as a computer system including one or more digital processors for implementing the adaptive model (in this case, the computer system may be a personal computer (PC) or a server). Alternatively, the invention may be expressed as a method or apparatus for training such a computer system, or even as program code (e.g. stored in a non-transitory manner in a tangible data storage device) for automatic performance of the method. Brief description of the Drawings

Embodiments of the invention will now be described for the sake of example only with reference to the following figures in which:

Fig. 1 shows the structure of a known ELM model;

Fig. 2 shows the structure of a known machine-learning co-processor of the

ELM model of Fig. 1 ;

Fig. 3 shows the structure of a first embodiment of the invention which is a modified ELM network;

Fig. 4 shows numerically the function of a virtual hidden layer (combination unit) of the embodiment of Fig. 3;

Fig. 5 shown a second embodiment which is a generalised form of the first embodiment; and

Fig. 6 shows a third embodiment which is a second generalised form of the first embodiment.

Detailed description of the Embodiments

We now describe an embodiment of the invention having various features as described below, with reference to Fig. 3.

The embodiment is an adaptive computer system which comprises an input layer and a (first) hidden layer (referred to here as a multiplication and summation section) which have the same general form as the hidden layer of the known system of Fig. 1. Thus, the input to the network is a vector x having D components. In this case of the embodiment, we denote the function performed by the multiplication and summation section as h=W^T.x. Here W is an DxN dimensional matrix.

Optionally, the input layer and multiplication and summation section may be implemented using the architecture of Fig. 2. The number of IHCs is equal to D. The multiplication and summation section may be at least partly implemented by an integrated circuit with an array (mirror synapse array) of respective multiplication units to implement the respective synapses. We denote the number of neurons of the current mirror synapse array 1 1 by N, and each neuron includes a corresponding set of D synapses (that is, one column of synapses in the array 1 1). The decoder 10 transmits the D input data values to the D respective IHCs. The current mirror synapse array 1 1 multiplies this D dimensional input with the random matrix W,_j (i = 1 , 2, ···, D and j = 1 , 2, ···, N). Thus, the weights of the matrix W are fixed and random, as the result of tolerances in the electronic circuits which constitute the respective synapses in Fig. 2. In contrast to the known system of Fig. 1 , the embodiment of Fig. 3 includes an additional combination unit which implements a virtual hidden layer of L neurons downstream of the multiplication and summation section, where L>N (in the particular example shown in Fig. 3, N is 4 and L is six). The function performed by the virtual hidden layer is denoted H=g(C.h). Note that C is the transpose of the matrix a described in the previous section. In other words, H=g(a ^T.h). The values H are referred to as feature values; the number of feature values is equal to L. The combination unit may be hardwired, but alternatively it may be implemented by a programmed digital signal processor. Due to the combination unit, there are DxL virtual random weights. Each virtual random weight is the weight which a corresponding component of the input vector x has in a corresponding component of C.h.

The matrix C denotes the transpose of the matrix a in the explanation given above. Its structure is as indicated in Fig. 4. In the case of Fig. 3, each row of C contains two nonzero values, and N-2 zero values. Thus, each neuron of the virtual hidden layer generates a respective feature value from a corresponding sub-set of the sum values, specifically a sub-set which consists of two of the sum values.

The feature value is the output of a corresponding arithmetic operation g(h_a,h_b) performed, by the virtual neuron, on the corresponding sub-set of the sum values. In particular, it is performed on a linear combination of the sub-set of the sum values. As described below, the arithmetic operation g(h_a,h ) may be a weighted sum of the sum values h_a and h_bl or, more generally, a function, e.g. a non-linear function, of a weighted sum of the sum values h_a and h_b. That is, the function g may optionally encode a non-linear function of the weighted sum of the corresponding sub-set of the sum values, which represents a non-linear activation function of the corresponding neuron of the virtual hidden layer. Note that in the embodiment of Fig. 3, each of the neurons of the virtual hidden layer receives a respective sub-set (pair) of the sum values, and combines them using the corresponding arithmetic operation to generate the respective feature value. As described below with reference to Fig. 4, in other embodiments, for one or more neurons of the virtual hidden layer, the respective sub-set of sum values may comprise more than two sum values.

The adaptive computer system further includes an adaptive output layer which receives the L feature values, and multiplies them by L respective variable parameters which constitute the vector β. Thus, the variable parameters are adaptively chosen, whereas the values for the matrices W (which define multiplicative operations) and C (which define arithmetic operations on the sum values output by the multiplication and summation section) are fixed. In the known computer system of Fig. 1 , in the case of an application in which the output layer requires at least L feature values, the hidden layer requires L physical summation units, so at least Dx|_ random projections are needed for classification.

By contrast, in the embodiment of Fig. 3, the number of physical summation units is N (Λ/ < L), and the multiplication and summation section provides a D*N random projection matrix S, (/^' = 1, 2, ···, D and j = 1, 2, ^■■·, Λ/), which is smaller than the D*L matrix Wij (/^' = 1, 2, ···, D and j = 1, 2, ^■··, L) required in the known system of Fig. 1. Thus, in the embodiment random feature values are created by the combination unit combining respective sub-sets of the N sum values (the outputs of the multiplication and summation section) using an arithmetic operation. For a given neuron of the virtual hidden layer, both the corresponding sub-set of sum values, and the corresponding arithmetic operation are defined by the corresponding row of the matrix C. Two efficient ways to implement C are as a pairwise average (i.e. the two non-zero entries of each row of C (i.e. each row of α ^τ) are the same, e.g. they may be equal to one half), or as a pairwise difference (i.e. one non-zero value is 1 , and the other is -1). Implementing C as the pairwise difference has the additional advantage that the virtual random weights nevertheless have zero mean, even in the case that the weights of the multiplication and summation section are not zero-mean. If the virtual random weights have zero mean, this has the advantage of making the system robust against common mode variations in all weights of the chip or in the input vector x. Thus, even in cases (discussed below) in which the number of non-zero values in each row of C is not equal to two, it is preferable if the values of each row sum to zero.

Denoting the non-zero values for the j-th row (1 j ^L) of C as a (1 ^N) and b (1 ^b <7v), the ^'-th neuron of the virtual hidden layer receives h_a and h_b. This neuron generates the y^'-th random feature as H_j = g((h_b + ) or H_j = g(h_b - h_a). Using the pairwise differences of two hardware hidden nodes can give up to

N*(N-1)/2 random feature values. One specific way of defining the matrix C would be such that j=AN+b a=mod(A+b+1, N).

Implementation of an ELM input stage producing L random features using a known ELM as shown in Fig. 1 , would need D*L random numbers and D*L MAC operations. Citation [2] can obtain the L random features from only D*N random numbers, but still needs energy and time resources to perform D*L MAC operations. By contrast, the embodiment of Fig. 3 can obtain L random features from only D*N random features and also reduce energy and time resources from (D*L MAC operations) to {D*N MAC operations and L add operations). The add operations need much less energy and time resource compared to multiply operations, and hence the extra resource cost of the extra L add operations needed for the pairwise addition to implement the ELM of Fig. 3, is much less expensive than the (L-D)*N MAC operations which it saves. This gives an overall reduction in resources required.

Note that, in principle, it would be possible for the matrix C to have additionally one or more rows with only a single non-zero value. In other words, the adaptive output layer would additionally employ feature values which are produced using only a single respective one of the sum values. However, preferably, this is not the case (i.e. preferably all the feature values received by the adaptive output layer are formed from a respective plurality of the sum values). For one thing, the random weights associated with feature values produced from a single sum value may be statistically different from those produced using multiple sum values. They may, for example, have non-zero mean, where ""mean" refers to the average over all training inputs. Turning to Fig. 5, a second embodiment of the invention is shown. This differs from the embodiment of Fig. 3 only in that the matrix C is different, such that the number of neurons in the virtual hidden layer is higher than in Fig. 3. Specifically, some of the neurons of the virtual hidden layer receive a respective sub-set of three of the sum values produced by the multiplication and summation section; some of the neurons of the virtual hidden layer receive a respective sub-set of four of the sum values of the multiplication and summation section; and so on. Taking all values up to p, the total number of neurons of the virtual hidden layer can thus be up to L_max =( ^N ₂C) + ( C) + ^■■■ ( pC) » N. This reduces the energy and time resources from D*L MAC operations to D*N MAC operations plus {1 + 2 + 3 ... +(p-1)}*L add operations.

Turning to Fig. 6, another embodiment is shown. This differs from the embodiment of Fig. 3 only in that the output layer generates a plurality of output values. The vector of variable parameters β in the embodiment of Fig. 3 is replaced by a matrix β . Respective columns of the matrix of variable parameters are used to generate respective outputs of the computer system. Thus, the output layer of the embodiment of the embodiment of Fig. 5 has multiple output values, where each output value is obtained using the product of a respective one of the columns of the matrix β and the L outputs of the virtual hidden layer implemented by the combination unit.

Preferred embodiments of the invention are able to achieve speedup in operation due to a reduction in number of operations required. The amount of the speed-up depends on the skewness (UN) and sparsity (fraction of zero elements) in C. The skewness may be at least 20, at least 50, at least 80, or even at least 100. The sparsity may be at least 70%, at least 80%, at least 90%, at least 95% or even at least 98% (i.e. at least 98% of the elements of C are zero). Experimentally we have found that certain embodiments of the proposed method for random weights can have high speedup/reduction in number of operations of greater than 100 times by having skewness as high as 100 and sparsity as high as 98% for less than 1 % loss in accuracy. In contrast, previous methods for trained weights [3] can achieve only speedup of 4.5X for 1 % reduction in accuracy since their sparsity and skewness are lower. Results

Table 1 a shows simulation results which are the average classification error for 100 runs for common benchmarking datasets, while Table 1 b shows the resources required. In table 1 a, the terms "25 original neurons" and "250 original neurons" refer to the commonly used cases of an input matrix of size Dx25 and Dx250 respectively in the known system of Fig. 1. The case "250 virtual neurons" is the embodiment of Fig. 3 in the case that the input matrix is Dx25 and a pairwise difference H_j = g(h_b - h_a) of its 25 outputs is used to get 250 random features.

The simulation results shown in Table 1 a and Table 1 b support the claims that:

compared to the case "250 original neurons", the proposed method in the case of "250 virtual neurons" allows about a factor of 10 in energy saving, and a factor of 10 in memory saving with negligible (0% to 1.3%) increase in error.

- with comparable energy and memory resource requirement, the proposed method of "250 virtual neurons" has a much lower classification error than "25 original neurons". Table.1a: Classification error for different datasets using different network implementations

Database sat Australian diabetes mushroom Vowel

Input 36 14 8 22 10 dimension

(D)

Method Classification error (%)

25 original 24.3 14.5 29.3 19.6 64.5 neurons

250 virtual 13.9 13.9 27.1 11.2 51.2 neurons

250 original 13.9 13.9 26.8 11.4 49.9 neurons Table.1b: Resources required for different implementation

Note that the column of energy values is independent of hardware used for implementation. The embodiment of Fig. 3 has lower energy requirement due to lesser memory usage and a lesser number of operations being required, and is thus superior to the known system of Fig. 1 for any hardware which needs 'm' pJ per multiply operation, and 'a' pJ per addition operation, provided that 'a'«'m'. The relative energy values assume that a=0.1 m. The relative energy value for the embodiment varies with D, e.g. it is different for D=8 to D=36, and the energy savings are more for applications with a higher dimensional input. Any further hardware improvements which are made to the embodiment of Fig. 3 to lower the energy per MAC (e.g. using analog VLSI techniques) or the energy per memory unit, would give additional benefits.

Note that embodiments of the invention can potentially generate much more than 250 virtual random neurons, but the datasets tabulated in Table 1a do not exhibit much accuracy improvement for an even higher number of random neurons.

The embodiment has been tested using applications such as image classification which typically require a much higher number of random neurons for good accuracy. Using the embodiments will allow users to target complex applications even with resource constrained processors.

Table 2 shows the average classification error for 10 runs for a common image recognition dataset "mnist" (D=784). In this table, the terms "5120 original neurons" and "12800 original neurons" refer to the commonly used cases with an input matrix of size Dx5120 and Dx12800 respectively; the term "5120 virtual neurons" is the embodiment of Fig. 3, where the input matrix is Dx128 and the embodiment uses the pairwise difference H_j = g(h_b - h_a) of its 128 outputs to get 5120 random features; the term "12800 virtual neurons" is the embodiment of Fig. 4, where the input matrix is Dx128m, and the virtual hidden layer generates 12800 random features from (i) 8064 respective pairs of the 128 outputs of the multiplication and summation section (sum values), using the arithmetic operation H_y = g(h_b - h_a) of its 128 outputs, and (iii) 4736 respective sub-sets of three of the 128 outputs of the multiplication and summation section (sum values), using the linear combination g{h_c + h_b - 2*h_a) of the three sum values.

The simulation results support the claims that:

- compared to the case "12800 original neurons", the embodiment in the case of "12800 virtual neurons" allows an energy saving of a factor of approximately 80 (=100/1.222) energy saving, and a memory saving of a fact of approximately 100, with negligible (0.8%) increase in error

compared to the case "5120 original neurons", the embodiment in the case of "5120 virtual neurons" allows an energy saving of a factor of about 40(=100/1.005), and a memory saving of about a factor of 40, with negligible (0.7%) increase in error

- compared to the case "128 original neurons", the embodiment in the case of "5120 virtual neurons" has a 12% lower error with same memory resource, and a negligible (1.005) increase in energy resource

Table.2: Classification error for mnist dataset showing much higher savings

Again, the energy result is independent of the hardware used for implementation. The embodiment has lower energy requirement due to lesser memory usage and a lesser number of operations being required, and is thus an improvement for any hardware which needs 'm' pJ per multiply operation, and 'a' pJ per addition operation, provided 'a'«'m'. The relative energy calculation assumes a=0.1m. Any further hardware improvements which are made to the embodiment of Fig. 3 to lower the energy per MAC (e.g. using analog VLSI techniques) or the energy per memory unit, would give additional benefits.

The embodiment of Fig. 1 has also been compared to the EMS suggested in [1] for classification of the MNIST dataset. Using the system of [1] with a hidden layer having 128 outputs, to get only 128 random feature values, resulted in a classification error of ~18%. However, the embodiment of Fig. 3, using a pairwise difference H_j = g(h_b - h_a) to create a virtual hidden layer of L=5120 neurons, reduced the classification error to ~4%. To obtain a similar error, the known approach would have needed 40(=5120/128) chips, resulting in an area and an energy consumption which are both larger by a factor of 40.

Commercial applications of the Invention An embodiment of the present invention can be used in any random projection-based machine learning system, and in particular for any application requiring data-based decision making in low-power, such as the applications described in [1] and [2]. Here, we outline several other possible use cases: 1. ImplantableA/Vearable Medical Devices: There has been a huge increase in wearable devices that monitor ECG/EKG/Blood Pressure/Glucose level etc. in a bid to promote healthy and affordable life styles. Typically, these devices operate under a limited energy budget with the biggest energy hog being the wireless transmitter. An embodiment of the invention may either eliminate the need for such transmission or drastically reduces the data rate of transmission. As an example of a wearable device, consider a wireless EEG monitor that is worn by epileptic patients to monitor and detect the onset of a seizure. An embodiment of the invention may cut down on wireless transmission by directly detecting seizure onset in the wearable device and triggering a remedial stimulation or alerting a caregiver.

In the realm of implantable devices, we can take the example of a cortical prosthetic aimed at restoring motor function in paralyzed patients or amputees. The amount of power available to such devices is very less and unreliable— being able to decode the motor intentions within the body in a micropower budget enable drastic reduction in data to be transmitted out.

2. Wireless Sensor Networks: Wireless sensor nodes are used to monitor structural health of buildings and bridges or for collecting data for weather prediction or even in smart homes to intelligently control air conditioning. In all such cases, being able to take decisions on the sensor node through intelligent machine learning will enable long life time of the sensors without requiring a change of batteries. In fact, the power dissipation of the node can reduce sufficiently for energy harvesting to be a viable option. This is also facilitated by the fact that the weights are stored in a non-volatile manner in this architecture.

3. Data centres: Today, data centres are becoming more prevalent due to the increasing popularity of cloud based computing. But power bills are the largest recurring cost for a data centre [23]. Hence, low-power machine learning solutions could enable data centres of the future by cutting their energy bills drastically.

4. Video surveillance: Modern defence systems rely on cameras to keep a watch on remote areas where it is difficult to keep a human guard. Currently these systems need to transmit raw videos to a base station where there is a person watching all videos. This transmission of raw videos needs lot of energy and since the video generation devices are typically battery powered, the battery needs to be replaced often. Machine learning algorithms using the architecture of the embodiments will help to perform low- energy object recognition within the device having camera, so that minimal information needs to be transmitted from that device to the base station. Variations of the invention

A number of variations of the invention are possible within the scope and spirit of the invention, and within the scope of the claims, as will be clear to a skilled reader. One of these variations is that many of the techniques explained above are applicable to reservoir computing systems, which are closely related to ELMs. In general, a reservoir computing system refers to a time variant dynamical system with two parts— (1) a recurrent connected set of nodes (referred to as the "liquid" of "the reservoir") with fixed connection weights to which the input is connected and (2) a readout with tunable weights that is trained according to the task. Two major types of reservoir computing systems are popularly used— the Liquid state machine (LSM) and the Echo state network (ESN). Fig. 13 shows a depiction of a LSM network where the input signal u(t) is connected to the "liquid" of the reservoir which implements a function L^M on the input to create internal states x^M(t), i.e. x^M(t) = (L^Mu)(t). The states of these nodes, xM(t) are used by a trainable readout f , which is trained to use these states and approximate a target function. The major difference between LSM and ESN is that in LSM, each node is considered to be a spiking neuron that communicates with other nodes only when its local state variable exceeds a threshold and the neuron emits a "spike" whereas in ESN, each node has an analog value and communicates constantly with other nodes. In practice, the communication between nodes for ESN and state updates are made at a fixed discrete time step.

Extreme Learning Machines (ELM) can be considered as a special case of reservoir learning where there are no feedback or recurrent connections within the reservoir. Also, typically the connection between input and hidden nodes is all-to-all in ELM while it may be sparse in LSM or ESN. Finally, the neurons or hidden nodes in ELM have an analog output value and are typically not spiking neurons. However, they may be implemented by using spiking neuronal oscillators followed by counters as shown in the patent draft. It is explained in [1] how an ELM can be implemented using a VLSI integrated circuit, and this applies to the presently disclosed techniques also. Since LSM and ESN have time-varying hidden neuron outputs (they are defined as continuous time systems), for analog implementation of h, H is preferably obtained by continuous time subtraction through analog circuits. For a digital implementation of h, since h(t) is already discretized in time to h(n), it is possible to use digital subtraction.

Furthermore, although the embodiments are explained with respect to adaptive models in which the synapses (multiplicative units) are implemented as analog circuits each comprising electrical components, with the random numerical parameters being due to random tolerances in the components, in an alternative the multiplicative section of the adaptive model (and indeed optionally the entire adaptive model) may be implemented by one or more digital processors. The numerical parameters of the corresponding multiplicative units may be defined by respective numerical values stored in a memory. The numerical values may be randomly-set, such as by a pseudo-random number generator algorithm.

Also, although the discussion of the embodiments above assumed that the multiplication and summation section received the inputs x to the computer system, and that the virtual hidden layer directly feeds the output layer of the computer system, neither of these assumption is necessarily the case. Instead, the computer system may be a multi-layer network comprising successively an input layer, two or more hidden layers, and an output layer. At least one of these hidden layers (not necessarily the hidden layer closest to the input layer of the computer system, or the hidden layer closest to the output layer) is random and fixed (i.e. not trained during the training procedure), and this fixed hidden layer is provided at its output with a virtual hidden layer as described above with reference to Fig. 5 or Fig. 6.

In other words, the operation of this single fixed hidden layer of the multilayer layer network and its corresponding virtual hidden layer, would be as described in Fig. 5 and Fig. 6 but with the input to the fixed layer (the "input signals") being denoted by x, and/or with the "output layer" of Figs. 5 and 6 replaced by the corresponding next hidden layer of the network. During the training of the computer system, the weights of the virtual hidden layer(s) would be changed (e.g. in tandem with the variation of weights in other layers), but the weights of the corresponding fixed hidden layer(s) (multiplication and summation section(s)) are not.

References

The disclosure of the following references is incorporated herein: [1] G.-B. Huang, H. Zhou, X. Ding, and R. Zhang, "Extreme learning machine for regression and multiclass classification," IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 42, no. 2, pp. 513-529, Apr. 2012.

[2] "Computer system incorporating an adaptive model, and methods for training the adaptive model", PCT patent application PCT/SG2016/050450.

[3] Max Jaderberg, Andrea Vedaldi, Andrew Zisserman "Speeding up Convolutional

Neural Networks with Low Rank Expansions" arXiv: 1405.3866

[4] R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua. "Learning separable filters". In

Com puterb Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages

2754-2761. IEEE, 2013.

[5] H. O. Song, S. Zickler, T. Althoff, R. Girshick, M. Fritz, C. Geyer, P. Felzenszwalb, and T. Darrell. "Sparselet models for efficient multiclass object detection". In Computer

Vision-ECCV 2012, pages 802-815. Springer, 2012.

[6] H. O. Song, T. Darrell, and R. B. Girshick. "Discriminatively activated sparselets". In Proceedings of the 30th International Conference on Machine Learning (ICML-13),

Claims

1. A computer-implemented method for training an adaptive computer system, the computer system including:

a multiplication and summation section arranged to perform multiplicative operations on corresponding ones of a plurality of input signals according to respective numerical parameters, and form a respective plurality of sum values which are the sums of a respective plurality of the results of the multiplicative operations;

a combination unit for generating, from the sum values, a number of feature values which is greater than the number of sum values; and

a processing unit for receiving the feature values, and generating at least one output as a function of the feature values and a respective set of variable parameters; wherein the combination unit is arranged to generate one or more of the feature values from corresponding sub-sets of the sum values comprising a plurality of the sum values, each of the one or more feature values being the output of a corresponding arithmetic operation performed on the corresponding sub-set of the sum values;

the method including:

training the computer system based on a corpus of training examples relating inputs to the computer system and respective outputs of the computer system,

wherein the training is performed by varying the variable parameters without varying the multiplicative operations or the arithmetic operations.

2. A method according to claim 1 in which, for each of the plurality of the feature values, the corresponding arithmetic operation is a function of a weighted sum of the corresponding sub-set of sum values.

3. A method according to claim 2 in which the function is a non-linear function.

4. A method according to claim 2 or claim 3, in which in each weighted sum, each of the sub-set of the sum values is weighted by a corresponding non-zero weight, each of the weights being proportional to ±2ⁿ, where n is a corresponding integer.

5. A method according to any preceding claim in which one or more of the feature values are obtained from corresponding sub-sets of sum values which consist of two of the sum values.

6. A method according to claim 5 in which substantially all the plurality of sub-sets of sum values consist of two of the sum values.

7. A method according to claim 5 or 6, when dependent upon claim 2, in which, for the one or more feature values, the weighted sum is a difference of the two sum values.

8. A method according to claim 5 or 6, when dependent upon claim 2, in which, for the one or more feature values, the weighted sum is a mean of the two sum values.

9. A method according to any preceding claim in which the multiplicative units implement multiplication by respective random multiplicative factors.

10. A method according to any preceding claim in which at least the multiplicative section is implemented by an integrated circuit in which each of the multiplicative units is formed by static circuitry.

11. A method according to any preceding claim in which the combination unit is arrange to generate substantially all the feature values received by the processing unit from corresponding sub-sets of the sum values comprising a plurality of the sum values, each feature vector being the output of a corresponding arithmetic operation performed on the corresponding sub-set of the sum values.

12. An apparatus for training an adaptive computer system, the computer system including:

the apparatus including a processor and a memory device storing program instructions operative, when performed by the processor, to train the computer system, by performing a method according to any preceding claim.

13. A computer system including:

a multiplicative section comprising a plurality of multiplicative units, the multiplicative units being arranged to perform multiplicative operations on corresponding ones of a plurality of input signals according to respective numerical parameters;

a summation section comprising a plurality of sum units for forming a respective plurality of sum values, each sum value being the sum of a respective plurality of the results of the multiplicative operations;

a processing unit for receiving the feature values, and generating an output as a function of the feature values and a respective set of variable parameters; wherein at least the multiplicative function is implemented by an integrated circuit in which the multiplicative units are respective non-programmable electronic circuits; and

wherein the combination unit is arranged to generate one or more of the feature values from corresponding sub-sets of the sum values comprising a plurality of the sum values, each of the one or more feature values being the output of a corresponding arithmetic operation performed on the corresponding sub-set of the sum values.