WO1991000591A1 - Pattern recognition - Google Patents

Pattern recognition Download PDF

Info

Publication number
WO1991000591A1
WO1991000591A1 PCT/GB1990/001002 GB9001002W WO9100591A1 WO 1991000591 A1 WO1991000591 A1 WO 1991000591A1 GB 9001002 W GB9001002 W GB 9001002W WO 9100591 A1 WO9100591 A1 WO 9100591A1
Authority
WO
WIPO (PCT)
Prior art keywords
weight
input
vector
weights
training
Prior art date
Application number
PCT/GB1990/001002
Other languages
French (fr)
Inventor
Philip Charles Woodland
Original Assignee
British Telecommunications Public Limited Company
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by British Telecommunications Public Limited Company filed Critical British Telecommunications Public Limited Company
Publication of WO1991000591A1 publication Critical patent/WO1991000591A1/en
Priority to FI916155A priority Critical patent/FI916155A0/en
Priority to HK132896A priority patent/HK132896A/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24317Piecewise classification, i.e. whereby each classification requires several discriminant rules
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Definitions

  • This invention relates to pattern recognition apparatus or the like using neural networks, and a method of producing such apparatus; in particular, but not exclusively, using neural networks of the multi-layer perceptron (MLP) type.
  • MLP multi-layer perceptron
  • Neural networks of this type in general comprise a plurality of parallel processing units ("neurons"), each connected to receive a plurality of inputs comprising an input vector, each input being connected to one of a respective plurality of weighting units, the weighting factors of which comprise a respective weight vector, the output of the neuron being a scalar function of the input vector and the weight vector.
  • the output is a function of the sum of the weighted inputs.
  • the perceptron illustrated in Figure 1 consists of simple processing units (neurons) arranged in layers connected together via 'weights* (synapses).
  • the output of each unit in a layer is the weighted sum of the outputs from the previous layer.
  • the values of these weights are adjusted so that a pattern on the output layer is 'recognised* by a particular set of output units being activated above a threshold.
  • interest in perceptrons faded in the 1960s, and did not revive again until the mid 1980s, when two innovations gave perceptrons new potential.
  • the first was the provision of a non-linear compression following each neuron, which had the effect that the transformation between layers was also be non-linear. This meant that, in theory at least, such a device was capable of performing complex, non-linear mappings.
  • the second innovation was the invention of a weight-adjustment algorithm known as the 'generalised delta rule*.
  • the new perceptron could have as many layers as necessary to perform its complicated mappings.
  • HLP multi-layer perceptron
  • the generalised delta rule enabled it to learn patterns by a simple error back propagation training process.
  • a pattern to be learned is supplied and latched ('clamped') to the input units of the device, and the corresponding required output is presented to the output units.
  • the weights, which connect input to output via the multiple layers, are adjusted so that the error between the actual and required output is reduced.
  • the standard back propagation training algorithm for these networks employs a gradient descent algorithm with weights and biases adjusted by an amount proportional to.
  • the gradient of the error function with respect to each weight is known as the learning rate.
  • a 'momentum' term is also usually added that smooths successive weight updates by adding a constant proportion of the previous weight update to the current one.
  • an alternative algorithm computes a variable learning rate and momentum smoothing. The adaptation scheme ensures that steep gradients do not cause excessively large steps in weight space, but still permits reasonable step sizes with small gradients. This process is repeated many times for all the patterns in the training set. After an appropriate number of iterations, the MLP will recognise the patterns in the training set. If the data is structured, and if the training set is representative, then the MLP will also recognise patterns not in the training set.
  • the network must learn an underlying mapping from input patterns to output patterns by using a sparse set of examples (training data).
  • This mapping should also be applicable to previously unseen data, i.e. the network should generalise well. This is especially important for pattern classification systems in which the data forms natural clusters in the input feature space, such as speech data.
  • generalisation is defined as the difference between the classification performance on the training data set and that on a test set drawn from a population with the same underlying statistics. Why does a given network fail to generalise? A net is specified by a set of parameters that must be learned from a set of training examples. If the amount of training data available is increased, the better in general will be the weight estimates and the more likely a net will generalise.
  • the standard algorithm used for training multi-layer perceptrons is the error back-propagation algorithm discussed above.
  • the algorithm adjusts network weight and bias values so as to reduce the sum-squared error between the actual network output and some desired output value.
  • the correct output is a vector that represents a particular class.
  • the same class label will be assigned to an input vector whether or not it is close to a class boundary.
  • the back-propagation algorithm if run for a large number of iterations, builds up such large values. It can be seen that generalisation in the presence of noise and limited training data is promoted if smooth decision surfaces are formed in the input feature space. This means that a small change in input values will lead to only a relatively small change in output value. This smoothness can be guaranteed if the connection weight magnitudes are kept to low values. Although it may not be possible for the network to learn the training set to such a high degree of accuracy, the difference between training set and test set performance decreases and test set performance can increase.
  • weight quantisation In any digital hardware implementation of an MLP or other network, the question of weight quantisation must be addressed. It is known that biological neurons do not perform precise arithmetic, so it might be hoped that weights in a neural network would be robust to quantisation. Normally, quantisation takes place after the network has been trained. However, if as described above, the network has built up large weight values, then node output may depend on small differences between large values. This is an undesirable situation in any numerical computation, especially one in which robustness to quantisation errors is required.
  • weight-quantisation is also an area that has not been approached with a view to performing MLP training subject to a criteria that will improve quantisation performance.
  • MLP weights would normally be examined after training and then a suitable quantisation scheme devised. It is true that the prior art technique of limiting the number of training cycles as discussed above, will improve weight quantisation simply because weight values will have not yet grown to large values, but the extent to which it does so depends, as discussed above, on a number of parameters which may be data-related. It is thus not a general-purpose solution to the problem of providing readily quantised weights. We have found that both generalisation performance and robustness to weight quantisation are improved by including explicit weight-range limiting into the MLP training procedure.
  • a method of deriving weight vectors for a neural net comprising:- - vector input means for receiving a plurality of input values comprising an input vector; and vector processing means for generating a plurality of scalar outputs in dependence upon the input vector and respective reference weight vectors, comprising the steps of:- selecting a sequence of sample input vectors (correspoding to predetermined net outputs); generating, using a digital processing device employing relatively high-precision digital arithmetic, an approximation to the scalar outputs which would be produced by the neural net processing means; generating therefrom an approximation to the outputs of the net corresponding to the respective input vectors; and - iteratively modifying the weight vectors so as to reduce the difference between the said approximated net outputs and the predetermined outputs; characterised in that, if the said modifying step would result in
  • Weight limiting also improves general network robustness to numerical inaccuracies - hence weight quantisation performance improves. It is seen here that with suitable weight limits as few as three bits per weight can be used. Large weights can cause a network node to compute small differences between large (weighted) inputs. This in turn gives rise to sensitivity to numerical inaccuracies and leads to a lessening of the inherent robustness of the MLP structure.
  • the weight limited MLP according to the invention is able to deal with inaccuracies in activation function evaluation and low resolution arithmetic better than a network with larger weight values. These factors combine to mean that the technique is useful in any limited precision, fixed-point MLP implementation.
  • the numerical robustness is increased so that, with suitable weight limits, as few as three bits per weight can be used to represent trained weight values.
  • a neural network employing weights derived according to the above method will be distinguishable, in general, from prior art networks because the distribution of the weight values will not be even (i.e. tailing off at high positive and negative weight sizes), but will be skewed towards the maximum level M, with a substantial proportion of weight magnitudes equal to M. It will also, as discussed above, have an improved generalisation performance.
  • the invention thus extends to such a network.
  • Such a network (which need not be a multi-layer perceptron network but could be, for example, a single layer perception), is, as discussed above, useful for speech recognition, but is also useful in visual object recognition and other classification tasks, and in estimation tasks such as echo cancelling or optimisation tasks such as telephone network management.
  • an "untrained" network which includes training means (e.g. a microcomputer) programmed to accept a series of test data input vectors and derive the weights therefrom in the manner discussed above during a training phase.
  • training means e.g. a microcomputer
  • the magnitude referred to is generally the absolute (i.e. ⁇ ) magnitude, and preferably the magnitude of any single weight (i.e. vector component) is constrained not to exceed M.
  • Figure 2 shows schematically a multi-layer perceptron
  • Figure 3 shows schematically a training algorithm according to one aspect of the invention
  • Figure 4a shows schematically an input stage for an embodiment of the invention
  • Figure 4b shows schematically a neuron for an embodiment of the invention
  • - Figure 5 shows schematically a network of such neurons.
  • the following describes an experimental evaluation of the weight range-limiting method of the invention. First, the example problem and associated database are described, followed by a series of experiments and the results. The problem was required to be a "real-world" problem having noisy features but also being of limited dimension.
  • the speech data used was part of a large database collected by British Telecom from long distance telephone talks over the public switched telephone network. It consisted of a single utterance of the words "yes” and “no" from more than 700 talkers. 798 utterances were available for MLP training and a further 620 for testing. The talkers in the training set were completely distinct from the test set talkers. The speech was digitally sampled at 8kHz and manually endpointed. The resulting data set included samples with impulsive switching noise and very high background noise levels.
  • the data was processed by an energy based segmentation scheme into five variable length portions. Within each segment low order LPC analysis was used to produce two cepstral coefficients and the normalised prediction error. The complete utterance was therefore described by a single fifteen dimensional vector.
  • MLPs had a single hidden layer, full connection between the input and hidden layers and full connection between the hidden and output layer. There were no direct input/output connections.
  • the back-propagation training algorithm was used with updating after each epoch, i.e., after every input/output pattern has been presented. This update scheme ensures that the weights are changed in a direction that reduces error over all the training patterns and result of the training procedure does not depend on the order of pattern presentation. All networks used a single output node. During training, the desired output was set to 0.9 for "yes” and to 0.1 for "no". During testing all utterances that gave an output of greater than 0.5 were classified as "yes” and the remainder as "no".
  • Table 7 lists the number of bits per weight required such that the RMS error is less than 5% greater than for unquantised weights for each of the values of M under consideration.
  • a real-world input (speech) 'signal is received, sampled, digitised, and processed to extract a feature vector by an input processing circuit 1.
  • the sampling rate may be 8 KHZ;
  • the digitisation may involve A-law PCM coding;
  • the feature vector may for example comprise a set of LPC coefficients, Fourier Transform coefficients or preferably melfrequency cepstral coefficients.
  • the start and end points of the utterance are determined by end pointing device la, using for example the method described in Wilpon J.G, Rabiner and Martin T: 'An improved word-detection algorithm for telephone quality speech incorporating both syntactic and semantic constraints', AT&T Bell Labs Tech J, 63 (1984) (or any other well known method), and between these points n-dimensional feature vectors X are supplied periodically (for example, every 10-100 msec) to the input node 2 of the net.
  • the feature vectors are conveniently supplied in time-division multiplexed form to a single input 2a connected to common data bus 2b.
  • the input bus is connectable to each neuron 3a, 3b, 3d, in the input layer.
  • each comprises a weight vector ROM 5 storing the 15 weight values.
  • the weighted input value produced at the output of the multiplier 6 is supplied to a digital accumulating adder 7, and added to the current accumulated total.
  • the clock signal also controls a latch 8 on the output of the accumulating adder 7 so that when all weighted input values of an input vector are accumulated, the (scalar) total is latched in latch 8 for the duration of the next cycle. This is achieved by dividing the clock by n.
  • the accumulating adder 7 is then reset to zero (or to any desired predetermined bias value).
  • the total y . w- is supplied to a non linear compression circuit 9 which generates in response an output of compressed range, typically given by the function
  • the compression circuit 9 could be a lookup ROM, the total being supplied to the address lines and the output being supplied from the data lines but is preferably as disclosed in our UK application GB8922528.8 and the article 'Efficient Implementation of piecewise linear activation function for digital VLSI Neural Networks'; Myers et al, Electronics letters, 23 November 1989, Vol 25 no 24.
  • the (latched) scalar output Y of each neuron is connected as an input to the neurons 10a, 10b, 10c of the intermediate or 'hidden' layer. Typically, fewer neurons will be required in the hidden layer.
  • the output values may be multiplexed onto a common intermediate bus 11 (but clocked at a rate of 1/n that of the input values by, for example, a flipflop circuit) in which case each hidden layer neuron 10a, 10b, 10c may be equivalent to the input layer neurons 3a, 3b, 3c, receiving as their input vector the output values of the input layer.
  • a common intermediate bus 11 but clocked at a rate of 1/n that of the input values by, for example, a flipflop circuit
  • each hidden layer neuron 10a, 10b, 10c may be equivalent to the input layer neurons 3a, 3b, 3c, receiving as their input vector the output values of the input layer.
  • the neurons 10a, 10b, 10c of the hidden layer likewise supply scalar outputs which act as inputs to neurons 12a, 12b, of the output layer.
  • Bach receives as its input vector the set of outputs of the hidden layer below, and applies its weight vector (stored, as above, in ROMs) via a multiplier, to produce a net output value.
  • the output layer neurons also use a compression function.
  • the class which corresponds to the output layer neuron producing the largest output value is allocated to the input vector (although other 'decision rules' could be used).
  • ROMs 5 could be realised as areas of a single memory device.
  • the invention may be realised using hybrid analogue digital networks in which digital weights act on analogue signals, using neurons of the type shown in Figs 11 and 12 of WO 89/02134 (assigned to the present applicants).
  • a weight adjusting device 13 is provided; typically a microprocessor operating according to a stored program.
  • the input test patterns are supplied to the input 1, and the initial weight values are adjusted, using an error back-propagation algorithm, to reduce the difference between the net outputs (to which the device 13 is connectable) and the predetermined outputs corresponding to the inputs.
  • the device 13 thus calculates the output error, calculates the necessary weight value increment for each weight; limits the weight magnitudes (if necessary) to M; accesses the weights in each store 5 (which must, of course in this embodiment, be a read/write store, not a ROM), adds the increments and rewrites the new weight values to the stores 5; the method is discussed above. Consequently, the adjusting device 13 is connected to the address busses of all stores 5 and to the net outputs. It is also connected to a source of correct output values; for example, in training a speaker-dependant word recogniser these are derived from a prompt device (not shown) which instructs the speaker to speak a given word. The correct output (say 0.9) for the output neuron corresponding to that word, is supplied to weight adjuster 13 together with correct outputs (say 0.1) for all other output neurons.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Complex Calculations (AREA)
  • Image Analysis (AREA)

Abstract

A neural net is trained on training data, and the weight values are increased up to a predetermined maximum value M; any weight values which would otherwise exceed M are set equal to M. Useful for training multi-layer perceptrons for speech recognition; results in weight values which are more easily quantised hence give a more robust performance.

Description

PATTERN RECOGNITION
This invention relates to pattern recognition apparatus or the like using neural networks, and a method of producing such apparatus; in particular, but not exclusively, using neural networks of the multi-layer perceptron (MLP) type.
Neural networks of this type in general comprise a plurality of parallel processing units ("neurons"), each connected to receive a plurality of inputs comprising an input vector, each input being connected to one of a respective plurality of weighting units, the weighting factors of which comprise a respective weight vector, the output of the neuron being a scalar function of the input vector and the weight vector. Typically, the output is a function of the sum of the weighted inputs. Such networks have been proposed since the 1950s (or before), and have been applied to a wide variety of problems such as visual object recognition, speech recognition and text-to-speech conversion.
It is also known to implement such networks using a single, suitably programmed, digital computing device to perform the processing of all such neurons. Although the speed achievable by such an arrangement is of necessity of lower than that of a parallel network, the advantages of adaptive pattern recognition can give such an implementation greater speed and simplicity than would the use of conventional pattern recognition techniques.
The perceptron illustrated in Figure 1, consists of simple processing units (neurons) arranged in layers connected together via 'weights* (synapses). The output of each unit in a layer is the weighted sum of the outputs from the previous layer. During training, the values of these weights are adjusted so that a pattern on the output layer is 'recognised* by a particular set of output units being activated above a threshold. interest in perceptrons faded in the 1960s, and did not revive again until the mid 1980s, when two innovations gave perceptrons new potential. The first was the provision of a non-linear compression following each neuron, which had the effect that the transformation between layers was also be non-linear. This meant that, in theory at least, such a device was capable of performing complex, non-linear mappings. The second innovation was the invention of a weight-adjustment algorithm known as the 'generalised delta rule*. These innovations are discussed in
Rumelhart, D.E., Hinton, G.E. & Williams, R.J.
(1986). "Learning internal representations by error propagation."
In Parallel Distributed Processing Eds McClelland & Rumelhart. HIT Press.
Since a sequence of non-linear transformations is not, generally, equivalent to a single, non-linear transformation, the new perceptron could have as many layers as necessary to perform its complicated mappings. Thus the new device came to be known as the multi-layer perceptron (HLP). The generalised delta rule enabled it to learn patterns by a simple error back propagation training process. A pattern to be learned is supplied and latched ('clamped') to the input units of the device, and the corresponding required output is presented to the output units. The weights, which connect input to output via the multiple layers, are adjusted so that the error between the actual and required output is reduced. The standard back propagation training algorithm for these networks employs a gradient descent algorithm with weights and biases adjusted by an amount proportional to. the gradient of the error function with respect to each weight. The constant of proportionality is known as the learning rate. A 'momentum' term is also usually added that smooths successive weight updates by adding a constant proportion of the previous weight update to the current one. For large MLPs training on large amounts of data, an alternative algorithm computes a variable learning rate and momentum smoothing. The adaptation scheme ensures that steep gradients do not cause excessively large steps in weight space, but still permits reasonable step sizes with small gradients. This process is repeated many times for all the patterns in the training set. After an appropriate number of iterations, the MLP will recognise the patterns in the training set. If the data is structured, and if the training set is representative, then the MLP will also recognise patterns not in the training set.
Such training techniques are discussed in, for example, British Telecom Technology Journal Vol 6, No. 2 April 1988 pl31-139 "Multi-layer perceptions applied to speech technology", N. McCulloch et al, and are well known in the art. Similar training procedures are used for other types of neural network.
To be effective, the network must learn an underlying mapping from input patterns to output patterns by using a sparse set of examples (training data). This mapping should also be applicable to previously unseen data, i.e. the network should generalise well. This is especially important for pattern classification systems in which the data forms natural clusters in the input feature space, such as speech data. In the following, generalisation is defined as the difference between the classification performance on the training data set and that on a test set drawn from a population with the same underlying statistics. Why does a given network fail to generalise? A net is specified by a set of parameters that must be learned from a set of training examples. If the amount of training data available is increased, the better in general will be the weight estimates and the more likely a net will generalise. In all practical applications, the amount of training data available is limited and strategies must be developed to constrain a network so that a limited amount of data will produce good weight values. Limiting the internal complexity of the network (numbers of hidden nodes and connectivity) is one prior method of constraint.
The standard algorithm used for training multi-layer perceptrons is the error back-propagation algorithm discussed above. To learn a particular mapping, the algorithm adjusts network weight and bias values so as to reduce the sum-squared error between the actual network output and some desired output value. In a classification task such as speech or visual pattern recognition, the correct output is a vector that represents a particular class. However, the same class label will be assigned to an input vector whether or not it is close to a class boundary. Boundary effects, and as importantly for "real-world" problems, noise at the level of the feature description, mean that, to minimise the output error, fast transitions of the output nodes in the input space are required. To build up fast transitions, large weight values need to be used. The back-propagation algorithm, if run for a large number of iterations, builds up such large values. It can be seen that generalisation in the presence of noise and limited training data is promoted if smooth decision surfaces are formed in the input feature space. This means that a small change in input values will lead to only a relatively small change in output value. This smoothness can be guaranteed if the connection weight magnitudes are kept to low values. Although it may not be possible for the network to learn the training set to such a high degree of accuracy, the difference between training set and test set performance decreases and test set performance can increase.
Previously, the generalisation problem has been tackled simply by initialising a network using small random weights and stopping the training process after a small number of cycles. This is done so that the training set should not be learnt in too great detail. Additionally, some workers (for example Haffner et al "Fast Back-propagation learning methods for phonemic neural networks". Eurospeech 89), Paris have realised that large network weight values lead to poor generalisation ability, and that it is for this reason that training cycles should be limited. However, the correct number of training cycles to choose is dependent on the problem, the network size, network connectivity and on the learning algorithm parameters. Hence, simply limiting training cycles is a poor solution to an important problem.
In any digital hardware implementation of an MLP or other network, the question of weight quantisation must be addressed. It is known that biological neurons do not perform precise arithmetic, so it might be hoped that weights in a neural network would be robust to quantisation. Normally, quantisation takes place after the network has been trained. However, if as described above, the network has built up large weight values, then node output may depend on small differences between large values. This is an undesirable situation in any numerical computation, especially one in which robustness to quantisation errors is required.
The issue of weight-quantisation is also an area that has not been approached with a view to performing MLP training subject to a criteria that will improve quantisation performance. MLP weights would normally be examined after training and then a suitable quantisation scheme devised. It is true that the prior art technique of limiting the number of training cycles as discussed above, will improve weight quantisation simply because weight values will have not yet grown to large values, but the extent to which it does so depends, as discussed above, on a number of parameters which may be data-related. It is thus not a general-purpose solution to the problem of providing readily quantised weights. We have found that both generalisation performance and robustness to weight quantisation are improved by including explicit weight-range limiting into the MLP training procedure. According to the invention there is provided; a method of deriving weight vectors (each comprising a plurality of multibit digital weights each having one of a range of discrete values) for a neural net comprising:- - vector input means for receiving a plurality of input values comprising an input vector; and vector processing means for generating a plurality of scalar outputs in dependence upon the input vector and respective reference weight vectors, comprising the steps of:- selecting a sequence of sample input vectors (correspoding to predetermined net outputs); generating, using a digital processing device employing relatively high-precision digital arithmetic, an approximation to the scalar outputs which would be produced by the neural net processing means; generating therefrom an approximation to the outputs of the net corresponding to the respective input vectors; and - iteratively modifying the weight vectors so as to reduce the difference between the said approximated net outputs and the predetermined outputs; characterised in that, if the said modifying step would result in the magnitude of a weight, or weight vector, exceeding a predetermined value M, then that weight, or weight vector, magnitude is constrained to be equal to (or less than) M.
Our results clearly show the effectiveness of the weight range-limiting technique in both . improving quantisation performance and increasing the robustness of the structure to weight quantisation, although it may seem surprising that good test-set accuracy can be obtained by networks with very limited weight ranges (down to ±1.5). It is important to note that good performance is due to the fact that the weight-limiting technique is incorporated into the learning procedure and hence the MLP parameters are optimised subject to constraints on the weight values. This process can be thought of as incorporating knowledge into the structure by disallowing weight-space configuration that will give poor generalisation due to the inherent noise in data for real-world problems.
Weight limiting also improves general network robustness to numerical inaccuracies - hence weight quantisation performance improves. It is seen here that with suitable weight limits as few as three bits per weight can be used. Large weights can cause a network node to compute small differences between large (weighted) inputs. This in turn gives rise to sensitivity to numerical inaccuracies and leads to a lessening of the inherent robustness of the MLP structure. The weight limited MLP according to the invention is able to deal with inaccuracies in activation function evaluation and low resolution arithmetic better than a network with larger weight values. These factors combine to mean that the technique is useful in any limited precision, fixed-point MLP implementation. The numerical robustness is increased so that, with suitable weight limits, as few as three bits per weight can be used to represent trained weight values.
Simulations on a "real-world" speech recognition problem show that, for a fixed MLP structure, although classification performance on the training set decreases, improved generalisation leads to improved test set performance.
A neural network employing weights derived according to the above method will be distinguishable, in general, from prior art networks because the distribution of the weight values will not be even (i.e. tailing off at high positive and negative weight sizes), but will be skewed towards the maximum level M, with a substantial proportion of weight magnitudes equal to M. It will also, as discussed above, have an improved generalisation performance. The invention thus extends to such a network. Such a network (which need not be a multi-layer perceptron network but could be, for example, a single layer perception), is, as discussed above, useful for speech recognition, but is also useful in visual object recognition and other classification tasks, and in estimation tasks such as echo cancelling or optimisation tasks such as telephone network management. In some applications (for example, speaker-dependent recognition), it may be useful to supply an "untrained" network which includes training means (e.g. a microcomputer) programmed to accept a series of test data input vectors and derive the weights therefrom in the manner discussed above during a training phase.
The magnitude referred to is generally the absolute (i.e. ±) magnitude, and preferably the magnitude of any single weight (i.e. vector component) is constrained not to exceed M.
The invention will now be described by way of example only, with reference to the accompanying drawings, in which:- - Figure 1 shows schematically a (prior art) perceptron;
Figure 2 shows schematically a multi-layer perceptron;
Figure 3 shows schematically a training algorithm according to one aspect of the invention
Figure 4a shows schematically an input stage for an embodiment of the invention,
Figure 4b shows schematically a neuron for an embodiment of the invention, and - Figure 5 shows schematically a network of such neurons. The following describes an experimental evaluation of the weight range-limiting method of the invention. First, the example problem and associated database are described, followed by a series of experiments and the results. The problem was required to be a "real-world" problem having noisy features but also being of limited dimension.
The problem chosen was the one of speaker independent recognition of telephone quality utterances of "yes" and "no" using a simple feature description. This task has all the required properties listed above and also the added advantages of known performance using several other techniques; see
Woodland, P.C. & Millar, W. (1989). "Fixed dimensional classifiers for speech recognition". In Speech and Language Processing Eds. Wheddon & Linggard. Kogan-Page.
The speech data used was part of a large database collected by British Telecom from long distance telephone talks over the public switched telephone network. It consisted of a single utterance of the words "yes" and "no" from more than 700 talkers. 798 utterances were available for MLP training and a further 620 for testing. The talkers in the training set were completely distinct from the test set talkers. The speech was digitally sampled at 8kHz and manually endpointed. The resulting data set included samples with impulsive switching noise and very high background noise levels.
The data was processed by an energy based segmentation scheme into five variable length portions. Within each segment low order LPC analysis was used to produce two cepstral coefficients and the normalised prediction error. The complete utterance was therefore described by a single fifteen dimensional vector.
A number of MLPs were trained using this database to assess the effects of weight range-limiting on generalisation and weight quantisation. In all cases the
MLPs had a single hidden layer, full connection between the input and hidden layers and full connection between the hidden and output layer. There were no direct input/output connections. The back-propagation training algorithm was used with updating after each epoch, i.e., after every input/output pattern has been presented. This update scheme ensures that the weights are changed in a direction that reduces error over all the training patterns and result of the training procedure does not depend on the order of pattern presentation. All networks used a single output node. During training, the desired output was set to 0.9 for "yes" and to 0.1 for "no". During testing all utterances that gave an output of greater than 0.5 were classified as "yes" and the remainder as "no". The learning rate used was 0.01 and the momentum constant was set to 0.9. Weight range-limited networks were trained subject to a maximum absolute value of M. If after a weight update a particular weight exceeded this value then it was reset to the maximum value. Classification performance results for values of M of 1.5, 2.0, 3.0, 5.0 and no-limit are shown as Tables 1 to 5
TABLE 1 - Performance with M = 1.5
Figure imgf000012_0001
TABLE 2 - Performance with M = 2.0
Number of Training Set Test Set Hidden Nodes Accuracy* Accuracy*
2 92.9 92.7 3 94.1 94.0
4 94.6 95.2 5 94.9 95.3
10 96.9 95.0 15 97.9 95.7 20 97.6 95.3 25 98.1 96.3 30 97.9 95.2
TABLE 3 - Performance with M = 3.0
Number Of Training Set Test Set Hidden Nodes Accuracy % Accuracy %
2 94.4 95.0
3 95.2 95.2 4 96.2 95.3 5 97.0 94.8
10 98.9 95.3 15 99.0 95.2 TABLE 4 - Performance with M = 5.0
Number of Training Set Test Set Hidden Nodes Accuracy % Accuracy %
2 95.6 95.5 3 96.5 95.3 4 97.6 94.7 5 97.9 95.3 10 99.5 94.3 15 99.5 93.7
TABLE 5 - Performance with no weight limits
Number of Training Set Test Set Hidden Nodes Accuracy % Accuracy %
2 97.0 95.2
3 97.9 93.7 4 98.6 95.0 5 98.9 94.5
10 99.8 94.0 15 99.8 94.0
It can be seen from these tables that as M is increased the training set accuracy increases for a given number of hidden nodes, tending towards the figures with no weight constraints. It can also be seen that training set and test set accuracies converge as M is decreased, i.e., generalisation performance improves. Further, as the number of hidden nodes increase, the test set performance gets worse for the no limit case, while (M=) for small values of M this effect is not in evidence. It should be noted that the best test set performance occurs for M = 2.0 with 25 hidden nodes. Also, for all finite values of M tested, there is at least one MLP configuration that gives superior test-set performance to the best achieved with no weight limiting.
Experiments to ascertain the relative degradations due to weight quantisation were also performed. Once the network had been fully trained using a floating point arithmetic training algorithm, the weights and biases were quantised into a fixed (odd) number, L, of equally spaced levels. The maximum absolute value in the network determined the maximum positive and maximum negative level value. Each network weight and bias was then mapped to the nearest quantisation level value and network testing performed using floating point arithmetic. The effects of weight quantisation after training was tested for the 15 hidden node networks described above. Table 6 gives the RMS error (between actual and desired outputs) on the training set for different numbers of levels and a range of maximum absolute weights.
TABLE 6 - RMS Error on trainin set for differin L & M
Figure imgf000015_0001
From Table 6 the approximate number of levels required, and hence the number of bits per weight, for no significant degradation can be calculated. Table 7 lists the number of bits per weight required such that the RMS error is less than 5% greater than for unquantised weights for each of the values of M under consideration.
TABLE 7 - Bits/weight required
Maximum weight Bits per weight M required
1.5 3
2.0 4
3.0 " 6 5.0 5
O 8
Although, as discussed, the invention is primarily intended for implementation in neural networks of the well known perceptron type which need no further description to the skilled man, some exemplary embodiments will now be briefly disclosed with reference to Figures 4a, 4b & 5 (with particular reference to speech recognition).
A real-world input (speech) 'signal is received, sampled, digitised, and processed to extract a feature vector by an input processing circuit 1. For a speech input, the sampling rate may be 8 KHZ; the digitisation may involve A-law PCM coding; and the feature vector may for example comprise a set of LPC coefficients, Fourier Transform coefficients or preferably melfrequency cepstral coefficients.
For speech, the start and end points of the utterance are determined by end pointing device la, using for example the method described in Wilpon J.G, Rabiner and Martin T: 'An improved word-detection algorithm for telephone quality speech incorporating both syntactic and semantic constraints', AT&T Bell Labs Tech J, 63 (1984) (or any other well known method), and between these points n-dimensional feature vectors X are supplied periodically (for example, every 10-100 msec) to the input node 2 of the net. The vector X may be a set of 15 8-bit coefficients x. (i.e. n=15)
The feature vectors are conveniently supplied in time-division multiplexed form to a single input 2a connected to common data bus 2b.
The input bus is connectable to each neuron 3a, 3b, 3d, in the input layer.
Referring to Figure 4b, each comprises a weight vector ROM 5 storing the 15 weight values. The clock signal controlling the multiplex also controls the address bus of the ROM 5, so that as successive input values x. (i = 1 to n) of the input vector X are successively supplied to a first input of hardware digital multiplier 6, corresponding weight values w. are addressed and placed on the data bus of the ROM 5 and supplied to a second input of the multiplier 6. The weighted input value produced at the output of the multiplier 6 is supplied to a digital accumulating adder 7, and added to the current accumulated total. The clock signal also controls a latch 8 on the output of the accumulating adder 7 so that when all weighted input values of an input vector are accumulated, the (scalar) total is latched in latch 8 for the duration of the next cycle. This is achieved by dividing the clock by n. The accumulating adder 7 is then reset to zero (or to any desired predetermined bias value). The total y
Figure imgf000017_0001
. w- ) is supplied to a non linear compression circuit 9 which generates in response an output of compressed range, typically given by the function
c(y) = l/(l+exp (-y))
The compression circuit 9 could be a lookup ROM, the total being supplied to the address lines and the output being supplied from the data lines but is preferably as disclosed in our UK application GB8922528.8 and the article 'Efficient Implementation of piecewise linear activation function for digital VLSI Neural Networks'; Myers et al, Electronics letters, 23 November 1989, Vol 25 no 24. Referring to Figure 5, the (latched) scalar output Y of each neuron is connected as an input to the neurons 10a, 10b, 10c of the intermediate or 'hidden' layer. Typically, fewer neurons will be required in the hidden layer. The output values may be multiplexed onto a common intermediate bus 11 (but clocked at a rate of 1/n that of the input values by, for example, a flipflop circuit) in which case each hidden layer neuron 10a, 10b, 10c may be equivalent to the input layer neurons 3a, 3b, 3c, receiving as their input vector the output values of the input layer. This is a useful embodiment in that it allows a flexible implementation of various nets with low interconnection complexity.
The neurons 10a, 10b, 10c of the hidden layer likewise supply scalar outputs which act as inputs to neurons 12a, 12b, of the output layer. In a classification application, there is typically an output neuron for each class (and possibly others; for example one for "not recognised"). For a speech recogniser intended to recognise "yes" and "no" two output neurons 12a, 12b could be provided as shown (although a single output could be used to discriminate between two classes). Bach receives as its input vector the set of outputs of the hidden layer below, and applies its weight vector (stored, as above, in ROMs) via a multiplier, to produce a net output value. Preferably the output layer neurons also use a compression function. The class which corresponds to the output layer neuron producing the largest output value is allocated to the input vector (although other 'decision rules' could be used).
Various modifications and substitutions to the above example are possible.
Further intermediate layers may be included. The skilled man will realise that the functions of all neurons of one input layer(s) could be performed by a single computing device sequentially, in environments where high speed is inessential; the multiply and accumulate/add operations are of course common in microprocessors and in dedicated digital signal processing (DSP) devices such as the Texas Instruments TMS320 family or the Motorola 56000 family.
Equally, all the ROMs 5 could be realised as areas of a single memory device. Alternatively, the invention may be realised using hybrid analogue digital networks in which digital weights act on analogue signals, using neurons of the type shown in Figs 11 and 12 of WO 89/02134 (assigned to the present applicants).
In applications such as hybrid video coding, speaker recognition or speaker-dependent speech recognition, the network needs to be 'trainable' - that is, must be capable of devising satisfactory weight values on training data. Accordingly, in this type of apparatus a weight adjusting device 13 is provided; typically a microprocessor operating according to a stored program. The input test patterns are supplied to the input 1, and the initial weight values are adjusted, using an error back-propagation algorithm, to reduce the difference between the net outputs (to which the device 13 is connectable) and the predetermined outputs corresponding to the inputs. The device 13 thus calculates the output error, calculates the necessary weight value increment for each weight; limits the weight magnitudes (if necessary) to M; accesses the weights in each store 5 (which must, of course in this embodiment, be a read/write store, not a ROM), adds the increments and rewrites the new weight values to the stores 5; the method is discussed above. Consequently, the adjusting device 13 is connected to the address busses of all stores 5 and to the net outputs. It is also connected to a source of correct output values; for example, in training a speaker-dependant word recogniser these are derived from a prompt device (not shown) which instructs the speaker to speak a given word. The correct output (say 0.9) for the output neuron corresponding to that word, is supplied to weight adjuster 13 together with correct outputs (say 0.1) for all other output neurons.

Claims

1. A method of deriving weight vectors (each comprising a plurality of multibit digital weights each having one of a range of discrete values) for a neural net comprising:- - vector input means for receiving a plurality of input values comprising an input vector; and vector processing means for generating a plurality of scalar outputs in dependence upon the input vector and respective reference weight vectors, comprising the steps of:- selecting a sequence of sample input vectors (correspoding to predetermined net outputs); generating, using a digital processing device employing relatively high-precision digital arithmetic, an approximation to the scalar outputs which would be produced by the neural net processing means; generating therefrom an approximation to the outputs of the net corresponding to the respective input vectors; and- - iteratively modifying the weight vectors so as to reduce the difference between the said approximated net outputs and the predetermined outputs; characterised in that, if the said modifying step would result in the magnitude of a weight, or weight vector, exceeding a predetermined value M, then that weight, or weight vector, magnitude is constrained to be equal to (or less than) H.
2. A method according to claim 1, further comprising the step of quantising the thus-derived weights or said weight vectors to a lower precision, one of the quantised levels being said value M.
3. A trainable neural network comprising:- vector input means for receiving an input vector; and vector processing means for generating a plurality of scalar outputs in dependence upon the input vector and a plurality of respective reference, weight, vectors each' comprising a plurality of multibit digital weights each having one of a range of discrete values, further comprising:- training means for deriving, in a training phase, the weights of said reference weight vectors, characterised in that the training means includes means for limiting, during the training phase, said reference weight vectors or weights to a predetermined maximum value M by constraining the value of weight vectors or weights which would otherwise exceed M not to so do.
4. A network according to claim 3, further comprising quantising means for quantising said thus-derived weights subsequent to said training phase.
5. A network according to claim 3 or claim 4, arranged to comprise a multi-layer perceptron.
6. A neural network having weight vectors derived by the method of claim 1 or claim 2.
7. A neural network according to claim 6, connected to comprise a multi-layer perceptron.
8. A neural network in which the distribution of weight values is skewed towards a maximum magnitude M, a substantial number of said weights having said magnitude.
9. A method of training a neural network substantially as described herein, with reference to the accompanying Figure 3.
10. A neural network substantially as herein described.
PCT/GB1990/001002 1989-06-30 1990-06-29 Pattern recognition WO1991000591A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
FI916155A FI916155A0 (en) 1989-06-30 1991-12-30 GESTALTIDENTIFIERING.
HK132896A HK132896A (en) 1989-06-30 1996-07-25 Pattern recognition

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB8915085.8 1989-06-30
GB898915085A GB8915085D0 (en) 1989-06-30 1989-06-30 Pattern recognition

Publications (1)

Publication Number Publication Date
WO1991000591A1 true WO1991000591A1 (en) 1991-01-10

Family

ID=10659352

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB1990/001002 WO1991000591A1 (en) 1989-06-30 1990-06-29 Pattern recognition

Country Status (7)

Country Link
EP (1) EP0484344A1 (en)
JP (1) JPH04506424A (en)
CA (1) CA2063426A1 (en)
FI (1) FI916155A0 (en)
GB (2) GB8915085D0 (en)
HK (1) HK132896A (en)
WO (1) WO1991000591A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2258311A (en) * 1991-07-27 1993-02-03 Nigel Andrew Dodd Monitoring a plurality of parameters
DE4300159A1 (en) * 1993-01-07 1994-07-14 Lars Dipl Ing Knohl Reciprocal portrayal of characteristic text
DE4436692A1 (en) * 1993-10-14 1995-04-20 Ricoh Kk Training system for a speech (voice) recognition system
US11106973B2 (en) 2016-03-16 2021-08-31 Hong Kong Applied Science and Technology Research Institute Company Limited Method and system for bit-depth reduction in artificial neural networks

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0683781B2 (en) * 1986-03-14 1994-10-26 住友化学工業株式会社 Method for producing flaky material

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5040230A (en) * 1988-01-11 1991-08-13 Ezel Incorporated Associative pattern conversion system and adaptation method thereof

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
IEEE Expert, Vol. 3, No. 1, Spring 1988, IEEE, (New York, NY, US), P.D. WASSERMAN et al.: "Neural Networks, part 2", pages 10-15 *
IEEE International Conference on Neural Networks, San Diego, 24-27 July 1988, IEEE, (New York, US), N.H. FARHAT et al.: "Bimodal Stochastic Optical Learning Machine", pages 365-372 *
Proceedings of the Fall Joint Computer Conference, San Francisco, 9-11 December 1968, (AFIPS Conference Proceedings, Vol. 33, part 2), The Thompson Book Co., (Washington, US), E.R. IDE et al.: "Some Conclusions on the use of Adaptive Linear Decision Functions", pages 1117-1124 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2258311A (en) * 1991-07-27 1993-02-03 Nigel Andrew Dodd Monitoring a plurality of parameters
GB2258311B (en) * 1991-07-27 1995-08-30 Nigel Andrew Dodd Apparatus and method for monitoring
US5621858A (en) * 1992-05-26 1997-04-15 Ricoh Corporation Neural network acoustic and visual speech recognition system training method and apparatus
DE4300159A1 (en) * 1993-01-07 1994-07-14 Lars Dipl Ing Knohl Reciprocal portrayal of characteristic text
DE4436692A1 (en) * 1993-10-14 1995-04-20 Ricoh Kk Training system for a speech (voice) recognition system
DE4436692C2 (en) * 1993-10-14 1998-04-30 Ricoh Kk Training system for a speech recognition system
US11106973B2 (en) 2016-03-16 2021-08-31 Hong Kong Applied Science and Technology Research Institute Company Limited Method and system for bit-depth reduction in artificial neural networks

Also Published As

Publication number Publication date
JPH04506424A (en) 1992-11-05
GB2253295B (en) 1993-11-03
GB8915085D0 (en) 1989-09-20
EP0484344A1 (en) 1992-05-13
GB9127502D0 (en) 1992-03-11
GB2253295A (en) 1992-09-02
HK132896A (en) 1996-08-02
CA2063426A1 (en) 1990-12-31
FI916155A0 (en) 1991-12-30

Similar Documents

Publication Publication Date Title
Virtanen Sound source separation using sparse coding with temporal continuity objective
US5095443A (en) Plural neural network system having a successive approximation learning method
US5033087A (en) Method and apparatus for the automatic determination of phonological rules as for a continuous speech recognition system
US5621848A (en) Method of partitioning a sequence of data frames
EP0623914A1 (en) Speaker independent isolated word recognition system using neural networks
NZ331431A (en) Speech processing via voice recognition
TW202001874A (en) Voice activity detection system
WO1993013519A1 (en) Composite expert
CN111899757A (en) Single-channel voice separation method and system for target speaker extraction
EP1576580A1 (en) Method of optimising the execution of a neural network in a speech recognition system through conditionally skipping a variable number of frames
AU685626B2 (en) Speech-recognition system utilizing neural networks and method of using same
Campbell Analog i/o nets for syllable timing
US5745874A (en) Preprocessor for automatic speech recognition system
WO1991000591A1 (en) Pattern recognition
JPH0540497A (en) Speaker adaptive voice recognizing device
Selouani et al. Automatic birdsong recognition based on autoregressive time-delay neural networks
Woodland Weight limiting, weight quantisation and generalisation in multi-layer perceptrons
US5732393A (en) Voice recognition device using linear predictive coding
CN115116448A (en) Voice extraction method, neural network model training method, device and storage medium
Lee et al. Recurrent neural networks for speech modeling and speech recognition
Renals et al. Learning phoneme recognition using neural networks
CN112786068A (en) Audio source separation method and device and storage medium
Jou et al. Mandarin syllables recognition based on one class one net neural network with modified selective update algorithm
WO2000008634A1 (en) Methods and apparatus for phoneme estimation using neural networks
Reynolds et al. Spoken letter recognition with neural networks

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): CA FI GB JP US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FR GB IT LU NL SE

WWE Wipo information: entry into national phase

Ref document number: 1990909478

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2063426

Country of ref document: CA

WWE Wipo information: entry into national phase

Ref document number: 916155

Country of ref document: FI

WWP Wipo information: published in national office

Ref document number: 1990909478

Country of ref document: EP

WWR Wipo information: refused in national office

Ref document number: 1990909478

Country of ref document: EP

WWW Wipo information: withdrawn in national office

Ref document number: 1990909478

Country of ref document: EP