WO1993024898A1

WO1993024898A1 - Method of training a neural network

Info

Publication number: WO1993024898A1
Application number: PCT/GB1993/001180
Authority: WO
Inventors: John Gerald Taylor; Denise Gorse; Trevor Grant Clarkson
Original assignee: University College London; King's College London
Priority date: 1992-06-04
Filing date: 1993-06-03
Publication date: 1993-12-09
Also published as: AU4341093A

Abstract

A method is described of training a pRAM network, in which a pRAM network is repeatedly presented with a pattern to which it is to be trained to respond, to which noise has been added, and the contents of at least some of the storage locations of the pRAM network are altered in dependence on whether the output of the network represents success or failure in responding to the pattern. The noise may be added by applying the pattern via noise-adding pRAMs, or by some other suitable measure, such as the use of a lookup table.

Description

Method of training a neural network

This invention relates to a method of training a neural network, and to a neural network which is trainable by this method. The invention concerns neural networks composed of probabilistic RAMs (pRAMs) , such networks being referred to below as pRAM networks. pRAMs are described, for example, in Proceedings of the First IEEE International Conference on Artificial Neural Networks, IEE, 1989, No. 313, pp 242-246), and in two International Patent Applications Nos. WO 92/00572 and WO 92/00573. Attention is directed to these documents, which are incorporated herein by reference, for a full description of pRAMs.

In brief, however, a pRAM is an artificial neuron comprising a memory having a plurality of storage locations at each of which a number representing a probability is stored, the memory having at least one address line to define a succession of storage location addresses, and means for causing to appear at the output of the device, when the storage locations are addressed, a succession of output signals, the probability of the output signal having a given one of the first and second values being determined by the number at the addressed location. As described in the documents identified in the preceding paragraph, the desired probabilistic effect can be achieved by using a comparator connected to receive as an input the contents of each of the successively addressed locations, and a noise generator for inputting to the comparator a succession of random numbers representing noise.

A pRAM can be provided with means to enable it to learn, and a form of reinforcement training is described in the above mentioned International Applications, and in a paper entitled "Training strategies for probabilistic RAMs" in Proceedings Parallel Processing in Neural Systems and Computers, March 19, 1990, Dusseldorf.FRG, pages 161-164, incorporated herein by reference. In reinforcement training of a pRAM network, the network is presented with the pattern or patterns which is it desired that the network should learn to respond to, and the contents of the pRAM storage locations are then altered in a sense which depends on whether the network has succeeded or failed in responding correctly to the pattern concerned and which is such that the probability of the network correctly responding to a pattern is increased. This process is carried out repeatedly.

However, noise is present in virtually all patterns which a pRAM network might be called on to respond to in real life. It has now been found that a pRAM network can be trained to respond to noisy patterns, by carrying out the training on patterns to which noise has been deliberately added.

Accordingly the present invention provides a method of training a pRAM network in which a pRAM network is presented, normally repeatedly, with a pattern to which it is to be trained to respond, to which noise has been added, and the contents of at least some of the storage locations of the pRAM network are altered in dependence on whether the output of the network represents success or failure in responding to the pattern. The invention further provides a pRAM network which is adapted to be trained by the foregoing method. It should be understood that although in the examples given below the patterns are visual patterns, they need not be, and that the term is used to cover any type of pattern. It should further be understood that although the following description is in terms of recognising patterns, the invention is also applicable where the response which the network is to produce when presented with a pattern does not amount to recognition.

It will be seen that by adding noise during training to the input vectors which represent the patterns on which the pRAM is being trained, for any vector the probability of all nearest-neighbour vectors being generated is greater than zero. This assists in the recovery of information from the network in response to a previously unknown input vector, in that it gives the network the property of generalisation.

Some examples of the invention will now be described with reference to the accompanying drawings, in which:

Figure 1 shows a pRAM network which is trainable by the method of the present invention;

Figure 2 show two patterns on which the network of Figure 1 may be trained;

Figure 3 is a graph showing the effect of using different amounts of noise in training on the patterns of Figure 2;

Figure 4 is a graph showing the effect of different amounts of noise on the confidence limits obtainable in relation to the patterns of Figure 2;

Figure 5 shows another set, this time of four patterns, on which the network of Figure 1 may be trained;

Figure 6 is a set of four graphs showing results corresponding to those of Figure 3, but for the patterns of Figure 5;

Figure 7 shows yet another set, this time of ten patterns, on which a pRAM network may be trained;

Figure 8 shows a network for dealing with the patterns of Figure 7; and

Figure 9 is a graph showing results corresponding to those of Figure 3, but for the patterns of Figure 7.

Referring first to Figures l and 2, a description will now be given of the training of the network shown in Figure 1 on the patterns shown in Figure 2. In this example the patterns used are binary images of 5 x 5 pixels. The black blocks in an image have the pixel value of 1 and the white blocks have the pixel value of 0. In order to reduce the number of connections needed between neurons and to reduce the amount of memory needed, a pyramidal structure is adopted which has two layers of pRAM nodes, a hidden layer 10 and an output layer 20. The network is fully interconnected, but it is believed that the invention is also applicable where full connectivity does not apply. As is explained below, an additional layer 30 is present for the purpose of training, but these are not used when the network is used after training to recognise a pattern.

In the network shown in Figure 1, the hidden layer consists of five 5-pRAMs and the output layer consists of two 5-pRAMS. A 5-pRAM is one which has five input lines and therefore 2⁵ memory locations (more generally, an n-pRAM has n input lines and 2ⁿ memory locations, where n≥l) . Each pixel in the input image is applied to a respective one of the 25 input lines of the hidden layer. The pixels can be applied in any order to the input lines, but in this example the five pixels in the first row of the input image are taken to the five inputs of the first pRAM in the hidden layer, the five pixels in the second row to the inputs of the second pRAM, and so on.

As already explained, the learning process takes place in the presence of noise. This is introduced by the layer 30, which constitutes a "noise layer" and consists of twenty-five 1-pRAMS, one for each of the inputs to the layer 20. Each of these pRAMs functions between one pixel of the input image and the corresponding input line of the pRAM network, so that the input image has a preset amount of added noise. There are two memory locations in a 1-pRAM, here called L₀ and L_x, and illuminated pixels (with a value of 1) are defined here as addressing location L₂ and pixels not illuminated (with a value of 0) are defined as addressing location L₀. Firing probabilities p_n and l-p_n are stored in L₀ and L_λ respectively so that a pixel value 0 will have a probability p_n of firing and a pixel value of 1 have a probability p_n of not firing. Statistically this is equivalent to p_n*100 percent of pixels being inverted in the training pattern. It should be noted that the added noise does not have to be the same for the inputs of 0 and 1, so that some bias can be imposed. Thus, it is not essential that the memory pair in the training layer 30 be set to p_n and l-p_n, though this will normally be so.

While all the training noise in the previous discussion has been assumed to be of a linear distribution, any appropriate distribution for the training noise (resembling the noise distribution in the expected input) could be used, in which case the noise might be added by the use of a lookup table mechanism.

Since the value of p_n can be arbitrarily set in the range of from 0 to 1, training with different degrees of noise can be realised. This imposes realistic "white" noise on the pattern, such as will be obtained from real transducers (e.g. a camera) .

Training can be carried out using the reinforcement rule:

In this equation _u is the selected pRAM memory contents and a is the state of the pRAM output. The Kronecker delta, S_Αt^, reflects the fact that only currently accessed locations are adapted. The values of the reward rate (p) and penalty rate (λ) are chosen in this example to be 0.1 and 0.05 respectively. For simplicity, the values are kept fixed during training, though faster convergence could be achieved if they were tuned during training, and this is described below with reference to another example. The value of acceptable training error was set at 0.05. r and p are the reward and penalty factors. The stochastic property of these factors is an important feature in reinforcement training of pRAM networks, because it allows the possibility of "neutral" actions which are neither punished nor rewarded but which may correspond to a useful exploration of the environment.

The above rule is used to provide global reinforcement, and in every pRAM in the network the contents of the memory location which was accessed to give the output concerned is updated. However, alternative training schemes are possible, for example one in which only the memory contents in a particular layer are updated. The training rule of equation (1) may be applied using software in a conventional computer to calculate the required alteration to the memory contents of the pRAMs, the computer then sending the updated memory contents to the pRAMs. Alternatively, the operation may be done in dedicated hardware, and examples are disclosed in detail in the above mentioned International Applications.

It is observed that training easily converges in this simple kind of pyramidal pRAM network. This enables us to take: r=l-p r,p,e{0,l}

Another important feature in pRAM network training, which differs from at least most training methods used in non-pRAM networks, lies in the initial weight values of the network. Normally where weights are used to denote the connection strength between neurons, a small real number near to zero is set as the initial value of each weight. In pRAM training, however, the memory contents are firing probabilities rather than deterministic connection weights. The firing probabilities near to 0 possess the same confidence in the pRAM's behaviour as those near to 1. Therefore the initial values of the memory contents of the pRAMs in layers 10 and 20 are set to be 0.5 or 0.5+e, where e is a small fraction which varies randomly from pRAM to pRAM.

Outputs (A B) in Figure 1 are the output pair of the pRAM network. The output (1 0) is defined to be the training target for pattern 0 of Figure 1, and output (01) the training target for pattern 1. A reward signal is sent when the output pair meets the target and a penalty signal is sent otherwise. This is carried out for a predetermined training period (1000 iterations for instance) . The output firing rate is calculated by accumulating the output spikes of the network over a fixed window at the end of the training period (say the last 100 iterations. If over 95% (since the training error is defined as 0.05) of the outputs in the window meet the target for both the training patterns, the training finishes. If not, the training continues, moving the window of 100 iterations forwards, until the firing rates are within the error limit.

Figure 3 shows the results of training the network using different levels of noise and then presenting to it, for recognition, pattern 0 in the presence successively of various levels of noise. For each level of training noise, the data comes from a set of five pattern recognition trials each with a different amount of noise in the pattern to be recognised, and each trial comprises 200 pattern recognition runs. The vertical axis represents the firing rates of the pRAM network averaged over a period of time, and the horizontal axis represents the percentage noise added to the input patterns.

In the first set of pattern recognition trials, the network is trained alternately on two patterns to completion with no training noise(TN) added (TN=0%) . Thereafter it is presented, for recognition, with the pattern 0. The pattern is presented successively with increasing levels of noise (0,5,10,20,30%) and the output recorded.

In the next set of pattern recognition trials the network is trained to completion in the presence of 5% noise, and again presented with pattern 0 for recognition with the five different levels of noise. This process is repeated with 10%, 20% and 30% training noise.

The generalisation ability of the network is shown by Figure 3, where substantial amounts of noise (i.e. divergence from the training pattern) still lead to a correct output. An output is correctly discriminated if the associated output pRAM (A or B) fires for more than 50% of the time. It can be seen that while maintaining the discrimination property, the generalisation ability increases as the training noise increases. Even with 45% added noise in the pattern presented for recognition, the network maintains a 20% confidence margin when trained with 30% training noise. The way in which the confidence of the result varies with the level of the training noise and the noise present in the input pattern is shown in Figure 4. This shows that the confidence level decreases with input noise but increases with training noise.

However, training time increases exponentially with training noise, as demonstrated by Table 1 below, which gives the results of an experiment done to determine the number of iterations required to reach an error level of 0.05 for a given level of training noise.

TABLE 1

training noise iterations

0% 200

5% 400

10% 500

20% 800

30% 1500

35% 2750

Thus, it might be appropriate to select a lower value of training noise in applications where fast training is a priority.

The sequence in which the input patterns are applied to the network is an important factor which will influence the training results. The training scheme above adopts a regular sequence in which the patterns are applied in the order; pattern 0, pattern 1, pattern 0, pattern 1, etc. This is alternate training. Other sequences may be used, such as random training, in which the input patterns are applied in a random order, and sequential training, in which the input patterns are applied one at a time without being interleaved. Since the memory contents are not clamped in this training, a different sequence will generate different results in terms of generalisation, discrimination and training times.

During training, the output pair (A B) is coded as (01) or (10) , the desired output for each input pattern. This is unary coding which reserves one output node to "fire" for one input pattern. Other codings, such as binary coding, can be used in output representation. The output can be coded as (00) , (01) , (10) and (11) and trained for binary coding where four input patterns are provided. This has been shown to produce similar results to those given in Figure 3 and each output (A or B) responds positively to two of the four training patterns and negatively to the other two (again in the presence of noise) . The advantage of using binary coding is that fewer nodes are needed in the output layer since N output nodes can be coded as the responses for 2^N input patterns, instead of only N input patterns in unary coding.

An example of the use of binary coding is given with reference to Figures 5 and 6. Here the pRAM network of Figure l is trained to recognise the four patterns shown in Figure 5. The output (00) is defined to be the training target for pattern 0, output (01) the training target for pattern 1, and so forth. Training is carried out in the same way as in the example of Figures 2 to 4, except that in this case only two levels of training noise were used, 0% and 30%. The recognition ability of the network was tested by presenting it successively with each of the patterns 0, 1, 2 and 3, each pattern being presented with five different levels of noise, namely 0%, 5%, 10%, 20% and 30%. The results are shown in Figure 6. It can be seen that the discrimination of noisy input patterns is improved as the training noise increases, due to the generalisation behaviour of the pRAM network. Even with 45% added noise, the network maintains a 20% confidence margin when trained with 30% training noise.

Another example of binary coding is given in Figures 7 to 9, where Figure 7 shows patterns representing the digits 0 to 9 as the patterns to be recognised. Each pattern is an image of 7x5 pixels, so that the hidden layer 10a of the pRAM network needs to consist of five 7-pRAMS, as shown in Figure 8. The pRAMS of the hidden layer are connected to an output layer 20a which consists of four 5- pRAMS. Each pixel of the input image is applied to a respective one of the 35 input lines of the hidden layer 10a.

There are sixteen possible output codes from the output layer. Of these, ten are chosen to represent the ten digits, in accordance with Table 2 below. They are chosen in such a way that every output node has the same number of "1" (active) and "0" (not active) states. This is to take account of the influence of the output Hamming distance on the network's memory distribution. For a discussion of the concept of Hamming distance as the measure of the similarity between two patterns, attention is directed to I. Aleksander & H. Morton, "An Introduction to Neural Computing", Chapman and Hall, London, 1990. In this example, there are five Is and five 0s for each node, and this maintains the firing rate of the pRAM network at 50% (which means no knowledge) when the input noise increases to about 50%

To ensure the convergence and fast training of the network, the reinforcement mechanism may be tuned in terms of two aspects: (a) the training rate p and (b) the reward and penalty signals r and p.

In both cases the relevant parameters (p, r or p) may be adapted in response to a suitable error measure.

(a) During training, the scale of the parameter changes is tuned as the output firing rates change. Since, as can be seen from equation (1) , rewarding is provided by the factor and penalising by the factor pλ, this can be achieved by varying just in response to the performance of the network.

(b) Reward and' penalty signals may be applied in a graded fashion, dependent on the performance of the network. The Hamming distances between the real outputs and the desired outputs may be used as a measurement of how far the desired output is from its target. In our example, the maximum Hamming distance in the output is 4 (since there are four output nodes) and the minimum Hamming distance is 0. If the Hamming distance is 4, the network state is said to be too far away from the desired output, so a full punishment is applied (i.e. p=l) • If the Hamming distance is 3, a smaller punishment factor is used (e.g. p-0.75) , and so on. Only if the Hamming distance is 0, is a reward factor is applied. Results of simulations show that the training converges more quickly when this graded reinforcement method is used (reduced from above 2000 to about 700 iterations) , compared with the reinforcement method described earlier in this application.

For simplification, Figure 3 shows the average firing rate of the output nodes with desired output "1" (denoted by A) and the average firing rate of those with desired output "0" (denoted by B) . The output firing rate is calculated by accumulating the output spikes of the network for a constant training period (1000 runs for instance) . If over 90% of the outputs meet the target for all the training patterns (the training error is taken in this example as being 0.1), the training finishes. If not, the training continues until the firing rates are within the error limit. The vertical axis represents the firing rates of the pRAM network averaged over a period of time and over the ten patterns, and the horizontal axis represents the percentage noise added to the input images.

The generalisation ability of the network is shown by Figure 9, where substantial amounts of noise (i.e. divergence from the training patterns) still lead to a correct output. It can be seen that while maintaining the discrimination property, the generalisation ability increases as the training noise increases. Even with 20% added noise, the network maintains a 20% confidence margin when trained with 10% training noise.

Claims

CLAIMS :

1. A method of training a network comprising a plurality of pRAMs, each pRAM comprising a memory having a plurality of storage locations at each of which a number representing a probability is stored, the memory having at least one address line to define a succession of storage location addresses, and means for causing to appear at the output of the pRAM, when the storage locations are addressed, a succession of output signals, the probability of the output signal having a given one of the first and second values being determined by the number at the addressed location, in which the pRAM network is presented with a pattern to which it is to be trained to respond, to which noise has been added, and the contents of at least some of the storage locations of the pRAM network are altered in dependence on whether the output of the network is that desired in response to the pattern.

2. A method as claimed in claim 1, wherein the said network comprises a plurality of input pRAMs constituting a hidden layer and at least one output pRAM connected thereto to constitute an output layer.

3. A method as claimed in claim 2, wherein the said network comprises a plurality of output pRAMs.

4. A method as claimed in claim 3, wherein the outputs of the output pRAMs are treated as constituting a binary code, and the network is trained to generate a given binary code in response to the input of a given pattern.

5. A method as claimed in claim 4, wherein the number of possible binary codes exceeds the number of patterns on which the network is trained, and wherein the binary codes selected to represent the patterns are so chosen that, considered over all the patterns, each output of the output pRAMs has an equal chance of being 0 or 1.

6. A method as claimed in any one of claims 2 to 5, wherein the address line of the hidden layer pRAMs each receive a respective signal via an additional noise-adding pRAM.

7. A method as claimed in claim 6, wherein there is a plurality of the said noise-adding pRAMs, each being a 1- pRAM having two storage locations, with each address line of the pRAMs in the hidden layer receiving a signal via a respective one of the 1-pRAMs.

8. A method as claimed in claim 7, wherein the sum of the numbers stored at the two storage locations of each noise-adding 1-pRAM is 1.

9. A method as claimed in any preceding claim, wherein training is carried out for each pRAM being trained, according to the rule:

where α_u is the memory contents of the pRAM, a e{0,l} is the state of the output of the pRAM, p and λ are reward and penalty rates respectively, r and p are reward and penalty factors, and S _a^ is a delta function denoting the fact that only currently accessed locations are adapted.

10. A method as claimed in claim 9, wherein r = l-p.

11. A method according to claim 9 or 10, wherein the value of p is varied during training.

12. A method according to claims 9, 10 or 11, wherein the values of r and/or p are varied during training.

13. A method according to any one of claims 9 to 11, wherein, for the or each pattern, the Hamming distance is used as a measure of the extent to which the actual output(s) of the network differ from the desired value(s) , and reinforcement is applied in a way which depends on this measure.

14. A method according to any preceding claim, wherein the initial memory contents of each the storage locations of the pRAMs to be trained is equal to 0.5.

15. A method according to any one of claims 1 to 13, wherein the initial memory contents of each of the storage locations of the pRAMs to be trained is equal to 0.5 + e, where e is a small fraction which varies randomly from pRAM to pRAM.

16. A method according to any preceding claim, wherein the network is trained to respond to a plurality of different patterns.

17. A method according to claim 16, in which the different patterns are successively applied to the network in a fixed order, which is repeated.

18. A method according to claim 16, in which the different patterns are applied in a random order.

19. A method according to claim 16, in which the patterns are applied in a fixed order, with each pattern being applied a plurality of times before the next pattern is applied.

20. A trainable network comprising a plurality of pRAMs, each pRAM comprising a memory having a plurality of storage locations at each of which a number representing a probability is stored, the memory having at least one address line to define a succession of storage location addresses, and means for causing to appear at the output of the pRAM, when the storage locations are addressed, a succession of output signals, the probability of the output signal having a given one of the first and second values being determined by the number at the addressed location, the pRAM network comprising means for receiving a pattern to which it is to be trained to respond to which noise has been added, and means for altering the contents of at least some of the storage locations of the pRAM network in dependence on whether the output of the network is that desired in response to the pattern.

21. A network as claimed in claim 20, which comprises a plurality of input pRAMs constituting a hidden layer and at least one output pRAM connected thereto to constitute an output layer.

22. A network as claimed in claim 21, which comprises a plurality of output pRAMs.

23. A network as claimed in claim 22, wherein the outputs of the output pRAMs are treated as constituting a binary code, and the network is trained to generate a given binary code in response to the input of a given pattern.

24. A network as claimed in claim 23, wherein the number of possible binary codes exceeds the number of patterns on which the network is trained, and wherein the binary codes selected to represent the patterns are so chosen that, considered over all the patterns, each output of the output pRAMs has an equal chance of being 0 or l.

25. A network as claimed in any one of claims 21 to 24, wherein the address lines of the hidden layer pRAMs each receive a respective signal via an additional noise- adding pRAM.

26. A network as claimed in claim 25, wherein there is a plurality of the said noise-adding pRAMs, each being a 1-pRAM, having two storage locations, with each address line of the pRAMs in the hidden layer being adapted to receive a signal via a respective one of the 1-pRAMs.

27. A network as claimed in claim 26, wherein the sum of the numbers stored at the two storage locations of each noise-adding 1-pRAM is 1.

28. A network as claimed in any one of claims 20 to 27, wherein training is carried out for each pRAM being trained, according to the rule:

Δα_J.(t)=p((a-α₁,)r+A(a-α_il)p) (tjx*..^

where α_u is the memory contents of the pRAM, a €{0,1} is the state of the output of the pRAM, pand A are reward and penalty rates respectively, r and p are reward and penalty factors, and S_u * is a delta function denoting the fact that only currently accessed locations are adapted.

29. A network as claimed in claim 28, wherein r ^***■ 1- P-

30. A network according to claim 28 or 29, comprising means for varying the value of p during training.

31. A network according to claims 28, 29 or 30, wherein the values of r and/or p are varied during training.

32. A network according to any one of claims 28 to 31, wherein, for the or each pattern, the Hamming distance is used as a measure of the extent to which the actual output(s) of the network differ from the desired value(s) , and reinforcement is applied in a way which depends on this measure.

33. A network according to any one of claims 20 to 32, wherein the initial memory contents of each the storage locations of the pRAMs to be trained is equal to 0.5.

34. A network according to any one of claims 20 to 32, wherein the initial memory contents of each of the storage locations of the pRAMs to be trained in equal to 0.5 ± e, where e is a small faction which varies randomly from pRAM to pRAM.

35. A network according to any one of claims 20 to 34, wherein the network is trainable to respond to a plurality of different patterns.

36. A network according to claim 35, in which the different patterns are successively received by the network in a fixed order, which is repeated.

37. A network according to claim 36, in which the different patterns are received in a random order.

38. A network according to claim 35, in which the patterns are received in a fixed order, with each pattern being received a plurality of times before the next pattern is applied.