WO1994002907A1

WO1994002907A1 - Dynamic neural networks

Info

Publication number: WO1994002907A1
Application number: PCT/GB1993/001494
Authority: WO
Inventors: Leslie Samuel Smith; Kevin Michael Swingler
Original assignee: British Telecommunications Public Limited Company
Priority date: 1992-07-16
Filing date: 1993-07-15
Publication date: 1994-02-03
Also published as: GB9500605D0; AU4578493A; GB2283605A; GB2283605B

Abstract

A neural network, specifically for analysing time-dependent sequences such as spoken utterances, comprises an input layer (I) and an output layer (H). On each of the connections between the input and output layer there is a filter (P). As the sequence to be analysed is received by the input layer, activation signals are sent via the connections to the output layer. Each filter (P) has a dual role: it will only allow a maximum amount of activation to pass at any one given time, this maximum amount decreasing as the sequence proceeds, and it also determines the maximum cumulative activation that may be passed along the respective connection throughout the entire presentation of the input sequence. Several such networks may be used in parallel, each feeding their outputs to a common top layer (T) which adjudicates between the differing outputs and provides a composite answer.

Description

DYNAMIC NEURAL NETWORKS

The present invention relates to dynamic neural networks, and in particular to networks which are capable of sequence recognition. One specific application of sequence recognition, which is described by way of example, is that of speech analysis and in particular spoken vowel recognition. The most popular applications of neural networks to date have been in static recognition tasks, in other words tasks which involve a static pattern which can be presented to the network at a single instance and as one entity. There are however many tasks, of which speech perception is one, which involve patterns which grow and change over time. It is a desirable goal that such dynamic patterns should be processed, not as a whole after they have been completed,^" but rather continuously as they are developing. A number of methods have been employed in the past to deal with the problem of sequence recognition within dynamic patterns. Perhaps the most obvious approach is to record the pattern as it develops over time, and then analyse the complete recording with a static neural network, using time as one of the spatial dimensions. The disadvantage with such an approach, of course, is that the developing pattern has to be recorded in some way, and analysis cannot commence until the pattern has fully developed. The resultant time delay makes this approach particularly unattractive for dynamic sequences which are reσuired to be identified rapidly, for example real-time speech analysis.

Another approach has been to use time delay neural networks (TDNNs) . These are networks which include weighted delay lines in addition to the weighted connections that are familiar in standard static networks. These lines delay the information which flows along them so that it may be integrated over time. Such networks effectively incorporate a moving temporal window, which allow each unit to take past events into account during the learning process. The main problem with such a network is the fact that it can work only within its given receptive field, the length of the sequences that can be processed being governed by the number of delay lines used. It is therefore very inflexible and is not naturally suited to such tasks as speech analysis, in which the length of the sequences to be recognised cannot necessarily be defined in advance. To overcome the problem of fixed length input, some researchers have made use of recurrent neural networks. The idea behind such networks, as used in sequence recognition, is to maintain an extra set of units which represents the last previous state of the network. These extra units feed into the set which represent the current state, and in this way, past events build up in an accumulated representation. Unfortunately, it has been found in practice that such networks are difficult to analyse and design, and also to train. As the input length grows, the additional units tend to contain increasingly blurred representations of past states.

Recurrency can also be used in other ways, one potential area being to allow a bank of neuron pairs to oscillate at given frequencies to which they have evolved to be sensitive. Using this method, a simple resonator can be trained to be sensitive to particular frequencies. Associative connecrions between such neuron pairs may then be set up to represent varying tonal information. Such an approach has been found, in practice, to be not particularly useful for sequence recognition, and to be more suited to use as a "front end".

The presently most popular conventional approaches to sequence and in particular speech recognition involve the use of neuron networks in combination with other conventional methods. In particular, researchers are making use of Hidden Markov Models (HMM) , which utilise a set of local probabilities of given elements following other given elements. Another approach is known as Dynamic Time Warping (DTW) , which is a template matching method which transforms the input to match the template form. Studies aimed at combining multi-layer perceptrons (MLPs) with HMM and DTW approaches have aimed at improving performance and providing a base for real time hardware implementation.

One final method of applying neural networks to sequential processing is to build a network capable of associatively stepping itself through a sequence of learned states on being presented with an initial state, so that a sequence of network states is recalled. Although a number of researchers have investigated such methods, no network of this type has so far been tested on speech data, the main reason for this being that such associative networks are primarily designed for recall rather than recognition. The problem of speech recognition is one of classification, not perfect recall. It is an object of the present invention at least to alleviate the problems of the prior art.

It is a further specific object to overcome two of the problems of TDNNs, namely the need to apply the whole sequence before updating the weights, and the very slow convergence of the TDNN.

It is a further object to provide a neural network which is not limited by a receptive field of a specific length.

It is yet another object to provide a neural network capable of sequence recognition in spoken speech, and which is capable at least in principle of providing a real time output.

These objects are achieved by the invention set out in the claims of the present patent application. The particular inventive neural network which is defined in those claims will in this specification be called a "Filtered Activation Network" (FAN) .

The present invention may be considered either as a neural network, or as a method for operating a neural network, but for the sake of simplicity the claims have been drafted in terms of apparatus rather than method. It will be appreciated by the skilled man, of course, that the networks of this invention may be implemented either in hardware or in software.

According to a first aspect of the present invention, a neural network is characterised by those features set out in the characterising portion of Claim 1.

According to a second aspect, a neural network may comprise a plurality of individual neural networks, as previously defined, and as set out in Claim 16. One particularly advantageous use of the neural network of the present invention is to analyse spoken utterances. The invention extends to a method of analysing such utterances as set out in Claim 22.

The invention may be carried into practice in a number of ways and one specific embodiment will now be described, by way of example, with reference to the accompanying drawings, in which:

Figure 1 shows a simple prior art neural network for recognising patterns; Figure 2 shows the result of adding natural inhibition and variable weights to the network of Figure i;

Figure 3 shows a filtered activation network according to an embodiment of the present invention;

Figure 4 shows a larger network incorporating three FA s;

Figure 5 shows a network incorporate two FANs, for use in identifying the sequences ABBB and BAAA; Figure 6 is a table showing how the network of

Figure 5 recognises the target sequences;

Figure 7 is a table illustrating how the FAN of Figure 3 may be trained to recognise a particular sequence; Figure 8 is a table showing how the sequence of

Figure 7 is recognised during the sequence recognition stage; and

Figure 9 is a table showing how a different sequence may be distinguished.

A simple network for pattern recognition Before describing the FAN model in detail, it will be convenient to consider how a conventional network undertakes pattern recognition. With reference to Figure 1, it will be seen that a conventional network consists of an input layer and an output layer, with connections between the layers.

Each input unit in the input layer is triggered by receipt of a particular element in an input sequence, here a single letter v;ithin a sequence of letters. On triggering, the input unit corresponding to the particular letter which has just been received passes a signal to each unit in the output layer to which it is connected. Each unit in the output layer is representative of a particular sequence to be recognised, here a particular sequence of letters.

If the sequence of letters applied to the input layer is c,a,t, three signals are sent to the cat unit in the output layer, two to the bat unit and two to the can unit. The network recognises the word cat by determining that the cat output unit has received three signals from below, where the bat and can units have each only received two signals. Unfortunately, this simple network is not sensitive to the order in which the letters are received. For example, if the three letters are t,a,c, in that order, the network will again recognise the word cat because as before the cat unit receives three signals from below while the bat and can units receive only two. This can cause problems if one needs to distinguish between two words having the same letters, for example bat and tab. If we were to add another output unit, tab, then this will be activated to exactly the same extent as the bat unit whatever letters are input. Accordingly, the network will never be able to distinguish between the words tab and bat.

One way of dealing with this problem is to add lateral inhibition between the output units, and to use variable weights. This is what is shown in Figure 2. In this example, the weight on the connection relating to the first letter of a particular output sequence is three units, the weight corresponding to the middle letter is two units, and the weight corresponding to the final letter one unit. The lateral inhibition between the output units, bat and tab in the example, acts to decrease the cumulative signal on the tab output unit when the signal on the bat unit increases, and similarly to decrease the cumulative signal on the bat unit when the signal on the tab unit increases.

Such an arrangement allows sequences to be correctly recognised. If, for example, the input layer receives the letters b,a,t, in that order, the initial letter b provides the bat unit with three points of excitation, and the tab unit with only one point. The subsequent letter a provides both output units with two points of excitation. The final letter t provides the bat unit with one point, and the tab unit with three points. However, even though that tab unit has finally received the same amount of excitation from below as has the bat unit, the initial lateral inhibition, arising from the fact that bat started at three points and tab only at one point, will never allow the tab unit to catch up. Accordingly, at the end of the sequence b,a,t, the bat unit has a higher score than the tab unit and the network therefore recognises the word bat.

Although the simple network of Figure 2 may be perfectly satisfactory where certain types of sequences are to be identified, the network does not work well if there may be repeated elements in the sequence. If, for example, the sequence to be recognised is bata the weight between the input unit a and the output unit for bata would have to reflect the fact that a is both the second and the fourth letter. This causes a particular problem, because in practice the individual weights within any practical network are not normally calculated but are instead learned by training the network on a particular set of known data. To build up a pattern of weights such as is shown in Figure 2, the training scheme would simply provide increased weights on letters towards the beginning of the sequence, and reduced weights on letters towards the end. Now if there are repeated letters, the problem is that simply increasing a weight by a lesser amount as the sequence is learned will cause repeated occurrence of a given letter to build up during training such a strong weight that the first occurrence of this letter during subsequent recognition will cause far too great activation on its connected output unit too early on in the sequence. Thus, the fact that the letter a occurs twice in bata means that the connection between the a input unit and the bata output unit will be disproportionately heavily weighted, so affecting the network's recognition capabilities.

The simple FAN model

Figure 3 shows a simple Filtered Activation Network (FAN) according to an embodiment of the present invention. As before, the network consists of a layer of input units I and a layer of output units H, each input unit being connected to each output unit by an appropriately weighted connection W. There are also lateral inhibition connections C between all possible pairings of output units H.

The network also has a pre-unit filter P, schematically illustrated in the drawing by a pair of parallel lines, on each of the weighted connections W between the input units and the output units. During running of the network (the recognition phase) , each pre- unit filter acts to prevent more than a given maximum amount of activation passing through to the corresponding output unit at any given time, thus allowing the rest of the sequence to be input before the network becomes too set on a given answer. In addition, each filter takes note of the amount of activation that is being passed througn it during the presentation of a particular sequence to the input units I, and it ensures that during a single sequence the total activation which can be passed through that filter to the corresponding output unit H cannot exceed a given amount. This amount corresponds in the present embodiment to the weight W on that particular connection.

Data representation

As with the prior art examples of Figures 1 and 2, local representations are used both for the input and the output of the network of Figure 3. If for example the network were to be used to identify specific sequences of letters from a series of individual letters applied to the input units, each input unit I would represent a particular letter, and each output unit H a particular sequence of letters to be recognised. It will be evident that there may be more or fewer than three input and output units, and that the number of input and output units is not necessarily the same.

The activation on a particular input unit is in this embodiment either 1 or 0: in other words, each individual input unit is either activated or it is not. The input units are activated mutually exclusively, so that at most one input unit will be active at any one time. That unit will have an activation of 1, and the others will have activations of 0. As in Figures 1 and 2, the activations of the input units feed through to the output units as the input sequence is being processed, and if at the end of the input sequence one of the output units H has received a much greater excitation than the others, the network can be said to have recognised the particular sequence which is represented by that output unit.

Definitions

It will be assumed in the following discussion that the network has m input units, and that there are n output units or in other words that there are n sequences to be learnt or recognised. The entire network may then formally be represented or defined as the set of the following vectors: a) I,, where i = 0...m-l, and each I, is either 0 or 1. I, represents the activation of the i'^h input unit. b) H_h, where h = 0...n-l, where each H_h lies between 0 and 1. H_h is the output of the network corresponding to the ^Λ learned sequence. c) W_hl, where i = 0...m-l and h = o...n-l, and W_hj lies between 0 and 1. W_hl represents the weight from the input unit i to the output unit h. d) C,., where i = 0...n-l and j = 0...n-l. C„ lies between -l and 0 if i is not equal to j, and is 0 where i is equal to j . C„ represents the inhibitory weights between the units of the output layer. e) P_hl, where i = 0—m-1 and h - 0...n-l, and P_hl lies between 0 and 1. P_h, represents the synaptic filters on the connection between the input unit i and the output unit h.

Sequence recognition

Before the network can recognise sequences, it has to be trained to recognise them. Although chronologically training naturally comes before recognition, in the present discussion we shall consider recognition first as it is rather more straightforward and better illustrates the primary purpose of the filters.

When the recognition process starts, the network will essentially be defined by the set of weights W,„ which have been determined during training. The recognition process starts with each weight value being copied into the appropriate filter, so that each P,„ starts off being equal to W_hl. From then on, until the start of another input sequence, it is only the filter values and not the weight values which determine the netv/ork behaviour. During recognition, the weight values are therefore used solely as initialisation variables for the filter values. The filters can control the flow of activation to the output layer by two means: 1) Activation potential: This is the maximum amount that the filter can let through over the presentation of any one entire sequence. The activation potential corresponds to the value of the filter at the beginning of the sequence, and is therefore equal to the weight on the given connection. 2) Local allowance: This is the amount that the filter will allow through at any given moment. At the start of each sequence, the local allowance is set to an initial value σ. The local allowance decreases by a factor of τ on each cycle, so that at time t the filter will allow through a maximum activation of σ*τ'. Under no circumstances will a filter let through an amount greater than its original weight, so once the activation potential has been used up, no further excitation is allowed to pass whatever the value of the local allowance.

The values of both T and σ are greater than 0 but less than 1.

The filter can thus be imagined as having a potential, equal at first to the value of its attached weight, and an ever shrinking door through which to allow an equally shrinking proportion of that potential. The practical purpose of the filters is to reduce the effect that new input can have on a decision currently being taken, so that once certain answers are ruled out (or inhibited) they cannot easily be resurrected later in the input sequence.

The formal algorithms used during recognition will now be set out. First, the excitation rule, which calculates the effect of all the inputs to a particular output unit at time t. On each input connection, the minimum of the amount the filter has left and the local allowance at time t is determined, and the resultant values are then summed over all of the connections to the particular output unit h, remembering that at most one value of I, will be non-zero.

Ex(h,t) = Σ(min(P_h,,σ*τ')*I₁) - (l) i

The inhibition rule calculates the lateral inhibition from other output units to a given output unit. The lateral inhibition is the sum of the strength of every other output unit multiplied by its connecting weight. Dividing this by the number of active output units in the network (f(h)) ensures that the network does not start to settle on an answer until several possibilities have been ruled out. The inhibition rule is:

ln(h,t) = Σ (H,(t-l)*c_h.) joh - (2) f(h)

The symbol "<>", as used in this specification, means "not equal to".

The propagation of input rule uses the results of the excitation rule (1) and the inhibition rule (2) to update the value of H (t). The incoming excitation is multiplied by (1 minus the current activation strength of the unit in question) , and from that is subtracted the product of the incoming inhibition with the activation strength of the current unit. This ensure that the output units activity stays between 0 and 1. The propagation of input rule is:

H„(t) = HJt-l) +(l-H_h(t-l))*Ex(h,t)+H_h(t-l)*In(h,t) - (3)

Finally, the filters are updated according to the filter update rule. As previously explained, P_hl(t) is set to be equal to W_hl when t = 0, and the filters are then decremented at every time interval provided that this will not cause them to become negative. If a filter would otherwise become negative, it is simply left untouched. The filter update rule is:

P_hl(t) = P_hi(t-1) - σ*τ¹-' if P_hl(t-l)>=σ*r¹-' - (4) = P_hi(t-1) otherwise

The algorithms set out in equations (1) to (4) above are then repeated for each output unit h, where h = 0... (n-l) .

The inhibitory weights C„ may be set according to the application, and may for example be

C„ = -1 if ioj - (5)

- 0 otherwise

Referring now to Figure 8, there is given an example of the operation of the network of Figure 3 in identifying a particular sequence of letters (abac) that it has been trained to recognise. For ease of calculation and understanding, the inhibitory weights C„ are all assumed to be set to zero in this example, so that equation (3) can be simplified to read

H_h(t) = H_h(t-1) + (l-H_h(t-l))*Ex(h,t) - (3a) = H_h(t-1) + Ex(h,t)-H_h(t-l)*Ex(H,t)

The subtraction of the value H,,(t-l) *Ex(h,t) keeps the value of H_h(t) less than 1. This is particularly important when σ and τ are greater than 0.5; otherwise the subtracted term could be omitted. It is assumed in Figure 8 that the network has already been trained to recognise the sequence abac at output unit H₀. As will be seen from the table, an input a is represented by activation of I„, b by activation of I,, and c by activation ofI,. At time t=0, the filter values P,„,, P„, and P,,₂ are set to be equal to the weights on the respective lines. After the training process for recognising the sequence abac at output H₀, these weights are respectively 0.62, 0.25 and 0.06. The derivation of these particular figures will be illustrated later, in connection with the discussion of the training process.

The value of a is taken to be 0.5, and the value of T also taken to be 0.5.

Using equation (1) , at time t=0 σ*τ' is lower than P , so that amount of excitation (0.5) is passed through the filter to the output unit H„ using equation (3a) . The value of P_ικ, is then decremented by 0.5 using equation

(4).

At time t=l, p,,, is equal to σ*τ' so that value is sent through to the output unit using equation (3a) and then decremented from the filter P„,.

At time t=2 ?,,, is equal to σ*τ', so P is accordingly decremented by that amount (leaving 0) and the same amount passed through to the output unit using equation ( 3a ) .

Finally, at t=3, P,₎₂ is equal to σ*τ^l . P₀₂ is therefore decremented, and the equation (3a) used again to update the output unit. At the end of the procedure, the output unit H₀ holds a cumulative activity of 0.692, and each of the filters P_oo, P_0I and P_oj is 0. The fact that all the filters are exactly 0 shows that the network has correctly recognised the sequence on which it has been trained. Compare now what happens to the activity of H„ when a different sequence abbe is applied. This is of course similar but not identical to the sequence abac on which the unit H„ was originally trained.

The sequence starts as before, and for t=0 and t=l there is of course no difference as the first two letters

(ab) are the same. At t=2, however, the incorrect letter b is presented. Using equation (1) the minimum value of

P_0I and σ*τ^l is 0, all the activation potential on that filter having already been used up. No excitation can therefore be passed through to the output unit, and the value of H₀(t) at t=2 remains at 0.625.

At t=3, a letter c is presented. Again using equation (1) the value of P₀₂ is the same as σ*τ' and that value is accordingly passed through to the output unit and used to decrement the value of the filter.

The final value of H„ is, in this case, 0.648, rather lower than the 0.692 of Figure 8. The lower output value represents the poorer concurrence between the input sequence and the sequence the network was trained to recognise. The existence of a mismatch can also be deduced from the fact that P₍₁₀ still remains at 0.12, in other words not all of the activation potential was used

UD. Training the network of Figure 3

Before the network can be used to recognise sequences, it first has to be trained to do so. The training sets up the weights W_()| which, during the subsequent running of the network, are used as the initialising values for the filters P--.

Briefly, to train the network to recognise a single example of a sequence, one of the output units, say H„, is chosen to represent that sequence. The sequence is then applied to the input units, and the weights on all of the connections between the input units and the chosen output unit H_υ are iteratively updated, using a particular learning rule, until the sequence has been learnt. If a second sequence is then to be learnt, a second output unit, say H,, is chosen and the weights W,. are similarly learnt for each of the connections between the input units and the output unit H,.

Formally, all the weights are first set to zero, that is W_hi(0) = 0 for all h and i. For the particular sequence h being recognised, the weights are then iteratively incremented using the following learning rule:

ΔW_hi(t) = (l-W_lli(t-l))*I,*σ*τ¹ - (6)

where:

W_hi(t) is the weight from i to h at time t

I, is the input to the input unit i σ is the starting value for the input filter (0<σ<=l)

T is the input filter decay value (0<τ<=l) .

On each learning cycle, the next element in the sequence to be learned is presented to the input units, and the weights from the activated input unit to the output unit chosen to represent the current sequence are increased according to the above Hebbian rule. Note that this weight adaptation rule ensures that the weights are always between 0 and 1. Using this rule, the network can now learn any given sequence, limited only by the accuracy required as the sequences get longer and σ*τ' grows smaller.

Example Turning now to Figure 7, the table illustrates a simple example in which the network of Figure 3 is trained to recognise the sequence abac at the output unit H₀. During the training procedure, therefore, it is the weights W,_s which will be updated, in this instance W_(X,, W_(M and W„₂. The representations of a, b and c are as before, and a and τ are again taken to be l and respectively.

Starting with all the weights W„, set at 0, the weights are learned as shown in the table of Figure 7 so that at the end of the sequence W₍₎₀, W,„ and W₀₂ are respectively 0.625, 0.25 and 0.0625. As will be appreciated, these were the values of the weights that were used to initialise the filter values P during running of the network, in the tables of Figures 8 and 9.

Learning general representations with the network of Figure 3.

In a practical situation, one often cannot provide a series of single specific examples for the network to learn. Either the examples are likely to be contaminated by noise, or alternatively the data may be of the type (for example spoken speech) where there is no single "right" sequence. What is needed is a method for showing the network several \noisy) examples of a given sequence and allowing gradual learning of a representation common to all of them. This will mean that classification of an input will be based, not on a single example as above, but on a representation built out of several such examples. The learning method used in the present embodiment relies on incremental weight changes. The rule looks at the filter on the particular line being assessed, as this tells us whether the connection has over-used or under¬ used its weight strength in recognising the given sequence. If the filter is depleted too early, then the original weight was not large enough, and so must be increased. On the other hand, if the filter is still not empty by the end of a sequence, the weight must have been too large and it is therefore decreased slightly. The learning algorithm comprises the following steps, which are carried out sequentially:

1. Set all weights W_hl to 0.

2. Choose an output unit which is to have its weights trained (call this H_t.) .

3. Take an example (call this E.) in the set to be learned from; set the time t to be egual to 0.

4. Initialise the filter values to the weights: P.—W,-.

5. Take the next element in the current example sequence E_c, and assume that this element actuates a particular input unit I_c; let t = t+1. 6(a) If the filter (P.) between H. and I. is 0, then increase W_c by learning rule (7) and go to step 7.

(b) Else, if the weight W_c is <σ*τ'_; then reduce the filter P_c by rule (8) and go to step 7.

(c) Else, reduce the value of P_c by _:.

7. If P. is <0, then increase W. by rule (9) and set P. equal to 0.

8. Repeat steps 5 to 7 for all elements in the current sequence E_c. 9(a) If P_c is not equal to 0 then reduce the weight W. by rule (10) . (b) Otherwise, leave weight unchanged. 10. Repeat steps 3 to 9 for all examples (E.) in the set to be learned from. 11. Repeat steps 2 to 10 for all output units (H.) in the net.

The learning rules (7) to (10) , referred to in the algorithm above are as follows:

__Wl. = _σ*τ'*e*(l-W. - (7)

ΔP_hi = -σ*τ" - (8) ΔW_hi = σ*τ^l*e*|P_hl| (1-W_hi) - (9)

ΔW_h, = -W_hi*P_hl*e*W_hl - (10)

where e is a constant which determines the rate of learning: 0<e l Equations (8) and (10) are shown with an initial minus sign, since these are used to decrement rather than increment the respective values. Increments in weight have a multiplier of (l-W,.,), and decrements a multiplier of W_hi, to prevent the weights going outside the range 0 to 1.

Steps 1 to 7 of the above algorithm run the sequence to be learned through the network in its current state, detecting where there is an attempt to use more activation through a filter than that filter will allow. When this occurs, the value of the weight is increased so that on future training or recognition runs, the new larger weight would be copied into the filter, thus giving a larger capacity to accept the current training pattern. Step 9 examines the state of the filters after a sequence has been completed, and if there is excess potential to activate on a filter which was not used, the weight is reduced so that, on a future presentation, too much activation will not be allowed through.

Since in training all the weights are initially set to zero, the first example in each sequence to be learned will required the updating of zero weights. This simply uses the update rule (7) at step 6(a) , which is effectively equivalent to the update rule (6) previously described with reference to the learning of single examples.

After this first example, further examples use rules

(9) and (10) above to move the weight slightly closer to a general representation of the patterns.

If an input sequence is exactly that sequence which the weights are set to accept, then the filter value will reach exactly zero when the last element of the sequence is input. Any deviation from this ideal input (in the form of further differing examples of that input class) will cause the filter value to be either too high or too low. It is in this situation that weight adaptation must occur. The correspondence between this use of the filter in training and the use in subsequent recognition will be clear.

The specific function of each of the rules (7) to

(10) is as follows:

Rule (7) : This rule is used for the first example of any given sequence. It may also come into play later in the sequence if the filter value reaches zero while the filter is still required to carry activation. If that happens, then the weight which was originally copied into that filter was evidently not large enough. This rule adds to the weight sufficiently to allow this extra activation through next time.

Rule (8) : This simply reduces the filter to allow only the maximum permitted activation for time t (the local allowance) to be processed. Rule (9) : This is used in a similar way to rule (7) above. There is a difference, however, in that the weight is only increased- by a fraction which is proportional to the distance below zero that the filter drops when it first becomes negative. Rule (10) : If the filter still holds some activation potential once the whole sequence to be trained on has been fully input, then the original weight was evidently too high. This rule reduces that weight by an amount in proportion to the value left in the filter and the size of the weight. Since large weights are decremented by large amounts, and small weights by small amounts, this ensures that the weight never falls below zero.

If one follows the above algorithm through, it will be seen that it effectively reduces to the update rule (6) if the network is simply trained on a number of single (non-noisy) examples. In that event, all of the updating is done using the rule (7) and since steps 6(b) and 6(c) are never reached, the filters do not come into play and always remain at zero. The filters come into play, and steps 6(b) and/or 6(c) are activated, when the network is trained on a selection of samples, some or all of which are noisy. The network is effectively then generalising, and extracting from the noisy data what it believes to be a general representation of the signal. The value of e in equation (7) determines the size of the weight changes (e is effectively unity in equation (6)) . £ may be less than unity, particularly where there are a numoer of examples of each sequence in the input. Repeated-Filtered Activation Networks

Although the network of Figure 3 can be used in its own right, there are many situations where more complex analysis is required, and for those purposes several filtered activation networks may be combined into one larger network, as shown in Figure 4. That figure shows three identical filtered activation networks operating in parallel, and feeding their outputs to a common top level output layer T. For ease of illustration, only the cross-inhibitory connections are shown for subnet zero, and only the weighted connections and the filters for subnet 1. It is to be understood, however, that each of the subnets in fact has all of the connections shown in Figure 3. While three subnets are shown, the number of subnets is arbitrary, and may be chosen according to the application.

With a combined network of this type, the output is now taken from the top units T,- and the output units of each of the individual subnets now represent an intermediate hidden layer H.

The number of units n in the top or output layer T is equal to the number of sequences the network has to recognise, and that in turn is equal to the number of hidden units H within each of the individual subnets. In operation, each of the subnets operates essentially independently to choose a particular answer, and the purpose of the top layer is to adjudicate between any discrepancies there may be between the answers from the individual subnets. To achieve this, there are feed forward connections between each of the hidden units and each of the top units (only some of which are shown in Figure 4) . The feed forward connections are defined by the values F._ht where a represents the number of the subnet, h the number of the hidden unit within the subnet, and t the top unit. If there are s subnets in total, then: h = 0...n-l, t = 0...n-l, a = 0...S-1.

The value of each of the feed forward connections F._hl lies between zero and 1.

There are also feedback connections B_nht between each of the top units and each, of the units in the hidden layer. As before: h = 0...n-l, t = 0...n-l, a = 0...S-1. The feedback and feedforward weights are so chosen that:

F_an = 1

F._hl = 0 for hot

B_an = 1 B^_t - -1 for hot

This makes each output unit excited by its corresponding hidden unit in all the assemblies, and they, in turn, excite their corresponding output unit. They also inhibit all the other output units. Such an arrangement turns the top part of the net into a Winner- Takes All cluster.

The individual units, connections, weights and filters within each of the subnets are labelled in exactly the same way as before except that they now have an additional subscript a representing the number of the subnet. Thus, for example, the input units I- are now labelled I,,, the weights W-- are now labelled W.-- and so on. In each case, a = 0...S-1 where s = the number of subnets. The state of the network shown in Figure 4 can now be uniquely defined as the following set of vectors:

I_ai = input units' activation

H., = hidden units' activation

T, = top layer units' activation

W.j_j = the weights from the input layer to the hidden layer F = the forward weights from the hidden layer to the top layer B„, = the backward weights from the top layer to the hidden layer

C_nι| = the inhibitory weights within the hidden layer P._n| = the filters between the input layer and the hidden layer.

Data representation for Figure 4

As before, a local representation is used for both input and output, so for example if a sequence of letters is to be recognised an input letter a might activate the first input element in each subnet, an input letter b the second element in each subnet, input letter c the third and so on. At most one input unit in each subnetwork will be active at any one time. The first top unit represents a particular sequence to be recognised, say abac, and as before that sequence is also to be associated with the first hidden unit in each of the subnets. The second top unit will represent a different sequence to be recognised, say abbe, and that sequence will also be associated with the second hidden unit within each of the subnets. By using parallel subnets in this way, several different approaches can be brought to bear on the same problem, with the top layer adjudicating between the various answers received from the subnets. For example, if one were to use the network to identify letters of the alphabet from their printed representations, the input to subnet zero might be the number of horizontal lines in a particular representation, the input to subnet 1 the numDer of vertical lines and the input to subnet 2 the number of curved lines. Each subnet would then determine individually which letter of the alphabet the representation would be most likely to be, and the top layer would then collate the individual results. This example, of course, does not make use of the network's capability of sequence-recognition. Another more important example is in the analysis of spoken utterances. Here, the input to each of the subnets represents the power at time t of the utterance within a particular frequency band. In one particular example, spoken utterances were passed through eight separate band pass filters to provide output in the form of eight sequences of measures of the power on each of the eight filtered bands. This raw data was then taken and compressed logarithmically for input to the network. Thus, subnetwork zero deals with the sequence of power, as the utterance develops, within the lowest frequency band. After logarithmic compression, the power on that band is represented as a sequence of integers, say 0,2,1... At time t=0, with this example, the power on that particular band is zero, and the input unit I_(l will be activated. At time t=l the power is 2 and the input unit I₀₂ will be activated. At time t=3, the power is 1 and the unit I₀₁ will be activated, and so on. At the same time, corresponding sequences of integers, representative of the power on the other seven bands, are being sent to the input units of the other subnets.

With such a coding of the input, the weight coming into a given hidden unit h W._h), represents the importance of the strength i of the frequency band a in the sequence h. This importance is governed by the position in the sequence and the number of times the element appears in that sequence.

In this example, each of the output units reDresented a Darticuiar sooken vowel sound. Recognition using the network of Figure 4

The recognition rules (1) to (5) need to be modified to take into account the forward connections F and the feedback connections B. Neither the feed forward connections nor the feedback connections vary during recognition, and they may for example be set as follows:

F._hl = 1 if t = h

•= 0 otherwise B_3t = 1 if t = h = -1 otherwise

These weights could be changed to give priority to certain channels.

The feedback from the top layer to the hidden layer not only allows the subnetworks to be affected by the output of the others, but it also causes one output unit to climb to an activation value of 1, and all the others to be forced downwards towards zero. Accordingly, once the network has settled on what it believes to be the correct answer, that answer (in other words one of the top units) is enhanced, and all the other competing answers (top units) are inhibited.

During recognition, the excitation rule (eguation

(1)), the inhibition rule (equation (2)) , and the filter update rule (equation (4)) are all used as before, with each of the subnetworks being treated independently.

Each of those equations is therefore used for each of the subnetworks, it being recognised that the subscript a, indexing the individual subnetworks, has effectively been dropped as the equations are the same for all of the subnetworks. In equation (2) the function f(h) represents the number of hidden units in that particular subnet which have a positive activation.

Additional rules now need to be provided to deal both with the feedback between the top layer and the hidden layer, and also the feedforward connections between the hidden layer and the top layer.

To update the top units from the hidden units, we of course first need to know what the values on the hidden units are, and those are given by equation (3) above. That equation is used to update each of the hidden units H_Λ, where it will be recalled that the subscript a indexes each of the S subnets (a = 0...S-1) . The updated values of H,_h, obtained from that equation, will be written as H_ah(t) .

The rule for updating the top units from the hidden units, by means of the feedforward connections, is as follows:

T.(t) = [Σ F._hhH;_h(t)]/n - (11)

where n is the number of subnets. Note that since F_ahh = 1, this is the mean of the activations of the corresponding hidden units.

If a more general network were to be used at the top, equation (11) could be generalised to read:

T.(t) = [Σ [ΣF,_hlH,(t) ]]/n - (lla)

J *

Following updating of the top units from the hidden units, the hidden units then need to be updated from the top units using the feedback connections. This is carried out in three parts:

(a) calculate the excitatary feedback from the top layer, Fbl (h,t) ;

(b) calculate the mnibitory feedbacK from the top layer, Fb2 (h,t) ; (c) update H*_h(t) to produce H.„(t) .

For step (a) _; the positive feedback comes from the unit in the top layer with the same index, in other ^■ words:

Fbl(h,t) = B.,._h*T_h(t) - (12)

For step (b) , the negative feedback comes from all the other units in the top layer:

Fb2(h,t) = Σ B,_h,*T,(t) (13) j< >h

For step (d) , the following equation is used:

H_lh(t) = H°_h(t) + (l-H_h(t))*Fbl(h,t) + H^_h(t)*Fb2(h,t)

(14)

This update equation limits the positive effect of Fbl by multiplying it by (l-H",,(t) ) , and limits the negative effect of Fb2 by multiplying it by H"_h(t) .

On each cycle, steps (a) , (b) and (c) are repeated.

Setting F^, and B_ώ. as indicated above, immediately under the heading "Recognition using the Network of Figure 4" makes the top network into a "winner takes all" network which converges on:

^τk ^{= H}»k ⁼ ! f°^r same k, and all a

T, = H„ = 0 for lok, and all a - (15)

given a non-zero output.

It should be noted that the multiplications by (1- H°_h(t)) and H,_h(t) keep the hidden activations between zero and 1, and that since the top unit activations are means of a number of hidden unit activations, these too remain between zero and 1.

The inhibitory weights inside each subnet, and the weights between the hidden layer and the top layer mean that the total net is thoroughly recurrent. In a truly parallel implementation, all the updatings would occur simultaneously, but in a sequential implementation (as, for example, in a computer program) they must necessarily follow one after another. It will be appreciated from the above description that the preferred sequential implementation updates in the following order:

(a) update the hidden units from the inputs;

(b) update the top units from the hidden units;

(c) update the hidden units from the top units. This sequence is repeated in each cycle. Typically, multiple cycles are run (not restricted to pattern length) , allowing the top/hidden subnetwork to continue to settle after the input has been presented.

In alternative embodiments (not shown) the updating could be carried out in a different sequence. It would also be possible, for example, to run the updating of the top layer by the hidden layer and the updating of the hidden layer by the top layer more often than the update of the hidden layer by the input. This of course would have an effect on the final results, but it is considered that the sequence of updating, and the frequency with which various parts of the network are to be updated is well within the capabilities of the skilled man in the art. It should be noted that the values of F-,„ and B-_hi are not learned. They are fixed and they do not vary either during training or during recognition. It is envisaged, however, that these connections could be learned, in a similar way to those of the weights W.,„. A simple example

To illustrate how the network of Figure 4 operates in practice, consider the simple example network shown in

Figure 5. This consists of two subnets zero and 1, each of which have been trained to recognise the two sequences

ABBB and BAAA.

The parameters used were σ = 0.5, τ = 0.5, C_aM set up as equation (5) and F.,., and B,_ht set up as previously specified. The initial weights W_ahl, after training, were as shown in Figure 5.

The results of running the trained network, and presenting the two patterns, is shown in the table of Figure 6.

It will be seen that after six time steps, following presentation of ABBB, T„ is 0.497 and T, is 0.145. The pattern ABBB is therefore taken as being recognised. Six time steps after presentation of the alternative pattern BAAA, T„ stands at 0.006 and T, at 0.962. The pattern BAAA has therefore been recognised.

Experimental results

Several sets of synthetic test data were created to test the FAN model shown in general terms in Figure 4. It was found in practice that sequences do not need to be of any uniform length, as there is no set receptive field. However, the longer the sequences, the smaller will be the difference between the weights where the sequences do not differ until their ends.

Given a string to be recognised, embedded in noise, the success of the network in finding the sequence depends upon its position in the noise. The closer to the front it is, the better chance it has of being discovered. For example, if the network has been trained to recognise the sequence ABCDEF, it will be more likely to identify this in the sequence ABCDEFdfjgds than in dshfABCDEFdfk.

The network was found to be extremely robust under conditions of missing or damaged data. As it does not rely on chained associations from one element to another, it is not disrupted by slight changes in order or by missing elements.

By making use of several subnets, the top layer receives input on a number of channels. All of the channels interact using a mechanism which excites one unit in each of the subnets, and inhibits all others. This has several effects: one is to keep the overall activation low if no units are significantly higher in activation than the average, thus preventing noise causing an incorrect response. It also means that all but one of the channels could contain noise without disrupting the recognition process. If the different channels disagree, then the fact that the top layer feeds back to them will tend to pull them all towards a common response.

A network using eight individual subnets was then tested on a sample of real speech utterances, each utterance representing a particular vowel sound. In subsequent recognition, the network recognised correctly 95.5% of the vowel sounds. To test generalisation capabilities, the same data was then split into two groups, and the network was trained on the first and tested on the second. Due to the relatively small size of the data set, and the large differences in the vowel sounds, the network did not generalise as successfully, scoring only 52% correct.

As a comparison, the same data was tested using a multi-layer perceptron network. This recognised at best 92% of the examples, compared with 95.5% with the FAN model. Note, however, that the M P required 600 epoch? to achieve this level, whereas the FAN learned άn only 10 epochs. To test generalisation capabilities, the MLP was trained on half of the data set and tested on the other half. A score of 62% correct was achieved, which at first sight appears to compare favourably with the 52% scored by the FAN model, but the 62% required some 200 epochs (one epoch being the presentation of one complete set of training data) . The speed with which the FAN model trains itself is determined by the constant e in equations (7) , (9) and (10) above. It is found in practice that setting e equal to 1 and using one epoch gives fairly good results. Reducing e and increasing the number of epochs gives some improvement, but generally the number of epochs required to train on particular data was substantially less than the number required using the MLP model. Fewer than 10 epochs .were needed with the FAN model to arrive at the 52% figure given above.

Possible alternatives

One problem with the network shown in Figure 4 is the fact that the filters start to close as soon as the input starts, so somewhat limiting the network's ability to skip over extraneous information to find meaningful data. A second, related point concerns the fact that the network places more importance on the start of the sequence than it does on later elements. This is effective when recognising a sequence as it allows possibilities which do not match at the start to be ruled out, so preventing them from becoming active at a later stage. An alternative possibility, however, which could relatively easily be encoded, would be for the filters to be at a maximum during the most salient part of the sequence. This would include the elements most likely to cause a correct decision.

Such an arrangement could be implemented by linking the local allowance of the filters to the outputs of the hidden units H. Due to inhibition, the filters can then be kept open during noisy data, when all units will be active at a similarly low level. As the input becomes meaningful, a small number of units will start to rise above the generally low level of the others, at which point the filters can start their decay.

Alternatively, the local allowance of the filters could be linked to the output of the top units T.

In the described specific embodiment, the local allowance σ*τ' is the same for all of the filters at a particular time step t. If greater flexibility were required, that could be changed so that the value of each individual filter local allowance depends upon its own individual hidden unit, or perhaps the corresponding top unit. As previously mentioned, the feed forward weights F and the feedback weights B connecting the top layer with the hidden layer could be altered so that the connection strengths are learned. Similarly, the inhibitory values C could also be learned. The learning process could also be modified to increase the effect more salient channels have, for example by increasing during training the values F, arising from a particular subnet, when without feedback, that subnet's output agrees with the correct answer. Another refinement of the network could involve multiple layers in the subnetworks, to form intermediate representations of the data set. This could be achieved in ucn the same way as additional hidden units were added to perceptrons to overcome the problem of representing linearly dependent patterns.

Finally, one could have distributed rather than local representations for the input and/or output.

In the specific embodiment, the activation on any one input unit I is always either 1 or zero. In other embodiments (not shown) it is envisaged that the activation on any individual input unit could either be continuously variable, or could take one of a number of discrete values. It would also be possible to use a distributed (non local) coding for input and/or output.

Claims

1. A sequence-analysing neural network comprising a plurality of input units forming an input layer for receiving, sequentially, the elements of an input sequence to be processed, a plurality of output units forming an output layer, each input unit being connected to at least some of the output units by respective connections arranged to transfer activity from the said input unit to its connected output units; characterised in that each connection has an associated filter arranged to restrict the activation that can pass at any one time, the amount being allowed through the filter varying as the elements of the input sequence are sequentially received by the input layer.

2. A neural network as claimed in Claim 1 in which the amount of activation permitted to pass by the filter at any one time reduces with time as elements of the input sequence are sequentially received.

3. A neural network as claimed in Claim 1 or Claim 2 in which the filter is further arranged to prevent more than a given cumulative activation from passing during presentation of the entire input sequence.

4. A neural network as claimed in Claim 3 in which an initial value associated with the filter defines the maximum amount of cumulative activation the filter will allow to pass during presentation of an entire input sequence, a record being kept as the sequence is processed of the cumulative amount of activation so far allowed to pass, or the unused part of the initial value, the filter preventing any further activation from passing once the cumulative activation that has been allowed to pass reaches the amount defined by the initial value.

5. A neural network as claimed in Claim 4 in which, at any given time, the filter allows through the unused part of the initial value, or a local allowance, the local allowance varying as the elements of the input sequence are sequentially received by the input layer.

6. A neural network as claimed in Claim 5 in which, each time a further element of the input sequence is received, and the unused part of the initial value is greater than the local allowance, the unused part of the initial value is decremented by the local allowance.

7. A neural network as claimed in any one of the preceding claims in which, after a given iteration, the amount of activation allowed to pass through the filter depends only upon the number of elements already processed in the input sequence, and one or more fixed constants.

8. A neural network as claimed in any one of Claims 1 to 6 in which, after a given iteration, the amount of activation allowed to pass through the filter depends at least partially upon the amount of activation of the respective output unit.

9. A neural network as claimed in any one of the preceding claims in which, after a given iteration, the maximum amount of activation allowed to pass through each of the filters during t e next iteration is the same.

10. A neural network as claimed in any one of the preceding claims including cross-inhibitory connections between the output units of the output layer, an increase in activation of an output unit at one end of such a connection tending to reduce the activation of an output unit at the other end.

11. A neural network as claimed in any one of the preceding claims including a further layer of units arranged to hold intermediate representations of the incoming input sequence.

12. A neural network as claimed in any one of the preceding claims in which a given output unit is trained by presenting a sequential test input sequence to the input units and iteratively calculating a weight value associated with each connection to the output unit being trained, the maximum possible iterative change to a weight, at any given instant, varying as the elements of the test input sequence are sequentially received by the input layer.

13. A neural network as claimed in Claim 12 in which the maximum possible iterative change decreases with time as elements of the test input sequence are sequentially received.

14. A neural network as claimed in Claim 12 or Claim 13, when dependent upon Claim 4, in which, based upon the best current estimate of initial values for each of the filters being trained, a record is kept as the test sequence is processed of the cumulative amount of activation so far allowed to pass each filter, or the unused parts of the filter initial values, the weight on a given connection being decreased if, at the end of the test sequence, the filter still has the potential to allow further activation to pass, and the weight being increased if the filter becomes exhausted before the end of the test sequence is reached.

15. A neural network as claimed in Claim 14 in which the best current estimate of initial values for the filters, at the beginning of a new test sequence, is determined on the basis of the weights as at the end of the last preceding test sequence.

16. A sequence-analysing neural network comprising a plurality of parallel neural networks, each as claimed in any one of the preceding claims, defining a respective plurality of subnets, the output layers of the subnets together defining a hidden layer; and the network further including a plurality of top units forming a top layer, each output unit of the hidden layer being connected to at least some of the top units by respective forward connections arranged to transfer activity from the said output unit to its connected top units.

17. A neural network as claimed in Claim 16 including feedback connections from each top unit to at least some of the output units, each feedback connection being arranged to reduce activity on its respective output unit when activity on its respective top unit increases.

18. A neural network as claimed in Claim 16 or Claim 17 in which, during teaching, weights associated with the forward connections are learned.