GB2179483A

GB2179483A - Speech recognition

Info

Publication number: GB2179483A
Application number: GB8619696A
Authority: GB
Inventors: Gideon Abraham Senensieb
Original assignee: National Research Development Corp UK
Current assignee: National Research Development Corp UK
Priority date: 1985-08-20
Filing date: 1986-08-13
Publication date: 1987-03-04
Also published as: GB8619696D0; GB2179483B

Description

SPECIFICATION Apparatus and methods for speech recognition The present invention relates to apparatus and methods for recognising spoken words in a predetermined vocabulary and to apparatus and methods for training speech recognition apparatus.

An approach to speech recognition which has been investigated employs mathematical models termed hidden Markov models. In this approach, a mathematical model of each word to be recognised is established and its parameters are estimated using a suitable training procedure.

Given an observation sequence obtained in a prescribed way from an unknown spoken word, it is possible to compute the maximum a posteriori probabilities that each of the word-models could give rise to that observation sequence. The word corresponding to the model yielding the highest probability can then be recognised as the most likely identity of the unknown word. This concept can be extended to the recognition of sequences of words spoken in connected fashion.

A Markov model is a finite state machine comprising a set of states; for example see the four state model of Fig. 3. The model can only be in any one state at any time. At the end of each discrete time interval the model makes a transition from the current state either to another state or to the current state itself. Given that the model is in one state, it is not possible to predict with certainty what the next state to be visited will be. Rather, the probabilities of making all the possible transitions from one state to another or to itself are known.

During each discrete time interval the model gives rise to an observation which is, in general, a multi-dimensional measurable quantity. In a hidden Markov model, a knowledge of the state is not sufficient to predict with certainty what the observation will be. instead, the state determines the a priori probability distribution of observations which the model generates whilst in that state.

The approach to speech recognition mentioned above suffers from the disadvantage that it is necessary to make a large number of calculations in a very short time as samples of sounds making up speech to be recognised are received by a recogniser. This problem is particularly acute where continuous speech is to be recognised. In addition the requirement for a large number of calculations in a short time makes the recognition apparatus too costly for many applications such as personal computers and data entry by voice for example by telephone.

According to a first aspect of the present invention there is provided apparatus for recognising words in a predetermined vocabulary comprising means for successively sampling sounds and obtaining respectively sets of signals representative of attributes of the sounds.

means storing data representing a number of finite state machines one for, and forming an individual model of, each word in the predetermined vocabulary, the data specifying the states which form the finite state machines and assigning a probability function to each state with at least one probability function assigned to more than one state, and each probability function specifying the probabilities that the signals representative of the attributes of the sounds would assume any observed values if the models produced real sounds and any model was in a state to which that probability function is assigned, means for determining from the sets of signals and the probability functions, the probabilities that any given set of signals would be generated if the models produced real sounds and any model was in any given state, means for determining from the probabilities calculated and the properties of the finite state machines the maximum likelihoods that sequences of sound samples represent the predetermined words, and means for providing an output, based on the maximum likelihoods detected, indicating one of the predetermined words as being most likely to have been spoken.

According to a second aspect of the present invention there is provided a method of recognising words in a predetermined vocabulary comprising the steps of successively sampling sounds and obtaining respective sets of signals representative of attributes of the sounds, storing data representing a number of finite state machines one for, and forming an indiVidual model of, each word in the predetermined vocabulary, the data specifying the states which form the finite state machines and assigning a probability function to each state with at least one probability function assigned to more than one state, and each probability function specifying the probabilities that the signals representative of the attributes of the sounds would assume any observed values if the models produced real sounds and any model was in a state to which that probability function is assigned, determining from the sets of signals and the probability functions, the probabilities that any given set of signals would be generated if the models produced real sounds and any model was in any given state, determining from the probabilities calculated and the properties of the finite state machines the maximum likelihoods that sequences of sound samples represent the predetermined words, and providing an output, based on the maximum likelihoods detected, indicating one of the predetermined words as being most likely to have been spoken.

The main advantage of the invention is that the number of probability functions can be greatly reduced from, for example, 1024 for a vocabulary of 64 words (that is 64 models each of 16 states) to for example about 100 because usually most of the probability functions are assigned to a number of states and therefore to a number of finite state machines. The complexity of calculation, the number of periodic operations required and the amount of storage used can therefore also be reduced with consequent cost savings.

The finite state machines are alternatively known as hidden Markov models.

The means for determining maximum likelihoods may comprise means for implementing the Viterbi algorithm (see S.E. Levinson, L.R. Rabiner, M.M. Soudhi, "An Introduction to the Application of the Theory of Probabilistic Functions of a Markov Process in Automatic Speech Recognition, Bell System Technical Journal, Vol. 62, No. 4, Pt 1 (April 1983, pp. 1035-1074), or the "forward-backward" algorithm (see S.E. Levinson et al, op. cit.).

According to a third aspect of the present invention there is provided apparatus for recognising words in a predetermined vocabulary comprising means for successively sampling sounds and obtaining respective sets of signals representative of attributes of the sounds, means for storing data representing a number of finite state machines one for, and forming an individual model of, each word in the predetermined vocabulary, the data specifying the states which form the finite state machines and assigning a probability function to each state with at least one probability function assigned to more than one state, and each probability function specifying the probabilities that the signals representative of the attributes of the sounds would assume any observed values if the models produced real sounds and any model was in a state to which that probability function is assigned, means for calculating a distance vector from the probability functions and each set of signals as they occur and for generating a control signal each time a vector has been calculated, each vector being representative of the probabilities that any given set of signals would be generated if the models produced real sounds and any model was in any given state, logic circuit means specially constructed to determine, on receipt of one of the control signals, values representative of the minimum cumulative distances of each state of each finite state machine using the current distance vector, and means for providing an output, based on the minimum cumulative distances calculated, indicating one of the predetermined words as being most likely to have been spoken.

The invention reduces the time required for, and complexity of, calculation by using the logic circuit means which is specially constructed in hardware form to make calculations of cumulative distances.

For hidden Markov models a finite state machine may comprise a plurality of states linked by transitions, each state and each transition having a corresponding signal set probability function and a transition penalty, respectively. A state penalty may then be computed for any observed signal set to represent the reciprocal of the probability of the said observed signal set given the probability function. Each finite state machine represents a word in the vocabulary to be recognised and the transition penalties are characteristic of that word. The state penalties however depend on probability functions and on sounds currently received.

The means for storing data may store the minimum cumulative distance for each machine, and the logic circuit means may be constructed to determine for each transition of a current state a value dependent on the minimum cumulative distance of the originating state of that transition and the transition penalty, to determine the minimum said value and to determine the minimum cumulative distance for the current state from the minimum said value and the state penalty for the current state. Each minimum cumulative distance may be regarded, in effect, as indicating the highest probability that a sequence of sounds forming part of a word in the vocabulary have been uttered.

Apart from the logic circuit means, one or more of the other means mentioned above can be formed by a general purpose computer, preferably one based on a microprocessor and specially permanently programmed for the operations required.

According to a fourth aspect of the invention there is provided apparatus for analysing data relating to the existence of one of a number of conditions, comprising means for receiving sets of signals representing data to be analysed, means for storing data representing a number of finite state machines one for an individually modelling each of the conditions, including a number of functions at least one assigned to more than one of the states of the finite state machines and each specifying the probabilities that the signals representative of the attributes of data would assume any observed values if the models produced real signals and any model was in a state to which that finite state machine is assigned, means for determining indicators, from the functions and sets of data as they occur representative of the probabilities that any given set of signals would be generated if the models produced real signals and any model was in any given state, and logic circuit means constructed to determine values of minimum cumulative distance of each state of each finite state machine using the said indicators.

The means for successively sampling sounds may include a microphone coupled by way of an amplifier having a wide dynamic range to an analogue-to-digital converter. In addition these means may include, for example, a bank of digital filters to analyse incoming signals into frequency spectra the intensities of which then form respective said sets of signals or means for obtaining the said sets by linear prediction (see J.D. Markel and A.H. Gray, "Linear Prediction of Speech", Springer-Verlag, 1976).

The means for calculating probabilities preferably includes means for calculating in each frame (that is a predetermined number of sample periods) a distance vector having a number of elements each of which is calculated from the probability function of a state and one of the said sets of signals.

The apparatus and method of the first four aspects of the invention usually require word models defined by machine states and transistion probabilities, and the states require the specification of probability functions. These states and functions are obtained by "training" with words of the predetermined vocabulary. This vocabulary can be changed as required by retraining.

According to a fifth aspect of the present invention there is provided, therefore, a method of selecting a number of states for finite state machines to represent words in a predetermined vocabulary and for deriving data to characterise the states comprising successively sampleing sounds making up each word in the vocabulary to derive sets of signals representative of attributes of the sounds making up the word, deriving data defining the said states for each word from the sets of signals, deriving one probability function for each state from the sets of signals, merging some of the probability functions to provide a number of the said functions less than a predetermined number, merging being carried out according to criteria which relate to suitability for merging and to the predetermined number, calculating the probability of transition from each state to every allowed subsequent state, in each finite state machine, entering data identifying each word as it is spoken, storing data defining the states which form each machine and identifying the word which is represented by that machine, and storing data defining each merged probability function.

The invention also includes apparatus for carrying out the method of the fifth aspect thereof.

Preferably the step of deriving data defining the states includes merging some successive sets of signals for each word according to criteria which relate to suitability for merging, and maximum and minimum numbers of states for each word to provide data defining the states for that word.

The suitability of two successive sets of signals for merging may be assessed by calculating the logarithm of the ratio of the likelihood that each set of signals for potential merging arose from the merged sets to the likelihood that each set of signals arose from distinct sets. The sets can then be merged if this logarithm exceeds a threshold, which can either be predetermined by experiment or obtained using a test statistic. Similarly the suitability of two probability functions for merging may be assessed by calculating the logarithm of the ratio of the likelihood that each probability function for potential merging arose from the same probability function to the likelihood that each probability function arose from distinct probability functions. The functions can then be merged if this logarithm exceeds a threshold obtained in a similar way to the other threshold.

Obtaining the said sets of signals preferably includes the known process of dynamic time warping.

Apparatus according to the invention may be constructed for training and/or recognition.

Usually the apparatus will be formed by one or more mask programmed logic arrays and one or more microprocessors, each such array or processor forming a plurality of the means mentioned above.

According to a further aspect of the invention there is provided apparatus for speech recognition comprising means for sampling sounds, means for storing data representing a number of finite state machines corresponding to words to be recognised, and means for determining from the output of the sampling means and the stored data, the likelihoods that sequences of sound samples represent the said words.

The invention also includes methods corresponding to the further aspect and apparatus and methods for determining the stored data for the further aspect from known spoken words.

Certain embodiments of the invention will now be described by way of example with reference to the accompanying drawings in which: Figs. 1(a) and 1(b) form a flow diagram of a method according to the invention for selecting states and deriving probability density functions for a word, Figure 2 is a flow diagram of a method according to the invention for recognising words, Figure 3 is a block diagram of apparatus according to the invention.

Figure 4 is a diagram of a finite state machine used in explaining an embodiment of the invention, Figure 5 is the outline of a table indicating how input values to a Viterbi engine used in one embodiment of the invention occur and are accessed, Figure 6 is a table indicating quantities which are calculated and stored one row at a time by the Viterbi engine, Figure 7 is an example indicating how transitions between finite state machines representing different words may occur, Figure 8 is a diagram of storage locations in a Viterbi engine used in one embodiment of the invention, Figure 9 is a flow diagram of a Viterbi engine used in one embodiment of the invention, and Figure 10 is a block diagram of a dedicated circuit for a Viterbi engine used in one embodiment of the invention.

A method of obtaining the parameters of a hidden Markov model of a word from spoken examples of that word during training will first be described. In Fig. 1(a) one of the words of the vocabulary is entered into training apparatus in an operation 10. Briefiy, the training apparatus may comprise a microphone, an amplifier with appropriate dynamic gain, an analogue-to-digital converter and a bank of digital filters, the word to be entered being spoken into the microphone.

An operation 11 is then carried out in which feature vectors are formed, one for each frame period, a frame comprising, for example, 128 sampling periods, of the analogue-to-digital converter. Each feature vector may comprise a number F, for example 10, of elements corresponding to intensities in respective frequency bands spread across the audio frequency band from 0 to 4.8 kHz, for example, and may be derived using a bank of filters and means for extracting energy values at the filter outputs.Thus a feature vector can be expressed as [x'1, x'2, x'3---x'F]T (where T signifies transposition to the more usual column or vertical vector), for example [x'1, X 2' X 3---X lo]T In general different utterances of the same word are not identical, an important source of variation being a non-linear distortion of the timescale of one utterance with respect to the timescale of another. For this reason a process of dynamic time warping (operation 12) is carried out which matches the two time-scales by aligning feature vectors for the most recently spoken utterance of the word with the feature vector means of a composite template which except for the first utterance has already been derived as described below from previous utterances of the word.Dynamic time warping is a known process which can be carried out using the dynamic programming algorithm and will therefore be only briefly described here. In effect the feature vectors of the most recently spoken utterance are divided into groups of one or more vectors and each group corresponds to one template vector. The correspondence is derived in such a way that the resulting total spectral distance between word vectors and corresponding template vectors is minimised.

Template vectors are formed in an operation 13 by merging each group of feature vectors with the corresponding template vector according to the alignment derived by the dynamic time warping. Each template vector comprises F elements (using the previous example there are 10 elements) each of which is formed by weighted averaging of each element of the template vector with those elements in the group of feature vectors corresponding to that template vector. A template vector can therefore be expressed as the feature vector means [R,, R2, R3--- RF]T. In addition each template vector includes elements [a21 --a ---aF2lT compiled from n:=x:-(x,)2; if of ( a,,2 em then of = o'm2rn where i takes values 1 to F, of is the variance of the set of feature vectors used informing that template vector, x2, is the mean square of that set, and enin is a positive constant.

Further, the template vector has associated with it a number N, representing the number of feature vectors merged into that template vector; that is the number of corresponding feature vectors in each group obtained in time warping summed for all repetitions of a word. Each time the operation 13 is carried out the means, mean squares and number of feature vectors are updated. When a word is entered for the first time, the composite template is formed directly from the sequence of feature vectors available from operation 11. Each template vector has its feature vector mean set equal to the corresponding feature vector obtained from the spoken word. The number of feature vectors, N, is set to 1.

When a predetermined number of examples of a word have been entered, for example ten, a test 14 prevents further return to the operation 10, and the composite template eventually provides the data defining the states of a hidden Markov model modelling that word. Each of the template vectors can be regarded as representing a model state although at this time there are more states than required and each state is not sufficiently characterised by a template vector. The next group of operations in Fig. 1(a) reduces the number of states to a smaller number.

As a first step the template is segmented. A quantity A is calculated (operation 15) for each pair of successive template vectors, with the pairs overlapping in the sense that each template vector except the first and last appears in two pairs. Thus A is calculated for all pairs n and no 1 from n=2 to n=nm, the number of template vectors in that template. Let the value of A computed for the pair of template vectors n and no 1 be denoted as A(n).If the likelihood that the observations which occur arose from the probability distribution obtained when two template vectors in a pair are merged is Ls, where the observations are the feature vectors contributing to the template vectors, and the likelihood that the observations which occurred arose from two different most likely distributions is L,, then

A is thus a measure of how suitable two states are for merging.It can be shown that, assuming Gaussian multivariate distributions with diagonal covariance matrices

where the subscripts 1 and 2 refer to the two consecutive template vectors, the subscript f refers to the fth feature in a feature vector, N is the number of feature vectors which have been merged into the template vector and the subscript y applies to a hypothetical template vector which would be formed if the two consecutive template vectors were merged into one.

Modified values i.'(n) are now computed (operation 15') for n=2 to n=n according to A'(n)=2A(n)-A(n- 1)-A(n+ 1), with arbitrarily high values set for A(1) and i.(nm). The ,'(n) values are second differencies used to indicate large changes in A(n) and therefore to indicate suitable positions for segmentation as is now described.

The template vectors for a model are stored sequentially and a number of segmentation markers are now inserted between the template vectors to indicate how template vectors are to be merged: where there is no marker adjacent template vectors will be merged. When a certain number of markers have been inserted the number of final states in the model will equal that number plus one and therefore a test 16 is carried out to determine whether the number of states so far defined is less than a maximum value, for example eight. If so, a test 17 is carried out to determine whether the maximum value of l'(ltmaX) which has not already been used to insert a segmentation marker is less than a threshold value.If it is less no segmentation is required. Provided A'maX is above the threshold a new segmentation marker is inserted in an operation 18 at the position between template vectors corresponding to this value of A'may and then the test 16 is repeated. If, however, test 17 indicates that may is less than the threshold then a test 19 is performed to determine whether the number of states defined by markers is less than or equal to a minimum value. If it is less, a further segmentation marker must be inserted even though may is less than the threshold value and therefore the operation 18 is carried out, this loop (that is the tests 16, 17, 19 and the operation 18) being followed until the number of states becomes greater than or equal to the minimum value, for example three.

If the criterion for the minimum number of states is met in the test 19, before or after inserting markers corresponding to A'maX less than the threshold, then merging of template vectors takes place in an operation 20 of Fig. 1(b). To merge template vectors the feature vector means and mean squares are averaged taking account of the total number of, feature vectors giving rise to each template vector merged. Each merged vector so produced with its associated standard deviation defines a probability density function (PDF) assuming that the probability is Gaussian. However, the PDFs so formed are temporary in the sense that the next stage of Fig. 1(b) merges the temporary PDFs with PDFs which have already been stored for other words (if any), or if inappropriate stores further PDFs.Before merging PDFs, the temporary PDFs are stored as a temporary model of the word in an operation 20.

In operation 22 a A value is calculated between each temporary PDF and each stored PDF unless the word currently used for training is the first word in the vocabulary when the temporary PDFs are stored and training commences on the next word in the vocabulary. Since the temporary and stored PDFs are stored in the same form as template vectors, the calculation of A is then carried out in the same way as when merging template vectors.

Having completed operation 22 the minimum A value for one temporary PDF is selected in an operation 23 and compared in a test 24 with a threshold value which indicates whether merging is desirable. If the threshold value indicates that merging should take place then that temporary PDF and the stored PDF corresponding to 1,,, are merged by the averaging process described above in an operation 25 and then new values of A are calculated between the merged PDF (which is not stored) and each remaining temporary PDF (operation 26).

If the operation 24 indicates that the maximum value of A is greater than the threshold then a test 31 is carried out to determine whether the number of PDFs already stored is less than the maximum number allowed (for example 48) and if so the temporary PDF having the current maximum value of A under examination is stored in an operation 32. However if no more PDFs can be stored as indicated by the test 31 then the temporary PDF must be merged although its value is higher than the threshold and therefore the operation 25 is carried out.

As each PDF is merged in the operation 25, the model of the current word is enhanced (operation 29) by substituting a label identifying the merged PDF for the temporary PDF.

An operation 27 is now carried out to determine if there are any more temporary PDFs to be considered for merging and if so a return to operation 23 takes place.

If there are no more temporary PDFs, for merging then the probabilities of transition from one state to another or of return to the same state (see arrows 36 and 37, respectively, in Fig. 3) are calculated in an operation 28. The probability of transition from one state to following state can be calculated by dividing the number of example utterances used to train the word by the total number of feature vectors merged into the first-mentioned state. The probability of transition from that state to itself can be obtained by subtracting the probability thus obtained from unity.

Finally the transition probabilities are stored in operation 30 with the stored PDF iabels to form a complete model of the word. The processes of Figs. 1(a) and (b) are repeated for every word with consequent additions to the models stored and changes to the stored PDFs. However since stored PDFs are stored in models by labels no further processing of completed models is required.

The flow diagram of Figs. 1(a) and 1(b) can be implemented by programming a computer, such as a microprocessor, and since such programming is a well known technique it is not described here. If it is required to allow the processing required for each word utterance to take place during training in times which are roughly equal to the time taken for the word to be spoken, it may be necessary to employ computers which are at least partially specially constructed and one such computer is described later in this specification in connection with the recognition system which uses the models derived in training.

Models are derived for each word in the vocabulary to be recognised, for example for 64 words, by repeating each word, say ten times into the microphone of training apparatus. The apparatus is associated with a data input device such as a keyboard to allow each word to be entered in writing and stored as a label for its model as that word is spoken repeatedly during training.

Moving on now to speech recognition using the hidden Markov models obtained in training, the same analog circuits are used and the output from the analogue-to-digital converter is separated into the frequency domain using digital filtering as in training.

In each frame period the filter outputs are first formed in an operation 42 (Fig. 2) as a feature vector having, referring to the previous example ten elements: x'" x'2, ....... x'10.

Each PDF obtained during training is taken in turn and used with the elements of the feature vector to calculate a probability that the speech sample at that time was generated by the state corresponding to that PDF (operation 43). Each such computation derives, according to the "distance" expression given below, one element in a distance vector [d1, d2, d3, . . dm]T where m is the total number of PDfs and may, using the previous example be 48. The distance vectors produced in this way are stored in an operation 44. For a one dimensional Gaussian distribution, the probability that any model state gives rise to an observation equivalent to a feature vector is given by

where o is standard deviation, x is the observation and R is the mean observation.

If a distribution has J dimensions, as in the present method of speech recognition where J corresponds to the number of elements in the feature vector, the above probability is given by

In the above expression 7r means (according to known mathematical convention) that the terms obtained by substituting for j are multiplied together. The values for x, are obtained from the feature vector and those for x, and o, from the template vector. Minus twice the natural logarithm of this probability, known in this specification as the "distance" of the probability, is given by

where a term Jln (27r) has been omitted since it is constant and not required in comparing relative "distances".Each element of the resulting distance vector is a measure of how improba ble the speech sample is given the corresponding PDF and is computed as proportional to the logarithm of the reciprocal of the probability of the speech sample given the corresponding PDF.

The use of "distances" allows the number of multiplications and divisions to be reduced so speeding recognition and cutting cost of special purpose circuits for implementing the flow diagram.

Having obtained a distance vector which gives the probabilities that the current utterance arises from each possible model state, each model is considered in turn to determine the likelihood that the word uttered corresponds to that model. Thus in operation 45 a model is selected and in operation 46 the minimum distance for each state in the model is computed.

This process is described in detail below in connection with Figs. 4 to 10.

The above process is continued until each model has been processed as indicated by a test 47 and then an operation 48 is carried out in which the smallest minimum distance determined for all states of all models is found, this value being used in operation 49 to normalise all the minimum cumulative distances found. (The operations 48 and 49 are also described in connection with Figs. 4 to 10.) The normalised distances are next stored as a cumulative distance vector in an operation 50. Thus this vector comprises columns of elements D'l, D'2, D'3. .

where n is the maximum number of states in a model with one column for each model. The smallest minimum cumulative distance is also stored. An operation 51 is then carried out to determine from the final state of each model which model has the minimum D, and provide an indication of the corresponding word as the word considered most likely to have been spoken at the current sampling period. This approach can be readily extended to the recognition of connected words, as has been described in the literature-see J.S. Bridle, M.D. Brown and R.M.

Chamberlain "an Algorithm for Connected Word Recognition" JSRU Research Report 1010 where traceback pointers indicating the paths taken through the models which result in the minimum cumulative distances are used. One way in which traceback pointers can be determined in described below. The following two papers also describe suitable methods for recognising connected words using minimum cumulative distances and traceback pointers:- "Partial Traceback and Dynamic Programming" P.F. Brown, J.C. Spohrer, P.H. Hockschild and J.K. Baker, pages 1629 to 1632 of Proceedings IEEE International Conference on Acoustics Speech and Signal Processing, 1982; and "The use of a one-stage dynamic programming algorithm for connected word recognition" by Hermann Ney, IEEE Transactions on Acoustics Speech, and Signal Processing, Vol. ASSP-32, No. 2, April 1984 pages 263 to 271.

The process of Fig. 2 is repeated each frame period and for this purpose a fast special purpose computer is usually required. Such a machine is now described.

Apparatus which can be used for both training and speech recognition is shown in Fig. 3 where a microphone 60 is coupled by way of an amplifier 61 to the input of an analogue-todigital converter 62. The amplifier 60 has automatic gain control to provide a wide dynamic range but this amplifier and its microphone 60 are not described in detail since suitable circuits for this part of the apparatus are well known. A processor 63 which may for example be a microprocessor type Intel 8051 provides an 8 kHz sampling frequency for the converter 62 and a front end processor 64 by way of a connection 71 and thus each cycle of processing shown in Fig. 2 is completed in 125 microseconds. The processor 64 carries out the digital filtering 40 mentioned above preferably with passbands divided approximately arithmetically between 0 and 1 kHz and logarithmically between 1 kHz and 4.6 kHz.Thus the feature vectors become available on an 8 bit bus 65 where they are passed to a logic circuit 66 which may be a mask programmed array. Handshake between the processor 64 and the logic circuit 66 is controlled by lines 67.

The processor 63, the logic circuit 66 and a random access memory (RAM) 67 are interconnected by address and data busses 68 and 69, and the host computer 72 is coupled by a bus 73 and control line 74 to the processor 63. Items 62 to 67 are expected to form apparatus which is manufactured and sold as a peripheral to be used with a host computer so that the host computer 72 provides input facilities during training, where spoken words must be identified, and a display to indicate words recognised. Where the apparatus is used for recognition only, the RAM 67 may be partially replaced by a read only memory and the host computer 72 may be replaced by a simple or intelligent display to indicate words recognised. The logic circuit 66 performs the Viterbi algorithm and thus carries out operations equivalent to the operations 45 to 48 so determining the minimum cumulative distance.The remainder of the operations shown in Figs 1(a), 1(b) and 2 are carried out by the processor 63. The RAM 67 stores the composite templates of operation 13 and the models of operation 29 including the PDFs. The RAM 67 also stores the distance vector of operation 44 and the cumulative distance vector of operation 50.

Since programming the processors is a known technique which can be followed given the algorithms of Figs. 1(a), 1(b) and Fig. 2, it is not described further, except that one way in which the operations 46, 48 and 49 may be carried out is now described and then an example of a Viterbi engine for carrying out these operations is given.

Although the example described indicates how traceback pointers are derived, these pointers need not, of course, be derived, if as indicated as an option in connection with the operation 51 of Fig. 2, the recognition process used does not employ such pointers.

The finite state machine of Fig. 4 represents a single word in the vocabulary of the recogniser.

It has three normal states A, B and C, although in practice instead of these three states it would typically have many more. In addition the finite state machine has start and end dummy states (SD and ED). The normal model states are separated by transitions shown as arrows and in speech models only left to right transitions or transitions from one state to itself are employed.

In a simple example for a word consisting of three sounds the sounds must occur in the left to right order or for the duration of one or more time frames a sound may stay in the same state.

The information supplied to allow words to be indicated as recognised to the processor 63 by the Viterbi engine concerns maximum likelihood paths through the models and the lengths of these paths. Maximum likelihood is determined by allocating two types of penalties to the models: transition penalties tp(s1,s2) which are individual to the transitions shown by the arrows between, and back to, the states, and state penalties sp(l,t) which are allocated to a normal state at a particular iteration, t of the Viterbi engine by means of an index l(s), where s indicates the state.

The transition penalties and indices need not remain constant for the models of a particular Viterbi engine but in this example they are held constant. Table 1 in Fig. 5 shows the way in which the values associated with the indices vary with time. For the first iteration (0) the values associated with the indices 11, 12... I are formed by the elements of the distance vector for the first frame (0) and successive distance vectors make up the columns 1, 2... i of Table 1. Only the current distance vector is held by the Viterbi Engine.

The way in which the minimum distance through each model is calculated is now described. In each iteration the Viterbi engine calculates a minimum cumulative distance for each normal state starting at the rightmost state and progressing left. Thus for iteration 0 and state C the minimum cumulative distance from three paths is determined and in each path a stored cumulative distance for the state at the beginning of the path is added to the transition penalty for being in that path. The minimum value found is added to the penalty for being in state C obtained using the index allocated to that state and the current distance vector (that is the currently available column of Table 1). The minimum cumulative distance (MCD) found in this way for each state is held in a store which at any time holds one row of a Table 2 in Fig. 6.At the first iteration 0 the values in row 0 are initially set to a maximum so that when iteration 1 is carried out the cumulative distances for states A, B and C are available to update the minimum cumulative distances as just described. For example to update the minimum cumulative distance for state C previous cumulative distances, 0, for states A, B and C are available. Having decided which is the minimum cumulative distance for state C at iteration 1 this distance is used to update MCD and at the same time a traceback pointer (TB), previously set to O at iteration 0, is also updated by incrementing by one the previous traceback pointer for the state from which the new minimum cumulative distance was calculated.Thus at each iteration (t) for each state (s) a minimum cumulative distance (MCD(s,t)) and a traceback pointer (TB(s,t)) are given by MCD(s,t)=DV (I(s),t)+min, (tp(x,s) + MCD(x,t- 1 )} X=Argminx p(x,s)+MCD(x,t- 1)j TB(s,t)=TB(X,t- 1)+ 1 where t refers to the current iteration, DV(1 (s),t) is the element of the distance vector obtained by using an index 1(s) for a state s at iteration t, tp(x,s) is the transition penalty associated with making a transition from state x to state s, MCD(s,t) is the minimum cumulative distance to state s at iteration t, TB(s,t) is the traceback pointer associated with state s at iteration t, min,, Si signifies the minimum of the expression for all valid values of x, and Argmin,, t signifies the value of x which causes the expression to be minimised.

As has been mentioned the Viterbi engine finds in each frame the minimum cumulative distance and traceback pointer for each state. For each finite state machine these minimum cumulative distances and the corresponding traceback pointers are passed to the decision logic as data for the logic to indicate which words have apparently been spoken. In addition to the operations described above, the Viterbi engine finds, at the end of each complete iteration through all the models, the minimum of the minimum cumulative distances for all models and in the next iteration substracts this minimum from all the stored minimum cumulative distance values. This scaling operation is carried out to prevent, as far as possible, the stored cumulative distance values from increasing beyond the size which locations in the store can hold, that is to prevent overflow.From the above equations it can be seen that minimum cumulative distances are formed by addition processes so they inevitably increase. By subtracting a minimum value from all values found the tendency to increase is reduced. However, if a stored minimum cumulative distance reaches the maximum size which can be stored it is held at that value without being increased and, in order to prevent misleading cumulative distance values being stored, the subtraction process just described is not carried out once the maximum has been reached. The maximum values automatically exit from the store eventually because at each iteration minimum cumulative distances are stored so eventually eliminating the maximum values.

The two dummy states of Fig. 5 are used to simplify transitions from one finite state machine, representing one word, to another. In Fig. 7 two different words have similar models A, B, C and A', B' and C', respectively, although as mentioned above the transition penalties and indices are different. A normal transition from the end of the first model to the beginning of the second model is indicated by the arrow 110 from state C to state A'. However transitions such as indicated by the arrows 111 and 112 omitting the last state of the first model or the first state of the second model often occur when parts of words are not pronounced. The transition from state B to state B' also often occurs.To take account of all the transitions which could occur between 64 words in the vocabulary of a typical recogniser would be extremely complicated if carried out on the basis of Fig. 5 but this problem is avoided by the use of the dummy states in a way which is now described. When the Viterbi engine has completed updating the normal states for a word it updates the end dummy state (ED) by finding the minimum cumulative distance for that state in the same way as for a normal state except that there is no state penalty for the end dummy state. A traceback pointer is also stored for the end dummy state and found in the same way as other such pointers except that the traceback pointer corresponding to the minimum cumulative distance selected is not incremented before storage.When the Viterbi engine has processed all the models the end dummy with the smallest minimum cumulative distance is used to update the start dummy states (SD) of selected word models, and the other start states are updated to a maximum value. The traceback pointers for all start dummies are set to zero. Start states are selected for updating with the smallest minimum cumulative distance on a grammatical basis according to which words in the vocabulary of words to be recognised can or cannot follow previous words, Where there is no such grammar, or the grammar is ignored, all start states are updated with the smallest minimum cumulative distance found. In this way the minimum path and cumulative distance for the transition from any finite state machine to another such machine is recorded for each machine.

The Viterbi engine may be constructed using a general purpose computer or microprocessor but for continuous speech recognition it is preferable to construct a special purpose computer either from discrete integrated circuits or preferably on a single chip having a specially metallised gate or logic array in order to carry out processing in a sufficiently short time. The special purpose computer, an example of which is described later, may for example comprise an arithmetic logic unit with output connected to a number of latches some connected to the address bus of a random access memory and others to the data bus.

In order to make a computer as described in the previous paragraph flexible for this application, that is to avoid hard wiring it in such a way that the number of states in the models of finite state machines and the transitions between states cannot be changed, part of the RAM 67 is allocated as a "skeleton" determining the number of states in each model and the transition paths between states. Various portions of the RAM are shown in Fig. 8 and one of these portions 115 is the "skeleton". The portions 116 to 118 contain most of the data for three respective models and there is one similar portion for each model. Taking the model of Fig. 4 to be representd by the RAM portion 116 it will be seen that the three lower locations contain the three entry transfer penalties tp, to tp3 corresponding to the transitions to state C from other states and from itself.The fourth location contains the index which allows the state penalty to be found from Table 1. Portion 116 is divided into other locations for the states A, B and ED.

The state C has one more location than the states A and B since there are three entry transitions and the state ED has one less location than the state A because there is no penalty associated with being in this state. State SD has no locations in the portion 116 since there are no transitions to it and no associated penalty. As will be described below, in carrying out iterations the Viterbi engine uses pointers to move through the portions 116, 117 and 118 and those of other models in sequence. A pointer is also used to move through the skeleton RAM portion 115 where, for each state, an offset is stored for each transition penalty.

Further portions of the RAM, one for each word, are set aside to store the minimum cumulative distances and traceback pointers associated with each state in each word. Examples of these portions are shown at 120, 121 and 122 in Fig. 6. For the word shown in Fig. 4 the RAM portion 120 is divided into five pairs of locations, one for each state, and containing the cumulative distance and traceback pointer for that state.

In order to update the minimum cumulative distances three pointers indicated by arrows 124 and 125 and held, in the special purpose computer, by latches, are used. The first pointer 123 initially points to the first transition penalty of the last state of the first model to be updated, in this example held by the lowest location in the RAM, portion 116. The Viterbi engine has to determine which cumulative distance, at this time to the state C, is a minimum. In order to do so it must find the cumulative distances for each of the paths to the state C and then determine which is the minimum. Having obtained the transfer penalty 1, the arrow 124 points to an offset in the skeleton RAM portion 115 which determines the position of the cumulative distance in the state from which the first transition orginates.Thus if the originating state is the current state then the offset is zero and the cumulative distance is that given by the pointer 125. The distance found is stored and the pointers 123 and 124 are incremented by one so that the second transition penalty is added to the cumulative distance given by offset 2 which in this case might be "2" so that the value held at a location with an address two greater than that of the pointer 125 is read, that is the cumulative distance of state B. The distance found in this way is stored if it is smaller than that previously found. The pointers 123 and 124 are again incremented to give the third transition and the cumulative distance to be added to it, and then the distance found is again stored if smaller than the distance already stored. In this way the minimum of the cumulative distance found is stored.The pointers 123 and 124 are again incremented and the pointer 123 then gives the index for finding the appropriate state penalty and the index 124 indicates that the end of iterations for state C has been reached. The minimum cumulative distance found together with its appropriate traceback pointer is now stored in the position given by the pointer 125 and all three pointers are incremented. The cumulative distance for states B and C are then updated in a similar manner as the pointers 123, 124 and 125 are incremented. However, the cumulative distance and traceback pointer for the end state (ED) is obtained when all the states in a model have been updated and, as has been mentioned above, when all models have been updated (except for their start states) then the minimum cumulative distance held by any end dummy state is entered in selected start states (SDs) on a grammatical basis.

The use of the "skeleton" allows the models to be changed in three ways: firstly by changing the offsets the transitions between states are altered, and secondly if the number of offsets between two markers are changed the number of transitions to a state changes. It is this marker which indicates the completion of updating of one state. The number of states in each model can be changed by increasing the number of groups of offsets stored by the RAM portion 115 and the number of models can be increased by allocating more portions such as the portions 116, 117 and 118 and 120, 121 and 122.

The operation of the flow chart of Fig. 9 which shows one iteration of the Viterbi engine will mainly be apparent from the chart. In an operation 130 the distance vector is read into the RAM at a location which respesents the current column of Table 1. A loop 131 is then entered in which the cumulative distance CDn for one transition of the last normal state of the first model is calculated. If the value of CD, is less than a reset value of the minimum cumulative distance for that state or a previously calculated value (MCD), as appropriate, then MCD is set to CD, in an operation 132. At the same time the corresponding traceback pointer TB for the cumulative distance of the originating state of that transition is incremented by one.A test 133 determines whether there are further transitions to the current state and if not then a loop 134 is followed in which scaling by subtracting the smallest minimum distance from the previous iteration is carried out, the state penalty at that state is added, and the smallest MCD is retained. At the end of the loop 134 MCD and TB for that state are retained and the loop 131 is repeated for the next state. In carrying out this repeat MCD is reset for the new state and the values of CD,, tp, CDO and TB relate to the new State. If the new state is an end state as determined by a test 135 then loops 131 and 134 are repeated as indicated by a jump 136 but again variables are reset as required to represent the new models and new states.When all the models have been processed as indicated by a test 137 the start dummy states of all models are updated and the various values of MCD and TB relating to all the states of all the models are ready for output to the processor 63 to allow decisions on words to be indicated as recognised. The next interation can now take place starting with the operation 130.

A circuit 140 which may be used for the Viterbi engine and thus the logic circuit 66 is shown in Fig. 10, and may, for example, be constructed from discrete integrated circuits or a customised gate array. The Viterbi engine is coupled to the RAM 67 by means of a 16-bit address but 142 and 8-bit data bus 143. The RAM 67 includes the areas 115 to 120 of Fig. 8 and similar areas.

When an iteration of the Viterbi engine is to take place a "GO" signal is applied to a terminal 144 connected to a controller 145 which, in one example, comprises a sequencer and a memory (neither shown). In response to the "GO" signal the sequencer cycles through all its states and in so doing addresses the controller memory which outputs a sequence of patterns of binary bits on terminals 146. An enable terminal of every circuit of the Viterbi engine 140 (except the controller 145) is connected individually to a respective one of the terminals 145 so that as the patterns of bits appear different circuits or groups of circuits are enabled.

The first operation uses a parameter base address held permanently by a latch to address a parameter array in the RAM 67. A buffer 148 is enabled by the controller 145 and the RAM is addressed by way of the bus 142. The parameter array holds the addresses of the four pointers 123 (penalty pointer-Pempt), 124 (skeleton pointer-Tpt), 125 (cumulative distance pointer-CDpt) and as index pointer (Dvec) for reading Table 1. These pointers are read into latches 150, 151, 152 and 153, respectively, each in two serial 8-bit bytes by way of a buffer 1 54 and a latch 155, the controller 146 again applying the required control signals (from now on in this description the function of the controller will be taken to be understood except where special functions are required).The quantity by which cumulative distances should be reduced is also initialised by reading a further element of the parameter array and entering the result into a latch 158. To complete initialisation two latches 156 and 157 representing the smallest value of a previous cumulative distance plus a transition penalty and for smallest cumulative distance for the current iteration, respectively, are set to maximum.

The skelton pointer 124 in the latch 151 is now used to read the first offset from the RAM 67 which is loaded into a latch 160 by way of the buffer 154. Then the pointer 124 is incremented by application to one input of an arithmetic logic unit (ALU) 161 through a buffer 1 62 at the same time as a buffer 163 is used to force a "one" on to another input of the ALU.

A buffer 159 loads the incremented pointer 124 back to the latch 151.

The cumulative distance pointer 125 is passed from the latch 152 to the ALU where it is added to the required offset now held in the latch 160 and the result is passed to a latch 164 holding a temporary pointer (Vpt). Vpt is used to read the cumulative distance of the state originating the first transition (for example A to C in Fig. 1) held in the RAM 67, since the offset is the incremental address of this distance in the area 120 from the pointer 125. The distance read is loaded into a latch 165 and Vpt is incremented by one using the buffer 162, the latch 163, the ALU and the buffer 159. Thus Vpt now points at the traceback pointer corresponding to the cumulative distance held by the latch 165 and this pointer is read into a latch 166.Now the penalty pointer 123 is used to read the appopriate transition penalty from the RAM 67 into the latch 160, and the ALU sums the contents of the latches 160 and 165, that is the cumulative distance and the transition penalty and the result is loaded into the latch 160 by way of a buffer 167. The latch 156 normally holds the smallest value cumulative distance found so far for a state but for the first transition it is, as already mentioned, already set to maximum.

Thus normally the ALU compares the contents of the latches 1 56 and 160 and if the contents of the latch 160 is less than the smallest value found so far a flag is set on a control line 168 which causes the controller to read from the latch 160 into the latch 156, and to read the traceback pointer from the latch 166 into a latch 169. Thus the best cumulative distance plus transition penalty found so far and corresponding traceback pointer have been found.

The skeleton pointer is now used to initiate the investigation of another transition to the first state and the cumulative distance plus transition penalty found is checked to determine whether its sum is amaler than the value held in the latch 156. If so it is stored together with the corresponding traceback pointer in the latches 156 and 169. This process continues until the skeleton pointer 124 reaches the first marker which is of a type known as an end of state (EOS) marker. When such a marker appears at the output of the buffer 154 it is detected by a detector 170 and the controller 145 is signalled to continue in the way now described.

The ALU substracts smallest cumulative distance found in the last iteration, held, as mentioned above, in the latch 158 from the contents of the latch 156 and writes the result into the latch 165. The address indicated by the penalty pointer 123 that is the index for Table 1 is used to write the index into the latch 160 and the ALU adds these contents to the contents of the latch 153 which as a result of the initialisation mentioned above now holds the base address of that area of the RAM 67 which contains the elements of the distance vector (that is the current column of Table 2). The result is written to the latch 164 and becomes the pointer Vpt which is used to read the appropriate element of the distance vector into the latch 160.The ALU adds the contents of the latches 160 and 165 (holding the scaled smallest cumulative distance) to give the updated cumulative distance which is entered into the location indicated by the pointer 125 held in the latch 152. The traceback pointer in the latch 169 is now incremented by one in the ALU and the result entered into the latch 160. The contents of the latch 152 are also incremented by one in the ALU and the address so formed is used to enter the traceback pointer in the latch 160 into the RAM 67 by way of a buffer 171. In order to be ready to process the next state the contents of the latch are again incremented by one.

The operations for one state have now been concluded and the skeleton pointer is again used to read the RAM and the other normal states are processed in a similar way. Eventually the detector 170 finds the marker between the last normal state and the end dummy, this marker being of a type known as end of word (EOW), and when the next state (the end state) is processed the following operations are omitted: the operation employing the index and the operation of incrementing the traceback pointer. Also since the end dummy processing is in the opposite direction through the model (see Fig. 4) in those operations requiring an offset between the pointer 125 and the cumulative distances to be read, the offsets are substracted instead of incremented.

When the next marker after an EOW marker is encountered by the skeleton pointer 124 initialisation of the latches 151, and 156 takes place, and then the next word model is processed. An invalid transition penalty is held in the location above the area 116 of the RAM 67 and when the controller attempts to continue processing after the last word in the vocabulary has been processed, the first transition penalty read out is found by the detector 170 to be the invalid penalty and an end of vocabulary (EOV) control signal reaches the controller signalling the end of an iteration.The controller then writes out the smallest cumulative distance found in this iteration for use in the next iteration and applies a "FINISH" signal to a terminal 1 72. The decision logic, usually in the form of a microprocessor, connected thereto then carries out an iteration using the cumulative distances and traceback pointers which have been transferred to the RAM 67. No further operations take place in the Viterbi engine until the decision logic or an associated microprocessor applies a further "GO" signal to the terminal 144.

As seen in Fig. 10 the upper input of the ALU 161 is 16-bit while the lower input is 8-bit.

Since the outputs of the latches 156 to 158, 165 and 169 are 8-bit, a latch 174 is provided which forces eight zero bits at the high significance side of the upper input of the ALU when these latches pass input to the ALU.

A detector 175 is connected at the output of the ALU to detect saturation and to control a buffer 176 so that whenever saturation occurs the output of the buffer 176 is enabled instead of the output of the buffer 167. The buffer 1 76 contains the maximum possible number while the buffer 167 may contain an overflow. The detector 175 also has an input from the buffer 163 and if this input indicates that the ALU is performing an operation on the maximum value then saturation is also deemed to have been detected and the output of the buffer 176 (that is the maximum value) is used as the result of the operation.

For synchronisation-of the various circuits operated by the controller 145 a conventional clock pulse system is operated, with an external clock pulse applied to a terminal 173. The arrows to the right of each latch indicate a clock terminal while the circles to the left indicate an enable terminal.

The apparatus described above relates to the recognition of isolated words but it can be extended to recognise continuous speech formed from a limited trained vocabulary by tracking the various word model states which are traversed using, for example, the traceback pointer (see the above mentioned references by P.F. Brown et al, and by Ney).

It will be clear that the invention can be put into practice in many other ways than those specifically described above. For example other hardware for finding minimum cumulative distances may be specially designed, the distance vector can be derived from a linear prediction function and the Viterbi process can be replaced by a forward backward algorithm.

Probability functions may be derived in other ways than specifically described but the number of such functions must be smaller, preferably considerably, than the total number of states in all the finite state machines.

Claims

1. Apparatus for recognising words in a predetermined vocabulary comprising means for successively sampling sounds and obtaining respective sets of signals representative of attributes of the sounds, means storing data representing a number of finite state machines one for, and forming an individual model of, each word in the predetermined vocabulary, the data specifying the states which from the finite state machines and assigning a probability function to each state with at least one probability function assigned to more than one state, and each probability function specifying the probabilities that the signals representative of the attributes of the sounds would assume any observed values if the models produced real sounds and any model was in a state to which that probability function is assigned, means for determining from the sets of signals and the probability functions, the probabilities that any given set of signals would be generated if the models produced real sounds and any model was in any given state, means for determining from the probabilities calculated and the properties of the finite state machines the maximum likelihoods that sequences of sound samples represent the predetermined words, and means for providing an output, based on the maximum, likelihoods detected, indicating one of the predetermined words as being most likely to have been spoken.

2. Apparatus according to Claim 1 wherein the means storing data stores, for each probability function, a template vector having elements relating to the intensities and/or variance of sounds in respective frequency ranges.

3. Apparatus according to Claim 1 or 2 wherein the means for calculating probabilities comprises means for calculating a distance vector from the probability functions and each set of signals as they occur, each element of the distance vector has a corresponding probability function and is representative of the reciprocal of the probability of observing the current signal set if any finite state machines produced real sounds and occupied a state to which the corresponding probability function had been assigned, and a special purpose logic circuit means constructed to determine the minimum cumulative distances of each state of each finite state machine using the current distance vector.

4. A method of recognising words in a predetermined vocabulary comprising the steps of successively sampling sounds and obtaining respective sets of signals representative of attributes of the sounds, storing data representing a number of finite state machines one for, and forming an individual model of, each word in the predetermined vocabulary, the data specifying the states which form the finite state machines and assigning a probability function to each state with at least one probability function assigned to more than one state, and each probability function specifying the probabilities that the signals representative of the attributes of the sounds would assume any observed values if the models produced real sounds and any model was in a state to which that probability function is assigned, determining from the sets of signals and the probability functions, the probabilities that any given set of signals would be generated if the models produced real sounds and any model was in any given state, determining from the probabilities calculated and the properties of the finite state machines the maximum likelihoods that sequences of sound samples represent the predetermined words, and providing an output, based on the maximum likelihoods detected, indicating one of the predetermined words as being most likely to have been spoken.

5. Apparatus for analysing data relating to the existence of one of a number of conditions, comprising means for receiving sets of signals representing data to be analysed, means for storing data representing a number of finite state machines one for and individually modelling each of the conditions, including a number of functions at least one assigned to more than one of the states of the finite state machines and each specifying that the signals representative of the attributes of data would assume any observed values if the models produced real signals and any model was in a state to which that finite state machine is assigned, means for determining indicators, from the function and sets of data as they occur representative of the probabilities that any given set of signals would be generated if the models produced real signals and any model was in any given state, and logic circuits means constructed to determine values of minimum cumulative distances of each state of each finite state machine using the said indicators.

6. Apparatus for recognising words in a predetermined vocabulary comprising means for successively sampling sounds and obtaining respective sets of signals representative of attributes of the sounds, means for storing data representing a number of finite state machines one for, and forming an individual model of, each word in the predetermined vocabulary, the data specifying the states which form the finite state machines and assigning a probability function to each state with at least one probability function assigned to more than one state, and each probability function specifying the probabilities that the signals representative of the attributes of the sounds would assume any observed values if the models produced real sounds and any model was in a state to which that probability function is assigned.

means for calculating a distance vector from the probability functions and each set of signals as they occur and for generating a control signal each time a vector has been calculated, each vector being representative of the probabilities that any given set of signals would be generated if the models produced real sounds and any model was in any given state, logic circuit means specially constructed to determine, on receipt of one of the control signals, values representative of the minimum cumulative distances of each state of each finite state machine using the current distance vector, and means for providing an output, based on the minimum cumulative distances calculated, indicating one of the predetermined words as being most likely to have been spoken.

7. Apparatus according to any of Claims 1 to 3, 5 and 6 wherein each finite state machine comprises a plurality of the said states linked by transitions, each state and each transition having a corresponding state and transition penalty, respectively, held as part of the data representing the finite state machines, each transition penalty being a characteristic of the finite state machine of which it is a part and each state penalty depending on the probability function assigned to the corresponding state and on current said sets of signals.

8. Apparatus according to Claim 7 insofar as dependent on Claim 3 or 4 including means for storing the minimum cumulative distance for each machine, and wherein the logic circuit means is constructed to determine for each transition of a current state a value dependent on the minimum cumulative distance of the originating state of that transition and the transition penalty, to determine the minimum said value and to determine the minimum cumulative distance for the current state from the minimum said value and the state penalty for the current state.

9. A method of forming finite state machines to represent words in a predetermined vocabulary comprising sampling sounds making up each word in the vocabulary to obtain sets of signals representative of the words, forming an individual finite state machine for each word comprising states based on the sets of signals obtained for that word, forming probability functions, and assigning one probability function to each state, with at least one probability function assigned to more than one state, each probability function relating the sets of signals to the probability of observing the sounds when any of the finite state machines containing a state to which that probability function is assigned correctly models an utterance by being in that state.

10. A method of selecting a number of states for finite state machines to represent words in a predetermined vocabulary and for deriving data to characterise the states comprising successively sampling sounds making up each word in the vocabulary to derive sets of signals representative of attributes of the sounds making up the word, deriving data defining the said states for each word from the sets of signals, deriving one probability function for each state from the sets of signals, merging some of the probability functions to provide a number of the said functions less than a predetermined number, merging being carried out according to criteria which relate to suitability for merging and to the predetermined number, calculating the probability of transition from each state to every allowed subsequent state, in each finite state machine, entering data identifying each word as it is spoken, storing data defining the state which form each machine and identifying the word which is represented by that machine, and storing data defining each merged probability function.

11. A method according to Claim 10 wherein the data defining the states deriving one probability function for each state includes merging some sets of signals for each word according to criteria which relate to suitability for merging to provide data defining the states for that word.

12. A method according to Claim 11 wherein the criteria for merging sets of signals for each word include predetermined maximum and minimum numbers of states for each word.

13. A method according to Claim 11 or 12 wherein successive sets of signals are merged and the suitability of two successive sets for merging is assessed by calculating the logarithm of the ratio of the likelihood that each set of signals for potential merging arose from distinct sets.

14. A method according to any of Claims 10 to 13 wherein the suitability of two probability functions for merging is assessed by calculating the logarithm of the ratio of the likelihood that each probability function for potential merging arose from the same probability function to the likelihood that each probability function arose from distinct probability functions.

15. Apparatus arranged to carry out the method of any of Claims 9 to 14.

16. Apparatus for speech recognition comprising means for sampling sounds, means for storing data representing a number of finite state machines corresponding to words to be recognised, and means for determining from the output of the sampling means and the stored data, the likelihoods that sequences of sound samples represent the said words.

17. Apparatus for recognising words in a predetermined vocabulary substantially as hereinbefore described with reference to Fig. 3.

18. A method of recognising words in a predetermined vocabulary substantially as hereinbefore described with reference to Fig. 2.

19. A method of selecting states for finite state machines to represent words in a predetermined vocabulary substantially as hereinbefore described with reference to Figs. 1a and 1b.