GB2387008A

GB2387008A - Signal Processing System

Info

Publication number: GB2387008A
Application number: GB0207343A
Authority: GB
Inventors: Christopher John St Cla Webber
Original assignee: Qinetiq Ltd
Current assignee: Qinetiq Ltd
Priority date: 2002-03-28
Filing date: 2002-03-28
Publication date: 2003-10-01
Also published as: EP1488411A1; WO2003083831A1; DE60309142D1; US20060178887A1; DE60309142T2; JP4264006B2; US7664640B2; EP1488411B1; GB0207343D0; AU2003217013A1; ATE343197T1; JP2005521906A

Abstract

A signal processing system is disclosed which includes a Gaussian Mixture Model (GMM) based Hidden Markov Model (HMM), parameters of which are constrained during the optimisation procedure. Also disclosed is a constraint system applied to input vectors representing the input signal to the system. The invention is particularly, but not exclusively, related to speech recognition systems. The invention reduces the tendency, common in prior art systems, to get caught in local minima associated with highly anisotropic Gaussian components - which reduces the recogniser performance - by employing the constraint system as above whereby the an isotropy of such components may be minimised. The invention also covers a method of processing a signal, and a speech recogniser trained according to the method.

Description

- 1 Signal Processing System This invention relates to a system and method

for processing signals to aid their classification and recognition. More specifically, the invention relates to 5 a modified process for training and using both Gaussian Mixture Models and Hidden Markov Models to improve classification performance, particularly but not exclusively with regard to speech.

Gaussian Mixture Models (GMMs) and Hidden Markov Models (HMMs) are 10 often used in signal classifiers to help identify an input signal when given a set of example inputs, known as training data. Uses of the technique include speech recognition, where the audio speech signal is digitised and input to the classifier, and the classifier attempts to generate from its vocabulary of words the set of words most likely to correspond to the input audio signal. Another 15 application is in radar, where radar signal returns from a scene are processed to provide an estimate of the contents of the scene. Published International specification W002/08783 demonstrates the use of Hidden Markov Model

processing of radar signals.

20 Before a GMM or HEM can be used to classify a signal, it must be trained with an appropriate set of training data to initialise parameters within the model to provide most efficient performance. There are thus two distinct stages associated with practical use of these models, the training stage and the classification stage. With both of these stages, data is presented to the 25 classifier in a similar manner. When applied to speech recognition, a set of vectors representing the speech signal is typically generated in the following manner. The incoming audio signal is digitised and divided into 10ms segments. The frequency spectrum of each segment is then taken, with windowing functions being employed if necessary to compensate for 30 truncation effects, to produce a spectral vector. Each element of the spectral vector typically measures the logarithm of the integrated power within each different frequency band. The audible frequency range is typically spanned by around 25 such contiguous bands, but one element of the spectral vector is conventionally reserved to measure the logarithm of the integrated power

- 2 across all frequency bands, i.e. the logarithm of the overall loudness of the sound. Thus, each spectral vector conventionally has around 25+1=26 elements; in other words, the vector space is conventionally 26-dimensional.

These spectral vectors are time-ordered and constitute the input to the HMM or GMM, as a spectrogram representation of the audio signal.

Training both the GMM and HMM involve establishing an optimised set of parameters associated with the processes using training data, such that optimal classification occurs when the model is subjected to unseen data.

10 A GMM is a model of the probability density function (PDF) of its input vectors (e.g. spectral vectors) in their vector space, parameterised as a weighted sum of Gaussian components, or classes. Available parameters for optimization are the means and covariance matrices for each class, and prior class probabilities. The prior class probabilities are the weights of the weighted sum 15 of the classes. These adaptive parameters are typically optimised for a set of training data by an adaptive, iterative, re-estimation procedure such as the Expectation Maximisation (EM), and log-likelihood gradient ascent algorithms, which are well known procedures for finding a set of values for all the adaptive parameters that maximises the training-set average of the logarithm of the 20 model's likelihood function (log-likelihood). These iterative procedures refine the values of the adaptive parameters from one iteration to the next, starting from initial estimates, which may just be random numbers lying in sensible ranges. 25 Once the adaptive parameters of a GMM have been optimised, those trained parameters may subsequently be used for identifying the most likely of the set of alternative models for any observed spectral vector, i.e. for classification of the spectral vector. The classification step involves the conventional procedure for computing the likelihood that each component of the GMM 30 could have given rise to the observed spectral vector.

Whereas a GMM is a model of the PDF of individual input vectors irrespective of their mutual temporal correlations, a HMM is a model of the PDF of time-

ordered sequences of input vectors. The adaptive parameters of an ordinary

- 3 HMM are the observation probabilities (the PDF of input vectors given each possible hidden state of the Markov chain) and the transition probabilities (the set of probabilities that the Markov chain will make a transition between each pair-wise combination of possible hidden states).

For the case of an ordinary GMM-based HMM, the observation probabilities are parameterised as a weighted sum of Gaussian components ('classes'), i. e. the observation probabilities are parameterised as GMMs. Thus, a prescription for optimising the HMM's observation probabilities can be recast 10 as a prescription for optimising the associated GMM's class means, covariance matrices and prior class probabilities.

Training, or optimization, of the adaptive parameters of a HMM is done so as to maximise the overall likelihood function of the model of the input signal, 15 such as a speech sequence. One common way of doing this is to use the Baum-Welch re-estimation algorithm, which is a development of the technique of expectation maximization of the model's log-likelihood function, extended to allow for the probabilistic dependence of the hidden states on their earlier values in the speech sequence. A HMM is first initialised with initial, possibly 20 random, assumptions for the values of the transition and observation probabilities. For each one of a set of sequences of input training vectors, such as speech-

sequences, the Baum-Welch forward-backward algorithm is applied, to 25 deduce the probability that the HMM could have given rise to the observed sequence. On the basis of all these per-sequence model likelihoods, the Baum-Welch re-estimation formula updates the model's assumed values for the transition probabilities and the observation probabilities (i.e. the GMM class means, covariance matrices and prior class probabilities), so as to 30 maximise the increase in the model's average log-likelihood. This process is iterated, using the Baum-Welch forward-backward algorithm to deduce revised model likelihoods for each training speech-sequence and, on the basis of these, using the Baum-Welch re-estimation formula to provide further updates to the adaptive parameters.

- 4 Each iteration of the conventional Baum-Welch re-estimation procedure can be broken down into five steps for every GMM-based HMM: (a) applying the Baum-Welch forward-backward algorithm on every training speechsequence, 5 (b) the determination of what the updated values of the GMM class means should be for the next iteration, (c) the determination of what the updated values of the GMM class covariance matrices should be for the next iteration, (d) the determination of what the updated values of the GMM prior class probabilities should be for the next iteration, and (e) the determination of what 10 the updated values of the HMM transition probabilities should be for the next iteration. Thus, the Baum- Welch re-estimation procedure for optimising a GMM-based HMM can be thought of as a generalization of the EM algorithm for optimising a GMM, but with the updated transition probabilities as an extra, fourth output.

For certain applications, HMMs are employed that do not have their observation probabilities parameterised as GMMs, but instead use lower level HMMs. Thus, a hierarchy is formed that comprises at the top a "high level" HMM, and at the bottom a GMM, with each layer having its observation 20 probabilities defined by the next stage down. This technique is common in subword-unit based speech recognition systems, where the structure comprises two nested levels of HMM, with the lowest one having GMM based observation probabilities.

25 The procedure for optimising the observation probabilities of a highlevel HMM reduces to the conventional procedure for optimising both the transition probabilities and the observation probabilities (i.e. the GMM parameters) of the ordinary HMMs at the lower level, which is as described above. The procedure for optimising the high-level HMM's transition probabilities is the 30 same as the conventional procedure for optimising ordinary HMMs' transition probabilities, which is as described above.

HMMs can be stacked into multiple-level hierarchies in this way. The procedure for optimising the observation probabilities at any level reduces to

the conventional procedure for optimising the transition probabilities at all lower levels combined with the conventional procedure for optimising the GMM parameters at the lowest level. The procedure for optimising the transition probabilities at any level is the same as the conventional procedure 5 for optimising ordinary HMMs' transition probabilities. Thus, the procedure for optimising hierarchical HMMs can be described in terms of recursive application of the conventional procedures for optimising the transition and observation probabilities of ordinary HMMs.

10 Once the HMM's adaptive parameters have been optimised, the trained HMM may subsequently be used for identifying the most likely of a set of alternative models of an observed sequence of input vectors - spectral vectors in the case of speech classification. This process conventionally is achieved using the Baum-Welch forward-backward algorithm, which computes the likelihood 15 of generating the observed sequence of input vectors from each of a set of alternative HMMs with different optimised transition and observation probabilities. The classification methods described above have certain disadvantages.

20 When optimising the observation probabilities of the GMMs, and hence of the HMMs that may be hierarchically above them, as well as the transition probabilities of the HMM, there is a tendency for the optimization to get caught in local minima, which prevents the system from achieving optimal classification. This can often be attributed to a tendency for class likelihood 25 PDFs to become "tangled up" with one another if they are free to become too highly anisotropic. Also, regarding speech recogniser technology, current recognisers are poor at capturing subtle variations and intrinsic characteristics of real speech, such as the full, specific variability of speakers' vowels under very different speaking conditions. In particular, individual vowels occupy 30 complex shapes in spectral vector space, and attempting to represent these shapes as Gaussian distributions, as is conventionally done, can lead to unfaithful representation of the speech sounds.

- 6 According to the present invention there is provided a signal processing system for processing a plurality of multi-element data encoding vectors, the system: - having means for deriving the data encoding vectors from input 5 signals; - being arranged to process the data encoding vectors using a Gaussian Mixture Model (GMM) based Hidden Markov Model (HMM), the GMM based HMM having at least one class mean vector having multiple elements; 10 - being arranged to process the elements of the class mean vector(s) by an iterative optimization procedure; characterized in that the system is also arranged to scale the elements of the class mean vector(s) during the optimization procedure to provide for the class mean vector(s) to have constant modulus at each iteration, and to 15 normalise the data encoding vectors input to the GMM based HMM.

A GMM-based HMM is a generalization of a GMM such that the HMM has observation probabilities parameterised as Gaussian PDFs or weighted sums of Gaussian PDFs, i.e. as a GMM. The observation probabilities of a GMM 20 based HMM are parameterised as a GMM, but the GMM-based HMM is not itself a GMM. An input stage can be added to a GMM based HMM however, where this input stage comprises a simple GMM. The log-likelihood of a GMM-based HMM is the log-likelihood of an HMM whose observation probabilities are constrained to be parameterised as GMMs; it is not the log 25 likelihood of a GMM. Consequently, the optimization procedure of a GMM-

based HMM is not the same as that of a GMM.

Preferably the moduli of the mean vectors of each of the GMMs are resealed after each iteration of the optimization procedure so that they are all of equal value. Most signal processing systems of the type discussed in this specification

incorporate a GMM that represents the probability density function of all data encoding vectors in the training sequence The constraint of limiting the elements of the class mean vector to have constant modulus leads to

- 7 simplified processing of the GMMs making up the signal processing system, as the class means of each GMM will lie on the surface of a hypersphere having dimensionality (n -1), where n is the dimension of an individual vector.

5 Preferably each covariance matrix is constrained so as to be isotropic and diagonal, and to have a variance constrained to be a constant value. This eliminates the possibility of certain classes of severe local minima associated with highly anisotropic Gaussian components, and so prevents such sub-

optimal configurations from forming during the training process. Note that a 10 covariance matrix that is so constrained may be regarded mathematically as a scalar value, and hence a scalar value may be used to represent such a covariance matrix.

Each GMM, and therefore GMM based HMM, has a set of prior class 15 probabilities. Preferably the prior class probabilities associated with the GMM are constrained to be equal, and to remain constant throughout the optimization procedure.

Prior art signal processing systems incorporating GMMs generally avoid

20 putting constraints on the model parameters; other than that covariance matrices are on occasion constrained to be equal across classes, requirements are rarely imposed on the class means, covariance matrices, prior class probabilities and hidden-state transition probabilities other than that their values are chosen to make the average log-likelihood as large as 25 possible.

Preferably, each data encoding vector that is also an input vector, derived from the input signal during both training and classifying stages of using the GMM is constrained such that its elements x/ are proportional to the square 30 roots of the integrated power within different frequency bands.

Advantageously, the elements of each such data encoding vector are scaled such that the squares of the elements of the vector sum to a constant value that is independent of the total power of the original input.

- 8 Preferably each such data encoding vector is augmented with the addition of one or more elements representing the overall power in the vector. The scaling of the vector elements described above removes any indication of the power, so the additional element(s) provide the only indication of the power, or 5 loudness, within the vector. Clearly, the computation of the value of the elements representing power would need to be based on pre-scaled elements of the vector.

Certain applications, notably subword-unit based models advantageously 10 employ a HMM that uses as its observation probability a GMM constrained according to the current invention, that likewise acts as the observation probability for a further HMM. In this way, a hierarchy of HMMs can be built up, in the manner of the prior art, but with the difference that the constraints

on the model parameters according to the current invention are applied at 15 each level of the hierarchy.

Advantageously, the hierarchy may incorporate two GMMs as two lower levels, with a HMM at the highest level. The lowest level GMM provides posterior probabilities as a data encoding vector to a second, higher level 20 GMM. This second GMM provides observation probabilities to a HMM at the third level. This arrangement allows individual speech-sounds to be represented in the spectral-vector space not as individual Gaussian ellipsoids, as is conventional, but as assemblies of many smaller Gaussian hypercircles tiling the unit hypersphere, offering in the potential for more faithful 25 representation of highly complex-shaped speech-sounds, and thus improved classification performance.

Note that in this specification the terms "input vector" and "spectral vector" are

used interchangeably in the context of providing an input to the lowest level of 30 the system hierarchy. The vector at this level may represent the actual power spectrum of the input signal, and hence be spectral coefficients, or may represent some modified form of the power spectrum. In practice, the input vector will generally represent a power spectrum of a segment of a temporal input signal, but this will not be the case for all applications. Further

- 9 - processing of the temporal input signal is used in some applications, e.g. cosine transform. A "data encoding vector" is, within this specification, any

vector that is used as an input to any level of the hierarchy, depending on the context, i.e. any vector that is used as the direct input to the particular level of 5 the hierarchy being discussed in that context. A data encoding vector is thus an input vector only when it represents the information entering the system at the lowest level of the hierarchy.

Note also that normalising a vector is the process of resealing all its elements 10 by the same factor, in order to achieve some criterion defined on the whole vector of elements. What that factor is depends on the criterion chosen for normalization. A vector can generally be normalised by one of two useful criteria; one is to normalise such that the elements sum to a constant after normalization, the other is to normalise such that the squares of the elements 15 sum to a constant after normalization. By the first criterion, the resealing factor should be proportional to the reciprocal of the sum of the values of the elements before normalization. By the second criterion, the resealing factor should be proportional to the reciprocal of the square root of the sum of the squares of the values of the elements before normalization. A vector of 20 exclusive probabilities is an example of a vector normalised by the first criterion, such that the sum of those probabilities is 1. A (real-valued) unit vector is an example of a vector normalised according to the second criterion; the sum of the squares of the elements of a (real-valued) unit vector is 1. A vector whose elements comprise the square roots of a set of exclusive 25 probabilities is also an example of a vector normalised by the second criterion.

According to another aspect of the current invention there is provided a method of processing a signal, the signal comprising a plurality of multi-

element data encoding vectors, wherein the data encoding vectors are 30 derived from an analogue or digital input, and where the method employs at least one Gaussian Mixture Model (GEM) based Hidden Markov Model (HMM), the GMM based HMM having at least one class mean vector having multiple elements, and the elements of the class mean vector(s) are optimised in an iterative procedure, characterized in that the elements of the class mean

- 10 vectors are scaled during the optimization procedure such that the class mean vectors have a constant modulus at each iteration, and the data encoding vectors input to the GMM based HMM are processed such that they are normalised. Note that the user(s) of a system trained according to the method of the current invention may be different to the user(s) who performed the training.

This is due to the distinction between the training and the classification modes of the invention According to another aspect of the current invention there is provided a computer program designed to run on a computer and arranged to implement a signal processing system for processing one or more multi-element input vectors, the system: 15 having means for deriving the data encoding vectors from input signals; being arranged to process the data encoding vectors using a Gaussian Mixture Model (GMM) based Hidden Markov Model (HMM), the GMM based HMM having at least one class mean vector having multiple 20 elements; - being arranged to process the elements of the class mean vector(s) by an iterative optimization procedure; characterized in that the system is also arranged to scale the elements of the class mean vector(s) during the optimization procedure to provide for the 25 class mean vector(s) to have constant modulus at each iteration, and to normalise the data encoding vectors input to the GMM based HMM.

The present invention can be implennented on a conventional computer system. A computer can be programmed to so as to implement a signal 30 processing system according to the current invention to run on the computer hardware.

According to another aspect of the current invention there is provided a speech recogniser incorporating a signal processing system for processing one or more multi-element input vectors the recogniser: - having means for deriving the data encoding vectors from input 5 signals; - being arranged to process the data encoding vectors using a Gaussian Mixture Model (GMM) based Hidden Markov Model (HMM), the GMM based HMM having at least one class mean vector having multiple elements; 10 - being arranged to process the elements of the class mean vector(s) by an iterative optimization procedure; characterized in that the system is also arranged to scale the elements of the class mean vector(s) during the optimization procedure to provide for the class mean vector(s) to have constant modulus at each iteration, and to 15 normalise the data encoding vectors input to the GMM based HMM.

A speech recogniser may advantageously incorporate a signal processing system as described herein, and may incorporate a method of signal processing as described herein.

The current invention will now be described in more detail, by way of example only, with reference to the accompanying Figures, of which: 25 Figure 1 diagrammatically illustrates a typical hardware arrangement suitable for use with the current invention when implemented in a speech recogniser.

Figure 2 shows in block diagrammatic form the conventional re-estimation procedure adopted by the prior art systems employing GMM or HMM based

30 classifiers; Figure 3 shows in block diagrammatic form one of the preprocessing stages carried out on input vectors based on frames of speech, relating to the frame's spectral shape;

- 12 Figure 4 shows in block diagrammatic form a further pre-processing stage carried out on the input vectors relating to the overall loudness of a frame of speech; Figure 5 shows in block diagrammatic form the modified re-estimation procedure of GMMs or ordinary, or hierarchical HMMs as per the current invention; 10 Figure 6 shows in more detail the class mean re-scaling constraint shown in Figure 5; Figure 7 shows in block diagrammatic form the implementation of a complete system; and Figure 8 shows graphically one advantage of the current invention using the example of a simplified three dimensional input vector space.

The current invention would typically be implemented on a computer system 20 having some sort of analogue input, an analogue to digital converter, and digital processing means. The digital processing means would comprise a digital store and a processor. As shown in Figure 1, a speech recogniser embodiment typically has a microphone 1 acting as a transducer from the speech itself, the electrical output of which is fed to an analogue to digital 25 converter (ADC) 2. There may also be some analogue processing before the ADC (not shown). The ADC feeds its output to a circuit 3 that divides the digital signal into 10ms slices, and carries out a spectral analysis on each slice, to produce a spectral vector. These spectral vectors are then fed into the signal processor 4, in which is implemented the current invention. The 30 signal processor 4 will have associated with it a digital storage 5. Some applications may have as an input a signal that has been digitised at some remote point, and so wouldn't have the ADC. Other hardware configurations are also possible within the scope of the current invention.

- 13 A typical signal processing system of the current invention will comprise a simple GMM and a GMM-based HMM, together used to classify an input signal. Before either of those models can be used for classification purposes, they must first be optimised, or trained, using a set of training data. There are 5 thus two distinct modes of operation of a classification model: the training phase, and the classification phase.

Figure 2 shows generically the steps used by prior art systems in training both

a GMM and a HMM based classifier. Figure 2 depicts the optimization of 10 hierarchical GMM-based HMMs as well as the optimization of ordinary GMM-

based HMMs and simple GMMs, because the steps relating to initializing and re-estimating HMM transition probabilities relate to the initialization and re-

estimation of HMM transition probabilities at all levels of the hierarchy.

The flow chart is entered from the top when it is required to establish an 15 improved set of parameters in the model to improve the classification performance. First the various classes need to be initialized, these being the class means, class covariance matrices and prior class probabilities. HMMs have the additional step of initializing the transition probabilities. These initialization values.may be random, or they may be a "best guess" resulting 20 either from some previous estimation procedure or from some other method.

These initializations form the adaptive parameters for the first iteration of the training procedure, which proceeds as follows. An data encoding vector or vector sequence (for the HMM case) from the training sequence is obtained, 25 and processed using a known re-estimation procedure. For GMMs the EM algorithm is often used, and for HMMs the Baum- Welch re-estimation procedure is commonplace. This is the inner loop of the re-estimation procedure, and is carried out for all data encoding vectors in the training sequence. Following this, the information gained during the inner loop processing is used to compute the new classes and, for the HMM case, the new transition probabilities. Convergence of this new data is tested by comparing it with the previous set or by judging whether the likelihood function has achieved a

- 14 stable minimum, and the process re-iterated if necessary using the newly computed data as a starting point.

Moving to the current invention, one embodiment of the current invention 5 applied to speech recognition employs a modified spectral vector that is pre-

processed in a manner that is different from the conventional log-power representation of the prior art. The spectral vector itself comprises a spectral

representation of a 1 Oms slice of speech, divided up into typically 25 frequency bins.

The objective of the first stage of the pre-processing is that elements x (i=1,,m) of the e-dimensional (m < n) spectral vector x should be proportional to the square roots of integrated power Pi within different frequency bands, rather than the conventional logarithms of integrated power 15 within different frequency bands. Further, elements xj (i=1,..., m) should be scaled such that their squares should sum to a constant A that is independent of the total power integrated across all frequency bands within the frame corresponding to that spectral vector. Thus, if the frame is sampled into m frequency bands, m of the elements xj of the e-dimensional (m < n) spectral 20 vector x should satisfy x = AMP, (i =1,. ..,m) (Equation 1) j=, which implies Xj2 =4 The value of the constant Ahas no functional significance; all that matters is that it doesn't change from one spectral vector to the next.

25 The advantage of this normalized square root power representation for spectral vectors is that the degree of match of the shape of spectral vector x; (i=1,...,m), compared with a class mean vector w; (i=1,...,n), is then proportional to the scalar product xjw;, irrespective of the modulus (vector length) of the template. This provides the freedom to constrain the 30 modulus of the template without losing the functionality of being able to

- 15 determine the degree of match of the template by computing the scalar product. The steps involved in the novel encoding of spectral vectors are represented 5 in the flow diagram of Figure 3 and listed as follows (a-e). After (a) choosing a value for the constant A to be used for all frames of speech, (b) the first step to be applied for each individual frame of speech is the same as the conventional process for conducting a spectral analyisis in order to obtain m values of the integrated power Pj (i=1,..,m) within m different frequency 10 bands spanning the audible frequency range. Then, instead of taking the logarithms of these power-values as is conventional in the prior art, (c) their

sum By Pj and (d) their square roots (i=1,...,m) are computed. (e) each square-root value is then divided by the total power By.,P, (and multiplied by a constant scaling factor A as desired) to obtain elements xj 15 (i1,...,m) of the novel encoding of the spectral vector defined by equation 1.

As a second part of the pre-processing of the spectral vectors, the vector is also augmented with the addition of extra elements that represent the overall loudness of the speech at that frame, i.e. the total power,j UP, integrated 20 across all frequency bands-.

This is particularly useful in conjunction with the novel way of encoding spectral shape defined by equation 1. This is because elements x; (i=1,... ,m) defined by equation 1 are clearly independent of the overall loudness j,Pj and therefore encode no information about it, so those m elements need to be 25 augmented with additional information if the spectral vector is to convey loudness information.

In the current embodiment, two extra elements xm+ and xm+2 are added to the spectral vector, beyond the m elements used to encode the spectral shape.

30 Thus the spectral vector will have n = m+2 dimensions. These two elements depend on the overall loudness L - By;, P. in the following way:

- 16 x =B f (L), Xn+2 =B g(L) (Equation 2) Al 4[f(L)] +[g(L)] g[f(L)]2 + [g(L)I2 where f() and g() are two (different) functions of the overall loudness L, and B is a constant. The significance of B is that the ratio B/A determines the relative contributions to the squared modulus I x 12 = x x =., x2 made by 5 the two subsets of elements (i - m+1, m+2) and (i = 1,...,m); the values of these contributions are clearly B2 and A2 respectively. The ratio B/A may therefore be used to control the relative importance assigned to overall loudness and spectral shape in the coding of spectral vectors; for example, choosing B = 0 assigns no importance to overall loudness, while choosing 10 similar values of A and B assigns similar importance to both aspects of the speech. The value of A2+B2 can be chosen to be 1 for simplicity, which will make the squared modulus I x 12 = x x =,x2 = A9+B2 equal to 1 for all spectral vectors regardless of their speech content.

The advantages of this novel representation of loudness are (a) that the 15 moduli of all spectral vectors will have the same constant value regardless of overall loudness, which frees one to constrain the moduli of templates (class means) w = (w,...,wn), as is proposed in the main claims, and (b) that the ratio B/A may be used to control the relative importance assigned to overall loudness and spectral shape in the coding of spectral vectors.

20 Possible choices for the functions f() and 9() include log L-fog Lm'n log -log Lmin f (L) = sin 2 1Og Lmax _ 1OgLmin g(L) = cos 2 1Og Lnax _ lOg Lmin (Equation 3) where Lmin and LmaX are constants chosen to correspond to the quietest and loudest volumes (total integrated power) typically encountered in individual frames of speech.

Z5 Useful values for the pair of constants (A,B) are (1,0), (,f>,) and (,) , which all satisfy A2+B7 = 1.

Once functions f() and g() and constants B. Lmin and LmaX, to be used for all frames of speech, have been chosen, the steps involved in the process 30 required to incorporate the loudness encoding as described above are shown

- 17 in Figure 4. The process involves (a) summing the integrated powers PI within m frequency ranges i=1,...,m for each frame of speech to obtain the overall loudness L for that frame of speech, (b) evaluating the two extra elements xm+j and xn,+2 for that frame of speech according to equation 2, and 5 (c) for that frame of speech appending the two extra elements to the m elements obtained from the process of figure 4 to obtain an n=m+2 dimensional spectral vector incorporating the novel encodings of spectral shape and loudness.

10 The steps as shown in Figures 3 and 4 comprise the pre-processing of the spectral vectors according to the embodiment of the current invention.

The input vectors pre-processed as described above are used when optimising the various parameters of the GMMs and GMM-based HMMs. The 15 inner loop of the optimization procedure, as described in relation to Figure 1 above, is done using convention methods such as EM re-estimation and Baum-Welch re-estimation, respectively. Further novel stages are concerned with applying constraints to the parameters in between iterations of this inner loop. Figure 5 shows the re-estimation procedure of the current invention, with additional processes present as compared to that shown in Figure 2. These additional processes relate to the initialization of the classes before the iterative part of the procedure starts, and to the resealing of the class means 25 following each iteration to take into account the constraints to be imposed.

Note that for the HEM case the transition probability processing is unchanged from the prior art.

One of the constraints is concerned with the class mean vectors of the GMM 30 or HMM. The constraint takes the form of re-scaling the set of edimensional vectors w; = (Wj,... Win) which represent the class means.

This constraint is applied to all the class means, as soon as they have been re-estimated, every time they are re-estimated (by the EM or BaumWelch re

- 18 estimation procedures for example), and also when they are first initialised (see Figure 5). These extra steps, illustrated in the flow diagram of Figure 5, are (a) by summing the squares of its elements and then taking the square root of the sum, the modulus Iwjl of each of the N re-estimated class means Wj 5 is first computed as I w j I = (Equation 4) for all N classes j = 1,...,N; (b) after computing the modulus Iwjl of each re-

estimated class mean, all the elements of each class mean are divided by that corresponding modulus, i.e. 10 Wjj D I ji I, for all elements i = 1,, n of all GMM classes j = 1,, N (Equation 5) These steps have the effect or re-scaling all the class means w; to constant modulus D until the next iteration of their re-estimation, after which they are re-scaled again to constant modulus D by applying these steps again, as 15 depicted in Figure 5. The value of the constant D is preferably set equal to the modulus Ixl of the data vectors x. (For example, for a GMM receiving input data having moduli I x I = gA2+B2, the value of D should be set equal to ) 20 The advantages of re-scaling the class means to constant modulus are that this encourages speech recognition algorithms to adopt novel encodings of speech data that may improve speech classification performance (such as hierarchical sparse coding), and that it may reduce the vulnerability of speech recognition algorithms to becoming trapped in undesirable sub-optimal 25 configurations ('local minima') during training. These advantages result from the fact that the dynamics of learning have simplified degrees of freedom because the class means are constrained to remain on a hypersphere (of radius D) as they adapt.

Re-scaling class means wj to constant modulus is particularly appropriate in 30 conjunction with scaling data vectors x to constant modulus. This is because

- 19 the degree of match between a data vector x and a class mean wj can then determined purely by computing the scalar product w j x.

Further to the embodiment of the current invention, the covariance matrices C 5 of the Gaussian distributions that constitute the GMMs are constrained to be isotropic and of constrained variance V, i.e. they are not optimised according to the conventional re-estimation procedures for covariance matrices (such as the EM algorithm for GMMs and the Baum-Welch procedure for GMM-based HMMs), but are defined once and for all in terms of the isotropic Identity 10 Matrix I and the constrained variance V by Cj -VI forallclassesj=l,...,N (Equation 6) V is a free parameter chosen (for example by trial and error) to give the speech recognition system best classification performance; V must be greater than zero, as a covariance matrix has non-negative eigenvalues, and V is 15 preferably significantly smaller than the value of D2. The benefit of setting V much smaller than D2 is that it leads to a sparse distribution of the first level simple GMM's posterior probabilities, which in the main embodiment feed the data encoding vector space of the GMM-based HEM at the second level.

This is because each Gaussian component of the first level simple GMM will 20 individually only span a small area on the spectral vector hypersphere.

This process for choosing covariance matrices involves the following steps: (a) choosing a value for the constant of proportionality V so as to optimise the classification performance, for example by trial and error, (b) setting all the 25 diagonal elements of the class covariance matrices equal to V, and (c) setting all the off-diagonal elements of the class covariance matrices equal to zero.

Thus, the covariance matrix according to this embodiment of the present invention is both isotropic and diagonal.

30 Used in conjunction with the above techniques for constraining the moduli of data vectors x and class means Wj, constraining the class covariances in this way gives the advantage of encouraging speech recognition algorithms to adopt novel encodings of speech data that may improve speech recognition

- 20 performance (such as hierarchical sparse coding), and reducing the vulnerability of speech recognition algorithms to becoming trapped in undesirable sub-optimal configurations ('local minima') during training. Sparse coding results from representing individual speech-sounds as assemblies of 5 many small isotropic Gaussian hypercircles tiling the unit hypersphere in the spectral-vector space, offering in the potential for more faithful representation of highly complex-shaped speech-sounds than is permitted by representation as a single anisotropic ellipsoid, and thus improved classification performance. Because this constraint does away with the need for the conventional unconstrained re-estimation of the covariance matrices, Figure 5's modified procedure for optimising GMMs does not involve re-estimation of covariance matrices as does the conventional procedure of Figure 2.

A further constraint imposed on this embodiment of the current invention relates to the choice of prior class probabilities. The N prior probabilities Pr(j) for the GMM classes j = 1,...,N may be constrained to be constants, i.e. not optimised according to the conventional reestimation procedures for prior 20 class probabilities (such as the EM algorithm for GMMs and the Baum-Welch procedure for GMM-based HMMs), but are defined once and for all by the step of setting Pr(j) = 1/N for all classes j = 1,..., N (Equation 7) Used in conjunction with the above innovations for constraining the moduli of 25 data vectors x, class means w; and the covariance matrices Cj, constraining the prior class probabilities in this way gives the advantage of reducing the vulnerability of speech recognition algorithms to becoming trapped in undesirable sub-optimal configurations ('local minima') during training.

Because this innovation does away with the need for the conventional 30 unconstrained re-estimation of the prior class probabilities, Figure 5's modified procedure for optimising GMMs does not involve re-estimation of prior class probabilities as does the conventional procedure of Figure 2.

-21 It will be understood by people skilled in the relevant arts that the constraints applied to a GMM or HMM as described above in the training phase of the model will equally need to be applied during the classifying phase of use of the models. If they were employed during training, the steps for encoding 5 spectral shape and overall loudness according to the present invention as described above will need to be applied to every spectral vector of any new speech to be classified.

An implementation of the invention, which combines all of the constraints 10 detailed above, is illustrated in Figure 6. This implementation uses conventional spectral analysis of each frame of speech, followed by novel steps described above to encode both spectral shape and overall loudness into each spectral vector and to scale every spectral vector's modulus to the constant value of 1. The parameters A and B are both set to equal 1/ and 15 D is set equal to 1.

Such unit-modulus spectral vectors are input to a GMM having a hundred Gaussian classes (N = 100), with class means all constrained to have moduli equal to 1, with class prior probabilities all constrained to have constant and equal values of 1/100, and covariance matrices constrained to be isotropic 20 and to have constant variances (i.e. not re-estimated at each iteration according to a procedure such as the EM algorithm). A good choice for that constant variance V has been found to be 0.01, although other values could be chosen by trial and error so as to give best speech classification performance of the whole system; the right choice for V will lie betvveen O 25 and 1. For each spectral vector input to this GMM, posterior probabilities for the classes are computed in the conventional way.

Each set of GMM posterior probabilities computed above for each spectral vector are used to compute unit-modulus data-encoding vectors for input to an ordinary GMM-based HMM by taking the square roots of those posterior 30 probabilities.

These unit-modulus data-encoding vectors are input to the HMM as observation vectors. The class means of the Gaussian mixture that constitutes

- 22 the parameterisation of the HMM's observation probabilities are all constrained to have moduli equal to 1. The number N of Gaussian classes used to parameterise the HMM's observation probabilities is chosen by trial and error so as to give best speech classification performance of the whole 5 system. The prior probabilities of those classes are then determined by that choice of N; they are all constrained and set equal to 1/N. The covariance matrices of those classes are all constrained to be isotropic and to have constant variances (i.e. not re-estimated unconstrained according to a procedure such as the EM algorithm). The choice of that constant variance V 10 would be determined by trial and error so as to give best speech classification performance of the whole system; the right choice for Vwill lie between 0 and 1.

The preferred implementation of the invention can be operated in training 15 mode and classification mode. In classification mode, the HMM is used to classify the input observation vectors according to a conventional HMM classification method (Baum-Welch forward-backward algorithm or Viterbi algorithm), subject to the modifications described above.

In training mode, (a) the GMM is optimised for the training of unitmodulus 20 spectral vectors (encoded as described above) according to a conventional procedure for optimising GMM class means (e.g. the EM reestimation algorithm), subject to the innovative modifications to rescale the GMM class means to have constant moduli equal to 1, and to omit the conventional steps for re-estimating the GMM class covariance matrices and prior class 25 probabilities. (b) Once the GMM has been optimised, it is used as described above to compute a set of data- encoding vectors from the training set of speech spectral vectors. (c) This set of data-encoding vectors is then used for training the HMM according to a conventional procedure for optimising HMM class means (e.g. the Baum-Welch re-estimation procedure), subject to the 30 innovative modifications to re-scale the HMM class means to have constant moduli equal to 1, and to omit the conventional steps for re-estimating the HMM class covariance matrices and prior class probabilities. No modification is made to the conventional steps for re-estimating HMM transition

- 23 probabilities; the conventional Baum-Welch re-estimation procedure may be used for re-estimating HMM transition probabilities Figure 8 illustrates the advantage of employing the constraints of the current 5 invention. This shows a spectral vector x - (x,, X2, X3), where Ixl = 1.

Constraining this spectral vector, e.g. 101 into having a constant modulus has the implication that the class means 102 will all lie on the surface of a hypersphere. In the case shown the hypersphere has two dimensions, and so is an ordinary 2-sphere 103 in an ordinary three- dimensional space.

10 Constraining the covariance matrices to be isotropic and diagonal has the effect that the individual classes will project onto this hypersphere in the form of circles 104. This arrangement allows individual speechsounds to be represented in the spectral-vector space not as individual Gaussian ellipsoids, as is conventional, but as assemblies 105 of many smaller Gaussian 15 hypercircles 104 tiling the unit hypersphere 103, offering in the potential for more faithful representation of highly complex-shaped speech-sounds, and thus improved classification performance. Each class (hypercircle) eg 104 will span just a small area within the complex shape that delimits the set of all spectral vectors (which must all lie on the spectral-vector hypersphere 103) 20 that could correspond to alternative pronunciations of a particular individual speech-sound; collectively, many such classes 104 will be able to span that whole complex shape much more faithfully than could a single, anisotropic ellipsoid as is conventionally used to represent an individual speech sound.

Other sets of Gaussian classes within the same mixture model will be able to 25 span parts of other complex shapes on the spectral vector hypersphere, i.e. of other speech sounds. The posterior probabilities associated with each of these Gaussian classes (hypercircles) is a measure of how close the current spectral vector is (on the spectral- vector hypersphere) to the corresponding Gaussian class mean 102 (hypercircle centre). Learning which sets of classes 30 correspond to which speech sounds, on the basis of all the temporal correlations between them that are present in the training speech sequences, is the function of the GMM-based HMM, whose inputs are fed from the set of all those posterior probabilities.

- 24 To use an analogy, a large number of hypercircles helps one to avoid local minima far better than would a small number of anisotropic ellipsoids, for effectively the same reason that a bunch of sticks gets tangled more easily than a tray of marbles. (In this analogy, minimising the total gravitational 5 potential of the set of marbles plays the analogous role to maximising the model likelihood.) Similarly, one can map out highly complex shapes much more faithfully by using a lot of marbles than by using a few sticks.

The skilled person will be aware that other embodiments within the scope of 10 the invention may be envisaged, and thus the invention should not be limited to the embodiments as herein described.

1 5 References A.R. Webb, Statistical Pattern Recognition, Arnold (London) , 1999.

B. H. Juang & L.R. Rabiner, Hidden Markov models for speech recognition, 20 Technometrics 33(3), American Statistical Association, 1991.

Claims

- 25 Claims

1. A signal processing system for processing a plurality of multi-element data encoding vectors, the system: 5 - having means for deriving the data encoding vectors from input signals; - being arranged to process the data encoding vectors using a Gaussian Mixture Model (GMM) based Hidden Markov Model (HMM), the GMM based HMM having at least one class mean vector having multiple 1 0 elements; - being arranged to process the elements of the class mean vector(s) by an iterative optimization procedure; characterized in that the system is also arranged to scale the elements of the class mean vector(s) during the optimization procedure to provide for the 15 class mean vector(s) to have constant modulus at each iteration, and to normalise the data encoding vectors input to the GMM based HMM.

2. A system as claimed in claim 1 wherein the GMM based HMM has a covariance matrix, the elements of which remain constrained during the 20 optimization procedure such that the matrix is isotropic and diagonal, and the value of the non zero diagonal elements remain constant throughout the optimization procedure.

3. A system as claimed in claim 1 or claim 2 wherein prior class 25 probabilities associated with the GMM based HMM are constrained to be equal, and to remain unchanged throughout the optimization procedure.

4. A system as claimed in any of the above claims wherein when the elements of the data encoding vectors represent spectral coefficients, the 30 normalization of the data encoding vectors is such that the data encoding vectors have equal moduli.

5. A system as claimed in claim 4 wherein the modulus of each data encoding vector is independent of the overall spectral power in the vector.

- 26

6. A system as claimed in claim 4 or claim 5 wherein elements forming spectral coefficients of a data encoding vector are arranged to be individually proportional to the square root of the power in their corresponding spectral 5 band divided by the square root of the overall power contained in spectral bands represented in the vector.

7. A system as claimed in any of claims 4 to 6 wherein the system is arranged to add at least one additional element to each data encoding vector, 10 wherein the added element(s) encode the overall power contained in spectral bands represented in the vector.

8. A system as claimed in claim 7 wherein the system is arranged to add two elements to each data encoding vector to represent the overall power in 15 spectral bands, these two elements arranged such that the sum of their squares is a constant across all data encoding vectors that represent the spectrum of the input signal.

9. A system as claimed in any of claims 1 to 8 wherein the GMM based 20 HMM provides the observation probabilities for a higher level HMM.

10. A system as claimed in any of claims 1 to 9 wherein the derivation of the data encoding vectors from the input signal involves the use of a GMM, whereby this GMM provides the data encoding vectors to the GMM based 25 HMM that comprise elements derived from the GMM's posterior probabilities.

11. A system as claimed in claim 10 wherein elements of the data encoding vectors input from the GMM to the GMM based HMM are proportional to the square root of posterior probabilities of the additional GMM.

12. A system as claimed in claim 10 wherein elements of the data encoding vectors input from the GMM to the GMM based HMM are proportional to posterior probabilities of the additional GMM.

- 27

13. A system as claimed in any of claims 9 to 12 wherein the constant values for the modulus of each of the class mean vectors may be different at each level.

5

14. A method of processing a signal, the signal comprising a plurality of multi-element data encoding vectors, wherein the data encoding vectors are derived from an analogue or digital input, and where the method employs at least one Gaussian Mixture Model (GMM) based Hidden Markov Model (HMM), the GMM based HMM having at least one class mean vector having 10 multiple elements, and the elements of the class mean vector(s) are optimised in an iterative procedure, characterized in that the elements of the class mean vectors are scaled during the optimization procedure such that the class mean vectors have a constant modulus at each iteration, and the data encoding vectors input to the GMM based HMM are processed such that they are 1 5 normalised.

15. A method as claimed in claim 14 wherein a covariance matrix within the GMM based HMM has one or more elements, all of which are constrained during the optimization procedure such that the matrix is isotropic and 20 diagonal, and the value of its non zero elements remain constant throughout the optimization procedure.

16. A method as claimed in claim 14 or claim 15 wherein prior class probabilities associated with the GMM based HMM are constrained to be 25 equal, and to remain unchanged throughout the optimization procedure.

17. A method as claimed in any of claims 14 to 16 wherein when the elements of the data encoding vectors represent spectral coefficients, the data encoding vectors are scaled in a pre-processing stage before being input to 30 the GMM based HMM, such that the moduli of all data encoding vectors are equal.

18. A method as claimed in claim 17 wherein the modulus of each data encoding vector is independent of the overall power in the vector.

- 28

19. A method as claimed in claim 17 or claim 18 wherein elements forming spectral coefficients of a data encoding vector are arranged to be individually proportional to the square root of the power in their corresponding spectral 5 band, divided by the square root of the overall power contained in spectral bands represented in the vector.

20. A method as claimed in any of claims 17 to 19 wherein at least one additional element is added to each data encoding vector, wherein the added 10 element(s) encode the overall power contained in spectral bands represented in the vector.

21. A method as claimed in claim 20 wherein two elements are added to each data encoding vector to represent the overall power in spectral bands, 15 these two elements arranged such that the sum of their squares is a constant across all input vectors that represent the spectrum of the input signal.

22. A method as claimed in any of claims 14 to 21 wherein the GMM based HMM provides the observation probabilities for a higher level HMM.

23. A method as claimed in any of claims 14 to 21 wherein the derivation of the data encoding vectors from the input signal involves the use of a GMM, whereby this GMM provides the data encoding vectors to the GMM based HMM that comprise elements derived from the GMM's posterior probabilities.

24. A method as claimed in claim 23 wherein elements of the data encoding vectors input from the GMM to the GMM based HMM are proportional to the square root of posterior probabilities of the additional GMM.

30

25. A method as claimed in claim 23 wherein elements of the data encoding vectors input from the GMM to the GMM based HMM are proportional to posterior probabilities of the additional GMM.

- 2g -

26. A method as claimed in any of claims 22 to 25 wherein the constant values for the modulus of each of the class mean vectors may be different at each level.

5

27. A signal processing system that has been trained according to the method as described in claim any of claims 14 to 26.

28. A computer program designed to run on a computer and arranged to implement a signal processing system for processing one or more multi 1 0 element input vectors, the system: - having means for deriving the data encoding vectors from input signals; - being arranged to process the data encoding vectors using a Gaussian Mixture Model (GMM) based Hidden Markov Model (HMM), the 15 GMM based HMM having at least one class mean vector having multiple elements; - being arranged to process the elements of the class mean vector(s) by an iterative optimization procedure; characterized in that the system is also arranged to scale the elements of the 20 class mean vector(s) during the optimization procedure to provide for the class mean vector(s) to have constant modulus at each iteration and to normalise the data encoding vectors input to the GMM based HMM.

29. A speech recogniser incorporating a signal processing system for 25 processing one or more multi-element input vectors, the recogniser: having means for deriving the data encoding vectors from input signals; being arranged to process the data encoding vectors using a Gaussian Mixture Model (GMM) based Hidden Markov Model (HMM), the 30 GMM based HMM having at least one class mean vector having multiple elements; - being arranged to process the elements of the class mean vector(s) by an iterative optimization procedure;

- 30 characterised in that the system is also arranged to scale the elements of the class mean vector(s) during the optimization procedure to provide for the class mean vector(s) to have constant modulus at each iteration, and to normalise the data encoding vectors input to the GMM based HEM.