GB2480085A

GB2480085A - An adaptive speech recognition system and method using a cascade of transforms

Info

Publication number: GB2480085A
Application number: GB201007525A
Authority: GB
Inventors: Catherine Breslin; Kean Kheong Chin; Mark John Francis Gales; Katherine Mary Knill; Haitian Xu
Original assignee: Toshiba Research Europe Ltd
Current assignee: Toshiba Europe Ltd
Priority date: 2010-05-05
Filing date: 2010-05-05
Publication date: 2011-11-09
Anticipated expiration: 2030-05-05
Also published as: GB201007525D0; GB2480085B

Abstract

A speech processing method, comprising receiving a speech input in a first environment which comprises a sequence of observations, determining the likelihood of a sequence of words arising from the sequence of observations using an acoustic model and a language model, comprising providing an acoustic model for performing speech recognition on a input signal which comprises a sequence of observations, wherein said model has been trained to recognize speech in a second environment, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to an observation. The model trained in the second environment is adapted to that of the first environment. The speech recognition method further comprises determining the likelihood of a sequence of observations occurring in a given language using a language model, combining the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal. Adapting the model trained in the second environment comprises using a cascade of transforms successively applied to the observations or model parameters, where a parent transform is a transform which is applied earlier in said succession of transforms, and a child transform is a transform applied later in the succession of transforms than said parent transform. The child transform is estimated on the basis of a prior transform and said prior transform is also said parent transform.

Description

A Speech Recognition System and Method The present invention is concerned with the field of speech recognition. More specifically, the present invention is concerned with adaptive speech recognition systems and methods which can adapt between different environments or conditions such as between different speakers and different noise environments.

The best speech recognition performance is obtained when the acoustic model matches the environment in which it is being used including speaker, noise, microphone etc. Since we cannot model every environment in the world in advance, this means the acoustic model that comes with a speech recogniser needs to be adapted to the current operating environment. For a task such as dictation on a PC the speaker can be asked to provide several minutes of training data for this adaptation before they use the system "in anger".

However, in most other applications where speech recognition is used such as in-car phone dialling or navigation, voice commands on a mobile phone or MP3 player and speech recognition enabled interactive voice response systems, the speaker expects good performance from the first interaction with the system and cannot be expected to (at least consciously) supply adaptation data. As a user's first experience with a system affects how they feel about the system it is very important to aim to achieve as good as recognition accuracy as possible from the first utterance. Rapid adaptation using the speaker's utterances is therefore very important.

Known popular speaker adaptation methods include Maximum Likelihood Linear Regression (MLLR), Constrained Maximum Likelihood Linear Regression (CMLLR) and Vocal Tract Length Normalisation (VTLN).

MLLR and CMLLR estimate a set of transforms from recognition data by maximising the likelihood of a hypothesis on the adaptation data. MLLR transforms are applied to Hidden Markov Model (HMM) parameters such as means and/or covariances while CMLLR transforms are applied to the model or speech feature vectors to adapt the model means and covariances. If the hypothesis is not known, then a first decoding pass is normally performed to estimate the best hypothesis. Cascades of transforms can be estimated and combined during decoding by applying them in turn. Components of HMM state output distributions are normally clustered using a regression tree, so the same transform is applied to multiple components.

CMLLR and MLLR perform better when more data is available, and with limited data it may not be possible to robustly estimate all the parameters. With only one utterance available for adaptation, the accuracy achieved by these methods is still some way from what can be achieved when many utterances are available for adaptation. Full transforms outperform diagonal or block-diagonal transforms, but need more data to be robustly estimated. Full or block diagonal MLLR transforms are more computationally expensive to use as they convert the generally diagonal model covariances to block-diagonal or full.

CAT-CMLLR is a variant of CMLLR where a number of transforms are estimated from the training set, and are interpolated during decoding. Then, only a small number of interpolation weights need to be estimated during decoding. CAT-CMLLR has a small number of parameters, and so it can perform well with limited data but its performance is not normally improved by using more data. Typically, only small improvements over the baseline are seen, hence its effect is limited.

CAT-CMLLR performs well in clean conditions, but does not work well when the test and training conditions are mismatched, such as when the two have different noise conditions. This is because the set of transforms estimated in training are mismatched to the test conditions, and hence can degrade performance.

VTLN warps the frequency axis to account for variations in vocal tract lengths between speakers. VTLN has a small number of parameters to estimate (typically just one warping factor for each regression class) and so it is useful for fast adaptation as the parameters can be robustly estimated from small amounts of data. VTLN can be implemented as a linear transform of the features and hence used in the same framework as CMLLR and CAT-CMLLR. VTLN transforms for multiple warping factors can be pre-computed, and then the same statistics accumulated for CMLLR can be used to select the best of these transforms for each utterance or speaker.

VTLN, as for CAT-CMLLR, has a small number of parameters, and so it can perform well with limited data but its performance is not normally improved by using more data.

Typically, only small improvements over the baseline are seen, hence its effect is limited. Recent work has implemented VTLN as a linear transform, in which case it can be seen as a CMLLR transform which has been estimated using physiological constraints. So, the effect of VTLN is often subsumed by CMLLR when the two are used in conjunction.

MAPLR directly uses a prior distribution for estimating model-space linear adaptation transforms. A prior distribution P(W) is estimated from a set of transforms derived from an HMM set and a training data set. SMAPLR and CSMAPLR are related in their use of a prior distribution, but first use a regression tree to cluster states or components.

For any particular node in the tree, the transform previously estimated at the parent node is used in a prior distribution for estimating the transform at the current node. Thus robust transform estimates can be obtained for clusters where there is limited data.

MAPLR requires defining a prior distribution over the transforms in order to estimate an MLLR transform. It has issues in terms of dynamic range of the two quantities being interpolated to give the updated statistics. The prior transform and adaptive statistics are not consistent and thus can lead to issues in tuning a suitable interpolation weight.

Recently, F. Flego and M.J.F. Gales, "Incremental Predictive and Adaptive Noise Compensation," in ICASSP, 2009 have proposed a method using a combination of predictive and adaptive statistics for noise compensation.

The present invention builds upon the work of Flego and Gales and uses a cascade of transforms to at least partially address some of the problems of the above identified prior art and, in a first aspect provides a speech processing method, comprising: receiving a speech input in a first environment which comprises a sequence of observations; determining the likelihood of a sequence of words arising from the sequence of observations using an acoustic model and a language model, comprising: providing an acoustic model for performing speech recognition on a input signal which comprises a sequence of observations, wherein said model has been trained to recognise speech in a second environment, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to an observation; adapting the model trained in the second environment to that of the first environment; the speech recognition method further comprising determining the likelihood of a sequence of observations occurring in a given language using a language model; combining the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal, wherein adapting the model trained in the second environment comprises: using a cascade of transforms successively applied to the observations or model parameters, where a parent transform is a transform which is applied earlier in said succession of transforms, and a child transform is a transform applied later in the succession of transforms than said parent transform, and wherein the child transform is estimated on the basis of a prior transform and said prior transform is also said parent transform.

The use of the parent transformation to provide prior information for the child transform allows a framework to be devised which is computationally very efficient.

This new method can be adopted both for recognition and for adaptive training. When used for speech recognition, robustness to different speakers is improved when using limited amounts of data for adaptation by incorporating prior information in a count-smoothing framework.

Experiments indicate that the method of a preferred embodiment is able to outperform related methods in terms of recognition accuracy, with very little additional computational cost.

Using prior information allows full transforms to be estimated from limited adaptation data, when without prior information there are not enough frames to estimate a full transform. Very little additional computational power is needed as the appropriate statistics can be computed offline and cached.

In an embodiment, the present invention is used in a system where there is a training mode for said new speaker and said new speaker reads known text. The speech data and the text which is read can then be used to estimate the transforms to adapt the acoustic model to the new speaker.

In a further embodiment, the system receives speech for which text is not known. For example, if the system is used as part of a voice controlled sat nay system, an MP3 player, smart phone etc, there is generally no distinct adaptive training phase. In such systems, text corresponding to the input speech will be estimated on the basis of a hypothesis. For example, the text may first be estimated using the model without adaptation or using a differentmodel. The system then uses this hypothesis to estimate the transforms. The transforms may then be continually estimated and adapted as more speech is received.

The parent and child transforms may be applied to said model parameters, and on some cases, for example MLLR to said observations.

The parent transform itself may comprise a plurality of cascading transforms.

In a preferred embodiment, the child transform is determined by interpolating between prior and adaptive statistics.

The above framework allows the parent and child to have different structures of regression tree, including a different number of regression classes.

At least one of the parent transform or the child transform may adapt for speaker variations, andlor at least one of the parent transform or the child transform adapts for noise variations.

The parent transform may be linear or non-linear. The parent transform may be trained from adaptation data, the parent transform may be selected from a plurality of pre-set transforms, the selection between such transforms may be made on the basis of adaptation data or other factors. For example, if the system is aware that it is being used in a car, then it can select a parent transform appropriate to that environment. The parent transform may also be a single pre-set transform which is used in all situations.

In a second aspect, the present invention provides a method of adapting an acoustic model for speech processing to the speech of a new speaker, the method comprising: receiving a speech input in a first environment which comprises a sequence of observations; providing an acoustic model for performing speech recognition on a input signal which comprises a sequence of observations, wherein said model has been trained to recognise speech in a second environment, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to an observation; using a cascade of transforms successively applied to the observations or model parameters, where a parent transform is a transform which is applied earlier in said succession of transforms, and a child transform is a transform applied later in the succession of transforms than said parent transform, and wherein the child transform is estimated on the basis of a prior transform and said prior transform is also said parent transform.

In a third aspect, the present invention provides a speech processing apparatus, said apparatus comprising: a receiver for receiving a speech input in a first noise environment which comprises a sequence of observations; a processor configured to: determine the likelihood of a sequence of words arising from the sequence of observations using an acoustic model and a language model, comprising: provide an acoustic model for performing speech recognition on a input signal which comprises a sequence of observations, wherein said model has been trained to recognise speech in a second environment, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to an observation; adapt the model trained in the second environment to that of the first environment; determine the likelihood of a sequence of observations occurring in a given language using a language model; combine the likelihoods determined by the acoustic model and the language model; and an output configured to output a sequence of words identified from said speech input signal, wherein adapting the model trained in the second environment comprises: using a cascade of transforms successively applied to the observations or model parameters, where a parent transform is a transform which is applied earlier in said succession of transforms, and a child transform is a transform applied later in the succession of transforms than said parent transfonn, and wherein the child transform is estimated on the basis of a prior transform and said prior transform is also said parent transform.

The present invention can be implemented either in hardware or on software in a general purpose computer. Further the present invention can be implemented in a combination of hardware and software. The present invention can also be implemented by a single processing apparatus or a distributed network of processing apparatuses.

Since the present invention can be implemented by software, the present invention encompasses computer code provided to a general purpose computer on any suitable carrier medium. The carrier medium can comprise any storage medium such as a floppy disk, a CD ROM, a magnetic device or a programmable memory device, or any transient medium such as any signal e.g. an electrical, optical or microwave signal.

The present invention will now be described with reference to the following non-limiting embodiments in which: Figure 1 is a schematic of a general speech recognition system; Figure 2 is a schematic of the components of a speech recognition processor; Figure 3 is a schematic of a Gaussian probability function; Figure 4 is a schematic plot of acoustic space representing both probability density functions and an observation vector; Figure 5 is a flow diagram showing a known method to determine a VTLN adaptation transform; Figure 6 is a figure used to demonstrate VTLN; Figure 7 is a flow diagram showing a known method to determine an adaptation transform; Figure 8 is a flow diagram showing a speech recognition method in accordance with an embodiment of the present invention; Figure 9 is a flow diagram showing a speech recognition method in accordance with a further embodiment of the present invention; and Figure 10 shows a method of adapting an acoustic model for speech processing in accordance with an embodiment of the present invention; and Figure 11 is a plot of word error rate against the log of a weighting factor used to smooth statistics used to obtain a transform for a speech recognition system in accordance with an embodiment of the present invention.

Figure 1 is a schematic of a very basic speech recognition system. A user (not shown) speaks into microphone 1 or other collection device for an audio system. The device 1 could be substituted by a memory which contains audio data previously recorded or the device 1 may be a network connection for receiving audio data from a remote location.

The speech signal is then directed into a speech processor 3 which will be described in more detail with reference to figure 2.

The speech processor 3 takes the speech signal and turns it into text corresponding to the speech signal. Many different forms of output are available. For example, the output may be in the form of a display 5 which outputs to a screen. Alternatively, the output could be directed to a printer or the like. Also, the output could be in the form of an electronic signal which is provided to a further system 9. For example, the further system 9 could be part of a speech translation system which takes the outputted text from processor 3 and then converts it into a different language. The converted text is then outputted via a further text or speech system.

Alternatively, the text outputted by the processor 3 could be used to operate different types of equipment, for example, it could be part of a mobile phone, car, etc. where the user controls various functions via speech. The output could be used in an in-car navigation system to direct the user to a named location.

Figure 2 is a block diagram of the standard components of a speech recognition processor 3 of the type shown in figure 1. The speech signal received from microphone, through a network or from a recording medium 1 is directed into front-end unit 11.

The front end unit 11 digitises the received speech signal and splits it into frames of equal lengths. The speech signals are then subjected to a spectral analysis to determine various parameters which are plotted in an "acoustic space of feature space". The parameters which are derived will be discussed in more detail later.

The front end unit 11 also removes signals which are believed not to be speech signals and other irrelevant information. Popular front end units comprise apparatus which use filter bank (F BANK) parameters, Mel Frequency Cepstral Coefficients (MFCC) and Perceptual Linear Predictive (PLP) parameters. The output of the front end unit is in the form of an input vector which is in n-dimensional acoustic space.

The input vector is then fed into a decoder 13 which cooperates with both an acoustic model section 15 and a language model section 17. The acoustic model section 15 will generally operate using Hidden Markov Models. However, it is also possible to use acoustic models based on connectionist models and hybrid models.

The acoustic model unit 15 derives the likelihood of a sequence of observations corresponding to a word or part thereof on the basis of the acoustic input alone.

The language model section 17 contains information concerning probabilities of a certain sequence of words or parts of words following each other in a given language.

Generally a static model is used. The most popular method is the N-gram model.

The decoder 13 then traditionally uses a dynamic programming (DP) approach to find the best transcription for a given speech utterance using the results from the acoustic model 15 and the language model 17.

This is then output via the output device 19 which allows the text to be displayed, presented or converted for further use e.g. in speech to speech translation or to control a voice activated device.

This description will be mainly concerned with the use of an acoustic model which is a Hidden Markov Model (HMM), However, it could also be used for other models.

The actual model used in this embodiment is a standard model, the details of which are outside the scope of this patent application. However, the model will require the provision of probability density functions (pdfs) which relate to the probability of an observation represented by an acoustic vector (speech vector or feature vector) being related to a word or part thereof. Generally, this probability distribution will be a Gaussian distribution in n-dimensional space.

A schematic example of a generic Gaussian distribution is shown in figure 3. Here, the horizontal axis corresponds to a parameter of the input vector in one dimension and the probability distribution is for a particular word or part thereof relating to the observation. For example, in figure 3, an observation corresponding to an acoustic vector x has a probability p1 of corresponding to the word whose probability distribution is shown in figure 3. The shape and position of the Gaussian is defined by its mean and variance. These parameters are determined during training for phonemes or phonetic units which the acoustic model covers, they will be referred to as the "model parameters".

In a HMM, once the model parameters have been determined, the model can be used to determine the likelihood of a sequence of observations corresponding to a sequence of Figure 4 is schematic plot of acoustic space where an observation is represented by an observation vector or feature vector x1. The open circles g correspond to the means of Gaussians or other probability distribution functions plotted in acoustic space.

During decoding, the acoustic model will calculate a number of different likelihoods that the feature vector x1 corresponds to a word or part thereof represented by the Gaussians. These likelihoods are then used in the acoustic model and combined with probabilities from the language model to determine the text spoken.

However, the acoustic model which is to be used for speech recognition will need to cope under different conditions such as for different speakers and/or under different noise conditions.

Speaker adaptation methods like Maximum Likelihood Linear Regression (MLLR) and Constrained MLLR (CMLLR) have proven successful for adapting to new speakers.

However, these methods give better performance as more speaker data is available, and effective adaptation using limited data is still a challenge.

In an embodiment, adaptation under different noise environments and for different speakers is achieved using prior information in the form of linear transforms to obtain more robust estimates of transforms used to adapt acoustic model parameters to different conditions.

Prior information which can be used to estimate a transform can come from a variety of sources. For example, previously, when using (MAP) adaptation to estimate MLLR transforms, the prior used was a Normal-Wishart distribution over a set of transforms estimated from training data. Also, a regression class tree has been used and the prior transform for a node was the transform estimated at the parent node. Thus prior transforms estimated higher up the tree using more frames are propagated down the tree and used to obtain more robust estimates for transforms at nodes with few observations.

VTLN is another example of a set of prior transforms obtained via knowledge about the effect of vocal tract length on speech features. VTLN has been used directly for adaptation, but not as a prior. VTLN warps the frequency axis to compensate for the difference in individual speaker vocal tract lengths. This uses prior information about the form of frequency warping, and works effectively with little adaptation data as only a small number of parameters must be estimated. However, as more data becomes available, VTLN does not improve and so its effect is limited. VTLN can be implemented as a set of linear transforms and thus provides a source of prior information which can be robustly estimated on a single utterance. In this embodiment, speaker adaptation will be discussed, but the principles can be applied to adaptation to other conditions such as different noise environments. Further, in this embodiment Vocal Tract Length Normalisation (VTLN) will be used for prior information.

However, other transform techniques could be used.

Prior transforms could also be obtained from the training data. A global CMLLR transform is one example, or training data may be clustered and a set of linear transforms estimated.

In the preferred embodiment, the above techniques obtain prior transforms which are combined with adaptive techniques such as CMLLR where o, is the feature vector to be transformed and, is the feature vector after transformation.

A CMLLR transform w'= b(r) A(r)1 can be applied as a linear transformation of the feature vectors: Ôt = + (1) yielding likelihood p(otftn) = + b(r); (m), E(rn)) (2) where component m in regression class r has mean and variance 1(m) and (m) Equivalently the transform can be written as: = (3) Where the extended observation vector, = [1 o,j and = [ b(r) A(T) ] (4) To estimate the CMLLR transform parameters using maximum likelihood, the auxiliary function Q(W,W) is used Q(W, 1'T) = log (IAIAI (wet; (5) t 771 The th row of the transform can be estimated iteratively using = (Api + k) (6) Where Pi is the extended co..factor row vector of ACT) and ?. satisfies the quadratic equation \2pG1p + pG)kT -3(r) = 0 (7) The statistics accumulated from the data to obtain the optimal W are GT) = (rn) (m) [ I mEr O t=1 (m) T = (m)2 > (m) [1 oT] (8) mEr O t=1 where m is a component in regression class r, y" is the posterior probability that frame o is aligned with component rn which has mean (m) and variance m)* The total occupancy count for a regression class is (r) __ (m) (9) mEr t=1 Linear transforms can be estimated in a predictive rather than an adaptive framework, and have proven successful for efficient noise adaptation. The predictive transfonn parameters are obtained by minimising the KL divergence between a CMLLR adapted distribution and a target distribution. This is a powerful technique when the target distribution is complex, e.g. full covariance, and the PCMLLR transform provides a computationally efficient approximation.

In F. Flego and M.J.F. Gales, "Incremental Predictive and Adaptive Noise Compensation," in ICASSP, 2009., the benefits of adaptive and predictive frameworks for noise robustness are combined by using the predictive scheme (PCMLLR) as a prior for an adaptive scheme (CMLLR). A similar framework to this can be used where the prior information is contained in a transform W = [b) A)J The target distribution for the adapted model is: p(otlm) == IA IN(1t; (m) (m)) (10) The statistics that are used to estimate the prior transforms are: (r) \ (m) 1 E{oTn} pr1 (m)2 E{orn} g{ooTm} mEr Oj __ (rn) (m) = [1 S{oTm}] (11) mE'r ?, where the occupancy counts are estimated from the training data. If the extended mean vector is: (m) = [i (rn)] (12) and extended covariance matrix = [ i E(rn) 1 (13) Then the prior statistics can be calculated using E{oTm} = A1 (ji(m)T b) (14) E{ooTlm} = (15) (E(n) + (,(m -b)) (jt(m)T -b))) A)_T The prior and adaptive statistics are then combined in a count-smoothing fashion so the prior transform Wpr acts as a prior for estimating a new smoothed adaptive transform (r) = Gpr (16) inEr (r) = k +r kprj (m) (17) mEr1 The prior statistics are normalised so that they effectively contribute t frames to the final statistics. For normalisation: (r) = (m) + r (18) mEr t=1.

As more data becomes available, the adaptive CMLLR statistics will dominate, but for small amounts of data the prior statistics are more important.

The ith row of the smoothed adaptive transform W can be estimated iteratively using = (pz + k) G-' (19) where p is the extended cofactor row vector of Aand X satisfies the quadratic equation A2p1G1pf + ApG)_lkT fi(r) = 0 (20) With = [b A] and Ao1 +b2 The value of the occupancy counts J$ is given in 18 above.

For efficiency, the elements of the prior statistics in equations (14) and (15) which depend only on the model set can be accumulated offline.

In this embodiment the prior and the smoothed adaptive transforms use the same regression tree. In another embodiment, different regression trees are used for the prior and smoothed adaptive transform.

The above framework provides significant advantages when the prior transform is also used as a parent transform. In speech recognition and speech synthesis, cascading transforms are often used to adapt between different environments. This framework can be used to estimate a child transform in such a cascade.

With a parent transform with regression class rp: V(rp) -Ih(rr) A(rp) pt1 Ptt pt1 and child transform with regression class r IXT(TC) -Ih(rc) A(rc) " ch1 -rich1 the mean parameters in the HMM under these cascading transforms can be represented as: == A1 (A), + b) + (21) And a parent transform A and child transform applied to the covariance parameters is represented by = (22) Equivalently, in another embodiment, if the same transform is applied to both mean and covariance i.e. = A5) = and A' a computationally efficient implementation is to transform the speech feature vectors = AC) (A)o + b)) + (23) Where, is the observation vector after the transforms have been applied, o is the untransformed observation vector.

In the general case the transforms are applied to the acoustic model. Above equations 21 and 22 give the transform of the means and covariances for a general case. It should be noted that this principle applies for non parent child transforms, i.e. in the general case adaptive transforms aie applied to the model means andlor covariances and in a particular case transforms are applied to the speech vectors directly.

In such a parent and child arrangement, one transform may account for noise and the other for speaker adaptation. Alternatively, both transforms may model different speaker attributes or noise attributes. In a further variation, a plurality of transforms are successively applied.

It is computationally very efficient for a parent transform to also be used as a prior in the above framework.

Suppose that the prior information V is already known or has been estimated, and is to be used as a parent transform in a cascade of transforms. The adaptive statistics to estimate the child transform W where wc) = [bc) A)] (24) are accumulated in the domain of the parent transform. If ?is the original observation transformed by the parent transform = (25) then with = [b) AP)] (26) OTc) = (rn) 7(fll) [ I mEre a,,, t=i (m) T = (m) (m) [1 o] (27) ?nEr O t=1 Prior statistics to estimate the child must also be in the domain of the parent.

(m) -i O(rc) 7 i ciO m pr1 (rn) E{olm} E{ôôTm} (rn) (m) = II (m) [1 E{oTIm}] (28) mEre O 1 the prior statistics must be also transformed by V to be in the domain of the parent.

If the parent is equal to the prior, i.e. = this cancels out the pre-existing transformation by W' and becomes equivalent to using an identity matrix prior in the space defined by the parent, i.e. E{ôTIrn} = (p(m)T) (29) E{ôôTm} = (E(m) +(m)(m)T) (30) A child transform Wcis estimated from the smoothed statistics (re) _____________ Gpa. = acrc) + GPr (31) mEre -(re) -(re) ________ kpa. = + kprj (32) (m) rnEr I the 1th row of the transform can be estimated iteratively using: (re) w = (Api + O(rc)_l (33) paj j pat where p is the extended co-factor row vector of and ? satisfies the quadratic equation A2 O()'T + A (r)-1 -(r)T i pa1 PiGpaj kpa. -= 0 (34) and

T

= (m) (35) inErt t=1 and both parent and child must be used in decoding p(otm) = IAA (36) + b) + b; (m) E(m)) This is a convenient framework for using a parent transform as a prior as there is no longer any need to transform the prior statistics by the prior transform before using the count smoothing framework. The prior statistics can be fully cached offline. The structure of the parent and child can differ, and any form of linear or non-linear transform can easily be used as a parent and prior, yet adaptive statistics only need to be accumulated at the regression class level for the child. For example, a 64 regression class block diagonal PCMLLR transform could be used as a prior for estimating a 2 regression class diagonal CMLLR transform and statistics only need be accumulated for the 2 regression classes of the child, not the 64 classes of the parent.

To illustrate the differences between the present invention and the prior art, figure 5 shows a process for obtaining a VTLN transform in accordance with a known method.

First, speech is input in step Si 01. As this method is used for adapting an acoustic model, the speech input in step SlOl corresponds to a known or hypothesised text. In such systems a transcription of the text is not normally available, unsupervised mode.

Instead, the text will have usually been estimated from a previous recognition pass with the model for which the transform is to be estimated, but could be estimated using another model. Alternatively in supervised mode the text is known.

In this embodiment, the forwardlbackward algorithm is used which presumes a soft assignment of a frame to a state. However, other implementations of the Baum Welch algorithm may also be used which also assume a soft assignment. In a further embodiment, the Viterbi algorithm is used which assumes a hard assignment of frame to state.

From this step S 103, the statistics G and k below can be obtained: = (rn) (m) [ t oT] mEr t-1 (rn) T = (m)2 (m) [1 oT] (8) mEr O t=1 Prior to performing decoding, a number of the VTLN transforms were saved corresponding to different a.

Standard VTLN attempts to find a linear transform W which maps the unwarped cepstral coefficients o onto the corresponding warped values of o. VTLN applies warping to the frequency axes to correspond to differences in speaker vocal tract length.

A piece wise linear warping can be used which is expressed by the constant a. Figure 6 is a plot showing how a can be expressed graphically.

G and k statistics are accumulated in step Si 03 for the adaptation data and are used to select the best a in step Si05. In step S107, the transform corresponding to the best value of ci is outputted.

Figure 7 is a flow diagram showing a known CMLLRIMLLR method. Input speech is input in step Sill. The input speech corresponds to a known or hypothesised text. The text will have usually been obtained as for VTLN. In the same manner as above, the forward/backward algorithm is run in step Si 13. This outputs the statistics G and k which correspond to equation 8 as shown above.

From this, then the transform is estimated in step Si 15. In step Si 17 the transform is outputted.

The method of figure 5 is essentially a method where a transform which is based on modelling some physical relationship, in figure 5 differences in vocal tract length of different speakers, is used to adapt the speech from one speaker to a different speaker.

In figure 7, an adaptive method is used where the text from a new speaker is processed on the basis that differences in acoustic realisations between speakers are not based on an underlying assumption concerning how the physical differences arise. The transform parameters in step Si 15 are estimated to give a "best fit" match of the current acoustic model to the new speech. In step Si 17 the transform is outputted.

Figure 8 shows a speech recognition system capable of adaptation in accordance with an embodiment of the present invention using cascading transforms where the prior is also used as a parent.

The system adapts model parameters S307 based on adaptation data S301. The adaptation data S301 comprises data from an environment to which the system is to be adapted (for example, a particular speaker, noise environment etc). To estimate the transforms a hypothesis is required which allows the speech to be related to text. In one embodiment, an initial decoding pass is performed which provides such text.

Adaptive statistics are accumulated in step S305. In this preferred embodiment, the adaptive stats are given, in the parent transform space, by: Oc) =>: (flL)2 m) [ I mEre Oj t_1 __ (m) T = > (m)2 >(m) [1 o] (27) mEre CTj t1 The observation vector a is derived from the adaptation data and the parent transform.

The parameters are obtained from an alignment of observations of the adaptation data to the model states in model S307. The and c parameters are derived from the parameters of model S307.

A prior transform S304 is obtained in step S303. The prior transform S304 will be used to compute the adaptive statistics.

In this embodiment, the prior transform is linear or non-linear, for example, VLTN.

However, other transforms either linear or non-linear may be used such as MLLR, CMLLR, VTLN, JUD, VTS. The transform may be selected from a plurality of stored transforms. For example, a plurality of transforms may be stored dependent on the noise environment, e.g. in car etc. In further embodiments, the transforms may be estimated from the adaptation data without selecting from a plurality of pre-set transforms.

Where adaptation data is required to estimate the prior transform, the prior transform needs to be estimated on-line for example from an initial pass of the method of figure 8, i.e. first pass to estimate parent, then a second to estimate child. However, when the prior transform is pre-set, the transform can be calculated off-line. When the prior transform is selected from a plurality of transforms, the plurality of transforms can be calculated off-line and stored.

In step S309 prior statistics are then obtained as below, these prior statistics are in the parent transform space. (m) i

O(T) ___ I (O (m)2 E{ôlrn} {ôôTm} n2Er O ___ (in) (m) = 2 [1 e{oTm}] (28) mEre where E{ôTftn} = (ji(m)T) (29) E{ôôTrn} = (E(m) +(m)('m)T) (30) The prior stats need to be transformed to be in the domain of the parent. However, in the case where the parent transform is the same as the prior transform, they cancel with the prior transformation and the prior statistics are calculated in step S309 on the basis of an identity transform S308. Thus, the prior statistics can be calculated off-line since they are purely dependent on the model parameters S307.

In step S3 11, adaptive stats are smoothed with the prior stats. This uses the prior stats obtained in step S309 and the adaptive stats accumulated in step S305 using the equations: O('c) = Orc) + pr1 (31) L.imEr Y = jc) + T> pr1 (m) (32) mEre t is a weighting factor. The magnitude of the weighting factor is determined by trial and error. In a preferred embodiment, it is selected based on decoding accuracy. The value of r can be determined off-line. The data which is used to select r should be independent of the adaptation data used on-line.

From these statistics, a smoothed adaptive transform is computed in step S3 13 using the equation Wc) = (APi + klcLc) (r)-1 (33) where p is the extended co-factor row vector of and ? satisfies the quadratic equation A2P(re)lPT + ::)_hi:)T = 0 (34) and = (m) (35) rnEr t=1 This smoothed adaptive transform is then fed into decoder S3 17 as the child transform.

In decoder S3 17, speech recognition is performed using an acoustic model or features which have been adapted using the transform output in step S3 15 as a child transform and the prior transform of S304 as a parent transform.

Thus, the acoustic model is transformed by both parent (S304) and child (S315) transforms. The parent transform is applied before the child.

Referring back to figure 2, the above is equivalent to replacing the acoustic model in 15 after each adaptation is completed. In another embodiment the cascaded adaptive transforms can be applied to the speech feature vectors rather than directly adapting the model to form. In this embodiment the speech features output by 11 in Figure 2 are adapted prior to input to the speech decoder 13. The acoustic model 15 remains unchanged across utterances.

This has significant advantages as it is possible to retain the full power of the prior transform right up to the decoding process. For example, if a powerful transform such as a 64 regression class block diagonal PCMLLR transform is used as a prior to a 2 regression class full CMLLR transform, the full power of the prior transform is retained as it is used directly during the recognition stage S3 17. The end result of this will effectively yield 64 different full transforms being applied to the data. After being estimated, any cascade of transforms may be collapsed to an equivalent single transform. The regression tree for the collapsed transform is the intersection of the regression trees of the constituent transforms in the cascade.

The above framework can also be used when there are more than 2 cascading transforms. In this situation, the prior transform V would be a cascade of all of except the last applied child transform. For example, ignoring any bias terms, for 4 cascading transforms applied to the observation vectors: =TUV\VC where T, U, V and W are transforms, UVW will be applied as the parent and prior transform with T as the child transform. Multiple cascading transforms could also be applied directly to the model parameters.

Figure 9 is a flow diagram showing a speech recognition method in accordance with an embodiment of the present invention. Speech is input in step S601. In most embodiments, the input speech will not correspond to a transcription. However, a hypothesis will be used to estimate the text which corresponds to the speech. Typically, this will be done by performing a first pass of the decoder to obtain a first estimate.

Possible operations will be described after step S61 1.

The forwardlbackward algorithm is then run in step S603 and from the input data, the prior transform (which will also act as the parent) is determined. The prior transform may be set before recognition starts. The adaptive statistics Gand k are accumulated.

Transforms are calculated in step S605 using the above calculated adaptive statistics and prior statistics. The prior statistics have been calculated off-line.

The acoustic model parameters of the speech recognition system are then directly transformed or are indirectly modified using CMLLR or MLLR transforms in step S607 using both the child transform of step S605 and the prior/parent transform.

The acoustic model is then run for the new speaker using the modified parameters in step S609 and the identified words are output in step S61 1.

In the above embodiment, a first pass decoding run is performed to obtain an initial transcription. Any model can be used for the first pass decoding, but is often the baseline (speaker independent) model. With an initial transcription and corresponding data, a transform can be estimated for adaptation. This transform can then either be used to redecode the current data (i.e. that used to estimate the transform) and improve the initial hypothesis. Or, the transform can be used on other test data.

In use, the user is not likely to be aware that adaptation was ongoing. The user would only perceive recognition as taking place.

The transforms may be continually re-estimated or there may be an adaptation training phase where new transforms are estimated.

In general, there will be some continual re-estimation of transforms, which is especially useful when there are multiple users/noises etc. There follows three different scenarios which might take place: Example 1 -incremental mode a. user 1 makes a request "play let it be" b. system incorrectly recognises and outputs "play you and me" (userl unhappy) c. system then obtains transform using hypothesis "play you and me" d. userl makes a second request "play the beatles" e. system uses transform to decode and output hypothesis "play the beatles" (user 1 happy) f. system uses second utterance to improve transform... etc. Example 2 -similar but redecoding a. userl makes a request "play let it be" b. system incorrectly recognises "play you and me" c. system then obtains transform using hypothesis "play you and me" d. system uses transform to redecode original data and outputs new hypothesis "play let it be" (user 1 happy) e. userl makes a request "play the zutons" f. system uses first transform to decode "play the zutons" g. system uses second utterance to improve first transform h. system redecodes second utterance using improved transform... etc. Example 3 -redecoding but multiple users a. user 1 makes a request "play let it be" b. system incorrectly recognises "play you and me" c. system then obtains transform using hypothesis "play you and me" d. system uses transform to redecocle original data and outputs new hypothesis "play let it be" (userl happy) e. user2 makes a request "play the zutons" f. system uses original model to decode "play the beatles" g. system estimates user2 transform using "play the beatles" h. system redecodes second utterance using user2 to get new hypothesis "play the zutons" (user 2 also happy) In the above, the system may also receive user feedback, such that the system does not use a hypothesis which the user has indicated is incorrect to re-estimate transforms.

Figure 10 is a simplified flow diagram showing the steps taken when estimating transforms in accordance with an embodiment of the present invention.

Speech is input in step S701. The forward backward algorithm is run in step S703 and a prior transform is obtained in step S705.

Adaptive statistics are generated in step S707 using the prior transform.

Prior statistics are then retrieved in step S708. As the prior transform is the same as the parent transform, the action of the two transforms cancel one another and thus the prior statistics are based on a unity transform and can be calculated off-line and retrieved for use on-line.

In step S709 smoothed adaptive statistics are computed and the child transform is estimated from these statistics as previously explained.

To demonstrate the effectiveness of methods in accordance with the above embodiments, experiments were performed. The linear transform version of VTLN was used as a convenient form of prior information which is expected to yield more information than an identity transform.

Gender independent US English acoustic models were trained using a 39 dimensional MFCC feature vector, with static, delta and delta-delta parameters. A total of 312 hours of data from WSJ, TIDIGITS, TIMIT and internally collected noisy data was used for training triphone acoustic models. Decision tree clustering was used to yield 650 unique states. 12 Gaussian components were used per speech state and 24 Gaussian components per silence state, yielding approximately 8000 components. For adaptive training, transforms were estimated on a per speaker basis, using a transform type consistent with decoding. Experiments were carried out on two tasks: * Toshiba in-car task -a database recorded in real driving conditions with phone numbers, 4 digits, command and control, and city names subtasks. Each sub task includes two noisy conditions, engine on and highway, and there are a total of 8983 utterances spoken by native speakers with an average of 463 frames per utterance.

* Multi-accent task -a database recorded in studio conditions with additional noise.

There are approximately 14k utterances split between telephone and TV control, spoken by users with a mixture of accents, with an average of 226 frames per utterance.

Unless stated otherwise, a separate transform is estimated for each test set utterance using two regression classes -speech and silence -to limit the number of parameters and allow for rapid adaptation. The baseline hypothesis was used for estimation of all transforms.

Experimental results are given in table 1. The first lines show results obtained for the baseline system without adaptation, and with standard VTLN and CMLLR. These results show that VTLN consistently yields small gains, e.g. on the multi-accent set the baseline error rate of 15.90% is improved to 15.44%. Diagonal CMLLR improves over VTLN on the in-car and multi-accent tests. Full CMLLR does not impact on performance, suggesting that one utterance does not give enough data to robustly estimate the parameters.

Table 1

Standard Adaptive Parent Prior In-car Mlt-acc In-car Mit-ace Baseline ---2.38 15.90 --VTL.N Block --2.33 15.44 2.17 15.11 CMLLR Diag --1.86 15.17 1.74 14.96 CMLLR Full --2.36 15.90 2.34 15.84 CMLLR* Diag VTLN -1.79 14.98 1.77 14.94 CMLLR* Full VTLN -2.35 15.90 2.35 15.90 CMLLR Full -Identity 1.92 15.11 2.11 14.58 CMLLR Full -VTLN 1.87 14.82 1.63 13.54 Next, VTLN was used as a parent transform, although not as a prior, when estimating a CMLLR child transform to be used in a cascade during decoding. The results show that using VTLN as a parent transform for estimating a diagonal CMLLR transform can give performance gains. For example, on the Toshiba in-car set, the error rates are and 2.33% and 1.86% for VTLN and diagonal CMLLR respectively, but 1.79% when cascading VTLN and CMLLR. However, when used as a parent to estimate a full CMLLR transform, very little difference in error rate is seen. The error rates for VTLN arid full CMLLR on the Toshiba set are 2.33% and 2.36% respectively, and 2.35% when cascading the two transforms. These results suggest that incorporating prior knowledge as a parent transform with no prior does not improve the robustness of poorly estimated CMLLR transforms when using limited data.

Combining prior and adaptive statistics for more robust transform estimates.

Experiments were carried out using an identity matrix as prior (i.e. W I in equations 14 and 16) and also using VTLN as a prior with the method described. Here, a full CMLLR transform was trained for each utterance by combining the prior and adaptive statistics according to equations 16 and 17. Note that full CMLLR transforms are equivalent to 0. As can be seen, for large values of t the resulting transform can give better performance than either the full CMLLR or VTLN transforms alone. For some values of r, the transforms from interpolated statistics give gains over the robust diagonal CMLLR transform. VTLN appears to be a relatively weak prior as it does not give large gains by itself, and only gives small improvements over an identity prior.

Results are given in table 1 for t = 50000 for the two test sets. On the in-car set, the identity prior yields an error rate of 1.92% and the VTLN prior gives 1.86%, which are relative gains of 20% and 22% respectively over the baseline model.

Finally, adaptive training was performed using VTLN, CMLLR and CMLLR with a prior. For using VTLN as a only a parent to estimate full CMLLR transforms, the VTLN adaptively trained model was used. As expected, gains are seen from adaptive training using VTLN and CMLLR. However, the use of the VTLN prior in adaptive training yields further improvement in word error rate. On the Toshiba in-car set, the adaptive training using VTLN and diagonal CMLLR yielded results of 2.17% and 1.77% WER respectively, but a further improvement to 1.63%was seen using the approach proposed in section 3. This is an improvement of 46% relative over the baseline performance. Relative improvement on the multi-accent task is 15%.

Claims

CLAIMS1. A speech processing method, comprising: receiving a speech input in a first environment which comprises a sequence of observations; determining the likelihood of a sequence of words arising from the sequence of observations using an acoustic model and a language model, comprising: providing an acoustic model for performing speech recognition on a input signal which comprises a sequence of observations, wherein said model has been trained to recognise speech in a second environment, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to an observation; adapting the model trained in the second environment to that of the first environment; the speech recognition method further comprising determining the likelihood of a sequence of observations occurring in a given language using a language model; combining the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal, wherein adapting the model trained in the second environment comprises: using a cascade of transforms successively applied to the observations or model parameters, where a parent transform is a transform which is applied earlier in said succession of transforms, and a child transform is a transform applied later in the succession of transforms than said parent transform, and wherein the child transform is estimated on the basis of a prior transform and said prior transform is also said parent transform.
2. A method according to claim 1, wherein adapting the model to the new speaker comprises receiving speech from the new speaker corresponding to known text.
3. A method according to claim 1, wherein adapting the model to the new speaker comprises receiving speech from said new speaker and making a first estimate of the text corresponding to said speech.
4. A speech processing method according to claim 1, wherein said parent and child transforms are applied to said model parameters.
5. A speech processing method according to claim 1, wherein said parent and child transforms are applied to said observations.
6. A speech processing method according to claim 1, wherein the parent transform comprises a plurality of cascading transforms.
7. A speech processing method according to claim 1, wherein said child transform is determined by interpolating between prior and adaptive statistics.
8. A speech processing method according to claim 7, wherein said adaptive statistics have the form: = (rn) m) [ ot o] mEre cr t1 (m) T j(rc) = >1 (m)2 [1 ô, ] (27) ?nEr ci t=1 Where m is a component in regression class r, y(m)t is the posterior probability that frame & is aligned with model component m which has mean,(m) and variance (m) p(m)and are the ith elements of and variance m) respectively.
9. A speech recognition method according to claim 7, wherein said prior statistics have the form: (m) i O(rC)_ "C" P)f I Ti' pr - (m) E{ôm} E{ôôTIrn} (m) (m) = (rn) [1 E{ôTm}] (28) mEre Oj where {ôTm} = ((m)T) (29) E{ôôTlm} = ((rn) + (n2)(m)T) (30) and m is a component in regression class r which has mean (m) and variance (m) is the posterior probability that a frame o is aligned with model component m and is obtained from the training set. 14m)and are the ith elements of 1(m) and variance (m) respectively.
10. A speech processing method according to claim 1, wherein the prior statistics are calculated off-line.
11. A speech processing method according to claim 1, wherein the parent and child transforms have different regression trees.
12. A speech processing method according to claim 11, wherein the parent and child transforms have different regression classes.
13. A speech processing method according to claim 1, wherein at least one of the parent transform or the child transform adapts for speaker variations.
14. A speech processing method according to claim 1, wherein at least one of the parent transform or the child transform adapts for noise variations.
15. A speech processing method according to claim 1, wherein said parent transform is linear.
16. A speech processing method according to claim 1, wherein said parent transform is non-linear.
17. A speech processing method according to claim 1, wherein the parent transform is selected from at least one transform which has been stored.
18. A method of adapting an acoustic model for speech processing to the speech of a new speaker, the method comprising: receiving a speech input in a first environment which comprises a sequence of observations; providing an acoustic model for performing speech recognition on a input signal which comprises a sequence of observations, wherein said model has been trained to recognise speech in a second environment, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to an observation; using a cascade of transforms successively applied to the observations or model parameters, where a parent transform is a transform which is applied earlier in said succession of transforms, and a child transform is a transform applied later in the succession of transforms than said parent transform, and wherein the child transform is estimated on the basis of a prior transform and said prior transform is also said parent transform.
19. A carrier medium carrying computer readable instructions for controlling the computer to carry out the method of any preceding claim.
20. A speech processing apparatus, said apparatus comprising: a receiver for receiving a speech input in a first noise environment which comprises a sequence of observations; a processor configured to: determine the likelihood of a sequence of words arising from the sequence of observations using an acoustic model a language model, comprising: provide an acoustic model for performing speech recognition on a input signal which comprises a sequence of observations, wherein said model has been trained to recognise speech in a second noise environment, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to an observation; adapt the model trained in the second environment to that of the first environment; determine the likelihood of a sequence of observations occurring in a given language using a language model; combine the likelihoods determined by the acoustic model and the language model; and an output configured to output a sequence of words identified from said speech input signal, wherein adapting the model trained in the second environment comprises: using a cascade of transforms successively applied to the observations or model parameters, where a parent transform is a transform which is applied earlier in said succession of transforms, and a child transform is a transform applied later in the succession of transforms than said parent transform, and wherein the child transform is estimated on the basis of a prior transform and said prior transform is also said parent transform.