GB2480084A

GB2480084A - An adaptive speech processing system

Info

Publication number: GB2480084A
Application number: GB201007524A
Authority: GB
Inventors: Catherine Breslin; Kean Kheong Chin; Mark John Francis Gales; Katherine Mary Knill; Haitian Xu
Original assignee: Toshiba Research Europe Ltd
Current assignee: Toshiba Europe Ltd
Priority date: 2010-05-05
Filing date: 2010-05-05
Publication date: 2011-11-09
Anticipated expiration: 2030-05-05
Also published as: GB201007524D0; GB2480084B

Abstract

A speech recognition method, comprising receiving a speech input from a new speaker which comprises a sequence of observations and determining the likelihood of a sequence of words arising from the sequence of observations using an acoustic model and a language model, comprising providing an acoustic model for performing speech recognition on a input signal which comprises a sequence of observations, wherein said model has been trained to recognize speech from a different speaker or speakers. The model has a plurality of model parameters relating to the probability distribution of a word or part thereof being related to an observation and the model trained for a different speaker or speakers is adapted to the new speaker. The speech recognition method further comprises determining the likelihood of a sequence of observations occurring in a given language using a language model and combining the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal. Adapting the model to the new speaker comprises calculating adaptive statistics, said adaptive statistics being generated by comparing the speech of the new speaker with that of the acoustic model trained for other speakers; determining prior statistics, said prior statistics derived from a prior transform which models the differences between speakers based on heuristic knowledge of the differences in acoustic realizations between speakers and interpolating said adaptive statistics and selected prior statistics to produce smoothed statistics and using said smoothed statistics to estimate a new transform and applying said transform to said model.

Description

A Speech Processing System and Method The present invention is concerned with the field of speech recognition. More specifically,, the present invention is concerned with adaptive speech recognition systems and methods which can adapt between or to different speakers.

The best speech recognition performance is obtained when the acoustic model matches the environment in which it is being used including speaker, noise, microphone etc. Since we carmot model every environment in the world in advance, this means the acoustic model that comes with a speech recogniser needs to be adapted to the current operating environment. For a task such as dictation on a PC the speaker can be asked to provide several minutes of training data for this adaptation before they use the system "in anger".

However, in most other applications where speech recognition is used such as in-car phone dialling or navigation, voice commands on a mobile phone or MP3 player and speech recognition enabled interactive voice response systems, the speaker expects good performance from the first interaction with the system and cannot be expected to (at least consciously) supply adaptation data. As a user's first experience with a system affects how they feel about the system it is very important to aim to achieve as good as recognition accuracy as possible from the first utterance. Rapid adaptation using the speaker's utterances is therefore very important.

Known popular speaker adaptation methods include Maximum Likelihood Linear Regression (MLLR), Constrained Maximum Likelihood Linear Regression (CMLLR) and Vocal Tract Length Normalisation (VTLN).

MLLR and CMLLR estimate a set of transforms from recognition data by maximising the likelihood of a hypothesis on the adaptation data. MLLR transforms are applied to Hidden Markov Model (HMM) parameters such as means and/or covariances while CMLLR transforms are applied to the model or speech feature vectors to adapt the model means and covariances. If the hypothesis is not known, then a first decoding pass is normally performed to estimate the best hypothesis. Cascades of transforms can be estimated and combined during decoding by applying them in turn. Components of HMM state output distributions are normally clustered using a regression tree, so the same transform is applied to multiple components.

CMLLR and MLLR perform better when more data is available, and with limited data it may not be possible to robustly estimate all the parameters. With only one utterance available for adaptation, the accuracy achieved by these methods is still some way from what can be achieved when many utterances are available for adaptation. Full transforms outperform diagonal or block-diagonal transforms, but need more data to be robustly estimated. Full or block diagonal MLLR transforms are more computationally expensive to use as they convert the generally diagonal model covariances to block-diagonal or full.

CAT-CMLLR is a variant of CMLLR where a number of transforms are estimated from the training set, and are interpolated during decoding. Then, only a small number of interpolation weights need to be estimated during decoding. CAT-CMLLR has a small number of parameters, and so it can perform well with limited data but its performance is not normally improved by using more data. Typically, only small improvements over the baseline are seen, hence its effect is limited.

CAT-CMLLR performs well in clean conditions, but does not work well when the test and training conditions are mismatched, such as when the two have different noise conditions. This is because the set of transforms estimated in training are mismatched to the test conditions, and hence can degrade performance.

VTLN warps the frequency axis to account for variations in vocal tract lengths between speakers. VTLN has a small number of parameters to estimate (typically just one warping factor for each regression class) and so it is useful for fast adaptation as the parameters can be robustly estimated from small amounts of data. VTLN can be implemented as a linear transform of the features and hence used in the same framework as CMLLR and CAT-CMLLR. VTLN transforms for multiple warping factors can be pre-computed, and then the same statistics accumulated for CMLLR can be used to select the best of these transforms for each utterance or speaker.

VTLN, as for CAT-CMLLR, has a small number of parameters, and so it can perform well with limited data but its performance is not normally improved by using more data.

Typically, only small improvements over the baseline are seen, hence its effect is limited. Recent work has implemented VTLN as a linear transform, in which case it can be seen as a CMLLR transform which has been estimated using physiological constraints. So, the effect of VTLN is often subsumed by CMLLR when the two are used in conjunction.

MAPLR directly uses a prior distribution for estimating model-space linear adaptation transforms. A prior distribution P(W) is estimated from a set of transforms derived from an HMM set and a training data set. SMAPLR and CSMAPLR are related in their use of a prior distribution, but first use a regression tree to cluster states or components.

For any particular node in the tree, the transform previously estimated at the parent node is used in a prior distribution for estimating the transform at the current node. Thus robust transform estimates can be obtained for clusters where there is limited data.

MAPLR requires defining a prior distribution over the transforms in order to estimate an MLLR transform. It has issues in terms of dynamic range of the two quantities being interpolated to give the updated statistics. The prior transform and adaptive statistics are not consistent and thus can lead to issues in tuning a suitable interpolation weight.

Recently, F. Flego and M.J.F. Gales, "Incremental Predictive and Adaptive Noise Compensation," in ICASSP, 2009 have proposed a method using a combination of predictive and adaptive statistics for noise compensation.

The present invention builds upon the work of Flego and Gales and at least partially addresses some of the problems of the above identified prior art and, in a first aspect provides a speech processing method, comprising: receiving a speech input from a new speaker which comprises a sequence of observations; determining the likelihood of a sequence of words arising from the sequence of observations using an acoustic model and a language model, comprising: providing an acoustic model for performing speech recognition on an input signal which comprises a sequence of observations, wherein said model has been trained to recognise speech from a different speaker or speakers, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to an observation; adapting the model trained for a different speaker or speakers to the new speaker; the speech recognition method further comprising determining the likelihood of a sequence of observations occurring in a given language using a language model; combining the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal, wherein adapting the model to the new speaker comprises: calculating adaptive statistics, said adaptive statistics being generated by comparing the speech of the new speaker with that of the acoustic model trained for other speakers; determining prior statistics, said prior statistics derived from a prior transform which models the differences between speakers based on heuristic knowledge of the differences in acoustic realisations between speakers; and interpolating said adaptive statistics and selected prior statistics to produce smoothed statistics and using said smoothed statistics to estimate a new transform and applying said transform to said model.

In an embodiment, the invention uses prior information, in the form of linear transforms. The prior statistics are derived from the prior linear transform. These transformed prior statistics are interpolated with actual statistics from test data for transform estimation, e.g. MLLR or CMLLR. The prior information is a set of one or more quantised linear transforms. VTLN is one form of prior information that can be used. VTLN (vocal tract length normalisation) warps the frequency axis to account for differences in vocal tract length between speakers, and can be implemented as a set of linear transforms. Prior information could also come from the training data, or from other known constraints concerning channel conditions, microphone, speaker etc. In one embodiment, the prior transform is selected from a plurality of transforms used to model differences between speakers, the selection being made on the basis of adaptive statistics. However, the smoothed transform does not need to be limited to speaker adaptation and can also be used for noise adaptation in addition to speaker adaptation.

This new method can be adopted both for recognition and for adaptive training. When used for speech recognition, robustness to different speakers is improved when using limited amowits of data for adaptation by incorporating prior information in a count-smoothing framework.

Experiments indicate that the method of a preferred embodiment is able to outperform related methods in terms of recognition accuracy, with very little additional computational cost.

Using prior information allows full transforms to be estimated from limited adaptation data, when without prior information there are not enough frames to estimate a full transform. Very little additional computational power is needed as the appropriate prior statistics can be partially or fully computed offline and cached.

In an embodiment, the present invention is used in a system where there is a training mode for said new speaker and said new speaker reads known text. The speech data and the text which is read can then be used to estimate the transforms to adapt the acoustic model to the new speaker.

In a further embodiment, speech data is received for which text is not known. For example, if the system is used as part of a voice controlled sat nay system, an MP3 player, smart phone etc, there is generally no distinct adaptation training phase. In such systems, text corresponding to the input speech will be estimated on the basis of a hypothesis. For example, the text may first be estimated using the model without adaptation. The system then uses this hypothesis to estimate the transforms. The transforms may then be continually estimated and adapted as more speech is received.

The above two embodiments may be combined such that a system with an adaptation training phase also continually updates transforms after said adaptation training phase has been completed and when said system is in use.

In an embodiment, the prior transform is selected from an identity transform or fixed linear speaker transforms. An identity transform can be thought of to represent the case where the new speaker is initially assumed to be the same as the speakers used to initially train the acoustic model.

In a further embodiment, the prior transform is a transform selected from a plurality of pre-determined transforms on the basis of said adaptive statistics. The plurality of transforms may be determined on-line or stored off-line.

The prior statistics may be partially or fully cached off-line.

The prior transform may be a single linear transform, or be a non-linear transform consisting of a plurality of linear transforms. For example, a regression tree may be used to partition the acoustic space and one of a plurality of linear transforms applied separately to each partition.

The new transform may be applied directly to the model parameters or to the observation vectors.

In an embodiment, the adaptive statistics are prepared on data where there is a different noise environment to that used while training the acoustic model. The prior statistics are estimated on the basis of a speaker transform.

In a further embodiment said new transform is one of a cascade of transforms applied to said model. Preferably, said new transform is a child transform applied in a cascade of transforms and said prior transform is a parent transform in said cascade of transforms.

In a second aspect, the present invention provides a method of adapting an acoustic model for speech processing to the speech of a new speaker, the method comprising: receiving a speech input from a new speaker which comprises a sequence of observations; providing an acoustic model for performing speech recognition on a input signal which comprises a sequence of observations, wherein said model has been trained to recognise speech from a different speaker or speakers, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to an observation; calculating adaptive statistics, said adaptive statistics being generated by comparing the speech of the new speaker with that of the acoustic model trained for other speakers; determining prior statistics, said prior statistics derived from a prior transform which models the differences between speakers based on heuristic knowledge of the differences between speakers; and interpolating said adaptive statistics and selected prior statistics to adapt the model.

In a third aspect, the present invention provides a speech processing apparatus, said apparatus comprising: a receiver for receiving a speech input from a first speaker which comprises a sequence of observations; a processor configured to: determine the likelihood of a sequence of words arising from the sequence of observations using an acoustic model and a language model, comprising: provide an acoustic model for performing speech recognition on an input signal which comprises a sequence of observations, wherein said model has been trained to recognise speech from a different speaker or speakers, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to an observation; adapt the model trained for a different speaker or speakers to the new speaker; and determine the likelihood of a sequence of observations occurring in a given language using a language model; an output configured to output a sequence of words identified from said speech input signal, wherein adapting the model to the new speaker comprises: calculating adaptive statistics, said adaptive statistics being generated by comparing the speech of the new speaker with that of the acoustic model trained for other speakers; determining prior statistics, said prior statistics derived from a prior transform which models the differences between speakers based on heuristic knowledge of the differences in acoustic realisations between speakers; and interpolating said adaptive statistics and selected prior statistics to produce smoothed statistics and using said smoothed statistics to estimate a new transform and applying said transform to said model.

The present invention can be implemented either in hardware or on software in a general purpose computer. Further the present invention can be implemented in a combination of hardware and software. The present invention can also be implemented by a single processing apparatus or a distributed network of processing apparatuses.

Since the present invention can be implemented by software, the present invention encompasses computer code provided to a general purpose computer on any suitable carrier medium. The carrier medium can comprise any storage medium such as a floppy disk, a CD ROM, a magnetic device or a programmable memory device, or any transient medium such as any signal e.g. an electrical, optical or microwave signal.

The present invention will now be described with reference to the following non-limiting embodiments in which: Figure 1 is a schematic of a general speech recognition system; Figure 2 is a schematic of the components of a speech recognition processor; Figure 3 is a schematic of a Gaussian probability function; Figure 4 is a schematic plot of acoustic space representing both probability density functions and an observation vector; Figure 5 is a flow diagram showing a known method to determine a VTLN adaptation transform; Figure 6 is a figure used to demonstrate VTLN; Figure 7 is a flow diagram showing a known method to determine an adaptation transform; Figure 8 is a flow diagram showing a speech processing method in accordance with an embodiment of the present invention; Figure 9 is a flow diagram showing a speech recognition method in accordance with a further embodiment of the present invention; and Figure 10 shows a method of adapting an acoustic model for speech processing in accordance with an embodiment of the present invention; and Figure 11 is a plot of word error rate against the log of a weighting factor used to smooth statistics used to obtain a transform for a speech recognition system in accordance with an embodiment of the present invention.

Figure 1 is a schematic of a very basic speech recognition system. A user (not shown) speaks into microphone 1 or other collection device for an audio system. The device 1 could be substituted by a memory which contains audio data previously recorded or the device 1 may be a network connection for receiving audio data from a remote location.

The speech signal is then directed into a speech processor 3 which will be described in more detail with reference to figure 2.

The speech processor 3 takes the speech signal and turns it into text corresponding to the speech signal. Many different forms of output are available. For example, the output may be in the form of a display 5 which outputs to a screen. Alternatively, the output could be directed to a printer or the like. Also, the output could be in the form of an electronic signal which is provided to a further system 9. For example, the further system 9 could be part of a speech translation system which takes the outputted text from processor 3 and then converts it into a different language. The converted text is then outputted via a further text to speech system.

Alternatively, the text outputted by the processor 3 could be used to operate different types of equipment, for example, it could be part of a mobile phone, car, etc. where the user controls various functions via speech. The output could be used in an in-car navigation system to direct the user to a named location.

Figure 2 is a block diagram of the standard components of a speech recognition processor 3 of the type shown in figure 1. The speech signal received from microphone, through a network or from a recording medium 1 is directed into front-end unit 11.

The front end unit 11 digitises the received speech signal and splits it into frames of equal lengths. The speech signals are then subjected to a spectral analysis to determine various parameters which are plotted in an "acoustic space" or feature space. The parameters which are derived will be discussed in more detail later.

The front end unit 11 also removes signals which are believed not to be speech signals and other irrelevant information. Popular front end units comprise apparatus which use filter bank (F BANK) parameters, Mel Frequency Cepstral Coefficients (MFCC) and Perceptual Linear Predictive (PLP) parameters. The output of the front end unit is in the form of an input vector which is in n-dimensional acoustic space.

The input vector is then fed into a decoder 13 which cooperates with both an acoustic model section 15 and a language model section 17. The acoustic model section 15 will generally operate using Hidden Markov Models. However, it is also possible to use acoustic models based on connectionist models and hybrid models.

The acoustic model unit 15 derives the likelihood of a sequence of observations corresponding to a word or part thereof on the basis of the acoustic input alone.

The language model section 17 contains information concerning probabilities of a certain sequence of words or parts of words following each other in a given language.

Generally a static model is used. The most popular method is the N-gram model.

The decoder 13 then traditionally uses a dynamic programming (DP) approach to find the best transcription for a given speech utterance using the results from the acoustic model 15 and the language model 17.

This is then output via the output device 19 which allows the text to be displayed, presented or converted for further use e.g. in speech to speech translation or to control a voice activated device.

This description will be mainly concerned with the use of an acoustic model which is a Hidden Markov Model (HMM). However, it could also be used for other models.

The actual model used in this embodiment is a standard model, the details of which are outside the scope of this patent application. However, the model will require the provision of probability density functions (pdfs) which relate to the probability of an observation represented by an acoustic vector (speech vector or feature vector) being related to a word or part thereof. Generally, this probability distribution will be a Gaussian distribution in n-dimensional space.

A schematic example of a generic Gaussian distribution is shown in figure 3. Here, the horizontal axis corresponds to a parameter of the input vector in one dimension and the probability distribution is for a particular word or part thereof relating to the observation. For example, in figure 3, an observation corresponding to an acoustic vector x has a probability p1 of corresponding to the word whose probability distribution is shown in figure 3. The shape and position of the Gaussian is defined by its mean and variance. These parameters are determined during training for phonemes or phonetic units which the acoustic model covers, they will be referred to as the "model parameters".

In a HMM, once the model parameters have been determined, the model can be used to determine the likelihood of a sequence of observations corresponding to a sequence of Figure 4 is schematic plot of acoustic space where an observation is represented by an observation vector or feature vector Xi. The open circles g correspond to the means of Gaussians or other probability distribution functions plotted in acoustic space.

During decoding, the acoustic model will calculate a number of different likelihoods that the feature vector x1 corresponds to a word or part thereof represented by the Gauss jans. These likelihoods are then used in the acoustic model and combined with probabilities from the language model to determine the text spoken.

However, the acoustic model which is to be used for speech recognition will need to cope under different conditions such as for different speakers and/or under different noise conditions.

Speaker adaptation methods like Maximum Likelihood Linear Regression (MLLR) and Constrained MLLR (CMLLR) have proven successful for adapting to new speakers.

However, these methods give better performance as more speaker data is available, and effective adaptation using limited data is still a challenge.

In an embodiment, adaptation under different noise environments and for different speakers is achieved using prior information in the form of linear transforms to obtain more robust estimates of transforms used to adapt acoustic model parameters to different conditions.

Prior information which can be used to estimate a transform can come from a variety of sources. For example, previously, when using (MAP) adaptation to estimate MLLR transforms, the prior used was a Normal-Wishart distribution over a set of transforms estimated from training data. Also, a regression class tree has been used and the prior transform for a node was the transform estimated at the parent node. Thus prior transforms estimated higher up the tree using more frames are propagated down the tree and used to obtain more robust estimates for transforms at nodes with few observations.

VTLN is another example of a set of prior transforms obtained via knowledge about the effect of vocal tract length on speech features. VTLN has been used directly for adaptation, but not as a prior. VTLN warps the frequency axis to compensate for the difference in individual speaker vocal tract lengths. This uses prior information about the form of frequency warping, and works effectively with little adaptation data as only a small number of parameters must be estimated. However, as more data becomes available, VTLN does not improve arid so its effect is limited. VTLN can be implemented as a set of linear transforms and thus provides a source of prior information which can be robustly estimated on a single utterance. In this embodiment, speaker adaptation will be discussed, but the principles can be applied to adaptation to other conditions such as different noise environments. Further, in this embodiment Vocal Tract Length Normalisation (VTLN) will be used for prior information.

However, other transform techniques could be used.

Prior transforms could also be obtained from the training data. A global CMLLR transform is one example, or training data may be clustered and a set of linear transforms estimated.

In the preferred embodiment, the above techniques obtain prior transforms which are combined with adaptive techniques such as CMLLR where o1 is the feature vector to be transformed and is the feature vector after transformation.

A CMLLR transform W=[ A'] can be applied as a linear transformation of the feature vectors: = Ao + b (1) yielding likelihood p(otrn) = + b"; (m), E(m)) (2) where component m in regression class r has mean and variance,P' and L"0.

Equivalently the transform can be written as: Ôt = (3) Where the extended observation vector = [i o and W(r) = [ b A(r) ] (4) To estimate the CMLLR transform parameters using maximum likelihood, the auxiliary function Q(W,W)is used Q(W, ) (m) log (IAIM (wet; (?n) (rn))) (5) m The jth row of the transform can be estimated iteratively using = (Apj + k) G' (6) Where Pi is the extended co-factor row vector of A and X satisfies the quadratic equation A2PG(r)_1PT + ApG lkT -(r) = 0 (7) The statistics accumulated from the data to obtain the optimal W(r) are = (m) (m) [ ot oo I mErU t-1 (rn) T = ()2 (m) [1 oJ 1 (8) mEra t==1 where m is a component in regression class r, is the posterior probability that frame o, is aligned with component m which has mean j.t(m) and variance m)* The total occupancy count for a regression class is (r) = (m) (9) inErt=1 Linear transforms can be estimated in a predictive rather than an adaptive framework, and have proven successful for efficient noise adaptation. In an embodiment, the predictive transform parameters are obtained by minimising the KL divergence between a CMLLR adapted distribution and a target distribution. This is a powerful technique when the target distribution is complex, e.g. full covariance, and the PCMLLR transform provides a computationally efficient approximation.

In F. Flego and M.J.F. Gales, "Incremental Predictive and Adaptive Noise Compensation," in ICASSP, 2009, the benefits of adaptive and predictive frameworks for noise robustness are combined by using the predictive scheme (PCMLLR) as a prior for an adaptive scheme (CMLLR). A similar framework to this can be used where the prior information is contained in a transform = [bi A,?]. The target distribution for the adapted model is: p(otn) = AIJ\f(TTt; m), E(m)) (10) The statistics that are used to estimate the prior transforms are: G(r) -____ 1 E{oTm} pr1 -__ (m)2 E{om} e{ooTm} mEr O (rn) (m) k = (rn)2 [1 E{oTm} 1 (11) mE'r Q where the occupancy counts are estimated from the training data. If the extended mean vector is: (m) -[i m)] (12) and extended covariance matrix j,(rn) = [1 j(rn) ] (13) Then the prior statistics can be calculated using: E{oTm} = A1 ((rn)T -b) (14) E{ooTm} = A' (15) (E(m) + (,2(m -b) (m)T -b))) A)_T The prior and adaptive statistics are then combined in a count-smoothing fashion so the prior transform acts as a prior for estimating a new smoothed adaptive transform (r) = G1r +r (16) L_mEr k(r) k) = kT + (rn) (17) The prior statistics are normalised so that they effectively contribute r frames to the final statistics. For normalisation: (r) = (m) + r (18) mEr t=1 As more data becomes available, the adaptive CMLLR statistics will dominate, but for small amounts of data the prior statistics are more important.

The ith row of the smoothed adaptive transform Wj can be estimated iteratively using w$) = (Api + k) G1 (19) where p is the extended co-factor row vector of A and X satisfies the quadratic equation + ApGk T -(r) = 0 (20) with = [b Aj and ?, = Ao, + The value of the occupancy counts fl is given in 18 above.

For efficiency, the elements of the prior statistics in equations (14) and (15) which depend only on the model set can be accumulated offline.

In this embodiment the prior and the smoothed adaptive transforms use the same regression tree. In another embodiment, different regression trees are used for the prior and smoothed adaptive transform. In this embodiment when reference is made to equation 8, the adaptive statistics used to compute the prior transform will vary in terms of the regression tree r from those used to compute the adaptive transform.

To illustrate differences between the present invention and the prior art, figure 5 shows a process for obtaining a VTLN transform in accordance with a known method.

First, speech is input in step SlOl. As this method is used for adapting an acoustic model, the speech input in step SlOl corresponds to a known or hypothesised text. In such systems a transcription of the text is not normally available, unsupervised mode.

Instead, the text will have usually been estimated from a previous recognition pass with either the model for which the transform is to be estimated or another model.

Alternatively in supervised mode the text is known.

In this embodiment, the forwardlbackward algorithm is used which presumes a soft assignment of a frame to a state. However, other implementations of the Baum Welch algorithm may also be used which also assumes a soft assignment. In a further embodiment, the Viterbi algorithm is used which assumes a hard assignment of frame to state.

From this step S 103, the statistics G and k below can be obtained: = (m) (m) [ olt o I rnEr cT. t-1 (m) T = (m)2 (m) [ o] (8) rnEr cJ2, t=1 Prior to performing decoding, a number of the VTLN transforms were saved corresponding to different a.

Standard VTLN attempts to find a linear transform W which maps the unwarped cepstral coefficients 0 onto the corresponding warped values of O. VTLN applies warping to the frequency axes to correspond to differences in speaker vocal tract length.

A piece wise linear warping is one example of a frequency warping which can be used and is expressed by the constant a. Figure 6 is a plot showing how a can be expressed graphically.

G and k statistics are accumulated in step Si 03 for the adaptation data and are used to select the best a in step Sl05. In step Sl07, the transform corresponding to the best value of a is outputted.

Figure 7 is a flow diagram showing a known CMLLRJMLLR method. Input speech is input in step Sill. The input speech corresponds to a known or hypothesised text. The text will have usually been obtained as for VTLN. In the same manner as above, the forwardlbackward algorithm is run in step Si 13. This outputs the statistics G and k which correspond to equation 8 as shown above.

From this, then the transform is estimated in step S 115. In step S117 the transform is outputted.

The method of figure 5 is essentially a method where a transform which is based on modelling some physical relationship, in figure 5 differences in vocal tract length of different speakers, is used to adapt the speech from one speaker to a different speaker.

In figure 7, an adaptive method is used where the text from a new speaker is processed on the basis that differences in acoustic realisations between speakers are not based on an underlying assumption concerning bow the physical differences arise. The transform parameters in step Si 15 are estimated to give a "best fit" match of the current acoustic model to the new speech. In step S117 the transform is outputted.

Figure 8 shows a speech recognition system capable of adaptation in accordance with an embodiment of the present invention.

The system adapts model parameters S207 based on adaptation data S201. The adaptation data S201 comprises speech data for a particular condition to which the system is to be adapted (for example, a particular speaker, noise environment etc). To estimate the transforms a hypothesis is required which allows the speech to be related to text. In one embodiment, an initial decoding pass is performed which provides such text.

Adaptive statistics are accumulated in step S205. In this preferred embodiment, the adaptive stats are given by: = (rn) [ot otT] mEr O t_1 (in) T k = ()2 i (m) [1 oT 1 (8) mEra t=i The observation vectors o are derived from the adaptation data. The y parameters are obtained from an alignment of observations of the adaptation data O to the model states in model S207. The p. and a parameters are derived from the parameters of model S207.

A prior transform S204 is obtained in step S203. In this embodiment the prior transform is intended to compensate for differences in acoustic realisations between speakers. In one embodiment, the prior transform is an identity transform. In a further embodiment the prior transform is a given fixed linear transform.

In a further embodiment, adaptive statistics as explained, with reference to equation 8 aie used to select a preferred prior transform from a plurality of prior transforms In this embodiment, the prior transform is a linear transform, for example, VLTN.

However, other transforms either linear or non-linear may be used such as MLLR, CMLLR, VTLN, JUD, VTS.

In further embodiments, a prior transform which is used to achieve speaker adaptation may be estimated from the adaptation data without selecting from a plurality of pre-set transforms.

Where adaptation data is required to estimate the prior transform, the prior transform needs to be estimated on-line. However, when the prior transform is pre-set as in the case of an identity transform, the transform can be calculated off-line. When the prior transform is selected from a plurality of transforms, the plurality of transforms can be calculated off-line and stored.

In step S209 prior statistics are then obtained using: G(r) -(m) 1 g{oTm} pr -rnEr E{olm} E{ooTlm} __ (m) (m) = _)2 [1 E{oTm} 1 (11) ?flET U7, where the occupancy counts are estimated from the training data.

Then the prior statistics can be calculated using: E{oTftn} = A' ((m)T -. b)) (14) E{ooTftn} = A1 (15) (E(1) + (p,: -b)) ((m)T -b))) A)_T Equations (14) and (15) are determined from both the model parameters S207 and the prior transform S204. If the transform has been calculated offline then the prior statistics may be determined offline. If considerable memory is available then it is also possible to cache the prior statistics for all saved transforms offline. If the transform is estimated using the adaptive data S201 then the transform is estimated and must be applied online. However, terms derived from the model parameters may be calculated offline and cached, and the transform applied online.

In step S21 1, adaptive stats are smoothed with the prior stats in a count-smoothing framework. This uses the prior stats obtained in step S209 and the adaptive stats accumulated in step S205 using the equations: = G +r (16) k(T) = k +T pr1 (rn) (17) LirnEr t is a weighting factor. The magnitude of the weighting factor is determined by trial and error. In a preferred embodiment, it is selected based on decoding accuracy. The value of t can be determined off-line. The data which is used to select t should be independent of the adaptation data used on-line.

From these statistics, a smoothed adaptive transform, expressed in terms of [b A] and which transforms observation vector o1 to observation vector ii as: , =Ao +b, is produced in S213 and outputted as step S215. The ith row of the smoothed adaptive transform can be estimated iteratively from the above smoothed statistics using: w:j = (APi + k)) G-' (19) where p is the extended cofactor row vector of A and X satisfies the quadratic equation A2pG1p + \pGkT -= 0 (20) Where the value of the occupancy counts /3 is: (r) = (m) + (18) mEr t==1 This smoothed adaptive transform is then fed into decoder S2 17. In decoder S2 17, speech recognition is performed using an acoustic model which has been adapted using the transform output in step S2 15. In Figure 2 this is equivalent to replacing the acoustic model in 15 after each adaptation is completed. In another embodiment the adaptive transform can be applied to the speech feature vectors rather than directly adapting the model. In this embodiment the speech features output by 11 in Figure 2 are adapted prior to input to the speech decoder 13. The acoustic model S2 15 remains unchanged across utterances.

Figure 9 is a flow diagram showing a speech recognition method in accordance with an embodiment of the present invention. Speech is input in step S401. In most embodiments, the input speech will not correspond to a known transcription. However, a transcription can be estimated which corresponds to the speech. Typically, this will be done by performing a first pass of the decoder to obtain a first estimate. Possible operations will be described after step S4 11.

The forward/backward algorithm is then run in step S403 and from the input data, model set and prior the statistics G,,a and k are estimated. Transforms for these statistics are then estimated in step S405.

The acoustic model parameters of the speech recognition system are then directly transformed using the transform of step S405 or are indirectly modified using CMLLR transforms in step S407.

A speech recognition system will also comprise a language model in addition to an acoustic model. A language model is used to indicate the probability that the sequences of words output by an acoustic model occur in a given language. The probabilities for various sequences are scored both by the acoustic model and the language model and the probabilities are combined.

In one embodiment, recognition is run with the acoustic model and unchanged language model for the new speaker using the modified parameters in step S409 and the identified words are output in step S4 11. Recognition with the adapted acoustic model can take one of two forms: (1) standard ASR -needs acoustic model and language model to determine possible recognition hypotheses -this could be a 2' pass on the same data or using the new acoustic model on the next speech input; (2) rescoring -a lattice or N-best list of possible hypotheses from the 1St recognition pass used to determine the text in the 1st estimate is saved, The language model scores are saved with the lattice/N-best list. Then full recognition does not need to be run. Only acoustic model scores are required at each frame which are combined with the stored language model scores to "rescore" the saved lattice paths. Rescoring would only apply if rerunning on same data. For complex recognition tasks, it is much faster to rescore than run a whole recognition pass from scratch.

In the above embodiment, a first pass decoding run is performed to obtain an initial transcription. Any model can be used for the first-pass decoding, but often the baseline (speaker independent) model is used. With an initial transcription and corresponding data, a transform can be estimated for adaptation. This transform can then either be used to redecode the current data (i.e. that used to estimate the transform) and improve the initial hypothesis. Or, the transform can be used on other test data.

In use, the user is likely not to be aware that adaptation was ongoing. The user would only perceive recognition as taking place.

The transforms may be continually re-estimated or there may be an adaptation training phase where new transforms are estimated.

In general, there will be some continual re-estimation of transforms, which is especially useful when there are multiple users/noises etc. There follows three different scenarios which might take place: Example I -incremental mode a. user I makes a request "play let it be" b. system incorrectly recognises and outputs "play you and me" (userl unhappy) c. system then obtains transform using hypothesis "play you and me" d. user 1 makes a second request "play the beatles" e. system uses transform to decode and output hypothesis "play the beatles" (user 1 happy) f. system uses second utterance to improve transform... etc. Example 2 -similar but redecoding a. userl makes a request "play let it be" b. system incorrectly recognises "play you and me" c. system then obtains transform using hypothesis "play you and me" d. system uses transform to redecode original data and outputs new hypothesis "play let it be" (user 1 happy) e. userl makes a request "play the zutons" f system uses first transform to decode "play the zutons" g. system uses second utterance to improve first transform h. system redecodes second utterance using improved transform... etc. Example 3 -redecoding but multiple users a. user 1 makes a request "play let it be" b. system incorrectly recognises "play you and me" c. system then obtains transform using hypothesis "play you and me" d. system uses transform to redecode original data and outputs new hypothesis "play let it be" (user 1 happy) e. user2 makes a request "play the zutons" f. system uses original model to decode "play the beatles" g. system estimates user2 transform using "play the beatles" h. system redecodes second utterance using user2 to get new hypothesis "play the zutons" (user 2 also happy) In the above, the system may also receive user feedback, such that the system does not use a hypothesis which the user has indicated is incorrect to re-estimate transforms.

Figure 10 is a simplified flow diagram showing the steps taken when estimating transforms in accordance with an embodiment of the present invention. In this example, the prior transform is a VLTN transform selected from a plurality of pre-computed or pre-determined prior transforms.

Speech is input in step S501. The forward backward algorithm is run in step S503.

Adaptive statistics are generated in step S503. In step S505 the best warping factor a and corresponding VTLN transform is selected using the adaptive statistics generated in S503. This transform is used in step S507 to obtain the prior statistics. The adaptive statistics are re-used in step S509 and interpolated with the prior statistics obtained in step S507. The transform is then estimated in step S511 using the smoothed statistics from step S509.

In a further embodiment, a cascade of transforms is used, where the above smoothed adaptive transform is used as a child transform in a sequence of transforms. In a further preferred embodiment, the prior transform which is used in the estimation of the child is also used as a parent transform. In speech recognition and speech synthesis, cascading transforms are often used to adapt between different environments. This framework can be used to estimate a child transform in such a cascade.

With a parent transform with regression class rp:

-

tj [ pt1 pt1 and child transform with regression class r -Fi.(rc) ,4(r) -Lch1 -"ch1 the mean parameters in the HMM under these cascading transforms can be represented as jl = (A)I + b) + (21) And a parent transform and child transform applied to the covariance parameters is represented by = >2A TA -T (22) Equivalently, in another embodiment, if the same transform is applied to both mean and covariance i.e. = A and A' = A2 = a computationally efficient implementation is to transform the speech feature vectors = A (Ao + b) + (23) where ii is the observation vector after the transforms have been applied, o, is the untransformed observation vector.

In the general case the transforms are applied to the acoustic model. Above equations 21 and 22 give the transform of the means and covariances for a general case. It should be noted that this principle applies for non parent child transforms, i.e. in the general case adaptive transforms are applied to the model means and/or covariances and in a particular case transforms are applied to the speech vectors directly.

In such a parent and child arrangement, one transform may account for noise and the other for speaker adaptation. Alternatively, both transforms may model different speaker attributes or noise attributes. In a further variation, a plurality of transforms are successively applied.

It is computationally very efficient for a parent transform to also be used as a prior in the above framework.

Suppose that the prior information V is already known or has been estimated, and is to be used as a parent transform in a cascade of transforms. The adaptive statistics to estimate the child transform where: = [b) A)] (24)

A

are accumulated in the domain of the parent transform. If o1 is the original observation transformed by the parent transform: Ôt = (25) then with: = [b) A1] (26) = (rn)2 (rn) [ ] (in) T j(rc) = (m)2),(m) [i o] (27) rnEr O t=1 Prior statistics to estimate the child must also be in the domain of the parent.

(m) -i CIAT O(rc)_ \ )f m - (,)2 E{o(rn} E{ôôTm} (ni) (in) = (rn) [1 E{ôTIm}] (28) mEre U the prior statistics must be also transformed by V to be in the domain of the parent.

(r) (r) . . If the parent is equal to the prior, i.e. Wpy V( this cancels out the pre-existing transformation by W' and becomes equivalent to using an identity matrix prior in the space defined by the parent, i.e. E{ôTIrn} = (,i(m)T) (29) E{ôôTm} = (E(rn) +,(m),)T) (30) A child transform W is estimated from the smoothed statistics pr (31) = O(TC) j(rc) = j(rc) + pr (32) paj (rn) the row of the transform can be estimated iteratively using: (re) = (Ap1 ± j(rc) O(rc) (33) pa2) paj where P is the extended co-factor row vector of A'f and ? satisfies the quadratic equation >2 (r)-1 T -(r)-1 -(r)T Pi'pa P + PiGpa,i kpai -= 0 (34) and

T

V' (rn) (35) = mEre t1 and both parent and child must be used in decoding p(otm) = IAAI (36) f(A (A1ot + + b; (rn) E(m)) This is a convenient framework for using a parent transform as a prior as there is no longer any need to transform the prior statistics by the prior transform. The prior statistics can be fully cached offline. The structure of the parent and child can differ, and any form of linear or non-linear transform can easily be used as a parent and prior, yet adaptive statistics only need to be accumulated at the regression class level for the child. For example, a 64 regression class block diagonal PCMLLR transform could be used as a prior for estimating a 2 regression class diagonal CMLLR transform and statistics only need be accumulated for the 2 regression classes of the child, not the 64 classes of the parent. After being estimated, any cascade of transforms may be collapsed to an equivalent single transform. The regression tree for the collapsed transform is the intersection of the regression trees of the constituent transforms in the cascade." To demonstrate the effectiveness of methods in accordance with the above embodiments, experiments were performed. The linear transform version of VTLN was used as a convenient form of prior information which is expected to yield more information than an identity transform.

Gender independent US English acoustic models were trained using a 39 dimensional MFCC feature vector, with static, delta and delta-delta parameters. A total of 312 hours of data from WSJ, TIDIGITS, TIMIT and internally collected noisy data was used for training triphone acoustic models. Decision tree clustering was used to yield 650 unique states. 12 Gaussian components were used per speech state and 24 Gaussian components per silence state, yielding approximately 8000 components. For adaptive training, transforms were estimated on a per speaker basis, using a transform type consistent with decoding. Experiments were carried out on two tasks: Toshiba in-car task -a database recorded in real driving conditions with phone numbers, 4 digits, command and control, and city names subtasks. Each sub task includes two noisy conditions, engine on and highway, and there are a total of 8983 utterances spoken by native speakers with an average of 463 frames per utterance.

Multi-accent task -a database recorded in studio conditions with additional noise.

There are approximately 14k utterances split between telephone and TV control, spoken by users with a mixture of accents, with an average of 226 frames per utterance.

Unless stated otherwise, a separate transform is estimated for each test set utterance using two regression classes -speech and silence -to limit the number of parameters and allow for rapid adaptation. The baseline hypothesis was used for estimation of all transforms.

Experimental results are given in table 1. The first lines show results obtained for the baseline system without adaptation, and with standard VTLN and CMLLR. These results show that VTLN consistently yields small gains, e.g. on the multi-accent set the baseline error rate of 15.90% is improved to 15.44%. Diagonal CMLLR improves over VTLN on the in-car and multi-accent tests. Full CMLLR does not impact on performance, suggesting that' one utterance does not give enough data to robustly estimate the parameters.

Table 1

Standard Adaptive Parent Prior In-car Mlt-acc In-car Mlt-acc Baseline ---2.38 15.90 --VTLN Block --2.33 15.44 2.17 15.11 CMLLR Diag --1.86 15.17 1.74 14.96 CMLLR Full --2.36 15.90 2.34 15.84 CMLLR* Diag VTLN -1.79 14.98 1.77 14.94 CMLLR* Full VTLN -2.35 15.90 2.35 15.90 CMLLR Full -Identity 1.92 15.11 2.11 14.58 CMLLR Full -VTLN 1.87 14.82 1.63 13.54 Next, VTLN was used as a parent transform, although not as a prior, when estimating a CMLLR child transform to be used in a cascade during decoding. The results show that using VTLN as a parent transform for estimating a diagonal CMLLR transform can give performance gains. For example, on the Toshiba in-car set, the error rates are and 2.33% and 1.86% for VTLN and diagonal CMLLR respectively, but 1.79% when cascading VTLN and CMLLR. However, when used as a parent to estimate a full CMLLR transform, very little difference in error rate is seen. The error rates for VTLN and full CMLLR on the Toshiba set are 2.33% and 2.36% respectively, and 2.35% when cascading the two transforms. These results suggest that incorporating prior knowledge as a parent transform with no prior does not improve the robustness of poorly estimated CMLLR transforms when using limited data.

Combining prior and adaptive statistics for more robust transform estimates.

Experiments were carried out using an identity matrix as prior (i.e. W = I in equations 14 and 16) and also using VTLN as a prior with the method described. Here, a full CMLLR transform was trained for each utterance by combining the prior and adaptive statistics according to equations 16 and 17. Note that full CMLLR transforms are equivalent to r 0. As can be seen, for large values oft the resulting transform can give better performance than either the full CMLLR or VTLN transforms alone. For some values of t, the transforms from interpolated statistics give gains over the robust diagonal CMLLR transform. VTLN appears to be a relatively weak prior as it does not give large gains by itself, and only gives small improvements over an identity prior.

Results are given in table 1 for t 50000 for the two test sets. On the in-car set, the identity prior yields an error rate of 1.92% and the VTLN prior gives 1.86%, which are relative gains of 20% and 22% respectively over the baseline model.

Finally, adaptive training was performed using VTLN, CMLLR and CMLLR with a prior. For using VTLN as a only a parent to estimate full CMLLR transforms, the VTLN adaptively trained model was used. As expected, gains are seen from adaptive training using VTLN and CMLLR. However, the use of the VTLN prior in adaptive training yields further improvement in word error rate. On the Toshiba in-car set, the adaptive training using VTLN and diagonal CMLLR yielded results of 2.17% and 1.77% WER respectively, but a further improvement to l.63%was seen using the approach proposed in section 3. This is an improvement of 46% relative over the baseline performance. Relative improvement on the multi-accent task is 15%.

Claims

CLAIMS1. A speech processing method, comprising: receiving a speech input from a new speaker which comprises a sequence of observations; determining the likelihood of a sequence of words arising from the sequence of observations using an acoustic model and a language model, comprising: providing an acoustic model for performing speech recognition on an input signal which comprises a sequence of observations, wherein said model has been trained to recognise speech from a different speaker or speakers, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to an observation; adapting the acoustic model trained for a different speaker or speakers to the new speaker; the speech processing method further comprising determining the likelihood of a sequence of observations occurring in a given language using a language model; combining the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal, wherein adapting the model to the new speaker comprises: calculating adaptive statistics, said adaptive statistics being generated by comparing the speech of the new speaker with that of the acoustic model trained for other speakers; determining prior statistics, said prior statistics derived from a prior transform which models the differences between speakers based on heuristic knowledge of the differences in acoustic realisations between speakers; and interpolating said adaptive statistics and selected prior statistics to produce smoothed statistics and using said smoothed statistics to estimate a new transform and applying said transform to said model.
2. A method according to claim 1, wherein adapting the acoustic model to the new speaker comprises receiving speech from the new speaker corresponding to known text.
3. A method according to claim 1, wherein adapting the acoustic model to the new speaker comprises receiving speech from said new speaker and making a first estimate of the text corresponding to said speech.
4. A speech recognition method according to claim 1, wherein the prior transform is selected from an identity transform, a fixed linear or non-linear speaker transform.
5. A speech recognition method according to claim 1, wherein the prior transform is a transform selected from a plurality of pre-determined transforms on the basis of said adaptive statistics.
6. A speech recognition method according to claim 1, wherein the prior statistics are partially cached off-line.
7. A method according to claim 1, wherein said adaptive statistics are expressed by: = (rn) [ o I mEr O t-1 (rn) T 1C = i (m)2 [1 oT 1 (8) rnEr O7, t=1 Where m is a component in regression class r, y(m) is the posterior probability that frame & is aligned with model component m which has mean (m) and variance m) (m) . (m) (m) p and o. are the ith elements of ji and variance E respectively.
8. A method according to claim 1, wherein said prior statistics are expressed by: G(r) 7(m) 1 E{oTm} pr1 -L (m)2 E{om} E{ooTftn} mEr __ (m) (m) k) = (m) [1 E{oTm} 1 (11) mEr O Where the prior statistics are calculated using: g{oTm} = A1 ((m)T -b)) (14) E{ooT7n} = A1 (15) ((m) + ((m) -b) ((m)T b)) A)_T and m is a component in regression class r which has mean (m) and variance E(m). ?(rn) is the posterior probability that a frame o is aligned with model component m and is obtained from the training set. m)and o" are the ith elements of,(m) and variance (m) respectively.
9. A method according to claim 1, wherein said interpolated adaptive and prior statistics are expressed by: G(T) = Pr (16) LimEr k(r) = + pr1 (m) (17) L,mEr
10. A method according to claim 1, wherein the prior transform is linear or a plurality of linear transforms.
11. A method according to claim 1, wherein the new transform is applied indirectly to the acoustic model via direct application to observation vectors.
12. A method according to claim 1, wherein the adaptive statistics are prepared on data where there is a different noise environment to that used while training the acoustic model.
13. A method according to any preceding claim, wherein said new transform is one of a cascade of transforms applied to said model.
14. A method according to claim 13, wherein said new transform is a child transform applied in a cascade of transforms and said prior transform is a parent transform in said cascade of transforms.A method of adapting an acoustic model for speech processing to the speech of a new speaker, the method comprising: receiving a speech input from a new speaker which comprises a sequence of observations; providing an acoustic model for performing speech recognition on a input signal which comprises a sequence of observations, wherein said model has been trained to recognise speech from a different speaker or speakers, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to an observation; calculating adaptive statistics, said adaptive statistics being generated by comparing the speech of the new speaker with that of the acoustic model trained for other speakers; determining prior statistics, said prior statistics derived from a prior transform which models the differences between speakers based on heuristic knowledge of the differences in acoustic realisations between speakers; and interpolating said adaptive statistics and selected prior statistics to produce smoothed statistics and using said smoothed statistics to estimate a new transform and applying said transform to said model.16. A carrier medium carrying computer readable instructions for controlling the computer to carry out the method of any preceding claim.17. A speech processing apparatus, said apparatus comprising: a receiver for receiving a speech input from a first speaker which comprises a sequence of observations; a processor configured to: determine the likelihood of a sequence of words arising from the sequence of observations using an acoustic model and a language model, comprising: provide an acoustic model for performing speech recognition on a input signal which comprises a sequence of observations, wherein said model has been trained to recognise speech from a different speaker or speakers, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to an observation; adapt the acoustic model trained for a different speaker or speakers to the new speaker; and determine the likelihood of a sequence of observations occurring in a given language using a language model; an output configured to output a sequence of words identified from said speech input signal, wherein adapting the model to the new speaker comprises: calculating adaptive statistics, said adaptive statistics being generated by comparing the speech of the new speaker with that of the acoustic model trained for other speakers; determining prior statistics, said prior statistics derived from a prior transform which models the differences between speakers based on heuristic knowledge of the differences in acoustic realisations between speakers; and interpolating said adaptive statistics and selected prior statistics to produce smoothed statistics and using said smoothed statistics to estimate a new transform and applying said transform to said model.