GB2401469A

GB2401469A - Pattern recognition

Info

Publication number: GB2401469A
Application number: GB0310720A
Authority: GB
Inventors: Murray Holt; Konstantinos Koumpis
Original assignee: Domain Dynamics Ltd
Current assignee: Domain Dynamics Ltd
Priority date: 2003-05-09
Filing date: 2003-05-09
Publication date: 2004-11-10
Anticipated expiration: 2023-05-09
Also published as: GB2401469B; GB0310720D0

Abstract

A speech recognition system uses hidden Markov models. Noise-robust hidden Markov models are generated by measuring clean speech samples under quiet conditions, measuring background noise samples, synthesising noisy speech samples using the clean speech samples and the noise samples, obtaining a set of clean speech hidden Markov models and refining these models based on variance reestimation using the synthesised noisy speech samples.

Description

1 2401 469 Pattern recognition

Description

The present invention relates to pattern recognition particularly, but not exclusively, to speech recognition.

Pattern recognition may be used in many different applications involving different types or sources of signal. For example, pattern recognition may be used to analyse and enhance images and to monitor industrial machinery.

Pattern recognition can also be applied to speech. For example, speech recognition may be used to convert spoken words and phrases into text for entering data into a word-processing application or into commands for controlling a device such as a computer.

Speech recognition typically involves providing a sample of speech, extracting features from the sample and analysing the features to identify patterns corresponding to given words or phrases using speech models.

Speech recognition systems generally suffer the drawback that they require a quiet environment to operate. However, such conditions are rarely found even in the home and office. It is desirable that a speech recognition system can tolerate noise and thus be used in noisy surroundings such as in a factory, on private or public transport, or in a crowd.

Portable electronic devices, such as palm-held computers and mobile telephone handsets, can potentially benefit from speech recognition. These types of devices can be used in many different environments. Thus, it is also desirable that a speech recognition system can tolerate a wide variety of noise and a wide range of noise intensity. - 2

One approach to achieving noise tolerance involves filtering the input signal in an attempt to remove background noise. However, this is liable to reduce the discriminatory features of the speech and reduce recognition accuracy.

s Another approach is to employ ad-hoc model adaptation. This increases processing overheads, which can increase recognition delays and enlarge the footprint of a voice-interfaced application. Also, model adaptation techniques need to be carefully designed to maintain stability. This increases development overheads.

Yet another approach to achieving noise robustness is to switch between models or filters according to different noise conditions. However, switching can be distracting or inconvenient to the user.

Finally, another approach involves training the speech recognition system in different environments and using different speaking styles. This increases development overheads since it necessitates collecting large corpora of speech recordings in different noisy environments. Multi-style training is described in "Multi-Style Training for Robust Isolated-Word Speech Recognition", by R. P. Lippman, E. A. Martin, and D.B. Paul, Proc. IEEE ICASSP-87, pp. 705- 708, Dallas, Texas, April 19 87.

The present invention seeks to provide a method of training pattern recognition models.

According to a first aspect of the present invention there is provided a method of training a pattern recognition model, the method comprising training a pattern recognition model using a first set of observations derived from a relatively noise- free signal and refining the pattern recognition model using a second set of observations derived from a relatively noisy signal obtained from the relatively noise-free signal and a noise signal.

The method may comprise training the pattern recognition model using a first plurality of sets of observations, each set of observations derived from a respective - 3 relatively noise-free signal and refining the pattern recognition model using a second plurality of sets of observations, each set of observations derived from a respective relatively noisy signal obtained from a corresponding relatively noisefree signal and a noise signal.

The method may further comprise combining the substantially noise-free signal and the noise signal to obtain the relatively noisy signal. The method may comprise I filtering the substantially noise-free signal to produce an enhanced substantially noise-free signal and combining the enhanced substantially noise-free signal and the noise signal to obtain the relatively noisy signal. Filtering the substantially noise- free signal may comprise using a minimum mean-square error short-time spectral amplitude estimator.

The method may include using a statistical model as the pattern recognition model, such as a hidden Markov model, and using continuous observation densities to model observations.

Training the pattern recognition model may comprise determining model parameters, the model parameters including, for each of at least one model state, respective parameters for defining a statistical distribution of observations associated with a model state. Refining the pattern recognition model may comprise re-estimating only some of the parameters for each of the model states. Refining the pattern recognition model may comprise re-estimating statistical distribution spread parameters, but not re-estimating statistical distribution position parameters.

Training the pattern recognition model may comprise determining model parameters, the model parameters including, for each of at least one model state, respective plural sets of parameters for defining corresponding plural statistical distributions of observations associated with a model state and plural weighting parameters, the weighting parameters for allowing combination of the plural i statistical distributions for producing a mixture statistical distribution. Refining the pattern recognition model may comprise re-estimating statistical distribution spread parameters, but not re-estimating statistical distribution position parameters and not re-estimating weighting parameters. Training the pattern recognition model may - 4 comprise determining model parameters, the model parameters including, for each of at least one model state, respective at least one set of vectors for defining a statistical distribution of observations vectors emitted by a model state. Training the pattern recognition model may comprise determining model parameters, the model parameters including, for each at least one of the model states, a respective plurality of sets of vectors and plurality of sets of weighting parameters, each set of vectors for defining a statistical distribution of observations vectors, the statistical distributions being combinable according to the weighting parameters to form a mixture of distributions associated with a model state. Refining the pattern recognition model may comprise not re- estimating the weighting parameters. Each set of vectors may include a mean vector for defining a mean of the statistical distribution of observations vectors and a variance vector for defining a variance of the statistical distribution of observations vectors. Refining the pattern recognition model may comprise re-estimating the variance vector and not re-estimating the mean vector. Training the pattern recognition model may comprise determining model parameters, the model parameters including a set of state transmission probabilities. Refining the pattern recognition model may comprise not reestimating the set of state transmission probabilities.

According to a second aspect of the present invention there is provided a method of training a plurality of pattern recognition models, the method comprising training each pattern recognition model with respective first and second sets of observations using the method of training a pattern recognition model.

According to a third aspect of the present invention there is provided a method of training a plurality of pattern recognition models, the method comprising training each pattern recognition model with respective first and second pluralities of sets of observations using the method of training a pattern recognition model.

The method may comprise training and refining the plurality of pattern recognition models collectively. Refining the plurality of pattern recognition models collectively may comprise using embedded re-estimation. - 5

Each pattern recognition model may be associated with a different pattern and may be associated with a different phrase or word.

The method may comprise generating a further pattern recognition model using first s and second sets of observations or the first and second pluralities observations, the further recognition model being associated with background noise. The method may comprise training a background noise recognition model using a set of observations derived from the noise signal, and replacing the further pattern recognition model with the background noise recognition model.

According to a fourth aspect of the present invention there is provided a method of generating a pattern recognition model, the method comprising providing a pattern recognition model trained using a first set of observations derived from a relatively noise-free signal and refining the pattern recognition model using a second set of observations derived from a relatively noisy signal obtained from the relatively noise-free signal and a noise signal.

According to a fifth aspect of the present invention there is provided a method of training a speech recognition system using the method of training a pattern recognition model.

According to a sixth aspect of the present invention there is provided a computer program comprising program instructions for causing a computer to perform the method.

According to a seventh aspect of the present invention there is provided a pattern recognition model generated by the method.

The pattern recognition model may be carried on an electrical or optical carrier signal, stored in memory or held on a server.

According to an eighth aspect of the present invention there is provided apparatus for generating a pattern recognition model comprising means for training a pattern - 6 recognition model using a first set of observations derived from a relatively noise- free signal and means for refining the pattern recognition model using a second set of observations derived from a relatively noisy signal obtained from the substantially noise-free signal and a noise signal.

Embodiments of the present invention will now be described with reference to the accompanying drawings in which: Figure 1 is a schematic diagram of a speech recognition system; Figure 2 is a schematic diagram of a speech recognition device; Figure 3 is a schematic diagram of a model trainer; Figure 4 illustrates a process of generating noise-robust models; Figure 5 shows an example of a speech signal; Figure 6 shows sets of clean speech signals and digital representations of the sets clean speech signals; Figure 7 shows an example of a noise signal; Figure 8 shows sets of noise signals and digital representations of the sets of noise signals; Figure 9 illustrates an enhancing process; Figure 10 shows sets of digital enhanced clean speech signals; Figure 11 illustrates a combining process; Figure 12 shows a set of digital noisy speech signals; Figure 13 illustrates a framing process; Figure 14 illustrates a transforming process; Figure 15 shows a set of feature vectors; Figure 16 shows sets of clean speech feature vectors; Figure 17 shows sets of noisy speech feature vectors; Figure 18 shows sets of background noise feature vectors; Figure 19 illustrates a model training and optimisation process; Figure 20 shows a hidden Markov model; Figure 21 illustrates a set of clean speech models; Figure 22 is a process diagram of a model training and optimisation process; Figure 23 illustrates a model re-estimation process; Figure 24 is a process diagram of a model re-estimating; l - 7 Figure 25 illustrates a set of noise-robust speech models; Figure 26 illustrates a model combining process;

Figure 27 shows a set of background noise models;

Figure 28 illustrates a combined set of noise-robust speech models; s Figure 29 is a schematic diagram of a speech recognition architecture; and Figure 30 illustrates a process of speech recognition using noise- robust speech models.

Referring to Figure 1, a pattern recognition system is shown. The pattern recognition system is configured to recognise speech and so is referred to hereinafter as a speech recognition system.

The speech recognition system 1 comprises a device 2 for performing speech recognition and a trainer 3 for building models for use in speech recognition. In this example, the device 2 and trainer 3 are separate devices, although they may be integrated into a single device. The device 2 is in the form of a personal digital assistant (PDA) and the trainer 3 is in the form of a desktop personal computer.

Speech recognition comprises two stages, namely training and testing.

During training, one or more users 4 are given prompts 5 to provide speech samples 6, for example in the form of phrases, words or sub-word units. An advantage of using more than one user 4 is that permits definition of speaker-independent models.

To provide so-called "clean" speech samples, the user 4 and the device 2 are placed in quiet or substantially silent environment, such as in an anechoic chamber where the user 4 speaks.

To provide background noise samples 8, the device 2 is taken into one or more noisy environments each having one or more noise sources 7. - 8

The samples 6, 8 are recorded and digitised to form digital samples 9 which are fed, together with a list 10 of prompts, into the trainer 2. The trainer 3 processes the digital samples 9 and generates noise-robust speech models 11, preferably in the form of hidden Markov models, which are uploaded to the device 2 or other similar s devices (not shown).

The device 2 holds the speech models 11 ready to test future speech sample.

During testing, a user 4 is given prompts 5 to provide speech samples 6. The user 4 may be located in a quiet or noisy environment. The speech samples 6 are recorded, digitised, processed and compared with the speech models 11 to determine a best match in an attempt to recognise any phase, word or sub-word spoken by the user 4.

Referring to Figure 2, the speech recognition device 2 is shown in more detail.

The device 2 includes a microphone 12 into which a user may provide a spoken response and which converts a sound signal into an electrical signal, an amplifier 13 for amplifying the electrical signal, an analogto-digital (A/D) converter 14 for sampling the amplified signal and generating a digital signal, a filter IS, a processor 16 for performing signal processing on the digital signal and controlling the device 2, volatile memory 17, non-volatile memory 18 and, optionally, storage 19 in the form of a hard disk drive, a removable drive and/or a drive for receiving a removable disk. In this example, the A/D converter 14 samples the amplified signal at 11025 Hz and provides a mono-linear 16-bit pulse code modulation (PCM) representation of the signal.

The device 2 further includes a digital-to-analog (D/A) converter 20, another amplifier 21, a speaker 22 for providing audio prompts to the user 4 and a display 23 for providing text prompts to the user 4. The device 2 also includes an interface 24, such as a keyboard or touch pad. The device 2 also includes input/output interfaces 25, such as wired and/or wireless networking interfaces.

Referring to Figure 3, the speech model trainer 3 is shown in more detail.

The speech model trainer 3 includes a processor 26 for performing signal processing and controlling the trainer 3, volatile memory 27, nonvolatile memory 28 and s storage 29 in the form of a hard disk drive, a removable drive and/or a drive for receiving a removable disk. A digital signal processor 30 may also be provided. The trainer 3 further includes a display 31, an interface 32, such as a keyboard and mouse, a display 33 and an interface 34.

A method of training speech models according to the present invention will now be described.

Referring to Figure 1, the device 2 prompts the user 4 to say a plurality of phrases, words and/or other units of speech. Preferably, the user 4 repeats his or her response so as to provide a plurality of samples for each phrase, word or speech other unit. Other speech units may include numbers, for instance "56", punctuation, such as "comma", letters, for example "a" and sub-word units, such as "ch".

Referring to Figure 5, an example of a user response 34 is shown, which includes leading and trailing silences.

Referring to Figures 4 and 6, each user response 34 is captured by microphone 12 (Figure 1) and a corresponding signal 35111' '35P N M iS generated (step S1). A signal 3511', ,35p N. M captured in quiet or substantially silent environment, without appreciable background noise, is referred to as a "clean speech signal". A clean speech signal 3511 1,. ,35p N. M iS generated for each of a number of users (P), for each of a number samples (N), for each of a number (M) of phrases. However, the number of samples N may vary from user to user and, for the same user, from word to word. The number of words or phrases M may vary from user to user. Sets of clean speech signal 3511,, ,35p N M are thus generated. -

Each clean speech signal 35 I,. ,35p N M iS amplified, filtered and sampled to produce a corresponding digital clean speech signal 361,..., 36p N M forming sets of digital clean speech signals (step S2). Amplification and filtration are optional.

s Referring to Figure 7, an example of a noise sample 37 is shown.

Referring to Figures 4 and 8, noise samples 37 are captured and corresponding signals 38,,...,38R Q are generated (step S3). A signal 38,,

.,38R Q. without any intended speech, is referred to as a "background noise signal". A background noise 0 signal 38,1, ,38R Q is generated for each of a number (R) of environments, for each of a number (A) of samples. Environments in which the device 2 is expected to operate are selected, such as in a car, train, office or street. A single sample 38,...,38R may suffice for each environment, i.e. Q = 1...DTD: Each background noise signal 38,...,38R Q is amplified, filtered and sampled to produce corresponding digital background noise signal 39,1,..., 39R Q forming a set 39 of background noise signal (step S4). Amplification and filtration are optional.

Preferably, any pre-processing of the signals 35 1,...,35p N. M, 38,..., 38R Q obtaining during training is the same as any pre-processing performed during testing.

Generating noise-robust speech models The trainer 3 processes the digital clean speech signals 36,,36p N. M (Figure 6) and the digital noise signals 39,, ,39R Q (Figure 8) to obtain noise-robust speech 2s models.

Referring again to Figure 4, processing includes enhancing the digital clean speech signals 36, ,...,36p N. M (Figure 6) (step S5) and combining the enhanced digital clean speech signals 43 1, ,43p N M (Figure 10) with the background noise signals 38', ,38R Q (Figure 8) to synthesise digital noisy speech signals 48 1,, , 48p N. M RQ (Figure 12) (step S6).

Processing also includes generating clean speech feature vectors 54111, j54p N M (Figure 16) from the digital clean speech signals 361 1 1,...,36p N M (Figure 6) (step S7) and generating clean speech models G21, ..., 62M+1 (Figure 21) (step S8).

s Processing also includes generating noisy speech feature vectors 55, 1, . ,55p N M (Figure 17) from the digital noisy speech signals 4811 11, ,48P N M.R.Q (Figure 12) (step S9), which are used to modify the clean speech models 621, ...62M (Figure 21) to produce noise-robust speech models 661, ..., 66,+1 (Figure 25) (step S10).

Optionally, processing may include generating background noise speech vectors 5611,...,56R Q (Figure 18) from the digital background noise signal 3911, ,39R Q (Figure 8) (step Sit) and generating silence models 701, ..., 70Q (Figure 27) (step S12). The silence model 66M+I which forms one of the noise robust speech models 621, ...62M+! may be replaced with a silence model 7 1, , 70Q.

Finally, the noise robust speech 621, , 62+ models are stored (step Sly).

These processes will now be described in more detail: Filtering signals A process of filtering signals may be used to enhance digital clean speech signals 36 1. 36p N M (Figure 6) and/or the digital noise signals 3911,, 39R Q (Figure 8).

Referring to Figure 9, a digital signal 40, such as digital clean speech signals 2s 361 1, ,36p N M (Figure 6), is filtered to produce an enhanced signal 41 using a filtering process 42. A reason for filtering the signal 40 is to attenuate known noise introduced by the microphone 12 (Figure 1) or other parts of the device 2 (Figure 1).

A preferred filtering process is based on modelling speech and noise spectral components of the digital signal 40 as statistically independent Gaussian random variables. The process comprises estimating a short-time spectral amplitude (STSA) of the signal 40, deriving a complex exponential of a noisy phase where speech is - 12 absent and combining the two. The process is described in "Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator", by Y. Ephraim and D. Malah, IEEE Transactions on Acoustics, Speech, and Signal Processing, Volume ASSP-32, Number 6, pp. 1109 to 1121 (December 1984).

Referring to Figure 10, the trainer 3 filters the digital clean speech signals 361 ',. , 36p N M (Figure 6) in the manner just described so as to form enhanced digital clean speech signals 43 a, . ,43p N M (step S5). Alternatively, however, the digital noise signals 391 I, ., 39R Q (Figure 8) may be filtered.

Combining signals A process of combining signals, such as the enhanced digital clean speech signals 43, ', ,43P N M (Figure 10) and the digital noise signals 39' 1,,39R Q (Figure 8), may be used to synthesise digital noisy speech signals 481 l,, 48p N M RQ (Figure 12).

Referring to Figure 11, a first digital signal 44, such as an enhanced digital clean speech signal 43 I, ..,43p N M (Figure 10), and a second digital signal 45 are combined to form a combined signal 46 using a combining process 47. Preferably, the combining process 47 is an adding process. However, weighted adding processes or averaging processes may also be used. In particular, weighted adding can be applied to obtain a combined signal with a desired signal-to-noise ratio.

Referring to Figure 12, the trainer 3 combines each of the enhanced digital clean speech signals 43, ', ,43p N M (Figure 10) with each digital noise signal 39 I, ,39R Q (Figure 8) in the manner just described so as to synthesise digital noisy speech signals 48,...,48p N M RQ (step S6).

Parameterising signals A process of parameterising signals is applied to the digital clean speech signals 36,...,36p N M (Figure 6), the digital noisy speech signals 48 1 I, I,.. . ,48p N M R. Q (Figure 12) and, optionally, the digital noise signals 39 ', ,39R Q (Figure 8). - 13

Referring to Figure 13, a digital signal 49, such as digital clean speech signal 361,...,36p N. M, digital noisy speech signals 48; ,...,48p N. M p Q or digital noise signal 39,...,39p Q. is divided into frames 50, each having a constant duration of 50 me, by a framing process 51. The framing process 51 may define overlapping s frames 50.

Referring to Figure 14, each frame 50 is converted into a feature vector 52 using a feature transform 53.

The content of each feature vector 52 depends on the transform 52 used. In general, a feature vector 52 is a one-dimensional data structure comprising data related to acoustic information-bearing attributes of the frame 50. Typically, a I feature vector 52 comprises a string of numbers, for example 10 to 50 numbers, which represent the acoustic features of signal comprised in the frame 50.

In this example, a mel-cepstral transform 53 is used. A mel-cepstral transform 53 is a cosine transform of the real-part of a logarithmicscale energy spectrum. A met is a measure of perceived pitch or frequency of a tone by a human auditory system.

Thus, in this example, for a sampling rate of 11025Hz, each feature vector 52 comprises twelve signed 8-bit integers, typically representing the second to thirteenth calculated mel-cepstral coefficients. Data relating to energy (in dB) may be included as a 13'h feature.

The transform 53 may also calculate first and second differentials, referred to as "delta" and "delta-delta" values.

Further details regarding mel-ceptral transforms may be found in "Fundamentals of Speech Recognition" by Rabiner &Juan" (Prentice Hall, 1993) and also in "Comparison of parametric representations for monosyllabic word recognition in continuous spoken sentences" by S. B. Davis and P. Mermelstein, IEEE Transactions in Acoustics, Speech and Signal Processing, volume 28, pp. 357 to 3G6 (1980). - 14

Other transforms may be used. For example, a linear predictive coefficient (LPC) transform may be used in conjunction with a regression algorithm so as to produce LPC cepstral coefficients. Alternatively, a TESPAR transform may be used.

Linear predictive coefficient (LPC) transform is described in "Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification" by B. S. Atal, Journal of Acoustical Society of America, Vol. 55, pp-1304-1312, June 1974. Further details regarding the TESPAR transform may be found in GB-B2162025.

Referring to Figure 15, a set 54 of feature vectors 52,,52 is shown. The feature vectors 52,,52 corresponding to a signal 34, 37, are used as a sequence of observations to train a model.

Referring to Figure 16, the digital clean speech signals 36, ,, ,36p N M (Figure 6) are enframed and transformed in the manner just described using processes 51, 53 to produce clean speech feature vector sets 54,1, , 54P N. M, referred to hereinafter as clean speech features.

Referring to Figure 17, the trainer 3 enframes and transforms the digital noisy speech signals 48, ,, ,,,...,48p N M RQ (Figure 12) in the manner just described using processes 51, 53 to produce noisy speech feature vector sets 55,, ,, ,55p N. M RQ, referred to hereinafter as noisy speech features.

Referring to Figure 18, the trainer 3 may enframe and transform the digital noise signal 39 I, ,39R Q (Figure 8) in the manner just described using processes 51, 53 to produce noise feature vector sets 56, ,,...,56R Q. referred to hereinafter as

background noise features.

Modelling Referring to Figure 19, a plurality of sets 57 I, , 57zY x of feature vectors, such as clean speech feature vector sets 541 I, ,54p N M (Figure 16), are fed into a process 58 for training and optimising a set of models 59, ., 59x+. There is a set of - 15 feature vectors 571 1, , 57z Y. x for each user Z. for each repetition of a phrase Y. for each phrase X. The modelling process produces a set of X phrase models 591 59x and a silence model 59x+, Preferably, hidden Markov models 59, ., 59x+ 1 are used, although other statistical models may be employed.

Referring to Figure 20, an example of a hidden Markov model 59 in the form of a left-to-right, whole-word, Gaussian mixture model is shown. A hidden Markov model of this form is preferably used to characterise each respective phrase and, optionally, to characterise silence.

The model 59 may be considered as a finite state machine which changes state 60I,, GON once every time unit. Thus, at time I, there is change of state and the probability associated with a transition from state i to state j is ajj. Each time a new state j is entered, a feature vector a, is generated according to a continuous observation density vj(o9. Thus, a sequence of feature vectors 61l, , 61T is generated. In this example, the first and last states 601, 6ON are non-emitting, although this need not be the case.

The hidden Markov model 59 is defined in terms of a number of parameters including the type of observation vector used, the number of states N. the number of data streams S. the number of mixture components in each stream Mjs and a transition matrix = {at}. The parameters also include for eachemitting state (labelled A, each stream (labelled s) and each mixture (labelled m) a mixture component weight cjlm, a mean ujS,n and a co-variance matrix jsn, Other parameters may be included, such as a stream weight vector A. A continuous observation density bj(oJ is modelled using Gaussian mixture densities.

Each observation at time t is split into a number of independent data streams is' and modelled with a probability density function of the form: M, - r5 bj ( t) = rl Cjsm( st; Jim Jim) (1) 5=/ m=i - 16 where Ms is the number of mixture components in stream s, cjlm is the weight of the mth component in the sth stream in the Ah state.

s K is the multivariate Gaussian with mean vector,u and co-variance matrix I, namely: K(o;/l,2)= 1 e''r 'l J'r''r J (2) x1(291)n 1z j 0 where is the dimensionality of a.

During training, for each phrase, a model is found by choosing parameters that best fit an observed sequence of feature vectors.

During testing, an observed sequence of feature vectors is compared with different phrase models to determine which model best fits the observed sequence and, thus, which phrase is most likely.

In this example, the Hidden Markov models are built using the Hidden Markov Model Tool Kit (HTK) (version 3.1) and described in the "The HTK Book for HTK Version 3.1" by S. Young et al.. These may be obtained via the WWW at http://htk.eng.cam.ac.uk or from Cambridge University Engineering Department, University of Cambridge, Trumpington Street, Cambridge, CB2 lPZ, United Kingdom.

Hidden Markov model parameters are described in more detail in Chapter 7 of "The HTK Book" ibid..

Referring to Figures 21 and 22, a process by which clean speech models 62,

.,62M, 62+ are generated is shown. - 17..DTD: Starting with a first phrase (step S15), a prototype model is produced and initialised (step S16). For example, this may be achieved using the HInit module in HTK.

Each model 621,...,62M, 62M+1 is trained using Baum-Welch re-estimation until the model 62,...,62M, 62M+1 converges (step S17). For example, this may be achieved s using the HRest module in HTK. Steps S16 and S17 are repeated for each phrase (step S18 & S19).

The full set of models 621,...,62M, 62M+1 are refined using embedded reestimation (step S20). For example, this may be achieved using the HERest module in HTK.

Hidden Markov parameter estimation is described in more detail in Chapter 8 of "The HTK Book for version 3.1" ibidParameter re-estimation formulae are given in Section 8.7 of "The HTK Book for version 3.1" ibid The clean speech models 621, ,62M, 62M+1 are output (step S21).

Steps S15 to S21 may be repeated a number of times. Each time a new prototype model and a new set of initial model parameters may be chosen. This may include choosing a new model topology, a model with fewer or greater number of states, a model using different numbers of mixtures and/or a model initialised with different model parameters. For each repetition, a respective set of clean speech models is generated which may be compared against one another other and tested on an independent data set so as to determine the best set of clean speech models. The choice of new prototype model and set of initial model parameters may be dependent upon a particular word or phrase, or the length or complexity of a word or phrase.

Hidden Markov models are also described in more detail in "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition" by L. Rabiner, Proceedings of the IEEE, volume 77, Number 2, pages 257 to 286 (February 1989).

Noise-robust model generation Referring to Figures 23, 24 and 25, a set of models 591, , 59x+, in this case the set - 18 of clean speech models 621,...,62M, 62M+, (Figure 21), and further sets of feature vectors 63 I,.

, 63FE D, in this case the noisy speech features 55111, ,55PN M (Figure 17), are fed into a process 64 for training and optimising a new set of models 651,..., 65x+. The parameters within the set of models 591,...DTD: , 59x+l are used as initial estimates and the feature vectors 631 l, ., 63; E D as training data to generate a new set of models 651,..., 65X+I using embedded re-estimation (steps S22 to 24)...DTD: However, unlike embedded re-estimation described earlier the variances, ú, are 0 updated, whereas the means, ,u, mixture weights and/or transmission probabilities are not updated. This may be achieved using the HERest module in HTK, using "- u" option and parameter "v", for example: HERest -S trainlist -I labs -H dirl/hmacs -M dir2 -u v hmmlist Thus, referring to Equation 8.2 in "The HTK Book for version 3.1" ibid, EjSmis calculated, whereas,uj5m end cjSm are not.

Referring again to Figure 23, parameters of the clean speech models 621,.. ., 62M, 62M+ (Figure 21) are used as an initial estimate and noisy speech feature vector sets 55, I, ,55P N. M (Figure 17) are used as training data to produce a set 11 of noise- robust speech models 66l,...,66M, 66+ using process 64.

In this way, a set of noise-robust speech models 6GI,...,66M, 66M+I is generated. The models have an advantage that, during testing, they can be used to recognise speech in noisy environments with improved accuracy. Furthermore, recognition accuracy in quiet conditions is not significantly reduced.

Combining models Referring to Figure 26, a first set of models 651' .65x+ ' such as the noise-robust speech models 66,...,66+ (Figure 25) and a second set of models 67,..., 67y, such - 19 as silence models 70I,...70Q (Figure 27), may be selectively combined using a combination process 68 to produce a new set of models 69l,...,69z.

Referring to Figures 27 and 28, the silence model 66M+1 (Figure 25) included in the s noise-robust speech models 66I,..., 66M+] (Figure 25) is replaced with a silence model 701,...,70Q corresponding to the environment which the noise-robust speech model was generated so as to generate a combined set 11' of noise-robust speech models 66I,..., 66M, 70Q.

0 The set of combined noise-robust speech models 661, , 66M, 70Q is tested on the clean and noisy speech features 541 1, ,54P N M, (Figure 16) ,541 l,. À À,54PN M n Q (Figure 17) for recognition accuracy. Several iterations may be required to determine an optimum silence model configuration for good recognition in both clean and noisy conditions.

The method of generating noise-robust speech models may be applied to existing speech models for which original training data is still available and so improve the models without significant effort or cost. For example, using samples of background noise, noisy speech samples may be synthesised and used to refine the existing speech models in the manner described earlier. This process may include enhancing either the original training data or the background noise data.

The processes described earlier are performed by the processor 26 (Figure 3) executing one or more computer programs. However, the processes may be implemented in hardware.

Speech recognition (Testing) The noise-robust speech models 661,, 66M, 66M+1 (Figure 25) or combined noise robust speech models 661,, 66M, 70Q are stored in the speech recognition device 2 (Figure 1) and ready to be used in speech recognition, i.e. testing.

Referring to Figure 29, a speech recognition architecture 70 is shown, which comprises a hardware layer 72, a recogniser layer 73 and an application interface - 20 layer 74. The hardware layer includes microphone 12 for receiving speech and memory 1 8 or storage 19 for storing the noise-robust speech models 66l,..., 66M, 66M+ 1 s A method of speech recognition will now be described: Referring to Figures 29 and 30, the recogniser layer 73 receives from the application layer 74 a list 75 of valid phrases associated with a particular stage of an application.

The active vocabulary is identical to, or a subset of, the vocabulary used in the model training described above (step S25).

A grammar 76 is generated (step S26). The grammar 76 places constraints on the ordering of the phrase. In this example, a simple grammar is used, preferably in the form <silence> <active phrase><silence>.

Speech is captured (step S27) and digitised (step S28) in a substantially identical manner to that described earlier and parametrerised (step S29). This generates a set of feature vectors 79, i.e. an observation sequence. The noise-robust speech models 66,,,66M, 66+ are retrieved from memory 18 (step S30).

A Viterbi decoding process is used to determine an optimum path through the grammar with respect to feature vectors 79 and the models 661,, 66M, 66M+1 (step S31). Identification of the optimum path yields the most likely spoken phrase.

Reference is made to "Error bounds for convolutional codes and an asymptotically optimal decoding algorithm" by A. J. Viterbi, IEEE Transactions in Information Theory, Volume 13, pp. 260-269 (1967). However, other processes for identifying the most likely phrase may be used.

The identity of the most likely spoken phrase 80 is returned to the application interface layer 74 (step S32). - 21

The processes described above are performed by the processor 16 (Figure 2) executing one or more computer programs. However, the processes may be implemented in hardware.

The use of noise-robust speech models 66, ,66M, 66+ helps to improve recognition accuracy in noisy conditions. Nevertheless, they may also be used when recognising speech in substantially quiet or silent environments. This helps avoid the need to use different models.

It will be appreciated that many modifications may be made to the embodiments hereinbefore described. For example, the pattern recognition system may be modified to recognise other forms of signal, such as images, vibration or handwriting. Consequently, transducers for generating a signal, such as a charge couple device (CCD), accelerometer or touch pad may be provided.

The device may be a user-portable device, such as a notebook personal computer, mobile telephone handset, game device or watch, or a fixed device, such as a desktop personal computer, a vehicular on-board computer or industrial control apparatus. The noise-robust model may be downloaded from a server, which need not be the model trainer. - 22

Claims

Claims 1. A method of training a pattern recognition model, the method

comprising: training a pattern recognition model using a first set of observations derived s from a relatively noise-free signal and refining said pattern recognition model using a second set of observations derived from a relatively noisy signal obtained from said relatively noise-free signal and a noise signal.
2. A method according to claim 1, the method comprising: training said pattern recognition model using a first plurality of sets of observations, each set of observations derived from a respective relatively noise-free signal and refining said pattern recognition model using a second plurality of sets of observations, each set of observations derived from a respective relatively noisy signal obtained from a corresponding relatively noise-free signal and a noise signal.
3. A method according to claim 1 or 2, further comprising: combining said substantially noise-free signal and said noise signal to obtain said relatively noisy signal.
4. A method according to any preceding claim, comprising: filtering said substantially noise-free signal to produce an enhanced substantially noise-free signal and combining said enhanced substantially noise-free signal and said noise signal to obtain said relatively noisy signal.
5. A method according to claim 4, wherein filtering said substantially noise-free signal comprises: using a minimum mean-square error short-time spectral amplitude estimator.
6. A method according to any preceding claim, comprising: using a statistical model as said pattern recognition model. - 23
7. A method according to any preceding claim, comprising: using a hidden Markov model as said pattern recognition model.
8. A method according to claim 7, comprising: using continuous observation densities to model observations.
9. A method according to any preceding claim, wherein training said pattern recognition model comprises: determining model parameters, said model parameters including, for each of at least one model state, respective parameters for defining a statistical distribution of observations associated with a model state.
10. A method according to claim 9, wherein refining said pattern recognition model comprises: re-estimating only some of the parameters for each of said model states.
11. A method according to claim 9 or 10, wherein refining said pattern recognition model comprises: re-estimating statistical distribution spread parameters, but not re-estimating statistical distribution position parameters.
12. A method according to any preceding claim, wherein training said pattern recognition model comprises: determining model parameters, said model parameters including, for each of at least one model state, respective plural sets of parameters for defining corresponding plural statistical distributions of observations associated with a model state and plural weighting parameters, said weighting parameters for allowing combination of said plural statistical distributions for producing a mixture statistical distribution.
13. A method according to claim 12, wherein refining said pattern recognition model comprises: - 24 re-estimating statistical distribution spread parameters, but not re-estimating statistical distribution position parameters and not re-estimating weighting parameters.

s
14. A method according to any preceding claim, wherein training said pattern recognition model comprises: determining model parameters, said model parameters including, for each of at least one model state, respective at least one set of vectors for defining a statistical distribution of observations vectors emitted by a model state.
15. A method according to claim 14, wherein training said pattern recognition model comprises: determining model parameters, said model parameters including, for each at least one of said model states, a respective plurality of sets of vectors and plurality of sets of weighting parameters, each set of vectors for defining a statistical distribution of observations vectors, said statistical distributions being combinable according to said weighting parameters to form a mixture of distributions associated with a model state.
16. A method according to claim 15, wherein refining said pattern recognition model comprises: not re-estimating said weighting parameters.
17. A method according to any one of claims 12 to 16, wherein each set of vectors includes: a mean vector for defining a mean of the statistical distribution of observations vectors and a variance vector for defining a variance of the statistical distribution of observations vectors.
18. A method according to claim 17, wherein refining said pattern recognition model comprises: re-estimating said variance vector and not re-estimating said mean vector. -
19. A method according to any preceding claim, wherein training said pattern recognition model comprises: determining model parameters, said model parameters including a set of state transmission probabilities.
20. A method according claim 19, wherein refining said pattern recognition model comprises not re-estimating said set of state transmission probabilities.
21. A method of training a plurality of pattern recognition models, the method comprising: training each pattern recognition model with respective first and second sets of observations using the method according to any preceding claim.
22. A method of training a plurality of pattern recognition models, the method comprising: training each pattern recognition model with respective first and second pluralities of sets of observations using the method according to any preceding claim.
23. A method according to claim 21 or 22, the method comprising: training said plurality of pattern recognition models collectively.
24. A method according to any one of claims 21 to 23, the method comprising: refining said plurality of pattern recognition models collectively.
25. A method according to claim 24, wherein refining said plurality of pattern recognition models collectively comprises: using embedded reestimation.
26. A method according to any one of claims 19 to 25, wherein: each pattern recognition model is associated with a different pattern. - 26
27. A method according to any one of claims 19 to 25, wherein: each pattern recognition model is associated with a different phrase or word.

s
28. A method according to any one of claims 19 to 25, the method comprising: generating a further pattern recognition model using said first and second sets of observations or said first and second pluralities of observations, said further recognition model being associated with background noise.
29. A method according to claim 28, the method comprising: training a background noise recognition model using a set of observations derived from said noise signal, and replacing said further pattern recognition model with said background noise recognition model.
30. A method of generating a pattern recognition model, the method comprising: providing a pattern recognition model trained using a first set of observations derived from a relatively noise-free signal and refining said pattern recognition model using a second set of observations derived from a relatively noisy signal obtained from said relatively noise-free signal and a noise signal.
31. A method of training a plurality of pattern recognition models substantially as hereinbefore described with reference to Figures 1 to 30 of the accompanying drawings.
32. A method of training a speech recognition system using the method according to any preceding claim.
33. A computer program comprising program instructions for causing a computer to perform the method according to any preceding claim. - 27
34. A pattern recognition model generated using the method according to any one of claims 1 to 32.
35. A pattern recognition model according to claim 34, carried on an electrical or optical carrier signal.
36. A pattern recognition model according to claim 34, stored in memory.
37. A pattern recognition model according to claim 34, held on a server.
38. Apparatus for generating a pattern recognition model comprising: means for training a pattern recognition model using a first set of observations derived from a relatively noise-free signal and means for refining said pattern recognition model using a second set of observations derived from a relatively noisy signal obtained from said substantially noise-free signal and a noise signal.
39. Apparatus for generating a pattern recognition model comprising a processor configured to train a pattern recognition model using a first set of observations derived from a relatively noise-free signal and to refine said pattern recognition model using a second set of observations derived from a relatively noisy signal obtained from said substantially noise-free signal and a noise signal.