GB2216320A

GB2216320A - Selective addition of noise to templates employed in automatic speech recognition systems

Info

Publication number: GB2216320A
Application number: GB8902475A
Authority: GB
Inventors: Jack Elliott Porter
Original assignee: International Standard Electric Corp
Current assignee: International Standard Electric Corp
Priority date: 1988-02-29
Filing date: 1989-02-03
Publication date: 1989-10-04
Also published as: GB2216320B; JPH01255000A; FR2627887B1; JP3046029B2; FR2627887A1; GB8902475D0

Abstract

A speech recognition system employs templates which are compared, 160, with incoming speech. When templates have the same signal-to-noise ratio as an unknown speech signal, recognition is improved. The signal-to-noise ratio of an incoming signal is predicted and the templates are modified, 164, before they are used in such a way that they are as if they were generated from speech with the same signal-to-noise ratio as the incoming unknown speech. <IMAGE>

Description

APPARATUS AND METHODS FOR THE SELECTIVE ADDITION OF NOISE TO TEMPLATES EMPLOYED IN AUTOMATIC SPEECH RECOGNITION SYSTEMS This invention relates to speech recognition systems in general and more particularly to a speech recognition system employing templates each generated by the selective addition of noise to increase the probability of speech recognition.

The art of speech recognition in general has been vastly developed in the last few years and speech recognition systems have been employed in many forms. The concept of recognising speech is associated with the idea that the information obtained in a spoken sound can be utilised directly to activate a computer or other means. Essentially, the prior art understood that a key element in recognising information in a spoken sound is the distribution of sound energy with respect to frequency. The formant frequencies are those at which the energy peaks are particularly important.

The formant frequencies are the acoustic resonances of the mouth cavity and are controlled by the tongue, jaw and lips. For a human listener the determination of the first two or three formant frequencies is usually enough to characterise vowel sounds. In this manner, speech recogniser machines of the prior art included some means of determining the amplitude or power spectrum of the incoming speech signal. This first step of speech recognition is referred to as preprocessing as it transforms a speech signal into features or parameters that are recognisable and reduces the data flow to manageable proportions. One means of accomplishing this is the measurement of the zero crossing rate of the signal in several broad frequency bands to give an estimate of the formant frequencies in these bands.

Another means is representing the speech signal in terms of the Parameters of the filter whose spectrum best fits that of the input speech signal.

This technique is known as linear predictive coding (LPC). Linear predictive coding or LPC has gained popularity because of its efficiency, accuracy and simplicity. The recognition features extracted from speech are typically averaged over 10 to 40 milliseconds then sampled 50-100 times per second.

The parameters used to represent speech for recognition purposes may be directly or indirectly related to the amplitude or power spectrum. Formant frequencies and linear pedictor filter coefficients are examples of parameters indirectly related to the speech spectrum. Other examples are cepstral parameters and log-area ratio parameters. In these and most other cases the speech parameters used in recognition are, or can be, derived from spectral parameters. This invention is related to the selective addition of noise to spectrum parameters generating speech recognition parameters. This invention applies to all forms of speech recognition which use speech parameters which are, or can be, derived from spectral parameters.

One of the most popular approaches to speech recognition in the past has been the use of templates to provide matching. In this approach words are typically represented in the form of parameter sequences. Recognition is achieved by using a predefined similarity measure to compare the unknown template token against stored templates. In many cases, time alignment algorithms are used to account for variability in the rate of production of words. Thus, template matching systems can achieve high performance with a small set of accoustically distinct words. Some researchers have questioned the ability of such systems to ultimately make fine phonetic distinction among the wide range of talkers. See for example an article entitled "Performing Fine Phonetic Distinctions: Templates versus Features" in "Variability and Invariance in Speech Processes" by J.S.Perkel and D.H.Klatt, editors, Hillsdale, New Jersey: Lawrence Erlbaum Associates, 1985, authors R.A.Cole, R.M.Stern and M.J.Lasry.

Thus as an alternative, many people propose a feature-based approach to speech recognition in which one must first identify a set of acoustic features that capture the phonetically relevant information in the speech signal. Wits. this knowledge, algorithms can be developed to extract the features from the speech signal. A classifier is then used to combine the features and arrive at a recognition decision. It is argued that a featurebased system is better able to perform fine phonetic distinctions than a template matching scheme and thus is inherently superior. In any event, template matching is a technique which is often used in pattern recognition whereby an unknown is compared to prototypes in order to determine which one it most closely resembles.

By this definition, feature-based speech recognitions that use multi-variate Gaussian models for classification also perform template matching.

In this case, the statistical classifier merely uses a feature vector as a pattern. Similarly, if one regards spectrum amplitude and LPC coefficients as features then spectrum based techniques are featurebased as well.

In regard to use, template matching and feature-based systems really represent different points along a continuum. One of the most serious problems with the template matching approach is the difficulty of defining distance measures that are sensitive enough for fine phonetic distinctions but insensitive to the irrelevant spectral changes.

One manifestation of this problem is the excessive weight given to unimportant frame-to-frame variations in the spectrum of a long steady-state vowel. Thus the prior art, aware of such problems, has proposed a number of distance metrics that are intended to be sensitive to phonetic distances and are insensitive to irrelevant acoustic differences.

See for example an article entitled "Prediction of perceived Phonetic Distance from Critical Band Spectra" by D.H.Klatt, published in the Procedures ICASSP-82, IEEE Catalog No. CH1746-7, pages 12781281, 1982.

In any event, in order to gain a better understanding of speech communication systems, reference is made to Proceedincs of the IEEE, November 1985, Volume 73, No. 11, pages 1537 - 1696.

This issue of the IEEE presents various papers regarding man-machine speech communications systems and gives good insight to the particular problems involved. A major aspect in regard to any speech recognition system is the ability of the system to perform its allocated task--namely, to recognise speech in regard to all types of environments.

As indicated, many speech recognition systems utilise templates. Essentially, such systems convert utterances into parameter sequences which are stored in the computer. Sound waves travel from a speaker's mouth through a microphone to an analog-to-digital converter where they are filtered and digitised along with, for example, background noise, which may be present. The digitised signal is then further filtered and converted to recognition parameters, in which form it is compared with stored speech templates to determine the most likely choice for the spoken word. For further examples of such techniques, reference is made to the IEEE Spectrum, Vol. 24 No.4, published April 1977. See an article entitled "Putting Speech Recognizers to Work" pages 55-57 by T.Walich.

As can be ascertained from that article, the utilisation of speech recognition systems are constantly being expanded in application and there are many models which are already available which are employed for various applications as indicated in that article. The formation of templates is also quite well known in the prior art. Such templates are employed with many different types of speech recognition systems. One particular type of system is known as "A Key word recognition system" as described in the publication entitled "An Efficient Elastic-Template method for Determining Given Words in Running Speech" by J.S.Bridle, "British Accoustical Society Spring Meeting", pages 1-4, April 1973. In this article the author discusses the derivation of elastic templates from a parametric representation of spoken examples of key words to be detected. A similar parametric representation of the incoming speech is continuously compared with these templates to measure the similarity between the speech and the key words from which the templates were derived.

A word is determined by the recogniser to have been spoken when a segment of the incoming speech is sufficiently similar to the corresponding template.

The word templates are termed "elastic" because they can be expanded and compressed in time to account for variations in the talking speed and local variations in the rate of word pronunciation.

Key word recognition is similar to conventional speech recognition. In the former, templates are stored only for "key" words to be recognised within a context of arbitrary words or sounds, whereas in the latter templates are stored for all the speech anticipated to be spoken. All such systems, whether they be key word recognition systems or conventional speech recognition sytems that employ templates encounter the same problems--namely the inability of the system to recognise the spoken word as uttered for example by different individuals or as uttered by the same individual under different conditions.

The present invention seeks to provide an apparatus and method for improved automatic speech recognition.

The present invention also seeks to provide an apparatus and method of speech recognition which automatically adapts to a noisy environment.

According to one aspect of the invention, there is provided a speech recognition system including a spectrum analyser for providing spectral magnitude values of utterances at an output and for comparing stored templates with processed spectral values to provide an output upon a favourable comparison indicative of the presence of speech in the utterance, characterised in the provision of apparatus for generating the stored templates, comprising first means coupled with the spectrum analyser for providing a signal indicative of the predicted noise signal of an incoming signal, and means coupled to said first means and responsive to said predicted noise signal to generate templates which are modified according to the predicted noise signal.

According to a second aspect of the invention there is provided a method of forming templates for use in a speech recognition system, comprising the steps of providing a signal indicative of an expected noise level for a forthcoming signal and modifying a predetermined template according to the provided signal to provide a template having the expected noise level.

As will be further understood from the appended specification, most speech recognition systems suffer from a degraded operation in the presence of noise. This degradation is particularly severe when the templates have been derived from speech with little or no noise, or with a noise of different quality from that present when recognition is attempted. Previous methods of reducing this difficulty require the production of new templates in the presence of the new noise. This production necessitates the collection of new speech and noise.

In this particular system there is an analytical addition of noise to templates which permit an improved recognition probability thereby substantially improving the system performance, and it does not require collecting new speech for template generation.

In order that the invention and its various other preferred features may be understood more easily, some embodiments thereof will now be described, by way of example only with reference to the drawings, in which: Figure 1A is a block diagram showing a speech recognition system, using recognition parameters derived from spectra, employing the present invention, Figure 1B is a block diagram showing an alternative speech recognition system, using recognition parameters which are spectral in nature, employing the present invention.

Figure 2 is a detailed block diagram showing a technique employing the invention for forming operational template data, Figure 3 is a Table giving the definitions of various outputs indicated in Figure 2, Figure 4 is a detailed block diagram of an alternative embodiment of the invention, Figures 5A-5C are detailed flow charts depicting the operation of a speech and noise tracker operating in accordance with the invention, Figure 6 is a Table giving the definitions of engineering parameters shown in Figures 5A-5C.

The present invention applies to all recognition systems which use parameters which are, or are derived from those which are, spectral in nature. In the latter case it may be necessary to store templates in two forms: spectral for analytic addition of noise and operational templates.

Referring to Figure 1A, there is shown a block diagram of a speech recognition system operating in accordance with the invention using recognition parameters derived from spectra.

A microphone 10, into which a speaker utilising the system speaks, conventionally converts sound waves into electrical signals which are amplified by means of an amplifier 11. The output of the amplifier 11 is coupled to a spectrum analyser 12.

The spectrym analyser may be a wide band or a narrow band type with a short term analysis capability. The function and structure of the spectrum analyser is essentially well known and can be implemented by a number of techniques.

The spectrum analyser operates to divide the speech into short frames and provides a parametric representation of each frame at its output. The particular type of acoustic analysis performed by the spectrum analyser is not critical to this invention and many known acoustic analysers or spectrum analysers may be employed. Examples of such are given in U.S. Patent Application Serial No.

439,018 filed on November 3, 1982 for G.Vensko et al. and Serial No. 473,422 filed on March 9, 1983 for G.Vensko et al. Both commonly assigned to ITT Corporation, the assignee herein, and incorporated herein by reference.

Reference is also made to U.S.Patent Application Serial No. 655,958 filed on September 28, 1984 for A.L.Higgins et al. and entitled KEYWORD RECOGNITION SYSTEM AND METHOD USING TEMPLATE CONCANTENATION MODEL.

The spectrum analyser 12 may include a 14 channel bandpass filter array and utilises a frame size of 20 milliseconds or greater. These spectral parameters are processed as shown in Figure 1A. As illustrated, the output of the spectrum analyser is coupled to a switch 13 which can be operated in a Recognise, a Form Template, or a Modify Template Mode.

When the switch 13 is placed in the Form Template Mode, the output of the spectrum analyser 12 is coupled to a purpose of module 14 designated as spectral form of templates. The purpose of module 14 is to assist in forming templates from the output of the spectrum analyser. Many techniques for forming such templates are well known. Essentially, in the Form Template Mode the output of the spectrum analyser 12 is processed by the module 14 which provides templates in response to utterances made by a speaker through microphone 10. The speaker is prompted to speak words to be recognised and templates representative of the spoken words are generated. These templates are utilised by module 15 to derive recognition parameters derived from the spectra formed templates for the generation of final templates for a low or no noise condition which templates are stored in module 16. These stored templates are indicative of particular utterances, as for example words, phrases and so on, uttered by the particular speaker.

The templates, as stored, are coupled via a switch 100 to a processor 160 which performs a recognition algorithm. Thus, as one can ascertain, the processor 160 operates in the recognition Mode to compare unknown speech with templates, as stored in module 16, which were generated for a no noise condition. Thus, as indicated in Figure lA, in the Form Template Mode there is provided a spectral form of templates to obtain template parameters which template parameters are then utilised to form templates for no noise or low noise conditions. The processor, as implemented, can operate with the templates stored in module 16 for a no noise or low noise condition, as will be explained. The function of the processor 160 is also well known and operates to provide a match based on various distance measurements or other algorithms.When such a match is made there is an indiciation provided that this is a correct word and hence that word or sound is then the system output.

Switch 13, when placed in a Recognise Mode, enables the output of the spectrum generator 12 to be coupled to a derive parameters module 161 which essentially derives parameters from the spectrum analyser which parameters will be compared with the stored templates as, for example, described above and stored in module 16. As seen in Figure 1A the switch 13 can also be placed in the centre position.

In the centre position, also designated as the Modify Template Mode, the output of the spectrum analyser enters an estimate noise statistics module 162. As one will understand, the function of module 162 is to provide a noise analysis or to process noise to provide an estimate of noise statistics.

Noise is selectively added to form templates to implement speech recognition and to achieve an improvement in such recognition in the presence of such additive noise.

Thus the function of the estimate noise statistics module 162, which will be further described, operates to modify the spectral templates as formed in module 164 which is coupled and receives its information from module 14. The output of module 164 derives recognition parameters in module 165 which parameters are utilised to form templates, as indicated by module 166, for use with noise or at low noise levels. Accordingly, the system depicted in Figure 1A enables recognition to be performed with templates for use with noise or templates for use with very low noise or no noise via switch 100.

As indicated briefly above, in the Recognise Mode the spectral parameter output of the spectrum analyser 12 is provided to the input of the processor 160 via the derive parameter module 161.

The processor 160 typically performs an algorithm which again is not critical to the invention. The processor 160 determines the sequence of the stored templates and provides the best matches to the incoming speech to be recognised. Hence the output of the processor 160 is essentially a string of template labels where each label represents one template in the best matching template sequence.

For example, each template may be assigned a number and a label. It may be a multibit representation of that number. This output is provided to a template search system included in the processor 160 which when there is a multibit representation, for exmaple, may be a comparator with a storage device for template labels. Thus the processor 160 operates to compare each incoming template label with the stored templates. The subsystem 160 can then provide an indication that a particular word or phrase has been spoken as well as what word or phrase.

In either the Form or Modify Template Mode the user speaks various words and recognition parameters are derived from the spectrum output of the spectrum analyser 12. In the Modify template Mode the system operates to produce various templates for use in conjunction with the system in the Recognise Mode which templates, as indicated above, are modified by the selective addition of noise via the estimate noise statistic generator 162. The selective addition of noise by means of the generator 162 provides a more reliable system operation, as will be further explained.

Referring now to Figure 1B there is shown a recognition system which employs recognition parameters which are spectral in nature. In any event, the same functioning parts have been designated by the same reference numerals in Figure 1B. As can be seen, there is a microphone 10 coupled to the input of an amplifier 11 whose output is coupled to the input of a spectrum analyser 12.

The spectrum analyser 12 output is again coupled to a switch 13 which can be operated in a Form Template, a Modify Template or a Recognise Mode.

In the Form Template Mode templates are formed for low or no noise conditions via module 170. This module 170 forms templates directly providing recognition.parameters which are spectral in nature.

The form templates are then stored and also coupled to module 171 which modifies the spectral templates as for example derived from module 170 under the influence of an estimate noise statistic generator 172 which functions similarly to the noise generator 162. The output of the modified spectral template module 171 is coupled to module 173 which stores templates for use with a noise condition. Again, a processor 177 is depicted which processor can operate either with the templates as stored in module 170 or with the templates as stored in module 173.

Known methods which accomplish the task of template generation are automatic and normally employ a multi-stage or a two-stage procedure. In one such approach speech data from the training utterance (Template Mode) is divided into segments.

These segments are then applied as an input for a statistical clustering analysis which selects a subset of segments that maximises a mathematical function based on a measure of distance between segments. The segments belonging to the selected subset are used as templates.

Such techniques are described in the previously mentioned copending U.S. Patent Application Serial No. 655,958. In any event, the various techniques for measuring distances are well known as indicated by some of the previously mentioned references. One technique of providing distance measurements which is widely employed is known as the Mahalanobis distance computation. For examples of this particular technique reference is made to a copending U.S. Patent Application entitled MULTIPLE PARAMETER SPEARER RECOGNITION SYSTEM AND METHODS, filed on January 16, 1987, Serial No. 003,971 in the name of E.Wrench et al. This application gives various other examples of techniques employed in speaker recognition systems and describes in detail some of the algorithms employed in conjunction with such systems. The major application of the present invention is related to such speech recognition systems as shown in Figures 1A and 1B which utilise templates to provide a comparison with incoming speech to thereby make a decision as to which word is being spoken. The technique can be employed in key word recognition systems, speech recognition systems, speaker recognition systems, speaker verification systems, language recognition systems, or any system which utilises templates or a combination of various templates to make a decision in regard to an uttered sound.

Before proceeding with an explanation of the structure and techniques employed in this invention, certain aspects and considerations of the present invention are considered.

The inventor has determined that when templates have the same signal-to-noise ratio as the unknown or uttered speech, the recognition performance is better than with templates with less noise or more noise. Hence if it is assumed that the signal-tonoise ratio of the audio signal can be predicted then recognition performance can be optimised by modifying templates before they are used in such a way that they are "as if" they were generated from speech with the same signal-to-noise ratio as the upcoming unknown speech.

Thus in order to practice the present invention, the following considerations are applicable. The first is to predict the signal-tonoise ratio of upcoming speech and the second is to modify the templates to meet the "as if" requirement. Prediction is based c both theoretical and empirical considerations. In most applcations one can reasonably expect a user to speak at a relatively constant level, either absolutely in the case of low level or constant noise or at a relatively constant level above that noise. One can then use the speech and noise level to predict the signal-to-noise ratio of the unknown speech. As will be explained, this is accomplished by the use of a speech and noise level tracker module.In certain instances it is assumed that both the speaking level and the noise level in each filter channel changes slowly enough that the current values are useful estimates of the values in the near future.

Hence by modifying noise-free or relatively noise-free templates so that they are the same "as if" they had been made from noiser speech is based on both empirical and theoretical considerations.

Research has determined that it is an excellent approximation to assume that noise and speech powers add in each individual filter bank channel. A more accurate approximation is that the combination of speech and noise has a non-central Chi squared distribution, with the number of degrees of freedom related to the filterbank channel bandwidth. From this and other considerations, more precise estimations can be made of the expected value of the combination of known speech power with a noise of known statistical properties. The increased precision in the "addition of noise" thus obtained does increase the precision of the templates produced, but does not significantly increase recognition accuracy beyond the improvement obtained using the "powers add" rule.The following discussion therefore continues to refer to the powers add rule, although the proces can be made more theoretically precise by substituting an alternative method of estimating the expected value of the combination of speech and noise power. Such a substitution does not alter the intent or practice of this invention.

It is further observed that both internal electronic and quantisation noise combines with acoustic noise and signal in accordance with the "powers add" rule. They may be smaller than the acoustic noise of interest but this is applicable.

Hence one can use the "powers add" result in constructing various models so that the application of the research work is manifested through a continuing effort to use the numbers which are derived from valid models. This will be explained subsequently.

It has been shown that templates which would result from a noise power equal to its average value behave very well in producing reliable recognition outputs. Hence it is not necessary to predict the frame-to-frame variability of the noise power and it is sufficient to use the average value. The template parameters which are sought are those which would be produced from the same speech power as that effectively in the Base Form template combined with the current average noise power.

The channel noise power values from the system are estimates of the noise power and they may be taken to be related to the average noise power as can be mathematically determined. For a full understanding of the present procedure and the justification therefore, the following considerations are applicable.

It is first indicated that the probability distribution of the output of a single discrete Fourier transform (DFT) of a speech signal corrupted by additive zero mean Gaussian noise can easily be calculated. The next factor to be considered and important for extending the model of how speech and noise combine so as to be applicable to each channel of a bandpass filter bank indicates that the channels have or may have a much wider bandwidth than a single DFT channel. Hence the noise power parameter and the number of contributing channels can be estimated by observing the output of the bandpass filter in the absence of speech and in the presence of noise.

The next step was to realise that speech recognition templates formed in the absence of noise might be improved for use in the presence of noise by modifying them to be equal to their expected value in the presence of noise. Hence the method to be employed is that for each speech sample and each bandpass filter channel represented in a noise-free template, there is substituted the expected value of the noise-free template as modified by the presence of the current noise.

Hence by measuring the average and the variance at the output of the bandpass filter channel it is possible to estimate the properties of the channel by the way it passes Gaussian noise. As will be appreciated (and many of the previous considerations have been mathematically proven) there is both a theoretical and empirical basis for the implementation of this invention. Essentially as indicated, the nature of this invention is the analytical addition of noise to form templates which formed templates operate to increase the reliability of speech recognition systems.

In any event, there are two ways to add noise to template data collected in a noise-free environment and thus form new templates for use in a noisy environment. A rigorous way is to add noise to each template token then average the results. An approximate way is to average the noise-free tokens to form Base Form data and modify the data by adding noise appropriate to current conditions, using the "powers add" or other convenient or more precise rule. The rigorous way requires keeping all the templates and tokens around and requires excessive storage. The approximate way provides substantially the same templates and recognition results. There is a main assumption which is implicit in the implemention. That is that template data are noise free relative to the environment in which they are used.

Referring to Figure 2, there is shown a detailed block diagram of template formation technique employed by adding noise to a Base Form template. The Base Form Template is itself an average formed over a set of work "tokens". Each token consists of parameters taken from one utterance of the given work. One or more tokens may be arranged to form a Base Form Template. Base Form templates are formed during quiet conditions and stored in module 16 of Figure 1A or module 170 of Figure 1B. It should be noted that Figure 3 is a Table defining each value depictd in Figure 2. In Figure 2, there is shown again the microphone 10 into which a speaker utters speech. The output of the microphone is coupled to the input of amplifier 11 which then has its output coupled to the spectrum analyser which is indicated as BPF or bandpass filter bank 12. The switch 13 is in the modify template position.The output from the bandpass filter bank 12 is the vector of the bandpass filter spectral magnitude values and is applied to a module 20 which module 20 serves to average frame pairs.

The averaging of frame pairs is a well known technique and may be implemented by many well known circuits. The output of the module 20 is the result of averaging successive pairs of the input from the spectrum analyser 12 and module 20 serves to halve the effective frame rate. The output of module 20 is applied to a scale bit module 21 and to a square component module 22. The square component module 22 provides a vector output equal to the squared magnitude which is the power value of the output of the average frame pair module 20.

The output of the scale bit module 21 serves to provide twice the averaging of the succesive pairs implemented by a series of shifts to enable one to fit the vector maximum components into a 7-bit scale. Hence the module 21 is a shift register which provides a number of right shifts to implement the described operation. The output from the scale bit module 21 is directed to a logarithmic converter 23 which produces at its output a scaled log spectral parameter vector. This parameter vector is then averaged over a given set of template tokens by the module 24 to provide at the output the scaled log spectral parameter which provides one parameter of the Base Form template. The output from the square component module 22 is directed to an input of module 25 designated as a relativised energy module and to an input of module 26 designated as a speech and noise level tracker.

The output from the relative energy module 25 is a parameter indicative of the relative energy as for example determined by averaging the energy from the output of the square component module 22. This is averaged over template tokens by module 36 to provide an averaging indicative of the output vector which is the relative energy parameter necessary to provide another Base Form data value. The output from the speech and noise level tracker 26 is as will be explained indicative of the energy level which is again averaged by module 27 to provide at its output the energy level of still another Base Form nature.The speech and noise level tracker which will be further described provides two additional outputs, one of which is a logarith'ic indicator of the speaking level averaged over a work time and channel which is a scaler attached to words and the other is the vector of the noise level in each channel averaged over time but not with reference to channel. This is also a vector attached to word recognition units. The output from module 27 is applied to a first adder module 30 which receives an additional output from the speech and noise level tracker. The output of adder 30 is applied to one input of an adder 31 which receives at its other input an output derived from the scale bit module 21. The output of the scale bit module 21 is multiplied via module 32 by a factor K which is equal to 18.172 and which is further defined in Figure 2.This value is then averaged by module 33 to produce at its output the Base Form value of the log value which is applied to the other input of adder 31. The output of adder 31 is applied to adder 32. Adder 32 receives as another input the output from the speech and noise level tracker 26 which again is the vector of the noise levels in each channel. The output of the adder 32 is applied to one input of a function module 40 which receives at its other input the output from module 24. The output from the function module 40 is the scale log spectral parameter vector for a noise added template. This is applied to a function module 41 to provide at its output a recognition parameter vector which is the mel-cosine transform matrix for the particular utterance. Thus the output from module 41 and the output from module 26 are utilised to provide the operational template data.

As indicated, most of the outputs associated with the block diagram of Figure 2 are described in Figure 3. The effective spectral magnitude value for the Base Form template as derived from Figure 2 is given by the following equation.

BB = 2 SBexpb (IB) and the effective power is given by the following equation.

n won o rsr n pB=2B=2 2SBexp ( 2B) See Figure 3 for definitions.

Before adding noise, the power in each frame is modified so that the average speaking level of the template indicated at the output of module 27 of Figure 2 is the same as the current speaking level indicated by the output of the speech and noise level tracker 26 as applied to the input of adder 30. Since the values are in recognition units (0.331 db) the effective power in the Base Form is changed which is indicated at the output of module 26. To this the current noise power level is added and there is obtained the effective power level of the noise added template so that the effective magnitude of the noise added template is shown as the output of module 41.

Thus all the operational recognition parameters are in the mel-cosine transform of the log spectral parameters and are relative energy measures. All of this should be clear to one skilled in the art after the reviewing of Figure 2 together with the definitions of Figure 3 and the mathematics therefore should be apparent.

By using similar techniques, templates can be formed by adding noise to template tokens and then averaging. The process for accomplishing this is similar to that shown in Figure 2 whereby the same outputs can be obtained as shown in Figure 2 with the exception that the averaging may be accomplished after the functional unit 40.

Referring to Figure 4, there is shown a more detailed block diagram of a typical system employing a template formation scheme as previously mentioned.

In Figure 4 the same reference numerals have been utilised to indicate similar functioning components.

An AGC or automatic gain control module 45 is included which is coupled to one input of an adder 46 with the output of the adder coupled to a coder/decoder (CODEC) module and lineariser circuit 47. The coder/decoder module may be a analog-todigital converter followed by a digital-to-analog converter. The output of the Codec is applied to a synthesiser or bandpass filter bank 12.

The output from the bandpass filter 12 goes to a averaging frame pairs module 20 which again is associated with a scale module 21 and a speech and noise track module 26 which will be explained. The output lines as shown on the right side of Figure 4 provide the various operational template data values which are utilised to thereby form templates in the presence of noise.

A major functioning module is the speech noise tracker 26 which will be further described. Again, referring to Figure 4, it is shown that the inputs to the microphone 10 are labelled Nc and 5c which are the significant signal and noise sources. By the subscript "c" it is indicated that these represent the average spectral magnitude over the bandpass of each of the filter bank channels forming the spectrum analyser 12. Each subscript "c" has 14 values, one for each filter in the filter bank.

Therefore 5c is the spectral magnitude in channel C of the acoustic speech signal while N c is the root mean square spectral magnitude of the acoustic noise for that channel. The outputs from adders 50 and 46 are spectral magnitudes of electronic noise which is injected before and after the AGC gain control 45.

The output from the CODEC 47 contains the spectral magnitude of the quantization noise introduced by the CODEC. The output of the bandpass filter bank 12 is the vector of the bandpass filter spectral magnitude values while the output of the average frame pair module 20 is the result of averaging successive pairs of the spectral magnitude values.

The effective output signal of the filter bank 12 is an estimate of the spectral magnitude of the signal at the filter bank input over the pass band of the filter bank and this is indicated for each channel in the filter bank. Successive pairs of these values are averaged to produce the output from the module 20 at a rate of 50 per second.

The set of all values for all 14 channels are all shifted right by the same number S in module 21 so that the largest occupies 7 bits or less and the resultant values are converted by a Table lookup to a number proportional to the logarithm. The table returns 127 for an input of 127 so that the result can be considered as 26.2 times the natural logarithm of the input, or equivalently, the logarithm to the base b, where b equals 1.03888. 20 millisecond frame values are also used by the tracker 26 to produce a measure of the peak speech energy and an estimate of the mean noise energy for each channel. The speaking level is an estimate of the log to the base b of the total speech energy at the microphone 10 plus an arbitrary constant.

The effect of the AGC gain is effectively removed and thus is not a spectral value. For example, it is related to the total energy of the pass band of the whole filter bank. The speaking level estimate is also word or phrase related. Its time constants are such that it is a measure of the level at which short utterances are spoken. There is therefore only one level value to be associated with each template or unknown segment of template duration. The time constraints of the noise estimates from the tracker 26 are also such that only one noise level estimate should be assigned to each channel over the time periods of utterance length. Hence the output values from the speech and noise tracker 26 as coupled to the logarithmic circuit 54 of Figure 4 are mean energy estimates for the output of the filter bank.They are therefore affected by AGC gain and they are directly proportional to the average spectral energy without logarithmic transformation.

It is assumed that the signal and the various noise sources are statistically independent and that their energies add on the average. This is not only a convenient way for determining the internal noise sources but has been demonstrated to be an excellent approximation for both the acoustic noise and the signal sources. Furthermore, it is assumed that there are noise values which can be referred to as equivalent noise power at the microphone. These values include acoustic noise power and other system noise power some of which are reduced by the AGC 45 gain.

Thus the scale factors as derived from Figure 4 and as indicated in Figures 2 and 3 are provided to produce noise related templates. Thus by employing the template averaging process, an average template can be produced which is the same or equivalent to what would be attained by averaging the log spectral parameters of all tokens at the same speaking level and signal to noise ratio. Thus in order to simplify the entire problem, one makes the assumption that there are equal singal-to-noise ratios in all templates and as well as in all template tokens. This can be accomplished by adjusting the speaking levels in all tokens to be equal and therefore the result of equal signal-tonoise ratios is equal noise value in all tokens.

Under this assumption, one can make all forms of averaging the noise equivalent.

As previously indicated, research has shown that when templates have the same signal-to-noise ratio as the unknown speech recognition, performance is better than with templates with less noise or more noise. Thus it is indicated that based on the above techniques, the signal-to-noise ratio of the audio signal can be predicted and therefore recognition performance can be optimised by modifying templates before they are used in such a way that they are "as if" they were generated from speech with the same signal-to-noise ratio as the upcoming unknown speech.

Thus as indicated, two steps are employed. One is to predict the signal-to-noise ratio of forthcoming speech and then modifying the templates to meet this requirement. Hence as will be explained, the speech and noise tracker 26 does not form an estimate of the speech power in each channel as this would vary from word to word depending on the phonetic content of each. Thus since one cannot predict what words will be spoken, the data would have no predictive power. The importance is that for normal procedures, one would have no estimate of the signal-to-noise ratio for each channel.

Therefore, the template modification procedure as previously described avoids the use of specific signal-to-noise values on a channel basis. Therefore templates which result from a noise power equal to its average value behave very well in a recognition system.

By way of further explanation, it is not necessary to be concerned about the frame-to-frame variability of the noise power as it is sufficient to use the average value. The template parameters are then those which would be produced from the same speech power as effectively exists in the "base form" templates combined with the current average noise power. The speech and noise tracker module 26 is a digital signal processing (DSP) circuit which operates to execute an algorithm which provides a measure of the power level of a speech signal in the presence of additional acoustic noise and also a measure of the average noise power in the bandpass filter bank channels of any arbitrary form.

The measure of speaking level found is indicative of the speakers' conversational level suitable for adjusting the signal-to-noise ratio for purposes of speech recognition. Other measures of speaking level vary quickly and/or with the relative frequency of occurrence of voiced and unvoiced sounds within the speech. The measure found by the speech and noise tracker avoids these problems by detecting the slightly smoothed peak power in vocalic nuclei.

More specifically, it tracks the slightly smoothed peaked power in the more energetic vocalic nuclei. By ignoring power peaks during unstressed syllable nuclei and during speech intervals which are not vocalic nuclei, the measure is a continuous indication of the general speaking level. The tracker is intended for use in the presence of additive noise uncorrelated with the speech present when the total noise power usually varies slowly compared to the rate of vocalic nuclei production in speech (typically 5 to 15 per second). The tracker also is operative to recover from more rapid changes in noise level. The speech and noise tracker 26 uses a logarithmic or compression technique whereby there is provided a measure of total speech power over the frequency range of interest.This measure is first subjected to a slow-rise, fast-fall filtering process with the rise and fall time constraints chosen so that a large positive difference exists between the instantaneous signal power and the filtered value during the first few milliseconds of vocalic nuclei while large negative values of that difference do not occur.

Hence a non-linear function of the difference between the instantaneous signal power and the fastfall, slow-rise time filtered value is then directed to a moving box-car integration process of suitable duration so that the resulting value rises above an appropriate threshold only during normal or stressed vocalic nuclei in speech intervalsr usually skipping unstressed vowel nuclei. The crossing of that threshold is then used to identify an interval of high signal power as due to speech nuclei. Only intervals thus identified are used for speaking level tracking. Values from the box-car integration process which are greater than a second threshold which is lower than the speech nuclei threshold are then used to identify intervals which contain speech power as well as noise power.Only intervals where the boxcar integration value is lower than the second (lower) threshold and where the instantaneous power is not more than a third threshold above its fast-fall, slow-rise filtered value are used as the input to the noise power tracking function.

The noise power tracking module may include a digital signal processor module implemented by an integrated circuit chip. Many such chips are available which are programmable and adapted to execute various types of algorithms. The algorithm which is associated with the noise and signal tracking function operates to determine both the signal energy content and the noise energy content and does so in the following manner.

Firstly a mathematical value is obtained which is indicative of the channel energies. This is done in each and every frame. Then the total energy is calculated. The system can then proceed to accommodate for automatic gain control changes. Once the energy is calculated, the results are then smoothed over predetermined time intervals. After the smoothed energy value is obtained, the logarithmic value of the total energy is computed.

After computing the logarithmic value of the total energy, a box-car integration of average for speech level estimate is performed at the input to the bandpass filter array. The next step incorporates the use of an asymmetric filter to filter the log energy for speech detection by monitoring the rise time of the speech signal. It should be appreciated that the speech signal is referred to generically and the incoming signal can be noise, an artifact (art) which is not a noise or a speech signal and may be due to heavy breathing or some other characteristic of the speaker's voice which essentially is not intelligence and is not noise.

In any event, it may also be a true speech signal.

To determine this, the instantaneous values of the logarithmic energy over the smoothed energy is monitored. The algorithm operates to divide the time interval which is associated with the rise and fall times of the signal into given periods. When the rise is positive as compared to negative, certain decisions are made as to the nature of the incoming signal to be recognised. These decisions as indicated above, determine whether it is speech, an artifact or pure noise. For example, for an interval where the rise is negative, it is absolutely assumed that if the rise continues to be negative then this is a noise signal. The noise signal is accepted and the system continues to track the signal by smoothing the noise values and using these values to contribute to the averaged noise energy and utilising the calculated values to apply the same to the noise estimate. This is then utilised to form the template. The care in response to a positive transition is more difficult.

A positive transition can indicate noise, an artifact or speech. In order to make this determination, an integral of a non-linear function is employed. Based on comparing the integral value with certain thresholds, it can be determined whether or not a positive rise is indicative of speech, noise or an artifact. In this manner the values which emanate from the speech and noise tracker module are indicative of the true speech value. The programs for the speech and noise tracker are shown in Figures 5A-5C where the complete program is indicated.

Figure 6 shows the definition of engineering parameters which are necessary to understand the programming formats depicted in Figures 5A-5C. For further explanation, the procedure is accomplished once every single frame and operates as follows.

The first step in the procedure as indicated in Figure 5A is to obtain the energy in each channel as well as the total energy. This is shown in steps 1 and 2. Then the energy is filtered in each channel taking the automatic gain control scale changes into account as shown in Steps 3 and 4. The next step is to smooth the energy values so that one obtains smooth log values of the energy which are corrected for AGC. This is shown in Steps 5, 6 and 7. The next step is to obtain a box-car average for the speech level estimate at Step 8. Then the asymmetric filter value of the energy and the rise of the current energy over the filtered value is obtained as is shown in Steps 9 and 10. This particular aspect of the program is then exited and we proceed to Figure 5B.The variable r which is shown in Step 10 of Figure 5A is the amount the current log energy exceeds its asymmetrically smooth value. During vocalic nuclei r goes positive and stays positive for a considerable interval. This is indicated as having special significance to its positive and negative intervals and hence special processing is required when it first becomes positive or first becomes negative. This is shown in detail in Figure 5B. In any event, when r first becomes positive, the frame number is recorded as the possible beginning of a definite speech nucleus. Then the value of P which is used to decide if it is speech 3 reset. Then noise tracking is suspended. While r remains positive, the values of p are accumulated and the artifact and speech flags are set if P exceeds specified thresholds. These are indicated to the left of Figure 5B.When r first becomes positive, the noise tracker resets to the last known noise values then resumes noise tracking after a given delay if the speech or artifact was detected, while making sure that the speech level assumed is high enough from the noise level. If the speech was detected in this rise, the frame number as the end of a known speech interval is recorded.

Whilst r remains negative noise tracking continues after a predetermined delay. This is all shown in flow charts which clearly illustrate the various operations provided.

Figure 5C shows the generation of output variables which as indicated are used to provide the operational template data as for example shown in Figures 2 and 4. Thus the major aspect of the present system is to provide templates whereby noise is added in a correct and anticipated way to formulate a template which has an expected signal to noise ratio level associated therewith. The noise level associated with the template is indicative of an estimate of the noise which will be present in the upcoming signal. In this manner the recognition probability of a speech recognition system is substantially increased.

It will be understood that the generation of such templates utilising the addition of noise as indicated above can be employed in any speech recognition system utilising templates to provide a comparison with an incoming signal to determine whether that signal is in fact speech, an artifact, or noise. Thus the system operates to provide speech recognition templates which are first formed in the absence of noise and which are improved for use in the presence of noise by modifying them to be equal to their expected value in the presence of noise.

Claims

CLAIMS:

1. A speech recognition system including a spectrum analyser for providing spectral magnitude values of utterances at an output and for comparing stored templates with processed spectral values to provide an output upon a favourable comparison indicative of the presence of speech in the utterance, characterised in the provision of apparatus for generating the stored templates, comprising first means coupled with the spectrum analyser for providing a signal indicative of the predicted noise signal of an incoming signal, and means coupled to said first means and responsive to said predicted noise signal to generate templates which are modified according to the predicted noise signals.

2. A speech recognition system as claimed in claim 1, wherein said first means includes a speech and noise level tracking means operative to provide at an output a first signal indicative of the power level of a speech signal in the presence of noise, and a second signal indicative of the average noise power.

3. A speech recognition system as claimed in claim 1 or 2, wherein said spectrum analyser comprises a plurality of bandpass filters arranged in a filter bank array with each filter adapted to pass a given spectral component according to the bandwidth of said filter.

4. The speech recognition system as claimed in any one of the preceding claims, wherein said second means includes means for generating templates under low noise conditions and for modifying said templates according to said predicted noise signal.

5. A speech, recognition system as claimed in any one of the preceding claims, wherein said first means includes means for predicting the signal-tonoise ratio of a forthcoming speech signal.

6. A speech recognition system according to claim 3, wherein said first means includes means for measuring the average and the variance of said bandpass filters to provide an estimate of the noise passing properties of each filter.

7. A speech recognition system as claimed in claim 6, wherein said noise estimate is estimated on the basis of said filter response to Gaussian noise.

8. A speech recognition system as claimed in claim 4, wherein said templates generated in the absence of noise are noise free token templates and means responsive to said templates to provide an average value to provide at outputs Base Form data, and means for modifying said Base Form data according to a current predicted noise signal.

9. A speech recognition system including a spectrum analyser for providing spectral magnitude values of utterances at an output and for comparing predetermined stored templates with processed spectral values to provide an output upon a favourable comparison indicative of the presence of speech in the utterance, characterised in the provision of apparatus for generating the templates, comprising processing means coupled to the analyser for generating templates for storage by modifying the predetermined templates according to an expected calculated value indicative of the presence of noise, means for comparing the generated templates with incoming signals to provide the output.

10. A speech recognition system as claimed in Claim 9, wherein in the processing means said expected calculated value is indicative of the presence of Gaussian noise.

11. A speech recognition system as claimed in claim 9 or 10, wherein the processing means includes means for averaging noise-free templates to provide Base Form data outputs and modifying said Base Form data outputs by adding to said data, noise data calculated.

12. A speech recognition system as claimed in claim 9, 10 or 11, wherein the processing means includes averaging means for providing at an output the average value of successive pairs of the spectral magnitude values as provided by said analyser, scaling means coupled to the averaging means output for providing a given length field signal and means for converting the given length field signal to a logarithmic signal for providing one of the Base Form data outputs.

13. A speech recognition system as claimed in claim 12, comprising squaring means coupled to the averaging means for providing at an output a vector signal indicative of the squared magnitude of the average value of successive pairs and means coupled to the output of the squaring means for providing other ones of the Base Form data outputs.

14. A speech recognition system as claimed in claim 13, wherein the means coupled to the output of the squaring means includes relative energy forming means responsive to the vector signal to provide a Base Form energy parameter and speech and noise level tracker means for providing at an output a Base Form parameter indicative of the power level of both speech and noise.

15. A method of forming templates for use in a speech recognition system, comprising steps of providing a signal indicative of an expected noise level for a forthcoming signal and modifying a predetermined template according to the provided signal to provide a template having the expected noise level.

16. A method as claimed in claim 15, wherein the step of providing includes measuring the response of a given speech processing channel in response to noise and estimating the signal to be provided based on the measurement.

17. A method as claimed in claim 15 or 16, wherein the step of modifying includes first forming a Base Form template relatively free from noise and modifying the Base Form template according to the signal indicative of the expected noise level.

18. A method as claimed in claim 15 or 16, wherein the step of modifying includes forming Base Form templates relatively free from noise, adding noise to each template and, averaging the added noise template data to form new templates according to the analysed data.

19. A method as claimed in claim 15, wherein the step of providing a signal includes predicting the signal-to-noise ratio of a forthcoming signal to be recognised by modifying the power in a present signal by averaging the log spectral parameters of all templates at the same speaking level and signalto-noise ratio and, using the averaged parameters to form modified templates.

20. A method of forming templates for use in a speech recognition system, comprising the steps of modifying formed templates before they are used for comparison by adding a noise signal to the templates indicative of a predicted value so that the templates as modified behave as if they were generated from a speech signal having the same signal-to-noise ratio as a forthcoming signal to be recognised.

21. A method as claimed in claim 20, wherein the steps of modifying includes, predicting the signal-to-noise ratio of a forthcoming speech signal by using a current signal-to-noise ratio as the predicted value based on a current speaking level, and averaging the current noise power and speech power to define the added noise signal.

22. A method of forming templates for use in a speech recognition system substantially as described with reference to the drawings.

23. A speech recognition system substantially as described with reference to the drawings.