IL98060A - Speech recognition system - Google Patents

Speech recognition system

Info

Publication number
IL98060A
IL98060A IL9806091A IL9806091A IL98060A IL 98060 A IL98060 A IL 98060A IL 9806091 A IL9806091 A IL 9806091A IL 9806091 A IL9806091 A IL 9806091A IL 98060 A IL98060 A IL 98060A
Authority
IL
Israel
Prior art keywords
feature
background noise
feature sets
test
speech
Prior art date
Application number
IL9806091A
Other versions
IL98060A0 (en
Inventor
Alberto Berstein
Original Assignee
Dsp Group Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dsp Group Inc filed Critical Dsp Group Inc
Priority to IL9806091A priority Critical patent/IL98060A/en
Publication of IL98060A0 publication Critical patent/IL98060A0/en
Publication of IL98060A publication Critical patent/IL98060A/en

Links

Landscapes

  • Image Analysis (AREA)

Description

A SPEECH RECOGNITION SYSTEM TIU'-TII 'ln'T1? rmyn THE DSP GROUP, INC.
Inventor: Alberto Berstein C: 11362 11362dsp.pat 1-9005 01may91 FIELD OP THE INVENTION The present invention relates to pattern recognition generally and to speech recognition in adverse background noise conditions in particular.
BACKGROUND OF THE INVENTION Prior art speech recognition systems analyze a voice signal and compare it to stored speech patterns in order to determine what was said. When the stored speech patterns and the voice signal under analysis are acquired in different environments the pattern similarity is corrupted by the unmatched conditions which effect leads to recognition errors.
Prior art speech recognizers typically implement supervised learning, or training, in order to provide stored speech patterns. Supervised learning is performed in a "clean" environment (e.g. one with little or no background noise), In the training phase, a speech recognizer "learns" a reference vocabulary by digitally storing a set of patterns, known as templates, representing acoustical features of the words conforming to the vocabulary.
A testing phase, during which words to be recognized are spoken, known as test utterances, is performed in a natural environment which is typically noisy. During this phase, the acoustical features of the word to be recognized are extracted and compared with those of each template. By selecting the template (s) showing the maximum similarity, a decision about the utterance being tested can be reached.
Speech is a non-stationary process and therefore, speech recognizers segment spoken words into time frames of approximately 20 to 30 msec, typically with a 50% overlap between frames. These time frames are typically assumed . to be stationary.
The acoustical features, mentioned hereinabove, are typically extracted from each frame and are combined together into a feature set, or feature vector, for each frame. The mos commonly used features are the coefficients of an autoregressive model or a transformation of them. Typical features include the Linear Prediction Coefficients, Cepstrum coefficients, Bank of filter energies, etc. In general, feature sets reflect vocal tract characteristics.
Short time spectral estimations of segments of speech can be obtained from such sets of coefficients according to methods known in the art.
A detailed description of different sets of features may be found in "Digital Processing of Speech Signals" by L.R. Rabiner et Al., Prentice Hall, Chapter 8.
Speech Recognition systems can be classified as follows: Isolated Word Recognition, Connected Speech Recognition and Continuous Speech Recognition. Alternatively, they can be classified as Speaker Dependent systems which require the user to train the system, and Speaker Independent systems which utilize data bases containing speech of many speakers. A description of many available systems can be found in "Putting Speech Recognizers to Work", P. Wallich, IEEE Spectrum, April 1987, pp. 55-57.
There are many approaches to recognizing speech. The Dynamic Programming approach, as described in U.S. Patent 4,488,243 to Brown et al, stores a feature vector for each time frame and the entirety of feature vectors are utilized as a time series of vectors. Through a dynamic programming algorithm, the Dynamic Programming approach identifies the best match between an uttered word, known as the test utterance, and a given set of reference word templates. For each reference template, the algorithm determines a global similarity score between the test utterance and the reference template. The test utterance is identified by the reference template which yields the highest similarity score.
A Vector Quantization (VQ) approach is described in the article, "Isolated Word Speech Recognition Using Multisection Vector Quantization Codebooks", by D.K. Burton, J.H. Shore and J.T. Buck and published in IEEE Transactions on Acoustics , Speech and Signal Processing. Vol. ASSP-33, August 1985, pp. 837-849. In this method, the feature vectors are quantized according to a sequence of codebooks which characterize each reference word. A global simlarity score is calculated from the quantization errors and used to determine the most similar reference word.
In a Hidden Markov Model (HMM) approach, each reference word is represented by a model consisting of a sequence of states, each characterized by a probability distribution. In the recognition procedure, a dynamic programming algorithm is applied to find the best match between the test utterance feature vectors and the refernce word states. A probability that the test feature vectors correspond to the given reference template is computed. The test' utterance is identified as the refernce word which yields the greatest probability.
There are two forms to the HMM approach, a discrete distribution and a continuous approach. In the discrete distribution approach, also known as HMM-VQ, the test feature vectors are first quantized and labelled in accordance with a predetermined VQ codebook and the probabilites are computed for the resultant VQ labels. In the continuous distribution approach, the probabilies are directly computed for the feature vectors .
All of these methods may be extended to the recognition of connected or continuous speech by finding a sequence of reference templates which best match the connected speech test utterance in the sense that it provides a best global similarity score. A global similarity score algorithm is described in the article "A Model Based Connected Digit Recognizer Using Either Hidden Markov Models or Templates", by L.R. Rabiner, J.G. Wilpon, and B.H. Juang, published in Computer Speech and Language, Vol. 1, Dec. 1986, pp. 167-197.
Another similar application of the abovementioned methods is for speech verification in which the global similarity score is compared to a threshold to determine whether or not the test utterance contains a given refernce word or words.
In the methods outlined above, there is a similarity measurement between feature vectors of the test utterance and feature vectors stored as templates, models or VQ codewords. This measurement, often called a local distance or local distortion measure, is strongly affected by the presence of noise, or more precisely, by differences in the background noise characteristics of the training and testing phases.
Prior art speech recognition systems resolve the problem by training in a "clean" environment and by applying speech enhancement techniques to the noisy test words in order to input to the recognition system noise reduced utterances. For example J.H.L. Hansen and Mark A. Clements, in "Constrained Iterative Speech Enhancement with Applications to Automatic Speech Recognition", published in Proceedings of the International Conference on Acoustics f Speech and Signal Processing 1988, pp.561-564, disclose a preprocessor that "would produce speech or recognition features which are less sensitive to background noise so that existing recognition system may be employed" .
A similar approach is offered by Y. Ephraim et al. in "A Linear Predictive Front-End processor for Speech Recognition in Noisy Environments", Proceedings of the International Conference on Acoustics , Speech and Signal Processing 1987, pp.1324-1327. Their system "takes into account the noise presence in estimating the feature vector" in order "to make existing speech recognition systems, which have proved to perform successfully in a laboratory environment, immune to noise".
SUMMARY OP THE INVENTION Because prior art pattern recognition systems operate serially, i.e. noise effects are first reduced and only afterwards is a standard recognition phase applied, important information contained in the clean templates is ignored. This information is valuable when eliminating noise effects in the feature estimation step. A need, therefore, exists for an improved pattern recognition system, capable of estimating the feature vectors of the noisy patterns, which fully exploits the information present in the clean templates.
It is therefore an object of the present invention to provide an improved method and apparatus for pattern recognition in noisy environments.
Briefly described, the present invention performs a feature correction on noisy test patterns in order to eliminate noise effects. The correction utilizes correction filters and is performed when computing the local distance between a clean template or codeword feature vector and a noisy test pattern feature vector. An estimate of the background noise is used in conjunction with different template hypotheses to build the correction filters. In this way, the present invention recognizes patterns using all the useful information available to it.
For each local distance computation, a Wiener filter is constructed according to well known techniques. A signal Power Spectral Density (PSD) is estimated using a feature vector of the template under hypothesis and a noise PSD is estimated by measuring the background noise surrounding the test pattern. A Signal to Noise Ratio (SNR) matching gain is applied to the Weiner filter in order to match the filter to the SNR of the test pattern.
The filtering is performed in the frequency domain and produces a corrected spectral representation from which the corrected feature vector is obtained. The distance (similarity) between the corrected feature vector and the feature vector of the template under hypothesis is then evaluated. The correction produces a minimum global distance between the template and the noisy pattern only when the template hypothesis is correct, i.e. when the feature vectors used to build the filters pertained to a template which is similar to the speech present in the noisy pattern under test.
There is provided, in accordance with the present invention, a speech recognition system for identifying spoken words in a noisy environment. The system includes apparatus for providing a test feature set of an input signal characterizing at least a portion of a spoken speech utterance, apparatus for providing a plurality of reference feature sets of reference speech utterances spoken in a quiet environment, apparatus for providing a background noise feature set of background noise present in the speech utterance and feature comparison apparatus for producing corrected test feature sets from the test, reference and background noise feature sets and for comparing the corrected test feature sets with the reference feature sets thereby to recognize which reference speech utterance was spoken in the input signal.
There is also provided, in accordance with the present invention, a pattern recognition system including apparatus for providing a test feature set of a generally noisy input signal characterizing at least a portion of an input pattern contained within the input signal apparatus for providing a plurality of reference feature sets of reference patterns produced in a quiet environment apparatus for providing a background noise feature set of background noise present in the input signal and feature comparison apparatus for producing a corrected test feature set from the test, reference and background noise feature sets and for comparing the corrected test feature set with the reference feature sets thereby to recognize which reference pattern exists in the input signal.
Additionally, in accordance with the present invention, the feature comparison apparatus includes apparatus for calculating a noise reducing filter from each one of the reference feature sets and the background noise feature set. The feature comparison apparatus also include apparatus for calculating a plurality of corrected test feature set via application of the plurality of noise reducing filters to the test feature set. The feature comparison apparatus further includes apparatus for calculating a global similarity measure between the reference feature sets and the plurality of corrected test feature sets and apparatus for selecting a corrected test feature set which is most similar to the reference feature sets.
Further, in accordance with the present invention, the noise reducing filters are Wiener filters.
Still further, in accordance with the present invention, the test, background and reference feature sets include spectral characterizations of the input signal, the background noise and the reference patterns, respectively. The spectral characterization is preferably obtained by linear prediction analysis.
Moreover, in accordance with the present invention, the apparatus for providing a background noise feature set includes apparatus for averaging an autocorrelation function of the background noise surrounding each the spoken word.
Additionally, in accordance with the present invention, the apparatus for calculating a global similarity measure performs a Dynamic Time Warping algorithm. Alternatively, the apparatus for calculating a global similarity measure performs a Hidden Markov Model algorithm. A second alternative is that the apparatus for calculating a global similarity measure includes apparatus for performing vector quantization.
Furthermore, in accordance with the present invention, the reference feature sets incorporate at least one vector quantization codebook having a plurality of codewords and the feature comparison apparatus provides the codeword which is most similar to the corrected feature set.
Still further, in accordance with the present invention, the system includes apparatus for estimating verbal contents of the speech utterance from the global similarity measure .
Additionally, in accordance with the present invention, the system includes apparatus for controlling output devices in accordance with the verbal contents. The system can also include apparatus for determining the identity of a speaker of the input signal from the global similarity measure. Alternatively, or in addition, the system can include apparatus for providing an index of a reference speech word from the global similarity measure.
Still further, in accordance with the present invention, the system includes apparatus for transmitting a sequence of at least one index along a communication channel.
There is also provided, in accordance with the present invention, a speech recognition method for identifying spoken words in a noisy environment including the steps of providing a test feature set of an input signal characterizing at least a portion of a spoken speech utterance, providing a plurality of reference feature sets of reference speech utterances spoken in a quiet environment, providing a background noise feature set of background noise present before and after the speech utterance and producing corrected test feature sets from the test, reference and background noise feature sets and for comparing the corrected test feature sets with the reference feature sets thereby to recognize which reference speech utterance was spoken in the input signal.
Finally, there is provided, in accordance with the present invention, a method for pattern recognition system including the steps of providing a test feature set of a generally noisy input signal characterizing at least a portion of an input pattern contained within the input signal, providing a plurality of reference feature sets of reference patterns produced in a quiet environment, providing a background noise feature set of background noise present in the input signal and producing corrected test feature sets from the test, reference and background noise feature sets and comparing the corrected test feature sets with the reference feature sets thereby to recognize which reference pattern exists in the input signal.
BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is a block diagram illustration of an improved pattern recognition system employing a background noise estimator and a feature correction step constructed and operative in accordance with the present invention; Fig. 2 is a block diagram illustration of pattern recognition hardware operative to implement the system of Fig. 1; Fig. 3 is a block diagram illustration of the feature correction step of Fig. 1; Figs. 4 and 5 are graphic illustrations of the operation of Fig. 3 on two templates; Fig. 6 is a pictorial illustration of a DTW searching grid and the different parameters involved in each local distance computation useful in understanding the implementation of the present invention in a DTW Dynamic Programming Algorithm; and Fig. 7 is a block diagram illustration of an alternative embodiment of the present invention utilizing a Hidden Markov Model approach.
DESCRIPTION OP THE PREFERRED EMBODIMENT Reference is now made to Fig. 1 which illustrates, in block diagram form, a pattern recognition system constructed and operative in accordance with the present invention. The pattern recognition system will be described in the context of a speech recognition system, it being understood that any type of pattern, such as an ECG, can be recognized.
The speech recognition system typically comprises an input device 8, such as a microphone or a telephone handset, for acquiring a speech utterance in a necessarily quiet environment for training and in a non-necessarily quiet environment for recognition.
The system additionally comprises a band pass filter 10 for receiving the speech utterance and for eliminating from the speech utterance frequencies below a first frequency, typically of 150 Hz, and above a second frequency, typically of 3200 Hz. Typically, band pass filter 10 is also an antialiasing filter thereby to enable proper sampling of the speech utterance.
For other types of pattern recognition systems, the input device 8 is any type of input device capable of receiving the training and test signal. In such systems, devices capable of conditioning the input and preparing it for analog to digital conversion are typically subsituted for the band pass filter 10.
The speech recognition system additionally comprises an Analog-to-Digital Converter (ADC) 12 for sampling the analog band-passed speech utterance, typically at a 8000 Hz sampling rate, and a segmenter 14 for segmenting the sampled speech utterance into frames of approximately 30 msec in length. A feature extractor 16 computes a feature set or vector for each frame using any known analysis method, such as the Linear Prediction method. The feature vector produced by the Linear Prediction method represents acoustic features of the vocal tract.
A switch 18, connected to a switch 19, switches the system between training and recognition modes. In the training mode (switch position T) the system learns a predetermined set of template patterns. When switch 18 is positioned in the T position switch 19 is forced to be open.
It will be noted that switches 18 and 19 are for illustration only; they depict the connections between different steps performed by a microprocessor, described in more detail hereinbelow with respect to Fig. 2, and are typically implemented in software.
For the training mode, the system comprises a template creation block 20 for creating reference templates from features extracted by feature extractor 16 and a reference template storage block 22 for storing the reference templates until they are needed. Template creation block 20 creates the reference templates according to well known techniques, such as Dynamic Time Warping (DTW) , Vector Quantization (VQ) , or Hidden Markov Model (HMM) .
For DTW, reference templates are comprised of a sequence of feature vectors for the entirety of frames forming a spoken word. For VQ, each reference template is represented by a sequence of indices of VQ codewords and for HMM each reference template, also known as a model, is represented by a sequence of probability distributions. For HMM-VQ, the HMM model is based on a VQ codebook, which is common to all templates.
When switch 18 is set to recognition mode (R position) , switch 19 is automatically closed and input speech is acquired in a typically noisy environment.
In accordance with a preferred embodiment of the present invention, for the recognition mode, the system additionally comprises a background noise estimation unit 30 for estimating spectral properties of the background noise, a featue comparison unit 32 for producing a similarity measure between a test feature vector and a reference feature vector given the noise estimate, and a global scoring unit 34 for producing a global score for the similarity between a reference template and a test utterance, based on a multiplicity of similarity measures produced by the feature comparison unit 32.
Background noise estimation unit 30 typically comprises a Voice Operated Switch (VOX) 35, for identifying when no speech utterance is present, a background noise estimator 36 for estimating noise characteristics of noise present between words (i.e. when no speech is present) and for computing a noise feature vector, and a noise template storage unit 38 for storing the computed noise template for later utilization by feature comparison unit 32.
A suitable VOX 35 is described in U.S. Patent 4,959,865 to Stettiner et al.
For other pattern recognition systems, VOX 35 is typically replaced by a suitable detector typically for detecting the moment that the signal energy rises above a background noise level.
The background noise estimator 36 is typically an exponential decay averager which produces, as the noise feature vector, an average value of an Autocorrelation Function (ACF) of the input signal of the frames having no speech activity. The noise feature vector is the noise template.
For each noisy speech frame under test, feature comparison unit 32 takes as input the speech feature vector of the noisy speech frame, the stored background noise template and a frame of the reference template whose similarity to the speech feature vector is to be measured. Using the background noise template and the frame of the reference template, a Wiener filter is created. The frame of the noisy speech feature vector is filtered producing a corrected frame feature vector. The local similarity between the corrected feature vector and the reference template feature vector is then calculated.
The local similarities are computed for each reference template frame. A global score is then computed for the entirety of the reference template based on the local scores.
The above-described operation is described in more detail hereinbelow with reference to Fig. 3.
It will be appreciated that the reference templates can be any type of template. They can consist of a plurality of different words spoken by one person, for identifying the spoken word or words, or they can consist of average properties of utterances spoken by a plurality of people for identifying the speaker rather than his words. In speech recognition, each template represents a word or portion of a word in the vocabulary to be recognized. In speaker recognition, each template represents the identity of a person. Reference templates are described in the following article, incorporated herein by reference: G. Doddington, "Speaker Recognition: Identifying People by Their Voices," Proceedings of the IEEE No. 73, 1985, pp. 1651 - 1664.
The operation described hereinabove is performed for each of the reference templates stored in storage unit 22. The template offering the maximum similarity is selected, in accordance with the algorithm being used for the global scoring unit 34, as the best candidate for the speech utterance under test. Such an algorithm might be DTW, VQ or HMM.
It will be appreciated that the system of the present inention can alternatively perform connected or continuous speech recognition. In such a system, the global scoring unit 34 will select the best sequence of reference templates which yields the best total similarity score.
Once a decision is reached, an output device 40,- such as a voice actuated device, a communication channel or a storage device, is operated in response to the meaning of the recognized word or words contained in the speech utterance or in response to the identity of the speaker.
Reference is now briefly made to Fig. 2 which illustrates a hardware configuration for implementing the block diagram of Fig. 1. The system typically comprises an input device 50 for acquiring a speech utterance or background noise, a COder-DECoder (CODEC) 52 for implementing the band pass filter 10 and the ADC 12 , an output device 56 for operating in response to the identified word or words, and a microprocessor 54 for implementing the remaining elements of the block diagram of Fig. 1.
Microprocessor 54 typically works in conjunction with a Random Access Memory (RAM) 58 and a Read Only Memory (ROM) 60, as is known in the art. RAM 58 typically serves to implement reference template storage unit 22 and noise template storage unit 38. ROM 60 is operative to store a computer program which incorporates the method of the present invention. Data and address buses connect the entirety of the elements of Fig. 2 in accordance with conventional digital techniques.
Input device 50 may be, as mentioned hereinabove, a microphone or a telephone handset. CODEC 52 may be a type TCM29cl3 integrated circuit made by Texas Instruments Inc., Houston, Texas. RAM 58 may be a type LC3664NML 64 Kbit Random Access Memory manufactured by Sanyo, Tokyo, Japan. ROM 60 may be a 128 Kbit Programmable Read Only Memory manufactured by Cypress Semiconductor, San Jose, California. Microprocessor 54 may be a TMS320C25 digital signal microprocessor made by Texas Instruments Inc., Houston, Texas. The output device 56 may be a dialing mechanism, a personal computer or any other device to be activated by known voice commands. Alternatively, it may be apparatus for communicating the identified word or words to a communication channel or for storing the identified word or words .
Reference is now made to Fig. 3 which illustrates elements of the feature comparison unit 32, in accordance with a preferred embodiment of the present invention. The feature comparison unit 32 typically comprises a filter generator 70 for generating filters, such as Wiener filters, from the Power Spectral Density (PSD) representations of the feature vectors of the following: the reference template, the test speech utterance and the background noise template.
Filter generator 70 is operative to generate noise suppression filters of which Wiener filters are one type. The Wiener filter is an optimal linear filter desiged to reduce additive noise according to the criterion of minimum mean-squared error. Other types of filters are possible and are discussed, for example, in the following paper which is incorporated herein by reference: R.J. McAulay and M.L. Malpass, entitled "Speech Enhancement Using a Soft-Decision Noise Supression Filter," IEEE Transactions on Acoustics , Speech and Signal Processing. Vol. ASSP-28, No. 2, 1980, pp. 137 - 145.
Unit 32 also comprises a filtering unit 72 for applying a Wiener filter W(w) to the PSD of feature vectors of the test speech utterance to produce corrected feature vectors. Unit 32 further comprises a similarity measurement unit 74 for comparing the corrected feature vector to the reference feature vector.
The feature vectors, denoted in Fig. 3 by the letter E, can be any kind of vector provided that they can be relatively easily converted to a PSD representation. Typical examples are the coefficients of the Autocorrelation Function (ACF) , noted Rs(n), Linear Prediction Coefficients (LPC) , noted α(η) , and Cepstrum (CEP) coefficients, noted Ca(n). The index "n" indicates the frame number.
The features are referred to as vectors with dimension "p" , where p is typically 10, and are evaluated as described in chapter 8 of the book by L. Rabiner and Schaffer entitled Digital Processing of Speech Signals and published by Prentice Hall, Inc., Englewood Cliffs, NY, 1978.
Unit 32 also comprises a plurality of E-P conversion units 69 for converting feature vectors to PSD representations, denoted by the letter P. An estimate of a PSD, noted P(w) , of the frame can be obtained from the LPC a(n) as follows: q2 P(w) = (1) I 1 - ∑ ak · e-3kw I2 k=l by applying a Fast Fourier Transform (FFT) to the LPC vector α(η) , padded with zeroes, as is known in the art. A typical number of frequency components "w" is 64. "q" is the LPC model gain obtained from the Linear Prediction analysis, as is known in the art.
As can be seen from Fig. 3, E-P conversion units 69 convert reference feature vectors Es to PSDs Pg(w), test feature vectors Ey to PSDs Py(w) and background noise feature vector to PSD Pfcfw) . Units 69 also calculate the amount of energy in each feature vector, which is also the first coefficient R(0) of the ACF.
It will be appreciated that the feature vectors might be PSDs, in which case, E-P conversion units 69 only calculate coefficients R(0) of the ACF of each feature vector.
The background noise feature vector, denoted Ε^, is evaluated whenever there is background noise only, typically both before and after a speech utterance is spoken. An ACF coefficient vector ^(n) is first calculated for a number of frames before the word is spoken. An estimate of the average of the background noise ACF coefficients, is produced by iterating the following equation for a number of frames before and after the spoken test utterance: Rb(n) = a * Rb(n-1) + (1-a) * Rd(n) (2) where Rjj = Estimate of the background noise ACF R^ = ACF vector of the frame n = Frame index a = Time Constant. (Typically, a = 0.95) Rb is restarted before the next spoken utterance.
Alternatively, the noise template can consist of a plurality of noise feature vectors, one per frame in the speech utterance. For this embodiment, the ACF of the noise is separately estimated both before and after the speech utterance and the feature vectors per frame are interpolated from the values of the ACF before and after the speech utterance.
The noise feature vector Eb is typically comprised of LPC coefficients £ of the background noise and is evaluated from the ACF noise estimate using Linear Prediction Analysis as described in Chapter 8 of Digital Processing, of Speech Signals written by Rabiner and Schaffer. The background noise Power Spectral Density pfc>(w) ^s estimated using the LPC coefficients f3 in a manner similar to that shown in equation 1.
The Wiener filter W(w) for the present frame for the present reference template is produced as follows: G . |P (w) I w(w) = (3) G . |Ps(w) I + |Pb(w) I where Ps is the PSD of the present frame of the present reference template, Pb is the PSD of interpolated frame of the background noise, and G is a Signal to Noise Ratio matching gain defined as follows: G = [Ry(0) - Rb(0) ]/Rs(0) (4) where Ry(0), Rj-,(0) and Rs(0) are the first coefficients of the ACF of the present frame of the test speech utterance, of the interpolated frame of the background noise and of the present frame of the present reference template, respectively. These coefficients represent the estimated powers of their respective frames.
In the filtering unit 72, the Wiener filter W(w) is multiplied by the PSD of the speech utterance Py, thereby to produce a corrected PSD Pc. Pc is converted, in a P-E conversion unit 73, to a corrected feature vector Ec which is then compared, in similarity measurement unit 74, to the reference template feature vector Es in accordance with known similarity measurement techniques.
The similarities are then processed by the global scoring unit 34 in order to reach a decision about the spoken word or words in the input test utterance.
Reference is now made to Figs. 4 and 5 which illustrate, in exemplary form, the operation of the feature comparison unit 32. In this example, two reference feature vectors are to be compared to a noisy test feature vector which is defined as a feature vector similar to the first reference feature vector to which noise has been added.
In Fig. 4A, the first reference feature vector Esl and the background noise feature vector E^ are converted to PSDs Psl(w) and Pb(w), respectively, and Psi(w) and Pj-^w) are combined to produce a first Wiener filter W^w) .
In Fig. 4B, the PSD Py(w) of the noisy test feature vector is filtered by the Wiener filter Q^(W) and a corrected PSD Pcl is produced.
In Fig. 5A, the second reference feature vector Es2 and the background noise feature vector E^ are converted to PSDs Psl(w) and Pb(w), respectively, and Ps2 (w) and Ρ^,ί^) are combined to produce a second Wiener filter W2(w).
In Fig. 5B, the PSD of the noisy test feature vector Py(w) is filtered by the Wiener filter 2 (w) and a corrected PSD Pc2 is produced.
It will be appreciated that Pcl is closer to Psl than Pc2 is to PS2* This is reflected in the respective feature vectors, Ecl and Ec2 (not shown) . Thus, in ' the similarity measurment unit 74, the noisy test feature vector will be found similar to the first feature vector.
The above example illustrates a particular case of a pattern recognition system where each reference template and the unclassified noisy test pattern are represented by single feature vectors. More realistic recognition systems deal with templates constructed as series or sets of feature vectors. In such cases, 098060/2 a global scoring mechanism is introduced to reach a decision using the similarities between individual vectors.
In a speech recognition system, feature vectors may represent phonemes and a series of feature vectors may ' represent a word. In order to match the verbal contents of a test speech utterance with reference utterances, a global score must be evaluated using the similarites between each phoneme (feature vector) in the test speech utterance and the phenomes of each reference template. Such a global scoring mechanism is typically based on different types of algorithms like Dynamic Time Warping or Hidden Markov Models.
A specific embodiment of the present invention, which makes use of the Dynamic Time Warping global scoring approach will now be described. Modifications using different similarity measurement approaches or feature sets will be known to those skilled in the art.
In the DTW approach, a warping function giving the best time alignment between two sequences of features is searched. A global distance accumulating the local distances over the warping function represents the similarity between the words. A detailed explanation of the DTW algorithm can be found in the article, incorporated herein by reference, by H. Sakoe and S. Chiba entitled "Dynamic Programming Algorithm Optimization for Spoken Word Recognition", IEEE Transactions on Acoustics , Speech and Signal Processing, Vol. 26, 1978, pp. 43 - 49.
Reference is now made to Fig. 6 which illustrates a standard DTW searching grid as is known in the art. In an embodiment of the invention, the local distance, defined as the similarity between the speech feature vector and the chosen reference feature vector at a given frame, is evaluated at each point in the DTW grid using the Wiener filtering as described hereinabove. Thus, at each grid point all the relevant information present in the recognition system is used. This information includes the background noise estimated before and after the speech utterance was spoken and the feature vectors of the reference templates acquired during the training phase.
The resultant global similarity score between the chosen reference template and the spoken word is saved and the process repeated using the next reference template. When the comparison of the entirety of reference templates is completed, the reference template most similar to the spoken word is selected as the recognized word, wherein the term "most similar" is defined as is known for DTW algorithms. Microprocessor 54 then sends the selected word to the actuator device 56.
Fig. 6 additionally illustrates the operation of the present invention with a feature set of cepstrum coefficients. In the grid of Fig. 6, the sequence of features corresponding to the spoken word and to the reference template are mapped to the coordinates. For each frame "i" of the reference template, Ps(w,i), Rs(0,i) and Cs(i) are stored. For each frame "j" of the spoken word, Ry(0,j) and Cy(j) are stored. Additionally the background noise features P^iw) and ¾(0) are available. Index "0" refers to the first component of the Autocorrelation vector, which represents the power of the corresponding frame of the words and for the background noise represents the average power level of the noise. A typical warping function 90 is also shown in Fig. 6.
The local distance is evaluated at each grid point using the Wiener filtering in accordance with the present invention. For example, at point j=2, i=3 the following steps are performed: 1. The Wiener filter W2 3 (w) is evaluated according to: G . |Ps(w,3) I 2 3(w) = 7 7- (5) G . |Ps(w,3) I + |Pb(w) I where G is defined as G = [Ry(0,2) - Rb(0)]/Rs(0,3) (6) In other words, the filter W2 3(w) is built under the assumption that Ps(w,3) is similar to the speech spectrum present in the second noisy speech frame (j=2) 2. The cepstral representation (Cw) of is typically computed by taking the natural logarithm of each component of W(w) , performing an Inverse Fast Fourier Transform (IFFT) and truncating the resulting vector to its first ten components.
An alternative method for computing the cepstral representation comprises separately computing the cepstra of the numerator and denominator of the function in equation 5 from their respective ACF functions. As is known in the art, cepstral vectors can be obtained from ACF functions through linear prediction analysis. 3. Feature correction is performed by adding the resulting vector to the cepstrum of the noisy speech frame Cy. This operation is equivalent to filtering the noisy frame using the Wiener filter. This kind of processing is called homomorphic filtering and is well known in the art. It is described in L. Rabiner et al., Digital Processing of Speech Signals f Chapter* 7, incorporated herein by reference. 4. The local distance L2 3 is computed at the specific point of the grid according to the following: L2,3 = II ≤s<3> - (%(2) + ≤w) II2 It will be appreciated that this use of hypothesized reference information when reducing noise effects at each local distance computation is a unique feature of this invention.
A second embodiment of the invention, which makes use of the Hidden Markov Model global similarity approach will now be discussed. A tutorial description of HMM is given in the paper by Rabiner, J.R. and Juang, B.H. "An Introduction to Hidden Markov Models", IEEE ASSP Magazine, Vol. 3, No. 1, 1986, pp. 4 -16, incorporated herein by reference.
The training phase, as in the DTW approach, is performed using words spoken in a quiet environment. For each word in the vocabulary, a HMM is obtained according to well known techniques.
Reference is now made to Fig. 7 which discloses an isolated word HMM recognizer that uses discrete observation symbol densities. Those units which are similar to those in the previous embodiment have similar reference numerals.
As in the previous embodiment, the training phase is perfomed using words in a quiet environment. In accordance with the present embodiment, a vector quantizer (VQ) unit 100 is used in the training phase to map each feature vector of the reference spoken words into a discrete codebook index. Vector quantizer unit 100 is described in the article "Vector Quantization", by R.M. Gray and published in the IEEE ASSP Magazine. Vol. 1, No. 2, April 1984, pp. 4 - 29, incorporated herein by reference.
Using the codebook in unit 100, for each reference word, a Hidden Markov Model (HMM) is built, via a HMM training unit 101, according to techniques described in "An Introduction to Hidden Markov Models", incorporated herein by reference. The plurality of HMM models corresponding to the spoken reference words are then stored in HMM model storage unit 102.
In the recognition mode, the set of feature test vectors of a word spoken in a noisy environment are utilized to produce an observation sequence using the hypothesized filter approach described hereinabove.
As before, each feature test vector is filtered in a feature comparison unit, here labelled VQ feature comparison unit 104. However, in this embodiment, the codewords of the VQ code-book 100 are used when building the filters, rather than the feature vectors of the reference templates.
As before, the estimated background noise feature vector is utilized in the filtering process of the feature comparison unit 104. For each input feature test vector, unit 104 produces a set of corrected feature test vectors corresponding to the codewords of VQ codebook 100.
In a VQ mapping unit 106, each corrected feature test vector is compared to the codeword used to correct it. The codeword which is closest to its corrected feature test vector is chosen as the codeword representing the input feature test vector and its index provided as an output. The closeness comparison is typically performed in accordance with standard techniques, such as those described in the article by R.M. Gray.
The set of indices for a spoken test word, called an "observation sequence", is provided to a HMM global scoring unit 108 which, as is known in the art, typically comprises a probability computation unit 110 and a maximum search unit 112. The operation of units 110 and 112 are described in the article "An Introduction to Hidden Markov Models".
The HMM models stored in HMM model storage unit 102 are also provided to the HMM global scoring unit 108 which is operative to select the HMM model which is most closely matches the observation sequence.
Probability computation unit 110 computes the probability that the observation sequence pertains to one of the HMM models and maximum search unit 112, utilizing the probability results, selects the HMM model most likely to match the observation sequence. Maximum search unit 112 outputs an index of the selected word.
It will be appreciated that the embodiment of Fig. 7 can be extended to any suitable VQ based recognition system.
While the invention has been shown and described with reference to preferred embodiments in speech recognition, further modifications and improvements may be made by those skilled in the art to other forms of pattern recognition. All such modifications which retain the basic underlying principles disclosed and claimed herein are within the scope of this invention.

Claims (46)

1. A pattern recognition system comprising: means for providing a test feature set of a generally noisy input signal characterizing at least a portion of an input pattern contained within said input signal; means for providing a plurality of reference feature sets of reference patterns produced in a quiet environment; means for providing a background noise feature set of background noise present in said input signal; and feature comparison means for producing a corrected test feature set from said test, reference and background noise feature sets and for comparing said corrected test feature set with said reference feature sets thereby to recognize which reference pattern exists in said input signal.
2. A system according to claim 1 and wherein said feature comparison means includes means for calculating a noise reducing filter from each one of said reference feature sets and said background noise feature set and means for calculating a plurality of corrected test feature sets via application of each of said noise reducing filters to said test feature set.
3. A system according to claim 2 and wherein said feature comparison means also include means for calculating a global similarity measure between said plurality of reference feature sets and said corrected test feature sets and means for selecting a reference feature set which is most similar to said corrected test feature sets.
4. A system according to claim 1 and wherein said test, background and reference feature sets include spectral characterizations of said input signal, said background noise and said reference patterns, respectively.
5. A system according to claim 4 and wherein said noise reducing filters are Wiener filters.
6. A system according to claim 4 and wherein said spectral characterization is obtained by linear prediction analysis.
7. A system according to claim 1 and wherein said means for providing a background noise feature set include means for averaging an autocorrelation function of the background noise surrounding each said spoken word.
8. A system according to claim 3 and wherein said means for calculating a global similarity measure perform a Dynamic Time Warping algorithm.
9. A system according to claim 3 and wherein said means for calculating a global similarity measure perform a Hidden Markov Model algorithm.
10. A system according to any of claims 1 - 7 and wherein said reference feature sets incorporate at least one vector quantization codebook having a plurality of codewords and said feature comparison means provides the codeword which is most similar to said corrected feature set.
11. A system according to any of the previous claims and wherein said input pattern is a speech signal.
12. A speech recognition system for identifying spoken words in a noisy environment comprising: means for providing a test feature set of an input signal characterizing at least a portion of a spoken speech utterance ; means for providing a plurality of reference feature sets of reference speech utterances spoken in a quiet environment; means for providing a background noise feature set of background noise present in said speech utterance; and feature comparison means for producing corrected test feature sets from said test, reference and background noise feature sets and for comparing said corrected test feature sets with said reference feature sets thereby to recognize which reference speech utterance was spoken in said input signal.
13. A system according to claim 12 and wherein said feature comparison means include means for calculating a noise reducing filter from each one of said reference feature sets and said background noise feature set. 098060/2
14. A system according to claim 13 and wherein said feature comparison means also include means for calculating a plurality of corrected test feature sets via application of said plurality of noise reducing filters- to said test feature sets.
15. A system according to claim 14 and wherein said feature comparison means also include means for calculating a global similarity measure between said reference feature sets and said plurality of corrected test feature sets and means for selecting a corrected test feature set which is most . similar . to said reference feature sets.
16. A system according to any of claims 13 - 15 and wherein said noise reducing filters are Wiener filters.
17. A system according to claim 12 and wherein said test, background and reference feature sets include spectral characterizations of said input signal, said background noise and said reference patterns, respectively.
18. A system according to claim 17 and wherein "said spectral characterization is obtained by linear prediction analysis.
19. A system according to claim 12 and wherein said means for providing a background noise feature set include means for averaging an autocorrelation function of the background noise surrounding each said spoken word.
20. A system according to claim 15 and wherein said means for calculating a global similarity measure perform a Dynamic Time Warping algorithm.
21. A system according to claim 15 and wherein said means for calculating a global similarity measure perform a Hidden Markov Model algorithm.
22. A system according to any of claims 12 - 20 and wherein said reference feature sets incorporate at least one vector quantization codebook having a plurality of codewords and said feature comparison means provides the codeword which is most similar to said corrected feature set.
23. A system according to claims 15 - 22 and also comprising means for estimating verbal contents of said speech utterance from said global similarity measure.
24. A system according to claim 23 and also comprising means for controlling output devices in accordance with said verbal contents.
25. A system according to claims 15 - 22 and also comprising means for determining the identity of a speaker of said input signal from said global similarity measure.
26. A system according to claims 15 - 22 and also comprising means for providing an index of a reference speech word from said global similarity measure.
27. A system according to claim 26 and also comprising means for transmitting a sequence of at least one index along a communication channel.
28. A system according to claim 26 and also comprising means for storing said index.
29. A system according to claim 15 and wherein said means for calculating a global similarity measure comprises means for performing vector quantization.
30. A system according to claims 15 - 22 and also comprising means for verifying the identity of a speaker of said input signal from said global similarity measure.
31. A system according to either of claims 25 or 30 and also comprising means for controlling output devices in accordance with the determined identity of said speaker.
32. A method for pattern recognition system comprising the steps of: providing a test feature set of a generally noisy input signal characterizing at least a portion of an input pattern contained within said input signal; providing a plurality of reference feature sets of reference patterns produced in a quiet environment; providing a background noise feature set of background noise present in said input signal; and producing corrected test feature sets from said test, reference and background noise feature sets and comparing said corrected test feature sets with said reference feature sets thereby to recognize which reference pattern exists in said input signal .
33. A method according to claim 32 and wherein said step of producing includes the step of calculating ^a noise reducing filter from each one of said reference feature sets and said background noise feature set and the step of calculating a plurality of corrected test feature set via application of each of said noise reducing filters to said test feature set.
34. A method according to claim 33 and wherein said step of producing also includes the step of calculating a global similarity measure between said plurality of reference feature sets and said plurality of corrected test feature sets and selecting a reference feature set which is most similar to said corrected test feature sets.
35. A method according to claim 32 and wherein said step of providing a background noise feature set includes the step of averaging an autocorrelation function of the background noise surrounding each said spoken word.
36. A method according to claim 34 and wherein said step of calculating a global similarity measure performs a Dynamic Time Warping algorithm.
37. A method according to claim 34 and wherein said step of calculating a global similarity measure performs a Hidden Markov Model algorithm.
38. A speech recognition method for identifying spoken words in a noisy environment comprising the steps of: providing a test feature set of an input signal characterizing at least a portion of a spoken speech utterance; providing a plurality of reference feature sets of reference speech utterances spoken in a quiet environment; providing a background noise feature set of background noise present before and after said speech utterance; and producing corrected test feature sets from said test, reference and background noise feature sets and for comparing said corrected test feature sets with said reference feature sets thereby to recognize which reference speech utterance was spoken in said input signal.
39. A method according to either of claims 32 or 38 and also comprising the step of estimating verbal contents of said speech utterance from said global similarity measure.
40. A method according to claim 39 and also comprising the step of controlling output devices in accordance with said verbal contents .
41. A method according to either of claims 32 or 38 and also comprising the step of determining the identity of a speaker of said input signal from said global similarity measure.
42. A method according to either of claims 32 or 38 and also comprising the step of providing an index of a reference speech word from said global similarity measure.
43. A method according to claim 42 and also comprising the step of transmitting a sequence of at least one index along a communication channel.
44. Apparatus substantially as shown and described hereinabove .
45. Apparatus substantially as illustrated in any of the drawings .
46. A method substantially as shown and described hereinabove. For the Applicant, C:11362 1-9005
IL9806091A 1991-05-03 1991-05-03 Speech recognition system IL98060A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
IL9806091A IL98060A (en) 1991-05-03 1991-05-03 Speech recognition system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
IL9806091A IL98060A (en) 1991-05-03 1991-05-03 Speech recognition system

Publications (2)

Publication Number Publication Date
IL98060A0 IL98060A0 (en) 1992-06-21
IL98060A true IL98060A (en) 1995-10-31

Family

ID=11062410

Family Applications (1)

Application Number Title Priority Date Filing Date
IL9806091A IL98060A (en) 1991-05-03 1991-05-03 Speech recognition system

Country Status (1)

Country Link
IL (1) IL98060A (en)

Also Published As

Publication number Publication date
IL98060A0 (en) 1992-06-21

Similar Documents

Publication Publication Date Title
US5778342A (en) Pattern recognition system and method
Togneri et al. An overview of speaker identification: Accuracy and robustness issues
KR100631786B1 (en) Method and apparatus for speech recognition by measuring frame&#39;s confidence
US6671669B1 (en) combined engine system and method for voice recognition
JP4802135B2 (en) Speaker authentication registration and confirmation method and apparatus
US6029124A (en) Sequential, nonparametric speech recognition and speaker identification
KR0123934B1 (en) Low cost speech recognition system and method
US5459815A (en) Speech recognition method using time-frequency masking mechanism
KR101892733B1 (en) Voice recognition apparatus based on cepstrum feature vector and method thereof
KR20010102549A (en) Speaker recognition
JP2004504641A (en) Method and apparatus for constructing a speech template for a speaker independent speech recognition system
JP2006235243A (en) Audio signal analysis device and audio signal analysis program for
JP4858663B2 (en) Speech recognition method and speech recognition apparatus
Marković et al. Application of teager energy operator on linear and mel scales for whispered speech recognition
Zolnay et al. Extraction methods of voicing feature for robust speech recognition.
JP3098593B2 (en) Voice recognition device
Kumar et al. Effective preprocessing of speech and acoustic features extraction for spoken language identification
Sharma et al. Speech recognition of Punjabi numerals using synergic HMM and DTW approach
IL98060A (en) Speech recognition system
JP4749990B2 (en) Voice recognition device
Hernando Pericás et al. A comparative study of parameters and distances for noisy speech recognition
JP2007508577A (en) A method for adapting speech recognition systems to environmental inconsistencies
Krishnamoorthy et al. Application of combined temporal and spectral processing methods for speaker recognition under noisy, reverberant or multi-speaker environments
Genoud et al. Deliberate Imposture: A Challenge for Automatic Speaker Verification Systems.
Stainhaouer et al. Automatic detection of allergic rhinitis in patients

Legal Events

Date Code Title Description
KB Patent renewed
KB Patent renewed
KB Patent renewed
EXP Patent expired