CA2387091A1

CA2387091A1 - Method and system for detection of phonetic features

Info

Publication number: CA2387091A1
Application number: CA002387091A
Authority: CA
Inventors: Jont B. Allen; Mazin G. Rahim; Lawrence K. Saul
Original assignee: Individual
Current assignee: AT&T Corp
Priority date: 1999-10-28
Filing date: 2000-10-27
Publication date: 2001-05-03
Also published as: TW480473B; EP1232495A2; WO2001031628A3; WO2001031628A2

Abstract

In various embodiments, techniques for detecting phonetic features in a stre am of speech data are provided by first dividing the stream speech data into a plurality of critical bands, segmenting the critical bands into streams of consecutive windows and determining various parameters for each window per critical band. The various parameters can then be combined using various operators in a multi-layered network including a first layer that can proces s the various parameters by using a sigmoid operator on a weighted parameters sum. The processed sums can then be further combined using a hierachy of conjonctive and disjunctive operators to produce a stream of detected featur es.

Description

METHOD AND SYSTEM FOR DETECTION OF PHONETIC FEATURES
BACKGROUND OF THE INVENTION
1. Field of Invention This invention relates to speech recognition systems that detect phonetic features.
Description of Related Art A goal of automated speech recognition (ASR) is to recognize human speech with the accuracy of human listeners, including recognize speech that has been degraded by background noise or distorted by various filters inherent in communication devices such as telephones. Unfortunately, ASR systems rarely perform with the accuracy of human listeners. However, by modeling an ASR system after the human organic recognition system, ASR accuracy can theoretically rise to that of a human listener.
Unfortunately, while various aspects of the human recognition system are well documented, conventional ASR systems have failed to capitalize on the human model.
Accordingly, these conventional ASR systems do not maintain the robust capacity to recognize speech in poor listening conditions as do humans. Thus, there is a need for new speech recognition methods and apparatus that provide accurate phonetic feature recognition.
SUMMARY OF THE INVENTION
In various embodiments, methods and systems are provided for phonetic feature recognition based on critical bands of speech. In various embodiments, techniques for detecting phonetic features in a stream of speech data are provided by first dividing the stream speech data into a plurality of critical bands, segmenting the critical bands into streams of consecutive windows and determining various parameters for each window per critical band. The various parameters can then be combined using various operators m a mufti-layered network.

A first layer of the multi-layered network can process the various parameters by weighting the parameters, forming a sums for each critical band and processing the sums using sigmoid operators. The processed sums can then be further combined using a hierarchy of conjunctive and disjunctive operators to produce a stream of detected features.
In other exemplary embodiments, a training technique is tailored to the multi-layered network by iteratively detecting features, comparing the detected features to a stream predetermined feature labels and updating various internal weights using various approaches such as a expectation-maximization technique and a maximum likelihood estimation technique.
BRIEF DESCRIPTION OF THE DRAWINGS
The invention is described in detail with regard to the following figures, wherein like numerals reference like elements, and where:
Figure 1 is a block diagram of an exemplary feature recognition system;
Figure 2 is a block diagram of the feature recognizes of Figure l;
Figure 3 is a block diagram of an exemplary front-end of the feature recognizes of Figure 2;
Figure 4 is a block diagram of the exemplary back-end of the feature recognizes of Figure 2;
Figure 5 is a block diagram of a portion of the back-end of Figure 4 with various training circuits to enable learning;
Figure 6 is a block diagram of an exemplary first-layer combines according to the present invention; and Figure 7 is a flowchart outlining an exemplary method for recognizing and training on various phonetic features.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
A speech recognition system can provide a powerful tool to automate various transactions such as buying or selling over a telephone line, automatic dictation machines and various other transactions where information must be acquired from a speaker.
Unfortunately, speech recognition systems can produce unacceptable error rates in the presence of background noise or when human speech has been filtered by various electronic systems such as telephones, recording mechanisms and the like.
However, by applying various organic models to automatic speech recognition (ASR) systems, it is theoretically possible to enable machines to detect speech with the accuracy of a human listener.
One organic model of interest is based on a hypothesis that different parts of the frequency spectrum can be independently analyzed in the early stages of speech recognition. That is, by dividing a broad frequency spectrum into a number of narrow frequency ranges known as "critical bands" and properly processing and combining the resultant information, various phonetic features can be accurately recognized.
Unfortunately, previous attempts at deriving an accurate ASR model based on critical bands have proven largely unsuccessful.
However, by first forming and processing various critical bands of speech, then appropriately combining the processed critical bands using hidden variable modeling and various learning techniques, an ASR system can be developed that can reliably detect various phonetic features (also known as distinctive features) even in the presence of extreme background noise.
One desirable goal of a feature recognition system is to distinguish two phonetic features known as sonorants and obstruents. A sonorant ("[+sonorant]") can be one of a group of auditory features recognizable as vowels, nasals and approximants, and can be characterized by periodic vibrations of the vocal cords. That is, much of the energy of a sonorant can be found in a particular narrow frequency range and its harmonics. An obstruent ("[-sonorant]"), on the other hand, can include phonetic features such as stops, fricatives and affricates, and can be characterized as speech sounds having an obstructed air stream. Table 1 below demonstrates the breakout between sonorants [+sonorant] and obstruents [-sonorant].

[+voiced] [-voiced]
b (bee) p (pea) d (day) t (tea) stops g (gay) k (key) obstruent z (zone) [-sonorant] v (van) fricatives dh (then) zh (azure) jh (joke) ch (choke) affricates m (mom) n (noon) nasals ng (sing) sonorant 1 (lay) [+sonorant] r (ray) approximants ~' (waY) y (yacht) While phonetic features such as [+sonorants] can be described as having periodicity, the particular frequency ranges of periodicity will be established by a speaker's pitch. However, by dividing a spectrum into many critical bands, some of those critical bands can not only detect [+sonorant] vocal energies, but those critical bands containing [+sonorants] can display an improved signal-to-noise (SNR) ratio as compared processing [+sonorants] against broad-band noise. Accordingly, by emulating the mufti-band processing approach of the human auditory system, phonetic features such as [+sonorants] and [-sonorants] can be distinguished, even in noisy and filtered environments.
Figure 1 is an exemplary block diagram of a speech recognition system 100. The system 100 includes a feature recognizes 120 that is connected to a data source 110 through an input link 112 and to a data sink 130 through an output link 122.
The exemplary feature recognizes 120 can receive speech data from the data source 110 and detect various phonetic features in the speech data. The exemplary feature recognizes 120 can detect phonetic features by dividing the spectrum of the speech data into a number of critical bands, process each critical band to produce various cues and advantageously combine the various cues to produce a stream of phonetic features that can be passed to the data sink 130 via output link 122.
The data source 110 can provide the feature recognizer 120 with either physical speech or speech data in any format that can represent physical speech including binary data, ASCII data, Fourier data, wavelet data, data contained in a word processing file, a WAV file, and MPEG file or any other file format that contain compressed or uncompressed speech data. Furthermore, the data source 110 can be any one of a number of different types of data sources, such as a person, a computer, a storage device, or any known or later developed combination of hardware and software capable of generating, relaying or recalling from storage a message or any other information capable of representing physical speech. Similarly, the data sink 130 can be any device capable of receiving phonetic feature data, such as a digital computer, a communications network element, or any combination of hardware or software capable of receiving, relaying, storing, sensing or perceiving data or information representing phonetic features.
1 S The links 112 and 122 can be any known or later developed device or system for connecting the data source 110 or the data sink 130 to the feature recognizer 120. Such devices include a direct serial/parallel cable connection, a connection over a wide area network or a local area network, a connection over an intranet, a connection over the Internet, or any connection over any other distributed processing network or system.
Additionally, the input link 112 or the output link 122 can be any software devices linking various software systems. In general, the links 112 and 122 can be any known or later developed connection system, computer program, or any structure usable to connect the data source 110 or the data sink 130 to the feature recognizer 120.
Figure 2 is an exemplary feature recognizer 120 according to the present invention. The feature recognizer 120 includes a front-end 210 coupled to a back-end 220 via link 212.
In a first mode of operation, the front-end 210 receives a stream of training data via link 112, including a stream of speech data with a respective stream of feature labels that can indicate whether a particular window of speech data contains a particular phonetic feature. For example, a first segment in a stream of speech data can contain the phoneme /h/ ("hay") and have a respective feature label of [+sonorant], while a second segment that contains the phoneme /d/ ("ladder") can have a [-sonorant]
respective feature label and a third segment can contain random noise directed to neither a [+sonorant] nor [-sonorant] feature. While the exemplary feature recognizes 120 can distinguish between sonorants [+sonorant] and obstruents [-sonorant], it should be appreciate that the various features distinguished and/or recognized by the feature recognizes 120 can vary without departing from the spirit and scope of the present invention. For example, in other exemplary embodiments, the feature recognizes 120 can detect/distinguish other phonetic features such as voicing, nasality and the like. Other phonetic features can include at least any one-bit phonetic feature described or otherwise referenced in Miller, G., and Nicely, P., "An analysis of perceptual confusions among some English consonants", Acoust. Soc. Am. J. 27(2):338-352 (1955) herein incorporated by reference in its entirety Once the front-end 210 receives the training data, the front-end can perform a first set of processes on the speech data to produce a stream of processed speech data. The front-end 210 can then pass the stream of processed speech data, along with the stream of respective feature label, to the back-end 220 using link 212. Using the processed speech data and respective feature estimates, the back-end 220 can adapt various internal weights . (not shown) such that the feature recognizes 120 can effectively learn to distinguish various features. The exemplary back-end 220 uses an expectation-maximization (EM) technique along with an iterative maximum likelihood estimation (NILE) technique to learn to distinguish various features. Information about the expectation-maximization and maximum likelihood techniques can be found in at least Dempster, A., Laird, N., and Rubin, D. "Maximum likelihood from incomplete data via the EM algorithm".
Journal of the Royal Statistical Society, B 39:1-38 (1977) herein incorporated by reference in its entirety. While the exemplary back-end 220 can use a combination of EM and MLE
techniques to learn from examples of training data, it should be appreciated that any combination of known or later developed techniques suitable to train a device or system to distinguish various phonetic features can be used without departing from the spirit and scope of the present invention.
During the training operation, the back-end 220 can receive the processed speech data and respective feature labels and train its internal weights (not shown) until the back-end 220 can effectively distinguish between various phonetic features of interest. Once WO 01/31628 ~ PCT/US00/41649 the feature recognizer 120 is trained, the feature recognizer 120 can then operate according to a second mode of operation.
In the second mode of operation, the front-end 210 can receive a stream of speech data, processes the stream of speech data as in the first mode of operation before and provide a stream of processed speech data to the back-end 220. The back-end 220 can accordingly receive the stream of processed speech data, advantageously combine different cues provided by the stream of processed speech data using the trained internal weights to detect/distinguish between various phonetic features and provide a stream of the detected features to link 122.
Figure 3 is an exemplary front-end 210 according to the present invention. The front-end 210 contains a filter bank 310, a nonlinear device 312 such as rectifying/squaring device, a low pass filter (LPF)/down-sampling device 314, a windowing/parameter measuring device 316 and a thresholding device 318. In operation, the filter bank 310 can first receive a stream of speech data via link 112.
The filter bank 310 can then transform the stream of speech data into an array of narrow-bands of frequencies, i.e., critical bands, using a number of band pass filters (not shown) incorporated into the filter bank 310. The exemplary filter bank 310 can divide speech data into twenty-four separate bands having center frequencies between two-hundred twenty-five (225) Hz to three-thousand six-hundred twenty-five (3625) Hz with bandwidths ranging from one-half to one-third octave. However, it should be appreciated that the filter bank 310 can divide speech data into any number of critical bands having various center frequencies and bandwidths without departing from the spirit and scope of the present invention. Once the filter bank 310 divides the stream of speech data into its various critical bands, the streams of narrow-band speech data are provided to the nonlinear device 312.
The nonlinear device 312 receives the streams of narrow-band speech data, rectifies, i.e., removes the negative components of the narrow-band speech data, squares the streams of rectified speech data and then provides the streams of rectified/squared speech data to the LPF/down-sampling device 314.
The LPF/down-sampling device 314 receives the streams of rectified/squared speech data, removes the high frequency components from the streams of WO 01/31628 ° PCT/US00/41649 rectified/squared speech data to smooth the speech data, digitizes the streams of smooth speech data and provides the streams of digitized speech data to the windowing/parameter measuring device 316.
The windowing/parameter measuring device 316 receives the streams of digitized speech data and divides each stream of digitized speech data into a stream of sixteen millisecond (l6mS) contiguous non-overlapping windows. While the exemplary windowing/parameter measuring device 316 divides speech into contiguous non-overlapping windows of sixteen millisecond, it should be appreciated that, in various exemplary embodiments, the size of the windows can vary as desired or otherwise required by design without departing from the spirit and scope of the present invention.
Furthermore, in other various exemplary embodiments, it should be appreciated that the various windows can be either non-overlapping or overlapping as desired, determined advantageous or other required by design.
For each window of speech data, the windowing/parameter measuring device 316 can determine a number of statistical parameters associated with each window.
The exemplary windowing/parameter measuring device 316 determines six (6) statistics per window per critical band: the first two parameters being running estimates of the signal to noise ratio of a particular critical band, and the remaining four parameters being autocovariance statistics.
While the exemplary windowing/parameter measuring device 316 measures six parameters relating to signal-to-nose ratios and autocovariance statistics, it should be appreciated that, in various exemplary embodiments, the windowing/parameter measuring device 316 can determine various other qualities and/or determine a different number of parameters without departing from the spirit and scope of the present invention. For example, the windowing/parameter measuring device 316 can first determine the six parameter above for a particular window, then determine the first and second derivatives of the parameters using the previous and subsequent windows. Once the various parameters have been determined, the windowing/parameter measuring device 316 can provide the various parameters to the thresholding device 318.
The thresholding device 318 can receive the various parameters and normalize the values of the parameters. That is, the various parameters can be scaled according to a number of predetermined threshold values that can be derived, for example, from identically processed bands of white noise. After the thresholding device 318 normalizes the various channel parameters, the normalized channel parameters can be exported via links 212-1, 212-2, . . . 212-n. As discussed above, the exemplary front-end 210 can divide a stream of speech data into twenty-four channels. Accordingly, given that the front end 210 produces six parameters per channel, the exemplary thresholding device 318 can produce a total of one-hundred forty-four (144) parameters per sixteen millisecond window of speech at a rate of about sixty windows per second.
Figure 4 is a block diagram of an exemplary back-end 220 according to the present invention. The exemplary back-end 220 includes a first number of first-layer combiners 410-l, 410-2, . . . 410-n, a second number of second-layer combiners 420-1, 420-2, . . . 420-m and a third-layer combiner 430. In operation, the first-layer combiners 410-1, 410-2, . . . 410-n each receive streams of parameters associated with various critical bands of speech via links 212-1, 212-2, . . . 212-n. The exemplary parameters in the streams of parameters can be sets of six measurements relating to signal-to-noise ratios and autocovariance statistics for contiguous, non-overlapping windows of speech data. However, as discussed above, the number, type and nature of the parameters can vary as desired or otherwise required by design without departing from the spirit and scope of the present invention. As each window of parameters is received by the first-layer combiners 410-1, 410-2 . . . 410-n, the various first-layer combiners 410-l, 410-2, .
. . 410-n can perform a first combining operation according to Eq. (1):
Pr[Xt~ = 1 ~ MI] = a(8l~ ~ MZ), (1) wherein MI is a set of measurement, i.e., a parameter vector, related to an ith window of speech data, A1~ is a set of weights f e" 82, . . . 6~} associated with Mi, and a(z) _ [1 + a Z]-' is the logistic function, also known as a sigmoid function. As discussed above, the weights 6l~ can be estimated by various training techniques. However, the weights 91~
can alternatively be derived by any method that can provide weights useful for detecting/distinguishing between various phonetic features without departing from the spirit and scope of the present invention.

For each window of speech data, each first-layer combiner 410-1, 410-2, . . .

n can multiply each parameter MI by its respective weight 81j, add the respective products and process the product-sum using a sigmoid function. After each set of weights is processed, the output of each first-layer combiner 410-1, 410-2, . . . 410-n can be provided to the second-layer combiners 420-1, 420-2, . . . 420-m.
As shown in Figure 4, each second-layer combiner 420-l, 420-2, . . . 420-m can receive the outputs from three first-layer combiners 410-1, 410-2, . . . 410-n. However, it should be appreciated that, in various exemplary embodiments, each second-layer combiner 420-1, 420-2, . . . 420-n can receive any number of first-layer combiner outputs.
Furthermore, it should be appreciated that each first-layer combiner 410-1, 410-2, . . .
410-n can provide its output to more than one second-layer combiner 420-1, 420-2, . . .
420-m without departing from the spirit and scope of the present invention.
Once each second-layer combiner 420-1, 420-2, . . . 420-m has received its respective first-layer combiner data, each second-layer combiner 420-1, 420-2, . . . 420-m can perform a second combining operation on its received data according to Eq.
(2):
Pr[Y,. =1~M; ] _ ~ Pr[X;~ =1 M; ]. (2) i where Pr [X1 j = 1 ~ M] is the conditional probability distribution of a first-layer combiner given Mt, and Xij denotes a first-layer combiner outcome of the jth test in a particular critical band. As the output for each first-layer combiner can vary from zero to one, Eq.
(2) suggests that the output of each second-layer combiner 420-1, 420-2, . . .
420-m can also vary from zero to one, and that the effect of Eq. (2) is to effectively perform a conjunction. That is, Eq. (2) can form an ANDing operation. For example, if a given second-layer combiner receives three first-layer combiner outputs and one of the first-layer combiner has an output of zero, the output of the second-layer combiner will also be zero regardless of the output values of the other first-layer combiners. Once each second-layer combiner 420-1, 420-2, . . . 420-m performs its second combining operation, the output of each second-layer combiner 420-1, 420-2, . . . 420-m can be provided to the third-layer combiner 430.

The third-layer 430 receives the outputs from each second-layer combiner 420-l, 420-2, . . . 420-m and performs a third combining operation on the second-layer outputs according to Eq. (3):
Pr~Z=1IM~=1-~~1-Pr~Y,. =1IM1~~, (3) r where M = {M" Mz, . . .} and denotes the entire set of parameter measurements, Z is a binary random variable and Pr [Z = 1 ~ M] is the conditional probability distribution for Z
given M. The effect of the third-layer combiner layer 430 is to effect a disjunction of the various second-level outputs Yi. That is, the third-layer combiner 430 effectively performs an OR operation. For example, if any one of the second-layer combiner outputs is one, the output of the third-layer combiner 430 will also be one regardless of the output values of the others second-layer combiners.
As discussed, the back-end 220 can determine Pr[Z ~ M], i.e., the probability that a window of speech data contains a particular phonetic feature such as a [+sonorant] based on critical band measurements of periodicity and SNR. This inference can involve a bottom-up propagation of information through the layered network in Figure 4.
However, other inferences can also be made, involving a combination of bottom-up and top-down reasoning. For example, posterior probabilities such as Pr[XI~ ~ YI, MI] (the conditional probability distribution of a particular first-layer output Xl~
given parameter measurements MI and the output of a respective second-layer output) and Pr[YI
~ Z, M]
(the conditional probability distribution of a particular second-layer output given parameter measurements M and the output of a third-layer output YI) can be useful for learning from examples, i.e., training. Certain of these posterior probabilities can follow from the ANDing and ORing operations of Eqs (2) and (3) above. For example, based on an AND operation, one can make the inference Pr[Xl~ = 1 ~ YZ = 1, M] = 1, i.e., given that a particular second-layer combiner output is one (Y1 = 1 ), a first-layer combiner feeding the second-layer is also likely one. Furthermore, if YZ = l, then the inference Xt~ = 1 can be made for all first-level combiners that feeding the second-layer combiner.
Likewise, based on the OR operation of Eq. (3), one can make the inference Pr[YI
=1 ~ Z = 0, M] = 0, i.e., given that the output of the third-layer combiner is zero (Z = 0), then output of a particular second-layer combiner is also likely zero.
Furthermore, assuming Z = 0, then the inference Yi = 0 can be made for all second-layer combiners.
Still other posterior probabilities can be computed from Bayes rule. To simplify the resulting expressions, let pij = Pr[XZj = 1 ~ MI] denote the conditional probabilities computed by the first-layer combiners of the back-end 220. Accordingly, in [-sonorant]
critical bands, one can make an inference according to Eq. (4):
1 ~ k*; Pik Pr[X~ =1Y,. =O,M;]= pj (4) 1-~,pa where the term in square brackets in this equation is always less than one.
Thus, Eq. (4) can suggests than when an output in a second-layer combiner is known to be negative, one should decrease the probability that any particular test is positive for any respective first-layer combiner. Likewise, in [+sonorant] windows of speech, an inference can arise according to Eq. (5):
k Pik Pr~Y,. =1Z=1,M~_ 1-~ m 1-~, pmr where the denominator in this equation is always less than one. Equation (5) suggests that when a [+sonorant] feature has been detected in one or more critical bands, one should increase the probability that a [+sonorant] feature was detected in any particular first-level combiner. An advantage of the probabilistic graphical model described above is that the probabilistic graphical model can formalize the intuitions embodied in Eqs. (4) and (5) in a quantitatively precise way.
In various embodiments, it should be appreciated that the exemplary back-end can be modified/extended without departing from the spirit and scope of the present invention. For example, as discussed above, the measurement vector MI can consists of sets of six measurements relating to SNR and autocovariance statistics.
However, as also discussed above, the sets of measurements can be extended by including first and second-order time derivatives of the SNR and autocovariance statistics. Still further, a second extention can be had by feeding parameter measurements from consecutive windows, as opposed to the same window, to the logistic regressions under each second-layer combiner. Finally, while the exemplary back-end 220 has been described in reference to detecting [+/-sonorants], as discussed above, the back-end 220 can alternatively be used to detect other phonetic features such as voicing, nasality or any one-bit phonetic feature without departing from the spirit and scope of the present invention.
Figure S is a block diagram of a portion of the exemplary back-end 220 of Figure 4 used in conjunction with a set of training circuits 510 that can enable the back-end 220 to learn to distinguish various phonetic features. In operation, the exemplary back-end 220 can receive windows of processed speech data, and combine the processed speech data according to Eqs. (1)-(3) above. As each window of processed speech data is combined, the exemplary training circuits S 10 can receive data from the third-layer combiner output, which can consist of a stream of predicted phonetic features.
The training circuits 510 can further receive the stream of processed speech data, along with a respective stream of phonetic labels from the data source 110 indicating whether a particular window of speech data actually contains a phonetic feature of interest.
Using the streams of processed speech data, predicted phonetic features and actual phonetic features (labels), the training circuits 510 can iteratively train the various weights in the first-layer combiners 410-1, 410-2, . . . 410-n. The training circuits 510 can estimate the various weights using an EM technique combined with a MLE
technique.
The exemplary EM technique consists of two alternating steps, an E-step and an M-step. The exemplary E-step can include computing the posterior probabilities Pr [Xij ~
Z, M] conditioned on the labels provided by the data source 110. The M-step can include updating the various parameters 6l~ in each logistic regression, using the posterior probabilities as target values.
The exemplary training data can be derived from wideband speech data and can, in various embodiments, be optionally contaminated with various noise sources, filtered or otherwise distorted. The exemplary training data can also contain a stream [+/-sonorant] labels having phonetic alignment. For each window of speech data, a set of acoustic measurements, M', and a target label, z' E {0,1 }, indicating whether or not each window is a [+sonorant] feature can be associated. The first-layer parameters can then be chosen to maximize a log-likelihood (LL) according to Eq. (6) LL=~logPr[Z' =z'M'~, ' such that the back-end's output predictions match the data source's labels.
The EM process consists of two alternating steps, an E-step and an M-step. The E-step in this model can compute the posterior probabilities of hidden variables, conditioned on the labels provided by the phonetic alignment. The calculations here are different for [-sonorant] and [+sonorant] windows of speech. The calculations for [-sonorant] windows can be calculated according to Eq. (7):
1 ~ k*jpik Pr~X~ =1 Z = O,M~= p~ 1 (7) lPif while the posterior probabilities for [+sonorant] windows can be calculated according to Eq. (8):
1 ~ ik n in Pr~Xi~=1Z=1,M~=pig lllk*;P 1_1- ~ P
~ fr ~ m ~ trmt . (8) kPik +1 ~m 1 ~tPmt The posterior probabilities of Eqs. (7) and (8) can be then derived by applying Bayes rule to the left hand sides of Eqs. (7) and (8), marginalizing the hidden variable YZ, and making repeated use of Eqs. (4) and (5).
Next, the M-step of the EM process can update the parameters in each logistic regression to provide updated parameter estimates 81~. Let q~ = Pre [ X~ = 1 ~Z' = z', M']
denote the posterior probabilities computed by Eqs. (7) and (8), using the updated estimates 8 1~, to replace the current estimates 91~. Likewise, let p;~ = Pr[
Xi~ = 1 ~Z' = z', M'] denote the prior probabilities computed from Eq. (1), using the updated parameter estimates B 1~. The M-step can then include replacing 61~ by B 1~, where B 1~
can be derived by Eq. (9):
ar max _ _ ~~9~logP+(1-q~)iog(1-P~~~ ~ ( ~l t Because each term in Eq. (9) can define a convex function of 8 1~, the maximization of Eq. (9) can be performed either by Newton's method or by a gradient ascent technique.

Once the new estimates 8 i j are determined, the training circuits 510 can provide the new estimates B ij to their respective first-layer combiners 410-1 through 410-n.
Accordingly, the first-layer combiners 410-1 through 410-n can incorporate the new estimates B Zj and the next window of speech data can be similarly processed until the entire stream of training data has been processed, the back-end 220 displays adequate performance or as otherwise desired.
Figure 6 is a block diagram of an exemplary first-layer combiner 410-i according to the present invention. The exemplary first-layer combiner 410-1 contains a number of multipliers 610-1, 610-2, . . . 610-j, a summing node 620 and a sigmoid node 630. In operation, various parameters can be presented to each multiplier 610-1, 610-2, . . . 610 j via links 212-i-1, 212-i-2, . . . 212-i j. The multiplier 610-1, 610-2, . . .
610 j can receive the various parameters, multiply each parameter by a respective weight 8a - 0, fand export their respective products to the summing node 620 via link 612-1, 612-2, . . .
612 j, respectively. The summing node 620 can accordingly receive the various products from the multiplier 610-l, 610-2, . . . 610 j, add the various products and provide the sum of the products to the sigmoid node 630 via link 422. The sigmoid node 630 can process the sum according to using sigmoid transfer function or other similar function.
Once the sigmoid node 630 has processed the sum, the processed sum can be provided to a second-layer combiner (not shown) via link 412-i.
As discussed above, the various first-layer weights can vary, particularly during a training operation. Accordingly, the various multipliers 610-1, 610-2, . . .
610 j can receive various weight estimates via link 512 during each iteration of a training process.
Once each multiplier 610-l, 610-2, . . . 610-j receives a particular weight, the multipliers 610-1, 610-2, . . . 610 j can indefinitely retain the weight until further modified.
Figure 7 is a flowchart outlining an exemplary method for processing critical bands of speech according to the present invention. The process starts in step 710 where a first window of speech data is received. Next, in step 720, a number of front-end filtering operations are performed. As discussed above, a set of front-end operations can include dividing the received speech data into a number of critical bands of speech, rectifying and squaring each critical band of speech, filtering and down-sampling each rectified/squared critical band, windowing, measuring various parameters and normalizing the various parameters for each critical band of speech per window to produce a stream of parameter vectors M. However, as discussed above, it should be appreciated that the various front-end filtering operations can vary as desired or otherwise required without departing from the spirit and scope of the present invention.
The operation continues to step 730.
In step 730, a first-layer combining operation is performed according to Eq.
(1) above. While the exemplary first-layer combining operation generally involves passing a sum of weighted parameters through a sigmoid operator, it should be appreciated that in various exemplary embodiment, the particular form of first-layer combining operations can vary without departing from the spirit and scope of the present invention.
The operations continues to step 740.
In step 740, a number of second-layer combining operations are performed using the outputs of step 730. As discussed above, the second-layer combining operations can be conjunctive in nature and can be performed according to Eq. (2) above.
Then, in step 750, a number of third-layer combining operations can be performed on the conjunctive . outputs of step 740. The third-layer combining operations can be disjunctive in nature and can take the form of Eq. (3) above. While the exemplary second-layer and third-layer operations can be performed using Eqs. (2) and (3) above, it should be appreciated that the exact forms of steps 740 and 750 can vary and can be any combination of processes that can be useful to detect/distinguish between various phonetic features such as sonorants, obstruents, voicing, nasality and the like without departing from the spirit and scope of the present invention. Control continues to step 760.
In step 760, the estimated feature provided by step 750 is provided to an external device such as a computer. Then, in step 770, a determination is made as to whether a training operation is being performed. If a training operation is being performed, control continues to step 780; otherwise, control jumps to step 800.
In step 780, a window of speech training data, including a phonetic label that can indicate whether the present window of speech data contains a particular phonetic feature is received. Then, in step 790, a set of weights associated with the first-layer combining operation of 730 is updated. As discussed above, the exemplary technique can use an EM

and MLE technique to update the various weights. However, it should be appreciated that the particular techniques used to update the various weights can vary and can be any combination of techniques useful to train various weights such that a particular device can learn to accurately detect/distinguish phonetic features without departing from the spirit and scope of the present invention. The operations continues to step 800.
In step 800, a determination is made as to whether to stop the process. If the process is to stop, control continues to step 810 where the process stops;
otherwise, control jumps back to step 710 where additional speech data is received such that step 710-790 can be repeated. The operation can then iteratively perform steps 710-790 until the first-layer weights are adequately trained, the available speech data is exhausted, or as otherwise desired.
It should be appreciated that the various systems and methods of this invention are preferably implement on a digital signal processor (DSP) or other integrated circuits.
However, the systems and methods can also be implemented using any combination of one or more general purpose computers, special purpose computers, program microprocessors or microcontrollers and peripheral integrated circuit elements, hardware electronic or logic circuits, such as application specific integrated circuits (ASIC), discrete element circuits, programmable logic devices such as a PLD, PLA, FPGA, or PAL or the like. In general, any device on which exists a finite state machine capable of implementing the various elements of Figures 1-6 and/or the flowchart of Figure 7 can be used to implement the feature recognizer 120 functions.
While this invention has been described in conjunction with the specific embodiments thereof, it is evident that many alternatives, modifications, and variations will be apparent to those skilled in the art. Accordingly, preferred embodiments of the invention as set forth herein are intended to illustrative, not limiting.
There are changes that may be made without departing from the spirit and scope of the present invention.

Claims

WHAT IS CLAIMED IS:

1. A method for processing features in a stream of speech data, comprising:
dividing the stream speech data into a plurality of bands;
determining a plurality of parameters for at least one of the plurality of bands; and combining the parameters based on at least one of a sum-of-products, a sigmoid operator and a Baysian technique to determine at least one phonetic feature.

2. The method of claim 1, wherein combining the parameters is based on at least one of a sum-of-products and a sigmoid operator.

3. The method of claim 2, wherein combining the parameters is based on a sum-of-products and a sigmoid operator.

4. The method of claim 3, wherein combining the parameters is based on a sum-of-products, a sigmoid operator and a Baysean technique.

5. The method of claim 4, wherein the Baysean technique is conjunctive.

6. The method of claim 5, wherein the Baysean technique is disjunctive.

7. The method of claim 2, wherein combining the parameters includes:
multiplying a first set of parameter the of plurality of parameters by a set of weights to produce a first set of weighted parameters;
forming a first sum of the first set of weighted parameters; and processing the first sum using a sigmoid operator to form a first processed sum.

8. The method of claim 7, wherein combining the parameters further includes:
combining the first processed sum with at least one other processed sum using a conjunctive operator to produce a first conjunctive outcome.

9. The method of claim 8, wherein the conjunctive operator comprises:
wherein M i is a parameter vector, X ij denotes a processed sum and Y i denotes the conjunctive outcome.

10. The method of claim 8, wherein combining the parameters further includes:
combining the first conjunctive outcome with at least one other conjunctive outcome using a disjunctive operator to produce a disjunctive outcome.

11. The method of claim 8, wherein the disjunctive operator comprises:
wherein M i is a parameter vector, M = {M1, M2, ...} denotes a set of parameter vectors, Y i denotes a conjunctive outcome and Z is a binary random variable.

12. The method of claim 1, further comprising updating at least one weight of the set of weights.

13. The method of claim 12, wherein the step of updating is based on at least one of an expectation-maximization (EM) technique and a maximum likelihood estimation (MLE) technique.

14. The method of claim 1, wherein at least one parameter of the plurality of parameters includes at least one of a signal-to-noise ratio estimate of a particular band and an autocovariance statistic.

15. The method of claim 1, further including performing a non-linear operation on at least one band of the plurality of bands

16. The method of claim 15, wherein the non-linear operation is a squaring operation.

17. The method of claim 15, wherein the non-linear operation is a rectification operation.

18. A device for processing phonetic features, comprising:
a frond-end that receives a stream of speech data, divides the stream of speech data into a plurality of bands of speech data, segments each of the plurality of bands of speech data into a stream of windows and determines a plurality of parameters for each window; and a back-end that can combine the parameters based on at least one of a sum-of-products, a sigmoid operator and a Baysian technique to determine at least one phonetic feature.

19. The device of claim 18, wherein the back-end includes a first-layer combiner that weights a first plurality of parameters using a first set of weights, adds the weighted parameters and processes the weighted parameters to produce a first outcome.

20. The device of claim 19, wherein processing the weighted parameters includes transforming the weighted parameters based on a sigmoid operator.

21. The device of claim 20, wherein the back-end further includes a second-layer combiner that combines the first outcome with at least a second outcome using a conjunctive operator to produce a first conjunctive outcome.

22. The device of claim 21, wherein the back-end further includes a third-layer combiner that combines the first conjunctive outcome at least a second conjunctive outcome using a disjunctive operator to produce a first disjunctive outcome.

23. The device of claim 22, further comprising a training device that updates the first set of weights based on at least the first disjunctive outcome.

24. The device of claim 23, wherein the training device updates the first set of weights further based on at least an expectation-maximization (EM) technique and a maximum likelihood estimation (MLE) technique.

25. The device of claim 20, wherein the front-end includes a non-linear device.

26. The device of claim 25, wherein the non-linear device substantially squares at least one bands of speech data.

27. The device of claim 25, wherein the non-linear device substantially rectifies at least one bands of speech data.