EP1232495A2 - Procede et systeme de detection de caracteristiques phonetiques - Google Patents

Procede et systeme de detection de caracteristiques phonetiques

Info

Publication number
EP1232495A2
EP1232495A2 EP00991894A EP00991894A EP1232495A2 EP 1232495 A2 EP1232495 A2 EP 1232495A2 EP 00991894 A EP00991894 A EP 00991894A EP 00991894 A EP00991894 A EP 00991894A EP 1232495 A2 EP1232495 A2 EP 1232495A2
Authority
EP
European Patent Office
Prior art keywords
parameters
outcome
speech data
conjunctive
operator
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP00991894A
Other languages
German (de)
English (en)
Inventor
Jonathan B. Allen
Mazin G. Rahim
Lawrence K. Saul
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AT&T Corp
Original Assignee
AT&T Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by AT&T Corp filed Critical AT&T Corp
Publication of EP1232495A2 publication Critical patent/EP1232495A2/fr
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band

Definitions

  • This invention relates to speech recognition systems that detect phonetic features.
  • ASR automated speech recognition
  • ASR systems rarely perform with the accuracy of human listeners.
  • ASR accuracy can theoretically rise to that of a human listener.
  • conventional ASR systems have failed to capitalize on the human model. Accordingly, these conventional ASR systems do not maintain the robust capacity to recognize speech in poor listening conditions as do humans.
  • methods and systems are provided for phonetic feature recognition based on critical bands of speech.
  • techniques for detecting phonetic features in a stream of speech data are provided by first dividing the stream speech data into a plurality of critical bands, segmenting the critical bands into streams of consecutive windows and determining various parameters for each window per critical band. The various parameters can then be combined using various operators in a multi-layered network. A first layer of the multi-layered network can process the various parameters by weighting the parameters, forming a sums for each critical band and processing the sums using sigmoid operators. The processed sums can then be further combined using a hierarchy of conjunctive and disjunctive operators to produce a stream of detected features.
  • a training technique is tailored to the multi- layered network by iteratively detecting features, comparing the detected features to a stream predetermined feature labels and updating various internal weights using various approaches such as a expectation-maximization technique and a maximum likelihood estimation technique.
  • Figure 1 is a block diagram of an exemplary feature recognition system
  • Figure 2 is a block diagram of the feature recognizer of Figure 1 ;
  • Figure 3 is a block diagram of an exemplary front-end of the feature recognizer of Figure 2;
  • Figure 4 is a block diagram of the exemplary back-end of the feature recognizer of Figure 2;
  • Figure 5 is a block diagram of a portion of the back-end of Figure 4 with various training circuits to enable learning;
  • Figure 6 is a block diagram of an exemplary first-layer combiner according to the present invention.
  • Figure 7 is a flowchart outlining an exemplary method for recognizing and training on various phonetic features.
  • a speech recognition system can provide a powerful tool to automate various transactions such as buying or selling over a telephone line, automatic dictation machines and various other transactions where information must be acquired from a speaker.
  • speech recognition systems can produce unacceptable error rates in the presence of background noise or when human speech has been filtered by various electronic systems such as telephones, recording mechanisms and the like.
  • ASR automatic speech recognition
  • One organic model of interest is based on a hypothesis that different parts of the frequency spectrum can be independently analyzed in the early stages of speech recognition.
  • an ASR system can be developed that can reliably detect various phonetic features (also known as distinctive features) even in the presence of extreme background noise.
  • a sonorant (“[+sonorant]”) can be one of a group of auditory features recognizable as vowels, nasals and approximants, and can be characterized by periodic vibrations of the vocal cords. That is, much of the energy of a sonorant can be found in a particular narrow frequency range and its harmonics.
  • An obstruent (“[-sonorant]”), on the other hand, can include phonetic features such as stops, fricatives and affricates, and can be characterized as speech sounds having an obstructed air stream. Table 1 below demonstrates the breakout between sonorants [+sonorant] and obstruents [-sonorant].
  • phonetic features such as [+sonorants] can be described as having periodicity, the particular frequency ranges of periodicity will be established by a speaker's pitch.
  • some of those critical bands can not only detect [+sonorant] vocal energies, but those critical bands containing [+sonorants] can display an improved signal-to-noise (SNR) ratio as compared processing [+sonorants] against broad-band noise.
  • SNR signal-to-noise
  • Figure 1 is an exemplary block diagram of a speech recognition system 100.
  • the system 100 includes a feature recognizer 120 that is connected to a data source 110 through an input link 112 and to a data sink 130 through an output link 122.
  • the exemplary feature recognizer 120 can receive speech data from the data source 110 and detect various phonetic features in the speech data.
  • the exemplary feature recognizer 120 can detect phonetic features by dividing the spectrum of the speech data into a number of critical bands, process each critical band to produce various cues and advantageously combine the various cues to produce a stream of phonetic features that can be passed to the data sink 130 via output link 122.
  • the data source 110 can provide the feature recognizer 120 with either physical speech or speech data in any format that can represent physical speech including binary data, ASCII data, Fourier data, wavelet data, data contained in a word processing file, a
  • the data source 110 can be any one of a number of different types of data sources, such as a person, a computer, a storage device, or any known or later developed combination of hardware and software capable of generating, relaying or recalling from storage a message or any other information capable of representing physical speech.
  • the data sink 130 can be any device capable of receiving phonetic feature data, such as a digital computer, a communications network element, or any combination of hardware or software capable of receiving, relaying, storing, sensing or perceiving data or information representing phonetic features.
  • the links 112 and 122 can be any known or later developed device or system for connecting the data source 110 or the data sink 130 to the feature recognizer 120. Such devices include a direct serial/parallel cable connection, a connection over a wide area network or a local area network, a connection over an intranet, a connection over the Internet, or any connection over any other distributed processing network or system. Additionally, the input link 112 or the output link 122 can be any software devices linking various software systems. In general, the links 112 and 122 can be any known or later developed connection system, computer program, or any structure usable to connect the data source 110 or the data sink 130 to the feature recognizer 120.
  • Figure 2 is an exemplary feature recognizer 120 according to the present invention.
  • the feature recognizer 120 includes a front-end 210 coupled to a back-end
  • the front-end 210 receives a stream of training data via link 112, including a stream of speech data with a respective stream of feature labels that can indicate whether a particular window of speech data contains a particular phonetic feature.
  • a first segment in a stream of speech data can contain the phoneme /h/ ("hay") and have a respective feature label of [+sonorant]
  • a second segment that contains the phoneme /d/ (“ladder”) can have a [-sonorant] respective feature label
  • a third segment can contain random noise directed to neither a [+sonorant] nor [-sonorant] feature.
  • the exemplary feature recognizer 120 can distinguish between sonorants [+sonorant] and obstruents [-sonorant], it should be appreciate that the various features distinguished and/or recognized by the feature recognizer 120 can vary without departing from the spirit and scope of the present invention.
  • the feature recognizer 120 can detect/distinguish other phonetic features such as voicing, nasality and the like.
  • Other phonetic features can include at least any one-bit phonetic feature described or otherwise referenced in Miller, G., and Nicely, P., "An analysis of perceptual confusions among some English consonants",
  • the front-end 210 can perform a first set of processes on the speech data to produce a stream of processed speech data.
  • the front-end 210 can then pass the stream of processed speech data, along with the stream of respective feature label, to the back-end 220 using link 212.
  • the back-end 220 can adapt various internal weights (not shown) such that the feature recognizer 120 can effectively learn to distinguish various features.
  • the exemplary back-end 220 uses an expectation-maximization (EM) technique along with an iterative maximum likelihood estimation (MLE) technique to learn to distinguish various features.
  • EM expectation-maximization
  • MLE iterative maximum likelihood estimation
  • the back-end 220 can receive the processed speech data and respective feature labels and train its internal weights (not shown) until the back- end 220 can effectively distinguish between various phonetic features of interest. Once the feature recognizer 120 is trained, the feature recognizer 120 can then operate according to a second mode of operation.
  • the front-end 210 can receive a stream of speech data, processes the stream of speech data as in the first mode of operation before and provide a stream of processed speech data to the back-end 220.
  • the back-end 220 can accordingly receive the stream of processed speech data, advantageously combine different cues provided by the stream of processed speech data using the trained internal weights to detect/distinguish between various phonetic features and provide a stream of the detected features to link 122.
  • Figure 3 is an exemplary front-end 210 according to the present invention.
  • the front-end 210 contains a filter bank 310, a nonlinear device 312 such as rectifying/squaring device, a low pass filter (LPF)/down-sampling device 314, a windowing/parameter measuring device 316 and a thresholding device 318.
  • the filter bank 310 can first receive a stream of speech data via link 112. The filter bank 310 can then transform the stream of speech data into an array of narrow-bands of frequencies, i.e., critical bands, using a number of band pass filters (not shown) incorporated into the filter bank 310.
  • the exemplary filter bank 310 can divide speech data into twenty-four separate bands having center frequencies between two-hundred twenty-five (225) Hz to three-thousand six-hundred twenty-five (3625) Hz with bandwidths ranging from one-half to one-third octave.
  • the filter bank 310 can divide speech data into any number of critical bands having various center frequencies and bandwidths without departing from the spirit and scope of the present invention.
  • the nonlinear device 312 receives the streams of narrow-band speech data, rectifies, i.e., removes the negative components of the narrow-band speech data, squares the streams of rectified speech data and then provides the streams of rectified/squared speech data to the LPF/down-sampling device 314.
  • the LPF/down-sampling device 314 receives the streams of rectified/squared speech data, removes the high frequency components from the streams of rectified/squared speech data to smooth the speech data, digitizes the streams of smooth speech data and provides the streams of digitized speech data to the windowing/parameter measuring device 316.
  • the windowing/parameter measuring device 316 receives the streams of digitized speech data and divides each stream of digitized speech data into a stream of sixteen millisecond (16mS) contiguous non-overlapping windows. While the exemplary windowing/parameter measuring device 316 divides speech into contiguous non- overlapping windows of sixteen millisecond, it should be appreciated that, in various exemplary embodiments, the size of the windows can vary as desired or otherwise required by design without departing from the spirit and scope of the present invention.
  • the various windows can be either non-overlapping or overlapping as desired, determined advantageous or other required by design.
  • the windowing/parameter measuring device 316 can determine a number of statistical parameters associated with each window.
  • the exemplary windowing/parameter measuring device 316 determines six (6) statistics per window per critical band: the first two parameters being running estimates of the signal to noise ratio of a particular critical band, and the remaining four parameters being autocovariance statistics. While the exemplary windowing/parameter measuring device 316 measures six parameters relating to signal-to-nose ratios and autocovariance statistics, it should be appreciated that, in various exemplary embodiments, the windowing/parameter measuring device 316 can determine various other qualities and/or determine a different number of parameters without departing from the spirit and scope of the present invention.
  • the windowing/parameter measuring device 316 can first determine the six parameter above for a particular window, then determine the first and second derivatives of the parameters using the previous and subsequent windows. Once the various parameters have been determined, the windowing/parameter measuring device 316 can provide the various parameters to the thresholding device 318.
  • the thresholding device 318 can receive the various parameters and normalize the values of the parameters. That is, the various parameters can be scaled according to a number of predetermined threshold values that can be derived, for example, from identically processed bands of white noise.
  • the normalized channel parameters can be exported via links 212-1, 212-2, . . . 212-n.
  • the exemplary front-end 210 can divide a stream of speech data into twenty- four channels. Accordingly, given that the front end 210 produces six parameters per channel, the exemplary thresholding device 318 can produce a total of one-hundred forty-four (144) parameters per sixteen millisecond window of speech at a rate of about sixty windows per second.
  • Figure 4 is a block diagram of an exemplary back-end 220 according to the present invention.
  • the exemplary back-end 220 includes a first number of first-layer combiners 410-1, 410-2, . . . 410-n, a second number of second-layer combiners 420-1, 420-2, . . . 420-m and a third-layer combiner 430.
  • the first-layer combiners 410-1, 410-2, . . . 410-n each receive streams of parameters associated with various critical bands of speech via links 212-1, 212-2, . . . 212-n.
  • the exemplary parameters in the streams of parameters can be sets of six measurements relating to signal-to-noise ratios and autocovariance statistics for contiguous, non-overlapping windows of speech data.
  • the number, type and nature of the parameters can vary as desired or otherwise required by design without departing from the spirit and scope of the present invention.
  • . . 410-n can perform a first combining operation according to Eq. (1):
  • M is a set of measurement, i.e., a parameter vector, related to an rth window of speech data
  • ⁇ z y is a set of weights ⁇ adjuvant ⁇ 2 , . . . ⁇ j ⁇ associated with M,-
  • the weights ⁇ y can be estimated by various training techniques.
  • the weights ⁇ y can alternatively be derived by any method that can provide weights useful for detecting/distinguishing between various phonetic features without departing from the spirit and scope of the present invention.
  • each first-layer combiner 410-1, 410-2, . . . 410- n can multiply each parameter M by its respective weight Qy, add the respective products and process the product-sum using a sigmoid function. After each set of weights is processed, the output of each first-layer combiner 410-1, 410-2, . . . 410-n can be provided to the second-layer combiners 420-1, 420-2, . . . 420-m.
  • each second-layer combiner 420-1, 420-2, . . . 420-m can receive the outputs from three first-layer combiners 410-1, 410-2, . . . 410-n.
  • each second-layer combiner 420-1, 420-2, . . . 420-n can receive any number of first-layer combiner outputs.
  • 410-n can provide its output to more than one second-layer combiner 420-1, 420-2, . . . 420-m without departing from the spirit and scope of the present invention.
  • each second-layer combiner 420-1, 420-2, . . . 420-m can perform a second combining operation on its received data according to Eq. (2):
  • Pr [Xy 1
  • M] is the conditional probability distribution of a first-layer combiner given Mi
  • Xy denotes a first-layer combiner outcome of they ' th test in a particular critical band.
  • Eq. (2) suggests that the output of each second-layer combiner 420-1, 420-2, . . . 420-m can also vary from zero to one, and that the effect of Eq. (2) is to effectively perform a conjunction. That is, Eq. (2) can form an ANDing operation.
  • each second-layer combiner 420-1, 420-2, . . . 420-m performs its second combining operation
  • the output of each second-layer combiner 420-1, 420-2, . . . 420-m can be provided to the third-layer combiner 430.
  • the third- layer 430 receives the outputs from each second- layer combiner 420-1, 420-2, . . . 420-m and performs a third combining operation on the second-layer outputs according to Eq. (3):
  • Pr[z l
  • the effect of the third-layer combiner layer 430 is to effect a disjunction of the various second-level outputs Y,-. That is, the third-layer combiner 430 effectively performs an OR operation. For example, if any one of the second-layer combiner outputs is one, the output of the third-layer combiner 430 will also be one regardless of the output values of the others second- layer combiners.
  • the back-end 220 can determine Pr[Z
  • posterior probabilities such as Pr X ⁇ y I Y z , M ⁇ ] (the conditional probability distribution of a particular first-layer output Xiy given parameter measurements M z and the output of a respective second-layer output) and Pr[Y
  • Equation (5) suggests that when a [+sonorant] feature has been detected in one or more critical bands, one should increase the probability that a [+sonorant] feature was detected in any particular first-level combiner.
  • the exemplary back-end 220 can be modified/extended without departing from the spirit and scope of the present invention.
  • the measurement vector M can consists of sets of six measurements relating to SNR and autocovariance statistics.
  • the sets of measurements can be extended by including first and second- order time derivatives of the SNR and autocovariance statistics.
  • a second extention can be had by feeding parameter measurements from consecutive windows, as opposed to the same window, to the logistic regressions under each second-layer combiner.
  • the exemplary back-end 220 has been described in reference to detecting [+/-sonorants], as discussed above, the back-end 220 can alternatively be used to detect other phonetic features such as voicing, nasality or any one-bit phonetic feature without departing from the spirit and scope of the present invention.
  • Figure 5 is a block diagram of a portion of the exemplary back-end 220 of Figure 4 used in conjunction with a set of training circuits 510 that can enable the back-end 220 to learn to distinguish various phonetic features.
  • the exemplary back-end is a block diagram of a portion of the exemplary back-end 220 of Figure 4 used in conjunction with a set of training circuits 510 that can enable the back-end 220 to learn to distinguish various phonetic features.
  • the exemplary back-end is a block diagram of a portion of the exemplary back-end 220 of Figure 4 used in conjunction with a set of training circuits 510 that can enable the back-end 220 to learn to distinguish various phonetic features.
  • the exemplary back-end is a block diagram of a portion of the exemplary back-end 220 of Figure 4 used in conjunction with a set of training circuits 510 that can enable the back-end 220 to learn to distinguish various phonetic features.
  • the exemplary back-end is a block diagram of a portion of the
  • the exemplary training circuits 510 can receive data from the third-layer combiner output, which can consist of a stream of predicted phonetic features.
  • the training circuits 510 can further receive the stream of processed speech data, along with a respective stream of phonetic labels from the data source 110 indicating whether a particular window of speech data actually contains a phonetic feature of interest.
  • the training circuits 510 can iteratively train the various weights in the first-layer combiners 410-1, 410-2, . . . 410-n.
  • the training circuits 510 can estimate the various weights using an EM technique combined with a MLE technique.
  • the exemplary EM technique consists of two alternating steps, an E-step and an M-step.
  • the exemplary E-step can include computing the posterior probabilities Pr [Xy I Z, M] conditioned on the labels provided by the data source 110.
  • the M-step can include updating the various parameters Qy in each logistic regression, using the posterior probabilities as target values.
  • the exemplary training data can be derived from wideband speech data and can, in various embodiments, be optionally contaminated with various noise sources, filtered or otherwise distorted.
  • the exemplary training data can also contain a stream [+/- sonorant] labels having phonetic alignment.
  • a set of acoustic measurements, M', and a target label, z' e ⁇ 0,1 ⁇ , indicating whether or not each window is a [+sonorant] feature can be associated.
  • the EM process consists of two alternating steps, an E-step and an M-step.
  • the E-step in this model can compute the posterior probabilities of hidden variables, conditioned on the labels provided by the phonetic alignment.
  • the calculations here are different for [-sonorant] and [+sonorant] windows of speech.
  • the calculations for [- sonorant] windows can be calculated according to Eq. (7):
  • the posterior probabilities of Eqs. (7) and (8) can be then derived by applying Bayes rule to the left hand sides of Eqs. (7) and (8), marginalizing the hidden variable Y 7 -, and making repeated use of Eqs. (4) and (5).
  • the M-step of the EM process can update the parameters in each logistic regression to provide updated parameter estimates ⁇ y.
  • Z' z', M 1 ] denote the posterior probabilities computed by Eqs. (7) and (8), using the updated estimates ⁇ y, to replace the current estimates ⁇ y.
  • Z' z',
  • M * denote the prior probabilities computed from Eq. (1), using the updated parameter estimates ⁇ y.
  • the M-step can then include replacing ⁇ y by ⁇ y, where ⁇ y can be derived by Eq. (9):
  • Eq. (9) can define a convex function of ⁇ y
  • the maximization of Eq. (9) can be performed either by Newton's method or by a gradient ascent technique.
  • the training circuits 510 can provide the new estimates ⁇ y to their respective first-layer combiners 410-1 through 410-n. Accordingly, the first-layer combiners 410-1 through 410-n can incorporate the new estimates ⁇ y and the next window of speech data can be similarly processed until the entire stream of training data has been processed, the back-end 220 displays adequate performance or as otherwise desired.
  • FIG. 6 is a block diagram of an exemplary first-layer combiner 410-i according to the present invention.
  • the exemplary first- layer combiner 410-1 contains a number of multipliers 610-1, 610-2, . . . 610-j, a summing node 620 and a sigmoid node 630.
  • various parameters can be presented to each multiplier 610-1, 610-2, . . . 610-j via links 212-i-l, 212-i-2, . . . 212-i-j.
  • the summing node 620 can accordingly receive the various products from the multiplier 610-1, 610-2, . . . 610-j, add the various products and provide the sum of the products to the sigmoid node 630 via link 422.
  • the sigmoid node 630 can process the sum according to using sigmoid transfer function or other similar function. Once the sigmoid node 630 has processed the sum, the processed sum can be provided to a second- layer combiner (not shown) via link 412-i.
  • the various first-layer weights can vary, particularly during a training operation. Accordingly, the various multipliers 610-1, 610-2, . . . 610-j can receive various weight estimates via link 512 during each iteration of a training process. Once each multiplier 610-1, 610-2, . . . 610-j receives a particular weight, the multipliers 610-1, 610-2, . . . 610-j can indefinitely retain the weight until further modified.
  • Figure 7 is a flowchart outlining an exemplary method for processing critical bands of speech according to the present invention. The process starts in step 710 where a first window of speech data is received. Next, in step 720, a number of front-end filtering operations are performed.
  • a set of front-end operations can include dividing the received speech data into a number of critical bands of speech, rectifying and squaring each critical band of speech, filtering and down-sampling each rectified/squared critical band, windowing, measuring various parameters and normalizing the various parameters for each critical band of speech per window to produce a stream of parameter vectors M.
  • the various front-end filtering operations can vary as desired or otherwise required without departing from the spirit and scope of the present invention. The operation continues to step 730.
  • a first-layer combimng operation is performed according to Eq. (1) above. While the exemplary first-layer combining operation generally involves passing a sum of weighted parameters through a sigmoid operator, it should be appreciated that in various exemplary embodiment, the particular form of first-layer combimng operations can vary without departing from the spirit and scope of the present invention. The operations continues to step 740.
  • step 740 a number of second-layer combining operations are performed using the outputs of step 730.
  • the second-layer combining operations can be conjunctive in nature and can be performed according to Eq. (2) above.
  • step 750 a number of third-layer combining operations can be performed on the conjunctive outputs of step 740.
  • the third-layer combining operations can be disjunctive in nature and can take the form of Eq. (3) above. While the exemplary second-layer and third-layer operations can be performed using Eqs.
  • steps 740 and 750 can vary and can be any combination of processes that can be useful to detect/distinguish between various phonetic features such as sonorants, obstruents, voicing, nasality and the like without departing from the spirit and scope of the present invention. Control continues to step 760.
  • step 760 the estimated feature provided by step 750 is provided to an external device such as a computer. Then, in step 770, a determination is made as to whether a training operation is being performed. If a training operation is being performed, control continues to step 780; otherwise, control jumps to step 800.
  • a window of speech training data including a phonetic label that can indicate whether the present window of speech data contains a particular phonetic feature is received.
  • a set of weights associated with the first-layer combining operation of 730 is updated.
  • the exemplary technique can use an EM and MLE technique to update the various weights.
  • the particular techniques used to update the various weights can vary and can be any combination of techniques useful to train various weights such that a particular device can learn to accurately detect/distinguish phonetic features without departing from the spirit and scope of the present invention.
  • the operations continues to step 800.
  • step 800 a determination is made as to whether to stop the process. If the process is to stop, control continues to step 810 where the process stops; otherwise, control jumps back to step 710 where additional speech data is received such that step 710-790 can be repeated. The operation can then iteratively perform steps 710-790 until the first-layer weights are adequately trained, the available speech data is exhausted, or as otherwise desired.
  • DSP digital signal processor
  • the systems and methods of this invention are preferably implement on a digital signal processor (DSP) or other integrated circuits.
  • DSP digital signal processor
  • the systems and methods can also be implemented using any combination of one or more general purpose computers, special purpose computers, program microprocessors or microcontrollers and peripheral integrated circuit elements, hardware electronic or logic circuits, such as application specific integrated circuits (ASIC), discrete element circuits, programmable logic devices such as a PLD, PLA, FPGA, or PAL or the like.
  • ASIC application specific integrated circuits
  • PLD programmable logic devices
  • PLD programmable logic devices
  • any device on which exists a finite state machine capable of implementing the various elements of Figures 1-6 and/or the flowchart of Figure 7 can be used to implement the feature recognizer 120 functions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

Dans plusieurs modes de réalisation, l'invention concerne des techniques de détection de caractéristiques phonétiques dans un flux de données vocales. Ces techniques consistent d'abord à diviser ce flux de données vocales en une pluralité de bandes critiques, à segmenter les bandes critiques en flux de fenêtres consécutives et à déterminer différents paramètres pour chaque fenêtre par bande critique. Les différents paramètres peuvent alors être combinés à l'aide de plusieurs opérateurs dans un réseau multicouche comprenant une première couche capable de traiter ces paramètres au moyen d'un opérateur sigmoïde sur une somme de paramètres pondérée. Les sommes traitées peuvent alors être à leur tour combinées à l'aide d'une hiérarchie d'opérateurs conjonctifs et disjonctifs, d'où la production d'un flux de caractéristiques détectées.
EP00991894A 1999-10-28 2000-10-27 Procede et systeme de detection de caracteristiques phonetiques Withdrawn EP1232495A2 (fr)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US16195599P 1999-10-28 1999-10-28
US161955P 1999-10-28
PCT/US2000/041649 WO2001031628A2 (fr) 1999-10-28 2000-10-27 Procede et systeme de detection de caracteristiques phonetiques

Publications (1)

Publication Number Publication Date
EP1232495A2 true EP1232495A2 (fr) 2002-08-21

Family

ID=22583531

Family Applications (1)

Application Number Title Priority Date Filing Date
EP00991894A Withdrawn EP1232495A2 (fr) 1999-10-28 2000-10-27 Procede et systeme de detection de caracteristiques phonetiques

Country Status (4)

Country Link
EP (1) EP1232495A2 (fr)
CA (1) CA2387091A1 (fr)
TW (1) TW480473B (fr)
WO (1) WO2001031628A2 (fr)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100714721B1 (ko) * 2005-02-04 2007-05-04 삼성전자주식회사 음성 구간 검출 방법 및 장치
CN101816191B (zh) * 2007-09-26 2014-09-17 弗劳恩霍夫应用研究促进协会 用于提取环境信号的装置和方法
US8639510B1 (en) 2007-12-24 2014-01-28 Kai Yu Acoustic scoring unit implemented on a single FPGA or ASIC
US8352265B1 (en) 2007-12-24 2013-01-08 Edward Lin Hardware implemented backend search engine for a high-rate speech recognition system
US8463610B1 (en) 2008-01-18 2013-06-11 Patrick J. Bourke Hardware-implemented scalable modular engine for low-power speech recognition

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH02195400A (ja) * 1989-01-24 1990-08-01 Canon Inc 音声認識装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO0131628A2 *

Also Published As

Publication number Publication date
WO2001031628A2 (fr) 2001-05-03
CA2387091A1 (fr) 2001-05-03
WO2001031628A3 (fr) 2001-12-06
TW480473B (en) 2002-03-21

Similar Documents

Publication Publication Date Title
US10878823B2 (en) Voiceprint recognition method, device, terminal apparatus and storage medium
US10504539B2 (en) Voice activity detection systems and methods
EP1688921B1 (fr) Appareil et procédé d'amélioration de la parole
EP1083541B1 (fr) Méthode et appareil pour la détection de la parole
EP1536414B1 (fr) Méthode et dispositif multisensoriel d'amélioration de la parole
EP2695160B1 (fr) Détection de la liaison syllabe/voyelle/phonème d'un discours en utilisant des signaux d'attention auditive
EP2089877B1 (fr) Système et procédé de détermination de l'activité de la parole
US7181390B2 (en) Noise reduction using correction vectors based on dynamic aspects of speech and noise normalization
EP0470245B1 (fr) Procede d'estimation spectrale destine a reduire la sensibilite au bruit pour la reconnaissance de la parole
EP2363852B1 (fr) Procédé informatisé et système pour évaluer l'intelligibilité de la parole
US8831942B1 (en) System and method for pitch based gender identification with suspicious speaker detection
US20080281591A1 (en) Method of pattern recognition using noise reduction uncertainty
Czyzewski et al. Intelligent processing of stuttered speech
Mulimani et al. Segmentation and characterization of acoustic event spectrograms using singular value decomposition
JP3298858B2 (ja) 低複雑性スピーチ認識器の区分ベースの類似性方法
Wiem et al. Unsupervised single channel speech separation based on optimized subspace separation
Rahman et al. Dynamic time warping assisted svm classifier for bangla speech recognition
Jaiswal Performance analysis of voice activity detector in presence of non-stationary noise
Baelde et al. A mixture model-based real-time audio sources classification method
EP1232495A2 (fr) Procede et systeme de detection de caracteristiques phonetiques
Bendiksen et al. Neural networks for voiced/unvoiced speech classification
Sunija et al. Comparative study of different classifiers for Malayalam dialect recognition system
Wiem et al. Phase‐aware subspace decomposition for single channel speech separation
Ravuri et al. Using spectro-temporal features to improve AFE feature extraction for ASR.
JPH01255000A (ja) 音声認識システムに使用されるテンプレートに雑音を選択的に付加するための装置及び方法

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20020513

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE

RBV Designated contracting states (corrected)

Designated state(s): AT BE DE GB

17Q First examination report despatched

Effective date: 20061212

RIC1 Information provided on ipc code assigned before grant

Ipc: G10L 15/16 20060101AFI20090714BHEP

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20100316