EP0694906A1 - Method and system for speech recognition - Google Patents

Method and system for speech recognition Download PDF

Info

Publication number
EP0694906A1
EP0694906A1 EP95111784A EP95111784A EP0694906A1 EP 0694906 A1 EP0694906 A1 EP 0694906A1 EP 95111784 A EP95111784 A EP 95111784A EP 95111784 A EP95111784 A EP 95111784A EP 0694906 A1 EP0694906 A1 EP 0694906A1
Authority
EP
European Patent Office
Prior art keywords
speech
noise
feature vector
frame
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
EP95111784A
Other languages
German (de)
French (fr)
Other versions
EP0694906B1 (en
Inventor
Alejandro Acero
Xuedong Huang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Corp
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corp filed Critical Microsoft Corp
Publication of EP0694906A1 publication Critical patent/EP0694906A1/en
Application granted granted Critical
Publication of EP0694906B1 publication Critical patent/EP0694906B1/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/12Speech classification or search using dynamic programming techniques, e.g. dynamic time warping [DTW]

Definitions

  • This invention relates generally to speech recognition and, more particularly, to a method and system for improving speech recognition through front-end normalization of feature vectors.
  • Speech recognition generally involves two phases. The first phase is known as training. During training, the system "learns" speech by inputting a large sample of speech and generating models of the speech. The second phase is known as recognition. During recognition, the system attempts to recognize input speech by comparing the speech to the models generated during training and finding an exact match or a best match. Most speech recognition systems have a front-end that extracts some features from the input speech in the form of feature vectors. These feature vectors are used to generate the models during training and are compared to the generated models during recognition.
  • a particular word or sentence should always be recognized as that word or sentence, regardless of the acoustical environment in which the word or sentence is spoken.
  • mean normalization One attempt to solve this problem is known as mean normalization.
  • the input speech feature vector is normalized by computing the mean of all the feature vectors extracted from the entire speech and subtracting the mean from the input speech feature vector using the function: where ( t ) is the normalized input speech feature vector, x( t ) is the raw input speech feature vector, and n is the number of feature vectors extracted from the entire speech.
  • SNR-dependent signal-to-noise-ratio-dependent
  • the input speech feature vector is normalized by computing the instantaneous SNR of the input speech and subtracting a correction vector that depends on the SNR from the input speech feature vector using the function: where ( t ) is the normalized input speech feature vector, x( t ) is the raw input speech feature vector, and y( SNR ) is the correction vector.
  • the correction vectors are precomputed and stored in a look-up table with the corresponding SNR's.
  • Mean normalization allows the input speech feature vectors to be dynamically adjusted but is not very accurate because it only computes a single mean for all of the feature vectors extracted from the entire speech.
  • SNR-dependent normalization is more accurate than mean normalization because it computes varying correction vectors depending on the SNR of the input speech but it does not dynamically update the values of the correction vectors. Therefore, a solution is needed that both is accurate and dynamically updates the values used to normalize the input speech feature vectors.
  • One aspect of the present invention provides a method and system for improving speech recognition through front-end normalization of feature vectors.
  • speech to be recognized is spoken into a microphone, amplified by an amplifier, and converted from an analog signal to a digital signal by an analog-to-digital ("A/D") converter.
  • the digital signal from the A/D converter is input to a feature extractor that breaks down the signal into frames of speech and then extracts a feature vector from each of the frames.
  • the feature vector is input to an input normalizer that normalizes the vector.
  • the normalized feature vector is input to a pattern matcher that compares the normalized vector to feature models stored in a database to find an exact match or a best match.
  • the input normalizer of the present invention normalizes the feature vector by computing a correction vector and subtracting the correction vector from the feature vector.
  • the correction vector is computed based on the probability of the current frame of speech being noise and based on the average noise and speech feature vectors for a current utterance and the database of utterances.
  • the normalization of feature vectors reduces the effect of changes in the acoustical environment on the feature vectors. By reducing the effect of changes in the acoustical environment on the feature vectors, the input normalizer of the present invention improves the accuracy of the speech recognition system.
  • the preferred embodiment of the present invention provides a method and system for improving speech recognition through front-end normalization of feature vectors.
  • the normalization of feature vectors reduces the effect of changes in the acoustical environment on the feature vectors. Such changes could result, for example, from changes in the microphone used, the background noise, the distance between the speaker's mouth and the microphone, and the room acoustics. Without normalization, the effect of changes in the acoustical environment on the feature vectors could cause the same speech to be recognized as different speech. This could occur because the acoustical environment affects the feature vectors extracted from speech. Thus, different feature vectors may be extracted from the same speech if spoken in different acoustical environments. By reducing the effect of changes in the acoustical environment on the feature vectors, the input normalizer of the present invention improves the accuracy of the speech recognition system.
  • FIG. 1 illustrates a speech recognition system 10 incorporating the principles of the present invention.
  • speech to be recognized is spoken into a microphone 12, amplified by an amplifier 14, and converted from an analog signal to a digital signal by an analog-to-digital (“A/D") converter 16.
  • the microphone 12, amplifier 14, and A/D converter 16 are conventional components and are well-known in the art.
  • the digital signal from the A/D converter 16 is input to a computer system 18. More specifically, the digital signal is input to a feature extractor 20 that extracts certain features from the signal in the form of feature vectors.
  • Speech is composed of utterances. An utterance is the spoken realization of a sentence and typically represents 1 to 10 seconds of speech. Each utterance is broken down into evenly-spaced time intervals called frames.
  • a frame typically represents 10 milliseconds of speech.
  • a feature vector is extracted from each frame of speech. That is, the feature extractor 20 breaks down the digital signal from the A/D converter 16 into frames of speech and then extracts a feature vector from each of the frames.
  • the feature vector extracted from each frame of speech comprises cepstral vectors. Cepstral vectors, and the methods used to extract cepstral vectors from speech, are well-known in the art.
  • the feature vector is then input to an input normalizer 22 that normalizes the vector.
  • the normalization of the feature vector reduces the effect of changes in the acoustical environment on the feature vector.
  • the normalized feature vector is then input to a pattern matcher 24 that compares the normalized vector to feature models stored in a database 26 to find an exact match or a best match.
  • the feature models stored in the database 26 were generated from known speech. If there is an acceptable match, the known speech corresponding to the matching feature model is output. Otherwise, a message indicating that the speech could not be recognized is output.
  • Typical pattern matchers are based on networks trained by statistical methods, such as hidden Markov models or neural networks. However, other pattern matchers may be used. Such pattern matchers are well-known in the art.
  • the input normalizer 22 receives the feature vector x j for the current frame j , where j is an index (step 210).
  • the feature vector comprises cepstral vectors.
  • a cepstral vector is a set of coefficients derived from the energy in different frequency bands by taking the Discrete Cosine Transform ("DCT") of the logarithm of such energies.
  • the feature vector comprises a static cepstral vector augmented with its first and second order derivatives with time, the delta cepstral vector and the delta-delta cepstral vector, respectively.
  • Each cepstral vector comprises a set of thirteen cepstral coefficients.
  • cepstral vectors having a different number of cepstral coefficients may be used.
  • other forms of feature vectors may be used.
  • the input normalizer 22 computes a normalized feature vector j using the function (step 214): While the feature vector comprises the three cepstral vectors discussed above, in the preferred embodiment of the present invention, only the static cepstral vector is normalized, the delta cepstral vector and the delta-delta cepstral vector are not normalized.
  • the computation of the correction vector r(x j ) is simplified based on certain assumptions and estimations.
  • is the a priori probability of the current frame j being noise
  • N (x j , n j -1 , ⁇ n ( j -1) ) and N (x j , s j -1 , ⁇ s ( j -1) ) are the Gaussian probability density functions ("pdf's") for noise and speech, respectively
  • ⁇ n ( j -1) and ⁇ s ( j -1) are the covariance matrices for noise and speech, respectively.
  • the Gaussian pdf's for noise and speech, N (x j , n j -1 , ⁇ n ( j -1) ) and N ( x j , s j -1 , ⁇ s ( j -1) ), are represented using the standard function for Gaussian pdf's: and where q is the dimension of x j , exp is the exponential function, and T represents the transpose function.
  • the a posteriori probability of the current frame j being noise p j is represented by the sigmoid function: where where d (x j ) or d j is the distortion.
  • the distortion is an indication of whether a signal is noise or speech. If the distortion is largely negative, the signal is noise; if the distortion is largely positive, the signal is speech; if the distortion is zero, the signal may be noise or speech.
  • EM Estimate-Maximize
  • the EM algorithm is discussed in N.M. Laird, A.P. Dempster, and D.B. Rubin, "Maximum Likelihood from Incomplete Data via the EM Algorithm," Annals Royal Statistical Society , 1-38, December 1967.
  • the EM algorithm generates maximum likelihood estimates of the values by refining previous estimates based on new values.
  • This algorithm uses a window function over which the estimates are refined.
  • the window function defines the interval of time over which past estimates are used to refine the current estimates.
  • the standard EM algorithm uses a rectangular window function.
  • a rectangular window function gives equal weight to the data over the entire window.
  • the modified version of the EM algorithm used in the preferred embodiment of the present invention uses an exponential window function.
  • An exponential window function gives more weight to recent data in the window.
  • the values of n , s , ⁇ n , ⁇ s , and ⁇ are estimated using the functions: where w k is the exponential window function.
  • the rate of adaptation determines how much weight is given to past data relative to the current data. The smaller ⁇ is, the less weight that is given to past data relative to the current data; the larger ⁇ is, the more weight that is given to past data relative to the current data.
  • the value of ⁇ is computed using the function: where T is a time constant and F s is the sampling frequency of the A/D converter 16.
  • T is a time constant
  • F s is the sampling frequency of the A/D converter 16.
  • separate ⁇ 's are used for noise and for speech. The use of separate ⁇ 's allows noise and speech to be adapted at different rates.
  • ⁇ n and ⁇ s are the parameters that control the rate of adaptation for noise and speech, respectively.
  • step 310 values for ⁇ n and ⁇ s are selected.
  • the values for ⁇ n and ⁇ s are selected based on the desired rate of adaptation (as discussed above).
  • step 312 the value of j is set equal to zero (step 312) and initial values for n , s , ⁇ n , ⁇ s , and ⁇ are estimated (step 314).
  • the initial values for n , s , ⁇ n , ⁇ s , and ⁇ are estimated from the database of utterances 26 using standard EM techniques.
  • step 330 If frame j is not the last frame in the current utterance, j is incremented (step 330) and steps 316 through 326 are repeated for the next frame. If frame j is the last frame in the current utterance, the input normalizer 22 determines whether the current utterance is the last utterance (step 332). If the current utterance is not the last utterance, j is reset to zero (step 334), the values of n , s , ⁇ n , ⁇ s , and ⁇ are reset to the estimated initial values (step 336), and steps 316 through 326 are repeated for each frame in the next utterance. If the current utterance is the last utterance, the input normalizer 22 returns.
  • the last term could be eliminated from the function (Eq. 38) used to compute the distortion d j .
  • This term does not significantly affect the value of the distortion d j , but is expensive to compute because it involves a logarithm.
  • the a posteriori probability of the current frame j being noise p j could be computed using a look-up table. This table would contain the possible values for the distortion d j and the corresponding values for the a posteriori probability p j .
  • n , s , ⁇ n , and ⁇ s could be updated every few frames, instead of every frame, and the value of ⁇ could be kept at its initial value and not updated at all.
  • each utterance in the database 26 is normalized according to the principles of the present invention and then the system is retrained using the database of normalized utterances. The database of normalized utterances is then used during recognition as described above.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

A method and system for improving speech recognition through front-end normalization of feature vectors are provided. Speech to be recognized is spoken into a microphone, amplified by an amplifier, and converted from an analog signal to a digital signal by an analog-to-digital ("A/D") converter. The digital signal from the A/D converter is input to a feature extractor that breaks down the signal into frames of speech and then extracts a feature vector from each of the frames. The feature vector is input to an input normalizer that normalizes the vector. The input normalizer normalizes the feature vector by computing a correction vector and subtracting the correction vector from the feature vector. The correction vector is computed based on the probability of the current frame of speech being noise and based on the average noise and speech feature vectors for a current utterance and a database of utterances. The normalization of the feature vector reduces the effect of changes in the acoustical environment on the feature vector. The normalized feature vector is input to a pattern matcher that compares the normalized vector to feature models stored in the database to find an exact match or a best match.

Description

    Field of the Invention
  • This invention relates generally to speech recognition and, more particularly, to a method and system for improving speech recognition through front-end normalization of feature vectors.
  • Background of the Invention
  • A variety of speech recognition systems have been developed. These systems enable computers to understand speech. This ability is useful for inputting commands or data into computers. Speech recognition generally involves two phases. The first phase is known as training. During training, the system "learns" speech by inputting a large sample of speech and generating models of the speech. The second phase is known as recognition. During recognition, the system attempts to recognize input speech by comparing the speech to the models generated during training and finding an exact match or a best match. Most speech recognition systems have a front-end that extracts some features from the input speech in the form of feature vectors. These feature vectors are used to generate the models during training and are compared to the generated models during recognition.
  • One problem with such speech recognition systems arises when there are changes in the acoustical environment during and between training and recognition. Such changes could result, for example, from changes in the microphone used, the background noise, the distance between the speaker's mouth and the microphone, and the room acoustics. If changes occur, the system may not work very well because the acoustical environment affects the feature vectors extracted from speech. Thus, different feature vectors may be extracted from the same speech if spoken in different acoustical environments. Since the acoustical environment will rarely remain constant, it is desirable for a speech recognition system to be robust to changes in the acoustical environment. A particular word or sentence should always be recognized as that word or sentence, regardless of the acoustical environment in which the word or sentence is spoken. Some attempts to solve the problem of changes in the acoustical environment have focused on normalizing the input speech feature vectors to reduce the effect of such changes.
  • One attempt to solve this problem is known as mean normalization. Using mean normalization, the input speech feature vector is normalized by computing the mean of all the feature vectors extracted from the entire speech and subtracting the mean from the input speech feature vector using the function:
    Figure imgb0001

    where
    Figure imgb0002
    Figure imgb0002
    (t) is the normalized input speech feature vector, x(t) is the raw input speech feature vector, and n is the number of feature vectors extracted from the entire speech.
  • Another attempt to solve this problem is known as signal-to-noise-ratio-dependent ("SNR-dependent") normalization. Using SNR-dependent normalization, the input speech feature vector is normalized by computing the instantaneous SNR of the input speech and subtracting a correction vector that depends on the SNR from the input speech feature vector using the function:
    Figure imgb0004

    where
    Figure imgb0002
    (t) is the normalized input speech feature vector, x(t) is the raw input speech feature vector, and y(SNR) is the correction vector. The correction vectors are precomputed and stored in a look-up table with the corresponding SNR's.
  • None of the prior attempts to solve the problem of changes in the acoustical environment during and between training and recognition have been very successful. Mean normalization allows the input speech feature vectors to be dynamically adjusted but is not very accurate because it only computes a single mean for all of the feature vectors extracted from the entire speech. SNR-dependent normalization is more accurate than mean normalization because it computes varying correction vectors depending on the SNR of the input speech but it does not dynamically update the values of the correction vectors. Therefore, a solution is needed that both is accurate and dynamically updates the values used to normalize the input speech feature vectors.
  • Summary of the Invention
  • One aspect of the present invention provides a method and system for improving speech recognition through front-end normalization of feature vectors. In a speech recognition system of the present invention, speech to be recognized is spoken into a microphone, amplified by an amplifier, and converted from an analog signal to a digital signal by an analog-to-digital ("A/D") converter. The digital signal from the A/D converter is input to a feature extractor that breaks down the signal into frames of speech and then extracts a feature vector from each of the frames. The feature vector is input to an input normalizer that normalizes the vector. The normalized feature vector is input to a pattern matcher that compares the normalized vector to feature models stored in a database to find an exact match or a best match.
  • The input normalizer of the present invention normalizes the feature vector by computing a correction vector and subtracting the correction vector from the feature vector. The correction vector is computed based on the probability of the current frame of speech being noise and based on the average noise and speech feature vectors for a current utterance and the database of utterances. The normalization of feature vectors reduces the effect of changes in the acoustical environment on the feature vectors. By reducing the effect of changes in the acoustical environment on the feature vectors, the input normalizer of the present invention improves the accuracy of the speech recognition system.
  • Brief Description of the Drawings
    • Figure 1 is a block diagram illustrating a speech recognition system incorporating the principles of the present invention;
    • Figure 2 is a high level flow chart illustrating the steps performed by an input normalizer of the system of Figure 1; and
    • Figures 3A and 3B collectively are a high level flow chart illustrating the steps performed in the normalization of feature vectors in the system of Figure 1.
    Detailed Description of the Preferred Embodiment
  • The preferred embodiment of the present invention provides a method and system for improving speech recognition through front-end normalization of feature vectors. The normalization of feature vectors reduces the effect of changes in the acoustical environment on the feature vectors. Such changes could result, for example, from changes in the microphone used, the background noise, the distance between the speaker's mouth and the microphone, and the room acoustics. Without normalization, the effect of changes in the acoustical environment on the feature vectors could cause the same speech to be recognized as different speech. This could occur because the acoustical environment affects the feature vectors extracted from speech. Thus, different feature vectors may be extracted from the same speech if spoken in different acoustical environments. By reducing the effect of changes in the acoustical environment on the feature vectors, the input normalizer of the present invention improves the accuracy of the speech recognition system.
  • Figure 1 illustrates a speech recognition system 10 incorporating the principles of the present invention. In this system, speech to be recognized is spoken into a microphone 12, amplified by an amplifier 14, and converted from an analog signal to a digital signal by an analog-to-digital ("A/D") converter 16. The microphone 12, amplifier 14, and A/D converter 16 are conventional components and are well-known in the art. The digital signal from the A/D converter 16 is input to a computer system 18. More specifically, the digital signal is input to a feature extractor 20 that extracts certain features from the signal in the form of feature vectors. Speech is composed of utterances. An utterance is the spoken realization of a sentence and typically represents 1 to 10 seconds of speech. Each utterance is broken down into evenly-spaced time intervals called frames. A frame typically represents 10 milliseconds of speech. A feature vector is extracted from each frame of speech. That is, the feature extractor 20 breaks down the digital signal from the A/D converter 16 into frames of speech and then extracts a feature vector from each of the frames. In the preferred embodiment of the present invention, the feature vector extracted from each frame of speech comprises cepstral vectors. Cepstral vectors, and the methods used to extract cepstral vectors from speech, are well-known in the art.
  • The feature vector is then input to an input normalizer 22 that normalizes the vector. The normalization of the feature vector reduces the effect of changes in the acoustical environment on the feature vector. The normalized feature vector is then input to a pattern matcher 24 that compares the normalized vector to feature models stored in a database 26 to find an exact match or a best match. The feature models stored in the database 26 were generated from known speech. If there is an acceptable match, the known speech corresponding to the matching feature model is output. Otherwise, a message indicating that the speech could not be recognized is output. Typical pattern matchers are based on networks trained by statistical methods, such as hidden Markov models or neural networks. However, other pattern matchers may be used. Such pattern matchers are well-known in the art.
  • The steps performed by the input normalizer 22 are shown in Figure 2. The input normalizer 22 receives the feature vector x j for the current frame j, where j is an index (step 210). In the preferred embodiment of the present invention, the feature vector comprises cepstral vectors. A cepstral vector is a set of coefficients derived from the energy in different frequency bands by taking the Discrete Cosine Transform ("DCT") of the logarithm of such energies. In the preferred embodiment, the feature vector comprises a static cepstral vector augmented with its first and second order derivatives with time, the delta cepstral vector and the delta-delta cepstral vector, respectively. Each cepstral vector comprises a set of thirteen cepstral coefficients. However, one of ordinary skill in the art will appreciate that cepstral vectors having a different number of cepstral coefficients may be used. Additionally, one of ordinary skill in the art will appreciate that other forms of feature vectors may be used.
  • Next, the input normalizer 22 computes a correction vector r(x j ) or r j using the function (step 212): r(x j ) = p j (n j -1 - n avg ) + ( 1- p j )(s j -1 - s avg )
    Figure imgb0006
    where p j is the a posteriori probability of the current frame j being noise, n j -1 and s j -1 are the average noise and speech feature vectors for the current utterance, and n avg and s avg are the average noise and speech feature vectors for the database of utterances 26. The computation of n, s, n avg , and s avg will be discussed below. Lastly, the input normalizer 22 computes a normalized feature vector
    Figure imgb0002
    j using the function (step 214):
    Figure imgb0008

    While the feature vector comprises the three cepstral vectors discussed above, in the preferred embodiment of the present invention, only the static cepstral vector is normalized, the delta cepstral vector and the delta-delta cepstral vector are not normalized.
  • The computation of the correction vector r(x j ) is simplified based on certain assumptions and estimations. First, assume that noise and speech follow a Gaussian distribution. Based on this assumption, the a posteriori probability of the current frame j being noise p j is computed using the function:
    Figure imgb0009

    where ξ is the a priori probability of the current frame j being noise, N (x j ,n j -1n (j-1) ) and N (x j ,s j -1s (j-1) ) are the Gaussian probability density functions ("pdf's") for noise and speech, respectively, and Σn (j-1) and Σs (j-1) are the covariance matrices for noise and speech, respectively. The Gaussian pdf's for noise and speech, N (x j ,n j -1n (j-1) ) and N (x j ,s j -1,Σs(j-1)), are represented using the standard function for Gaussian pdf's:
    Figure imgb0010

    and
    Figure imgb0011

    where q is the dimension of x j , exp is the exponential function, and T represents the transpose function.
  • Then, the a posteriori probability of the current frame j being noise p j is represented by the sigmoid function:
    Figure imgb0012

    where
    Figure imgb0013

    where d (x j ) or d j is the distortion. The distortion is an indication of whether a signal is noise or speech. If the distortion is largely negative, the signal is noise; if the distortion is largely positive, the signal is speech; if the distortion is zero, the signal may be noise or speech.
  • Second, assume that the components of x j are independent of one another. Based on this assumption, Σn and Σs are modelled using diagonal covariance matrices σ n and σ s , respectively. Thus, d (x j ) is represented using the function:
    Figure imgb0014

    where q is the dimension of σ n and σ s . Further, the most important factor in discriminating noise from speech is the power term (l=0). Thus, d (x j ) is approximated using the function:
    Figure imgb0015

       Next, the values of n, s, σ n , σ s , and ξ are estimated using a modified version of the well-known Estimate-Maximize ("EM") algorithm. The EM algorithm is discussed in N.M. Laird, A.P. Dempster, and D.B. Rubin, "Maximum Likelihood from Incomplete Data via the EM Algorithm," Annals Royal Statistical Society, 1-38, December 1967. The EM algorithm generates maximum likelihood estimates of the values by refining previous estimates based on new values. This algorithm uses a window function over which the estimates are refined. The window function defines the interval of time over which past estimates are used to refine the current estimates. The standard EM algorithm uses a rectangular window function. A rectangular window function gives equal weight to the data over the entire window. The modified version of the EM algorithm used in the preferred embodiment of the present invention uses an exponential window function. An exponential window function gives more weight to recent data in the window. Thus, the values of n, s, σ n , σ s , and ξ are estimated using the functions:
    Figure imgb0016
    Figure imgb0017

    where w k is the exponential window function.
  • The exponential window function w k is represented by: w k = α k
    Figure imgb0018
    where α is a parameter that controls the rate of adaptation. The rate of adaptation determines how much weight is given to past data relative to the current data. The smaller α is, the less weight that is given to past data relative to the current data; the larger α is, the more weight that is given to past data relative to the current data. The value of α is computed using the function:
    Figure imgb0019

    where T is a time constant and F s is the sampling frequency of the A/D converter 16. In the preferred embodiment of the present invention, separate α's are used for noise and for speech. The use of separate α's allows noise and speech to be adapted at different rates. In the preferred embodiment in which separate α's are used, a smaller α is used for noise than for speech. Thus, the functions used to estimate the values of n, s, σ n , σ s , and ξ are reduced to:
    Figure imgb0020
    Figure imgb0021

    where a n ( j ) = p j x j + α n a n ( j -1)
    Figure imgb0022
    Figure imgb0023
    c n ( j ) = p j + α n c n ( j -1)
    Figure imgb0024
    a s ( j ) =(1- p j ) x j + α s a s ( j -1)
    Figure imgb0025
    Figure imgb0026
    c s ( j ) =(1- p j )+ α s c s ( j -1)
    Figure imgb0027
    where α n and α s are the parameters that control the rate of adaptation for noise and speech, respectively. The computation of initial values for n, s, σ n , σ s , ξ, a n , b n , c n , a s , b s , and c s will be discussed below.
  • The steps performed in the normalization of a feature vector are shown in Figures 3A and 3B. First, values for α n and α s are selected (step 310). The values for α n and α s are selected based on the desired rate of adaptation (as discussed above). Additionally, the value of j is set equal to zero (step 312) and initial values for n, s, σ n , σ s , and ξ are estimated (step 314). The initial values for n, s, σ n , σ s , and ξ are estimated from the database of utterances 26 using standard EM techniques. n 0 = n avg
    Figure imgb0028
    s 0 = s avg
    Figure imgb0029
    Figure imgb0030
    ξ
    0 avg
    Figure imgb0031
    Figure imgb0032
    Figure imgb0033

       Then, the feature vector x j for the current frame j is received (step 316). The distortion d j is computed using the function (step 318):
    Figure imgb0034

    The a posteriori probability of the current frame j being noise p j is computed using the function (step 320):
    Figure imgb0035

    The correction vector r j is computed using the function (step 322): r j [ l ]= p j ( n j -1 [ l ]- n avg [ l ])+(1- p j )( s j -1 [ l ]- s avg [ l ])
    Figure imgb0036
    for l=0,1,...,m The normalized feature vector
    Figure imgb0002
    j is computed using the function (step 324):
    Figure imgb0038

    for l=0,1,...,m
       The values of n, s, σ n , σ s , and ξ are updated using the functions (step 326):
    Figure imgb0039
    ξ j [ l ]=(1- α n ) c n ( j ) [ l ]
    Figure imgb0040
    for l=0,1,...,m
    where a n ( j ) [ l ]= p j x j [ l ]+ α n a n ( j -1) [ l ]
    Figure imgb0041
    Figure imgb0042
    c n ( j ) = p j + α n c n ( j -1)
    Figure imgb0043
    a s ( j ) [ l ]=(1- p j ) x j [ l ]+ α s a s ( j -1) [ l ]
    Figure imgb0044
    Figure imgb0045
    c s ( j ) =(1- p j )+ α s c s ( j -1)
    Figure imgb0046
       Lastly, the input normalizer 22 determines whether frame j is the last frame in the current utterance (step 326). If frame j is not the last frame in the current utterance, j is incremented (step 330) and steps 316 through 326 are repeated for the next frame. If frame j is the last frame in the current utterance, the input normalizer 22 determines whether the current utterance is the last utterance (step 332). If the current utterance is not the last utterance, j is reset to zero (step 334), the values of n, s, σ n , σ s , and ξ are reset to the estimated initial values (step 336), and steps 316 through 326 are repeated for each frame in the next utterance. If the current utterance is the last utterance, the input normalizer 22 returns.
  • In order to reduce the computational complexity of the input normalizer 22 of the present invention, one of ordinary skill in the art will appreciate that several modifications could be made to the input normalizer. First, the last term could be eliminated from the function (Eq. 38) used to compute the distortion d j . This term does not significantly affect the value of the distortion d j , but is expensive to compute because it involves a logarithm. Additionally, the a posteriori probability of the current frame j being noise p j could be computed using a look-up table. This table would contain the possible values for the distortion d j and the corresponding values for the a posteriori probability p j . Lastly, the values of n, s, σ n , and σ s could be updated every few frames, instead of every frame, and the value of ξ could be kept at its initial value and not updated at all. Each of these modifications will improve the efficiency of the input normalizer 22 without significantly affecting the accuracy of the input normalizer.
  • While the invention has described the normalization of feature vectors only during recognition, the preferred embodiment of the present invention involves the normalization of feature vectors during training as well. Specifically, each utterance in the database 26 is normalized according to the principles of the present invention and then the system is retrained using the database of normalized utterances. The database of normalized utterances is then used during recognition as described above.
  • One of ordinary skill in the art will now appreciate that the present invention provides a method and system for improving speech recognition through front-end normalization of feature vectors. Although the present invention has been shown and described with reference to a preferred embodiment, equivalent alterations and modifications will occur to those skilled in the art upon reading and understanding this specification. The present invention includes all such equivalent alterations and modifications and is limited only by the scope of the following claims.

Claims (21)

  1. A method for improving speech recognition through front-end normalization of feature vectors, speech comprising utterances, each utterance comprising frames of speech, each frame of speech being represented by a feature vector, the method comprising the steps of:
       providing a database of known utterances, the database of utterances having an average noise feature vector and an average speech feature vector;
       receiving a feature vector representing a frame of speech in an utterance to be recognized, the frame of speech having a probability of being noise, the utterance having an average noise feature vector and an average speech feature vector;
       computing a correction vector based on the probability of the frame of speech being noise and based on the average noise and speech feature vectors for the utterance and the database of utterances; and
       computing a normalized feature vector based on the feature vector and the correction vector.
  2. The method of claim 1, wherein the step of receiving a feature vector comprises the step of receiving a cepstral vector.
  3. The method of claim 1, wherein the probability of the frame of speech being noise and the average noise and speech feature vectors for the utterance are updated for each frame of speech.
  4. The method of claim 1, wherein the step of computing a correction vector includes the steps of:
       computing the probability of the frame of speech being noise based on a distortion measure of the frame of speech;
       computing the average noise and speech feature vectors for the utterance;
       computing the average noise and speech feature vectors for the database of utterances; and
       computing the correction vector based on the probability of the frame of speech being noise and the differences between the average noise and speech feature vectors for the utterance and the database of utterances.
  5. A method for improving speech recognition through front-end normalization of feature vectors, speech comprising utterances, each utterance comprising frames of speech, each frame of speech being represented by a feature vector, the method comprising the steps of:
       providing a database of known utterances, the database of utterances having an average noise feature vector and an average speech feature vector; receiving a feature vector x j representing a
       frame of speech j in an utterance to be recognized, the frame of speech having an a posteriori probability of being noise, the utterance having an average noise feature vector and an average speech feature vector;
       computing a correction vector r(x j ) as: r(x j ) = p j (n j -1 - n avg ) + ( 1- p j )(s j -1 - s avg )
    Figure imgb0047
    wherein p j is the a posteriori probability of the frame of speech j being noise, n j -1 and s j -1 are the average noise and speech feature vectors for the utterance, and n avg and s avg are the average noise and speech feature vectors for the database of utterances; and
       computing a normalized feature vector
    Figure imgb0002
    j as:
    Figure imgb0049
  6. The method of claim 5, wherein the step of receiving a feature vector includes the step of receiving a cepstral vector.
  7. The method of claim 5, wherein the a posteriori probability of the frame of speech being noise and the average noise and speech feature vectors for the utterance are updated for each frame of speech.
  8. The method of claim 5, wherein the a posteriori probability of the frame of speech j being noise p j is computed as:
    Figure imgb0050
    wherein ξ is an a priori probability of the frame of speech j being noise, N (x j ,n j -1n (j-1) ) and N (x j ,s j -1s (j-1) ) are Gaussian probability density functions for noise and speech, respectively, and Σn (j-1) and Σs (j-1) are covariance matrices for noise and speech, respectively.
  9. The method of claim 8, wherein the Gaussian probability density functions for noise and speech, N (x j ,n j -1n (j-1) ) and N (x j ,s j -1s (j-1) ), are computed as:
    Figure imgb0051
    and
    Figure imgb0052
    wherein q is a dimension of x j , exp is an exponential function, and T represents a transpose function.
  10. The method of claim 5, wherein the a posteriori probability of the frame of speech j being noise p j is computed as:
    Figure imgb0053
    wherein d (x j ) is a distortion measure of the frame of speech j.
  11. The method of claim 10, wherein the distortion measure d (x j ) is computed as:
    Figure imgb0054
  12. The method of claim 10, wherein the distortion measure d (x j ) is computed as:
    Figure imgb0055
    wherein q is a dimension of σ n and σ s .
  13. The method of claim 10, wherein the distortion measure d (x j ) is computed as:
    Figure imgb0056
  14. The method of claim 13, wherein the average noise and speech feature vectors for the utterance are computed as:
    Figure imgb0057
    wherein w k is an exponential window function represented as: w k = α k
    Figure imgb0058
    wherein α is a parameter that controls a rate of adaptation.
  15. The method of claim 14, wherein the diagonal covariance matrices for noise and speech are computed as:
    Figure imgb0059
  16. The method of claim 15, wherein the a priori probability of the frame of speech j being noise ξ j is computed as:
    Figure imgb0060
  17. The method of claim 13, wherein the average noise and speech feature vectors for the utterance are computed as:
    Figure imgb0061
    wherein a n ( j ) = p j x j + α n a n ( j -1) , c n ( j ) = p j + α n c n ( j -1) , a s ( j ) =(1- p j ) x j + α s a s ( j -1) , and c s ( j ) =(1- p j )+ α s c s ( j -1)
    Figure imgb0062
    and wherein α n and α s are parameters that control rates of adaptation for noise and speech, respectively.
  18. The method of claim 17, wherein the diagonal covariance matrices for noise and speech are computed as:
    Figure imgb0063
    wherein
    Figure imgb0064
  19. The method of claim 18, wherein the a priori probability of the frame of speech j being noise ξ j is computed as: ξ j =(1- α n ) c n ( j ) .
    Figure imgb0065
  20. A system for improving speech recognition through front-end normalization of feature vectors, speech comprising utterances, each utterance comprising frames of speech, each frame of speech being represented by a feature vector, the system comprising:
       a database of known utterances, the database of utterances having an average noise feature vector and an average speech feature vector; and
       an input normalizer for:
       receiving a feature vector representing a frame of speech in an utterance to be recognized, the frame of speech having a probability of being noise, the utterance having an average noise feature vector and an average speech feature vector;
       computing a correction vector based on the probability of the frame of speech being noise and based on the average noise and speech feature vectors for the utterance and the database of utterances; and
       computing a normalized feature vector based on the feature vector and the correction vector.
  21. A system for improving speech recognition through front-end normalization of feature vectors, speech comprising utterances, each utterance comprising frames of speech, each frame of speech being represented by a feature vector, the system comprising:
       a database of known utterances, the utterances being represented by feature models, the database of utterances having an average noise feature vector and an average speech feature vector;
       a feature extractor for extracting a feature vector from a frame of speech in an utterance to be recognized, the frame of speech having a probability of being noise, the utterance having an average noise feature vector and an average speech feature vector;
       an input normalizer for normalizing the feature vector by: (i) computing a correction vector based on the probability of the frame of speech being noise and based on the average noise and speech feature vectors for the utterance and the database of utterances, and (ii) computing a normalized feature vector based on the feature vector and the correction vector; and
       a pattern matcher for comparing the normalized feature vector to the feature models in the database.
EP95111784A 1994-07-29 1995-07-26 Method and system for speech recognition Expired - Lifetime EP0694906B1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US283271 1994-07-29
US08/283,271 US5604839A (en) 1994-07-29 1994-07-29 Method and system for improving speech recognition through front-end normalization of feature vectors

Publications (2)

Publication Number Publication Date
EP0694906A1 true EP0694906A1 (en) 1996-01-31
EP0694906B1 EP0694906B1 (en) 2000-09-06

Family

ID=23085287

Family Applications (1)

Application Number Title Priority Date Filing Date
EP95111784A Expired - Lifetime EP0694906B1 (en) 1994-07-29 1995-07-26 Method and system for speech recognition

Country Status (4)

Country Link
US (1) US5604839A (en)
EP (1) EP0694906B1 (en)
JP (1) JPH08110793A (en)
DE (1) DE69518705T2 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0871157A2 (en) * 1997-04-11 1998-10-14 Nokia Mobile Phones Ltd. A method and a device for recognising speech
EP1113419A1 (en) * 1999-12-28 2001-07-04 Sony Corporation Model adaptive apparatus and model adaptive method, recording medium, and pattern recognition apparatus
EP1199712A2 (en) * 2000-10-16 2002-04-24 Microsoft Corporation Noise reduction method
EP1246165A1 (en) * 2001-03-28 2002-10-02 Matsushita Electric Industrial Co., Ltd. Keyword detection in a noisy signal
EP1326233A2 (en) * 2001-12-28 2003-07-09 Kabushiki Kaisha Toshiba Apparatus and method for speech recognition in noise
WO2005036525A1 (en) * 2003-10-08 2005-04-21 Philips Intellectual Property & Standards Gmbh Adaptation of environment mismatch for speech recognition systems
GB2422237A (en) * 2004-12-21 2006-07-19 Fluency Voice Technology Ltd Dynamic coefficients determined from temporally adjacent speech frames
US7117148B2 (en) 2002-04-05 2006-10-03 Microsoft Corporation Method of noise reduction using correction vectors based on dynamic aspects of speech and noise normalization
CN111489754A (en) * 2019-01-28 2020-08-04 国家电网有限公司客户服务中心 Telephone traffic data analysis method based on intelligent voice technology

Families Citing this family (44)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2797949B2 (en) * 1994-01-31 1998-09-17 日本電気株式会社 Voice recognition device
US5848246A (en) 1996-07-01 1998-12-08 Sun Microsystems, Inc. Object-oriented system, method and article of manufacture for a client-server session manager in an interprise computing framework system
US6304893B1 (en) 1996-07-01 2001-10-16 Sun Microsystems, Inc. Object-oriented system, method and article of manufacture for a client-server event driven message framework in an interprise computing framework system
US5987245A (en) 1996-07-01 1999-11-16 Sun Microsystems, Inc. Object-oriented system, method and article of manufacture (#12) for a client-server state machine framework
US6272555B1 (en) 1996-07-01 2001-08-07 Sun Microsystems, Inc. Object-oriented system, method and article of manufacture for a client-server-centric interprise computing framework system
US6266709B1 (en) 1996-07-01 2001-07-24 Sun Microsystems, Inc. Object-oriented system, method and article of manufacture for a client-server failure reporting process
US6424991B1 (en) 1996-07-01 2002-07-23 Sun Microsystems, Inc. Object-oriented system, method and article of manufacture for a client-server communication framework
US6434598B1 (en) 1996-07-01 2002-08-13 Sun Microsystems, Inc. Object-oriented system, method and article of manufacture for a client-server graphical user interface (#9) framework in an interprise computing framework system
US5999972A (en) 1996-07-01 1999-12-07 Sun Microsystems, Inc. System, method and article of manufacture for a distributed computer system framework
US6038590A (en) 1996-07-01 2000-03-14 Sun Microsystems, Inc. Object-oriented system, method and article of manufacture for a client-server state machine in an interprise computing framework system
JP3195752B2 (en) * 1997-02-28 2001-08-06 シャープ株式会社 Search device
JP3962445B2 (en) * 1997-03-13 2007-08-22 キヤノン株式会社 Audio processing method and apparatus
KR100450787B1 (en) * 1997-06-18 2005-05-03 삼성전자주식회사 Speech Feature Extraction Apparatus and Method by Dynamic Spectralization of Spectrum
KR100442825B1 (en) * 1997-07-11 2005-02-03 삼성전자주식회사 Method for compensating environment for voice recognition, particularly regarding to improving performance of voice recognition system by compensating polluted voice spectrum closely to real voice spectrum
US5946653A (en) * 1997-10-01 1999-08-31 Motorola, Inc. Speaker independent speech recognition system and method
US6173258B1 (en) * 1998-09-09 2001-01-09 Sony Corporation Method for reducing noise distortions in a speech recognition system
US6768979B1 (en) * 1998-10-22 2004-07-27 Sony Corporation Apparatus and method for noise attenuation in a speech recognition system
US6308155B1 (en) * 1999-01-20 2001-10-23 International Computer Science Institute Feature extraction for automatic speech recognition
GB2349259B (en) 1999-04-23 2003-11-12 Canon Kk Speech processing apparatus and method
US6804640B1 (en) * 2000-02-29 2004-10-12 Nuance Communications Signal noise reduction using magnitude-domain spectral subtraction
EP1229517B1 (en) * 2001-02-06 2005-05-04 Sony International (Europe) GmbH Method for recognizing speech with noise-dependent variance normalization
US6985858B2 (en) * 2001-03-20 2006-01-10 Microsoft Corporation Method and apparatus for removing noise from feature vectors
US6959276B2 (en) * 2001-09-27 2005-10-25 Microsoft Corporation Including the category of environmental noise when processing speech signals
US7139703B2 (en) * 2002-04-05 2006-11-21 Microsoft Corporation Method of iterative noise estimation in a recursive framework
US6944590B2 (en) 2002-04-05 2005-09-13 Microsoft Corporation Method of iterative noise estimation in a recursive framework
US7107210B2 (en) * 2002-05-20 2006-09-12 Microsoft Corporation Method of noise reduction based on dynamic aspects of speech
US7174292B2 (en) 2002-05-20 2007-02-06 Microsoft Corporation Method of determining uncertainty associated with acoustic distortion-based noise reduction
US7103540B2 (en) 2002-05-20 2006-09-05 Microsoft Corporation Method of pattern recognition using noise reduction uncertainty
US7047047B2 (en) 2002-09-06 2006-05-16 Microsoft Corporation Non-linear observation model for removing noise from corrupted signals
US7165026B2 (en) * 2003-03-31 2007-01-16 Microsoft Corporation Method of noise estimation using incremental bayes learning
US7720675B2 (en) * 2003-10-27 2010-05-18 Educational Testing Service Method and system for determining text coherence
CN101228577B (en) * 2004-01-12 2011-11-23 语音信号技术公司 Automatic speech recognition channel normalization method and system
US20080208578A1 (en) * 2004-09-23 2008-08-28 Koninklijke Philips Electronics, N.V. Robust Speaker-Dependent Speech Recognition System
US8175877B2 (en) * 2005-02-02 2012-05-08 At&T Intellectual Property Ii, L.P. Method and apparatus for predicting word accuracy in automatic speech recognition systems
KR100714721B1 (en) 2005-02-04 2007-05-04 삼성전자주식회사 Method and apparatus for detecting voice region
US8202098B2 (en) * 2005-02-28 2012-06-19 Educational Testing Service Method of model scaling for an automated essay scoring system
DE102005030855A1 (en) * 2005-07-01 2007-01-11 Müller-BBM GmbH Electro-acoustic method
US7725316B2 (en) * 2006-07-05 2010-05-25 General Motors Llc Applying speech recognition adaptation in an automated speech recognition system of a telematics-equipped vehicle
CN101154380B (en) * 2006-09-29 2011-01-26 株式会社东芝 Method and device for registration and validation of speaker's authentication
JP2008298844A (en) * 2007-05-29 2008-12-11 Advanced Telecommunication Research Institute International Noise suppressing device, computer program, and speech recognition system
US20100094622A1 (en) * 2008-10-10 2010-04-15 Nexidia Inc. Feature normalization for speech and audio processing
US20130158996A1 (en) * 2011-12-19 2013-06-20 Spansion Llc Acoustic Processing Unit
US9953646B2 (en) 2014-09-02 2018-04-24 Belleau Technologies Method and system for dynamic speech recognition and tracking of prewritten script
US9824684B2 (en) 2014-11-13 2017-11-21 Microsoft Technology Licensing, Llc Prediction-based sequence recognition

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0301199A1 (en) * 1987-07-09 1989-02-01 International Business Machines Corporation Normalization of speech by adaptive labelling
EP0487309A2 (en) * 1990-11-20 1992-05-27 Canon Kabushiki Kaisha Pattern recognition method and apparatus using a neural network
US5185848A (en) * 1988-12-14 1993-02-09 Hitachi, Ltd. Noise reduction system using neural network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0301199A1 (en) * 1987-07-09 1989-02-01 International Business Machines Corporation Normalization of speech by adaptive labelling
US5185848A (en) * 1988-12-14 1993-02-09 Hitachi, Ltd. Noise reduction system using neural network
EP0487309A2 (en) * 1990-11-20 1992-05-27 Canon Kabushiki Kaisha Pattern recognition method and apparatus using a neural network

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0871157A3 (en) * 1997-04-11 1999-06-16 Nokia Mobile Phones Ltd. A method and a device for recognising speech
EP0871157A2 (en) * 1997-04-11 1998-10-14 Nokia Mobile Phones Ltd. A method and a device for recognising speech
US6772117B1 (en) 1997-04-11 2004-08-03 Nokia Mobile Phones Limited Method and a device for recognizing speech
EP1113419A1 (en) * 1999-12-28 2001-07-04 Sony Corporation Model adaptive apparatus and model adaptive method, recording medium, and pattern recognition apparatus
US7043425B2 (en) 1999-12-28 2006-05-09 Sony Corporation Model adaptive apparatus and model adaptive method, recording medium, and pattern recognition apparatus
US6920421B2 (en) 1999-12-28 2005-07-19 Sony Corporation Model adaptive apparatus for performing adaptation of a model used in pattern recognition considering recentness of a received pattern data
US7003455B1 (en) 2000-10-16 2006-02-21 Microsoft Corporation Method of noise reduction using correction and scaling vectors with partitioning of the acoustic space in the domain of noisy speech
EP1199712A2 (en) * 2000-10-16 2002-04-24 Microsoft Corporation Noise reduction method
EP1199712A3 (en) * 2000-10-16 2003-09-10 Microsoft Corporation Noise reduction method
US7254536B2 (en) 2000-10-16 2007-08-07 Microsoft Corporation Method of noise reduction using correction and scaling vectors with partitioning of the acoustic space in the domain of noisy speech
EP1246165A1 (en) * 2001-03-28 2002-10-02 Matsushita Electric Industrial Co., Ltd. Keyword detection in a noisy signal
US6985859B2 (en) 2001-03-28 2006-01-10 Matsushita Electric Industrial Co., Ltd. Robust word-spotting system using an intelligibility criterion for reliable keyword detection under adverse and unknown noisy environments
EP1326233A2 (en) * 2001-12-28 2003-07-09 Kabushiki Kaisha Toshiba Apparatus and method for speech recognition in noise
EP1326233A3 (en) * 2001-12-28 2004-07-28 Kabushiki Kaisha Toshiba Apparatus and method for speech recognition in noise
US7117148B2 (en) 2002-04-05 2006-10-03 Microsoft Corporation Method of noise reduction using correction vectors based on dynamic aspects of speech and noise normalization
US7181390B2 (en) 2002-04-05 2007-02-20 Microsoft Corporation Noise reduction using correction vectors based on dynamic aspects of speech and noise normalization
US7542900B2 (en) 2002-04-05 2009-06-02 Microsoft Corporation Noise reduction using correction vectors based on dynamic aspects of speech and noise normalization
WO2005036525A1 (en) * 2003-10-08 2005-04-21 Philips Intellectual Property & Standards Gmbh Adaptation of environment mismatch for speech recognition systems
GB2422237A (en) * 2004-12-21 2006-07-19 Fluency Voice Technology Ltd Dynamic coefficients determined from temporally adjacent speech frames
CN111489754A (en) * 2019-01-28 2020-08-04 国家电网有限公司客户服务中心 Telephone traffic data analysis method based on intelligent voice technology

Also Published As

Publication number Publication date
DE69518705D1 (en) 2000-10-12
EP0694906B1 (en) 2000-09-06
DE69518705T2 (en) 2001-01-04
JPH08110793A (en) 1996-04-30
US5604839A (en) 1997-02-18

Similar Documents

Publication Publication Date Title
EP0694906A1 (en) Method and system for speech recognition
CA2147772C (en) Method of and apparatus for signal recognition that compensates for mismatching
US5148489A (en) Method for spectral estimation to improve noise robustness for speech recognition
US4905286A (en) Noise compensation in speech recognition
EP0966736B1 (en) Method for discriminative training of speech recognition models
US5806029A (en) Signal conditioned minimum error rate training for continuous speech recognition
US5023912A (en) Pattern recognition system using posterior probabilities
US6493667B1 (en) Enhanced likelihood computation using regression in a speech recognition system
US7269556B2 (en) Pattern recognition
EP0470245B1 (en) Method for spectral estimation to improve noise robustness for speech recognition
US6421640B1 (en) Speech recognition method using confidence measure evaluation
Bahl et al. Multonic Markov word models for large vocabulary continuous speech recognition
US5684924A (en) User adaptable speech recognition system
US6202047B1 (en) Method and apparatus for speech recognition using second order statistics and linear estimation of cepstral coefficients
US6725196B2 (en) Pattern matching method and apparatus
WO1997010587A9 (en) Signal conditioned minimum error rate training for continuous speech recognition
JPH05257492A (en) Voice recognizing system
US5345535A (en) Speech analysis method and apparatus
Welling et al. A model for efficient formant estimation
EP1074018B1 (en) Speech recognition system and method
EP0435336A2 (en) Reference pattern learning system
Rose et al. Robust speaker identification in noisy environments using noise adaptive speaker models
EP0309561B1 (en) An adaptive threshold voiced detector
KR0170317B1 (en) Voice recognition method using hidden markov model having distortion density of observation vector
Pan et al. An efficient vector-quantization preprocessor for speaker independent isolated word recognition

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): DE FR GB

17P Request for examination filed

Effective date: 19960725

17Q First examination report despatched

Effective date: 19990519

GRAG Despatch of communication of intention to grant

Free format text: ORIGINAL CODE: EPIDOS AGRA

GRAG Despatch of communication of intention to grant

Free format text: ORIGINAL CODE: EPIDOS AGRA

GRAH Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOS IGRA

RIC1 Information provided on ipc code assigned before grant

Free format text: 7G 10L 15/20 A

GRAH Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOS IGRA

GRAA (expected) grant

Free format text: ORIGINAL CODE: 0009210

AK Designated contracting states

Kind code of ref document: B1

Designated state(s): DE FR GB

REF Corresponds to:

Ref document number: 69518705

Country of ref document: DE

Date of ref document: 20001012

ET Fr: translation filed
PLBE No opposition filed within time limit

Free format text: ORIGINAL CODE: 0009261

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT

26N No opposition filed
REG Reference to a national code

Ref country code: GB

Ref legal event code: IF02

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: GB

Payment date: 20130624

Year of fee payment: 19

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: DE

Payment date: 20130731

Year of fee payment: 19

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: FR

Payment date: 20130712

Year of fee payment: 19

REG Reference to a national code

Ref country code: DE

Ref legal event code: R082

Ref document number: 69518705

Country of ref document: DE

Representative=s name: GRUENECKER, KINKELDEY, STOCKMAIR & SCHWANHAEUS, DE

REG Reference to a national code

Ref country code: DE

Ref legal event code: R119

Ref document number: 69518705

Country of ref document: DE

REG Reference to a national code

Ref country code: GB

Ref legal event code: 732E

Free format text: REGISTERED BETWEEN 20150108 AND 20150114

REG Reference to a national code

Ref country code: DE

Ref legal event code: R082

Ref document number: 69518705

Country of ref document: DE

Representative=s name: GRUENECKER PATENT- UND RECHTSANWAELTE PARTG MB, DE

Effective date: 20150126

Ref country code: DE

Ref legal event code: R081

Ref document number: 69518705

Country of ref document: DE

Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, REDMOND, US

Free format text: FORMER OWNER: MICROSOFT CORP., REDMOND, WASH., US

Effective date: 20150126

GBPC Gb: european patent ceased through non-payment of renewal fee

Effective date: 20140726

REG Reference to a national code

Ref country code: FR

Ref legal event code: ST

Effective date: 20150331

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: DE

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20150203

REG Reference to a national code

Ref country code: DE

Ref legal event code: R119

Ref document number: 69518705

Country of ref document: DE

Effective date: 20150203

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: FR

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20140731

Ref country code: GB

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20140726