US20040267530A1 - Discriminative training of hidden Markov models for continuous speech recognition - Google Patents

Discriminative training of hidden Markov models for continuous speech recognition Download PDF

Info

Publication number
US20040267530A1
US20040267530A1 US10/719,682 US71968203A US2004267530A1 US 20040267530 A1 US20040267530 A1 US 20040267530A1 US 71968203 A US71968203 A US 71968203A US 2004267530 A1 US2004267530 A1 US 2004267530A1
Authority
US
United States
Prior art keywords
models
recognition
model
training
standard deviation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US10/719,682
Inventor
Chuang He
Jianxiong Wu
Vlad Sejnoha
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nuance Communications Inc
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to US10/719,682 priority Critical patent/US20040267530A1/en
Assigned to SCANSOFT, INC. reassignment SCANSOFT, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: HE, CHUANG, SEJNOHA, VLAD, WU, JIANXIONG
Publication of US20040267530A1 publication Critical patent/US20040267530A1/en
Assigned to NUANCE COMMUNICATIONS, INC. reassignment NUANCE COMMUNICATIONS, INC. MERGER AND CHANGE OF NAME TO NUANCE COMMUNICATIONS, INC. Assignors: SCANSOFT, INC.
Assigned to USB AG, STAMFORD BRANCH reassignment USB AG, STAMFORD BRANCH SECURITY AGREEMENT Assignors: NUANCE COMMUNICATIONS, INC.
Priority to US12/241,811 priority patent/US7672847B2/en
Assigned to ART ADVANCED RECOGNITION TECHNOLOGIES, INC., A DELAWARE CORPORATION, AS GRANTOR, NUANCE COMMUNICATIONS, INC., AS GRANTOR, SCANSOFT, INC., A DELAWARE CORPORATION, AS GRANTOR, SPEECHWORKS INTERNATIONAL, INC., A DELAWARE CORPORATION, AS GRANTOR, DICTAPHONE CORPORATION, A DELAWARE CORPORATION, AS GRANTOR, TELELOGUE, INC., A DELAWARE CORPORATION, AS GRANTOR, DSP, INC., D/B/A DIAMOND EQUIPMENT, A MAINE CORPORATON, AS GRANTOR reassignment ART ADVANCED RECOGNITION TECHNOLOGIES, INC., A DELAWARE CORPORATION, AS GRANTOR PATENT RELEASE (REEL:017435/FRAME:0199) Assignors: MORGAN STANLEY SENIOR FUNDING, INC., AS ADMINISTRATIVE AGENT
Assigned to MITSUBISH DENKI KABUSHIKI KAISHA, AS GRANTOR, NORTHROP GRUMMAN CORPORATION, A DELAWARE CORPORATION, AS GRANTOR, STRYKER LEIBINGER GMBH & CO., KG, AS GRANTOR, ART ADVANCED RECOGNITION TECHNOLOGIES, INC., A DELAWARE CORPORATION, AS GRANTOR, NUANCE COMMUNICATIONS, INC., AS GRANTOR, SCANSOFT, INC., A DELAWARE CORPORATION, AS GRANTOR, SPEECHWORKS INTERNATIONAL, INC., A DELAWARE CORPORATION, AS GRANTOR, DICTAPHONE CORPORATION, A DELAWARE CORPORATION, AS GRANTOR, HUMAN CAPITAL RESOURCES, INC., A DELAWARE CORPORATION, AS GRANTOR, TELELOGUE, INC., A DELAWARE CORPORATION, AS GRANTOR, DSP, INC., D/B/A DIAMOND EQUIPMENT, A MAINE CORPORATON, AS GRANTOR, NOKIA CORPORATION, AS GRANTOR, INSTITIT KATALIZA IMENI G.K. BORESKOVA SIBIRSKOGO OTDELENIA ROSSIISKOI AKADEMII NAUK, AS GRANTOR reassignment MITSUBISH DENKI KABUSHIKI KAISHA, AS GRANTOR PATENT RELEASE (REEL:018160/FRAME:0909) Assignors: MORGAN STANLEY SENIOR FUNDING, INC., AS ADMINISTRATIVE AGENT
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/144Training of HMMs

Definitions

  • the invention generally relates to automatic speech recognition, and more particularly, to techniques for adjusting the mixture components of hidden Markov models as used in automatic speech recognition.
  • HMM hidden Markov model
  • PDF probability density function
  • model pdfs are most commonly trained using the maximum likelihood method. In this manner, the model parameters are adjusted so that the likelihood of observing the training data given the model is maximized.
  • This approach does not necessarily lead to the best recognition performance.
  • This problem can be addressed by discriminative training of the mixture models. The idea is to adjust the model parameters so as to minimize the number of recognition errors rather than fit the distributions to the data.
  • One approach to discriminative training in a large vocabulary continuous speech recognition system is described in U.S. Pat. No. 6,490,555, the contents of which are incorporated herein by reference.
  • Embodiments of the present invention are directed to methods for improving discriminative training of hidden Markov models for a continuous speech recognition system.
  • One embodiment assigns a value to a model parameter of a mixture component of a hidden Markov model state as a weighted sum of a maximum likelihood trained value of the parameter and a discriminatively trained value of the parameter.
  • the interpolation weights are determined by the amount of data used in maximum likelihood training and discriminative training. Different mixture components may have different weights.
  • the model parameter may be, for example, Gaussian mixture mean and standard deviation.
  • Another embodiment reuses the segmentation and recognition results of a first set of recognition models to discriminatively train a second set of recognition models.
  • a first set of recognition models is first used to perform segmentation and recognition of a set of speech training data so as to form a first model reference state sequence and a set of first model hypothesis state sequences. States in the first model reference state sequence are mapped to corresponding states in a second set of recognition models so as to form a second model reference state sequence. States in the set of first model hypothesis state sequences are mapped to corresponding states in the second set of recognition models so as to form a set of second model hypothesis state sequences. Selected model states in the second set of recognition models are then discriminatively trained using the mapped state sequences.
  • the segmentation and recognition results of the detailed match models are mapped and then used to discriminatively train the fast match models.
  • the gradients for the standard deviation of mixture components are clipped to a range.
  • the range is determined by the mean and standard deviation of the gradients of the standard deviation of all the mixture components.
  • An embodiment of the present invention also avoids the tedious work of text normalization by determining the “correctness” of recognition hypotheses using the pronunciation of words in the reference and hypothesis texts.
  • FIG. 1 shows how to reuse the segmentation and recognition results of the detailed match models to discriminatively train the fast match models.
  • FIG. 2 shows clipping of the gradients of the standard deviation of mixture components according to one embodiment of the present invention.
  • discriminative training algorithms are prone to over-training. These algorithms may significantly improve the recognition accuracy of the training data, but the improvement does not necessarily generalize to other independent test sets. In some cases, discriminatively trained models may even degrade the recognition performance on independent test sets.
  • Embodiments of the present invention improve the generalization of discriminative training techniques by interpolating the discriminatively trained and the maximum likelihood trained models. Embodiments also limit the gradients of the standard deviation of mixture components.
  • Discriminative training algorithms are computationally intensive because segmentation and recognition of the entire training corpus may be required.
  • segmentation and recognition of the training data have to be performed for each of the different models, which is time consuming and inefficient.
  • Embodiments of the present invention reuse the segmentation and recognition results of one particular model, for discriminative training of another model. For example, one specific embodiment reuses segmentation and recognition results of detailed match models for discriminative training of fast match models.
  • the hypothesized words in the recognition results of the training data are marked as “correct” or “incorrect” for discriminative training.
  • this is done by matching the word label of a hypothesized word with the corresponding word in the reference text.
  • To obtain accurate “correct” or “incorrect” labels tedious manual or semi-manual text normalization typically has to be performed on the reference text.
  • Embodiments of the present invention avoid text normalization by determining the “correctness” of recognition hypotheses using the pronunciation of the words in the reference and hypothesis texts.
  • Embodiments of the present invention are directed to various techniques for improving discriminative training of mixture models for continuous speech recognition. Such improvements can be considered as contributing to one or both of two system design objectives: (1) improving the recognition performance including recognition accuracy and/or speed, and (2) improving the efficiency of discriminative training process.
  • MCE Minimum Classification Error
  • GMLP Gaussian mixture log-probability density function
  • N(s) is the number of mixture components
  • a(s,k) is the weight of mixture component k of state s
  • G(x(t); ⁇ (s,k); ⁇ (s,k)) represents the probability of observing x(t) given a multivariate Gaussian with mean ⁇ (s,k) and covariance ⁇ (s,k).
  • x(p) is the feature vector at time p
  • q i,p is the corresponding state index
  • P is the number of feature vectors in the input utterance.
  • the first step in the training of the continuous density pdfs is the initialization of the mean vectors ⁇ (s,k) and the standard deviation vectors ⁇ (s,k), which are the square root of the diagonal elements of ⁇ (s,k). This can be done by training a conventional maximum likelihood Gaussian mixture pdf for each model state from the input utterance frames aligned with that state.
  • the next step consists of discriminative training of the mean and standard deviation vectors. This is accomplished by defining an appropriate training objective function that reflects recognition error rate, and by optimizing the mean and standard deviation vectors so as to minimize this function.
  • Gradient descent optimization is described, for example, in D. E. Rumelhart et al., Parallel Distributed Processing, Vol. 1, pp. 322-28, MIT Press, the contents of which are incorporated herein by reference.
  • the objective function is differentiated with respect to the model parameters to obtain the gradients, and the model parameters are then modified by the addition of the scaled gradients.
  • a new gradient that reflects the modified parameters is then computed, and the parameters are adjusted further. The iteration is continued until convergence is attained, usually determined by monitoring the recognition performance on an evaluation data set which is independent of the training data.
  • a training database is preprocessed by obtaining for each training utterance a short list of candidate recognition models.
  • a short list of candidate recognition models contains descriptions of model state sequences.
  • o i,j (1+e ⁇ (D i ⁇ D j ) ) ⁇ 1
  • is a scalar multiplier
  • D i is the alignment score between the input token and a correct model i ⁇ C
  • D j is the alignment score between the input token and an incorrect model j ⁇ I.
  • the sizes of the sets C and I can be controlled to determine how many correct models and incorrect or potential intruder models are used in the training.
  • the error function o i,j takes on values near 1 when the correct model score D i is much greater (i.e., worse) than the intruder score D j , and near 0 when the converse is true. Values of o i,j greater than 0.5 represent recognition errors, while values less than 0.5 represent correct recognitions.
  • the scalar multiplier parameter ⁇ controls the influence of “near-errors” on the training.
  • N is the total number of utterances.
  • the mean and standard deviation of mixture components are modified by the addition of the scaled gradient:
  • w ⁇ and w ⁇ are weights which determine the magnitude of the changes to the parameter set in one iteration. This process is repeated until some stopping criterion is met.
  • the gradient descent algorithm described above is an unconstrained optimization technique. For Gaussian mixture components, certain constraints must be maintained, e.g., ⁇ (s,k,l)>0. In Wu Chou, Discriminant - Function - Based Minimum Recognition Error Rate Pattern - Recognition Approach To Speech Recognition, IEEE Proceedings, Vol. 88, No. 8, August 2000, which is incorporated herein by reference, the author applied gradient descent algorithm to transformed mixture components.
  • One embodiment of the present invention is directed to improving recognition performance by interpolating discriminatively trained mixture models with maximum likelihood trained mixture models.
  • the final trained value of that parameter of a mixture component k in some state s will be a weighted sum of the maximum likelihood trained value of the parameter and discriminatively trained value of the parameter:
  • a s,k and b s,k are weighting coefficients, the exact values of which depend on the amount of training data and may be different for different mixture components.
  • the model parameters that are interpolated are the Gaussian mixture mean vector and standard deviation vector.
  • an iterative process is used to determine the final trained value of the mean and standard deviation vector.
  • maximum likelihood training is used to initialize the mean and standard deviation vector: ⁇ ML (s,k) and ⁇ ML (s,k).
  • FrameCount ML (s,k) is the frame count for mixture component k of state s in maximum likelihood training
  • FrameCount DT,i (s,k) is the corresponding frame count in iteration i of discriminative training. This iterative training loop continues until some stopping criterion is met and a final trained value of the mean ⁇ Final (s,k) and standard deviation ⁇ Final (s,k) is established.
  • Another embodiment reuses segmentation and recognition results from a first set of recognition models for discriminative training of a second set of recognition models.
  • the segmentation and recognition results include:
  • the top N hypothesis state sequences are used.
  • the lattice can be used. Each arc of the lattice contains the identification of the word associated with the arc, the timing information, and a list of state sequences.
  • the top N hypothesis state sequences or the lattice can be obtained by performing recognition of the training utterance.
  • States in the reference state sequence of the first model are mapped to corresponding states in a second set of recognition models so as to form a reference state sequence of the second model.
  • States in a set of N hypothesis state sequences of the first model are mapped to corresponding states in the second set of recognition models so as to form a set of N hypothesis state sequences of the second model.
  • Selected model states in the second set of recognition models are then discriminatively trained using the mapped results.
  • mapping of the state sequences is performed in the following way:
  • Step 2) Phoneme sequences obtained from Step 1) are then mapped to state sequences corresponding to the second set of models based on the decision tree of the second set of models.
  • the segmentation and recognition results of detailed match models are mapped and then used to discriminatively train fast match models.
  • Fast match acoustic models are commonly used to quickly prone the recognition search space.
  • P. S. Gopalakrishnan and L. R. Bahl Fast Match Techniques, pp. 413-428 in “Automatic Speech and Speaker Recognition: Advanced Topics,” Chin-Hui Lee et al., 1996, the contents of which are incorporated herein by reference.
  • separate models are used for performing fast match.
  • Segmentation and recognition results for the detailed match models are collected by running segmentation and recognition on the training data.
  • the segmentation and recognition results of the detailed match models are mapped to results of the fast match models using the two-step method described in the previous page. Then, the fast match models are discriminatively trained using the mapped segmentation and recognition results.
  • FIG. 1 shows this concept. Initially, segmentation and recognition are performed on the training data using the detailed match models. For a given input utterance, this results in a detailed match model reference state sequence 101 and a set of detailed match model hypothesis state sequences. For illustration purposes, only one hypothesis state sequence (denoted by 102 ) is showed in FIG. 1. Based on the segmentation and recognition results of the detailed match models (as in 101 and 102 ), discriminative training may be performed on the mixture models of the detailed match states. Then, rather than regenerating fast match model reference and hypothesis state sequences from another iteration of segmentation and recognition, an embodiment of the present invention maps:
  • discriminative training may be performed on the fast match models using the mapped states (as in 103 and 104 ).
  • this approach avoids the computationally intensive process of regenerating segmentation and recognition results for different models.
  • the discriminative training time of the fast match models was reduced from ten days to one day.
  • experimental results showed that performing discriminative training of the fast match models produced significant improvement in recognition speed (10-15%) with no decrease in recognition accuracy.
  • Another embodiment of the present invention improves the generalization of MCE-based discriminative training techniques by limiting or clipping the gradients of the standard deviation of mixture components based on the statistics of these adjustments.
  • FIG. 2 shows this idea where a gradient distribution (of all mixture components) curve is centered at some mean value.
  • An embodiment of the present invention also avoids the tedious work of text normalization by determining the “correctness” of recognition hypotheses using the pronunciation of words in the reference and hypothesis texts.
  • word label is used to mark the “correctness”.
  • the same word in the reference text may appear in different form(s) in the recognition vocabulary. For example, the word “newborn” may appear as “newborn” in the reference text while appears as “new-born” in the recognition vocabulary. If the word label is used to determine if a hypothesized word is correct or not, then a word recognized as “new-born” will be determined as “incorrect” if the corresponding word in the reference text is “newborn”, which is not a correct decision.
  • a hypothesized word is marked as “correct” if its pronunciation is the same as the pronunciation of the corresponding word in the reference text and is marked as “incorrect” if its pronunciation is not the same as the pronunciation of the corresponding word in the reference text. Only the “incorrect” words are used for discriminative training. The correspondence between the hypothesized word and the reference word is determined based on the amount of time overlap of the two words.
  • Another advantage of using the pronunciation of words to determine the “correctness” of hypothesized words is that it makes discriminative training more focused on correcting errors caused by the acoustic model. If word label is used to mark the “correctness”, then a hypothesized word (e.g. “to”) that has the same pronunciation as the corresponding word (e.g. “two”) in the reference text, but has different word label, will be marked as incorrect. However, from acoustic point of view, these words are recognized “correctly”. They are errors caused by the language model. If these words are used in discriminative training, they will bias the data statistics used to compute the gradients; therefore make it less effective in correcting errors caused truly by the acoustic model. Using the pronunciation of words to mark the “correctness” eliminates this bias.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Machine Translation (AREA)

Abstract

Methods are given for improving discriminative training of hidden Markov models for continuous speech recognition. In one approach, discriminatively trained mixture models are interpolated with maximum likelihood trained mixture models. In another approach, segmentation and recognition results from one set of models are reused to discriminatively train a second set of models. For example, segmentation and recognition results from detailed match models are mapped and used to discriminatively train fast match models. In addition, gradients for the standard deviation of mixture components are clipped based on the statistics of the gradients. Pronunciation of words may also be used to determine the “incorrect” recognition hypothesis.

Description

  • This application claims priority from provisional application 60/446,198, filed Feb. 10, 2003, and provisional application 60/428,194, filed Nov. 21, 2002, the contents of which are incorporated herein by reference.[0001]
  • FIELD OF THE INVENTION
  • The invention generally relates to automatic speech recognition, and more particularly, to techniques for adjusting the mixture components of hidden Markov models as used in automatic speech recognition. [0002]
  • BACKGROUND ART
  • Most speech recognition systems utilize a statistical model called the hidden Markov model (HMM). Such models consist of sequences of states connected by arcs, and a probability density function (pdf) associated with each state which describes the likelihood of observing any given feature vector at that state. A separate set of probabilities determines the transitions between the states. Most large vocabulary continuous recognition systems use continuous pdfs, which are parametric functions that describe the probability of any arbitrary input feature vector given a model state. [0003]
  • One drawback of using continuous pdfs is that the designer must make explicit assumptions about the nature of the pdfs being modeled—something which can be quite difficult since the true distribution form for the speech signal is not known. The most common class of functions used for this purpose is a mixture of Gaussians, where an arbitrary pdf is modeled by a weighted sum of normal distributions. [0004]
  • The model pdfs are most commonly trained using the maximum likelihood method. In this manner, the model parameters are adjusted so that the likelihood of observing the training data given the model is maximized. However, it is known that this approach does not necessarily lead to the best recognition performance. This problem can be addressed by discriminative training of the mixture models. The idea is to adjust the model parameters so as to minimize the number of recognition errors rather than fit the distributions to the data. One approach to discriminative training in a large vocabulary continuous speech recognition system is described in U.S. Pat. No. 6,490,555, the contents of which are incorporated herein by reference. [0005]
  • SUMMARY OF THE INVENTION
  • Embodiments of the present invention are directed to methods for improving discriminative training of hidden Markov models for a continuous speech recognition system. One embodiment assigns a value to a model parameter of a mixture component of a hidden Markov model state as a weighted sum of a maximum likelihood trained value of the parameter and a discriminatively trained value of the parameter. The interpolation weights are determined by the amount of data used in maximum likelihood training and discriminative training. Different mixture components may have different weights. The model parameter may be, for example, Gaussian mixture mean and standard deviation. [0006]
  • Another embodiment reuses the segmentation and recognition results of a first set of recognition models to discriminatively train a second set of recognition models. Specifically, a first set of recognition models is first used to perform segmentation and recognition of a set of speech training data so as to form a first model reference state sequence and a set of first model hypothesis state sequences. States in the first model reference state sequence are mapped to corresponding states in a second set of recognition models so as to form a second model reference state sequence. States in the set of first model hypothesis state sequences are mapped to corresponding states in the second set of recognition models so as to form a set of second model hypothesis state sequences. Selected model states in the second set of recognition models are then discriminatively trained using the mapped state sequences. In one specific such embodiment, the segmentation and recognition results of the detailed match models are mapped and then used to discriminatively train the fast match models. [0007]
  • In another embodiment, the gradients for the standard deviation of mixture components are clipped to a range. The range is determined by the mean and standard deviation of the gradients of the standard deviation of all the mixture components. [0008]
  • An embodiment of the present invention also avoids the tedious work of text normalization by determining the “correctness” of recognition hypotheses using the pronunciation of words in the reference and hypothesis texts.[0009]
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 shows how to reuse the segmentation and recognition results of the detailed match models to discriminatively train the fast match models. [0010]
  • FIG. 2 shows clipping of the gradients of the standard deviation of mixture components according to one embodiment of the present invention.[0011]
  • DETAILED DESCRIPTION OF SPECIFIC EMBODIMENTS
  • It is well known that discriminative training algorithms are prone to over-training. These algorithms may significantly improve the recognition accuracy of the training data, but the improvement does not necessarily generalize to other independent test sets. In some cases, discriminatively trained models may even degrade the recognition performance on independent test sets. Embodiments of the present invention improve the generalization of discriminative training techniques by interpolating the discriminatively trained and the maximum likelihood trained models. Embodiments also limit the gradients of the standard deviation of mixture components. [0012]
  • Discriminative training algorithms are computationally intensive because segmentation and recognition of the entire training corpus may be required. Traditionally, in order to discriminatively train different models using the same training corpus (for example, models of difference sizes, or models used for detailed match and fast match), segmentation and recognition of the training data have to be performed for each of the different models, which is time consuming and inefficient. Embodiments of the present invention reuse the segmentation and recognition results of one particular model, for discriminative training of another model. For example, one specific embodiment reuses segmentation and recognition results of detailed match models for discriminative training of fast match models. [0013]
  • In the discriminative training algorithm used in one embodiment of the present invention, the hypothesized words in the recognition results of the training data are marked as “correct” or “incorrect” for discriminative training. Conventionally, this is done by matching the word label of a hypothesized word with the corresponding word in the reference text. To obtain accurate “correct” or “incorrect” labels, tedious manual or semi-manual text normalization typically has to be performed on the reference text. Embodiments of the present invention avoid text normalization by determining the “correctness” of recognition hypotheses using the pronunciation of the words in the reference and hypothesis texts. [0014]
  • Embodiments of the present invention are directed to various techniques for improving discriminative training of mixture models for continuous speech recognition. Such improvements can be considered as contributing to one or both of two system design objectives: (1) improving the recognition performance including recognition accuracy and/or speed, and (2) improving the efficiency of discriminative training process. Before describing these improvements in any detail, we start by reviewing some background art on one particular type of discriminative training technique called Minimum Classification Error (MCE) training. [0015]
  • In a continuous density pdf using Gaussian mixtures, the standard Gaussian mixture log-probability density function GMLP is described by: [0016] GMLP ( x ( t ) , s ) = - log ( k N ( s ) a ( s , k ) G ( x ( t ) ; μ ( s , k ) ; ( s , k ) ) )
    Figure US20040267530A1-20041230-M00001
  • where N(s) is the number of mixture components, a(s,k) is the weight of mixture component k of state s, and G(x(t);μ(s,k);Σ(s,k)) represents the probability of observing x(t) given a multivariate Gaussian with mean μ(s,k) and covariance Σ(s,k). [0017]
  • However, experimental evidence indicates that a computationally simpler form of Gaussian mixture may be employed as the pdf. Using a simpler mixture model not only reduces computational load, but in addition, the resultant reduction in the number of free parameters in the model significantly improves trainability with limited quantities of data. Accordingly, the continuous density pdf used in the following described embodiments assumes that Σ(s,k) is a diagonal matrix. [0018]
  • The average score for a path corresponding to an alignment of the input utterance with a reference model i is given by [0019] D i = 1 P p = 1 P GMLP ( x ( p ) , q i , p ) ,
    Figure US20040267530A1-20041230-M00002
  • where x(p) is the feature vector at time p, q[0020] i,p is the corresponding state index, and P is the number of feature vectors in the input utterance.
  • The first step in the training of the continuous density pdfs is the initialization of the mean vectors μ(s,k) and the standard deviation vectors σ(s,k), which are the square root of the diagonal elements of Σ(s,k). This can be done by training a conventional maximum likelihood Gaussian mixture pdf for each model state from the input utterance frames aligned with that state. The next step consists of discriminative training of the mean and standard deviation vectors. This is accomplished by defining an appropriate training objective function that reflects recognition error rate, and by optimizing the mean and standard deviation vectors so as to minimize this function. [0021]
  • One common technique applicable to the minimization of the objective function is gradient descent optimization. Gradient descent optimization is described, for example, in D. E. Rumelhart et al., [0022] Parallel Distributed Processing, Vol. 1, pp. 322-28, MIT Press, the contents of which are incorporated herein by reference. In this approach, the objective function is differentiated with respect to the model parameters to obtain the gradients, and the model parameters are then modified by the addition of the scaled gradients. A new gradient that reflects the modified parameters is then computed, and the parameters are adjusted further. The iteration is continued until convergence is attained, usually determined by monitoring the recognition performance on an evaluation data set which is independent of the training data.
  • A training database is preprocessed by obtaining for each training utterance a short list of candidate recognition models. In a continuous speech recognition system, such a list contains descriptions of model state sequences. U.S. Pat. No. 6,490,555 to Girija Yegnanarayanan et al., incorporated herein by reference, describes one particular approach to generating a set of candidate models. Each candidate list thus contains some number of correct models (subset C), and a number of incorrect models (subset I). [0023]
  • An error function ε[0024] n for a particular training utterance n is computed from the pair-wise error functions oi,j: ɛ n = i C j I o i , j
    Figure US20040267530A1-20041230-M00003
  • where o[0025] i,j=(1+e−β(D i −D j ))−1, βis a scalar multiplier, Di is the alignment score between the input token and a correct model i ε C, and Dj is the alignment score between the input token and an incorrect model j ε I. The sizes of the sets C and I can be controlled to determine how many correct models and incorrect or potential intruder models are used in the training.
  • The error function o[0026] i,j takes on values near 1 when the correct model score Di is much greater (i.e., worse) than the intruder score Dj, and near 0 when the converse is true. Values of oi,j greater than 0.5 represent recognition errors, while values less than 0.5 represent correct recognitions. The scalar multiplier parameter β controls the influence of “near-errors” on the training. As previously described, the score Di between the utterance and model i is obtained by scoring the alignment path D i = 1 P p = 1 P GMLP ( x ( p ) , q i , p ) .
    Figure US20040267530A1-20041230-M00004
  • A similar expression can be written for D[0027] j. For mixture component k of state s, differentiating the error function with respect to element l of the mean vector μ(s,k) yields the gradient: - ɛ n μ ( s , k , l ) .
    Figure US20040267530A1-20041230-M00005
  • Similarly, differentiating the error function with respect to element l of the standard deviation vector σ(s,k) yields the gradient: [0028] - ɛ n μ ( s , k , l ) .
    Figure US20040267530A1-20041230-M00006
  • For batch mode processing, in each iteration, the gradient is averaged over all utterances: [0029] Δμ ( s , k , l ) = 1 N n - ɛ n μ ( s , k , l ) Δσ ( s , k , l ) = 1 N n - ɛ n σ ( s , k , l )
    Figure US20040267530A1-20041230-M00007
  • where N is the total number of utterances. The mean and standard deviation of mixture components are modified by the addition of the scaled gradient: [0030]
  • {circumflex over (μ)}(s,k,l)=μ(s,k,l)+wμΔμ(s,k,l)
  • {circumflex over (σ)}(s,k,l)=σ(s,k,l)+wσΔσ(s,k,l)
  • where wμ and wσ are weights which determine the magnitude of the changes to the parameter set in one iteration. This process is repeated until some stopping criterion is met. [0031]
  • The gradient descent algorithm described above is an unconstrained optimization technique. For Gaussian mixture components, certain constraints must be maintained, e.g., σ(s,k,l)>0. In Wu Chou, [0032] Discriminant-Function-Based Minimum Recognition Error Rate Pattern-Recognition Approach To Speech Recognition, IEEE Proceedings, Vol. 88, No. 8, August 2000, which is incorporated herein by reference, the author applied gradient descent algorithm to transformed mixture components. For example, the following transforms can be applied to the mean and standard deviation of mixture components: μ Transformed ( s , k , l ) = μ ( s , k , l ) σ ( s , k , l ) and σ Transformed ( s , k , l ) = log ( σ ( s , k , l ) )
    Figure US20040267530A1-20041230-M00008
  • Further details of specific approaches to implementing discriminative training in a continuous speech recognition system are given in U.S. Pat. No. 6,490,555 and in Wu Chou, [0033] Discriminant-Function-Based Minimum Recognition Error Rate Pattern-Recognition Approach To Speech Recognition, IEEE Proceedings, Vol. 88, No. 8, August 2000, which are incorporated herein by reference.
  • One embodiment of the present invention is directed to improving recognition performance by interpolating discriminatively trained mixture models with maximum likelihood trained mixture models. Generally, for some model parameter γ the final trained value of that parameter of a mixture component k in some state s, will be a weighted sum of the maximum likelihood trained value of the parameter and discriminatively trained value of the parameter: [0034]
  • γFinal(s,k)=a s,kγML(s,k)+b s,kγDT(s,k)
  • where a[0035] s,k and bs,k are weighting coefficients, the exact values of which depend on the amount of training data and may be different for different mixture components.
  • In one specific such embodiment, the model parameters that are interpolated are the Gaussian mixture mean vector and standard deviation vector. For each model state s and mixture component k, an iterative process is used to determine the final trained value of the mean and standard deviation vector. First, maximum likelihood training is used to initialize the mean and standard deviation vector: μ[0036] ML(s,k) and σML(s,k). Then an iterative loop is entered in which discriminative training is applied to determine μDT,i(s,k) and σDT,i(s,k) for iteration i, and the discriminatively trained parameters are interpolated with the smoothed parameters from the previous iteration i−1 to determine smoothed model parameters for iteration i: μ Smooth , i ( s , k ) = a s , k , i μ Smooth , i - 1 ( s , k ) + b s , k , i μ DT , i ( s , k ) , i = 1 , , M σ Smooth , i ( s , k ) = a s , k , i σ Smooth , i - 1 ( s , k ) + b s , k , i σ DT , i ( s , k ) , i = 1 , , M where μ Smooth , 0 ( s , k ) = μ ML ( s , k ) , σ Smooth , 0 ( s , k ) = σ ML ( s , k ) , a s , k , i = FrameCount ML ( s , k ) FrameCount ML ( s , k ) + FrameCount DT , i ( s , k ) , and b s , k , i = 1 - a s , k , i .
    Figure US20040267530A1-20041230-M00009
  • FrameCount[0037] ML(s,k) is the frame count for mixture component k of state s in maximum likelihood training, FrameCountDT,i(s,k) is the corresponding frame count in iteration i of discriminative training. This iterative training loop continues until some stopping criterion is met and a final trained value of the mean μFinal(s,k) and standard deviation σFinal(s,k) is established.
  • Another embodiment reuses segmentation and recognition results from a first set of recognition models for discriminative training of a second set of recognition models. For each training utterance, the segmentation and recognition results include: [0038]
  • A reference state sequence obtained by performing forced alignment of the training utterance with the reference text, and [0039]
  • A set of N hypothesis state sequences corresponding to the top N hypothesized word sequences, or a lattice representing the recognition results. [0040]
  • In one embodiment, the top N hypothesis state sequences are used. In other embodiments, the lattice can be used. Each arc of the lattice contains the identification of the word associated with the arc, the timing information, and a list of state sequences. The top N hypothesis state sequences or the lattice can be obtained by performing recognition of the training utterance. [0041]
  • States in the reference state sequence of the first model are mapped to corresponding states in a second set of recognition models so as to form a reference state sequence of the second model. States in a set of N hypothesis state sequences of the first model are mapped to corresponding states in the second set of recognition models so as to form a set of N hypothesis state sequences of the second model. Selected model states in the second set of recognition models are then discriminatively trained using the mapped results. [0042]
  • The mapping of the state sequences is performed in the following way: [0043]
  • 1) States in the state sequences corresponding to the first set of models are first mapped to phonemes based on the decision tree of the first set of models. [0044]
  • 2) Phoneme sequences obtained from Step 1) are then mapped to state sequences corresponding to the second set of models based on the decision tree of the second set of models. [0045]
  • In one specific such embodiment, the segmentation and recognition results of detailed match models are mapped and then used to discriminatively train fast match models. Fast match acoustic models are commonly used to quickly prone the recognition search space. One extended discussion of this subject is provided by P. S. Gopalakrishnan and L. R. Bahl, [0046] Fast Match Techniques, pp. 413-428 in “Automatic Speech and Speaker Recognition: Advanced Topics,” Chin-Hui Lee et al., 1996, the contents of which are incorporated herein by reference. In many speech recognition systems, separate models are used for performing fast match.
  • Segmentation and recognition results for the detailed match models are collected by running segmentation and recognition on the training data. The segmentation and recognition results of the detailed match models are mapped to results of the fast match models using the two-step method described in the previous page. Then, the fast match models are discriminatively trained using the mapped segmentation and recognition results. [0047]
  • FIG. 1 shows this concept. Initially, segmentation and recognition are performed on the training data using the detailed match models. For a given input utterance, this results in a detailed match model [0048] reference state sequence 101 and a set of detailed match model hypothesis state sequences. For illustration purposes, only one hypothesis state sequence (denoted by 102) is showed in FIG. 1. Based on the segmentation and recognition results of the detailed match models (as in 101 and 102), discriminative training may be performed on the mixture models of the detailed match states. Then, rather than regenerating fast match model reference and hypothesis state sequences from another iteration of segmentation and recognition, an embodiment of the present invention maps:
  • (1) the identities of the detailed match model reference states in [0049] 101 to corresponding fast match model reference states in 103, and
  • (2) the identities of the detailed match model hypothesis states in [0050] 102 to corresponding fast match model hypothesis states in 104.
  • Then, discriminative training may be performed on the fast match models using the mapped states (as in [0051] 103 and 104).
  • As explained above, this approach avoids the computationally intensive process of regenerating segmentation and recognition results for different models. In one specific embodiment, the discriminative training time of the fast match models was reduced from ten days to one day. In addition, experimental results showed that performing discriminative training of the fast match models produced significant improvement in recognition speed (10-15%) with no decrease in recognition accuracy. [0052]
  • Another embodiment of the present invention improves the generalization of MCE-based discriminative training techniques by limiting or clipping the gradients of the standard deviation of mixture components based on the statistics of these adjustments. The gradient refers to modification of each of the model standard deviations: [0053] Δσ ( s , k , l ) = 1 N n - ɛ n σ ( s , k , l ) .
    Figure US20040267530A1-20041230-M00010
  • By limiting or clipping the gradient, we mean that if a calculated gradient for the standard deviation is greater or less than some threshold distance from the average of the gradients, then some corresponding maximum or minimum gradient is used, instead of the actual calculated gradient. FIG. 2 shows this idea where a gradient distribution (of all mixture components) curve is centered at some mean value. Any gradient for the standard deviation greater than some high-clip threshold or less than some low-clip threshold will be set to the corresponding high-clip or low-clip threshold instead of the actual calculated gradient, i.e., [0054] Δ σ ( s , k , l ) clipped = { Δ σ ( s , k , l ) calculated , where Mean { Δ σ } - Thresh low - clip < Δ σ ( s , k , l ) calculated < Mean { Δ σ } + Thresh high - clip Mean { Δ σ } + Thresh high - clip , where Δ σ ( s , k , l ) calculated > Mean { Δ σ } + Thresh high - clip Mean { Δ σ } + Thresh low - clip , where Δ σ ( s , k , l ) calculated < Mean { Δ σ } - Thresh low - clip
    Figure US20040267530A1-20041230-M00011
  • where Mean{Δσ} is the mean of Δσ(s,k,l) for all the s, k, l, and typically Thresh[0055] high-chip=Threshlow-chip=a×Std{Δσ} where Std{Δσ} is the standard deviation of Δσ(s,k,l) for all the s, k, l, and a is a constant. Typically, a is in the range of [2, 3].
  • An embodiment of the present invention also avoids the tedious work of text normalization by determining the “correctness” of recognition hypotheses using the pronunciation of words in the reference and hypothesis texts. Traditionally, word label is used to mark the “correctness”. However, in acoustic model training data, the same word in the reference text may appear in different form(s) in the recognition vocabulary. For example, the word “newborn” may appear as “newborn” in the reference text while appears as “new-born” in the recognition vocabulary. If the word label is used to determine if a hypothesized word is correct or not, then a word recognized as “new-born” will be determined as “incorrect” if the corresponding word in the reference text is “newborn”, which is not a correct decision. [0056]
  • To make the form of words in the reference text and the recognition vocabulary match, typically tedious manual or semi-manual text normalization is needed. This problem becomes more severe when training texts are collected from different sources and transcribed using different philosophies. By using the pronunciation of words to determine if a hypothesized word is “correct” or not, the text normalization procedure is completed avoided. [0057]
  • Specifically, a hypothesized word is marked as “correct” if its pronunciation is the same as the pronunciation of the corresponding word in the reference text and is marked as “incorrect” if its pronunciation is not the same as the pronunciation of the corresponding word in the reference text. Only the “incorrect” words are used for discriminative training. The correspondence between the hypothesized word and the reference word is determined based on the amount of time overlap of the two words. [0058]
  • Another advantage of using the pronunciation of words to determine the “correctness” of hypothesized words is that it makes discriminative training more focused on correcting errors caused by the acoustic model. If word label is used to mark the “correctness”, then a hypothesized word (e.g. “to”) that has the same pronunciation as the corresponding word (e.g. “two”) in the reference text, but has different word label, will be marked as incorrect. However, from acoustic point of view, these words are recognized “correctly”. They are errors caused by the language model. If these words are used in discriminative training, they will bias the data statistics used to compute the gradients; therefore make it less effective in correcting errors caused truly by the acoustic model. Using the pronunciation of words to mark the “correctness” eliminates this bias. [0059]
  • Although various exemplary embodiments of the invention have been disclosed, it should be apparent to those skilled in the art that various changes and modifications can be made which will achieve some of the advantages of the invention without departing from the true scope of the invention. [0060]

Claims (5)

What is claimed is:
1. A method of a continuous speech recognition system for discriminatively training hidden Markov models, the method comprising:
performing segmentation and recognition of speech training data using a first set of recognition models so as to form a first model reference state sequence, and a set of first model hypothesis state sequences;
mapping states in the first model reference state sequence to corresponding states in a second set of recognition models so as to form a second model reference state sequence;
mapping states in the set of first model hypothesis sequences to corresponding states in the second set of recognition models so as to form a set of second model hypothesis sequences; and
discriminatively training selected model states in the second set of recognition models using the mapped state sequences.
2. A method according to claim 1, wherein the hypothesis state sequences are represented by a lattice structure.
3. A method according to claim 1, wherein the first set of recognition models are detailed match models, and the second set of recognition models are fast match models.
4. A method of a continuous speech recognition system for discriminatively training hidden Markov models, the method comprising:
for a mixture component of a hidden Markov model state, calculating a gradient adjustment of the standard deviation of the mixture component, and
i. if the calculated gradient adjustment is greater than a first threshold amount, performing an adjustment of the standard deviation of the mixture component using the first threshold, or
ii. if the calculated gradient adjustment is less than a second threshold amount, performing an adjustment of the standard deviation of the mixture component using the second threshold, or else
iii. performing an adjustment of the standard deviation of the mixture component using the calculated gradient adjustment.
5. A method of a continuous speech recognition system for discriminatively training hidden Markov models, the method comprising:
determining correctness of a hypothesized word using pronunciation of the hypothesized word and a corresponding word in a reference text.
US10/719,682 2002-11-21 2003-11-21 Discriminative training of hidden Markov models for continuous speech recognition Abandoned US20040267530A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US10/719,682 US20040267530A1 (en) 2002-11-21 2003-11-21 Discriminative training of hidden Markov models for continuous speech recognition
US12/241,811 US7672847B2 (en) 2002-11-21 2008-09-30 Discriminative training of hidden Markov models for continuous speech recognition

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US42819402P 2002-11-21 2002-11-21
US44619803P 2003-02-10 2003-02-10
US10/719,682 US20040267530A1 (en) 2002-11-21 2003-11-21 Discriminative training of hidden Markov models for continuous speech recognition

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US12/241,811 Division US7672847B2 (en) 2002-11-21 2008-09-30 Discriminative training of hidden Markov models for continuous speech recognition

Publications (1)

Publication Number Publication Date
US20040267530A1 true US20040267530A1 (en) 2004-12-30

Family

ID=32397103

Family Applications (2)

Application Number Title Priority Date Filing Date
US10/719,682 Abandoned US20040267530A1 (en) 2002-11-21 2003-11-21 Discriminative training of hidden Markov models for continuous speech recognition
US12/241,811 Expired - Fee Related US7672847B2 (en) 2002-11-21 2008-09-30 Discriminative training of hidden Markov models for continuous speech recognition

Family Applications After (1)

Application Number Title Priority Date Filing Date
US12/241,811 Expired - Fee Related US7672847B2 (en) 2002-11-21 2008-09-30 Discriminative training of hidden Markov models for continuous speech recognition

Country Status (2)

Country Link
US (2) US20040267530A1 (en)
WO (1) WO2004049305A2 (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050256817A1 (en) * 2004-05-12 2005-11-17 Wren Christopher R Determining temporal patterns in sensed data sequences by hierarchical decomposition of hidden Markov models
US20070083373A1 (en) * 2005-10-11 2007-04-12 Matsushita Electric Industrial Co., Ltd. Discriminative training of HMM models using maximum margin estimation for speech recognition
US20080004876A1 (en) * 2006-06-30 2008-01-03 Chuang He Non-enrolled continuous dictation
US20080046245A1 (en) * 2006-08-21 2008-02-21 Microsoft Corporation Using a discretized, higher order representation of hidden dynamic variables for speech recognition
US20080091424A1 (en) * 2006-10-16 2008-04-17 Microsoft Corporation Minimum classification error training with growth transformation optimization
US20080114593A1 (en) * 2006-11-15 2008-05-15 Microsoft Corporation Noise suppressor for speech recognition
US20080114596A1 (en) * 2006-11-15 2008-05-15 Microsoft Corporation Discriminative training for speech recognition
US20080201139A1 (en) * 2007-02-20 2008-08-21 Microsoft Corporation Generic framework for large-margin MCE training in speech recognition
US20080243503A1 (en) * 2007-03-30 2008-10-02 Microsoft Corporation Minimum divergence based discriminative training for pattern recognition
US20090024390A1 (en) * 2007-05-04 2009-01-22 Nuance Communications, Inc. Multi-Class Constrained Maximum Likelihood Linear Regression
US20090123070A1 (en) * 2007-11-14 2009-05-14 Itt Manufacturing Enterprises Inc. Segmentation-based image processing system
US20090138265A1 (en) * 2007-11-26 2009-05-28 Nuance Communications, Inc. Joint Discriminative Training of Multiple Speech Recognizers
US20110208521A1 (en) * 2008-08-14 2011-08-25 21Ct, Inc. Hidden Markov Model for Speech Processing with Training Method
US8239332B2 (en) 2007-11-20 2012-08-07 Microsoft Corporation Constrained line search optimization for discriminative training of HMMS
WO2016167779A1 (en) * 2015-04-16 2016-10-20 Mitsubishi Electric Corporation Speech recognition device and rescoring device
CN110634474A (en) * 2019-09-24 2019-12-31 腾讯科技(深圳)有限公司 Speech recognition method and device based on artificial intelligence

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8515758B2 (en) 2010-04-14 2013-08-20 Microsoft Corporation Speech recognition including removal of irrelevant information
US9634855B2 (en) 2010-05-13 2017-04-25 Alexander Poltorak Electronic personal interactive device that determines topics of interest using a conversational agent
US8935170B2 (en) * 2012-11-27 2015-01-13 Longsand Limited Speech recognition
US9953646B2 (en) 2014-09-02 2018-04-24 Belleau Technologies Method and system for dynamic speech recognition and tracking of prewritten script

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1116219B1 (en) * 1999-07-01 2005-03-16 Koninklijke Philips Electronics N.V. Robust speech processing from noisy speech models

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7542949B2 (en) * 2004-05-12 2009-06-02 Mitsubishi Electric Research Laboratories, Inc. Determining temporal patterns in sensed data sequences by hierarchical decomposition of hidden Markov models
US20050256817A1 (en) * 2004-05-12 2005-11-17 Wren Christopher R Determining temporal patterns in sensed data sequences by hierarchical decomposition of hidden Markov models
US20070083373A1 (en) * 2005-10-11 2007-04-12 Matsushita Electric Industrial Co., Ltd. Discriminative training of HMM models using maximum margin estimation for speech recognition
US20080004876A1 (en) * 2006-06-30 2008-01-03 Chuang He Non-enrolled continuous dictation
WO2008005711A2 (en) * 2006-06-30 2008-01-10 Nuance Communications, Inc. Non-enrolled continuous dictation
WO2008005711A3 (en) * 2006-06-30 2008-09-25 Nuance Communications Inc Non-enrolled continuous dictation
US20080046245A1 (en) * 2006-08-21 2008-02-21 Microsoft Corporation Using a discretized, higher order representation of hidden dynamic variables for speech recognition
US7680663B2 (en) 2006-08-21 2010-03-16 Micrsoft Corporation Using a discretized, higher order representation of hidden dynamic variables for speech recognition
US20080091424A1 (en) * 2006-10-16 2008-04-17 Microsoft Corporation Minimum classification error training with growth transformation optimization
US8301449B2 (en) * 2006-10-16 2012-10-30 Microsoft Corporation Minimum classification error training with growth transformation optimization
US20080114596A1 (en) * 2006-11-15 2008-05-15 Microsoft Corporation Discriminative training for speech recognition
US7885812B2 (en) * 2006-11-15 2011-02-08 Microsoft Corporation Joint training of feature extraction and acoustic model parameters for speech recognition
US8615393B2 (en) * 2006-11-15 2013-12-24 Microsoft Corporation Noise suppressor for speech recognition
US20080114593A1 (en) * 2006-11-15 2008-05-15 Microsoft Corporation Noise suppressor for speech recognition
US20080201139A1 (en) * 2007-02-20 2008-08-21 Microsoft Corporation Generic framework for large-margin MCE training in speech recognition
US8423364B2 (en) 2007-02-20 2013-04-16 Microsoft Corporation Generic framework for large-margin MCE training in speech recognition
US20080243503A1 (en) * 2007-03-30 2008-10-02 Microsoft Corporation Minimum divergence based discriminative training for pattern recognition
US8386254B2 (en) 2007-05-04 2013-02-26 Nuance Communications, Inc. Multi-class constrained maximum likelihood linear regression
US20090024390A1 (en) * 2007-05-04 2009-01-22 Nuance Communications, Inc. Multi-Class Constrained Maximum Likelihood Linear Regression
US20090123070A1 (en) * 2007-11-14 2009-05-14 Itt Manufacturing Enterprises Inc. Segmentation-based image processing system
US8260048B2 (en) * 2007-11-14 2012-09-04 Exelis Inc. Segmentation-based image processing system
US8239332B2 (en) 2007-11-20 2012-08-07 Microsoft Corporation Constrained line search optimization for discriminative training of HMMS
US20090138265A1 (en) * 2007-11-26 2009-05-28 Nuance Communications, Inc. Joint Discriminative Training of Multiple Speech Recognizers
US8843370B2 (en) * 2007-11-26 2014-09-23 Nuance Communications, Inc. Joint discriminative training of multiple speech recognizers
US20110208521A1 (en) * 2008-08-14 2011-08-25 21Ct, Inc. Hidden Markov Model for Speech Processing with Training Method
US9020816B2 (en) 2008-08-14 2015-04-28 21Ct, Inc. Hidden markov model for speech processing with training method
WO2016167779A1 (en) * 2015-04-16 2016-10-20 Mitsubishi Electric Corporation Speech recognition device and rescoring device
CN110634474A (en) * 2019-09-24 2019-12-31 腾讯科技(深圳)有限公司 Speech recognition method and device based on artificial intelligence

Also Published As

Publication number Publication date
US20090055182A1 (en) 2009-02-26
US7672847B2 (en) 2010-03-02
WO2004049305A2 (en) 2004-06-10
WO2004049305A3 (en) 2004-08-19

Similar Documents

Publication Publication Date Title
US7672847B2 (en) Discriminative training of hidden Markov models for continuous speech recognition
JP3053711B2 (en) Speech recognition apparatus and training method and apparatus therefor
US6490555B1 (en) Discriminatively trained mixture models in continuous speech recognition
US6260013B1 (en) Speech recognition system employing discriminatively trained models
Sha et al. Large margin Gaussian mixture modeling for phonetic classification and recognition
US6493667B1 (en) Enhanced likelihood computation using regression in a speech recognition system
EP0635820B1 (en) Minimum error rate training of combined string models
US5684925A (en) Speech representation by feature-based word prototypes comprising phoneme targets having reliable high similarity
US7617103B2 (en) Incrementally regulated discriminative margins in MCE training for speech recognition
US7366669B2 (en) Acoustic model creation method as well as acoustic model creation apparatus and speech recognition apparatus
US7689419B2 (en) Updating hidden conditional random field model parameters after processing individual training samples
EP2888669B1 (en) Method and system for selectively biased linear discriminant analysis in automatic speech recognition systems
US7587321B2 (en) Method, apparatus, and system for building context dependent models for a large vocabulary continuous speech recognition (LVCSR) system
WO1998040876A9 (en) Speech recognition system employing discriminatively trained models
US7885812B2 (en) Joint training of feature extraction and acoustic model parameters for speech recognition
US5615299A (en) Speech recognition using dynamic features
Li et al. Large margin HMMs for speech recognition
US8762148B2 (en) Reference pattern adaptation apparatus, reference pattern adaptation method and reference pattern adaptation program
US6401064B1 (en) Automatic speech recognition using segmented curves of individual speech components having arc lengths generated along space-time trajectories
US5825977A (en) Word hypothesizer based on reliably detected phoneme similarity regions
US8078462B2 (en) Apparatus for creating speaker model, and computer program product
McDermott et al. Prototype-based discriminative training for various speech units
JP3525082B2 (en) Statistical model creation method
JPH0486899A (en) Standard pattern adaption system
Furui Generalization problem in ASR acoustic model training and adaptation

Legal Events

Date Code Title Description
AS Assignment

Owner name: SCANSOFT, INC., MASSACHUSETTS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HE, CHUANG;WU, JIANXIONG;SEJNOHA, VLAD;REEL/FRAME:015015/0353

Effective date: 20040211

AS Assignment

Owner name: NUANCE COMMUNICATIONS, INC., MASSACHUSETTS

Free format text: MERGER AND CHANGE OF NAME TO NUANCE COMMUNICATIONS, INC.;ASSIGNOR:SCANSOFT, INC.;REEL/FRAME:016914/0975

Effective date: 20051017

AS Assignment

Owner name: USB AG, STAMFORD BRANCH,CONNECTICUT

Free format text: SECURITY AGREEMENT;ASSIGNOR:NUANCE COMMUNICATIONS, INC.;REEL/FRAME:017435/0199

Effective date: 20060331

Owner name: USB AG, STAMFORD BRANCH, CONNECTICUT

Free format text: SECURITY AGREEMENT;ASSIGNOR:NUANCE COMMUNICATIONS, INC.;REEL/FRAME:017435/0199

Effective date: 20060331

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: NUANCE COMMUNICATIONS, INC., AS GRANTOR, MASSACHUS

Free format text: PATENT RELEASE (REEL:018160/FRAME:0909);ASSIGNOR:MORGAN STANLEY SENIOR FUNDING, INC., AS ADMINISTRATIVE AGENT;REEL/FRAME:038770/0869

Effective date: 20160520

Owner name: NUANCE COMMUNICATIONS, INC., AS GRANTOR, MASSACHUS

Free format text: PATENT RELEASE (REEL:017435/FRAME:0199);ASSIGNOR:MORGAN STANLEY SENIOR FUNDING, INC., AS ADMINISTRATIVE AGENT;REEL/FRAME:038770/0824

Effective date: 20160520

Owner name: DICTAPHONE CORPORATION, A DELAWARE CORPORATION, AS

Free format text: PATENT RELEASE (REEL:017435/FRAME:0199);ASSIGNOR:MORGAN STANLEY SENIOR FUNDING, INC., AS ADMINISTRATIVE AGENT;REEL/FRAME:038770/0824

Effective date: 20160520

Owner name: INSTITIT KATALIZA IMENI G.K. BORESKOVA SIBIRSKOGO

Free format text: PATENT RELEASE (REEL:018160/FRAME:0909);ASSIGNOR:MORGAN STANLEY SENIOR FUNDING, INC., AS ADMINISTRATIVE AGENT;REEL/FRAME:038770/0869

Effective date: 20160520

Owner name: NOKIA CORPORATION, AS GRANTOR, FINLAND

Free format text: PATENT RELEASE (REEL:018160/FRAME:0909);ASSIGNOR:MORGAN STANLEY SENIOR FUNDING, INC., AS ADMINISTRATIVE AGENT;REEL/FRAME:038770/0869

Effective date: 20160520

Owner name: SPEECHWORKS INTERNATIONAL, INC., A DELAWARE CORPOR

Free format text: PATENT RELEASE (REEL:018160/FRAME:0909);ASSIGNOR:MORGAN STANLEY SENIOR FUNDING, INC., AS ADMINISTRATIVE AGENT;REEL/FRAME:038770/0869

Effective date: 20160520

Owner name: NORTHROP GRUMMAN CORPORATION, A DELAWARE CORPORATI

Free format text: PATENT RELEASE (REEL:018160/FRAME:0909);ASSIGNOR:MORGAN STANLEY SENIOR FUNDING, INC., AS ADMINISTRATIVE AGENT;REEL/FRAME:038770/0869

Effective date: 20160520

Owner name: DSP, INC., D/B/A DIAMOND EQUIPMENT, A MAINE CORPOR

Free format text: PATENT RELEASE (REEL:018160/FRAME:0909);ASSIGNOR:MORGAN STANLEY SENIOR FUNDING, INC., AS ADMINISTRATIVE AGENT;REEL/FRAME:038770/0869

Effective date: 20160520

Owner name: ART ADVANCED RECOGNITION TECHNOLOGIES, INC., A DEL

Free format text: PATENT RELEASE (REEL:017435/FRAME:0199);ASSIGNOR:MORGAN STANLEY SENIOR FUNDING, INC., AS ADMINISTRATIVE AGENT;REEL/FRAME:038770/0824

Effective date: 20160520

Owner name: STRYKER LEIBINGER GMBH & CO., KG, AS GRANTOR, GERM

Free format text: PATENT RELEASE (REEL:018160/FRAME:0909);ASSIGNOR:MORGAN STANLEY SENIOR FUNDING, INC., AS ADMINISTRATIVE AGENT;REEL/FRAME:038770/0869

Effective date: 20160520

Owner name: DICTAPHONE CORPORATION, A DELAWARE CORPORATION, AS

Free format text: PATENT RELEASE (REEL:018160/FRAME:0909);ASSIGNOR:MORGAN STANLEY SENIOR FUNDING, INC., AS ADMINISTRATIVE AGENT;REEL/FRAME:038770/0869

Effective date: 20160520

Owner name: TELELOGUE, INC., A DELAWARE CORPORATION, AS GRANTO

Free format text: PATENT RELEASE (REEL:018160/FRAME:0909);ASSIGNOR:MORGAN STANLEY SENIOR FUNDING, INC., AS ADMINISTRATIVE AGENT;REEL/FRAME:038770/0869

Effective date: 20160520

Owner name: TELELOGUE, INC., A DELAWARE CORPORATION, AS GRANTO

Free format text: PATENT RELEASE (REEL:017435/FRAME:0199);ASSIGNOR:MORGAN STANLEY SENIOR FUNDING, INC., AS ADMINISTRATIVE AGENT;REEL/FRAME:038770/0824

Effective date: 20160520

Owner name: SPEECHWORKS INTERNATIONAL, INC., A DELAWARE CORPOR

Free format text: PATENT RELEASE (REEL:017435/FRAME:0199);ASSIGNOR:MORGAN STANLEY SENIOR FUNDING, INC., AS ADMINISTRATIVE AGENT;REEL/FRAME:038770/0824

Effective date: 20160520

Owner name: SCANSOFT, INC., A DELAWARE CORPORATION, AS GRANTOR

Free format text: PATENT RELEASE (REEL:018160/FRAME:0909);ASSIGNOR:MORGAN STANLEY SENIOR FUNDING, INC., AS ADMINISTRATIVE AGENT;REEL/FRAME:038770/0869

Effective date: 20160520

Owner name: ART ADVANCED RECOGNITION TECHNOLOGIES, INC., A DEL

Free format text: PATENT RELEASE (REEL:018160/FRAME:0909);ASSIGNOR:MORGAN STANLEY SENIOR FUNDING, INC., AS ADMINISTRATIVE AGENT;REEL/FRAME:038770/0869

Effective date: 20160520

Owner name: DSP, INC., D/B/A DIAMOND EQUIPMENT, A MAINE CORPOR

Free format text: PATENT RELEASE (REEL:017435/FRAME:0199);ASSIGNOR:MORGAN STANLEY SENIOR FUNDING, INC., AS ADMINISTRATIVE AGENT;REEL/FRAME:038770/0824

Effective date: 20160520

Owner name: MITSUBISH DENKI KABUSHIKI KAISHA, AS GRANTOR, JAPA

Free format text: PATENT RELEASE (REEL:018160/FRAME:0909);ASSIGNOR:MORGAN STANLEY SENIOR FUNDING, INC., AS ADMINISTRATIVE AGENT;REEL/FRAME:038770/0869

Effective date: 20160520

Owner name: HUMAN CAPITAL RESOURCES, INC., A DELAWARE CORPORAT

Free format text: PATENT RELEASE (REEL:018160/FRAME:0909);ASSIGNOR:MORGAN STANLEY SENIOR FUNDING, INC., AS ADMINISTRATIVE AGENT;REEL/FRAME:038770/0869

Effective date: 20160520

Owner name: SCANSOFT, INC., A DELAWARE CORPORATION, AS GRANTOR

Free format text: PATENT RELEASE (REEL:017435/FRAME:0199);ASSIGNOR:MORGAN STANLEY SENIOR FUNDING, INC., AS ADMINISTRATIVE AGENT;REEL/FRAME:038770/0824

Effective date: 20160520