CN1217315C - Hidden Markov model with frame correlation - Google Patents
Hidden Markov model with frame correlation Download PDFInfo
- Publication number
- CN1217315C CN1217315C CN01823553.0A CN01823553A CN1217315C CN 1217315 C CN1217315 C CN 1217315C CN 01823553 A CN01823553 A CN 01823553A CN 1217315 C CN1217315 C CN 1217315C
- Authority
- CN
- China
- Prior art keywords
- frame
- expectation
- maximization
- model
- calibration
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 claims abstract description 38
- 239000011159 matrix material Substances 0.000 claims description 20
- 238000005259 measurement Methods 0.000 claims description 12
- 238000005516 engineering process Methods 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 description 14
- 230000006870 function Effects 0.000 description 5
- 239000000203 mixture Substances 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000005648 markovian process Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 108010022579 ATP dependent 26S protease Proteins 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000013383 initial experiment Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012821 model calculation Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000036962 time dependent Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/12—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Complex Calculations (AREA)
- Image Analysis (AREA)
Abstract
The present invention discloses a method and a system for containing frame correlation in a hidden Markov model. The method comprises the steps: calculating the frame independent section of speech; inputting the frame independent section into a Gauss model; calculating frame probability; then, calculating an autoregression coefficient from the frame probability.
Description
Technical field
The present invention relates to hidden Markov model, particularly in hidden Markov model, comprise frame correlation.
This Markovian process is a probability model useful in the Analysis of Complex system.This process can comprise state and/or state-transition.State can comprise the numerical value of a plurality of variablees of the current state of describing a system, when a state changes, state-transition may occur.Current state and probability that a probability model of Markovian process only provides each possible known state to change.Therefore, be made to this process each state transformation with and the probability of in the future track only depend on current state.A hidden Markov model can be described to the stochastic systems model (markov) that a quilt is partly observed, and the some of them status information can not be watched by coverage.
The development of this hidden Markov model (HMM) causes the substance progress in speech recognition technology.This is progressive more remarkable than other field of speech recognition in big vocabulary continuous speech recognition (LVCSR) field.But a plurality of hypothesis in hidden Markov model still are considered to the obstacle to the likely effectiveness of this model.Problematic hypothesis may for: Continuous Observation is independently and in a state to distribute in the same manner.But the mechanism of this voice production process show this observation be basically subordinate with relevant.In addition, under PRML (ML) criterion, the system based on HMM depends on the essence how this model can show actual speech.
Description of drawings
Fig. 1 is process flow diagram based on a frame correlation process in the environment of HMM according to an embodiment of the invention.
Fig. 2 is block scheme based on the frame correlation system of HMM according to an embodiment of the invention.
Embodiment
Recognize the above-mentioned difficulties in the speech recognition of hidden Markov model (HMM) being used unpractiaca hypothesis, the present invention describes a kind of being used at the method and system based on the frame correlation of HMM environment.Thereby for the purpose that illustrates rather than limit, illustrated embodiment of the present invention still obviously the invention is not restricted to this according to describing with the corresponding to mode of this use-pattern.
In the statistical method to automatic speech recognition, best mathematics solution wishes to make recognizer to observe maximum experience (maximum a posteriori, MAP) judgment rule.This MAP judgment rule can be expressed as:
Wherein W is the word string hypothesis that is used for given sound observation O, and p (O|W) is this sound model, and
Be N gram language model (N-gram language model).When derived sound model score p (O|W), a hidden state sequence
Usually be expressed as:
Therefore, suppose that this hiding process can consider the conditional probability of this voice signal fully.
In HMM method based on frame, this status switch probability
Can be rewritten as by using markov first rank hypothesis
Therefore, a given hidden state sequence q
1 T, be accompanied by this status switch p (o
1 T| q
1 T, joint observation probability W) can be written as and depend on former observation o
1 TWith state partial sequence q
1 TIndividual measurement vector o
tThe product of probability.This can be expressed as follows
In order to make above-mentioned Equation for Calculating manageable (for standard HMM), wish that making frame independently supposes.Therefore, this hypothesis means that this observation only depends on the state that produces them on statistics, and does not depend on former observation.Therefore,
Suppose independently that according to this frame this joint observation probability can be rewritten as:
Under PRML (ML) standard, the performance based on the system of HMM depends on how this hidden Markov model shows the feature of the essence of actual speech well.For this reason, people have attempted the whole bag of tricks, so that the actual more model of frame correlation to be provided.Many these work have been put to Probability p (o
t| o
1 T-1, q
j, decomposition λ).
Recognize the above-mentioned difficulties of using existing model, this instructions is described a kind of system and method that comprises the novel hidden Markov model (HMM) with frame correlation simulation.Therefore, in an embodiment of native system, the frame correlation in the segment of cepstrum (cepstral) vector that belongs to a HMM state (or Gaussian Mixture) can use a kind of automatic recurrence (AR) technology to simulate.This technology supposes independently that tension and relaxation (Relaxation) is to the correlativity between N+1 the successive frame that is used for a hypothesis HMM state to frame.Then, use this expectation value maximum (EM) process, derive the estimation formulas of the new HMM parameter be used to comprise mean vector, variance matrix and one group of correlation matrix.But when frame correlation was left in the basket, above-mentioned technology was reduced to the hidden Markov model of standard.The initial experiment of the English task of wall street daily record 20K shown to obtain to be reduced to 11.4 the word bit error rate with additional parameter from 11.8 (baselines).
1. the relevant automatic recurrence characteristic model of state
In an embodiment of native system, automatic recurrence (AR) model that state is relevant is used to be included in the simple crosscorrelation between the continuous observation vector.This comprises that generation has the measurement vector of state as follows:
A wherein
iBe a diagonal matrix, make an AR model be applied to this vector o
tEach component; e
tIt is the relevant mean vector of one-component in this HMM state; n
tFor having a Gaussian noise of zero mean, it can be used as this actual observation o
tWith prediction observation
Between an error.
The advantage that the automatic regression model that user mode is relevant shows the feature of frame correlation be included in this model for speech production with and the advantage that in the application program of voice coding, provides.In time domain, speech waveform is directly produced by driving source and voice range.This voice range can be fully by time dependent automatic recurrence filter model parametrization fully.According to this model framework, it is called as linear predictive coding, has made in voice coding than much progress, to reduce bit rate.In the cepstrum spectral domain, extract each cepstrum frame from a window of speech samples.
2. the tension and relaxation (Relaxation) independently supposed of frame
The automatic recurrence characteristic model relevant according to above-mentioned state can be supposed given current state q
t, and top n frame o
T-N..., o
T-1, o
tHas the identical n that is distributed as
tThis hypothesis can be formulated as follows:
Therefore, the likelihood of status switch hypothesis can be written as:
3. expectation value maximization procedure
For the status switch of being simulated by Gaussian Mixture, the maximization that this likelihood function p (O|W) has been shown equals the maximization of function Q, wherein:
The automatic recurrence characteristic model that application state is relevant, above-mentioned Q function can be rewritten as:
In order to make the Q function maximize, can use expectation value maximization (EM) process with respect to hybrid parameter.For each pronunciation, this mixing occupation rate is an obliterated data.Therefore, can be formulated the EM process of following iteration.
The expectation value step: given mean value e
m, variance W
mWith correlation matrix a
M, i, can use following forward-reverse technology to provide desired calibration γ
m(t):
Maximization steps: the expectation value of given obliterated data, for the differential Q of hybrid parameter (mean value, variance and correlation matrix) and be set to zero and provide following estimation formulas:
For diagonal matrix a
M, i(1≤i≤N), the vector that is formed by N unit, k diagonal angle from diagonal matrix can be estimated as:
Therefore, the mode that can the while follow the unit according to the unit uses above-mentioned formula to estimate N diagonal angle correlation matrix.
4. embodiment
Fig. 1 is according to an embodiment of the invention based on the process flow diagram of a frame correlation process in the environment of HMM.In an illustrated embodiment, should be applied to speech recognition based on the frame correlation process of HMM.But, should know that this frame correlation can be used to other application programs, for example phonetic synthesis or Audio Processing.
This frame correlation process is included in the frame independent sector that step 100 is calculated these voice.The calculating of this frame independent sector is by realizing according to relevant automatic recurrence (AR) the Model Calculation measurement vector of state.As indicated above, the relevant AR model of this state is used to be included in the correlativity between the Continuous Observation vector.Therefore, measurement vector is produced as follows according to above-mentioned equation [6]:
A wherein
iBe a diagonal matrix, e
tBe the relevant mean vector of the component in this HMM state, n
tFor having the Gaussian distribution of zero mean.
In step 120, the frame independent sector of these voice is imported into this Gauss model.Calculate this frame probability in step 104 then.In step 106, make this expectation value maximization estimate the AR coefficient that this state is relevant in the step described in above-mentioned the 3rd joint by basis.In one embodiment, can use above-mentioned equation [13] to estimate N diagonal angle correlation matrix simultaneously.
Fig. 2 is block scheme based on the frame correlation system 200 of HMM according to an embodiment of the invention.In an illustrated embodiment, this system 200 comprises that one returns (AR) analogue unit and an expectation value maximization unit 204 automatically.
This AR analogue unit 202 receives diagonal matrix (a
i), mean vector (e that component is relevant
t) and have zero mean (n
t) Gaussian noise.Then, this AR functional unit 202 calculates a measurement vector (o
t).
This expectation value maximization unit 204 can comprise a Gauss model piece 206, expectation value piece 208 and maximization piece 210.This Gauss model piece 206 receives the measurement vector (o that is calculated
t) and calculate the frame probability.This measurement vector (o
t) and mean value (e
m), variance (W
m) and correlation matrix (a
M, i) together be sent to this expectation value piece 208, with the calibration γ of calculation expectation
m(t).In one embodiment, this expectation value piece 208 uses forward-reverse technology to calculate γ
m(t).
This maximization piece 210 receives the calibration (γ of expectation
mAnd the relevant AR coefficient (a of estimated state (t)),
i).This coefficient can be expressed as diagonal matrix.
5. result of experiment
The continuous speech recognition task that a big vocabulary is independent of the speaker is carried out above-mentioned application of model program.Wall street daily record 20k English task is carried out this experiment.This baseline system be one with the irrelevant HMM system (gender-independent within-word-triphone Gaussian-mixture tiedstate HMM system) of Gaussian Mixture association status in three sound words of sex.In this model set, each speech model has three states that set out (emitting state) and a left-to-right topology.Also use two quiet models.The first quiet model, very brief pause model, having can the uncared-for single state that sets out.The second quiet model is the complete threaded tree that is used to indicate the long mute periods state model that sets out.First and second differential of these voice and normalized record energy (log-energy) and these parameters are together turned to the cepstrum spectral coefficient (MFCC) of 12 Mel scales by parameter.This parametrization produces the eigenvector of one 39 dimension, and these eigenvectors are used the average normalization of cepstrum.These voice training data comprise 36696 pronunciation from SI-284 WSJ0 and WSJ1 set.The big vocabulary continuous speech recognition of ICRC (LVCSR) system is used based on the state of decision tree and trains, and it is concentrated and determines 6617 three sound states.Tabulation of 24k word and dictionary are used to this three gram language model.Use a dynamic network demoder to carry out all decodings.
For the application-specific of the model of above-mentioned consideration, the state based on contextual sound relevant with this single-tone is assigned to the identity set of diagonal angle correlation matrix.Automatically the rank that returns characteristic model is selected as 3.Therefore, this only causes 117 additional parameters.In the process that makes up this correlation matrix, the final number of component is mixed.Conversion from standard to new hidden Markov model is set to 0 by 117 additional correlation parameters and realizes.At last, 5 iteration of the forward that carry out to embed-oppositely reappraise.
This result of experiment compares in table 1.This result shows that this average character error rate (WER) is reduced to 11.4 from 11.8 (baselines).In addition, the data in this form show that the WER that is used for most of speaker is reduced by using native system.
On average | 440 | 441 | 442 | 443 | 444 | 445 | 446 | 447 | |
Standard HMM | 11.8 | 9.9 | 21.0 | 12.4 | 14.6 | 11.9 | 6.1 | 10.3 | 8.4 |
New HMM | 11.4 | 9.5 | 20.8 | 11.8 | 13.0 | 12.0 | 5.9 | 9.7 | 9.6 |
Table 1: modular system and frame related system are for the performance of 333 test speaker
Although specific embodiment of the present invention is shown and described, this description only is used for illustrative purposes rather than is used for restriction.Correspondingly, in this is described in detail, in order to illustrate, various details are set, so that thorough understanding of the present invention to be provided.But those of ordinary skill in the art does not obviously have these details can realize this system and method as can be seen yet.For example, although illustrated embodiment and example are described in hidden Markov process analog frame correlativity by being used for speech recognition, shown frame correlation method can be used to other application programs, for example phonetic synthesis and/or Audio Processing.In other examples, do not describe known 26S Proteasome Structure and Function in detail, obscure to avoid purport of the present invention caused.Correspondingly, scope and spirit of the present invention should be determined by appended claim.
Claims (22)
1. method that is used for comprising frame correlation at a hidden Markov model, comprising:
Calculate the frame independent sector of voice;
The frame independent sector of voice is input to a Gauss model;
Calculate the frame probability; And
Estimate automatic regression coefficient.
2. method according to claim 1, the frame independent sector of wherein said computing voice comprise calculates a measurement vector.
3. method according to claim 2, wherein said measurement vector is based on relevant automatic recurrence (AR) model of state.
4. method according to claim 2, the described measurement vector that wherein is used for current state are to multiply each other and sue for peace and this result and mean vector addition of multiplying each other after suing for peace is calculated by the vector with diagonal matrix and observation in the past.
5. method according to claim 4 wherein further comprises:
A Gaussian noise and described measurement vector addition.
6. method according to claim 5, wherein said Gaussian noise have a zero mean.
7. method according to claim 1, wherein said calculating frame probability comprise makes the likelihood of a status switch maximize.
8. method according to claim 1, wherein said calculating frame probability comprise makes a Q function maximize.
9. method according to claim 8 wherein saidly makes the maximization of Q function comprise iteration expectation value maximization procedure.
10. method according to claim 9, wherein said expectation value maximization comprises the expectation value that is formulated data.
Receive mean value, variance and correlation matrix 11. method according to claim 10, the wherein said expectation value that is formulated data comprise, and calculate desired calibration.
12. method according to claim 11, the calibration of wherein said calculation expectation comprise use forward-reverse technology.
13. comprising, method according to claim 9, wherein said expectation value maximization carry out the maximization of Q function.
14. method according to claim 13, wherein said execution Q function maximization comprises reception mean value, variance and correlation matrix, and this Q function is differentiated for this mean value, variance and correlation matrix, and this Q function setup for equalling zero, is used for the new numerical value of this mean value and variance with estimation.
15. comprising, method according to claim 11, the automatic regression coefficient of wherein said estimation use the calibration of estimated mean value, variance and an expectation to estimate one group of cross-correlation matrix.
16. a frame correlation system that is used for hidden Markov model, comprising:
The device that is used for the frame independent sector of computing voice;
Be connected to the device of calculating frame probability of the device of the described frame independent sector that is used for computing voice;
Be connected to the device of the calibration that is used for calculation expectation of the device of the described frame independent sector that is used for computing voice; And
Be connected to the device of the automatic regression coefficient of estimation of the device of the described calibration that is used for calculation expectation.
17. system according to claim 16, the wherein said device that is used for the frame independent sector of computing voice comprises that returns an analogue unit automatically.
18. a system that is used for frame correlation is covered a hidden Markov model, comprising:
The automatic recurrence analogue unit that is used for the calculating observation vector; And
Estimate to be used for this is returned automatically an expectation value maximization unit of the coefficient of analogue unit.
19. system according to claim 18, wherein said expectation value maximization unit comprises a Gauss model piece, an expectation value piece and a maximization piece.
20. system according to claim 19, wherein said Gauss model piece receives described measurement vector, and calculates a frame probability.
21. system according to claim 19, wherein said expectation value piece receives mean value and variance, and calculates the calibration of an expectation.
22. system according to claim 21, wherein said maximization piece receives the calibration of this expectation, and estimates to be used for the coefficient that this returns analogue unit automatically.
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2001/001037 WO2003001507A1 (en) | 2001-06-22 | 2001-06-22 | Hidden markov model with frame correlation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN1545695A CN1545695A (en) | 2004-11-10 |
CN1217315C true CN1217315C (en) | 2005-08-31 |
Family
ID=4574818
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN01823553.0A Expired - Fee Related CN1217315C (en) | 2001-06-22 | 2001-06-22 | Hidden Markov model with frame correlation |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN1217315C (en) |
WO (1) | WO2003001507A1 (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9721569B2 (en) * | 2015-05-27 | 2017-08-01 | Intel Corporation | Gaussian mixture model accelerator with direct memory access engines corresponding to individual data streams |
CN113057850B (en) * | 2021-03-11 | 2022-06-10 | 东南大学 | Recovery robot control method based on probability motion primitive and hidden semi-Markov |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH08241095A (en) * | 1995-03-06 | 1996-09-17 | Atr Onsei Honyaku Tsushin Kenkyusho:Kk | Speaker adaptation device and speech recognizing device |
US5924066A (en) * | 1997-09-26 | 1999-07-13 | U S West, Inc. | System and method for classifying a speech signal |
JPH11212591A (en) * | 1998-01-23 | 1999-08-06 | Pioneer Electron Corp | Pattern recognition method, device therefor and recording medium recorded with pattern recognizing program |
-
2001
- 2001-06-22 WO PCT/CN2001/001037 patent/WO2003001507A1/en active Application Filing
- 2001-06-22 CN CN01823553.0A patent/CN1217315C/en not_active Expired - Fee Related
Also Published As
Publication number | Publication date |
---|---|
WO2003001507A1 (en) | 2003-01-03 |
CN1545695A (en) | 2004-11-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110992987B (en) | Parallel feature extraction system and method for general specific voice in voice signal | |
Virtanen | Speech recognition using factorial hidden Markov models for separation in the feature space. | |
CN101751921B (en) | Real-time voice conversion method under conditions of minimal amount of training data | |
CN101027716B (en) | Robust speaker-dependent speech recognition system | |
CN103065629A (en) | Speech recognition system of humanoid robot | |
Srinivasan et al. | Transforming binary uncertainties for robust speech recognition | |
Kim et al. | Cepstrum-domain acoustic feature compensation based on decomposition of speech and noise for ASR in noisy environments | |
US6990447B2 (en) | Method and apparatus for denoising and deverberation using variational inference and strong speech models | |
CN101246685A (en) | Pronunciation quality evaluation method of computer auxiliary language learning system | |
CN111326170A (en) | Method and device for converting ear voice into normal voice by combining time-frequency domain expansion convolution | |
US8942978B2 (en) | Parameter learning in a hidden trajectory model | |
Sharma et al. | Automatic speech recognition systems: challenges and recent implementation trends | |
Dang et al. | Using semi-supervised learning for monaural time-domain speech separation with a self-supervised learning-based si-snr estimator | |
CN1217315C (en) | Hidden Markov model with frame correlation | |
CN1420486A (en) | Voice identification based on decision tree | |
US7653535B2 (en) | Learning statistically characterized resonance targets in a hidden trajectory model | |
Shahin | Improving speaker identification performance under the shouted talking condition using the second-order hidden Markov models | |
Shen et al. | Solfeggio Teaching Method Based on MIDI Technology in the Background of Digital Music Teaching | |
Kumawat et al. | SSQA: Speech signal quality assessment method using spectrogram and 2-D convolutional neural networks for improving efficiency of ASR devices | |
Martinčić-Ipšić et al. | Croatian large vocabulary automatic speech recognition | |
CN113241090B (en) | Multichannel blind sound source separation method based on minimum volume constraint | |
Oura et al. | A fully consistent hidden semi-Markov model-based speech recognition system | |
Klapuri et al. | Representing musical sounds with an interpolating state model | |
Milner et al. | Noisy audio speech enhancement using Wiener filters derived from visual speech. | |
CN1864202A (en) | Adaptation of environment mismatch for speech recognition systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20050831 Termination date: 20150622 |
|
EXPY | Termination of patent right or utility model |