CN1217315C

CN1217315C - Hidden Markov model with frame correlation

Info

Publication number: CN1217315C
Application number: CN01823553.0A
Authority: CN
Inventors: 李锦宇; 贾颖
Original assignee: Intel China Ltd; Intel Corp
Current assignee: Intel China Ltd; Intel Corp
Priority date: 2001-06-22
Filing date: 2001-06-22
Publication date: 2005-08-31
Anticipated expiration: 2021-06-22
Also published as: WO2003001507A1; CN1545695A

Abstract

The present invention discloses a method and a system for containing frame correlation in a hidden Markov model. The method comprises the steps: calculating the frame independent section of speech; inputting the frame independent section into a Gauss model; calculating frame probability; then, calculating an autoregression coefficient from the frame probability.

Description

Hidden Markov model with frame correlation

Technical field

The present invention relates to hidden Markov model, particularly in hidden Markov model, comprise frame correlation.

This Markovian process is a probability model useful in the Analysis of Complex system.This process can comprise state and/or state-transition.State can comprise the numerical value of a plurality of variablees of the current state of describing a system, when a state changes, state-transition may occur.Current state and probability that a probability model of Markovian process only provides each possible known state to change.Therefore, be made to this process each state transformation with and the probability of in the future track only depend on current state.A hidden Markov model can be described to the stochastic systems model (markov) that a quilt is partly observed, and the some of them status information can not be watched by coverage.

The development of this hidden Markov model (HMM) causes the substance progress in speech recognition technology.This is progressive more remarkable than other field of speech recognition in big vocabulary continuous speech recognition (LVCSR) field.But a plurality of hypothesis in hidden Markov model still are considered to the obstacle to the likely effectiveness of this model.Problematic hypothesis may for: Continuous Observation is independently and in a state to distribute in the same manner.But the mechanism of this voice production process show this observation be basically subordinate with relevant.In addition, under PRML (ML) criterion, the system based on HMM depends on the essence how this model can show actual speech.

Description of drawings

Fig. 1 is process flow diagram based on a frame correlation process in the environment of HMM according to an embodiment of the invention.

Fig. 2 is block scheme based on the frame correlation system of HMM according to an embodiment of the invention.

Embodiment

Recognize the above-mentioned difficulties in the speech recognition of hidden Markov model (HMM) being used unpractiaca hypothesis, the present invention describes a kind of being used at the method and system based on the frame correlation of HMM environment.Thereby for the purpose that illustrates rather than limit, illustrated embodiment of the present invention still obviously the invention is not restricted to this according to describing with the corresponding to mode of this use-pattern.

In the statistical method to automatic speech recognition, best mathematics solution wishes to make recognizer to observe maximum experience (maximum a posteriori, MAP) judgment rule.This MAP judgment rule can be expressed as:

\hat{W} = \arg \max p (W | O) = \arg \max p (O | W) p (W), - - - [1]

Wherein W is the word string hypothesis that is used for given sound observation O, and p (O|W) is this sound model, and

p (W) = Π_{i = l}^{L} p (w_{i} | w_{i - l}, . . ., w_{i - N})

Be N gram language model (N-gram language model).When derived sound model score p (O|W), a hidden state sequence

Usually be expressed as:

p (O | W) = \underset{Γ}{Σ} p (o_{1}^{T} q_{1}^{T} | W) = \underset{Γ}{Σ} p (o_{1}^{T} | q_{1}^{T}, W) \cdot p (q_{1}^{T} | W) . - - - [2]

Therefore, suppose that this hiding process can consider the conditional probability of this voice signal fully.

In HMM method based on frame, this status switch probability Can be rewritten as by using markov first rank hypothesis

p (q_{1 |}^{T |} W) = p (q_{0}) {II}_{t = 1}^{T} p (q_{t} | q_{t - 1}, W) = π_{q_{0}} a_{q_{1} q_{1}} a_{q_{1} q_{2}} . . . a_{q_{t - 1} q_{t}} . - - - [3]

Therefore, a given hidden state sequence q ₁ ^T, be accompanied by this status switch p (o ₁ ^T| q ₁ ^T, joint observation probability W) can be written as and depend on former observation o ₁ ^TWith state partial sequence q ₁ ^TIndividual measurement vector o _tThe product of probability.This can be expressed as follows

p (o_{1}^{T} | q_{1}^{T}) = Π_{t = 1}^{T} p (o_{t} | o_{1}^{t}, q_{t}, q_{1}^{t - 1}) . - - - [4]

In order to make above-mentioned Equation for Calculating manageable (for standard HMM), wish that making frame independently supposes.Therefore, this hypothesis means that this observation only depends on the state that produces them on statistics, and does not depend on former observation.Therefore,

p (o_{t} | o_{1}^{t}, q_{t}, q_{1}^{t - 1}) = p (o_{t} | q_{t}) .

Suppose independently that according to this frame this joint observation probability can be rewritten as:

p (o_{1}^{T} | q_{1}^{T}) = Π_{t = 1}^{T} p (o_{t} | o_{1}^{t}, q_{t}, q_{1}^{t - 1}) = Π_{t - 1}^{T} p (o_{t} | q_{t}) . - - - [5]

Under PRML (ML) standard, the performance based on the system of HMM depends on how this hidden Markov model shows the feature of the essence of actual speech well.For this reason, people have attempted the whole bag of tricks, so that the actual more model of frame correlation to be provided.Many these work have been put to Probability p (o _t| o ₁ ^T-1, q _j, decomposition λ).

Recognize the above-mentioned difficulties of using existing model, this instructions is described a kind of system and method that comprises the novel hidden Markov model (HMM) with frame correlation simulation.Therefore, in an embodiment of native system, the frame correlation in the segment of cepstrum (cepstral) vector that belongs to a HMM state (or Gaussian Mixture) can use a kind of automatic recurrence (AR) technology to simulate.This technology supposes independently that tension and relaxation (Relaxation) is to the correlativity between N+1 the successive frame that is used for a hypothesis HMM state to frame.Then, use this expectation value maximum (EM) process, derive the estimation formulas of the new HMM parameter be used to comprise mean vector, variance matrix and one group of correlation matrix.But when frame correlation was left in the basket, above-mentioned technology was reduced to the hidden Markov model of standard.The initial experiment of the English task of wall street daily record 20K shown to obtain to be reduced to 11.4 the word bit error rate with additional parameter from 11.8 (baselines).

1. the relevant automatic recurrence characteristic model of state

In an embodiment of native system, automatic recurrence (AR) model that state is relevant is used to be included in the simple crosscorrelation between the continuous observation vector.This comprises that generation has the measurement vector of state as follows:

o_{t} = Σ_{i = 1}^{N} a_{i} o_{t - 1} + e_{t} + n_{t}, - - - [6]

A wherein _iBe a diagonal matrix, make an AR model be applied to this vector o _tEach component; e _tIt is the relevant mean vector of one-component in this HMM state; n _tFor having a Gaussian noise of zero mean, it can be used as this actual observation o _tWith prediction observation Between an error.

The advantage that the automatic regression model that user mode is relevant shows the feature of frame correlation be included in this model for speech production with and the advantage that in the application program of voice coding, provides.In time domain, speech waveform is directly produced by driving source and voice range.This voice range can be fully by time dependent automatic recurrence filter model parametrization fully.According to this model framework, it is called as linear predictive coding, has made in voice coding than much progress, to reduce bit rate.In the cepstrum spectral domain, extract each cepstrum frame from a window of speech samples.

2. the tension and relaxation (Relaxation) independently supposed of frame

The automatic recurrence characteristic model relevant according to above-mentioned state can be supposed given current state q _t, and top n frame o _T-N..., o _T-1, o _tHas the identical n that is distributed as _tThis hypothesis can be formulated as follows:

p (o_{t} | o_{1}^{t - 1}, q_{t}) = p (n_{t} | o_{t - N}^{t - 1}, q_{t}) . - - - [7]

Therefore, the likelihood of status switch hypothesis can be written as:

p (o_{1}^{T} | q_{1}^{T}) = Π_{t = 1}^{T} p (o_{t} | o_{1}^{t - 1}, q_{t}, q_{1}^{t - 1}) = Π_{t = 1}^{T} p (n_{t} | o_{t - N}^{t - 1}, q_{t}) . - - - [8]

3. expectation value maximization procedure

For the status switch of being simulated by Gaussian Mixture, the maximization that this likelihood function p (O|W) has been shown equals the maximization of function Q, wherein:

Q = Σ_{t = 1}^{T} Σ_{m = 1}^{M} γ_{q_{t}, m} (t) \ln p (n_{t} | o_{t - N}^{t - 1}, q_{t}) . - - - [9]

The automatic recurrence characteristic model that application state is relevant, above-mentioned Q function can be rewritten as:

Q = Σ_{t = 1}^{T} Σ_{m = 1 t}^{M} γ_{q_{t}, m} (t) \ln p (n_{t} | o_{t - N}^{t - 1}, q_{t}) - - - [10]

= Σ_{t = 1}^{T} Σ_{m = 1}^{M} γ_{q_{t}, m} (t) 1 np (o_{t} - Σ_{t = 1}^{N} a_{m, i} o_{t - 1} - e_{t, m} | q_{t})

= Σ_{t = 1}^{T} Σ_{m = 1}^{M} γ_{m} (t) [\ln 2 π | W_{m} | + {(o_{t} - Σ_{t = 1}^{N} a_{m, i} o_{t - i} - e_{t, m})}^{T} W_{m}^{- 1} (o_{t} - Σ_{t = 1}^{N} a_{m, i} o_{t - i} - e_{t, m})]

In order to make the Q function maximize, can use expectation value maximization (EM) process with respect to hybrid parameter.For each pronunciation, this mixing occupation rate is an obliterated data.Therefore, can be formulated the EM process of following iteration.

The expectation value step: given mean value e _m, variance W _mWith correlation matrix a _{M, i}, can use following forward-reverse technology to provide desired calibration γ _m(t):

γ_{m} (t) = p (q_{s, m} | e_{m, t}, W_{m}, a_{m, i}, o_{t - N}^{t - 1}) = α_{m} (t) β_{m} (t) . - - - [11]

Maximization steps: the expectation value of given obliterated data, for the differential Q of hybrid parameter (mean value, variance and correlation matrix) and be set to zero and provide following estimation formulas:

e_{m, t} = \frac{Σ_{t = i}^{T} γ_{m} (t) (o_{t} - Σ_{t = 1}^{N} a_{m, i} o_{t - i})}{Σ_{t = 1}^{T} γ_{m} (t)} t - - - [12]

W_{m} = diag [\frac{Σ_{t = i}^{T} γ_{m} (t) (o_{t} - Σ_{i = 1}^{N} a_{m, t} o_{t - i} - e_{m, t}) {(o_{t} - Σ_{t = 1}^{N} a_{m, i} o_{t - i} - e_{m, t})}^{T}}{Σ_{t = 1}^{T} γ_{m} (t)}] .

For diagonal matrix a _{M, i}(1≤i≤N), the vector that is formed by N unit, k diagonal angle from diagonal matrix can be estimated as:

Therefore, the mode that can the while follow the unit according to the unit uses above-mentioned formula to estimate N diagonal angle correlation matrix.

4. embodiment

Fig. 1 is according to an embodiment of the invention based on the process flow diagram of a frame correlation process in the environment of HMM.In an illustrated embodiment, should be applied to speech recognition based on the frame correlation process of HMM.But, should know that this frame correlation can be used to other application programs, for example phonetic synthesis or Audio Processing.

This frame correlation process is included in the frame independent sector that step 100 is calculated these voice.The calculating of this frame independent sector is by realizing according to relevant automatic recurrence (AR) the Model Calculation measurement vector of state.As indicated above, the relevant AR model of this state is used to be included in the correlativity between the Continuous Observation vector.Therefore, measurement vector is produced as follows according to above-mentioned equation [6]:

o_{t} = Σ_{i = 1}^{N} a_{i} o_{t - i} + e_{t} + n_{t},

A wherein _iBe a diagonal matrix, e _tBe the relevant mean vector of the component in this HMM state, n _tFor having the Gaussian distribution of zero mean.

In step 120, the frame independent sector of these voice is imported into this Gauss model.Calculate this frame probability in step 104 then.In step 106, make this expectation value maximization estimate the AR coefficient that this state is relevant in the step described in above-mentioned the 3rd joint by basis.In one embodiment, can use above-mentioned equation [13] to estimate N diagonal angle correlation matrix simultaneously.

Fig. 2 is block scheme based on the frame correlation system 200 of HMM according to an embodiment of the invention.In an illustrated embodiment, this system 200 comprises that one returns (AR) analogue unit and an expectation value maximization unit 204 automatically.

This AR analogue unit 202 receives diagonal matrix (a _i), mean vector (e that component is relevant _t) and have zero mean (n _t) Gaussian noise.Then, this AR functional unit 202 calculates a measurement vector (o _t).

This expectation value maximization unit 204 can comprise a Gauss model piece 206, expectation value piece 208 and maximization piece 210.This Gauss model piece 206 receives the measurement vector (o that is calculated _t) and calculate the frame probability.This measurement vector (o _t) and mean value (e _m), variance (W _m) and correlation matrix (a _{M, i}) together be sent to this expectation value piece 208, with the calibration γ of calculation expectation _m(t).In one embodiment, this expectation value piece 208 uses forward-reverse technology to calculate γ _m(t).

This maximization piece 210 receives the calibration (γ of expectation _mAnd the relevant AR coefficient (a of estimated state (t)), _i).This coefficient can be expressed as diagonal matrix.

5. result of experiment

The continuous speech recognition task that a big vocabulary is independent of the speaker is carried out above-mentioned application of model program.Wall street daily record 20k English task is carried out this experiment.This baseline system be one with the irrelevant HMM system (gender-independent within-word-triphone Gaussian-mixture tiedstate HMM system) of Gaussian Mixture association status in three sound words of sex.In this model set, each speech model has three states that set out (emitting state) and a left-to-right topology.Also use two quiet models.The first quiet model, very brief pause model, having can the uncared-for single state that sets out.The second quiet model is the complete threaded tree that is used to indicate the long mute periods state model that sets out.First and second differential of these voice and normalized record energy (log-energy) and these parameters are together turned to the cepstrum spectral coefficient (MFCC) of 12 Mel scales by parameter.This parametrization produces the eigenvector of one 39 dimension, and these eigenvectors are used the average normalization of cepstrum.These voice training data comprise 36696 pronunciation from SI-284 WSJ0 and WSJ1 set.The big vocabulary continuous speech recognition of ICRC (LVCSR) system is used based on the state of decision tree and trains, and it is concentrated and determines 6617 three sound states.Tabulation of 24k word and dictionary are used to this three gram language model.Use a dynamic network demoder to carry out all decodings.

For the application-specific of the model of above-mentioned consideration, the state based on contextual sound relevant with this single-tone is assigned to the identity set of diagonal angle correlation matrix.Automatically the rank that returns characteristic model is selected as 3.Therefore, this only causes 117 additional parameters.In the process that makes up this correlation matrix, the final number of component is mixed.Conversion from standard to new hidden Markov model is set to 0 by 117 additional correlation parameters and realizes.At last, 5 iteration of the forward that carry out to embed-oppositely reappraise.

This result of experiment compares in table 1.This result shows that this average character error rate (WER) is reduced to 11.4 from 11.8 (baselines).In addition, the data in this form show that the WER that is used for most of speaker is reduced by using native system.

	On average	440	441	442	443	444	445	446	447
	On average	440	441	442	443	444	445	446	447	Standard HMM	11.8	9.9	21.0	12.4	14.6	11.9	6.1	10.3	8.4
New HMM	11.4	9.5	20.8	11.8	13.0	12.0	5.9	9.7	9.6	Standard HMM	11.8	9.9	21.0	12.4	14.6	11.9	6.1	10.3	8.4

Table 1: modular system and frame related system are for the performance of 333 test speaker

Although specific embodiment of the present invention is shown and described, this description only is used for illustrative purposes rather than is used for restriction.Correspondingly, in this is described in detail, in order to illustrate, various details are set, so that thorough understanding of the present invention to be provided.But those of ordinary skill in the art does not obviously have these details can realize this system and method as can be seen yet.For example, although illustrated embodiment and example are described in hidden Markov process analog frame correlativity by being used for speech recognition, shown frame correlation method can be used to other application programs, for example phonetic synthesis and/or Audio Processing.In other examples, do not describe known 26S Proteasome Structure and Function in detail, obscure to avoid purport of the present invention caused.Correspondingly, scope and spirit of the present invention should be determined by appended claim.

Claims

1. method that is used for comprising frame correlation at a hidden Markov model, comprising:

Calculate the frame independent sector of voice;

The frame independent sector of voice is input to a Gauss model;

Calculate the frame probability; And

Estimate automatic regression coefficient.

2. method according to claim 1, the frame independent sector of wherein said computing voice comprise calculates a measurement vector.

3. method according to claim 2, wherein said measurement vector is based on relevant automatic recurrence (AR) model of state.

4. method according to claim 2, the described measurement vector that wherein is used for current state are to multiply each other and sue for peace and this result and mean vector addition of multiplying each other after suing for peace is calculated by the vector with diagonal matrix and observation in the past.

5. method according to claim 4 wherein further comprises:

A Gaussian noise and described measurement vector addition.

6. method according to claim 5, wherein said Gaussian noise have a zero mean.

7. method according to claim 1, wherein said calculating frame probability comprise makes the likelihood of a status switch maximize.

8. method according to claim 1, wherein said calculating frame probability comprise makes a Q function maximize.

9. method according to claim 8 wherein saidly makes the maximization of Q function comprise iteration expectation value maximization procedure.

10. method according to claim 9, wherein said expectation value maximization comprises the expectation value that is formulated data.

Receive mean value, variance and correlation matrix 11. method according to claim 10, the wherein said expectation value that is formulated data comprise, and calculate desired calibration.

12. method according to claim 11, the calibration of wherein said calculation expectation comprise use forward-reverse technology.

13. comprising, method according to claim 9, wherein said expectation value maximization carry out the maximization of Q function.

14. method according to claim 13, wherein said execution Q function maximization comprises reception mean value, variance and correlation matrix, and this Q function is differentiated for this mean value, variance and correlation matrix, and this Q function setup for equalling zero, is used for the new numerical value of this mean value and variance with estimation.

15. comprising, method according to claim 11, the automatic regression coefficient of wherein said estimation use the calibration of estimated mean value, variance and an expectation to estimate one group of cross-correlation matrix.

16. a frame correlation system that is used for hidden Markov model, comprising:

The device that is used for the frame independent sector of computing voice;

Be connected to the device of calculating frame probability of the device of the described frame independent sector that is used for computing voice;

Be connected to the device of the calibration that is used for calculation expectation of the device of the described frame independent sector that is used for computing voice; And

Be connected to the device of the automatic regression coefficient of estimation of the device of the described calibration that is used for calculation expectation.

17. system according to claim 16, the wherein said device that is used for the frame independent sector of computing voice comprises that returns an analogue unit automatically.

18. a system that is used for frame correlation is covered a hidden Markov model, comprising:

The automatic recurrence analogue unit that is used for the calculating observation vector; And

Estimate to be used for this is returned automatically an expectation value maximization unit of the coefficient of analogue unit.

19. system according to claim 18, wherein said expectation value maximization unit comprises a Gauss model piece, an expectation value piece and a maximization piece.

20. system according to claim 19, wherein said Gauss model piece receives described measurement vector, and calculates a frame probability.

21. system according to claim 19, wherein said expectation value piece receives mean value and variance, and calculates the calibration of an expectation.

22. system according to claim 21, wherein said maximization piece receives the calibration of this expectation, and estimates to be used for the coefficient that this returns analogue unit automatically.