CN1397929A

CN1397929A - Speech intensifying-characteristic weighing-logrithmic spectrum addition method for anti-noise speech recognization

Info

Publication number: CN1397929A
Application number: CN02124144A
Authority: CN
Inventors: 曹志刚; 许涛
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2002-07-12
Filing date: 2002-07-12
Publication date: 2003-02-19
Anticipated expiration: 2022-07-12
Also published as: CN1162838C

Abstract

A "speech intensifying (MMSE)-feature weighting (FW)-logrithmic spectrum addition (LA)" method for anti-interference speech recognition features that according to the speech features in short time, the local S/N ratio is extracted, the confidence of the feature, that is weight, is estimated, and the recognition algorithm is such modified that the weight information is used. Its advantages are high S/N ratio and high recognition percentage up to 80% in strong noise condition.

Description

Noise-robust Speech Recognition is with voice enhancing-characteristic weighing-logarithmic spectrum addition method

Technical field

Noise-robust Speech Recognition belongs to the speech recognition technology field with voice enhancing-characteristic weighing-logarithmic spectrum addition method.

Background technology

Probability statistics recognition methods based on HMM (Hidden Markov Model) is a most frequently used model framework in present automatic speech recognition (the ASR:Automatic Speech Recognition) research.HMM with milestone significance is introduced into field of speech recognition, because it can describe the mechanism of production of voice preferably, and has simpler and clearer model to estimate (training) and state search algorithm, has promoted the development of speech recognition technology greatly.

Hidden Markov model can be regarded a finite-state automata as, sees Fig. 1, and this is the topological structure of a most frequently used HMM.Constantly discrete at each, corresponding any t frame voice, it can only be in a certain state in the limited various states.Suppose that the state that allow to occur has the U kind, note be S _u, u=1～U.If automat residing state when t frame voice represents that with q (t) q (t) can only equal S so ₁～S _UIn some, this can be expressed as q (t) ∈ { S ₁～S _U, t.If this automat brings into operation when t=1, the residing state of each frame depends on original state probability vector π and state transition probability matrix A with probabilistic manner so later on.For arbitrary frame t, (t 〉=1), the state q (t) of automat gets S ₁～S _UIn any probability residing state when only depending on former frame t-1, and the state of getting with more preceding arbitrary frame is irrelevant.Like this, consequent status switch q (1), q (2), q (3) ... it is a single order Markov chain.The residing state q (t) when arbitrary frame t of this system is hidden in internal system, is not extraneous finding, and the external world can only obtain the output at random (being voice signal) that system provides here under this state, and hidden Markov model is gained the name thus.

We know that voice signal has smooth performance in short-term., voice can be divided into different short time intervals for this reason, every section state corresponding to HMM, the migration between section and the section can be represented to the transfer of state with state among the HMM.Each state has specific model parameter, the statistical property stably of one frame voice can be described, if the next frame voice have identical statistical property, then state does not shift, next in other words state still jumps to this state, if instead the statistical property of next frame voice has changed, then next state can jump to the state that conforms to this section voice statistical property.

As seen from the above, hidden Markov model is the mathematical model that is based upon on certain physical significance, each metastable process that wherein each state is experienced in the people speaks with respect to vocal organs, relatively properer description the time variation and the accurate stationarity of voice signal.Fig. 1 shows the description of HMM to the input voice.Voice are " he removes Wuxi City " of Chinese among the figure.We mark the input voice with corresponding phone simultaneously.Each phone mark is with respect to a HMM.We show a HMM topological structure from left to right in the drawings.Each state has corresponding output probability to distribute.State 1 and state 9 are respectively initial state and final state, and they are used for different HMM is connected in series, and are a transition state that does not account for the time, and itself does not produce external output.We are with the solid line voice cepstral mean that different labeled divides of having drawn.

For explaining conveniently, directly use status number i, j represents state set { S ₁～S _UIn i and j state, U representation model state sum.: the A-state transition probability matrix, element is:

a _Ij=P (j|i), 1≤i, j≤U (1) represents by the probability of state i to state j.According to the definition of transition probability, we have,

Σ_{j = 1}^{U} a_{ij} = 1, &ForAll; 1 \leq i \leq U - - - - - - (2)

In the most frequently used HMM that has by left-to-right topological structure of Fig. 1, A is actually a two-wire diagonal matrix.B-output probability density:

p(y _t|q(t)＝i)＝N(y _t；μ _i，∑ _i)

= Π_{r = 1}^{R} \frac{1}{\sqrt{2 π} σ_{ir}} \exp (\frac{- {(y_{ir} - μ_{ir})}^{2}}{σ_{ir}^{2}}) - - - - - (3)

Be illustrated in state q (t)=i, for observation phonetic feature y _tLikelihood value.The probability distribution of phonic signal character can be approached with Gaussian function, wherein y _t=[y _T1, y _T2..., y _TR] be R dimension observational characteristic vector, μ _i=[μ _I1, μ _I2... μ _IR],

Σ_{i} = diag [σ_{i 1}^{2}, σ_{i 2}^{2}, \cdot \cdot \cdot, σ_{iR}^{2}]

Be respectively Gaussian function N (y _tμ _i, ∑ _i) average and variance because y _t=[y _T1, y _T2..., y _TR] generally obtain through orthogonal transformation, so the covariance matrix of Gaussian distribution is described with diagonal matrix, and the multidimensional Gaussian distribution can be write as a plurality of one dimension Gaussian distribution and connected the form of taking advantage of.The initial probability distribution of π-each state:

Element π _i∈ [0,1].In HMM shown in Figure 1, state 1 is unique initial state, so π ₁=1, the initial probability of all the other states is 0.

Above parameter obtains by training process.Training will be adjusted above-mentioned parameter by the training utterance data, also just obtain the statistical information of phonetic feature.Training just can have been discerned after finishing.

Speech recognition based on HMM is the phonetic feature sequence Y=[y that will import ₁, y ₂..., y _T], according to maximum-likelihood criterion, search out the optimum condition sequence

\hat{Q} = [\hat{q} (1), \hat{q} (2), \cdot \cdot \cdot, \hat{q} (T)]

Thereby, open the implicit part of HMM, wherein T is the length of voice to be identified, and the feature of T speech frame is promptly arranged.The Viterbi algorithm is adopted in the solution of this problem usually.Definition:

δ_{t} (i) = \max_{q (1) q (2) \cdot \cdot \cdot q (t - 1)} \log [p (q (1) q (2) \cdot \cdot \cdot q (t - 1), q (t) = i, y_{1} y_{2} \cdot \cdot \cdot y_{t} | λ)] - - - (4)

Be given model parameter, part is observed y ₁y ₂Y _t, part path q ₁q ₂Q _T-1, q _tThe maximum output log-likelihood value of=i, wherein λ represents to train the HMM speech model that obtains.

Initialization: δ ₁(i)=log π _i+ log[p (y ₁| q (1)=i)] and ₁(i)=0 (5)

Iteration:

δ_{t} (j) = \max_{1 \leq i \leq N} [δ_{t - 1} (i) + \log (a_{ij})] + \log (p (y_{t} | q (t) = j)) - - (6)

Stop: maximum probability

p^{*} = \max_{1 \leq i \leq N} [δ_{T} (i)] - - - - - - (8)

The last state of optimal path

\hat{q} (T) = \underset{1 \leq i \leq N}{\arg \max} [δ_{T} (i)] - - - - - - (9)

By recalling other path of asking successively on the optimal path:

As can be seen, δ _t(i) be used for being recorded in the partly maximum probability of output of each state generation of t constantly, and _t(j) then be used for the link information of record path.

Clean speech identification has at present reached the stage of a comparative maturity, Via Voice with IBM is representative, discrimination to continuous speech can reach more than 90%, but ground unrest and input microphone are had strict requirement, otherwise system performance will have very big decline.The reason that causes this situation is the mismatch of training environment and environment-identification.The parameter of now a lot of recognition systems is all trained in laboratory environment and is obtained, and training utterance is mostly under quiet background, gathers by the high-quality microphone.And arrived actual application scenario, and because influence of various factors, there are mismatch in inevitable meeting of recognizing voice and systematic parameter, thus the performance in the actual performance of causing and the laboratory is far from each other.

Cause in the speech recognition test and the reason of the mismatch of training environment to have a lot, comprise speaker's mood itself, speaker's noise on every side, the channel during recording, the ground unrest during recording, the channel during the signal transmission and the ground unrest of receiving end etc.Noise-robust Speech Recognition only considers to receive the influence to voice signal of ground unrest and convolution channel, and mismatch model as shown in Figure 2.

The antinoise problem is a focus in the field of speech recognition at present.Ubiquitous noise has brought the mismatch of training environment and environment-identification, thereby causes the rapid decline of recognizer performance.The target of Noise-robust Speech Recognition will be eliminated this mismatch exactly, makes recognition performance as much as possible near the performance under training environment.Because present speech recognition system generally adopts the statistical model based on HMM, so the mismatch that noise brings can be mapped to three spaces as shown in Figure 3.

In Fig. 3, the mismatch of training and identification shows signal, eigenwert, three spaces of model.At signal space, Sx represents the raw tone under the training environment, and Sy represents the voice under the environment-identification, and the mismatch of voice signal is represented by distortion function Ds () under two kinds of environment.Voice signal is behind the process characteristic extraction procedure, and the mismatch of signal space is inevitable also can to show feature space, and Fx is the feature of training utterance, and Fy is the feature of tested speech, and its mismatch is represented with distortion function Df ().At last, feature Fx is used for training HMM to obtain model M x, and and the model that is complementary of feature Fy should be My, this mismatch on model is with distortion function Dm () expression.

The method of Noise-robust Speech Recognition can three different angles be considered from Fig. 3, in research process, has basically formed following a few class way:

One. the processing of signal space.Adopt signal processing method to improve the speech recognition system noise robustness, as utilize speech enhancement technique and microphone array to improve the signal to noise ratio (S/N ratio) of input signal.

Two. the processing of feature space.Mainly be knowledge, extract to the insensitive robustness phonetic feature of noise, as perception linear predictor coefficient (PLP:Perceptive Linear Predictive) in conjunction with human auditory system.

Three. the model space is handled.Promptly utilize the statistical property of noise, train the speech model that obtains to proofread and correct down to ecotopia, make it to be applicable to specific environment-identification, as parallel model compensation (PMC:Parallel Model Compensation) and logarithmic spectrum additive process (LA:Log-Add).

These methods have improved the recognition performance of system effectively under weak background noise environment, and accuracy of identification still sharply descends under the strong background noise environment.The present invention will solve the speech recognition problem under the low signal-to-noise ratio noise circumstance just.

Logarithmic spectrum addition (LA:Log-Add) backoff algorithm of the least mean-square error of signal space (the MMSE:Minimum Mean Square Error) enhancement process and the model space is merged mutually, we have obtained a solution, be referred to as the MMSE-LA scheme, it can improve the accuracy of identification under the low signal-to-noise ratio environment significantly.The present invention has also proposed a kind of new characteristic weighing algorithm at feature space, and utilize the MMSE Enhancement Method, provided effective weight calculation formula, thereby the MMSE-FW-LA scheme that many spacing waves are handled has been proposed, FW refers to characteristic weighing (Feature Weight), promptly simultaneously eliminates the training that noise causes and the mismatch of environment-identification in signal space, feature space and the model space.

Because MMSE-LA and two kinds of schemes of MMSE-FW-LA all relate to this acoustic feature relatively more commonly used at present of Mel frequency range cepstrum coefficient (MFCC:MelFrequency Cepstral Coefficient), are necessary to be introduced in advance.

Automatic speech recognition (ASR:Automatic Speech Recognition) is given one section voice signal, by therefrom information extraction and determine the process of language meaning of machine, it at first will extract the acoustic feature vector that can reflect voice essence, help discerning and being suitable for Computer Processing from voice signal.The development of acoustic feature has been experienced from the time domain to the frequency domain, arrives the process of cepstrum domain again, and more and more combines human auditory system's knowledge.Mel frequency range cepstrum coefficient (MFCC:Mel-Frequency Cepstral Coefficient) is an acoustic feature relatively more commonly used at present.We at first describe its leaching process, as shown in Figure 4.

Divide frame and windowing to divide frame to utilize the smooth performance in short-term of voice signal.By a minute frame, can be used as stationary random signal analysis to voice.Adjacent speech frame is by certain overlapping relevant information that guarantees between each frame.The purpose of windowing is to reduce frequency alias, normally the Hamming window.

h (n) = 0.54 - 0.46 \cos (\frac{2 πn}{N - 1}), n = 1, \cdot \cdot \cdot \cdot, N

Wherein N equals frame length, the coefficient of h (n) expression hamming window on n sampling point.Raw tone after y (n) the expression sampling is expressed as behind the branch frame:

y (n, t) = y (\frac{N \times (t - 1)}{2} + n), n = 1, \cdot \cdot \cdot, N - - - (12)

Wherein t represents frame number, and n represents the sampling point sequence number of present frame.Be expressed as after adding Hamming window:

y _w(n，t)＝y(n，t)×h(n)，n＝1，…，N?????????????????????????????(13)

The FFT fast fourier transform utilizes FFT frame by frame phonetic modification to be arrived spectrum domain because the voice short-term spectrum plays conclusive effect to perceptual speech, and expression-form is:

Y(k，t)＝Y(k，t)e ^{∠ Y(k，t)}＝FFT{y _w(n，t)}，k＝1，...，N _fft???????(14)

Wherein Y (k, t) and e ^{∠ Y (k, t)}Amplitude and the phase place of representing k frequency of spectrum domain respectively, N _FftBe counting of FFT conversion.

Ask power spectrum because the short-time spectrum amplitude of voice plays a leading role to perceptual speech, and phase place therefore can the rated output spectral amplitude comparatively speaking acoustically and not really important in short-term, and ignore the influence of phase place, expression-form is:

Y _p(k，t)＝|Y(k，t)| ²，k＝1，...，N _fft???????????????????????????????(15)

It is to propose on the research basis to auditory model that Mel-Scaled bank of filters Mel frequency range is divided.The Mel-Scaled frequency f _MelWith linear frequency f _HzThe pass be:

f_{mel} = 1127 \ln (1 + \frac{f_{Hz}}{700}) - - - - - - (16)

The Mel bank of filters as shown in Figure 5.At first utilize formula (16) with linear frequency, i.e. frequency transformation after the FFT conversion and is carried out even segmentation on the Mel frequency on the Mel frequency.M represents the number of Mel-Scaled bank of filters on the power spectral domain, also is the segmentation number on the Mel frequency:

{Mel}_{m} = m \times 1127 \ln (1 + \frac{F_{S} / 2}{700}) / M, m = 1, . . ., M - - - - - - (17)

Mel wherein _mRepresent m Mel segment frequence, F _SBe the signals sampling frequency, then with Mel segment frequence mapping loop line resistant frequency:

Lin _m=(exp (Mel _m/ 1127)-1) * 700, m=1 .., M (18) is Lin wherein _mThe linear frequency of representing m Mel segment frequence correspondence, calculate the tap coefficient of Mel bank of filters on each linear frequency:

H wherein _m(k) tap coefficient of m Mel wave filter of expression on k linear frequency, f _kThe frequency values of representing k frequency:

f _k=k * F _S/ N _Fft, k=1 .., N _Fft(20) the Mel spectrum signature of Ti Quing is:

MBank (m, t) = Σ_{k = 1}^{N_{fft} / 2} H_{m} (k) \times Y_{p} (k, t), m = 1, . . ., M - - - - - (21)

Wherein (m, t) m of the t frame voice of expression extraction ties up the Mel spectrum signature to MBank.

Logarithmic spectrum represents to consider people's auditory properties, as to sound intensity feel it is that logarithm value with the sound intensity is linear, we take the logarithm to the output of Mel-Scaled bank of filters, obtain logarithmic spectrum characteristic parameter (log-Spectra).

FBank (m, t)=(MBank (m, t)), m=1 .., M (22) be FBank (m, t) the m dimension logarithmic spectrum feature of the t frame voice of expression extraction wherein for log.

Discrete cosine transform (DCT) DCT has the effect of similar orthogonal transformation, can make speech feature vector each the dimension between correlativity reduce; The proper vector dimension is reduced, further play the effect of feature extraction and feature compression.Because it is uncorrelated mutually that discrete cosine transform makes between each dimension of proper vector, so available diagonal matrix is represented the covariance matrix between each dimensional vector.In this case, diagonalizable covariance matrix is equivalent to reduce one dimension for calculating, and calculated amount reduces greatly, and many high-efficient algorithm can be achieved.Discrete cosine transform is defined as:

\tilde{c} (r, t) = α (r) Σ_{m = 1}^{M} FBank (m, t) \cos (\frac{π (2 m - 1) (r - 1)}{2 M}), r = 1, \cdot \cdot \cdot, M - - - - - - (23)

α (1) = \sqrt{\frac{1}{M}}, α (r) = \sqrt{\frac{2}{M}}, r = 2, \cdot \cdot \cdot \cdot, M - - - - - (24)

Wherein, The r dimension cepstrum coefficient of the t frame voice that expression is extracted.Because through behind the dct transform, the back apteryx of M dimension cepstrum coefficient is very little, therefore can reduce the proper vector dimension, in calculating, identification only gets the preceding R dimension of cepstrum coefficient.

The cepstrum weighting is because the cepstrum coefficient of low-dimensional and higher-dimension is responsive to noise ratio, thus adopt the band pass function of raised cosine form that cepstrum coefficient is weighted usually, in the robustness that to a certain degree can improve system.

Lifter (r) = 1 + \frac{L}{2} \sin (\frac{π (r - 1)}{L}), r = 1, \cdot \cdot \cdot, R - - - - (25)

Wherein L is the weighting filter width.Cepstrum coefficient after the weighting is:

c (r, t) = lifter (r) \times \tilde{c} (r, t), r = 1, \cdot \cdot \cdot, R - - - (26)

This weighting procedure is called cepstrum filtering.(r t) is called static MFCC feature to c.

Performance coeffcient has reflected the multidate information in the speech manual.They calculate by following formula respectively and get:

Δc (r, t) = Σ_{Δt = - 2}^{2} Δtc (r, t + Δt) / 10 - - - - - - - (27)

Wherein (Δ t represents frame pitch to Δ c for r, t) expression single order MFCC characteristic coefficient.

Summary of the invention

The object of the present invention is to provide the antinoise of using under a kind of low signal-to-noise ratio environment to discern with voice enhancing-characteristic weighing-logarithmic spectrum addition method.

The starting point of characteristic weighing algorithm is to think noise in the different periods, and the damage that different frequency causes voice is different.I.e. in the T/F of voice is represented (sound spectrograph), the zone that has is subjected to the degree of noise pollution a little bit smaller, and the feature of coming out from these extracted region has than higher degree of confidence, has in identifying than higher distinguishing ability; In contrast, those features from come out by the big extracted region of noise pollution will cause interference to identification, be the discrimination main reasons for decrease.

In the statistical recognition method based on HMM, acquisition and the priori of utilizing are many more, and the result of identification is just accurate more.The characteristic weighing algorithm has utilized noise in the degree of injury information of different time-frequency field to voice, effectively raise the recognition performance under the noise circumstance, promptly according to short time interval each dimensional feature of voice extract the local signal to noise ratio (S/N ratio) in space, the degree of confidence that provides feature is estimated, it is weight, and recognizer made amendment, with weight information for people's identifying.

The characteristic weighing algorithm need solve following two problems:

1. how to estimate the degree of confidence of feature, and provide the weight calculation formula.

2. how the characteristic weighing process is embedded in the identification framework based on HMM.

From MFCC Feature Extraction process (Fig. 4), before carrying out dct transform, we are referred to as the logarithmic spectrum feature as can be seen, and its each dimension data all interrelates with noisy speech certain local frequencies interval in current short time interval.Therefore the degree of confidence of respectively tieing up the logarithmic spectrum feature can be estimated by the local signal to noise ratio (S/N ratio) in this interval.

In speech enhancement technique, utilized voice at a key property aspect the sense of hearing based on voice short-time spectrum amplitude (STSA:Short Time Spectral Amplitude) estimation approach, be that the voice short-term spectrum plays conclusive effect to perceptual speech, wherein the short-time spectrum amplitude of voice is again active, and the phase place in short-term of voice is comparatively speaking acoustically and not really important.Therefore, the sound enhancement method of estimating based on STSA generally only strengthens the STSA of voice, and directly the phase place of noisy speech as the phase place that strengthens voice.Fig. 6 has provided the general block diagram of these class methods.

Noise voice and clean speech are at the noisy speech that is superposed to of time domain, and promptly noisy speech can be expressed as

y(n)＝x(n)+d(n)?????????????????????????????????????????????????(28)

Wherein x (n) is a clean speech, and d (n) is the additivity ground unrest, and both are uncorrelated mutually.Identification and enhancement process all need voice were divided by short time interval, and after undue frame and windowing process, formula (4.1) is expressed as:

y _w(n, t)=x _w(n, t)+d _w(n, t), 1≤n≤N (29) wherein N is the length of speech frame, t is a frame number.D (n, t), x (n, t), y (n, the amplitude of discrete spectrum in short-term t) and in short-term the discrete power spectral amplitude use respectively D (k, t), X (k, t), Y (k, t) and D _p(k, t), X _p(k, t), Y _p(k, t) expression, wherein 1≤k≤N _FftRepresent each frequency, N _FftBe the length of a frame fast Fourier transform (FFT).

The sound enhancement method of estimating based on STSA has a general enhancing estimation formulas.

\hat{X} (k, t) = G (k, t) Y (k, t), 1 \leq k {\leq N}_{fft} / 2 + 1 - - - - - - - - (30)

Wherein (k t) is called the gain coefficient of k frequency in the t frame to G, and different function expression-forms is arranged in different Enhancement Method.

Expression X (k, estimation t), the short-time spectrum amplitude of voice after promptly strengthening.The core of MMSE method be calculate clean speech short-time spectrum amplitude X (k, least mean-square error t) estimate that under the Gaussian distribution hypothesis of voice and noise spectrum, gain coefficient can be expressed as:

G (k, t) = \frac{1}{2} \sqrt{\frac{πζ (k, t)}{γ (k, t) (1 + ζ (k, t))}} Ψ (- 0.5; 1; - \frac{γ (k, t) ζ (k, t)}{1 + ζ (k, t)}), 1 \leq k \leq N_{fft} / 2 + 1 - - - - - - - - (31)

Wherein (k, t), (k t) is called priori signal to noise ratio (S/N ratio) and posteriority signal to noise ratio (S/N ratio), Ψ (a to γ to ζ ₁, a ₂, a ₃) be confluent hypergeometric function, can utilize the summation of series to calculate:

Ψ (a_{1}, a_{2}, a_{3}) = 1 + \frac{a_{1}}{a_{2}} \frac{a_{3}}{1} + \frac{a_{1} (a_{1} + 1)}{a_{2} (a_{2} + 1)} \frac{{a_{3}}^{2}}{2!} + . . . - - - - - - (32)

Wherein, a ₁=-0.5, a ₂=1,

a_{3} = - \frac{γ (k, t) ζ (k, t)}{1 + ζ (k, t)} .

As can be seen, represent local signal to noise ratio (S/N ratio) ζ (k, t) and γ (k, t) big more, (k, t) also big more, vice versa for gain coefficient G.Therefore (k t) can be used as the tolerance of local signal to noise ratio (S/N ratio) to G, is used for the calculating of feature weight.

Because feature weight is relevant with each dimensional feature in log-spectral domain, so we obtain using for reference the calculated characteristics weight from logarithmic spectrum Feature Extraction process (Fig. 4).

w_{m} (t) = Σ_{k = 1}^{N_{fft} / 2} G (k, t) H_{m} (k) / Σ_{k = 1}^{N_{fft} / 2} H_{m} (k), 1 \leq m \leq M - - - - - (33)

Standardize then, make

Σ_{m = 1}^{M} w_{m} (t) = 1 - - - - - (34)

Here H _m(k) be the coefficient of m triangular filter on k spectrum component in the power spectral domain among Fig. 5, see formula (19), and w _m(t) weight of the m dimension logarithmic spectrum feature of expression t frame voice extraction.M is the number of Mel wave filter, also is the dimension of logarithmic spectrum feature.

Fig. 7 has provided under 0dB additive white Gaussian noise environment, and (Fig. 7 a) and the feature weight (Fig. 7 b) that adopts said method to obtain for the mismatch situation of certain frame voiced segments voice 26 dimension logarithmic spectrum features.Mismatch is big more as can be seen, and weight is more little, and vice versa.Particularly feature weight has tangible peak value near two formant frequencies of reflection voice content information, and outstanding this part information will help improving the accuracy of identification of voice.

In the voice unvoiced segments, the feature weight that the employing said method obtains and the actual mismatch situation of feature are not inconsistent, and we do not carry out characteristic weighing to it, even the weight of each dimensional feature is 1.

Though we are weighted feature in log-spectral domain, but because the performance that the logarithmic spectrum feature is discerned is not as cepstrum feature, and cepstrum feature to have a dimension low, characteristics such as each dimension data is approximate uncorrelated, can simplify speech model and reduce the identification operand, therefore in our characteristic weighing recognizer, still adopt the MFCC feature of cepstrum domain.

Recognizer adopts the Viterbi decoding algorithm, promptly seeks max log likelihood output state sequence:

δ_{t} (i) = \max_{q (1) q (2) \cdot \cdot \cdot q (t - 1)} \log [p (q (1) q (2) \cdot \cdot \cdot q (t - 1), q (t) = i, y_{1} y_{2} \cdot \cdot \cdot y_{t} | λ)]

= \max_{1 \leq i \leq N} [δ_{t - 1} (i) + \log (a_{ij})] + \log (p (y_{t} | q (t) = j)) - - (4 / 6)

Therefore the core of characteristic weighing algorithm is to obtain to have the log-likelihood calculations formula of robustness at the feature mismatch.With formula (3) substitution log-likelihood calculations formula

\log (p (y_{t} | q (t) = i)) = \log N (y_{t}; μ_{i}, Σ_{i}) = - Σ_{r = 1}^{R} \log (\sqrt{2 π} σ_{ir}) - Σ_{r = 1}^{R} \frac{{(y_{tr} - μ_{ir})}^{2}}{σ_{ir}^{2}} - - - - - - (35)

For the ease of statement characteristic weighing algorithm, make μ ^cThe mean value vector of expression Gaussian distribution, ∑ ^cThe expression variance matrix, subscript c represents cepstrum domain.Because the approximate uncorrelated characteristic between each dimension of cepstrum feature can make ∑ ^cBe diagonal matrix.R dimensional feature vector y _y ^cLogarithm probability likelihood value under this Gauss model is:

\log N ({{y_{t}}^{c}, u}^{c}, Σ^{c}) = c (Σ^{c}) - 1 / 2 d^{cT} Σ^{c - 1} d^{c}, d^{c} = {y_{t}}^{c} - u^{c} - - - - - (36)

C (∑ wherein ^c) expression and y _t ^cIrrelevant constant term, c (∑ as can be seen ^c) corresponding in the formula (4-8)

- Σ_{r = 1}^{R} \log (\sqrt{2 π} σ_{ir})

, and  d ^CT∑ ^C-1d ^cCorresponding to

Σ_{r = 1}^{R} \frac{{(y_{ir} - μ_{ir})}^{2}}{σ_{ir}^{2}} .

It is very directly perceived to carry out characteristic weighing in log-spectral domain, and its formula is as follows:

\log^{*} N ({{y_{t}}^{l}, u}^{l}, Σ^{l}) = c (Σ^{l}) - 1 / 2 d^{lT} {W^{T} (Σ}^{l})^{- 1} W d^{l}, d^{l} = {y_{t}}^{l} - u^{l} - - - - - (37)

Y wherein _t ^lRepresent t frame M dimension logarithmic spectrum feature, weight matrix W=diag{w ₁(t), w ₂(t)., w _m(t) ..}, element w _m(t) be the weight of m dimension logarithmic spectrum feature, subscript l represents log-spectral domain.

Aggregative formula (36) and (37) can obtain the characteristic weighing log-likelihood calculations formula on the cepstrum domain:

\log^{*} N ({{y_{t}}^{c}, u}^{c}, Σ^{c}) = c (Σ^{c}) - 1 / 2 d^{lT} W^{T} {Tr}^{T} (Σ^{c})^{- 1} TrW d^{l}, d^{l} = {Tr}^{- 1} ({y_{t}}^{c} - u^{c}) - - - - - (38)

The meaning of formula (38) can be expressed as: at first calculate the difference vector of cepstrum feature and state average at cepstrum domain, it is transformed to log-spectral domain be weighted, and then conversion is returned cepstrum feature and discerned.Wherein dct transform in the matrix T r presentation graphs 4 and cepstrum weighting, i.e. linear transformation from the logarithmic spectrum feature to the MFCC feature.

Tr = DCT \times Diag {1,1 + \frac{L}{2} \sin (\frac{π}{L}), . . ., 1 + \frac{L}{2} \sin (\frac{π (R - 1)}{L}), 0, . . ., 0} - - - (39)

Be the product of DCT matrix and cepstrum weighting diagonal matrix, the preceding R dimension of the diagonal element of cepstrum weighting diagonal matrix is the cepstrum weighting coefficient, and the M-R of back dimension is 0.Tr ^-1Be its inverse transformation, can be expressed as:

{Tr}^{- 1} = Diag {1, \frac{1}{1 + \frac{L}{2} \sin (\frac{π}{L})}, . . ., \frac{1}{1 + \frac{L}{2} \sin (\frac{π (R - 1)}{L})}, 0, . . ., 0} \times {DCT}^{- 1} - - - - - - (40)

Specifically, the dimension R of the MFCC feature that adopts when discerning owing to us is less than logarithmic spectrum intrinsic dimensionality M, and when carrying out this Feature Conversion, we increase dimension to the MFCC feature, and each dimensional feature data of increase replace with 0; In identifying, still adopt the MFCC feature of R dimension.

Under the identification framework based on HMM, the test that noise brought and the mismatch of environment-identification can be mapped to three spaces, i.e. signal, feature and the model space.The method of Noise-robust Speech Recognition is also considered from these three aspects.The present invention is merged the Noise-robust Speech Recognition technology in these three spaces, proposes the Noise-robust Speech Recognition scheme that many spacing waves are handled, and is desirably in further to improve accuracy of identification under the low signal-to-noise ratio additive noise environment.

There is certain error in noisy speech after handling through MMSE between the estimated value of the clean speech that obtains and the true value, we are referred to as residual noise, can be expressed as:

\hat{d} (n) = \hat{x} (n) - x (n) - - - - - - (41)

Wherein

And x (n) represents residual noise and the value of clean speech on n sampling point respectively,

The estimated value of expression x (n), in order to eliminate the mismatch of the training and testing environment that this part residual noise brings, we consider to carry out in the model space noise compensation of clean speech training pattern.Residual noise after MMSE strengthens has kept certain accurate smooth performance, we can describe with the HMM that a single gauss' condition distributes, we have adopted the Log-Add method model space, only the state average to the clean speech training pattern compensates, under the prerequisite that does not influence discrimination, greatly reduced computation complexity, do not need residual noise is carried out model training simultaneously, and only need estimate the characteristic mean of residual noise, these all help the real-time implementation of scheme.

Because residual noise is present in each speech frame, and voice exist only in non-noise frame, so for noise frame,

\hat{D} (k, t) = \hat{X} (k, t)

, wherein The estimated value of expression clean speech spectral amplitude on k frequency in the t frame, and

Be residual noise spectral amplitude on k frequency in the t frame, promptly the short-time spectrum amplitude of residual noise equals to strengthen the short-time spectrum amplitude of back voice in each noise frame.The noise frame that obtains when utilizing signal space MMSE voice to strengthen detects information, and MFCC features of extracting all noise frames after these strengthen are averaged, and just can obtain to be used for the characteristic mean of the residual noise of Log-Add model compensation.

Many spaces that the present invention proposes are merged the Noise-robust Speech Recognition technology and can be summarized as follows:

Selecting the MMSE method that noisy speech is carried out the front end voice strengthens.At first be because its computation complexity is low, can handle in real time; Next is because the supplementary (gain coefficient) that it provides in processing procedure can be estimated the weight of logarithmic spectrum feature more accurately, reflects the mismatch situation of each dimensional feature; At last, the MMSE enhancement process is smaller to the damage of voice, and the residual noise after the processing keeps original accurate smooth performance, helps the model compensation of back.

The characteristic weighing algorithm that selection proposes previously, the spectral amplitude yield value that utilizes the signal space enhancement algorithms to obtain is estimated the weight of logarithmic spectrum feature, and this weight information is introduced identifying.

Logarithmic spectrum addition (Log-Add) backoff algorithm that the selection algorithm complexity is lower, the MFCC characteristic mean component that is about to clean speech model and noise model is in the log-spectral domain addition, thus the logarithmic spectrum average of the noisy speech model after being compensated.Compare with parallel model compensation (PMC) algorithm of classics, its average to model, and variance is not compensated, calculated amount is far smaller than PMC, but can reach essentially identical accuracy of identification.The Log-Add algorithm not only can compensate the average of static MFCC in addition, and can compensate dynamic and high-order MFCC.

{\hat{μ}}_{m}^{l} = μ_{m}^{l} + \log (1 + \exp (μ_{nm}^{l} - μ_{m}^{l})) - - - - - - (43)

Δ {\hat{μ}}_{m}^{l} = \frac{{Δμ}_{m}^{l}}{1 + \exp (μ_{nm}^{l} - μ_{m}^{l})} - - - (44)

Wherein, With

Model after the expression compensation is in the static state and the dynamical state average of log-spectral domain respectively; μ ^lWith Δ μ ^lThe model that the training of expression clean speech obtains is in the static state and the dynamical state average of log-spectral domain;

It is the characteristic mean of residual noise; Subscript l represents log-spectral domain; Subscript m is represented the m dimensional feature.

Because we are state average and the residual noise characteristic means that obtain the clean speech training pattern at the MFCC cepstrum domain, and actual model compensation is to carry out in log-spectral domain, this need carry out the conversion between logarithmic spectrum feature and the MFCC feature equally, identical with the characteristic weighing algorithm, promptly the MFCC feature to low-dimensional increases dimension, and utilizes linear transformation Tr and Tr ^-1Change.That is: μ ^l=Tr ^-1μ ^c, Δ μ ^l=Tr ^-1Δ μ ^c, Wherein, μ ^c, Δ μ ^cWith

Expression respectively

μ ^l, Δ μ ^lWith

Corresponding MFCC cepstrum feature, subscript c represents the MFCC cepstrum domain.Obtain the static state of speech model under each state and dynamic MFCC average after residual noise compensates at last.

For simple and Convenient Calculation, we describe the residual noise model with the HMM that single gauss' condition distributes, and only need the characteristic mean of residual noise during model compensation, therefore do not need noise model is carried out off-line training, help the real-time processing of identifying schemes.

The algorithm flow of MMSE-FW-LA scheme is as shown in Figure 8: 1. at first import the speech model that the training of noisy speech and clean speech obtains, noisy speech is carried out branch frame and windowing, and be FFT and transform to frequency domain.2. carry out voice intermittently, promptly noise segment detects, and the power spectrum amplitude of estimating noise.3. estimate the short-time spectrum amplitude of clean speech with the MMSE method, and keep the spectral amplitude gain coefficient.4. the spectral amplitude gain coefficient that utilizes previous step to obtain calculates the weight of logarithmic spectrum feature.5. utilize the short-time spectrum amplitude of the 3rd enhancing voice that obtain of step, promptly the estimated value of clean speech short-time spectrum amplitude is extracted the MFCC feature.6. utilize the MFCC feature of the enhancing voice that the second unvoiced segments division information that obtain of step and previous step obtain, calculate the MFCC characteristic mean of residual noise.7. in the model space, do the residual noise compensation with the Log-Add method speech model that training obtains to clean speech.Here utilized the MFCC characteristic mean of the residual noise that previous step obtains.8. the logarithmic spectrum feature weight input that the speech model after the MFCC characteristic parameter of the 5th enhancing voice that obtain of step, the residual noise compensation that previous step obtains and the 4th step are obtained is based on the identification demoder of characteristic weighing.9. obtain recognition result.

The invention is characterized in: it contains following steps successively:

(1). the tap coefficient H of initialization Mel bank of filters on each linear frequency k _mAnd the transition matrix Tr and the Tr of logarithmic spectrum feature and MFCC (Mel frequency range cepstrum coefficient) feature (k), ^-1: k=1 wherein, 2 .., N _Fft/ 2, N _FftIt is the frequency number of FFT; M=1,2 .., M, M are the numbers of Mel wave filter.

(2). the model parameter that input noisy speech and clean speech obtain through training:

μ ^c: the static nature average of model state under the MFCC cepstrum domain that the clean speech training obtains;

Δ μ ^c: the behavioral characteristics average of model state under the MFCC cepstrum domain that the clean speech training obtains;

(3). divide frame, windowing:

If the raw tone after the sampling is y (n), the coefficient of Hamming (hamming) window on n sampled point:

h (n) = 0.54 - 0.46 \cos (\frac{2 πn}{N - 1}), n = 1, \cdot \cdot \cdot \cdot, N

N equals frame length, then divides the primary speech signal behind the frame to be:

y (n, t) = y (\frac{N \times (t - 1)}{2} + n), n = 1, \cdot \cdot \cdot, N

T represents frame number, adds that the primary speech signal behind the Hamming window is:

y _w(n，t)＝y(n，t)×h(n)，n＝1，…，N

(4). fast fourier transform FFT:

Because the voice short-term spectrum plays conclusive effect to perceptual speech, utilize FFT frame by frame phonetic modification to be arrived spectrum domain:

Y(k，t)＝Y(k，t)e ^{∠ Y(k，l)}＝FFT{y _w(n，t)}，k＝1，...，N _fft

N _FftBe counting of FFT conversion.

(5). noise frame detects and the noise spectrum amplitude Estimation:

(5.1). setting preceding 10 frame The initial segment noisy speeches is noise frame, imports the short-time spectrum amplitude of current t frame noisy speech;

(5.2). if present frame is the The initial segment noise frame, then the estimated value of preceding t frame noise power spectral amplitude is:

{\tilde{D}}_{p} (k, t) = {[Σ_{s = 1}^{t} Y (k, s) / t]}^{2}

And when present frame is the 10th frame the estimated value of output The initial segment noise spectrum amplitude:

N (k) = Σ_{s = 1}^{10} Y (k, s) / 10

Calculating is used to distinguish the decision threshold x of noise frame and noisy speech frame:

x = \max_{t = 1,2, . . ., 10} {Σ_{k = 1}^{N_{fft} / 2 + 1} Pow [Y (k, t) / N (k), 5]}

(5.3). as if present frame is not the The initial segment noise frame, then the decision value of present frame t:

ρ = Σ_{k = 1}^{N_{fft} / 2 + 1} Pow [Y (k, t) / N (k), 5]

(5.3.1) as if ρ＜x, then judgement is the noise frame in the noisy speech, and its noise power spectrum amplitude estimation value is:

{\tilde{D}}_{p} (k, t) = 0.98 \times {\tilde{D}}_{p} (k, t - 1) + 0.02 \times Y_{p} (k, t)

And output;

(5.3.2). if ρ 〉=x, then judgement is non-noise frame, promptly contains the speech frame of noise, and its noise power spectral amplitude is:

{\tilde{D}}_{p} (k, t) = {\tilde{D}}_{p} (k, t - 1)

And output;

(6). with the spectral amplitude gain coefficient G that depends on priori signal to noise ratio (S/N ratio) ζ and posteriority signal to noise ratio (S/N ratio) γ (k, the t) estimated value of calculating clean speech short-time spectrum amplitude, and the weight w of m logarithmic spectrum feature of corresponding t frame _m(t):

(6.1). import the short-time spectrum amplitude of current t frame noisy speech;

(6.2). calculate the posteriority signal to noise ratio (S/N ratio) of k frequency of present frame t

γ (k, t) = Y_{P} (k, t) / {\tilde{D}}_{P} (k, t)

, Y _p(k t) is the power spectrum amplitude of noisy speech, Be the noise power spectral amplitude of estimating.

If (6.2.1). present frame t=1, then the priori signal to noise ratio (S/N ratio) of k frequency of initialization present frame be ζ (k, t)=0.1;

If (6.2.2). present frame t＞1, then utilize the priori of previous frame and the posteriority signal to noise ratio (S/N ratio) of present frame, estimate to obtain the priori signal to noise ratio (S/N ratio) of k frequency of present frame by running mean:

ζ(k，t)＝0.98×ζ(k，t-1)+0.02×[γ(k，t)-1]

(6.3). the spectral amplitude gain coefficient of k frequency of present frame t:

G (k, t) = \frac{1}{2} \sqrt{\frac{πζ (k, t)}{γ (k, t) (1 + ζ (k, t))}} Ψ (- 0.5; 1; - \frac{γ (k, t) ζ (k, t)}{1 + ζ (k, t)})

Utilize the summation of series, calculate:

Ψ (a_{1}, a_{2}, a_{3}) = 1 + \frac{a_{1}}{a_{2}} \frac{a_{3}}{1} + \frac{a_{1} (a_{1} + 1)}{a_{2} (a_{2} + 1)} \frac{{a_{3}}^{2}}{2!} + . . .

A wherein ₁=-0.5, a ₂=1,

a_{3} = - \frac{γ (k, t) ζ (k, t)}{1 + ζ (k, t)}

(6.4). the estimated value of corresponding clean speech short-time spectrum amplitude is:

\hat{X} (k, t) = G (k, t) Y (k, t)

(6.5). recomputate the priori signal to noise ratio (S/N ratio) of k frequency of present frame:

ζ (k, t) = {| \hat{X} (k, t) |}^{2} / {\tilde{D}}_{p} (k, t)

(6.6). calculated k frequency (1≤k≤N of present frame t _Fft/ 2+1) G (k, t),

And ζ (k, t) value.

(6.7). calculate the weight of m logarithmic spectrum feature of present frame t:

w_{m} (t) = Σ_{k = 1}^{N_{fft} / 2} G (k, t) H_{m} (k) / Σ_{k = 1}^{N_{fft} / 2} H_{m} (k)

(6.8). calculate present frame altogether M logarithmic spectrum feature weight, M is the dimension of logarithmic spectrum feature.

(6.9). calculated t=1,2 ..., in each frame of T

And w _m(t);

(6.10). export all corresponding clean speech short-time spectrum amplitude estimation value Weight w with the logarithmic spectrum feature _m(t);

(7) .MFCC feature extraction

(7.1). input clean speech short-time spectrum amplitude estimation value

(7.2). the rated output spectrum:

{\hat{X}}_{p} (k, t) = {| \hat{X} (k, t) |}^{2}, k = 1, . . ., N_{fft};

(7.3) .Mel filtering:

MBank (m, t) = Σ_{k = 1}^{N_{fft} / 2} H_{m} (k) \times {\hat{X}}_{p} (k, t), m = 1, . . ., M

(7.4). and logarithmic spectrum feature: FBank (m, t)=log (MBank (m, t)), m=1 .., M

(7.5) the .DCT cepstrum is represented:

\tilde{c} (r, t) = α (r) Σ_{m = 1}^{M} FBank (m, t) \cos (\frac{π (2 m - 1) (r - 1)}{2 M}), r = 1, \cdot \cdot \cdot, M

Wherein

α (1) = \sqrt{\frac{1}{M}}, α (r) = \sqrt{\frac{2}{M}}

, r=2 ..., M, and R dimensional feature vector before getting

(7.6). the cepstrum weighting:

c (r, t) = lifter (r) \times \tilde{c} (r, t), r = 1, \cdot \cdot \cdot, R

Wherein

Lifter (r) = 1 + \frac{L}{2} \sin (\frac{π (r - 1)}{L})

, r=1 ..., R, L are the weighting filter width;

(7.7). calculate performance coeffcient:

Δc (r, t) = Σ_{Δt = - 2}^{2} Δtc (r, t + Δt) / 10

, Δ t represents frame pitch;

(7.8). output c (r, t) and Δ c (r, t);

(8). judge the statement input to be identified t=T that finishes?

(9). imported and finished if be judged as statement to be identified, calculating noise frame then, i.e. the static MFCC feature mean value of residual noise, residual noise is defined as follows:

\hat{d} (n) = \hat{x} (n) - x (n)

Wherein x (n) represents the value of clean speech on n sampling point, Estimated value after expression x (n) strengthens.Because residual noise is present in each speech frame, and voice exist only in non-noise frame, so for noise frame,

\hat{D} (k, t) = \hat{X} (k, t),

The short-time spectrum amplitude that is residual noise equals to strengthen the short-time spectrum amplitude of voice afterwards in each noise frame, we can utilize following formula to calculate the static MFCC characteristic mean of residual noise:

Wherein noise frame comprises The initial segment 10 frames and the noise frame of adjudicating later, r=1,2 .., R.

(10) .Log-Add logarithmic spectrum additive model compensation:

(10.1). the MFCC characteristic mean of input residual noise also is transformed into log-spectral domain

μ_{n}^{l} = {Tr}^{- 1} μ_{n}^{c};

(10.2). the state average of input clean speech training pattern, and be transformed into log-spectral domain μ ^l=Tr ^-1μ ^c, Δ μ ^l=Tr ^-1Δ μ ^c

(10.3) .Log-Add model compensation:

{\hat{μ}}_{m}^{l} = μ_{m}^{l} + \log (1 + \exp (μ_{nm}^{l} - μ_{m}^{l}))

，m＝1，2，..，M

Δ {\hat{μ}}_{m}^{l} = \frac{{Δμ}_{m}^{l}}{1 + \exp (μ_{nm}^{l} - μ_{m}^{l})}

(10.4). the model state of compensation is transformed into the MFCC cepstrum domain

{\hat{μ}}^{c} = Tr {\hat{μ}}^{l}, Δ {\hat{μ}}^{c} = Tr {Δ \hat{μ}}^{l};

(10.5). when the state input finishes, export the speech model after residual noise compensates; (11). the Viterbi identification decoding of characteristic weighing:

(11.1). the voice mould plough after the compensation of input residual noise, enhancing voice present frame MFCC feature

, logarithmic spectrum feature weight w _m(t);

(11.2). the logarithm probability likelihood value of calculating observation frame under candidate state:

(11.2.1). calculate the phasor difference of the state average of MFCC feature and optional state at the MFCC cepstrum domain:

d^{c} = {y_{t}}^{c} - u^{c};

(11.2.2). difference vector is transformed to logarithmic spectrum property field: d ^l=Tr ^-1d ^c

(11.2.3). be weighted in log-spectral domain, and the MFCC cepstrum domain is returned in conversion

{\overset{&RightArrow;}{d}}^{c} = Tr {Wd}^{l};

(11.2.4). calculate logarithm probability likelihood value:

\log (p ({y_{t}}^{c} | q (t) = i)) = C (Σ^{c}) - {1 / 2}^{{\overset{&RightArrow;}{d}}^{cT}} {(Σ^{c})}^{- 1} {\overset{&RightArrow;}{d}}^{c}

∑ wherein ^cBe the state variance matrix of cepstrum domain, and be the diagonal matrix ∑ ^c=Diag{ σ _I1, σ _I2..., σ _Ir..}, c represents cepstrum domain, and i represents state; C (∑ ^c) expression with

Irrelevant constant term, correspondence

- Σ_{r = 1}^{R} \log (\sqrt{2 π} σ_{ir})

, R is the dimension of cepstrum feature.

(11.3). after the initialization Viterbi decoding, iteration has been calculated t=1 again, and 2 ..., the T frame;

(11.4). calculate maximum probability

p^{*} = \max_{1 \leq i \leq N} [δ_{T} (i)]

Final state with optimal path:

\hat{q} (T) = \underset{1 \leq i \leq N}{\arg \max} [δ_{T} (i)];

(11.5) by recalling other states of exporting successively on the optimal path: , t=T-1 ..., 1;

(12). the output recognition result, finish.Use proof: it has reached re-set target.

Description of drawings uses proof: it has reached re-set target.

Description of drawings

Fig. 1: HMM is in Application in Speech Recognition.

Fig. 2: neighbourhood noise model.

Fig. 3: the mismatch of training and identification.

Fig. 4: MFCC characteristic extraction procedure.

Fig. 5: Mel bank of filters structural map.

Fig. 6: the voice of estimating based on STSA strengthen block diagram.

Fig. 7: logarithmic spectrum feature mismatch and weight synoptic diagram under the signal to noise ratio (S/N ratio) 0dB white noise environment:

A:26 dimension logarithmic spectrum vector;

B:26 dimension logarithmic spectrum vector weights.

Fig. 8: MMSE-FW-LA scheme algorithm flow chart.

Fig. 9: MMSE-LA scheme main program flow chart.

Figure 10: MMSE-FW-LA scheme main program flow chart.

Figure 11: noise segment detection/noise power spectrum amplitude Estimation kernel program process flow diagram

Figure 12: voice strengthen and feature weight calculates the kernel program process flow diagram.

Figure 13: MFCC feature extraction algorithm block diagram.

Figure 14: Log-Add model compensation core process figure.

Figure 15: the Viterbi identification decoding kernel program process flow diagram of characteristic weighing.

Figure 16: under the low signal-to-noise ratio white noise environment, the antinoise recognition performance of front end MMSE enhancing, characteristic weighing and Log-Add model compensation relatively.

Figure 17: under the low signal-to-noise ratio white noise environment, the antinoise accuracy of identification after characteristic weighing merges with front end MMSE enhancing, Log-Add model compensation respectively relatively.

Figure 18: under the low signal-to-noise ratio white noise environment, the antinoise recognition performance of MMSE-FW-LA and MMSE-LA scheme relatively.

Figure 19: under the low signal-to-noise ratio automobile noise environment, the antinoise recognition performance of front end MMSE enhancing, characteristic weighing and Log-Add model compensation relatively.

Figure 20: under the low signal-to-noise ratio automobile noise environment, the antinoise accuracy of identification after characteristic weighing merges with front end MMSE enhancing, Log-Add model compensation respectively relatively.

Figure 21: under the low signal-to-noise ratio automobile noise environment, the antinoise recognition performance of MMSE-FW-LA and MMSE-LA scheme relatively.

From Fig. 9,10 as can be seen, the main program flow of MMSE-FW-LA and MMSE-LA scheme is basic identical, just many logarithmic spectrum feature weight computing modules, and adopt the viterbi decoder of characteristic weighing during identification.The whole algorithm flow process comprises five nucleus modules: noise frame detects and noise power spectrum amplitude Estimation module, MMSE voice strengthen and the Viterbi decoding algorithm of the estimation of logarithmic spectrum feature weight, MFCC feature extraction, Log-Add model compensation and characteristic weighing.

Figure 11 has provided the process flow diagram of the short-time spectrum amplitude Estimation module of noise frame judgement and noise, is input as the short-time spectrum amplitude of noisy speech present frame, and output is the court verdict of noise frame and the noise power spectral amplitude after the process present frame estimation renewal.Noise frame detects the detection method that has adopted based on energy.

Because noisy speech to be identified beginning always has a unvoiced segments, so we are noise frame with preceding 10 frame voice judgement, and the estimated value of the power spectrum amplitude of noise is:

{\tilde{D}}_{P} (k, t) = {[Σ_{s = 1}^{t} Y (k, s) / t]}^{2} - - - - - (45)

Wherein Y (k, s) the noisy speech short-time spectrum amplitude of expression input are calculated decision threshold then:

x = \max_{t = 1,2, . . ., 10} {Σ_{k = 1}^{N_{fft} / 2 + 1} Pow [Y (k, t) / N (k), 5]} - - - (46)

Wherein

N (k) = Σ_{s = 1}^{10} Y (k, s) / 10

The noise spectrum amplitude of expression rough estimate, function

Pow = (x_{1}, x_{2}) = {x_{1}}^{x_{2}} .

Since the 11st frame, need carry out noise frame and detect judgement, at first calculate decision parameter:

ρ = Σ_{k = 1}^{N_{fft} / 2 + 1} Pow [Y (k, t) / N (k), 5] - - - (47)

If ρ＜x adjudicates and is noise frame, need reappraise the noise power spectral amplitude this moment:

{\tilde{D}}_{p} (k, t) = 0.98 \times {\tilde{D}}_{p} (k, t - 1) + 0.02 \times Y_{p} (k, t) - - - - - - - (48)

Promptly carry out coefficient and be 0.98 smooth estimated.If ρ 〉=x does not then do reappraising of noise power spectral amplitude:

{\tilde{D}}_{p} (k, t) = {\tilde{D}}_{p} (k, t - 1) - - - - - (49)

Figure 12 provides the process flow diagram of enhancing of MMSE voice and logarithmic spectrum feature weight estimation module, its input is the short-time spectrum amplitude of noisy speech present frame, be output as the short-time spectrum amplitude that strengthens the back voice, i.e. the estimation of the short-time spectrum amplitude of clean speech and logarithmic spectrum feature weight.Need calculate priori and the posteriority signal to noise ratio (S/N ratio) of noisy speech owing to calculate short-time spectrum amplitude gain coefficient at current frequency, shown in (31):

G (k, t) = \frac{1}{2} \sqrt{\frac{πζ (k, t)}{γ (k, t) (1 + ζ (k, t))}} Ψ (- 0.5; 1; - \frac{γ (k, t) ζ (k, t)}{1 + ζ (k, t)}) - - - - - - - (31)

In actual operation, the priori signal to noise ratio (S/N ratio) can be estimated to obtain by running mean:

ζ (k, t)=0.98 * ζ (k, t-1)+0.02 * [γ (k, t)-1] (50) posteriority signal to noise ratio (S/N ratio) can directly calculate:

γ (k, t) = Y_{P} (k, t) / {\tilde{D}}_{P} (k, t) - - - - - - (51)

For the noise power spectral amplitude of estimating, referring to Figure 11.

Figure 13 has provided the program flow diagram of MFCC characteristic extracting module, is input as the spectral amplitude value that strengthens voice, is output as the MFCC characteristic parameter that strengthens voice.

Figure 14 has provided Log-Add model compensation core process figure, is input as the speech model that clean speech training obtains and the MFCC characteristic mean of residual noise, is output as the speech model after the residual noise compensation.

Figure 15 has provided the Viterbi identification decoding kernel program process flow diagram of characteristic weighing, is input as the speech model after the phonetic feature, logarithmic spectrum feature weight and the residual noise that strengthen through MMSE compensate, and is output as recognition result.

Content of the present invention is mainly discussed the Noise-robust Speech Recognition under the strong background noise environment, and recognition system is at Speaker-independent continuous speech digit string, and concrete experiment is described below:

Baseline system (BaseLine)

For the ease of the result's that experimentizes comparison, we have at first built a continuous speech recognition system, and it is made up of three modules: MFCC feature extraction, training module and identification module.

The feature that baseline system adopts is the MFCC_0_D feature of 26 dimensions.Wherein MFCC represent to remove c (1, t) the static cepstrum outside, the c of 0 expression reflection speech energy information (1, t) spectrum, D represents the single order cepstrum obtained according to static cepstrum or MFCC_0.The parameter of MFCC feature is provided with as follows:

The length of voice short time frame is 20ms, i.e. N=320; Frame overlaps and is 10ms, i.e. 160 sampled points.The points N of short time frame FFT _Fft=512.

The number of Mel wave filter is M=26.

Static MFCC number of parameters is R=13.

Cepstrum weighting coefficient L=22.

Owing to be little vocabulary continuous speech recognition, baseline system adopts 12 continuously, state does not have leap by left-to-right HMM word model (' one '～' nine ', ' oh ', ' zero ' and ' sil '), each model has 8 states, and the characteristic probability of each state distributes and is similar to single diagonalization multidimensional Gaussian distribution.

Speech database

The training and testing speech database of experiment is TI-Digits.TI-Digits is by Texas Instruments company design, is used for training and testing unspecified person English digital string speech recognition system, has 326 people (111 adult males, 114 adult females, 50 boys, 51 girls), everyone 77 numeric string pronunciations.15 speakers' 500 word in the experiment training use TIDigit storehouse, 4 people's that have nothing to do with training in the storehouse 100 word are used in the identification test.The sampling rate of speech data is 16KHz, and sampling bits is 16bit.

The noise data storehouse

The noise of experiment usefulness is from the Noise-92 database, and noisy speech is to obtain at signal to noise ratio (S/N ratio)-5dB every interval 5dB superimposed noise in the scope of 15dB.The sampling rate of noise data also is 16KHz, and sampling bits is 16bit.Signal to noise ratio (snr) is calculated as follows:

SNR = 10 \log_{10} (\frac{P_{s}}{P_{n}}) - - - - - - - - (52)

P wherein _sAnd P _nBe respectively the linear power of signal and noise.

Hardware and software platform

Experimental arrangement operates on Pentium 450 machines, in save as 128M, the operating system of selecting for use is Windows 2000.The Noise-robust Speech Recognition system that experiment is used comprises front end enhancing, feature extraction, model training, noise compensation, recognizer and corresponding performance evaluating software.

The recognition performance evaluation criterion

For speech recognition system, the leading indicator of evaluation system performance is a discrimination, is also referred to as accuracy of identification (Accuracy), also has other some standards certainly, as recognition speed, and vocabulary size etc.Because our experiment is the little vocabulary conjunction speech recognition under the noise circumstance, experiment purpose is the quality of the various Noise-robust Speech Recognition methods of evaluation and test, therefore main this index of consideration discrimination.

For W _NThe individual word that will discern, W has appearred in recognition system _SIndividual alternative mistake, W _DIndividual deletion error and W _IIndividual insertion mistake, accuracy of identification (Accuracy) is defined as:

％accuracy＝[(W _N-W _D-W _S-W _I)/W _N]×100％?????????????????????(53)

At different noises:

Characteristic weighing algorithm and front end MMSE enhancing, Log-Add model compensation are carried out the comparison of noise robustness, and special the proposition in our characteristic weighing algorithm, only is weighted static nature;

The weighted of feature space is merged the noise robustness after analytical algorithm merges mutually with the front end MMSE voice enhancing of signal space and the Log-Add model compensation of the model space respectively;

MMSE-FW-LA scheme and MMSE-LA scheme that the present invention is proposed compare.White Gaussian noise (white)

Table 1: the white Gaussian noise environment adopts the accuracy of identification of distinct methods down

	??-5dB	??0dB	??5dB	??10dB	????15dB
	??-5dB	??0dB	??5dB	??10dB	????15dB	????Baseline	??6	??8	??13	??30	????65.33
????MMSE	??14.67	??24	??54	??80.33	????91	????Baseline	??6	??8	??13	??30	????65.33
????MMSE	??14.67	??24	??54	??80.33	????91	????FW	??24.33	??46.33	??65	??76.67	????82
????LA	??22.33	??34.33	??67.67	??84.67	????92.67	????FW	??24.33	??46.33	??65	??76.67	????82
????LA	??22.33	??34.33	??67.67	??84.67	????92.67	????MMSE-FW	??46.33	??76	??85	??91.67	????92.67
????FW-LA	??32	??57.33	??77.67	??87.33	????94.67	????MMSE-FW	??46.33	??76	??85	??91.67	????92.67
????FW-LA	??32	??57.33	??77.67	??87.33	????94.67	????MMSE-LA	??65.33	??79.67	??89.33	??93	????93.33
????MMSE-FW-LA	??81	??86	??89	??94.33	????94	????MMSE-LA	??65.33	??79.67	??89.33	??93	????93.33

Wherein, Baseline represents not adopt the accuracy of identification of the baseline system of any antinoise measure, and on behalf of front end MMSE, MMSE, FW and LA strengthen characteristic weighing and Log-Add model compensation respectively.Fusion between the short circuit symbol-method for expressing.At first we compare front end MMSE enhancing, the antinoise recognition performance of characteristic weighing and Log-Add model compensation method, as shown in figure 16:

Front end MMSE enhancing, characteristic weighing and Log-Add model compensation have all improved the recognition performance under the noise circumstance; The Log-Add model compensation all is better than front end MMSE in whole signal to noise ratio (S/N ratio) interval strengthens; Under high background noise environment (SNR＜5dB), the characteristic weighing algorithm is better than the Log-Add model compensation, and particularly when signal to noise ratio (S/N ratio) 0dB, accuracy of identification has improved 12%:

Weighted with feature space strengthens with front end MMSE respectively and the fusion of Log-Add model compensation, relatively their recognition performances then.As shown in figure 17: Characteristic weighing strengthens with front end MMSE and the Log-Add model compensation merges mutually, compares their individual processing, has improved accuracy of identification all apparent in viewly; The fusion that characteristic weighing and front end MMSE strengthen, (SNR＜15dB) is better than the fusion with the Log-Add model compensation when low signal-to-noise ratio; Compare MMSE-FW-LA scheme and MMSE-LA scheme, as shown in figure 18:

MMSE-LA and MMSE-FW-LA scheme have all improved the accuracy of identification under the noise circumstance significantly, and when signal to noise ratio (S/N ratio)-5dB, the accuracy of identification of MMSE-LA has reached 65.33%, and MMSE-FW-LA is especially up to 81%. The MMSE-FW-LA scheme that merges signal, feature and three spaces of model Noise-robust Speech Recognition technology only is better than the MMSE-LA scheme that merges in signal and two spaces of model, and signal to noise ratio (S/N ratio) is low more, and the advantage that many spaces Anti-noise Technique merges is just obvious more.During as signal to noise ratio (S/N ratio)-5dB, the accuracy of identification of MMSE-FW-LA has improved 15% than MMSE-LA.Automobile noise (leopard)

Table 2: the automobile noise environment adopts the accuracy of identification of distinct methods down

	??-5dB	??0dB	??5dB	??10dB	??15dB
	??-5dB	??0dB	??5dB	??10dB	??15dB	Baseline	??0.67	??17.67	??41.33	??60.67	??80
MMSE	??48	??73	??87	??95	??96	Baseline	??0.67	??17.67	??41.33	??60.67	??80
MMSE	??48	??73	??87	??95	??96	FW	??24	??29.33	??42	??69	??84.67
LA	??55.33	??77.67	??93.67	??95.33	??97.33	FW	??24	??29.33	??42	??69	??84.67
LA	??55.33	??77.67	??93.67	??95.33	??97.33	MMSE-FW	??51.33	??74.33	??88.67	??95.33	??96.33
FW-LA	??74	??86	??94	??97	??97	MMSE-FW	??51.33	??74.33	??88.67	??95.33	??96.33
FW-LA	??74	??86	??94	??97	??97	MMSE-LA	??85.67	??91.67	??95	??95.33	??96
MMSE-FW-LA	??87.67	??94.33	??96.33	??96.67	??97.33	MMSE-LA	??85.67	??91.67	??95	??95.33	??96

Equally, Baseline represents not adopt the accuracy of identification of the baseline system of any antinoise measure, and on behalf of front end MMSE, MMSE, FW and LA strengthen characteristic weighing and Log-Add model compensation respectively.Fusion between the short circuit symbol-method for expressing.

At first we compare front end MMSE enhancing, the antinoise recognition performance of characteristic weighing and Log-Add model compensation method, as shown in figure 19: Front end MMSE strengthens and the Log-Add model compensation has improved accuracy of identification all apparent in viewly, and particularly the Log-Add model compensation all is better than front end MMSE enhancing in whole signal to noise ratio (S/N ratio) interval;

Characteristic weighing strengthens with front end MMSE to be compared with the Log-Add model compensation, and accuracy of identification obviously descends.Main cause is the unvoiced segments voice effectively not to be handled at the characteristic weighing algorithm, and the result causes the tangible automobile noise of fluctuation ratio to introduce a lot of mistakes of inserting in unvoiced segments, causes the reduction of discrimination;

Characteristic weighing is compared with baseline system (Baseline), and recognition performance still makes moderate progress.Explanation is at sound section of voice, and the estimation of feature weight and weighted are effective.Antinoise recognition performance after the weighted in comparative feature space merges with front end MMSE enhancing and Log-Add model compensation respectively then.As shown in figure 20: Merge mutually with the weighted of feature space, improved the antinoise recognition performance of front end MMSE enhancing and Log-Add model compensation apparent in viewly.In signal to noise ratio (S/N ratio) be-during 5dB, accuracy of identification has improved 3.33% and 18.67% respectively. When signal to noise ratio (S/N ratio) was lower than 10dB, the syncretizing effect of Log-Add model compensation and characteristic weighing is better than front end MMSE to be strengthened, and these are just different with situation under the white Gaussian noise environment; Compare MMSE-FW-LA scheme and MMSE-LA scheme, as shown in figure 21:

MMSE-LA and MMSE-FW-LA scheme have improved the accuracy of identification under the noise circumstance significantly when signal to noise ratio (S/N ratio) is lower than 5dB, as when the signal to noise ratio (S/N ratio)-5dB, the discrimination of MMSE-LA and MMSE-FW-LA all is higher than 80%; When signal to noise ratio (S/N ratio) is higher than 5dB, also moderately improved the performance of recognizer.

The MMSE-FW-LA scheme that has merged feature space Noise-robust Speech Recognition technology is better than the MMSE-LA scheme more, and in the 15dB scope, accuracy of identification nearly on average improves 2% at-5dB.From experimental result as can be seen, the characteristic weighing algorithm can effectively improve accuracy of identification under the low signal-to-noise ratio environment, and the MMSE that is better than front end strengthens and the Log-Add model compensation; What is more important, because front end speech enhancement technique, characteristic weighing and model compensation algorithm are handled in the mismatch that signal, feature and the model space cause at noise respectively, therefore distinct methods can merge mutually, integrally improves the noise robustness of speech recognition system.The MMSE-FW-LA scheme that the present invention proposes has merged many spaces antinoise recognition technology, very significantly improved the accuracy of identification under the strong background noise environment, at SNR be-white Gaussian noise and automobile noise environment of 5dB under, accuracy of identification has all reached 80%, and from algorithm complex, the front end of MMSE-FW-LA scheme strengthens and feature weight is estimated to merge mutually, selected the lower MMSE method of estimation of calculated amount for use, model compensation does not need that noise model is carried out off-line and estimates that these all help the real-time processing of this scheme.Therefore, the MMSE-FW-LA scheme of the present invention's proposition has very strong practicality.

Claims

1. Noise-robust Speech Recognition contains the voice enhancing-logarithmic spectrum addition method that moves on the computing machine with voice enhancing-characteristic weighing-logarithmic spectrum addition method, it is characterized in that it contains following steps successively:

(1). the tap coefficient H of initialization Mel bank of filters on each linear frequency k _mAnd the transition matrix Tr and the Tr of logarithmic spectrum feature and MFCC (Mel frequency range cepstrum coefficient) feature (k), ^-1: k=1 wherein, 2 ..., N _Fft2, N _FftBe the frequency number of FFT: m=1,2 .., M, M are the numbers of Mel wave filter.

(3). divide frame, windowing:

h (n) = 0.54 - 0.46 \cos (\frac{2 πn}{N - 1}), n = 1, \cdot \cdot \cdot \cdot, N

y (n, t) = y (\frac{N \times (t - 1)}{2} + n), n = 1, \cdot \cdot \cdot, N

T represents frame number, adds that the primary speech signal behind the Hamming window is: y _w(n, t)=y (n, t) * h (n), n=1 ..., N

(4). fast fourier transform FFT:

\dot{Y} (k, t) = Y {(k, t) e}^{< \dot{Y} (k, t)} = FFT {y_{w} (n, t)}, k = 1, . . ., N_{fft}

N _FftBe counting of FFT conversion.

(5). noise frame detects and the noise spectrum amplitude Estimation:

(5.1). setting preceding 10 frame The initial segment noisy speeches is noise frame, imports the short-time spectrum amplitude of current t frame noisy speech:

{\tilde{D}}_{P} (k, t) = {[Σ_{s = 1}^{t} Y (k, s) / t]}^{2}

N (k) = Σ_{s = 1}^{10} Y (k, s) / 10

x = \max_{t = 1,2, . . ., 10} {Σ_{k = 1}^{N_{fft} / 2 + 1} Pow [Y (k, t) / N (k), 5]}

ρ = Σ_{k = 1}^{N_{fft} / 2 + 1} Pow [Y (k, t) / N (k), 5]

{\tilde{D}}_{p} (k, t) = 0.98 \times {\tilde{D}}_{p} (k, t - 1) + 0.02 \times Y_{p} (k, t)

And output;

(5.3.2). if ρ 〉=x, then judgement is non-noise frame, promptly contains the speech frame of noise, its noise power spectrum amplitude

Degree is:

{\tilde{D}}_{p} (k, t) = {\tilde{D}}_{p} (k, t - 1)

And output;

γ (k, t) = Y_{D} (k, t) / {\tilde{D}}_{D} (k, t)

(6.2.1). if present frame t=1, then the priori signal to noise ratio (S/N ratio) of k frequency of initialization present frame is

ζ(k，t)＝0.1；

(6.2.2). if present frame t＞1, then utilize the priori of previous frame and the posteriority signal to noise ratio (S/N ratio) of present frame, by cunning

The moving average priori signal to noise ratio (S/N ratio) of estimating to obtain k frequency of present frame:

ζ(k，t)＝0.98×ζ(k，t-1)+0.02×[γ(k，t)-1]

G (k, t) = \frac{1}{2} \sqrt{\frac{πζ (k, t)}{γ (k, t) (1 + ζ (k, t))}} Ψ (- 0.5; 1; - \frac{γ (k, t) ζ (k, t)}{1 + ζ (k, t)})

Utilize the summation of series, calculate:

Ψ (a_{1}, a_{2}, a_{3}) = 1 + \frac{a_{1}}{a_{2}} \frac{a_{3}}{1} + \frac{a_{1} (a_{1} + 1)}{a_{2} (a_{2} + 1)} \frac{{a_{3}}^{2}}{2!} + . . .

A wherein ₁=-0.5, a ₂=1,

a_{3} = - \frac{γ (k, t) ζ (k, t)}{1 + ζ (k, t)}

\hat{X} (k, t) = G (k, t) Y (k, t)

ζ (k, t) = {| \hat{X} (k, t) |}^{2} / {\tilde{D}}_{p} (k, t)

(6.6). calculated k frequency (1≤k≤N of present frame t _Fft/ 2+1) G (k, t), And ζ (k, t) value.

w_{m} (t) = Σ_{k = 1}^{N_{fft} / 2} G (k, t) H_{m} (k) / Σ_{k = 1}^{N_{fft} / 2} H_{m} (k)

(6.9). calculated t=1,2 ..., in each frame of T

And w _m(t);

(6.10). export all corresponding clean speech short-time spectrum amplitude estimation value

Weight with the logarithmic spectrum feature

w _m(t)；

(7) .MFCC feature extraction

(7.1). input clean speech short-time spectrum amplitude estimation value

(7.2). the rated output spectrum:

{\hat{X}}_{p} (k, t) = {| \hat{X} (k, t) |}^{2}

，k＝1，...，N _fft；

(7.3) .Mel filtering:

MBank (m, t) = Σ_{k = 1}^{N_{fft} / 2} H_{m} (k) \times {\hat{X}}_{p} (k, t), m = 1, . . ., M

(7.5) the .DCT cepstrum is represented:

\tilde{c} (r, t) = α (r) Σ_{m = 1}^{M} FBank (m, t) \cos (\frac{π (2 m - 1) (r - 1)}{2 M}), r = 1, \cdot \cdot \cdot, M

Wherein

α (1) = \sqrt{\frac{1}{M}}, α (r) = \sqrt{\frac{2}{M}}, r = 2, \cdot \cdot \cdot \cdot, M

, and R dimensional feature vector before getting

(7.6). the cepstrum weighting:

c (r, t) = lifter (r) \times \tilde{c} (r, t), r = 1, \cdot \cdot \cdot, R

Wherein

Lifter (r) = 1 + \frac{L}{2} \sin (\frac{π (r - 1)}{L})

, r=1 ..., R, L are the weighting filter width;

(7.7). calculate performance coeffcient:

Δc (r, t) = Σ_{Δt = - 2}^{2} Δtc (r, t + Δt) / 10

, Δ t represents frame pitch;

(7.8). output c (r, t) and Δ c (r, t);

(8). judge the statement input to be identified t=T that finishes?

\hat{d} (n) = \hat{x} (n) - x (n)

\hat{D} (k, t) = \hat{X} (k, t),

(10) .Log-Add logarithmic spectrum additive model compensation:

μ_{n}^{l} = {Tr}^{- 1} μ_{n}^{c};

(10.2). the state average of input clean speech training pattern, and be transformed into log-spectral domain μ ^l=Tr ^-1μ ^c,

Δμ ^l＝Tr ^-1Δμ ^c；

(10.3) .Log-Add model compensation:

{\hat{μ}}_{m}^{l} = μ_{m}^{l} + \log (1 + \exp (μ_{nm}^{l} - μ_{m}^{l})), m = 1,2, . ., M

Δ {\hat{μ}}_{m}^{l} = \frac{{Δμ}_{m}^{l}}{1 + \exp (μ_{nm}^{l} - μ_{m}^{l})}

{\hat{μ}}^{c} = Tr {\hat{μ}}^{l}, {Δ \hat{μ}}^{c} = Tr {Δ \hat{μ}}^{l};

(10.5). when the state input finishes, export the speech model after residual noise compensates;

(11). the Viterbi identification decoding of characteristic weighing:

(11.1). the speech model after the compensation of input residual noise, enhancing voice present frame MFCC feature , logarithmic spectrum feature weight w _m(t);

d^{c} = {y_{t}}^{c} - u^{c};

(11.2.3). be weighted in log-spectral domain, and MFCC cepstrum domain d is returned in conversion ^c=TrWd ^l

(11.2.4). calculate logarithm probability likelihood value:

\log (p ({y_{t}}^{c} | q (t) = i)) = C (Σ^{c}) - {1 / 2}^{{\overset{&RightArrow;}{d}}^{cT}} {(Σ^{c})}^{- 1} {\overset{&RightArrow;}{d}}^{c}

∑ wherein ^cBe the state variance matrix of cepstrum domain, and be the diagonal matrix ∑ ^c=Diag{ σ _I1, σ _I2.., σ _Ir..}, c represents cepstrum

The territory, i represents state; C (∑ ^c) expression with Irrelevant constant term, correspondence

- Σ_{r = 1}^{R} \log (\sqrt{2 π} σ_{ir})

, R is a cepstrum

The dimension of feature.

(11.4). calculate maximum probability

p^{*} = \max_{1 \leq i \leq N} [δ_{T} (i)]

Final state with optimal path:

\hat{q} (T) = \underset{1 \leq i \leq N}{\arg \max} [δ_{T} (i)];

(11.5) by recalling other states of exporting successively on the optimal path:

(12). the output recognition result, finish.