CN100498935C

CN100498935C - Variation Bayesian voice strengthening method based on voice generating model

Info

Publication number: CN100498935C
Application number: CNB2006100283311A
Authority: CN
Inventors: 黄青华; 杨杰; 薛云峰
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2006-06-29
Filing date: 2006-06-29
Publication date: 2009-06-10
Anticipated expiration: 2026-06-29
Also published as: CN1870136A

Abstract

A method for intensifying variational Bayes voice based on voice generating model includes setting up a noising voice model and state space equation of voice generating model, expressing a noising course and probability distribution, applying approximate posteriori distribution to approximate parameter of voice generating model and probability distribution of pure voice according to variational Bayes method to obtain parameter update equality of those approximate posteriori distribution and updating equality with cyclic iteration till algorithm convergence.

Description

Variation Bayes sound enhancement method based on speech production model

Technical field

The present invention relates to a kind of variation Bayes sound enhancement method, can be widely used in aspects such as speech communication and speech recognition, belong to field of voice signal based on speech production model.

Background technology

Actual voice capture device and voice collecting environment can not obtain pure voice down, voice can be by the diversity of settings noise pollution, therefore in speech communication and speech recognition etc. are used, it is very important that voice are strengthened as a pre-service link, and the voice after the enhancing can better guarantee the accuracy that subsequent voice is handled.

For improving voice quality, existing sound enhancement method mainly contains following several:

First method is a threshold method, and its ultimate principle thinks that the less part of amplitude absolute value mainly is a noise in the signal, further compresses this part signal by a kind of linearity or non-linear compression function and reaches the purpose that voice strengthen.When being compression noise, the major defect of this algorithm also compressed a lot of useful voice messagings.

Second method is a spectrum-subtraction, suppose that noise is stably or the additive noise that becomes when slow, and suppose that voice signal and noise are under the separate condition, to deduct the power spectrum of noise from the power spectrum of noisy speech, thereby obtain comparatively pure voice spectrum.But it is exactly to have the not naturetone that is called " music " noise in the voice signal after strengthening that this method has a well-known shortcoming, and then makes people's ear subjective sensation uncomfortable.

The third method is based on the enhancement algorithms of speech production model, this algorithm is owing to the parameter of " pure " speech model can't accurately be estimated, so can only adopt direct estimation model parameter from signals and associated noises, inaccurate if model is estimated, strengthen back intelligibility of speech variation.Therefore estimation model parameter and model order are the keys of this method accurately from the voice that contain noise.(S.Gannot such as Gannot, D.Burshteinand E.Weinstein, Iterative and Sequential Kalman Filter-Based Speech EnhancementAlgorithms, IEEE Trans.Speech and Audio Processing, vol.6, No.4, July 1998, pp.373-385.) a kind of enhancement algorithms based on Kalman filtering is proposed, estimate the speech production model parameter with maximum likelihood method, but this method can not the estimation model exponent number, can only determine model order with additive method or priori, and the estimation of initial parameter value is very big to result's influence.(J.Vermaak such as Vermaak, C.Andrieu, A.Doucet and S.J.Godsill, Partical Methods for Bayesian Modeling andEnhancement of Speech Signals, IEEE Trans.Speech and Audio Processing, Vol.10, No.3,2002, pp.173-185.) propose to estimate the speech production model parameter with the Markov chain Monte Carlo method, estimate pure voice signal with Kalman filter.But this method can not the estimation model exponent number, and calculated amount is very big, is not suitable for a lot of occasions.

Summary of the invention

The objective of the invention is at the deficiencies in the prior art, a kind of variation Bayes sound enhancement method based on speech production model is proposed, can select the exponent number of speech production model automatically, and can avoid producing in the parameter estimation procedure over-fitting phenomenon, make the estimation of model more accurate, the better effects if that voice strengthen.

For realizing this purpose, the technical solution used in the present invention is considered: the variation bayes method is a kind of Bayes's approximation method that grows up recent years, its principle is that the approximate posteriority with known variables and parameter distributes and approaches their true distribution, make bayes method can resolve realization, it can learning model structure and model parameter.Therefore, the present invention makes full use of the variation bayes method and avoid the advantage of over-fitting and the ability of Model Selection in the learning parameter process, accurately estimates the parameter and the exponent number of speech production model, better to reach the purpose that voice strengthen.The present invention at first sets up the state space equation of noisy speech model and speech production model, expresses the probability distribution of noisy process and speech production process then.According to the variation bayes method, with approximate posteriority the distribute parameter of approaching speech production model and the probability distribution of clean speech signal.At last, obtain the renewal equation of the parameter of these approximate posteriority distributions, loop iteration upgrades equation up to algorithm convergence.It is that the exponent number of minimum cost function value correspondence promptly is optimum model order with the exponent number of the speech production model independent variable as the cost function of variation bayes method that automodel is selected.The voice signal that is calculated by this optimum exponent number is an optimal results.

Variation Bayes sound enhancement method based on speech production model of the present invention mainly comprises following step:

1, noisy speech signal is expressed as the form of clean speech signal and noise addition, sets up the noisy speech model, represent speech production model, and set up the state space equation of noisy speech model and speech production model correspondence with an autoregressive process.

2, the noise of selected noisy speech model is a Gaussian distribution, the driving noise of speech production model also is a Gaussian distribution, state space equation according to these two Gaussian distribution and noisy speech model and speech production model correspondence, draw the probability distribution of state vector and observation vector, determine the prior distribution of the contrary variance of the weight coefficient of speech production model and all Gaussian distribution by priori.

3, according to the cost function of variation bayes method, and according to the probability distribution of state vector and observation vector, and the prior distribution of the contrary variance of the weight coefficient of speech production model and all Gaussian distribution, obtain the approximate posteriority distribution of the contrary variance of the weight coefficient of state vector, speech production model and all Gaussian distribution with the variation expectation-maximization algorithm.

4, with the renewal equation of the approximate posteriority distribution parameter of variation Kalman smoothing algorithm estimated state vector, by the derive renewal equation of approximate posteriority distribution parameter of the weight coefficient of speech production model and the contrary variance of all Gaussian distribution of the variation maximization of variation expectation-maximization algorithm.

5, in predetermined speech production model exponent number scope, select an initial exponent number value, noisy speech signal and initial exponent number value are brought in the parameter update equation of being derived by step 4, the calculation cost function iterates, be not more than certain pre-determined threshold value up to cost function from an absolute value that goes on foot next step variation, with the cost function of this moment and the approximate posteriority distribution parameter preservation of the state vector of correspondence with it.

6, in predetermined speech production model exponent number scope, change the value of model order successively, with the initial exponent number value in the new exponent number value replacement step 5, repeating step 5 obtains the approximate posteriority distribution parameter of one group of cost function corresponding with each model order and state vector.

7, in all cost functions that obtain, the exponent number of minimum cost function correspondence is exactly optimum model order, and the voice signal that is calculated by the approximate posteriority distribution parameter of the pairing state vector of this optimization model exponent number is exactly optimum result.

The present invention makes full use of the advantage of variation Bayesian learning model parameter and structure, estimates the parameter and the exponent number of speech production model more exactly, has improved voice and has strengthened effect.

The variation Bayes sound enhancement method based on speech production model that the present invention proposes can be widely used in aspects such as speech communication and speech recognition, has suitable practical value.

Embodiment

In order to understand technical scheme of the present invention better, below be described in further detail.

1. noisy speech signal x _tChinese herbaceous peony is expressed as clean speech signal s _tWith noise n _tThe form of addition, it is as follows to set up the noisy speech model:

x _t=s _t+n _t (1)

Subscript t is the time.Speech production model is represented with an autoregressive process:

s_{i} = {\overset{&RightArrow;}{w}}^{T} {\overset{&RightArrow;}{s}}_{t}^{(p)} + e_{t} - - - (2)

\overset{&OverBar;}{w} = {[w_{1}, w_{2} \cdot \cdot \cdot w_{p}]}^{T}

Be the weight coefficient of autoregressive model,

{\overset{&RightArrow;}{s}}_{t}^{(p)} = [s_{t - 1}, \cdot \cdot \cdot, s_{t - p}]

Be and t p the value in relevant past of speech value constantly, p is the exponent number of model.e _tIt is the driving noise of autoregressive model.According to above-mentioned noisy speech model (1) and speech production model (2), it is as follows to set up state space equation:

{\overset{&RightArrow;}{s}}_{t} = A {\overset{&RightArrow;}{s}}_{t - 1} + B e_{t} - - - (3)

x_{t} = C {\overset{&RightArrow;}{s}}_{t} + n_{t} - - - (4)

{\overset{&RightArrow;}{s}}_{t} \overset{Δ}{=} {[\begin{matrix} s_{t} & s_{t - 1} & \cdot \cdot \cdot & s_{t - p + 1} \end{matrix}]}^{T}

Be the state vector of p dimension, noisy speech signal x _tBe observation vector,

A \overset{Δ}{=} [\begin{matrix} {\overset{&RightArrow;}{w}}^{T} \\ \begin{matrix} I [p - 1] & 0_{p - 1 \times 1} \end{matrix} \end{matrix}]

Be the state-transition matrix of p * p,

B = C^{T} \overset{Δ}{=} {[\begin{matrix} 1 & 0 & \cdot \cdot \cdot & 0 \end{matrix}]}^{T},

I[p-1] be (p-1) * (p-1) unit matrix.

2. noise n _tElect Gaussian distribution as, be expressed as The driving noise e of autoregressive model _tAlso elect Gaussian distribution as, be expressed as

It is a that expression stochastic variable y satisfies average, and contrary variance is the Gaussian distribution of b.According to (3), state vector

Probability distribution as shown in the formula:

According to (4), the probability distribution of observation vector can be write

The weight coefficient of autoregressive model is obeyed Gauss's prior distribution of a zero-mean

The contrary variance of all Gaussian distribution is obeyed the Gamma prior distribution

3. set { the x that represents observation vector with X ₁, x ₂..., x _T, represent the set of state vector with S

Represent the set of the contrary variance of the weight coefficient of speech production model and all Gaussian distribution with θ

The principle of variation bayes method use exactly an approximate posteriority distribution Q (S, θ) approach p (S, θ | X), the cost function of usefulness is in practice

C_{KL} = {&lang; \log \frac{Q (S, θ)}{p (X, S, θ)} &rang;}_{Q} = {&lang; \log \frac{Q (S) Q (θ)}{p (X, S, θ)} &rang;}_{Q} - - - (11)

＜ _QBe illustrated in the expectation under the probability distribution Q ().Cost function (11) according to the variation bayes method, and according to probability distribution (5)-(6) of state vector and observation vector, and prior distribution (7)-(10) of the contrary variance of the weight coefficient of speech production model and all Gaussian distribution, the approximate posteriority distribution of contrary variance that can obtain the weight coefficient of state vector, speech production model and all Gaussian distribution with the variation expectation-maximization algorithm is as follows:

Q(α)＝Gamma(α|b ^(α)，c ^(α)) (14)

Q(β)＝Gamma(β|b ^(β)，c ^(β)) (15)

G(γ)＝Gamma(γ|b ^(γ)，c ^(γ)) (16)

4. ask distribute parameter in (12) of the approximate posteriority of state vector with variation Kalman smoothing algorithm.An arrangement set With Represent at first definite condition expectation

{\overset{&RightArrow;}{m}}_{t | τ} = E ({\overset{&RightArrow;}{s}}_{t} | {x}_{l}^{τ})

And conditional covariance matrix

V_{t | τ} = Var ({\overset{&RightArrow;}{s}}_{t} | {x}_{l}^{τ}),

Initial value

{\overset{&RightArrow;}{m}}_{0 | 0} = {\overset{&RightArrow;}{m}}_{0}

And V _0|0=V ₀, to t=1 ..., T below is a Kalman filtering forward recursive process:

{\overset{&RightArrow;}{m}}_{t | t - 1} = \overset{&OverBar;}{A} {\overset{&RightArrow;}{m}}_{t - 1 | t - 1} - - - (17)

V _t|t-1＝AV _t-1|t-1A ^T+P (18)

K _t＝V _t|t-1C ^T(CV _t|t-1C ^T+(<γ> _Q) ^-1) ^-1 (19)

{\overset{&RightArrow;}{m}}_{t | t} = {\overset{&RightArrow;}{m}}_{t | t - 1} + K_{t} (x_{t} - C {\overset{&RightArrow;}{m}}_{t | t - 1}) - - - (20)

V _t|t＝V _t|t-1-K _tCV _t|t-1 (21)

Here

\overset{&OverBar;}{A} \overset{Δ}{=} [\begin{matrix} {&lang; \overset{&RightArrow;}{w} &rang;}_{Q}^{T} \\ \begin{matrix} I [p - 1] & 0_{p - 1 \times 1} \end{matrix} \end{matrix}],

P = [\begin{matrix} \begin{matrix} \overset{&OverBar;}{β} & 0_{1 \times p - 1} \end{matrix} \\ 0_{p - 1 \times p} \end{matrix}],

β=(＜β 〉 _Q) ^-1,

It is state vector

Kalman filtering distribute.Proceed Kalman's smoothing algorithm, with corresponding Kalman filtering value initialization

And V _T|T, to t=T-1 ..., 0, it is as follows then to carry out the backward recursive process:

Q_{t} = V_{t | t} {\overset{&OverBar;}{A}}^{T} V_{t + 1 | t}^{- 1} - - - (22)

{\overset{&RightArrow;}{m}}_{t | T} = {\overset{&RightArrow;}{m}}_{t | t} + Q_{t} ({\overset{&RightArrow;}{m}}_{t + 1 | T} - {\overset{&RightArrow;}{m}}_{t + 1 | t}) - - - (23)

V_{t | T} = V_{t | t} + Q_{t} (V_{t + 1 | T} - V_{t + 1 | t}) Q_{t}^{T} - - - (24)

Therefore, we obtain

The renewal equation of parameter is:

{\overset{&RightArrow;}{m}}_{t}^{(s)} = {\overset{&RightArrow;}{m}}_{t | T}

With

V_{t}^{(s)} = {[V_{t | T}]}^{- 1} .

Renewal equation with the approximate posteriority distribution parameter of the weight coefficient of the variation of variation expectation-maximization algorithm maximization derivation speech production model and the contrary variance of all Gaussian distribution is as follows:

Σ^{(w)} = {&lang; αI [p] &rang;}_{Q} + Σ_{t = 1}^{T} {&lang; β {\overset{&RightArrow;}{s}}_{t}^{(p)} {\overset{&RightArrow;}{s}}_{t}^{(p) T} &rang;}_{Q} - - - (25)

{\overset{&RightArrow;}{μ}}^{(w)} = {[Σ^{(w)}]}^{- 1} [Σ_{t = 1}^{T} {&lang; β s_{t} {\overset{&RightArrow;}{s}}_{t}^{(p)} &rang;}_{Q}] - - - (26)

{\overset{&OverBar;}{c}}^{(α)} = c^{(α)} + \frac{p}{2} - - - (27)

{\overset{&OverBar;}{b}}^{(α)} = b^{(α)} + \frac{1}{2} {&lang; {\overset{&RightArrow;}{w}}^{T} \overset{&RightArrow;}{w} &rang;}_{Q} - - - (28)

{\overset{&OverBar;}{c}}^{(β)} = c^{(β)} + \frac{T}{2} - - - (29)

{\overset{&OverBar;}{b}}^{(β)} = b^{(β)} + \frac{1}{2} Σ_{t = 1}^{T} {&lang; {(s_{t} - {\overset{&RightArrow;}{w}}^{T} {\overset{&RightArrow;}{s}}_{t}^{(p)})}^{2} &rang;}_{Q} - - - (30)

{\overset{&OverBar;}{c}}^{(γ)} = c^{(γ)} + \frac{T}{2} - - - (31)

{\overset{&OverBar;}{b}}^{(γ)} = b^{(γ)} + \frac{1}{2} Σ_{t = 1}^{T} {&lang; {(x_{t} - s_{t})}^{2} &rang;}_{Q} - - - (32)

5. in predetermined speech production model exponent number scope, select an initial exponent number value p ₁, with the signals and associated noises x of reality _tWith initial exponent number value p ₁Bring in renewal equation (17)-(32) of the parameter of deriving by step 4, the cost function of calculating (11) formula that iterates, be not more than certain pre-determined threshold value up to cost function from an absolute value that goes on foot next step variation and stop, the cost function of this moment is reached the approximate posteriority distribution parameter of corresponding with it state vector

Preserve;

6. in predetermined speech production model exponent number scope, change the value of model order successively, with the initial exponent number value p in the new exponent number value p replacement step 5 ₁, repeating step 5 obtains the approximate posteriority distribution parameter of one group of cost function corresponding with each model order and state vector;

7. in all cost functions that obtain, the p value of minimum cost function correspondence is exactly optimum model order, by the approximate posteriority distribution parameter of the pairing state vector of this optimization model exponent number

The voice signal that calculates

{\hat{s}}_{t} = C {\overset{&RightArrow;}{m}}_{t}^{(s)}

Be exactly best result.

Claims

1, a kind of variation Bayes sound enhancement method based on speech production model is characterized in that comprising following concrete steps:

1) noisy speech signal is expressed as the form of clean speech signal and noise addition, sets up the noisy speech model, represent speech production model with an autoregressive process, and set up the state space equation of noisy speech model and speech production model correspondence;

2) noise of selected noisy speech model is a Gaussian distribution, the driving noise of speech production model also is a Gaussian distribution, state space equation according to these two Gaussian distribution and noisy speech model and speech production model correspondence, draw the probability distribution of state vector and observation vector, determine the prior distribution of the contrary variance of the weight coefficient of speech production model and all Gaussian distribution by priori;

3) according to the cost function of variation bayes method, and according to the probability distribution of state vector and the probability distribution of observation vector, and the prior distribution of the contrary variance of the weight coefficient of speech production model and all Gaussian distribution, the approximate posteriority of obtaining state vector with the variation expectation-maximization algorithm distributes, the approximate posteriority distribution of the contrary variance of the approximate posteriority distribution of the weight coefficient of speech production model and all Gaussian distribution;

4) with the renewal equation of the approximate posteriority distribution parameter of variation Kalman smoothing algorithm estimated state vector, by the derive renewal equation of approximate posteriority distribution parameter of contrary variance of the renewal equation of approximate posteriority distribution parameter of weight coefficient of speech production model and all Gaussian distribution of the variation maximization of variation expectation-maximization algorithm;

5) in predetermined speech production model exponent number scope, select an initial exponent number value, noisy speech signal and initial exponent number value are brought in the parameter update equation of being derived by step 4), the calculation cost function iterates, be not more than certain pre-determined threshold value up to cost function from an absolute value that goes on foot next step variation, with the cost function of this moment and the approximate posteriority distribution parameter preservation of the state vector of correspondence with it;

6) in predetermined speech production model exponent number scope, change the value of model order successively, with the initial exponent number value in the new exponent number value replacement step 5), repeating step 5), obtain the approximate posteriority distribution parameter of one group of cost function corresponding and state vector with each model order;

7) in all cost functions that obtain, the exponent number of minimum cost function correspondence is exactly optimum model order, and the voice signal that is calculated by the approximate posteriority distribution parameter of the pairing state vector of this optimization model exponent number is exactly optimum result.