CN106340304A

CN106340304A - Online speech enhancement method for non-stationary noise environment

Info

Publication number: CN106340304A
Application number: CN201610843483.0A
Authority: CN
Inventors: 冯宝; 张绍荣; 孙山林; 郑伟; 张国宁; 武博; 韦周耀
Original assignee: Guilin University of Aerospace Technology
Current assignee: Guilin University of Aerospace Technology
Priority date: 2016-09-23
Filing date: 2016-09-23
Publication date: 2017-01-18
Anticipated expiration: 2036-09-23
Also published as: CN106340304B

Abstract

The invention provides an online speech enhancement method for a non-stationary noise environment. The method comprises the steps of (1) establishing a system model in a non-stationary noise environment, (2) framing and windowing, (3) carrying out system initialization, (4) estimating an AR parameter, and (5) estimating a speech signal state sequence. For a problem that the AR parameter in a speech model can not be updated with noise change in real time, the invention put forward a dual Calman filtering frame, two Calman filters are in parallel computing, speech signal state estimation and AR parameter estimation are in mutual updating, a data estimation process and a parameter estimation process are carried out alternately, thus the parameter estimation process can be adapted to the noise change process so as to improve the accuracy of the system model, and thus the performance of speech enhancement is enhanced. For a problem that a traditional Calman filtering algorithm can not process non-stationary noise, combined with a convex optimization technique, an improved Calman filtering frame is put forward, Gauss noise and non-stationary noise can be accurately estimated, and the accuracy of speech enhancement is improved.

Description

A kind of online sound enhancement method being applied under nonstationary noise environment

Technical field

The present invention relates to field of speech enhancement, refer in particular to a kind of online voice being applied under nonstationary noise environment and increase Strong method.

Background technology

In speech recognition front-ends processing procedure, voice signal always by various noise jamming and flooding, due to interference Randomness, signal processing technology can only go to strengthen as far as possible voice quality.The main purpose of speech enhan-cement is from noisy speech In extract pure raw tone.

Common voice enhancement algorithm mainly has following several:

1st, noise cancellation method: the method is according in a time domain or in a frequency domain, directly subtracts noise component(s) from noisy speech The method gone is realized.The maximum feature of the method is to need using background signal as reference signal, reference signal accurately with The no performance directly determining the method.

2nd, harmonic signal enhancement method: because the voiced sound in voice has obvious periodicity, this periodicity reflects in frequency domain It is then a series of peak component one by one corresponding to fundamental frequency (fundamental tone) and its harmonic wave respectively, these frequency components occupy voice Most of energy, can carry out speech enhan-cement using this periodicity, to extract fundamental tone using comb filter and its harmonic wave divides Amount, suppresses other periodic noises and aperiodic broadband noise.

3rd, the enhancing algorithm based on speech production model: the voiced process of voice can be modeled as a linear time-varying filtering Device.Different driving sources are adopted to different types of voice.In the generation model of voice, most widely used is full limit mould Type.A series of voice enhancement algorithm, such as time-varying parameter Wiener filtering and Kalman can be obtained based on speech production model Filtering method.

4th, the enhancing algorithm based on short time spectrum: the enhancing algorithm species based on voice short time spectrum is a lot, such as composes Subtractive method, Wiener Filter Method, LMSE method etc..Such method has an adaptation, and SNR ranges are big, method simple, be easy to The advantages of real-time processing.

5th, the enhancing algorithm based on wavelet decomposition: wavelet decomposition method is as sending out of this tool of mathematical analysis of wavelet decomposition Open up and grow up, it combines some ultimate principles of subtractive method of spectrums simultaneously again.

6th, the enhancing algorithm based on audition shielding: audition screen method is that a kind of enhancing of the auditory properties using human ear is calculated Method.

Based on the voice enhancement algorithm of Kalman filtering belong to above the third, traditional Kalman filtering is carrying out voice increasing Two important hypothesis: process noise and measurement noise equal Gaussian distributed are had when strong.Traditional Kalman filtering is in actual speech Following both sides limitation is shown: 1. the estimation of ar parameter must be accurately in enhancing.But gather environment in actual speech In, noise is continually changing, and this requires that the estimation of ar parameter in speech model should have real-time, simultaneously should be in ar parameter Consider various effect of noise in estimation procedure, otherwise can lead to the decline of speech enhan-cement performance.2. traditional Kalman filtering is calculated Method only considers that the situation of Gaussian noise does not meet practical application.Can be by a kind of nonstationary noise (tool during speech signal collection Have openness, obey laplacian distribution) pollution, it is not common, but is implicitly present in and voice quality is affected larger.If In speech enhan-cement, when nonstationary noise is processed as Gaussian noise, it will serious reduction speech enhan-cement quality, it is unfavorable for follow-up The semantic identification of voice.

Based on the problems referred to above, provide a kind of can be in the case of real-time processing Gaussian noise and nonstationary noise exist simultaneously Online speech enhancement technique is very important.

Content of the invention

The technical problem to be solved is cannot to process ar in speech model for existing kalman filter method Parameter cannot real-time update, measure during there is nonstationary noise, in conjunction with convex optimisation technique, provide one kind to be applied to Online sound enhancement method under nonstationary noise environment, being capable of On-line Estimation ar parameter and nonstationary noise.

For achieving the above object, technical scheme provided by the present invention is: a kind of is applied under nonstationary noise environment Online sound enhancement method, comprises the following steps:

1) set up the system model under nonstationary noise environment

1.1) the autoregression ar model in the case of setting up that Gaussian noise and sparse noise are common and existing

The generation process of voice signal be one by white-noise excitation, through the output of full limit linear system from recurrence mistake Journey, i.e. current output is equal to the pumping signal of present moment and the weighted sum of p moment output in the past, and this is an autoregression Ar model, is expressed as follows:

s (k) = σ_{i = 1}^{p} a_{i} s (k - i) + u (k) - - - (1)

Wherein, u (k) is the white Gaussian noise excitation value in k moment；S (k-i) is the voice signal in (k-i) moment；s(k) Voice signal for the kth moment；a_iFor i-th linear predictor coefficient, also referred to as ar model parameter；P is the rank of ar model parameter Number；

Set up the voice signal model meeting actual measurement process, it is as follows that voice signal measures process description:

Y (k)=s (k)+n (k)+v (k) (2)

Wherein, y (k) is k moment voice signal measurement sequence；S (k) is the voice signal in k moment；N (k) is that the k moment is high This white noise；V (k) is k moment nonstationary noise, obeys laplacian distribution, has openness；

1.2) set up voice signal state-space model

Formula (1) and formula (2) are converted to state-space model, are described as follows:

X (k)=fx (k-1)+p (k) (3)

Y (k)=cx (k)+n (k)+v (k) (4)

Wherein,

f = [\begin{matrix} 0 & 1 & 0 & ... & 0 \\ 0 & 0 & 1 & ... & 0 \\ ... & ... & ... & ... & ... \\ 0 & 0 & 0 & ... & 1 \\ a_{p} (k) & a_{p - 1} (k) & a_{p - 2} (k) & a_{1} (k) \end{matrix}] - - - (5)

C=[0 0 ... 0 1] (6)

X (k)=[s (k-p+1) ... s (k)]^t(7)

In voice signal state equation (3) and voice signal measurement equation (4), x (k) is k moment voice signal state Estimated sequence, i.e. the optimal State Estimation of voice signal；X (k-1) is (k-1) moment voice signal state estimation sequence；y(k) For k moment voice signal measurement sequence；The state-transition matrix that f is constituted for linear predictor coefficient, last column [a in f_p(k) … a₁(k)] it is referred to as ar parameter；C=[0 0 ... 0 1] is to measure transfer matrix；P (k) is k moment state-noise, obeys high This distribution；N (k) is k moment measurement noise, Gaussian distributed；V (k) is the nonstationary noise in k moment, obeys Laplce Distribution；

The statistical property of the state of voice signal and measurement noise p (k) and n (k) is:

E (p (k))=q, e (n (k))=r

e(p(k)p(j)^t)=q δ_kj,e(n(k)n(j)^t)=r δ_kj(8)

Wherein, q and r is respectively the average of noise p (k) and n (k)；Q and r is respectively the covariance of noise p (k) and n (k)； δ_kjFor kronecker function；Speech Enhancement problem is to go to estimate optimum voice on the premise of known measurement voice signal y (k) Signal x (k)；

2) framing and adding window

Voice signal has short-term stationarity, thinks that voice signal is constant in 10--30ms, this makes it possible to voice to believe Number it is divided into some short sections come being processed, here it is framing, the framing of voice signal is using moveable finite length Method that window is weighted is realizing；Frame number generally per second is 33～100 frames, and framing method is the side of overlapping segmentation The overlapping part of method, former frame and a later frame is referred to as frame and moves, and frame moves and the ratio of frame length is 0～0.5；

3) system initialization

3.1) improved Kalman filter device parameter initialization

Initialization voice signal state estimation sequence x (0/0), covariance matrix p (0/0) are it is ensured that covariance matrix is just Fixed；

3.2) ar parameter initialization

Initialization ar parameter state estimated sequence θ (0/0)；

4) estimate ar parameter

Ar parameter refers to last column [a in state-transition matrix f in formula (3)_p(k) … a₁(k)], it is mainly used to Description speech production process, its accuracy has direct impact to the result of speech enhan-cement；Propose in the estimation of ar parameter Consider voice signal state estimation sequence x (k-1), state-noise q (k), measurement noise n (k), nonstationary noise v (k), Set up new ar parameter estimation state-space model, realize the online Robust Estimation of ar parameter, and the real-time estimation mistake to ar parameter Journey is as follows:

4.1) set up the parameter estimation model of ar parameter

The ar parameter model that Gaussian noise and nonstationary noise mix under lower environment is described as follows:

θ (k)=θ (k-1)+q (k)

Y (k)=a θ (k)+r (k)+w (k) (9)

Wherein, θ (k)=[a_p(k) … a₁(k)]^tFor k moment ar parameter state sequence；Q (k) is k moment state-noise, Gaussian distributed, its covariance matrix is q (k)；R (k) k moment measurement noise, Gaussian distributed, its covariance matrix is r(k)；W (k) k moment measurement noise, Gaussian distributed, its covariance matrix is w (k)；A=x (k-1)^t=[s (k-p) ... S (k-1)] it is measurement matrix；Y (k) is k moment voice signal measurement sequence；State and the statistics of measurement noise q (k) and r (k) Characteristic is:

E (q (k))=d, e (r (k))=l

e(q(k)q(j)^t)=d δ_kj,e(r(k)r(j)^t)=l δ_kj(10)

Wherein, d and l is respectively the average of noise q (k) and r (k)；D and l is respectively the covariance of noise q (k) and r (k)； δ_kjFor kronecker function；

4.2) from the traditional Kalman filtering problem of convex optimization angle reconstruct

In order to easily estimate to sparse noise, need to ask from the angle reconstruct Kalman filtering of convex optimization Topic, the state-space model of traditional Kalman filtering, without nonstationary noise w (k), as follows:

θ (k)=θ (k-1)+q (k)

Y (k)=a θ (k)+r (k) (11)

According to Bayes principle, ar Parameter Estimation Problem is expressed as, under the premise of metric data y (k) is known, estimating Excellent ar argument sequence θ (k) it may be assumed that

p (θ (k) | y (k)) = \frac{p (y (k) | θ (k)) p (θ (k))}{p (y (k))} - - - (12)

Theoretical according to maximal possibility estimation, set up the likelihood function of p (y (k) | θ (k)) and p (θ (k)):

\begin{matrix} l_{1} (y (k), θ (k)) = \frac{p (θ (k)) p (r (k))}{p (θ (k))} \\ = p (r (k)) = \frac{1}{{(\sqrt{2 π})}^{m} {| l |}^{1 / 2}} \exp (- \frac{1}{2} r^{t} (k) l^{- 1} r (k)) \end{matrix} - - - (13)

\begin{matrix} l_{2} (θ (k)) = p (θ (k)) \\ = \frac{1}{{(\sqrt{2 π})}^{n} {| σ |}^{1 / 2}} \exp (- \frac{1}{2} {(θ (k) - \hat{θ} (k | k - 1))}^{t} ψ {(k)}^{- 1} (θ (k) - \hat{θ} (k | k - 1)) \end{matrix} - - - (14)

Wherein, ψ beThe covariance matrix ψ of conditional probability p in the case of known (θ (k) | y (k)) (k)=p_θ(k | k)+d (k), wherein p_θ(k | k) it is covariance updated value；When likelihood function condition l₁(y (k), θ (k)) and l₂(θ (k)) when obtaining maximum, conditional probability p (y (k) | θ (k)) obtains optimal estimation value；Observation type (12) and formula (13) find Bigization likelihood function condition l₁(z (k), x (k+1)) and l₂(x (k+1)) is equivalent to the index minimizing power exponent in likelihood function PartWithTherefore obtain as Lower optimization form:

\begin{matrix} \min i m i z e & r^{t} (k) l^{- 1} r (k) + {(θ (k) - \hat{θ} (k | k - 1))}^{t} ψ {(k)}^{- 1} (θ (k) - \hat{θ} (k | k - 1)) \end{matrix}

Subjiect to y (k)=a θ (k)+r (k) (15)

Wherein, θ (k) and r (k) is variable, ψ (k)=p_θ(k | k)+d (k) is the covariance matrix of Gaussian noise；θ(k) Estimated value beR (k) is exactly the estimation to Gaussian noise；p_θ(k | k) updates matrix for covariance:

p_θ(k | k)=(i-k_θ(k)a(k))p_θ(k|k-1) (16)

p_θ(k | k-1) be covariance prediction matrix:

p_θ(k | k-1)=p_θ(k-1|k-1)+d(k-1) (17)

k_θK () is covariance gain:

k_θ(k)=p_θ(k|k-1)a^t(ap_θ(k|k-1)a^t+l(k-1))^-1(18)

4.3) build, from the convex angle that optimizes, the optimization problem that nonstationary noise is estimated

Nonstationary noise obeys laplacian distribution, has sparse characteristic, and the core concept that nonstationary noise is estimated is profit With the sparse characteristic of noise, through step 4.2) traditional Kalman filtering problem is converted into after convex optimization problem, can be excellent Increase the sparsity constraints of nonstationary noise w (k) completing the estimation to sparse noise, new optimization form is in change:

\begin{matrix} \min i m i z e & r^{t} (k) l^{- 1} r (k) + {(θ (k) - \hat{θ} (k | k - 1))}^{t} ψ^{- 1} (θ (k) - \hat{θ} (k | k - 1)) + λ | | w (k) | |_{1} \end{matrix}

Subjiect to y (k)=a θ (k)+r (k)+w (k) (19)

Wherein, w (k) is sparse noise, by above-mentioned optimization problem, obtaining the optimum of ar parameter is estimated Meter θ (k),The optimization problem that formula (17) represents is a convex optimization problem, can be using the interior point in engineering Method is solved；

5) estimated speech signal status switch

5.1) from the traditional Kalman filtering problem of convex optimization angle reconstruct

In order to easily estimate to sparse noise, need to ask from the angle reconstruct Kalman filtering of convex optimization Topic, the state-space model of traditional Kalman filtering is as follows:

X (k)=fx (k-1)+p (k) (20)

Y (k)=cx (k)+n (k) (21)

According to Bayes principle, Kalman filtering problem is expressed as, under the premise of metric data y (k) is known, estimating Excellent voice status sequence x (k) it may be assumed that

p (x (k) | y (k)) = \frac{p (y (k) | x (k)) p (x (k)}{p (y (k))} - - - (22)

Theoretical according to maximal possibility estimation, set up p (y (k) | x (k)) and p (likelihood function of x (k):

\begin{matrix} l_{1} (y (k), x (k)) = \frac{p (x (k)) p (n (k))}{p (x (k))} \\ = p (w (k)) = \frac{1}{{(\sqrt{2 π})}^{m} {| r |}^{1 / 2}} \exp (- \frac{1}{2} w^{t} (k) r^{- 1} w (k)) \end{matrix} - - - (23)

\begin{matrix} l_{2} (x (k)) = p (x (k)) \\ = \frac{1}{{(\sqrt{2 π})}^{n} {| σ |}^{1 / 2}} \exp (- \frac{1}{2} {(x (k) - \hat{x} (k | k - 1))}^{t} θ^{- 1} (x (k) - \hat{x} (k | k - 1))) \end{matrix} - - - (24)

Wherein, θ beThe covariance matrix of conditional probability p in the case of known (x (k) | y (k-1)) θ=fp (k-1 | k-1) f^t+ q (k-1), wherein p (k-1 | k-1) it is covariance updated value；When likelihood function condition l₁(y(k),x (k)) and l₂When (x (k)) obtains maximum, and conditional probability p (x (k) | y (k)) obtain optimal estimation value；Observation type (23) and formula (24) find to maximize likelihood function condition l₁(y (k), x (k)) and l₂(x (k)) is equivalent to power exponent in minimum likelihood function Exponential partWithTherefore To the following form that optimizes:

\begin{matrix} \min i m i z e & w^{t} (k) r^{- 1} w (k) + {(x (k) - \hat{x} (k | k - 1))}^{t} θ^{- 1} (x (k) - \hat{x} (k | k - 1)) \end{matrix}

Subjiect to y (k)=cx (k)+n (k) (25)

Wherein, x (k) and n (k) is variable, and θ is the covariance matrix of Gaussian noise；The estimated value of x (k) isN (k) is exactly the estimation to Gaussian noise；

P (k | k) updates matrix for covariance:

P (k | k)=(i-k (k) c (k)) p (k | k-1) (26)

P (k | k-1) be covariance prediction matrix:

P (k | k-1)=f (k-1) p (k-1 | k-1) f (k-1)^t+q(k-1) (27)

k_θK () is covariance gain:

K (k)=p (k | k-1) c^t(cp(k|k-1)c^t+r(k-1))^-1(28)

5.2) build the estimation problem to sparse noise from the convex angle that optimizes

The core concept of the estimation of sparse noise is the sparse characteristic using noise, through step 5.1) by traditional Kalman After filtering problem is converted into convex optimization problem, sparse noise n can be increased in optimization_sK the sparsity constraints of () are right to complete The estimation of sparse noise, new optimization form is:

\begin{matrix} \min i m i z e & w^{t} (k) r^{- 1} w (k) + {(x (k) - \hat{x} (k | k - 1))}^{t} θ^{- 1} (x (k) - \hat{x} (k | k - 1)) + λ | | v (k) | |_{1} \end{matrix}

Subjiect to y (k)=cx (k)+n (k)+v (k) (29)

Wherein, v (k) is sparse noise, by above-mentioned optimization problem, obtaining the optimum to molten bath centroid position Estimate x (k), x (k) is the optimal estimation in traditional Kalman filtering to state valueThe optimization that formula (29) represents is asked An entitled convex optimization problem, can be solved using the interior point method in engineering；

5.3), after completing the enhancing to k moment voice signal, strengthen resultStep 4 will be returned to), it is used for Update the ar parameter θ (k+1) in k+1 moment, be further continued for carrying out the speech enhan-cement in k+1 moment afterwards, estimate x (k+1), until by institute There is Speech processing complete.

The present invention compared with prior art, has the advantage that and beneficial effect:

1st, the present invention is directed to ar parameter in speech model (especially autoregression ar model) and can not change in real time more with noise New problem is it is proposed that double card Kalman Filtering framework, two Kalman filter concurrent operations, voice signal state estimation and ar Parameter estimation updates mutually, and state estimation procedure and parameter estimation procedure are alternately so that parameter estimation procedure can adapt to Noise change procedure, to improve the accuracy of system model, and then improves the performance of speech enhan-cement.

2nd, the present invention cannot process the problem of nonstationary noise for traditional Kalman filter algorithm, in conjunction with convex optimization skill Art is it is proposed that improved Kalman filter framework.New algorithm has been simultaneously introduced Gauss to measurement process in speech enhan-cement model Noise and nonstationary noise item, set up rational Optimized model by using convex optimisation technique, can be to Gaussian noise and non-flat Steady noise is accurately estimated, improves the accuracy of speech enhan-cement.

Brief description

Fig. 1 is the flow chart of the sound enhancement method under nonstationary noise.

Fig. 2 a is primary speech signal schematic diagram.

Fig. 2 b is the voice signal schematic diagram with white Gaussian noise.

Fig. 2 c is the voice signal schematic diagram with white Gaussian noise and nonstationary noise.

Fig. 3 is the voice enhancement algorithm flow chart based on dual improved Kalman filter.

Fig. 4 a is primary speech signal.

Fig. 4 b is speech enhan-cement result schematic diagram.

Specific embodiment

With reference to specific embodiment, the invention will be further described.

As shown in figure 1, the online sound enhancement method being applied under nonstationary noise environment described in the present embodiment, including Following steps:

1) set up the system model under nonstationary noise environment

The generation process of voice signal can be described as one by white-noise excitation, through the output of full limit linear system from Recursive procedure, i.e. current output is equal to the pumping signal of present moment and the weighted sum of p moment output in the past, and this is one Autoregression ar model, is expressed as follows

s (k) = σ_{i = 1}^{p} a_{i} s (k - i) + u (k) - - - (1)

Wherein, u (k) is the white Gaussian noise excitation value in k moment；S (k-i) is the voice signal in (k-i) moment；s(k) Voice signal for the kth moment；a_iFor i-th linear predictor coefficient, also referred to as ar model parameter；P is the rank of ar model parameter Number.

As shown in Fig. 2 a, 2b, 2c, the voice signal observing in actual environment can be by various sound pollutions, especially right and wrong Stationary noise, proposes in the present invention to consider Gaussian noise and nonstationary noise during voice signal measures simultaneously, sets up more Meet the voice signal model of actual measurement process.Voice signal in the present invention measures process and can be described as follows:

Y (k)=s (k)+n (k)+v (k) (2)

Wherein, y (k) is k moment voice signal measurement sequence；S (k) is the voice signal in k moment；N (k) is that the k moment is high This white noise；V (k) is k moment nonstationary noise, obeys laplacian distribution, has openness.

1.2) set up voice signal state-space model

Formula (1) and formula (2) are converted to state-space model, can be described as follows:

X (k)=fx (k-1)+p (k) (3)

Y (k)=cx (k)+n (k)+v (k) (4)

Wherein

f = [\begin{matrix} 0 & 1 & 0 & ... & 0 \\ 0 & 0 & 1 & ... & 0 \\ ... & ... & ... & ... & ... \\ 0 & 0 & 0 & ... & 1 \\ a_{p} (k) & a_{p - 1} (k) & a_{p - 2} (k) & a_{1} (k) \end{matrix}] - - - (5)

C=[0 0 ... 0 1] (6)

X (k)=[s (k-p+1) ... s (k)]^t(7)

In voice signal state equation (3) and voice signal measurement equation (4), x (k) is k moment voice signal state Estimated sequence, i.e. the optimal State Estimation of voice signal；X (k-1) is (k-1) moment voice signal state estimation sequence；y(k) For k moment voice signal measurement sequence；The state-transition matrix that f is constituted for linear predictor coefficient, last column [a in f_p(k) … a₁(k)] it is referred to as ar parameter.；C=[0 0 ... 0 1] is to measure transfer matrix；P (k) is k moment state-noise, obeys high This distribution；N (k) is k moment measurement noise, Gaussian distributed；V (k) is the nonstationary noise in k moment, obeys Laplce Distribution.

E (p (k))=q, e (n (k))=r

e(p(k)p(j)^t)=q δ_kj,e(n(k)n(j)^t)=r δ_kj(8)

Wherein, q and r is respectively the average of noise p (k) and n (k)；Q and r is respectively the covariance of noise p (k) and n (k). δ_kjFor kronecker function.Speech Enhancement problem is to go to estimate optimum voice on the premise of known measurement voice signal y (k) Signal x (k).

2) framing and adding window

Voice signal has short-term stationarity (it is considered that voice signal is approximately constant in 10～30ms), thus permissible Voice signal is divided into some short sections come being processed, here it is framing, the framing of voice signal is using movably having Method that the window of limit for length's degree is weighted is realizing.Frame number typically per second is about 33～100 frames.General framing method For the method for overlapping segmentation, the overlapping part of former frame and a later frame is referred to as frame and moves, frame move with the ratio generally 0 of frame length～ 0.5.In the present invention, frame length is 25ms, and frame moves as 10ms.

3) system initialization

3.1) improved Kalman filter device parameter initialization

Initialization voice signal state estimation sequence x (0/0), covariance matrix p (0/0) are it is ensured that covariance matrix is just Fixed.

3.2) ar parameter initialization

Initialization ar parameter state estimated sequence θ (0/0), in the present invention, the exponent number of ar parameter (rule of thumb sets for 13 Fixed).

4) estimate ar parameter

Ar parameter refers to last column [a in state-transition matrix f in formula (3)_p(k) … a₁(k)], it is mainly used to Description speech production process, its accuracy has direct impact to the result of speech enhan-cement.Ar parameter estimation in practical application Larger by voice signal itself, various influence of noise, therefore propose in the present invention to consider voice in the estimation of ar parameter Signal condition estimated sequence x (k-1), state-noise q (k), measurement noise n (k), nonstationary noise v (k) etc., set up new ar Parameter estimation state-space model, realizes the online Robust Estimation of ar parameter, and this is a core point of the present invention.As shown in figure 3, As follows to the real-time estimation process of ar parameter:

4.1) set up the parameter estimation model of ar parameter

θ (k)=θ (k-1)+q (k)

Y (k)=a θ (k)+r (k)+w (k) (9)

Wherein θ (k)=[a_p(k) … a₁(k)]^tFor k moment ar parameter state sequence；Q (k) is k moment state-noise, Gaussian distributed, its covariance matrix is q (k)；R (k) k moment measurement noise, Gaussian distributed, its covariance matrix is r(k)；W (k) k moment measurement noise, Gaussian distributed, its covariance matrix is w (k)；A=x (k-1)^t=[s (k-p) ... S (k-1)] it is measurement matrix；Y (k) is k moment voice signal measurement sequence.State and the statistics of measurement noise q (k) and r (k) Characteristic is:

E (q (k))=d, e (r (k))=l

e(q(k)q(j)^t)=d δ_kj,e(r(k)r(j)^t)=l δ_kj(10)

Wherein, d and l is respectively the average of noise q (k) and r (k)；D and l is respectively the covariance of noise q (k) and r (k). δ_kjFor kronecker function.

In order to easily estimate to sparse noise, need to ask from the angle reconstruct Kalman filtering of convex optimization Topic.The state-space model (without nonstationary noise w (k)) of traditional Kalman filtering is as follows:

θ (k)=θ (k-1)+q (k)

Y (k)=a θ (k)+r (k) (11)

According to Bayes principle, ar Parameter Estimation Problem can be expressed as, under the premise of metric data y (k) is known, estimating Optimum ar argument sequence θ of meter (k) it may be assumed that

p (θ (k) | y (k)) = \frac{p (y (k) | θ (k)) p (θ (k))}{p (y (k))} - - - (12)

\begin{matrix} l_{1} (y (k), θ (k)) = \frac{p (θ (k)) p (r (k))}{p (θ (k))} \\ = p (r (k)) = \frac{1}{{(\sqrt{2 π})}^{m} {| l |}^{1 / 2}} \exp (- \frac{1}{2} r^{t} (k) l^{- 1} r (k)) \end{matrix} - - - (13)

\begin{matrix} l_{2} (θ (k)) = p (θ (k)) \\ = \frac{1}{{(\sqrt{2 π})}^{n} {| σ |}^{1 / 2}} \exp (- \frac{1}{2} {(θ (k) - \hat{θ} (k | k - 1))}^{t} ψ {(k)}^{- 1} (θ (k) - \hat{θ} (k | k - 1)) \end{matrix} - - - (14)

Wherein, ψ beThe covariance matrix ψ of conditional probability p in the case of known (θ (k) | y (k)) (k)=p_θ(k | k)+d (k) (wherein p_θ(k | k) be covariance updated value).When likelihood function condition l₁(y (k), θ (k)) and l₂(θ (k)) when obtaining maximum, conditional probability p (y (k) | θ (k)) obtains optimal estimation value.Observation type (12) and formula (13) can be sent out Now maximize likelihood function condition l₁(z (k), x (k+1)) and l₂(x (k+1)) is equivalent to power exponent in minimum likelihood function Exponential partWithTherefore may be used To be optimized form as follows:

\begin{matrix} \min i m i z e & r^{t} (k) l^{- 1} r (k) + {(θ (k) - \hat{θ} (k | k - 1))}^{t} ψ {(k)}^{- 1} (θ (k) - \hat{θ} (k | k - 1)) \end{matrix}

Subjiect to y (k)=a θ (k)+r (k) (15)

Wherein, θ (k) and r (k) is variable, ψ (k)=p_θ(k | k)+d (k) is the covariance matrix of Gaussian noise.θ(k) Estimated value beR (k) is exactly the estimation to Gaussian noise.p_θ(k | k) updates matrix for covariance:

p_θ(k | k)=(i-k_θ(k)a(k))p_θ(k|k-1) (16)

p_θ(k | k-1) be covariance prediction matrix:

p_θ(k | k-1)=p_θ(k-1|k-1)+d(k-1) (17)

k_θK () is covariance gain:

k_θ(k)=p_θ(k|k-1)a^t(ap_θ(k|k-1)a^t+l(k-1))^-1(18)

\begin{matrix} \min i m i z e & r^{t} (k) l^{- 1} r (k) + {(θ (k) - \hat{θ} (k | k - 1))}^{t} ψ^{- 1} (θ (k) - \hat{θ} (k | k - 1)) + λ | | w (k) | |_{1} \end{matrix}

Subjiect to y (k)=a θ (k)+r (k)+w (k) (19)

Wherein, w (k) is sparse noise, by above-mentioned optimization problem, obtaining the optimum of ar parameter is estimated Meter θ (k) (note:), the optimization problem that formula (17) represents is a convex optimization problem, it is possible to use in engineering relatively Solved for ripe interior point method.

5) estimated speech signal status switch.

During speech signal collection, nonstationary noise affects larger on voice quality.In order to improve voice quality, Voice enhancement algorithm allows for tackling the situation of Gaussian noise and nonstationary noise mixing simultaneously.Nonstationary noise is typically obeyed Laplacian distribution, has sparse characteristic, and the estimation of nonstationary noise mainly be make use of with the sparse characteristic of noise.For convenience In optimization problem introduce noise sparsity constraints, initially with convex optimisation technique by traditional Kalman filtering problem reformulation be one Individual convex optimization problem, then introduces the sparsity constraints to sparse noise in the new optimization building, is finally completed speech enhan-cement Task, this is another core point of the present invention.

In order to easily estimate to sparse noise, need to ask from the angle reconstruct Kalman filtering of convex optimization Topic.The state-space model of traditional Kalman filtering is as follows:

X (k)=fx (k-1)+p (k) (20)

Y (k)=cx (k)+n (k) (21)

According to Bayes principle, Kalman filtering problem can be expressed as, under the premise of metric data y (k) is known, estimating Optimum voice status sequence x of meter (k) it may be assumed that

p (x (k) | y (k)) = \frac{p (y (k) | x (k)) p (x (k)}{p (y (k))} - - - (22)

\begin{matrix} l_{1} (y (k), x (k)) = \frac{p (x (k)) p (n (k))}{p (x (k))} \\ = p (w (k)) = \frac{1}{{(\sqrt{2 π})}^{m} {| r |}^{1 / 2}} \exp (- \frac{1}{2} w^{t} (k) r^{- 1} w (k)) \end{matrix} - - - (23)

\begin{matrix} l_{2} (x (k)) = p (x (k)) \\ = \frac{1}{{(\sqrt{2 π})}^{n} {| σ |}^{1 / 2}} \exp (- \frac{1}{2} {(x (k) - \hat{x} (k | k - 1))}^{t} θ^{- 1} (x (k) - \hat{x} (k | k - 1))) \end{matrix} - - - (24)

Wherein, θ beThe covariance matrix of conditional probability p in the case of known (x (k) | y (k-1)) θ=fp (k-1 | k-1) f^t+ q (k-1) (wherein p (k-1 | k-1) be covariance updated value).When likelihood function condition l₁(y(k),x (k)) and l₂When (x (k)) obtains maximum, and conditional probability p (x (k) | y (k)) obtain optimal estimation value.Observation type (23) and formula (24) it can be found that maximizing likelihood function condition l₁(y (k), x (k)) and l₂(x (k)) is equivalent to power in minimum likelihood function The exponential part of indexWithCause This can be optimized form as follows:

\begin{matrix} \min i m i z e & w^{t} (k) r^{- 1} w (k) + {(x (k) - \hat{x} (k | k - 1))}^{t} θ^{- 1} (x (k) - \hat{x} (k | k - 1)) \end{matrix}

Subjiect to y (k)=cx (k)+n (k) (25)

Wherein, x (k) and n (k) is variable, and θ is the covariance matrix of Gaussian noise.The estimated value of x (k) isN (k) is exactly the estimation to Gaussian noise.

P (k | k) updates matrix for covariance:

P (k | k)=(i-k (k) c (k)) p (k | k-1) (26)

P (k | k-1) be covariance prediction matrix:

P (k | k-1)=f (k-1) p (k-1 | k-1) f (k-1)^t+q(k-1) (27)

k_θK () is covariance gain:

K (k)=p (k | k-1) c^t(cp(k|k-1)c^t+r(k-1))^-1(28)

\begin{matrix} \min i m i z e & w^{t} (k) r^{- 1} w (k) + {(x (k) - \hat{x} (k | k - 1))}^{t} θ^{- 1} (x (k) - \hat{x} (k | k - 1)) + λ | | v (k) | |_{1} \end{matrix}

Subjiect to y (k)=cx (k)+n (k)+v (k) (29)

Wherein, v (k) is sparse noise, by above-mentioned optimization problem, obtaining to molten bath centroid position (note: x (k) is the optimal estimation to state value in traditional Kalman filtering for optimal estimation x (k)), formula (29) represents Optimization problem be a convex optimization problem, it is possible to use in engineering, more ripe interior point method is solved.

As shown in Figs. 4a and 4b, can relatively accurately Gaussian noise and non-stationary be made an uproar through method proposed by the present invention Sound is filtered, and former voice signal is strengthened.

Using the present invention, can accurately estimate and filter white noise and nonstationary noise, realize white noise and non-stationary Speech enhan-cement under noise mixing, provides more pure estimated speech signal simultaneously, is that the raising of speech recognition accuracy carries Support for front end.

Because the present invention establishes two Robust Kalman Filter models, the generating process model of voice signal is carried out Mathematical modeling, has all done on the temporal characteristics and time-varying characteristics of voice and has targetedly considered, ar parameter estimation has taken dynamic reality Shi Gengxin iteration, meets the requirement of parameter time varying characteristic, often estimated speech signal can be gone to utilize by state estimation by frame again Voice short-term stationarity characteristic, so that filter effect is better than traditional Kalman filtering in result, is worthy to be popularized.

Embodiment described above is only the preferred embodiments of the invention, not limits the enforcement model of the present invention with this Enclose, therefore the change that all shapes according to the present invention, principle are made, all should cover within the scope of the present invention.

Claims

1. a kind of online sound enhancement method being applied under nonstationary noise environment is it is characterised in that comprise the following steps:

1) set up the system model under nonstationary noise environment

The generation process of voice signal be one by white-noise excitation, through the output of full limit linear system from recursive procedure, that is, Current output is equal to the pumping signal of present moment and the weighted sum of p moment output in the past, and this is an autoregression ar mould Type, is expressed as follows:

s (k) = σ_{i = 1}^{p} a_{i} s (k - i) + u (k) - - - (1)

Wherein, u (k) is the white Gaussian noise excitation value in k moment；S (k-i) is the voice signal in (k-i) moment；S (k) is the The voice signal in k moment；a_iFor i-th linear predictor coefficient, also referred to as ar model parameter；P is the exponent number of ar model parameter；

Y (k)=s (k)+n (k)+v (k) (2)

Wherein, y (k) is k moment voice signal measurement sequence；S (k) is the voice signal in k moment；N (k) is k moment white Gaussian Noise；V (k) is k moment nonstationary noise, obeys laplacian distribution, has openness；

1.2) set up voice signal state-space model

X (k)=fx (k-1)+p (k) (3)

Y (k)=cx (k)+n (k)+v (k) (4)

Wherein,

f = [\begin{matrix} 0 & 1 & 0 & ... & 0 \\ 0 & 0 & 1 & ... & 0 \\ ... & ... & ... & ... & ... \\ 0 & 0 & 0 & ... & 1 \\ a_{p} (k) & a_{p - 1} (k) & a_{p - 2} (k) & a_{1} (k) \end{matrix}] - - - (5)

C=[0 0 ... 0 1] (6)

X (k)=[s (k-p+1) ... s (k)]^t(7)

In voice signal state equation (3) and voice signal measurement equation (4), x (k) is k moment voice signal state estimation Sequence, i.e. the optimal State Estimation of voice signal；X (k-1) is (k-1) moment voice signal state estimation sequence；When y (k) is k Carve voice signal measurement sequence；The state-transition matrix that f is constituted for linear predictor coefficient, last column [a in f_p(k) … a₁ (k)] it is referred to as ar parameter；C=[0 0 ... 0 1] is to measure transfer matrix；P (k) is k moment state-noise, obeys Gauss and divides Cloth；N (k) is k moment measurement noise, Gaussian distributed；V (k) is the nonstationary noise in k moment, obeys laplacian distribution；

E (p (k))=q, e (n (k))=r

e(p(k)p(j)^t)=q δ_kj,e(n(k)n(j)^t)=r δ_kj(8)

Wherein, q and r is respectively the average of noise p (k) and n (k)；Q and r is respectively the covariance of noise p (k) and n (k)；δ_kjFor Kronecker function；Speech Enhancement problem is to go to estimate optimum voice signal x on the premise of known measurement voice signal y (k) (k)；

2) framing and adding window

Voice signal has short-term stationarity, thinks that voice signal is constant in 10--30ms, this makes it possible to voice signal to divide For some short sections come being processed, here it is framing, the framing of voice signal is the window using moveable finite length The method that is weighted is realizing；Frame number generally per second is 33～100 frames, and framing method is the method for overlapping segmentation, front The overlapping part of one frame and a later frame is referred to as frame and moves, and frame moves and the ratio of frame length is 0～0.5；

3) system initialization

3.1) improved Kalman filter device parameter initialization

Initialization voice signal state estimation sequence x (0/0), covariance matrix p (0/0) are it is ensured that covariance matrix is positive definite；

3.2) ar parameter initialization

Initialization ar parameter state estimated sequence θ (0/0)；

4) estimate ar parameter

Ar parameter refers to last column [a in state-transition matrix f in formula (3)_p(k) … a₁(k)], it is mainly used to describe Speech production process, its accuracy has direct impact to the result of speech enhan-cement；Propose comprehensive in the estimation of ar parameter Consider voice signal state estimation sequence x (k-1), state-noise q (k), measurement noise n (k), nonstationary noise v (k), set up New ar parameter estimation state-space model, realizes the online Robust Estimation of ar parameter, and to the real-time estimation process of ar parameter such as Under:

4.1) set up the parameter estimation model of ar parameter

θ (k)=θ (k-1)+q (k)

Y (k)=a θ (k)+r (k)+w (k) (9)

Wherein, θ (k)=[a_p(k) … a₁(k)]^tFor k moment ar parameter state sequence；Q (k) is k moment state-noise, obeys Gauss distribution, its covariance matrix is q (k)；R (k) k moment measurement noise, Gaussian distributed, its covariance matrix is r (k)；W (k) k moment measurement noise, Gaussian distributed, its covariance matrix is w (k)；A=x (k-1)^t=[s (k-p) ... s (k-1)] it is measurement matrix；Y (k) is k moment voice signal measurement sequence；The statistics of state and measurement noise q (k) and r (k) is special Property is:

E (q (k))=d, e (r (k))=l

e(q(k)q(j)^t)=d δ_kj,e(r(k)r(j)^t)=l δ_kj(10)

Wherein, d and l is respectively the average of noise q (k) and r (k)；D and l is respectively the covariance of noise q (k) and r (k)；δ_kjFor Kronecker function；

In order to easily estimate to sparse noise, need to reconstruct Kalman filtering problem from the angle of convex optimization, pass The state-space model of system Kalman filtering, without nonstationary noise w (k), as follows:

θ (k)=θ (k-1)+q (k)

Y (k)=a θ (k)+r (k) (11)

According to Bayes principle, ar Parameter Estimation Problem is expressed as, under the premise of metric data y (k) is known, estimating optimum ar Argument sequence θ (k) it may be assumed that

p (θ (k) | y (k)) = \frac{p (y (k) | θ (k)) p (θ (k))}{p (y (k))} - - - (12)

\begin{matrix} l_{1} (y (k), θ (k)) = \frac{p (θ (k)) p (r (k))}{p (θ (k))} \\ = p (r (k)) = \frac{1}{{(\sqrt{2 π})}^{m} | l |^{1 / 2}} \exp (- \frac{1}{2} r^{t} (k) l^{- 1} r (k)) \end{matrix} - - - (13)

\begin{matrix} l_{2} (θ (k)) = p (θ (k)) \\ = \frac{1}{{(\sqrt{2 π})}^{n} | σ |^{1 / 2}} \exp (- \frac{1}{2} {(θ (k) - \hat{θ} (k | k - 1))}^{t} ψ {(k)}^{- 1} (θ (k) - \hat{θ} (k | k - 1)) \end{matrix} - - - (14)

Wherein, ψ beCovariance matrix ψ (k) of conditional probability p in the case of known (θ (k) | y (k))= p_θ(k | k)+d (k), wherein p_θ(k | k) it is covariance updated value；When likelihood function condition l₁(y (k), θ (k)) and l₂(θ (k)) takes When obtaining maximum, and conditional probability p (y (k) | θ (k)) obtain optimal estimation value；Observation type (12) and formula (13) find to maximize seemingly So function condition l₁(z (k), x (k+1)) and l₂(x (k+1)) is equivalent to the exponential part minimizing power exponent in likelihood functionWithTherefore obtain excellent as follows Change form:

\begin{matrix} \min i m i z e & r^{t} (k) l^{- 1} r (k) + {(θ (k) - \hat{θ} (k | k - 1))}^{t} ψ {(k)}^{- 1} (θ (k) - \hat{θ} (k | k - 1)) \\ s u b j i e c t t o & y (k) = a θ (k) + r (k) \end{matrix} - - - (15)

Wherein, θ (k) and r (k) is variable, ψ (k)=p_θ(k | k)+d (k) is the covariance matrix of Gaussian noise；The estimation of θ (k) Value isR (k) is exactly the estimation to Gaussian noise；p_θ(k | k) updates matrix for covariance:

p_θ(k | k)=(i-k_θ(k)a(k))p_θ(k|k-1) (16)

p_θ(k | k-1) be covariance prediction matrix:

p_θ(k | k-1)=p_θ(k-1|k-1)+d(k-1) (17)

k_θK () is covariance gain:

k_θ(k)=p_θ(k|k-1)a^t(ap_θ(k|k-1)a^t+l(k-1))^-1(18)

Nonstationary noise obeys laplacian distribution, has a sparse characteristic, and the core concept that nonstationary noise is estimated is using making an uproar The sparse characteristic of sound, through step 4.2) traditional Kalman filtering problem is converted into after convex optimization problem, can be in optimization Completing the estimation to sparse noise, new optimization form is the sparsity constraints increasing nonstationary noise w (k):

\begin{matrix} \min i m i z e & r^{t} (k) l^{- 1} r (k) + {(θ (k) - \hat{θ} (k | k - 1))}^{t} ψ^{- 1} (θ (k) - \hat{θ} (k | k - 1)) + λ | | w (k) | |_{1} \\ s u b j i e c t t o & y (k) = a θ (k) + r (k) + w (k) \end{matrix} - - - (19)

Wherein, w (k) is sparse noise, by above-mentioned optimization problem, obtaining the optimal estimation θ to ar parameter (k),The optimization problem that formula (17) represents is a convex optimization problem, can be using the interior point method in engineering Solved；

5) estimated speech signal status switch

In order to easily estimate to sparse noise, need to reconstruct Kalman filtering problem from the angle of convex optimization, pass The state-space model of system Kalman filtering is as follows:

X (k)=fx (k-1)+p (k) (20)

Y (k)=cx (k)+n (k) (21)

According to Bayes principle, Kalman filtering problem is expressed as, under the premise of metric data y (k) is known, estimating optimum language Sound status switch x (k) it may be assumed that

p (x (k) | y (k)) = \frac{p (y (k) | x (k)) p (x (k))}{p (y (k))} - - - (22)

\begin{matrix} l_{1} (y (k), x (k)) = \frac{p (x (k)) p (n (k))}{p (x (k))} \\ = p (w (k)) = \frac{1}{{(\sqrt{2 π})}^{m} | r |^{1 / 2}} \exp (- \frac{1}{2} w^{t} (k) r^{- 1} w (k)) \end{matrix} - - - (23)

\begin{matrix} l_{2} (x (k)) = p (x (k)) \\ = \frac{1}{{(\sqrt{2 π})}^{n} | σ |^{1 / 2}} \exp (- \frac{1}{2} {(x (k) - \hat{x} (k | k - 1))}^{t} θ^{- 1} (x (k) - \hat{x} (k | k - 1))) \end{matrix} - - - (24)

Wherein, θ beThe covariance matrix θ of conditional probability p in the case of known (x (k) | y (k-1))= fp(k-1|k-1)f^t+ q (k-1), wherein p (k-1 | k-1) it is covariance updated value；When likelihood function condition l₁(y(k),x(k)) And l₂When (x (k)) obtains maximum, and conditional probability p (x (k) | y (k)) obtain optimal estimation value；Observation type (23) and formula (24) Find to maximize likelihood function condition l₁(y (k), x (k)) and l₂(x (k)) is equivalent to the finger minimizing power exponent in likelihood function Fractional partWithTherefore obtain as Lower optimization form:

\begin{matrix} \min i m i z e & w^{t} (k) r^{- 1} w (k) + {(x (k) - \hat{x} (k | k - 1))}^{t} θ^{- 1} (x (k) - \hat{x} (k | k - 1)) \\ s u b j i e c t t o & y (k) = c x (k) + n (k) \end{matrix} - - - (25)

Wherein, x (k) and n (k) is variable, and θ is the covariance matrix of Gaussian noise；The estimated value of x (k) isn K () is exactly the estimation to Gaussian noise；

P (k | k) updates matrix for covariance:

P (k | k)=(i-k (k) c (k)) p (k | k-1) (26)

P (k | k-1) be covariance prediction matrix:

P (k | k-1)=f (k-1) p (k-1 | k-1) f (k-1)^t+q(k-1) (27)

k_θK () is covariance gain:

K (k)=p (k | k-1) c^t(cp(k|k-1)c^t+r(k-1))^-1(28)

The core concept of the estimation of sparse noise is the sparse characteristic using noise, through step 5.1) by traditional Kalman filtering After problem is converted into convex optimization problem, sparse noise n can be increased in optimization_sK the sparsity constraints of () are completing to sparse The estimation of noise, new optimization form is:

\begin{matrix} \min i m i z e & w^{t} (k) r^{- 1} w (k) + {(x (k) - \hat{x} (k | k - 1))}^{t} θ^{- 1} (x (k) - \hat{x} (k | k - 1)) + λ | | v (k) | |_{1} \\ s u b j i e c t t o & y (k) = c x (k) + n (k) + v (k) \end{matrix} - - - (29)

Wherein, v (k) is sparse noise, by above-mentioned optimization problem, obtaining the optimal estimation to molten bath centroid position X (k), x (k) are the optimal estimation in traditional Kalman filtering to state valueThe optimization problem that formula (29) represents is one Individual convex optimization problem, can be solved using the interior point method in engineering；

5.3), after completing the enhancing to k moment voice signal, strengthen resultStep 4 will be returned to), for updating k The ar parameter θ (k+1) in+1 moment, is further continued for carrying out the speech enhan-cement in k+1 moment afterwards, estimates x (k+1), until by all languages Sound signal processing is complete.