CN106340304A

CN106340304A - Online speech enhancement method for non-stationary noise environment

Info

Publication number: CN106340304A
Application number: CN201610843483.0A
Authority: CN
Inventors: 冯宝; 张绍荣; 孙山林; 郑伟; 张国宁; 武博; 韦周耀
Original assignee: Guilin University of Aerospace Technology
Current assignee: Guilin University of Aerospace Technology
Priority date: 2016-09-23
Filing date: 2016-09-23
Publication date: 2017-01-18
Anticipated expiration: 2036-09-23
Also published as: CN106340304B

Abstract

The invention provides an online speech enhancement method for a non-stationary noise environment. The method comprises the steps of (1) establishing a system model in a non-stationary noise environment, (2) framing and windowing, (3) carrying out system initialization, (4) estimating an AR parameter, and (5) estimating a speech signal state sequence. For a problem that the AR parameter in a speech model can not be updated with noise change in real time, the invention put forward a dual Calman filtering frame, two Calman filters are in parallel computing, speech signal state estimation and AR parameter estimation are in mutual updating, a data estimation process and a parameter estimation process are carried out alternately, thus the parameter estimation process can be adapted to the noise change process so as to improve the accuracy of the system model, and thus the performance of speech enhancement is enhanced. For a problem that a traditional Calman filtering algorithm can not process non-stationary noise, combined with a convex optimization technique, an improved Calman filtering frame is put forward, Gauss noise and non-stationary noise can be accurately estimated, and the accuracy of speech enhancement is improved.

Description

Online voice enhancement method suitable for non-stationary noise environment

Technical Field

The invention relates to the field of voice enhancement, in particular to an online voice enhancement method suitable for a non-stationary noise environment.

Background

In the process of speech recognition front-end processing, speech signals are always interfered and submerged by various noises, and due to the randomness of the interference, only signal processing technology can enhance the speech quality as much as possible. The main purpose of speech enhancement is to extract clean original speech from noisy speech.

The following common speech enhancement algorithms are mainly used:

1. noise cancellation this is done according to a method in which the noise component is subtracted directly from the noisy speech in the time or frequency domain. The method is mainly characterized in that a background signal is required to be used as a reference signal, and whether the reference signal is accurate or not directly determines the performance of the method.

2. The harmonic enhancement method is characterized in that voiced sound in the voice has obvious periodicity, the periodicity is reflected in a frequency domain to be a series of peak components respectively corresponding to fundamental frequency (fundamental tone) and harmonic thereof, the frequency components occupy most energy of the voice, the periodicity can be used for voice enhancement, and a comb filter is adopted to extract the fundamental tone and the harmonic component thereof so as to inhibit other periodic noise and aperiodic broadband noise.

3. Based on the enhancement algorithm of the speech generation model, the sound production process of the speech can be modeled as a linear time-varying filter. Different excitation sources are used for different types of speech. Among the generative models of speech, the most widely used are all-pole models. Based on the speech generation model, a series of speech enhancement algorithms, such as time-varying parameter wiener filtering and kalman filtering methods, can be obtained.

4. The enhancement algorithm based on the short-time spectrum estimation has various types, such as a spectral subtraction method, a wiener filtering method, a minimum mean square error method and the like. The method has the advantages of large range of signal-to-noise ratio, simple method, easy real-time processing and the like.

5. Wavelet decomposition method is developed along with the development of a mathematical analysis tool of wavelet decomposition, and combines some basic principles of spectral subtraction.

6. Auditory masking method is an enhancement algorithm that utilizes the auditory properties of the human ear.

The speech enhancement algorithm based on kalman filtering belongs to the third category above, and there are two important assumptions in the conventional kalman filtering when performing speech enhancement: both process noise and metrology noise follow gaussian distributions. The traditional Kalman filtering has the following limitations in the actual speech enhancement: estimation of AR parameters must be accurate. However, in an actual speech acquisition environment, noise is constantly changing, which requires that the estimation of the AR parameters in the speech model should have real-time performance, and the influence of various noises should be considered in the estimation process of the AR parameters, otherwise, the speech enhancement performance is reduced. Secondly, the traditional Kalman filtering algorithm only considers the situation of Gaussian noise and is not suitable for practical application. The speech signal acquisition process is contaminated by a non-stationary noise (sparse, subject to laplacian distribution), which is not common, but does exist and has a large influence on the speech quality. If the non-stationary noise is treated as Gaussian noise in the speech enhancement, the speech enhancement quality is seriously reduced, and the subsequent speech semantic recognition is not facilitated.

Based on the above problems, it is necessary to provide an online speech enhancement technique capable of processing both gaussian noise and non-stationary noise in real time.

Disclosure of Invention

The invention aims to solve the technical problems that the existing Kalman filtering method cannot process the problems that AR parameters in a voice model cannot be updated in real time and non-stationary noise exists in the measurement process, and provides an online voice enhancement method suitable for the non-stationary noise environment by combining a convex optimization technology, so that the AR parameters and the non-stationary noise can be estimated online.

In order to achieve the purpose, the technical scheme provided by the invention is as follows: an online voice enhancement method suitable for a non-stationary noise environment comprises the following steps:

1) establishing a system model in a non-stationary noise environment

1.1) establishing an autoregressive AR model under the condition that Gaussian noise and sparse noise coexist

The generation process of the speech signal is an autoregressive process excited by white noise and output by an all-pole linear system, namely the current output is equal to the weighted sum of the excitation signal at the current moment and the outputs at p past moments, which is an autoregressive AR model and is expressed as follows:

s (k) = Σ_{i = 1}^{p} a_{i} s (k - i) + u (k) - - - (1)

wherein u (k) is a Gaussian white noise excitation value at the moment k; s (k-i) is a voice signal at the (k-i) th moment; s (k) is the speech signal at the k-th moment; a is_iIs the ith linear prediction coefficient, also called AR model parameter; p is the order of the AR model parameter;

establishing a voice signal model conforming to an actual measurement process, wherein the voice signal measurement process is described as follows:

Y(k)＝s(k)+n(k)+v(k) (2)

wherein Y (k) is a measurement sequence of the voice signal at time k; s (k) is a speech signal at time k; n (k) is white Gaussian noise at time k; v (k) is non-stationary noise at the moment k, obeys Laplace distribution and has sparsity;

1.2) establishing a speech signal state space model

Converting equations (1) and (2) into a state space model, described as follows:

X(k)＝FX(k-1)+p(k) (3)

Y(k)＝CX(k)+n(k)+v(k) (4)

wherein,

F = [\begin{matrix} 0 & 1 & 0 & ... & 0 \\ 0 & 0 & 1 & ... & 0 \\ ... & ... & ... & ... & ... \\ 0 & 0 & 0 & ... & 1 \\ a_{p} (k) & a_{p - 1} (k) & a_{p - 2} (k) & a_{1} (k) \end{matrix}] - - - (5)

C＝[0 0 ... 0 1](6)

X(k)＝[S(k-p+1) … S(k)]^T(7)

in the speech signal state equation (3) and the speech signal measurement equation (4), x (k) is a speech signal state estimation sequence at the time k, that is, an optimal state estimation of a speech signal; x (k-1) is a speech signal state estimation sequence at the (k-1) moment; y (k) is a measurement sequence of the speech signal at time k; f is a state transition matrix formed by linear prediction coefficients, and the last row [ a ] in F_p(k)… a₁(k)]Referred to as AR parameters; c ═ 00 … 01]Is a measurement transfer matrix; p (k) is state noise at time k, obeying Gaussian distribution; n (k) is the measurement noise at time k, and follows Gaussian distribution; v (k) is the non-stationary noise at time k, subject to a laplacian distribution;

the state of the speech signal and the statistical properties of the measured noise p (k) and n (k) are:

E(p(k))＝q,E(n(k))＝r

E(p(k)p(j)^T)＝Q_kj,E(n(k)n(j)^T)＝R_kj(8)

wherein q and r are mean values of noise p (k) and n (k), respectively; q and R are the covariance of noise p (k) and n (k), respectively;_kjis a function of Kronecker; the speech enhancement problem is to estimate the optimal speech signal x (k) on the premise that the measured speech signal y (k) is known;

2) framing and windowing

The voice signal has short-time stationarity, and the voice signal is considered to be unchanged within 10-30 ms, so that the voice signal can be divided into a plurality of short sections for processing, namely framing, and the framing of the voice signal is realized by adopting a movable window with limited length for weighting; the number of frames per second is usually 33-100 frames, the framing method is an overlapped segmentation method, the overlapped part of a previous frame and a next frame is called frame shift, and the ratio of the frame shift to the frame length is 0-0.5;

3) system initialization

3.1) improved Kalman Filter parameter initialization

Initializing a speech signal state estimation sequence X (0/0) and a covariance matrix P (0/0), and ensuring that the covariance matrix is positive definite;

3.2) AR parameter initialization

Initializing an AR parameter state estimation sequence θ (0/0);

4) estimating AR parameters

The AR parameter refers to the last row [ a ] in the state transition matrix F in equation (3)_p(k) … a₁(k)]Mainly used for describing the voice generating process, and the accuracy of the voice generating process has direct influence on the voice enhancement result; the method proposes that a speech signal state estimation sequence X (k-1), state noise q (k), measurement noise n (k) and non-stationary noise v (k) are comprehensively considered in the estimation of the AR parameters, a new AR parameter estimation state space model is established, the on-line robust estimation of the AR parameters is realized, and the real-time estimation process of the AR parameters is as follows:

4.1) establishing a parameter estimation model of the AR parameters

The AR parameter model under the environment mixed by Gaussian noise and non-stationary noise is described as follows:

θ(k)＝θ(k-1)+q(k)

Y(k)＝Aθ(k)+r(k)+w(k) (9)

wherein θ (k) ═ a_p(k) … a₁(k)]^TIs an AR parameter state sequence at the k moment; q (k) is state noise at the moment k, and follows Gaussian distribution, and the covariance matrix is Q (k); r (k) measuring noise at the time k, wherein the noise follows Gaussian distribution, and the covariance matrix is R (k); w (k) measuring noise at the time k, wherein the noise follows Gaussian distribution, and the covariance matrix is W (k); a ═ X (k-1)^T＝[S(k-p) …S(k-1)]Is a measurement matrix; y (k) is a measurement sequence of the speech signal at time k; the statistical properties of the state and measurement noise q (k) and r (k) are:

E(q(k))＝d,E(r(k))＝l

E(q(k)q(j)^T)＝D_kj,E(r(k)r(j)^T)＝L_kj(10)

wherein d and l are the mean values of the noise q (k) and r (k), respectively; d and L are the covariance of the noise q (k) and r (k), respectively;_kjis a function of Kronecker;

4.2) reconstructing the conventional Kalman filtering problem from a convex optimization perspective

In order to conveniently estimate sparse noise, the kalman filtering problem needs to be reconstructed from the perspective of convex optimization, and a state space model of the conventional kalman filtering does not contain non-stationary noise w (k), as follows:

θ(k)＝θ(k-1)+q(k)

Y(k)＝Aθ(k)+r(k)(11)

according to the bayesian principle, the AR parameter estimation problem is expressed as estimating an optimal AR parameter sequence θ (k) on the premise that the measured data y (k) is known, that is:

p (θ (k) | Y (k)) = \frac{p (Y (k) | θ (k)) p (θ (k))}{p (Y (k))} - - - (12)

establishing a likelihood function of p (Y (k) | theta (k)) and p (theta (k)) according to the maximum likelihood estimation theory:

\begin{matrix} L_{1} (Y (k), θ (k)) = \frac{p (θ (k)) p (r (k))}{p (θ (k))} \\ = p (r (k)) = \frac{1}{{(\sqrt{2 π})}^{m} {| L |}^{1 / 2}} \exp (- \frac{1}{2} r^{T} (k) L^{- 1} r (k)) \end{matrix} - - - (13)

\begin{matrix} L_{2} (θ (k)) = p (θ (k)) \\ = \frac{1}{{(\sqrt{2 π})}^{n} {| Σ |}^{1 / 2}} \exp (- \frac{1}{2} {(θ (k) - \hat{θ} (k | k - 1))}^{T} Ψ {(k)}^{- 1} (θ (k) - \hat{θ} (k | k - 1)) \end{matrix} - - - (14)

wherein Ψ isThe covariance matrix Ψ (k) ═ P of the conditional probability P (θ (k) | y (k)) is known_θ(k | k) + D (k), where P_θ(k | k) is a covariance update value; when likelihood function condition L₁(Y (k), θ (k)) and L₂(θ (k)) obtaining an optimal estimation value for the conditional probability p (y (k) | θ (k)) when the maximum value is obtained; observing the conditions L of the maximum likelihood function found by the equations (12) and (13)₁(Z (k), X (k +1)) and L₂(X (k +1)) corresponds to minimizing the exponential part of the power exponent in the likelihood functionAndthe following optimized form is thus obtained:

\begin{matrix} \min i m i z e & r^{T} (k) L^{- 1} r (k) + {(θ (k) - \hat{θ} (k | k - 1))}^{T} Ψ {(k)}^{- 1} (θ (k) - \hat{θ} (k | k - 1)) \end{matrix}

subjiect to Y(k)＝Aθ(k)+r(k) (15)

where θ (k) and r (k) are variables, Ψ (k) ═ P_θ(k | k) + D (k) is the covariance matrix of Gaussian noise; the estimated value of theta (k) isr (k) isIs an estimate of gaussian noise; p_θ(k | k) is the covariance update matrix:

P_θ(k|k)＝(I-K_θ(k)A(k))P_θ(k|k-1) (16)

P_θ(k | k-1) is the covariance prediction matrix:

P_θ(k|k-1)＝P_θ(k-1|k-1)+D(k-1) (17)

K_θ(k) to covariance gain:

K_θ(k)＝P_θ(k|k-1)A^T(AP_θ(k|k-1)A^T+L(k-1))^-1(18)

4.3) constructing an optimization problem for non-stationary noise estimation from a convex optimization perspective

The non-stationary noise obeys Laplace distribution and has a sparse characteristic, the core idea of non-stationary noise estimation is to utilize the sparse characteristic of noise, and after the traditional Kalman filtering problem is converted into a convex optimization problem through step 4.2), the estimation of the sparse noise can be completed by adding the sparsity constraint of the non-stationary noise w (k) in the optimization, and the new optimization form is as follows:

\begin{matrix} \min i m i z e & r^{T} (k) L^{- 1} r (k) + {(θ (k) - \hat{θ} (k | k - 1))}^{T} Ψ^{- 1} (θ (k) - \hat{θ} (k | k - 1)) + λ | | w (k) | |_{1} \end{matrix}

subjiect to Y(k)＝Aθ(k)+r(k)+w(k) (19)

wherein w (k) is sparse noise, and the optimal estimation theta (k) of the AR parameters can be obtained by solving the optimization problem,the optimization problem represented by the formula (17) is a convex optimization problem and can be solved by using an interior point method in engineering;

5) estimating a speech signal state sequence

5.1) reconstructing the conventional Kalman filtering problem from a convex optimization perspective

In order to conveniently estimate sparse noise, the kalman filtering problem needs to be reconstructed from the perspective of convex optimization, and a state space model of the conventional kalman filtering is as follows:

X(k)＝FX(k-1)+p(k) (20)

Y(k)＝CX(k)+n(k) (21)

according to the bayesian principle, the kalman filtering problem is expressed as estimating an optimal speech state sequence x (k) on the premise that the measured data y (k) is known, that is:

p (X (k) | Y (k)) = \frac{p (Y (k) | X (k)) p (X (k)}{p (Y (k))} - - - (22)

establishing a likelihood function of p (Y (k) | X (k)) and p (X (k)) according to the maximum likelihood estimation theory:

\begin{matrix} L_{1} (Y (k), X (k)) = \frac{p (X (k)) p (n (k))}{p (X (k))} \\ = p (W (k)) = \frac{1}{{(\sqrt{2 π})}^{m} {| R |}^{1 / 2}} \exp (- \frac{1}{2} W^{T} (k) R^{- 1} W (k)) \end{matrix} - - - (23)

\begin{matrix} L_{2} (X (k)) = p (X (k)) \\ = \frac{1}{{(\sqrt{2 π})}^{n} {| Σ |}^{1 / 2}} \exp (- \frac{1}{2} {(X (k) - \hat{X} (k | k - 1))}^{T} Θ^{- 1} (X (k) - \hat{X} (k | k - 1))) \end{matrix} - - - (24)

wherein, theta isThe covariance matrix Θ of the conditional probability p (x (k) Y (k-1)) in the known case is FP (k-1| k-1) F^T+ Q (k-1), where P (k-1| k-1) is a covariance update value; when likelihood function condition L₁(Y (k), X (k) and L₂(x (k)) obtaining an optimal estimation value for the conditional probability p (x (k) y (k)) when the maximum value is obtained; the observation expressions (23) and (24) find the maximum likelihood function condition L₁(Y (k), X (k) and L₂(X (k)) corresponds to minimizing the exponential part of the power exponent in the likelihood functionAndthe following optimized form is thus obtained:

\begin{matrix} \min i m i z e & W^{T} (k) R^{- 1} W (k) + {(X (k) - \hat{X} (k | k - 1))}^{T} Θ^{- 1} (X (k) - \hat{X} (k | k - 1)) \end{matrix}

subjiect to Y(k)＝CX(k)+n(k) (25)

where X (k) and n (k) are variables, and Θ is a covariance matrix of Gaussian noise; the estimated value of X (k) isn (k) is an estimate of gaussian noise;

p (k | k) is the covariance update matrix:

P(k|k)＝(I-K(k)C(k))P(k|k-1) (26)

p (k | k-1) is the covariance prediction matrix:

P(k|k-1)＝F(k-1)P(k-1|k-1)F(k-1)^T+Q(k-1) (27)

K_θ(k) to covariance gain:

K(k)＝P(k|k-1)C^T(CP(k|k-1)C^T+R(k-1))^-1(28)

5.2) constructing the estimation problem of sparse noise from the convex optimization angle

The core idea of sparse noise estimation is that sparse characteristics of noise are utilized, and after the traditional Kalman filtering problem is converted into a convex optimization problem through the step 5.1), sparse noise n can be added in optimization_s(k) The estimation of sparse noise is completed by the sparsity constraint of the following steps:

\begin{matrix} \min i m i z e & W^{T} (k) R^{- 1} W (k) + {(X (k) - \hat{X} (k | k - 1))}^{T} Θ^{- 1} (X (k) - \hat{X} (k | k - 1)) + λ | | v (k) | |_{1} \end{matrix}

subjiect to Y(k)＝CX(k)+n(k)+v(k) (29)

wherein v (k) is sparse noise, and the optimal estimation X (k) of the centroid position of the molten pool is obtained by solving the optimization problem, wherein X (k) is the optimal estimation of the state value in the traditional Kalman filteringThe optimization problem represented by the formula (29) is a convex optimization problem, and can be solved by using an interior point method in engineering;

5.3) after finishing the enhancement of the voice signal at the k moment, enhancing the resultAnd returning to the step 4) for updating the AR parameter theta (k +1) at the moment k +1, and then continuing to perform the speech enhancement at the moment k +1 to estimate X (k +1) until all speech signals are processed.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the invention provides a double-Kalman filtering frame aiming at the problem that AR parameters in a voice model (particularly an autoregressive AR model) can not be updated in real time along with noise change, two Kalman filters perform parallel operation, voice signal state estimation and AR parameter estimation are updated mutually, and a state estimation process and a parameter estimation process are performed alternately, so that the parameter estimation process can adapt to the noise change process, the accuracy of the system model is improved, and the voice enhancement performance is improved.

2. The invention provides an improved Kalman filtering framework by combining a convex optimization technology aiming at the problem that the traditional Kalman filtering algorithm cannot process non-stationary noise. The new algorithm adds Gaussian noise and non-stationary noise items in the voice enhancement model in the measurement process, and can accurately estimate the Gaussian noise and the non-stationary noise by establishing a reasonable optimization model by using a convex optimization technology, so that the accuracy of voice enhancement is improved.

Drawings

FIG. 1 is a flow chart of a method of speech enhancement under non-stationary noise.

FIG. 2a is a diagram of an original speech signal.

FIG. 2b is a diagram of a speech signal with white Gaussian noise.

FIG. 2c is a diagram of a speech signal with white Gaussian noise and non-stationary noise.

FIG. 3 is a flow chart of a speech enhancement algorithm based on dual modified Kalman filtering.

Fig. 4a is an original speech signal.

FIG. 4b is a diagram illustrating the speech enhancement result.

Detailed Description

The present invention will be further described with reference to the following specific examples.

As shown in fig. 1, the online speech enhancement method applicable to a non-stationary noise environment according to this embodiment includes the following steps:

1) establishing a system model in a non-stationary noise environment

The process of generating a speech signal can be described as a self-recursive process excited by white noise and output via an all-pole linear system, i.e. the current output is equal to the weighted sum of the excitation signal at the present moment and the outputs at the past p moments, which is an autoregressive AR model represented as follows

s (k) = Σ_{i = 1}^{p} a_{i} s (k - i) + u (k) - - - (1)

Wherein u (k) is a Gaussian white noise excitation value at the moment k; s (k-i) is a voice signal at the (k-i) th moment; s (k) is the speech signal at the k-th moment; a is_iIs the ith linear prediction coefficient, also called AR model parameter; p is the order of the AR model parameters.

As shown in fig. 2a, 2b, and 2c, a speech signal observed in a real environment is polluted by various noises, especially non-stationary noises. The speech signal measurement process of the present invention can be described as follows:

Y(k)＝s(k)+n(k)+v(k) (2)

wherein Y (k) is a measurement sequence of the voice signal at time k; s (k) is a speech signal at time k; n (k) is white Gaussian noise at time k; v (k) is non-stationary noise at the time k, follows Laplace distribution, and has sparsity.

1.2) establishing a speech signal state space model

Converting equations (1) and (2) into a state space model, the following can be described:

X(k)＝FX(k-1)+p(k) (3)

Y(k)＝CX(k)+n(k)+v(k) (4)

wherein

F = [\begin{matrix} 0 & 1 & 0 & ... & 0 \\ 0 & 0 & 1 & ... & 0 \\ ... & ... & ... & ... & ... \\ 0 & 0 & 0 & ... & 1 \\ a_{p} (k) & a_{p - 1} (k) & a_{p - 2} (k) & a_{1} (k) \end{matrix}] - - - (5)

C＝[0 0 … 0 1](6)

X(k)＝[S(k-p+1) … S(k)]^T(7)

In the speech signal state equation (3) and the speech signal measurement equation (4), x (k) is a speech signal state estimation sequence at the time k, that is, an optimal state estimation of a speech signal; x (k-1) is a speech signal state estimation sequence at the (k-1) moment; y (k) is a measurement sequence of the speech signal at time k; f is a state transition matrix formed by linear prediction coefficients, and the last row [ a ] in F_p(k)… a₁(k)]Referred to as AR parameters. (ii) a C ═ 00 … 01]Is a measurement transfer matrix; p (k) is state noise at time k, obeying Gaussian distribution; n (k) is the measurement noise at time k, and follows Gaussian distribution; v (k) is the non-stationary noise at time k, obeying the laplacian distribution.

E(p(k))＝q,E(n(k))＝r

E(p(k)p(j)^T)＝Q_kj,E(n(k)n(j)^T)＝R_kj(8)

wherein q and r are mean values of noise p (k) and n (k), respectively; q and R are the covariance of the noise p (k) and n (k), respectively._kjAs a function of Kronecker. The speech enhancement problem is to estimate the optimal speech signal x (k) given the measured speech signal y (k).

2) Framing and windowing

The voice signal has short-time stationarity (the voice signal can be considered to be approximately unchanged within 10-30 ms), so that the voice signal can be divided into a plurality of short sections for processing, namely framing, and framing of the voice signal is realized by adopting a movable window with limited length for weighting. The number of frames per second is generally about 33 to 100 frames. A common framing method is an overlapping segmentation method, the overlapping part of a previous frame and a next frame is called frame shift, and the ratio of the frame shift to the frame length is generally 0-0.5. In the invention, the frame length is 25ms, and the frame shift is 10 ms.

3) System initialization

3.1) improved Kalman Filter parameter initialization

The speech signal state estimation sequence X (0/0), covariance matrix P (0/0) are initialized, ensuring that the covariance matrix is positive.

3.2) AR parameter initialization

The state estimation sequence θ of the AR parameter is initialized (0/0), and the order of the AR parameter is 13 (empirically set) in the present invention.

4) Estimating AR parameters

The AR parameter refers to the last row [ a ] in the state transition matrix F in equation (3)_p(k) … a₁(k)]Mainly used for describing the voice generating process, and the accuracy of the voice generating process has direct influence on the voice enhancement result. In practical application, AR parameter estimation is greatly influenced by a voice signal and various noises, so that the invention provides a new AR parameter estimation state space model established by comprehensively considering a voice signal state estimation sequence X (k-1), state noise q (k), measurement noise n (k), non-stationary noise v (k) and the like in the estimation of the AR parameter, and the invention is a core point of the invention. As shown in fig. 3, the real-time estimation process for AR parameters is as follows:

4.1) establishing a parameter estimation model of the AR parameters

θ(k)＝θ(k-1)+q(k)

Y(k)＝Aθ(k)+r(k)+w(k) (9)

wherein θ (k) ═ a_p(k) … a₁(k)]^TIs an AR parameter state sequence at the k moment; q (k) is state noise at the moment k, and follows Gaussian distribution, and the covariance matrix is Q (k); r (k) k time measurement noiseSound, obeying a gaussian distribution with a covariance matrix r (k); w (k) measuring noise at the time k, wherein the noise follows Gaussian distribution, and the covariance matrix is W (k); a ═ X (k-1)^T＝[S(k-p) …S(k-1)]Is a measurement matrix; y (k) is a measurement sequence of the speech signal at time k. The statistical properties of the state and measurement noise q (k) and r (k) are:

E(q(k))＝d,E(r(k))＝l

E(q(k)q(j)^T)＝D_kj,E(r(k)r(j)^T)＝L_kj(10)

wherein d and l are the mean values of the noise q (k) and r (k), respectively; d and L are the covariance of the noise q (k) and r (k), respectively._kjAs a function of Kronecker.

In order to be able to estimate the sparse noise conveniently, the kalman filtering problem needs to be reconstructed from the perspective of convex optimization. The state space model of conventional kalman filtering (without non-stationary noise w (k)) is as follows:

θ(k)＝θ(k-1)+q(k)

Y(k)＝Aθ(k)+r(k) (11)

according to the bayesian principle, the AR parameter estimation problem can be expressed as estimating an optimal AR parameter sequence θ (k) on the premise that the measured data y (k) is known, that is:

p (θ (k) | Y (k)) = \frac{p (Y (k) | θ (k)) p (θ (k))}{p (Y (k))} - - - (12)

\begin{matrix} L_{1} (Y (k), θ (k)) = \frac{p (θ (k)) p (r (k))}{p (θ (k))} \\ = p (r (k)) = \frac{1}{{(\sqrt{2 π})}^{m} {| L |}^{1 / 2}} \exp (- \frac{1}{2} r^{T} (k) L^{- 1} r (k)) \end{matrix} - - - (13)

\begin{matrix} L_{2} (θ (k)) = p (θ (k)) \\ = \frac{1}{{(\sqrt{2 π})}^{n} {| Σ |}^{1 / 2}} \exp (- \frac{1}{2} {(θ (k) - \hat{θ} (k | k - 1))}^{T} Ψ {(k)}^{- 1} (θ (k) - \hat{θ} (k | k - 1)) \end{matrix} - - - (14)

wherein Ψ isThe covariance matrix Ψ (k) ═ P of the conditional probability P (θ (k) | y (k)) is known_θ(k | k) + D (k) (wherein P_θ(k | k) is a covariance update value). When likelihood function condition L₁(Y (k), θ (k)) and L₂When the conditional probability p (y (k) is equal to or greater than (θ (k)), the conditional probability p (y (k)) is equal to or greater than (θ (k)). The conditions L for the maximum likelihood function can be found by observing the equations (12) and (13)₁(Z (k), X (k +1)) and L₂(X (k +1)) corresponds to minimizing the exponential part of the power exponent in the likelihood functionAndthe following optimized form can thus be obtained:

\begin{matrix} \min i m i z e & r^{T} (k) L^{- 1} r (k) + {(θ (k) - \hat{θ} (k | k - 1))}^{T} Ψ {(k)}^{- 1} (θ (k) - \hat{θ} (k | k - 1)) \end{matrix}

subjiect to Y(k)＝Aθ(k)+r(k) (15)

where θ (k) and r (k) are variables, Ψ (k) ═ P_θ(k | k) + D (k) is the covariance matrix of Gaussian noise. The estimated value of theta (k) isr (k) is an estimate of gaussian noise. P_θ(k | k) is the covariance update matrix:

P_θ(k|k)＝(I-K_θ(k)A(k))P_θ(k|k-1) (16)

P_θ(k | k-1) is the covariance prediction matrix:

P_θ(k|k-1)＝P_θ(k-1|k-1)+D(k-1) (17)

K_θ(k) to covariance gain:

K_θ(k)＝P_θ(k|k-1)A^T(AP_θ(k|k-1)A^T+L(k-1))^-1(18)

\begin{matrix} \min i m i z e & r^{T} (k) L^{- 1} r (k) + {(θ (k) - \hat{θ} (k | k - 1))}^{T} Ψ^{- 1} (θ (k) - \hat{θ} (k | k - 1)) + λ | | w (k) | |_{1} \end{matrix}

subjiect to Y(k)＝Aθ(k)+r(k)+w(k) (19)

where w (k) is sparse noise, the optimal estimate θ (k) of the AR parameters can be obtained by solving the above optimization problem (note:) The optimization problem represented by the formula (17) is a convex optimization problem, and can be solved by using a relatively mature interior point method in engineering.

5) A sequence of speech signal states is estimated.

In the voice signal acquisition process, non-stationary noise has a large influence on the voice quality. In order to be able to improve speech quality, the speech enhancement algorithm must be able to cope with both gaussian and non-stationary noise mixing. The non-stationary noise generally obeys Laplace distribution and has a sparse characteristic, and the estimation of the non-stationary noise mainly utilizes the sparse characteristic of the noise. In order to introduce noise sparsity constraint in the optimization problem, firstly, the traditional Kalman filtering problem is reconstructed into a convex optimization problem by adopting a convex optimization technology, then sparsity constraint on sparse noise is introduced in newly constructed optimization, and finally, a voice enhancement task is completed, which is another core point of the invention.

In order to be able to estimate the sparse noise conveniently, the kalman filtering problem needs to be reconstructed from the perspective of convex optimization. The state space model of the conventional kalman filter is as follows:

X(k)＝FX(k-1)+p(k) (20)

Y(k)＝CX(k)+n(k) (21)

according to the bayesian principle, the kalman filtering problem can be expressed as estimating an optimal speech state sequence x (k) on the premise that the measured data y (k) is known, that is:

p (X (k) | Y (k)) = \frac{p (Y (k) | X (k)) p (X (k)}{p (Y (k))} - - - (22)

\begin{matrix} L_{1} (Y (k), X (k)) = \frac{p (X (k)) p (n (k))}{p (X (k))} \\ = p (W (k)) = \frac{1}{{(\sqrt{2 π})}^{m} {| R |}^{1 / 2}} \exp (- \frac{1}{2} W^{T} (k) R^{- 1} W (k)) \end{matrix} - - - (23)

\begin{matrix} L_{2} (X (k)) = p (X (k)) \\ = \frac{1}{{(\sqrt{2 π})}^{n} {| Σ |}^{1 / 2}} \exp (- \frac{1}{2} {(X (k) - \hat{X} (k | k - 1))}^{T} Θ^{- 1} (X (k) - \hat{X} (k | k - 1))) \end{matrix} - - - (24)

wherein, theta isThe covariance matrix Θ of the conditional probability p (x (k) Y (k-1)) in the known case is FP (k-1| k-1) F^T+ Q (k-1) (where P (k-1| k-1) is the covariance update value). When likelihood function condition L₁(Y (k), X (k) and L₂When (x (k)) has a maximum value, the conditional probability p (x (k)) y (k)) has an optimum estimated value. The conditions L for the maximum likelihood function can be found by observing the equations (23) and (24)₁(Y (k), X (k) and L₂(X (k)) corresponds to minimizing the exponential part of the power exponent in the likelihood functionAndthe following optimized form can thus be obtained:

\begin{matrix} \min i m i z e & W^{T} (k) R^{- 1} W (k) + {(X (k) - \hat{X} (k | k - 1))}^{T} Θ^{- 1} (X (k) - \hat{X} (k | k - 1)) \end{matrix}

subjiect to Y(k)＝CX(k)+n(k) (25)

where X (k) and n (k) are variables and Θ is the covariance matrix of Gaussian noise. The estimated value of X (k) isn (k) is an estimate of gaussian noise.

P (k | k) is the covariance update matrix:

P(k|k)＝(I-K(k)C(k))P(k|k-1) (26)

p (k | k-1) is the covariance prediction matrix:

P(k|k-1)＝F(k-1)P(k-1|k-1)F(k-1)^T+Q(k-1) (27)

K_θ(k) to covariance gain:

K(k)＝P(k|k-1)C^T(CP(k|k-1)C^T+R(k-1))^-1(28)

The core idea of sparse noise estimation is to utilize the sparse characteristic of noise, and after the traditional Kalman filtering problem is converted into the convex optimization problem through the step 5.1), the sparse noise n can be increased in optimization_s(k) The estimation of sparse noise is completed by the sparsity constraint of the following steps:

\begin{matrix} \min i m i z e & W^{T} (k) R^{- 1} W (k) + {(X (k) - \hat{X} (k | k - 1))}^{T} Θ^{- 1} (X (k) - \hat{X} (k | k - 1)) + λ | | v (k) | |_{1} \end{matrix}

subjiect to Y(k)＝CX(k)+n(k)+v(k) (29)

wherein v (k) is sparse noise, and the optimal estimation X (k) of the centroid position of the molten pool can be obtained by solving the optimization problem (note: X (k)) is the optimal estimation of the state value in the traditional Kalman filtering) The optimization problem represented by the formula (29) is oneThe convex optimization problem can be solved by using a mature interior point method in engineering.

As shown in fig. 4a and 4b, the method provided by the present invention can accurately filter gaussian noise and non-stationary noise, and enhance the original speech signal.

The invention can accurately estimate and filter white noise and non-stationary noise, realize voice enhancement under the mixing of the white noise and the non-stationary noise, and simultaneously provide a purer estimated voice signal and provide front-end support for improving the accuracy of voice recognition.

Because the two robust Kalman filtering models are established, the generation process model of the voice signal is subjected to mathematical modeling, the short-time characteristic and the time-varying characteristic of the voice are considered in a targeted manner, the AR parameter estimation adopts dynamic real-time updating iteration, the requirement of the time-varying characteristic of the parameter is met, the voice signal can be estimated by each frame through state estimation, the short-time stability characteristic of the voice is utilized, the filtering effect is superior to that of the traditional Kalman filtering, and the method is worthy of popularization.

The above-mentioned embodiments are merely preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, so that any changes made in the shape and principle of the present invention should be covered within the protection scope of the present invention.

Claims

1. An online voice enhancement method suitable for a non-stationary noise environment, comprising the steps of:

1) establishing a system model in a non-stationary noise environment

s (k) = Σ_{i = 1}^{p} a_{i} s (k - i) + u (k) - - - (1)

Y(k)＝s(k)+n(k)+v(k) (2)

1.2) establishing a speech signal state space model

X(k)＝FX(k-1)+p(k) (3)

Y(k)＝CX(k)+n(k)+v(k) (4)

wherein,

F = [\begin{matrix} 0 & 1 & 0 & ... & 0 \\ 0 & 0 & 1 & ... & 0 \\ ... & ... & ... & ... & ... \\ 0 & 0 & 0 & ... & 1 \\ a_{p} (k) & a_{p - 1} (k) & a_{p - 2} (k) & a_{1} (k) \end{matrix}] - - - (5)

C＝[0 0 ... 0 1](6)

X(k)＝[S(k-p+1) ... S(k)]^T(7)

in the speech signal state equation (3) and the speech signal measurement equation (4), x (k) is a speech signal state estimation sequence at the time k, that is, an optimal state estimation of a speech signal; x (k-1) is a speech signal state estimation sequence at the (k-1) moment; y (k) is a measurement sequence of the speech signal at time k; f is a state transition matrix formed by linear prediction coefficients, and the last row [ a ] in F_p(k) … a₁(k)]Referred to as AR parameters; c ═ 00.. 01]Is a measurement transfer matrix; p (k) is state noise at time k, obeying Gaussian distribution; n (k) is the measurement noise at time k, and follows Gaussian distribution; v (k) is the non-stationary noise at time k, subject to a laplacian distribution;

E(p(k))＝q,E(n(k))＝r

E(p(k)p(j)^T)＝Q_kj,E(n(k)n(j)^T)＝R_kj(8)

wherein q and r are mean values of noise p (k) and n (k), respectively; q and R are the covariance of noise p (k) and n (k), respectively;_kjis a function of Kronecker; the speech enhancement problem is to estimate the optimal speech signal on the premise that the measured speech signal Y (k) is knownSignal x (k);

2) framing and windowing

3) system initialization

3.1) improved Kalman Filter parameter initialization

3.2) AR parameter initialization

Initializing an AR parameter state estimation sequence θ (0/0);

4) estimating AR parameters

4.1) establishing a parameter estimation model of the AR parameters

θ(k)＝θ(k-1)+q(k)

Y(k)＝Aθ(k)+r(k)+w(k) (9)

wherein θ (k) ═ a_p(k) … a₁(k)]^TIs an AR parameter state sequence at the k moment; q (k) is state noise at the moment k, and follows Gaussian distribution, and the covariance matrix is Q (k); r (k) measuring noise at the time k, wherein the noise follows Gaussian distribution, and the covariance matrix is R (k); w (k)) Measuring noise at the time k, and obeying Gaussian distribution, wherein a covariance matrix is W (k); a ═ X (k-1)^T＝[S(k-p)...S(k-1)]Is a measurement matrix; y (k) is a measurement sequence of the speech signal at time k; the statistical properties of the state and measurement noise q (k) and r (k) are:

E(q(k))＝d,E(r(k))＝l

E(q(k)q(j)^T)＝D_kj,E(r(k)r(j)^T)＝L_kj(10)

θ(k)＝θ(k-1)+q(k)

Y(k)＝Aθ(k)+r(k) (11)

p (θ (k) | Y (k)) = \frac{p (Y (k) | θ (k)) p (θ (k))}{p (Y (k))} - - - (12)

\begin{matrix} L_{1} (Y (k), θ (k)) = \frac{p (θ (k)) p (r (k))}{p (θ (k))} \\ = p (r (k)) = \frac{1}{{(\sqrt{2 π})}^{m} | L |^{1 / 2}} \exp (- \frac{1}{2} r^{T} (k) L^{- 1} r (k)) \end{matrix} - - - (13)

\begin{matrix} L_{2} (θ (k)) = p (θ (k)) \\ = \frac{1}{{(\sqrt{2 π})}^{n} | Σ |^{1 / 2}} \exp (- \frac{1}{2} {(θ (k) - \hat{θ} (k | k - 1))}^{T} Ψ {(k)}^{- 1} (θ (k) - \hat{θ} (k | k - 1)) \end{matrix} - - - (14)

\begin{matrix} \min i m i z e & r^{T} (k) L^{- 1} r (k) + {(θ (k) - \hat{θ} (k | k - 1))}^{T} Ψ {(k)}^{- 1} (θ (k) - \hat{θ} (k | k - 1)) \\ s u b j i e c t t o & Y (k) = A θ (k) + r (k) \end{matrix} - - - (15)

where θ (k) and r (k) are variables, Ψ (k) ═ P_θ(k | k) + D (k) is the covariance matrix of Gaussian noise; the estimated value of theta (k) isr (k) is an estimate of gaussian noise; p_θ(k | k) is the covariance update matrix:

P_θ(k|k)＝(I-K_θ(k)A(k))P_θ(k|k-1) (16)

P_θ(k | k-1) is the covariance prediction matrix:

P_θ(k|k-1)＝P_θ(k-1|k-1)+D(k-1) (17)

K_θ(k) to covariance gain:

K_θ(k)＝P_θ(k|k-1)A^T(AP_θ(k|k-1)A^T+L(k-1))^-1(18)

\begin{matrix} \min i m i z e & r^{T} (k) L^{- 1} r (k) + {(θ (k) - \hat{θ} (k | k - 1))}^{T} Ψ^{- 1} (θ (k) - \hat{θ} (k | k - 1)) + λ | | w (k) | |_{1} \\ s u b j i e c t t o & Y (k) = A θ (k) + r (k) + w (k) \end{matrix} - - - (19)

5) estimating a speech signal state sequence

X(k)＝FX(k-1)+p(k) (20)

Y(k)＝CX(k)+n(k) (21)

p (X (k) | Y (k)) = \frac{p (Y (k) | X (k)) p (X (k))}{p (Y (k))} - - - (22)

\begin{matrix} L_{1} (Y (k), X (k)) = \frac{p (X (k)) p (n (k))}{p (X (k))} \\ = p (W (k)) = \frac{1}{{(\sqrt{2 π})}^{m} | R |^{1 / 2}} \exp (- \frac{1}{2} W^{T} (k) R^{- 1} W (k)) \end{matrix} - - - (23)

\begin{matrix} L_{2} (X (k)) = p (X (k)) \\ = \frac{1}{{(\sqrt{2 π})}^{n} | Σ |^{1 / 2}} \exp (- \frac{1}{2} {(X (k) - \hat{X} (k | k - 1))}^{T} Θ^{- 1} (X (k) - \hat{X} (k | k - 1))) \end{matrix} - - - (24)

\begin{matrix} \min i m i z e & W^{T} (k) R^{- 1} W (k) + {(X (k) - \hat{X} (k | k - 1))}^{T} Θ^{- 1} (X (k) - \hat{X} (k | k - 1)) \\ s u b j i e c t t o & Y (k) = C X (k) + n (k) \end{matrix} - - - (25)

p (k | k) is the covariance update matrix:

P(k|k)＝(I-K(k)C(k))P(k|k-1) (26)

p (k | k-1) is the covariance prediction matrix:

P(k|k-1)＝F(k-1)P(k-1|k-1)F(k-1)^T+Q(k-1) (27)

K_θ(k) to covariance gain:

K(k)＝P(k|k-1)C^T(CP(k|k-1)C^T+R(k-1))^-1(28)

\begin{matrix} \min i m i z e & W^{T} (k) R^{- 1} W (k) + {(X (k) - \hat{X} (k | k - 1))}^{T} Θ^{- 1} (X (k) - \hat{X} (k | k - 1)) + λ | | v (k) | |_{1} \\ s u b j i e c t t o & Y (k) = C X (k) + n (k) + v (k) \end{matrix} - - - (29)