CN100543842C

CN100543842C - Realize the method that ground unrest suppresses based on multiple statistics model and least mean-square error

Info

Publication number: CN100543842C
Application number: CNB2006100811562A
Authority: CN
Inventors: 吴颖谦; 柯昌伟
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2006-05-23
Filing date: 2006-05-23
Publication date: 2009-09-23
Anticipated expiration: 2026-05-23
Also published as: CN101079266A

Abstract

The present invention relates to the inhibiting method of background noise based on multiple statistics model and least mean-square error, comprising: the voice signal to current incoming frame carries out short time discrete Fourier transform; Pure voice amplitude variance is estimated and the estimation of noise amplitude variance on each frequency that frame keeps in the utilization, and each frequency component of voice signal in the current incoming frame, calculates the real part of each frequency component of voice signal in the present frame and the estimation of imaginary part; There is not probability in the priori voice that calculate current each frequency component of incoming frame, further revise the real part of each frequency component of voice signal in the aforementioned present frame that obtains and the estimated result of imaginary part in view of the above.The present invention approaches the true distribution of voice and noise in the practical application more accurately; Can obtain higher inhibition effect; Estimation procedure is more accurate, sane; Have high squelch efficient and lower computation complexity, be suitable for various voice communication systems.

Description

Realize the method that ground unrest suppresses based on multiple statistics model and least mean-square error

Technical field

The invention belongs to the speech processes field, relate generally to a kind of background acoustic noise (Acoustic Noise) inhibition method, be applicable to the pre-treatment that various voice communications, speech recognition etc. are used based on multiple statistics model (Multiple Statistical Model) and least mean-square error (Minimum Mean Squared Error).

Background technology

In most of voice communications applications, the input end of system can only receive by the noisy speech after the ground unrest interference, noise has greatly disturbed the quality of voice communication, has reduced the sharpness of voice and the property understood, speech processing module such as coding and decoding in the system are produced adverse influence.

Ground unrest suppresses technology can extract pure as far as possible raw tone from noisy speech, this research belongs to " voice enhancing " category in the speech processes field.Squelch helps to improve the consciousness quality of noisy speech signal, improves the comfort and the service quality of communication environment; Simultaneously, squelch can improve vocoder in noise circumstance lower compression performance and stability as the pre-processing module of speech; In addition, this technology can effectively improve the robustness of speech recognition system under background noise environment.

The phonetic entry model of most of communications applications all has the characteristics of single channel phonetic entry and additivity ground unrest.In this input model, the signals and associated noises that observes can be expressed as voice and noise component sum in time domain or frequency domain.In present existing noise suppressing method, the short-time spectrum method of weighting is the technology of main flow the most, these class methods are done short time discrete Fourier transform with noisy speech, according to voice on each frequency and noise component designing gain coefficient, by the effect that is inhibited that this coefficient and signals and associated noises are multiplied each other, the voice after behind inverse-Fourier transform in short-term, obtaining handling.

Spectrum subtraction method (S.F.Boll, Suppression of acoustic noise in speech using spectral subtraction, IEEE Trans.ASSP, vol.ASSP-27, pp.113-120, Apr., 1979) be most typical example, thereby its ultimate principle is to deduct the estimating noise amplitude from the noisy speech amplitude and phase invariant realization process of inhibition in frequency domain, and owing to the mode of this process with the spectrum weighting realizes, thereby the weighted gain coefficient is related with signal to noise ratio (S/N ratio).A lot of methods have all been followed this thinking, but differ from one another on Calculation of Gain method and noise estimation method.

As patent (Method and Apparatus for Suppression Noise in a Communication System, US patent5,659,622), this method utilizes the modified index of signal to noise ratio (S/N ratio) and noise energy to come calculated gains, thereby and by each frequency band when long average power spectra calculate the spectrum deviation and carry out noise power and estimate.

(Method and Device for Speech Enhancement in the Presence of Background Noise WO2005064595) further carries out the smoothing processing of time domain to patent to gain coefficient.

Patent (Low frequency spectral enhancement system and method, US patent 6,233,549), this method are then emphasized the enhancing to low frequency component when calculating the spectrum gain coefficient.

Some spectrum methods of weighting will suppress noise and be interpreted as the spectral amplitude of estimating raw tone and obtained better effect, estimation criterion commonly used comprises maximum likelihood ML (R.McAulay, Speech enhancement using a soft-decision noisesuppression filter, IEEE Trans.A.S.S.P., 28,1980), least mean-square error MMSE (Y.Ephraim, Speechenhancement using a minimum mean-square error short-time spectral amplitude estimator, IEEETrans.A.S.S.P, 32,1984) etc.

Wherein the MMSE method of estimation is the most commonly used, obtained updating, typical in Y.Ephraim (Y.Ephraim, Speechenhancement using a minimum mean-square error log-spectral amplitude estimator, IEEE Trans.A.S.S.P, 33,1985) use the MMSE criterion to estimate the logarithm value of speech manual amplitude on all frequencies, the spectrum weighted gain of Ji Suaning can obtain better effect thus.

Patent (Core Estimator and Adaptive Gains from Signal To Noise Ratio in a Hybrid SpeechEnhancement System, US patent 2002002455) adopt the soft-decision mode and consider that evaluated error minimizes and the voice distortion that suppresses to bring between balance, thereby finally obtain the computing method of weighted gain.

Which kind of statistical model is the key issue of MMSE method be to adopt reflect the statistical distribution of frequency domain voice signal, and existing method model commonly used is a Gauss model, but it can not simulate truth well; On the other hand, the accuracy of Noise Estimation and sane degree also are the key factors of decision inhibition method performance.

Summary of the invention

The object of the present invention is to provide a kind of inhibiting method of background noise,, improve the accuracy and the robustness of Noise Estimation with good simulation truth based on multiple statistics model and least mean-square error.

The present invention specifically is achieved in that

A kind of inhibiting method of background noise based on multiple statistics model and least mean-square error may further comprise the steps:

Step 1, the voice signal of current incoming frame is carried out short time discrete Fourier transform;

Pure voice amplitude variance is estimated and the estimation of noise amplitude variance on each frequency that frame keeps in step 2, the utilization, and each frequency component of voice signal in the current incoming frame, adopt laplacian distribution and gamma to distribute the real part and the imaginary part probability density distribution of analogue noise and voice spectrum component respectively, utilize described real part and imaginary part probability density distribution to calculate the conditional expectation of real part and imaginary part respectively, as the estimation of real part and imaginary part according to minimum mean square error criterion;

There is not probability in the priori voice of step 3, current each frequency component of incoming frame of calculating, further revise the real part of step 2 calculating and the estimation of imaginary part in view of the above;

Step 4, according to the revised real part of current incoming frame and the estimation of imaginary part, calculate pure voice amplitude variance and estimate and keep to use to next frame;

The likelihood ratio of step 5, the current incoming frame of calculating, whether be pure noise frame, in this way, then upgrade the noise amplitude variance and estimate if adjudicating current incoming frame;

Step 6, adopt the voice after inverse fourier transform and splicing adding in short-term obtain squelch.

The real part of described noise and voice spectrum component and imaginary part probability density distribution are as shown in the formula expression:

\{\begin{matrix} p (N_{R}) = \frac{1}{λ_{n}} \exp (- \frac{2 | N_{R} |}{λ_{n}}) \\ p (N_{I}) = \frac{1}{λ_{n}} \exp (- \frac{2 | N_{I} |}{λ_{n}}) \end{matrix}

\{\begin{matrix} p (S_{R}) = \frac{\sqrt[4]{3}}{2 \sqrt{π λ_{s}} \sqrt[4]{2}} {| S_{R} |}^{- \frac{1}{2}} \exp (- \frac{\sqrt{3} | S_{R} |}{\sqrt{2} λ_{s}}) \\ p (S_{I}) = = \frac{\sqrt[4]{3}}{2 \sqrt{π λ_{s}} \sqrt[4]{2}} {| S_{I} |}^{- \frac{1}{2}} \exp (- \frac{\sqrt{3} | S_{I} |}{\sqrt{2} λ_{s}}) \end{matrix}

N in the following formula _R, N _I, S _RAnd S _IRepresent the real part and the imaginary part of noise, voice spectrum component respectively, λ _nAnd λ _sRepresent the variance of noise and voice spectrum component respectively.

The expectation value of described real part condition is:

E [S_{R} (k, l) | Y_{R} (k, l)] = \frac{\sqrt[4]{1.5}}{{2 λ}_{n} (k, l) \sqrt{π λ_{s} (k, l)} p (Y_{R} (k, l))} {&Integral;}_{- \infty}^{\infty} S_{R} (k, l) \cdot {| S_{R} (k, l) |}^{- 0.5} .

\exp [- \frac{2 | Y_{R} (k, l) - S_{R} (k, l) |}{λ_{n} (k, l)}] \cdot \exp [- \frac{\sqrt{1.5} | S_{R} (k, l) |}{λ_{s} (k, l)}] d S_{R} (k, l)

= \frac{\sqrt[4]{1.5}}{{2 λ}_{n} (k, l) \sqrt{π λ_{s} (k, l)} p [Y_{R} (k, l)]} {\frac{2}{3} \exp [\frac{- 2 Y_{R} (k, l)}{λ_{n} (k, l)}] Y_{R}^{3 / 2} (k, l) Φ [1.5,2.5, - 2 Y_{R} (k, l) G_{1}]

+ \exp [\frac{- \sqrt{1.5} Y_{R} (k, l)}{λ_{s} (k, l)}] {(2 G_{2})}^{- 1.5} ψ [- 0.5, - 0.5,2 G_{2} Y_{R} (k, l)]

- 0.856 \cdot \exp [\frac{- 2 Y_{R} (k, l)}{λ_{n} (k, l)}] {(2 G_{2})}^{- 1.5}}

Y in the following formula _R(k, l) expression signals and associated noises Y is at the real part of l k frequency component constantly, Φ (a, b, z)=M (a, b z) represent first kind confluent hypergeometric function, Ψ (a, b z) represent hypergeometric function equally, can by Φ (a, b z) calculate,

G_{1} = \frac{\sqrt{1.5} λ_{n} + 2 λ_{s}}{2 λ_{s} λ_{n}}

With

G_{2} = \frac{\sqrt{1.5} λ_{n} - 2 λ_{s}}{2 λ_{s} λ_{n}};

The expectation value of described imaginary part condition:

E [S_{I} (k, l) | Y_{I} (k, l)] = \frac{\sqrt[4]{1.5}}{{2 λ}_{n} (k, l) \sqrt{π λ_{s} (k, l)} p (Y_{I} (k, l))} {&Integral;}_{- \infty}^{\infty} S_{I} (k, l) \cdot {| S_{I} (k, l) |}^{- 0.5} .

\exp [- \frac{2 | Y_{I} (k, l) - S_{I} (k, l) |}{λ_{n} (k, l)}] \cdot \exp [- \frac{\sqrt{1.5} | S_{I} (k, l) |}{λ_{s} (k, l)}] d S_{I} (k, l)

= \frac{\sqrt[4]{1.5}}{{2 λ}_{n} (k, l) \sqrt{π λ_{s} (k, l)} p [Y_{I} (k, l)]} {\frac{2}{3} \exp [\frac{- 2 Y_{I} (k, l)}{λ_{n} (k, l)}] Y_{I}^{3 / 2} (k, l) Φ [1.5,2.5, - 2 Y_{I} (k, l) G_{1}]

+ \exp [\frac{- \sqrt{1.5} Y_{I} (k, l)}{λ_{s} (k, l)}] {(2 G_{2})}^{- 1.5} ψ [- 0.5, - 0.5,2 G_{2} Y_{I} (k, l)]

- 0.856 \cdot \exp [\frac{- 2 Y_{I} (k, l)}{λ_{n} (k, l)}] {(2 G_{2})}^{- 1.5}}

Y in the following formula _i(k, l) expression signals and associated noises Y is in the imaginary part of l k frequency component constantly.

Further comprise in the described step 3:

Calculate current incoming frame squared magnitude and, after calculating signal to noise ratio (S/N ratio), it is level and smooth to carry out the time domain recurrence, calculate overall probability then, after each frequency computation part signal to noise ratio (S/N ratio), it is level and smooth to carry out the time domain recurrence, calculates local probability then, and described priori voice do not exist probability to equal 1 poor with overall probability and local probability product;

Utilize the priori voice not exist probability and voice to have the estimation of uncertain hypothesis correction real part and imaginary part.

The real part of described correction is estimated as:

E [S_{R} | Y_{R}] = \frac{Γ (k, l)}{1 + Γ (k, l)} E [S_{R} | Y_{R}, H_{1}] =

\frac{\sqrt[4]{1.5}}{{2 λ}_{n} (k, l) \sqrt{π λ_{s} (k, l)}} {2 \cdot \exp [- \frac{2 Y_{R} (k, l)}{λ_{n} (k, l)}] \sqrt{Y_{R} (k, l)} Φ [0.5,1.5, - 2 G_{1} Y_{R} (k, l)]

+ \exp [- \frac{1.5}{λ_{s} (k, l)} Y_{R} (k, l)] \frac{1}{\sqrt{2 G_{2}}} Ψ [0.5,0.5,2 G_{2} Y_{R} (k, l)]

+ \exp [- \frac{2 Y_{R} (k, l)}{λ_{n} (k, l)}] \sqrt{\frac{π}{2 G_{2}}}} \cdot \frac{Γ (k, l)}{1 + Γ (k, l)}

In the following formula

Γ (k, l) = \frac{p (Y_{R} (k, l) | H_{1})}{p (Y_{R} (k, l) | H_{0})} \cdot \frac{[1 - P (H_{0})]}{P (H_{0})},

P (Y _R(k, l) | H ₁) Y of expression voice when existing _R(k, l) probability density distribution, and p (Y _R(k, l) | H ₀) probability density of expression when having only noise then;

The imaginary part of described correction is estimated as:

E [S_{R} | Y_{R}] = \frac{Γ (k, l)}{1 + Γ (k, l)} E [S_{R} | Y_{R}, H_{1}] .

Also comprise behind the short time discrete Fourier transform of described step 1:

Detect step, the signal tone that one or more single frequency tone of input is mixed, calculate all frequency components energy and, then obtain the maximal value of 2 squared magnitude, and from energy and subtract get peaked and, if maximal value and greater than energy and and closer to each other, then adjudicate the current signal tone that is input as, do not carry out any inhibition and handle.

The present invention adopts a plurality of statistical models to come the statistical distribution of match voice and noise frequency domain components respectively, thereby can approach the true distribution of voice and noise in the practical application more accurately; There is uncertain influence owing to considered noisy speech, can obtains higher inhibition effect process of inhibition; Adopted the maximum likelihood ratio method to carry out VAD and detected, carried out noise power spectrum in view of the above and estimate, estimation procedure is more accurate, sane; Adopt the overall situation to add partial approach and carry out the estimation that there is not probability in the priori voice; Adopt special flow process to avoid harmful effect, can not influence the detection of DTMF, fax tones, have high squelch efficient and lower computation complexity, be suitable for various voice communication systems single-frequency and multitone signal.

Description of drawings

Fig. 1 is the frame diagram of the method for the invention;

Fig. 2 is the process flow diagram of motion detection step in the method for the invention.

Embodiment

The phonetic entry model of most of communications applications all has the characteristics of single channel phonetic entry and additivity ground unrest, the present invention relates to the squelch problem under this model.At the squelch problem, the present invention proposes an adaptive filter method based on multiple statistics model.Shown in Figure 1 is the frame principles of entire method.The present invention uses short time discrete Fourier transform that input signal is transformed to frequency domain, the real part of each frequency component of voice signal and the estimation of imaginary part in the current incoming frame of the calculation of parameter of then utilizing previous frame to obtain, there is probability in computing voice and revises voice signal and estimate then, after upgrading the parameter current estimation, utilize the voice after inverse fourier transform is inhibited in short-term.

The steps in sequence of embodiment is following seven trifles:

1. time-the frequency analysis of voice signal

Voice and ground unrest all have height non-stationary characteristics, single Fourier transform can not the time dependent spectrum information of reflected signal, as voice the time become the power spectrum etc. of resonance peak and noise, thereby all voice noises suppress to adopt Time-Frequency Analysis Method.(Short-time Fourier STFT) is most important Time-Frequency Analysis Method to short time discrete Fourier transform.

The STFT process at first adopts analysis window (analysis window) to the current speech data weighting, and the analysis window function is not 0 in it supports only.Among the present invention, when the analysis window function was supported for L, the speech frame length l was 25% of L.The speech data of STFT after to the window weighting carries out discrete Fourier transform (DFT).Relatively L and l as seen, adjacent analysis window has 3/4 overlappingly among the present invention, this process as shown in Figure 2.The STFT process is suc as formula (1), and wherein N is analysis window length, and w (n) is the analysis window function.

Y (k, l) = Σ_{n = 0}^{N - 1} y (n + lN) w (n) \exp [- j (\frac{2 π}{N}) nk]

(k l) carries out obtaining after the noise reduction process to Y

Carry out the voice signal after inverse fourier transform (STIFT) and splicing adding (OLA) method can obtain handling in short-term

Owing to become when the spectrum weighting coefficient is in the voice de-noising, thereby STIFT must adopt and the biorthogonal synthetic window of h (n) (synthesis window) (J.Wexler, Discrete Gabor expansions, SignalProcessing, Nov, 1990).

2. the statistical model of speech manual range coefficient

The spectral amplitude weighting inhibition method of estimating based on MMSE all uses Gaussian distribution (GaussianDistribution) to set up the probability Distribution Model of each spectrum component of noise and voice basically at present, the main advantage of this model is the convenience of mathematics manipulation, in fact can not the accurate description voice and the distribution of noise spectrum component.

If real, the imaginary part Gaussian distributed of supposition voice spectrum component, adopt MMSE can obtain Linear Estimation as estimation criterion, and this estimator is a real-valued wave filter, that is to say, the MMSE phase estimation that it obtains in fact still equals the phase place (Y.Ephraim of the respective tones spectral component of noisy speech, Speech enhancement using a minimummean-square error short-time spectral amplitude estimator, IEEE Trans.A.S.S.P, 32,1984).

Research and test all show, the real part of voice spectrum component and imaginary part all more meet gamma distribution (GammaDistribution), the MMSE estimator that adopts the gamma distribution to obtain simultaneously is a wave filter highly non-linear, complex values, and this will obtain better squelch performance.Therefore, the present invention adopts laplacian distribution and gamma to distribute the real part and the imaginary part probability density distribution of analogue noise and voice spectrum component respectively, specifically distributes as shown in the formula shown in (2), (3).

\{\begin{matrix} p (N_{R}) = \frac{1}{λ_{n}} \exp (- \frac{2 | N_{R} |}{λ_{n}}) \\ p (N_{I}) = \frac{1}{λ_{n}} \exp (- \frac{2 | N_{I} |}{λ_{n}}) \end{matrix} - - - (2)

\{\begin{matrix} p (S_{R}) = \frac{\sqrt[4]{3}}{2 \sqrt{π λ_{s}} \sqrt[4]{2}} {| S_{R} |}^{- \frac{1}{2}} \exp (- \frac{\sqrt{3} | S_{R} |}{\sqrt{2} λ_{s}}) \\ p (S_{I}) = \frac{\sqrt[4]{3}}{2 \sqrt{π λ_{s}} \sqrt[4]{2}} {| S_{I} |}^{- \frac{1}{2}} \exp (- \frac{\sqrt{3} | S_{I} |}{\sqrt{2} λ_{s}}) \end{matrix} - - - (3)

N in the following formula _R, N _I, S _RAnd S _IRepresent the real part and the imaginary part of noise, voice spectrum component respectively, λ _nAnd λ _sRepresent the variance of noise and voice spectrum component respectively, the real part and the imaginary part of the voice of t moment analysis window, k spectrum component of noise are expressed as S respectively _R(t, k), S _I(t, k) and N _R(t, k), N _I(t, k), its probability density distribution is respectively corresponding variance λ _sAnd λ _nLaplce and gamma distribution random number.

Because the height of voice is non-stationary, to each k, the distribution parameter { λ that different t constantly obtain _s(1, k), λ _s(2, k), λ _s(3, k) ... and { λ _n(1, k), λ _n(2, k), λ _n(3, k) ... more be construed as a random series, the present invention need be from noisy speech these random seriess of On-line Estimation.

3. the least mean-square estimate of spectral amplitude coefficient

People's auditory system changes more responsive to frequency domain, thereby estimates that from the frequency domain components of noisy speech the frequency domain components of pure voice can obtain better result.It is known that the estimation of random signal requires distributed model and error to estimate, and square error (MSE) is the most frequently used estimation criterion, the expectation value minimum of the estimated signal that its requirement calculates and the square error of pure voice signal.The present invention does not use Gaussian distribution model, all estimators all should comprise the estimation (R.Martin of real part and imaginary part, SpeechEnhancement using MMSE Short Time Spectral Estimation with Gamma Distributed Speech Priors, Proceeding of IEEE ICASSP, May, 2002).

Signals and associated noises Y l k frequency component constantly be expressed as Y (k, l)=Y _R(k, l)+jY _I(k, l), it comprises noise and speech components, promptly have Y (k, l)=[S _R(k, l)+D _R(k, l)]+j[S _l(k, l)+D _l(k, l)].The MMSE estimation problem can be summed up as that (k estimates under condition l) at known observed reading Y

\hat{S} (k, l) = {\hat{S}}_{R} (k, l) + j {\hat{S}}_{I} (k, l)

Make the error minimum

By the signal estimation theory as can be known the condition least mean-square error value of signal be the conditional expectation of signal, promptly

\hat{S} (k, l) = E [S (k, l) | Y (0, l), Y (1, l), . . .] .

Consider independence supposition between the FFT spectral coefficient and real part, the supposition of imaginary part independence, finally can get the MMSE estimator as the formula (4):

\hat{S} (k, l) = E [S_{R} (k, l) | Y_{R} (k, l)] + jE [S_{I} (k, l) | Y_{I} (k, l)] - - - (4)

The laplacian distribution and the gamma that provide according to a last joint distribute, utilize real part and imaginary part probability density distribution further to calculate the conditional expectation of real part and imaginary part respectively, result (R.Martin that can be shown in (5) formula, Speech Enhancement usingMMSE Short Time Spectral Estimation with Gamma Distributed Speech Priors, Proceeding ofIEEE ICASSP, May, 2002).

E [S_{R} (k, l) | Y_{R} (k, l)] = \frac{\sqrt[4]{1.5}}{{2 λ}_{n} (k, l) \sqrt{π λ_{s} (k, l)} p (Y_{R} (k, l))} {&Integral;}_{- \infty}^{\infty} S_{R} (k, l) \cdot {| S_{R} (k, l) |}^{- 0.5} .

\exp [- \frac{2 | Y_{R} (k, l) - S_{R} (k, l) |}{λ_{n} (k, l)}] \cdot \exp [- \frac{\sqrt{1.5} | S_{R} (k, l) |}{λ_{s} (k, l)}] d S_{R} (k, l)

= \frac{\sqrt[4]{1.5}}{{2 λ}_{n} (k, l) \sqrt{π λ_{s} (k, l)} p [Y_{R} (k, l)]} {\frac{2}{3} \exp [\frac{- 2 Y_{R} (k, l)}{λ_{n} (k, l)}] Y_{R}^{3 / 2} (k, l) Φ [1.5,2.5, - 2 Y_{R} (k, l) G_{1}] - - - (5)

+ \exp [\frac{- \sqrt{1.5} Y_{R} (k, l)}{λ_{s} (k, l)}] {(2 G_{2})}^{- 1.5} ψ [- 0.5, - 0.5,2 G_{2} Y_{R} (k, l)]

- 0.856 \cdot \exp [\frac{- 2 Y_{R} (k, l)}{λ_{n} (k, l)}] {(2 G_{2})}^{- 1.5}}

G_{1} = \frac{\sqrt{1.5} λ_{n} + 2 λ_{s}}{2 λ_{s} λ_{n}}

With

G_{2} = \frac{\sqrt{1.5} λ_{n} - 2 λ_{s}}{2 λ_{s} λ_{n}};

P (Y wherein _RThe probability density that noisy speech frequency component real part is observed in (k, l)) expression has

p [Y_{R} (k, l)] = \frac{\sqrt[4]{1.5}}{{2 λ}_{n} (k, l) \sqrt{π λ_{s} (k, l)}} {&Integral;}_{- \infty}^{\infty} {| S_{R} (k, l) |}^{- 0.5} \exp (- \frac{2 | Y_{R} (k, l) - S_{R} (k, l) |}{λ_{n} (k, l)})

\exp (- \frac{\sqrt{1.5} | S_{R} (k, l) |}{λ_{s} (k, l)}) d S_{R}

= \frac{\sqrt[4]{1.5}}{{2 λ}_{n} (k, l) \sqrt{π λ_{s} (k, l)}} {2 \cdot \exp [- \frac{2 Y_{R} (k, l)}{λ_{n} (k, l)}] \sqrt{Y_{R} (k, l)} Φ [0.5,1.5, - 2 G_{1} Y_{R} (k, l)] - - - (6)

+ \exp [- \frac{1.5}{λ_{s} (k, l)} Y_{R} (k, l)] \frac{1}{\sqrt{2 G_{2}}} Ψ [0.5,0.5,2 G_{2} Y_{R} (k, l)]

+ \exp [- \frac{2 Y_{R} (k, l)}{λ_{n} (k, l)}] \sqrt{\frac{π}{2 G_{2}}}}

Φ in the formula [a, b, z]=M (z) expression first kind confluent hypergeometric function can utilize the summation of series to calculate for a, b, and

ψ [a, b, z] = \frac{Γ (1 - b)}{Γ (a - b + 1)} M (a, b, z) + \frac{Γ (b - 1)}{Γ (a)} z^{1 - b} M (a - b + 1,2 - b, z) - - - (7)

(I.S.Gradshteyn，Table?of?Intergrals，Series，and?Products，1994)。

In like manner can get the conditional expectation of imaginary part, as the estimation of imaginary part

E [S_{I} (k, l) | Y_{I} (k, l)] = \frac{\sqrt[4]{1.5}}{{2 λ}_{n} (k, l) \sqrt{π λ_{s} (k, l)} p (Y_{I} (k, l))} {&Integral;}_{- \infty}^{\infty} S_{I} (k, l) \cdot {| S_{I} (k, l) |}^{- 0.5} .

\exp [- \frac{2 | Y_{I} (k, l) - S_{I} (k, l) |}{λ_{n} (k, l)}] \cdot \exp [- \frac{\sqrt{1.5} | S_{I} (k, l) |}{λ_{s} (k, l)}] d S_{I} (k, l)

= \frac{\sqrt[4]{1.5}}{{2 λ}_{n} (k, l) \sqrt{π λ_{s} (k, l)} p [Y_{I} (k, l)]} {\frac{2}{3} \exp [\frac{- 2 Y_{I} (k, l)}{λ_{n} (k, l)}] Y_{I}^{3 / 2} (k, l) Φ [1.5,2.5, - 2 Y_{I} (k, l) G_{1}] - - - (8)

+ \exp [\frac{- \sqrt{1.5} Y_{I} (k, l)}{λ_{s} (k, l)}] {(2 G_{2})}^{- 1.5} ψ [- 0.5, - 0.5,2 G_{2} Y_{I} (k, l)]

- 0.856 \cdot \exp [\frac{- 2 Y_{I} (k, l)}{λ_{n} (k, l)}] {(2 G_{2})}^{- 1.5}}

Y in the formula _l(k, l) expression signals and associated noises Y is in the imaginary part of l k frequency component constantly, and other are referring to preceding formula explanation;

P[Y wherein _l(k, l)] computing method with suc as formula (6) in like manner.

In actual speech communication, what each parameter that distributes often can not priori knows, must estimate that method of estimation sees below from noisy data.

Can find that formula (5) and (8) have utilized the joint distribution of the stochastic variable of two independences, obedience Laplce and gamma distributions, that is to say that in fact estimator comprises voice and noise component, have also promptly supposed the unconditional existence of voice.It is quiet that noisy speech signal in the actual voice communication environment comprises that a large amount of pauses bring, and this not only comprises the transition between the sentence even also comprises time-out between the syllable, thereby voice are uncertain in signals and associated noises, and it just exists according to probability.Thereby above-mentioned supposition is incorrect in actual applications: voice exist the then existence all the time of a large amount of time-out ground unrests in the input data.

4. the uncertainty that exists of voice

Because (5) formula and (8) formula are current incoming frame is in the estimation voice existence under, the present invention exists uncertainty that this is done according to voice to further expand.Noisy speech model Y (k, l)=S (k, l)+(k, l) the supposition voice are present in the input data to D all the time, if use H ₀And H ₁After representing that respectively whether voice exist, noisy more accurately model is:

Y (w) = \{\begin{matrix} S (w) + D (w), & H_{1} \\ D (w), & H_{0} \end{matrix} - - - (9)

For expressing conveniently, each expression formula of this trifle is saved subscript.Consider the MMSE estimation E[S after there is uncertainty in voice _R| Y _R] should rewrite an accepted way of doing sth (10), wherein

E[S _R|Y _R]＝E[S _R|Y _R，H ₁]P(H ₁|Y _R)

(10)

+E[S _R|Y _R，H ₀]P(H ₀|Y _R)

E[S _R| Y _R, H ₀] the expression voice when not existing by Y _RThe MMSE that obtains estimates, can not obtain the estimation of voice when obviously voice do not exist, and this should be 0, considers that therefore the real part of the voice signal when there is uncertainty in voice is estimated as:

E[S _R|Y _R]＝E[S _R|Y _R，H ₁]P(H ₁|Y _R) (11)

(11) formula of calculating requires posterior probability P (H ₁(k) | Y _k) known, this can calculate by bayes rule, that is:

P (H_{1} | Y_{R}) = \frac{p (Y_{R} (k, l) | H_{1}) P (H_{1})}{p (Y_{R} (k, l) | H_{1}) P (H_{1}) + p (Y_{R} | H_{0}) P (H_{0})} - - - (12)

= \frac{Γ (k, l)}{1 + Γ (k, l)}

In the following formula

Γ (k, l) = \frac{p (Y_{R} (k, l) | H_{1})}{p (Y_{R} (k, l) | H_{0})} \cdot \frac{[1 - P (H_{0})]}{P (H_{0})},

P (Y _R(k, l) | H ₁) calculating as the formula (6), and p (Y _R(k, l) | H ₀) conditional probability density of expression when having only noise then, it calculates as the formula (2).Make P (H ₀)=q represents that there is not probability in the priori voice.Obviously, q is priori the unknown, how to obtain the estimation of q

See the 6th trifle.

Having obtained MMSE of the present invention thus estimates:

E [S_{R} | Y_{R}] = \frac{Γ (k, l)}{1 + Γ (k, l)} E [S_{R} | Y_{R}, H_{1}] =

= \frac{\sqrt[4]{1.5}}{{2 λ}_{n} (k, l) \sqrt{π λ_{s} (k, l)}} {2 \cdot \exp [- \frac{2 Y_{R} (k, l)}{λ_{n} (k, l)}] \sqrt{Y_{R} (k, l)} Φ [0.5,1.5, - 2 G_{1} Y_{R} (k, l)] - - - (13)

+ \exp [- \frac{1.5}{λ_{s} (k, l)} Y_{R} (k, l)] \frac{1}{\sqrt{2 G_{2}}} Ψ [0.5,0.5,2 G_{2} Y_{R} (k, l)]

+ \exp [- \frac{2 Y_{R} (k, l)}{λ_{n} (k, l)}] \sqrt{\frac{π}{2 G_{2}}}} \cdot \frac{Γ (k, l)}{1 + Γ (k, l)}

In the following formula

Γ (k, l) = \frac{p (Y_{R} (k, l) | H_{1})}{p (Y_{R} (k, l) | H_{0})} \cdot \frac{[1 - P (H_{0})]}{P (H_{0})},

P (Y _R(k, l) | H ₁) Y of expression voice when existing _R(k, l) probability density distribution, and p (Y _R(k, l) | H ₀) probability density of expression when having only noise then.

The estimation of the imaginary part of voice signal correction and real part are in like manner.

E [S_{R} | Y_{R}] = \frac{Γ (k, l)}{1 + Γ (k, l)} E [S_{R} | Y_{R}, H_{1}] - - - (14)

5. the method for estimation of spectral component variance

The present invention require the variance of voice and noise spectrum component known, but in the actual environment, these two parameters is not know, and without any priori, can only estimate from noisy speech when estimating pure speech manual coefficient.Consider the non-stationary of voice and noise in the actual environment, method should be followed the tracks of the variation of these parameters.The present invention uses previous frame to suppress to handle the amplitude variance of back each frequency component of voice signal as λ _s(k, estimation l)

λ _n(k, estimation l) is more complicated then, and the present invention uses the VAD module to judge whether pure noise frame of current incoming frame, if pure noise frame then upgrades noise parameter.This hard decision method of estimation thinks that input speech signal switches in voice-noise and pure noise two states, the estimation of noise variance only should be carried out in pure noise mode.In the practical communication environment, voice and noise often show height non-stationary characteristics, the time statistical property that becomes make this hard decision method show more steadily and surely, thereby these class methods have obtained to be extensive use of.

The present invention has proposed a kind of VAD method based on likelihood ratio (Likelihood Ratio) on last joint basis, further whether the current incoming frame of relatively adjudicating by likelihood function is pure noise frame on the probability density basis that formula (6) is calculated.Because each frequency domain components comprises approximate independently real part and imaginary part, its likelihood ratio comprises real part and imaginary part simultaneously, and its shape is as the definition of (15) formula, and each spectrum component is uncorrelated each other.

Λ (k, l) = \frac{p [Y_{R} (k, l) | H_{1}] p [Y_{I} (k, l) | H_{1}]}{p [Y_{R} (k, l) | H_{0}] p [Y_{I} (k, l) | H_{0}]} - - - (15)

P (Y in the following formula _R| H ₁) and p (Y _R| H ₀) calculating respectively suc as formula shown in (6) and (2), p (Y _l| H ₁) and p (Y _l| H ₀) calculating in like manner.

Because VAD must put in order frame and carry out, thereby the likelihood ratio of whole frame is:

\log [Λ (l)] = \frac{1}{K} Σ_{k = 0}^{K - 1} \log [Λ (k, l)] - - - (16)

In view of the above, VAD judging process of the present invention as the formula (17),

\{\begin{matrix} H_{0}, if [\log (Λ)] < θ_{Λ} \\ H_{1}, if [\log (Λ)] > θ_{Λ} \end{matrix} - - - (17)

As [log (Λ)]＜θ _ΛThe time, H is made in judgement ₁Judgement, i.e. voice-noise frame, otherwise make H ₀Judgement, promptly pure noise frame.If the judgement of current incoming frame is pure noise frame, each spectral component of calculating noise as the formula (18) then.

{\hat{σ}}_{n}^{2} (k, l) = \{\begin{matrix} {\hat{σ}}_{n}^{2} (k, l - 1), & H_{0} \\ α_{σ} {\hat{σ}}_{n}^{2} (k, l - 1) + (1 - α_{σ}) σ_{n}^{2} (k, l), & H_{1} \end{matrix} - - - (18)

Can observe from formula (13) and calculate to suppress the amplitude variance that the result also needs to estimate each frequency component of pure voice, the present invention directly adopts the estimation of the filtering voice of previous frame as pure voice, thereby obtains the estimation of this parameter.

6. there is not the estimation of probability in the priori voice

Concerning noise suppressor formula (13) and (14), it is an important parameter that there is not probability in the priori voice.In actual applications, this parameter not only priori is unknown but also different with frequency in time and change, thereby must be by the frequency On-line Estimation.The present invention proposes following method of estimation.

Calculate at first as the formula (19) current incoming frame squared magnitude and:

A_{Sum}^{2} (l) = Σ_{k = 0}^{K - 1} A^{2} (k, l) - - - (19)

Calculating signal to noise ratio (S/N ratio)

η (l) = \frac{A_{Sum}^{2}}{σ_{n}^{2}}

After, it is level and smooth to carry out the time domain recurrence as the formula (20),

η(l)＝β _ηη(l-1)+(1-β _η)η(l) (20)

β wherein _η=0.9.Calculate overall probability P then as the formula (21) _Glob(l):

P_{glob} (l) = \{\begin{matrix} 0, & η_{\min} &GreaterEqual; \overset{&OverBar;}{η} (l) \\ \frac{\log [\overset{&OverBar;}{η} (l)] - \log η_{\min}}{\log \frac{η_{\max}}{η_{\min}}}, & η_{\min} \leq \overset{&OverBar;}{η} (l) \leq η_{\max} \\ 1, & η_{\max} \leq \overset{&OverBar;}{η} (l) \end{matrix} - - - (21)

η wherein _MaxAnd η _MinBe empirical constant, be respectively-3dB and-11dB.

Be each frequency computation part signal to noise ratio (S/N ratio)

γ (k, l) = \frac{A^{2} (k, l)}{σ_{n}^{2}}

After, it is level and smooth to carry out the time domain recurrence as the formula (22),

γ(k，l)＝β _γγ(k，l-1)+(1-β _γ)γ(k，l)　　　　(22)

β wherein _γ=0.9.Calculate local probability P then as the formula (23) _Loc(k, l),

P_{loc} (k, l) = \{\begin{matrix} 0, & γ_{\min} &GreaterEqual; \overset{&OverBar;}{γ} (l) \\ \frac{\log [\overset{&OverBar;}{γ} (k, l)] - \log η_{\min}}{\log \frac{γ_{\max}}{γ_{\min}}}, & γ_{\min} \leq \overset{&OverBar;}{γ} (l) \leq γ_{\max} \\ 1, & γ_{\max} \leq \overset{&OverBar;}{γ} (l) \end{matrix} - - - (23)

η wherein _MaxAnd η _MinBe empirical constant, be respectively-1dB and-9dB.The priori voice that finally obtain do not exist probability to be

\hat{q} (k, l) = 1 - P_{loc} (k, l) P_{glob} (k, l) - - - (24)

Analysis mode (24) as can be known, the present invention has made full use of the relativity of time domain between the voice signal consecutive frame, the overall voice of having taken current incoming frame into consideration do not exist and the non-existent possibility of the local speech components of each frequency, estimation procedure has better robustness.

7.ITU-G.160 the requirement of agreement

The present invention is mainly used in various speech enhancement apparatus, and (Voice Enhancement Device VED), to improve the quality of voice communication, still many times goes back signal tones such as transmitting DTMF sound, fax tone in the network.Obviously, any noise suppression algorithm can not have a negative impact to these signal tones in processing procedure, and this ITU-G.160 agreement has been proposed clear and definite requirement.These signal tones all have one or 2 single frequency tone to mix, and have spike showing as on the frequency domain on one or more frequencies.Comprise short time discrete Fourier transform in the step of the present invention, detecting these spikes on this basis is very easily, can distinguish the signal tone easily.For satisfying the G.160 requirement of agreement, the present invention adds one and detects link after short time discrete Fourier transform, input is judged, if find to have only one or more tangible spikes, then is judged to the signal tone, does not carry out any inhibition and handles.In judging process, the present invention calculate all frequency components energy and, then obtain the maximal value of 2 squared magnitude, and from energy and subtract get peaked and, if maximal value and greater than energy and and closer to each other, then declare the current signal tone that is input as.

Claims

1, a kind of inhibiting method of background noise based on multiple statistics model and least mean-square error is characterized in that, may further comprise the steps:

2, the inhibiting method of background noise based on multiple statistics model and least mean-square error as claimed in claim 1 is characterized in that:

\{\begin{matrix} p (N_{R}) = \frac{1}{λ_{n}} \exp (- \frac{2 | N_{R} |}{λ_{n}}) \\ p (N_{I}) = \frac{1}{λ_{n}} \exp (- \frac{2 | N_{I} |}{λ_{n}}) \end{matrix}

\{\begin{matrix} p (S_{R}) = \frac{\sqrt[4]{3}}{2 \sqrt{π λ_{s}} \sqrt[4]{2}} {| S_{R} |}^{- \frac{1}{2}} \exp (- \frac{\sqrt{3} | S_{R} |}{\sqrt{2} λ_{s}}) \\ p (S_{I}) = \frac{\sqrt[4]{3}}{2 \sqrt{π λ_{s}} \sqrt[4]{2}} {| S_{I} |}^{- \frac{1}{2}} \exp (- \frac{\sqrt{3} | S_{I} |}{\sqrt{2} λ_{s}}) \end{matrix}

N in the formula _R, N _I, S _RAnd S _IRepresent the real part and the imaginary part of noise, voice spectrum component respectively, λ _nAnd λ _sRepresent the variance of noise and voice spectrum component respectively.

3, the inhibiting method of background noise based on multiple statistics model and least mean-square error as claimed in claim 1 is characterized in that:

The expectation value of described real part condition is:

E [S_{R} (k, l) | Y_{R} (k, l)] = \frac{\sqrt[4]{1.5}}{{2 λ}_{n} (k, l) \sqrt{π λ_{s} (k, l)} p (Y_{R} (k, l))} {&Integral;}_{- \infty}^{\infty} S_{R} (k, l) \cdot {| S_{R} (k, l) |}^{- 0.5} .

\exp [- \frac{2 | Y_{R} (k, l) - S_{R} (k, l) |}{λ_{n} (k, l)}] \cdot \exp [- \frac{\sqrt{1.5} | S_{R} (k, l) |}{λ_{s} (k, l)}] d S_{R} (k, l)

= \frac{\sqrt[4]{1.5}}{{2 λ}_{n} (k, l) \sqrt{π λ_{s} (k, l)} p [Y_{R} (k, l)]} {\frac{2}{3} \exp [\frac{- 2 Y_{R} (k, l)}{λ_{n} (k, l)}] Y_{R}^{3 / 2} (k, l) Φ [1.5,2.5, - 2 Y_{R} (k, l) G_{1}]

+ \exp [\frac{- \sqrt{1.5} Y_{R} (k, l)}{λ_{s} (k, l)}] {(2 G_{2})}^{- 1.5} ψ [- 0.5, - 0.5,2 G_{2} Y_{R} (k, l)]

- 0.856 \cdot \exp [\frac{- 2 Y_{R} (k, l)}{λ_{n} (k, l)}] {(2 G_{2})}^{- 1.5}}

Y in the formula _R(k, l) expression signals and associated noises Y is at the real part of l k frequency component constantly, Φ (a, b, z)=(a, b z) represent first kind confluent hypergeometric function to M, and (a, b z) represent hypergeometric function to Ψ equally, can (a, b z) calculate, wherein by Φ

G_{1} = \frac{\sqrt{1.5} λ_{n} + 2 λ_{s}}{2 λ_{s} λ_{n}}

With

G_{2} = \frac{\sqrt{1.5} λ_{n} - 2 λ_{s}}{2 λ_{s} λ_{n}};

The expectation value of described imaginary part condition:

E [S_{I} (k, l) | Y_{I} (k, l)] = \frac{\sqrt[4]{1.5}}{{2 λ}_{n} (k, l) \sqrt{π λ_{s} (k, l)} p (Y_{I} (k, l))} {&Integral;}_{- \infty}^{\infty} S_{I} (k, l) \cdot {| S_{I} (k, l) |}^{- 0.5} .

\exp [- \frac{2 | Y_{I} (k, l) - S_{I} (k, l) |}{λ_{n} (k, l)}] \cdot \exp [- \frac{\sqrt{1.5} | S_{I} (k, l) |}{λ_{s} (k, l)}] d S_{I} (k, l)

= \frac{\sqrt[4]{1.5}}{{2 λ}_{n} (k, l) \sqrt{π λ_{s} (k, l)} p [Y_{I} (k, l)]} {\frac{2}{3} \exp [\frac{- 2 Y_{I} (k, l)}{λ_{n} (k, l)}] Y_{I}^{3 / 2} (k, l) Φ [1.5,2.5, - 2 Y_{I} (k, l) G_{1}]

+ \exp [\frac{- \sqrt{1.5} Y_{I} (k, l)}{λ_{s} (k, l)}] {(2 G_{2})}^{- 1.5} ψ [- 0.5, - 0.5,2 G_{2} Y_{I} (k, l)]

- 0.856 \cdot \exp [\frac{- 2 Y_{I} (k, l)}{λ_{n} (k, l)}] {(2 G_{2})}^{- 1.5}}

Y in the formula ₁(k, l) expression signals and associated noises Y is in the imaginary part of l k frequency component constantly.

4, the inhibiting method of background noise based on multiple statistics model and least mean-square error as claimed in claim 1 is characterized in that, further comprises in the described step 3:

5, the inhibiting method of background noise based on multiple statistics model and least mean-square error as claimed in claim 4 is characterized in that:

The real part of described correction is estimated as:

E [S_{R} | Y_{R}] = \frac{Γ (k, l)}{1 + Γ (k, l)} E [S_{R} | Y_{R}, H_{1}] =

\frac{\sqrt[4]{1.5}}{{2 λ}_{n} (k, l) \sqrt{π λ_{s} (k, l)}} {2 \cdot \exp [- \frac{2 Y_{R} (k, l)}{λ_{n} (k, l)}] \sqrt{Y_{R} (k, l)} Φ [0.5,1.5, - 2 G_{1} Y_{R} (k, l)]

+ \exp [- \frac{1.5}{λ_{s} (k, l)} Y_{R} (k, l)] \frac{1}{\sqrt{2 G_{2}}} Ψ [0.5,0.5,2 G_{2} Y_{R} (k, l)]

+ \exp [- \frac{2 Y_{R} (k, l)}{λ_{n} (k, l)}] \sqrt{\frac{π}{2 G_{2}}}} \cdot \frac{Γ (k, l)}{1 + Γ (k, l)}

In the formula

Γ (k, l) = \frac{p (Y_{R} (k, l) | H_{1})}{p (Y_{R} (k, l) | H_{0})} \cdot \frac{[1 - P (H_{0})]}{P (H_{0})},

The imaginary part of described correction is estimated as:

E [S_{R} | Y_{R}] = \frac{Γ (k, l)}{1 + Γ (k, l)} E [S_{R} | Y_{R}, H_{1}] .

6, the inhibiting method of background noise based on multiple statistics model and least mean-square error as claimed in claim 1 is characterized in that, also comprises behind the short time discrete Fourier transform of described step 1: