CN109817234B

CN109817234B - Target speech signal enhancement method, system and storage medium based on continuous noise tracking

Info

Publication number: CN109817234B
Application number: CN201910168105.0A
Authority: CN
Inventors: 张啟权; 王明江; 陆云; 韩宇菲; 张禄; 孙凤娇
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2019-03-06
Filing date: 2019-03-06
Publication date: 2021-01-26
Anticipated expiration: 2039-03-06
Also published as: WO2020177374A1; CN109817234A

Abstract

The invention provides a target speech signal enhancement method, a system and a storage medium based on continuous noise tracking, wherein the target speech signal enhancement method comprises the following steps: step 1: receiving a voice signal with noise, performing frame division and windowing processing on the voice signal with noise, and obtaining a time-frequency domain relation by using short-time Fourier transform; step 2: estimating a noise power spectrum; and step 3: estimating a speech power spectrum; and 4, step 4: estimating a speech signal by a speech estimator; and 5: inverse fourier transform, windowing and using overlap-add techniques to achieve speech recovery. The invention has the beneficial effects that: the invention effectively separates the target voice signal, greatly reduces the noise residual amount in the voice signal and greatly improves the quality of the target signal. This is very important for applications such as automatic speech recognition, speaker recognition, man-machine interface and hearing aids.

Description

Target speech signal enhancement method, system and storage medium based on continuous noise tracking

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a target speech signal enhancement method and system based on continuous noise tracking, and a storage medium.

Background

Noise exists everywhere in life, and the aim of the voice enhancement algorithm is to improve the quality and intelligibility of target voice signals polluted by the noise. Existing speech enhancement algorithms typically employ a speech activity detector to estimate background noise and thus achieve target signal enhancement, and these algorithms perform well in stationary noise environments and high signal-to-noise ratio conditions. However, the performance of these algorithms is very limited when at low signal-to-noise ratios, especially in non-stationary noise environments. Because the noise in life is complex, for example, when a car or a train passes through, various noises are generated by speaking and chatting of pedestrians, and it is necessary to develop a speech enhancement algorithm which can work well under the condition of non-stationary noise.

Disclosure of Invention

The invention provides a target speech signal enhancement method based on continuous noise tracking, which comprises the following steps:

step 1: receiving a voice signal with noise, performing frame division and windowing processing on the voice signal with noise, and then obtaining a time-frequency domain relation by using short-time Fourier transform;

step 2: estimating a noise power spectrum;

and step 3: estimating a clean speech power spectrum;

and 4, step 4: the pure voice signal is estimated through a voice estimator, and the prior signal-to-noise ratio of the voice estimator is realized by utilizing a decision-directed algorithm estimator;

and 5: inverse fourier transform, windowing and using overlap-add techniques to achieve clean speech recovery.

As a further improvement of the present invention, in said step 2, a minimum mean square error estimator of the noise power is used to estimate the noise power spectrum.

As a further improvement of the present invention, in said step 3, a minimum mean square error estimator based on the existence probability of the voice is used to realize the calculation of the voice power spectrum.

As a further improvement of the present invention, in the step 4, a prior probability model based on generalized gamma is applied to obtain a minimum mean square error speech enhancement algorithm.

The invention also provides a target speech signal enhancement system based on continuous noise tracking, which comprises: memory, a processor and a computer program stored on the memory, the computer program being configured to carry out the steps of the method of the invention when called by the processor.

The invention also provides a computer-readable storage medium having stored thereon a computer program configured to, when invoked by a processor, perform the steps of the method of the invention.

The invention has the beneficial effects that: the invention effectively separates the target voice signal, greatly reduces the noise residual amount in the voice signal and greatly improves the quality of the target signal. This is very important for applications such as automatic speech recognition, speaker recognition, man-machine interface and hearing aids.

Drawings

Fig. 1 is a framework diagram of the present invention.

FIG. 2(a) is a time domain waveform of noisy speech contaminated by traffic noise under the condition of a signal-to-noise ratio of-5 dB.

Fig. 2(b) is a graph showing a comparison of tracking performance of the respective noise tracking methods against a rapidly changing noise level.

FIG. 3 is a speech waveform diagram, in which (a) is a clean speech diagram, (b) is a noisy speech diagram, and (c) is an enhanced speech diagram.

Detailed Description

The invention discloses a target voice signal enhancement method based on continuous noise tracking, which can realize effective separation of a target source signal and background noise aiming at noise in life.

As shown in fig. 1, the frame of the present invention comprises two main parts: a speech estimator and a noise tracker.

And (3) signal model: we consider an additive signal model, y (n) ═ x (n) + d (n), where y (n) is the noisy speech signal, and x (n) and d (n) represent the clean speech signals, respectivelySign and noise signal. The relationship of the time-frequency domain is obtained by using a short-time fourier transform, Y (l, k) ═ X (l, k) + D (l, k), where l and k represent the frame number and the index of the frequency point, respectively. The polar coordinate representation is as follows: y is Re^jα,X＝Ae^jβAnd D ═ Ne^jθ。E{|X(l,k)|²}＝λ_xAnd E { | D (l, k) > ceiling²}＝λ_dThe variance of the speech and noise signals, respectively. From fig. 1 we see the main flow of the process: 1. the noisy speech signal is subjected to framing and windowing, followed by a short-time Fourier transform → 2. noise power spectrum estimation → 3. Prior Signal-to-noise ratio estimation → 4. speech signal estimation → 5. synthesis (inverse Fourier transform, windowing and using overlap-and-add techniques to achieve speech recovery).

The target speech signal enhancement method based on continuous noise tracking comprises the following steps:

in step 1: and receiving the voice signal with noise, performing frame division and windowing processing on the voice signal with noise, and then obtaining the relation of a time-frequency domain by using short-time Fourier transform.

A noise tracker:

to estimate the noise power spectrum, we propose to estimate the noise power spectrum using a minimum mean square error estimator of the noise power. Therefore, in step 2, a minimum mean square error estimator of the noise power is used to estimate the noise power spectrum.

Using Bayesian criteria, we can derive a minimum mean square error estimator, as follows

In formula (1), n (N) represents a noise spectrum variable, θ represents an angle (a real part and an imaginary part have an angle) of a noise short-time Fourier transform coefficient,

since the Fourier coefficients of clean speech and noise are assumed to follow a Gaussian distribution, we can obtain

In the formula (2), n (N) represents a noise spectrum variable, λ_dRepresentative of the power spectral density of the noise,

wherein λ_xRepresentative of the power spectral density of speech, which we derive by derivation

Where xi ═ λ_x/λ_dAnd γ ═ R²/λ_dRepresenting the prior and posterior signal-to-noise ratios, respectively, and R ═ Y (l, k) | is the magnitude of the short-time fourier transform coefficient of the noisy speech. From equation (3) we can see the computation of the prior snr from the noise estimator, which requires information of the speech power spectrum. The next step is the estimation of the clean speech power spectrum.

And step 3: the estimation of the pure speech power spectrum is realized by using a minimum mean square error estimator based on the existence probability of speech, and the expression is

In formula (4), a (a) represents the speech spectrum amplitude, the upper case represents the variable, the lower case represents the value of the variable, and the former noise is the same as that in this case. H₁And H₀Representing a binary hypothesis, which refers to the presence and absence of speech, respectively.

Since the second part is zero, we only need to compute the first part. Can be calculated by using Bayesian formula

In equation (5), β denotes the angle of the speech short-time fourier transform coefficient.

Further by deriving and integrating with a special integration function Bessel, we obtain

For the existing probability of the voice, a simple and effective probability estimation method is obtained by using a fixed prior signal-to-noise ratio. The probability is calculated by the formula

In the formula (7), the first and second groups,

representing an estimate of the a priori signal-to-noise ratio.

A voice estimator:

the algorithm obtains a minimum mean square error speech enhancement algorithm by using a prior probability model based on generalized gamma. The prior generalized gamma probability model is

In equation (8), the variable represented by a is the speech spectral amplitude,

the other parameters are the shape parameters of the gamma model.

Our parameters were chosen as μ ═ 1 and ν ═ 6. For the a priori signal-to-noise ratio of the speech estimator, we use a decision directed algorithm estimator to implement.

In order to evaluate the performance of the method, a large number of experiments are carried out, and the method is proved to be capable of effectively inhibiting non-stationary noise so as to enhance the target speech signal. From fig. 2 and fig. 3, it can be seen more intuitively that the method can achieve efficient noise tracking and suppression of non-stationary noise.

Fig. 2(a) and 2(b) show the experimental results of tracking noise, and we can clearly see that the proposed noise tracker can track the rapid changes of the noise level quickly and accurately. Fig. 3 shows waveforms of clean speech, noisy speech and enhanced speech, and we can very intuitively see that non-stationary noise is well suppressed. Overall, this method works very well for suppression of non-stationary noise.

The invention has the following beneficial effects:

1. the target speech signal enhancement method of the present invention eliminates the need for a speech activity detector to detect speech segments and speech segments.

2. Noise can be continuously tracked even in a speech segment, and accurate tracking and estimation of the rapidly changing noise level are completed.

3. The target voice signal is effectively separated, the noise residual quantity in the voice signal is greatly reduced, and the quality of the target signal is greatly improved. This is very important for applications such as automatic speech recognition, speaker recognition, man-machine interface and hearing aids.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A target speech signal enhancement method based on continuous noise tracking is characterized by comprising the following steps:

step 2: estimating a noise power spectrum;

and step 3: estimating a clean speech power spectrum;

and 5: performing inverse Fourier transform, windowing and using an overlap-add technique to achieve clean speech recovery;

in the step 2, estimating a noise power spectrum by using a minimum mean square error estimator of the noise power;

in the step 2, a minimum mean square error estimator can be obtained by using a Bayes criterion, and the formula is as follows

In the formula (1), N (N) represents a noise spectrum variable, theta represents an angle of a noise short-time Fourier transform coefficient, N and N respectively represent the amplitude of a variable noise spectrum and a value thereof, Y represents a short-time Fourier transform coefficient of noisy speech, E {. cndot } represents an expected operator, d has no meaning, and d is only a general expression mode in an integral formula;

since the Fourier coefficients of clean speech and noise are assumed to follow a Gaussian distribution, the result is

wherein λ_xRepresenting the power spectral density of speech, R representing the amplitude of the noisy speech spectrum, I₀(. cndot.) A first class of Bezier functions representing a zero order correction;

by deducing we obtain

Where xi ═ λ_x/λ_dAnd γ ═ R²/λ_dRespectively representing a priori signal-to-noise ratio and a posteriori signal-to-noise ratio, wherein R ═ Y (l, k) | is the amplitude of a short-time Fourier transform coefficient of the voice with noise;

in the step 3, a minimum mean square error estimator based on the existence probability of the voice is used for realizing the calculation of a pure voice power spectrum;

in the step 4, a least mean square error speech enhancement algorithm is obtained by applying a prior probability model based on generalized gamma;

the prior generalized gamma probability model is

The parameters μ ═ 1 and ν ═ 6;

Γ (·) represents the gamma function, and the other parameters are the shape parameters of the gamma model.

2. A target speech signal enhancement system based on continuous noise tracking, comprising: memory, a processor and a computer program stored on the memory, the computer program being configured to carry out the steps of the method of claim 1 when invoked by the processor.

3. A computer-readable storage medium characterized by: the computer-readable storage medium stores a computer program configured to implement the steps of the method of claim 1 when invoked by a processor.