CN111508514A

CN111508514A - Single-channel speech enhancement algorithm based on compensation phase spectrum

Info

Publication number: CN111508514A
Application number: CN202010278564.7A
Authority: CN
Inventors: 张晓如; 许清臣; 张再跃
Original assignee: Jiangsu University of Science and Technology
Current assignee: Jiangsu University of Science and Technology
Priority date: 2020-04-10
Filing date: 2020-04-10
Publication date: 2020-08-07

Abstract

The invention provides a single-channel speech enhancement algorithm based on a compensation phase spectrum, which comprises the following steps: preprocessing the voice signal with noise, and windowing in frames; performing Fourier transform; dividing critical frequency bands by using an ERB scale; calculating the value of the segmented signal-to-noise ratio; calculating a new compensation factor, and simultaneously obtaining a primary enhanced speech complex spectrum through power spectrum subtraction; performing additive calculation on the phase spectrum compensation function and the primarily enhanced voice complex spectrum to obtain a compensated complex spectrum; calculating a phase angle of the compensated complex frequency spectrum to obtain a compensated phase spectrum; and performing inverse Fourier transform after overlapping and adding the voice magnitude spectrum subjected to basic spectrum subtraction and the compensation phase spectrum obtained in the step seven to obtain the enhanced voice signal. The invention verifies that the improved compensation factor not only has an effect on the steady-state noise, but also is more favorable for the denoising effect of the unsteady-state noise, and the speech enhancement algorithm of the invention is more widely and effectively applicable to noise environments.

Description

Single-channel speech enhancement algorithm based on compensation phase spectrum

Technical Field

The invention relates to the field of speech enhancement algorithms, in particular to a single-channel speech enhancement algorithm based on a compensation phase spectrum.

Background

In the conventional phase spectrum compensation speech enhancement algorithm, assuming that the clean speech s (t) is contaminated by the stationary additive gaussian noise d (t), and the two are independent of each other, the time domain representation of the noisy speech x (t) is:

x(t)＝s(t)+d(t) (1)；

the expression of the frequency domain obtained by performing short-time Fourier transform on the formula (1) is as follows:

wherein N is the number of frames, N represents the discrete fourier transform length, k represents the number of frequency bands, and w (N) is a window function; j is a mathematical symbol in the definition of the fourier transform and generally does not require explanation.

For convenience of representing noisy speech, equation (2) is represented using polar coordinates:

X(n,k)＝|X(n,k)|e^j∠X(n,k)(3)；

where | X (n, k) | is the magnitude spectrum of the noisy speech signal, and ∠ X (n, k) is the phase spectrum.

In a conventional phase spectrum compensation algorithm, the positioning phase spectrum compensation function is:

in the formula (4), λ is a compensation factor, and the value of an empirical constant obtained by a large number of experiments for the compensation factor in the prior art is 3.74;

is a noise estimate for the first few frames of noisy speech;

is a decision function whose expression is:

then adding the phase spectrum compensation function and the frequency spectrum of the voice signal with noise to obtain a compensated frequency spectrum expression:

X_Λ(n,k)＝X(n,k)+Λ(n,k) (6)。

and (3) solving a phase angle of the compensated frequency spectrum to obtain a compensated phase spectrum:

wherein Im {. cndot } and Re {. cndot } are each a pair of X_Λ(n, k) the imaginary and real parts are calculated.

Finally, the amplitude spectrum of the voice with noise is combined with the compensated phase spectrum to obtain an enhanced voice frequency spectrum as follows:

since the speech signal is a real signal, it is subjected to short-time fourier transform to obtain a pair of vectors with conjugate symmetry, in which the magnitude spectrum is symmetric and the phase spectrum is antisymmetric, and finally the inverse short-time fourier transform in the speech synthesis process is the inverse process of adding conjugate terms to form a real signal.

In conventional speech enhancement algorithms, the noisy phase spectrum is usually retained and combined with the processed speech magnitude spectrum. And the compensation factor in the traditional phase spectrum compensation algorithm is a constant value of 3.74 fixed according to experience obtained by experiments, which cannot flexibly compensate the phase spectrum of the voice with noise, for the voice with noise, the background noise is constantly changed, and if the phase spectrum compensation is performed by using a fixed compensation factor, the phase spectrum which is consistent with the constant change cannot be obtained, so that the detail information in the synthesized voice phase spectrum is inaccurate, and the quality of the enhanced voice is not high.

Because the compensation factors in the phase spectrum compensation function in the prior art are all fixed, the noisy speech cannot be compensated to different degrees according to the change of the noise, and when the noise intensity is high and the signal-to-noise ratio is low, the noise removal condition becomes not ideal, and the residual noise amount is high, so that the phase spectrum compensation algorithm in the prior art cannot obtain better enhanced speech quality. In order to compensate for this drawback, the present invention improves the compensation factor in the phase spectrum compensation function.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a single-channel speech enhancement algorithm based on a compensation phase spectrum, verify that an improved compensation factor is effective, not only has an effect on steady-state noise, but also is more beneficial to the denoising effect of unsteady-state noise, and simultaneously make up for the defect of denoising the unsteady-state noise in the prior art, so that the speech enhancement algorithm is more widely and effectively applicable to noise environments.

In order to solve the above technical problem, an embodiment of the present invention provides a single-channel speech enhancement algorithm based on a compensated phase spectrum, including the following steps:

step one, preprocessing a voice signal x (n) with noise, and windowing in a frame mode;

step two, Fourier transform is carried out to obtain a complex spectrum X (n, k) ═ X (n, k) | ∠ X (n, k),

wherein | X (n, k) | is the amplitude spectrum of the noisy speech signal, ∠ X (n, k) is the noisy speech phase spectrum;

thirdly, dividing a critical frequency band by using an ERB scale;

step four, calculating the value of the segmented signal-to-noise ratio;

step five, calculating a new compensation factor, then calculating an improved phase spectrum compensation function Λ' (n, k), and simultaneously obtaining a primary enhanced speech complex spectrum by power spectrum subtraction

Step six, additive calculation is carried out on the phase spectrum compensation function and the primarily enhanced voice complex spectrum to obtain a compensated complex spectrum;

step seven, solving a phase angle of the compensated complex frequency spectrum to obtain a compensated phase spectrum;

step eight, overlapping and adding the voice magnitude spectrum after the basic spectrum is subtracted and the compensation phase spectrum obtained in the step seven, and then performing inverse Fourier transformLeaf transformation to obtain enhanced speech signal

Wherein, the definition of the segment signal-to-noise ratio in the fourth step is as follows:

where x (n) represents the original (clean) speech signal,

is an enhanced speech signal, N is the frame length, M is the number of frames in the signal; the segment signal-to-noise ratio in step four is based on geometric averaging of the signal-to-noise ratios of all speech frames.

In the fifth step, a functional expression of a compensation factor in the phase spectrum compensation function is as follows:

where c is a fixed empirical value obtained from a large number of experimental data, c is 2.5; SNRseg'_iIs the segment signal-to-noise ratio for the ith band:

the numerator denominator in the formula (III) is a noisy speech power spectrum and a noise estimation power spectrum in the ith frequency band respectively, and K is the total frequency band number of frequency band division;

substituting equation (II) into the phase spectrum compensation function yields the following expression:

in the sixth step, the new phase spectrum compensation function Λ' (n, k) is subtracted from the ERB scale-based multi-band spectrum divisionComplex spectrum obtained by the method

Adding to obtain a new compensated complex spectrum:

step seven, calculating a phase angle of the compensated complex spectrum to obtain a new compensated phase spectrum:

in the eighth step, the voice magnitude spectrum after the basic spectrum subtraction and the compensation phase spectrum obtained in the seventh step are overlapped and added, and then inverse Fourier transform is carried out, so as to obtain an enhanced voice signal

The technical scheme of the invention has the following beneficial effects:

1. the signal-to-noise ratio improvement value obtained by the algorithm is superior to the prior art in a white Gaussian noise environment, and the result of the algorithm is optimal along with the improvement of the input signal-to-noise ratio in the background noise environment of automobile noise, airport noise and babble noise, which shows that the improved compensation factor plays a certain role in the estimation of unsteady noise and can flexibly perform corresponding compensation according to the change of the noise.

2. The algorithm verifies that the improved compensation factor is effective, not only has the effect on steady-state noise, but also is more favorable for the denoising effect of unsteady-state noise, and simultaneously makes up the defect of the prior art in denoising the unsteady-state noise, so that the speech enhancement algorithm is more widely and effectively applicable to noise environments.

3. The voice residual noise and the voice distortion degree enhanced by the algorithm provided by the invention are acceptable, and the PESQ result is verified.

4. The algorithm of the invention is not only applied to the phase spectrum compensation function based on the ERB scale, but also improves the compensation factor, thus removing more noise components from the noisy speech in the unsteady noise environment, and ensuring better speech quality.

5. The improved compensation factor of the present invention enables the background noise to be removed and the distortion degree of the voice to be reduced. In a whole view, the speech quality after the algorithm enhancement is superior to the prior art.

Drawings

FIG. 1 is a block flow diagram of the present invention;

FIG. 2 is a diagram illustrating signal-to-noise ratio improvement values for different noise types in the present invention;

FIG. 3 is a diagram illustrating PESQ values for different noise types according to the present invention;

FIG. 4 is a diagram of a phonetic spectrogram according to the present invention;

FIG. 5 shows the MOS scoring test results of residual noise in the present invention;

FIG. 6 shows the result of MOS scoring test of speech distortion in the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages to be solved by the present invention clearer, the following detailed description is made with reference to the accompanying drawings and specific embodiments.

The invention firstly uses ERB scale to divide the frequency band of the noisy speech complex frequency spectrum after Fourier transform, independently performs power spectrum subtraction in each frequency band to obtain the speech signal after primary enhancement, and then performs improved compensation phase spectrum on the speech signal after primary enhancement, and the invention will be described in detail.

The invention provides a single-channel speech enhancement algorithm based on a compensation phase spectrum, which has a flow shown in figure 1, and comprises the following steps:

where | X (n, k) | is the magnitude spectrum of the noisy speech signal, and ∠ X (n, k) is the phase spectrum of the noisy speech.

Thirdly, dividing a critical frequency band by using an ERB scale;

step four, calculating the value of the segmented signal-to-noise ratio;

the invention improves the compensation factor, so that the compensation factor can be flexibly adjusted according to the noise intensity. The calculation of the segment signal-to-noise ratio can measure the noise level of the speech signal, which can be calculated in the time domain or the frequency domain, and the definition of the segment signal-to-noise ratio is as follows:

where x (n) represents the original (clean) speech signal,

In this step, the function expression of the compensation factor in the phase spectrum compensation function is:

it can be seen from the new phase spectrum compensation function that if the current frame is a speech frame, i.e. the segmented signal-to-noise ratio value is larger, the compensation factor is correspondingly reduced according to the characteristics of the exponential function, and then the phase spectrum compensation for the speech with noise is reduced at this time; if the current frame is a noise frame, i.e. the segmented signal-to-noise ratio value is smaller, the compensation factor is correspondingly increased, and the compensation for the phase spectrum of the noisy speech is increased, so that the effect of removing the background noise is achieved.

Sixthly, the new phase spectrum compensation function Λ' (n, k) and the complex spectrum obtained by dividing the multiband spectrum subtraction method based on the ERB scale

Adding to obtain a new compensated complex spectrum:

step seven, solving a phase angle of the compensated complex spectrum to obtain a new compensated phase spectrum:

step eight, overlapping and adding the voice amplitude spectrum after the basic spectrum is subtracted and the compensation phase spectrum obtained in the step seven, and then performing inverse Fourier transform to obtain an enhanced voice signal

The basic spectral subtraction in this step is to perform spectral subtraction on the noisy speech signal first, which is a complete algorithm, and then to call the magnitude spectrum after spectral subtraction. Theoretically, the basic spectral subtraction is after the fourier transform.

For the description of the symbols in the algorithm: the time domain signal before preprocessing is a lowercase X, the time domain signal is changed into a frequency domain signal after Fourier transformation, and the frequency domain signal is a uppercase X, and finally the time domain signal is converted back to a lowercase s after inverse Fourier transformation.

Compared with the experimental simulation result of the prior art, the improved algorithm provided by the invention is provided. The superiority and the effectiveness of the speech enhancement algorithm provided in the chapter can be more intuitively seen in experimental simulation.

The speech data used in this experiment are from the NOIZEUS corpus database, and in this simulation experiment, gaussian white noise, car noise, airport noise and babble noise are selected plus different signal-to-noise ratios (0dB, 2dB, 4dB, 6dB, 8dB and 10dB) to generate different noisy speech for performance evaluation. In the experimental performance evaluation, the sampling frequency of the voice is 8KHz, and the sampling precision is 16 bits.

And performing speech enhancement algorithm performance evaluation from the improvement of the passenger signal-to-noise ratio, subjective speech quality evaluation, speech spectrogram comparison and MOS scoring. In order to facilitate direct comparison between algorithms, basic parameter settings involved in the short-time fourier transform process are consistent: (1) the length of the voice frame is 256 sampling points, namely 32 ms; (2) selecting 50% of overlapping rate between adjacent frames; (3) the window function selects a hamming window.

One, improvement of SNR

FIG. 2 shows the improvement of SNR at different input SNR for different background noise types as shown in FIG. 2, where FIG. 2(a) is Gaussian white noise; FIG. 2(b) car noise; FIG. 2(c) airport noise; fig. 2(d) babble noise.

Observing fig. 2, it can be seen that the signal-to-noise ratio improvement value obtained by the algorithm of the present invention is superior to the prior art in the white gaussian noise environment, and in the other three background noise environments, it can be seen that the result of the algorithm of the present invention is optimal as the input signal-to-noise ratio is improved. This shows that the improved compensation factor plays a certain role in the estimation of the unsteady noise, and can flexibly perform corresponding compensation according to the change of the noise.

Second, PESQ value

FIG. 3 shows PESQ values at different input signal-to-noise ratios for different background noise types as shown in FIG. 3, where FIG. 3(a) is white Gaussian noise; FIG. 3(b) car noise; FIG. 3(c) airport noise; fig. 3(d) babble noise.

As shown in FIG. 3, the algorithm of the present invention achieves PESQ values in four background noise environments that are significantly better than those of the prior art. Under the white Gaussian noise, no matter under low signal-to-noise ratio or high signal-to-noise ratio, the result of the algorithm is optimal, under the other three unsteady noise environments, the PESQ value becomes better and better along with the increase of the signal-to-noise ratio, the result is obviously better than the prior art, and the improved compensation factor is effective, so that the algorithm not only has an effect on steady-state noise, but also has a more favorable denoising effect on the unsteady-state noise, and simultaneously compensates the defect of denoising the unsteady-state noise in the prior art, therefore, the speech enhancement algorithm is more widely and effectively applicable to noise environments.

Spectrogram of three languages

FIG. 4 is a spectrogram showing an example of noisy speech, i.e., babble noise with an input signal-to-noise ratio equal to 0dB, FIG. 4 is a diagram illustrating performance evaluation between the present invention and the prior art, wherein FIG. 4(a) is pure speech; FIG. 4(b) noisy speech with babble noise and a signal-to-noise ratio of 0 dB; FIG. 4(c) prior art; FIG. 4(d) the algorithm of the present invention.

The method is obtained by analyzing and comparing the spectrogram shown in the figure 4, and the voice denoising effect is enhanced by the algorithm, and the voice distortion degree is optimal. The spectrogram of (c) in fig. 4 has very obvious residual noise and slightly distorted speech, for example, the enhanced speech structure of the present invention is closer to the pure speech in the period of about 1-1.5s, and the enhanced speech structure of the prior art algorithm is distorted too much. This shows that the degree of speech residual noise and speech distortion enhanced by the algorithm of the present invention is acceptable, and the result of PESQ is also verified.

Fourth, MOS scoring

In order to more comprehensively evaluate the enhanced performance of the algorithm, in addition to the three evaluation modes, an informal subjective listening test is also carried out, wherein the selected test audio is Gaussian white noise and babble noise, and the input signal-to-noise ratios of the Gaussian white noise and the babble noise are both 0 dB. Here, the MOS scoring results in terms of both residual noise and speech distortion degree are mainly given, and the specific experimental test results are shown in fig. 5 and fig. 6.

In fig. 5, the residual noise result of the algorithm of the present invention is better than the prior art, no matter in white gaussian noise environment or in babble noise environment, because the compensation factor of the prior art is fixed and cannot flexibly change according to the change of noise, this is reflected in the result of the test of the babble noise environment. The algorithm is not only applied to the phase spectrum compensation function based on the ERB scale, but also improves the compensation factor, so that more noise components can be removed from the noisy speech in the unsteady noise environment, and the speech quality is better.

It is found from fig. 6 that the speech distortion level of the algorithm of the present invention achieves the best test results compared to the prior art. It can be seen therein that the degree of speech distortion is best in both noise environments. Although the test result score of the speech distortion degree is not as high as the speech quality, the speech distortion degree of the algorithm of the invention is most acceptable and comfortable, which shows that the improved compensation factor of the invention can remove background noise and reduce the speech distortion degree. In a whole view, the speech quality after the algorithm enhancement is superior to the prior art.

The foregoing is a preferred embodiment of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should be construed as the protection scope of the present invention.

Claims

1. A single-channel speech enhancement algorithm based on a compensated phase spectrum is characterized by comprising the following steps:

thirdly, dividing a critical frequency band by using an ERB scale;

step four, calculating the value of the segmented signal-to-noise ratio;

2. The compensated phase spectrum-based single-channel speech enhancement algorithm of claim 1, wherein the segment signal-to-noise ratio in step four is defined as follows:

where x (n) represents the original speech signal,

is the enhanced speech signal, N is the frame length, and M is the number of frames in the signal.

3. The compensated phase spectrum based single channel speech enhancement algorithm of claim 1, wherein in the fifth step, the function expression of the compensation factor in the phase spectrum compensation function is:

wherein c is a fixed empirical value, c is 2.5; SNRseg'_iIs the segment signal-to-noise ratio for the ith band:

4. the compensated phase spectrum based single channel speech enhancement algorithm of claim 1, wherein in step six, the new phase spectrum compensation function Λ' (n, k) is combined with the complex spectrum obtained by ERB scale division multi-band spectral subtraction

Adding to obtain a new compensated complex spectrum:

in the eighth step, the voice amplitude spectrum after the basic spectrum is subtracted and the compensation phase spectrum obtained in the seventh step are overlapped and added, and then inverse Fourier transform is carried out to obtain an enhanced voice signal