CN1212608C

CN1212608C - A multichannel speech enhancement method using postfilter

Info

Publication number: CN1212608C
Application number: CNB031570747A
Authority: CN
Inventors: 杜利民; 阎兆立
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 2003-09-12
Filing date: 2003-09-12
Publication date: 2005-07-27
Anticipated expiration: 2023-09-12
Also published as: CN1523573A

Abstract

The present invention discloses a speech enhancing method which adopts a postfilter and is used for enhancing multichannel speech signals. The method comprises the procedures: 1. the time delay of a speech signal in every channel is calculated; 2. every channel signal is aligned at a time domain through delay compensation; 3. beam formation is carried out to signals of every channel by a beam former; 4. the autopower spectrum of a pure speech signal and the autopower spectrum of a signal with noise are estimated, the frequency response function of a Weiner filter is obtained, wherein the estimation of the cross power spectrum of noise is eliminated for obtaining the estimation of the autopower spectrum of the pure speech signal in the estimation of the cross power spectrum of the signal with noise; 5. the postpositive Wiener filter filters output beams of the beam former, and speech is enhanced. Because the present invention considers correlation among channel noises, the present invention more accords with actual conditions, noise can be effectively eliminated at low-frequency bands especially, and the effect of speech enhancement is improved.

Description

A kind of multicenter voice Enhancement Method that adopts postfilter

Technical field

The present invention relates to computer speech signal Processing field, more particularly, the present invention relates to a kind of multicenter voice Enhancement Method that adopts postfilter

Background technology

It is a kind of selectivity treatment technology of signal that voice strengthen, and mainly solves from the voice signal that is subjected to the different modes pollution, extracts the problem of pure as far as possible target voice signal.The purpose that voice strengthen is to improve the sense of hearing of voice signal, improves intelligibility, is used for communication, hearing aid, intercepts, field such as audiovisual conference.Along with the development of speech recognition technology, under quiet environment, can reach very high discrimination in addition, but the degeneration of discrimination is comparatively serious under noise circumstance.Therefore voice strengthen a kind of means of handling as speech recognition front-ends, are current very active important research directions in the world.

According to the microphone number that picks up voice signal, voice strengthen and to be divided into two types of single channel and hyperchannels.The single channel speech-enhancement system only needs a microphone, and hardware resource requires low, and algorithm complexity is less, but the de-noising performance is limited.The multicenter voice enhanced system is used microphone array, and multi channel signals has comprised abundant spatial information and temporal information, has bigger performance boost space.Therefore from since last century is over 90 years, it is people's a research focus that the microphone array voice strengthen always.

Adopt the typical workflow of the multicenter voice Enhancement Method of microphone array may be summarized as follows:

1) at first utilize time delay algorithm for estimating (as the broad sense cross correlation function, Adaptive Time Delay algorithm for estimating etc.) to obtain voice signal in each interchannel time delay, estimating signal time delay exactly is the basis that multicenter voice strengthens.

2) then by delay compensation, each channel signal in time domain alignment.

3) with Beam-former the signal of each passage being carried out wave beam forms.

4) with a postfilter (being S filter) beamformer output of Beam-former is carried out filtering, realize the enhancing of voice.

Wherein, in abovementioned steps (4),, need obtain the frequency response function of S filter for the beamformer output to Beam-former carries out filtering.

At first will remove time delay microphone signal x before _i(t) and x _j(t) be modeled as the combination of sound source s (t) and additive noise n (t):

x _i(t)＝s(t-τ _i)+n _i(t) (1)

x _j(t)＝s(t-τ _j)+n _j(t) (2)

Wherein, i and j are the numbering of microphone/passage, τ _i, τ _jIt is the travel-time (be time postpone) of sound source to microphone.

The form of S filter frequency response function is:

H (f) = \frac{φ_{ss} (f)}{φ_{xx} (f)} - - - (3)

φ wherein _Ss(f) be the auto-power spectrum of desirable clean speech signal s (t), φ _Xx(f) be the auto-power spectrum of signals with noise (s (t)+n (t)).The auto-power spectrum of signals with noise can directly calculate by measuring microphone signal, but the auto-power spectrum of clean speech signal can't a priori be obtained, and particularly voice signal is again a non-stationary signal, and its power spectrum is constantly to change.Therefore the key of S filter is the power spectrum that obtains the clean speech signal in the Noisy Speech Signal in each passage as far as possible exactly, and obtains the S filter frequency response function according to this power spectrum.Zelinski utilizes multi-channel information to preferably resolve this problem, and he at first supposes:

1, signal and ground unrest are incoherent.

2, also be incoherent between the noise that each passage is recorded.

3, the noise power spectrum recorded of each passage is identical.

Like this, after the simple crosscorrelation of ignoring between the relevant and noise of signal and ground unrest, obtain

φ_{x_{i} x_{j}} (f) = φ_{ss} (f) - - - (4)

φ wherein _Xixj(f) be signals with noise x _iAnd x _jCross-power spectrum.Formula (4) substitution formula (3) is just obtained the S filter frequency response function.Average by the spectral density of all possible microphone combination is calculated, can obtain estimated result more accurately:

\hat{H} (f) = \frac{E [R {Σ_{i = 0}^{N - 2} Σ_{j = i + 1}^{N - 1} {\hat{φ}}_{x_{i} x_{j}}}]}{E [Σ_{i = 0}^{N - 1} φ_{x_{i} x_{j}}]} - - - (5)

Wherein N represents passage/microphone number, and operational character R{.} gets real, because the signal auto-power spectrum must be a real number.

But this method is because to be based between the noise that each passage records also be uncorrelated this hypothesis, but the simple crosscorrelation of this each channel noise only could be ignored under the situation of high frequency substantially, and under the low frequency situation, the simple crosscorrelation of each channel noise is comparatively obvious, can not be left in the basket, so this method can not be practical.Therefore just need a kind of algorithm process that is applicable under the low frequency situation.

Summary of the invention

The objective of the invention is to overcome existing multicenter voice Enhancement Method and only be suitable for this shortcoming of high frequency,, provide a kind of multicenter voice Enhancement Method that adopts postfilter by considering the simple crosscorrelation of interchannel noise signal.

In order to realize purpose of the present invention, the invention provides a kind of sound enhancement method that adopts postfilter, be used for the enhancing of multicenter voice signal, comprise the steps:

1) the computing voice signal is in the time delay of each passage.

2) by delay compensation, with each channel signal in time domain alignment.

4) estimate the auto-power spectrum and the signals with noise auto-power spectrum of clean speech signal, obtain the frequency response function of S filter.

Wherein, the auto-power spectrum of clean speech signal obtains as follows:

A) in all voice channels, choose two passages wantonly as a combination;

B) estimate in the described combination of channels two interchannel signals with noise cross-power spectrums and noise cross-power spectrum;

C) in described interchannel signals with noise cross-power spectrum estimation, remove the noise cross-power spectrum estimation and obtain interchannel clean speech signal auto-power spectrum estimation;

D) all possible combination of channels in a) is all carried out b) and operation c), the interchannel clean speech signal auto-power spectrum that then all is obtained is estimated to do average, and this average result is estimated as the auto-power spectrum of the clean speech signal in the step 4).

Wherein, the signals with noise auto-power spectrum is the average result of the signals with noise auto-power spectrum of all passages.

5) with rearmounted described S filter the beamformer output of Beam-former is carried out filtering, realize the enhancing of voice.

Described multicenter voice signal comprises two passage voice signals at least.

In order to reduce operand, this sound enhancement method can only be used to strengthen the low frequency part of voice signal; And the HFS of voice signal still uses existing sound enhancement method, for example the Zelinski algorithm.

Because the present invention has considered the correlativity between each channel noise when obtaining the auto-power spectrum of clean speech signal, this more tallies with the actual situation, and especially can remove noise effectively in low-frequency range, has improved the effect that voice strengthen.

Description of drawings

Fig. 1 adopts the enhancing example of sound enhancement method to one section noisy speech; Wherein (a) is original noisy speech, (b) is the voice enhancement process result of filtering behind the employing Zelinski, and figure (c) is the voice enhancement process result who adopts method of the present invention to obtain.

Embodiment

Below in conjunction with the drawings and specific embodiments the present invention is described in further detail.

To formula (1) and (2) given signal model x _i(t) and x _j(t) remove time delay τ _i, τ _jRemake Fourier transform afterwards, obtain

{\hat{X}}_{i} (f) = S (f) + N_{i} (f) e^{j \frac{2 π}{W} f τ_{i}} - - - (6)

{\hat{X}}_{j} (f) = S (f) + N_{j} (f) e^{j \frac{2 π}{W} f τ_{j}} - - - (7)

In the formula With Be that time delay is removed back x _i(t+ τ _i) and x _j(t+ τ _j) Fourier transform, (^) represent that erasure signal postpones; S (f) is the purified signal Fourier transform; N _i(f) and N _j(f) be the Fourier transform of noise; W is a frame length.By formula (6), (7) obtain the cross-power spectrum of signals with noise

{\hat{φ}}_{x_{i} x_{j}} (f) = φ_{ss} (f) + {\hat{φ}}_{n_{i} n_{j}} (f) - - - (8)

Wherein

{\hat{φ}}_{n_{i} n_{j}} (f) = φ_{n_{i} n_{j}} (f) e^{j \frac{2 π}{W} f τ_{ij}} - - - (9)

In the formula

Be signals with noise x _i(t+ τ _i) and x _j(t+ τ _j) cross-power spectrum, φ _Ss(f) be the auto-power spectrum of purified signal, φ _Ninj(f) and Be respectively to postpone to remove forward and backward noise cross-power spectrum.τ _Ij=τ _i-τ _jIt is the time delay between two passage i and the j signal.

Be not difficult to find out from formula (8), in order to obtain the auto-power spectrum φ of purified signal _Ss(f), at first will estimate noise cross-power spectrum part in the formula, and in the prior art, the noise cross-power spectrum is left in the basket partly.Formula (9) shows the noise cross-power spectrum Along with time delay τ _IjChange change, this also is simply to postpone the reason that addition and Wiener filtering algorithm can not be handled mobile sound source.According to above analysis, the noise cross-power spectrum can obtain by following formula:

{\hat{φ}}_{n_{i} n_{j}}^{'} (f) = φ_{n_{i} n_{j}}^{'} (f) e^{j \frac{2 π}{W} f τ_{ij}} - - - (10)

In the formula Be to postpone to eliminate back noise cross-power spectrum estimation, φ ' _Ninj(f) be original noise cross-power spectrum estimation, it can obtain in speech gaps.() ' expression signal estimated value.According to formula (8), (10) obtain the purified signal power Spectral Estimation

φ_{ss}^{'} (f) = {\hat{φ}}_{x_{i} x_{j}} (f) - φ_{n_{i} n_{j}}^{'} (f) e^{j \frac{2 π}{W} {fτ}_{ij}} - - - (11)

Also can estimate φ ' simultaneously by the calculating of signals with noise auto-power spectrum _Ss(f).Release by formula (1)

φ_{x_{i} x_{j}} (f) = φ_{ss} (f) + φ_{n_{i} n_{j}} (f) - - - (12)

Therefore obtain

φ_{ss}^{'} (f) = φ_{x_{i} x_{j}} (f) - φ_{n_{i} n_{j}}^{'} (f) - - - (13)

φ ' in the formula _Ninj(f) be that noise power spectrum is estimated.All microphones are made up the φ ' that tries to achieve according to formula (11), (13) _Ss(f) work is average to improve the estimation of purified signal auto-power spectrum, obtains the estimation of S filter

\hat{H} = \frac{R {E [Σ_{i = 0}^{N - 1} (φ_{x_{i} x_{j}} - φ_{n_{i} n_{j}}^{'}) + Σ_{i = 0}^{N - 2} Σ_{j = i + 1}^{N - 1} ({\hat{φ}}_{x_{i} x_{j}} (f) - {\hat{φ}}_{n_{i} n_{j}}^{'} (f) e^{j \frac{2 π}{W} f τ_{ij}})]}}{R {E [Σ_{i = 1}^{N} φ_{x_{i} x_{j}}]}} - - - (14)

R{.} represents to get real.Because power spectrum signal φ _Ss(f) may be arithmetic number only, so also will make the half-wave integer, the negative that removal may occur to it.

In the specific implementation, power spectrum all upgrades by the following formula of repeatedly being with

φ_{x_{i} x_{j}} (k + 1, f) = α φ_{x_{i} x_{j}} (k, f) + (1 - α) X_{i} (f) X_{j}^{*} (f), 0 < α \leq 1 - - - (15)

X represents signal or noise in the formula; φ _Xixj(k+1, f) expression k+1 frame power Spectral Estimation, φ _Xixj(k f) is k frame power Spectral Estimation.X (f) is the Fourier Tranform of signal x (k), and α is the number between 0 to 1, has reflected that power spectrum upgrades speed.

The simple crosscorrelation of each channel noise is only comparatively obvious in low frequency part, can ignore substantially at HFS.Therefore in order rationally to reduce operand, can be the low frequency part below the signal 1kHz with formula (14) filtering, and HFS is still used the algorithm process of Zelinski, as shown in Equation (5).

Figure (1) is one section noisy speech result, and wherein (a) is original noisy speech, (b) is the voice enhancement process result of filtering behind the employing Zelinski, and figure (c) is the voice enhancement process result who adopts method of the present invention to obtain.As can be seen from the figure, filtering algorithm can not effectively be removed the low-frequency noise that wherein comprises behind the Zelinski, and this part noise is in 1kHz, so also can't remove with high-pass filtering; Method of the present invention has then been removed low-frequency noise substantially.

Claims

1, a kind of sound enhancement method that adopts postfilter is used for the enhancing of multicenter voice signal, and described multicenter voice signal comprises two passage voice signals at least, comprises the steps:

1) the computing voice signal is in the time delay of each passage;

2) by delay compensation, with each channel signal in time domain alignment;

3) with Beam-former the signal of each passage being carried out wave beam forms;

4) estimate the auto-power spectrum and the signals with noise auto-power spectrum of clean speech signal, obtain the frequency response function of S filter;

5) with rearmounted described S filter the beamformer output of Beam-former is carried out filtering, realize the enhancing of voice;

It is characterized in that in the step 4), the auto-power spectrum of clean speech signal obtains as follows:

A) in all voice channels, choose two passages wantonly as a combination;

2, the sound enhancement method of employing postfilter according to claim 1 is characterized in that, the signals with noise auto-power spectrum described in the step 4) is the average result of the signals with noise auto-power spectrum of all passages.

3, the sound enhancement method of employing postfilter according to claim 1 and 2 is characterized in that, this sound enhancement method only is used to strengthen the low frequency part of voice signal.