WO2006097886A1

WO2006097886A1 - Noise power estimation

Info

Publication number: WO2006097886A1
Application number: PCT/IB2006/050771
Authority: WO
Inventors: Ivo Batina; Jesper Jensen; Richard Heusdens
Original assignee: Koninklijke Philips Electronics N.V.
Priority date: 2005-03-16
Filing date: 2006-03-13
Publication date: 2006-09-21

Abstract

A method of estimating the power spectral density of noise in a speech signal uses filtering in the frequency domain. The current power of the speech signal is determined per time segment and per frequency range, the difference of the current power and a current estimate of the power is determined, and then a new estimate of the noise power is provided on the basis of said difference and said current estimate. A speech enhancement method may include these method steps. A device (1) for estimating the power spectral density of noise in the speech signal may be comprised in portable consumer apparatus.

Description

Noise power estimation

The present invention relates to noise estimation. More in particular, the present invention relates to a method of and a device for estimating the power spectral density of noise in a noisy speech signal.

It is well known to estimate the amount of noise in speech signals. As it is very difficult to separate noise from speech, conventional methods estimate noise properties in the pauses between speech, that is, between words. In the (assumed) absence of speech, noise properties such as the amplitude or power spectrum can be determined relatively easily. If it is assumed that the noise in the pauses between speech is identical to the noise during speech (stationary noise), an accurate speech estimation may be obtained. However, experiments have shown that natural noise, for example the noise heard in a moving vehicle, is not necessarily stationary and that speech estimates during pauses are therefore unreliable predictors of the noise during speech. It is therefore desirable to estimate speech properties during speech. The paper by Rainer Martin, "Noise power spectral density and estimation based on optimal smoothing and minimum statistics", IEEE Transactions on Speech and Audio Processing, 9(5): 504-512, July 2001, discloses a method for noise estimation in which no distinction is made between speech activity and speech pause. This known method, commonly known as the Minimum Statistics (MS) method, is designed to be combined with speech enhancement schemes. The Minimum Statistics method is based on tracking of spectral minima of the noisy speech power spectrum. This known method is computationally complex and displays a slow update of the noise estimate in the case of a sudden rise in the noise energy level. Consequently, this method is mainly applicable for slowly varying noise sources. Speech enhancement schemes typically require an estimate of the power spectral density of noise in order to extract a "clean" speech signal estimate from the noisy input speech signal. In many applications, it is desirable to enhance speech in real-time. This requires methods having a relatively small computational complexity. This is particularly the case when the speech enhancement is to be carried out by a relatively small device having a limited computing power, such as a portable consumer device. Typical portable consumer devices in which speech enhancement may be used include, for example, mobile (cellular) telephones and electronic hearing aids. Prior art methods and devices do not allow effective speech enhancement or noise estimation to be carried out in portable consumer devices.

It is an object of the present invention to overcome these and other problems of the Prior Art and to provide a method and a device which allow noise properties to be estimated from a noisy speech signal, that is, during speech activity. It is a further object of the present invention to provide a method and a device for noise power estimation which have a low computational complexity and are therefore suitable for real-time applications.

Accordingly, the present invention provides a method of estimating the power spectral density of noise in a speech signal, the method comprising the steps of: - determining the total power of the speech signal, determining the difference of said total power and a current estimate of said total power, providing a new estimate of the noise power and the speech power using said difference and a current estimate of the noise power and the speech power, - deriving a new estimate of said total power from the new estimate of the noise power and the speech power, wherein the above steps are carried out per time segment and per frequency range.

By providing a new estimate of the noise power and the speech power using said difference and a current estimate of the noise power and the speech power, new estimates of the noise power can be obtained very efficiently, requiring a relatively small number of calculations. Accordingly, the method of the present invention can be carried out even by devices having relatively little computing power, such as portable consumer devices.

The type of calculations made in the method of the present invention is known as Kalman filtering. For these calculations, several efficient algorithms are known, thus enabling an efficient computation. In particular, the recursive nature of these calculations enhance the efficiency.

By carrying out the method per time segment and per frequency range, the noise power estimates are made per frequency band (also called frequency bin). In this way, the power spectral density is obtained. By carrying out the method per time segment, for example per time frame, the calculations can be very efficient.

It is noted that the total power of the speech signal is the combined power of the noise and the "clean" speech signal. The estimate of the noise power and the speech power comprises both a noise power estimate and a separate speech power estimate.

It is preferred that the step of providing a new estimate of the noise power and the speech power involves multiplying the current estimate by a first gain, said first gain preferably being determined using a priori knowledge. In this way, a priori knowledge may be used to improve the estimates. This a priori knowledge may be obtained during (off-line) test sessions or may be based on scientific assumptions on the noise properties. Alternatively, or additionally, the first gain may be updated on-line, while the method is being executed, to track the local statistics of the underlying "clean" speech.

It is further preferred that the step of providing a new estimate of the noise power and the speech power involves multiplying the difference by a second gain, said second gain preferably being determined using a priori knowledge.

As stated above, the method of the present invention is carried out per time segment and per frequency band. To achieve this, it is preferred that the speech signal is transformed using a short-time Fourier transform. By using a short-time Fourier transform (STFT), an efficient transformation may be achieved. In addition, the method may be carried out in the frequency domain. This preferred embodiment may be summarized as Kalman filtering in the frequency domain. It is noted that Kalman filtering is conventionally carried out in the time domain, not in the frequency domain. By using Kalman filtering in the frequency domain, a more efficient and effective estimation of noise properties in a speech signal may be obtained. In particular, no assumptions have to be made on noise properties, such as the typical (but often unrealistic) assumption that the noise is autoregressive. The present invention further provides a speech enhancement method, comprising the steps of: estimating the power spectral density of noise in a speech signal as defined above, and - using the estimated power spectral density to remove noise from the speech signal.

In an alternative embodiment, the amplitude of the noise is derived from the power spectral density. That is, the noise amplitude (the absolute value of the spectrum) instead of the power spectral density is estimated. Those skilled in the art will appreciate that the said amplitude is equal to the square root of the power spectral density.

The present invention additionally provides a computer program product for carrying out the method as defined above. A computer program product may comprise a set of computer executable instructions stored on a data carrier, such as a CD or a DVD. The set of computer executable instructions, which allow a programmable computer to carry out the method as defined above, may also be available for downloading from a remote server, for example via the Internet.

The present invention also provides a device for estimating the power spectral density of noise in a speech signal, the device comprising: means for determining the total power of the speech signal, means for determining the difference of said total power and a current estimate of said total power, means for providing a new estimate of the noise power and the speech power using said difference and a current estimate of the noise power and the speech power, means for deriving a new estimate of said total power from the new estimate of the noise power and the speech power, wherein said means are arranged for estimating the power spectral density of noise per time segment and per frequency range. In addition, the present invention provides a speech enhancement device, comprising: means for estimating the power spectral density of noise in a speech signal as defined above, and means for using the estimated power spectral density to remove noise from the speech signal.

The present invention will further be explained below with reference to exemplary embodiments illustrated in the accompanying drawings, in which: Fig. 1 schematically shows a noise power estimating device according to the present invention.

Fig. 2 schematically shows a speech enhancement device according to the present invention. The noise power estimating device 1 shown merely by way of non- limiting example in Fig. 1 comprises a squaring unit 11, a first combination unit 12, a (Kalman) gain unit 13, a second combination unit 14, a delay unit 15, a power factor unit 16, and a further gain unit 17.

The squaring unit 11 receives the absolute value (magnitude) |Y(k,m)| of the frequency spectrum Y(k,m) of a noisy speech signal y. The frequency spectrum Y(k,m) is a function of the frequency range (that is, frequency band or bin) represented by the frequency range index k and the time segment (that is, time frame or other time segment) represented by the time segment index m. That is, a frequency spectrum value Y(k,m) is made available for each frequency range k and each time segment m. The frequency spectrum value Y(k, m) or its absolute value |Y(k,m)| may be produced by a short-time Fourier transform (STFT) which is well known in the art.

The squaring unit 11 outputs the total power |Y(k,m)|² of the frequency spectrum of the noisy speech signal, that is, the power of the speech signal including noise, for each frequency range and each time segment. The first combination unit 12 determines the difference of this power and the current estimate of the total power. This current estimate is produced by the power factor unit 16 on the basis of the noise power and the speech power, as will be explained later in more detail. The difference of the total power and its estimate is fed to the gain unit 13 where it is multiplied by a gain K. This gain K is the so-called Kalman gain, as will be explained later in more detail, and may be based upon a priori knowledge. The result of this multiplication is fed to the second combination unit 14, where it is added to the output of the unit 17. The result of this addition is the new estimate of the noise power and the speech power: x(k, m+1) or, more accurately, x(k, m + 1) , where ^A indicates that the value is an estimate.

It is noted that x(k,m) is a vector which is indicative of the power spectrum of the noise and of the "clean" speech in a frequency range (or bin) k and in a time segment (or frame) m. This will later be explained in more detail. The delay unit 15 is coupled to the second combination unit 14 so as to produce a delayed version of the new estimate x(k, m+1), that is, the current estimate x(k, m). This current estimate is fed to the power factor unit 16 where it is multiplied by a factor C so as to produce the current estimate of the total power (spectrum) Cx(k,m). The current estimate x(k, m) of the noise power and the speech power is also fed to the gain unit 17 where it is multiplied by a factor A. The result of this multiplication is also fed to the second combination unit 14.

The method of the present invention, as carried out by the device 1 of Fig. 1, can be expressed mathematically as follows. Assuming that the noisy speech signal y(n) is transformed by the short-term Fourier transform (STFT) to the "short-time" frequency domain (also called "time-frequency domain"), the resulting (total or combined) frequency spectrum may be written as:

Y(k,m) = S(k,m) + N(k,m) (1)

where Y, S and N denote STFT coefficients of the noisy speech signal, the speech and the noise respectively. As before, k denotes the frequency range index and m represents the time segment index.

Starting from equation (1), the power spectrum of the signal y(n) may be modeled as:

\ (Jc,m)^' where C is a vector [1 1], x(k,m) is a vector in which λ_s(k,m) and λ_n(k,m) are the λ_n(k,m) speech variance and the noise variance respectively, and e(k,m) is an exponentially distributed random variable. Accordingly, the Kalman filtering equations for the device 1 of Fig. 1 can be written as:

x(k, m + Y) = A(Jc)x(Jc,m)

- Cx(Jc, m)) (3)

a_n (Jc) 0 where x denotes the estimate of x, where A(Jc) is the matrix and

0 a_s (Jc) where K(Jc, m) is the Kalman gain for frequency range k and time segment m. The Kalman gain K(Jc, m) can be written as:

K(k,m) = A(k)Q_e(k,m)C^T(C(2Q_e(k,m) + Q(k,m))C^T)-ⁱ (4) where Q_e(k,m) and Q(k,m) define the variance of the estimation error. As noted above, however, the Kalman gain may be pre-computed, for example using a priori knowledge, or may be determined experimentally. The non-zero coefficients of A(Jc) may be determined experimentally, for example by numerical optimization for a large, "clean" speech sample. It has been found that a suitable value of a_n is 1, while suitable values for a_s(k) may be given by:

a_s (k) = 6.265 10^"6A:² -1.9163 10^"3A: + 0.87941 for 1 < k < L/2 α_s (A:) = 6.265 10^"6η(A:)² -1.9163 10^"3η (A:) + 0.87941 for L/2 < k ≤ L

where η (A:) = L - k + 1 with L indicating the number of frequency ranges, typically the number of FFT bins.

It will be understood that the above numerical values are exemplary only and that other numerical values can be used without departing from the scope of the present invention.

The speech enhancement device 10 of figure 2 comprises a short-time Fourier transform (STFT) unit 2, a noise power estimating unit 1 and a speech enhancement unit 3. The short-term Fourier transform (STFT) unit 2 receives a noisy signal y(n) which is a time signal, n being the sample number. The unit 2 transforms this time signal y(n) into a frequency spectrum Y(k, m) and its absolute value |Y(k,m)|, which are dependent on both time (the time segment or frame index m) and frequency (the frequency range or frequency bin index k). The absolute value |Y(k,m)| is then fed to the noise power estimating (NPE) unit 1, which preferably corresponds with the noise power estimating device 1 of Fig. 1. Both the estimated noise power and the speech signal y(n) are fed to the speech enhancement (SE) unit 3 which may apply a known speech enhancement algorithm, such as a short-time spectral amplitude (STSA) algorithm. This unit 3 outputs a spectral amplitude |S(k,m)| representing a speech signal from which the noise has been substantially removed. The spectral amplitude |S(k,m)| may be transformed into a speech time signal s(n) using means well known in the art, such as an inverse STFT.

It is noted that instead of the signal y(n), the absolute value |Y(k,m)| of the frequency spectrum may be fed to the speech enhancement (SE) unit 3, in which case the output of the STFT unit 2 is connected to the SE unit 3. The squaring (SQR) unit 11 shown in Fig. 1 may be omitted from the device 1 if the power spectrum |Y(k,m)|² is available. Instead, the squaring unit 11 may for example be part of the device 10 of Fig. 2.

It will be clear from the above description that the present invention provides both devices, as illustrated in Figs. 1 and 2, and methods, as carried out by the exemplary devices of Figs. 1 and 2. Accordingly, the present invention provides a method of estimating the power spectral density of noise in a speech signal, the method comprising the steps of: determining the total power of the speech signal, as determined in Fig. 1 by the squaring unit 11, determining the difference of said total power and a current estimate of said total power, as determined in Fig. 1 by the combination unit 12, providing a new estimate of the noise power and the speech power using said difference and a current estimate of the noise power and the speech power, as provided in Fig. 1 by the units 13, 14, 15 and 17, and deriving a new estimate of said total power from the new estimate of the noise power and the speech power, as derived in Fig. 1 by the unit 16, wherein the above steps are carried out per time segment and per frequency range, as illustrated by the indices k and m.

The present invention is based upon the insight that Kalman filtering in the frequency domain provides an efficient way to estimate noise properties in a noisy speech a signal. The present invention benefits from the further insight that a short-time Fourier transform is particularly suitable for pre-processing speech samples for Kalman filtering in the frequency domain.

It is noted that any terms used in this document should not be construed so as to limit the scope of the present invention. In particular, the words "comprise(s)" and "comprising" are not meant to exclude any elements not specifically stated. Single (circuit) elements may be substituted with multiple (circuit) elements or with their equivalents.

It will be understood by those skilled in the art that the present invention is not limited to the embodiments illustrated above and that many modifications and additions may be made without departing from the scope of the invention as defined in the appending claims.

Claims

CLAIMS:

1. A method of estimating the power spectral density of noise in a speech signal, the method comprising the steps of: determining the total power of the speech signal, determining the difference of said total power and a current estimate of said total power, providing a new estimate of the noise power and the speech power using said difference and a current estimate of the noise power and the speech power, deriving a new estimate of said total power from the new estimate of the noise power and the speech power, wherein the above steps are carried out per time segment and per frequency range.

2. The method according to claim 1, wherein the step of providing a new estimate of the noise power and the speech power involves multiplying the current estimate by a first gain (A), said first gain preferably being determined using a priori knowledge.

3. The method according to claim 1, wherein the step of providing a new estimate of the noise power and the speech power involves multiplying said difference by a second gain (K), said second gain preferably being determined using a priori knowledge.

4. The method according to claim 1, wherein the speech signal is transformed using a short-time Fourier transform.

5. A speech enhancement method, comprising the steps of: - estimating the power spectral density of noise in a speech signal in accordance with to claim 1, and using the estimated power spectral density to remove noise from the speech signal.

6. The speech enhancement method according to claim 5, comprising the additional step of deriving the amplitude of the speech signal from its power spectral density.

7. A computer program product for carrying out the method according to claim 1 or claim 5.

8. A device for estimating the power spectral density of noise in a speech signal, the device comprising: means (11) for determining the total power of the speech signal, - means (12) for determining the difference of said total power and a current estimate of said total power, means (13, 14, 15, 17) for providing a new estimate of the noise power and the speech power using said difference and a current estimate of the noise power and the speech power, - means (16) for deriving a new estimate of said total power from the new estimate of the noise power and the speech power, wherein said means are arranged for estimating the power spectral density of noise per time segment and per frequency range.

9. The device according to claim 8, wherein the means for providing a new estimate of the noise power and the speech power are arranged for multiplying the current estimate by a first gain (A), said first gain preferably being determined using a priori knowledge.

10. The device according to claim 8, wherein the means for providing a new estimate of the noise power and the speech power are arranged for multiplying the difference by a second gain (K), said second gain preferably being determined using a priori knowledge.

11. The device according to claim 8, arranged for transforming the speech signal using a short-time Fourier transform.

12. A speech enhancement device (10), comprising: means (1) for estimating the power spectral density of noise in a speech signal in accordance with to claim 1, and means (3) for using the estimated power spectral density to remove noise from the speech signal.

13. The device according to claim 12, arranged for deriving the amplitude of the noise from its power spectral density.

14. A consumer device, comprising a device (1; 10) according to claim 8 or claim

12.