EP2363853A1

EP2363853A1 - A method for estimating the clean spectrum of a signal

Info

Publication number: EP2363853A1
Application number: EP10450036A
Authority: EP
Inventors: Luis Weruaga
Original assignee: Innovationsagentur GmbH; Osterreichische Akademie der Wissenschaften
Current assignee: Innovationsagentur GmbH; Osterreichische Akademie der Wissenschaften
Priority date: 2010-03-04
Filing date: 2010-03-04
Publication date: 2011-09-07

Abstract

The invention proposes a method for estimating the clean spectrum of a signal degraded by additive noise, in particular a speech signal, by determining the coefficients of a predictive model of said clean spectrum, comprising:
computing the spectrum of said signal;
estimating the power spectrum of said noise; and
determining said coefficients by minimizing the cost function

\int_{2 π} \frac{| X (ω) |^{2}}{{|H (ω)|}^{2} + S_{V} (ω)} - \log \frac{| X (ω) |^{2}}{{|H (ω)|}^{2} + S_{V} (ω)} dω

with respect to said coefficients, with
X (ω) being the spectrum of said signal,
S_v (ω) being the power spectrum of said noise, and
H(ω) being the transfer function of said model based on said coefficients.

Description

The present invention relates to a method for estimating the clean spectrum of a signal degraded by additive noise, in particular a speech signal, by determining the coefficients of a predictive model of said clean spectrum. The invention further relates to a method for enhancing a signal based on this clean spectrum estimation.
Restoration of single-channel digital audio recordings degraded by additive noise is a technical problem that currently arouses large interest from scientific and commercial points of view. The enhancement of speech by digital signal processing means improves the quality and intelligibility of voice communication for a wide fan of applications, such as mobile telephony, hearing aids, teleconference systems, dictation systems, voice coders and automatic speech recognition systems.
Among different solutions proposed for the enhancement of noisy speech, restoration of short-time speech spectrum has been extensively studied, see e.g. Y. Ephraim and D. Malah, "Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator", IEEE Trans. Acoust., Speech, Signal Processing, Vol. 32, No. 6, pp. 1109- 1121, 1984; B. Sim, Y. Tong, J. Chang, and C. Tan, "A parametric formulation of the generalized spectral subtraction method", IEEE Transactions on Speech and Audio Processing, Vol. 6, No. 4, pp. 328-337, Jul. 1998; P. J. Wolfe and S. J. Godsill, "Efficient alternatives to the Ephraim and Malah suppression rule for audio signal enhancement", EURASIP J. Applied Signal Processing, Vol. 2003, No. 10, pp. 1043-1051, 2003; and P. C. Loizou, "Speech enhancement: Theory and practice", CRC Press, 2007. This approach is based on estimation of the short-time spectral amplitude of the clean speech from an estimate of the signal-to-noise ratio (SNR) at each frequency. In other cases, the clean speech is assumed to follow a parametric model, such as an autoregressive model (AR); upon the estimation of that model, an enhancement filter, such as the Wiener filter, is employed to enhance the noisy signal, see J. H. L. Hansen and M. A. Clements, "Constrained iterative speech enhancement with application to speech, recognition", IEEE Trans. Signal Processing, Vol. 39, No. 4, pp. 795-805, Apr. 1991. In all cases, an accurate estimation of the power spectrum of the noise is required. This can be accomplished by several techniques, such as minimum statistics, tracking of the spectral floor, or by detecting silences in the speech activity (P. C. Loizou, l.c).
The biggest technical challenge in this problem is thus to obtain the a priori signal-to-noise ratio at each frequency. Since the noise is assumed to be available with state-of-the art techniques, the previous challenge is equivalent to the estimation of the clean speech spectrum from the available noisy spectrum. This problem has coped the efforts of many researchers in the last twenty five years: a decision-directed method (Y. Ephraim and D. Malah, loc.cit.), subspace methods (P. C. Loizou, loc.cit.), iterative Wiener filter (J. H. L. Hansen and M. A. Clements, loc.cit.), or Kalman filters (see e.g. E. Za-varehei, S. Vaseghi, and Q. Yan, "Speech enhancement using Kalman filters for restoration of short-time DFT trajectories," IEEE Workshop Automatic Speech Recognition and Understanding, 2005, pp. 313-318.) are some of the most popular techniques thereto. From the previous techniques, the iterative Wiener filter is particularly interesting because it aims to estimate the clean speech spectrum only from the current noisy spectrum, combining iteratively Wiener filtering (IWF) with autoregressive analysis. The problem of that technique is its tendency to generate high resonance peaks, which introduces an unpleasant distortion in the enhanced speech. Further attempts for stabilizing the IWF have been made in T. V. Sreenivas and P. Kirnapure, "Codebook constrained Wiener filtering for speech enhancement," IEEE Trans. Speech, Audio Processing, Vol. 4, No. 5, pp. 383-389, Sep. 1996, but the performance of this technique and its variants is still clearly insufficient.
One possible solution to the problem of estimating parameters of a predictive clean speech model has been disclosed in the earlier application WO 2008/109904 Al of the same applicant. This prior solution fails to estimate the clean speech spectrum in cases where the signal-to-noise (SNR) ratio of the signal is low. Likewise, the mentioned IWF method is not appropriate in such applications.
It is therefore an object of the invention to provide a method for estimating the clean spectrum of a noise-corrupted signal with improved accuracy.
This object is achieved by means of a method for estimating the clean spectrum of a signal degraded by additive noise, in particular a speech signal, by determining the coefficients of a predictive model of said clean spectrum, comprising:

computing the spectrum of said signal;
estimating the power spectrum of said noise; and
determining said coefficients by minimizing the cost function $\int_{2 π} \frac{| X (ω) |^{2}}{{|H (ω)|}^{2} + S_{V} (ω)} - \log \frac{| X (ω) |^{2}}{{|H (ω)|}^{2} + S_{V} (ω)} dω$

with respect to said coefficients, with
X(ω) being the spectrum of said signal,
S_v (ω) being the power spectrum of said noise, and
H(ω) being the transfer function of said model based on said coefficients.

In the present disclosure the term "minimizing" is intended to comprise both, making the cost function minimal as well as making the cost function at least a sufficiently low value, i.e. a value within a given or acceptable tolerance interval from that minimum.
From the field of bioacoustics it is known that the biological hearing sense responds to the logarithm of the sound intensity. The invention is based on the insight that this bio-acoustic principle of logarithmic sense can be introduced into a novel cost function as stated above which takes into account the actual signal-to-noise ratio in each portion of the signal spectrum. Loosely speaking the proposed cost function fits the model to the data for those regions with high SNR, and - as will be detailed later on - in low-SNR areas the fitting process is driven by the mentioned good fitting performance taking place on adjacent high-SNR areas. The inventive method thus leads to an interpolation effect from high-SNR to low-SNR spectral regions.
According to a preferred embodiment of the invention said cost function is minimized by solving the equation $\int_{- π}^{π} M (ω) E (ω) A (ω) e^{jωl} dω = 0$

with A(ω) being the predictive model based on its coefficients a_m according to $A (ω) = \sum_{m = 0}^{M} a_{m} e^{- jωm},$

E(ω) being the prediction error according to $E (ω) = | X (ω) |^{2} - {|H (ω)|}^{2} - S_{V} (ω),$

and M(ω) being a spectral mask defined as $M (ω) = {(\frac{SNR (ω)}{SNR (ω) + 1})}^{2} with SNR (ω) = \frac{{|H (ω)|}^{2}}{S_{V} (ω)} .$
In particular said equation can be solved by holding E(ω) and M(ω) constant, solving the remaining linear problem, using the solution to re-evaluate the previous constant terms, and proceeding further iteratively.
In general, the method of the invention is suited for any predictive model known in the art. Preferably, a parametric all-pole filter model, an autoregressive coefficients filter (ARC) model, a reflection coefficients filter (RC) model, and/or a line spectral frequencies (LSF) model is used.
In a second aspect of the invention a method for enhancing a digital signal, in particular a speech signal, with increased quality is provided. The inventive method comprises the further steps of
calculating a spectral signal-to-noise ratio on the basis of the clean spectrum and the noise spectrum, and
using the spectral signal-to-noise ratio to enhance the signal.
Preferably, the signal is enhanced by means of a Wiener filter, a MMSE-based enhancement, or variants thereof, using said spectral signal-to-noise ratio.
Further details and advantages of the invention will become apparent from the appended claims and the following detailed description of a preferred embodiment under reference to the enclosed drawings in which

Fig. 1 shows in block diagram form an apparatus for enhancing a digital speech signal, the blocks concurrently illustrating the steps of the method of the invention, and
Fig. 2 shows the function of the clean speech estimation block and step of Fig. 1 in detail.

As a first basis of the present invention, the inventor has found out analytically that the IWF method is equivalent to a method that results from the following minimization problem $\underset{a}{arg min} \int_{2 π} \frac{| X (ω) |^{2}}{{|H (ω)|}^{2} + S_{V} (ω)} dω$

where ω is frequency, X(ω) is the Fourier transform of a short-time segment of the input noisy signal, S_v (ω) is the estimate of the noise power spectral density, and H(ω) is the transfer function of the autoregressive model which relates to the clean speech spectrum. The said transfer function of the autoregressive model is equal to $H (ω) = \frac{1}{\sum_{m = 0}^{M} a_{m} e^{- jωm}}$

where a = (a_o, a₁, ..., a_M] are the autoregressive coefficients, and M is the autoregressive model order.
As a second basis of the invention, the inventor has found out that a functional built on the ratio between the samples and the model, such as in (1), does not possess the desirable property of frequency selectivity while such a property would be desirable when not all spectral samples are available: In case of the spectrum of the noisy signal X(ω), the spectral samples at which the a priori SNR is low or very low do not represent a trustful reference for the estimation of the autoregressive model.
To this end, the method of the present invention for estimating the clean speech spectrum is related to the minimization of the maximum likelihood (ML) of the ratio between the input noisy spectrum X(ω) and the model of clean speech corrupted by additive noise. Assuming that X(ω) is modelled by a Gaussian distribution, said maximum likelihood estimation turns out $\underset{a}{arg min} \int_{2 π} \frac{| X (ω) |^{2}}{{|H (ω)|}^{2} + S_{V} (ω)} - \log \frac{| X (ω) |^{2}}{{|H (ω)|}^{2} + S_{V} (ω)} dω$

where the clean speech follows the autoregressive model defined in (2), a is the vector containing the autoregressive coefficients, and S_v (ω) is the power spectral density of the noise which is available a priori.
By computing the gradient of the functional (3) with respect to the autoregressive coefficients a, one gets to the solution of this problem, given by the following equation $\int_{- π}^{π} M (ω) E (ω) A (ω) e^{jωl} dω = 0$

where A(ω) is the linear prediction error filter, defined in terms of the autoregressive coefficients as $A (ω) = \sum_{m = 0}^{M} a_{m} e^{- jωm}$

E(ω) is the prediction error of the model according to $E (ω) = | X (ω) |^{2} - {|H (ω)|}^{2} - S_{V} (ω)$

and M(ω) is a so-called spectral mask defined as $M (ω) = {(\frac{SNR (ω)}{SNR (ω) + 1})}^{2}$
Here, the spectral mask is defined in terms of the a-priori signal-to-noise ratio for each frequency ("spectral" signal-to-noise ratio), SNR(ω). The a-priori (spectral) SNR is defined as the ratio between the clean speech power spectrum and the noise power spectrum, $SNR (ω) = \frac{{|H (ω)|}^{2}}{S_{V} (ω)}$
Since equation (4) is nonlinear with respect to the autoregressive coefficients, its solution must and can be obtained by means of an iterative procedure, in which at each iteration a positive-definite Toeplitz linear system must be solved. Several techniques are available to solve Toeplitz systems, such as the well-known Levinson algorithm. One skilled in the art will immediately recognize that this choice does not affect the essence of the present invention. It is important to mention that the spectral mask (7) weights the importance of the spectral error between the noisy samples and the model of clean speech plus additive noise. This weight at each frequency depends on the respective signal-to-noise ratio. Thus, if the SNR is high at a given frequency, the spectral mask is close to 1 at that frequency, and the information at that frequency is valuable in the estimation. On the contrary, if the SNR is low, the spectral mask tends to zero, which implies that the relevance of the information at the frequency is low.
The spectral mask, the signal-to-noise ratio, and therewith the clean speech model are estimated in an iterative fashion. The final solution is obtained either after several iterations or when successive partial solutions do not differ from each other substantially.
One iterative approach to solve equation (4) will be discussed in detail. This approach is based on considering E(ω) and M(ω) constant, and solving the remaining linear problem; this partial solution is used to re-evaluate the previous constant terms, and proceeding further iteratively. Thus, the linear residue filter A(ω) is obtained with the following iterative algorithm $S_{X, ω}^{(κ)} = \frac{1}{{|A_{ω}^{(κ)}|}^{2}}$
$ξ_{ω}^{(κ)} = \frac{S_{X, ω}^{(κ)}}{S_{V, ω}}$
$M_{ω}^{(κ)} = {(\frac{ξ_{ω}^{(κ)}}{ξ_{ω}^{(κ)} + 1})}^{2}$
$h_{l}^{(κ)} = \int_{〈 M^{(κ)} 〉} \frac{e^{jωl}}{A_{ω}^{(κ) *}} dω$
$g_{l}^{(κ)} = \int_{〈 M^{(κ)} 〉} S_{V, ω} A_{ω}^{(κ)} e^{jωl} dω$
$\int_{〈 M^{(κ)} 〉} {|X_{ω}|}^{2} A^{(κ + 1)} (e^{jω}) e^{jωl} dω = h_{l}^{(κ)} + g_{l}^{(κ)}$

for ℓ = 0, 1, ..., M, where subindex K denotes iteration, and superscript * complex conjugate. The noise-substracted power spectrum can be used as initial seed, i.e., $S_{X, ω}^{1} = {|{|X_{ω}|}^{2} - S_{V, ω}|}_{\in}$
The notation in the integrals refers to $\int_{〈 M^{(κ)} 〉} \cdot \equiv \int_{- π}^{π} M_{ω}^{(κ)}$

where $M_{ω}^{(κ)}$
is the spectral weight (mask) M (ω) at the _K iteration. Since the spectral weight is present in all terms of the inverse problem (8f), its effect is that of weighting the relevance of the spectral samples. The magnitude of the weight depends on the local SNR ξ_ω , such that in areas with high SNR >> 1) the spectral weight tends to one, while in low-SNR areas (ξ _{ω ≤} 1) it tends to zero. Note as comparison that in the noiseless case the spectral weight turns one for all frequencies, this meaning that the noiseless case need not require spectral selectivity.
Finally, the step (8f) is a linear inverse problem involving a positive-semidefinite symmetric Toeplitz system. Thus, it can be efficiently solved with the Levinson algorithm or any other algorithm to solve Toeplitz systems.
Fig. 1 shows in a simplified fashion the processing-block diagram of a speech enhancement front-end (apparatus 100) that uses the method of the present invention. Fig. 2 shows the function of the clean speech estimation step (block 40) of Fig. 1 in detail.
Block 10 performs the usual segmentation of the input digital signal into segments.
Block 20 performs the spectral transformation of said segment. Said spectral transformation corresponds to the "Discrete Fourier Transform", "Discrete Sinus Transform" and/or to the "Fan-Chirp Transform", among other popular choices.
Block 30 carries out the estimation of the power spectrum of the noise according to known ad-hoc techniques. It is assumed that this block has memory facilities in such a way that the spectrum of the previous segments are stored therein. Therefore, if required, the estimation of the noise power spectrum can be performed by statistical methods over spectral data stretching within a reasonably long time span.
Block 40 carries out the estimation of the clean speech model from the spectrum of the segment and the estimation of the noise power spectrum. The estimation of the clean speech model is based on the numerical implementation of the minimization problem (3), which represents the core method of the present invention.
Block 50 computes numerically the signal-to-noise ratio for each frequency (spectral signal-to-noise ratio) from the estimated clean speech model and noise model.
Block 60 enhances the spectrum of the input signal by means of state-of-art techniques that require the signal-to-noise ratio for each frequency. Among these techniques, we can cite the Wiener filter and its variants, e.g. the root-square of the Wiener filter, and the minimum-mean-square-error (MMSE)-based enhancement (see Y. Ephraim and D. Malah, loc.cit.) and its variants, e.g. the log-MMSE, et cet. (see P. J. Wolfe and S. J. Godsill, loc.cit.).
Block 70 performs the inverse spectral transformation to block 20. The output of block 70 is the enhanced segment of the audio signal.
Although all processor blocks of the apparatus 100 operate with time-discrete and frequency-discrete samples, for the sake of clarity the mathematical description of the invention has been given in continuous frequency. One skilled in the art will immediately recognize that this choice does not affect the essence of the present invention.

Claims

A method for estimating the clean spectrum of a signal degraded by additive noise, in particular a speech signal, by determining the coefficients of a predictive model of said clean spectrum, comprising:
computing the spectrum of said signal;

estimating the power spectrum of said noise; and

determining said coefficients by minimizing the cost function $\int_{2 π} \frac{| X (ω) |^{2}}{{|H (ω)|}^{2} + S_{V} (ω)} - \log \frac{| X (ω) |^{2}}{{|H (ω)|}^{2} + S_{V} (ω)} dω$

with respect to said coefficients, with

X (ω) being the spectrum of said signal,

S_v (ω) being the power spectrum of said noise, and

H(ω) being the transfer function of said model based on said coefficients.
The method of claim 1, wherein said cost function is minimized by solving the equation $\int_{- π}^{π} M (ω) E (ω) A (ω) e^{jωl} dω = 0$

with A(ω) being the predictive model based on its coefficients a_m according to $A (ω) = \sum_{m = 0}^{M} a_{m} e^{- jωm},$

E(ω) being the prediction error according to $E (ω) = | X (ω) |^{2} - {|H (ω)|}^{2} - S_{V} (ω),$
and M (ω) being a spectral mask defined as $M (ω) = {(\frac{SNR (ω)}{SNR (ω) + 1})}^{2} with SNR (ω) = \frac{{|H (ω)|}^{2}}{S_{V} (ω)} .$
The method of claim 2, wherein said equation is solved by holding E(ω) and M(ω) constant, solving the remaining linear problem, using the solution to re-evaluate the previous constant terms, and proceeding further iteratively.
The method of any of the claims 1 to 3, wherein said predictive model is a parametric all-pole filter model, an autoregressive coefficients filter (ARC) model, a reflection coefficients filter (RC) model, and/or a line spectral frequencies (LSF) model.
The method of any of the claims 1 to 4, further for enhancing the signal, comprising the further steps of
calculating a spectral signal-to-noise ratio on the basis of the clean spectrum and the noise spectrum, and
using the spectral signal-to-noise ratio to enhance the signal.
The method of claim 5, wherein the signal is enhanced by means of a Wiener filter, a MMSE-based enhancement, or variants thereof, using said spectral signal-to-noise ratio.