WO2016050725A1

WO2016050725A1 - Method and apparatus for speech enhancement based on source separation

Info

Publication number: WO2016050725A1
Application number: PCT/EP2015/072344
Authority: WO
Inventors: Dalia ELBADAWY; Alexey Ozerov; Quang Khanh Ngoc DUONG
Original assignee: Thomson Licensing
Priority date: 2014-09-30
Filing date: 2015-09-29
Publication date: 2016-04-07
Also published as: TW201614641A

Abstract

The present embodiments provide speech enhancement based on source separation techniques. Specifically, we use a universal spectral model for speech, and train the spectral model for noise and activations for speech/noise based on the universal spectral model for speech and input noisy speech. We formulate the optimization problem using a cost function that includes a divergence function and a sparsity penalty function, wherein the penalty function is based on the notion of relative group sparsity. The sparsity penalty function includes two parts: a sparsity-promoting part for the groups (activations for some groups become zero) and an anti-sparsity-promoting part for the whole activation matrix corresponding to the speech model (i.e., the activations for speech as a whole does not become zero). Based on the universal spectral model for speech, the spectral model for noise, and activations for speech/noise, we can estimate the speech/noise included in the input noisy speech.

Description

Method and Apparatus for Speech Enhancement Based on Source

Separation

TECHNICAL FIELD [1] This invention relates to a method and an apparatus for speech enhancement, and more particularly, to a method and an apparatus for speech enhancement based on audio source separation technique.

BACKGROUND

[2] Speech enhancement, or speech denoising, plays a key role in many applications such as telephone communication, robotics, and automatic speech recognition systems.

Numerous speech enhancement techniques have been developed such as those based on beamforming approaches or noise suppression algorithms. There also exist work in applying source separation for speech enhancement.

SUMMARY [3] The present principles provide a method for processing an audio signal, comprising: accessing a universal spectral model for speech; determining a spectral model for noise included in the audio signal, responsive to the audio signal and the universal spectral model for speech; determining a first set of time activations corresponding to the spectral model for speech, responsive to the audio signal and the universal spectral model for speech; estimating a speech included in the audio signal responsive to the universal spectral model for speech and the first set of time activations; and providing the estimated speech as output. The present principles also provide an apparatus for performing these steps.

[4] The present principles provide a method for processing an audio signal, comprising: accessing a universal spectral model for speech; determining a spectral model for noise included in the audio signal, responsive to the audio signal and the universal spectral model for speech; determining a first set of time activations corresponding to the spectral model for speech, responsive to the audio signal and the universal spectral model for speech; estimating a speech included in the audio signal responsive to the universal spectral model for speech and the first set of time activations; determining a second set of time activations

corresponding to the spectral model for noise, responsive to the audio signal and the universal spectral model for speech; estimating the noise included in the audio signal responsive to the spectral model for noise and the second set of time activations; and providing the noise and the estimated speech as output. The present principles also provide an apparatus for performing these steps.

[5] The present principles also provide a method for processing an audio signal, comprising: accessing a universal spectral model for speech; determining a spectral model for noise included in the audio signal, responsive to the audio signal and the universal spectral model for speech; determining a first set of time activations corresponding to the spectral model for speech, responsive to the audio signal and the universal spectral model for speech, wherein at least one of the determining a spectral model for noise and the determining a first set of time activations is responsive to a cost function, wherein the cost function includes a sparsity penalty on the first set of time activations, and wherein the sparsity penalty is responsive to a ratio between a norm of a subset of the first set of time activations and a norm of the first set of time activations; estimating a speech included in the audio signal responsive to the universal spectral model for speech and the first set of time activations; and providing the estimated speech as output. The present principles also provide an apparatus for performing these steps. [6] The present principles also provide a computer readable storage medium having stored thereon instructions for processing an audio signal according to the methods described above.

BRIEF DESCRIPTION OF THE DRAWINGS [7] FIG. 1 is a pictorial example illustrating an exemplary overview of speech

enhancement according to an embodiment of the present principles.

[8] FIG. 2 is a flow diagram depicting an exemplary method for speech enhancement based on source separation, according to an embodiment of the present principles.

[9] FIG. 3 is a pictorial example illustrating an example where a spectrogram. V is decomposed into two matrices W and //.

[10] FIG. 4 is a pictorial example illustrating that the activations corresponding to the universal speech model part entirely converge to zero while noise spectral model is updated, when using a prior art optimization function.

[11] FIG. 5 is a pictorial example illustrating one example of decomposing the

spectrogram based on block sparsity, according to an embodiment of the present principles.

[12] FIG. 6 is a pictorial example illustrating an estimated activation matrix obtained by an optimization scheme based on component sparsity, according to an embodiment of the present principles.

[13] FIG. 7 is a pictorial example illustrating one example of decomposing the

spectrogram based on both block sparsity and component sparsity, according to an embodiment of the present principles.

[14] FIG. 8 is a block diagram depicting an exemplary speech system, in accordance with an embodiment of the present principles. DETAILED DESCRIPTION

[15] When applying source separation for speech enhancement or denoising, in order to separate speech from noise, relevant training data is usually needed to first learn the spectral characteristics of the speech and/or of the particular noise. Such a class of supervised audio source separation algorithms is mostly based on non-negative matrix factorization (NMF), or its probabilistic formulation known as probabilistic latent component analysis (PLC A). In case of NMF model, the input spectrogram (or magnitude) matrix V (a time-frequency representation of the input mixture signal) is factorized into two matrices as V = WH, where W and // can be interpreted as the latent spectral features and the activations of those features in the signal, respectively. When the input is a mixture of two sources, we may write matri W = [W_{l r} W₂], where the matri W contains spectral components of, for example, source 1 - speech (M^) and source 2 - noise (W₂), and H = [ /j ; H₂], where //j and H₂ are matrices representing time activations, which indicate whether a spectral component is active or not at each time index and can be considered as weighting the contribution of spectral components in the universal speech model to the speech spectrogram, corresponding to W_x and W₂, respectively. Once the decomposition is obtained, the spectral power of source 1 is estimated as V₁ = IV, //₃ , and the spectral power of source 2 as V₂ = W₂H₂. [16] In an article by D. L. Sun and G. J. Mysore, entitled "Universal speech models for speaker independent single channel source separation," IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), May 2013 (hereinafter "Sun"), a universal speech spectral model was employed as well as a pre-learned noise spectral model. However, using a pre-learned noise model requires training data for noise, and the model may not be representative for other types of noise which are not included in the training data.

[17] In an article by N. Mohammadiha, P. Smaragdis, and A. Leijon, entitled "Supervised and unsupervised speech enhancement using nonnegative matrix factorization," IEEE Transactions on Audio, Speech, and Language Processing, 2013, a method that allows to learn the noise model online is used. However, the method uses hidden Markov models (HMM) in combination with Bayesian formulation of NMF, which may be sensitive to the parameter initialization and is different from the present embodiments that use a universal speech model.

[18] A commonly owned EP application (EP14305712.3, Attorney Docket No.

PF140127, hereinafter "Duong"), entitled "Method and system of on-the-fly audio source separation" by the inventors of the present application, the teachings of which are specifically incorporated herein by reference, discloses a method and apparatus for a combined text-and- example based approach for audio source separation, wherein a universal spectral model for each source is learned in advance. The universal noise model is learned through user guidance. Specifically, the noise type is determined by a user and then a corresponding universal spectral model is learned in advance by retrieved noise examples. While in the present embodiments, the noise model (non-universal) is estimated directly from noisy signals.

[19] The present principles are directed to speech enhancement based on source separation technique, which decomposes an audio mixture into constituent sound sources. In one embodiment, we use a universal spectral model for speech, and learn the spectral model for noise from the input signal. In general, the speech enhancement would improve the perceptual quality of the speech.

[20] FIG. 1 illustrates an exemplary overview of speech enhancement according to an embodiment of the present principles. We employ a universal spectral model for speech, trained from n clean speech examples. A universal speech model contains an

overcomplete dictionary of spectral characteristics of speech learned from different speakers. To train the universal speech model from the clean speech examples, clean speech example i is used to learn spectral model W . Then the universal speech model is constructed by concatenating the learned models: W_speech = \W_X W₂ ... W_n \ . Amplitude no m al i zat i o n can be applied to ensure that different speech examples have similar energy level.

[21] As illustrated in FIG. 1, given the universal speech model, for an input noisy speech (also referred to as "audio mixture" as the noisy speech is a mixture of noise and speech), the noise spectral model (W_noise) is learned automatically in the source separation algorithm, as well as the activations for speech (H_speech) and noise (H_noise). Sparsity constraints on the activation matrix H_speech is used to enforce the selection of only few representative spectral components learned from all training examples. Based on the spectral models and activations, the speech contained in the input noisy speech may be estimated, for example, using the estimated speech magnitude spectrogram (W_speechH_speech), and the noise contained in the in ut noisy speech may be estimated, for example, using the estimated noise magnitude spectrogram (W_noiseH_noise). Using Wiener filtering, the estimated speech/noise Short Time Fourier Transform (STFT) coefficients can be obtained, then the estimated time domain signals for speech and noise can be obtained via inverse Short Time Fourier Transform (IS TFT). Because the noise can be removed, the output largely contains speech only and thus enhances the perceptual quality over the input noisy speech.

[22] FIG. 2 illustrates an exemplary method 200 for speech enhancement based on source separation according to an embodiment of the present principles. Method 200 can be used for source separation as described in FIG. 1. Method 200 starts at initialization step 210. At the initialization, the audio mixture is input, and it may also accept some parameter values used in the universal model training and/or source separation process from a user. In addition, it may train a universal speech model based on training examples, or it may accept a universal speech model as input. At step 220, the audio mixture is transformed via Short- time Fourier Transform (STFT) into a time-frequency representation known as the spectrogram (denoted as matrix V). Note that V can be, for example, power (square magnitude) or magnitude of the STFT coefficients.

[23] Using the universal speech model, the spectrogram is used to estimate the noise spectral model and activations for speech and noise at step 230, wherein the speech spectral model is used to guide the estimation (i.e., the speech part of the spectral model W is known and does not change during the estimation process). Once the noise model and activations are estimated, the STFT coeffiences of the speech signal, and optionally of the noise, can be reconstructed by Wiener filtering at step 240. Inverse STFT is performed to obtain the time- domain signal of the estimated speech and/or noise. [24] In the following, the step of estimating activations and the noise spectral model (230) is described in further detail.

[25] Estimating activations and noise spectral model

[26] The non-negative spectrogram matrix V of dimension FxN is to be decomposed into two non-negative matrices, W (the spectral model of dimension FxK) and H (time activations of dimension KxN), such that V « V = WH. In this formulation, F denotes the total number of frequency bins, N denotes the number of time frames, and K denotes the number of spectral components, wherein a spectral component corresponds to a column in the matrix W and represents a latent spectral characteristic. FIG. 3 provides an example where a spectrogram V is decomposed into two matrices W and // . [27] In our context, W includes two parts: W= [W_speech, W_noise], where W_speech is the universal speech model, and W_noise is the noise model which is unknown in advance.

Similarly, the activation matrix / also includes two parts: H=[H_speech; H_noise] , where

^speech corresponds to speech and H_noise corresponds to noise, [28] In one embodiment, we consider sparsity constraints on the speech activations

H speech - Mathematically, the activation matrix is estimated by solving the following optimization problem that includes a divergence function and a sparsity penalty function:

min D (V\ WH) + A (H_speech) (1) where D (V \ WH ) = ∑ = i . d (i _n | (vv/i) .„ ), /^" indexes the frequency bin, n indexes the the time frame, Vj_n indicates an element in the /-th row and n-th column of the

spectrogram., d(.\.) is a divergence function, and A is a weighting factor for the penalty function Ψ(, ) and controls how much we want to emphasize sparsity of H_speech during optimization. Possible divergences include, for example, the Itakura-Saito divergence (IS divergence), Euclidean distance, and Kullback-Leibler divergence, [29] Using a penalty function in the optimization problem is motivated by the fact that if some of the speech examples used to train the universal speech model are more representative of the speech contained in the audio mixture more than others, then it may be better to use only these more representative ("good") examples. Also, some spectral components in the universal speech model may be more representative for spectral characteristics of the speech in the audio mixture, and it may be better to use only these more representative ("good") spectral components. The purpose of the penalty function is to enforce the activation of "good" examples or components, and force the activations corresponding to other examples and/or components to zero. [30] Consequently, the penalty function results in a sparse matrix H_speech where some groups in H_speech are set to zero. In the present application, we use a group to generalize the subset of elements in the speech model which are affected by the sparsity constraint. For example, when the sparsity constraint is applied on a block basis, a group corresponds to a block (a consecutive number of rows) in the matri H_speech which in tu n corresponds to activations of one clean speech example used to train the universal speech model. When the sparsity constraint is applied on a spectral component basis, a group corresponds to a row in the matrix H_speech which in turn corresponds to the activation of one spectral component (a column in W) in the universal speech model. In another embodiment, a group can be a column in H_speech which corresponds to the activation of one frame (audio window) in the input spectrogram.

[31] An iterative algorithm with multiplicative updates may be used to solve the optimization problem.. Table 1 illustrates an exemplary algorithm (Algorithm 1) to solve the optimization problem, where H(_5fcn) represents the group (sub-matrix ) of H such that matrix element h_kn £ H_{(gkn )} , 0 denotes the element- wise Hadamard product, K_speech is the number of rows in H_speech, and e, p and q are constants. In Algorithm 1, H and

W noise ^are initialized randomly. In other embodiments, they can be initialized in other manners. Note that the speech spectral model W_speech is fixed while W_noise is updated.

Algorithm 1 NMF with relative group sparsity (IS divergence is used) Input: V, W_speech, λ

OutputrH, W_noise

Initialize H randomly

Initialize W_noise randomly

until convergence

[32] In Algorithm 1,

w^rv-¹ + λΡ J

W 'noi^■se " ' noise I ,- _ ,„ 7· I

V ^V "noise /

where P and Q are matrices of the same size as // and are used to enforce the penalty on H speech- While we have a model for speech (W_speech), we need to leam a model for noise. In one embodiment, we randomly initialize W_noise and set W =[W_speech W_noise] .

[33] In our previous work, as described in the Duong reference, the log// , norm, is used as a penalty function. For one exemplary audio mixture, applying the log/ _x norm (i.e.,

Ψ(// ) =∑ -i log + W^) and other configurations in the Duong reference to the optimization problem (1), the activation corresponding to the universal, speech model part entirely converges to zero, as shown in FIG. 4, due to the sparsity constraint in the cost function to be minimized. With H_speech = 0, no result for estimated speech can be obtained.

[34] In general, we observe that the performance of the penalty function depends on the choice of the A value. If A is small, H_speech usually does not become zero but may include some "bad" groups to represent the audio mixture, which affects the final separation quality. However, if A gets larger, the penalty function cannot guarantee that H_speech will not become zero. In order to obtain a good separation quality, the choice of A may need to be adaptive to the i nput mixture. For example, the longer the duration of the input (large N), the bigger A may need to be to .result in a sparse // since // is now correspondingly large (size KxN ).

[35 J In one embodiment, we may use A = FNK1₀, where A_Q is a constant (for example, 10^~7 or 10^~8). Here, since we use a universal speech model, F and K are fixed, and only N is a variable. In this case, since A is not fixed, we may end up with a value that is large enough to make H_speech zero if using the sparsity penalty function of the Sun or Duong reference.

[36] To prevent H_speech from converging to zero regardless of the choice of the A value, we introduce alternative optimization problem formulations. Specifically, we provide alternative sparsity penalty functions, providing different ways of exploiting the spectral characteristics of the universal speech model while making sure that H_speech does not degenerate to zero.

[37] We introduce the notion of relative group sparsity where sparsity of the groups takes into account the energy of H_speech. In one embodiment, a penalty function based on relative relative group sparsity includes two parts: a sparsity-promoting part for the groups

(activations for some groups become zero) and an ami- sparsity -promoting part for the whole activation matrix corresponding to the speech model (i.e., _speech as a whole does not become zero). This ensures that at least one group in H_speech remains active and thus the penalty is not as sensitive to λ as the one provided by the Sun or Duong reference. In the following, we describe different optimization schemes with different penalty functions in further detail.

[38] Optimization scheme 1

[39] In one embodiment, we propose a block sparsity approach, where a block represents activations corresponding to one clean speech example used to train the universal speech model. This may efficiently select the best speech examples to represent the speech in the audio mixture. Mathematically, the penalty function may be written as:

where G denotes the number of blocks (i.e., corresponding to the number of clean speech examples used for training the universal model), e is a small value greater than zero to avoid having log(0), H^_g) is part of the activation matrix H_speech corresponding to g-th training example, p and q determine the norm or pseudo-norm to be used (for example, p=q=l), and and γ is a constant (for example, 1 or 1/G). The

norm is calculated over all the the elements in H_speech as (∑_¾)7l |¾_¾,η Ι^ρ) · If y = 0, the pen lty function Ψ₁(.) is similar to the penalty functions used in the Sun or Duong reference.

[40] This scheme forces H_speech to contain few blocks of activations, which correspond to to speech training examples with similar spectral characteristics as the speech in the noisy signal. FIG. 5 illustrates one example of decomposing the spectrogram., where only two blocks of H_speech are activated.

[41] Optimization scheme 2

[42] We also propose a component sparsity approach to allow more flexibility and choose the best spectral components. Mathematically, the penalty function may be written as:

K

H speech) =

where h_g is g-th row in H_speech, and K_speech is the number of rows in H_speech . Note that each row in // represents the activation coefficients for the corresponding column (the spectral component) in IV. For example, if the first row of // is zero, then the first column of W is not used to represent V (where V = Wli ). FIG. 6 illustrates one example of the estimated H after convergence, where several components of H_speech are activated.

[43] Optimization scheme 3

[44] We can also combine optimization schemes 1 and 2 and use a mix of block and component sparsity. Mathematically, the penalty function may be written as: speech) ⁼

ί,Η speech ) where a and β are weights determining the contribution of each penalty. FIG. 7 illustrates illustrates one example of decomposing the spectrogram., where blocks, or parts (components) of a block of H_speech are activated.

[45] Optimization scheme 4

[46] The penalty function Ψι (Hs eech) ^{can ta}^^e another form., for example, we can propose another relative group sparsity approach to choose the best spectral characteristics:

where is g-th block in H_speech. Similarly, penalty functions ^Vz- ifl speech) ^an<i Ψ₃ (Hspeec/i) can also be adjusted.

[47] In the above, we discussed several different penalty functions. Each of these penalty functions can be used to replace the penalty function ^ (β speech) ^m the optimization problem (1). The multiplicative update may also be adjusted for different penalty functions. Other functions, rather than "logQ," can also be used in the penalty functions.

[48] Using 11 Hspeech ||_p in the denominator, if H_speech approaches zero, the cost function will increase and not decrease. Thus, all the previous optimization schemes avoid the situation where H_speech becomes zero because the denominator ||¾_peecft ||^ favors that some activations remain in H_speech even for a very high value of λ (see Eq. ( 1)). By contrast, for the penalty functions used in the Sun or Duong reference, a high value of A will force H_speech to be zero in order for the cost function to be minimized. Other penalty functions that favor some activations remain in H_speech even for a very high value of λ can also be used.

[49] Advantageously, the speech enhancement techniques according to the present principles learn the noise model automatically during the denoising process directly from the input noisy speech, and thus no training data for noise is required. This makes our methods more efficient as opposed to other techniques requiring pre-learned and fixed noise models. In addition, because clean speech examples are easily accessible in practice, we can generally have a good universal speech model to guide the speech enhancement process. [50] The different formulations of penalty functions and optimization schemes can also be applied, for example, to our previous on-the-fly source separation as in the Duong reference, where one or more keyword specifying audio source is missing so that the corresponding source spectral models should be learned. More generally, the present principles can be applied to separate any audio sources from a mixture (not only speech and noise), where universal spectral models for some of the sources can be learned from corresponding examples, and some cannot. For those sources where universal speech models are not available, their models can be learned during the iterations of the algorithm starting from a random (or another type of) initialization similar to how we learn the noise part in Algorithm 1,.

[51] The present principles can be used in a speech enhancement module that denoises an audio mixture to enhance the quality of the reproduction of speech, and the speech enhancement module can be used as a pre-processor (for example, for a speech recognition system) or post-processor for other speech systems. FIG. 8 depicts a block diagram of an exemplary system 800 where a speech enhancement module can be used according to an embodiment of the present principles. Based on clean speech examples, Universal speech model training module 820 learns a universal speech spectral model. The clean speech examples can come from different sources, for example, but not limited to, a microphone recording in a studio, a speech database and an automatic speech synthesizer. The universal speech model can be learned from any available clean speech, thus, the present principles mainly provide non- supervised solutions. When the target speakers are known, the clean speech examples may be learned from the target speakers only and the present principles also provide semi- supervised solutions.

[52] Microphone 810 records a noisy speech that needs to be processed. The microphone may record speech from one or more speakers. The noisy speech may also be pre-recorded and stored in a storage medium. Given the universal speech spectral model and the noisy speech, Speech enhancement module 830 may obtain noise spectral model and time activations for speech and noise, for example, using method 200, and reconstruct an enhanced speech corresponding to the noisy speech. The reconstructed speech may then be played by Speaker 840. Speech enhancement module 830 may also estimate noise included in the noisy speech. The output speech/noise may also be saved in a storage medium, or provided as input to another module, for example, a speech recognition module.

[53] Different modules shown in FIG. 8 may be implemented in one device, or distributed over several devices. For example, all modules may be included in a tablet or mobile phone. In another example, Speech enhancement module 830 may be located separately from other modules, in a computer or in the cloud. In yet another embodiment, Universal speech model training module 820 as well as Microphone 810 can be a standalone module from Speech enhancement module 830. [54] The implementations described herein may be implemented in, for example, a method or a process, an apparatus, a software program, a data stream, or a signal. Even if only discussed in the context of a single form of implementation (for example, discussed only as a method), the implementation of features discussed may also be implemented in other forms (for example, an apparatus or program). An apparatus may be implemented in, for example, appropriate hardware, software, and firmware. The methods may be implemented in, for example, an apparatus such as, for example, a processor, which refers to processing devices in general, including, for example, a computer, a microprocessor, an integrated circuit, or a programmable logic device. Processors also include communication devices, such as, for example, computers, cell phones, portable/personal digital assistants ("PDAs"), and other devices that facilitate communication of information between end-users.

[55] Reference to "one embodiment" or "an embodiment" or "one implementation" or "an implementation" of the present principles, as well as other variations thereof, mean that a particular feature, structure, characteristic, and so forth described in connection with the embodiment is included in at least one embodiment of the present principles. Thus, the appearances of the phrase "in one embodiment" or "in an embodiment" or "in one

implementation" or "in an implementation", as well any other variations, appearing in various places throughout the specification are not necessarily all referring to the same embodiment.

[56] Additionally, this application or its claims may refer to "determining" various pieces of information. Determining the information may include one or more of, for example, estimating the information, calculating the information, predicting the information, or retrieving the information from memory.

[57] Further, this application or its claims may refer to "accessing" various pieces of information. Accessing the information may include one or more of, for example, receiving the information, retrieving the information (for example, from memory), storing the information, processing the information, transmitting the information, moving the

information, copying the information, erasing the information, calculating the information, determining the information, predicting the information, or estimating the information.

[58] Additionally, this application or its claims may refer to "receiving" various pieces of information. Receiving is, as with "accessing", intended to be a broad term. Receiving the information may include one or more of, for example, accessing the information, or retrieving the information (for example, from memory). Further, "receiving" is typically involved, in one way or another, during operations such as, for example, storing the information, processing the information, transmitting the information, moving the information, copying the information, erasing the information, calculating the information, determining the information, predicting the information, or estimating the information.

[59] As will be evident to one of skill in the art, implementations may produce a variety of signals formatted to carry information that may be, for example, stored or transmitted. The information may include, for example, instructions for performing a method, or data produced by one of the described implementations. For example, a signal may be formatted to carry the bitstream of a described embodiment. Such a signal may be formatted, for example, as an electromagnetic wave (for example, using a radio frequency portion of spectrum) or as a baseband signal. The formatting may include, for example, encoding a data stream and modulating a carrier with the encoded data stream. The information that the signal carries may be, for example, analog or digital information. The signal may be transmitted over a variety of different wired or wireless links, as is known. The signal may be stored on a processor-readable medium.

Claims

CLAIMS:

1. A method for processing an audio signal, comprising:

accessing a universal spectral model for speech;

determining (230) a spectral model for noise included in the audio signal, responsive to the audio signal and the universal spectral model for speech;

determining (230) a first set of time activations corresponding to the spectral model for speech, responsive to the audio signal and the universal spectral model for speech;

estimating (240) a speech included in the audio signal responsive to the universal spectral model for speech and the first set of time activations; and

providing the estimated speech as output.

2. The method of claim 1, wherein at least one of the determining a spectral model for noise and the determining a first set of time activations is responsive to a cost function, wherein the cost function includes a sparsity penalty on the first set of time activations.

3. The method of claim 2, wherein the sparsity penalty increases when the first set of time activations approaches zero.

4. The method of claim 2, wherein the sparsity penalty forces a plurality of elements in the first set of time activations to zero.

5. The method of claim 2, wherein the sparsity penalty is responsive to a norm of the first set of time activations.

6. The method of claim 2, wherein the sparsity penalty is responsive to a ratio between a norm of a subset of the first set of time activations and a norm of the first set of time activations.

7. The method of claim 6, wherein the subset of the first set of time activations corresponds to at least one of a speech example used to train the universal spectral model for speech and a spectral component of the universal spectral model.

8. An apparatus for processing an audio signal, comprising:

a universal speech model training module (820) configured to access a universal spectral model for speech; and

a speech enhancement module (830) configured to

determine a spectral model for noise included in the audio signal, responsive to the audio signal and the universal spectral model for speech,

determine a first set of time activations corresponding to the spectral model for speech, responsive to the audio signal and the universal spectral model for speech, estimate a speech included in the audio signal responsive to the universal spectral model for speech and the first set of time activations, and

provide the estimated speech as output.

9. The apparatus of claim 8, wherein the speech enhancement module is configured to determine at least one of the spectral model for noise and the first set of time activations responsive to a cost function, wherein the cost function includes a sparsity penalty on the first set of time activations.

10. The apparatus of claim 9, wherein the sparsity penalty increases when the first set of time activations approaches zero.

11. The apparatus of claim 9, wherein the sparsity penalty forces a plurality of elements in the first set of time activations to zero.

12. The apparatus of claim 9, wherein the sparsity penalty is responsive to a norm of the first set of time activations.

13. The apparatus of claim 9, wherein the sparsity penalty is responsive to a ratio between a norm of a subset of the first set of time activations and a norm of the first set of time activations.

14. The apparatus of claim 13, wherein the subset of the first set of time activations corresponds to at least one of a speech example used to train the universal spectral model for speech and a spectral component of the universal spectral model.

15. A computer readable storage medium having stored thereon instructions for processing an audio signal, according to any one of claims 1-7.