WO2016050725A1 - Method and apparatus for speech enhancement based on source separation - Google Patents

Method and apparatus for speech enhancement based on source separation Download PDF

Info

Publication number
WO2016050725A1
WO2016050725A1 PCT/EP2015/072344 EP2015072344W WO2016050725A1 WO 2016050725 A1 WO2016050725 A1 WO 2016050725A1 EP 2015072344 W EP2015072344 W EP 2015072344W WO 2016050725 A1 WO2016050725 A1 WO 2016050725A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
universal
noise
spectral model
activations
Prior art date
Application number
PCT/EP2015/072344
Other languages
French (fr)
Inventor
Dalia ELBADAWY
Alexey Ozerov
Quang Khanh Ngoc DUONG
Original Assignee
Thomson Licensing
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Thomson Licensing filed Critical Thomson Licensing
Publication of WO2016050725A1 publication Critical patent/WO2016050725A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02163Only one microphone

Definitions

  • This invention relates to a method and an apparatus for speech enhancement, and more particularly, to a method and an apparatus for speech enhancement based on audio source separation technique.
  • Speech enhancement or speech denoising, plays a key role in many applications such as telephone communication, robotics, and automatic speech recognition systems.
  • the present principles provide a method for processing an audio signal, comprising: accessing a universal spectral model for speech; determining a spectral model for noise included in the audio signal, responsive to the audio signal and the universal spectral model for speech; determining a first set of time activations corresponding to the spectral model for speech, responsive to the audio signal and the universal spectral model for speech; estimating a speech included in the audio signal responsive to the universal spectral model for speech and the first set of time activations; and providing the estimated speech as output.
  • the present principles also provide an apparatus for performing these steps.
  • the present principles provide a method for processing an audio signal, comprising: accessing a universal spectral model for speech; determining a spectral model for noise included in the audio signal, responsive to the audio signal and the universal spectral model for speech; determining a first set of time activations corresponding to the spectral model for speech, responsive to the audio signal and the universal spectral model for speech; estimating a speech included in the audio signal responsive to the universal spectral model for speech and the first set of time activations; determining a second set of time activations
  • the present principles also provide an apparatus for performing these steps.
  • the present principles also provide a method for processing an audio signal, comprising: accessing a universal spectral model for speech; determining a spectral model for noise included in the audio signal, responsive to the audio signal and the universal spectral model for speech; determining a first set of time activations corresponding to the spectral model for speech, responsive to the audio signal and the universal spectral model for speech, wherein at least one of the determining a spectral model for noise and the determining a first set of time activations is responsive to a cost function, wherein the cost function includes a sparsity penalty on the first set of time activations, and wherein the sparsity penalty is responsive to a ratio between a norm of a subset of the first set of time activations and a norm of the first set of time activations; estimating a speech included in the audio signal responsive to the universal spectral model for speech and the first set of time activations; and providing the estimated speech as output.
  • the present principles also provide an apparatus for performing these steps. [6] The present
  • FIG. 1 is a pictorial example illustrating an exemplary overview of speech
  • FIG. 2 is a flow diagram depicting an exemplary method for speech enhancement based on source separation, according to an embodiment of the present principles.
  • FIG. 3 is a pictorial example illustrating an example where a spectrogram. V is decomposed into two matrices W and //.
  • FIG. 4 is a pictorial example illustrating that the activations corresponding to the universal speech model part entirely converge to zero while noise spectral model is updated, when using a prior art optimization function.
  • FIG. 5 is a pictorial example illustrating one example of decomposing the
  • FIG. 6 is a pictorial example illustrating an estimated activation matrix obtained by an optimization scheme based on component sparsity, according to an embodiment of the present principles.
  • FIG. 7 is a pictorial example illustrating one example of decomposing the
  • FIG. 8 is a block diagram depicting an exemplary speech system, in accordance with an embodiment of the present principles.
  • NMF non-negative matrix factorization
  • PLC A probabilistic latent component analysis
  • HMM hidden Markov models
  • PF140127 hereinafter “Duong”
  • Duong entitled “Method and system of on-the-fly audio source separation” by the inventors of the present application, the teachings of which are specifically incorporated herein by reference, discloses a method and apparatus for a combined text-and- example based approach for audio source separation, wherein a universal spectral model for each source is learned in advance.
  • the universal noise model is learned through user guidance. Specifically, the noise type is determined by a user and then a corresponding universal spectral model is learned in advance by retrieved noise examples. While in the present embodiments, the noise model (non-universal) is estimated directly from noisy signals.
  • the present principles are directed to speech enhancement based on source separation technique, which decomposes an audio mixture into constituent sound sources.
  • source separation technique which decomposes an audio mixture into constituent sound sources.
  • the speech enhancement would improve the perceptual quality of the speech.
  • FIG. 1 illustrates an exemplary overview of speech enhancement according to an embodiment of the present principles.
  • a universal speech model contains an
  • the noise spectral model (W noise ) is learned automatically in the source separation algorithm, as well as the activations for speech (H speech ) and noise (H noise ). Sparsity constraints on the activation matrix H speech is used to enforce the selection of only few representative spectral components learned from all training examples.
  • the speech contained in the input noisy speech may be estimated, for example, using the estimated speech magnitude spectrogram (W speech H speech ), and the noise contained in the in ut noisy speech may be estimated, for example, using the estimated noise magnitude spectrogram (W noise H noise ).
  • the estimated speech/noise Short Time Fourier Transform (STFT) coefficients can be obtained, then the estimated time domain signals for speech and noise can be obtained via inverse Short Time Fourier Transform (IS TFT). Because the noise can be removed, the output largely contains speech only and thus enhances the perceptual quality over the input noisy speech.
  • STFT Short Time Fourier Transform
  • IS TFT inverse Short Time Fourier Transform
  • FIG. 2 illustrates an exemplary method 200 for speech enhancement based on source separation according to an embodiment of the present principles.
  • Method 200 can be used for source separation as described in FIG. 1.
  • Method 200 starts at initialization step 210.
  • the audio mixture is input, and it may also accept some parameter values used in the universal model training and/or source separation process from a user. In addition, it may train a universal speech model based on training examples, or it may accept a universal speech model as input.
  • the audio mixture is transformed via Short- time Fourier Transform (STFT) into a time-frequency representation known as the spectrogram (denoted as matrix V).
  • STFT Short- time Fourier Transform
  • V can be, for example, power (square magnitude) or magnitude of the STFT coefficients.
  • the spectrogram is used to estimate the noise spectral model and activations for speech and noise at step 230, wherein the speech spectral model is used to guide the estimation (i.e., the speech part of the spectral model W is known and does not change during the estimation process).
  • the STFT coeffiences of the speech signal, and optionally of the noise can be reconstructed by Wiener filtering at step 240.
  • Inverse STFT is performed to obtain the time- domain signal of the estimated speech and/or noise.
  • F denotes the total number of frequency bins
  • N denotes the number of time frames
  • K denotes the number of spectral components, wherein a spectral component corresponds to a column in the matrix W and represents a latent spectral characteristic.
  • FIG. 3 provides an example where a spectrogram V is decomposed into two matrices W and // .
  • the activation matrix is estimated by solving the following optimization problem that includes a divergence function and a sparsity penalty function:
  • d(. ⁇ .) is a divergence function
  • A is a weighting factor for the penalty function ⁇ (, ) and controls how much we want to emphasize sparsity of H speech during optimization.
  • Possible divergences include, for example, the Itakura-Saito divergence (IS divergence), Euclidean distance, and Kullback-Leibler divergence, [29]
  • IS divergence Itakura-Saito divergence
  • Euclidean distance Euclidean distance
  • Kullback-Leibler divergence [29]
  • some spectral components in the universal speech model may be more representative for spectral characteristics of the speech in the audio mixture, and it may be better to use only these more representative ("good") spectral components.
  • the purpose of the penalty function is to enforce the activation of "good” examples or components, and force the activations corresponding to other examples and/or components to zero. [30] Consequently, the penalty function results in a sparse matrix H speech where some groups in H speech are set to zero. In the present application, we use a group to generalize the subset of elements in the speech model which are affected by the sparsity constraint.
  • a group corresponds to a block (a consecutive number of rows) in the matri H speech which in tu n corresponds to activations of one clean speech example used to train the universal speech model.
  • a group corresponds to a row in the matrix H speech which in turn corresponds to the activation of one spectral component (a column in W) in the universal speech model.
  • a group can be a column in H speech which corresponds to the activation of one frame (audio window) in the input spectrogram.
  • Table 1 illustrates an exemplary algorithm (Algorithm 1) to solve the optimization problem, where H( 5fcn ) represents the group (sub-matrix ) of H such that matrix element h kn £ H (gkn ) , 0 denotes the element- wise Hadamard product, K speech is the number of rows in H speech , and e, p and q are constants.
  • Algorithm 1 H and
  • W noise are initialized randomly. In other embodiments, they can be initialized in other manners. Note that the speech spectral model W speech is fixed while W noise is updated.
  • the performance of the penalty function depends on the choice of the A value. If A is small, H speech usually does not become zero but may include some "bad" groups to represent the audio mixture, which affects the final separation quality. However, if A gets larger, the penalty function cannot guarantee that H speech will not become zero. In order to obtain a good separation quality, the choice of A may need to be adaptive to the i nput mixture. For example, the longer the duration of the input (large N), the bigger A may need to be to .result in a sparse // since // is now correspondingly large (size KxN ).
  • A FNK1 0 , where A Q is a constant (for example, 10 ⁇ 7 or 10 ⁇ 8 ).
  • a Q is a constant (for example, 10 ⁇ 7 or 10 ⁇ 8 ).
  • F and K are fixed, and only N is a variable.
  • A is not fixed, we may end up with a value that is large enough to make H speech zero if using the sparsity penalty function of the Sun or Duong reference.
  • a penalty function based on relative relative group sparsity includes two parts: a sparsity-promoting part for the groups
  • a block sparsity approach where a block represents activations corresponding to one clean speech example used to train the universal speech model. This may efficiently select the best speech examples to represent the speech in the audio mixture.
  • the penalty function may be written as:
  • G denotes the number of blocks (i.e., corresponding to the number of clean speech examples used for training the universal model)
  • e is a small value greater than zero to avoid having log(0)
  • H ⁇ g) is part of the activation matrix H speech corresponding to g-th training example
  • is a constant (for example, 1 or 1/G).
  • the norm is calculated over all the the elements in H speech as ( ⁇ 3 ⁇ 4)7l
  • 3 ⁇ 4 3 ⁇ 4 , ⁇ ⁇ ⁇ ) ⁇ If y 0, the pen lty function ⁇ 1 (.) is similar to the penalty functions used in the Sun or Duong reference.
  • FIG. 5 illustrates one example of decomposing the spectrogram., where only two blocks of H speech are activated.
  • FIG. 6 illustrates one example of the estimated H after convergence, where several components of H speech are activated.
  • FIG. 7 illustrates one example of decomposing the spectrogram., where blocks, or parts (components) of a block of H speech are activated.
  • the penalty function ⁇ (Hs eech) can ta ⁇ e another form., for example, we can propose another relative group sparsity approach to choose the best spectral characteristics:
  • the speech enhancement techniques learn the noise model automatically during the denoising process directly from the input noisy speech, and thus no training data for noise is required. This makes our methods more efficient as opposed to other techniques requiring pre-learned and fixed noise models. In addition, because clean speech examples are easily accessible in practice, we can generally have a good universal speech model to guide the speech enhancement process.
  • the different formulations of penalty functions and optimization schemes can also be applied, for example, to our previous on-the-fly source separation as in the Duong reference, where one or more keyword specifying audio source is missing so that the corresponding source spectral models should be learned.
  • the present principles can be applied to separate any audio sources from a mixture (not only speech and noise), where universal spectral models for some of the sources can be learned from corresponding examples, and some cannot.
  • a mixture not only speech and noise
  • their models can be learned during the iterations of the algorithm starting from a random (or another type of) initialization similar to how we learn the noise part in Algorithm 1,.
  • the present principles can be used in a speech enhancement module that denoises an audio mixture to enhance the quality of the reproduction of speech, and the speech enhancement module can be used as a pre-processor (for example, for a speech recognition system) or post-processor for other speech systems.
  • FIG. 8 depicts a block diagram of an exemplary system 800 where a speech enhancement module can be used according to an embodiment of the present principles.
  • Universal speech model training module 820 learns a universal speech spectral model.
  • the clean speech examples can come from different sources, for example, but not limited to, a microphone recording in a studio, a speech database and an automatic speech synthesizer.
  • the universal speech model can be learned from any available clean speech, thus, the present principles mainly provide non- supervised solutions. When the target speakers are known, the clean speech examples may be learned from the target speakers only and the present principles also provide semi- supervised solutions.
  • Microphone 810 records a noisy speech that needs to be processed.
  • the microphone may record speech from one or more speakers.
  • the noisy speech may also be pre-recorded and stored in a storage medium.
  • Speech enhancement module 830 may obtain noise spectral model and time activations for speech and noise, for example, using method 200, and reconstruct an enhanced speech corresponding to the noisy speech.
  • the reconstructed speech may then be played by Speaker 840.
  • Speech enhancement module 830 may also estimate noise included in the noisy speech.
  • the output speech/noise may also be saved in a storage medium, or provided as input to another module, for example, a speech recognition module.
  • Different modules shown in FIG. 8 may be implemented in one device, or distributed over several devices. For example, all modules may be included in a tablet or mobile phone.
  • Speech enhancement module 830 may be located separately from other modules, in a computer or in the cloud.
  • Universal speech model training module 820 as well as Microphone 810 can be a standalone module from Speech enhancement module 830.
  • the implementations described herein may be implemented in, for example, a method or a process, an apparatus, a software program, a data stream, or a signal. Even if only discussed in the context of a single form of implementation (for example, discussed only as a method), the implementation of features discussed may also be implemented in other forms (for example, an apparatus or program).
  • An apparatus may be implemented in, for example, appropriate hardware, software, and firmware.
  • the methods may be implemented in, for example, an apparatus such as, for example, a processor, which refers to processing devices in general, including, for example, a computer, a microprocessor, an integrated circuit, or a programmable logic device.
  • processors also include communication devices, such as, for example, computers, cell phones, portable/personal digital assistants ("PDAs"), and other devices that facilitate communication of information between end-users.
  • PDAs portable/personal digital assistants
  • Determining the information may include one or more of, for example, estimating the information, calculating the information, predicting the information, or retrieving the information from memory.
  • Accessing the information may include one or more of, for example, receiving the information, retrieving the information (for example, from memory), storing the information, processing the information, transmitting the information, moving the
  • Receiving is, as with “accessing”, intended to be a broad term.
  • Receiving the information may include one or more of, for example, accessing the information, or retrieving the information (for example, from memory).
  • “receiving” is typically involved, in one way or another, during operations such as, for example, storing the information, processing the information, transmitting the information, moving the information, copying the information, erasing the information, calculating the information, determining the information, predicting the information, or estimating the information.
  • implementations may produce a variety of signals formatted to carry information that may be, for example, stored or transmitted.
  • the information may include, for example, instructions for performing a method, or data produced by one of the described implementations.
  • a signal may be formatted to carry the bitstream of a described embodiment.
  • Such a signal may be formatted, for example, as an electromagnetic wave (for example, using a radio frequency portion of spectrum) or as a baseband signal.
  • the formatting may include, for example, encoding a data stream and modulating a carrier with the encoded data stream.
  • the information that the signal carries may be, for example, analog or digital information.
  • the signal may be transmitted over a variety of different wired or wireless links, as is known.
  • the signal may be stored on a processor-readable medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present embodiments provide speech enhancement based on source separation techniques. Specifically, we use a universal spectral model for speech, and train the spectral model for noise and activations for speech/noise based on the universal spectral model for speech and input noisy speech. We formulate the optimization problem using a cost function that includes a divergence function and a sparsity penalty function, wherein the penalty function is based on the notion of relative group sparsity. The sparsity penalty function includes two parts: a sparsity-promoting part for the groups (activations for some groups become zero) and an anti-sparsity-promoting part for the whole activation matrix corresponding to the speech model (i.e., the activations for speech as a whole does not become zero). Based on the universal spectral model for speech, the spectral model for noise, and activations for speech/noise, we can estimate the speech/noise included in the input noisy speech.

Description

Method and Apparatus for Speech Enhancement Based on Source
Separation
TECHNICAL FIELD [1] This invention relates to a method and an apparatus for speech enhancement, and more particularly, to a method and an apparatus for speech enhancement based on audio source separation technique.
BACKGROUND
[2] Speech enhancement, or speech denoising, plays a key role in many applications such as telephone communication, robotics, and automatic speech recognition systems.
Numerous speech enhancement techniques have been developed such as those based on beamforming approaches or noise suppression algorithms. There also exist work in applying source separation for speech enhancement.
SUMMARY [3] The present principles provide a method for processing an audio signal, comprising: accessing a universal spectral model for speech; determining a spectral model for noise included in the audio signal, responsive to the audio signal and the universal spectral model for speech; determining a first set of time activations corresponding to the spectral model for speech, responsive to the audio signal and the universal spectral model for speech; estimating a speech included in the audio signal responsive to the universal spectral model for speech and the first set of time activations; and providing the estimated speech as output. The present principles also provide an apparatus for performing these steps.
[4] The present principles provide a method for processing an audio signal, comprising: accessing a universal spectral model for speech; determining a spectral model for noise included in the audio signal, responsive to the audio signal and the universal spectral model for speech; determining a first set of time activations corresponding to the spectral model for speech, responsive to the audio signal and the universal spectral model for speech; estimating a speech included in the audio signal responsive to the universal spectral model for speech and the first set of time activations; determining a second set of time activations
corresponding to the spectral model for noise, responsive to the audio signal and the universal spectral model for speech; estimating the noise included in the audio signal responsive to the spectral model for noise and the second set of time activations; and providing the noise and the estimated speech as output. The present principles also provide an apparatus for performing these steps.
[5] The present principles also provide a method for processing an audio signal, comprising: accessing a universal spectral model for speech; determining a spectral model for noise included in the audio signal, responsive to the audio signal and the universal spectral model for speech; determining a first set of time activations corresponding to the spectral model for speech, responsive to the audio signal and the universal spectral model for speech, wherein at least one of the determining a spectral model for noise and the determining a first set of time activations is responsive to a cost function, wherein the cost function includes a sparsity penalty on the first set of time activations, and wherein the sparsity penalty is responsive to a ratio between a norm of a subset of the first set of time activations and a norm of the first set of time activations; estimating a speech included in the audio signal responsive to the universal spectral model for speech and the first set of time activations; and providing the estimated speech as output. The present principles also provide an apparatus for performing these steps. [6] The present principles also provide a computer readable storage medium having stored thereon instructions for processing an audio signal according to the methods described above.
BRIEF DESCRIPTION OF THE DRAWINGS [7] FIG. 1 is a pictorial example illustrating an exemplary overview of speech
enhancement according to an embodiment of the present principles.
[8] FIG. 2 is a flow diagram depicting an exemplary method for speech enhancement based on source separation, according to an embodiment of the present principles.
[9] FIG. 3 is a pictorial example illustrating an example where a spectrogram. V is decomposed into two matrices W and //.
[10] FIG. 4 is a pictorial example illustrating that the activations corresponding to the universal speech model part entirely converge to zero while noise spectral model is updated, when using a prior art optimization function.
[11] FIG. 5 is a pictorial example illustrating one example of decomposing the
spectrogram based on block sparsity, according to an embodiment of the present principles.
[12] FIG. 6 is a pictorial example illustrating an estimated activation matrix obtained by an optimization scheme based on component sparsity, according to an embodiment of the present principles.
[13] FIG. 7 is a pictorial example illustrating one example of decomposing the
spectrogram based on both block sparsity and component sparsity, according to an embodiment of the present principles.
[14] FIG. 8 is a block diagram depicting an exemplary speech system, in accordance with an embodiment of the present principles. DETAILED DESCRIPTION
[15] When applying source separation for speech enhancement or denoising, in order to separate speech from noise, relevant training data is usually needed to first learn the spectral characteristics of the speech and/or of the particular noise. Such a class of supervised audio source separation algorithms is mostly based on non-negative matrix factorization (NMF), or its probabilistic formulation known as probabilistic latent component analysis (PLC A). In case of NMF model, the input spectrogram (or magnitude) matrix V (a time-frequency representation of the input mixture signal) is factorized into two matrices as V = WH, where W and // can be interpreted as the latent spectral features and the activations of those features in the signal, respectively. When the input is a mixture of two sources, we may write matri W = [Wl r W2], where the matri W contains spectral components of, for example, source 1 - speech (M^) and source 2 - noise (W2), and H = [ /j ; H2], where //j and H2 are matrices representing time activations, which indicate whether a spectral component is active or not at each time index and can be considered as weighting the contribution of spectral components in the universal speech model to the speech spectrogram, corresponding to Wx and W2, respectively. Once the decomposition is obtained, the spectral power of source 1 is estimated as V1 = IV, //3 , and the spectral power of source 2 as V2 = W2H2. [16] In an article by D. L. Sun and G. J. Mysore, entitled "Universal speech models for speaker independent single channel source separation," IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), May 2013 (hereinafter "Sun"), a universal speech spectral model was employed as well as a pre-learned noise spectral model. However, using a pre-learned noise model requires training data for noise, and the model may not be representative for other types of noise which are not included in the training data.
[17] In an article by N. Mohammadiha, P. Smaragdis, and A. Leijon, entitled "Supervised and unsupervised speech enhancement using nonnegative matrix factorization," IEEE Transactions on Audio, Speech, and Language Processing, 2013, a method that allows to learn the noise model online is used. However, the method uses hidden Markov models (HMM) in combination with Bayesian formulation of NMF, which may be sensitive to the parameter initialization and is different from the present embodiments that use a universal speech model.
[18] A commonly owned EP application (EP14305712.3, Attorney Docket No.
PF140127, hereinafter "Duong"), entitled "Method and system of on-the-fly audio source separation" by the inventors of the present application, the teachings of which are specifically incorporated herein by reference, discloses a method and apparatus for a combined text-and- example based approach for audio source separation, wherein a universal spectral model for each source is learned in advance. The universal noise model is learned through user guidance. Specifically, the noise type is determined by a user and then a corresponding universal spectral model is learned in advance by retrieved noise examples. While in the present embodiments, the noise model (non-universal) is estimated directly from noisy signals.
[19] The present principles are directed to speech enhancement based on source separation technique, which decomposes an audio mixture into constituent sound sources. In one embodiment, we use a universal spectral model for speech, and learn the spectral model for noise from the input signal. In general, the speech enhancement would improve the perceptual quality of the speech.
[20] FIG. 1 illustrates an exemplary overview of speech enhancement according to an embodiment of the present principles. We employ a universal spectral model for speech, trained from n clean speech examples. A universal speech model contains an
overcomplete dictionary of spectral characteristics of speech learned from different speakers. To train the universal speech model from the clean speech examples, clean speech example i is used to learn spectral model W . Then the universal speech model is constructed by concatenating the learned models: Wspeech = \WX W2 ... Wn \ . Amplitude no m al i zat i o n can be applied to ensure that different speech examples have similar energy level.
[21] As illustrated in FIG. 1, given the universal speech model, for an input noisy speech (also referred to as "audio mixture" as the noisy speech is a mixture of noise and speech), the noise spectral model (Wnoise) is learned automatically in the source separation algorithm, as well as the activations for speech (Hspeech) and noise (Hnoise). Sparsity constraints on the activation matrix Hspeech is used to enforce the selection of only few representative spectral components learned from all training examples. Based on the spectral models and activations, the speech contained in the input noisy speech may be estimated, for example, using the estimated speech magnitude spectrogram (WspeechHspeech), and the noise contained in the in ut noisy speech may be estimated, for example, using the estimated noise magnitude spectrogram (WnoiseHnoise). Using Wiener filtering, the estimated speech/noise Short Time Fourier Transform (STFT) coefficients can be obtained, then the estimated time domain signals for speech and noise can be obtained via inverse Short Time Fourier Transform (IS TFT). Because the noise can be removed, the output largely contains speech only and thus enhances the perceptual quality over the input noisy speech.
[22] FIG. 2 illustrates an exemplary method 200 for speech enhancement based on source separation according to an embodiment of the present principles. Method 200 can be used for source separation as described in FIG. 1. Method 200 starts at initialization step 210. At the initialization, the audio mixture is input, and it may also accept some parameter values used in the universal model training and/or source separation process from a user. In addition, it may train a universal speech model based on training examples, or it may accept a universal speech model as input. At step 220, the audio mixture is transformed via Short- time Fourier Transform (STFT) into a time-frequency representation known as the spectrogram (denoted as matrix V). Note that V can be, for example, power (square magnitude) or magnitude of the STFT coefficients.
[23] Using the universal speech model, the spectrogram is used to estimate the noise spectral model and activations for speech and noise at step 230, wherein the speech spectral model is used to guide the estimation (i.e., the speech part of the spectral model W is known and does not change during the estimation process). Once the noise model and activations are estimated, the STFT coeffiences of the speech signal, and optionally of the noise, can be reconstructed by Wiener filtering at step 240. Inverse STFT is performed to obtain the time- domain signal of the estimated speech and/or noise. [24] In the following, the step of estimating activations and the noise spectral model (230) is described in further detail.
[25] Estimating activations and noise spectral model
[26] The non-negative spectrogram matrix V of dimension FxN is to be decomposed into two non-negative matrices, W (the spectral model of dimension FxK) and H (time activations of dimension KxN), such that V « V = WH. In this formulation, F denotes the total number of frequency bins, N denotes the number of time frames, and K denotes the number of spectral components, wherein a spectral component corresponds to a column in the matrix W and represents a latent spectral characteristic. FIG. 3 provides an example where a spectrogram V is decomposed into two matrices W and // . [27] In our context, W includes two parts: W= [Wspeech, Wnoise], where Wspeech is the universal speech model, and Wnoise is the noise model which is unknown in advance.
Similarly, the activation matrix / also includes two parts: H=[Hspeech; Hnoise] , where
^speech corresponds to speech and Hnoise corresponds to noise, [28] In one embodiment, we consider sparsity constraints on the speech activations
H speech - Mathematically, the activation matrix is estimated by solving the following optimization problem that includes a divergence function and a sparsity penalty function:
min D (V\ WH) + A (Hspeech) (1) where D (V \ WH ) = ∑ = i . d (i n | (vv/i) .„ ), /" indexes the frequency bin, n indexes the the time frame, Vjn indicates an element in the /-th row and n-th column of the
spectrogram., d(.\.) is a divergence function, and A is a weighting factor for the penalty function Ψ(, ) and controls how much we want to emphasize sparsity of Hspeech during optimization. Possible divergences include, for example, the Itakura-Saito divergence (IS divergence), Euclidean distance, and Kullback-Leibler divergence, [29] Using a penalty function in the optimization problem is motivated by the fact that if some of the speech examples used to train the universal speech model are more representative of the speech contained in the audio mixture more than others, then it may be better to use only these more representative ("good") examples. Also, some spectral components in the universal speech model may be more representative for spectral characteristics of the speech in the audio mixture, and it may be better to use only these more representative ("good") spectral components. The purpose of the penalty function is to enforce the activation of "good" examples or components, and force the activations corresponding to other examples and/or components to zero. [30] Consequently, the penalty function results in a sparse matrix Hspeech where some groups in Hspeech are set to zero. In the present application, we use a group to generalize the subset of elements in the speech model which are affected by the sparsity constraint. For example, when the sparsity constraint is applied on a block basis, a group corresponds to a block (a consecutive number of rows) in the matri Hspeech which in tu n corresponds to activations of one clean speech example used to train the universal speech model. When the sparsity constraint is applied on a spectral component basis, a group corresponds to a row in the matrix Hspeech which in turn corresponds to the activation of one spectral component (a column in W) in the universal speech model. In another embodiment, a group can be a column in Hspeech which corresponds to the activation of one frame (audio window) in the input spectrogram.
[31] An iterative algorithm with multiplicative updates may be used to solve the optimization problem.. Table 1 illustrates an exemplary algorithm (Algorithm 1) to solve the optimization problem, where H(5fcn) represents the group (sub-matrix ) of H such that matrix element hkn £ H(gkn ) , 0 denotes the element- wise Hadamard product, Kspeech is the number of rows in Hspeech, and e, p and q are constants. In Algorithm 1, H and
W noise are initialized randomly. In other embodiments, they can be initialized in other manners. Note that the speech spectral model Wspeech is fixed while Wnoise is updated.
Algorithm 1 NMF with relative group sparsity (IS divergence is used) Input: V, Wspeech, λ
OutputrH, Wnoise
Initialize H randomly
Initialize Wnoise randomly
Figure imgf000011_0001
until convergence
[32] In Algorithm 1,
Figure imgf000011_0002
wrv-1 + λΡ J
W 'noise " ' noise I ,- _ ,„ 7· I
V V "noise /
where P and Q are matrices of the same size as // and are used to enforce the penalty on H speech- While we have a model for speech (Wspeech), we need to leam a model for noise. In one embodiment, we randomly initialize Wnoise and set W =[Wspeech Wnoise] .
[33] In our previous work, as described in the Duong reference, the log// , norm, is used as a penalty function. For one exemplary audio mixture, applying the log/ x norm (i.e.,
Ψ(// ) =∑ -i log + W^) and other configurations in the Duong reference to the optimization problem (1), the activation corresponding to the universal, speech model part entirely converges to zero, as shown in FIG. 4, due to the sparsity constraint in the cost function to be minimized. With Hspeech = 0, no result for estimated speech can be obtained.
[34] In general, we observe that the performance of the penalty function depends on the choice of the A value. If A is small, Hspeech usually does not become zero but may include some "bad" groups to represent the audio mixture, which affects the final separation quality. However, if A gets larger, the penalty function cannot guarantee that Hspeech will not become zero. In order to obtain a good separation quality, the choice of A may need to be adaptive to the i nput mixture. For example, the longer the duration of the input (large N), the bigger A may need to be to .result in a sparse // since // is now correspondingly large (size KxN ).
[35 J In one embodiment, we may use A = FNK10, where AQ is a constant (for example, 10~7 or 10~8). Here, since we use a universal speech model, F and K are fixed, and only N is a variable. In this case, since A is not fixed, we may end up with a value that is large enough to make Hspeech zero if using the sparsity penalty function of the Sun or Duong reference.
[36] To prevent Hspeech from converging to zero regardless of the choice of the A value, we introduce alternative optimization problem formulations. Specifically, we provide alternative sparsity penalty functions, providing different ways of exploiting the spectral characteristics of the universal speech model while making sure that Hspeech does not degenerate to zero.
[37] We introduce the notion of relative group sparsity where sparsity of the groups takes into account the energy of Hspeech. In one embodiment, a penalty function based on relative relative group sparsity includes two parts: a sparsity-promoting part for the groups
(activations for some groups become zero) and an ami- sparsity -promoting part for the whole activation matrix corresponding to the speech model (i.e., speech as a whole does not become zero). This ensures that at least one group in Hspeech remains active and thus the penalty is not as sensitive to λ as the one provided by the Sun or Duong reference. In the following, we describe different optimization schemes with different penalty functions in further detail.
[38] Optimization scheme 1
[39] In one embodiment, we propose a block sparsity approach, where a block represents activations corresponding to one clean speech example used to train the universal speech model. This may efficiently select the best speech examples to represent the speech in the audio mixture. Mathematically, the penalty function may be written as:
Figure imgf000013_0001
where G denotes the number of blocks (i.e., corresponding to the number of clean speech examples used for training the universal model), e is a small value greater than zero to avoid having log(0), H^g) is part of the activation matrix Hspeech corresponding to g-th training example, p and q determine the norm or pseudo-norm to be used (for example, p=q=l), and and γ is a constant (for example, 1 or 1/G). The
Figure imgf000013_0002
norm is calculated over all the the elements in Hspeech as (∑¾)7l¾,η Ιρ) · If y = 0, the pen lty function Ψ1(.) is similar to the penalty functions used in the Sun or Duong reference.
[40] This scheme forces Hspeech to contain few blocks of activations, which correspond to to speech training examples with similar spectral characteristics as the speech in the noisy signal. FIG. 5 illustrates one example of decomposing the spectrogram., where only two blocks of Hspeech are activated.
[41] Optimization scheme 2
[42] We also propose a component sparsity approach to allow more flexibility and choose the best spectral components. Mathematically, the penalty function may be written as:
K
H speech) =
Figure imgf000014_0001
where hg is g-th row in Hspeech, and Kspeech is the number of rows in Hspeech . Note that each row in // represents the activation coefficients for the corresponding column (the spectral component) in IV. For example, if the first row of // is zero, then the first column of W is not used to represent V (where V = Wli ). FIG. 6 illustrates one example of the estimated H after convergence, where several components of Hspeech are activated.
[43] Optimization scheme 3
[44] We can also combine optimization schemes 1 and 2 and use a mix of block and component sparsity. Mathematically, the penalty function may be written as: speech) =
Figure imgf000014_0002
ί,Η speech ) where a and β are weights determining the contribution of each penalty. FIG. 7 illustrates illustrates one example of decomposing the spectrogram., where blocks, or parts (components) of a block of Hspeech are activated.
[45] Optimization scheme 4
[46] The penalty function Ψι (Hs eech) can ta^e another form., for example, we can propose another relative group sparsity approach to choose the best spectral characteristics:
Figure imgf000015_0001
where is g-th block in Hspeech. Similarly, penalty functions ^Vz- ifl speech) an<i Ψ3 (Hspeec/i) can also be adjusted.
[47] In the above, we discussed several different penalty functions. Each of these penalty functions can be used to replace the penalty function ^ (β speech) m the optimization problem (1). The multiplicative update may also be adjusted for different penalty functions. Other functions, rather than "logQ," can also be used in the penalty functions.
[48] Using 11 Hspeech ||p in the denominator, if Hspeech approaches zero, the cost function will increase and not decrease. Thus, all the previous optimization schemes avoid the situation where Hspeech becomes zero because the denominator ||¾peecft ||^ favors that some activations remain in Hspeech even for a very high value of λ (see Eq. ( 1)). By contrast, for the penalty functions used in the Sun or Duong reference, a high value of A will force Hspeech to be zero in order for the cost function to be minimized. Other penalty functions that favor some activations remain in Hspeech even for a very high value of λ can also be used.
[49] Advantageously, the speech enhancement techniques according to the present principles learn the noise model automatically during the denoising process directly from the input noisy speech, and thus no training data for noise is required. This makes our methods more efficient as opposed to other techniques requiring pre-learned and fixed noise models. In addition, because clean speech examples are easily accessible in practice, we can generally have a good universal speech model to guide the speech enhancement process. [50] The different formulations of penalty functions and optimization schemes can also be applied, for example, to our previous on-the-fly source separation as in the Duong reference, where one or more keyword specifying audio source is missing so that the corresponding source spectral models should be learned. More generally, the present principles can be applied to separate any audio sources from a mixture (not only speech and noise), where universal spectral models for some of the sources can be learned from corresponding examples, and some cannot. For those sources where universal speech models are not available, their models can be learned during the iterations of the algorithm starting from a random (or another type of) initialization similar to how we learn the noise part in Algorithm 1,.
[51] The present principles can be used in a speech enhancement module that denoises an audio mixture to enhance the quality of the reproduction of speech, and the speech enhancement module can be used as a pre-processor (for example, for a speech recognition system) or post-processor for other speech systems. FIG. 8 depicts a block diagram of an exemplary system 800 where a speech enhancement module can be used according to an embodiment of the present principles. Based on clean speech examples, Universal speech model training module 820 learns a universal speech spectral model. The clean speech examples can come from different sources, for example, but not limited to, a microphone recording in a studio, a speech database and an automatic speech synthesizer. The universal speech model can be learned from any available clean speech, thus, the present principles mainly provide non- supervised solutions. When the target speakers are known, the clean speech examples may be learned from the target speakers only and the present principles also provide semi- supervised solutions.
[52] Microphone 810 records a noisy speech that needs to be processed. The microphone may record speech from one or more speakers. The noisy speech may also be pre-recorded and stored in a storage medium. Given the universal speech spectral model and the noisy speech, Speech enhancement module 830 may obtain noise spectral model and time activations for speech and noise, for example, using method 200, and reconstruct an enhanced speech corresponding to the noisy speech. The reconstructed speech may then be played by Speaker 840. Speech enhancement module 830 may also estimate noise included in the noisy speech. The output speech/noise may also be saved in a storage medium, or provided as input to another module, for example, a speech recognition module.
[53] Different modules shown in FIG. 8 may be implemented in one device, or distributed over several devices. For example, all modules may be included in a tablet or mobile phone. In another example, Speech enhancement module 830 may be located separately from other modules, in a computer or in the cloud. In yet another embodiment, Universal speech model training module 820 as well as Microphone 810 can be a standalone module from Speech enhancement module 830. [54] The implementations described herein may be implemented in, for example, a method or a process, an apparatus, a software program, a data stream, or a signal. Even if only discussed in the context of a single form of implementation (for example, discussed only as a method), the implementation of features discussed may also be implemented in other forms (for example, an apparatus or program). An apparatus may be implemented in, for example, appropriate hardware, software, and firmware. The methods may be implemented in, for example, an apparatus such as, for example, a processor, which refers to processing devices in general, including, for example, a computer, a microprocessor, an integrated circuit, or a programmable logic device. Processors also include communication devices, such as, for example, computers, cell phones, portable/personal digital assistants ("PDAs"), and other devices that facilitate communication of information between end-users.
[55] Reference to "one embodiment" or "an embodiment" or "one implementation" or "an implementation" of the present principles, as well as other variations thereof, mean that a particular feature, structure, characteristic, and so forth described in connection with the embodiment is included in at least one embodiment of the present principles. Thus, the appearances of the phrase "in one embodiment" or "in an embodiment" or "in one
implementation" or "in an implementation", as well any other variations, appearing in various places throughout the specification are not necessarily all referring to the same embodiment.
[56] Additionally, this application or its claims may refer to "determining" various pieces of information. Determining the information may include one or more of, for example, estimating the information, calculating the information, predicting the information, or retrieving the information from memory.
[57] Further, this application or its claims may refer to "accessing" various pieces of information. Accessing the information may include one or more of, for example, receiving the information, retrieving the information (for example, from memory), storing the information, processing the information, transmitting the information, moving the
information, copying the information, erasing the information, calculating the information, determining the information, predicting the information, or estimating the information.
[58] Additionally, this application or its claims may refer to "receiving" various pieces of information. Receiving is, as with "accessing", intended to be a broad term. Receiving the information may include one or more of, for example, accessing the information, or retrieving the information (for example, from memory). Further, "receiving" is typically involved, in one way or another, during operations such as, for example, storing the information, processing the information, transmitting the information, moving the information, copying the information, erasing the information, calculating the information, determining the information, predicting the information, or estimating the information.
[59] As will be evident to one of skill in the art, implementations may produce a variety of signals formatted to carry information that may be, for example, stored or transmitted. The information may include, for example, instructions for performing a method, or data produced by one of the described implementations. For example, a signal may be formatted to carry the bitstream of a described embodiment. Such a signal may be formatted, for example, as an electromagnetic wave (for example, using a radio frequency portion of spectrum) or as a baseband signal. The formatting may include, for example, encoding a data stream and modulating a carrier with the encoded data stream. The information that the signal carries may be, for example, analog or digital information. The signal may be transmitted over a variety of different wired or wireless links, as is known. The signal may be stored on a processor-readable medium.

Claims

CLAIMS:
1. A method for processing an audio signal, comprising:
accessing a universal spectral model for speech;
determining (230) a spectral model for noise included in the audio signal, responsive to the audio signal and the universal spectral model for speech;
determining (230) a first set of time activations corresponding to the spectral model for speech, responsive to the audio signal and the universal spectral model for speech;
estimating (240) a speech included in the audio signal responsive to the universal spectral model for speech and the first set of time activations; and
providing the estimated speech as output.
2. The method of claim 1, wherein at least one of the determining a spectral model for noise and the determining a first set of time activations is responsive to a cost function, wherein the cost function includes a sparsity penalty on the first set of time activations.
3. The method of claim 2, wherein the sparsity penalty increases when the first set of time activations approaches zero.
4. The method of claim 2, wherein the sparsity penalty forces a plurality of elements in the first set of time activations to zero.
5. The method of claim 2, wherein the sparsity penalty is responsive to a norm of the first set of time activations.
6. The method of claim 2, wherein the sparsity penalty is responsive to a ratio between a norm of a subset of the first set of time activations and a norm of the first set of time activations.
7. The method of claim 6, wherein the subset of the first set of time activations corresponds to at least one of a speech example used to train the universal spectral model for speech and a spectral component of the universal spectral model.
8. An apparatus for processing an audio signal, comprising:
a universal speech model training module (820) configured to access a universal spectral model for speech; and
a speech enhancement module (830) configured to
determine a spectral model for noise included in the audio signal, responsive to the audio signal and the universal spectral model for speech,
determine a first set of time activations corresponding to the spectral model for speech, responsive to the audio signal and the universal spectral model for speech, estimate a speech included in the audio signal responsive to the universal spectral model for speech and the first set of time activations, and
provide the estimated speech as output.
9. The apparatus of claim 8, wherein the speech enhancement module is configured to determine at least one of the spectral model for noise and the first set of time activations responsive to a cost function, wherein the cost function includes a sparsity penalty on the first set of time activations.
10. The apparatus of claim 9, wherein the sparsity penalty increases when the first set of time activations approaches zero.
11. The apparatus of claim 9, wherein the sparsity penalty forces a plurality of elements in the first set of time activations to zero.
12. The apparatus of claim 9, wherein the sparsity penalty is responsive to a norm of the first set of time activations.
13. The apparatus of claim 9, wherein the sparsity penalty is responsive to a ratio between a norm of a subset of the first set of time activations and a norm of the first set of time activations.
14. The apparatus of claim 13, wherein the subset of the first set of time activations corresponds to at least one of a speech example used to train the universal spectral model for speech and a spectral component of the universal spectral model.
15. A computer readable storage medium having stored thereon instructions for processing an audio signal, according to any one of claims 1-7.
PCT/EP2015/072344 2014-09-30 2015-09-29 Method and apparatus for speech enhancement based on source separation WO2016050725A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP14306540 2014-09-30
EP14306540.7 2014-09-30

Publications (1)

Publication Number Publication Date
WO2016050725A1 true WO2016050725A1 (en) 2016-04-07

Family

ID=51730467

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2015/072344 WO2016050725A1 (en) 2014-09-30 2015-09-29 Method and apparatus for speech enhancement based on source separation

Country Status (2)

Country Link
TW (1) TW201614641A (en)
WO (1) WO2016050725A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108076238A (en) * 2016-11-16 2018-05-25 艾丽西亚(天津)文化交流有限公司 A kind of science and technology service packet audio mixing communicator
CN108573698A (en) * 2017-03-09 2018-09-25 中国科学院声学研究所 A kind of voice de-noising method based on gender fuse information
CN109346097A (en) * 2018-03-30 2019-02-15 上海大学 A kind of sound enhancement method based on Kullback-Leibler difference
CN111710343A (en) * 2020-06-03 2020-09-25 中国科学技术大学 Single-channel voice separation method on double transform domains
CN113823316A (en) * 2021-09-26 2021-12-21 南京大学 Voice signal separation method for sound source close to position
US11227621B2 (en) 2018-09-17 2022-01-18 Dolby International Ab Separating desired audio content from undesired content

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113656747A (en) * 2021-08-13 2021-11-16 南京理工大学 Array self-adaptive beam forming method under multiple expected signals based on branch and bound

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
D. L. SUN; G. J. MYSORE: "Universal speech models for speaker independent single channel source separation", IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP, May 2013 (2013-05-01)
HURMALAINEN ANTTI ET AL: "Modelling non-stationary noise with spectral factorisation in automatic speech recognition", COMPUTER SPEECH AND LANGUAGE, ELSEVIER, LONDON, GB, vol. 27, no. 3, 27 July 2012 (2012-07-27), pages 763 - 779, XP028969074, ISSN: 0885-2308, DOI: 10.1016/J.CSL.2012.07.008 *
N. MOHAMMADIHA; P. SMARAGDIS; A. LEIJON: "Supervised and unsupervised speech enhancement using nonnegative matrix factorization", IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2013
SUN DENNIS L ET AL: "Universal speech models for speaker independent single channel source separation", 2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP); VANCOUCER, BC; 26-31 MAY 2013, INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS, PISCATAWAY, NJ, US, 26 May 2013 (2013-05-26), pages 141 - 145, XP032508548, ISSN: 1520-6149, [retrieved on 20131018], DOI: 10.1109/ICASSP.2013.6637625 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108076238A (en) * 2016-11-16 2018-05-25 艾丽西亚(天津)文化交流有限公司 A kind of science and technology service packet audio mixing communicator
CN108573698A (en) * 2017-03-09 2018-09-25 中国科学院声学研究所 A kind of voice de-noising method based on gender fuse information
CN108573698B (en) * 2017-03-09 2021-06-08 中国科学院声学研究所 Voice noise reduction method based on gender fusion information
CN109346097A (en) * 2018-03-30 2019-02-15 上海大学 A kind of sound enhancement method based on Kullback-Leibler difference
CN109346097B (en) * 2018-03-30 2023-07-14 上海大学 Speech enhancement method based on Kullback-Leibler difference
US11227621B2 (en) 2018-09-17 2022-01-18 Dolby International Ab Separating desired audio content from undesired content
CN111710343A (en) * 2020-06-03 2020-09-25 中国科学技术大学 Single-channel voice separation method on double transform domains
CN111710343B (en) * 2020-06-03 2022-09-30 中国科学技术大学 Single-channel voice separation method on double transform domains
CN113823316A (en) * 2021-09-26 2021-12-21 南京大学 Voice signal separation method for sound source close to position
CN113823316B (en) * 2021-09-26 2023-09-12 南京大学 Voice signal separation method for sound source close to position

Also Published As

Publication number Publication date
TW201614641A (en) 2016-04-16

Similar Documents

Publication Publication Date Title
WO2016050725A1 (en) Method and apparatus for speech enhancement based on source separation
Qian et al. Speech Enhancement Using Bayesian Wavenet.
Han et al. Learning spectral mapping for speech dereverberation
JP7387634B2 (en) Perceptual loss function for speech encoding and decoding based on machine learning
WO2019204547A1 (en) Systems and methods for automatic speech recognition using domain adaptation techniques
US9215539B2 (en) Sound data identification
US9607627B2 (en) Sound enhancement through deverberation
CN110223708B (en) Speech enhancement method based on speech processing and related equipment
Venkataramani et al. Adaptive front-ends for end-to-end source separation
KR20160125984A (en) Systems and methods for speaker dictionary based speech modeling
US20230162758A1 (en) Systems and methods for speech enhancement using attention masking and end to end neural networks
CN111201569A (en) Electronic device and control method thereof
Richter et al. Speech Enhancement with Stochastic Temporal Convolutional Networks.
KR102026226B1 (en) Method for extracting signal unit features using variational inference model based deep learning and system thereof
Cui et al. Multi-objective based multi-channel speech enhancement with BiLSTM network
US20180358025A1 (en) Method and apparatus for audio object coding based on informed source separation
Ashraf et al. Underwater ambient-noise removing GAN based on magnitude and phase spectra
Jukić et al. Multi-channel linear prediction-based speech dereverberation with low-rank power spectrogram approximation
Şimşekli et al. Non-negative tensor factorization models for Bayesian audio processing
Chen et al. A dual-stream deep attractor network with multi-domain learning for speech dereverberation and separation
Zhou et al. Speech Enhancement via Residual Dense Generative Adversarial Network.
Badiezadegan et al. A wavelet-based thresholding approach to reconstructing unreliable spectrogram components
WO2020250220A1 (en) Sound analysis for determination of sound sources and sound isolation
Zhu et al. Maximum likelihood sub-band adaptation for robust speech recognition
Jukić et al. Speech dereverberation with convolutive transfer function approximation using MAP and variational deconvolution approaches

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15770908

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15770908

Country of ref document: EP

Kind code of ref document: A1