GB2510631A - Sound source separation based on a Binary Activation model - Google Patents

Sound source separation based on a Binary Activation model Download PDF

Info

Publication number
GB2510631A
GB2510631A GB201302398A GB201302398A GB2510631A GB 2510631 A GB2510631 A GB 2510631A GB 201302398 A GB201302398 A GB 201302398A GB 201302398 A GB201302398 A GB 201302398A GB 2510631 A GB2510631 A GB 2510631A
Authority
GB
United Kingdom
Prior art keywords
signals
signal
combination
parameters
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB201302398A
Other versions
GB201302398D0 (en
Inventor
G Rald Kergourlay
Joachim Thiemann
Emmanuel Vincent
Nancy Bertin
Frederic Bimbot
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Centre National de la Recherche Scientifique CNRS
Institut National de Recherche en Informatique et en Automatique INRIA
Canon Inc
Original Assignee
Centre National de la Recherche Scientifique CNRS
Institut National de Recherche en Informatique et en Automatique INRIA
Canon Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Centre National de la Recherche Scientifique CNRS, Institut National de Recherche en Informatique et en Automatique INRIA, Canon Inc filed Critical Centre National de la Recherche Scientifique CNRS
Priority to GB201302398A priority Critical patent/GB2510631A/en
Priority to GB1304774.1A priority patent/GB2510650B/en
Publication of GB201302398D0 publication Critical patent/GB201302398D0/en
Publication of GB2510631A publication Critical patent/GB2510631A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03HIMPEDANCE NETWORKS, e.g. RESONANT CIRCUITS; RESONATORS
    • H03H21/00Adaptive networks

Abstract

A sound source (101, 102, fig.1) separation method assumes that a single source predominates in each time-frequency bin thus yielding a less computationally complex binary activation model for a microphone array R in the presence of background noise. A Binary Activation Expectation Maximization (BAEM) algorithm takes source model parameters for each bin, inverts the mixture covariance matrix 402 and calculates its determinant, then computes source posterior probability 403 (equation 25 & 26), then updates the model parameters 404 (equations 14-17) in order to maximize expectation (equations 27-29) yielding Short Time Fourier Transform (STFT) coefficients (equation 30). Compared to Sub-Source Expectation Maximization (SSEM), the inversion of Rj,f need only be calculated once for each time-frequency bin.

Description

METHOD AND APPARATUS FOR SOUND SOURCE SEPARATION BASED
ON A BINARY ACTIVATION MODEL
The present invention concerns a method and a device for sound source separation. More particularly it concerns a method for determining parameters defining sub-signals forming a combination of signals according to a given model. The model deals with signal sources recorded in reverberant and noisy environments.
Sound source separation aims at isolating signals of sound sources of interest from a plurality of sound source combination of signals sometimes called a sound source mixture. It is a core function in sound sources analysis with applications ranging from Audio-Video surveillance/conferencing, sound enhancement, de-noising to pattern recognition.
A typical sound source separation system is illustrated on figure 1. Let's consider two audio sources 101 and 102, each emitting audio signals 106 and 107. The sound source separation comprises a microphone array 103, 104 and 105, composed of at least two microphones. These microphones 103, 104 and 105 are used to capture a multichannel audio combination of signals. This audio combination of signals is composed by the set of individual channel combination of signals 106 and 107 captured at each microphone of the array. The microphone array is splitting the combination of signals emitted by the sound source into a plurality of channels, each channel consisting in its own representation of the combination of signals. The system also comprises a computing device 108 for executing a sound source separation algorithm. This algorithm is used to estimate the spatial images of the sound sources 109 from the combination of captured audio signals. It is worth noting that the output of the sound source separation algorithm are not the source signals themselves but the estimated spatial images of the sources which are the contributions of each source in the recorded combination of signals.
The microphone array records the multichannel combination of sound sources predominant at different times, located at different positions and exhibiting different spectral contents. The sound source separation algorithm exploits those differences. A first difference is spatial diversity. It stands for the time differences of arrival of a given source at two different microphones. A second difference is spectral diversity. A third difference is temporal diversity, meaning the predominance of some sources at different moments.
Numerous sound source separation techniques have been proposed, ranging from early multichannel spatial filtering (beamforming) and single-channel spectral modeling techniques to recent techniques jointly exploiting spatial and spectral cues. The general Gaussian model-based framework, which underlies the Flexible Audio Source Separation Toolbox (FASST), is currently one of the most advanced frameworks which enables the modeling of reverberant sources in noisy conditions by means of full-rank spatial covariance matrices and the enforcement of spectral constraints by means of multilevel nonnegative matrix factorization (NMF). This Flexible Audio Source Separation Toolbox is described, for example, in the document A. Ozerov, F. Vincent, and F. Bimbot, A general flexible framework for the handling of prior information in audio source separation, IEEE Transactions on Audio, Speech and Signal Processing 20(4), pp. 1118-1133 (2012).
This general Gaussian model-based framework considers probabilistic models of the spatial images of the sources, namely local Gaussian model, in time-frequency domain whose parameters, typically spatial covariance matrices and power spectra, are estimated using the maximum likelihood criterion and an Expectation-Maximization (EM) algorithm. Computing the spatial covariance matrices and the power spectra in order to estimate spatial images of the sources corresponds to the so-called variance-modeling paradigm.
Figure 2 illustrates a typical variance modeling sound source separation algorithm. The microphones array comprises a plurality of microphones 103, 104 and 105. While the figure features three microphones, it may be understood that any number, greater of equal to two, of microphones may be used. A digital sound capture device 209 captures and digitalizes the combination of signals obtained from the microphones 103, 104 and 105. The produced signals are the signals xm(n), where m is an index of the microphone and n a time-slot index.
These temporal signals enter the sound source separation algorithm module 108. Each signal from each microphone undergoes a Fourier transform 201, 202 and 203, typically a Short Time Fourier Transform or STFT, to get a set of mixture coefficients 204 representing the input signals in the frequency domain. The time is divided into period of time called time-slots. In the frequency domain, frequencies are divided into frequency intervals. A given time-slot and frequency interval defines a time frequency bin. These mixture coefficients may be expressed in time-slot n and frequency interval f as: = XMfnI (1) where is the shod time Fourier transform coefficient of the captured sound xm(n), indexed with the microphone index in corresponding to the channel, f the frequency index and n the time-slot index.
Xfn XYJJn (2) with Yjjn the spatial image of the source or component fin the time-frequency domain one wants to estimate from the combination of signals, with J the total number of components which can include both actual sources and additional noise components. These spatial images Yjjn are the actual data we want to determine in a sound source separation system.
In order to determine an estimate of the spatial images of the sources, a model is used. This model aims at representing the spatial images. A local Gaussian model is typically used. According to this model, the components Yjjn are modeled as Gaussian with covariance matrices: 1y,j,fn = 12] fRJf (3) Where is a parameter representing the power spectra of the signal, and RJf is a spatial covariance matrix of size M M. The model is therefore defined by J * F covariance matrices parameters and J * N * F power spectra scalar variables. The sound source separation algorithm aims at estimating these model parameters.
The objective is to maximize the likelihood for the combination of signals: P (x [vJf, Rff}) = e -x (xyf,t)_lxffl (4) det(mE1j) where E = E1E,Jffl is the covariance matrix of Xfn which is named the mixture covariance matrix, assuming spatial image components are uncorrelated.
The sound source separation algorithm comprises two main steps. In a first step 205 a joint estimation of all model parameters is made using so-called Expectation-Maximization algorithm for non-linear optimization aiming at maximizing the likelihood. This first step takes as input some initial values 208 for the model parameters. These initial values may be random or chosen according to some prior knowledge if available. In a second step 206 a Maximum A Priori (MAP) or Minimum Mean Squared Error (MMSE) estimation of the short time Fourier transform coefficients of the mixture components is undertaken using multichannel Wiener filtering. This may be done according to the following equation: = (5) with Ey,j,fn = v]fR],f the estimated source component covariance matrice computed from the estimated model parameters 12jfm and obtained in the first step and with: = yJJn (6) The actual estimation of the spatial images of the sources is then obtained by undertaking an inverse Fourier transform 207 to come back in the temporal space. The resulting estimation Ym(t) 109 of the spatial sources are obtained.
The expectation maximization algorithm is a powerful method for obtaining maximum-likelihood parameter estimates in the presence of missing/hidden data. For historical reasons, expectation maximization algorithms have been assessed mainly in two-channel scenarios. Additional microphones can improve the separation performance by increasing the spatial resolution of the array, but they imply a greater computational cost. The increase of the computational cost as a function of the number of channels, being cubic, is especially dramatic for Gaussian expectation maximization-based techniques which consider the source signals as hidden data. Furthermore, experiments show that the increase in the size of the hidden data space requires more iteration steps until convergence. Altogether, this results in a major increase of the computational cost, which quickly becomes intractable.
The present invention has been devised to address one or more of the foregoing concerns. It is described to change the conventional observation model into a binary activation model. It is assumed that a single source predominates in each time-frequency bin. A significant computational gain of the expectation maximization algorithm is obtained improving the speed of the expectation maximization algorithm with little or no degradation of the quality of the separated data.
According to a first aspect of the invention there is provided a method for determining parameters defining a plurality of signals, and forming a combination of signals, said parameters including the spatial covariance matrices of the signals, the combination of signals being transformed for a chosen frequency range divided into frequency intervals and over a period of time divided into time-slots, in order to obtain transformed coefficients, the method comprising: -obtaining initial values of the parameters to be used as current values of the parameters; and -determining new values of the parameters by running iteratively: o an expectation step to compute intermediate parameters based on the transformed coefficients; o a maximization step to update values of the parameters based on the obtained current values of the parameters and the computed intermediate parameters; wherein the method further comprises: -obtaining prior probability for each signal to be predominant in the combination of signals; and wherein the expectation step comprises: -for every signal and every frequency interval, inverting the spatial covariance matrix; and for every time-slot computing an intermediate parameter which is a posterior probability for each signal to be predominant in the combination of signals, based on the said prior probabilities, the inverted spatial covariance matrices and the transformed coefficients.
Accordingly, a better efficiency of the expectation maximization algorithm is achieved. A smaller number of iteration is needed before convergence due to the fact that the space of hidden variables has smaller dimension. The complexity of each iteration step is reduced.
In an embodiment the prior probability for each signal to be predominant in the combination of signals is fixed.
In an embodiment the expectation step further comprises: for every signals and every frequency intervals computing the determinant of the spatial covariance matrix of the signals, the posterior probability for a signal to be predominant in the combination of signals being further based on the determinant of the spatial covariance matrix of the signal.
In an embodiment the posterior probability for each signal to be predominant in the combination of signal is computed according to the equation: e Vj,fn 7Tj,f det(mEJf) where: 1 0 = 12] fRJf, 12jfn is a parameter representing the power spectra of the signal; is a spatial covariance matrix; Rffl is the combination of signals covariance matrix for a given frequency interval and a given time slot; ltjf is the prior probability of a signal of index] to be predominant in the combination of signals; n is the index of the time-slot and f the frequency interval.
In an embodiment the values of the parameters are updated according to: R -En. Yj,fnRxjn/12],fn 1,_f -nYjjn where: 12jjn is a parameter representing the power spectra of the signal; &i,fn is the combination of signals covariance matrix for a given frequency interval and a given time slot; is the posterior probability of a signal of index jto be predominant in the combination of signals; n is the index of the time-slot and f the frequency interval.
In an embodiment the method further comprises: -using obtained new values of parameters as second initial values; and -determining second new values of the parameters by running iteratively an second expectation step and a second maximization step; wherein the second expectation step comprises: o for every time-slot and frequency interval inverting the combination of signals covariance matrix; and a for every signal computing another intermediate parameter which is the posterior second order raw moment of the components of the source signals taken as hidden variables.
According to a further aspect of the invention there is provided a method for separating sound source signals defined by parameters determined by using a method according to the invention.
According to a further aspect of the invention there is provided a device for determining parameters defining a plurality of signals, and forming a combination of signals, said parameters including the spatial covariance matrices of the signals, the combination of signals being transformed for a chosen frequency range divided into frequency intervals and over a period of time divided into time-slots, in order to obtain transformed coefficients, the device comprising: -an expectation module to compute intermediate parameters based on the transformed coefficients; -a maximization module to update values of the parameters based on the obtained current values of the parameters and the computed intermediate parameters; wherein the expectation module is configured for: -for every signal and every frequency interval, inverting the spatial covariance matrix; and -for every time-slot computing an intermediate parameter which is a posterior probability for each signal to be predominant in the combination of signals, based on given prior probabilities for each signals to be predominant in the combination of signals, the inverted spatial covariance matrices and the transformed coefficients.
In an embodiment the expectation module is configured for computing the posterior probability for each signal to be predominant in the combination of signal is computed according to the equation: e Yjjn lTjjn r where: Ly,j,fn = is a parameter representing the power spectra of the signal; is a spatial covariance matrix; Rxfn is the combination of signals covariance matrix for a given frequency interval and a given time slot; i-t is the prior probability of a signal of index Ito be predominant in the combination of signals; n is the index of the time-slot and f the frequency interval.
In an embodiment the expectation module is configured for updating the values of the parameters according to: -En YJjnRxjnh1)j,f 1sf -nYjjn where: v1 is a parameter representing the power spectra of the signal; Rxfn is the combination of signals covariance matrix for a given frequency interval and a given time slot; Vj,fn is the posterior probability of a signal of index jto be predominant in the combination of signals; n is the index of the time-slot and f the frequency interval.
In an embodiment the device further comprises: -a second expectation module and a second maximization module; and wherein: -the second expectation module is configured for: o for every time-slot and frequency interval inverting the combination of signals covariance matrix; and o for every signal computing another intermediate parameter which is the posterior second order raw moment of the source signals taken as hidden variables.
According to a further aspect of the invention there is provided a device for separating sound source signals comprising a device according to the invention.
According to a further aspect of the invention there is provided a computer program product for a programmable apparatus, the computer program product comprising a sequence of instructions for implementing a method according to the invention, when loaded into and executed by the programmable apparatus.
According to a further aspect of the invention there is provided a computer-readable storage medium storing instructions of a computer program for implementing a method according to the invention.
At least parts of the methods according to the invention may be computer implemented. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit", "module" or "system". Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer usable program code embodied in the medium.
Since the present invention can be implemented in software, the present invention can be embodied as computer readable code for provision to a programmable apparatus on any suitable carrier medium. A tangible carrier medium may comprise a storage medium such as a floppy disk, a CD-ROM, a hard disk drive, a magnetic tape device or a solid state memory device and the like. A transient carrier medium may include a signal such as an electrical signal, an electronic signal, an optical signal, an acoustic signal, a magnetic signal or an electromagnetic signal, e.g. a microwave or RF signal.
Embodiments of the invention will now be described, by way of example only, and with reference to the following drawings in which: Figure 1 illustrates a typical sound source separation system; Figure 2 illustrates a typical variance-modeling sound source separation algorithm; Figure 3 illustrates an example of sub-sources expectation maximization algorithm; Figure 4 illustrates an example of binary activation expectation maximization algorithm; Figure 5 illustrates an alternative embodiment of an expectation maximization algorithm; Figure 6 is a schematic block diagram of a computing device for implementation of one or more embodiments of the invention.
In the conventional observation model, the combination of signals is modeled as the sum of the contributions of all sources, each of them being modeled by a local Gaussian model.
Using the Short-Time Fourier Transform (STFT), the M x 1 vector of mixture short time Fourier transform coefficients in time-slot n and frequency interval f can be expressed as:
I
= -E bf (7) j=1 Where Yjjn is the so-called spatial image of the th source and is a small Gaussian noise with covariance Ebffl. The noise is added relatively to equation defining the model in order to improve convergence of the expectation maximization algorithm.
According to the general Gaussian model-based framework, the spatial images of all sources are modeled as Gaussian random vectors: yjj-'J'I(O, v11R11) (8) where denotes the short-term power spectrum of the th source and R1 its spatial covariance matrix. The short-term power spectra V1 = [Vj,fn]f of each source are further assumed to factor in a multilevel nonnegative matrix factorization fashion as = (w7u;XG7XHff)o(wftufEGfEHft) (9) Where the nonnegative matrices WyX Uff GJX 147 encode the fine spectral structure, the spectral envelope, the temporal envelope and the temporal fine structure of the source excitation signal. WftUfEGJ[CHft represent the same quantities for the spectral resonance filter, and ® denotes entrywise matrix multiplication. In the following, we assume that the sources are reverberated or diffuse, so that R1,1 is full-rank.
Coming to the implementation of the sound source separation algorithm based on this model, the classical approach for maximum likelihood (ML) inference in Gaussian model-based source separation is to employ the expectation maximization algorithm, where the source time-frequency coefficients themselves are considered as hidden data. The algorithm iteratively computes the expectation of the log-likelihood of the complete data conditioned on the previous parameter estimates, the expectation step (E-step) and maximizes this expectation with respect to the new parameter estimates, the maximization step (M-step). This log-likelihood can be expressed in terms of the
H
Pfk((f',fl') X1,,X1, , empirical mixture covariance matrix = averaged Pf,n,k,t(f'M') around each time-frequency bin (f,n) with fnkf a weighting function defined in the frequency-time domain such as = 0 for (f' cf-k) &(f' > f-bk) andfor(n' <n-1)&(n' >n+1).
This approach was applied to general Gaussian model-based framework by defining vectors of sub-source coefficients = nf SJR nf} such that YJ,nf = AJJSJIIf with A1 7A 51 = R11 and by considering them as hidden data. Sub-source concept generalizes that of "source" to the case when is full-rank. The sub-sources and the matrices A11 are not unique but must verify AJJAJ'J = R11. The sub-sources are then the components of the source signals taken as hidden variables.
In the following models, the rank R of the spatial matrices A11 which encode the spatial properties of the source are superior to 1 and are taken to be R = M, M being the number of microphones. This enriched model takes better into account the reverberant and diffuse conditions since it models to a certain extent the spatial spread of the sources and circumvent the narrowband approximation.
The resulting sub-source expectation maximization (SSEM) updates are summarized below. The expectation step relied on the computation of the Wiener filter 2s,flf while the maximization step involves a closed-form update for A1 = [A1, AijI and multiplicative updates for the multilevel nonnegative matrix factorization parameters.
Figure 3 illustrates an example of expectation maximization algorithm based on the general Gaussian model-based framework and called sub-sources expectation maximization. It takes as input the initial source model parameter 301 and and of course the mixture coefficients.
The expectation step 306 comprises a first step 302 undertaken for every time frequency bin and consisting in inverting the mixture covariance matrix and computing its determinant. Next, a second step 303 is undertaken for every source and consists in computing the posterior mean and covariance of the sub-sources.
This expectation step 306 consists in the computation of the full data statistics according to the following equations: k nil xsfn -x,Jn"s,fn Which is the posterior cross-covariance of the combination of signals and the sub-sources, and = 125jn1?x,fnt2sfn + (I-(11) Which is the posterior second order raw moment of the sub-sources.
More precisely the first term fis RXJThQ'J corresponds to the outer product of the posterior mean of the sub-sources with itself and the second term is the posterior covariance of the sub-sources. Where H -1
-ES,ffl.Af xfn (1 2) is the Wiener filter and implies the inversion of the mixture covariance matrix, with xjn. = AJESfThAI + (1 3) the prior covariance directly of the combination of signals directly obtained from the relation = Afsf + bf and = diag(vJf... Vj,fn).1 with... v11 repeated M times, is the prior covariance of the sub-sources.
Next the maximization step 307 is undertaken and consists in updating the model parameters in a step 304. This maximization which solves the optimization problem: L(9,nIx,S) =L(XIS;8)+L(SIO)+1ogp(8q)+constant = -> tr[x (R,1 --RXSfflAfl + AfflRSfflA7)] ->i: lO9IbfflI ->1 d15(e1,1Iv,1) (14) + log p(91jij1j) + constant j,k=1 with parameters 0 = (R1, Wff, (J7X Gff, Hyx, WJt, tift, cf HJt)jj, is conducted by updating the model parameters using the following equations.
A1 = ()_t (15) Which corresponds to a closed-form update, and = (16) BT(81C1D1) DJ which corresponds to multiplicative updates, and where: = = r=(J_l)J+l(&Jfl) (17) The eight matrices W7X,U7X,G7X,H7X,WftJU[t,GJtJHft are updated in turn. Denoting by C1 the matrix to be updated, the factorization V1 = (W7U7GJXHff)o(WftuftGftHfE) can always be rewritten as V1 = (B1C1D1)®E where the nonnegative matrices B11DE are assumed to be fixed while C3 is updated. For example if C3 = Hft, then Hf = WJtUfEGJE, = I and E1 = WJXU7XGJ!XHJX.
It is worth noting that this maximization step only requires matrix addition, multiplication and scalar division.
The expectation step and the maximization step are iterated to obtain the final source model parameters 305.
Once the parameters 0 = (Rj, W7. 117, G7, H7X, Wft, tJf, cf Hft)jf have been estimated, the source short time Fourier transform coefficients are obtained via the Wiener filter: / I Yjjn = v11R11 ( vjyR1;) x1 (18) \ 3=1 J corresponding to step 206 on figure 2.
According to one embodiment of the invention, the model used is changed to use the binary activation model. A reference to a similar model may be found in "T. Nakatani, S. Araki, T. Yoshioka, and M. Fujimoto, "Joint unsupervised learning of hidden Markov source models and source location models for multichannel source separation", in Proc. 2011 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp. 237-240, 2011".
Here the estimation is derived within the general Gaussian model-based framework which allows modeling a wide range of situations by retaining the desirable features of full-rank Gaussian models like handling of diffuse or reverberated sources and multilevel Non-Negative Matrix Factorization.
The binary action model is based on the general Gaussian model with the following changes. Denoting by 1fn the index of the predominant source in time-frequency bin (f,n), we replace the conventional observation model of the forgoing by: Xfn = with l1Cat(7rij 7r1,m) (19) Where the Cat notation stands for the categorical distribution, the source spatial image coefficients YLnJ follow the same model as in the foregoing and Tt3,f is the prior probability that the th source be predominant in this time-frequency bin. This prior probability may for instance be set to zero in certain time-slots for those sources which are known to be inactive and uniformly shared among the other sources.
A new expectation-maximization algorithm may be derived using this binary activation model. The indexes of the predominant source are considered as hidden data. The resulting expectation-maximization algorithm updates are summarized below. The expectation step consists in computing the posterior probability of lfn: Yjjn = P(lfl 11t,0) (20) The maximization step solves the optimization problem: O = argrnax -y1 [tr (EjfflkXffl) -1ogdet(E]ffl)] (21) For the source spatial covariance RJf, the update is obtained in closed form. For the multilevel nonnegative matrix factorization parameters, this optimization problem is equivalent to the weighted multilevel nonnegative matrix factorization problem: argrnin (jjn I vjjn) (22) where d15(xly) = --log--i (23) is the Itakura-Saito divergence and jn is defined by: = tr(RyR1,) (24) Figure 4 illustrates an example of expectation maximization algorithm based on the binary activation model. Let's call it the binary activation expectation maximization (BAEM). It takes as input the same initial source model parameter 401 and RJf plus a set of new fixed parameters consisting in the prior probabilities of sources activation lTjf. It also takes as input the mixture coefficients.
The expectation step 406 comprises a first step 402 undertaken for every source and every time frequency bin and consisting in inverting the source spatial covariance matrix and computing its determinant. Next, a second step 403 is undertaken for every time-slot and consists in computing the source posterior probability from the inverted spatial covariance matrices and the short time Fourier transform coefficients.
It is worth noting that the term "expectation step" is quite misuse as this step is no longer the computation of an expectation as it is in SSEM. The name is kept for clarity purpose. The same apply to the term "maximization step" in the context of the binary activation model.
This expectation step 406 comprises computing the posterior probability of 1jn according to the following equations: e 3" 111T (25) et(m y,j,fn) where Eyjyn = VJfRJf (26) Next the maximization step 407 is undertaken and consists in updating the model parameters 0 = (Rjf,W7,U7,Gf',H7,Wft3Uft,Gft,Hft)Jf of the optimization problem expressed in equation (21) in a step 404. This maximization is conducted by updating the model parameters using the following equations: -E YjfnRxfn/vLffl if-nYj,fn corresponds to a closed-form update, and 3T [roeoElo(BcDY2IDT -(28 BJ[F1O(BC1D) ]D}' corresponds to multiplicative updates, where ti = [Vi,fitj J = [1 ffljf WIth = 1tr(R7kr) (29) Once the parameters 0 = (Rn, W7, UJX GyX, H7, H/f, uft, Gft, Hf t)]1 have been estimated, the source short time Fourier transform coefficients are now obtained via soft masking: = (30) corresponding to step 206 on figure 2.
Since the same parameters have been estimated as for SSEM, it remains also possible to evaluate the source short time Fourier transform coefficients via the Wiener filter: Yjjn = (x VJJflRLJ) Xj,1 (31) In sub-sources expectation maximization algorithm, when the number of channels M increases, the computational cost of each iteration step becomes dominated by that of the matrix inversion in the computation of the Wiener filter in the expectation step H -1 -X,111A1 Xt (32) which grows cubically with M. On the contrary, binary activation expectation maximization does not require the inversion of in each time-frequency bin anymore. Instead, by rewriting first equation of expectation step of binary activation expectation maximization e Yj,fn jfn A (33) uettirE1yj as e C(7t, (34) ]j iP' j irv11e It appears that the inverse and the determinant of RJf must be computed once for each source in each frequency interval at each iteration step. The computational cost incurred by matrix inverses and determinants is therefore reduced by a factor of N/M where N is the number of time-slots. For typical values of N, this cost becomes negligible and the cost of each iteration step becomes dominated by the other operations whose complexity grows linearly with the number of channels M. Additionally, it has been found by monitoring the log-likelihood in preliminary experiments that the number of iteration steps needed to reach convergence with binary activation expectation maximization is one order of magnitude smaller than for sub-sources expectation maximization, which is a natural consequence of the drastically reduced size of the hidden data space (from M xJ x F x N complex-valued variables toP x N discrete variables).
One potential disadvantage of binary activation expectation maximization compared to sub-sources expectation maximization and variants using MAP/MMSE estimation is that soft masking induces a smaller theoretical upper bound on the separation performance than multichannel Wiener filtering because it does not combine the input channels together. This is not always a problem in practice because the algorithm includes fewer hidden data, as illustrated before, if often converges to better solutions which are closer to the upper bound.
According to an alternative embodiment aiming at improving the separation performance it may be possible to mix the two expectation maximization algorithms. This embodiment is illustrated in figure 5.
In a first step 501, the parameters of the binary activation model are initialized. In a second step 502 a number of iteration steps of binary activation expectation maximization algorithm are run in order to quickly converge to an approximate solution. These steps are run according to the foregoing. Next, the estimated parameter values or quantities computed from these values are used to initialize the sub-source model parameters in a step 503. As described in the foregoing, the variable parameters for both algorithms, namely v11 and R1. In a step 504, some iteration steps of the sub-source expectation maximization algorithm are run in order to refine the parameter estimates 505. Alternatively to the sub-source expectation maximization, some variants using MAP/MMSE estimation may be used. Next obtained parameters are used as illustrated on figure 2 step 206 to separate the sources by multichannel Wiener filtering.
The binary activation expectation maximization algorithm presents a better efficiency compared to conventional sound source separation algorithms by virtue of the increased speed of the expectation maximization algorithm.
Smaller number of iteration steps is needed to reach convergence due to the fact that the space of hidden variables has smaller dimension. The space of hidden variables may be up to 20 times smaller for 2 speech and 3 noise sources and 1 spectrum per noise source when compared to Sub-Sources expectation maximization. The complexity of each iteration step theoretically is reduced by a factor up to the number of time-slots divided by the number of sources for a large number of channels. The inversion of the mixture covariance matrix in each time-frequency bin at each iteration step is not required anymore.
Instead, the inverse and the determinant of the spatial covariance matrix of each source need to be computed once in each frequency interval at each iteration step. This new expectation maximization algorithm retains the desirable features of full-rank Gaussian models and multilevel nonnegative matrix factorization. It brings little or no degradation of the signal to interference ratio.
SSEM and BAEM are used here in the field of sound sources separation but it can be envisaged to use them for other applications like the multispectral
imagery for example.
Figure 6 is a schematic block diagram of a computing device 600 for implementation of one or more embodiments of the invention. The computing device 600 may be a device such as a micro-computer, a workstation or a light portable device. The computing device 600 comprises a communication bus connected to: -a central processing unit 601, such as a microprocessor, denoted CPU; -a random access memory 602, denoted RAM, for storing the executable code of the method of embodiments of the invention as well as the registers adapted to record variables and parameters necessary for implementing the method for encoding or decoding at least part of an image according to embodiments of the invention, the memory capacity thereof can be expanded by an optional RAM connected to an expansion port for example; -a read only memory 603, denoted ROM, for storing computer programs for implementing embodiments of the invention; -a network interface 604 is typically connected to a communication network over which digital data to be processed are transmitted or received.
The network interface 604 can be a single network interface, or composed of a set of different network interfaces (for instance wired and wireless interfaces, or different kinds of wired or wireless interfaces). Data packets are written to the network interface for transmission or are read from the network interface for reception under the control of the software application running in the CPU 601; -a user interface 605 may be used for receiving inputs from a user or to display information to a user; -a hard disk 606 denoted HD may be provided as a mass storage device; -an I/O module 607 may be used for receiving/sending data from/to external devices such as a video source or display.
The executable code may be stored either in read only memory 603, on the hard disk 606 or on a removable digital medium such as for example a disk.
According to a variant, the executable code of the programs can be received by means of a communication network, via the network interface 604, in order to be stored in one of the storage means of the communication device 600, such as the hard disk 606, before being executed.
The central processing unit 601 is adapted to control and direct the execution of the instructions or portions of software code of the program or programs according to embodiments of the invention, which instructions are stored in one of the aforementioned storage means. After powering on, the CPU 601 is capable of executing instructions from main RAM memory 602 relating to a software application after those instructions have been loaded from the program ROM 603 or the hard-disc (HD) 606 for example. Such a software application, when executed by the CPU 601, causes the steps of the flowcharts shown in Figures 3 to 5 to be performed.
Any step of the algorithm shown in Figure 3 to 5 may be implemented in software by execution of a set of instructions or program by a programmable computing machine, such as a PC ("Personal Computer"), a DSP ("Digital Signal Processor") or a microcontroller; or else implemented in hardware by a machine or a dedicated component, such as an FPGA ("Field-Programmable Gate Array") or an ASIC ("Application-Specific Integrated Circuit").
Although the present invention has been described hereinabove with reference to specific embodiments, the present invention is not limited to the specific embodiments, and modifications will be apparent to a skilled person in the art which lie within the scope of the present invention.
Many further modifications and variations will suggest themselves to those versed in the art upon making reference to the foregoing illustrative embodiments, which are given by way of example only and which are not intended to limit the scope of the invention, that being determined solely by the appended claims. In particular the different features from different embodiments may be interchanged, where appropriate.
In the claims, the word "comprising" does not exclude other elements or steps, and the indefinite article "a" or "an" does not exclude a plurality. The mere fact that different features are recited in mutually different dependent claims does not indicate that a combination of these features cannot be advantageously used.

Claims (15)

  1. CLAIMS1. A method for determining parameters defining a plurality of signals, forming a combination of signals, said parameters including the spatial covariance matrices of the signals, the combination of signals being transformed for a chosen frequency range divided into frequency intervals and over a period of time divided into time-slots, in order to obtain transformed coefficients, the method comprising: -obtaining initial values of the parameters to be used as current values of the parameters; and -determining new values of the parameters by running iteratively: o an expectation step to compute intermediate parameters based on the transformed coefficients; o a maximization step to update values of the parameters based on the obtained current values of the parameters and the computed intermediate parameters; wherein the method further comprises: -obtaining prior probability for each signal to be predominant in the combination of signals; and wherein the expectation step comprises: -for every signal and every frequency interval, inverting the spatial covariance matrix; and -for every time-slot computing an intermediate parameter which is a posterior probability for each signal to be predominant in the combination of signals, based on the said prior probabilities, the inverted spatial covariance matrices and the transformed coefficients.
  2. 2. A method according to claim 1, wherein the prior probability for each signal to be predominant in the combination of signals is fixed.
  3. 3. A method according to any one claim 1 or 2, wherein the expectation step further comprises: -for every signal and every frequency interval computing the determinant of the spatial covariance matrix of the signals, the posterior probability for a signal to be predominant in the combination of signals being further based on the determinant of the spatial covariance matrix of the signal.
  4. 4. A method according to any one claim 1 to 3, wherein the posterior probability for each signal to be predominant in the combination of signal is computed according to the equation: V. °fl 1,1)1 J,11 V ueLyTz.y,j,fn where: Ey,j,jn = VJfRJf, is a parameter representing the power spectra of the signal; R1 is a spatial covariance matrix; R,ffl is the combination of signals covariance matrix for a given frequency interval and a given time slot; is the prior probability of a signal of index Ito be predominant in the combination of signals; n is the index of the time-slot and f the frequency interval.
  5. 5. A method according to any one claim 1 to 4, wherein the values of the parameters are updated according to: R -En. Vj,fnfx,fn/12;,/n if -nYjjn where: v11 is a parameter representing the power spectra of the signal; 1xfn is the combination of signals covariance matrix for a given frequency interval and a given time-slot; VJJII is the posterior probability of a signal of index jto be predominant in the combination of signals; it is the index of the time-slot and f the frequency interval.
  6. 6. A method according to any one claim 1 to 5, wherein it further comprises: -using obtained new values of parameters as second initial values; and -determining second new values of the parameters by running iteratively an second expectation step and a second maximization step; wherein the second expectation step comprises: o for every time-slot and frequency interval inverting the combination of signals covariance matrix; and o for every signal computing another intermediate parameter which is the posterior second order raw moment of the components of the source signals taken as hidden variables.
  7. 7. A method for separating sound source signals defined by parameters determined by using a method according to any one of claims 1 to 6.
  8. 8. A device for determining parameters defining a plurality of signals, and forming a combination of signals, said parameters including the spatial covariance matrices of the signals, the combination of signals being transformed for a chosen frequency range divided into frequency intervals and over a period of time divided into time-slots, in order to obtain transformed coefficients, the device comprising: -an expectation module to compute intermediate parameters based on the transformed coefficients: -a maximization module to update values of the parameters based on the obtained current values of the parameters and the computed intermediate parameters; wherein the expectation module is configured for: -for every signal and every frequency interval, inverting the spatial covariance matrix; and -for every time-slot computing an intermediate parameter which is a posterior probability for each signal to be predominant in the combination of signals, based on given prior probabilities for each signals to be predominant in the combination of signals, the inverted spatial covariance matrices and the transformed coefficients.
  9. 9. A device according to claim 8, wherein the expectation module is configured for computing the posterior probability for each signal to be predominant in the combination of signal is computed according to the equation: 1 -1 e ky,j,fn'2rffl Yjjn °C 1jjn where: Ey,j,n = Vjjn is a parameter representing the power spectra of the signal; R1,1 is a spatial covariance matrix; RXfTh is the combination of signals covariance matrix for a given frequency interval and a given time-slot; is the prior probability of a signal of index Ito be predominant in the combination of signals; n is the index of the time-slot and f the frequency interval.
  10. 10. A device according to any one claim 8 or 9, wherein the expectation module is configured for updating the values of the parameters according to: -En. Yj,fnfx,fn/12;,Jn 1sf -nYjjn where: v1 is a parameter representing the power spectra of the signal; Rffl is the combination of signals covariance matrix for a given frequency interval and a given time slot; Vj,fn is the posterior probability of a signal of index jto be predominant in the combination of signals; n is the index of the time-slot and f the frequency interval.
  11. 11. A device according to any one claim 8 to 10, wherein the device further comprises: -a second expectation module and a second maximization module; and wherein: -the second expectation module is configured for: a for every time-slot and frequency interval inverting the combination of signals covariance matrix; and a for every signal computing another intermediate parameter which is the posterior second order raw moment of the source signals taken as hidden variables.
  12. 12. A device for separating sound source signals comprising a device according to any one claim 8 to 11.
  13. 13. A computer program product for a programmable apparatus, the computer program product comprising a sequence of instructions for implementing a method according to any one of claims 1 to 7, when loaded into and executed by the programmable apparatus.
  14. 14. A computer-readable storage medium storing instructions of a computer program for implementing a method according to any one of claims 1 to 7.
  15. 15. A method for determining parameters defining a model of signals substantially as hereinbefore described with reference to, and as shown in Figure 2, 4 and 5.
GB201302398A 2013-02-11 2013-02-11 Sound source separation based on a Binary Activation model Withdrawn GB2510631A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
GB201302398A GB2510631A (en) 2013-02-11 2013-02-11 Sound source separation based on a Binary Activation model
GB1304774.1A GB2510650B (en) 2013-02-11 2013-03-15 Method and apparatus for sound source separation based on a binary activation model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB201302398A GB2510631A (en) 2013-02-11 2013-02-11 Sound source separation based on a Binary Activation model

Publications (2)

Publication Number Publication Date
GB201302398D0 GB201302398D0 (en) 2013-03-27
GB2510631A true GB2510631A (en) 2014-08-13

Family

ID=47998948

Family Applications (2)

Application Number Title Priority Date Filing Date
GB201302398A Withdrawn GB2510631A (en) 2013-02-11 2013-02-11 Sound source separation based on a Binary Activation model
GB1304774.1A Active GB2510650B (en) 2013-02-11 2013-03-15 Method and apparatus for sound source separation based on a binary activation model

Family Applications After (1)

Application Number Title Priority Date Filing Date
GB1304774.1A Active GB2510650B (en) 2013-02-11 2013-03-15 Method and apparatus for sound source separation based on a binary activation model

Country Status (1)

Country Link
GB (2) GB2510631A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017176968A1 (en) * 2016-04-08 2017-10-12 Dolby Laboratories Licensing Corporation Audio source separation
US10410641B2 (en) 2016-04-08 2019-09-10 Dolby Laboratories Licensing Corporation Audio source separation

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11152014B2 (en) 2016-04-08 2021-10-19 Dolby Laboratories Licensing Corporation Audio source parameterization
US20190200777A1 (en) * 2017-12-28 2019-07-04 Sleep Number Corporation Bed having sensors features for determining snore and breathing parameters of two sleepers
CN109684603B (en) * 2019-01-09 2019-09-03 四川大学 A kind of Efficient Solution large scale matrix determinant can verify that outsourcing calculation method, client and cloud computing system

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080052074A1 (en) * 2006-08-25 2008-02-28 Ramesh Ambat Gopinath System and method for speech separation and multi-talker speech recognition

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080052074A1 (en) * 2006-08-25 2008-02-28 Ramesh Ambat Gopinath System and method for speech separation and multi-talker speech recognition

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"A Two-Stage Frequency-Domain Blind Source Separation Method for Underdetermined Convolutive Mixtures" (SAWADA et al.) - 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 21st October 2007, p.139-143. *
"Joint unsupervised learning of hidden Markov source models and source location models for multichannel source separation" (NAKATANI et al.) 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 22nd May 2011, p.237-240. *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017176968A1 (en) * 2016-04-08 2017-10-12 Dolby Laboratories Licensing Corporation Audio source separation
US10410641B2 (en) 2016-04-08 2019-09-10 Dolby Laboratories Licensing Corporation Audio source separation
US10818302B2 (en) 2016-04-08 2020-10-27 Dolby Laboratories Licensing Corporation Audio source separation

Also Published As

Publication number Publication date
GB201302398D0 (en) 2013-03-27
GB2510650A (en) 2014-08-13
GB2510650B (en) 2015-07-01
GB201304774D0 (en) 2013-05-01

Similar Documents

Publication Publication Date Title
US10446171B2 (en) Online dereverberation algorithm based on weighted prediction error for noisy time-varying environments
Pedersen et al. Convolutive blind source separation methods
KR100486736B1 (en) Method and apparatus for blind source separation using two sensors
US20170251301A1 (en) Selective audio source enhancement
US9536538B2 (en) Method and device for reconstructing a target signal from a noisy input signal
GB2510631A (en) Sound source separation based on a Binary Activation model
US9520138B2 (en) Adaptive modulation filtering for spectral feature enhancement
Cord-Landwehr et al. Monaural source separation: From anechoic to reverberant environments
Wang et al. Mask weighted STFT ratios for relative transfer function estimation and its application to robust ASR
JP2022529912A (en) Methods and equipment for determining deep filters
Oudre Interpolation of missing samples in sound signals based on autoregressive modeling
Ono et al. User-guided independent vector analysis with source activity tuning
EP2774147B1 (en) Audio signal noise attenuation
US11902757B2 (en) Techniques for unified acoustic echo suppression using a recurrent neural network
Du et al. Semi-supervised multichannel speech separation based on a phone-and speaker-aware deep generative model of speech spectrograms
Kim et al. Sound source separation algorithm using phase difference and angle distribution modeling near the target.
EP3557576B1 (en) Target sound emphasis device, noise estimation parameter learning device, method for emphasizing target sound, method for learning noise estimation parameter, and program
US11790929B2 (en) WPE-based dereverberation apparatus using virtual acoustic channel expansion based on deep neural network
Delcroix et al. Multichannel speech enhancement approaches to DNN-based far-field speech recognition
Wang et al. Low-latency real-time independent vector analysis using convolutive transfer function
Li et al. Low complex accurate multi-source RTF estimation
Jukić et al. Speech dereverberation with convolutive transfer function approximation using MAP and variational deconvolution approaches
Pérez-López et al. Blind reverberation time estimation from ambisonic recordings
CN114220453B (en) Multi-channel non-negative matrix decomposition method and system based on frequency domain convolution transfer function
US20240085935A1 (en) Neuralecho: a self-attentive recurrent neural network for unified acoustic echo suppression, speaker aware speech enhancement and automatic gain control

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)