WO2014047025A1 - Source separation using a circular model - Google Patents
Source separation using a circular model Download PDFInfo
- Publication number
- WO2014047025A1 WO2014047025A1 PCT/US2013/060044 US2013060044W WO2014047025A1 WO 2014047025 A1 WO2014047025 A1 WO 2014047025A1 US 2013060044 W US2013060044 W US 2013060044W WO 2014047025 A1 WO2014047025 A1 WO 2014047025A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- phase
- source
- frequency
- sensor
- sources
- Prior art date
Links
- 238000000926 separation method Methods 0.000 title description 9
- 238000000034 method Methods 0.000 claims abstract description 31
- 238000009826 distribution Methods 0.000 claims description 19
- 230000000737 periodic effect Effects 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 6
- 230000008878 coupling Effects 0.000 claims 1
- 238000010168 coupling process Methods 0.000 claims 1
- 238000005859 coupling reaction Methods 0.000 claims 1
- 238000013459 approach Methods 0.000 abstract description 27
- 230000006870 function Effects 0.000 description 6
- 230000004807 localization Effects 0.000 description 6
- 239000000203 mixture Substances 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 230000001934 delay Effects 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000001143 conditioned effect Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000005295 random walk Methods 0.000 description 1
- 230000002040 relaxant effect Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 239000013598 vector Substances 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R1/00—Details of transducers, loudspeakers or microphones
- H04R1/08—Mouthpieces; Microphones; Attachments therefor
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
- G10L21/0308—Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R3/00—Circuits for transducers, loudspeakers or microphones
- H04R3/04—Circuits for transducers, loudspeakers or microphones for correcting frequency response
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
- G10L21/028—Voice signal separating using properties of sound source
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R2430/00—Signal processing covered by H04R, not provided for in its groups
Definitions
- This invention relates to separating source signals.
- Multiple sound sources may be present in an environment in which audio signals are received by multiple microphones. Localizing, separating, and/or tracking the sources can be useful in a number of applications. For example, in a multiple- microphone hearing aid, one of multiple sources may be selected as the desired source whose signal is provided to the user of the hearing aid. The better the desired source is identified in the microphone signals, the better the user's perception of the desired signal, hopefully providing higher intangibility, lower fatigue, etc.
- IPD Interaural phase differences
- IPD interaural level differences
- DUET Degenerate Unmixing Estimation Technique
- an approach to separating multiple sources exploits the observation that each source is associated with a linear-circular phase characteristic in which the relative phase between pairs of microphones follows a linear (modulo 2 ⁇ ) pattern.
- a modified RANSAC (Random Sample Consensus) approach is used to identify the frequency/phase samples that are attributed to each source.
- a wrapped variable representation is used to represent a probability density of phase, thereby avoiding a need to "unwrap" phase in applying probabilistic techniques to estimating delay between sources.
- modified RANSAC Random Sample Consensus
- EM Random Sample Consensus
- the modified RANSAC approach is applied to perform source separation by treating the phase differences (IPD) between two or more microphones as wrapped variables. Once wrapped lines are fit to the IPD data, the signals are separated by constructing a probabilistic (soft) mask or a binary mask from the data and the lines. Since the lines correspond to directions of arrival (DOA) of the source signals in physical space, they can be validated to ensure that the model fit by RANSAC doesn't violate the laws of wave propagation. This is done by forcing the model estimates to lie on the manifold of physically possible inter-microphone delays. In this way, RANSAC can be applied to source separation as well as source localization in 2D and 3D with an arbitrary number of microphones.
- IPD phase differences
- a method for separating source signals from a plurality of sources uses a plurality of sensors.
- a first signal is accepted at each of the sensors.
- the first signal includes a combination of multiple of the source signals and each sensor provides a corresponding first sensor signal representing the first signal.
- phase values are deteremined for a plurality of frequencies of the pair of the first sensor signals provided by the pair of sensors, and a parametric relationship between phase and frequency for each of a plurality of signal sources included in the sensor signals is estimated.
- the parametric relationship characterizes a periodic distribution of phase at each frequency for each source.
- a second signal is accepted at each of the sensors, each sensor providing a corresponding second sensor signal representing the second signal.
- phase values for a plurality of frequencies of the pair of the second sensor signals accepted at the pair of sensors are determined.
- a frequency mask is formed corresponding to a desired source of the plurality of sources from the determined phase values of the second sensor signals and the periodic distribution of phase characterized by the parametric relationships estimated from the first signals.
- aspects may include one or more of the following features.
- the method further includes combining at least one of the second sensor signals and the frequency mask to determine an estimate of a source signal received at the sensors from the selected one of the sources.
- the sources comprise acoustic signal sources and the sensors comprise microphones and the first sensor signals and the second sensor signals each includes a representation of an acoustic signal received from the selected source at the microphones.
- Estimating the parametric relationship between phase and frequency includes applying an iteration. Each iteration includes generating a set of candidate parameters, and selecting a best parameter from the candidate parameters according to a degree to which a parametric relationship with said parameter accounts for the determined phase values.
- Applying the iteration includes, at each of at least some of the iterations, selecting the best parameter according to a degree to which a parametric relationship with said parameter accounts for determined phase values not accounted for according to parameters of prior iterations.
- estimating the parametric relationship between phase and frequency includes estimating a linear relationship. In some examples, estimating the parametric relationship between phase and frequency includes estimating a parametric curve relationship. For instance, estimating a parametric curve relationship includes estimating a spline relationship.
- Forming the frequency mask includes forming a binary frequency mask.
- Estimating the parametric relationships comprises applying a RANSAC (Random Sample Consensus) procedure.
- FIG. 1 is a diagram of a source localization and estimation system.
- FIG. 2 show a relationship of relative phase and frequency with multiple sources/.
- three audio sources 110 are distributed in a space in which a receiver 120 received signals from the sources at two microphones 122 (i.e., audio sensors).
- Each of the microphone signals is transformed into the frequency domain, for example, using a Short Time Fourier Transform (STFT) implemented in a Fast Fourier Transform (FFT) block 132.
- STFT Short Time Fourier Transform
- FFT Fast Fourier Transform
- the complex frequency components of the transformed signals are divided, yielding a relative frequency domain complex signal ⁇ ) .
- x(co) ⁇ ( ⁇ ) , the phase of the frequency domain signal at frequency ⁇ , where ⁇ ⁇ ) e [0, 2 ⁇ ) .
- each STFT yields a set of data points ( ⁇ ⁇ , j3 ⁇ 4) , where the y are scaled versions of
- phase ( x ) axis is illustrated in the range x e [- ⁇ , ⁇ ] and labeled "IPD", and the frequency axis is in units of frequency bins of a Discrete Fourier Transform. Lines characterizing the relative delays for the two sources, shows that the data samples are indeed somewhat concentrated near the lines.
- a probabilistic model is used to characterize the data in FIG. 2.
- the probability density of the phase is assumed to take the form p(x ⁇ y, ⁇ 3 ⁇ 4 ) oc exp( cos(x - ⁇ 3 ⁇ 4 )) .
- the term ⁇ 3 ⁇ 4 y can be replaced, for example for numerical reasons, with ⁇ 3 ⁇ 4 y mod 2 ⁇ or ( ⁇ 3 ⁇ 4 y + ⁇ ) mod 2 ⁇ - ⁇ , without changing p(x
- the integral of exp( cos(x - ⁇ 3 ⁇ 4 y)) over any interval of 2 ⁇ in x is 2 I Q ⁇ k) where 7 0 ( ⁇ ) is the zeroth order Bessel function of the first kind. With N equally likely sources the distribution can be considered to be a mixture distribution such that
- the inliers may be defined by making p G be a fixed fraction
- a quality of a match of a line to the sample data can be measured by the fraction (or number) of inlier points to the line.
- a higher quality line accounts for a larger fraction of the sample data.
- the result of this procedure is a set of source parameters (i.e., directions of arrival) ⁇ , ... , ⁇ ⁇ .
- an approach to source separation involved determining a mask that identifies frequencies at which a desired source is present. Note that as described above, the source parameters the probability of a phase/frequency pair Xf , yi conditioned on the source can be used to yield the posterior probability that the phase/frequency pair comes from that source as follows:
- a "hard” mask may be chosen such that
- a "soft" mask may be used such that
- k is the index of the desired source. Note that in the distributional form, as the parameter k , a hardness of the soft mask is increased by concentrating the distribution near the line corresponding to each source.
- An alternative embodiment relaxes the assumption that the phase difference between microphones is proportional to frequency, or equivalently that the ⁇ ⁇ ⁇ ⁇ , y > points for a source line on a straight line in the wrapped space.
- a variety of factors can affect such deviation from a straight line, although one should understand that these factors may not be present in all cases and that other factors may affect the shape of the relationship.
- One factor is that the multiple microphones may have somewhat different phase response as a function of frequency. Therefore, the difference in the phase responses will manifest as deviation from a straight line.
- Another factor is reverberation, which may also manifest as deviation from an ideal straight line.
- One approach to relaxing the straight-line assumption is to use a spline approximation, for example, using a cubic spline with a fixed number of knots at variable frequencies.
- One way to introduce the spline approximations into the procedure is to first follow the procedure described above to determine the straight- line parameters ⁇ 3 ⁇ 4 for the
- K sources k ⁇ ,.,., ⁇ .
- K sources k ⁇ ,.,., ⁇ .
- K sources k ⁇ ,.,., ⁇ .
- K sources k ⁇ ,.,., ⁇ .
- K sources k ⁇ ,.,., ⁇ .
- EM Estimate -Maximize
- each data pair ⁇ x y > is fractionally associated with a source k and wrap index / according to
- weights, w ikl are coupled to the parameters of the spline functions f k (x) , which is a reason that the estimation of the spline parameters is performed in this iterative manner.
- the fractionally weighted data pairs are used to update the spline parameters according to conventional techniques.
- the parameters ⁇ , ... , ⁇ ⁇ represent the parameters of the K spline fits.
- soft mask values at a frequency ⁇ ⁇ with an observed phase _y z - at that frequency may be computed using a posterior probability approach similar to that described previously as w( ⁇ ; » (3 ⁇ 4 2 )
- the mask m is formed in block 136 using any one of the approaches described above. Then this mask is passed to a source estimation block 138, which modifies each STFT received from a Fourier Transform block 132 for one of the microphones (e.g., Microphone 1) prior to reconstruction of a time signal, for example, using a conventional overlap-add technique. For example, windowed 1024 point STFT's are computed with a widow hop size of 256.
- the approach can be applied to multiple microphones, defining a (or ⁇ ), x , and y as a vectors (e.g., dimension 2 for three microphones).
- a vectors e.g., dimension 2 for three microphones.
- Various forms of distribution may be used, for example, assuming the dimensions are independent and using a product of the densities over the dimensions.
- each data sample associates a frequency with a tuple of relative phases.
- the slopes of the phase vs. frequency lines are related according to the coordinates of the microphones. Therefore, the procedure described above for the two- microphone case can be extended by defining an "inlier" to depend on all the relative phases observed. For example, the relatively phases must be sufficiently near the estimated line for all the relative phases measured, or the product of the probabilities (e.g., the sum of the exponent terms k cos(x; - ay ⁇ ) ) must be above a threshold .
- a combination e.g., product
- Prior information regarding probability of source given frequency may also be included, for example, in addition or instead of the prior information based on tracking over time.
- an assumption that is made is that the prior probability for each source, and more particularly, the prior probability for each source at each frequency are fixed, and in particular are equal.
- Other situations, other information is available for separating the sources such that r(source k) , or r(source k
- An example of such a source of information includes a tracking (recognition) of the spectral characteristics of each source, for example, according to a speech production model, such that past spectral characteristics for a source provide information about the presence of that source's signal in each of the frequency bins at a current window where the source time signal is being reconstructed.
- Another source of prior information relates to locations of the sources.
- a prior probability distribution for source locations can be combined with the conditional probabilities of the frequency/phase samples (e.g., a mixture distribution form introduced above) given the locations to yield a Bayesian estimate (e.g., a posterior distribution) of the source locations.
- source locations may be tracked by including a model of movement of sources (e.g., random walks) for prediction and the frequency/phase samples for updating of the source locations, for example, using a Kalman Filtering or similar approach.
- Embodiments of the approaches described above may be implemented in software, in hardware, or a combination of software and hardware.
- Software can include instructions (e.g., machine instructions, higher level language instructions, etc.) stored on a tangible computer readable medium for causing a processor to perform the functions described above.
- instructions e.g., machine instructions, higher level language instructions, etc.
Landscapes
- Engineering & Computer Science (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
An approach to separating multiple sources exploits the observation that each source is associated with a linear-circular phase characteristic in which the relative phase between pairs of microphones follows a linear (modulo ) pattern. In some examples, a modified RANSAC (Random Sample Consensus) approach is used to identify the frequency/phase samples that are attributed to each source. In some examples, either in combination with the modified RANSAC approach or using other approaches, a wrapped variable representation is used to represent a probability density of phase, thereby avoiding a need to "unwrap" phase in applying probabilistic techniques to estimating delay between sources.
Description
S OURCE SEPARATION USING A CIRCULAR MODEL
Cross-Reference to Related Applications
[001] This application claims the benefit of U.S. Provisional Application No.
61/702,993 filed on September 19, 2012, the entire contents of which is incorporated herein by reference.
Background
[002] This invention relates to separating source signals.
[003] Multiple sound sources may be present in an environment in which audio signals are received by multiple microphones. Localizing, separating, and/or tracking the sources can be useful in a number of applications. For example, in a multiple- microphone hearing aid, one of multiple sources may be selected as the desired source whose signal is provided to the user of the hearing aid. The better the desired source is identified in the microphone signals, the better the user's perception of the desired signal, hopefully providing higher intangibility, lower fatigue, etc.
[004] Interaural phase differences (IPD) have been used for source separation since the 90 's. It was shown in (Rickard, Yilmaz) that blind source separation is possible using just IPD's and interaural level differences (ILD) with the Degenerate Unmixing Estimation Technique (DUET). DUET relies on the condition that the sources to be separated exhibit W-disjoint orthogonality. This says that the energy in each time-frequency bin of the mixture's Short-Time Fourier Transform (STFT) is dominated by a single source. If this is true, the mixture STFT can be partitioned into disjoint sets such that only the bins assigned to the jth source are used to reconstruct it. The bin assignments are known as binary masks. In theory, as long as the sources are W-disjoint orthogonal, perfect separation can be achieved. Good separation can be achieved in practice even though speech signals are only approximately orthogonal.
Summary
[005] In one aspect, in general, an approach to separating multiple sources exploits the observation that each source is associated with a linear-circular phase characteristic in which the relative phase between pairs of microphones follows a linear (modulo 2π ) pattern. In some examples, a modified RANSAC (Random Sample Consensus) approach
is used to identify the frequency/phase samples that are attributed to each source. In some examples, either in combination with the modified RANSAC approach or using other approaches, a wrapped variable representation is used to represent a probability density of phase, thereby avoiding a need to "unwrap" phase in applying probabilistic techniques to estimating delay between sources.
[006] In examples, in which modified RANSAC (Random Sample Consensus) is applied to fit multiple wrapped lines to circular-linear data, the approach can have an advantage of avoiding issues with local maxima where optimization strategies (i.e. EM, gradient descent) will fail (there may be many (50+ %) outliers present in the data and lines may cross over each other).
[007] In some examples, the modified RANSAC approach is applied to perform source separation by treating the phase differences (IPD) between two or more microphones as wrapped variables. Once wrapped lines are fit to the IPD data, the signals are separated by constructing a probabilistic (soft) mask or a binary mask from the data and the lines. Since the lines correspond to directions of arrival (DOA) of the source signals in physical space, they can be validated to ensure that the model fit by RANSAC doesn't violate the laws of wave propagation. This is done by forcing the model estimates to lie on the manifold of physically possible inter-microphone delays. In this way, RANSAC can be applied to source separation as well as source localization in 2D and 3D with an arbitrary number of microphones.
[008] In another aspect, in general, a method for separating source signals from a plurality of sources uses a plurality of sensors. A first signal is accepted at each of the sensors. The first signal includes a combination of multiple of the source signals and each sensor provides a corresponding first sensor signal representing the first signal. For each of a set of pairs of sensors, phase values are deteremined for a plurality of frequencies of the pair of the first sensor signals provided by the pair of sensors, and a parametric relationship between phase and frequency for each of a plurality of signal sources included in the sensor signals is estimated. The parametric relationship characterizes a periodic distribution of phase at each frequency for each source. A second signal is accepted at each of the sensors, each sensor providing a corresponding second sensor signal representing the second signal. For each of a set of pairs of sensors, phase values for a plurality of frequencies of the pair of the second sensor signals accepted at the pair of sensors are determined. A frequency mask is formed corresponding to a desired source of the plurality of sources from the determined phase values of the second sensor signals
and the periodic distribution of phase characterized by the parametric relationships estimated from the first signals.
[009] Aspects may include one or more of the following features.
[010] The method further includes combining at least one of the second sensor signals and the frequency mask to determine an estimate of a source signal received at the sensors from the selected one of the sources.
[011] The sources comprise acoustic signal sources and the sensors comprise microphones and the first sensor signals and the second sensor signals each includes a representation of an acoustic signal received from the selected source at the microphones.
[012] Estimating the parametric relationship between phase and frequency includes applying an iteration. Each iteration includes generating a set of candidate parameters, and selecting a best parameter from the candidate parameters according to a degree to which a parametric relationship with said parameter accounts for the determined phase values.
[013] Applying the iteration includes, at each of at least some of the iterations, selecting the best parameter according to a degree to which a parametric relationship with said parameter accounts for determined phase values not accounted for according to parameters of prior iterations.
[014] In some examples, estimating the parametric relationship between phase and frequency includes estimating a linear relationship. In some examples, estimating the parametric relationship between phase and frequency includes estimating a parametric curve relationship. For instance, estimating a parametric curve relationship includes estimating a spline relationship.
[015] Forming the frequency mask includes forming a binary frequency mask.
[016] Estimating the parametric relationships comprises applying a RANSAC (Random Sample Consensus) procedure.
[017] Other features and advantages of the invention are apparent from the following description, and from the claims.
Description of Drawings
[018] FIG. 1 is a diagram of a source localization and estimation system.
[019] FIG. 2 show a relationship of relative phase and frequency with multiple sources/.
Description
[020] Referring to FIG. 1, in one example implementation, three audio sources 110 are distributed in a space in which a receiver 120 received signals from the sources at two microphones 122 (i.e., audio sensors). Each of the microphone signals is transformed into the frequency domain, for example, using a Short Time Fourier Transform (STFT) implemented in a Fast Fourier Transform (FFT) block 132. The complex frequency components of the transformed signals are divided, yielding a relative frequency domain complex signal Χ{ω) . In the discussion below, x(co) = ΖΧ(ω) , the phase of the frequency domain signal at frequency ω , where χ ω) e [0, 2π) .
[021] If there is only a single source, say source 1, and the difference in signal propagation delay from the source to microphone 1 and source to microphone 2 is τ , then the phase x(co) is concentrated on a wrapped line x = τ ω mod 2π where τ is in seconds, and ω is in radians per second. The phase is not exactly on a line due to factors including noise in the microphone signals and differences in the transfer function from the source to the two microphones not purely due to delay. In a discrete domain, each STFT yields a set of data points ( ζ·, j¾) , where the y are scaled versions of
corresponding frequencies ωί . Combining the data points from multiple STFTs yields a sample distribution in phase which is concentrated near the line χζ· = ay^ mod2;r , where a is a multiple of the delay τ .
[022] In some discussion below, rather than referring to the delay variable a , an equivalent direction of arrival that satisfies Θ = sin is used, where Θ e [-π, π) and m is suitably chosen so that -1≤ a I πτη < 1. However, it should be understood that because of the 1-1 relationship between the two variables, either can be used, and in the derivations and examples below, setting or determining one of the two variables should be understood to correspond to setting or determining of the other of the two variables as well.
[023] Referring to FIG. 2, an example of the sample distribution in a simulation for two audio sources in reverberant environment is shown where the phase ( x ) axis is illustrated
in the range x e [-π, π] and labeled "IPD", and the frequency axis is in units of frequency bins of a Discrete Fourier Transform. Lines characterizing the relative delays for the two sources, shows that the data samples are indeed somewhat concentrated near the lines.
[024] A probabilistic model is used to characterize the data in FIG. 2. In particular, at any frequency y , and a particular source i with delay variable <¾ , the probability density of the phase is assumed to take the form p(x \ y, <¾ ) oc exp( cos(x - <¾ )) . Note that due to the periodic nature of cos( ) , the term <¾ y can be replaced, for example for numerical reasons, with <¾ y mod 2π or (<¾ y + π) mod 2π - π , without changing p(x | y, αζ· ) . Note that exp( cos(x - <¾ )) is unimodal with a peak of exp( ) at x = <¾ . The integral of exp( cos(x - <¾ y)) over any interval of 2π in x is 2 IQ{k) where 70(^) is the zeroth order Bessel function of the first kind. With N equally likely sources the distribution can be considered to be a mixture distribution such that
[025] Note that other forms functions p(x | y, cij)∞ G(x - a^) where G(x) has a period 2π , with a unimodal peak at x = 0 , can equivalently be used.
[026] A number of procedures are combined in order to form a desired signal that approximates the signal received from the desired source. The processes include the following:
• Estimation of the parameters <¾ for sources k = 1, ... , K .
• Forming of a frequency mask based on a selected source and the estimated source parameters
• Reconstruction of the estimate of the desired source signal.
[027] One approach to estimation of the parameters <¾ for sources k = l, ... ,K , which characterize the directions of arrival of the sources, makes use of an iterative approach in which points ( ζ·, j¾) points are assigned to sources as follows. For a given line x = ay , points (Xf , yi) are "inliers" to that line if they are in proximity to the line defined in one of the following ways:
• P(xi z » a) - Po f°r some threshold p0
• cos(¾ - ayi)≥ c0 for some threshold c0 (e.g., /¾ = exp(Ar0) )
• I (Xj ■■ ayi + π mod 2π) - π |< z0 for some threshold z0 (e-g-> cos(zo) = co )
[028] In some examples, the inliers may be defined by making pG be a fixed fraction
(e.g., ½) of the maximum value of the density. In some examples, a phase range specifies the inlier range, for example, as z0 = π 116 .
[029] Generally, a quality of a match of a line to the sample data can be measured by the fraction (or number) of inlier points to the line. A higher quality line accounts for a larger fraction of the sample data.
[030] One procedure for identifying the delays (i.e., slopes of lines) represented in a data set D = {< x yi >} of phase/frequency pairs identifies K sources as follows:
For k = \, ...,K
Select M random samples from D ;
For m = \, ...,M
Choose 6k m corresponding to the slope a = x I y for that mth random sample;
Over the full data set D , count the number of inliers;
Set 9k to be the 6k m with the highest inlier count;
Remove the inlier data from D ;
The result of this procedure is a set of source parameters (i.e., directions of arrival) θγ, ... , θκ .
[031] Given the estimated source parameters, an approach to source separation involved determining a mask that identifies frequencies at which a desired source is present. Note that as described above, the source parameters the probability of a phase/frequency pair Xf , yi conditioned on the source can be used to yield the posterior probability that the phase/frequency pair comes from that source as follows:
Pr(source k \ ^ ^ ^'^^ Μ
p(*i I d
Under certain assumptions (e.g., that all sources are equally likely to be present at each frequency a priori), this permits computing the probability that the a data point a
frequency yi , with phase χζ· comes from the n source as
frequency j¾) =
∑exp(fccos(:¾ - <¾.;¾))
k=\
[032] One of two masking approaches can be used. A "hard" mask may be chosen such that
Pr (source k | frequency i)
Alternatively, a "soft" mask may be used such that
mi = iV(source k | frequency i)
where k is the index of the desired source. Note that in the distributional form, as the parameter k , a hardness of the soft mask is increased by concentrating the distribution near the line corresponding to each source.
[033] An alternative embodiment relaxes the assumption that the phase difference between microphones is proportional to frequency, or equivalently that the < χζ· , y > points for a source line on a straight line in the wrapped space. A variety of factors can affect such deviation from a straight line, although one should understand that these factors may not be present in all cases and that other factors may affect the shape of the relationship. One factor is that the multiple microphones may have somewhat different phase response as a function of frequency. Therefore, the difference in the phase responses will manifest as deviation from a straight line. Another factor is reverberation, which may also manifest as deviation from an ideal straight line.
[034] One approach to relaxing the straight-line assumption is to use a spline approximation, for example, using a cubic spline with a fixed number of knots at variable frequencies. One way to introduce the spline approximations into the procedure is to first follow the procedure described above to determine the straight- line parameters <¾ for the
K sources k = Ι,.,., Κ . These straight line parameters are then used to initialize the unknown parameters of the splines. Each spline is assumed to have M knots, and therefore have - 1 cubic sections, each with four unknown parameters of the cubic polynomial. Constraints at the interior M - 2 knots guarantee continuity of value and first and second derivatives at the knots. An iterative procedure is then used to update the spline parameters to better match the data.
[035] One iterative approach make use of an Estimate -Maximize (EM) algorithm approach. Specifically, for a particular source k , the parameterize spline y = f^ix) defines the mode of phase distribution. The distribution is modeled using a wrapped Gaussian defined as , - π≤γ≤π ,
[036] In the iterative procedure, in each "E" step, each data pair < x y > is fractionally associated with a source k and wrap index / according to
Note that these weights, wikl are coupled to the parameters of the spline functions fk (x) , which is a reason that the estimation of the spline parameters is performed in this iterative manner.
[037] In the "M" step, the fractionally weighted data pairs are used to update the spline parameters according to conventional techniques. In some examples, the variances are fixed at unity ( ak = 1.0 ) or at some other fixed values. The parameters θγ, ... , θκ represent the parameters of the K spline fits.
[038] At the end of the iteration, soft mask values at a frequency ζ· with an observed phase _yz- at that frequency may be computed using a posterior probability approach similar to that described previously as w(^; » (¾ 2)
Pr(source n \ frequency j¾) =— — k=l
[039] Referring again to FIG. 1 , after determining the source parameters θγ, ... , θκ in block 134 as described above, and selecting a desired source k , for example, as k = 1 , which corresponds to the source that accounts for the greatest number of points, of the
source k that accounts for the greatest energy, or applying other probabilistic or heuristic selection for the source, the mask m is formed in block 136 using any one of the approaches described above. Then this mask is passed to a source estimation block 138, which modifies each STFT received from a Fourier Transform block 132 for one of the microphones (e.g., Microphone 1) prior to reconstruction of a time signal, for example, using a conventional overlap-add technique. For example, windowed 1024 point STFT's are computed with a widow hop size of 256.
[040] It should be understood that the approach described above can be extended to more than two microphones, thereby allowing localization in three dimensions or enhanced localization in two dimensions.
[041] The approach can be applied to multiple microphones, defining a (or Θ ), x , and y as a vectors (e.g., dimension 2 for three microphones). Various forms of distribution may be used, for example, assuming the dimensions are independent and using a product of the densities over the dimensions.
[042] For localization in two dimensions using more than two microphones arranged in along a line, each data sample associates a frequency with a tuple of relative phases. For each source, the slopes of the phase vs. frequency lines are related according to the coordinates of the microphones. Therefore, the procedure described above for the two- microphone case can be extended by defining an "inlier" to depend on all the relative phases observed. For example, the relatively phases must be sufficiently near the estimated line for all the relative phases measured, or the product of the probabilities (e.g., the sum of the exponent terms k cos(x; - ay^) ) must be above a threshold . In forming the masks, and in particular in forming the soft masks, a combination (e.g., product) of the probabilities determined for each of the relative phase measurements are combined.
[043] When the three or more microphones are not arranged along a line, localization in more than two dimensions can be performed. The procedure described above is again modified but each line for the relative phase between a pairs of microphones now depends two direction parameters rather than one.
[044] Other prior information regarding probability of source given frequency may also be included, for example, in addition or instead of the prior information based on tracking over time. In the approach described above for forming a soft mask for isolating the desired source, an assumption that is made is that the prior probability for each source,
and more particularly, the prior probability for each source at each frequency are fixed, and in particular are equal. Other situations, other information is available for separating the sources such that r(source k) , or r(source k | frequency i) . These sources of information can be combined with the phase-based quantities in determining the masks. An example of such a source of information includes a tracking (recognition) of the spectral characteristics of each source, for example, according to a speech production model, such that past spectral characteristics for a source provide information about the presence of that source's signal in each of the frequency bins at a current window where the source time signal is being reconstructed.
[045] Another source of prior information relates to locations of the sources. For example, at any time, a prior probability distribution for source locations can be combined with the conditional probabilities of the frequency/phase samples (e.g., a mixture distribution form introduced above) given the locations to yield a Bayesian estimate (e.g., a posterior distribution) of the source locations. Similarly, source locations may be tracked by including a model of movement of sources (e.g., random walks) for prediction and the frequency/phase samples for updating of the source locations, for example, using a Kalman Filtering or similar approach.
[046] Applications of the approaches described above are not restricted to those described (e.g., for hearing aids). For example, multiple microphone audio input systems for automated audio processing and/or transmission may similarly use the approach. An example of such an application is a tablet computer, smartphone, or other portable device that has multiple microphones, for example, at four corners of the body of the device. One (or more) source can be selected for processing (e.g., speech recognition) or transmission (e.g., for audio conferencing) from the device using the approaches described above. Other examples arise in fixed configurations, for example, for a microphone array in a conference room. In some such examples, prior knowledge of locations of desirable sources (e.g., speakers seated around a conference table) can be incorporated into the estimation procedure.
[047] Embodiments of the approaches described above may be implemented in software, in hardware, or a combination of software and hardware. Software can include instructions (e.g., machine instructions, higher level language instructions, etc.) stored on a tangible computer readable medium for causing a processor to perform the functions described above.
[048] It is to be understood that the foregoing description is intended to illustrate and not to limit the scope of the invention, which is defined by the scope of the appended claims. Other embodiments are within the scope of the following claims.
Claims
1. A method for separating source signals from a plurality of sources using a plurality of sensors, the method comprising:
accepting a first signal at each of the sensors, the first signal including a
combination of multiple of the source signals, each sensor providing a corresponding first sensor signal representing the first signal; for each of a set of pairs of sensors,
determining phase values for a plurality of frequencies of the pair of the first sensor signals provided by the pair of sensors, and
estimating a parametric relationship between phase and frequency for each of a plurality of signal sources included in the sensor signals, the parametric relationship characterizing a periodic distribution of phase at each frequency for each source;
accepting a second signal at each of the sensors, each sensor providing a
corresponding second sensor signal representing the second signal;;
for each of a set of pairs of sensors,
determining phase values for a plurality of frequencies of the pair of the second sensor signals accepted at the pair of sensors; and forming a frequency mask corresponding to a desired source of the plurality of sources from the determined phase values of the second sensor signals and the periodic distribution of phase characterized by the parametric relationships estimated from the first signals.
2. The method of claim 1 further comprising combining at least one of the second sensor signals and the frequency mask to determine an estimate of a source signal received at the sensors from the selected one of the sources.
3. The method of claim 1 wherein the sources comprise acoustic signal sources and the sensors comprise microphones.
4. The method of claim 3 wherein the first sensor signals and the second sensor signals each includes a representation of an acoustic signal received from the selected source at the microphones.
5. The method of claim 1 wherein estimating the parametric relationship between phase and frequency includes:
applying an iteration, each iteration including generating a set of candidate
parameters, and selecting a best parameter from the candidate parameters according to a degree to which a parametric relationship with said parameter accounts for the determined phase values.
6. The method of claim 5 wherein applying the iteration includes, at each of at least some of the iterations, selecting the best parameter according to a degree to which a parametric relationship with said parameter accounts for determined phase values not accounted for according to parameters of prior iterations.
7. The method of claim 1 wherein estimating the parametric relationship between phase and frequency includes estimating a linear relationship.
8. The method of claim 1 wherein estimating the parametric relationship between phase and frequency includes estimating a parametric curve relationship.
9. The method of claim 8 wherein estimating a parametric curve relationship includes estimating a spline relationship.
10. The method of claim 1 wherein forming the frequency mask includes forming a binary frequency mask.
11. The method of claim 1 wherein estimating the parametric relationships comprises applying a RANSAC (Random Sample Consensus) procedure.
12. A signal processing system comprising:
an plurality of sensor inputs, each for coupling to a corresponding one of a
plurality of sensor and accepting a corresponding sensor signal;
a computer-implemented processing module configured to, for each of a set of pairs of sensor signals,
determine phase values for a plurality of frequencies of the pair of first sensor signals accepted at the sensor inputs, and
estimate a parametric relationship between phase and frequency for each of a plurality of signal sources represented in the fjrst sensor signals, the parametric relationship characterizing a periodic distribution of phase at each frequency for each source, determine phase values for a plurality of frequencies of the pair of second sensor signals accepted at the sensor inputs; and
wherein the processing module is further configured to form and store a frequency mask corresponding to a desired source of the plurality of sources from the determined phase values of the second sensor signals and the periodic distribution of phase characterized by the parametric relationships estimated from the first signals.
13. The system of claim 12 wherein the processing module is further configured to combine at least one of the second sensor signals and the frequency mask to determine an estimate of a source signal received at the sensors from the selected one of the sources.
14. Software stored on a non-transitory machine-readable medium comprising instructions for causing a signal processor to:
accept sensor signals at a plurality of sensor inputs;
for each of a set of pairs of sensor signals,
determine phase values for a plurality of frequencies of the pair of first sensor signals accepted at the sensor inputs, and
estimate a parametric relationship between phase and frequency for each of a plurality of signal sources represented in the fjrst sensor signals, the parametric relationship characterizing a periodic distribution of phase at each frequency for each source, determine phase values for a plurality of frequencies of the pair of second sensor signals accepted at the sensor inputs; and
to form and store a frequency mask corresponding to a desired source of the
plurality of sources from the determined phase values of the second sensor signals and the periodic distribution of phase characterized by the parametric relationships estimated from the first signals.
15. The system of claim 14 wherein the instructions are further for causing the signal processor to combine at least one of the second sensor signals and the frequency mask to determine an estimate of a source signal received at the sensors from the selected one of the sources.
376378
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/440,211 US20150312663A1 (en) | 2012-09-19 | 2013-09-17 | Source separation using a circular model |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201261702993P | 2012-09-19 | 2012-09-19 | |
US61/702,993 | 2012-09-19 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2014047025A1 true WO2014047025A1 (en) | 2014-03-27 |
Family
ID=49253446
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2013/060044 WO2014047025A1 (en) | 2012-09-19 | 2013-09-17 | Source separation using a circular model |
Country Status (2)
Country | Link |
---|---|
US (1) | US20150312663A1 (en) |
WO (1) | WO2014047025A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015048070A1 (en) | 2013-09-24 | 2015-04-02 | Analog Devices, Inc. | Time-frequency directional processing of audio signals |
CN107636756A (en) * | 2015-04-10 | 2018-01-26 | 汤姆逊许可公司 | For the method and apparatus of the method and apparatus and the mixing for decoding multiple audio signals using improved separation that encode multiple audio signals |
WO2021252823A1 (en) * | 2020-06-11 | 2021-12-16 | Dolby Laboratories Licensing Corporation | Methods, apparatus, and systems for detection and extraction of spatially-identifiable subband audio sources |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9438454B1 (en) * | 2013-03-03 | 2016-09-06 | The Government Of The United States, As Represented By The Secretary Of The Army | Alignment of multiple editions of a signal collected from multiple sensors |
EP3704872B1 (en) | 2017-10-31 | 2023-05-10 | Widex A/S | Method of operating a hearing aid system and a hearing aid system |
US10860900B2 (en) * | 2018-10-30 | 2020-12-08 | International Business Machines Corporation | Transforming source distribution to target distribution using Sobolev Descent |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070076902A1 (en) * | 2005-09-30 | 2007-04-05 | Aaron Master | Method and Apparatus for Removing or Isolating Voice or Instruments on Stereo Recordings |
EP1887831A2 (en) * | 2006-08-09 | 2008-02-13 | Fujitsu Limited | Method, apparatus and program for estimating the direction of a sound source |
JP2011164467A (en) * | 2010-02-12 | 2011-08-25 | Nippon Telegr & Teleph Corp <Ntt> | Model estimation device, sound source separation device, and method and program therefor |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6597818B2 (en) * | 1997-05-09 | 2003-07-22 | Sarnoff Corporation | Method and apparatus for performing geo-spatial registration of imagery |
JP2012078422A (en) * | 2010-09-30 | 2012-04-19 | Roland Corp | Sound signal processing device |
US9131295B2 (en) * | 2012-08-07 | 2015-09-08 | Microsoft Technology Licensing, Llc | Multi-microphone audio source separation based on combined statistical angle distributions |
-
2013
- 2013-09-17 WO PCT/US2013/060044 patent/WO2014047025A1/en active Application Filing
- 2013-09-17 US US14/440,211 patent/US20150312663A1/en not_active Abandoned
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070076902A1 (en) * | 2005-09-30 | 2007-04-05 | Aaron Master | Method and Apparatus for Removing or Isolating Voice or Instruments on Stereo Recordings |
EP1887831A2 (en) * | 2006-08-09 | 2008-02-13 | Fujitsu Limited | Method, apparatus and program for estimating the direction of a sound source |
JP2011164467A (en) * | 2010-02-12 | 2011-08-25 | Nippon Telegr & Teleph Corp <Ntt> | Model estimation device, sound source separation device, and method and program therefor |
Non-Patent Citations (2)
Title |
---|
RICKARD S ET AL: "Blind Separation of Speech Mixtures via Time-Frequency Masking", IEEE TRANSACTIONS ON SIGNAL PROCESSING, IEEE SERVICE CENTER, NEW YORK, NY, US, vol. 52, no. 7, 1 July 2004 (2004-07-01), pages 1830 - 1847, XP011114136, ISSN: 1053-587X, DOI: 10.1109/TSP.2004.828896 * |
RICKARD S ET AL: "On the approximate W-disjoint orthogonality of speech", 2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING. PROCEEDINGS. (ICASSP). ORLANDO, FL, MAY 13 - 17, 2002; [IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP)], NEW YORK, NY : IEEE, US, vol. 1, 13 May 2002 (2002-05-13), pages I - 529, XP010804807, ISBN: 978-0-7803-7402-7 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015048070A1 (en) | 2013-09-24 | 2015-04-02 | Analog Devices, Inc. | Time-frequency directional processing of audio signals |
CN107636756A (en) * | 2015-04-10 | 2018-01-26 | 汤姆逊许可公司 | For the method and apparatus of the method and apparatus and the mixing for decoding multiple audio signals using improved separation that encode multiple audio signals |
WO2021252823A1 (en) * | 2020-06-11 | 2021-12-16 | Dolby Laboratories Licensing Corporation | Methods, apparatus, and systems for detection and extraction of spatially-identifiable subband audio sources |
AU2021289742B2 (en) * | 2020-06-11 | 2023-09-28 | Dolby International Ab | Methods, apparatus, and systems for detection and extraction of spatially-identifiable subband audio sources |
Also Published As
Publication number | Publication date |
---|---|
US20150312663A1 (en) | 2015-10-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2014047025A1 (en) | Source separation using a circular model | |
US9668066B1 (en) | Blind source separation systems | |
EP2355097B1 (en) | Signal separation system and method | |
JP6374882B2 (en) | Method and apparatus for determining the direction of uncorrelated sound sources in higher-order ambisonic representations of sound fields | |
US20170188175A1 (en) | Audio signal processing method and device | |
JP5449624B2 (en) | Apparatus and method for resolving ambiguity from direction of arrival estimates | |
WO2016130885A1 (en) | Audio source separation | |
EP3440670B1 (en) | Audio source separation | |
US20200072799A1 (en) | Hypothesis-based Estimation of Source Signals from Mixtures | |
Traa et al. | Multichannel source separation and tracking with RANSAC and directional statistics | |
WO2016119388A1 (en) | Method and device for constructing focus covariance matrix on the basis of voice signal | |
JP6487569B2 (en) | Method and apparatus for determining inter-channel time difference parameters | |
EP1589783A2 (en) | System and method for separating multiple sources using directional filtering | |
Dorfan et al. | Distributed expectation-maximization algorithm for speaker localization in reverberant environments | |
US11107492B1 (en) | Omni-directional speech separation | |
Liu et al. | Iterative deep neural networks for speaker-independent binaural blind speech separation | |
EP3797415A1 (en) | Sound processing apparatus and method for sound enhancement | |
Koyama et al. | Joint source and sensor placement for sound field control based on empirical interpolation method | |
Kindt et al. | 2d acoustic source localisation using decentralised deep neural networks on distributed microphone arrays | |
Marelli et al. | Efficient approximation of head-related transfer functions in subbands for accurate sound localization | |
Tourbabin et al. | Direction of arrival estimation in highly reverberant environments using soft time-frequency mask | |
WO2022219558A1 (en) | System and method for estimating direction of arrival and delays of early room reflections | |
Hammond et al. | Robust full-sphere binaural sound source localization | |
CN109074811B (en) | Audio source separation | |
Firoozabadi et al. | Multi-speaker localization by central and lateral microphone arrays based on the combination of 2D-SRP and subband GEVD algorithms |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 13767221 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 14440211 Country of ref document: US |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 13767221 Country of ref document: EP Kind code of ref document: A1 |