EP3050056B1 - Time-frequency directional processing of audio signals - Google Patents
Time-frequency directional processing of audio signals Download PDFInfo
- Publication number
- EP3050056B1 EP3050056B1 EP14780737.4A EP14780737A EP3050056B1 EP 3050056 B1 EP3050056 B1 EP 3050056B1 EP 14780737 A EP14780737 A EP 14780737A EP 3050056 B1 EP3050056 B1 EP 3050056B1
- Authority
- EP
- European Patent Office
- Prior art keywords
- time
- signal
- signals
- acquired signals
- computed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012545 processing Methods 0.000 title claims description 46
- 230000005236 sound signal Effects 0.000 title description 17
- 238000009826 distribution Methods 0.000 claims description 52
- 238000000034 method Methods 0.000 claims description 51
- 230000003595 spectral effect Effects 0.000 claims description 31
- 230000036962 time dependent Effects 0.000 claims description 16
- 239000013598 vector Substances 0.000 claims description 14
- 239000000470 constituent Substances 0.000 claims description 8
- 238000004891 communication Methods 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000012512 characterization method Methods 0.000 claims description 3
- 238000004590 computer program Methods 0.000 claims description 3
- 230000007717 exclusion Effects 0.000 claims description 2
- 238000013459 approach Methods 0.000 description 53
- 238000000926 separation method Methods 0.000 description 20
- 239000011159 matrix material Substances 0.000 description 17
- 238000004458 analytical method Methods 0.000 description 10
- 230000008569 process Effects 0.000 description 7
- 230000008901 benefit Effects 0.000 description 5
- 230000008859 change Effects 0.000 description 4
- 238000013518 transcription Methods 0.000 description 4
- 230000035897 transcription Effects 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000000354 decomposition reaction Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000002452 interceptive effect Effects 0.000 description 3
- 230000000873 masking effect Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000000835 fiber Substances 0.000 description 2
- 238000009499 grossing Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 208000037516 chromosome inversion disease Diseases 0.000 description 1
- 238000009472 formulation Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000000135 prohibitive effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000002829 reductive effect Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000005204 segregation Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
Definitions
- This invention relates to time-frequency directional processing of audio signals.
- One broad approach to separating a signal from a source of interest using multiple microphone signals is beamforming, which uses multiple microphones separated by distances on the order of a wavelength or more to provide directional sensitivity to the microphone system.
- beamforming approaches may be limited, for example, by inadequate separation of the microphones.
- NMF Non-Negative Matrix Factorization
- An approach used for speech processing makes use of some processing capacity at a user's device along with transmission of the result of such processing to a server computer, where further processing is performed.
- An example of such an approach is described, for instance, in U.S. Pat. 8,666,963 , "Method and Apparatus for Processing Spoken Search Queries.”
- Aoki M Et Al "Sound source segregation based on estimating incident angle of each frequency component of input signals acquired by multiple microphones", Acoustical Science and Technology, Acoustical Society of Japan, Tokyo, JP, col. 22, no. 2, 1 March 2001 (2001-03-01), pages 149-157 describes a method of segregating desired speech from concurrent sounds received by two microphones.
- the present disclosure includes description of, an approach to processing of acoustic signals acquired at a user's device include one or both of acquisition of parallel signals from a set of closely spaced microphones, and use of a multi-tier computing approach in which some processing is performed at the user's device and further processing is performed at one or more server computers in communication with the user's device.
- the acquired signals are processed using time versus frequency estimates of both energy content as well as direction of arrival.
- a non-negative matrix or tensor factorization approach is used to identify multiple sources each associated with a corresponding direction of arrival of a signal from that source.
- data characterizing direction of arrival information is passed from the user's device to a server computer where direction-based processing is performed.
- a method for processing a plurality of signals acquired using a corresponding plurality of acoustic sensors at a user device is defined in claim 1.
- the method may include one or more of the following features in any combination recognizing that unless indicated otherwise none of these features are essential to any particular embodiment.
- Each component of the plurality of components of the time-dependent spectral characteristics computed from the acquire signals is associated with a time frame of a plurality of successive time frames.
- each component of the plurality of components of the time-dependent spectral characteristics computed from the acquired signals is associated with a frequency range, whereby the computed components form a time-frequency characterization of the acquired signals.
- each component represents energy (e.g., via a monotonic function, such as square root) at a corresponding range of time and frequency.
- Computing the direction estimates of component comprises computing data representing a direction of arrival of the component in the acquired signals.
- computing the data representing the directional of arrival comprises at least one of (a) computing data representing one direction of arrival, and (b) computing data representing an exclusion of at least one direction of arrival.
- computing the data representing the direction of arrival comprises determining an optimized direction associated with the component using at least one of (a) phases, and (b) times of arrivals of the acquired signals.
- the determining of the optimized direction may comprise performing at least one of (a) a pseudo-inverse calculation, and (b) a least-squared-error estimation.
- Computing the data representing the direction of arrival may comprise computing at least one of (a) an angle representation of the direction of arrival, (b) a direction vector representation of the direction of arrival, and (c) a quantized representation of the direction of arrival.
- Combining the computed spectral characteristics and the computed direction estimates to form a data structure representing a distribution indexed by time, frequency, and direction may comprise perfoming a non-negative matrix or tensor factorization using the formed data structure.
- forming the data structure comprises forming data structure representing a sparse data structure in which a majority of the entries of the distribution are absent.
- At least part of forming the approximation, performing the plurality of iterations, and computing the mask function is performed as a server computing system in data communication with the user device.
- the method further comprises communicating from the user device to the server computing system at least one of (a) the direction estimates, (b) a result of performing the plurality of iterations, and (c) a signal formed as an estimate of a part of at least one signal of the plurality of acquired signals corresponding to the selected source.
- the present disclosure includes description of, in general, a signal processing system, which comprises a processor and an acoustic sensor having multiple sensor elements, is configured to perform all the steps of any one of methods set forth above.
- a signal processing system is defined in claim 12.
- the present disclosure includes description of computer program product comprising instructions embodied on a non- transitory machine readable medium, execution of said instructions on one or more processors of a data processing system causing said system to all the steps of any one of methods set forth above.
- a computer program product is defined in claim 15.
- One or more aspects address a technical problem of providing accurate processing of acquired acoustic signals within the limits of computation capacity of a user's device.
- An approach of performing a direction-based processing of the acquired acoustic signals at the user's device permits reduction of the amount of data that needs to be transmitted to a server computer for further processing.
- Use of the server computer for the further processing often involving speech recognition, permits use of greater computation resources (e.g., processor speed, runtime and permanent storage capacity, etc.) that may be available at the server computer.
- embodiments described herein are directed to a problem of acquiring a set of audio signals, which typically represent a combination of signals from multiple sources, and processing the signals to separate out a signal of a particular source of interest from other undesired signals. At least some of the embodiments are directed to the problem of separating out the signal of interest for the purpose of automated speech recognition when the acquired signals include a speech utterance of interest as well as interfering speech and/or non-speech signals. Other embodiments are directed to problem of enhancement of the audio signal for presentation to a human listener. Yet other embodiments are directed for other forms of automated speech processing, for example, speaker verification or voice-based search queries.
- Embodiments also include one or both of (a) acquisition of directional information during acquisition of the audio signals, and (b) processing the audio signals in a multi-tier architecture in which different parts of the processing may be performed on different computing devices, for example, in a client-server arrangement. It should be understood that these two features are independent and that some embodiments may use directional information on a single computing device, and that other embodiments may not use directional information, but may nevertheless use a multi-tier architecture. Finally, at least some embodiments may neither use directional information nor multi-tier architectures, for example, using only time-frequency factorization approaches described below.
- the smartphone includes a processor 212, which is coupled to an Analog-to-Digital Converter (ADC), which provides digitized audio signals acquired at the microphone(s) 110.
- ADC Analog-to-Digital Converter
- the processor includes a storage 140, which is used in part for data representing the acquired acoustic signals, and a CPU 120 which implements various procedures described below.
- the smartphone 210 is coupled to a server 220 over a data link (e.g., over a cellular data connection).
- the server includes a CPU 122 and associated storage 142.
- data passes between the smartphone and the server during and/or immediately following the processing of the audio signals acquired at the smartphone.
- partially processed audio signals are passed from the smartphone to the server, and results of further processing (e.g., results of automated speech recognition) are passed back from the server to the smartphone.
- the server 220 may provide data to the smartphone, e.g. estimated directionality information or spectral prototypes for the sources, which is used at the smartphone to fully or partially process audio signals acquired at the smartphone.
- a smartphone application is only one of a variety of examples of user devices.
- FIG. 2 Another example is shown in which a multi-element microphone is integrated into a vehicle 250, and that at least some of the processing of the acquired audio signals from a speaker 205 are processed using a computing device at the vehicle, and that computing device may optionally communicate with a server to perform at least some of the processing of the acquired signal.
- the multiple element microphone 110 acquires multiple parallel audio signals.
- the microphone acquires four parallel audio signals from closely spaced elements 112 (e.g., spaced less than 2 mm apart) and passes these as analog signals (e.g., electric or optical signals on separate wires or fibers, or multiplexed on a common wire or fiber) x 1 ( t ),..., x 4 ( t ) to the ADC 132.
- processing of the acquired audio signals includes performing a time frequency analysis that generates positive real quantities X ( f,n ), where f is an index over frequency bins and n is an index over time intervals (i.e., frames).
- Short-Time Fourier Transform (STFT) analysis is performed on the time signals in each of a series of time windows ("frames") shifted 30 ms per increment with 1024 frequency bins, yielding 1024 complex quantities per frame for each input signal.
- one of the input signals is chosen as a representative, and the quantity X ( f,n ) representing the magnitude (or alternatively the squared magnitude or compressive transformation of the magnitude, such as a square root) derived from the STFT analysis of the time signal, with the angle of the complex quantities being retained for later reconstruction of a separated time signal.
- a combination e.g., weighted average or the output of a linear beamformer based on previous direction estimates
- X ( f,n ) and the associated phase quantities are used for forming X ( f,n ) and the associated phase quantities.
- direction-of-arrival (DOA) information is computed from the time signals, also indexed by frequency and frame.
- continuous incidence angle estimates D ( f,n ) which may be represented as a scalar or a multi-dimensional vector, are derived from the phase differences of the STFT.
- An example of a particular direction of arrival calculation approach is as follows.
- A is a K ⁇ 4 matrix ( K is the number of microphones) that depends on the positions of the microphones
- x represent the direction of arrival (a 4-dimensional vector having d augmented with a unit element)
- b is a vector that represents the observed K phases.
- the pseudoinverse P of A can be computed once (e.g., as a property of the physical arrangement of ports on the microphone) and hardcoded into computation modules that implement an estimation of direction of arrival x as P b .
- the direction D is then available directly from the vector direction x .
- the magnitude of the direction vector x which should be consistent with (e.g., equal to) the speed of sound, is used to determine a confidence score for the direction, for example, representing low confidence if the magnitude is inconsistent with the speed of sound.
- the direction of arrival is quantized (i.e., binned) using a fixed set of directions (e.g., 20 bins), or using an adapted set of directions consistent with the long-term distribution of observed directions of arrival.
- the use of the pseudo-inverse approach to estimating direction information is only one example, which is suited to the situation in which the microphone elements are closely spaced, thereby reducing the effects of phase "wrapping."
- at least some pairs of microphone elements may be more widely spaced, for example, in a rectangular arrangement with 36 mm ad 63 mm spacing.
- a phase unwrapping approach is applied in combination with a pseudo-inverse approach as described above, for example, using an unwrapping approach to yield approximate delay estimates, followed by application of a pseudo-inverse approach.
- a direction estimate we mean either a single direction, or at least some representation of direction that excludes certain directions or renders certain directions to be substantially unlikely.
- Various embodiments make use of the time-frequency analysis including the magnitude and the direction information as a function of frequency and time, and form a time-frequency mask M ( f , n ) indexed on the same frequency and time indices that is used to separate the signal of interest in the acquired audio signals.
- a batch approach is used in which a user 205 speaks an utterance and the utterance is acquired as the parallel audio signals x 1 ( t ),..., x 4 ( t ) with the microphone 110. These signals are processed as a unit, for example, computing the entire mask for the duration of the utterance.
- a number of alterative multi-tier processing approaches are used in different embodiments, including for example:
- the user's device does not wait until the completion of the utterance to pass the separated signal or the mask information. For example, sequential or a sliding segment of the input utterance is processed and the information is passed to the server as it is computed.
- a spectral estimation and direction estimation stage 310 produces the magnitude and direction information X ( f,n ) and D ( f,n ) described above. In at least some embodiments, this information is used in a signal separation stage 320 to produce a separated time signal x ⁇ ( t ), and this separated signal is passed to a speech recognition stage 330.
- the speech recognition stage 330 produces a transcription.
- the separated signal is determined at the user's device and passed to a server computer where the speech recognition stage 330 is performed, with the transcription being passed back from the server computer to the user's device.
- the transcription is further processed, for example, forming a query (e.g., a Web search) with the results of the query being passed back to the user's device or otherwise processed.
- an implementation of the signal separation stage 320 involves first performing a frequency domain mask stage 322, which produces a mask M ( f,n ). This mask is then used to perform signal separation in the frequency domain producing X ⁇ ( f , n ), which then passes to a spectral inversion stage 326 in which the time signal x ⁇ ( t ) is determined for example using an inverse transform. Note that in FIG. 3 , the flow of the phase information (i.e., the angle of complex quantities indexed by frequency f and time frame n ) associated with X ( f,n ) and X ⁇ ( f,n ) is not shown.
- f , n where p f , n X f , n ⁇ f ′ , n ′ X f ′ , n ′ and p d
- the distribution p ( f,n,d ) can be thought of as a probability distribution in that the quantities are all in the range 0.0 to 1.0 and the sum over all the index values is 1.0.
- f,n ) are not necessarily 0 or 1, and in some implementations may be represented as a distribution with non-zero values for multiple discrete direction values d .
- the distribution may be discrete (e.g., using fixed or adaptive direction "bins") or may be represented as a continuous distribution (e.g., a parameterized distribution) over a one-dimensional or multi-dimensional representation of direction.
- a number of implementations of the signal separation approach are based on forming an approximation q ( f,n,d ) of p ( f,n,d ), where the distribution q ( f,n,d ) has a hidden multiple-source structure.
- NMF non-negative matrix factorization
- a non-negative tensor i.e., three or more dimensional factorization approach.
- z,s ) 410 provide relative magnitudes of various frequency bins, which are indexed by f .
- the time-varying contributions of the different prototypes for a given source is represented by terms q ( n,z
- s ⁇ z q f
- Direction information in this model is treated, for any particular source, as independent of time and frequency or the magnitude at such times and frequencies. Therefore a distribution q ( d
- the joint quantity q ( d,s ) q ( d
- s ) q ( s ) is used without separating into the two separate terms.
- other factorizations of the distribution may be used. For example, q f , n
- s ⁇ z q f , z
- operation of the signal separation phase finds the components of the model to best match the distribution determined from the observed signals. This is expressed as an optimization to minimize a distance between the distribution p ( ) determined from the actually observed signals, and q ( ) formed from the structured components , the distance function being represented as D ( p ( f,n,d ) ⁇ q ( f,n,d )).
- D p ( f,n,d ) ⁇ q ( f,n,d )
- KL Kullback-Leibler
- MM Minorization-Maximization
- the iteration is repeated a fixed number of times (e.g., 10 times).
- Alternative stopping criteria may be used, for example, based on the change in the distance function, change in the estimated values, etc.
- the computations identified above may be implemented efficiently as matrix computations (e.g., using matrix multiplications), and by computing intermediate quantities appropriately.
- Steps 2-4 of the iterative procedure outlined above can then be expressed as
- f , n q f , n , d , s * , z / ⁇ d , s , z q f , n , d , s , z
- s * is the index of the desired source.
- the index of the desired source is determined by the estimated direction q ( d
- This latter approach is somewhat analogous to using a time-varying Wiener filter in the case of X ( f,n ) representing the spectra energy (e.g., squared magnitude of the STFT).
- separating a desired signal from the acquired signals may be based on the estimated decomposition. For example, rather than identifying a particular desired signal, one or more undesirable signals may be identified and their contribution to X ( f,n ) "subtracted" to form an enhanced representation of the desired signal.
- the mask information may be used in directly estimating spectrally-based speech recognition feature vectors, such as cepstra, using a "missing data” approach (see, e.g., Kuhne et al., “Time-Frequency Masking: Linking Blind Source Separation and Robust Speech Recognition,” in Speech Recognition, Technologies and Applications (2008 )).
- a "missing data” approach see, e.g., Kuhne et al., “Time-Frequency Masking: Linking Blind Source Separation and Robust Speech Recognition,” in Speech Recognition, Technologies and Applications (2008 )
- such approaches treat time-frequency bins in which the source separation approach indicates the desired signal is absent as "missing” in determining the speech recognition feature vectors.
- the estimates may be made independently for different utterances and/or without any prior information.
- various sources of information may be used to improve the estimates.
- Prior information about the direction of a source may be used.
- the prior distribution of a speaker relative to a smartphone, or a driver relative to a vehicle-mounted microphone may be incorporated into the reestimation of the direction information (e.g., the q ( d
- tracking of a hand-held phone's orientation e.g., using inertial sensors
- prior information about a desired source's direction may be provided by the user, for example, via a graphical user interface, or may be inherent in the typical use of the user's device, for example, with a speaker being typically in a relatively consistent position relative to the face of a smartphone.
- Information about a source's spectral prototypes may be available from a variety of sources.
- One source may be a set of "standard" speech-like prototypes.
- Another source may be the prototypes identified in a previous utterance.
- Information about a source may also be based on characterization of expected interfering signals, for example, wind noise, windshield wiper noise, etc. This prior information may be used in a statistical prior model framework, or may be used as an initialization of the iterative optimization procedures described above.
- the server provides feedback to the user device that aids the separation of the desired signal.
- the user's device may provide the spectral information X ( f,n ) to the server, and the server through the speech recognition process may determine appropriate spectral prototypes q s ( f
- the user's device may then uses these as fixed, as prior estimates, or initializations for iterative re-estimation.
- ICA Indepenent Components Analysis
- the acquired acoustic signals are processed by computing a time versus frequency distribution P ( f,n ) based on one or more of the acquired signals, for example, over a time window.
- the values of this distribution are non-negative, and in this example, the distribution is over a discrete set of frequency values f ⁇ [1, F ] and time values n ⁇ [1, N ] .
- the value of P ( f,n 0 ) is determined using a Short Time Fourier Transform at a discrete frequency f in the vicinity of time t 0 of the input signal corresponding to the n 0 t h analysis window (frame) for the STFT.
- the processing of the acquired signals also includes determining directional characteristics at each time frame for each of multiple components of the signals.
- One example of components of the signals across which directional characteristics are computed are separate spectral components, although it should be understood that other decompositions may be used.
- direction information is determined for each ( f , n ) pair, and the direction of arrival estimates on the indices as D ( f,n ) are determined as discretized (e.g., quantized) values, for example d ⁇ [1, D ] for D (e.g., 20) discrete (i.e., "binned") directions of arrival.
- n ) is formed representing the directions from which the different frequency components at time frame n originated from.
- the processing of the acquired signals provides a continuous-valued (or finely quantized) direction estimate D ( f,n ) or a parametric or non-parametric distribution P ( d
- the resulting directional histogram can be interpreted as a measure of the strength of signal from each direction at each time frame.
- these histograms can change over time as some sources turn on and off (for example, when a person stops speaking little to no energy would be coming from his general direction, unless there is another noise source behind him, a case we will not treat).
- Peaks in the resulting aggregated histogram then correspond to sources. These can be detected with a peak-finding algorithm and boundaries between sources can be delineated by for example taking the mid-points between peaks.
- Another approach is to consider the collection of all directional histograms over time and analyze which directions tend to increase or decrease in weight together.
- One way to do this is to compute the sample covariance or correlation matrix of these histograms.
- the correlation or covariance of the distributions of direction estimates is used to identify separate distributions associated with different sources.
- a variety of analyses can be performed on the covariance matrix Q or on a correlation matrix.
- the principal components of Q i.e., the eigenvectors associated with the largest eigenvalues
- Another way of using the correlation or covariance matrix is to form a pairwise "similarity" between pairs of directions d 1 and d 2 .
- input mask values over a set of time-frequency locations that are determined by one or more of the approaches described above.
- These mask values may have local errors or biases. Such errors or biases have the potential result that the output signal constructed from the masked signal has undesirable characteristics, such as audio artifacts.
- the determined mask information may be "smoothed."
- one general class of approaches to “smoothing” or otherwise processing the mask values makes use of a binary Markov Random Field treating the input mask values effectively as "noisy" observations of the true but not known (i.e., the actually desired) output mask values.
- a number of techniques described below address the case of binary masks, however it should be understood that the techniques are directly applicable, or may be adapted, to the case of non-binary (e.g., continuous or multi-valued) masks. In many situations, sequential updating using the Gibbs algorithm or related approaches may be computationally prohibitive.
- Available parallel updating procedures may not be available because the neighborhood structure of the Markov Random Field does not permit partitioning of the locations in such a way as to enable current parallel update procedures. For example, a model that conditions each value on the eight neighbors in the time-frequency grid is not amenable to a partition into subsets of locations of exact parallel updating.
- a procedure presented herein therefore repeats in a sequence of update cycles.
- a subset of locations i.e., time-frequency components of the mask
- is selected at random e.g., selecting a random fraction, such as one half
- a deterministic pattern e.g., selecting a random fraction, such as one half
- location-invariant convolution When updating in parallel in the situation in which the underlying MRF is homogeneous, location-invariant convolution according to a fixed kernel is used to compute values at all locations, and then the subset of values at the locations being updated are used in a conventional Gibbs update (e.g., drawing a random value and in at least some examples comparing at each update location).
- the convolution is implemented in a transform domain (e.g., Fourier Transform domain).
- transform domain e.g., Fourier Transform domain
- Use of the transform domain and/or the fixed convolution approach is also applicable in the exact situation where a suitable pattern (e.g., checkerboard pattern) of updates is chosen, for example, because the computational regularity provides a benefit that outweighs the computation of values that are ultimately not used.
- multiple signals are acquired at multiple sensors (e.g., microphones) (step 612).
- relative phase information at successive analysis frames ( n ) and frequencies ( f ) is determined in an analysis step (step 614).
- a value between -1.0 (i.e., a numerical quantity representing "probably off') and +1.0 (i.e., a numerical quantity representing "probably on") is determined for each time-frequency location as the raw (or input) mask M ( f,n ) (step 616).
- the input mask is determined in other ways than according to phase or direction of arrival information.
- An output of this procedure is to determine a smoothed mask S ( f,n ), which is initialized to be equal to the raw mask (step 618).
- a sequence of iterations of further steps is performed, for example terminating after a predetermined number of iterations (e.g., 50 iterations).
- Each iteration begins with a convolution of the current smoothed mask with a local kernel to form a filtered mask (step 622).
- this kernel extends plus and minus one sample in time and frequency, with weights: 0.25 0.5 0.25 1.0 0.0 1.0 0.25 0.5 0.25
- a subset of a fraction h of the ( f,n ) locations, for example h 0.5, is selected at random or alternatively according to a deterministic pattern (step 626).
- the smoothed mask S at these random locations is updated probabilistically such that a location ( f,n ) selected to be updated is set to +1.0 with a probability F ( f,n ) and -1.0 with a probability (1 - F ( f,n )) (step 628).
- An end of iteration test (step 632) allows the iteration of steps 122-128 to continue, for example for a predetermined number of iterations.
- a further computation (not illustrated in the flowchart of FIG. 5 ) is optionally performed to determine a smoothed filtered mask SF ( f,n ).
- This mask is computed as the sigmoid function applied to the average of the filtered mask computed over a trailing range of the iterations, for example, with the average computed over the last 40 of 50 iterations, to yield a mask with quantities in the range 0.0 to 1.0.
- Implementations of the approaches described above may be implemented in software, in hardware, or in a combination of hardware and software.
- processing of the acquired acoustic signals may be performed in a general-purpose processor, in a special purpose processor (e.g., a signal processor, or a processor coupled to or embedded in a microphone unit), or may be implemented using special purpose circuitry (e.g., an Application Specific Integrated Circuit, ASIC).
- Software may include instructions stored on a non-transitory medium (e.g., a semiconductor storage device) or transferred to a user's device over a data network and at least temporarily stored in the data network.
- server implementations include one or more processors, and non-transitory machine-readable storage for instructions for implementing server-side procedures described above.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Circuit For Audible Band Transducer (AREA)
Applications Claiming Priority (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201361881709P | 2013-09-24 | 2013-09-24 | |
US201361881678P | 2013-09-24 | 2013-09-24 | |
US201361919851P | 2013-12-23 | 2013-12-23 | |
US14/138,587 US9460732B2 (en) | 2013-02-13 | 2013-12-23 | Signal source separation |
US201461978707P | 2014-04-11 | 2014-04-11 | |
PCT/US2014/057122 WO2015048070A1 (en) | 2013-09-24 | 2014-09-24 | Time-frequency directional processing of audio signals |
Publications (2)
Publication Number | Publication Date |
---|---|
EP3050056A1 EP3050056A1 (en) | 2016-08-03 |
EP3050056B1 true EP3050056B1 (en) | 2018-09-05 |
Family
ID=52744399
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP14780737.4A Active EP3050056B1 (en) | 2013-09-24 | 2014-09-24 | Time-frequency directional processing of audio signals |
Country Status (3)
Country | Link |
---|---|
EP (1) | EP3050056B1 (zh) |
CN (1) | CN105580074B (zh) |
WO (1) | WO2015048070A1 (zh) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9460732B2 (en) | 2013-02-13 | 2016-10-04 | Analog Devices, Inc. | Signal source separation |
US9420368B2 (en) | 2013-09-24 | 2016-08-16 | Analog Devices, Inc. | Time-frequency directional processing of audio signals |
US10699727B2 (en) * | 2018-07-03 | 2020-06-30 | International Business Machines Corporation | Signal adaptive noise filter |
US11574628B1 (en) * | 2018-09-27 | 2023-02-07 | Amazon Technologies, Inc. | Deep multi-channel acoustic modeling using multiple microphone array geometries |
CN109859769B (zh) * | 2019-01-30 | 2021-09-17 | 西安讯飞超脑信息科技有限公司 | 一种掩码估计方法及装置 |
CN111739551A (zh) * | 2020-06-24 | 2020-10-02 | 广东工业大学 | 一种基于低秩与稀疏张量分解的多通道心肺音去噪系统 |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1923866B1 (en) * | 2005-08-11 | 2014-01-01 | Asahi Kasei Kabushiki Kaisha | Sound source separating device, speech recognizing device, portable telephone, sound source separating method, and program |
KR101456866B1 (ko) * | 2007-10-12 | 2014-11-03 | 삼성전자주식회사 | 혼합 사운드로부터 목표 음원 신호를 추출하는 방법 및장치 |
US8144896B2 (en) * | 2008-02-22 | 2012-03-27 | Microsoft Corporation | Speech separation with microphone arrays |
JP5195652B2 (ja) * | 2008-06-11 | 2013-05-08 | ソニー株式会社 | 信号処理装置、および信号処理方法、並びにプログラム |
CN102138176B (zh) * | 2008-07-11 | 2013-11-06 | 日本电气株式会社 | 信号分析装置、信号控制装置及其方法 |
JP5229053B2 (ja) * | 2009-03-30 | 2013-07-03 | ソニー株式会社 | 信号処理装置、および信号処理方法、並びにプログラム |
KR101670313B1 (ko) * | 2010-01-28 | 2016-10-28 | 삼성전자주식회사 | 음원 분리를 위해 자동적으로 문턱치를 선택하는 신호 분리 시스템 및 방법 |
US8239366B2 (en) | 2010-09-08 | 2012-08-07 | Nuance Communications, Inc. | Method and apparatus for processing spoken search queries |
US9111526B2 (en) * | 2010-10-25 | 2015-08-18 | Qualcomm Incorporated | Systems, method, apparatus, and computer-readable media for decomposition of a multichannel music signal |
CN103024629B (zh) * | 2011-09-30 | 2017-04-12 | 斯凯普公司 | 处理信号 |
US20150312663A1 (en) | 2012-09-19 | 2015-10-29 | Analog Devices, Inc. | Source separation using a circular model |
US9460732B2 (en) | 2013-02-13 | 2016-10-04 | Analog Devices, Inc. | Signal source separation |
-
2014
- 2014-09-24 EP EP14780737.4A patent/EP3050056B1/en active Active
- 2014-09-24 CN CN201480052202.9A patent/CN105580074B/zh active Active
- 2014-09-24 WO PCT/US2014/057122 patent/WO2015048070A1/en active Application Filing
Also Published As
Publication number | Publication date |
---|---|
EP3050056A1 (en) | 2016-08-03 |
WO2015048070A1 (en) | 2015-04-02 |
CN105580074B (zh) | 2019-10-18 |
CN105580074A (zh) | 2016-05-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9420368B2 (en) | Time-frequency directional processing of audio signals | |
EP3050056B1 (en) | Time-frequency directional processing of audio signals | |
Nugraha et al. | Multichannel audio source separation with deep neural networks | |
US10901063B2 (en) | Localization algorithm for sound sources with known statistics | |
US9668066B1 (en) | Blind source separation systems | |
US20160071526A1 (en) | Acoustic source tracking and selection | |
KR101688354B1 (ko) | 신호 소스 분리 | |
US20170178664A1 (en) | Apparatus, systems and methods for providing cloud based blind source separation services | |
CN111899756B (zh) | 一种单通道语音分离方法和装置 | |
CN106847301A (zh) | 一种基于压缩感知和空间方位信息的双耳语音分离方法 | |
JP6538624B2 (ja) | 信号処理装置、信号処理方法および信号処理プログラム | |
Lee et al. | Beamspace-domain multichannel nonnegative matrix factorization for audio source separation | |
JP5911101B2 (ja) | 音響信号解析装置、方法、及びプログラム | |
Nesta et al. | Robust Automatic Speech Recognition through On-line Semi Blind Signal Extraction | |
Hoffmann et al. | Using information theoretic distance measures for solving the permutation problem of blind source separation of speech signals | |
Liu et al. | A time domain algorithm for blind separation of convolutive sound mixtures and L1 constrainted minimization of cross correlations | |
Zhang et al. | Modulation domain blind speech separation in noisy environments | |
Fontaine et al. | Scalable source localization with multichannel α-stable distributions | |
Wu et al. | Blind separation of speech signals based on wavelet transform and independent component analysis | |
Adiloğlu et al. | A general variational Bayesian framework for robust feature extraction in multisource recordings | |
Chen et al. | Acoustic vector sensor based speech source separation with mixed Gaussian-Laplacian distributions | |
Maazaoui et al. | From binaural to multimicrophone blind source separation using fixed beamforming with HRTFs | |
CN111863017B (zh) | 一种基于双麦克风阵列的车内定向拾音方法及相关装置 | |
Mizuno et al. | Effective frame selection for blind source separation based on frequency domain independent component analysis | |
Iikawaa et al. | Blind Source Separation Based on Rotation of Joint Distribution Without Inversion of Positive and Negative Sign |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20160303 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
AX | Request for extension of the european patent |
Extension state: BA ME |
|
DAX | Request for extension of the european patent (deleted) | ||
GRAP | Despatch of communication of intention to grant a patent |
Free format text: ORIGINAL CODE: EPIDOSNIGR1 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: GRANT OF PATENT IS INTENDED |
|
INTG | Intention to grant announced |
Effective date: 20180328 |
|
GRAS | Grant fee paid |
Free format text: ORIGINAL CODE: EPIDOSNIGR3 |
|
GRAA | (expected) grant |
Free format text: ORIGINAL CODE: 0009210 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE PATENT HAS BEEN GRANTED |
|
AK | Designated contracting states |
Kind code of ref document: B1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
REG | Reference to a national code |
Ref country code: GB Ref legal event code: FG4D |
|
REG | Reference to a national code |
Ref country code: CH Ref legal event code: EP |
|
REG | Reference to a national code |
Ref country code: AT Ref legal event code: REF Ref document number: 1038782 Country of ref document: AT Kind code of ref document: T Effective date: 20180915 |
|
REG | Reference to a national code |
Ref country code: IE Ref legal event code: FG4D |
|
REG | Reference to a national code |
Ref country code: DE Ref legal event code: R096 Ref document number: 602014031866 Country of ref document: DE |
|
REG | Reference to a national code |
Ref country code: FR Ref legal event code: PLFP Year of fee payment: 5 |
|
REG | Reference to a national code |
Ref country code: NL Ref legal event code: MP Effective date: 20180905 |
|
REG | Reference to a national code |
Ref country code: LT Ref legal event code: MG4D |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: FI Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20180905 Ref country code: RS Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20180905 Ref country code: NO Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20181205 Ref country code: GR Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20181206 Ref country code: SE Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20180905 Ref country code: BG Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20181205 Ref country code: LT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20180905 |
|
REG | Reference to a national code |
Ref country code: AT Ref legal event code: MK05 Ref document number: 1038782 Country of ref document: AT Kind code of ref document: T Effective date: 20180905 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: AL Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20180905 Ref country code: LV Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20180905 Ref country code: HR Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20180905 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: ES Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20180905 Ref country code: CZ Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20180905 Ref country code: IS Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20190105 Ref country code: PL Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20180905 Ref country code: RO Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20180905 Ref country code: IT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20180905 Ref country code: AT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20180905 Ref country code: NL Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20180905 Ref country code: EE Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20180905 |
|
REG | Reference to a national code |
Ref country code: CH Ref legal event code: PL |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: SM Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20180905 Ref country code: SK Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20180905 Ref country code: PT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20190105 |
|
REG | Reference to a national code |
Ref country code: DE Ref legal event code: R097 Ref document number: 602014031866 Country of ref document: DE |
|
REG | Reference to a national code |
Ref country code: BE Ref legal event code: MM Effective date: 20180930 |
|
REG | Reference to a national code |
Ref country code: IE Ref legal event code: MM4A |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: LU Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20180924 |
|
PLBE | No opposition filed within time limit |
Free format text: ORIGINAL CODE: 0009261 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: MC Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20180905 Ref country code: DK Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20180905 Ref country code: IE Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20180924 |
|
26N | No opposition filed |
Effective date: 20190606 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: CH Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20180930 Ref country code: BE Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20180930 Ref country code: SI Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20180905 Ref country code: LI Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20180930 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: MT Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20180924 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: TR Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20180905 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: CY Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20180905 Ref country code: HU Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT; INVALID AB INITIO Effective date: 20140924 Ref country code: MK Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20180905 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: DE Payment date: 20240820 Year of fee payment: 11 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: GB Payment date: 20240822 Year of fee payment: 11 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: FR Payment date: 20240820 Year of fee payment: 11 |