US20150071461A1 - Single-channel suppression of intefering sources - Google Patents
Single-channel suppression of intefering sources Download PDFInfo
- Publication number
- US20150071461A1 US20150071461A1 US14/540,778 US201414540778A US2015071461A1 US 20150071461 A1 US20150071461 A1 US 20150071461A1 US 201414540778 A US201414540778 A US 201414540778A US 2015071461 A1 US2015071461 A1 US 2015071461A1
- Authority
- US
- United States
- Prior art keywords
- audio signal
- noise
- type
- source
- interfering
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000001629 suppression Effects 0.000 title claims abstract description 85
- 230000002452 interceptive effect Effects 0.000 claims abstract description 194
- 230000005236 sound signal Effects 0.000 claims abstract description 100
- 238000000034 method Methods 0.000 claims abstract description 76
- 239000000203 mixture Substances 0.000 claims description 129
- 230000003044 adaptive effect Effects 0.000 claims description 19
- 230000001965 increasing effect Effects 0.000 claims description 14
- 230000004044 response Effects 0.000 claims description 12
- 238000004891 communication Methods 0.000 abstract description 43
- 239000000654 additive Substances 0.000 abstract description 10
- 230000000996 additive effect Effects 0.000 abstract description 10
- 230000006870 function Effects 0.000 description 50
- 238000000605 extraction Methods 0.000 description 39
- 239000013598 vector Substances 0.000 description 34
- 238000012545 processing Methods 0.000 description 20
- 230000015654 memory Effects 0.000 description 18
- 230000000694 effects Effects 0.000 description 14
- 239000011159 matrix material Substances 0.000 description 13
- 230000000903 blocking effect Effects 0.000 description 12
- 238000009826 distribution Methods 0.000 description 12
- 238000000926 separation method Methods 0.000 description 12
- 238000001228 spectrum Methods 0.000 description 12
- 238000003860 storage Methods 0.000 description 12
- 230000003595 spectral effect Effects 0.000 description 11
- 238000010586 diagram Methods 0.000 description 10
- 230000008569 process Effects 0.000 description 9
- 238000012549 training Methods 0.000 description 9
- 238000009499 grossing Methods 0.000 description 8
- 238000007476 Maximum Likelihood Methods 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 7
- 239000008186 active pharmaceutical agent Substances 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 6
- 230000001419 dependent effect Effects 0.000 description 6
- 230000000873 masking effect Effects 0.000 description 6
- 230000006399 behavior Effects 0.000 description 5
- 230000008901 benefit Effects 0.000 description 5
- 230000001143 conditioned effect Effects 0.000 description 4
- 238000013461 design Methods 0.000 description 4
- 230000009467 reduction Effects 0.000 description 4
- 230000000153 supplemental effect Effects 0.000 description 4
- 238000012935 Averaging Methods 0.000 description 3
- 230000006978 adaptation Effects 0.000 description 3
- 239000002131 composite material Substances 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 238000002156 mixing Methods 0.000 description 3
- 239000004065 semiconductor Substances 0.000 description 3
- 238000013179 statistical model Methods 0.000 description 3
- 230000005534 acoustic noise Effects 0.000 description 2
- 238000005311 autocorrelation function Methods 0.000 description 2
- 230000001413 cellular effect Effects 0.000 description 2
- 230000000295 complement effect Effects 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 230000005669 field effect Effects 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 239000012528 membrane Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 238000010521 absorption reaction Methods 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000005315 distribution function Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000002184 metal Substances 0.000 description 1
- 229910044991 metal oxide Inorganic materials 0.000 description 1
- 150000004706 metal oxides Chemical class 0.000 description 1
- 230000008450 motivation Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 230000005654 stationary process Effects 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R3/00—Circuits for transducers, loudspeakers or microphones
- H04R3/002—Damping circuit arrangements for transducers, e.g. motional feedback circuits
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R3/00—Circuits for transducers, loudspeakers or microphones
- H04R3/005—Circuits for transducers, loudspeakers or microphones for combining the signals of two or more microphones
Definitions
- the present invention generally relates to systems and methods that process audio signals, such as speech signals, to remove components of one or more interfering sources therefrom.
- noise suppression generally describes a type of signal processing that attempts to attenuate or remove an undesired noise component from an input audio signal. Noise suppression may be applied to almost any type of audio signal that may include an undesired noise component. Conventionally, noise suppression functionality is often implemented in telecommunications devices, such as telephones, Bluetooth® headsets, or the like, to attenuate or remove an undesired additive background noise component from an input speech signal.
- An input speech signal may be viewed as comprising both a desired speech signal (sometimes referred to as “clean speech”) and an additive noise signal.
- the additive noise signal may comprise stationary noise, non-stationary noise, echo, residual echo, etc.
- Many conventional noise suppression techniques are unable to effectively differentiate between, model, and suppress these different types of interfering sources, thereby resulting in a non-optimal noise-suppressed audio signal.
- FIG. 1 is a block diagram of a communication device, according to an example embodiment.
- FIG. 2 is a block diagram of an example system that includes multi-microphone configurations, frequency domain acoustic echo cancellation, source tracking, switched super-directive beamforming, adaptive blocking matrices, adaptive noise cancellation, and single-channel suppression, according to example embodiments.
- FIG. 3A depicts an example graph that illustrates a 3-mixture 2-dimensional Gaussian mixture model trained on features that comprise adaptive noise canceller to blocking matrix ratios or signal-to-noise ratios, according to an example embodiment.
- FIG. 3B depicts an example graph that illustrates a 3-mixture 2-dimensional Gaussian mixture model trained on features that comprise adaptive noise canceller to blocking matrix ratios or signal-to-noise ratios, according to another example embodiment.
- FIG. 3C is a block diagram of a back-end single-channel suppression component, according to an example embodiment.
- FIG. 3D depicts example diagnostic plots of 1-dimensional 2-mixture Gaussian mixture model parameters during online parameter estimation of a signal-to-noise feature vector, according to an example embodiment.
- FIG. 3E depicts example plots associated with an input signal that includes speech and car noise, according to an example embodiment.
- FIG. 3F depicts example diagnostic plots of 1-dimensional 2-mixture Gaussian mixture model parameters during online parameter estimation of an adaptive noise canceller to blocking matrix ratio, according to an example embodiment.
- FIG. 3G depicts example plots associated with an input signal that includes speech and car noise, according to another example embodiment.
- FIG. 3H depicts an example graph that plots example masking functions for different windowing functions, according to an example embodiment.
- FIG. 3I depicts example diagnostic plots associated with an input signal that includes speech and babble noise, according to an example embodiment.
- FIG. 3J depicts example diagnostic plots associated with an input signal that includes speech and babble noise, according to another example embodiment.
- FIG. 4 depicts a flowchart of a method for determining a noise suppression gain, according to an example embodiment.
- FIG. 5 depicts a flowchart of a method for applying a determined gain to an audio signal, according to an example embodiment.
- FIG. 6 depicts a flowchart of a method for setting a value of a first parameter that specifies a degree of balance between a distortion of a desired source included in an audio signal and a distortion of a residual amount of a first type of interfering source present in the audio signal and a second parameter that specifies a degree of balance between a distortion of a desired source included in an audio signal and a distortion of a residual amount of a second type of interfering source present in the audio signal based on a rate at which an energy contour associated with an audio signal changes over time, according to an example embodiment.
- FIG. 7 is a block diagram of a back-end single-channel suppression component that is configured to suppress multiple types of non-stationary noise and/or other types of interfering sources that may be present in an audio signal, according to an example embodiment.
- FIG. 8 is a block diagram of a generalized back-end single-channel suppression component, according to an example embodiment.
- FIG. 9 is a block diagram of a processor that may be configured to perform techniques disclosed herein.
- references in the specification to “one embodiment,” “an embodiment,” “an example embodiment,” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
- Coupled and “connected” may be used synonymously herein, and may refer to physical, operative, electrical, communicative and/or other connections between components described herein, as would be understood by a person of skill in the relevant art(s) having the benefit of this disclosure.
- Back-end single-channel suppression may refer to the suppression of interfering source(s) in a single-channel audio signal during the back-end processing of the single-channel audio signal.
- the single-channel audio signal may be generated from a single microphone, or may be based on an audio signal in which noise has been suppressed during the front-end processing of the audio signal using multiple microphones (e.g., by applying a multi-microphone noise reduction technique).
- the back-end single-channel suppression techniques may suppress types(s) of additive noise using one or more suppression branches (e.g., a non-spatial (or stationary noise) branch, a spatial (or non-stationary noise) branch, a residual echo suppression branch, etc.).
- the non-spatial branch may be configured to suppress stationary noise from the single-channel audio signal
- the spatial branch may be configured to suppress non-stationary noise from the single-channel audio signal
- the residual echo suppression branch may be configured to suppress residual echo from the signal-channel audio signal.
- the spatial branch may be disabled based on an operational mode (e.g., single-user speakerphone mode or a conference speakerphone mode) of the communication device or based on a determination that spatial information (e.g., information that is used to distinguish a desired source from non-stationary noise present in the single-channel audio signal) is ambiguous.
- an operational mode e.g., single-user speakerphone mode or a conference speakerphone mode
- spatial information e.g., information that is used to distinguish a desired source from non-stationary noise present in the single-channel audio signal
- the example techniques and embodiments described herein may be adapted to various types of communication devices, communications systems, computing systems, electronic devices, and/or the like, which perform back-end single-channel suppression in an uplink path in such devices and/or systems.
- back-end single-channel suppression may be implemented in devices and systems according to the techniques and embodiments herein.
- additional structural and operational embodiments, including modifications and/or alterations, will become apparent to persons skilled in the relevant arts) from the teachings herein.
- a method for suppressing multiple types of interfering sources included in an audio signal.
- an audio signal that comprises at least a desired source component and at least one interfering source type is received.
- a noise suppression gain is determined based on a statistical modeling of at least one feature associated with the audio signal using a mixture model comprising a plurality of model mixtures.
- Each of the plurality of model mixtures are associated with one of the desired source component or an interfering source type of the at least one interfering source type.
- a method for determining and applying suppression of interfering sources to an audio signal is further described herein.
- one or more first characteristics associated with a first type of interfering source included in an audio signal are determined
- One or more second characteristics associated with a second type of interfering source included in the audio signal are also determined
- a gain is determined based on the one or more first characteristics and the one or more second characteristics. The determined gain is applied to the audio signal.
- a system for determining and applying suppression of interfering sources to an audio signal includes a signal-to-stationary noise ratio feature statistical modeling component configured to determine one or more first characteristics associated with a first type of interfering source included in the audio signal.
- the system also includes a spatial feature statistical modeling component configured to determine one or more second characteristics associated with a second type of interfering source included in the audio signal.
- the system further includes a multi-noise source gain component configured to determine a gain based on the one or more first characteristics and the one or more second characteristics, and a gain application component configured to apply the determined gain to the audio signal.
- Systems and devices may be configured in various ways to perform back-end single-channel suppression of interfering source(s) included in an audio signal. Techniques and embodiments are also provided for implementing devices and systems with back-end single-channel suppression.
- FIG. 1 shows an example communication device 100 for implementing back-end single-channel suppression in accordance with an example embodiment.
- Communication device 100 may include an input interface 102 , an optional display interface 104 , a plurality of microphones 106 1 - 106 N , a loudspeaker 108 , and a communication interface 110 .
- communication device 100 may include one or more instances of a frequency domain acoustic echo cancellation (FDAEC) component 112 , a multi-microphone noise reduction (MMNR) component 114 , and/or a single-channel suppression (SCS) component 116 .
- FDAEC frequency domain acoustic echo cancellation
- MMNR multi-microphone noise reduction
- SCS single-channel suppression
- communication device 100 may include one or more processor circuits (not shown) such as processor circuit 1200 of FIG. 12 described below.
- input interface 102 and optional display interface 104 may be combined into a single, multi-purpose input-output interface, such as a touchscreen, or may be any other form and/or combination of known user interfaces as would understood by a person of skill in the relevant art(s) having the benefit of this disclosure.
- loudspeaker 108 may be any standard electronic device loudspeaker that is configurable to operate in a speakerphone or conference phone type mode (e.g., not in a handset mode).
- loudspeaker 108 may comprise an electro-mechanical transducer that operates in a well-known manner to convert electrical signals into sound waves for perception by a user.
- communication interface 110 may comprise wired and/or wireless communication circuitry and/or connections to enable voice and/or data communications between communication device 100 and other devices such as, but not limited to, computer networks, telecommunication networks, other electronic devices, the Internet, and/or the like.
- plurality of microphones 106 1 - 106 N may include two or more microphones, in embodiments. Each of these microphones may comprise an acoustic-to-electric transducer that operates in a well-known manner to convert sound waves into an electrical signal. Accordingly, plurality of microphones 106 1 - 106 N may be said to comprise a microphone array that may be used by communication device 100 to perform one or more of the techniques described herein. For instance, in embodiments, plurality of microphones 106 1 - 106 N may include 2, 3, 4, . . . , to N microphones located at various locations of communication device 100 .
- any number of microphones may be configured in communication device 100 embodiments.
- embodiments that include more microphones in plurality of microphones 106 1 - 106 N provide for finer spatial resolution of beamformers for suppressing interfering sources and for better tracking sources.
- back-end SCS 116 can be used by itself without MMNR 114 .
- FDAEC component 112 is configured to provide a scalable algorithm and/or circuitry for two to many microphone inputs.
- MMNR component 114 is configured to include a plurality of subcomponents for determining and/or estimating spatial parameters associated with audio sources, for directing a beamformer, for online modeling of acoustic scenes, for performing source tracking, and for performing adaptive noise reduction, suppression, and/or cancellation.
- SCS component 116 is configurable to perform single-channel suppression of interfering source(s) using non-spatial information, using spatial information, and/or using downlink signal information. Further details and embodiments of FDAEC component 112 , MMNR component 114 , and SCS component 116 are provided below.
- FIG. 1 is shown in the context of a communication device, the described embodiments may be applied to a variety of products that employ multi-microphone noise suppression for speech signals.
- Embodiments may be applied to portable products, such as smart phones, tablets, laptops, gaming systems, etc., to stationary products, such as desktop computers, office phones, conference phones, gaming systems, etc., and to car entertainment/navigation systems, as well as being applied to further types of mobile and stationary devices.
- Embodiments may be used for MMNR and/or suppression for speech communication, for enhancing speech signals as a pre-processing step for automated speech processing applications, such as automatic speech recognition (ASR), and in further types of applications.
- ASR automatic speech recognition
- System 200 may be a further embodiment of a portion of communication device 100 of FIG. 1 .
- system 200 may be included, in whole or in part, in communication device 100 .
- system 200 includes plurality of microphones 106 1 - 106 N , FDAEC component 112 , MMNR component 114 , and SCS component 116 .
- System 200 also includes an acoustic echo cancellation (AEC) component 204 , a microphone mismatch compensation component 208 , a microphone mismatch estimation component 210 , and an automatic mode detector 222 .
- AEC acoustic echo cancellation
- FDAEC component 112 may be included in AEC component 204 as shown, and references to AEC component 204 herein may inherently include a reference to FDAEC component 112 unless specifically stated otherwise.
- MMNR component 114 includes a steered null error phase transform (SNE-PHAT) time delay of arrival (TDOA) estimation component 212 , an on-line Gaussian mixture model (GMM) modeling component 214 , an adaptive blocking matrix (ABM) component 216 , a switched super-directive beamformer (SSDB) 218 , and an adaptive noise canceller (ANC) 220 .
- SNE-PHAT steered null error phase transform
- GMM on-line Gaussian mixture model
- ABSB adaptive blocking matrix
- SSDB switched super-directive beamformer
- ANC adaptive noise canceller
- automatic mode detector 222 may be structurally and/or logically included in MMNR component 114 . It is noted that component 112 may use acoustic echo cancellation schemes other than FDAEC and that
- MMNR component 114 may be considered to be the front-end processing portion of system 200 (e.g., the “front end”), and SCS component 116 may be considered to be the back-end processing portion of system 200 (e.g., the “back end”).
- AEC component 204 , FDAEC component 112 , microphone mismatch compensation component 208 , and microphone mismatch estimation component 210 may be included in references to the front end.
- plurality of microphones 106 1 - 106 N provides N microphone inputs 206 to AEC 204 and its instances of FDAEC 112 .
- AEC 204 also receives a downlink signal 202 (a signal received from a far-end device) as an input, which may include one or more downlink signals “L” in embodiments.
- AEC 204 provides echo-cancelled outputs 224 to microphone mismatch compensation component 208 , provides residual echo information 238 to SCS component 116 , and/or provides downlink-uplink coherence information 246 (i.e., an estimate of the coherence between the downlink and uplink signals as a measure of residual echo presence) to SNE-PHAT TDOA estimation component 212 and/or on-line GMM modeling component 214 .
- Microphone mismatch estimation component 210 provides estimated microphone mismatch values 248 to microphone mismatch compensation component 208 .
- Microphone mismatch compensation component 208 provides compensated microphone outputs 226 (e.g., normalized microphone outputs) to microphone mismatch estimation component 210 (and in some embodiments, not shown, microphone mismatch estimation component 210 may also receive echo-cancelled outputs 224 directly), to SNE-PHAT TDOA estimation component 212 , to adaptive blocking matrix component 216 , and to SSDB 218 .
- SNE-PHAT TDOA estimation component 212 provides spatial information 228 to on-line GMM modeling component 214
- on-line GMM modeling component 214 provides statistics, mixtures, and probabilities 230 based on acoustic scene modeling to automatic mode detector 222 , to adaptive blocking matrix component 216 , and to SSDB 218 .
- SSDB 218 provides a desired source single output selected signal 232 to ANC 220
- ABM component 216 provides non-desired source signals 234 to ANC 220 , as well as to SCS component 116 .
- Automatic mode detector 222 provides a mode enable signal 236 to MMNR component 114 and to SCS component 116
- ANC 220 provides a noise-cancelled (or enhanced) source signal 240 to SCS component 116
- SCS component 116 provides a suppressed signal 244 as an output for subsequent processing and/or uplink transmission.
- SCS component 116 also provides a soft-disable control signal 242 to MMNR component 114 .
- SCS component 116 is configured to perform single-channel suppression of interfering source(s) on enhanced source signal 240 .
- SCS component 116 is configured to perform single-channel suppression using non-spatial information, using spatial information, and/or using downlink signal information.
- SCS component 116 is also configured to determine spatial ambiguity in the acoustic scene, and to provide a soft-disable control signal 242 that causes MMNR 114 (or portions thereof) to be disabled when SCS component 116 is in a spatially ambiguous state.
- one or more of the components and/or sub-components of system 200 may be configured to be dynamically disabled based upon enable/disable outputs received from the back end, such as soft-disable control signal 242 .
- enable/disable outputs received from the back end, such as soft-disable control signal 242 .
- the specific system connections and logic associated therewith is not shown for the sake of brevity and illustrative clarity in FIG. 2 , but would be understood by persons of skill in the relevant art(s) having the benefit of this disclosure.
- back-end single-channel suppression of one or more types of interfering sources e.g., additive noise
- back-end single-channel is performed based on a statistical modeling of acoustic source(s). Examples of such sources include desired speaker(s), interfering speaker(s), stationary noise (e.g., diffuse or point-source noise), non-stationary noise, residual echo, reverberation, etc.
- subsection IV.A describes how acoustic sources are statistically modelled
- subsection IV.B describes a system that implements the statistical modeling of acoustic sources to suppress multiple types of interfering sources from an audio signal.
- Statistical modeling may be comprised of two steps, namely adaptation and inference.
- models are adapted to current observations to capture the generally non-stationary states of the underlying processes.
- inference is performed to classify subpopulations of the data, and extract information regarding the current acoustic scene.
- the goal of back-end modeling is to provide the system with time- and frequency-specific probabilistic information regarding the activity of various sources, which can then be leveraged during the calculation of the back-end noise suppression gain (e.g., calculated by multi-noise source gain component 332 , as described below with reference to FIG. 3C ).
- MMs are hierarchical probabilistic models which can be used to represent statistical distributions of arbitrary shape.
- MMs are useful when modeling the marginal distribution of data in the presence of subpopulations.
- mixture models correspond to a linear mixing of individual distributions, where mixing weights are used to control the effect of each.
- the Gaussian mixture model serves as an efficient tool for estimating data distributions, particularly of a dimension greater than one, due to various attractive mathematical properties.
- the maximum likelihood (ML) estimates of the mean vector and covariance matrix are obtainable in closed form.
- Equation 1 The GMM distribution of a random variable x n , of dimension D is given by Equation 1, which is shown below:
- ⁇ m represent Gaussian means
- C m represent Gaussian covariance matrices
- w m represent mixing weights
- M denotes the number of mixtures (i.e., model mixtures) in the GMM.
- evaluating the probability distribution function (pdf) of a trained GMM involves the calculation of the above equation for a given data point x n .
- the adaptation step of back-end statistical modeling performs parameter estimation to obtain a trained model based on a set of training data, i.e., adapting the set cp.
- Parameter estimation optimizes model parameters by maximizing some cost function. Examples of common cost functions include the ML and maximum a posteriori (MAP) cost functions.
- MAP maximum a posteriori
- Equation 2 An example of the ML cost for the training process of a GMM for batch processing is shown below as Equation 2.
- Equation 2 Let the set ⁇ x 1 , x 2 , . . . , x N ⁇ be a set of N data samples of dimension D:
- N(x n ; ⁇ n ,C m ) denotes the evaluation of a Gaussian distribution with parameters ⁇ m , and C m at x n .
- GMMs allows freedom in designing the feature vector, x n .
- the feature vector should be constructed to include elements which may provide discriminative information for the inference step of back-end statistical modeling.
- elements which provide complementary information may be included in the feature vector.
- feature elements should be conditioned to better fit the Gaussian assumption implied by the use of this model. For example, features which occur naturally in the form of ratios can be used in the log domain because this avoids the non-negative, highly-skewed nature of ratios.
- the notation x n (k) to represent the k th element of a full-band feature vector corresponding to time index n is introduced.
- the notation x n,m (k) represents the k th element of a feature vector corresponding to time index n and frequency channel m.
- the GMM parameter estimation in subsection IV.A.1 assumes the availability of all training samples. However, such batch processing is not realistic for communication systems, wherein successive (training) samples are observed in time and delay to buffer future samples is not practical. Instead, an online method to adapt the GMM parameters as new samples arrive (e.g., during a communication session) is desirable. In online GMM parameter estimation, it is assumed that the GMM has previously been trained on a set of N past samples. The system then observes K new samples, and the GMM is updated based on these new samples.
- One method by which to perform online parameter estimation is to use the MAP cost function. This involves defining the a priori distribution of ⁇ conditioned on the original N data samples.
- x n ) ⁇ n N + 1 N + K ⁇ ⁇ P ′ ⁇ ( m
- Equation 12 A simple heuristic method by which to emphasize recent samples is to calculate ⁇ m in an alternative manner, as shown below in Equation 12:
- x n ) ⁇ n N + 1 N + K ⁇ ⁇ P ′ ⁇ ( m
- N max corresponds to some constant.
- ⁇ m avoids convergence to zero as the total number of observed data samples N grows very large.
- minimum constraints can be placed on mixture priors. That is, after an iteration of data-driven parameter estimation, mixture priors are floored at a threshold. This generally requires all mixture priors to be altered, due to the constraint that mixture weights must sum to unity. Application of minimum constraints on mixture priors maintains the presence of acoustic source mixtures, even during extended periods of source inactivity. Additionally, it allows GMM modeling to rapidly recapture the inactive source when it eventually becomes active.
- the inference step in back-end statistical modeling involves classifying the underlying acoustic source types corresponding to each GMM mixture, and then extracting probabilistic information regarding the activity of each source.
- Stationary SNR The time- and frequency-localized stationary log-domain SNRs can be used to differentiate between stationary noise sources, and non-stationary acoustic sources. Mixtures representing stationary noise sources are expected to include highly negative mean values of this element. Mixtures corresponding to desired sources can be expected to show particularly high stationary SNR mean.
- Adaptive noise canceller to blocking matrix ratio The time- and frequency-localized non-stationary log-domain adaptive noise canceller (e.g., ANC 220 , as shown in FIG. 2 ) to blocking matrix (e.g., ABM 216 , as shown in FIG. 2 ) ratios can be used to differentiate between non-stationary noise sources and desired sources. Mixtures representing non-stationary noise sources are expected to include highly negative mean values of this element. Mixtures corresponding to desired sources can again be expected to show particularly high stationary SNR mean.
- SRR Signal to reverberation ratio
- Echo return loss enhancement The log-domain ERLE can be used to differentiate between acoustic sources originating in the present environment, and those originating from the device speaker. Mixtures representing residual echo are expected to show high ERLE mean values, whereas other sources are expected to show small ERLE mean values. In this particular case, ERLE refers to a short-term or instantaneous ratio of down-link to up-link power, possibly as a function of frequency.
- FIG. 3A illustrates an example graph that illustrates a 3-mixture 2-dimenional GMM trained on features comprised of adaptive noise canceller to blocking matrix ratios or SNRs. Mixtures are shown by contours of a constant pdf. As shown in FIG. 3A , the acoustic sources present are desired source 335 , stationary noise 337 , and non-stationary noise 339 . The parameters of each mixture are consistent with the expected statistical behavior of each source type, as outlined above.
- An objective of statistical modeling in back-end single-channel suppression is to provide probabilistic information regarding the present activity of various sources, which can be used during calculation of the back-end multi-noise source gain rule.
- Equation 14 Equation 14
- the feature vector x n is designed to include information which may improve separation of acoustic sources in feature space. However, in some cases there exists supplemental information which may be advantageous to use in statistical analysis of acoustic sources, but may not be appropriate for inclusion in the model feature vector.
- VAD voice activity detection
- supplemental full-band information is the posterior probability of a target speaker provided by a speaker identification (SID) system. This information would be leveraged analogously to Equation 15.
- feature elements are chosen to provide separation between acoustic source types during back-end statistical modeling.
- the intended discriminative power of the feature may become insufficient for reliable GMM inference.
- An example of this is when two or more acoustic sources are physically located relative to the device microphones of a communication device (e.g., communication device 100 , as shown in FIG. 1 ) such that their time differences of arrival (TDOAs) become very similar, and any feature designed to exploit spatial diversity becomes ambiguous. It is then advantageous to recognize the lack of separation provided by this dimension of the GMM, and disable inference related to it.
- TDOAs time differences of arrival
- Error! Reference source not found illustrates an example graph that illustrates a 3-mixture 2-dimenional GMM trained on features comprised of adaptive noise canceller to blocking matrix ratios or SNRs, similar to Error! Reference source not found. Again, mixtures are shown by contours of a constant pdf, and the acoustic sources present are desired source 335 , stationary noise 337 , and non-stationary noise 339 . As opposed to the example shown in FIG. 3A , the adaptive noise canceller to blocking matrix ratio feature, which is intended to capture spatial diversity of sources, has become ambiguous due to e.g., the physical locations of the acoustic sources.
- the separation between the mixtures representing them is taken into account.
- the symmetrized Kullback-Leibler (KL) distance is used to quantify this separation.
- the symmetrized KL distance between mixtures i and j is given by:
- logistic regression an example of which is shown below with reference to Equation 18, is appealing since it naturally outputs predictions within the range [0,1]:
- Reliability ⁇ ( i , j ) 1 1 + exp ⁇ ( - ⁇ ⁇ ( d i , j KL - ⁇ ) ) , Equation ⁇ ⁇ 18
- back-end statistical modeling may use a single unifying model for all acoustic sources. This allows all statistical correlation between sources to be exploited during the process.
- large mixture-number MM modeling is performed with smaller parallel MMs.
- FIG. 3C is a block diagram of a back-end single-channel suppression (SCS) component 300 that performs noise suppression of multiple types of interfering sources using statistical modeling that has been decoupled into separate parallel branches in accordance with an embodiment.
- SCS single-channel suppression
- the benefit of multivariate modeling is the ability to capture statistical correlation between features. Therefore, the branches may be configured to cluster features with high inter-feature correlation.
- the motivation for such a system is that each of the previously mentioned acoustic sources is expected to display specific correlation patterns, thereby improving separation relative to 1-dimenional modeling.
- Back-end SCS component 300 is configured to suppress multiple types of interfering sources (e.g., stationary noise, non-stationary noise, residual echo, etc.) present in a first signal 340 .
- Back-end SCS component 300 may be configured to receive first signal 340 and a second signal 334 and provide a suppressed signal 344 .
- suppressed signal 344 may correspond to suppressed signal 244 , as shown in FIG. 2 .
- First signal 340 may be a suppressed signal provided by a multi-microphone noise reduction (MMNR) component (e.g., MMNR component 114 ), and second signal 234 may be a noise estimate provided by the MMNR component that is used to obtain first signal 340 .
- MMNR multi-microphone noise reduction
- Back-end SCS component 300 may comprise an implementation of SCS component 116 , as described above in reference to FIGS. 1 and 2 .
- first signal 340 may correspond to enhanced source signal 240 provided by ANC 220 (as shown in FIG. 2 )
- second signal 334 may correspond to non-desired source signals 234 provided by ABM 216 (as shown in FIG. 2 ).
- second signal 334 may correspond to non-desired source signals 234 provided by ABM 216 (as shown in FIG. 2 ).
- back-end SCS component 300 includes stationary noise estimation component 304 , signal-to-stationary noise ratio (SSNR) estimation component 306 , SSNR feature extraction component 308 , SSNR feature statistical modeling component 310 , spatial feature extraction component 312 , spatial feature statistical modeling component 314 , signal-to-non-stationary noise ratio (SNSNR) estimation component 316 , speaker identification (SID) feature extraction component 318 , SID speaker model update component 320 , uplink (UL) correlation feature extraction component 322 , signal-to-residual echo ratio (SRER) estimation component 326 , fullband modulation feature extraction component 328 , fullband modulation statistical modeling component 330 , multi-noise source gain component 332 and gain application component 346 .
- SSNR signal-to-stationary noise ratio
- SNSNR signal-to-non-stationary noise ratio
- SID speaker identification
- SID speaker model update component 320 uplink (UL) correlation feature extraction component 322
- SRER signal-
- Stationary noise estimation component 304 may assist in obtaining characteristics associated with stationary noise included in first signal 340 , and therefore, may be referred to as being included in a non-spatial (or stationary noise) branch of SCS component 300 .
- Spatial feature extraction component 312 , spatial feature statistical modeling component 314 , SID feature extraction component 318 , SID speaker model update component 320 and SNSNR estimation component 316 may assist in obtaining characteristics associated with non-stationary noise included in first signal 340 , and therefore, may be referred to as being included in a spatial (or non-stationary noise) branch of SCS component 300 .
- UL correlation feature extraction component 322 , spatial feature statistical modeling component 314 and SRER estimation component 326 may assist in obtaining characteristics associated with residual echo included in first signal 340 , and therefore, may be referred to as being included in a residual echo branch of SCS component 300 .
- Stationary noise estimation component 304 may be configured to receive first signal 340 and provide a stationary noise estimate 301 (e.g., an estimate of magnitude, power, signal level, etc.) of stationary noise present in first signal 340 on a per-frame basis and/or per-frequency bin basis.
- stationary noise estimation component 304 may determine stationary noise estimate 301 by estimating statistics of an additive noise signal included in first signal 340 during non-desired source segments.
- stationary noise estimation component 304 may include functionality that is capable of classifying segments of first signal 340 as desired source segments or non-desired source segments.
- stationary noise estimation component 304 may be connected to another entity that is capable of performing such a function. Of course, numerous other methods may be used to determine stationary noise estimate 301 .
- Stationary noise estimate 301 is provided to SSNR estimation component 306 and SSNR feature extraction component 308 .
- SSNR estimation component 306 may be configured to receive first signal 340 and stationary noise estimate 301 and determine a ratio between first signal 340 and stationary noise estimate 301 to provide an SSNR estimate 303 on a per-frame basis and/or per-frequency bin basis.
- SSNR estimate 303 may be equal to a measured characteristic (e.g., magnitude, power, signal level, etc.) of first signal 340 divided by stationary noise estimate 301 .
- SSNR estimate 303 is provided to SSNR feature extraction component 308 and multi-noise source gain component 332 . As will be described below, SSNR estimate 303 may be used to determine an optimal gain 325 that is used to suppress noise from first signal 340 .
- SSNR feature extraction component 308 may be configured to extract one or more SNR feature(s) from first signal 340 based on stationary noise estimate 301 on a per-frame basis and/or per-frequency bin basis to obtain an SNR feature vector 305 .
- a preliminary (rough) estimate of the desired source power spectral density may be obtained.
- the estimate of the desired source power spectral density may be obtained through conventional methods or according to the methods in described in aforementioned U.S. patent application Ser. No. 12/897,548, the entirety of which has been incorporated by reference as if fully set forth herein.
- the estimate of the SNR feature(s) is equivalent to the a priori SNR that is estimated simply as the posteriori SNR minus one (assuming statistical independence between interfering and desired sources).
- the various SNR feature forms could include various degrees of smoothing the power across frequency prior to forming the SNR feature(s).
- SSNR feature extraction component 308 may be configured to apply preliminary single-channel noise suppression to first signal 340 .
- SSNR feature extraction component 308 may suppress single-channel noise from first signal 340 based on SSNR estimate 303 .
- SSNR feature extraction component 308 may also be configured to down-sample the preliminary noise-suppressed first signal and/or stationary noise estimate 301 to reduce the sample sizes thereof, thereby reducing computational complexity.
- SNR feature vector 305 is provided to SSNR feature statistical modeling component 310 .
- SSNR feature statistical modeling component 310 may be configured to model feature vector 305 on a per-frame basis and/or per-frequency bin basis.
- SSNR feature statistical modeling component 310 models SNR feature vector 305 using GMM modeling.
- GMM modeling a probability 307 that a particular frame of first signal 340 is from a desired source (e.g., speech) and/or a probability that the particular frame of first signal 340 is from a non-desired source (e.g., an interfering source, such as stationary background noise) may be determined for each frame and/or frequency bin.
- stationary noise can be separated from the desired source by exploiting the time and frequency separation of the sources.
- the restriction to stationary sources arises from the fact that the interfering component is estimated during desired source absence and then assumed stationary, and hence maintaining its power spectral density during desired source presence.
- This allows for estimation of the (stationary) interfering source power spectral density from which the SNR feature(s) can then be formed. It reflects the way traditional single channel noise suppression works, and the interfering source power spectral density can be estimated with such traditional methods.
- the (stationary) interfering source presence can then be modelled with GMM-based SNR feature vector 305 , which comprises various forms of SNRs.
- two Gaussian mixtures are used to model SNR feature vector 305 (i.e., a 2-mixture GMM), and the Gaussian mixture with the lowest (average in case of multiple SNR features) mean parameter (lowest SNR) corresponds to the interfering (stationary) source, and the Gaussian mixture with the highest (average) mean parameter corresponds to the desired source.
- the inference in place i.e., the association of Gaussian mixtures with sources, it is possible to calculate the probabilities of desired source and probability of interfering (stationary) source in accordance Equations 13, 14 and/or 15, as described above in subsections IV.A.5.2 and IV.A.5.3.
- FIG. 3D shows example diagnostic plots of 1-dimensional 2-mixture GMM parameters during online parameter estimation of GMM modeling of the SNR feature vector 305 .
- initial segments of a signal e.g., first signal 340
- the left column corresponds to the interfering source mixture corresponding to the pub noise
- the right column corresponds to the desired source mixture corresponding to the speech.
- Plots 335 , 337 and 339 show mixture priors, means, and variances, respectively, associated with the interfering source mixture
- plots 341 , 343 and 345 show the mixture priors, means, and variances, respectively, associated with the desired source mixture.
- the SNR feature does not require multiple microphones (or channels), and it applies equally to single microphone (channel) or multi-microphone (multi-channel) applications.
- k is the frequency index
- m is the frame index
- N fft is the FFT size, e.g. 256.
- Equation 21 represents a rectangular window, but, in certain embodiments, an alternate window may be used instead in accordance with embodiments.
- the SNR forms the single feature (i.e., SNR feature vector 305 ) that is modelled independently for every frequency index k in order to estimate the probability of desired source, P DS,m (k) (i.e., probability 307 ), versus the probability of interfering (stationary) source, P IS, m (k), for every frequency index.
- plot 347 represents a time domain input waveform representing first signal 340 (which includes both speech and car noise)
- plot 349 represents a time-frequency plot of first signal 340
- plot 351 represents SNR feature vector 305 , which is being modelled using GMM modeling
- plot 353 represents a probability of desired source (i.e., probability 307 ) with respect to car noise obtained using GMM modeling.
- first signal 340 is down-sampled by SSNR feature extraction component 308
- SSNR feature statistical modeling component 310 up-samples probability 307 .
- Probability 307 is provided to multi-noise source gain component 332 .
- probability 307 may be used to determine optimal gain 325 , which is used to suppress stationary noise (and/or other types of interfering sources) present in first signal 340 on a per-frame basis and/or per-frequency bin basis.
- Spatial feature extraction component 312 may be configured to extract spatial feature(s) from first signal 340 and second signal 334 on a per-frame basis and/or per-frequency bin basis.
- the feature(s) may be a ratio 309 between first signal 340 and second signal 334 .
- ratio 309 corresponds to a ratio between enhanced source signal 240 provided by ANC 220 and non-desired source signals 234 provided by ABM 216 .
- ratio 309 separates non-stationary interfering sources from a desired source. Hence, it is used for non-stationary noise suppression.
- Ratio 309 can be calculated on a frequency bin or range basis in order to provide frequency resolution, and smoothing to a varying degree can be carried out in order to achieve a multi-dimensional feature vector that captures both local strong events as well as broader weaker events. Ratio 309 is greater for desired source presence and smaller for interfering source presence.
- ratio 309 may require at least two microphones and the presence of a generalized sidelobe canceller (GSC)-like front-end spatial processing stage.
- GSC generalized sidelobe canceller
- a similar “spatial” ratio can be formed with the use of many other front-ends, and in some applications a front-end is not even necessary.
- An example of that is the case where the position of the desired source relative to the two microphones provides a significant level (possibly frequency dependent) difference on the two microphones while all interfering sources can be assumed to be far-field, and hence provide approximately similar level on the two microphones.
- a communication device 100 as shown in FIG.
- the desired source e.g., speech of the user
- ratio 309 can be formed directly from the two microphone signals.
- spatial feature extraction component 312 before obtaining ratio 309 , spatial feature extraction component 312 applies preliminary single-channel noise suppression to first signal 340 .
- spatial feature extraction component 312 may suppress single-channel noise present in first signal 340 based on SNR estimate 303 . This suppression should not be too strong as it will then render this modeling very similar to the stationary SNR modeling described above in subsection IV.B.1. However, a mild suppression will aid the convergence of the parameters of the online GMM modeling (as described below), preventing divergence of the modeling by guiding it in a proper direction.
- An example value of preliminary target suppression is 6 dB.
- Spatial feature extraction component 312 may also be configured to down-sample the preliminary noise-suppressed first signal and/or second signal 334 to reduce the sample sizes thereof, thereby reducing computational complexity.
- Ratio 309 is provided to spatial feature statistical modeling component 314 .
- the Anc2AbmR (i.e., ratio 309 ) associated with a frequency index is then calculated as:
- Equation 24 represents a rectangular window, but similar to subsection IV.B.1, in certain embodiments, an alternate window may be used instead.
- the Anc2AbmR may form the single feature that is modelled independently for every frequency index k in order to estimate the probability of desired source, P DS,m (k), versus the probability of interfering (spatial) source, P IS,m (k), for every frequency index (as described below with reference to spatial feature statistical modeling component 314 ).
- SID feature extraction component 318 may be configured to extract features from first signal 340 and provide a classification 311 (e.g., a soft or hard classification) of first signal 340 based on the extracted features on a per-frame basis and/or per-frequency bin basis.
- a classification 311 e.g., a soft or hard classification
- Such features may include, for example, reflection coefficients (RCs), log-area ratios (LARs), arcsin of RCs, line spectrum pair (LSP) frequencies, and the linear prediction (LP) cepstrum.
- Classification 311 may indicate whether a particular frame and/or frequency bin of first signal 340 is associated with a target speaker.
- classification 311 may be a probability as to whether a particular frame and/or frequency bin is associated with a target speaker or a non-desired source (i.e., the supplemental full-band information described above in subsection IV.A.5.3), where the higher the probability, the more likely that the particular frame and/or frequency bin is associated with a target speaker.
- Back-end SCS component 300 may include a speaker identification component (or may be coupled to a speaker identification component) that assists in determining whether a particular frame and/or frequency bin of first signal 340 is associated with a target speaker.
- the speaker identification component may include GMM-based speaker models.
- the feature(s) extracted from first signal 340 may be compared to these speaker models to determine classification 311 . Further details concerning SID-assisted audio processing algorithm(s) may be found in commonly-owned, co-pending U.S. patent application Ser. No. 13/965,661, entitled “Speaker-Identification-Assisted Speech Processing Systems and Methods” and filed on Aug. 13, 2013, U.S. patent application Ser. No. 14/041,464, entitled “Speaker-Identification-Assisted Downlink Speech Processing Systems and Methods” and filed on Sep. 30, 2013, and U.S. patent application Ser. No. 14/069,124, entitled “Speaker-Identification-Assisted Uplink Speech Processing Systems and Methods” and filed on Oct. 31, 2013, the entireties of which are incorporated by reference as if fully set forth herein. Classification 311 is provided to spatial feature statistical modeling component 314 .
- Spatial feature statistical modeling component 314 may be configured to determine and provide a probability 313 that a particular feature of a particular frame and/or frequency bin of first signal 340 is from a desired source and a probability 315 that a particular feature of a particular frame and/or frequency bin of first signal 340 is from a non-desired source (e.g., non-stationary noise).
- Probabilities 313 and 315 may be based on ratio 309 .
- Probability 313 and/or probability 315 may be also be based on classification 311 .
- Ratio 309 may be modelled using a GMM.
- the Gaussian distributions of the GMM can be associated with interfering non-stationary sources and the desired source according to the GMM mean parameters based on inference, thereby allowing calculation of probability 315 and probability 313 from ratio 309 and the parameters of respective GMMs associated with interfering non-stationary sources and the desired source.
- At least one mixture of the GMM may correspond to a distribution of a particular type of a non-desired source (e.g., non-stationary noise), and at least one other mixture of the GMM may correspond to a distribution of a desired source. It is noted that the GMM may also include other mixtures that correspond to other types of interfering, non-desired sources.
- a non-desired source e.g., non-stationary noise
- the GMM may also include other mixtures that correspond to other types of interfering, non-desired sources.
- spatial features statistical modeling component 314 may monitor the mean associated with each mixture.
- the mixture having a relatively higher mean equates to the mixture corresponding to a desired source, and the mixture having a relatively lower mean equates to the mixture corresponding to a non-desired source.
- FIG. 3F shows example diagnostic plots of 1-dimensional 2-mixture GMM parameters during online parameter estimation of the GMM modeling of the Anc2AbmR (i.e., ratio 309 ).
- initial segments of a signal e.g., first signal 340
- the left column corresponds to the interfering source mixture corresponding to the pub noise
- the right column corresponds to the desired source mixture corresponding to the desired source.
- Plots 355 , 357 and 359 show mixture priors, means, and variances, respectively, associated with the interfering source mixture
- plots 361 , 363 and 365 show the mixture priors, means, and variances, respectively, associated with the desired source mixture.
- probabilities 313 and 315 may be based on a ratio between the mixture associated with the desired source and the mixture associated with the non-desired source. For example, probability 313 may indicate that a particular feature of a particular frame and/or frequency bin of first signal 340 is from a desired source if the ratio is relatively high, and probability 315 may indicate that a particular feature of a particular frame and/or frequency bin of first signal 340 is from a non-desired source if the ratio is relatively low.
- the ratios may be determined for a plurality of ranges for smoothing across frequency. For example, a wideband smoothed ratio and a narrowband smoothed ratio may be determined.
- probabilities 313 and 315 are based on a combination of these ratios. Probabilities 313 and 315 are provided to SNSNR estimation component 316 .
- FIG. 3G An example of a waveform of an input signal (e.g., first signal 340 ) that includes speech an non-stationary noise (e.g., babble noise), time-frequency plots of the input signal, the Anc2AbmR feature (i.e., ratio 309 ), and the resulting P DS,m (k) (i.e., probability 313 ) for speech in an environment that includes non-stationary noise, are shown in FIG. 3G .
- This is a type of interfering source where SNR feature vector 305 of subsection IV.B.1 traditionally may not provide good separation.
- plot 367 represents a time domain input waveform representing first signal 340
- plot 369 represents a time-frequency plot of first signal 340
- plot 371 represents an output of ABM 216 (i.e., second signal 334 )
- plot 373 represents the Anc2AbmR (i.e., ratio 309 ) being modelled using GMM modeling
- plot 375 represents a probability of desired source (i.e., probability 313 ) with respect to babble noise obtained using GMM modeling.
- the Anc2AbmR feature i.e., ratio 309
- SNR feature vector 305 of subsection IV.B.1 may be obsolete given the Anc2AbmR feature.
- the modeling of the Anc2AbmR is ambiguous. This can be due to slower convergence of the Anc2AbmR modeling or due to the microphone signals of the acoustic scene not providing sufficient spatial separation.
- the SNR feature vector and Anc2AbmR features complement each other, although there is also some overlap.
- Spatial feature statistical modeling component 314 may also be configured to determine and provide a measure of spatial ambiguity 331 on a per-frame basis and/or a per-frequency bin basis. Measure of spatial ambiguity 331 may be indicative of how well spatial feature statistical modeling component 314 is able to distinguish a desired source from non-stationary noise in the acoustic scene. Measure of spatial ambiguity 331 may be determined based on the means for each of the mixtures of the GMM modelled by spatial feature statistical modeling component 314 .
- the value of measure of spatial ambiguity 331 may be set such that it is indicative of spatial feature statistical modeling component 314 being in a spatially ambiguous state.
- the value of measure of spatial ambiguity 331 may be set such that it is indicative of spatial feature statistical modeling component 314 being in a spatially unambiguous state, i.e., in a spatially confident state.
- measure of spatial ambiguity 331 is determined in accordance with Equation 25, which is shown below:
- d corresponds to the distance between the mean of the mixture associated with the desired source and the mean of the mixture associated with the non-desired source and a and ⁇ are user-defined constants which control the distance to spatial ambiguity mapping.
- non-stationary noise suppression may be soft-disabled.
- spatial feature statistical modeling component 314 in response to determining that spatial feature statistical modeling component 314 is in a spatially ambiguous state, spatial feature statistical modeling component 314 provides a soft-disable output 342 , which is provided to MMNR component 114 (as shown in FIG. 2 ).
- Soft-disable output 342 may cause one or more components and/or sub-components of MMNR component 114 to be disabled.
- soft-disable output 342 may correspond to soft-disable control signal 242 , as shown in FIG. 2 .
- Spatial feature statistical modeling component 314 may further provide probability 313 to SID speaker model update component 320 .
- SID speaker model update component 320 may be configured to update the GMM-based speaker model(s) based on probability 313 and provide updated GMM-based speaker model(s) 333 to SID feature extraction component 318 .
- SID feature extraction component 318 may compare feature(s) extracted from subsequent frame(s) of first signal 340 to updated GMM-based speaker model(s) 333 to provide classification 311 for the subsequent frame(s).
- SID speaker model update component 320 updates the GMM-based speaker model(s) based on probability 313 when back-end SCS component 300 operates in handset mode.
- updates to the GMM-based speaker model(s) may be controlled by information available from the acoustic scene analysis in the front end.
- back-end SCS component 300 receives a mode enable signal 336 from a mode detector (e.g., automatic mode detector 222 , as shown in FIG. 2 ) that causes SCS system 300 to switch between single-user or conference speakerphone mode.
- mode enable signal 336 may correspond to mode enable signal 236 , as shown in FIG. 2 .
- SNSNR estimate 317 is provided to multi-noise source gain component 332 . As will be described below, SNSNR estimate 317 may be used determine optimal gain 325 , which is used to suppress non-stationary noise (and/or other types of interfering sources) present in first signal 340 .
- Residual echo suppression is used to suppress any acoustic echo remaining after linear acoustic echo cancellation. This need is typically greatest when a device is operated in speakerphone mode, i.e., when the device is not handheld in a typical telephony handset use mode of operation.
- the far-end signal also referred as the downlink signal
- the far-end signal is played back on a loudspeaker (e.g., loudspeaker 108 , as shown in FIG. 1 ) on a device (e.g., communication device 100 , as shown in FIG. 1 ) at a level that, seen from the perspective of the microphone(s) (e.g., microphones 106 1-N , as shown in FIG.
- the near-end signal also referred as the uplink signal
- the near-end signal also referred as the uplink signal
- this is carried out by means of estimating the ERL (Echo Return Loss) of the acoustic channel from the downlink to the uplink, and the ERLE (Echo Return Loss Enhancement) of the linear acoustic echo canceller.
- ERL Echo Return Loss
- the ERLE Echo Return Loss Enhancement
- non-linear residual echo is identified by measuring the normalized correlation in the uplink signal after linear echo cancellation at the pitch period of the downlink signal. Moreover, this can be measured as a function of frequency in order to exploit spectral separation between the residual echo and the desired source.
- the normalized correlation of the uplink signal at the pitch period of the downlink signal may be able to identify residual echo components that are harmonics of the downlink pitch periods, and may not be able to identify any unvoiced residual echo components.
- This is, however, acceptable as non-linear residual echo is typically non-linear components triggered by the high energy components of the downlink signal (i.e., voiced speech).
- strong residual echo is often a result of strong non-linearities being excited by voiced components, and typically manifests itself as pitch harmonics of the downlink signal being repeated up through the spectrum, producing pitch harmonics where the downlink signal had no or only weak harmonics.
- UL correlation feature extraction component 322 may be configured to determine an uplink correlation at a downlink pitch period. For example, UL correlation feature extraction component 322 may determine a measure of correlation 319 in an FDAEC output signal (e.g., FDAEC output signal 224 , as shown in FIG. 2 ) at the pitch period of a downlink signal (e.g., downlink signal 202 , as shown in FIG. 2 ) as a function of frequency, where a relatively high correlation is an indication of residual echo presence in first signal 340 and a relatively low correlation is an indication of no residual echo presence in first signal 340 .
- FDAEC output signal e.g., FDAEC output signal 224 , as shown in FIG. 2
- a downlink signal e.g., downlink signal 202 , as shown in FIG. 2
- Equation 30 represents a rectangular window, but, in certain embodiments, any alternate suitable window can be used.
- the averaging over a window is a tradeoff with frequency resolution of C N,UL (k, L DL ) (i.e., measure of correlation 319 ).
- a generalized version of the previously described normalized uplink correlation at the downlink pitch period can be derived to exploit information contained in the autocorrelation function of the uplink signal, at multiples of the downlink pitch period. This measure can be expressed as:
- w(n) represents some smoothing window, which can be used to control the weighting of various downlink pitch period multiples.
- d(n) is a series of delta functions at pitch period multiples, as defined below:
- M denotes the number of pitch multiples contained within the sampled autocorrelation function and is dependent on L DL and N fft .
- the generalized measure can be expressed in terms of a convolution of functions:
- the generalized measure can be expressed in the frequency domain as:
- G(k), W(k), and D(k) are the Fourier transforms of g(n), w(n), and d(n), respectively. whereas W(k) depends on the unspecified windowing function w(n), D(k) can be explicitly expressed by applying the Fourier transform to d(n), as shown below:
- Equation 37 is a result of the fact that downlink pitch periods are generally not perfect factors of the FFT length. However, the expression serves as a relatively close approximation, particularly for large M, and the approximation is exact when the downlink pitch period is a factor of the FFT length.
- the generalized normalized uplink correlation at the downlink pitch period is obtained as the summed element-wise product of the uplink spectrum and a masking function.
- the masking function is constructed as the convolution of a series of deltas located at multiples of the fundamental frequency of the downlink signal, and a smoothing window which spreads the effect of the masking function beyond exact multiples of the fundamental frequency.
- FIG. 3H This relationship can be observed in FIG. 3H , where example masking functions are plotted for different windowing functions. As shown in FIG. 3H , masking functions are shown for three different windowing functions, w(n). As further shown in FIG. 3H , the downlink pitch period L DL is 10, and the FFT length N FFT is 160.
- UL correlation feature extraction component 322 may receive residual echo information 338 from the front end that includes measure of correlation 319 and UL correlation feature extraction component 322 extracts measure of correlation 319 from residual echo information 338 .
- residual echo information 338 may include the FDAEC output signal and the downlink signal (or the pitch period thereof), and UL correlation feature extraction component 322 determines the measure of correlation in the FDAEC output signal at the pitch period of the downlink signal as a function of frequency.
- the correlation at the downlink pitch period of the FDAEC output signal may be calculated as a normalized correlation of the FDAEC output signal at a lag corresponding to the downlink pitch period, providing a measure of correlation that is bounded between 0 and 1.
- UL correlation feature extraction component 322 provides measure of correlation 319 to spatial feature statistical modeling component 314 .
- residual echo information 338 corresponds to residual echo information 238 .
- Spatial feature statistical modeling component 314 may be configured to determine and provide a probability 321 that a particular frame is from a non-desired source (e.g., residual echo) on a per-frame basis and/or per-frequency bin basis based on measure of correlation 319 .
- the GMM being modelled by spatial feature statistical modeling component 314 may also include a mixture that corresponds to residual echo. The mixture may be adapted based on measure of correlation 319 .
- Probability 321 may be relatively higher if measure of correlation 319 indicates that the FDAEC output signal has high correlation at the pitch period of the downlink signal, and probability 321 may be relatively lower if measure of correlation 319 indicates that the FDAEC output signal has low correlation at the pitch period of the downlink signal.
- Probability 321 is provided to SRER estimation component 326 .
- SRER estimation component 326 may be configured to determine an SRER estimate 323 based on probability 321 and 313 on a per-frame basis and/or per-frequency bin basis.
- SRER estimate 323 may be determined in accordance to Equation 26 provided above, where x IS corresponds to non-stationary noise or residual echo included in x, P(y
- SRER estimate 323 is provided to multi-noise source gain component 332 .
- SRER estimate 323 may be used to determine optimal gain 325 , which is used to suppress residual echo (and/or other types of interfering sources) present in first signal 340 .
- SRER estimate (based on downlink and traditional ERL and ERLE estimates, and not on measure of correlation 319 as described above) and measure of correlation 319 , are complimentary.
- the modeling can be carried out on a frequency basis in order to exploit frequency separation between desired source and residual echo.
- a power or magnitude spectrum ratio feature is formed between a microphone far from the loudspeaker and the microphone close to the loudspeaker. This naturally occurs on a cellular handset in speakerphone phone mode where the loudspeaker is at the bottom of the phone, one microphone is at the bottom of the phone, and a second microphone is at the top of the phone.
- the ratio can be formed down-stream of acoustic echo cancellation so that only the presence of residual echo is captured by the feature.
- ABM 216 i.e., second signal 334
- ANC 220 i.e., first signal 340
- forming the power or magnitude spectrum ratio is done by using an additional mixture in the GMM modeling.
- the desired source will generally have a relatively high Anc2AbmR
- acoustic environmental noise will generally have relatively lower Anc2AbmR
- residual echo will have a much lower Anc2AbmR compared to the acoustic environment noise. It may be suitable to use three mixtures in each frequency band/bin: one for desired source, one for non-stationary/spatial noise, one for residual echo.
- each microphone path has acoustic echo cancellation (AEC) prior to the spatial front-end with ANC 220 and ABM 214 , then this particular modeling would indeed capture residual echo (assuming AEC provides similar ERLE on the two microphone paths).
- AEC acoustic echo cancellation
- Multi-noise source gain component 332 may be configured to determine an optimal gain 325 that is used to suppress multiple types of interfering sources (e.g., stationary noise, non-stationary noise, residual echo, etc.) present in first signal 340 on a per-frame basis and/or per-frequency bin basis.
- An observed signal e.g., first signal 340
- Equation 38 An observed signal that includes multiple types of interfering sources may be represented in accordance with Equation 38:
- Y corresponds to the observed signal (e.g., first signal 340 )
- X corresponds to the underlying clean speech in observed signal Y
- N k corresponds to the kth interfering source (e.g., stationary noise, non-stationary noise, or residual echo).
- stationary noise e.g., stationary noise
- non-stationary noise e.g., stationary noise
- residual echo e.g., residual echo
- a global cost function may be formulated that minimizes the distortion of the desired source and that also achieves satisfactory noise suppression.
- Such a global cost function may be a composite of more than one branch cost function.
- the global cost function may be based on a cost function for minimizing the distortion of the desired source and a respective branch cost function for minimizing the distortion of each of the k interfering sources (i.e., the unnaturalness of the residual of an interfering source, as it is referred to in the aforementioned U.S. patent application Ser. No. 12/897,548, the entirety of which has been incorporated by reference as if fully set forth herein).
- These different cost functions may be further weighted to obtain a degree of balance between distortion of the desired source and the distortion of the k interfering sources.
- a global cost function is shown in Equation 39:
- the optimal gain, G may be determined by taking the derivative of the global cost function with respect to the optimal gain and setting the derivative to zero. This is shown in Equation 40:
- Equation 40 the second moment (i.e., variance) for each of the k interfering noise sources (i.e., ⁇ N k 2 ) and the desired source (i.e., ⁇ N k 2 ) that naturally occur from the expectations used in Equation 39 are introduced.
- the second moment of the desired source divided by the second moment of a particular kth interfering noise source is equivalent to the SNR for that particular kth interfering noise source. This is shown in Equation 41:
- ⁇ k corresponds to the SNR for the kth interfering noise source.
- Optimal gain, G may be determined by simplifying Equation 41 to Equation 42, as shown below:
- Equation 43 Equation 43
- Equation 43 represents the gain rule derived in aforementioned U.S. patent application Ser. No. 12/897,548, the entirety of which has been incorporated by reference as if fully set forth herein.
- the generalized multi-source gain rule degenerates to the gain rule derived in aforementioned U.S. patent application Ser. No. 12/897,548 in the case of a single interfering source.
- Multi-noise source gain component 332 may be configured to determine optimal gain 325 , which is used to suppress multiple types of interfering sources from input signal 340 , in accordance with Equation 42.
- SSNR estimation component 306 may provide SSNR estimate 303
- SNSNR estimation component 316 may provide SNSNR estimate 317
- SRER estimation component 326 may provide SRER estimate 323 .
- Each of these estimates may correspond to an SNR (i.e., for a kth interfering noise source.
- each of these estimates may be provided on a per-frame basis and/or per-frequency bin basis.
- the value of the target suppression parameter H for each of the k interfering noise sources comprises a fixed aspect of back-end SCS component 300 that is determined during a design or tuning phase associated with that component.
- the value of the target suppression parameter H for each of the k interfering noise sources may be determined in response to some form of user input (e.g., responsive to user control of settings of a device that includes back-end SCS component 300 ).
- the value of the target suppression parameter H for each of the k interfering noise sources may be adaptively determined based at least in part on characteristics of first signal 340 .
- the values for each of the target suppression parameter(s) H k may be constant across all frequencies, or alternatively, the values of first target suppression parameter(s) H k may very per frequency bin.
- the value for each intra-branch tradeoff ⁇ for a particular k interfering noise source may be based on a probability that a particular frame of first signal 340 is from a desired source (e.g., speech) with respect to the particular interfering noise.
- the intra-branch tradeoff associated with the stationary noise branch e.g., ⁇ 1
- the intra-branch tradeoff associated with the non-stationary noise branch e.g., ⁇ 2
- the intra-branch tradeoff associated with the residual echo branch e.g., ⁇ 3
- the residual echo branch e.g., ⁇ 3
- the value of the intra-branch tradeoff parameter ⁇ associated with each of the k interfering noise sources comprises a fixed aspect of back-end SCS component 300 that is determined during a design or tuning phase associated with that component.
- the value of the intra-branch tradeoff parameter ⁇ associated with each of the k interfering noise sources may be determined in response to some form of user input (e.g., responsive to user control of settings of a device that includes back-end SCS component 300 ).
- the value of the intra-branch tradeoff parameter ⁇ associated with each of the k interfering noise sources is adaptively determined.
- the value of ⁇ associated with a particular kth interfering noise source may be adaptively determined based at least in part on the probability that a particular frame and/or frequency bin of first signal 340 is from a desired source with respect to the particular kth interfering noise source. For instance, if the probability that a particular frame and/or frequency bin of first signal 340 is a desired source with respect to a particular kth interfering noise source is high, the value of ⁇ k may be set such that an increased emphasis is placed on minimizing the distortion of the desired source.
- the value of ⁇ k may be set such that an increased emphasis is placed on minimizing the distortion of the residual kth interfering noise source.
- each intra-branch tradeoff, a may be determined in accordance with Equation 44, which is shown below:
- ⁇ N corresponds to a tradeoff intended for a particular interfering noise source included in first signal 340
- ⁇ S + ⁇ N corresponds to a tradeoff intended for a desired source included in first signal 340
- P DS corresponds to a probability that a particular frame and/or frequency bin of first signal 340 is from a desired source with respect to a particular interfering noise source (e.g., probability 307 , probability 313 , or probability 313 ).
- the value of ⁇ may be adaptively determined based on modulation information associated with first signal 340 .
- fullband modulation feature extraction component 328 may extract features 327 of an energy contour associated with first signal 340 over time. Features 327 are provided to fullband modulation statistical modeling component 330 .
- Fullband modulation statistical modeling component 330 may be configured to model features 327 on a per-frame basis and/or per-frequency bin basis.
- modulation statistical modeling component 330 models features 327 using GMM modeling.
- GMM modeling a probability 329 that a particular frame and/or frequency bin of first signal 340 is from a desired source (e.g., speech) may be determined. For example, it has been observed that an energy contour associated with a signal that changes relatively fast over time equates to the signal including a desired source; whereas an energy contour associated with a signal that changes relatively slow over time equates to the signal including an interfering source.
- probability 329 may be relatively high, thereby causing the value of ⁇ k to be set such that an increased emphasis is placed on minimizing the distortion of the desired source during frames including the desired source.
- probability 329 may be relatively low, thereby causing the value of ⁇ k to be set such that an increased emphasis is placed on minimizing the distortion of the residual kth interfering noise signal.
- Still other adaptive schemes for setting the value of ⁇ k may be used.
- the value of inter-branch tradeoff parameter, ⁇ , for each of the k interfering noise sources may be based on measure of spatial ambiguity 331 .
- measure of spatial ambiguity 331 is indicative of spatial feature statistical modeling component 314 being in a spatially ambiguous state
- the value of ⁇ associated with the non-stationary branch e.g. ⁇ 2
- the value of ⁇ associated with the stationary noise branch and the residual echo branch e.g., ⁇ and ⁇ 3
- the non-stationary noise branch is effectively disabled (i.e. soft-disabled).
- the non-stationary noise branch may be re-enabled (i.e., soft-enabled) in the event that measure of spatial ambiguity 331 is indicative of spatial feature statistical modeling component 314 being in a spatially confident state by increasing the value of ⁇ 2 and adjusting the values of ⁇ and ⁇ 3 (such that the sum of all the inter-branch tradeoff parameters is equal to one) accordingly.
- multi-noise source gain component 332 is configured to determine optimal gain 325 on a per-frequency bin basis, multi-noise source gain component 332 provides a respective optimal gain value for each frequency bin.
- Gain application component 346 may be configured to suppress noise (e.g., stationary noise, non-stationary noise and/or residual echo) present in first signal 340 by applying optimal gain 325 to provide noise-suppressed signal 344 .
- gain application component 346 is configured to suppress noise present in first signal 340 on a frequency bin by frequency bin basis using the respective optimal gain values obtained for each frequency bin, as described above.
- back-end SCS component 300 is configured to operate in a single-user speakerphone mode of a device in which SCS component 300 is implemented or a conference speakerphone mode of such a device.
- back-end SCS component 300 receives a mode enable signal 336 from a mode detector (e.g., activity mode detector 222 , as shown in FIG. 2 ) that causes back-end SCS component 300 to switch between single-user speakerphone mode or conference speakerphone mode.
- mode enable signal 336 may correspond to mode enable signal 236 , as shown in FIG. 2 .
- mode enable signal 336 When operating in conference speakerphone mode, mode enable signal 336 may cause the non-stationary branch to be disabled (e.g., ⁇ 2 is set to a relatively low value, for example, zero). Accordingly, gain application component 346 may be configured to suppress stationary noise and/or residual echo present in first signal 340 (and not non-stationary noise). When operating in single-user speakerphone mode, mode enable signal 336 may cause the non-stationary noise suppression branch to be enabled. Accordingly, gain application component 346 may be configured to suppress stationary noise, non-stationary noise, and/or residual echo present in first signal 340 .
- FIG. 3I shows example diagnostic plots of a segment of an input signal (e.g., first signal 340 ) that includes speech (i.e., a desired source) and babble noise (i.e., an interfering source) in accordance to back-end SCS system 300 .
- Plot 377 shows first signal 340 as received from a primary microphone (i.e., microphone 106 1 , as shown in FIG. 1 ).
- Plot 379 shows the SSNR estimate (i.e., SSNR estimate 303 ) and panel 381 shows the probability of desired source (i.e., probability 307 ) inferred from statistical modeling of the SNR features by SSNR feature statistical modeling component 310 .
- Plot 383 shows the estimated spatial ambiguity (e.g., measure of spatial ambiguity 331 obtained by spatial feature statistical modeling component 314 ), which is constant at unity due to the spatial diversity present in this segment.
- Plot 385 shows the posterior probability of target speaker (i.e., classification 311 provided by SID feature extraction component 318 ).
- Plot 387 shows the SNSNR estimate (i.e., SNSNR estimate 317 ) and plot 389 shows the probability of desired source (i.e., probability 313 ) inferred from statistical modeling of the Anc2AbmR feature (i.e., ratio 309 ) by spatial feature statistical modeling component 314 .
- Plot 391 illustrates the final gain (i.e., optimal gain 325 ) obtained by the multi-noise source gain component 332 .
- FIG. 3J shows an analogous plot for a segment of an input speech (e.g., first signal 340 ) that includes speech and babble noise, but captured in a spatially ambiguous configuration.
- the spatial ambiguity measure i.e., measure of spatial ambiguity 331
- plot 383 ′ converges to zero (indicating spatial ambiguity)
- the final gain shown in panel 391 ′ follows the SSNR estimate and probability of desired source inferred from statistical modeling of the SNR feature shown in panels 379 ′ and 381 ′, respectively.
- system 300 may operate in various ways to determine a noise suppression gain used to suppress multiple types of interfering sources present in an audio signal.
- FIG. 4 depicts a flowchart 400 of an example method for determining a noise suppression gain in accordance with an example embodiment. The method of flowchart 400 will now be described with continued reference to system 300 of FIG. 3C , although the method is not limited to that implementation. Other structural and operational embodiments will be apparent to persons skilled in the relevant art(s) based on the discussion regarding flowchart 400 and system 300 .
- the method of flowchart 400 begins at step 402 , where an audio signal is received that comprises at least a desired source component and at least one interfering source type.
- an audio signal is received that comprises at least a desired source component and at least one interfering source type.
- back-end SCS component receives first signal 340 .
- the one or more interfering source types include stationary noise and non-stationary noise.
- a noise suppression gain is determined based on a statistical modeling of at least one feature associated with the audio using a mixture model comprising a plurality of model mixtures, each of the plurality of model mixtures being associated with one of the desired source component or an interfering source type of the at least one interfering source type.
- multi-noise source gain component 332 determines a noise suppression gain (i.e., optimal gain 325 ).
- SSNR feature statistical modeling component 310 and/or spatial feature statistical modeling component 314 may statistically model at least one feature associated with the audio signal using a mixture model (e.g., a Gaussian mixture model) that comprises a plurality of model mixtures.
- SSNR feature statistical modeling component 310 and/or spatial feature statistical modeling component 314 may associate each of the plurality of model mixtures with one of the desired source component or an interfering source type of the at least one interfering source type.
- the statistical modeling is adaptive based on at least one feature associated with each frame of the audio signal being received.
- the determination of the noise suppression gain includes determining one or more contributions that are derived from the at least one feature and determining the noise suppression gain based on the one or more contributions.
- Each of the one or more contributions may be determined in accordance to the composite cost function described above with reference to Equation 39 (i.e., each of the one or more contributions may be based on a branch cost function for minimizing the distortion of the residual of a respective kth interfering source included in the audio signal plus the cost function for minimizing the distortion of the desired source component included in the audio signal).
- the one or more contributions are weighted based on a measure of ambiguity between two or more of the plurality of model mixtures.
- the one or more contributions may be weighted based on measure of spatial ambiguity 331 .
- a respective model mixture of the plurality of model mixtures is associated with one of the desired source component or an interfering source type of the at least one interfering source type based on one or more properties (e.g., the mean, variance, etc.) of the respective model mixture and one or more expected characteristics (e.g., the SNR, Anc2AbmR, etc.) of a respective interfering source type of the at least one interfering source type.
- properties e.g., the mean, variance, etc.
- expected characteristics e.g., the SNR, Anc2AbmR, etc.
- the noise suppression gain is determined for each of a plurality of frequency bins of the audio signal.
- optimal gain 325 is determined for each of a plurality of frequency bins of first signal 340 .
- FIG. 5 depicts a flowchart 500 of an example method for determining and applying a gain to an audio signal in accordance with an example embodiment.
- the method of flowchart 500 will now be described with continued reference to system 300 of FIG. 3C , although the method is not limited to that implementation.
- Other structural and operational embodiments will be apparent to persons skilled in the relevant art(s) based on the discussion regarding flowchart 500 and system 300 .
- the method of flowchart 500 begins at step 502 , where one or more first characteristics associated with a first type of interfering source in an audio signal are determined.
- the first type of interfering source is stationary noise.
- the first characteristic(s) include an SNR regarding the stationary noise with respect to the audio signal and a first measure of probability indicative of a probability that the audio signal is from a desired source with respect to the stationary noise.
- multi-noise source gain component 332 receives first characteristic(s) associated with stationary noise included in first signal 340 .
- the first characteristic(s) may include SSNR estimate 303 and probability 307 that indicates a probability that a particular frame of first signal 340 is from a desired source with respect to the stationary noise.
- one or more second characteristics associated with a second type of interfering source in an audio signal are determined.
- the second type of interfering source is non-stationary noise.
- the second characteristic(s) include an SNR regarding the non-stationary noise with respect to the audio signal and a second measure of probability indicative of a probability that the audio signal is from a desired source with respect to the non-stationary noise.
- multi-noise source gain component 332 receives the second characteristic(s) associated with non-stationary noise included in first signal 340 .
- the second characteristic(s) may include SNSNR estimate 317 and probability 313 that indicates a probability that a particular frame of first signal 340 is from a desired source with respect to the non-stationary noise.
- a gain based on the first characteristic(s) and the second characteristic(s) is determined.
- multi-noise source gain component 332 determines optimal gain 325 based on the first characteristic(s) and the second characteristic(s).
- multi-source gain component determines optimal gain 325 in accordance with Equation 42 described above.
- a gain i.e., optimal gain 325
- a gain is determined for each of a plurality of frequency bins of the audio signal (i.e., first signal 340 ) based on the first characteristic(s) and the second characteristic(s).
- the determined gain is applied to the audio signal.
- gain application component 346 applies optimal gain 325 to first signal 340 .
- each of the determined gains are applied to a corresponding frequency bin of the audio signal.
- the determined gain is applied in a manner that is controlled by a tradeoff parameter ⁇ ssociated with a measure of spatial ambiguity.
- multi-noise source gain component 332 may set the value of the inter-branch tradeoff parameter(s) (i.e., ⁇ k ) based on measure of spatial ambiguity 331 .
- the determined gain is applied in a manner that is controlled by a first parameter that specifies a degree of balance between a distortion of a desired source included in the audio signal and a distortion of a residual amount of the first type of interfering source included in a noise-suppressed signal that is obtained from applying the determined gain to the audio signal and a second parameter that specifies a degree of balance between the distortion of the desired source included in the audio signal and a distortion of a residual amount of the second type of interfering source included in the noise-suppressed signal,
- multi-noise source gain component 332 may determine the value of the first parameter (i.e., ⁇ 1 ) that specifies a degree of balance between the distortion of the desired source included in first signal 340 and the distortion of a residual amount of the first type of interfering source included in noise-suppressed signal 344 and may also determine the value of the second parameter (i.e., ⁇ 2 ) that specifies a degree of balance between the distortion of the desired source included in first signal 340 and the distortion of a residual amount of the second type of interfering included in noise-suppressed signal 344 .
- the first parameter i.e., ⁇ 1
- the second parameter i.e., ⁇ 2
- the value of the first parameter is set based on the probability that the audio signal is from a desired source with respect to the first type of interfering source
- the value of the second parameter is set based on the probability that the audio signal includes a desired source with respect to the second type of interfering source included in the audio signal
- the value of the first parameter may be set based on probability 307 that indicates a probability that a particular frame of first signal 340 is from a desired source with respect to the first type of interfering source (e.g., stationary noise) included in first signal 340
- the value of the second parameter may be set based on probability 313 that indicates a probability that a particular frame of first signal 340 is from a desired source with respect to the second type of interfering source (e.g., non-stationary noise) included in first signal 340 .
- FIG. 6 depicts a flowchart 600 of an example method for setting a value of ⁇ first parameter ⁇ nd a second parameter based on a rate at which an energy contour associated with an audio signal changes in accordance with an embodiment.
- the method of flowchart 600 will now be described with continued reference to system 300 of FIG. 3C , although the method is not limited to that implementation.
- Other structural and operational embodiments will be apparent to persons skilled in the relevant art(s) based on the discussion regarding flowchart 600 and system 300 .
- the method of flowchart 600 begins at step 602 , where a rate at which an energy contour associated with the audio signal changes is determined.
- fullband modulation statistical modeling component 330 may determine the rate at which the energy contour associated with first signal 340 changes.
- Fullband modulation statistical modeling component 330 provides probability 329 that indicates a probability that a particular frame of first signal 340 is a desired source (e.g., speech) based on the determination. For example, it has been observed that an energy contour associated with a signal that changes relatively fast over time equates to the signal including a desired source; whereas an energy contour associated with a signal that changes relatively slow over time equates to the signal including an interfering source.
- probability 329 may be relatively high. In response to determining that the rate at which the energy contour associated with first signal 340 changes is relatively slow, probability 329 may be relatively low.
- the value of the first parameter and the value of the second parameter are set such that an increased emphasis is placed on minimizing the distortion of the desired source included in the audio signal in response to determining that the rate at which the energy contour changes is relatively fast.
- multi-noise source gain component 332 may set the value of the first parameter (i.e., ⁇ 1 ) and the second parameter (i.e., ⁇ 2 ) such that an increased emphasis is placed on minimizing the distortion of the desired source included in the first signal 340 if probability 329 is relatively high.
- the value of the first parameter is set such that an increased emphasis is placed on minimizing the distortion of the residual amount of the first type of interfering source included in the noise-suppressed signal
- the value of the second parameter is set such that an increased emphasis is placed on minimizing the distortion of the residual amount of the second type of interfering source included in the noise-suppressed signal in response to determining that the rate at which the energy contour changes is relatively slow.
- multi-noise source gain component 332 may set the value of the first parameter (i.e., ⁇ 1 ) such that an increased emphasis is placed on minimizing the distortion of the residual amount of the first type of interfering source (e.g., stationary noise) included in noise-suppressed signal 344 and may set the value of the second parameter (i.e., ⁇ 2 ) such that an increased emphasis is placed on minimizing the distortion of the residual amount of the second type of interfering source (e.g., non-stationary noise) included in noise-suppressed signal 344 if probability 329 is relatively low.
- the first parameter i.e., ⁇ 1
- the second parameter i.e., ⁇ 2
- FIG. 3C depicts a system for suppressing stationary noise, non-stationary noise, and residual echo from an observed audio signal (e.g., first signal 340 ), it is noted that the foregoing embodiments may also be used to suppress multiple types of non-stationary noise (e.g., wind noise, traffic noise, etc.) and/or other types of interfering sources (e.g., reverberation).
- FIG. 7 is a block diagram of a back-end SCS component 700 that is configured to suppress multiple types of non-stationary noise and/or other types of interfering sources in accordance with an embodiment.
- Back-end SCS component 700 may be an example of back-end SCS component 116 or back-end SCS component 300 . As shown in FIG.
- FIG. 7 includes stationary noise estimation component 304 , SSNR estimation component 306 , SSNR feature extraction component 308 , SSNR feature statistical modeling component 310 , spatial feature extraction component 712 , spatial feature statistical modeling component 714 , SNSNR estimation component 716 , multi-noise source gain component 332 and gain application component 346 .
- Stationary noise estimation component 304 SSNR estimation component 306 , SSNR feature extraction component 308 and SSNR feature statistical modeling component 310 operate in a similar manner as described above with reference to FIG. 3C to obtain SSNR estimate 303 and probability 307 , respectively, which are used by multi-noise source gain component 332 to obtain an optimal gain 325 .
- Spatial feature extraction component 712 operates in a similar manner as spatial feature extraction component 312 as described above with reference to FIG. 3C to extract features from first signal 340 and second signal 334 .
- spatial feature extraction component 712 is further configured to extract features 709 1-k , associated with multiple types of non-stationary noise and/or other interfering sources.
- features 709 1 may correspond to features associated with a first type of non-stationary noise or other type of interfering source
- features 709 2 may correspond to features associated with a second type of non-stationary noise or other type of interfering source
- features 709 k may correspond to features associated with a kth type of non-stationary noise or other type of interfering source.
- reverberation and wind noise are examples of additional types of non-stationary noise and/or other types of interfering sources that may be suppressed from an observed audio signal.
- An example of extracting features associated with reverberation and wind noise is described below.
- Reverberation can be considered an additive noise, where all multi-path receptions of the desired source less the direct-path are considered interfering sources.
- the direct-path reception of the desired source by the microphone(s) e.g., microphones 106 1-N , as shown in FIG. 1
- the multi-path receptions of the desired source are generally filtered versions of the desired source that includes a delay and attenuation compared to the direct-path due to the longer distance the reflected sound wave travels and the sound absorption of the material of the reflecting surfaces.
- reverberation will manifest itself as a smearing or added tail to the direct-path desired source, and it will effectively reduce the modulation bandwidth compared to the source due to somewhat filling in the gaps of the time evolution of the magnitude spectrum between syllables (due to the smearing), see, for example, “The Linear Prediction Inverse Modulation Transfer Function (LP-IMTF) Filter for Spectral Enhancement, with Applications to Speaker Recognition” by Bengt J. Borgstrom and Alan McCree, ICASSP 2012, pp. 4065-4068, which is incorporated by reference herein.
- LP-IMTF Linear Prediction Inverse Modulation Transfer Function
- the modulation information pertinent to reverberation may be modelled (e.g., as a function of frequency).
- the modulation information is modelled by lowpass filtering the magnitude spectrum in order to estimate the reverberation magnitude spectrum and using this estimate to calculate the SRR, which can be modelled (e.g., by spatial feature statistical modeling component 714 , as described below) in a way similar to SNR feature vector 305 .
- the statistical modeling of the SRR can then provide a probability of desired source, P DS,m (k), and a probability of interfering source, P IS,m (k), with respect to reverberation.
- P DS,m desired source
- P IS,m probability of interfering source
- the SRR feature will not only capture reverberation, but also stationary noise in general, and hence there is an overlap with the modeling of SNR feature vector 305 , similar to how there is an overlap between the modeling of the Anc2AbmR feature (i.e., ratio 309 ) and SNR feature vector 305 .
- This overlap can be mitigated by applying a conventional stationary noise suppression (of a suitable degree) to first signal 340 prior to estimating the SRR feature, similar to how a preliminary stationary noise suppression is performed for first signal 340 prior to calculating the Anc2AbmR feature (i.e., ratio 309 ). Similar to the Anc2AbmR feature, the degree of a preliminary stationary noise suppression should not be exaggerated, as that will tend to impose the properties of that particular suppression algorithm onto the SRR feature, and result in the SRR feature essentially mirroring SSNR estimate 303 or stationary noise estimate 301 obtained within the stationary noise branch instead of reflecting the reverberation.
- Wind noise is typically not an acoustic noise, but a noise generated by the wind moving the microphone membrane (as opposed to the sound pressure wave moving the membrane). It propagates with a speed corresponding to the wind speed which is typically much smaller than the speed of sound in air (i.e., 340 meters/second), with which sound propagates in air. As an effect, there is no correlation between wind noise picked up on two microphones in typical dual-microphone configurations. Hence, an indicator of wind noise can be constructed by measuring the normalized correlation between two microphone signals. This can be extended to measuring the magnitude of the normalized coherence between the two microphone signals in the frequency domain as a function of frequency.
- a probability of desired source, P DS,m (k), and a probability of interfering source, P IS,m (k), with respect to wind noise obtained by GMM modeling of the normalized correlation between two microphone signals only indicates the probability of wind noise presence on one of the two microphones, but if the feature vector is augmented with an additional parameter corresponding to the power ratio between the two microphone signals (in the same frequency bin/range as the correlation/coherence feature), then the joint GMM modeling should be able to facilitate calculation of: (1) the probability of wind noise on a first microphone of a communication device, (2) the probability of desired source on the first microphone of the communication device, (3) the probability of wind noise on a second microphone of the communication device, and (3) the probability of desired source on the second microphone of the communication device, as a function of frequency. This information can be useful in attempts to rebuild desired source on a microphone pollute
- Spatial feature statistical modeling component 714 operates in a similar manner as spatial feature statistical modeling component 314 as described above with reference to FIG. 3C to model features received thereby.
- spatial feature statistical modeling component 714 is further configured to model features associated with multiple types of non-stationary noise and/or other types of interfering sources (i.e., features 709 1-k ) to provide a probability for each of the multiple types non-stationary noise and/or other types of interfering sources (e.g., probabilities 715 1-k ) that a particular frame of input signal 340 is from a particular type of non-stationary noise and/or other type of noise. For example, as shown in FIG.
- probability 715 1 corresponds to a probability that a particular frame of input signal 340 is from a first type of non-stationary noise or other type of interfering source
- probability 715 2 corresponds to a probability that a particular frame of input signal 340 is from a second type of non-stationary noise or other type of interfering source
- probability 715 k corresponds to a probability that a particular frame of input signal 340 is from a kth type of non-stationary noise or other type of interfering source.
- Spatial feature statistical modeling component 714 also provides probability (i.e., probability 313 ) that a particular frame of input signal 340 is from a desired source as described above with reference to FIG. 3C .
- SNSNR estimation component 716 may operate in a similar manner as SNSNR estimation component 316 as described above with reference to FIG. 3C to determine an SNSNR estimate for input signal 340 .
- SNSNR estimation component 716 is further configured to provide SNSNR estimates (e.g., 717 1-k ) for multiple types of non-stationary noise and/or SNR estimates for other types of interfering sources. For example, as shown in FIG.
- SNSNR estimate 717 1 corresponds to an SNSNR estimate for a first type of non-stationary noise or other type of interfering source
- SNSNR estimate 717 2 corresponds to an SNSNR estimate for a second type of non-stationary noise or other type of interfering source
- SNSNR estimate 717 k corresponds to an SNSNR estimate for a kth type of non-stationary noise or other type of interfering source.
- SNSNR estimate 717 1 may be based at least on probability 313 and probability 715 1
- SNSNR estimate 717 2 may be based at least on probability 313 and probability 715 2
- SNSNR estimate 717 k may be based at least on probability 313 and probability 715 k .
- Multi-noise source gain component 332 may be configured to obtain optimal gain 325 in accordance to Equation 42 as described above.
- Gain application component 346 may be configured to suppress stationary noise, multiple types of non-stationary noise, residual echo, and/or other types of interfering sources based on optimal gain 325 .
- FIG. 8 shows a block diagram of a generalized back-end SCS component 800 in accordance with an example embodiment.
- Back-end SCS component 800 may be an example of back-end SCS component 116 , back-end SCS component 300 or back-end SCS component 700 .
- generalized back-end SCS component 800 includes feature extraction components 802 1-k , statistical modeling components 804 1-k , SNR estimation components 808 1-k and a multi-noise source gain component 810 .
- Back-end SCS component 800 may be coupled to a plurality of microphone inputs 806 1-n .
- plurality of microphone inputs 806 1-n correspond to plurality of microphone inputs 106 1-n .
- Each of feature extraction components 802 1-k may be configured to extract features 801 1-k pertaining to a particular interfering noise source (e.g., stationary noise, a particular type of non-stationary noise, residual echo, reverberation, etc.) from one or more input signals 812 derived from the plurality of microphone inputs 806 1-n .
- a particular interfering noise source e.g., stationary noise, a particular type of non-stationary noise, residual echo, reverberation, etc.
- input signal(s) 812 may correspond to microphone inputs that have been processed by the front end and/or have been condensed into an m number of signals, where m is an integer value less than n.
- input signal(s) 812 may correspond to enhanced source signal 240 , non-desired source signals 234 , FDAEC output signal 224 , and/or residual echo information 238 .
- Each of features 801 1-k may be provided to a respective statistical modeling component 804 1-k .
- Each of statistical modeling components 804 1-k may be configured model the respective features received to determine respective probabilities 803 1-k that each indicate a probability that particular frame of input signal(s) 812 comprises a particular type of interfering noise source.
- probability 803 1 may correspond to a probability that a particular frame of input signal(s) 812 comprises a first type of interfering noise source
- probability 803 2 may correspond to a probability that a particular frame of input signal(s) 812 comprises a second type of interfering noise source
- probability 803 3 may correspond to a probability that a particular frame of input signal(s) 812 comprises a third type of interfering noise source
- probability 803 k may correspond to a probability that a particular frame of input signal(s) 812 comprises a kth type of interfering noise source.
- One or more of statistical modeling components 804 1-k may also determine a probability 805 that a particular frame of input signal(s) comprises a desired source.
- Each of probabilities 803 1-k and 805 may be provided to a respective SNR estimation component 808 1-k .
- Each of SNR estimation components 808 1-k may be configured to determine a respective SNR estimate 807 1-k pertaining to a particular interfering noise source included in input signals(s) 812 based on the received probabilities.
- SNR estimation component 808 1 may determine SNR estimate 807 1 , which pertains to a first type of interfering noise source included in input signals(s) 812 , based on probability 803 1 and/or probability 805
- SNR estimation component 808 2 may determine SNR estimate 807 2 , which pertains to a second type of interfering noise source included in input signals(s) 812 , based on probability 803 2 and/or probability 805
- SNR estimation component 808 3 may determine SNR estimate 807 3 , which pertains to a third type of interfering noise source included in input signals(s) 812 , based on probability 803 3 and/or probability 805
- SNR estimation component 808 k may determine SNR estimate 807 k , which pertains to a kth type of interfering noise source included in input signals(s) 812 , based on probability 803 k and/or probability 805 .
- Multi-noise source gain component 810 may be configured to determine an optimal gain 811 based at least on probability 805 and/or SNR estimates 807 1-k in accordance to Equation 42 as described above.
- a gain application component e.g., gain application component 346 , as shown in FIG. 3C
- FIG. 9 depicts a block diagram of a processor circuit 900 in which portions of communication device 100 , as shown in FIG. 1 , system 200 (and the components and/or sub-components described therein), as shown in FIG. 2 , back-end SCS component 300 (and the components and/or sub-components described therein), as shown in FIG. 3C , back-end SCS component 700 (and the components and/or sub-components described therein), as shown in FIG. 7 , back-end SCS component 800 (and the components and/or sub-components described therein), as shown in FIG. 8 , flowcharts 400 - 600 , as respectively shown in FIGS. 4-6 , as well as any methods, algorithms, and functions described herein, may be implemented.
- Processor circuit 900 is a physical hardware processing circuit and may include central processing unit (CPU) 902 , an I/O controller 904 , a program memory 906 , and a data memory 908 .
- CPU 902 may be configured to perform the main computation and data processing function of processor circuit 900 .
- I/O controller 904 may be configured to control communication to external devices via one or more serial ports and/or one or more link ports.
- I/O controller 904 may be configured to provide data read from data memory 908 to one or more external devices and/or store data received from external device(s) into data memory 908 .
- Program memory 906 may be configured to store program instructions used to process data.
- Data memory 908 may be configured to store the data to be processed.
- Processor circuit 900 further includes one or more data registers 910 , a multiplier 912 , and/or an arithmetic logic unit (ALU) 914 .
- Data register(s) 910 may be configured to store data for intermediate calculations, prepare data to be processed by CPU 902 , serve as a buffer for data transfer, hold flags for program control, etc.
- Multiplier 912 may be configured to receive data stored in data register(s) 910 , multiply the data, and store the result into data register(s) 910 and/or data memory 908 .
- ALU 914 may be configured to perform addition, subtraction, absolute value operations, logical operations (AND, OR, XOR, NOT, etc.), shifting operations, conversion between fixed and floating point formats, and/or the like.
- CPU 902 further includes a program sequencer 916 , a program memory (PM) data address generator 918 and a data memory (DM) data address generator 920 .
- Program sequencer 916 may be configured to manage program structure and program flow by generating an address of an instruction to be fetched from program memory 906 .
- Program sequencer 916 may also be configured to fetch instruction(s) from instruction cache 922 , which may store an N number of recently-executed instructions, where N is a positive integer.
- PM data address generator 918 may be configured to supply one or more addresses to program memory 906 , which specify where the data is to be read from or written to in program memory 906 .
- DM data address generator 920 may be configured to supply address(es) to data memory 908 , which specify where the data is to be read from or written to in data memory 908 .
- Techniques, including methods, and embodiments described herein may be implemented by hardware (digital and/or analog) or a combination of hardware with one or both of software and/or firmware. Techniques described herein may be implemented by one or more components. Embodiments may comprise computer program products comprising logic (e.g., in the form of program code or software as well as firmware) stored on any computer useable medium, which may be integrated in or separate from other components. Such program code, when executed by one or more processor circuits, causes a device to operate as described herein. Devices in which embodiments may be implemented may include storage, such as storage drives, memory devices, and further types of physical hardware computer-readable storage media.
- Examples of such computer-readable storage media include, a hard disk, a removable magnetic disk, a removable optical disk, flash memory cards, digital video disks, random access memories (RAMs), read only memories (ROM), and other types of physical hardware storage media.
- examples of such computer-readable storage media include, but are not limited to, a hard disk associated with a hard disk drive, a removable magnetic disk, a removable optical disk (e.g., CDROMs, DVDs, etc.), zip disks, tapes, magnetic storage devices, MEMS (micro-electromechanical systems) storage, nanotechnology-based storage devices, flash memory cards, digital video discs, RAM devices, ROM devices, and further types of physical hardware storage media.
- Such computer-readable storage media may, for example, store computer program logic, e.g., program modules, comprising computer executable instructions that, when executed by one or more processor circuits, provide and/or maintain one or more aspects of functionality described herein with reference to the figures, as well as any and all components, steps and functions therein and/or further embodiments described herein.
- computer program logic e.g., program modules
- Such computer-readable storage media may, for example, store computer program logic, e.g., program modules, comprising computer executable instructions that, when executed by one or more processor circuits, provide and/or maintain one or more aspects of functionality described herein with reference to the figures, as well as any and all components, steps and functions therein and/or further embodiments described herein.
- Such computer-readable storage media are distinguished from and non-overlapping with communication media (do not include communication media).
- Communication media embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave.
- modulated data signal means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
- communication media includes wireless media such as acoustic, RF, infrared and other wireless media, as well as signals transmitted over wires. Embodiments are also directed to such communication media.
- inventions described herein may be implemented as, or in, various types of devices. For instance, embodiments may be included in mobile devices such as laptop computers, handheld devices such as mobile phones (e.g., cellular and smart phones), handheld computers, and further types of mobile devices, stationary devices such as conference phones, office phones, gaming consoles, and desktop computers, as well as car entertainment/navigation systems.
- a device, as defined herein, is a machine or manufacture as defined by 35 U.S.C. ⁇ 101. Devices may include digital circuits, analog circuits, or a combination thereof. Devices may include one or more processor circuits (e.g., processor circuit 1200 of FIG.
- CPUs central processing units
- DSPs digital signal processors
- BJT Bipolar Junction Transistor
- HBT heterojunction bipolar transistor
- MOSFET metal oxide field effect transistor
- MESFET metal semiconductor field effect transistor
- Such devices may use the same or alternative configurations other than the configuration illustrated in embodiments presented herein.
Landscapes
- Engineering & Computer Science (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
Description
- This application is a continuation-in-part of U.S. patent application Ser. No. 14/216,769, entitled “Multi-Microphone Source Tracking and Noise Suppression,” filed Mar. 17, 2014, which claims the benefit of U.S. Provisional Patent Application No. 61/799,154, entitled “Multi-Microphone Speakerphone Mode Algorithm,” filed Mar. 15, 2013. This application also claims priority to U.S. Provisional Application Ser. No. 62/025,847, filed Jul. 17, 2014. Each of these applications is incorporated by reference herein.
- This application is related to U.S. patent application Ser. No. 12/897,548, entitled “Noise Suppression System and Method,” filed Oct. 4, 2010, which is incorporated in its entirety be reference herein.
- I. Technical Field
- The present invention generally relates to systems and methods that process audio signals, such as speech signals, to remove components of one or more interfering sources therefrom.
- II. Background Art
- The term noise suppression generally describes a type of signal processing that attempts to attenuate or remove an undesired noise component from an input audio signal. Noise suppression may be applied to almost any type of audio signal that may include an undesired noise component. Conventionally, noise suppression functionality is often implemented in telecommunications devices, such as telephones, Bluetooth® headsets, or the like, to attenuate or remove an undesired additive background noise component from an input speech signal.
- An input speech signal may be viewed as comprising both a desired speech signal (sometimes referred to as “clean speech”) and an additive noise signal. The additive noise signal may comprise stationary noise, non-stationary noise, echo, residual echo, etc. Many conventional noise suppression techniques are unable to effectively differentiate between, model, and suppress these different types of interfering sources, thereby resulting in a non-optimal noise-suppressed audio signal.
- Methods, systems, and apparatuses are described for single-channel suppression of interfering source(s) in an audio signal, substantially as shown in and/or described herein in connection with at least one of the figures, as set forth more completely in the claims.
- The accompanying drawings, which are incorporated herein and form a part of the specification, illustrate embodiments and, together with the description, further serve to explain the principles of the embodiments and to enable a person skilled in the pertinent art to make and use the embodiments.
-
FIG. 1 is a block diagram of a communication device, according to an example embodiment. -
FIG. 2 is a block diagram of an example system that includes multi-microphone configurations, frequency domain acoustic echo cancellation, source tracking, switched super-directive beamforming, adaptive blocking matrices, adaptive noise cancellation, and single-channel suppression, according to example embodiments. -
FIG. 3A depicts an example graph that illustrates a 3-mixture 2-dimensional Gaussian mixture model trained on features that comprise adaptive noise canceller to blocking matrix ratios or signal-to-noise ratios, according to an example embodiment. -
FIG. 3B depicts an example graph that illustrates a 3-mixture 2-dimensional Gaussian mixture model trained on features that comprise adaptive noise canceller to blocking matrix ratios or signal-to-noise ratios, according to another example embodiment. -
FIG. 3C is a block diagram of a back-end single-channel suppression component, according to an example embodiment. -
FIG. 3D depicts example diagnostic plots of 1-dimensional 2-mixture Gaussian mixture model parameters during online parameter estimation of a signal-to-noise feature vector, according to an example embodiment. -
FIG. 3E depicts example plots associated with an input signal that includes speech and car noise, according to an example embodiment. -
FIG. 3F depicts example diagnostic plots of 1-dimensional 2-mixture Gaussian mixture model parameters during online parameter estimation of an adaptive noise canceller to blocking matrix ratio, according to an example embodiment. -
FIG. 3G depicts example plots associated with an input signal that includes speech and car noise, according to another example embodiment. -
FIG. 3H depicts an example graph that plots example masking functions for different windowing functions, according to an example embodiment. -
FIG. 3I depicts example diagnostic plots associated with an input signal that includes speech and babble noise, according to an example embodiment. -
FIG. 3J depicts example diagnostic plots associated with an input signal that includes speech and babble noise, according to another example embodiment. -
FIG. 4 depicts a flowchart of a method for determining a noise suppression gain, according to an example embodiment. -
FIG. 5 depicts a flowchart of a method for applying a determined gain to an audio signal, according to an example embodiment. -
FIG. 6 depicts a flowchart of a method for setting a value of a first parameter that specifies a degree of balance between a distortion of a desired source included in an audio signal and a distortion of a residual amount of a first type of interfering source present in the audio signal and a second parameter that specifies a degree of balance between a distortion of a desired source included in an audio signal and a distortion of a residual amount of a second type of interfering source present in the audio signal based on a rate at which an energy contour associated with an audio signal changes over time, according to an example embodiment. -
FIG. 7 is a block diagram of a back-end single-channel suppression component that is configured to suppress multiple types of non-stationary noise and/or other types of interfering sources that may be present in an audio signal, according to an example embodiment. -
FIG. 8 is a block diagram of a generalized back-end single-channel suppression component, according to an example embodiment. -
FIG. 9 is a block diagram of a processor that may be configured to perform techniques disclosed herein. - Embodiments will now be described with reference to the accompanying drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Additionally, the left-most digit(s) of a reference number identifies the drawing in which the reference number first appears.
- The present specification discloses numerous example embodiments. The scope of the present patent application is not limited to the disclosed embodiments, but also encompasses combinations of the disclosed embodiments, as well as modifications to the disclosed embodiments.
- References in the specification to “one embodiment,” “an embodiment,” “an example embodiment,” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
- Further, descriptive terms used herein such as “about,” “approximately,” and “substantially” have equivalent meanings and may be used interchangeably.
- Still further, the terms “coupled” and “connected” may be used synonymously herein, and may refer to physical, operative, electrical, communicative and/or other connections between components described herein, as would be understood by a person of skill in the relevant art(s) having the benefit of this disclosure.
- Numerous exemplary embodiments are now described. Any section/subsection headings provided herein are not intended to be limiting. Embodiments are described throughout this document, and any type of embodiment may be included under any section/subsection. Furthermore, it is contemplated that the disclosed embodiments may be combined with each other in any manner.
- Techniques described herein are directed to performing back-end single-channel suppression of one or more types of interfering sources (e.g., additive noise) in an uplink path of a communication device. Back-end single-channel suppression may refer to the suppression of interfering source(s) in a single-channel audio signal during the back-end processing of the single-channel audio signal. The single-channel audio signal may be generated from a single microphone, or may be based on an audio signal in which noise has been suppressed during the front-end processing of the audio signal using multiple microphones (e.g., by applying a multi-microphone noise reduction technique).
- The back-end single-channel suppression techniques may suppress types(s) of additive noise using one or more suppression branches (e.g., a non-spatial (or stationary noise) branch, a spatial (or non-stationary noise) branch, a residual echo suppression branch, etc.). The non-spatial branch may be configured to suppress stationary noise from the single-channel audio signal, the spatial branch may be configured to suppress non-stationary noise from the single-channel audio signal and the residual echo suppression branch may be configured to suppress residual echo from the signal-channel audio signal.
- In embodiments, the spatial branch may be disabled based on an operational mode (e.g., single-user speakerphone mode or a conference speakerphone mode) of the communication device or based on a determination that spatial information (e.g., information that is used to distinguish a desired source from non-stationary noise present in the single-channel audio signal) is ambiguous.
- The example techniques and embodiments described herein may be adapted to various types of communication devices, communications systems, computing systems, electronic devices, and/or the like, which perform back-end single-channel suppression in an uplink path in such devices and/or systems. For example, back-end single-channel suppression may be implemented in devices and systems according to the techniques and embodiments herein. Furthermore, additional structural and operational embodiments, including modifications and/or alterations, will become apparent to persons skilled in the relevant arts) from the teachings herein.
- For instance, methods, systems, and apparatuses are provided for suppressing multiple types of interfering sources included in an audio signal. In an example aspect, a method is disclosed. In accordance with the method, an audio signal that comprises at least a desired source component and at least one interfering source type is received. A noise suppression gain is determined based on a statistical modeling of at least one feature associated with the audio signal using a mixture model comprising a plurality of model mixtures. Each of the plurality of model mixtures are associated with one of the desired source component or an interfering source type of the at least one interfering source type.
- A method for determining and applying suppression of interfering sources to an audio signal is further described herein. In accordance with the method, one or more first characteristics associated with a first type of interfering source included in an audio signal are determined One or more second characteristics associated with a second type of interfering source included in the audio signal are also determined A gain is determined based on the one or more first characteristics and the one or more second characteristics. The determined gain is applied to the audio signal.
- A system for determining and applying suppression of interfering sources to an audio signal is also described herein. The system includes a signal-to-stationary noise ratio feature statistical modeling component configured to determine one or more first characteristics associated with a first type of interfering source included in the audio signal. The system also includes a spatial feature statistical modeling component configured to determine one or more second characteristics associated with a second type of interfering source included in the audio signal. The system further includes a multi-noise source gain component configured to determine a gain based on the one or more first characteristics and the one or more second characteristics, and a gain application component configured to apply the determined gain to the audio signal.
- Various example embodiments are described in the following subsections. In particular, example device and system embodiments are described. This is followed by example single-channel suppression embodiments, followed by further example embodiments. An example processor circuit implementation is also described. Finally, some concluding remarks are provided. It is noted that the division of the following description generally into subsections is provided for ease of illustration, and it is to be understood that any type of embodiment may be described in any subsection.
- Systems and devices may be configured in various ways to perform back-end single-channel suppression of interfering source(s) included in an audio signal. Techniques and embodiments are also provided for implementing devices and systems with back-end single-channel suppression.
- For instance,
FIG. 1 shows anexample communication device 100 for implementing back-end single-channel suppression in accordance with an example embodiment.Communication device 100 may include aninput interface 102, anoptional display interface 104, a plurality of microphones 106 1-106 N, aloudspeaker 108, and acommunication interface 110. In embodiments, as described in further detail below,communication device 100 may include one or more instances of a frequency domain acoustic echo cancellation (FDAEC)component 112, a multi-microphone noise reduction (MMNR)component 114, and/or a single-channel suppression (SCS)component 116. In embodiments,communication device 100 may include one or more processor circuits (not shown) such as processor circuit 1200 ofFIG. 12 described below. - In embodiments,
input interface 102 andoptional display interface 104 may be combined into a single, multi-purpose input-output interface, such as a touchscreen, or may be any other form and/or combination of known user interfaces as would understood by a person of skill in the relevant art(s) having the benefit of this disclosure. - Furthermore,
loudspeaker 108 may be any standard electronic device loudspeaker that is configurable to operate in a speakerphone or conference phone type mode (e.g., not in a handset mode). For example,loudspeaker 108 may comprise an electro-mechanical transducer that operates in a well-known manner to convert electrical signals into sound waves for perception by a user. In embodiments,communication interface 110 may comprise wired and/or wireless communication circuitry and/or connections to enable voice and/or data communications betweencommunication device 100 and other devices such as, but not limited to, computer networks, telecommunication networks, other electronic devices, the Internet, and/or the like. - While only two microphones are illustrated for the sake of brevity and illustrative clarity, plurality of microphones 106 1-106 N may include two or more microphones, in embodiments. Each of these microphones may comprise an acoustic-to-electric transducer that operates in a well-known manner to convert sound waves into an electrical signal. Accordingly, plurality of microphones 106 1-106 N may be said to comprise a microphone array that may be used by
communication device 100 to perform one or more of the techniques described herein. For instance, in embodiments, plurality of microphones 106 1-106 N may include 2, 3, 4, . . . , to N microphones located at various locations ofcommunication device 100. Indeed, any number of microphones (greater than one) may be configured incommunication device 100 embodiments. As described herein, embodiments that include more microphones in plurality of microphones 106 1-106 N provide for finer spatial resolution of beamformers for suppressing interfering sources and for better tracking sources. In certain single-microphone embodiments, back-end SCS 116 can be used by itself withoutMMNR 114. - In embodiments,
FDAEC component 112 is configured to provide a scalable algorithm and/or circuitry for two to many microphone inputs.MMNR component 114 is configured to include a plurality of subcomponents for determining and/or estimating spatial parameters associated with audio sources, for directing a beamformer, for online modeling of acoustic scenes, for performing source tracking, and for performing adaptive noise reduction, suppression, and/or cancellation. In embodiments,SCS component 116 is configurable to perform single-channel suppression of interfering source(s) using non-spatial information, using spatial information, and/or using downlink signal information. Further details and embodiments ofFDAEC component 112,MMNR component 114, andSCS component 116 are provided below. - While
FIG. 1 is shown in the context of a communication device, the described embodiments may be applied to a variety of products that employ multi-microphone noise suppression for speech signals. Embodiments may be applied to portable products, such as smart phones, tablets, laptops, gaming systems, etc., to stationary products, such as desktop computers, office phones, conference phones, gaming systems, etc., and to car entertainment/navigation systems, as well as being applied to further types of mobile and stationary devices. Embodiments may be used for MMNR and/or suppression for speech communication, for enhancing speech signals as a pre-processing step for automated speech processing applications, such as automatic speech recognition (ASR), and in further types of applications. - Turning now to
FIG. 2 , asystem 200 is shown in accordance with an example embodiment.System 200 may be a further embodiment of a portion ofcommunication device 100 ofFIG. 1 . For example, in embodiments,system 200 may be included, in whole or in part, incommunication device 100. As shown,system 200 includes plurality of microphones 106 1-106 N,FDAEC component 112,MMNR component 114, andSCS component 116.System 200 also includes an acoustic echo cancellation (AEC)component 204, a microphonemismatch compensation component 208, a microphonemismatch estimation component 210, and anautomatic mode detector 222. In embodiments,FDAEC component 112 may be included inAEC component 204 as shown, and references toAEC component 204 herein may inherently include a reference toFDAEC component 112 unless specifically stated otherwise.MMNR component 114 includes a steered null error phase transform (SNE-PHAT) time delay of arrival (TDOA)estimation component 212, an on-line Gaussian mixture model (GMM)modeling component 214, an adaptive blocking matrix (ABM)component 216, a switched super-directive beamformer (SSDB) 218, and an adaptive noise canceller (ANC) 220. In some embodiments,automatic mode detector 222 may be structurally and/or logically included inMMNR component 114. It is noted thatcomponent 112 may use acoustic echo cancellation schemes other than FDAEC and thatestimation component 212 may use source tracking schemes other than SNE-PHAT and that the usage of the terms FDAEC and SNE-PHAT are purely exemplary. - In embodiments,
MMNR component 114 may be considered to be the front-end processing portion of system 200 (e.g., the “front end”), andSCS component 116 may be considered to be the back-end processing portion of system 200 (e.g., the “back end”). For the sake of simplicity when referring to embodiments herein,AEC component 204,FDAEC component 112, microphonemismatch compensation component 208, and microphonemismatch estimation component 210 may be included in references to the front end. - As shown in
FIG. 2 , plurality of microphones 106 1-106 N providesN microphone inputs 206 toAEC 204 and its instances ofFDAEC 112.AEC 204 also receives a downlink signal 202 (a signal received from a far-end device) as an input, which may include one or more downlink signals “L” in embodiments.AEC 204 provides echo-cancelledoutputs 224 to microphonemismatch compensation component 208, providesresidual echo information 238 toSCS component 116, and/or provides downlink-uplink coherence information 246 (i.e., an estimate of the coherence between the downlink and uplink signals as a measure of residual echo presence) to SNE-PHATTDOA estimation component 212 and/or on-lineGMM modeling component 214. Microphonemismatch estimation component 210 provides estimated microphone mismatch values 248 to microphonemismatch compensation component 208. Microphonemismatch compensation component 208 provides compensated microphone outputs 226 (e.g., normalized microphone outputs) to microphone mismatch estimation component 210 (and in some embodiments, not shown, microphonemismatch estimation component 210 may also receive echo-cancelledoutputs 224 directly), to SNE-PHATTDOA estimation component 212, to adaptiveblocking matrix component 216, and toSSDB 218. SNE-PHATTDOA estimation component 212 provides spatial information 228 to on-lineGMM modeling component 214, and on-lineGMM modeling component 214 provides statistics, mixtures, andprobabilities 230 based on acoustic scene modeling toautomatic mode detector 222, to adaptiveblocking matrix component 216, and toSSDB 218.SSDB 218 provides a desired source single output selectedsignal 232 toANC 220, andABM component 216 provides non-desired source signals 234 toANC 220, as well as toSCS component 116.Automatic mode detector 222 provides a mode enablesignal 236 toMMNR component 114 and toSCS component 116,ANC 220 provides a noise-cancelled (or enhanced) source signal 240 toSCS component 116, andSCS component 116 provides a suppressedsignal 244 as an output for subsequent processing and/or uplink transmission.SCS component 116 also provides a soft-disable control signal 242 toMMNR component 114. - Additional details regarding plurality of microphones 106 1-106 N,
FDAEC component 112,MMNR component 114,AEC component 204, microphonemismatch compensation component 208, microphonemismatch estimation component 210,automatic mode detector 222, SNE-PHATTDOA estimation component 212, on-lineGMM modeling component 214,ABM component 216,SSDB 218 andANC 220 are provided in commonly-owned, co-pending U.S. patent application Ser. No. 14/216,769, the entirety of which has been incorporated by reference as if fully set forth herein. -
SCS component 116 is configured to perform single-channel suppression of interfering source(s) on enhancedsource signal 240.SCS component 116 is configured to perform single-channel suppression using non-spatial information, using spatial information, and/or using downlink signal information.SCS component 116 is also configured to determine spatial ambiguity in the acoustic scene, and to provide a soft-disable control signal 242 that causes MMNR 114 (or portions thereof) to be disabled whenSCS component 116 is in a spatially ambiguous state. As noted above, in embodiments, one or more of the components and/or sub-components ofsystem 200 may be configured to be dynamically disabled based upon enable/disable outputs received from the back end, such as soft-disablecontrol signal 242. The specific system connections and logic associated therewith is not shown for the sake of brevity and illustrative clarity inFIG. 2 , but would be understood by persons of skill in the relevant art(s) having the benefit of this disclosure. - Techniques described herein are directed to performing back-end single-channel suppression of one or more types of interfering sources (e.g., additive noise) in an uplink path of a communication device. In accordance with an embodiment, back-end single-channel is performed based on a statistical modeling of acoustic source(s). Examples of such sources include desired speaker(s), interfering speaker(s), stationary noise (e.g., diffuse or point-source noise), non-stationary noise, residual echo, reverberation, etc.
- Various example embodiments are described in the following subsections. In particular, subsection IV.A describes how acoustic sources are statistically modelled, and subsection IV.B describes a system that implements the statistical modeling of acoustic sources to suppress multiple types of interfering sources from an audio signal.
- A. Statistical Modeling of Acoustic Sources
- Statistical modeling may be comprised of two steps, namely adaptation and inference. First, models are adapted to current observations to capture the generally non-stationary states of the underlying processes. Second, inference is performed to classify subpopulations of the data, and extract information regarding the current acoustic scene. Ultimately, the goal of back-end modeling is to provide the system with time- and frequency-specific probabilistic information regarding the activity of various sources, which can then be leveraged during the calculation of the back-end noise suppression gain (e.g., calculated by multi-noise
source gain component 332, as described below with reference toFIG. 3C ). - In this subsection, an illustrative example of a unified statistical model for back-end single-channel suppression (e.g., as performed by back-
end SCS component 300, as described below with reference toFIG. 3C ) is presented. That is, one model is constructed to capture all present acoustic sources. This allows back-end single-channel suppression to fully exploit any statistical correlation between acoustic sources. However, in many cases the back-end modeling can be achieved with lower complexity by constructing several parallel branches, each using a model of lower dimensionality. Further details on the use of multiple branches will be provided below in subsection IV.B. However, the theory derived in this subsection in the context of a unified statistical model is easily applied to smaller models as well. - 1. Gaussian Mixture Modeling (GMM)
- Mixture models (MMs) are hierarchical probabilistic models which can be used to represent statistical distributions of arbitrary shape. In particular, MMs are useful when modeling the marginal distribution of data in the presence of subpopulations. Formally, mixture models correspond to a linear mixing of individual distributions, where mixing weights are used to control the effect of each.
- Specifically, the Gaussian mixture model (GMM) serves as an efficient tool for estimating data distributions, particularly of a dimension greater than one, due to various attractive mathematical properties. For example, given a set of training data, the maximum likelihood (ML) estimates of the mean vector and covariance matrix are obtainable in closed form.
- The GMM distribution of a random variable xn, of dimension D is given by
Equation 1, which is shown below: -
- where φ={μ1, . . . , μM, C1, . . . , CM, w1, . . . , wM} is the set of parameters which defines the GMM, μm represent Gaussian means, Cm represent Gaussian covariance matrices, wm represent mixing weights, and M denotes the number of mixtures (i.e., model mixtures) in the GMM.
- Thus, evaluating the probability distribution function (pdf) of a trained GMM involves the calculation of the above equation for a given data point xn.
- The adaptation step of back-end statistical modeling performs parameter estimation to obtain a trained model based on a set of training data, i.e., adapting the set cp. Parameter estimation optimizes model parameters by maximizing some cost function. Examples of common cost functions include the ML and maximum a posteriori (MAP) cost functions. Here, the training process of a GMM for batch processing is described, where all training data is accessible at once. In subsection IV.A.3, this process is extended to online training, in which training samples are observed successively, and parameter estimation is performed iteratively to adapt to changing environments.
- An example of the ML cost for the training process of a GMM for batch processing is shown below as
Equation 2. Let the set {x1, x2, . . . , xN} be a set of N data samples of dimension D: -
- where the function N(xn;μn,Cm) denotes the evaluation of a Gaussian distribution with parameters μm, and Cm at xn.
- Parameter estimation for a mixture model is not possible in closed-form due to the ambiguity associated with mixture membership of data samples. However, several methods exist to estimate mixture model parameters iteratively. One such technique is the expectation-maximization (EM) algorithm, which assumes data mixture membership to be hidden random processes. The solution to EM parameter estimation reduces to a two-step iterative process, in which minimum mean-square error (MMSE) point estimates of data mixture membership are first obtained, and ML or MAP estimates of Gaussian parameters are then obtained conditioned on mixture membership estimates. Mathematically, for the (i+1)th iteration, this is expressed as:
-
- where:
-
- The above steps can be performed iteratively until convergence of the parameters.
- 2. Feature Vector
- The use of GMMs allows freedom in designing the feature vector, xn. Generally, the feature vector should be constructed to include elements which may provide discriminative information for the inference step of back-end statistical modeling. Furthermore, it is advantageous to include elements which provide complementary information. Finally, when using GMMs, feature elements should be conditioned to better fit the Gaussian assumption implied by the use of this model. For example, features which occur naturally in the form of ratios can be used in the log domain because this avoids the non-negative, highly-skewed nature of ratios.
- Examples of features that can make up the feature vector are discussed below in subsection IV.B. However, the notation xn(k) to represent the kth element of a full-band feature vector corresponding to time index n is introduced. In the case of frequency-dependent feature vectors, the notation xn,m(k) represents the kth element of a feature vector corresponding to time index n and frequency channel m.
- 3. Online/Adaptive Update of GMM Parameters
- The GMM parameter estimation in subsection IV.A.1 assumes the availability of all training samples. However, such batch processing is not realistic for communication systems, wherein successive (training) samples are observed in time and delay to buffer future samples is not practical. Instead, an online method to adapt the GMM parameters as new samples arrive (e.g., during a communication session) is desirable. In online GMM parameter estimation, it is assumed that the GMM has previously been trained on a set of N past samples. The system then observes K new samples, and the GMM is updated based on these new samples. One method by which to perform online parameter estimation is to use the MAP cost function. This involves defining the a priori distribution of φ conditioned on the original N data samples.
- Assume the initial N samples were used for parameter estimation to obtain initial parameter estimates φ′={μ′1, . . . , μ′M, C′1, . . . , C′M, w′1, . . . , w′M}. The EM approach can then be applied to the MAP cost function, similar to the case of the ML cost function in subsection IV.A.1, to obtain the new parameter estimates based on the next K samples. By making a few assumptions regarding the a priori distribution of φ, the EM solution to online parameter estimation can be expressed as:
-
- where:
-
- and:
-
- The above solution places equal weight on each of the (N+K) data samples during parameter estimation. When modeling non-stationary processes, however, it may be advantageous to place emphasis on recent samples because they can provide a better representation of the current state of the underlying random processes. A simple heuristic method by which to emphasize recent samples is to calculate αm in an alternative manner, as shown below in Equation 12:
-
- where Nmax, corresponds to some constant. Thus, αm avoids convergence to zero as the total number of observed data samples N grows very large.
- 4. Knowledge-driven Parameter Constraints
- In the previous sections, parameter estimation for GMMs was described from a purely data-driven view. However, as will be discussed below in subsection IV.A.5, the inference phase of this two-step statistical analysis framework makes the assumption that each acoustic source is represented by at least one mixture. If parameter estimation is performed in an unsupervised manner, the adapted back-end GMM will generally not be consistent with this assumption. For example, if a certain acoustic source is inactive for a given duration, the corresponding mixture may be absorbed by a statistically similar source, and the particular acoustic source will no longer be modelled. Additionally, if a certain acoustic source exhibits features with non-Gaussian behavior, unsupervised parameter estimation may look to model the particular source with multiple mixtures. In order to maintain the validity of the assumption that each acoustic source is represented by a single GMM mixture, knowledge-driven constraints are placed on parameters during parameter estimation. These knowledge-driven constraints are applied after each iteration of data-driven parameter estimation.
- 4.1 Minimum Constraints on Mixture Priors
- In order to avoid mixtures corresponding to temporarily inactive sources from being absorbed by statistically similar active sources, minimum constraints can be placed on mixture priors. That is, after an iteration of data-driven parameter estimation, mixture priors are floored at a threshold. This generally requires all mixture priors to be altered, due to the constraint that mixture weights must sum to unity. Application of minimum constraints on mixture priors maintains the presence of acoustic source mixtures, even during extended periods of source inactivity. Additionally, it allows GMM modeling to rapidly recapture the inactive source when it eventually becomes active.
- 4.2 Minimum and Maximum Constraints on Mixture Means
- Using intuition regarding the design of feature elements of xn, mixture means corresponding to various sources can often be expected to inhabit specific ranges in feature space. Thus, knowledge-driven mean constraints can be applied to the back-end GMM to ensure that mixture means representing various acoustic sources remain in these ranges. Minimum and maximum mean constraints can avoid scenarios during data-driven parameter estimation wherein multiple mixtures converge to represent a single acoustic source.
- 4.3 Minimum and Maximum Constraints on Covariance Values
- Elements of mixture covariance matrices play an important role in the behavior of a GMM during statistical modeling. If mixture covariances become too broad, mixture memberships of sample data may be ambiguous, and the adaptation rate of data-driven parameter estimation may become slow or inaccurate. Conversely, if mixture covariances become too narrow, those mixtures may become effectively marginalized during data-driven parameter estimation. To avoid these issues, intuitive constraints can be applied to diagonal elements of the covariance matrices. Constraining diagonal elements of the covariance matrix will generally require careful handling of off-diagonal elements in order to avoid singular covariance matrices.
- 5. Inference of Statistical Models
- The inference step in back-end statistical modeling involves classifying the underlying acoustic source types corresponding to each GMM mixture, and then extracting probabilistic information regarding the activity of each source.
- 5.1 Classification of Data Subpopulations
- Classification of GMM mixtures requires prior knowledge of the statistical behavior expected for specific acoustic source types in terms of the feature vector elements. Final decisions regarding source classification are made by applying knowledge-based rules to the updated GMM parameters.
- Below are examples of feature elements that can be used during back-end modeling, along with the expected statistical behavior of source types with respect to those elements. Further details on the design of feature elements is provided in subsection IV.B and subsection V:
- Stationary SNR: The time- and frequency-localized stationary log-domain SNRs can be used to differentiate between stationary noise sources, and non-stationary acoustic sources. Mixtures representing stationary noise sources are expected to include highly negative mean values of this element. Mixtures corresponding to desired sources can be expected to show particularly high stationary SNR mean.
- Adaptive noise canceller to blocking matrix ratio: The time- and frequency-localized non-stationary log-domain adaptive noise canceller (e.g.,
ANC 220, as shown inFIG. 2 ) to blocking matrix (e.g.,ABM 216, as shown inFIG. 2 ) ratios can be used to differentiate between non-stationary noise sources and desired sources. Mixtures representing non-stationary noise sources are expected to include highly negative mean values of this element. Mixtures corresponding to desired sources can again be expected to show particularly high stationary SNR mean. - Signal to reverberation ratio (SRR): The time- and frequency-localized log-domain SRRs can be used to differentiate between direct-path desired source, and reverberation due to multi-path acoustic propagation. Mixtures representing reverberation are expected to show highly negative mean values of SRR, whereas mixtures representing direct path and other sources are expected to show high mean values.
- Echo return loss enhancement (ERLE): The log-domain ERLE can be used to differentiate between acoustic sources originating in the present environment, and those originating from the device speaker. Mixtures representing residual echo are expected to show high ERLE mean values, whereas other sources are expected to show small ERLE mean values. In this particular case, ERLE refers to a short-term or instantaneous ratio of down-link to up-link power, possibly as a function of frequency.
-
FIG. 3A illustrates an example graph that illustrates a 3-mixture 2-dimenional GMM trained on features comprised of adaptive noise canceller to blocking matrix ratios or SNRs. Mixtures are shown by contours of a constant pdf. As shown inFIG. 3A , the acoustic sources present are desiredsource 335,stationary noise 337, andnon-stationary noise 339. The parameters of each mixture are consistent with the expected statistical behavior of each source type, as outlined above. - 5.2 Estimating the Activity of Acoustic Sources
- An objective of statistical modeling in back-end single-channel suppression is to provide probabilistic information regarding the present activity of various sources, which can be used during calculation of the back-end multi-noise source gain rule. Once classification of data subpopulations has been performed, the posterior probabilities of individual source activity, conditioned on the current feature vector, can be estimated by means of Bayes' rule. For example, assume that the GMM mixture m′ is classified as representing a particular source of interest. The posterior probability of activity for the source represented by m′ is then given by Equation 13, which is shown below:
-
- In certain cases it may be desired to obtain the posterior probability of source inactivity, which is given by
Equation 14, which is shown below: -
- 5.3 Refining Source Activity Probabilities with Supplemental Information
- The feature vector xn, is designed to include information which may improve separation of acoustic sources in feature space. However, in some cases there exists supplemental information which may be advantageous to use in statistical analysis of acoustic sources, but may not be appropriate for inclusion in the model feature vector.
- For example, full-band voice activity detection (VAD) decisions provide valuable information regarding the activity of desired or interfering speakers. Probabilistic VAD outputs can seamlessly be used to refine source activity probabilities from subsection IV.5.2, by assuming statistical independence between xn and the features used for VAD, and by applying Bayes' rule. Let Pvad denote the posterior probability of active speech obtained from a separate VAD system. Further, assume mixture m′ represents a source which corresponds to speech (e.g. desired source, interfering speaker, etc.), and let the set θ contain all such mixtures. The refined posterior of m′ then becomes:
-
- Another example of supplemental full-band information is the posterior probability of a target speaker provided by a speaker identification (SID) system. This information would be leveraged analogously to
Equation 15. - 6. Estimating the Reliability of GMM Modeling
- As described above, feature elements are chosen to provide separation between acoustic source types during back-end statistical modeling. However, there exist scenarios during which the intended discriminative power of the feature may become insufficient for reliable GMM inference. An example of this is when two or more acoustic sources are physically located relative to the device microphones of a communication device (e.g.,
communication device 100, as shown inFIG. 1 ) such that their time differences of arrival (TDOAs) become very similar, and any feature designed to exploit spatial diversity becomes ambiguous. It is then advantageous to recognize the lack of separation provided by this dimension of the GMM, and disable inference related to it. - Error! Reference source not found. illustrates an example graph that illustrates a 3-mixture 2-dimenional GMM trained on features comprised of adaptive noise canceller to blocking matrix ratios or SNRs, similar to Error! Reference source not found. Again, mixtures are shown by contours of a constant pdf, and the acoustic sources present are desired
source 335,stationary noise 337, andnon-stationary noise 339. As opposed to the example shown inFIG. 3A , the adaptive noise canceller to blocking matrix ratio feature, which is intended to capture spatial diversity of sources, has become ambiguous due to e.g., the physical locations of the acoustic sources. - To estimate the reliability of the GMM in discriminating between specific acoustic sources, the separation between the mixtures representing them is taken into account. Motivated by its well-known interpretation as the expected discrimination information over two hypotheses corresponding to two Gaussian likelihood distributions, the symmetrized Kullback-Leibler (KL) distance is used to quantify this separation. The symmetrized KL distance between mixtures i and j is given by:
-
- If the covariance matrices of mixtures i and j are assumed to be similar, a reduced complexity approximation becomes:
-
- Having quantified the discriminative power of a GMM with respect to two mixtures, various types of regression may be used to predict GMM reliability. As an example, logistic regression, an example of which is shown below with reference to
Equation 18, is appealing since it naturally outputs predictions within the range [0,1]: -
- where α and β are constants.
- B. Statistical Modeling of Acoustic Sources in a Back-End Single-Channel Suppression System
- As mentioned above IV.A, back-end statistical modeling may use a single unifying model for all acoustic sources. This allows all statistical correlation between sources to be exploited during the process. However, in certain embodiments, in order to reduce the complexity required by high-dimension, large mixture-number MM modeling is performed with smaller parallel MMs.
-
FIG. 3C is a block diagram of a back-end single-channel suppression (SCS)component 300 that performs noise suppression of multiple types of interfering sources using statistical modeling that has been decoupled into separate parallel branches in accordance with an embodiment. The benefit of multivariate modeling is the ability to capture statistical correlation between features. Therefore, the branches may be configured to cluster features with high inter-feature correlation. The motivation for such a system is that each of the previously mentioned acoustic sources is expected to display specific correlation patterns, thereby improving separation relative to 1-dimenional modeling. - Back-
end SCS component 300 is configured to suppress multiple types of interfering sources (e.g., stationary noise, non-stationary noise, residual echo, etc.) present in afirst signal 340. Back-end SCS component 300 may be configured to receivefirst signal 340 and asecond signal 334 and provide a suppressedsignal 344. In accordance with the embodiments described herein, suppressedsignal 344 may correspond to suppressedsignal 244, as shown inFIG. 2 . First signal 340 may be a suppressed signal provided by a multi-microphone noise reduction (MMNR) component (e.g., MMNR component 114), andsecond signal 234 may be a noise estimate provided by the MMNR component that is used to obtainfirst signal 340. Back-end SCS component 300 may comprise an implementation ofSCS component 116, as described above in reference toFIGS. 1 and 2 . In accordance with such an embodiment,first signal 340 may correspond to enhanced source signal 240 provided by ANC 220 (as shown inFIG. 2 ), andsecond signal 334 may correspond to non-desired source signals 234 provided by ABM 216 (as shown inFIG. 2 ). As shown inFIG. 3C , back-end SCS component 300 includes stationarynoise estimation component 304, signal-to-stationary noise ratio (SSNR)estimation component 306, SSNRfeature extraction component 308, SSNR featurestatistical modeling component 310, spatialfeature extraction component 312, spatial featurestatistical modeling component 314, signal-to-non-stationary noise ratio (SNSNR)estimation component 316, speaker identification (SID)feature extraction component 318, SID speakermodel update component 320, uplink (UL) correlationfeature extraction component 322, signal-to-residual echo ratio (SRER)estimation component 326, fullband modulationfeature extraction component 328, fullband modulationstatistical modeling component 330, multi-noisesource gain component 332 and gainapplication component 346. - Stationary
noise estimation component 304,SSNR estimation component 306, SSNRfeature extraction component 308 and SSNR featurestatistical modeling component 310 may assist in obtaining characteristics associated with stationary noise included infirst signal 340, and therefore, may be referred to as being included in a non-spatial (or stationary noise) branch ofSCS component 300. Spatialfeature extraction component 312, spatial featurestatistical modeling component 314, SIDfeature extraction component 318, SID speakermodel update component 320 andSNSNR estimation component 316 may assist in obtaining characteristics associated with non-stationary noise included infirst signal 340, and therefore, may be referred to as being included in a spatial (or non-stationary noise) branch ofSCS component 300. UL correlationfeature extraction component 322, spatial featurestatistical modeling component 314 andSRER estimation component 326 may assist in obtaining characteristics associated with residual echo included infirst signal 340, and therefore, may be referred to as being included in a residual echo branch ofSCS component 300. - 1. Non-Spatial Branch
- Stationary
noise estimation component 304 may be configured to receivefirst signal 340 and provide a stationary noise estimate 301 (e.g., an estimate of magnitude, power, signal level, etc.) of stationary noise present infirst signal 340 on a per-frame basis and/or per-frequency bin basis. In accordance with an embodiment, stationarynoise estimation component 304 may determinestationary noise estimate 301 by estimating statistics of an additive noise signal included infirst signal 340 during non-desired source segments. In accordance with such an embodiment, stationarynoise estimation component 304 may include functionality that is capable of classifying segments offirst signal 340 as desired source segments or non-desired source segments. Alternatively, stationarynoise estimation component 304 may be connected to another entity that is capable of performing such a function. Of course, numerous other methods may be used to determinestationary noise estimate 301.Stationary noise estimate 301 is provided toSSNR estimation component 306 and SSNRfeature extraction component 308. -
SSNR estimation component 306 may be configured to receivefirst signal 340 andstationary noise estimate 301 and determine a ratio betweenfirst signal 340 andstationary noise estimate 301 to provide anSSNR estimate 303 on a per-frame basis and/or per-frequency bin basis. In accordance with an embodiment,SSNR estimate 303 may be equal to a measured characteristic (e.g., magnitude, power, signal level, etc.) offirst signal 340 divided bystationary noise estimate 301.SSNR estimate 303 is provided to SSNRfeature extraction component 308 and multi-noisesource gain component 332. As will be described below,SSNR estimate 303 may be used to determine anoptimal gain 325 that is used to suppress noise fromfirst signal 340. - SSNR
feature extraction component 308 may be configured to extract one or more SNR feature(s) fromfirst signal 340 based onstationary noise estimate 301 on a per-frame basis and/or per-frequency bin basis to obtain anSNR feature vector 305. In accordance with an embodiment, to form SNR feature(s), a preliminary (rough) estimate of the desired source power spectral density may be obtained. The estimate of the desired source power spectral density may be obtained through conventional methods or according to the methods in described in aforementioned U.S. patent application Ser. No. 12/897,548, the entirety of which has been incorporated by reference as if fully set forth herein. In accordance with another embodiment, the estimate of the SNR feature(s) is equivalent to the a priori SNR that is estimated simply as the posteriori SNR minus one (assuming statistical independence between interfering and desired sources). In accordance with yet another embodiment, the various SNR feature forms could include various degrees of smoothing the power across frequency prior to forming the SNR feature(s). - In accordance with an embodiment, before extracting features from
first signal 340, SSNRfeature extraction component 308 may be configured to apply preliminary single-channel noise suppression tofirst signal 340. For example, SSNRfeature extraction component 308 may suppress single-channel noise fromfirst signal 340 based onSSNR estimate 303. SSNRfeature extraction component 308 may also be configured to down-sample the preliminary noise-suppressed first signal and/orstationary noise estimate 301 to reduce the sample sizes thereof, thereby reducing computational complexity.SNR feature vector 305 is provided to SSNR featurestatistical modeling component 310. - SSNR feature
statistical modeling component 310 may be configured to modelfeature vector 305 on a per-frame basis and/or per-frequency bin basis. In accordance with an embodiment, SSNR featurestatistical modeling component 310 modelsSNR feature vector 305 using GMM modeling. By using GMM modeling, aprobability 307 that a particular frame offirst signal 340 is from a desired source (e.g., speech) and/or a probability that the particular frame offirst signal 340 is from a non-desired source (e.g., an interfering source, such as stationary background noise) may be determined for each frame and/or frequency bin. - For example, stationary noise can be separated from the desired source by exploiting the time and frequency separation of the sources. The restriction to stationary sources arises from the fact that the interfering component is estimated during desired source absence and then assumed stationary, and hence maintaining its power spectral density during desired source presence. This allows for estimation of the (stationary) interfering source power spectral density from which the SNR feature(s) can then be formed. It reflects the way traditional single channel noise suppression works, and the interfering source power spectral density can be estimated with such traditional methods. The (stationary) interfering source presence can then be modelled with GMM-based
SNR feature vector 305, which comprises various forms of SNRs. - In accordance with an embodiment, two Gaussian mixtures are used to model SNR feature vector 305 (i.e., a 2-mixture GMM), and the Gaussian mixture with the lowest (average in case of multiple SNR features) mean parameter (lowest SNR) corresponds to the interfering (stationary) source, and the Gaussian mixture with the highest (average) mean parameter corresponds to the desired source. With the inference in place, i.e., the association of Gaussian mixtures with sources, it is possible to calculate the probabilities of desired source and probability of interfering (stationary) source in
accordance Equations 13, 14 and/or 15, as described above in subsections IV.A.5.2 and IV.A.5.3. -
FIG. 3D shows example diagnostic plots of 1-dimensional 2-mixture GMM parameters during online parameter estimation of GMM modeling of theSNR feature vector 305. InFIG. 3D , initial segments of a signal (e.g., first signal 340) that includes speech and pub noise are depicted, during which parameters are converging to the acoustic environment. The left column corresponds to the interfering source mixture corresponding to the pub noise, whereas the right column corresponds to the desired source mixture corresponding to the speech.Plots - Unlike subsection IV.B.2 (which is described below), the SNR feature does not require multiple microphones (or channels), and it applies equally to single microphone (channel) or multi-microphone (multi-channel) applications.
- As an example, only a single feature is used (per frequency bin in the frequency domain), with a mild smoothing. Let the preliminary estimate of desired source power spectral density after pre-noise suppression be:
-
- and the interfering source power spectral density be:
-
- where k is the frequency index, m is the frame index, and Nfft is the FFT size, e.g. 256. The SNR associated with a frequency index is then calculated as:
-
- where K determines the smoothing range, e.g., 2. Equation 21 represents a rectangular window, but, in certain embodiments, an alternate window may be used instead in accordance with embodiments. The SNR forms the single feature (i.e., SNR feature vector 305) that is modelled independently for every frequency index k in order to estimate the probability of desired source, PDS,m(k) (i.e., probability 307), versus the probability of interfering (stationary) source, PIS, m(k), for every frequency index.
- An example of a waveform of an input signal that includes speech and car noise (e.g., first signal 340), time-frequency plots of the input signal, the SNR feature (i.e., SNR feature vector 305), and the resulting PDS,m (k) (i.e., probability 307) are shown in Error! Reference source not found.E. For example, as shown in
FIG. 3E ,plot 347 represents a time domain input waveform representing first signal 340 (which includes both speech and car noise),plot 349 represents a time-frequency plot offirst signal 340,plot 351 representsSNR feature vector 305, which is being modelled using GMM modeling, andplot 353 represents a probability of desired source (i.e., probability 307) with respect to car noise obtained using GMM modeling. - In an embodiment where
first signal 340 is down-sampled by SSNRfeature extraction component 308, SSNR featurestatistical modeling component 310 up-samples probability 307.Probability 307 is provided to multi-noisesource gain component 332. As will be described below,probability 307 may be used to determineoptimal gain 325, which is used to suppress stationary noise (and/or other types of interfering sources) present infirst signal 340 on a per-frame basis and/or per-frequency bin basis. - 2. Spatial Branch
- Spatial
feature extraction component 312 may be configured to extract spatial feature(s) fromfirst signal 340 andsecond signal 334 on a per-frame basis and/or per-frequency bin basis. The feature(s) may be aratio 309 betweenfirst signal 340 andsecond signal 334. In accordance with an embodiment where back-end SCS component 300 comprises an implementation ofSCS component 116,ratio 309 corresponds to a ratio between enhanced source signal 240 provided byANC 220 and non-desired source signals 234 provided byABM 216. By forming a ratio between the output of ANC 220 (i.e., enhanced source signal 240) and the output of ABM 216 (i.e., non-desired source signals 234), both by means of the linear spatial processing of the front-end, a feature indicating the presence of desired source vs. interfering source (from a spatial perspective) is obtained (i.e., anANC 220 toABM 216 ratio, or simply Anc2AbmR). - Unlike
SNR feature vector 305 of subsection IV.B.1,ratio 309 separates non-stationary interfering sources from a desired source. Hence, it is used for non-stationary noise suppression.Ratio 309 can be calculated on a frequency bin or range basis in order to provide frequency resolution, and smoothing to a varying degree can be carried out in order to achieve a multi-dimensional feature vector that captures both local strong events as well as broader weaker events.Ratio 309 is greater for desired source presence and smaller for interfering source presence. - The formation of
ratio 309 may require at least two microphones and the presence of a generalized sidelobe canceller (GSC)-like front-end spatial processing stage. However, a similar “spatial” ratio can be formed with the use of many other front-ends, and in some applications a front-end is not even necessary. An example of that is the case where the position of the desired source relative to the two microphones provides a significant level (possibly frequency dependent) difference on the two microphones while all interfering sources can be assumed to be far-field, and hence provide approximately similar level on the two microphones. Such a scenario is present when acommunication device 100 as shown inFIG. 1 is handheld next to the face as in conventional telephony use, with one microphone at the bottom of communication device 100 (i.e., microphone 106 1) near the mouth, and another microphone at the upper back part communication device 100 (i.e., microphone 106 N). While interfering sources of environmental ambient acoustic noise will have approximately similar levels on the two microphones, the desired source (e.g., speech of the user) will be in the order of 10 dB higher at the bottom microphone than compared to the upper back microphone. In this case,ratio 309 can be formed directly from the two microphone signals. - In accordance with an embodiment, before obtaining
ratio 309, spatialfeature extraction component 312 applies preliminary single-channel noise suppression tofirst signal 340. For example, spatialfeature extraction component 312 may suppress single-channel noise present infirst signal 340 based onSNR estimate 303. This suppression should not be too strong as it will then render this modeling very similar to the stationary SNR modeling described above in subsection IV.B.1. However, a mild suppression will aid the convergence of the parameters of the online GMM modeling (as described below), preventing divergence of the modeling by guiding it in a proper direction. An example value of preliminary target suppression is 6 dB. - Spatial
feature extraction component 312 may also be configured to down-sample the preliminary noise-suppressed first signal and/orsecond signal 334 to reduce the sample sizes thereof, thereby reducing computational complexity.Ratio 309 is provided to spatial featurestatistical modeling component 314. - An example of obtaining
ratio 309 is described with respect to Equations 22-24 below. Let the power spectral density of the preliminary noise suppressed output of ANC 220 (i.e., first signal 340) be: -
- and the power spectral density of the output of ABM 216 (i.e., second signal 334) be
-
- where k is the frequency index, m is the frame index, and Nfft is the FFT size, e.g. 256. The Anc2AbmR (i.e., ratio 309) associated with a frequency index is then calculated as:
-
- where K determines the smoothing range, e.g. 2. Equation 24 represents a rectangular window, but similar to subsection IV.B.1, in certain embodiments, an alternate window may be used instead. The Anc2AbmR may form the single feature that is modelled independently for every frequency index k in order to estimate the probability of desired source, PDS,m(k), versus the probability of interfering (spatial) source, PIS,m(k), for every frequency index (as described below with reference to spatial feature statistical modeling component 314).
- SID
feature extraction component 318 may be configured to extract features fromfirst signal 340 and provide a classification 311 (e.g., a soft or hard classification) offirst signal 340 based on the extracted features on a per-frame basis and/or per-frequency bin basis. Such features may include, for example, reflection coefficients (RCs), log-area ratios (LARs), arcsin of RCs, line spectrum pair (LSP) frequencies, and the linear prediction (LP) cepstrum. -
Classification 311 may indicate whether a particular frame and/or frequency bin offirst signal 340 is associated with a target speaker. For example,classification 311 may be a probability as to whether a particular frame and/or frequency bin is associated with a target speaker or a non-desired source (i.e., the supplemental full-band information described above in subsection IV.A.5.3), where the higher the probability, the more likely that the particular frame and/or frequency bin is associated with a target speaker. Back-end SCS component 300 may include a speaker identification component (or may be coupled to a speaker identification component) that assists in determining whether a particular frame and/or frequency bin offirst signal 340 is associated with a target speaker. For example, the speaker identification component may include GMM-based speaker models. The feature(s) extracted fromfirst signal 340 may be compared to these speaker models to determineclassification 311. Further details concerning SID-assisted audio processing algorithm(s) may be found in commonly-owned, co-pending U.S. patent application Ser. No. 13/965,661, entitled “Speaker-Identification-Assisted Speech Processing Systems and Methods” and filed on Aug. 13, 2013, U.S. patent application Ser. No. 14/041,464, entitled “Speaker-Identification-Assisted Downlink Speech Processing Systems and Methods” and filed on Sep. 30, 2013, and U.S. patent application Ser. No. 14/069,124, entitled “Speaker-Identification-Assisted Uplink Speech Processing Systems and Methods” and filed on Oct. 31, 2013, the entireties of which are incorporated by reference as if fully set forth herein.Classification 311 is provided to spatial featurestatistical modeling component 314. - Spatial feature
statistical modeling component 314 may be configured to determine and provide aprobability 313 that a particular feature of a particular frame and/or frequency bin offirst signal 340 is from a desired source and aprobability 315 that a particular feature of a particular frame and/or frequency bin offirst signal 340 is from a non-desired source (e.g., non-stationary noise).Probabilities ratio 309.Probability 313 and/orprobability 315 may be also be based onclassification 311.Ratio 309 may be modelled using a GMM. The Gaussian distributions of the GMM can be associated with interfering non-stationary sources and the desired source according to the GMM mean parameters based on inference, thereby allowing calculation ofprobability 315 andprobability 313 fromratio 309 and the parameters of respective GMMs associated with interfering non-stationary sources and the desired source. - At least one mixture of the GMM may correspond to a distribution of a particular type of a non-desired source (e.g., non-stationary noise), and at least one other mixture of the GMM may correspond to a distribution of a desired source. It is noted that the GMM may also include other mixtures that correspond to other types of interfering, non-desired sources.
- To determine which mixture corresponds to the desired source and which mixture corresponds to the non-desired source, spatial features
statistical modeling component 314 may monitor the mean associated with each mixture. The mixture having a relatively higher mean equates to the mixture corresponding to a desired source, and the mixture having a relatively lower mean equates to the mixture corresponding to a non-desired source. -
FIG. 3F shows example diagnostic plots of 1-dimensional 2-mixture GMM parameters during online parameter estimation of the GMM modeling of the Anc2AbmR (i.e., ratio 309). InFIG. 3F , initial segments of a signal (e.g., first signal 340) that includes speech and pub noise are depicted, during which parameters are converging to the acoustic environment. The left column corresponds to the interfering source mixture corresponding to the pub noise, whereas the right column corresponds to the desired source mixture corresponding to the desired source.Plots - In accordance with an embodiment,
probabilities probability 313 may indicate that a particular feature of a particular frame and/or frequency bin offirst signal 340 is from a desired source if the ratio is relatively high, andprobability 315 may indicate that a particular feature of a particular frame and/or frequency bin offirst signal 340 is from a non-desired source if the ratio is relatively low. In accordance with an embodiment, the ratios may be determined for a plurality of ranges for smoothing across frequency. For example, a wideband smoothed ratio and a narrowband smoothed ratio may be determined. In accordance with such an embodiment,probabilities Probabilities SNSNR estimation component 316. - An example of a waveform of an input signal (e.g., first signal 340) that includes speech an non-stationary noise (e.g., babble noise), time-frequency plots of the input signal, the Anc2AbmR feature (i.e., ratio 309), and the resulting PDS,m(k) (i.e., probability 313) for speech in an environment that includes non-stationary noise, are shown in
FIG. 3G . This is a type of interfering source whereSNR feature vector 305 of subsection IV.B.1 traditionally may not provide good separation. - As shown in
FIG. 3G ,plot 367 represents a time domain input waveform representingfirst signal 340,plot 369 represents a time-frequency plot offirst signal 340,plot 371 represents an output of ABM 216 (i.e., second signal 334),plot 373 represents the Anc2AbmR (i.e., ratio 309) being modelled using GMM modeling, andplot 375 represents a probability of desired source (i.e., probability 313) with respect to babble noise obtained using GMM modeling. As can be seen fromFIG. 3G , the Anc2AbmR feature (i.e., ratio 309) provides excellent separation despite the interfering source being non-stationary. - It could be speculated that
SNR feature vector 305 of subsection IV.B.1 may be obsolete given the Anc2AbmR feature. However, in practice, there are cases where the modeling of the Anc2AbmR is ambiguous. This can be due to slower convergence of the Anc2AbmR modeling or due to the microphone signals of the acoustic scene not providing sufficient spatial separation. Hence, the SNR feature vector and Anc2AbmR features complement each other, although there is also some overlap. - Spatial feature
statistical modeling component 314 may also be configured to determine and provide a measure ofspatial ambiguity 331 on a per-frame basis and/or a per-frequency bin basis. Measure ofspatial ambiguity 331 may be indicative of how well spatial featurestatistical modeling component 314 is able to distinguish a desired source from non-stationary noise in the acoustic scene. Measure ofspatial ambiguity 331 may be determined based on the means for each of the mixtures of the GMM modelled by spatial featurestatistical modeling component 314. In accordance with such an embodiment, if the mixtures of the GMM are not easily separable (i.e., the means of each mixture are relatively close to one another such that a particular mixture cannot be associated with a desired source or a non-desired source (e.g., non-stationary noise), the value of measure ofspatial ambiguity 331 may be set such that it is indicative of spatial featurestatistical modeling component 314 being in a spatially ambiguous state. In contrast, if the mixtures of the GMM are easily separable (i.e., the mean of one mixture is relatively high, and the mean of the other mixture is relatively low), the value of measure ofspatial ambiguity 331 may be set such that it is indicative of spatial featurestatistical modeling component 314 being in a spatially unambiguous state, i.e., in a spatially confident state. - In accordance with an embodiment, measure of
spatial ambiguity 331 is determined in accordance with Equation 25, which is shown below: -
Measure of Spatial Ambiguity=(1+e (α(d−β)))−1, Equation 25 - where d corresponds to the distance between the mean of the mixture associated with the desired source and the mean of the mixture associated with the non-desired source and a and β are user-defined constants which control the distance to spatial ambiguity mapping.
- As will be described below, in response to determining that spatial feature
statistical modeling component 314 is in a spatially ambiguous state, non-stationary noise suppression may be soft-disabled. - In accordance with an embodiment, in response to determining that spatial feature
statistical modeling component 314 is in a spatially ambiguous state, spatial featurestatistical modeling component 314 provides a soft-disableoutput 342, which is provided to MMNR component 114 (as shown inFIG. 2 ). Soft-disableoutput 342 may cause one or more components and/or sub-components ofMMNR component 114 to be disabled. In accordance with such an embodiment, soft-disableoutput 342 may correspond to soft-disablecontrol signal 242, as shown inFIG. 2 . - Spatial feature
statistical modeling component 314 may further provideprobability 313 to SID speakermodel update component 320. SID speakermodel update component 320 may be configured to update the GMM-based speaker model(s) based onprobability 313 and provide updated GMM-based speaker model(s) 333 to SIDfeature extraction component 318. SIDfeature extraction component 318 may compare feature(s) extracted from subsequent frame(s) offirst signal 340 to updated GMM-based speaker model(s) 333 to provideclassification 311 for the subsequent frame(s). - In accordance with an embodiment, SID speaker
model update component 320 updates the GMM-based speaker model(s) based onprobability 313 when back-end SCS component 300 operates in handset mode. When operating in speakerphone mode, updates to the GMM-based speaker model(s) may be controlled by information available from the acoustic scene analysis in the front end. In accordance with such an embodiment, back-end SCS component 300 receives a mode enablesignal 336 from a mode detector (e.g.,automatic mode detector 222, as shown inFIG. 2 ) that causesSCS system 300 to switch between single-user or conference speakerphone mode. Accordingly, mode enablesignal 336 may correspond to mode enablesignal 236, as shown inFIG. 2 . -
SNSNR estimation component 316 may determine anSNSNR estimate 317 based onprobability 313 andprobability 315 on a per-frame basis and/or per-frequency bin basis. For example, when assuming that x=xDS+xIS, where x corresponds tofirst signal 340, xDS corresponds to the underlying desired source in x and xIS corresponds to an interfering source (e.g., non-stationary noise) in x,SNSNR estimate 317 may be determined in accordance to Equation 26: -
- where y is a particular extracted feature and P(y|HDS) corresponds to probability 313 (i.e., the likelihood of feature y given the desired source hypothesis) and P(y|HIS) corresponds to probability 315 (i.e., the likelihood of feature y given the interfering source hypothesis).
SNSNR estimate 317 is provided to multi-noisesource gain component 332. As will be described below,SNSNR estimate 317 may be used determineoptimal gain 325, which is used to suppress non-stationary noise (and/or other types of interfering sources) present infirst signal 340. - 3. Residual Echo Suppression Branch
- Residual echo suppression is used to suppress any acoustic echo remaining after linear acoustic echo cancellation. This need is typically greatest when a device is operated in speakerphone mode, i.e., when the device is not handheld in a typical telephony handset use mode of operation. In speakerphone mode, the far-end signal (also referred as the downlink signal) is played back on a loudspeaker (e.g.,
loudspeaker 108, as shown inFIG. 1 ) on a device (e.g.,communication device 100, as shown inFIG. 1 ) at a level that, seen from the perspective of the microphone(s) (e.g., microphones 106 1-N, as shown inFIG. 1 ), may be louder than the near-end signal (also referred as the uplink signal), including the desired source. This makes the acoustic echo cancellation a difficult problem, often with significant residual echo that must be suppressed. Traditionally, this is carried out by means of estimating the ERL (Echo Return Loss) of the acoustic channel from the downlink to the uplink, and the ERLE (Echo Return Loss Enhancement) of the linear acoustic echo canceller. With knowledge of the downlink signal, the ERL, and the ERLE, an estimate of the residual echo level can be calculated. Such an estimate can be used to estimate a SRER feature much likeSNR feature vector 305 is estimated in subsection IV.B.1. In accordance with an embodiment, non-linear residual echo is identified by measuring the normalized correlation in the uplink signal after linear echo cancellation at the pitch period of the downlink signal. Moreover, this can be measured as a function of frequency in order to exploit spectral separation between the residual echo and the desired source. - The normalized correlation of the uplink signal at the pitch period of the downlink signal may be able to identify residual echo components that are harmonics of the downlink pitch periods, and may not be able to identify any unvoiced residual echo components. This is, however, acceptable as non-linear residual echo is typically non-linear components triggered by the high energy components of the downlink signal (i.e., voiced speech). Moreover, strong residual echo is often a result of strong non-linearities being excited by voiced components, and typically manifests itself as pitch harmonics of the downlink signal being repeated up through the spectrum, producing pitch harmonics where the downlink signal had no or only weak harmonics.
- Accordingly, in embodiments, UL correlation
feature extraction component 322 may be configured to determine an uplink correlation at a downlink pitch period. For example, UL correlationfeature extraction component 322 may determine a measure ofcorrelation 319 in an FDAEC output signal (e.g.,FDAEC output signal 224, as shown inFIG. 2 ) at the pitch period of a downlink signal (e.g., downlink signal 202, as shown inFIG. 2 ) as a function of frequency, where a relatively high correlation is an indication of residual echo presence infirst signal 340 and a relatively low correlation is an indication of no residual echo presence infirst signal 340. - The following outlines and provides an example of the feature calculation and modeling of the normalized uplink correlation at the downlink pitch period (i.e., measure of correlation 319). Let the (full-band) downlink pitch period be denoted LDL, and let the frequency domain output of the linear acoustic echo cancellation be:
-
- where, k is the frequency index, m is the frame index, and Nfft is the FFT size, e.g. 256. The inverse Fourier transform of the power spectrum is the autocorrelation, and hence the correlation at a given lag, L, can be found as the inverse Fourier transform of |YAEC,m(k)|2 at lag L:
-
- From here the normalized correlation at the downlink pitch period is calculated as:
-
- This is a full-band measure of the normalized correlation, and as outlined above it is desirable to characterize the presence of residual echo as a function of frequency. Hence, the normalized full-band correlation is generalized in the spirit of the above formula to provide frequency resolution, and the frequency dependent normalized uplink correlation at the downlink pitch period is calculated as:
-
- where K determines a window for averaging, e.g. 10. Equation 30 represents a rectangular window, but, in certain embodiments, any alternate suitable window can be used. The expression is simplified by only considering the lower half of the symmetric power spectrum. The imaginary contribution of the low and upper halves of the full sum cancels, and hence only the real part is summed when only the lower half is considered. It is noted that for K=0 the frequency dependent normalized correlation becomes trivial:
-
- and hence some averaging, K≠0, is necessary.
- The averaging over a window is a tradeoff with frequency resolution of CN,UL (k, LDL) (i.e., measure of correlation 319). A good compromise can be K=10 as mentioned above, but it can be considered to make K dependent on frequency, e.g., larger for higher frequencies and smaller for lower frequencies.
- A generalized version of the previously described normalized uplink correlation at the downlink pitch period can be derived to exploit information contained in the autocorrelation function of the uplink signal, at multiples of the downlink pitch period. This measure can be expressed as:
-
- where g(n) can itself be expressed as the element-wise product of functions:
-
g(n)=w(n)d(n), Equation 33 - Here, w(n) represents some smoothing window, which can be used to control the weighting of various downlink pitch period multiples. d(n) is a series of delta functions at pitch period multiples, as defined below:
-
d(n)=Σm=1 M=δ(n−mL DL), Equation 34 - and M denotes the number of pitch multiples contained within the sampled autocorrelation function and is dependent on LDL and Nfft. Note that the generalized measure can be expressed in terms of a convolution of functions:
-
- Then, using the convolution theorem associated with the Fourier transform, the generalized measure can be expressed in the frequency domain as:
-
- where G(k), W(k), and D(k) are the Fourier transforms of g(n), w(n), and d(n), respectively. whereas W(k) depends on the unspecified windowing function w(n), D(k) can be explicitly expressed by applying the Fourier transform to d(n), as shown below:
-
- where K denotes the number of fundamental frequency multiples contained within Nfft. The approximation in Equation 37 is a result of the fact that downlink pitch periods are generally not perfect factors of the FFT length. However, the expression serves as a relatively close approximation, particularly for large M, and the approximation is exact when the downlink pitch period is a factor of the FFT length.
- From Equation 37, it can be observed that the generalized normalized uplink correlation at the downlink pitch period is obtained as the summed element-wise product of the uplink spectrum and a masking function. The masking function is constructed as the convolution of a series of deltas located at multiples of the fundamental frequency of the downlink signal, and a smoothing window which spreads the effect of the masking function beyond exact multiples of the fundamental frequency.
- This relationship can be observed in
FIG. 3H , where example masking functions are plotted for different windowing functions. As shown inFIG. 3H , masking functions are shown for three different windowing functions, w(n). As further shown inFIG. 3H , the downlink pitch period LDL is 10, and the FFT length NFFT is 160. - In accordance with an embodiment, UL correlation
feature extraction component 322 may receiveresidual echo information 338 from the front end that includes measure ofcorrelation 319 and UL correlationfeature extraction component 322 extracts measure ofcorrelation 319 fromresidual echo information 338. In accordance with another embodiment,residual echo information 338 may include the FDAEC output signal and the downlink signal (or the pitch period thereof), and UL correlationfeature extraction component 322 determines the measure of correlation in the FDAEC output signal at the pitch period of the downlink signal as a function of frequency. The correlation at the downlink pitch period of the FDAEC output signal may be calculated as a normalized correlation of the FDAEC output signal at a lag corresponding to the downlink pitch period, providing a measure of correlation that is bounded between 0 and 1. In accordance with either embodiment, UL correlationfeature extraction component 322 provides measure ofcorrelation 319 to spatial featurestatistical modeling component 314. - In an embodiment where back-
end SCS component 300 comprises an implementation ofSCS component 116,residual echo information 338 corresponds toresidual echo information 238. - Spatial feature
statistical modeling component 314 may be configured to determine and provide aprobability 321 that a particular frame is from a non-desired source (e.g., residual echo) on a per-frame basis and/or per-frequency bin basis based on measure ofcorrelation 319. For example, the GMM being modelled by spatial featurestatistical modeling component 314 may also include a mixture that corresponds to residual echo. The mixture may be adapted based on measure ofcorrelation 319.Probability 321 may be relatively higher if measure ofcorrelation 319 indicates that the FDAEC output signal has high correlation at the pitch period of the downlink signal, andprobability 321 may be relatively lower if measure ofcorrelation 319 indicates that the FDAEC output signal has low correlation at the pitch period of the downlink signal.Probability 321 is provided toSRER estimation component 326. -
SRER estimation component 326 may be configured to determine anSRER estimate 323 based onprobability SRER estimate 323 may be determined in accordance to Equation 26 provided above, where xIS corresponds to non-stationary noise or residual echo included in x, P(y|HDS) corresponds to probability 313 (i.e., the likelihood of feature y given the desired source hypothesis) and P(y|HIS) corresponds to probability 321 (i.e., the likelihood of feature y given the non-stationary noise or residual echo hypothesis).SRER estimate 323 is provided to multi-noisesource gain component 332. As will be described below,SRER estimate 323 may be used to determineoptimal gain 325, which is used to suppress residual echo (and/or other types of interfering sources) present infirst signal 340. - The two measures, SRER estimate (based on downlink and traditional ERL and ERLE estimates, and not on measure of
correlation 319 as described above) and measure ofcorrelation 319, are complimentary. Thus, in accordance with an embodiment, it may be advantageous to use a multi-variate GMM with a feature vector including both measures. While measure ofcorrelation 319 will capture non-linear residual echo well, SRER estimate (based on downlink and traditional ERL and ERLE estimates, and not on measure ofcorrelation 319 as described above) will capture linear residual echo. Additionally, as also described above, the modeling can be carried out on a frequency basis in order to exploit frequency separation between desired source and residual echo. - In accordance with an embodiment in a multi-microphone system, where the loudspeaker in speakerphone mode is in near proximity to one microphone, a power or magnitude spectrum ratio feature is formed between a microphone far from the loudspeaker and the microphone close to the loudspeaker. This naturally occurs on a cellular handset in speakerphone phone mode where the loudspeaker is at the bottom of the phone, one microphone is at the bottom of the phone, and a second microphone is at the top of the phone. The ratio can be formed down-stream of acoustic echo cancellation so that only the presence of residual echo is captured by the feature. This can be combined and modelled jointly with the Anc2AmbR (i.e., ratio 309) because the output of ABM 216 (i.e., second signal 334) originates from the microphone relatively close to the loudspeaker less desired source, and the output of ANC 220 (i.e., first signal 340) originates from the microphone relatively far from the loudspeaker less spatial interfering sources.
- In accordance with an embodiment, forming the power or magnitude spectrum ratio is done by using an additional mixture in the GMM modeling. In accordance with such an embodiment, the desired source will generally have a relatively high Anc2AbmR, acoustic environmental noise will generally have relatively lower Anc2AbmR, and residual echo will have a much lower Anc2AbmR compared to the acoustic environment noise. It may be suitable to use three mixtures in each frequency band/bin: one for desired source, one for non-stationary/spatial noise, one for residual echo. It is noted that if each microphone path has acoustic echo cancellation (AEC) prior to the spatial front-end with
ANC 220 andABM 214, then this particular modeling would indeed capture residual echo (assuming AEC provides similar ERLE on the two microphone paths). - 4. Multi-Noise Source Gain Rule
- Multi-noise
source gain component 332 may be configured to determine anoptimal gain 325 that is used to suppress multiple types of interfering sources (e.g., stationary noise, non-stationary noise, residual echo, etc.) present infirst signal 340 on a per-frame basis and/or per-frequency bin basis. An observed signal (e.g., first signal 340) that includes multiple types of interfering sources may be represented in accordance with Equation 38: -
Y=X+Σ k=1 K N k, Equation 38 - where Y corresponds to the observed signal (e.g., first signal 340), X corresponds to the underlying clean speech in observed signal Y and Nk corresponds to the kth interfering source (e.g., stationary noise, non-stationary noise, or residual echo). For simplicity, a value of 1 for k corresponds to stationary noise, a value of 2 for k corresponds to non-stationary noise and a value of 3 for k corresponds to residual echo.
- A global cost function may be formulated that minimizes the distortion of the desired source and that also achieves satisfactory noise suppression. Such a global cost function may be a composite of more than one branch cost function. For example, the global cost function may be based on a cost function for minimizing the distortion of the desired source and a respective branch cost function for minimizing the distortion of each of the k interfering sources (i.e., the unnaturalness of the residual of an interfering source, as it is referred to in the aforementioned U.S. patent application Ser. No. 12/897,548, the entirety of which has been incorporated by reference as if fully set forth herein). These different cost functions may be further weighted to obtain a degree of balance between distortion of the desired source and the distortion of the k interfering sources. A global cost function is shown in Equation 39:
-
C=Σ k=1 Kλk[αk E{(1−G)2 X 2}+(1−αk)E{(H k −G)2 N k 2}], Equation 39 - where
-
- E{(1−G)2X2} corresponds to the cost function for minimizing the distortion of the desired source included in observed signal Y,
- E{(Hk−G)2Nk 2} corresponds to the branch cost function for minimizing the distortion of the residual of the kth interfering source included in observed signal Y,
- G corresponds to the optimal gain (i.e., gain that optimizes (or minimizes) the corresponding cost function,
- Hk corresponds to an amount of desired attenuation to be applied to the kth interfering source included in observed signal Y,
- αk corresponds to an intra-branch tradeoff that specifies a degree of balance between distortion of the desired source included in observed signal Y and distortion of the residual kth interfering source included in the noise-suppressed signal (e.g., noise-suppressed signal 344), where 0≦αk≦1, and
- λk corresponds to an inter-branch tradeoff that weights each of the k composite cost functions.
- Once the global cost function is formulated, the optimal gain, G, may be determined by taking the derivative of the global cost function with respect to the optimal gain and setting the derivative to zero. This is shown in Equation 40:
-
∂C/∂G=−2Σk{λkαk(1−G)σx 2+λk(1−αk)(H k −G)σNk 2}=0,Equation 40 - As shown in
Equation 40, the second moment (i.e., variance) for each of the k interfering noise sources (i.e., σNk 2) and the desired source (i.e., σNk 2) that naturally occur from the expectations used in Equation 39 are introduced. The second moment of the desired source divided by the second moment of a particular kth interfering noise source is equivalent to the SNR for that particular kth interfering noise source. This is shown in Equation 41: -
- where ξk corresponds to the SNR for the kth interfering noise source.
- Optimal gain, G, may be determined by simplifying Equation 41 to Equation 42, as shown below:
-
- In the case where there is only one interfering noise source (i.e., k=1), the existing solution is simplified to Equation 43, as shown below:
-
- Equation 43 represents the gain rule derived in aforementioned U.S. patent application Ser. No. 12/897,548, the entirety of which has been incorporated by reference as if fully set forth herein. Hence, the generalized multi-source gain rule degenerates to the gain rule derived in aforementioned U.S. patent application Ser. No. 12/897,548 in the case of a single interfering source.
- Multi-noise
source gain component 332 may be configured to determineoptimal gain 325, which is used to suppress multiple types of interfering sources frominput signal 340, in accordance with Equation 42. For example, as described above,SSNR estimation component 306 may provideSSNR estimate 303,SNSNR estimation component 316 may provideSNSNR estimate 317 andSRER estimation component 326 may provideSRER estimate 323. Each of these estimates may correspond to an SNR (i.e., for a kth interfering noise source. In addition, each of these estimates may be provided on a per-frame basis and/or per-frequency bin basis. - In accordance with an embodiment, the value of the target suppression parameter H for each of the k interfering noise sources comprises a fixed aspect of back-
end SCS component 300 that is determined during a design or tuning phase associated with that component. Alternatively, the value of the target suppression parameter H for each of the k interfering noise sources may be determined in response to some form of user input (e.g., responsive to user control of settings of a device that includes back-end SCS component 300). In a still further embodiment, the value of the target suppression parameter H for each of the k interfering noise sources may be adaptively determined based at least in part on characteristics offirst signal 340. In accordance with any of these embodiments, the values for each of the target suppression parameter(s) Hk may be constant across all frequencies, or alternatively, the values of first target suppression parameter(s) Hk may very per frequency bin. - The value for each intra-branch tradeoff α for a particular k interfering noise source may be based on a probability that a particular frame of
first signal 340 is from a desired source (e.g., speech) with respect to the particular interfering noise. For example, the intra-branch tradeoff associated with the stationary noise branch (e.g., α1) may be based onprobability 307, the intra-branch tradeoff associated with the non-stationary noise branch (e.g., α2) may be based onprobability 313 and the intra-branch tradeoff associated with the residual echo branch (e.g., α3) may be based onprobability 321. - In one embodiment, the value of the intra-branch tradeoff parameter α associated with each of the k interfering noise sources comprises a fixed aspect of back-
end SCS component 300 that is determined during a design or tuning phase associated with that component. Alternatively, the value of the intra-branch tradeoff parameter α associated with each of the k interfering noise sources may be determined in response to some form of user input (e.g., responsive to user control of settings of a device that includes back-end SCS component 300). - In a still further embodiment, the value of the intra-branch tradeoff parameter α associated with each of the k interfering noise sources is adaptively determined. For example, the value of α associated with a particular kth interfering noise source may be adaptively determined based at least in part on the probability that a particular frame and/or frequency bin of
first signal 340 is from a desired source with respect to the particular kth interfering noise source. For instance, if the probability that a particular frame and/or frequency bin offirst signal 340 is a desired source with respect to a particular kth interfering noise source is high, the value of αk may be set such that an increased emphasis is placed on minimizing the distortion of the desired source. If the probability that a particular frame and/or frequency bin offirst signal 340 is from a desired source with respect to the particular kth interfering noise source is low, the value of αk may be set such that an increased emphasis is placed on minimizing the distortion of the residual kth interfering noise source. - In accordance with such an embodiment, each intra-branch tradeoff, a, may be determined in accordance with Equation 44, which is shown below:
-
α=αN +P DSαS, Equation 44 - where αN corresponds to a tradeoff intended for a particular interfering noise source included in
first signal 340, αS+αN corresponds to a tradeoff intended for a desired source included infirst signal 340, and PDS corresponds to a probability that a particular frame and/or frequency bin offirst signal 340 is from a desired source with respect to a particular interfering noise source (e.g.,probability 307,probability 313, or probability 313). - In addition to, or in lieu of, adaptively determining the value of intra-branch tradeoff α based on a probability that a particular frame and/or frequency bin of
first signal 340 is from a desired source with respect to a particular interfering noise source, the value of α may be adaptively determined based on modulation information associated withfirst signal 340. For example, as shown inFIG. 3C , fullband modulationfeature extraction component 328 may extractfeatures 327 of an energy contour associated withfirst signal 340 over time.Features 327 are provided to fullband modulationstatistical modeling component 330. - Fullband modulation
statistical modeling component 330 may be configured to model features 327 on a per-frame basis and/or per-frequency bin basis. In accordance with an embodiment, modulationstatistical modeling component 330 models features 327 using GMM modeling. By using GMM modeling, aprobability 329 that a particular frame and/or frequency bin offirst signal 340 is from a desired source (e.g., speech) may be determined. For example, it has been observed that an energy contour associated with a signal that changes relatively fast over time equates to the signal including a desired source; whereas an energy contour associated with a signal that changes relatively slow over time equates to the signal including an interfering source. Accordingly, in response to determining that the rate at which the energy contour associated withfirst signal 340 changes is relatively fast,probability 329 may be relatively high, thereby causing the value of αk to be set such that an increased emphasis is placed on minimizing the distortion of the desired source during frames including the desired source. In response to determining that the rate at which the energy contour associated withfirst signal 340 changes is relatively slow,probability 329 may be relatively low, thereby causing the value of αk to be set such that an increased emphasis is placed on minimizing the distortion of the residual kth interfering noise signal. Still other adaptive schemes for setting the value of αk may be used. - The value of inter-branch tradeoff parameter, λ, for each of the k interfering noise sources may be based on measure of
spatial ambiguity 331. For example, if measure ofspatial ambiguity 331 is indicative of spatial featurestatistical modeling component 314 being in a spatially ambiguous state, then the value of λ associated with the non-stationary branch (e.g. λ2) is set to a relatively low value, and the value of λ associated with the stationary noise branch and the residual echo branch (e.g., λ and λ3) are set to relatively higher values. By doing so, the non-stationary noise branch is effectively disabled (i.e. soft-disabled). The non-stationary noise branch may be re-enabled (i.e., soft-enabled) in the event that measure ofspatial ambiguity 331 is indicative of spatial featurestatistical modeling component 314 being in a spatially confident state by increasing the value of λ2 and adjusting the values of λ and λ3 (such that the sum of all the inter-branch tradeoff parameters is equal to one) accordingly. - In accordance with an embodiment where multi-noise
source gain component 332 is configured to determineoptimal gain 325 on a per-frequency bin basis, multi-noisesource gain component 332 provides a respective optimal gain value for each frequency bin. -
Gain application component 346 may be configured to suppress noise (e.g., stationary noise, non-stationary noise and/or residual echo) present infirst signal 340 by applyingoptimal gain 325 to provide noise-suppressedsignal 344. In accordance with an embodiment,gain application component 346 is configured to suppress noise present infirst signal 340 on a frequency bin by frequency bin basis using the respective optimal gain values obtained for each frequency bin, as described above. - It is noted that in accordance with an embodiment, back-
end SCS component 300 is configured to operate in a single-user speakerphone mode of a device in whichSCS component 300 is implemented or a conference speakerphone mode of such a device. In accordance with such an embodiment, back-end SCS component 300 receives a mode enablesignal 336 from a mode detector (e.g.,activity mode detector 222, as shown inFIG. 2 ) that causes back-end SCS component 300 to switch between single-user speakerphone mode or conference speakerphone mode. Accordingly, mode enablesignal 336 may correspond to mode enablesignal 236, as shown inFIG. 2 . When operating in conference speakerphone mode, mode enablesignal 336 may cause the non-stationary branch to be disabled (e.g., λ2 is set to a relatively low value, for example, zero). Accordingly,gain application component 346 may be configured to suppress stationary noise and/or residual echo present in first signal 340 (and not non-stationary noise). When operating in single-user speakerphone mode, mode enablesignal 336 may cause the non-stationary noise suppression branch to be enabled. Accordingly,gain application component 346 may be configured to suppress stationary noise, non-stationary noise, and/or residual echo present infirst signal 340. -
FIG. 3I shows example diagnostic plots of a segment of an input signal (e.g., first signal 340) that includes speech (i.e., a desired source) and babble noise (i.e., an interfering source) in accordance to back-end SCS system 300. Plot 377 showsfirst signal 340 as received from a primary microphone (i.e., microphone 106 1, as shown inFIG. 1 ). Plot 379 shows the SSNR estimate (i.e., SSNR estimate 303) andpanel 381 shows the probability of desired source (i.e., probability 307) inferred from statistical modeling of the SNR features by SSNR featurestatistical modeling component 310. Plot 383 shows the estimated spatial ambiguity (e.g., measure ofspatial ambiguity 331 obtained by spatial feature statistical modeling component 314), which is constant at unity due to the spatial diversity present in this segment. Plot 385 shows the posterior probability of target speaker (i.e.,classification 311 provided by SID feature extraction component 318). Plot 387 shows the SNSNR estimate (i.e., SNSNR estimate 317) andplot 389 shows the probability of desired source (i.e., probability 313) inferred from statistical modeling of the Anc2AbmR feature (i.e., ratio 309) by spatial featurestatistical modeling component 314.Plot 391 illustrates the final gain (i.e., optimal gain 325) obtained by the multi-noisesource gain component 332. -
FIG. 3J shows an analogous plot for a segment of an input speech (e.g., first signal 340) that includes speech and babble noise, but captured in a spatially ambiguous configuration. Note that the spatial ambiguity measure (i.e., measure of spatial ambiguity 331) shown inplot 383′ converges to zero (indicating spatial ambiguity), and the final gain shown inpanel 391′ follows the SSNR estimate and probability of desired source inferred from statistical modeling of the SNR feature shown inpanels 379′ and 381′, respectively. - Accordingly, in embodiments,
system 300 may operate in various ways to determine a noise suppression gain used to suppress multiple types of interfering sources present in an audio signal. For example,FIG. 4 depicts aflowchart 400 of an example method for determining a noise suppression gain in accordance with an example embodiment. The method offlowchart 400 will now be described with continued reference tosystem 300 ofFIG. 3C , although the method is not limited to that implementation. Other structural and operational embodiments will be apparent to persons skilled in the relevant art(s) based on thediscussion regarding flowchart 400 andsystem 300. - As shown in
FIG. 4 , the method offlowchart 400 begins atstep 402, where an audio signal is received that comprises at least a desired source component and at least one interfering source type. For example, with reference toFIG. 3C , back-end SCS component receivesfirst signal 340. - In accordance with an embodiment, the one or more interfering source types include stationary noise and non-stationary noise.
- At
step 404, a noise suppression gain is determined based on a statistical modeling of at least one feature associated with the audio using a mixture model comprising a plurality of model mixtures, each of the plurality of model mixtures being associated with one of the desired source component or an interfering source type of the at least one interfering source type. - For example, with reference to
FIG. 3C , multi-noisesource gain component 332 determines a noise suppression gain (i.e., optimal gain 325). SSNR featurestatistical modeling component 310 and/or spatial featurestatistical modeling component 314 may statistically model at least one feature associated with the audio signal using a mixture model (e.g., a Gaussian mixture model) that comprises a plurality of model mixtures. SSNR featurestatistical modeling component 310 and/or spatial featurestatistical modeling component 314 may associate each of the plurality of model mixtures with one of the desired source component or an interfering source type of the at least one interfering source type. - In accordance with an embodiment, the statistical modeling is adaptive based on at least one feature associated with each frame of the audio signal being received.
- In accordance with an embodiment, the determination of the noise suppression gain includes determining one or more contributions that are derived from the at least one feature and determining the noise suppression gain based on the one or more contributions. Each of the one or more contributions may be determined in accordance to the composite cost function described above with reference to Equation 39 (i.e., each of the one or more contributions may be based on a branch cost function for minimizing the distortion of the residual of a respective kth interfering source included in the audio signal plus the cost function for minimizing the distortion of the desired source component included in the audio signal).
- In accordance with an embodiment, the one or more contributions are weighted based on a measure of ambiguity between two or more of the plurality of model mixtures. For example, with reference to
FIG. 3C , the one or more contributions may be weighted based on measure ofspatial ambiguity 331. - In accordance with an embodiment, a respective model mixture of the plurality of model mixtures is associated with one of the desired source component or an interfering source type of the at least one interfering source type based on one or more properties (e.g., the mean, variance, etc.) of the respective model mixture and one or more expected characteristics (e.g., the SNR, Anc2AbmR, etc.) of a respective interfering source type of the at least one interfering source type.
- In accordance with an embodiment, the noise suppression gain is determined for each of a plurality of frequency bins of the audio signal. For example, with reference to
FIG. 3C ,optimal gain 325 is determined for each of a plurality of frequency bins offirst signal 340. -
FIG. 5 depicts aflowchart 500 of an example method for determining and applying a gain to an audio signal in accordance with an example embodiment. The method offlowchart 500 will now be described with continued reference tosystem 300 ofFIG. 3C , although the method is not limited to that implementation. Other structural and operational embodiments will be apparent to persons skilled in the relevant art(s) based on thediscussion regarding flowchart 500 andsystem 300. - As shown in
FIG. 5 , the method offlowchart 500 begins atstep 502, where one or more first characteristics associated with a first type of interfering source in an audio signal are determined. In accordance with an embodiment, the first type of interfering source is stationary noise. In accordance with such an embodiment, the first characteristic(s) include an SNR regarding the stationary noise with respect to the audio signal and a first measure of probability indicative of a probability that the audio signal is from a desired source with respect to the stationary noise. - For example, with reference to
FIG. 3C , multi-noisesource gain component 332 receives first characteristic(s) associated with stationary noise included infirst signal 340. For instance, the first characteristic(s) may includeSSNR estimate 303 andprobability 307 that indicates a probability that a particular frame offirst signal 340 is from a desired source with respect to the stationary noise. - At
step 504, one or more second characteristics associated with a second type of interfering source in an audio signal are determined. In accordance with an embodiment, the second type of interfering source is non-stationary noise. In accordance with such an embodiment, the second characteristic(s) include an SNR regarding the non-stationary noise with respect to the audio signal and a second measure of probability indicative of a probability that the audio signal is from a desired source with respect to the non-stationary noise. - For example, with reference to
FIG. 3C , multi-noisesource gain component 332 receives the second characteristic(s) associated with non-stationary noise included infirst signal 340. For instance, the second characteristic(s) may includeSNSNR estimate 317 andprobability 313 that indicates a probability that a particular frame offirst signal 340 is from a desired source with respect to the non-stationary noise. - At
step 506, a gain based on the first characteristic(s) and the second characteristic(s) is determined. For example, with reference toFIG. 3C , multi-noisesource gain component 332 determinesoptimal gain 325 based on the first characteristic(s) and the second characteristic(s). In accordance with an embodiment, multi-source gain component determinesoptimal gain 325 in accordance with Equation 42 described above. In accordance with another embodiment, a gain (i.e., optimal gain 325) is determined for each of a plurality of frequency bins of the audio signal (i.e., first signal 340) based on the first characteristic(s) and the second characteristic(s). - At
step 508, the determined gain is applied to the audio signal. For example, with reference toFIG. 3C ,gain application component 346 appliesoptimal gain 325 tofirst signal 340. In accordance with an embodiment in which a gain is determined for each of a plurality of frequency bins of the audio signal, each of the determined gains are applied to a corresponding frequency bin of the audio signal. - In accordance with an embodiment, the determined gain is applied in a manner that is controlled by a tradeoff parameter α ssociated with a measure of spatial ambiguity.
- For example, with reference to
FIG. 3C , multi-noisesource gain component 332 may set the value of the inter-branch tradeoff parameter(s) (i.e., λk) based on measure ofspatial ambiguity 331. - In accordance with another embodiment, the determined gain is applied in a manner that is controlled by a first parameter that specifies a degree of balance between a distortion of a desired source included in the audio signal and a distortion of a residual amount of the first type of interfering source included in a noise-suppressed signal that is obtained from applying the determined gain to the audio signal and a second parameter that specifies a degree of balance between the distortion of the desired source included in the audio signal and a distortion of a residual amount of the second type of interfering source included in the noise-suppressed signal,
- For example, with reference to
FIG. 3C , multi-noisesource gain component 332 may determine the value of the first parameter (i.e., α1) that specifies a degree of balance between the distortion of the desired source included infirst signal 340 and the distortion of a residual amount of the first type of interfering source included in noise-suppressedsignal 344 and may also determine the value of the second parameter (i.e., α2) that specifies a degree of balance between the distortion of the desired source included infirst signal 340 and the distortion of a residual amount of the second type of interfering included in noise-suppressedsignal 344. - In accordance with an embodiment, the value of the first parameter is set based on the probability that the audio signal is from a desired source with respect to the first type of interfering source, and the value of the second parameter is set based on the probability that the audio signal includes a desired source with respect to the second type of interfering source included in the audio signal.
- For example with reference to
FIG. 3C , the value of the first parameter may be set based onprobability 307 that indicates a probability that a particular frame offirst signal 340 is from a desired source with respect to the first type of interfering source (e.g., stationary noise) included infirst signal 340, and the value of the second parameter may be set based onprobability 313 that indicates a probability that a particular frame offirst signal 340 is from a desired source with respect to the second type of interfering source (e.g., non-stationary noise) included infirst signal 340. - In accordance with another embodiment, the value of the first parameter and the value of the second parameter α re based, at least in part, on a rate at which an energy contour associated with the audio signal changes.
FIG. 6 depicts aflowchart 600 of an example method for setting a value of α first parameter α nd a second parameter based on a rate at which an energy contour associated with an audio signal changes in accordance with an embodiment. The method offlowchart 600 will now be described with continued reference tosystem 300 ofFIG. 3C , although the method is not limited to that implementation. Other structural and operational embodiments will be apparent to persons skilled in the relevant art(s) based on thediscussion regarding flowchart 600 andsystem 300. - As shown in
FIG. 6 , the method offlowchart 600 begins atstep 602, where a rate at which an energy contour associated with the audio signal changes is determined. For example, with reference toFIG. 3C , fullband modulationstatistical modeling component 330 may determine the rate at which the energy contour associated withfirst signal 340 changes. Fullband modulationstatistical modeling component 330 providesprobability 329 that indicates a probability that a particular frame offirst signal 340 is a desired source (e.g., speech) based on the determination. For example, it has been observed that an energy contour associated with a signal that changes relatively fast over time equates to the signal including a desired source; whereas an energy contour associated with a signal that changes relatively slow over time equates to the signal including an interfering source. Accordingly, in response to determining that the rate at which the energy contour associated withfirst signal 340 changes is relatively fast,probability 329 may be relatively high. In response to determining that the rate at which the energy contour associated withfirst signal 340 changes is relatively slow,probability 329 may be relatively low. - At
step 604, the value of the first parameter and the value of the second parameter are set such that an increased emphasis is placed on minimizing the distortion of the desired source included in the audio signal in response to determining that the rate at which the energy contour changes is relatively fast. For example, with reference toFIG. 3C , multi-noisesource gain component 332 may set the value of the first parameter (i.e., α1) and the second parameter (i.e., α2) such that an increased emphasis is placed on minimizing the distortion of the desired source included in thefirst signal 340 ifprobability 329 is relatively high. - At
step 606, the value of the first parameter is set such that an increased emphasis is placed on minimizing the distortion of the residual amount of the first type of interfering source included in the noise-suppressed signal, and the value of the second parameter is set such that an increased emphasis is placed on minimizing the distortion of the residual amount of the second type of interfering source included in the noise-suppressed signal in response to determining that the rate at which the energy contour changes is relatively slow. For example, with reference toFIG. 3C , multi-noisesource gain component 332 may set the value of the first parameter (i.e., α1) such that an increased emphasis is placed on minimizing the distortion of the residual amount of the first type of interfering source (e.g., stationary noise) included in noise-suppressedsignal 344 and may set the value of the second parameter (i.e., α2) such that an increased emphasis is placed on minimizing the distortion of the residual amount of the second type of interfering source (e.g., non-stationary noise) included in noise-suppressedsignal 344 ifprobability 329 is relatively low. - While
FIG. 3C depicts a system for suppressing stationary noise, non-stationary noise, and residual echo from an observed audio signal (e.g., first signal 340), it is noted that the foregoing embodiments may also be used to suppress multiple types of non-stationary noise (e.g., wind noise, traffic noise, etc.) and/or other types of interfering sources (e.g., reverberation). For example,FIG. 7 is a block diagram of a back-end SCS component 700 that is configured to suppress multiple types of non-stationary noise and/or other types of interfering sources in accordance with an embodiment. Back-end SCS component 700 may be an example of back-end SCS component 116 or back-end SCS component 300. As shown inFIG. 7 ,FIG. 7 includes stationarynoise estimation component 304,SSNR estimation component 306, SSNRfeature extraction component 308, SSNR featurestatistical modeling component 310, spatialfeature extraction component 712, spatial featurestatistical modeling component 714,SNSNR estimation component 716, multi-noisesource gain component 332 and gainapplication component 346. - Stationary
noise estimation component 304,SSNR estimation component 306, SSNRfeature extraction component 308 and SSNR featurestatistical modeling component 310 operate in a similar manner as described above with reference toFIG. 3C to obtainSSNR estimate 303 andprobability 307, respectively, which are used by multi-noisesource gain component 332 to obtain anoptimal gain 325. - Spatial
feature extraction component 712 operates in a similar manner as spatialfeature extraction component 312 as described above with reference toFIG. 3C to extract features fromfirst signal 340 andsecond signal 334. However, spatialfeature extraction component 712 is further configured to extract features 709 1-k, associated with multiple types of non-stationary noise and/or other interfering sources. For example, features 709 1 may correspond to features associated with a first type of non-stationary noise or other type of interfering source, features 709 2 may correspond to features associated with a second type of non-stationary noise or other type of interfering source, and features 709 k may correspond to features associated with a kth type of non-stationary noise or other type of interfering source. - As described above, reverberation and wind noise are examples of additional types of non-stationary noise and/or other types of interfering sources that may be suppressed from an observed audio signal. An example of extracting features associated with reverberation and wind noise is described below.
- Reverberation can be considered an additive noise, where all multi-path receptions of the desired source less the direct-path are considered interfering sources. The direct-path reception of the desired source by the microphone(s) (e.g., microphones 106 1-N, as shown in
FIG. 1 ) are considered the ultimate desired source. The multi-path receptions of the desired source are generally filtered versions of the desired source that includes a delay and attenuation compared to the direct-path due to the longer distance the reflected sound wave travels and the sound absorption of the material of the reflecting surfaces. Hence, reverberation will manifest itself as a smearing or added tail to the direct-path desired source, and it will effectively reduce the modulation bandwidth compared to the source due to somewhat filling in the gaps of the time evolution of the magnitude spectrum between syllables (due to the smearing), see, for example, “The Linear Prediction Inverse Modulation Transfer Function (LP-IMTF) Filter for Spectral Enhancement, with Applications to Speaker Recognition” by Bengt J. Borgstrom and Alan McCree, ICASSP 2012, pp. 4065-4068, which is incorporated by reference herein. - However, instead of bandpass filtering the magnitude spectrum in time to suppress the reverberation, as described by Borgstrom and McCree, the modulation information pertinent to reverberation may be modelled (e.g., as a function of frequency). In accordance with an embodiment, the modulation information is modelled by lowpass filtering the magnitude spectrum in order to estimate the reverberation magnitude spectrum and using this estimate to calculate the SRR, which can be modelled (e.g., by spatial feature
statistical modeling component 714, as described below) in a way similar toSNR feature vector 305. The statistical modeling of the SRR can then provide a probability of desired source, PDS,m (k), and a probability of interfering source, PIS,m(k), with respect to reverberation. It should be noted that the SRR feature will not only capture reverberation, but also stationary noise in general, and hence there is an overlap with the modeling ofSNR feature vector 305, similar to how there is an overlap between the modeling of the Anc2AbmR feature (i.e., ratio 309) andSNR feature vector 305. This overlap can be mitigated by applying a conventional stationary noise suppression (of a suitable degree) tofirst signal 340 prior to estimating the SRR feature, similar to how a preliminary stationary noise suppression is performed forfirst signal 340 prior to calculating the Anc2AbmR feature (i.e., ratio 309). Similar to the Anc2AbmR feature, the degree of a preliminary stationary noise suppression should not be exaggerated, as that will tend to impose the properties of that particular suppression algorithm onto the SRR feature, and result in the SRR feature essentially mirroringSSNR estimate 303 orstationary noise estimate 301 obtained within the stationary noise branch instead of reflecting the reverberation. - Wind noise is typically not an acoustic noise, but a noise generated by the wind moving the microphone membrane (as opposed to the sound pressure wave moving the membrane). It propagates with a speed corresponding to the wind speed which is typically much smaller than the speed of sound in air (i.e., 340 meters/second), with which sound propagates in air. As an effect, there is no correlation between wind noise picked up on two microphones in typical dual-microphone configurations. Hence, an indicator of wind noise can be constructed by measuring the normalized correlation between two microphone signals. This can be extended to measuring the magnitude of the normalized coherence between the two microphone signals in the frequency domain as a function of frequency. This is beneficial since wind noise typically extends from low frequencies towards higher frequencies with a cut-off that increases with the degree of wind noise, and often only part of the spectrum is polluted by wind noise. A probability of desired source, PDS,m(k), and a probability of interfering source, PIS,m(k), with respect to wind noise obtained by GMM modeling of the normalized correlation between two microphone signals only indicates the probability of wind noise presence on one of the two microphones, but if the feature vector is augmented with an additional parameter corresponding to the power ratio between the two microphone signals (in the same frequency bin/range as the correlation/coherence feature), then the joint GMM modeling should be able to facilitate calculation of: (1) the probability of wind noise on a first microphone of a communication device, (2) the probability of desired source on the first microphone of the communication device, (3) the probability of wind noise on a second microphone of the communication device, and (3) the probability of desired source on the second microphone of the communication device, as a function of frequency. This information can be useful in attempts to rebuild desired source on a microphone polluted by wind noise from one that is not polluted by wind noise.
- Spatial feature
statistical modeling component 714 operates in a similar manner as spatial featurestatistical modeling component 314 as described above with reference toFIG. 3C to model features received thereby. However, spatial featurestatistical modeling component 714 is further configured to model features associated with multiple types of non-stationary noise and/or other types of interfering sources (i.e., features 709 1-k) to provide a probability for each of the multiple types non-stationary noise and/or other types of interfering sources (e.g., probabilities 715 1-k) that a particular frame ofinput signal 340 is from a particular type of non-stationary noise and/or other type of noise. For example, as shown inFIG. 7 , probability 715 1 corresponds to a probability that a particular frame ofinput signal 340 is from a first type of non-stationary noise or other type of interfering source, probability 715 2 corresponds to a probability that a particular frame ofinput signal 340 is from a second type of non-stationary noise or other type of interfering source, and probability 715 k corresponds to a probability that a particular frame ofinput signal 340 is from a kth type of non-stationary noise or other type of interfering source. Spatial featurestatistical modeling component 714 also provides probability (i.e., probability 313) that a particular frame ofinput signal 340 is from a desired source as described above with reference toFIG. 3C . -
SNSNR estimation component 716 may operate in a similar manner asSNSNR estimation component 316 as described above with reference toFIG. 3C to determine an SNSNR estimate forinput signal 340. However,SNSNR estimation component 716 is further configured to provide SNSNR estimates (e.g., 717 1-k) for multiple types of non-stationary noise and/or SNR estimates for other types of interfering sources. For example, as shown inFIG. 7 , SNSNR estimate 717 1 corresponds to an SNSNR estimate for a first type of non-stationary noise or other type of interfering source, SNSNR estimate 717 2 corresponds to an SNSNR estimate for a second type of non-stationary noise or other type of interfering source and SNSNR estimate 717 k corresponds to an SNSNR estimate for a kth type of non-stationary noise or other type of interfering source. SNSNR estimate 717 1 may be based at least onprobability 313 and probability 715 1, SNSNR estimate 717 2 may be based at least onprobability 313 and probability 715 2 and SNSNR estimate 717 k may be based at least onprobability 313 and probability 715 k. - Multi-noise
source gain component 332 may be configured to obtainoptimal gain 325 in accordance to Equation 42 as described above.Gain application component 346 may be configured to suppress stationary noise, multiple types of non-stationary noise, residual echo, and/or other types of interfering sources based onoptimal gain 325. - Embodiments described herein may be generalized in accordance to
FIG. 8 .FIG. 8 shows a block diagram of a generalized back-end SCS component 800 in accordance with an example embodiment. Back-end SCS component 800 may be an example of back-end SCS component 116, back-end SCS component 300 or back-end SCS component 700. As shown inFIG. 8 , generalized back-end SCS component 800 includes feature extraction components 802 1-k, statistical modeling components 804 1-k, SNR estimation components 808 1-k and a multi-noisesource gain component 810. - Back-
end SCS component 800 may be coupled to a plurality of microphone inputs 806 1-n. In an embodiment where back-end SCS component 800 comprises an implementation of back-end SCS component 116, plurality of microphone inputs 806 1-n correspond to plurality of microphone inputs 106 1-n. Each of feature extraction components 802 1-k may be configured to extract features 801 1-k pertaining to a particular interfering noise source (e.g., stationary noise, a particular type of non-stationary noise, residual echo, reverberation, etc.) from one or more input signals 812 derived from the plurality of microphone inputs 806 1-n. For example, input signal(s) 812 may correspond to microphone inputs that have been processed by the front end and/or have been condensed into an m number of signals, where m is an integer value less than n. For example, with reference toFIG. 2 , input signal(s) 812 may correspond to enhancedsource signal 240, non-desired source signals 234,FDAEC output signal 224, and/orresidual echo information 238. - Each of features 801 1-k may be provided to a respective statistical modeling component 804 1-k. Each of statistical modeling components 804 1-k may be configured model the respective features received to determine respective probabilities 803 1-k that each indicate a probability that particular frame of input signal(s) 812 comprises a particular type of interfering noise source. For example, probability 803 1 may correspond to a probability that a particular frame of input signal(s) 812 comprises a first type of interfering noise source, probability 803 2 may correspond to a probability that a particular frame of input signal(s) 812 comprises a second type of interfering noise source, probability 803 3 may correspond to a probability that a particular frame of input signal(s) 812 comprises a third type of interfering noise source and probability 803 k may correspond to a probability that a particular frame of input signal(s) 812 comprises a kth type of interfering noise source. One or more of statistical modeling components 804 1-k may also determine a
probability 805 that a particular frame of input signal(s) comprises a desired source. - Each of
probabilities 803 1-k and 805 may be provided to a respective SNR estimation component 808 1-k. Each of SNR estimation components 808 1-k may be configured to determine a respective SNR estimate 807 1-k pertaining to a particular interfering noise source included in input signals(s) 812 based on the received probabilities. For example, SNR estimation component 808 1 may determine SNR estimate 807 1, which pertains to a first type of interfering noise source included in input signals(s) 812, based on probability 803 1 and/orprobability 805, SNR estimation component 808 2 may determine SNR estimate 807 2, which pertains to a second type of interfering noise source included in input signals(s) 812, based on probability 803 2 and/orprobability 805, SNR estimation component 808 3 may determine SNR estimate 807 3, which pertains to a third type of interfering noise source included in input signals(s) 812, based on probability 803 3 and/orprobability 805 and SNR estimation component 808 k may determine SNR estimate 807 k, which pertains to a kth type of interfering noise source included in input signals(s) 812, based on probability 803 k and/orprobability 805. - Multi-noise
source gain component 810 may be configured to determine anoptimal gain 811 based at least onprobability 805 and/or SNR estimates 807 1-k in accordance to Equation 42 as described above. A gain application component (e.g.,gain application component 346, as shown inFIG. 3C ) may be configured to suppress the different types of interfering sources (e.g., stationary noise, multiple types of non-stationary noise, residual echo, and/or other types of interfering sources) based onoptimal gain 811. -
FIG. 9 depicts a block diagram of aprocessor circuit 900 in which portions ofcommunication device 100, as shown inFIG. 1 , system 200 (and the components and/or sub-components described therein), as shown inFIG. 2 , back-end SCS component 300 (and the components and/or sub-components described therein), as shown inFIG. 3C , back-end SCS component 700 (and the components and/or sub-components described therein), as shown inFIG. 7 , back-end SCS component 800 (and the components and/or sub-components described therein), as shown inFIG. 8 , flowcharts 400-600, as respectively shown inFIGS. 4-6 , as well as any methods, algorithms, and functions described herein, may be implemented.Processor circuit 900 is a physical hardware processing circuit and may include central processing unit (CPU) 902, an I/O controller 904, aprogram memory 906, and adata memory 908.CPU 902 may be configured to perform the main computation and data processing function ofprocessor circuit 900. I/O controller 904 may be configured to control communication to external devices via one or more serial ports and/or one or more link ports. For example, I/O controller 904 may be configured to provide data read fromdata memory 908 to one or more external devices and/or store data received from external device(s) intodata memory 908.Program memory 906 may be configured to store program instructions used to process data.Data memory 908 may be configured to store the data to be processed. -
Processor circuit 900 further includes one ormore data registers 910, amultiplier 912, and/or an arithmetic logic unit (ALU) 914. Data register(s) 910 may be configured to store data for intermediate calculations, prepare data to be processed byCPU 902, serve as a buffer for data transfer, hold flags for program control, etc.Multiplier 912 may be configured to receive data stored in data register(s) 910, multiply the data, and store the result into data register(s) 910 and/ordata memory 908.ALU 914 may be configured to perform addition, subtraction, absolute value operations, logical operations (AND, OR, XOR, NOT, etc.), shifting operations, conversion between fixed and floating point formats, and/or the like. -
CPU 902 further includes aprogram sequencer 916, a program memory (PM)data address generator 918 and a data memory (DM)data address generator 920.Program sequencer 916 may be configured to manage program structure and program flow by generating an address of an instruction to be fetched fromprogram memory 906.Program sequencer 916 may also be configured to fetch instruction(s) frominstruction cache 922, which may store an N number of recently-executed instructions, where N is a positive integer. PM data addressgenerator 918 may be configured to supply one or more addresses toprogram memory 906, which specify where the data is to be read from or written to inprogram memory 906. DM data addressgenerator 920 may be configured to supply address(es) todata memory 908, which specify where the data is to be read from or written to indata memory 908. - Techniques, including methods, and embodiments described herein may be implemented by hardware (digital and/or analog) or a combination of hardware with one or both of software and/or firmware. Techniques described herein may be implemented by one or more components. Embodiments may comprise computer program products comprising logic (e.g., in the form of program code or software as well as firmware) stored on any computer useable medium, which may be integrated in or separate from other components. Such program code, when executed by one or more processor circuits, causes a device to operate as described herein. Devices in which embodiments may be implemented may include storage, such as storage drives, memory devices, and further types of physical hardware computer-readable storage media. Examples of such computer-readable storage media include, a hard disk, a removable magnetic disk, a removable optical disk, flash memory cards, digital video disks, random access memories (RAMs), read only memories (ROM), and other types of physical hardware storage media. In greater detail, examples of such computer-readable storage media include, but are not limited to, a hard disk associated with a hard disk drive, a removable magnetic disk, a removable optical disk (e.g., CDROMs, DVDs, etc.), zip disks, tapes, magnetic storage devices, MEMS (micro-electromechanical systems) storage, nanotechnology-based storage devices, flash memory cards, digital video discs, RAM devices, ROM devices, and further types of physical hardware storage media. Such computer-readable storage media may, for example, store computer program logic, e.g., program modules, comprising computer executable instructions that, when executed by one or more processor circuits, provide and/or maintain one or more aspects of functionality described herein with reference to the figures, as well as any and all components, steps and functions therein and/or further embodiments described herein.
- Such computer-readable storage media are distinguished from and non-overlapping with communication media (do not include communication media). Communication media embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wireless media such as acoustic, RF, infrared and other wireless media, as well as signals transmitted over wires. Embodiments are also directed to such communication media.
- The techniques and embodiments described herein may be implemented as, or in, various types of devices. For instance, embodiments may be included in mobile devices such as laptop computers, handheld devices such as mobile phones (e.g., cellular and smart phones), handheld computers, and further types of mobile devices, stationary devices such as conference phones, office phones, gaming consoles, and desktop computers, as well as car entertainment/navigation systems. A device, as defined herein, is a machine or manufacture as defined by 35 U.S.C. §101. Devices may include digital circuits, analog circuits, or a combination thereof. Devices may include one or more processor circuits (e.g., processor circuit 1200 of
FIG. 12 , central processing units (CPUs), microprocessors, digital signal processors (DSPs), and further types of physical hardware processor circuits) and/or may be implemented with any semiconductor technology in a semiconductor material, including one or more of a Bipolar Junction Transistor (BJT), a heterojunction bipolar transistor (HBT), a metal oxide field effect transistor (MOSFET) device, a metal semiconductor field effect transistor (MESFET) or other transconductor or transistor technology device. Such devices may use the same or alternative configurations other than the configuration illustrated in embodiments presented herein. - While various embodiments have been described above, it should be understood that they have been presented by way of example only, and not limitation. It will be apparent to persons skilled in the relevant art that various changes in form and detail can be made therein without departing from the spirit and scope of the embodiments. Thus, the breadth and scope of the embodiments should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.
Claims (20)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/540,778 US9570087B2 (en) | 2013-03-15 | 2014-11-13 | Single channel suppression of interfering sources |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201361799154P | 2013-03-15 | 2013-03-15 | |
US14/216,769 US9338551B2 (en) | 2013-03-15 | 2014-03-17 | Multi-microphone source tracking and noise suppression |
US201462025847P | 2014-07-17 | 2014-07-17 | |
US14/540,778 US9570087B2 (en) | 2013-03-15 | 2014-11-13 | Single channel suppression of interfering sources |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/216,769 Continuation-In-Part US9338551B2 (en) | 2013-03-15 | 2014-03-17 | Multi-microphone source tracking and noise suppression |
Publications (2)
Publication Number | Publication Date |
---|---|
US20150071461A1 true US20150071461A1 (en) | 2015-03-12 |
US9570087B2 US9570087B2 (en) | 2017-02-14 |
Family
ID=52625649
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/540,778 Active 2034-03-22 US9570087B2 (en) | 2013-03-15 | 2014-11-13 | Single channel suppression of interfering sources |
Country Status (1)
Country | Link |
---|---|
US (1) | US9570087B2 (en) |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140286497A1 (en) * | 2013-03-15 | 2014-09-25 | Broadcom Corporation | Multi-microphone source tracking and noise suppression |
US20150071461A1 (en) * | 2013-03-15 | 2015-03-12 | Broadcom Corporation | Single-channel suppression of intefering sources |
US20160005419A1 (en) * | 2014-07-01 | 2016-01-07 | Industry-University Cooperation Foundation Hanyang University | Nonlinear acoustic echo signal suppression system and method using volterra filter |
US20160029121A1 (en) * | 2014-07-24 | 2016-01-28 | Conexant Systems, Inc. | System and method for multichannel on-line unsupervised bayesian spectral filtering of real-world acoustic noise |
US20170103771A1 (en) * | 2014-06-09 | 2017-04-13 | Dolby Laboratories Licensing Corporation | Noise Level Estimation |
CN107817506A (en) * | 2016-09-13 | 2018-03-20 | 法国国家太空研究中心 | The multipaths restraint based on cepstrum of spread spectrum radiocommunication signal |
WO2019108828A1 (en) | 2017-11-29 | 2019-06-06 | Nuance Communications, Inc. | System and method for speech enhancement in multisource environments |
US10395667B2 (en) * | 2017-05-12 | 2019-08-27 | Cirrus Logic, Inc. | Correlation-based near-field detector |
US10431240B2 (en) * | 2015-01-23 | 2019-10-01 | Samsung Electronics Co., Ltd | Speech enhancement method and system |
CN111031609A (en) * | 2018-10-10 | 2020-04-17 | 鹤壁天海电子信息系统有限公司 | Channel selection method and device |
CN112017682A (en) * | 2020-09-18 | 2020-12-01 | 中科极限元(杭州)智能科技股份有限公司 | Single-channel voice simultaneous noise reduction and reverberation removal system |
US20210081821A1 (en) * | 2018-03-16 | 2021-03-18 | Nippon Telegraph And Telephone Corporation | Information processing device and information processing method |
CN112542177A (en) * | 2020-11-04 | 2021-03-23 | 北京百度网讯科技有限公司 | Signal enhancement method, device and storage medium |
US11025324B1 (en) * | 2020-04-15 | 2021-06-01 | Cirrus Logic, Inc. | Initialization of adaptive blocking matrix filters in a beamforming array using a priori information |
US11074917B2 (en) * | 2017-10-30 | 2021-07-27 | Cirrus Logic, Inc. | Speaker identification |
CN113221062A (en) * | 2021-04-07 | 2021-08-06 | 北京理工大学 | High-frequency motion error compensation algorithm of small unmanned aerial vehicle-mounted BiSAR system |
US20220058503A1 (en) * | 2017-12-11 | 2022-02-24 | Adobe Inc. | Accurate and interpretable rules for user segmentation |
US20220392478A1 (en) * | 2021-06-07 | 2022-12-08 | Cisco Technology, Inc. | Speech enhancement techniques that maintain speech of near-field speakers |
US20230026735A1 (en) * | 2021-07-21 | 2023-01-26 | Qualcomm Incorporated | Noise suppression using tandem networks |
US20230116052A1 (en) * | 2021-10-05 | 2023-04-13 | Microsoft Technology Licensing, Llc | Array geometry agnostic multi-channel personalized speech enhancement |
US11683634B1 (en) * | 2020-11-20 | 2023-06-20 | Meta Platforms Technologies, Llc | Joint suppression of interferences in audio signal |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106971740B (en) * | 2017-03-28 | 2019-11-15 | 吉林大学 | Sound enhancement method based on voice existing probability and phase estimation |
CN107393523B (en) * | 2017-07-28 | 2020-11-13 | 深圳市盛路物联通讯技术有限公司 | Noise monitoring method and system |
Citations (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6369758B1 (en) * | 2000-11-01 | 2002-04-09 | Unique Broadband Systems, Inc. | Adaptive antenna array for mobile communication |
US20040138882A1 (en) * | 2002-10-31 | 2004-07-15 | Seiko Epson Corporation | Acoustic model creating method, speech recognition apparatus, and vehicle having the speech recognition apparatus |
US20050238238A1 (en) * | 2002-07-19 | 2005-10-27 | Li-Qun Xu | Method and system for classification of semantic content of audio/video data |
US7072834B2 (en) * | 2002-04-05 | 2006-07-04 | Intel Corporation | Adapting to adverse acoustic environment in speech processing using playback training data |
US20060178874A1 (en) * | 2003-03-27 | 2006-08-10 | Taoufik En-Najjary | Method for analyzing fundamental frequency information and voice conversion method and system implementing said analysis method |
US20070055508A1 (en) * | 2005-09-03 | 2007-03-08 | Gn Resound A/S | Method and apparatus for improved estimation of non-stationary noise for speech enhancement |
US20090048824A1 (en) * | 2007-08-16 | 2009-02-19 | Kabushiki Kaisha Toshiba | Acoustic signal processing method and apparatus |
US20090136052A1 (en) * | 2007-11-27 | 2009-05-28 | David Clark Company Incorporated | Active Noise Cancellation Using a Predictive Approach |
US20090228272A1 (en) * | 2007-11-12 | 2009-09-10 | Tobias Herbig | System for distinguishing desired audio signals from noise |
US20090265168A1 (en) * | 2008-04-22 | 2009-10-22 | Electronics And Telecommunications Research Institute | Noise cancellation system and method |
US20100042563A1 (en) * | 2008-08-14 | 2010-02-18 | Gov't of USA represented by the Secretary of the Navy, Chief of Naval Research Office of Counsel co | Systems and methods of discovering mixtures of models within data and probabilistic classification of data according to the model mixture |
US20100057453A1 (en) * | 2006-11-16 | 2010-03-04 | International Business Machines Corporation | Voice activity detection system and method |
US7930178B2 (en) * | 2005-12-23 | 2011-04-19 | Microsoft Corporation | Speech modeling and enhancement based on magnitude-normalized spectra |
US20110096942A1 (en) * | 2009-10-23 | 2011-04-28 | Broadcom Corporation | Noise suppression system and method |
US20110123019A1 (en) * | 2009-11-20 | 2011-05-26 | Texas Instruments Incorporated | Method and apparatus for cross-talk resistant adaptive noise canceller |
US20110178798A1 (en) * | 2010-01-20 | 2011-07-21 | Microsoft Corporation | Adaptive ambient sound suppression and speech tracking |
US20110216089A1 (en) * | 2010-03-08 | 2011-09-08 | Henry Leung | Alignment of objects in augmented reality |
US20120093341A1 (en) * | 2010-10-19 | 2012-04-19 | Electronics And Telecommunications Research Institute | Apparatus and method for separating sound source |
US20120128168A1 (en) * | 2010-11-18 | 2012-05-24 | Texas Instruments Incorporated | Method and apparatus for noise and echo cancellation for two microphone system subject to cross-talk |
US20130121497A1 (en) * | 2009-11-20 | 2013-05-16 | Paris Smaragdis | System and Method for Acoustic Echo Cancellation Using Spectral Decomposition |
US20130132077A1 (en) * | 2011-05-27 | 2013-05-23 | Gautham J. Mysore | Semi-Supervised Source Separation Using Non-Negative Techniques |
US20140254816A1 (en) * | 2013-03-06 | 2014-09-11 | Qualcomm Incorporated | Content based noise suppression |
US20140286497A1 (en) * | 2013-03-15 | 2014-09-25 | Broadcom Corporation | Multi-microphone source tracking and noise suppression |
US20150071461A1 (en) * | 2013-03-15 | 2015-03-12 | Broadcom Corporation | Single-channel suppression of intefering sources |
US9008329B1 (en) * | 2010-01-26 | 2015-04-14 | Audience, Inc. | Noise reduction using multi-feature cluster tracker |
US9036826B2 (en) * | 2012-02-22 | 2015-05-19 | Broadcom Corporation | Echo cancellation using closed-form solutions |
Family Cites Families (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6041106A (en) | 1996-07-29 | 2000-03-21 | Elite Entry Phone Corp. | Access control apparatus for use with buildings, gated properties and the like |
GB2367730B (en) | 2000-10-06 | 2005-04-27 | Mitel Corp | Method and apparatus for minimizing far-end speech effects in hands-free telephony systems using acoustic beamforming |
JP3574123B2 (en) | 2001-03-28 | 2004-10-06 | 三菱電機株式会社 | Noise suppression device |
US7577262B2 (en) | 2002-11-18 | 2009-08-18 | Panasonic Corporation | Microphone device and audio player |
CA2561556A1 (en) | 2004-04-04 | 2005-10-13 | Ben Gurion University Of The Negev Research And Development Authority | Apparatus and method for the detection of one lung intubation by monitoring sounds |
PT1875463T (en) | 2005-04-22 | 2019-01-24 | Qualcomm Inc | Systems, methods, and apparatus for gain factor smoothing |
JP4670483B2 (en) | 2005-05-31 | 2011-04-13 | 日本電気株式会社 | Method and apparatus for noise suppression |
DE102005047047A1 (en) | 2005-09-30 | 2007-04-12 | Siemens Audiologische Technik Gmbh | Microphone calibration on a RGSC beamformer |
US9185487B2 (en) | 2006-01-30 | 2015-11-10 | Audience, Inc. | System and method for providing noise suppression utilizing null processing noise subtraction |
SG144752A1 (en) | 2007-01-12 | 2008-08-28 | Sony Corp | Audio enhancement method and system |
US8005238B2 (en) | 2007-03-22 | 2011-08-23 | Microsoft Corporation | Robust adaptive beamforming with enhanced noise suppression |
JP5086442B2 (en) | 2007-12-20 | 2012-11-28 | テレフオンアクチーボラゲット エル エム エリクソン(パブル) | Noise suppression method and apparatus |
US8503669B2 (en) | 2008-04-07 | 2013-08-06 | Sony Computer Entertainment Inc. | Integrated latency detection and echo cancellation |
US8170226B2 (en) | 2008-06-20 | 2012-05-01 | Microsoft Corporation | Acoustic echo cancellation and adaptive filters |
US8565446B1 (en) | 2010-01-12 | 2013-10-22 | Acoustic Technologies, Inc. | Estimating direction of arrival from plural microphones |
WO2012072637A1 (en) | 2010-12-01 | 2012-06-07 | Ibbt | Method and device for correlation channel estimation |
US8824692B2 (en) | 2011-04-20 | 2014-09-02 | Vocollect, Inc. | Self calibrating multi-element dipole microphone |
US9002027B2 (en) | 2011-06-27 | 2015-04-07 | Gentex Corporation | Space-time noise reduction system for use in a vehicle and method of forming same |
US20130163781A1 (en) | 2011-12-22 | 2013-06-27 | Broadcom Corporation | Breathing noise suppression for audio signals |
US8989755B2 (en) | 2013-02-26 | 2015-03-24 | Blackberry Limited | Methods of inter-cell resource sharing |
-
2014
- 2014-11-13 US US14/540,778 patent/US9570087B2/en active Active
Patent Citations (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6369758B1 (en) * | 2000-11-01 | 2002-04-09 | Unique Broadband Systems, Inc. | Adaptive antenna array for mobile communication |
US7072834B2 (en) * | 2002-04-05 | 2006-07-04 | Intel Corporation | Adapting to adverse acoustic environment in speech processing using playback training data |
US20050238238A1 (en) * | 2002-07-19 | 2005-10-27 | Li-Qun Xu | Method and system for classification of semantic content of audio/video data |
US20040138882A1 (en) * | 2002-10-31 | 2004-07-15 | Seiko Epson Corporation | Acoustic model creating method, speech recognition apparatus, and vehicle having the speech recognition apparatus |
US20060178874A1 (en) * | 2003-03-27 | 2006-08-10 | Taoufik En-Najjary | Method for analyzing fundamental frequency information and voice conversion method and system implementing said analysis method |
US20070055508A1 (en) * | 2005-09-03 | 2007-03-08 | Gn Resound A/S | Method and apparatus for improved estimation of non-stationary noise for speech enhancement |
US7930178B2 (en) * | 2005-12-23 | 2011-04-19 | Microsoft Corporation | Speech modeling and enhancement based on magnitude-normalized spectra |
US20100057453A1 (en) * | 2006-11-16 | 2010-03-04 | International Business Machines Corporation | Voice activity detection system and method |
US20090048824A1 (en) * | 2007-08-16 | 2009-02-19 | Kabushiki Kaisha Toshiba | Acoustic signal processing method and apparatus |
US20090228272A1 (en) * | 2007-11-12 | 2009-09-10 | Tobias Herbig | System for distinguishing desired audio signals from noise |
US20090136052A1 (en) * | 2007-11-27 | 2009-05-28 | David Clark Company Incorporated | Active Noise Cancellation Using a Predictive Approach |
US20090265168A1 (en) * | 2008-04-22 | 2009-10-22 | Electronics And Telecommunications Research Institute | Noise cancellation system and method |
US20100042563A1 (en) * | 2008-08-14 | 2010-02-18 | Gov't of USA represented by the Secretary of the Navy, Chief of Naval Research Office of Counsel co | Systems and methods of discovering mixtures of models within data and probabilistic classification of data according to the model mixture |
US20110096942A1 (en) * | 2009-10-23 | 2011-04-28 | Broadcom Corporation | Noise suppression system and method |
US20130121497A1 (en) * | 2009-11-20 | 2013-05-16 | Paris Smaragdis | System and Method for Acoustic Echo Cancellation Using Spectral Decomposition |
US20110123019A1 (en) * | 2009-11-20 | 2011-05-26 | Texas Instruments Incorporated | Method and apparatus for cross-talk resistant adaptive noise canceller |
US20110178798A1 (en) * | 2010-01-20 | 2011-07-21 | Microsoft Corporation | Adaptive ambient sound suppression and speech tracking |
US9008329B1 (en) * | 2010-01-26 | 2015-04-14 | Audience, Inc. | Noise reduction using multi-feature cluster tracker |
US20110216089A1 (en) * | 2010-03-08 | 2011-09-08 | Henry Leung | Alignment of objects in augmented reality |
US20120093341A1 (en) * | 2010-10-19 | 2012-04-19 | Electronics And Telecommunications Research Institute | Apparatus and method for separating sound source |
US20120128168A1 (en) * | 2010-11-18 | 2012-05-24 | Texas Instruments Incorporated | Method and apparatus for noise and echo cancellation for two microphone system subject to cross-talk |
US20130132077A1 (en) * | 2011-05-27 | 2013-05-23 | Gautham J. Mysore | Semi-Supervised Source Separation Using Non-Negative Techniques |
US9036826B2 (en) * | 2012-02-22 | 2015-05-19 | Broadcom Corporation | Echo cancellation using closed-form solutions |
US9065895B2 (en) * | 2012-02-22 | 2015-06-23 | Broadcom Corporation | Non-linear echo cancellation |
US20140254816A1 (en) * | 2013-03-06 | 2014-09-11 | Qualcomm Incorporated | Content based noise suppression |
US20140286497A1 (en) * | 2013-03-15 | 2014-09-25 | Broadcom Corporation | Multi-microphone source tracking and noise suppression |
US20150071461A1 (en) * | 2013-03-15 | 2015-03-12 | Broadcom Corporation | Single-channel suppression of intefering sources |
Cited By (32)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9570087B2 (en) * | 2013-03-15 | 2017-02-14 | Broadcom Corporation | Single channel suppression of interfering sources |
US20150071461A1 (en) * | 2013-03-15 | 2015-03-12 | Broadcom Corporation | Single-channel suppression of intefering sources |
US20140286497A1 (en) * | 2013-03-15 | 2014-09-25 | Broadcom Corporation | Multi-microphone source tracking and noise suppression |
US9338551B2 (en) * | 2013-03-15 | 2016-05-10 | Broadcom Corporation | Multi-microphone source tracking and noise suppression |
US20160241955A1 (en) * | 2013-03-15 | 2016-08-18 | Broadcom Corporation | Multi-microphone source tracking and noise suppression |
US10141003B2 (en) * | 2014-06-09 | 2018-11-27 | Dolby Laboratories Licensing Corporation | Noise level estimation |
US20170103771A1 (en) * | 2014-06-09 | 2017-04-13 | Dolby Laboratories Licensing Corporation | Noise Level Estimation |
US9536539B2 (en) * | 2014-07-01 | 2017-01-03 | Industry-University Cooperation Foundation Hanyang University | Nonlinear acoustic echo signal suppression system and method using volterra filter |
US20160005419A1 (en) * | 2014-07-01 | 2016-01-07 | Industry-University Cooperation Foundation Hanyang University | Nonlinear acoustic echo signal suppression system and method using volterra filter |
US9564144B2 (en) * | 2014-07-24 | 2017-02-07 | Conexant Systems, Inc. | System and method for multichannel on-line unsupervised bayesian spectral filtering of real-world acoustic noise |
US20160029121A1 (en) * | 2014-07-24 | 2016-01-28 | Conexant Systems, Inc. | System and method for multichannel on-line unsupervised bayesian spectral filtering of real-world acoustic noise |
US10431240B2 (en) * | 2015-01-23 | 2019-10-01 | Samsung Electronics Co., Ltd | Speech enhancement method and system |
CN107817506A (en) * | 2016-09-13 | 2018-03-20 | 法国国家太空研究中心 | The multipaths restraint based on cepstrum of spread spectrum radiocommunication signal |
US10395667B2 (en) * | 2017-05-12 | 2019-08-27 | Cirrus Logic, Inc. | Correlation-based near-field detector |
US11074917B2 (en) * | 2017-10-30 | 2021-07-27 | Cirrus Logic, Inc. | Speaker identification |
EP3718106A4 (en) * | 2017-11-29 | 2021-12-01 | Nuance Communications, Inc. | System and method for speech enhancement in multisource environments |
US10482878B2 (en) | 2017-11-29 | 2019-11-19 | Nuance Communications, Inc. | System and method for speech enhancement in multisource environments |
WO2019108828A1 (en) | 2017-11-29 | 2019-06-06 | Nuance Communications, Inc. | System and method for speech enhancement in multisource environments |
US20220058503A1 (en) * | 2017-12-11 | 2022-02-24 | Adobe Inc. | Accurate and interpretable rules for user segmentation |
US20210081821A1 (en) * | 2018-03-16 | 2021-03-18 | Nippon Telegraph And Telephone Corporation | Information processing device and information processing method |
CN111031609A (en) * | 2018-10-10 | 2020-04-17 | 鹤壁天海电子信息系统有限公司 | Channel selection method and device |
US11025324B1 (en) * | 2020-04-15 | 2021-06-01 | Cirrus Logic, Inc. | Initialization of adaptive blocking matrix filters in a beamforming array using a priori information |
GB2594154A (en) * | 2020-04-15 | 2021-10-20 | Cirrus Logic Int Semiconductor Ltd | Initialization of adaptive blocking matrix filters in a beamforming array using a priori information |
GB2594154B (en) * | 2020-04-15 | 2022-08-03 | Cirrus Logic Int Semiconductor Ltd | Initialization of adaptive blocking matrix filters in a beamforming array using a priori information |
CN112017682A (en) * | 2020-09-18 | 2020-12-01 | 中科极限元(杭州)智能科技股份有限公司 | Single-channel voice simultaneous noise reduction and reverberation removal system |
CN112542177A (en) * | 2020-11-04 | 2021-03-23 | 北京百度网讯科技有限公司 | Signal enhancement method, device and storage medium |
US11683634B1 (en) * | 2020-11-20 | 2023-06-20 | Meta Platforms Technologies, Llc | Joint suppression of interferences in audio signal |
CN113221062A (en) * | 2021-04-07 | 2021-08-06 | 北京理工大学 | High-frequency motion error compensation algorithm of small unmanned aerial vehicle-mounted BiSAR system |
US20220392478A1 (en) * | 2021-06-07 | 2022-12-08 | Cisco Technology, Inc. | Speech enhancement techniques that maintain speech of near-field speakers |
US20230026735A1 (en) * | 2021-07-21 | 2023-01-26 | Qualcomm Incorporated | Noise suppression using tandem networks |
US11805360B2 (en) * | 2021-07-21 | 2023-10-31 | Qualcomm Incorporated | Noise suppression using tandem networks |
US20230116052A1 (en) * | 2021-10-05 | 2023-04-13 | Microsoft Technology Licensing, Llc | Array geometry agnostic multi-channel personalized speech enhancement |
Also Published As
Publication number | Publication date |
---|---|
US9570087B2 (en) | 2017-02-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9570087B2 (en) | Single channel suppression of interfering sources | |
CN111418010B (en) | Multi-microphone noise reduction method and device and terminal equipment | |
US10504539B2 (en) | Voice activity detection systems and methods | |
US10123113B2 (en) | Selective audio source enhancement | |
Parchami et al. | Recent developments in speech enhancement in the short-time Fourier transform domain | |
US8724829B2 (en) | Systems, methods, apparatus, and computer-readable media for coherence detection | |
US10049678B2 (en) | System and method for suppressing transient noise in a multichannel system | |
CN102938254B (en) | Voice signal enhancement system and method | |
US8239196B1 (en) | System and method for multi-channel multi-feature speech/noise classification for noise suppression | |
US7366662B2 (en) | Separation of target acoustic signals in a multi-transducer arrangement | |
US8880396B1 (en) | Spectrum reconstruction for automatic speech recognition | |
US20120245927A1 (en) | System and method for monaural audio processing based preserving speech information | |
US9564144B2 (en) | System and method for multichannel on-line unsupervised bayesian spectral filtering of real-world acoustic noise | |
US20190355373A1 (en) | 360-degree multi-source location detection, tracking and enhancement | |
US9520138B2 (en) | Adaptive modulation filtering for spectral feature enhancement | |
Martín-Doñas et al. | Dual-channel DNN-based speech enhancement for smartphones | |
López-Espejo et al. | Dual-channel spectral weighting for robust speech recognition in mobile devices | |
Wang et al. | A Semi-Blind Source Separation Approach for Speech Dereverberation. | |
Jeong et al. | Adaptive noise power spectrum estimation for compact dual channel speech enhancement | |
Li et al. | Speech separation based on reliable binaural cues with two-stage neural network in noisy-reverberant environments | |
US9936295B2 (en) | Electronic device, method and computer program | |
US20240212701A1 (en) | Estimating an optimized mask for processing acquired sound data | |
Prasad | Speech enhancement for multi microphone using kepstrum approach | |
Yoshioka et al. | A microphone array system integrating beamforming, feature enhancement, and spectral mask-based noise estimation | |
Li et al. | Microphone array speech enhancement based on optimized IMCRA |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: BROADCOM CORPORATION, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:THYSSEN, JES;BORGSTROM, BENGT J.;SIGNING DATES FROM 20150317 TO 20150324;REEL/FRAME:035240/0939 |
|
AS | Assignment |
Owner name: BANK OF AMERICA, N.A., AS COLLATERAL AGENT, NORTH CAROLINA Free format text: PATENT SECURITY AGREEMENT;ASSIGNOR:BROADCOM CORPORATION;REEL/FRAME:037806/0001 Effective date: 20160201 Owner name: BANK OF AMERICA, N.A., AS COLLATERAL AGENT, NORTH Free format text: PATENT SECURITY AGREEMENT;ASSIGNOR:BROADCOM CORPORATION;REEL/FRAME:037806/0001 Effective date: 20160201 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
AS | Assignment |
Owner name: AVAGO TECHNOLOGIES GENERAL IP (SINGAPORE) PTE. LTD., SINGAPORE Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:BROADCOM CORPORATION;REEL/FRAME:041706/0001 Effective date: 20170120 Owner name: AVAGO TECHNOLOGIES GENERAL IP (SINGAPORE) PTE. LTD Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:BROADCOM CORPORATION;REEL/FRAME:041706/0001 Effective date: 20170120 |
|
AS | Assignment |
Owner name: BROADCOM CORPORATION, CALIFORNIA Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A., AS COLLATERAL AGENT;REEL/FRAME:041712/0001 Effective date: 20170119 |
|
AS | Assignment |
Owner name: AVAGO TECHNOLOGIES INTERNATIONAL SALES PTE. LIMITE Free format text: MERGER;ASSIGNOR:AVAGO TECHNOLOGIES GENERAL IP (SINGAPORE) PTE. LTD.;REEL/FRAME:047422/0464 Effective date: 20180509 |
|
AS | Assignment |
Owner name: AVAGO TECHNOLOGIES INTERNATIONAL SALES PTE. LIMITE Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE EXECUTION DATE PREVIOUSLY RECORDED AT REEL: 047422 FRAME: 0464. ASSIGNOR(S) HEREBY CONFIRMS THE MERGER;ASSIGNOR:AVAGO TECHNOLOGIES GENERAL IP (SINGAPORE) PTE. LTD.;REEL/FRAME:048883/0702 Effective date: 20180905 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |