WO2012158156A1 - Noise supression method and apparatus using multiple feature modeling for speech/noise likelihood - Google Patents

Noise supression method and apparatus using multiple feature modeling for speech/noise likelihood Download PDF

Info

Publication number
WO2012158156A1
WO2012158156A1 PCT/US2011/036637 US2011036637W WO2012158156A1 WO 2012158156 A1 WO2012158156 A1 WO 2012158156A1 US 2011036637 W US2011036637 W US 2011036637W WO 2012158156 A1 WO2012158156 A1 WO 2012158156A1
Authority
WO
WIPO (PCT)
Prior art keywords
noise
feature
speech
frame
frames
Prior art date
Application number
PCT/US2011/036637
Other languages
French (fr)
Inventor
Marco Paniconi
Original Assignee
Google Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Google Inc. filed Critical Google Inc.
Priority to CN201180072331.0A priority Critical patent/CN103650040B/en
Priority to PCT/US2011/036637 priority patent/WO2012158156A1/en
Publication of WO2012158156A1 publication Critical patent/WO2012158156A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Definitions

  • the present disclosure generally relates to systems and methods for transmission of audio signals such as voice communications. More specifically, aspects of the present disclosure relate to estimating and filtering noise using speech probability modeling.
  • Surrounding noise includes noise introduced from a number of sources, some of the more common of which include computers, fans, microphones, and office equipment.
  • One embodiment of the present disclosure relates to a method for noise estimation and filtering by a noise suppression module.
  • the method comprises defining, for each of a plurality of successive frames of an input signal received at the noise suppression module, a speech probability function based on an initial noise estimation for the frame; measuring a plurality of signal classification features for each of the plurality of frames; computing a feature-based speech probability for each of the plurality of frames using the measured signal classification features of the frame; applying one or more dynamic weighting factors to the computed feature-based speech probability for each of the plurality of frames; modifying the speech probability function for each of the plurality of frames based on the computed feature- based speech probability of the frame; and updating the initial noise estimation for each of the plurality of frames using the modified speech probability function for the frame.
  • the method for noise estimation and filtering further comprises filtering noise from each of the plurality of frames using the updated initial noise estimation for each frame.
  • the one or more dynamic weighting factors includes weight and threshold parameters for each of the plurality of signal classification features.
  • the initial noise estimation is based on quantile noise estimation for each of the plurality of successive frames.
  • the method for noise estimation and filtering further comprises applying the one or more dynamic weighting factors to each of the measured signal classification features of the frame; and updating the feature-based speech probability for the frame with the one or more dynamic weighting factors applied.
  • the method for noise estimation and filtering further comprises combining the one or more dynamic weighting factors and the measured signal classification features into a feature-based speech probability function.
  • the method for noise estimation and filtering further comprises updating, for each of the plurality of frames, the feature-based speech probability function; and updating, for each of the plurality of frames, the speech probability function based on the updated feature-based speech probability function.
  • the plurality of signal classification features is used to classify the input signal into a class state of speech or noise.
  • the feature-based speech probability function is updated with a recursive average.
  • the feature-based speech probability function is obtained by mapping each of the plurality of signal classification features to a probability value using a map function.
  • the map function is defined on a value of the signal classification feature and includes one or more threshold and width parameters.
  • the speech probability function is further based on a likelihood ratio factor for the frame.
  • the plurality of signal classification features includes at least: average likelihood ratio over time, spectral flatness measure, and spectral template difference measure.
  • the one or more dynamic weighting factors selects as the plurality of signal classification features at least one of: average likelihood ratio over time, spectral flatness measure, and spectral template difference measure.
  • the spectral template difference measure is based on a comparison of a spectrum of the input signal with a template noise spectrum.
  • the template noise spectrum is estimated based on an updated noise estimation using an updated speech probability function and a set of estimated shape parameters.
  • the estimated shape parameters are one or more of a shift, amplitude, and normalization parameter.
  • the method for noise estimation and filtering further comprises, in response to filtering noise from each of the plurality of frames, energy scaling each of the plurality of frames based on the modified speech probability function of the frame.
  • the method for noise estimation and filtering further comprises setting initial values for the weight and threshold parameters applied to each of the plurality of signal classification features; and updating the initial values for the weight and threshold parameters after a first interval of the input signal.
  • the method for noise estimation and filtering further comprises computing histograms for each of the plurality of signal classification features over the first interval; determining new values for the weight and threshold parameters from one or more quantities derived from the histograms; and using the new values for the weight and threshold parameters for a second interval of the input signal.
  • the first and second intervals are sequences of frames of the input signal.
  • the method for noise estimation and filtering further comprises comparing the one or more quantities derived from the histograms with one or more internal parameters to determine corresponding weight and threshold parameters of the feature-based speech probability of the input signal.
  • Figure 2 is a block diagram illustrating exemplary components of a noise suppression system according to one or more embodiments described herein.
  • Figure 3 is a schematic diagram illustrating example buffering and windowing processes according to one or more embodiments described herein.
  • Figure 4 is a flowchart illustrating an example update process for feature threshold and weighting parameters according to one or more embodiments described herein.
  • Figure 5 is a block diagram illustrating an example computing device arranged for multipath routing and processing of audio input signals according to one or more embodiments described herein.
  • Noise suppression aims to remove or reduce surrounding background noise to enhance the clarity of the intended audio thereby enhancing the comfort of the listener. In at least some embodiments of the present disclosure, noise suppression occurs in the frequency domain, where both noise estimation and noise filtering processes are performed.
  • a process of updating and adapting a speech/noise probability measure, for each input frame and frequency, that incorporates multiple speech/noise classification features (e.g., “signal classification features” or “noise-estimation features” as also referred to herein) for a feature- based probability provides a more accurate and robust estimation of speech/noise presence in the frame.
  • speech/noise classification features “signal classification features,” and “noise-estimation features” are interchangeable and refer to features that may be used (e.g., measured) to classify an input signal, for each frame and frequency, into a state of either speech or noise.
  • noise suppression based on an estimation of the noise spectrum, and a Wiener type filter to suppress the estimated noise.
  • the noise spectrum may be estimated based on a model that classifies each time/frame and frequency component of a received signal as speech or noise by using a speech noise likelihood (e.g., probability) function.
  • a speech noise likelihood e.g., probability
  • the speech/noise probability function and its use in estimating the noise spectrum will be described in greater detail below.
  • a noise suppression module may be configured to perform various speech probability modeling processes as described herein. For example, for each input frame of speech received, the noise suppression module may perform the following processes on the frame: signal analysis, including buffering, windowing, and Fourier transformation; noise estimation and filtering, including determining an initial noise estimation, computing a speech/noise likelihood function, updating the initial noise estimation based on the speech/noise likelihood function, and suppressing the estimated noise using a Wiener type filter; and signal synthesis, including inverse Fourier transformation, scaling, and window synthesis. Additionally, the noise suppression module may be further configured to generate, as output of the above processes, an estimated speech frame.
  • signal analysis including buffering, windowing, and Fourier transformation
  • noise estimation and filtering including determining an initial noise estimation, computing a speech/noise likelihood function, updating the initial noise estimation based on the speech/noise likelihood function, and suppressing the estimated noise using a Wiener type filter
  • signal synthesis including inverse Fourier transformation, scaling, and window synthesis.
  • FIG. 1 and the following discussion provide a brief, general description of a representative embodiment in which aspects of the present disclosure may be implemented.
  • a noise suppression module 40 may be located at the near-end environment of a signal transmission path, along with a capture device 5 also at the near-end and a render device 30 located at the far-end environment.
  • noise suppression module 40 may be one component in a larger system for audio (e.g., voice) communications.
  • the noise suppression module 40 may be an independent component in such a larger system or may be a subcomponent within an independent component (not shown) of the system.
  • FIG. 1 the example embodiment illustrated in FIG.
  • noise suppression module 40 is arranged to receive and process input from capture device 5 and generate output to, e.g., one or more other audio processing components (not shown).
  • these other audio processing components may be acoustic echo control (AEC), automatic gain control (AGC), and/or other voice quality improvement components.
  • AEC acoustic echo control
  • AGC automatic gain control
  • these other processing components may receive input from capture device 5 prior to noise suppression module 40 receiving such input.
  • Capture device 5 may be any of a variety of audio input devices, such as one or more microphones configured to capture sound and generate input signals.
  • Render device 30 may be any of a variety of audio output devices, including a loudspeaker or group of loudspeakers configured to output sound of one or more channels.
  • capture device 5 and render device 30 may be hardware devices internal to a computer system, or external peripheral devices connected to a computer system via wired and/or wireless connections.
  • capture device 5 and render device 30 may be components of a single device, such as a speakerphone, telephone handset, etc.
  • one or both of capture device 5 and render device 30 may include analog-to-digital and/or digital-to-analog transformation functionalities.
  • noise suppression module 40 includes a controller 50 for coordinating various processes and timing considerations.
  • Noise suppression module 40 may also include a signal analysis unit 10, a noise estimation unit 15, a Wiener filter 20, and a signal synthesis unit 25. Each of these units may be in communication with controller 50 such that controller 50 can facilitate some of the processes described herein.
  • controller 50 can facilitate some of the processes described herein.
  • Various details of the signal analysis unit 10, noise estimation unit 15, Wiener filter 20, and signal synthesis unit 25 will be further described below.
  • noise suppression module 40 may be included as part of noise suppression module 40, in addition to or instead of those illustrated in FIG. 1.
  • the names used to identify the units included as part of noise suppression module 40 are exemplary in nature, and are not intended to limit the scope of the disclosure.
  • FIG. 2 is a flow diagram illustrating an example embodiment of an overall noise suppression system and method of the present disclosure.
  • the noise suppression system shown in FIG. 2 includes three main processes: signal analysis 270, noise estimating and filtering 275, and signal synthesis 280.
  • the signal analysis process 270 may include various pre-processing that must be performed on input frame 200 to allow noise suppression to proceed in the frequency domain.
  • signal analysis 270 may include the preprocessing steps of buffering 205, windowing 210, and the Discrete Fourier Transform (DFT) 215.
  • DFT Discrete Fourier Transform
  • initial noise estimation 220 decision-directed (DD) update of post and prior SNRs 225
  • speech/noise likelihood determination 230 which is based on a likelihood ratio (LR) factor determined using the post and prior SNRs with a speech probability density function (PDF) model 235 (e.g., Gaussian, Laplacian, Gamma, Super-Gaussian, etc.) and a probability term determined from feature modeling 240, noise estimate update 245, and applying Wiener gain filter 250.
  • PDF speech probability density function
  • the signal synthesis process 280 which is needed to convert input frame 200 back to the time-domain, includes inverse Discrete Fourier Transform 255, scaling 260, and window synthesis 265 steps.
  • the result of signal synthesis process 280 is output frame 290, which is an estimated speech frame.
  • this model assumes the (unknown) speech signal is corrupted with additive noise, with the noisy signal y(i) uncorrelated to the speech signal x(f).
  • the above model equation takes the following form:
  • Y k (m) X k (m) + N k (m)
  • k denotes the frequency
  • m represents the frame index (e.g., the frame number used in short-time window DFT 215, described in greater detail below).
  • Signal analysis 270 may include various pre-processing steps so as to allow noise suppression to be performed in the frequency domain, rather than in the time-domain.
  • previous data e.g., a portion of the previous frame, such as previous data 330 from frame 305 shown in FIG. 3, the details of which will be further described below
  • the noise suppression system shown in FIG. 2 is a real-time system that operates on a frame basis, where data is buffered and analyzed when a frame (e.g., input frame 200) is received.
  • the frame size of input frame 200 is 10 milliseconds (ms). For a sampling rate of 8kHz, this is equivalent to 80 samples, and for a sampling rate of 16kHz, is equivalent to 160 samples.
  • the noise suppression system described herein and illustrated in FIG. 2 may alternatively and/or additionally support other input frame sizes, such as 15ms, 20ms, and 30ms. For clarity purposes, the following description is based on input frame 200 having a frame size of 10ms.
  • FIG. 3 is a schematic diagram showing examples of buffering 205 and windowing 210 steps as described herein.
  • FIG. 3 shows how data is buffered and windowed when the sampling rate is 8kHz and only one single frame is being analyzed.
  • new frame of data 305 has a frame size of 80 samples and is added to buffer 320, which has a size of 128 samples.
  • windowing function 310 is displayed below the expanded buffers.
  • the analyzing buffers e.g., buffer 320 shown in FIG. 3 are larger than the frames (e.g., frame 305 shown in FIG. 3)
  • previous data 330 which in the example illustrated includes the previous forty- eight samples from frame 305.
  • the overlap also places constraints on the synthesis. For example, when overlapping buffer sections are added, such as frame 305, the signals must be windowed to avoid abrupt change.
  • any overlap between analyzing buffers may require windowing.
  • noise estimation and suppression processes are performed in the frequency-domain. Transformation of input frame 200 to the frequency-domain is accomplished in DFT step 215 of signal analysis process 270 using DFT of the windowed data:
  • the frequency bin index (sub-band) is given by k.
  • k The frequency bin index (sub-band) is given by k.
  • the process described herein is only concerned with the magnitude of the frequency response,
  • the noise estimation and filtering process 275 of the system shown in FIG. 2 classifies each input frame 200 of a received signal as either speech or noise using a speech probability model that incorporates multiple features of the signal.
  • This speech/noise classification is defined for every time/frame and frequency, and is realized through a speech/noise probability function further described below. Given the speech/noise classification, an initial estimation of the noise spectrum is updated more heavily during pause (noise) regions in the signal, resulting in a smoother sounding residual noise (e.g., less musical noise) and a more accurate and robust measure of the noise spectrum for non- stationary noise sources.
  • a smoother sounding residual noise e.g., less musical noise
  • noise estimation and filtering process 275 includes the following steps: initial noise estimation 220, decision- directed (DD) update of post and prior SNRs 225, speech/noise likelihood determination 230, which is based on a likelihood ratio (LR) factor determined using the post and prior SNRs with a speech probability density function (PDF) model 235 (e.g., Gaussian) and a probability term determined from feature modeling 240, noise estimate update 245, and applying a Wiener gain filter 250.
  • LR likelihood ratio
  • PDF speech probability density function
  • Wiener gain filter 250 e.g., Gaussian
  • initial noise estimation 220 is based on a quantile noise estimation.
  • the noise estimate is controlled by the quantile parameter, which is denoted as q.
  • the noise estimate determined from initial noise estimation step 220 is only used as initial condition to subsequent processing for improved noise update/estimation.
  • Filters for noise suppression processing may generally be expressed in terms of a prior SNR and a posteriori SNR (post SNR). Accordingly, prior and post SNR quantities need to be estimated before any actual suppression is performed. As will be further described below, prior and post SNR quantities are also needed for the speech/noise likelihood determination step 230 of the noise estimation and filtering process 275.
  • the prior SNR may be the expectation value of the clean (unknown) signal power spectrum, relative to the noise power spectrum, expressed as: where X k ipi) is the spectral coefficients of the unknown clean speech signal.
  • the noise power spectrum in each of the post and prior SNRs expressed above may be obtained from the initial estimated noise spectrum determined in initial noise estimation step 220, which was based on a quantile estimation.
  • the post and prior SNR may be expressed using magnitude quantities in place of the squared magnitudes shown in the above computations:
  • the natural estimate for the prior SNR is the average of the estimated prior SNR at the previous frame (e.g., the input frame processed through the system shown in FIG. 2 immediately prior to input frame 200) and the instantaneous SNR a ⁇ m :
  • the above expression may be taken as the decision-directed (DD) update of the prior SNR 225 step of the noise estimation and filtering process 275, with a temporal smoothing parameter dd-
  • the prior SNR is a smooth version of the post SNR, with some amount of time-lag. A larger jdd increases the smoothing but also increases the lag.
  • the value used for the smoothing parameter is -0.98.
  • the prior and post SNRs described and defined above are elements of speech/noise likelihood determination step 230 of noise estimation and filtering process 275.
  • the speech/noise likelihood involves two factors: (1) LR (likelihood ratio) factor, determined from the prior and post SNRs, and (2) a probability term based on feature modeling, which will be described in greater detail below.
  • the speech and noise states are defined for every frame m and frequency bin k.
  • the probability of the speech/noise state can be expressed as:
  • the probability of speech/noise is conditioned on the observed noise input spectral coefficient, Yk(m), and some feature data of the signal (e.g., signal classification features) being processed, which in the present example is denoted as ⁇ F ⁇ .
  • the above expression for the speech/noise likelihood is also referred to herein as the "speech probability function.”
  • the feature data may be any functional of the noisy input spectrum, past spectrum data, model data, off-line data, etc.
  • feature data ⁇ F ⁇ may include spectral flatness measures, harmonic peak pitch, LPC residual, template matching, and the like.
  • the above quantity, q k ,m(H I ⁇ F ⁇ is also referred to as the "feature-based speech probability.” Ignoring the prior probability based on ⁇ F ⁇ , and denoting for notational simplicity ⁇ 23 ⁇ 4 ,OT (Hi
  • Gaussian PDF for the complex coefficients ⁇ Xk,Nk ⁇ we have for the quantities P(Y k (m) ⁇ H, ⁇ F ⁇ ) the following:
  • the probability may be fully determined from the linear model and Gaussian PDF assumption, the feature dependency may be removed from the above expression.
  • the likelihood ratio then becomes: where p ⁇ m) is the SNR of the unknown signal (e.g., prior SNR) and is the posteriori signal SNR (e.g., post SNR or instantaneous SNR) for frequency k and frame m.
  • p ⁇ m is the SNR of the unknown signal (e.g., prior SNR) and is the posteriori signal SNR (e.g., post SNR or instantaneous SNR) for frequency k and frame m.
  • both the prior SNR and post SNR used in the above expression are approximated by the magnitude definition, reproduced as: a k (m) --
  • the speech/noise state probability may be obtained from the likelihood ratio ( ⁇ * ⁇ ), which is determined from frequency-dependent post and prior SNRs, and the quantity
  • ⁇ F ⁇ ) q, which is a feature-based or model-based probability that will be described in greater detail below. Accordingly, the speech/noise state probability may be expressed as: , q
  • the geometric average (over all frequencies) of the time-smoothened LR factor may be used as a reliable measure of frame-based speech/noise classification:
  • the LR may be derived in speech/noise likelihood determination step 230 using, for example, the Gaussian assumption as speech PDF model 235.
  • other models of speech PDF may be used as the basis for measuring the LR, such as Laplacian, Gamma, and/or Super-Gaussian.
  • the Gaussian assumption may be reasonable to use with noise, the assumption is not always true for speech, especially on small time frames (e.g., -lOms).
  • another model of speech PDF may be used; however, most likely at the cost of increased complexity.
  • determining the speech/noise likelihood (or probability) 230 during the noise estimation and filtering process 275 is driven not only by local SNR (e.g., prior and instantaneous SNRs), but also incorporates speech model/knowledge derived from feature modeling 240. Incorporating speech model knowledge into the speech/noise probability determination allows the noise suppression processing described herein to better handle and/or differentiate cases of high non-stationary noise levels, where relying only on local SNRs may incorrectly bias the likelihood.
  • the system uses a process of updating and adapting the feature-based probability q k, JHF) for each frame and frequency that incorporates local SNR and speech feature/model data.
  • the notation qicm is used. Because the process as described herein only models and updates the quantity on a frame-basis, the k variable is suppressed.
  • an update of the feature-based probability may be modeled as: where y p is a smoothing constant and Miz) is the map function (e.g., between 0 and 1 ) for the given time and frequency.
  • the parameter w characterizes the shape/width of the map function.
  • the map function biases the time-frequency bin to either speech (M close to 1) or noise (M close to 0), based on the measured feature and the threshold and width parameter.
  • noise estimation and filtering process 275 considers the following features of a speech signal in performing feature modeling 240 for speech/noise likelihood 230 determination: (1) average LRT, which may be based on local SNR, (2) spectral flatness, which may be based on a harmonic model of speech, and (3) spectral- template difference measure. These three features will be described in greater detail below. It should be understood that numerous other features of the speech signal may also be used in addition to or instead of the three example features described below.
  • LR time-smoothened likelihood ratio
  • spectral flatness For purposes of the spectral flatness feature, it is assumed that speech is likely to have more harmonic behavior than noise. Whereas the speech spectrum typically shows peaks at the fundamental frequency (pitch) and harmonics, the noise spectrum tends to be relatively flat in comparison. Accordingly, in at least some arrangements, measures of local spectral flatness may collectively be used as a good indicator/classifier of speech and noise.
  • N represents the number of frequency bins and B represents the number of bands.
  • the index for a frequency bin is k and the index for a band is j.
  • Each band will contain a number of bins.
  • the frequency spectrum of 128 bins can be divided into 4 bands (e.g., low band, low-middle band, high-middle band, and high band) each containing 32 bins. In another example, only one band containing all the frequencies is used.
  • the spectral flatness may be computed as the ratio of the geometric mean to the arithmetic mean of the input magnitude spectrum: where N represents the number of frequencies in the band.
  • the computed quantity 2 will tend to be larger and constant for noise, and smaller and more variable for speech.
  • the map function M(z) for the update to the feature-based prior probability is a sigmoid-type function:
  • This third feature may be determined by comparing the input spectrum with a template learned noise spectrum.
  • the template spectrum is determined by updating the spectrum, which is initially set to zero, over segments that have strong likelihood of being noise or pause in speech. A result of the comparison is a conservative noise estimate, where the noise is only updated for segments where the speech probability is determined to be below a threshold (e.g., P(H l ⁇ Y k (m), ⁇ F ⁇ ) ⁇ X ).
  • the template spectrum may also be input to the algorithm or selected from a table of shapes corresponding to different noises.
  • the spectral template difference feature may be obtained by initially defining the spectral difference measure as: where (a,u) are shape parameters, such as linear shift and amplitude parameters, obtained by minimizing J. Parameters (a,u) are obtained from a linear equation, and therefore are easily extracted for each frame. In some examples, the parameters account for any simple shift/scale changes of the input spectrum (e.g., if the volume increases). The feature is then the normalized measure,
  • the spectral template difference feature measures the difference/deviation of the template or learned noise spectrum from the input spectrum.
  • this spectral template difference feature may be used to modify the speech/noise feature-based probability, If 3 is small, then the input frame spectrum is taken as being "close to" the template spectrum, and the frame is considered to be more likely noise.
  • the spectral template difference feature is large, the input frame (e.g., input frame 200) spectrum is very different from the noise template spectrum, and the frame is considered to be speech.
  • the template spectrum may be input to the speech/noise probability algorithm or instead digitally measured and utilized as an online resource.
  • mapping the spectral template difference feature value to a probability weight may be done using the same sigmoid function described above. It is important to note that the spectral template difference feature measure is more general than the spectral flatness feature measure. In the case of a template with a constant (e.g., near perfectly) flat spectrum, the spectral template difference feature reduces to a measure of the spectral flatness.
  • a weighting term Wk may be added to the spectral template difference measure to emphasize certain bands in the spectrum:
  • the different features which arise from different cues (e.g., the different information conveyed by the different features, such as the energy measurement or local SNR conveyed from the first feature, the spectral flatness of the noise conveyed by second feature, and stationarity and general shape of the noise from the third feature), may complement each other to provide a more robust and adaptive update of the speech/noise probability.
  • the update model of the speech/noise probability shown above includes various weighting terms (T / ), threshold parameters ⁇ T t ⁇ , and width parameters for map function.
  • noise estimate update 245 (e.g., a soft-decision recursive noise update) is performed.
  • noise estimate update 245 may follows as: where
  • the parameter yong controls the smoothing of the noise update, and the second term updates the noise with both the input spectrum and previous noise estimation, weighted according to the probability of speech/noise which, as described above, may be given as:
  • the noise estimation model above updates the noise at every frame and frequency bin where the noise likelihood is large (e.g., where the speech likelihood is small). Where the noise likelihood is not found to be large, the noise estimate is taken as the estimate obtained from the previous frame in the signal.
  • this noise estimate update process is controlled by the speech/noise likelihood and the smoothing parameter ⁇ termed, for example, to 0.85.
  • smoothing parameter may be increased to ⁇ réelle « 0.99 for regions where the speech probability is found to be above a threshold parameter X to prevent the noise level from increasing too much at speech onsets.
  • the noise estimation and filtering process 275 applies a Wiener gain filter 250 to reduce or remove the estimated amount of noise from input frame 200.
  • the standard Wiener filter is given as: where N k (m) is the estimated noise spectral coefficient, i3 ⁇ 4m) is the observed noisy spectral coefficient, and Xk(m) is the clean speech spectrum, at frame m and frequency k.
  • the squared magnitude may then be replaced by the magnitude and the Wiener filter becomes:
  • the Wiener filter is expressed in terms of the prior SNR and a decision-directed (DD) update is used to time-average the prior SNR.
  • the Wiener filter can be expressed in terms of the prior SNR as: l + p k (m) where pk(m) represents the prior SNR as defined above, with the noise spectrum replaced with the estimated noise spectrum:
  • the parameter ⁇ is defined based on the aggressiveness (e.g., the mode) of the noise suppressor (e.g., noise suppression module 15 shown in FIG. 1) implemented within the noise suppression system.
  • the Wiener filter is applied to the input magnitude spectrum to obtain a suppressed signal (e.g., an estimate of the underlying speech signal).
  • Application of the Wiener filter 250 in noise estimation and filtering process 275 yields:
  • Signal synthesis 280 includes various post-noise suppression processing to generate output frame 290, which includes clean speech.
  • inverse DFT 255 is used to convert the frame back to the time-domain.
  • conversion back to the time-domain is performed as: where X k (m) is the estimated speech after suppression with the Wiener filter, and x(n,m) is the corresponding time-domain signal, for time index n and frame index m.
  • energy scaling 260 is performed on the noise- suppressed signal as part of the signal synthesis process 280.
  • Energy scaling may be used to help rebuild speech frames in a manner that increases the power of the speech after suppression. For example, scaling may be performed on the basis that only speech frames are to be amplified to a certain extent, and noise frames are to be left alone. Because noise suppression may reduce the speech signal level, some amplification of speech segments during scaling 260 is beneficial.
  • scaling 260 is performed on a speech frame based on energy lost in the frame due to the noise estimation and filtering process 275. The gain may be determined by a ratio of the energy in the frame before and after noise suppression processing: energ
  • a scale may be extracted according to the following model:
  • Scale A(K)P(H, ⁇ m) + B(K)(l - P(H,
  • m) is the probability of speech for frame m, obtained by averaging the speech probability function, P(H ⁇
  • the first term in the above scale equation will be large if the probability P(H X
  • parameters A ⁇ K), B(K) control the scaling for the input frame (e.g., input frame 200).
  • A(K 1.0
  • the parameter B(K) 1.0, so the frame is not scaled for noise regions.
  • the scale for these regions may be determined by a flooring term in the Wiener filter.
  • Signal synthesis 280 also includes window synthesis operation 265, which provides the final output frame 290 of estimated speech.
  • window synthesis 265 is:
  • the map function also contains a width parameter ⁇ w t ⁇ to control the shape of the map function:
  • Table 1 presents example parameter settings according to various embodiments of the disclosure. Table 1 identifies each parameter and provides a brief description and an example default value for each parameter. It should be understood that various other parameter settings and/or default values may also be used in addition to or instead of those presented in Table 1 below.
  • T ⁇ threshold for LR feature Initial: 0.5; Modified on-line
  • feature threshold and weighting parameters for feature measurements are dynamically updated after a set interval.
  • alternative update intervals may be used including various frame counts or set intervals of time.
  • FIG. 4 illustrates an example update process for feature threshold and weighting parameters for feature measurements (e.g., average LRT feature (Fi), spectral flatness feature (/3 ⁇ 4, and spectral template difference feature ( 3 )).
  • feature threshold and weighting parameters e.g., T ⁇ , Ti, T 3 and ⁇ ⁇ , ⁇ 2 , ⁇ 3
  • step 405 histograms of the features may be computed over the ( ⁇ frames of the relevant (e.g., current or present) parameter estimation window.
  • step 405 involves the first W frames of the sequence, during which the threshold and weighting parameters are fixed to their initial values set in step 400.
  • the threshold and weighting parameters are fixed to the values derived from the previous ⁇ frames.
  • step 410 new threshold and weighting parameters for the features are extracted from quantities derived from the histograms computed in step 405.
  • the threshold and weighting parameters for the features are derived from histogram quantities such as the peak positions of the histograms, the height of the histograms, the average of each feature over some range of the feature's respective histogram, and the fluctuation of each feature over some range of the feature's respective histogram. Numerous other quantities may also be derived from the histograms computed in step 405 to use in extracting new feature threshold and weighting parameters in step 410, in addition to or instead of those described above.
  • the quantities derived from the histograms in step 410 are compared with some internal parameters to determine the corresponding prior model threshold and weighting parameters.
  • internal parameters may include the following sets: (1) scale parameter applied to either the dominant peak values, or the sum of the two peak values, of the measured histogram, to obtain the feature threshold; (2) parameter that merges the two histogram peaks if they are too close; (3) parameter to reject the feature if the average height of the peaks is too small; (4) parameter to reject the feature if the average peak positions is too small; (5) parameter to reject some feature(s) if the fluctuation of the LRT feature over the histogram range is too low; and (6) maximum and minimum limits on the thresholds for each feature.
  • step 415 the threshold and weighting parameters extracted in step 410 are fixed or set as the feature threshold and weighting parameters for the next W frames of the speech sequence. If the end of the speech sequence is reached in step 420, then the process ends. However, if the end of the speech sequence is not reached in step 420, then the process returns to step 405 and repeats through step 420 using the next W frames of the sequence and the threshold and weighting parameters fixed in step 415.
  • the initial feature threshold and weighting parameters set in step 400 of FIG. 4 may be used for an entire speech sequence, without the values of these parameters being updated at all.
  • the threshold and weighting parameters may be updated once following the first window of W frames of the sequence (e.g., the threshold and weighting parameters are updated one time from their initial values).
  • the feature threshold and weighting parameters update process illustrated in FIG. 4 may use overlapping windows of the sequence where, for example, includes frames 1-500, W3 ⁇ 4 includes frames 250-750, Wj, includes frames 500-1000, and so on. This is one alternative to using to non-overlapping windows where W includes frames 1-500, W2 includes frames 500-1000, W3 includes frames 1000- 1500, etc. Additionally, while some arrangements use fixed windows, e.g., each W includes 500 frames of the sequence; other arrangements may use variable, or changing windows. For example, W ⁇ may include 500 frames, W2 include 250 frames, and W3 include 750 frames.
  • variable or changing windows may be overlapping or non-overlapping, such as W ⁇ including frames 1-500 (500 frames), W2 including frames 500-750 (250 frames, non-overlapping), and W3 including frames 500-1250 (750 frames, overlapping).
  • W ⁇ including frames 1-500 (500 frames)
  • W2 including frames 500-750 (250 frames, non-overlapping)
  • W3 including frames 500-1250 (750 frames, overlapping).
  • the threshold and weighting parameters may be updated according to a variety of other window configurations involving numerous other characteristics of a given sequence.
  • the feature threshold and weighting parameters extraction in step 410 may lead to one or more of the features (e.g., average LRT feature (/3 ⁇ 4, spectral flatness feature (/3 ⁇ 4, and/or spectral template difference feature ( 3 )) not being used in computing the update model of the speech/noise probability.
  • the weighting parameter for each feature that will not be included in the update model is set to 0.
  • FIG. 5 is a block diagram illustrating an example computing device 500 that is arranged for multipath routing in accordance with one or more embodiments of the present disclosure.
  • computing device 500 typically includes one or more processors 510 and system memory 520.
  • a memory bus 530 may be used for communicating between the processor 510 and the system memory 520.
  • processor 510 can be of any type including but not limited to a microprocessor ( ⁇ ), a microcontroller ( ⁇ ), a digital signal processor (DSP), or any combination thereof.
  • Processor 510 may include one or more levels of caching, such as a level one cache 51 1 and a level two cache 512, a processor core 513, and registers 514.
  • the processor core 513 may include an arithmetic logic unit (ALU), a floating point unit (FPU), a digital signal processing core (DSP Core), or any combination thereof.
  • a memory controller 515 can also be used with the processor 510, or in some embodiments the memory controller 515 can be an internal part of the processor 510.
  • system memory 520 can be of any type including but not limited to volatile memory (e.g., RAM), non-volatile memory (e.g., ROM, flash memory, etc.) or any combination thereof.
  • System memory 520 typically includes an operating system 521, one or more applications 522, and program data 524.
  • application 522 includes a multipath processing algorithm 523 that is configured to pass a noisy input signal to a noise suppression component.
  • the multipath processing algorithm is further arranged to pass a noise-suppressed output from the noise suppression component to other components in the signal processing pathway.
  • Program Data 524 may include multipath routing data 525 that is useful for passing a noisy input signal along multiple signal pathways to, for example, a noise suppression component such that the component receives the noisy signal before the signal has been manipulated or altered by other audio processing.
  • Computing device 500 can have additional features and/or functionality, and additional interfaces to facilitate communications between the basic configuration 501 and any required devices and interfaces.
  • a bus/interface controller 540 can be used to facilitate communications between the basic configuration 501 and one or more data storage devices 550 via a storage interface bus 541.
  • the data storage devices 550 can be removable storage devices 551 , non-removable storage devices 552, or any combination thereof.
  • Examples of removable storage and non-removable storage devices include magnetic disk devices such as flexible disk drives and hard-disk drives (HDD), optical disk drives such as compact disk (CD) drives or digital versatile disk (DVD) drives, solid state drives (SSD), tape drives and the like.
  • Example computer storage media can include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, and/or other data.
  • System memory 520, removable storage 551 and non-removable storage 552 are all examples of computer storage media.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computing device 500. Any such computer storage media can be part of computing device 500.
  • Computing device 500 can also include an interface bus 542 for facilitating communication from various interface devices (e.g., output interfaces, peripheral interfaces, communication interfaces, etc.) to the basic configuration 501 via the bus/interface controller 540.
  • Example output devices 560 include a graphics processing unit 561 and an audio processing unit 562, either or both of which can be configured to communicate to various external devices such as a display or speakers via one or more A/V ports 563.
  • Example peripheral interfaces 570 include a serial interface controller 571 or a parallel interface controller 572, which can be configured to communicate with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device, etc.) or other peripheral devices (e.g., printer, scanner, etc.) via one or more I/O ports 573.
  • An example communication device 580 includes a network controller 581, which can be arranged to facilitate communications with one or more other computing devices 590 over a network communication (not shown) via one or more communication ports 582.
  • the communication connection is one example of a communication media.
  • Communication media may typically be embodied by computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and includes any information delivery media.
  • a "modulated data signal" can be a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
  • communication media can include wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency (RF), infrared (IR) and other wireless media.
  • RF radio frequency
  • IR infrared
  • the term computer readable media as used herein can include both storage media and communication media.
  • Computing device 500 can be implemented as a portion of a small-form factor portable (or mobile) electronic device such as a cell phone, a personal data assistant (PDA), a personal media player device, a wireless web-watch device, a personal headset device, an application specific device, or a hybrid device that include any of the above functions.
  • a small-form factor portable (or mobile) electronic device such as a cell phone, a personal data assistant (PDA), a personal media player device, a wireless web-watch device, a personal headset device, an application specific device, or a hybrid device that include any of the above functions.
  • PDA personal data assistant
  • Computing device 500 can also be implemented as a personal computer including both laptop computer and non-laptop computer configurations.
  • ASICs Application Specific Integrated Circuits
  • FPGAs Field Programmable Gate Arrays
  • DSPs digital signal processors
  • ASICs Application Specific Integrated Circuits
  • FPGAs Field Programmable Gate Arrays
  • DSPs digital signal processors
  • some aspects of the embodiments described herein, in whole or in part, can be equivalently implemented in integrated circuits, as one or more computer programs running on one or more computers (e.g., as one or more programs running on one or more computer systems), as one or more programs running on one or more processors (e.g., as one or more programs running on one or more microprocessors), as firmware, or as virtually any combination thereof.
  • processors e.g., as one or more programs running on one or more microprocessors
  • firmware e.g., as one or more programs running on one or more microprocessors
  • designing the circuitry and/or writing the code for the software and/or firmware would be well within the skill of one of skilled in the art in light of the present disclosure.
  • Examples of a signal-bearing medium include, but are not limited to, the following: a recordable-type medium such as a floppy disk, a hard disk drive, a Compact Disc (CD), a Digital Video Disk (DVD), a digital tape, a computer memory, etc.; and a transmission-type medium such as a digital and/or an analog communication medium (e.g., a fiber optic cable, a waveguide, a wired communications link, a wireless communication link, etc.).
  • a recordable-type medium such as a floppy disk, a hard disk drive, a Compact Disc (CD), a Digital Video Disk (DVD), a digital tape, a computer memory, etc.
  • a transmission-type medium such as a digital and/or an analog communication medium (e.g., a fiber optic cable, a waveguide, a wired communications link, a wireless communication link, etc.).
  • a typical data processing system generally includes one or more of a system unit housing, a video display device, a memory such as volatile and non-volatile memory, processors such as microprocessors and digital signal processors, computational entities such as operating systems, drivers, graphical user interfaces, and applications programs, one or more interaction devices, such as a touch pad or screen, and/or control systems including feedback loops and control motors (e.g., feedback for sensing position and/or velocity; control motors for moving and/or adjusting components and/or quantities).
  • a typical data processing system may be implemented utilizing any suitable commercially available components, such as those typically found in data computing/communication and/or network computing/communication systems.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Noise Elimination (AREA)

Abstract

Systems and methods of noise suppression based on an estimation of the noise spectrum, and a Wiener type filter to suppress the estimated noise. The noise spectrum may be estimated based on a model that classifies each time/frame and frequency component of a received signal as speech or noise by using a speech/noise likelihood (e.g., probability) function. The speech/noise likelihood function is updated and adapted, for each input frame and frequency, by incorporating multiple speech/noise classification features into a model for a feature-based probability function.

Description

NOISE SUPPRESSION METHOD AND APPARATUS USING MULTIPLE FEATURE MODELING FOR SPEECH NOISE LIKELIHOOD
FIELD OF THE INVENTION
[0001] The present disclosure generally relates to systems and methods for transmission of audio signals such as voice communications. More specifically, aspects of the present disclosure relate to estimating and filtering noise using speech probability modeling.
BACKGROUND
[0002] In voice communications, excessive amounts of surrounding and/or background noise can make one or both participants difficult to understand, sometimes rendering a conversation useless. Surrounding noise includes noise introduced from a number of sources, some of the more common of which include computers, fans, microphones, and office equipment.
SUMMARY
[0003] This Summary introduces a selection of concepts in a simplified form in order to provide a basic understanding of some aspects of the present disclosure. This Summary is not an extensive overview of the disclosure, and is not intended to identify key or critical elements of the disclosure or to delineate the scope of the disclosure. This Summary merely presents some of the concepts of the disclosure as a prelude to the Detailed Description provided below.
[0004] One embodiment of the present disclosure relates to a method for noise estimation and filtering by a noise suppression module. The method comprises defining, for each of a plurality of successive frames of an input signal received at the noise suppression module, a speech probability function based on an initial noise estimation for the frame; measuring a plurality of signal classification features for each of the plurality of frames; computing a feature-based speech probability for each of the plurality of frames using the measured signal classification features of the frame; applying one or more dynamic weighting factors to the computed feature-based speech probability for each of the plurality of frames; modifying the speech probability function for each of the plurality of frames based on the computed feature- based speech probability of the frame; and updating the initial noise estimation for each of the plurality of frames using the modified speech probability function for the frame.
[0005] In another embodiment of the disclosure, the method for noise estimation and filtering further comprises filtering noise from each of the plurality of frames using the updated initial noise estimation for each frame.
[0006] In another embodiment of the disclosure, the one or more dynamic weighting factors includes weight and threshold parameters for each of the plurality of signal classification features.
[0007] In another embodiment of the disclosure, the initial noise estimation is based on quantile noise estimation for each of the plurality of successive frames.
[0008] In another embodiment of the disclosure, the method for noise estimation and filtering further comprises applying the one or more dynamic weighting factors to each of the measured signal classification features of the frame; and updating the feature-based speech probability for the frame with the one or more dynamic weighting factors applied.
[0009] In another embodiment of the disclosure, the method for noise estimation and filtering further comprises combining the one or more dynamic weighting factors and the measured signal classification features into a feature-based speech probability function.
[0010] In another embodiment of the disclosure, the method for noise estimation and filtering further comprises updating, for each of the plurality of frames, the feature-based speech probability function; and updating, for each of the plurality of frames, the speech probability function based on the updated feature-based speech probability function.
[0011] In another embodiment of the disclosure, the plurality of signal classification features is used to classify the input signal into a class state of speech or noise.
[0012] In another embodiment of the disclosure, the feature-based speech probability function is updated with a recursive average. [0013] In another embodiment of the disclosure, the feature-based speech probability function is obtained by mapping each of the plurality of signal classification features to a probability value using a map function.
[0014] In another embodiment of the disclosure, the map function is defined on a value of the signal classification feature and includes one or more threshold and width parameters.
[0015] In another embodiment of the disclosure, the speech probability function is further based on a likelihood ratio factor for the frame.
[0016] In another embodiment of the disclosure, the plurality of signal classification features includes at least: average likelihood ratio over time, spectral flatness measure, and spectral template difference measure.
[0017] In another embodiment of the disclosure, the one or more dynamic weighting factors selects as the plurality of signal classification features at least one of: average likelihood ratio over time, spectral flatness measure, and spectral template difference measure.
[0018] In another embodiment of the disclosure, the spectral template difference measure is based on a comparison of a spectrum of the input signal with a template noise spectrum.
[0019] In another embodiment of the disclosure, the template noise spectrum is estimated based on an updated noise estimation using an updated speech probability function and a set of estimated shape parameters.
[0020] In another embodiment of the disclosure, the estimated shape parameters are one or more of a shift, amplitude, and normalization parameter.
[0021] In another embodiment of the disclosure, the method for noise estimation and filtering further comprises, in response to filtering noise from each of the plurality of frames, energy scaling each of the plurality of frames based on the modified speech probability function of the frame. [0022] In another embodiment of the disclosure, the method for noise estimation and filtering further comprises setting initial values for the weight and threshold parameters applied to each of the plurality of signal classification features; and updating the initial values for the weight and threshold parameters after a first interval of the input signal.
[0023] In still another embodiment of the disclosure, the method for noise estimation and filtering further comprises computing histograms for each of the plurality of signal classification features over the first interval; determining new values for the weight and threshold parameters from one or more quantities derived from the histograms; and using the new values for the weight and threshold parameters for a second interval of the input signal.
[0024] In another embodiment of the disclosure, the first and second intervals are sequences of frames of the input signal.
[0025] In yet another embodiment of the disclosure, the method for noise estimation and filtering further comprises comparing the one or more quantities derived from the histograms with one or more internal parameters to determine corresponding weight and threshold parameters of the feature-based speech probability of the input signal.
[0026] Further scope of applicability of the present invention will become apparent from the Detailed Description given below. However, it should be understood that the Detailed Description and specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only, since various changes and modifications within the spirit and scope of the invention will become apparent to those skilled in the art from this Detailed Description.
BRIEF DESCRIPTION OF DRAWINGS
[0027] These and other objects, features and characteristics of the present disclosure will become more apparent to those skilled in the art from a study of the following Detailed Description in conjunction with the appended claims and drawings, all of which form a part of this specification. In the drawings: [0028] Figure 1 provides a general description of a representative embodiment in which one or more aspects described herein may be implemented.
[0029] Figure 2 is a block diagram illustrating exemplary components of a noise suppression system according to one or more embodiments described herein.
[0030] Figure 3 is a schematic diagram illustrating example buffering and windowing processes according to one or more embodiments described herein.
[0031] Figure 4 is a flowchart illustrating an example update process for feature threshold and weighting parameters according to one or more embodiments described herein.
[0032] Figure 5 is a block diagram illustrating an example computing device arranged for multipath routing and processing of audio input signals according to one or more embodiments described herein.
[0033] The headings provided herein are for convenience only and do not necessarily affect the scope or meaning of the claimed invention.
[0034] In the drawings, the same reference numerals and any acronyms identify elements or acts with the same or similar structure or functionality for ease of understanding and convenience. The drawings will be described in detail in the course of the following Detailed Description.
DETAILED DESCRIPTION
[0035] Various examples of the invention will now be described. The following description provides specific details for a thorough understanding and enabling description of these examples. One skilled in the relevant art will understand, however, that the invention may be practiced without many of these details. Likewise, one skilled in the relevant art will also understand that the invention can include many other obvious features not described in detail herein. Additionally, some well-known structures or functions may not be shown or described in detail below, so as to avoid unnecessarily obscuring the relevant description. [0036] Noise suppression aims to remove or reduce surrounding background noise to enhance the clarity of the intended audio thereby enhancing the comfort of the listener. In at least some embodiments of the present disclosure, noise suppression occurs in the frequency domain, where both noise estimation and noise filtering processes are performed. In situations involving high non-stationary noise levels, relying only on local speech-to-noise ratios (SNRs) to drive noise suppression often incorrectly biases a likelihood determination of speech and noise presence. As will be described in greater detail herein, a process of updating and adapting a speech/noise probability measure, for each input frame and frequency, that incorporates multiple speech/noise classification features (e.g., "signal classification features" or "noise-estimation features" as also referred to herein) for a feature- based probability, provides a more accurate and robust estimation of speech/noise presence in the frame. In the following description, "speech/noise classification features," "signal classification features," and "noise-estimation features" are interchangeable and refer to features that may be used (e.g., measured) to classify an input signal, for each frame and frequency, into a state of either speech or noise.
[0037] Aspects of the present disclosure relate to noise suppression based on an estimation of the noise spectrum, and a Wiener type filter to suppress the estimated noise. The noise spectrum may be estimated based on a model that classifies each time/frame and frequency component of a received signal as speech or noise by using a speech noise likelihood (e.g., probability) function. The speech/noise probability function and its use in estimating the noise spectrum will be described in greater detail below.
[0038] In at least some arrangements, a noise suppression module may be configured to perform various speech probability modeling processes as described herein. For example, for each input frame of speech received, the noise suppression module may perform the following processes on the frame: signal analysis, including buffering, windowing, and Fourier transformation; noise estimation and filtering, including determining an initial noise estimation, computing a speech/noise likelihood function, updating the initial noise estimation based on the speech/noise likelihood function, and suppressing the estimated noise using a Wiener type filter; and signal synthesis, including inverse Fourier transformation, scaling, and window synthesis. Additionally, the noise suppression module may be further configured to generate, as output of the above processes, an estimated speech frame.
[0039] FIG. 1 and the following discussion provide a brief, general description of a representative embodiment in which aspects of the present disclosure may be implemented. As shown in FIG. 1, a noise suppression module 40 may be located at the near-end environment of a signal transmission path, along with a capture device 5 also at the near-end and a render device 30 located at the far-end environment. In some arrangements, noise suppression module 40 may be one component in a larger system for audio (e.g., voice) communications. The noise suppression module 40 may be an independent component in such a larger system or may be a subcomponent within an independent component (not shown) of the system. In the example embodiment illustrated in FIG. 1 , noise suppression module 40 is arranged to receive and process input from capture device 5 and generate output to, e.g., one or more other audio processing components (not shown). These other audio processing components may be acoustic echo control (AEC), automatic gain control (AGC), and/or other voice quality improvement components. In some embodiments, these other processing components may receive input from capture device 5 prior to noise suppression module 40 receiving such input.
[0040] Capture device 5 may be any of a variety of audio input devices, such as one or more microphones configured to capture sound and generate input signals. Render device 30 may be any of a variety of audio output devices, including a loudspeaker or group of loudspeakers configured to output sound of one or more channels. For example, capture device 5 and render device 30 may be hardware devices internal to a computer system, or external peripheral devices connected to a computer system via wired and/or wireless connections. In some arrangements, capture device 5 and render device 30 may be components of a single device, such as a speakerphone, telephone handset, etc. Additionally, one or both of capture device 5 and render device 30 may include analog-to-digital and/or digital-to-analog transformation functionalities.
[0041] In at least the embodiment shown in FIG. 1, noise suppression module 40 includes a controller 50 for coordinating various processes and timing considerations. Noise suppression module 40 may also include a signal analysis unit 10, a noise estimation unit 15, a Wiener filter 20, and a signal synthesis unit 25. Each of these units may be in communication with controller 50 such that controller 50 can facilitate some of the processes described herein. Various details of the signal analysis unit 10, noise estimation unit 15, Wiener filter 20, and signal synthesis unit 25 will be further described below.
[0042] In some embodiments of the present disclosure, one or more other components, modules, units, etc., may be included as part of noise suppression module 40, in addition to or instead of those illustrated in FIG. 1. The names used to identify the units included as part of noise suppression module 40 (e.g., signal analysis unit, noise estimation unit, etc.) are exemplary in nature, and are not intended to limit the scope of the disclosure.
[0043] FIG. 2 is a flow diagram illustrating an example embodiment of an overall noise suppression system and method of the present disclosure. The noise suppression system shown in FIG. 2 includes three main processes: signal analysis 270, noise estimating and filtering 275, and signal synthesis 280. The signal analysis process 270 may include various pre-processing that must be performed on input frame 200 to allow noise suppression to proceed in the frequency domain. For example, signal analysis 270 may include the preprocessing steps of buffering 205, windowing 210, and the Discrete Fourier Transform (DFT) 215. The noise estimation and filtering process 275 shown in FIG. 2 includes as steps or sub- processes: initial noise estimation 220, decision-directed (DD) update of post and prior SNRs 225, speech/noise likelihood determination 230, which is based on a likelihood ratio (LR) factor determined using the post and prior SNRs with a speech probability density function (PDF) model 235 (e.g., Gaussian, Laplacian, Gamma, Super-Gaussian, etc.) and a probability term determined from feature modeling 240, noise estimate update 245, and applying Wiener gain filter 250. Additionally, the signal synthesis process 280, which is needed to convert input frame 200 back to the time-domain, includes inverse Discrete Fourier Transform 255, scaling 260, and window synthesis 265 steps. The result of signal synthesis process 280 is output frame 290, which is an estimated speech frame. Each of the above processes and sub- processes of the noise suppression system illustrated in FIG. 2 will now be described in greater detail below.
[0044] The noise suppression methods and systems for reducing or removing noise from speech signals described herein proceed on the following model equation (presented in time- domain form): y(t) = x(t) + N(t) where x(t) is the clean speech signal, y(t) is the observed noisy signal, and N(t) is the noise. For at least the following description of the various processes and steps illustrated in FIG. 2, this model assumes the (unknown) speech signal is corrupted with additive noise, with the noisy signal y(i) uncorrelated to the speech signal x(f). In the frequency-domain, the above model equation takes the following form:
Yk(m) = Xk(m) + Nk (m) where k denotes the frequency and m represents the frame index (e.g., the frame number used in short-time window DFT 215, described in greater detail below).
Signal Analysis
[0045] Signal analysis 270 may include various pre-processing steps so as to allow noise suppression to be performed in the frequency domain, rather than in the time-domain. First, input frame 200 is passed through a buffering step 205, where input frame 200 is expanded with previous data (e.g., a portion of the previous frame, such as previous data 330 from frame 305 shown in FIG. 3, the details of which will be further described below) to a buffer length of a power of 2. [0046] In at least some arrangements, the noise suppression system shown in FIG. 2 is a real-time system that operates on a frame basis, where data is buffered and analyzed when a frame (e.g., input frame 200) is received. In one example, the frame size of input frame 200 is 10 milliseconds (ms). For a sampling rate of 8kHz, this is equivalent to 80 samples, and for a sampling rate of 16kHz, is equivalent to 160 samples. In one or more other arrangements, the noise suppression system described herein and illustrated in FIG. 2 may alternatively and/or additionally support other input frame sizes, such as 15ms, 20ms, and 30ms. For clarity purposes, the following description is based on input frame 200 having a frame size of 10ms.
[0047] After buffering 205, input frame 200 proceeds to windowing 210 and DFT 215 to map input frame 200 to the frequency domain. Because DFT 215 is optimized for data lengths of power of 2, in at least some arrangements the available analyzing buffer lengths for the input frame are 128 samples and 256 samples. FIG. 3 is a schematic diagram showing examples of buffering 205 and windowing 210 steps as described herein. FIG. 3 shows how data is buffered and windowed when the sampling rate is 8kHz and only one single frame is being analyzed. In the example illustrated, new frame of data 305 has a frame size of 80 samples and is added to buffer 320, which has a size of 128 samples. Additionally, windowing function 310 is displayed below the expanded buffers.
[0048] Because the analyzing buffers (e.g., buffer 320 shown in FIG. 3) are larger than the frames (e.g., frame 305 shown in FIG. 3), there is overlap between consecutive buffers as indicated by previous data 330, which in the example illustrated includes the previous forty- eight samples from frame 305. Although such overlap will generally result in smoother noise reduction due to the dependencies between analyzing buffers 320, the overlap also places constraints on the synthesis. For example, when overlapping buffer sections are added, such as frame 305, the signals must be windowed to avoid abrupt change.
[0049] As described above, any overlap between analyzing buffers (e.g., buffers 320 shown in FIG. 3) may require windowing. In at least one arrangement, the same window may be used before and after noise processing in the frequency domain. More specifically, with reference to FIG. 2, the same window may be used in windowing step 210 of signal analysis process 270 and window synthesis step 265 of signal synthesis process 280. Accordingly, in such an arrangement the window function must be power preserving, e.g., the square of the window in the overlapping buffer sections must sum up to one as follows: w N) + w2{M + N) = \ where N is the buffer length and M is the frame length. Defining y(n,m) as the noisy audio signal at inter buffer time index n and frame m, the windowed signal is: yw(n, m) = w(n)y(n,m)
[0050] In some arrangements of the present disclosure, noise estimation and suppression processes are performed in the frequency-domain. Transformation of input frame 200 to the frequency-domain is accomplished in DFT step 215 of signal analysis process 270 using DFT of the windowed data:
Figure imgf000012_0001
n=
The frequency bin index (sub-band) is given by k. For noise estimation, the process described herein is only concerned with the magnitude of the frequency response, | Y(m) |, since a Wiener type of filter is used for noise suppression as will described in greater detail below.
Noise Estimation and Filtering
[0051] The noise estimation and filtering process 275 of the system shown in FIG. 2 classifies each input frame 200 of a received signal as either speech or noise using a speech probability model that incorporates multiple features of the signal. This speech/noise classification is defined for every time/frame and frequency, and is realized through a speech/noise probability function further described below. Given the speech/noise classification, an initial estimation of the noise spectrum is updated more heavily during pause (noise) regions in the signal, resulting in a smoother sounding residual noise (e.g., less musical noise) and a more accurate and robust measure of the noise spectrum for non- stationary noise sources. As shown in example system of FIG. 2, noise estimation and filtering process 275 includes the following steps: initial noise estimation 220, decision- directed (DD) update of post and prior SNRs 225, speech/noise likelihood determination 230, which is based on a likelihood ratio (LR) factor determined using the post and prior SNRs with a speech probability density function (PDF) model 235 (e.g., Gaussian) and a probability term determined from feature modeling 240, noise estimate update 245, and applying a Wiener gain filter 250. Each of the steps comprising noise estimation and filtering process 275 is further described below.
[0052] In one or more arrangements, initial noise estimation 220 is based on a quantile noise estimation. The noise estimate is controlled by the quantile parameter, which is denoted as q. The noise estimate determined from initial noise estimation step 220 is only used as initial condition to subsequent processing for improved noise update/estimation.
[0053] Filters for noise suppression processing may generally be expressed in terms of a prior SNR and a posteriori SNR (post SNR). Accordingly, prior and post SNR quantities need to be estimated before any actual suppression is performed. As will be further described below, prior and post SNR quantities are also needed for the speech/noise likelihood determination step 230 of the noise estimation and filtering process 275.
[0054] In one example, post SNR may refer to the instantaneous SNR based on the observed input power spectrum relative to the noise power spectrum, which may be defined as: ak(m) =
Figure imgf000013_0001
where -¾w) is the input noisy spectrum and N*(m) is the noise spectrum, at time/frame m and frequency k. In this sample example, the prior SNR may be the expectation value of the clean (unknown) signal power spectrum, relative to the noise power spectrum, expressed as:
Figure imgf000014_0001
where Xkipi) is the spectral coefficients of the unknown clean speech signal. The noise power spectrum in each of the post and prior SNRs expressed above may be obtained from the initial estimated noise spectrum determined in initial noise estimation step 220, which was based on a quantile estimation. In at least one embodiment, the post and prior SNR may be expressed using magnitude quantities in place of the squared magnitudes shown in the above computations:
Figure imgf000014_0002
[0055] Because the clean signal is not known, the natural estimate for the prior SNR is the average of the estimated prior SNR at the previous frame (e.g., the input frame processed through the system shown in FIG. 2 immediately prior to input frame 200) and the instantaneous SNR a^m :
Pk (m) = yddH{k m - J^r-1! + (1 - Ydd )max{uk {m)-\ ,θ) where H(k,m-1) is the gain filter (e.g., Wiener gain filter 250 of the noise estimation and filtering process 275) for the previous processed frame, and |}¾m-l)| is the observed magnitude spectrum of the noisy speech for the previous frame. In the above expression, the first term is the prior SNR at the previous time frame, and the second term is an instantaneous estimate of the prior SNR. In at least this example, the above expression may be taken as the decision-directed (DD) update of the prior SNR 225 step of the noise estimation and filtering process 275, with a temporal smoothing parameter dd- The prior SNR is a smooth version of the post SNR, with some amount of time-lag. A larger jdd increases the smoothing but also increases the lag. In one or more arrangements, the value used for the smoothing parameter is -0.98.
[0056] According to aspects of the present disclosure, the prior and post SNRs described and defined above are elements of speech/noise likelihood determination step 230 of noise estimation and filtering process 275. In at least one example, the speech/noise likelihood involves two factors: (1) LR (likelihood ratio) factor, determined from the prior and post SNRs, and (2) a probability term based on feature modeling, which will be described in greater detail below.
[0057] In defining and deriving the model for the speech/noise likelihood, the state of speech is denoted as Ft,m = H,k,m, and the state of noise denoted as Ft,m = H0 km. The speech and noise states are defined for every frame m and frequency bin k. The probability of the speech/noise state can be expressed as:
P(H*'m I Yk(m), {F})
The probability of speech/noise is conditioned on the observed noise input spectral coefficient, Yk(m), and some feature data of the signal (e.g., signal classification features) being processed, which in the present example is denoted as {F}. The above expression for the speech/noise likelihood is also referred to herein as the "speech probability function." In at least one arrangement, the feature data may be any functional of the noisy input spectrum, past spectrum data, model data, off-line data, etc. For example, feature data {F} may include spectral flatness measures, harmonic peak pitch, LPC residual, template matching, and the like. [0058] In the following expression, the (k,m) dependency of the speech/noise state is suppressed and k'm is written as H in order to simplify the notation. Accordingly, to compute the speech/noise probability, it may be expressed, using Bayes rule, as:
P(H \ Yk{m\ {F}) a P(Yk{m) \ H,{F}) q H \ {F})p({F}) where p({F}) is a prior probability based on features data of the signal, which is set to a constant in one or more of the following expressions. In this example, the quantity c¾m(H | {F}) is the speech/noise probability given the features data {F}, which will be further described below. In describing various aspects of the present disclosure, the above quantity, qk,m(H I {F} is also referred to as the "feature-based speech probability." Ignoring the prior probability based on {F}, and denoting for notational simplicity <2¾,OT(Hi | {F}) = q and %m(Ho \ {F}) = 1 - q, the normalized speech probability may be written as:
Figure imgf000016_0001
where the likelihood ratio (LR) Ak is:
Figure imgf000016_0002
[0059] In the above expression for the quantities P(Yk(m) \ H 0,{F}) are determined, in at least one arrangement of the model described herein, from the linear state model and the Gaussian probability density function (PDF) assumption for the speech and noise spectral coefficients. More specifically, the linear model for the noisy input may be expressed as: Yk(m) = Xkijn) + Nk(m) for the state of speech, H = H,; and as ¾m) = Nt(m) for the state of noise, H = H . Assuming Gaussian PDF for the complex coefficients {Xk,Nk}, we have for the quantities P(Yk(m) \ H,{F}) the following:
Figure imgf000017_0001
[0060] Because the probability may be fully determined from the linear model and Gaussian PDF assumption, the feature dependency may be removed from the above expression. The likelihood ratio then becomes:
Figure imgf000017_0002
where p^m) is the SNR of the unknown signal (e.g., prior SNR) and is the posteriori signal SNR (e.g., post SNR or instantaneous SNR) for frequency k and frame m. In one implementation, both the prior SNR and post SNR used in the above expression are approximated by the magnitude definition, reproduced as: ak (m) --
(KM)
Figure imgf000017_0003
[0061] From the above expressions and description, in at least one arrangement, the speech/noise state probability may be obtained from the likelihood ratio (Δ*·), which is determined from frequency-dependent post and prior SNRs, and the quantity
Figure imgf000017_0004
| {F}) = q, which is a feature-based or model-based probability that will be described in greater detail below. Accordingly, the speech/noise state probability may be expressed as:
Figure imgf000017_0005
,q
Figure imgf000018_0001
Because the frequency-dependent LR factor (Δ*) may sometimes have high fluctuation from frame to frame, in at least one arrangement of the noise suppression system described herein, a time-smoothened LR factor may be used: log(At O)) = γ,„ log(AA (m - 1)) + (1 - γΜ ) log(A4 (m))
[0062] Furthermore, the geometric average (over all frequencies) of the time-smoothened LR factor may be used as a reliable measure of frame-based speech/noise classification:
Figure imgf000018_0002
[0063] As described above, the LR may be derived in speech/noise likelihood determination step 230 using, for example, the Gaussian assumption as speech PDF model 235. In one or more additional arrangements, other models of speech PDF may be used as the basis for measuring the LR, such as Laplacian, Gamma, and/or Super-Gaussian. For example, while the Gaussian assumption may be reasonable to use with noise, the assumption is not always true for speech, especially on small time frames (e.g., -lOms). In such instances, another model of speech PDF may be used; however, most likely at the cost of increased complexity.
[0064] As shown in FIG. 2, determining the speech/noise likelihood (or probability) 230 during the noise estimation and filtering process 275 is driven not only by local SNR (e.g., prior and instantaneous SNRs), but also incorporates speech model/knowledge derived from feature modeling 240. Incorporating speech model knowledge into the speech/noise probability determination allows the noise suppression processing described herein to better handle and/or differentiate cases of high non-stationary noise levels, where relying only on local SNRs may incorrectly bias the likelihood. In at least one arrangement, the system uses a process of updating and adapting the feature-based probability qk, JHF) for each frame and frequency that incorporates local SNR and speech feature/model data. In the following description of various aspects of this updating and adapting process, the notation
Figure imgf000019_0001
= qicm is used. Because the process as described herein only models and updates the quantity on a frame-basis, the k variable is suppressed.
[0065] According to one or more aspects of the disclosure, an update of the feature-based probability may be modeled as:
Figure imgf000019_0002
where yp is a smoothing constant and Miz) is the map function (e.g., between 0 and 1 ) for the given time and frequency. The z variable in the map function is z = F - T, where F is the measured feature and T is a threshold. The parameter w characterizes the shape/width of the map function. The map function biases the time-frequency bin to either speech (M close to 1) or noise (M close to 0), based on the measured feature and the threshold and width parameter.
[0066] In one arrangement, noise estimation and filtering process 275 considers the following features of a speech signal in performing feature modeling 240 for speech/noise likelihood 230 determination: (1) average LRT, which may be based on local SNR, (2) spectral flatness, which may be based on a harmonic model of speech, and (3) spectral- template difference measure. These three features will be described in greater detail below. It should be understood that numerous other features of the speech signal may also be used in addition to or instead of the three example features described below.
1. Average LRT Feature
[0067] As described above, the geometric average of a time-smoothened likelihood ratio (LR) factor is a reliable indicator of speech/noise state: , = log Y[ A(m) = -∑log(Ai (7«)) where time-smoothened LR factor is provided in the earlier expression above. Given the average LRT feature, one example of the map function M z) may be a sigmoid-type function such as:
M(z) = 0.5 * (tanh(w,z, )+ 0.5) z = Tl - Fl where , is the feature and w is a transition/width parameter to control the smoothness of the map from 0 to 1. The threshold parameter, T\, needs to be defined according to the parameter settings, which are described in greater detail herein.
2. Spectral Flatness Feature
[0068] For purposes of the spectral flatness feature, it is assumed that speech is likely to have more harmonic behavior than noise. Whereas the speech spectrum typically shows peaks at the fundamental frequency (pitch) and harmonics, the noise spectrum tends to be relatively flat in comparison. Accordingly, in at least some arrangements, measures of local spectral flatness may collectively be used as a good indicator/classifier of speech and noise.
[0069] In computing spectral flatness, N represents the number of frequency bins and B represents the number of bands. The index for a frequency bin is k and the index for a band is j. Each band will contain a number of bins. For example, the frequency spectrum of 128 bins can be divided into 4 bands (e.g., low band, low-middle band, high-middle band, and high band) each containing 32 bins. In another example, only one band containing all the frequencies is used. The spectral flatness may be computed as the ratio of the geometric mean to the arithmetic mean of the input magnitude spectrum:
Figure imgf000020_0001
where N represents the number of frequencies in the band. The computed quantity 2 will tend to be larger and constant for noise, and smaller and more variable for speech. Again, one example of the map function M(z) for the update to the feature-based prior probability is a sigmoid-type function:
M(Z) = 0.5 * (tanh(w2z2 )+ 0.5) z = T2 - F2
3. Spectral Template Difference Feature
[0070] In addition to the assumptions about noise described above for the spectral flatness feature, yet another assumption that can be made about the noise spectrum is that it is more stationary than the speech spectrum. Therefore, it can be assumed that the overall shape of the noise spectrum will tend be the same during any given session. Proceeding under this assumption, a third feature can be incorporated into the speech/noise probability determination of the present example. This additional feature measures the deviation of the input spectrum from the shape of the noise spectrum.
[0071] This third feature may be determined by comparing the input spectrum with a template learned noise spectrum. In at least some arrangements, the template spectrum is determined by updating the spectrum, which is initially set to zero, over segments that have strong likelihood of being noise or pause in speech. A result of the comparison is a conservative noise estimate, where the noise is only updated for segments where the speech probability is determined to be below a threshold (e.g., P(Hl \ Yk (m), {F}) < X ). In other arrangements, the template spectrum may also be input to the algorithm or selected from a table of shapes corresponding to different noises. Given the input spectrum, Y^m), and the template spectrum, which may be denoted as ak (m), the spectral template difference feature may be obtained by initially defining the spectral difference measure as:
Figure imgf000021_0001
where (a,u) are shape parameters, such as linear shift and amplitude parameters, obtained by minimizing J. Parameters (a,u) are obtained from a linear equation, and therefore are easily extracted for each frame. In some examples, the parameters account for any simple shift/scale changes of the input spectrum (e.g., if the volume increases). The feature is then the normalized measure,
Norm where the normalization is the average input spectrum over all frequencies and over some time window of previous frames:
Figure imgf000022_0001
[0072] As described above, the spectral template difference feature measures the difference/deviation of the template or learned noise spectrum from the input spectrum. In at least some arrangements, this spectral template difference feature may be used to modify the speech/noise feature-based probability,
Figure imgf000022_0002
If 3 is small, then the input frame spectrum is taken as being "close to" the template spectrum, and the frame is considered to be more likely noise. On the other hand, where the spectral template difference feature is large, the input frame (e.g., input frame 200) spectrum is very different from the noise template spectrum, and the frame is considered to be speech. In one or more variations, the template spectrum may be input to the speech/noise probability algorithm or instead digitally measured and utilized as an online resource.
[0073] Similar to the average LRT feature and spectral flatness feature, mapping the spectral template difference feature value to a probability weight may be done using the same sigmoid function described above. It is important to note that the spectral template difference feature measure is more general than the spectral flatness feature measure. In the case of a template with a constant (e.g., near perfectly) flat spectrum, the spectral template difference feature reduces to a measure of the spectral flatness.
[0074] In at least one arrangement, a weighting term Wk may be added to the spectral template difference measure to emphasize certain bands in the spectrum:
Figure imgf000023_0001
In one example, the weighting term may be kept at Wk = 1 for all frequencies.
[0075] The multiple features described above (e.g., average LRT, spectral flatness, and spectral template difference) may be combined in the update model of the speech/noise probability as follows:
The different features, which arise from different cues (e.g., the different information conveyed by the different features, such as the energy measurement or local SNR conveyed from the first feature, the spectral flatness of the noise conveyed by second feature, and stationarity and general shape of the noise from the third feature), may complement each other to provide a more robust and adaptive update of the speech/noise probability. The update model of the speech/noise probability shown above includes various weighting terms (T/), threshold parameters {Tt}, and width parameters for map function. For example, if the spectral flatness feature (i¾ is not reliable for a given input, such as where the noise spectrum is not very flat, then the second weighting term τ2 may be set to zero, τ2 = 0, so as to avoid bringing an unreliable measure into the update model. Aspects related to the setting of these weighting terms and threshold parameters will be described in greater detail below.
[0076] Following determination of speech/noise likelihood 230 in the noise estimation and filtering process 275 of the system shown in FIG. 2, a noise estimate update 245 (e.g., a soft-decision recursive noise update) is performed. For example, noise estimate update 245 may follows as:
Figure imgf000023_0002
where |Nt is the estimate of the magnitude of the noise spectrum, for frame/time m and frequency bin k. The parameter y„ controls the smoothing of the noise update, and the second term updates the noise with both the input spectrum and previous noise estimation, weighted according to the probability of speech/noise which, as described above, may be given as:
where the LR factor Ak(m) is: p(m)ak (m)
exp|
Ak (m) =
(l + pk (m)) and the quantity qm is the model-based or feature-based speech/noise probability obtained from the update model with the combined multiple features described above. The noise estimation model above updates the noise at every frame and frequency bin where the noise likelihood is large (e.g., where the speech likelihood is small). Where the noise likelihood is not found to be large, the noise estimate is taken as the estimate obtained from the previous frame in the signal.
[0077] In at least one arrangement, this noise estimate update process is controlled by the speech/noise likelihood and the smoothing parameter γ„, which may be set, for example, to 0.85. In a different example, smoothing parameter may be increased to γ„ « 0.99 for regions where the speech probability is found to be above a threshold parameter X to prevent the noise level from increasing too much at speech onsets. As will be further described below, in one or more arrangements the threshold parameter is set as λ = 0.2 / 0.25.
[0078] When noise estimate update 245 is completed, the noise estimation and filtering process 275 applies a Wiener gain filter 250 to reduce or remove the estimated amount of noise from input frame 200. The standard Wiener filter is given as:
Figure imgf000024_0001
where Nk(m) is the estimated noise spectral coefficient, i¾m) is the observed noisy spectral coefficient, and Xk(m) is the clean speech spectrum, at frame m and frequency k. The squared magnitude may then be replaced by the magnitude and the Wiener filter becomes:
Hk (k, m) = \ - -
[0079] One or more conventional approaches apply time-averaging directly to the filter to reduce any frame-to-frame fluctuation. In accordance with aspects of the present disclosure, the Wiener filter is expressed in terms of the prior SNR and a decision-directed (DD) update is used to time-average the prior SNR. The Wiener filter can be expressed in terms of the prior SNR as: l + pk (m) where pk(m) represents the prior SNR as defined above, with the noise spectrum replaced with the estimated noise spectrum:
The prior SNR is estimated according to the DD update, as described above. This gain filter, with flooring and an over-substraction type parameter, is thus obtained as:
Figure imgf000025_0002
In this and other arrangements, no external time-averaging is used on this gain filter since the DD update explicitly time-averages the prior SNR. The parameter β is defined based on the aggressiveness (e.g., the mode) of the noise suppressor (e.g., noise suppression module 15 shown in FIG. 1) implemented within the noise suppression system. [0080] The Wiener filter is applied to the input magnitude spectrum to obtain a suppressed signal (e.g., an estimate of the underlying speech signal). Application of the Wiener filter 250 in noise estimation and filtering process 275 yields:
Xk {m) = Hv dd{k,m)Yk{m)
Signal Synthesis
[0081] Signal synthesis 280 includes various post-noise suppression processing to generate output frame 290, which includes clean speech. Following application of the Wiener filter, inverse DFT 255 is used to convert the frame back to the time-domain. In one or more arrangements, conversion back to the time-domain is performed as:
Figure imgf000026_0001
where Xk (m) is the estimated speech after suppression with the Wiener filter, and x(n,m) is the corresponding time-domain signal, for time index n and frame index m.
[0082] Following inverse DFT 255, energy scaling 260 is performed on the noise- suppressed signal as part of the signal synthesis process 280. Energy scaling may be used to help rebuild speech frames in a manner that increases the power of the speech after suppression. For example, scaling may be performed on the basis that only speech frames are to be amplified to a certain extent, and noise frames are to be left alone. Because noise suppression may reduce the speech signal level, some amplification of speech segments during scaling 260 is beneficial. In one arrangement, scaling 260 is performed on a speech frame based on energy lost in the frame due to the noise estimation and filtering process 275. The gain may be determined by a ratio of the energy in the frame before and after noise suppression processing: energ
K =
energybefc In the present example, a scale may be extracted according to the following model:
Scale = A(K)P(H, \ m) + B(K)(l - P(H, | «)) where -P(H, | m) is the probability of speech for frame m, obtained by averaging the speech probability function, P(H\ | ¾/w), {F}), over all frequencies:
P(Hl \ m) = j P(Hl \ Yk (m),{F})
k
The first term in the above scale equation will be large if the probability P(HX | m) is close to 1 (e.g., if the frame is likely speech), and the second term will be large if the frame is likely noise.
[0083] In the above scale equation, parameters A{K), B(K) control the scaling for the input frame (e.g., input frame 200). For example, in one arrangement, A(K), B(K) may control the scaling as follows: A(K) = 1.0 + l .3*(K - 0.5) if K > 0.5, with maximum clipped at \IK. For K < 0.5, A(K = 1.0. The parameter B(K) = 1.0, so the frame is not scaled for noise regions. The scale for these regions may be determined by a flooring term in the Wiener filter.
[0084] Signal synthesis 280 also includes window synthesis operation 265, which provides the final output frame 290 of estimated speech. In one example, window synthesis 265 is:
Scaleim -l)w(n)x(n + M ,m - l) + Scale{m)w{n)x(n, m) 0≤n < M where the Scale parameter is determined using the above Scale equation for each frame. Parameter Estimation
[0085] The update model of the feature-based speech/noise probability function, which is reproduced below, includes various feature weighting (τ,) and threshold {Tt} parameters applied to the feature measurements: qm(H I Fl tF2, F3) = qm = 7pqm_ + (1 - χρ)[τ,Μ(^ ~ Ό + r2M(F2 - T2) + T3M(F3 - T3)] These weighting (τ,) and threshold {Tt} parameters are used to prevent unreliable feature measurements from coming into the update model. The map function also contains a width parameter {wt} to control the shape of the map function:
M = M(Fi - Tl; wl )
For example, if the average LRT feature (Fi) is not reliable for a given input, such as where there is an error in the initial noise estimate, then the first weighting parameter may be set to zero, Tj = 0, so as to avoid bringing an unreliable average LRT measurement into the update model.
[0086] In at least one embodiment, the initial setting for the feature weighting and threshold parameters may be that only the average LRT feature ( i) is used, and therefore τ\ = τ3 = 0 and the initial threshold for the feature is T\ = 0.5. Table 1 presents example parameter settings according to various embodiments of the disclosure. Table 1 identifies each parameter and provides a brief description and an example default value for each parameter. It should be understood that various other parameter settings and/or default values may also be used in addition to or instead of those presented in Table 1 below. The width parameters for the map function corresponding to each feature are set to the same value, w = 4, in Table 1.
TABLE 1
Example Parameter Settings
Parameter Description Default Value
q quantile: initial noise est 0.25
dd smoothing for DO update 0.98
fir i smooth ing for LR factor 0.9
T\ threshold for LR feature Initial: 0.5; Modified on-line
T2 threshold for spectral flatness Determined on-line
-} threshold for spectral-template Determined on-line
τ, (/' = 1 ,2,3) (1 ,0,0); updated on-line weight values of features
w width of sigmoid map for prior 4.0
YP update parameter for prior prob. 0.9
Yn update parameter for noise est 0.9
λ threshold for noise state 0.2
flooring term in gain filter Determined from mode #
β over-subst. In gain filter Determined from mode #
converged start-up time 200 frames
block_size max frame size 320
analy size max analysis block 512
[0087] In one or more embodiments, feature threshold and weighting parameters for feature measurements (e.g., T\, Γ2, Γ3 and Τ], τ2, τ3 presented in the update model of the speech/noise probability and also contained above in Table 1 ) are dynamically updated after a set interval. In one example, the feature threshold and weighting parameters may be updated every window, W, where W= 500 frames of the signal. In other examples, alternative update intervals may be used including various frame counts or set intervals of time. In these and other embodiments of the disclosure, the process of updating the feature threshold and weighting parameters for the feature measurements may be performed as illustrated in FIG. 4.
[0088] FIG. 4 illustrates an example update process for feature threshold and weighting parameters for feature measurements (e.g., average LRT feature (Fi), spectral flatness feature (/¾, and spectral template difference feature ( 3)). The process begins in step 400 with feature threshold and weighting parameters (e.g., T\, Ti, T3 and τ\, τ2, τ3) being set to initial values for the first W frames (e.g., 500 frames) of a speech sequence. For example, the initial values for the threshold and weighting parameters may be {T\ = 0.5} and {t\ = 1.0, τ2 = 0, τ3 = 0}.
[0089] In step 405, histograms of the features may be computed over the (^ frames of the relevant (e.g., current or present) parameter estimation window. For the initial window of the speech sequence, step 405 involves the first W frames of the sequence, during which the threshold and weighting parameters are fixed to their initial values set in step 400. For later windows of the speech sequence (e.g., windows of the sequence other than the initial window), the threshold and weighting parameters are fixed to the values derived from the previous ^ frames.
[0090] The process continues to step 410 where, after the processing of the W frames, new threshold and weighting parameters for the features are extracted from quantities derived from the histograms computed in step 405. In one example, the threshold and weighting parameters for the features are derived from histogram quantities such as the peak positions of the histograms, the height of the histograms, the average of each feature over some range of the feature's respective histogram, and the fluctuation of each feature over some range of the feature's respective histogram. Numerous other quantities may also be derived from the histograms computed in step 405 to use in extracting new feature threshold and weighting parameters in step 410, in addition to or instead of those described above.
[0091] In at least one arrangement, the quantities derived from the histograms in step 410 are compared with some internal parameters to determine the corresponding prior model threshold and weighting parameters. Examples of such internal parameters may include the following sets: (1) scale parameter applied to either the dominant peak values, or the sum of the two peak values, of the measured histogram, to obtain the feature threshold; (2) parameter that merges the two histogram peaks if they are too close; (3) parameter to reject the feature if the average height of the peaks is too small; (4) parameter to reject the feature if the average peak positions is too small; (5) parameter to reject some feature(s) if the fluctuation of the LRT feature over the histogram range is too low; and (6) maximum and minimum limits on the thresholds for each feature. Various other parameters may also be used as the internal parameters to which the quantities derived in step 410 are compared, in addition to or instead of the example parameters described above. [0092] In step 415, the threshold and weighting parameters extracted in step 410 are fixed or set as the feature threshold and weighting parameters for the next W frames of the speech sequence. If the end of the speech sequence is reached in step 420, then the process ends. However, if the end of the speech sequence is not reached in step 420, then the process returns to step 405 and repeats through step 420 using the next W frames of the sequence and the threshold and weighting parameters fixed in step 415.
[0093] In some embodiments of the disclosure, the initial feature threshold and weighting parameters set in step 400 of FIG. 4 may be used for an entire speech sequence, without the values of these parameters being updated at all. In other embodiments, the threshold and weighting parameters may be updated once following the first window of W frames of the sequence (e.g., the threshold and weighting parameters are updated one time from their initial values).
[0094] In still further embodiments of the disclosure, the feature threshold and weighting parameters update process illustrated in FIG. 4 may use overlapping windows of the sequence where, for example,
Figure imgf000031_0001
includes frames 1-500, W¾ includes frames 250-750, Wj, includes frames 500-1000, and so on. This is one alternative to using to non-overlapping windows where W includes frames 1-500, W2 includes frames 500-1000, W3 includes frames 1000- 1500, etc. Additionally, while some arrangements use fixed windows, e.g., each W includes 500 frames of the sequence; other arrangements may use variable, or changing windows. For example, W\ may include 500 frames, W2 include 250 frames, and W3 include 750 frames. Furthermore, in one or more arrangements, these variable or changing windows may be overlapping or non-overlapping, such as W\ including frames 1-500 (500 frames), W2 including frames 500-750 (250 frames, non-overlapping), and W3 including frames 500-1250 (750 frames, overlapping). It should be understood that the threshold and weighting parameters may be updated according to a variety of other window configurations involving numerous other characteristics of a given sequence. [0095] Referring still to the update process illustrated in FIG. 4, in some situations the feature threshold and weighting parameters extraction in step 410 may lead to one or more of the features (e.g., average LRT feature (/¾, spectral flatness feature (/¾, and/or spectral template difference feature ( 3)) not being used in computing the update model of the speech/noise probability. In such situations, the weighting parameter for each feature that will not be included in the update model is set to 0.
[0096] In a scenario where three features are used in computing the update model of the speech/noise probability, the following may result from the feature threshold and weighting parameters extraction step of the parameter update process (e.g., step 410 shown in FIG. 4): (1) all three features are used {τι = 1/3, ¾ = 1/3, τ3 = 1/3}; (2) two features, e.g., features 1 and 3, are used {TI = 1/2, τ2 = 0, τ3 = 1/2}; or (3) only one feature, e.g., feature 1, is used {τι = 1.0, T2 = 0, r3 = 0}.
[0097] FIG. 5 is a block diagram illustrating an example computing device 500 that is arranged for multipath routing in accordance with one or more embodiments of the present disclosure. In a very basic configuration 501, computing device 500 typically includes one or more processors 510 and system memory 520. A memory bus 530 may be used for communicating between the processor 510 and the system memory 520.
[0098] Depending on the desired configuration, processor 510 can be of any type including but not limited to a microprocessor (μΡ), a microcontroller (μϋ), a digital signal processor (DSP), or any combination thereof. Processor 510 may include one or more levels of caching, such as a level one cache 51 1 and a level two cache 512, a processor core 513, and registers 514. The processor core 513 may include an arithmetic logic unit (ALU), a floating point unit (FPU), a digital signal processing core (DSP Core), or any combination thereof. A memory controller 515 can also be used with the processor 510, or in some embodiments the memory controller 515 can be an internal part of the processor 510.
[0099] Depending on the desired configuration, the system memory 520 can be of any type including but not limited to volatile memory (e.g., RAM), non-volatile memory (e.g., ROM, flash memory, etc.) or any combination thereof. System memory 520 typically includes an operating system 521, one or more applications 522, and program data 524. In at least some embodiments, application 522 includes a multipath processing algorithm 523 that is configured to pass a noisy input signal to a noise suppression component. The multipath processing algorithm is further arranged to pass a noise-suppressed output from the noise suppression component to other components in the signal processing pathway. Program Data 524 may include multipath routing data 525 that is useful for passing a noisy input signal along multiple signal pathways to, for example, a noise suppression component such that the component receives the noisy signal before the signal has been manipulated or altered by other audio processing.
[0100] Computing device 500 can have additional features and/or functionality, and additional interfaces to facilitate communications between the basic configuration 501 and any required devices and interfaces. For example, a bus/interface controller 540 can be used to facilitate communications between the basic configuration 501 and one or more data storage devices 550 via a storage interface bus 541. The data storage devices 550 can be removable storage devices 551 , non-removable storage devices 552, or any combination thereof. Examples of removable storage and non-removable storage devices include magnetic disk devices such as flexible disk drives and hard-disk drives (HDD), optical disk drives such as compact disk (CD) drives or digital versatile disk (DVD) drives, solid state drives (SSD), tape drives and the like. Example computer storage media can include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, and/or other data.
[0101] System memory 520, removable storage 551 and non-removable storage 552 are all examples of computer storage media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computing device 500. Any such computer storage media can be part of computing device 500.
[0102] Computing device 500 can also include an interface bus 542 for facilitating communication from various interface devices (e.g., output interfaces, peripheral interfaces, communication interfaces, etc.) to the basic configuration 501 via the bus/interface controller 540. Example output devices 560 include a graphics processing unit 561 and an audio processing unit 562, either or both of which can be configured to communicate to various external devices such as a display or speakers via one or more A/V ports 563. Example peripheral interfaces 570 include a serial interface controller 571 or a parallel interface controller 572, which can be configured to communicate with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device, etc.) or other peripheral devices (e.g., printer, scanner, etc.) via one or more I/O ports 573. An example communication device 580 includes a network controller 581, which can be arranged to facilitate communications with one or more other computing devices 590 over a network communication (not shown) via one or more communication ports 582. The communication connection is one example of a communication media. Communication media may typically be embodied by computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and includes any information delivery media. A "modulated data signal" can be a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media can include wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency (RF), infrared (IR) and other wireless media. The term computer readable media as used herein can include both storage media and communication media.
[0103] Computing device 500 can be implemented as a portion of a small-form factor portable (or mobile) electronic device such as a cell phone, a personal data assistant (PDA), a personal media player device, a wireless web-watch device, a personal headset device, an application specific device, or a hybrid device that include any of the above functions. Computing device 500 can also be implemented as a personal computer including both laptop computer and non-laptop computer configurations.
[0104] There is little distinction left between hardware and software implementations of aspects of systems; the use of hardware or software is generally (but not always, in that in certain contexts the choice between hardware and software can become significant) a design choice representing cost versus efficiency tradeoffs. There are various vehicles by which processes and/or systems and/or other technologies described herein can be effected (e.g., hardware, software, and/or firmware), and the preferred vehicle will vary with the context in which the processes and/or systems and/or other technologies are deployed. For example, if an implementer determines that speed and accuracy are paramount, the implementer may opt for a mainly hardware and/or firmware vehicle; if flexibility is paramount, the implementer may opt for a mainly software implementation. In one or more other scenarios, the implementer may opt for some combination of hardware, software, and/or firmware.
[0105] The foregoing detailed description has set forth various embodiments of the devices and/or processes via the use of block diagrams, flowcharts, and/or examples. Insofar as such block diagrams, flowcharts, and/or examples contain one or more functions and/or operations, it will be understood by those within the art that each function and/or operation within such block diagrams, flowcharts, or examples can be implemented, individually and/or collectively, by a wide range of hardware, software, firmware, or virtually any combination thereof.
[0106] In one or more embodiments, several portions of the subject matter described herein may be implemented via Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), digital signal processors (DSPs), or other integrated formats. However, those skilled in the art will recognize that some aspects of the embodiments described herein, in whole or in part, can be equivalently implemented in integrated circuits, as one or more computer programs running on one or more computers (e.g., as one or more programs running on one or more computer systems), as one or more programs running on one or more processors (e.g., as one or more programs running on one or more microprocessors), as firmware, or as virtually any combination thereof. Those skilled in the art will further recognize that designing the circuitry and/or writing the code for the software and/or firmware would be well within the skill of one of skilled in the art in light of the present disclosure.
[0107] Additionally, those skilled in the art will appreciate that the mechanisms of the subject matter described herein are capable of being distributed as a program product in a variety of forms, and that an illustrative embodiment of the subject matter described herein applies regardless of the particular type of signal-bearing medium used to actually carry out the distribution. Examples of a signal-bearing medium include, but are not limited to, the following: a recordable-type medium such as a floppy disk, a hard disk drive, a Compact Disc (CD), a Digital Video Disk (DVD), a digital tape, a computer memory, etc.; and a transmission-type medium such as a digital and/or an analog communication medium (e.g., a fiber optic cable, a waveguide, a wired communications link, a wireless communication link, etc.).
[0108] Those skilled in the art will also recognize that it is common within the art to describe devices and/or processes in the fashion set forth herein, and thereafter use engineering practices to integrate such described devices and/or processes into data processing systems. That is, at least a portion of the devices and/or processes described herein can be integrated into a data processing system via a reasonable amount of experimentation. Those having skill in the art will recognize that a typical data processing system generally includes one or more of a system unit housing, a video display device, a memory such as volatile and non-volatile memory, processors such as microprocessors and digital signal processors, computational entities such as operating systems, drivers, graphical user interfaces, and applications programs, one or more interaction devices, such as a touch pad or screen, and/or control systems including feedback loops and control motors (e.g., feedback for sensing position and/or velocity; control motors for moving and/or adjusting components and/or quantities). A typical data processing system may be implemented utilizing any suitable commercially available components, such as those typically found in data computing/communication and/or network computing/communication systems.
[0109] With respect to the use of substantially any plural and/or singular terms herein, those having skill in the art can translate from the plural to the singular and/or from the singular to the plural as is appropriate to the context and/or application. The various singular/plural permutations may be expressly set forth herein for sake of clarity.
[0110] While various aspects and embodiments have been disclosed herein, other aspects and embodiments will be apparent to those skilled in the art. The various aspects and embodiments disclosed herein are for purposes of illustration and are not intended to be limiting, with the true scope and spirit being indicated by the following claims.

Claims

Claims We claim:
1. A method for noise estimation and filtering by a noise suppression module, the method comprising: defining by the noise suppression module, for each of a plurality of successive frames of an input signal received at the noise suppression module, a speech probability function based on an initial noise estimation for the frame; measuring a plurality of signal classification features for each of the plurality of frames; computing a feature-based speech probability for each of the plurality of frames using the measured signal classification features of the frame; applying one or more dynamic weighting factors to the computed feature-based speech probability for each of the plurality of frames; modifying the speech probability function for each of the plurality of frames based on the computed feature-based speech probability of the frame; and updating the initial noise estimation for each of the plurality of frames using the modified speech probability function for the frame.
2. The method of claim 1 , further comprising: filtering noise from each of the plurality of frames using the updated initial noise estimation for each frame.
3. The method of claim 1, wherein the one or more dynamic weighting factors includes weight and threshold parameters for each of the plurality of signal classification features.
4. The method of claim 1, wherein the initial noise estimation is based on quantile noise estimation for each of the plurality of successive frames.
5. The method of claim 1, wherein the applying the one or more dynamic weighting factors to the computed feature-based speech probability includes: applying the one or more dynamic weighting factors to each of the measured signal classification features of the frame; and updating the feature-based speech probability for the frame with the one or more dynamic weighting factors applied.
6. The method of claim 5, wherein the applying the one or more dynamic weighting factors to each of the measured signal classification features of the frame includes combining the one or more dynamic weighting factors and the measured signal classification features into a feature-based speech probability function.
7. The method of claim 6, further comprising: updating, for each of the plurality of frames, the feature-based speech probability function; and updating, for each of the plurality of frames, the speech probability function based on the updated feature-based speech probability function.
8. The method of claim 1, wherein the plurality of signal classification features is used to classify the input signal into a class state of speech or noise.
9. The method of claim 7, wherein the feature-based speech probability function is updated with a recursive average.
10. The method of claim 6, wherein the feature-based speech probability function is obtained by mapping each of the plurality of signal classification features to a probability value using a map function.
11. The method of claim 10, wherein the map function is defined on a value of the signal classification feature and includes one or more threshold and width parameters.
12. The method of claim 1, wherein the speech probability function is further based on a likelihood ratio factor for the frame.
13. The method of claim 1 , wherein the plurality of signal classification features includes at least: average likelihood ratio over time, spectral flatness measure, and spectral template difference measure.
14. The method of claim 1, wherein the one or more dynamic weighting factors selects as the plurality of signal classification features at least one of: average likelihood ratio over time, spectral flatness measure, and spectral template difference measure.
15. The method of claim 13, wherein the spectral template difference measure is based on a comparison of a spectrum of the input signal with a template noise spectrum.
16. The method of claim 15, wherein the template noise spectrum is estimated based on an updated noise estimation using an updated speech probability function and a set of estimated shape parameters.
17. The method of claim 16, wherein the estimated shape parameters are one or more of a shift, amplitude, and normalization parameter.
18. The method of claim 2, further comprising: in response to filtering noise from each of the plurality of frames, energy scaling each of the plurality of frames based on the modified speech probability function of the frame.
19. The method of claim 3, further comprising: setting initial values for the weight and threshold parameters applied to each of the plurality of signal classification features; and updating the initial values for the weight and threshold parameters after a first interval of the input signal.
20. The method of claim 19, wherein updating the initial values for the weight and threshold parameters includes: computing histograms for each of the plurality of signal classification features over the first interval; determining new values for the weight and threshold parameters from one or more quantities derived from the histograms; and using the new values for the weight and threshold parameters for a second interval of the input signal.
21. The method of claim 20, wherein the first and second intervals are sequences of frames of the input signal.
22. The method of claim 20, further comprising: comparing the one or more quantities derived from the histograms with one or more internal parameters to determine corresponding weight and threshold parameters of the feature-based speech probability of the input signal.
PCT/US2011/036637 2011-05-16 2011-05-16 Noise supression method and apparatus using multiple feature modeling for speech/noise likelihood WO2012158156A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201180072331.0A CN103650040B (en) 2011-05-16 2011-05-16 Use the noise suppressing method and device of multiple features modeling analysis speech/noise possibility
PCT/US2011/036637 WO2012158156A1 (en) 2011-05-16 2011-05-16 Noise supression method and apparatus using multiple feature modeling for speech/noise likelihood

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2011/036637 WO2012158156A1 (en) 2011-05-16 2011-05-16 Noise supression method and apparatus using multiple feature modeling for speech/noise likelihood

Publications (1)

Publication Number Publication Date
WO2012158156A1 true WO2012158156A1 (en) 2012-11-22

Family

ID=44279729

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2011/036637 WO2012158156A1 (en) 2011-05-16 2011-05-16 Noise supression method and apparatus using multiple feature modeling for speech/noise likelihood

Country Status (2)

Country Link
CN (1) CN103650040B (en)
WO (1) WO2012158156A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016135741A1 (en) * 2015-02-26 2016-09-01 Indian Institute Of Technology Bombay A method and system for suppressing noise in speech signals in hearing aids and speech communication devices
TWI557722B (en) * 2012-11-15 2016-11-11 緯創資通股份有限公司 Method to filter out speech interference, system using the same, and computer readable recording medium
CN111261183A (en) * 2018-12-03 2020-06-09 珠海格力电器股份有限公司 Method and device for denoising voice
CN111477243A (en) * 2020-04-16 2020-07-31 维沃移动通信有限公司 Audio signal processing method and electronic equipment
CN112017676A (en) * 2019-05-31 2020-12-01 京东数字科技控股有限公司 Audio processing method, apparatus and computer readable storage medium

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105989843A (en) * 2015-01-28 2016-10-05 中兴通讯股份有限公司 Method and device of realizing missing feature reconstruction
US9330684B1 (en) * 2015-03-27 2016-05-03 Continental Automotive Systems, Inc. Real-time wind buffet noise detection
CN104900237B (en) * 2015-04-24 2019-07-05 上海聚力传媒技术有限公司 A kind of methods, devices and systems for audio-frequency information progress noise reduction process
CN104886981B (en) * 2015-04-29 2017-05-17 成都陌云科技有限公司 Active noise reduction bed
US9966073B2 (en) * 2015-05-27 2018-05-08 Google Llc Context-sensitive dynamic update of voice to text model in a voice-enabled electronic device
GB2536742B (en) * 2015-08-27 2017-08-09 Imagination Tech Ltd Nearend speech detector
CN106571150B (en) * 2015-10-12 2021-04-16 阿里巴巴集团控股有限公司 Method and system for recognizing human voice in music
CN105355199B (en) * 2015-10-20 2019-03-12 河海大学 A kind of model combination audio recognition method based on the estimation of GMM noise
CN107564512B (en) * 2016-06-30 2020-12-25 展讯通信(上海)有限公司 Voice activity detection method and device
CN106384597B (en) * 2016-08-31 2020-01-21 广州市网星信息技术有限公司 Audio data processing method and device
GB201617016D0 (en) * 2016-09-09 2016-11-23 Continental automotive systems inc Robust noise estimation for speech enhancement in variable noise conditions
CN107123419A (en) * 2017-05-18 2017-09-01 北京大生在线科技有限公司 The optimization method of background noise reduction in the identification of Sphinx word speeds
CN108022591B (en) * 2017-12-30 2021-03-16 北京百度网讯科技有限公司 Processing method and device for voice recognition in-vehicle environment and electronic equipment
CN109643554B (en) * 2018-11-28 2023-07-21 深圳市汇顶科技股份有限公司 Adaptive voice enhancement method and electronic equipment
CN110164467B (en) * 2018-12-18 2022-11-25 腾讯科技(深圳)有限公司 Method and apparatus for speech noise reduction, computing device and computer readable storage medium
CN109979478A (en) * 2019-04-08 2019-07-05 网易(杭州)网络有限公司 Voice de-noising method and device, storage medium and electronic equipment
CN110265064B (en) * 2019-06-12 2021-10-08 腾讯音乐娱乐科技(深圳)有限公司 Audio frequency crackle detection method, device and storage medium
WO2021007841A1 (en) * 2019-07-18 2021-01-21 深圳市汇顶科技股份有限公司 Noise estimation method, noise estimation apparatus, speech processing chip and electronic device
CN110648680B (en) * 2019-09-23 2024-05-14 腾讯科技(深圳)有限公司 Voice data processing method and device, electronic equipment and readable storage medium
CN110739005B (en) * 2019-10-28 2022-02-01 南京工程学院 Real-time voice enhancement method for transient noise suppression
CN111429929B (en) * 2020-03-03 2023-01-03 厦门快商通科技股份有限公司 Voice denoising method, voice recognition method and computer readable storage medium
CN113470674B (en) * 2020-03-31 2023-06-16 珠海格力电器股份有限公司 Voice noise reduction method and device, storage medium and computer equipment
CN113539300A (en) * 2020-04-10 2021-10-22 宇龙计算机通信科技(深圳)有限公司 Voice detection method and device based on noise suppression, storage medium and terminal
CN112002339B (en) * 2020-07-22 2024-01-26 海尔优家智能科技(北京)有限公司 Speech noise reduction method and device, computer-readable storage medium and electronic device
CN111986691B (en) * 2020-09-04 2024-02-02 腾讯科技(深圳)有限公司 Audio processing method, device, computer equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1662481A2 (en) * 2004-11-25 2006-05-31 LG Electronics Inc. Speech detection method
EP2058797A1 (en) * 2007-11-12 2009-05-13 Harman Becker Automotive Systems GmbH Discrimination between foreground speech and background noise

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1770264A (en) * 2000-12-28 2006-05-10 日本电气株式会社 Noise removing method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1662481A2 (en) * 2004-11-25 2006-05-31 LG Electronics Inc. Speech detection method
EP2058797A1 (en) * 2007-11-12 2009-05-13 Harman Becker Automotive Systems GmbH Discrimination between foreground speech and background noise

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
COHEN I ET AL: "Speech enhancement for non-stationary noise environments", SIGNAL PROCESSING, ELSEVIER SCIENCE PUBLISHERS B.V. AMSTERDAM, NL, vol. 81, no. 11, 1 November 2001 (2001-11-01), pages 2403 - 2418, XP004308517, ISSN: 0165-1684, DOI: DOI:10.1016/S0165-1684(01)00128-1 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI557722B (en) * 2012-11-15 2016-11-11 緯創資通股份有限公司 Method to filter out speech interference, system using the same, and computer readable recording medium
WO2016135741A1 (en) * 2015-02-26 2016-09-01 Indian Institute Of Technology Bombay A method and system for suppressing noise in speech signals in hearing aids and speech communication devices
US10032462B2 (en) 2015-02-26 2018-07-24 Indian Institute Of Technology Bombay Method and system for suppressing noise in speech signals in hearing aids and speech communication devices
CN111261183A (en) * 2018-12-03 2020-06-09 珠海格力电器股份有限公司 Method and device for denoising voice
CN111261183B (en) * 2018-12-03 2022-11-22 珠海格力电器股份有限公司 Method and device for denoising voice
CN112017676A (en) * 2019-05-31 2020-12-01 京东数字科技控股有限公司 Audio processing method, apparatus and computer readable storage medium
CN111477243A (en) * 2020-04-16 2020-07-31 维沃移动通信有限公司 Audio signal processing method and electronic equipment

Also Published As

Publication number Publication date
CN103650040B (en) 2017-08-25
CN103650040A (en) 2014-03-19

Similar Documents

Publication Publication Date Title
WO2012158156A1 (en) Noise supression method and apparatus using multiple feature modeling for speech/noise likelihood
US8428946B1 (en) System and method for multi-channel multi-feature speech/noise classification for noise suppression
CN111418010B (en) Multi-microphone noise reduction method and device and terminal equipment
US10504539B2 (en) Voice activity detection systems and methods
US9305567B2 (en) Systems and methods for audio signal processing
US9165567B2 (en) Systems, methods, and apparatus for speech feature detection
EP2633519B1 (en) Method and apparatus for voice activity detection
US8712074B2 (en) Noise spectrum tracking in noisy acoustical signals
CN107113521B (en) Keyboard transient noise detection and suppression in audio streams with auxiliary keybed microphones
Cohen Speech enhancement using super-Gaussian speech models and noncausal a priori SNR estimation
EP2710590B1 (en) Super-wideband noise supression
US9520138B2 (en) Adaptive modulation filtering for spectral feature enhancement
EP3757993B1 (en) Pre-processing for automatic speech recognition
CN112951259A (en) Audio noise reduction method and device, electronic equipment and computer readable storage medium
US10249322B2 (en) Audio processing devices and audio processing methods
CN112309417B (en) Method, device, system and readable medium for processing audio signal with wind noise suppression
Upadhyay et al. An improved multi-band spectral subtraction algorithm for enhancing speech in various noise environments
Martín-Doñas et al. Dual-channel DNN-based speech enhancement for smartphones
JP6190373B2 (en) Audio signal noise attenuation
US20150162014A1 (en) Systems and methods for enhancing an audio signal
WO2017128910A1 (en) Method, apparatus and electronic device for determining speech presence probability
Dionelis On single-channel speech enhancement and on non-linear modulation-domain Kalman filtering
Sunitha et al. NOISE ROBUST SPEECH RECOGNITION UNDER NOISY ENVIRONMENTS.
WO2022068440A1 (en) Howling suppression method and apparatus, computer device, and storage medium
Mao et al. An improved iterative wiener filtering algorithm for speech enhancement

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11721212

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11721212

Country of ref document: EP

Kind code of ref document: A1