WO2017143334A1 - Method and system for multi-talker babble noise reduction using q-factor based signal decomposition - Google Patents

Method and system for multi-talker babble noise reduction using q-factor based signal decomposition Download PDF

Info

Publication number
WO2017143334A1
WO2017143334A1 PCT/US2017/018696 US2017018696W WO2017143334A1 WO 2017143334 A1 WO2017143334 A1 WO 2017143334A1 US 2017018696 W US2017018696 W US 2017018696W WO 2017143334 A1 WO2017143334 A1 WO 2017143334A1
Authority
WO
WIPO (PCT)
Prior art keywords
component
noise
audio signal
signal
speech
Prior art date
Application number
PCT/US2017/018696
Other languages
French (fr)
Inventor
Roozbeh SOLEYMANI
Ivan W. SELESNICK
David M. LANDSBERGER
Original Assignee
New York University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by New York University filed Critical New York University
Publication of WO2017143334A1 publication Critical patent/WO2017143334A1/en
Priority to US15/703,721 priority Critical patent/US10319390B2/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/10Earpieces; Attachments therefor ; Earphones; Monophonic headphones
    • H04R1/1083Reduction of ambient noise
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R25/00Deaf-aid sets, i.e. electro-acoustic or electro-mechanical hearing aids; Electric tinnitus maskers providing an auditory perception
    • H04R25/50Customised settings for obtaining desired overall acoustical characteristics
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61NELECTROTHERAPY; MAGNETOTHERAPY; RADIATION THERAPY; ULTRASOUND THERAPY
    • A61N1/00Electrotherapy; Circuits therefor
    • A61N1/18Applying electric currents by contact electrodes
    • A61N1/32Applying electric currents by contact electrodes alternating or intermittent currents
    • A61N1/36Applying electric currents by contact electrodes alternating or intermittent currents for stimulation
    • A61N1/36036Applying electric currents by contact electrodes alternating or intermittent currents for stimulation of the outer, middle or inner ear
    • A61N1/36038Cochlear stimulation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02087Noise filtering the noise being separate speech, e.g. cocktail party
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2225/00Details of deaf aids covered by H04R25/00, not provided for in any of its subgroups
    • H04R2225/43Signal processing in hearing aids to enhance the speech intelligibility
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2499/00Aspects covered by H04R or H04S not otherwise provided for in their subgroups
    • H04R2499/10General applications
    • H04R2499/11Transducers incorporated or for use in hand-held devices, e.g. mobile phones, PDA's, camera's
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones

Definitions

  • the present invention relates generally to a method and a system for noise reduction, such as, for example, in a cochlear implant, a telephone, an electronic communication, etc.
  • Cochlear implants may restore the ability to hear to deaf or partially deaf individuals by providing electrical stimulation to the auditory nerve via a series of electrodes placed in the cochlea.
  • CIs may successfully provide the ability of almost all post-lingually deaf users (i.e., those who lost their hearing after learning speech and language) to gain an auditory understanding of an environment and/or restore hearing to a level suitable for an individual to understand speech without the aid of lipreading.
  • One of the key challenges for CI users is to be able to clearly and/or intelligibly understand speech in the context of background noise.
  • Conventional CI devices have been able to aid patients to hear and ascertain speech in a quiet environment, but the performance of such devices quickly degrades in noisy environments.
  • There have been a number of attempts to isolate speech from background noise e.g., single-channel noise reduction algorithms.
  • Typical single-channel noise reduction algorithms have included applying a gain to the noisy envelopes, pause detection and spectral subtraction, feature extraction and splitting the spectrogram into noise and speech dominated tiles.
  • speech understanding in the presence of competing talkers i.e., speech babble noise
  • additional artifacts are often introduced.
  • one embodiment of the present invention provides systems and methods for reducing noise and/or improving intelligibility of an audio signal.
  • a method for reducing noise comprises a first step for receiving an input audio signal comprising a speech signal and a noise.
  • the noise may comprise a multi-talker babble noise.
  • the method also comprises a step for decomposing the input audio signal into at least two components, the at least two components comprises a first component having a low or no sustained oscillatory pattern, and a second component having a high oscillatory pattern.
  • the decomposing step comprises de-noising the first and second components, and the first component is more aggressively de-noised than the second component.
  • the decomposing step may include determining a first Tunable Q-Factor Wavelet Transform (TQWT) for the first component and a second TQWT for the second component.
  • the method also comprises a step for de-noising the second component based on data generated from the first component to obtained a modified second component.
  • the de-noising step comprises further modifying the second component to obtain a modified second component having a temporal and spectral pattern (TSP) corresponding to a TSP of the first component.
  • TSP temporal and spectral pattern
  • the method further comprises a step for outputting an audio signal having reduced noise, the output audio signal comprising the first component in combination with the modified second component.
  • the outputted audio signal may more closely correspond to the speech signal than the input audio signal.
  • a method for improving intelligibility of speech comprises a first step for obtaining, from a receiving arrangement, an input audio signal comprising a speech signal and a noise, and then a step for estimating a noise level of the input audio signal.
  • the estimating step comprises determining or estimating a signal to noise (SNR) for the input audio signal.
  • the method also includes a step for decomposing the input audio signal into at least two components when the estimated noise level of the input audio signal is above a predetermined threshold, the at least two components comprises a first component having a low or no sustained oscillatory pattern, and a second component having a high oscillatory pattern.
  • the method also includes a step for de-noising the second component based on data generated from the first component to obtained a modified second component.
  • the method further includes a step for outputting an audio signal having reduced noise to an output arrangement, the output audio signal comprising the first component in combination with the modified second component.
  • a non-transitory computer readable medium storing a computer program that is executable by at least one processing unit.
  • the computer program comprise sets of instructions for: receiving an input audio signal comprising a speech signal and a noise; decomposing the input audio signal into at least two components, the at least two components comprises a first component having a low or no sustained oscillatory pattern, and a second component having a high oscillatory pattern; de-noising the second component based on data generated from the first component to obtained a modified second component; and outputting an audio signal having reduced noise, the output audio signal comprising the first component in combination with the modified second component.
  • a system for improving intelligibility for a user may comprise a receiving arrangement configured to receive an input audio signal comprising a speech signal and a noise.
  • the system may also include a processing arrangement configured to receive the input audio signal from the cochlear implant, decompose the input audio signal into at least two components, the at least two components comprises a first component having a low or no sustained oscillatory pattern, and a second component having a high oscillatory pattern, de-noise the second component based on data generated from the first component to obtained a modified second component, and output an audio signal having reduced noise to the cochlear implant, the output audio signal comprising the first component in combination with the modified second component.
  • the system may further comprise a cochlear implant, wherein the cochlear implant includes the receiving arrangement, and the cochlear implant is configured to generate an electrical stimulation to the user, the electrical stimulation corresponds to the output audio signal.
  • the system may further comprise a mobile computing device, wherein the mobile computing device includes the receiving arrangement, and the mobile computing device is configured to generate an audible sound corresponding to the output audio signal.
  • Fig. la shows an exemplary method for noise reduction, in particular, multi-talker babble noise reduction in a cochlear implant
  • Fig. lb shows an alternative exemplary method for noise reduction, in particular, multi-talker babble noise reduction in a cochlear implant.
  • FIG. 2 shows an exemplary computer system for performing method for noise reduction.
  • FIG. 3 shows an exemplary embodiment of a user interface for a MUSHRA (Multiple Stimuli with Hidden Reference and Anchor) evaluation.
  • Fig. 4a shows data corresponding to percentages of words correct in normal patients for input signals that are unprocessed and processed using the exemplary method of Fig. la.
  • Fig. 4b shows data corresponding to MUSHRA scores in normal patients for input signals that are unprocessed and processed using the exemplary method of Fig. I a.
  • FIG. 5 shows data corresponding to percentages of words correct in CI patients for input signals that are unprocessed and processed using the exemplary method of Fig. la.
  • FIG. 6 shows data corresponding to MUSHRA scores in CI patients for input signals that are unprocessed and processed using the exemplary method of Fig. la.
  • Fig. 7a shows an average of the data corresponding to percentages of words correct in CI patients of Fig. 5.
  • Fig. 7b shows an average of the data corresponding to MUSHRA scores in CI patients of Fig. 6.
  • Fig. 8 shows a Gaussian Mixture model of data corresponding to noisy speech samples with SNRs ranging from -lOdB to 20 dB processed using the exemplary method of Fig. lb.
  • Fig. 9 shows data corresponding to variation of accuracy metric F as a function of
  • Fig. 10 shows data corresponding to frequency response and sub-band wavelets of a
  • Fig. 1 shows data corresponding to Low frequency Gap Binary Patterns of for clean/noisy speech samples processed using the exemplary method of Fig. lb.
  • Fig. 12 shows the effect of each initial de-noising and spectral cleaning on the weighted normalized Manhattan distance 3 ⁇ 4 measured on noisy speech samples corrupted with various randomly created multi-talker babbles processed according to the exemplary method of Fig. lb.
  • the present invention is directed to a method and system for multi-talker babble noise reduction.
  • the system may be used with an audio processing device, a cochlear implant, a mobile computing device, a smart phone, a computing tablet, a computing device to improve intelligibility of input audio signals, particularly that of speech.
  • the system may be used in a cochlear implant to improve recognition and intelligibility of speech to patients in need of hearing assistance.
  • the method and system for multi-talker babble noise reduction may utilize Q-factor based signal decomposition, which is further described below.
  • Cochlear implants may restore the ability to hear to deaf or partially deaf individuals.
  • conventional cochlear implants are often ineffective in noisy environments, because it is difficult for a user to intelligibly understand speech in the context of background noise.
  • original signals having a background of multi-talker babble noise is particularly difficult to filter and/or process to improve intelligibility to the user, because it often includes background noise that does not adhere to any predictable prior pattern.
  • multi-talker babble noise tends to reflect the spontaneous speech patterns of having multiple speakers within one room, and it is therefore difficult for the user to intelligibly understand the desired speech while it is competing to simultaneous multi-talker babble noise.
  • modulating based methods may differentiate speech from noise based on temporal characteristics, including modulations of depth and/or frequency, and may subsequently apply a gain reduction to the noisy signals or portions of signals, e.g., noisy envelopes.
  • spectral subtraction based methods may estimate a noise spectrum using a predetermined pattern, which may be generated based on prior knowledge (e.g., detection of prior speech patterns) or speech pause detection, and may subsequently subtract the estimated noise spectrum from a noisy speech spectrum.
  • sub-space noise reduction methods may be based on a noisy speech vector, which may be projected onto different sub-spaces for analysis, e.g. , a signal sub-space and a noise sub- space.
  • the clean signal may be estimated by a sub-space noise reduction method by retaining only the components in the signal sub-space, and nullifying the components in the noise sub- space.
  • An additional example may include an envelope subtraction algorithm, which is based on the principle that the clean (noise-free) envelope may be estimated by subtracting a noisy envelope from the noise envelope, which may be separately estimated.
  • Another example may include a method that utilizes S-shaped compression functions in place of the conventional logarithmic compression functions for noise suppression.
  • a binary masking algorithm may utilize features extracted from training data and categorizes each time- frequency region of a spectrogram as speech-dominant or noise-dominant.
  • a wavelet-based noise reduction method may provide de-noising in a wavelet domain by utilizing shrinking and/or thresholding operations.
  • the exemplary embodiments described herein provide a method and system for noise reduction, particularly multi-talker babble noise reduction, that is believed to bypass this optimal point conundrum by applying both aggressive and mild noise removal methods at the same time and benefit from the advantages and avoid the disadvantages of both approaches.
  • the exemplary method comprises a first step for decomposing a noisy signal into two components, which may also perform a preliminary de-noising of the signal at the same time.
  • This first step for decomposing the noisy signal into two components may utilize any suitable signal processing methods.
  • this first step may utilize, one, two or more wavelet or wavelet-like transforms and a signal decomposition method, e.g., a sparsity based signal decomposition method, optionally coupled with a de-noising optimization method.
  • this first step may utilize two Tunable Q-Factor Wavelet Transfonns (TQWTs) and a sparsity based signal decomposition method coupled with applying a Basis Pursuit De-noising optimization method.
  • TQWTs Tunable Q-Factor Wavelet Transfonns
  • Wavelets, sparsity based decomposition methods and de-noising optimization methods may be highly tunable. Therefore, their parameters may be adjusted to obtain desired features in output components.
  • the output components of this first step may include two main products and a byproduct.
  • the two main products may include a Low Q-factor (LQF) component and a High Q-factor (HQF) component
  • the byproduct may include a separated residual noise, wherein the Q-factor may be a ratio of a pulse's center frequency to its bandwidth, which is discussed further below.
  • this first, step for decomposing the noisy signal may not remove all of the noise. Therefore, the method may include a second step for de-noising using information from the products obtained from the first step.
  • a method for noise reduction may comprise three different stages; (1) Noise level classification, (2) Signal decomposition and initial de-noising, and (3) Spectral cleaning and reconstitution.
  • the first stage classifies the noise level of the noisy speech.
  • the second stage decomposes the noisy speech into two components and performs a preliminary denoising of the signal. This is achieved using two Tunable Q-factor Wavelet Transforms (TQWTs) and a sparsity-based signal decomposition algorithm, Basis Pursuit De-noising (BPD).
  • TQWTs Tunable Q-factor Wavelet Transforms
  • BPD Basis Pursuit De-noising
  • the wavelet parameters in the second stage will be set based on the results of the classification stage.
  • the output of the second stage will consist of three components.
  • the third stage further denoises the HQF and LQF components and then recombines them to produce the final de-noised output.
  • Fig. 1 illustrates an exemplary method 100 for noise reduction, in particular, multi- talker babble noise reduction in a cochlear implant.
  • the method may be used to improve recognition and intelligibility of speech to patients in need of hearing assistance.
  • Any suitable cochlear implant may be used with exemplary method 100.
  • the cochlear implant may detect an audio signal and restore a deaf or partially deaf individual's ability to hear by providing an electrical stimulation to the auditory nerve corresponding to the audio signal.
  • the input audio signal may be noisy and cannot be recognized or discerned by the user. Therefore, the input signal may be further processed, e.g., filtered, to improve clarity and/or intelligibility of speech to the patient.
  • a rough determination of the noise level in the input signal may be determined before starting a de- noising process, in addition, the estimated level of noise present may be utilized to set a wavelet and optimizations parameters for subsequent de-noising of the input signal.
  • the input audio signal may be a continuous audio signal and may be broken down into predetermined segments and/or frames for processing by the exemplary method 100.
  • the input signal may include non-steady noise where the level of noise, e.g., signal to noise ratio, may change over time.
  • the signal may be separated into a plurality of frames of input signal, where each frame may be individually analyzed and/or de- noised, such as for example, processing each individual frame using the exemplary method 100.
  • the input signal may be divided into the plurality of frames by any suitable means.
  • the exemplary method 00 may be continuously applied to each successive frame of the input signal for analysis and/or de-noising.
  • the input audio signal may be obtained and each frame of the input audio signal may be processing by the exemplary method 100 in real-time or substantially real-time, meaning within a time frame that is negligible or imperceptible to a user, for example, within less than 3 seconds, less than 1 second, or less than 0.5 seconds.
  • an input signal or a frame of an input signal may be obtained and analyzed to determine and/or estimate a level of noise present in the signal. Based on a level or an estimated level of noise present, the input signal or frame of input signal may be categorized into one of three categories: (I) the signal is either not noisy or has negligible amount of noise 104; (II) the signal is mildly noisy 106; or (III) the signal is highly noisy 108.
  • Step 102 may estimate the noise level in an input signal or a frame of an input signal using any suitable methods, such as, for example, methods for determining and/or estimating a signal to noise ratio (SNR), which may be adjusted to estimate the noise level in a variety of noise conditions.
  • SNR signal to noise ratio
  • Any suitable SNR method may be used and may include, for example, those methods described in Hmam, H., "Approximating the SNR Value in Detection Problems," IEEE Trans, on Aerospace and Electronic Systems VOL. 39, NO. 4 (2003); Xu, PL, Wei, G., & Zhu, J. "A Novel SNR Estimation Algorithm for OFDM,” Vehicular Technology Conference, vol.
  • the noise level of an input signal or a frame of an input signal may be estimated by measuring a frequency and depth of modulations in the signal, or by analyzing a portion of the input signal in silent segments in speech gaps. It is noted that step 102 may determine a SN for an input signal or a frame of an input signal, but may alternatively provide merely an estimate, even a rough estimate of its SNR.
  • the SN R or estimated SNR may be used to categorize the input signal or a frame of the input signal into the three different categories 104, 106, and 108, For example, Category I for a signal that is either not noisy or include negligible amounts of noise 104.
  • this first category 104 may include, for example, those input signals or frames of input signals that have or are estimated to have a SNR that is greater than 12 dB (SNR > 12 dB), or greater than or equal to 12 dB (SNR > 12 dB).
  • the second category 106 may include, for example, those input signals or frames of input signals that have or are estimated to have a SNR that is greater than 5 dB and less than 12 dB (5 dB ⁇ SNR ⁇ 12 dB), or greater than or equal to 5 dB and less than or equal to 12 dB (5 dB ⁇ SNR ⁇ 12 dB).
  • the third category 108 may include, for example, those input signals or frames of input signals that have or are estimated to have a SNR that is less than 5 dB (SNR ⁇ 5 dB), or less than or equal to 5 dB (SNR ⁇ 5 dB).
  • This first step 102 does not depend highly on the accuracy of the noise level estimation, e.g., SNR estimation provided. Rather, for input signals having SN R values on or near the threshold values of 5 dB and 12 dB, categorization of such an input signal in either of the bordering categories is not expected to significantly alter the outcome of the exemplary de- noising method 100 of Fig. la. Therefore, estimated SNR values may be sufficient for the first step 102. In certain exemplar ⁇ 7 embodiments, estimated SNR values may be determined using a more efficient process, e.g., a method that requires less computational resources and/or time, such as by a process that requires fewer iterative steps.
  • the ratio r(s, x(s)) may be defined as: r(s t r s) ⁇ where s rms — + sf 4- . . . l s?. ]
  • the term il l s, T(S) ) refers to signal s after hard thresholding with respect to r(s).
  • the term r(s) may be defined such that for speech samples that are mixed with multi-talker babble, the value of r(s, r(s)) varies little from signal to signal for samples having a constant a constant signal to noise ratio (SNR).
  • SNR signal to noise ratio
  • An input signal s with an unknown SNR may be categorized into one of the three different categories 104, 106, and 108 as follows:
  • this exemplary SNR estimation method in the first step 102 need not provide accurate estimates of SNR. Rather, it serves to categorize the input signals or frames of input signals into various starting categories prior to further analysis and/or de-noising of the input signals or frames of input signals.
  • This pre-processing categorization in step 102 is particularly beneficial for input signals or frames of input signals containing multi-talker babble.
  • this first step 102 utilize any suitable method to categorize the input signals or frames of input signals into a plurality of categories, each having a different noise level. More particularly, the first step 102 may encompass any fast and efficient method for categorizing the input signals or frames of input signals into a plurality of categories having different noise levels.
  • input signals or frames of input signals that fall within the first category 104 do not contain substantial amounts of noise. Therefore, these input signals or frames of input signals are too clean to be de-noised.
  • the intelligibility of input signals in this first category 104 may be relatively high, therefore further de-noising of the signal may introduce distortion and/or lead to no significant intelligibility improvement. Accordingly, if the input signal or frame of input signal is determined to fall within the first category 104, the method 100 terminates without modification to the input signal or the frame of the input signal.
  • Input signals or frames of input signals that fall within the second category 106 may be de-noised in a less aggressive manner as compared to noisier signals. For input signals or frames of input signals in the second category 106, the priority is to avoid de -noising distortion rather than to remove as much no se as possible.
  • Input signals or frames of input signals that fall within the third category 108 may not be very intelligible to a CI user, and may not be intelligible at all to an average CI user, For input signals or frames of input signals in the third category 108, distortion is less of a concern compared to intelligibility. Therefore, a more aggressive de-noising of the input signal or frame of input signal may be performed on input signals of the third category 108 to increase the amount of noise removed while gaining improvements in signal intelligibility to the CI user.
  • input signals or frames of input signals that fall within either the second category 106 or the third category 108 may be further processed in step 110.
  • the input signals or frames of input signals may be decomposed into at least two components: (I) a first component 112 that exhibits no or low amounts of sustained oscillatory behavior; and (II) a second component 114 that exhibits high sustained oscillator ⁇ '' behavior.
  • Step 1 10 may optionally decompose the input signals or frames of input signals to include a third component: (III) a residual component 116 that does not fall within either component 112 or 114.
  • Step 110 may decompose the input signals or frames of input signals using any suitable methods, such as, for example, separating the signals into components having different Q-factors.
  • the Q-factor of a pulse may be defined as a ratio of its center frequency to its bandwidth, as shown in the formula below:
  • the first component 112 may correspond to a low Q-factor component and the second component 114 may correspond to a high Q-factor component.
  • the second component 114 which corresponds to a high Q-factor component, may exhibit more sustained oscillatory behavior than the first component 1.12, which corresponds to a low Q-factor component.
  • Suitable methods for decomposing the input signals or frames of input signals may include a sparse optimization wavelet method.
  • the sparse optimization wavelet method may decompose the input signals or frames of input signals and may also provide preliminary de- noising of the input signals or frames of input signals.
  • the sparse optimization wavelet method may utilize any suitable wavelet transform to provide a sparse representation of the input signals or frames of input signals.
  • One exemplary wavelet transform that may be utilized with a sparse optimization wavelet for decomposing the input signals or frames of input signals in step 100 may include a Tunable Q-Factor Wavelet Transform (TQWT).
  • TQWT Tunable Q-Factor Wavelet Transform
  • the TQWT may be determined based on a Q-factor, a redundancy rate and a number of stages (or levels) utilized in the sparse optimization wavelet method, each of which may be independently adjustable within the method.
  • the Q-factor may be adjusted such that the oscillator 7 behavior of the TQWT wavelet matches that of the input signals or frames of input signals.
  • Redundancy rate in a wavelet transform e.g., a TQWT, may refer to a total over-sampling rate of the transform. The redundancy rate must be always greater than 1. Because the TQWT is an over-sampled wavelet transform, any given signal would not correspond to a unique set of wavelet coefficients. In other words, an inverse TQWT applied to two different sets of wavelet coefficients, may correspond to the same signal.
  • Step 110 may also provide preliminary de-noising of the input signals or frames of input signals.
  • the preliminary de-noising may be performed by a sparsity-based de-noising method, such as, for example, a sparse optimization wavelet method.
  • a sparsity-based de-noising method such as, for example, a sparse optimization wavelet method.
  • of the input signals or frames of input signals may be represented by any suitable wavelet, in particular TQWT.
  • TQWT wavelet
  • an optimal sparse representation of the input signals or frames of input signals may be obtained.
  • Such an optimal sparse representation may provide improved performance for related sparsity-based methods such as signal decomposition and/or de-noising.
  • BP Basis Pursuit
  • a Basis Pursuit De-noising (BPD) method may be used.
  • each input signal or frame of input signal may be represented using two different components having two different Q-factors.
  • Suitable methods for decomposing the input signals or frames of input signals in step 110 may also include, for example, a Morphological Component Analysis (MCA) method.
  • MCA Morphological Component Analysis
  • the input signal or frame of input signal y may be decomposed into three components: (I) a first component 112 having a low Q-factor x i , which does not exhibit sustained oscillator ⁇ 7 behavior; (II) a second component 114 having a High Q-factor component x s , which exhibits sustained oscillatory behavior; and (III) a residual component 116 represented by , which includes noise and stochastic unstructured signals that cannot be sparsely represented by either of the two wavelet transforms of the first and second components 112 and 114,
  • the input signal 7 may be represented as follows:
  • the decomposition of the input signal >' may be a nonlinear decomposition, which cannot be achieved by any linear decomposition methods in time or frequency domain. Therefore, a MCA method may be used to obtain a sparse representation of both the first and second components 112, 114, where 3 ⁇ 4 and 3 ⁇ 4" s may be obtained using a constrained optimization method using the following formula: argmin Wi ⁇ ⁇ ⁇ y - 4> 1 w 1 - *2 1M 3 ⁇ 4 II2 + ⁇ l
  • w j and w iJ are the wavelet coefficients in different subbands.
  • the first and second components 112 and 114 as represented by i and x s , may be obtained as follows:
  • the wavelet and optimization parameters may also be selected such that the first and second components 112, 114 are also preliminarily de-noised using a BPD method, in particular, the wavelet and optimization parameters may be selected such that the following conditions are met:
  • the first component 1 12 which is the Low Q-factor (LQF) component, have significantly lower energy than the second component 1 14, which is the high Q-factor (HQF) component;
  • the LQF be de-noised more aggressively, and consequently may be more distorted.
  • the HQF would be de-noised more mildly to reduce the amount of distortion.
  • TSP 'Temporal and Spectral Pattern
  • the input signal or frame of input signal may be decomposed based on the Q-factors of different components, and that the input signals or frames of input signals that share similar frequency content may correspond to different Q-factors.
  • the second component 114 may be farther de-noised using the first component 112 or data generated based on the first component 112, As explained further below, the TSP of the first component 112 is expected to more closely resemble that of a clean speech signal, as compared to the second component 114. Therefore, the first component 112 may be used to further de-noise the second component 114, particularly using the TSP of the first component.
  • a clean audio signal that is not noisy may be represented by A .
  • BPD is not necessary for de-noising the signal. Therefore, decomposition of a clean input signal * may be correspond to a spare representation of two components, where x i and 3 ⁇ 4 may be obtained using a constrained optimization method using the following formula:
  • Both the noisy input signal or frame of input signal ⁇ and the clean input signal may be decomposed into HQF and LQF components are follows:
  • the TSP of the LQF component 3 ⁇ 4 is expected to be more similar to the TSP of the LQF component of the clean speech signal. This similarity is particularly notable in lower frequencies where speech fundamental frequencies are often located. Therefore, the concentration of energy in both their spectrograms are expected to follow a similar shared pattern. Gaps and speech pauses are also expected to be located at the same areas of the spectrograms and time domain graphs in both cases.
  • gaps refers to empty or low energy areas in low frequency parts of the spectrograms or very low amplitude pauses in time domain graphs.
  • the HQF component % which is de-noised less aggressively in step 110, is expected to be noisier and therefore, less similar to HQF component -3 ⁇ 4 of the clean speech. Contrary to the LQF components 3 ⁇ 4. and 3 ⁇ 4 discussed above where gaps could be seen in both noisy and clean spectrograms, all low frequency gaps which could be identified in clean signal's HQF component may be filled, typically completely filled, by noise in the HQF component 3 ⁇ 4 of the input signal or frame of input signal. Although the signal may include more noise, the HQF component % is expected to be less distorted, which is particularly crucial for good intelligibly to a patient.
  • the LQF and HQF components of the clean speech are also expected to have roughly similar TSPs (at least the gaps in low frequencies in their spectrograms are roughly in the same areas), it is expected that the TSP of the HQF component 3 ⁇ 4 of the clean speech also bears some similarities to the TSP of the LQF component 3 ⁇ 4 obtained from noisy input signal. This resemblance may be more pronounced in time domain graphs. The low frequency gaps in the time domain graphs may also be similar, at least compared to the noisy HQF component . [0065] In step 118, the input signal or frame of input signal ⁇ should be de-noised such that it becomes as similar as possible to the clean speech % without causing too much distortion.
  • the LQF components of clean speech and noisy speech are already similar, and therefore, only the HQF component of the noisy input signal needs to be further modified (e.g., de-noised) so that it more closely resembles the HQF component of the clean speech ( 3 ⁇ 4 ),
  • the second component 114 may be further de-noised and may be represented by 3 ⁇ 4 , which corresponds to a modified version of 3 ⁇ 4 having a TSP that is similar to TSP of ⁇ >3 ⁇ 4 , which may be represented as follows:
  • the first component 112 may correspond to 3 ⁇ 4 a d the second component 114 may respond «o 3 ⁇ 4 in the formula shown above. Because ' (3 ⁇ 4) is expected to be s.miiar to and in the absence priori knowledge of -% , the TSP of % may be modified and a modified version corresponding to version % may be obtained to satisfy the following condition:
  • the further de-noised 3 ⁇ 4 T may be determined based on the following formula:
  • step 118 may include a method which modifies the spectrograph of the second component 114, e.g., 3 ⁇ 4 , to a modified version of the second component, e.g., % .
  • the method may preferably introduce the least possible amount of distortion to the resulting output, and/or may provide processing of input signals in real-time or substantially real-time as to be useful in applications such as cochlear implant devices.
  • the method for modifying the spectrograph of the second component 114, e.g., 3 ⁇ 4 , to a modified version of the second component, e.g., 3 ⁇ 4 may include point- wise multiplication of a Fourier transform domain of non-overlapping frames of an input signal.
  • each frame of the input signal may be represented as Y t E R N , wherein ⁇ corresponds to a length of the frame.
  • Each frame of the input signal may be represented m ay correspond to the following:
  • a Discrete Fourier Transform may be determined for each of the above components as follows:
  • Each point i 1 L and may be categorized as one of the following:
  • CHS' , €i? €lL represents four different categories corresponding to: very high energy, high energy, low energy and very low energy; yf . » ⁇ , ⁇ »
  • the above categorization may be performed using a threshold-based quantification method.
  • the TSP of the is expected to be similar to TSP of ⁇ k after removing the noise, Therefore, if a point demonstrates a high or very high energy in but demonstrates low or very low energy in 3 ⁇ 4 , its energy in 3 ⁇ 4 is believe to most likely be coming from a noise source and must then be attenuated.
  • each point in 3 ⁇ 4 may be compared with its counterpart in 3 ⁇ 4 and different reduction gains Sr may be applied to high or very high energy points in 3 ⁇ 4 with low or
  • I L very low energy counterparts in I L , which may be represented in the following iormula:
  • a reduction gain may be applied to low or very low energy points in , yf
  • an inverse Discrete Fourier Transform may be applied to obtain a modified version of the second component, e.g., % , of the input signal, as follows:
  • the first component 112 and a further filtered second component, where the second component 1 14 is filtered using the first component 1 14, may be combined to generate a filtered signal that may be outputted for use in a cochlear.
  • the further filtered second component, e.g., % may be combined to create an output signal, as represented by ' o , as follows:
  • Fig. lb provides an alternative exemplary embodiment of a method 150 for noise reduction, in particular, multi-talker babble noise reduction in a cochlear implant.
  • the alternative exemplary embodiment of method 150 shown in Fig. la is substantially similar to the method 100 describe with respect to Fig. l b as discussed above. Differences between the two exemplary methods 100 and 150 are further detailed below.
  • an input signal or a frame of an input signal may be obtained and analyzed to determine and/or estimate a level of noise present in the signal. Based on a level or an estimated level of noise present, the input signal or frame of input signal may be categorized into one of three categories: (I) the signal is either mildly noisy 154; or (II) the signal is highly noisy 156.
  • Step 152 may estimate the noise level in an input signal or a frame of an input signal using any suitable methods, such those described above in reference to step 10 (e.g., methods for determining and/or estimating SNR).
  • the SNR or estimated SNR may be used to categorize the input signal or a frame of the input signal into the two instead of three different categories 154 and 156.
  • Category I for a signal that is mildly noisy 154.
  • this first category 154 may include, for example, those input signals or frames of input signals that have or are estimated to have a SNR that is greater than 3.5 dB (SNR > 3.5 dB), or greater than or equal to 3.5 dB (SNR > 3.5 dB).
  • the second category 156 may include, for example, those input signals or frames of input signals that have or are estimated to have a SNR that is less than 3.5 dB (SNR ⁇ 3.5 dB), or less than or equal to 3.5 dB (SNR ⁇ 3.5 dB).
  • the SNR may be estimated using the exemplary SNR detection method described above in reference to step 102.
  • the SNR may be estimated using a different exemplary method. This method may provide a computationally efficient, and relatively accurate method to classify the noise level of speech corrupted by multi- talker babble. To keep track of the background noise variation, longer signals may be segmented into shorter frames and each frame may be classified and de-noised separately. The length of each frame should be at least one second to ensure a high classification/de-nosing performance.
  • step 152 uses two features which are sensitive to changes of the noise level in speech, easy to extract and relatively robust for various babble noise conditions (i.e., different number of talkers, etc.).
  • the first feature is the envelope mean-crossing rate which is defined as the number of times that the envelope crosses its mean over a certain period of time (e.g. , one second).
  • step 152 first needs to extract the envelope of the noisy speech.
  • the envelope can be obtained as follows:
  • i is the length of the window (w) and I 3 ⁇ 4 is the hop size.
  • the envelope mean-crossing rate of a noisy signal frame is calculated as follows:
  • E, l s and M are the envelope and its length and mean respectively, N is the length of the frame, f s is the sampling rate and S ⁇ x is the sign function defined as:
  • 3 ⁇ 4 is the mean f t values of frames in class k
  • is the f t values overall mean
  • 3 ⁇ 4 is the f t values variance in class k
  • n k is the total number of frames in class k.
  • this feature's Fischer score may be calculated for 10,000 labeled noisy speech frames corrupted with randomly created multi -talker babble.
  • the second feature is post-thresholding to pre-thresholding RMS ratio.
  • fc £
  • Post-thresholding to pre-thresholding RMS ratio is calculated as follows:
  • variable which determines the quality of this feature is K and this feature may be optimized by finding the value of which maximizes the Fischer score for this feature: argtu x sem
  • GMM Gaussian Mixture Model
  • ⁇ ( ⁇ , ⁇ ⁇ ⁇ ) ⁇ 3 ⁇ 4_ 3 ⁇ 4 (F ⁇ . )
  • 3 ⁇ 4 a : 1.
  • % is the weight factor
  • ⁇ & is the mean
  • ⁇ t is the covariance of the ith Gaussian distribution
  • a Gaussian distribution ⁇ ( ⁇ ⁇ ⁇ , ⁇ ) can be written as:
  • step 152 also does not depend highly on the accuracy of the noise level estimation, e.g., SNR estimation provided. Rather, for input signals having SNR values on or near the threshold value of 3.5 dB, categorization of such an input signal in either of the categories is not expected to significantly alter the outcome of the exemplary de-noising method 152 of Fig. lb. Therefore, estimated SNR values may also be sufficient for step 152. In certain exemplary embodiments, estimated SNR values may be determined using a more efficient process, e.g., a method that requires less computational resources and/or time, such as by a process that requires fewer iterative steps.
  • a more efficient process e.g., a method that requires less computational resources and/or time, such as by a process that requires fewer iterative steps.
  • input signals or frames of input signals that fall within the first category 154 may be de-noised in a less aggressive manner as compared to noisier signals.
  • the priority is to avoid de-noising distortion rather than to remove as much noise as possible.
  • the data samples may be divided between each of the two categories into two clusters and each cluster may be modeled by a Gaussian model. In order to train the model, the Expectation-Maximization (EM) algorithm may be used.
  • EM Expectation-Maximization
  • «3 ⁇ 4, 3 ⁇ 4, 3 ⁇ 4 ⁇ 2 , ⁇ t , ⁇ 2 are GMM parameters of class 1 and ⁇ %, 4 , ⁇ , ⁇ , ⁇ $ , ⁇ 4 are GMM parameters of class 1 and ⁇ %, 4 , ⁇ , ⁇ , ⁇ $ , ⁇ 4 are
  • the method 150 has already obtained the values of 3 ⁇ 4 and 3 ⁇ 4 from the EM method.
  • MAP for each noisy speech sample with feature vector F, two probabilities we may be obtained and the noisy sample may be classified into the class with the higher probability.
  • step 160 input signals or frames of input signals that fall within either the first category 154 or the second category 156 may be further processed in step 160 in a similar manner as step 110 described above.
  • the input signals or frames of input signals may be decomposed into at least two components: (I) a first component 62 that exhibits no or low amounts of sustained oscillatory behavior; and (II) a second component 164 that exhibits high sustained oscillatory behavior.
  • Step 160 may optionally decompose the input signals or frames of input signals to include a third component: (III) a residual component 166 that does not fall within either components 162 or 164.
  • Step 160 may decompose the input signals or frames of input signals using any suitable methods, such as, for example, separating the signals into components having different Q-factors.
  • Step 160 may similarly provide preliminary de-noising of the input signals or frames of input signals.
  • the preliminary de-noising may be performed by a sparsity-based de-noising method, such as, for example, a sparse optimization wavelet method.
  • a sparsity-based de-noising method such as, for example, a sparse optimization wavelet method.
  • of the input signals or frames of input signals may be represented by any suitable wavelet in particular TQWT.
  • TQWT wavelet in particular
  • an optimal sparse representation of the input signals or frames of input signals may be obtained.
  • Such an optimal sparse representation may provide improved performance for related sparsity-based methods such as signal decomposition and/or de-noising.
  • BP Basis Pursuit
  • a Basis Pursuit De-noising (BPD) method may be used.
  • the different HQF and LQF components may be further de-noised (e.g., by spectral cleaning) and subsequently recombined to produce the final de-noised output 170.
  • this further de-noising step 168 may include parameter optimization followed by subsequent spectral cleaning. For example, assuming that the clean speech sample JT and its noisy version Y are available, they may be each decomposed into HQF and LQF components.
  • Low and high Q- factors (3 ⁇ 4 and ⁇ 3 ⁇ 4): These two parameters should be selected to match the oscillatory behavior of the speech in order to attain high sparsity and efficient subsequent denoising.
  • Q i and 3 ⁇ 4 denote the low and high Q-factors, respectively.
  • 3 ⁇ 4 must be sufficiently larger than Q t .
  • Choosing close values for 3 ⁇ 4 and 3 ⁇ 4 will lead to very similar LQF and HQF components and poor sparsification.
  • the regularization parameters X t and ⁇ 2 may be adjusted. These two parameters directly influence the effectiveness of denoising. A larger value for either of them will lead to a more aggressive de-noising for its corresponding component. A more aggressive de-noising will potentially lead to more noise removal but usually at the expense of increasing the distortion of the denoised speech. Choosing suitable values for 1, ; and , ⁇ .-, which ensure the maximum noise removal with minimum distortion is crucial for this stage.
  • t and ⁇ £ may be selected, which maximize the similarity between the spectrogram s of the clean speech components ⁇ 3 ⁇ 4 and X s ) and their de-noised versions ( ⁇ L and Y H ).
  • the normalized Manhattan distance applied to the magnitude of the spectrograms e.g., here with non-overlapping 2 16 samples long time frames
  • M L and M a may be defined as metrics to measure the STFT similarity between the low and high Q factor components of the clean and noisy speech samples respectively as follows:
  • the STFT matrix is denoted with S and its corresponding component with its subscript.
  • the weighted normalized Manhattan distance may be defined as follows:
  • weighting factors of the a and ⁇ are selected based on the £ 3 -norms of their corresponding components as follows:
  • de-noised LQF and HQF components may be obtained. Nevertheless, the spectrograms of these components exhibit some remaining noise still existing in optimally de-noised components ⁇ L and 3 ⁇ 4.
  • Low magnitude 'gaps' in the spectrogram of clean speech components and A' H may be completely filled with noise in their de-noised versions (i.e. , Y L and 3 ⁇ 4), Here, by 'gaps' it refers to low magnitude pockets surrounded by high magnitude areas. These low magnitude gaps are more distinctly visible in lower frequencies (i.e. , frequencies between 0 and 2000 Hz) where most of the speech signals energy exists.
  • GBP GBP
  • N Fb the number of frequency bins
  • N tf the number of time frames
  • step 168 can potentially remove significant residual noise from ⁇ L and 3 ⁇ 4. If a low amplitude tile in the clean speech components L and 3 ⁇ 4, is categorized as high amplitude in de-noised components 3 ⁇ 4 and
  • step 168 can conclude that this extra boost in the tile's energy is likely to be originated from the noise and can be attenuated by a reduced gain. Because in reality to clean speech components of * L and 3 ⁇ 4 are not readily available, the goal is to find aggressively de-noised low and high Q-factor components (denoted by J L and F'' H ) with a similar gap location (in lower frequencies) with the clean speech components of 3 ⁇ 4 . and 3 ⁇ 4.
  • Sorenson's metric which is designed to measure the similarity between binary matrices with emphasize on ones (i.e., gaps) rather than zeros. Sorenson's metric for two binary matrices M i and M z is defined as:
  • C is the number of 1-1 matches (both values are 1)
  • 3 ⁇ 4 is the total number of Is in the matrix M i
  • N 2 is the total number of Is in the matrix 2 .
  • l f £ and 3 ⁇ 4 found by maximizing SM ⁇ G x , , Gtf and are used to generate the aggressively de-noised component F ? L with similar gaps location with 3 ⁇ 4, lj and 3 ⁇ 4 by found by maximizing 5 ⁇ 8 , ⁇ ) and are used to find the aggressively de-noised component Y' with similar gaps location with 3 ⁇ 4.
  • 3 ⁇ 4 and 3 ⁇ 4 have optimally similar gap patterns to X than Y L respectively, they can be used as a template further clean up optimally de-noised 3 ⁇ 4. and 3 ⁇ 4 .
  • spectral cleaning may be performed on 3 ⁇ 4 and 3 ⁇ 4, based on the GBPs of the aggressively de- noised Yl , 3 ⁇ 4 .
  • reduction gains 3 ⁇ 4 and r H may be applied to high magnitude tiles in Y and 3 ⁇ 4 ⁇ with low magnitude counter parts T ; in Y and 3 ⁇ 4 ,
  • the spectral cleaning is only performed in lower frequencies (i.e., frequencies between 0 and 2000 mean ⁇ F fii
  • T f ' are time/frequency tiles in S y and Sy* respectively and the resulting enhanced STFT matrix and its time/frequency tiles are denoted with 3 ⁇ 4. and 7y .
  • F and T ⁇ - are time/frequency tiles in S Y and ⁇ 3 ⁇ 4 respectively and the resulting enhanced STFT matrix and its time/frequency tiles are denoted with and f ,
  • the reduction gains are chosen to decrease the normalized average magnitude of the tiles in S Y , S Y to the level of the normalized average magnitude of the tiles in 3 ⁇ 4 ⁇ > ⁇ 3 ⁇ 4 ⁇
  • the gaps which were filled by noise in optimally de-noised components may be visible after spectral cleaning.
  • the enhanced Low and high Q-factor components of X t and X can be obtained by inverse short time Fourier transform of 3 ⁇ 4 and S g and eventually X which is the de-noised version of clean speech X can be created by re-composition ofi ⁇ a d A' ⁇ as:
  • the exemplary embodiments described herein may be implemented in any number of manners, including as a separate software module, as a combination of hardware and software, etc.
  • the exemplary analysis methods may be embodiment in one or more programs stored in a non-transitory storage medium and containing lines of code that, when compiled, may be executed by at least one of the plurality of processor cores or a separate processor.
  • a system comprising a plurality of processor cores and a set of instructions executing on the plurality of processor cores may be provided. The set of instructions may be operable to perform the exemplary methods discussed below.
  • the at least one of the plurality of processor cores or a separate processor may be incorporated in or may communicate with any suitable electronic device for receiving audio input signal and/or outputting a modified audio signal, including, for example, an audio processing device, a cochlear implant, a mobile computing device, a smart phone, a computing tablet, a computing device, etc.
  • the exemplary analysis methods described above are discussed in reference to a cochlear implant. It is contemplated that the exemplary analysis methods may be incorporated into any suitable electronic device that may require or benefit from improved audio processing, particularly noise reduction.
  • the exemplary analysis methods may be embodied in an exemplary system 200 as shown in Fig. 2.
  • an exemplary method described herein may be performed entirely or in part, by a processing arrangement 210.
  • Such processing/computing arrangement 210 may be, e.g., entirely or a part of, or include, but not limited to, a computer/processor that can include, e.g., one or more microprocessors, and use instructions stored on a computer-accessible medium ⁇ e.g., RAM, ROM, hard drive, or other storage device).
  • a computer-accessible medium 220 e.g., as described herein, a storage device such as a hard disk, floppy disk, memory stick, C D-ROM, RAM, ROM, etc., or a collection thereof
  • the computer-accessible medium 220 may be a non-transitory computer-
  • the computer-accessible medium 220 can contain executable instructions 230 thereon.
  • a storage arrangement 240 can be provided separately from the computer-accessible medium 220, which can provide the instructions to the processing arrangement 210 so as to configure the processing arrangement to execute certain exemplary procedures, processes and methods, as described herein, for example.
  • System 200 may also include a receiving arrangement for receiving an input audio signal, e.g., an audio receiver or a microphone, and an outputting arrangement for outputting a de-noised audio signal, e.g., a speaker, a telephone, or a smart phone.
  • the input audio signal may be a pre-recorded that is subsequently transmitted to the system 200 for processing.
  • an audio signal may be pre-recorded, e.g., a recording having a noisy background, particularly a multi-babble talk noisy background, that may be processed by the system 200 post-hoc.
  • the receiving arrangement and outputting arrangement may be part of the same device, e.g., a cochlear implant, headphones, etc., or separate devices.
  • the system may include a display or output device, an input device such as a key-board, mouse, touch screen or other input device, and may be connected to additional systems via a logical network.
  • the system 200 may include a smart phone a receiving arrangement, e.g., a microphone, for detecting speech, such as a conversation from a user.
  • the conversation from the user may be obtained from a noisy environment, particularly where there is multi-talker babble, such as in a crowded area with many others speaking in the background, e.g., in a crowded bar.
  • the input audio signal received by the smart phone may be processed using the exemplary methods described above and a modified signal, e.g. , a cleaned, audio signal, where a noise portion may be reduced and/or a speech signal may be enhanced, may be transmitted via the smart phone over a communications network to a recipient.
  • the modified signal may provide for a more intelligible audio such that a smart phone user from a noisy environment may be more easily understood by the recipient, as compared to an unmodified signal.
  • the input audio signal may be received by the smart phone and transmitted to an external processing unit, such as a centralized processing arrangement in a communications network.
  • the centralized processing arrangement may process the input audio signal transmitted by the smart phone using the exemplary methods described above and forward the modified signal to the intended recipient, thereby providing a centralized processing unit for de-noising telephone calls.
  • the input audio signal may be a pre-recorded audio signal received by the system 200 and the input audio signal may be processed using the exemplary methods described above.
  • the system 200 may include a computing device, e.g., a mobile communications device, that includes instructions for processing pre- recorded input audio signals before outputting it to a user.
  • the input audio signal may be received by the system 200 (e.g., a smart phone or other mobile communications device), in real-time, or substantially in real-time from a communications network (e.g., an input audio call from a third party received by a smart phone) and the input audio signal may be processed using the exemplary methods described above.
  • a user of the system 200 may receive a noisy an input audio signal from another party, e.g., conversation from the other party, where the other party may be in a noisy environment, particularly where there is multi-talker babble, such as in a crowded area with many others speaking in the background, e.g., in a crowded bar.
  • another party e.g., conversation from the other party
  • multi-talker babble such as in a crowded area with many others speaking in the background, e.g., in a crowded bar.
  • the input audio signal received via the communications network by the smart phone may be processed using the exemplary methods described above and a modified signal, e.g., a cleaned, audio signal, where a noise portion may be reduced and-'or a speech signal may be enhanced, may be outputted to the user, for example, as an audible sound, e.g., outputted through a speaker or any other suitable audio output device or component.
  • a modified signal e.g., a cleaned, audio signal, where a noise portion may be reduced and-'or a speech signal may be enhanced
  • Logical connections may include a local area network (LAN) and a wide area network (WAN) that are presented here by way of example and not limitation.
  • LAN local area network
  • WAN wide area network
  • Such networking environments are commonplace in office-wide or enterprise-wide computer networks, intranets and the Internet and may use a wide variety of different communication protocols.
  • Those skilled in the art can appreciate that such network computing environments can typically encompass many types of computer system configurations, including personal computers, hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, and the like.
  • Embodiments of the invention may also be practiced in distributed computing environments where tasks are performed by local and remote processing devices that are linked (either by hardwired links, wireless links, or by a combination of hardwired or wireless links) through a communications network.
  • the tasks may be performed by an external device such as a cell-phone for de-noising an input signal and then sending a modified signal from the external device to a CI device via any suitable communications network such as, for example, Bluetooth.
  • program modules may be located in both local and remote memory storage devices.
  • Fig. l a The exemplary embodiment of Fig. l a, as described above may be evaluated by measuring a subject's understanding of IEEE standard sentences with and without processing by the exemplary method 100.
  • Sentences may be presented against a background of 6-talker babble using four different signal to noise ratios (0, 3, 6, or 9 dB).
  • IEEE standard sentences also known as "1965 Revised List of Phonetically Balanced Sentences, Harvard Sentences
  • To test speech intelligibility in noise two randomly selected sentence sets (20 sentences) may be presented for each of the following 8 conditions:
  • Each intelligibility test in Example I may include 180 sentences in total. Before processing of any audio signals, 18 sets of sentences that may be spoken by a male speaker may be arbitrarily selected from IEEE standard sentences. In Example I, the selected sentence sets include: 11 , 16, 22, 41 , 42, 43, 44, 45, 46, 47, 48, 49, 50, 51 , 52, 65, 71 and 72. Before each test, two sentence sets may be selected at random for each condition and two other sentence sets may be selected for speech in quiet test and practice session. Then a list, including these 180 sentences in a completely random order may be created. Prior to the test, a practice session with ten sentences, presented in all SNRs may be used to familiarize the subject with the test.
  • the practice session with the subject may last for 5 to 30 minutes. After the practice session, the subjects may be tested on the various conditions. Sentences may be presented to CI subjects in free field via a single loudspeaker positioned in front of the listener at 65 dBA. Subjects may be tested using their clinically assigned speech processor. Subjects may then be asked to use their normal, everyday volume/sensitivity settings. Performance may be assessed in terms of percent of the correctly identified words-in-sentences as a function of SNR for each subject. Each sentence may include five keywords and a number of non-keywords. Keywords may be scored 1 and non-keywords may be scored 0.5.
  • MUSHRA Multiple Stimuli with Hidden Reference and Anchor
  • Participants may complete a total of 5 MUSHRA evaluations, one for each randomly selected sentence. Trials may be randomized among participants.
  • participants may be presented with a labeled reference (Clean Speech) and ten versions of the same sentence presented in random order. These versions may include a "hidden reference” (i.e., identical to the labeled reference), eight different conditions (two processing conditions in 4 SNRs) and an anchor (Pure 6-talker babble).
  • Participants may be able to listen to each of these versions without limit by pressing a "Play" button or trigger within a user interface. Participants may then be instructed to listen to each stimulus at least once and provide a sound quality rating for each of the ten sentences using a 100-point scale.
  • Participants may move an adjustable slider between 0 and 100, and example of which is shown in Fig. 3.
  • the rating scale may be divided into five equal intervals, and may delineate by the adjectives very poor (0-20), poor (21-40), fair (41-60) good (61-80), and excellent (81- 100). Participants may be requested to rate at least one stimulus in the set a score of "100" (i.e., identical sound quality to the labeled reference).
  • Example i as a pilot test, preliminary results were collected with 5 normal hearing (NH) subjects using an eight channel noise-vocoded signals. As shown in Fig. 4a, the percentage of words correct for each unprocessed signal is shown with an open triangle symbol, and the percentage of words correct for each signal processed using the exemplary method 100 of Fig. la is shown with a filled-in circle symbol. Similarly, as shown in Fig.
  • the MUSHRA score for each unprocessed signal is shown with an open triangle symbol
  • the MUSHRA score for each signal processed using the exemplary method 100 of Fig, l a is shown with a filled-in circle symbol.
  • Figs. 4a and 4b for all NH subjects, intelligibility and quality improved.
  • Example I for the main test, 7 post-lingually deafened CI subjects, as indicated below in Table 1 were tested. For all subjects intelligibility in quite was measured as a reference and its average was 80.81 percent.
  • the exemplary method 100 of Fig. la may provide significant speech understanding improvements in the presence of multi-talker babble noise in the CI listeners.
  • the exemplary method 100 performed notably better for higher signal to noise ratios (6 and 9). This could be because of the distortion introduced to the signal due to the more aggressive de-noising strategy for lower SNRs (0 and 3).
  • Example L subjects with higher performance in quiet also performed generally better. For the subjects with lower performance in quite (C I 05 and CI 07), a floor effect may be seen. However, a ceiling effect was not observed in Example I for the subjects with higher performance in quiet.
  • Example II Example II
  • Fig. lb The exemplary embodiment of Fig. lb, as described above may be evaluated by measuring a subject's understanding of IEEE standard sentences with and without processing by the exemplary method 150, All babble samples in Example II are randomly created by mixing sentences randomly taken from a pool of standard sentences which contains a total of 2,100 sentences (including IEEE standard sentences with male and female speaker, Hint sentences and SPIN sentences). For each babble sample, the number of talkers was randomized between 5 to 10 and the gender ratio of talkers also was randomly selected (all female, all male or a random combination of both ,)
  • Fig. 8 shows a Gaussian Mixture model using EM method trained with EM method trained with 100,000 randomly created noisy speech samples with SNRs ranging from -!OdB to 20 dB, as the different speech samples would be classified under step 152.
  • a first set of curves to the right curves represent Gaussian distributions belonging to the class (SNR ⁇ 3.5) and a second set of curves to the left represent Gaussian distributions belonging to the class (SNR > 3.5).
  • a modified version of a two-fold cross validation method may be used. First, half of the sentences in the database were used for training and the second half were used to test the classifier. Then, the sentences used for testing and training (second half of the sentences in the database for training and the first half for testing the classifier) were switched. For a classifier, the F accuracy metric is defined as follows:
  • Figure 1 1 shows that using the selected aggressive de-noising regulation parameters will lead to finding a much more accurate gap patterns of the clean speech components.
  • Figure 12 shows the effect of each initial de-noising and spectral cleaning on the weighted normalized Manhattan distance M, H measured on 1000 noisy speech samples corrupted with various randomly created multi-talker babbles. As it can be seen the effect of spectral cleaning decreases with increasing SNR.

Landscapes

  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Neurosurgery (AREA)
  • Otolaryngology (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

A system and method for improving intelligibility of speech is provided. The system and method may include obtaining an input audio signal, decomposing the audio signal into a first component having a low or no sustained oscillatory pattern, and a second component having a high oscillatory pattern, further de-noising the second component based on data generated from the first component to obtained a modified second component, and outputting an audio signal having reduced noise, the output audio signal comprising the first component in combination with the modified second component.

Description

METHOD AND SYSTEM FOR MULTI-TALKER BABBLE NOISE REDUCTION USING Q-FACTOK BASED SIGNAL DECOMPOSITION
PRIORITY CLAIM
[0001] This application claims priority to U.S. Provisional Patent Application Serial No. 62/297,536 filed February 19, 2016, the entire contents of which is hereby incorporated by reference herein.
GOVERNMENT FUNDING
[0002] This invention was made with the U.S. Goveniment support under Grant Nos. NIH Grant R01-DC12152. The U.S. government has certain rights in the invention.
FIELD OF INVENTION
[0003] The present invention relates generally to a method and a system for noise reduction, such as, for example, in a cochlear implant, a telephone, an electronic communication, etc.
BACKGROUND
[0004] Cochlear implants ("CP's) may restore the ability to hear to deaf or partially deaf individuals by providing electrical stimulation to the auditory nerve via a series of electrodes placed in the cochlea. CIs may successfully provide the ability of almost all post-lingually deaf users (i.e., those who lost their hearing after learning speech and language) to gain an auditory understanding of an environment and/or restore hearing to a level suitable for an individual to understand speech without the aid of lipreading.
[0005] One of the key challenges for CI users is to be able to clearly and/or intelligibly understand speech in the context of background noise. Conventional CI devices have been able to aid patients to hear and ascertain speech in a quiet environment, but the performance of such devices quickly degrades in noisy environments. There have been a number of attempts to isolate speech from background noise, e.g., single-channel noise reduction algorithms. Typical single-channel noise reduction algorithms have included applying a gain to the noisy envelopes, pause detection and spectral subtraction, feature extraction and splitting the spectrogram into noise and speech dominated tiles. However, even with these algorithms, speech understanding in the presence of competing talkers (i.e., speech babble noise) remains difficult and additional artifacts are often introduced. Furthermore, mobile communications has created an ever-rising need to be able to clearly and'Or intelligibly understand speech while one user may be in a noisy environment. In particular, there is a need for improving speech understanding in telephonic communications, even in the presence of competing talkers (i.e., background speech babble noise).
[0006] Despite good progress in improving speech quality and listening ease, little progress has been made in designing algorithms that can improve speech intelligibility. Conventional methods that have been found to perform well in steady background noise generally do not perform well in non-stationary noise (e.g., multi-talker babble). For example, it is often difficult to accurately estimate the background noise spectrum. Moreover, applying noise removal methods to already noisy signals usually introduces distortion and artifacts (e.g., musical noise) to the original signal, which in many cases lead to almost no significant intelligibility improvement. All these reasons make the improvement of speech intelligibility in the presence of competing talkers a difficult problem. SUMMARY OF THE INVENTION
[0007] in accordance with the foregoing objectives and others, one embodiment of the present invention provides systems and methods for reducing noise and/or improving intelligibility of an audio signal.
[0008] In one aspect, a method for reducing noise is provided. The method comprises a first step for receiving an input audio signal comprising a speech signal and a noise. In some embodiments, the noise may comprise a multi-talker babble noise. The method also comprises a step for decomposing the input audio signal into at least two components, the at least two components comprises a first component having a low or no sustained oscillatory pattern, and a second component having a high oscillatory pattern. In certain embodiments, the decomposing step comprises de-noising the first and second components, and the first component is more aggressively de-noised than the second component. In some embodiments, the decomposing step may include determining a first Tunable Q-Factor Wavelet Transform (TQWT) for the first component and a second TQWT for the second component. The method also comprises a step for de-noising the second component based on data generated from the first component to obtained a modified second component. In some embodiments, the de-noising step comprises further modifying the second component to obtain a modified second component having a temporal and spectral pattern (TSP) corresponding to a TSP of the first component. The method further comprises a step for outputting an audio signal having reduced noise, the output audio signal comprising the first component in combination with the modified second component. The outputted audio signal may more closely correspond to the speech signal than the input audio signal.
[0009] In another aspect, a method for improving intelligibility of speech is provided. The method comprises a first step for obtaining, from a receiving arrangement, an input audio signal comprising a speech signal and a noise, and then a step for estimating a noise level of the input audio signal. In some embodiments, the estimating step comprises determining or estimating a signal to noise (SNR) for the input audio signal. The method also includes a step for decomposing the input audio signal into at least two components when the estimated noise level of the input audio signal is above a predetermined threshold, the at least two components comprises a first component having a low or no sustained oscillatory pattern, and a second component having a high oscillatory pattern. The method also includes a step for de-noising the second component based on data generated from the first component to obtained a modified second component. The method further includes a step for outputting an audio signal having reduced noise to an output arrangement, the output audio signal comprising the first component in combination with the modified second component.
[0010] In another aspect, a non-transitory computer readable medium storing a computer program that is executable by at least one processing unit. The computer program comprise sets of instructions for: receiving an input audio signal comprising a speech signal and a noise; decomposing the input audio signal into at least two components, the at least two components comprises a first component having a low or no sustained oscillatory pattern, and a second component having a high oscillatory pattern; de-noising the second component based on data generated from the first component to obtained a modified second component; and outputting an audio signal having reduced noise, the output audio signal comprising the first component in combination with the modified second component.
[0011] In a further aspect, a system for improving intelligibility for a user is provided. The system may comprise a receiving arrangement configured to receive an input audio signal comprising a speech signal and a noise. The system may also include a processing arrangement configured to receive the input audio signal from the cochlear implant, decompose the input audio signal into at least two components, the at least two components comprises a first component having a low or no sustained oscillatory pattern, and a second component having a high oscillatory pattern, de-noise the second component based on data generated from the first component to obtained a modified second component, and output an audio signal having reduced noise to the cochlear implant, the output audio signal comprising the first component in combination with the modified second component. The system may further comprise a cochlear implant, wherein the cochlear implant includes the receiving arrangement, and the cochlear implant is configured to generate an electrical stimulation to the user, the electrical stimulation corresponds to the output audio signal. Alternatively, the system may further comprise a mobile computing device, wherein the mobile computing device includes the receiving arrangement, and the mobile computing device is configured to generate an audible sound corresponding to the output audio signal.
[0012] These and other aspects of the invention will become apparent to those skilled in the art after a reading of the following detailed description of the invention, including the figures and appended claims.
BRIEF DESCRIPTION OF THE FIGURES
[0013] Fig. la shows an exemplary method for noise reduction, in particular, multi-talker babble noise reduction in a cochlear implant,
[0014] Fig. lb shows an alternative exemplary method for noise reduction, in particular, multi-talker babble noise reduction in a cochlear implant.
[0015] Fig. 2 shows an exemplary computer system for performing method for noise reduction.
[0Θ16] Fig. 3 shows an exemplary embodiment of a user interface for a MUSHRA (Multiple Stimuli with Hidden Reference and Anchor) evaluation. [0017] Fig. 4a shows data corresponding to percentages of words correct in normal patients for input signals that are unprocessed and processed using the exemplary method of Fig. la.
[0018] Fig. 4b shows data corresponding to MUSHRA scores in normal patients for input signals that are unprocessed and processed using the exemplary method of Fig. I a.
[001 J Fig. 5 shows data corresponding to percentages of words correct in CI patients for input signals that are unprocessed and processed using the exemplary method of Fig. la.
[0020] Fig. 6 shows data corresponding to MUSHRA scores in CI patients for input signals that are unprocessed and processed using the exemplary method of Fig. la.
[0021] Fig. 7a shows an average of the data corresponding to percentages of words correct in CI patients of Fig. 5.
[0022] Fig. 7b shows an average of the data corresponding to MUSHRA scores in CI patients of Fig. 6.
[0023] Fig. 8 shows a Gaussian Mixture model of data corresponding to noisy speech samples with SNRs ranging from -lOdB to 20 dB processed using the exemplary method of Fig. lb.
[0024] Fig. 9 shows data corresponding to variation of accuracy metric F as a function of
SNR for three different multi-talker babble noise according to the exemplary method of Fig, lb.
[0025] Fig. 10 shows data corresponding to frequency response and sub-band wavelets of a
TQWT according to the exemplary method of Fig. lb.
[0026] Fig. 1 shows data corresponding to Low frequency Gap Binary Patterns of for clean/noisy speech samples processed using the exemplary method of Fig. lb.
[0027] Fig. 12 shows the effect of each initial de-noising and spectral cleaning on the weighted normalized Manhattan distance ¾ measured on noisy speech samples corrupted with various randomly created multi-talker babbles processed according to the exemplary method of Fig. lb.
DETAILED DESCRIPTION
[0028] The present invention is directed to a method and system for multi-talker babble noise reduction. The system may be used with an audio processing device, a cochlear implant, a mobile computing device, a smart phone, a computing tablet, a computing device to improve intelligibility of input audio signals, particularly that of speech. For example, the system may be used in a cochlear implant to improve recognition and intelligibility of speech to patients in need of hearing assistance. In one particular embodiment, the method and system for multi-talker babble noise reduction may utilize Q-factor based signal decomposition, which is further described below.
[0029] Cochlear implants (CIs) may restore the ability to hear to deaf or partially deaf individuals. However, conventional cochlear implants are often ineffective in noisy environments, because it is difficult for a user to intelligibly understand speech in the context of background noise. Specifically, original signals having a background of multi-talker babble noise, is particularly difficult to filter and/or process to improve intelligibility to the user, because it often includes background noise that does not adhere to any predictable prior pattern. Rather, multi-talker babble noise tends to reflect the spontaneous speech patterns of having multiple speakers within one room, and it is therefore difficult for the user to intelligibly understand the desired speech while it is competing to simultaneous multi-talker babble noise.
[0030] There are a number of different approaches to filtering and/or reducing noise in a noisy audio signal to a cochlear implant. For example, modulating based methods may differentiate speech from noise based on temporal characteristics, including modulations of depth and/or frequency, and may subsequently apply a gain reduction to the noisy signals or portions of signals, e.g., noisy envelopes. In another example, spectral subtraction based methods may estimate a noise spectrum using a predetermined pattern, which may be generated based on prior knowledge (e.g., detection of prior speech patterns) or speech pause detection, and may subsequently subtract the estimated noise spectrum from a noisy speech spectrum. As a further example, sub-space noise reduction methods may be based on a noisy speech vector, which may be projected onto different sub-spaces for analysis, e.g. , a signal sub-space and a noise sub- space. The clean signal may be estimated by a sub-space noise reduction method by retaining only the components in the signal sub-space, and nullifying the components in the noise sub- space. An additional example may include an envelope subtraction algorithm, which is based on the principle that the clean (noise-free) envelope may be estimated by subtracting a noisy envelope from the noise envelope, which may be separately estimated. Another example may include a method that utilizes S-shaped compression functions in place of the conventional logarithmic compression functions for noise suppression. In an alternative example, a binary masking algorithm may utilize features extracted from training data and categorizes each time- frequency region of a spectrogram as speech-dominant or noise-dominant. In another example, a wavelet-based noise reduction method may provide de-noising in a wavelet domain by utilizing shrinking and/or thresholding operations.
[0031] Although there have been many approaches to filtering and/or reducing noise in a noisy audio signal to a cochlear implant, there remains a dilemma in designing a noise reduction system and/or method that there may be a tradeoff between an amount of noise-reduction that can be provided as compared to signal distortion and/or speech distortion that may be introduced as a side-effect of filtering and/or noise reduction processes. In particular, a more aggressive noise removal process may introduce more distortion, and therefore, possibly less intelligibility in the resulting signal. Conversely, a mild approach to remove noise may result in less distortion, but the signal may retain more noise. Finding the optimal point where the distortion may be minimized, and the noise may be minimized requires careful balancing of the two factors and can be difficult. In particular, this optimal point may differ from person to person in both normal hearing people and in CI users.
[0032] The exemplary embodiments described herein provide a method and system for noise reduction, particularly multi-talker babble noise reduction, that is believed to bypass this optimal point conundrum by applying both aggressive and mild noise removal methods at the same time and benefit from the advantages and avoid the disadvantages of both approaches. In particular, the exemplary method comprises a first step for decomposing a noisy signal into two components, which may also perform a preliminary de-noising of the signal at the same time. This first step for decomposing the noisy signal into two components may utilize any suitable signal processing methods. For example, this first step may utilize, one, two or more wavelet or wavelet-like transforms and a signal decomposition method, e.g., a sparsity based signal decomposition method, optionally coupled with a de-noising optimization method. In particular, this first step may utilize two Tunable Q-Factor Wavelet Transfonns (TQWTs) and a sparsity based signal decomposition method coupled with applying a Basis Pursuit De-noising optimization method. Wavelets, sparsity based decomposition methods and de-noising optimization methods may be highly tunable. Therefore, their parameters may be adjusted to obtain desired features in output components. The output components of this first step may include two main products and a byproduct. The two main products may include a Low Q-factor (LQF) component and a High Q-factor (HQF) component, and the byproduct may include a separated residual noise, wherein the Q-factor may be a ratio of a pulse's center frequency to its bandwidth, which is discussed further below. In case of complex non~ stationary noise, this first, step for decomposing the noisy signal may not remove all of the noise. Therefore, the method may include a second step for de-noising using information from the products obtained from the first step.
[0033] Generally, a method for noise reduction, particularly multi-talker babble noise reduction, e.g., a Speech Enhancement using Decomposition Approach iterative version (SEDA__i), may comprise three different stages; (1) Noise level classification, (2) Signal decomposition and initial de-noising, and (3) Spectral cleaning and reconstitution. The first stage classifies the noise level of the noisy speech. The second stage decomposes the noisy speech into two components and performs a preliminary denoising of the signal. This is achieved using two Tunable Q-factor Wavelet Transforms (TQWTs) and a sparsity-based signal decomposition algorithm, Basis Pursuit De-noising (BPD). The wavelet parameters in the second stage will be set based on the results of the classification stage. The output of the second stage will consist of three components. The low Q-factor (LQF) component, the high Q-factor (HQF) component and the residual. The third stage further denoises the HQF and LQF components and then recombines them to produce the final de-noised output.
[0034] Fig. 1 illustrates an exemplary method 100 for noise reduction, in particular, multi- talker babble noise reduction in a cochlear implant. Specifically, the method may be used to improve recognition and intelligibility of speech to patients in need of hearing assistance. Any suitable cochlear implant may be used with exemplary method 100. In particular, the cochlear implant may detect an audio signal and restore a deaf or partially deaf individual's ability to hear by providing an electrical stimulation to the auditory nerve corresponding to the audio signal. However, often the input audio signal may be noisy and cannot be recognized or discerned by the user. Therefore, the input signal may be further processed, e.g., filtered, to improve clarity and/or intelligibility of speech to the patient. In an exemplary embodiment, a rough determination of the noise level in the input signal may be determined before starting a de- noising process, in addition, the estimated level of noise present may be utilized to set a wavelet and optimizations parameters for subsequent de-noising of the input signal.
[0Θ35] The input audio signal may be a continuous audio signal and may be broken down into predetermined segments and/or frames for processing by the exemplary method 100. In particular, in a real-time application, such as an application for improving hearing for a CI user or for improving intelligibility of audio communications on a communications device (such as mobile communications device, a telephone, a smart phone, etc.), the input signal may include non-steady noise where the level of noise, e.g., signal to noise ratio, may change over time. To adapt to the changing levels of noise intensity in an input signal, the signal may be separated into a plurality of frames of input signal, where each frame may be individually analyzed and/or de- noised, such as for example, processing each individual frame using the exemplary method 100. The input signal may be divided into the plurality of frames by any suitable means. The exemplary method 00 may be continuously applied to each successive frame of the input signal for analysis and/or de-noising. In some embodiments, the input audio signal may be obtained and each frame of the input audio signal may be processing by the exemplary method 100 in real-time or substantially real-time, meaning within a time frame that is negligible or imperceptible to a user, for example, within less than 3 seconds, less than 1 second, or less than 0.5 seconds.
[0036] In a first step 102, an input signal or a frame of an input signal may be obtained and analyzed to determine and/or estimate a level of noise present in the signal. Based on a level or an estimated level of noise present, the input signal or frame of input signal may be categorized into one of three categories: (I) the signal is either not noisy or has negligible amount of noise 104; (II) the signal is mildly noisy 106; or (III) the signal is highly noisy 108.
[0037] Step 102 may estimate the noise level in an input signal or a frame of an input signal using any suitable methods, such as, for example, methods for determining and/or estimating a signal to noise ratio (SNR), which may be adjusted to estimate the noise level in a variety of noise conditions. Any suitable SNR method may be used and may include, for example, those methods described in Hmam, H., "Approximating the SNR Value in Detection Problems," IEEE Trans, on Aerospace and Electronic Systems VOL. 39, NO. 4 (2003); Xu, PL, Wei, G., & Zhu, J. "A Novel SNR Estimation Algorithm for OFDM," Vehicular Technology Conference, vol. 5, 3068-3071 (2005); Mian, G., & Howell, T., "Determining a signal to noise ratio for an arbitrary data sequence by a time domain analysis," IEEE Trans. Magn., Vol. 29, No. 6 (1993); Liu, X., Jia, J., & Cai, L., "SNR estimation for clipped audio based on amplitude distribution," ICNC, 1434-1438 (2013), all of which are incorporated by reference herein. However, existing SNR estimation methods do not specifically accommodate non-stationary noise and therefore, typically suffer from some degree of error and computational costs. Alternatively, the noise level of an input signal or a frame of an input signal may be estimated by measuring a frequency and depth of modulations in the signal, or by analyzing a portion of the input signal in silent segments in speech gaps. It is noted that step 102 may determine a SN for an input signal or a frame of an input signal, but may alternatively provide merely an estimate, even a rough estimate of its SNR.
[0038] The SN R or estimated SNR may be used to categorize the input signal or a frame of the input signal into the three different categories 104, 106, and 108, For example, Category I for a signal that is either not noisy or include negligible amounts of noise 104. In particular, this first category 104 may include, for example, those input signals or frames of input signals that have or are estimated to have a SNR that is greater than 12 dB (SNR > 12 dB), or greater than or equal to 12 dB (SNR > 12 dB). The second category 106 may include, for example, those input signals or frames of input signals that have or are estimated to have a SNR that is greater than 5 dB and less than 12 dB (5 dB < SNR < 12 dB), or greater than or equal to 5 dB and less than or equal to 12 dB (5 dB < SNR < 12 dB). The third category 108 may include, for example, those input signals or frames of input signals that have or are estimated to have a SNR that is less than 5 dB (SNR < 5 dB), or less than or equal to 5 dB (SNR < 5 dB).
[0039] This first step 102 does not depend highly on the accuracy of the noise level estimation, e.g., SNR estimation provided. Rather, for input signals having SN R values on or near the threshold values of 5 dB and 12 dB, categorization of such an input signal in either of the bordering categories is not expected to significantly alter the outcome of the exemplary de- noising method 100 of Fig. la. Therefore, estimated SNR values may be sufficient for the first step 102. In certain exemplar}7 embodiments, estimated SNR values may be determined using a more efficient process, e.g., a method that requires less computational resources and/or time, such as by a process that requires fewer iterative steps.
[0040] In one particular embodiment, the SNR may be estimated using an exemplar SNR detection method for an arbitrary signal s, where s may be defined as s = {si f s2 l. . . , sn}. A ratio of the signal's root mean square f ws} after and/or before a thresholding with respect to r(s) (which may be defined as r(s) = 3 ^∑=1 Jsi S), may be represented by the term r(s, r(s)) . The ratio r(s, x(s)) may be defined as: r(st r s)} where srms— + sf 4- . . . l s?. ]
And h{s, r(s)) = ¾<■ ■ where ¾— ¾
T(S)
[0041] The term il l s, T(S) ) refers to signal s after hard thresholding with respect to r(s). The term r(s) may be defined such that for speech samples that are mixed with multi-talker babble, the value of r(s, r(s)) varies little from signal to signal for samples having a constant a constant signal to noise ratio (SNR). in one specific embodiment, the term τ(.ν) for an arbitrary signal s = [sis s2 r. . . , sn} is may be defined as shown below:
[0042] The values of Γ(¾,Τ(¾)) , r(x2,r(x2)) ,. . ., r(xN, t(xN}) for a sufficiently large number, for example but not limited to (N— 200), may be subsequently determined according to the following:
Figure imgf000013_0001
wherein , Λ·;.,, . . . . , x.N correspond to a mixture of various speech samples taken from IEEE standard sentences (IEEE Subcommittee, 1969) and multi-taker babble with SNR=5.
[0043] The values for r(>'i^( 'i)) , r(y2. r(y2)) .· . ·. Τ Ν, &Ν)) may be subsequently determined accordingly to the following:
wherein yi, y2 ■ ■ ■ ■ > )7N correspond to a mixture of various speech samples taken from IEEE standard sentences (IEEE Subcommittee, 1969) and multi-taker babble with SNR=T2. [0044] An input signal s with an unknown SNR may be categorized into one of the three different categories 104, 106, and 108 as follows:
(104. (SNR > 12), Rlz < r(s, r(s))
C(s) E { 106 (5 < SNR < 12), < r(s, T(s» ≤Rl2
1.108 (SN < 5), r(s, T ( .V) ) < Rs
C(s) : Signal's s category based on its SNR
[0045] As discussed above, this exemplary SNR estimation method in the first step 102 need not provide accurate estimates of SNR. Rather, it serves to categorize the input signals or frames of input signals into various starting categories prior to further analysis and/or de-noising of the input signals or frames of input signals. This pre-processing categorization in step 102 is particularly beneficial for input signals or frames of input signals containing multi-talker babble. it is further contemplated that this first step 102 utilize any suitable method to categorize the input signals or frames of input signals into a plurality of categories, each having a different noise level. More particularly, the first step 102 may encompass any fast and efficient method for categorizing the input signals or frames of input signals into a plurality of categories having different noise levels.
[0046] In the exemplary embodiment shown in Fig, l a, input signals or frames of input signals that fall within the first category 104 do not contain substantial amounts of noise. Therefore, these input signals or frames of input signals are too clean to be de-noised. The intelligibility of input signals in this first category 104 may be relatively high, therefore further de-noising of the signal may introduce distortion and/or lead to no significant intelligibility improvement. Accordingly, if the input signal or frame of input signal is determined to fall within the first category 104, the method 100 terminates without modification to the input signal or the frame of the input signal.
[0047] Input signals or frames of input signals that fall within the second category 106 may be de-noised in a less aggressive manner as compared to noisier signals. For input signals or frames of input signals in the second category 106, the priority is to avoid de -noising distortion rather than to remove as much no se as possible. [0048] Input signals or frames of input signals that fall within the third category 108 may not be very intelligible to a CI user, and may not be intelligible at all to an average CI user, For input signals or frames of input signals in the third category 108, distortion is less of a concern compared to intelligibility. Therefore, a more aggressive de-noising of the input signal or frame of input signal may be performed on input signals of the third category 108 to increase the amount of noise removed while gaining improvements in signal intelligibility to the CI user.
[0049] input signals or frames of input signals that fall within either the second category 106 or the third category 108 may be further processed in step 110. In step 1 10, the input signals or frames of input signals may be decomposed into at least two components: (I) a first component 112 that exhibits no or low amounts of sustained oscillatory behavior; and (II) a second component 114 that exhibits high sustained oscillator}'' behavior. Step 1 10 may optionally decompose the input signals or frames of input signals to include a third component: (III) a residual component 116 that does not fall within either component 112 or 114. Step 110 may decompose the input signals or frames of input signals using any suitable methods, such as, for example, separating the signals into components having different Q-factors. The Q-factor of a pulse may be defined as a ratio of its center frequency to its bandwidth, as shown in the formula below:
0 = -^- .
¾ BW
[0050] For example, the first component 112 may correspond to a low Q-factor component and the second component 114 may correspond to a high Q-factor component. The second component 114, which corresponds to a high Q-factor component, may exhibit more sustained oscillatory behavior than the first component 1.12, which corresponds to a low Q-factor component.
[0051] Suitable methods for decomposing the input signals or frames of input signals may include a sparse optimization wavelet method. The sparse optimization wavelet method may decompose the input signals or frames of input signals and may also provide preliminary de- noising of the input signals or frames of input signals. The sparse optimization wavelet method may utilize any suitable wavelet transform to provide a sparse representation of the input signals or frames of input signals. One exemplary wavelet transform that may be utilized with a sparse optimization wavelet for decomposing the input signals or frames of input signals in step 100 may include a Tunable Q-Factor Wavelet Transform (TQWT). In particular, the TQWT may be determined based on a Q-factor, a redundancy rate and a number of stages (or levels) utilized in the sparse optimization wavelet method, each of which may be independently adjustable within the method. By adjusting the Q-factor, the oscillatory behavior of the TQWT may be modified. In particular, the Q-factor may be adjusted such that the oscillator 7 behavior of the TQWT wavelet matches that of the input signals or frames of input signals. Redundancy rate in a wavelet transform, e.g., a TQWT, may refer to a total over-sampling rate of the transform. The redundancy rate must be always greater than 1. Because the TQWT is an over-sampled wavelet transform, any given signal would not correspond to a unique set of wavelet coefficients. In other words, an inverse TQWT applied to two different sets of wavelet coefficients, may correspond to the same signal.
[0052] Step 110 may also provide preliminary de-noising of the input signals or frames of input signals. The preliminary de-noising may be performed by a sparsity-based de-noising method, such as, for example, a sparse optimization wavelet method. As discussed above, of the input signals or frames of input signals may be represented by any suitable wavelet, in particular TQWT. By adjusting the Q-factor, an optimal sparse representation of the input signals or frames of input signals may be obtained. Such an optimal sparse representation may provide improved performance for related sparsity-based methods such as signal decomposition and/or de-noising. To select a spare representation of the input signals or frames of input signals, a Basis Pursuit (BP) method may be used. In particular, if the input signals or frames of input signals are considered to be noisy, e.g., those falling within the third category 109, a Basis Pursuit De-noising (BPD) method may be used.
[0053] Human speech may exhibit mixture of oscillatory and non-oscillatory behaviors. These two components usually cannot be sparsely represented using only one TQWT. Therefore in step 110, each input signal or frame of input signal may be represented using two different components having two different Q-factors. Suitable methods for decomposing the input signals or frames of input signals in step 110 may also include, for example, a Morphological Component Analysis (MCA) method.
[0054] in one particular exemplary embodiment, the input signal or frame of input signal y may be decomposed into three components: (I) a first component 112 having a low Q-factor xi , which does not exhibit sustained oscillator}7 behavior; (II) a second component 114 having a High Q-factor component xs , which exhibits sustained oscillatory behavior; and (III) a residual component 116 represented by , which includes noise and stochastic unstructured signals that cannot be sparsely represented by either of the two wavelet transforms of the first and second components 112 and 114, The input signal 7 may be represented as follows:
y = Xi - x2 + n .
[0055] The decomposition of the input signal >' , as shown above, may be a nonlinear decomposition, which cannot be achieved by any linear decomposition methods in time or frequency domain. Therefore, a MCA method may be used to obtain a sparse representation of both the first and second components 112, 114, where ¾ and ¾"s may be obtained using a constrained optimization method using the following formula: argminWi ^ \ \y - 4> 1w1 - *2 1M¾ II2 + ^ l| i,j | li + ^ *½, j i 1 w2 j \ 11
j=i i= i
such that: y ' (w, ) P2 1 (w1 ) + « wherein Φ1 and Φ2 are TQWT with low and high Q-factors respectively, and A2j are subband-dependent regularizations and should be selected based on the intensity of the noise, / is the subband index and Φ^1 and ·, ' are the inverse of the first and second tunable wavelet transforms.
[0056] The above formula may be solved to obtain w j and wiJ , which are the wavelet coefficients in different subbands. Using the wavelet coefficients, wi and ws , the first and second components 112 and 114, as represented by i and xs , may be obtained as follows:
Figure imgf000017_0001
[0057] in one particular exemplary embodiment, the wavelet and optimization parameters may also be selected such that the first and second components 112, 114 are also preliminarily de-noised using a BPD method, in particular, the wavelet and optimization parameters may be selected such that the following conditions are met:
(1) The first component 1 12, which is the Low Q-factor (LQF) component, have significantly lower energy than the second component 1 14, which is the high Q-factor (HQF) component; and
(2) The LQF be de-noised more aggressively, and consequently may be more distorted. [0058] Because the LQF may be de-noised more aggressively, the HQF would be de-noised more mildly to reduce the amount of distortion, The two conditions above allow for identification of the HQF and LQF that typically have relatively similar 'Temporal and Spectral Pattern (TSP) when the signal is not noisy. In other words, the concentration of the energy in these spectrograms and time domain graphs are expected to be roughly in the same areas. The input signal or frame of input signal may be decomposed based on the Q-factors of different components, and that the input signals or frames of input signals that share similar frequency content may correspond to different Q-factors.
[0059] In step 118, the second component 114 may be farther de-noised using the first component 112 or data generated based on the first component 112, As explained further below, the TSP of the first component 112 is expected to more closely resemble that of a clean speech signal, as compared to the second component 114. Therefore, the first component 112 may be used to further de-noise the second component 114, particularly using the TSP of the first component.
[0060] A clean audio signal that is not noisy may be represented by A . For a clean input signal , BPD is not necessary for de-noising the signal. Therefore, decomposition of a clean input signal * may be correspond to a spare representation of two components, where xi and ¾ may be obtained using a constrained optimization method using the following formula:
Figure imgf000018_0001
such that: x = ϊ1 (½¾) + Φ^1 (w2)
and: χί = Φ^1(ινί) , ¾ = 2 "1 (¼¾)
where: x = Χ - x2 .
0061 Both the noisy input signal or frame of input signal ^ and the clean input signal may be decomposed into HQF and LQF components are follows:
Y = X + N
wherein ^ - -½ + ¾ , and
wherein i
[0062] Each of the above variables are defined as follows:
Y · : Noisy speech signal -■¾ : Clean speech signal before adding noise
^ : Added noise
¾ : LQF component of the original speech signal
¾ : HQF component of the original speech signal
L : LQF component of the noisy speech signal
¾ : HQF component of the noisy speech signal
¾ : Residual component of the decomposition using BPD
[0063] Because the LQF component ¾ js expected to include less noise than HQF component r¾ due to a more aggressive noise removal in step 110, the TSP of the LQF component ¾ is expected to be more similar to the TSP of the LQF component of the clean speech signal. This similarity is particularly notable in lower frequencies where speech fundamental frequencies are often located. Therefore, the concentration of energy in both their spectrograms are expected to follow a similar shared pattern. Gaps and speech pauses are also expected to be located at the same areas of the spectrograms and time domain graphs in both cases. The term gaps, as used herein, refers to empty or low energy areas in low frequency parts of the spectrograms or very low amplitude pauses in time domain graphs.
[0064] In contrast, the HQF component % , which is de-noised less aggressively in step 110, is expected to be noisier and therefore, less similar to HQF component -¾ of the clean speech. Contrary to the LQF components ¾. and ¾ discussed above where gaps could be seen in both noisy and clean spectrograms, all low frequency gaps which could be identified in clean signal's HQF component may be filled, typically completely filled, by noise in the HQF component ¾ of the input signal or frame of input signal. Although the signal may include more noise, the HQF component % is expected to be less distorted, which is particularly crucial for good intelligibly to a patient. Because the LQF and HQF components of the clean speech, ¾ and -% , are also expected to have roughly similar TSPs (at least the gaps in low frequencies in their spectrograms are roughly in the same areas), it is expected that the TSP of the HQF component ¾ of the clean speech also bears some similarities to the TSP of the LQF component ¾ obtained from noisy input signal. This resemblance may be more pronounced in time domain graphs. The low frequency gaps in the time domain graphs may also be similar, at least compared to the noisy HQF component . [0065] In step 118, the input signal or frame of input signal ^ should be de-noised such that it becomes as similar as possible to the clean speech % without causing too much distortion. As discussed above, the LQF components of clean speech and noisy speech are already similar, and therefore, only the HQF component of the noisy input signal needs to be further modified (e.g., de-noised) so that it more closely resembles the HQF component of the clean speech ( ¾ ),
[0066] The second component 114 may be further de-noised and may be represented by ¾ , which corresponds to a modified version of ¾ having a TSP that is similar to TSP of ·>¾ , which may be represented as follows:
?(¾) -?(¾)
[0067] The further de-noised %may be determined using the following formula: W/ TW & -P(¾ )~P(¾) => P(FL + ¾)~P(¾ + ¾)→ V(YL + YH)~VOO
Specifically, the first component 112 may correspond to ¾ a d the second component 114 may respond «o ¾ in the formula shown above. Because ' (¾) is expected to be s.miiar to and in the absence priori knowledge of -% , the TSP of % may be modified and a modified version corresponding to version % may be obtained to satisfy the following condition:
Therefore, the further de-noised ¾T may be determined based on the following formula:
[0068] In another exemplary embodiment, step 118 may include a method which modifies the spectrograph of the second component 114, e.g., ¾ , to a modified version of the second component, e.g., % . In particular, the method may preferably introduce the least possible amount of distortion to the resulting output, and/or may provide processing of input signals in real-time or substantially real-time as to be useful in applications such as cochlear implant devices. In particular, the method for modifying the spectrograph of the second component 114, e.g., ¾ , to a modified version of the second component, e.g., ¾ may include point- wise multiplication of a Fourier transform domain of non-overlapping frames of an input signal. In particular, each frame of the input signal may be represented as Yt E RN, wherein ^ corresponds to a length of the frame. Each frame of the input signal may be represented m ay correspond to the following:
Yt = YL + Yu + ¾
[0069] A Discrete Fourier Transform may be determined for each of the above components as follows:
Figure imgf000021_0001
0070] Each point i 1 L and may be categorized as one of the following:
Figure imgf000021_0002
where:
CHS' ,€i?€lL represents four different categories corresponding to: very high energy, high energy, low energy and very low energy; yf . »Ή,ί»
N ' Η,πϊ
-. < «!
[0071] The above categorization may be performed using a threshold-based quantification method. The TSP of the is expected to be similar to TSP of ^k after removing the noise, Therefore, if a point demonstrates a high or very high energy in but demonstrates low or very low energy in ¾ , its energy in ¾ is believe to most likely be coming from a noise source and must then be attenuated.
[0072] To estimate ^k , each point in ¾ may be compared with its counterpart in ¾ and different reduction gains Sr may be applied to high or very high energy points in ¾ with low or
yf
very low energy counterparts in I L , which may be represented in the following iormula:
Figure imgf000022_0001
where : 0 < grl < gr2 ¾ r3 < 5r4 ¾ 1 .
In some embodiments, a reduction gain may be applied to low or very low energy points in , yf
After an estimate for I H is obtained, an inverse Discrete Fourier Transform may be applied to obtain a modified version of the second component, e.g., % , of the input signal, as follows:
In step 120, the first component 112 and a further filtered second component, where the second component 1 14 is filtered using the first component 1 14, may be combined to generate a filtered signal that may be outputted for use in a cochlear. In particular, the first component 112, e.g., ¾ , and the further filtered second component, e.g., % , may be combined to create an output signal, as represented by 'o , as follows:
Y0ut = YL + ¥H , which is expected to demonstrate a TSP that is similar to the TSP of clean speech,
[0074] Fig. lb provides an alternative exemplary embodiment of a method 150 for noise reduction, in particular, multi-talker babble noise reduction in a cochlear implant. The alternative exemplary embodiment of method 150 shown in Fig. la is substantially similar to the method 100 describe with respect to Fig. l b as discussed above. Differences between the two exemplary methods 100 and 150 are further detailed below.
[007S] Similar to step 102, in a first step 152, an input signal or a frame of an input signal may be obtained and analyzed to determine and/or estimate a level of noise present in the signal. Based on a level or an estimated level of noise present, the input signal or frame of input signal may be categorized into one of three categories: (I) the signal is either mildly noisy 154; or (II) the signal is highly noisy 156. Step 152 may estimate the noise level in an input signal or a frame of an input signal using any suitable methods, such those described above in reference to step 10 (e.g., methods for determining and/or estimating SNR).
[0076] In method 152, the SNR or estimated SNR may be used to categorize the input signal or a frame of the input signal into the two instead of three different categories 154 and 156. For example, Category I for a signal that is mildly noisy 154. In particular, this first category 154 may include, for example, those input signals or frames of input signals that have or are estimated to have a SNR that is greater than 3.5 dB (SNR > 3.5 dB), or greater than or equal to 3.5 dB (SNR > 3.5 dB). The second category 156 may include, for example, those input signals or frames of input signals that have or are estimated to have a SNR that is less than 3.5 dB (SNR < 3.5 dB), or less than or equal to 3.5 dB (SNR < 3.5 dB).
[0077] In one embodiment, the SNR may be estimated using the exemplary SNR detection method described above in reference to step 102. In another embodiment, the SNR may be estimated using a different exemplary method. This method may provide a computationally efficient, and relatively accurate method to classify the noise level of speech corrupted by multi- talker babble. To keep track of the background noise variation, longer signals may be segmented into shorter frames and each frame may be classified and de-noised separately. The length of each frame should be at least one second to ensure a high classification/de-nosing performance. In this embodiment, step 152 uses two features which are sensitive to changes of the noise level in speech, easy to extract and relatively robust for various babble noise conditions (i.e., different number of talkers, etc.).
[0078] The first feature is the envelope mean-crossing rate which is defined as the number of times that the envelope crosses its mean over a certain period of time (e.g. , one second). To compute this feature, step 152 first needs to extract the envelope of the noisy speech. For noisy speech frame ¥ the envelope can be obtained as follows:
Figure imgf000024_0001
where, i is the length of the window (w) and I¾ is the hop size. The envelope mean-crossing rate of a noisy signal frame is calculated as follows:
Figure imgf000024_0002
where, E, ls and M are the envelope and its length and mean respectively, N is the length of the frame, fs is the sampling rate and S{x is the sign function defined as:
Figure imgf000024_0003
[0079] Note that in for this feature we have used rectangular windows hence we have lh= I. [0080] The main parameter that affects this feature is the length of the window (I). This feature may be optimized by finding the value of l ε M which maximizes the feature's Fischer score:
Gr pnax ,
Figure imgf000024_0004
where, Cn = 2 is the number of classes, ¾ is the mean ft values of frames in class k, μ is the ft values overall mean, ¾ is the ft values variance in class k and nk is the total number of frames in class k.
[0081] To numerically solve the above, this feature's Fischer score may be calculated for 10,000 labeled noisy speech frames corrupted with randomly created multi -talker babble. The duration of each noisy speech frame may be randomized between 2-5 seconds with sampling rate of fs = 16000 samples/second. The average Fischer score for this feature may be maximized with =50.
[0082] The second feature is post-thresholding to pre-thresholding RMS ratio. First we denote hard threshold of a noisy speech frame F = y3, . . .,ywj, with threshold τ by ¾{ , T) = k2i, . .« hn} where: fc£ =
Figure imgf000025_0001
[0083] Post-thresholding to pre-thresholding RMS ratio is calculated as follows:
k ¥ , — 1„ p ,)
f2
[0084] The variable which determines the quality of this feature is K and this feature may be optimized by finding the value of which maximizes the Fischer score for this feature: argtu xsem
Figure imgf000025_0002
[0085] Numerical maximization of Fischer score with K = 0.1 x a where V i < i < IGQ, a e , shows the best value for K is IS — 3.
[0086] For training the classifier, the Gaussian Mixture Model (GMM) may be used. A GMM is the weighted sum of several Gaussian distributions:
ρ(Ρ\μ , α {Γ ) =∑¾_ ¾ (F^ . ) Such that ¾ a: = 1. where F is a d-dimensiona! feature vector (in this classification problem we have only two dimensions or d=2), % is the weight factor, μ& is the mean and∑t is the covariance of the ith Gaussian distribution, A Gaussian distribution Μ(Ρ μί , έ ) can be written as:
Figure imgf000026_0001
[0087] Similar to step 102, step 152 also does not depend highly on the accuracy of the noise level estimation, e.g., SNR estimation provided. Rather, for input signals having SNR values on or near the threshold value of 3.5 dB, categorization of such an input signal in either of the categories is not expected to significantly alter the outcome of the exemplary de-noising method 152 of Fig. lb. Therefore, estimated SNR values may also be sufficient for step 152. In certain exemplary embodiments, estimated SNR values may be determined using a more efficient process, e.g., a method that requires less computational resources and/or time, such as by a process that requires fewer iterative steps.
[0088] In the exemplary embodiment shown in Fig. lb, input signals or frames of input signals that fall within the first category 154 may be de-noised in a less aggressive manner as compared to noisier signals. For input signals or frames of input signals in the second category 156, the priority is to avoid de-noising distortion rather than to remove as much noise as possible. The data samples may be divided between each of the two categories into two clusters and each cluster may be modeled by a Gaussian model. In order to train the model, the Expectation-Maximization (EM) algorithm may be used.
[0089] After training the classifier having the above, the method 150 may classify each test noisy speech frame F with feature set F = fltf2} using MAP (Maximum a posteriori estimation) as follows:
y
f Class 1 (SNR < 35), P(F\Oass l)F(Ciass 1) > P(FlClass 2 )P{C!ass 2 )
(class 2 (SNR > 3.5), P(FjClass 2 )F(C ass 2 ) > P(F\Oass i)P(Class 1) P(F|C ass l) = ^i ^ .,∑i } -f e2Jf (F1¾ ,∑z ) F( |Cla.ss 2} = ¾J¾T( |¾ } + ¾JT( |¾ ,∑4 ).
where; «¾, ¾, ¾ μ2,∑t,∑2 are GMM parameters of class 1 and <%, 4, μ , μ±,∑$,∑4 are
GMM parameters of class 2. Here both classes may be assumed to have equal overall probability (i.e., P(classt) = P(cl ss2) = 0.5). Note that for each Gaussian model, the method 150 has already obtained the values of ¾ and ¾ from the EM method. Using MAP, for each noisy speech sample with feature vector F, two probabilities we may be obtained and the noisy sample may be classified into the class with the higher probability.
[0090] input signals or frames of input signals that fall within either the first category 154 or the second category 156 may be further processed in step 160 in a similar manner as step 110 described above. In step 160, the input signals or frames of input signals may be decomposed into at least two components: (I) a first component 62 that exhibits no or low amounts of sustained oscillatory behavior; and (II) a second component 164 that exhibits high sustained oscillatory behavior. Step 160 may optionally decompose the input signals or frames of input signals to include a third component: (III) a residual component 166 that does not fall within either components 162 or 164. Step 160 may decompose the input signals or frames of input signals using any suitable methods, such as, for example, separating the signals into components having different Q-factors.
[0091] Step 160 may similarly provide preliminary de-noising of the input signals or frames of input signals. The preliminary de-noising may be performed by a sparsity-based de-noising method, such as, for example, a sparse optimization wavelet method. As discussed above, of the input signals or frames of input signals may be represented by any suitable wavelet in particular TQWT. By adjusting the Q-factor, an optimal sparse representation of the input signals or frames of input signals may be obtained. Such an optimal sparse representation may provide improved performance for related sparsity-based methods such as signal decomposition and/or de-noising. To select a spare representation of the input signals or frames of input signals, a Basis Pursuit (BP) method may be used. In particular, if the input signals or frames of input signals are considered to be noisy, e.g., those falling within the third category 109, a Basis Pursuit De-noising (BPD) method may be used.
[0Θ92] In step 168, the different HQF and LQF components may be further de-noised (e.g., by spectral cleaning) and subsequently recombined to produce the final de-noised output 170. In particular, this further de-noising step 168 may include parameter optimization followed by subsequent spectral cleaning. For example, assuming that the clean speech sample JT and its noisy version Y are available, they may be each decomposed into HQF and LQF components.
There are a total 8 parameters associated with the optimization problem discussed above in steps 110 and 160. In order to maximize the de-noising performance in this stage each of these eight parameters are optimized to ensure maximal noise attenuation with minimal signal distortion.
[0093] Low and high Q- factors (¾ and <¾): These two parameters should be selected to match the oscillatory behavior of the speech in order to attain high sparsity and efficient subsequent denoising. Qi and ¾ denote the low and high Q-factors, respectively. Hence ¾ must be sufficiently larger than Qt. Choosing close values for ¾ and ¾ will lead to very similar LQF and HQF components and poor sparsification. Conversely setting Q2 to be too much greater than Q-t also leads to poor results due to the concentration of most of the signal's energy in one component. With ¾ = i , any value between 5 to 7 is a reasonable choice for Qz . In one exemplary embodiment, Q± = 1 and Qz = S.
[0094] Oversampling rates (rt and r2 ): a sufficient oversamp!ing rate (redundancy) is required for an optimal sparsification. Nevertheless, selecting large oversampling values will increase the computational cost of the algorithm. For this method, any number between 2-4 can be suitable for r and rz. In one exemplary embodiment, rt = r2 = 3.
[0095] Number of levels ( } and /2): Once the previous four parameters are chosen, fa d /2 should be selected to ensure the distribution of wavelet coefficient in a sufficiently large number of sub-bands. In one exemplary embodiment, t = 10 , 2 = 37.
[0096] After selecting suitable values for wavelet parameters, the regularization parameters Xt and λ2 may be adjusted. These two parameters directly influence the effectiveness of denoising. A larger value for either of them will lead to a more aggressive de-noising for its corresponding component. A more aggressive de-noising will potentially lead to more noise removal but usually at the expense of increasing the distortion of the denoised speech. Choosing suitable values for 1,; and ,{.-, which ensure the maximum noise removal with minimum distortion is crucial for this stage.
[0097] Assuming the clean speech sample is available, t and λ£ may be selected, which maximize the similarity between the spectrogram s of the clean speech components {¾ and Xs) and their de-noised versions (¥L and YH). To measure the similarity between the spectrograms of the clean and de-noised signals, the normalized Manhattan distance applied to the magnitude of the spectrograms (e.g., here with non-overlapping 216 samples long time frames) may be used, which may be defined as:
Figure imgf000029_0001
where,≤" c is the Short Time Fourier Transform (STFT) of the clean speech and 5d is the STFT of its de-noised version. Using the above, ML and Ma may be defined as metrics to measure the STFT similarity between the low and high Q factor components of the clean and noisy speech samples respectively as follows:
Figure imgf000029_0002
where the STFT matrix is denoted with S and its corresponding component with its subscript. To maximize the similarity of S¾ and ¾ as well as the similarity of ¾ and ½ simultaneously while taking the relative energy of each component into account, the weighted normalized Manhattan distance may be defined as follows:
M:ut = %k + βΜΗ where or + β = 1
[0098] The weighting factors of the a and β are selected based on the £ 3-norms of their corresponding components as follows:
β
¾ ¾s ¾
Therefore:
Figure imgf000029_0003
[0099] The values of λ1 and ls which minimize can be used to optimize the de-noising stage or:
Figure imgf000030_0001
[00100] To numerically solve the above, the average M - may be calculated over many speech samples (n=1000) corrupted with randomly generated multi-talker babble noise with various signal to noise ratios. For each noisy sample, all combinations of λϊ and 12 from 0.01 to 0, 1 with 0.0 intervals may be used (Total 100 possible combinations) and Mm may be obtained. Two sets of values for and λ2 may be selected where each set maximizes the average MLH for noisy signals belonging to one of the classes discussed in the previous stage.
[00101] Using the optimized parameters discussed in the previous section, de-noised LQF and HQF components may be obtained. Nevertheless, the spectrograms of these components exhibit some remaining noise still existing in optimally de-noised components ¥L and ¾. Low magnitude 'gaps' in the spectrogram of clean speech components and A'H may be completely filled with noise in their de-noised versions (i.e. , YL and ¾), Here, by 'gaps' it refers to low magnitude pockets surrounded by high magnitude areas. These low magnitude gaps are more distinctly visible in lower frequencies (i.e. , frequencies between 0 and 2000 Hz) where most of the speech signals energy exists. By implementing a more aggressive de-nosing (i.e., choosing larger values for 1χΟΓ 22 or both) more noise will be removed and some of these gaps will appear again in de-noised components. Nevertheless, this is only achieved at the expense of inflicting more distortion to the de-noised signal (i.e., larger MLH values). Hence even though more aggressively de-noised LQF and HQF components may have more similar "gap patterns" with the original clean speech components -¾'L: and ¾, they are not directly usable due to the high degree of distortion. However, they potentially contain usable information about the location of the gaps in spectrograms of XL and Xu which may help us de-noise L and ¾ one step further. In order to quantify and measure the similarity between the location of gaps in two spectrograms, the "Gap Binary Pattern" (GBP) matrix may be defined. To create GBP of a signal, the spectrogram of the signal is divided into non-overlapping time/frequency tiles and each tile is categorized as either low magnitude or high magnitude tile. Hence GBP of a spectrogram is a Nfb XNtj binary matrix where NFb is the number of frequency bins and Ntf is the number of time frames. Assuming S% is the STFT matrix of the signal X, and T } is a time/frequency tile of ¾ which covers the area on the spectrogram which contains all the frequencies between (ί— Ι)Δ/ to iAf on the frequency axis and ( — i}At to iAt on the time axis, the GBP of A" is defined as:
mean. |¾{ < a mean i¾ j
Figure imgf000031_0001
me n |F< ,-|≥ a mean 15·? |
[00103] In one particularly embodiment, the following may be selected: . . = 16000 Hz Atfs = 21S , Nfk = 128 , = 0.5, Af = 62.5 Hz.
[00104] By estimating the location of the gaps in clean speech components, step 168 can potentially remove significant residual noise from ¥L and ¾. If a low amplitude tile in the clean speech components L and ¾, is categorized as high amplitude in de-noised components ¾ and
¾, then step 168 can conclude that this extra boost in the tile's energy is likely to be originated from the noise and can be attenuated by a reduced gain. Because in reality to clean speech components of * L and ¾ are not readily available, the goal is to find aggressively de-noised low and high Q-factor components (denoted by J L and F''H) with a similar gap location (in lower frequencies) with the clean speech components of ¾. and ¾.
[00105] To find these aggressively de-noised components, we should find parameter settings that maximize the similarity between GBPs of de-noised and clean speech components in lower frequencies. The best, metric to measure the similarity of two GBPs is the Sorenson's metric which is designed to measure the similarity between binary matrices with emphasize on ones (i.e., gaps) rather than zeros. Sorenson's metric for two binary matrices Mi and Mz is defined as:
SM(M, , M,) =—^—
- 1 ° ÷ .··¾
where C is the number of 1-1 matches (both values are 1), ¾ is the total number of Is in the matrix Mi and N2 is the total number of Is in the matrix 2 .
[00106] In this stage, two new sets of regulanzation parameters may be identified; one should maximize SM | G h t Gy* ") and the other should maximize SM f Sjj , G¥^
[00107] Two sets of regularization parameters may be numerically found which maximize the Sorenson's metrics by measuring SM ( £¾, , GY^ ) and SM ^ Gg , GY^ ^ for sufficiently large number of speech samples (n=1000) corrupted with randomly generated multi-talker babble noise with various signal to noise ratios. There may be three sets of regularization parameters as follows: .¾i and A2 found by minimizing MLH and are used to generate optimally de-noised components of YL and ¾. lf £ and ¾ found by maximizing SM \ Gx, , Gtf and are used to generate the aggressively de-noised component F? L with similar gaps location with ¾, lj and ¾ by found by maximizing 5 ^8, δ^ ) and are used to find the aggressively de-noised component Y' with similar gaps location with ¾.
[00108] Because ¾ and ¾ have optimally similar gap patterns to X than YL respectively, they can be used as a template further clean up optimally de-noised ¾. and ¾ . To achieve this, spectral cleaning may be performed on ¾ and ¾, based on the GBPs of the aggressively de- noised Yl , ¾ , Using the time/frequency tiling, reduction gains ¾ and rH may be applied to high magnitude tiles in Y and ¾■ with low magnitude counter parts T ; in Y and ¾ , In some embodiments, the spectral cleaning is only performed in lower frequencies (i.e., frequencies between 0 and 2000
Figure imgf000032_0001
mean {Ffii | mean. jS^f
where and Tf'; are time/frequency tiles in Sy and Sy* respectively and the resulting enhanced STFT matrix and its time/frequency tiles are denoted with ¾. and 7y .
Figure imgf000032_0002
?nem% \Sy f jRe s
%(U3 = mean IT,- .; | mean |Sy< |
where F and T^- are time/frequency tiles in SY and■¾ respectively and the resulting enhanced STFT matrix and its time/frequency tiles are denoted with and f ,
[00109] Note that the reduction gains are chosen to decrease the normalized average magnitude of the tiles in SY , SY to the level of the normalized average magnitude of the tiles in ¾ ■>■¾ · The gaps which were filled by noise in optimally de-noised components may be visible after spectral cleaning. [00110J In step 170, after spectral cleaning the enhanced Low and high Q-factor components of Xt and X can be obtained by inverse short time Fourier transform of ¾ and Sg and eventually X which is the de-noised version of clean speech X can be created by re-composition ofi^ a d A'^ as:
X = L - Xa
[00111] Those skilled in the art will understand that the exemplary embodiments described herein may be implemented in any number of manners, including as a separate software module, as a combination of hardware and software, etc. For example, the exemplary analysis methods may be embodiment in one or more programs stored in a non-transitory storage medium and containing lines of code that, when compiled, may be executed by at least one of the plurality of processor cores or a separate processor. In some embodiments, a system comprising a plurality of processor cores and a set of instructions executing on the plurality of processor cores may be provided. The set of instructions may be operable to perform the exemplary methods discussed below. The at least one of the plurality of processor cores or a separate processor may be incorporated in or may communicate with any suitable electronic device for receiving audio input signal and/or outputting a modified audio signal, including, for example, an audio processing device, a cochlear implant, a mobile computing device, a smart phone, a computing tablet, a computing device, etc.
[00112] Although the exemplary analysis methods describe above are discussed in reference to a cochlear implant. It is contemplated that the exemplary analysis methods may be incorporated into any suitable electronic device that may require or benefit from improved audio processing, particularly noise reduction. For example, the exemplary analysis methods may be embodied in an exemplary system 200 as shown in Fig. 2. For example, an exemplary method described herein may be performed entirely or in part, by a processing arrangement 210. Such processing/computing arrangement 210 may be, e.g., entirely or a part of, or include, but not limited to, a computer/processor that can include, e.g., one or more microprocessors, and use instructions stored on a computer-accessible medium {e.g., RAM, ROM, hard drive, or other storage device). As shown in Fig. 2, e.g., a computer-accessible medium 220 {e.g., as described herein, a storage device such as a hard disk, floppy disk, memory stick, C D-ROM, RAM, ROM, etc., or a collection thereof) can be provided {e.g., in communication with the processing arrangement 210). The computer-accessible medium 220 may be a non-transitory computer-
Si accessible medium. The computer-accessible medium 220 can contain executable instructions 230 thereon. In addition or alternatively, a storage arrangement 240 can be provided separately from the computer-accessible medium 220, which can provide the instructions to the processing arrangement 210 so as to configure the processing arrangement to execute certain exemplary procedures, processes and methods, as described herein, for example.
[00113] System 200 may also include a receiving arrangement for receiving an input audio signal, e.g., an audio receiver or a microphone, and an outputting arrangement for outputting a de-noised audio signal, e.g., a speaker, a telephone, or a smart phone. Alternatively, the input audio signal may be a pre-recorded that is subsequently transmitted to the system 200 for processing. For example, an audio signal may be pre-recorded, e.g., a recording having a noisy background, particularly a multi-babble talk noisy background, that may be processed by the system 200 post-hoc. The receiving arrangement and outputting arrangement may be part of the same device, e.g., a cochlear implant, headphones, etc., or separate devices. Alternatively, the system may include a display or output device, an input device such as a key-board, mouse, touch screen or other input device, and may be connected to additional systems via a logical network.
[0Θ114] In one particular embodiment, the system 200 may include a smart phone a receiving arrangement, e.g., a microphone, for detecting speech, such as a conversation from a user. The conversation from the user may be obtained from a noisy environment, particularly where there is multi-talker babble, such as in a crowded area with many others speaking in the background, e.g., in a crowded bar. The input audio signal received by the smart phone may be processed using the exemplary methods described above and a modified signal, e.g. , a cleaned, audio signal, where a noise portion may be reduced and/or a speech signal may be enhanced, may be transmitted via the smart phone over a communications network to a recipient. The modified signal may provide for a more intelligible audio such that a smart phone user from a noisy environment may be more easily understood by the recipient, as compared to an unmodified signal. Alternatively, the input audio signal may be received by the smart phone and transmitted to an external processing unit, such as a centralized processing arrangement in a communications network. The centralized processing arrangement may process the input audio signal transmitted by the smart phone using the exemplary methods described above and forward the modified signal to the intended recipient, thereby providing a centralized processing unit for de-noising telephone calls. In some embodiments, the input audio signal may be a pre-recorded audio signal received by the system 200 and the input audio signal may be processed using the exemplary methods described above. For example, the system 200 may include a computing device, e.g., a mobile communications device, that includes instructions for processing pre- recorded input audio signals before outputting it to a user. In a further embodiment, the input audio signal may be received by the system 200 (e.g., a smart phone or other mobile communications device), in real-time, or substantially in real-time from a communications network (e.g., an input audio call from a third party received by a smart phone) and the input audio signal may be processed using the exemplary methods described above. For example, a user of the system 200, e.g., smart phone, may receive a noisy an input audio signal from another party, e.g., conversation from the other party, where the other party may be in a noisy environment, particularly where there is multi-talker babble, such as in a crowded area with many others speaking in the background, e.g., in a crowded bar. The input audio signal received via the communications network by the smart phone may be processed using the exemplary methods described above and a modified signal, e.g., a cleaned, audio signal, where a noise portion may be reduced and-'or a speech signal may be enhanced, may be outputted to the user, for example, as an audible sound, e.g., outputted through a speaker or any other suitable audio output device or component.
[00115] Many of the embodiments described herein may be practiced in a networked environment using logical connections to one or more remote computers having processors. Logical connections may include a local area network (LAN) and a wide area network (WAN) that are presented here by way of example and not limitation. Such networking environments are commonplace in office-wide or enterprise-wide computer networks, intranets and the Internet and may use a wide variety of different communication protocols. Those skilled in the art can appreciate that such network computing environments can typically encompass many types of computer system configurations, including personal computers, hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, and the like. Embodiments of the invention may also be practiced in distributed computing environments where tasks are performed by local and remote processing devices that are linked (either by hardwired links, wireless links, or by a combination of hardwired or wireless links) through a communications network. For example, the tasks may be performed by an external device such as a cell-phone for de-noising an input signal and then sending a modified signal from the external device to a CI device via any suitable communications network such as, for example, Bluetooth. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.
E ;:XAMPL; 12 s
Example I
[00116] The exemplary embodiment of Fig. l a, as described above may be evaluated by measuring a subject's understanding of IEEE standard sentences with and without processing by the exemplary method 100. Sentences may be presented against a background of 6-talker babble using four different signal to noise ratios (0, 3, 6, or 9 dB). In IEEE standard sentences (also known as "1965 Revised List of Phonetically Balanced Sentences, Harvard Sentences) there may be 72 lists of 10 sentences. To test speech intelligibility in noise, two randomly selected sentence sets (20 sentences) may be presented for each of the following 8 conditions:
1- Speech and 6 Talker Babble (SNR 0 dB)- Unprocessed
2- Speech and 6 Talker Babble (SN R 0 dB) - Processed
3- Speech and 6 Talker Babble (SNR 3 dB)-Unprocessed
4- Speech and 6 Talker Babble (SNR 3 dB) - Processed
5- Speech and 6 Talker Babble (SNR 6 dB)-Unprocessed
6- Speech and 6 Talker Babble (SN = 6 dB) - Processed
7- Speech and 6 Talker Babble (SNR = 9 dB)-Unprocessed
8- Speech and 6 Talker Babble (SNR = 9 dB) - Processed
[00117] In addition to the above mentioned conditions, another two sentence sets (20 sentences) may be selected for the following two additional conditions:
[00118] 9-Speech in quiet (10 Sentences)
[00119] 10-Practice with all SNRs (10 Sentences)
[00120] Each intelligibility test in Example I may include 180 sentences in total. Before processing of any audio signals, 18 sets of sentences that may be spoken by a male speaker may be arbitrarily selected from IEEE standard sentences. In Example I, the selected sentence sets include: 11 , 16, 22, 41 , 42, 43, 44, 45, 46, 47, 48, 49, 50, 51 , 52, 65, 71 and 72. Before each test, two sentence sets may be selected at random for each condition and two other sentence sets may be selected for speech in quiet test and practice session. Then a list, including these 180 sentences in a completely random order may be created. Prior to the test, a practice session with ten sentences, presented in all SNRs may be used to familiarize the subject with the test. The practice session with the subject may last for 5 to 30 minutes. After the practice session, the subjects may be tested on the various conditions. Sentences may be presented to CI subjects in free field via a single loudspeaker positioned in front of the listener at 65 dBA. Subjects may be tested using their clinically assigned speech processor. Subjects may then be asked to use their normal, everyday volume/sensitivity settings. Performance may be assessed in terms of percent of the correctly identified words-in-sentences as a function of SNR for each subject. Each sentence may include five keywords and a number of non-keywords. Keywords may be scored 1 and non-keywords may be scored 0.5.
[0Θ121] After completing a speech understanding test, subjects may be asked to evaluate the sound quality of the sentences using a MUSHRA (Multiple Stimuli with Hidden Reference and Anchor) scaling test. Participants may complete a total of 5 MUSHRA evaluations, one for each randomly selected sentence. Trials may be randomized among participants. Within each MUSHRA evaluation, participants may be presented with a labeled reference (Clean Speech) and ten versions of the same sentence presented in random order. These versions may include a "hidden reference" (i.e., identical to the labeled reference), eight different conditions (two processing conditions in 4 SNRs) and an anchor (Pure 6-talker babble). Participants may be able to listen to each of these versions without limit by pressing a "Play" button or trigger within a user interface. Participants may then be instructed to listen to each stimulus at least once and provide a sound quality rating for each of the ten sentences using a 100-point scale. To rate a stimulus, participants may move an adjustable slider between 0 and 100, and example of which is shown in Fig. 3. The rating scale may be divided into five equal intervals, and may delineate by the adjectives very poor (0-20), poor (21-40), fair (41-60) good (61-80), and excellent (81- 100). Participants may be requested to rate at least one stimulus in the set a score of "100" (i.e., identical sound quality to the labeled reference). Once participants are satisfied with their ratings, they may press a "Save and proceed" button or trigger within a user interface to move to a next trial. [00122] In Example i, as a pilot test, preliminary results were collected with 5 normal hearing (NH) subjects using an eight channel noise-vocoded signals. As shown in Fig. 4a, the percentage of words correct for each unprocessed signal is shown with an open triangle symbol, and the percentage of words correct for each signal processed using the exemplary method 100 of Fig. la is shown with a filled-in circle symbol. Similarly, as shown in Fig. 4b, the MUSHRA score for each unprocessed signal is shown with an open triangle symbol, and the MUSHRA score for each signal processed using the exemplary method 100 of Fig, l a is shown with a filled-in circle symbol. As can be seen in Figs. 4a and 4b, for all NH subjects, intelligibility and quality improved.
[00123] In Example I, for the main test, 7 post-lingually deafened CI subjects, as indicated below in Table 1 were tested. For all subjects intelligibility in quite was measured as a reference and its average was 80.81 percent.
Table.1.
Figure imgf000038_0001
*Note: For MUSHRA test, oral data was collected from subject CI 18 due to her severe visual impairment.
[0Θ124] As shown in Fig. 5, word-in-sentence intelligibility in the presence of 6 talker babble background as a function of the SNR for individual subjects. Data for each unprocessed signal is shown with an open triangle symbol, whereas data for each signal processed using the exemplary method 100 of Fig. l a is shown with a filled-in circle symbol. Fig. 7a shows an average result for all subjects. Mean intelligibility scores, averaged across all subjects and all SNRs, increased by 17.94 percentage points. Two-way ANOVA tests revealed significant main effects of processing [F(l,6)==128.953, pO.OOl ] and noise levels [F(3,18)=40.128, pO.OOl ]. It also revealed a relatively large interaction between noise levels and algorithms [F(3 , 18)=8.117, p=0.001].
[00125] Fig. 6 shows speech quality in the presence of 6 talker babble background as a function of the SNR for individual subjects. Data for each unprocessed signal is shown with an open triangle symbol, whereas data for each signal processed using the exemplary method 100 of Fig. la is shown with a filled-in circle symbol. Fig. 7b shows average results for all subjects. Mean quality scores, averaged across all subjects and all SNRs, increased by 21.18 percentage points. Two-way ANOVA tests revealed significant main effects of processing [F(l,6)=72.676, pO.OOl ] and noise levels [F(3,18)=42.896, p<0.001]. It also revealed no significant mteraction between noise levels and algorithms [F(3,18)=1.914, p=0.163].
[0Θ126] As can be seen above, the exemplary method 100 of Fig. la may provide significant speech understanding improvements in the presence of multi-talker babble noise in the CI listeners. The exemplary method 100 performed notably better for higher signal to noise ratios (6 and 9). This could be because of the distortion introduced to the signal due to the more aggressive de-noising strategy for lower SNRs (0 and 3). In Example L subjects with higher performance in quiet also performed generally better. For the subjects with lower performance in quite (C I 05 and CI 07), a floor effect may be seen. However, a ceiling effect was not observed in Example I for the subjects with higher performance in quiet. Example II
[00127] The exemplary embodiment of Fig. lb, as described above may be evaluated by measuring a subject's understanding of IEEE standard sentences with and without processing by the exemplary method 150, All babble samples in Example II are randomly created by mixing sentences randomly taken from a pool of standard sentences which contains a total of 2,100 sentences (including IEEE standard sentences with male and female speaker, Hint sentences and SPIN sentences). For each babble sample, the number of talkers was randomized between 5 to 10 and the gender ratio of talkers also was randomly selected (all female, all male or a random combination of both ,)
[00128] Fig. 8 shows a Gaussian Mixture model using EM method trained with EM method trained with 100,000 randomly created noisy speech samples with SNRs ranging from -!OdB to 20 dB, as the different speech samples would be classified under step 152. A first set of curves to the right curves represent Gaussian distributions belonging to the class (SNR < 3.5) and a second set of curves to the left represent Gaussian distributions belonging to the class (SNR > 3.5).
[00129] To evaluate the performance of method 150, a modified version of a two-fold cross validation method may be used. First, half of the sentences in the database were used for training and the second half were used to test the classifier. Then, the sentences used for testing and training (second half of the sentences in the database for training and the first half for testing the classifier) were switched. For a classifier, the F accuracy metric is defined as follows:
Figure imgf000040_0001
where C, ^and /~ are correct, false positive and false negative detection, respectively.
[00130] The average values of F accuracy metric were measured for three types of multi- talker babble in different SNRs. The average value of F slightly changed by changing the number and the gender ratio of talkers. The average value of F was 1 for SNRs outside the neighborhood of the border SNR between two classes (i.e., 3.5 dB). In the vicinity of SNR=3.5 dB some decline in the accuracy was observed. Figure 9 shows the variation of accuracy metric F as a function of SNR for three different multi-talker babble noise. 1,000 randomly created noisy samples were tested for each SNR.
[00131] Fig. 30 shows frequency response and sub-band wavelets of a TQWT, e.g., step 160 as described above. Specifically, Fig. 10 shows frequency response (left) sub-band wavelets (right) of a TQ WT with Q = 2t r = 3 J = 13. [00132] Table 2 shows specific selected values for %t and &t in Example II as well as other parameters for each class.
Table.2.
Figure imgf000041_0003
[00133] To validate the optimization results with other distance metrics, the Manhattan distance of the sum of two components in were minimized:
Figure imgf000041_0001
as well as the Euclidean distance of the de-noised and clean com onents in:
Figure imgf000041_0002
[00134] The same results for t and λ2 were achieved.
[00135] In this example, two sets of regularization parameters were found which maximize the Sorenson's metrics by measuring SM GSt > Gtf } and SM^ G^ Gy^ f for sufficiently large number of speech samples (n=1000) corrupted with randomly generated multi- talker babble noise with various signal to noise ratios. Three sets of regularization parameters were also identified as follows: λί and Ά2 found by minimizing Mm and are used to generate optimally de-noised components of ¾. and ¾· ¾ and <¾. found by maximizing SM ¾ , GY' ) and are used to generate the aggressively de-noised component Y! L with similar gaps location with 'i and by found by maximizing SM | Gx , GY^ ^ and are used to find the aggressively de-noised component Yf with similar gaps location with ¾, Table 3 shows selected values for these regularization parameters for both classes. Table. 3.
Figure imgf000042_0001
[00136] Figure 1 1 shows that using the selected aggressive de-noising regulation parameters will lead to finding a much more accurate gap patterns of the clean speech components. In particular, Fig. 11 shows Low frequency Gap Binary Patterns of ¾,¾,¾¾,¾ and ¾ for clean/noisy speech samples. It can be see that gaps (shown with .¾, ,¾, ¾.g4) which are filled with noise in rL and ¾ , are visible in y£ and ¾. s¾ = , S s¾, s,¾) = s.?9
, S (c¾, G,£ ) = Q.54 , SM = S.57.
[00137] Figure 12 shows the effect of each initial de-noising and spectral cleaning on the weighted normalized Manhattan distance M, H measured on 1000 noisy speech samples corrupted with various randomly created multi-talker babbles. As it can be seen the effect of spectral cleaning decreases with increasing SNR.
[00138] The invention described and claimed herein is not to be limited in scope by the specific embodiments herein disclosed since these embodiments are intended as illustrations of several aspects of this invention. Any equivalent embodiments are intended to be within the scope of this invention. Indeed, various modifications of the invention in addition to those shown and described herein will become apparent to those skilled in the art from the foregoing description. Such modifications are also intended to fall within the scope of the appended claims. All publications cited herein are incorporated by reference in their entirety.

Claims

What is claimed is: 1 , A method for reducing noise, comprising:
receiving an input audio signal comprising a speech signal and a noise;
decomposing the input audio signal into at least two components, the at least two components comprises a first component having a low or no sustained oscillatory pattern, and a second component having a high oscillatory pattern;
de-noising the second component based on data generated from the first component to obtained a modified second component; and
outputting an audio signal having reduced noise, the output audio signal comprising the first component in combination with the modified second component. 2. The method of claim 1, wherein the outputted audio signal more closely corresponds to the speech signal than the input audio signal
3. The method of claim 1, wherein the noise comprises a multi-talker babble noise, 4. The method of claim 1, wherein the decomposing step comprises de-noising the first and second components, and the first component is more aggressively de-noised than the second component.
5. The method of claim 1, wherein the decomposing step comprises de -noising the first and second components, the second component being more distorted than the first component.
6. The method of claim 1, wherein the decomposing step comprises a nonlinear decomposition method. 7. The method of claim 1, wherein the decomposing step comprises a morphological component analysis (MCA) method.
8. The method of claim 1, wherein the decomposing step comprises a spare optimization wavelet method. 9, The method of claim 8, wherein the decomposing step includes determining a first
Tunable Q-Factor Wavelet Transform (TQWT) for the first component and a second TQWT for the second component,
10, The method of claim 9, wherein the first component has a low value for a Q-factor, and the second component has a high value for the Q-factor, wherein the Q-factor corresponds to a ratio a center frequency to a bandwidth of each component,
11 , The method of claim 9, wherein the decomposing step further includes a basis pursuit de- noising (BPD) method.
12, The method of claim 10, wherein the decomposing step decomposes the input audio signal into the first component, the second component, and further a residua! component,
13, The method of claim 1 , wherein the de-noising step comprises further modifying the second component to obtained a modified second component having a temporal and spectral pattern (TSP) corresponding to a TSP of the first component.
14, A method for improving intelligibility of speech, comprising:
obtaining, from a receiving arrangement, an input audio signal comprising a speech signal and a noise;
estimating a noise level of the input audio signal;
decomposing the input audio signal into at least two components when the estimated noise level of the input audio signal is above a predetermined threshold, the at least two components comprises a first component having a low or no sustained oscillatory pattern, and a second component having a high oscillatory pattern; de-noising the second component based on data generated from the first component to obtained a modified second component; and
outputting an audio signal having reduced noise to an output arrangement, the output audio signal comprising the first component in combination with the modified second component.
15. The method of claim 14, wherein the noise comprises a multi-talker babble noise.
16. The method of claim 14, wherein the estimating step comprises determining or estimating a signal to noise (SNR) for the input audio signal. 7. The method of claim 14, wherein the decomposing step comprises de-noising the first and second components, and the first component is more aggressively de-noised than the second component.
18. The method of claim 14, wherein the de-noising step comprises further modifying the second component to obtained a modified second component having a temporal and spectral pattern (TSP) corresponding to a TSP of the first component. 19. A non-transitory computer readable medium storing a computer program that is executable by at least one processing unit, the computer program comprising sets of instructions for:
receiving an input audio signal comprising a speech signal and a noise;
decomposing the input audio signal into at least two components, the at least two components comprises a first component having a low or no sustained oscillatory pattern, and a second component having a high oscillatory pattern;
de-noising the second component based on data generated from the first component to obtained a modified second component; and
outputting an audio signal having reduced noise, the output audio signal comprising the first component in combination with the modified second component. 20, A system for improving intelligibility for a user, comprising:
a receiving arrangement configured to receive an input audio signal comprising a speech signal and a noise;
a processing arrangement configured to receive the input audio signal from the cochlear implant, decompose the input audio signal into at least two components, the at least two components comprises a first component having a low or no sustained oscillatory pattern, and a second component having a high oscillatory pattern, de-noise the second component based on data generated from the first component to obtained a modified second component, and output an audio signal having reduced noise to the cochlear implant, the output audio signal comprising the first component in combination with the modified second component,
21 , The system of claim 20, further comprising a cochlear implant, wherein the cochlear implant includes the receiving arrangement, and the cochlear implant is configured to generate an electrical stimulation to the user, the electrical stimulation corresponds to the output audio signal,
22, The system of claim 20, further comprising a mobile computing device, wherein the mobile computing device includes the receiving arrangement, and the mobile computing device is configured to generate an audible sound corresponding to the output audio signal.
PCT/US2017/018696 2016-02-19 2017-02-21 Method and system for multi-talker babble noise reduction using q-factor based signal decomposition WO2017143334A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/703,721 US10319390B2 (en) 2016-02-19 2017-09-13 Method and system for multi-talker babble noise reduction

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201662297536P 2016-02-19 2016-02-19
US62/297,536 2016-02-19

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/703,721 Continuation-In-Part US10319390B2 (en) 2016-02-19 2017-09-13 Method and system for multi-talker babble noise reduction

Publications (1)

Publication Number Publication Date
WO2017143334A1 true WO2017143334A1 (en) 2017-08-24

Family

ID=59625426

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2017/018696 WO2017143334A1 (en) 2016-02-19 2017-02-21 Method and system for multi-talker babble noise reduction using q-factor based signal decomposition

Country Status (1)

Country Link
WO (1) WO2017143334A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108765322A (en) * 2018-05-16 2018-11-06 上饶师范学院 Image de-noising method and device
CN113488074A (en) * 2021-08-20 2021-10-08 四川大学 Long-time variable Q time-frequency conversion algorithm of audio signal and application thereof

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5133013A (en) * 1988-01-18 1992-07-21 British Telecommunications Public Limited Company Noise reduction by using spectral decomposition and non-linear transformation
US20030023430A1 (en) * 2000-08-31 2003-01-30 Youhua Wang Speech processing device and speech processing method
US20040260540A1 (en) * 2003-06-20 2004-12-23 Tong Zhang System and method for spectrogram analysis of an audio signal
US20050049857A1 (en) * 2003-08-25 2005-03-03 Microsoft Corporation Method and apparatus using harmonic-model-based front end for robust speech recognition
US20050111683A1 (en) * 1994-07-08 2005-05-26 Brigham Young University, An Educational Institution Corporation Of Utah Hearing compensation system incorporating signal processing techniques
US20130336541A1 (en) * 2012-06-14 2013-12-19 Peter Adrian Spencer Elkington Geological log data processing methods and apparatuses
US20140321763A1 (en) * 2009-10-21 2014-10-30 Futurewei Technologies, Inc. Communication System with Compressive Sensing
US20150124560A1 (en) * 2013-11-01 2015-05-07 Conocophillips Company Compressive sensing
US20150230032A1 (en) * 2014-02-12 2015-08-13 Oticon A/S Hearing device with low-energy warning

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5133013A (en) * 1988-01-18 1992-07-21 British Telecommunications Public Limited Company Noise reduction by using spectral decomposition and non-linear transformation
US20050111683A1 (en) * 1994-07-08 2005-05-26 Brigham Young University, An Educational Institution Corporation Of Utah Hearing compensation system incorporating signal processing techniques
US20030023430A1 (en) * 2000-08-31 2003-01-30 Youhua Wang Speech processing device and speech processing method
US20040260540A1 (en) * 2003-06-20 2004-12-23 Tong Zhang System and method for spectrogram analysis of an audio signal
US20050049857A1 (en) * 2003-08-25 2005-03-03 Microsoft Corporation Method and apparatus using harmonic-model-based front end for robust speech recognition
US20140321763A1 (en) * 2009-10-21 2014-10-30 Futurewei Technologies, Inc. Communication System with Compressive Sensing
US20130336541A1 (en) * 2012-06-14 2013-12-19 Peter Adrian Spencer Elkington Geological log data processing methods and apparatuses
US20150124560A1 (en) * 2013-11-01 2015-05-07 Conocophillips Company Compressive sensing
US20150230032A1 (en) * 2014-02-12 2015-08-13 Oticon A/S Hearing device with low-energy warning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
IVAN W. SELESNICK: "Wavelet Transform with Tunable Q-Factor", IEEE , TRANSACTIONS ON SIGNAL PROCESSING, August 2011 (2011-08-01), XP011370222 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108765322A (en) * 2018-05-16 2018-11-06 上饶师范学院 Image de-noising method and device
CN108765322B (en) * 2018-05-16 2021-04-27 上饶师范学院 Image denoising method and device
CN113488074A (en) * 2021-08-20 2021-10-08 四川大学 Long-time variable Q time-frequency conversion algorithm of audio signal and application thereof
CN113488074B (en) * 2021-08-20 2023-06-23 四川大学 Two-dimensional time-frequency characteristic generation method for detecting synthesized voice

Similar Documents

Publication Publication Date Title
US10319390B2 (en) Method and system for multi-talker babble noise reduction
Das et al. Fundamentals, present and future perspectives of speech enhancement
Kim et al. An algorithm that improves speech intelligibility in noise for normal-hearing listeners
Wang Time-frequency masking for speech separation and its potential for hearing aid design
Healy et al. An algorithm to improve speech recognition in noise for hearing-impaired listeners
Vincent et al. Performance measurement in blind audio source separation
Kates et al. The hearing-aid speech perception index (HASPI) version 2
Stern et al. Hearing is believing: Biologically inspired methods for robust automatic speech recognition
EP1580730B1 (en) Isolating speech signals utilizing neural networks
Shivakumar et al. Perception optimized deep denoising autoencoders for speech enhancement.
Lai et al. Multi-objective learning based speech enhancement method to increase speech quality and intelligibility for hearing aid device users
Gopalakrishna et al. Real-time automatic tuning of noise suppression algorithms for cochlear implant applications
Monaghan et al. Auditory inspired machine learning techniques can improve speech intelligibility and quality for hearing-impaired listeners
Hummersone A psychoacoustic engineering approach to machine sound source separation in reverberant environments
Soleymani et al. SEDA: A tunable Q-factor wavelet-based noise reduction algorithm for multi-talker babble
Diehl et al. Restoring speech intelligibility for hearing aid users with deep learning
Edraki et al. Spectro-temporal modulation glimpsing for speech intelligibility prediction
JP4496378B2 (en) Restoration method of target speech based on speech segment detection under stationary noise
Patil et al. Marathi speech intelligibility enhancement using i-ams based neuro-fuzzy classifier approach for hearing aid users
WO2017143334A1 (en) Method and system for multi-talker babble noise reduction using q-factor based signal decomposition
Hossain et al. On the feasibility of using a bispectral measure as a nonintrusive predictor of speech intelligibility
Mesgarani et al. Denoising in the domain of spectrotemporal modulations
Ma et al. A modified Wiener filtering method combined with wavelet thresholding multitaper spectrum for speech enhancement
CN116312561A (en) Method, system and device for voice print recognition, authentication, noise reduction and voice enhancement of personnel in power dispatching system
Lobdell et al. Intelligibility predictors and neural representation of speech

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17754030

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17754030

Country of ref document: EP

Kind code of ref document: A1