WO2017143334A1

WO2017143334A1 - Method and system for multi-talker babble noise reduction using q-factor based signal decomposition

Info

Publication number: WO2017143334A1
Application number: PCT/US2017/018696
Authority: WO
Inventors: Roozbeh SOLEYMANI; Ivan W. SELESNICK; David M. LANDSBERGER
Original assignee: New York University
Priority date: 2016-02-19
Filing date: 2017-02-21
Publication date: 2017-08-24

Abstract

A system and method for improving intelligibility of speech is provided. The system and method may include obtaining an input audio signal, decomposing the audio signal into a first component having a low or no sustained oscillatory pattern, and a second component having a high oscillatory pattern, further de-noising the second component based on data generated from the first component to obtained a modified second component, and outputting an audio signal having reduced noise, the output audio signal comprising the first component in combination with the modified second component.

Description

METHOD AND SYSTEM FOR MULTI-TALKER BABBLE NOISE REDUCTION USING Q-FACTOK BASED SIGNAL DECOMPOSITION

PRIORITY CLAIM

[0001] This application claims priority to U.S. Provisional Patent Application Serial No. 62/297,536 filed February 19, 2016, the entire contents of which is hereby incorporated by reference herein.

GOVERNMENT FUNDING

[0002] This invention was made with the U.S. Goveniment support under Grant Nos. NIH Grant R01-DC12152. The U.S. government has certain rights in the invention.

FIELD OF INVENTION

[0003] The present invention relates generally to a method and a system for noise reduction, such as, for example, in a cochlear implant, a telephone, an electronic communication, etc.

BACKGROUND

[0004] Cochlear implants ("CP's) may restore the ability to hear to deaf or partially deaf individuals by providing electrical stimulation to the auditory nerve via a series of electrodes placed in the cochlea. CIs may successfully provide the ability of almost all post-lingually deaf users (i.e., those who lost their hearing after learning speech and language) to gain an auditory understanding of an environment and/or restore hearing to a level suitable for an individual to understand speech without the aid of lipreading.

[0005] One of the key challenges for CI users is to be able to clearly and/or intelligibly understand speech in the context of background noise. Conventional CI devices have been able to aid patients to hear and ascertain speech in a quiet environment, but the performance of such devices quickly degrades in noisy environments. There have been a number of attempts to isolate speech from background noise, e.g., single-channel noise reduction algorithms. Typical single-channel noise reduction algorithms have included applying a gain to the noisy envelopes, pause detection and spectral subtraction, feature extraction and splitting the spectrogram into noise and speech dominated tiles. However, even with these algorithms, speech understanding in the presence of competing talkers (i.e., speech babble noise) remains difficult and additional artifacts are often introduced. Furthermore, mobile communications has created an ever-rising need to be able to clearly and'Or intelligibly understand speech while one user may be in a noisy environment. In particular, there is a need for improving speech understanding in telephonic communications, even in the presence of competing talkers (i.e., background speech babble noise).

[0006] Despite good progress in improving speech quality and listening ease, little progress has been made in designing algorithms that can improve speech intelligibility. Conventional methods that have been found to perform well in steady background noise generally do not perform well in non-stationary noise (e.g., multi-talker babble). For example, it is often difficult to accurately estimate the background noise spectrum. Moreover, applying noise removal methods to already noisy signals usually introduces distortion and artifacts (e.g., musical noise) to the original signal, which in many cases lead to almost no significant intelligibility improvement. All these reasons make the improvement of speech intelligibility in the presence of competing talkers a difficult problem. SUMMARY OF THE INVENTION

[0007] in accordance with the foregoing objectives and others, one embodiment of the present invention provides systems and methods for reducing noise and/or improving intelligibility of an audio signal.

[0008] In one aspect, a method for reducing noise is provided. The method comprises a first step for receiving an input audio signal comprising a speech signal and a noise. In some embodiments, the noise may comprise a multi-talker babble noise. The method also comprises a step for decomposing the input audio signal into at least two components, the at least two components comprises a first component having a low or no sustained oscillatory pattern, and a second component having a high oscillatory pattern. In certain embodiments, the decomposing step comprises de-noising the first and second components, and the first component is more aggressively de-noised than the second component. In some embodiments, the decomposing step may include determining a first Tunable Q-Factor Wavelet Transform (TQWT) for the first component and a second TQWT for the second component. The method also comprises a step for de-noising the second component based on data generated from the first component to obtained a modified second component. In some embodiments, the de-noising step comprises further modifying the second component to obtain a modified second component having a temporal and spectral pattern (TSP) corresponding to a TSP of the first component. The method further comprises a step for outputting an audio signal having reduced noise, the output audio signal comprising the first component in combination with the modified second component. The outputted audio signal may more closely correspond to the speech signal than the input audio signal.

[0009] In another aspect, a method for improving intelligibility of speech is provided. The method comprises a first step for obtaining, from a receiving arrangement, an input audio signal comprising a speech signal and a noise, and then a step for estimating a noise level of the input audio signal. In some embodiments, the estimating step comprises determining or estimating a signal to noise (SNR) for the input audio signal. The method also includes a step for decomposing the input audio signal into at least two components when the estimated noise level of the input audio signal is above a predetermined threshold, the at least two components comprises a first component having a low or no sustained oscillatory pattern, and a second component having a high oscillatory pattern. The method also includes a step for de-noising the second component based on data generated from the first component to obtained a modified second component. The method further includes a step for outputting an audio signal having reduced noise to an output arrangement, the output audio signal comprising the first component in combination with the modified second component.

[0010] In another aspect, a non-transitory computer readable medium storing a computer program that is executable by at least one processing unit. The computer program comprise sets of instructions for: receiving an input audio signal comprising a speech signal and a noise; decomposing the input audio signal into at least two components, the at least two components comprises a first component having a low or no sustained oscillatory pattern, and a second component having a high oscillatory pattern; de-noising the second component based on data generated from the first component to obtained a modified second component; and outputting an audio signal having reduced noise, the output audio signal comprising the first component in combination with the modified second component.

[0011] In a further aspect, a system for improving intelligibility for a user is provided. The system may comprise a receiving arrangement configured to receive an input audio signal comprising a speech signal and a noise. The system may also include a processing arrangement configured to receive the input audio signal from the cochlear implant, decompose the input audio signal into at least two components, the at least two components comprises a first component having a low or no sustained oscillatory pattern, and a second component having a high oscillatory pattern, de-noise the second component based on data generated from the first component to obtained a modified second component, and output an audio signal having reduced noise to the cochlear implant, the output audio signal comprising the first component in combination with the modified second component. The system may further comprise a cochlear implant, wherein the cochlear implant includes the receiving arrangement, and the cochlear implant is configured to generate an electrical stimulation to the user, the electrical stimulation corresponds to the output audio signal. Alternatively, the system may further comprise a mobile computing device, wherein the mobile computing device includes the receiving arrangement, and the mobile computing device is configured to generate an audible sound corresponding to the output audio signal.

[0012] These and other aspects of the invention will become apparent to those skilled in the art after a reading of the following detailed description of the invention, including the figures and appended claims.

BRIEF DESCRIPTION OF THE FIGURES

[0013] Fig. la shows an exemplary method for noise reduction, in particular, multi-talker babble noise reduction in a cochlear implant,

[0014] Fig. lb shows an alternative exemplary method for noise reduction, in particular, multi-talker babble noise reduction in a cochlear implant.

[0015] Fig. 2 shows an exemplary computer system for performing method for noise reduction.

[0Θ16] Fig. 3 shows an exemplary embodiment of a user interface for a MUSHRA (Multiple Stimuli with Hidden Reference and Anchor) evaluation. [0017] Fig. 4a shows data corresponding to percentages of words correct in normal patients for input signals that are unprocessed and processed using the exemplary method of Fig. la.

[0018] Fig. 4b shows data corresponding to MUSHRA scores in normal patients for input signals that are unprocessed and processed using the exemplary method of Fig. I a.

[001 J Fig. 5 shows data corresponding to percentages of words correct in CI patients for input signals that are unprocessed and processed using the exemplary method of Fig. la.

[0020] Fig. 6 shows data corresponding to MUSHRA scores in CI patients for input signals that are unprocessed and processed using the exemplary method of Fig. la.

[0021] Fig. 7a shows an average of the data corresponding to percentages of words correct in CI patients of Fig. 5.

[0022] Fig. 7b shows an average of the data corresponding to MUSHRA scores in CI patients of Fig. 6.

[0023] Fig. 8 shows a Gaussian Mixture model of data corresponding to noisy speech samples with SNRs ranging from -lOdB to 20 dB processed using the exemplary method of Fig. lb.

[0024] Fig. 9 shows data corresponding to variation of accuracy metric F as a function of

SNR for three different multi-talker babble noise according to the exemplary method of Fig, lb.

[0025] Fig. 10 shows data corresponding to frequency response and sub-band wavelets of a

TQWT according to the exemplary method of Fig. lb.

[0026] Fig. 1 shows data corresponding to Low frequency Gap Binary Patterns of for clean/noisy speech samples processed using the exemplary method of Fig. lb.

[0027] Fig. 12 shows the effect of each initial de-noising and spectral cleaning on the weighted normalized Manhattan distance ¾ measured on noisy speech samples corrupted with various randomly created multi-talker babbles processed according to the exemplary method of Fig. lb.

DETAILED DESCRIPTION

[0028] The present invention is directed to a method and system for multi-talker babble noise reduction. The system may be used with an audio processing device, a cochlear implant, a mobile computing device, a smart phone, a computing tablet, a computing device to improve intelligibility of input audio signals, particularly that of speech. For example, the system may be used in a cochlear implant to improve recognition and intelligibility of speech to patients in need of hearing assistance. In one particular embodiment, the method and system for multi-talker babble noise reduction may utilize Q-factor based signal decomposition, which is further described below.

[0029] Cochlear implants (CIs) may restore the ability to hear to deaf or partially deaf individuals. However, conventional cochlear implants are often ineffective in noisy environments, because it is difficult for a user to intelligibly understand speech in the context of background noise. Specifically, original signals having a background of multi-talker babble noise, is particularly difficult to filter and/or process to improve intelligibility to the user, because it often includes background noise that does not adhere to any predictable prior pattern. Rather, multi-talker babble noise tends to reflect the spontaneous speech patterns of having multiple speakers within one room, and it is therefore difficult for the user to intelligibly understand the desired speech while it is competing to simultaneous multi-talker babble noise.

[0030] There are a number of different approaches to filtering and/or reducing noise in a noisy audio signal to a cochlear implant. For example, modulating based methods may differentiate speech from noise based on temporal characteristics, including modulations of depth and/or frequency, and may subsequently apply a gain reduction to the noisy signals or portions of signals, e.g., noisy envelopes. In another example, spectral subtraction based methods may estimate a noise spectrum using a predetermined pattern, which may be generated based on prior knowledge (e.g., detection of prior speech patterns) or speech pause detection, and may subsequently subtract the estimated noise spectrum from a noisy speech spectrum. As a further example, sub-space noise reduction methods may be based on a noisy speech vector, which may be projected onto different sub-spaces for analysis, e.g. , a signal sub-space and a noise sub- space. The clean signal may be estimated by a sub-space noise reduction method by retaining only the components in the signal sub-space, and nullifying the components in the noise sub- space. An additional example may include an envelope subtraction algorithm, which is based on the principle that the clean (noise-free) envelope may be estimated by subtracting a noisy envelope from the noise envelope, which may be separately estimated. Another example may include a method that utilizes S-shaped compression functions in place of the conventional logarithmic compression functions for noise suppression. In an alternative example, a binary masking algorithm may utilize features extracted from training data and categorizes each time- frequency region of a spectrogram as speech-dominant or noise-dominant. In another example, a wavelet-based noise reduction method may provide de-noising in a wavelet domain by utilizing shrinking and/or thresholding operations.

[0031] Although there have been many approaches to filtering and/or reducing noise in a noisy audio signal to a cochlear implant, there remains a dilemma in designing a noise reduction system and/or method that there may be a tradeoff between an amount of noise-reduction that can be provided as compared to signal distortion and/or speech distortion that may be introduced as a side-effect of filtering and/or noise reduction processes. In particular, a more aggressive noise removal process may introduce more distortion, and therefore, possibly less intelligibility in the resulting signal. Conversely, a mild approach to remove noise may result in less distortion, but the signal may retain more noise. Finding the optimal point where the distortion may be minimized, and the noise may be minimized requires careful balancing of the two factors and can be difficult. In particular, this optimal point may differ from person to person in both normal hearing people and in CI users.

[0032] The exemplary embodiments described herein provide a method and system for noise reduction, particularly multi-talker babble noise reduction, that is believed to bypass this optimal point conundrum by applying both aggressive and mild noise removal methods at the same time and benefit from the advantages and avoid the disadvantages of both approaches. In particular, the exemplary method comprises a first step for decomposing a noisy signal into two components, which may also perform a preliminary de-noising of the signal at the same time. This first step for decomposing the noisy signal into two components may utilize any suitable signal processing methods. For example, this first step may utilize, one, two or more wavelet or wavelet-like transforms and a signal decomposition method, e.g., a sparsity based signal decomposition method, optionally coupled with a de-noising optimization method. In particular, this first step may utilize two Tunable Q-Factor Wavelet Transfonns (TQWTs) and a sparsity based signal decomposition method coupled with applying a Basis Pursuit De-noising optimization method. Wavelets, sparsity based decomposition methods and de-noising optimization methods may be highly tunable. Therefore, their parameters may be adjusted to obtain desired features in output components. The output components of this first step may include two main products and a byproduct. The two main products may include a Low Q-factor (LQF) component and a High Q-factor (HQF) component, and the byproduct may include a separated residual noise, wherein the Q-factor may be a ratio of a pulse's center frequency to its bandwidth, which is discussed further below. In case of complex non~ stationary noise, this first, step for decomposing the noisy signal may not remove all of the noise. Therefore, the method may include a second step for de-noising using information from the products obtained from the first step.

[0033] Generally, a method for noise reduction, particularly multi-talker babble noise reduction, e.g., a Speech Enhancement using Decomposition Approach iterative version (SEDA__i), may comprise three different stages; (1) Noise level classification, (2) Signal decomposition and initial de-noising, and (3) Spectral cleaning and reconstitution. The first stage classifies the noise level of the noisy speech. The second stage decomposes the noisy speech into two components and performs a preliminary denoising of the signal. This is achieved using two Tunable Q-factor Wavelet Transforms (TQWTs) and a sparsity-based signal decomposition algorithm, Basis Pursuit De-noising (BPD). The wavelet parameters in the second stage will be set based on the results of the classification stage. The output of the second stage will consist of three components. The low Q-factor (LQF) component, the high Q-factor (HQF) component and the residual. The third stage further denoises the HQF and LQF components and then recombines them to produce the final de-noised output.

[0034] Fig. 1 illustrates an exemplary method 100 for noise reduction, in particular, multi- talker babble noise reduction in a cochlear implant. Specifically, the method may be used to improve recognition and intelligibility of speech to patients in need of hearing assistance. Any suitable cochlear implant may be used with exemplary method 100. In particular, the cochlear implant may detect an audio signal and restore a deaf or partially deaf individual's ability to hear by providing an electrical stimulation to the auditory nerve corresponding to the audio signal. However, often the input audio signal may be noisy and cannot be recognized or discerned by the user. Therefore, the input signal may be further processed, e.g., filtered, to improve clarity and/or intelligibility of speech to the patient. In an exemplary embodiment, a rough determination of the noise level in the input signal may be determined before starting a de- noising process, in addition, the estimated level of noise present may be utilized to set a wavelet and optimizations parameters for subsequent de-noising of the input signal.

[0Θ35] The input audio signal may be a continuous audio signal and may be broken down into predetermined segments and/or frames for processing by the exemplary method 100. In particular, in a real-time application, such as an application for improving hearing for a CI user or for improving intelligibility of audio communications on a communications device (such as mobile communications device, a telephone, a smart phone, etc.), the input signal may include non-steady noise where the level of noise, e.g., signal to noise ratio, may change over time. To adapt to the changing levels of noise intensity in an input signal, the signal may be separated into a plurality of frames of input signal, where each frame may be individually analyzed and/or de- noised, such as for example, processing each individual frame using the exemplary method 100. The input signal may be divided into the plurality of frames by any suitable means. The exemplary method 00 may be continuously applied to each successive frame of the input signal for analysis and/or de-noising. In some embodiments, the input audio signal may be obtained and each frame of the input audio signal may be processing by the exemplary method 100 in real-time or substantially real-time, meaning within a time frame that is negligible or imperceptible to a user, for example, within less than 3 seconds, less than 1 second, or less than 0.5 seconds.

[0036] In a first step 102, an input signal or a frame of an input signal may be obtained and analyzed to determine and/or estimate a level of noise present in the signal. Based on a level or an estimated level of noise present, the input signal or frame of input signal may be categorized into one of three categories: (I) the signal is either not noisy or has negligible amount of noise 104; (II) the signal is mildly noisy 106; or (III) the signal is highly noisy 108.

[0037] Step 102 may estimate the noise level in an input signal or a frame of an input signal using any suitable methods, such as, for example, methods for determining and/or estimating a signal to noise ratio (SNR), which may be adjusted to estimate the noise level in a variety of noise conditions. Any suitable SNR method may be used and may include, for example, those methods described in Hmam, H., "Approximating the SNR Value in Detection Problems," IEEE Trans, on Aerospace and Electronic Systems VOL. 39, NO. 4 (2003); Xu, PL, Wei, G., & Zhu, J. "A Novel SNR Estimation Algorithm for OFDM," Vehicular Technology Conference, vol. 5, 3068-3071 (2005); Mian, G., & Howell, T., "Determining a signal to noise ratio for an arbitrary data sequence by a time domain analysis," IEEE Trans. Magn., Vol. 29, No. 6 (1993); Liu, X., Jia, J., & Cai, L., "SNR estimation for clipped audio based on amplitude distribution," ICNC, 1434-1438 (2013), all of which are incorporated by reference herein. However, existing SNR estimation methods do not specifically accommodate non-stationary noise and therefore, typically suffer from some degree of error and computational costs. Alternatively, the noise level of an input signal or a frame of an input signal may be estimated by measuring a frequency and depth of modulations in the signal, or by analyzing a portion of the input signal in silent segments in speech gaps. It is noted that step 102 may determine a SN for an input signal or a frame of an input signal, but may alternatively provide merely an estimate, even a rough estimate of its SNR.

[0038] The SN R or estimated SNR may be used to categorize the input signal or a frame of the input signal into the three different categories 104, 106, and 108, For example, Category I for a signal that is either not noisy or include negligible amounts of noise 104. In particular, this first category 104 may include, for example, those input signals or frames of input signals that have or are estimated to have a SNR that is greater than 12 dB (SNR > 12 dB), or greater than or equal to 12 dB (SNR > 12 dB). The second category 106 may include, for example, those input signals or frames of input signals that have or are estimated to have a SNR that is greater than 5 dB and less than 12 dB (5 dB < SNR < 12 dB), or greater than or equal to 5 dB and less than or equal to 12 dB (5 dB < SNR < 12 dB). The third category 108 may include, for example, those input signals or frames of input signals that have or are estimated to have a SNR that is less than 5 dB (SNR < 5 dB), or less than or equal to 5 dB (SNR < 5 dB).

[0039] This first step 102 does not depend highly on the accuracy of the noise level estimation, e.g., SNR estimation provided. Rather, for input signals having SN R values on or near the threshold values of 5 dB and 12 dB, categorization of such an input signal in either of the bordering categories is not expected to significantly alter the outcome of the exemplary de- noising method 100 of Fig. la. Therefore, estimated SNR values may be sufficient for the first step 102. In certain exemplar}⁷ embodiments, estimated SNR values may be determined using a more efficient process, e.g., a method that requires less computational resources and/or time, such as by a process that requires fewer iterative steps.

[0040] In one particular embodiment, the SNR may be estimated using an exemplar SNR detection method for an arbitrary signal s, where s may be defined as s = {s_{i f} s_{2 l}. . . , s_n}. A ratio of the signal's root mean square f ws} after and/or before a thresholding with respect to r(s) (which may be defined as r(s) = 3 ^∑₌₁ J^si S), may be represented by the term r(s, r(s)) . The ratio r(s, x(s)) may be defined as: r(s_t r s)} where s_rms— + sf 4- . . . l s?. ]

And h{s, r(s)) = ¾<■ ■ where ¾— ¾

T(S)

[0041] The term il l s, T(S) ) refers to signal s after hard thresholding with respect to r(s). The term r(s) may be defined such that for speech samples that are mixed with multi-talker babble, the value of r(s, r(s)) varies little from signal to signal for samples having a constant a constant signal to noise ratio (SNR). in one specific embodiment, the term τ(.ν) for an arbitrary signal s = [s_is s_{2 r}. . . , s_n} is may be defined as shown below:

[0042] The values of Γ(¾,Τ(¾)) , r(x₂,r(x₂)) ,. . ., r(x_N, t(x_N}) for a sufficiently large number, for example but not limited to (N— 200), may be subsequently determined according to the following:

wherein , Λ·_;.,, . . . . , x._N correspond to a mixture of various speech samples taken from IEEE standard sentences (IEEE Subcommittee, 1969) and multi-taker babble with SNR=5.

[0043] The values for r(>'i^( 'i)) , r(y₂. r(y₂)) .· . ·. Τ Ν, &Ν)) may be subsequently determined accordingly to the following:

wherein yi, y₂ ■ ■ ■ ■ _> )⁷N correspond to a mixture of various speech samples taken from IEEE standard sentences (IEEE Subcommittee, 1969) and multi-taker babble with SNR=T2. [0044] An input signal s with an unknown SNR may be categorized into one of the three different categories 104, 106, and 108 as follows:

(104. (SNR > 12), R_lz < r(s, r(s))

C(s) E { 106 (5 < SNR < 12), < r(s, T(s» ≤R_l2

1.108 (SN < 5), r(s, T ( .V) ) < R_s

C(s) : Signal's s category based on its SNR

[0045] As discussed above, this exemplary SNR estimation method in the first step 102 need not provide accurate estimates of SNR. Rather, it serves to categorize the input signals or frames of input signals into various starting categories prior to further analysis and/or de-noising of the input signals or frames of input signals. This pre-processing categorization in step 102 is particularly beneficial for input signals or frames of input signals containing multi-talker babble. it is further contemplated that this first step 102 utilize any suitable method to categorize the input signals or frames of input signals into a plurality of categories, each having a different noise level. More particularly, the first step 102 may encompass any fast and efficient method for categorizing the input signals or frames of input signals into a plurality of categories having different noise levels.

[0046] In the exemplary embodiment shown in Fig, l a, input signals or frames of input signals that fall within the first category 104 do not contain substantial amounts of noise. Therefore, these input signals or frames of input signals are too clean to be de-noised. The intelligibility of input signals in this first category 104 may be relatively high, therefore further de-noising of the signal may introduce distortion and/or lead to no significant intelligibility improvement. Accordingly, if the input signal or frame of input signal is determined to fall within the first category 104, the method 100 terminates without modification to the input signal or the frame of the input signal.

[0047] Input signals or frames of input signals that fall within the second category 106 may be de-noised in a less aggressive manner as compared to noisier signals. For input signals or frames of input signals in the second category 106, the priority is to avoid de -noising distortion rather than to remove as much no se as possible. [0048] Input signals or frames of input signals that fall within the third category 108 may not be very intelligible to a CI user, and may not be intelligible at all to an average CI user, For input signals or frames of input signals in the third category 108, distortion is less of a concern compared to intelligibility. Therefore, a more aggressive de-noising of the input signal or frame of input signal may be performed on input signals of the third category 108 to increase the amount of noise removed while gaining improvements in signal intelligibility to the CI user.

[0049] input signals or frames of input signals that fall within either the second category 106 or the third category 108 may be further processed in step 110. In step 1 10, the input signals or frames of input signals may be decomposed into at least two components: (I) a first component 112 that exhibits no or low amounts of sustained oscillatory behavior; and (II) a second component 114 that exhibits high sustained oscillator}'' behavior. Step 1 10 may optionally decompose the input signals or frames of input signals to include a third component: (III) a residual component 116 that does not fall within either component 112 or 114. Step 110 may decompose the input signals or frames of input signals using any suitable methods, such as, for example, separating the signals into components having different Q-factors. The Q-factor of a pulse may be defined as a ratio of its center frequency to its bandwidth, as shown in the formula below:

0 = -^- .

^¾ BW

[0050] For example, the first component 112 may correspond to a low Q-factor component and the second component 114 may correspond to a high Q-factor component. The second component 114, which corresponds to a high Q-factor component, may exhibit more sustained oscillatory behavior than the first component 1.12, which corresponds to a low Q-factor component.

[0051] Suitable methods for decomposing the input signals or frames of input signals may include a sparse optimization wavelet method. The sparse optimization wavelet method may decompose the input signals or frames of input signals and may also provide preliminary de- noising of the input signals or frames of input signals. The sparse optimization wavelet method may utilize any suitable wavelet transform to provide a sparse representation of the input signals or frames of input signals. One exemplary wavelet transform that may be utilized with a sparse optimization wavelet for decomposing the input signals or frames of input signals in step 100 may include a Tunable Q-Factor Wavelet Transform (TQWT). In particular, the TQWT may be determined based on a Q-factor, a redundancy rate and a number of stages (or levels) utilized in the sparse optimization wavelet method, each of which may be independently adjustable within the method. By adjusting the Q-factor, the oscillatory behavior of the TQWT may be modified. In particular, the Q-factor may be adjusted such that the oscillator ⁷ behavior of the TQWT wavelet matches that of the input signals or frames of input signals. Redundancy rate in a wavelet transform, e.g., a TQWT, may refer to a total over-sampling rate of the transform. The redundancy rate must be always greater than 1. Because the TQWT is an over-sampled wavelet transform, any given signal would not correspond to a unique set of wavelet coefficients. In other words, an inverse TQWT applied to two different sets of wavelet coefficients, may correspond to the same signal.

[0052] Step 110 may also provide preliminary de-noising of the input signals or frames of input signals. The preliminary de-noising may be performed by a sparsity-based de-noising method, such as, for example, a sparse optimization wavelet method. As discussed above, of the input signals or frames of input signals may be represented by any suitable wavelet, in particular TQWT. By adjusting the Q-factor, an optimal sparse representation of the input signals or frames of input signals may be obtained. Such an optimal sparse representation may provide improved performance for related sparsity-based methods such as signal decomposition and/or de-noising. To select a spare representation of the input signals or frames of input signals, a Basis Pursuit (BP) method may be used. In particular, if the input signals or frames of input signals are considered to be noisy, e.g., those falling within the third category 109, a Basis Pursuit De-noising (BPD) method may be used.

[0053] Human speech may exhibit mixture of oscillatory and non-oscillatory behaviors. These two components usually cannot be sparsely represented using only one TQWT. Therefore in step 110, each input signal or frame of input signal may be represented using two different components having two different Q-factors. Suitable methods for decomposing the input signals or frames of input signals in step 110 may also include, for example, a Morphological Component Analysis (MCA) method.

[0054] in one particular exemplary embodiment, the input signal or frame of input signal y may be decomposed into three components: (I) a first component 112 having a low Q-factor ^xi , which does not exhibit sustained oscillator}⁷ behavior; (II) a second component 114 having a High Q-factor component ^xs , which exhibits sustained oscillatory behavior; and (III) a residual component 116 represented by , which includes noise and stochastic unstructured signals that cannot be sparsely represented by either of the two wavelet transforms of the first and second components 112 and 114, The input signal 7 may be represented as follows:

y = Xi - x₂ + n .

[0055] The decomposition of the input signal >' , as shown above, may be a nonlinear decomposition, which cannot be achieved by any linear decomposition methods in time or frequency domain. Therefore, a MCA method may be used to obtain a sparse representation of both the first and second components 112, 114, where ¾ and ^¾"s may be obtained using a constrained optimization method using the following formula: argmin_Wi ^ \ \y - 4> ¹w₁ - *2 ^1M¾ II2 ⁺ ^ l| i,j | li + ^ *½, j i 1 ^w2 j \ 11

j=i i= i

such that: y ' (w, ) P₂ ¹ (w₁ ) + « wherein Φ₁ and Φ₂ are TQWT with low and high Q-factors respectively, and A_2j are subband-dependent regularizations and should be selected based on the intensity of the noise, / is the subband index and Φ^¹ and ·, ^{' are} the inverse of the first and second tunable wavelet transforms.

[0056] The above formula may be solved to obtain ^w j and ^wiJ , which are the wavelet coefficients in different subbands. Using the wavelet coefficients, ^wi and ^ws , the first and second components 112 and 114, as represented by i and ^xs , may be obtained as follows:

[0057] in one particular exemplary embodiment, the wavelet and optimization parameters may also be selected such that the first and second components 112, 114 are also preliminarily de-noised using a BPD method, in particular, the wavelet and optimization parameters may be selected such that the following conditions are met:

(1) The first component 1 12, which is the Low Q-factor (LQF) component, have significantly lower energy than the second component 1 14, which is the high Q-factor (HQF) component; and

(2) The LQF be de-noised more aggressively, and consequently may be more distorted. [0058] Because the LQF may be de-noised more aggressively, the HQF would be de-noised more mildly to reduce the amount of distortion, The two conditions above allow for identification of the HQF and LQF that typically have relatively similar 'Temporal and Spectral Pattern (TSP) when the signal is not noisy. In other words, the concentration of the energy in these spectrograms and time domain graphs are expected to be roughly in the same areas. The input signal or frame of input signal may be decomposed based on the Q-factors of different components, and that the input signals or frames of input signals that share similar frequency content may correspond to different Q-factors.

[0059] In step 118, the second component 114 may be farther de-noised using the first component 112 or data generated based on the first component 112, As explained further below, the TSP of the first component 112 is expected to more closely resemble that of a clean speech signal, as compared to the second component 114. Therefore, the first component 112 may be used to further de-noise the second component 114, particularly using the TSP of the first component.

[0060] A clean audio signal that is not noisy may be represented by ^A . For a clean input signal , BPD is not necessary for de-noising the signal. Therefore, decomposition of a clean input signal * may be correspond to a spare representation of two components, where ^xi and ¾ may be obtained using a constrained optimization method using the following formula:

such that: x = ϊ¹ (½¾) + Φ^¹ (w₂)

and: χ_ί = Φ^¹(ιν_ί) , ¾ = ₂ ^"1 (¼¾)

where: x = Χ - x₂ .

0061 Both the noisy input signal or frame of input signal ^ and the clean input signal may be decomposed into HQF and LQF components are follows:

Y = X^■ + N

wherein ^ - -½ + ¾ , and

wherein i

[0062] Each of the above variables are defined as follows:

Y · : Noisy speech signal -^■¾ : Clean speech signal before adding noise

^ : Added noise

¾ : LQF component of the original speech signal

¾ : HQF component of the original speech signal

L : LQF component of the noisy speech signal

¾ : HQF component of the noisy speech signal

¾ : Residual component of the decomposition using BPD

[0063] Because the LQF component ¾ j_s expected to include less noise than HQF component ^r¾ due to a more aggressive noise removal in step 110, the TSP of the LQF component ¾ is expected to be more similar to the TSP of the LQF component of the clean speech signal. This similarity is particularly notable in lower frequencies where speech fundamental frequencies are often located. Therefore, the concentration of energy in both their spectrograms are expected to follow a similar shared pattern. Gaps and speech pauses are also expected to be located at the same areas of the spectrograms and time domain graphs in both cases. The term gaps, as used herein, refers to empty or low energy areas in low frequency parts of the spectrograms or very low amplitude pauses in time domain graphs.

[0064] In contrast, the HQF component % , which is de-noised less aggressively in step 110, is expected to be noisier and therefore, less similar to HQF component -¾ of the clean speech. Contrary to the LQF components ¾. and ¾ discussed above where gaps could be seen in both noisy and clean spectrograms, all low frequency gaps which could be identified in clean signal's HQF component may be filled, typically completely filled, by noise in the HQF component ¾ of the input signal or frame of input signal. Although the signal may include more noise, the HQF component % is expected to be less distorted, which is particularly crucial for good intelligibly to a patient. Because the LQF and HQF components of the clean speech, ¾ and -% , are also expected to have roughly similar TSPs (at least the gaps in low frequencies in their spectrograms are roughly in the same areas), it is expected that the TSP of the HQF component ¾ of the clean speech also bears some similarities to the TSP of the LQF component ¾ obtained from noisy input signal. This resemblance may be more pronounced in time domain graphs. The low frequency gaps in the time domain graphs may also be similar, at least compared to the noisy HQF component . [0065] In step 118, the input signal or frame of input signal ^ should be de-noised such that it becomes as similar as possible to the clean speech % without causing too much distortion. As discussed above, the LQF components of clean speech and noisy speech are already similar, and therefore, only the HQF component of the noisy input signal needs to be further modified (e.g., de-noised) so that it more closely resembles the HQF component of the clean speech ( ¾ ),

[0066] The second component 114 may be further de-noised and may be represented by ¾ , which corresponds to a modified version of ¾ having a TSP that is similar to TSP of ·>¾ , which may be represented as follows:

?(¾) -?(¾)

[0067] The further de-noised %may be determined using the following formula: W_/ TW & -P(¾ )~P(¾) => P(F_L + ¾)~P(¾ + ¾)→ V(Y_L + Y_H)~VOO

Specifically, the first component 112 may correspond to ¾ a d the second component 114 may respond «o ¾ in the formula shown above. Because ' (¾) is expected to be s.miiar to and in the absence priori knowledge of -% , the TSP of % may be modified and a modified version corresponding to version % may be obtained to satisfy the following condition:

Therefore, the further de-noised ¾^T may be determined based on the following formula:

[0068] In another exemplary embodiment, step 118 may include a method which modifies the spectrograph of the second component 114, e.g., ¾ , to a modified version of the second component, e.g., % . In particular, the method may preferably introduce the least possible amount of distortion to the resulting output, and/or may provide processing of input signals in real-time or substantially real-time as to be useful in applications such as cochlear implant devices. In particular, the method for modifying the spectrograph of the second component 114, e.g., ¾ , to a modified version of the second component, e.g., ¾ may include point- wise multiplication of a Fourier transform domain of non-overlapping frames of an input signal. In particular, each frame of the input signal may be represented as Y_t E R^N, wherein ^ corresponds to a length of the frame. Each frame of the input signal may be represented m ay correspond to the following:

Yt = Y_L + Y_u + ¾

[0069] A Discrete Fourier Transform may be determined for each of the above components as follows:

0070] Each point i ¹ L and may be categorized as one of the following:

where:

CHS' ,€i?€lL represents four different categories corresponding to: very high energy, high energy, low energy and very low energy; yf . »Ή,ί»

N ' Η,πϊ

-. < «!

[0071] The above categorization may be performed using a threshold-based quantification method. The TSP of the is expected to be similar to TSP of ^k after removing the noise, Therefore, if a point demonstrates a high or very high energy in but demonstrates low or very low energy in ¾ , its energy in ¾ is believe to most likely be coming from a noise source and must then be attenuated.

[0072] To estimate ^k , each point in ¾ may be compared with its counterpart in ¾ and different reduction gains Sr may be applied to high or very high energy points in ¾ with low or

yf

very low energy counterparts in ^I L , which may be represented in the following iormula:

where : 0 < g_rl < g_r2 ¾ _r3 < 5_r4 ¾ 1 .

In some embodiments, a reduction gain may be applied to low or very low energy points in , yf

After an estimate for ^I H is obtained, an inverse Discrete Fourier Transform may be applied to obtain a modified version of the second component, e.g., % , of the input signal, as follows:

In step 120, the first component 112 and a further filtered second component, where the second component 1 14 is filtered using the first component 1 14, may be combined to generate a filtered signal that may be outputted for use in a cochlear. In particular, the first component 112, e.g., ¾ , and the further filtered second component, e.g., % , may be combined to create an output signal, as represented by ^'o , as follows:

Y_0ut = Y_L + ¥_H , which is expected to demonstrate a TSP that is similar to the TSP of clean speech,

[0074] Fig. lb provides an alternative exemplary embodiment of a method 150 for noise reduction, in particular, multi-talker babble noise reduction in a cochlear implant. The alternative exemplary embodiment of method 150 shown in Fig. la is substantially similar to the method 100 describe with respect to Fig. l b as discussed above. Differences between the two exemplary methods 100 and 150 are further detailed below.

[007S] Similar to step 102, in a first step 152, an input signal or a frame of an input signal may be obtained and analyzed to determine and/or estimate a level of noise present in the signal. Based on a level or an estimated level of noise present, the input signal or frame of input signal may be categorized into one of three categories: (I) the signal is either mildly noisy 154; or (II) the signal is highly noisy 156. Step 152 may estimate the noise level in an input signal or a frame of an input signal using any suitable methods, such those described above in reference to step 10 (e.g., methods for determining and/or estimating SNR).

[0076] In method 152, the SNR or estimated SNR may be used to categorize the input signal or a frame of the input signal into the two instead of three different categories 154 and 156. For example, Category I for a signal that is mildly noisy 154. In particular, this first category 154 may include, for example, those input signals or frames of input signals that have or are estimated to have a SNR that is greater than 3.5 dB (SNR > 3.5 dB), or greater than or equal to 3.5 dB (SNR > 3.5 dB). The second category 156 may include, for example, those input signals or frames of input signals that have or are estimated to have a SNR that is less than 3.5 dB (SNR < 3.5 dB), or less than or equal to 3.5 dB (SNR < 3.5 dB).

[0077] In one embodiment, the SNR may be estimated using the exemplary SNR detection method described above in reference to step 102. In another embodiment, the SNR may be estimated using a different exemplary method. This method may provide a computationally efficient, and relatively accurate method to classify the noise level of speech corrupted by multi- talker babble. To keep track of the background noise variation, longer signals may be segmented into shorter frames and each frame may be classified and de-noised separately. The length of each frame should be at least one second to ensure a high classification/de-nosing performance. In this embodiment, step 152 uses two features which are sensitive to changes of the noise level in speech, easy to extract and relatively robust for various babble noise conditions (i.e., different number of talkers, etc.).

[0078] The first feature is the envelope mean-crossing rate which is defined as the number of times that the envelope crosses its mean over a certain period of time (e.g. , one second). To compute this feature, step 152 first needs to extract the envelope of the noisy speech. For noisy speech frame ¥ the envelope can be obtained as follows:

where, i is the length of the window (w) and I_¾ is the hop size. The envelope mean-crossing rate of a noisy signal frame is calculated as follows:

where, E, l_s and M are the envelope and its length and mean respectively, N is the length of the frame, f_s is the sampling rate and S{x is the sign function defined as:

[0079] Note that in for this feature we have used rectangular windows hence we have l_h= I. [0080] The main parameter that affects this feature is the length of the window (I). This feature may be optimized by finding the value of l ε M which maximizes the feature's Fischer score:

Gr pnax ,

where, C_n = 2 is the number of classes, ¾ is the mean f_t values of frames in class k, μ is the f_t values overall mean, ¾ is the f_t values variance in class k and n_k is the total number of frames in class k.

[0081] To numerically solve the above, this feature's Fischer score may be calculated for 10,000 labeled noisy speech frames corrupted with randomly created multi -talker babble. The duration of each noisy speech frame may be randomized between 2-5 seconds with sampling rate of f_s = 16000 samples/second. The average Fischer score for this feature may be maximized with =50.

[0082] The second feature is post-thresholding to pre-thresholding RMS ratio. First we denote hard threshold of a noisy speech frame F = y₃, . . .,y_wj, with threshold τ by ¾{ , T) = k_2i, . ._« h_n} where: fc_£ =

[0083] Post-thresholding to pre-thresholding RMS ratio is calculated as follows:

k ¥ , — 1„ p ,)

f₂

[0084] The variable which determines the quality of this feature is K and this feature may be optimized by finding the value of which maximizes the Fischer score for this feature: argtu x_sem

[0085] Numerical maximization of Fischer score with K = 0.1 x a where V i < i < IGQ, a e , shows the best value for K is IS — 3.

[0086] For training the classifier, the Gaussian Mixture Model (GMM) may be used. A GMM is the weighted sum of several Gaussian distributions:

ρ(Ρ\μ , α _{Γ ) =∑¾_ ¾ (F^ . ) Such that ¾ a_: = 1. where F is a d-dimensiona! feature vector (in this classification problem we have only two dimensions or d=2), % is the weight factor, μ_& is the mean and∑_t is the covariance of the ith Gaussian distribution, A Gaussian distribution Μ(Ρ μ_ί , _έ ) can be written as:

[0087] Similar to step 102, step 152 also does not depend highly on the accuracy of the noise level estimation, e.g., SNR estimation provided. Rather, for input signals having SNR values on or near the threshold value of 3.5 dB, categorization of such an input signal in either of the categories is not expected to significantly alter the outcome of the exemplary de-noising method 152 of Fig. lb. Therefore, estimated SNR values may also be sufficient for step 152. In certain exemplary embodiments, estimated SNR values may be determined using a more efficient process, e.g., a method that requires less computational resources and/or time, such as by a process that requires fewer iterative steps.

[0088] In the exemplary embodiment shown in Fig. lb, input signals or frames of input signals that fall within the first category 154 may be de-noised in a less aggressive manner as compared to noisier signals. For input signals or frames of input signals in the second category 156, the priority is to avoid de-noising distortion rather than to remove as much noise as possible. The data samples may be divided between each of the two categories into two clusters and each cluster may be modeled by a Gaussian model. In order to train the model, the Expectation-Maximization (EM) algorithm may be used.

[0089] After training the classifier having the above, the method 150 may classify each test noisy speech frame F with feature set F = f_ltf₂} using MAP (Maximum a posteriori estimation) as follows:

y

f Class 1 (SNR < 35), P(F\Oass l)F(Ciass 1) > P(FlClass 2 )P{C!ass 2 )

(class 2 (SNR > 3.5), P(FjClass 2 )F(C ass 2 ) > P(F\Oass i)P(Class 1) P(F|C ass l) = ^i ^ .,∑_i } -f e₂Jf (F1¾ ,∑_z ) F( |Cla.ss 2} = ¾J¾T( |¾ } + _¾JT( |¾ ,∑₄ ).

where; «¾, ¾, ¾ μ₂,∑_t,∑₂ are GMM parameters of class 1 and <%, ₄, μ , μ±,∑_$,∑₄ are

GMM parameters of class 2. Here both classes may be assumed to have equal overall probability (i.e., P(class_t) = P(cl ss₂) = 0.5). Note that for each Gaussian model, the method 150 has already obtained the values of ¾ and ¾ from the EM method. Using MAP, for each noisy speech sample with feature vector F, two probabilities we may be obtained and the noisy sample may be classified into the class with the higher probability.

[0090] input signals or frames of input signals that fall within either the first category 154 or the second category 156 may be further processed in step 160 in a similar manner as step 110 described above. In step 160, the input signals or frames of input signals may be decomposed into at least two components: (I) a first component 62 that exhibits no or low amounts of sustained oscillatory behavior; and (II) a second component 164 that exhibits high sustained oscillatory behavior. Step 160 may optionally decompose the input signals or frames of input signals to include a third component: (III) a residual component 166 that does not fall within either components 162 or 164. Step 160 may decompose the input signals or frames of input signals using any suitable methods, such as, for example, separating the signals into components having different Q-factors.

[0091] Step 160 may similarly provide preliminary de-noising of the input signals or frames of input signals. The preliminary de-noising may be performed by a sparsity-based de-noising method, such as, for example, a sparse optimization wavelet method. As discussed above, of the input signals or frames of input signals may be represented by any suitable wavelet in particular TQWT. By adjusting the Q-factor, an optimal sparse representation of the input signals or frames of input signals may be obtained. Such an optimal sparse representation may provide improved performance for related sparsity-based methods such as signal decomposition and/or de-noising. To select a spare representation of the input signals or frames of input signals, a Basis Pursuit (BP) method may be used. In particular, if the input signals or frames of input signals are considered to be noisy, e.g., those falling within the third category 109, a Basis Pursuit De-noising (BPD) method may be used.

[0Θ92] In step 168, the different HQF and LQF components may be further de-noised (e.g., by spectral cleaning) and subsequently recombined to produce the final de-noised output 170. In particular, this further de-noising step 168 may include parameter optimization followed by subsequent spectral cleaning. For example, assuming that the clean speech sample JT and its noisy version Y are available, they may be each decomposed into HQF and LQF components.

There are a total 8 parameters associated with the optimization problem discussed above in steps 110 and 160. In order to maximize the de-noising performance in this stage each of these eight parameters are optimized to ensure maximal noise attenuation with minimal signal distortion.

[0093] Low and high Q- factors (¾ and <¾): These two parameters should be selected to match the oscillatory behavior of the speech in order to attain high sparsity and efficient subsequent denoising. Q_i and ¾ denote the low and high Q-factors, respectively. Hence ¾ must be sufficiently larger than Q_t. Choosing close values for ¾ and ¾ will lead to very similar LQF and HQF components and poor sparsification. Conversely setting Q₂ to be too much greater than Q-_t also leads to poor results due to the concentration of most of the signal's energy in one component. With ¾ = i , any value between 5 to 7 is a reasonable choice for Q_z . In one exemplary embodiment, Q_± = 1 and Q_z = S.

[0094] Oversampling rates (r_t and r₂ ): a sufficient oversamp!ing rate (redundancy) is required for an optimal sparsification. Nevertheless, selecting large oversampling values will increase the computational cost of the algorithm. For this method, any number between 2-4 can be suitable for r and r_z. In one exemplary embodiment, r_t = r₂ = 3.

[0095] Number of levels ( } and /₂): Once the previous four parameters are chosen, fa d /₂ should be selected to ensure the distribution of wavelet coefficient in a sufficiently large number of sub-bands. In one exemplary embodiment, _t = 10 , ₂ = 37.

[0096] After selecting suitable values for wavelet parameters, the regularization parameters X_t and λ₂ may be adjusted. These two parameters directly influence the effectiveness of denoising. A larger value for either of them will lead to a more aggressive de-noising for its corresponding component. A more aggressive de-noising will potentially lead to more noise removal but usually at the expense of increasing the distortion of the denoised speech. Choosing suitable values for 1,_; and ,{.-, which ensure the maximum noise removal with minimum distortion is crucial for this stage.

[0097] Assuming the clean speech sample is available, _t and λ_£ may be selected, which maximize the similarity between the spectrogram s of the clean speech components {¾ and X_s) and their de-noised versions (¥_L and Y_H). To measure the similarity between the spectrograms of the clean and de-noised signals, the normalized Manhattan distance applied to the magnitude of the spectrograms (e.g., here with non-overlapping 2¹⁶ samples long time frames) may be used, which may be defined as:

where,≤^" _c is the Short Time Fourier Transform (STFT) of the clean speech and 5_d is the STFT of its de-noised version. Using the above, M_L and M_a may be defined as metrics to measure the STFT similarity between the low and high Q factor components of the clean and noisy speech samples respectively as follows:

where the STFT matrix is denoted with S and its corresponding component with its subscript. To maximize the similarity of S_¾ and ¾ as well as the similarity of ¾ and ½ simultaneously while taking the relative energy of each component into account, the weighted normalized Manhattan distance may be defined as follows:

^M:ut = ^%k + β^ΜΗ where or + β = 1

[0098] The weighting factors of the a and β are selected based on the £ ₃-norms of their corresponding components as follows:

β

¾ ¾s ¾

Therefore:

[0099] The values of λ₁ and l_s which minimize can be used to optimize the de-noising stage or:

[00100] To numerically solve the above, the average M - may be calculated over many speech samples (n=1000) corrupted with randomly generated multi-talker babble noise with various signal to noise ratios. For each noisy sample, all combinations of λ_ϊ and 1₂ from 0.01 to 0, 1 with 0.0 intervals may be used (Total 100 possible combinations) and M_m may be obtained. Two sets of values for and λ₂ may be selected where each set maximizes the average M_LH for noisy signals belonging to one of the classes discussed in the previous stage.

[00101] Using the optimized parameters discussed in the previous section, de-noised LQF and HQF components may be obtained. Nevertheless, the spectrograms of these components exhibit some remaining noise still existing in optimally de-noised components ¥_L and ¾. Low magnitude 'gaps' in the spectrogram of clean speech components and A'_H may be completely filled with noise in their de-noised versions (i.e. , Y_L and ¾), Here, by 'gaps' it refers to low magnitude pockets surrounded by high magnitude areas. These low magnitude gaps are more distinctly visible in lower frequencies (i.e. , frequencies between 0 and 2000 Hz) where most of the speech signals energy exists. By implementing a more aggressive de-nosing (i.e., choosing larger values for 1_χΟΓ 2₂ or both) more noise will be removed and some of these gaps will appear again in de-noised components. Nevertheless, this is only achieved at the expense of inflicting more distortion to the de-noised signal (i.e., larger M_LH values). Hence even though more aggressively de-noised LQF and HQF components may have more similar "gap patterns" with the original clean speech components -¾'_L: and ¾, they are not directly usable due to the high degree of distortion. However, they potentially contain usable information about the location of the gaps in spectrograms of X_L and X_u which may help us de-noise _L and ¾ one step further. In order to quantify and measure the similarity between the location of gaps in two spectrograms, the "Gap Binary Pattern" (GBP) matrix may be defined. To create GBP of a signal, the spectrogram of the signal is divided into non-overlapping time/frequency tiles and each tile is categorized as either low magnitude or high magnitude tile. Hence GBP of a spectrogram is a N_fb XN_tj binary matrix where N_Fb is the number of frequency bins and N_tf is the number of time frames. Assuming S% is the STFT matrix of the signal X, and T _} is a time/frequency tile of ¾^■ which covers the area on the spectrogram which contains all the frequencies between (ί— Ι)Δ/ to iAf on the frequency axis and ( — i}At to iAt on the time axis, the GBP of A^" is defined as:

mean. |¾{ < a mean i¾ j

me n |F< ,-|≥ a mean 15·_? |

[00103] In one particularly embodiment, the following may be selected: . . = 16000 Hz Atf_s = 2^1S , N_fk = 128 , = 0.5, Af = 62.5 Hz.

[00104] By estimating the location of the gaps in clean speech components, step 168 can potentially remove significant residual noise from ¥_L and ¾. If a low amplitude tile in the clean speech components _L and ¾, is categorized as high amplitude in de-noised components ¾ and

¾, then step 168 can conclude that this extra boost in the tile's energy is likely to be originated from the noise and can be attenuated by a reduced gain. Because in reality to clean speech components of ^* _L and ¾ are not readily available, the goal is to find aggressively de-noised low and high Q-factor components (denoted by ^J _L and F''_H) with a similar gap location (in lower frequencies) with the clean speech components of ¾_. and ¾.

[00105] To find these aggressively de-noised components, we should find parameter settings that maximize the similarity between GBPs of de-noised and clean speech components in lower frequencies. The best, metric to measure the similarity of two GBPs is the Sorenson's metric which is designed to measure the similarity between binary matrices with emphasize on ones (i.e., gaps) rather than zeros. Sorenson's metric for two binary matrices M_i and M_z is defined as:

SM(M, , M,) =—^—

- ¹ ° ÷ .··¾

where C is the number of 1-1 matches (both values are 1), ¾ is the total number of Is in the matrix M_i and N₂ is the total number of Is in the matrix ₂ .

[00106] In this stage, two new sets of regulanzation parameters may be identified; one should maximize SM | G _{h t} Gy* ^") and the other should maximize SM f _Sjj , G_¥^

[00107] Two sets of regularization parameters may be numerically found which maximize the Sorenson's metrics by measuring SM ( £¾, , G_Y^ ) and SM ^ G_g , G_Y^ ^ for sufficiently large number of speech samples (n=1000) corrupted with randomly generated multi-talker babble noise with various signal to noise ratios. There may be three sets of regularization parameters as follows: .¾i and A₂ found by minimizing M_LH and are used to generate optimally de-noised components of Y_L and ¾. l^f _£ and ¾ found by maximizing SM \ G_x, , Gtf and are used to generate the aggressively de-noised component F^? _L with similar gaps location with ¾, lj and ¾ by found by maximizing 5 ^₈, δ^ ) and are used to find the aggressively de-noised component Y' with similar gaps location with ¾.

[00108] Because ¾ and ¾ have optimally similar gap patterns to X than Y_L respectively, they can be used as a template further clean up optimally de-noised ¾. and ¾ . To achieve this, spectral cleaning may be performed on ¾ and ¾, based on the GBPs of the aggressively de- noised Yl , ¾ , Using the time/frequency tiling, reduction gains ¾ and r_H may be applied to high magnitude tiles in Y and ¾■ with low magnitude counter parts T _; in Y and ¾ , In some embodiments, the spectral cleaning is only performed in lower frequencies (i.e., frequencies between 0 and 2000

mean {F_fii | mean. jS^f

where and T_f'_; are time/frequency tiles in S_y and Sy* respectively and the resulting enhanced STFT matrix and its time/frequency tiles are denoted with ¾. and 7y .

?nem% \S_y f jRe s

%(U3 = mean IT,- ._; | mean |S_y< |

where F and T^- are time/frequency tiles in S_Y and■¾ respectively and the resulting enhanced STFT matrix and its time/frequency tiles are denoted with and f ,

[00109] Note that the reduction gains are chosen to decrease the normalized average magnitude of the tiles in S_Y , S_Y to the level of the normalized average magnitude of the tiles in ¾ ■_>■¾ · The gaps which were filled by noise in optimally de-noised components may be visible after spectral cleaning. [00110J In step 170, after spectral cleaning the enhanced Low and high Q-factor components of X_t and X can be obtained by inverse short time Fourier transform of ¾ and S_g and eventually X which is the de-noised version of clean speech X can be created by re-composition ofi^ a d A'^ as:

X = _L - X_a

[00111] Those skilled in the art will understand that the exemplary embodiments described herein may be implemented in any number of manners, including as a separate software module, as a combination of hardware and software, etc. For example, the exemplary analysis methods may be embodiment in one or more programs stored in a non-transitory storage medium and containing lines of code that, when compiled, may be executed by at least one of the plurality of processor cores or a separate processor. In some embodiments, a system comprising a plurality of processor cores and a set of instructions executing on the plurality of processor cores may be provided. The set of instructions may be operable to perform the exemplary methods discussed below. The at least one of the plurality of processor cores or a separate processor may be incorporated in or may communicate with any suitable electronic device for receiving audio input signal and/or outputting a modified audio signal, including, for example, an audio processing device, a cochlear implant, a mobile computing device, a smart phone, a computing tablet, a computing device, etc.

[00112] Although the exemplary analysis methods describe above are discussed in reference to a cochlear implant. It is contemplated that the exemplary analysis methods may be incorporated into any suitable electronic device that may require or benefit from improved audio processing, particularly noise reduction. For example, the exemplary analysis methods may be embodied in an exemplary system 200 as shown in Fig. 2. For example, an exemplary method described herein may be performed entirely or in part, by a processing arrangement 210. Such processing/computing arrangement 210 may be, e.g., entirely or a part of, or include, but not limited to, a computer/processor that can include, e.g., one or more microprocessors, and use instructions stored on a computer-accessible medium {e.g., RAM, ROM, hard drive, or other storage device). As shown in Fig. 2, e.g., a computer-accessible medium 220 {e.g., as described herein, a storage device such as a hard disk, floppy disk, memory stick, C D-ROM, RAM, ROM, etc., or a collection thereof) can be provided {e.g., in communication with the processing arrangement 210). The computer-accessible medium 220 may be a non-transitory computer-

Si accessible medium. The computer-accessible medium 220 can contain executable instructions 230 thereon. In addition or alternatively, a storage arrangement 240 can be provided separately from the computer-accessible medium 220, which can provide the instructions to the processing arrangement 210 so as to configure the processing arrangement to execute certain exemplary procedures, processes and methods, as described herein, for example.

[00113] System 200 may also include a receiving arrangement for receiving an input audio signal, e.g., an audio receiver or a microphone, and an outputting arrangement for outputting a de-noised audio signal, e.g., a speaker, a telephone, or a smart phone. Alternatively, the input audio signal may be a pre-recorded that is subsequently transmitted to the system 200 for processing. For example, an audio signal may be pre-recorded, e.g., a recording having a noisy background, particularly a multi-babble talk noisy background, that may be processed by the system 200 post-hoc. The receiving arrangement and outputting arrangement may be part of the same device, e.g., a cochlear implant, headphones, etc., or separate devices. Alternatively, the system may include a display or output device, an input device such as a key-board, mouse, touch screen or other input device, and may be connected to additional systems via a logical network.

[0Θ114] In one particular embodiment, the system 200 may include a smart phone a receiving arrangement, e.g., a microphone, for detecting speech, such as a conversation from a user. The conversation from the user may be obtained from a noisy environment, particularly where there is multi-talker babble, such as in a crowded area with many others speaking in the background, e.g., in a crowded bar. The input audio signal received by the smart phone may be processed using the exemplary methods described above and a modified signal, e.g. , a cleaned, audio signal, where a noise portion may be reduced and/or a speech signal may be enhanced, may be transmitted via the smart phone over a communications network to a recipient. The modified signal may provide for a more intelligible audio such that a smart phone user from a noisy environment may be more easily understood by the recipient, as compared to an unmodified signal. Alternatively, the input audio signal may be received by the smart phone and transmitted to an external processing unit, such as a centralized processing arrangement in a communications network. The centralized processing arrangement may process the input audio signal transmitted by the smart phone using the exemplary methods described above and forward the modified signal to the intended recipient, thereby providing a centralized processing unit for de-noising telephone calls. In some embodiments, the input audio signal may be a pre-recorded audio signal received by the system 200 and the input audio signal may be processed using the exemplary methods described above. For example, the system 200 may include a computing device, e.g., a mobile communications device, that includes instructions for processing pre- recorded input audio signals before outputting it to a user. In a further embodiment, the input audio signal may be received by the system 200 (e.g., a smart phone or other mobile communications device), in real-time, or substantially in real-time from a communications network (e.g., an input audio call from a third party received by a smart phone) and the input audio signal may be processed using the exemplary methods described above. For example, a user of the system 200, e.g., smart phone, may receive a noisy an input audio signal from another party, e.g., conversation from the other party, where the other party may be in a noisy environment, particularly where there is multi-talker babble, such as in a crowded area with many others speaking in the background, e.g., in a crowded bar. The input audio signal received via the communications network by the smart phone may be processed using the exemplary methods described above and a modified signal, e.g., a cleaned, audio signal, where a noise portion may be reduced and-'or a speech signal may be enhanced, may be outputted to the user, for example, as an audible sound, e.g., outputted through a speaker or any other suitable audio output device or component.

[00115] Many of the embodiments described herein may be practiced in a networked environment using logical connections to one or more remote computers having processors. Logical connections may include a local area network (LAN) and a wide area network (WAN) that are presented here by way of example and not limitation. Such networking environments are commonplace in office-wide or enterprise-wide computer networks, intranets and the Internet and may use a wide variety of different communication protocols. Those skilled in the art can appreciate that such network computing environments can typically encompass many types of computer system configurations, including personal computers, hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, and the like. Embodiments of the invention may also be practiced in distributed computing environments where tasks are performed by local and remote processing devices that are linked (either by hardwired links, wireless links, or by a combination of hardwired or wireless links) through a communications network. For example, the tasks may be performed by an external device such as a cell-phone for de-noising an input signal and then sending a modified signal from the external device to a CI device via any suitable communications network such as, for example, Bluetooth. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.

E ;:XAMPL; 12 s

Example I

[00116] The exemplary embodiment of Fig. l a, as described above may be evaluated by measuring a subject's understanding of IEEE standard sentences with and without processing by the exemplary method 100. Sentences may be presented against a background of 6-talker babble using four different signal to noise ratios (0, 3, 6, or 9 dB). In IEEE standard sentences (also known as "1965 Revised List of Phonetically Balanced Sentences, Harvard Sentences) there may be 72 lists of 10 sentences. To test speech intelligibility in noise, two randomly selected sentence sets (20 sentences) may be presented for each of the following 8 conditions:

1- Speech and 6 Talker Babble (SNR 0 dB)- Unprocessed

2- Speech and 6 Talker Babble (SN R 0 dB) - Processed

3- Speech and 6 Talker Babble (SNR 3 dB)-Unprocessed

4- Speech and 6 Talker Babble (SNR 3 dB) - Processed

5- Speech and 6 Talker Babble (SNR 6 dB)-Unprocessed

6- Speech and 6 Talker Babble (SN = 6 dB) - Processed

7- Speech and 6 Talker Babble (SNR = 9 dB)-Unprocessed

8- Speech and 6 Talker Babble (SNR = 9 dB) - Processed

[00117] In addition to the above mentioned conditions, another two sentence sets (20 sentences) may be selected for the following two additional conditions:

[00118] 9-Speech in quiet (10 Sentences)

[00119] 10-Practice with all SNRs (10 Sentences)

[00120] Each intelligibility test in Example I may include 180 sentences in total. Before processing of any audio signals, 18 sets of sentences that may be spoken by a male speaker may be arbitrarily selected from IEEE standard sentences. In Example I, the selected sentence sets include: 11 , 16, 22, 41 , 42, 43, 44, 45, 46, 47, 48, 49, 50, 51 , 52, 65, 71 and 72. Before each test, two sentence sets may be selected at random for each condition and two other sentence sets may be selected for speech in quiet test and practice session. Then a list, including these 180 sentences in a completely random order may be created. Prior to the test, a practice session with ten sentences, presented in all SNRs may be used to familiarize the subject with the test. The practice session with the subject may last for 5 to 30 minutes. After the practice session, the subjects may be tested on the various conditions. Sentences may be presented to CI subjects in free field via a single loudspeaker positioned in front of the listener at 65 dBA. Subjects may be tested using their clinically assigned speech processor. Subjects may then be asked to use their normal, everyday volume/sensitivity settings. Performance may be assessed in terms of percent of the correctly identified words-in-sentences as a function of SNR for each subject. Each sentence may include five keywords and a number of non-keywords. Keywords may be scored 1 and non-keywords may be scored 0.5.

[0Θ121] After completing a speech understanding test, subjects may be asked to evaluate the sound quality of the sentences using a MUSHRA (Multiple Stimuli with Hidden Reference and Anchor) scaling test. Participants may complete a total of 5 MUSHRA evaluations, one for each randomly selected sentence. Trials may be randomized among participants. Within each MUSHRA evaluation, participants may be presented with a labeled reference (Clean Speech) and ten versions of the same sentence presented in random order. These versions may include a "hidden reference" (i.e., identical to the labeled reference), eight different conditions (two processing conditions in 4 SNRs) and an anchor (Pure 6-talker babble). Participants may be able to listen to each of these versions without limit by pressing a "Play" button or trigger within a user interface. Participants may then be instructed to listen to each stimulus at least once and provide a sound quality rating for each of the ten sentences using a 100-point scale. To rate a stimulus, participants may move an adjustable slider between 0 and 100, and example of which is shown in Fig. 3. The rating scale may be divided into five equal intervals, and may delineate by the adjectives very poor (0-20), poor (21-40), fair (41-60) good (61-80), and excellent (81- 100). Participants may be requested to rate at least one stimulus in the set a score of "100" (i.e., identical sound quality to the labeled reference). Once participants are satisfied with their ratings, they may press a "Save and proceed" button or trigger within a user interface to move to a next trial. [00122] In Example i, as a pilot test, preliminary results were collected with 5 normal hearing (NH) subjects using an eight channel noise-vocoded signals. As shown in Fig. 4a, the percentage of words correct for each unprocessed signal is shown with an open triangle symbol, and the percentage of words correct for each signal processed using the exemplary method 100 of Fig. la is shown with a filled-in circle symbol. Similarly, as shown in Fig. 4b, the MUSHRA score for each unprocessed signal is shown with an open triangle symbol, and the MUSHRA score for each signal processed using the exemplary method 100 of Fig, l a is shown with a filled-in circle symbol. As can be seen in Figs. 4a and 4b, for all NH subjects, intelligibility and quality improved.

[00123] In Example I, for the main test, 7 post-lingually deafened CI subjects, as indicated below in Table 1 were tested. For all subjects intelligibility in quite was measured as a reference and its average was 80.81 percent.

Table.1.

*Note: For MUSHRA test, oral data was collected from subject CI 18 due to her severe visual impairment.

[0Θ124] As shown in Fig. 5, word-in-sentence intelligibility in the presence of 6 talker babble background as a function of the SNR for individual subjects. Data for each unprocessed signal is shown with an open triangle symbol, whereas data for each signal processed using the exemplary method 100 of Fig. l a is shown with a filled-in circle symbol. Fig. 7a shows an average result for all subjects. Mean intelligibility scores, averaged across all subjects and all SNRs, increased by 17.94 percentage points. Two-way ANOVA tests revealed significant main effects of processing [F(l,6)==128.953, pO.OOl ] and noise levels [F(3,18)=40.128, pO.OOl ]. It also revealed a relatively large interaction between noise levels and algorithms [F(3 , 18)=8.117, p=0.001].

[00125] Fig. 6 shows speech quality in the presence of 6 talker babble background as a function of the SNR for individual subjects. Data for each unprocessed signal is shown with an open triangle symbol, whereas data for each signal processed using the exemplary method 100 of Fig. la is shown with a filled-in circle symbol. Fig. 7b shows average results for all subjects. Mean quality scores, averaged across all subjects and all SNRs, increased by 21.18 percentage points. Two-way ANOVA tests revealed significant main effects of processing [F(l,6)=72.676, pO.OOl ] and noise levels [F(3,18)=42.896, p<0.001]. It also revealed no significant mteraction between noise levels and algorithms [F(3,18)=1.914, p=0.163].

[0Θ126] As can be seen above, the exemplary method 100 of Fig. la may provide significant speech understanding improvements in the presence of multi-talker babble noise in the CI listeners. The exemplary method 100 performed notably better for higher signal to noise ratios (6 and 9). This could be because of the distortion introduced to the signal due to the more aggressive de-noising strategy for lower SNRs (0 and 3). In Example L subjects with higher performance in quiet also performed generally better. For the subjects with lower performance in quite (C I 05 and CI 07), a floor effect may be seen. However, a ceiling effect was not observed in Example I for the subjects with higher performance in quiet. Example II

[00127] The exemplary embodiment of Fig. lb, as described above may be evaluated by measuring a subject's understanding of IEEE standard sentences with and without processing by the exemplary method 150, All babble samples in Example II are randomly created by mixing sentences randomly taken from a pool of standard sentences which contains a total of 2,100 sentences (including IEEE standard sentences with male and female speaker, Hint sentences and SPIN sentences). For each babble sample, the number of talkers was randomized between 5 to 10 and the gender ratio of talkers also was randomly selected (all female, all male or a random combination of both ,)

[00128] Fig. 8 shows a Gaussian Mixture model using EM method trained with EM method trained with 100,000 randomly created noisy speech samples with SNRs ranging from -!OdB to 20 dB, as the different speech samples would be classified under step 152. A first set of curves to the right curves represent Gaussian distributions belonging to the class (SNR < 3.5) and a second set of curves to the left represent Gaussian distributions belonging to the class (SNR > 3.5).

[00129] To evaluate the performance of method 150, a modified version of a two-fold cross validation method may be used. First, half of the sentences in the database were used for training and the second half were used to test the classifier. Then, the sentences used for testing and training (second half of the sentences in the database for training and the first half for testing the classifier) were switched. For a classifier, the F accuracy metric is defined as follows:

where C, ^and /^~ are correct, false positive and false negative detection, respectively.

[00130] The average values of F accuracy metric were measured for three types of multi- talker babble in different SNRs. The average value of F slightly changed by changing the number and the gender ratio of talkers. The average value of F was 1 for SNRs outside the neighborhood of the border SNR between two classes (i.e., 3.5 dB). In the vicinity of SNR=3.5 dB some decline in the accuracy was observed. Figure 9 shows the variation of accuracy metric F as a function of SNR for three different multi-talker babble noise. 1,000 randomly created noisy samples were tested for each SNR.

[00131] Fig. 30 shows frequency response and sub-band wavelets of a TQWT, e.g., step 160 as described above. Specifically, Fig. 10 shows frequency response (left) sub-band wavelets (right) of a TQ WT with Q = 2_t r = 3 J = 13. [00132] Table 2 shows specific selected values for %_t and &_t in Example II as well as other parameters for each class.

Table.2.

[00133] To validate the optimization results with other distance metrics, the Manhattan distance of the sum of two components in were minimized:

as well as the Euclidean distance of the de-noised and clean com onents in:

[00134] The same results for _t and λ₂ were achieved.

[00135] In this example, two sets of regularization parameters were found which maximize the Sorenson's metrics by measuring SM G_{St >} Gtf } and SM^ G^ Gy^ f for sufficiently large number of speech samples (n=1000) corrupted with randomly generated multi- talker babble noise with various signal to noise ratios. Three sets of regularization parameters were also identified as follows: λ_ί and Ά₂ found by minimizing M_m and are used to generate optimally de-noised components of ¾_. and ¾· ¾ and <¾. found by maximizing SM ¾ , G_Y' ) and are used to generate the aggressively de-noised component Y^! _L with similar gaps location with ^'i and by found by maximizing SM | G_x , G_Y^ ^ and are used to find the aggressively de-noised component Y^f with similar gaps location with ¾, Table 3 shows selected values for these regularization parameters for both classes. Table. 3.

[00136] Figure 1 1 shows that using the selected aggressive de-noising regulation parameters will lead to finding a much more accurate gap patterns of the clean speech components. In particular, Fig. 11 shows Low frequency Gap Binary Patterns of ¾,¾,¾¾,¾ and ¾ for clean/noisy speech samples. It can be see that gaps (shown with .¾, ,¾, ¾.g₄) which are filled with noise in r_L and ¾ , are visible in y£ and ¾. s¾ = , S s_¾, s,_¾) = s.?9

, S (c_¾, G,_£ ) = Q.54 , SM = S.57.

[00137] Figure 12 shows the effect of each initial de-noising and spectral cleaning on the weighted normalized Manhattan distance M, _H measured on 1000 noisy speech samples corrupted with various randomly created multi-talker babbles. As it can be seen the effect of spectral cleaning decreases with increasing SNR.

[00138] The invention described and claimed herein is not to be limited in scope by the specific embodiments herein disclosed since these embodiments are intended as illustrations of several aspects of this invention. Any equivalent embodiments are intended to be within the scope of this invention. Indeed, various modifications of the invention in addition to those shown and described herein will become apparent to those skilled in the art from the foregoing description. Such modifications are also intended to fall within the scope of the appended claims. All publications cited herein are incorporated by reference in their entirety.

Claims

What is claimed is: 1 , A method for reducing noise, comprising:

receiving an input audio signal comprising a speech signal and a noise;

decomposing the input audio signal into at least two components, the at least two components comprises a first component having a low or no sustained oscillatory pattern, and a second component having a high oscillatory pattern;

de-noising the second component based on data generated from the first component to obtained a modified second component; and

outputting an audio signal having reduced noise, the output audio signal comprising the first component in combination with the modified second component. 2. The method of claim 1, wherein the outputted audio signal more closely corresponds to the speech signal than the input audio signal

3. The method of claim 1, wherein the noise comprises a multi-talker babble noise, 4. The method of claim 1, wherein the decomposing step comprises de-noising the first and second components, and the first component is more aggressively de-noised than the second component.

5. The method of claim 1, wherein the decomposing step comprises de -noising the first and second components, the second component being more distorted than the first component.

6. The method of claim 1, wherein the decomposing step comprises a nonlinear decomposition method. 7. The method of claim 1, wherein the decomposing step comprises a morphological component analysis (MCA) method.

8. The method of claim 1, wherein the decomposing step comprises a spare optimization wavelet method. 9, The method of claim 8, wherein the decomposing step includes determining a first

Tunable Q-Factor Wavelet Transform (TQWT) for the first component and a second TQWT for the second component,

10, The method of claim 9, wherein the first component has a low value for a Q-factor, and the second component has a high value for the Q-factor, wherein the Q-factor corresponds to a ratio a center frequency to a bandwidth of each component,

11 , The method of claim 9, wherein the decomposing step further includes a basis pursuit de- noising (BPD) method.

12, The method of claim 10, wherein the decomposing step decomposes the input audio signal into the first component, the second component, and further a residua! component,

13, The method of claim 1 , wherein the de-noising step comprises further modifying the second component to obtained a modified second component having a temporal and spectral pattern (TSP) corresponding to a TSP of the first component.

14, A method for improving intelligibility of speech, comprising:

obtaining, from a receiving arrangement, an input audio signal comprising a speech signal and a noise;

estimating a noise level of the input audio signal;

decomposing the input audio signal into at least two components when the estimated noise level of the input audio signal is above a predetermined threshold, the at least two components comprises a first component having a low or no sustained oscillatory pattern, and a second component having a high oscillatory pattern; de-noising the second component based on data generated from the first component to obtained a modified second component; and

outputting an audio signal having reduced noise to an output arrangement, the output audio signal comprising the first component in combination with the modified second component.

15. The method of claim 14, wherein the noise comprises a multi-talker babble noise.

16. The method of claim 14, wherein the estimating step comprises determining or estimating a signal to noise (SNR) for the input audio signal. 7. The method of claim 14, wherein the decomposing step comprises de-noising the first and second components, and the first component is more aggressively de-noised than the second component.

18. The method of claim 14, wherein the de-noising step comprises further modifying the second component to obtained a modified second component having a temporal and spectral pattern (TSP) corresponding to a TSP of the first component. 19. A non-transitory computer readable medium storing a computer program that is executable by at least one processing unit, the computer program comprising sets of instructions for:

receiving an input audio signal comprising a speech signal and a noise;

outputting an audio signal having reduced noise, the output audio signal comprising the first component in combination with the modified second component. 20, A system for improving intelligibility for a user, comprising:

a receiving arrangement configured to receive an input audio signal comprising a speech signal and a noise;

a processing arrangement configured to receive the input audio signal from the cochlear implant, decompose the input audio signal into at least two components, the at least two components comprises a first component having a low or no sustained oscillatory pattern, and a second component having a high oscillatory pattern, de-noise the second component based on data generated from the first component to obtained a modified second component, and output an audio signal having reduced noise to the cochlear implant, the output audio signal comprising the first component in combination with the modified second component,

21 , The system of claim 20, further comprising a cochlear implant, wherein the cochlear implant includes the receiving arrangement, and the cochlear implant is configured to generate an electrical stimulation to the user, the electrical stimulation corresponds to the output audio signal,

22, The system of claim 20, further comprising a mobile computing device, wherein the mobile computing device includes the receiving arrangement, and the mobile computing device is configured to generate an audible sound corresponding to the output audio signal.