WO2010091339A1

WO2010091339A1 - Method and system for noise reduction for speech enhancement in hearing aid

Info

Publication number: WO2010091339A1
Application number: PCT/US2010/023463
Authority: WO
Inventors: Miodrag Bolic; Martin Bouchard; Frédéric MUSTIÈRE
Original assignee: University Of Ottawa
Priority date: 2009-02-06
Filing date: 2010-02-08
Publication date: 2010-08-12

Abstract

A method and system for reducing noise in an audio signal composed of speech and noise components. The method includes decomposing an audio signal into a plurality of subbands, wherein the audio signal is pre-enhanced by processing with a noise reduction algorithm before or after decomposing to provide an enhanced audio signal having audio components comprising enhanced speech components and residual noise components; processing each of the subbands of the enhanced audio signal by scaling the audio components via a scaling factor for each subband to provide a processed subband audio signal with reduced residual noise components; and reconstituting the processed subband audio signal into an output audio signal having enhanced speech components and reduced residual noise components.

Description

METHOD AND SYSTEM FOR NOISE REDUCTION FOR SPEECH ENHANCEMENT IN HEARING AID

This application claims priority to provisional application No. 61/150354, filed February 6, 2009 which is incorporated by reference herein in its entirety

FIELD OF THE INVENTION

The invention relates to improvements in noise reduction systems and methods for sound reproducing systems, such as hearing aids.

BACKGROUND OF THE INVENTION

Hearing devices are wearable hearing apparatus used to provide assistance those with impaired hearing. To meet the numerous individual requirements different designs of hearing device are provided, such as behind-the-ear hearing devices, with an external earpiece and in- the-ear hearing devices e.g. also Concha or in-canal hearing devices. The typical configurations of hearing device are worn on the outer ear or in the auditory canal. Above and beyond these designs however there are also bone conduction hearing aids, implantable or vibro-tactile hearing aids available on the market. In such hearing aids the damaged hearing is simulated either mechanically or electrically.

Hearing devices principally have as their main components an input converter, an amplifier and an output converter. The input converter is as a rule a sound receiver, e.g. a microphone, and/or an electromagnetic receiver, e.g. an induction coil. The output converter is mostly implemented as an electro acoustic converter, e.g. a miniature loudspeaker or as an electromechanical converter, e.g. bone conduction earpiece. The amplifier is usually integrated into a signal processing unit. This basic structure is shown in FIG. 4, using a behind-the-ear hearing device as an example. One or more microphones 2 for recording the sound from the surroundings are built into a hearing device housing 1 worn behind the ear. A signal processing unit 3, which is also integrated into the hearing device housing 1, processes the microphone signals and amplifies them. The output signal of the signal processing unit 3 is transmitted to a loudspeaker or earpiece 4 which outputs an acoustic signal. The sound is transmitted, if necessary via a sound tube, which is fixed with an otoplastic in the auditory canal, to the hearing device wearer's eardrum. The power is supplied to the hearing device and especially to the signal processing unit 3 by a battery 5 also integrated into the hearing device housing 1.

One of the biggest challenges in speech enhancement is the tradeoff between the amount of noise reduction and the intelligibility of the resulting speech signal. While aggressive enhancement algorithms may be able to remove a large amount of background noise and increase significantly some objective scores, it is common that the output speech is eventually found to be less intelligible than the original noisy speech, which is a strong penalty for sensitive applications such as hearing aid devices.

Hearing aid manufacturers have implemented various technologies to address noise. For example, some hearing aids may attempt to boost gain in frequency subbands with low noise while reducing gain in frequency subbands with high noise. One problem with this frequency- gain approach is that desired signals may be attenuated along with noise signals. Another problem with many frequency-gain approaches to dealing with noise is the inaccuracy of traditional algorithms for detecting which frequency subbands contain noise. In other words, many traditional algorithms may be somewhat ineffective in distinguishing between noise signals and desired signals.

Thus, there is a need for improved hearing aids as well as improved techniques for implementing noise reduction in hearing aids.

SUMMARY OF THE INVENTION

The present invention provide a noise reduction system and method, which leads to improved speech intelligibility. Embodiments may be directed to methods for reducing noise. Some embodiments may also be directed to hearing aid devices configured to reduce noise.

In at least one embodiment, a computer-implemented method for reducing noise in an audio signal composed of speech and noise components may comprise (a) decomposing an audio signal into a plurality of subbands, wherein the audio signal is pre-enhanced by processing with a noise reduction algorithm before or after decomposing to provide an enhanced audio signal having audio components comprising enhanced speech components and residual noise components; (b) processing each of the subbands of the enhanced audio signal by scaling the audio components via a scaling factor for each subband to provide a processed subband audio signal with reduced residual noise components; and (c) reconstituting the processed subband audio signal into an output audio signal having enhanced speech components and reduced residual noise components.

In a further embodiment, the audio signal is received from an input device of a hearing aid. Moreover, the scaling is performed on a frame-by -frame basis for the subbands depending on an assumed level of residual noise wherein the assumed level of residual noise is based on an estimate of the Signal-to-Residual Noise-Ratio (SRNR).

In still further embodiments, the scaling comprises, for an expected subband speech level ^a , scaling of low amplitude audio components on a relative basis. For example, at low instantaneous Signal-to- Noise-Ratio (SNR), scaling is more severe towards low-amplitude audio components and conversely, at high SNR, scaling is avoided for low amplitude audio components. A discrimination rule for scaling may be applied such that below a certain subband speech level in a particular subband and if an input instantaneous fullband SNR is low, the audio components are scaled down.

According to certain further embodiments, a hearing aid may comprise a signal processing unit adapted to receive an input signal and apply a hearing aid gain to the input signal to produce an output signal, wherein the signal processing unit comprises code devices for decomposing an audio signal into a plurality of subbands, wherein the audio signal is pre- enhanced by processing with a noise reduction algorithm before or after decomposing to provide an enhanced audio signal having audio components comprising enhanced speech components and residual noise components; processing each of the subbands of the enhanced audio signal by scaling the audio components via a scaling factor for each subband to provide a processed subband audio signal with reduced residual noise components; and reconstituting the processed subband audio signal into an output audio signal having enhanced speech components and reduced residual noise components.

According to certain further embodiments, a computer program product fixed on a tangible medium and/or executed on a hearing device for a noise reduction, includes a computer program that implements the methods herein.

Further specific variations of the invention are defined herein. Other aspects and advantages of the present invention will become more apparent from the following detailed description taken in conjunction with the accompanying drawings which illustrate, by way of example, the principles of the invention. BRIEF DESCRIPTION OF THE DRAWINGS

The invention is explained in the following description in view of the drawings that show:

FIG. 1 shows a post-processing scheme, if the pre-enhancement does not take place in the subband domain

FIG. 2 shows a post-processing scheme, if the pre-enhancement takes place in the subband domain

FIG. 3 shows an effect of the post-processor wherein a top graph shows the initial noisy speech, the second graph is a clean signal, next is the pre-enhanced signal, and at the bottom the post-processed signal. The noise reduction can be clearly seen, and the speech parts with lower amplitude are not affected either.

FIG. 4 shows a basic structure of a hearing device in which the method may be implemented.

DETAILED DESCRIPTION OF THE INVENTION

The present invention to provide an adaptive noise cancelling system, which leads to improved speech intelligibility. The invention further provides a method and a system for reducing noise, as well as a computer program product.

A hearing aid comprises at least one microphone, a signal processing unit and an output transducer. The signal processing unit is adapted to receive an input signal from the microphone. The signal processing means is adapted to apply a hearing aid gain to the input signal to produce an output signal to be output by the output transducer, and the signal processing means comprises means for adjusting the hearing aid gain calculated for the hearing aid.

The method and system herein provides a technique for the reduction of background residual noise as a post-processor for non-aggressive speech enhancement algorithms. The method keeps the beneficial characteristics of such algorithms, and then uses both the noisy and pre-enhanced signals to remove the remaining noise in such a way that the speech is the least possibly affected. The proposed method comprises first decomposing a pre-enhanced signal into frequency bands, and then operating on the downsampled subband time series by softly scaling down their low-energy segments, provided they occur at low estimated SNR. In simple terms, the method comprises scaling, on a frame-by-frame basis, the subband pre-enhanced signals depending on an assumed level of residual noise. The method is tested herein with three types of enhancement algorithms: a spectral subtractive method, a Minimum Mean Squared Error log-spectral amplitude estimator, and a Kalman Filter-based scheme. In various real-world noise environments, the post-processor is found to consistently reduce background noise, with no apparent loss of intelligibility between the pre-enhanced and the final output speech signals, as reported by several objective measure scores and informal listening experiments.

One of the central issues in speech enhancement is the tradeoff between noise reduction and intelligibility, and it is in fact rare for a method to effectively improve intelligibility in several experimentation conditions. Rather than trying to improve intelligibility, practitioners usually set the more reasonable goal of at least not affecting it in the noise removal process. In sensitive applications where intelligibility and naturalness are important, non-aggressive setups for speech enhancement algorithms are thus privileged.

The post-processing technique herein addresses the following objectives: (1) Removing additional background noise while retaining the positive features of (pre)enhanced speech (i.e. intelligibility, low distortion, naturalness, etc), (2) providing a simple and efficient implementation.

Both objectives are treated here with equal importance. Indeed, if the second objective is not respected, one might as well rework and upgrade the pre-enhancement scheme. On the other hand, if the first objective can be attained with very small additions, then the appeal is more significant for real-world applications already employing certain well-established algorithms. Herein, the objective of the post-processor is not enhancement per se, but rather noticeable background noise removal.

In an embodiment, the method comprises "turning down the volume" when too much noise is present. Practically speaking, the above principle understates that there exists a reliable rule to discriminate speech and residual noise components. Note first that even in ideal conditions, it is not desirable to apply such volume-scaling in a fullband setup, as it would perceivably modulate the amplitude of the signal in a disturbing manner, and possibly affect some unvoiced parts of the speech with small energy. Thus, the method is chosen to be applied in the subband domain. In summary, the goal is to determine, for a given pre-enhanced frame an appropriate scaling factor so as to satisfy the problem requirements.

Suppose that the average expected level ^a for speech components in a given subband is known. The speech/noise discrimination rule is then chosen to follow two easily measurable quantities: the signal's amplitude within particular subbands and the global, instantaneous fullband SNR. In very simple terms, the entire scheme can be summarized as follows: below a certain level, and especially if the input SNR is low, the observed components are likely to be noise-like and must be scaled down.

Slightly more rigorously, the following two rules can be written: (/) Relatively to the expected subband speech level ^a , low amplitudes should be scaled down, but (U) Assuming that the amount of residual noise directly depends on the input SNR, at low instantaneous SNR the scaling should be more severe towards low-amplitude components and conversely, at high SNR low amplitude components should be spared.

The fullband SNR is chosen as reference rather than individual subband SNRs for two reasons: first for simplicity, and secondly because in many situations the "local" subband SNR is found to be a poor indicator of the global SNR and thus some low-amplitude speech components that are still important for intelligibility are more at risk of being scaled down (this tendency was confirmed in practical tests as well).

Thus, the method is chosen to be applied in the subband domain, and is formally described below using the accompanying Figures 1 and 2. First of all, one of the main assumptions is that the pre-enhancement algorithm is "nonaggressive", in the sense that the speech signal is left as intact as possible, while the noise is still present but has been decreased to a smaller energy than the speech.

The above scheme assumes the availability of an expected subband speech level as further described herein.

The assumption that we have direct access to a form of SNR estimate is not far-fetched, since this is a building block for many enhancement algorithms - if not available, it is possible to obtain an estimate by comparing the noisy and pre-enhanced spectra for example. The procedure described above is now formally presented in three stages, and then further explanations are given showing the direct links with rules /^' and ii above.

The method then follows the three stages described below. Stages 1 and 3 are classical subband decomposition/decimation/reconstitution.

Stage 1 proceeds as follows: Following Figure 1, from the noisy speech signal at time k, z(k), the pre-enhancement algorithm, symbolized by/(.), produces the pre-enhanced signal y(k) =f(z(k)). Next, y(k) is decomposed into overlapping frames of length N , with the / ^Λ frame denoted by y(n,i) and n = 1 ... N. Then, the frames are decomposed into M subbands, with the /n^A corresponding "subframe" denoted by y_m(l,i) and / =1...N/M . Note that all of the above might very well already be part of the pre-enhanced algorithm - that is, the post-processing may directly have access Io y_m(l,i), as in Figure 2.

Stage 2 proceeds as follows: Let ^SNR<^f) be an estimated signal-to-noise ratio for the ' * frame. With ^ ^ denoting the pre-enhanced, decimated speech vector at subband ^m ,

being the current energy in the pre-enhanced subband segment and ⁰^ being a constant band-dependent threshold (the choice of which is discussed below), the following rule is applied to y*"^>'' to obtain the post-processed enhanced series ^x»⁽'''⁾;

Stage 3 proceeds as follows: reconstruct the post-processed estimated clean speech ,from ,

Implementation of Stages 1 and 3 may include any classical subband decomposition/ decimation/ reconstitution techniques as known in the art.

Before detailing the choice of O_n, let us briefly explain the rationale behind the two steps

(a) and (b) presented above in stage 2. Beginning with b) and first assuming that

, this step basically involves scaling down the subband frame if its energy is found to be lower than a certain value O_n, .

The scaling is linear, and clearly the lower the energy, the lower the scaling factor

The introduction of the other frame-dependent constant & at step (a) is in direct relationship with rule //, and is important for the cases where the input speech is of low amplitude to begin with, but still high comparatively to the noise - which can occur for example at speech onsets or for quiet speakers: in such cases the effective threshold must be appropriately lowered so as not to risk damaging the speech. Regarding speech onsets, the fact that the signal is scaled based on the energy of an entire frame and not on a sample-by-sample basis is also meant to minimize the potential damage inflicted to the clean speech.

In our experience, ^a"> depends on the type of subband decomposition used, on the amount of bands (and obviously on the subband frame size ^N IM _sj_nce \χ j_s compared to the quantity ^EΛ0 ) F₀₁- linearly spaced bands, and more specifically a near-perfect pseudo-QMF decomposition, we find that for an input noisy signal with maximum amplitude of 1 and Λ/ = 16_} good performance is obtained by letting"™ be proportional to ^{m 2}, i.e.,

(i.e., the expected energy in each subband decreases as the square of the subband index, which appears to be a reasonable assumption considering long-term spectral averages of speech in 16 equally- spaced bands, as shown for example in Figure 2), thereby making K be the only required value to "tune" to select the aggressiveness of the post-processor.

In further embodiments, this type of subband-signal scaling method can be applied as part of a "full" subband speech enhancement algorithm (as opposed to a mere "post-processor" as in this section), where subband scaling factors are applied to the incoming noisy speech, and are determined from a VAD-based estimation of the a posteriori SNR. It may be assumed that the scaling factors are to be applied to pre-enhanced subband speech signals (and thus that an estimate of the SNR is also accessible). In addition, each subband domain signal (i.e., each of the decimated signals at the outputs of the filters of the filterbank) may be real-valued and can locally be viewed as time-domain signals.

Essentially, with

denoting the pre-enhanced decimated speech vector at subband ^m at the 7^th frame, we propose to perform the following scaling of

based on an estimate of the Signal-to-Residual Noise-Ratio, denoted here by ^^wW _{? to} obtain the post-processed enhanced series ^mV ' ' as follows:

Obviously, the above requires the knowledge of

_wni_cn is difficult to accurately estimate as it strongly depends on the method/algorithm used and on the noise conditions. Nevertheless, a practical solution comprises estimating it from

(assumed to be known from the pre-enhancement stage) - the two are indeed strongly correlated. For this purpose, several methods can be envisioned: For example, using various training data obtained specifically using the chosen pre-enhancement algorithm; some mathematical relationship (e.g. linear regression) between the two sets of subband SNRs could be obtained. Results can be obtained by using the simple following rule:

that is, the practical value used to represent the residual noise ratio in each subband is simply taken as the maximum between the fullband estimated SNR and the current subband estimated SNR. The rationale for incorporating the fullband SNR was initially based on the observation that in many situations the "local" subband SNR is found to be in discordance with the fullband SNR and thus some low-amplitude speech components that are still important for intelligibility are more at risk of being filtered out. Note also that from the equation above we necessarily have

_wni_ch does not contradict the expected effect of the pre- enhancement scheme. In practice, to further account for the effect of pre-enhancement, the introduction of a constraint P ^~ is also beneficial, so as to obtain the final rule:

( )

In this implementations, P is set to * ^■ ^ . The use of the equation above allows for a very low-cost post-processing.

The method herein takes advantage of the available pre-enhanced signal for which speech and noise have already been discriminated to a certain extent. Based on our initial assumption that the goal of the pre-enhancement algorithm is to try not to degrade intelligibility, the overall approach to noise reduction is much less aggressive.

With the pre-enhancement handling the speech/noise discrimination, our post-processing method leads overall to a more robust enhancement scheme, in that it can perform thresholding with less risks. This is especially the case since we are taking the estimated SNR into account. The method applies uniform scaling to overlapping frames, which is less prone to perceivable "sudden volume changes" artifacts that a sample-by -sample volume scaling. The method above may be implemented as a module to existing schemes, and is not resorting to wavelet packet transformations.

In terms of computational complexity, it can be readily seen that the proposed method is extremely low-cost, especially if the pre-enhancement scheme is already frame-based and employing sub bands, in which case only one extra equation per band must be applied. The subband method resorting to "fully discretized" noise PSD lends itself very well to psychoacoustic treatment Hereafter, a way to include perceptual constraints as part of this method is described. The idea is similar to that shown in the KF case; the differences are mainly related to the fact that we are only applying the constraints under certain risk-related conditions to avoid damage to the speech in complex noise conditions.

The central tool is the estimated masking threshold of the clean speech. The masking threshold of a signal represents, in the frequency domain, the level/curve below which nothing is audible in the presence of the particular signal being studied. In the ISO MPEG-I layer 1 psychoacoustic model 1, a technique to compute such an estimate of a signal's masking threshold is elaborated. In the context of MPEG coding, this is useful to determine how much quantization noise can be introduced while remaining imperceptible.

Note that an estimated clean signal is to begin with used to compute the threshold; in practice a rough clean speech estimate (obtained via spectral subtraction for example) can provide results almost as good as when the true clean speech is available. In addition, the distinct estimate used can be further used to improve the overall quality by combining it with the state- space algorithm's final estimate.

In the algorithm, the masking threshold is used as follows. In a given frame, once the

P P noise power * has been estimated and the speech power and the masking threshold ^x and

T have been calculated (based on the prior spectral subtractive estimate), first in each band the average level of each of the above quantities is calculated (yielding

and the following two rules are applied: then the current data frame is left unprocessed

) then the enhancement is made more aggressive by purposely overestimating the corresponding observation variance in the state-space model. The first rule is based on the assumption that if the noise component in band ^m is to begin with masked by the speech, then there is no need to perform any noise removal. Next, in the second rule, if the speech component is inaudible but some noticeable noise is present in band ^m , the enhancement takes place in a more aggressive manner.

Note that the above technique can naturally be followed by the one given above, this yields very good results. Using such conservative rules allows for a less risky solution - and in turn for a more robust solution in nonstationary noise, as compared to where the scaling is done based only on the masking threshold level

Following are examples testing the procedures of the invention. These examples should not be construed as limiting. Simulation results obtained from the application of the above method follow:

1. The decomposition into 4 bands. In the following, an algorithm "X" employing such a decomposition are coded as "4B-X".

2. The decomposition into 32 bands presented. In the following, algorithms "X" employing such a decomposition are coded as "32B-X".

3. The standalone (internal PSD estimation) particle filtering solution shown is tested on the RBPF algorithm, and denoted by 32B-RBPF(_Standaione).

4. The post-processing method. Algorithm "X" used with this method is denoted by "X- Post".

5. The application of psychoacoustic constraints. For an algorithm "X" resorting to this technique, the code "Ψ-32B-X" is used.

In addition, a combination of psychoacoustic constraining and post-processing is tested as well, with code "Ψ-32B-X-Post". The algorithms used are the DKF, the KEM_Bur_g, the RBPF, the DEKF4, the DUKFl, the KEM and the NPF. For the first three, all the above flavours are tested (except the standalone solution for the non-RBPF ones), and for the last four only the "Ψ- 32B-X-Post" results are published, for the reasons discussed below.

In Tables A and B, we assess the average benefits of using each of the techniques across several algorithms. This is done by showing the average difference of scores obtained across the first three algorithms for all types of colored noise in VL/L and M/H conditions (respectively) are given. With reference to algorithm "X", the differences shown are those between each of the scores obtained by "4B-X", "32B-X", "32B-X-PoSt", "T-32B-X", and "Ψ-32B-X-Post" and the scores obtained from the fullband application of algorithm "X".

In Tables C and D, the 7 individual algorithms are compared in the context of a "Ψ-32B- X-Post" setup, by averaging the scores obtained for all types of colored noise in VL/L and M/H conditions (respectively).

Table A: Estimation of the average benefits obtained by using the subband-based techniques presented in this chapter, in the context of VL/L colored noise conditions. "X" is a generic letter to designate an algorithm to which the techniques are applied - the averages were obtained with 3 algorithms.

Table B: Estimation of the average benefits obtained by using the subband-based techniques presented in this chapter, in the context of M/H colored noise conditions. "X" is a generic letter to designate an algorithm to which the techniques are applied - the averages were obtained with 3 algorithms.

From the results shown in Tables A and B, several conclusions can be made.

First, in VL/L conditions, there are rather clear advantages in using subband methods as opposed to fullband ones, especially so for the 32 bands case. This is all the more obvious when considering the fact that psychoacoustic constraining and post-processing can be readily applied and also provide non-negligible improvements. Next, it is interesting to note from the bottom rows of Tables A and B that, while the SNR and ASNR scores are lower when internal noise estimation is used, the rest of the measures are not far from those obtained with dedicated, external noise PSD estimation. From informal listening tests, we find that the subband methods are unambiguously better in terms of background noise reduction especially. In particular, it is also noticeable that the "Ψ-32B-X", and "Ψ-32B-X-Post" methods yield a higher signal quality and a better intelligibility. Moreover, we find that the "standalone" method performing internal noise estimation achieves less noise reduction but still preserves well the speech naturalness. Still, this method remains interesting in terms of complexity since the internal noise estimation only adds a marginal amount of computations per particle.

Regarding medium to high SNR conditions, the results are relatively more contrasted, in the sense that the 4 bands solution actually yields slightly worse results, in that it marginally penalizes each objective measures. However, recall that there are still advantages in terms of computational requirements, and thus the 4 bands treatment is still an appealing alternative when compared to fullband processing. On the other hand, the 32 bands case again provides significant advantages when coupled with psychoacoustic constraining and post-processing. In fact, even without any additional scheme, with 32 bands the WPESQ score is improved on average by 0.14 units. Careful listening of the enhanced signals yields observations that are in accordance with the above findings. For instance, it is difficult to differentiate the fullband and the 4 bands case, but improvements become more noticeable with 32 bands, especially with the reduction of background noise.

Finally, in Tables C and D the average scores obtained by each individual algorithm in a "Ψ-32B-X-Post" configuration are shown.

Table C: Comparison between the average scores obtained from using the 7 listed algorithms in VL/L colored noise situations and a "Ψ-32B-X-Post" setup.

Table D: Comparison between the average scores obtained from using the 7 listed algorithms in VL/L colored noise situations and a "Ψ-32B-X-Post" setup.

In the VL/L case, two "groups" of algorithms can be formed: First the DKF, NPF, KEMβ_urg, and RBPF; and secondly the DEKF, DUKF, and KEM, all with markedly lower scores than the algorithms from the first group. Quite interestingly, in this setup it turns out that the very simple DKF algorithm yields the best CSII, WPESQ, Csig (ex-aequo with the NPF), and Cbak scores - and second-best ASNR, Covl scores. The NPF, KEM_Burg, and RBPF still obtain very close (and some better) results (for example the Covl score for the NPF). Still, according to the objective scores the "Ψ-32B-DKF-Post" algorithm may very well be the best subband option in VL/L conditions.

Informal listening tests result in remarks that are in accordance with the above findings. However, while we are able to confirm that the first "group" of algorithms perform significantly better than the second group, we also find that the DKF, NPF, KEM_Bur_g, and RBPF are relatively difficult to tell apart. Nevertheless, while the DKF is able to remove a slightly larger amount of noise, the NPF overall tends to sound more natural.

In M/H conditions, the same algorithms can be separated into two groups. This time however, the RBPF and NPF both stand out - although the DKF is not far behind. Our subjective impressions, from listening to the enhanced speech files, agree with the above, but we also find that RBPF and NPF are this time more distinguishable from the rest, with crisper and higher quality speech.

Following are additional examples testing the procedures of the invention. These examples should not be construed as limiting.

In order to assess the benefits of using the proposed postprocessor, it was appended to three different algorithms and measured objectively the differences obtained in quality, while also reporting on the results of informal listening tests. The three algorithms all resort to frame- based background noise spectrum estimation, and are the following:

(I) A multiband spectral subtraction scheme, referred to as MSSUB below,

(2) An subband implementation of the Minimum Mean Squared Error log-spectral amplitude estimator, (LMMSE), and

(3) A subband Kalman Filter-based scheme using an EM algorithm to determine the clean AR coefficients, and approximating the noise to be white in each band (i.e., the noise spectrum is discretized in each band to a single value), which will be referred to as KEM.

For each of the algorithms, the output of the background noise estimator is slightly modified so as to provide an underestimate for the noise level, thereby making each pre- enhancement less aggressive and helping to preserve the speech intelligibility. For the postprocessor, we use a pseudo-QMF filterbank withM =16, frames of length N= 512 with 50% overlaps, and K = 0.015.

In our implementation, we found that such a choice for K yields the most noise reduction with the least effect on the speech signal, and so for various speakers and speech levels. The clean speech signal, sampled at 20 kHz, is obtained by concatenating multiple speakers (male and female) taken from the ΗMIT database and inserting silences so as to obtain a 60% activity rate, so as to reach a length of approximately 30 seconds. The noise data was obtained online from the following page: http://spib.rice.edu/spib/select noise.html, containing examples from the NOISEX-92 database: namely the babble, factory, military vehicle and car interior noises were used. In each case, the noisy speech signals were created by adding these noises to the clean speech, and scaling them with 3 different scales so as to obtain various conditions, from low to high input SNR. Thus, in total 12 different conditions were tested for 3 different algorithms.

The objective quality measures used are the Average segmental Signal-to-Noise Ratio (referred to ASNR hereafter) and the Coherence Speech Intelligibility Index (CSII). The choice of these objective measures is based on the following considerations: first, the ASNR is mostly correlated with the level of background noise intrusiveness, and thus it is consistent with our objective of reducing the residual noise. Next, the CSII, which can range from 0 to 1, is an index that is found to be an accurate predictor of speech intelligibility, (again an important criteria for our work).

The results are now shown in Tables 1, 2, and 3, each corresponding to one of the three algorithms.

Table 1 shows results obtained using the multi band spectral subtraction method. The scores reported are ASNR/CSII.

Table 1.

Table 2 shows results obtained using the LMMSE method. The scores reported are ASNR/CSII.

Table !

Table 3 shows results obtained using the KEM method. The scores reported are ASNR/CSII

Table 3.

The noise reduction can be clearly seen, and the speech parts with lower amplitude are not affected either. Figure 3 shows an example of the waveforms obtained with the LMMSE algorithm under babble noise conditions, for which the effect of the post-processor can be clearly viewed: the parts where speech is very present are only minimally affected, but as soon as noisy parts are present the scaling process is effective. Notice particularly that the parts with low speech amplitude are still kept intact. The results in Tables 1, 2, and 3 are now commented. First of all, it is clear from the ASNR reading, from simply observing waveforms such as the one shown in Figure 3, and from informal listening tests that the proposed post-processor is able to remove a significant amount of background noise.

This is particularly noticeable when no speech is present, but it can also be heard during speech utterances, especially when the original noise contains high frequencies. Next, observe that the CSII scores are almost identical before and after the processing, with a few isolated cases where post-processing negligibly improves it or decreases it (by ± 0.01). The objective of not damaging the intelligibility of the input speech is therefore achieved, which is also what we find in the informal listening tests. As an additional remark, note how the actual speech intelligibility is in fact still moderately affected by the enhancement algorithms for the higher input SNR conditions, whether or not post-processing is used. This is not a surprising observation: when there is, to begin with, not enough noise to impede the intelligibility at all in the noisy speech, then any processing can jeopardize the output intelligibility.

It was observed that the KEM algorithm performs better than the other two in babble noise, whereas for the car interior noise, the clear winner is the multiband spectral subtraction scheme. For the other two types of noise, the better performances are distributed among the three algorithms, depending on the input SNR.

In sum, the invention provides a very simple and low-complexity add-on to speech enhancement algorithms, which can reduce the excess of residual noise in the enhanced speech without affecting intelligibility. The method is particularly advantageous when the enhancement algorithm used operates in subbands, in which case the additional complexity is minimal.

The noise reduction system and method according to the invention can be utilized in a hearing aid or in a cochlear implant, which comprises a digital signal processor (DSP). In this way the system and method can be integrated or is feasible in hearing aids and cochlear implants without increasing the size of the instrument.

The invention may be implemented in hardware or software, or a combination of both {e.g., programmable logic arrays). Unless otherwise specified, the processes included as part of the invention are not inherently related to any particular computer or other apparatus. In particular, various general-purpose machines may be used with programs written in accordance with the teachings herein, or it may be more convenient to construct more specialized apparatus (e.g., integrated circuits) to perform the required method steps. Thus, the invention may be implemented in one or more computer programs executing on one or more programmable computer systems each comprising at least one processor, at least one data storage system (including volatile and non-volatile memory and/or storage elements), at least one input device or port, and at least one output device or port. Program code is applied to input data to perform the functions described herein and generate output information. The output information is applied to one or more output devices, in known fashion. Each such program may be implemented in any desired computer language.

Computer program code for carrying out operations of the invention described above may be written in a high-level programming language, such as C or C++, for development convenience. In addition, computer program code for carrying out operations of embodiments of the present invention may also be written in other programming languages, such as, but not limited to, interpreted languages. Some modules or routines may be written in assembly language or even micro-code to enhance performance and/or memory usage. It will be further appreciated that the functionality of any or all of the program modules may also be implemented using discrete hardware components, one or more application specific integrated circuits (ASICs), or a programmed digital signal processor or microcontroller. A code in which a program of the present invention is described can be included as a firmware in a RAM, a ROM and a flash memory. Otherwise, the code can be stored in a tangible computer-readable storage medium such as a magnetic tape, a flexible disc, a hard disc, a compact disc, a photo-magnetic disc, a digital versatile disc (DVD).

While various embodiments of the present invention have been shown and described herein, it will be obvious that such embodiments are provided by way of example only. Numerous variations, changes and substitutions may be made without departing from the invention herein. Accordingly, it is intended that the invention be limited only by the spirit and scope of the appended claims.

Claims

CLAIMSThe invention claimed is:

1. A computer-implemented method for reducing noise in an audio signal composed of speech and noise components, the computer-implemented method comprising: decomposing an audio signal into a plurality of subbands, wherein the audio signal is pre-enhanced by processing with a noise reduction algorithm before or after decomposing to provide an enhanced audio signal having audio components comprising enhanced speech components and residual noise components; processing each of the subbands of the enhanced audio signal by scaling the audio components via a scaling factor for each subband to provide a processed subband audio signal with reduced residual noise components; and reconstituting the processed subband audio signal into an output audio signal having enhanced speech components and reduced residual noise components.

2. The method of claim 1 wherein the audio signal is received from an input device of a hearing aid.

3. The method of claim 1 wherein the scaling is performed on a frame-by-frame basis for the subbands depending on an assumed level of residual noise.

4. The method of claim 3 wherein the assumed level of residual noise is based on an estimate of the Signal-to-Residual Noise-Ratio (SRNR).

5. The method of claim 1 wherein the scaling comprises, for an expected subband speech level ^a , scaling of low amplitude audio components on a relative basis.

6. The method of claim 1 wherein, at low instantaneous Signal-to- Noise-Ratio (SNR), scaling is more severe towards low-amplitude audio components and conversely, at high SNR, scaling is avoided for low amplitude audio components.

7. The method of claim 1 wherein a discrimination rule for scaling is applied such that below a certain sub band speech level in a particular subband and if an input instantaneous fullband SNR is low, the audio components are scaled down.

8. The method of claim 1 wherein the noise reduction algorithm comprises a non- aggressive algorithm in which the speech component is left substantially intact while the noise component is present but decreased to a smaller energy than the speech component.

9. The method of claim 1 wherein processing each of the subbands of the enhanced audio signal comprises: letting be an estimated signal-to-noise ratio for the ' * frame, with y

denoting a pre-enhanced , decimated speech vector at subband ^m ,

with

being a current energy in the pre-enhanced subband segment, and with ^α<" being a constant band-dependent threshold; obtaining a post-processed enhanced series ^x*"'¹' by applying a rule to ^^»"'') as follows:

10. A hearing device, comprising: a signal processing unit adapted to receive an input signal and apply a hearing aid gain to the input signal to produce an output signal, wherein the signal processing unit comprises code devices for decomposing an audio signal into a plurality of subbands, wherein the audio signal is pre-enhanced by processing with a noise reduction algorithm before or after decomposing to provide an enhanced audio signal having audio components comprising enhanced speech components and residual noise components; processing each of the subbands of the enhanced audio signal by scaling the audio components via a scaling factor for each subband to provide a processed subband audio signal with reduced residual noise components; and reconstituting the processed subband audio signal into an output audio signal having enhanced speech components and reduced residual noise components.

11. The device of claim 10 wherein the audio signal is received from an input device of a hearing aid.

12. The device of claim 10 wherein the scaling is performed on a frame-by -frame basis for the subbands depending on an assumed level of residual noise.

13. The device of claim 12 wherein the assumed level of residual noise is based on an estimate of the Signal-to-Residual Noise-Ratio (SRNR).

14. The device of claim 10 wherein the scaling comprises, for an expected subband speech level ^a , scaling of low amplitude audio components on a relative basis.

15. The device of claim 10 wherein, at low instantaneous Signal-to- Noise-Ratio (SNR), scaling is more severe towards low-amplitude audio components and conversely, at high SNR, scaling is avoided for low amplitude audio components.

16. The device of claim 10 wherein a discrimination rule for scaling is applied such that below a certain subband speech level in a particular subband and if an input instantaneous fullband SNR is low, the audio components are scaled down.

17. The device of claim 10 wherein the noise reduction algorithm comprises a non- aggressive algorithm in which the speech component is left substantially intact while the noise component is present but decreased to a smaller energy than the speech component.

18. The device of claim 10 wherein processing each of the subbands of the enhanced audio signal comprises: letting

fo an estimated signal-to-noise ratio for the ' ^Λ frame, with

denoting a pre-enhanced , decimated speech vector at subband ^m ,

with

being a current energy in the pre-enhanced subband segment, and with ^a>n being a constant band-dependent threshold; obtaining a post-processed enhanced series ^x"^' by applying a two-part rule to as follows:

19. A computer program product executed on a hearing device for a noise reduction, comprising: a computer program that: decomposes an audio signal into a plurality of subbands, wherein the audio signal is pre-enhanced by processing with a noise reduction algorithm before or after decomposing to provide an enhanced audio signal having audio components comprising enhanced speech components and residual noise components; processes each of the subbands of the enhanced audio signal by scaling the audio components via a scaling factor for each subband to provide a processed subband audio signal with reduced residual noise components; and reconstitutes the processed subband audio signal into an output audio signal having enhanced speech components and reduced residual noise components.

20. The computer program product of claim 19 wherein processing each of the subbands of the enhanced audio signal comprises: letting ^SNR(') be an estimated signal-to-noise ratio for the ' * frame, with

denoting a pre-enhanced , decimated speech vector at subband ^m ,

with

being a current energy in the pre-enhanced subband segment, and with ^a"> being a constant band-dependent threshold; obtaining a post-processed enhanced series

by applying a two-part rule to

as follows: