WO2010091339A1 - Method and system for noise reduction for speech enhancement in hearing aid - Google Patents

Method and system for noise reduction for speech enhancement in hearing aid Download PDF

Info

Publication number
WO2010091339A1
WO2010091339A1 PCT/US2010/023463 US2010023463W WO2010091339A1 WO 2010091339 A1 WO2010091339 A1 WO 2010091339A1 US 2010023463 W US2010023463 W US 2010023463W WO 2010091339 A1 WO2010091339 A1 WO 2010091339A1
Authority
WO
WIPO (PCT)
Prior art keywords
enhanced
components
audio signal
noise
speech
Prior art date
Application number
PCT/US2010/023463
Other languages
French (fr)
Inventor
Miodrag Bolic
Martin Bouchard
Frédéric MUSTIÈRE
Original Assignee
University Of Ottawa
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University Of Ottawa filed Critical University Of Ottawa
Publication of WO2010091339A1 publication Critical patent/WO2010091339A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0364Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0204Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using subband decomposition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L2021/065Aids for the handicapped in understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques

Definitions

  • the invention relates to improvements in noise reduction systems and methods for sound reproducing systems, such as hearing aids.
  • Hearing devices are wearable hearing apparatus used to provide assistance those with impaired hearing.
  • different designs of hearing device are provided, such as behind-the-ear hearing devices, with an external earpiece and in- the-ear hearing devices e.g. also Concha or in-canal hearing devices.
  • the typical configurations of hearing device are worn on the outer ear or in the auditory canal.
  • bone conduction hearing aids implantable or vibro-tactile hearing aids available on the market. In such hearing aids the damaged hearing is simulated either mechanically or electrically.
  • Hearing devices principally have as their main components an input converter, an amplifier and an output converter.
  • the input converter is as a rule a sound receiver, e.g. a microphone, and/or an electromagnetic receiver, e.g. an induction coil.
  • the output converter is mostly implemented as an electro acoustic converter, e.g. a miniature loudspeaker or as an electromechanical converter, e.g. bone conduction earpiece.
  • the amplifier is usually integrated into a signal processing unit. This basic structure is shown in FIG. 4, using a behind-the-ear hearing device as an example.
  • One or more microphones 2 for recording the sound from the surroundings are built into a hearing device housing 1 worn behind the ear.
  • a signal processing unit 3 which is also integrated into the hearing device housing 1, processes the microphone signals and amplifies them.
  • the output signal of the signal processing unit 3 is transmitted to a loudspeaker or earpiece 4 which outputs an acoustic signal.
  • the sound is transmitted, if necessary via a sound tube, which is fixed with an otoplastic in the auditory canal, to the hearing device wearer's eardrum.
  • the power is supplied to the hearing device and especially to the signal processing unit 3 by a battery 5 also integrated into the hearing device housing 1.
  • Hearing aid manufacturers have implemented various technologies to address noise. For example, some hearing aids may attempt to boost gain in frequency subbands with low noise while reducing gain in frequency subbands with high noise.
  • One problem with this frequency- gain approach is that desired signals may be attenuated along with noise signals.
  • Another problem with many frequency-gain approaches to dealing with noise is the inaccuracy of traditional algorithms for detecting which frequency subbands contain noise. In other words, many traditional algorithms may be somewhat ineffective in distinguishing between noise signals and desired signals.
  • Embodiments provide a noise reduction system and method, which leads to improved speech intelligibility.
  • Embodiments may be directed to methods for reducing noise. Some embodiments may also be directed to hearing aid devices configured to reduce noise.
  • a computer-implemented method for reducing noise in an audio signal composed of speech and noise components may comprise (a) decomposing an audio signal into a plurality of subbands, wherein the audio signal is pre-enhanced by processing with a noise reduction algorithm before or after decomposing to provide an enhanced audio signal having audio components comprising enhanced speech components and residual noise components; (b) processing each of the subbands of the enhanced audio signal by scaling the audio components via a scaling factor for each subband to provide a processed subband audio signal with reduced residual noise components; and (c) reconstituting the processed subband audio signal into an output audio signal having enhanced speech components and reduced residual noise components.
  • the audio signal is received from an input device of a hearing aid.
  • the scaling is performed on a frame-by -frame basis for the subbands depending on an assumed level of residual noise wherein the assumed level of residual noise is based on an estimate of the Signal-to-Residual Noise-Ratio (SRNR).
  • SRNR Signal-to-Residual Noise-Ratio
  • the scaling comprises, for an expected subband speech level a , scaling of low amplitude audio components on a relative basis. For example, at low instantaneous Signal-to- Noise-Ratio (SNR), scaling is more severe towards low-amplitude audio components and conversely, at high SNR, scaling is avoided for low amplitude audio components.
  • SNR Signal-to- Noise-Ratio
  • a discrimination rule for scaling may be applied such that below a certain subband speech level in a particular subband and if an input instantaneous fullband SNR is low, the audio components are scaled down.
  • a hearing aid may comprise a signal processing unit adapted to receive an input signal and apply a hearing aid gain to the input signal to produce an output signal
  • the signal processing unit comprises code devices for decomposing an audio signal into a plurality of subbands, wherein the audio signal is pre- enhanced by processing with a noise reduction algorithm before or after decomposing to provide an enhanced audio signal having audio components comprising enhanced speech components and residual noise components; processing each of the subbands of the enhanced audio signal by scaling the audio components via a scaling factor for each subband to provide a processed subband audio signal with reduced residual noise components; and reconstituting the processed subband audio signal into an output audio signal having enhanced speech components and reduced residual noise components.
  • a computer program product fixed on a tangible medium and/or executed on a hearing device for a noise reduction includes a computer program that implements the methods herein.
  • FIG. 1 shows a post-processing scheme, if the pre-enhancement does not take place in the subband domain
  • FIG. 2 shows a post-processing scheme, if the pre-enhancement takes place in the subband domain
  • FIG. 3 shows an effect of the post-processor wherein a top graph shows the initial noisy speech, the second graph is a clean signal, next is the pre-enhanced signal, and at the bottom the post-processed signal.
  • the noise reduction can be clearly seen, and the speech parts with lower amplitude are not affected either.
  • FIG. 4 shows a basic structure of a hearing device in which the method may be implemented.
  • the present invention to provide an adaptive noise cancelling system, which leads to improved speech intelligibility.
  • the invention further provides a method and a system for reducing noise, as well as a computer program product.
  • a hearing aid comprises at least one microphone, a signal processing unit and an output transducer.
  • the signal processing unit is adapted to receive an input signal from the microphone.
  • the signal processing means is adapted to apply a hearing aid gain to the input signal to produce an output signal to be output by the output transducer, and the signal processing means comprises means for adjusting the hearing aid gain calculated for the hearing aid.
  • the method and system herein provides a technique for the reduction of background residual noise as a post-processor for non-aggressive speech enhancement algorithms.
  • the method keeps the beneficial characteristics of such algorithms, and then uses both the noisy and pre-enhanced signals to remove the remaining noise in such a way that the speech is the least possibly affected.
  • the proposed method comprises first decomposing a pre-enhanced signal into frequency bands, and then operating on the downsampled subband time series by softly scaling down their low-energy segments, provided they occur at low estimated SNR.
  • the method comprises scaling, on a frame-by-frame basis, the subband pre-enhanced signals depending on an assumed level of residual noise.
  • the method is tested herein with three types of enhancement algorithms: a spectral subtractive method, a Minimum Mean Squared Error log-spectral amplitude estimator, and a Kalman Filter-based scheme.
  • a spectral subtractive method a Minimum Mean Squared Error log-spectral amplitude estimator
  • a Kalman Filter-based scheme In various real-world noise environments, the post-processor is found to consistently reduce background noise, with no apparent loss of intelligibility between the pre-enhanced and the final output speech signals, as reported by several objective measure scores and informal listening experiments.
  • the post-processing technique herein addresses the following objectives: (1) Removing additional background noise while retaining the positive features of (pre)enhanced speech (i.e. intelligibility, low distortion, naturalness, etc), (2) providing a simple and efficient implementation.
  • the method comprises "turning down the volume” when too much noise is present.
  • the above principle understates that there exists a reliable rule to discriminate speech and residual noise components. Note first that even in ideal conditions, it is not desirable to apply such volume-scaling in a fullband setup, as it would perceivably modulate the amplitude of the signal in a disturbing manner, and possibly affect some unvoiced parts of the speech with small energy.
  • the method is chosen to be applied in the subband domain.
  • the goal is to determine, for a given pre-enhanced frame an appropriate scaling factor so as to satisfy the problem requirements.
  • the average expected level a for speech components in a given subband is known.
  • the speech/noise discrimination rule is then chosen to follow two easily measurable quantities: the signal's amplitude within particular subbands and the global, instantaneous fullband SNR.
  • the entire scheme can be summarized as follows: below a certain level, and especially if the input SNR is low, the observed components are likely to be noise-like and must be scaled down.
  • the fullband SNR is chosen as reference rather than individual subband SNRs for two reasons: first for simplicity, and secondly because in many situations the "local" subband SNR is found to be a poor indicator of the global SNR and thus some low-amplitude speech components that are still important for intelligibility are more at risk of being scaled down (this tendency was confirmed in practical tests as well).
  • the method is chosen to be applied in the subband domain, and is formally described below using the accompanying Figures 1 and 2.
  • the pre-enhancement algorithm is "nonaggressive", in the sense that the speech signal is left as intact as possible, while the noise is still present but has been decreased to a smaller energy than the speech.
  • Stages 1 and 3 are classical subband decomposition/decimation/reconstitution.
  • Stage 2 proceeds as follows: Let SNR ⁇ f) be an estimated signal-to-noise ratio for the ' * frame. With ⁇ ⁇ denoting the pre-enhanced, decimated speech vector at subband m , being the current energy in the pre-enhanced subband segment and 0 ⁇ being a constant band-dependent threshold (the choice of which is discussed below), the following rule is applied to y*" > '' to obtain the post-processed enhanced series x » ( ''' ) ;
  • Stage 3 proceeds as follows: reconstruct the post-processed estimated clean speech ,from ,
  • Implementation of Stages 1 and 3 may include any classical subband decomposition/ decimation/ reconstitution techniques as known in the art.
  • this step basically involves scaling down the subband frame if its energy is found to be lower than a certain value O n , .
  • step (a) The introduction of the other frame-dependent constant & at step (a) is in direct relationship with rule //, and is important for the cases where the input speech is of low amplitude to begin with, but still high comparatively to the noise - which can occur for example at speech onsets or for quiet speakers: in such cases the effective threshold must be appropriately lowered so as not to risk damaging the speech.
  • speech onsets the fact that the signal is scaled based on the energy of an entire frame and not on a sample-by-sample basis is also meant to minimize the potential damage inflicted to the clean speech.
  • this type of subband-signal scaling method can be applied as part of a "full" subband speech enhancement algorithm (as opposed to a mere "post-processor” as in this section), where subband scaling factors are applied to the incoming noisy speech, and are determined from a VAD-based estimation of the a posteriori SNR. It may be assumed that the scaling factors are to be applied to pre-enhanced subband speech signals (and thus that an estimate of the SNR is also accessible).
  • each subband domain signal i.e., each of the decimated signals at the outputs of the filters of the filterbank
  • P is set to * ⁇ ⁇ .
  • the method herein takes advantage of the available pre-enhanced signal for which speech and noise have already been discriminated to a certain extent. Based on our initial assumption that the goal of the pre-enhancement algorithm is to try not to degrade intelligibility, the overall approach to noise reduction is much less aggressive.
  • our post-processing method leads overall to a more robust enhancement scheme, in that it can perform thresholding with less risks. This is especially the case since we are taking the estimated SNR into account.
  • the method applies uniform scaling to overlapping frames, which is less prone to perceivable "sudden volume changes" artifacts that a sample-by -sample volume scaling.
  • the method above may be implemented as a module to existing schemes, and is not resorting to wavelet packet transformations.
  • the proposed method is extremely low-cost, especially if the pre-enhancement scheme is already frame-based and employing sub bands, in which case only one extra equation per band must be applied.
  • the subband method resorting to "fully discretized" noise PSD lends itself very well to psychoacoustic treatment
  • a way to include perceptual constraints as part of this method is described. The idea is similar to that shown in the KF case; the differences are mainly related to the fact that we are only applying the constraints under certain risk-related conditions to avoid damage to the speech in complex noise conditions.
  • the central tool is the estimated masking threshold of the clean speech.
  • the masking threshold of a signal represents, in the frequency domain, the level/curve below which nothing is audible in the presence of the particular signal being studied.
  • a technique to compute such an estimate of a signal's masking threshold is elaborated. In the context of MPEG coding, this is useful to determine how much quantization noise can be introduced while remaining imperceptible.
  • an estimated clean signal is to begin with used to compute the threshold; in practice a rough clean speech estimate (obtained via spectral subtraction for example) can provide results almost as good as when the true clean speech is available.
  • the distinct estimate used can be further used to improve the overall quality by combining it with the state- space algorithm's final estimate.
  • the masking threshold is used as follows. In a given frame, once the
  • T have been calculated (based on the prior spectral subtractive estimate), first in each band the average level of each of the above quantities is calculated (yielding and the following two rules are applied: then the current data frame is left unprocessed ) then the enhancement is made more aggressive by purposely overestimating the corresponding observation variance in the state-space model.
  • the first rule is based on the assumption that if the noise component in band m is to begin with masked by the speech, then there is no need to perform any noise removal.
  • the second rule if the speech component is inaudible but some noticeable noise is present in band m , the enhancement takes place in a more aggressive manner.
  • Table A Estimation of the average benefits obtained by using the subband-based techniques presented in this chapter, in the context of VL/L colored noise conditions.
  • "X" is a generic letter to designate an algorithm to which the techniques are applied - the averages were obtained with 3 algorithms.
  • Table B Estimation of the average benefits obtained by using the subband-based techniques presented in this chapter, in the context of M/H colored noise conditions.
  • "X" is a generic letter to designate an algorithm to which the techniques are applied - the averages were obtained with 3 algorithms.
  • the results are relatively more contrasted, in the sense that the 4 bands solution actually yields slightly worse results, in that it marginally penalizes each objective measures.
  • the 4 bands treatment is still an appealing alternative when compared to fullband processing.
  • the 32 bands case again provides significant advantages when coupled with psychoacoustic constraining and post-processing.
  • the WPESQ score is improved on average by 0.14 units.
  • Careful listening of the enhanced signals yields observations that are in accordance with the above findings. For instance, it is difficult to differentiate the fullband and the 4 bands case, but improvements become more noticeable with 32 bands, especially with the reduction of background noise.
  • Table C Comparison between the average scores obtained from using the 7 listed algorithms in VL/L colored noise situations and a " ⁇ -32B-X-Post" setup.
  • Table D Comparison between the average scores obtained from using the 7 listed algorithms in VL/L colored noise situations and a " ⁇ -32B-X-Post" setup.
  • MSSUB multiband spectral subtraction scheme
  • KEM subband Kalman Filter-based scheme using an EM algorithm to determine the clean AR coefficients, and approximating the noise to be white in each band (i.e., the noise spectrum is discretized in each band to a single value), which will be referred to as KEM.
  • the output of the background noise estimator is slightly modified so as to provide an underestimate for the noise level, thereby making each pre- enhancement less aggressive and helping to preserve the speech intelligibility.
  • the clean speech signal sampled at 20 kHz, is obtained by concatenating multiple speakers (male and female) taken from the ⁇ MIT database and inserting silences so as to obtain a 60% activity rate, so as to reach a length of approximately 30 seconds.
  • the noise data was obtained online from the following page: http://spib.rice.edu/spib/select noise.html, containing examples from the NOISEX-92 database: namely the babble, factory, military vehicle and car interior noises were used.
  • the noisy speech signals were created by adding these noises to the clean speech, and scaling them with 3 different scales so as to obtain various conditions, from low to high input SNR. Thus, in total 12 different conditions were tested for 3 different algorithms.
  • the objective quality measures used are the Average segmental Signal-to-Noise Ratio (referred to ASNR hereafter) and the Coherence Speech Intelligibility Index (CSII).
  • ASNR Average segmental Signal-to-Noise Ratio
  • CSII Coherence Speech Intelligibility Index
  • Table 1 shows results obtained using the multi band spectral subtraction method. The scores reported are ASNR/CSII.
  • Table 2 shows results obtained using the LMMSE method. The scores reported are ASNR/CSII.
  • Table 3 shows results obtained using the KEM method. The scores reported are ASNR/CSII
  • FIG. 3 shows an effect of the post-processor wherein a top graph shows the initial noisy speech, the second graph is a clean signal, next is the pre-enhanced signal, and at the bottom the post-processed signal.
  • the noise reduction can be clearly seen, and the speech parts with lower amplitude are not affected either.
  • Figure 3 shows an example of the waveforms obtained with the LMMSE algorithm under babble noise conditions, for which the effect of the post-processor can be clearly viewed: the parts where speech is very present are only minimally affected, but as soon as noisy parts are present the scaling process is effective. Notice particularly that the parts with low speech amplitude are still kept intact.
  • Tables 1, 2, and 3 are now commented.
  • the invention provides a very simple and low-complexity add-on to speech enhancement algorithms, which can reduce the excess of residual noise in the enhanced speech without affecting intelligibility.
  • the method is particularly advantageous when the enhancement algorithm used operates in subbands, in which case the additional complexity is minimal.
  • the noise reduction system and method according to the invention can be utilized in a hearing aid or in a cochlear implant, which comprises a digital signal processor (DSP).
  • DSP digital signal processor
  • the invention may be implemented in hardware or software, or a combination of both ⁇ e.g., programmable logic arrays). Unless otherwise specified, the processes included as part of the invention are not inherently related to any particular computer or other apparatus. In particular, various general-purpose machines may be used with programs written in accordance with the teachings herein, or it may be more convenient to construct more specialized apparatus (e.g., integrated circuits) to perform the required method steps. Thus, the invention may be implemented in one or more computer programs executing on one or more programmable computer systems each comprising at least one processor, at least one data storage system (including volatile and non-volatile memory and/or storage elements), at least one input device or port, and at least one output device or port. Program code is applied to input data to perform the functions described herein and generate output information. The output information is applied to one or more output devices, in known fashion. Each such program may be implemented in any desired computer language.
  • Computer program code for carrying out operations of the invention described above may be written in a high-level programming language, such as C or C++, for development convenience.
  • computer program code for carrying out operations of embodiments of the present invention may also be written in other programming languages, such as, but not limited to, interpreted languages.
  • Some modules or routines may be written in assembly language or even micro-code to enhance performance and/or memory usage.
  • the functionality of any or all of the program modules may also be implemented using discrete hardware components, one or more application specific integrated circuits (ASICs), or a programmed digital signal processor or microcontroller.
  • a code in which a program of the present invention is described can be included as a firmware in a RAM, a ROM and a flash memory. Otherwise, the code can be stored in a tangible computer-readable storage medium such as a magnetic tape, a flexible disc, a hard disc, a compact disc, a photo-magnetic disc, a digital versatile disc (DVD).

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

A method and system for reducing noise in an audio signal composed of speech and noise components. The method includes decomposing an audio signal into a plurality of subbands, wherein the audio signal is pre-enhanced by processing with a noise reduction algorithm before or after decomposing to provide an enhanced audio signal having audio components comprising enhanced speech components and residual noise components; processing each of the subbands of the enhanced audio signal by scaling the audio components via a scaling factor for each subband to provide a processed subband audio signal with reduced residual noise components; and reconstituting the processed subband audio signal into an output audio signal having enhanced speech components and reduced residual noise components.

Description

METHOD AND SYSTEM FOR NOISE REDUCTION FOR SPEECH ENHANCEMENT IN HEARING AID
This application claims priority to provisional application No. 61/150354, filed February 6, 2009 which is incorporated by reference herein in its entirety
FIELD OF THE INVENTION
The invention relates to improvements in noise reduction systems and methods for sound reproducing systems, such as hearing aids.
BACKGROUND OF THE INVENTION
Hearing devices are wearable hearing apparatus used to provide assistance those with impaired hearing. To meet the numerous individual requirements different designs of hearing device are provided, such as behind-the-ear hearing devices, with an external earpiece and in- the-ear hearing devices e.g. also Concha or in-canal hearing devices. The typical configurations of hearing device are worn on the outer ear or in the auditory canal. Above and beyond these designs however there are also bone conduction hearing aids, implantable or vibro-tactile hearing aids available on the market. In such hearing aids the damaged hearing is simulated either mechanically or electrically.
Hearing devices principally have as their main components an input converter, an amplifier and an output converter. The input converter is as a rule a sound receiver, e.g. a microphone, and/or an electromagnetic receiver, e.g. an induction coil. The output converter is mostly implemented as an electro acoustic converter, e.g. a miniature loudspeaker or as an electromechanical converter, e.g. bone conduction earpiece. The amplifier is usually integrated into a signal processing unit. This basic structure is shown in FIG. 4, using a behind-the-ear hearing device as an example. One or more microphones 2 for recording the sound from the surroundings are built into a hearing device housing 1 worn behind the ear. A signal processing unit 3, which is also integrated into the hearing device housing 1, processes the microphone signals and amplifies them. The output signal of the signal processing unit 3 is transmitted to a loudspeaker or earpiece 4 which outputs an acoustic signal. The sound is transmitted, if necessary via a sound tube, which is fixed with an otoplastic in the auditory canal, to the hearing device wearer's eardrum. The power is supplied to the hearing device and especially to the signal processing unit 3 by a battery 5 also integrated into the hearing device housing 1.
One of the biggest challenges in speech enhancement is the tradeoff between the amount of noise reduction and the intelligibility of the resulting speech signal. While aggressive enhancement algorithms may be able to remove a large amount of background noise and increase significantly some objective scores, it is common that the output speech is eventually found to be less intelligible than the original noisy speech, which is a strong penalty for sensitive applications such as hearing aid devices.
Hearing aid manufacturers have implemented various technologies to address noise. For example, some hearing aids may attempt to boost gain in frequency subbands with low noise while reducing gain in frequency subbands with high noise. One problem with this frequency- gain approach is that desired signals may be attenuated along with noise signals. Another problem with many frequency-gain approaches to dealing with noise is the inaccuracy of traditional algorithms for detecting which frequency subbands contain noise. In other words, many traditional algorithms may be somewhat ineffective in distinguishing between noise signals and desired signals.
Thus, there is a need for improved hearing aids as well as improved techniques for implementing noise reduction in hearing aids.
SUMMARY OF THE INVENTION
The present invention provide a noise reduction system and method, which leads to improved speech intelligibility. Embodiments may be directed to methods for reducing noise. Some embodiments may also be directed to hearing aid devices configured to reduce noise.
In at least one embodiment, a computer-implemented method for reducing noise in an audio signal composed of speech and noise components may comprise (a) decomposing an audio signal into a plurality of subbands, wherein the audio signal is pre-enhanced by processing with a noise reduction algorithm before or after decomposing to provide an enhanced audio signal having audio components comprising enhanced speech components and residual noise components; (b) processing each of the subbands of the enhanced audio signal by scaling the audio components via a scaling factor for each subband to provide a processed subband audio signal with reduced residual noise components; and (c) reconstituting the processed subband audio signal into an output audio signal having enhanced speech components and reduced residual noise components.
In a further embodiment, the audio signal is received from an input device of a hearing aid. Moreover, the scaling is performed on a frame-by -frame basis for the subbands depending on an assumed level of residual noise wherein the assumed level of residual noise is based on an estimate of the Signal-to-Residual Noise-Ratio (SRNR).
In still further embodiments, the scaling comprises, for an expected subband speech level a , scaling of low amplitude audio components on a relative basis. For example, at low instantaneous Signal-to- Noise-Ratio (SNR), scaling is more severe towards low-amplitude audio components and conversely, at high SNR, scaling is avoided for low amplitude audio components. A discrimination rule for scaling may be applied such that below a certain subband speech level in a particular subband and if an input instantaneous fullband SNR is low, the audio components are scaled down.
According to certain further embodiments, a hearing aid may comprise a signal processing unit adapted to receive an input signal and apply a hearing aid gain to the input signal to produce an output signal, wherein the signal processing unit comprises code devices for decomposing an audio signal into a plurality of subbands, wherein the audio signal is pre- enhanced by processing with a noise reduction algorithm before or after decomposing to provide an enhanced audio signal having audio components comprising enhanced speech components and residual noise components; processing each of the subbands of the enhanced audio signal by scaling the audio components via a scaling factor for each subband to provide a processed subband audio signal with reduced residual noise components; and reconstituting the processed subband audio signal into an output audio signal having enhanced speech components and reduced residual noise components.
According to certain further embodiments, a computer program product fixed on a tangible medium and/or executed on a hearing device for a noise reduction, includes a computer program that implements the methods herein.
Further specific variations of the invention are defined herein. Other aspects and advantages of the present invention will become more apparent from the following detailed description taken in conjunction with the accompanying drawings which illustrate, by way of example, the principles of the invention. BRIEF DESCRIPTION OF THE DRAWINGS
The invention is explained in the following description in view of the drawings that show:
FIG. 1 shows a post-processing scheme, if the pre-enhancement does not take place in the subband domain
FIG. 2 shows a post-processing scheme, if the pre-enhancement takes place in the subband domain
FIG. 3 shows an effect of the post-processor wherein a top graph shows the initial noisy speech, the second graph is a clean signal, next is the pre-enhanced signal, and at the bottom the post-processed signal. The noise reduction can be clearly seen, and the speech parts with lower amplitude are not affected either.
FIG. 4 shows a basic structure of a hearing device in which the method may be implemented.
DETAILED DESCRIPTION OF THE INVENTION
The present invention to provide an adaptive noise cancelling system, which leads to improved speech intelligibility. The invention further provides a method and a system for reducing noise, as well as a computer program product.
A hearing aid comprises at least one microphone, a signal processing unit and an output transducer. The signal processing unit is adapted to receive an input signal from the microphone. The signal processing means is adapted to apply a hearing aid gain to the input signal to produce an output signal to be output by the output transducer, and the signal processing means comprises means for adjusting the hearing aid gain calculated for the hearing aid.
The method and system herein provides a technique for the reduction of background residual noise as a post-processor for non-aggressive speech enhancement algorithms. The method keeps the beneficial characteristics of such algorithms, and then uses both the noisy and pre-enhanced signals to remove the remaining noise in such a way that the speech is the least possibly affected. The proposed method comprises first decomposing a pre-enhanced signal into frequency bands, and then operating on the downsampled subband time series by softly scaling down their low-energy segments, provided they occur at low estimated SNR. In simple terms, the method comprises scaling, on a frame-by-frame basis, the subband pre-enhanced signals depending on an assumed level of residual noise. The method is tested herein with three types of enhancement algorithms: a spectral subtractive method, a Minimum Mean Squared Error log-spectral amplitude estimator, and a Kalman Filter-based scheme. In various real-world noise environments, the post-processor is found to consistently reduce background noise, with no apparent loss of intelligibility between the pre-enhanced and the final output speech signals, as reported by several objective measure scores and informal listening experiments.
One of the central issues in speech enhancement is the tradeoff between noise reduction and intelligibility, and it is in fact rare for a method to effectively improve intelligibility in several experimentation conditions. Rather than trying to improve intelligibility, practitioners usually set the more reasonable goal of at least not affecting it in the noise removal process. In sensitive applications where intelligibility and naturalness are important, non-aggressive setups for speech enhancement algorithms are thus privileged.
The post-processing technique herein addresses the following objectives: (1) Removing additional background noise while retaining the positive features of (pre)enhanced speech (i.e. intelligibility, low distortion, naturalness, etc), (2) providing a simple and efficient implementation.
Both objectives are treated here with equal importance. Indeed, if the second objective is not respected, one might as well rework and upgrade the pre-enhancement scheme. On the other hand, if the first objective can be attained with very small additions, then the appeal is more significant for real-world applications already employing certain well-established algorithms. Herein, the objective of the post-processor is not enhancement per se, but rather noticeable background noise removal.
In an embodiment, the method comprises "turning down the volume" when too much noise is present. Practically speaking, the above principle understates that there exists a reliable rule to discriminate speech and residual noise components. Note first that even in ideal conditions, it is not desirable to apply such volume-scaling in a fullband setup, as it would perceivably modulate the amplitude of the signal in a disturbing manner, and possibly affect some unvoiced parts of the speech with small energy. Thus, the method is chosen to be applied in the subband domain. In summary, the goal is to determine, for a given pre-enhanced frame an appropriate scaling factor so as to satisfy the problem requirements.
Suppose that the average expected level a for speech components in a given subband is known. The speech/noise discrimination rule is then chosen to follow two easily measurable quantities: the signal's amplitude within particular subbands and the global, instantaneous fullband SNR. In very simple terms, the entire scheme can be summarized as follows: below a certain level, and especially if the input SNR is low, the observed components are likely to be noise-like and must be scaled down.
Slightly more rigorously, the following two rules can be written: (/) Relatively to the expected subband speech level a , low amplitudes should be scaled down, but (U) Assuming that the amount of residual noise directly depends on the input SNR, at low instantaneous SNR the scaling should be more severe towards low-amplitude components and conversely, at high SNR low amplitude components should be spared.
The fullband SNR is chosen as reference rather than individual subband SNRs for two reasons: first for simplicity, and secondly because in many situations the "local" subband SNR is found to be a poor indicator of the global SNR and thus some low-amplitude speech components that are still important for intelligibility are more at risk of being scaled down (this tendency was confirmed in practical tests as well).
Thus, the method is chosen to be applied in the subband domain, and is formally described below using the accompanying Figures 1 and 2. First of all, one of the main assumptions is that the pre-enhancement algorithm is "nonaggressive", in the sense that the speech signal is left as intact as possible, while the noise is still present but has been decreased to a smaller energy than the speech.
The above scheme assumes the availability of an expected subband speech level as further described herein.
The assumption that we have direct access to a form of SNR estimate is not far-fetched, since this is a building block for many enhancement algorithms - if not available, it is possible to obtain an estimate by comparing the noisy and pre-enhanced spectra for example. The procedure described above is now formally presented in three stages, and then further explanations are given showing the direct links with rules /' and ii above.
The method then follows the three stages described below. Stages 1 and 3 are classical subband decomposition/decimation/reconstitution.
Stage 1 proceeds as follows: Following Figure 1, from the noisy speech signal at time k, z(k), the pre-enhancement algorithm, symbolized by/(.), produces the pre-enhanced signal y(k) =f(z(k)). Next, y(k) is decomposed into overlapping frames of length N , with the / Λ frame denoted by y(n,i) and n = 1 ... N. Then, the frames are decomposed into M subbands, with the /nA corresponding "subframe" denoted by ym(l,i) and / =1...N/M . Note that all of the above might very well already be part of the pre-enhanced algorithm - that is, the post-processing may directly have access Io ym(l,i), as in Figure 2.
Stage 2 proceeds as follows: Let SNR<f) be an estimated signal-to-noise ratio for the ' * frame. With ^ ^ denoting the pre-enhanced, decimated speech vector at subband m ,
Figure imgf000008_0001
being the current energy in the pre-enhanced subband segment and 0^ being a constant band-dependent threshold (the choice of which is discussed below), the following rule is applied to y*">'' to obtain the post-processed enhanced series x»(''');
Figure imgf000008_0002
Stage 3 proceeds as follows: reconstruct the post-processed estimated clean speech ,from ,
Figure imgf000008_0003
Figure imgf000008_0004
Implementation of Stages 1 and 3 may include any classical subband decomposition/ decimation/ reconstitution techniques as known in the art.
Before detailing the choice of On, let us briefly explain the rationale behind the two steps
(a) and (b) presented above in stage 2. Beginning with b) and first assuming that
Figure imgf000008_0006
, this step basically involves scaling down the subband frame if its energy is found to be lower than a certain value On, .
The scaling is linear, and clearly the lower the energy, the lower the scaling factor
Figure imgf000008_0005
The introduction of the other frame-dependent constant & at step (a) is in direct relationship with rule //, and is important for the cases where the input speech is of low amplitude to begin with, but still high comparatively to the noise - which can occur for example at speech onsets or for quiet speakers: in such cases the effective threshold must be appropriately lowered so as not to risk damaging the speech. Regarding speech onsets, the fact that the signal is scaled based on the energy of an entire frame and not on a sample-by-sample basis is also meant to minimize the potential damage inflicted to the clean speech.
In our experience, a"> depends on the type of subband decomposition used, on the amount of bands (and obviously on the subband frame size N IM sjnce \χ js compared to the quantity EΛ0 ) F01- linearly spaced bands, and more specifically a near-perfect pseudo-QMF decomposition, we find that for an input noisy signal with maximum amplitude of 1 and Λ/ = 16} good performance is obtained by letting"™ be proportional to m 2, i.e.,
Figure imgf000009_0006
(i.e., the expected energy in each subband decreases as the square of the subband index, which appears to be a reasonable assumption considering long-term spectral averages of speech in 16 equally- spaced bands, as shown for example in Figure 2), thereby making K be the only required value to "tune" to select the aggressiveness of the post-processor.
In further embodiments, this type of subband-signal scaling method can be applied as part of a "full" subband speech enhancement algorithm (as opposed to a mere "post-processor" as in this section), where subband scaling factors are applied to the incoming noisy speech, and are determined from a VAD-based estimation of the a posteriori SNR. It may be assumed that the scaling factors are to be applied to pre-enhanced subband speech signals (and thus that an estimate of the SNR is also accessible). In addition, each subband domain signal (i.e., each of the decimated signals at the outputs of the filters of the filterbank) may be real-valued and can locally be viewed as time-domain signals.
Essentially, with
Figure imgf000009_0003
denoting the pre-enhanced decimated speech vector at subband m at the 7th frame, we propose to perform the following scaling of
Figure imgf000009_0002
based on an estimate of the Signal-to-Residual Noise-Ratio, denoted here by ^wW ? to obtain the post-processed enhanced series mV ' ' as follows:
Figure imgf000009_0001
Obviously, the above requires the knowledge of
Figure imgf000009_0004
wnicn is difficult to accurately estimate as it strongly depends on the method/algorithm used and on the noise conditions. Nevertheless, a practical solution comprises estimating it from
Figure imgf000009_0005
(assumed to be known from the pre-enhancement stage) - the two are indeed strongly correlated. For this purpose, several methods can be envisioned: For example, using various training data obtained specifically using the chosen pre-enhancement algorithm; some mathematical relationship (e.g. linear regression) between the two sets of subband SNRs could be obtained. Results can be obtained by using the simple following rule:
Figure imgf000010_0001
that is, the practical value used to represent the residual noise ratio in each subband is simply taken as the maximum between the fullband estimated SNR and the current subband estimated SNR. The rationale for incorporating the fullband SNR was initially based on the observation that in many situations the "local" subband SNR is found to be in discordance with the fullband SNR and thus some low-amplitude speech components that are still important for intelligibility are more at risk of being filtered out. Note also that from the equation above we necessarily have
Figure imgf000010_0003
wnich does not contradict the expected effect of the pre- enhancement scheme. In practice, to further account for the effect of pre-enhancement, the introduction of a constraint P ~ is also beneficial, so as to obtain the final rule:
Figure imgf000010_0002
( )
In this implementations, P is set to * ^ . The use of the equation above allows for a very low-cost post-processing.
The method herein takes advantage of the available pre-enhanced signal for which speech and noise have already been discriminated to a certain extent. Based on our initial assumption that the goal of the pre-enhancement algorithm is to try not to degrade intelligibility, the overall approach to noise reduction is much less aggressive.
With the pre-enhancement handling the speech/noise discrimination, our post-processing method leads overall to a more robust enhancement scheme, in that it can perform thresholding with less risks. This is especially the case since we are taking the estimated SNR into account. The method applies uniform scaling to overlapping frames, which is less prone to perceivable "sudden volume changes" artifacts that a sample-by -sample volume scaling. The method above may be implemented as a module to existing schemes, and is not resorting to wavelet packet transformations.
In terms of computational complexity, it can be readily seen that the proposed method is extremely low-cost, especially if the pre-enhancement scheme is already frame-based and employing sub bands, in which case only one extra equation per band must be applied. The subband method resorting to "fully discretized" noise PSD lends itself very well to psychoacoustic treatment Hereafter, a way to include perceptual constraints as part of this method is described. The idea is similar to that shown in the KF case; the differences are mainly related to the fact that we are only applying the constraints under certain risk-related conditions to avoid damage to the speech in complex noise conditions.
The central tool is the estimated masking threshold of the clean speech. The masking threshold of a signal represents, in the frequency domain, the level/curve below which nothing is audible in the presence of the particular signal being studied. In the ISO MPEG-I layer 1 psychoacoustic model 1, a technique to compute such an estimate of a signal's masking threshold is elaborated. In the context of MPEG coding, this is useful to determine how much quantization noise can be introduced while remaining imperceptible.
Note that an estimated clean signal is to begin with used to compute the threshold; in practice a rough clean speech estimate (obtained via spectral subtraction for example) can provide results almost as good as when the true clean speech is available. In addition, the distinct estimate used can be further used to improve the overall quality by combining it with the state- space algorithm's final estimate.
In the algorithm, the masking threshold is used as follows. In a given frame, once the
P P noise power * has been estimated and the speech power and the masking threshold x and
T have been calculated (based on the prior spectral subtractive estimate), first in each band the average level of each of the above quantities is calculated (yielding
Figure imgf000011_0001
and the following two rules are applied: then the current data frame is left unprocessed
Figure imgf000011_0002
) then the enhancement is made more aggressive by purposely overestimating the corresponding observation variance in the state-space model. The first rule is based on the assumption that if the noise component in band m is to begin with masked by the speech, then there is no need to perform any noise removal. Next, in the second rule, if the speech component is inaudible but some noticeable noise is present in band m , the enhancement takes place in a more aggressive manner.
Note that the above technique can naturally be followed by the one given above, this yields very good results. Using such conservative rules allows for a less risky solution - and in turn for a more robust solution in nonstationary noise, as compared to where the scaling is done based only on the masking threshold level
Following are examples testing the procedures of the invention. These examples should not be construed as limiting. Simulation results obtained from the application of the above method follow:
1. The decomposition into 4 bands. In the following, an algorithm "X" employing such a decomposition are coded as "4B-X".
2. The decomposition into 32 bands presented. In the following, algorithms "X" employing such a decomposition are coded as "32B-X".
3. The standalone (internal PSD estimation) particle filtering solution shown is tested on the RBPF algorithm, and denoted by 32B-RBPF(Standaione).
4. The post-processing method. Algorithm "X" used with this method is denoted by "X- Post".
5. The application of psychoacoustic constraints. For an algorithm "X" resorting to this technique, the code "Ψ-32B-X" is used.
In addition, a combination of psychoacoustic constraining and post-processing is tested as well, with code "Ψ-32B-X-Post". The algorithms used are the DKF, the KEMBurg, the RBPF, the DEKF4, the DUKFl, the KEM and the NPF. For the first three, all the above flavours are tested (except the standalone solution for the non-RBPF ones), and for the last four only the "Ψ- 32B-X-Post" results are published, for the reasons discussed below.
In Tables A and B, we assess the average benefits of using each of the techniques across several algorithms. This is done by showing the average difference of scores obtained across the first three algorithms for all types of colored noise in VL/L and M/H conditions (respectively) are given. With reference to algorithm "X", the differences shown are those between each of the scores obtained by "4B-X", "32B-X", "32B-X-PoSt", "T-32B-X", and "Ψ-32B-X-Post" and the scores obtained from the fullband application of algorithm "X".
In Tables C and D, the 7 individual algorithms are compared in the context of a "Ψ-32B- X-Post" setup, by averaging the scores obtained for all types of colored noise in VL/L and M/H conditions (respectively).
Figure imgf000012_0001
Figure imgf000013_0001
Table A: Estimation of the average benefits obtained by using the subband-based techniques presented in this chapter, in the context of VL/L colored noise conditions. "X" is a generic letter to designate an algorithm to which the techniques are applied - the averages were obtained with 3 algorithms.
Figure imgf000013_0002
Table B: Estimation of the average benefits obtained by using the subband-based techniques presented in this chapter, in the context of M/H colored noise conditions. "X" is a generic letter to designate an algorithm to which the techniques are applied - the averages were obtained with 3 algorithms.
From the results shown in Tables A and B, several conclusions can be made.
First, in VL/L conditions, there are rather clear advantages in using subband methods as opposed to fullband ones, especially so for the 32 bands case. This is all the more obvious when considering the fact that psychoacoustic constraining and post-processing can be readily applied and also provide non-negligible improvements. Next, it is interesting to note from the bottom rows of Tables A and B that, while the SNR and ASNR scores are lower when internal noise estimation is used, the rest of the measures are not far from those obtained with dedicated, external noise PSD estimation. From informal listening tests, we find that the subband methods are unambiguously better in terms of background noise reduction especially. In particular, it is also noticeable that the "Ψ-32B-X", and "Ψ-32B-X-Post" methods yield a higher signal quality and a better intelligibility. Moreover, we find that the "standalone" method performing internal noise estimation achieves less noise reduction but still preserves well the speech naturalness. Still, this method remains interesting in terms of complexity since the internal noise estimation only adds a marginal amount of computations per particle.
Regarding medium to high SNR conditions, the results are relatively more contrasted, in the sense that the 4 bands solution actually yields slightly worse results, in that it marginally penalizes each objective measures. However, recall that there are still advantages in terms of computational requirements, and thus the 4 bands treatment is still an appealing alternative when compared to fullband processing. On the other hand, the 32 bands case again provides significant advantages when coupled with psychoacoustic constraining and post-processing. In fact, even without any additional scheme, with 32 bands the WPESQ score is improved on average by 0.14 units. Careful listening of the enhanced signals yields observations that are in accordance with the above findings. For instance, it is difficult to differentiate the fullband and the 4 bands case, but improvements become more noticeable with 32 bands, especially with the reduction of background noise.
Finally, in Tables C and D the average scores obtained by each individual algorithm in a "Ψ-32B-X-Post" configuration are shown.
Table C: Comparison between the average scores obtained from using the 7 listed algorithms in VL/L colored noise situations and a "Ψ-32B-X-Post" setup.
Figure imgf000014_0002
Table D: Comparison between the average scores obtained from using the 7 listed algorithms in VL/L colored noise situations and a "Ψ-32B-X-Post" setup.
In the VL/L case, two "groups" of algorithms can be formed: First the DKF, NPF, KEMβurg, and RBPF; and secondly the DEKF, DUKF, and KEM, all with markedly lower scores than the algorithms from the first group. Quite interestingly, in this setup it turns out that the very simple DKF algorithm yields the best CSII, WPESQ, Csig (ex-aequo with the NPF), and Cbak scores - and second-best ASNR, Covl scores. The NPF, KEMBurg, and RBPF still obtain very close (and some better) results (for example the Covl score for the NPF). Still, according to the objective scores the "Ψ-32B-DKF-Post" algorithm may very well be the best subband option in VL/L conditions.
Informal listening tests result in remarks that are in accordance with the above findings. However, while we are able to confirm that the first "group" of algorithms perform significantly better than the second group, we also find that the DKF, NPF, KEMBurg, and RBPF are relatively difficult to tell apart. Nevertheless, while the DKF is able to remove a slightly larger amount of noise, the NPF overall tends to sound more natural.
In M/H conditions, the same algorithms can be separated into two groups. This time however, the RBPF and NPF both stand out - although the DKF is not far behind. Our subjective impressions, from listening to the enhanced speech files, agree with the above, but we also find that RBPF and NPF are this time more distinguishable from the rest, with crisper and higher quality speech.
Following are additional examples testing the procedures of the invention. These examples should not be construed as limiting.
In order to assess the benefits of using the proposed postprocessor, it was appended to three different algorithms and measured objectively the differences obtained in quality, while also reporting on the results of informal listening tests. The three algorithms all resort to frame- based background noise spectrum estimation, and are the following:
(I) A multiband spectral subtraction scheme, referred to as MSSUB below,
(2) An subband implementation of the Minimum Mean Squared Error log-spectral amplitude estimator, (LMMSE), and
(3) A subband Kalman Filter-based scheme using an EM algorithm to determine the clean AR coefficients, and approximating the noise to be white in each band (i.e., the noise spectrum is discretized in each band to a single value), which will be referred to as KEM.
For each of the algorithms, the output of the background noise estimator is slightly modified so as to provide an underestimate for the noise level, thereby making each pre- enhancement less aggressive and helping to preserve the speech intelligibility. For the postprocessor, we use a pseudo-QMF filterbank withM =16, frames of length N= 512 with 50% overlaps, and K = 0.015.
In our implementation, we found that such a choice for K yields the most noise reduction with the least effect on the speech signal, and so for various speakers and speech levels. The clean speech signal, sampled at 20 kHz, is obtained by concatenating multiple speakers (male and female) taken from the ΗMIT database and inserting silences so as to obtain a 60% activity rate, so as to reach a length of approximately 30 seconds. The noise data was obtained online from the following page: http://spib.rice.edu/spib/select noise.html, containing examples from the NOISEX-92 database: namely the babble, factory, military vehicle and car interior noises were used. In each case, the noisy speech signals were created by adding these noises to the clean speech, and scaling them with 3 different scales so as to obtain various conditions, from low to high input SNR. Thus, in total 12 different conditions were tested for 3 different algorithms.
The objective quality measures used are the Average segmental Signal-to-Noise Ratio (referred to ASNR hereafter) and the Coherence Speech Intelligibility Index (CSII). The choice of these objective measures is based on the following considerations: first, the ASNR is mostly correlated with the level of background noise intrusiveness, and thus it is consistent with our objective of reducing the residual noise. Next, the CSII, which can range from 0 to 1, is an index that is found to be an accurate predictor of speech intelligibility, (again an important criteria for our work).
The results are now shown in Tables 1, 2, and 3, each corresponding to one of the three algorithms.
Table 1 shows results obtained using the multi band spectral subtraction method. The scores reported are ASNR/CSII.
Table 1.
Figure imgf000016_0001
Table 2 shows results obtained using the LMMSE method. The scores reported are ASNR/CSII.
Table !
Figure imgf000017_0001
Table 3 shows results obtained using the KEM method. The scores reported are ASNR/CSII
Table 3.
Figure imgf000017_0002
FIG. 3 shows an effect of the post-processor wherein a top graph shows the initial noisy speech, the second graph is a clean signal, next is the pre-enhanced signal, and at the bottom the post-processed signal. The noise reduction can be clearly seen, and the speech parts with lower amplitude are not affected either.
The noise reduction can be clearly seen, and the speech parts with lower amplitude are not affected either. Figure 3 shows an example of the waveforms obtained with the LMMSE algorithm under babble noise conditions, for which the effect of the post-processor can be clearly viewed: the parts where speech is very present are only minimally affected, but as soon as noisy parts are present the scaling process is effective. Notice particularly that the parts with low speech amplitude are still kept intact. The results in Tables 1, 2, and 3 are now commented. First of all, it is clear from the ASNR reading, from simply observing waveforms such as the one shown in Figure 3, and from informal listening tests that the proposed post-processor is able to remove a significant amount of background noise.
This is particularly noticeable when no speech is present, but it can also be heard during speech utterances, especially when the original noise contains high frequencies. Next, observe that the CSII scores are almost identical before and after the processing, with a few isolated cases where post-processing negligibly improves it or decreases it (by ± 0.01). The objective of not damaging the intelligibility of the input speech is therefore achieved, which is also what we find in the informal listening tests. As an additional remark, note how the actual speech intelligibility is in fact still moderately affected by the enhancement algorithms for the higher input SNR conditions, whether or not post-processing is used. This is not a surprising observation: when there is, to begin with, not enough noise to impede the intelligibility at all in the noisy speech, then any processing can jeopardize the output intelligibility.
It was observed that the KEM algorithm performs better than the other two in babble noise, whereas for the car interior noise, the clear winner is the multiband spectral subtraction scheme. For the other two types of noise, the better performances are distributed among the three algorithms, depending on the input SNR.
In sum, the invention provides a very simple and low-complexity add-on to speech enhancement algorithms, which can reduce the excess of residual noise in the enhanced speech without affecting intelligibility. The method is particularly advantageous when the enhancement algorithm used operates in subbands, in which case the additional complexity is minimal.
The noise reduction system and method according to the invention can be utilized in a hearing aid or in a cochlear implant, which comprises a digital signal processor (DSP). In this way the system and method can be integrated or is feasible in hearing aids and cochlear implants without increasing the size of the instrument.
The invention may be implemented in hardware or software, or a combination of both {e.g., programmable logic arrays). Unless otherwise specified, the processes included as part of the invention are not inherently related to any particular computer or other apparatus. In particular, various general-purpose machines may be used with programs written in accordance with the teachings herein, or it may be more convenient to construct more specialized apparatus (e.g., integrated circuits) to perform the required method steps. Thus, the invention may be implemented in one or more computer programs executing on one or more programmable computer systems each comprising at least one processor, at least one data storage system (including volatile and non-volatile memory and/or storage elements), at least one input device or port, and at least one output device or port. Program code is applied to input data to perform the functions described herein and generate output information. The output information is applied to one or more output devices, in known fashion. Each such program may be implemented in any desired computer language.
Computer program code for carrying out operations of the invention described above may be written in a high-level programming language, such as C or C++, for development convenience. In addition, computer program code for carrying out operations of embodiments of the present invention may also be written in other programming languages, such as, but not limited to, interpreted languages. Some modules or routines may be written in assembly language or even micro-code to enhance performance and/or memory usage. It will be further appreciated that the functionality of any or all of the program modules may also be implemented using discrete hardware components, one or more application specific integrated circuits (ASICs), or a programmed digital signal processor or microcontroller. A code in which a program of the present invention is described can be included as a firmware in a RAM, a ROM and a flash memory. Otherwise, the code can be stored in a tangible computer-readable storage medium such as a magnetic tape, a flexible disc, a hard disc, a compact disc, a photo-magnetic disc, a digital versatile disc (DVD).
While various embodiments of the present invention have been shown and described herein, it will be obvious that such embodiments are provided by way of example only. Numerous variations, changes and substitutions may be made without departing from the invention herein. Accordingly, it is intended that the invention be limited only by the spirit and scope of the appended claims.

Claims

CLAIMSThe invention claimed is:
1. A computer-implemented method for reducing noise in an audio signal composed of speech and noise components, the computer-implemented method comprising: decomposing an audio signal into a plurality of subbands, wherein the audio signal is pre-enhanced by processing with a noise reduction algorithm before or after decomposing to provide an enhanced audio signal having audio components comprising enhanced speech components and residual noise components; processing each of the subbands of the enhanced audio signal by scaling the audio components via a scaling factor for each subband to provide a processed subband audio signal with reduced residual noise components; and reconstituting the processed subband audio signal into an output audio signal having enhanced speech components and reduced residual noise components.
2. The method of claim 1 wherein the audio signal is received from an input device of a hearing aid.
3. The method of claim 1 wherein the scaling is performed on a frame-by-frame basis for the subbands depending on an assumed level of residual noise.
4. The method of claim 3 wherein the assumed level of residual noise is based on an estimate of the Signal-to-Residual Noise-Ratio (SRNR).
5. The method of claim 1 wherein the scaling comprises, for an expected subband speech level a , scaling of low amplitude audio components on a relative basis.
6. The method of claim 1 wherein, at low instantaneous Signal-to- Noise-Ratio (SNR), scaling is more severe towards low-amplitude audio components and conversely, at high SNR, scaling is avoided for low amplitude audio components.
7. The method of claim 1 wherein a discrimination rule for scaling is applied such that below a certain sub band speech level in a particular subband and if an input instantaneous fullband SNR is low, the audio components are scaled down.
8. The method of claim 1 wherein the noise reduction algorithm comprises a non- aggressive algorithm in which the speech component is left substantially intact while the noise component is present but decreased to a smaller energy than the speech component.
9. The method of claim 1 wherein processing each of the subbands of the enhanced audio signal comprises: letting be an estimated signal-to-noise ratio for the ' * frame, with y
Figure imgf000022_0001
denoting a pre-enhanced , decimated speech vector at subband m ,
with
Figure imgf000022_0002
being a current energy in the pre-enhanced subband segment, and with α<" being a constant band-dependent threshold; obtaining a post-processed enhanced series x*"'1' by applying a rule to ^»"'') as follows:
Figure imgf000022_0003
10. A hearing device, comprising: a signal processing unit adapted to receive an input signal and apply a hearing aid gain to the input signal to produce an output signal, wherein the signal processing unit comprises code devices for decomposing an audio signal into a plurality of subbands, wherein the audio signal is pre-enhanced by processing with a noise reduction algorithm before or after decomposing to provide an enhanced audio signal having audio components comprising enhanced speech components and residual noise components; processing each of the subbands of the enhanced audio signal by scaling the audio components via a scaling factor for each subband to provide a processed subband audio signal with reduced residual noise components; and reconstituting the processed subband audio signal into an output audio signal having enhanced speech components and reduced residual noise components.
11. The device of claim 10 wherein the audio signal is received from an input device of a hearing aid.
12. The device of claim 10 wherein the scaling is performed on a frame-by -frame basis for the subbands depending on an assumed level of residual noise.
13. The device of claim 12 wherein the assumed level of residual noise is based on an estimate of the Signal-to-Residual Noise-Ratio (SRNR).
14. The device of claim 10 wherein the scaling comprises, for an expected subband speech level a , scaling of low amplitude audio components on a relative basis.
15. The device of claim 10 wherein, at low instantaneous Signal-to- Noise-Ratio (SNR), scaling is more severe towards low-amplitude audio components and conversely, at high SNR, scaling is avoided for low amplitude audio components.
16. The device of claim 10 wherein a discrimination rule for scaling is applied such that below a certain subband speech level in a particular subband and if an input instantaneous fullband SNR is low, the audio components are scaled down.
17. The device of claim 10 wherein the noise reduction algorithm comprises a non- aggressive algorithm in which the speech component is left substantially intact while the noise component is present but decreased to a smaller energy than the speech component.
18. The device of claim 10 wherein processing each of the subbands of the enhanced audio signal comprises: letting
Figure imgf000024_0004
fo an estimated signal-to-noise ratio for the ' Λ frame, with
Figure imgf000024_0005
denoting a pre-enhanced , decimated speech vector at subband m ,
with
Figure imgf000024_0001
being a current energy in the pre-enhanced subband segment, and with a>n being a constant band-dependent threshold; obtaining a post-processed enhanced series x"^' by applying a two-part rule to as follows:
Figure imgf000024_0003
Figure imgf000024_0002
19. A computer program product executed on a hearing device for a noise reduction, comprising: a computer program that: decomposes an audio signal into a plurality of subbands, wherein the audio signal is pre-enhanced by processing with a noise reduction algorithm before or after decomposing to provide an enhanced audio signal having audio components comprising enhanced speech components and residual noise components; processes each of the subbands of the enhanced audio signal by scaling the audio components via a scaling factor for each subband to provide a processed subband audio signal with reduced residual noise components; and reconstitutes the processed subband audio signal into an output audio signal having enhanced speech components and reduced residual noise components.
20. The computer program product of claim 19 wherein processing each of the subbands of the enhanced audio signal comprises: letting SNR(') be an estimated signal-to-noise ratio for the ' * frame, with
Figure imgf000025_0001
denoting a pre-enhanced , decimated speech vector at subband m ,
with
Figure imgf000025_0002
being a current energy in the pre-enhanced subband segment, and with a"> being a constant band-dependent threshold; obtaining a post-processed enhanced series
Figure imgf000025_0004
by applying a two-part rule to
Figure imgf000025_0005
as follows:
Figure imgf000025_0003
PCT/US2010/023463 2009-02-06 2010-02-08 Method and system for noise reduction for speech enhancement in hearing aid WO2010091339A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US15035409P 2009-02-06 2009-02-06
US61/150,354 2009-02-06

Publications (1)

Publication Number Publication Date
WO2010091339A1 true WO2010091339A1 (en) 2010-08-12

Family

ID=42111468

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2010/023463 WO2010091339A1 (en) 2009-02-06 2010-02-08 Method and system for noise reduction for speech enhancement in hearing aid

Country Status (1)

Country Link
WO (1) WO2010091339A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014086400A1 (en) 2012-12-05 2014-06-12 Advanced Bionics Ag Method and system for electrical stimulation of a patient's cochlear
EP3148213A1 (en) * 2015-09-25 2017-03-29 Giri, Ritwik Dynamic relative transfer function estimation using structured sparse bayesian learning
WO2018083570A1 (en) * 2016-11-02 2018-05-11 Chears Technology Company Limited Intelligent hearing aid
CN109416914A (en) * 2016-06-24 2019-03-01 三星电子株式会社 Signal processing method and device suitable for noise circumstance and the terminal installation using it
US10304478B2 (en) 2014-03-12 2019-05-28 Huawei Technologies Co., Ltd. Method for detecting audio signal and apparatus

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1376539A1 (en) * 2001-03-28 2004-01-02 Mitsubishi Denki Kabushiki Kaisha Noise suppressor
EP1931169A1 (en) * 2005-09-02 2008-06-11 Japan Advanced Institute of Science and Technology Post filter for microphone array

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1376539A1 (en) * 2001-03-28 2004-01-02 Mitsubishi Denki Kabushiki Kaisha Noise suppressor
EP1931169A1 (en) * 2005-09-02 2008-06-11 Japan Advanced Institute of Science and Technology Post filter for microphone array

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CLAUDE MARRO ET AL: "Analysis of Noise Reduction and Dereverberation Techniques Based on Microphone Arrays with Postfiltering", IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, IEEE SERVICE CENTER, NEW YORK, NY, US, vol. 6, no. 3, 1 May 1998 (1998-05-01), XP011054308, ISSN: 1063-6676 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014086400A1 (en) 2012-12-05 2014-06-12 Advanced Bionics Ag Method and system for electrical stimulation of a patient's cochlear
US9713714B2 (en) 2012-12-05 2017-07-25 Advanced Bionics Ag Method and system for electrical stimulation of a patient's cochlea
US10304478B2 (en) 2014-03-12 2019-05-28 Huawei Technologies Co., Ltd. Method for detecting audio signal and apparatus
US10818313B2 (en) 2014-03-12 2020-10-27 Huawei Technologies Co., Ltd. Method for detecting audio signal and apparatus
US11417353B2 (en) 2014-03-12 2022-08-16 Huawei Technologies Co., Ltd. Method for detecting audio signal and apparatus
EP3148213A1 (en) * 2015-09-25 2017-03-29 Giri, Ritwik Dynamic relative transfer function estimation using structured sparse bayesian learning
CN109416914A (en) * 2016-06-24 2019-03-01 三星电子株式会社 Signal processing method and device suitable for noise circumstance and the terminal installation using it
EP3457402A4 (en) * 2016-06-24 2019-05-22 Samsung Electronics Co., Ltd. Signal processing method and device adaptive to noise environment and terminal device employing same
CN109416914B (en) * 2016-06-24 2023-09-26 三星电子株式会社 Signal processing method and device suitable for noise environment and terminal device using same
WO2018083570A1 (en) * 2016-11-02 2018-05-11 Chears Technology Company Limited Intelligent hearing aid

Similar Documents

Publication Publication Date Title
EP3701525B1 (en) Electronic device using a compound metric for sound enhancement
US9343056B1 (en) Wind noise detection and suppression
AU771444B2 (en) Noise reduction apparatus and method
US9438992B2 (en) Multi-microphone robust noise suppression
US10614788B2 (en) Two channel headset-based own voice enhancement
US10034102B2 (en) Methods and apparatus for reducing ambient noise based on annoyance perception and modeling for hearing-impaired listeners
US9854368B2 (en) Method of operating a hearing aid system and a hearing aid system
US10154353B2 (en) Monaural speech intelligibility predictor unit, a hearing aid and a binaural hearing system
KR101744464B1 (en) Method of signal processing in a hearing aid system and a hearing aid system
EP2395506A1 (en) Method and acoustic signal processing system for interference and noise suppression in binaural microphone configurations
US9245538B1 (en) Bandwidth enhancement of speech signals assisted by noise reduction
CN106331969B (en) Method and system for enhancing noisy speech and hearing aid
WO2010091339A1 (en) Method and system for noise reduction for speech enhancement in hearing aid
Wang et al. Improving the intelligibility of speech for simulated electric and acoustic stimulation using fully convolutional neural networks
Upadhyay et al. The spectral subtractive-type algorithms for enhancing speech in noisy environments
US20230169987A1 (en) Reduced-bandwidth speech enhancement with bandwidth extension
Parikh et al. Blind source separation with perceptual post processing
Whitmal et al. Denoising speech signals for digital hearing aids: a wavelet based approach
EP3837621B1 (en) Dual-microphone methods for reverberation mitigation
Pandey et al. Adaptive gain processing to improve feedback cancellation in digital hearing aids
Madhavi et al. A Thorough Investigation on Designs of Digital Hearing Aid.
Lezzoum et al. NOISE REDUCTION OF SPEECH SIGNAL USING TIME-VARYING AND MULTI-BAND ADAPTIVE GAIN CONTROL
Parikh et al. Perceptual artifacts in speech noise suppression
Huang Efficient acoustic noise suppression for audio signals

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10704459

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10704459

Country of ref document: EP

Kind code of ref document: A1