EP3411876B1 - Babble noise suppression - Google Patents
Babble noise suppression Download PDFInfo
- Publication number
- EP3411876B1 EP3411876B1 EP16808870.6A EP16808870A EP3411876B1 EP 3411876 B1 EP3411876 B1 EP 3411876B1 EP 16808870 A EP16808870 A EP 16808870A EP 3411876 B1 EP3411876 B1 EP 3411876B1
- Authority
- EP
- European Patent Office
- Prior art keywords
- noise
- speech
- floor
- audio signal
- weighting coefficients
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000001629 suppression Effects 0.000 title claims description 60
- 230000003595 spectral effect Effects 0.000 claims description 110
- 230000005236 sound signal Effects 0.000 claims description 82
- 238000001514 detection method Methods 0.000 claims description 66
- 238000001228 spectrum Methods 0.000 claims description 36
- 238000009499 grossing Methods 0.000 claims description 31
- 238000000034 method Methods 0.000 claims description 29
- 238000007493 shaping process Methods 0.000 claims description 17
- 230000000694 effects Effects 0.000 claims description 15
- 239000000203 mixture Substances 0.000 claims description 7
- 230000001131 transforming effect Effects 0.000 claims 3
- 238000012545 processing Methods 0.000 description 19
- 238000010586 diagram Methods 0.000 description 17
- 230000006872 improvement Effects 0.000 description 15
- 238000012360 testing method Methods 0.000 description 13
- 238000012805 post-processing Methods 0.000 description 12
- 230000002452 interceptive effect Effects 0.000 description 9
- 238000013459 approach Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 5
- 230000009467 reduction Effects 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000003672 processing method Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 230000001960 triggered effect Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 241000408659 Darpa Species 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010835 comparative analysis Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000005315 distribution function Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000036962 time dependent Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 239000013598 vector Substances 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/84—Detection of presence or absence of voice signals for discriminating voice from noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02087—Noise filtering the noise being separate speech, e.g. cocktail party
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/93—Discriminating between voiced and unvoiced parts of speech signals
Definitions
- Speech-controlled applications and devices supporting human speech communication are becoming more and more popular.
- Human-computer interfaces based on speech recognition allow users to dictate text and control devices using spoken commands comprising speech utterances.
- a speech detector may be employed for speech recognition to determine a beginning and end of such speech utterances. It is known according to the publication GEMELLO R ET AL: "A modified Ephraim-Malah noise suppression rule for automatic speech recognition", PROCEEDINGS ICASSP 2004; 17-21 MAY 2004, pages 957-960 , a technique for speech enhancement using a noise overestimation factor and noise floor. A similar technique is known from the patent application EP1286334A2 .
- background noise also referred to interchangeably herein as “noise” or “environmental noise”
- the background noise may include speech from one or more interfering speakers that may be talking in the background while the user speaks to the device.
- the speech from the one or more interfering speakers may be referred to interchangeably herein as "babble,” “babble noise,” “babble speech,” or “interfering speech” and may be present in a crowded environment in which multiple persons are speaking.
- the crowded environment may be a public space, such as a restaurant or cafe, or any other suitable space in which multiple persons are speaking.
- a speech application that includes a speech processing method may be employed on a device that may be used by a user speaking to the device in the crowded environment, and, thus, the speech application may experience a crowded environment noise condition.
- the speech application may receive as input an electronic representation of the user's voice that may be superposed with interfering voices.
- the crowded environment noise condition may include babble noise that contains portions of interfering speech from the interfering voices.
- the crowded environment noise condition that includes the babble noise may be referred to herein as a babble noise scenario and poses a challenge for speech processing methods.
- the background noise may include stationary and non-stationary noise components. Changes in a power spectrum of the audio signal over a time interval may be less for the stationary noise components than for the non-stationary noise components. Standard noise reduction approaches primarily target on reducing stationary background noise components in an audio signal. Non-stationary components of babble noise are, therefore, not sufficiently suppressed. As a result, the non-stationary components may remain as annoying artifacts in the processed audio signal. Embodiments disclosed herein are motivated by a desire to deal with babble noise while not affecting the desired foreground speech, that is, the speech from the user speaking to the device.
- babble noise Characteristics of babble noise are disclosed herein and distinctive features summarized that enable distinguishing of the desired foreground speech from the babble noise.
- kurtosis of a signal is identified as a good measure to use to detect the presence of the desired foreground speech even in the presence of babble noise. It should be understood that detection of the presence may be in the form of a hard result, such as a boolean true/false type indicator, or in the form of a soft result, that is, a probability value that indicates the probability of the presence of the desired foreground speech.
- a babble noise suppression system is introduced that distinguishes between the desired foreground speech and the babble noise.
- an aggressiveness i . e ., an attenuation amount, strength of attenuation, or attenuation level
- an aggressiveness of the noise suppression may be controlled by a kurtosis-based speech detector. Strong attenuation may be applied during absence of speech whereas the aggressiveness may be reduced, dynamically, in response to speech being detected.
- annoying fluctuations in the background noise may be reduced to achieve a more stationary background in a processed audio signal.
- any remaining noise after processing the audio signal should be perceived as pleasant by human listeners.
- results from a subjective listening test are presented, as disclosed further below in the Results section. Further, an acceptability of the remaining background noise after processing is assessed, as disclosed in the Results section.
- FIG. 1 is a block diagram 100 of an example embodiment of a system 102 configured to perform noise suppression of an input audio signal 101.
- the input audio signal 101 may include foreground speech components 103 and background noise 105.
- the system 102 may comprise an input interface 117, that may be a hardware input interface or any other suitable interface, configured to transform the input audio signal 101 into an electronic representation of the input audio signal 101' for input to a soft speech detector 104.
- the soft speech detector 104 is configured to determine, dynamically, a speech detection result 112.
- the speech detection result 112 may indicate a likelihood ( i . e ., a probability value) of a presence of the foreground speech components 103 in the input audio signal 101.
- the system 102 comprises a noise suppressor 114 communicatively coupled to the soft speech detector 104 to receive the speech detection result 112 determined and the electronic representation of the input audio signal 101'.
- the noise suppressor 114 is configured to compute, dynamically, spectral weighting coefficients 109 based on the speech detection result 112 determined and to apply the spectral weighting coefficients 109 computed to the electronic representation of the input audio signal 101' to suppress the background noise 105 in a dynamic manner.
- the system 102 may comprise a memory 111 and the noise suppressor 114 may be configured to store the spectral weighting coefficients 109 computed in the memory 111.
- the noise suppressor 114 may be configured to retrieve the spectral weighting coefficients 109 computed from the memory 111 to apply the spectral weighting coefficients 109 computed to the electronic representation of the input audio signal 101'.
- the soft speech detector 104 is referred to as a "soft" speech detector because the soft speech detector 104 determines the speech detection result 112 that may represent a likelihood ( i . e ., probability value) of a presence of the foreground speech components 103 as opposed to a "hard” result that represents a definitive true/false boolean type of result for indicating whether the foreground speech components 103 are present.
- a "soft" speech detector may be a speech detector that produces a "soft" speech detection result that represents a likelihood ( i . e ., probability) of a presence of speech components in an audio signal.
- the foreground speech components 103 may correspond to speech from a user's voice, such as speech from a voice of a user 107a that may be speaking to the system 102.
- the background noise 105 may include the babble noise 113.
- the babble noise 113 may include a composition of multiple background speech components from one or more other speakers, such as the user 107b and the user 107c.
- the system 102 may be referred to interchangeably herein as a babble noise suppression system 102.
- the soft speech detector 104 may be further configured to determine the speech detection result 112 as a function of a combination of feature values determined in the time domain, frequency domain, or a combination thereof.
- the combination of feature values includes kurtosis and at least one other feature value as disclosed below with reference to FIG. 2 .
- the other feature value is a cepstral maximum feature value, however, in other implementations not covered by the claimed invention, any other suitable feature value that reflects the presence of speech may be employed for combination with the kurtosis feature value.
- power, signal-to-noise power ratio, harmonicity, or pitch-based features may be employed, or any other suitable feature that reflects the presence of speech.
- the system 102 employs the soft speech detector 104 to control, dynamically, an aggressiveness (i . e ., an attenuation strength) of noise suppression of the electronic representation of the input audio signal 101' by the noise suppressor 114.
- the babble noise suppression system 102 produces an electronic representation of an output audio signal 115' with the background noise 105 suppressed.
- the electronic representation of the output audio signal 115' may be output as the output audio signal 115 by an output interface 119, that may be a hardware interface configured to produce the output audio signal 115 (also referred to interchangeably herein as a processed audio signal 115) in an audible form.
- any remaining noise in the processed audio signal 115 may be perceived by human listeners as more pleasant than the background noise 105 of the audio signal 101.
- Further embodiments of the babble noise suppression system 102 are disclosed below with regard to FIG. 2 .
- FIG. 2 is a block diagram 200 of another embodiment of a system 202 configured to perform noise suppression of an electronic representation of an input audio signal 201'.
- the electronic representation of the input audio signal 201' that is x ( n ), where n is the sample index, is an electronic representation of a pre-emphasized input audio signal.
- x ( n ) may not be pre-emphasized.
- the system 202 may also be referred to interchangeably herein as a babble noise suppression system 202.
- the babble noise suppression system 202 comprises two main parts, a soft speech detector 204 and a noise suppressor 214.
- the noise suppressor 214 is configured to determine spectral weighting coefficients in order to suppress the babble noise, as disclosed further below in the Noise Suppression section.
- the soft speech detector 204 is configured to determine a speech detection result 212 that indicates a likelihood of presence of desired speech in an input audio signal, such as the likelihood of the presence of the foreground speech components 103 of the input audio signal 101 of FIG. 1 , disclosed above.
- the soft speech detector 204 may determine the speech detection result 212 as a function of a combination of feature values determined in the time domain, frequency domain, or a combination thereof.
- the speech detection result 212 may be referred to interchangeably herein as a combined speech detection result 212.
- the soft speech detector 204 includes a kurtosis feature module 208, a cepstral maximum feature module 210, and a combiner 206.
- the combiner 206 is configured to combine a kurtosis feature 232, produced by the kurtosis feature module 208, with a cepstral maximum feature 234, produced by the cepstral maximum feature module 210, to produce the speech detection result 212, as disclosed further below with regard to Equations 2 ⁇ 8.
- the speech detection result 212 is employed to control the noise suppressor 214.
- the noise suppressor 214 is designed to attenuate the background noise, such as the background noise 105 of FIG. 1 , more aggressively during speech pauses (not shown) of the electronic representation of the input audio signal 201', and such speech pauses may be identified as a function of the speech detection result 212 produced by the soft speech detector 204.
- spectral weighting coefficients such as the Wiener filter spectral weighting coefficients 241, modified spectral weighting coefficients 243, and final spectral weighting coefficients 245, are determined by the noise suppressor 214 to enable the babble noise suppression system 202 to apply stronger attenuation during the speech pauses based on overestimating the noise.
- the noise suppressor 214 includes a noise shaper 220 and post-processor 222 to achieve a more stationary electronic representation of the output audio signal 215' by applying noise shaping and post-processing, respectively, as disclosed further below in the Noise Suppression section.
- Embodiments of the soft speech detectors 104 and 204 of FIGS. 1 and 2 , respectively, are disclosed below in the Speech Detection section.
- Embodiments of the noise suppressors 114 and 214 of FIGS. 1 and 2 are disclosed further below in the Noise Suppression section.
- the normalized kurtosis of a random variable ⁇ reflects the peakiness of the PDF ( Guoping Li and Mark E. Lutman, "Sparseness and speech perception in noise,” in Proc. of Statistical and Perceptual Audition (SAPA), Pittsburgh PA, USA, 2006 ).
- zero-mean E ⁇ 0 is assumed.
- Positive values of the kurtosis indicate a sharp peak of the distribution, whereas the kurtosis vanishes for Gaussian distributed variables. Clean speech is, therefore, characterized by high values of the kurtosis.
- the kurtosis decreases when multiple speech samples are mixed since the result approaches a Gaussian distributed random variable.
- Kurtosis-based speech detection is discussed in multiple publications, such as ( Elias Nemer, Rafik Goubran, and Samy Mahmoud, "Robust voice activity detection using higher-order statistics in the LPC residual domain," IEEE Transactions on Speech and Audio Processing, vol. 9, no. 3, pp. 217 ⁇ 231, 2001 ), ( David Cournapeau, Tatsuya Kawahara, Kenji Mase, and Tomoji Toriyama, "Voice activity detector based on enhanced cumulant of LPC residual and on-line EM algorithm," in Proc.
- the kurtosis is applied to the residual signal after linear predictive coding (LPC) analysis.
- LPC linear predictive coding
- embodiments disclosed herein such as an embodiment of the kurtosis feature module 208, apply the kurtosis feature directly to the electronic representation of the input audio signal 201' x(n ), that is a pre-emphasized input audio signal, to detect presence of desired speech.
- a time-dependent estimate of the kurtosis feature is determined, such as by the kurtosis feature module 208, by: where downsampling by a factor R is applied to align the feature with the l-th frame used for block processing.
- Moments of 2nd and 4th order that is, m 2 ( n ) and m 4 ( n ), respectively, are estimated by the kurtosis feature module 208 using recursive smoothing with a smoothing constant ⁇ k :
- m 4 ( n ) is computed by smoothing x 4 ( n ) .
- the kurtosis is smoothed again by the kurtosis feature module 208 using moving averaging with a look-ahead L f to capture speech onsets.
- the kurtosis feature module 208 is configured to produce the kurtosis 232, according to Equations 2, 3, and 4, disclosed above.
- FIG. 3A is a graph 350 of an illustration of frequency components over time of a speech spectrum of an audio signal including a Texas Instruments/Massachusetts Institute of Technology (TIMIT) utterance.
- the graph 350 includes frequency components ⁇ [kHz] 352 for the audio signal over time t [s] 354.
- the TIMIT utterance is: "She had your dark suit in greasy wash water all year.”
- the TIMIT utterance is indicative of the desired speech, such as the foreground speech components 103 of FIG. 1 , disclosed above.
- the simulated feature values 356 include simulated kurtosis features values 362 that are based on the kurtosis feature alone and resulted in most of the desired speech 361 of FIG. 3A being detected as shown in FIG. 3B . However, in a few cases, low frequency vowels, such as "/u/,” were missed.
- embodiments disclosed herein may employ an additional feature that may reflect, explicitly, the voiced excitation of human speech, such as the cepstral maximum feature 234 produced by the cepstral maximum feature module 210 of FIG. 2 .
- the simulated feature values 356 include simulated cepstral maximum feature values 364 as well as simulated combination feature values 366 that are based on a combination of the kurtosis and cepstral maximum features. As shown in FIG. 3B , the simulated combination feature values 366 yielded the highest amount of the desired speech 361 of FIG. 3A .
- a speech detection feature based on kurtosis is capable of distinguishing between desired speech and babble noise and further improvement for the distinguishing may be achieved by smoothing the kurtosis and/or combining the kurtosis or smoothed kurtosis with a complementing feature being the cepstral maximum feature, disclosed in more detail below.
- cepstral coefficients cepst( ⁇ ,l) are determined, where ⁇ denotes the cepstral coefficient index and l denotes the l-th frame.
- the cepstral coefficients cepst( ⁇ ,l) are determined by the cepstral maximum feature module 210 of FIG.
- the noisy speech spectrum ⁇ xx ( k , l) may be referred to interchangeably herein as the power spectral density.
- the noisy speech spectrum ⁇ xx ( k , l) is estimated by smoothing, temporally, the magnitude squared Short-Term Fourier Transform STFT coefficients
- the system 202 of FIG. 2 comprises an STFT module 224 and a spectrum estimator 228.
- the STFT module 224 is configured to transform blocks of the electronic representation of the input audio signal 201' x(n) in a time domain into a representation in a frequency domain, that is, the STFT coefficients X ( k, l ) 226.
- the STFT coefficients X ( k, l ) 226 are input to the spectrum estimator 228 that is configured to estimate the power spectral density ⁇ xx ( k, l ) by smoothing, temporally, the magnitude squared STFT coefficients 226.
- the spectrum estimator 228 is configured to estimate a power spectrum of the electronic representation of the input audio signal 201' x(n) based on a transformation of the electronic representation of the input audio signal 201' from the time domain to the frequency domain.
- the STFT coefficients X ( k, l ) 226 determined by the STFT module 224 are further input to a noise shaper 220 and multiplicative element 253, as disclosed further below.
- the power spectral density ⁇ xx ( k , l ) is input as the power spectral density ⁇ xx ( k , l ) 230a to the cepstral maximum feature module 210 for determining the cepstral coefficients cepst( ⁇ ,l) (not shown) disclosed above.
- the power spectral density ⁇ xx ( k , l ) is further input as the power spectral density ⁇ xx ( k , l ) 230b to a noise estimator 236 that is configured to estimate the power spectral density of the noise, that is, the estimated noise spectrum ⁇ bb ( k, l ) 238, based on the estimated power spectral density ⁇ xx ( k , l ) 230b of the noisy input speech signal as estimated by the spectrum estimator 228.
- the power spectral density ⁇ xx ( k , l ) 230b is further input to a Wiener filter 221, as disclosed further below in the Noise Suppression section.
- the power spectral density ⁇ xx ( k, l ) 230a and the power spectral density ⁇ xx ( k , l ) 230b are the same power spectral density ⁇ xx ( k , l ) determined by the spectrum estimator 228.
- the estimated noise spectrum ⁇ bb ( k, l ) 238 is not shown as an input to the cepstral maximum feature module 210 for simplicity; however, the estimated noise spectrum ⁇ bb ( k, l ) 238 may be used by the cepstral maximum feature module 210 for determining the cepstral coefficients cepst( ⁇ , l ).
- the cepstral maximum feature module 210 is configured to accumulate neighboring bins of the cepstrum by: before the maximum in the relevant region between 60 Hz and 300 Hz is searched An offset is removed by the cepstral maximum feature module 210 by subtracting an offset parameter voicing offset , according to Equation 6, disclosed above.
- the cepstral maximum feature module 210 is configured to smooth the maximum value, temporally, by:
- cepstral maximum feature 234 of FIG. 2 is shown as the simulated cepstral maximum feature values 364 of FIG. 3B , disclosed above.
- the STFT module 224 was employed with a Hann window of length 512 samples to determine the STFT coefficients X ( k, l) 226 of FIG. 2 .
- the noise estimate ⁇ bb ( k, l) 238 was not updated.
- the kurtosis feature 232 and the cepstral maximum feature 234 are combined by a combiner 206 to produce the speech detection result 212.
- only positive values of the kurtosis feature 232 and cepstral maximum feature 234 may be considered to prevent from negative contributions.
- the combination of both features is shown as the simulated combination feature values 366 that may represent the speech detection result 212 of FIG. 2 .
- a good detection performance for the kurtosis feature is observable from the kurtosis feature Receiver Operating Characteristic (ROC) curve 404 shown in FIG. 4 , disclosed below.
- ROC Receiver Operating Characteristic
- FIG. 4 is a block diagram 400 of a graph 402 with Receiver Operating Characteristic (ROC) curves.
- the cepstral maximum feature ROC curve 410 that may be computed according to Equation 7, disclosed above, shows a lower performance than the kurtosis feature ROC curve 404 that may be computed according to Equation 2, disclosed above, since the cepstrum only detects voiced speech portions. Furthermore, harmonic components in the babble noise sometimes falsely trigger the cepstral maximum feature.
- performance improvements can be achieved by combining both features, as shown by the combined feature ROC curve 416 and performance can be further improved by combining both features and applying a smoothing to the kurtosis, as shown by the combination feature with kurtosis smoothing ROC curve 420.
- smoothing the kurtosis according to Equation 4, disclosed above, yields the smoothed kurtosis feature ROC curve 418.
- the noise suppressor 214 of FIG. 2 is configured to perform noise suppression and the noise suppression provides a strong attenuation that is applied during speech pauses by overestimating the noise, dynamically.
- the noise suppressor 214 includes an overestimator 250 that is configured to produce an overestimation factor ⁇ oe (l) 216 to control the aggressiveness of noise suppression.
- the noise suppressor 214 is configured to compute, dynamically, the dynamic noise overestimation factor ⁇ oe (l) 216 based on the speech detection result 212 determined.
- the noise suppressor 214 may be further configured to determine periods of speech pauses and periods of speech activity in the electronic representation of the input audio signal 201' as a function of the speech detection result 212 determined.
- the overestimation factor ⁇ oe (l) 216 is computed by the overestimator 250 based on the combined speech detection result 212 according to Equation 11, disclosed further below.
- the noise suppressor 214 includes the Wiener filter 221.
- the Wiener filter 221 is configured to produce spectral weighting coefficients H wf ( k, l ) 241 according to: where the power spectral density ⁇ xx ( k , l ) 230b from the spectrum estimator 228 is represented in the denominator.
- the noise suppressor 214 includes a first multiplicative element 251 configured to apply the overestimation factor ⁇ oe ( l ) 216, that is a scalar, to the estimated power spectral density (PSD) of the noise, that is the estimated noise spectrum ⁇ bb ( k, l ) 238, to produce the numerator of Equation 9, disclosed above, that is, the overestimated power spectral density of the noise 239 of FIG. 2 .
- PSD power spectral density
- the overestimation factor ⁇ oe (l) 216 is computed by the overestimator 250 based on the combined speech detection result 212 and applied to the estimated noise power spectral density ⁇ bb ( k, l) 238 by employing the first multiplicative element 251.
- Equation 9 disclosed above, all variables are scalars; however, the overestimation factor ⁇ oe (l) is not frequency-dependent as it depends only on l.
- the power spectral density (PSD) ⁇ xx ( k , l) depends on both frequency ( k ) and frame (l). As such, for a frame l, the same overestimation factor ⁇ oe (l) is, therefore, applied to all frequency bins ( k ) of the PSD ⁇ xx ( k , l).
- the spectral weighting coefficients H wf ( k, l) 241 are computed according to the Wiener filter (Equation 9) based on the estimated power spectral density of the noisy speech signal ⁇ xx ( k , l) 230b and the overestimated power spectral density of the noise 239. According to embodiments disclosed herein, increasing the value of the dynamic noise overestimation factor ⁇ oe (l) 216 enables the spectral weighting coefficients computed, that is H wf (k, l) 241, to increase suppression of the background noise relative to an amount of suppression of the background noise for the periods of speech activity determined.
- the Wiener filter 221 corresponds to a classical Wiener characteristic.
- the spectral weighting coefficients H wf ( k, l) 241 are input to the noise shaper 220, disclosed above.
- the noise shaper 220 is configured to apply a dynamic floor to the spectral weighting coefficients H wf ( k, l) 241 that reduces the non-stationary noise in the noisy speech signal.
- a dynamic floor shapes the residual noise ( Vasudev Kandade Rajan, Christin Baasch, Mohamed Krini, and Gerhard Schmidt, "Improvement in Listener Comfort Through Noise Shaping Using a Modified Wiener Filter Approach," in Proc. of 11. ITG Symposium on Speech Communication, Er Weg, Germany, 2014 ) to achieve a more stationary output signal.
- the overestimation factor ⁇ oe (l) 216 is computed based on the speech detection result 212, that is, the combined speech detection feature comb(l), according to: and applied to the estimated noise spectrum ⁇ bb ( k, l) 238 according to Equation 9, disclosed above.
- the combined speech detection result 212 e . g ., comb ( l ) ⁇ 0
- High values of the feature that indicates presence of speech results in a reduced overestimation factor ⁇ oe (l) ⁇ 1. This kurtosis-based control prevents speech distortions caused by too aggressive attenuation during presence of speech.
- Equations 9 and 11 speech detection features are employed to control aggressiveness of noise suppression and protect the desired speech by reducing, dynamically, the aggressiveness with more aggressive attenuation applied to the non-stationary noise components.
- Equations 10 and 12 disclosed above, combination with reduction of non-stationary components by selectively lowering the maximal attenuation H floor ( k , l) yields a more stationary output in addition to the more aggressive attenuation.
- FIG. 5A is an example showing a spectrogram 502a for an unprocessed noisy input signal.
- FIG. 5B, FIG. 5C, and FIG. 5D are showing spectrograms 502b, 502c, and 502d, respectively, showing improvements for babble noise suppression of the unprocessed noisy input signal according to embodiments disclosed herein.
- FIG. 5B shows a spectrogram 502b showing improvement to the babble noise suppression of FIG. 5A by processing the unprocessed noisy input signal of FIG. 5A with a Wiener filter without overestimation and with a fixed floor.
- the spectrogram 502b may show that stationary noise is reduced; however, the spectrogram 502b shows strong non-stationary artifacts, that is, non-stationary components remain. These artifacts are reduced by the dynamic overestimation as shown in the spectrogram 502c of FIG. 5C , disclosed below.
- FIG. 5C shows a spectrogram 502c showing further improvement of the babble noise suppression that is achieved by processing the unprocessed noisy input signal of FIG. 5A with a Wiener filter as in FIG. 5B and with dynamic noise overestimation, as disclosed above.
- FIG. 5D shows further improvement to the quality of the unprocessed noisy input signal of FIG. 5A , as disclosed below.
- the desired shape is computed based on the estimated noise spectrum averaged over time:
- the noise suppressor 214 is further configured to compute, dynamically, a dynamic noise floor, and selectively lower the dynamic noise floor based on frequencies corresponding to the non-stationary noise components.
- the noise shaper 220 is configured to apply the dynamic noise floor to the spectral weighting coefficients H wf (k, l) 241 that may reduce the non-stationary noise.
- the spectral weighting coefficients are further computed further based on the dynamic noise floor computed and selectively lowered to produce the modified spectral weighting coefficients after noise shaping 243 that are input to the post-processor 222.
- residual non-stationary noise components may be identified relying on contextual information from neighboring coefficients.
- Spectral weighting coefficients corresponding to the residual non-stationary noise components may be set to a lower value, such as a fixed noise floor or a dynamic noise floor.
- the post-processor 222 is configured to modify the spectral weighting coefficients after noise shaping 243 according to Equation 13, disclosed below. According to embodiments disclosed herein, sporadically occurring musical tones may be finally removed by the post-processor 222. According to Equation 13, disclosed below, the post-processor 222 is configured to modify the spectral weighting coefficients after noise shaping 243 to produce the final spectral weighting coefficients H (k, l) 245. Spectral weighting coefficients H (k, l) 245 that exceed the fixed noise floor are set to the dynamic floor in an event a majority of coefficients in neighboring frequencies attenuates the spectrum to the dynamic floor.
- a majority of neighboring coefficients has to attenuate stronger than the fixed floor to overrule the original weighting coefficient, that is, a given one of the modified spectral weighting coefficients 243.
- a first number n ⁇ ( k, l) of coefficients next to k that attenuate stronger than the fixed floor is compared to a second number of coefficients n + ( k,l ) that exceed the fixed floor.
- the noise suppressor 214 is further configured to identify one or more spectral weighting coefficients from the spectral weighting coefficients computed based on contextual information from neighboring spectral weighting coefficients.
- Neighboring spectral coefficients are some spectral coefficients in frequency bins k or frames l that are close to a current spectral coefficient, such as in a given range of bins or frames from an associated bin or frame of the current spectral coefficient.
- the noise suppressor 214 is further configured to post-process the spectral weighting coefficients computed by setting first values computed for the one or more spectral weighting coefficients identified to second values, the second values enabling a stronger attenuation of the background noise than the first values.
- post-processing of the modified spectral weighting coefficients 243 employs contextual information from neighboring frequency bins to produce the final spectral weighting coefficients H ( k, l) 245 that are considered as corrected spectral weighting coefficients.
- the final spectral weighting coefficients H ( k, l) 245 include spectral weighting coefficients associated with a particular kth bin and lth frame that may be set to the maximal attenuation when the majority of neighboring bins are set to the maximal attenuation.
- the system 202 includes a second multiplicative element 253 that applies the final spectral weighting coefficients H ( k, l) 245 after noise shaping by the noise shaper 220 and post-processing by the post-processor 222 to the noisy STFT coefficients X ( k, l) 226 and generates the enhanced STFT coefficients Y ( k, l) 247.
- the second multiplicative element 253 may perform an element-wise multiplication of two vectors that represent the final spectral weighting coefficients H ( k, l) 245 and the noisy STFT coefficients X ( k, l) 226 to generate the enhanced STFT coefficients Y ( k, l ) 247.
- the system 202 includes the overlap add module 249 that may be configured to convert ( i.e., transform) the enhanced STFT coefficients Y (k, l) 247 into the time-domain electronic representation of the output audio signal 215'.
- FIG. 5D shows a spectrogram 502d showing further improvement of the babble noise suppression that is achieved by processing the unprocessed noisy input signal of FIG. 5A with a Wiener filter with dynamic noise overestimation as in FIG. 5C , and by applying the noise shaping and post-processing disclosed above.
- the spectrogram 502d includes all of the processing features of a babble noise suppression system according to embodiments disclosed herein.
- the babble noise suppression acts less aggressively to prevent from speech distortions.
- a more stationary background is achieved which is expected to be more comfortable for human listeners.
- a subjective listening test was conducted. In the following, the test results are summarized.
- FIG. 6A is a graph 602 including results of the subjective listening test.
- a noisy speech signal was presented as the reference that was repeated as a hidden reference shown in (a) 606.
- results for variants of the signal that were presented are shown.
- the variants of the signal included: (i) the signal processed with a standard Wiener filter (b) 608, (ii) the signal processed a Wiener filter with dynamic noise overestimation (c) 610, and (iii) the signal processed with the Wiener filter with dynamic noise overestimation, noise shaping, and post-processing (d) 613.
- the test was repeated for 10 different signals, including artificially mixed signals, as well as real recordings.
- the acceptability of the remaining background noise was rated on a scale from - 10 (less pleasant) to 10 (more pleasant) as shown by the rating 614 that is a rating relative to the reference (a) 606.
- the noise suppression may be designed to act less aggressively in presence of speech.
- small speech distortions may be unavoidable when the signal is processed.
- an objective measure may be employed.
- the distortions between the clean speech signal and the processed speech component may be determined.
- FIG. 6B is a graph 604 with such an objective measure.
- the graph 604 shows an illustration of distortion-to-speech power ratios 616 for different signal processing variants (b) 618, (c) 620, and (d) 622, that correspond to the signal processing variants (b) 608, (c) 610, and (d) 612 of the graph 602 of FIG. 6A , respectively.
- Distortion-to-speech power ratio results 626, 628, and 630 are shown for each signal variant (b) 618, (c) 620, and (d) 622, respectively.
- a babble noise suppression system includes a soft speech detector that may be employed to distinguish between babble noise and desired speech.
- Simulation results disclosed herein show that a kurtosis measure achieves good detection results and that further improvements may be achieved by smoothing the kurtosis and combining the smoothed kurtosis with at least one other feature.
- noise suppression is controlled in a system based on a combined speech detection result.
- Noise may be suppressed more aggressively when no speech is detected.
- Noise shaping is applied to achieve a more stationary background in the output signal.
- remaining musical tones are reduced by modifying the spectral weighting coefficients.
- the systems 102 and 202 further comprise a pre-processing unit (not shown).
- the pre-processing unit is configured to pre-process the electronic representation of the input audio signal 101' or 201' to pre-emphasize spectral characteristics of the electronic representation of the input audio signal 101' or 201'.
- the soft speech detector 104 or 204 and the noise suppressor 114 or 214, respectively, are further configured to determine and compute, respectively, for a given time interval of the pre-processed electronic representation of the input audio signal 101' or 201'.
- the noise suppressor 114 or 214 is further configured to apply the spectral weighting coefficients computed to the pre-processed audio signal in the given time interval.
- FIG. 7 is a flow diagram 700 of an main steps of a method of performing noise suppression of an audio signal.
- the audio signal may include foreground speech components and background noise, such as the foreground speech components 103 and background noise 105 of FIG. 1 , disclosed above.
- the method determines, dynamically, a speech detection result indicating a likelihood of a presence of the foreground speech components in the audio signal (704).
- the method computes dynamically, spectral weighting coefficients based on the speech detection result determined (706) and apply the spectral weighting coefficients computed to the audio signal to suppress the background noise in a dynamic manner (708).
- the method thereafter ends (710), in the example embodiment.
- FIG. 8 is a flow diagram 800 of another embodiment of a method of performing noise suppression of an audio signal.
- the method begins (802) and pre-processes the input signal (804).
- the method may estimate the spectrum (806) and compute speech detection features (808).
- speech detection features are employed to control aggressiveness, that is, a strength of attenuation, of noise suppression. Protection of desired speech may be achieved by reducing the aggressiveness. More aggressive attenuation of the non-stationary noise components is applied via the overestimation factor of Equation 11 as applied in Equation 9, as disclosed above.
- a speech detection feature based on kurtosis is used to distinguish between desired speech and babble noise. Further, embodiments disclosed herein provide further improvement by smoothing and combining the kurtosis feature with a cepstral maximum feature according to equations 2-8 above.
- the method computes according to above equations 9,10 a dynamic noise overestimation factor (810) and determines spectral weighting coefficients (812). The method determines dynamic maximum attenuation and applies post-processing (814).
- Post-processing of spectral weighting coefficients employs contextual information from neighboring frequency bins to correct spectral weighting coefficients, as disclosed above.
- Embodiments disclosed herein set a frequency bin to the maximal attenuation in an event the majority of neighboring bins is set to the maximal attenuation.
- the method applies the spectral weighting coefficients (816) and the method thereafter ends (818) in the embodiment.
- babble noise may be a severe problem in speech enhancement applications.
- This type of noise may include a composition of multiple background speech components and exhibit properties similar to the desired foreground speech.
- Embodiments disclosed herein improve suppression of non-stationary noise components, such as babble noise, whereas other noise reduction approaches primarily suppress the stationary background noise.
- a noise suppression system is introduced for a more aggressive attenuation of babble noise.
- the noise suppression system disclosed herein includes a speech detector for foreground speech which is robust against background speech.
- Embodiments disclosed herein distinguish between desired speech and interfering babble noise and introduce a babble noise suppression system that provides effective suppression of babble noise by employing speech detection information to control the noise suppression.
- Automatic speech recognition can benefit from the improved speech enhancement or from meta-information on the presence of speech, such as the combined speech detection result.
- the babble noise suppression described herein comprises four stages:
- FIG. 9 is a block diagram of an example of the internal structure of a computer 900 in which various embodiments of the present disclosure may be implemented.
- the computer 900 contains a system bus 902, where a bus is a set of hardware lines used for data transfer among the components of a computer or processing system.
- the system bus 902 is essentially a shared conduit that connects different elements of a computer system (e.g., processor, disk storage, memory, input/output ports, network ports, etc .) that enables the transfer of information between the elements.
- Coupled to the system bus 902 is an I/O device interface 904 for connecting various input and output devices (e.g., keyboard, mouse, displays, printers, speakers, etc .) to the computer 900.
- I/O device interface 904 for connecting various input and output devices (e.g., keyboard, mouse, displays, printers, speakers, etc .) to the computer 900.
- a network interface 906 allows the computer 900 to connect to various other devices attached to a network.
- Memory 908 provides volatile storage for computer software instructions 910 and data 912 that may be used to implement embodiments of the present disclosure.
- Disk storage 914 provides nonvolatile storage for computer software instructions 910 and data 912 that may be used to implement embodiments of the present disclosure.
- a central processor unit 918 is also coupled to the system bus 902 and provides for the execution of computer instructions.
- FIG. 9 the soft speech detector 104 or 204 of FIGs. 1 and 2 , respectively, and the as well as the noise suppressor 114 and 214 of FIGs.
- FIG. 1 and 2 respectively, and elements thereof, may be implemented in software or hardware, such as via one or more arrangements of circuitry of FIG. 9 , disclosed above, or equivalents thereof, firmware, a combination thereof, or other similar implementation determined in the future.
- the elements of the block and flow diagrams described herein may be combined or divided in any manner in software, hardware, or firmware.
- the software may be written in any language that can support the example embodiments disclosed herein.
- the software may be stored in any form of non-transitory computer readable medium, such as random access memory (RAM), read only memory (ROM), compact disk read-only memory (CD-ROM), and so forth.
- RAM random access memory
- ROM read only memory
- CD-ROM compact disk read-only memory
- a general purpose or application-specific processor or processing core loads and executes software in a manner well understood in the art.
- block and flow diagrams may include more or fewer elements, be arranged or oriented differently, or be represented differently. It should be understood that implementation may dictate the block, flow, and/or network diagrams and the number of block and flow diagrams illustrating the execution of embodiments disclosed herein.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Circuit For Audible Band Transducer (AREA)
Description
- This application claims the benefit of
U.S. Provisional Application No. 62/291,791, filed on February 5, 2016 - Speech-controlled applications and devices supporting human speech communication are becoming more and more popular. Human-computer interfaces based on speech recognition allow users to dictate text and control devices using spoken commands comprising speech utterances. A speech detector may be employed for speech recognition to determine a beginning and end of such speech utterances. It is known according to the publication GEMELLO R ET AL: "A modified Ephraim-Malah noise suppression rule for automatic speech recognition", PROCEEDINGS ICASSP 2004; 17-21 MAY 2004, pages 957-960, a technique for speech enhancement using a noise overestimation factor and noise floor. A similar technique is known from the patent application
EP1286334A2 . - The present invention is disclosed in the appended set of claims.
- It should be understood that embodiments disclosed herein can be implemented in the form of a method, apparatus, system, or computer readable medium with program codes embodied thereon.
- The foregoing will be apparent from the following more particular description of example embodiments of the invention, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating embodiments of the present invention.
-
FIG. 1 is a block diagram of an embodiment of a system configured to perform noise suppression of an audio signal. -
FIG. 2 is a block diagram of another embodiment of a system configured to perform noise suppression of an audio signal. -
FIG. 3A is a graph of frequency components over time of a spectrum of an audio signal including a Texas Instruments/Massachusetts Institute of Technology (TIMIT) utterance. -
FIG. 3B is a graph of simulated kurtosis feature values over time for the same TIMIT utterance used for generating the graph ofFIG. 3A . -
FIG. 4 is a block diagram illustrating a graph with Receiver Operating Characteristic (ROC) curves. -
FIG. 5A is an illustration of a spectrogram for an unprocessed noisy input signal. -
FIGS. 5B-D are illustrations of spectrograms showing improvements for babble noise suppression according to example embodiments disclosed herein. -
FIG. 6A is a graph including results of a subjective listening test. -
FIG. 6B is a graph with an objective measure. -
FIG. 7 is a flow diagram of an embodiment of a method of performing noise suppression of an audio signal. -
FIG. 8 is a flow diagram of another embodiment of a method of performing noise suppression of an audio signal. -
FIG. 9 is a block diagram of an example internal structure of a computer optionally within an embodiment disclosed herein. - Mobile speech applications employed by devices, such as smartphones, may be utilized in crowded surroundings. As a user speaks to the device, background noise (also referred to interchangeably herein as "noise" or "environmental noise") may be present. The background noise may include speech from one or more interfering speakers that may be talking in the background while the user speaks to the device. The speech from the one or more interfering speakers may be referred to interchangeably herein as "babble," "babble noise," "babble speech," or "interfering speech" and may be present in a crowded environment in which multiple persons are speaking. The crowded environment may be a public space, such as a restaurant or cafe, or any other suitable space in which multiple persons are speaking.
- A speech application that includes a speech processing method may be employed on a device that may be used by a user speaking to the device in the crowded environment, and, thus, the speech application may experience a crowded environment noise condition. For example, the speech application may receive as input an electronic representation of the user's voice that may be superposed with interfering voices. In contrast to other noise conditions, such as an automotive noise condition that may include engine or wiper noise, the crowded environment noise condition may include babble noise that contains portions of interfering speech from the interfering voices. The crowded environment noise condition that includes the babble noise may be referred to herein as a babble noise scenario and poses a challenge for speech processing methods. Assumptions, such as stationarity of the noise or a good Signal to Noise Ratio (SNR), may not be valid for babble noise scenarios. According to embodiments disclosed herein, other distinctive properties may be considered for distinguishing the babble noise from the user's speech.
- Since the background noise contains speech portions, it is a particularly challenging scenario for many speech processing methods (Nitish Krishnamurthy and John H. L. Hansen, "Babble Noise: Modeling, Analysis, and Applications," IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 7, pp. 1394―1407, Sept. 2009), such as speech detection methods or noise reduction methods. Speech detection methods or speech detectors that distinguish between desired foreground speech (also referred to interchangeably herein as "foreground speech components," "user speech," or "desired speech") and the background noise, may be triggered, falsely, by the interfering speech, that is, the babble noise, that may be present in the background noise. Therefore, an elaborated speech detector that maintains reliable results under such an adverse condition, that is, the presence of babble noise, may be useful.
- The background noise may include stationary and non-stationary noise components. Changes in a power spectrum of the audio signal over a time interval may be less for the stationary noise components than for the non-stationary noise components. Standard noise reduction approaches primarily target on reducing stationary background noise components in an audio signal. Non-stationary components of babble noise are, therefore, not sufficiently suppressed. As a result, the non-stationary components may remain as annoying artifacts in the processed audio signal. Embodiments disclosed herein are motivated by a desire to deal with babble noise while not affecting the desired foreground speech, that is, the speech from the user speaking to the device.
- Characteristics of babble noise are disclosed herein and distinctive features summarized that enable distinguishing of the desired foreground speech from the babble noise. In particular, according to embodiments disclosed herein, kurtosis of a signal is identified as a good measure to use to detect the presence of the desired foreground speech even in the presence of babble noise. It should be understood that detection of the presence may be in the form of a hard result, such as a boolean true/false type indicator, or in the form of a soft result, that is, a probability value that indicates the probability of the presence of the desired foreground speech.
- According to embodiments disclosed herein, a babble noise suppression system is introduced that distinguishes between the desired foreground speech and the babble noise. In an example embodiment, an aggressiveness (i.e., an attenuation amount, strength of attenuation, or attenuation level) of the noise suppression may be controlled by a kurtosis-based speech detector. Strong attenuation may be applied during absence of speech whereas the aggressiveness may be reduced, dynamically, in response to speech being detected. In addition, according to embodiments disclosed herein, annoying fluctuations in the background noise may be reduced to achieve a more stationary background in a processed audio signal.
- According to embodiments disclosed herein, strong suppression of the babble noise is desired from the babble noise suppression system. In addition, any remaining noise after processing the audio signal should be perceived as pleasant by human listeners. To evaluate improvements achieved by an example embodiment of a babble noise suppression system disclosed herein, results from a subjective listening test are presented, as disclosed further below in the Results section. Further, an acceptability of the remaining background noise after processing is assessed, as disclosed in the Results section.
-
FIG. 1 is a block diagram 100 of an example embodiment of asystem 102 configured to perform noise suppression of aninput audio signal 101. Theinput audio signal 101 may includeforeground speech components 103 andbackground noise 105. Thesystem 102 may comprise aninput interface 117, that may be a hardware input interface or any other suitable interface, configured to transform theinput audio signal 101 into an electronic representation of the input audio signal 101' for input to asoft speech detector 104. Thesoft speech detector 104 is configured to determine, dynamically, aspeech detection result 112. Thespeech detection result 112 may indicate a likelihood (i.e., a probability value) of a presence of theforeground speech components 103 in theinput audio signal 101. Thesystem 102 comprises anoise suppressor 114 communicatively coupled to thesoft speech detector 104 to receive thespeech detection result 112 determined and the electronic representation of the input audio signal 101'. Thenoise suppressor 114 is configured to compute, dynamically,spectral weighting coefficients 109 based on thespeech detection result 112 determined and to apply thespectral weighting coefficients 109 computed to the electronic representation of the input audio signal 101' to suppress thebackground noise 105 in a dynamic manner. - The
system 102 may comprise amemory 111 and thenoise suppressor 114 may be configured to store thespectral weighting coefficients 109 computed in thememory 111. Thenoise suppressor 114 may be configured to retrieve thespectral weighting coefficients 109 computed from thememory 111 to apply thespectral weighting coefficients 109 computed to the electronic representation of the input audio signal 101'. - The
soft speech detector 104 is referred to as a "soft" speech detector because thesoft speech detector 104 determines thespeech detection result 112 that may represent a likelihood (i.e., probability value) of a presence of theforeground speech components 103 as opposed to a "hard" result that represents a definitive true/false boolean type of result for indicating whether theforeground speech components 103 are present. As such, a "soft" speech detector may be a speech detector that produces a "soft" speech detection result that represents a likelihood (i.e., probability) of a presence of speech components in an audio signal. - The
foreground speech components 103 may correspond to speech from a user's voice, such as speech from a voice of auser 107a that may be speaking to thesystem 102. Thebackground noise 105 may include thebabble noise 113. Thebabble noise 113 may include a composition of multiple background speech components from one or more other speakers, such as the user 107b and the user 107c. Thesystem 102 may be referred to interchangeably herein as a babblenoise suppression system 102. - The
soft speech detector 104 may be further configured to determine thespeech detection result 112 as a function of a combination of feature values determined in the time domain, frequency domain, or a combination thereof. The combination of feature values includes kurtosis and at least one other feature value as disclosed below with reference toFIG. 2 . It should be understood that in the embodiment ofFIG. 2 the other feature value is a cepstral maximum feature value, however, in other implementations not covered by the claimed invention, any other suitable feature value that reflects the presence of speech may be employed for combination with the kurtosis feature value. For example, power, signal-to-noise power ratio, harmonicity, or pitch-based features may be employed, or any other suitable feature that reflects the presence of speech. - According to embodiments disclosed herein, the
system 102 employs thesoft speech detector 104 to control, dynamically, an aggressiveness (i.e., an attenuation strength) of noise suppression of the electronic representation of the input audio signal 101' by thenoise suppressor 114. As such, the babblenoise suppression system 102 produces an electronic representation of an output audio signal 115' with thebackground noise 105 suppressed. The electronic representation of the output audio signal 115' may be output as theoutput audio signal 115 by anoutput interface 119, that may be a hardware interface configured to produce the output audio signal 115 (also referred to interchangeably herein as a processed audio signal 115) in an audible form. As disclosed further below in the Results section, any remaining noise in the processedaudio signal 115 may be perceived by human listeners as more pleasant than thebackground noise 105 of theaudio signal 101. Further embodiments of the babblenoise suppression system 102 are disclosed below with regard toFIG. 2 . -
FIG. 2 is a block diagram 200 of another embodiment of asystem 202 configured to perform noise suppression of an electronic representation of an input audio signal 201'. According to this embodiment, the electronic representation of the input audio signal 201', that is x(n), where n is the sample index, is an electronic representation of a pre-emphasized input audio signal. In alternative implementations not covered by the claimed invention, x(n) may not be pre-emphasized. Thesystem 202 may also be referred to interchangeably herein as a babblenoise suppression system 202. The babblenoise suppression system 202 comprises two main parts, asoft speech detector 204 and anoise suppressor 214. Thenoise suppressor 214 is configured to determine spectral weighting coefficients in order to suppress the babble noise, as disclosed further below in the Noise Suppression section. - The
soft speech detector 204 is configured to determine aspeech detection result 212 that indicates a likelihood of presence of desired speech in an input audio signal, such as the likelihood of the presence of theforeground speech components 103 of theinput audio signal 101 ofFIG. 1 , disclosed above. Thesoft speech detector 204 may determine thespeech detection result 212 as a function of a combination of feature values determined in the time domain, frequency domain, or a combination thereof. As such, thespeech detection result 212 may be referred to interchangeably herein as a combinedspeech detection result 212. - According to this embodiment, the
soft speech detector 204 includes akurtosis feature module 208, a cepstralmaximum feature module 210, and acombiner 206. Thecombiner 206 is configured to combine akurtosis feature 232, produced by thekurtosis feature module 208, with a cepstralmaximum feature 234, produced by the cepstralmaximum feature module 210, to produce thespeech detection result 212, as disclosed further below with regard toEquations 2―8. Thespeech detection result 212 is employed to control thenoise suppressor 214. - For example, the
noise suppressor 214 is designed to attenuate the background noise, such as thebackground noise 105 ofFIG. 1 , more aggressively during speech pauses (not shown) of the electronic representation of the input audio signal 201', and such speech pauses may be identified as a function of thespeech detection result 212 produced by thesoft speech detector 204. According to embodiments disclosed herein, spectral weighting coefficients, such as the Wiener filterspectral weighting coefficients 241, modifiedspectral weighting coefficients 243, and finalspectral weighting coefficients 245, are determined by thenoise suppressor 214 to enable the babblenoise suppression system 202 to apply stronger attenuation during the speech pauses based on overestimating the noise. - The
noise suppressor 214 includes anoise shaper 220 and post-processor 222 to achieve a more stationary electronic representation of the output audio signal 215' by applying noise shaping and post-processing, respectively, as disclosed further below in the Noise Suppression section. Embodiments of thesoft speech detectors FIGS. 1 and2 , respectively, are disclosed below in the Speech Detection section. Embodiments of thenoise suppressors FIGS. 1 and2 , respectively, are disclosed further below in the Noise Suppression section. - Detecting presence of desired speech in a noisy signal has been subject to research for several decades (Simon Graf, Tobias Herbig, Markus Buck, and Gerhard Schmidt, "Features for voice activity detection: a comparative analysis," EURASIP Journal on Advances in Signal Processing, vol. 2015, no. 91, Nov. 2015). The overlapping characteristics of the desired speech and the babble noise complicate detection of the desired speech. Embodiments disclosed herein include features that enable for a robust distinction between the babble noise and the desired speech.
- Clean speech (i.e., speech in the absence of noise) sample values exhibit a sparse characteristic. Values close to zero are dominating which implies a peak of the probability density function (PDF) around zero. No or only little reverberation can be expected. To find distinctive properties of babble noise, two effects appear relevant (Nitish Krishnamurthy and John H. L. Hansen, "Babble Noise: Modeling, Analysis, and Applications," IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 7, pp. 1394―1407, Sept. 2009):
- independent speech components from multiple distant talkers superpose, and
- reverberation smears, temporally, the speech component of each distant talker.
- Both effects result in a mixture of multiple samples of speech signals. The distribution of this mixture differs from the distribution of clean speech. Since multiple independent values are summed, the resulting distribution approaches a Gaussian probability distribution function (PDF). Embodiments disclosed herein exploit this property by evaluating the kurtosis.
- The normalized kurtosis of a random variable χ
- This beneficial property of the kurtosis has been employed in different applications: The human speech recognition score in babble noise was predicted in (Guoping Li and Mark E. Lutman, "Sparseness and speech perception in noise," in Proc. of Statistical and Perceptual Audition (SAPA), Pittsburgh PA, USA, 2006). A high correlation between the value of kurtosis and the score was observed when increasing the number of talkers. Dereverberation of speech signals was performed in (Bradford W. Gillespie, Henrique S. Malvar, and Dinei AF Florêncio, "Speech dereverberation via maximum-kurtosis subband adaptive filtering," in Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Salt Lake City, UT, USA, 2001) by maximizing the kurtosis value. In (Kohei Hayashida, Makoto Nakayama, Takanobu Nishiura, Yukihiko Yamashita, T. K. Horiuchi, and Toshihiko Kato, "Close/distant talker discrimination based on kurtosis of linear prediction residual signals," in Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Florence, Italy, 2014), the kurtosis was employed to distinguish between close and distant talkers. Kurtosis-based speech detection is discussed in multiple publications, such as (Elias Nemer, Rafik Goubran, and Samy Mahmoud, "Robust voice activity detection using higher-order statistics in the LPC residual domain," IEEE Transactions on Speech and Audio Processing, vol. 9, no. 3, pp. 217―231, 2001), (David Cournapeau, Tatsuya Kawahara, Kenji Mase, and Tomoji Toriyama, "Voice activity detector based on enhanced cumulant of LPC residual and on-line EM algorithm," in Proc. of INTERSPEECH, Pittsburgh, PA, USA, 2006), and (David Cournapeau and Tatsuya Kawahara, "Evaluation of real-time voice activity detection based on high order statistics," in Proc. of INTERSPEECH, Antwerp, Belgium, 2007).
- Typically, the kurtosis is applied to the residual signal after linear predictive coding (LPC) analysis. In contrast, turning back to
FIG. 2 , embodiments disclosed herein, such as an embodiment of thekurtosis feature module 208, apply the kurtosis feature directly to the electronic representation of the input audio signal 201' x(n), that is a pre-emphasized input audio signal, to detect presence of desired speech. - According to embodiments disclosed herein, a time-dependent estimate of the kurtosis feature is determined, such as by the
kurtosis feature module 208, by:kurtosis feature module 208 using recursive smoothing with a smoothing constant α k: - Analogously, m 4(n) is computed by smoothing x 4(n).
-
-
FIG. 3A is agraph 350 of an illustration of frequency components over time of a speech spectrum of an audio signal including a Texas Instruments/Massachusetts Institute of Technology (TIMIT) utterance. Thegraph 350 includes frequency components ƒ[kHz] 352 for the audio signal over time t[s] 354. In the example embodiment, the TIMIT utterance is: "She had your dark suit in greasy wash water all year." The TIMIT utterance is indicative of the desired speech, such as theforeground speech components 103 ofFIG. 1 , disclosed above. In the example embodiment ofFIG. 3A , a sampling rate of 16 kHz and a downsampling factor of R = 256 were utilized to produce the frequency components ƒ[kHz] 352 ofFIG. 3A . -
FIG. 3B is agraph 360 ofsimulated feature values 356 over time t[s] 358 for the same TIMIT utterance used for generating thegraph 350 ofFIG. 3A ., with smoothing constants of α k= 0.9986 ≙ ― 100 dB/s and L p = L f = 10 frames ≙ 160ms. Thesimulated feature values 356 include simulated kurtosis features values 362 that are based on the kurtosis feature alone and resulted in most of the desiredspeech 361 ofFIG. 3A being detected as shown inFIG. 3B . However, in a few cases, low frequency vowels, such as "/u/," were missed. As such, embodiments disclosed herein may employ an additional feature that may reflect, explicitly, the voiced excitation of human speech, such as the cepstralmaximum feature 234 produced by the cepstralmaximum feature module 210 ofFIG. 2 . Thesimulated feature values 356 include simulated cepstral maximum feature values 364 as well as simulated combination feature values 366 that are based on a combination of the kurtosis and cepstral maximum features. As shown inFIG. 3B , the simulated combination feature values 366 yielded the highest amount of the desiredspeech 361 ofFIG. 3A . As such, according to embodiments disclosed herein, a speech detection feature based on kurtosis is capable of distinguishing between desired speech and babble noise and further improvement for the distinguishing may be achieved by smoothing the kurtosis and/or combining the kurtosis or smoothed kurtosis with a complementing feature being the cepstral maximum feature, disclosed in more detail below. - As disclosed above, voiced speech components by multiple speakers overlap in the babble noise. Compared to clean speech, less distinct harmonic structures are, therefore, observable in a mixture of clean speech and babble noise. To complement the kurtosis, embodiments disclosed herein exploit a cepstrum that reflects voiced components. According to embodiments disclosed herein, cepstral coefficients cepst(τ,ℓ) are determined, where τ denotes the cepstral coefficient index and ℓ denotes the ℓ-th frame. The cepstral coefficients cepst(τ,ℓ) are determined by the cepstral
maximum feature module 210 ofFIG. 2 , for example, by applying an Inverse Discrete Fourier Transform (IDFT) to log(Φxx(k,ℓ)/Φ̂bb (k,ℓ) ), where k addresses the frequency bin of the ℓ-th frame of the noisy speech spectrum Φ xx (k, ℓ) and the estimated noise spectrum Φ̂ bb (k, ℓ) of the noisy speech spectrum Φ xx (k, ℓ). The noisy speech spectrum Φ xx (k, ℓ) may be referred to interchangeably herein as the power spectral density. According to embodiments disclosed herein, the noisy speech spectrum Φ xx (k, ℓ) is estimated by smoothing, temporally, the magnitude squared Short-Term Fourier Transform STFT coefficients |X(k, ℓ)|2.
Normalization with the estimated noise spectrum Φ̂ bb (k, ℓ) may emphasize the harmonic structure. - For example, the
system 202 ofFIG. 2 comprises anSTFT module 224 and aspectrum estimator 228. TheSTFT module 224 is configured to transform blocks of the electronic representation of the input audio signal 201' x(n) in a time domain into a representation in a frequency domain, that is, the STFT coefficients X(k, ℓ) 226. - The STFT coefficients X(k, ℓ) 226 are input to the
spectrum estimator 228 that is configured to estimate the power spectral density Φ xx (k, ℓ) by smoothing, temporally, the magnitude squaredSTFT coefficients 226. As such, thespectrum estimator 228 is configured to estimate a power spectrum of the electronic representation of the input audio signal 201' x(n) based on a transformation of the electronic representation of the input audio signal 201' from the time domain to the frequency domain. The STFT coefficients X(k, ℓ) 226 determined by theSTFT module 224 are further input to anoise shaper 220 andmultiplicative element 253, as disclosed further below. - The power spectral density Φ xx (k, ℓ) is input as the power spectral density Φ xx (k, ℓ) 230a to the cepstral
maximum feature module 210 for determining the cepstral coefficients cepst(τ,ℓ) (not shown) disclosed above. The power spectral density Φ xx (k, ℓ) is further input as the power spectral density Φ xx (k, ℓ) 230b to anoise estimator 236 that is configured to estimate the power spectral density of the noise, that is, the estimated noise spectrum Φ̂ bb (k, ℓ) 238, based on the estimated power spectral density Φ xx (k, ℓ) 230b of the noisy input speech signal as estimated by thespectrum estimator 228. The power spectral density Φ xx (k, ℓ) 230b is further input to aWiener filter 221, as disclosed further below in the Noise Suppression section. It should be understood that the power spectral density Φxx(k, ℓ) 230a and the power spectral density Φ xx (k, ℓ) 230b are the same power spectral density Φ xx (k, ℓ) determined by thespectrum estimator 228. - In
FIG. 2 , the estimated noise spectrum Φ̂ bb (k, ℓ) 238 is not shown as an input to the cepstralmaximum feature module 210 for simplicity; however, the estimated noise spectrum Φ̂ bb (k, ℓ) 238 may be used by the cepstralmaximum feature module 210 for determining the cepstral coefficients cepst(τ,ℓ). According to embodiments disclosed herein, the cepstralmaximum feature module 210 is configured to accumulate neighboring bins of the cepstrum by:maximum feature module 210 by subtracting an offset parameter voicingoffset, according toEquation 6, disclosed above. Finally, to produce the cepstralmaximum feature 234, the cepstralmaximum feature module 210 is configured to smooth the maximum value, temporally, by: - An illustration of the cepstral
maximum feature 234 ofFIG. 2 is shown as the simulated cepstral maximum feature values 364 ofFIG. 3B , disclosed above. In the illustration ofFIG. 3B , theSTFT module 224 was employed with a Hann window of length 512 samples to determine the STFT coefficients X(k, ℓ) 226 ofFIG. 2 . The estimated noise spectrum Φ̂ bb (k, ℓ) 238 was computed by smoothing Φ xx (k, ℓ) 230a = Φ xx (k, ℓ) 230b when thesoft speech detector 204 produced aspeech detection result 212 indicating a high likelihood of absence of speech, that is, a high probability that speech is not present. In the presence of speech, that is, when thesoft speech detector 204 produced thespeech detection result 212 indicating a high probability of the presence of speech, the noise estimate Φ̂ bb (k, ℓ) 238 was not updated. The offset parameter, voicingoffset, disclosed above, was set = 1/6 to achieve positive values only for distinct harmonic structures. Smoothing was performed with α v = 0.9 ≙ ― 30 dB/s. - In the embodiment of
FIG. 2 , thekurtosis feature 232 and the cepstralmaximum feature 234 are combined by acombiner 206 to produce thespeech detection result 212. According to embodiments disclosed herein, thecombiner 206 is configured to combine thekurtosis feature 232 and the cepstralmaximum feature 234 using a weighted sum:kurtosis feature 232 and cepstralmaximum feature 234 may be considered to prevent from negative contributions.
Turning again toFIG. 3B , the combination of both features is shown as the simulated combination feature values 366 that may represent thespeech detection result 212 ofFIG. 2 . - Analyses with artificially mixed data from TIMIT (John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallet, and Nancy L. Dahlgren, "DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus CD-ROM," 1993) and NOISEX-92 (Andrew Varga and Herman J. M. Steeneken, "Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems," Speech Communication, vol. 12, no. 3, pp. 247―251, 1993) database were performed. Signal-to-noise power ratios (SNRs) in a range between 0 and 10 dB were chosen. A good detection performance for the kurtosis feature is observable from the kurtosis feature Receiver Operating Characteristic (ROC) curve 404 shown in
FIG. 4 , disclosed below. For example, high detection rates of the detection rates P d 406 can be achieved with relatively low false-alarm rates of the false-alarm rates P fa 408 as shown. -
FIG. 4 is a block diagram 400 of agraph 402 with Receiver Operating Characteristic (ROC) curves. The cepstral maximumfeature ROC curve 410 that may be computed according toEquation 7, disclosed above, shows a lower performance than the kurtosisfeature ROC curve 404 that may be computed according toEquation 2, disclosed above, since the cepstrum only detects voiced speech portions. Furthermore, harmonic components in the babble noise sometimes falsely trigger the cepstral maximum feature. - According to embodiments disclosed herein, performance improvements can be achieved by combining both features, as shown by the combined
feature ROC curve 416 and performance can be further improved by combining both features and applying a smoothing to the kurtosis, as shown by the combination feature with kurtosis smoothingROC curve 420. For example, smoothing the kurtosis, according toEquation 4, disclosed above, yields the smoothed kurtosisfeature ROC curve 418. Combining both features and applying a smoothing to the kurtosis according toEquation 8, disclosed above, yields the combination feature with kurtosis smoothingROC curve 420 that results in the highest detection rates of the detection rates P d 406 with the lowest false-alarm rates of the false-alarm rates P fa 408. For comparison, theoperating point 414 of the established speech detector ETSI-AFE (ETSI, "ETSI standard 202 050 v1.1.5: Advanced front-end feature extraction algorithm," 2007) is plotted. With the database, this detector is almost never triggered resulting in both P d ≈ P fa ≈ 0 as shown by thedata point 414. This observation underlines the challenge of speech detection in babble noise. - Turning back to
FIG. 2 , according to embodiments disclosed herein, thenoise suppressor 214 ofFIG. 2 is configured to perform noise suppression and the noise suppression provides a strong attenuation that is applied during speech pauses by overestimating the noise, dynamically. Thenoise suppressor 214 includes anoverestimator 250 that is configured to produce an overestimation factor β oe(ℓ) 216 to control the aggressiveness of noise suppression. As such, thenoise suppressor 214 is configured to compute, dynamically, the dynamic noise overestimation factor β oe(ℓ) 216 based on thespeech detection result 212 determined. - For example, the
noise suppressor 214 may be further configured to determine periods of speech pauses and periods of speech activity in the electronic representation of the input audio signal 201' as a function of thespeech detection result 212 determined. The overestimation factor β oe(ℓ) 216 is computed by theoverestimator 250 based on the combinedspeech detection result 212 according to Equation 11, disclosed further below. - As disclosed above, the
noise suppressor 214 includes theWiener filter 221. TheWiener filter 221 is configured to produce spectral weighting coefficients H wf(k, ℓ) 241 according to:spectrum estimator 228 is represented in the denominator. Thenoise suppressor 214 includes a firstmultiplicative element 251 configured to apply the overestimation factor β oe(ℓ) 216, that is a scalar, to the estimated power spectral density (PSD) of the noise, that is the estimated noise spectrum Φ̂ bb (k, ℓ) 238, to produce the numerator ofEquation 9, disclosed above, that is, the overestimated power spectral density of thenoise 239 ofFIG. 2 . As such, the overestimation factor β oe(ℓ) 216 is computed by theoverestimator 250 based on the combinedspeech detection result 212 and applied to the estimated noise power spectral density Φ̂ bb (k, ℓ) 238 by employing the firstmultiplicative element 251. InEquation 9, disclosed above, all variables are scalars; however, the overestimation factor β oe(ℓ) is not frequency-dependent as it depends only on ℓ. In contrast, the power spectral density (PSD) Φ xx (k, ℓ) depends on both frequency (k) and frame (ℓ). As such, for a frame ℓ, the same overestimation factor β oe(ℓ) is, therefore, applied to all frequency bins (k) of the PSD Φ xx (k, ℓ). - The spectral weighting coefficients H wf(k, ℓ) 241 are computed according to the Wiener filter (Equation 9) based on the estimated power spectral density of the noisy speech signal Φ xx (k, ℓ) 230b and the overestimated power spectral density of the
noise 239. According to embodiments disclosed herein, increasing the value of the dynamic noise overestimation factor β oe(ℓ) 216 enables the spectral weighting coefficients computed, that is H wf(k, ℓ) 241, to increase suppression of the background noise relative to an amount of suppression of the background noise for the periods of speech activity determined. Without overestimation, (β oe(ℓ) = 1), theWiener filter 221 corresponds to a classical Wiener characteristic. According to embodiments disclosed herein, the spectral weighting coefficients H wf(k, ℓ) 241 are input to thenoise shaper 220, disclosed above. Thenoise shaper 220 is configured to apply a dynamic floor to the spectral weighting coefficients H wf(k, ℓ) 241 that reduces the non-stationary noise in the noisy speech signal. - A dynamic floor shapes the residual noise (Vasudev Kandade Rajan, Christin Baasch, Mohamed Krini, and Gerhard Schmidt, "Improvement in Listener Comfort Through Noise Shaping Using a Modified Wiener Filter Approach," in Proc. of 11. ITG Symposium on Speech Communication, Erlangen, Germany, 2014)
- According to embodiments disclosed herein, the overestimation factor βoe(ℓ) 216 is computed based on the
speech detection result 212, that is, the combined speech detection feature comb(ℓ), according to:Equation 9, disclosed above. During speech pauses indicated by the combined speech detection result 212 (e.g., comb ( ℓ ) ≈ 0), high noise overestimation is applied with a maximal factor βmax = 21. High values of the feature that indicates presence of speech results in a reduced overestimation factor βoe(ℓ) ≈ 1. This kurtosis-based control prevents speech distortions caused by too aggressive attenuation during presence of speech. - As such, according to
Equations 9 and 11, disclosed above, speech detection features are employed to control aggressiveness of noise suppression and protect the desired speech by reducing, dynamically, the aggressiveness with more aggressive attenuation applied to the non-stationary noise components. Further, according toEquations 10 and 12, disclosed above, combination with reduction of non-stationary components by selectively lowering the maximal attenuation H floor(k, ℓ) yields a more stationary output in addition to the more aggressive attenuation. -
FIG. 5A is an example showing a spectrogram 502a for an unprocessed noisy input signal.FIG. 5B, FIG. 5C, and FIG. 5D are showing spectrograms 502b, 502c, and 502d, respectively, showing improvements for babble noise suppression of the unprocessed noisy input signal according to embodiments disclosed herein. -
FIG. 5B shows a spectrogram 502b showing improvement to the babble noise suppression ofFIG. 5A by processing the unprocessed noisy input signal ofFIG. 5A with a Wiener filter without overestimation and with a fixed floor. The spectrogram 502b may show that stationary noise is reduced; however, the spectrogram 502b shows strong non-stationary artifacts, that is, non-stationary components remain. These artifacts are reduced by the dynamic overestimation as shown in the spectrogram 502c ofFIG. 5C , disclosed below. -
FIG. 5C shows a spectrogram 502c showing further improvement of the babble noise suppression that is achieved by processing the unprocessed noisy input signal ofFIG. 5A with a Wiener filter as inFIG. 5B and with dynamic noise overestimation, as disclosed above.FIG. 5D shows further improvement to the quality of the unprocessed noisy input signal ofFIG. 5A , as disclosed below. - Remaining non-stationary components are further reduced by selectively lowering the noise floor
- According to embodiments disclosed herein, the
noise suppressor 214 is further configured to compute, dynamically, a dynamic noise floor, and selectively lower the dynamic noise floor based on frequencies corresponding to the non-stationary noise components. Thenoise shaper 220 is configured to apply the dynamic noise floor to the spectral weighting coefficients H wf(k, ℓ) 241 that may reduce the non-stationary noise. The spectral weighting coefficients are further computed further based on the dynamic noise floor computed and selectively lowered to produce the modified spectral weighting coefficients after noise shaping 243 that are input to the post-processor 222. - Alternatively, according to a modification not covered by the claimed invention, residual non-stationary noise components may be identified relying on contextual information from neighboring coefficients. Spectral weighting coefficients corresponding to the residual non-stationary noise components may be set to a lower value, such as a fixed noise floor or a dynamic noise floor.
- The post-processor 222 is configured to modify the spectral weighting coefficients after noise shaping 243 according to Equation 13, disclosed below. According to embodiments disclosed herein, sporadically occurring musical tones may be finally removed by the post-processor 222. According to Equation 13, disclosed below, the post-processor 222 is configured to modify the spectral weighting coefficients after noise shaping 243 to produce the final spectral weighting coefficients H(k, ℓ) 245. Spectral weighting coefficients H(k, ℓ) 245 that exceed the fixed noise floor are set to the dynamic floor
spectral weighting coefficients 243. For this, a first number n― (k,ℓ) of coefficients next to k that attenuate stronger than the fixed floor is compared to a second number of coefficients n +(k,ℓ) that exceed the fixed floor. As such, thenoise suppressor 214 is further configured to identify one or more spectral weighting coefficients from the spectral weighting coefficients computed based on contextual information from neighboring spectral weighting coefficients. Neighboring spectral coefficients are some spectral coefficients in frequency bins k or frames ℓ that are close to a current spectral coefficient, such as in a given range of bins or frames from an associated bin or frame of the current spectral coefficient. Thenoise suppressor 214 is further configured to post-process the spectral weighting coefficients computed by setting first values computed for the one or more spectral weighting coefficients identified to second values, the second values enabling a stronger attenuation of the background noise than the first values. - As such, post-processing of the modified
spectral weighting coefficients 243 employs contextual information from neighboring frequency bins to produce the final spectral weighting coefficients H(k, ℓ) 245 that are considered as corrected spectral weighting coefficients. The final spectral weighting coefficients H(k, ℓ) 245 include spectral weighting coefficients associated with a particular kth bin and ℓth frame that may be set to the maximal attenuation when the majority of neighboring bins are set to the maximal attenuation. - The
system 202 includes a secondmultiplicative element 253 that applies the final spectral weighting coefficients H(k, ℓ) 245 after noise shaping by thenoise shaper 220 and post-processing by the post-processor 222 to the noisy STFT coefficients X(k, ℓ) 226 and generates the enhanced STFT coefficients Y(k, ℓ) 247. The secondmultiplicative element 253 may perform an element-wise multiplication of two vectors that represent the final spectral weighting coefficients H(k, ℓ) 245 and the noisy STFT coefficients X(k, ℓ) 226 to generate the enhanced STFT coefficients Y(k, ℓ) 247. Thesystem 202 includes the overlap addmodule 249 that may be configured to convert (i.e., transform) the enhanced STFT coefficients Y(k, ℓ) 247 into the time-domain electronic representation of the output audio signal 215'. -
FIG. 5D shows a spectrogram 502d showing further improvement of the babble noise suppression that is achieved by processing the unprocessed noisy input signal ofFIG. 5A with a Wiener filter with dynamic noise overestimation as inFIG. 5C , and by applying the noise shaping and post-processing disclosed above. - The spectrogram 502d includes all of the processing features of a babble noise suppression system according to embodiments disclosed herein. According to embodiments disclosed herein, in the presence of the desired speech (i.e., 1.5s―4s in the example embodiment), the babble noise suppression acts less aggressively to prevent from speech distortions. A more stationary background is achieved which is expected to be more comfortable for human listeners. In order to evaluate the improvements that are achieved by the babble noise suppression system according to embodiments disclosed herein, a subjective listening test was conducted. In the following, the test results are summarized.
- To evaluate the methods disclosed herein, a subjective listening test and objective measure were employed.
- To evaluate the quality of the remaining noise as perceived by human listeners, a subjective listening test was conducted. In a multi-stimulus test, similar to MUSHRA (ITU, "Recommendation ITU-R BS.1534-3: Method for the subjective assessment of intermediate quality level of audio systems," 2015), 21 test subjects rated an acceptability of background noise for four processing variants of a signal. The subjects were asked to decide whether they perceived the noise in the variants as more or less pleasant than the noise in a reference.
-
FIG. 6A is agraph 602 including results of the subjective listening test. In the subjective listening test, a noisy speech signal was presented as the reference that was repeated as a hidden reference shown in (a) 606. InFIG. 6A , results for variants of the signal that were presented are shown. The variants of the signal included: (i) the signal processed with a standard Wiener filter (b) 608, (ii) the signal processed a Wiener filter with dynamic noise overestimation (c) 610, and (iii) the signal processed with the Wiener filter with dynamic noise overestimation, noise shaping, and post-processing (d) 613. The test was repeated for 10 different signals, including artificially mixed signals, as well as real recordings. The acceptability of the remaining background noise was rated on a scale from - 10 (less pleasant) to 10 (more pleasant) as shown by therating 614 that is a rating relative to the reference (a) 606. - The median, as well as the 25% and 75% quantiles over all answers are depicted in the
boxplots - According to embodiments disclosed herein, the noise suppression may be designed to act less aggressively in presence of speech. However, small speech distortions may be unavoidable when the signal is processed. In order to evaluate the speech distortions that are introduced by the system, an objective measure may be employed. For artificially mixed signals, the distortions between the clean speech signal and the processed speech component may be determined.
-
FIG. 6B is agraph 604 with such an objective measure. Thegraph 604 shows an illustration of distortion-to-speech power ratios 616 for different signal processing variants (b) 618, (c) 620, and (d) 622, that correspond to the signal processing variants (b) 608, (c) 610, and (d) 612 of thegraph 602 ofFIG. 6A , respectively. Distortion-to-speech power ratio results 626, 628, and 630 are shown for each signal variant (b) 618, (c) 620, and (d) 622, respectively. A worst case reference (e) 624 is also shown that depicts the distortion-to-speech power ratio results 632 for a fixed noise overestimation β oe(ℓ) = β max. - As shown in the
graph 604 ofFIG. 6B , the Wiener filter withoutnoise overestimation 618 introduces small distortions in the range of ―23 dB. Only 2 dB of additional speech distortions are introduced by applying the dynamic noise overestimation with noise shaping and post-processing according to embodiments disclosed herein. These processing features, however, significantly improved the acceptability of the background noise, as confirmed by the subjective listening tests, disclosed above. For comparison, a fixed overestimation by β max = 21 introduces high speech distortions in the range of -5 dB. - According to embodiments disclosed herein, a babble noise suppression system is introduced that includes a soft speech detector that may be employed to distinguish between babble noise and desired speech. Simulation results disclosed herein show that a kurtosis measure achieves good detection results and that further improvements may be achieved by smoothing the kurtosis and combining the smoothed kurtosis with at least one other feature.
- According to embodiments disclosed herein, noise suppression is controlled in a system based on a combined speech detection result. Noise may be suppressed more aggressively when no speech is detected. Noise shaping is applied to achieve a more stationary background in the output signal. In addition, remaining musical tones are reduced by modifying the spectral weighting coefficients.
- Subjective listening tests confirmed that a system, according to embodiments disclosed herein, reduces the babble noise, effectively. The background noise in the processed signals was mostly perceived as more pleasant compared to an unprocessed reference. Further, the system introduces only little speech distortions as verified by an objective measure, as disclosed above.
- Turning back to
FIGS. 1 and2 , thesystems soft speech detector noise suppressor noise suppressor -
FIG. 7 is a flow diagram 700 of an main steps of a method of performing noise suppression of an audio signal. The audio signal may include foreground speech components and background noise, such as theforeground speech components 103 andbackground noise 105 ofFIG. 1 , disclosed above. The method determines, dynamically, a speech detection result indicating a likelihood of a presence of the foreground speech components in the audio signal (704). The method computes dynamically, spectral weighting coefficients based on the speech detection result determined (706) and apply the spectral weighting coefficients computed to the audio signal to suppress the background noise in a dynamic manner (708). The method thereafter ends (710), in the example embodiment. -
FIG. 8 is a flow diagram 800 of another embodiment of a method of performing noise suppression of an audio signal. The method begins (802) and pre-processes the input signal (804). The method may estimate the spectrum (806) and compute speech detection features (808). According to embodiments disclosed herein, speech detection features are employed to control aggressiveness, that is, a strength of attenuation, of noise suppression. Protection of desired speech may be achieved by reducing the aggressiveness. More aggressive attenuation of the non-stationary noise components is applied via the overestimation factor of Equation 11 as applied inEquation 9, as disclosed above. - According to embodiments disclosed herein, a speech detection feature based on kurtosis is used to distinguish between desired speech and babble noise. Further, embodiments disclosed herein provide further improvement by smoothing and combining the kurtosis feature with a cepstral maximum feature according to equations 2-8 above. The method computes according to
above equations 9,10 a dynamic noise overestimation factor (810) and determines spectral weighting coefficients (812). The method determines dynamic maximum attenuation and applies post-processing (814). - Combination with reduction of non-stationary components by selectively lowering the maximal attenuation achieves a more stationary output in addition to the more aggressive attenuation as disclosed above, with reference to
Equations 10 and 12. Post-processing of spectral weighting coefficients employs contextual information from neighboring frequency bins to correct spectral weighting coefficients, as disclosed above. Embodiments disclosed herein set a frequency bin to the maximal attenuation in an event the majority of neighboring bins is set to the maximal attenuation. The method applies the spectral weighting coefficients (816) and the method thereafter ends (818) in the embodiment. - As disclosed above, babble noise may be a severe problem in speech enhancement applications. This type of noise may include a composition of multiple background speech components and exhibit properties similar to the desired foreground speech. Embodiments disclosed herein improve suppression of non-stationary noise components, such as babble noise, whereas other noise reduction approaches primarily suppress the stationary background noise.
- According to embodiments disclosed herein, a noise suppression system is introduced for a more aggressive attenuation of babble noise. The noise suppression system disclosed herein includes a speech detector for foreground speech which is robust against background speech. Embodiments disclosed herein distinguish between desired speech and interfering babble noise and introduce a babble noise suppression system that provides effective suppression of babble noise by employing speech detection information to control the noise suppression.
- Since mobile devices, such as smartphones, are employed even in crowded environments, a strong need may be present for embodiments disclosed herein that may reduce babble noise in a cost effective manner. Automatic speech recognition can benefit from the improved speech enhancement or from meta-information on the presence of speech, such as the combined speech detection result.
- According to some embodiments, the babble noise suppression described herein comprises four stages:
- Feature extraction and combination to distinguish between desired speech and interfering babble noise. As disclosed above, in order to detect desired foreground speech in the presence of babble noise, two features are evaluated. Kurtosis reflects the sparseness of foreground speech by considering the distribution of sample values. The distribution of foreground speech is sparser than the distribution of babble noise in the background. Higher values of kurtosis, therefore, indicate the presence of desired speech. As disclosed above, the kurtosis feature is applied directly to a pre-emphasized noisy input signal. In addition to kurtosis, the cepstral maximum, is evaluated, to capture harmonic speech components. According to embodiments disclosed herein, the combination of both features allows for a more accurate detection of desired speech compared to the single features.
- Noise overestimation for stronger attenuation of noise in speech pauses. According to embodiments disclosed herein, using the combined value from both features, noise overestimation is controlled. Accordingly, during speech pauses, the noise spectrum is overestimated resulting in a stronger attenuation of noise. The maximum attenuation, however, is limited by a floor.
- Noise floor modifications for more aggressive attenuation of non-stationary noise components. A fixed floor may result in the same attenuation for stationary and non-stationary noise components. As such, the non-stationary components still stick out of the stationary background in this case. According to embodiments disclosed herein, stationarization of non-stationary components is achieved by applying a more aggressive attenuation to non-stationary components. For this, the noise floor is selectively lowered for frequency bins that contain non-stationary interferences, as disclosed above.
- Post processing of the spectral weighting coefficients to reduce the amount of noise in the result. According to embodiments disclosed herein, post processing is applied to the spectral weighting coefficients that were determined with noise overestimation and noise floor modification. Open bins that are neighbored by attenuating bins are set to the noise floor to reduce remaining musical noise components. The decision as to which bins are affected by this post processing relies on a local majority vote, as disclosed above.
-
FIG. 9 is a block diagram of an example of the internal structure of acomputer 900 in which various embodiments of the present disclosure may be implemented. Thecomputer 900 contains asystem bus 902, where a bus is a set of hardware lines used for data transfer among the components of a computer or processing system. Thesystem bus 902 is essentially a shared conduit that connects different elements of a computer system (e.g., processor, disk storage, memory, input/output ports, network ports, etc.) that enables the transfer of information between the elements. Coupled to thesystem bus 902 is an I/O device interface 904 for connecting various input and output devices (e.g., keyboard, mouse, displays, printers, speakers, etc.) to thecomputer 900. Anetwork interface 906 allows thecomputer 900 to connect to various other devices attached to a network.Memory 908 provides volatile storage forcomputer software instructions 910 anddata 912 that may be used to implement embodiments of the present disclosure.Disk storage 914 provides nonvolatile storage forcomputer software instructions 910 anddata 912 that may be used to implement embodiments of the present disclosure. Acentral processor unit 918 is also coupled to thesystem bus 902 and provides for the execution of computer instructions. - Further example embodiments disclosed herein may be configured using a computer program product; for example, controls may be programmed in software for implementing example embodiments. Further embodiments include a non-transitory computer-readable medium containing instructions that may be executed by a processor, and, when loaded and executed, cause the processor to complete methods described herein. It should be understood that elements of the block and flow diagrams may be implemented in software or hardware, such as via one or more arrangements of circuitry of
FIG. 9 , disclosed above, or equivalents thereof, firmware, a combination thereof, or other similar implementation determined in the future. For example, thesoft speech detector FIGs. 1 and2 , respectively, and the as well as thenoise suppressor FIGs. 1 and2 , respectively, and elements thereof, may be implemented in software or hardware, such as via one or more arrangements of circuitry ofFIG. 9 , disclosed above, or equivalents thereof, firmware, a combination thereof, or other similar implementation determined in the future. In addition, the elements of the block and flow diagrams described herein may be combined or divided in any manner in software, hardware, or firmware. If implemented in software, the software may be written in any language that can support the example embodiments disclosed herein. The software may be stored in any form of non-transitory computer readable medium, such as random access memory (RAM), read only memory (ROM), compact disk read-only memory (CD-ROM), and so forth. In operation, a general purpose or application-specific processor or processing core loads and executes software in a manner well understood in the art. It should be understood further that the block and flow diagrams may include more or fewer elements, be arranged or oriented differently, or be represented differently. It should be understood that implementation may dictate the block, flow, and/or network diagrams and the number of block and flow diagrams illustrating the execution of embodiments disclosed herein. - While this invention has been particularly shown and described with references to embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the invention encompassed by the appended claims.
Claims (9)
- A method of performing noise suppression of an audio signal to attenuate background noise more aggressively during speech pauses, the audio signal including speech components and background noise, the method is characterised by comprising:determining, dynamically, by a soft speech detector, a speech detection result as a function of a combination of feature values estimated from the audio signal, the speech detection result indicating a presence of the speech components in the audio signal containing the speech components and babble noise, the combination of feature values including a kurtosis value and a maximum cepstral value that reflects the speech presence;computing, dynamically, by a noise suppressor, spectral weighting coefficients based at least in part on a noise overestimation factor determined using the combination of feature values; andapplying the spectral weighting coefficients computed to the audio signal to suppress the background noise in a dynamic manner,wherein the maximum is searched over values of τ that correspond to frequencies of between 60 Hz and 300 Hz,wherein αv is a smoothing constant having a value of 0.9, wherein wk = 1, wherein wv = 96, wherein τ is a cepstral coefficient index, wherein voicing offset is an arbitrary offset, wherein Lp is the number frames to look backwards, wherein Lƒ is the number of frames to look ahead to capture speech onsets, and wherein kurt(l) is an estimate of kurtosis at frame l:wherein the kurtosis is applied to the pre-emphasized input signal x(n),wherein α k is the smoothing constant,wherein the moment of the 4th order, m 4(n), analogously is computed by smoothing x 4(n) andwherein computing the spectral weighting coefficients dynamically comprises:at a first multiplicative element (251) of a noise suppressor (214), multiplying an estimated noise spectrum, Φ̂ bb (k, l) (238), by β oe(ℓ) (216) to provide an overestimated power spectral density (239) of Φ̂ bb (k, l), said Φ̂ bb (k, l) having been received from a noise estimator (236) and said β oe(ℓ) having been provided to said first multiplicative element by an over-estimator (250),at a Wiener filter (221) of said noise suppressor, receiving said overestimated power spectral density and generating therefrom the spectral weighting coefficients HWF (k, l), wherein the spectral weighting coefficients are given by:wherein Φ̂ xx (k, l) is estimated by a spectrum estimator (228) by smoothing, temporally, the magnitude squared of X(k, l) (226), which are provided by a short-term Fourier Transform module (224) with a Hanning window of length 512 samples, the X(k, l) having been obtained by the short-term Fourier Transform module by transforming blocks of an electronic representation of the pre-emphasized input audio signal x(n) (201') in the time domain into a representation in the frequency domain in terms of the X(k, /),wherein Φ̂ bb (k, l) is determined by the noise estimator based on Φ̂ xx (k, l) and is calculated by smoothing Φ̂ xx (k, l) in the absence of speech,wherein said overestimated power spectral density is given by β oe(l)·Φ̂ bb (k, l), wherein βmax is a maximal factor having a value of 21,wherein ∈ is an adjustment factor,at a noise shaper (220), noise shaping the computed spectral weighting coefficients by computing a dynamic noise floor and selectively lowering the dynamic noise floor for frequencies corresponding to non-stationary noise components of the background noise;wherein Φ̂ bb, average (k, l) is an average of the estimated noise spectrum,wherein k addresses the frequency bin of the l th frame, andwherein Hfloor,fixed is a fixed noise floor, and,at a post processor 222, modifying the noise-shaped spectral weighting coefficients computed to suppress the background noise in a dynamic manner by causing those spectral weighting coefficients that exceed the fixed noise floor to be set to the dynamic floor as defined by:wherein a first number n-(k, l) of coefficients next to frequency bin k that attenuate stronger than the fixed floor are compared to a second number of coefficients n+(k, l) that exceed the fixed floor, wherein HNS (k, l) are spectral weighting coefficients given bywherein applying the noise-shaped spectral weighting coefficients to the audio signal comprisesat a second multiplicative element (245), multiplying the spectral weighting coefficients H(k, l) (245) from said post-processor (222) with the short-term Fourier transform coefficients X(k, l) (226) received from said short-term Fourier transform module (224) to produce enhanced short-term Fourier transform coefficients Y(k, l) (247) and,at an overlap add module (249), converting said enhanced short-term Fourier transform coefficients Y(k, l) (247) into a time-domain electronic representation of an output audio signal (215').
- The method of claim 1, further comprising: determining periods of speech pauses and periods of speech activity in the audio signal as a function of the speech detection result and increasing a value of the dynamic noise overestimation factor for those periods that were determined to be periods of speech pauses relative to the value of the dynamic noise overestimation factor for those periods that were determined to be periods of speech activity, wherein increasing the value of the dynamic noise overestimation factor enables the spectral weighting coefficients computed to increase suppression of the background noise relative to an amount of suppression of the background noise for those periods that were determined to be periods of speech activity.
- The method of claim 1, wherein the background noise includes stationary and non-stationary noise components and, optionally, wherein:
changes in a power spectrum of the audio signal over a time interval are less for the stationary noise components than for the non-stationary noise components. - The method of claim 1, wherein the speech components are foreground speech components that correspond to speech from a user speaking into an audio receiving device and wherein the background noise includes babble noise, the babble noise including a composition of multiple background speech components from other speakers.
- A system configured to perform noise suppression of an audio signal to attenuate background noise more aggressively during speech pauses, the audio signal including speech components and background noise, the system being characterised by comprising:a soft speech detector (204) configured to determine, dynamically, a speech detection result (212) as a function of a combination of feature values estimated from the audio signal, the speech detection result indicating a presence of the speech components in the audio signal in the presence of babble noise in the audio signal, the combination of feature values including a kurtosis value and a maximum cepstral value that reflects the speech presence; anda noise suppressor communicatively coupled to the soft speech detector to receive the speech detection result determined and configured to compute, dynamically, spectral weighting coefficients based at least in part on a noise overestimation factor determined using the combination of feature values and apply the spectral weighting coefficients computed to the audio signal to suppress the background noise in a dynamic manner, wherein the soft speech detector comprises a combiner (206) configured to provide the combination of feature values,wherein the maximum is searched over values of τ that correspond to frequencies of between 60 Hz and 300 Hz,wherein αv is a smoothing constant having a value of 0.9, wherein wk = 1, wherein wv = 96, wherein τ is a cepstral coefficient index, wherein voicing offset is an arbitrary offset, wherein Lp is the number frames to look backwards, wherein Lƒ is the number of frames to look ahead to capture speech onsets, wherein kurt(l) is an estimate of kurtosis at frame /,wherein the kurtosis is applied to the pre-emphasized input signal x(n),wherein α k is the smoothing constant,wherein the moment of the 4th order, m 4(n), analogously is computed by smoothing x 4(n) andthe system further comprising a noise suppressor (214) communicatively coupled to the soft speech detector to receive the speech detection result determined, wherein said noise suppressor comprises a first multiplicative element (251), a Wiener filter (221), an over-estimator (250), and noise-shaper (220), and a post-processor (222),wherein said first multiplicative element is configured to provide Φ̂ bb (k, l)·β oe(ℓ) to said Wiener filter, said Φ̂ bb (k, l) having been received from a noise estimator (236) and said β oe(ℓ) having been provided to said first multiplicative element by said over-estimator, wherein said Wiener filter outputs spectral weighting coefficients given bywherein Φ̂ xx (k, l) is estimated by a spectrum estimator (228) by smoothing, temporally, the magnitude squared of X(k, l) (226), which are provided by a short-term Fourier Transform module (224) with a Hanning window of length 512 samples, said X(k, /) having been obtained by said short-term Fourier Transform module by transforming blocks of an electronic representation of the pre-emphasized input audio signal x(n) (201') in the time domain into a representation in the frequency domain in terms of X(k, /),wherein Φ̂ bb (k, l) is determined by the noise estimator based on Φ̂ xx (k, l) and is calculated by smoothing Φ̂ xx (k, l) in the absence of speech,wherein βmax is a maximal factor having a value of 21,wherein ∈ is an adjustment factor,wherein said noise shaper is configured to carry out noise shaping of said computed spectral weighting coefficients by computing a dynamic noise floor and selectively lowering the dynamic noise floor for frequencies corresponding to non-stationary noise components of the background noise, said noise floor being defined bywherein Φ̂ bb, average (k, l) is an average of an estimated noise spectrum,wherein the X(k, l) are short-term Fourier transform coefficients,wherein k addresses the frequency bin of the l th frame, andwherein Hfloor,fixed is a fixed noise floor, andwherein said post processor 222 is configured to modify said noise shaped spectral weighting coefficients to suppress the background noise in a dynamic manner by causing those spectral weighting coefficients that exceed the fixed noise floor to be set to the dynamic floor as defined by:wherein a first number n―(k, l) of coefficients next to frequency bin k that attenuate stronger than the fixed floor are compared to a second number of coefficients n+(k, l) that exceed the fixed floor, wherein HNS (k, l) are spectral weighting coefficients given by:wherein a second multiplicative element (245) is configured to multiply the spectral weighting coefficients H(k, l) (245) from said post-processor (222) with the short-term Fourier transform coefficients X(k, l) (226) received from said short-term Fourier transform module (224) to produce enhanced short-term Fourier transform coefficients Y(k, /) (247) and wherein an overlap add module (249) converts said enhanced short-term Fourier transform coefficients Y(k, l) (247) into a time-domain electronic representation of an output audio signal (215').
- The system of claim 5, wherein the noise suppressor is further configured to: determine periods of speech pauses and periods of speech activity in the audio signal as a function of the speech detection result and to increase the dynamic noise overestimation factor for the periods of speech pauses relative to the value of the dynamic noise overestimation factor for the periods of speech activity, wherein increasing the value of the dynamic noise overestimation factor enables the spectral weighting coefficients computed to increase suppression of the background noise relative to an amount of suppression of the background noise for the periods of speech activity.
- The system of claim 5, wherein the background noise includes stationary and non-stationary noise components, wherein changes in a power spectrum of the audio signal over a time interval are less for the stationary noise components than for the non-stationary noise components.
- The system of claim 5, wherein the speech components are foreground speech components that correspond to speech from a user speaking into an audio receiving device and further wherein the background noise includes babble noise, the babble noise including a composition of multiple background speech components from other speakers.
- A non-transitory computer-readable medium having encoded thereon a sequence of instructions which, when loaded and executed by a processor, causes the processor to perform noise suppression of an audio signal including speech components and background noise by:determining, dynamically, a speech detection result as a function of a combination of feature values estimated from the audio signal, the speech detection result indicating a presence of the speech components in the audio signal containing the speech components and babble noise, the combination of feature values including a kurtosis value and a maximum cepstral value that reflects speech presence; andcomputing, dynamically, spectral weighting coefficients based at least in part on a noise overestimation factor determined using the combination of feature values and applying the spectral weighting coefficients computed to the audio signal to suppress the background noise in a dynamic manner;determining, dynamically, by a soft speech detector, a speech detection result as a function of a combination of feature values estimated from the audio signal, the speech detection result indicating a presence of the speech components in the audio signal containing the speech components and babble noise, the combination of feature values including a kurtosis value and a maximum cepstral value that reflects the speech presence;computing, dynamically, by a noise suppressor, spectral weighting coefficients based at least in part on a noise overestimation factor determined using the combination of feature values; andapplying the spectral weighting coefficients computed to the audio signal to suppress the background noise in a dynamic manner,wherein the maximum is searched over values of τ that correspond to frequencies of between 60 Hz and 300 Hz,wherein αv is a smoothing constant having a value of 0.9, wherein wk = 1, wherein wv = 96, wherein τ is a cepstral coefficient index, wherein voicing offset is an arbitrary offset, wherein Lp is the number frames to look backwards, wherein Lƒ is the number of frames to look ahead to capture speech onsets, and wherein kurt(l) is an estimate of kurtosis at frame /,wherein the kurtosis is applied to the pre-emphasized input signal x(n),wherein α k is the smoothing constant,wherein the moment of the 4th order, m 4(n), analogously is computed by smoothing x 4(n) andwherein computing the spectral weighting coefficients dynamically comprises:at a first multiplicative element (251) of a noise suppressor (214), multiplying an estimated noise spectrum, Φ̂ bb (k, l) (238), by β oe(ℓ) (216) to provide an overestimated power spectral density (239) of Φ̂ bb (k, l), said Φ̂ bb (k, l) having been received from a noise estimator (236) and said β oe(ℓ) having been provided to said first multiplicative element by an over-estimator (250),at a Wiener filter (221) of said noise suppressor, receiving said overestimated power spectral density and generating therefrom the spectral weighting coefficients HWF (k, l), wherein the spectral weighting coefficients are given by:wherein Φ̂ xx (k, l) is estimated by a spectrum estimator (228) by smoothing, temporally,the magnitude squared of X(k, l) (226), which are provided by a short-term Fourier Transform module (224) with a Hanning window of length 512 samples, the X(k, l) having been obtained by the short-term Fourier Transform module by transforming blocks of an electronic representation of the pre-emphasized input audio signal x(n) (201') in the time domain into a representation in the frequency domain in terms of the X(k, l),wherein Φ̂ bb (k, l) is determined by the noise estimator based on Φ̂ xx (k, l) ) and is calculated by smoothing Φ̂ xx (k, l) in the absence of speech,wherein said overestimated power spectral density is given by β oe(l)·Φ̂ bb (k, l),wherein βmax is a maximal factor having a value of 21,wherein ∈ is an adjustment factor,at a noise shaper (220), noise shaping the computed spectral weighting coefficients by computing a dynamic noise floor and selectively lowering the dynamic noise floor for frequencies corresponding to non-stationary noise components of the background noise;wherein Φ̂ bb, average (k, l) is an average of the estimated noise spectrum,wherein k addresses the frequency bin of the l th frame, andwherein Hfloor,fixed is a fixed noise floor, and,at a post processor 222, modifying the noise-shaped spectral weighting coefficients to suppress the background noise in a dynamic manner by causing those spectral weighting coefficients that exceed the fixed noise floor to be set to the dynamic floor as defined by:wherein a first number n―(k, l) of coefficients next to frequency bin k that attenuate stronger than the fixed floor are compared to a second number of coefficients n+(k, l) that exceed the fixed floor, wherein HNS (k, l) are spectral weighting coefficients given bywherein applying the noise-shaped spectral weighting coefficients to the audio signal comprisesat a second multiplicative element (245), multiplying the spectral weighting coefficients H(k, l) (245) from said post-processor (222) with the short-term Fourier transform coefficients X(k, l) (226) received from said short-term Fourier transform module (224) to produce enhanced short-term Fourier transform coefficients Y(k, l) (247) and,at an overlap add module (249), converting said enhanced short-term Fourier transform coefficients Y(k, l) (247) into a time-domain electronic representation of an output audio signal (215').
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201662291791P | 2016-02-05 | 2016-02-05 | |
PCT/US2016/062908 WO2017136018A1 (en) | 2016-02-05 | 2016-11-18 | Babble noise suppression |
Publications (2)
Publication Number | Publication Date |
---|---|
EP3411876A1 EP3411876A1 (en) | 2018-12-12 |
EP3411876B1 true EP3411876B1 (en) | 2021-12-15 |
Family
ID=57517998
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP16808870.6A Active EP3411876B1 (en) | 2016-02-05 | 2016-11-18 | Babble noise suppression |
Country Status (3)
Country | Link |
---|---|
US (1) | US10783899B2 (en) |
EP (1) | EP3411876B1 (en) |
WO (1) | WO2017136018A1 (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017136018A1 (en) | 2016-02-05 | 2017-08-10 | Nuance Communications, Inc. | Babble noise suppression |
US10607386B2 (en) | 2016-06-12 | 2020-03-31 | Apple Inc. | Customized avatars and associated framework |
WO2017217978A1 (en) * | 2016-06-15 | 2017-12-21 | Nuance Communications, Inc. | Techniques for wake-up word recognition and related systems and methods |
EP3324407A1 (en) * | 2016-11-17 | 2018-05-23 | Fraunhofer Gesellschaft zur Förderung der Angewand | Apparatus and method for decomposing an audio signal using a ratio as a separation characteristic |
EP3324406A1 (en) | 2016-11-17 | 2018-05-23 | Fraunhofer Gesellschaft zur Förderung der Angewand | Apparatus and method for decomposing an audio signal using a variable threshold |
US10861210B2 (en) * | 2017-05-16 | 2020-12-08 | Apple Inc. | Techniques for providing audio and video effects |
CN108109619B (en) * | 2017-11-15 | 2021-07-06 | 中国科学院自动化研究所 | Auditory selection method and device based on memory and attention model |
CN110487546B (en) * | 2018-05-10 | 2021-12-14 | 上汽通用汽车有限公司 | Gearbox knocking noise testing method, testing device and evaluation method |
DE102019102415A1 (en) * | 2019-01-31 | 2020-08-06 | Harman Becker Automotive Systems Gmbh | REDUCTION OF DISTORTIONS IN VOICE SIGNALS |
EP4128225A1 (en) * | 2020-03-30 | 2023-02-08 | Harman Becker Automotive Systems GmbH | Noise supression for speech enhancement |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
FR2808917B1 (en) * | 2000-05-09 | 2003-12-12 | Thomson Csf | METHOD AND DEVICE FOR VOICE RECOGNITION IN FLUATING NOISE LEVEL ENVIRONMENTS |
DE10137348A1 (en) * | 2001-07-31 | 2003-02-20 | Alcatel Sa | Noise filtering method in voice communication apparatus, involves controlling overestimation factor and background noise variable in transfer function of wiener filter based on ratio of speech and noise signal |
JP4670483B2 (en) * | 2005-05-31 | 2011-04-13 | 日本電気株式会社 | Method and apparatus for noise suppression |
JP5293817B2 (en) * | 2009-06-19 | 2013-09-18 | 富士通株式会社 | Audio signal processing apparatus and audio signal processing method |
JP5687522B2 (en) * | 2011-02-28 | 2015-03-18 | 国立大学法人 奈良先端科学技術大学院大学 | Speech enhancement apparatus, method, and program |
US20120245927A1 (en) * | 2011-03-21 | 2012-09-27 | On Semiconductor Trading Ltd. | System and method for monaural audio processing based preserving speech information |
US9449615B2 (en) * | 2013-11-07 | 2016-09-20 | Continental Automotive Systems, Inc. | Externally estimated SNR based modifiers for internal MMSE calculators |
US9799330B2 (en) * | 2014-08-28 | 2017-10-24 | Knowles Electronics, Llc | Multi-sourced noise suppression |
WO2017136018A1 (en) | 2016-02-05 | 2017-08-10 | Nuance Communications, Inc. | Babble noise suppression |
-
2016
- 2016-11-18 WO PCT/US2016/062908 patent/WO2017136018A1/en active Application Filing
- 2016-11-18 US US16/073,740 patent/US10783899B2/en active Active
- 2016-11-18 EP EP16808870.6A patent/EP3411876B1/en active Active
Non-Patent Citations (2)
Title |
---|
ELIAS NEMER ET AL: "SPEECH ENHANCEMENT USING FOURTH-ORDER CUMULANTS AND TIME- DOMAIN OPTIMAL FILTERS", EUROSPEECH 1999, vol. 6, 5 September 1999 (1999-09-05), pages 2623, XP007001557 * |
SIMON GRAF ET AL: "Kurtosis-Controlled Babble Noise Suppression", ITG-FACHBERICHT 267 : SPEECH COMMUNICATION, 5 July 2016 (2016-07-05), Padeborn, pages 307 - 311, XP055616272, ISBN: 978-3-8007-4275-2, Retrieved from the Internet <URL:https://ieeexplore.ieee.org/ielx7/7776136/7776137/07776197.pdf?tp=&arnumber=7776197&isnumber=7776137&ref=aHR0cHM6Ly9pZWVleHBsb3JlLmllZWUub3JnL2RvY3VtZW50Lzc3NzYxOTc=> [retrieved on 20190828] * |
Also Published As
Publication number | Publication date |
---|---|
EP3411876A1 (en) | 2018-12-12 |
WO2017136018A9 (en) | 2018-05-11 |
WO2017136018A1 (en) | 2017-08-10 |
US20190013036A1 (en) | 2019-01-10 |
US10783899B2 (en) | 2020-09-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3411876B1 (en) | Babble noise suppression | |
Zhao et al. | Perceptually guided speech enhancement using deep neural networks | |
Graf et al. | Features for voice activity detection: a comparative analysis | |
EP1745468B1 (en) | Noise reduction for automatic speech recognition | |
Sadjadi et al. | Unsupervised speech activity detection using voicing measures and perceptual spectral flux | |
EP2151822B1 (en) | Apparatus and method for processing and audio signal for speech enhancement using a feature extraction | |
EP2058797B1 (en) | Discrimination between foreground speech and background noise | |
EP1083541B1 (en) | A method and apparatus for speech detection | |
Cohen et al. | Spectral enhancement methods | |
Verteletskaya et al. | Noise reduction based on modified spectral subtraction method | |
JPH0916194A (en) | Noise reduction for voice signal | |
Hansen et al. | Speech enhancement based on generalized minimum mean square error estimators and masking properties of the auditory system | |
Veisi et al. | Hidden-Markov-model-based voice activity detector with high speech detection rate for speech enhancement | |
US20140177853A1 (en) | Sound processing device, sound processing method, and program | |
CN112053702B (en) | Voice processing method and device and electronic equipment | |
Jin et al. | Speech enhancement using harmonic emphasis and adaptive comb filtering | |
Garg et al. | A comparative study of noise reduction techniques for automatic speech recognition systems | |
JP2797861B2 (en) | Voice detection method and voice detection device | |
Lu | Reduction of musical residual noise using block-and-directional-median filter adapted by harmonic properties | |
Martin et al. | Robust speech/non-speech detection based on LDA-derived parameter and voicing parameter for speech recognition in noisy environments | |
Deisher et al. | Speech enhancement using state-based estimation and sinusoidal modeling | |
Graf et al. | Kurtosis-Controlled Babble Noise Suppression | |
Krishnamoorthy et al. | Modified spectral subtraction method for enhancement of noisy speech | |
Verteletskaya et al. | Enhanced spectral subtraction method for noise reduction with minimal speech distortion | |
Son et al. | Improved speech absence probability estimation based on environmental noise classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: UNKNOWN |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE |
|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
17P | Request for examination filed |
Effective date: 20180905 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
AX | Request for extension of the european patent |
Extension state: BA ME |
|
DAV | Request for validation of the european patent (deleted) | ||
DAX | Request for extension of the european patent (deleted) | ||
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |
|
17Q | First examination report despatched |
Effective date: 20190904 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |
|
RAP1 | Party data changed (applicant data changed or rights of an application transferred) |
Owner name: CERENCE OPERATING COMPANY |
|
GRAP | Despatch of communication of intention to grant a patent |
Free format text: ORIGINAL CODE: EPIDOSNIGR1 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: GRANT OF PATENT IS INTENDED |
|
INTG | Intention to grant announced |
Effective date: 20210517 |
|
GRAS | Grant fee paid |
Free format text: ORIGINAL CODE: EPIDOSNIGR3 |
|
GRAA | (expected) grant |
Free format text: ORIGINAL CODE: 0009210 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE PATENT HAS BEEN GRANTED |
|
AK | Designated contracting states |
Kind code of ref document: B1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
REG | Reference to a national code |
Ref country code: GB Ref legal event code: FG4D Ref country code: CH Ref legal event code: EP |
|
REG | Reference to a national code |
Ref country code: IE Ref legal event code: FG4D Ref country code: DE Ref legal event code: R096 Ref document number: 602016067444 Country of ref document: DE |
|
REG | Reference to a national code |
Ref country code: AT Ref legal event code: REF Ref document number: 1456091 Country of ref document: AT Kind code of ref document: T Effective date: 20220115 |
|
REG | Reference to a national code |
Ref country code: LT Ref legal event code: MG9D |
|
REG | Reference to a national code |
Ref country code: NL Ref legal event code: MP Effective date: 20211215 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: RS Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211215 Ref country code: LT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211215 Ref country code: FI Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211215 Ref country code: BG Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20220315 |
|
REG | Reference to a national code |
Ref country code: AT Ref legal event code: MK05 Ref document number: 1456091 Country of ref document: AT Kind code of ref document: T Effective date: 20211215 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: SE Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211215 Ref country code: NO Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20220315 Ref country code: LV Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211215 Ref country code: HR Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211215 Ref country code: GR Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20220316 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: NL Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211215 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: SM Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211215 Ref country code: SK Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211215 Ref country code: RO Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211215 Ref country code: PT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20220418 Ref country code: ES Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211215 Ref country code: EE Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211215 Ref country code: CZ Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211215 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: PL Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211215 Ref country code: AT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211215 |
|
REG | Reference to a national code |
Ref country code: DE Ref legal event code: R097 Ref document number: 602016067444 Country of ref document: DE |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: IS Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20220415 |
|
PLBE | No opposition filed within time limit |
Free format text: ORIGINAL CODE: 0009261 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: DK Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211215 Ref country code: AL Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211215 |
|
26N | No opposition filed |
Effective date: 20220916 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: SI Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211215 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: IT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211215 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: MC Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211215 |
|
REG | Reference to a national code |
Ref country code: CH Ref legal event code: PL |
|
GBPC | Gb: european patent ceased through non-payment of renewal fee |
Effective date: 20221118 |
|
REG | Reference to a national code |
Ref country code: BE Ref legal event code: MM Effective date: 20221130 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: LI Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20221130 Ref country code: CH Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20221130 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: LU Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20221118 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: IE Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20221118 Ref country code: GB Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20221118 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: FR Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20221130 Ref country code: BE Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20221130 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: DE Payment date: 20230926 Year of fee payment: 8 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: HU Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT; INVALID AB INITIO Effective date: 20161118 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: CY Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211215 |