AT509570B1

AT509570B1 - METHOD AND APPARATUS FOR ONE-CHANNEL LANGUAGE IMPROVEMENT BASED ON A LATEN-TERM REDUCED HEARING MODEL

Info

Publication number: AT509570B1
Application number: AT0956707A
Authority: AT
Inventors: Martin Opitz; Robert Hoeldrich; Franz Zotter; Markus Noisternig
Original assignee: Akg Acoustics Gmbh
Priority date: 2007-10-02
Filing date: 2007-10-02
Publication date: 2011-12-15
Also published as: GB201004090D0; DE112007003674T5; GB2465910A; AT509570A5; WO2009043066A1; GB2465910B

Description

österreichisches Patentamt AT 509 570 B1 2011-12-15Austrian Patent Office AT 509 570 B1 2011-12-15

Beschreibungdescription

METHODE UND APPARAT ZUR EINKANAL-SPRACHVERBESSERUNG BASIEREND AUF EINEM LATENZZEITREDUZIERTEN GEHÖRMODELLMETHOD AND APPARATUS FOR ONE-CHANNEL LANGUAGE IMPROVEMENT BASED ON A LATEN-TERM REDUCED HEARING MODEL

BEREICH DER ERFINDUNGFIELD OF THE INVENTION

[0001] Die gegenständliche Erfindung bezieht sich auf eine Methode zur Verbesserung eines breitbandigen Audiosignals mit Hintergrundgeräuschen und im Speziellen auf ein Störgeräuschunterdrückungssystem, eine Störgeräuschunterdrückungsmethode und ein Störgeräuschunterdrückungsprogramm. Im Speziellen bezieht sich die gegenständliche Erfindung auf eine latenzzeitreduzierte Einkanalstörgeräuschunterdrückung, unter Verwendung von Teilbandverarbeitung basierend auf Ausblendeigenschaften des menschlichen Gehörsystems.The present invention relates to a method for improving a broadband audio signal with background noise, and more particularly, to a noise canceling system, a noise canceling method, and a noise canceling program. More particularly, the subject invention relates to latency reduced single channel noise reduction, using subband processing based on fading characteristics of the human auditory system.

HINTERGRUND DER ERFINDUNGBACKGROUND OF THE INVENTION

[0002] Zusätzliche Hintergrundgeräusche in der Sprachkommunikationssysteme reduziert die subjektive Qualität und Verständlichkeit der wahrgenommenen Stimme. Deshalb erfordern Sprachverarbeitungssysteme Störgeräuschreduktionsmethoden, z.B. Methoden, die auf eine Verarbeitung abzielen, um den Rauschpegel in einem verrauschten Signal zu eliminieren oder zu abzuschwächen und das Störabstand (Signal-zu-Rausch-Verhältnis, SNR) zu verbessern ohne die Sprache oder ihre Charakteristik zu beeinträchtigen. Störgeräuschreduktion wird im Allgemeinen auch Störgeräuschunterdrückung oder Sprachverbesserung genannt.Additional background noise in the voice communication systems reduces the subjective quality and intelligibility of the perceived voice. Therefore, speech processing systems require noise reduction techniques, e.g. Methods aimed at processing to eliminate or attenuate the noise level in a noisy signal and to improve signal-to-noise ratio (SNR) without compromising the speech or its characteristics. Noise reduction is also commonly called noise suppression or speech enhancement.

[0003] Zum Beispiel werden Mobiltelefone oft in Umgebungen, wie öffentliche Plätze, mit hohem Hintergrundstörgeräuschen verwendet. Die Verwendung von Mobiltelefonen und sprachgesteuerte Geräte und Kommunikationssysteme in Autos hat einen großen Bedarf an Freisprechinstallationen für die Erhöhung der Sicherheit und des Komforts im Auto geschaffen. In vielen Staaten und Regionen verbietet das Gesetz z.B. das handgehaltene Telefonieren im Auto. Störgeräuschreduktion wird für diese Anwendungen wichtig, da ihre Anwendungen in akustisch ungünstigen Umgebungen notwendig sind, im Speziellen bei niedrigem Störabstand (SNR) und hoher zeitlich veränderlichen Störgeräuschpegelcharakteristik, wie z.B. Rollgeräusche von Autos.For example, mobile phones are often used in environments such as public places with high background noise. The use of mobile phones and voice-activated devices and communication systems in cars has created a great need for hands-free installations for increasing safety and comfort in the car. For example, in many states and regions, the law prohibits the hand-held telephone in the car. Noise reduction becomes important for these applications, as their applications are necessary in acoustically unfavorable environments, especially at low SNR and high time varying noise floor characteristics, e.g. Rolling noise of cars.

[0004] In (Freisprech-)Applikationen für Telekonferenzen, wie Videokonferenzen oder Spracherkennung und Abfragesysteme rührt das Hintergrundstörgeräusch von Ventilatoren von Computern, Druckern oder Faxgeräten her, welches als (langzeitlich) stationär betrachtet werden kann. Konversationsstörgeräusche von (Telefon-)Gesprächen, die von Kollegen stammen, die sich das Zimmer teilen, werden oft als Schnattergeräusch (babble noise) bezeichnet und bestehen aus harmonischen Komponenten und sind deshalb schwieriger durch eine Störgeräuschreduktionseinheit abzuschwächen.In (hands-free) applications for teleconferencing, such as videoconferencing or speech recognition and interrogation systems, the background noise of fans from computers, printers or fax machines, which may be considered stationary (long-term). Conversation noise from (telephone) calls coming from colleagues sharing the room is often referred to as babble noise and consists of harmonic components and is therefore more difficult to attenuate by a noise reduction unit.

[0005] Applikationen in Hörhilfen und Autosprechkommunikationssystemen erfordern jedoch Rauschunterdrückungsmethoden, die in Echtzeit ausgeführt werden können.However, applications in hearing aids and car voice communication systems require noise suppression methods that can be performed in real time.

[0006] Trotzdem, die rasante Entwicklung der darunterliegenden Hardware in Bezug auf Rechenleistung und Speicherkapazität unterstützt den Fortschritt der Softwarerealisierungen.Nevertheless, the rapid development of the underlying hardware in terms of processing power and storage capacity supports the progress of the software implementations.

[0007] Einer der meist verbreiteten Methoden der Rauschunterdrückung in anwendungsnahen Anwendungen wird in der Fachsprache als spektrale Subtraktion bezeichnet (vgl. S. F. Boll, "Suppression of Acoustic Noise in Speech using Spectral Subtraction," IEEE Trans. Acoust. Speech and Sig. Proc., vol. ASSP-27, pp. 113-120, Apr. 1979). Im Allgemeinen schätzt der spektrale Subtraktionsansatz die kurzzeitige spektrale Amplitude (STSA) der klaren Sprache von einem gestörten Sprachsignal, z.B. die gewünschte, durch Rauschen verunreinigte Sprache durch Subtraktion eines geschätzten Rauschsignals. Basierend auf der Annahme, dass das menschliche Ohr unempfindlich gegenüber Phasenverzerrungen ist, wird der geschätzte Betrag des Sprachsignals mit der Phase des gestörten Signals kombiniert (vgl. C. L. Wang et al., "The unimportance of phase in speech enhancement, "IEEE Trans. Acoust. Speech and Sig. Proc., 1/31 österreichisches Patentamt AT 509 570 B1 2011-12-15 vol. ASSP-30, pp. 679-681, Aug. 1982). In der Praxis wird die spektrale Subtraktion durch die Multiplikation des Eingangssignalspektrums mit einer Gewichtsfunktion bewerkstelligt, um so Frequenzkomponenten mit geringer SNR zu unterdrücken. Diese SNR-basierte Gewichtsfunktion wird durch Abschätzungen des Störgeräuschspektrums gebildet und das gestörte Sprach-spektrum wird im weitesten Sinne als stationär, und die mittelwertfreien Zufallsignale, die Sprache und die Rauschsignale als unkorreliert angenommen. Diese konventionellen spektralen Subtraktionsmethoden bieten signifikante Geräuschunterdrückung mit dem Hauptnachteil der Reduktion der Signalqualität an, akustisch wahrgenommen als musikalische Klänge oder musikalisches Geräusch. Die musikalischen Klänge stammen von den spektralen Schätzfehlern. In letzten Jahren wurden viele Verbesserungen des einfachen spektralen Subtraktionsansatzes entwickelt.One of the most common methods of noise suppression in application-related applications is referred to in the jargon as spectral subtraction (see SF Boll, " Suppression of Acoustic Noise in Speech Using Spectral Subtraction, " IEEE Trans. Acoust. Speech and Sig. Proc., Vol. ASSP-27, pp. 113-120, Apr. 1979). In general, the spectral subtraction approach estimates the short-term spectral amplitude (STSA) of the clear speech from a disturbed speech signal, e.g. the desired noise-contaminated speech by subtraction of an estimated noise signal. Based on the assumption that the human ear is insensitive to phase distortions, the estimated magnitude of the speech signal is combined with the phase of the perturbed signal (see CL Wang et al., &Quot; The unimportance of phase in speech enhancement ", IEEE Trans Acoust Speech and Sig. Proc., 1/31 Austrian Patent Office AT 509 570 B1 2011-12-15 vol. ASSP-30, pp. 679-681, Aug. 1982). In practice, the spectral subtraction is accomplished by the multiplication of the input signal spectrum with a weighting function so as to suppress low SNR frequency components. This SNR-based weighting function is formed by estimates of the noise spectrum and the disturbed speech spectrum is broadly assumed to be stationary and the mean-free random signals, speech and noise signals to be uncorrelated. These conventional spectral subtraction methods offer significant noise suppression with the major disadvantage of reducing signal quality, perceived acoustically as musical sounds or musical noise. The musical sounds come from the spectral estimation errors. In recent years, many improvements of the simple spectral subtraction approach have been developed.

[0008] Eine oft angewendete Methode um die musikalischen Klänge zu reduzieren ist ein über-schätzes Störgeräuschspektrum zu substrahieren um die Fluktuationen in der DFT-Koeffizien-ten zu reduzieren und um zu verhindern, dass die spektralen Komponenten unter eine spektrale Untergrenze gehen (vgl. M. Berouti et al., "Enhancement of speech corrupted by acoustic noi-se," in Proc. IEEE Int. Conf. on Acoust., Speech and Sig. Proc. (ICASSP'79), vol. 4, pp. 208-211, Washington D.C., Apr. 1979). Dieser Ansatz reduziert erfolgreich die musikalischen Klänge bei schlechten SNR-Verhältnissen und Perioden mit alleinigem Störgeräuschen. Der Hauptnachteil ist die Verzerrung des Sprachsignals während des Sprechens. In der Praxis wurde ein Kompromiss zwischen Sprachqualität und dem Rest-Störgeräuschpegel gefunden. Weitere Methoden bewältigen dieses Problem durch die Einführung von optimalen und adaptiven Übersubtraktionsfaktoren für schlechte SNR-Verhältnisse und schlagen eine Untersubtraktion der Störgeräuschspektrums für gute SNR-Verhältnisse vor (vgl. W. M. Kushner et al., "The effects of subtractive-type speech enhancement/noise reduction algorithms on parameter estimation for improved recognition and coding in high noise environments," in Proc. IEEE Int. Conf. Acous-tics, Speech and Sig. Proc. (ICASSP'89), vol. 1, pp. 211-214, 1989).An often used method to reduce the musical sounds is to subtract an over-estimated noise spectrum to reduce the fluctuations in the DFT coefficients and to prevent the spectral components from going below a spectral lower limit (cf. M. Berouti et al., &Quot; Enhancement of speech, corrupted by acoustic noi-se, " in Proc. IEEE Int Conf. On Acoust., Speech and Sig. Proc. (ICASSP'79), vol. 4, pp 208-211, Washington DC, Apr. 1979). This approach successfully reduces the musical sounds at poor SNR ratios and periods of only noise. The main disadvantage is the distortion of the speech signal during speech. In practice, a compromise has been found between voice quality and the residual noise level. Other methods overcome this problem by introducing optimal and adaptive over-subtraction factors for poor SNR ratios and suggest sub-subtraction of the noise spectrum for good SNR ratios (see WM Kushner et al., The effects of subtractive-type speech enhancement / IEEE Int Conf Conf Acous-tics, Speech and Sig. Proc. (ICASSP'89), vol. 1, pp. 211- 214, 1989).

[0009] Die Anwendung einer auf weichen Entscheidung basierenden (soft-decision based) Modifikation der spektralen Gewichtsfunktion (vgl. R. McAulay and M. Malpass, "Speech en-hancement using a soft-decision noise Suppression filter," in IEEE Trans. Acoust., Speech and Sig. Proc, vol. 28, no. 2, pp. 137-145, 1980) hat Verbesserungen der Störgeräuschunterdrückungseigenschaften des Verstärkersystems in Bezug auf die Unterdrückung der musikalischen Klänge gezeigt. Diese weichen Entscheidungsansätze hängen hauptsächlich von der a priori Wahrscheinichkeit des Fehlens der Sprache in jeder spektralen Komponente der gestörten Sprache ab.The use of a soft-decision based modification of the spectral weighting function (see R. McAulay and M. Malpass, " Speech en-hancement using a soft-decision noise suppression filter, " in IEEE Trans. Acoust., Speech and Sig. Proc, vol. 28, no. 2, pp. 137-145, 1980) has shown improvements in the noise cancellation characteristics of the amplifier system with respect to the suppression of musical sounds. These soft decision approaches depend mainly on the a priori likelihood of the lack of speech in each spectral component of the disturbed speech.

[0010] Die kleinste mittlere quadratische Abweichung des kurzzeitigen spektralen Amplitudenschätzers (MMSE-STSA, vgl. Y. Ephraim and D. Malah, "Speech enhancement using a mini-mum mean-square error short-time amplitude estimator," IEEE Trans. Acoust. Speech and Sig. Proc., vol. 32, no. 6, pp.1109-1121, 1984) und die kleinste mittlere quadratische Abweichung des logarithmischen spektralen Amplitudenschätzers (MMSE-LSA, Y. Ephraim and D. Malah, "Speech enhancement using a minimum mean-square error log spectral amplitude estimator," IEEE Trans. Acoust. Speech and Sig. Proc, vol. 33, no. 2, pp.443-445, 1985) minimieren die entsprechende mittlere quadratische Abweichung der geschätzten kurzzeitigen spektralen oder logarithmischen spektralen Amplitude. Es wurde erkannt, dass der nicht-lineare Glättungsvorgang der MMSE-SP/LSA Methoden (die sogenannten entscheidungsgesteuerten Ansätze), eine einheitliche Abschätzung des SNR erwirkt, der eine gute Störgeräuschunterdrückung ohne unangenehme musikalische Klänge bewerkstelligt (vgl. O. Capp, "Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor" IEEE Trans. Speech and Audio Proc., vol. 2, no. 2, pp. 345-349, 1994). Beide: Capp and Malah (vgl. E. Malah et al., "Tracking speech-presence uncertainty to improve speech enhancement in non-stationary noise environments," in Proc. IEEE Int. Conf. Acoust., Speech and Sig. Proc. (ICASSP'99), vol. 2, pp. 789-792, 1999) schlagen eine Begrenzung der a priori SNR-Abschätzung vor, um das Problem des wahrnehmbaren musikalischen Rauschens mit niedrigem Pegel während Sprechpausen zu bewältigen. Das sogenannte a priori SNR stellt die Information über das unbekannte 2/31 österreichisches Patentamt AT 509 570 B1 2011-12-15The smallest mean square deviation of the short-term spectral amplitude estimator (MMSE-STSA, see Y. Ephraim and D. Malah, " Speech enhancement using a mini-mean-square error short-time estimator, " IEEE Trans Acoust, Speech and Sig. Proc., Vol. 32, no. 6, pp.1109-1121, 1984) and the least mean square deviation of the logarithmic amplitude spectral estimator (MMSE-LSA, Y. Ephraim and D. Malah, " Speech enhancement using a minimum mean-square error log spectral amplitude estimator, " IEEE Trans Acoust Speech and Sig. Proc, vol., 33, no. 2, pp. 443-445, 1985) minimizes the corresponding mean square deviation the estimated short-term spectral or logarithmic spectral amplitude. It has been recognized that the non-linear smoothing process of the MMSE-SP / LSA methods (the so-called decision-driven approaches) achieves a uniform estimation of the SNR that provides good noise suppression without unpleasant musical sounds (see O. Capp, " Elimination Ephraim and Malah noise suppressor " IEEE Trans. Speech and Audio Proc., vol. 2, no. 2, pp. 345-349, 1994). Both: Capp and Malah (See E. Malah et al., "Tracking speech-presence uncertainty to improve speech enhancement in non-stationary noise environments," in Proc. IEEE Int. Conf. Acoust., Speech and Sig. Proc. (ICASSP'99), vol. 2, pp. 789-792, 1999) propose limiting a priori SNR estimation to overcome the problem of perceptible low level musical noise during pauses in speech. The so-called a priori SNR provides the information about the unknown 2/31 Austrian Patent Office AT 509 570 B1 2011-12-15

Betragssprektrum dar, das von den vorhergegangenen Frames gesammelt und im entscheidungsgesteuerten Ansatz (DDA) ausgewertet wurde. Weil die Glättung, die vom DDA ausgeführt wird, Unregelmäßigkeiten aufweist, kann das musikalische Geräusch mit geringem Pegel auftreten. Eine einfache Lösung für dieses Problem besteht in der Einschränkung des a priori SNR durch eine untere Schranke.Spectrum range collected from the previous frames and evaluated in the decision-driven approach (DDA). Because the smoothing performed by the DDA has irregularities, the musical noise may occur at a low level. A simple solution to this problem is to constrain the a priori SNR by a lower bound.

[0011] In der Einkanal-Spektralsubtraktion wird das Störgeräuschspektrum normalerweise während der Sprechpause abgeschätzt, das Sprechaktivitätserkennungmethoden (VAD) erfordert (vgl. R. McAulay and M. Malpass, "Speech enhancement using a soft-decision noise Suppression filter" in IEEE Trans. Acoust., Speech and Sig. Proc., vol. 28, no. 2, pp. 137-145, 1980; and W. J. Hess, "A pitch-synchronous digital feature extraction System for phonemic recognition of speech", in IEEE Trans. Acoust., Speech and Sig. Proc., vol. 24, no. 1, pp. 14-25, 1976). Dieser Ansatz impliziert statische Störgeräuschcharkteristika während der Perioden des Sprechens. Arslan et al. entwickelte eine robuste Störgeräuschschätzmethode, die keine Sprechaktivitätserkennungmethoden wegen der rekursiven Mittelung mittels pegelabhängiger Zeitkonstanten für jedes Teilband erfordert (vgl. L. Arslan et al. "New methods for adaptive noise Suppression", in Proc. Int. Conf. on Acoustics, Speech and Sig. Proc. (ICASSP-95), Detroit, May 1995). Martin schlägt eine Störgeräuschschätzmethode vor, basierend auf einer Minimum-Statistik und einer optimalen Glättung der Leistungsspektrumdichte (PSD, vgl. R. Martin, "Noise power spectral density estimation based on optimal smoothing and minimum statistics," in IEEE Trans. Speech and Audio Proc., vol. 9, no. 5, pp. 512, July 2001). Weiters präsentiert Ealey et al. eine Methode zur Abschätzung der nicht-stationären Störgeräusche während der Dauer der gesprochenen Worte durch die Verwendung der harmonischen Struktur des gesprochenen Sprachspektrums, auch bekannt als harmonisches Tunneln (vgl. D. Ealey et al., "Harmonie tunnelling: tracking non-stationary noises during speech," in Proc. Eurospeech Aalborg, 2001). Des Weiteren wird von Sohn und Sung vorgeschlagen, wenn Informationen aus weichen Entscheidungen verwendet werden, dass das Störgeräuschspektrum kontinuierlich adaptiert wird, ob Sprache vorhanden ist der nicht, (vgl. J. Sohn and W. Sung, "A voice activity detector employing soft decision based noise spectrum adaptation," in Proc. IEEE Int. Conf. Acoustics, Speech and Sig. Proc. (ICASSP'98), vol. 1, pp-365-368, 1998).In single channel spectral subtraction, the noise spectrum is normally estimated during the speech pause that requires speech activity detection (VAD) methods (see R. McAulay and M. Malpass, "Speech enhancement using a soft-decision noise suppression filter" in IEEE Trans Acoust., Speech and Sig. Proc., Vol. 28, No. 2, pp. 137-145, 1980; and WJ Hess, "A pitch-synchronous digital feature extraction system for phonemic recognition of speech", in IEEE Trans. Acoust., Speech and Sig. Proc., Vol. 24, no. 1, pp. 14-25, 1976). This approach implies static noise characteristics during the periods of speech. Arslan et al. developed a robust noise estimation method that does not require speech activity detection methods due to the recursive averaging of level-dependent time constants for each subband (see L. Arslan et al., New Methods for Adaptive Noise Suppression, in Proc. Int. Conf. on Acoustics, Speech and Sig. Proc. (ICASSP-95), Detroit, May 1995). Martin proposes a noise estimation method based on minimum statistics and optimal power spectrum density smoothing (PSD, see R. Martin, "Noise power spectral density estimation based on optimal smoothing and minimum statistics," in IEEE Trans Audio Proc., Vol. 9, no. 5, pp. 512, July 2001). Furthermore, Ealey et al. a method for estimating non-stationary noise during the duration of the spoken words by using the harmonic structure of the spoken speech spectrum, also known as harmonic tunneling (see D. Ealey et al., "Harmony tunneling: tracking non-stationary noises during speech, " in Proc. Eurospeech Aalborg, 2001). Furthermore, if information from soft decisions is used, it is suggested by Sohn and Sung that the noise spectrum be continuously adapted whether speech is not present (see J. Sohn and W. Sung, " A voice activity detector employing soft decision based noise spectrum adaptation, " in Proc. IEEE Int Conf. Acoustics, Speech and Sig. Proc. (ICASSP'98), vol. 1, pp-365-368, 1998).

[0012] Ephraim und Van Trees schlagen eine andere wichtige auf Signalteilraumzerlegung basierte Methode zur Störgeräuschunterdrückung vor (vgl. Y. Ephraim and H. L. Van Trees, "A Signal subspace approach for speech enhancement", in IEEE Trans. Speech and Audio Proc, vol. 3, pp. 251-266, July 1995). Dabei wird das verrauschte Signal in einen Signal-plus-Störgeräusch Teilraum und einen Störgeräuschteilraum zerlegt, wobei diese beiden Teilräume orthogonal zueinander sind. Dadurch wird es möglich das klare Sprachsignal von dem verrau-scheten Signal abzuschätzen. Der resultierende lineare Schätzer ist ein allgemeines Wiener-Filter mit einem justierbaren Störgeräuschpegel um den Kompromiss zwischen der Signalverzerrung und dem Reststörgeräusch einzustellen, weil sie nicht gleichzeitig minimiert werden können.Ephraim and Van Trees propose another important signal subspace resolution based noise suppression method (see Y. Ephraim and HL Van Trees, " A Signal subspace approach for speech enhancement ", in IEEE Trans. Speech and Audio Proc, vol 3, pp. 251-266, July 1995). In this case, the noisy signal is decomposed into a signal plus noise subspace and a noise subspace, these two subspaces are orthogonal to each other. This makes it possible to estimate the clear speech signal from the used signal. The resulting linear estimator is a generic Wiener filter with adjustable noise level to adjust the trade-off between signal distortion and residual noise because they can not be simultaneously minimized.

[0013] Skoglund und Kleijn zeigen die Wichtigkeit des temporären Ausblendens von Eigenschaften in Verbindung mit der Einspeisung der gesprochenen Sprache (vgl. J. Skoglund and W. B. Kleijn, "On Time-Frequency Masking in Voiced Speech", in IEEE Trans. Speech and Audio Proc., vol. 8, no. 4, pp. 361-369, July 2000). Es wird gezeigt, dass Störgeräusche zwischen zwei Einspeisungsimpulsen stärker wahrgenommen werden, als Störgeräusche in der Nähe der Impulse und dies ist speziell für Sprache mit geringer Wortdichte der Fall, für die der Einspeisungsimpuls temporär spärlich zu finden ist. Temporäres Ausblenden wird nicht von konventionellen Störgeräuschreduktionsmethoden verwendet, die einen Frequenzbereichschätzer verwenden. WO 2006 114100 offenbart ein Signalteilraum-Ansatz, der temporäre Ausble-dungseigenschaften in Betracht zieht.Skoglund and Kleijn show the importance of temporarily masking out characteristics associated with feeding the spoken language (see J. Skoglund and WB Kleijn, " On Time-Frequency Masking in Voiced Speech ", IEEE Trans Audio Proc., Vol. 8, no. 4, pp. 361-369, July 2000). It is shown that noise between two feed pulses is perceived more strongly than noise near the pulses and this is especially the case for low word density speech for which the feed pulse is temporarily sparse. Temporary fading is not used by conventional noise reduction methods that use a frequency domain estimator. WO 2006 114100 discloses a signal subspace approach that takes into account temporary fading characteristics.

GEGENSTAND UND ZUSAMMENFASSUNG DER ERFINDUNGSCOPE AND SUMMARY OF THE INVENTION

[0014] Das Ziel der vorliegenden Erfindung besteht darin, eine auf einem Einkanalhörmodell basierende Geräuschunterdrückungsmethode mit latenzzeitreduzierten Verarbeitung eines 3/31 österreichisches Patentamt AT 509 570 B1 2011-12-15 breitbandigen Sprachsignals in der Gegenwart von Hintergrundgeräuschen zu schaffen. Im Speziellen basiert die gegenwärtige Erfindung auf einer Methode zur spektralen Subtraktion unter Verwendung eines modifizierten entscheidungsgesteuerten Ansatzes, umfassend eine Übersubtraktion und einen einstellbaren Geräuschpegel zur Vermeidung von wahrnehmbaren musikalisches Klängen. Weiters verwendet die gegenwärtige Erfindung Teilbandverarbeitung mit Vor- und Nachfilterung, um zur menschlichen Wahrnehmung gehörendes zeitweiliges und gleichzeitiges Ausblenden zu berücksichtigen, im Speziellen um die wahrnehmbaren Signalverzerrungen während der Sprechperioden zu minimieren.The object of the present invention is to provide a single-channel-based noise cancellation method with latency-reduced processing of a broadband speech signal in the presence of background noise. In particular, the present invention is based on a spectral subtraction method using a modified decision-driven approach, including over-subtraction and adjustable noise level to avoid perceptible musical sounds. Furthermore, the present invention uses pre-filtering and post-filtering subband processing to account for temporary and simultaneous fading associated with human perception, particularly to minimize perceptible signal distortion during speech periods.

[0015] Die Frequenzbereichverarbeitung wird durch das vorgeschlagene System ausgeführt, das mittels einer uneinheitlichen Gammaton-Filterbank (GTF), die in kritische Bänder, auch oft als Bark-Bänder bezeichnet, unterteilt ist. Diese Analysefilterbank teilt das verrauschte Signal in eine Vielzahl von sich überlappenden schmalbandigen Signalen auf, wobei die sepktrale (gleichzeitige) Ausblendeigenschaften des menschlichen Hörempfindens berücksichtigt wird.Frequency domain processing is performed by the proposed system, which is subdivided by means of a nonuniform gammaton filter bank (GTF), which is also called critical bands, often called bark bands. This analysis filter bank divides the noisy signal into a plurality of overlapping narrowband signals, taking into account the spectral (simultaneous) blanking characteristics of the human auditory senses.

[0016] Eine Vorverarbeitungseinheit, die das Transferverhalten des menschlichen Außen- und Mittelohr nachbildet, wird auf das zeit-diskrete verrauschte Eingangssignal angewendet (z.B. auf die gewünschte mit Störgeräuschen und Interferenzen verunreinigte Sprache).A preprocessing unit that replicates the transfer behavior of the human outer and middle ear is applied to the time-discrete noisy input signal (e.g., to the desired speech-contaminated speech).

[0017] In jedem Teilband wird der Pegel des verrauschten Signals detektiert und geglättet. Diese engbandigen Pegeldetektoren werden auf eine Vielzahl von Teilbändern angewendet, um die Phase der einfachen Filterteile auszunutzen und um kürzeste Signalverarbeitungszeiten zu erhalten.In each subband, the level of the noisy signal is detected and smoothed. These narrow-band level detectors are applied to a plurality of subbands to exploit the phase of the simple filter parts and to obtain the shortest signal processing times.

[0018] Von der geglätteten Einhüllenden der Teilbandsignale wird der Störgeräuschpegel unter der Verwendung eines heuristischen, auf der rekursiven Minimum-Statistik basierenden Ansatzes für jedes Teilband geschätzt.From the smoothed envelope of the subband signals, the noise floor is estimated using a heuristic recursive minimum-statistic based approach for each subband.

[0019] Das unmittelbare Signal-zu-Störgeräusch-Verhältnis (SNR) wird für jedes Teilband von der Einhüllenden des verrauschten Signals und der Störgeräuschpegels geschätzt.The immediate signal to noise ratio (SNR) for each subband is estimated from the envelope of the noisy signal and the noise floor.

[0020] Die a priori SNR wird von der unmittelbaren SNR durch die Verwendung der spektralen Ephraim-Malah-Subtraktionsregel (EMSR) geschätzt. Um den Einfluss der Schätzfehler zu minimieren, wird ein verbesserter entscheidungsgesteuerter Ansatz (DDA) vorgeschlagen, der einen Unterschätzungsparameter und einen unteren Störgeräuschpegelparameter einführt.The a priori SNR is estimated from the immediate SNR through the use of the Ephraim-Malah Spectral Subtraction Rule (EMSR). To minimize the influence of estimation errors, an improved decision-driven approach (DDA) is proposed that introduces an underestimation parameter and a lower noise floor parameter.

[0021] Das zeitliche, auf dem menschlichen Hörempfinden basierende Ausblenden wird durch das adäquate Filtern der Teilbandsignale berücksichtigt. Diese nichtlineare Gehörnachblendfil-ter wenden rekursive Mittelwertbildung an fallende Flanken der in jedem Teilband detektierten Signalpegel an; mit den folgenden Effekten: (a) Überschätzungsvarianzen der stoßartigen Störgeräusche, (b) Störgeräuschunterdrückungsalgorithmen haben keinen Effekt auf Signal unterhalb der zeitlichen Ausblendgrenze und (c) es wird keine zusätzliche Signalverzögerung für transiente Signale verursacht, die wichtig für die Sprachwahrnehmung sind.The temporal hiding based on the human auditory sense is taken into account by adequately filtering the subband signals. These nonlinear auditory fade filters apply recursive averaging to falling edges of the signal levels detected in each subband; with the following effects: (a) overestimation variances of the jerky noise, (b) noise cancellation algorithms have no effect on signal below the timeout boundary, and (c) no additional signal delay is caused to transient signals that are important to speech perception.

[0022] Eine nichtlineare Gewichtsfunktion für jedes Teilband wird aus der a priori SNR abgeschätzt, welche eine Übersubstraktion des geschätzten Störgeräuschsignals umfasst.A non-linear weighting function for each subband is estimated from the a priori SNR which includes oversubstraction of the estimated noise signal.

[0023] Das gestörte Signal in jedem Teilband wird mit einem entsprechenden Gewichtsfaktor multipliziert, um die Störgeräuschsignalkomponenten zu unterdrücken.The disturbed signal in each subband is multiplied by a corresponding weighting factor to suppress the noise signal components.

[0024] Eine optimierte, nahezu perfekte Rekonstruktionsfilterbank setzt ein Entscheidungskriterium für vorzeichenbehaftetes Summieren zum Wiederherstellen des verbesserten Vollbandsprachsignals ein.An optimized, near-perfect reconstruction filter bank employs a signed-summation decision criterion for restoring the improved full-band speech signal.

[0025] Letztlich wird ein Nachfilter auf das verbesserte Vollbandsignal angewendet, um den Effekt vom Vorfilter zu kompensieren.Finally, a post-filter is applied to the improved full-band signal to compensate for the effect of the pre-filter.

[0026] Bemerkungen: Die eingangs zitierten Störgeräuschunterdrückungsmethoden arbeiten im Frequenzbereich und verwenden die Diskrete Zeit-Fourier-Transformation (DTFT), die auf eine Blockverarbeitung der zeit-diskreten Eingangsignale basiert. Diese Blockverarbeitung fügt eine framegrößenabhängige Signalverzögerung hinzu. 4/31 österreichisches Patentamt AT 509 570 B1 2011-12-15 [0027] Einkanal-Sprachverstärkungssysteme des Subtraktionstyps sind effizient in der Reduktion der Hintergrundgeräusche; jedoch bergen sie wahrnehmbare, lästige Reststörgeräusche. Um dieses Problem zu bewältigen, werden die Eigenschaften des Hörsystems in den Verstärkungsprozess eingebracht. Dieses Phänomen wird durch die Berechung der Störgeräuschaus-blendungsgrenze im Frequenzbereich modelliert, unter der alle Komponenten unhörbar sind (vgl. N. Virag, "Single Channel Speech Enhancement Based on Masking Properties of the Human Auditory System", IEEE Trans, on Speech and Audio Proc., vol. 7, no. 2, pp. 126-137, March 1999).Comments: The noise suppression methods cited above operate in the frequency domain and use the Discrete Time Fourier Transform (DTFT) which is based on block processing of the time discrete input signals. This block processing adds a frame size dependent signal delay. [0027] Austrian Patent Office AT 509 570 B1 2011-12-15 Subtraction type single-channel speech enhancement systems are efficient in reducing background noise; However, they contain noticeable, annoying residual noise. To cope with this problem, the characteristics of the hearing system are introduced into the amplification process. This phenomenon is modeled by calculating the Noise Blanking Limit in the frequency domain under which all components are inaudible (see N. Virag, " Single Channel Speech Enhancement Based on Masking Properties of the Human Auditory System ", IEEE Trans, on Speech and Audio Proc., Vol. 7, no. 2, pp. 126-137, March 1999).

[0028] Um die Hörausblendung in Sprachverstärkungssystemen des Subtraktionstyps zu modellieren, sind Filterbankimplementierungen speziell attraktiv, da sie auf die spektrale und zeitliche Auflösung des menschlichen Ohrs adaptiert werden können. Die Autoren schlagen eine Störgeräuschunterdrückungsmethode vor, basierend auf spektraler Subtraktion kombiniert mit der Zerlegung in kritische Bänder Gammaton-Filterbänke (GTF). Das Konzept der kritischen Bänder, welches die Auflösung des menschlichen Gehörsystems beschreibt, führt zu einer nichtlinearen Frequenzskala, der sogenannten Bark-Skala (vgl. J. O. Smith III and J. S. Abel, "Bark and ERB Bilinear Transforms," IEEE Trans, on Speech and Audio Proc., vol. 7, no. 6, pp. 697-708, Nov. 1999).In order to model the echo suppression in subtractive-type speech enhancement systems, filter bank implementations are particularly attractive because they can be adapted to the spectral and temporal resolution of the human ear. The authors propose a noise suppression method based on spectral subtraction combined with decomposition into critical bands Gammaton Filter Banks (GTF). The concept of critical bands describing the resolution of the human auditory system results in a nonlinear frequency scale, the so-called Bark scale (see JO Smith III and JS Abel, " Bark and ERB Bilinear Transforms ", IEEE Trans, on Speech and Audio Proc., vol. 7, no. 6, pp. 697-708, Nov. 1999).

[0029] Die Verwendung der Gammaton-Filterbank übertrifft die DTFT basierten Ansätze in Bezug auf die rechnerische Komplexität und die Gesamtsystemverzögerungszeit. Jedoch, erlauben die GTF-Ansätze Auführungen mit kurzen Laufzeiten, Analyse-Synthese-Schemata mit geringer rechnerischer Komplexität und nahezu perfekter Rekonstruktion. Der vorgeschlagene Systhesefilter erstellt das breitbandige Ausgangssignal durch eine einfache Summation der Teilbandsignale unter Einführung eines Kriteriums der Notwendigkeit, das Vorzeichen vor der Summation zu wechseln. Dieser Ansatz übertrifft die von McAulay and Malpass vorgeschlagenen sprachkanalentschlüsselungsbasierten (vocoder-based) Ansätze (vgl. R. J. McAulay and M. L. Malpass, "Speech Enhancement Using a Soft-Decision Noise Suppression Filter", IEEE Trans, on Acoust., Speech and Sig. Proc., vol. ASSP-28, no. 2, pp. 137-145, April 1980). In diesem Ansatz wird die Vollbandrekonstruktion des Ausgangsignals durch Summation von alternierend aus-der-Phase befindlichen Teilbandsignalen ohne Berücksichtigung der realen Phasenbeziehung zwischen Subbändern bewerkstelligt. Das bringt große Verzerrungen für das Ausgangsignal.The use of the Gammaton filterbank outperforms the DTFT based approaches in terms of computational complexity and overall system delay time. However, the GTF approaches allow short-term runs, low-computational complexity, and near-perfect reconstruction analysis-synthesis schemes. The proposed system filter generates the wideband output signal by simply summing the subband signals, introducing a criterion of the need to change the sign before summation. This approach outperforms the vocal-based approaches proposed by McAulay and Malpass (see RJ McAulay and ML Malpass, " Speech Enhancement Using a Soft-Decision Noise Suppression Filter ", IEEE Trans, on Acoust., Speech and Sig. Proc., Vol. ASSP-28, no. 2, pp. 137-145, April 1980). In this approach, the fullband reconstruction of the output signal is accomplished by summing alternate out-of-phase subband signals without regard to the real phase relationship between subbands. This brings big distortions to the output signal.

[0030] Wichtige Bemerkung: Teilbandsignale ohne Downsampling, wie sie oft in Hörhilfssystemen angewendet werden, benötigen keine Synthesefilterbank. Daher ist dieser Ansatz für laufzeitreduzierte Sprachverstärkungssysteme anwendbar, aber rechnerisch hoch ineffizient. Die von den Autoren vorgeschlagene Methode erlaubt die Berechnung des Ausgangsignals von den Teilbandsignalen durch einfache Summation unter Berücksichtigung der Phasenunterschiede! [0031] Es ist wert zu erwähnen, dass es viele Anwendungen, wie Hörhilfen oder Freisprecheinrichtungen in Autos, gibt, bei denen die rechnerischen Komplexität und Signalverzögerungen von äußerster Wichtigkeit sind.Important Note: Subband signals without downsampling, as often used in hearing aids, do not require a synthesis filter bank. Therefore, this approach is applicable to delay-reduced speech enhancement systems, but is computationally highly inefficient. The method proposed by the authors allows the calculation of the output signal from the subband signals by simple summation taking into account the phase differences! It is worth mentioning that there are many applications, such as hearing aids or hands-free kits in cars, where the computational complexity and signal delays are of utmost importance.

[0032] Die Hauptvorteile der gegenwärtigen Erfindung, verglichen mit konventionellen Störgeräuschunterdrückungsmethoden, sind die signifikanten Verbesserungen betreffend den Gesamtsignalverzögerungen und die rechnerische Effizienz.The main advantages of the present invention compared to conventional noise reduction techniques are the significant improvements in overall signal delays and computational efficiency.

[0033] Die Erfindung wird nicht durch die folgende Ausführungsform beschränkt. Sie ist lediglich zur Erläuterung des erfinderischen Konzeptes und zur Darstellung einer möglichen Anwendung vorgesehen.The invention is not limited by the following embodiment. It is only intended to explain the inventive concept and to illustrate a possible application.

[0034] Erfindungsgemäß arbeitet die Methode für laufzeitreduzierte, auf einem Gehörmodell basierte Einkanal-Störgeräuschunterdrückung und -reduktion als unabhängiges Modul und ist für Installationen in digitalen Signalverarbeitungsketten vorgesehen, worin ein durch Software spezifizierter Algorithmus in einen kommerziell verfügbaren digitalen Signalprozessor (DSP), insbesondere ein DSP für Audioanwendungen, implementiert ist. 5/31 österreichisches Patentamt AT 509 570 B1 2011-12-15 [0035] Bemerkungen: Die Amplitude des klaren Sprachsignals wird mit der spektralen Ephraim-Malah-Subtraktionsregel (EMSR) von der gegebenen Amplitude des verrauschten Signals und der geschätzten Störgeräuschvarianz abgeschätzt. Um Artefakte wie das musikalische Geräusch zu vermeiden, werden modifizierte entscheidungsgesteuerte Ansätze (DDA), die Übersubtraktion (Unterschätzung) der Störgeräuschvarianz mit einem unteren Störgeräuschpegelparameter eingeführt.According to the invention, the method for reduced-time, auditory model-based single-channel noise reduction and reduction operates as an independent module and is intended for installations in digital signal processing chains, wherein a software-specified algorithm into a commercially available digital signal processor (DSP), in particular a DSP for audio applications. Remarks: The amplitude of the clear speech signal is estimated using the spectral ephraim-malah subtraction rule (EMSR) from the given amplitude of the noisy signal and the estimated noise variance. To avoid artifacts such as musical noise, modified decision-driven approaches (DDA), the over-subtraction (underestimation) of the noise variance are introduced with a lower noise level parameter.

[0036] Im Vergleich zum nachgewiesenen Stand der Technik, sowohl Druckschrift für Druckschrift als auch in deren Zusammenschau, ist die Lösung neu und erfinderisch. Der wesentliche Unterschied besteht darin, dass die dargestellte auditive Gammaton-Analysefilterbank zur Teilbandzerlegung des Eingangssignals eine zusätzliche Phasenverschiebung an den Teilbändern durchführt, durch welche eine verbesserte Rekonstruktion des Ausgangssignals mittels einfacher und recheneffizienter Summation der unterabgetasteten Teilbandsignale im Zeitbereich erreicht wird.In comparison to the prior art, both document for publication and in their synopsis, the solution is new and inventive. The essential difference is that the illustrated auditory gammaton analysis filter bank for subband decomposition of the input signal performs an additional phase shift on the subbands, by which an improved reconstruction of the output signal is achieved by simple and computationally summing the sub-sampled subband signals in the time domain.

[0037] Desweiteren ist im Vergleich zum vorliegenden Stand der Technik folgender Unterschied feststellbar: Die dargestellte Methode zur Störgeräuschschätzung unter Anwendung nachgefühlter Schwellwerte ist besonders effizient hinsichtlich des Speicherbedarfs der rekursiven Ausführung. Rekursive Verfahren weisen üblicherweise eine hohe Robustheit und Stabilität auf, jedoch sind diese Vorteile immer mit hohem Speicherbedarf und einer hohen Rechenzeit verknüpft. Die Kombination der ineinandergreifenden Verarbeitungsstufen (Mitteln bei kleinen Signalpegeln - Halten der Schätzwerte bei hohen jedoch zeitlich begrenzten Signalpegeln -Überschätzen bei anhaltend hohen Signalpegeln) ist in dieser Form im Stand der Technik noch nicht gegeben. Durch die rekursive Struktur der Signalverarbeitungsalgorithmen kann eine lange Speicherung von Signalpegelwerten vermieden und der benötigte Rechenaufwand minimiert werden.Furthermore, in comparison with the present state of the art, the following difference can be established: The illustrated method for noise estimation using sensed threshold values is particularly efficient with regard to the memory requirement of the recursive embodiment. Recursive methods usually have a high degree of robustness and stability, but these advantages are always associated with a high memory requirement and a high computing time. The combination of intermeshing processing stages (means at small signal levels - keeping the estimates at high but temporally limited signal levels - overestimating at sustained high signal levels) is not yet available in this form in the prior art. Due to the recursive structure of the signal processing algorithms, a long storage of signal level values can be avoided and the required amount of computation can be minimized.

KURZBESCHREIBUNG DER ZEICHNUNGSFIGURENBRIEF DESCRIPTION OF THE DRAWING FIGURES

[0038] Fig. 1 ist eine schematische Darstellung einer Einkanal-Teilband-Sprachverstär- kungseinheit der vorliegenden Erfindung.FIG. 1 is a schematic diagram of a one-channel subband speech enhancement unit of the present invention. FIG.

[0039] Fig. 2 ist eine schematische Darstellung der nichtlinearen Berechnung des Ver stärkungsfaktors für die Störgeräuschunterdrückung, welche für jedes Teilband angewendet wird.Fig. 2 is a schematic representation of the nonlinear calculation of the gain factor for the noise reduction used for each subband.

[0040] Fig. 3 und 4 zeigen die dachförmige MMSE-SP-Abschwächungsfläche in Abhängigkeit der a posteriori (yk) und der a priori (ξϋ SNR. Um alle Werte 0 < yk < oo abzudecken bezieht sich die x-Achse auf yk und nicht wie in der Literatur auf (yk -1). Die strichpunktierte Linie in Fig. 3 markiert den Übergang zwischen den Bereichen und , die strichlierte Linie zeigt die spektrale Leis- v n tungssubtraktionskontur. Die Konturen der DDA-Abschätzung sind in Fig. 4 über der MMSE-SP-Abschwächnungsfläche eingezeichnet. Die gestrichelten Linien in Fig. 4 zeigen den Durchschnitt der dynamischen Verhältnisse zwischen yk und Die soliden Linien zeigen die statischen Verhältnisse.Figures 3 and 4 show the roof-shaped MMSE-SP attenuation area as a function of the a posteriori (yk) and the a priori (ξϋ SNR. To cover all values 0 <yk <oo, the x-axis refers to yk and not (yk -1) as in the literature The dot-dash line in Fig. 3 marks the transition between the regions and, the dotted line shows the spectral attenuation subtraction contour. 4 above the MMSE-SP suppression surface The dashed lines in Fig. 4 show the average of the dynamic relationships between yk and The solid lines show the static conditions.

[0041] Fig. 5 und 6 sind Darstellungen des kombinierten (modifizierten) DDA- und MMSE-SP-Figs. 5 and 6 are illustrations of the combined (modified) DDA and MMSE SP

Abschätzungsverhalten. Die strichlierten Linien in Fig. 5 zeigen den Durchschnitt des dynamischen Verhältnis zwischen yk und ξι<. Die soliden Linien zeigen die statischen Verhältnisse. Zwei fikitive Hystereseschleifen in Fig. 6 passen mit den Beobachtungen von informellen Experimenten überein.Assessment behavior. The dashed lines in Fig. 5 show the average of the dynamic relationship between yk and ξι <. The solid lines show the static conditions. Two fictitious hysteresis loops in Figure 6 match the observations of informal experiments.

[0042] Fig. 7 zeigt ein Blockdiagramm des Komplettsystems.Fig. 7 shows a block diagram of the complete system.

[0043] Fig. 8 zeigt das Komplettsystem, das eine Hörfrequenzanalyse und eine Wieder zusammensetzung als Eingang und Ausgang umfasst, sowie eine spezielle verzögerszeitreduzierte Sprachverstärkung mit geringem Aufwand dazwi- 6/31FIG. 8 shows the complete system comprising a hearing frequency analysis and a reassembly as input and output, as well as a special delay time-reduced voice amplification with little effort between them

österreichisches Patentamt [0044] Fig. 9 [0045] Fig. 10 [0046] Fig. 11 [0047] Fig. 12 [0048] Fig. 13 [0049] Fig. 14 AT 509 570 B1 2011-12-15 sehen. Eine Kombination eines ausgeklügelten Geräuschunterdrückungsgesetz mit einem menschlichen Gehörmodell ermöglicht hochqualitative Leistungsmerkmale. zeigt einen Außenohr- und einen Mittelohrfilter zusammengestellt aus drei Abschnitten zweiter Ordnung (SOS). zeigt ein Beispiel: Three-Zero Gammaton-Filter der Ordnung 3. Die gemeinsame Null bei z = 1 ist nicht in dieser Figur dargestellt. zeigt eine bekannte Art der Pegelerkennung. Bei der Verwendung der Signalleistung wird das Quadrat der Amplitude detektiert. zeigt den laufzeitreduzierten FIR-Pegeldetektor. zeigt einen nichtlinearen rekursiven Post-Masking auditorisches Filter, der auf fallende Flanken anspricht. zeigt einen rekursiven Störgeräuschpegelabschätzer, der drei Zeitkonstanten und einem Zählerschwellwert verwendet.Austrian Patent Office Figure 9 Figure 10 [0046] Figure 12 [0048] Figure 13 [0049] Figure 14 AT 509 570 B1 2011-12-15. A combination of a sophisticated noise suppression law with a human auditory model enables high quality performance. shows an outer ear and a middle ear filter composed of three second order SOS sections. shows an example: three-zero gamma-tone filter of order 3. The common zero at z = 1 is not shown in this figure. shows a known type of level detection. When using the signal power, the square of the amplitude is detected. shows the runtime reduced FIR level detector. shows a non-linear recursive post-masking auditory filter that responds to falling edges. shows a recursive noise floor estimator that uses three time constants and a counter threshold.

DETAILLIERTE BESCHREIBUNGDETAILED DESCRIPTION

[0050] In dieser Beschreibung werden neue Aspekte vorgelegt, welche die Ephraim-Malah-Störgeräuschunterdrückungsregel (EMSR) und den entscheidungsgesteuerten Ansatz (DDA) für eine a priori Störabstandabschätzung betreffen. Nach Aufteilung des Bereichs des Amplitudenabschätzers wird es klar, dass die kombinierte DDA-Abschätzung eines unkonfigurierten Hysteresezyklus folgt. Die Einführung eines Hysteresebreiteparamteters verbessert die Hystereseform und reduziert das musikalische Geräusch. Schließlich erhalten wir einen flexibleren Störgeräuschunterdrücker mit geringerer Abhängigkeit von der Abtastraste des Systems.In this description, new aspects are presented concerning the Ephraim Malah Noise Suppression Rule (EMSR) and the Decision Based Approach (DDA) for a priori S / N estimation. After splitting the range of the amplitude estimator, it becomes clear that the combined DDA estimation follows an unconfigured hysteresis cycle. The introduction of a hysteresis full-width paremeter improves the hysteresis and reduces the musical noise. Finally, we get a more flexible noise canceler with less dependency on the system's sample rate.

I. EINFÜHRUNGI. INTRODUCTION

[0051] Der Ephraim-Malah-Amplitudenabschätzer und die entscheidungsgesteuerte Ephraim-Malah a priori SNR-Abschätzung (Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, nr. 6, vol. ASSP-32, pp. 1109-1121, Dec. 1984 and Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, nr. 2, vol. ASSP-33, pp. 443-445, Apr. 1985.) sind leistungsstarke Werkzeuge der Störgeräuschunterdrückung in der Sprachsignalverarbeitung. Gegenwärtig gibt es eine ganze Menge von kürzlich publizierten Arbeiten zu beiden Themen, da der kombinierte Algorithmus ein leistungsfähiges Werkzeug einerseits ist (O. Cappe, "Elimination of the Musical Noise Phenomenon with the Ephraim and Malah Noise Suppressor", IEEE Transactions on Speech and Audio Processing, nr. 2, vol. 2, pp. 345-349, Apr. 1994), aber anderseits sind Vereinfachungen (P. J. Wolfe and S. J. Godsill, "Simple Alternatives to the Ephraim and Malah Suppression Rule for Speech Enhancement", Proc. 11th IEEE Signal Processing Workshop, pp. 496-499, 6-8. Aug 2001) sowie Weiterentwicklungen (I. Cohen and B. Berdugo, "Speech Enhancement for non-stationary noise environments", Signal Processing, no. 11, pp. 2403-2418, Elsevier, Nov. 2001; I. Cohen, "Speech Enhancement Using a Noncausal A Priori SNR estimator", IEEE Signal Processing Letters, no. 9, pp. 725-728, Sep. 2004; I. Cohen, "Relaxed Statistical Model for Speech Enhancement and A Priori SNR Estinnation", Center for Communication and Information Technologies, Israel Institute of Technology, Oct, 2003, CCIT Report no. 443; Μ. K. Hasan, S. Salahuddin, M. R. Khan, "A Modified A Priori SNR for Speech Enhancement Using Spectral Subtraction Rules", IEEE Signal Processing Letters, vol. 11, no. 4, pp 450-453, April 2004) wünschenswert sind.The Ephraim-Malah Amplitude Estimator and the Decision-Driven Ephraim-Malah a priori SNR Estimate (Y. Ephraim and D. Malah, Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator, IEEE Transactions on Acoustics, Speech, and Signal Processing, No. 6, vol. ASSP-32, pp. 1109-1121, Dec. 1984 and Y. Ephraim and D. Malah, " Speech Enhancement Using a Minimum Mean Square Error Log. Spectral Amplitude Estimator ", IEEE Transactions on Acoustics, Speech, and Signal Processing, No. 2, vol. ASSP-33, pp. 443-445, Apr. 1985.) are powerful noise suppression tools in speech signal processing. At present, there is a great deal of recent work published on both topics, as the combined algorithm is a powerful tool on the one hand (O. Cappe, "Elimination of the Musical Noise Phenomenon with the Ephraim and Malah Noise Suppressor", IEEE Transactions on Speech and 2, vol. 2, pp. 345-349, Apr. 1994), but on the other hand simplifications (PJ Wolfe and SJ Godsill, "Simple Alternatives to the Ephraim and Malah Suppression Rule for Speech Enhancement", Proc 11th IEEE Signal Processing Workshop, pp. 496-499, 6-8 Aug. 2001) as well as further developments (I. Cohen and B. Berdugo, " Speech Enhancement for non-stationary noise environments ", Signal Processing, no. pp. 2403-2418, Elsevier, Nov. 2001; I. Cohen, " Speech Enhancement Using a Noncausal A priori SNR estimator ", IEEE Signal Processing Letters, no. 9, pp. 725-728, Sep. 2004; Cohen, " Relaxed Statistical Model for Speech Enhanc ement and A Priori SNR Estinnation, Center for Com- munication and Information Technologies, Israel Institute of Technology, Oct, 2003, CCIT Report no. 443; Μ. K. Hasan, S. Salahuddin, M.R. Khan, " Modified A Priori SNR for Speech Enhancement Using Spectral Subtraction Rules ", IEEE Signal Processing Letters, vol. 11, no. 4, pp 450-453, April 2004) are desirable.

[0052] Im Amplitudenabschätzungsteil des Alogrithmus wird ein Signalmodell herangezogen, in welchem ein Störgeräuschsignal y[n], bestehend aus Sprache x[n] und additiven Störgeräuschen d[n], zum Zeitindex n. Die Signale x[n] und d[n] werden als statistisch unabhängige 7/31 österreichisches Patentamt AT 509 570 B1 2011-12-15In the amplitude estimation part of the algorithm, a signal model is used in which a noise signal y [n] consisting of speech x [n] and additive noise d [n], the time index n. The signals x [n] and d [n ] are considered statistically independent 7/31 Austrian Patent Office AT 509 570 B1 2011-12-15

Gauß'sche Zufallsvariablen angenommen. Wegen bestimmter Eigenschaften der Fouriertransformation kann das selbe statistische Modell für die entsprechenden kurzzeitigen spektralen Amplituden Xk[m] und Dk[m] in jedem Frequenzintervall k zum Analysezeitpunkt m angenommen werden. (Unterstrichene Variablen kennzeichnen hier komplexwertige Größen. Deshalb ist Xk[m] in unserer Notation eine komplexe Variable. Zur Vereinfachung der Notation soll Xk[m] den Betrag |Xk[m]| darstellen.) Bei gegebenen Sprach- und Störgeräuschvarianzen a2xk und σ] k kann die Sprachamplitude Xk[m] von der verrauschten Sprache Yk[m] abgeschätzt werden.Gaussian random variables are assumed. Because of certain properties of the Fourier transform, the same statistical model for the corresponding short-term spectral amplitudes Xk [m] and Dk [m] can be assumed in each frequency interval k at the time of analysis m. Xk [m] is a complex variable in our notation.To simplify the notation, let Xk [m] represent the amount | Xk [m] |). For given speech and noise variations, a2xk and σ ] k, the speech amplitude Xk [m] can be estimated from the noisy language Yk [m].

Ein geeigneter Abschätzer [m] für die klare Sprachamplitude wird in Abschnitt l-A beschrieben.A suitable estimator [m] for the clear speech amplitude is described in section I-A.

[0053] Die unbekannten Varianzen der klaren Sprache a2xk werden implizit im a priori SNR-Abschätzungsteil des Algorithmus bestimmt, wobei die Störgeräuschvarianz adk im Vorhinein zu bestimmen ist, z.B. durch die Verwendung der Minimum-Statistik (P. J. Wolfe and S. J. God-sill, "Simple Alternatives to the Ephraim and Malah Suppression Rule for Speech Enhancement", Proc. 11th IEEE Signal Processing Workshop, pp. 496-499, 6-8. Aug 2001), MCRA (I. Cohen and B. Berdugo, "Speech Enhancement for non-stationary noise environments", Signal Processing, no. 11, pp. 2403-2418, Elsevier, Nov. 2001) oder harmonisches Tunneln (D. Ealey, H. Kelleher, D. Pearce, "Harmonie Tunneling: Tracking Non-Stationary Noises Düring Speech", Proc. Eurospeech, 2001).The unknown variances of the clear language a2xk are implicitly determined in the a priori SNR estimation part of the algorithm, wherein the noise variance adk is to be determined in advance, e.g. through the use of minimum statistics (PJ Wolfe and SJ God-sill, "Simple Alternatives to the Ephraim and Malah Suppression Rule for Speech Enhancement", Proc. 11th IEEE Signal Processing Workshop, pp. 496-499, 6-8. Aug 2001), MCRA (Cohen and B. Berdugo, " Speech Enhancement for non-stationary noise environments ", Signal Processing, no. 11, pp. 2403-2418, Elsevier, Nov. 2001) or harmonic tunneling (D. Ealey, H. Kelleher, D. Pearce, " Harmony Tunneling: Tracking Non-Stationary Noises Düring Speech ", Proc. Eurospeech, 2001).

[0054] Die entscheidungsgesteuerte Abschätzung, beschrieben in Abschnitt l-B, bestimmt die a priori SNR ξ1ζ=σ2χ1ζΙ a2dk in jedem Frequenzintervall k. Zusätzlich verwendet der Störgeräuschunterdrücker eine unmittelbare Abschätzung, den sogenannten a posteriori SNR-Ab-schätzer, der das Quadrat des gegenwärtigen Störgeräuschbetrags auf die Störgeräuschvarianz bezieht 7k[m] = Yk[m]la2dJi.The decision-controlled estimation, described in Section l-B, determines the a priori SNR ξ1ζ = σ2χ1ζΙ a2dk in each frequency interval k. In addition, the noise canceler uses an immediate estimate, the so-called a posteriori SNR estimator, which relates the square of the current noise floor to the noise variance 7k [m] = Yk [m] la2dJi.

[0055] In Abschnitt II wird ein Überblick über die kombinierte Abschätzung gegeben und die Hystereseform präsentiert. Anschließend wird in Abschnitt III gezeigt, wie eine kleine Modifikation ungwünschtes Abschätzungsverhalten reduzieren kann und eine glattere Hysterese ermöglicht. A. DAS EPHRAIM-MALAH-UNTERDRÜCKUNGSGESTZ (EMSR) [0056] Wie eingangs beschrieben, rekonstruiert der EMSR den Betrag des klaren Sprachsignals Xk\m\ von der verrauschten Beobachtung Yk[m], Weil die Beträge zu unterschiedlichenSection II gives an overview of the combined estimate and presents the hysteresis form. Subsequently, Section III shows how a small modification can reduce unwanted estimation behavior and allow smoother hysteresis. A. THE EPHRAIM-MALAH SUPPRESSION GESTURE (EMSR) As described above, the EMSR reconstructs the magnitude of the clear speech signal Xk \ m \ from the noisy observation Yk [m], because the amounts are different

Zeitpunkten m als statistisch unabhängig angenommen wurden, kann der Zeitindex m zur Vereinfachung der Notation weggelassen werden.Times m were assumed to be statistically independent, the time index m can be omitted for simplicity of notation.

[0057] Der MMSE-SA-Schätzer von Ephraim und Malah (Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, nr. 6, vol. ASSP-32, pp. 1109-1121, Dec. 1984) löst die Bayes'sche Formel Λ=4ν,ΐη} um die Amplitude der klarenThe MMSE SA estimator by Ephraim and Malah (Y. Ephraim and D. Malah, " Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator ", IEEE Transactions on Acoustics, Speech, and Signal Processing, No. 6, vol., ASSP-32, pp. 1109-1121, Dec. 1984) solves the Bayesian formula Λ = 4ν, ΐη} by the amplitude of the clear ones

Sprache Xk abzuschätzen. Werden verschiedene Verzerrungen auf die Amplitude angewendet, werden andere Schätzer in ähnlicherWeise abgeleitet, z.B. der MMSE-LSA Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, nr. 2, vol. ASSP- 33, pp. 443-445, Apr. 1985) Xk=eE^Xt^t\ und der MMSE-SP von Wolfe und Godsill (P. J.Estimate language Xk. When different distortions are applied to the amplitude, other estimators are derived in a similar way, e.g. MMSE-LSA Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, no. 2, vol. ASSP-33, pp. 443-445, Apr. 1985) Xk = eE ^ Xt ^ t \ and the MMSE-SP by Wolfe and Godsill (P.J.

Wolfe and S. J. Godsill, "Simple Alternatives to the Ephraim and Malah Suppression Rule for Speech Enhancement", Proc. 11th IEEE Signal Processing Workshop, pp. 496-499, 6-8. Aug 2001) Xk =yjE{x2\Yk\. Für eine detailiertere Beschreibung sei auf Cohen verwiesen (I. Cohen, "Relaxed Statistical Model for Speech Enhancement and A Priori SNR Estimation", Center for 8/31 österreichisches Patentamt AT 509 570 B1 2011-12-15Wolfe and S.J. Godsill, "Simple Alternatives to the Ephraim and Malah Suppression Rule for Speech Enhancement", Proc. 11th IEEE Signal Processing Workshop, pp. 496-499, 6-8. Aug 2001) Xk = yjE {x2 \ Yk \. For a more detailed description, see Cohen (I. Cohen, " Relaxed Statistical Model for Speech Enhancement and Priori SNR Estimation ", Center for 8/31 Austrian Patent Office AT 509 570 B1 2011-12-15

Communication and Information Technologies, Israel Institute of Technology, Oct, 2003, CCIT Report no. 443).Communication and Information Technologies, Israel Institute of Technology, Oct, 2003, CCIT Report no. 443).

[0058] Gemäß Ephraim and Malah ist die verrauschte Phase eine optimale Schätzung der klaren Phase. Daher ist der Rekonstruktionsoperator ein reell-wertiges Spektralgewicht G[m]: G[m]Kk[m]According to Ephraim and Malah, the noisy phase is an optimal estimate of the clear phase. Therefore, the reconstruction operator is a real-valued spectral weight G [m]: G [m] Kk [m]

Xk[m] Yk[m] G[m\ -Zfc[m]. (1) (2) [0059] Wegen seiner Einfachheit haben wir die MMSE-SP (3) von Wolfe und Godsill als Basis für unsere Betrachtung gewählt. Die entsprechende Gewichtsregel kann wie folgt angegeben werden:Xk [m] Yk [m] G [m \ -Zfc [m]. (1) (2) Because of its simplicity, we have chosen the MMSE SP (3) from Wolfe and Godsill as the basis for our consideration. The corresponding weight rule can be specified as follows:

Gmmse—sp[m]GMMSE sp [m]

(3) unter der Verwendung der Gleichung des Wiener-Filters (4) _ χΑ = 6(3) using the Wiener filter equation (4) _ χΑ = 6

Grv - 1 + &‘ [0060] Um die Anwendung zu vereinfachen, zerlegen wir den Rekonstruktionsoperator in einige Regionen • (7k - 1) < 1/6 : Gmmse-sp • (7¾ — 1) 1/6 : Gmmse-sp ~ Gw • — 1) = 1/6 : GmMSE-SP = >/Gw 2/7fc- [0061] Zusätzlich können wir das Wiener-Filter durch • 6^1’ Gw ~ 6 • 6 ^ 1 : Gw ~ 1 [0062] approximieren. Mit der Kombination von beiden können wir die MMSE-SP-Fläche logarithmisch in flache Teile zerlegen (vgl. auch Fig. 3): 1) (7fe - 1) <C 1 /6, ξΐι < 1 => Gmmse-sp « \/6/7fc 2) (7¾ - 1) < 1/6, ξ* » 1 => Gmmse-sp ~ \fillk 3) (7k - 1) > 1/6» 6 < 1 =* Gmmse-sp « 6 4) (7*, - 1) > 1/6, 6 > 1 => Gmmse-sp ~ 1 [0063] In den folgenden Abschnitten verwenden wir die Kurzform G wenn wir uns auf GMmse-sp beziehen. B. DER ENTSCHEIDUNGSGESTEUERTE ANSATZ (DDA) [0064] Der DDA kombiniert zwei einfache SNR-Schätzer zu einem neuen Schätzer für a priori SNR ξκ.Grv - 1 + & 'To simplify the application, we decompose the reconstruction operator into some regions • (7k - 1) < 1/6: Gmmse-sp • (7¾ - 1) 1/6: Gmmse-sp ~ Gw • - 1) = 1/6: GmMSE-SP => / Gw 2 / 7fc- [0061] Additionally, we can do that Wiener filters by • 6 ^ 1 'Gw ~ 6 • 6 ^ 1: Gw ~ 1 approximate. With the combination of both, we can logarithmically decompose the MMSE SP area into flat parts (see also Fig. 3): 1) (7fe-1) <C 1/6, ξΐι <1. 1 = > Gmmse-sp «\ / 6 / 7fc 2) (7¾ - 1) < 1/6, ξ * »1 = > Gmmse-sp ~ \ fillk 3) (7k - 1) > 1/6 »6 < 1 = * Gmmse-sp «6 4) (7 *, - 1) > 1/6, 6 > 1 = > Gmmse-sp ~ 1 In the following sections we use the short form G when referring to GMmse-sp. B. THE DECISION-DRIVEN APPROACH (DDA) The DDA combines two simple SNR estimators into a new estimator for a priori SNR ξκ.

[0065] Der erste Schätzer ist der unmittelbare SNR Mit nur positiven SNR-Werten erhält man (n-i)= k- 9/31 österreichisches Patentamt AT 509 570 B1 2011-12-15 (5) SNRinst = max(7fc - 1,0), [0066] das vor der Störgeräuschunterdrückung berechnet werden kann. Dieses unmittelbare SNR unterscheidet sich von dem wirklichen SNR in den folgenden Fällen: [0067] · wenn das Analysezeitfenster zu kurz, hinsichtlich der Stationarität der Signale x[n] und d[n], ist, [0068] · wenn ein nichtstationäres Störgeräusch nicht im Detail indentifiziert werden kann oder [0069] · wenn Störgeräusch und Sprachsignal stark korreliert sind.The first estimator is the immediate SNR. With only positive SNR values, one obtains (ni) = k-9/31 Austrian Patent Office AT 509 570 B1 2011-12-15 (5) SNRinst = max (7fc-1.0 ), Which can be calculated before the noise suppression. This immediate SNR differs from the true SNR in the following cases: if the analysis time window is too short as to the stationarity of the signals x [n] and d [n], if a nonstationary noise is not can be identified in detail, or · if noise and speech signal are highly correlated.

[0070] Der Schätzer zweiter Ordnung beschreibt das wiederhergestellte SNR, welches nach der Störgeräuschunterdrückung folgendermaßen berechnet wird SNRrec = n = lk-G2 (6) 'd,k [0071] Bei schlechten SNR-Verhältnissen, z.B. 0 < 7k < 2, zeigt das a posteriori SNR 7k relative Variationen mit der Zeit, die kleiner als jene von (7k - 1) sind. (Relative Variationen, z.B. 10 log(7k[m]) - 10 log(7k[m-1]), sind signifikanter als lineare Variationen hinsichtlich des menschlichen Hörempfindens.) Idealer Weise liefert G 5 eine konsistente hohe Dämpfung für schlechte SNR-Verhältnisse. Daher ergibt die wiederhergestellte SNRrec beständigere Werte als SNRinst bei schlechten SNR Fällen.The second order estimator describes the recovered SNR which is calculated after noise suppression as follows: SNRrec = n = 1k-G2 (6) 'd, k For poor SNR ratios, e.g. 0 < 7k < 2, the a posteriori SNR 7k shows relative variations with time smaller than those of (7k - 1). (Relative variations, eg 10 log (7k [m]) - 10 log (7k [m-1]), are more significant than linear variations in human hearing.) Ideally, G 5 provides consistently high attenuation for poor SNR ratios , Therefore, the recovered SNRrec gives more consistent values than SNRinst in bad SNR cases.

[0072] Letztendlich kombiniert der DDA zur Abschätzung des a priori SNR SNRinst und SNRrec: (7) £fc[m] = (1 — a) · SNRin8t[m] + a · SNRrec [τη — 1]- [0073] Die spezifischen Eigenschaften des Schätzers können beim Einsetzen der Unterdrückungsverstärkung in den DDA beobachtet werden.Finally, to estimate a priori SNR SNRinst and SNRrec, the DDA combines: (7) £ fc [m] = (1-a) * SNRin8t [m] + a * SNRrec [τη-1] - [0073] The specific properties of the estimator can be observed upon onset of suppression enhancement in the DDA.

[0074] II. Kombination von DDA und EMSRII. Combination of DDA and EMSR

[0075] Das Einsetzen der Teile des Rekonstruktionsoperators Gmmse-sp von Wolfe und Godsill aus Abschnitt l-A in die DDA-Gleichung (7) von Ephraim und Malah ergibt für die kombinierte a priori SNR-Schätzung folgende Wirkungsbereiche: 1) (7k — 1) < 1/fjfe, 6 < 1. G α y/ξΐί/7fc £fc[m] « (1 — a) · max (qfk[m] — 1,0) + (8) a - £fc[m - 1]. 2) (7fc - 1) <C 1/&, 6 » 1, G « y/lpik £fc[ra] « (1 — a) · max (7*[πι] — 1,0) + a (9) ~ a. 10/31 österreichisches Patentamt AT 509 570 B1 2011-12-15 3) (7fc - 1) » 1/&, ξ*[τη] « (1 - a) max (7*[m] - 1,0) + or £*[m - 1] - 7*[m - 1] (10)Inserting the parts of the reconstruction operator Gmmse-sp by Wolfe and Godsill from section IA into the DDA equation (7) of Ephraim and Malah gives the following ranges of action for the combined a priori SNR estimation: 1) (7k-1) < 1 / f, 6 < 1. Gαy / ξΐί / 7fc £ fc [m] «(1 - a) · max (qfk [m] - 1,0) + (8) a - £ fc [m - 1]. 2) (7fc - 1) <C 1 / &, 6 »1, G« y / lpik £ fc [ra] «(1 - a) · max (7 * [πι] - 1,0) + a (9) ~ a. 10/31 Austrian Patent Office AT 509 570 B1 2011-12-15 3) (7fc - 1) »1 / &, ξ * [τη]« (1 - a) max (7 * [m] - 1,0) + or £ * [m - 1] - 7 * [m - 1] (10)

(ID « (1 - a) (7k[m] - 1). « (1 - a) · max (7fc[m] -1,0) + a 7k[m - 1] « a-7*;[m-l]. 5) (7fc - 1) = 1/6, & < 1 =► G = >/2 · efc/7fc w (1 - a) · (TfcH “ 1) + (12) 2a ^fc[m - 1].(ID «(1 - a) (7k [m] - 1).« (1 - a) · max (7fc [m] -1,0) + a 7k [m - 1] «a-7 *; [ ml]. 5) (7fc-1) = 1/6, & &Lt; 1 = ► G => / 2 · efc / 7fc w (1-a) · (TfcH "1) + (12) 2a ^ fc [m-1].

[0076] Die Charakteristik des kombinierten Ansatzes kann in Fig. 4 betrachtet werden. Unter der Berücksichtigung der Amplitude des Sprachsignals und eines konstanten Störgeräuschpegels, z.B. einer zeitlich veränderlichen a posteriori SNR 7k als Eingangssequenz, kann man sich eine Art von Hystereseschleifeentwicklung auf der MMSE-SP-Fläche vorstellen. Neben offensichtlichen Unstetigkeiten in dieser Schleife werden andere Eigenschaften gezeigt (O. Cappe, "Elimination of the Musical Noise Phenomenon with the Ephraim and Malah Noise Suppressor", IEEE Transactions on Speech and Audio Processing, nr. 2, vol. 2, pp. 345-349, Apr. 1994).The characteristic of the combined approach can be considered in FIG. Considering the amplitude of the speech signal and a constant noise level, e.g. a time-varying a posteriori SNR 7k as input sequence, one can imagine a kind of hysteresis loop development on the MMSE SP area. Besides obvious discontinuities in this loop, other properties are shown (O. Cappe, " Elimination of the Musical Noise Phenomenon with the Ephraim and Malah Noise Suppressor ", IEEE Transactions on Speech and Audio Processing, No. 2, vol. 2, pp. 345-349, Apr. 1994).

A. REKURSIVE MITTELWERTBILDUNGA. RECURSIVE AVERAGE EDUCATION

[0077] 1) ERWARTUNGEN VON REKURSIVER MITTELWERTBILDUNG: In der obigen Aufzählung kann man erkennen, dass die a priori SNR-Schätzung in Teil 1 mit dem rekursiven Mittelwert (8) der unmittelbaren SNRinst (5) korrespondiert. Es ist möglich den Mittelungsprozess durch die Einführung einer Zeitkonstante Tavg, die den Mittelwertparameter α = exp[-1/(Tavg fs)] bestimmt, zu verallgemeinern. Hier bezeichnet die Abtastrate fs = 1/T die Anzahl der Zeit-Frequenz-Tranformationen pro Sekunde.1) EXPECTATIONS OF RECURRENT AVERAGING: In the above enumeration it can be seen that the a priori SNR estimate in part 1 corresponds to the recursive mean (8) of the immediate SNRinst (5). It is possible to generalize the averaging process by introducing a time constant Tavg, which determines the mean parameter α = exp [-1 / (Tavg fs)]. Here, the sampling rate fs = 1 / T denotes the number of time-frequency transformations per second.

[0078] 2) DER KONSTANTE-EFFEKT: Falls das a priori SNR ξκ einen konstanten Wert in Teil 1 aufweist, z.B. für den Fall von großen Zeitkonstanten Tavg oder an den Rändern des ξκ-Wertebereichs, könnte der Schätzer seltsam funktionieren. Bei kleinen und konstanten ξκ wird das System die Ausgangsgröße auf einem konstanten Pegel gehalten. Das passiert, wenn der Eingang klein genug ist (κ*2[ι»]/σ^ -l) «l/£t ^>Yk2[m]«G2dk/Gw (unter Verwen dung von (8) und seinen Voraussetzungen):2) THE CONSTANT EFFECT: If the a priori SNR ξκ has a constant value in part 1, e.g. in the case of large time constants Tavg or at the edges of the ξκ value range, the estimator could work strangely. At small and constant ξκ, the system keeps the output at a constant level. This happens when the entrance is small enough (κ * 2 [ι »] / σ ^ -l)« l / £ t ^> Yk2 [m] «G2dk / Gw (using (8) and its presuppositions ):

11/31 (13) [0079] Unter bestimmten Umständen kann das zu störenden, zusätzlichen, breitbandigen Stör- österreichisches Patentamt AT 509 570 B1 2011-12-15 geräuschen führen, die schlimmer sein können als eine konstante Ausgangsgröße, die wegen der Beschränkung von 6 auf ein Minimum ζ für F/[m] < a]k I ζ verursacht wird.11/31 (13) [0079] In certain circumstances, the disturbing, additional broadband Störteisches Patentamt AT 509 570 B1 2011-12-15 may cause noises which may be worse than a constant output which, because of the limitation of 6 to a minimum ζ for F / [m] < a] k I ζ is caused.

[0080] 3) INSTABILE REKURSIVE MITTELWERTBILDUNG: Folgt man (12), kann Teil 5 zu a priori SNR-Schätzung durch instabile rekursive Mittelwertbildung von SNRins, führen, falls α > 1/2, z.B. kann 6 plötzlich in diesem Teil steigen.3) INSTABILE RECURSIVE AVERAGE FORMATION: Following (12), part 5 may result in a priori SNR estimation by unstable recursive averaging of SNRins, if α > 1/2, e.g. 6 can suddenly rise in this part.

B. TEILE OHNE REKURSIVER MITTELWERTBILDUNGB. PARTS WITHOUT RECURRENT AVERAGE EDUCATION

[0081] In den Teilen 2, 3, und 4 ist die Interpretation der rekursiven Mittelwertbildung nicht brauchbar. In (9) nimmt nämlich die a priori SNR-Schätzung 6 einen konstanten Wert an, und in (10) wird ξκ durch eine einfache Verzögerungszeit bestimmt. Es wirkt merkwürdig, dass SNR ξκ in (10) reduzierte Version von SNRins, ist.In parts 2, 3, and 4 the interpretation of the recursive averaging is not useful. Namely, in (9), the a priori SNR estimate 6 takes a constant value, and in (10), ξκ is determined by a simple delay time. It is strange that SNR ξκ is in (10) reduced version of SNRins.

C. ZUSAMMENFASSUNG DER EIGENSCHAFTENC. SUMMARY OF PROPERTIES

[0082] Tatsächtlich, besitzt jeder Teil außer 1 und 4 (Eqs. (8) und (11)) unerwartetes Verhalten. Mit der Definition von α durch eine Zeitkonstante erhält man verallgemeinerte mittelwertbildende Eigenschaften von (8), wohingegen a abtastratenabhängiges Verhalten durch die durch Eqs. (9)-(12) definierte Schätzung eingeführt wird. Diese Form der Abtastrate schließt einen allgemein passenden Parametersatz für unterschiedliche Zeitschrittanalysen und Transformationsgrößen aus.Actually, every part other than 1 and 4 (Eqs. (8) and (11)) has unexpected behavior. The definition of α by a time constant yields generalized averaging properties of (8), whereas a sample-rate-dependent behavior is obtained by the Eqs. (9) - (12) defined estimate. This form of sample rate excludes a generally fitting parameter set for different time step analysis and transformation quantities.

[0083] Ungünstiges Schätzverhalten, z.B. der "Konstant- ξ-Effekt", und die Unstetigkeiten in der Hystereseschleife (Fig. 4) erhöhen die Erwägung bezüglich einer Modifikation der DDA und einer nochmaligen Prüfung derzeitkonstanten und Minimum-a priori SNR-Größen.Unfavorable estimation behavior, e.g. the " constant " effect ", and the discontinuities in the hysteresis loop (Figure 4) increase the consideration of modifying the DDA and retesting current and minimum a priori SNR magnitudes.

III. EIN MODIFIZIERES, SCHNELL ANTWORTENDER DDAIII. A MODIFY, QUICKLY ANSWERING DDA

[0084] Um den Einfluss unerwartender Schätzfunktionen zu minimieren, wird der entscheidungsgesteuerte Ansatz modifiziert: £*[m] = (1 — a) · (p · SNRmstfwi] + C) + & · SNRrec[m — 1], (14) [0085] mit ζ als unterer Störgeräuschpegelparameter (O. Cappe, „Elimination of the Musical Noise Phenomenon with the Ephraim and Malah Noise Suppressor“, IEEE Transactions on Speech and Audio Processing, nr. 2, vol. 2, pp. 345-349, Apr. 1994) und p and Unterschätzparameter des unmittelbaren SNR. Ähnlich wie bei den Teilen in Abschnitt II kann man folgendes finden: 1) (tk - 1) « 1/6. & C 1, G » VWrÄ £fc[ra] « p( 1 — α) · max(7fc[m] — 1,0) + a 6[m — 1]. (15) 2) (lk - 1) < 1/6, 6 » 1, G « y/l/lfk (16) 6H ~ «· 3) (7* -1) > 1/6,6 < ~ 6 6[m] « ρ( 1 - a) (7fcH - 1)· U7) 4) (7* -1) > 1/6,6 »1, G «i (18) 6M ~ a · 7k[m - 1]. 12/31 österreichisches Patentamt AT 509 570 B1 2011-12-15 [0086] Hinsichtlich der Teilungen des neuen Schätzers, kann man das Schema des Gesamtschätzers in Fig. 5 betrachten. Statt der Zeitkonstanten in dem quasistationären Bereich der Sprache wird jetzt Tavg = 2 ms verwendet, p = 10'15/10 garantiert, dass der Skalierungsfaktor in (17) durch p(l-a) « p approximiert wird, das die Unstetigkeiten in der Abschätzhysterese behebt. Man kann den unteren Störgeräuschpegel ζ = 10'2T°so klein wählen, dass die maximale Abschwächung ζ am unteren Ende des dynamischen Bereichs des Frequenzintervalls liegt. Diese Maßnahmen reduzieren größtenteils die in Abschnitt ll-C beschriebene Abtastratenabhängigkeit und den "Konstante-Effekt" aus Abschnitt II-A.2.To minimize the influence of unexpected estimators, the decision-driven approach is modified: £ * [m] = (1-a) * (p * SNRmstfwi] + C) + & · SNRrec [m-1], (14) with ζ as the lower noise level parameter (O. Cappe, "Elimination of the Musical Noise Phenomenon with the Ephraim and Malah Noise Suppressor", IEEE Transactions on Speech and Audio Processing, no. 2, vol. 2, pp. 345-349, Apr. 1994) and p and underestimation parameters of the immediate SNR. Similar to the parts in Section II one can find: 1) (tk - 1) «1/6. &Amp; C 1, G »VWrÄ £ fc [ra]« p (1 - α) · max (7fc [m] - 1,0) + a 6 [m - 1]. (15) 2) (lk-1) < 1/6, 6 »1, G« y / l / lfk (16) 6H ~ «· 3) (7 * -1) > 1 / 6.6 < ~ 6 6 [m] «ρ (1 - a) (7fcH - 1) · U7) 4) (7 * -1) > 1 / 6,6 »1, G« i (18) 6M ~ a · 7k [m - 1]. With regard to the divisions of the new estimator, one can consider the scheme of the total estimator in FIG. 5. Instead of the time constant in the quasi-stationary domain of the language Tavg = 2 ms is now used, p = 10'15 / 10 guarantees that the scaling factor in (17) is approximated by p (la) «p, which fixes the discontinuities in the estimation hysteresis , You can choose the lower noise level ζ = 10'2T ° so small that the maximum attenuation ζ is at the lower end of the dynamic range of the frequency interval. These measures largely reduce the sample rate dependence described in Section II-C and the " Constant Effect " from section II-A.2.

[0087] Es wird klar, dass steigende unmittelbare SNRs nun besser abgeschwächt werden nach Fig. 5 als in Fig. 4. Daher kann eine starke Abschwächung für musikalische Klänge, z.B. inkonsistente hohe unmittelbare SNR, bereitgestellt werden, während ein Signal mit durchwegs hoher SNR, durch den Störgeräuschunterdrücker hindurchgehen kann. Die zwei gekräuselten Schleifen in Fig. 6 geben ein Beispiel einer approximierten Hystereseschleife während des Systembetriebs.It will be appreciated that rising immediate SNRs are now better attenuated in Figure 5 than in Figure 4. Therefore, a strong attenuation for musical sounds, e.g. inconsistent high immediate SNR, while a signal of consistently high SNR can pass through the noise canceler. The two crimped loops in Figure 6 give an example of an approximated hysteresis loop during system operation.

[0088] Der Parameter p kann direkt die Unterdrückungshysteresebreite und die Unterdrückung des musikalischen Geräusches steuern. Unsere Modifikationen ermöglichen eine separate Steuerung der mittelwertbildenen Zeitkonstante und das Störgeräuschunterdrückung.The parameter p can directly control the suppression hysteresis width and the suppression of musical noise. Our modifications allow separate control of the averaged time constant and noise suppression.

IV. SCHLUSSFOLGERUNGIV. CONCLUSION

[0089] Wir haben einen nachvollziehbaren Weg gefunden, um die Eigenschaften der spektralen Amplitudenschätzung von Wolfe und Godsill sowie die entscheidungsgesteuerte a priori SNR-Abschätzung von Ephraim und Malah grafisch zu beschreiben. Diese Beschreibung kann in ähnlicher Weise für andere Amplitudenschätzreglen verwendet werden und bietet eine neue Einsicht in den Störgeräuschunterdrücker von Ephraim und Malah.We have found a tractable way to graphically describe the properties of the spectral amplitude estimate of Wolfe and Godsill, as well as the decision-driven a priori SNR estimate of Ephraim and Malah. This description can similarly be used for other amplitude estimation rules and provides a new insight into the Ephraim and Malah noise cancelers.

[0090] Bisher war die die Unterdrückung des musikalischen Geräusches ein Kompromiss zwischen der Unterdrückung des musikalischen Geräusches und transienten Verzerrung. Kleine Modifikationen in der entscheidungsgesteuerten Schätzregel erlaubt ein flexibleres Handhaben der Unterdrückung des musikalischen Geräusches, bei gleichzeitiger Reduktion der Abhängigkeiten der Zeitschrittanalyse und des "Konstante-Effektes". Ein informeller Hörtest mit modifiziertem Algorithmus und justierbarer Analysezeit/Frequenzauflösung (Filterbankansatz) zeigte bereits nützliche Verbesserungen in Gesamtsystem.So far, the suppression of musical noise has been a compromise between the suppression of musical noise and transient distortion. Small modifications in the decision-driven estimation rule allow for a more flexible handling of the suppression of musical noise, while reducing the dependencies of the time step analysis and the " constant effect ". An informal hearing test with modified algorithm and adjustable analysis time / frequency resolution (filter bank approach) already showed useful improvements in the overall system.

[0091] Unsere zukünftige Arbeit wird unsere beschreibenden Methoden in ausgeklügeltere Schätzansätze von Cohen (I. Cohen, "Speech Enhancement Using a Noncausal A Priori SNR estimator", IEEE Signal Processing Letters, no. 9, pp. 725-728, Sep. 2004) oder Hasan (Μ. K. Hasan, S. Salahuddin, M. R. Khan, "A Modified A Priori SNR for Speech Enhancement Using Spectral Subtraction Rules", IEEE Signal Processing Letters, vol. 11, no. 4, pp 450-453, April 2004) einsetzen.Our future work will turn our descriptive methods into more sophisticated estimation approaches by Cohen (I. Cohen, " Speech Enhancement Using a Noncausal A priori SNR estimator ", IEEE Signal Processing Letters, no. 9, pp. 725-728, Sep. 2004) or Hasan (Μ.K. Hasan, S. Salahuddin, MR Khan, " Modified A Priori SNR for Speech Enhancement Using Spectral Subtraction Rules ", IEEE Signal Processing Letters, vol. 11, no. 4, pp 450- 453, April 2004).

APPARAT FÜR LAUFZEITREDUZIERTE EINKANAL-SPRACHVERSTÄRKUNGAPPARATUS FOR RUNNING REDUCED ONE-CHANNEL LANGUAGE GAIN

[0092] Im Folgenden wird eine bevorzugte Ausführungsform beschrieben, jedoch ist die Erfindung nicht auf diese Ausführungsform beschränkt.Hereinafter, a preferred embodiment will be described, but the invention is not limited to this embodiment.

[0093] Die Reduktion von musikalischen Geräuschen in Störgeräuschunterdrückungsalgorithmen ist immer noch ein Kernpunkt für Störgeräuschreduktion. Obwohl die Ephraim-Malah-Unterdrückungsregel (EMSR) und der entscheidungsgesteuerte Ansatz (DDA) ein gutes Leistungsvermögen aufweisen, müssen zusätzliche Hilfsmittel angewendet werden. Darüber hinaus stellen die Verarbeitungszeiten von der Signalanalyse kommend (schnelle Fourier-Transformation, FFT) ein Problem für Echtzeitanwendungen dar. Entscheidende Verbesserungen in beiden Punkten kann durch die Implementierung der Signalanalyse und Filteransätze mit menschliche Hörempfindungsmodellen und Laufzeitreduktion erreicht werden. 13/31 österreichisches Patentamt AT 509 570 B1 2011-12-15The reduction of musical noise in noise reduction algorithms is still a key issue for noise reduction. Although the Ephraim-Malah suppression rule (EMSR) and the decision-driven approach (DDA) have good performance, additional aids must be used. In addition, processing times from signal analysis (fast Fourier transform, FFT) present a problem for real-time applications. Significant improvements in both can be achieved by implementing signal analysis and filtering approaches with human hearing models and run time reduction. 13/31 Austrian Patent Office AT 509 570 B1 2011-12-15

V. EINFÜHRUNGV. INTRODUCTION

[0094] Der Hauptteil dieser Beschreibung ist der Aufbereitung und der Anlayse des Hörsignals unter Verwendung von effizienten Algorithmen mit kurzen Verzögerungszeiten gewidmet. Unser System kombiniert eine Gehör-Gammaton-Filterbank (R. F. Lyon, "The All-Pole Gammatone Filter and Auditory Models", Proc. Forum Acusticum, Antwerpen 1996; L. Lin, E. Ambikairajah, W. H. Holmes, "Auditory Filterbank Design Using Masking Curves", Proc. EUROSPEECH Scandinavia, 7th European Conference on Speech Communication and Technology, 2001; L. Lin, E. Ambikairajah, W. H. Holmes, "Perceptual Domain Based Speech and Audio Coder", Proc. of the third International Symposion DSPCS 2002, Sydney, Jan. 28-31, 2002) mit der Ephraim-Malah Störgeräuschunterdrückungsregel (Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, nr. 6, vol. ASSP-32, pp. 1109-1121, Dec. 1984; Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, nr.2, vol. ASSP-33, pp. 443-445, Apr. 1985; P. J. Wolfe and S. J. Godsill, "Simple Alternatives to the Ephraim and Malah Suppression Rule for Speech Enhancement", Proc. 11th IEEE Signal Processing Workshop, pp. 496-499, 6-8. Aug 2001). Diese Kombination wurde kürzlich von den Autoren vorgestellt, wobei die Kombination einer Gehör-Gammaton-Filterbank mit einem Wiener-Störgeräuschunterdrücker von (L. Lin, E. Ambikairajah, "Speech Denoising Based on an Auditory Filterbank", 6th ICSP, International Conference on Signal Processing, (552-555), 26-30 Aug. 2002) und eine Frequenzbereichlösung von WO 00/30264 (International applicatoin No. PCT/SG99/00119) bekannt ist. Ferner ist die Integration eines Außen- und Mittelohrfilters im Zeitbereich sowie die Integration eines nichtlinearen temporären Post-Masking Filter (G. Stall, J. G. Beerends, R. Bitto, K. Brandenburg, C. Colomes, B. Feiten, M. Keyhl, C. Schmidmer, T. Sporer, T. Thiede, W. C. Treurniet, "PEAQ - der neue ITU-Standard zur objektiven Messung der wahrgenommenen Audioqualität", RTM - Rundfunktechnische Mitteilungen, die Fachzeitschrift für Hörfunk und Fernsehtechnik, 43. Jahrgang, ISSN 0035-9890 (81-120), Firma Mensing GmbH + Co. KG, Abteilung Verlag, Sept. 1999; L. Lin, E. Ambikairajah, W. H. Holmes, "Perceptual Domain Based Speech and Audio Coder", Proc. of the third International Symposion DSPCS 2002, Sydney, Jan. 28-31, 2002) in ein Störgeräuschunterdrückungssystem neu. Zusätzlich wird ein engbandiger Pegeldetektor mit kurzer Latenzzeit, der die Phase eines einfachen Filters erster Ordnung ausnützt, erstmals vorgestellt. Abschließend präsentieren wir ein einfaches Schema zur Signalrekonstruktion (Wiederherstellung) unter der Vermeidung von Bandkantensignalauslöschungen.The main part of this description is devoted to the preparation and the analysis of the audio signal using efficient algorithms with short delay times. Our system combines an auditory gammaton filter bank (RF Lyon, "The All-Pole Gammatone Filters and Auditory Models", Proc. Forum Acusticum, Antwerp 1996; L. Lin, E. Ambikairajah, WH Holmes, " Auditory Filterbank Design EUROSPEECH Scandinavia, 7th European Conference on Speech Communication and Technology, 2001; L. Lin, E. Ambikairajah, WH Holmes, " Perceptual Domain Based Speech and Audio Coder ", Proc. Of the Third International Symposium DSPCS 2002, Sydney, Jan. 28-31, 2002) using the Ephraim-Malah noise suppression rule (Y. Ephraim and D. Malah, " Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator ", IEEE Transactions on Acoustics, Speech, and Signal Processing, No. 6, vol., ASSP-32, pp. 1109-1121, Dec. 1984; Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator" ;, IEEE Transactions on Acoustic s, Speech, and Signal Processing, no.2, vol. ASSP-33, pp. 443-445, Apr. 1985; P.J. Wolfe and S.J. Godsill, " Simple Alternatives to the Ephraim and Malah Suppression Rule for Speech Enhancement ", Proc. 11th IEEE Signal Processing Workshop, pp. 496-499, 6-8. Aug 2001). This combination was recently presented by the authors using the combination of an auditory gammatone filter bank with a Wiener noise canceler of (L. Lin, E. Ambikairajah, " Speech Denoising Based on an Auditory Filter Bank ", 6th ICSP, International Conference on Signal Processing, (552-555), 26-30 Aug. 2002) and a frequency domain solution of WO 00/30264 (International Applicants No. PCT / SG99 / 00119). Furthermore, the integration of an outer and middle ear filter in the time domain and the integration of a nonlinear temporary post-masking filter (G. Stall, JG Beerends, R. Bitto, K. Brandenburg, C. Colomes, B. Feiten, M. Keyhl, C Schmidmer, T. Sporer, T. Thiede, WC Treurniet, " PEAQ - the new ITU standard for the objective measurement of perceived audio quality ", RTM - Rundfunktechnische Mitteilungen, the trade journal for radio and television technology, Volume 43, ISSN 0035- 9890 (81-120), Mensing GmbH + Co. KG, Publishing Division, Sept. 1999; L. Lin, E. Ambikairajah, WH Holmes, " Perceptual Domain Based Speech & Audio Coder ", Proc. Of the Third International Symposium DSPCS 2002, Sydney, Jan. 28-31, 2002) into a noise cancellation system. In addition, a short-latency narrow-band level detector utilizing the phase of a first-order simple filter is first introduced. Finally, we present a simple scheme for signal reconstruction (recovery) while avoiding band edge signal cancellations.

[0095] · Die Kombination einer Gehör-Gammaton-Filterbank und eines EMSR-Störgeräusch- unterdrückers in einem Zeitbereichansatz [0096] · Integration eines Außen- und Mittelohrfilters in das Unterdrückungssystem in einemThe combination of an auditory gammaton filter bank and an EMSR noise canceler in a time domain approach. Integration of an outer and middle ear filter into the suppression system in one

Zeitbereichansatz [0097] · Integration eines Post-Masking auditorischen FiltersTime domain approach · integration of a post-masking auditory filter

Engbandiger Pegeldetektor mit kurzer Latenzzeit Signalwiederherstellung nach Wolfe und Godsill mit geringem Aufwand Upsampling mit kurzer Latenzzeit Wiederherstellung mit kurzer Latenzzeit trotz hindernder destruktiver Interferenzen [0098] [0099] [00100] [00101] [00102] Die Druckschriften „Speech denoising based on an auditory filterbank" von Lin et al., 2002, „Nonlinear Adaptive Speech Enhancement Inspired by Early Auditory Processing" von Hussain et al., 2005 und die WO 0205262 A2 beschreiben jeweils ein Verfahren zur Störgeräuschunterdrückung von Audiosignalen im Zeitbereich. Dabei erfolgt eine Aufspaltung des Audio-Eingangssignals in eine Vielzahl von Frequenzteilbändern, in denen jeweils eine Störgeräuschunterdrückung durchgeführt wird, anschließend werden die gefilterten Frequenzteilbänder wieder zu einem Ausgangssignal zusammengesetzt. Solche Verfahren zur Rekonstruktion des Ausgangssignals aus Frequenzteilbändern erlauben ohne Verwendung einer Synthese- 14/31 österreichisches Patentamt AT 509 570 B1 2011-12-15Short-latency narrow-band level detector Wolfe and Godsill signal recovery at low cost Short-latency upsampling Short-latency recovery despite hindering destructive interference [0099] [00101] The references "Speech denoising based on auditory filterbank" ; Lin et al., 2002, "Nonlinear Adaptive Speech Enhancement Inspired by Early Auditory Processing". by Hussain et al., 2005 and WO 0205262 A2 each describe a method for noise suppression of audio signals in the time domain. In this case, a splitting of the audio input signal into a plurality of frequency subbands, in each of which a noise reduction is performed, then the filtered frequency subbands are reassembled into an output signal. Such methods for reconstructing the output signal from frequency subbands allow without the use of a synthesis AT 509 570 B1 2011-12-15

Filterbank keine Unterabtastung der Teilbandsignale, was zu einem vergleichsweise hohen Rechenleistungsbedarf führt. Somit weisen diese Verfahren vor allem in modernen Telekom-munikations- und Mobilfunksystemen einen wesentlichen Nachteil auf.Filterbank no undersampling of the subband signals, which leads to a comparatively high computing power requirements. Thus, these methods have a major drawback, especially in modern telecommunications and mobile systems.

VI. SYSTEMÜBERBLICKVI. SYSTEM OVERVIEW

[00103] Das Gesamtsystem ist als Blockdiagramm in Fig. 7 dargestellt und kann als analoger oder digitaler Effektprozessor oder als Teil eines Softwarealgorithmus implementiert werden. Innerhalb des Gesamtsystems sind mehrere Subsysteme (Fig. 8): [00104] · ein Außen- und Mittelohrfilter (Home), [00105] · ein Gammaton-Filterbank-Analyseabschnitt (GFB), [00106] · der Pegeldetektor mit kurzer Latenzzeit (LD), [00107] · der auditorische Post-Masking-Filter (PM), [00108] · ein rekursiver Störgeräuschspektrumschätzer (NE), [00109] · das spektrale Subtraktionsgewicht (EMSR), [00110] · Upsampling mit kurzer Latenzzeit (L t), [00111] · dem Vocoder-Zustand und [00112] · das inverse Außen- und Mittelohrfilter (Η,ομε)-The overall system is shown as a block diagram in Fig. 7 and may be implemented as an analog or digital effects processor or as part of a software algorithm. Within the overall system are several subsystems (Figure 8): an outer and middle ear filter (Home), [00105] a Gammaton Filter Bank Analysis (GFB) section, [00106] the low latency level detector (LD · Auditory post-masking filter (PM), · a recursive noise spectrum estimator (NE), [00109] · spectral subtraction weight (EMSR), [00110] · short latency upsampling (L t ), The vocoder state and [00112] the inverse outer and middle ear filters (Η, ομε) -

VII. AUSSEN- UND MITTELOHRFILTERVII. OUTER AND MIDDLE EAR FILTER

[00113] Ein Außen- und Mittelohrfilter unfasst drei Teile von zweiter Ordnung (SOS), die den physiolgischen Teil des menschlichen Ohrs repräsentieren (E. Zwicker, H. Fastl, "Psychoa-coustics, facts and models", Springer, Berlin Heidelberg, 1999; E. Terhardt, "Akustische Kommunikation", Springer, Berlin Heidelberg, 1998): [00114] 1) Die Hochpassdämpfungskurve unterhalb von 1KHz modelliert die 100-Phon-Kurve, die die akustischen Impedanz des Außenohrs und die mechanische Impedanz der Gehörknöchelchen im Mittelohr repräsentiert [00115] 2) Die Resonanz des Ohrkanals und [00116] 3) Die Tiefpassdämpfungskurve überhalb 1kHz modelliert die Hörschwelle.An outer and middle ear filter comprises three second order (SOS) parts representing the physiological part of the human ear (E. Zwicker, H. Fastl, " Psychoa-coustics, facts and models ", Springer, Berlin Heidelberg E. Terhardt, " Acoustic Communications ", Springer, Berlin Heidelberg, 1998): 1) The high-pass attenuation curve below 1KHz models the 100-phonon curve which includes the outer ear acoustic impedance and mechanical impedance Auditory ossicles in the middle ear represent [00115] 2) the resonance of the ear canal and [00116] 3) the low-pass attenuation curve above 1 kHz models the threshold of hearing.

[00117] Die letzten zwei Filter sind optional, wobei die Hochpass-Komponente obligatorisch ist und den Einfluss der niederfrequenten Störgeräusche auf den Störgeräuschunterdrücker reduziert.The last two filters are optional, with the high-pass component being mandatory and reducing the influence of low-frequency noise on the noise canceler.

[00118] Eine Filterstruktur mit einem adequaten Größentransferfunktion könnte letztendlich wie in Fig. 9 aussehen. Alle drei Filterabschnitte müssen Abschnitte zweiter Ordnung aufweisen, um geeignete Flanken zu gewährleisten. Die äußeren Filterränder können als zweite-Ordnung Tief-und Hochpass-Kuhschwanzfilter modelliert werden, wobei die Resonanzen als parametrischen Glockenfilter modelliert werden kann (P. Dutilleux, U. Zölzer, "DAFX", Wiley&Sons, 2002).A filter structure with an adequate size transfer function might eventually look like FIG. 9. All three filter sections must have second order sections to ensure proper edges. The outer filter edges can be modeled as second order low and high pass cow tail filters, where the resonances can be modeled as a parametric bell filter (P. Dutilleux, U. Zölzer, "DAFX", Wiley & Sons, 2002).

[00119] Die Filterinversion ist unkompliziert. Falls Nullen bei z.B. z = 1 im z-Bereich sein sollen, kann das inverse Filter das nicht bewerkstelligen. Möglicherweise ist z = 0.99 eine geeignete Wahl für einen Startwert zur Inversion eine z = 1 Null.The filter inversion is straightforward. If zeros at e.g. z = 1 in the z-range, the inverse filter can not do this. Perhaps z = 0.99 is a suitable choice for a starting value for inversion a z = 1 zero.

VIII. FREQUENZGRUPPEN/GEHORBANDBREITENVIII. FREQUENCY GROUPS / BAND WIDTHS

[00120] Frequenzgruppierung ist ein wichtiger Effekt in der menschlichen Wahrnehmung der Lautstärke. Die wahrgenommene Lautstärke umfasst besondere Lautstärken für unterschiedliche Frequenzbereiche. Eine hörbare Frequenzskala kann zum Modellieren der Frequenzgruppeneffekte verwendet werden, dessen Einheiten als die Frequenzauflösung der menschlichen Lautstärkewahrnehmung gesehen werden kann (E. Zwicker, H. Fastl, „Psychoacoustics, facts and models“, Springer, Berlin Herdeiberg, 1999). Wir bezeichnen eine beliebige hörbare Fre- 15/31 österreichisches Patentamt AT 509 570 B1 2011-12-15 quenztransformation mit ffi{·} und die dazugehörige inverse Frequenstransformation mit Eine vernünftige Frequenzskala verwendet eine kleine Anzahl von Frequenzgruppen gemäß der Formel von Traunmüller (e. Terhardt, „ Akustische Kommunikation“, Springer, Berlin Heidelberg, 1998) *7[Bark]=» {//[Hz]} = 11^-0.53. (19)Frequency grouping is an important effect in human perception of volume. The perceived volume includes special volumes for different frequency ranges. An audible frequency scale can be used to model the frequency group effects, the units of which can be considered the frequency resolution of human volume perception (E. Zwicker, H. Fastl, "Psychoacoustics, facts and models", Springer, Berlin Herdeiberg, 1999). We denote any audible frequency transformation with ffi {·} and the corresponding inverse frequency transformation with A reasonable frequency scale uses a small number of frequency groups according to the formula of Traunmüller (e. Terhardt, "Acoustic Communication", Springer, Berlin Heidelberg, 1998) * 7 [Bark] = »{// [Hz]} = 11 ^ -0.53. (19)

Demgemäß ist die inverse Tranformation ffi'1 {} //[Hz] = !8 W[Bark]} = 1960 <20> [00121] Die Mittelfrequenzen fk der Gehör-Filterbank kann unter Anwendung der inversen Transformation fk = 0,‘r -\\pk) an einer äquidistanten Skala vk (mit Abständen dv, z.B. dv = 1[Bark]) im Bark-Raum berechnet werden, Ähnlich können die Bandbreiten Bk von Bk = ®'1{vk + dv/2} - ffi-1{vk - dv/2} berechnet werden. Andere Bark-Skalen (z.B. E. Zwicker, H. Fastl, "Psycho-acoustics, facts and models", Springer, Berlin Heidelberg, 1999) verwenden kleinere Bandbreiten und ergeben Gehörfilter mit größerer Gruppenverzögerung; daher wird der obige Abstand bevorzugt.Accordingly, the inverse transformation ffi'1 {} // [Hz] =! 8 W [Bark]} = 1960 < 20 > The center frequencies fk of the auditory filter bank can be calculated using the inverse transformation fk = 0, 'r - \\ pk) on an equidistant scale vk (with distances dv, eg dv = 1 [Bark]) in the Bark space Similarly, the bandwidths Bk of Bk = R'1 {vk + dv / 2} - ffi-1 {vk-dv / 2} can be calculated. Other Bark scales (e.g., E. Zwicker, H. Fastl, " Psycho-Acoustics, Facts and Models ", Springer, Berlin Heidelberg, 1999) use smaller bandwidths and result in higher group delay auditory filters; therefore, the above distance is preferred.

[00122] Um die Verwechslung mit der Variable z der z-Bereichs zu vermeinden, wird v anstelle von z für die Bark-Frequenzen verwendet.In order to avoid confusion with the variable z of the z-range, v is used instead of z for the bark frequencies.

IX. GEHOR-GAMMATON-FILTERSIX. GEHOR Gammatone FILTERS

[00123] Gehör-Gammaton-Filter (R. F. Lyon, "The All-Pole Gammatone Filter and Auditory Models", Proc. Forum Acusticum, Antwerpen 1996) können efiizient im Zeitbereich implementiert werden und erlauben die Separation eines breitbandigen Audiosignals in Gehörbandsignalen. Die Antwortgröße des Gammaton-Filters korrespondiert mit den unmittelbaren Ausben-dungseigenschaften des menschlichen Ohrs. Die Größe dieses Filters über die hörbare Frequenzskala aufgetragen bleibt gleich, egal für welche Mittelfrequenz das Filter ausgelegt wurde. Die beliebige Form repräsentiert eine Familie von Gammaton-Filtern der Ordnung m und ist weiter dargestellt, worin k der Filterbankkanalindex ist. Eine entsprechende z-Transformation, worin *GF ein beliebiges Gammaton-Filter (z.B. GF, APGF, OZGF, TZGF) bezeichnet: H* GF,k(^) 9*GF ' -iÄium.ki'Z) Π _1_ 1 - 2 · rk cos(0fc) · z~x + rk · z~2 (21) [00124] Digitale Mittelfrequenzen 0k und Pol-Radien rk werden von den zeitkontinuierlichen Größen Mittelfrequenz fk, Bandbreite Bk, die Bandrandunterdrückung CdB (z.B. CdB - - 5[dB]) und die Abtastrate fs:[00123] Auditory gamma-tone filters (R.F. Lyon, "The All-Pole Gammatone Filter and Auditory Models", Proc. Forum Acusticum, Antwerp 1996) can be efficiently implemented in the time domain and allow the separation of a wideband audio signal in auditory canal signals. The response size of the gammaton filter corresponds to the immediate fording properties of the human ear. The size of this filter plotted over the audible frequency scale remains the same regardless of the center frequency the filter has been designed for. The arbitrary shape represents a family of gamma-tone filters of order m and is further illustrated, where k is the filter bank channel index. A corresponding z-transformation, where * GF denotes any Gammaton filter (eg GF, APGF, OZGF, TZGF): H * GF, k (^) 9 * GF '-iÄium.ki'Z) Π _1_ 1 - 2 · Rk cos (0fc) · z · x + rk · z ~ 2 (21) Digital center frequencies 0k and pole radii rk are determined by the continuous time quantities center frequency fk, bandwidth Bk, band edge suppression CdB (eg CdB - 5) [dB]) and the sampling rate fs:

9k = 2π · JS rk = 1 - 2ττ· 7* Js (22) [00125] Eine Gehör-Gainmaton-Filterbank repräsentiert eine Gruppe von überlappenden Gam-matone-Filtern, welche die hörbare Frequenzskala in äquidistante Frequenzbänder unterteilt. Die Ordnung m = 4 wird häufig in der Literatur verwendet, wobei die Ordnung m = 3 zur Minimierung der Rechenleistung vorgeschlagen wurde. Der Term g*GF soll derart justierbar sein, dass die Einheitsverstärkung bei der Mittelfrequenz fk erreicht wird. Für eine spezielle Form des Gammaton-Filters muss das System Hnum,k(z), wie in den folgenden Unterabschnitten gezeigt, geeignet adaptiert werden. 16/319k = 2π * JS rk = 1 - 2ττ * 7 * Js (22) An auditory gain-tomon filter bank represents a group of overlapping gam-matone filters which subdivides the audible frequency scale into equidistant frequency bands. The order m = 4 is often used in the literature, where the order m = 3 has been proposed to minimize the computational power. The term g * GF should be adjustable so that the unity gain at the center frequency fk is achieved. For a particular form of gamma-tone filter, the system Hnum, k (z) must be suitably adapted as shown in the following subsections. 16/31

A. EINFACHES GAMMATON-FILTERA. SIMPLE GAMMATON FILTER

[00126] Das einfache Gammaton-Filter (GF; R. F. Lyon, "The All-Pole Gammatone Filter and Auditory Models", Proc. Forum Acusticum, Antwerpen 1996) muss von der zeitkontinuierlichen Impulsantwort unter der Verwendung der Laplace- und Impulsvarianzentransformation (A. V. Oppenheim, R. W. Schäfer, J. R. Buck, "Discrete-Time Signal Processing", Prentice Hall, 1999) abgeleitet werden: (23) h(t) = tm 1e Bfc t cos(27r/fct), [00127] welches das unbekannte Polynom Hnum,k(z) in (21) bestimmt. Wegen seiner Form und des rechnerischen Aufwands ist seine Verwendung nicht empfohlen.The simple gammaton filter (GF; RF Lyon, " The All-Pole Gammatone Filter and Auditory Models ", Proc. Forum Acusticum, Antwerp 1996) must be evaluated from the time-continuous impulse response using Laplace and impulse variance transformation (AV Oppenheim, RW Schäfer, JR Buck, " Discrete-Time Signal Processing ", Prentice Hall, 1999): (23) h (t) = tm 1e Bfc t cos (27r / fct), [00127] which is the unknown Polynominal Hnum, k (z) determined in (21). Because of its shape and computational effort, its use is not recommended.

B. ALL-POL GAMMATON-FILTERB. ALL-POL GAMMATON FILTER

[00128] Ein All-Pol Gammaton-Filter (APGF) erhält man wenn das Polynom in (21) verschwindet Hnum,k(z) = 1. Es ist das effizienteste Gammaton-Filter (R. F. Lyon, "The All-Pole Gammatone Filter and Auditory Models", Proc. Forum Acusticum, Antwerpen 1996).An all-pole gamma-tone filter (APGF) is obtained when the polynomial in (21) disappears Hnum, k (z) = 1. It is the most efficient gammaton filter (RF Lyon, " The All-Pole Gammatone Filter and Auditory Models ", Proc. Forum Acusticum, Antwerp 1996).

C. ONE-ZERO GAMMATONE-FILTERC. ONE-ZERO GAMMATONE FILTER

[00129] Das Setzen von Hnum,kZ) = (1 - z'1) in (21) führt zu einem sogenannten One-Zero Gam-maton-Filter (R. F. Lyon, "The All-Pole Gammatone Filter and Auditory Models", Proc. Forum Acusticum, Antwerpen 1996). Das One-Zero Gammaton-Filter(OZGF) kann effizient aus einem "One-Zero" für alle Kanäle k vor dem Zerlegen in k All-Pol Gammaton-Filters zusammengesetzt werden.Placing Hnum, kZ) = (1-z'1) in (21) results in a so-called One-Zero Gam-maton filter (RF Lyon, " The All-Pole Gammatone Filter and Auditory Models " Proc. Forum Acusticum, Antwerp 1996). The One-Zero Gammaton Filter (OZGF) can be efficiently derived from a " One-Zero " for all channels k before disassembling in k all-pole gammaton filters are assembled.

D. THREE-ZERO GAMMATON-FILTER _|_Λ [00130] Wenn ein Paar von komplex-konjugierten Nullstellen z = rz-e~z* mit der digitalen Frequenz 0z,k bei 1 Bark über der Mittelfrequenz 0k mit einem Radius rz« 0.98 und eine zusätzlichen Nullstelle bei z = 1 hinzugefügt werden, erhält man Hnwnk{z) = (\.-2rzcos{ezk)z i + r2z~2)-(l-z~1) für das Three-Zero Gammaton-Filter (TZGF) mit einer verbesserten Form (L. Lin, E. Ambikairajah, W. H. Holmes, "Auditory Filterbank Design Using Masking Curves", Proc. EUROSPEECH Scandinavia, 7th European Conference on Speech Communication and Technology, 2001). Der rechnerische Aufwand des One-Zero Gammaton-Filters der Ordnung m + 1 ist gleich dem Aufwand des Three-Zero Gammaton-Filter der Ordnung m, falls wieder ein einzelnes "One-Zero" für alle Kanäle k verwendet wird. Geeignete Transformationen und digitale Frequenzberechnugen 0z,k folgen aus (19), (20) und (22).D. THREE-ZERO GAMMATON FILTERS _ | _Λ [00130] If a pair of complex-conjugate zeros z = rz-e ~ z * with the digital frequency 0z, k at 1 Bark above the center frequency 0k with a radius rz «0.98 and an additional zero at z = 1 are added, one obtains Hnwnk {z) = (\ - 2rzcos {ezk) zi + r2z ~ 2) - (lz ~ 1) for the Three Zero Gammaton Filter (TZGF) an improved form (L. Lin, E. Ambikairajah, WH Holmes, " Auditory Filterbank Design Using Masking Curves ", Proc. EUROSPEECH Scandinavia, 7th European Conference on Speech Communication and Technology, 2001). The arithmetic effort of the one-zero gammaton filter of order m + 1 is equal to the effort of the three-zero gammaton filter of order m, in case a single " one-zero " is used for all channels k. Suitable transforms and digital frequency calculations 0z, k follow from (19), (20) and (22).

X. WIEDERZUSAMMENSETZUNGX. REPRODUCTION

[00131] Die Wiederzusammensetzung eines breitbandigen Signals von den hörbaren Bandsignalen kann als Addition aller Signalbänder implementiet werden. Unglücklicherweise kann das destruktive Signalauslöschung in den Überlappungsbereichen benachbarter Signalkanäle mit sich bringen. Deshalb leiten wir ein einfaches Kriterium ab, das die Notwendigkeit eines Vorzeichenwechsels für jeden zweiten Kanal vor der Summation zeigt:The re-composition of a wideband signal from the audible band signals may be implemented as addition of all signal bands. Unfortunately, destructive signal cancellation can occur in the overlapping regions of adjacent signal channels. Therefore, we derive a simple criterion that shows the need for a sign change for every other channel before the summation:

(24) [00132] Bei der Verwendung dieser Formel liegt die Frequenzantwort der Superposition aller Signale im Bereich CdB + 3 [dB] and 0[dB]. Das Weglassen eines notwendigen Vorzeichens kann zu destruktiver Signalauslöschung an den Bandrändern benachbarter Filter führen.(24) Using this formula, the frequency response of the superposition of all signals is in the range of CdB + 3 [dB] and 0 [dB]. The omission of a necessary sign may result in destructive signal cancellation at the band edges of adjacent filters.

XI. (LAUFZEITREDUZIERTE) PEGELERKENNUNGXI. (RATE-REDUCED) LEVEL DETECTION

[00133] Von der Gehör-Filterbank modellierte Ausblendungseffekte können nicht ausgenutzt 17/31 österreichisches Patentamt AT 509 570 B1 2011 -12-15 werden, solange die Amplitude des Filterbankkanals nicht bestimmt ist. Geeignete Wege der Pegelerkennung werden in den folgenden Unterabschnittten vorgeschlagen.Blanking effects modeled by the auditory filter bank can not be exploited as long as the amplitude of the filter bank channel is not determined. Suitable ways of level detection are proposed in the following subsections.

[00134] Wir schlagen den ersten einfachen Ansatz für hochfrequente Kanäle und den laufzeitreduzierten Ansatz für die niederfrequente Bänder vor.We propose the first simple approach for high frequency channels and the runtime reduced approach for the low frequency bands.

A. EINFACHE PEGELERKENNUNG MIT PRE-MASKINGA. EASY LEVEL DETECTION WITH PRE-MASKING

[00135] Normalerweise werden Nichlinearitäten, wie z.B. Absolutbetrag, Quadrat, Halbwellen-Gleichrichtung, dazu verwendet, um die Signalamplitude in das Basisband bei etwa 0 Hz zu transformieren. Des Weiteren entfernt ein Glättungsfilter höherfrequente Komponenten, und letztendlich wird das gewünschte Amplitudensignal gefunden. Fig. 11 zeigt ein Beispiel, das auch den Formfaktor F mitberücksichtigt.Normally, nonlinearities such as e.g. Absolute magnitude, square, half-wave rectification, used to transform the signal amplitude into baseband at about 0 Hz. Furthermore, a smoothing filter removes higher frequency components, and ultimately the desired amplitude signal is found. Fig. 11 shows an example which also takes into account the form factor F.

[00136] Üblich verwendete Ansätze der Amplitudenerkennung sind rechnerisch effizient, Glättungsfilter beinhalten Gruppenlaufzeiten im Signalpfad, die zu kompensieren sind. Wir empfehlen den rekursiven Glättungsparameter α durch eine Zeitkonstante Tavg in [s] zu beschreiben a = e_7T7*. (25) [00137] Geeignete Zeitkonstanten stimmen mit der Vor-Hörausblendzeitkonstante überein, und ist näherungsweise Tavg ~ 2[ms] (G. Stoll, J. G. Beerends, R. Bitto, K. Brandenburg, C. Colomes, B. Feiten, M. Keyhl, C. Schmidmer, T. Sporer, T. Thiede, W. C. Treurniet, "PEAQ - der neue ITU-Standard zur objektiven Messung der wahrgenommenen Audioqualität", RTM - Rundfunktechnische Mitteilungen, die Fachzeitschrift für Hörfunk und Fernsehtechnik, 43. Jahrgang, ISSN 0035-9890 (81-120), Firma Mensing GmbH + Co. KG, Abteilung Verlag, Sept 1999).Usally used approaches of amplitude detection are computationally efficient, smoothing filters include group delay times in the signal path, which are to be compensated. We recommend to describe the recursive smoothing parameter α by a time constant Tavg in [s] a = e_7T7 *. (25) Suitable time constants are consistent with the pre-holographic time constant, and is approximately Tavg ~ 2 [ms] (G. Stoll, JG Beerends, R. Bitto, K. Brandenburg, C. Colomes, B. Feiten, M Keyhl, C. Schmidmer, T. Sporer, T. Thiede, WC Treurniet, " PEAQ - The New ITU Standard for the Objective Measurement of Perceived Audio Quality ", RTM - Rundfunktechnische Mitteilungen, the trade journal for radio and television technology, 43rd grade , ISSN 0035-9890 (81-120), Mensing GmbH + Co. KG, Publishing Department, Sept 1999).

B. LAUFZEITREDUZIERTE PEGELERKENNUNGB. LENGTH-REDUCED LEVEL DETECTION

[00138] Unsere neue Methode nützt die Phase eines einfachen Filterabschnitts aus. Diese Methode zur Pegelerkennung kann ebenfalls in anderen technischen Gebieten Anwendung finden und ist nicht alleine auf die Störgeräuschunterdrückung beschränkt.Our new method exploits the phase of a simple filter section. This method of level detection can also find application in other technical fields and is not limited to noise suppression alone.

[00139] Mit der Hilbert-Transformation kann das breitbandige Signal konsistent um 90° phasenverschoben werden. Durch Summation der Quadrate des originalen und des verschobenen Signals bleiben die Quadrate der Amplituden (z.B. Signalleistung), und die sinusförmigen Komponenten löschen einander aus. Aber eine kausale Implementierung der Hilbert-Transformation existiert nicht.With the Hilbert transform, the broadband signal can be consistently phase-shifted by 90 °. By summing the squares of the original and the shifted signals, the squares of the amplitudes (e.g., signal power) remain, and the sinusoidal components cancel each other out. But a causal implementation of the Hilbert transform does not exist.

[00140] Im Gegensatz zum idealen Hilbert-Transformator, benötigen wir die 90° Phasenverschiebung nur im betrachteten Frequenzintervall, z.B. in der entsprechenden hörbaren Frequenzgruppe.In contrast to the ideal Hilbert transformer, we need the 90 ° phase shift only in the considered frequency interval, e.g. in the corresponding audible frequency group.

[00141] Wir schlagen vor, folgende Filterarten für eine 90° Phasenverschiebung bei einer Frequenz 0k zu verwenden: [00142] · einen einfachen FIR-Abschnitt erster Ordnung, [00143] · einen einfachen 11R-All-Pass (AP) erster Ordnung, und [00144] · eine einfache Verzögerungsline mit einer Kl4 Verzögerung bei 0k.We propose to use the following filter types for a 90 ° phase shift at a frequency 0k: a simple first-order FIR portion, a simple first-order 11R all-pass (AP), and [00144] · a simple delay line with a Kl4 delay at 0k.

[00145] Jede der obgenannten Methoden erbringt 90° Phasenverschiebung bei einer virtuellen beliebigen Frequenz 0k und ist deshalb geeignet.Each of the above methods provides 90 ° phase shift at a virtual arbitrary frequency 0k and is therefore suitable.

[00146] Man kann zwischen den folgenden Eigenschaften wählen: [00147] · FIR: numerisch nicht stabil bei 0k = [0,π/2, π], bietet das breiteste Band mit 90° Pha senverschiebung.One can choose between the following properties: FIR: numerically unstable at 0k = [0, π / 2, π], offers the widest band with 90 ° phase shift.

[00148] · AP: numerisch nicht stabil bei 0k = [0,π/2,π], das 90° Phasen-Frequenzband ist schmäler und der Rechenaufwand ist größer. 18/31 österreichisches Patentamt AT 509 570 B1 2011-12-15 [00149] · λ/4-delay: numerisch stabil, das schmälste Frequenzband mit 90° Phaseverschie bung, Rechenaufwand gering, viel Speicher notwendig.AP: numerically not stable at 0k = [0, π / 2, π], the 90 ° phase frequency band is narrower and the computational effort is greater. 18/31 Austrian Patent Office AT 509 570 B1 2011-12-15 · λ / 4-delay: numerically stable, the narrowest frequency band with 90 ° phase shift, low computation, much memory required.

[00150] Fig. 12 zeigt ein Beispiel für die FIR-Pegelerkennungsmethode. Ein geeigneter Parameter kann über die Phasengleichung für das entsprechende System gefunden werden, z.B. A. V Oppenheim, R. W. Schäfer, J. R. Buck, "Discrete-Time Signal Processing", Prentice Hall, 1999.[00150] Fig. 12 shows an example of the FIR level detection method. A suitable parameter can be found via the phase equation for the corresponding system, e.g. A. V Oppenheim, R.W. Schäfer, J.R. Buck, Discrete-Time Signal Processing, Prentice Hall, 1999.

XII. AUDITORISCHES POST-MASKINGXII. AUDITOR POST MASKING

[00151] Die Verwendung der nichtlinearen Post-Masking-Filter (z.B. rekursive Mittelwertbildung reagiert auf fallende Flanken) birgt einige Vorteile: [00152] · Die Impulsive Störgeräuschvarianz ist wegen dem Nachausblenden leicht über schätzt (Übersubtraktion).The use of the non-linear post-masking filters (e.g., recursive averaging responsive to falling edges) has several advantages: The Impulsive Noise Variance is slightly overestimated (over-subtraction) because of the fade-out.

[00153] · Störgeräuschunterdrückungsalgorithmen können keine Signale abschwächen bis dieNoise suppression algorithms can not attenuate signals until the

Nach-Hörausblendzeit verstrichen ist.After-eclipse time has passed.

[00154] · Aliasing-Effekte nach dem Downsampling oder die Welligkeit im Amplitudensignal sind aufgrund der glättenden Wirkung des Nachausblendens reduziert.· Aliasing effects after the downsampling or the ripple in the amplitude signal are reduced due to the smoothing effect of the fade-out.

[00155] · Dabei wird geglättet und die Amplituden der wichtigen transient Signale erfahren keine zusätzlichen Grupppenverzögerungszeiten.It is smoothed and the amplitudes of the important transient signals experience no additional group delay times.

[00156] Wir schlagen eine Struktur vor, die an der Signalleistung in jeden Kanal arbeitet (vgl. Fig. 13, L. Lin, E. Ambikairajah, W. H. Holmes, "Perceptual Domain Based Speech and Audio Coder", Proc. of the third International Symposion DSPCS 2002, Sydney, Jan. 28-31, 2002).We propose a structure that works on the signal power in each channel (see Fig. 13, L. Lin, E. Ambikairajah, WH Holmes, " Perceptual Domain Based Speech and Audio Coder ", Proc. Of the Third International Symposium DSPCS 2002, Sydney, Jan. 28-31, 2002).

[00157] Der Mittelwertparameter ak im Kanal k hat mit dem menschlichen Nach-Hörausblendzeitkonstanten für die ensprechenden Frequenzen fk zu korrespondieren. Deshalb verwenden wir folgende Gleichung um den Mittelwertparameter α herzuleiten: et* = e G rk·^. (26) [00158] Ein Parameter G kann zum Skalieren der Nachausblendzeitkonstanten verwendet werden.The averaging parameter ak in channel k has to correspond to the human post-audible time constant for the corresponding frequencies fk. Therefore we use the following equation to derive the mean parameter α: et * = e G rk · ^. (26) A parameter G can be used to scale the fade out time constant.

[00159] Die Zeitkonstante für 1[Bark] ist näherungsweise τ ~ 40[ms], und für 20[Bark] näherungsweise τ ~ 4[ms] (G. Stoll, J. G. Beerends, R. Bitto, K. Brandenburg, C. Colomes, B.The time constant for 1 [Bark] is approximately τ ~ 40 [ms], and for 20 [Bark] approximately τ ~ 4 [ms] (G. Stoll, JG Beerends, R. Bitto, K. Brandenburg, C. Colomes, B.

Feiten, M. Keyhl, C. Schmidmer, T. Sporer, T. Thiede, W. C. Treurniet, "PEAQ - der neue ITU-Standard zur objektiven Messung der wahrgenommenen Audioqualität", RTM - Rundfunktechnische Mitteilungen, die Fachzeitschrift für Hörfunk und Fernsehtechnik, 43. Jahrgang, ISSN 0035-9890 (81-120), Firma Mensing GmbH + Co. KG, Abteilung Verlag, Sept 1999). Folgende Gleichung kann zur Herleitung von τκ verwendet werden:Feiten, M. Keyhl, C. Schmidmer, T. Sporer, T. Thiede, WC Treurniet, " PEAQ - the new ITU standard for the objective measurement of perceived audio quality ", RTM - Rundfunktechnische Mitteilungen, the trade journal for radio and television technology, Volume 43, ISSN 0035-9890 (81-120), Mensing GmbH + Co. KG, Publishing Department, Sept 1999). The following equation can be used to derive τκ:

T*/[ms] =-_j___i_ 1--WT * / [ms] = -_ j___i_ 1 - W

[00160] Alternativ kann die Gleichung in der zitierten Referenzen verwendet werden, aber unsere Formel bietet eine geeignete Interpolation mit längeren Zeitkonstanten.Alternatively, the equation may be used in the cited references, but our formula provides suitable interpolation with longer time constants.

XIII. REKURSIVE MINIMUM-STATISTIKXIII. REKURSIVE MINIMUM STATISTICS

[00161] Wir können die Struktur in Fig. 14 verwenden, um den Störgeräuschpegel in jedem Frequenzband abzuschätzen. Ähnliche Ansätze können in R. Martin, "Noise Power Spectral Estimation Based on Optimal Smoothing and Minimum Statistics", IEEE Transactions on Speech and Audio Processing, nr. 5, vol. 9, pp. 504-512, Jul. 2001 oder WO 00/30264 (International application No. PCT/SG99/00119) gefunden werden. 19/31We can use the structure in Fig. 14 to estimate the noise level in each frequency band. Similar approaches can be found in R. Martin, "Noise Power Spectral Estimation Based on Optimal Smoothing and Minimum Statistics", IEEE Transactions on Speech and Audio Processing, no. 5, vol. 9, pp. 504-512, Jul. 2001 or WO 00/30264 (International application No. PCT / SG99 / 00119). 19/31

österreichisches Patentamt AT 509 570 B1 2011-12-15 [00162] Diese Methode verwendet hauptsächtlich drei Zeitkonstanten zum Mitteln der Signalpegel. Fallende Flanken werden leicht gemittelt, wobei während steigender Eingangsflanken der Ausgang während der Periode von Nw Abtastintervallen konstant gehalten wird (unendlich große Zeitkonstante). Wenn Nw Abtastintervalle verstrichen sind, wird die steigende Flanke durch eine dritte Zeitkonstante gemittelt. Die Zeitkonstanten können, ähnlich wie in (25) und (26), zu einem rekursiven Mittelwertparameter konvertiert werden.Austrian Patent Office AT 509 570 B1 2011-12-15 This method mainly uses three time constants for averaging the signal levels. Falling edges are easily averaged, while during increasing input edges the output is kept constant during the period of Nw sampling intervals (infinite time constant). If Nw sampling intervals have elapsed, the rising edge is averaged by a third time constant. The time constants can be converted to a recursive mean parameter, similar to (25) and (26).

[00163] Eine geeignete Zählergrenze Nw kann mittels einem kontinuierlichen Zeitintervall Tw berechnet werden (28)A suitable counter limit Nw can be calculated by means of a continuous time interval Tw (28)

Nw = round(Tw · fs).Nw = round (Tw · fs).

[00164] Für Äußerungen oder Wörter der menschlichen Sprache kann dieses Zeitintervall angemessen gewählt werden, z.B. Tw ® 1.5s. Die Zeitkonstante für die fallende Flanke kann eine skalierte Version der Nachausblendzeitkonstante oder z.B. konstant 200[ms] sein.For utterances or words of human speech, this time interval may be appropriately selected, e.g. Tw ® 1.5s. The falling edge time constant may be a scaled version of the fade out time constant, or e.g. be constant 200 [ms].

[00165] Die steigendene Flanke definierede Zeitkonstante ß kann näherungsweise 700[ms] sein, das einer Geschwindigkeit von circa 6[dB]/[s] entspricht. Im Gegensatz zu allen anderen Zeitkonstanten, wird diese als für alle Kanäle k gleich vorgeschlagen.The rising edge defining time constant β may be approximately 700 [ms], which corresponds to a speed of approximately 6 [dB] / [s]. In contrast to all other time constants, this is suggested as equal to all channels k.

[00166] Die Sättigungswirkung in Fig. 14 kann wie folgt angegeben werden: f(x) = < 1 falls x > 0, (29) v 0 sonst. XIV. EPHRAIM-MALAH STÖRGERÄUSCHUNTERDRÜCKUNGSREGEL (EMSR) [00167] Mit der EMSR (Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, nr. 6, vol. ASSP-32, pp. 1109-1121, Dec. 1984; Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, nr.2, vol. ASSP-33, pp. 443-445, Apr. 1985) können wir die klare Sprachamplitude aus der gegebenen verrauschten Sprachamplitude und der Störgeräuschvarianz abschätzen. Wir können z.B. die Definition von Wolfe und Godsill für die spektralen Gewichte (P. J. Wolfe and S. J. Godsill, "Simple Alternatives to the Ephraim and Malah Suppression Rule for Speech Enhancement", Proc. 11 th IEEE Signal Processing Workshop, pp. 496-499, 6-8. Aug 2001) und einen modifizierten entscheidungsgesteuerten Ansatz (F. Zotter, M. Noisternig, R. Höldrich, "Speech Enhancement Using the Ephraim and Malah Suppression Rule and Decision Directed Approach: A Hysteretic Process“, to appear in IEEE Signal Processing Letters, 2005. First manuscript sub-mitted Jan 24, 2005) verwendenThe saturation effect in Fig. 14 can be given as follows: f (x) = < 1 if x > 0, (29) v 0 otherwise. XIV. EPHRAIM-MALAH NOISE REDUCTION RULES (EMSR) [00167] With the EMSR (Y. Ephraim and D. Malah, " Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator " IEEE Transactions on Acoustics, Speech and Signal Processing, No. 6, vol., ASSP-32, pp. 1109-1121, Dec. 1984, Y. Ephraim and D. Malah, " Speech Enhancement Using a Minimum Mean- Square Error Log Spectral Amplitude Estimator ", IEEE Transactions on Acoustics, Speech, and Signal Processing, no.2, vol. ASSP-33, pp. 443-445, Apr. 1985), we can obtain the clear speech amplitude from the given noisy speech amplitude and estimate the noise variance. We can e.g. the definition of Wolfe and Godsill for the spectral weights (PJ Wolfe and SJ Godsill, " Simple Alternatives to the Ephraim and Malah Suppression Rule for Speech Enhancement ", Proc. 11th IEEE Signal Processing Workshop, pp. 496-499, 6- 8 Aug 2001) and a Modified Decision-driven Approach (F. Zotter, M. Noisternig, R. Höldrich, "Speech Enhancement Using the Ephraim and Malah Suppression Rule and Decision Directed Approach: A Hysteretic Process", to appear in IEEE Signal Processing Letters, 2005. First manuscript sub-mitted Jan 24, 2005)

(30) [00168] Die folgenden Beziehungen sind in der obigen Gleichung involviert:(30) The following relationships are involved in the above equation:

£fc[m] = α · min(7fc[ra] -1,0) + (31) 20/31 (32) österreichisches Patentamt AT 509 570 B1 2011-12-15 P (1 - a) · 7k[m - 1] · g2k[m - 1] + C (33> m = L n (35) [00169] Die Störgeräuschvarianz a2dk[m] ist durch den Störgeräuschschätzalgorithmus gegeben; m und n sind Zeitindices, fs ist die System abtastrate und L ist ein Downsampling-Faktor.£ fc [m] = α · min (7fc [ra] -1,0) + (31) 20/31 (32) Austrian Patent Office AT 509 570 B1 2011-12-15 P (1 - a) · 7k [m The noise variance a2dk [m] is given by the noise estimation algorithm, m and n are time indices, fs is the system sampling rate, and L is .sigma..times..times..times..times..times..times..times..times..times..times..times..times..times..times..times..times..times..times..times..times..times..times..times..times..times..times..times..times..times..times is a downsampling factor.

[00170] Gemäß Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, nr. 6, vol. ASSP-32, pp. 1109-1121, Dec. 1984, istyk[m] das a posteriori SNR und ξκ[ηι] das a priori SNR. Gw,k[m] ist das spektrale Gewicht des Wiener-Filters, α der Mittelwertparameter, definiert durch eine mittelwertbildene Zeitkonstante Tsnr,k, die entweder näherungsweise 2[ms] (F. Zotter, M. Noisternig, R. Höldrich, "Speech Enhancement Using the Ephraim and Malah Suppression Rule and Decision Directed Approach: A Hysteretic Process", to appear in IEEE Signal Processing Leiters, 2005. First manuscript submitted Jan 24, 2005) oder von den Hörausblendzeitkonstanten ableitet ist.According to Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, no. 6, vol. ASSP-32, pp. 1109-1121, Dec. 1984, istyk [m] is the a posteriori SNR and ξκ [ηι] is the a priori SNR. Gw, k [m] is the spectral weight of the Wiener filter, α the mean parameter defined by a mean time constant Tsnr, k, which is either approximately 2 [ms] (F. Zotter, M. Noisternig, R. Hoeldrich, " Speech Enhancement Using the Ephraim and Malah Suppression Rule and Decision Directed Approach: A Hysteretic Process, to appear in the IEEE Signal Processing Ladder, 2005. First manuscript submitted Jan 24, 2005) or derived from the hearing-time constants.

[00171] Der "Übersubtraktionsfaktor" p (vgl. Zotter et at) kann als p = 10'15/1° gewählt werden und der untere Störgeräuschparameter ζ als ζ = 10'40/10.The " over subtraction factor " p (see Zotter et at) can be chosen as p = 10'15 / 1 ° and the lower noise parameter ζ as ζ = 10'40 / 10.

XV. LAUFZEITREDUZIERTES UPSAMPLINGXV. RUNTIME-REDUCED UPSAMPLING

[00172] Normales Upsampling benötigt entweder eine Verarbeitungsverzögerung oder eine Gruppenlaufzeit wegen der involvierten Interpolationsoperation. Bei der Verwendung des Up-sampling-Faktors L sind solche Verzögerungszeiten näherungsweise L Abtastschritte lang.Normal upsampling requires either a processing delay or a group delay because of the interpolation operation involved. When using the upsampling factor L, such delay times are approximately L sampling steps long.

[00173] Wir schlagen vor, eine spezielle Methode für das Upsampling zu verwenden, das keine zusätzlichen Verzögerungszeiten bringt. Das kann dadurch bewerkstelligt werden, dass das Signal in Puffer aufgeteilt wird (vorzugsweise mit einer Puffergröße des ADCs und DACs).We propose to use a special method for upsampling that does not add extra delay times. This can be accomplished by splitting the signal into buffers (preferably with a buffer size of the ADC and DACs).

[00174] Wenn in jedem Signalblock der letzte Abtastwert des vorangegangenen Blocks vorhanden ist, ist es möglich die folgenden Abtastwerte linear zu interpolieren. Deshalb hat der letzte Abtastwert in jedem Block mit dem Abtastzeitpunkt der niedrigeren Abtastrate übereinzustimmen.If the last sample of the previous block is present in each signal block, it is possible to linearly interpolate the following samples. Therefore, the last sample in each block must match the sample time of the lower sample rate.

XVI. SCHLUSSFOLGERUNGENXVI. CONCLUSIONS

[00175] Frequenzbereichslösungen, die äquivalente Gehörmodelle verwenden, benötigen Verzögerungszeiten im Bereich von 10 Milisekunden. Die Implementierung unseres Systems mit 20 Frequenzbändern und einem TZGF der dritten Ordnung hat eine mittlere Latenzzeit von 3.5 bis 4 Milisekunden. Der erforderliche rechnerische Aufwand ist etwa 8.9 MIPs bei fs = 16[kHz], das ist ein wenig mehr, als für DFT-Lösungen benötigt wird (7 MIPs). Wir haben ebenfalls eine leicht modifizierte Ephraim-Malah-Unterdrückungsregel (EMSR) mit der vereinfachten Wolfe-Godsill-Formel und dem modifizierten entscheidungsgesteuerten Ansatz angewendet.Frequency domain solutions using equivalent hearing models require delay times in the order of 10 milliseconds. The implementation of our system with 20 frequency bands and a third-order TZGF has a mean latency of 3.5 to 4 milliseconds. The required computational effort is about 8.9 MIPs at fs = 16 [kHz], which is a little more than what is needed for DFT solutions (7 MIPs). We also used a slightly modified Ephraim Malah suppression rule (EMSR) with the simplified Wolfe-Godsill formula and the modified decision-driven approach.

[00176] Die Offenbarung aller zitierten Publikationen ist zur Gänze in dieser Beschreibung eingeschlossen. 21 /31The disclosure of all cited publications is incorporated in its entirety in this specification. 21/31

Claims

Austrian Patent Office AT 509 570 B1 2011-12-15 Claims 1. An interference suppression method for an input audio signal (y [n]) having a desired signal (x [n]) and a noise signal component, the method comprising the steps of: - Splitting the input audio signal (y [n]) into a plurality of subbands (yk [n]) by a band splitting analysis based on a gammaton filter bank (GFB), preferably a nonuniform gammaton filter bank, noise suppression in each subband (yk [nj ) by a plurality of noise reduction processors, - composition of the plurality of subbands (yk [n]) into an output signal (x [nj) through a synthesis filter, wherein all steps are performed in the time domain, characterized in that the gammaton filter bank (GFB) performs a phase shift on the subbands.

2. The method according to claim 1, characterized in that a preprocessor (Home) and a post-processor (Hiome) perform a non-linear filtering of the input audio signal (y [nj], thereby: a. a preprocessing filter which emulates the transfer behavior of the human outer and middle ears and is applied to the discrete-time noisy input audio signal (y [nj]), and b. a post-processing filter applied to the canceled / improved full-band signal to compensate for the effect of the pre-processed filter.

3. The method according to any one of claims 1 or 2, characterized in that the noise reduction processors each comprise a signal level detection (LD), a noise estimator (NE), an auditory blanking filter (PM) and a subtraction processor.

Method according to claim 3, characterized in that the signal level detection (LD) uses that phase of the low-order subband to generate a quadrature signal and to evaluate an in-phase signal from the subband (yk [nj) and the squared amplitudes of these signals summed to the squared amplitude envelope.

A method according to claim 3, characterized in that the noise estimator (NE) generates a sub-band noise sound value by smoothing based on the minimum statistics, in particular using a weighted averaging of the previous noise value and the current input value with three different time constants.

Method according to claim 3, characterized in that the auditory blanking filter (PM) uses the detected signal power in each subband to generate a temporary blanking behavior based on human auditory sense, in particular a non-linear, weighted average of the previous subband input value and the current input value only is applied to falling edges as a function of the detected level in each subband.

A method according to any one of claims 1 to 6, characterized in that the noise estimator (NE) depends on the current input value compared to time-dependent, level-dependent thresholds. 22/31 Austrian Patent Office AT 509 570 B1 2011-12-15

A method according to any one of claims 1 to 7, characterized in that the noise cancellation in each subband is performed by the Ephraim Malah Noise Suppression Rule (EMSR).

9. Method according to one of claims 1 to 7, characterized in that the noise suppression in each subband is performed by a decision-driven approach (DDA).

An apparatus for noise suppression for an input audio signal (y [nj) comprising a desired signal (x [nj] and a noise signal component, the apparatus comprising: a band split analyzer for splitting the input audio signal (y [nj] into a plurality subbands (yk [nj], based on a gammaton filter bank (GFB), preferably a nonuniform gammaton filter bank, - a plurality of noise canceling processors for noise canceling in each subband (yk [nj), - a synthesis filter for composing the plurality of subbands (yk [nj] to an output signal (x [nj), wherein all components work in the time domain, characterized in that a preprocessor (Home) and a post-processor (Η, ομε) perform a non-linear filtering of the input audio signal, thereby: a. a preprocessing filter which emulates the transfer behavior of the human outer and middle ear and is applied to the discrete-time noisy input audio signal; and b. a post-processing filter applied to the improved full-band signal to compensate for the effect of the preprocessed filter.

An apparatus according to claim 10, characterized in that the noise reduction processors each comprise a signal level detector (LD), a noise estimator (NE), an auditory blanking filter (PM) and a subtraction processor.

12. Apparatus according to claim 11, characterized in that the signal level detector (LD) uses the phase of the low-order filter section to generate a quadrature signal, evaluates an in-phase signal from the subband (yk [nj) and the sums up the quadratic amplitudes of these signals.

13. Apparatus according to claim 12, characterized in that the quadrature signal is generated by a in the signal level detector (LD) provided FIR portion of the first order.

14. Apparatus according to claim 12, characterized in that the quadrature signal is generated by a in the signal level detector (LD) provided FIR-AII pass (AP) first order.

15. Apparatus according to claim 12, characterized in that the quadrature signal from a delay line to create a λ / 4 delay at a digital frequency (0k) is generated. For this 8 sheets drawings 23/31