DE19907900A1

DE19907900A1 - Determining signal-to-noise ratios of distorted speech signals involves determining probability of characteristic speech signal component with characteristic speech signal parameter(s)

Info

Publication number: DE19907900A1
Application number: DE1999107900
Authority: DE
Inventors: Luis Arevalo; Andreas Korthauer
Original assignee: Robert Bosch GmbH
Current assignee: Robert Bosch GmbH
Priority date: 1999-02-24
Filing date: 1999-02-24
Publication date: 2000-12-28
Anticipated expiration: 2019-02-25
Also published as: DE19907900B4

Abstract

The method involves determining (7) the probability of occurrence of a characteristic speech signal component using at least one characteristic speech signal parameter, whereby the distorted speech signal is filtered (3) to reduce the noise component and a frequency distribution is produced depending on the filtered values. When performing averaging (9,10) to derive the signal-to-noise ratio the parameter(s) to be averaged is(are) assessed using the probability of occurrence for the speech components. Independent claims are also included for an arrangement for implementing the method and for a use of the method for assessing and validating speech databases, esp. for automatic speech recognition systems.

Description

State of the art

Die Erfindung geht aus von einem Verfahren zur Ermittlung des Signal-zu-Rauschverhältnisses bei gestörten Sprachsignalen. Das Signal-zu-Rauschverhältnis (SNR) ist z. B. eine wichtige Größe zur Bewertung von Datenbasen bei der Entwicklung von Anwendungen für die automatische Spracherkennung. In der Literatur wird häufig das sogenannte Segment-SNR [1] verwendet, um das SNR für Sprachsignale anzugeben. Dieses Verfahren benötigt ein ungestörtes Referenzsignal, welches ein genaues Abbild der Sprachanteile des gestörten Signales enthalten muß. Aus der Differenz von Referenzsignal und gestörtem Signal wird die Störung berechnet. Die Kurzzeitleistungen des Referenzsignals und der Störung werden in Signalsegmenten von etwa 10 ms Dauer ermittelt und zur Berechnung eines "Kurzzeit-SNR" verwendet. Bei endlich langen Signalen können diese SNR-Werte einer Mittelwertbildung zugeführt werden. Daraus ergibt sich das mittlere Segment-SNR für die Sprachprobe. Bei der Mittelwertbildung müssen jedoch Sprachpausen, in denen nur das Störgeräusch vorhanden ist, ausgeschlossen werden. Für alle Signalsegmente muß deshalb eine Sprachpausendetektion durchgeführt werden. Die Sprachpausendetektion kann z. B. über den Vergleich der Kurzzeitleistung des Referenzsignals mit einer konstanten Leistungsschwelle erfolgen. Bei gemessenen Signalen, z. B. im Kraftfahrzeug, ist die Forderung nach einem Referenzsignal jedoch meist nicht einzuhalten, so daß das Segment SNR nur sinnvoll eingesetzt werden kann, wenn Sprachsignal und Störung getrennt vorliegen, d. h. im Falle simulierter Störungen.The invention is based on a method for determination the signal-to-noise ratio for disturbed Voice signals. The signal-to-noise ratio (SNR) is e.g. B. an important parameter for evaluating databases at the Development of applications for automatic Voice recognition. The so-called Segment SNR [1] used the SNR for speech signals specify. This procedure requires an undisturbed Reference signal, which is an exact representation of the speech components of the disturbed signal must contain. From the difference of The reference signal and disturbed signal becomes the disturbance calculated. The short-term powers of the reference signal and the disturbance are in signal segments of approximately 10 ms duration determined and used to calculate a "short-term SNR". With finally long signals, these SNR values can be one Averaging can be supplied. It follows from this middle segment SNR for the speech sample. In the However, averaging must take language breaks in which only the noise is present, can be excluded. For All signal segments must therefore have a speech pause detection be performed. The speech pause detection can e.g. B. by comparing the short-term power of the reference signal with a constant power threshold. At measured signals, e.g. B. in the motor vehicle, is the However, the call for a reference signal is usually not to be observed so that the SNR segment is only used sensibly can be separated when speech signal and interference are present, d. H. in the case of simulated faults.

In [2] wird ein Verfahren zur SNR-Messung vorgestellt, das auf ein ungestörtes Referenzsignal verzichtet. Dort wird das "Mean-SNR" eingeführt, das sich auf ein Verfahren zur Sprachpausendetektion am gestörten Signal stützt. Zur Erkennung von Sprachpausen wird aus den logarithmierten Werten der Kurzzeitleistung des gestörten Signals ein Histogramm erstellt. Das Histogramm wird durch die Überlagerung zweier Gauß-Funktionen approximiert, und aus dem Schnittpunkt der Gauß-Funktionen wird die Leistungsschwelle für die Sprachpausendetektion bestimmt. Unterschreitet die Kurzzeitleistung des Signals diese Leistungsschwelle, so wird eine Sprachpause erkannt. Anhand der Sprachpausendetektion können für die Signalabschnitte mit Sprachaktivität und mit Sprachpause jeweils mittlere Leistungen des gestörten Signals berechnet werden. Das Mean- SNR wird aus der Differenz der logarithmierten Werte dieser Leistungen bestimmt.In [2] a procedure for SNR measurement is presented, the dispenses with an undisturbed reference signal. There it will "Mean-SNR" introduced, which is based on a process for Speech pause detection based on the disturbed signal. To Recognition of speech pauses is made from the logarithmic Values of the short-term power of the disturbed signal Histogram created. The histogram is shown by the Superposition of two Gaussian functions approximated, and off the intersection of the Gaussian functions is the Power threshold for speech pause detection determined. If the short-term power of the signal falls below this Power threshold, a speech pause is recognized. Based the speech pause detection can be used for the signal sections with language activity and with language pause each medium Power of the disturbed signal can be calculated. The mean SNR is the difference between the logarithmic values of these Services determined.

Das Mean-SNR weicht von der bekannten Definition des Signal- zu-Rauschverhältnisses - wie sie auch dem Segment - SNR zugrunde liegt - ab, weil es die Leistung des gestörten Signals, also Signal und Rauschen, zur Leistung des Rauschens ins Verhältnis setzt. Im logarithmischen Maßstab nähert sich das Mean-SNR daher für stark gestörte Signale dem Wert 0 dB an und kann keine negativen Werte annehmen. Diese Sättigung führt dazu, daß das Mean-SNR bei stark gestörten Signalen keine quantitativen Aussagen über unterschiedliche Störungen zuläßt. Auch die zuvor beschriebene Sprachpausendetektion kann nur bei relativ gering gestörten Signalen verwendet werden, da sonst das Histogramm keine deutliche Trennung in hohe und niedrige Leistungswerte mehr erlaubt.The mean SNR deviates from the known definition of the signal to noise ratio - just like the segment - SNR is based - because it affects the performance of the disrupted Signals, i.e. signal and noise, for the performance of the Noise in relation. On a logarithmic scale the mean SNR therefore approaches for strongly disturbed signals the value 0 dB and cannot assume negative values. This saturation leads to the mean SNR being strong disturbed signals no quantitative statements about allows different disturbances. Even the one before described speech pause detection can only with relative slightly disturbed signals are used, otherwise that Histogram no clear distinction between high and low Performance values allowed more.

Advantages of the invention

Mit den Maßnahmen gemäß den Merkmalen des Anspruchs 1 ist es möglich, das Signal-zu-Rauschverhältnis von gestörten Signalen zu ermitteln, ohne daß ein ungestörtes Referenzsignal notwendig ist. Durch die Filterung des gestörten Sprachsignals für die Ermittlung der Auftrittswahrscheinlichkeit von Sprachanteilen gegenüber Sprachpausen ist eine Sprachpausendetektion bzw. eine Detektion von Sprachanteilen auch für stark gestörte Sprachsignale zuverlässig möglich. Im Gegensatz zum vorgenannten Stand der Technik wird bei der Mittelwertbildung zur Gewinnung des Signal-zu- Rauschverhältnisses keine zweiwertige Entscheidung zwischen Sprachanteilen und Sprachpausen anhand einer konstanten Leistungsschwelle verwendet, sondern es wird eine kontinuierliche Größe, die sogenannte Sprachwahrscheinlichkeit, d. h. die Auftrittswahrscheinlichkeit von Sprachanteilen im gestörten Sprachsignal, herangezogen. Mit dieser Sprachwahrscheinlichkeit wird eine charakteristische Sprachsignalgröße, z. B. die Kurzzeitleistung, des gestörten Sprachsignals, bei der Mittelwertbildung zur Gewinnung des Signal-zu-Rauschverhältnisses bewertet. Dadurch werden fehlerhafte Entscheidungen, welche den Wert des Mean-SNR (Mittelwert des Signal-zu-Rauschverhältnisses) verfälschen könnten, vermieden.It is with the measures according to the features of claim 1 possible, the signal-to-noise ratio of disturbed Detect signals without being disturbed Reference signal is necessary. By filtering the disturbed speech signal for the determination of the Probability of occurrence of language components Speech breaks is a speech break detection or one Detection of speech components even for severely disturbed people Voice signals reliably possible. In contrast to The aforementioned prior art is used in the Averaging to obtain the signal to Noise ratio no two-valued decision between Language components and language breaks based on a constant Power threshold used, but it becomes a continuous size, the so-called Linguistic probability, d. H. the Probability of occurrence of language components in the disturbed Speech signal, used. With this Speech probability becomes a characteristic Speech signal size, e.g. B. the short-term power, the disturbed Speech signal, when averaging to obtain the Signal-to-noise ratio assessed. This will erroneous decisions affecting the value of the mean SNR Falsify (mean value of the signal-to-noise ratio) could be avoided.

In einer Weiterbildung gemäß Anspruch 2 wird durch eine nichtlineare Transformation ein modifiziertes Signal-zu- Rauschverhältnis (Mean-SNR) gebildet. Damit wird der mögliche Wertebereich des Mean-SNR im logarithmischen Maßstab auf negative Zahlenwerte erweitert und die Sättigung des Mean-SNR wird vermieden.In a further development according to claim 2, a nonlinear transformation a modified signal-to- Noise ratio (mean SNR) formed. With that the possible value range of the mean SNR in logarithmic Scale extended to negative numerical values and saturation the mean SNR is avoided.

Die weiteren Ansprüche zeigen vorteilhafte Weiterbildungen des erfindungsgemäßen Verfahrens insbesondere der Ermittlung der Sprachwahrscheinlichkeit auf, bzw. eine Anordnung zum Durchführen des Verfahrens, sowie seine Verwendung.The other claims show advantageous developments of the method according to the invention, in particular the determination the language probability, or an arrangement for Carrying out the procedure, as well as its use.

drawings

Anhand der Zeichnungen werden Ausführungsbeispiele der Erfindungen näher erläutert. Es zeigenBased on the drawings, embodiments of the Inventions explained in more detail. Show it

Fig. 1 ein Blockschaltbild zur Ermittlung des Signal-zu- Rauschverhältnisses gestörter Sprachsignale, Fig. 1 is a block diagram of the signal-to-noise ratio of disturbed speech signals determination,

Fig. 2 die Approximation eines Histogramms der Kurzzeitleistung durch die Überlagerung von zwei Normalverteilungen, Fig. 2 shows the approximation of a histogram of the short-time power by the superposition of two normal distributions,

Fig. 3 die Erweiterung der Sprachpausendetektion für mehrere Kriterien. Fig. 3, the expansion of the speech pause detection for multiple criteria.

Description of exemplary embodiments

Das Blockschaltbild gemäß Fig. 1 gliedert sich in die Einheiten Sprachpausendetektion 1 und SNR-Messung 2. The block diagram of FIG. 1 is divided into the units of speech pause detection 1 and SNR measurement 2.

Für die Sprachpausendetektion 2 wird das Eingangssignal y so gefiltert, daß der Einfluß der Störung im Signal reduziert wird. Hierbei kann z. B. das bekannte Verfahren der spektralen Subtraktion oder ein fest entworfenes Filter .3. eingesetzt werden. Durch die Filterung werden typische Störgeräusche reduziert. Im Kraftfahrzeug sind dies vor allem das tieffrequente Motorgeräusch und das hochfrequente Wind- und Fahrgeräusch. Die Filterung wird im Frequenzbereich mit einer spektralen Gewichtung W (Ω_ν) der geschätzten Leistungsdichte Φyy (Ω_ν _k)durchgeführt:
For speech pause detection 2 , the input signal y is filtered so that the influence of the interference in the signal is reduced. Here, for. B. the known method of spectral subtraction or a fixed filter .3. be used. The filtering reduces typical noise. In the motor vehicle, these are primarily the low-frequency engine noise and the high-frequency wind and driving noise. The filtering is carried out in the frequency domain with a spectral weighting W (Ω _ν ) of the estimated power density Φyy (Ω _ν _k ):

Die gewählte Gewichtungsfunktion hat die Form
The selected weighting function has the form

W(Ω_ν) = 1 - cos² Ω_ν.W (Ω _ν ) = 1 - cos ² Ω _ν .

Anhand des entstörten, d. h. in seinem Störanteil reduzierten Signals wird nun im Block 4 die Kurzzeitleistung bestimmt und aus den logarithmierten Werten dieser Kurzeitleistung ein Histogramm (Block 5) erstellt. Durch ein geeignetes iteratives Verfahren wird die Häufigkeitsverteilung des Histogramms durch die Überlagerung zweier Funktionen p(|Pause) und p(|Sprache) approximiert (Block 6). In Block 7 wird wie nachfolgend näher erläutert die Sprachwahrscheinlichkeit ermittelt.The short-term power is now determined in block 4 on the basis of the noise-suppressed signal, ie its signal content is reduced, and a histogram (block 5 ) is created from the logarithmic values of this short-term power. Using a suitable iterative method, the frequency distribution of the histogram is approximated by overlaying two functions p (| pause) and p (| language) (block 6 ). The probability of speech is determined in block 7 , as explained in more detail below.

Die Funktionen p(|Pause) und p(|Sprache) stellen Häufigkeitsverteilungen der Kurzzeitleistungen für die Ereignisse "Pause" und "Sprache" dar und können, wie z. B. in Fig. 2 dargestellt, Normalverteilungen sein. Es sind ab er auch beliebige andere Funktionen denkbar, solange sie die Normierungsbedingung für Häufigkeitsverteilungen bzw. Wahrscheinlichkeitsdichten erfüllen:
The functions p (| pause) and p (| language) represent frequency distributions of the short-term services for the events "pause" and "language". B. shown in Fig. 2, normal distributions. Any other functions are also conceivable as long as they meet the standardization requirements for frequency distributions or probability densities:

Für die Überlagerung von p(|Pause) und p(|Sprache) wird angenommen, daß die Ereignisse "Pause" und "Sprache" a priori gleich wahrscheinlich sind. Daher werden die Häufigkeitsverteilungen jeweils mit dem Faktor ½ gewichtet:
For the superposition of p (| pause) and p (| language) it is assumed that the events "pause" and "language" are a priori equally probable. The frequency distributions are therefore weighted with the factor ½:

Aus den Häufigkeitsverteilungen und dem logarithmierten Wert der Kurzzeitleistung _k kann für jedes Segment k des entstörten Signals nach dem bekannten Bayes'schen Theorem eine Wahrscheinlichkeit p_k(Sprache) = p(Sprache|_k) dafür angegeben werden, ob das Segment k Sprache enthält.From the frequency distributions and the logarithmic value of the short-term power _k , a probability p _k (language) = p (language | _k ) can be specified for each segment k of the suppressed signal according to the known Bayesian theorem for whether the segment contains k language.

Die SNR-Messung wird am ungefilterten, gestörten Signal y vorgenommen. Die Bestimmung der mittleren Signalleistungen für Sprache und Pause (Block 9) erfolgt anhand der Kurzzeitleistung _k (Block 8) des gestörten Signals und der Sprachwahrscheinlichkeit p_k(Sprache) nach folgender Vorschrift:
The SNR measurement is carried out on the unfiltered, disturbed signal y. The mean signal powers for speech and pause (block 9 ) are determined on the basis of the short-term power _k (block 8 ) of the disturbed signal and the speech probability p _k (speech) according to the following rule:

Aus der Differenz der Leistungswerte ergibt sich das Mean- SNR (Block 10):
The mean SNR (block 10 ) results from the difference between the power values:

SNR_Mean = E_Sprache - E_Pause SNR _Mean = E _language - E _pause

Die folgende nichtlineare Transformation (Block 11) bildet aus dem Mean-SNR ein modifiziertes Mean-SNR:
The following nonlinear transformation (block 11 ) forms a modified mean SNR from the mean SNR:

Das modifizierte Mean-SNR verbessert die Meßwerte im Vergleich zum Stand der Technik im Sinne der bekannten Definition des Signal-zu-Rausch-Verhältnisses insbesondere für stark gestörte Signale, da die oben erwähnte Sättigung mit einer Annäherung an den Wert 0 dB nicht eintritt.The modified mean SNR improves the measured values in the Comparison to the prior art in the sense of the known Definition of the signal-to-noise ratio in particular for strongly disturbed signals, since the saturation mentioned above with an approximation to the value 0 dB does not occur.

Alternativ zum dargestellten Aufbau in Fig. 1 kann die Sprachpausendetektion so erweitert werden, daß mehrere Kriterien berücksichtigt werden, z. B. neben der Kurzzeitleistung auch die in der Sprachsignalverarbeitung vielfach verwendete Korrelation. Dazu sind für jedes Kriterium separate Wahrscheinlichkeiten zu berechnen, die anschließend geeignet zu einer Sprachwahrscheinlichkeit P_k (Sprache) verknüpft werden. Fig. 3 zeigt den Aufbau der Sprachpausendetektion für zwei Kriterien K1 und K2. Eine Erweiterung auf mehr als zwei Kriterien wird angedeutet. Die Verknüpfungseinrichtung ist mit dem Bezugszeichen 12 versehen. Die übrigen Blöcke weisen die gleichen Bezugszeichen wie in Fig. 1 auf. Sie sind lediglich um eine zweite Bezugsziffer ergänzt und zwar eine 1 für das erste Kriterium und eine 2 für das zweite Kriterium. As an alternative to the structure shown in FIG. 1, the speech pause detection can be expanded so that several criteria are taken into account, e.g. B. in addition to the short-term power also the correlation often used in speech signal processing. For this purpose, separate probabilities have to be calculated for each criterion, which are then suitably linked to form a language probability P _k (language). Fig. 3 shows the structure of the speech pause detection for two criteria K1 and K2. An extension to more than two criteria is indicated. The linking device is provided with the reference number 12 . The remaining blocks have the same reference numerals as in FIG. 1. They are only supplemented by a second reference number, namely a 1 for the first criterion and a 2 for the second criterion.

Bei der Entwicklung von Anwendungen für die Sprachsignalverarbeitung im Kraftfahrzeug (z. B. eine automatische Spracherkennung zur Steuerung von Fahrerinformationssystemen) ist das erfindungsgemäße Verfahren als Meßverfahren zur Bewertung der Qualität von Sprachdaten einsetzbar. Es ist somit eine Qualitätskontrolle und schnelle Fehlerlokalisation für die sehr aufwendigen Sprachdatensammlungen möglich.When developing applications for the Voice signal processing in the motor vehicle (e.g. a automatic speech recognition to control Driver information systems) is the one according to the invention Methods as measuring methods for evaluating the quality of Voice data can be used. It is therefore a quality control and fast error localization for the very complex Voice data collections possible.

Zudem ist bekannt, daß die Erkennungsleistung eines automatischen Spracherkenners stark vom Grad der Störung im Sprachsignal abhängt. Es ist daher sinnvoll, die SNR-Messung in den Prozeß der Spracherkennung selbst zu integrieren. So können z. B. die verwendeten Modelle des Spracherkenners an verschiedene Störungen angepaßt werden, d. h. gering gestörte Signale werden nach anderen Modellen klassifiziert als stark gestörte Signale. Hierfür ist eine SNR-Messung am Eingangssignal des Spracherkenners notwendig, um die richtigen Modelle auswählen zu können. Da in diesem Fall ausschließlich das gestörte Sprachsignal zur Verfügung steht, ist das erfindungsgemäße Verfahren hierbei vorteilhaft einsetzbar.It is also known that the recognition performance of a automatic speech recognizer strongly depending on the degree of interference in the Voice signal depends. It is therefore useful to take the SNR measurement to integrate into the process of speech recognition itself. So can e.g. B. the models of the speech recognizer used various disturbances are adjusted, d. H. slightly disturbed Signals are classified as strong according to other models disturbed signals. For this, an SNR measurement is on Input signal of the speech recognizer necessary to the to be able to choose the right models. Because in this case only the disturbed speech signal is available stands, the method according to the invention is here can be used advantageously.

literature

1. [1] NOLL, P .: Adaptive Quantizing in Speech Coding Systems. In: Proceedings of the International Zürich Seminar on Digital Communications. 1974, pp. B3 ( 1 ) -B3 ( 6 ).
2. [2] SMOLDERS, J .; CLAES, T .; SABLON, G .; VAN COMPERNOLLE, D .: On the Importance of the Microphone Position for Speech Recognition in the car. In: Proceeding of the International Conference on Acoustics, Speech & Signal Processing (ICASSP) 1.1994, pp. 429-432.

Claims

1. Method for determining the signal-to-noise ratio in the case of a disturbed speech signal, with the following steps:

- The probability of occurrence for speech components is determined on the basis of at least one characteristic speech signal variable ( 7 ) by subjecting the disturbed speech signal to filtering ( 3 ), which reduces the interference component and creating a frequency distribution depending on the filtered values,
- When averaging ( 9 , 10 ) to obtain the signal-to-noise ratio, the speech signal quantity (s) to be averaged is / are evaluated with the probability of occurrence for speech components.

2. The method according to claim 1, characterized in that a modified signal-to-noise ratio is determined by a non-linear transformation ( 11 ), which does not lead to saturation on a logarithmic scale even for severely disturbed speech signals.

3. The method according to claim 1 or 2, characterized in that as a characteristic speech signal size a first Criterion in particular the short-term performance of the disrupted Speech signal is used.

4. The method according to any one of claims 1 to 3, characterized characterized that as a characteristic Speech signal size another criterion especially the Correlation of the disturbed speech signal is used.

5. The method according to any one of claims 3 or 4, characterized in that for determining the average signal powers for speech portions and pauses in speech, the short-time power of the disturbed speech signal with the occurrence probability of the speech components and the speech pauses is linked and the signal-to-noise ratio by difference (10 ) of these two linked values is determined.

6. The method according to any one of claims 1 to 5 characterized in that to determine the Probability of occurrence from the in its disturbance share reduced speech signal the short-term power and / or the correlation is formed and a histogram is created and that the frequency distribution of this Histogram by superimposing the frequency densities for short-term performance and / or correlation regarding the language breaks on the one hand and the Language parts on the other hand is approximated.

7. The method according to claim 6, characterized in that the histogram from the logarithmic values of the Short-term power and / or the values of the correlation is created.

8. The method according to claim 6 or 7, characterized in that from the frequency distributions for speech parts and Speech breaks and the logarithmic value in particular the short-term power and / or the value of the correlation for each segment of the reduced in the interference component Speech signal a probability is determined whether the segment in question contains or does not contain speech components.

9. The method according to claim 8, characterized in that averaging to obtain the signal Noise ratio occurs only over those segments in those with the determined probability of occurrence Voice activity, d. H. Speech components were detected.

10. The method according to any one of claims 1 to 9, characterized characterized that several characteristic Speech signal quantities simultaneously to determine the Probability of the language components to occur be taken into account.

11. Arrangement in particular for performing the method according to one of claims 1 to 10, with the following features:

- a first device ( 1 ) for forming the mean value of a characteristic speech signal size in the case of a disturbed speech signal,
- A second device ( 2 ) for determining the probability of occurrence for speech components, the second device ( 2 ) being connected to the first device ( 1 ) in such a way that an evaluation of the characteristic speech signal size with the occurrence probability for speech components during the averaging ( 9 , 10 ) to obtain the signal-to-noise ratio.

12. The arrangement according to claim 11, characterized in that the second device ( 2 ) on the input side has a filter device ( 3 ) for reducing the interference component of the feedable disturbed speech signal.

13. Arrangement according to claim 11 or 12, characterized in that the first device ( 1 ) on the output side has a non-linear transformation device ( 11 ) which is designed such that no saturation can occur in the determination of the signal-to-noise ratio.

14. Use of the method according to one of claims 1 to 10 or the arrangement according to one of claims 11 to 13 for the evaluation and validation of language databases especially for automatic speech recognition systems.