EP0751495B1

EP0751495B1 - Method and device for classifying speech

Info

Publication number: EP0751495B1
Application number: EP96104213A
Authority: EP
Inventors: Joachim Dipl.-Ing. Stegmann
Original assignee: Deutsche Telekom AG
Current assignee: Deutsche Telekom AG
Priority date: 1995-06-30
Filing date: 1996-03-16
Publication date: 2001-10-10
Anticipated expiration: 2016-03-16
Also published as: NO961636D0; NO309831B1; ES2165933T3; NO961636L; EP0751495A2; ATE206841T1; EP0751495A3

Abstract

The method classifies speech, in particular speech signals, for the adaptive control of a speech encoding process. This encoding reduces the bit rate while keeping the speech quality the same, or increases the quality while keeping the bit rate the same. After segmenting the speech signal for each frame, a wavelet transformation is calculated. Using adaptive thresholds, a set of parameters is derived which control a state model. The speech frames are divided into sub-frames. Each sub-frame is divided into one of several typical classes for the speech encoding. The speech signal may be divided into segments of constant length. To reduce the edge effects with the wavelet transformation, either the segment at the boundaries is reflected or the wavelet transformation is calculated at smaller intervals. The frames are preferably shifted such that the segments overlap, or at the edges the segments are filled with previous or predicted sample values.

Description

Die Erfindung betrifft ein Verfahren zur Klassifizierung von Sprachsignalen nach dem Oberbegriff des Patentanspruchs 1 sowie eine Schaltungsanordnung zur Durchführung des Verfahrens.The invention relates to a method for classification of speech signals according to the preamble of the claim 1 and a circuit arrangement for performing the Procedure.

Sprachcodierverfahren und zugehörige Schaltungsanordnungen zur Klassifizierung von Sprachsignalen für Bitraten unterhalb von 8 kbit pro Sekunde gewinnen zunehmend an Bedeutung.Speech coding method and associated circuitry for classifying voice signals for bit rates below 8 kbit per second are increasing Importance.

Die Hauptanwendungen hierfür sind unter anderem bei Multiplexübertragung für bestehende Festnetze und in Mobilfunksystemen der dritten Generation zu sehen. Auch für die Bereitstellung von Diensten wie zum Beispiel Videophonie werden Sprachcodierverfahren in diesem Datenratenbereich benötigt.The main applications for this include Multiplex transmission for existing fixed networks and in Third generation mobile radio systems can be seen. Also for the provision of services such as Videophony are speech coding methods in this Data rate range required.

Die meisten derzeit bekannten, hochqualitativen Sprachcodierverfahren für Datenraten zwischen 4 kbit/s und 8 kbit/s arbeiten nach dem Prinzip des Code Excited Linear Prediction (CELP)-Verfahrens wie es von Schroeder, M.R., Atal, B.S.: Code-Excited Linear Prediction (CELP): High Quality Speech at Very Low Bit Rates, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 1985, erstmals beschrieben worden ist. Dabei wird das Sprachsignal durch lineare Filterung von Anregungsvektoren aus einem oder mehreren Codebüchern synthetisiert. In einem ersten Schritt werden die Koeffizienten des Kurzzeit-Synthesefilters durch LPC-Analyse aus dem Eingangs-Sprachvektor ermittelt und dann quantisiert. Im Anschluß daran werden die Anregungscodebücher durchsucht, wobei als Optimierungskriterium der perzeptuell gewichtete Fehler zwischen Original- und synthetisiertem Sprachvektor verwendet wird (⇒ Analyse durch Synthese). Übertragen werden schließlich nur die Indizes der optimalen Vektoren, aus denen der Decoder den synthetisierten Sprachvektor wieder erzeugen kann.Most currently known, high quality Speech coding method for data rates between 4 kbit / s and 8 kbit / s work on the principle of code excited linear Prediction (CELP) method as described by Schroeder, M.R., Atal, B.S .: Code-Excited Linear Prediction (CELP): High Quality Speech at Very Low Bit Rates, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 1985. The speech signal is filtered by linear filtering Excitation vectors from one or more code books synthesized. In a first step, the Coefficients of the short-term synthesis filter LPC analysis determined from the input speech vector and then quantized. Following this, the Searched excitation code books, where as Optimization criterion of the perceptually weighted errors between original and synthesized speech vector is used (⇒ analysis by synthesis). Transfer ultimately only the indexes of the optimal Vectors from which the decoder synthesizes the Can generate speech vector again.

Viele dieser Codierverfahren, wie zum Beispiel der neue 8 kbit/s Sprachcoder von ITU-T, beschrieben in der Literaturstelle Study Group 15 Contribution - Q. 12/15: Draft Recommendation G.729 - Coding Of Speech at 8 kbit/s using Conjugate-Structure-Algebraic-Code-Excited-Linear-Predictive (CS-ACELP) Coding, 1995, arbeiten mit einer festen Kombination von Codebüchern. Diese starre Anordnung berücksichtigt nicht die starken zeitlichen Änderungen der Eigenschaften des Sprachsignals und benötigt zur Codierung im Durchschnitt mehr Bits als erforderlich. Zum Beispiel bleibt das nur zur Codierung von periodischen Sprachabschnitten erforderliche adaptive Codebuch auch während eindeutig nichtperiodischer Segmente eingeschaltet.Many of these coding methods, such as the new one 8 kbit / s voice encoder from ITU-T, described in the Literature Study Group 15 Contribution - Q. 12/15: Draft Recommendation G.729 - Coding Of Speech at 8 kbit / s using Conjugate-Structure-Algebraic-Code-Excited-Linear-Predictive (CS-ACELP) Coding, 1995, work with a fixed combination of code books. This rigid arrangement does not take into account the strong changes in time of the Properties of the speech signal and required for coding more bits than required on average. For example that remains only for coding periodic Language sections required adaptive code book too switched on during clearly non-periodic segments.

Um zu niedrigeren Datenraten im Bereich um 4 kbit/s bei möglichst wenig abfallender Qualität zu gelangen, wurde deshalb in anderen Veröffentlichungen, zum Beispiel in Wang, S., Gersho, A.: Phonetically-Based Vector Excitation Coding of Speech at 3.6 kbit/s, Proceedings of IEEE International Conference On Acoustics, Speech and Signal Processing, 1989, vorgeschlagen, das Sprachsignal vor der Codierung in verschiedene typische Klassen einzuordnen. Im Vorschlag für das GSM-Halbratensystem wird das Signal auf Basis des Langzeit-Prädiktionsgewinns rahmenweise (alle 20 ms) in stimmhafte und stimmlose Abschnitte mit jeweils angepaßten Codebüchern eingeteilt, wodurch die Datenrate für die Anregung gesenkt und die Qualität gegenüber dem Vollratensystem weitgehend gleich bleibt. Bei einer allgemeineren Untersuchung wurde das Signal in die Klassen stimmhaft, stimmlos und Onset eingeteilt. Dabei wurde die Entscheidung rahmenweise (hier 11,25 ms) auf Basis von Parametern - wie unter anderem Nulldurchgangsrate, Reflexionskoeffizienten, Energie - durch lineare Diskriminierung gewonnen, siehe zum Beispiel Campbell, J., Tremain, T.: Voiced/Unvoiced Classification of Speech with Application to the U.S. Gouvernment LPC-10e Algorithm, Proceedings of IEEE International Conference On Acoustics, Speech and Signal Processing, 1986. Jeder Klasse wird wiederum eine bestimmte Kombination von Codebüchern zugeordnet, so daß die Datenrate auf 3,6 kbit/s bei mittlerer Qualität gesenkt werden kann.To contribute to lower data rates in the 4 kbit / s range to get as little declining quality as possible therefore in other publications, for example in Wang, S., Gersho, A .: Phonetically-Based Vector Excitation Coding of Speech at 3.6 kbit / s, Proceedings of IEEE International Conference On Acoustics, Speech and Signal Processing, 1989, proposed the speech signal before the Classify coding into different typical classes. in the Proposal for the GSM half-rate system is based on the signal Basis of long-term prediction gain frame by frame (every 20 ms) into voiced and unvoiced sections with each matched codebooks, reducing the data rate for the suggestion lowered and the quality over that Full rate system remains largely the same. At a more general investigation was the signal in the classes voiced, unvoiced and divided onset. The Decision frame by frame (here 11.25 ms) based on Parameters - such as zero crossing rate, Reflection coefficient, energy - by linear discrimination won, see for example Campbell, J., Tremain, T .: Voiced / Unvoiced Classification of Speech with Application to the U.S. Governorate LPC-10e Algorithm, Proceedings of IEEE International Conference On Acoustics, Speech and Signal Processing, 1986. Any class in turn becomes a certain combination of code books assigned so that the data rate at 3.6 kbit / s medium quality can be reduced.

Ein weiteres Beispiel für ein solches Verfahren findet sich in Meyer et. al., "Variable rate speech coding using perceptive thresholds and adaptive VUS detection", EUROSPEECH 91, S. 809-812.Another example of such a method can be found in Meyer et. al., "Variable rate speech coding using perceptive thresholds and adaptive VUS detection ", EUROSPEECH 91, pp. 809-812.

All diese bekannten Verfahren ermitteln das Ergebnis ihrer Klassifizierung aus Parametern, die durch Berechnung von Zeitmittelwerten aus einem Fenster konstanter Länge gewonnen wurden. Die zeitliche Auflösung ist also durch die Wahl dieser Fensterlänge fest vorgegeben. Verringert man die Fensterlänge, so sinkt auch die Genauigkeit der Mittelwerte. Erhöht man dagegen die Fensterlänge, so kann der zeitliche Verlauf der Mittelwerte dem Verlauf des instationären Sprachsignals nicht mehr folgen. Dies gilt besonders für stark instationäre Übergänge (Onsets) von stimmlosen auf stimmhafte Sprachabschnitte. Gerade die zeitlich richtige Reproduktion der Lage der ersten signifikanten Pulse stimmhafter Abschnitte ist aber wichtig für die subjektive Beurteilung eines Codierverfahrens. Weitere Nachteile herkömmlicher Klassifizierungsverfahren sind oftmals eine hohe Komplexität oder starke Abhängigkeit von in der Praxis immer vorhandenen Hintergrundgeräuschen.All of these known methods determine the result of their Classification from parameters calculated by calculating Time averages from a window of constant length won. The temporal resolution is therefore through the Choice of this window length is fixed. You decrease the window length, the accuracy of the Averages. On the other hand, if you increase the window length, you can the course of the mean values over time no longer follow transient speech signal. this applies especially for strongly transient transitions (onsets) from voiceless on voiced sections of speech. Just that correct reproduction of the position of the first significant pulses of voiced sections is important for the subjective assessment of one Coding method. Other disadvantages of conventional Classification procedures are often a high one Complexity or strong dependence on in practice background noise always present.

Der Erfindung liegt die Aufgabe zugrunde, ein Verfahren und einen Klassifizierer von Sprachsignalen für die signalangepaßte Steuerung von Sprachcodierverfahren zur Senkung der Bitrate bei gleichbleibender Sprachqualität bzw. zur Erhöhung der Qualität bei gleicher Bitrate zu schaffen, die das Sprachsignal mit Hilfe der Wavelet-Transformation für jeden Zeitraum klassifizieren, wobei sowohl eine hohe Auflösung im Zeitbereich als auch im Frequenzbereich erreicht werden soll.The invention has for its object a method and a classifier of speech signals for the signal-adapted control of speech coding methods for Lowering the bit rate while maintaining the speech quality or to increase the quality at the same bit rate create the speech signal using the wavelet transform classify for each period, where both a high resolution in the time domain as well as in Frequency range should be reached.

Die Lösung für das erfindungsgemäße Verfahren ist im Kennzeichen des Patentanspruchs 1 charakterisiert und die für den Klassifizierer im Kennzeichen des Patentanspruchs 5.The solution for the inventive method is in Characterized by claim 1 and the for the classifier in the identifier of the Claim 5.

Weitere Lösungen bzw. Ausgestaltungen der Erfindung ergeben sich aus den Kennzeichen der Patentansprüche 2 - 4.Further solutions or refinements of the invention result derive from the characteristics of claims 2-4.

Hier werden ein Verfahren und eine Anordnung beschrieben, die das Sprachsignal auf Basis der Wavelet-Transformation für jeden Zeitrahmen klassifizieren. Dadurch kann - den Anforderungen des Sprachsignals entsprechend - sowohl eine hohe Auflösung im Zeitbereich (Lokalisierung von Pulsen) als auch im Frequenzbereich (gute Mittelwerte) erreicht werden. Die Klassifizierung eignet sich deshalb besonders zur Steuerung bzw. Auswahl von Codebüchern in einem niederratigen Sprachcoder. Dabei weist das Verfahren und die Anordnung eine hohe Unempfindlichkeit gegenüber Hintergrundgeräuschen sowie eine niedrige Komplexität auf. Bei der Wavelet-Transformation handelt es sich - ähnlich der Fourier-Transformation - um ein mathematisches Verfahren zur Bildung eines Modells für ein Signal oder System. Im Gegensatz zur Fourier-Transformation kann man aber im Zeit- und Frequenz- bzw. Skalierungsbereich die Auflösung den Anforderungen entsprechend flexibel anpassen. Die Basisfunktionen der Wavelet-Transformation werden durch Skalierung und Verschiebung aus einem sogenannten Mother-Wavelet erzeugt und haben Bandpaßcharakter. Die Wavelet-Transformation ist somit erst durch Angabe des zugehörigen Mother-Wavelets eindeutig definiert. Hintergründe und Details zur mathematischen Theorie sind beispielsweise aufgezeigt von Rioul O., Vetterli, M.: Wavelets and Signal Processing, IEEE Signal Processing Magazine, Oct. 1991.Here a method and an arrangement are described which is the speech signal based on the wavelet transform classify for each time frame. This can - Requirements of the speech signal accordingly - both one high resolution in the time domain (localization of pulses) as well as in the frequency domain (good mean values) become. The classification is therefore particularly suitable to control or select code books in one low-rate speech encoder. The method and the arrangement is highly insensitive to Background noise and low complexity. The wavelet transformation is similar the Fourier transform - a mathematical one Process for building a model for a signal or System. In contrast to the Fourier transformation one can but in the time and frequency or scaling range Adapt resolution flexibly according to requirements. The basic functions of the wavelet transformation are defined by Scaling and shifting from a so-called Mother-Wavelet generates and have bandpass character. The Wavelet transformation is therefore only possible by specifying the associated mother wavelets clearly defined. Background and details on mathematical theory are shown, for example, by Rioul O., Vetterli, M .: Wavelets and Signal Processing, IEEE Signal Processing Magazine, Oct. 1991.

Aufgrund ihrer Eigenschaften eignet sich die Wavelet-Transformation gut zur Analyse instationärer Signale. Ein weiterer Vorteil ist die Existenz schneller Algorithmen, mit denen eine effiziente Berechnung der Wavelet-Transformation durchgeführt werden kann. Erfolgreiche Anwendungen im Bereich der Signalverarbeitung findet man unter anderem in der Bildcodierung, bei Breitbandkorrelationsverfahren (zum Beispiel für Radar) sowie zur Sprachgrundfrequenzschätzung, wie unter anderem aus den folgenden Literaturstellen hervorgeht. Mallat, S., Zhong, S.: Characterization of Signals from Multiscale Edges, IEEE Transactions on Pattern Analysis and Machine Intelligence, July, 1992 sowie Kadambe, S.Boudreaux-Bartels, G.F.: Applications of the Wavelet Transform for Pitch Detection of Speech Signals, IEEE Transactions on Information Theory, March 1992.Due to its properties, the wavelet transformation is suitable good for the analysis of transient signals. On another advantage is the existence of fast algorithms, with which an efficient calculation of the wavelet transformation can be carried out. Successful Applications in the field of signal processing can be found among other things in image coding, at Broadband correlation method (e.g. for radar) as well as for basic speech frequency estimation, among others emerges from the following references. Mallat, S., Zhong, S .: Characterization of Signals from Multiscale Edges, IEEE Transactions on Pattern Analysis and Machine Intelligence, July, 1992 and Kadambe, S.Boudreaux-Bartels, G.F .: Applications of the Wavelet Transform for Pitch detection of speech signals, IEEE transactions on Information Theory, March 1992.

Die Erfindung wird im folgenden anhand eines Ausführungsbeispiels näher beschrieben. Für die Beschreibung des Verfahrens soll der prinzipielle Aufbau eines Klassifizierers nach Fig. 1 verwendet werden. Zunächst erfolgt die Segmentierung des Sprachsignals. Das Sprachsignal wird in Segmente konstanter Länge eingeteilt, wobei die Länge der Segmente zwischen 5 ms und 40 ms betragen soll. Zur Vermeidung von Randeffekten bei der sich anschließenden Transformation kann eine der drei folgenden Techniken angewandt werden:

Das Segment wird an den Grenzen gespiegelt.
Die Wavelet-Transformation wird im kleineren Intervall (L/2,N-L/2) berechnet und der Rahmen nur um den konstanten Versatz L/2 verschoben, so daß die Segmente überlappen. Dabei ist L die Länge eines auf den zeitlichen Ursprung zentrierten Wavelets, wobei die Bedingung N>L gelten muß.
An den Rändern des Segmentes wird mit den vorangegangenen bzw. zukünftigen Abtastwerten aufgefüllt.

The invention is described below with reference to an embodiment. The basic structure of a classifier according to FIG. 1 is to be used for the description of the method. First, the speech signal is segmented. The speech signal is divided into segments of constant length, the length of the segments being between 5 ms and 40 ms. To avoid edge effects in the subsequent transformation, one of the following three techniques can be used:

The segment is mirrored at the borders.
The wavelet transform is calculated in the smaller interval (L / 2, NL / 2) and the frame is only shifted by the constant offset L / 2, so that the segments overlap. L is the length of a wavelet centered on the temporal origin, whereby the condition N> L must apply.
At the edges of the segment, the previous or future samples are filled in.

Danach erfolgt eine diskrete Wavelet-Transformation. Für ein solches Segment s(k), wird eine zeitdiskrete Wavelet-Transformation (DWT) Sh(m,n) bezüglich eines Wavelets h(k) mit den ganzzahligen Parametern Skalierung m und Zeitverschiebung n berechnet. Diese Transformation ist durch

definiert, wobei N_u und N_o die durch die gewählte Segmentierung vorgegebene untere bzw. obere Grenze des Zeitindex k darstellen. Die Transformation muß nur für den Skalierungsbereich 0<m<M und den Zeitbereich im Intervall (0,N) berechnet werden, wobei die Konstante M in Abhängigkeit von a_o so groß gewählt werden muß, daß die niedrigsten Signalfrequenzen im Transformationsbereich noch ausreichend gut repräsentiert werden.This is followed by a discrete wavelet transformation. For such a segment s (k), a discrete-time wavelet transformation (DWT) Sh (m, n) with respect to a wavelet h (k) is calculated with the integer parameters scaling m and time shift n. This transformation is through

defined, where N _u and N _{o represent} the lower and upper limits of the time index k given by the selected segmentation. The transformation only has to be calculated for the scaling range 0 <m <M and the time range in the interval (0, N), the constant M depending on a _o having to be chosen so large that the lowest signal frequencies in the transformation range still represent sufficiently well become.

Zur Klassifizierung von Sprachsignalen reicht es in der Regel aus, das Signal zu dyadischen Skalierungen (a_o=2) zu betrachten. Läßt sich das Wavelet h(k) durch eine sogenannte "Multiresolution-Analyse" gemäß Rioul, Vetterli mittels einer iterierten Filterbank darstellen, so kann man zur Berechnung der dyadischen Wavelet-Transformation in der Literatur angegebene effiziente, rekursive Algorithmen verwenden. In diesem Fall (a_o=2) ist eine Zerlegung bis maximal M=6 ausreichend. Für die Klassifizierung eignen sich besonders Wavelets mit wenigen signifikanten Oszillationszyklen, aber dennoch möglichst glattem Funktionsverlauf. Beispielsweise können kubische Spline-Wavelets oder orthogonale Daubechies-Wavelets geringer Länge verwendet werden.To classify speech signals, it is usually sufficient to consider the signal for dyadic scaling (a _o = 2). If the wavelet h (k) can be represented by a so-called "multi-resolution analysis" according to Rioul, Vetterli using an iterated filter bank, then efficient, recursive algorithms specified in the literature can be used to calculate the dyadic wavelet transformation. In this case (a _o = 2) a decomposition up to a maximum of M = 6 is sufficient. Wavelets with a few significant oscillation cycles, but still as smooth as possible, are particularly suitable for classification. For example, cubic spline wavelets or orthogonal daubechies wavelets of short length can be used.

Hiernach erfolgt die Klasseneinteilung. Das Sprachsegment wird auf Basis der Transformationskoeffizienten in Klassen eingeteilt. Um eine ausreichend feine Zeitlauflösung zu erreichen, wird das Segment noch in P Subrahmen eingeteilt, so daß für jeden Subrahmen ein Klassifizierungsergebnis ausgegeben wird. Für einen Einsatz in niederratigen Sprachcodierverfahren wurde die Unterscheidung der folgenden Klassen vorgenommen:

(1) Hintergrundrauschen/stimmlos,

(2) Signalübergänge/"voicing onsets",

(3) Periodisch/stimmhaft.

This is followed by the division into classes. The language segment is divided into classes based on the transformation coefficients. In order to achieve a sufficiently fine timing solution, the segment is still divided into P subframes, so that a classification result is output for each subframe. A distinction was made between the following classes for use in low-rate speech coding methods:

(1) background noise / unvoiced,

(2) signal transitions / "voicing onsets",

(3) Periodic / voiced.

Beim Einsatz in bestimmten Codierverfahren kann es sinnvoll sein, die periodische Klasse noch weiter aufzuteilen, etwa in Abschnitte mit überwiegend tieffrequenter Energie oder eher gleichmäßig verteilter Energie. Optional kann deshalb auch eine Unterscheidung von mehr als drei Klassen durchgeführt werden.It can be useful when used in certain coding methods be to split the periodic class even further, for example in sections with predominantly low-frequency energy or more evenly distributed energy. Therefore, optionally also a distinction from more than three classes be performed.

Im Anschluß daran erfolgt in einem entsprechenden Prozessor die Parameterberechnung. Zunächst wird aus den Transformationskoeffizienten S_h(m,n) ein Satz von Parametern bestimmt, mit deren Hilfe dann anschließend die endgültige Klasseneinteilung vorgenommen werden kann. Die Auswahl der Parameter Skalierungs-Differenzmaß (P₁), zeitliches Differenzmaß (P₂) und Periodizitätsmaß (P₃) erwiesen sich dabei als besonders günstig, da sie einen direkten Bezug zu den definierten Klassen (1) bis (3) aufweisen.

Für P₁ wird die Varianz der Energie der DWTTransformationskoeffizienten über alle Skalierungsbereiche berechnet. Auf Basis dieses Parameters kann rahmenweise - also für ein relativ grobes Zeitraster - festgestellt werden, ob das Sprachsignal stimmlos ist bzw. nur Hintergrundrauschen vorliegt.
Um P₂ zu ermitteln, wird zunächst die mittlere Energiedifferenz der Transformationskoeffizienten zwischen dem aktuellen und dem vergangen Rahmen berechnet. Nun werden für Transformationskoeffizienten feiner Skalierungsstufe (m klein) die Energiedifferenzen zwischen benachbarten Subrahmen ermittelt und mit der Energiedifferenz für den Gesamtrahmen verglichen. Dadurch kann ein Maß für die Wahrscheinlichkeit eines Signalübergangs (zum Beispiel stimmlos auf stimmhaft) für jeden Subrahmen - also für ein feines Zeitraster - bestimmt werden.
Für P₃ werden rahmenweise die lokalen Maxima von Transformationskoeffizienten grober Skalierungsstufe (m nahe bei M) bestimmt und geprüft, ob diese in regelmäßigen Abständen auftreten. Als lokale Maxima werden dabei die Spitzen bezeichnet, die einen gewissen Prozentsatz T des globalen Maximums des Rahmens übersteigen.

The parameters are then calculated in a corresponding processor. First of all, a set of parameters is determined from the transformation coefficients S _h (m, n), with the aid of which the final classification can then be made. The selection of the parameters scaling difference measure (P ₁ ), temporal difference measure (P ₂ ) and periodicity measure (P ₃ ) turned out to be particularly favorable, since they have a direct relationship to the defined classes (1) to (3).

For P ₁ , the variance of the energy of the DWT transformation coefficients is calculated over all scaling ranges. On the basis of this parameter, it can be determined frame by frame - that is to say for a relatively rough time grid - whether the speech signal is unvoiced or only background noise is present.
In order to determine P ₂ , the mean energy difference of the transformation coefficients between the current and the past frame is first calculated. Now the energy differences between neighboring subframes are determined for transformation coefficients of fine scaling level (m small) and compared with the energy difference for the overall frame. In this way, a measure of the probability of a signal transition (for example unvoiced to voiced) can be determined for each subframe - that is to say for a fine time grid.
For P ₃ the local maxima of transformation coefficients of coarse scaling level (m close to M) are determined frame by frame and checked whether these occur at regular intervals. The peaks that exceed a certain percentage T of the global maximum of the frame are referred to as local maxima.

Die für diese Parameterberechnungen erforderlichen Schwellwerte werden in Abhängigkeit vom aktuellen Pegel des Hintergrundgeräusches adaptiv gesteuert, wodurch die Robustheit des Verfahrens in gestörter Umgebung gesteigert wird.The ones required for these parameter calculations Threshold values are dependent on the current level of the Background noise is controlled adaptively, which makes the Increased robustness of the process in a disturbed environment becomes.

Darauffolgend wird die Auswertung vorgenommen. Die drei Parameter werden der Auswerteeinheit in. Form von "Wahrscheinlichkeiten" (auf den Wertebereich (0,1) abgebildete Größen) zugeführt. Die Auswerteeinheit selbst trifft das endgültige Klassifizierungsergebnis für jeden Subrahmen auf Basis eines Zustandsmodells. Dadurch'wird das Gedächtnis der für vorangegangene Subrahmen getroffenen Entscheidungen berücksichtigt. Außerdem werden nicht sinnvolle Übergänge, wie zum Beispiel direkter Sprung von "stimmlos" auf "stimmhaft", verboten. Als Ergebnis wird schließlich pro Rahmen ein Vektor mit P Komponenten ausgegeben, der das Klassifizierungsergebnis für die P Subrahmen enthält.The evaluation is then carried out. The three The evaluation unit receives parameters in the form of "Probabilities" (on the value range (0.1) shown sizes) supplied. The evaluation unit itself hits the final classification result for everyone Subframe based on a state model. This makes it Memory of those taken for previous subframes Decisions taken into account. Besides, no sensible transitions, such as a direct jump from "unvoiced" to "voiced", prohibited. As a result finally a vector with P components per frame output of the classification result for which contains P subframes.

In den Fig. 2a und 2b sind die Klassifizierungsergebnisse für das Sprachsegment "...parcel, I'd like..." einer englischen Sprecherin exemplarisch dargestellt. Dabei wurden die Sprachrahmen der Länge 20ms in vier equidistante Subrahmen zu jeweils 5 ms eingeteilt. Die DWT wurde nur für dyadische Skalierungsschritte ermittelt und auf Basis von kubischen Spline-Wavelets mit Hilfe einer rekursiven Filterbank implementiert. Die drei Signalklassen werden mit 0,1,2 in der gleichen Reihenfolge wie oben bezeichnet. Für Fig. 2a wurde Telefonband-Sprache (200 Hz bis 3400 Hz) ohne Störung verwendet, während für Fig. 2b zusätzlich Fahrzeuggeräusche mit einem durchschnittlichen Signal-Rausch-Abstand von 10 dB überlagert wurden. Der Vergleich der beiden Abbildungen zeigt, daß das Klassifizierungsergebnis nahezu unabhängig vom Rauschpegel ist. Mit Ausnahme kleinerer Unterschiede, die für Anwendungen in der Sprachcodierung irrelevant sind, werden die perzeptuell wichtigen periodischen Abschnitte sowie deren Anfangs- und Endpunkte in beiden Fällen gut lokalisiert. Durch Auswertung einer großen Vielfalt unterschiedlichen Sprachmaterials ergab sich, daß der Klassifizierungsfehler deutlich unter 5% für Signal-Rausch-Abstände oberhalb 10 dB liegt.2a and 2b are the classification results for the language segment "... parcel, I'd like ..." one English speaker exemplified. there the speech frames of length 20ms were equidistante in four Subframes divided into 5 ms each. The DWT was only for dyadic scaling steps determined and based on cubic spline wavelets using a recursive Filter bank implemented. The three signal classes are with 0,1,2 in the same order as described above. For Fig. 2a was telephone band speech (200 Hz to 3400 Hz) without Disturbance used, while for Fig. 2b additionally Vehicle noise with an average Signal-to-noise ratio of 10 dB were superimposed. The Comparison of the two figures shows that the Classification result almost independent of the noise level is. Except for minor differences that are for Applications in speech coding are irrelevant the perceptually important periodic sections as well their starting and ending points are good in both cases localized. By evaluating a wide variety different language material showed that the Classification error well below 5% for signal-to-noise ratios is above 10 dB.

Der Klassifizierer wurde zusätzlich für folgenden typischen Anwendungsfall getestet: Ein CELP-Codierverfahren arbeitet bei einer Rahmenlänge von 20 ms und teilt diesen Rahmen zur effizienten Anregungscodierung in vier Subrahmen ä 5 ms ein. Für jeden Subrahmen soll entsprechend der drei oben genannten Signalklassen auf Basis des Klassifizierers eine angepaßte Kombination von Codebüchern verwendet werden. Es wurde für jede Klasse ein typisches Codebuch mit jeweils 9 Bit/Subrahmen zur Codierung der Anregung eingesetzt, wodurch sich eine Bitrate von lediglich 1800 Bit/s für die Anregungscodierung (ohne Gain) ergab. Es wurden für die stimmlose Klasse ein Gauß'sches Codebuch, für die Onset-Klasse ein Zwei-Puls-Codebuch und für die periodische Klasse ein adaptives Codebuch verwendet. Schon für diese einfache, mit festen Subrahmenlängen arbeitende Konstellation von Codebüchern ergab sich eine gut verständliche Sprachqualität, jedoch noch mit rauhem Klang in periodischen Abschnitten. Zum Vergleich sei erwähnt, daß in ITU-T, Study Group 15 Contribution - Q. 12/15: Draft Recommendation G.729 - Coding Of Speech at 8 kbit/s using Conjugate-Structure-Algebraic-Code-Excited-Linear-Predictive (CS-ACELP) Coding, 1995, für die Codierung der Anregung (ohne Gain) 4800 Bit/s benötigt werden, um Leitungsqualität zu erzielen. Selbst in Gerson, I. et al., Speech and Channel Coding for the Half-Rate GSM Channel, ITG-Fachbericht "Codierung für Quelle, Kanal und Übertragung", 1994, werden dafür noch 2800 bit/s verwendet, um Mobilfunkqualität sicherzustellen.The classifier was additionally typical for the following Use case tested: A CELP coding method works with a frame length of 20 ms and allocates this frame efficient excitation coding in four subframes of 5 ms on. For each subframe, the three above should correspond mentioned signal classes based on the classifier one adapted combination of code books can be used. It was a typical codebook with 9 for each class Bit / subframe used for coding the excitation, which results in a bit rate of only 1800 bit / s for the Excitation coding (without gain) resulted. It was made for the unvoiced class a Gaussian codebook for which Onset class a two-pulse code book and for periodic Class used an adaptive codebook. Already for this simple, working with fixed subframe lengths Constellation of code books resulted in a good understandable speech quality, but still with a rough sound in periodic sections. For comparison, it should be mentioned that in ITU-T, Study Group 15 Contribution - Q. 12/15: Draft Recommendation G.729 - Coding Of Speech at 8 kbit / s using Conjugate-Structure-Algebraic-Code-Excited-Linear-Predictive (CS-ACELP) Coding, 1995, for coding the Excitation (without gain) 4800 bps are needed to To achieve line quality. Even in Gerson, I. et al., Speech and Channel Coding for the Half-Rate GSM Channel, ITG technical report "Coding for source, channel and Transmission ", 1994, 2800 bit / s are still used for this, to ensure cellular quality.

Claims

Process for the classification of speech, particularly of speech signals for the signal-adapted control of speech-coding processes for lowering the bit rate with unchanged speech quality or for increasing the quality with an unchanged bit rate,
wherein, after the segmentation of a speech signal, a wavelet transformation is calculated for each frame formed, said wavelet transformation being used to determine, with the aid of adaptive thresholds, a set of parameters (P₁-P₃) which control a state model which divides the speech frame into subframes and subdivides each of said subframes into one of a plurality of classes typical of speech coding.
Process according to claim 1,
wherein the speech signal is divided into segments of constant length and wherein, in order to prevent edge effects in the subsequent wavelet transformation, either the segment is mirrored at the borders or the wavelet transformation is calculated at a smaller interval (L/2, N-L/2) and the frame is displaced only by the constant offset L/2, with the result that the segments overlap or the edges of the segment are filled with the preceding or future sampling values.
Process according to claim 1 or 2,
wherein, for a segment s(k), a discrete-time wavelet transformation (DWT) S_h(mn) is calculated with regard to a wavelet h(k) with the integer parameters of scaling m and time displacement n and the segment is divided into classes on the basis of the transformation coefficients, wherein, particularly in order to obtain a fine resolution with respect to time, there is additional division into P subframes and a classification result is calculated and output for each subframe.
Process according to any one of claims 1 to 3,
wherein determined from the transformation coefficient S_h(mn) is a set of parameters, particularly the scaling difference dimension (P₁), time difference dimension (P₂) and periodicity dimension (P₃), with the aid of which the final classification is then performed, the threshold values required for such parameter calculations being adaptively controlled as a function of the instantaneous level of the background noise.
Arrangement, particularly a classifier, for implementing the process according to any one of claims 1 - 4, wherein
the speech signal is supplied to a segmenting device and wherein, after the segmentation of the speech signal, a discrete wavelet transformation is calculated for each frame formed or for each segment formed, said wavelet transformation being used to determine, with the aid of adaptive thresholds, a set of parameters (P₁-P₃) which are supplied as input variables to a state model which, in turn, divides the speech frame into subframes and subdivides each of said subframes into one of a plurality of classes typical of speech coding.