EP0076233B1

EP0076233B1 - Method and apparatus for redundancy-reducing digital speech processing

Info

Publication number: EP0076233B1
Application number: EP82810390A
Authority: EP
Inventors: Stephan Dr. Horvath; Yung-Shain Wu
Original assignee: Gretag AG
Current assignee: Omnisec AG Te Regensdorf Zwitserland
Priority date: 1981-09-24
Filing date: 1982-09-20
Publication date: 1985-09-11
Also published as: EP0076233A1; US4589131A; ATE15563T1; DE3266204D1; CA1184657A; JPS5870299A

Abstract

Speech signal is decided voiced or unvoiced by a sequence of unilateral decisions: a first test decides "unvoiced" if standardized energy Es is below a threshold, or "ambiguous" if above the threshold whereby a second test decides "unvoiced" if the number of zero crossings ZC is above a threshold, and ambiguous if below the threshold. Up to six criteria may be so tested as ambiguous before a "voiced" decision is made.

Description

Die Erfindung betrifft ein nach der Methode der linearen Prädiktion arbeitendes redundanzverminderndes digitales Sprachverarbeitungsverfahren und eine entsprechende Vorrichtung gemäß dem Oberbegriff von Patentanspruch 1 bzw. Patentanspruch 33.The invention relates to a redundancy-reducing digital speech processing method that works according to the linear prediction method and to a corresponding device according to the preamble of claim 1 and claim 33.

Derartige Sprachverarbeitungssysteme, sogenannte LPC-Vocoder, erlauben eine erhebliche Redundanzreduktion bei der digitalen Übertragung von Sprachsignalen. Sie gewinnen heute immer mehr an Bedeutung und sind Gegenstand zahlreicher Veröffentlichungen und Patente, von denen hier nur einige repräsentative rein beispielsweise angeführt sind:

B. S. Atal und S. L. Hanauer, Journal Acoust. Soc. Am., 50, S. 637-655,1971
R. W. Schafer und L. R. Rabiner, Proc. IEEE Vol. 63, No. 4, S. 662-677, 1975
L. R. Rabiner et al., Trans-Acoustics, Speech and Signal Proc., Vol. 24, No. 5, S. 399-418, 1976
B. Gold, Proc. IEEE Vol. 65, No. 12, S. 1636-1658,1977
A. Kurematsu et al, Proc. IEEE, ICASSP, Washington 1979, S. 69-72
S. Horvath, »LPC-Vocoder, Entwicklungsstand und Perspektiven«,
Sammelband Kolloquiumsvorträge »Krieg im Äther«, XVII. Folge, Bern, 1978
US-PS 3 624 302
US-PS 3 631 520
US-PS 3 909 533

Such speech processing systems, so-called LPC vocoders, allow a considerable reduction in redundancy in the digital transmission of speech signals. They are becoming increasingly important today and are the subject of numerous publications and patents, only a few of which are listed here, for example:

BS Atal and SL Hanauer, Journal Acoust. Soc. Am., 50, pp. 637-655, 1971
RW Schafer and LR Rabiner, Proc. IEEE Vol. 63, No. 4, pp. 662-677, 1975
LR Rabiner et al., Trans-Acoustics, Speech and Signal Proc., Vol. 24, No. 5, pp. 399-418, 1976
B. Gold, Proc. IEEE Vol. 65, No. 12, pp. 1636-1658, 1977
A. Kurematsu et al, Proc. IEEE, ICASSP, Washington 1979, pp. 69-72
S. Horvath, "LPC vocoder, development status and perspectives",
Anthology colloquium lectures »War in the Aether«, XVII. Episode, Bern, 1978
U.S. Patent 3,624,302
U.S. Patent 3,631,520
U.S. Patent 3,909,533

Die bekannten und erhältlichen LPC-Vocoder arbeiten noch nicht voll zufriedenstellend. Zwar ist die nach der Analyse wieder synthetisierte Sprache meistens noch relativ verständlich, jedoch ist sie verzerrt und tönt künstlich. Eine Hauptursache dafür liegt u. a. vor allem in der Schwierigkeit, den Entscheid, ob ein stimmhafter oder ein stimmloser Sprachabschnitt vorliegt, mit ausreichender Sicherheit zu treffen. Weitere Ursachen sind mangelhafte Bestimmung der Pitchperiode und genaue Bestimmung der Klangbildungsfilterparameter.The known and available LPC vocoders are not yet fully satisfactory. Although the language synthesized again after the analysis is usually still relatively understandable, it is distorted and sounds artificial. A main cause for this lies u. a. above all in the difficulty of making the decision with certainty whether there is a voiced or an unvoiced speech section with sufficient certainty. Other causes include poor determination of the pitch period and accurate determination of the sound filter parameters.

Die vorliegende Erfindung befaßt sich nun vornehmlich mit der ersten dieser Schwierigkeiten und hat zum Ziel, ein digitales Sprachverarbeitungsverfahren bzw. -system der eingangs definierten Art dahingehend zu verbessern, daß es zu treffenderen bzw. sichereren Stimmhaft-Stimmlos-Entscheiden und damit zu einer Verbesserung der Qualität der synthetisierten Sprache führt.The present invention is now primarily concerned with the first of these difficulties and aims to improve a digital speech processing method or system of the type defined at the outset in such a way that it makes more accurate or more reliable voiced-unvoiced decisions and thus an improvement in Quality of the synthesized language leads.

Das erfindungsgemäße Verfahren und die erfindungsgemäße Vorrichtung sind in den Ansprüchen 1 und 33 beschrieben. Bevorzugte Ausführungsvarianten ergeben sich aus den abhängigen Ansprüchen.The inventive method and the inventive device are described in claims 1 and 33. Preferred design variants result from the dependent claims.

Für die Stimmhaft-Stimmlos-Klassifikation sind eine Reihe von Entscheidungskriterien bekannt, die jeweils einzeln oder zum Teil auch kombiniert angewandt werden. Übliche Kriterien sind z. B. die Energie des Sprachsignals, die Anzahl der Nulldurchgänge desselben innerhalb eines gewissen Zeitabschnitts, die normierte Restfehlerenergie, d. h. das Verhältnis der Energie des Prädiktionsfehlersignals zu der des Sprachsignals, und die Höhe des zweiten Maximums der Autokorrelationsfunktion des Sprachsignals oder des Prädiktionsfehlersignals. Ferner ist es auch üblich, einen Quervergleich zu einem oder mehreren benachbarten Sprachabschnitten durchzuführen. Eine übersichtliche und vergleichende Darstellung der wichtigsten Klassifikationskriterien und -methoden ist z. B. der eingangs angeführten Veröffentlichung von L. R. Rabiner et al. zu entnehmen.A number of decision criteria are known for the voiced-voiceless classification, which are used individually or in part in combination. Common criteria are e.g. B. the energy of the speech signal, the number of zero crossings of the same within a certain time period, the normalized residual error energy, d. H. the ratio of the energy of the prediction error signal to that of the speech signal, and the level of the second maximum of the autocorrelation function of the speech signal or of the prediction error signal. Furthermore, it is also common to carry out a cross-comparison to one or more neighboring language sections. A clear and comparative representation of the most important classification criteria and methods is e.g. B. the publication by L. R. Rabiner et al. refer to.

Ein gemeinsames Merkmal aller dieser bekannten Methoden und Kriterien ist, daß stets zweiseitige Entscheide getroffen werden, indem der Sprachabschnitt jeweils definitiv der einen oder der anderen der beiden Möglichkeiten zugeordnet wird, je nachdem, ob das oder die betreffenden Kriterien erfüllt sind oder nicht. Es kann zwar auf diese Weise bei geeigneter Auswahl und gegebenenfalls Kombination der Entscheidungskriterien eine relativ hohe Treffsicherheit erzielt werden, wie die Praxis jedoch zeigt, treten dabei immer noch relativ häufig Fehlentscheidungen auf, welche die Qualität der synthetisierten Sprache erheblich in Mitleidenschaft ziehen. Ein Hauptgrund dafür liegt in der Tatsache, daß Sprachsignale im allgemeinen trotz aller Redundanz einen instationären Charakter haben, aufgrund dessen es einfach nicht möglich ist, die bei den jeweiligen Kriterien benützten Entscheidungsschwellen so zu legen, daß nach beiden Seiten eine sichere Aussage gemacht werden kann. Eine gewisse Unsicherheit bleibt stets vorhanden und muß inkauf genommen werden.A common feature of all of these known methods and criteria is that two-sided decisions are always made by definitely assigning the language section to one or the other of the two options, depending on whether the criteria or criteria in question are met or not. In this way, it can be achieved with a suitable selection and, if necessary, a combination of the decision criteria, a relatively high degree of accuracy, however, as practice shows, wrong decisions still occur relatively often, which significantly affect the quality of the synthesized language. A main reason for this lies in the fact that, in spite of all redundancy, voice signals generally have an unsteady character, due to which it is simply not possible to set the decision thresholds used in the respective criteria in such a way that a reliable statement can be made on both sides. A certain degree of uncertainty always remains and must be accepted.

In Erkenntnis dieses Sachverhalts geht nun die Erfindung von diesem bisher ausschließlich benutzten Prinzip der zweiseitigen Entscheidungen ab und verwendet stattdessen eine Strategie, bei der nur einseitige, dafür aber praktisch absolut sichere Entscheidungen getroffen werden. Mit anderen Worten heißt dies, daß ein Sprachabschnitt nur dann eindeutig als stimmhaft oder stimmlos klassifiziert wird, wenn ein gewisses Kriterium erfüllt ist. Falls das Kriterium jedoch nicht erfüllt ist, wird der Sprachabschnitt nicht bereits definitiv als stimmlos bzw. stimmhaft beurteilt, sondern einem weiteren Klassifizierungskriterium unterworfen. In diesem erfolgt wiederum nur ein sicherer Entscheid in einer Richtung, falls das betreffende Kriterium erfüllt ist, andernfalls wird das Entscheidungsprocedere in analoger Weise fortgesetzt. Dies geht so lange weiter, bis eine sichere Klassifikation möglich ist. Umfangreiche Untersuchungen haben gezeigt, daß dazu bei geeigneter Auswahl und Reihenfolge der Kriterien in der Regel maximal etwa sechs bis sieben Entscheidungsschritte erforderlich sind.In recognition of this fact, the invention now proceeds from this previously used principle of bilateral decisions and instead uses a strategy in which only unilateral, but practically absolutely safe decisions are made. In other words, a language section is only clearly classified as voiced or unvoiced if a certain criterion is met. However, if the criterion is not met, the language section is not already definitely judged to be unvoiced or voiced, but is subject to a further classification criterion. This in turn only makes a safe decision in one direction if the relevant criterion is met, otherwise the decision procedure is continued in an analogous manner. This continues until a safe classification is possible. Extensive studies have shown that with a suitable selection and order of the criteria, a maximum of about six to seven decision steps are usually required.

Für den Grad der Sicherheit der einzelnen Entscheidungen sind die Lagen der jeweiligen Entscheidungsschwellen maßgebend. Je extremer diese Entscheidungsschwellen sind, desto selektiver sind die Kriterien und desto sicherer die Entscheide. Mit zunehmender Selektivität der einzelnen Kriterien steigt aber auch die Zahl der maximal notwendigen Entscheidungsoperationen. In der Praxis ist es jedoch ohne weiteres möglich, die Schwellen so zu legen, daß praktisch absolute (einseitige) Entscheidungssicherheit erreicht wird, ohne daß die Gesamtanzahl der Kriterien bzw. Entscheidungsoperationen über das oben angegebene Maß ansteigt.The positions of the respective decisions are for the degree of security of the individual decisions applicable thresholds. The more extreme these decision thresholds are, the more selective the criteria and the safer the decisions. However, with increasing selectivity of the individual criteria, the number of the maximum necessary decision-making operations increases. In practice, however, it is easily possible to set the thresholds in such a way that practically absolute (one-sided) decision-making certainty is achieved without the total number of criteria or decision-making operations increasing above the level specified above.

Im folgenden wird die Erfindung anhand der Zeichnung näher erläutert. Es zeigt

Fig. 1 ein stark vereinfachtes Blockschema einer erfindungsgemäßen Sprachdigitalisierungsvorrichtung,
Fig. 2 ein Blockschaltbild eines entsprechenden Multi-Prozessor-Systems und
Fig. 3 und 4 Flußschemen zweier verschiedener Verfahrensabläufe für den Stimmhaft-Stimmlos-Entscheid.

The invention is explained in more detail below with reference to the drawing. It shows

1 shows a greatly simplified block diagram of a speech digitizing device according to the invention,
Fig. 2 is a block diagram of a corresponding multi-processor system and
3 and 4 flow diagrams of two different procedures for the voiced-unvoiced decision.

Für die Analyse wird das von irgendeiner Quelle, z. B. einem Mikrophon 1 stammende analoge Sprachsignal in einem Filter 2 bandbegrenzt und dann in einem A/D-Wandler 3 abgetastet und digitalisiert. Die Abtastrate beträgt dabei etwa 6 bis 16 kHz, vorzugsweise etwa 8 kHz. Die Auflösung ist etwa 8 bis 12 bit. Der Durchlaßbereich des Filters 2 erstreckt sich bei sog. Breitbandsprache gewöhnlich von ca. 80 Hz bis etwa 3,1-3,4 kHz, bei Telefonsprache von etwa 300 Hz bis 3,1-3,4 kHz.For analysis, this is from some source, e.g. B. a microphone 1 originating analog voice signal in a filter 2 band limited and then sampled and digitized in an A / D converter 3. The sampling rate is about 6 to 16 kHz, preferably about 8 kHz. The resolution is about 8 to 12 bit. The pass band of the filter 2 usually extends from approximately 80 Hz to approximately 3.1-3.4 kHz in the case of so-called broadband speech, and from approximately 300 Hz to 3.1-3.4 kHz in the telephone language.

Für die nun einsetzende eigentliche Analyse bzw. redundanzvermindernde Verarbeitung wird das digitale Sprachsignal s_n in aufeinanderfolgende, vorzugsweise überlappende Sprachabschnitte, sog. Frames, eingeteilt. Die Sprachabschnittslänge kann etwa 10 bis 30 msec, vorzugsweise etwa 20 msec betragen. Die Frame-Rate, d. h. die Anzahl von Frames pro Sekunde, beträgt etwa 30 bis 100, vorzugsweise etwa 45 bis 70. Im Interesse hoher Auflösung und damit Sprachqualität bei der Synthetisierung sind möglichst kurze Abschnitte und entsprechende hohe Frame-Raten erstrebenswert, jedoch stehen dem einerseits bei Echtzeit-Verarbeitung das begrenzte Leistungsvermögen des eingesetzten Computers und anderseits die Forderung möglichst niedriger Bitraten bei der Übertragung entgegen.For the actual analysis or processing to reduce redundancy, the digital speech signal s _{n is divided} into successive, preferably overlapping speech sections, so-called frames. The speech section length can be approximately 10 to 30 msec, preferably approximately 20 msec. The frame rate, ie the number of frames per second, is approximately 30 to 100, preferably approximately 45 to 70. In the interest of high resolution and thus voice quality in the synthesis, sections as short as possible and correspondingly high frame rates are desirable, but this is appropriate on the one hand, the limited performance of the computer used in real-time processing and, on the other hand, the demand for the lowest possible bit rates during transmission.

Für jeden dieser Sprachabschnitte erfolgt nun eine Analyse des Sprachsignals nach den Prinzipien der linearen Prädiktion, wie sie z. B. in den eingangs erwähnten Publikationen beschrieben sind. Grundlage der linearen Prädiktion ist ein parametrisches Modell der Spracherzeugung. Ein zeitdiskretes Allpol-Digitalfilter modelliert die Klangformung durch Hals- und Mundtrakt (Vokaltrakt). Bei stimmhaften Lauten ist die Anregung dieses Filters eine periodische Pulsfolge, deren Frequenz, die sog. Pitchfrequenz, die periodische Anregung durch die Stimmbänder idealisiert. Bei stimmlosen Lauten ist die Anregung weißes Rauschen, idealisierend für die Luftturbulenz im Hals bei nicht angeregten Stimmbändern. Ein Verstärkungsfaktor schließlich kontrolliert die Lautstärke. Auf der Grundlage dieses Modells ist somit das Sprachsignal durch die folgenden Parameter vollständig bestimmt:

1. Die Information, ob der zu synthetisierende Laut stimmhaft oder stimmlos ist,
2. die Pitch-Periode (bzw. die Pitch Frequenz) bei stimmhaften Lauten (bei stimmlosen ist die Pitch- periode per def. gleich 0)
3. die Koeffizienten des zugrundegelegten Allpol-Digitalfilters (Vokaltraktmodells) und
4. der Verstärkungsfaktor.

For each of these speech sections, an analysis of the speech signal is now carried out according to the principles of linear prediction, as described, for. B. are described in the publications mentioned. The basis of linear prediction is a parametric model of speech generation. A discrete-time all-pole digital filter models the sound formation through the throat and mouth (vocal tract). In the case of voiced sounds, the excitation of this filter is a periodic pulse sequence whose frequency, the so-called pitch frequency, idealizes the periodic excitation by the vocal cords. In the case of unvoiced sounds, the excitation is white noise, idealizing the air turbulence in the throat when the vocal cords are not excited. Finally, an amplification factor controls the volume. On the basis of this model, the speech signal is therefore completely determined by the following parameters:

1. The information as to whether the sound to be synthesized is voiced or unvoiced,
2. the pitch period (or the pitch frequency) for voiced sounds (for voiceless ones the pitch period by definition is 0)
3. the coefficients of the underlying all-pole digital filter (vocal tract model) and
4. the gain factor.

Die Analyse gliedert sich demnach im wesentlichen in zwei Hauptprozeduren, und zwar zum einen in die Berechnung des Verstärkungsfaktors bzw. Lautstärkeparameters sowie der Koeffizienten bzw. Filterparameter des zugrundeliegenden Vokaltrakt-Modellfilters und zum anderen in den Stimmhaft-Stimmlos-Entscheid und in die Ermittlung der Pitch-Periode im stimmhaften Falle.The analysis is essentially divided into two main procedures, firstly in the calculation of the amplification factor or volume parameter and the coefficients or filter parameters of the underlying vocal tract model filter and secondly in the voiced-unvoiced decision and in determining the pitch -Period in voiced case.

Die Filterkoeffizienten werden in einem Parameterrechner 4 durch Lösung des Gleichungssystems gewonnen, welches erhalten wird, wenn die Energie des Prädiktionsfehlers, d. h. die Energie der Differenz zwischen den tatsächlichen Abtastwerten und den aufgrund der Modellannahme geschätzten Abtastwerten im betrachteten Intervall (Sprachabschnitt) in Funktion der Koeffizienten minimiert wird. Die Auflösung des Gleichungssystems erfolgt vorzugsweise nach der Autokorrelationsmethode mittels eines Algorithmus nach Durbin (vgl. z. B. L. B. Rabiner and R. W. Schafer »Digital Processing of Speech Signals«, Prentice-Hall Inc., Englewood Cliffs, N.J. 1978, S. 411-413). Dabei ergeben sich neben den Filterkoeffizienten bzw. -parametern (a;) gleichzeitig auch die sog. Reflexionskoeffizienten (k_j), welche auf Quantisierung weniger empfindliche Transformierte der Filterkoeffizienten (a_j) sind. Die Reflexionskoeffizienten sind bei stabilen Filtern dem Betrag nach stets kleiner als 1 und außerdem nimmt ihr Betrag mit zunehmender Ordnungszahl ab. Wegen dieser Vorteile werden bevorzugt die Reflexionskoeffizienten (k_j) statt der Filterkoeffizienten (a;) übertragen. Der Lautstärkeparameter G ergibt sich aus dem Algorithmus als Nebenprodukt.The filter coefficients are obtained in a parameter calculator 4 by solving the system of equations which is obtained when the energy of the prediction error, ie the energy of the difference between the actual samples and the samples estimated on the basis of the model assumption in the interval under consideration (speech section) is minimized as a function of the coefficients becomes. The system of equations is preferably solved using the autocorrelation method using an algorithm according to Durbin (cf., for example, BLB Rabiner and RW Schafer "Digital Processing of Speech Signals", Prentice-Hall Inc., Englewood Cliffs, NJ 1978, pp. 411-413) . In addition to the filter coefficients or parameters (a;), the so-called reflection coefficients (k _j ) also result, which are less sensitive transforms of the filter coefficients (a _j ) to quantization. The amount of reflection coefficients in stable filters is always less than 1 and, moreover, their amount decreases with increasing atomic number. Because of these advantages, the reflection coefficients (k _j ) are preferably transmitted instead of the filter coefficients (a;). The volume parameter G results from the algorithm as a by-product.

Zur Auffindung der Pitch-Periode p (Periode der Stimmbandgrundfrequenz) wird das digitale Sprachsignal _Sn in einem Buffer 5 zunächst solange zwischengespeichert, bis die Filterparameter (a;) berechnet sind. Dann passiert das Signal ein mit den Parametern (a_j) eingestelltes Inversfilter 6, welches eine zur Übertragungsfunktion des Vokaltraktmodellfilters inverse Übertragungsfunktion besitzt. Das Ergebnis dieser Invers-Filterung ist ein Prädiktionsfehlersignal e_", welches dem mit dem Verstärkungsfaktor G multiplizierten Anregungssignal x_" ähnlich ist. Dieses Prädiktionsfehlersignal e_n wird nun im Falle von Telefonsprache direkt oder im Falle von Breitbandsprache über ein Tiefpaßfilter 7 einer Autokorrelationsstufe 8 zugeführt, welche daraus die auf das Autokorrelationsmaximum nullter Ordnung normierte Autokorrelationsfunktion AKF bildet, anhand welcher in einer Pitchextraktionsstufe 9 die Pitchperiode p ermittelt wird, und zwar in bekannter Weise als Abstand des zweiten Autokorrelationsmaximums RXX vom ersten Maximum (nullter Ordnung), wobei vorzugsweise ein adaptives Suchverfahren angewandt wird.In order to find the pitch period p (period of the basic vocal cord frequency), the digital speech signal _{Sn is} first temporarily _stored in a buffer 5 until the filter parameters (a;) have been calculated. The signal then passes through an inverse filter 6 set with the parameters (a _j ), which has an inverse transfer function to the transfer function of the vocal tract model filter. The result of this inverse filtering is a prediction error signal e _" , which is similar to the excitation signal x _" multiplied by the gain factor G. This prediction error signal e _n is now in the case of telephone speech directly or in the case of broadband speech via a low-pass filter 7 fed to an autocorrelation stage 8, which forms the autocorrelation function AKF standardized to the zero-order autocorrelation maximum, from which the pitch period p is determined in a pitch extraction stage 9, in a known manner as the distance between the second autocorrelation maximum RXX and the first maximum (zero-order), an adaptive search method is preferably used.

Die Bedeutung des Tiefpaßfilters 7 wird weiter unten noch erläutert. An dieser Stelle sei lediglich erwähnt, daß es mittels eines Schalters 10 für Telefonsprache überbrückbar ist und ferner auch vor dem Inversfilter 6 angeordnet sein könnte.The meaning of the low-pass filter 7 will be explained further below. At this point it should only be mentioned that it can be bridged by means of a switch 10 for telephone speech and could also be arranged in front of the inverse filter 6.

Die Klassifikation des betrachteten Sprachabschnitts als stimmhaft oder stimmlos erfolgt nach dem noch genauer zu erläuternden erfindungsgemäßen Entscheidungsprocedere in einer Entscheidungsstufe 11, welche von einer Energiebestimmungsstufe 12 und einer Nulldurchgangsbestimmungsstufe 13 unterstützt wird. Im stimmlosen Fall wird der Pitch-Parameter p gleich null gesetzt.The speech section under consideration is classified as voiced or unvoiced according to the decision procedure according to the invention to be explained in more detail in a decision stage 11 which is supported by an energy determination stage 12 and a zero crossing determination stage 13. In the unvoiced case, the pitch parameter p is set to zero.

Der vorstehend beschriebene Parameterrechner ermittelt pro Sprachabschnitt (Frame) je einen Satz Filterparameter. Selbstverständlich könnten die Filterparameter auch anders bestimmt werden, beispielsweise laufend mittels einer adaptiven inversen Filtrierung oder eines anderen bekannten Verfahrens, wobei die Filterparameter zwar mit jedem Abtasttakt laufend nachgeregelt, aber nur jeweils zu den durch die Frame-Rate festgelegten Zeitpunkten für die weitere Verarbeitung bzw. Übertragung bereitgestellt werden. Die Erfindung ist diesbezüglich in keiner Weise eingeschränkt. Wesentlich ist lediglich, daß für jeden Sprachabschnitt ein Satz Filterparameter vorliegt.The parameter calculator described above determines a set of filter parameters for each speech section (frame). Of course, the filter parameters could also be determined differently, for example continuously by means of adaptive inverse filtering or another known method, the filter parameters being readjusted continuously with each sampling cycle, but only at the times determined by the frame rate for further processing or Transmission will be provided. The invention is in no way restricted in this regard. It is only essential that there is a set of filter parameters for each language section.

Die nunmehr vollzählig vorliegenden Parameter (kj), G und p werden dann einer Codierungsstufe 14 zugeführt, wo sie in eine für die Übertragung geeignete Form gebracht und bereitgestellt werden.The now complete parameters (kj), G and p are then fed to a coding stage 14, where they are brought into a form suitable for transmission and provided.

Die Rückgewinnung bzw. Synthese des Sprachsignals aus den Parametern erfolgt in bekannter Weise dadurch, daß die zunächst in einem Decoder 15 decodierten Parameter einem Puls-Rausch-Generator 16, einem Verstärker 17 und einem Vokaltraktmodellfilter 18 zugeführt werden und das Ausgangssignal des Modellfilters 18 mittels eines D/A-Wandlers 19 in analoge Form gebracht und dann nach der üblichen Filterung 20 durch ein Wiedergabegerät, z. B. einen Lautsprecher 21 hörbar gemacht wird. Der Puls-Rauschgenerator 16 erzeugt die durch den Verstärker 17 verstärkte Anregung x_" des Vokaltraktmodellfilters 18, und zwar im stimmlosen Falle (p = 0) weißes Rauschen und im stimmhaften Falle (p_?'=0) eine periodische Pulsfolge der durch die Pitchperiode p festgelegten Frequenz. Der Lautstärkeparameter G kontrolliert den Verstärkungsfaktor des Verstärkers 17, die Filterparameter(kj) definieren die Übertragungsfunktion des Klangbildungs- bzw. Vokaltraktmodellfilters 18.The recovery or synthesis of the speech signal from the parameters takes place in a known manner in that the parameters initially decoded in a decoder 15 are fed to a pulse-noise generator 16, an amplifier 17 and a vocal tract model filter 18 and the output signal of the model filter 18 by means of a D / A converter 19 brought into analog form and then after the usual filtering 20 by a playback device, for. B. a speaker 21 is made audible. The pulse-noise generator 16 generates the excitation x _{″ of} the vocal tract model filter 18 which is amplified by the amplifier 17, namely in the unvoiced case (p = 0) white noise and in the voiced case (p _{? '} = 0) a periodic pulse sequence of the pitch period p The volume parameter G controls the amplification factor of the amplifier 17, the filter parameters (kj) define the transfer function of the sound formation or vocal tract model filter 18.

Vorstehend wurde der allgemeine Aufbau und die Funktion der erfindungsgemäßen Sprachverarbeitungsvorrichtung der einfacheren Verständlichkeit halber anhand diskreter Funktionsstufen erläutert. Es ist für den Fachmann jedoch selbstverständlich, daß sämtliche Funktionen bzw. Funktionsstufen zwischen dem analyseseitigen A/D-Wandler 3 und dem syntheseseitigen D/A-Wandler 19, in denen also digitale Signale verarbeitet werden, in der Praxis vorzugsweise durch einen entsprechend programmierten Computer oder einen Mikroprozessor oder dergleichen implementiert sind. Die softwaremäßige Realisierung der einzelnen Funktionsstufen, wie z. B. der Parameterrechner, die diversen Digitalfilter, Autokorrelation etc. ist für den mit der Datenverarbeitungstechnik vertrauten Fachmann Routine und in der Fachliteratur beschrieben (siehe z. B. IEEE Digital Signal Processing Comittee: »Programsfor Digital Signal Processing«, IEEE Press Book 1980).The general structure and function of the speech processing device according to the invention has been explained above for the sake of clarity using discrete function levels. However, it is self-evident to the person skilled in the art that all functions or functional levels between the analysis-side A / D converter 3 and the synthesis-side D / A converter 19, in which digital signals are thus processed, in practice preferably by a suitably programmed computer or a microprocessor or the like are implemented. The software implementation of the individual functional levels, such as B. the parameter calculator, the various digital filters, autocorrelation etc. is routine for the specialist familiar with data processing technology and described in the specialist literature (see e.g. IEEE Digital Signal Processing Committee: "Programs for Digital Signal Processing", IEEE Press Book 1980) .

Für Echtzeit-Anwendungen sind insbesondere bei hohen Abtastarten und kurzen Sprachabschnitten wegen der dann großen Anzahl von in kürzester Zeit zu bewältigenden Operationen extrem leistungsfähige Rechner erforderlich. Für solche Zwecke werden dann am besten Multi-Prozessor-Systeme mit einer geeigneten Aufgabenteilung eingesetzt. Ein Beispiel für ein solches System ist in Fig. 2 als Blockschema dargestellt.Extremely powerful computers are required for real-time applications, in particular in the case of high scanning types and short speech sections, because of the large number of operations that can then be completed in a very short time. For such purposes it is best to use multi-processor systems with a suitable division of tasks. An example of such a system is shown in Fig. 2 as a block diagram.

Das dargestellte Multi-Prozessor-System umfaßt im wesentlichen vier Funktionsblöcke, und zwar einen Hauptprozessor 50, zwei Nebenprozessoren 60 und 70 und eine Eingabe/Ausgabe-Einheit 80. Es implementiert sowohl Analyse als auch Synthese.The multi-processor system shown essentially comprises four functional blocks, namely a main processor 50, two secondary processors 60 and 70 and an input / output unit 80. It implements both analysis and synthesis.

Die Eingabe/Ausgabe-Einheit 80 enthält die mit 81 bezeichneten Stufen zur analogen Signalverarbeitung, wie Verstärker, Filter und automatische Verstärkungsregelung, sowie den A/D-Wandler und den D/A-Wandler.The input / output unit 80 contains the stages designated 81 for analog signal processing, such as amplifiers, filters and automatic gain control, as well as the A / D converter and the D / A converter.

Der Hauptprozessor 50 führt die eigentliche Sprachanalyse bzw. -synthese durch, wozu die Bestimmung der Filterparameter und der Lautstärkeparameter (Parameterrechner 4), die Bestimmung von Energie und Nulldurchgängen des Sprachsignals (Stufen 12 und 13), die Stimmhaft-Stimmlos Entscheidung (Stufe 11) und die Bestimmung der Pitchperiode (Stufe 9) bzw. syntheseseitig die Erzeugung des Ausgangssignals (Stufe 16), dessen Lautstärkevariation (Stufe 17) und dessen Filtrierung im Sprachmodellfilter (Filter 18) gehören.The main processor 50 carries out the actual speech analysis or synthesis, for which purpose the determination of the filter parameters and the volume parameters (parameter calculator 4), the determination of energy and zero crossings of the speech signal (stages 12 and 13), the voiced-unvoiced decision (stage 11) and the determination of the pitch period (stage 9) or, on the synthesis side, the generation of the output signal (stage 16), its volume variation (stage 17) and its filtering in the speech model filter (filter 18).

Der Hauptprozessor 50 wird dabei vom Nebenprozessor 60 unterstützt, welcher die Zwischenspeicherung (Buffer 5), Inversfiltrierung (Stufe 6), gegebenenfalls die Tiefpaßfiltrierung (Stufe 7) und die Autokorrelation (Stufe 8) durchführt.The main processor 50 is supported by the secondary processor 60, which carries out the intermediate storage (buffer 5), inverse filtering (stage 6), optionally the low-pass filtering (stage 7) and the autocorrelation (stage 8).

Der Nebenprozessor 70 schließlich befaßt sich ausschließlich mit der Codierung bzw. Decodierung der Sprachparameter sowie mit dem Datenverkehr mit z. B. einem Modem 90 od. dgl. via eine mit 71 bezeichnete Schnittstelle.The secondary processor 70 finally deals exclusively with the coding or decoding of the speech parameters and with the data traffic with z. B. a modem 90 or the like. Via an interface designated 71.

Im folgenden wird das Stimmhaft-Stimmlos-Entscheidungsprocedere näher erläutert. Vorweg sei erwähnt, daß für den Stimmhaft-Stimmlos-Entscheid und die Bestimmung der Pitch-Periode vorzugsweise ein längeres Analyseintervall zugrundegelegt wird als für die Bestimmung der Filterkoeffizienten. Für die letzteren ist das Analyseintervall gleich dem betrachteten Sprachabschnitt, für die Pitchextraktion hingegen erstreckt sich das Analyseintervall zu beiden Seiten des Sprachabschnitts in den jeweils benachbarten Sprachabschnitt, beispielsweise bis etwa zur Hälfte desselben. Auf diese Weise läßt sich eine zuverlässigere und weniger sprunghafte Pitchextraktion durchführen. Ferner sei klargestellt, daß, wenn im folgenden von der Energie eines Signals gesprochen wird, damit stets die relative, also auf den Dynamikumfang des A/D-Wandlers 3 normierte Energie des Signals im Analyseintervall gemeint ist.The voiced-unvoiced decision-making procedure is explained in more detail below. Be beforehand mentions that a longer analysis interval is preferably used as a basis for the voiced-unvoiced decision and the determination of the pitch period than for the determination of the filter coefficients. For the latter, the analysis interval is the same as the language section under consideration; for pit extraction, on the other hand, the analysis interval extends on both sides of the language section into the respectively adjacent language section, for example up to about half of the same. In this way, a more reliable and less erratic pitch extraction can be carried out. Furthermore, it should be clarified that when the energy of a signal is referred to in the following, this always means the relative energy of the signal in the analysis interval, that is to say standardized to the dynamic range of the A / D converter 3.

Grundlegendes Prinzip des erfindungsgemäßen Stimmhaft-Stimmlos-Entscheids ist, wie schon weiter vorne erläutert, daß nur sichere Entscheide getroffen werden. Unter »sicher« wird dabei ein Entscheid verstanden, der eine wenigstens 97%ige, vorzugsweise wesentlich höhere und insbesondere sogar absolute Treffsicherheit bzw. entsprechend geringe statistische Fehlerquote aufweist.The basic principle of the voiced-voiceless decision according to the invention is, as already explained above, that only safe decisions are made. “Safe” is understood to mean a decision that has an at least 97%, preferably significantly higher and in particular even absolute accuracy or a correspondingly low statistical error rate.

In den Fig. 3 und 4 sind die Flußdiagramme von zwei besonders zweckmäßigen erfindungsgemäßen Entscheidungsabläufen dargestellt, und zwar in Fig. 3 eine Variante für Breitbandsprache und in Fig. 4 eine solche für Telefonsprache.FIGS. 3 and 4 show the flow diagrams of two particularly expedient decision-making processes according to the invention, specifically in FIG. 3 a variant for broadband voice and in FIG. 4 one for telephone voice.

Gemäß Fig. 3 wird als erstes Entscheidungskriterium ein Energietest durchgeführt. Dabei wird die (relative, normierte) Energie Es des Sprachsignals s_n mit einer Mindestenergieschwelle EL verglichen, die so tief angesetzt ist, daß der Sprachabschnitt mit Sicherheit als stimmlos bezeichnet werden kann, wenn die Energie Es nicht über dieser Schwelle liegt. Praktische Werte für diese Mindestenergieschwelle EL sind 1,1 - 10-⁴ bis 1,4 - 10 ⁴, vorzugsweise etwa 1,2 - 10-⁴. Diese Werte gelten für den Fall, daß alle digitalen Abtastsignale im Einheitsformat (Bereich ±1) dargestellt sind. Bei anderen Signalformaten sind die Werte mit entsprechenden Faktoren zu multiplizieren.3, an energy test is carried out as the first decision criterion. The (relative, standardized) energy Es of the speech signal s _{n is} compared with a minimum energy threshold EL which is set so low that the speech section can certainly be called unvoiced if the energy Es does not lie above this threshold. Practical values for this minimum energy threshold EL is 1.1 - 10- ⁴ to 1.4 - 10 ^-4, preferably about 1.2 - 10. ⁴ These values apply in the event that all digital scanning signals are shown in the standard format (range ± 1). For other signal formats, the values must be multiplied by the corresponding factors.

Wenn die Energie Es des Sprachsignals über dieser Schwelle liegt, kann keine eindeutige Aussage getroffen werden und es erfolgt als nächstes Kriterium ein Nulldurchgangstest. Dabei wird die Anzahl der Nulldurchgänge des digitalen Sprachsignals im Analyseintervall festgestellt und mit einer Maximalanzahl ZCU verglichen. Falls die Anzahl größer ist als diese Maximalanzahl, wird der Sprachabschnitt eindeutig als stimmlos bewertet, andernfalls wird ein weiteres Entscheidungskriterium herangezogen. Für einen praktisch ausreichend sicheren Entscheid beträgt die Maximalanzahl ZCU etwa 105 bis 120, vorzugsweise etwa 110 Nulldurchgänge für eine Analyseintervallänge von 256 Abtastwerten.If the energy Es of the speech signal lies above this threshold, no clear statement can be made and the next criterion is a zero-crossing test. The number of zero crossings of the digital voice signal is determined in the analysis interval and compared with a maximum number of ZCU. If the number is greater than this maximum number, the speech section is clearly rated as unvoiced, otherwise a further decision criterion is used. For a practically sufficiently reliable decision, the maximum number ZCU is approximately 105 to 120, preferably approximately 110 zero crossings for an analysis interval length of 256 samples.

Die angegebene Reihenfolge von Energietest und Nulldurchgangstest hat sich in der Praxis gut bewährt. Sie könnte jedoch auch umgekehrt sein, wobei dann die Entscheidungsschwellen modifiziert werden müßten.The specified sequence of energy test and zero crossing test has proven itself in practice. However, it could also be the other way round, in which case the decision thresholds would have to be modified.

Als nächstes Entscheidungskriterium wird die normierte Autokorrelationsfunktion AFK des tiefpaßfiltrierten Prädiktionsfehlersignals e_" herangezogen, und zwar wird das normierte Autokorrelationsmaximum RXX, welches sich in einem durch den Index IP gekennzeichneten Abstand vom Maximum nullter Ordnung befindet, mit einem Schwellenwert RU verglichen und als stimmhaft bewertet, wenn dieser Schwellenwert überschritten wird. Andernfalls wird zum nächsten Kriterium weitergegangen. Praktisch günstige Werte für den Schwellenwert sind 0,55 bis 0,75, vorzugsweise etwa 0,6.As the next decision criterion, the normalized autocorrelation function AFK of the low-pass filtered prediction error signal e _"is used, namely the normalized autocorrelation maximum RXX, which is at a distance from the zero-order maximum identified by the index IP, is compared with a threshold value RU and evaluated as correct if this threshold value is exceeded, otherwise the next criterion is proceeded in. Practically favorable values for the threshold value are 0.55 to 0.75, preferably about 0.6.

Als nächstes wird die Energie des tiefpaßfiltrierten Prädiktionsfehlersignals e_n, genauer das Verhältnis V_o derselben zur Energie Es des Sprachsignals, untersucht. Wenn dieses Energieverhältnis V_o kleiner ist als eine erste, tiefere Verhältnisschwelle VL, wird der Sprachabschnitt als stimmhaft bewertet. Andernfalls erfolgt ein weiterer Vergleich mit einer zweiten, höheren Verhältnisschwelle VU, wobei auf stimmlos entschieden wird, wenn das Energieverhältnis V_o über dieser höheren Schwelle VU liegt. Dieser zweite Vergleich kann eventuell auch entfallen.Next, the energy of the low-pass filtered prediction error signal e _n , more precisely the ratio V _o thereof to the energy Es of the speech signal, is examined. If this energy ratio V _{o is} smaller than a first, lower ratio threshold VL, the speech section is rated as voiced. Otherwise there is a further comparison with a second, higher ratio threshold VU, the decision being made unvoiced if the energy ratio V _{o is} above this higher threshold VU. This second comparison may also be omitted.

Geeignete Werte für die beiden Verhältnisschwellen VL und VU sind 0,05 bis 0,15 bzw. 0,6 bis 0,75, vorzugsweise etwa 0,1 bzw. 0,7.Suitable values for the two ratio thresholds VL and VU are 0.05 to 0.15 and 0.6 to 0.75, preferably about 0.1 and 0.7.

Falls auch diese Untersuchung der Restfehlerenergie zu keinem eindeutigen Resultat geführt hat, erfolgt ein weiterer Nulldurchgangstest mit einer tieferen Entscheidungsschwelle bzw. Maximalanzahl ZCL, wobei auf stimmlos entschieden wird, wenn diese Maximalanzahl überschritten wird. Geeignete Werte für diese tiefere Maximalanzahl ZCL sind 70 bis 90, vorzugsweise etwa 80 auf 256 Abtastwerte.If this investigation of the residual error energy has not led to a clear result, another zero-crossing test is carried out with a lower decision threshold or maximum number ZCL, the decision being made unvoiced if this maximum number is exceeded. Suitable values for this lower maximum number ZCL are 70 to 90, preferably approximately 80 to 256 samples.

Im Zweifelsfalle wird als nächstes Entscheidungskriterium noch ein weiterer Energietest durchgeführt, wobei die Energie Es des Sprachsignals mit einer zweiten, höheren Mindestenergieschwelle EU verglichen und diesmal auf stimmhaft entschieden wird, wenn die Energie Es des Sprachsignals diese Schwelle EU übersteigt. Praktische Werte für diese höhere Mindestenergieschwelle EU sind 1,3 - 10--³ bis 1,8 - 10 ³, vorzugsweise etwa 1,5 - 10-³.In case of doubt, the next decision criterion is yet another energy test, whereby the energy Es of the speech signal is compared to a second, higher minimum energy threshold EU and this time the decision is made as to when the energy Es of the speech signal exceeds this threshold EU. Practical values for these higher minimum energy threshold EU are 1.3 - 10-- ³ to 1.8 - 10 ^-3, preferably about 1.5 - 10 ^3rd

Sollte auch dann noch kein eindeutiger Entscheid vorliegen, wird zunächst das Autokorrelationsmaximum RXX mit einem zweiten, tieferen Schwellenwert RM verglichen. Wird dieser Schwellenwert überstiegen, wird auf stimmhaft entschieden. Andernfalls wird als letztes Kriterium ein Quervergleich mit den beiden (gegebenenfalls auch nur einem) unmittelbar vorangegangenen Sprachabschnitten durchgeführt. Dabei wird der Sprachabschnitt nur dann als stimmlos bewertet, wenn die (bzw. der eine) beiden vorangegangenen Sprachabschnitte ebenfalls stimmlos waren. Andernfalls wird endgültig auf stimmhaft entschieden. Geeignete Werte für den Schwellenwert RM sind 0,35 bis 0,45, vorzugsweise etwa 0,42.If there is still no clear decision, the autocorrelation maximum RXX is first compared with a second, lower threshold value RM. If this threshold is exceeded, the decision will be made by voice. Otherwise, a cross-comparison with the two (possibly also only one) immediately preceding language sections is carried out as the last criterion. The speech section is only rated as unvoiced if the (or one) of the two previous speech sections were also unvoiced. Otherwise, the decision will be final. Suitable values for the threshold value RM are 0.35 to 0.45, preferably approximately 0.42.

Wie schon weiter vorne erwähnt, wird das Prädiktionsfehlersignal e_n bei Breitbandsprache tiefpaßfiltriert. Diese Tiefpaßfiltrierung bewirkt eine Aufsplittung der Häufigkeitsverteilungen der Autokorrelationsmaximalwerte zwischen stimmlosen und stimmhaften Sprachabschnitten und erleichtert damit die Festlegung der Entscheidungsschwelle bei gleichzeitiger Verringerung der Fehlerhäufigkeit. Ferner wird damit auch eine bessere Pitchextraktion, d. h. Bestimmung der Pitchperiode, ermöglicht. Wesentliche Bedingung dafür ist jedoch, daß die Tiefpaßfiltrierung mit extrem hoher Flankensteilheit von ca. 150 bis 180 db/Oktave erfolgt. Das verwendete (digitale) Filter sollte eine elliptische Charakteristik besitzen, die Grenzfrequenz soll im Bereich von 700-1200 Hz, vorzugsweise 800 bis 900 Hz liegen.As mentioned earlier, the prediction error signal e _n becomes low-pass in broadband speech filtered. This low-pass filtering causes the frequency distributions of the autocorrelation maximum values to be split up between unvoiced and voiced speech sections and thus makes it easier to determine the decision threshold while at the same time reducing the frequency of errors. It also enables better pitch extraction, ie determining the pitch period. An essential condition for this, however, is that the low-pass filtering is carried out with an extremely high slope of approximately 150 to 180 db / octave. The (digital) filter used should have an elliptical characteristic, the cut-off frequency should be in the range of 700-1200 Hz, preferably 800 to 900 Hz.

Bei Telefonsprache, der gegenüber der Breitbandsprache der Frequenzbereich unter 300 Hz fehlt, bringt diese Tiefpaßfiltrierung keine Vorteile, sondern ist sogar eher nachteilig. Sie wird daher bei Telefonsprache weggelassen. Dies kann einfach durch Schließen des Schalters 10 oder softwaremassig (durch Nichtausführung des entsprechenden Programmteils) bewerkstelligt werden.In the case of telephone speech, which lacks the frequency range below 300 Hz compared to broadband speech, this low-pass filtering has no advantages, but is actually rather disadvantageous. It is therefore omitted in the telephone language. This can be accomplished simply by closing the switch 10 or in software (by not executing the corresponding program part).

Der in Fig. 4 dargestellte Entscheidungsablauf für Telefonsprache stimmt weitestgehend mit dem für Breitbandsprache überein. Es ist lediglich die Reihenfolge von zweitem Energietest und zweitem Nulldurchgangstest vertauscht (nicht zwingend) und ferner ist der zweite Test des Autokorrelationsmaximums RXX weggelassen, da dieser bei Telefonsprache nichts bringen würde. Die einzelnen Entscheidungsschwellen sind entsprechend den Unterschieden der Telefonsprache gegenüber der Breitbandsprache zum Teil anders gelegt. Für die Praxis günstige Werte gehen aus der nachstehenden Tabelle hervor.

The decision process for telephone speech shown in FIG. 4 largely corresponds to that for broadband speech. Only the sequence of the second energy test and the second zero-crossing test is reversed (not mandatory) and the second test of the auto-correlation maximum RXX is also omitted, since this would not work for telephone speech. The individual decision thresholds are partly different, depending on the differences between the telephone language and the broadband language. Practical values are shown in the table below.

Mit den beiden vorstehend beschriebenen Entscheidungsabläufen wurde eine Stimmhaft-Stimmlos-Entscheidung mit extrem kleinen Fehlerquoten erreicht. Es versteht sich, daß die Reihenfolge der Kriterien sowie die Kriterien selbst im Prinzip auch anders sein könnten, wesentlich ist lediglich, daß bei jedem Kriterium immer nur sichere Entscheide getroffen werden.With the two decision processes described above, a voiced-unvoiced decision was achieved with extremely small error rates. It goes without saying that the order of the criteria and the criteria themselves could in principle also be different, the only important thing is that only reliable decisions are made for each criterion.

Claims

1. Redundance reducing speech processing process by the method of linear prediction, wherein the digital speech signal obtained by the scanning of an optionally band limited analog speech signal is divided into sections and for each speech section the parameters of a speech model filter are calculated and a voiced-unvoiced decision made, and in the voiced case the period of the voice band base frequency (pitch period) determined, characterized in that for the voiced-unvoiced decision the speech signal or the signal derived therefrom is analyzed initially according to a first threshold value criterion, with the threshold value being chosen so that upon the satisfaction of the criterion a decision is reached that is at 97%, preferably 100%, secure, that if the first criterion is not satisfied the speech signal or a signal derived therefrom is analyzed according to a second, different threshold value criterion, wherein the threshold value is chosen so that upon the satisfaction of the criterion a decision is reached that is at least 97%, preferably 100%, secure, and that if the second criterion is not satisfied, the speech signal or the signal derived therefrom is subjected to at least one further different decision criterion.

2. Process according to Claim 1, characterized in that the first criterion is an energy test, whereby the relative energy (E_s) of the speech signal is determined and the speech section is evaluated as unvoiced if the energy (E_s) does not exceed a minimum energy threshold (EL).

3. Process according to Claim 1, characterized in that the first criterion is a zero crossing test, wherein the number (ZC) of zero crossing determines the speech signal and the speech section is evaluated as unvoiced if the number (ZC) exceeds a maximum number (ZCU).

4. Process according to Claim 2, characterized in that the second criterion is a zero crossing test according to Claim 3.

5. Process according to one of the preceding claims, characterized in that a further criterion is a threshold value test of a standardized autocorrelation function (AKF), obtained by the autocorrelation of the prediction error signal formed from the digitized speech signal by means of an inverse filter with a transfer function inverse to the speech model filter, whereby the section is evaluated as voiced if the second maximum (RXX) of the standardized autocorrelation function (AKF) axceeds a threshold value (RU).

6. Process according to one of the preceding claims, characterized in that a further criterion is a residual error energy test, wherein the prediction error signal is formed by means of an inverse filter with a transfer function inverse to the speech model filter and the energy of said prediction error signal and the energy (E_s) of the speech signal determined and whereby further the ratio (V_o) of the energy of the prediction error signal to the energy (E_s) of the speech signal is formed and compared with a lower ratio threshold (VL), and the speech section is evaluated as voiced if said ratio (V_o) is smaller than the threshold (VL).

7. Process according Claim 6, characterized in that the energy ratio (V_o) is additionally compared with an upper ratio threshold (VU) and the speech section is evaluated as unvoiced if the said ratio (V_o) is larger than the said upperthreshold (VU).

8. Process according to Claim 2 or 4 and one of Claims 5 to 7, characterized in that a further decision criterion is a second energy test, whereby the energy (E_s) of the speech signal is compared with a second, higher minimum energy threshold (EU) and the speech section is evaluated as voiced if the energy (E_s) exceeds the said minimum energy threshold (EU).

9. Process according to Claim 3 or 4 and one of Claims 5 to 8, characterized in that a further decision criterion is a second zero crossing test, whereby the number (ZC) of zero crossings of the speech signal is compared with a second, lower maximum number (ZCL) and the speech section is evaluated as unvoiced if the number (ZC) exceeds the said lower maximum number (ZCL).

10. Process according to Claim 5 and one of Claims 6 and 7, characterized in that a further decision criterion consists of a second threshold value test of the standardized autocorrelation function (AKF), whereby the section is evaluated as voiced if the second maximum (RXX) of the standardized autocorrelation function (AKF) is higher than a second, lower threshold value (RM).

11. Process according to one of the preceding claims, characterized in that a further decision criterion is a cross comparison with preferably two to three speech sections immediately preceding the speech section under consideration, whereby the speech section is evaluated as unvoiced only if all of said preceding speech section were also unvoiced.

12. Process according to Claim 5 and one of Claims 6 to 11, characterized in that the speech signal passed to the inverse filter to form the prediction error signal or the prediction error signal is low pass filtered prior to the autocorrelation.

13. Process according to Claims 4 to 12, characterized in that the voiced-unvoiced decision is effected by means of the decision criteria of the first energy test, first zero crossing test, first threshold value test of the autocorrelation function, residual error energy test or tests, second zero crossing test, second energy test, second threshold value test of the autocorrelation function and cross comparison.

14. Process according to Claim 4 to 9 and 11, characterized in that the voiced-unvoiced decision is effected by means of the decision criteria of the first energy test, second zero crossing tests, first threshold value test of the autocorrelation function, residual error energy test or tests, second energy test, second zero crossing test and cross comparison.

15. Process according to Claim 12, characterized in that the low pass filtering of the prediction error signal is effected with a cutoff frequency of 700 to 1200 Hz, preferably 800 to 900 Hz.

16. Process according to Claims 12 or 15, characterized in that the low pass filtering is effected by means of a steep flanked digital filter (7) with an elliptical characteristic and a flank slope of at least 150 to 180 db/octave.

17. Process according to Claim 5, characterized in that in the case of wide band speech the threshold value (RU) is chosen within a range of 0.55 to 0.75, preferably approximately 0.6, with respect to the autocorrelation maximum of zero order.

18. Process according to Claim 10, characterized in that in case of wide band speech the lower threshold value (RM) is chosen within a range of 0.35 to 0.45, preferably approximately 0.42, with respect to the autocorrelation maximum of zero order.

19. Process according to Claim 2, characterized in that in case of wide band speech the minimum energy threshold (EL) is chosen within a range of 1.1 × 10^-4 to 1.4 x 10^-4, preferably approximately 1.2 x 10-4.

20. Process according to Claim 8, characterized in that in case of wide band speech the higher minimum energy threshold (EU) is chosen within a range of 1.3 x 10 ³ to 1.8 x 10 ³, preferably 1.5 x 10-3.

21. Process according to Claim 3, characterized in that in case of wide band speech the maximum number (ZCU) is chosen within a range of 105 to 120, peferably approximately 110, with respect to a speech section length of 256 scanning values.

22. Process according to Claim 9, characterized in that in case of wide band speech the lower maximum number (ZCL) is chosen within a range of 70 to 90, preferably approximately 80, with respect to a speech section length of 256 scanning values.

23. Process according to Claim 6, characterized in that in case of wide band speech the upper ratio threshold (VU) is chosen within a range of 0.6 to 0.75, preferably approximately 0.7.

24. Process according to Claim 7, characterized in that in case of wide band speech the lower ratio threshold (VL) is chosen within a range of 0.50 to 0.15, preferably approximately 0.1.

25. Process according to Claim 5, characterized in that in telephone speech the threshold value (RU) is chosen within a range of 0.2 to 0.4, preferably approximately 0.25, with respect to the autocorrelation maximum of zero order.

26. Process according to Claim 2, characterized in that in telephone speech the minimum energy threshold (EL) is chosen within a range of 1.4 x 10-5to 1.6 x 10 ⁵, preferably approximately 1.5x 10 - ⁵.

27. Process according to Claim 8, characterized in that in telephone speech the higher minimum energy threshold (EU) is chosen within a range of 1.3 x 10- ³to 1.8 x 10 ³, preferably approximately 1.5 x 10-3.

28. Process according to Claim 3, characterized in that in telephone speech the maximum number (ZCU) is chosen within a range of 120 to 140, preferably approximately 130, with respect to a speech section length of 256 scanning values.

29. Process according to Claim 9, characterized in that in telephone speech the lower maximum number (ZCL) is chosen within a range of 100 to 120, preferably approximately 110, with respect to a speech section length of 256 scanning values.

30. Process according to Claim 6, characterised in that in telephone speech the upper ratio threshold (VU) is chosen within a range of 0.5 to 0.7, preferably approximately 0.6.

31. Process according to Claim 7, characterized in that in telephone speech the lower ratio threshold (VL) is chosen within a range of 0.05 to 0.15, preferably approximately 0.1.

32. Process according to one of the preceding claims, characterized in that for the voiced-unvoiced decision a decision speech section is analyzed, which is composed of the speech section for which the decision is to be rendered, and at least a part of the two speech sections adjacent to the speech section under consideration.

33. Apparatus for the realization of the process according to one of the preceding claims, with a signal processing part which cyclically scans the analog speech signal and digitizes the scanning values obtained, and with an analysis part which analyzes the digitized speech signal in sections and comprises a parameter computer, a pitch decision stage and a pitch calculation stage, characterized in that the analysis part is a multiprocessor system with a principal processor (50) and two secondary processors (60, 70), wherein one secondary processor (60) intermediately stores the speech signal, produces the prediction error signal from the intermediately stored speech signal by means of inverse filtering and from this, optionally after low pass filtering, forms the standardized autocorrelation function, while the principal processor (50) performs the analysis of the speech signal itself, and the other secondary processor (70) is responsible for coding of the speech parameters determined by the principal processor in combination with the first secondary processor.