EP1224531B1

EP1224531B1 - Method for detecting the time sequences of a fundamental frequency of an audio-response unit to be synthesised

Info

Publication number: EP1224531B1
Application number: EP00984858A
Authority: EP
Inventors: Martin Holzapfel; Caglayan Erdem
Original assignee: Siemens AG
Current assignee: Siemens AG
Priority date: 1999-10-28
Filing date: 2000-10-24
Publication date: 2004-12-15
Anticipated expiration: 2020-10-24
Also published as: JP4005360B2; EP1224531A2; WO2001031434A2; US7219061B1; WO2001031434A3; JP2003513311A; DE50008976D1

Description

Die Erfindung betrifft ein Verfahren zum Bestimmen des zeitlichen Verlaufs einer Grundfrequenz einer zu synthetisierenden Sprachausgabe.The invention relates to a method for determining the temporal Course of a fundamental frequency of a to be synthesized Voice output.

Auf der Konferenz ICASSP 97, in München, ist unter dem Titel "Recent Improvements on Microsoft's Trainable Text-to-Speech System-Whistler", X. Huang et al, ein Verfahren zum Synthetisieren von Sprache aus einem Text vorgestellt worden, das vollständig trainierbar ist und die Prosodie eines Textes anhand von in einer Datenbank gespeicherten Prosodiemustern zusammenstellt und erzeugt. Die Prosodie eines Textes wird im wesentlichen durch die Grundfrequenz festgelegt, weshalb dieses bekannte Verfahren auch als Verfahren zur Erzeugung einer Grundfrequenz auf Grundlage entsprechender in einer Datenbank gespeicherter Muster betrachtet werden kann. Zur Erzielung einer möglichst natürlichen Sprachweise sind aufwendige Korrekturverfahren vorgesehen, die die Kontur der Grundfrequenz interpolieren, glätten und korrigieren.At the ICASSP 97 conference, in Munich, is titled "Recent Improvements on Microsoft's Trainable Text-to-Speech System-Whistler ", X. Huang et al, a method of synthesizing presented by language from a text that is completely trainable and based on the prosody of a text compiling prosody patterns stored in a database and generated. The prosody of a text is in the essentially determined by the fundamental frequency, which is why this known methods also as a method for generating a Basic frequency based on the corresponding in a database stored pattern can be viewed. To achieve The most natural way of speaking is complex correction procedures provided the the contour of the fundamental frequency interpolate, smooth and correct.

Auf der ICASSP 98, in Seattle, ist unter dem Titel "Optimization of a Neural Network for Speaker and Task Dependent F₀-Generation", Ralf Haury et al. ein weiteres Verfahren zum Erzeugen einer synthetischen Sprachausgabe aus einem Text vorgestellt worden. Dieses bekannte Verfahren verwendet zur Erzeugung der Grundfrequenz anstelle einer Datenbank mit Mustern ein neuronales Netzwerk, mit dem der zeitliche Verlauf der Grundfrequenz für die Sprachausgabe festgelegt wird.At ICASSP 98, in Seattle, under the title "Optimization of a Neural Network for Speaker and Task Dependent F ₀ Generation", Ralf Haury et al. Another method for generating a synthetic speech output from a text has been presented. This known method uses a neural network to generate the basic frequency instead of a database with patterns, with which the temporal course of the basic frequency is determined for the speech output.

Mit den oben beschriebenen Verfahren soll eine Sprachausgabe geschaffen werden, die keinen metallischen, mechanischen und unnatürlichen Klang besitzt, wie es von herkömmlichen Sprachsynthesesystemen bekannt ist. Diese Verfahren stellen eine deutliche Verbesserung gegenüber den herkömmlichen Sprachsynthesesystemen dar. Es bestehen dennoch erhebliche klangliche Unterschiede zwischen der auf diesen Verfahren beruhenden Sprachausgabe und einer menschlichen Stimme.The procedure described above is intended to provide voice output be created that do not have any metallic, mechanical and possesses unnatural sound, as is the case with conventional speech synthesis systems is known. These procedures represent one significant improvement over conventional speech synthesis systems There are nevertheless considerable sonic Differences between those based on these procedures Voice output and a human voice.

Insbesondere wird bei einer Sprachsynthese, bei der die Grundfrequenz aus einzelnen Grundfrequenzmustern zusammengesetzt wird, nach wie vor ein metallischer, mechanischer Klang erzeugt, der deutlich von einer natürlichen Stimme unterschieden werden kann. Wird die Grundfrequenz hingegen mit einem neuronalen Netzwerk festgelegt, klingt die Stimme zwar natürlicher, aber ist etwas dumpf.In particular, in a speech synthesis in which the Basic frequency composed of individual basic frequency patterns is still a metallic, mechanical sound generated that clearly differed from a natural voice can be. However, if the fundamental frequency is changed to neural network set, the voice sounds though more natural, but a little dull.

Der Erfindung liegt deshalb die Aufgabe zugrunde, ein Verfahren zum Bestimmen des zeitlichen Verlaufs einer Grundfrequenz einer zu synthetisierenden Sprachausgabe zu schaffen, die der Sprachausgabe einen natürlichen, einer menschlichen Stimme sehr ähnlichen Klang verleiht.The invention is therefore based on the object of a method to determine the time profile of a basic frequency to create a speech to be synthesized that the A natural, a human voice gives very similar sound.

Die Aufgabe wird durch ein Verfahren mit den Merkmalen des Anspruchs 1 gelöst. Vorteilhafte Ausgestaltungen sind in den Unteransprüchen angegeben.The task is accomplished through a process with the characteristics of Claim 1 solved. Advantageous embodiments are in the Subclaims specified.

Das erfindungsgemäße Verfahren zum Bestimmen des zeitlichen Verlaufs einer Grundfrequenz einer zu synthetisierenden Sprachausgabe umfaßt folgende Schritte:

Bestimmen von Vorgabemakrosegmenten der Grundfrequenz mittels eines neuronalen Netzwerkes, und

Bestimmen von Mikrosegmenten mittels in einer Datenbasis gespeicherten Grundfrequenzsequenzen, wobei die Grundfrequenzsequenzen derart aus der Datenbasis ausgewählt werden, daß durch die aufeinanderfolgenden Grundfrequenzsequenzen das jeweilige Vorgabemakrosegment mit möglichst geringer Abweichung nachgebildet wird.

The method according to the invention for determining the time profile of a basic frequency of a speech output to be synthesized comprises the following steps:

Determining default macro segments of the fundamental frequency using a neural network, and

Determining microsegments by means of fundamental frequency sequences stored in a database, the fundamental frequency sequences being selected from the database in such a way that the respective default macro segment is reproduced with as little deviation as possible by the successive fundamental frequency sequences.

Der vorliegenden Erfindung liegt die Erkenntnis zugrunde, daß die Bestimmung des Verlaufs einer Grundfrequenz mittels eines neuronalen Netzwerkes die Makrostruktur des zeitlichen Verlaufs einer Grundfrequenz sehr'ähnlich zu dem Verlauf der Grundfrequenz einer natürlichen Sprache erzeugt, und die in einer Datenbasis gespeicherten Grundfrequenzsequenzen sehr ähnlich die Mikrostruktur der Grundfrequenz einer natürlichen Sprache wiedergeben. Durch die erfindungsgemäße Kombination wird somit eine optimale Bestimmung des Verlaufs der Grundfrequenz erzielt, die sowohl in der Makrostruktur als auch in der Mikrostruktur der der natürlichen Sprache wesentlich ähnlicher ist, als bei einer mit den bisher bekannten Verfahren erzeugten Grundfrequenz. Hierdurch wird eine beträchtliche Annäherung der synthetischen Sprachausgabe an eine natürliche Sprache erzielt. Die hierdurch erzeugte synthetische Sprache ist der natürlichen Sprache sehr ähnlich und kann kaum von dieser unterschieden werden.The present invention is based on the finding that the determination of the course of a fundamental frequency by means of a neural network the macro structure of the temporal course a fundamental frequency very similar to the course of the Fundamental frequency of a natural language is generated, and that in a fundamental frequency sequences stored very much similar to the microstructure of the fundamental frequency of a natural one Play language. Through the combination according to the invention is an optimal determination of the course of the fundamental frequency achieved, both in the macro structure and in the microstructure is much more similar to that of natural language is, than in one with the previously known methods generated fundamental frequency. This will make a considerable one Approximation of synthetic speech to a natural one Language achieved. The synthetic language created in this way is very similar to natural language and can hardly speak of this can be distinguished.

Vorzugsweise wird die Abweichung zwischen dem Nachbildungsmakrosegment und dem Vorgabemakrosegment mittels einer Kostenfunktion ermittelt, die derart gewichtet ist, daß bei geringen Abweichungen von der Grundfrequenz des Vorgabemakrosegments lediglich eine kleine Abweichung ermittelt wird, wobei bei Überschreitung vorbestimmter Grenzfrequenzdifferenzen die ermittelten Abweichungen stark bis zum Erreichen eines Sättigungswertes ansteigen. Dies bedeutet, daß alle Grundfrequenzsequenzen, die innerhalb des Bereiches der Grenzfrequenzen liegen, eine sinnvolle Auswahl zur Nachbildung des Vorgabemakrosegments darstellen und die Grundfrequenzsequenzen, die außerhalb des Bereiches der Grenzfrequenzdifferenzen liegen, als wesentlich ungeeigneter zur Nachbildung des Vorgabemakrosegments bewertet werden. Diese Nichtlinearität bildet das nichtlineare Verhalten des menschlichen Gehörs ab.The deviation between the replica macro segment is preferred and the default macro segment using a cost function determined, which is weighted such that at low Deviations from the basic frequency of the default macro segment only a small deviation is determined, whereby when predetermined limit frequency differences are exceeded deviations determined strongly until a saturation value is reached increase. This means that all fundamental frequency sequences, those within the range of the cutoff frequencies are a useful selection for emulating the default macro segment represent and the fundamental frequency sequences that are outside the range of the limit frequency differences, as significantly unsuitable for replicating the default macro segment be rated. This is nonlinearity non-linear behavior of human hearing.

Nach einer weiteren bevorzugten Ausführungsform der Erfindung werden Abweichungen desto schwächer gewichtet, je näher sie am Rand einer Silbe angeordnet sind. According to a further preferred embodiment of the invention deviations are weighted less, the closer they are are arranged on the edge of a syllable.

Die Nachbildung des Vorgabemakrosegments erfolgt vorzugsweise durch Erzeugung mehrerer Grundfrequenzsequenzen für jeweils eine mikroprosodische Einheit, wobei Kombinationen von Grundfrequenzsequenzen sowohl bezüglich der Abweichung vom Vorgabemakrosegment als auch bezüglich einer paarweisen Abstimmung bewertet werden. In Abhängigkeit des Ergebnisses dieser beiden Bewertungen (Abweichung vom Vorgabemakrosegment, Abstimmung zwischen benachbarten Grundfrequenzsequenzen) wird dann eine entsprechende Auswahl einer Kombination von Grundfrequenzsequenzen getroffen.The default macro segment is preferably reproduced by creating multiple fundamental frequency sequences for each a microprosodic unit, with combinations of fundamental frequency sequences both in terms of the deviation from the default macro segment as well as paired voting be rated. Depending on the outcome of these two Ratings (deviation from the default macro segment, coordination between adjacent fundamental frequency sequences) an appropriate selection of a combination of fundamental frequency sequences met.

Mit dieser paarweisen Abstimmung werden insbesondere die Übergänge zwischen benachbarten Grundfrequenzsequenzen bewertet, wobei hier größere Sprünge vermieden werden sollen. Nach einer bevorzugten Ausführungsform der Erfindung werden diese paarweisen Abstimmungen der Grundfrequenzsequenzen innerhalb einer Silbe stärker gewichtet als am Randbereich der Silbe. Der Silbenkern ist im Deutschen maßgeblich für den Höreindruck.With this pair-wise voting especially the Transitions between adjacent fundamental frequency sequences evaluated, larger jumps should be avoided here. To a preferred embodiment of the invention pairwise tuning of the fundamental frequency sequences within a syllable weighted more than at the edge of the syllable. In German, the syllable core is decisive for the auditory impression.

Das erfindungsgemäße Verfahren wird nachfolgend anhand eines in der Zeichnung dargestellten Ausführungsbeispieles näher erläutert. In den Zeichnungen zeigen schematisch:

Fig. 1a bis 1d: den Aufbau und das Zusammensetzen des zeitlichen Verlaufes einer Grundfrequenz in vier Schritten,
Fig. 2: eine Funktion zur Gewichtung einer Kostenfunktion zur Bestimmung der Abweichung zwischen einem Nachbildungsmakrosegment und einem Vorgabemakrosegment,
Fig. 3: den Verlauf einer aus mehreren Makrosegmenten bestehenden Grundfrequenz,
Fig. 4: schematisch vereinfacht den Aufbau eines neuronalen Netzwerkes,
Fig. 5: das erfindungsgemäße Verfahren in einem Flußdiagramm, und
Fig. 6: ein Verfahren zum Synthetisieren von Sprache, daß auf dem erfindungsgemäßen Verfahren beruht.

The method according to the invention is explained in more detail below with reference to an embodiment shown in the drawing. The drawings show schematically:

1a to 1d: the construction and composition of the time course of a basic frequency in four steps,
Fig. 2: a function for weighting a cost function for determining the deviation between a replication macro segment and a default macro segment,
Fig. 3: the course of a basic frequency consisting of several macro segments,
Fig. 4: schematically simplified the construction of a neural network,
Fig. 5: the inventive method in a flow chart, and
Fig. 6: a method for synthesizing speech that is based on the inventive method.

In Fig. 6 ist ein Verfahren zum Synthetisieren von Sprache, bei dem ein Text in eine Folge akustischer Signale gewandelt wird, in einem Flußdiagramm dargestellt.6 is a method for synthesizing speech, where a text is converted into a series of acoustic signals is shown in a flow chart.

Dieses Verfahren ist in Form eines Computerprogrammes realisiert, das mit einem Schritt S1 gestartet wird.This process is implemented in the form of a computer program, that is started with a step S1.

Im Schritt S2 wird ein Text eingegeben, der in Form einer elektronisch lesbaren Textdatei vorliegt.In step S2, a text is entered in the form of a there is an electronically readable text file.

Im folgenden Schritt S3 wird eine Folge von Phonemen, das heißt eine Lautfolge, erstellt, wobei den einzelnen Graphemen des Textes, das sind jeweils einzelne oder mehrere Buchstaben, denen jeweils ein Phonem zugeordnet ist, ermittelt werden. Es werden dann die den einzelnen Graphemen zugeordneten Phoneme bestimmt, wodurch die Phonemfolge festgelegt ist.In the following step S3, a sequence of phonemes, the is called a sound sequence, created using the individual graphemes of the text, that is each one or more letters, which are each assigned a phoneme. The individual graphemes are then assigned Phonemes determines what determines the phoneme sequence.

Im Schritt S4 wird eine Betonungsstruktur bestimmt, das heißt es wird bestimmt, wie stark die einzelnen Phoneme betont werden sollen.In step S4, an emphasis structure is determined, that is it is determined how strongly the individual phonemes are emphasized should.

Die Betonungsstruktur ist in Fig. 1a mittels eines Zeitstrahles anhand des Wortes "stop" dargestellt. Demgemäß sind dem Graphem "st" die Betonungsstufe 1, dem Graphem "o" die Betonungsstufe 0,3 und dem Graphem "p" die Betonungsstufe 0,5 zugeordnet worden. The emphasis structure is in Fig. 1a by means of a time line represented by the word "stop". Accordingly, that Grapheme "st" is level 1, grapheme "o" is level 0.3 and the graphem "p" assigned the emphasis level 0.5 Service.

Nachfolgend wird die Dauer der einzelnen Phoneme bestimmt (S5).The duration of the individual phonemes is determined below (S5).

Im Schritt S6 wird der zeitliche Verlauf der Grundfrequenz bestimmt, was unten näher ausgeführt ist.In step S6, the time course of the basic frequency determines what is detailed below.

Nachdem die Phonemfolge und die Grundfrequenz festgelegt sind, kann eine Wave-Datei auf Grundlage der Phoneme und der Grundfrequenz erzeugt werden (S7).After the phoneme sequence and the fundamental frequency are determined can be a wave file based on the phonemes and the Fundamental frequency are generated (S7).

Die Wave-Datei wird mittels einer akustischen Ausgabeeinheit und einem Lautsprecher in akustische Signale umgesetzt (S8), womit die Sprachausgabe beendet ist (S9).The wave file is generated using an acoustic output unit and a loudspeaker converted into acoustic signals (S8), which ends the speech output (S9).

Erfindungsgemäß wird der zeitliche Verlauf der Grundfrequenz der zu synthetisierenden Sprachausgabe mittels eines neuronalen Netzwerkes in Kombination mit in einer Datenbasis gespeicherten Grundfrequenzsequenzen erzeugt.According to the invention, the time course of the basic frequency the speech to be synthesized using a neural Network in combination with stored in a database Fundamental frequency sequences generated.

Das Verfahren, das dem Schritt S6 aus Fig. 6 entspricht, ist ausführlicher in Fig. 5 in einem Flußdiagramm dargestellt.The method corresponding to step S6 in FIG. 6 is shown in more detail in Figure 5 in a flow chart.

Dieses Verfahren zum Bestimmen des zeitlichen Verlaufs der Grundfrequenz ist ein Unterprogramm zu dem in Fig. 6 gezeigtem Programm. Das Unterprogramm wird mit dem Schritt S10 gestartet.This method for determining the time course of the Fundamental frequency is a subroutine of that shown in FIG. 6 Program. The subroutine is started with step S10.

Mit dem Schritt S11 wird ein Vorgabemakrosegment der Grundfrequenz mittels eines neuronalen Netzwerkes bestimmt. Ein derartiges neuronales Netzwerk ist schematisch vereinfacht in Fig. 4 gezeigt. Das neuronale Netzwerk weist an einer Eingabeschicht I Knoten zur Eingabe einer phonetisch linguistischen Einheit PE des zu synthetisierenden Textes und eines Kontextes Kl, Kr links und rechts von der phonetisch linguistischen Einheit auf. Die phonetisch linguistische Einheit besteht z.B. aus einer Phrase, einem Wort oder einer Silbe des zu synthetisierenden Textes, zu der das Vorgabemakrosegment der Grundfrequenz bestimmt werden soll. Der linke Kontext Kl und der rechte Kontext Kr stellen jeweils einen Textabschnitt links und rechts der phonetischen linguistischen Einheit PE dar. Die mit der phonetischen Einheit eingegebenen Daten umfassen die entsprechende Phonemfolge, Betonungsstruktur und die Lautdauer der einzelnen Phoneme. Die mit dem linken bzw. rechten Kontext eingegebenen Informationen umfassen zumindest die Phonemfolge, wobei es zweckmäßig sein kann, auch die Betonungsstruktur und/oder die Lautdauer mit einzugeben. Die Länge des linken und rechten Kontextes kann der Länge der phonetisch linguistischen Einheit PE entsprechen, also wiederum eine Phrase, ein Wort oder eine Silbe sein. Es kann jedoch auch zweckmäßig sein, einen längeren Kontext von z.B. zwei oder drei Wörtern als linken oder rechten Kontext vorzusehen. Diese Eingaben Kl, PE und Kr werden in einer versteckten Schicht VS verarbeitet und an einer Ausgabeschicht O als Vorgabemakrosegment VG der Grundfrequenz ausgegeben.With step S11, a default macro segment of the fundamental frequency determined by means of a neural network. On such a neural network is schematically simplified in Fig. 4 shown. The neural network points to an input layer I node for entering a phonetic linguistic Unit PE of the text to be synthesized and one Context Kl, Kr left and right of the phonetic linguistic Unity on. The phonetic linguistic unit e.g. from a phrase, a word or a syllable of the text to be synthesized for which the default macro segment the fundamental frequency is to be determined. The left context Kl and the right context Kr each represent a section of text left and right of the phonetic linguistic Unit PE represents the entered with the phonetic unit Data include the corresponding phoneme sequence, stress structure and the duration of each phoneme. The one with the left or right context include information entered at least the phoneme sequence, where it can be useful also enter the emphasis structure and / or the duration of the sound. The length of the left and right context can be the Correspond to the length of the phonetic linguistic unit PE, so again be a phrase, a word or a syllable. It however, it can also be useful to have a longer context of e.g. two or three words as left or right context provided. These inputs Kl, PE and Kr are hidden Layer VS processed and at an output layer O output as the default macro segment VG of the fundamental frequency.

In Fig. 1b ist eine solche Vorgabemakrosegment für das Wort "stop" dargestellt. Dieses Vorgabemakrosegment besitzt einen typischen dreiecksförmigen Verlauf, der zunächst mit einem Anstieg beginnt und mit einem etwas kürzeren Abfall endet.In Fig. 1b is such a default macro segment for the word shown "stop". This default macro segment has one typical triangular course, initially with a Ascent begins and ends with a somewhat shorter descent.

Nach der Bestimmung eines Vorgabemakrosegmentes der Grundfrequenz werden in den Schritten S12 und S13 die dem Vorgabemakrosegment entsprechenden Mikrosegmente bestimmt.After determining a default macro segment of the fundamental frequency are the default macro segment in steps S12 and S13 corresponding micro-segments determined.

Im Schritt S12 werden aus einer Datenbasis, in der Graphemen zugeordnete Grundfrequenzsequenzen gespeichert sind, ausgelesen, wobei in der Regel für jedes Graphem eine Vielzahl von Grundfrequenzsequenzen vorliegen. In Fig. 1c sind derartige Grundfrequenzsequenzen für die Grapheme "st", "o" und "p" schematisch dargestellt, wobei zur zeichnerischen Vereinfachung lediglich eine geringe Anzahl von Grundfrequenzsequenzen gezeigt sind. In step S12, a database, in the graphemes assigned fundamental frequency sequences are stored, read out, usually a large number of for each grapheme Fundamental frequency sequences are available. 1c are such Fundamental frequency sequences for the graphemes "st", "o" and "p" shown schematically, with the aim of simplifying the drawing only a small number of fundamental frequency sequences are shown.

Diese Grundfrequenzsequenzen können grundsätzlich beliebig miteinander kombiniert werden. Die möglichen Kombinationen dieser Grundfrequenzsequenzen werden mittels einer Kostenfunktion bewertet. Dieser Verfahrensschritt wird mittels des Viterbi-Algorhithmus ausgeführt.These fundamental frequency sequences can in principle be arbitrary can be combined with each other. The possible combinations of these fundamental frequency sequences are performed using a cost function rated. This process step is carried out using the Viterbi algorithm executed.

Für jede Kombination von Grundfrequenzsequenzen, die für jedes Phonem eine Grundfrequenzsequenz aufweist, wird ein Kostenfaktor Kf mittels folgender Kostenfunktion berechnet:

For each combination of fundamental frequency sequences that has a fundamental frequency sequence for each phoneme, a cost factor Kf is calculated using the following cost function:

Die Kostenfunktion ist eine Summe von j=1 bis 1, wobei j der Zähler der Phoneme ist und 1 die Gesamtzahl aller Phoneme darstellt. Die Kostenfunktion weist zwei Terme auf, eine lokale Kostenfunktion lok (k_ij)und eine Verknüpfungskostenfunktion Ver(k_ij, k_n, j+1). Mit der lokalen Kostenfunktion wird die Abweichung der i-ten Grundfrequenzsequenz des j-ten Phonems vom Vorgabemakrosegment bewertet. Mit der Verknüpfungskostenfunktion wird die Abstimmung zwischen der i-ten Grundfrequenz des j-ten Phonems mit der n-ten Grundfrequenzsequenz des j+1-ten Phonems bewertet.The cost function is a sum of j = 1 to 1, where j is the phoneme counter and 1 represents the total number of all phonemes. The cost function has two terms, a local cost function lok (k _ij ) and a link cost function Ver (k _ij , k _n , j + 1). The local cost function is used to evaluate the deviation of the i-th fundamental frequency sequence of the j-th phoneme from the default macro segment. The link cost function evaluates the coordination between the i-th fundamental frequency of the j-th phoneme and the n-th fundamental frequency sequence of the j + 1-th phoneme.

Die lokale Kostenfunktion weist beispielsweise folgende Form auf:

The local cost function has the following form, for example:

Die lokale Kostenfunktion ist somit ein Integral über den Zeitbereich des Beginns ta eines Phonems bis zum Ende te des Phonems über das Quadrat der Differenz der durch das Vorgabemakrosegment vorgegebenen Grundfrequenz f_v und der i-ten Grundfrequenzsequenz des j-ten Phonems. The local cost function is thus an integral over the time range from the beginning ta of a phoneme to the end te of the phoneme over the square of the difference between the fundamental frequency f _v specified by the default macro segment and the i-th fundamental frequency sequence of the j-th phoneme.

Diese lokale Kostenfunktion ermittelt somit einen positiven Wert der Abweichung zwischen der jeweiligen Grundfrequenzsequenz und der Grundfrequenz des Vorgabemakrosegments. Zudem ist diese Kostenfunktion sehr einfach realisierbar und erzeugt durch die parabolische Eigenschaft eine Bewertung, die der des menschlichen Gehörs ähnelt, da kleinere Abweichungen um die Vorgabeseqeunz fv gering bewertet werden, wohingegen größere Abweichungen progressiv bewertet werden.This local cost function thus determines a positive one Value of the deviation between the respective fundamental frequency sequence and the fundamental frequency of the default macro segment. moreover this cost function is very easy to implement and generate by the parabolic property a rating that resembles that of human hearing because of minor deviations to be rated low by the default sequence fv, whereas larger deviations are evaluated progressively.

Nach einer bevorzugten Ausführungsform wird'die lokale Kostenfunktion mit einem Gewichtungsterm versehen, der zu dem in Fig. 2 dargestellten Funktionsverlauf führt. Das Diagramm aus Fig. 2 zeigt den Wert der lokalen Kostenfunktion lok (f_ij) in Abhängigkeit vom Logorhitmus der Frequenz f_ij der i-ten Grundfrequenzsequenz des j-ten Phonems. Dem Diagramm kann man entnehmen, daß Abweichungen von der Vorgabefrequenz f_v innerhalb bestimmter Grenzfrequenzen GF1, GF2 nur gering bewertet werden, wobei eine weitere Abweichung einen stark zunehmenden Anstieg bis zu einem Schwellwert SW bewirkt. Eine derartige Gewichtung entspricht dem menschlichen Gehör, das geringe Frequenzabweichungen kaum wahrnimmt aber ab gewissen Frequenzdifferenzen dies als deutlichen Unterschied registriert.According to a preferred embodiment, the local cost function is provided with a weighting term which leads to the function curve shown in FIG. 2. The diagram in FIG. 2 shows the value of the local cost function lok (f _ij ) as a function of the logarithm of the frequency f _{ij of} the i-th fundamental frequency sequence of the j-th phoneme. It can be seen from the diagram that deviations from the preset frequency f _v within certain limit frequencies GF1, GF2 are assessed only slightly, a further deviation causing a sharply increasing increase up to a threshold value SW. Such a weighting corresponds to human hearing, which hardly perceives slight frequency deviations but registers this as a clear difference from certain frequency differences.

Mit der Verknüpfungskostenfunktion wird bewertet, wie gut zwei aufeinanderfolgende Grundfrequenzsequenzen aufeinander abgestimmt sind. Insbesondere wird hierbei die Frequenzdifferenz an der Verbindungsstelle der beiden Grundfrequenzsequenzen bewertet, wobei je größer die Differenz am Ende der vorhergehenden Grundfrequenzsequenz zur Frequenz am Anfang der nachfolgenden Grundfrequenzsequenzen ist, desto größer ist der Ausgabewert der Verknüpfungskostenfunktion. Hierbei können jedoch noch weitere Parameter berücksichtigt werden, die z.B. die Stetigkeit des Überganges oder dergleichen, wiedergeben. The link cost function evaluates how well two successive fundamental frequency sequences on top of each other are coordinated. In particular, the frequency difference at the junction of the two fundamental frequency sequences rated, the greater the difference at the end of the previous one Fundamental frequency sequence to the frequency at the beginning of the subsequent fundamental frequency sequences, the larger the output value of the link cost function. Here you can however, other parameters are taken into account that e.g. reflect the continuity of the transition or the like.

Bei einer bevorzugten Ausführungsform der Erfindung wird der Ausgabewert der Verknüpfungskostenfunktion umso schwächer gewichtet, je näher die jeweilige Verbindungsstelle zweier benachbarter Grundfrequenzsequenzen am Rand einer Silbe angeordnet ist. Dies entspricht dem menschlichen Gehör, das akustische Signale am Rande einer Silbe weniger intensiv analysiert als im mittleren Bereich der Silbe. Eine derartige Gewichtung wird auch als perzeptiv dominant bezeichnet.In a preferred embodiment of the invention, the Output value of the link cost function weighted even less, the closer the respective connection point of two neighboring ones Fundamental frequency sequences arranged on the edge of a syllable is. This corresponds to human hearing, the acoustic Signals on the edge of a syllable analyzed less intensely than in the middle area of the syllable. Such a weighting is also known as perceptually dominant.

Gemäß obiger Kostenfunktion Kf werden für jede Kombination von Grundfrequenzsequenzen der Phoneme einer linguistischen Einheit, für die ein Vorgabemakrosegment bestimmt worden ist, die Werte der lokalen Kostenfunktion und der Verknüpfungskostenfunktion aller Grundfrequenzsequenzen ermittelt und summiert. Aus der Menge der Kombinationen der Grundfrequenzsequenzen wird diejenige Kombination ausgewählt, für die die Kostenfunktion Kf den kleinsten Wert ergeben hat, da diese Kombination von Grundfrequenzsequenzen einen Grundfrequenzverlauf für die entsprechende linguistische Einheit bildet, der als Nachbildungsmakrosegment bezeichnet wird und dem Vorgabemakrosegment sehr ähnlich ist.According to the above cost function Kf are for each combination of fundamental frequency sequences of the phonemes of a linguistic Unit for which a default macro segment has been determined the values of the local cost function and the link cost function all fundamental frequency sequences determined and summed. From the set of combinations of the fundamental frequency sequences the combination is selected for which the Cost function Kf has given the smallest value since this Combination of fundamental frequency sequences a fundamental frequency course for the corresponding linguistic unit, which is called the replica macro segment and the default macro segment is very similar.

Mit dem erfindungsgemäßen Verfahren werden somit an die mittels des neuronalen Netzwerkes erzeugten Vorgabemakrosegemente der Grundfrequenz angepaßte Grundfrequenzverläufe mittels einzelner in einer Datenbasis gespeicherten Grundfrequenzsequenzen erzeugt. Hierdurch wird eine sehr natürliche Makrostruktur sichergestellt, die zudem auch die detailgenaue Mikrostruktur der Grundfrequenzsequenzen besitzt.With the method according to the invention, the means default macro segments generated by the neural network the fundamental frequency characteristics adapted by means of individual fundamental frequency sequences stored in a database generated. This creates a very natural macro structure ensured that also the detailed Has microstructure of the fundamental frequency sequences.

Ein derartiges Nachbildungsmakrosegment für das Wort "stop" ist in Fig. 1d gezeigt.Such a replica macro segment for the word "stop" is shown in Fig. 1d.

Nachdem im Schritt S13 die Auswahl der Kombinationen von Grundfrequenzsequenzen zur Nachbildung des Vorgabemakrosegments abgeschlossen ist, wird im Schritt S14 geprüft, ob für eine weitere phonetische linguistische Einheit ein weiterer zeitlicher Verlauf der Grundfrequenz erzeugt werden muß. Ergibt diese Abfrage im Schritt S14 ein "ja", springt der Programmablauf auf den Schritt S11 zurück, andernfalls verzweigt der Programmablauf auf den Schritt S15, mit dem die einzelnen Nachbildungsmakrosegmente der Grundfrequenz zusammengesetzt werden.After the selection of the combinations of Fundamental frequency sequences to simulate the default macro segment is completed, it is checked in step S14 whether for another phonetic linguistic unit another time course of the fundamental frequency must be generated. results If this query is "yes" in step S14, the program flow jumps go back to step S11, otherwise branch the program flow to step S15, with which the individual Simulation macro segments of the fundamental frequency composed become.

Im Schritt S16 werden die Verbindungsstellen der einzelnen Nachbildungsmakrosegmente aneinander angeglichen, wie es in Fig. 3 dargestellt ist. Hierbei werden die Frequenzen links f₁ und rechts f_r von den Verbindungsstellen V einander angepaßt, wobei die Endbereiche der Nachbildungsmakrosegmente vorzugsweise derart verändert werden, daß die Frequenzen f₁ und f_r den gleichen Wert besitzen. Vorzugsweise kann im Bereich der Verbindungsstelle der Übergang auch geglättet und/oder stetig gemacht werden.In step S16, the connection points of the individual simulation macro segments are matched to one another, as shown in FIG. 3. The frequencies left f ₁ and right f _r are matched to one another by the connecting points V, the end regions of the replica macro segments preferably being changed such that the frequencies f ₁ and f _{r have} the same value. Preferably, the transition can also be smoothed and / or made continuous in the area of the connection point.

Nachdem für alle linguistisch phonetischen Einheiten des Textes die Nachbildungsmakrosegmente der Grundfrequenz erstellt und zusammengesetzt worden sind, wird das Unterprogramm beendet und der Programmablauf geht zurück zum Hauptprogramm (S17).After for all the linguistic phonetic units of the text creates the simulation macro segments of the fundamental frequency and have been assembled, the subroutine is ended and the program flow goes back to the main program (S17).

Mit dem erfindungsgemäßen Verfahren kann somit ein Verlauf einer Grundfrequenz erzeugt werden, der der Grundfrequenz einer natürlichen Sprache sehr ähnlich ist, da mittels des neuronalen Netzwerkes einfach größere Kontextbereiche erfaßt und ausgewertet werden können (Makrostruktur) und zugleich mittels der in der Datenbasis gespeicherten Grundfrequenzsequenzen feinste Strukturen des Grundfrequenzverlaufes entsprechend der natürlichen Sprache erzeugt werden können (Mikrostruktur). Hierdurch wird eine Sprachausgabe mit einem wesentlich natürlicheren Klang als bei bisher bekannten Verfahren ermöglicht.With the method according to the invention, a course can thus a fundamental frequency are generated, which is the fundamental frequency of a natural language is very similar because of the neural Network simply captures larger context areas and can be evaluated (macro structure) and at the same time by means of the fundamental frequency sequences stored in the database finest structures corresponding to the fundamental frequency curve natural language can be generated (microstructure). This makes a speech with an essential more natural sound than with previously known methods allows.

Die Erfindung ist oben anhand eines Ausführungsbeispiels näher erläutert worden. Die Erfindung ist jedoch nicht auf das konkrete Ausführungsbeispiel beschränkt, sondern im Rahmen der Erfindung sind unterschiedlichste Abwandlungen möglich. So kann z.B. die Reihenfolge, wann die Grundfrequenzsequenzen aus der Datenbasis und wann das neuronale Netzwerk das Vorgabemakrosegment erstellt, variiert werden. Es ist z.B. auch möglich, daß zunächst für alle phonetisch linguistischen Einheiten Vorgabemakrosegmente erzeugt werden und dann erst die einzelnen Grundfrequenzsequenzen ausgelesen, kombiniert, bewertet und ausgewählt werden. Im Rahmen der Erfindung können auch unterschiedlichste Kostenfunktionen angewandt werden, solange sie eine Abweichung zwischen einem Vorgabemakrosegment der Grundfrequenz und Mikrosegmente der Grundfrequenzen berücksichtigen. Das oben beschriebene Integral der lokalen Kostenfunktion kann aus numerischen Gründen auch als Summe dargestellt werden.The invention is closer above using an exemplary embodiment have been explained. However, the invention is not based on that specific embodiment limited, but within the framework Various modifications of the invention are possible. For example, the order of when the fundamental frequency sequences from the database and when the neural network the default macro segment created, varied. It is e.g. also possible that initially for all phonetic linguistic units Default macro segments are generated and only then individual fundamental frequency sequences read out, combined, evaluated and be selected. Within the scope of the invention different cost functions are also used, as long as there is a discrepancy between a default macro segment the fundamental frequency and microsegments of the fundamental frequencies consider. The integral of the local described above For numerical reasons, the cost function can also be a sum being represented.

Claims

Method for determining the time characteristic of a fundamental frequency of a voice response to be synthesized, comprising the following steps:

determining predefined macrosegments of the fundamental frequency of a phonetic linguistic unit of a text (S11) to be synthesized by means of a neural network, and

determining microsegments (S12, S13) which correspond to the respective predefined macrosegment by means of fundamental-frequency sequences stored in a database, the fundamental-frequency sequences being selected from the database in such a manner that the respective predefined macrosegment is reproduced with the least possible deviation by the successive fundamental-frequency sequences.
Method according to Claim 1, characterized in that the predefined macrosegments cover a time range which corresponds to a phonetic linguistic unit of the voice such as, e.g. a phrase, a word or a syllable.
Method according to Claim 1 or 2, characterized in that the fundamental-frequency sequences of the microsegments represent the fundamental frequencies of in each case one phoneme.
Method according to one of Claims 1 to 3, characterized in that the fundamental-frequency sequences of the microsegments which are located within a time range of one of the predefined macrosegments are assembled to form one reproduced macrosegment, the deviation of the reproduced macrosegment from the respective predefined macrosegment being determined and the fundamental-frequency sequences being optimized in such a manner that the deviation is as small as possible.
Method according to Claim 4, characterized in that in each case a number of fundamental-frequency sequences can be selected for the individual microsegments, where the combinations of fundamental-frequency sequences resulting in the least deviation between the respective reproduced macrosegment and the respective predefined macrosegment are selected.
Method according to Claim 4 or 5, characterized in that the deviation between the reproduced macrosegment and the predefined macrosegment is determined by means of a cost function which is weighted in such a manner that in the case of small deviations from the fundamental frequency of the predefined macrosegment, only a small deviation is determined and when predetermined limit frequency differences are exceeded, the deviations determined rise steeply until a saturation value is reached.
Method according to one of Claims 4 to 6, characterized in that the deviation between the reproduced macrosegment and the predefined macrosegment is determined by means of a cost function by means of which a multiplicity of deviations arranged distributed over the macrosegments are weighted, and the closer the deviations are to the edge of the syllable, the less weighting is applied to them.
Method according to one of Claims 4 to 7, characterized in that during the selecting of the fundamental-frequency sequences, the individual fundamental-frequency sequences are syntonized with the in each case following or preceding fundamental-frequency sequences in accordance with predetermined criteria and only combinations of fundamental-frequency sequences meeting the criteria are permitted to be assembled to form a reproduced macrosegment.
Method according to Claim 8, characterized in that adjacent fundamental-frequency sequences are assessed by means of a cost function which generates an output value, to be minimized, for a junction of the fundamental-frequency sequences of adjacent fundamental-frequency sequences and the greater the difference at the end of the preceding fundamental-frequency sequence from the frequency at the beginning of the subsequent fundamental-frequency sequence, the greater this output value.
Method according to Claim 9, characterized in that the closer the respective junction is to the edge of a syllable the less weighting is applied to the output value.
Method according to one of Claims 1 to 10, characterized in that the individual macrosegments are concatenated with one another and the fundamental frequencies are matched to one another at the junctions of the macrosegments.
Method according to one of Claims 1 to 11, characterized in that the neural networks determine the predefined segments for a predetermined section of a text on the basis of this text section and of a text section preceding and/or following this text section.
Method for synthesizing speech in which a text is converted in a sequence of acoustic signals, comprising the following steps:

converting the text into a sequence of phonemes,

generating a stressing structure,

determining the duration of the individual phonemes,

determining the time characteristic of a fundamental frequency according to the method according to one of Claims 1 to 12,

generating the acoustic signals representing the speech on the basis of the sequence of phonemes determined and of the fundamental frequency determined.