EP1170723A2

EP1170723A2 - Method for the computation of phoneme duration statistics and method for the determination of the duration of isolated phonemes for speech synthesis

Info

Publication number: EP1170723A2
Application number: EP01114696A
Authority: EP
Inventors: Martin Dr. Holzapfel
Original assignee: Siemens AG
Current assignee: Unify GmbH and Co KG
Priority date: 2000-07-07
Filing date: 2001-06-19
Publication date: 2002-01-09
Anticipated expiration: 2021-06-19
Also published as: EP1170723A3; EP1170723B1; DE10033104A1; US20020016709A1; US6934680B2; DE10033104C2; DE50115685D1

Abstract

Die vorliegende Erfindung betrifft ein Verfahren zum Erzeugen einer Statistik von Phondauern und ein Verfahren zum Ermitteln der Dauer einzelner Phone für die Sprachsynthese. Erfindungsgemäß wird eine Primärstatistik vorgesehen, die beispielsweise auf Primärklustern (z.B. Triphonen) beruht und eine Sekundärstatistik, die auf Sekundärklustern (z.B. Phoneme von ganzen Wörtern) beruht. Beide Statistiken beinhalten mittlere Phondauern und beispielsweise die Standardabweichung der mittleren Phondauern. Bei der Ermittlung der Phondauern wird zunächst versucht, diese anhand der Sekundärstatistik, die sprachspezifischer ist, zu ermitteln. Falls dies nicht der Fall ist, wird auf die Primärstatistik zurückgegriffen, die immer anwendbar ist. Durch dieses zweistufige Verfahren wird eine Phondauer ermittelt, die einer natürlichen Sprache wesentlich besser entspricht, als dies mit dem bekannten einstufigen Verfahren möglich war. <IMAGE>The present invention relates to a method for generating statistics of phone durations and a method for determining the duration of individual phones for speech synthesis. According to the invention, a primary statistic is provided which is based, for example, on primary clusters (e.g. triphones) and a secondary statistic which is based on secondary clusters (e.g. phonemes of whole words). Both statistics include average phone durations and, for example, the standard deviation of the average phone durations. When determining the phone duration, an attempt is first made to determine this using the secondary statistics, which is more language-specific. If this is not the case, the primary statistics are used, which are always applicable. With this two-stage method, a phone duration is determined which corresponds to a natural language much better than was possible with the known one-stage method. <IMAGE>

Description

Verfahren zum Erzeugen einer Statistik von Phondauern und Verfahren zum Ermitteln der Dauer einzelner Phone für die SprachsyntheseMethod for generating statistics of phone durations and Procedure for determining the duration of individual phones for the speech synthesis

Die vorliegende Erfindung betrifft ein Verfahren zum Erzeugen einer Statistik von Phondauern und ein Verfahren zum Ermitteln der Dauer einzelner Phone für die Sprachsynthese.The present invention relates to a method for producing statistics of phone durations and a method for determining the duration of individual phones for speech synthesis.

Im Sinne der vorliegenden Anmeldung ist ein Phonem die kleinste bedeutungsunterscheidende, aber nicht selbstbedeutungstragende sprachliche Einheit (z.B. b in Bein im Unterschied zu p in Pein). Ein Phon ist hingegen der ausgesprochene Laut eines Phonems.For the purposes of the present application, a phoneme is smallest meaningful, but not self-meaningful linguistic unity (e.g. b in leg in difference to p in pain). A phon, on the other hand, is the pronounced one According to a phoneme.

Verfahren zum Erzeugen einer Statistik von Phondauern, wobei auf Grundlage dieser Statistik bei der synthetischen Spracherzeugung die Phondauern gesteuert werden können, sind bekannt. Bei derartigen Verfahren wird ein von einem Sprecher gesprochener Text aufgezeichnet und der aufgezeichnete Text in einzelne Phone segmentiert. Von den einzelnen Phonen wird die Lautlänge bestimmt. Diese Phondauer wird in einer Statistik erfasst, wobei die Statistik eine Liste von Triphonen aufweist. Ein Triphon ist ein Kluster von einem oder mehreren Phonemen mit dem jeweiligen rechten und linken Kontext.Method for generating statistics of phone durations, wherein based on these statistics in synthetic speech production the phone durations can be controlled are known. In such procedures, one of a spokesman spoken text recorded and the recorded text segmented into individual phones. From the individual phones determines the length of the sound. This phone duration is shown in a statistic recorded, the statistics a list of triphones having. A triphone is a cluster of one or more Phonemes with the respective right and left context.

Bei den bekannten Verfahren wird jeweils einem Phonem der Triphone in ihrem links-rechts Kontext eine mittlere Phonlänge bzw. Lautdauer zugeordnet. Diese Phondauer wird aus allen Phonen des gesprochenen Textes ermittelt, die im gleichen Kontext im gesprochenen Text wie in dem jeweiligen Triphon vorkommen, das heißt deren benachbarte Phone korrespondieren zu den benachbarten Phonemen im Triphon.In the known methods, a phoneme is used Triphones in their left-right context a medium phonelength or sound duration assigned. This phone duration becomes everyone Phones of the spoken text are identified in the same Context in the spoken text as in the respective triphone occur, that is, their neighboring phones correspond to the neighboring phonemes in the triphone.

Bei den bekannten Verfahren zum Ermitteln der Dauer einzelner Phone für die Sprachsynthese werden den Phonemen des zu synthetisierenden Textes die jeweils mittlere Lautdauer des Phonems der Statistik zugeordnet, dessen Kontext im Triphon dem Kontext des Phonems im zu synthetisierenden Textes entspricht. Ist z.B. die Phondauer des Phonems "b" des Wortes "aber" zu Ermitteln, so wird bei dem bekannten Verfahren dem Phonem "b" diejenige Phondauer zugeordnet, die in der Statistik dem Phonem "b" im Triphon "abe" zugeordnet ist. Die Kontexte des Triphons und im zu synthetisierenden Text sind hier jeweils identisch.In the known methods for determining the duration of individual Phones for speech synthesis are the phonemes of the synthesized Text the mean length of the phoneme assigned to statistics, the context of which in the triphone Corresponds to the context of the phoneme in the text to be synthesized. Is e.g. the duration of the phoneme "b" of the word "But" to determine, so in the known method The phoneme "b" is assigned the duration of the phon that is in the statistics is assigned to the phoneme "b" in the triphone "abe". The contexts of the triphone and in the text to be synthesized are here each identical.

Der Erfindung liegt die Aufgabe zugrunde, ein Verfahren zum Erzeugen einer Statistik von Phondauern, wobei auf Grundlage dieser Statistik bei der synthetischen Spracherzeugung die Phondauern gesteuert werden können, und ein Verfahren zum Ermitteln der Dauer einzelner Phone für die Sprachsynthese zu schaffen, wodurch eine Sprachsynthese mit natürlicherer Aussprache als bei bekannten Verfahren erzielt werden soll.The invention has for its object a method for Generate statistics of phone durations, based on this statistic in synthetic speech production Phon durations can be controlled, and a method for determining the duration of individual phones for speech synthesis create what makes a speech synthesis with more natural pronunciation than to be achieved with known methods.

Die Aufgabe wird mit einem Verfahren zum Erzeugen einer Statistik von Phondauern mit den Merkmalen des Anspruchs 1 und durch ein Verfahren zum Ermitteln der Dauer einzelner Phone mit den Merkmalen des Anspruchs 11 gelöst. Vorteilhafte Ausgestaltungen der Erfindung sind in den Unteransprüchen angegeben.The task is done with a method of generating statistics of Phon durations with the features of claim 1 and through a method of determining the duration of individual phones solved with the features of claim 11. Advantageous configurations the invention are specified in the subclaims.

Das erfindungsgemäße Verfahren zum Erzeugen einer Statistik von Phondauern auf Grundlage derer bei der synthetischen Spracherzeugung die Phondauern gesteuert werden können, umfasst folgende Schritte:

Zuordnen von Phonen eines in Phone segmentierten gesprochenen und aufgezeichneten Textes zu Phonemen von vorbestimmten Primärklustern, die aus mehreren Phonemen zusammengesetzt sind, wobei jeweils ein Phon einem Phonem eines Primärklusters zugeordnet wird, wenn es im gesprochenen Text zu einem im Kontext des Phonems des Primärklusters identischen oder ähnlichen Kontext auftritt,
Erstellen einer Primärstatistik, die zumindest die mittlere Phondauer aller Phone, die dem jeweiligen Phonem eines Primärklusters zugeordnet sind, umfasst,
Zuordnen von Phonen des gesprochenen und aufgezeichneten Textes zu Phonemen zu vorbestimmten Sekundärklustern, die aus Phonemen zusammengesetzt sind, wobei zumindest die Anzahl Phoneme einiger Sekundärkluster sich von der Anzahl der Phoneme der Primärkluster unterscheidet, wobei jeweils ein Phon einem Phonem eines Sekundärklusters zugeordnet wird, wenn es im gesprochenen Text zu einem im Kontext des Phonems des Sekundärklusters identischen Kontext auftritt,
Erstellen einer Sekundärstatistik, die zumindest die mittlere Phondauer aller Phone, die dem jeweiligen Phonem eines Sekundärklusters zugeordnet sind, umfasst.

The method according to the invention for generating statistics of phone durations on the basis of which the phone durations can be controlled in synthetic speech production comprises the following steps:

Associating phones of a spoken and recorded text segmented in phone to phonemes of predetermined primary clusters, which are composed of several phonemes, wherein one phon is assigned to a phoneme of a primary cleaver, if it is identical in the spoken text or to one in the context of the phoneme of the primary cleaver similar context occurs
Creation of a primary statistic, which comprises at least the average phone duration of all phones that are assigned to the respective phoneme of a primary cleaver,
Assigning phones of the spoken and recorded text to phonemes to predetermined secondary clusters, which are composed of phonemes, wherein at least the number of phonemes of some secondary clusters differs from the number of phonemes of the primary clusters, wherein a phon is assigned to a phoneme of a secondary clusters if it is occurs in the spoken text to an identical context in the context of the phoneme of the secondary
Creation of a secondary statistic that includes at least the average phone duration of all phones that are assigned to the respective phoneme of a secondary clusters.

Die durch das erfindungsgemäße Verfahren erzeugte Statistik besteht somit aus einer Primärstatistik und einer Sekundärstatistik. Die Primärstatistik kann auf Primärkluster mit z.B. jeweils drei Phonemen beruhen, so dass sie der eingangs erläuterten Statistik auf Basis von Triphonen entspricht. Die Sekundärstatistik ist eine weitere Statistik auf Basis von Sekundärklustern, die sich in der Anzahl der Phoneme zumindest teilweise von der Anzahl der Phoneme der Primärkluster unterscheiden. Hierdurch wird eine sprachspezifischere Statistik zur Phondauer erzielt.The statistics generated by the method according to the invention thus consists of a primary statistic and a secondary statistic. The primary statistics can be based on primary clusters e.g. each have three phonemes, so that they are the beginning explained statistics based on triphones. The Secondary statistics is another statistic based on Secondary esters, at least in the number of phonemes partly from the number of phonemes in the primary clusters differ. This will make language-specific statistics achieved for the duration of the phone.

So können z.B. die Primärkluster drei Phoneme und die Sekundärkluster vier Phoneme umfassen, wodurch ein größerer Kontext (vier Phoneme gegenüber drei Phonemen) bei der Ermittlung der mittleren Phondauern berücksichtigt wird, so dass durch eine wesentlich sprachspezifischere Auswertung erzielt wird.For example, the primary clusters three phonemes and the secondary clusters include four phonemes, creating a larger context (four phonemes versus three phonemes) in the determination the average duration of the phone is taken into account, so that achieved through a much more language-specific evaluation becomes.

Nach einer bevorzugten Ausführungsform der Erfindung besitzen die Primärkluster eine konstante Anzahl Phoneme, wohingegen die Anzahl der Phoneme der Sekundärkluster variabel ist. So können z.B. die Primärkluster jeweils drei Phoneme und die Sekundärkluster jeweils alle Phoneme eines Wortes umfassen. Mit Hilfe dieser Sekundärkluster wird dann eine wortspezifische Auswertung der Phondauern erzielt, die wesentlich präziser ist, als die auf Grundlage der Triphone.According to a preferred embodiment of the invention the primary clusters have a constant number of phonemes, whereas the number of phonemes in the secondary clusters is variable. So can e.g. the primary clusters each have three phonemes and the Secondary clusters each encompass all phonemes of a word. With the help of these secondary clusters, a word-specific one is then created Evaluation of the phone duration achieved, which is much more precise is than that based on the Triphone.

Nach einer bevorzugten Ausführungsform der Erfindung werden in der Sekundärstatistik nur Sekundärkluster erfasst, deren Häufigkeit im Text größer oder gleich einer vorbestimmten Mindesthäufigkeit ist. Hierdurch wird sichergestellt, dass in der Statistik nicht signifikante Häufigkeiten nicht berücksichtigt werden. So ist es zweckmäßig, Wörter, die in dem Text, auf dem die Statistik beruht, lediglich einmal oder zweimal vorkommen, nicht zu berücksichtigen.According to a preferred embodiment of the invention only secondary clusters are recorded in the secondary statistics Frequency in the text greater than or equal to a predetermined Minimum frequency is. This ensures that in frequencies not significant in the statistics are not taken into account become. So it is convenient to use words in the Text on which the statistics are based, only once or occur twice, not taken into account.

Das erfindungsgemäße Verfahren zum Ermitteln der Dauer einzelner Phone für die Sprachsynthese beruht auf einer derartigen eine Primärstatistik und eine Sekundärstatistik umfassenden Statistik von Phondauern. Dieses Verfahren umfasst folgende Schritte:

Bestimmen, ob das in Sprache umzusetzende Phonem, für das die Phondauer zu ermitteln ist, Bestandteil eines Sekundärklusters ist,
Zuordnen der mittleren Phondauer (d), die in der Sekundärstatistik dem entsprechenden Phonem in dem jeweiligen Sekundärkluster zugeordnet ist, falls das Phonem Bestandteil eines Sekundärklusters ist, und
Zuordnen der mittleren Phondauer (d), die in der Primärstatistik dem entsprechenden Phonem in dem jeweiligen Primärkluster zugeordnet ist, falls das Phonem nicht Bestandteil eines Sekundärklusters ist.

The method according to the invention for determining the duration of individual phones for speech synthesis is based on such a statistic of telephone durations comprising a primary statistic and a secondary statistic. This process includes the following steps:

Determine whether the phoneme to be translated into speech, for which the duration of the phoneme is to be determined, is part of a secondary circuit,
Assigning the average phone duration (d) which is assigned in the secondary statistics to the corresponding phoneme in the respective secondary cluster if the phoneme is part of a secondary cluster, and
Assignment of the average phone duration (d), which is assigned in the primary statistics to the corresponding phoneme in the respective primary cluster, if the phoneme is not part of a secondary cluster.

Bei diesem Verfahren wird bevorzugt die sprachspezifischere Sekundärstatistik bei der Ermittlung der Phondauern ausgewertet. Hierbei ist zu berücksichtigen, dass beim Erzeugen der Sekundärstatistik lediglich identische Kontexte zwischen dem Sekundärkluster und dem entsprechenden Abschnitt in dem gesprochenen und aufgezeichneten Text, auf dem die Statistiken beruhen, berücksichtigt werden, wohingegen bei der Primärstatistik auch ähnliche Kluster zu berücksichtigen sind, falls keine identische Übereinstimmung vorhanden ist. Dies ist ein weiterer Grund, weshalb zunächst versucht wird, die Sekundärstatistik auszuwerten, bevor auf die Primärstatistik zurückgegriffen wird.In this method, the language-specific one is preferred Secondary statistics evaluated when determining the duration of the phone. It should be noted here that when generating the Secondary statistics only have identical contexts between the Secondary clusters and the corresponding section in the spoken and recorded text on which the statistics are taken into account, whereas primary statistics Similar clusters should also be considered if there is no identical match. This is a Another reason why the first attempt is secondary statistics evaluate before using the primary statistics becomes.

Gemäß einer bevorzugten Weiterbildung des Verfahrens zum Ermitteln der Dauer einzelner Phone wird die Standardabweichung der einzelnen mittleren Phondauer berücksichtigt. Dies bewirkt eine weitere Anpassung an eine natürliche Aussprache.According to a preferred development of the method for determining the duration of individual phones becomes the standard deviation of the individual average phone duration taken into account. this causes another adaptation to a natural pronunciation.

Die Erfindung wird nachfolgend beispielhaft anhand der beiliegenden Zeichnungen näher erläutert. In denen zeigen schematisch:

Fig. 1: einen allgemeinen Überblick über die Abläufe bei der Erzeugung einer Statistik von Phondauern in einem Flussdiagramm,
Fig. 2: die Verfahrensschritte zur statistischen Auswertung einer Sprachaufzeichnung zur Erzeugung einer Statistik von Phondauern,
Fig. 3: ein Verfahren zum Ermitteln der Dauer einzelner Phone für die Sprachsynthese in einem Flussdiagramm, und
Fig. 4: ein Computersystem zum Ausführen der erfindungsgemäßen Verfahren in einem Blockschaltbild.

The invention is explained in more detail below by way of example with reference to the accompanying drawings. In which schematically show:

Fig. 1: a general overview of the processes involved in generating statistics of phone durations in a flow chart,
Fig. 2: the process steps for the statistical evaluation of a voice recording to generate statistics of phone durations,
Fig. 3: a method for determining the duration of individual phones for speech synthesis in a flow chart, and
Fig. 4: a computer system for executing the inventive method in a block diagram.

Fig. 1 zeigt die grundlegenden Abläufe für ein Verfahren zum Erzeugen einer Statistik von Phondauern, auf deren Grundlage bei der synthetischen Spracherzeugung die Phondauer gesteuert werden kann.Fig. 1 shows the basic procedures for a method for Generate statistics of phone durations based on this the duration of the phone is controlled in synthetic speech generation can be.

Das Verfahren beginnt mit dem Schritt S1 und im Schritt S2 wird ein vorbestimmter Trainingstext von einem Sprecher gesprochen und aufgezeichnet. Die Aufzeichnung erfolgt mittels eines Mikrofons, das die akustischen Sprachsignale in korrespondierende elektrische Sprachsignale wandelt.The method begins with step S1 and in step S2 a predetermined training text is spoken by a speaker and recorded. The recording is done using of a microphone that converts the acoustic speech signals into corresponding ones converts electrical speech signals.

Das aufgezeichnete Sprachsignal wird im Schritt S3 in einzelne Phone segmentiert. Das Segmentieren des Sprachsignals in die einzelnen Phone wird oftmals von einem Sprachexperten manuell durchgeführt. Es sind auch voll- und teilautomatische Verfahren bekannt, die in der Regel auf einem HMM (Hidden-Markow-Model) Algorithmus beruhen.The recorded voice signal is broken down into individual ones in step S3 Segmented phone. Segmenting the speech signal into The individual phone is often operated manually by a language expert carried out. They are also fully and partially automatic Methods known, which are usually based on an HMM (Hidden Markow Model) Algorithm based.

Im Schritt S4 werden die einzelnen Phone statistisch ausgewertet, wobei deren Dauer bestimmt wird. Phondauern von Phonen, die dem gleichen Phonem im gleichen oder ähnlichen Kontext zugeordnet sind, werden statistisch ausgewertet, indem deren Mittelwerte und Standardabweichungen berechnet werden.In step S4, the individual phones are statistically evaluated, the duration of which is determined. Phon durations of phones, the same phoneme in the same or similar context are statistically evaluated by their mean values and standard deviations are calculated.

Im Schritt S5 wird dieses Verfahren beendet.This method is ended in step S5.

Die erfindungsgemäß auszuführenden Verfahrensschritte bei der statistischen Auswertung (S4) sind in Fig. 2 in einem Flussdiagramm dargestellt. Mit dem Schritt S6 beginnt das statistische Auswerteverfahren. Zunächst werden die einzelnen Phone des Trainingstextes einem Primärkluster zugeordnet. Im vorliegenden Ausführungsbeispiel ist das Primärkluster ein aus drei Phonemen bestehendes Triphon. Ein Phon des Trainingstextes wird demjenigen Triphon zugeordnet, dessen mittleres Phonem dem Phon des Trainingstextes entspricht und das den gleichen Kontext wie der Abschnitt des Trainingstextes in dem das zuzuordnende Phon angeordnet ist, aufweist. Dies bedeutet, dass die zum mittleren Phonem des Triphons benachbarten Phoneme den benachbarten Phonen des zuzuordnenden Phones des Trainingstextes entsprechen. Soll z.B. das Phon des Phonems "f" des Wortes "Anfang" einem solchen Primärkluster zugeordnet werden, so wird dieses Phon dem Phonem "f" im Triphon "nfa" zugeordnet, da die beiden benachbarten Phoneme "n" (links) und "a" (rechts) den entsprechenden Phonen von "n" und "a" im Trainingstext entsprechen.The method steps to be carried out according to the invention in the statistical evaluation (S4) are in Fig. 2 in a flow chart shown. The statistical begins with step S6 Evaluation. First, the individual phones of the training text assigned to a primary cluster. In the present Embodiment is the primary cluster on triphon consisting of three phonemes. A phon of the training text is assigned to the triphone whose middle phoneme corresponds to the phon of the training text and the same Context like the section of the training text in which the to be assigned is arranged. This means, that the phonemes adjacent to the middle phoneme of the triphone the neighboring phones of the phone to be assigned Correspond to the training text. Should e.g. the phonem of the phoneme "f" of the word "beginning" assigned to such a primary cluster , this phone becomes the phoneme "f" in the triphone assigned to "nfa" because the two neighboring phonemes "n" (left) and "a" (right) the corresponding phones of "n" and "a" in the training text.

Die Primärkluster sind in einer vorab festgelegten Liste gespeichert. Sind die Primärkluster Triphone, so umfasst eine solche Liste typischerweise 1500 bis 2000 Triphone. In dieser Liste sind die am häufigsten auftretenden Permutationen von drei aufeinanderfolgenden Phonemen enthalten. Selten und ähnlich klingende Permutationen werden in einem Kluster zusammengefasst. So können z.B. die Triphone "ter" and "der" in einem Kluster zusammengefasst sein.The primary clusters are stored in a predetermined list. If the primary clusters are triphones, one includes such list typically 1500 to 2000 triphones. In this List are the most common permutations of contain three consecutive phonemes. Rare and similar sounding permutations are summarized in a cluster. For example, the Triphone "ter" and "der" in a cluster.

Bei der Zuordnung nach dem Schritt S7 werden somit die Phone den jeweiligen Phonemen im gleichen oder ähnlichen Kontext zugeordnet.When assigning after step S7, the phones are the respective phonemes in the same or similar context assigned.

Am Ende dieses Zuordnungsvorganges sind der Liste der Primärkluster alle Phone des Trainingstextes zugeordnet, das heißt, dass eine Liste vorliegt, in der zu jedem Primärkluster die entsprechenden Phone des Trainingstextes gespeichert sind.At the end of this assignment process are the list of primary clusters All phones assigned to the training text that means that there is a list of each primary cluster the corresponding phone of the training text is saved are.

Im Schritt S8 wird die mittlere Phondauer d' und die Standardabweichung G für das jeweils mittlere Phonem eines jedem aus drei Phonemen bestehenden Primärklusters berechnet. Hierbei werden die Lautdauern der einzelnen einem Primärkluster zugeordneten Phone gemittelt und als mittlere Lautdauer gespeichert und die entsprechende Standardabweichung G berechnet. In step S8, the average phone duration is d 'and the standard deviation G for the middle phoneme of each primary clusters consisting of three phonemes. in this connection the sound durations of each become a primary cluster assigned to the assigned phone and saved as the average duration and the corresponding standard deviation G is calculated.

Mit dem Schritt S8 wird somit eine Primärstatistik erzeugt, die im wesentlicher der eingangs erörterten, aus dem Stand der Technik bekannten Statistik entspricht.With step S8, primary statistics are thus generated, the essentially of those discussed at the beginning, from the state the technology known statistics corresponds.

Im Schritt S9 werden die einzelnen Phone Sekundärklustern zugeordnet. Im vorliegenden Ausführungsbeispiel umfassen die Sekundärkluster jeweils alle Phoneme eines Wortes. Die Länge der Sekundärkluster ist somit variabel. Bei der Zuordnung der Phone zu den Sekundärklustern werden die Wörter des Trainingstextes ermittelt und die einzelnen Phone dieser Wörter werden den korrespondierenden Phonemen der entsprechenden Sekundärkluster zugeordnet. Ein wesentlicher Unterschied gegenüber dem Schritt S7 ist, dass hier nicht nur ein Phon einem Kluster zugeordnet wird, sondern alle Phone eines Wortes werden den entsprechenden Phonemen des Sekundärkluster zugeordnet, das heißt, dass allen Phonemen des Sekundärklusters jeweils ein Phon zugeordnet wird. Im Schritt S10 wird geprüft, ob den Phonemen der Sekundärkluster jeweils mindestens drei Phone des Trainingstextes zugeordnet worden sind. Ist dies nicht der Fall, bedeutet dies, dass das entsprechende Wort im Trainingstext weniger als dreimal vorkommt und deshalb nicht statistisch signifikant ist. Sekundärkluster, denen weniger als drei Wörter des Trainingstextes zugeordnet worden sind, werden gelöscht.In step S9, the individual phone secondary clusters are assigned. In the present exemplary embodiment, the Secondary clusters all phonemes of a word. The length the secondary cluster is therefore variable. When assigning the Phone to the secondary clusters become the words of the training text determined and the individual phone of these words become the corresponding phonemes of the corresponding secondary clusters assigned. A major difference compared to Step S7 is that here is not just a phone Kluster is assigned, but all phones of a word assigned to the corresponding phonemes of the secondary cluster, that is, all phonemes of the secondary clusters each a phone is assigned. In step S10 it is checked whether whether the phonemes of the secondary clusters each have at least three Phone of the training text have been assigned. Is this not the case, it means that the corresponding word in the Training text occurs less than three times and therefore not is statistically significant. Secondary clusters, those less than three words of the training text have been assigned, will be deleted.

Im vorliegenden Ausführungsbeispiel beträgt die geforderte Häufigkeit für die Signifikanz drei. Zur Erzielung einer größeren statistischen Sicherheit kann es zweckmäßig sein, einen entsprechend höheren Wert anzusetzen.In the present embodiment, the required Frequency for significance three. To achieve a larger one statistical security, it may be appropriate to correspondingly higher value.

Im Schritt S11 wird die mittlere Phondauer d' und die Standardabweichung G für ein jedes Phonem des Sekundärklusters berechnet und abgespeichert. Als Ergebnis des Schrittes S11 wird eine Sekundärstatistik auf Grundlage der Sekundärkluster erhalten.In step S11, the average phone duration is d 'and the standard deviation G for each phoneme of the secondary clusters calculated and saved. As a result of step S11 becomes secondary statistics based on the secondary clusters receive.

Im Schritt S12 wird das Auswerteverfahren beendet. The evaluation method is ended in step S12.

Mit dem in Fig. 2 gezeigten Ausführungsbeispiel wird eine Statistik erhalten, die wesentlich sprachspezifischer ist, da die einzelnen Phondauern sehr stark von dem entsprechenden Kontext abhängen und ein wesentlich präziserer Kontext durch den Kontext eines gesamten Wortes berücksichtigt wird, falls dies statistisch möglich ist. Wird auf Grundlage einer solchen zweistufigen Statistik die Lautdauer für eine Sprachsynthese bestimmt, so ermöglicht dies eine wesentlich natürlichere Synthese der Sprache.With the embodiment shown in Fig. 2 is a Get statistics that are much more language specific because the individual phone durations are very different from the corresponding one Depend on context and a much more precise context through the context of an entire word is considered if this is statistically possible. Is based on such two-stage statistics the length of time for a speech synthesis determined, this enables a much more natural Synthesis of language.

Im Rahmen der Erfindung können sowohl andere Primärkluster und Sekundärkluster verwendet werden. Insbesondere ist es z.B. möglich Sekundärkluster mit einer konstanten Länge von z.B. vier Phonemen zu verwenden. Es könnte jedoch auch zweckmäßig sein, bei bestimmten Anwendungen, wesentlich längere Sekundärkluster zu verwenden, die z.B. eine vollständige Phrase, einen vollständigen Satz oder einen ganzen Absatz umfassen können. Je länger die Sekundärkluster gewählt werden, desto spezieller sollte das Anwendungsgebiet der Sprachsynthese sein. Ein typisches Beispiel für ein sehr spezielles Anwendungsgebiet einer Sprachsynthese ist ein Navigationssystem für Kraftfahrzeuge, bei dem wiederholt sehr ähnliche Sätze und Satzstrukturen erzeugt werden.Both other primary clusters can be used within the scope of the invention and secondary clusters can be used. In particular it is e.g. possible secondary clusters with a constant length of e.g. to use four phonemes. However, it could also be useful be, in certain applications, much longer To use secondary clusters, e.g. a complete Phrase, a complete sentence or an entire paragraph can. The longer the secondary clusters are chosen, the more specific the field of application of speech synthesis should be his. A typical example of a very special one A speech system is used in a navigation system for motor vehicles, in which repeated very similar sentences and sentence structures are generated.

In Fig. 3 ist ein Verfahren zum Ermitteln einzelner Phone für die Sprachsynthese schematisch in einem Flussdiagramm dargestellt.In Fig. 3 is a method for determining individual phones for the speech synthesis is shown schematically in a flow chart.

Ausgangspunkt des Verfahrens ist, dass ein Phonem eines zu synthetisierenden Textes in ein Phon umgesetzt wird und die Dauer dieses Phons zu bestimmen ist.The starting point of the procedure is that a phoneme is one too synthesizing text is converted into a phon and the The duration of this phone is to be determined.

Das Verfahren beginnt mit dem Schritt S13. Im Schritt S14 wird der Kontext des Phonems im Ausgangstext bestimmt. Hierbei wird zweckmäßigerweise der Umfang des Kontextes so gewählt, dass er der Länge des Sekundärklusters entspricht. Im vorliegenden Ausführungsbeispiel wird der Kontext im Umfang eines Wortes bestimmt.The method begins with step S13. In step S14 the context of the phoneme is determined in the source text. in this connection the scope of the context is expediently chosen so that it corresponds to the length of the secondary crusher. in the present embodiment, the context is in scope one word.

Im Schritt S15 wird geprüft, ob der im Schritt S14 ermittelte Kontext als Sekundärkluster in der Sekundärstatistik gespeichert ist. Ist dies der Fall, geht der Programmablauf auf den Schritt S16 über, mit dem die mittlere Phondauer d' die dem Phonem des Sekundärklusters zugeordnet ist, der dem Phonem des Ausgangstextes entspricht, und die Phondauern und die Standardabweichung ausgelesen werden. Der Programmablauf geht dann auf den Schritt S17 über, bei dem die tatsächlich anzuwendende Phondauer d aus der mittleren Phondauer d' und der Standardabweichung G gemäß folgender Formel berechnet wird: d=d'+G·s, wobei s ein Geschwindigkeitsskalierungsfaktor ist, der gemäß folgender Formel berechnet wird: s = Rrel - 1, wobei R_rel das Verhältnis der zu sprechenden Sprechgeschwindigkeit gegenüber der Sprechgeschwindigkeit ist, mit der der Text auf dem die Statistik beruht, gesprochen worden ist. Durch die Berücksichtigung der Standardabweichung werden Phone, die der Sprecher des Trainingstextes mit stark unterschiedlichen Längen ausgesprochen hat, entsprechend stark bei der Sprachsynthese variiert. Z.B. werden Plosiv-Laute, wie z.B. "k" sehr wenig variiert, weshalb sie eine sehr kleine Standardabweichung besitzen. Sie werden bei der Sprachsynthese entsprechend wenig variiert. Vokale, wie z.B. "a" werden stark variiert, weshalb sie eine entsprechend große Standardabweichung besitzen. Bei obigen Formeln ist zu berücksichtigen, dass der Geschwindigkeitsskalierungsfaktor s auch negative Werte annehmen kann, wodurch die Phondauer gegenüber der mittleren Phondauer entsprechend verkürzt wird. In step S15 it is checked whether the context determined in step S14 is stored as a secondary cluster in the secondary statistics. If this is the case, the program flow goes to step S16, with which the average phonetime d 'that is assigned to the phoneme of the secondary locker, which corresponds to the phoneme of the source text, and the phone durations and the standard deviation are read out. The program sequence then goes to step S17, in which the actual phonic duration d to be used is calculated from the mean phonic duration d 'and the standard deviation G according to the following formula: d = d '+ G · s, where s is a speed scaling factor calculated using the following formula: s = R rel - 1, where R _{rel is} the ratio of the speaking speed to the speaking speed at which the text on which the statistics are based was spoken. By taking the standard deviation into account, phones that the speaker of the training text pronounced with very different lengths are varied accordingly in speech synthesis. For example, plosive sounds such as "k" are varied very little, which is why they have a very small standard deviation. They are accordingly varied little during speech synthesis. Vowels such as "a" are widely varied, which is why they have a correspondingly large standard deviation. In the above formulas, it should be taken into account that the speed scaling factor s can also assume negative values, which shortens the duration of the phone compared to the average duration of the phone.

Ergibt die Abfrage im Schritt S15 hingegen, dass der im Schritt S14 ermittelte Kontext nicht in der Sekundärstatistik enthalten ist, so geht der Verfahrensablauf auf den Schritt S18 über. Im Schritt S18 wird geprüft, ob der Abschnitt des Kontextes im Bereich des umzusetzenden Phonems identisch zu einem Primärkluster der Primärstatistik ist. Ist dies der Fall, geht der Verfahrensablauf auf den Schritt S19 über. Im Schritt S19 wird die mittlere Phondauer und die Standardabweichung des mittleren Phonems des entsprechenden Primärklusters ausgelesen. Der Verfahrensablauf geht dann auf den Schritt S17 über, mit dem in der oben erläuterten Weise die tatsächlich anzuwendende Phondauer berechnet wird.On the other hand, if the query in step S15 shows that the in Step S14 did not determine the context in the secondary statistics is included, the procedure goes to step S18 over. In step S18, it is checked whether the section of the Context in the area of the phoneme to be implemented is identical to is a primary cluster of primary statistics. Is this the If so, the process flow goes to step S19. in the Step S19 becomes the average phone duration and the standard deviation of the middle phoneme of the corresponding primary cleaver read. The procedure then works via step S17, in the manner explained above the actual phone duration to be used is calculated.

Ergibt die Abfrage im Schritt S18, dass zu dem Kontext des Ausgangstextes kein identisches Primärkluster in der Primärstatistik vorhanden ist, so geht der Verfahrensablauf auf den Schritt S20 über, in dem ein Primärkluster bestimmt wird, das dem Kontext klanglich möglichst ähnlich ist.If the query in step S18 shows that the context of the Source text is not an identical primary cluster in the primary statistics is present, so the procedure goes on Via step S20, in which a primary cluster is determined that is as similar as possible to the context in terms of sound.

Im darauffolgenden Schritt S21 werden die mittlere Phondauer und die Standardabweichung des mittleren Phonems dieses Primärklusters ausgelesen. Der Verfahrensablauf geht dann auf den Schritt S17 über.In the subsequent step S21, the average phone duration and the standard deviation of the mean phoneme of this primary cleaver read. The procedure then works over to step S17.

Nach Ausführung des Schrittes S17 wird das Verfahren zum Ermitteln der Dauer eines Phons eines Phonems eines Ausgangstextes im Schritt S18 beendet.After execution of step S17, the method for determining the duration of a phon of a phoneme of an original text ended in step S18.

Das erfindungsgemäße Verfahren zum Bestimmen der Phondauern für die Sprachsynthese ist somit ein zweistufiges Verfahren, bei dem zunächst versucht wird, mittels der Sekundärstatistik eine mittlere Phondauer zu ermitteln, die auf einem speziellen Kontext (hier: Wortlänge) beruht, wodurch eine Lautdauer ermittelt wird die der natürlichen Sprechweise wesentlich ähnlicher ist, als die auf Grund der Primärstatistik ermittelte Phondauer. Sollte diese Phondauerbestimmung mittels der Sekundärstatistik nicht möglich sein, so wird auf die Primärstatistik zurückgegriffen, die grundsätzlich immer anwendbar ist.The method according to the invention for determining the duration of the phone is a two-step process for speech synthesis, in which an attempt is first made using the secondary statistics to determine an average phone duration based on a special Context (here: word length) is based, creating a length of sound that of natural speech is essentially determined is more similar than that determined on the basis of primary statistics Phondauer. Should this duration of the phone be determined using the Secondary statistics may not be possible, so the primary statistics resorted to, which is basically always applicable is.

Insbesondere die Kombination des Verfahrens zum Erzeugen der Statistik und des Verfahrens zum Ermitteln der Phondauern stellt ein im wesentlichen rein statistisches Verfahren zur Ermittlung der Phondauern dar, das im wesentlichen ohne Expertenwissen erstellt und angewendet werden kann. Bei dem oben beschriebenen Ausführungsbeispiel wird z.B. lediglich bei der Segmentierung der Sprachaufzeichnung Expertenwissen eingesetzt, wobei dieser Schritt mittels bekannter Verfahren auch automatisierbar ist.In particular, the combination of the method for generating the Statistics and the procedure for determining the duration of the phone provides an essentially purely statistical method Determine the duration of the phone, essentially without expert knowledge can be created and applied. In the above described embodiment is e.g. only at the segmentation of the voice recording used expert knowledge, this step using known methods can also be automated.

Die erfindungsgemäßen Verfahren sind so einfach zu implementieren und zu trainieren. Dennoch haben erste Versuche mit Prototypen gezeigt, dass sie bei der Sprachsynthese eine wesentliche Steigerung der Sprachqualität bewirken, da die Phondauer durch das Vorsehen der Sekundärstatistik sprachspezifischer ermittelt wird.The methods according to the invention are so easy to implement and train. Nevertheless have first attempts with Prototypes showed that they are essential in speech synthesis Cause an increase in speech quality since the Phondauer by providing secondary statistics more language-specific is determined.

Die oben beschriebenen Verfahren können als Computerprogramme realisiert werden, die selbständig auf einem Computer zum Erzeugen der Statistik bzw. zum Ermitteln der Phondauern ablaufen. Sie stellen somit automatisch ausführbare Verfahren dar.The methods described above can be used as computer programs be realized, which can be generated independently on a computer the statistics or to determine the duration of the phone. They therefore represent automatically executable procedures.

Die Computerprogramme können auch auf elektrisch lesbaren Datenträgern gespeichert werden und so auf andere Computersysteme übertragen werden.The computer programs can also be read on electrically readable data carriers stored and so on other computer systems be transmitted.

Ein zur Anwendung des erfindungsgemäßen Verfahrens geeignetes Computersystem ist in Fig. 4 gezeigt. Das Computersystem 1 weist einen internen Bus 2 auf, der mit einem Speicherbereich 3, einer zentralen Prozessoreinheit 4 und einem Interface 5 verbunden ist. Das Interface 5 stellt über eine Datenleitung 6 eine Datenverbindung zu weiteren Computersystemen her. An dem internen Bus 2 sind ferner eine akustische Ausgabeeinheit 7, eine grafische Ausgabeeinheit 8 und eine Eingabeeinheit 9 angeschlossen. Die akustische Ausgabeeinheit 7 ist mit einem Lautsprecher 10, die grafische Ausgabeeinheit 8 mit einem Bildschirm 11 und die Eingabeeinheit 9 mit einer Tastatur 12 verbunden. An dem Computersystem 1 können über die Datenleitung 6 und das Interface 5 Sprachaufzeichnungen eines Textes übertragen werden, die im Speicherbereich 3 abgespeichert werden. Der Speicherbereich 3 ist in mehrere Bereiche unterteilt, in denen Sprachaufzeichnungen, Audiodateien, Anwendungsprogramme zum Durchführen der erfindungsgemäßen Verfahren und weitere Anwendungs- und Hilfsprogramme gespeichert sind. Die Sprachdateien werden mit vorbestimmten Programmpaketen analysiert und in die einzelnen Phone segmentiert. Danach wird das erfindungsgemäße Verfahren zum Erzeugen einer Statistik ausgeführt, wobei als Ergebnis die Primär- und Sekundärstatistik vorliegen.A suitable for using the method according to the invention Computer system is shown in FIG. 4. The computer system 1 has an internal bus 2 with a memory area 3, a central processor unit 4 and an interface 5 connected is. The interface 5 provides a data line 6 establishes a data connection to further computer systems. On the internal bus 2 are also an acoustic output unit 7, a graphic output unit 8 and an input unit 9 connected. The acoustic output unit 7 is with a Loudspeaker 10, the graphic output unit 8 with a Screen 11 and the input unit 9 with a keyboard 12 connected. On the computer system 1 can via the data line 6 and the interface 5 voice recordings of a text are transmitted, which are stored in memory area 3 become. The storage area 3 is divided into several areas, in which voice recordings, audio files, application programs to carry out the method according to the invention and other application and utility programs saved are. The language files are provided with predetermined program packages analyzed and segmented into the individual phones. After that the inventive method for generating a Statistics carried out, with the result being the primary and secondary statistics available.

Ein beispielsweise über die Datenleitung 6 und das Interface 5 im Speicherbereich 3 abgespeicherter Text kann dann in eine Audiodatei umgesetzt werden, wobei die Phondauern mittels des erfindungsgemäßen Verfahrens (Fig. 3) auf Grundlage der Primär- und Sekundärstatistik bestimmt werden.One for example via the data line 6 and the interface 5 Text stored in memory area 3 can then be converted into a Audio file are implemented, the duration of the phone using the inventive method (Fig. 3) based on the primary and secondary statistics can be determined.

Eine so erzeugte Audiodatei wird über den internen Bus 2 zur akustischen Ausgabeeinheit 7 übertragen und von dieser am Lautsprecher 10 als Sprache ausgegeben.An audio file generated in this way is used via the internal bus 2 acoustic output unit 7 transmitted and from this on Loudspeaker 10 output as language.

Claims

Method for generating a statistic of phonic durations, the phonic durations being able to be controlled on the basis of this statistic in synthetic speech production, comprising the following steps:

Assigning phones of a spoken and recorded text segmented in phone to phonemes of predetermined primary clusters, which are composed of a plurality of phonemes, wherein one phone is assigned to a phoneme of a primary clusters if it is identical in the spoken text or to a context of the phoneme of the primary clusters similar context occurs

Creation of a primary statistic, which comprises at least the average phone duration of all phones that are assigned to the respective phoneme of a primary cleaver,

marked by

Assigning phones of the spoken and recorded text to phonemes of predetermined secondary clusters, which are composed of phonemes, wherein at least the number of phonemes of some secondary clusters differs from the number of phonemes of the primary clusters, wherein a phon is assigned to a phoneme of a secondary clusters if it is occurs in the spoken text in a context that is identical to the context of the phoneme of the secondary grader,

Creation of a secondary statistic that includes at least the average phone duration of all phones that are assigned to the respective phoneme of a secondary clusters.

Method for generating statistics of phone durations according to claim 1,
characterized in that the number of phonemes of the primary clusters is constant and the number is, for example, 3.

Method for generating statistics according to claim 1 or 2,
characterized in that the number of phonemes of the secondary clusters is variable and the secondary clusters each comprise, for example, the phonemes of a word.

Method for generating statistics according to one of claims 1 to 3,
characterized in that the primary statistics and the secondary statistics each comprise the standard deviation of the respective phone duration.

Method for generating statistics according to one of claims 1 to 4,
characterized in that only secondary clusters whose frequency in the text is greater than or equal to a predetermined minimum frequency are recorded with the secondary statistics.

Method for generating statistics according to one of claims 1 to 5,
characterized in that the minimum frequency is at least 3 and is preferably in the range of 3 to 10.

Method for generating statistics according to one of claims 1 to 6,
characterized in that the assignment of the phones to phonemes of the primary clusters takes place by means of a predetermined list of phonemes grouped in primary clusters, the phones being assigned to the individual phonemes of the primary clusters of the list and the individual assignments being stored.

Method according to claim 7,
characterized in that, for the individual phonemes of the primary clusters of the list, the mean phone duration (d) and the standard deviation (G) of the average phone duration are calculated on the basis of the stored assignments.

Method according to one of claims 1 to 8,
characterized in that the assignment of the phones to the phonemes of the secondary clusters takes place using a predetermined list of phonemes grouped in secondary clusters, the phones being assigned to the individual phonemes of the secondary clusters of the list and the individual assignments being stored.

Method according to claim 9,
characterized in that, for the individual phonemes of the secondary clusters in the list, the average phone duration (d) and the standard deviation (G) of the average phone duration are calculated on the basis of the stored assignments.

Method for determining the duration of individual phones for speech synthesis, by means of a statistic of phonic durations, which has a primary statistic and a secondary statistic, the primary statistic comprising phonemes grouped into primary clusters, and the individual phonemes of the primary clusters being assigned at least an average phon duration, and the secondary statistic comprises phonemes grouped in secondary clusters, and the individual phonemes of the secondary clusters are assigned at least an average phoneme duration, comprising the following steps:

Determine whether the phoneme to be translated into speech, for which the duration of the phoneme is to be determined, is part of a secondary circuit,

Assigning the average phone duration (d), which is assigned in the secondary statistics to the corresponding phoneme in the respective secondary cluster if the phoneme is part of a secondary cluster, and

Assignment of the mean phone duration (d), which is assigned in the primary statistics to the corresponding phoneme in the respective primary cluster, if the phoneme is not part of a secondary cluster.

Procedure to determine the duration of each phone at speech synthesis using statistics with a method Statistics generated according to one of claims 1 to 10.

The method of claim 11 or 12,
characterized in that the standard deviations (G) of the average phone durations (d ') stored in the statistics are taken into account in the determination of the duration (d) of the individual phones according to the following formula d = d ' + G · s . where s is a speed scaling factor calculated according to the following formula s = R rel - 1, where R _{rel is} the ratio of the speech speed to be spoken to the speech speed at which the text on which the statistics are based was spoken.

Device for generating a statistic of phonic durations on the basis of which the phonic durations can be controlled in synthetic speech generation
a computer system (1) which has a memory area (3) in which a program for executing a method according to one of claims 1 to 10 is stored.

Device for determining the duration of individual phones for speech synthesis with
a computer system (1) which has a memory area (3) in which a program for executing a method according to one of claims 11 to 13 is stored.