EP1224531B1 - Method for detecting the time sequences of a fundamental frequency of an audio-response unit to be synthesised - Google Patents

Method for detecting the time sequences of a fundamental frequency of an audio-response unit to be synthesised Download PDF

Info

Publication number
EP1224531B1
EP1224531B1 EP00984858A EP00984858A EP1224531B1 EP 1224531 B1 EP1224531 B1 EP 1224531B1 EP 00984858 A EP00984858 A EP 00984858A EP 00984858 A EP00984858 A EP 00984858A EP 1224531 B1 EP1224531 B1 EP 1224531B1
Authority
EP
European Patent Office
Prior art keywords
fundamental
frequency
macrosegment
fundamental frequency
sequences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
EP00984858A
Other languages
German (de)
French (fr)
Other versions
EP1224531A2 (en
Inventor
Martin Holzapfel
Caglayan Erdem
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Siemens AG
Original Assignee
Siemens AG
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Siemens AG filed Critical Siemens AG
Publication of EP1224531A2 publication Critical patent/EP1224531A2/en
Application granted granted Critical
Publication of EP1224531B1 publication Critical patent/EP1224531B1/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the invention relates to a method for determining the temporal Course of a fundamental frequency of a to be synthesized Voice output.
  • the invention is therefore based on the object of a method to determine the time profile of a basic frequency to create a speech to be synthesized that the A natural, a human voice gives very similar sound.
  • the present invention is based on the finding that the determination of the course of a fundamental frequency by means of a neural network the macro structure of the temporal course a fundamental frequency very similar to the course of the Fundamental frequency of a natural language is generated, and that in a fundamental frequency sequences stored very much similar to the microstructure of the fundamental frequency of a natural one Play language.
  • an optimal determination of the course of the fundamental frequency achieved both in the macro structure and in the microstructure is much more similar to that of natural language is, than in one with the previously known methods generated fundamental frequency. This will make a considerable one Approximation of synthetic speech to a natural one Language achieved.
  • the synthetic language created in this way is very similar to natural language and can hardly speak of this can be distinguished.
  • the deviation between the replica macro segment is preferred and the default macro segment using a cost function determined, which is weighted such that at low Deviations from the basic frequency of the default macro segment only a small deviation is determined, whereby when predetermined limit frequency differences are exceeded deviations determined strongly until a saturation value is reached increase.
  • a cost function determined, which is weighted such that at low Deviations from the basic frequency of the default macro segment only a small deviation is determined, whereby when predetermined limit frequency differences are exceeded deviations determined strongly until a saturation value is reached increase.
  • deviations are weighted less, the closer they are are arranged on the edge of a syllable.
  • the default macro segment is preferably reproduced by creating multiple fundamental frequency sequences for each a microprosodic unit, with combinations of fundamental frequency sequences both in terms of the deviation from the default macro segment as well as paired voting be rated. Depending on the outcome of these two Ratings (deviation from the default macro segment, coordination between adjacent fundamental frequency sequences) an appropriate selection of a combination of fundamental frequency sequences met.
  • pairwise voting especially the Transitions between adjacent fundamental frequency sequences evaluated, larger jumps should be avoided here.
  • the syllable core is decisive for the auditory impression.
  • 6 is a method for synthesizing speech, where a text is converted into a series of acoustic signals is shown in a flow chart.
  • This process is implemented in the form of a computer program, that is started with a step S1.
  • step S2 a text is entered in the form of a there is an electronically readable text file.
  • a sequence of phonemes the is called a sound sequence, created using the individual graphemes of the text, that is each one or more letters, which are each assigned a phoneme.
  • the individual graphemes are then assigned Phonemes determines what determines the phoneme sequence.
  • step S4 an emphasis structure is determined, that is it is determined how strongly the individual phonemes are emphasized should.
  • the emphasis structure is in Fig. 1a by means of a time line represented by the word "stop". Accordingly, that Grapheme “st” is level 1, grapheme “o” is level 0.3 and the graphem “p” assigned the emphasis level 0.5 Service.
  • the duration of the individual phonemes is determined below (S5).
  • step S6 the time course of the basic frequency determines what is detailed below.
  • the wave file is generated using an acoustic output unit and a loudspeaker converted into acoustic signals (S8), which ends the speech output (S9).
  • the time course of the basic frequency the speech to be synthesized using a neural Network in combination with stored in a database Fundamental frequency sequences generated.
  • step S6 in FIG. 6 is shown in more detail in Figure 5 in a flow chart.
  • This method for determining the time course of the Fundamental frequency is a subroutine of that shown in FIG. 6 Program.
  • the subroutine is started with step S10.
  • a default macro segment of the fundamental frequency determined by means of a neural network is schematically simplified in Fig. 4 shown.
  • the neural network points to an input layer I node for entering a phonetic linguistic Unit PE of the text to be synthesized and one Context Kl, Kr left and right of the phonetic linguistic Unity on.
  • the phonetic linguistic unit e.g. from a phrase, a word or a syllable of the text to be synthesized for which the default macro segment the fundamental frequency is to be determined.
  • the left context Kl and the right context Kr each represent a section of text left and right of the phonetic linguistic Unit PE represents the entered with the phonetic unit
  • Data include the corresponding phoneme sequence, stress structure and the duration of each phoneme.
  • the one with the left or right context include information entered at least the phoneme sequence, where it can be useful also enter the emphasis structure and / or the duration of the sound.
  • the length of the left and right context can be the Correspond to the length of the phonetic linguistic unit PE, so again be a phrase, a word or a syllable. It however, it can also be useful to have a longer context of e.g. two or three words as left or right context provided.
  • These inputs Kl, PE and Kr are hidden Layer VS processed and at an output layer O output as the default macro segment VG of the fundamental frequency.
  • Fig. 1b is such a default macro segment for the word shown "stop".
  • This default macro segment has one typical triangular course, initially with a Ascent begins and ends with a somewhat shorter descent.
  • step S12 a database, in the graphemes assigned fundamental frequency sequences are stored, read out, usually a large number of for each grapheme Fundamental frequency sequences are available. 1c are such Fundamental frequency sequences for the graphemes “st", “o” and “p” shown schematically, with the aim of simplifying the drawing only a small number of fundamental frequency sequences are shown.
  • a cost factor Kf is calculated using the following cost function:
  • the cost function has two terms, a local cost function lok (k ij ) and a link cost function Ver (k ij , k n , j + 1).
  • the local cost function is used to evaluate the deviation of the i-th fundamental frequency sequence of the j-th phoneme from the default macro segment.
  • the link cost function evaluates the coordination between the i-th fundamental frequency of the j-th phoneme and the n-th fundamental frequency sequence of the j + 1-th phoneme.
  • the local cost function has the following form, for example:
  • the local cost function is thus an integral over the time range from the beginning ta of a phoneme to the end te of the phoneme over the square of the difference between the fundamental frequency f v specified by the default macro segment and the i-th fundamental frequency sequence of the j-th phoneme.
  • This local cost function thus determines a positive one Value of the deviation between the respective fundamental frequency sequence and the fundamental frequency of the default macro segment. Moreover this cost function is very easy to implement and generate by the parabolic property a rating that resembles that of human hearing because of minor deviations to be rated low by the default sequence fv, whereas larger deviations are evaluated progressively.
  • the local cost function is provided with a weighting term which leads to the function curve shown in FIG. 2.
  • the diagram in FIG. 2 shows the value of the local cost function lok (f ij ) as a function of the logarithm of the frequency f ij of the i-th fundamental frequency sequence of the j-th phoneme. It can be seen from the diagram that deviations from the preset frequency f v within certain limit frequencies GF1, GF2 are assessed only slightly, a further deviation causing a sharply increasing increase up to a threshold value SW. Such a weighting corresponds to human hearing, which hardly perceives slight frequency deviations but registers this as a clear difference from certain frequency differences.
  • the link cost function evaluates how well two successive fundamental frequency sequences on top of each other are coordinated. In particular, the frequency difference at the junction of the two fundamental frequency sequences rated, the greater the difference at the end of the previous one Fundamental frequency sequence to the frequency at the beginning of the subsequent fundamental frequency sequences, the larger the output value of the link cost function.
  • other parameters are taken into account that e.g. reflect the continuity of the transition or the like.
  • the Output value of the link cost function weighted even less, the closer the respective connection point of two neighboring ones Fundamental frequency sequences arranged on the edge of a syllable is. This corresponds to human hearing, the acoustic Signals on the edge of a syllable analyzed less intensely than in the middle area of the syllable. Such a weighting is also known as perceptually dominant.
  • cost function Kf are for each combination of fundamental frequency sequences of the phonemes of a linguistic Unit for which a default macro segment has been determined the values of the local cost function and the link cost function all fundamental frequency sequences determined and summed. From the set of combinations of the fundamental frequency sequences the combination is selected for which the Cost function Kf has given the smallest value since this Combination of fundamental frequency sequences a fundamental frequency course for the corresponding linguistic unit, which is called the replica macro segment and the default macro segment is very similar.
  • the means default macro segments generated by the neural network the fundamental frequency characteristics adapted by means of individual fundamental frequency sequences stored in a database generated. This creates a very natural macro structure ensured that also the detailed Has microstructure of the fundamental frequency sequences.
  • step S14 After the selection of the combinations of Fundamental frequency sequences to simulate the default macro segment is completed, it is checked in step S14 whether for another phonetic linguistic unit another time course of the fundamental frequency must be generated. results If this query is "yes" in step S14, the program flow jumps go back to step S11, otherwise branch the program flow to step S15, with which the individual Simulation macro segments of the fundamental frequency composed become.
  • connection points of the individual simulation macro segments are matched to one another, as shown in FIG. 3.
  • the frequencies left f 1 and right f r are matched to one another by the connecting points V, the end regions of the replica macro segments preferably being changed such that the frequencies f 1 and f r have the same value.
  • the transition can also be smoothed and / or made continuous in the area of the connection point.
  • a course can thus a fundamental frequency are generated, which is the fundamental frequency of a natural language is very similar because of the neural Network simply captures larger context areas and can be evaluated (macro structure) and at the same time by means of the fundamental frequency sequences stored in the database finest structures corresponding to the fundamental frequency curve natural language can be generated (microstructure). This makes a speech with an essential more natural sound than with previously known methods allows.
  • the order of when the fundamental frequency sequences from the database and when the neural network the default macro segment created varied. It is e.g. also possible that initially for all phonetic linguistic units Default macro segments are generated and only then individual fundamental frequency sequences read out, combined, evaluated and be selected. Within the scope of the invention different cost functions are also used, as long as there is a discrepancy between a default macro segment the fundamental frequency and microsegments of the fundamental frequencies consider. The integral of the local described above For numerical reasons, the cost function can also be a sum being represented.

Description

Die Erfindung betrifft ein Verfahren zum Bestimmen des zeitlichen Verlaufs einer Grundfrequenz einer zu synthetisierenden Sprachausgabe.The invention relates to a method for determining the temporal Course of a fundamental frequency of a to be synthesized Voice output.

Auf der Konferenz ICASSP 97, in München, ist unter dem Titel "Recent Improvements on Microsoft's Trainable Text-to-Speech System-Whistler", X. Huang et al, ein Verfahren zum Synthetisieren von Sprache aus einem Text vorgestellt worden, das vollständig trainierbar ist und die Prosodie eines Textes anhand von in einer Datenbank gespeicherten Prosodiemustern zusammenstellt und erzeugt. Die Prosodie eines Textes wird im wesentlichen durch die Grundfrequenz festgelegt, weshalb dieses bekannte Verfahren auch als Verfahren zur Erzeugung einer Grundfrequenz auf Grundlage entsprechender in einer Datenbank gespeicherter Muster betrachtet werden kann. Zur Erzielung einer möglichst natürlichen Sprachweise sind aufwendige Korrekturverfahren vorgesehen, die die Kontur der Grundfrequenz interpolieren, glätten und korrigieren.At the ICASSP 97 conference, in Munich, is titled "Recent Improvements on Microsoft's Trainable Text-to-Speech System-Whistler ", X. Huang et al, a method of synthesizing presented by language from a text that is completely trainable and based on the prosody of a text compiling prosody patterns stored in a database and generated. The prosody of a text is in the essentially determined by the fundamental frequency, which is why this known methods also as a method for generating a Basic frequency based on the corresponding in a database stored pattern can be viewed. To achieve The most natural way of speaking is complex correction procedures provided the the contour of the fundamental frequency interpolate, smooth and correct.

Auf der ICASSP 98, in Seattle, ist unter dem Titel "Optimization of a Neural Network for Speaker and Task Dependent F0-Generation", Ralf Haury et al. ein weiteres Verfahren zum Erzeugen einer synthetischen Sprachausgabe aus einem Text vorgestellt worden. Dieses bekannte Verfahren verwendet zur Erzeugung der Grundfrequenz anstelle einer Datenbank mit Mustern ein neuronales Netzwerk, mit dem der zeitliche Verlauf der Grundfrequenz für die Sprachausgabe festgelegt wird.At ICASSP 98, in Seattle, under the title "Optimization of a Neural Network for Speaker and Task Dependent F 0 Generation", Ralf Haury et al. Another method for generating a synthetic speech output from a text has been presented. This known method uses a neural network to generate the basic frequency instead of a database with patterns, with which the temporal course of the basic frequency is determined for the speech output.

Mit den oben beschriebenen Verfahren soll eine Sprachausgabe geschaffen werden, die keinen metallischen, mechanischen und unnatürlichen Klang besitzt, wie es von herkömmlichen Sprachsynthesesystemen bekannt ist. Diese Verfahren stellen eine deutliche Verbesserung gegenüber den herkömmlichen Sprachsynthesesystemen dar. Es bestehen dennoch erhebliche klangliche Unterschiede zwischen der auf diesen Verfahren beruhenden Sprachausgabe und einer menschlichen Stimme.The procedure described above is intended to provide voice output be created that do not have any metallic, mechanical and possesses unnatural sound, as is the case with conventional speech synthesis systems is known. These procedures represent one significant improvement over conventional speech synthesis systems There are nevertheless considerable sonic Differences between those based on these procedures Voice output and a human voice.

Insbesondere wird bei einer Sprachsynthese, bei der die Grundfrequenz aus einzelnen Grundfrequenzmustern zusammengesetzt wird, nach wie vor ein metallischer, mechanischer Klang erzeugt, der deutlich von einer natürlichen Stimme unterschieden werden kann. Wird die Grundfrequenz hingegen mit einem neuronalen Netzwerk festgelegt, klingt die Stimme zwar natürlicher, aber ist etwas dumpf.In particular, in a speech synthesis in which the Basic frequency composed of individual basic frequency patterns is still a metallic, mechanical sound generated that clearly differed from a natural voice can be. However, if the fundamental frequency is changed to neural network set, the voice sounds though more natural, but a little dull.

Der Erfindung liegt deshalb die Aufgabe zugrunde, ein Verfahren zum Bestimmen des zeitlichen Verlaufs einer Grundfrequenz einer zu synthetisierenden Sprachausgabe zu schaffen, die der Sprachausgabe einen natürlichen, einer menschlichen Stimme sehr ähnlichen Klang verleiht.The invention is therefore based on the object of a method to determine the time profile of a basic frequency to create a speech to be synthesized that the A natural, a human voice gives very similar sound.

Die Aufgabe wird durch ein Verfahren mit den Merkmalen des Anspruchs 1 gelöst. Vorteilhafte Ausgestaltungen sind in den Unteransprüchen angegeben.The task is accomplished through a process with the characteristics of Claim 1 solved. Advantageous embodiments are in the Subclaims specified.

Das erfindungsgemäße Verfahren zum Bestimmen des zeitlichen Verlaufs einer Grundfrequenz einer zu synthetisierenden Sprachausgabe umfaßt folgende Schritte:

  • Bestimmen von Vorgabemakrosegmenten der Grundfrequenz mittels eines neuronalen Netzwerkes, und
  • Bestimmen von Mikrosegmenten mittels in einer Datenbasis gespeicherten Grundfrequenzsequenzen, wobei die Grundfrequenzsequenzen derart aus der Datenbasis ausgewählt werden, daß durch die aufeinanderfolgenden Grundfrequenzsequenzen das jeweilige Vorgabemakrosegment mit möglichst geringer Abweichung nachgebildet wird.
  • The method according to the invention for determining the time profile of a basic frequency of a speech output to be synthesized comprises the following steps:
  • Determining default macro segments of the fundamental frequency using a neural network, and
  • Determining microsegments by means of fundamental frequency sequences stored in a database, the fundamental frequency sequences being selected from the database in such a way that the respective default macro segment is reproduced with as little deviation as possible by the successive fundamental frequency sequences.
  • Der vorliegenden Erfindung liegt die Erkenntnis zugrunde, daß die Bestimmung des Verlaufs einer Grundfrequenz mittels eines neuronalen Netzwerkes die Makrostruktur des zeitlichen Verlaufs einer Grundfrequenz sehr'ähnlich zu dem Verlauf der Grundfrequenz einer natürlichen Sprache erzeugt, und die in einer Datenbasis gespeicherten Grundfrequenzsequenzen sehr ähnlich die Mikrostruktur der Grundfrequenz einer natürlichen Sprache wiedergeben. Durch die erfindungsgemäße Kombination wird somit eine optimale Bestimmung des Verlaufs der Grundfrequenz erzielt, die sowohl in der Makrostruktur als auch in der Mikrostruktur der der natürlichen Sprache wesentlich ähnlicher ist, als bei einer mit den bisher bekannten Verfahren erzeugten Grundfrequenz. Hierdurch wird eine beträchtliche Annäherung der synthetischen Sprachausgabe an eine natürliche Sprache erzielt. Die hierdurch erzeugte synthetische Sprache ist der natürlichen Sprache sehr ähnlich und kann kaum von dieser unterschieden werden.The present invention is based on the finding that the determination of the course of a fundamental frequency by means of a neural network the macro structure of the temporal course a fundamental frequency very similar to the course of the Fundamental frequency of a natural language is generated, and that in a fundamental frequency sequences stored very much similar to the microstructure of the fundamental frequency of a natural one Play language. Through the combination according to the invention is an optimal determination of the course of the fundamental frequency achieved, both in the macro structure and in the microstructure is much more similar to that of natural language is, than in one with the previously known methods generated fundamental frequency. This will make a considerable one Approximation of synthetic speech to a natural one Language achieved. The synthetic language created in this way is very similar to natural language and can hardly speak of this can be distinguished.

    Vorzugsweise wird die Abweichung zwischen dem Nachbildungsmakrosegment und dem Vorgabemakrosegment mittels einer Kostenfunktion ermittelt, die derart gewichtet ist, daß bei geringen Abweichungen von der Grundfrequenz des Vorgabemakrosegments lediglich eine kleine Abweichung ermittelt wird, wobei bei Überschreitung vorbestimmter Grenzfrequenzdifferenzen die ermittelten Abweichungen stark bis zum Erreichen eines Sättigungswertes ansteigen. Dies bedeutet, daß alle Grundfrequenzsequenzen, die innerhalb des Bereiches der Grenzfrequenzen liegen, eine sinnvolle Auswahl zur Nachbildung des Vorgabemakrosegments darstellen und die Grundfrequenzsequenzen, die außerhalb des Bereiches der Grenzfrequenzdifferenzen liegen, als wesentlich ungeeigneter zur Nachbildung des Vorgabemakrosegments bewertet werden. Diese Nichtlinearität bildet das nichtlineare Verhalten des menschlichen Gehörs ab.The deviation between the replica macro segment is preferred and the default macro segment using a cost function determined, which is weighted such that at low Deviations from the basic frequency of the default macro segment only a small deviation is determined, whereby when predetermined limit frequency differences are exceeded deviations determined strongly until a saturation value is reached increase. This means that all fundamental frequency sequences, those within the range of the cutoff frequencies are a useful selection for emulating the default macro segment represent and the fundamental frequency sequences that are outside the range of the limit frequency differences, as significantly unsuitable for replicating the default macro segment be rated. This is nonlinearity non-linear behavior of human hearing.

    Nach einer weiteren bevorzugten Ausführungsform der Erfindung werden Abweichungen desto schwächer gewichtet, je näher sie am Rand einer Silbe angeordnet sind. According to a further preferred embodiment of the invention deviations are weighted less, the closer they are are arranged on the edge of a syllable.

    Die Nachbildung des Vorgabemakrosegments erfolgt vorzugsweise durch Erzeugung mehrerer Grundfrequenzsequenzen für jeweils eine mikroprosodische Einheit, wobei Kombinationen von Grundfrequenzsequenzen sowohl bezüglich der Abweichung vom Vorgabemakrosegment als auch bezüglich einer paarweisen Abstimmung bewertet werden. In Abhängigkeit des Ergebnisses dieser beiden Bewertungen (Abweichung vom Vorgabemakrosegment, Abstimmung zwischen benachbarten Grundfrequenzsequenzen) wird dann eine entsprechende Auswahl einer Kombination von Grundfrequenzsequenzen getroffen.The default macro segment is preferably reproduced by creating multiple fundamental frequency sequences for each a microprosodic unit, with combinations of fundamental frequency sequences both in terms of the deviation from the default macro segment as well as paired voting be rated. Depending on the outcome of these two Ratings (deviation from the default macro segment, coordination between adjacent fundamental frequency sequences) an appropriate selection of a combination of fundamental frequency sequences met.

    Mit dieser paarweisen Abstimmung werden insbesondere die Übergänge zwischen benachbarten Grundfrequenzsequenzen bewertet, wobei hier größere Sprünge vermieden werden sollen. Nach einer bevorzugten Ausführungsform der Erfindung werden diese paarweisen Abstimmungen der Grundfrequenzsequenzen innerhalb einer Silbe stärker gewichtet als am Randbereich der Silbe. Der Silbenkern ist im Deutschen maßgeblich für den Höreindruck.With this pair-wise voting especially the Transitions between adjacent fundamental frequency sequences evaluated, larger jumps should be avoided here. To a preferred embodiment of the invention pairwise tuning of the fundamental frequency sequences within a syllable weighted more than at the edge of the syllable. In German, the syllable core is decisive for the auditory impression.

    Das erfindungsgemäße Verfahren wird nachfolgend anhand eines in der Zeichnung dargestellten Ausführungsbeispieles näher erläutert. In den Zeichnungen zeigen schematisch:

    Fig. 1a bis 1d
    den Aufbau und das Zusammensetzen des zeitlichen Verlaufes einer Grundfrequenz in vier Schritten,
    Fig. 2
    eine Funktion zur Gewichtung einer Kostenfunktion zur Bestimmung der Abweichung zwischen einem Nachbildungsmakrosegment und einem Vorgabemakrosegment,
    Fig. 3
    den Verlauf einer aus mehreren Makrosegmenten bestehenden Grundfrequenz,
    Fig. 4
    schematisch vereinfacht den Aufbau eines neuronalen Netzwerkes,
    Fig. 5
    das erfindungsgemäße Verfahren in einem Flußdiagramm, und
    Fig. 6
    ein Verfahren zum Synthetisieren von Sprache, daß auf dem erfindungsgemäßen Verfahren beruht.
    The method according to the invention is explained in more detail below with reference to an embodiment shown in the drawing. The drawings show schematically:
    1a to 1d
    the construction and composition of the time course of a basic frequency in four steps,
    Fig. 2
    a function for weighting a cost function for determining the deviation between a replication macro segment and a default macro segment,
    Fig. 3
    the course of a basic frequency consisting of several macro segments,
    Fig. 4
    schematically simplified the construction of a neural network,
    Fig. 5
    the inventive method in a flow chart, and
    Fig. 6
    a method for synthesizing speech that is based on the inventive method.

    In Fig. 6 ist ein Verfahren zum Synthetisieren von Sprache, bei dem ein Text in eine Folge akustischer Signale gewandelt wird, in einem Flußdiagramm dargestellt.6 is a method for synthesizing speech, where a text is converted into a series of acoustic signals is shown in a flow chart.

    Dieses Verfahren ist in Form eines Computerprogrammes realisiert, das mit einem Schritt S1 gestartet wird.This process is implemented in the form of a computer program, that is started with a step S1.

    Im Schritt S2 wird ein Text eingegeben, der in Form einer elektronisch lesbaren Textdatei vorliegt.In step S2, a text is entered in the form of a there is an electronically readable text file.

    Im folgenden Schritt S3 wird eine Folge von Phonemen, das heißt eine Lautfolge, erstellt, wobei den einzelnen Graphemen des Textes, das sind jeweils einzelne oder mehrere Buchstaben, denen jeweils ein Phonem zugeordnet ist, ermittelt werden. Es werden dann die den einzelnen Graphemen zugeordneten Phoneme bestimmt, wodurch die Phonemfolge festgelegt ist.In the following step S3, a sequence of phonemes, the is called a sound sequence, created using the individual graphemes of the text, that is each one or more letters, which are each assigned a phoneme. The individual graphemes are then assigned Phonemes determines what determines the phoneme sequence.

    Im Schritt S4 wird eine Betonungsstruktur bestimmt, das heißt es wird bestimmt, wie stark die einzelnen Phoneme betont werden sollen.In step S4, an emphasis structure is determined, that is it is determined how strongly the individual phonemes are emphasized should.

    Die Betonungsstruktur ist in Fig. 1a mittels eines Zeitstrahles anhand des Wortes "stop" dargestellt. Demgemäß sind dem Graphem "st" die Betonungsstufe 1, dem Graphem "o" die Betonungsstufe 0,3 und dem Graphem "p" die Betonungsstufe 0,5 zugeordnet worden. The emphasis structure is in Fig. 1a by means of a time line represented by the word "stop". Accordingly, that Grapheme "st" is level 1, grapheme "o" is level 0.3 and the graphem "p" assigned the emphasis level 0.5 Service.

    Nachfolgend wird die Dauer der einzelnen Phoneme bestimmt (S5).The duration of the individual phonemes is determined below (S5).

    Im Schritt S6 wird der zeitliche Verlauf der Grundfrequenz bestimmt, was unten näher ausgeführt ist.In step S6, the time course of the basic frequency determines what is detailed below.

    Nachdem die Phonemfolge und die Grundfrequenz festgelegt sind, kann eine Wave-Datei auf Grundlage der Phoneme und der Grundfrequenz erzeugt werden (S7).After the phoneme sequence and the fundamental frequency are determined can be a wave file based on the phonemes and the Fundamental frequency are generated (S7).

    Die Wave-Datei wird mittels einer akustischen Ausgabeeinheit und einem Lautsprecher in akustische Signale umgesetzt (S8), womit die Sprachausgabe beendet ist (S9).The wave file is generated using an acoustic output unit and a loudspeaker converted into acoustic signals (S8), which ends the speech output (S9).

    Erfindungsgemäß wird der zeitliche Verlauf der Grundfrequenz der zu synthetisierenden Sprachausgabe mittels eines neuronalen Netzwerkes in Kombination mit in einer Datenbasis gespeicherten Grundfrequenzsequenzen erzeugt.According to the invention, the time course of the basic frequency the speech to be synthesized using a neural Network in combination with stored in a database Fundamental frequency sequences generated.

    Das Verfahren, das dem Schritt S6 aus Fig. 6 entspricht, ist ausführlicher in Fig. 5 in einem Flußdiagramm dargestellt.The method corresponding to step S6 in FIG. 6 is shown in more detail in Figure 5 in a flow chart.

    Dieses Verfahren zum Bestimmen des zeitlichen Verlaufs der Grundfrequenz ist ein Unterprogramm zu dem in Fig. 6 gezeigtem Programm. Das Unterprogramm wird mit dem Schritt S10 gestartet.This method for determining the time course of the Fundamental frequency is a subroutine of that shown in FIG. 6 Program. The subroutine is started with step S10.

    Mit dem Schritt S11 wird ein Vorgabemakrosegment der Grundfrequenz mittels eines neuronalen Netzwerkes bestimmt. Ein derartiges neuronales Netzwerk ist schematisch vereinfacht in Fig. 4 gezeigt. Das neuronale Netzwerk weist an einer Eingabeschicht I Knoten zur Eingabe einer phonetisch linguistischen Einheit PE des zu synthetisierenden Textes und eines Kontextes Kl, Kr links und rechts von der phonetisch linguistischen Einheit auf. Die phonetisch linguistische Einheit besteht z.B. aus einer Phrase, einem Wort oder einer Silbe des zu synthetisierenden Textes, zu der das Vorgabemakrosegment der Grundfrequenz bestimmt werden soll. Der linke Kontext Kl und der rechte Kontext Kr stellen jeweils einen Textabschnitt links und rechts der phonetischen linguistischen Einheit PE dar. Die mit der phonetischen Einheit eingegebenen Daten umfassen die entsprechende Phonemfolge, Betonungsstruktur und die Lautdauer der einzelnen Phoneme. Die mit dem linken bzw. rechten Kontext eingegebenen Informationen umfassen zumindest die Phonemfolge, wobei es zweckmäßig sein kann, auch die Betonungsstruktur und/oder die Lautdauer mit einzugeben. Die Länge des linken und rechten Kontextes kann der Länge der phonetisch linguistischen Einheit PE entsprechen, also wiederum eine Phrase, ein Wort oder eine Silbe sein. Es kann jedoch auch zweckmäßig sein, einen längeren Kontext von z.B. zwei oder drei Wörtern als linken oder rechten Kontext vorzusehen. Diese Eingaben Kl, PE und Kr werden in einer versteckten Schicht VS verarbeitet und an einer Ausgabeschicht O als Vorgabemakrosegment VG der Grundfrequenz ausgegeben.With step S11, a default macro segment of the fundamental frequency determined by means of a neural network. On such a neural network is schematically simplified in Fig. 4 shown. The neural network points to an input layer I node for entering a phonetic linguistic Unit PE of the text to be synthesized and one Context Kl, Kr left and right of the phonetic linguistic Unity on. The phonetic linguistic unit e.g. from a phrase, a word or a syllable of the text to be synthesized for which the default macro segment the fundamental frequency is to be determined. The left context Kl and the right context Kr each represent a section of text left and right of the phonetic linguistic Unit PE represents the entered with the phonetic unit Data include the corresponding phoneme sequence, stress structure and the duration of each phoneme. The one with the left or right context include information entered at least the phoneme sequence, where it can be useful also enter the emphasis structure and / or the duration of the sound. The length of the left and right context can be the Correspond to the length of the phonetic linguistic unit PE, so again be a phrase, a word or a syllable. It however, it can also be useful to have a longer context of e.g. two or three words as left or right context provided. These inputs Kl, PE and Kr are hidden Layer VS processed and at an output layer O output as the default macro segment VG of the fundamental frequency.

    In Fig. 1b ist eine solche Vorgabemakrosegment für das Wort "stop" dargestellt. Dieses Vorgabemakrosegment besitzt einen typischen dreiecksförmigen Verlauf, der zunächst mit einem Anstieg beginnt und mit einem etwas kürzeren Abfall endet.In Fig. 1b is such a default macro segment for the word shown "stop". This default macro segment has one typical triangular course, initially with a Ascent begins and ends with a somewhat shorter descent.

    Nach der Bestimmung eines Vorgabemakrosegmentes der Grundfrequenz werden in den Schritten S12 und S13 die dem Vorgabemakrosegment entsprechenden Mikrosegmente bestimmt.After determining a default macro segment of the fundamental frequency are the default macro segment in steps S12 and S13 corresponding micro-segments determined.

    Im Schritt S12 werden aus einer Datenbasis, in der Graphemen zugeordnete Grundfrequenzsequenzen gespeichert sind, ausgelesen, wobei in der Regel für jedes Graphem eine Vielzahl von Grundfrequenzsequenzen vorliegen. In Fig. 1c sind derartige Grundfrequenzsequenzen für die Grapheme "st", "o" und "p" schematisch dargestellt, wobei zur zeichnerischen Vereinfachung lediglich eine geringe Anzahl von Grundfrequenzsequenzen gezeigt sind. In step S12, a database, in the graphemes assigned fundamental frequency sequences are stored, read out, usually a large number of for each grapheme Fundamental frequency sequences are available. 1c are such Fundamental frequency sequences for the graphemes "st", "o" and "p" shown schematically, with the aim of simplifying the drawing only a small number of fundamental frequency sequences are shown.

    Diese Grundfrequenzsequenzen können grundsätzlich beliebig miteinander kombiniert werden. Die möglichen Kombinationen dieser Grundfrequenzsequenzen werden mittels einer Kostenfunktion bewertet. Dieser Verfahrensschritt wird mittels des Viterbi-Algorhithmus ausgeführt.These fundamental frequency sequences can in principle be arbitrary can be combined with each other. The possible combinations of these fundamental frequency sequences are performed using a cost function rated. This process step is carried out using the Viterbi algorithm executed.

    Für jede Kombination von Grundfrequenzsequenzen, die für jedes Phonem eine Grundfrequenzsequenz aufweist, wird ein Kostenfaktor Kf mittels folgender Kostenfunktion berechnet:

    Figure 00080001
    For each combination of fundamental frequency sequences that has a fundamental frequency sequence for each phoneme, a cost factor Kf is calculated using the following cost function:
    Figure 00080001

    Die Kostenfunktion ist eine Summe von j=1 bis 1, wobei j der Zähler der Phoneme ist und 1 die Gesamtzahl aller Phoneme darstellt. Die Kostenfunktion weist zwei Terme auf, eine lokale Kostenfunktion lok (kij)und eine Verknüpfungskostenfunktion Ver(kij, kn, j+1). Mit der lokalen Kostenfunktion wird die Abweichung der i-ten Grundfrequenzsequenz des j-ten Phonems vom Vorgabemakrosegment bewertet. Mit der Verknüpfungskostenfunktion wird die Abstimmung zwischen der i-ten Grundfrequenz des j-ten Phonems mit der n-ten Grundfrequenzsequenz des j+1-ten Phonems bewertet.The cost function is a sum of j = 1 to 1, where j is the phoneme counter and 1 represents the total number of all phonemes. The cost function has two terms, a local cost function lok (k ij ) and a link cost function Ver (k ij , k n , j + 1). The local cost function is used to evaluate the deviation of the i-th fundamental frequency sequence of the j-th phoneme from the default macro segment. The link cost function evaluates the coordination between the i-th fundamental frequency of the j-th phoneme and the n-th fundamental frequency sequence of the j + 1-th phoneme.

    Die lokale Kostenfunktion weist beispielsweise folgende Form auf:

    Figure 00080002
    The local cost function has the following form, for example:
    Figure 00080002

    Die lokale Kostenfunktion ist somit ein Integral über den Zeitbereich des Beginns ta eines Phonems bis zum Ende te des Phonems über das Quadrat der Differenz der durch das Vorgabemakrosegment vorgegebenen Grundfrequenz fv und der i-ten Grundfrequenzsequenz des j-ten Phonems. The local cost function is thus an integral over the time range from the beginning ta of a phoneme to the end te of the phoneme over the square of the difference between the fundamental frequency f v specified by the default macro segment and the i-th fundamental frequency sequence of the j-th phoneme.

    Diese lokale Kostenfunktion ermittelt somit einen positiven Wert der Abweichung zwischen der jeweiligen Grundfrequenzsequenz und der Grundfrequenz des Vorgabemakrosegments. Zudem ist diese Kostenfunktion sehr einfach realisierbar und erzeugt durch die parabolische Eigenschaft eine Bewertung, die der des menschlichen Gehörs ähnelt, da kleinere Abweichungen um die Vorgabeseqeunz fv gering bewertet werden, wohingegen größere Abweichungen progressiv bewertet werden.This local cost function thus determines a positive one Value of the deviation between the respective fundamental frequency sequence and the fundamental frequency of the default macro segment. moreover this cost function is very easy to implement and generate by the parabolic property a rating that resembles that of human hearing because of minor deviations to be rated low by the default sequence fv, whereas larger deviations are evaluated progressively.

    Nach einer bevorzugten Ausführungsform wird'die lokale Kostenfunktion mit einem Gewichtungsterm versehen, der zu dem in Fig. 2 dargestellten Funktionsverlauf führt. Das Diagramm aus Fig. 2 zeigt den Wert der lokalen Kostenfunktion lok (fij) in Abhängigkeit vom Logorhitmus der Frequenz fij der i-ten Grundfrequenzsequenz des j-ten Phonems. Dem Diagramm kann man entnehmen, daß Abweichungen von der Vorgabefrequenz fv innerhalb bestimmter Grenzfrequenzen GF1, GF2 nur gering bewertet werden, wobei eine weitere Abweichung einen stark zunehmenden Anstieg bis zu einem Schwellwert SW bewirkt. Eine derartige Gewichtung entspricht dem menschlichen Gehör, das geringe Frequenzabweichungen kaum wahrnimmt aber ab gewissen Frequenzdifferenzen dies als deutlichen Unterschied registriert.According to a preferred embodiment, the local cost function is provided with a weighting term which leads to the function curve shown in FIG. 2. The diagram in FIG. 2 shows the value of the local cost function lok (f ij ) as a function of the logarithm of the frequency f ij of the i-th fundamental frequency sequence of the j-th phoneme. It can be seen from the diagram that deviations from the preset frequency f v within certain limit frequencies GF1, GF2 are assessed only slightly, a further deviation causing a sharply increasing increase up to a threshold value SW. Such a weighting corresponds to human hearing, which hardly perceives slight frequency deviations but registers this as a clear difference from certain frequency differences.

    Mit der Verknüpfungskostenfunktion wird bewertet, wie gut zwei aufeinanderfolgende Grundfrequenzsequenzen aufeinander abgestimmt sind. Insbesondere wird hierbei die Frequenzdifferenz an der Verbindungsstelle der beiden Grundfrequenzsequenzen bewertet, wobei je größer die Differenz am Ende der vorhergehenden Grundfrequenzsequenz zur Frequenz am Anfang der nachfolgenden Grundfrequenzsequenzen ist, desto größer ist der Ausgabewert der Verknüpfungskostenfunktion. Hierbei können jedoch noch weitere Parameter berücksichtigt werden, die z.B. die Stetigkeit des Überganges oder dergleichen, wiedergeben. The link cost function evaluates how well two successive fundamental frequency sequences on top of each other are coordinated. In particular, the frequency difference at the junction of the two fundamental frequency sequences rated, the greater the difference at the end of the previous one Fundamental frequency sequence to the frequency at the beginning of the subsequent fundamental frequency sequences, the larger the output value of the link cost function. Here you can however, other parameters are taken into account that e.g. reflect the continuity of the transition or the like.

    Bei einer bevorzugten Ausführungsform der Erfindung wird der Ausgabewert der Verknüpfungskostenfunktion umso schwächer gewichtet, je näher die jeweilige Verbindungsstelle zweier benachbarter Grundfrequenzsequenzen am Rand einer Silbe angeordnet ist. Dies entspricht dem menschlichen Gehör, das akustische Signale am Rande einer Silbe weniger intensiv analysiert als im mittleren Bereich der Silbe. Eine derartige Gewichtung wird auch als perzeptiv dominant bezeichnet.In a preferred embodiment of the invention, the Output value of the link cost function weighted even less, the closer the respective connection point of two neighboring ones Fundamental frequency sequences arranged on the edge of a syllable is. This corresponds to human hearing, the acoustic Signals on the edge of a syllable analyzed less intensely than in the middle area of the syllable. Such a weighting is also known as perceptually dominant.

    Gemäß obiger Kostenfunktion Kf werden für jede Kombination von Grundfrequenzsequenzen der Phoneme einer linguistischen Einheit, für die ein Vorgabemakrosegment bestimmt worden ist, die Werte der lokalen Kostenfunktion und der Verknüpfungskostenfunktion aller Grundfrequenzsequenzen ermittelt und summiert. Aus der Menge der Kombinationen der Grundfrequenzsequenzen wird diejenige Kombination ausgewählt, für die die Kostenfunktion Kf den kleinsten Wert ergeben hat, da diese Kombination von Grundfrequenzsequenzen einen Grundfrequenzverlauf für die entsprechende linguistische Einheit bildet, der als Nachbildungsmakrosegment bezeichnet wird und dem Vorgabemakrosegment sehr ähnlich ist.According to the above cost function Kf are for each combination of fundamental frequency sequences of the phonemes of a linguistic Unit for which a default macro segment has been determined the values of the local cost function and the link cost function all fundamental frequency sequences determined and summed. From the set of combinations of the fundamental frequency sequences the combination is selected for which the Cost function Kf has given the smallest value since this Combination of fundamental frequency sequences a fundamental frequency course for the corresponding linguistic unit, which is called the replica macro segment and the default macro segment is very similar.

    Mit dem erfindungsgemäßen Verfahren werden somit an die mittels des neuronalen Netzwerkes erzeugten Vorgabemakrosegemente der Grundfrequenz angepaßte Grundfrequenzverläufe mittels einzelner in einer Datenbasis gespeicherten Grundfrequenzsequenzen erzeugt. Hierdurch wird eine sehr natürliche Makrostruktur sichergestellt, die zudem auch die detailgenaue Mikrostruktur der Grundfrequenzsequenzen besitzt.With the method according to the invention, the means default macro segments generated by the neural network the fundamental frequency characteristics adapted by means of individual fundamental frequency sequences stored in a database generated. This creates a very natural macro structure ensured that also the detailed Has microstructure of the fundamental frequency sequences.

    Ein derartiges Nachbildungsmakrosegment für das Wort "stop" ist in Fig. 1d gezeigt.Such a replica macro segment for the word "stop" is shown in Fig. 1d.

    Nachdem im Schritt S13 die Auswahl der Kombinationen von Grundfrequenzsequenzen zur Nachbildung des Vorgabemakrosegments abgeschlossen ist, wird im Schritt S14 geprüft, ob für eine weitere phonetische linguistische Einheit ein weiterer zeitlicher Verlauf der Grundfrequenz erzeugt werden muß. Ergibt diese Abfrage im Schritt S14 ein "ja", springt der Programmablauf auf den Schritt S11 zurück, andernfalls verzweigt der Programmablauf auf den Schritt S15, mit dem die einzelnen Nachbildungsmakrosegmente der Grundfrequenz zusammengesetzt werden.After the selection of the combinations of Fundamental frequency sequences to simulate the default macro segment is completed, it is checked in step S14 whether for another phonetic linguistic unit another time course of the fundamental frequency must be generated. results If this query is "yes" in step S14, the program flow jumps go back to step S11, otherwise branch the program flow to step S15, with which the individual Simulation macro segments of the fundamental frequency composed become.

    Im Schritt S16 werden die Verbindungsstellen der einzelnen Nachbildungsmakrosegmente aneinander angeglichen, wie es in Fig. 3 dargestellt ist. Hierbei werden die Frequenzen links f1 und rechts fr von den Verbindungsstellen V einander angepaßt, wobei die Endbereiche der Nachbildungsmakrosegmente vorzugsweise derart verändert werden, daß die Frequenzen f1 und fr den gleichen Wert besitzen. Vorzugsweise kann im Bereich der Verbindungsstelle der Übergang auch geglättet und/oder stetig gemacht werden.In step S16, the connection points of the individual simulation macro segments are matched to one another, as shown in FIG. 3. The frequencies left f 1 and right f r are matched to one another by the connecting points V, the end regions of the replica macro segments preferably being changed such that the frequencies f 1 and f r have the same value. Preferably, the transition can also be smoothed and / or made continuous in the area of the connection point.

    Nachdem für alle linguistisch phonetischen Einheiten des Textes die Nachbildungsmakrosegmente der Grundfrequenz erstellt und zusammengesetzt worden sind, wird das Unterprogramm beendet und der Programmablauf geht zurück zum Hauptprogramm (S17).After for all the linguistic phonetic units of the text creates the simulation macro segments of the fundamental frequency and have been assembled, the subroutine is ended and the program flow goes back to the main program (S17).

    Mit dem erfindungsgemäßen Verfahren kann somit ein Verlauf einer Grundfrequenz erzeugt werden, der der Grundfrequenz einer natürlichen Sprache sehr ähnlich ist, da mittels des neuronalen Netzwerkes einfach größere Kontextbereiche erfaßt und ausgewertet werden können (Makrostruktur) und zugleich mittels der in der Datenbasis gespeicherten Grundfrequenzsequenzen feinste Strukturen des Grundfrequenzverlaufes entsprechend der natürlichen Sprache erzeugt werden können (Mikrostruktur). Hierdurch wird eine Sprachausgabe mit einem wesentlich natürlicheren Klang als bei bisher bekannten Verfahren ermöglicht.With the method according to the invention, a course can thus a fundamental frequency are generated, which is the fundamental frequency of a natural language is very similar because of the neural Network simply captures larger context areas and can be evaluated (macro structure) and at the same time by means of the fundamental frequency sequences stored in the database finest structures corresponding to the fundamental frequency curve natural language can be generated (microstructure). This makes a speech with an essential more natural sound than with previously known methods allows.

    Die Erfindung ist oben anhand eines Ausführungsbeispiels näher erläutert worden. Die Erfindung ist jedoch nicht auf das konkrete Ausführungsbeispiel beschränkt, sondern im Rahmen der Erfindung sind unterschiedlichste Abwandlungen möglich. So kann z.B. die Reihenfolge, wann die Grundfrequenzsequenzen aus der Datenbasis und wann das neuronale Netzwerk das Vorgabemakrosegment erstellt, variiert werden. Es ist z.B. auch möglich, daß zunächst für alle phonetisch linguistischen Einheiten Vorgabemakrosegmente erzeugt werden und dann erst die einzelnen Grundfrequenzsequenzen ausgelesen, kombiniert, bewertet und ausgewählt werden. Im Rahmen der Erfindung können auch unterschiedlichste Kostenfunktionen angewandt werden, solange sie eine Abweichung zwischen einem Vorgabemakrosegment der Grundfrequenz und Mikrosegmente der Grundfrequenzen berücksichtigen. Das oben beschriebene Integral der lokalen Kostenfunktion kann aus numerischen Gründen auch als Summe dargestellt werden.The invention is closer above using an exemplary embodiment have been explained. However, the invention is not based on that specific embodiment limited, but within the framework Various modifications of the invention are possible. For example, the order of when the fundamental frequency sequences from the database and when the neural network the default macro segment created, varied. It is e.g. also possible that initially for all phonetic linguistic units Default macro segments are generated and only then individual fundamental frequency sequences read out, combined, evaluated and be selected. Within the scope of the invention different cost functions are also used, as long as there is a discrepancy between a default macro segment the fundamental frequency and microsegments of the fundamental frequencies consider. The integral of the local described above For numerical reasons, the cost function can also be a sum being represented.

    Claims (13)

    1. Method for determining the time characteristic of a fundamental frequency of a voice response to be synthesized, comprising the following steps:
      determining predefined macrosegments of the fundamental frequency of a phonetic linguistic unit of a text (S11) to be synthesized by means of a neural network, and
      determining microsegments (S12, S13) which correspond to the respective predefined macrosegment by means of fundamental-frequency sequences stored in a database, the fundamental-frequency sequences being selected from the database in such a manner that the respective predefined macrosegment is reproduced with the least possible deviation by the successive fundamental-frequency sequences.
    2. Method according to Claim 1, characterized in that the predefined macrosegments cover a time range which corresponds to a phonetic linguistic unit of the voice such as, e.g. a phrase, a word or a syllable.
    3. Method according to Claim 1 or 2, characterized in that the fundamental-frequency sequences of the microsegments represent the fundamental frequencies of in each case one phoneme.
    4. Method according to one of Claims 1 to 3, characterized in that the fundamental-frequency sequences of the microsegments which are located within a time range of one of the predefined macrosegments are assembled to form one reproduced macrosegment, the deviation of the reproduced macrosegment from the respective predefined macrosegment being determined and the fundamental-frequency sequences being optimized in such a manner that the deviation is as small as possible.
    5. Method according to Claim 4, characterized in that in each case a number of fundamental-frequency sequences can be selected for the individual microsegments, where the combinations of fundamental-frequency sequences resulting in the least deviation between the respective reproduced macrosegment and the respective predefined macrosegment are selected.
    6. Method according to Claim 4 or 5, characterized in that the deviation between the reproduced macrosegment and the predefined macrosegment is determined by means of a cost function which is weighted in such a manner that in the case of small deviations from the fundamental frequency of the predefined macrosegment, only a small deviation is determined and when predetermined limit frequency differences are exceeded, the deviations determined rise steeply until a saturation value is reached.
    7. Method according to one of Claims 4 to 6, characterized in that the deviation between the reproduced macrosegment and the predefined macrosegment is determined by means of a cost function by means of which a multiplicity of deviations arranged distributed over the macrosegments are weighted, and the closer the deviations are to the edge of the syllable, the less weighting is applied to them.
    8. Method according to one of Claims 4 to 7, characterized in that during the selecting of the fundamental-frequency sequences, the individual fundamental-frequency sequences are syntonized with the in each case following or preceding fundamental-frequency sequences in accordance with predetermined criteria and only combinations of fundamental-frequency sequences meeting the criteria are permitted to be assembled to form a reproduced macrosegment.
    9. Method according to Claim 8, characterized in that adjacent fundamental-frequency sequences are assessed by means of a cost function which generates an output value, to be minimized, for a junction of the fundamental-frequency sequences of adjacent fundamental-frequency sequences and the greater the difference at the end of the preceding fundamental-frequency sequence from the frequency at the beginning of the subsequent fundamental-frequency sequence, the greater this output value.
    10. Method according to Claim 9, characterized in that the closer the respective junction is to the edge of a syllable the less weighting is applied to the output value.
    11. Method according to one of Claims 1 to 10, characterized in that the individual macrosegments are concatenated with one another and the fundamental frequencies are matched to one another at the junctions of the macrosegments.
    12. Method according to one of Claims 1 to 11, characterized in that the neural networks determine the predefined segments for a predetermined section of a text on the basis of this text section and of a text section preceding and/or following this text section.
    13. Method for synthesizing speech in which a text is converted in a sequence of acoustic signals, comprising the following steps:
      converting the text into a sequence of phonemes,
      generating a stressing structure,
      determining the duration of the individual phonemes,
      determining the time characteristic of a fundamental frequency according to the method according to one of Claims 1 to 12,
      generating the acoustic signals representing the speech on the basis of the sequence of phonemes determined and of the fundamental frequency determined.
    EP00984858A 1999-10-28 2000-10-24 Method for detecting the time sequences of a fundamental frequency of an audio-response unit to be synthesised Expired - Lifetime EP1224531B1 (en)

    Applications Claiming Priority (3)

    Application Number Priority Date Filing Date Title
    DE19952051 1999-10-28
    DE19952051 1999-10-28
    PCT/DE2000/003753 WO2001031434A2 (en) 1999-10-28 2000-10-24 Method for detecting the time sequences of a fundamental frequency of an audio-response unit to be synthesised

    Publications (2)

    Publication Number Publication Date
    EP1224531A2 EP1224531A2 (en) 2002-07-24
    EP1224531B1 true EP1224531B1 (en) 2004-12-15

    Family

    ID=7927243

    Family Applications (1)

    Application Number Title Priority Date Filing Date
    EP00984858A Expired - Lifetime EP1224531B1 (en) 1999-10-28 2000-10-24 Method for detecting the time sequences of a fundamental frequency of an audio-response unit to be synthesised

    Country Status (5)

    Country Link
    US (1) US7219061B1 (en)
    EP (1) EP1224531B1 (en)
    JP (1) JP4005360B2 (en)
    DE (1) DE50008976D1 (en)
    WO (1) WO2001031434A2 (en)

    Families Citing this family (10)

    * Cited by examiner, † Cited by third party
    Publication number Priority date Publication date Assignee Title
    AT6920U1 (en) 2002-02-14 2004-05-25 Sail Labs Technology Ag METHOD FOR GENERATING NATURAL LANGUAGE IN COMPUTER DIALOG SYSTEMS
    DE10230884B4 (en) * 2002-07-09 2006-01-12 Siemens Ag Combination of prosody generation and building block selection in speech synthesis
    JP4264030B2 (en) * 2003-06-04 2009-05-13 株式会社ケンウッド Audio data selection device, audio data selection method, and program
    JP2005018036A (en) * 2003-06-05 2005-01-20 Kenwood Corp Device and method for speech synthesis and program
    WO2005119650A1 (en) * 2004-06-04 2005-12-15 Matsushita Electric Industrial Co., Ltd. Audio synthesis device
    US10453479B2 (en) * 2011-09-23 2019-10-22 Lessac Technologies, Inc. Methods for aligning expressive speech utterances with text and systems therefor
    US10109014B1 (en) 2013-03-15 2018-10-23 Allstate Insurance Company Pre-calculated insurance premiums with wildcarding
    CN105357613B (en) * 2015-11-03 2018-06-29 广东欧珀移动通信有限公司 The method of adjustment and device of audio output apparatus play parameter
    CN106653056B (en) * 2016-11-16 2020-04-24 中国科学院自动化研究所 Fundamental frequency extraction model and training method based on LSTM recurrent neural network
    CN108630190B (en) * 2018-05-18 2019-12-10 百度在线网络技术(北京)有限公司 Method and apparatus for generating speech synthesis model

    Family Cites Families (9)

    * Cited by examiner, † Cited by third party
    Publication number Priority date Publication date Assignee Title
    JPH08512150A (en) 1994-04-28 1996-12-17 モトローラ・インコーポレイテッド Method and apparatus for converting text into audible signals using neural networks
    US5787387A (en) * 1994-07-11 1998-07-28 Voxware, Inc. Harmonic adaptive speech coding method and system
    JPH10153998A (en) * 1996-09-24 1998-06-09 Nippon Telegr & Teleph Corp <Ntt> Auxiliary information utilizing type voice synthesizing method, recording medium recording procedure performing this method, and device performing this method
    BE1011892A3 (en) 1997-05-22 2000-02-01 Motorola Inc Method, device and system for generating voice synthesis parameters from information including express representation of intonation.
    US5913194A (en) * 1997-07-14 1999-06-15 Motorola, Inc. Method, device and system for using statistical information to reduce computation and memory requirements of a neural network based speech synthesis system
    US6064960A (en) * 1997-12-18 2000-05-16 Apple Computer, Inc. Method and apparatus for improved duration modeling of phonemes
    US6078885A (en) * 1998-05-08 2000-06-20 At&T Corp Verbal, fully automatic dictionary updates by end-users of speech synthesis and recognition systems
    JP2002530703A (en) * 1998-11-13 2002-09-17 ルノー・アンド・オスピー・スピーチ・プロダクツ・ナームローゼ・ベンノートシャープ Speech synthesis using concatenation of speech waveforms
    US7222075B2 (en) * 1999-08-31 2007-05-22 Accenture Llp Detecting emotions using voice signal analysis

    Also Published As

    Publication number Publication date
    JP4005360B2 (en) 2007-11-07
    EP1224531A2 (en) 2002-07-24
    WO2001031434A2 (en) 2001-05-03
    US7219061B1 (en) 2007-05-15
    WO2001031434A3 (en) 2002-02-14
    JP2003513311A (en) 2003-04-08
    DE50008976D1 (en) 2005-01-20

    Similar Documents

    Publication Publication Date Title
    DE2115258C3 (en) Method and arrangement for speech synthesis from representations of individually spoken words
    AT400646B (en) VOICE SEGMENT ENCODING AND TOTAL LAYER CONTROL METHOD FOR VOICE SYNTHESIS SYSTEMS AND SYNTHESIS DEVICE
    DE602005002706T2 (en) Method and system for the implementation of text-to-speech
    DE60118874T2 (en) Prosody pattern comparison for text-to-speech systems
    DE60112512T2 (en) Coding of expression in speech synthesis
    DE60004420T2 (en) Recognition of areas of overlapping elements for a concatenative speech synthesis system
    DE69627865T2 (en) VOICE SYNTHESIZER WITH A DATABASE FOR ACOUSTIC ELEMENTS
    EP1224531B1 (en) Method for detecting the time sequences of a fundamental frequency of an audio-response unit to be synthesised
    DE19942178C1 (en) Method of preparing database for automatic speech processing enables very simple generation of database contg. grapheme-phoneme association
    DE2920298A1 (en) BINARY INTERPOLATOR CIRCUIT FOR AN ELECTRONIC MUSICAL INSTRUMENT
    EP1282897B1 (en) Method for creating a speech database for a target vocabulary in order to train a speech recognition system
    DE19861167A1 (en) Method and device for concatenation of audio segments in accordance with co-articulation and devices for providing audio data concatenated in accordance with co-articulation
    EP1159733B1 (en) Method and array for determining a representative phoneme
    DE69816049T2 (en) DEVICE AND METHOD FOR GENERATING PROSODY IN VISUAL SYNTHESIS
    DE4138016A1 (en) DEVICE FOR GENERATING AN ANNOUNCEMENT INFORMATION
    DE60305944T2 (en) METHOD FOR SYNTHESIS OF A STATIONARY SOUND SIGNAL
    WO2000016310A1 (en) Device and method for digital voice processing
    DE60131521T2 (en) Method and device for controlling the operation of a device or a system, and system having such a device and computer program for carrying out the method
    EP1170723B1 (en) Method for the computation of phone duration statistics and method for the determination of the duration of single phones for speech synthesis
    DE10230884B4 (en) Combination of prosody generation and building block selection in speech synthesis
    WO2002050815A1 (en) Device and method for differentiated speech output
    DE69721539T2 (en) SYNTHESIS PROCEDURE FOR VOICELESS CONSONANTS
    DE19837661C2 (en) Method and device for co-articulating concatenation of audio segments
    EP0505709A2 (en) Method for vocabulary extension for speaker-independent speech recognition
    DE1922170B2 (en) Speech synthesis system

    Legal Events

    Date Code Title Description
    PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

    Free format text: ORIGINAL CODE: 0009012

    17P Request for examination filed

    Effective date: 20020404

    AK Designated contracting states

    Kind code of ref document: A2

    Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE

    AX Request for extension of the european patent

    Free format text: AL;LT;LV;MK;RO;SI

    RBV Designated contracting states (corrected)

    Designated state(s): DE FR GB IT

    GRAP Despatch of communication of intention to grant a patent

    Free format text: ORIGINAL CODE: EPIDOSNIGR1

    RIC1 Information provided on ipc code assigned before grant

    Ipc: 7G 10L 11/04 A

    Ipc: 7G 06F 3/16 B

    RIC1 Information provided on ipc code assigned before grant

    Ipc: 7G 06F 3/16 B

    Ipc: 7G 10L 13/08 B

    Ipc: 7G 10L 11/04 A

    GRAS Grant fee paid

    Free format text: ORIGINAL CODE: EPIDOSNIGR3

    GRAA (expected) grant

    Free format text: ORIGINAL CODE: 0009210

    AK Designated contracting states

    Kind code of ref document: B1

    Designated state(s): DE FR GB IT

    REG Reference to a national code

    Ref country code: GB

    Ref legal event code: FG4D

    Free format text: NOT ENGLISH

    REG Reference to a national code

    Ref country code: IE

    Ref legal event code: FG4D

    Free format text: GERMAN

    REF Corresponds to:

    Ref document number: 50008976

    Country of ref document: DE

    Date of ref document: 20050120

    Kind code of ref document: P

    GBT Gb: translation of ep patent filed (gb section 77(6)(a)/1977)

    Effective date: 20050211

    PLBE No opposition filed within time limit

    Free format text: ORIGINAL CODE: 0009261

    STAA Information on the status of an ep patent application or granted ep patent

    Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT

    ET Fr: translation filed
    26N No opposition filed

    Effective date: 20050916

    PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

    Ref country code: DE

    Payment date: 20131219

    Year of fee payment: 14

    PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

    Ref country code: FR

    Payment date: 20141017

    Year of fee payment: 15

    Ref country code: GB

    Payment date: 20141013

    Year of fee payment: 15

    PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

    Ref country code: IT

    Payment date: 20141029

    Year of fee payment: 15

    REG Reference to a national code

    Ref country code: DE

    Ref legal event code: R119

    Ref document number: 50008976

    Country of ref document: DE

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: DE

    Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

    Effective date: 20150501

    GBPC Gb: european patent ceased through non-payment of renewal fee

    Effective date: 20151024

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: IT

    Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

    Effective date: 20151024

    Ref country code: GB

    Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

    Effective date: 20151024

    REG Reference to a national code

    Ref country code: FR

    Ref legal event code: ST

    Effective date: 20160630

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: FR

    Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

    Effective date: 20151102