EP1105867B1

EP1105867B1 - Method and device for the concatenation of audiosegments, taking into account coarticulation

Info

Publication number: EP1105867B1
Application number: EP99942891A
Authority: EP
Inventors: Christoph Buskies
Original assignee: Individual
Current assignee: BUSKIES, CHRISTOPH
Priority date: 1998-08-19
Filing date: 1999-08-19
Publication date: 2003-06-25
Anticipated expiration: 2019-08-19
Also published as: AU5623199A; WO2000011647A1; CA2340073A1; ATE243876T1; US7047194B1; DE19861167A1; EP1105867A1; DE59906115D1

Abstract

The invention provides a method, apparatus, and a computer program stored on a data carrier that generates synthesized acoustical data by concatenating audio segments of sounds to reproduce a sequence of concatenated sounds/phones. The invention has an inventory or sounds and each sound has three bands (FIG. 1 b) including an initial co-articulation band, a solo articulation band and a final co-articulation band. The invention selects audio segments that end or begin with a co-articulation band and a solo articulation band of one sound. The instance of concatenation is defined by the co-articulation band and the solo articulation band of the one sound.

Description

Die Erfindung betrifft ein Verfahren und eine Vorrichtung zur Konkatenation von Audiosegmenten zur Erzeugung synthetisierter akustischer Daten, insbesondere synthetisierter Sprache. Des weiteren betrifft die Erfindung synthetisierte Sprachsignale, die durch die erfindungsgemäße koartikulationsgerechte Konkatenation von Sprachsegmenten erzeugt wurden, sowie einen Datenträger, der ein Computerprogramm zur erfindungsgemäßen Erzeugung von synthetisierten akustischen Daten, insbesondere synthetisierter Sprache, enthält.The invention relates to a method and a device for concatenation of audio segments for the generation of synthesized acoustic data, in particular synthesized Language. The invention further relates to synthesized speech signals which are transmitted through the co-articulation-compatible concatenation of language segments were generated, as well as a data carrier that a computer program for the invention Generation of synthesized acoustic data, in particular synthesized Language, contains.

Zusätzlich betrifft die Erfindung einen Datenspeicher, der Audiosegmente enthält, die zur erfindungsgemäßen koartikulationsgerechten Konkatenation geeignet sind, und einen Tonträger, der erfindungsgemäß synthetisierte akustische Daten enthält.In addition, the invention relates to a data memory which contains audio segments which are used for co-articulation-compatible concatenation according to the invention are suitable, and one Sound carrier which contains acoustic data synthesized according to the invention.

Es ist zu betonen, daß sowohl der im folgenden dargestellte Stand der Technik als auch die vorliegende Erfindung den gesamten Bereich der Synthese von akustischen Daten durch Konkatenation einzelner, auf beliebige Art und Weise erhaltene Audiosegmente betrifft. Aber um die Diskussion des Standes der Technik sowie die Beschreibung der vorliegenden Erfindung zu vereinfachen, beziehen sich die folgenden Ausführungen speziell auf synthetisierte Sprachdaten durch Konkatenation einzelner Sprachsegmente.It should be emphasized that both the prior art presented below and the present invention covers the entire field of synthesis of acoustic data through concatenation of individual audio segments obtained in any way concerns. But to discuss the state of the art as well as the description of the To simplify the present invention, the following statements refer specifically on synthesized speech data through concatenation of individual speech segments.

In den letzten Jahren hat sich im Bereich der Sprachsynthese der datenbasierte Ansatz gegenüber dem regelbasierten Ansatz durchgesetzt und ist in verschiedenen Verfahren und Systemen zur Sprachsynthese zu finden. Obwohl der regelbasierte Ansatz prinzipiell eine bessere Sprachsynthese ermöglicht, ist es für dessen Umsetzung notwendig, das gesamte zur Spracherzeugung notwendige Wissen explizit zu formulieren, d.h. die zu synthetisierende Sprache formal zu modellieren. Da die bekannten Sprachmodellierungen Vereinfachung der zu synthetisierenden Sprache aufweisen, ist die Sprachqualität der so erzeugten Sprache nicht ausreichend.In recent years, the data-based approach has changed in the area of speech synthesis enforced against the rule-based approach and is in different procedures and finding systems for speech synthesis. Although the rule-based approach in principle enables better speech synthesis, it is necessary for its implementation that to explicitly formulate all knowledge required for language production, i.e. the too Modeling synthesizing language formally. Because the well-known language modeling Simplification of the speech to be synthesized is the speech quality the language generated in this way is not sufficient.

Daher wird in zunehmendem Maße eine datenbasierte Sprachsynthese durchgeführt, bei der aus einer einzelne Sprachsegmente aufweisenden Datenbasis entsprechende Segmente ausgewählt und miteinander verknüpft (konkateniert) werden. Die Sprachqualität hängt hierbei in erster Linie von der Zahl und Art der verfügbaren Sprachsegmente ab, denn es kann nur Sprache synthetisiert werden, die durch Sprachsegmente in der Datenbasis wiedergeben ist. Um die Zahl der vorzusehenden Sprachsegmente zu minimieren und dennoch eine synthetisierte Sprache hoher Qualität zu erzeugen, sind verschieden Verfahren bekannt, die eine Verknüpfung (Konkatenation) der Sprachsegmente nach komplexen Regeln durchführen.Therefore, data-based speech synthesis is increasingly being carried out at the corresponding segments from a database comprising individual language segments selected and linked together (concatenated). The speech quality depends primarily on the number and type of available language segments, because only speech can be synthesized by using speech segments in the database play back. To minimize the number of language segments to be provided and yet producing a high quality synthesized language are different Process known that a link (concatenation) of the language segments perform according to complex rules.

Unter Verwendung solcher Verfahren bzw. entsprechender Vorrichtungen kann ein Inventar, d.h. eine die Sprachaudiosegmente umfassende Datenbasis, verwendet werden, das vollständig und handhabbar ist. Ein Inventar ist vollständig, wenn damit jede Lautfolge der zu synthetisierenden Sprache erzeugt werden kann, und ist handhabbar, wenn die Zahl und Art der Daten des Inventars mit den technisch verfügbaren Mitteln in einer gewünschten Weise verarbeitet werden kann. Darüber hinaus muß ein solches Verfahren gewährleisten, daß die Konkatenation der einzelnen Inventarelemente eine synthetisierte Sprache erzeugt, die sich von einer natürlich gesprochenen Sprache möglichst wenig unterscheidet. Hierfür muß eine synthetisierte Sprache flüssig sein und die gleichen artikulatorischen Effekte einer natürlichen Sprache aufweisen. Hier kommt den sogenannten koartikulatorischen Effekten, d.h. der gegenseitigen Beeinflussung von Sprachlauten, eine besondere Bedeutung zu. Daher sollten die Inventarelemente so beschaffen sein, das sie die Koartikulation einzelner aufeinanderfolgender Sprachlaute berücksichtigen. Des weiteren sollte ein Verfahren zur Konkatenation der Inventarelemente, die Elemente unter Berücksichtigung der Koartikulation einzelner aufeinanderfolgender Sprachlaute sowie der übergeordneten Koartikulation mehrerer aufeinanderfolgender Sprachlaute, auch über Wort- und Satzgrenzen hinweg, verketten.Using such methods or corresponding devices, an inventory, i.e. a database comprising the voice audio segments is used, that is complete and manageable. An inventory is complete if it has any phonetic sequence of the language to be synthesized, and is manageable if the number and type of data of the inventory with the technically available means in one can be processed as desired. In addition, such a procedure ensure that the concatenation of the individual inventory elements is synthesized Language generates that differ from a naturally spoken language if possible differs little. For this, a synthesized language must be fluid and the same have articulatory effects of a natural language. Here comes the so-called co-articulatory effects, i.e. the mutual influence of Speech sounds, a special meaning too. Therefore, the inventory items should look like this be the co-articulation of individual consecutive speech sounds consider. Furthermore, a procedure for concatenating the inventory elements, the elements taking into account the co-articulation of individual consecutive Speech sounds and the higher-level co-articulation of several consecutive Link speech sounds, also across word and sentence boundaries.

Vor der Darstellung des Standes der Technik werden im folgenden einige zum besseren Verständnis notwendige Begriffe aus dem Bereich der Sprachsynthese erläutert:

Ein Laut ist eine Klasse von beliebigen Schallereignissen (Geräusche, Klänge, Töne usw). Die Schallereignisse werden gemäß eines Klassifikationsschemas in Lautklassen eingeteilt. Ein Schallereigniss gehört zu einem Laut, wenn hinsichtlich der zur Klassifikation verwendeten Parameter (z.B. Spektrum, Tonhöhe, Lautstärke, Brust- oder Kopfstimme, Koartikulation, Resonanzräume, Emotion usw.) die Werte des Schallereignis innerhalb der für den Laut definierten Wertebereiche liegen.
Das Klassifikationsschema für Laute hängt von der Art der Anwendung ab. Für Sprachlaute (= Phone) wird in der Regel die IPA-Klassifikation verwendet. Die hier verwendete Definition des Begriffes Laut ist jedoch nicht darauf beschränkt, sondern es lassen sich beliebige andere Parameter verwendet. Wird z.B. zusätzlich zu der IPA-Klassifikation noch die Tonhöhe oder der emotionale Ausdruck als Parameter in die Klassifikation mit einbezogen, so werden zwei 'a'-Laute mit unterschiedlicher Tonhöhe oder mit unterschiedlichem emotionalen Ausdruck zu unterschiedlichen Lauten im Sinne der Definition. Laute können aber auch die Töne eines Musikinstrumentes, etwa einer Geige, auf den unterschiedlichen Tonhöhen in den unterschiedlichen Spielweisen (Auf- und Abstrich, Detaché, Spiccato, Marcato, Pizzicato, col Legno etc.) sein. Laute können ebenso Hundegebell oder das Quietschen einer Autotüre sein. Laute können durch Audiosegmente, die entsprechende akustische Daten enthalten, wiedergegeben werden.In der auf die Definitionen folgenden Beschreibung der Erfindung kann immer der Begriff Phon durch den Begriff Laut im Sinne der vorigen Definition und der Begriff Phonem durch den Begriff Lautzeichen ersetzt werden. (Dies gilt auch umgekehrt, da Phone gemäß der IPA-Klassifikation eingeteilte Laute sind.)
Ein statischer Laut hat Bereiche die ähnlich zu vorhergehenden oder nachfolgenden Bereichen des statischen Lauts sind. Die Ähnlichkeit muß nicht unbedingt eine exakte Entsprechung wie bei den Perioden eines Sinustones sein, sondern ist analog der Ähnlichkeit, die zwischen den Bereichen der unten definierten statischen Phone herrscht.
Ein dynamischer Laut hat keine Bereiche, die vorhergenden oder nachfolgenden Bereichen des dynamischen Lautes ähneln, etwa das Schallereignis einer Explosion oder ein dynamisches Phon.
Ein Phon ist ein von den Sprachorganen erzeugter Laut (ein Sprachlaut). Die Phone werden in statische und dynamische Phone unterteilt.
Zu den statischen Phonen zählen Vokale, Diphtonge, Nasale, Laterale, Vibranten und Frikative.
Zu den dynamischen Phonen zählen Plosive, Affrikate, Glottalstops und geschlagene Laute.
Ein Phonem ist die formale Beschreibung eines Phons, wobei i. allg. die formale Beschreibung durch Lautschriftzeichen erfolgt.
Die Koartikulation bezeichnet das Phänomen, daß ein Laut, also auch ein Phon, durch vorgelagerte und nachgelagerte Laute bzw. Phone beeinflußt wird, wobei die Koartikulation sowohl zwischen unmittelbar benachbarten Lauten/Phonen auftritt, aber sich auch als übergeordnete Koartikulation über eine Folge mehrerer Laute/Phone erstrecken kann (Beispielsweise bei einer Lippenrundung).

Before presenting the state of the art, some of the terms from the field of speech synthesis necessary for better understanding are explained below:

A sound is a class of arbitrary sound events (noises, sounds, tones, etc.). The sound events are divided into sound classes according to a classification scheme. A sound event belongs to a sound if, with regard to the parameters used for the classification (e.g. spectrum, pitch, volume, chest or head voice, coarticulation, resonance rooms, emotion, etc.), the values of the sound event lie within the value ranges defined for the sound.
The classification scheme for sounds depends on the type of application. The IPA classification is usually used for speech sounds (= phone). However, the definition of the term “loud” used here is not limited to this, but any other parameters can be used. If, for example, in addition to the IPA classification, the pitch or emotional expression is also included as a parameter in the classification, two 'a' sounds with different pitch or with different emotional expression become different sounds in the sense of the definition. Lute can also be the tones of a musical instrument, such as a violin, at different pitches in different ways of playing (spread and smear, detaché, spiccato, marcato, pizzicato, col legno etc.). Loud noises can also be dog barking or the squeaking of a car door. Sounds can be reproduced by audio segments which contain corresponding acoustic data. In the description of the invention following the definitions, the term phon can always be replaced by the term sound in the sense of the previous definition and the term phoneme by the term sound sign. (This also applies in reverse, since phones are classified sounds according to the IPA classification.)
A static sound has areas that are similar to previous or subsequent areas of the static sound. The similarity does not necessarily have to be an exact correspondence to the periods of a sine tone, but is analogous to the similarity that exists between the areas of the static phones defined below.
A dynamic sound has no areas that resemble previous or subsequent areas of the dynamic sound, such as the sound event of an explosion or a dynamic phone.
A phon is a sound generated by the speech organs (a speech sound). The phones are divided into static and dynamic phones.
Static phones include vowels, diphtongs, nasals, laterals, vibrants and fricatives.
Dynamic phones include plosives, affricates, glottal stops and struck sounds.
A phoneme is the formal description of a phon, where i. general. The formal description is made by phonetic characters.
Coarticulation describes the phenomenon that a sound, including a phon, is influenced by upstream and downstream sounds or phones, whereby the coarticulation occurs both between immediately adjacent sounds / phones, but also as a superordinate coarticulation over a sequence of several sounds / Phone can stretch (for example, when rounding the lips).

Daher kann ein Laut bzw. Phon in drei Bereiche unterteilt werden (siehe auch Figur 1b):

Der Anfangs-Koartikulationsbereich umfaßt den Bereich vom Beginn des Lautes/Phons bis zum Ende der Koartikulation aufgrund eines vorgelagerten Lautes/Phons.
Der Soloartikulationsbereich, ist der Bereich des Lautes/Phons, der nicht durch einen vor- oder nachgelagerten Laut bzw. ein vor- oder nachgelagertes Phon beeinflußt ist.
Der End-Koartikulationsbereich umfaßt den Bereich vom Beginn der Koartikulation aufgrund eines nachgelagerten Lautes/Phons bis zum Ende des Lautes/Phons.
Der Koartikulationsbereich umfaßt einen End-Koartikulationsbereich und den benachbarten Anfangs-Koartikulationsbereich des benachbarten Lautes/Phons.
Ein Polyphon ist eine Folge von Phonen.
Die Elemente eines Inventars sind in kodierter Form gespeicherte Audiosegmente, die Laute, Teile von Lauten, Lautfolgen oder Teile von Lautfolgen, bzw. Phone, Teile von Phonen, Polyphone oder Teile von Polyphonen wiedergeben. zum besseren Verständnis des möglichen Aufbaus eines Audiosegmentes/Inventarelementes sei hier auf die Figur 2a, die ein herkömmliches Audiosegment zeigt, und die Figuren 2b-2l verwiesen, in denen erfindungsgemäße Audiosegmente gezeigt sind. Ergänzend ist zu erwähnen, daß Audiosegmente auch aus kleineren oder größeren Audiosegmenten gebildet werden können, die in dem Inventar oder einer Datenbank enthalten sind. Des weiteren können Audiosegmente auch in einer transformierten Form (z.B. einer fouriertransformierten Form) in dem Inventar oder einer Datenbank vorliegen. Audiosegmente für das vorliegende Verfahren können auch aus einem vorgelagerten Syntheseschritt (der nicht Teil des Verfahrens ist) stammen. Audiosegmente enthalten wenigstens einen Teil eines Anfangs-Koartikuiationsbereiches, eines Soloartikulationsbereiches und/oder eines End-Koartikulationsbereiches. Anstelle von Audiosegmenten können auch Bereiche von Audiosegmenten verwendet werden.
Unter Konkatenation versteht man das Aneinanderfügen zweier Audiosegmente.
Der Konkatenationsmoment ist der Zeitpunkt, zu dem zwei Audiosegmente aneinandergefügt werden.

A sound or phon can therefore be divided into three areas (see also Figure 1b):

The initial co-articulation range covers the range from the beginning of the sound / phone to the end of the co-articulation due to an upstream sound / phone.
The solo articulation range is the range of the sound / phon that is not influenced by a preceding or following sound or a preceding or following phon.
The end co-articulation area covers the area from the start of co-articulation due to a downstream sound / phone to the end of the sound / phone.
The co- articulation area comprises an end co-articulation area and the adjacent initial co-articulation area of the adjacent sound / phone.
A polyphone is a series of phones.
The elements of an inventory are coded audio segments that reproduce sounds, parts of sounds, sequences of sounds or parts of sequences, or phone, parts of phones, polyphones or parts of polyphones. For a better understanding of the possible structure of an audio segment / inventory element, reference is made here to FIG. 2a, which shows a conventional audio segment, and FIGS. 2b-2l, in which audio segments according to the invention are shown. In addition, it should be mentioned that audio segments can also be formed from smaller or larger audio segments that are contained in the inventory or a database. Furthermore, audio segments can also be present in a transformed form (for example a Fourier-transformed form) in the inventory or in a database. Audio segments for the present method can also originate from an upstream synthesis step (which is not part of the method). Audio segments contain at least part of an initial co-articulation area, a solo articulation area and / or an end co-articulation area. Instead of audio segments, areas of audio segments can also be used.
Concatenation is the joining of two audio segments.
The concatenation moment is the time at which two audio segments are joined together.

Die Konkatenation kann auf verschiedene Arten erfolgen, z.B. mit einem Crossfade oder einem Hardfade (siehe auch Figuren 3a-3e):

Bei einem Crossfade werden ein zeitlich hinterer Bereich eines ersten Audiosegmentbereiches sowie ein zeitlich vorderer Bereich eines zweiten Audiosegmentbereiches mit geeigneten Übergangsfunktionen bearbeitet, und danach werden diese beiden Bereiche überlappend so addiert, daß maximal der zeitlich kürzere der beiden Bereichen von dem zeitlich längeren der beiden Bereiche vollständig überlappt wird.
Bei einem Hardfade wird ein zeitlich hinterer Bereich eines ersten Audiosegmentes und ein zeitlich vorderer Bereich eines zweiten Audiosegmentes mit geeigneten Übergangsfunktionen bearbeitet, wobei diese beiden Audiosegmente so aneinandergefügt werden, daß sich der hintere Bereich des ersten Audiosegmentes und der vordere Bereich des zweiten Audiosegmentes nicht überlappen. Der Koartikulationsbereich macht sich vor allem dadurch bemerkbar, daß eine Konkatenation darin mit Unstetigkeiten (z.B. Spektralsprüngen) verbunden ist.Ergänzend sei zu erwähnen, daß streng genommen ein Hardfade einen Grenzfall eines Crossfades darstellt, bei dem eine Überlappung eines zeitlich hinteren Bereiches eines ersten Audiosegmentes und eines zeitlich vorderen Bereiches eines zweiten Audiosegmentes eine Länge Null hat. Dies erlaubt es in bestimmten, z.B. äußerst zeitkritischen Anwendungen einen Crossfade durch einen Hardfade zu ersetzen, wobei eine solche Vorgehensweise genau abzuwägen ist, da diese zu deutlichen Qualitätseinbußen bei der Konkatenation von Audiosegmenten führt, die eigentlich durch einen Crossfade zu konkatenieren sind.
Unter Prosodie versteht man die Veränderungen der Sprachfrequenz und des Sprachrhythmus, die bei gesprochenen Worten bzw. Sätzen auftreten. Die Berücksichtigung solcher prosodischer Informationen ist bei der Sprachsynthese notwendig, um eine natürliche Wort- bzw. Satzmelodie zu erzeugen.

The concatenation can take place in different ways, for example with a crossfade or a hardfade (see also FIGS. 3a-3e):

In the case of a crossfade , a temporally rear area of a first audio segment area and a temporally front area of a second audio segment area are processed with suitable transition functions, and then these two areas are added overlapping in such a way that maximally the shorter of the two areas is complete than the longer of the two areas is overlapped.
In the case of a hard fade , a temporally rear area of a first audio segment and a temporally front area of a second audio segment are processed with suitable transition functions, these two audio segments being joined together in such a way that the rear area of the first audio segment and the front area of the second audio segment do not overlap. The coarticulation range is particularly noticeable in that a concatenation is associated with discontinuities (e.g. spectral jumps). In addition, it should be mentioned that strictly speaking a hardfade represents a borderline case of a crossfade, in which an overlap of a temporally backward area of a first audio segment and of a temporally front area of a second audio segment has a length of zero. This allows a crossfade to be replaced by a hardfade in certain, for example extremely time-critical, applications, such a procedure having to be carefully considered, since this leads to significant quality losses in the concatenation of audio segments which are actually to be concatenated by a crossfade.
Prosody is the change in speech frequency and rhythm that occurs with spoken words or sentences. Consideration of such prosodic information is necessary in speech synthesis in order to generate a natural word or sentence melody.

Aus WO 95/30193 sind ein Verfahren und eine Vorrichtung zur Umwandlung von Text in hörbare Sprachsignale unter Verwendung eines neuronalen Netzwerkes bekannt. Hierfür wird der in Sprache umzuwandelnde Text mit einer Konvertiereinheit in eine Folge von Phonemen umgewandelt, wobei zusätzlich Informationen über die syntaktischen Grenzen des Textes und die Betonung der einzelnen syntaktischen Komponenten des Textes erzeugt werden. Diese werden zusammen mit den Phonemen an eine Einrichtung weitergeleitet, die regelbasiert die Dauer der Aussprache der einzelnen Phoneme bestimmt. Ein Prozessor erzeugt aus jedem einzelnen Phonem in Verbindung mit den entsprechenden syntaktischen und zeitlichen Informationen eine geeignete Eingabe für das neuronale Netzwerk, wobei diese Eingabe für das neuronale Netz auch die entsprechenden prosodischen Informationen für die gesamte Phonemfolge umfaßt. Das neuronale Netz wählt aus den verfügbaren Audiosegmenten nun die aus, die die eingegebenen Phoneme am besten wiedergeben, und verkettet diese Audiosegmente entsprechend. Bei dieser Verkettung werden die einzelnen Audiosegmente in ihrer Dauer, Gesamtamplitude und Frequenz an vor- und nachgelagerte Audiosegmente unter Berücksichtigung der prosodischen Informationen der zu synthetisierenden Sprache angepaßt und zeitlich aufeinanderfolgend miteinander verbunden. Eine Veränderung einzelner Bereiche der Audiosegmente ist hier nicht beschrieben.WO 95/30193 describes a method and an apparatus for converting text into audible speech signals using a neural network. Therefor the text to be converted into language is converted into a sequence of Phonemas converted, with additional information about the syntactic boundaries of the text and the emphasis on the individual syntactic components of the text be generated. These are forwarded together with the phonemes to a facility, which determines the duration of the pronunciation of the individual phonemes based on rules. A processor generates each individual phoneme in conjunction with the corresponding one syntactical and temporal information a suitable input for the neural Network, this input also being the corresponding one for the neural network includes prosodic information for the entire phoneme sequence. The neural network selects from the available audio segments the one that matches the entered phonemes best play, and chains these audio segments accordingly. At this The individual audio segments are linked in terms of their duration and overall amplitude and frequency to upstream and downstream audio segments taking into account the Prosodic information adapted to the language to be synthesized and temporally sequentially linked together. A change in individual areas of the Audio segments are not described here.

Zur Erzeugung der für dieses Verfahren erforderlichen Audiosegmente ist das neuronale Netzwerk zuerst zu trainieren, indem natürlich gesprochene Sprache in Phone oder Phonfolgen unterteilt wird und diesen Phonen oder Phonfolgen entsprechende Phonem oder Phonemfolgen in Form von Audiosegmenten zugeordnet werden. Da dieses Verfahren nur eine Veränderung von einzelnen Audiosegmenten, aber keine Veränderung einzelner Bereiche eines Audiosegmentes vorsieht, muß das neuronale Netzwerk mit möglichst vielen verschiedenen Phonen oder Phonfolgen trainiert werden, um beliebige Texte in synthetisierte natürlich klingende Sprache umzuwandeln. Dies kann sich je nach Anwendungsfall sehr aufwendig gestalten. Auf der anderen Seite kann ein unzureichender Trainingsprozeß des neuronalen Netzes die Qualität der zu synthetisierenden Sprache negativ beeinflussen. Des weiteren ist es bei dem hier beschriebene Verfahren nicht möglich, den Konkatenationsmoment der einzelnen Audiosegmente in Abhängigkeit vorgelagerter oder nachgelagerter Audiosegmente zu bestimmen, um so eine koartikulationsgerechte Konkatenation durchzuführen.The neural is used to generate the audio segments required for this method Train first by using naturally spoken language in Phone or Phon sequences is subdivided and phonemes corresponding to these phones or phoneme sequences or phoneme sequences in the form of audio segments. Because this procedure only a change of individual audio segments, but no change individual areas of an audio segment, the neural network must as many different phones or phone sequences as possible can be trained to any Convert text into synthesized natural sounding language. This can vary depending on Design use case very complex. On the other hand, an inadequate one Training process of the neural network the quality of the speech to be synthesized influence negatively. Furthermore, it is not the case with the method described here possible, the concatenation moment of the individual audio segments depending on the upstream or downstream audio segments to determine a co-articulation To carry out concatenation.

In US-5,524,172 ist eine Vorrichtung zur Erzeugung synthetisierter Sprache beschrieben, die das sogenannte Diphonverfahren nutzt. Hier wird ein Text, der in synthetisierte Sprache umgewandelt werden soll, in Phonemfolgen unterteilt, wobei jeder Phonemfolge entsprechende prosodische Informationen zugeordnet werden. Aus einer Datenbank, die Audiosegmente in Form von Diphonen enthält, werden für jedes Phonem der Folge zwei das Phonem wiedergebende Diphone ausgewählt und unter Berücksichtigung der entsprechenden prosodischen Informationen konkateniert. Bei der Konkatenation werden die beiden Diphone jeweils mit Hilfe eines geeigneten Filters gewichtet und die Dauer und Tonhöhe beider Diphone so verändert, daß bei der Verkettung der Diphone eine synthetisierte Phonfolge erzeugt wird, deren Dauer und Tonhöhe der Dauer und Tonhöhe der gewünschten Phonemfolge entspricht. Bei der Konkatenation werden die einzelnen Diphone so addiert, daß sich ein zeitlich hinterer Bereich eines ersten Diphones und ein zeitlich vorderer Bereich eines zweiten Diphones überlappen, wobei der Konkatenationsmoment generell im Bereich stationären Bereiche der einzelnen Diphone liegt (siehe Figur 2a). Da eine Variation des Konkatenationsmomentes unter Berücksichtigung der Koartikulation aufeinanderfolgender Audiosegmente (Diphone) hier nicht vorgesehen ist, kann die Qualität (Natürlichkeit und Verständlichkeit) einer so synthetisierten Sprache negativ beeinflußt werden.No. 5,524,172 describes a device for generating synthesized speech, which uses the so-called diphone method. Here is a text that is in synthesized language is to be converted into phoneme sequences, with each phoneme sequence corresponding prosodic information. From a database that Includes audio segments in the form of diphones, two for each phoneme in the sequence selected the phoneme reproducing diphone and taking into account the corresponding prosodic information concatenated. At the concatenation the two diphones each weighted using a suitable filter and the duration and the pitch of both diphones changed so that when the diphones are concatenated a synthesized phoneme is generated, the duration and pitch of the duration and pitch corresponds to the desired phoneme sequence. In concatenation, the individual Diphone added in such a way that a temporally back area of a first diphone and overlap a temporally front area of a second diphone, the concatenation moment generally lies in the area of the stationary areas of the individual diphones (see Figure 2a). Since a variation of the concatenation moment taking into account the Coarticulation of successive audio segments (diphones) is not provided here, can the quality (naturalness and intelligibility) of such a synthesized language be adversely affected.

Eine Weiterentwicklung des zuvor diskutierten Verfahrens ist in EP-0,813,184 A1 zu finden. Auch hier wird ein in synthetisierte Sprache umzuwandelnder Text in einzelne Phoneme oder Phonemfolgen unterteilt und aus einer Datenbank entsprechende Audiosegmente ausgewählt und konkateniert. Um eine Verbesserung der synthetisierten Sprache zu erzielen, sind bei diesem Verfahren zwei Ansätze, die sich vom bisher diskutierten Stand der Technik unterscheiden, umgesetzt worden. Unter Verwendung eines Glättungsfilters, der die tieferfrequenten harmonischen Frequenzanteile eines vorgelagerten und eines nachgelagerten Audiosegments berücksichtigt, soll der Übergang von dem vorgelagerten Audiosegment zu dem nachgelagerten Audiosegment optimiert werden, indem ein zeitlich hinterer Bereich des vorgelagerten Audiosegments und ein zeitlich vorderer Bereich des nachgelagerten Audiosegments im Frequenzbereich aufeinander abgestimmt werden. Des weiteren stellt die Datenbank Audiosegmente zur Verfügung, die sich leicht unterscheiden, aber zur Synthetisierung desselben Phonems geeignet sind. Auf diese Weise soll die natürliche Variation der Sprache nachgebildet werden, um eine höhere Qualität der synthetisierten Sprache zu erreichen. Sowohl die Verwendung des Glättungsfilters als auch die Auswahl aus einer Menge unterschiedlicher Audiosegmente zur Realisierung eines Phonems erfordert bei einer Umsetzung dieses Verfahrenes eine hohe Rechenleistung der verwendeten Systemkomponenten. Außerdem steigt der Umfang der Datenbank aufgrund der erhöhten Zahl der vorgesehenen Audiosegmente. Des weiteren ist auch bei diesem Verfahren eine koartikulationsabhängige Wahl des Konkatenationsmoments einzelner Audiosegmente nicht vorgesehen, wodurch die Qualität der synthetisierten Sprache reduziert werden kann.A further development of the previously discussed method can be found in EP-0.813.184 A1. Here, too, a text to be converted into synthesized language becomes individual phonemes or phoneme sequences divided and corresponding audio segments from a database selected and concatenated. To improve the synthesized language To achieve this method are two approaches, which differ from the previously discussed Differentiate state of the art, implemented. Using a smoothing filter, the lower-frequency harmonic frequency components of an upstream and a downstream audio segment, the transition from upstream audio segment are optimized to the downstream audio segment, by a temporally rear area of the upstream audio segment and a temporally front area of the downstream audio segment in the frequency range one on top of the other be coordinated. Furthermore, the database provides audio segments, which differ slightly, but are suitable for synthesizing the same phoneme are. In this way, the natural variation of the language is to be replicated to to achieve a higher quality of the synthesized language. Both the use the smoothing filter as well as the selection from a number of different audio segments to implement a phoneme requires implementation of this method high computing power of the system components used. It also increases the size of the database due to the increased number of intended audio segments. Furthermore, this method is also a co-articulation-dependent choice of the concatenation moment of individual audio segments is not provided, whereby the Quality of the synthesized language can be reduced.

DE 693 18 209 T2 beschäftigt sich mit Formantsynthese. Gemäß diesem Dokument werden zwei mehrstimmige Laute unter Verwendung eines Interpolationsmechanismus miteinander verbunden, der auf ein letztes Phonem eines vorgelagerten Lauts und auf ein erstes Phonem eines nachgelagerten Lauts angewendet wird, wobei die zwei Phoneme der beiden Laute gleich sind und bei den verbundenen Lauten zu einem Phonem überlagert werden. Bei der Überlagerung werden die die zwei Phoneme beschreibenden Kurven jeweils mit einer Gewichtungsfunktion gewichtet. Die Gewichtungsfunktion wird bei jedem Phonem in einem Bereich angewendet, der unmittelbar nach dem Beginn des Phonems beginnt und unmittelbar vor dem Ende des Phonems endet. Somit entsprechen bei der hier beschriebenen Konkatenation von Lauten die verwendeten Bereiche der Phoneme, die den Übergang zwischen den Lauten bilden, im wesentlichen den jeweiligen gesamten Phonemen. Das heißt, daß die zur Konkatenation verwendeten Teile der Phoneme stets alle drei Bereiche, nämlich den jeweiligen Anfangs-Koartikulationsbereich, Soloartikulationsbereich und End-Koartikulationsbereich umfassen. Mithin lehrt DE 693 18 209 T2 eine Verfahrensweise wie die Übergänge zwischen zwei Lauten zu glätten sind.DE 693 18 209 T2 deals with formant synthesis. According to this document become two polyphonic sounds using an interpolation mechanism connected to one another on a last phoneme of a preceding sound and on a first phoneme of a downstream sound is used, the two phonemes of the two sounds are the same and the connected sounds make up a phoneme be overlaid. The two phonemes describing the two phonemes are superimposed Curves weighted with a weighting function. The weighting function is applied to each phoneme in an area immediately after the start of the Phoneme begins and ends immediately before the end of the phoneme. Thus correspond the ranges used in the concatenation of sounds described here the phonemes that form the transition between the sounds, essentially the respective ones entire phonemes. This means that the parts used for concatenation the phoneme always all three areas, namely the respective initial co-articulation area, Solo articulation area and final coarticulation area include. So teaches DE 693 18 209 T2 describes a procedure for smoothing the transitions between two sounds.

Des weiteren wird gemäß diesem Dokument der Moment der Konkatenation zweier Laute so festgelegt, daß sich das letzte Phonem in dem vorgelagerten Laut und das erste Phonem in dem nachgelagerten Laut vollständig überlappen.Furthermore, according to this document, the moment of concatenation becomes two Loud so determined that the last phoneme in the preceding sound and the first Completely overlap the phoneme in the downstream sound.

Grundsätzlich ist festzustellen, daß DE 689 15 353 T2 eine Verbesserung der Tonqualität erreichen will indem eine Vorgehensweise angegeben wird, wie der Übergang zwischen zwei benachbarten Abtastwerten zu gestalten ist. Dies ist insbesondere bei niedrigen Abtastraten relevant.Basically, it should be noted that DE 689 15 353 T2 is an improvement in sound quality wants to achieve by specifying a procedure such as the transition between two adjacent samples are to be designed. This is especially true at low Sampling rates relevant.

Bei der in diesem Dokument beschriebenen Sprachsynthese werden Wellenformen verwendet, die zu konkatenierende Laute wiedergeben. Bei Wellenformen für vorgelagerte Laute wird jeweils ein entsprechender Endabtastwert und ein zugeordneter Nulldurchgangspunkt bestimmt, während bei Wellenformen für nachgelagerte Laute jeweils ein erster oberer Abtastwert und ein zugeordneter Nulldurchgangspunkt bestimmt wird. In Abhängigkeit dieser bestimmten Abtastwerte und der zugeordneten Nulldurchgangspunkte werden Laute auf maximal vier verschiedene Arten miteinander verbunden. Die Anzahl der Verbindungsarten wird auf zwei reduziert, wenn die Wellenformen unter Verwendung des Nyquist-Theorems erzeugt werden. In DE 689 15 353 T2 ist beschrieben, daß sich der verwendete Bereich der Wellenformen zwischen dem letzten Abtastwert der vorgelagerten Wellenform und dem ersten Abtastwert der nachgelagerten Wellenform erstreckt. Eine Variation der Dauer der verwendeten Bereiche in Abhängigkeit der zu konkatenierenden Wetlenformen, wie dies bei der Erfindung der Fall ist, ist in DE 689 15 353 T1 nicht beschrieben.The speech synthesis described in this document uses waveforms reproduce the sounds to be concatenated. For waveforms for upstream A corresponding end sample value and an assigned zero crossing point become loud determined, while for waveforms for downstream sounds a first upper sample value and an assigned zero crossing point is determined. In Dependency of these specific samples and the assigned zero crossing points sounds are connected to each other in a maximum of four different ways. The Number of connection types is reduced to two when using the waveforms of the Nyquist theorem. DE 689 15 353 T2 describes that the range of waveforms used varies between the last sample of the upstream waveform and the first sample of the downstream waveform extends. A variation in the duration of the areas used depending on the Concatenating wet shapes, as is the case with the invention, is not in DE 689 15 353 T1 described.

In dem Artikel "A TtS system for the Greek language based on concatenation of formant coded segments" von N. Yiourgalis und G. Kokkinakis in der Zeitschrift Speech Communication 19 (1996), Selten 21-38, ist ein Verfahren zur Formantsynthese für die griechische Sprache beschrieben. Hierbei werden einem in synthetisierte Sprache umzuwandelnden Text Sprachsegmente aus einem sogenannten TtS-Thesaurus zugeordnet. Ferner wird der Text hinsichtlich seiner Grammatik und seiner Syntax analysiert, um Informationen zu erhalten, die zur Erzeugung einer natürlich wirkenden synthetisierten Sprachausgabe erforderlich sind. Insbesondere werden diese Informationen bei der Konkatentation aufeinanderfolgender Sprachsegmente verwendet, um die Art der Konkatenation zu wählen, die zu möglichst natürlich wirkenden Übergängen zwischen einzelnen Sprachsegmenten bei deren Wiedergabe erzeugt.In the article "A TtS system for the Greek language based on concatenation of formant coded segments "by N. Yiourgalis and G. Kokkinakis in the magazine Speech Communication 19 (1996), Rare 21-38, is a method for formant synthesis for the Greek Language described. This will convert one into synthesized language Text assigned to language segments from a so-called TtS thesaurus. Furthermore, the text is analyzed in terms of grammar and syntax to provide information get the necessary to produce a natural-looking synthesized speech are. In particular, this information becomes more consecutive when concatenated Language segments used to choose the type of concatenation that contribute to the most natural-looking transitions between individual language segments whose reproduction is generated.

Zur Konkatenation aufeinanderfolgender Sprachsegmente werden diese zeitlich unmittelbar aufeinanderfolgend hintereinander angeordnet. Ein zeitlicher Endbereich des zeitlich vorgelagerten Sprachsegments und ein zeitlicher Anfangsbereich des zeitlich nachgelagerten Sprachsegements werden so bearbeitet bzw. aneinander angepaßt, daß - unter Berücksichtigung des zu synthetisierenden Textes - möglichst natürlich klingende Übergänge erzeugt werden. Informationen darüber, wie die zeitlichen Anfangs- und Endbereiche festgelegt werden, sind in diesem Dokument nicht zu finden.For concatenation of successive language segments, these become immediate in time arranged one after the other. A temporal end range of the temporally upstream language segment and a temporal beginning area of the temporally downstream Language segments are edited or adapted to each other so that - under Consideration of the text to be synthesized - transitions sounding as natural as possible be generated. Information about how the start and end time ranges cannot be found in this document.

Zusammenfassend ist zu sagen, daß es der Stand der Technik zwar erlaubt, beliebige Phonemfolgen zu synthetisieren, aber die so synthetisierten Phonemfolgen haben keine authentische Sprachqualität. Eine synthetisierte Phonemfolge hat eine authentische Sprachqualität, wenn sie von der gleichen Phonemfolge, die von einem realen Sprecher gesprochen wurde, durch einen Hörer nicht unterschieden werden kann.In summary it can be said that the state of the art allows any Synthesize phoneme sequences, but the phoneme sequences thus synthesized have none authentic voice quality. A synthesized phoneme sequence has an authentic one Voice quality if it has the same phoneme sequence as that of a real speaker can not be distinguished by a listener.

Es sind auch Verfahren bekannt, die ein Inventar benutzen, das vollständige Worte und/oder Sätze in authentischer Sprachqualität als Inventarelemente enthält. Diese Elemente werden zur Sprachsynthese in einer gewünschten Reihenfolge hintereinander gesetzt, wobei die Möglichkeiten unterschiedliche Sprachsequenzen in hohem Maße von dem Umfang eines solchen Inventars limitiert werden. Die Synthese beliebiger Phonemfolgen ist mit diesen Verfahren nicht möglich.Methods are also known that use an inventory that uses full words and / or contains sentences in authentic speech quality as inventory elements. These elements are used for speech synthesis in a desired order one after the other set, the possibilities of different language sequences to a large extent from the scope of such an inventory can be limited. The synthesis of any phoneme sequences is not possible with these procedures.

Daher ist es eine Aufgabe der vorliegenden Erfindung ein Verfahren und eine entsprechende Vorrichtung zur Verfügung zu stellen, die die Probleme des Standes der Technik beseitigen und die Erzeugung synthetisierter akustischer Daten, insbesondere synthetisierter Sprachdaten, ermöglichen, die sich für einen Hörer nicht von entsprechenden natürlichen akustischen Daten, insbesondere natürlich gesprochener Sprache, unterscheiden. Die mit der Erfindung synthetisierten akustischen Daten, insbesondere synthetisierte Sprachdaten sollen eine authentische akustische Qualität, insbesondere eine authentische Sprachqualität aufweisen.It is therefore an object of the present invention a method and a corresponding one To provide device that addresses the problems of the prior art eliminate and generate synthesized acoustic data, especially synthesized Speech data, enable that for a listener not from the corresponding distinguish between natural acoustic data, in particular naturally spoken language. The acoustic data synthesized by the invention, in particular synthesized Speech data should have an authentic acoustic quality, especially one have authentic speech quality.

Zu Lösung dieser Aufgabe sieht die Erfindung ein Verfahren gemäß Anspruch 1, eine Vorrichtung gemäß Anspruch 16 synthetisierte Sprachsignale gemäß Anspruch 47, einen Datenträger gemäß Anspruch 33, sowie einen Tonträger gemäß Anspruch 58 vor. Somit ermöglicht es die Erfindung, synthetisierte akustische Daten zu erzeugen, die eine Folge von Lauten wiedergeben, indem bei der Konkatenation von Audiosegmentbereichen der Moment der Konkatenation zweier Audiosegmentbereiche in Abhängigkeit von Eigenschaften der zu verknüpfenden Audiosegmentbereiche, insbesondere der die beiden Audiosegmentbereiche betreffenden Koartikulationseffekte bestimmit wird. Der Konkatenationsmoment wird gemäß der vorliegenden Erfindung vorzugsweise in der Umgebung der Grenzen des Solo-Artikulationsbereiches gewählt. Auf diese Weise wird eine Sprachqualität erreicht, die mit dem Stand der Technik nicht erzielbar ist. Dabei ist die erforderliche Rechenleistung nicht höher als beim Stand der Technik.To achieve this object, the invention provides a method according to claim 1 Device according to claim 16 synthesized speech signals according to claim 47, one A data carrier according to claim 33, as well a sound carrier according to claim 58. Thus, the invention enables synthesized generate acoustic data that reproduce a sequence of sounds by at the concatenation of audio segment areas the moment of concatenation of two Audio segment areas depending on properties of the audio segment areas to be linked, in particular those relating to the two audio segment areas Coarticulation effects is determined. The concatenation moment is according to the present Invention preferably in the vicinity of the boundaries of the solo articulation area selected. In this way, a voice quality is achieved with the prior art is not achievable. The computing power required not higher than in the prior art.

Um bei der Synthese akustischer Daten die Variationen nachzubilden, die bei entsprechenden natürlichen akustischen Daten zu finden sind, sieht die Erfindung eine unterschiedliche Auswahl der Audiosegmentbereiche sowie unterschiedliche Arten der koartikulationsgerechten Konkatenation vor. So wird ein höheres Maß an Natürlichkeit der synthetisierten akustischen Daten erzielt, wenn ein zeitlich nachgelagerter Audiosegmentbereich, dessen Anfang einen statischen Laut wiedergibt, mit einem zeitlich vorgelagerten Audiosegmentbereich mittels eines Crossfades verbunden wird, bzw. wenn ein zeitlich nachgelagerter Audiosegmentbereich, dessen Anfang einen dynamischen Laut wiedergibt, mit einem zeitlich vorgelagerten Audiosegmentbereich mittels eines Hardfades verbunden wird. Des weiteren ist es vorteilhaft, den Anfang der zu erzeugenden synthetisierten akustischen Daten unter Verwendung eines den Anfang einer Lautfolge wiedergebenden Audiosegmentbereiches bzw. das Ende der zu erzeugenden synthetisierten akustischen Daten unter Verwendung eines das Ende einer Lautfolge wiedergebenden Audiosegmentbereiches zu erzeugen.In order to simulate the variations in the synthesis of acoustic data, the corresponding ones Natural acoustic data can be found, the invention sees a different Selection of the audio segment areas and different types of co-articulation Concatenation before. So a higher degree of naturalness of the synthesized acoustic data achieved when a temporally downstream audio segment area, the beginning of which reproduces a static sound, with a temporal one Audio segment area is connected by means of a crossfade, or if a downstream audio segment area, the beginning of which is a dynamic sound reproduces, with a temporally preceding audio segment area by means of a hardfade is connected. Furthermore, it is advantageous to start with the one to be generated synthesized acoustic data using a the beginning of a sound sequence reproducing audio segment area or the end of the synthesized to be generated acoustic data using the end of a sound sequence Generate audio segment area.

Um die Erzeugung der synthetisierten akustischen Daten einfacher und schneller durchzuführen, ermöglicht es die Erfindung, die Zahl der zur Datensynthetisierung notwendigen Audiosegmentbereiche zu reduzieren, indem Audiosegmentbereiche verwendet werden, die immer mit der Wiedergabe eines dynamischen Lauts beginnen, wodurch alle Konkatenationen dieser Audiosegmentbereiche mittels eines Hardfades durchgeführt werden können. Hierfür werden zeitlich nachgelagerte Audiosegmentbereiche mit zeitlich vorgelagerten Audiosegmentbereichen verbunden, deren Anfänge jeweils einen dynamischen Laut wiedergeben. Auf diese Weise können auch mit geringer Rechenleistung (z.B. bei Anrufbeantwortern oder Autoleitsystemen) erfindungsgemäß synthetisierte akustische Daten hoher Qualität erzeugt werden.To make the generation of the synthesized acoustic data easier and faster, The invention enables the number of data synthesis necessary Reduce audio segment areas by using audio segment areas that always start playing a dynamic sound, making everyone Concatenations of these audio segment areas are carried out using a hardfade can be. For this, downstream audio segment areas are also recorded with time connected upstream audio segment areas, the beginnings of which each have a dynamic Play out loud. This way, even with low computing power (e.g. in answering machines or car control systems) synthesized according to the invention high quality acoustic data are generated.

Außerdem sieht die Erfindung vor, akustische Phänomene nachzubilden, die sich aufgrund einer gegenseitigen Beeinflussung einzelner Segmente entsprechender natürlicher akustischer Daten ergeben. Insbesondere ist hier vorgesehen, einzelne Audiosegmente bzw. einzelne Bereiche der Audiosegmente mit Hilfe geeigneter Funktionen zu bearbeiten. Somit kann u.a. die Frequenz, die Dauer, die Amplitude oder das Spektrum der Audiosegmente verändert werden. Werden mit der Erfindung synthetisierte Sprachdaten erzeugt, so werden zur Lösung dieser Aufgabe vorzugsweise prosodische Informationen und/oder übergeordnete Koartikulationseffekte berücksichtigt.The invention also provides for the simulation of acoustic phenomena that occur as a result of mutual interaction of individual segments corresponding to natural ones acoustic data result. In particular, individual audio segments are provided here or individual areas of the audio segments with the help of suitable functions to edit. Among other things, the frequency, the duration, the amplitude or the spectrum of the audio segments are changed. Voice data synthesized with the invention generated, prosodic information is preferably used to solve this task and / or higher-level co-articulation effects are taken into account.

Der Signalverlauf von synthetisierten akustischen Daten kann zusätzlich verbessert werden, wenn der Konkatenationsmoment an Stellen der einzelnen zu verknüpfenden Audiosegmentbereiche gelegt wird, an denen die beiden verwendeten Bereiche hinsichtlich einer oder mehrerer geeigneter Eigenschaften übereinstimmen. Diese Eigenschaften können u.a. sein: Nullstelle, Amplitudenwert, Steigung, Ableitung beliebigen Grades, Spektrum, Tonhöhe, Amplitudenwert in einem Frequenzbereich, Lautstärke, Sprachstil, Sprachemotion, oder andere im Lautklassifizierungsschema betrachtete Eigenschaften.The signal curve of synthesized acoustic data can also be improved, if the concatenation moment at the locations of the individual audio segment areas to be linked is placed on which the two areas used with regard one or more suitable properties. These properties can include be: zero, amplitude value, slope, derivative of any degree, Spectrum, pitch, amplitude value in a frequency range, volume, language style, Speech emotion, or other properties considered in the sound classification scheme.

Darüber hinaus ermöglicht es Erfindung, die Auswahl der Audiosegmentbereiche zur Erzeugung der synthetisierten akustischen Daten zu verbessern sowie deren Konkatenation effizienter zu gestalten, indem heuristisches Wissen verwendet wird, das die Auswahl, Bearbeitung, Variation und Konkatenation der Audiosegmentbereiche betrifft.In addition, the invention makes it possible to select the audio segment areas Generation of the synthesized acoustic data and improve their concatenation more efficiently by using heuristic knowledge that the Selection, editing, variation and concatenation of the audio segment areas concerned.

Um synthetisierte akustische Daten zu erzeugen, die Sprachdaten sind, die sich von entsprechenden natürlichen Sprachdaten nicht unterscheiden, werden vorzugsweise Audiosegmentbereiche genutzt werden, die Laute/Phone oder Teile von Lautfolgen/Phonfolgen wiedergeben.To generate synthesized acoustic data that are voice data that differ from corresponding ones natural voice data, audio segment areas are preferred be used, the lute / phone or parts of sound sequences / phone sequences play.

Außerdem erlaubt die Erfindung die Nutzung der erzeugten synthetisierten akustischen Daten, indem diese Daten in akustische Signale und/oder Sprachsignale umwandelbar und/ oder auf einem Datenträger speicherbar sind.In addition, the invention allows the use of the synthesized acoustic generated Data by converting this data into acoustic signals and / or voice signals and / or can be stored on a data carrier.

Des weiteren kann die Erfindung verwendet werden, um synthetisierte Sprachsignale zu Verfügung zu stellen, die sich von bekannten synthetisierten Sprachsignalen dadurch unterscheiden, daß sie sich in ihrer Natürlichkeit und Verständlichkeit nicht von realer Sprache unterscheiden. Hierfür werden Audiosegmentbereiche koartikulationsgerecht konkateniert, die jeweils Teile der Lautfolge/Phonfolge der zu synthetisierenden Sprache wiedergeben, indem die zu verwendenden Bereiche der Audiosegmente sowie der Moment der Konkatenation dieser Bereiche erfindungsgemäß wie in Anspruch 28 definiert bestimmt werden.Furthermore, the invention can be used to generate synthesized speech signals To make it stand out from known synthesized speech signals differ in that they do not differ in their naturalness and intelligibility from real ones Distinguish language. For this purpose, audio segment areas become coarticulation-friendly concatenated, the parts of the sound sequence / phoneme of the language to be synthesized play back by the areas of the audio segments to be used as well as the moment the concatenation of these areas according to the invention as defined in claim 28 be determined.

Eine zusätzliche Verbesserung der synthetisierten Sprache kann erreicht werden, wenn ein zeitlich nachgelagerter Audiosegmentbereich, dessen Anfang einen statischen Laut bzw. ein statisches Phon wiedergibt, mit einem zeitlich vorgelagerten Audiosegmentbereich mittels eines Crossfades verbunden wird, bzw. wenn ein zeitlich nachgelagerter Audiosegmentbereich, dessen Anfang einen dynamischen Laut bzw. ein dynamisches Phon wiedergibt, mit einem zeitlich vorgelagerten Audiosegmentbereich mittels eines Hardfades verbunden wird. Hierbei umfassen statische Phone Vokale, Diphtonge, Liquide, Frikative, Vibranten und Nasale bzw. dynamische Phone Plosive, Affrikate, Glottalstops und geschlagene Laute.An additional improvement in the synthesized language can be achieved if a temporally downstream audio segment area, the beginning of which is a static sound or reproduces a static phone, with a temporally preceding audio segment area is connected by means of a crossfade, or if a later one Audio segment area, the beginning of which is a dynamic sound or a dynamic Phon reproduces, with a temporally preceding audio segment area by means of a Hardfades is connected. Static phone vowels, diphtongs, liquids, Frikative, Vibranten and Nasale or dynamic Phone Plosive, Affrikate, Glottalstops and struck sounds.

Da sich die Anfangs- und Endbetonungen von Lauten bei natürlicher Sprache von vergleichbaren, aber eingebetteten Lauten unterscheiden, ist es zu bevorzugen, entsprechende Audiosegmentbereiche zu verwenden, deren Anfänge jeweils den Anfang bzw. deren Enden jeweils das Ende von zu synthetisierender Sprache wiedergeben.Since the initial and final stresses of sounds in natural language differ from comparable, but distinguish embedded sounds, it is preferable to choose appropriate ones To use audio segment areas, the beginnings of which start or the ends of which each represent the end of the language to be synthesized.

Besonders bei Erzeugung synthetisierter Sprache ist eine schnelle und effiziente Vorgehensweise wünschenswert. Hierfür ist es zu bevorzugen, erfindungsgemäße koartikulationsgerechte Konkatenationen immer mittels Hardfades durchzuführen, wobei nur Audiosegmentbereiche verwendet werden, deren Anfänge jeweils immer einen dynamischen Laut bzw. ein dynamisches Phon wiedergeben. Derartige Audiosegmentbereiche können mit der Erfindung durch koartikulationsgerechte Konkatenation entsprechender Audiosegmentbereiche zuvor erzeugt werden.A quick and efficient procedure is particularly important when generating synthesized speech desirable. For this purpose, it is preferable to use coarticulation-compatible methods according to the invention Always carry out concatenations using hard fades, whereby only audio segment areas are used, the beginnings of which are always dynamic Play loud or dynamic phone. Such audio segment areas can with the invention by co-articulating concatenation of corresponding audio segment areas generated beforehand.

Des weiteren stellt die Erfindung Sprachsignale bereit, die einen natürlichen Sprachfluß, Sprachmelodie und Sprachrhythmus haben, indem Audiosegmentbereiche jeweils vor und/oder nach der Konkatenation in ihrer Gesamtheit oder in einzelnen Bereichen mit Hilfe geeigneter Funktionen bearbeitet werden. Besonders vorteilhaft ist es diese Variation zusätzlich in Bereichen durchzuführen, in denen die entsprechenden Momente der Konkatenationen liegen, um u.a. die Frequenz, Dauer, Amplitude oder das Spektrum zu verändern.Furthermore, the invention provides speech signals which allow a natural flow of speech, Speech melody and speech rhythm have preceded by audio segment areas respectively and / or after concatenation in its entirety or in individual areas Can be edited using suitable functions. This variation is particularly advantageous additionally in areas where the corresponding moments of the Concatenations are, among other things, the frequency, duration, amplitude or spectrum too change.

Ein zusätzlich verbesserter Signalverlauf kann erreicht werden, wenn die Konkatenationsmomente an Stellen der zu verknüpfenden Audiosegmentbereiche liegen, an denen diese in einer oder mehreren geeigneten. Eigenschaften übereinstimmen.An additionally improved signal curve can be achieved if the concatenation moments in places of the audio segment areas to be linked where these in one or more suitable. Properties match.

Um eine einfache Nutzung und/oder Weiterverarbeitung der erfindungsgemäßen Sprachsignale durch bekannte Verfahren oder Vorrichtungen, z.B. einem CD-Abspielgerät, zu erlauben, ist es besonders zu bevorzugen, daß die Sprachsignale in akustische Signale umwandelbar oder auf einem Datenträger speicherbar sind. For easy use and / or further processing of the invention Speech signals by known methods or devices, e.g. a CD player, to allow, it is particularly preferable that the speech signals in acoustic signals can be converted or stored on a data carrier.

Um die Erfindung auch bei bekannten Vorrichtungen, z.B. einem Personal Computer oder einem computergesteuerten Musikinstrument, anzuwenden, ist ein Datenträger vorgesehen, der ein Computerprogramm enthält, das die Durchführung des erfindungsgemäßen Verfahrens bzw. die Steuerung der erfindungsgemäßen Vorrichtung sowie deren verschiedenen Ausführungsformen ermöglicht. Des weiteren erlaubt der erfindungsgemäße Datenträger auch die Erzeugung von Sprachsignalen, die koartikulationsgerechte Konkatenationen aufweisen.To apply the invention to known devices, e.g. a personal computer or a computer-controlled musical instrument, a data carrier is provided, which contains a computer program which carries out the implementation of the invention Method or control of the device according to the invention and its allows various embodiments. Furthermore, the invention allows Data carriers also generate voice signals that are co-articulating Have concatenations.

Um ein Audiosegmente umfassendes Inventar zur Verfügung zu stellen, mit dem synthetisierte akustische Daten, insbesondere synthetisierte Sprachdaten, erzeugt werden können, die sich von entsprechenden natürlichen akustischen Daten nicht unterscheiden, kann ein Datenspeicher vorgesehen werden, der Audiosegmente enthält, die geeignet sind, um erfindungsgemäß zu synthetisierten akustischen Daten konkateniert zu werden. Vorzugsweise enthält ein solcher Datenträger Audiosegmente die, zur Durchführung des erfindungsgemäßen Verfahrens, zur Anwendung bei der erfindungsgemäßen Vorrichtung oder dem erfindungsgemäßen Datenträger geeignet sind. Alternativ kann der Datenträger auch erfindungsgemäße Sprachsignale umfassen.To provide an inventory of audio segments, with the synthesized acoustic data, in particular synthesized speech data, are generated can, which do not differ from corresponding natural acoustic data, a data memory can be provided which contains audio segments which are suitable are concatenated to acoustic data synthesized according to the invention become. Such a data carrier preferably contains audio segments for implementation of the inventive method, for use in the inventive Device or the data carrier according to the invention are suitable. Alternatively, you can the data carrier also comprise voice signals according to the invention.

Darüber hinaus ermöglicht es die Erfindung, erfindungsgemäße synthetisierte akustische Daten, insbesondere synthetisierte Sprachdaten, zur Verfügung zu stellen, die mit herkömmlichen bekannten Vorrichtungen, beispielsweise einem Tonbandgerät, einem CD-Spieler oder einer PC-Audiokarte, genutzt werden können. Hierfür ist ein Tonträger vorgesehen, der Daten aufweist, die zumindest teilweise mit dem erfindungsgemäßen Verfahren oder der erfindungsgemäßen Vorrichtung bzw. unter Verwendung des erfindungsgemäßen Datenträgers oder des erfindungsgemäßen Datenspeichers erzeugt wurden, Der Tonträger kann auch Daten enthalten, die erfindungsgemäß koartikulationsgerecht konkatenierte Sprachsignale sind.In addition, the invention enables synthesized acoustic according to the invention To provide data, in particular synthesized voice data, with conventional known devices, for example a tape recorder, a CD player or a PC audio card can be used. A sound carrier is provided for this, of the data, at least partially with the inventive method or the device according to the invention or using the device according to the invention Data carrier or the data memory according to the invention were generated, The sound carrier can also contain data which, according to the invention, is compatible with the articulation are concatenated speech signals.

Weitere Eigenschaften, Merkmale, Vorteile oder Abwandlungen der Erfindung werden anhand der nachfolgenden Beschreibung erläutert. Dabei zeigt:

Figur 1a: Schematische Darstellung einer erfindungsgemäßen Vorrichtung zur Erzeugung synthetisierter akustischer Daten;

Figur 1b: Struktur eines Lautes / Phons.

Figur 2a: Struktur eines herkömmlichen Audiosegmentes nach dem Stand der Technik, aus Teilen von zwei Lauten bestehend, also ein Diphon für Sprache. Wesentlich ist, daß die Solo-Artikulations-Bereiche jeweils nur teilweise im herkömmlichen Diphon-Audiosegment enthalten sind.

Figur 2b: Struktur eines erfindungsgemäßen Audiosegmentes, das Teile eines Lautes/Phons mit nachgelagerten Koartikulationsbereichen (für Sprache quasi ein 'verschobenes' Diphon) wiedergibt.

Figur 2c: Struktur eines erfindungsgemäßen Audiosegmentes, das Teile eines Lautes/Phons mit vorgelagerten Koartikulationsbereichen wiedergibt.

Figur 2d: Struktur eines erfindungsgemäßen Audiosegmentes, das Teile eines Lautes/Phons mit nachgelagerten Koartikulationsbereichen wiedergibt und zusätzliche Bereiche enthält.

Figur 2e: Struktur eines erfindungsgemäßen Audiosegmentes, das Teile eines Lautes/Phons mit vorgelagerten Koartikulationsbereichen wiedergibt und zusätzliche Bereiche enthält.

Figur 2f: Struktur eines erfindungsgemäßen Audiosegmentes, das Teile mehrerer Laute/Phone (für Sprache: ein Polyphon) mit jeweils nachgelagerten Koartikulationsbereichen wiedergibt. Die Laute / Phone 2 bis (n-1) sind jeweils vollständig im Audiosegment enthalten.

Figur 2g: Struktur eines erfindungsgemäßen Audiosegmentes, das Teile mehrerer Laute/Phone (für Sprache: ein Polyphon) mit jeweils vorgelagerten Koartikulationsbereichen wiedergibt. Die Laute / Phone 2 bis (n-1) sind jeweils vollständig im Audiosegment enthalten.

Figur 2h: Struktur eines erfindungsgemäßen Audiosegmentes, das Teile mehrerer Laute/Phone (für Sprache: ein Polyphon) mit jeweils nachgelagerten Koartikulationsbereichen wiedergibt und zusätzliche Bereiche enthält. Die Laute / Phone 2 bis (n-1) sind jeweils vollständig im Audiosegment enthalten.

Figur 2i: Struktur eines erfindungsgemäßen Audiosegmentes, das Teile mehrerer Laute/Phone (für Sprache: ein Polyphon) mit jeweils vorgelagerten Koartikulationsbereichen wiedergibt und zusätzliche Bereiche enthält. Die Laute / Phone 2 bis (n-1) sind jeweils vollständig im Audiosegment enthalten.

Figur 2j: Struktur eines erfindungsgemäßen Audiosegmentes, das einen Teil eines Lautes / Phons vom Anfang einer Lautfolge / Phonfolge wiedergibt.

Figur 2k: Struktur eines erfindungsgemäßen Audiosegmentes, das Teile von Lauten /Phonen vom Anfang einer Lautfolge / Phonfolge wiedergibt.

Figur 2I: Struktur eines erfindungsgemäßen Audiosegmentes, das einen Laut / ein Phon vom Ende einer Lautfolge / Phonfolge wiedergibt.

Figur 3a: Konkatenation gemäß dem Stand der Technik am Beispiel zweier herkömmlicher Audiosegmente. Die Segmente beginnen und enden mit Teilen der Solo-Artikulationsbereiche (in der Regel jeweils die Hälfte).

Figur 3aI: Konkatenation gemäß dem Stand der Technik. Der Solo-Artikulationsbereich des mittleren Phons stammt aus zwei unterschiedlichen Audiosegmenten.

Figur 3b: Konkatenation nach dem erfindungsgemäßen Verfahren am Beispiel zweier Audiosegmente, die je einen Laut / ein Phon mit nachgelagerten Koartikulationsbereichen enthalten. Beide Laute / Phone stammen aus der Mitte einer Lauteinheitenfolge

Figur 3bI: Konkatenation dieser Audiosegmente mittels eines Crossfades.
Der Soloartikulationsbereich stammt aus einem Audiosegment. Der Übergang zwischen den Audiosegmenten erfolgt zwischen zwei Bereichen und ist somit unempfindlicher gegen Unterschiede (im Spektrum, Frequenz, Amplitude etc.). Die Audiosegmente können vor der Konkatenation auch noch mit zusätzlichen Übergangsfunktionen bearbeitet werden.

Figur 3bII: Konkatenation dieser Audiosegmente mittels eines Hardfades.

Figur 3c: Konkatenation gemäß dem erfindungsgemäßen Verfahren am Beispiel zweier erfindungsgemäßer Audiosegmente, die je einen Laut / ein Phon mit nachgelagerten Koartikulationsbereichen enthalten, wobei das erste Audiosegment vom Anfang einer Lautfolge stammt.

Figur 3cI: Konkatenation dieser Audiosegmente mittels eines Crossfades.

Figur 3cII: Konkatenation dieser Audiosegmente mittels eines Hardfades.

Figur 3d: Konkatenation gemäß dem erfindungsgemäßen Verfahren am Beispiel zweier erfindungsgemäßer Audiosegmente, die je einen Laut / ein Phon mit vorgelagerten Koartikulationsbereichen enthalten. Beide Audiosegmente stammen aus der Mitte einer Lautfolge.

Figur 3dI: Konkatenation dieser Audiosegmente mittels eines Crossfades.
Der Soloartikulationsbereich stammt aus einem Audiosegment.

Figur 3dII: Konkatenation dieser Audiosegmente mittels eines Hardfades.

Figur 3e: Konkatenation nach dem erfindungsgemäßen Verfahren am Beispiel zweier erfindungsgemäßer Audiosegmente, die je einen Laut / ein Phon mit nachgelagerten Koartikulationsbereichen enthalten, wobei das letzte Audiosegment vom Ende einer Lautfolge stammt.

Figur 3eI: Konkatenation dieser Audiosegmente mittels eines Crossfades.

Figur 3eII: Konkatenation dieser Audiosegmente mittels eines Hardfades.

Figur 4: Schematische Darstellung der Schritte eines erfindungsgemäßen Verfahrens zur Erzeugung synthetisierter akustischer Daten.

Further properties, features, advantages or modifications of the invention are explained on the basis of the following description. It shows:

Figure 1a: Schematic representation of an inventive device for generating synthesized acoustic data;

Figure 1b: Structure of a sound / phon.

Figure 2a: Structure of a conventional audio segment according to the prior art, consisting of parts of two sounds, ie a diphone for speech. It is essential that the solo articulation areas are only partially contained in the conventional diphone audio segment.

Figure 2b: Structure of an audio segment according to the invention, which reproduces parts of a sound / phon with downstream co-articulation areas (for speech a quasi 'shifted' diphone).

Figure 2c: Structure of an audio segment according to the invention, which reproduces parts of a sound / phon with upstream coarticulation areas.

Figure 2d: Structure of an audio segment according to the invention, which reproduces parts of a sound / phon with downstream coarticulation areas and contains additional areas.

Figure 2e: Structure of an audio segment according to the invention, which reproduces parts of a sound / phon with upstream coarticulation areas and contains additional areas.

Figure 2f: Structure of an audio segment according to the invention, which reproduces parts of several sounds / phones (for speech: a polyphone), each with downstream co-articulation areas. The sounds / phone 2 to (n-1) are each completely contained in the audio segment.

Figure 2g: Structure of an audio segment according to the invention, which reproduces parts of several sounds / phones (for speech: a polyphone), each with upstream co-articulation areas. The sounds / phone 2 to (n-1) are each completely contained in the audio segment.

Figure 2h: Structure of an audio segment according to the invention, which reproduces parts of several sounds / phones (for speech: a polyphone), each with downstream co-articulation areas and contains additional areas. The sounds / phone 2 to (n-1) are each completely contained in the audio segment.

Figure 2i: Structure of an audio segment according to the invention, which reproduces parts of several sounds / phones (for speech: a polyphone), each with upstream co-articulation areas and contains additional areas. The sounds / phone 2 to (n-1) are each completely contained in the audio segment.

Figure 2j: Structure of an audio segment according to the invention, which reproduces part of a sound / phon from the beginning of a sound sequence / phon sequence.

Figure 2k: Structure of an audio segment according to the invention, which reproduces parts of sounds / phones from the beginning of a sequence of sounds / phonons.

Figure 2I: Structure of an audio segment according to the invention, which reproduces a sound / a phon from the end of a sound sequence / phoneme.

Figure 3a: Concatenation according to the prior art using the example of two conventional audio segments. The segments begin and end with parts of the solo articulation areas (usually half each).

Figure 3aI: concatenation according to the prior art. The solo articulation area of the middle phone comes from two different audio segments.

Figure 3b: Concatenation according to the inventive method using the example of two audio segments, each containing a sound / a phon with downstream coarticulation areas. Both sounds / phones come from the middle of a sequence of sound units

Figure 3bI: concatenation of these audio segments using a crossfade.
The solo articulation area comes from an audio segment. The transition between the audio segments takes place between two areas and is therefore less sensitive to differences (in spectrum, frequency, amplitude, etc.). The audio segments can also be edited with additional transition functions before concatenation.

Figure 3bII: concatenation of these audio segments using a hardfade.

3c: Concatenation according to the method according to the invention using the example of two audio segments according to the invention, each of which contains a sound / a phon with downstream co-articulation areas, the first audio segment originating from the beginning of a sound sequence.

Figure 3cI: concatenation of these audio segments using a crossfade.

Figure 3cII: concatenation of these audio segments using a hardfade.

Figure 3d: Concatenation according to the inventive method using the example of two audio segments according to the invention, each of which contains a sound / a phon with upstream coarticulation areas. Both audio segments come from the middle of a sound sequence.

Figure 3dI: concatenation of these audio segments using a crossfade.
The solo articulation area comes from an audio segment.

Figure 3dII: concatenation of these audio segments using a hardfade.

Figure 3e: Concatenation according to the inventive method using the example of two audio segments according to the invention, each containing a sound / a phon with downstream coarticulation areas, the last audio segment coming from the end of a sound sequence.

Figure 3eI: concatenation of these audio segments using a crossfade.

Figure 3eII: concatenation of these audio segments using a hardfade.

Figure 4: Schematic representation of the steps of a method according to the invention for generating synthesized acoustic data.

Die im folgenden benutzten Bezugszeichen beziehen sich auf die Figur 1a und die im folgenden für die verschiedenen Verfahrensschritte benutzten Nummern beziehen sich auf die Figur 4.The reference numerals used in the following refer to FIG. 1a and those in FIG The following numbers used for the various process steps refer to Figure 4.

Um mit Hilfe der Erfindung beispielsweise einen Text in synthetisierte Sprache umzuwandeln, ist es notwendig, in einem vorgelagerten Schritt diesen Text in eine Folge von Lautzeichen bzw. Phonemen unter Verwendung bekannter Verfahren oder Vorrichtungen zu unterteilen. Vorzugsweise sind auch dem Text entsprechende prosodische Informationen zu erzeugen. Die Lautfolge bzw. Phonfolge sowie die prosodischen und zusätzlichen Informationen dienen als Eingabegrößen für das erfindungsgemäße Verfahren bzw. die erfindungsgemäße Vorrichtung. In order to use the invention, for example, to convert a text into synthesized language, it is necessary to convert this text into a sequence of Phonemes or phonemes using known methods or devices to divide. Prosodic information corresponding to the text is also preferred to create. The sound sequence or phoneme as well as the prosodic and additional Information serves as input variables for the method according to the invention or the device according to the invention.

Die zu synthetisierenden Laute/Phone werden einer Eingabeeinheit 101 der Vorrichtung 1 zur Erzeugung synthetisierter Sprachdaten zugeführt und in einer ersten Speichereinheit 103 abgelegt (siehe Figur 1a). Mit Hilfe einer Auswahleinrichtung 105 werden aus einem Audiosegmente (Elemente) enthaltenden Inventar, das in einer Datenbank 107 gespeichert ist, oder von einer vorgeschalteten Syntheseeinrichtung 108 (die nicht Bestandteil der Erfindung ist) die Audiosegmentbereiche ausgewählt, die Laute bzw. Phone oder Teile von Lauten bzw. Phonen wiedergeben, die den einzelnen eingegebenen Lautzeichen bzw. Phonemen oder Teilen davon entsprechen und in einer Reihenfolge, die der Reihenfolge der eingegebenen Lautzeichen bzw. Phoneme entspricht, in einer zweiten Speichereinheit 109 gespeichert. Falls das Inventar Teile von Lautfolgen oder von Polyphonen wiedergebende Audiosegmente enthält, so wählt die Auswahleinrichtung 105 vorzugsweise die Audiosegmente aus, die die meisten Teile von Lautfolgen bzw. von Polyphonen wiedergeben, die einer Folge von Lautzeichen bzw. Phonemen aus der eingegebenen Lautzeichenfolge bzw. Phonemfolge entsprechen, so daß eine minimale Anzahl von Audiosegmenten zur Synthese der eingegebenen Phonemfolge benötigt wird.The sounds / phones to be synthesized become an input unit 101 of the device 1 for generating synthesized speech data and in a first storage unit 103 stored (see Figure 1a). With the help of a selection device 105 are made an inventory containing audio segments (elements), which is stored in a database 107 stored, or by an upstream synthesis device 108 (which is not part of of the invention) the audio segment areas are selected, the lute or phone or reproduce parts of sounds or phones that the individual entered Correspond to phonetic signs or phonemes or parts thereof and in an order, which corresponds to the order of the entered phonetic signs or phonemes, in one second storage unit 109 stored. If the inventory is part of sound sequences or contains audio segments reproducing polyphones, so the selection device selects 105 preferably the audio segments that make up most parts of sound sequences or play back from polyphones that are a sequence of sound signals or phonemes from the entered phonetic string or phoneme sequence, so that a Minimum number of audio segments for the synthesis of the entered phoneme sequence is needed.

Stellt die Datenbank 107 oder die vorgeschaltete Syntheseeinrichtung 108 ein Inventar mit Audiosegmenten unterschiedlicher Arten zur Verfügung, so wählt die Auswahleinrichtung 105 vorzugsweise die längsten Audiosegmentbereiche aus, dieTeile der Lautfolge/Phonfolge wiedergeben, um die eingegebene Lautfolge bzw. Phonfolge und/oder eine Folge von Lauten/ Phonen aus einer minimalen Anzahl von Audiosegmentbereichen zu synthetisieren. Hierbei ist es vorteilhaft, verkettete Laute/Phone wiedergebende Audiosegmentbereiche zu verwenden, die einen zeitlich vorgelagerten statischen Laut/Phon und einen zeitlich nachgelagerten dynamischen Laut/Phon wiedergeben. So entstehen Audiosegmente, die aufgrund der Einbettung der dynamischen Laute/Phone immer mit einem statischen Laut/Phon beginnen. Dadurch vereinfacht und vereinheitlicht sich das Vorgehen bei Konkatenationen solcher Audiosegmente, da hierfür nur Crossfades benötigt werden.The database 107 or the upstream synthesis device 108 provides an inventory with audio segments of different types available, so the selector selects 105 preferably the longest audio segment areas, the parts of the sound sequence / phoneme sequence play back to the entered sequence of sounds or phoneme and / or a sequence of sounds / phones from a minimal number of audio segment areas to synthesize. Here it is advantageous to play chained sounds / phone To use audio segment areas that have a static upstream Play loud / phon and a dynamic sound / phon downstream. So arise audio segments due to the embedding of the dynamic lute / phone always start with a static sound / phon. This simplifies and unifies the procedure for concatenations of such audio segments, since this is only crossfades are needed.

Um eine koartikulationsgerechte Konkatenation der zu verkettenden Audiosegmentbereiche zu erzielen, werden mit Hilfe einer Konkatenationseinrichtung 111 die Konkatenationsmomente zweier aufeinanderfolgender Audiosegmentbereiche wie folgt festgelegt:

Soll ein Audiosegmentbereich zu Synthetisierung des Anfanges der eingegebenen Lautfolge/Phonfolge (Schritt 1) verwendet werden, so ist aus dem Inventar ein Audiosegmentbereich zu wählen, das den Anfang einer Lautfolge/Phonfolge wiedergibt und mit einem zeitlich nachgelagerten Audiosegmentbereich zu verketten (siehe Figur 3c und Schritt 3 in Figur 4).
Bei der Konkatenation eines zweiten Audiosegmentbereiches an einen zeitlich vorgelagerten ersten Audiosegmentbereich ist zu unterscheiden, ob der zweite Audiosegmentbereich mit der Wiedergabe eines statischen Lautes/Phons oder eines dynamischen Lautes/Phons beginnt, um die Wahl des Momentes der Konkatenation entsprechend zu treffen (Schritt 6).
Beginnt der zweite Audiosegmentbereich mit einem statischen Laut/Phon, wird die Konkatenation in Form eines Crossfades durchgeführt, wobei der Moment der Konkatenation im zeitlich hinteren Bereich des ersten Audiosegmentbereiches und im zeitlich vorderen Bereich des zweiten Audiosegmentbereiches gelegt wird, wodurch sich diese beiden Bereiche bei der Konkatenation überlappen oder wenigstens unmittelbar aneinandergrenzen (siehe Figuren 3bI, 3cI, 3dl und 3eI, Konkatenation mittels Crossfade).
Beginnt der zweite Audiosegmentbereich mit einem dynamischen Laut/Phon, wird die Konkatenation in Form eines Hardfades durchgeführt, wobei der Moment der Konkatenation zeitlich unmittelbar hinter dem zeitlich hinteren Bereich des ersten Audiosegmentbereiches und zeitlich unmittelbar vor dem zeitlich vorderen Bereich des zweiten Audiosegmentbereiches gelegt wird (siehe Figuren 3bII, 3cII, 3dll und 3eII, Konkatenation mittels Hardfade).

In order to achieve concatenation of the audio segment areas to be concatenated in accordance with the articulation, the concatenation moments of two successive audio segment areas are determined as follows with the aid of a concatenation device 111:

If an audio segment area is to be used to synthesize the beginning of the entered sound sequence / phoneme sequence (step 1), an audio segment area is to be selected from the inventory, which reproduces the beginning of a sound sequence / phoneme sequence and to be concatenated with a temporally downstream audio segment region (see FIGS. 3c and Step 3 in Figure 4).
When concatenating a second audio segment area to a temporally preceding first audio segment area, a distinction must be made as to whether the second audio segment area starts to play a static sound / phone or a dynamic sound / phone in order to make the appropriate choice of the moment of the concatenation (step 6). ,
If the second audio segment area begins with a static sound / phon, the concatenation is carried out in the form of a crossfade, the moment of the concatenation being placed in the rear area of the first audio segment area and in the front area of the second audio segment area, whereby these two areas are located in the Concatenation overlap or at least immediately adjoin one another (see FIGS. 3bI, 3cI, 3dl and 3eI, concatenation using crossfade).
If the second audio segment area begins with a dynamic sound / phon, the concatenation is carried out in the form of a hardfade, the moment of the concatenation being placed immediately behind the back area of the first audio segment area and immediately before the front area of the second audio segment area (see Figures 3bII, 3cII, 3dll and 3eII, concatenation using hardfade).

Auf diese Weise können aus diesen ursprünglich verfügbaren Audiosegmentbereichen neue Audiosegmente erzeugt werden, die mit der Wiedergabe eines statischen Lautes/Phons beginnen. Dies erreicht man, indem Audiosegmentbereiche, die mit der Wiedergabe eines dynamischen Lautes/Phons beginnen, zeitlich nachgelagert mit Audiosegmentbereichen, die mit der Wiedergabe eines statischen Lautes/Phons beginnen, verkettet werden. Dies vergrößert zwar die Zahl der Audiosegmente bzw. den Umfang des inverters, kann aber bei der Erzeugung synthetisierter Sprachdaten einen rechentechnischen Vorteil darstellen, da weniger einzelne Konkatenationen zur Erzeugung einer Lautfolge/Phonemfolge erforderliche sind und Konkatenationen nur noch in Form eines Crossfades durchgeführt werden müssen. Vorzugsweise werden die so erzeugten neuen verketteten Audiosegmente der Datenbank 107 oder einer anderen Speichereinheit 113 zugeführt. In this way you can use these originally available audio segment areas new audio segments are generated with the playback of a static sound / phon kick off. This is accomplished by adding audio segment areas to the playback of a dynamic sound / phon begin, temporally downstream with audio segment areas, that start playing a static sound / phon, be chained. This increases the number of audio segments and the scope of the inverter, but can be a computational one when generating synthesized speech data Show advantage, since fewer individual concatenations to create a Sound sequence / phoneme sequence are required and concatenations only in form a crossfade must be carried out. The so produced are preferably new chained audio segments of database 107 or other storage device 113 fed.

Ein weiterer Vorteil dieser Verkettung der ursprüngliche Audiosegmentbereiche zu neuen längeren Audiosegmenten ergibt sich, wenn sich beispielsweise eine Folge von Lauten/Phonen in der eingegebenen Lautfolge/Phonfolge häufig wiederholt. Dann kann auf eines der neuen entsprechend verketteten Audiosegmente zurückgegriffen werden und es ist nicht notwendig, bei jedem Auftreten dieser Folge von Lauten/Phonen eine erneute Konkatenation der ursprünglich vorhandenen Audiosegmentbereiche durchzuführen. Vorzugsweise sind bei der Speicherung solcher verketteten Audiosegmente auch übergreifende Koartikulationseffekte zu erfassen bzw. spezifische Koartikulationseffekte in Form zusätzlicher Daten dem gespeicherten verketteten Audiosegment zuzuordnen.Another advantage of this concatenation of the original audio segment areas to new ones Longer audio segments result if, for example, there is a sequence of sounds / phones Repeatedly repeated in the entered phonetic sequence / phonetic sequence. Then on one of the new correspondingly chained audio segments can be used and it is not necessary to repeat each time this sequence of sounds / phones occurs To perform concatenation of the originally existing audio segment areas. Preferably, such concatenated audio segments are also stored to capture comprehensive co-articulation effects or specific co-articulation effects to be assigned in the form of additional data to the stored chained audio segment.

Soll ein Audiosegmentbereich zur Synthetisierung des Endes der eingegebenen Lautfolge/Phonfolge verwendet werden, so ist aus dem Inventar ein Audiosegmentbereich zu wählen, der ein Ende einer Lautfolge/Phonfolge wiedergibt und mit einem zeitlich vorgelagerten Audiosegmentbereich zu verketten (siehe Figur 3e und Schritt 8 in Figur 4).Should be an audio segment area for synthesizing the end of the entered sound sequence / phoneme sequence an audio segment area is to be used from the inventory choose that reproduces one end of a sound sequence / phoneme sequence and with a preceding one To concatenate audio segment area (see Figure 3e and step 8 in Figure 4).

Die einzelnen Audiosegmente werden in der Datenbank 107 kodiert gespeichert, wobei die kodierte Form der Audiosegmente neben der Wellenform des jeweiligen Audiosegmentes angeben kann, welche Teile von Lautfolgen/Phonfolgen das jeweilige Audiosegment wiedergibt, welche Art der Konkatenation (z.B. Hardfade, linearer oder exponentieller Crossfade) mit welchem zeitlich nachfolgenden Audiosegmentbereich durchzuführen ist und zu welchem Moment die Konkatenation mit welchem zeitlich nachfolgenden Audiosegmentbereich stattfindet. Vorzugsweise enthält die kodierte Form der Audiosegmente auch Informationen bezüglich der Prosodie, übergeordneten Koartikulationen und Übergangsfunktionen, die verwendet werden, um eine zusätzliche Verbesserung der Sprachqualität zu erzielen.The individual audio segments are stored in coded form in the database 107, whereby the coded form of the audio segments in addition to the waveform of the respective audio segment can specify which parts of sound sequences / sound sequences the respective audio segment shows the type of concatenation (e.g. hardfade, linear or exponential Crossfade) with which temporally subsequent audio segment range and at what moment the concatenation with which temporally following Audio segment area takes place. Preferably the encoded form contains the Audio segments also provide information regarding prosody, superordinate co-articulations and transition functions that are used to make an additional improvement to achieve the voice quality.

Bei der Wahl der Audiosegmentbereiche zur Synthetisierung der eingegebenen Lautfolge/Phonfolge werden als zeitlich nachgelagerte Audiosegmentbereiche solche gewählt, die den Eigenschaften der jeweils zeitlich vorgelagerten Audiosegmentbereiche, u.a. Konkatenationsart und Konkatenationsmoment, entsprechen. Nachdem die jeweils Teile der Lautfolge/Phonfolge wiedergebenden Audiosegmentbereiche aus der Datenbank 107 oder der vorgeschalteten Syntheseeinrichtung 108 gewählt wurden, erfolgt die Verkettung zweier aufeinanderfolgender Audiosegmentbereiche mit Hilfe der Konkatenationseinrichtung 111 folgendermaßen. Es werden die Wellenform, die Konkatenationsart, der Konkatenationsmoment sowie evtl. zusätzliche Informationen des ersten Audiosegmentbereiches und des zweiten Audiosegmentbereiches aus der Datenbank oder der Syntheseeinrichtung (Figur 3b und Schritt 10 und 11) geladen. Vorzugsweise werden bei der oben erwähnten Wahl der Audiosegmentbereiche solche Audiosegmentbereiche gewählt, die hinsichtlich ihrer Konkatenationsart und ihres Konkatenationsmoments zu einander passen. In diesem Fall ist das Laden der Informationen bezüglich der Konkatenationsart und des Konkatenationsmomentes des zweiten Audiosegmentbereiches nicht mehr notwendig.When selecting the audio segment areas for synthesizing the entered sound sequence / phoneme sequence are selected as the audio segment areas downstream, the characteristics of the audio segment areas upstream, e.g. Concatenation type and concatenation moment. After each part the sound segment / sound sequence reproducing audio segment areas from the database 107 or the upstream synthesis device 108 have been selected, the chaining takes place two consecutive audio segment areas with the help of the concatenation device 111 as follows. The waveform, the type of concatenation, the Concatenation moment and possibly additional information of the first audio segment area and the second audio segment area from the database or the synthesis device (Figure 3b and steps 10 and 11) loaded. Preferably at choice of audio segment areas mentioned above selected such audio segment areas those regarding their type of concatenation and their concatenation moment to each other fit. In this case, the loading of information regarding the type of concatenation and the concatenation moment of the second audio segment area is not more necessary.

Zur Konkatenation der beiden Audiosegmentbereiche werden die Wellenform des ersten Audiosegmentbereiches in einem zeitlich hinteren Bereich und die Wellenform des zweiten Audiosegmentbereiches in einem zeitlich vorderen Bereich jeweils mit geeigneten Übergangsfunktionen bearbeitet, z.B. mit einer geeigneten Gewichtungsfunktion multipliziert (siehe Figur 3b, Schritt 12 und 13). Die Längen des zeitlich hinteren Bereiches des ersten Audiosegmentbereiches und des zeitlich vorderen Bereiches des zweiten Audiosegmentbereiches ergeben sich aus der Konkatenationsart und zeitlichen Lage des Konkatenationsmomentes, wobei diese Längen auch in der kodierten Form der Audiosegmente in der Datenbank gespeichert werden können.To concatenate the two audio segment areas, the waveform of the first Audio segment area in a temporally back area and the waveform of the second audio segment area in a temporally front area each with suitable Edited transition functions, e.g. with a suitable weighting function multiplied (see Figure 3b, steps 12 and 13). The lengths of the back area of the first audio segment area and the temporally front area of the second Audio segment area result from the type of concatenation and temporal position the concatenation moment, these lengths also in the coded form of the audio segments can be stored in the database.

Sind die beiden Audiosegmentbereiche mit einem Crossfade zu verketten, werden diese entsprechend dem jeweiligen Konkatenationsmoment überlappend addiert (siehe Figuren 3bl, 3cI, 3dI und 3eI, Schritt 15). Vorzugsweise ist hierbei ein linearer symmetrischer Crossfade zu verwenden, es kann aber auch jede andere Art eines Crossfades oder jede Art von Übergangsfunktionen eingesetzt werden. Ist eine Konkatenation in Form eines Hardfades durchzuführen, werden die beiden Audiosegmentbereiche nicht überlappend hintereinander verbunden (siehe Figur 3bII, 3cII, 3dll und 3ell, Schritt 15). Wie in Figur 3bII zu sehen ist, werden hierbei die beiden Audiosegmentbereiche zeitlich unmittelbar hintereinander angeordnet. Um die so erzeugten synthetisierten Sprachdaten weiterverarbeiten zu können, werden diese vorzugsweise in einer dritten Speichereinheit 115 abgelegtIf the two audio segment areas are to be linked with a crossfade, they become added overlapping according to the respective concatenation moment (see figures 3bl, 3cI, 3dI and 3eI, step 15). In this case, a linear symmetric is preferred To use crossfade, but it can also be any other type of crossfade or any kind of transition functions can be used. Is a concatenation in shape a hardfade, the two audio segment areas do not overlap connected in series (see Figure 3bII, 3cII, 3dll and 3ell, step 15). As in 3bII can be seen, the two audio segment areas become temporally immediate arranged one behind the other. The synthesized speech data thus generated To be able to process them further, they are preferably stored in a third memory unit 115 filed

Für die weitere Verkettung mit nachfolgenden Audiosegmentbereichen werden die bisher verketteten Audiosegmentbereiche als erster Audiosegmentbereich betrachtet (Schritt 16) und der oben beschriebenen Verkettungsprozeß solange wiederholt, bis die gesamte Lautfolge/Phonfolge synthetisiert wurde.For the further chaining with subsequent audio segment areas, the so far considered chained audio segment areas as the first audio segment area (step 16) and the chaining process described above is repeated until the entire Phonetic sequence / phonetic sequence was synthesized.

Zur Verbesserung der Qualität der synthetisierten Sprachdaten sind vorzugsweise auch die prosodischen und zusätzlichen Informationen, die zusätzlich zu der Lautfolge/Phonfolge eingegeben werden, bei der Verkettung der Audiosegmentbereiche zu berücksichtigen. Mit Hilfe bekannter Verfahren können die Frequenz, Dauer, Amplitude und/oder spektralen Eigenschaften der Audiosegmentbereiche vor und/oder nach deren Konkatenation so verändert werden, daß die synthetisierten Sprachdaten eine natürliche Wort- und/oder Satzmelodie aufweisen (Schritte 14, 17 oder 18). Hierbei ist es zu bevorzugen, Konkatenationsmomente an Stellen der Audiosegmentbereiche zu wählen, an denen diese in einer oder mehrerer geeigneter Eigenschaften übereinstimmen.To improve the quality of the synthesized speech data are also preferred the prosodic and additional information in addition to the phonetic order can be entered when concatenating the audio segment areas consider. With the help of known methods, the frequency, duration, amplitude and / or spectral properties of the audio segment regions before and / or after them Concatenation are changed so that the synthesized speech data a natural Have word and / or sentence melody (steps 14, 17 or 18). It is preferable here To choose concatenation moments at points of the audio segment areas which match these in one or more suitable properties.

Um die Übergänge zwischen zwei aufeinander folgenden Audiosegmentbereichen zu optimieren, ist zusätzlich die Bearbeitung der beiden Audiosegmentbereiche mit Hilfe geeigneter Funktionen im Bereich des Konkatenationsmomentes vorgesehen, um u.a. die Frequenzen, Dauern, Amplituden und spektralen Eigenschaften anzupassen. Des weiteren erlaubt es die Erfindung, auch übergeordnete akustische Phänomene einer realen Sprache, wie z.B. übergeordnete Koartikulationseffekte oder Sprachstil (u.a. Flüstern, Betonung, Gesangsstimme, Falsett, emotionaler Ausdruck) bei der Synthetisierung der Lautfolge/Phonfolgen zu berücksichtigen. Hierfür werden Informationen, die solche übergeordnete Phänomene betreffen, zusätzlich in kodierter Form mit den entsprechenden Audiosegmenten gespeichert, um so bei der Auswahl der Audiosegmentbereiche nur solche zu wählen, die den übergeordneten Koartikulationseigenschaften der zeitlich vor- und/oder nachgelagerten Audiosegmentbereichen entsprechen.To make the transitions between two consecutive audio segment areas optimize, is also the processing of the two audio segment areas with the help suitable functions in the area of the concatenation moment are provided, e.g. adapt the frequencies, durations, amplitudes and spectral properties. Of the invention also allows superordinate acoustic phenomena real language, such as overarching co-articulation effects or language style (e.g. whispering, Emphasis, singing voice, falsetto, emotional expression) in the synthesis the phonetic sequence / phonetic sequences. For this, information such superordinate phenomena concern, additionally in coded form with the corresponding ones Audio segments saved, so when selecting the audio segment areas to choose only those that match the overall co-articulation properties of the temporal correspond to upstream and / or downstream audio segment areas.

Die so erzeugten synthetisierten Sprachdaten haben vorzugsweise eine Form, die es unter Verwendung einer Ausgabeeinheit 117 erlaubt, die Sprachdaten in akustische Sprachsignale umzuwandeln und die Sprachdaten und/oder Sprachsignale auf einem akustischen, optischen, magnetischen oder elektrischen Datenträger zu speichern (Schritt 19).The synthesized speech data thus generated preferably has a form that it using an output unit 117, which converts the speech data into acoustic Convert voice signals and the voice data and / or voice signals on one to store acoustic, optical, magnetic or electrical data carriers (Step 19).

Im allgemeinen werden Inventarelemente durch die Aufnahme von real gesprochener Sprache erzeugt. In Abhängigkeit des Trainingsgrades des inventaraufbauenden Sprechers, d.h. seiner Fähigkeit die aufzunehmende Sprache zu kontrollieren (z.B. die Tonhöhe der Sprache zu kontrollieren oder exakt auf einer Tonhöhe zu sprechen), ist es möglich, gleiche oder ähnliche Inventarelemente zu erzeugen, die verschobene Grenzen zwischen den Solo-Artikulationsbereichen und Koartikulationsbereichen haben. Dadurch ergeben sich wesentlich mehr Möglichkeiten, die Konkatenationspunkte an verschiedenen Stellen zu plazieren. In der Folge kann die Qualität einer zu synthetisierenden Sprache deutlich verbessert werden.In general, inventory elements are made up of real spoken words Language creates. Depending on the level of training of the inventory-building speaker, i.e. his ability to control the language to be recorded (e.g. pitch control the language or speak exactly at one pitch), it is possible to generate the same or similar inventory items, the shifted boundaries between the solo articulation areas and coarticulation areas. Thereby there are considerably more possibilities to find the concatenation points at different points Places to place. As a result, the quality of a language to be synthesized be significantly improved.

Mit dieser Erfindung ist es erstmals möglich synthetisierte Sprachsignale durch eine koartikulationsgerechte Konkatenation einzelner Audiosegmentbereiche zu erzeugen, da der Moment der Konkatenation in Abhängigkeit der jeweils zu verkettenden Audiosegmentbereiche gewählt wird. Auf diese Weise kann eine synthetisierte Sprache erzeugt werden, die vom einer natürlichen Sprache nicht mehr zu unterscheiden ist. Im Gegensatz zu bekannten Verfahren oder Vorrichtungen werden die hier verwendeten Audiosegmente nicht durch ein Einsprechen ganzer Worte erzeugt, um eine authentische Sprachqualität zu gewährleisten. Daher ist es mit dieser Erfindung möglich, synthetisierte Sprache beliebigen Inhalts in der Qualität einer real gesprochenen Sprache zu erzeugen.With this invention, it is possible for the first time to synthesize speech signals by means of a co-articulation-appropriate Generate concatenation of individual audio segment areas because the moment of concatenation depending on the audio segment areas to be concatenated is chosen. In this way, a synthesized language can be generated that can no longer be distinguished from natural language. In contrast the audio segments used here become known methods or devices not created by speaking whole words to an authentic one To ensure voice quality. Therefore, with this invention, it is possible to synthesize To produce language of any content in the quality of a real spoken language.

Obwohl diese Erfindung am Beispiel der Sprachsynthese beschrieben wurde, ist die Erfindung nicht auf den Bereich der synthetisierten Sprache beschränkt, sondern kann zu Synthetisierung beliebiger akustischer Daten, bzw. beliebiger Schallereignisse verwendet werden. Daher ist diese Erfindung auch für eine Erzeugung und/oder Bereitstellung von synthetisierten Sprachdaten und/oder Sprachsignale für beliebige Sprachen oder Dialekte sowie auch zur Synthese von Musik einsetzbar.Although this invention has been described using the example of speech synthesis, the invention is not limited to the area of synthesized language, but can Synthesis of any acoustic data or any sound events used become. Therefore, this invention is also for the production and / or provision of synthesized speech data and / or speech signals for any languages or dialects as well as for the synthesis of music.

Claims

A method for the co-articulation-specific concatenation of audio segments, in order to generate synthesised acoustical data which reproduces a sequence of concatenated sounds, comprising the following steps:

selecting at least two audio segments which contain bands, each of which reproducing a portion of a sound or a portion of a sound sequence,

characterised by the steps:

establishing a band to be used of an earlier audio segment;

establishing a band to be used of a later audio segment, which begins with the band to be used of the later audio segment and ends with the co-articulation band of the later audio segment which follows the initially used solo articulation band;

with the duration and position of the bands to be used being determined as a function of the earlier and later audio segments; and

concatenating the established band of the earlier audio segment with the established band of the later audio segment, in that the instance of concatenation, as a function of properties of the used band of the later audio segment, is set in a band which begins immediately before the used band of the later audio segment and ends with same.
The method according to Claim 1, characterised in that

the instance of concatenation is set in a band which lies in the vicinity of the boundaries of the initially to be used solo articulation band of the later audio segment, if the band of same to be used reproduces a static sound at the beginning; and

a downstream portion of the band to be used of the earlier audio segment and an upstream portion of the band to be used of the later audio segment are processed by means of suitable transfer functions and added in an overlapping manner (cross fade), with the transfer functions and the length of an overlapping portion of the two bands being determined depending on the audio segments to be concatenated.
The method according to Claim 1, characterised in that

the instance of concatenation is set in a band which lies immediately before the band to be used of the later audio segment, if the used band of same reproduces a dynamic sound at the beginning; and

a downstream portion of the band to be used of the earlier audio segment and an upstream portion of the band to be used of the later audio segment are processed by means of suitable transfer functions and joined in a non-overlapping manner (hard fade), with the transfer functions being determined depending on the acoustical data to be synthesised.
The method according to one of the Claims 1 to 3 characterised in that

for a sound or a portion of the sequence of concatenated sounds at the start of the concatenated sound sequence a band of an audio segment is selected so that the start of the band reproduces the properties of the start of the concatenated sound sequence.
The method according to one of the Claims 1 to 4 characterised in that

for a sound or a portion of the sequence of concatenated sounds at the end of the concatenated sound sequence a band of an audio segment is selected so that the end of the band reproduces the properties of the end of the concatenated sound sequence.
The method according to one of the Claims 1 to 5 characterised in that

the voice data to be synthesised is combined in groups, each of which being described by an individual audio segment.
The method according to one of the Claims 1 to 6 characterised in that

an audio segment band is selected for the later audio segment band, which reproduces the highest number of successive portions of the sounds of the sound sequence, in order to use the smallest number of audio segment bands in the generation of the synthesised acoustical data.
The method according to one of the Claims 1 to 7 characterised in that

a processing of the used bands of individual audio segments is carried out by means of suitable functions depending on properties of the concatenated sound sequence, with these properties involving a modification of the frequency, the duration, the amplitude, or the spectrum.
The method according to one of the Claims I to 8 characterised in that

a processing of the used bands of individual audio segments is carried out by means of suitable functions in a band, in which the instance of concatenation lies, with these functions involving a modification of the frequency, the duration, the amplitude, or the spectrum.
The method according to one of the Claims 1 to 9 characterised in that the instance of concatenation is set in places of the bands to be used of the earlier and/or later audio segment, in which the two used bands are in agreement with respect to one or several suitable properties, with these properties including: zero point, amplitude values, gradients, derivatives of any degree, spectra, tone levels, amplitude values within a frequency band, volume, style of speech, emotion of speech, or other properties covered in the phone classification scheme.
The method according to one of the Claims 1 to 10 characterised in that

the selection of the used bands of individual audio segments, their processing, their variation, as well as their concatenation are additionally carried out with the application of heuristic knowledge which is obtained by an additionally carried out heuristic method.
The method according to one of the Claims 1 to 11 characterised in that

the acoustical data to be synthesised is voice data, and the sounds are phones,
The method according to one of the Claims 2 to 12 characterised in that

the static phones include vowels, diphtongs, liquids, vibrants, fricatives and nasals,
The method according to one of the Claims 3 to 13 characterised in that

the dynamic phones include plosives, affricates, glottal stops, and click sounds.
The method according to one of the Claims 1 to 14 characterised in that

a conversion of the synthesised acoustical data to acoustical signals and/or voice signals is carried out.
A device for the co-articulation-specific concatenation of audio segments, in order to generate synthesised acoustical data which reproduces a sequence of phones, comprising:

a database (107) in which audio segments are stored, each of which reproducing portion of a phone or portions of a sequence of (concatenated) phones;

and/or any upstream synthesis means (108) which supplies audio segments;

a means (105) for the selection of at least two audio segments from the database (107) and/or the upstream synthesis means (108); and

a means (111) for the concatenation of audio segments, characterised in that the concatenation means (111) is suited for

defining a band to be used of an earlier audio segment;

defining a band to be used of a later audio segment in a band which starts with the later audio segment and ends after a co-articulation band of the later audio segment, which follows after the initially used solo articulation band,

determining the duration and position of the used bands depending on the earlier and later audio segments; and

concatenating the used band of the earlier audio segment with the used band of the later audio segment by defining the instance of concatenation as a function of properties of the used band of the later audio segment in a band which starts immediately before the used band of the later audio segment and ends with same.
The device according to Claim 16, characterised in that
the concatenation means (111) comprises:

means for the concatenation of the used band of the earlier audio segment with the used band of the later audio segment, whose used band reproduces a static phone at the beginning in the vicinity of the boundaries of the initially occurring solo articulation band of the used band of the later audio segment;

means for processing a downstream portion of the used band of the earlier audio segment and an upstream portion of the used band of the later audio segment by suitable transfer functions, and

means for the overlapping addition of the two bands in an overlapping portion (cross fade), which depends on the audio segments to be concatenated, with the transfer functions and the length of an overlapping portion of the two bands being determined depending on the acoustical data to be synthesised.
The device according to Claim 16 or 17 characterised in that
the concatenation means (111) comprises:

means for the concatenation of the used band of the earlier audio segment with the used band of the later audio segment, whose used band reproduces a dynamic phone at the beginning, immediately before the used band of the later audio segment;

means for processing a downstream portion of the used band of the earlier audio segment and an upstream portion of the used band of the later audio segment by suitable transfer functions, with the transfer functions being determined depending on the acoustical data to be synthesised; and

means for the non-overlapping joining of the two audio segments.
The device according to one of the Claims 16 to 18 characterised in that

the database (107) includes audio segments or the upstream synthesis means (108) supplies audio segments which comprise bands which at the start reproduce a phone or a portion of the concatenated phone sequence at the start of the concatenated phone sequence.
The device according to one of the Claims 16 to 19 characterised in that

the database (107) includes audio segments or the upstream synthesis means (108) supplies audio segments which comprise bands, whose ends reproduce a phone or a portion of the concatenated phone sequence at the end of the concatenated phone sequence.
The device according to one of the Claims 16 to 19 characterised in that

the database (107) includes a group of audio segments or the upstream synthesis means (108) supplies audio segments which comprise bands, whose starts each reproduce only a static phone.
The device according to one of the Claims 16 to 21 characterised in that the concatenation means (111) comprises:

means for the generation of further audio segments by concatenation of audio segments, with the starts of the bands each reproducing a static phone, each with a band of a later audio segment whose used band reproduces a dynamic phone at the start, and

a means which supplies the further audio segments to the database (107) or the selection means (105).
The device according to one of the Claims 16 to 22 characterised in that, in the selection of the audio segment bands from the database (107) or the upstream synthesis means (108), the selection means (105) is suited to select the audio segments which reproduce the greatest number of successive portions of concatenated phones of the concatenated phone sequence.
The device according to one of the Claims 16 to 23 characterised in that

the concatenation means (111) comprises means for processing the used bands of ndividual audio segments with the aid of suitable functions, depending on properties of the concatenated phone sequence, with the functions involving a modification of the frequency, the duration, the amplitude, or the spectrum.
The device according to one of the Claims 16 to 24 characterised in that

the concatenation means (111) comprises means for processing the used bands of individual audio segments with the aid of suitable functions in a band including the instance of concatenation, with this function involving a modification of the frequency, the duration, the amplitude, or the spectrum.
The device according to one of the Claims 16 to 25 characterised in that

the concatenation means (111) comprises means for the selection of the instance of concatenation in a place in the used bands of the earlier and/or the later audio segment, in which the two used bands are in agreement with respect to one or several suitable properties, with these properties including: zero points, amplitude values, gradients, derivatives of any degree, spectra, tone levels, amplitude values in a frequency band, volume, style of speech, emotion of speech, or other properties covered in the phone classification scheme.
The device according to one of the Claims 16 to 26 characterised in that

the selection means (105) comprises means for the implementation of heuristic knowledge which relates to the selection of the used bands of the individual audio segments, their processing, their variation, as well as their concatenation.
The device according to one of the Claims 16 to 27 characterised in that

the database (107) includes audio segments or the upstream synthesis means (108) supplies audio segments which include bands, each of which reproducing at least a portion of a sound or phone, respectively, a sound or phone, respectively, portions of phone sequences or polyphones, respectively, or sound sequences or polyphones, respectively.
The device according to one of the Claims 17 to 28 characterised in that

the data base (107) includes audio segments or the upstream synthesis means (108) supplies audio segments, with a static sound corresponding to a static phone and comprising vowels, diphtongs, liquids, vibrants, fricatives, and nasals.
The device according to one of the Claims 18 to 29 characterised in that

the database (107) includes audio segments or the upstream synthesis means (108) supplies audio segments, with a dynamic sound corresponding to a dynamic phone and comprising plosives, affricates, glottal stops, and klick speech.
The device according to one of the Claims 16 to 30 characterised in that

the concatenation means (111) is suitable to generate synthesised voice data by means of the concatenation of audio segments.
The device according to one of the Claims 16 to 31 characterised in that

means (117) are provided for the conversion of the synthesised acoustical data to acoustical signals and/or voice signals.
A data carrier which includes a computer program for the co-articulation-specific concatenation of audio segments in order to generate synthesised acoustical data which reproduces a sequence of concatenated phones, comprising the following steps:

selection of at least two audio segments which contain bands, each of which reproducing a portion of a sound or a portion of a sound sequence,

characterised by the steps of:

establishing a band to be used of an earlier audio segment;

establishing a band to be used of a later audio segment, which begins with the later audio segment and ends with the co-articulation band of the later audio segment which follows the initially used solo articulation band,

with the duration and position of the bands to be used being determined as a function of the earlier and later audio segments; and

concatenating the established band of the earlier audio segment with the established band of the later audio segment, in that the instance of concatenation, as a function of properties of the used band of the later audio segment, is set in its established band which starts immediately before the band to be used of the later audio segment and ends with same.
The data carrier according to Claim 33, characterised in that
the computer program selects the instance of the concatenation of the used band of the second audio segment with the used band of the first audio segment in such a manner that

the instance of concatenation is set in a band which lies in the vicinity of the boundaries of the initially used solo articulation band of the later audio segment, if its used band reproduces a static phone at the start;

a downstream portion of the used band of the earlier audio segment and an upstream portion of the used band of the later audio segment are processed by suitable transfer functions and added in an overlapping manner (cross fade), with the transfer functions and the length of an overlapping portion of the two bands being determined depending on the audio segments to be concatenated.
The data carrier according to Claim 33 or 34 characterised in that
the computer program selects the instance of the concatenation of the used band of the second audio segment with the used band of the first audio segment in such a manner that

the instance of concatenation is set in a band which lies immediately before the used band of the later audio segment, if its used band reproduces a dynamic phone at the start,

a downstream portion of the used band of the earlier audio segment and an upstream portion of the used band of the later audio segment are processed by suitable transfer functions and added in a non-overlapping manner (hard fade), with the transfer functions being determined depending on the audio segments to be concatenated.
The data carrier according to one of the Claims 33 to 35 characterised in that

the computer program selects a band of an audio segment for a phone or a portion of the sequence of concatenated phones at the start of the concatenated phone sequence, the start of which reproduces the properties of the start of the concatenated sequence of phones.
The data carrier according to one of the Claims 33 to 36 characterised in that

the computer program selects a band of an audio segment for a phone or a portion of the sequence of concatenated phones at the end of the concatenated phone sequence, the end of which reproduces the properties of the end of the concatenated sequence of phones.
The data carrier according to one of the Claims 33 to 37 characterised in that

the computer program carries out a processing of the used bands of individual audio segments with the aid of suitable functions depending on properties of the phone sequence, with the functions involving i.a. modification of the frequency, the duration, the amplitude, or the spectrum.
The data carrier according to one of the Claims 33 to 38 characterised in that

the computer program selects an audio segment band for the later audio segment band which reproduces the highest number of successive portions of the concatenated phones in the phone sequence, in order to use the smallest number of audio segment bands in the generation of the synthesised acoustical data.
The data carrier according to one of the Claims 33 to 39 characterised in that

the computer program carries out a processing of the used bands of individual audio segments with the aid of suitable functions in a band in which the instance of concatenation lies, with these functions involving i.a. a modification of the frequency, the duration, the amplitude, or the spectrum.
The data carrier according to one of the Claims 33 to 40 characterised in that

the computer program establishes the instance of concatenation in a place of the used bands of the first and/or the second audio segment, in which the two used bands are in agreement with respect to one or several suitable properties, with these properties including i.a.; zero points, amplitude values, gradients, derivatives of any degree, spectra, tone levels, amplitude values in a frequency band, volume, style of speech, emotion of speech, or other properties covered in the phone classification scheme.
The data carrier according to one of the Claims 33 to 41 characterised in that

the computer program carries out an implementation of heuristic knowledge which relates to the selection of the used bands of the individual audio segments, their processing, their variation, as well as their concatenation.
The data carrier according to one of the Claims 33 to 42 characterised in that

the computer program is suited for the generation of synthesised voice data, with the sounds being phones.
The data carrier according to one of the Claim 34 to 42 characterised in that

the computer program is suited for the generation of static phones, with the static phones comprising vowels, diphtongs, liquids, vibrants, fricatives, and nasals.
The data carrier according to one of the Claims 35 to 44 characterised in that

the computer program is suited for the generation of dynamic phones, with the dynamic phones comprising plosives, affricates, glottal slops, and klick speech.
The data carrier according to one of the Claims 33 to 45 characterised in that

the computer program converts the synthesised acoustical data to acoustical convertible data and/or voice signals.
Synthesised voice signals which consist of a sequence of sounds or phones, respectively, with the voice signals being generated in that:

at least two audio segments are selected which reproduce the sounds or phones, respectively; and

the audio segments are linked by a co-articulation-specific concatenation, with

one band to be used of an earlier audio segment being established;

one band to be used of a later audio segment being established which starts with the later audio segment and ends with the co-articulation band of the later audio segment, following the initially used solo articulation band;

with the duration and position of the bands to be used being determined depending on the audio segments; and

the used bands of the audio segments being concatenated in a co-articulation-specific manner, in that the instance of concatenation, as a function of properties of the uscd band of the later audio segment, is set in a band which starts immediately before the used band of the later audio segment and ends with same.
The synthesised voice signals according to Claim 47, characterised in that the voice signals are generated in that

the audio segments are concatenated in an instance which lies in the vicinity of the boundaries of the initially used solo articulation band of the later audio segment, if the start of this band reproduces a static sound or phone, respectively, with the static phone being a vowel, a diphtong, a liquid, a fricative, a vibrant, or a nasal; and

a downstream portion of the used band of the earlier audio segment and an upstream portion of the used band of the later audio segment are processed by means of suitable transfer function and both bands are added in an overlapping manner (cross fade), with the transfer functions and the length of an overlapping portion of the two bands being determined depending on the audio segments to be concatenated.
The synthesised voice signals according to Claim 47 or 48 characterised in that the voice signals are generated in that

the audio segments are concatenated in an instance which lies immediately before the used band of the later audio segment, if the start of this band reproduces a dynamic sound or phone, respectively, with the dynamic phone being a plosive, an affricate, a glottal stop, or klick speech; and

a downstream portion of the used band of the earlier audio segment and an upstream portion of the used band of the later audio segment are processed by means of suitable transfer functions and both bands are joined in a non-overlapping manner (hard fade), with the transfer functions being determined depending on the audio segments to be concatenated.
The synthesised voice signals according to one of the Claims 47 to 49 characterised in that

the first sound or the first phone, respectively, or a portion of the first phone sequence or of the first polyphone, respectively, in the sequence is generated by an audio segment, whose used band at the start reproduces the properties of the start of the sequence.
The synthesised voice signals according to one of the Claims 47 to 50 characterised in that

the last sound or the last phone, respectively, or a portion of the last phone sequence or of the last polyphone, respectively, in the sequence is generated by an audio segment, whose used band at the end reproduces the properties of the end of the sequence.
The synthesised voice signals according to one of the Claims 47 to 51 characterised in that

the voice signals are generated in that later bands of audio segments, beginning with the reproduction of a dynamic sound or phone, respectively, are concatenated with earlier bands of audio segments, beginning with the reproduction of a static sound or phone, respectively.
The synthesised voice signals according to one of the Claims 47 to 52 characterised in that

such audio segments are selected which reproduce the highest number of portions of sounds or phones, respectively, of the sequence, in order to use the smallest number of audio segment bands in the generation of the voice signals.
The synthesised voice signals according to one of the Claims 47 to 53 characterised in that

the voice signals are generated by the concatenation of the used bands of audio segments which are processed with the aid of suitable functions depending on properties of the sound sequence or phone sequence, respectively, with the functions involving i.a. a modification of the frequency, the duration, the amplitude, or the spectrum.
The synthesised voice signals according to one of the Claims 47 to 54 characterised in that

the voice signals are generated by the concatenation of the used bands of audio segments which are processed with the aid of suitable functions depending on properties of the sound sequence or phone sequence, respectively, in an area in which the instance of concatenation lies, with these properties including i.a. a modification of the frequency, the duration, the amplitude, or the spectrum.
The synthesised voice signals according to one of the Claims 47 to 55 characterised in that

the instance of concatenation lies at a place in the used bands of the earlier and/or the later audio segment, in which the two used bands are in agreement with respect to one or several suitable properties, with these properties including i.a.: zero points, amplitude values, gradients, derivatives of any degree, spectra, tone levels, amplitude values in a frequency band, volume, style of speech, emotion of speech, or other properties covered in the phone classification scheme.
The synthesised voice signals according to one of the Claims 47 to 56 characterised in that

the voice signals are suited for a conversion to acoustic signals.
Sound storage medium, which contains data, which is at least partially synthesised acoustical data,
which has been generated

utilising a method according to Claim 1; or

utilising a means according to Claim 16; or

utilising a data carrier according to Claim 33; or

are voice signals according to Claim 47.
Sound storage medium according to Claim 58 characterised in that

the synthesised acoustical data is synthesised voice data.