WO2000031722A1 - Procede de commande de la duree en synthese vocale - Google Patents

Procede de commande de la duree en synthese vocale Download PDF

Info

Publication number
WO2000031722A1
WO2000031722A1 PCT/EP1999/008825 EP9908825W WO0031722A1 WO 2000031722 A1 WO2000031722 A1 WO 2000031722A1 EP 9908825 W EP9908825 W EP 9908825W WO 0031722 A1 WO0031722 A1 WO 0031722A1
Authority
WO
WIPO (PCT)
Prior art keywords
level
duration
syllable
phrase
rule
Prior art date
Application number
PCT/EP1999/008825
Other languages
German (de)
English (en)
Inventor
Oliver Jokisch
Diane Hirschfeld
Matthias Eichner
Rüdiger Hoffmann
Original Assignee
Deutsche Telekom Ag
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Deutsche Telekom Ag filed Critical Deutsche Telekom Ag
Publication of WO2000031722A1 publication Critical patent/WO2000031722A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Definitions

  • the invention relates to a method for continuous control in speech synthesis according to the preamble of claim 1.
  • rule-based methods have been used to calculate the duration of the sounds.
  • a specific duration is calculated for each sound of the synthetic utterance, which results from the neutral sound duration through modification by various influencing factors.
  • the invention is therefore based on the object of providing a method for continuous control in speech synthesis which eliminates the deficiencies described in the prosodic processing stages, which significantly improves the naturalness of synthetic speech and which has the disadvantages of conventional rule-based models for continuous control by generating a natural speaking rhythm eliminated by correctly determined duration of sounds.
  • the present procedure clearly distinguishes itself from previous approaches in which the continuous control was implemented at the sound level with the help of a rule set and a speaker-independent permanent statistics.
  • the naturalness of a speech signal is significantly improved by particularly taking into account the temporal, speaker-specific structures that have a great influence on the naturalness of the speech signal. Therefore, the present method for speech rhythm control or continuous control in speech synthesis uses continuous statistics that are obtained from the data of the original speaker.
  • the method uses a multi-level model for continuous control in speech synthesis. It forms the language processing that takes place in humans at different levels. These are a phrase, syllable and phonetic level that alternatively use statistical / rule-based or learning methods that are based on the individual data of the original speaker.
  • the target durations are therefore calculated independently at different levels, namely the phrase level, the syllable level and the phonetic level.
  • rule-based or learning procedures are used at each level and data exchange is possible between the procedures, which enables the combination of rule-based with learning procedures.
  • the procedures at the individual levels use speaker-specific databases or are trained on the target speaker, in contrast to the general rules previously used for all speakers.
  • the duration for each prosodic phrase in a text is calculated depending on the number of syllables in the phrase and the phrase type. Either a rule-based calculation rule that works on the basis of speaker-specific statistics or a neural network that is trained on the speaker is used.
  • the syllable duration is calculated for each syllable within a prosodic phrase.
  • a learning process or a rule-based approach is used.
  • the methods evaluate various phonetic characteristics, such as For example, emphasize / not emphasize or type of syllable core, and use this to generate the syllable duration.
  • the syllable durations are then adapted to the phrase duration calculated in the phrase level.
  • the syllable duration is divided into the individual sounds.
  • the method used here can again be a rule-based or a learning method.
  • the multi-level hybrid structure which is structured hierarchically into phonetic, syllable and phrase levels, combines a rule-based procedure and an artificial neural network.
  • the interfaces between the alternative approaches are defined so that data can be exchanged and the end result results from a combination of the partial results from different processing stages. In this way, the advantages of the rule-based and the neural method can be optimally exploited with an increase in overall quality.
  • Fig. 1 shows a basic multi-level model for continuous control in speech synthesis.
  • the multi-stage hybrid structure shown in FIG. 1 is hierarchically divided into phonetic, syllable and phrase levels and combines rule-based methods and an artificial neural network.
  • the interfaces between the alternative approaches are defined so that data can be exchanged and the end result results from a combination of the partial results from different processing stages. In this way, the advantages of the rule-based and the neural method can be optimally used with an increase in overall quality.
  • the goal of a hybrid data-driven or rule-based rhythm control is the combination of proven knowledge components with the ability to vary the speaking rhythm and even to train speaker-specific features.
  • the strategy takes four aspects into account:
  • segment duration control Division of the segment duration control into three representation levels, namely phrase 1, syllable 2 and sound 3, each with its own data node for training and for generating the target duration;
  • each level 1 to 3 runs a neural or a rule-based algorithm 4 or 5 using the same database 7 to 9 corresponding to the respective level 1 to 3;
  • the extracted prosodic, syllable and sound database 7, 8, 8 'and 9 including the statistical parameters come from a (variable) speaker who must always agree with the speaker of the diphone inventory in database 10.
  • the diphone inventory that is to say the corresponding database 10 for acoustic synthesis and the aforementioned prosodic syllable and phonetic database 7, 8, 8 'and 9 are based on a variable speaker.
  • the controllable switchover 6 in each level 1 to 3 between neural and rule-based methods or algorithms serve both to combine and to use one of the two possibilities mentioned.
  • Level 1 receives the input data from the text analysis 11 both for the artificial neural networks 4 and for the rule-based method 5.
  • the acoustic synthesis 12 takes place with the output data of the level 3, it having to be emphasized again that the diphone inventory in the Database 10 for acoustic synthesis are based on a speaker who is identical to that of databases 7 to 9.
  • the neural algorithm used corresponds to the well-known ELMAN type.
  • basic values of prosodic contours here relative segment durations, are trained and predicted in the can phase.
  • the input coding depends on the respective processing level, namely the phrase duration level 1, the syllable duration level 2 or the sound level 3.
  • the rule-based or formula-based continuous control uses a set of rules or formulas for each level 1 to 3. These rules are extracted from databases 7 to 10 by statistical analysis. These rules model linguistic influencing factors at the processing levels.
  • Level 1 determines the phrase duration for a given prosodic phrase depending on the number of syllables and the type of prosodic phrase (see Formula 1).
  • the second level calculates the syllable duration of each syllable as a linear combination of a number of sounds.
  • Different phonetic properties for example the accent or the core type, influence the duration of the syllable in different ways, for example the core type causes a lengthening by a factor and an accented syllable is expanded by adding a constant as shown in the following formula .
  • the syllables are then adjusted by linear stretching or contraction to fit within the duration frame calculated in phrase level 1. Finally, the duration of each sound has to be adjusted to the frame of the sound duration.
  • a stretch factor is calculated iteratively for a certain syllable duration and the standard deviations from the duration of the sound.
  • the duration of a phrase is primarily determined by the number of syllables, the parameters for its calculation being determined using statistics from the data of the original speaker.
  • the type of a phrase also affects its length. Depending on the phrase type, the mean phrase duration is corrected using coefficients.
  • results of the statistical analysis of syllable duration depending on the number of sounds, the accentuation, the information content, the type of the syllable nucleus, the position of the syllable and the position of the syllable in the phrase are used as the basis for the calculation of the syllable durations. These influencing factors on the syllable duration are expressed by linear dependencies.
  • the determined syllable durations are then added up for each phrase, adapted to the phrase duration determined in level 1 by linear expansion or compression of all syllable durations.
  • the calculation of the actual duration of the sound is based on the calculated syllable duration.
  • the different elasticity of the individual sounds is taken into account. It is assumed that all sounds of a syllable are subjected to a constant stretching K.
  • phrase level 1 the duration ir * is calculated for each prosodic phrase in a text, depending on the number of syllables in the phrase and the phrase type. Either a rule-based calculation rule that works on the basis of speaker-specific statistics or a neural network that is trained on the speaker is used.
  • the syllable duration is calculated for each syllable within a prosodic phrase.
  • phrase level 1 either a learning method 4 or a rule-based approach 5 is used for this.
  • the methods evaluate various phonetic characteristics, such as emphasized / unstressed or type of syllable nucleus, and use these to generate the syllable durations. These durations are then adapted to the phrase duration calculated in phrase level 1.
  • phrase duration calculated in phrase level 1 is then adjusted.
  • the syllable duration is divided into the individual sounds.
  • the method used here can again be a rule-based or a learning method. List of reference numbers

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

Des systèmes ou des procédés de synthèse vocale servent à la conversion d'un texte écrit en une expression acoustique. L'invention concerne un procédé qui permet un rythme de la parole spécifique à l'orateur. Une structure hybride à plusieurs étages est divisée hiérarchiquement en trois plans, à savoir un plan phonème (1), un plan syllabe (2) et un plan phrase (3). Dans chacun des plans (1 à 3) susmentionnés peut être mis en oeuvre un procédé à base de règles ou un procédé neuronal (5 ou 4).
PCT/EP1999/008825 1998-11-25 1999-11-17 Procede de commande de la duree en synthese vocale WO2000031722A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP98122316.7 1998-11-25
EP98122316 1998-11-25

Publications (1)

Publication Number Publication Date
WO2000031722A1 true WO2000031722A1 (fr) 2000-06-02

Family

ID=8233028

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP1999/008825 WO2000031722A1 (fr) 1998-11-25 1999-11-17 Procede de commande de la duree en synthese vocale

Country Status (1)

Country Link
WO (1) WO2000031722A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110264993A (zh) * 2019-06-27 2019-09-20 百度在线网络技术(北京)有限公司 语音合成方法、装置、设备及计算机可读存储介质
CN113129863A (zh) * 2019-12-31 2021-07-16 科大讯飞股份有限公司 语音时长预测方法、装置、设备及可读存储介质
CN113129863B (zh) * 2019-12-31 2024-05-31 科大讯飞股份有限公司 语音时长预测方法、装置、设备及可读存储介质

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CAMPBELL W N ET AL: "DURATION PITCH AND DIPHONES IN THE CSTR TTS SYSTEM", PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING (ICSLP),JP,TOKYO, ASJ, 1990, pages 825 - 828, XP000506898 *
HIRSCHFELD D ET AL: "Hybrid process uses neural control for data based speech rhythm control", SPRACHKOMMUNIKATION' (SPEECH COMMUNICATION), DRESDEN, GERMANY, 31 AUG.-2 SEPT. 1998, no. 152, ITG-Fachbericht, 1998, VDE-Verlag, Germany, pages 111 - 114, XP002132260, ISSN: 0932-6022 *
SANTEN VAN J P H: "ASSIGNMENT OF SEGMENTAL DURATION IN TEXT-TO-SPEECH SYNTHESIS", COMPUTER SPEECH AND LANGUAGE,GB,ACADEMIC PRESS, LONDON, vol. 8, no. 2, 1 April 1994 (1994-04-01), pages 95 - 128, XP000501471, ISSN: 0885-2308 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110264993A (zh) * 2019-06-27 2019-09-20 百度在线网络技术(北京)有限公司 语音合成方法、装置、设备及计算机可读存储介质
CN110264993B (zh) * 2019-06-27 2020-10-09 百度在线网络技术(北京)有限公司 语音合成方法、装置、设备及计算机可读存储介质
CN113129863A (zh) * 2019-12-31 2021-07-16 科大讯飞股份有限公司 语音时长预测方法、装置、设备及可读存储介质
CN113129863B (zh) * 2019-12-31 2024-05-31 科大讯飞股份有限公司 语音时长预测方法、装置、设备及可读存储介质

Similar Documents

Publication Publication Date Title
DE602005002706T2 (de) Verfahren und System für die Umsetzung von Text-zu-Sprache
DE60112512T2 (de) Kodierung von Ausdruck in Sprachsynthese
DE69908047T2 (de) Verfahren und System zur automatischen Bestimmung von phonetischen Transkriptionen in Verbindung mit buchstabierten Wörtern
DE69909716T2 (de) Formant Sprachsynthetisierer unter Verwendung von Verkettung von Halbsilben mit unabhängiger Überblendung im Filterkoeffizienten- und Quellenbereich
DE69821673T2 (de) Verfahren und Vorrichtung zum Editieren synthetischer Sprachnachrichten, sowie Speichermittel mit dem Verfahren
DE19610019C2 (de) Digitales Sprachsyntheseverfahren
DE2212472A1 (de) Verfahren und Anordnung zur Sprachsynthese gedruckter Nachrichtentexte
EP3010014B1 (fr) Procede d'interpretation de reconnaissance vocale automatique
DE112004000187T5 (de) Verfahren und Vorrichtung der prosodischen Simulations-Synthese
EP1058235B1 (fr) Procédé de reproduction pour systèmes contrôlés par la voix avec synthèse de la parole basée sur texte
DE60108104T2 (de) Verfahren zur Sprecheridentifikation
DE2736082A1 (de) Elektronisches geraet zur phonetischen synthetisierung menschlicher sprache (sprach-synthesizer)
EP0633559B1 (fr) Méthode et dispositif pour la reconnaissance de la parole
WO2001069591A1 (fr) Procede pour reconnaitre les enonces verbaux de locuteurs non natifs dans un systeme de traitement de la parole
EP1105867A1 (fr) Procede et dispositif permettant de concatener des segments audio en tenant compte de la coarticulation
WO2001086634A1 (fr) Procede pour produire une banque de donnees vocales pour un lexique cible pour l'apprentissage d'un systeme de reconnaissance vocale
EP0058130B1 (fr) Procédé pour la synthèse de la parole avec un vocabulaire illimité et dispositif pour la mise en oeuvre dudit procédé
EP1344211B1 (fr) Vorrichtung und verfahren zur differenzierten sprachausgabe
WO2000031722A1 (fr) Procede de commande de la duree en synthese vocale
DE19912405A1 (de) Bestimmung einer Regressionsklassen-Baumstruktur für Spracherkenner
WO2000016310A1 (fr) Procede et dispositif de traitement numerique de la voix
EP1170723B1 (fr) Procédé pour le calcul des statistiques de durée des phones et procédé pour la détermination de la durée de phones pour la synthèse de la parole
DE3232835C2 (fr)
EP3144929A1 (fr) Génération synthétique d'un signal vocale ayant un son naturel
EP0505709A2 (fr) Procédé d'extension du vocabulaire pour la reconnaissance de la parole indépendamment du locuteur

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): JP US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

WWE Wipo information: entry into national phase

Ref document number: 1999958057

Country of ref document: EP

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
WA Withdrawal of international application
WWW Wipo information: withdrawn in national office

Ref document number: 1999958057

Country of ref document: EP