WO2000031722A1 - Procede de commande de la duree en synthese vocale - Google Patents
Procede de commande de la duree en synthese vocale Download PDFInfo
- Publication number
- WO2000031722A1 WO2000031722A1 PCT/EP1999/008825 EP9908825W WO0031722A1 WO 2000031722 A1 WO2000031722 A1 WO 2000031722A1 EP 9908825 W EP9908825 W EP 9908825W WO 0031722 A1 WO0031722 A1 WO 0031722A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- level
- duration
- syllable
- phrase
- rule
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 69
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 22
- 238000003786 synthesis reaction Methods 0.000 title claims abstract description 22
- 230000001537 neural effect Effects 0.000 claims abstract description 11
- 238000004422 calculation algorithm Methods 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 8
- 238000013528 artificial neural network Methods 0.000 claims description 7
- MQJKPEGWNLWLTK-UHFFFAOYSA-N Dapsone Chemical compound C1=CC(N)=CC=C1S(=O)(=O)C1=CC=C(N)C=C1 MQJKPEGWNLWLTK-UHFFFAOYSA-N 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 5
- 238000004458 analytical method Methods 0.000 claims description 3
- 230000033764 rhythmic process Effects 0.000 abstract description 9
- 238000012545 processing Methods 0.000 description 10
- 238000013459 approach Methods 0.000 description 6
- 230000007812 deficiency Effects 0.000 description 3
- 230000007935 neutral effect Effects 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 241000534414 Anotopterus nikparini Species 0.000 description 1
- 230000001944 accentuation Effects 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000008602 contraction Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 230000001020 rhythmical effect Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
Definitions
- the invention relates to a method for continuous control in speech synthesis according to the preamble of claim 1.
- rule-based methods have been used to calculate the duration of the sounds.
- a specific duration is calculated for each sound of the synthetic utterance, which results from the neutral sound duration through modification by various influencing factors.
- the invention is therefore based on the object of providing a method for continuous control in speech synthesis which eliminates the deficiencies described in the prosodic processing stages, which significantly improves the naturalness of synthetic speech and which has the disadvantages of conventional rule-based models for continuous control by generating a natural speaking rhythm eliminated by correctly determined duration of sounds.
- the present procedure clearly distinguishes itself from previous approaches in which the continuous control was implemented at the sound level with the help of a rule set and a speaker-independent permanent statistics.
- the naturalness of a speech signal is significantly improved by particularly taking into account the temporal, speaker-specific structures that have a great influence on the naturalness of the speech signal. Therefore, the present method for speech rhythm control or continuous control in speech synthesis uses continuous statistics that are obtained from the data of the original speaker.
- the method uses a multi-level model for continuous control in speech synthesis. It forms the language processing that takes place in humans at different levels. These are a phrase, syllable and phonetic level that alternatively use statistical / rule-based or learning methods that are based on the individual data of the original speaker.
- the target durations are therefore calculated independently at different levels, namely the phrase level, the syllable level and the phonetic level.
- rule-based or learning procedures are used at each level and data exchange is possible between the procedures, which enables the combination of rule-based with learning procedures.
- the procedures at the individual levels use speaker-specific databases or are trained on the target speaker, in contrast to the general rules previously used for all speakers.
- the duration for each prosodic phrase in a text is calculated depending on the number of syllables in the phrase and the phrase type. Either a rule-based calculation rule that works on the basis of speaker-specific statistics or a neural network that is trained on the speaker is used.
- the syllable duration is calculated for each syllable within a prosodic phrase.
- a learning process or a rule-based approach is used.
- the methods evaluate various phonetic characteristics, such as For example, emphasize / not emphasize or type of syllable core, and use this to generate the syllable duration.
- the syllable durations are then adapted to the phrase duration calculated in the phrase level.
- the syllable duration is divided into the individual sounds.
- the method used here can again be a rule-based or a learning method.
- the multi-level hybrid structure which is structured hierarchically into phonetic, syllable and phrase levels, combines a rule-based procedure and an artificial neural network.
- the interfaces between the alternative approaches are defined so that data can be exchanged and the end result results from a combination of the partial results from different processing stages. In this way, the advantages of the rule-based and the neural method can be optimally exploited with an increase in overall quality.
- Fig. 1 shows a basic multi-level model for continuous control in speech synthesis.
- the multi-stage hybrid structure shown in FIG. 1 is hierarchically divided into phonetic, syllable and phrase levels and combines rule-based methods and an artificial neural network.
- the interfaces between the alternative approaches are defined so that data can be exchanged and the end result results from a combination of the partial results from different processing stages. In this way, the advantages of the rule-based and the neural method can be optimally used with an increase in overall quality.
- the goal of a hybrid data-driven or rule-based rhythm control is the combination of proven knowledge components with the ability to vary the speaking rhythm and even to train speaker-specific features.
- the strategy takes four aspects into account:
- segment duration control Division of the segment duration control into three representation levels, namely phrase 1, syllable 2 and sound 3, each with its own data node for training and for generating the target duration;
- each level 1 to 3 runs a neural or a rule-based algorithm 4 or 5 using the same database 7 to 9 corresponding to the respective level 1 to 3;
- the extracted prosodic, syllable and sound database 7, 8, 8 'and 9 including the statistical parameters come from a (variable) speaker who must always agree with the speaker of the diphone inventory in database 10.
- the diphone inventory that is to say the corresponding database 10 for acoustic synthesis and the aforementioned prosodic syllable and phonetic database 7, 8, 8 'and 9 are based on a variable speaker.
- the controllable switchover 6 in each level 1 to 3 between neural and rule-based methods or algorithms serve both to combine and to use one of the two possibilities mentioned.
- Level 1 receives the input data from the text analysis 11 both for the artificial neural networks 4 and for the rule-based method 5.
- the acoustic synthesis 12 takes place with the output data of the level 3, it having to be emphasized again that the diphone inventory in the Database 10 for acoustic synthesis are based on a speaker who is identical to that of databases 7 to 9.
- the neural algorithm used corresponds to the well-known ELMAN type.
- basic values of prosodic contours here relative segment durations, are trained and predicted in the can phase.
- the input coding depends on the respective processing level, namely the phrase duration level 1, the syllable duration level 2 or the sound level 3.
- the rule-based or formula-based continuous control uses a set of rules or formulas for each level 1 to 3. These rules are extracted from databases 7 to 10 by statistical analysis. These rules model linguistic influencing factors at the processing levels.
- Level 1 determines the phrase duration for a given prosodic phrase depending on the number of syllables and the type of prosodic phrase (see Formula 1).
- the second level calculates the syllable duration of each syllable as a linear combination of a number of sounds.
- Different phonetic properties for example the accent or the core type, influence the duration of the syllable in different ways, for example the core type causes a lengthening by a factor and an accented syllable is expanded by adding a constant as shown in the following formula .
- the syllables are then adjusted by linear stretching or contraction to fit within the duration frame calculated in phrase level 1. Finally, the duration of each sound has to be adjusted to the frame of the sound duration.
- a stretch factor is calculated iteratively for a certain syllable duration and the standard deviations from the duration of the sound.
- the duration of a phrase is primarily determined by the number of syllables, the parameters for its calculation being determined using statistics from the data of the original speaker.
- the type of a phrase also affects its length. Depending on the phrase type, the mean phrase duration is corrected using coefficients.
- results of the statistical analysis of syllable duration depending on the number of sounds, the accentuation, the information content, the type of the syllable nucleus, the position of the syllable and the position of the syllable in the phrase are used as the basis for the calculation of the syllable durations. These influencing factors on the syllable duration are expressed by linear dependencies.
- the determined syllable durations are then added up for each phrase, adapted to the phrase duration determined in level 1 by linear expansion or compression of all syllable durations.
- the calculation of the actual duration of the sound is based on the calculated syllable duration.
- the different elasticity of the individual sounds is taken into account. It is assumed that all sounds of a syllable are subjected to a constant stretching K.
- phrase level 1 the duration ir * is calculated for each prosodic phrase in a text, depending on the number of syllables in the phrase and the phrase type. Either a rule-based calculation rule that works on the basis of speaker-specific statistics or a neural network that is trained on the speaker is used.
- the syllable duration is calculated for each syllable within a prosodic phrase.
- phrase level 1 either a learning method 4 or a rule-based approach 5 is used for this.
- the methods evaluate various phonetic characteristics, such as emphasized / unstressed or type of syllable nucleus, and use these to generate the syllable durations. These durations are then adapted to the phrase duration calculated in phrase level 1.
- phrase duration calculated in phrase level 1 is then adjusted.
- the syllable duration is divided into the individual sounds.
- the method used here can again be a rule-based or a learning method. List of reference numbers
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
Des systèmes ou des procédés de synthèse vocale servent à la conversion d'un texte écrit en une expression acoustique. L'invention concerne un procédé qui permet un rythme de la parole spécifique à l'orateur. Une structure hybride à plusieurs étages est divisée hiérarchiquement en trois plans, à savoir un plan phonème (1), un plan syllabe (2) et un plan phrase (3). Dans chacun des plans (1 à 3) susmentionnés peut être mis en oeuvre un procédé à base de règles ou un procédé neuronal (5 ou 4).
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP98122316.7 | 1998-11-25 | ||
EP98122316 | 1998-11-25 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2000031722A1 true WO2000031722A1 (fr) | 2000-06-02 |
Family
ID=8233028
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/EP1999/008825 WO2000031722A1 (fr) | 1998-11-25 | 1999-11-17 | Procede de commande de la duree en synthese vocale |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2000031722A1 (fr) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110264993A (zh) * | 2019-06-27 | 2019-09-20 | 百度在线网络技术(北京)有限公司 | 语音合成方法、装置、设备及计算机可读存储介质 |
CN113129863A (zh) * | 2019-12-31 | 2021-07-16 | 科大讯飞股份有限公司 | 语音时长预测方法、装置、设备及可读存储介质 |
CN113129863B (zh) * | 2019-12-31 | 2024-05-31 | 科大讯飞股份有限公司 | 语音时长预测方法、装置、设备及可读存储介质 |
-
1999
- 1999-11-17 WO PCT/EP1999/008825 patent/WO2000031722A1/fr not_active Application Discontinuation
Non-Patent Citations (3)
Title |
---|
CAMPBELL W N ET AL: "DURATION PITCH AND DIPHONES IN THE CSTR TTS SYSTEM", PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING (ICSLP),JP,TOKYO, ASJ, 1990, pages 825 - 828, XP000506898 * |
HIRSCHFELD D ET AL: "Hybrid process uses neural control for data based speech rhythm control", SPRACHKOMMUNIKATION' (SPEECH COMMUNICATION), DRESDEN, GERMANY, 31 AUG.-2 SEPT. 1998, no. 152, ITG-Fachbericht, 1998, VDE-Verlag, Germany, pages 111 - 114, XP002132260, ISSN: 0932-6022 * |
SANTEN VAN J P H: "ASSIGNMENT OF SEGMENTAL DURATION IN TEXT-TO-SPEECH SYNTHESIS", COMPUTER SPEECH AND LANGUAGE,GB,ACADEMIC PRESS, LONDON, vol. 8, no. 2, 1 April 1994 (1994-04-01), pages 95 - 128, XP000501471, ISSN: 0885-2308 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110264993A (zh) * | 2019-06-27 | 2019-09-20 | 百度在线网络技术(北京)有限公司 | 语音合成方法、装置、设备及计算机可读存储介质 |
CN110264993B (zh) * | 2019-06-27 | 2020-10-09 | 百度在线网络技术(北京)有限公司 | 语音合成方法、装置、设备及计算机可读存储介质 |
CN113129863A (zh) * | 2019-12-31 | 2021-07-16 | 科大讯飞股份有限公司 | 语音时长预测方法、装置、设备及可读存储介质 |
CN113129863B (zh) * | 2019-12-31 | 2024-05-31 | 科大讯飞股份有限公司 | 语音时长预测方法、装置、设备及可读存储介质 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
DE602005002706T2 (de) | Verfahren und System für die Umsetzung von Text-zu-Sprache | |
DE60112512T2 (de) | Kodierung von Ausdruck in Sprachsynthese | |
DE69908047T2 (de) | Verfahren und System zur automatischen Bestimmung von phonetischen Transkriptionen in Verbindung mit buchstabierten Wörtern | |
DE69909716T2 (de) | Formant Sprachsynthetisierer unter Verwendung von Verkettung von Halbsilben mit unabhängiger Überblendung im Filterkoeffizienten- und Quellenbereich | |
DE69821673T2 (de) | Verfahren und Vorrichtung zum Editieren synthetischer Sprachnachrichten, sowie Speichermittel mit dem Verfahren | |
DE19610019C2 (de) | Digitales Sprachsyntheseverfahren | |
DE2212472A1 (de) | Verfahren und Anordnung zur Sprachsynthese gedruckter Nachrichtentexte | |
EP3010014B1 (fr) | Procede d'interpretation de reconnaissance vocale automatique | |
DE112004000187T5 (de) | Verfahren und Vorrichtung der prosodischen Simulations-Synthese | |
EP1058235B1 (fr) | Procédé de reproduction pour systèmes contrôlés par la voix avec synthèse de la parole basée sur texte | |
DE60108104T2 (de) | Verfahren zur Sprecheridentifikation | |
DE2736082A1 (de) | Elektronisches geraet zur phonetischen synthetisierung menschlicher sprache (sprach-synthesizer) | |
EP0633559B1 (fr) | Méthode et dispositif pour la reconnaissance de la parole | |
WO2001069591A1 (fr) | Procede pour reconnaitre les enonces verbaux de locuteurs non natifs dans un systeme de traitement de la parole | |
EP1105867A1 (fr) | Procede et dispositif permettant de concatener des segments audio en tenant compte de la coarticulation | |
WO2001086634A1 (fr) | Procede pour produire une banque de donnees vocales pour un lexique cible pour l'apprentissage d'un systeme de reconnaissance vocale | |
EP0058130B1 (fr) | Procédé pour la synthèse de la parole avec un vocabulaire illimité et dispositif pour la mise en oeuvre dudit procédé | |
EP1344211B1 (fr) | Vorrichtung und verfahren zur differenzierten sprachausgabe | |
WO2000031722A1 (fr) | Procede de commande de la duree en synthese vocale | |
DE19912405A1 (de) | Bestimmung einer Regressionsklassen-Baumstruktur für Spracherkenner | |
WO2000016310A1 (fr) | Procede et dispositif de traitement numerique de la voix | |
EP1170723B1 (fr) | Procédé pour le calcul des statistiques de durée des phones et procédé pour la détermination de la durée de phones pour la synthèse de la parole | |
DE3232835C2 (fr) | ||
EP3144929A1 (fr) | Génération synthétique d'un signal vocale ayant un son naturel | |
EP0505709A2 (fr) | Procédé d'extension du vocabulaire pour la reconnaissance de la parole indépendamment du locuteur |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): JP US |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1999958057 Country of ref document: EP |
|
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WA | Withdrawal of international application | ||
WWW | Wipo information: withdrawn in national office |
Ref document number: 1999958057 Country of ref document: EP |