WO2000031722A1

WO2000031722A1 - Method for controlling duration in speech synthesis

Info

Publication number: WO2000031722A1
Application number: PCT/EP1999/008825
Authority: WO
Inventors: Oliver Jokisch; Diane Hirschfeld; Matthias Eichner; Rüdiger Hoffmann
Original assignee: Deutsche Telekom Ag
Priority date: 1998-11-25
Filing date: 1999-11-17
Publication date: 2000-06-02

Abstract

Speech synthesis systems and methods serve to convert a written text into an acoustic utterance. The invention relates to a method which permits a speaker-specific speech rhythm. According to this method a multistage hybrid structure is hierarchically divided into three levels, a phoneme level (1), a syllable level (2) and a phrase level (3). A rule-based or neuronal method (5 or 4) can be applied at each of the above levels (1 to 3).

Description

DESCRIPTION

METHOD FOR CONTINUOUS CONTROL IN LANGUAGE SYNTHESIS

The invention relates to a method for continuous control in speech synthesis according to the preamble of claim 1.

Methods for speech synthesis or speech synthesis systems are known in principle. They convert a written text into an acoustic expression. These so-called text-to-speech systems (Text to Speech) achieve a high level of comprehensibility, speak several languages and can synthesize almost any text. Nevertheless, speech synthesis continues to be a challenge for technology. The user acceptance of these systems is rather low due to the low naturalness and the greater concentration that the listener has to spend in listening compared to natural language. So far, this has stood in the way of widespread use of these systems. In addition, with the high segmental speech quality currently achieved by such systems and processes, deficiencies in the prosodic processing levels are increasingly noticeable. The lack of rhythm, which is largely modeled in such a system by the loudness control, is perceived as particularly disturbing.

So far, rule-based methods have been used to calculate the duration of the sounds. In such a method, a specific duration is calculated for each sound of the synthetic utterance, which results from the neutral sound duration through modification by various influencing factors.

From J. Acoustic Society America, vol. 59, May 1976, pages 1208 to 1221 by Denis H. Klatt is a model for the American language has been introduced. In addition, a further developed method has been published by Denis H. Klatt in J. Acoustic Society America 82 (3), September 1987, pages 737 to 797 under the title "Review of Text-to-Speech Conversation for English". These methods are based on general, speaker-independent, language-specific duration statistics and are able to generate correct segmental duration. With the improvement of the segmental speech quality, deficiencies in the prosodic processing levels become increasingly audible. Among other things, particularly "wrong" sound durations and a lack of speech rhythm are perceived as a disturbance.

In the publication "Variability and Stability of Segmental Characteristics under the Aspect of Concatenative Speech Synthesis - Vowels -" by Diane Hirschfeld, TU Dresden in the conference volume Electrical Speech Signal Processing, pages 94 to 101 and in the publication "Neural Prosodigeneration - Influence of Training Data from 0. Jokisch and M. Peschek, TU Dresden in Advances in Acoustics - DAGA Conference 98 in Zurich, pages 352 and 353, describe the problems of previous speech synthesis systems and specific approaches to solving them, thus pointing out that training-based prosody models, for example using neural networks , represent a solution, especially since this enables flexible adaptation to application or user requirements. The actual work status of a prosody module is presented on a neutral basis, which discusses the suitable training corpus and practical effects in the case of variations of the speaker, the Sp demonstrated the legal style and the scope of the data.

As already shown, with the improvement of segmental speech quality in speech synthesis, increasing Defects in the prosodic processing stages can be perceived acoustically. Among other things, wrong running times and lack of rhythm are perceived as disturbing.

In the model by Denis H. Klatt already mentioned, only the sound level is taken into account and this model is therefore not able to realize a natural rhythmic structure. This in turn is important for the comprehensibility and convenience of synthetic language.

The invention is therefore based on the object of providing a method for continuous control in speech synthesis which eliminates the deficiencies described in the prosodic processing stages, which significantly improves the naturalness of synthetic speech and which has the disadvantages of conventional rule-based models for continuous control by generating a natural speaking rhythm eliminated by correctly determined duration of sounds.

The solution according to the invention consists in the characterizing part of patent claim 1.

Further solutions or refinements of the invention are characterized in the characterizing parts of patent claims 2 to 13.

The present procedure clearly distinguishes itself from previous approaches in which the continuous control was implemented at the sound level with the help of a rule set and a speaker-independent permanent statistics. In the present method, the naturalness of a speech signal is significantly improved by particularly taking into account the temporal, speaker-specific structures that have a great influence on the naturalness of the speech signal. Therefore, the present method for speech rhythm control or continuous control in speech synthesis uses continuous statistics that are obtained from the data of the original speaker. The method uses a multi-level model for continuous control in speech synthesis. It forms the language processing that takes place in humans at different levels. These are a phrase, syllable and phonetic level that alternatively use statistical / rule-based or learning methods that are based on the individual data of the original speaker.

The target durations are therefore calculated independently at different levels, namely the phrase level, the syllable level and the phonetic level. Alternatively, rule-based or learning procedures are used at each level and data exchange is possible between the procedures, which enables the combination of rule-based with learning procedures. The procedures at the individual levels use speaker-specific databases or are trained on the target speaker, in contrast to the general rules previously used for all speakers.

At the phrase level, the duration for each prosodic phrase in a text is calculated depending on the number of syllables in the phrase and the phrase type. Either a rule-based calculation rule that works on the basis of speaker-specific statistics or a neural network that is trained on the speaker is used.

At the syllable level, the syllable duration is calculated for each syllable within a prosodic phrase. Just like at the phrase level, either a learning process or a rule-based approach is used. The methods evaluate various phonetic characteristics, such as For example, emphasize / not emphasize or type of syllable core, and use this to generate the syllable duration. The syllable durations are then adapted to the phrase duration calculated in the phrase level. At the sound level, the syllable duration is divided into the individual sounds. The method used here can again be a rule-based or a learning method.

This creates a new procedure that enables the modeling of a speaker-specific speaking rhythm. The multi-level hybrid structure, which is structured hierarchically into phonetic, syllable and phrase levels, combines a rule-based procedure and an artificial neural network. The interfaces between the alternative approaches are defined so that data can be exchanged and the end result results from a combination of the partial results from different processing stages. In this way, the advantages of the rule-based and the neural method can be optimally exploited with an increase in overall quality.

Further advantages, features and possible uses of the present invention, in particular the method according to the invention, result from the following description in connection with the exemplary embodiment shown in the drawing.

The invention is described below with reference to an embodiment shown in the drawing. In the description, in the patent claims, in the abstract and in the drawing, the terms and associated reference symbols used in the list of reference symbols given below are used. In the drawing means:

Fig. 1 shows a basic multi-level model for continuous control in speech synthesis.

The multi-stage hybrid structure shown in FIG. 1 is hierarchically divided into phonetic, syllable and phrase levels and combines rule-based methods and an artificial neural network. The interfaces between the alternative approaches are defined so that data can be exchanged and the end result results from a combination of the partial results from different processing stages. In this way, the advantages of the rule-based and the neural method can be optimally used with an increase in overall quality.

The goal of a hybrid data-driven or rule-based rhythm control is the combination of proven knowledge components with the ability to vary the speaking rhythm and even to train speaker-specific features. The strategy takes four aspects into account:

Division of the segment duration control into three representation levels, namely phrase 1, syllable 2 and sound 3, each with its own data node for training and for generating the target duration;

alternatively, each level 1 to 3 runs a neural or a rule-based algorithm 4 or 5 using the same database 7 to 9 corresponding to the respective level 1 to 3;

the extracted prosodic, syllable and sound database 7, 8, 8 'and 9 including the statistical parameters come from a (variable) speaker who must always agree with the speaker of the diphone inventory in database 10.

the diphone inventory, that is to say the corresponding database 10 for acoustic synthesis and the aforementioned prosodic syllable and phonetic database 7, 8, 8 'and 9 are based on a variable speaker. The controllable switchover 6 in each level 1 to 3 between neural and rule-based methods or algorithms serve both to combine and to use one of the two possibilities mentioned.

Level 1 receives the input data from the text analysis 11 both for the artificial neural networks 4 and for the rule-based method 5. The acoustic synthesis 12 takes place with the output data of the level 3, it having to be emphasized again that the diphone inventory in the Database 10 for acoustic synthesis are based on a speaker who is identical to that of databases 7 to 9.

The following description of the rule-based and neural algorithm used is only intended to provide a better understanding of how it works. The parameters used have to be adapted to the respective language or the respective speaker and are therefore not generalizable.

For example, the neural algorithm used corresponds to the well-known ELMAN type. From binary coded (linguistic-phonetic) input attributes, basic values of prosodic contours, here relative segment durations, are trained and predicted in the can phase. The input coding depends on the respective processing level, namely the phrase duration level 1, the syllable duration level 2 or the sound level 3. The rule-based or formula-based continuous control uses a set of rules or formulas for each level 1 to 3. These rules are extracted from databases 7 to 10 by statistical analysis. These rules model linguistic influencing factors at the processing levels. Level 1 determines the phrase duration for a given prosodic phrase depending on the number of syllables and the type of prosodic phrase (see Formula 1).

durphr phrase duration n number of syllables

(1) dur _phr (n) = fcjn 4- Cj kt factor

Constant i phrase type

The second level calculates the syllable duration of each syllable as a linear combination of a number of sounds. Different phonetic properties, for example the accent or the core type, influence the duration of the syllable in different ways, for example the core type causes a lengthening by a factor and an accented syllable is expanded by adding a constant as shown in the following formula .

dUV _S yl X ^ initial initial syllable

(2) dur _sy ι = t dur _sy ι + c _acc accented syllable

The syllables are then adjusted by linear stretching or contraction to fit within the duration frame calculated in phrase level 1. Finally, the duration of each sound has to be adjusted to the frame of the sound duration. A stretch factor is calculated iteratively for a certain syllable duration and the standard deviations from the duration of the sound. When using a rule-based algorithm, the duration of a phrase is primarily determined by the number of syllables, the parameters for its calculation being determined using statistics from the data of the original speaker. The type of a phrase also affects its length. Depending on the phrase type, the mean phrase duration is corrected using coefficients.

The results of the statistical analysis of syllable duration depending on the number of sounds, the accentuation, the information content, the type of the syllable nucleus, the position of the syllable and the position of the syllable in the phrase are used as the basis for the calculation of the syllable durations. These influencing factors on the syllable duration are expressed by linear dependencies. The determined syllable durations are then added up for each phrase, adapted to the phrase duration determined in level 1 by linear expansion or compression of all syllable durations.

The calculation of the actual duration of the sound is based on the calculated syllable duration. The different elasticity of the individual sounds is taken into account. It is assumed that all sounds of a syllable are subjected to a constant stretching K.

The present method is essentially characterized by the following features:

1. independent calculation of the target duration at different levels 1 to 3, 2. alternatively, rule-based or learning methods 5 or 4 are used at each level 1 to 3,

3. Data exchange is possible between the methods, which enables the combination of rule-based with learning methods 5 and 4 and

4. The procedures at the individual levels use speaker-specific databases 7 to 9 or are trained on the target speaker (in contrast to the generally applicable rules for all speakers used previously).

In phrase level 1, the duration ir _* is calculated for each prosodic phrase in a text, depending on the number of syllables in the phrase and the phrase type. Either a rule-based calculation rule that works on the basis of speaker-specific statistics or a neural network that is trained on the speaker is used.

At syllable level 2, the syllable duration is calculated for each syllable within a prosodic phrase. Just as on phrase level 1, either a learning method 4 or a rule-based approach 5 is used for this. The methods evaluate various phonetic characteristics, such as emphasized / unstressed or type of syllable nucleus, and use these to generate the syllable durations. These durations are then adapted to the phrase duration calculated in phrase level 1. On sound level 3, the phrase duration calculated in phrase level 1 is then adjusted. At sound level 3, the syllable duration is divided into the individual sounds. The method used here can again be a rule-based or a learning method. List of reference numbers

1 level or phrase level

2 level or syllable level

3 level or sound level

4 learning or neural methods

5 rule-based procedures or algorithms

6 switching

7 Phrase database 8.8 'syllable database

9 phonetic database

10 diphone database

11 Text analysis

12 acoustic speech synthesis

Claims

P AT ENT AN SPOKEN

1. Process for continuous control in speech synthesis to improve the speech quality of text-to-speech systems with the aid of rule-based and learning processes, characterized in that

that the duration control, in particular the segment duration control, is divided into a phrase duration level (1), a syllable duration level (2) and a sound duration level (3), each with its own data node for training and for generating the target duration, and

that alternatively a neuronal and / or a rule-based method (4 or 5) can be selected and runs in each of the levels (1 to 3).

2. The method according to claim 1, characterized in

that the rule-based, the learning methods or algorithms (4, 5) can be optionally combined.

3. The method according to claim 1 or 2, characterized in

that the phrase duration level (1) for the learning method (4) and the rule-based method (5) has a controllable switchover (6) and its own phrase database (7) and that the output of the switchover (6) to the syllable level (2) is given which has its own syllable database (8) and a further syllable database (8 '),

that from an automatic switchover (6) the output signals of the syllable duration level (2) to the Inputs of the sound duration level (3) with their own sound database (9) are given and

that the output signals from the controllable switchover (6) of this level for acoustic speech synthesis (12) are given with a diphone database (10).

4. The method according to any one of claims 1 to 3, characterized in

that the phrase duration level (1) receives its input variables from a text analysis (11).

5. The method according to any one of claims 1 to 4, characterized in

that alternatively a neuronal or a rule-based algorithm runs in each level (1 to 3) using the same database for continuous control in speech synthesis.

6. The method according to any one of claims 1 to 5, characterized in

that the extracted prosodic syllable and sound database (7 to 9) including the statistical parameters is generated by exactly one respective speaker.

7. The method according to any one of claims 1 to 6, characterized in

that the diphone database (10) for acoustic speech synthesis and the databases (7 to 9) of the levels (1 to 3) are based on an identical speaker.

8. The method according to any one of claims 1 to 7, characterized in

that the calculation of the target durations at the different levels (1 to 3) takes place independently of one another.

9. The method according to any one of claims 1 to 8, characterized in

that the combination of rule-based and learning procedures can be realized through the data exchange between the usable procedures.

10. The method according to any one of claims 1 to 9, characterized in

that on the phrase duration level (1) the duration is calculated for each prosodic phrase of a text depending on the number of syllables in the phrase and the phrase type.

11. The method according to claim 10, characterized in

that either a rule-based algorithm that works on the basis of speaker-specific statistics or a neural network that is trained on the speaker can be selected for further calculation.

12. The method according to any one of claims 1 to 11, characterized in

that the syllable duration is calculated on the syllable level (2) for each syllable within a prosodic phrase and that either a learning process or a rule-based process is used.

13. The method according to any one of claims 1 to 12, characterized in that

that the learning or rule-based procedures on the individual levels use speaker-specific databases (7 to 9) or are trained on the target speaker.