EP1589524B1 - Verfahren und Vorrichtung zur Sprachsynthese - Google Patents
Verfahren und Vorrichtung zur Sprachsynthese Download PDFInfo
- Publication number
- EP1589524B1 EP1589524B1 EP20050447078 EP05447078A EP1589524B1 EP 1589524 B1 EP1589524 B1 EP 1589524B1 EP 20050447078 EP20050447078 EP 20050447078 EP 05447078 A EP05447078 A EP 05447078A EP 1589524 B1 EP1589524 B1 EP 1589524B1
- Authority
- EP
- European Patent Office
- Prior art keywords
- speech
- units
- linguistic
- features
- candidate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000015572 biosynthetic process Effects 0.000 title claims description 26
- 238000000034 method Methods 0.000 title claims description 26
- 238000003786 synthesis reaction Methods 0.000 title claims description 26
- 238000004458 analytical method Methods 0.000 claims description 18
- 238000012549 training Methods 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 4
- 230000033764 rhythmic process Effects 0.000 claims description 4
- 238000002372 labelling Methods 0.000 claims description 3
- 238000003860 storage Methods 0.000 claims description 2
- MQJKPEGWNLWLTK-UHFFFAOYSA-N Dapsone Chemical compound C1=CC(N)=CC=C1S(=O)(=O)C1=CC=C(N)C=C1 MQJKPEGWNLWLTK-UHFFFAOYSA-N 0.000 description 10
- 238000012545 processing Methods 0.000 description 8
- 238000003058 natural language processing Methods 0.000 description 7
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 230000007935 neutral effect Effects 0.000 description 4
- 230000003595 spectral effect Effects 0.000 description 4
- 238000001308 synthesis method Methods 0.000 description 4
- 230000001755 vocal effect Effects 0.000 description 4
- 238000003066 decision tree Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 101100112111 Caenorhabditis elegans cand-1 gene Proteins 0.000 description 1
- 241000665848 Isca Species 0.000 description 1
- 206010024642 Listless Diseases 0.000 description 1
- 208000003443 Unconsciousness Diseases 0.000 description 1
- 238000004833 X-ray photoelectron spectroscopy Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 235000019994 cava Nutrition 0.000 description 1
- 238000010835 comparative analysis Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000009792 diffusion process Methods 0.000 description 1
- 238000005315 distribution function Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 230000000877 morphologic effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/06—Elementary speech units used in speech synthesisers; Concatenation rules
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
Definitions
- the present invention is related to a method and device for speech synthesis.
- Natural language processing aims at extracting information that allows reading the text aloud. This information can vary from one system to another but always comprises words, their nature and their phonetisation.
- Units selection aims at choosing speech units that correspond to the information extracted by natural language processing.
- digital signal processing concatenates the selected speech units and, if needed, changes their acoustic characteristics so that required speech signals are obtained.
- these units extracted from read-aloud sequences are diphones, i.e. pieces of speech starting from the middle of a phoneme and ending in the middle of the following phoneme (see Fig.2 ).
- diphones i.e. pieces of speech starting from the middle of a phoneme and ending in the middle of the following phoneme (see Fig.2 ).
- This means that a diphone extends from the stable part of a phoneme till the stable part of the following phoneme and contains, in its middle part, the coarticulation phase characterising the transition from one phoneme to another, which is very difficult to model mathematically.
- diphones as speech units improves speech generation and makes it easier, because concatenation is performed on their stable parts.
- the first systems using vocal databases for synthesis employed only one sample of each diphone.
- the underlying idea was to get rid of acoustic variations present in the diphones and dependent from the elocution time: accent, tone, fundamental frequency and duration.
- diphones merely are acoustic parameters describing the vocal tract only.
- Fundamental frequency, prosody and duration have to be regenerated during synthesis.
- Diphones may need to undergo some acoustic modifications in order to obtain the required prosodic features. This unfortunately leads to a loss of quality: the synthesised voice seems less natural.
- prosody keeps being neutral and listless.
- Neutral speech units constitute an important drawback to overcome, therefore non-uniform units started to be investigated.
- non-uniform is meant that the speech unit may change in two ways: length and acoustic production.
- Length variation means that the unit is not exclusively a diphone, but may be either shorter or longer. Longer units imply less frequent concatenation problems. However, in some cases, the corpus constitution (an inconsistency or incompleteness) can impose the use of a smaller unit, like a phoneme or half-phoneme. Therefore a variation in terms of length may be considered in both directions.
- Variation in terms of acoustic production means that the same unit has to appear several times in the corpus : for the same unit, they may be several representations with different acoustic realisation. By doing so, units are not neutral anymore; they reflect the variations occurring during the elocution.
- the search for speech units corresponding to the units described by natural language analysis often yields several candidates for each target unit.
- the result of this search is a lattice of possible units, allocated to different positions in the speech signal. Each position corresponds to one unit to be searched for and covers potential candidates found in the corpus (see Fig.3 ). So the challenge is to determine the best sequence of units to be selected in order to generate the speech signal.
- the target cost and the concatenation cost should be used.
- the target cost gives the distance between a target unit and units coming from the corpus. It is computed from the features added to each speech unit.
- the concatenation cost estimates the acoustic distance between units to be concatenated.
- the different systems that have been set up determine the concatenation cost between adjacent units in terms of acoustic distance, based on several criteria such as the fundamental frequency, a intensity difference or the spectral distance. Note that said acoustic distance does not compare to the real acoustic distance perceived by a listener.
- the selection of a sequence of units for a particular sentence is cost expensive in terms of CPU time and memory if no efficient optimisation is used. So far, two kinds of optimisation have been investigated.
- the first optimisation manages the whole selection. A single unit sequence has to be selected from the lattice. This task corresponds to finding the best path in a graph. This is usually solved with dynamic programming by means of the well-known Viterbi algorithm.
- the second optimisation method consists in assessing the importance of the different features used to determine the target or concatenation cost. Indeed most features may not be considered as equally important: some features affect more the resulting quality than other. Consequently, it has been investigated what would be the ideal weighting for the selection process.
- the proposed systems however apply a manually implemented weighting, which, as a consequence, is competence based and depends on the operator's expertise rather than on statistic values.
- One possible weighting method suggests forming a network between all sounds of the corpus (see Prosody and Selection of source Units for Concatenative Synthesis, Campbell and Black, pp. 279-292, Springer-Verlag,1996 ). Once this network has been set up, a learning phase can start aiming at improving the acoustic similarity between a reference sentence and the signal given by the system. This improvement can be achieved by tuning the features weighting, by successive iterations or by linear regression.
- This method has two inherent drawbacks: on the one hand, its computational load, still consuming resources even though performed off-line, and on the other hand, the limited amount of features the computation can weight. Most of the time, part of the weighting remains to be done manually. In order to reduce the computational load, one can carry out a clustering of sounds to keep only one representative sound, the centroid, on which the selection computation may be performed.
- Another weighting method relies on a corpus representation based on a phonetic and phonologic tree (see e.g. 'Non-uniform unit selection and the similarity metric within BT's laureate TTS system', Breen & Jackson, ESCA/COCOSDA 3rd Workshop on Speech Synthesis, pp. 201-206, Jenolan Caves, Australia, Nov. 26-29, 1998 ). During the selection, they look for candidate units with the same context as the target unit. However, the features they use are not automatically weighted.
- Non-uniform units-based systems try to give synthesised speech a more natural character, closer to human speech than that generated by previous systems. This goal is achieved by using non-neutralised units of variable length.
- the performance of such speech synthesis systems is currently limited by the intrinsic weakness of their prosodic models, restricted to some acoustic and symbolic parameters. These models, corpus- or rule-based, are not sufficient as they do not allow a natural prosodic variation of the synthesised sentences. Yet, the quality of prosody depends directly on how listeners perceive synthesised speech.
- the use of such prosodic models shows a major advantage: the selection of acoustic units that are relatively neutral, limits discontinuities between units to be concatenated further on. As a consequence, spectral smoothing at units boundaries is strongly restricted in order to keep the naturalness of speech units.
- a speech synthesis system wherein a database of diphones is used derived from natural speech.
- a text is first converted into phonetic form and divided into phonemes.
- the converted text is rendered as a series of target diphones and for each of these a number of predetermined diphone features are identified.
- Diphone features may be one or more of phonetic, prosodic, linguistic and acoustic features.
- Potential matches from the database are identified and a target cost for each of these features is established.
- the target costs are modified before selecting a least-cost combination. The modification of the target costs may be done by a simple weighting or by means of distribution functions.
- the present invention aims to provide a speech synthesis method that does not need any prosodic model and that requires little digital signal processing. It also aims to provide a speech synthesis device, operating according to the disclosed synthesis method.
- the present invention relates to a method to synthesise speech, comprising the steps of
- said selected linguistic features are determined in a training step preceding the above-mentioned steps.
- the step of selecting candidate speech units is performed using a database comprising information on phonemes and at least their linguistic features.
- the information on the linguistic features comprises a weighting coefficient for each linguistic feature.
- the weighting coefficients typically result from an automatic weighting procedure.
- the information is obtained from a step of labelling and segmenting a corpus.
- the speech units are diphonic units.
- a target cost is calculated for each candidate cluster.
- a target cost is calculated from the target costs for the candidate clusters.
- the concatenation of speech units is performed taking into account said target cost as well as a concatenation cost.
- the linguistic features comprise features from the group ⁇ surrounding phonemes, emphasis information, number of syllables, syllables, word location, number of words, rhythm group information ⁇ .
- the invention relates to a speech synthesis device comprising a linguistic analysis engine producing phonemes to be pronounced and, associated to each phoneme, a list of linguistic features,
- the speech synthesis device further comprises calculation means for computing automatically a weighting coefficient for each linguistic feature.
- Fig. 1 represents a Text-to-Speech Synthesiser system.
- Fig. 2 represents the segmentation into phonemes and diphones. "_" corresponds to silence.
- Fig. 3 represents a lattice network for the diphone sequence of the word 'speech'.
- Fig. 4 represents the steps of the method according to the present invention.
- the present invention discloses a speech units selection system freed from any prosodic model (either acoustic or symbolic) that allows more prosodic variations in synthesised sentences, thereby applying little signal processing at the units' boundaries.
- speech units selection in the method according to the present invention is exclusively based on a features set selected among linguistic information provided by language analysis.
- any prosodic model either rules- or corpus-based, relies on a list of linguistic features that allow to choose values for any acoustic or symbolic feature of the model.
- a prosodic model is just an acoustic and symbolic synthesis of linguistic features.
- the prosodic model is deterministic: from a finite list of linguistic features, this model always deduces the same prosodic features. Language however is not deterministic. Indeed, the same speaker could pronounce a given sentence with a single linguistic analysis, in different ways. Parameters having an influence on the pronunciation and prosody of this sentence can be affective or intellective.
- the synthesis method according to the invention is divided into a training and a run-time phase. In both phases, the same linguistic analysis engine is used for the linguistic features extraction, giving thus some homogeneity to the system.
- the training phase it is necessary to list the relevant linguistic features for selecting the units. Once this list is obtained, the further training consists in a labelling and a segmentation of the corpus as well as a weighting of the linguistic features. Note that in text-to-speech synthesis, a spoken language corpus is always paired with a written corpus that is its transcription. The written corpus helps in choosing labels and features for each unit of the spoken language corpus.
- the spoken language corpus may as well be called a speech units corpus or a speech units database.
- the run-time phase is carried out on a sentence applied to the synthesis system input. First the linguistic sentence is analysed. Then candidate speech units are selected based on selected linguistic features. Lastly, selected units are concatenated in order to form the speech signal corresponding to the sentence. Both phases are now presented in detail.
- the features selection is intrinsically linked to the linguistic analysis engine, the capabilities of which determine the amount of available linguistic information.
- the exclusive use of linguistic features for selection forces to add supplementary, prosody affecting information to the features typically used (like phonemes around the target, syllabification, number of syllables in the word, location of words in the sentence ).
- Very common linguistic features like the phonemes surrounding the target unit and the number of syllables in the word rarely are used in state-of-the-art systems. Consequently, the analysis engine must be powerful enough to determine the required additional information.
- Said additional information comprises:
- each sentence of the written corpus is annotated as follows: amount of words and place of the words in the sentence, syllabification and phonetisation of the words, synthesis in terms of articulatory criteria of phonemic contexts for each phoneme.
- the annotation elements are then discretised as integer values and stored into a linguistic units database wherein each phoneme is linked with its own linguistic features.
- the sentences of the spoken language corpus are segmented into phonemes and diphones. All phonemes occurring in the speech units corpus are then collected. For each phoneme the acoustic features useful for the concatenation cost are calculated and also added to the speech units corpus.
- acoustic features are the fundamental frequency, LPC (Linear Predictive Coding) coefficients and the intensity.
- LPC Linear Predictive Coding
- this number is set at 7 clusters of acoustic representations of one phoneme distributed according to their duration d: 1. ⁇ d ⁇ M - 2 ⁇ D 2. ⁇ M - 2 ⁇ D ⁇ d ⁇ M - D 3. ⁇ M - D ⁇ d ⁇ M - D / 2 4. ⁇ M - D / 2 ⁇ d ⁇ M + D / 2 5. ⁇ M + D / 2 ⁇ d ⁇ M + D 6. ⁇ M + D ⁇ d ⁇ M + 2 ⁇ D 7. ⁇ d > M + 2 ⁇ D where M denotes the mean duration of all representations for one phoneme, and D represents the standard deviation of this representation.
- the (fully automatic) linguistic features weighting may start.
- the objective is to determine to which extent each feature allows to discriminate between several clusters, whereby each cluster is seen as a class to be selected or a decision to be taken.
- the most appropriate method to do this is by using a decision tree.
- Decision tree building relies on the concept of entropy. Entropy computation for a list of features allows classifying them according to their intrinsic information. The more a feature i reduces the uncertainty about which cluster C to select, the more it is informative and relevant.
- the entropy of feature i is computed as gain ratio GR( i,C ) , i.e. the ratio of Information Gain IG( i,C ) to the Split Information SI( C ).
- the Split Information normalises the Information Gain of a given feature by taking into account the number of different values this feature can take.
- the Gain Ratio allows determining the features ranking between all decision tree levels, and also weights the features during the target cost calculation.
- the weighting coefficients are also stored in the database.
- the linguistic analysis At run-time, each time a sentence enters the system, the linguistic analysis generates the corresponding phonemes as well as a list of linguistic features associated to each of them. Every pair ⁇ phoneme , features ⁇ is defined as a target .
- the speech units selection occurs in three steps:
- diphonic units to be selected are only those that can be formed from adjacent phonemic candidates in the speech units corpus. However, if a target diphone does not have any candidate, one creates candidates containing the target phoneme partly left or partly right-hand side, according to the diphone needed.
- the units selection is performed in a traditional way, by solving the lattice with the Viterbi algorithm. In this way the path is selected in the lattice of diphones, which minimises the double cost ⁇ target , concatenation ⁇ .
- the target cost was already pre-computed at the pre-selection stage, whereas the concatenation cost is determined when running through the lattice.
- the concatenation cost has been defined as the acoustic distance between the units to be concatenated. To calculate this distance, the system thus needs acoustic features, taken at the boundaries of the units to be concatenated: fundamental frequency, spectrum, energy and duration. The distance, and thus the cost, is obtained by adding up:
- Figure 4 shows a block scheme of a text-to-speech synthesis system that implements the method of the invention.
- the system is split into three blocks, each corresponding to one of the steps of the run-time phase as described above : the NLP (Natural Language Processing), the USP (Units Selection Processing) and the DSP (Digital Signal Processing).
- the input to the system is the text that is to be transformed into speech.
- the output to the system is a speech signal concatenated from non-uniform speech units.
- Each block uses databases.
- the NLP loads linguistic databases (DBA) for each task (pre-processing, morphological analysis,).
- the DSP loads the Speech Units Database, from which speech units are selected and concatenated into a speech signal.
- DBA linguistic databases
- the USP in between, loads a Linguistic Units Database, comprising a set of triplets ⁇ phoneme, linguistic features, position ⁇ .
- the first pair, ⁇ phoneme, linguistic features ⁇ describes a unit from the Speech Units Database.
- the last information, position is the position in milliseconds of the unit in the Speech Units Database. It means that both databases describe and store Candidate Units, and are aligned thanks to the position feature.
- the NLP block aims at analysing the input text in order to generate a list of target units (T 1 , T 2 ,..., T n ). Each target unit is a pair ⁇ phoneme, linguistic features ⁇ .
- the second block, USP works in three steps.
- the Linguistic Units Database selects from the Linguistic Units Database a set of phonemic candidates for each target unit. A target cost computation is performed for each candidate. Candidate diphonic units are then determined together with their target cost and a lattice of weighted diphones is created, one diphone for each pair of adjacent phonemes. Next, it selects by dynamic programming the best path of diphones through the lattice.
- the DSP block takes selected diphones from the Speech Units Database. Then, it concatenates them acoustically, using a technique of the OverLap And Add type: pitch values are used to improve the concatenation. No signal processing is necessary other than the concatenation itself. Selected units are concatenated without any discontinuity. As a result, linguistic criteria used in the selection prove their relevance.
- the technology can for example be used for advertisement diffusion (broadcasting) in shopping centres. Advertisements of shopping centres must frequently change, which requires frequent and expensive need for professional speakers.
- the proposed synthesis method only once requires the services of a professional speaker, and subsequently allows pronouncing any written text, without additional cost.
- Another application could be directed to information for travellers in railway stations and airports and the like.
- the synthesis system according to the present invention can easily solve this problem.
- Speech synthesis can also generate fluent interactive dialogues. This is related to dialogue systems able to model a conversation and to automatically generate text in order to interact with the user.
- Two traditional examples are interactive terminals in stations, airports and shopping centres, as well as vocal servers that are accessible by phone.
- Systems currently used in this context are strongly limited: based on pieces of pre-recorded sentences, they are limited to some basic syntactic structures. Moreover, the result obtained is less natural, because of prosodic discontinuities at words or word-groups boundaries.
- the synthesis by non-uniform units selection using linguistic criteria is the ideal solution to get rid of these drawbacks, as it is not limited in terms of syntactic structures.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Claims (13)
- Verfahren zur Synthetisierung von Sprache, die folgende Schritte umfasst:- Anwenden einer linguistischen Analyse auf einen Satz, der in ein Sprachsignal umgewandelt werden soll, wobei die Analyse auszusprechende Phoneme und eine jedem Phonem zugeordnete Liste linguistischer Merkmale erzeugt,- Auswählen von Muster-Sprachelementen ausschließlich auf Basis ausgewählter linguistischer Merkmale,- Bilden des Sprachsignals durch Verketten von aus den Muster-Sprachelementen gewählten Spracheinheiten.
- Verfahren zur Synthetisierung von Sprache nach Anspruch 1, wobei in einem vorhergehenden Trainingsschritt die ausgewählten linguistischen Merkmale bestimmt werden.
- Verfahren zur Synthetisierung von Sprache nach Anspruch 1 oder 2, wobei der Schritt des Auswählens von Muster-Sprachelementen mittels einer Datenbank durchgeführt wird, die Information zu Phonemen und mindestens ihren linguistischen Merkmalen umfasst.
- Verfahren zur Synthetisierung von Sprache nach Anspruch 3, wobei die Information zu den linguistischen Merkmalen einen Gewichtungsfaktor für jedes linguistische Merkmal umfasst und die Gewichtungsfaktoren aus einem automatischen Gewichtungsverfahren resultieren.
- Verfahren zur Synthetisierung von Sprache nach Anspruch 3 oder 4, wobei die Information aus einem Schritt des Bezeichnens und Segmentierens eines Korpus erhalten wird.
- Verfahren zur Synthetisierung von Sprache nach einem der Ansprüche 1 bis 5, wobei der Schritt des Auswählens von Sprachelementen folgende Unterschritte umfasst- Auswählen von Musterclustern akustischer Darstellungen für jedes Phonem und- Berechnen von Muster-Sprachelementen aus den gewählten Musterclustern.
- Verfahren nach einem der vorhergehenden Ansprüche, wobei die Sprachelemente Diphoneinheiten sind.
- Verfahren nach Anspruch 6, wobei für jeden Mustercluster ein Zielaufwand berechnet wird.
- Verfahren nach Anspruch 8, wobei für jedes Mustersprachelement ein Zielaufwand aus den Zielaufwänden für die Mustercluster berechnet wird.
- Verfahren nach Anspruch 8 oder 9, wobei jede Verkettung von Sprachelementen unter Berücksichtigung des Zielaufwands sowie eines Verkettungsaufwands durchgeführt wird.
- Verfahren zur Synthetisierung von Sprache nach einem der Ansprüche 1 bis 10, wobei die linguistischen Merkmale Merkmale aus der Gruppe umfassen (umliegende Phoneme, Betonungsinformation, Anzahl der Silben, Silben, Wortstellen, Anzahl der Wörter, Rhythmusgruppeninformation).
- Sprachsynthesevorrichtung mit- einer Maschine zur linguistischen Analyse, um auszusprechende Phoneme und eine jedem Phonem zugeordnete Liste linguistischer Merkmale zu erzeugen,- Speicherungmöglichkeiten zum Speichern einer Datenbank, die Information zu Phonemen und mindestens ihren linguistischen Merkmalen umfasst,- Möglichkeiten zum Auswählen von Spracheelementen, um Musterspracheelemente ausschließlich auf Basis ausgewählter linguistischer Merkmale auszuwählen,- Möglichkeiten zum Synthetisieren, um Sprachelemente, die durch die Auswahlmittel ausgewählt werden, zu verketten.
- Sprachsynthesevorrichtung nach Anspruch 12, ferner Berechnungsmittel umfassend, um einen Gewichtungsfaktor für jedes linguistische Merkmal automatisch zu berechnen.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP20050447078 EP1589524B1 (de) | 2004-04-15 | 2005-04-08 | Verfahren und Vorrichtung zur Sprachsynthese |
Applications Claiming Priority (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US56238204P | 2004-04-15 | 2004-04-15 | |
US562382P | 2004-04-15 | ||
EP04447212 | 2004-09-27 | ||
EP04447212A EP1640968A1 (de) | 2004-09-27 | 2004-09-27 | Verfahren und Vorrichtung zur Sprachsynthese |
EP20050447078 EP1589524B1 (de) | 2004-04-15 | 2005-04-08 | Verfahren und Vorrichtung zur Sprachsynthese |
Publications (2)
Publication Number | Publication Date |
---|---|
EP1589524A1 EP1589524A1 (de) | 2005-10-26 |
EP1589524B1 true EP1589524B1 (de) | 2008-03-12 |
Family
ID=34943276
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP20050447078 Active EP1589524B1 (de) | 2004-04-15 | 2005-04-08 | Verfahren und Vorrichtung zur Sprachsynthese |
Country Status (1)
Country | Link |
---|---|
EP (1) | EP1589524B1 (de) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018167522A1 (en) * | 2017-03-14 | 2018-09-20 | Google Llc | Speech synthesis unit selection |
IT201800005283A1 (it) * | 2018-05-11 | 2019-11-11 | Rimodulatore del timbro vocale |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB0112749D0 (en) * | 2001-05-25 | 2001-07-18 | Rhetorical Systems Ltd | Speech synthesis |
-
2005
- 2005-04-08 EP EP20050447078 patent/EP1589524B1/de active Active
Also Published As
Publication number | Publication date |
---|---|
EP1589524A1 (de) | 2005-10-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7124083B2 (en) | Method and system for preselection of suitable units for concatenative speech | |
JP4302788B2 (ja) | 音声合成用の基本周波数テンプレートを収容する韻律データベース | |
US7565291B2 (en) | Synthesis-based pre-selection of suitable units for concatenative speech | |
EP1138038B1 (de) | Sprachsynthese durch verkettung von sprachwellenformen | |
US7869999B2 (en) | Systems and methods for selecting from multiple phonectic transcriptions for text-to-speech synthesis | |
US20200410981A1 (en) | Text-to-speech (tts) processing | |
US11763797B2 (en) | Text-to-speech (TTS) processing | |
EP1643486A1 (de) | Verfahren und Vorrichtung zur Verhinderung des Sprachverständnisses eines interaktiven Sprachantwortsystem | |
Latorre et al. | New approach to the polyglot speech generation by means of an HMM-based speaker adaptable synthesizer | |
JP2007249212A (ja) | テキスト音声合成のための方法、コンピュータプログラム及びプロセッサ | |
US10699695B1 (en) | Text-to-speech (TTS) processing | |
CN101131818A (zh) | 语音合成装置与方法 | |
Dutoit | A short introduction to text-to-speech synthesis | |
EP1589524B1 (de) | Verfahren und Vorrichtung zur Sprachsynthese | |
JPH08335096A (ja) | テキスト音声合成装置 | |
EP1640968A1 (de) | Verfahren und Vorrichtung zur Sprachsynthese | |
Ronanki et al. | The CSTR entry to the Blizzard Challenge 2017 | |
Bruce et al. | On the analysis of prosody in interaction | |
Dong et al. | A Unit Selection-based Speech Synthesis Approach for Mandarin Chinese. | |
Latorre et al. | New approach to polyglot synthesis: How to speak any language with anyone's voice | |
Klabbers | Text-to-Speech Synthesis | |
Demenko et al. | The design of polish speech corpus for unit selection speech synthesis | |
Heggtveit et al. | Intonation Modelling with a Lexicon of Natural F0 Contours | |
EP1501075B1 (de) | Sprachsynthese mittels Verknüpfung von Sprachwellenformen | |
Natvig et al. | Prosodic unit selection for text-to-speech synthesis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU MC NL PL PT RO SE SI SK TR |
|
AX | Request for extension of the european patent |
Extension state: AL BA HR LV MK YU |
|
17P | Request for examination filed |
Effective date: 20060206 |
|
AKX | Designation fees paid |
Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU MC NL PL PT RO SE SI SK TR |
|
GRAP | Despatch of communication of intention to grant a patent |
Free format text: ORIGINAL CODE: EPIDOSNIGR1 |
|
GRAS | Grant fee paid |
Free format text: ORIGINAL CODE: EPIDOSNIGR3 |
|
GRAA | (expected) grant |
Free format text: ORIGINAL CODE: 0009210 |
|
AK | Designated contracting states |
Kind code of ref document: B1 Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU MC NL PL PT RO SE SI SK TR |
|
REG | Reference to a national code |
Ref country code: GB Ref legal event code: FG4D |
|
REG | Reference to a national code |
Ref country code: CH Ref legal event code: EP |
|
REG | Reference to a national code |
Ref country code: IE Ref legal event code: FG4D |
|
REF | Corresponds to: |
Ref document number: 602005005241 Country of ref document: DE Date of ref document: 20080424 Kind code of ref document: P |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: FI Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20080312 Ref country code: LT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20080312 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: AT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20080312 |
|
NLV1 | Nl: lapsed or annulled due to failure to fulfill the requirements of art. 29p and 29m of the patents act | ||
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: PL Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20080312 Ref country code: SI Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20080312 |
|
ET | Fr: translation filed | ||
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: SK Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20080312 Ref country code: PT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20080818 Ref country code: CZ Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20080312 Ref country code: SE Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20080612 Ref country code: ES Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20080623 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: RO Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20080312 Ref country code: NL Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20080312 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: IS Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20080712 |
|
PLBE | No opposition filed within time limit |
Free format text: ORIGINAL CODE: 0009261 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: DE Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20080613 Ref country code: EE Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20080312 Ref country code: DK Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20080312 |
|
26N | No opposition filed |
Effective date: 20081215 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: BG Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20080612 Ref country code: IE Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20080408 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: IT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20080312 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: CY Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20080312 |
|
REG | Reference to a national code |
Ref country code: CH Ref legal event code: PL |
|
GBPC | Gb: european patent ceased through non-payment of renewal fee |
Effective date: 20090408 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: LI Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20090430 Ref country code: CH Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20090430 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: GB Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20090408 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: HU Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20080913 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: TR Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20080312 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: GR Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20080613 |
|
REG | Reference to a national code |
Ref country code: FR Ref legal event code: PLFP Year of fee payment: 12 |
|
REG | Reference to a national code |
Ref country code: FR Ref legal event code: PLFP Year of fee payment: 13 |
|
REG | Reference to a national code |
Ref country code: FR Ref legal event code: PLFP Year of fee payment: 14 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: LU Payment date: 20240320 Year of fee payment: 20 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: MC Payment date: 20240325 Year of fee payment: 20 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: FR Payment date: 20240320 Year of fee payment: 20 Ref country code: BE Payment date: 20240320 Year of fee payment: 20 |