EP1589524B1 - Procédé et dispositif pour la synthèse de la parole - Google Patents

Procédé et dispositif pour la synthèse de la parole Download PDF

Info

Publication number
EP1589524B1
EP1589524B1 EP20050447078 EP05447078A EP1589524B1 EP 1589524 B1 EP1589524 B1 EP 1589524B1 EP 20050447078 EP20050447078 EP 20050447078 EP 05447078 A EP05447078 A EP 05447078A EP 1589524 B1 EP1589524 B1 EP 1589524B1
Authority
EP
European Patent Office
Prior art keywords
speech
units
linguistic
features
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
EP20050447078
Other languages
German (de)
English (en)
Other versions
EP1589524A1 (fr
Inventor
Vincent Colotte
Richard Beaufort
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Multitel ASBL
Original Assignee
Multitel ASBL
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from EP04447212A external-priority patent/EP1640968A1/fr
Application filed by Multitel ASBL filed Critical Multitel ASBL
Priority to EP20050447078 priority Critical patent/EP1589524B1/fr
Publication of EP1589524A1 publication Critical patent/EP1589524A1/fr
Application granted granted Critical
Publication of EP1589524B1 publication Critical patent/EP1589524B1/fr
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation

Definitions

  • the present invention is related to a method and device for speech synthesis.
  • Natural language processing aims at extracting information that allows reading the text aloud. This information can vary from one system to another but always comprises words, their nature and their phonetisation.
  • Units selection aims at choosing speech units that correspond to the information extracted by natural language processing.
  • digital signal processing concatenates the selected speech units and, if needed, changes their acoustic characteristics so that required speech signals are obtained.
  • these units extracted from read-aloud sequences are diphones, i.e. pieces of speech starting from the middle of a phoneme and ending in the middle of the following phoneme (see Fig.2 ).
  • diphones i.e. pieces of speech starting from the middle of a phoneme and ending in the middle of the following phoneme (see Fig.2 ).
  • This means that a diphone extends from the stable part of a phoneme till the stable part of the following phoneme and contains, in its middle part, the coarticulation phase characterising the transition from one phoneme to another, which is very difficult to model mathematically.
  • diphones as speech units improves speech generation and makes it easier, because concatenation is performed on their stable parts.
  • the first systems using vocal databases for synthesis employed only one sample of each diphone.
  • the underlying idea was to get rid of acoustic variations present in the diphones and dependent from the elocution time: accent, tone, fundamental frequency and duration.
  • diphones merely are acoustic parameters describing the vocal tract only.
  • Fundamental frequency, prosody and duration have to be regenerated during synthesis.
  • Diphones may need to undergo some acoustic modifications in order to obtain the required prosodic features. This unfortunately leads to a loss of quality: the synthesised voice seems less natural.
  • prosody keeps being neutral and listless.
  • Neutral speech units constitute an important drawback to overcome, therefore non-uniform units started to be investigated.
  • non-uniform is meant that the speech unit may change in two ways: length and acoustic production.
  • Length variation means that the unit is not exclusively a diphone, but may be either shorter or longer. Longer units imply less frequent concatenation problems. However, in some cases, the corpus constitution (an inconsistency or incompleteness) can impose the use of a smaller unit, like a phoneme or half-phoneme. Therefore a variation in terms of length may be considered in both directions.
  • Variation in terms of acoustic production means that the same unit has to appear several times in the corpus : for the same unit, they may be several representations with different acoustic realisation. By doing so, units are not neutral anymore; they reflect the variations occurring during the elocution.
  • the search for speech units corresponding to the units described by natural language analysis often yields several candidates for each target unit.
  • the result of this search is a lattice of possible units, allocated to different positions in the speech signal. Each position corresponds to one unit to be searched for and covers potential candidates found in the corpus (see Fig.3 ). So the challenge is to determine the best sequence of units to be selected in order to generate the speech signal.
  • the target cost and the concatenation cost should be used.
  • the target cost gives the distance between a target unit and units coming from the corpus. It is computed from the features added to each speech unit.
  • the concatenation cost estimates the acoustic distance between units to be concatenated.
  • the different systems that have been set up determine the concatenation cost between adjacent units in terms of acoustic distance, based on several criteria such as the fundamental frequency, a intensity difference or the spectral distance. Note that said acoustic distance does not compare to the real acoustic distance perceived by a listener.
  • the selection of a sequence of units for a particular sentence is cost expensive in terms of CPU time and memory if no efficient optimisation is used. So far, two kinds of optimisation have been investigated.
  • the first optimisation manages the whole selection. A single unit sequence has to be selected from the lattice. This task corresponds to finding the best path in a graph. This is usually solved with dynamic programming by means of the well-known Viterbi algorithm.
  • the second optimisation method consists in assessing the importance of the different features used to determine the target or concatenation cost. Indeed most features may not be considered as equally important: some features affect more the resulting quality than other. Consequently, it has been investigated what would be the ideal weighting for the selection process.
  • the proposed systems however apply a manually implemented weighting, which, as a consequence, is competence based and depends on the operator's expertise rather than on statistic values.
  • One possible weighting method suggests forming a network between all sounds of the corpus (see Prosody and Selection of source Units for Concatenative Synthesis, Campbell and Black, pp. 279-292, Springer-Verlag,1996 ). Once this network has been set up, a learning phase can start aiming at improving the acoustic similarity between a reference sentence and the signal given by the system. This improvement can be achieved by tuning the features weighting, by successive iterations or by linear regression.
  • This method has two inherent drawbacks: on the one hand, its computational load, still consuming resources even though performed off-line, and on the other hand, the limited amount of features the computation can weight. Most of the time, part of the weighting remains to be done manually. In order to reduce the computational load, one can carry out a clustering of sounds to keep only one representative sound, the centroid, on which the selection computation may be performed.
  • Another weighting method relies on a corpus representation based on a phonetic and phonologic tree (see e.g. 'Non-uniform unit selection and the similarity metric within BT's laureate TTS system', Breen & Jackson, ESCA/COCOSDA 3rd Workshop on Speech Synthesis, pp. 201-206, Jenolan Caves, Australia, Nov. 26-29, 1998 ). During the selection, they look for candidate units with the same context as the target unit. However, the features they use are not automatically weighted.
  • Non-uniform units-based systems try to give synthesised speech a more natural character, closer to human speech than that generated by previous systems. This goal is achieved by using non-neutralised units of variable length.
  • the performance of such speech synthesis systems is currently limited by the intrinsic weakness of their prosodic models, restricted to some acoustic and symbolic parameters. These models, corpus- or rule-based, are not sufficient as they do not allow a natural prosodic variation of the synthesised sentences. Yet, the quality of prosody depends directly on how listeners perceive synthesised speech.
  • the use of such prosodic models shows a major advantage: the selection of acoustic units that are relatively neutral, limits discontinuities between units to be concatenated further on. As a consequence, spectral smoothing at units boundaries is strongly restricted in order to keep the naturalness of speech units.
  • a speech synthesis system wherein a database of diphones is used derived from natural speech.
  • a text is first converted into phonetic form and divided into phonemes.
  • the converted text is rendered as a series of target diphones and for each of these a number of predetermined diphone features are identified.
  • Diphone features may be one or more of phonetic, prosodic, linguistic and acoustic features.
  • Potential matches from the database are identified and a target cost for each of these features is established.
  • the target costs are modified before selecting a least-cost combination. The modification of the target costs may be done by a simple weighting or by means of distribution functions.
  • the present invention aims to provide a speech synthesis method that does not need any prosodic model and that requires little digital signal processing. It also aims to provide a speech synthesis device, operating according to the disclosed synthesis method.
  • the present invention relates to a method to synthesise speech, comprising the steps of
  • said selected linguistic features are determined in a training step preceding the above-mentioned steps.
  • the step of selecting candidate speech units is performed using a database comprising information on phonemes and at least their linguistic features.
  • the information on the linguistic features comprises a weighting coefficient for each linguistic feature.
  • the weighting coefficients typically result from an automatic weighting procedure.
  • the information is obtained from a step of labelling and segmenting a corpus.
  • the speech units are diphonic units.
  • a target cost is calculated for each candidate cluster.
  • a target cost is calculated from the target costs for the candidate clusters.
  • the concatenation of speech units is performed taking into account said target cost as well as a concatenation cost.
  • the linguistic features comprise features from the group ⁇ surrounding phonemes, emphasis information, number of syllables, syllables, word location, number of words, rhythm group information ⁇ .
  • the invention relates to a speech synthesis device comprising a linguistic analysis engine producing phonemes to be pronounced and, associated to each phoneme, a list of linguistic features,
  • the speech synthesis device further comprises calculation means for computing automatically a weighting coefficient for each linguistic feature.
  • Fig. 1 represents a Text-to-Speech Synthesiser system.
  • Fig. 2 represents the segmentation into phonemes and diphones. "_" corresponds to silence.
  • Fig. 3 represents a lattice network for the diphone sequence of the word 'speech'.
  • Fig. 4 represents the steps of the method according to the present invention.
  • the present invention discloses a speech units selection system freed from any prosodic model (either acoustic or symbolic) that allows more prosodic variations in synthesised sentences, thereby applying little signal processing at the units' boundaries.
  • speech units selection in the method according to the present invention is exclusively based on a features set selected among linguistic information provided by language analysis.
  • any prosodic model either rules- or corpus-based, relies on a list of linguistic features that allow to choose values for any acoustic or symbolic feature of the model.
  • a prosodic model is just an acoustic and symbolic synthesis of linguistic features.
  • the prosodic model is deterministic: from a finite list of linguistic features, this model always deduces the same prosodic features. Language however is not deterministic. Indeed, the same speaker could pronounce a given sentence with a single linguistic analysis, in different ways. Parameters having an influence on the pronunciation and prosody of this sentence can be affective or intellective.
  • the synthesis method according to the invention is divided into a training and a run-time phase. In both phases, the same linguistic analysis engine is used for the linguistic features extraction, giving thus some homogeneity to the system.
  • the training phase it is necessary to list the relevant linguistic features for selecting the units. Once this list is obtained, the further training consists in a labelling and a segmentation of the corpus as well as a weighting of the linguistic features. Note that in text-to-speech synthesis, a spoken language corpus is always paired with a written corpus that is its transcription. The written corpus helps in choosing labels and features for each unit of the spoken language corpus.
  • the spoken language corpus may as well be called a speech units corpus or a speech units database.
  • the run-time phase is carried out on a sentence applied to the synthesis system input. First the linguistic sentence is analysed. Then candidate speech units are selected based on selected linguistic features. Lastly, selected units are concatenated in order to form the speech signal corresponding to the sentence. Both phases are now presented in detail.
  • the features selection is intrinsically linked to the linguistic analysis engine, the capabilities of which determine the amount of available linguistic information.
  • the exclusive use of linguistic features for selection forces to add supplementary, prosody affecting information to the features typically used (like phonemes around the target, syllabification, number of syllables in the word, location of words in the sentence ).
  • Very common linguistic features like the phonemes surrounding the target unit and the number of syllables in the word rarely are used in state-of-the-art systems. Consequently, the analysis engine must be powerful enough to determine the required additional information.
  • Said additional information comprises:
  • each sentence of the written corpus is annotated as follows: amount of words and place of the words in the sentence, syllabification and phonetisation of the words, synthesis in terms of articulatory criteria of phonemic contexts for each phoneme.
  • the annotation elements are then discretised as integer values and stored into a linguistic units database wherein each phoneme is linked with its own linguistic features.
  • the sentences of the spoken language corpus are segmented into phonemes and diphones. All phonemes occurring in the speech units corpus are then collected. For each phoneme the acoustic features useful for the concatenation cost are calculated and also added to the speech units corpus.
  • acoustic features are the fundamental frequency, LPC (Linear Predictive Coding) coefficients and the intensity.
  • LPC Linear Predictive Coding
  • this number is set at 7 clusters of acoustic representations of one phoneme distributed according to their duration d: 1. ⁇ d ⁇ M - 2 ⁇ D 2. ⁇ M - 2 ⁇ D ⁇ d ⁇ M - D 3. ⁇ M - D ⁇ d ⁇ M - D / 2 4. ⁇ M - D / 2 ⁇ d ⁇ M + D / 2 5. ⁇ M + D / 2 ⁇ d ⁇ M + D 6. ⁇ M + D ⁇ d ⁇ M + 2 ⁇ D 7. ⁇ d > M + 2 ⁇ D where M denotes the mean duration of all representations for one phoneme, and D represents the standard deviation of this representation.
  • the (fully automatic) linguistic features weighting may start.
  • the objective is to determine to which extent each feature allows to discriminate between several clusters, whereby each cluster is seen as a class to be selected or a decision to be taken.
  • the most appropriate method to do this is by using a decision tree.
  • Decision tree building relies on the concept of entropy. Entropy computation for a list of features allows classifying them according to their intrinsic information. The more a feature i reduces the uncertainty about which cluster C to select, the more it is informative and relevant.
  • the entropy of feature i is computed as gain ratio GR( i,C ) , i.e. the ratio of Information Gain IG( i,C ) to the Split Information SI( C ).
  • the Split Information normalises the Information Gain of a given feature by taking into account the number of different values this feature can take.
  • the Gain Ratio allows determining the features ranking between all decision tree levels, and also weights the features during the target cost calculation.
  • the weighting coefficients are also stored in the database.
  • the linguistic analysis At run-time, each time a sentence enters the system, the linguistic analysis generates the corresponding phonemes as well as a list of linguistic features associated to each of them. Every pair ⁇ phoneme , features ⁇ is defined as a target .
  • the speech units selection occurs in three steps:
  • diphonic units to be selected are only those that can be formed from adjacent phonemic candidates in the speech units corpus. However, if a target diphone does not have any candidate, one creates candidates containing the target phoneme partly left or partly right-hand side, according to the diphone needed.
  • the units selection is performed in a traditional way, by solving the lattice with the Viterbi algorithm. In this way the path is selected in the lattice of diphones, which minimises the double cost ⁇ target , concatenation ⁇ .
  • the target cost was already pre-computed at the pre-selection stage, whereas the concatenation cost is determined when running through the lattice.
  • the concatenation cost has been defined as the acoustic distance between the units to be concatenated. To calculate this distance, the system thus needs acoustic features, taken at the boundaries of the units to be concatenated: fundamental frequency, spectrum, energy and duration. The distance, and thus the cost, is obtained by adding up:
  • Figure 4 shows a block scheme of a text-to-speech synthesis system that implements the method of the invention.
  • the system is split into three blocks, each corresponding to one of the steps of the run-time phase as described above : the NLP (Natural Language Processing), the USP (Units Selection Processing) and the DSP (Digital Signal Processing).
  • the input to the system is the text that is to be transformed into speech.
  • the output to the system is a speech signal concatenated from non-uniform speech units.
  • Each block uses databases.
  • the NLP loads linguistic databases (DBA) for each task (pre-processing, morphological analysis,).
  • the DSP loads the Speech Units Database, from which speech units are selected and concatenated into a speech signal.
  • DBA linguistic databases
  • the USP in between, loads a Linguistic Units Database, comprising a set of triplets ⁇ phoneme, linguistic features, position ⁇ .
  • the first pair, ⁇ phoneme, linguistic features ⁇ describes a unit from the Speech Units Database.
  • the last information, position is the position in milliseconds of the unit in the Speech Units Database. It means that both databases describe and store Candidate Units, and are aligned thanks to the position feature.
  • the NLP block aims at analysing the input text in order to generate a list of target units (T 1 , T 2 ,..., T n ). Each target unit is a pair ⁇ phoneme, linguistic features ⁇ .
  • the second block, USP works in three steps.
  • the Linguistic Units Database selects from the Linguistic Units Database a set of phonemic candidates for each target unit. A target cost computation is performed for each candidate. Candidate diphonic units are then determined together with their target cost and a lattice of weighted diphones is created, one diphone for each pair of adjacent phonemes. Next, it selects by dynamic programming the best path of diphones through the lattice.
  • the DSP block takes selected diphones from the Speech Units Database. Then, it concatenates them acoustically, using a technique of the OverLap And Add type: pitch values are used to improve the concatenation. No signal processing is necessary other than the concatenation itself. Selected units are concatenated without any discontinuity. As a result, linguistic criteria used in the selection prove their relevance.
  • the technology can for example be used for advertisement diffusion (broadcasting) in shopping centres. Advertisements of shopping centres must frequently change, which requires frequent and expensive need for professional speakers.
  • the proposed synthesis method only once requires the services of a professional speaker, and subsequently allows pronouncing any written text, without additional cost.
  • Another application could be directed to information for travellers in railway stations and airports and the like.
  • the synthesis system according to the present invention can easily solve this problem.
  • Speech synthesis can also generate fluent interactive dialogues. This is related to dialogue systems able to model a conversation and to automatically generate text in order to interact with the user.
  • Two traditional examples are interactive terminals in stations, airports and shopping centres, as well as vocal servers that are accessible by phone.
  • Systems currently used in this context are strongly limited: based on pieces of pre-recorded sentences, they are limited to some basic syntactic structures. Moreover, the result obtained is less natural, because of prosodic discontinuities at words or word-groups boundaries.
  • the synthesis by non-uniform units selection using linguistic criteria is the ideal solution to get rid of these drawbacks, as it is not limited in terms of syntactic structures.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Claims (13)

  1. Procédé pour synthétiser la parole, comprenant les étapes consistant à :
    - appliquer une analyse linguistique à une phrase à transformer en un signal vocal, ladite analyse générant des phonèmes à prononcer et, associée à chaque phonème, une liste de particularités linguistiques,
    - sélectionner des unités vocales candidates, exclusivement basées sur les particularités linguistiques sélectionnées,
    - former ledit signal vocal en concaténant les unités vocales sélectionnées parmi lesdites unités vocales candidates.
  2. Procédé pour synthétiser la parole selon la revendication 1, dans lequel dans une étape d'apprentissage précédente lesdites particularités linguistiques sélectionnées sont déterminées.
  3. Procédé pour synthétiser la parole selon la revendication 1 ou 2, dans lequel l'étape consistant à sélectionner des unités vocales candidates est réalisée en utilisant une base de données comprenant des informations sur les phonèmes et au moins leurs particularités linguistiques.
  4. Procédé pour synthétiser la parole selon la revendication 3, dans lequel lesdites informations sur lesdites particularités linguistiques comprennent un coefficient de pondération pour chaque particularité linguistique, lesdits coefficients de pondération résultant d'une procédure de pondération automatique.
  5. Procédé pour synthétiser la parole selon la revendication 3 ou 4, dans lequel lesdites informations sont obtenues à partir d'une étape d'étiquetage et de segmentation d'un corpus.
  6. Procédé pour synthétiser la parole selon l'une quelconque des revendications 1 à 5, dans lequel l'étape consistant à sélectionner des unités vocales candidates comprend des sous-étapes consistant à
    - sélectionner des groupes candidats de représentations acoustiques pour chaque phonème, et
    - calculer des unités vocales candidates à partir desdits groupes candidats sélectionnés.
  7. Procédé selon l'une quelconque des revendications précédentes, dans lequel lesdites unités vocales sont des unités diphoniques.
  8. Procédé selon la revendication 6, dans lequel un coût cible est calculé pour chaque groupe candidat.
  9. Procédé selon la revendication 8, dans lequel pour chaque unité vocale candidate un coût cible est calculé à partir desdits coûts cibles pour lesdites groupes candidats.
  10. Procédé selon les revendications 8 ou 9, dans lequel ladite concaténation d'unités vocales est réalisée en prenant en compte ledit coût cible ainsi qu'un coût de concaténation.
  11. Procédé pour synthétiser la parole selon l'une quelconque des revendications 1 à 10, dans lequel lesdites particularités linguistiques comprennent des particularités provenant du groupe (phonèmes environnants, informations d'accentuation, nombre de syllabes, syllabes, emplacement de mot, nombre de mots, informations de groupe de rythme).
  12. Dispositif de synthèse de la parole comprenant
    - un moteur d'analyse linguistique agencé pour produire des phonèmes à prononcer et, associée à chaque phonème, une liste de particularités linguistiques,
    - un moyen de stockage pour stocker une base de données comprenant des informations sur les phonèmes et au moins leurs particularités linguistiques,
    - un moyen de sélection d'unités vocales pour sélectionner des unités vocales candidates exclusivement basées sur les particularités linguistiques sélectionnées,
    - un moyen de synthétisation pour concaténer les unités vocales sélectionnées par ledit moyen de sélection.
  13. Dispositif de synthèse de la parole selon la revendication 12, comprenant en outre un moyen de calcul pour calculer automatiquement un coefficient de pondération pour chaque particularité linguistique.
EP20050447078 2004-04-15 2005-04-08 Procédé et dispositif pour la synthèse de la parole Active EP1589524B1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP20050447078 EP1589524B1 (fr) 2004-04-15 2005-04-08 Procédé et dispositif pour la synthèse de la parole

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
US56238204P 2004-04-15 2004-04-15
US562382P 2004-04-15
EP04447212A EP1640968A1 (fr) 2004-09-27 2004-09-27 Procédé et dispositif pour la synthèse de la parole
EP04447212 2004-09-27
EP20050447078 EP1589524B1 (fr) 2004-04-15 2005-04-08 Procédé et dispositif pour la synthèse de la parole

Publications (2)

Publication Number Publication Date
EP1589524A1 EP1589524A1 (fr) 2005-10-26
EP1589524B1 true EP1589524B1 (fr) 2008-03-12

Family

ID=34943276

Family Applications (1)

Application Number Title Priority Date Filing Date
EP20050447078 Active EP1589524B1 (fr) 2004-04-15 2005-04-08 Procédé et dispositif pour la synthèse de la parole

Country Status (1)

Country Link
EP (1) EP1589524B1 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018167522A1 (fr) * 2017-03-14 2018-09-20 Google Llc Sélection d'unité de synthèse de la parole
IT201800005283A1 (it) * 2018-05-11 2019-11-11 Rimodulatore del timbro vocale

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB0112749D0 (en) * 2001-05-25 2001-07-18 Rhetorical Systems Ltd Speech synthesis

Also Published As

Publication number Publication date
EP1589524A1 (fr) 2005-10-26

Similar Documents

Publication Publication Date Title
US7124083B2 (en) Method and system for preselection of suitable units for concatenative speech
JP4302788B2 (ja) 音声合成用の基本周波数テンプレートを収容する韻律データベース
US7565291B2 (en) Synthesis-based pre-selection of suitable units for concatenative speech
EP1138038B1 (fr) Synthese de la parole par concatenation de signaux vocaux
US7869999B2 (en) Systems and methods for selecting from multiple phonectic transcriptions for text-to-speech synthesis
US20200410981A1 (en) Text-to-speech (tts) processing
US11763797B2 (en) Text-to-speech (TTS) processing
EP1643486A1 (fr) Méthode et appareil pour empêcher la compréhension de la parole par un système interactif de réponse de voix
Latorre et al. New approach to the polyglot speech generation by means of an HMM-based speaker adaptable synthesizer
JP2007249212A (ja) テキスト音声合成のための方法、コンピュータプログラム及びプロセッサ
US10699695B1 (en) Text-to-speech (TTS) processing
CN101131818A (zh) 语音合成装置与方法
Dutoit A short introduction to text-to-speech synthesis
EP1589524B1 (fr) Procédé et dispositif pour la synthèse de la parole
JPH08335096A (ja) テキスト音声合成装置
EP1640968A1 (fr) Procédé et dispositif pour la synthèse de la parole
Louw et al. The Speect text-to-speech entry for the Blizzard Challenge 2016
Bruce et al. On the analysis of prosody in interaction
Ronanki et al. The CSTR entry to the Blizzard Challenge 2017
Dong et al. A Unit Selection-based Speech Synthesis Approach for Mandarin Chinese.
Latorre et al. New approach to polyglot synthesis: How to speak any language with anyone's voice
Klabbers Text-to-Speech Synthesis
Demenko et al. The design of polish speech corpus for unit selection speech synthesis
Heggtveit et al. Intonation Modelling with a Lexicon of Natural F0 Contours
EP1501075B1 (fr) Synthèse de la parole par concaténation de formes d'ondes de parole

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU MC NL PL PT RO SE SI SK TR

AX Request for extension of the european patent

Extension state: AL BA HR LV MK YU

17P Request for examination filed

Effective date: 20060206

AKX Designation fees paid

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU MC NL PL PT RO SE SI SK TR

GRAP Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOSNIGR1

GRAS Grant fee paid

Free format text: ORIGINAL CODE: EPIDOSNIGR3

GRAA (expected) grant

Free format text: ORIGINAL CODE: 0009210

AK Designated contracting states

Kind code of ref document: B1

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU MC NL PL PT RO SE SI SK TR

REG Reference to a national code

Ref country code: GB

Ref legal event code: FG4D

REG Reference to a national code

Ref country code: CH

Ref legal event code: EP

REG Reference to a national code

Ref country code: IE

Ref legal event code: FG4D

REF Corresponds to:

Ref document number: 602005005241

Country of ref document: DE

Date of ref document: 20080424

Kind code of ref document: P

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: FI

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20080312

Ref country code: LT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20080312

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: AT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20080312

NLV1 Nl: lapsed or annulled due to failure to fulfill the requirements of art. 29p and 29m of the patents act
PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: PL

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20080312

Ref country code: SI

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20080312

ET Fr: translation filed
PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: SK

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20080312

Ref country code: PT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20080818

Ref country code: CZ

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20080312

Ref country code: SE

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20080612

Ref country code: ES

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20080623

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: RO

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20080312

Ref country code: NL

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20080312

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: IS

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20080712

PLBE No opposition filed within time limit

Free format text: ORIGINAL CODE: 0009261

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: DE

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20080613

Ref country code: EE

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20080312

Ref country code: DK

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20080312

26N No opposition filed

Effective date: 20081215

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: BG

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20080612

Ref country code: IE

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20080408

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: IT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20080312

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: CY

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20080312

REG Reference to a national code

Ref country code: CH

Ref legal event code: PL

GBPC Gb: european patent ceased through non-payment of renewal fee

Effective date: 20090408

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: LI

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20090430

Ref country code: CH

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20090430

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: GB

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20090408

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: HU

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20080913

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: TR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20080312

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: GR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20080613

REG Reference to a national code

Ref country code: FR

Ref legal event code: PLFP

Year of fee payment: 12

REG Reference to a national code

Ref country code: FR

Ref legal event code: PLFP

Year of fee payment: 13

REG Reference to a national code

Ref country code: FR

Ref legal event code: PLFP

Year of fee payment: 14

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: FR

Payment date: 20230330

Year of fee payment: 19

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: BE

Payment date: 20230330

Year of fee payment: 19

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: LU

Payment date: 20240320

Year of fee payment: 20

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: MC

Payment date: 20240325

Year of fee payment: 20