WO2006032744A1 - Procede et dispositif de selection d'unites acoustiques et procede et dispositif de synthese vocale - Google Patents
Procede et dispositif de selection d'unites acoustiques et procede et dispositif de synthese vocale Download PDFInfo
- Publication number
- WO2006032744A1 WO2006032744A1 PCT/FR2005/002166 FR2005002166W WO2006032744A1 WO 2006032744 A1 WO2006032744 A1 WO 2006032744A1 FR 2005002166 W FR2005002166 W FR 2005002166W WO 2006032744 A1 WO2006032744 A1 WO 2006032744A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- acoustic
- sequence
- models
- units
- substep
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 88
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 18
- 238000003786 synthesis reaction Methods 0.000 title claims abstract description 17
- 230000008569 process Effects 0.000 claims description 27
- 230000002123 temporal effect Effects 0.000 claims description 15
- 238000004422 calculation algorithm Methods 0.000 claims description 11
- 238000010187 selection method Methods 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 8
- 230000011218 segmentation Effects 0.000 claims description 8
- 238000003066 decision tree Methods 0.000 claims description 7
- 230000002194 synthesizing effect Effects 0.000 claims description 6
- 238000009499 grossing Methods 0.000 claims description 4
- GHASVSINZRGABV-UHFFFAOYSA-N Fluorouracil Chemical compound FC1=CNC(=O)NC1=O GHASVSINZRGABV-UHFFFAOYSA-N 0.000 claims 3
- 229940001981 carac Drugs 0.000 claims 3
- 239000013598 vector Substances 0.000 description 11
- 238000001228 spectrum Methods 0.000 description 8
- 239000011159 matrix material Substances 0.000 description 7
- 230000007704 transition Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 5
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 238000001308 synthesis method Methods 0.000 description 4
- 238000000354 decomposition reaction Methods 0.000 description 3
- 230000006978 adaptation Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 235000009508 confectionery Nutrition 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000011084 recovery Methods 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000008570 general process Effects 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000004377 microelectronic Methods 0.000 description 1
- 238000012913 prioritisation Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
- 230000017105 transposition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/06—Elementary speech units used in speech synthesisers; Concatenation rules
- G10L13/07—Concatenation rules
Definitions
- Such selection methods are used, for example, in the context of speech synthesis.
- Each symbolic unit may be associated with a subset of natural speech segments, or acoustic units, such as phones, diphones or the like; representing variations of pronunciation of the sym ⁇ bolic unit.
- a so-called corpus approach makes it possible to define, for the same symbolic unit, a corpus of acoustic units of variable size and parameters recorded in different linguistic contexts and according to different prosodic variants.
- each comprises a plurality of symbolic parameters representing acoustic characteristics allowing its representation in mathematical form.
- This type of method generally requires a preliminary phase of learning or determination of contextual acoustic models, including the determination of probabilistic models, for example, of the type called hidden Markov models or HMM, then their classification according to their symbolic parameters that eventually take into account their phonetic context. Contextual acoustic models are thus determined in the form of mathematical laws. The classification is used to perform a preselection of acoustic units according to their symbolic parameters.
- the final selection generally involves cost functions based on a cost attributed to each concatenation between two acoustic units as well as a cost attributed to the use of each unit. However, the determination and prioritization of these costs are approximate and require the intervention of a human expert.
- the object of the present invention is to solve this problem by challenging a high-performance acoustic unit selection method using a finite appearance of contextual acoustic models.
- the subject of the present invention is a method for selecting acoustic units corresponding to acoustic embodiments of symbolic units of a phonological nature, said acoustic units each containing a natural speech signal and symbolic parameters representing their characteristics. acoustic devices, said method comprising:
- the method of the invention makes it possible to take into account spectrum, energy and duration information at the moment of selection, thus allowing a reliable and good quality selection. .
- the method comprises a preliminary step of determining contextual acoustic models, implemented from a given set of acoustic units;
- said step of determining contextual acoustic models comprises:
- said step of determining the contextual acoustic models further comprises a substep of determining probabilistic models adapted to the phonetic context whose parameters are used during said sub-step of classification;
- said sub-step of classification comprises a classification by decision ar ⁇ bres, the parameters of said probabilistic models being modified by the course of said decision trees to form said contextual acoustic models;
- said step of determining at least one target sequence of symbolic units comprises: a sub-step of acquiring a symbolic representation of a text
- said step of determining a sequence of contextual acoustic models comprises:
- said step of determining an acoustic mask comprises: a sub-step of determining the temporal importance of each contextual acoustic model
- said sub-step of determining the temporal importance of each contextual acoustic model comprises the prediction of its duration
- said step of selecting a sequence of acoustic units com ⁇ takes: a substep of determining a reference sequence of symbolic units from said target sequence, each sym ⁇ bolic unit of the reference sequence being associated with a set of acoustic units;
- said segmentation sub-step comprises a decomposition of the said acoustic mask on a time unit basis; said template being segmented, each segment corresponds to a symbolic unit of the reference sequence and said alignment sub-step comprises the alignment of each segment of the template with each of the acoustic units associated with the corresponding symbolic unit resulting from the reference sequence. ; said alignment substep comprises the determination of an optimal ali ⁇ tion as determined by an algorithm called "DTW";
- said selection step further comprises a substep of pre-selection making it possible to determine, for each symbolic unit of the reference sequencing, candidate acoustic units, said substep of alignment forming a sub-step of final selection among these candi ⁇ dates units;
- these contextual acoustic models are probabilistic models with observable processes with continuous values and non-observable processes with discrete values forming the states of this process; and - said contextual acoustic models are probabilistic models with unobservable processes with continuous values.
- the invention also relates to a method for synthesizing a speech signal, characterized in that it comprises a selection method as described above, said target sequence corresponding to a text to be synthesized and the method further comprising a step of synthesizing a voice sequence from said selected acoustic unit sequence.
- said synthesis step comprises:
- the invention also relates to a device for selecting acoustic units corresponding to acoustic embodiments of phonological symbolic units, this device comprising means adapted to the implementation of a selection method as defined. supra; and a dis ⁇ positive synthesis of a speech signal, remarkable in that it includes means adapted to the implementation of such a selection process.
- the present invention also relates to a computer program on an information carrier, this program comprising instructions adapted to the implementation of a method of selecting acoustic units according to the inven ⁇ tion, when the program is loaded and executed in a computer system.
- the advantages of these devices and computer program are identical to those mentioned above in connection with the method of selecting acoustic units of the invention.
- FIG. 1 represents a general flowchart of a speech synthesis method implementing a selection method according to the invention
- FIG. 1 represents a general process flow diagram of the invention implemented as part of a speech synthesis method.
- the steps of the method of selecting acoustic units according to the invention are determined by the instructions of a computer program used for example in a voice synthesis device.
- the method according to the invention is then implemented when the aforesaid program is loaded into computer means incorporated in the device in question, and whose operation is then controlled by the execu ⁇ tion of the program.
- computer program is meant here one or more computer programs forming a set (software) whose purpose is the implementation of the invention when it is executed by an appropriate computer system. Accordingly, the invention also relates to such a computer program, in particular in the form of software stored on an in ⁇ formation support.
- an information carrier may be constituted by any entity or device capable of storing a program according to the invention.
- the medium in question may comprise a hardware storage means, such as a ROM, for example a CD ROM or a microelectronic circuit ROM, or a magnetic recording means, for example a hard disk.
- the information carrier may be an integrated circuit in which the program is incorporated, the circuit being adapted to execute or to be used in the execution of the method in question.
- the information carrier can also be a transmissible im ⁇ material support, such as an electrical or optical signal that can be ache ⁇ mine via an electric or optical cable, by radio or by other means.
- a program according to the invention can in particular be downloaded to an Internet type network.
- a computer program according to the in ⁇ vention can use any programming language and be in the form of source code, object code, or intermediate code between source code and object code (eg, a partially compiled form), or in any other form desirable for implementing a method according to the invention.
- the selection method comprises first of all a preliminary step 2 for determining contextual acoustic models, implemented from a given set of acoustic units. contained in a database 3.
- This determination step 2 is also called learning and makes it possible to define mathematical laws representing the acoustic units which each contain a natural speech signal and symbo ⁇ lique parameters representing their acoustic characteristics.
- the method comprises following step 2 of determining contextual acoustic models, a step 4 of determining at least one target sequence of symbolic units of a phonological nature. In the embodiment described, this target sequence is unique and corresponds to a text to be synthesized.
- the method then comprises a step 5 of determining a sequence of contextual acoustic models, as obtained from the preceding step 2, and corresponding to the target sequence.
- the method further comprises a step 6 of determining an acoustic ga ⁇ barit from said sequence of contextual acoustic models. This template matches the most likely spectrum and energy settings given the sequence of contextual acoustic models determined previously.
- Step 6 of determining an acoustic mask is followed by a step 7 of selection of acoustic units according to this acoustic mask applied to the target sequence of symbolic units.
- the acoustic units selected are derived from a set of acoustic units for speech synthesis, formed of a database 8 identical to or different from the database 3.
- the method comprises a step 9 for synthesizing a voice signal from the selected acoustic units and the database 8, so as to reconstruct a voice signal from each natural speech signal contained in the units. selected acoustics.
- the method makes it possible, in particular by virtue of the determination and use of the acoustic mask, to have optimum control of the acoustic parameters of the signal generated by reference to the template.
- Step 2 of determining the acoustic models is conventional. It is implemented from the database 3 containing a finite number of symbolic units of phonological nature as well as the associated speech and phonetic transcriptions. This set of symbolic units is divided into sets, each comprising all the acoustic units corres ⁇ ponding the different embodiments of the same symbolic unit.
- Step 2 begins with a substep 22 for determining, for each symbolic unit, a probabilistic model which, in the embodiment described, is a hidden discrete state Markov model, commonly referred to as HMM (Hidden). Markov Model).
- HMM Hidden discrete state Markov model
- These models have three states and are defined, for each state, by a Gaussian law of mean ⁇ and covariance ⁇ which models the distribution of observations and by probabilities of state retention and transition to others. states of the model.
- the parameters constituting an HMM model are therefore the parameters of mean and covariance of the Gaussian laws of the different states and the transition matrix grouping the different transition probabilities between the states.
- these probabilistic models are derived from a finite algorithm of models comprising, for example, 36 different models which describe the probability of acoustic realization of symbolic units of a photonic nature.
- the discrete models each comprise an observable random process corresponding to the acoustic realization of symbolic units and an unobservable random process designated Q and having known probabilistic properties known as "Markov properties" according to which the realization of the future state of a random process depends only on the present state of this process.
- each natural speech signal contained in an acoustic unit is analyzed asynchronously with, for example, a fixed step of 5 milliseconds and a window of 10 milliseconds.
- a fixed step of 5 milliseconds For each window centered on an analysis instant t, twelve cepstral coefficients or MFCC coefficients (MeI Frequency Cepstral Coefficient) and the energy as well as their first and second derivatives are obtained.
- Ct is a spectrum and energy vector comprising the cepstral coefficients as well as the energy values
- o t is a vector comprising Ct and its first and second derivatives.
- the vector o t is called the acoustic vector of the instant t and comprises the spectrum and energy information of the natural speech signal analyzed.
- step 2 also comprises a substep 24 of determining probabilistic models adapted to the phonetic context. More precisely, this substep 24 corresponds to the learning of HMM models of the so-called triphone type.
- the phoneme represents in phonology the division of words into linguistic subunits.
- a phone refers to an acoustic realization of a pho ⁇ ndiag.
- Acoustic realizations of phonemes are different according to the speech context. For example, depending on the phonetic context, phenomena of coarticulation are observed to a greater or lesser extent. Similarly, depending on the prosodic context, differences in acoustic realization can appear.
- a classical method of adaptation to the phonetic context takes into account the left and right contexts, which resulted in so-called triphone modeling.
- triphone modeling When learning HMM models, for each triphone present in the base, the parameters of the Gaussian laws relating to each state are re-estimated from the representatives of this triphone.
- Step 2 then comprises a substep 26 of classification of the probabilistic models according to their symbolic parameters in order to re ⁇ group within the same class, the models having acoustic similarities.
- Such a classification can be obtained for example by the construction of decision trees.
- a decision tree is constructed for each state of each HMM model. The construction is performed by repeated divisions of the natural speech segments of the acoustic units of the set concerned, these divisions being ozza ⁇ re on the symbolic parameters.
- a criterion relating to the sym ⁇ bolic parameters is applied to separate the different acoustic units correspon ⁇ ing to the acoustic achievements of the same phoneme.
- a calculation of the likelihood variation between the father node and the wire node is performed, this calculation being made from the parameters of previously determined triphone models, in order to take into account the phonetic context.
- the criterion of separation leading to the maximum increase of the likelihood is retained and the separation is effectively accepted if this increase in likelihood exceeds a fixed threshold and if the number of representatives present in each of the child nodes is sufficient.
- This operation is repeated on each branch until a stop cry stops the classification giving rise to the generation of a leaf of the tree or a class.
- a contextual acoustic model can therefore be defined for each HMM model, by the route, for each state of the HMM model of the associated decision tree in order to assign a class to this state and to modify the parameters of average and covariance of its Gaussian law for adaptation to the context.
- the different symbolic units corresponding to the different realizations of the same phoneme are thus represented by the same HMM model and by different contextual acoustic models.
- a contextual acoustic model is defined as being an HMM model whose non-observable process has a transition matrix for that of the phoneme model resulting from step 22 and in which, for each state, the average and the covariance matrix of the observable process are the average and covariance matrix of the class obtained by the course of the deci ⁇ sion tree corresponding to this state of this phoneme.
- step 4 of determining a target sequence of symbolic units is carried out.
- This step 4 comprises, first of all, a substep 42 of acquiring a symbolic representation of a given text to be synthesized, such as a graphical or orthographic re ⁇ presentation.
- this graphical representation is a text written using the Latin alphabet designated by the reference TXT in FIG.
- the method then comprises a substep 44 for determining a sequence of symbolic units of a phonological nature from the graphemic re ⁇ presentation.
- This sequence of symbolic units identified by the reference UP in FIG. 3 is, for example, composed of phonemes extracted from a phonetic alphabet.
- This substep 44 is performed automatically by means of conventional techniques of the state of the art such as phonetization or other.
- this substep 44 implements a system of automatic phonification using databases and making it possible to decompose any text on a finite symbolic alphabet.
- the method comprises step 5 of determining a sequence of contextual acoustic models corresponding to the target sequence.
- This step firstly comprises a substep 52 of modeling the target sequence by its decomposition on the basis of probabilistic models and more precisely on the basis of probabilistic hidden Markov models designated HMM, determined during the analysis. 2nd step.
- the sequence of probabilistic models thus obtained is referenced Hi M and comprises the models Hi to HM selected from the 36 models of the finite alphabet and corresponds to the target sequence UP.
- the method then comprises a sub-step 54 for forming contextual acoustic models by modifying the parameters of the models of the sequence of the Hi M models to form a sequence ⁇ i M of contextual acoustic models.
- This training is performed by browsing, for each state of each model of the Hi M sequence, the decision trees. Each state of each model is modified and takes the average and covariance values of the sheet whose symbolic parameters correspond to those of the target.
- the sequence ⁇ i M of contextual acoustic models is therefore a sequence of hidden Markov models whose average and covariance parameters have been adapted to the phonetic context.
- the method then comprises step 6 of determining an acoustic mask.
- This step 6 comprises a substep 62 for determining the temporal importance of each contextual acoustic model, by allocating, for each contextual acoustic model, a corresponding number of temporal units, a substep 64 of determination a temporal sequence of models and a substep 66 of determining a corresponding sequence of acoustic frames forming the acoustic mask.
- the sub-step 62 for determining the temporal importance of each contextual acoustic model includes predicting the duration of each state of the contextual acoustic models.
- This sub-step 62 receives as input the sequence ⁇ i M of acoustic models, comprising information of mean, covariance, and Gaussian density for each state and transition matrices, as well as a value of du ⁇ re for each model state.
- each contextual acoustic model it is possible to take the average duration of each state of the model.
- an average duration is defined for each class and the classification of a state in a class results in the allocation of this average duration to this state.
- a duration prediction model such as exists in the state of the art, in particular for assigning each phoneme a desired value, is used to assign a duration to the different states of the sequence ⁇ - ⁇ M contextual acoustic models.
- N the total number of frames to be synthesized.
- A [X 1 , X 2 , ..., X N ] the sequence of contextual acoustic models and Q , the corresponding sequence of states.
- the sequence ⁇ is a temporal sequence of models, formed of the contextual acoustic models of the sequence ⁇ i M , each duplicated several times according to its temporal importance as represented in FIG.
- the determination of the required template is carried out during the sub-step
- observation sequence is completely defined by its static part C t formed of the spectrum and energy vector, the dynamic part being directly deduced therefrom.
- the observation sequence is also written in matrix form as follows:
- the acoustic mask thus corresponds to the most probable sequence of vectors of spectrum and energy given the sequence of contextual acoustic models.
- the method then goes to step 7 of selecting a sequence of acoustic units.
- Step 7 begins with a sub-step 72 for determining a reference sequence of symbolic units, denoted by U.
- This reference sequence U is formed from the target sequence UP and consists of symbolic units used to synthesis, which may be different from those forming the target sequence UP.
- the reference sequence U is formed of pho ⁇ nemes, diphondiags or others.
- this sequence is identical to the reference sequence U, so that the substep 72 is not performed.
- Each symbolic unit of the reference sequence U is associated with a finite set of acoustic units corresponding to different acoustic embodiments.
- the method comprises a substep 74 of segmentation of the acoustic mask as a function of the reference sequence U. Indeed, in order to be able to use the acoustic mask, it is preferable to operate a segmentation of this template according to the type of acoustic units to be selected.
- the method of the invention is applicable to any type of acoustic units, the substep 74 segmentation for adapting the acoustic template to different types of units.
- This segmentation is a decomposition of the acoustic mask on a basis of time units corresponding to the types of acoustic units used.
- this segmentation corresponds to the grouping of the frames of the acoustic ga ⁇ barit C by segments of a duration close to that of the units of the reference sequence U, which correspond to the acoustic units used for the synthesis. These segments are noted s in FIG.
- the selection step 7 comprises a preselection sub-step 76 making it possible to define, for each symbolic unit Uj of the reference sequence U, a subset Ej of candidate acoustic units, as represented in FIG.
- This preselection is carried out conventionally, for example according to the symbolic parameters of the acoustic units.
- the method further comprises a sub-step 78 of aligning the acoustic mask with each possible sequence of acoustic units from the preselected candidate units to make the final selection.
- each acoustic unit candi ⁇ dates are compared to segments of the corresponding template by means of an alignment algorithm, such as for example a so-called DTW (Dynamic Time Warping) algorithm.
- DTW Dynamic Time Warping
- This DTW algorithm performs an alignment of each acoustic unit with the corresponding template segment to calculate an overall distance between them, equal to the sum of the local distances on the alignment path, divided by the number of frames of the segment. shorter.
- the overall distance thus defined makes it possible to determine a relative distance of duration between the compared signals.
- the local distance used is the Euclidean distance between the spectrum and energy vectors comprising the MFCC coefficients and the energy information.
- the selection step 7 is followed by a synthesis step 9, which comprises a substep 92 of recovery, for each selected acoustic unit, of a signal in the database 8, a substep 94 of signal smoothing and a sub-step 96 of concatenation of different natural speech signals to output the final synthesized signal.
- a pro ⁇ sodic modification algorithm such as for example an algorithm known under the name of TD-PSOLA is used during the synthesis module. during a sub-step of prosodic modification.
- the hidden Markov models are models whose unobservable processes are discrete values.
- the method can also be realized with models whose unobservable processes are continuous values. It is also possible to use, for each graphical representation, several sequences of symbolic units, the taking into account of several symbolic sequences being known from the state of the art.
- this technique is based on the use of lan ⁇ gage models intended to weight the various hypotheses by their probability of appearing in the symbolic universe.
- the MFCC spectral parameters used in the example described can be replaced by other types of parameters, such as so-called Linear Spectral Frequencies (LSF) parameters, Linear Prediction Coefficients (LPC) parameters or parameters. related to the formants.
- LSF Linear Spectral Frequencies
- LPC Linear Prediction Coefficients
- the method may also use other characteristic information of the voice signals, such as fundamental frequency information or voice quality information, especially during the steps of determining the contextual acoustic models, template determination and selection.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Input Circuits Of Receivers And Coupling Of Receivers And Audio Equipment (AREA)
- Machine Translation (AREA)
- Exchange Systems With Centralized Control (AREA)
Abstract
Description
Claims
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP05798354A EP1789953B1 (fr) | 2004-09-16 | 2005-08-30 | Procede et dispositif de selection d'unites acoustiques et procede et dispositif de synthese vocale |
DE602005019070T DE602005019070D1 (de) | 2004-09-16 | 2005-08-30 | Her einheiten und sprachsynthesevorrichtung |
US11/662,652 US20070276666A1 (en) | 2004-09-16 | 2005-08-30 | Method and Device for Selecting Acoustic Units and a Voice Synthesis Method and Device |
AT05798354T ATE456125T1 (de) | 2004-09-16 | 2005-08-30 | Verfahren und vorrichtung für die auswahl akustischer einheiten und sprachsynthesevorrichtung |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
FR0409822 | 2004-09-16 | ||
FR0409822 | 2004-09-16 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2006032744A1 true WO2006032744A1 (fr) | 2006-03-30 |
Family
ID=34949650
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/FR2005/002166 WO2006032744A1 (fr) | 2004-09-16 | 2005-08-30 | Procede et dispositif de selection d'unites acoustiques et procede et dispositif de synthese vocale |
Country Status (5)
Country | Link |
---|---|
US (1) | US20070276666A1 (fr) |
EP (1) | EP1789953B1 (fr) |
AT (1) | ATE456125T1 (fr) |
DE (1) | DE602005019070D1 (fr) |
WO (1) | WO2006032744A1 (fr) |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1953052B (zh) * | 2005-10-20 | 2010-09-08 | 株式会社东芝 | 训练时长预测模型、时长预测和语音合成的方法及装置 |
JP5238205B2 (ja) * | 2007-09-07 | 2013-07-17 | ニュアンス コミュニケーションズ,インコーポレイテッド | 音声合成システム、プログラム及び方法 |
JP4528839B2 (ja) * | 2008-02-29 | 2010-08-25 | 株式会社東芝 | 音素モデルクラスタリング装置、方法及びプログラム |
ATE449400T1 (de) * | 2008-09-03 | 2009-12-15 | Svox Ag | Sprachsynthese mit dynamischen einschränkungen |
US8315871B2 (en) * | 2009-06-04 | 2012-11-20 | Microsoft Corporation | Hidden Markov model based text to speech systems employing rope-jumping algorithm |
US8340965B2 (en) * | 2009-09-02 | 2012-12-25 | Microsoft Corporation | Rich context modeling for text-to-speech engines |
US8805687B2 (en) * | 2009-09-21 | 2014-08-12 | At&T Intellectual Property I, L.P. | System and method for generalized preselection for unit selection synthesis |
US8594993B2 (en) | 2011-04-04 | 2013-11-26 | Microsoft Corporation | Frame mapping approach for cross-lingual voice transformation |
CN102270449A (zh) * | 2011-08-10 | 2011-12-07 | 歌尔声学股份有限公司 | 参数语音合成方法和系统 |
US9570066B2 (en) * | 2012-07-16 | 2017-02-14 | General Motors Llc | Sender-responsive text-to-speech processing |
US9489965B2 (en) * | 2013-03-15 | 2016-11-08 | Sri International | Method and apparatus for acoustic signal characterization |
JP6342428B2 (ja) * | 2013-12-20 | 2018-06-13 | 株式会社東芝 | 音声合成装置、音声合成方法およびプログラム |
US10902841B2 (en) | 2019-02-15 | 2021-01-26 | International Business Machines Corporation | Personalized custom synthetic speech |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2313530A (en) * | 1996-05-15 | 1997-11-26 | Atr Interpreting Telecommunica | Speech Synthesizer |
US5970453A (en) * | 1995-01-07 | 1999-10-19 | International Business Machines Corporation | Method and system for synthesizing speech |
US6163769A (en) * | 1997-10-02 | 2000-12-19 | Microsoft Corporation | Text-to-speech using clustered context-dependent phoneme-based units |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5950162A (en) * | 1996-10-30 | 1999-09-07 | Motorola, Inc. | Method, device and system for generating segment durations in a text-to-speech system |
US6505158B1 (en) * | 2000-07-05 | 2003-01-07 | At&T Corp. | Synthesis-based pre-selection of suitable units for concatenative speech |
-
2005
- 2005-08-30 WO PCT/FR2005/002166 patent/WO2006032744A1/fr active Application Filing
- 2005-08-30 AT AT05798354T patent/ATE456125T1/de not_active IP Right Cessation
- 2005-08-30 US US11/662,652 patent/US20070276666A1/en not_active Abandoned
- 2005-08-30 EP EP05798354A patent/EP1789953B1/fr active Active
- 2005-08-30 DE DE602005019070T patent/DE602005019070D1/de active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5970453A (en) * | 1995-01-07 | 1999-10-19 | International Business Machines Corporation | Method and system for synthesizing speech |
GB2313530A (en) * | 1996-05-15 | 1997-11-26 | Atr Interpreting Telecommunica | Speech Synthesizer |
US6163769A (en) * | 1997-10-02 | 2000-12-19 | Microsoft Corporation | Text-to-speech using clustered context-dependent phoneme-based units |
Non-Patent Citations (2)
Title |
---|
CHRISTOPHE BLOUIN, PAUL C. BAGSHAW & OLIVIER ROSEC: "A method of unit pre-selection for speech synthesis based on acoustic clustering and decision trees", PROCEEDINGS OF INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP'03), vol. I, 6 April 2003 (2003-04-06), HONG KONG, CHINA, pages 692 - 695, XP002327084, ISBN: 0-7803-7663-3 * |
SOUFIANE ROUIBIA , OLIVIER ROSEC AND THIERRY MOUDENC: "Unit Selection for Speech Synthesis Based on Acoustic Criteria", 8TH INTERNATIONAL CONFERENCE, TSD 2005, 12 September 2005 (2005-09-12) - 15 September 2005 (2005-09-15), Karlovy Vary, Czech Republic, pages 281 - 287, XP002361804 * |
Also Published As
Publication number | Publication date |
---|---|
DE602005019070D1 (de) | 2010-03-11 |
EP1789953A1 (fr) | 2007-05-30 |
EP1789953B1 (fr) | 2010-01-20 |
US20070276666A1 (en) | 2007-11-29 |
ATE456125T1 (de) | 2010-02-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP1789953B1 (fr) | Procede et dispositif de selection d'unites acoustiques et procede et dispositif de synthese vocale | |
O'shaughnessy | Interacting with computers by voice: automatic speech recognition and synthesis | |
EP3373293B1 (fr) | Procédé et appareil de reconnaissance vocale | |
US7136816B1 (en) | System and method for predicting prosodic parameters | |
Ghai et al. | Literature review on automatic speech recognition | |
US20210035560A1 (en) | System and method for performing automatic speech recognition system parameter adjustment via machine learning | |
EP1453037A2 (fr) | Méthode pour mettre au point un réseau neuronal classifié partitionné optimalement et méthode et dispositif pour l'étiquetage automatique utilisant un réseau neuronal classifié partitionné optimalement | |
WO2018118442A1 (fr) | Dispositif de reconnaissance vocale de réseau neuronal acoustique-mot | |
US10497362B2 (en) | System and method for outlier identification to remove poor alignments in speech synthesis | |
Upadhyay et al. | Foreign English accent classification using deep belief networks | |
EP1526508B1 (fr) | Procédé de sélection d'unités de synthèse | |
EP1152399A1 (fr) | Traitement en sous bandes de signal de parole par réseaux de neurones | |
Talesara et al. | A novel Gaussian filter-based automatic labeling of speech data for TTS system in Gujarati language | |
EP1846918B1 (fr) | Procede d'estimation d'une fonction de conversion de voix | |
Furui | Generalization problem in ASR acoustic model training and adaptation | |
US11670292B2 (en) | Electronic device, method and computer program | |
Ma et al. | Language identification with deep bottleneck features | |
El Ouahabi et al. | Amazigh speech recognition using triphone modeling and clustering tree decision | |
Kim et al. | Improving end-to-end contextual speech recognition via a word-matching algorithm with backward search | |
Garnaik et al. | An approach for reducing pitch induced mismatches to detect keywords in children’s speech | |
EP0595950B1 (fr) | Procede et dispositif de reconnaissance de la parole en temps reel | |
Geetha et al. | Phoneme Segmentation of Tamil Speech Signals Using Spectral Transition Measure | |
Frikha et al. | Hidden Markov models (HMMs) isolated word recognizer with the optimization of acoustical analysis and modeling techniques | |
Ratkevicius et al. | Advanced recognition of Lithuanian digit names using hybrid approach | |
Humayun et al. | A review of social background profiling of speakers from speech accents |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KM KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NG NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU LV MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWE | Wipo information: entry into national phase |
Ref document number: 2005798354 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 11662652 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWP | Wipo information: published in national office |
Ref document number: 2005798354 Country of ref document: EP |
|
WWP | Wipo information: published in national office |
Ref document number: 11662652 Country of ref document: US |