EP0515709A1 - Méthode et dispositif pour la représentation d'unités segmentaires pour la conversion texte-parole - Google Patents
Méthode et dispositif pour la représentation d'unités segmentaires pour la conversion texte-parole Download PDFInfo
- Publication number
- EP0515709A1 EP0515709A1 EP91108575A EP91108575A EP0515709A1 EP 0515709 A1 EP0515709 A1 EP 0515709A1 EP 91108575 A EP91108575 A EP 91108575A EP 91108575 A EP91108575 A EP 91108575A EP 0515709 A1 EP0515709 A1 EP 0515709A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- aehmm
- speech
- segmental
- processor
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 56
- 238000003786 synthesis reaction Methods 0.000 title claims abstract description 56
- 238000000034 method Methods 0.000 title claims abstract description 52
- 239000013598 vector Substances 0.000 claims abstract description 39
- 230000003595 spectral effect Effects 0.000 claims abstract description 34
- 230000008569 process Effects 0.000 claims abstract description 18
- 238000012549 training Methods 0.000 claims abstract description 8
- 238000002372 labelling Methods 0.000 claims abstract description 3
- 238000013139 quantization Methods 0.000 claims abstract description 3
- 230000009466 transformation Effects 0.000 claims description 6
- 230000007704 transition Effects 0.000 description 15
- 230000001755 vocal effect Effects 0.000 description 7
- MQJKPEGWNLWLTK-UHFFFAOYSA-N Dapsone Chemical compound C1=CC(N)=CC=C1S(=O)(=O)C1=CC=C(N)C=C1 MQJKPEGWNLWLTK-UHFFFAOYSA-N 0.000 description 6
- 238000013459 approach Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000013518 transcription Methods 0.000 description 4
- 230000035897 transcription Effects 0.000 description 4
- 238000009826 distribution Methods 0.000 description 3
- 230000011218 segmentation Effects 0.000 description 3
- 238000001228 spectrum Methods 0.000 description 3
- 101100419716 Podospora anserina RPS9 gene Proteins 0.000 description 2
- 238000004833 X-ray photoelectron spectroscopy Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 238000005311 autocorrelation function Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000005284 excitation Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 238000010420 art technique Methods 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000001143 conditioned effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000003362 replicative effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000005309 stochastic process Methods 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 238000011179 visual inspection Methods 0.000 description 1
- 210000001260 vocal cord Anatomy 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/06—Elementary speech units used in speech synthesisers; Concatenation rules
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
Definitions
- the present invention relates generally to the field of authomatic text-to-speech synthesis and specially to a method for compactly representing a large set of acoustic units for a concatenative text-to-speech synthesis system and an apparatus for speech synthesis employing such a method.
- the input text is initially processed in order to transform it in a sequence of phonetic symbols that are more suitable for speech synthesis purposes.
- a corresponding sequence of segmental units to be concatenated is produced.
- the sequence of segmental unit parameters is then retrieved to prepare the input sequence for the speech synthesizer.
- a proper transformation is generally operated in order to obtain a suitable set of coefficients for the speech synthesizer.
- the speech synthesizer is activated for each segmental unit, producing the synthetic speech signal.
- each segmental unit should be carefully chosen accounting for the desired synthetic speech quality and for memory occupancy of the resulting segmental unit set.
- a segmental unit as large as a syllable or more can be used if a high quality synthetic speech is requested, but this requires a very large memory because of the very large number of possible syllables in a language.
- a decrease of the size of the segmental units generally results in a lower synthetic speech quality, because a number of coarticulation phenomena occurring in natural speech are not represented. In this case the memory requirements are much less than those of the previous case.
- the use of phoneme-sized segmental units allows a large memory occupancy saving, but the resulting quality of synthetic speech is very poor.
- diphonic segmental units i.e. units that represent the coarticulation phenomena occurring between two phonetic units. It is assumed that in natural speech the occurrence of a phonetic event is influenced only by the previous and/or following phonemes, thus allowing the representation of coarticulation effects with units extending only to couples of phonemes.
- the first disclosure of the diphonic unit as a proper unit for the speech synthesis process is found in "Terminal Analog Synthesis of Continuous Speech using the Diphone Method of Segment Assembly", by N. R. Dixon, H. D. Maxey, in IEEE Transactions on Audio and Electroacustics, N.16, p. 40, 1968, and a number of speech synthesis systems have been developed using this technique. The trade-off between the number of segmental units and memory occupancy does not make this technique suitable for the development of high-quality low-cost speech synthesis systems.
- the number N of diphone units varies with the language to be synthesized in the order of one thousand units up to three thousand units.
- a number of coding techniques are adopted, mainly based on a spectral representation of speech.
- the speech signal is examined by an acoustic processor with respect to the main psyco-acoustic characteristics of the human perception of speech.
- Each interval is then represented by P coefficients a n , l ,p, 1 ⁇ p P (usually 10 ⁇ P 16), suitable for the synthesis purpose.
- the main approach relies on Linear Predictive Coding (LPC) of speech, because other approaches used in different fields of speech processing are not suitable for a direct use in speech synthesis.
- LPC Linear Predictive Coding
- the speech signal is then synthesized utilising the stored spectral representation to model phone-to-phone transitions, while steady state phone sections are obtained interpolating spectra between the end of a transition and the beginning of the next.
- Two possible solutions for the 2-byte floating point representation of each LPC coefficient are either the use of different coding schemes, or a reduction of the number of segmental units.
- the main drawback of the first solution i.e. different coding schemes, is a considerable lowering of the synthetic speech quality, mainly due to the local properties of the adopted coders.
- the drawback of the second solution is a reduction of the synthetic speech quality due to a poorer representation of coarticulation phenomena.
- the invention as claimed is intended to remedy the above drawbacks. It solves the problem of compactly representing the segmental unit set by the use of a spectral coding method based on Hidden Markov Model (HMM) techniques.
- HMM Hidden Markov Model
- the main advantage of the invention is that it drastically reduces the memory occupancy and allows a large set of diphones and triphones to be stored in a memory space smaller than that necessary to represent the same set of only diphones using prior art techniques. According to the present invention the same amount of memory can be used to store a larger set of segmental units, resulting in a better representation of coarticulation phenomena present in natural speech. Another advantage is that the use of HMM techniques in the synthetic speech reconstruction process allows the smoothing of spectral trajectories at the borders of linked segmental units, using a detailed model of the vocal apparatus dynamics.
- the present invention achieves the aim of compactly representing segmental units by using
- the set U of segmental units is determined using traditional approaches based on semiautomatic segmentation techniques as described in the article entitled "A Database for Diphone Units Extraction", by G. Ferri et al., in Proceedings of ESCA ETRW on Speech Synthesis, Autrans, France, Sep. 1990. Then, using the same speech material, an Acoustic Ergodic Hidden Markov Model (AEHMM) is trained in order to obtain the model of the spectral dynamics of the language.
- AEHMM Acoustic Ergodic Hidden Markov Model
- a suitable parametric representation is calculated, obtaining the sequence P n ,,, 1 ⁇ n N, 1 ⁇ I ⁇ I k . Then the most probable sequence of states q n , i of the AEHMM coder is calculated using the sequence P n , i as input and the Viterbi algorithm. The process is repeated for each segmental unit u k in the set U.
- M 256
- M P* 2 7,160 bytes.
- segmental unit set is usually made by a phonetic expert.
- a set of natural carrier utterances are chosen and recorded in quiet conditions, in order to represent all significant sound co- occurrences in the given language.
- the acoustic signal is then converted into digital format using analog to digital conversion techniques.
- segmental units are extracted from the carrier utterances generally by visual inspection of spectral representations, locating by a manual procedure the segmental unit borders. Generally the segmental unit borders are located in the spectral steady-state regions of adjacent phonetic events, as described in the article entitled "A Systematic Approach to the Extraction of Diphone Elements from Natural Speech", by H. Kaeslin, in IEEE Transactions ASSP-34, N.2, Apr. 1986.
- a copy of each segmental unit U n , 1 ⁇ n N is produced using some editing feature, and is stored in a suitable sampled data format in order to be easily retrieved.
- the Amplitude and Duration Database contains, for each phoneme in the language, the average duration and amplitude given the position in the syllable, taken from natural speech data. Details on this procedure can be found in the article entitled "Automatic Inference of a Syllabic Prosodic Model", by A. Falaschi, M. Giustiniani, P. Pierucci in Proceedings of ESCA ETRW on Speech Synthesis, Autrans, France, Sep.1990.
- the speech signal contains information about the vocal tract behaviour, i.e. the configuration of the vocal apparatus, and the source signal, produced from vocal cords and/or constrictions in the vocal apparatus.
- the contribution of source signal is often discarded from the representation of the segmental units, because it can be easily reconstructed during the synthesis phase from supra-segmental characteristics.
- the next step in building the segmental unit set is then the coding of the sampled data files using a suitable representation. Possible candidates are Discrete Fourier Transform, Formant Tracks and Linear Predictive Coding (LPC). The last one is the most used for text-to-speech synthesis using segmental unit concatenation mainly because LPC allows an automatic determination of the vocal tract representation.
- LPC Linear Predictive Coding
- LPC coefficients were demonstrated with other coefficients used during the interpolation stage (Log Area Ratios) and coefficients used in the synthesis process (Reflection Coefficients).
- the resulting set a n , l ,p, 1 ⁇ 1 ⁇ I n , 1 ⁇ p P of LPC coefficients, with In 10, is obtained for a segmental unit u n with a length of 80 ms and a frame spacing of 8 ms.
- the invention teaches the use of a continuous spectral densities Ergodic Hidden Markov Model (from here on AEHMM, Acoustic Ergodic Hidden Markov Model).
- AEHMM Acoustic Ergodic Hidden Markov Model
- a full description of this particular kind of Markov model can be found in the European patent application N.90119789.7 entitled "A Phonetic Hidden Markov Model Speech Synthesizer”.
- a continuous observation probability distribution giving the probability to observe a frame of speech, and a transition probability giving the probability to pass from a state at time t-1 to every other state at time t, given an input sequence of parametric observations extracted from speech data.
- the observation probability functions, one for each of M states, is representative of local spectral characteristics of the speech signal, i.e. they represent a basic alphabet of sounds for the given language.
- the transition probabilities, M for each of M states represent the rules governing the speech signal spectral dynamics, i.e. the constraints that are present in the speech production mechanism.
- M is the size of the model, i.e. the number of model's states
- Q is the set of states
- A is the state transition matrix
- F is a set of observation probability functions.
- a Hidden Markov Model represents two stochastic processes, one that is observable and one that is hidden.
- the observed process is the sequence of features extracted from speech
- the underlying hidden process is the sequence of local sources that most probably have generated the observed speech.
- the AEHMM associates the features, computed from each speech signal frame, to the state, or set of states, and therefore the corresponding signal sources, that most probably has emitted that signal frame feature.
- Each source maybe represented by a progressive number, named label, where the number of labels is equal to the size of the AEHMM.
- the final result is that the AEHMM associates to each frame the label, or labels, of each of the sources that most probably has emitted the frame. This action will be referred to as acoustic labelling.
- the basic point in the use of the AEHMM in the present invention is that of generating for a speech utterance, the sequence of sources, and hence of labels, that most probably have generated the observed speech utterance, where the probability is computed on the whole utterance and not only locally, as it is using standard vector quantizers.
- the AEHMM is initialised by any standard clustering algorithm applied to the same parametric representation of speech used in the AEHMM.
- the model is preferably initialised by a Vector Quantization clustering scheme (VQ in the following), having the same size of the AEHMM, and applied to the same set of speech utterances used in the following for the AEHMM model re-estimation procedure, as described in the articles entitled "Design and Performance of Trellis Vector Quantizers for Speech Signals" by B. H.
- the signal is divided into slices u n,l of the same length, named frames; the auto-correlation function and LPC are computed for each frame, obtaining the sequence r n , l ,p, n,l , 1 ⁇ n N, 1 ⁇ I ⁇ In. Then the most probable sequence of states q n , 1 of the AEHMM coder is calculated using the sequence r n , l ,p, a n,l as input and the Viterbi algorithm.
- a suitable value for P, number of autocorrelation lags, is 14, but other values can be used as well.
- the Viterbi algorithm For each segmental unit the Viterbi algorithm is activated, obtaining the corresponding acoustic label sequence.
- the segmental unit representation in terms of AEHMM labels is stored in the reference set. The process is repeated until all the segmental units of the set are considered.
- FIG.1 shows the block diagram of a text-to-speech synthesizer using the invention.
- the text-to-speech synthesizer of FIG.1 includes a Text Input module 100, a Text Processor 101, a Duration and Amplitude Processor 102, a Segmental Unit Processor 103, a Prosodic Processor 104, a Segmental Unit Linker 105, and a Synthesis Filter 106.
- the blocks labelled as 107, 108 are the Duration and Amplitude Database and the Segmental Unit Database respectively, built according to the teachings of the previous sections "Generation of the Segmental Unit Set" and "AEHMM Coding of Segmental Unit Set".
- the Text Input 100 receives a graphemic string of characters.
- the Text Processor 101 translates the input graphemic string into a phonetic string using a phonetic alphabet and a set of rules, in order to have a one-to-one correspondence between the output phonetic symbols and the set of acoustic units used for synthesis (letter-to-sound rules).
- the Text Processor 101 includes stress positioning rules, phonotactic rules, syllabification rules, morpho-syntactical analysis and phonetic transcription rules.
- the Text Processor 101 incorporates most of the linguistic knowledge required by the system and is language dependent in its structure. A possible set of phonetic symbols for the Italian language is reported in FIG.2.
- the Duration and Amplitude Processor 102 evaluates the proper duration and amplitude for each phonetic symbol to be synthesized.
- This module makes use of a syllable model and morpho-syntactical information in order to produce the desired output; it is based on a concept of intrinsic duration of the phoneme; each phoneme is considered differently according to its position in the syllable and with respect to the lexical stress; syllable models of this kind has been proposed in literature. In particular a collection of speech data has been previously examined to determine the correct amplitude and duration values given the syllabic position of the phonetic symbol in the word.
- the result of the processing is reported in FIG.6 where the phoneme sequence is depicted and each phoneme has associated the couple of indexes used to identify its syllabic position and, from that, the intrinsic duration, and in FIG.7 where the sequence of words and the corresponding POS are shown. Other procedures to compute energy may be used as well.
- the output of the Duration and Amplitude Processor 102 is sent to the Prosodic Processor 104, the Segmental Unit Linker 105 and the Synthesis Filter 106.
- the Prosodic Processor 104 receives as input the phonetic string in order to create an intonative contour for the sentence to be synthesized. For each phoneme the period of the excitation function (pitch) is determined, accounting for the sentence level of the phrase (interrogative, affirmative,..), the importance of the word the phoneme belongs to (noun, verb, adjective,..), the stress positioning, the duration and the intonative contour continuity constraints. A sequence of pitch values to be used during the synthesis is obtained at the end of this phase. Such sequence is sent to the Synthesis Filter 106.
- the set up of spectral parameters is initiated by the Segmental Unit Linker 105.
- each segmental unit is stretched in order to reach the correct duration of the corresponding phonetic event and a suitable interpolation scheme is applied to prevent spectral and amplitude discontinuities at the segmental unit borders.
- the Segmental Unit Linker 105 is shown in two different implementations SU1 and SU2 in FIG.9 and FIG.10 respectively.
- the segmental units are first decoded by the Segmental Unit Decoding Processor SU11, the task of which is to back-transform the AEHMM segmental unit representation into a sequence of feature vectors.
- the Segmental Unit Decoding Processor SU11 associates to each label the corresponding source model of AEHMM, as determined in the previous AEHMM training; this means that, in the present embodiment, associated to each label resulting from the segmental unit coding procedure is the vector of expected values of the source parameters. This is immediate using multivariate Gaussian distributions. In this case associated to each label is the mean of the Gaussian density distribution itself. Then the segmental units are stretched in order to reach the proper duration, using a suitable interpolation scheme.
- the interpolation stage is a crucial issue in the design of the speech synthesizer; the segmental unit representation should be chosen in order to allow such spectral interpolation, or, alternatively, proper transformations should be applied to the adopted representation before and after the application of the interpolation scheme. It is of course desirable that the features be linear with respect to the interpolation scheme. If prediction coefficients are used, it is preferred to transform them into more linear features such as, for instance, Log Area Ratios.
- the feature vector transformation of SU12 is indicated by r (a,), and gives a different set of features vectors 1 ; :
- reflection coefficients are used : as described in the article entitled "Speech Analysis and Synthesis by Linear Prediction of the Speech Wave", by B. S. Atal and S. L. Hanauer, in The Journal of the Acoustic Society of America, Vol.50, N.2, pp. 637-855, Apr.1971.
- a sequence of spectral parameters to be sent to the Synthesis Filter 106 is obtained at the end of this phase.
- a sequence of labels to be sent to the AEHMM Segmental Unit Decoding module SU22 is obtained at the end of this phase.
- the structure and operation of the module SU22 is identical to that of module SU11.
- the following stage operates a transformation of the feature vectors into a representation domain that is more suitable for interpolation.
- the module SU23 contains a coefficients transformation procedure that should be regarded as identical to that illustrated for the module SU12.
- the AEHMM Interpolation Processor SU24 is invoked. It activates the computing to generate the actual features vector to be used in the Synthesis Filter 106.
- a weighted mean of the transformed features vectors of the AEHMM Codebook is computed.
- the output features vector, for each frame to synthesize, is then computed by weighting each transformed feature vector of the codebook by its probability at time t, where prob( ⁇ i t ) are the probabilities of each state as computed by the forward-backward algorithm, and I ; are the associated features vectors of the codebook with size M, and u t av is the resulting features vector sent to the Synthesis Filter 106.
- the result is then back-converted into a spectral representation suitable for the synthesis process by the module SU25, whose structure is similar to the module SU14.
- the synthetic speech is generated by the Synthesis Filter 106.
- the amplitude, pitch, and spectral parameters are taken from the input.
- speech synthesis algorithm is activated to obtain the segment of synthetic speech.
- the set of reflection coefficients k i ,...,kp feeds the boxes labelled as 1 ... p
- the pitch parameter produced by the Prosodic Processor 104 feeds the voicing control
- the amplitude produced by the Duration and Amplitude Processor 102 multiplied by the normalised prediction error G, where k ; are the reflection coefficients, feeds the gain control.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP91108575A EP0515709A1 (fr) | 1991-05-27 | 1991-05-27 | Méthode et dispositif pour la représentation d'unités segmentaires pour la conversion texte-parole |
JP9553292A JPH05197398A (ja) | 1991-05-27 | 1992-04-15 | 音響単位の集合をコンパクトに表現する方法ならびに連鎖的テキスト−音声シンセサイザシステム |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP91108575A EP0515709A1 (fr) | 1991-05-27 | 1991-05-27 | Méthode et dispositif pour la représentation d'unités segmentaires pour la conversion texte-parole |
Publications (1)
Publication Number | Publication Date |
---|---|
EP0515709A1 true EP0515709A1 (fr) | 1992-12-02 |
Family
ID=8206774
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP91108575A Withdrawn EP0515709A1 (fr) | 1991-05-27 | 1991-05-27 | Méthode et dispositif pour la représentation d'unités segmentaires pour la conversion texte-parole |
Country Status (2)
Country | Link |
---|---|
EP (1) | EP0515709A1 (fr) |
JP (1) | JPH05197398A (fr) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1994017518A1 (fr) * | 1993-01-21 | 1994-08-04 | Apple Computer, Inc. | Systeme de synthese vocale a codage/decodage de signaux vocaux base sur la quantification vectorielle |
WO1994017517A1 (fr) * | 1993-01-21 | 1994-08-04 | Apple Computer, Inc. | Technique de melange de formes d'ondes pour systeme de conversion texte-voix |
EP0623915A1 (fr) * | 1993-05-07 | 1994-11-09 | Xerox Corporation | Dispositif de décodage pour images de documents sous utilisation de méthodes modifiées de ramification et de liaison |
EP0674307A2 (fr) * | 1994-03-22 | 1995-09-27 | Canon Kabushiki Kaisha | Procédé et dispositif de traitement d'informations de parole |
EP0689192A1 (fr) * | 1994-06-22 | 1995-12-27 | International Business Machines Corporation | Système de synthèse du langage |
EP0689193A1 (fr) * | 1994-06-20 | 1995-12-27 | International Business Machines Corporation | Reconnaissance du langage sous utilisation de caractéristiques dynamiques |
GB2296846A (en) * | 1995-01-07 | 1996-07-10 | Ibm | Synthesising speech from text |
WO1996022514A2 (fr) * | 1995-01-20 | 1996-07-25 | Sri International | Procede et appareil de reconnaissance vocale adaptee a un locuteur individuel |
WO1997032299A1 (fr) * | 1996-02-27 | 1997-09-04 | Philips Electronics N.V. | Procede et appareil pour la segmentation automatique de la parole en unites du type phoneme |
EP0833304A2 (fr) * | 1996-09-30 | 1998-04-01 | Microsoft Corporation | Bases de données prosodiques contenant des modèles de fréquences fondamentales pour la synthèse de la parole |
EP0848372A2 (fr) * | 1996-12-10 | 1998-06-17 | Matsushita Electric Industrial Co., Ltd. | Système de synthèse de la parole et base de données des formes d'ondes à redondance réduite |
US7977562B2 (en) | 2008-06-20 | 2011-07-12 | Microsoft Corporation | Synthesized singing voice waveform generator |
GB2501062A (en) * | 2012-03-14 | 2013-10-16 | Toshiba Res Europ Ltd | A Text to Speech System |
US9361722B2 (en) | 2013-08-08 | 2016-06-07 | Kabushiki Kaisha Toshiba | Synthetic audiovisual storyteller |
-
1991
- 1991-05-27 EP EP91108575A patent/EP0515709A1/fr not_active Withdrawn
-
1992
- 1992-04-15 JP JP9553292A patent/JPH05197398A/ja active Pending
Non-Patent Citations (2)
Title |
---|
AT & T TECHNICAL JOURNAL, vol. 65, no. 5, September-October 1986, pages 2-11, Short Hills, NJ, US; R.E. CROCHIERE et al.: "Speech processing: an evolving technology" * |
EUROSPEECH 89, EUROPEAN CONFERENCE ON SPEECH COMMUNICATION AND TECHNOLOGY, Paris, September 1989, vol. 2, pages 187-190; A. FALASCHI et al.: "A hidden Markov model approach to speech synthesis" * |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5526444A (en) * | 1991-12-10 | 1996-06-11 | Xerox Corporation | Document image decoding using modified branch-and-bound methods |
WO1994017517A1 (fr) * | 1993-01-21 | 1994-08-04 | Apple Computer, Inc. | Technique de melange de formes d'ondes pour systeme de conversion texte-voix |
WO1994017518A1 (fr) * | 1993-01-21 | 1994-08-04 | Apple Computer, Inc. | Systeme de synthese vocale a codage/decodage de signaux vocaux base sur la quantification vectorielle |
EP0623915A1 (fr) * | 1993-05-07 | 1994-11-09 | Xerox Corporation | Dispositif de décodage pour images de documents sous utilisation de méthodes modifiées de ramification et de liaison |
EP0674307A2 (fr) * | 1994-03-22 | 1995-09-27 | Canon Kabushiki Kaisha | Procédé et dispositif de traitement d'informations de parole |
US5845047A (en) * | 1994-03-22 | 1998-12-01 | Canon Kabushiki Kaisha | Method and apparatus for processing speech information using a phoneme environment |
EP0674307A3 (fr) * | 1994-03-22 | 1996-04-24 | Canon Kk | Procédé et dispositif de traitement d'informations de parole. |
US5615299A (en) * | 1994-06-20 | 1997-03-25 | International Business Machines Corporation | Speech recognition using dynamic features |
EP0689193A1 (fr) * | 1994-06-20 | 1995-12-27 | International Business Machines Corporation | Reconnaissance du langage sous utilisation de caractéristiques dynamiques |
GB2290684A (en) * | 1994-06-22 | 1996-01-03 | Ibm | Speech synthesis using hidden Markov model to determine speech unit durations |
US5682501A (en) * | 1994-06-22 | 1997-10-28 | International Business Machines Corporation | Speech synthesis system |
EP0689192A1 (fr) * | 1994-06-22 | 1995-12-27 | International Business Machines Corporation | Système de synthèse du langage |
GB2296846A (en) * | 1995-01-07 | 1996-07-10 | Ibm | Synthesising speech from text |
WO1996022514A2 (fr) * | 1995-01-20 | 1996-07-25 | Sri International | Procede et appareil de reconnaissance vocale adaptee a un locuteur individuel |
WO1996022514A3 (fr) * | 1995-01-20 | 1996-09-26 | Stanford Res Inst Int | Procede et appareil de reconnaissance vocale adaptee a un locuteur individuel |
WO1997032299A1 (fr) * | 1996-02-27 | 1997-09-04 | Philips Electronics N.V. | Procede et appareil pour la segmentation automatique de la parole en unites du type phoneme |
EP0833304A2 (fr) * | 1996-09-30 | 1998-04-01 | Microsoft Corporation | Bases de données prosodiques contenant des modèles de fréquences fondamentales pour la synthèse de la parole |
EP0833304A3 (fr) * | 1996-09-30 | 1999-03-24 | Microsoft Corporation | Bases de données prosodiques contenant des modèles de fréquences fondamentales pour la synthèse de la parole |
EP0848372A3 (fr) * | 1996-12-10 | 1999-02-17 | Matsushita Electric Industrial Co., Ltd. | Système de synthèse de la parole et base de données des formes d'ondes à redondance réduite |
EP0848372A2 (fr) * | 1996-12-10 | 1998-06-17 | Matsushita Electric Industrial Co., Ltd. | Système de synthèse de la parole et base de données des formes d'ondes à redondance réduite |
US6125346A (en) * | 1996-12-10 | 2000-09-26 | Matsushita Electric Industrial Co., Ltd | Speech synthesizing system and redundancy-reduced waveform database therefor |
US7977562B2 (en) | 2008-06-20 | 2011-07-12 | Microsoft Corporation | Synthesized singing voice waveform generator |
GB2501062A (en) * | 2012-03-14 | 2013-10-16 | Toshiba Res Europ Ltd | A Text to Speech System |
GB2501062B (en) * | 2012-03-14 | 2014-08-13 | Toshiba Res Europ Ltd | A text to speech method and system |
US9454963B2 (en) | 2012-03-14 | 2016-09-27 | Kabushiki Kaisha Toshiba | Text to speech method and system using voice characteristic dependent weighting |
US9361722B2 (en) | 2013-08-08 | 2016-06-07 | Kabushiki Kaisha Toshiba | Synthetic audiovisual storyteller |
Also Published As
Publication number | Publication date |
---|---|
JPH05197398A (ja) | 1993-08-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Yu et al. | DurIAN: Duration Informed Attention Network for Speech Synthesis. | |
US5230037A (en) | Phonetic hidden markov model speech synthesizer | |
US5682501A (en) | Speech synthesis system | |
JP4176169B2 (ja) | 言語合成のためのランタイムアコースティックユニット選択方法及び装置 | |
O'shaughnessy | Interacting with computers by voice: automatic speech recognition and synthesis | |
EP1704558B1 (fr) | Synthese de parole a partir d'un corpus, basee sur une recombinaison de segments | |
Chen et al. | Vector quantization of pitch information in Mandarin speech | |
US10692484B1 (en) | Text-to-speech (TTS) processing | |
EP0140777A1 (fr) | Procédé de codage de la parole et dispositif pour sa mise en oeuvre | |
EP0515709A1 (fr) | Méthode et dispositif pour la représentation d'unités segmentaires pour la conversion texte-parole | |
Cravero et al. | Definition and evaluation of phonetic units for speech recognition by hidden Markov models | |
Lee et al. | A segmental speech coder based on a concatenative TTS | |
Soong | A phonetically labeled acoustic segment (PLAS) approach to speech analysis-synthesis | |
JP5574344B2 (ja) | 1モデル音声認識合成に基づく音声合成装置、音声合成方法および音声合成プログラム | |
Yin | An overview of speech synthesis technology | |
Venkatagiri et al. | Digital speech synthesis: Tutorial | |
Chen et al. | A statistical model based fundamental frequency synthesizer for Mandarin speech | |
Ramasubramanian et al. | Ultra low bit-rate speech coding | |
Baudoin et al. | Advances in very low bit rate speech coding using recognition and synthesis techniques | |
Huckvale | 14 An Introduction to Phonetic Technology | |
Dong et al. | Pitch contour model for Chinese text-to-speech using CART and statistical model | |
Cai et al. | The DKU Speech Synthesis System for 2019 Blizzard Challenge | |
Chiang et al. | A New Model-Based Mandarin-Speech Coding System. | |
Lee et al. | Ultra low bit rate speech coding using an ergodic hidden Markov model | |
Pagarkar et al. | Language Independent Speech Compression using Devanagari Phonetics |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): DE FR GB IT |
|
17P | Request for examination filed |
Effective date: 19930331 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION HAS BEEN WITHDRAWN |
|
18W | Application withdrawn |
Withdrawal date: 19940225 |
|
R18W | Application withdrawn (corrected) |
Effective date: 19940225 |