WO1998025260A2 - Speech synthesis using dual neural networks - Google Patents
Speech synthesis using dual neural networks Download PDFInfo
- Publication number
- WO1998025260A2 WO1998025260A2 PCT/US1997/018815 US9718815W WO9825260A2 WO 1998025260 A2 WO1998025260 A2 WO 1998025260A2 US 9718815 W US9718815 W US 9718815W WO 9825260 A2 WO9825260 A2 WO 9825260A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speech
- principal
- speech parameters
- parameters
- supplementary
- Prior art date
Links
- 238000013528 artificial neural network Methods 0.000 title claims description 15
- 230000015572 biosynthetic process Effects 0.000 title claims description 9
- 238000003786 synthesis reaction Methods 0.000 title claims description 9
- 230000009977 dual effect Effects 0.000 title 1
- 238000000034 method Methods 0.000 claims abstract description 34
- 230000004044 response Effects 0.000 claims abstract description 10
- 230000000306 recurrent effect Effects 0.000 claims description 11
- 238000003066 decision tree Methods 0.000 claims description 8
- 230000002068 genetic effect Effects 0.000 claims description 8
- 230000006872 improvement Effects 0.000 description 6
- 238000012549 training Methods 0.000 description 5
- 230000003595 spectral effect Effects 0.000 description 4
- 230000008859 change Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000002194 synthesizing effect Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000008713 feedback mechanism Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000010561 standard procedure Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Definitions
- the present invention relates to coder parameter generating systems used in speech synthesis, and more particularly to speech parameter feedback in coder parameter generating systems used in speech synthesis.
- numeral 100 to convert text to speech, statistical systems (102) typically convert a phonetic representation of the text into a plurality of speech parameters which characterize the speech waveform. These systems perform this conversion using a statistical component which attempts to extract salient features from a database. It is desirable that the statistical system be able to extract sufficient information from the database which will allow the conversion of novel phonetic representations into satisfactory speech parameters.
- FIG. 1 is a schematic representation of a statistical system for synthesizing waveforms for speech as is known in the art.
- FIG. 2 is a schematic representation of a system for synthesizing waveforms for speech in accordance with the present invention.
- FIG. 3 is a schematic representation of frequency bin selection wherein frequency is plotted versus spectral magnitude.
- FIG. 4 is a schematic representation of one embodiment of energy bin boundary frequency selection in accordance with the present invention.
- FIG. 5 is a flow chart of one embodiment of steps in accordance with the method of the present invention.
- FIG. 6 is a flow chart of another embodiment of steps in accordance with the method of the present invention.
- the present invention provides a method and device for efficiently increasing the number of coder parameters in order to allow a coder parameter generating system to extract more information from training examples, thus enabling more accurate speech synthesis.
- the coder parameter generating system is a neural network.
- a first neural network extracts domain-specific information by learning relations between the input and output data and feeds information learned back to a second neural network, thus providing additional output parameters via a recurrent feedback mechanism.
- a supplementary set of speech parameters is added to the coder parameter generating system (201).
- the principal speech parameters may be, for example, the input parameters to a waveform synthesizer, whereas the supplementary set of speech parameters is not directly used in the generation of speech.
- the supplementary set of speech parameters would therefore be useless, except for a feedback path that includes a recurrent buffer (206) which allows these parameters to be fed back as input to the coder parameter generating system (201 ), allowing for unused output information to be useful and increasing performance of the coder parameter generating system (201 ).
- the parameter generating system (201 ) is now trained to predict other parameters in addition to the coder parameters. These extra parameters are used internally by the parameter generating system (201 ) through the recurrent buffer (206).
- This coder parameter generating system may be broken into a principal system and a subsystem, as displayed in FIG. 2, numeral 200, where the principal system (202) generates the principal set of speech parameters and the subsystem (204) generates the supplementary set of modified speech parameters.
- the supplementary set of modified speech parameters may be chosen to represent the spectral energy in fixed frequency bins. These parameters are obtained by summing the magnitude over the appropriate frequency regions of a discrete Fourier transform (DFT) of the windowed speech waveform.
- the frequency regions may be chosen to fall between typical first formants (F1s) for a selected set of phonemes spoken by a selected speaker. Additional frequency regions may be chosen to fall between typical second formants (F2s) for a selected set of phonemes spoken by a selected speaker.
- numeral 300, 10 non-linear fixed frequency bins (304, 306, 308, 310, 312, 314, 316, 318, 320, 322) are selected to represent the spectral energy.
- the added parameters are obtained by summing the spectral magnitude over the selected frequency regions of a discrete Fourier transform of the windowed speech waveform (302).
- the frequency regions are selected (for example, as shown in FIG. 3) to provide energy measures with additional information that is useful in predicting the coder parameters.
- An alternative choice for the supplementary set of speech parameters is a set of parameters that may be derived from the principal set of speech parameters.
- the extra parameters obtained i.e., the supplementary set of speech parameters
- the recurrent buffer (206) is not used to train the energy bin subsystem (204), but the principal system (202) learns how to use the information generated by the recurrent buffer (206) when the energy bin subsystem (204) is fully trained.
- the subsystem (204) may include a second recurrent buffer that is used in training and testing, as may the principal system (202).
- FIG. 4 is a schematic representation of one embodiment of energy bin boundary frequency selection in accordance with the present invention.
- the frequencies of the energy bands are central to determining the usefulness of the selected parameters.
- a speech synthesizer that operates in accordance with the present invention is designed to imitate the speech from a single speaker. Analysis of formants from a single speaker provided data for the graph shown in FIG. 4, which plots the frequencies of the first two formants of several vowels (iy, ux, ey, ih, uw, uh, ow, eh, ae, aa, ao, ax).
- the principal system (202) and the subsystem (204) may each be a neural network, decision tree, genetic algorithm or combination of two or more of these.
- the method for using neural networks for speech synthesis has consisted of training a neural network to predict the coder parameters at a given time.
- the neural net is supplied with coder parameters, and in the testing portion, the neural network generates these same parameters.
- the standard methods typically change the size or architecture of the network (including network transfer functions), modify the rates, number of iterations or vector set used to train the network, add additional useful input information, or change the representation of the data.
- the speech spectra may be represented by linear predictive coding coefficients, cepstral coefficients, reflection coefficients, or other various parameters. Since these methods make only marginal improvements, the method of the present invention was developed to provide training that uses more output parameters wit a minimum of complexity added.
- the method of the present invention provides, in response to text or linguistic information, efficient generation of a parametric representation of speech.
- the method includes the steps of: A) using (502) a coder parameter generating system to provide a principal set and a supplementary set of speech parameters, the principal set of speech parameters being the parametric representation of speech; and B) providing (504) feedback to the coder parameter generating system using the supplementary set of speech parameters to modify the principal set of speech parameters.
- the method may further include providing (506) the modified principal set of speech parameters to a waveform synthesizer to synthesize speech.
- the coder parameter generating system is divided into a principal system (202) and a subsystem (204), wherein the principal system (202) generates the principal set of speech parameters and the subsystem (204) generates the supplementary set of speech parameters.
- the supplementary set of speech parameters consists of energies in each of a predetermined set of frequency bands for speech in a selected time period. Boundaries of the predetermined set of frequency bands are typically determined by: determining typical values of principal and second formants of speech of a selected speaker for a set of phonemes; selecting a set of frequencies that fall between said formant values; and selecting each frequency band of the predetermined set of frequency bands to consist of the frequencies between a selected two of the set of frequencies.
- the supplementary set of speech parameters generally includes a representation of speech derived from the principal set of speech parameters.
- Coder parameter generating system utilizes neural networks.
- the coder parameter generating system may utilize decision tree units or genetic algorithms.
- the method of the present invention provides in response to linguistic information, efficient generation of parametric representation of speech.
- the method includes the steps of: A) using (602) a principal system to generate a principal set of speech parameters; B) providing (604) the principal set of speech parameters and the linguistic information to a subsystem that modifies the principal set of speech parameters and providing a modified set of speech parameters to the principal system; and C) generating (606), by the principal system, a parametric representation of speech.
- the modified principal set of speech parameters may be provided (608) to a waveform synthesizer to synthesize speech.
- the coder parameter generating system is typically divided into a principal system and a subsystem, wherein the principal system generates the principal set of speech parameters and the subsystem generates the supplementary set of speech parameters.
- the supplementary set of speech parameters may consist of energies in each of a predetermined set of frequency bands for speech in a selected time period. Boundaries of the predetermined set of frequency bands are typically determined by: determining typical values of principal and second formants of speech of a selected speaker for a set of phonemes; selecting a set of frequencies that fall between said formant values; and selecting each frequency band of the predetermined set of frequency bands to consist of the frequencies between a selected two of the set of frequencies.
- the supplementary set of speech parameters generally includes a representation of speech derived from the principal set of speech parameters.
- Coder parameter generating system utilizes neural networks.
- the coder parameter generating system may utilize decision tree units or genetic algorithms.
- the present invention may be implemented by a device for providing, in response to linguistic information, efficient generation of a parametric representation of speech.
- the device includes a coder parameter generating system, coupled to receive linguistic information and feedback, for providing a principal set and a supplementary set of speech parameters, the principal set of speech parameters being the parametric representation of speech; and a feedback path having a recurrent buffer, coupled to the coder parameter generating system, for providing the supplementary set of speech parameters as feedback to the coder parameter generating system to modify the principal set of speech parameters.
- the coder parameter generating system may further provide the modified principal set of speech parameters to a waveform synthesizer to synthesize speech.
- the coder parameter generating system is typically divided into a principal system and a subsystem.
- the principal system generates the principal set of speech parameters
- the subsystem generates the supplementary set of speech parameters.
- the supplementary set of speech parameters generally consists of energies in each of a predetermined set of frequency bands for speech in a selected time period. Again, boundaries may be determined as set forth above.
- the supplementary set of speech parameters typically includes a representation of speech derived from the principal set of speech parameters.
- the device of the present invention is typically a microprocessor, a digital signal processor, an application specific integrated circuit, or a combination of these.
- the coder parameter generating system may include neural networks, decision tree units, or units that use a genetic algorithm.
- the device of the present invention may be implemented in a text-to-speech system (203), a speech synthesis system (203), or a dialog system (203).
- a text-to-speech system 203
- a speech synthesis system 203
- a dialog system 203
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Telephonic Communication Services (AREA)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP97946261A EP0932896A2 (en) | 1996-12-05 | 1997-10-15 | Method, device and system for supplementary speech parameter feedback for coder parameter generating systems used in speech synthesis |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US76162796A | 1996-12-05 | 1996-12-05 | |
US08/761,627 | 1996-12-05 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO1998025260A2 true WO1998025260A2 (en) | 1998-06-11 |
WO1998025260A3 WO1998025260A3 (en) | 1998-08-06 |
Family
ID=25062802
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US1997/018815 WO1998025260A2 (en) | 1996-12-05 | 1997-10-15 | Speech synthesis using dual neural networks |
Country Status (2)
Country | Link |
---|---|
EP (1) | EP0932896A2 (enrdf_load_stackoverflow) |
WO (1) | WO1998025260A2 (enrdf_load_stackoverflow) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2326320A (en) * | 1997-06-13 | 1998-12-16 | Motorola Inc | Text to speech synthesis using neural network |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5165008A (en) * | 1991-09-18 | 1992-11-17 | U S West Advanced Technologies, Inc. | Speech synthesis using perceptual linear prediction parameters |
CN1057625C (zh) * | 1994-04-28 | 2000-10-18 | 摩托罗拉公司 | 使用神经网络变换文本为声频信号的方法 |
-
1997
- 1997-10-15 WO PCT/US1997/018815 patent/WO1998025260A2/en not_active Application Discontinuation
- 1997-10-15 EP EP97946261A patent/EP0932896A2/en not_active Withdrawn
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2326320A (en) * | 1997-06-13 | 1998-12-16 | Motorola Inc | Text to speech synthesis using neural network |
GB2326320B (en) * | 1997-06-13 | 1999-08-11 | Motorola Inc | Method,device and article of manufacture for neural-network based orthography-phonetics transformation |
Also Published As
Publication number | Publication date |
---|---|
WO1998025260A3 (en) | 1998-08-06 |
EP0932896A2 (en) | 1999-08-04 |
EP0932896A4 (enrdf_load_stackoverflow) | 1999-09-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Yoshimura et al. | Mixed excitation for HMM-based speech synthesis. | |
Tokuda et al. | An HMM-based speech synthesis system applied to English | |
US7124083B2 (en) | Method and system for preselection of suitable units for concatenative speech | |
US7565291B2 (en) | Synthesis-based pre-selection of suitable units for concatenative speech | |
US5913194A (en) | Method, device and system for using statistical information to reduce computation and memory requirements of a neural network based speech synthesis system | |
US5682501A (en) | Speech synthesis system | |
Bulyko et al. | Joint prosody prediction and unit selection for concatenative speech synthesis | |
JP2826215B2 (ja) | 合成音声生成方法及びテキスト音声合成装置 | |
US6535852B2 (en) | Training of text-to-speech systems | |
US20050182629A1 (en) | Corpus-based speech synthesis based on segment recombination | |
Van Santen | Prosodic modelling in text-to-speech synthesis. | |
EP0458859A1 (en) | Text to speech synthesis system and method using context dependent vowell allophones. | |
Qian et al. | An HMM-based Mandarin Chinese text-to-speech system | |
Karaali et al. | Speech synthesis with neural networks | |
Bulyko et al. | Efficient integrated response generation from multiple targets using weighted finite state transducers | |
Koriyama et al. | Prosody generation using frame-based Gaussian process regression and classification for statistical parametric speech synthesis | |
RU61924U1 (ru) | Статистическая модель речи | |
Lin et al. | A novel prosodic-information synthesizer based on recurrent fuzzy neural network for the Chinese TTS system | |
WO1998025260A2 (en) | Speech synthesis using dual neural networks | |
Chen et al. | A Mandarin Text-to-Speech System | |
Ng | Survey of data-driven approaches to Speech Synthesis | |
Ho et al. | Voice conversion between UK and US accented English. | |
Yu et al. | A novel prosody adaptation method for mandarin concatenation-based text-to-speech system | |
Hyunsong | Modelling Duration In Text-to-Speech Systems | |
Barakat et al. | Investigating the effect of speech features and the number of HMM mixtures in the quality HMM-based synthesizers |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT SE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1997946261 Country of ref document: EP |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWP | Wipo information: published in national office |
Ref document number: 1997946261 Country of ref document: EP |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 1997946261 Country of ref document: EP |