EP0932896A2 - Method, device and system for supplementary speech parameter feedback for coder parameter generating systems used in speech synthesis - Google Patents
Method, device and system for supplementary speech parameter feedback for coder parameter generating systems used in speech synthesisInfo
- Publication number
- EP0932896A2 EP0932896A2 EP97946261A EP97946261A EP0932896A2 EP 0932896 A2 EP0932896 A2 EP 0932896A2 EP 97946261 A EP97946261 A EP 97946261A EP 97946261 A EP97946261 A EP 97946261A EP 0932896 A2 EP0932896 A2 EP 0932896A2
- Authority
- EP
- European Patent Office
- Prior art keywords
- speech
- principal
- speech parameters
- parameters
- supplementary
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Definitions
- the present invention relates to coder parameter generating systems used in speech synthesis, and more particularly to speech parameter feedback in coder parameter generating systems used in speech synthesis.
- numeral 100 to convert text to speech, statistical systems (102) typically convert a phonetic representation of the text into a plurality of speech parameters which characterize the speech waveform. These systems perform this conversion using a statistical component which attempts to extract salient features from a database. It is desirable that the statistical system be able to extract sufficient information from the database which will allow the conversion of novel phonetic representations into satisfactory speech parameters.
- FIG. 1 is a schematic representation of a statistical system for synthesizing waveforms for speech as is known in the art.
- FIG. 2 is a schematic representation of a system for synthesizing waveforms for speech in accordance with the present invention.
- FIG. 3 is a schematic representation of frequency bin selection wherein frequency is plotted versus spectral magnitude.
- FIG. 4 is a schematic representation of one embodiment of energy bin boundary frequency selection in accordance with the present invention.
- FIG. 5 is a flow chart of one embodiment of steps in accordance with the method of the present invention.
- FIG. 6 is a flow chart of another embodiment of steps in accordance with the method of the present invention.
- the present invention provides a method and device for efficiently increasing the number of coder parameters in order to allow a coder parameter generating system to extract more information from training examples, thus enabling more accurate speech synthesis.
- the coder parameter generating system is a neural network.
- a first neural network extracts domain-specific information by learning relations between the input and output data and feeds information learned back to a second neural network, thus providing additional output parameters via a recurrent feedback mechanism.
- a supplementary set of speech parameters is added to the coder parameter generating system (201).
- the principal speech parameters may be, for example, the input parameters to a waveform synthesizer, whereas the supplementary set of speech parameters is not directly used in the generation of speech.
- the supplementary set of speech parameters would therefore be useless, except for a feedback path that includes a recurrent buffer (206) which allows these parameters to be fed back as input to the coder parameter generating system (201 ), allowing for unused output information to be useful and increasing performance of the coder parameter generating system (201 ).
- the parameter generating system (201 ) is now trained to predict other parameters in addition to the coder parameters. These extra parameters are used internally by the parameter generating system (201 ) through the recurrent buffer (206).
- This coder parameter generating system may be broken into a principal system and a subsystem, as displayed in FIG. 2, numeral 200, where the principal system (202) generates the principal set of speech parameters and the subsystem (204) generates the supplementary set of modified speech parameters.
- the supplementary set of modified speech parameters may be chosen to represent the spectral energy in fixed frequency bins. These parameters are obtained by summing the magnitude over the appropriate frequency regions of a discrete Fourier transform (DFT) of the windowed speech waveform.
- the frequency regions may be chosen to fall between typical first formants (F1s) for a selected set of phonemes spoken by a selected speaker. Additional frequency regions may be chosen to fall between typical second formants (F2s) for a selected set of phonemes spoken by a selected speaker.
- numeral 300, 10 non-linear fixed frequency bins (304, 306, 308, 310, 312, 314, 316, 318, 320, 322) are selected to represent the spectral energy.
- the added parameters are obtained by summing the spectral magnitude over the selected frequency regions of a discrete Fourier transform of the windowed speech waveform (302).
- the frequency regions are selected (for example, as shown in FIG. 3) to provide energy measures with additional information that is useful in predicting the coder parameters.
- An alternative choice for the supplementary set of speech parameters is a set of parameters that may be derived from the principal set of speech parameters.
- the extra parameters obtained i.e., the supplementary set of speech parameters
- the recurrent buffer (206) is not used to train the energy bin subsystem (204), but the principal system (202) learns how to use the information generated by the recurrent buffer (206) when the energy bin subsystem (204) is fully trained.
- the subsystem (204) may include a second recurrent buffer that is used in training and testing, as may the principal system (202).
- FIG. 4 is a schematic representation of one embodiment of energy bin boundary frequency selection in accordance with the present invention.
- the frequencies of the energy bands are central to determining the usefulness of the selected parameters.
- a speech synthesizer that operates in accordance with the present invention is designed to imitate the speech from a single speaker. Analysis of formants from a single speaker provided data for the graph shown in FIG. 4, which plots the frequencies of the first two formants of several vowels (iy, ux, ey, ih, uw, uh, ow, eh, ae, aa, ao, ax).
- the principal system (202) and the subsystem (204) may each be a neural network, decision tree, genetic algorithm or combination of two or more of these.
- the method for using neural networks for speech synthesis has consisted of training a neural network to predict the coder parameters at a given time.
- the neural net is supplied with coder parameters, and in the testing portion, the neural network generates these same parameters.
- the standard methods typically change the size or architecture of the network (including network transfer functions), modify the rates, number of iterations or vector set used to train the network, add additional useful input information, or change the representation of the data.
- the speech spectra may be represented by linear predictive coding coefficients, cepstral coefficients, reflection coefficients, or other various parameters. Since these methods make only marginal improvements, the method of the present invention was developed to provide training that uses more output parameters wit a minimum of complexity added.
- the method of the present invention provides, in response to text or linguistic information, efficient generation of a parametric representation of speech.
- the method includes the steps of: A) using (502) a coder parameter generating system to provide a principal set and a supplementary set of speech parameters, the principal set of speech parameters being the parametric representation of speech; and B) providing (504) feedback to the coder parameter generating system using the supplementary set of speech parameters to modify the principal set of speech parameters.
- the method may further include providing (506) the modified principal set of speech parameters to a waveform synthesizer to synthesize speech.
- the coder parameter generating system is divided into a principal system (202) and a subsystem (204), wherein the principal system (202) generates the principal set of speech parameters and the subsystem (204) generates the supplementary set of speech parameters.
- the supplementary set of speech parameters consists of energies in each of a predetermined set of frequency bands for speech in a selected time period. Boundaries of the predetermined set of frequency bands are typically determined by: determining typical values of principal and second formants of speech of a selected speaker for a set of phonemes; selecting a set of frequencies that fall between said formant values; and selecting each frequency band of the predetermined set of frequency bands to consist of the frequencies between a selected two of the set of frequencies.
- the supplementary set of speech parameters generally includes a representation of speech derived from the principal set of speech parameters.
- Coder parameter generating system utilizes neural networks.
- the coder parameter generating system may utilize decision tree units or genetic algorithms.
- the method of the present invention provides in response to linguistic information, efficient generation of parametric representation of speech.
- the method includes the steps of: A) using (602) a principal system to generate a principal set of speech parameters; B) providing (604) the principal set of speech parameters and the linguistic information to a subsystem that modifies the principal set of speech parameters and providing a modified set of speech parameters to the principal system; and C) generating (606), by the principal system, a parametric representation of speech.
- the modified principal set of speech parameters may be provided (608) to a waveform synthesizer to synthesize speech.
- the coder parameter generating system is typically divided into a principal system and a subsystem, wherein the principal system generates the principal set of speech parameters and the subsystem generates the supplementary set of speech parameters.
- the supplementary set of speech parameters may consist of energies in each of a predetermined set of frequency bands for speech in a selected time period. Boundaries of the predetermined set of frequency bands are typically determined by: determining typical values of principal and second formants of speech of a selected speaker for a set of phonemes; selecting a set of frequencies that fall between said formant values; and selecting each frequency band of the predetermined set of frequency bands to consist of the frequencies between a selected two of the set of frequencies.
- the supplementary set of speech parameters generally includes a representation of speech derived from the principal set of speech parameters.
- Coder parameter generating system utilizes neural networks.
- the coder parameter generating system may utilize decision tree units or genetic algorithms.
- the present invention may be implemented by a device for providing, in response to linguistic information, efficient generation of a parametric representation of speech.
- the device includes a coder parameter generating system, coupled to receive linguistic information and feedback, for providing a principal set and a supplementary set of speech parameters, the principal set of speech parameters being the parametric representation of speech; and a feedback path having a recurrent buffer, coupled to the coder parameter generating system, for providing the supplementary set of speech parameters as feedback to the coder parameter generating system to modify the principal set of speech parameters.
- the coder parameter generating system may further provide the modified principal set of speech parameters to a waveform synthesizer to synthesize speech.
- the coder parameter generating system is typically divided into a principal system and a subsystem.
- the principal system generates the principal set of speech parameters
- the subsystem generates the supplementary set of speech parameters.
- the supplementary set of speech parameters generally consists of energies in each of a predetermined set of frequency bands for speech in a selected time period. Again, boundaries may be determined as set forth above.
- the supplementary set of speech parameters typically includes a representation of speech derived from the principal set of speech parameters.
- the device of the present invention is typically a microprocessor, a digital signal processor, an application specific integrated circuit, or a combination of these.
- the coder parameter generating system may include neural networks, decision tree units, or units that use a genetic algorithm.
- the device of the present invention may be implemented in a text-to-speech system (203), a speech synthesis system (203), or a dialog system (203).
- a text-to-speech system 203
- a speech synthesis system 203
- a dialog system 203
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Telephonic Communication Services (AREA)
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US76162796A | 1996-12-05 | 1996-12-05 | |
US761627 | 1996-12-05 | ||
PCT/US1997/018815 WO1998025260A2 (en) | 1996-12-05 | 1997-10-15 | Speech synthesis using dual neural networks |
Publications (2)
Publication Number | Publication Date |
---|---|
EP0932896A2 true EP0932896A2 (en) | 1999-08-04 |
EP0932896A4 EP0932896A4 (enrdf_load_stackoverflow) | 1999-09-08 |
Family
ID=25062802
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP97946261A Withdrawn EP0932896A2 (en) | 1996-12-05 | 1997-10-15 | Method, device and system for supplementary speech parameter feedback for coder parameter generating systems used in speech synthesis |
Country Status (2)
Country | Link |
---|---|
EP (1) | EP0932896A2 (enrdf_load_stackoverflow) |
WO (1) | WO1998025260A2 (enrdf_load_stackoverflow) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5930754A (en) * | 1997-06-13 | 1999-07-27 | Motorola, Inc. | Method, device and article of manufacture for neural-network based orthography-phonetics transformation |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5165008A (en) * | 1991-09-18 | 1992-11-17 | U S West Advanced Technologies, Inc. | Speech synthesis using perceptual linear prediction parameters |
CN1057625C (zh) * | 1994-04-28 | 2000-10-18 | 摩托罗拉公司 | 使用神经网络变换文本为声频信号的方法 |
-
1997
- 1997-10-15 EP EP97946261A patent/EP0932896A2/en not_active Withdrawn
- 1997-10-15 WO PCT/US1997/018815 patent/WO1998025260A2/en not_active Application Discontinuation
Also Published As
Publication number | Publication date |
---|---|
WO1998025260A3 (en) | 1998-08-06 |
WO1998025260A2 (en) | 1998-06-11 |
EP0932896A4 (enrdf_load_stackoverflow) | 1999-09-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Yoshimura et al. | Mixed excitation for HMM-based speech synthesis. | |
Tokuda et al. | An HMM-based speech synthesis system applied to English | |
US5913194A (en) | Method, device and system for using statistical information to reduce computation and memory requirements of a neural network based speech synthesis system | |
US7124083B2 (en) | Method and system for preselection of suitable units for concatenative speech | |
US7565291B2 (en) | Synthesis-based pre-selection of suitable units for concatenative speech | |
US5682501A (en) | Speech synthesis system | |
Bulyko et al. | Joint prosody prediction and unit selection for concatenative speech synthesis | |
US6535852B2 (en) | Training of text-to-speech systems | |
JP2826215B2 (ja) | 合成音声生成方法及びテキスト音声合成装置 | |
US20050182629A1 (en) | Corpus-based speech synthesis based on segment recombination | |
Van Santen | Prosodic modelling in text-to-speech synthesis. | |
EP0458859A1 (en) | Text to speech synthesis system and method using context dependent vowell allophones. | |
Qian et al. | An HMM-based Mandarin Chinese text-to-speech system | |
Karaali et al. | Speech synthesis with neural networks | |
Bulyko et al. | Efficient integrated response generation from multiple targets using weighted finite state transducers | |
Koriyama et al. | Prosody generation using frame-based Gaussian process regression and classification for statistical parametric speech synthesis | |
RU61924U1 (ru) | Статистическая модель речи | |
Lin et al. | A novel prosodic-information synthesizer based on recurrent fuzzy neural network for the Chinese TTS system | |
EP0932896A2 (en) | Method, device and system for supplementary speech parameter feedback for coder parameter generating systems used in speech synthesis | |
Ng | Survey of data-driven approaches to Speech Synthesis | |
Ho et al. | Voice conversion between UK and US accented English. | |
Yu et al. | A novel prosody adaptation method for mandarin concatenation-based text-to-speech system | |
Hyunsong | Modelling Duration In Text-to-Speech Systems | |
Barakat et al. | Investigating the effect of speech features and the number of HMM mixtures in the quality HMM-based synthesizers | |
Hwang et al. | An RNN-based spectral information generation for Mandarin text-to-speech. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 19990208 |
|
AK | Designated contracting states |
Kind code of ref document: A2 Designated state(s): BE DE FR GB |
|
A4 | Supplementary search report drawn up and despatched |
Effective date: 19990722 |
|
AK | Designated contracting states |
Kind code of ref document: A4 Designated state(s): BE DE FR GB |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION HAS BEEN WITHDRAWN |
|
18W | Application withdrawn |
Withdrawal date: 20021028 |