EP0932896A2 - Method, device and system for supplementary speech parameter feedback for coder parameter generating systems used in speech synthesis - Google Patents

Method, device and system for supplementary speech parameter feedback for coder parameter generating systems used in speech synthesis

Info

Publication number
EP0932896A2
EP0932896A2 EP97946261A EP97946261A EP0932896A2 EP 0932896 A2 EP0932896 A2 EP 0932896A2 EP 97946261 A EP97946261 A EP 97946261A EP 97946261 A EP97946261 A EP 97946261A EP 0932896 A2 EP0932896 A2 EP 0932896A2
Authority
EP
European Patent Office
Prior art keywords
speech
principal
speech parameters
parameters
supplementary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP97946261A
Other languages
German (de)
French (fr)
Other versions
EP0932896A4 (en
Inventor
Orhan Karaali
Noel Massey
Gerald Corrigan
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Motorola Solutions Inc
Original Assignee
Motorola Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Motorola Inc filed Critical Motorola Inc
Publication of EP0932896A2 publication Critical patent/EP0932896A2/en
Publication of EP0932896A4 publication Critical patent/EP0932896A4/xx
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the present invention relates to coder parameter generating systems used in speech synthesis, and more particularly to speech parameter feedback in coder parameter generating systems used in speech synthesis.
  • numeral 100 to convert text to speech, statistical systems (102) typically convert a phonetic representation of the text into a plurality of speech parameters which characterize the speech waveform. These systems perform this conversion using a statistical component which attempts to extract salient features from a database. It is desirable that the statistical system be able to extract sufficient information from the database which will allow the conversion of novel phonetic representations into satisfactory speech parameters.
  • FIG. 1 is a schematic representation of a statistical system for synthesizing waveforms for speech as is known in the art.
  • FIG. 2 is a schematic representation of a system for synthesizing waveforms for speech in accordance with the present invention.
  • FIG. 3 is a schematic representation of frequency bin selection wherein frequency is plotted versus spectral magnitude.
  • FIG. 4 is a schematic representation of one embodiment of energy bin boundary frequency selection in accordance with the present invention.
  • FIG. 5 is a flow chart of one embodiment of steps in accordance with the method of the present invention.
  • FIG. 6 is a flow chart of another embodiment of steps in accordance with the method of the present invention.
  • the present invention provides a method and device for efficiently increasing the number of coder parameters in order to allow a coder parameter generating system to extract more information from training examples, thus enabling more accurate speech synthesis.
  • the coder parameter generating system is a neural network.
  • a first neural network extracts domain-specific information by learning relations between the input and output data and feeds information learned back to a second neural network, thus providing additional output parameters via a recurrent feedback mechanism.
  • a supplementary set of speech parameters is added to the coder parameter generating system (201).
  • the principal speech parameters may be, for example, the input parameters to a waveform synthesizer, whereas the supplementary set of speech parameters is not directly used in the generation of speech.
  • the supplementary set of speech parameters would therefore be useless, except for a feedback path that includes a recurrent buffer (206) which allows these parameters to be fed back as input to the coder parameter generating system (201 ), allowing for unused output information to be useful and increasing performance of the coder parameter generating system (201 ).
  • the parameter generating system (201 ) is now trained to predict other parameters in addition to the coder parameters. These extra parameters are used internally by the parameter generating system (201 ) through the recurrent buffer (206).
  • This coder parameter generating system may be broken into a principal system and a subsystem, as displayed in FIG. 2, numeral 200, where the principal system (202) generates the principal set of speech parameters and the subsystem (204) generates the supplementary set of modified speech parameters.
  • the supplementary set of modified speech parameters may be chosen to represent the spectral energy in fixed frequency bins. These parameters are obtained by summing the magnitude over the appropriate frequency regions of a discrete Fourier transform (DFT) of the windowed speech waveform.
  • the frequency regions may be chosen to fall between typical first formants (F1s) for a selected set of phonemes spoken by a selected speaker. Additional frequency regions may be chosen to fall between typical second formants (F2s) for a selected set of phonemes spoken by a selected speaker.
  • numeral 300, 10 non-linear fixed frequency bins (304, 306, 308, 310, 312, 314, 316, 318, 320, 322) are selected to represent the spectral energy.
  • the added parameters are obtained by summing the spectral magnitude over the selected frequency regions of a discrete Fourier transform of the windowed speech waveform (302).
  • the frequency regions are selected (for example, as shown in FIG. 3) to provide energy measures with additional information that is useful in predicting the coder parameters.
  • An alternative choice for the supplementary set of speech parameters is a set of parameters that may be derived from the principal set of speech parameters.
  • the extra parameters obtained i.e., the supplementary set of speech parameters
  • the recurrent buffer (206) is not used to train the energy bin subsystem (204), but the principal system (202) learns how to use the information generated by the recurrent buffer (206) when the energy bin subsystem (204) is fully trained.
  • the subsystem (204) may include a second recurrent buffer that is used in training and testing, as may the principal system (202).
  • FIG. 4 is a schematic representation of one embodiment of energy bin boundary frequency selection in accordance with the present invention.
  • the frequencies of the energy bands are central to determining the usefulness of the selected parameters.
  • a speech synthesizer that operates in accordance with the present invention is designed to imitate the speech from a single speaker. Analysis of formants from a single speaker provided data for the graph shown in FIG. 4, which plots the frequencies of the first two formants of several vowels (iy, ux, ey, ih, uw, uh, ow, eh, ae, aa, ao, ax).
  • the principal system (202) and the subsystem (204) may each be a neural network, decision tree, genetic algorithm or combination of two or more of these.
  • the method for using neural networks for speech synthesis has consisted of training a neural network to predict the coder parameters at a given time.
  • the neural net is supplied with coder parameters, and in the testing portion, the neural network generates these same parameters.
  • the standard methods typically change the size or architecture of the network (including network transfer functions), modify the rates, number of iterations or vector set used to train the network, add additional useful input information, or change the representation of the data.
  • the speech spectra may be represented by linear predictive coding coefficients, cepstral coefficients, reflection coefficients, or other various parameters. Since these methods make only marginal improvements, the method of the present invention was developed to provide training that uses more output parameters wit a minimum of complexity added.
  • the method of the present invention provides, in response to text or linguistic information, efficient generation of a parametric representation of speech.
  • the method includes the steps of: A) using (502) a coder parameter generating system to provide a principal set and a supplementary set of speech parameters, the principal set of speech parameters being the parametric representation of speech; and B) providing (504) feedback to the coder parameter generating system using the supplementary set of speech parameters to modify the principal set of speech parameters.
  • the method may further include providing (506) the modified principal set of speech parameters to a waveform synthesizer to synthesize speech.
  • the coder parameter generating system is divided into a principal system (202) and a subsystem (204), wherein the principal system (202) generates the principal set of speech parameters and the subsystem (204) generates the supplementary set of speech parameters.
  • the supplementary set of speech parameters consists of energies in each of a predetermined set of frequency bands for speech in a selected time period. Boundaries of the predetermined set of frequency bands are typically determined by: determining typical values of principal and second formants of speech of a selected speaker for a set of phonemes; selecting a set of frequencies that fall between said formant values; and selecting each frequency band of the predetermined set of frequency bands to consist of the frequencies between a selected two of the set of frequencies.
  • the supplementary set of speech parameters generally includes a representation of speech derived from the principal set of speech parameters.
  • Coder parameter generating system utilizes neural networks.
  • the coder parameter generating system may utilize decision tree units or genetic algorithms.
  • the method of the present invention provides in response to linguistic information, efficient generation of parametric representation of speech.
  • the method includes the steps of: A) using (602) a principal system to generate a principal set of speech parameters; B) providing (604) the principal set of speech parameters and the linguistic information to a subsystem that modifies the principal set of speech parameters and providing a modified set of speech parameters to the principal system; and C) generating (606), by the principal system, a parametric representation of speech.
  • the modified principal set of speech parameters may be provided (608) to a waveform synthesizer to synthesize speech.
  • the coder parameter generating system is typically divided into a principal system and a subsystem, wherein the principal system generates the principal set of speech parameters and the subsystem generates the supplementary set of speech parameters.
  • the supplementary set of speech parameters may consist of energies in each of a predetermined set of frequency bands for speech in a selected time period. Boundaries of the predetermined set of frequency bands are typically determined by: determining typical values of principal and second formants of speech of a selected speaker for a set of phonemes; selecting a set of frequencies that fall between said formant values; and selecting each frequency band of the predetermined set of frequency bands to consist of the frequencies between a selected two of the set of frequencies.
  • the supplementary set of speech parameters generally includes a representation of speech derived from the principal set of speech parameters.
  • Coder parameter generating system utilizes neural networks.
  • the coder parameter generating system may utilize decision tree units or genetic algorithms.
  • the present invention may be implemented by a device for providing, in response to linguistic information, efficient generation of a parametric representation of speech.
  • the device includes a coder parameter generating system, coupled to receive linguistic information and feedback, for providing a principal set and a supplementary set of speech parameters, the principal set of speech parameters being the parametric representation of speech; and a feedback path having a recurrent buffer, coupled to the coder parameter generating system, for providing the supplementary set of speech parameters as feedback to the coder parameter generating system to modify the principal set of speech parameters.
  • the coder parameter generating system may further provide the modified principal set of speech parameters to a waveform synthesizer to synthesize speech.
  • the coder parameter generating system is typically divided into a principal system and a subsystem.
  • the principal system generates the principal set of speech parameters
  • the subsystem generates the supplementary set of speech parameters.
  • the supplementary set of speech parameters generally consists of energies in each of a predetermined set of frequency bands for speech in a selected time period. Again, boundaries may be determined as set forth above.
  • the supplementary set of speech parameters typically includes a representation of speech derived from the principal set of speech parameters.
  • the device of the present invention is typically a microprocessor, a digital signal processor, an application specific integrated circuit, or a combination of these.
  • the coder parameter generating system may include neural networks, decision tree units, or units that use a genetic algorithm.
  • the device of the present invention may be implemented in a text-to-speech system (203), a speech synthesis system (203), or a dialog system (203).
  • a text-to-speech system 203
  • a speech synthesis system 203
  • a dialog system 203

Abstract

A method (500, 600), device (201 and 206) and system (203) provide, in response to text/linguistic information, efficient generation of a parametric representation of speech. A coder parameter generating system provides a principal set and a supplementary set of speech parameters, the principal set of speech parameters being the parametric representation of speech. Then feedback is provided to the coder parameter generating system using the supplementary set of speech parameters to modify the principal set of speech parameters.

Description

METHOD, DEVICE AND SYSTEM FOR SUPPLEMENTARY
SPEECH PARAMETER FEEDBACK FOR CODER PARAMETER
GENERATING SYSTEMS USED IN SPEECH SYNTHESIS
Field of the Invention
The present invention relates to coder parameter generating systems used in speech synthesis, and more particularly to speech parameter feedback in coder parameter generating systems used in speech synthesis.
Background of the Invention
As shown in FIG. 1 , numeral 100, to convert text to speech, statistical systems (102) typically convert a phonetic representation of the text into a plurality of speech parameters which characterize the speech waveform. These systems perform this conversion using a statistical component which attempts to extract salient features from a database. It is desirable that the statistical system be able to extract sufficient information from the database which will allow the conversion of novel phonetic representations into satisfactory speech parameters.
One problem with statistical approaches is that the performance of the text-to-speech system is extremely dependent on the size and type of data in the database. Increasing the size of the database will typically increase the performance of the statistical system, but additional data is often difficult and expensive to obtain. It is known that, if the size of the database has to remain fixed, then improvements to the performance of the system may be obtained by changing the size or internal architecture of the statistical system, modifying the number of times or order in which the data is presented to the system, adding useful input information by extracting additional parameters from the data, or changing the representation of the data. The problem with these methods is that they provide only asymptotic improvements. After rough approximations are made in the parameters using these methods, many small changes may be made with only marginal improvements in the output. Thus, since the improvements are typically insignificant, further improvement is not sought by repeating the above modifications.
Hence, there is a need for a method and device for improving the performance of a text-to-speech system without increasing the size of the database used to create the system.
Brief Description of the Drawings FIG. 1 is a schematic representation of a statistical system for synthesizing waveforms for speech as is known in the art.
FIG. 2 is a schematic representation of a system for synthesizing waveforms for speech in accordance with the present invention.
FIG. 3 is a schematic representation of frequency bin selection wherein frequency is plotted versus spectral magnitude.
FIG. 4 is a schematic representation of one embodiment of energy bin boundary frequency selection in accordance with the present invention.
FIG. 5 is a flow chart of one embodiment of steps in accordance with the method of the present invention.
FIG. 6 is a flow chart of another embodiment of steps in accordance with the method of the present invention.
Detailed Description of a Preferred Embodiment
The present invention provides a method and device for efficiently increasing the number of coder parameters in order to allow a coder parameter generating system to extract more information from training examples, thus enabling more accurate speech synthesis.
In a preferred embodiment, the coder parameter generating system is a neural network. A first neural network extracts domain-specific information by learning relations between the input and output data and feeds information learned back to a second neural network, thus providing additional output parameters via a recurrent feedback mechanism.
In addition to the principal set of speech parameters which are typically generated by a statistical component in a text-to-speech system, in the present invention, a supplementary set of speech parameters is added to the coder parameter generating system (201). The principal speech parameters may be, for example, the input parameters to a waveform synthesizer, whereas the supplementary set of speech parameters is not directly used in the generation of speech. The supplementary set of speech parameters would therefore be useless, except for a feedback path that includes a recurrent buffer (206) which allows these parameters to be fed back as input to the coder parameter generating system (201 ), allowing for unused output information to be useful and increasing performance of the coder parameter generating system (201 ). Thus, the same phonemic input information is used, but under the present invention, the parameter generating system (201 ) is now trained to predict other parameters in addition to the coder parameters. These extra parameters are used internally by the parameter generating system (201 ) through the recurrent buffer (206).
This coder parameter generating system may be broken into a principal system and a subsystem, as displayed in FIG. 2, numeral 200, where the principal system (202) generates the principal set of speech parameters and the subsystem (204) generates the supplementary set of modified speech parameters.
The supplementary set of modified speech parameters may be chosen to represent the spectral energy in fixed frequency bins. These parameters are obtained by summing the magnitude over the appropriate frequency regions of a discrete Fourier transform (DFT) of the windowed speech waveform. The frequency regions may be chosen to fall between typical first formants (F1s) for a selected set of phonemes spoken by a selected speaker. Additional frequency regions may be chosen to fall between typical second formants (F2s) for a selected set of phonemes spoken by a selected speaker. In the example shown in FIG. 3, numeral 300, 10 non-linear fixed frequency bins (304, 306, 308, 310, 312, 314, 316, 318, 320, 322) are selected to represent the spectral energy. The added parameters are obtained by summing the spectral magnitude over the selected frequency regions of a discrete Fourier transform of the windowed speech waveform (302). The frequency regions are selected (for example, as shown in FIG. 3) to provide energy measures with additional information that is useful in predicting the coder parameters. An alternative choice for the supplementary set of speech parameters is a set of parameters that may be derived from the principal set of speech parameters.
When the trained coder parameter generating system (201 ) is used, the extra parameters obtained (i.e., the supplementary set of speech parameters) aid in determination of the coder parameters, which are then used in the synthesis of speech. It should be noted that the recurrent buffer (206) is not used to train the energy bin subsystem (204), but the principal system (202) learns how to use the information generated by the recurrent buffer (206) when the energy bin subsystem (204) is fully trained. Alternatively, the subsystem (204) may include a second recurrent buffer that is used in training and testing, as may the principal system (202).
FIG. 4, numeral 400, is a schematic representation of one embodiment of energy bin boundary frequency selection in accordance with the present invention. Clearly, the frequencies of the energy bands are central to determining the usefulness of the selected parameters. A speech synthesizer that operates in accordance with the present invention is designed to imitate the speech from a single speaker. Analysis of formants from a single speaker provided data for the graph shown in FIG. 4, which plots the frequencies of the first two formants of several vowels (iy, ux, ey, ih, uw, uh, ow, eh, ae, aa, ao, ax). In order to distinguish one vowel from another, it is desirable to separate the vowels in the F1/F2 plane shown, with the first formant F1 (402) and the second formant F2 (404) plotted in Hz. Three of the frequency bins (306, 308, 310) are selected to include typical values for F1 , and three of the frequency bins (312, 314, 316) are selected to include typical values for F2, as is shown in FIG. 4. In addition to these bins based on vowel locations, one bin (304) was added to include a typical nasal formant, and three bins (318, 320, 322) were added to distinguish between other speech sounds like /s/ and /sh/.
Strategic positioning of the energy bins moves the formant frequencies of the synthesized vowels. By adjusting the bin frequencies, the neighboring vowels in the F1/F2 plane are separated, thus clarifying the perceptions of the vowel phonemes. The energy bins also help to clarify synthetic consonants, thus producing more intelligible synthetic speech. The principal system (202) and the subsystem (204) may each be a neural network, decision tree, genetic algorithm or combination of two or more of these.
Typically, as shown in FIG. 1 , the method for using neural networks for speech synthesis has consisted of training a neural network to predict the coder parameters at a given time. In the training portion of this process, the neural net is supplied with coder parameters, and in the testing portion, the neural network generates these same parameters.
In order to improve the performance of the network, the standard methods typically change the size or architecture of the network (including network transfer functions), modify the rates, number of iterations or vector set used to train the network, add additional useful input information, or change the representation of the data. For example, the speech spectra may be represented by linear predictive coding coefficients, cepstral coefficients, reflection coefficients, or other various parameters. Since these methods make only marginal improvements, the method of the present invention was developed to provide training that uses more output parameters wit a minimum of complexity added.
As shown in the steps set forth in FIG. 5, numeral 500, the method of the present invention provides, in response to text or linguistic information, efficient generation of a parametric representation of speech. The method includes the steps of: A) using (502) a coder parameter generating system to provide a principal set and a supplementary set of speech parameters, the principal set of speech parameters being the parametric representation of speech; and B) providing (504) feedback to the coder parameter generating system using the supplementary set of speech parameters to modify the principal set of speech parameters. Where selected, the method may further include providing (506) the modified principal set of speech parameters to a waveform synthesizer to synthesize speech.
In the present invention, the coder parameter generating system is divided into a principal system (202) and a subsystem (204), wherein the principal system (202) generates the principal set of speech parameters and the subsystem (204) generates the supplementary set of speech parameters. Generally, the supplementary set of speech parameters consists of energies in each of a predetermined set of frequency bands for speech in a selected time period. Boundaries of the predetermined set of frequency bands are typically determined by: determining typical values of principal and second formants of speech of a selected speaker for a set of phonemes; selecting a set of frequencies that fall between said formant values; and selecting each frequency band of the predetermined set of frequency bands to consist of the frequencies between a selected two of the set of frequencies. The supplementary set of speech parameters generally includes a representation of speech derived from the principal set of speech parameters.
Software implementing the method may be embedded in a microprocessor or a digital signal processor. Alternatively, an application specific integrated circuit may implement the method, or a combination of any of these implementation may be used. In a preferred embodiment, the coder parameter generating system utilizes neural networks. Alternatively, the coder parameter generating system may utilize decision tree units or genetic algorithms.
In another embodiment, shown in FIG. 6, numeral 600, the method of the present invention provides in response to linguistic information, efficient generation of parametric representation of speech. The method includes the steps of: A) using (602) a principal system to generate a principal set of speech parameters; B) providing (604) the principal set of speech parameters and the linguistic information to a subsystem that modifies the principal set of speech parameters and providing a modified set of speech parameters to the principal system; and C) generating (606), by the principal system, a parametric representation of speech. Then, where selected, the modified principal set of speech parameters may be provided (608) to a waveform synthesizer to synthesize speech.
As in the previous method, the coder parameter generating system is typically divided into a principal system and a subsystem, wherein the principal system generates the principal set of speech parameters and the subsystem generates the supplementary set of speech parameters.
The supplementary set of speech parameters may consist of energies in each of a predetermined set of frequency bands for speech in a selected time period. Boundaries of the predetermined set of frequency bands are typically determined by: determining typical values of principal and second formants of speech of a selected speaker for a set of phonemes; selecting a set of frequencies that fall between said formant values; and selecting each frequency band of the predetermined set of frequency bands to consist of the frequencies between a selected two of the set of frequencies.
The supplementary set of speech parameters generally includes a representation of speech derived from the principal set of speech parameters.
Software implementing the method may be embedded in a microprocessor or a digital signal processor. Alternatively, an application specific integrated circuit may implement the method, or a combination of any of these implementation may be used. In a preferred embodiment, the coder parameter generating system utilizes neural networks. Alternatively, the coder parameter generating system may utilize decision tree units or genetic algorithms.
The present invention may be implemented by a device for providing, in response to linguistic information, efficient generation of a parametric representation of speech. The device includes a coder parameter generating system, coupled to receive linguistic information and feedback, for providing a principal set and a supplementary set of speech parameters, the principal set of speech parameters being the parametric representation of speech; and a feedback path having a recurrent buffer, coupled to the coder parameter generating system, for providing the supplementary set of speech parameters as feedback to the coder parameter generating system to modify the principal set of speech parameters. The coder parameter generating system may further provide the modified principal set of speech parameters to a waveform synthesizer to synthesize speech.
The coder parameter generating system is typically divided into a principal system and a subsystem. The principal system generates the principal set of speech parameters, and the subsystem generates the supplementary set of speech parameters. The supplementary set of speech parameters generally consists of energies in each of a predetermined set of frequency bands for speech in a selected time period. Again, boundaries may be determined as set forth above.
The supplementary set of speech parameters typically includes a representation of speech derived from the principal set of speech parameters. The device of the present invention is typically a microprocessor, a digital signal processor, an application specific integrated circuit, or a combination of these.
Again, the coder parameter generating system may include neural networks, decision tree units, or units that use a genetic algorithm.
The device of the present invention may be implemented in a text-to-speech system (203), a speech synthesis system (203), or a dialog system (203).
The present invention may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.
We claim:

Claims

1. A method for providing, in response to text/linguistic information, efficient generation of a parametric representation of speech, comprising the steps of:
1 A) using a coder parameter generating system to provide a principal set of speech parameters and a supplementary set of speech parameters, the principal set of speech parameters being the parametric representation of speech; and
1 B) providing feedback to the coder parameter generating system using the supplementary set of speech parameters to modify the principal set of speech parameters.
2. The method of claim 1 further including at least one of 2A-2F: 2A) providing the modified principal set of speech parameters to a waveform synthesizer to synthesize speech; 2B) wherein the coder parameter generating system is divided into a principal system and a subsystem, wherein the principal subsystem generates the principal set of speech parameters and the subsystem generates the supplementary set of speech parameters;
2C) wherein the supplementary set of speech parameters consists of energies in each of a predetermined set of frequency bands for speech in a selected time period, and where selected, wherein boundaries of the predetermined set of frequency bands are determined by: 2C1 ) determining typical values of principal and second formants of speech of a selected speaker for a set of phonemes;
2C2) selecting a set of frequencies that fall between said typical values of the principal and second formants; and
2C3) selecting each frequency band of the predetermined set of frequency bands to consist of frequencies between a selected two of the set of frequencies; ane where further selected, one of 2C4-2C5:
2C4) wherein the supplementary set of speech parameters includes a representation of speech derived from the principal set of speech parameters; and 2C5) wherein one of: 2C5a) software implementing the method is embedded in a microprocessor;
2C5b) software implementing the method is embedded in a digital signal processor;
2C5c) the method is implemented by an application specific integrated circuit; and
2C5d) the method is implemented by a combination of at least two of 2C5a-2C5c.
2D) wherein the coder parameter generating system is a neural network; 2E) wherein the coder parameter generating system is a decision tree unit; and 2F) wherein the coder parameter generating system is a unit that uses a genetic algorithm.
3. A method for providing, in response to linguistic information, efficient generation of parametric representation of speech, comprising the steps of:
3A) using a principal system to generate a principal set of speech parameters;
3B) providing the principal set of speech parameters and linguistic information to a subsystem that modifies the principal set of speech parameters and providing a supplementary set of modified speech parameters to the principal system; and
3C) generating, by the principal system, a parametric representation of speech.
4. The method of claim 3 further including at least one of 4A-4I :
4A) providing the modified principal set of speech parameters to a waveform synthesizer to synthesize speech; 4B) the principal system generates the principal set of speech parameters and the subsystem generates the supplementary set of modified speech parameters; 4C) the supplementary set of modified speech parameters consists of energies in each of a predetermined set of frequency bands for speech in a selected time period; 4D) boundaries of the predetermined set of frequency bands are determined by:
4D1 ) determining typical values of principal and second formants of speech of a selected speaker for a set of phonemes;
4D2) selecting a set of frequencies that fall between said typical values of the principal and second formants; and
4D3) selecting each frequency band of the predetermined set of frequency bands to consist of frequencies between a selected two of the set of frequencies;
4E) the supplementary set of modified speech parameters includes a representation of speech derived from the principal set of speech parameters; 4F) one of 4F1 -4F4:
4F1 ) software implementing the method is embedded in a microprocessor;
4F2) software implementing the method is embedded in a digital signal processor; 4F3) the method is implemented by an application specific integrated circuit; and
4F4) the method is implemented by a combination of at least two of 4F1 -4F3;
4G) at least one of the principal system and the subsystem is a neural network; 4H) at least one of the principal system and the subsystem is a decision tree unit; and
41 ) at least one of the principal system and the subsystem is a unit that uses a genetic algorithm.
5. A device for providing, in response to linguistic information, efficient generation of a parametric representation of speech, comprising:
5A) a coder parameter generating system, coupled to receive linguistic information and feedback, for providing a principal set of speech parameters and a supplementary set of speech parameters, the principal set of speech parameters being the parametric representation of speech; and
5B) a feedback path having a recurrent buffer, coupled to the coder parameter generating system, for providing the supplementary set of speech parameters as feedback to the coder parameter generating system to modify the principal set of speech parameters.
6. The device of claim 5 wherein at least one of 6A-6H: 6A) the coder parameter generating system further provides a modified principal set of speech parameters to a waveform synthesizer to synthesize speech;
6B) the coder parameter generating system is divided into a principal system and a subsystem, wherein the principal system generates the principal set of speech parameters and the subsystem generates the supplementary set of speech parameters;
6C) the supplementary set of speech parameters consists of energies in each of a predetermined set of frequency bands for speech in a selected time period, and where selected, wherein boundaries of the predetermined set of frequency bands are determined by determining typical values of principal and second formants of speech of a selected speaker for a set of phonemes, selecting a set of frequencies that fall between said typical values of the principal and second formants, and selecting each frequency band of the predetermined set of frequency bands to consist of the frequencies between a selected two of the set of frequencies; 6D) the supplementary set of speech parameters includes a representation of speech derived from the principal set of speech parameters;
6E) the device is one of 6E1-6E4: 6E1 ) a microprocessor; 6E2) a digital signal processor;
6E3) an application specific integrated circuit; and 6E4) a combination of at least two of 6E1 -6E3; 6F) the coder parameter generating system is a neural network; 6G) the coder parameter generating system is a decision tree unit; and 6H) the coder parameter generating system is a unit that uses a genetic algorithm.
7. A device for providing, in response to linguistic information, efficient generation of parametric representation of speech, comprising:
7A) a principal system, coupled to receive the linguistic information and feedback, for generating a supplementary set of speech parameters; and 7B) a subsystem, coupled to the principal system and to receive linguistic information, for providing a principal set of speech parameters which is a parametric representation of speech, via a recurrent buffer, to the principal system.
8. The device of claim 7 wherein at least one of 8A-8H: 8A) the principal system further provides a modified principal set of speech parameters to a waveforrri synthesizer to synthesize speech;
8B) the principal system generates the principal set of speech parameters and the subsystem generates the supplementary set of speech parameters.
8C) the supplementary set of speech parameters consists of energies in each of a predetermined set of frequency bands for speech in a selected time period, and where selected, boundaries of the predetermined set of frequency bands are determined by determining typical values of principal and second formants of speech of a selected speaker for a set of phonemes, selecting a set of frequencies that fall between said typical values of the principal and second formant values, and selecting each frequency band of the predetermined set of frequency bands to consist of frequencies between a selected two of the set of frequencies;
8D) the supplementary set of speech parameters includes a representation of speech derived from the principal set of speech parameters;
8E) the device is one of 8E1-8E4: 8E1 ) a microprocessor; 8E2) a digital signal processor; 8E3) an application specific integrated circuit; and 8E4) a combination of at least two of 8E1 -8E3;
8F) at least one of: the principal system and the subsystem is a neural network;
8G) at least one of: the principal system and the subsystem is a decision tree unit; and 8H) at least one of: the principal system and the subsystem is a unit that uses a genetic algorithm.
9. A text-to-speech system/speech synthesis system/dialog system having a device for providing, in response to text/linguistic information, efficient generation of a parametric representation of speech, comprising: 9A) a coder parameter generating system, coupled to receive text/linguistic information and feedback, for providing a principal set of speech parameters and a supplementary set of speech parameters, the principal set of speech parameters being the parametric representation of speech; and
9B) a feedback path, coupled to the coder parameter generating system, for providing the supplementary set of speech parameters as feedback, via a recurrent buffer, to coder parameter generating system to modify the principal set of speech parameters.
10. A text-to-speech system/speech synthesizer/dialog system having a device for providing, in response to text/linguistic information, efficient generation of parametric representation of speech, comprising:
1 0A) a principal system, coupled to receive the text/linguistic information, for generating a supplementary set of speech parameters; and
10B) a subsystem, coupled to the principal system and to receive linguistic information, for providing a principal set of speech parameters which is a parametric representation of speech, via a recurrent buffer, to the principal system.
EP97946261A 1996-12-05 1997-10-15 Method, device and system for supplementary speech parameter feedback for coder parameter generating systems used in speech synthesis Withdrawn EP0932896A2 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US76162796A 1996-12-05 1996-12-05
US761627 1996-12-05
PCT/US1997/018815 WO1998025260A2 (en) 1996-12-05 1997-10-15 Speech synthesis using dual neural networks

Publications (2)

Publication Number Publication Date
EP0932896A2 true EP0932896A2 (en) 1999-08-04
EP0932896A4 EP0932896A4 (en) 1999-09-08

Family

ID=25062802

Family Applications (1)

Application Number Title Priority Date Filing Date
EP97946261A Withdrawn EP0932896A2 (en) 1996-12-05 1997-10-15 Method, device and system for supplementary speech parameter feedback for coder parameter generating systems used in speech synthesis

Country Status (2)

Country Link
EP (1) EP0932896A2 (en)
WO (1) WO1998025260A2 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5930754A (en) * 1997-06-13 1999-07-27 Motorola, Inc. Method, device and article of manufacture for neural-network based orthography-phonetics transformation

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1995030193A1 (en) * 1994-04-28 1995-11-09 Motorola Inc. A method and apparatus for converting text into audible signals using a neural network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5165008A (en) * 1991-09-18 1992-11-17 U S West Advanced Technologies, Inc. Speech synthesis using perceptual linear prediction parameters

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1995030193A1 (en) * 1994-04-28 1995-11-09 Motorola Inc. A method and apparatus for converting text into audible signals using a neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of WO9825260A2 *

Also Published As

Publication number Publication date
EP0932896A4 (en) 1999-09-08
WO1998025260A2 (en) 1998-06-11
WO1998025260A3 (en) 1998-08-06

Similar Documents

Publication Publication Date Title
Tokuda et al. An HMM-based speech synthesis system applied to English
US7124083B2 (en) Method and system for preselection of suitable units for concatenative speech
US7565291B2 (en) Synthesis-based pre-selection of suitable units for concatenative speech
US5682501A (en) Speech synthesis system
US5913194A (en) Method, device and system for using statistical information to reduce computation and memory requirements of a neural network based speech synthesis system
Bulyko et al. Joint prosody prediction and unit selection for concatenative speech synthesis
US6535852B2 (en) Training of text-to-speech systems
JP2826215B2 (en) Synthetic speech generation method and text speech synthesizer
US20050182629A1 (en) Corpus-based speech synthesis based on segment recombination
Van Santen Prosodic modeling in text-to-speech synthesis
EP0458859A1 (en) Text to speech synthesis system and method using context dependent vowell allophones.
Qian et al. An HMM-based Mandarin Chinese text-to-speech system
Karaali et al. Speech synthesis with neural networks
Bulyko et al. Efficient integrated response generation from multiple targets using weighted finite state transducers
Koriyama et al. Prosody generation using frame-based Gaussian process regression and classification for statistical parametric speech synthesis
RU61924U1 (en) STATISTICAL SPEECH MODEL
Lin et al. A novel prosodic-information synthesizer based on recurrent fuzzy neural network for the Chinese TTS system
Chen et al. A Mandarin Text-to-Speech System
EP0932896A2 (en) Method, device and system for supplementary speech parameter feedback for coder parameter generating systems used in speech synthesis
Ng Survey of data-driven approaches to Speech Synthesis
Sheikhan Hybrid of evolutionary and swarm intelligence algorithms for prosody modeling in natural speech synthesis
Hyunsong Modelling Duration In Text-to-Speech Systems
Yu et al. A novel prosody adaptation method for mandarin concatenation-based text-to-speech system
Barakat et al. Investigating the effect of speech features and the number of HMM mixtures in the quality HMM-based synthesizers
Hwang et al. An RNN-Based Spectral Information Generation for Mandarin Text-To-Speech

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 19990208

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): BE DE FR GB

A4 Supplementary search report drawn up and despatched

Effective date: 19990722

AK Designated contracting states

Kind code of ref document: A4

Designated state(s): BE DE FR GB

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN WITHDRAWN

18W Application withdrawn

Withdrawal date: 20021028