EP0932896A2

EP0932896A2 - Method, device and system for supplementary speech parameter feedback for coder parameter generating systems used in speech synthesis

Info

Publication number: EP0932896A2
Application number: EP97946261A
Authority: EP
Inventors: Orhan Karaali; Noel Massey; Gerald Corrigan
Original assignee: Motorola Inc
Current assignee: Motorola Solutions Inc
Priority date: 1996-12-05
Filing date: 1997-10-15
Publication date: 1999-08-04
Also published as: EP0932896A4; WO1998025260A2; WO1998025260A3

Abstract

A method (500, 600), device (201 and 206) and system (203) provide, in response to text/linguistic information, efficient generation of a parametric representation of speech. A coder parameter generating system provides a principal set and a supplementary set of speech parameters, the principal set of speech parameters being the parametric representation of speech. Then feedback is provided to the coder parameter generating system using the supplementary set of speech parameters to modify the principal set of speech parameters.

Description

METHOD, DEVICE AND SYSTEM FOR SUPPLEMENTARY

SPEECH PARAMETER FEEDBACK FOR CODER PARAMETER

GENERATING SYSTEMS USED IN SPEECH SYNTHESIS

Field of the Invention

The present invention relates to coder parameter generating systems used in speech synthesis, and more particularly to speech parameter feedback in coder parameter generating systems used in speech synthesis.

Background of the Invention

As shown in FIG. 1 , numeral 100, to convert text to speech, statistical systems (102) typically convert a phonetic representation of the text into a plurality of speech parameters which characterize the speech waveform. These systems perform this conversion using a statistical component which attempts to extract salient features from a database. It is desirable that the statistical system be able to extract sufficient information from the database which will allow the conversion of novel phonetic representations into satisfactory speech parameters.

One problem with statistical approaches is that the performance of the text-to-speech system is extremely dependent on the size and type of data in the database. Increasing the size of the database will typically increase the performance of the statistical system, but additional data is often difficult and expensive to obtain. It is known that, if the size of the database has to remain fixed, then improvements to the performance of the system may be obtained by changing the size or internal architecture of the statistical system, modifying the number of times or order in which the data is presented to the system, adding useful input information by extracting additional parameters from the data, or changing the representation of the data. The problem with these methods is that they provide only asymptotic improvements. After rough approximations are made in the parameters using these methods, many small changes may be made with only marginal improvements in the output. Thus, since the improvements are typically insignificant, further improvement is not sought by repeating the above modifications.

Hence, there is a need for a method and device for improving the performance of a text-to-speech system without increasing the size of the database used to create the system.

Brief Description of the Drawings FIG. 1 is a schematic representation of a statistical system for synthesizing waveforms for speech as is known in the art.

FIG. 2 is a schematic representation of a system for synthesizing waveforms for speech in accordance with the present invention.

FIG. 3 is a schematic representation of frequency bin selection wherein frequency is plotted versus spectral magnitude.

FIG. 4 is a schematic representation of one embodiment of energy bin boundary frequency selection in accordance with the present invention.

FIG. 5 is a flow chart of one embodiment of steps in accordance with the method of the present invention.

FIG. 6 is a flow chart of another embodiment of steps in accordance with the method of the present invention.

Detailed Description of a Preferred Embodiment

The present invention provides a method and device for efficiently increasing the number of coder parameters in order to allow a coder parameter generating system to extract more information from training examples, thus enabling more accurate speech synthesis.

In a preferred embodiment, the coder parameter generating system is a neural network. A first neural network extracts domain-specific information by learning relations between the input and output data and feeds information learned back to a second neural network, thus providing additional output parameters via a recurrent feedback mechanism.

In addition to the principal set of speech parameters which are typically generated by a statistical component in a text-to-speech system, in the present invention, a supplementary set of speech parameters is added to the coder parameter generating system (201). The principal speech parameters may be, for example, the input parameters to a waveform synthesizer, whereas the supplementary set of speech parameters is not directly used in the generation of speech. The supplementary set of speech parameters would therefore be useless, except for a feedback path that includes a recurrent buffer (206) which allows these parameters to be fed back as input to the coder parameter generating system (201 ), allowing for unused output information to be useful and increasing performance of the coder parameter generating system (201 ). Thus, the same phonemic input information is used, but under the present invention, the parameter generating system (201 ) is now trained to predict other parameters in addition to the coder parameters. These extra parameters are used internally by the parameter generating system (201 ) through the recurrent buffer (206).

This coder parameter generating system may be broken into a principal system and a subsystem, as displayed in FIG. 2, numeral 200, where the principal system (202) generates the principal set of speech parameters and the subsystem (204) generates the supplementary set of modified speech parameters.

The supplementary set of modified speech parameters may be chosen to represent the spectral energy in fixed frequency bins. These parameters are obtained by summing the magnitude over the appropriate frequency regions of a discrete Fourier transform (DFT) of the windowed speech waveform. The frequency regions may be chosen to fall between typical first formants (F1s) for a selected set of phonemes spoken by a selected speaker. Additional frequency regions may be chosen to fall between typical second formants (F2s) for a selected set of phonemes spoken by a selected speaker. In the example shown in FIG. 3, numeral 300, 10 non-linear fixed frequency bins (304, 306, 308, 310, 312, 314, 316, 318, 320, 322) are selected to represent the spectral energy. The added parameters are obtained by summing the spectral magnitude over the selected frequency regions of a discrete Fourier transform of the windowed speech waveform (302). The frequency regions are selected (for example, as shown in FIG. 3) to provide energy measures with additional information that is useful in predicting the coder parameters. An alternative choice for the supplementary set of speech parameters is a set of parameters that may be derived from the principal set of speech parameters.

When the trained coder parameter generating system (201 ) is used, the extra parameters obtained (i.e., the supplementary set of speech parameters) aid in determination of the coder parameters, which are then used in the synthesis of speech. It should be noted that the recurrent buffer (206) is not used to train the energy bin subsystem (204), but the principal system (202) learns how to use the information generated by the recurrent buffer (206) when the energy bin subsystem (204) is fully trained. Alternatively, the subsystem (204) may include a second recurrent buffer that is used in training and testing, as may the principal system (202).

FIG. 4, numeral 400, is a schematic representation of one embodiment of energy bin boundary frequency selection in accordance with the present invention. Clearly, the frequencies of the energy bands are central to determining the usefulness of the selected parameters. A speech synthesizer that operates in accordance with the present invention is designed to imitate the speech from a single speaker. Analysis of formants from a single speaker provided data for the graph shown in FIG. 4, which plots the frequencies of the first two formants of several vowels (iy, ux, ey, ih, uw, uh, ow, eh, ae, aa, ao, ax). In order to distinguish one vowel from another, it is desirable to separate the vowels in the F1/F2 plane shown, with the first formant F1 (402) and the second formant F2 (404) plotted in Hz. Three of the frequency bins (306, 308, 310) are selected to include typical values for F1 , and three of the frequency bins (312, 314, 316) are selected to include typical values for F2, as is shown in FIG. 4. In addition to these bins based on vowel locations, one bin (304) was added to include a typical nasal formant, and three bins (318, 320, 322) were added to distinguish between other speech sounds like /s/ and /sh/.

Strategic positioning of the energy bins moves the formant frequencies of the synthesized vowels. By adjusting the bin frequencies, the neighboring vowels in the F1/F2 plane are separated, thus clarifying the perceptions of the vowel phonemes. The energy bins also help to clarify synthetic consonants, thus producing more intelligible synthetic speech. The principal system (202) and the subsystem (204) may each be a neural network, decision tree, genetic algorithm or combination of two or more of these.

Typically, as shown in FIG. 1 , the method for using neural networks for speech synthesis has consisted of training a neural network to predict the coder parameters at a given time. In the training portion of this process, the neural net is supplied with coder parameters, and in the testing portion, the neural network generates these same parameters.

In order to improve the performance of the network, the standard methods typically change the size or architecture of the network (including network transfer functions), modify the rates, number of iterations or vector set used to train the network, add additional useful input information, or change the representation of the data. For example, the speech spectra may be represented by linear predictive coding coefficients, cepstral coefficients, reflection coefficients, or other various parameters. Since these methods make only marginal improvements, the method of the present invention was developed to provide training that uses more output parameters wit a minimum of complexity added.

As shown in the steps set forth in FIG. 5, numeral 500, the method of the present invention provides, in response to text or linguistic information, efficient generation of a parametric representation of speech. The method includes the steps of: A) using (502) a coder parameter generating system to provide a principal set and a supplementary set of speech parameters, the principal set of speech parameters being the parametric representation of speech; and B) providing (504) feedback to the coder parameter generating system using the supplementary set of speech parameters to modify the principal set of speech parameters. Where selected, the method may further include providing (506) the modified principal set of speech parameters to a waveform synthesizer to synthesize speech.

In the present invention, the coder parameter generating system is divided into a principal system (202) and a subsystem (204), wherein the principal system (202) generates the principal set of speech parameters and the subsystem (204) generates the supplementary set of speech parameters. Generally, the supplementary set of speech parameters consists of energies in each of a predetermined set of frequency bands for speech in a selected time period. Boundaries of the predetermined set of frequency bands are typically determined by: determining typical values of principal and second formants of speech of a selected speaker for a set of phonemes; selecting a set of frequencies that fall between said formant values; and selecting each frequency band of the predetermined set of frequency bands to consist of the frequencies between a selected two of the set of frequencies. The supplementary set of speech parameters generally includes a representation of speech derived from the principal set of speech parameters.

Software implementing the method may be embedded in a microprocessor or a digital signal processor. Alternatively, an application specific integrated circuit may implement the method, or a combination of any of these implementation may be used. In a preferred embodiment, the coder parameter generating system utilizes neural networks. Alternatively, the coder parameter generating system may utilize decision tree units or genetic algorithms.

In another embodiment, shown in FIG. 6, numeral 600, the method of the present invention provides in response to linguistic information, efficient generation of parametric representation of speech. The method includes the steps of: A) using (602) a principal system to generate a principal set of speech parameters; B) providing (604) the principal set of speech parameters and the linguistic information to a subsystem that modifies the principal set of speech parameters and providing a modified set of speech parameters to the principal system; and C) generating (606), by the principal system, a parametric representation of speech. Then, where selected, the modified principal set of speech parameters may be provided (608) to a waveform synthesizer to synthesize speech.

As in the previous method, the coder parameter generating system is typically divided into a principal system and a subsystem, wherein the principal system generates the principal set of speech parameters and the subsystem generates the supplementary set of speech parameters.

The supplementary set of speech parameters may consist of energies in each of a predetermined set of frequency bands for speech in a selected time period. Boundaries of the predetermined set of frequency bands are typically determined by: determining typical values of principal and second formants of speech of a selected speaker for a set of phonemes; selecting a set of frequencies that fall between said formant values; and selecting each frequency band of the predetermined set of frequency bands to consist of the frequencies between a selected two of the set of frequencies.

The supplementary set of speech parameters generally includes a representation of speech derived from the principal set of speech parameters.

The present invention may be implemented by a device for providing, in response to linguistic information, efficient generation of a parametric representation of speech. The device includes a coder parameter generating system, coupled to receive linguistic information and feedback, for providing a principal set and a supplementary set of speech parameters, the principal set of speech parameters being the parametric representation of speech; and a feedback path having a recurrent buffer, coupled to the coder parameter generating system, for providing the supplementary set of speech parameters as feedback to the coder parameter generating system to modify the principal set of speech parameters. The coder parameter generating system may further provide the modified principal set of speech parameters to a waveform synthesizer to synthesize speech.

The coder parameter generating system is typically divided into a principal system and a subsystem. The principal system generates the principal set of speech parameters, and the subsystem generates the supplementary set of speech parameters. The supplementary set of speech parameters generally consists of energies in each of a predetermined set of frequency bands for speech in a selected time period. Again, boundaries may be determined as set forth above.

The supplementary set of speech parameters typically includes a representation of speech derived from the principal set of speech parameters. The device of the present invention is typically a microprocessor, a digital signal processor, an application specific integrated circuit, or a combination of these.

Again, the coder parameter generating system may include neural networks, decision tree units, or units that use a genetic algorithm.

The device of the present invention may be implemented in a text-to-speech system (203), a speech synthesis system (203), or a dialog system (203).

The present invention may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.

We claim:

Claims

1. A method for providing, in response to text/linguistic information, efficient generation of a parametric representation of speech, comprising the steps of:

1 A) using a coder parameter generating system to provide a principal set of speech parameters and a supplementary set of speech parameters, the principal set of speech parameters being the parametric representation of speech; and

1 B) providing feedback to the coder parameter generating system using the supplementary set of speech parameters to modify the principal set of speech parameters.

2. The method of claim 1 further including at least one of 2A-2F: 2A) providing the modified principal set of speech parameters to a waveform synthesizer to synthesize speech; 2B) wherein the coder parameter generating system is divided into a principal system and a subsystem, wherein the principal subsystem generates the principal set of speech parameters and the subsystem generates the supplementary set of speech parameters;

2C) wherein the supplementary set of speech parameters consists of energies in each of a predetermined set of frequency bands for speech in a selected time period, and where selected, wherein boundaries of the predetermined set of frequency bands are determined by: 2C1 ) determining typical values of principal and second formants of speech of a selected speaker for a set of phonemes;

2C2) selecting a set of frequencies that fall between said typical values of the principal and second formants; and

2C3) selecting each frequency band of the predetermined set of frequency bands to consist of frequencies between a selected two of the set of frequencies; ane where further selected, one of 2C4-2C5:

2C4) wherein the supplementary set of speech parameters includes a representation of speech derived from the principal set of speech parameters; and 2C5) wherein one of: 2C5a) software implementing the method is embedded in a microprocessor;

2C5b) software implementing the method is embedded in a digital signal processor;

2C5c) the method is implemented by an application specific integrated circuit; and

2C5d) the method is implemented by a combination of at least two of 2C5a-2C5c.

2D) wherein the coder parameter generating system is a neural network; 2E) wherein the coder parameter generating system is a decision tree unit; and 2F) wherein the coder parameter generating system is a unit that uses a genetic algorithm.

3. A method for providing, in response to linguistic information, efficient generation of parametric representation of speech, comprising the steps of:

3A) using a principal system to generate a principal set of speech parameters;

3B) providing the principal set of speech parameters and linguistic information to a subsystem that modifies the principal set of speech parameters and providing a supplementary set of modified speech parameters to the principal system; and

3C) generating, by the principal system, a parametric representation of speech.

4. The method of claim 3 further including at least one of 4A-4I :

4A) providing the modified principal set of speech parameters to a waveform synthesizer to synthesize speech; 4B) the principal system generates the principal set of speech parameters and the subsystem generates the supplementary set of modified speech parameters; 4C) the supplementary set of modified speech parameters consists of energies in each of a predetermined set of frequency bands for speech in a selected time period; 4D) boundaries of the predetermined set of frequency bands are determined by:

4D1 ) determining typical values of principal and second formants of speech of a selected speaker for a set of phonemes;

4D2) selecting a set of frequencies that fall between said typical values of the principal and second formants; and

4D3) selecting each frequency band of the predetermined set of frequency bands to consist of frequencies between a selected two of the set of frequencies;

4E) the supplementary set of modified speech parameters includes a representation of speech derived from the principal set of speech parameters; 4F) one of 4F1 -4F4:

4F1 ) software implementing the method is embedded in a microprocessor;

4F2) software implementing the method is embedded in a digital signal processor; 4F3) the method is implemented by an application specific integrated circuit; and

4F4) the method is implemented by a combination of at least two of 4F1 -4F3;

4G) at least one of the principal system and the subsystem is a neural network; 4H) at least one of the principal system and the subsystem is a decision tree unit; and

41 ) at least one of the principal system and the subsystem is a unit that uses a genetic algorithm.

5. A device for providing, in response to linguistic information, efficient generation of a parametric representation of speech, comprising:

5A) a coder parameter generating system, coupled to receive linguistic information and feedback, for providing a principal set of speech parameters and a supplementary set of speech parameters, the principal set of speech parameters being the parametric representation of speech; and

5B) a feedback path having a recurrent buffer, coupled to the coder parameter generating system, for providing the supplementary set of speech parameters as feedback to the coder parameter generating system to modify the principal set of speech parameters.

6. The device of claim 5 wherein at least one of 6A-6H: 6A) the coder parameter generating system further provides a modified principal set of speech parameters to a waveform synthesizer to synthesize speech;

6B) the coder parameter generating system is divided into a principal system and a subsystem, wherein the principal system generates the principal set of speech parameters and the subsystem generates the supplementary set of speech parameters;

6C) the supplementary set of speech parameters consists of energies in each of a predetermined set of frequency bands for speech in a selected time period, and where selected, wherein boundaries of the predetermined set of frequency bands are determined by determining typical values of principal and second formants of speech of a selected speaker for a set of phonemes, selecting a set of frequencies that fall between said typical values of the principal and second formants, and selecting each frequency band of the predetermined set of frequency bands to consist of the frequencies between a selected two of the set of frequencies; 6D) the supplementary set of speech parameters includes a representation of speech derived from the principal set of speech parameters;

6E) the device is one of 6E1-6E4: 6E1 ) a microprocessor; 6E2) a digital signal processor;

6E3) an application specific integrated circuit; and 6E4) a combination of at least two of 6E1 -6E3; 6F) the coder parameter generating system is a neural network; 6G) the coder parameter generating system is a decision tree unit; and 6H) the coder parameter generating system is a unit that uses a genetic algorithm.

7. A device for providing, in response to linguistic information, efficient generation of parametric representation of speech, comprising:

7A) a principal system, coupled to receive the linguistic information and feedback, for generating a supplementary set of speech parameters; and 7B) a subsystem, coupled to the principal system and to receive linguistic information, for providing a principal set of speech parameters which is a parametric representation of speech, via a recurrent buffer, to the principal system.

8. The device of claim 7 wherein at least one of 8A-8H: 8A) the principal system further provides a modified principal set of speech parameters to a waveforrri synthesizer to synthesize speech;

8B) the principal system generates the principal set of speech parameters and the subsystem generates the supplementary set of speech parameters.

8C) the supplementary set of speech parameters consists of energies in each of a predetermined set of frequency bands for speech in a selected time period, and where selected, boundaries of the predetermined set of frequency bands are determined by determining typical values of principal and second formants of speech of a selected speaker for a set of phonemes, selecting a set of frequencies that fall between said typical values of the principal and second formant values, and selecting each frequency band of the predetermined set of frequency bands to consist of frequencies between a selected two of the set of frequencies;

8D) the supplementary set of speech parameters includes a representation of speech derived from the principal set of speech parameters;

8E) the device is one of 8E1-8E4: 8E1 ) a microprocessor; 8E2) a digital signal processor; 8E3) an application specific integrated circuit; and 8E4) a combination of at least two of 8E1 -8E3;

8F) at least one of: the principal system and the subsystem is a neural network;

8G) at least one of: the principal system and the subsystem is a decision tree unit; and 8H) at least one of: the principal system and the subsystem is a unit that uses a genetic algorithm.

9. A text-to-speech system/speech synthesis system/dialog system having a device for providing, in response to text/linguistic information, efficient generation of a parametric representation of speech, comprising: 9A) a coder parameter generating system, coupled to receive text/linguistic information and feedback, for providing a principal set of speech parameters and a supplementary set of speech parameters, the principal set of speech parameters being the parametric representation of speech; and

9B) a feedback path, coupled to the coder parameter generating system, for providing the supplementary set of speech parameters as feedback, via a recurrent buffer, to coder parameter generating system to modify the principal set of speech parameters.

10. A text-to-speech system/speech synthesizer/dialog system having a device for providing, in response to text/linguistic information, efficient generation of parametric representation of speech, comprising:

1 0A) a principal system, coupled to receive the text/linguistic information, for generating a supplementary set of speech parameters; and

10B) a subsystem, coupled to the principal system and to receive linguistic information, for providing a principal set of speech parameters which is a parametric representation of speech, via a recurrent buffer, to the principal system.