GB2325599A - Speech synthesis with prosody enhancement - Google Patents

Speech synthesis with prosody enhancement Download PDF

Info

Publication number
GB2325599A
GB2325599A GB9811008A GB9811008A GB2325599A GB 2325599 A GB2325599 A GB 2325599A GB 9811008 A GB9811008 A GB 9811008A GB 9811008 A GB9811008 A GB 9811008A GB 2325599 A GB2325599 A GB 2325599A
Authority
GB
United Kingdom
Prior art keywords
prosodic
information
speech
events
marked
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB9811008A
Other versions
GB9811008D0 (en
GB2325599B (en
Inventor
Gerald E Corrigan
Noel Massey
Orhan Karaali
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Motorola Solutions Inc
Original Assignee
Motorola Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Motorola Inc filed Critical Motorola Inc
Publication of GB9811008D0 publication Critical patent/GB9811008D0/en
Publication of GB2325599A publication Critical patent/GB2325599A/en
Application granted granted Critical
Publication of GB2325599B publication Critical patent/GB2325599B/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The method includes determining prosodic information that describes rhythm and intonation of speech to be generated, from at least one of: the style information and the focus information 528; and using a statistical system eg neural network 522 to convert information including the linguistic information and the prosodic information into speech parameters, which describe portions of speech waveforms.

Description

METHOD, DEVICE AND SYSTEM FOR GENERATING SPEECH SYNTHESIS PARAMETERS FROM INFORMATION INCLUDING AN EXPLICIT REPRESENTATION OF INTONATION Field of the Invention The present invention relates to parameter generating systems used in speech synthesis, and more particularly to intonation coding in coder parameter generating systems used in speech synthesis.
Background of the Invention As shown in FIG. 1, numeral 100, to convert text to speech, statistical systems (102) typically convert linguistic (phonetic) representations of texts into parameters characterizing speech waveforms. The statistical system illustrated in FIG. 1 uses two statistical systems (neural networks) (110 and 118). One neural network (110) converts linguistic descriptions of speech phones and their contexts into segment durations for segments of speech waveforms associated with the phones. The second neural network (118) converts linguistic descriptions of speech frames (portions of speech waveforms occurring during short time periods) into acoustic descriptions (120) of the frames. These are then converted into speech waveforms (124) using a vocoder (122).
One problem with statistical approaches based on linguistic information extracted from text is that the text does not contain enough information to generate prosody (rhythm and intonation) correctly in speech waveforms. It is known that, for any given text, there exists a plurality of intonational contours and rhythm patterns that can be generated for that text. The conversions performed by statistical systems are controlled by statistical descriptions of training data typically consisting of a set of vectors comprised of possible inputs to the system and the outputs desired when the possible inputs are presented to the system.
In statistical systems used for text-to-speech, the training data typically is generated by analysis of natural speech.
Because the intonational contours and rhythm patterns cannot be predicted from linguistic information extracted from texts, statistical systems tend to produce prosody that averages the possible contours and rhythms. This averaging of the contours and rhythms can make the speech less understandable.
Hence, there is a need for a method and device for improving the performance of a text-to-speech system in generating prosody.
Brief Description of the Drawings FIG. 1 is a schematic representation of a text-to-speech system incorporating two statistical systems for generating speech parameters as is known in the art.
FIG. 2 is a schematic representation of a system for generating speech parameters in accordance with the present invention.
FIG. 3 is a schematic representation of the process of training a neural network in accordance with the existing technology.
FIG. 4 is a schematic representation of the process of training a neural network in accordance with the present invention.
FIG. 5 is a schematic representation of a text-to-speech system incorporating the present invention.
FIG. 6 is a flow chart of an embodiment of steps in accordance with the method of the present invention.
FIG. 7 is a schematic representation of a preferred embodiment of a device in accordance with the present invention.
Detailed Description of a Preferred Embodiment The present invention provides a method, device and system for improving the prosody expressed by a speech parameter generating system by incorporating an explicit representation of the prosody as input to the speech parameter generating system. This improvement provides more realistic rhythm and intonation in speech, making the meaning of the speech more understandable than speech in prior art text-tospeech systems.
In a preferred embodiment, the speech parameter generating system of the present invention is a neural network producing a series of parameters or parameter vectors consisting of one or more parameters describing a predetermined aspect of the speech waveform. As shown in FIG.
2, numeral 200, certain parameters may be the acoustic frame descriptions (220) or the segment durations (206). FIG. 2 illustrates a preferred embodiment of a device in a system in accordance with the present invention. The device provides, in response to predetermined phonetic/linguistic information (222) and predetermined prosodic information (224), efficient generation of acoustic frame descriptions (220), i.e., prosodically enchanced speech parameters. The device includes: A) a segment duration computation unit (202), coupled to receive the phonetic/linguistic information and to a prosody generation unit (204), for converting the phonetic/linguistic information into segment durations (206); B) the prosody generation unit (204), coupled to receive the predetermined prosodic information, for using the predetermined prosodic information to generate prosodic output (208); C) a frame description generation unit (210), coupled to the segment duration computation unit and the prosody generation unit and to receive the phonetic/linguistic information, for utilizing the segment durations (206), the prosodic output (208) and the phonetic/linguistic information (222) to generate linguistic/prosodic frame descriptions (212); and D) an acoustic description computation unit (214), coupled to the frame description generation unit (210), for computing acoustic frame descriptions (220) for the predetermined prosodic information (224) and the phonetic/linguistic information (222) to generate speech parameters that provide reliable prosodic performance. The parameters are typically suitable for use as input to a waveform synthesizer or generator (216) such as a vocoder, which generates a speech waveform (218).
In a preferred embodiment, the speech parameter generating device is a neural network producing a series of parameter vectors consisting of one or more parameters describing some aspect of the speech waveform. In FIG. 1, the parameters could be the acoustic frame descriptions (120) or the segment durations (112). FIG. 7, numeral 700, is a schematic representation of a preferred embodiment of a device in accordance with the present invention. As in existing speech parameter generating systems, this invention uses a statistical system (702) to convert information including linguistic information (704) derived from text into some parameters describing some aspect of the speech to be generated (706). However, unlike existing speech parameter generating systems, this invention includes a prosody determination unit (712), which converts information (710) about speaking style and focus into a prosodic description (708) of the speech to be generated. This prosodic description is also provided to the neural network to determine speech parameters.
In a preferred embodiment, the device provides, in response to information including linguistic information and at least one of: style information and focus information, efficient generation of prosodically enchanced speech parameters. The device includes: A) a prosody determination unit (712), coupled to receive at least one of the style information and the focus information, generating prosodic information, describing the rhythm and intonation of speech to be generated; and B) a statistical system (702), which in the preferred embodiment is a neural network, coupled to receive information including linguistic and prosodic information, for providing parameters describing portions of speech waveforms.
The information (710) about speaking style and focus may be provided by a dialog model or by user input. Alternatively, the style and focus determination may be made in advance for different sentence types, such as statements, questions, and commands.
Typically, as shown in FIG. 3, numeral 300, the method for generating neural networks for speech synthesis has consisted of training a neural network to predict speech parameters for a particular portion of the speech waveform.
Speech (302) is first recorded and stored in an audio database (304). The speech is first phonetically and syntactically labeled (phonetic/syntactic labeling unit, 306) and the label information stored in a phonetic database (308).
(Alternatively, the phonetic/syntactic labeling may be done manually.) The audio database (304) and/or the phonetic database (308) are then processed by the desired output computation unit (310) to extract parameters describing some portion of the recorded speech, used as training output (312).
For each training output value, the phonetic database (308) is processed by the input generation unit (314) to produce training input (316) for a neural network (318) that generates speech parameters (320). The neural network is then trained to generate a good approximation of the training output in response to the training input, using some criterion for goodness, such as the minimum mean squared difference between the neural network output and the training output.
The neural network in the preferred embodiment of the present invention is generated as shown in FIG. 4. Speech (402) is first recorded and stored in an audio database (404). The speech is first phonetically and syntactically labeled (phonetic/syntactic labeling unit, 406) and the label information stored in a phonetic database (408). The speech is then prosodically labeled (prosodic labeling unit, 410), indicating the actual prosody of the recorded speech, to create a prosodic database (412). (Alternatively, either the phonetic/syntactic labeling or the prosodic labeling or both may be performed manually.) The audio database (404) or the phonetic database (408) or both are then processed (desired output computation unit, 414) to extract parameters describing some portion of the recorded speech, used as training output (416). For each training output value, the phonetic database (408) and the prosodic database (412) are processed (input generation unit, 418) to produce training input (420) for a neural network (422). The neural network is then trained to generate a good approximation of the training output in response to the training input, using a predetermined criterion for goodness, such as the minimum mean squared difference between the neural network output and the training output.
This is similar to the process used for training in the existing technology, except that the speech is prosodically labeled to generate the prosodic database (412) which is used in input generation (input generation unit, 418).
FIG. 5, numeral 500, is a schematic representation of a text-to-speech system incorporating the present invention.
This is similar to the text-to-speech system of FIG. 1, except that there is predetermined prosodic information (528) that is input in addition to the text (504) which is used to determine the prosody of the generated speech. Text (504) is input to a text-to-linguistics conversion unit (506), which provides linguistic (phonetic) description (510) to a segment duration computation unit (514), which is typically a neural network.
The segment duration computation conversion unit (514) utilizes the linguistic (phonetic) description (510) and the prosodic output (512) to output segment durations (516). The predetermined prosodic information (528) may come from a dialog model or user selections, or simply be a set of arbitrary advance decisions concerning the way in which prosody will be generated. This information, along with information from the text, is used in prosody generation (prosody generation unit, 508) to produce the prosodic output (512) provided to the segment duration computation unit (514). This information is also provided to the frame description generation unit (518) to produce the linguistic/prosodic frame descriptions (520) provided to the acoustic description computation unit (522).
The segment duration computation unit (514) and the acoustic description computation unit (522) are examples of the current invention. The output of the acoustic description computation unit (522) is then input to a waveform generation unit (vocoder, 524), that generates a speech waveform (526).
The predetermined prosodic information (528) may be information about speaking style and focus, which is provided by a dialog model or by user input. Alternatively, the style and focus determination may be made in advance for different sentence types, such as statements, questions, and commands.
The speech parameter generating systems (514 and 522) may each be a neural network, decision tree, genetic algorithm or combination of two or more of these.
Existing statistical synthesis methods generate prosody of speech using only phonetic information and information that can be extracted from text. Since these methods tend to average the intonation contours and rhythm patterns that can occur in speech, producing unclear prosodic variation, the method of the present invention was developed to generate prosody using an explicit representation of prosody. Thus, the present invention provides more natural intonation and rhythm to the speech generated.
As shown in the steps set forth in FIG. 6, numeral 600, the method of the present invention provides, in response to information including linguistic information and at least one of: style information and focus information, efficient generation of prosodically enhanced speech parameters. In one embodiment, the method includes the steps of: A) determining (602) prosodic information that describes rhythm and intonation of speech to be generated, from at least one of: the style information and the focus information; and B) using (604) a statistical system to convert information including the linguistic information and the prosodic information into speech parameters, which describe portions of speech waveforms.
The speech parameters are typically parameters suitable for use as input to a waveform synthesizerigenerator which may be used for synthesizing speech. Alternatively, the speech parameters may be segment durations.
Software that implements the present invention may be embedded in a microprocessor or a digital signal processor (DSP). Alternatively, the method may be implemented by an application specific integrated circuit (ASIC). Also, the method may be implemented by a combination of microprocessor, DSP and ASIC.
The predetermined prosodic information generally includes at least one of: A) locations of word endings and a degree of disjuncture between words; B) locations of pitch accents and a form of the pitch accents; and C) locations of boundaries marked in a pitch contour and a form of the boundaries.A format of the predetermined prosodic information is typically one of: A) information describing a proximity of marked prosodic events to defined frames surrounding a frame for which coder parameters are being generated; B) information describing a time separating marked prosodic events from a frame for which the coder parameters are being generated; C) information describing a time separating certain marked prosodic events from other marked prosodic events; D) information describing a number of prosodic events of one type in a time period separating a prosodic event of another type and a frame for which the coder parameters are being generated; E) information describing a number prosodic events of one type occurring in a time period separating two prosodic events of another type; and F) at least two of formats A-E.
Where selected, the segment durations may be utilized as supplementary input to an acoustic frame parameter generation unit.
The speech parameters may be selected to be durations of speech segments associated with phones, and the method may be implemented utilizing software andlor hardware as described above. in this implementation the predetermined prosodic information and the format of the predetermined prosodic information is as described above.
Where selected, at least one of the segment duration computation unit and the acoustic description computation unit may be a neural network, a decision tree unit, or a unit that uses a genetic algorithm.
In the preferred embodiment of the invention, the prosodic information used is derived from the TOBI labeling system developed by Silverman et al. (Silverman, K. et al.
"TOBI: A Standard for Labeling English Prosody", Proc. ICSLP 92, pp. 867-870, Banff, October 1992.) The time at which each word ends is marked along with a number (the break index) which indicates the degree of disjuncture between the marked word and the following word. Tones are also marked with an inventory of symbols for pitch accents on syllables and intonational boundary marks.
A variety of representations are possible for the prosodic information provided as input to the statistical system. In a time-delay neural network representation, the input consists of a series of vectors, each vector representing the linguistic context during some sample time period near the portion of the waveform for which the system is generating speech parameters (the current portion of speech waveform), in this case, different inputs may indicate the presence of a tone or break index mark in the sample time period, or the proximity of a mark to the sample time period. As an alternative to the time-delay representation, the input may indicate the distances from the current portion of the speech waveform and events that are marked in the prosodic information.
Additionally, distances between these events may be provided as input. The distances between two elements may be measured as the time period separating them, or as counts of the number of events of another type separating the two elements. For example, an input may represent the time between the current portion of the speech waveform and the preceding occurrence of an intonational boundary mark, while another may indicate the number of pitch accents with downstep (a narrowing of the pitch range) that occurred since the boundary. When the speech parameters are acoustic representations of speech frames suitable to be used as input to a waveform synthesizer, the preferred embodiment uses a combination of all these techniques. When the speech parameters are segment durations, the preferred embodiment does not use a time-delay representation to represent prosodic information.
Software implementing the method may be embedded in a microprocessor or a digital signal processor. Alternatively, an application specific integrated circuit may implement the method, or a combination of any of these implementations may be used. In a preferred embodiment, the speech parameter generating system utilizes neural networks. Alternatively, the speech parameter generating system may utilize decision tree units or genetic algorithms.
The portions of speech waveforms described may be segments associated with phonetic elements such as phones, and the parameters generated may be the durations of these segments.
The portions of speech waveforms described may be frames of speech, and the parameters generated may be an acoustic representation of these frames.
Software implementing the method may be embedded in a microprocessor or a digital signal processor. Alternatively, an application specific integrated circuit may implement the method, or a combination of any of these implementations may be used. In a preferred embodiment, the speech parameter generating system utilizes neural networks. Alternatively, the speech parameter generating system may utilize decision tree units or genetic algorithms.
The present invention may be implemented by a device for providing, in response to information including phonetic and prosodic information, efficient generation of speech parameters. The device includes a statistical system, coupled to receive information including phonetic and prosodic information, for providing parameters describing portions of speech waveforms.
The portions of speech waveforms described may be segments associated with phonetic elements such as phones, and the parameters generated may be the durations of these segments.
The portions of speech waveforms described may be frames of speech, and the parameters generated may be an acoustic representation of these frames.
Again, the speech parameter generating system may include neural networks, decision tree units, or units that use a genetic algorithm.
The device of the present invention may be implemented in a text-to-speech system, a speech synthesis system, or a dialog system.
The present invention may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.
We claim:

Claims (10)

Claims 1. A method for providing, in response to information including linguistic information and at least one of: style information and focus information, efficient generation of prosodically enhanced speech parameters comprising the steps of: 1 A) determining prosodic information that describes rhythm and intonation of speech to be generated, from at least one of: the style information and the focus information; and
1 B) using a statistical system to convert information including the linguistic information and the prosodic information into speech parameters, which describe portions of speech waveforms.
2. The method of claim 1 wherein the speech parameters are parameters suitable to use as input to a waveform synthesizer/generator, and, where selected, where at least one of 2A-2D: 2A) further including providing the speech parameters to a waveform synthesizer/generator to synthesize speech; 2B) wherein one of: 2B1) software implementing the method is embedded in a microprocessor; 2B2) software implementing the method is embedded in a digital signal processor; 2B3) the method is implemented by an application specific integrated circuit; and 2B4) the method is implemented by a combination of at least two of 2B1-2B3; and 2C) wherein the prosodic information includes at least one of: 2C1) the locations of word endings and the degree of disjuncture between words; 2C2) the locations of pitch accents and the form of the pitch accents; and 2C3) the locations of boundaries marked in the pitch contour and the form of the boundaries; 2D) and where selected in step 2C, wherein the format of the prosodic information is one of: 2D1) information describing the proximity of marked prosodic events to defined frames surrounding the frame for which the coder parameters are being generated; 2D2) information describing the time separating marked prosodic events from the frame for which the coder parameters are being generated; 2D3) information describing the time separating certain marked prosodic events from other marked prosodic events; 2D4) information describing the number of prosodic events of one type in the time period separating a prosodic event of another type and the frame for which the coder parameters are being generated; 2D5) information describing the number prosodic events of one type occurring in the time period separating two prosodic events of another type; and 2D6) at least two of 2D1-2D5.
3. The method of claim 1 wherein the speech parameters are durations of speech segments associated with phones, and where selected, at least one of 3A-3C: 3A) further including providing the segment durations as supplementary input to an acoustic frame parameter generation method; 3B) wherein one of 3B1-3B4: 3B1) software implementing the method is embedded in a microprocessor; 3B2) software implementing the method is embedded in a digital signal processor; 3B3) the method is implemented by an application specific integrated circuit; and 3B4) the method is implemented by a combination of at least two of 3B1-3B3; and 3C) wherein the prosodic information includes at least one of 3C1-3C3: 3C1) the locations of word endings and the degree of disjuncture between words; 3C2) the locations of pitch accents and the form of the pitch accents; and 3C3) the locations of boundaries marked in the pitch contour and the form of the boundaries, and where selected, wherein the format of the prosodic information is one of 3c3a-3C3f: 3C3a) information describing the proximity of marked prosodic events to defined segments surrounding the segment for which the duration is being computed; 3C3b) information describing the number of segments separating marked prosodic events from the segment for which the duration is being computed; 3C3c) information describing the number of segments separating marked prosodic events from other marked prosodic events; 3C3d) information describing the number of prosodic events of some type in the segments separating a prosodic event of another type and the segment for which the duration is being computed; 3C3e) information describing the number prosodic events of one type occurring in the segments separating two prosodic events of another type; and 3C3f) at least two of 3C3a-3C3e.
4. The method of claim 1 wherein at least one of 4A-4C: 4A) the statistical system is a neural network; 4B) the statistical system is a decision tree unit; and 4C) the statistical system is a unit that uses a genetic algorithm.
5. A device for providing, in response to information including linguistic information and at least one of: style information and focus information, efficient generation of prosodically enchanced speech parameters comprising: 5A) a prosody determination unit, coupled to receive at least one of the style information and the focus information, generating prosodic information, describing the rhythm and intonation of speech to be generated; and 5B) a statistical system, coupled to information including linguistic and prosodic information, for providing parameters describing portions of speech waveforms.
6. The device of claim 5 wherein at least one of 6A-6C: 6A) the speech parameters are parameters suitable to use as input to a waveform synthesizer/generator, and where selected, one of 6A1-6A2: 6A1) wherein the device is one of: 6A1 a) a microprocessor; 6A1 b) a digital signal processor; 6A1c) an application specific integrated circuit; and 6A1d) a combination of at least two of A6A1 a-6A1 c; 6A2) wherein the prosodic information includes at least one of 6A2a-6A2c: 6A2a) the locations of word endings and the degree of disjuncture between words; 6A2b) the locations of pitch accents and the form of the pitch accents; and 6A2c) the locations of boundaries marked in the pitch contour and the form of the boundaries, and where selected for step 6A2, wherein the format of the prosodic information is one of 6A2d-6A2i: 6A2d) information describing the proximity of marked prosodic events to defined segments surrounding the segment for which the duration is being computed; 6A2e) information describing the number of segments separating marked prosodic events from the segment for which the duration is being computed; 6A2f) information describing the number of segments separating marked prosodic events from other marked prosodic events; 6A2g) information describing the number of prosodic events of some type in the segments separating a prosodic event of another type and the segment for which the duration is being computed; 6A2h) information describing the number prosodic events of one type occurring in the segments separating two prosodic events of another type; and 6A2i) at least two of A6A2d-6A2h; 6B) wherein the prosodic information includes at least one of 6B1-6B3: 6B1) the locations of word endings and the degree of disjuncture between words; 6B2) the locations of pitch accents and the form of the pitch accents; and 6B3) the locations of boundaries marked in the pitch contour and the form of the boundaries, and wherein, where selected, a format of the prosodic information is one of 6B4-6B9: 6B4) information describing the proximity of marked prosodic events to defined frames surrounding the frame for which the coder parameters are being generated; 6B5) information describing the time separating marked prosodic events from the frame for which the coder parameters are being generated; 6B6) information describing the time separating marked prosodic events from other marked prosodic events; 6B7) information describing the number of prosodic events of some type in the time period separating a prosodic event of another type and the frame for which the coder parameters are being generated; 6B8) information describing the number prosodic events of one type occurring in the time period separating two prosodic events of another type; and 6B9) at least two of 6B4-6B8; and 6C) wherein the speech parameters are durations of speech segments associated with phones, and where selected, further including an acoustic frame parameter generation unit coupled to receive the durations of speech segments.
7. The device of claim 5 wherein at least one of 7A-7E: 7A) the speech parameter generating system further provides the speech parameters to a waveform synthesizer/generator for synthesizing speech; 7B) wherein the device is one of 7B1-7B4: 7B1) a microprocessor; 7B2) a digital signal processor; 7B3) an application specific integrated circuit; and 7B4) a combination of at least two of 7B1-7B3; 7C) wherein the statistical system is a neural network; 7D) wherein the statistical system is a decision tree unit; and 7E) wherein the statistical is a unit that uses a genetic algorithm.
8. A text-to-speech system/speech synthesis system/dialog system having at least one device for providing, in response to information including linguistic information and at least one of: style information and focus information, efficient generation of speech parameters each device comprising: 8A) a prosody determination unit, coupled to receive at least one of: the style information and the focus information, that generates prosodic information that describes the rhythm and intonation of speech to be generated; and 8B) a statistical system, coupled to receive information including the linguistic information and the prosodic information, for providing speech parameters that describe portions of speech waveforms.
9. The text-to-speech system/speech synthesis system/dialog system of claim 8 wherein at least one of 9A9G: 9A) the devices provides speech parameters that are parameters suitable to use as input to a waveform coder, and where selected, wherein the device that produces speech parameters that are parameters suitable to use as input to a waveform coder further provides the speech parameters to a waveform synthesizer to synthesize speech; 9B) wherein the device is one of: 9B1) a microprocessor; 9B2) a digital signal processor; 9B3) an application specific integrated circuit; and 9B4) a combination of at least two of 9B1-9B3; 9C) wherein the prosodic information includes at least one of 9C1 -9C3: 9C1) the locations of word endings and the degree of disjuncture between words; 9C2) the locations of pitch accents and the form of the pitch accents; and 9C3) the locations of boundaries marked in the pitch contour and the form of the boundaries, and where selected, for step 9C, wherein the format of the prosodic information is one of 9C3a-9C3f: 9C3a) information describing the proximity of marked prosodic events to defined frames surrounding the frame for which the coder parameters are being generated; 9C3b) information describing the time separating marked prosodic events from the frame for which the coder parameters are being generated; 9C3c) information describing the time separating marked prosodic events from other marked prosodic events; 9C3d) information describing the number of prosodic events of some type in the time period separating a prosodic event of another type and the frame for which the coder parameters are being generated; 9C3e) information describing the number prosodic events of one type occurring in the time period separating two prosodic events of another type; and 9C3f)at least two of 9C3a-9C3e; 9D) wherein the device is one of 9D1-9D4: 9D1) a microprocessor; 9D2) a digital signal processor; 9D3) an application specific integrated circuit; and 9D4) a combination of at least two of 9D1-9D3; 9E) wherein at least one of the devices for providing efficient generation of prosodically enhanced speech parameters is a neural network; 9F) wherein at least one of the devices for providing efficient generation of prosodically enhanced speech parameters is a decision tree unit; and 9G) wherein at least one of the devices for providing efficient generation of prosodically enhanced speech parameters is a unit that uses a genetic algorithm.
10. The text-to-speech system/speech synthesis system/dialog system of claim 8 wherein at least one of the devices generates speech parameters that are durations of speech segments associated with phones, and where selected, at least one of 10A-lOB: 10A) the device that generates segment durations further provides the segment durations as supplementary input to an acoustic frame parameter generation device; and 10B) the prosodic information includes at least one of: 10B1) the locations of word endings and the degree of disjuncture between words; 1 0B2) the locations of pitch accents and the form of the pitch accents; and 10B3) the locations of boundaries marked in the pitch contour and the form of the boundaries, and where further selected in step 10B, wherein the format of the prosodic information is one of 1 OB3a-1 0B3f: 1 0B3a) information describing the proximity of marked prosodic events to defined segments surrounding the segment for which the duration is being computed;
1 OB3b) information describing the number of segments separating marked prosodic events from the segment for which the duration is being computed; 10B3c) information describing the number of segments separating marked prosodic events from other marked prosodic events; 10B3d) information describing the number of prosodic events of one type in the segments separating a prosodic event of another type and the segment for which the duration is being computed; 10B3e) information describing the number prosodic events of one type occurring in the segments separating two prosodic events of another type; and 10B3f) at least two of 10B3a-10B3e.
GB9811008A 1997-05-22 1998-05-21 Method device and system for generating speech synthesis parameters from information including an explicit representation of intonation Expired - Fee Related GB2325599B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US86175197A 1997-05-22 1997-05-22

Publications (3)

Publication Number Publication Date
GB9811008D0 GB9811008D0 (en) 1998-07-22
GB2325599A true GB2325599A (en) 1998-11-25
GB2325599B GB2325599B (en) 2000-01-26

Family

ID=25336655

Family Applications (1)

Application Number Title Priority Date Filing Date
GB9811008A Expired - Fee Related GB2325599B (en) 1997-05-22 1998-05-21 Method device and system for generating speech synthesis parameters from information including an explicit representation of intonation

Country Status (2)

Country Link
BE (1) BE1011892A3 (en)
GB (1) GB2325599B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001031434A2 (en) * 1999-10-28 2001-05-03 Siemens Aktiengesellschaft Method for detecting the time sequences of a fundamental frequency of an audio-response unit to be synthesised
WO2001057851A1 (en) * 2000-02-02 2001-08-09 Famoice Technology Pty Ltd Speech system
WO2001078063A1 (en) * 2000-04-12 2001-10-18 Siemens Aktiengesellschaft Method and device for the determination of prosodic markers
GB2590509A (en) * 2019-12-20 2021-06-30 Sonantic Ltd A text-to-speech synthesis method and system, a method of training a text-to-speech synthesis system, and a method of calculating an expressivity score

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1994007238A1 (en) * 1992-09-23 1994-03-31 Emerson & Stern Associates, Inc. Method and apparatus for speech synthesis
US5652828A (en) * 1993-03-19 1997-07-29 Nynex Science & Technology, Inc. Automated voice synthesis employing enhanced prosodic treatment of text, spelling of text and rate of annunciation
US5727120A (en) * 1995-01-26 1998-03-10 Lernout & Hauspie Speech Products N.V. Apparatus for electronically generating a spoken message
EP0831460A2 (en) * 1996-09-24 1998-03-25 Nippon Telegraph And Telephone Corporation Speech synthesis method utilizing auxiliary information
EP0833304A2 (en) * 1996-09-30 1998-04-01 Microsoft Corporation Prosodic databases holding fundamental frequency templates for use in speech synthesis
WO1998019297A1 (en) * 1996-10-30 1998-05-07 Motorola Inc. Method, device and system for generating segment durations in a text-to-speech system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU675389B2 (en) * 1994-04-28 1997-01-30 Motorola, Inc. A method and apparatus for converting text into audible signals using a neural network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1994007238A1 (en) * 1992-09-23 1994-03-31 Emerson & Stern Associates, Inc. Method and apparatus for speech synthesis
US5652828A (en) * 1993-03-19 1997-07-29 Nynex Science & Technology, Inc. Automated voice synthesis employing enhanced prosodic treatment of text, spelling of text and rate of annunciation
US5727120A (en) * 1995-01-26 1998-03-10 Lernout & Hauspie Speech Products N.V. Apparatus for electronically generating a spoken message
EP0831460A2 (en) * 1996-09-24 1998-03-25 Nippon Telegraph And Telephone Corporation Speech synthesis method utilizing auxiliary information
EP0833304A2 (en) * 1996-09-30 1998-04-01 Microsoft Corporation Prosodic databases holding fundamental frequency templates for use in speech synthesis
WO1998019297A1 (en) * 1996-10-30 1998-05-07 Motorola Inc. Method, device and system for generating segment durations in a text-to-speech system

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001031434A2 (en) * 1999-10-28 2001-05-03 Siemens Aktiengesellschaft Method for detecting the time sequences of a fundamental frequency of an audio-response unit to be synthesised
WO2001031434A3 (en) * 1999-10-28 2002-02-14 Siemens Ag Method for detecting the time sequences of a fundamental frequency of an audio-response unit to be synthesised
US7219061B1 (en) 1999-10-28 2007-05-15 Siemens Aktiengesellschaft Method for detecting the time sequences of a fundamental frequency of an audio response unit to be synthesized
WO2001057851A1 (en) * 2000-02-02 2001-08-09 Famoice Technology Pty Ltd Speech system
WO2001078063A1 (en) * 2000-04-12 2001-10-18 Siemens Aktiengesellschaft Method and device for the determination of prosodic markers
US7409340B2 (en) 2000-04-12 2008-08-05 Siemens Aktiengesellschaft Method and device for determining prosodic markers by neural autoassociators
GB2590509A (en) * 2019-12-20 2021-06-30 Sonantic Ltd A text-to-speech synthesis method and system, a method of training a text-to-speech synthesis system, and a method of calculating an expressivity score
GB2590509B (en) * 2019-12-20 2022-06-15 Sonantic Ltd A text-to-speech synthesis method and system, and a method of training a text-to-speech synthesis system

Also Published As

Publication number Publication date
GB9811008D0 (en) 1998-07-22
BE1011892A3 (en) 2000-02-01
GB2325599B (en) 2000-01-26

Similar Documents

Publication Publication Date Title
Black et al. Generating F/sub 0/contours from ToBI labels using linear regression
Moberg Contributions to Multilingual Low-Footprint TTS System for Hand-Held Devices
US7233901B2 (en) Synthesis-based pre-selection of suitable units for concatenative speech
US7580839B2 (en) Apparatus and method for voice conversion using attribute information
US5913194A (en) Method, device and system for using statistical information to reduce computation and memory requirements of a neural network based speech synthesis system
EP0833304B1 (en) Prosodic databases holding fundamental frequency templates for use in speech synthesis
US6163769A (en) Text-to-speech using clustered context-dependent phoneme-based units
EP0689192A1 (en) A speech synthesis system
US20050119890A1 (en) Speech synthesis apparatus and speech synthesis method
Kuligowska et al. Speech synthesis systems: disadvantages and limitations
Bellegarda et al. Statistical prosodic modeling: from corpus design to parameter estimation
US5950162A (en) Method, device and system for generating segment durations in a text-to-speech system
Karaali et al. Speech synthesis with neural networks
Indumathi et al. Survey on speech synthesis
Conkie et al. Prosody recognition from speech utterances using acoustic and linguistic based models of prosodic events
US6178402B1 (en) Method, apparatus and system for generating acoustic parameters in a text-to-speech system using a neural network
Karaali et al. Text-to-speech conversion with neural networks: A recurrent TDNN approach
O'Shaughnessy Modern methods of speech synthesis
GB2325599A (en) Speech synthesis with prosody enhancement
Karaali et al. A high quality text-to-speech system composed of multiple neural networks
Kim et al. Unit Generation Based on Phrase Break Strength and Pruning for Corpus‐Based Text‐to‐Speech
Furtado et al. Synthesis of unlimited speech in Indian languages using formant-based rules
JP3060276B2 (en) Speech synthesizer
JPH10254471A (en) Voice synthesizer
JP3571925B2 (en) Voice information processing device

Legal Events

Date Code Title Description
PCNP Patent ceased through non-payment of renewal fee

Effective date: 20060521