GB2325599A - Speech synthesis with prosody enhancement - Google Patents
Speech synthesis with prosody enhancement Download PDFInfo
- Publication number
- GB2325599A GB2325599A GB9811008A GB9811008A GB2325599A GB 2325599 A GB2325599 A GB 2325599A GB 9811008 A GB9811008 A GB 9811008A GB 9811008 A GB9811008 A GB 9811008A GB 2325599 A GB2325599 A GB 2325599A
- Authority
- GB
- United Kingdom
- Prior art keywords
- prosodic
- information
- speech
- events
- marked
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Abstract
The method includes determining prosodic information that describes rhythm and intonation of speech to be generated, from at least one of: the style information and the focus information 528; and using a statistical system eg neural network 522 to convert information including the linguistic information and the prosodic information into speech parameters, which describe portions of speech waveforms.
Description
METHOD, DEVICE AND SYSTEM FOR GENERATING SPEECH
SYNTHESIS PARAMETERS FROM INFORMATION INCLUDING
AN EXPLICIT REPRESENTATION OF INTONATION
Field of the Invention
The present invention relates to parameter generating systems used in speech synthesis, and more particularly to intonation coding in coder parameter generating systems used in speech synthesis.
Background of the Invention
As shown in FIG. 1, numeral 100, to convert text to speech, statistical systems (102) typically convert linguistic (phonetic) representations of texts into parameters characterizing speech waveforms. The statistical system illustrated in FIG. 1 uses two statistical systems (neural networks) (110 and 118). One neural network (110) converts linguistic descriptions of speech phones and their contexts into segment durations for segments of speech waveforms associated with the phones. The second neural network (118) converts linguistic descriptions of speech frames (portions of speech waveforms occurring during short time periods) into acoustic descriptions (120) of the frames. These are then converted into speech waveforms (124) using a vocoder (122).
One problem with statistical approaches based on linguistic information extracted from text is that the text does not contain enough information to generate prosody (rhythm and intonation) correctly in speech waveforms. It is known that, for any given text, there exists a plurality of intonational contours and rhythm patterns that can be generated for that text. The conversions performed by statistical systems are controlled by statistical descriptions of training data typically consisting of a set of vectors comprised of possible inputs to the system and the outputs desired when the possible inputs are presented to the system.
In statistical systems used for text-to-speech, the training data typically is generated by analysis of natural speech.
Because the intonational contours and rhythm patterns cannot be predicted from linguistic information extracted from texts, statistical systems tend to produce prosody that averages the possible contours and rhythms. This averaging of the contours and rhythms can make the speech less understandable.
Hence, there is a need for a method and device for improving the performance of a text-to-speech system in generating prosody.
Brief Description of the Drawings
FIG. 1 is a schematic representation of a text-to-speech system incorporating two statistical systems for generating speech parameters as is known in the art.
FIG. 2 is a schematic representation of a system for generating speech parameters in accordance with the present invention.
FIG. 3 is a schematic representation of the process of training a neural network in accordance with the existing technology.
FIG. 4 is a schematic representation of the process of training a neural network in accordance with the present invention.
FIG. 5 is a schematic representation of a text-to-speech system incorporating the present invention.
FIG. 6 is a flow chart of an embodiment of steps in accordance with the method of the present invention.
FIG. 7 is a schematic representation of a preferred embodiment of a device in accordance with the present invention.
Detailed Description of a Preferred Embodiment
The present invention provides a method, device and system for improving the prosody expressed by a speech parameter generating system by incorporating an explicit representation of the prosody as input to the speech parameter generating system. This improvement provides more realistic rhythm and intonation in speech, making the meaning of the speech more understandable than speech in prior art text-tospeech systems.
In a preferred embodiment, the speech parameter generating system of the present invention is a neural network producing a series of parameters or parameter vectors consisting of one or more parameters describing a predetermined aspect of the speech waveform. As shown in FIG.
2, numeral 200, certain parameters may be the acoustic frame descriptions (220) or the segment durations (206). FIG. 2 illustrates a preferred embodiment of a device in a system in accordance with the present invention. The device provides, in response to predetermined phonetic/linguistic information (222) and predetermined prosodic information (224), efficient generation of acoustic frame descriptions (220), i.e., prosodically enchanced speech parameters. The device includes: A) a segment duration computation unit (202), coupled to receive the phonetic/linguistic information and to a prosody generation unit (204), for converting the phonetic/linguistic information into segment durations (206);
B) the prosody generation unit (204), coupled to receive the predetermined prosodic information, for using the predetermined prosodic information to generate prosodic output (208); C) a frame description generation unit (210), coupled to the segment duration computation unit and the prosody generation unit and to receive the phonetic/linguistic information, for utilizing the segment durations (206), the prosodic output (208) and the phonetic/linguistic information (222) to generate linguistic/prosodic frame descriptions (212); and D) an acoustic description computation unit (214), coupled to the frame description generation unit (210), for computing acoustic frame descriptions (220) for the predetermined prosodic information (224) and the phonetic/linguistic information (222) to generate speech parameters that provide reliable prosodic performance. The parameters are typically suitable for use as input to a waveform synthesizer or generator (216) such as a vocoder, which generates a speech waveform (218).
In a preferred embodiment, the speech parameter generating device is a neural network producing a series of parameter vectors consisting of one or more parameters describing some aspect of the speech waveform. In FIG. 1, the parameters could be the acoustic frame descriptions (120) or the segment durations (112). FIG. 7, numeral 700, is a schematic representation of a preferred embodiment of a device in accordance with the present invention. As in existing speech parameter generating systems, this invention uses a statistical system (702) to convert information including linguistic information (704) derived from text into some parameters describing some aspect of the speech to be generated (706). However, unlike existing speech parameter generating systems, this invention includes a prosody determination unit (712), which converts information (710) about speaking style and focus into a prosodic description (708) of the speech to be generated. This prosodic description is also provided to the neural network to determine speech parameters.
In a preferred embodiment, the device provides, in response to information including linguistic information and at least one of: style information and focus information, efficient generation of prosodically enchanced speech parameters. The device includes: A) a prosody determination unit (712), coupled to receive at least one of the style information and the focus information, generating prosodic information, describing the rhythm and intonation of speech to be generated; and B) a statistical system (702), which in the preferred embodiment is a neural network, coupled to receive information including linguistic and prosodic information, for providing parameters describing portions of speech waveforms.
The information (710) about speaking style and focus may be provided by a dialog model or by user input. Alternatively, the style and focus determination may be made in advance for different sentence types, such as statements, questions, and commands.
Typically, as shown in FIG. 3, numeral 300, the method for generating neural networks for speech synthesis has consisted of training a neural network to predict speech parameters for a particular portion of the speech waveform.
Speech (302) is first recorded and stored in an audio database (304). The speech is first phonetically and syntactically labeled (phonetic/syntactic labeling unit, 306) and the label information stored in a phonetic database (308).
(Alternatively, the phonetic/syntactic labeling may be done manually.) The audio database (304) and/or the phonetic database (308) are then processed by the desired output computation unit (310) to extract parameters describing some portion of the recorded speech, used as training output (312).
For each training output value, the phonetic database (308) is processed by the input generation unit (314) to produce training input (316) for a neural network (318) that generates speech parameters (320). The neural network is then trained to generate a good approximation of the training output in response to the training input, using some criterion for goodness, such as the minimum mean squared difference between the neural network output and the training output.
The neural network in the preferred embodiment of the present invention is generated as shown in FIG. 4. Speech (402) is first recorded and stored in an audio database (404). The speech is first phonetically and syntactically labeled (phonetic/syntactic labeling unit, 406) and the label information stored in a phonetic database (408). The speech is then prosodically labeled (prosodic labeling unit, 410), indicating the actual prosody of the recorded speech, to create a prosodic database (412). (Alternatively, either the phonetic/syntactic labeling or the prosodic labeling or both may be performed manually.) The audio database (404) or the phonetic database (408) or both are then processed (desired output computation unit, 414) to extract parameters describing some portion of the recorded speech, used as training output (416). For each training output value, the phonetic database (408) and the prosodic database (412) are processed (input generation unit, 418) to produce training input (420) for a neural network (422). The neural network is then trained to generate a good approximation of the training output in response to the training input, using a predetermined criterion for goodness, such as the minimum mean squared difference between the neural network output and the training output.
This is similar to the process used for training in the existing technology, except that the speech is prosodically labeled to generate the prosodic database (412) which is used in input generation (input generation unit, 418).
FIG. 5, numeral 500, is a schematic representation of a text-to-speech system incorporating the present invention.
This is similar to the text-to-speech system of FIG. 1, except that there is predetermined prosodic information (528) that is input in addition to the text (504) which is used to determine the prosody of the generated speech. Text (504) is input to a text-to-linguistics conversion unit (506), which provides linguistic (phonetic) description (510) to a segment duration computation unit (514), which is typically a neural network.
The segment duration computation conversion unit (514) utilizes the linguistic (phonetic) description (510) and the prosodic output (512) to output segment durations (516). The predetermined prosodic information (528) may come from a dialog model or user selections, or simply be a set of arbitrary advance decisions concerning the way in which prosody will be generated. This information, along with information from the text, is used in prosody generation (prosody generation unit, 508) to produce the prosodic output (512) provided to the segment duration computation unit (514). This information is also provided to the frame description generation unit (518) to produce the linguistic/prosodic frame descriptions (520) provided to the acoustic description computation unit (522).
The segment duration computation unit (514) and the acoustic description computation unit (522) are examples of the current invention. The output of the acoustic description computation unit (522) is then input to a waveform generation unit (vocoder, 524), that generates a speech waveform (526).
The predetermined prosodic information (528) may be information about speaking style and focus, which is provided by a dialog model or by user input. Alternatively, the style and focus determination may be made in advance for different sentence types, such as statements, questions, and commands.
The speech parameter generating systems (514 and 522) may each be a neural network, decision tree, genetic algorithm or combination of two or more of these.
Existing statistical synthesis methods generate prosody of speech using only phonetic information and information that can be extracted from text. Since these methods tend to average the intonation contours and rhythm patterns that can occur in speech, producing unclear prosodic variation, the method of the present invention was developed to generate prosody using an explicit representation of prosody. Thus, the present invention provides more natural intonation and rhythm to the speech generated.
As shown in the steps set forth in FIG. 6, numeral 600, the method of the present invention provides, in response to information including linguistic information and at least one of: style information and focus information, efficient generation of prosodically enhanced speech parameters. In one embodiment, the method includes the steps of: A) determining (602) prosodic information that describes rhythm and intonation of speech to be generated, from at least one of: the style information and the focus information; and B) using (604) a statistical system to convert information including the linguistic information and the prosodic information into speech parameters, which describe portions of speech waveforms.
The speech parameters are typically parameters suitable for use as input to a waveform synthesizerigenerator which may be used for synthesizing speech. Alternatively, the speech parameters may be segment durations.
Software that implements the present invention may be embedded in a microprocessor or a digital signal processor (DSP). Alternatively, the method may be implemented by an application specific integrated circuit (ASIC). Also, the method may be implemented by a combination of microprocessor, DSP and ASIC.
The predetermined prosodic information generally includes at least one of: A) locations of word endings and a degree of disjuncture between words; B) locations of pitch accents and a form of the pitch accents; and C) locations of boundaries marked in a pitch contour and a form of the boundaries.A format of the predetermined prosodic information is typically one of: A) information describing a proximity of marked prosodic events to defined frames surrounding a frame for which coder parameters are being generated; B) information describing a time separating marked prosodic events from a frame for which the coder parameters are being generated; C) information describing a time separating certain marked prosodic events from other marked prosodic events; D) information describing a number of prosodic events of one type in a time period separating a prosodic event of another type and a frame for which the coder parameters are being generated; E) information describing a number prosodic events of one type occurring in a time period separating two prosodic events of another type; and F) at least two of formats A-E.
Where selected, the segment durations may be utilized as supplementary input to an acoustic frame parameter generation unit.
The speech parameters may be selected to be durations of speech segments associated with phones, and the method may be implemented utilizing software andlor hardware as described above. in this implementation the predetermined prosodic information and the format of the predetermined prosodic information is as described above.
Where selected, at least one of the segment duration computation unit and the acoustic description computation unit may be a neural network, a decision tree unit, or a unit that uses a genetic algorithm.
In the preferred embodiment of the invention, the prosodic information used is derived from the TOBI labeling system developed by Silverman et al. (Silverman, K. et al.
"TOBI: A Standard for Labeling English Prosody", Proc. ICSLP 92, pp. 867-870, Banff, October 1992.) The time at which each word ends is marked along with a number (the break index) which indicates the degree of disjuncture between the marked word and the following word. Tones are also marked with an inventory of symbols for pitch accents on syllables and intonational boundary marks.
A variety of representations are possible for the prosodic information provided as input to the statistical system. In a time-delay neural network representation, the input consists of a series of vectors, each vector representing the linguistic context during some sample time period near the portion of the waveform for which the system is generating speech parameters (the current portion of speech waveform), in this case, different inputs may indicate the presence of a tone or break index mark in the sample time period, or the proximity of a mark to the sample time period. As an alternative to the time-delay representation, the input may indicate the distances from the current portion of the speech waveform and events that are marked in the prosodic information.
Additionally, distances between these events may be provided as input. The distances between two elements may be measured as the time period separating them, or as counts of the number of events of another type separating the two elements. For example, an input may represent the time between the current portion of the speech waveform and the preceding occurrence of an intonational boundary mark, while another may indicate the number of pitch accents with downstep (a narrowing of the pitch range) that occurred since the boundary. When the speech parameters are acoustic representations of speech frames suitable to be used as input to a waveform synthesizer, the preferred embodiment uses a combination of all these techniques. When the speech parameters are segment durations, the preferred embodiment does not use a time-delay representation to represent prosodic information.
Software implementing the method may be embedded in a microprocessor or a digital signal processor. Alternatively, an application specific integrated circuit may implement the method, or a combination of any of these implementations may be used. In a preferred embodiment, the speech parameter generating system utilizes neural networks. Alternatively, the speech parameter generating system may utilize decision tree units or genetic algorithms.
The portions of speech waveforms described may be segments associated with phonetic elements such as phones, and the parameters generated may be the durations of these segments.
The portions of speech waveforms described may be frames of speech, and the parameters generated may be an acoustic representation of these frames.
Software implementing the method may be embedded in a microprocessor or a digital signal processor. Alternatively, an application specific integrated circuit may implement the method, or a combination of any of these implementations may be used. In a preferred embodiment, the speech parameter generating system utilizes neural networks. Alternatively, the speech parameter generating system may utilize decision tree units or genetic algorithms.
The present invention may be implemented by a device for providing, in response to information including phonetic and prosodic information, efficient generation of speech parameters. The device includes a statistical system, coupled to receive information including phonetic and prosodic information, for providing parameters describing portions of speech waveforms.
The portions of speech waveforms described may be segments associated with phonetic elements such as phones, and the parameters generated may be the durations of these segments.
The portions of speech waveforms described may be frames of speech, and the parameters generated may be an acoustic representation of these frames.
Again, the speech parameter generating system may include neural networks, decision tree units, or units that use a genetic algorithm.
The device of the present invention may be implemented in a text-to-speech system, a speech synthesis system, or a dialog system.
The present invention may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.
We claim:
Claims (10)
1 B) using a statistical system to convert information including the linguistic information and the prosodic information into speech parameters, which describe portions of speech waveforms.
2. The method of claim 1 wherein the speech parameters are parameters suitable to use as input to a waveform synthesizer/generator, and, where selected, where at least one of 2A-2D:
2A) further including providing the speech parameters to a waveform synthesizer/generator to synthesize speech;
2B) wherein one of:
2B1) software implementing the method is embedded in a microprocessor;
2B2) software implementing the method is embedded in a digital signal processor;
2B3) the method is implemented by an application specific integrated circuit; and
2B4) the method is implemented by a combination of at least two of 2B1-2B3; and
2C) wherein the prosodic information includes at least one of:
2C1) the locations of word endings and the degree of disjuncture between words;
2C2) the locations of pitch accents and the form of the pitch accents; and
2C3) the locations of boundaries marked in the pitch contour and the form of the boundaries;
2D) and where selected in step 2C, wherein the format of the prosodic information is one of:
2D1) information describing the proximity of marked prosodic events to defined frames surrounding the frame for which the coder parameters are being generated;
2D2) information describing the time separating marked prosodic events from the frame for which the coder parameters are being generated;
2D3) information describing the time separating certain marked prosodic events from other marked prosodic events;
2D4) information describing the number of prosodic events of one type in the time period separating a prosodic event of another type and the frame for which the coder parameters are being generated;
2D5) information describing the number prosodic events of one type occurring in the time period separating two prosodic events of another type; and
2D6) at least two of 2D1-2D5.
3. The method of claim 1 wherein the speech parameters are durations of speech segments associated with phones, and where selected, at least one of 3A-3C:
3A) further including providing the segment durations as supplementary input to an acoustic frame parameter generation method;
3B) wherein one of 3B1-3B4:
3B1) software implementing the method is embedded in a microprocessor;
3B2) software implementing the method is embedded in a digital signal processor;
3B3) the method is implemented by an application specific integrated circuit; and
3B4) the method is implemented by a combination of at least two of 3B1-3B3; and
3C) wherein the prosodic information includes at least one of 3C1-3C3:
3C1) the locations of word endings and the degree of disjuncture between words;
3C2) the locations of pitch accents and the form of the pitch accents; and
3C3) the locations of boundaries marked in the pitch contour and the form of the boundaries, and where selected, wherein the format of the prosodic information is one of 3c3a-3C3f:
3C3a) information describing the proximity of marked prosodic events to defined segments surrounding the segment for which the duration is being computed;
3C3b) information describing the number of segments separating marked prosodic events from the segment for which the duration is being computed;
3C3c) information describing the number of segments separating marked prosodic events from other marked prosodic events;
3C3d) information describing the number of prosodic events of some type in the segments separating a prosodic event of another type and the segment for which the duration is being computed;
3C3e) information describing the number prosodic events of one type occurring in the segments separating two prosodic events of another type; and
3C3f) at least two of 3C3a-3C3e.
4. The method of claim 1 wherein at least one of 4A-4C:
4A) the statistical system is a neural network;
4B) the statistical system is a decision tree unit; and
4C) the statistical system is a unit that uses a genetic algorithm.
5. A device for providing, in response to information including linguistic information and at least one of: style information and focus information, efficient generation of prosodically enchanced speech parameters comprising: 5A) a prosody determination unit, coupled to receive at least one of the style information and the focus information, generating prosodic information, describing the rhythm and intonation of speech to be generated; and 5B) a statistical system, coupled to information including linguistic and prosodic information, for providing parameters describing portions of speech waveforms.
6. The device of claim 5 wherein at least one of 6A-6C:
6A) the speech parameters are parameters suitable to use as input to a waveform synthesizer/generator, and where selected, one of 6A1-6A2:
6A1) wherein the device is one of:
6A1 a) a microprocessor;
6A1 b) a digital signal processor; 6A1c) an application specific integrated circuit; and 6A1d) a combination of at least two of
A6A1 a-6A1 c;
6A2) wherein the prosodic information includes at least one of 6A2a-6A2c:
6A2a) the locations of word endings and the degree of disjuncture between words;
6A2b) the locations of pitch accents and the form of the pitch accents; and
6A2c) the locations of boundaries marked in the pitch contour and the form of the boundaries, and where selected for step 6A2, wherein the format of the prosodic information is one of 6A2d-6A2i:
6A2d) information describing the proximity of marked prosodic events to defined segments surrounding the segment for which the duration is being computed;
6A2e) information describing the number of segments separating marked prosodic events from the segment for which the duration is being computed;
6A2f) information describing the number of segments separating marked prosodic events from other marked prosodic events;
6A2g) information describing the number of prosodic events of some type in the segments separating a prosodic event of another type and the segment for which the duration is being computed;
6A2h) information describing the number prosodic events of one type occurring in the segments separating two prosodic events of another type; and
6A2i) at least two of A6A2d-6A2h;
6B) wherein the prosodic information includes at least one of 6B1-6B3:
6B1) the locations of word endings and the degree of disjuncture between words;
6B2) the locations of pitch accents and the form of the pitch accents; and
6B3) the locations of boundaries marked in the pitch contour and the form of the boundaries, and wherein, where selected, a format of the prosodic information is one of 6B4-6B9:
6B4) information describing the proximity of marked prosodic events to defined frames surrounding the frame for which the coder parameters are being generated;
6B5) information describing the time separating marked prosodic events from the frame for which the coder parameters are being generated;
6B6) information describing the time separating marked prosodic events from other marked prosodic events;
6B7) information describing the number of prosodic events of some type in the time period separating a prosodic event of another type and the frame for which the coder parameters are being generated;
6B8) information describing the number prosodic events of one type occurring in the time period separating two prosodic events of another type; and
6B9) at least two of 6B4-6B8; and
6C) wherein the speech parameters are durations of speech segments associated with phones, and where selected, further including an acoustic frame parameter generation unit coupled to receive the durations of speech segments.
7. The device of claim 5 wherein at least one of 7A-7E:
7A) the speech parameter generating system further provides the speech parameters to a waveform synthesizer/generator for synthesizing speech;
7B) wherein the device is one of 7B1-7B4:
7B1) a microprocessor;
7B2) a digital signal processor;
7B3) an application specific integrated circuit; and
7B4) a combination of at least two of 7B1-7B3;
7C) wherein the statistical system is a neural network;
7D) wherein the statistical system is a decision tree unit; and
7E) wherein the statistical is a unit that uses a genetic algorithm.
8. A text-to-speech system/speech synthesis system/dialog system having at least one device for providing, in response to information including linguistic information and at least one of: style information and focus information, efficient generation of speech parameters each device comprising:
8A) a prosody determination unit, coupled to receive at least one of: the style information and the focus information, that generates prosodic information that describes the rhythm and intonation of speech to be generated; and
8B) a statistical system, coupled to receive information including the linguistic information and the prosodic information, for providing speech parameters that describe portions of speech waveforms.
9. The text-to-speech system/speech synthesis system/dialog system of claim 8 wherein at least one of 9A9G:
9A) the devices provides speech parameters that are parameters suitable to use as input to a waveform coder, and where selected, wherein the device that produces speech parameters that are parameters suitable to use as input to a waveform coder further provides the speech parameters to a waveform synthesizer to synthesize speech;
9B) wherein the device is one of:
9B1) a microprocessor;
9B2) a digital signal processor;
9B3) an application specific integrated circuit; and
9B4) a combination of at least two of 9B1-9B3;
9C) wherein the prosodic information includes at least one of 9C1 -9C3: 9C1) the locations of word endings and the degree of disjuncture between words;
9C2) the locations of pitch accents and the form of the pitch accents; and
9C3) the locations of boundaries marked in the pitch contour and the form of the boundaries, and where selected, for step 9C, wherein the format of the prosodic information is one of 9C3a-9C3f:
9C3a) information describing the proximity of marked prosodic events to defined frames surrounding the frame for which the coder parameters are being generated;
9C3b) information describing the time separating marked prosodic events from the frame for which the coder parameters are being generated;
9C3c) information describing the time separating marked prosodic events from other marked prosodic events;
9C3d) information describing the number of prosodic events of some type in the time period separating a prosodic event of another type and the frame for which the coder parameters are being generated;
9C3e) information describing the number prosodic events of one type occurring in the time period separating two prosodic events of another type; and
9C3f)at least two of 9C3a-9C3e;
9D) wherein the device is one of 9D1-9D4:
9D1) a microprocessor;
9D2) a digital signal processor;
9D3) an application specific integrated circuit; and
9D4) a combination of at least two of 9D1-9D3;
9E) wherein at least one of the devices for providing efficient generation of prosodically enhanced speech parameters is a neural network;
9F) wherein at least one of the devices for providing efficient generation of prosodically enhanced speech parameters is a decision tree unit; and
9G) wherein at least one of the devices for providing efficient generation of prosodically enhanced speech parameters is a unit that uses a genetic algorithm.
10. The text-to-speech system/speech synthesis system/dialog system of claim 8 wherein at least one of the devices generates speech parameters that are durations of speech segments associated with phones, and where selected, at least one of 10A-lOB: 10A) the device that generates segment durations further provides the segment durations as supplementary input to an acoustic frame parameter generation device; and
10B) the prosodic information includes at least one of:
10B1) the locations of word endings and the degree of disjuncture between words; 1 0B2) the locations of pitch accents and the form of the pitch accents; and
10B3) the locations of boundaries marked in the pitch contour and the form of the boundaries, and where further selected in step 10B, wherein the format of the prosodic information is one of 1 OB3a-1 0B3f: 1 0B3a) information describing the proximity of marked prosodic events to defined segments surrounding the segment for which the duration is being computed;
1 OB3b) information describing the number of segments separating marked prosodic events from the segment for which the duration is being computed;
10B3c) information describing the number of segments separating marked prosodic events from other marked prosodic events;
10B3d) information describing the number of prosodic events of one type in the segments separating a prosodic event of another type and the segment for which the duration is being computed;
10B3e) information describing the number prosodic events of one type occurring in the segments separating two prosodic events of another type; and
10B3f) at least two of 10B3a-10B3e.
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US86175197A | 1997-05-22 | 1997-05-22 |
Publications (3)
Publication Number | Publication Date |
---|---|
GB9811008D0 GB9811008D0 (en) | 1998-07-22 |
GB2325599A true GB2325599A (en) | 1998-11-25 |
GB2325599B GB2325599B (en) | 2000-01-26 |
Family
ID=25336655
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB9811008A Expired - Fee Related GB2325599B (en) | 1997-05-22 | 1998-05-21 | Method device and system for generating speech synthesis parameters from information including an explicit representation of intonation |
Country Status (2)
Country | Link |
---|---|
BE (1) | BE1011892A3 (en) |
GB (1) | GB2325599B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2001031434A2 (en) * | 1999-10-28 | 2001-05-03 | Siemens Aktiengesellschaft | Method for detecting the time sequences of a fundamental frequency of an audio-response unit to be synthesised |
WO2001057851A1 (en) * | 2000-02-02 | 2001-08-09 | Famoice Technology Pty Ltd | Speech system |
WO2001078063A1 (en) * | 2000-04-12 | 2001-10-18 | Siemens Aktiengesellschaft | Method and device for the determination of prosodic markers |
GB2590509A (en) * | 2019-12-20 | 2021-06-30 | Sonantic Ltd | A text-to-speech synthesis method and system, a method of training a text-to-speech synthesis system, and a method of calculating an expressivity score |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1994007238A1 (en) * | 1992-09-23 | 1994-03-31 | Emerson & Stern Associates, Inc. | Method and apparatus for speech synthesis |
US5652828A (en) * | 1993-03-19 | 1997-07-29 | Nynex Science & Technology, Inc. | Automated voice synthesis employing enhanced prosodic treatment of text, spelling of text and rate of annunciation |
US5727120A (en) * | 1995-01-26 | 1998-03-10 | Lernout & Hauspie Speech Products N.V. | Apparatus for electronically generating a spoken message |
EP0831460A2 (en) * | 1996-09-24 | 1998-03-25 | Nippon Telegraph And Telephone Corporation | Speech synthesis method utilizing auxiliary information |
EP0833304A2 (en) * | 1996-09-30 | 1998-04-01 | Microsoft Corporation | Prosodic databases holding fundamental frequency templates for use in speech synthesis |
WO1998019297A1 (en) * | 1996-10-30 | 1998-05-07 | Motorola Inc. | Method, device and system for generating segment durations in a text-to-speech system |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU675389B2 (en) * | 1994-04-28 | 1997-01-30 | Motorola, Inc. | A method and apparatus for converting text into audible signals using a neural network |
-
1998
- 1998-04-27 BE BE9800314A patent/BE1011892A3/en not_active IP Right Cessation
- 1998-05-21 GB GB9811008A patent/GB2325599B/en not_active Expired - Fee Related
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1994007238A1 (en) * | 1992-09-23 | 1994-03-31 | Emerson & Stern Associates, Inc. | Method and apparatus for speech synthesis |
US5652828A (en) * | 1993-03-19 | 1997-07-29 | Nynex Science & Technology, Inc. | Automated voice synthesis employing enhanced prosodic treatment of text, spelling of text and rate of annunciation |
US5727120A (en) * | 1995-01-26 | 1998-03-10 | Lernout & Hauspie Speech Products N.V. | Apparatus for electronically generating a spoken message |
EP0831460A2 (en) * | 1996-09-24 | 1998-03-25 | Nippon Telegraph And Telephone Corporation | Speech synthesis method utilizing auxiliary information |
EP0833304A2 (en) * | 1996-09-30 | 1998-04-01 | Microsoft Corporation | Prosodic databases holding fundamental frequency templates for use in speech synthesis |
WO1998019297A1 (en) * | 1996-10-30 | 1998-05-07 | Motorola Inc. | Method, device and system for generating segment durations in a text-to-speech system |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2001031434A2 (en) * | 1999-10-28 | 2001-05-03 | Siemens Aktiengesellschaft | Method for detecting the time sequences of a fundamental frequency of an audio-response unit to be synthesised |
WO2001031434A3 (en) * | 1999-10-28 | 2002-02-14 | Siemens Ag | Method for detecting the time sequences of a fundamental frequency of an audio-response unit to be synthesised |
US7219061B1 (en) | 1999-10-28 | 2007-05-15 | Siemens Aktiengesellschaft | Method for detecting the time sequences of a fundamental frequency of an audio response unit to be synthesized |
WO2001057851A1 (en) * | 2000-02-02 | 2001-08-09 | Famoice Technology Pty Ltd | Speech system |
WO2001078063A1 (en) * | 2000-04-12 | 2001-10-18 | Siemens Aktiengesellschaft | Method and device for the determination of prosodic markers |
US7409340B2 (en) | 2000-04-12 | 2008-08-05 | Siemens Aktiengesellschaft | Method and device for determining prosodic markers by neural autoassociators |
GB2590509A (en) * | 2019-12-20 | 2021-06-30 | Sonantic Ltd | A text-to-speech synthesis method and system, a method of training a text-to-speech synthesis system, and a method of calculating an expressivity score |
GB2590509B (en) * | 2019-12-20 | 2022-06-15 | Sonantic Ltd | A text-to-speech synthesis method and system, and a method of training a text-to-speech synthesis system |
Also Published As
Publication number | Publication date |
---|---|
GB9811008D0 (en) | 1998-07-22 |
BE1011892A3 (en) | 2000-02-01 |
GB2325599B (en) | 2000-01-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Black et al. | Generating F/sub 0/contours from ToBI labels using linear regression | |
Moberg | Contributions to Multilingual Low-Footprint TTS System for Hand-Held Devices | |
US7233901B2 (en) | Synthesis-based pre-selection of suitable units for concatenative speech | |
US7580839B2 (en) | Apparatus and method for voice conversion using attribute information | |
US5913194A (en) | Method, device and system for using statistical information to reduce computation and memory requirements of a neural network based speech synthesis system | |
EP0833304B1 (en) | Prosodic databases holding fundamental frequency templates for use in speech synthesis | |
US6163769A (en) | Text-to-speech using clustered context-dependent phoneme-based units | |
EP0689192A1 (en) | A speech synthesis system | |
US20050119890A1 (en) | Speech synthesis apparatus and speech synthesis method | |
Kuligowska et al. | Speech synthesis systems: disadvantages and limitations | |
Bellegarda et al. | Statistical prosodic modeling: from corpus design to parameter estimation | |
US5950162A (en) | Method, device and system for generating segment durations in a text-to-speech system | |
Karaali et al. | Speech synthesis with neural networks | |
Indumathi et al. | Survey on speech synthesis | |
Conkie et al. | Prosody recognition from speech utterances using acoustic and linguistic based models of prosodic events | |
US6178402B1 (en) | Method, apparatus and system for generating acoustic parameters in a text-to-speech system using a neural network | |
Karaali et al. | Text-to-speech conversion with neural networks: A recurrent TDNN approach | |
O'Shaughnessy | Modern methods of speech synthesis | |
GB2325599A (en) | Speech synthesis with prosody enhancement | |
Karaali et al. | A high quality text-to-speech system composed of multiple neural networks | |
Kim et al. | Unit Generation Based on Phrase Break Strength and Pruning for Corpus‐Based Text‐to‐Speech | |
Furtado et al. | Synthesis of unlimited speech in Indian languages using formant-based rules | |
JP3060276B2 (en) | Speech synthesizer | |
JPH10254471A (en) | Voice synthesizer | |
JP3571925B2 (en) | Voice information processing device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PCNP | Patent ceased through non-payment of renewal fee |
Effective date: 20060521 |