EP0107945B1 - Speech synthesizing apparatus - Google Patents

Speech synthesizing apparatus Download PDF

Info

Publication number
EP0107945B1
EP0107945B1 EP83306228A EP83306228A EP0107945B1 EP 0107945 B1 EP0107945 B1 EP 0107945B1 EP 83306228 A EP83306228 A EP 83306228A EP 83306228 A EP83306228 A EP 83306228A EP 0107945 B1 EP0107945 B1 EP 0107945B1
Authority
EP
European Patent Office
Prior art keywords
data
vowel
parameter data
consonant
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired
Application number
EP83306228A
Other languages
German (de)
French (fr)
Other versions
EP0107945A1 (en
Inventor
Tsuneo Nitta
Norimasa Nomura
Kazuo Sumita
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toshiba Corp
Original Assignee
Toshiba Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toshiba Corp filed Critical Toshiba Corp
Publication of EP0107945A1 publication Critical patent/EP0107945A1/en
Application granted granted Critical
Publication of EP0107945B1 publication Critical patent/EP0107945B1/en
Expired legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Definitions

  • This invention relates to a speech synthesizing . apparatus for synthesizing speech in accordance with input character strings.
  • various speech synthesizing apparatuses for synthesizing speech on the basis of the sentence data to be applied as character strings have become known.
  • various speech segments of predetermined units are preliminarily registered as a format of acoustic parameter in a speech segment file, and the corresponding acoustic parameter data is selectively read out from this speech segment file in accordance with the input phoneme data string.
  • the speech data is synthesized on the basis of this acoustic parameter data read out in accordance with a predetermined synthesizing rule.
  • a desired sentence can be generated at a desired speaking speed since the speech is synthesized in accordance with a predetermined synthesizing rule.
  • This apparatus for synthesizing speech by rule is mainly divided, for example, into a V-C-V synthesizing apparatus using a chain consisting of vowel, consonant and vowel as a speech segment of one unit, and a C-V synthesizing apparatus using a monosyllable consisting of consonant and vowel as a speech segment of one unit in dependence upon the format of the speech segment to be registered in the speech segment file.
  • Reference characters V and C used herein represent a vowel segment and a consonant segment, respectively.
  • Fig. 1 is a schematic block diagram of a conventional speech synthesizing apparatus.
  • This speech synthesizing apparatus includes a phoneme converting circuit 2 for converting input character code string into phoneme data string including accent information in accordance with predetermined phoneme conversion rule and accent rule, a speech segment file 4 in which a plurality of speech segments in the form of monosyllable have been stored, an interpolating circuit 6 which sequentially reads out the speech characteristic parameter data of the corresponding speech segment from the speech segment file 4 in accordance with the phoneme data string from the phoneme converting circuit 2 and then interpolates these speech characteristic parameter data, and a speech synthesizer circuit 8 for generating speech data by filter-processing the parameter data from this interpolating circuit 6.
  • phonemes In the apparatus for synthesizing speech data by rules of this kind, phonemes must of course be converted with high accuracy to obtain more natural speech with high quality, but it is also required to obtain speech characteristic parameters which represent, with a high fidelity, the characteristics of the speech generated by a human being. For example, when speech is continuously generated, there may be a case where a certain monosyllable in this speech is coarticulated by monosyllables before and after the above-mentioned monosyllable.
  • the acoustic energy pattern (speech characteristic parameter) of the speech segment of this monosyllable exhibits the inherent characteristics of the consonant C1 and vowel V1 with high fidelity as schematically shown in Fig. 2.
  • the acoustic energy pattern (speech characteristic parameter) of the speech segment of the C1-V1 monosyllable will be changed as shown in Figs.
  • this monosyllable is coarticulated by the subsequent C2-V2 monosyllable and is changed to a C11-V11 monosyllable, or it is coarticulated by the subsequent C3-V3 monosyllable and is changed to a C12-V12 monosyllable. Therefore, in order to generate the speech which is more natural and has high quality and is as similar as possible to the speech that is actually generated by a human being, it is required to generate the speech in consideration of the coarticulation between the successive speech segments. However, with a conventional speech synthesizing apparatus, only unnatural speech is obtained because it generates speech by simply coupling the phonemes regardless of the influence due to the coarticulation.
  • EP-A-58130 discloses that discreet sound elements corresponding to consonant portions, steady-state vowel portions and transition elements.
  • transition elements are composed of a combination of a consonant portion and a coarticulated vowel and it is thus necessary to prepare a large number of such transition elements in order to synthesize natural speech.
  • a speech synthesizing apparatus comprising a data generation circuit for generating phoneme string data; memory means in which consonant and vowel characteristic parameter data representative of consonant and vowel segments are stored and which has a consonant segment file in which a plurality of consonant characteristic parameter data representative of a plurality of consonant segments, each of which has a consonant portion and a transient segment to a vowel segment, are stored, and a vowel segment file in which a plurality of vowel characteristic parameter date representative of a plurality of steady-state vowel segments are stored; control means for allowing the corresponding consonant and vowel characteristic parameter data to be generated from said memory means in accordance with said phoneme string data; and synthesizing means for synthesizing a speech signal on the basis of said consonant and vowel characteristic parameter data from said memory means; and including a parameter data series generation circuit for generating a series of consonant and vowel characteristic parameter data on the
  • each consonant characteristic parameter data stored in the consonant segment file represents the consonant segment including a consonant portion and a transient segment to the vowel segment; therefore, it is possible to easily obtain the interpolated characteristic parameter data between this consonant characteristic parameter data and the succeeding vowel characteristic parameter data read out from the vowel segment file, thereby making it possible to clearly and naturally synthesize a speech even for a coarticulated monosyllable.
  • consonant segments each including a consonant portion and a transient segment which changes from this consonant portion to a vowel segment are registered as a consonant segment C in the consonant segment file
  • vowel segments including steady-state and coarticulated vowel segments are registered as a vowel segment V in the vowel segment file.
  • Figs. 5A and 5B shows waveforms of a second [a]-sound of speech [hakata] and an [a]-sound of speech [kiai].
  • Fig. 6A shows a power spectrum in the frame A of [a]-sound shown in Fig. 5A.
  • Fig. 68 shows a power spectrum in the frame B of [a]-sound shown in Fig. 5B.
  • the power spectrum of [a]-sound of [kiai] which is strongly affected due to the coarticulation is different from the power spectrum of the second [a]-sound of speech [hakata] which is not so affected due to the coarticulation.
  • the speech characteristic parameters representative of the power spectra of different kinds of [a]-sounds are registered in the vowel segment file in dependence upon the degree of the influence due to the coarticulation.
  • Figs. 7A to 7C show a speech signal, power spectrum and power sequence of a monosyllable "go" when it was generated.
  • Fig. 7D indicates similarity between the power spectrum having the maximum power in the power sequence shown in Fig. 7C and other power spectra.
  • time point t1 is determined as a boundary point between consonant and vowel, that is, in this example, the time point t1 is determined as a time point at which the similarity becomes smaller than a predetermined value when the similarity between the power spectrum having the maximum power and the power spectra which sequentially appear toward the direction in which a consonant was generated is sequentially calculated.
  • the speech characteristic parameter data representing the power spectra generated during the period from the time when the consonant had been generated to the time point t1, in this example, the power spectra of three frames, is registered as a consonant segment data in the consonant segment file.
  • the speech characteristic parameter data representing the power spectrum of one frame generated after a predetermined number of frames from the time point t1, preferably indicative of the power spectrum having the maximum power is registered as a vowel segment data in the vowel segment file.
  • the formats of the speech characteristic parameters to be registered in the consonant and vowel segment files are determined in accordance with the speech synthesizing apparatus to be used.
  • the speech characteristic parameter is determined by the Formant frequency, its band width and voiced-unvoiced information.
  • the speech characteristic parameter is determined by the linear prediction coefficient and voiced-unvoiced information.
  • Fig. 8 shows a block diagram of a speech synthesizing apparatus for synthesizing speech by rule as one embodiment according to the present invention.
  • This speech synthesizing apparatus includes a consonant segment file 10, a vowel segment file 12, a phoneme converting circuit 14, and a control circuit 16 for generating output data such as consonant segment address data, vowel segment address data, pitch data, etc. in response to the output data from the phoneme converting circuit 14.
  • a plurality of speech characteristic parameter data respectively representing a plurality of consonant segments each of which has a consonant portion and a transient segment are stored in the consonant segment file 10.
  • a plurality of speech characteristic parameter data respectively representing a plurality of steady-state vowel and coarticulated vowels are stored in the vowel segment file 12.
  • the phoneme converting circuit 14 reads out the corresponding phoneme string data and accent data from a phoneme dictionary and an accent dictionary (not shown) on the basis of the character code string corresponding to word, clause or sentence, and then supplies to the control circuit 16.
  • This phoneme converting circuit 14 is introduced in, for example, "Letter-to-Sound Rules for Automatic Translation of English Text to Phonetics" by Honey S. Elovitz et al. from Naval Research Lab. (ASSP-24, No. 6, Dec 76, p. 446).
  • the control circuit 16 serves to supply the consonant segment address data and vowel segment address data to the consonant segment file 10 and the vowel segment file 12, respectively, in accordance with the phoneme string data from the phoneme converting circuit 14. At the same time, the control circuit 16 writes the time data _corresponding to the time duration of a vowel to be generated and the accent data from the phoneme converting circuit 14 into a random access memory (RAM) 16A.
  • RAM random access memory
  • the segment address data are determined in accordance with not only the phoneme data indicative of the monosyllable, but also the phoneme data representing a succeeding monosyllable from the phoneme converting circuit 14, for example.
  • the speech characteristic parameter data from the consonant segment file 10 is supplied to a first input port of an interpolation circuit 18, while the speech characteristic parameter data from the vowel segment file 12 is supplied to a second input port of the interpolation circuit 18 and to a repetition circuit 20.
  • the interpolation circuit 18 calculates a predetermined number of speech characteristic parameter data on the basis of the speech characteristic parameter data indicative of the consonant segment which is constituted by the power spectrum of three frames from the consonant segment file 10 and the speech characteristic parameter data indicative of the vowel segment of the power spectrum of one frame from the vowel segment file 12.
  • the calculated speech parameter data respectively represent a corresponding number of vowel segments each having the spectrum of one frame and interpolated between the input consonant and vowel segments.
  • the repetition circuit 20 repeatedly fetches from the vowel segment file 12 the speech characteristic parameter data by the number of frames corresponding to the vowel time duration data stored in the RAM 16A.
  • the speech characteristic parameter data from the interpolation circuit 18 and repetition circuit 20 are supplied through a switch 24 to a buffer register 22 in this order.
  • the speech characteristic parameter data from this buffer register 22 is supplied to an interpolation circuit 26.
  • This interpolation circuit 26 interpolates a predetermined number of speech characteristic parameter data between these two speech characteristic parameter data on the basis of the speech characteristic parameter data of the successive two frames from the buffer register 22.
  • the speech characteristic parameter data from this interpolation circuit 26 are sequentially supplied to a speech synthesizer 28.
  • This speech synthesizer 28 sequentially filter-processes the speech characteristic parameter data from the interpolation circuit 26 according to the pitch period data generated from a pitch generation circuit 30 in accordance with the accent data of the RAM 16A, and then generates a speech signal.
  • the phoneme converting circuit 14 supplies the phoneme string data and accent data to the control circuit 16 in accordance with the input character code series.
  • This control circuit 16 writes the time length data representing the time duration of a vowel to be generated and the pitch data regarding a speech generating pitch in the RAM 16A on the basis of the phoneme data and accent data from the phoneme converting circuit 14, respectively.
  • the control circuit 16 supplies the consonant segment address data and vowel segment address data corresponding to the phoneme string data from the phoneme converting circuit 14 to the consonant segment file 10 and the vowel segment file 12, respectively.
  • the control circuit 16 simultaneously generates the switch control signal to set the switch 24 into the first switching position.
  • the control circuit 16 supplies the consonant and vowel segment address data coresponding to consonant segment [g] and vowel segment [o] to the consonant and vowel segment files 10 and 12, respectively, on the basis of the phoneme data corresponding to the two successive monosyllables of [goma] generated from the phoneme converting circuit 14. Due to this, the first to third speech characteristic parameter data corresponding to the power spectra of three frames indicative of consonant segment [g] in Fig. 9 are read out from the consonant segment file 10.
  • the fourth speech characteristic parameter data corresponding to the power spectrum of one frame indicative of vowel [o] is read out from vowel segment file 12.
  • the interpolation circuit 18 calculates the fifth to eighth speech characteristic parameter data indicative of the power spectrum of a predetermined number of frames, in this example, four frames between consonant segment [g] and vowel segment [o] shown in Fig. 9, on the basis of the third speech characteristic parameter data read out from the consonant segment file 10 and the fourth speech characteristic parameter data read out from the vowel segment file 12.
  • this interpolation circuit 18 supplies the 1st to 3rd speech characteristic parameter data from the consonant segment file 10, the 5th to 8th speech characteristic parameter data thus calculated, and the 4th speech characteristic parameter data from the vowel segment file 12 to the buffer register 22 through the switch 24 in this order in response to the interpolation control signal from the control circuit 16.
  • the switch 24 is set into the second switching position by the switching control signal from the control circuit 16.
  • the control circuit 16 then supplies the control pulses of the number corresponding to the vowel time duration data stored in the RAM 16A to the repetition circuit 20 and through an OR gate 32 to the buffer register 22.
  • the repetition circuit 20 fetches the speed characteristic parameter data from the vowel segment file 12 a corresponding number of times in response to the control pulse from the control circuit 16, and sequentially supplies to the buffer register 22.
  • the speech characteristic parameter data representing the power spectra similar to the power spectra shown in Fig. 7B is stored in the buffer register 22.
  • Fig. 9 the speech characteristic parameter data representing the power spectra similar to the power spectra shown in Fig. 7B is stored in the buffer register 22.
  • the power spectra shown by the solid lines indicate the power spectra corresponding to the speech characteristic parameter data read out from the consonant and vowel segment files 10 and 12, and the power spectra shown by the broken lines represent the power spectra calculated by the interpolation circuit 18 and the power spectra generated from the repetition circuit 20.
  • the control circuit 16 supplies the interpolation control signal through the OR gate 32 to the buffer register 22 and also supplies the interpolation control signal to the interpolation circuit 26, thereby allowing the speech characteristic parameter data in the buffer register 22 to be sequentially sent to the interpolation circuit 26.
  • the interpolation circuit 26 then creates a predetermined number of interpolated speech characteristic parameter data on the basis of the speech characteristic parameter data of the successive two frames sent from the buffer register 22 and sequentially supplies to the speech synthesizer 28.
  • the control circuit 16 simultaneously reads out the accent data stored in the RAM 16A and supplies to the pitch generation circuit 30, thereby allowing this pitch generation circuit 30 to generate the pitch period data.
  • the speech synthesizer 28 synthesizes the speech signal including the pitch information in accordance with the speech characteristic parameter data from the interpolation circuit 26 and the pitch period data from the pitch generation circuit 30 and then generates the synthesized speech signal.
  • the repetition circuit 20 is constituted in such a manner that it fetches the vowel characteristic parameter data from the ,, vowel segment file 12 in response to the control pulses from the control circuit 16.
  • this repetition circuit 20 may be modified such that a high-level signal is generated from the control circuit 16 over the period of time corresponding to the time length data, and that the repetition circuit 20 fetches the vowel characteristic parameter data at a fixed interval from the vowel segment file 12 in response to this high-level signal.
  • the vowel characteristic parameter data each of which represents one frame power spectrum have been stored in the vowel segment file 12
  • the vowel characteristic parameter data each of which represents a plurality of power spectra can be stored in this vowel segment file.

Description

  • This invention relates to a speech synthesizing . apparatus for synthesizing speech in accordance with input character strings.
  • Recently, various speech synthesizing apparatuses for synthesizing speech on the basis of the sentence data to be applied as character strings have become known. For example, in an apparatus for synthesizing speech by rule, various speech segments of predetermined units are preliminarily registered as a format of acoustic parameter in a speech segment file, and the corresponding acoustic parameter data is selectively read out from this speech segment file in accordance with the input phoneme data string. The speech data is synthesized on the basis of this acoustic parameter data read out in accordance with a predetermined synthesizing rule. As described above, in this speech synthesizing apparatus, a desired sentence can be generated at a desired speaking speed since the speech is synthesized in accordance with a predetermined synthesizing rule.
  • This apparatus for synthesizing speech by rule is mainly divided, for example, into a V-C-V synthesizing apparatus using a chain consisting of vowel, consonant and vowel as a speech segment of one unit, and a C-V synthesizing apparatus using a monosyllable consisting of consonant and vowel as a speech segment of one unit in dependence upon the format of the speech segment to be registered in the speech segment file. Reference characters V and C used herein represent a vowel segment and a consonant segment, respectively.
  • Fig. 1 is a schematic block diagram of a conventional speech synthesizing apparatus. This speech synthesizing apparatus includes a phoneme converting circuit 2 for converting input character code string into phoneme data string including accent information in accordance with predetermined phoneme conversion rule and accent rule, a speech segment file 4 in which a plurality of speech segments in the form of monosyllable have been stored, an interpolating circuit 6 which sequentially reads out the speech characteristic parameter data of the corresponding speech segment from the speech segment file 4 in accordance with the phoneme data string from the phoneme converting circuit 2 and then interpolates these speech characteristic parameter data, and a speech synthesizer circuit 8 for generating speech data by filter-processing the parameter data from this interpolating circuit 6.
  • In the apparatus for synthesizing speech data by rules of this kind, phonemes must of course be converted with high accuracy to obtain more natural speech with high quality, but it is also required to obtain speech characteristic parameters which represent, with a high fidelity, the characteristics of the speech generated by a human being. For example, when speech is continuously generated, there may be a case where a certain monosyllable in this speech is coarticulated by monosyllables before and after the above-mentioned monosyllable. When a monosyllable formed of consonant-vowel (C1-V1 ) syllable is independently generated, the acoustic energy pattern (speech characteristic parameter) of the speech segment of this monosyllable exhibits the inherent characteristics of the consonant C1 and vowel V1 with high fidelity as schematically shown in Fig. 2. However, in the case where this monosyllable is successively generated together with other monosyllables, the acoustic energy pattern (speech characteristic parameter) of the speech segment of the C1-V1 monosyllable will be changed as shown in Figs. 3A and 3B in dependence upon, for example, whether the subsequent monosyllable is a C2-V2 syllable or a C3-V3 syllable. In other words, this monosyllable is coarticulated by the subsequent C2-V2 monosyllable and is changed to a C11-V11 monosyllable, or it is coarticulated by the subsequent C3-V3 monosyllable and is changed to a C12-V12 monosyllable. Therefore, in order to generate the speech which is more natural and has high quality and is as similar as possible to the speech that is actually generated by a human being, it is required to generate the speech in consideration of the coarticulation between the successive speech segments. However, with a conventional speech synthesizing apparatus, only unnatural speech is obtained because it generates speech by simply coupling the phonemes regardless of the influence due to the coarticulation.
  • EP-A-58130 discloses that discreet sound elements corresponding to consonant portions, steady-state vowel portions and transition elements. However, in this prior art, transition elements are composed of a combination of a consonant portion and a coarticulated vowel and it is thus necessary to prepare a large number of such transition elements in order to synthesize natural speech.
  • It is an object of the present invention to provide a speech synthesizing apparatus for synthesizing clear and natural speech.
  • According to the invention, there is provided a speech synthesizing apparatus comprising a data generation circuit for generating phoneme string data; memory means in which consonant and vowel characteristic parameter data representative of consonant and vowel segments are stored and which has a consonant segment file in which a plurality of consonant characteristic parameter data representative of a plurality of consonant segments, each of which has a consonant portion and a transient segment to a vowel segment, are stored, and a vowel segment file in which a plurality of vowel characteristic parameter date representative of a plurality of steady-state vowel segments are stored; control means for allowing the corresponding consonant and vowel characteristic parameter data to be generated from said memory means in accordance with said phoneme string data; and synthesizing means for synthesizing a speech signal on the basis of said consonant and vowel characteristic parameter data from said memory means; and including a parameter data series generation circuit for generating a series of consonant and vowel characteristic parameter data on the basis of the consonant and vowel characteristic parameter data from said consonant and vowel characteristic paramteter data from said consonant and vowel segment files, and a synthesis circuit for synthesizing the speech signal on the basis of the parameter data series from said parameter data series generation circuit, characterized in that said vowel segment file further stores a plurality of vowel characteristic parameter data representative of a plurality of coarticulated vowel segments, each of said steady-state and coarticulated vowel segments being formed of one frame parameter data, said control means generates time length data indicative of a vowel duration length in accordance with the phoneme string data from said data generation circuit, and said parameter data series generation circuit includes a repetition circuit which derives the vowel characteristic parameter data from said vowel segment file the number of times corresponding to said time length data.
  • In the described embodiment, each consonant characteristic parameter data stored in the consonant segment file represents the consonant segment including a consonant portion and a transient segment to the vowel segment; therefore, it is possible to easily obtain the interpolated characteristic parameter data between this consonant characteristic parameter data and the succeeding vowel characteristic parameter data read out from the vowel segment file, thereby making it possible to clearly and naturally synthesize a speech even for a coarticulated monosyllable.
  • An embodiment of the invention will now be described, by way of example, with reference to the accompanying drawings, in which:
    • Fig. 1 is a schematic block diagram of a conventional speech synthesizing apparatus;
    • Fig. 2 shows the schematic acoustic energy pattern of a monosyllable independently generated;
    • Figs. 3A and 3B show the schematic acoustic energy pattern of coarticulated monosyllables;
    • Fig. 4 shows the schematic acoustic energy pattern of consonant and vowel segments registered in consonant and vowel segment files used in this invention;
    • Figs. 5A and 5B show waveforms of [a]-sound included in different speeches;
    • Figs. 6A and 68 show power spectra of selected frames in the [a]-sounds shown in Figs. 5A and 5B;
    • Figs. 7A to 7C show a speech signal, power spectra and power sequence of a monosyllable "go";
    • Fig. 7D shows similarity between the power spectrum having the maximum power in the power sequence of Fig. 7C and other power spectra;
    • Fig. 8 is a block diagram of a speech synthesizing apparatus according to one embodiment of this invention;
    • Fig. 9 shows power spectra obtained in the speech synthesizing apparatus of Fig. 8; and
    • Fig. 10 is a flowchart illustrating the operation of the speech synthesizing apparatus shown in Fig. 8.
  • As shown in Fig. 4, consonant segments each including a consonant portion and a transient segment which changes from this consonant portion to a vowel segment are registered as a consonant segment C in the consonant segment file, and vowel segments including steady-state and coarticulated vowel segments are registered as a vowel segment V in the vowel segment file.
  • Figs. 5A and 5B shows waveforms of a second [a]-sound of speech [hakata] and an [a]-sound of speech [kiai]. Fig. 6A shows a power spectrum in the frame A of [a]-sound shown in Fig. 5A. Fig. 68 shows a power spectrum in the frame B of [a]-sound shown in Fig. 5B. As is obvious from these Figs. 5A, 5B, 6A and 6B, the power spectrum of [a]-sound of [kiai] which is strongly affected due to the coarticulation is different from the power spectrum of the second [a]-sound of speech [hakata] which is not so affected due to the coarticulation. As described above, the speech characteristic parameters representative of the power spectra of different kinds of [a]-sounds are registered in the vowel segment file in dependence upon the degree of the influence due to the coarticulation.
  • Figs. 7A to 7C show a speech signal, power spectrum and power sequence of a monosyllable "go" when it was generated. Fig. 7D indicates similarity between the power spectrum having the maximum power in the power sequence shown in Fig. 7C and other power spectra. In Fig. 7D, time point t1 is determined as a boundary point between consonant and vowel, that is, in this example, the time point t1 is determined as a time point at which the similarity becomes smaller than a predetermined value when the similarity between the power spectrum having the maximum power and the power spectra which sequentially appear toward the direction in which a consonant was generated is sequentially calculated. The speech characteristic parameter data representing the power spectra generated during the period from the time when the consonant had been generated to the time point t1, in this example, the power spectra of three frames, is registered as a consonant segment data in the consonant segment file. In addition, the speech characteristic parameter data representing the power spectrum of one frame generated after a predetermined number of frames from the time point t1, preferably indicative of the power spectrum having the maximum power is registered as a vowel segment data in the vowel segment file.
  • The formats of the speech characteristic parameters to be registered in the consonant and vowel segment files are determined in accordance with the speech synthesizing apparatus to be used. For example, in the Formant synthesizing apparatus, the speech characteristic parameter is determined by the Formant frequency, its band width and voiced-unvoiced information. On the other hand, in the linear prediction synthesizing apparatus, the speech characteristic parameter is determined by the linear prediction coefficient and voiced-unvoiced information.
  • Fig. 8 shows a block diagram of a speech synthesizing apparatus for synthesizing speech by rule as one embodiment according to the present invention. This speech synthesizing apparatus includes a consonant segment file 10, a vowel segment file 12, a phoneme converting circuit 14, and a control circuit 16 for generating output data such as consonant segment address data, vowel segment address data, pitch data, etc. in response to the output data from the phoneme converting circuit 14. As already described with reference to Fig. 4, a plurality of speech characteristic parameter data respectively representing a plurality of consonant segments each of which has a consonant portion and a transient segment are stored in the consonant segment file 10. A plurality of speech characteristic parameter data respectively representing a plurality of steady-state vowel and coarticulated vowels are stored in the vowel segment file 12. The phoneme converting circuit 14 reads out the corresponding phoneme string data and accent data from a phoneme dictionary and an accent dictionary (not shown) on the basis of the character code string corresponding to word, clause or sentence, and then supplies to the control circuit 16. This phoneme converting circuit 14 is introduced in, for example, "Letter-to-Sound Rules for Automatic Translation of English Text to Phonetics" by Honey S. Elovitz et al. from Naval Research Lab. (ASSP-24, No. 6, Dec 76, p. 446).
  • The control circuit 16 serves to supply the consonant segment address data and vowel segment address data to the consonant segment file 10 and the vowel segment file 12, respectively, in accordance with the phoneme string data from the phoneme converting circuit 14. At the same time, the control circuit 16 writes the time data _corresponding to the time duration of a vowel to be generated and the accent data from the phoneme converting circuit 14 into a random access memory (RAM) 16A. Where the control circuit 16 generates the consonant and vowel segment address data corresponding to the consonant and vowel which are included in a monosyllable supplied from the phoneme converting circuit 14, the segment address data are determined in accordance with not only the phoneme data indicative of the monosyllable, but also the phoneme data representing a succeeding monosyllable from the phoneme converting circuit 14, for example.
  • The speech characteristic parameter data from the consonant segment file 10 is supplied to a first input port of an interpolation circuit 18, while the speech characteristic parameter data from the vowel segment file 12 is supplied to a second input port of the interpolation circuit 18 and to a repetition circuit 20. The interpolation circuit 18 calculates a predetermined number of speech characteristic parameter data on the basis of the speech characteristic parameter data indicative of the consonant segment which is constituted by the power spectrum of three frames from the consonant segment file 10 and the speech characteristic parameter data indicative of the vowel segment of the power spectrum of one frame from the vowel segment file 12. The calculated speech parameter data respectively represent a corresponding number of vowel segments each having the spectrum of one frame and interpolated between the input consonant and vowel segments. The repetition circuit 20 repeatedly fetches from the vowel segment file 12 the speech characteristic parameter data by the number of frames corresponding to the vowel time duration data stored in the RAM 16A.
  • The speech characteristic parameter data from the interpolation circuit 18 and repetition circuit 20 are supplied through a switch 24 to a buffer register 22 in this order. The speech characteristic parameter data from this buffer register 22 is supplied to an interpolation circuit 26. This interpolation circuit 26 interpolates a predetermined number of speech characteristic parameter data between these two speech characteristic parameter data on the basis of the speech characteristic parameter data of the successive two frames from the buffer register 22. The speech characteristic parameter data from this interpolation circuit 26 are sequentially supplied to a speech synthesizer 28. This speech synthesizer 28 sequentially filter-processes the speech characteristic parameter data from the interpolation circuit 26 according to the pitch period data generated from a pitch generation circuit 30 in accordance with the accent data of the RAM 16A, and then generates a speech signal.
  • The operation of the speech synthesizing apparatus shown in Fig. 8 will be described with reference to a power spectrum shown in Fig. 9, and a flowchart shown in Fig. 10.
  • The phoneme converting circuit 14 supplies the phoneme string data and accent data to the control circuit 16 in accordance with the input character code series. This control circuit 16 writes the time length data representing the time duration of a vowel to be generated and the pitch data regarding a speech generating pitch in the RAM 16A on the basis of the phoneme data and accent data from the phoneme converting circuit 14, respectively. Furthermore, the control circuit 16 supplies the consonant segment address data and vowel segment address data corresponding to the phoneme string data from the phoneme converting circuit 14 to the consonant segment file 10 and the vowel segment file 12, respectively. In this case, the control circuit 16 simultaneously generates the switch control signal to set the switch 24 into the first switching position.
  • It is now assumed, for example, that the input character code series including the character codes representative of two successive monosyllables of [goma] was supplied to the phoneme converting circuit 14. In this case, the control circuit 16 supplies the consonant and vowel segment address data coresponding to consonant segment [g] and vowel segment [o] to the consonant and vowel segment files 10 and 12, respectively, on the basis of the phoneme data corresponding to the two successive monosyllables of [goma] generated from the phoneme converting circuit 14. Due to this, the first to third speech characteristic parameter data corresponding to the power spectra of three frames indicative of consonant segment [g] in Fig. 9 are read out from the consonant segment file 10. The fourth speech characteristic parameter data corresponding to the power spectrum of one frame indicative of vowel [o] is read out from vowel segment file 12. The interpolation circuit 18 calculates the fifth to eighth speech characteristic parameter data indicative of the power spectrum of a predetermined number of frames, in this example, four frames between consonant segment [g] and vowel segment [o] shown in Fig. 9, on the basis of the third speech characteristic parameter data read out from the consonant segment file 10 and the fourth speech characteristic parameter data read out from the vowel segment file 12. Next, this interpolation circuit 18 supplies the 1st to 3rd speech characteristic parameter data from the consonant segment file 10, the 5th to 8th speech characteristic parameter data thus calculated, and the 4th speech characteristic parameter data from the vowel segment file 12 to the buffer register 22 through the switch 24 in this order in response to the interpolation control signal from the control circuit 16.
  • Thereafter, the switch 24 is set into the second switching position by the switching control signal from the control circuit 16. The control circuit 16 then supplies the control pulses of the number corresponding to the vowel time duration data stored in the RAM 16A to the repetition circuit 20 and through an OR gate 32 to the buffer register 22. Thus, the repetition circuit 20 fetches the speed characteristic parameter data from the vowel segment file 12 a corresponding number of times in response to the control pulse from the control circuit 16, and sequentially supplies to the buffer register 22. In this way, as shown in Fig. 9, the speech characteristic parameter data representing the power spectra similar to the power spectra shown in Fig. 7B is stored in the buffer register 22. In Fig. 9, the power spectra shown by the solid lines indicate the power spectra corresponding to the speech characteristic parameter data read out from the consonant and vowel segment files 10 and 12, and the power spectra shown by the broken lines represent the power spectra calculated by the interpolation circuit 18 and the power spectra generated from the repetition circuit 20.
  • Next, the control circuit 16 supplies the interpolation control signal through the OR gate 32 to the buffer register 22 and also supplies the interpolation control signal to the interpolation circuit 26, thereby allowing the speech characteristic parameter data in the buffer register 22 to be sequentially sent to the interpolation circuit 26. The interpolation circuit 26 then creates a predetermined number of interpolated speech characteristic parameter data on the basis of the speech characteristic parameter data of the successive two frames sent from the buffer register 22 and sequentially supplies to the speech synthesizer 28. In this case, the control circuit 16 simultaneously reads out the accent data stored in the RAM 16A and supplies to the pitch generation circuit 30, thereby allowing this pitch generation circuit 30 to generate the pitch period data. The speech synthesizer 28 synthesizes the speech signal including the pitch information in accordance with the speech characteristic parameter data from the interpolation circuit 26 and the pitch period data from the pitch generation circuit 30 and then generates the synthesized speech signal.
  • Although the present invention has been described above with respect to one embodiment, this invention is not limited to only this embodiment. For example; the repetition circuit 20 is constituted in such a manner that it fetches the vowel characteristic parameter data from the ,, vowel segment file 12 in response to the control pulses from the control circuit 16. However, it may be possible to modify this repetition circuit 20 such that a high-level signal is generated from the control circuit 16 over the period of time corresponding to the time length data, and that the repetition circuit 20 fetches the vowel characteristic parameter data at a fixed interval from the vowel segment file 12 in response to this high-level signal. In addition, although a plurality of vowel characteristic parameter data each of which represents one frame power spectrum have been stored in the vowel segment file 12, the vowel characteristic parameter data each of which represents a plurality of power spectra can be stored in this vowel segment file.

Claims (6)

1. A speech synthesizing apparatus comprising: a data generation circuit (14) for generating phoneme string data; memory means (10 and 12) in which consonant and vowel characteristic parameter data representative of consonant and vowel segments are stored and which has a consonant segment file (10) in which a plurality of consonant characteristic parameter data representative of a plurality of consonant segments, each of which has a consonant portion and a transient segment to a vowel segment, are stored, and a vowel segment file (12) in which a plurality of vowel characteristic parameter data representative of a plurality of steady-state vowel segments are stored; control means (16) for allowing the corresponding consonant and vowel characteristic parameter data to be generated from said memory means (10 and 12) in accordance with said phoneme string data; and synthesizing means (18, 20, 22, 24, 26, 28 and 30) for synthesizing a speech signal on the basis of said consonant and vowel characteristic parameter data from said memory means (10 and 12); and including a parameter data series generation circuit (18, 20 and 24) for generating a series of consonant and vowel characteristic parameter data on the basis of the consonant and vowel characteristic parameter data from said consonant and vowel characteristic parameter data from said consonant and vowel segment files (10 and 12), and a synthesis circuit (22, 26, 28 and 30) for synthesizing the speech signal on the basis of the parameter data series from said parameter data series generation circuit (18, 20 and 24), characterized in that said vowel segment file (12) further stores a plurality of vowel characteristic parameter data representative of a plurality of coarticulated vowel segments, each of said steady-state and coarticulated vowel segments being formed of one frame parameter data, said control means (16) generates time length data indicative of a vowel duration length in accordance with the phoneme string data from said data generation circuit (14), and said parameter data series generation circuit (18, 20, 24) includes a repetition circuit (20) which derives the vowel characteristic parameter data from said vowel segment file (12) the number of times corresponding to said time length data.
2. A speech synthesizing apparatus according to claim 1, characterized in that said parameter data series generation circuit further includes: an interpolation circuit (18) for calculating a predetermined number of interpolated characteristic parameter data on the basis of the consonant and vowel characteristic parameter data from said consonant and vowel segment files (10 and 12); and a data selection circuit (24) for sequentially and selectively supplying the characteristic parameter data from said interpolation circuit (18) and said repetition circuit (20) to said synthesis circuit (22, 26, 28 and 30).
3. A speech synthesizing apparatus according to claim 2, characterized in that said data selection circuit is a switching circuit (24) whose switching position is controlled in response to a switching control signal from said control means (16).
4. A speech synthesizing apparatus according to claim 2, characterized in that said data generation circuit (14) generates accent data together with said phoneme string data and said control means (16) generates pitch data in accordance with said accent data, and that said synthesis circuit (22, 26, 28 and 30) synthesizes the speech signal on the basis of the parameter data series from said parameter data series generation circuit (18, 20 and 24) and the pitch data from said control means (16).
5. A speech synthesizing apparatus according to claim 2, characterized in that said synthesis circuit comprises: an interpolator (26) which receives the parameter data series from said parameter data series generation circuit (18, 20 and 24) and calculates a predetermined number of interpolated parameter data on the basis of two successive parameter data; and a synthesizing unit (28) for synthesizing the speech signal on the basis of the parameter data from said interpolator (26).
6. A speech synthesizing apparatus according to claim 5, characterized in that said data generation circuit (14) generates accent data together with said phoneme string data and said control means (16) generates pitch data in accordance with said accent data, and that said synthesis circuit (22, 26, 28 and 30) synthesizes the speech signal on the basis of the parameter data series from said parameter data series generation circuit (18, 20 and 24) and the pitch data from said control means (16).
EP83306228A 1982-10-19 1983-10-14 Speech synthesizing apparatus Expired EP0107945B1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP183410/82 1982-10-19
JP57183410A JPS5972494A (en) 1982-10-19 1982-10-19 Rule snthesization system

Publications (2)

Publication Number Publication Date
EP0107945A1 EP0107945A1 (en) 1984-05-09
EP0107945B1 true EP0107945B1 (en) 1987-03-18

Family

ID=16135290

Family Applications (1)

Application Number Title Priority Date Filing Date
EP83306228A Expired EP0107945B1 (en) 1982-10-19 1983-10-14 Speech synthesizing apparatus

Country Status (3)

Country Link
EP (1) EP0107945B1 (en)
JP (1) JPS5972494A (en)
DE (1) DE3370390D1 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0642158B2 (en) * 1983-11-01 1994-06-01 日本電気株式会社 Speech synthesizer
JPH0756598B2 (en) * 1984-07-25 1995-06-14 株式会社日立製作所 Speech synthesis method of speech synthesizer
JPH0833744B2 (en) * 1986-01-09 1996-03-29 株式会社東芝 Speech synthesizer
JP2577372B2 (en) * 1987-02-24 1997-01-29 株式会社東芝 Speech synthesis apparatus and method
DK46493D0 (en) * 1993-04-22 1993-04-22 Frank Uldall Leonhard METHOD OF SIGNAL TREATMENT FOR DETERMINING TRANSIT CONDITIONS IN AUDITIVE SIGNALS
AU699837B2 (en) * 1995-03-07 1998-12-17 British Telecommunications Public Limited Company Speech synthesis

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3975587A (en) * 1974-09-13 1976-08-17 International Telephone And Telegraph Corporation Digital vocoder
DE2531006A1 (en) * 1975-07-11 1977-01-27 Deutsche Bundespost Speech synthesis system from diphthongs and phonemes - uses time limit for stored diphthongs and their double application
DE3105518A1 (en) * 1981-02-11 1982-08-19 Heinrich-Hertz-Institut für Nachrichtentechnik Berlin GmbH, 1000 Berlin METHOD FOR SYNTHESIS OF LANGUAGE WITH UNLIMITED VOCUS, AND CIRCUIT ARRANGEMENT FOR IMPLEMENTING THE METHOD

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ASSP-24, No. 6, Dec. 76, p. 446 *

Also Published As

Publication number Publication date
JPS5972494A (en) 1984-04-24
EP0107945A1 (en) 1984-05-09
DE3370390D1 (en) 1987-04-23

Similar Documents

Publication Publication Date Title
US4862504A (en) Speech synthesis system of rule-synthesis type
US4692941A (en) Real-time text-to-speech conversion system
EP0886853B1 (en) Microsegment-based speech-synthesis process
US4685135A (en) Text-to-speech synthesis system
US4398059A (en) Speech producing system
EP0059880A2 (en) Text-to-speech synthesis system
US5633984A (en) Method and apparatus for speech processing
EP0239394B1 (en) Speech synthesis system
US5463715A (en) Method and apparatus for speech generation from phonetic codes
EP0107945B1 (en) Speech synthesizing apparatus
US6970819B1 (en) Speech synthesis device
EP0144731B1 (en) Speech synthesizer
US6829577B1 (en) Generating non-stationary additive noise for addition to synthesized speech
van Rijnsoever A multilingual text-to-speech system
JP3771565B2 (en) Fundamental frequency pattern generation device, fundamental frequency pattern generation method, and program recording medium
JP2703253B2 (en) Speech synthesizer
KR100202539B1 (en) Voice synthetic method
JPH0594199A (en) Residual driving type speech synthesizing device
JPS62284398A (en) Sentence-voice conversion system
JP2573586B2 (en) Rule-based speech synthesizer
JP2573585B2 (en) Speech spectrum pattern generator
JP2573587B2 (en) Pitch pattern generator
JPS58168096A (en) Multi-language voice synthesizer
JPS63174100A (en) Voice rule synthesization system
JPH055116B2 (en)

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 19831024

AK Designated contracting states

Designated state(s): DE FR GB NL

RAP1 Party data changed (applicant data changed or rights of an application transferred)

Owner name: KABUSHIKI KAISHA TOSHIBA

GRAA (expected) grant

Free format text: ORIGINAL CODE: 0009210

AK Designated contracting states

Kind code of ref document: B1

Designated state(s): DE FR GB NL

REF Corresponds to:

Ref document number: 3370390

Country of ref document: DE

Date of ref document: 19870423

ET Fr: translation filed
PLBE No opposition filed within time limit

Free format text: ORIGINAL CODE: 0009261

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT

26N No opposition filed
REG Reference to a national code

Ref country code: GB

Ref legal event code: 746

Effective date: 19980909

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: FR

Payment date: 19981009

Year of fee payment: 16

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: GB

Payment date: 19981016

Year of fee payment: 16

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: DE

Payment date: 19981023

Year of fee payment: 16

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: NL

Payment date: 19981028

Year of fee payment: 16

REG Reference to a national code

Ref country code: FR

Ref legal event code: D6

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: GB

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 19991014

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: NL

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20000501

GBPC Gb: european patent ceased through non-payment of renewal fee

Effective date: 19991014

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: FR

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20000630

NLV4 Nl: lapsed or anulled due to non-payment of the annual fee

Effective date: 20000501

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: DE

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20000801

REG Reference to a national code

Ref country code: FR

Ref legal event code: ST