EP0107945B1

EP0107945B1 - Speech synthesizing apparatus

Info

Publication number: EP0107945B1
Application number: EP83306228A
Authority: EP
Inventors: Tsuneo Nitta; Norimasa Nomura; Kazuo Sumita
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1982-10-19
Filing date: 1983-10-14
Publication date: 1987-03-18
Also published as: JPS5972494A; EP0107945A1; DE3370390D1

Description

This invention relates to a speech synthesizing . apparatus for synthesizing speech in accordance with input character strings.
Recently, various speech synthesizing apparatuses for synthesizing speech on the basis of the sentence data to be applied as character strings have become known. For example, in an apparatus for synthesizing speech by rule, various speech segments of predetermined units are preliminarily registered as a format of acoustic parameter in a speech segment file, and the corresponding acoustic parameter data is selectively read out from this speech segment file in accordance with the input phoneme data string. The speech data is synthesized on the basis of this acoustic parameter data read out in accordance with a predetermined synthesizing rule. As described above, in this speech synthesizing apparatus, a desired sentence can be generated at a desired speaking speed since the speech is synthesized in accordance with a predetermined synthesizing rule.
This apparatus for synthesizing speech by rule is mainly divided, for example, into a V-C-V synthesizing apparatus using a chain consisting of vowel, consonant and vowel as a speech segment of one unit, and a C-V synthesizing apparatus using a monosyllable consisting of consonant and vowel as a speech segment of one unit in dependence upon the format of the speech segment to be registered in the speech segment file. Reference characters V and C used herein represent a vowel segment and a consonant segment, respectively.
Fig. 1 is a schematic block diagram of a conventional speech synthesizing apparatus. This speech synthesizing apparatus includes a phoneme converting circuit 2 for converting input character code string into phoneme data string including accent information in accordance with predetermined phoneme conversion rule and accent rule, a speech segment file 4 in which a plurality of speech segments in the form of monosyllable have been stored, an interpolating circuit 6 which sequentially reads out the speech characteristic parameter data of the corresponding speech segment from the speech segment file 4 in accordance with the phoneme data string from the phoneme converting circuit 2 and then interpolates these speech characteristic parameter data, and a speech synthesizer circuit 8 for generating speech data by filter-processing the parameter data from this interpolating circuit 6.
In the apparatus for synthesizing speech data by rules of this kind, phonemes must of course be converted with high accuracy to obtain more natural speech with high quality, but it is also required to obtain speech characteristic parameters which represent, with a high fidelity, the characteristics of the speech generated by a human being. For example, when speech is continuously generated, there may be a case where a certain monosyllable in this speech is coarticulated by monosyllables before and after the above-mentioned monosyllable. When a monosyllable formed of consonant-vowel (C1-V1 ) syllable is independently generated, the acoustic energy pattern (speech characteristic parameter) of the speech segment of this monosyllable exhibits the inherent characteristics of the consonant C1 and vowel V1 with high fidelity as schematically shown in Fig. 2. However, in the case where this monosyllable is successively generated together with other monosyllables, the acoustic energy pattern (speech characteristic parameter) of the speech segment of the C1-V1 monosyllable will be changed as shown in Figs. 3A and 3B in dependence upon, for example, whether the subsequent monosyllable is a C2-V2 syllable or a C3-V3 syllable. In other words, this monosyllable is coarticulated by the subsequent C2-V2 monosyllable and is changed to a C11-V11 monosyllable, or it is coarticulated by the subsequent C3-V3 monosyllable and is changed to a C12-V12 monosyllable. Therefore, in order to generate the speech which is more natural and has high quality and is as similar as possible to the speech that is actually generated by a human being, it is required to generate the speech in consideration of the coarticulation between the successive speech segments. However, with a conventional speech synthesizing apparatus, only unnatural speech is obtained because it generates speech by simply coupling the phonemes regardless of the influence due to the coarticulation.
EP-A-58130 discloses that discreet sound elements corresponding to consonant portions, steady-state vowel portions and transition elements. However, in this prior art, transition elements are composed of a combination of a consonant portion and a coarticulated vowel and it is thus necessary to prepare a large number of such transition elements in order to synthesize natural speech.
It is an object of the present invention to provide a speech synthesizing apparatus for synthesizing clear and natural speech.
According to the invention, there is provided a speech synthesizing apparatus comprising a data generation circuit for generating phoneme string data; memory means in which consonant and vowel characteristic parameter data representative of consonant and vowel segments are stored and which has a consonant segment file in which a plurality of consonant characteristic parameter data representative of a plurality of consonant segments, each of which has a consonant portion and a transient segment to a vowel segment, are stored, and a vowel segment file in which a plurality of vowel characteristic parameter date representative of a plurality of steady-state vowel segments are stored; control means for allowing the corresponding consonant and vowel characteristic parameter data to be generated from said memory means in accordance with said phoneme string data; and synthesizing means for synthesizing a speech signal on the basis of said consonant and vowel characteristic parameter data from said memory means; and including a parameter data series generation circuit for generating a series of consonant and vowel characteristic parameter data on the basis of the consonant and vowel characteristic parameter data from said consonant and vowel characteristic paramteter data from said consonant and vowel segment files, and a synthesis circuit for synthesizing the speech signal on the basis of the parameter data series from said parameter data series generation circuit, characterized in that said vowel segment file further stores a plurality of vowel characteristic parameter data representative of a plurality of coarticulated vowel segments, each of said steady-state and coarticulated vowel segments being formed of one frame parameter data, said control means generates time length data indicative of a vowel duration length in accordance with the phoneme string data from said data generation circuit, and said parameter data series generation circuit includes a repetition circuit which derives the vowel characteristic parameter data from said vowel segment file the number of times corresponding to said time length data.
In the described embodiment, each consonant characteristic parameter data stored in the consonant segment file represents the consonant segment including a consonant portion and a transient segment to the vowel segment; therefore, it is possible to easily obtain the interpolated characteristic parameter data between this consonant characteristic parameter data and the succeeding vowel characteristic parameter data read out from the vowel segment file, thereby making it possible to clearly and naturally synthesize a speech even for a coarticulated monosyllable.
An embodiment of the invention will now be described, by way of example, with reference to the accompanying drawings, in which:

Fig. 1 is a schematic block diagram of a conventional speech synthesizing apparatus;
Fig. 2 shows the schematic acoustic energy pattern of a monosyllable independently generated;
Figs. 3A and 3B show the schematic acoustic energy pattern of coarticulated monosyllables;
Fig. 4 shows the schematic acoustic energy pattern of consonant and vowel segments registered in consonant and vowel segment files used in this invention;
Figs. 5A and 5B show waveforms of [a]-sound included in different speeches;
Figs. 6A and 68 show power spectra of selected frames in the [a]-sounds shown in Figs. 5A and 5B;
Figs. 7A to 7C show a speech signal, power spectra and power sequence of a monosyllable "go";
Fig. 7D shows similarity between the power spectrum having the maximum power in the power sequence of Fig. 7C and other power spectra;
Fig. 8 is a block diagram of a speech synthesizing apparatus according to one embodiment of this invention;
Fig. 9 shows power spectra obtained in the speech synthesizing apparatus of Fig. 8; and
Fig. 10 is a flowchart illustrating the operation of the speech synthesizing apparatus shown in Fig. 8.

As shown in Fig. 4, consonant segments each including a consonant portion and a transient segment which changes from this consonant portion to a vowel segment are registered as a consonant segment C in the consonant segment file, and vowel segments including steady-state and coarticulated vowel segments are registered as a vowel segment V in the vowel segment file.
Figs. 5A and 5B shows waveforms of a second [a]-sound of speech [hakata] and an [a]-sound of speech [kiai]. Fig. 6A shows a power spectrum in the frame A of [a]-sound shown in Fig. 5A. Fig. 68 shows a power spectrum in the frame B of [a]-sound shown in Fig. 5B. As is obvious from these Figs. 5A, 5B, 6A and 6B, the power spectrum of [a]-sound of [kiai] which is strongly affected due to the coarticulation is different from the power spectrum of the second [a]-sound of speech [hakata] which is not so affected due to the coarticulation. As described above, the speech characteristic parameters representative of the power spectra of different kinds of [a]-sounds are registered in the vowel segment file in dependence upon the degree of the influence due to the coarticulation.
Figs. 7A to 7C show a speech signal, power spectrum and power sequence of a monosyllable "go" when it was generated. Fig. 7D indicates similarity between the power spectrum having the maximum power in the power sequence shown in Fig. 7C and other power spectra. In Fig. 7D, time point t1 is determined as a boundary point between consonant and vowel, that is, in this example, the time point t1 is determined as a time point at which the similarity becomes smaller than a predetermined value when the similarity between the power spectrum having the maximum power and the power spectra which sequentially appear toward the direction in which a consonant was generated is sequentially calculated. The speech characteristic parameter data representing the power spectra generated during the period from the time when the consonant had been generated to the time point t1, in this example, the power spectra of three frames, is registered as a consonant segment data in the consonant segment file. In addition, the speech characteristic parameter data representing the power spectrum of one frame generated after a predetermined number of frames from the time point t1, preferably indicative of the power spectrum having the maximum power is registered as a vowel segment data in the vowel segment file.
The formats of the speech characteristic parameters to be registered in the consonant and vowel segment files are determined in accordance with the speech synthesizing apparatus to be used. For example, in the Formant synthesizing apparatus, the speech characteristic parameter is determined by the Formant frequency, its band width and voiced-unvoiced information. On the other hand, in the linear prediction synthesizing apparatus, the speech characteristic parameter is determined by the linear prediction coefficient and voiced-unvoiced information.
Fig. 8 shows a block diagram of a speech synthesizing apparatus for synthesizing speech by rule as one embodiment according to the present invention. This speech synthesizing apparatus includes a consonant segment file 10, a vowel segment file 12, a phoneme converting circuit 14, and a control circuit 16 for generating output data such as consonant segment address data, vowel segment address data, pitch data, etc. in response to the output data from the phoneme converting circuit 14. As already described with reference to Fig. 4, a plurality of speech characteristic parameter data respectively representing a plurality of consonant segments each of which has a consonant portion and a transient segment are stored in the consonant segment file 10. A plurality of speech characteristic parameter data respectively representing a plurality of steady-state vowel and coarticulated vowels are stored in the vowel segment file 12. The phoneme converting circuit 14 reads out the corresponding phoneme string data and accent data from a phoneme dictionary and an accent dictionary (not shown) on the basis of the character code string corresponding to word, clause or sentence, and then supplies to the control circuit 16. This phoneme converting circuit 14 is introduced in, for example, "Letter-to-Sound Rules for Automatic Translation of English Text to Phonetics" by Honey S. Elovitz et al. from Naval Research Lab. (ASSP-24, No. 6, Dec 76, p. 446).
The control circuit 16 serves to supply the consonant segment address data and vowel segment address data to the consonant segment file 10 and the vowel segment file 12, respectively, in accordance with the phoneme string data from the phoneme converting circuit 14. At the same time, the control circuit 16 writes the time data _corresponding to the time duration of a vowel to be generated and the accent data from the phoneme converting circuit 14 into a random access memory (RAM) 16A. Where the control circuit 16 generates the consonant and vowel segment address data corresponding to the consonant and vowel which are included in a monosyllable supplied from the phoneme converting circuit 14, the segment address data are determined in accordance with not only the phoneme data indicative of the monosyllable, but also the phoneme data representing a succeeding monosyllable from the phoneme converting circuit 14, for example.
The speech characteristic parameter data from the consonant segment file 10 is supplied to a first input port of an interpolation circuit 18, while the speech characteristic parameter data from the vowel segment file 12 is supplied to a second input port of the interpolation circuit 18 and to a repetition circuit 20. The interpolation circuit 18 calculates a predetermined number of speech characteristic parameter data on the basis of the speech characteristic parameter data indicative of the consonant segment which is constituted by the power spectrum of three frames from the consonant segment file 10 and the speech characteristic parameter data indicative of the vowel segment of the power spectrum of one frame from the vowel segment file 12. The calculated speech parameter data respectively represent a corresponding number of vowel segments each having the spectrum of one frame and interpolated between the input consonant and vowel segments. The repetition circuit 20 repeatedly fetches from the vowel segment file 12 the speech characteristic parameter data by the number of frames corresponding to the vowel time duration data stored in the RAM 16A.
The speech characteristic parameter data from the interpolation circuit 18 and repetition circuit 20 are supplied through a switch 24 to a buffer register 22 in this order. The speech characteristic parameter data from this buffer register 22 is supplied to an interpolation circuit 26. This interpolation circuit 26 interpolates a predetermined number of speech characteristic parameter data between these two speech characteristic parameter data on the basis of the speech characteristic parameter data of the successive two frames from the buffer register 22. The speech characteristic parameter data from this interpolation circuit 26 are sequentially supplied to a speech synthesizer 28. This speech synthesizer 28 sequentially filter-processes the speech characteristic parameter data from the interpolation circuit 26 according to the pitch period data generated from a pitch generation circuit 30 in accordance with the accent data of the RAM 16A, and then generates a speech signal.
The operation of the speech synthesizing apparatus shown in Fig. 8 will be described with reference to a power spectrum shown in Fig. 9, and a flowchart shown in Fig. 10.
The phoneme converting circuit 14 supplies the phoneme string data and accent data to the control circuit 16 in accordance with the input character code series. This control circuit 16 writes the time length data representing the time duration of a vowel to be generated and the pitch data regarding a speech generating pitch in the RAM 16A on the basis of the phoneme data and accent data from the phoneme converting circuit 14, respectively. Furthermore, the control circuit 16 supplies the consonant segment address data and vowel segment address data corresponding to the phoneme string data from the phoneme converting circuit 14 to the consonant segment file 10 and the vowel segment file 12, respectively. In this case, the control circuit 16 simultaneously generates the switch control signal to set the switch 24 into the first switching position.
It is now assumed, for example, that the input character code series including the character codes representative of two successive monosyllables of [goma] was supplied to the phoneme converting circuit 14. In this case, the control circuit 16 supplies the consonant and vowel segment address data coresponding to consonant segment [g] and vowel segment [o] to the consonant and vowel segment files 10 and 12, respectively, on the basis of the phoneme data corresponding to the two successive monosyllables of [goma] generated from the phoneme converting circuit 14. Due to this, the first to third speech characteristic parameter data corresponding to the power spectra of three frames indicative of consonant segment [g] in Fig. 9 are read out from the consonant segment file 10. The fourth speech characteristic parameter data corresponding to the power spectrum of one frame indicative of vowel [o] is read out from vowel segment file 12. The interpolation circuit 18 calculates the fifth to eighth speech characteristic parameter data indicative of the power spectrum of a predetermined number of frames, in this example, four frames between consonant segment [g] and vowel segment [o] shown in Fig. 9, on the basis of the third speech characteristic parameter data read out from the consonant segment file 10 and the fourth speech characteristic parameter data read out from the vowel segment file 12. Next, this interpolation circuit 18 supplies the 1st to 3rd speech characteristic parameter data from the consonant segment file 10, the 5th to 8th speech characteristic parameter data thus calculated, and the 4th speech characteristic parameter data from the vowel segment file 12 to the buffer register 22 through the switch 24 in this order in response to the interpolation control signal from the control circuit 16.
Thereafter, the switch 24 is set into the second switching position by the switching control signal from the control circuit 16. The control circuit 16 then supplies the control pulses of the number corresponding to the vowel time duration data stored in the RAM 16A to the repetition circuit 20 and through an OR gate 32 to the buffer register 22. Thus, the repetition circuit 20 fetches the speed characteristic parameter data from the vowel segment file 12 a corresponding number of times in response to the control pulse from the control circuit 16, and sequentially supplies to the buffer register 22. In this way, as shown in Fig. 9, the speech characteristic parameter data representing the power spectra similar to the power spectra shown in Fig. 7B is stored in the buffer register 22. In Fig. 9, the power spectra shown by the solid lines indicate the power spectra corresponding to the speech characteristic parameter data read out from the consonant and vowel segment files 10 and 12, and the power spectra shown by the broken lines represent the power spectra calculated by the interpolation circuit 18 and the power spectra generated from the repetition circuit 20.
Next, the control circuit 16 supplies the interpolation control signal through the OR gate 32 to the buffer register 22 and also supplies the interpolation control signal to the interpolation circuit 26, thereby allowing the speech characteristic parameter data in the buffer register 22 to be sequentially sent to the interpolation circuit 26. The interpolation circuit 26 then creates a predetermined number of interpolated speech characteristic parameter data on the basis of the speech characteristic parameter data of the successive two frames sent from the buffer register 22 and sequentially supplies to the speech synthesizer 28. In this case, the control circuit 16 simultaneously reads out the accent data stored in the RAM 16A and supplies to the pitch generation circuit 30, thereby allowing this pitch generation circuit 30 to generate the pitch period data. The speech synthesizer 28 synthesizes the speech signal including the pitch information in accordance with the speech characteristic parameter data from the interpolation circuit 26 and the pitch period data from the pitch generation circuit 30 and then generates the synthesized speech signal.
Although the present invention has been described above with respect to one embodiment, this invention is not limited to only this embodiment. For example; the repetition circuit 20 is constituted in such a manner that it fetches the vowel characteristic parameter data from the ,, vowel segment file 12 in response to the control pulses from the control circuit 16. However, it may be possible to modify this repetition circuit 20 such that a high-level signal is generated from the control circuit 16 over the period of time corresponding to the time length data, and that the repetition circuit 20 fetches the vowel characteristic parameter data at a fixed interval from the vowel segment file 12 in response to this high-level signal. In addition, although a plurality of vowel characteristic parameter data each of which represents one frame power spectrum have been stored in the vowel segment file 12, the vowel characteristic parameter data each of which represents a plurality of power spectra can be stored in this vowel segment file.

Claims

1. A speech synthesizing apparatus comprising: a data generation circuit (14) for generating phoneme string data; memory means (10 and 12) in which consonant and vowel characteristic parameter data representative of consonant and vowel segments are stored and which has a consonant segment file (10) in which a plurality of consonant characteristic parameter data representative of a plurality of consonant segments, each of which has a consonant portion and a transient segment to a vowel segment, are stored, and a vowel segment file (12) in which a plurality of vowel characteristic parameter data representative of a plurality of steady-state vowel segments are stored; control means (16) for allowing the corresponding consonant and vowel characteristic parameter data to be generated from said memory means (10 and 12) in accordance with said phoneme string data; and synthesizing means (18, 20, 22, 24, 26, 28 and 30) for synthesizing a speech signal on the basis of said consonant and vowel characteristic parameter data from said memory means (10 and 12); and including a parameter data series generation circuit (18, 20 and 24) for generating a series of consonant and vowel characteristic parameter data on the basis of the consonant and vowel characteristic parameter data from said consonant and vowel characteristic parameter data from said consonant and vowel segment files (10 and 12), and a synthesis circuit (22, 26, 28 and 30) for synthesizing the speech signal on the basis of the parameter data series from said parameter data series generation circuit (18, 20 and 24), characterized in that said vowel segment file (12) further stores a plurality of vowel characteristic parameter data representative of a plurality of coarticulated vowel segments, each of said steady-state and coarticulated vowel segments being formed of one frame parameter data, said control means (16) generates time length data indicative of a vowel duration length in accordance with the phoneme string data from said data generation circuit (14), and said parameter data series generation circuit (18, 20, 24) includes a repetition circuit (20) which derives the vowel characteristic parameter data from said vowel segment file (12) the number of times corresponding to said time length data.

2. A speech synthesizing apparatus according to claim 1, characterized in that said parameter data series generation circuit further includes: an interpolation circuit (18) for calculating a predetermined number of interpolated characteristic parameter data on the basis of the consonant and vowel characteristic parameter data from said consonant and vowel segment files (10 and 12); and a data selection circuit (24) for sequentially and selectively supplying the characteristic parameter data from said interpolation circuit (18) and said repetition circuit (20) to said synthesis circuit (22, 26, 28 and 30).

3. A speech synthesizing apparatus according to claim 2, characterized in that said data selection circuit is a switching circuit (24) whose switching position is controlled in response to a switching control signal from said control means (16).

4. A speech synthesizing apparatus according to claim 2, characterized in that said data generation circuit (14) generates accent data together with said phoneme string data and said control means (16) generates pitch data in accordance with said accent data, and that said synthesis circuit (22, 26, 28 and 30) synthesizes the speech signal on the basis of the parameter data series from said parameter data series generation circuit (18, 20 and 24) and the pitch data from said control means (16).

5. A speech synthesizing apparatus according to claim 2, characterized in that said synthesis circuit comprises: an interpolator (26) which receives the parameter data series from said parameter data series generation circuit (18, 20 and 24) and calculates a predetermined number of interpolated parameter data on the basis of two successive parameter data; and a synthesizing unit (28) for synthesizing the speech signal on the basis of the parameter data from said interpolator (26).

6. A speech synthesizing apparatus according to claim 5, characterized in that said data generation circuit (14) generates accent data together with said phoneme string data and said control means (16) generates pitch data in accordance with said accent data, and that said synthesis circuit (22, 26, 28 and 30) synthesizes the speech signal on the basis of the parameter data series from said parameter data series generation circuit (18, 20 and 24) and the pitch data from said control means (16).