EP0144731B1

EP0144731B1 - Speech synthesizer

Info

Publication number: EP0144731B1
Application number: EP84113186A
Authority: EP
Inventors: Katsunobu Fushikida
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1983-11-01
Filing date: 1984-11-02
Publication date: 1988-09-07
Also published as: DE3473956D1; JPS6097396A; JPH0642158B2; EP0144731A2; EP0144731A3

Abstract

The invention relates to a speech synthesizer comprising a converting means (20, 30) for converting the input sequence of characters to a sequence of articulation symbols corresponding to a unit speech waveform which is obtained by dividing a diphone, a memory (80) for storing said unit speech waveform corresponding to the predetermined articulation symbols, and a synthesizing means (60, 70) for reading said unit speech waveforms corresponding to said articulation symbols of the converted sequence of articulation symbols from said memory (80) and synthesizing them. The speech synthesizer requires a comparatively small memory capacity and provides synthesized speech of high quality.

Description

This invention relates to a speech synthesizer.
In the conventional speech synthesis, appropriate syllable waveforms represented by combination of vowel-consonant-vowel (VCV) are prepared in advance, and connected together. However, since the number of phonemes represented by VCV is very large, an enormous memory capacity for storing them is required. On the other hand, there has been proposed a method in which the waveforms corresponding to the combinations of consonant-vowel (CV), or vowel-consonant (VC), namely, demisyllable or diphone, which have a time length of about half that of a single syllable, are prepared in advance, and the waveforms corresponding to the CV or VC to be required for synthesized speech are selected, and are connected together (compiled and synthesized). According to this method, a reduction memory capacity is possible than in the case of preparing VCV, but a relatively large memory capacity is still required because of the large quantity of speech waveform information corresponding to CV and VC.
The document "Proceedings of the Seminar on Pattern Recognition, Vol. 1, Nov. 1977, 4.4.1 to 4.4.6" and the EP-A-0 058 130 disclose the transcription of alphanumeric characters to synthesize speech from elements stored in a memory. The first-mentioned document describes also semi-diphones. A unit speech waveform obtained by division of a diphone is, however, not specified in these documents.
Accordingly, it is an object of the invention to provide a speech synthesizer which requires a comparatively small memory capacity in respect of speech data such as speech waveforms to be prepared in advance.
It is another object of the invention to provide a speech synthesizer which has the above advantage and by which synthesized speech of high quality can be obtained.
According to the present invention, there is provided a speech synthesizer comprising a converting means for converting the input sequence of characters to a sequence of articulation sym bols corresponding to a unit speech waveform which is obtained by dividing a diphone, a memory for storing the unit speech waveform corresponding to the predetermined articulation symbols and a synthesizing means for reading the unit speech waveforms corresponding to the articulation symbols of the converted sequence of articulation symbols from the memory and synthesizing them.
This speech synthesizer is characterized by an interpolation method determining means for determining an interpolation method on the basis of the speech part of input characters corresponding to the output of said converting means; and an interpolating means for interpolating the unit speech waveform read from said memory on the basis of the determined interpolation method; furthermore, said interpolation method determining means directly connects the two read unit speech waveforms when said input speech part of the input characters is unvoiced (as well as silence), and determines a predetermined first interpolation method when said input speech part of the input characters is voiced. The same unit waveform is used both for a voiced and its corresponding unvoiced phoneme.
These and other objects and features of the present invention will become clear by the following description of a preferred embodiment of the present invention with reference to the accompanying drawings.

Brief description of the drawings

Fig. 1 is a block diagram showing the structure of an embodiment of a speech synthesizer according to the invention;
Fig. 2 is a table of information of the synthesizer shown in Fig. 1 which is stored in a memory 32 of a phoneme symbol/articulation symbol converting part 30;
Fig. 3 illustrates the concept of the articulatory organs of the human body for explaining the principle of the invention;
Figs. 4A and 4B show examples of articulatory segments for explaining the principle of the invention;
Fig. 5 shows waveforms interpolated by a synchronous pitch method used in the present invention; and
Fig. 6 is a waveform of synthesized speech formed by compiling and synthesizing waveforms of articulation element pieces.

Description of the preferred embodiment

Referring to Fig. 1, a speech to be synthesized is first indicated by a keyboard 10. From the keyboard 10, a sequence of character signals, a stress strength signal (in this embodiment, three-levelled) and a boundary signal between speeches are generated. Hereinunder, the structure and performance of the speech synthesizer shown in Fig. 1 will be described on the assumption that the speech to be synthesized is "kite".
Now, the alphabetical character sequence signals incorporating "kite" are generated by pushing keys "K", "I", "T" and "E". The boundary signal B indicating the boundary such as the beginning, ending and pause of the word "kite" and the stress strength signal S_T are also supplied to a phoneme symbol/articulation symbol converting circuit 20 together with the character sequence signal. The stress strength is determined based on the pitch and strength of each syllable, for example, a high stress strength shows high pitch frequency. The converting circuit 20 has a processing part 21 and a memory 22. In the memory 22 is stored the phoneme symbol corresponding to the speech which has been prepared in advance. For example, a phoneme symbol /kait/ is stored in correspondence with the "kite". The processing part 21 supplies an address information to the memory 22 in response to the input signal for a sequence of character. Then the phoneme symbol signal /kait/ is read from the memory 22 and supplied to a phoneme symbol/articulation symbol converting circuit 30. The converting circuit 30 has, as well as the converting circuit 20, a processing part 31 and a memory 32. In the memory 32 is stored an articulation symbol (determined by the phonemes located therefore and thereafter) which is determined in advance corresponding to the phoneme symbol and by the method peculiar to the present invention which will be described in the following.
The articulatory organs of a human being include vocal chords, a tongue, lips, a velum palatinum, etc., as shown in Fig. 3, and various speech is generated by controlling these articulatory organs in accordance with nerve pulse signals. Therefore, if two articulations of the articulatory organs are similar, two similar speech waveforms are generated. Further, it is apparent that if the articulation parameter values representing the movements of these articulatory organs are approximate to each other, the generated speech waveforms are analogous. As described above, in the conventional synthesizing method based on the CV, VC waveform connecting type, many speech waveforms corresponding to CV and VC are prepared, but from the viewpont of the movement of an articulation parameter considerably redundant waveforms are included therein. For example, in the CV, VC waveform connecting type method, the speech waveform corresponding to a phoneme /ka/ and that corresponding to a phoneme /ga/ are prepared separately. However, the movement of the articulatory organs for /ka/ and that for /ga/ are very similar. The relationship between the tongue, palate, etc. is almost the same, and the main difference is in whether the vocal chords are vibrating or not (voiced or unvoiced) in the consonant parts. Therefore, in the voiced section after the unvoiced section of the consonant part /k/ in /ka/ (the section shifting to the normal part of the vowel /a/ which corresponds to (C in Fig. 4A) the articulation parameter is almost the same as that of /ga/ (C in Fig. 4B), which can take the place of the partial waveform of /ka/ in that section with a fairly good approximation. It is clear that in the pairs /kV/-/gV), /tV/-/dV/ and /pV/-/bV/ (V represents a vowel) also, the waveforms in the part shifting to the vowels can be shared. In Figs. 4A and 4B, part A is the silent part at the beginning of /ka/ or /ga/ (represented as ^*), part B the waveform of "k" in /ka/ or "g" in /ga/, B' the waveform of the part affected by the phoneme following "g" in /ga/, and C and D are, as described above, the speech waveforms of the vowels "a" following the consonant of /ka/ and /ga/.
Here, the time section which is determined in consideration of manner of articulation is shorter than a CV or VC waveform and can be substituted by a speech waveform based on a different phoneme series, as is shown in Figs. 4A, 4B, is called an articulation segment, and a speech waveform in the articulation segment is called an articulation element piece waveform. That is, the syllables /ka/ and /ga/ are divided into the time sections B and C for the purpose of using the transient parts of those syllables as those for another speech synthesis.
As described above, articulation segments the manner of articulation of which are the same are represented by the same articulation symbol and the articulation element piece waveform corresponding to this articulation symbol is stored in the memory 32 in advance. In this way, in the memory 32, the articulation symbols corresponding to a sequence of phoneme symboles are stored in advance. Fig. 2 shows the classified articulation symbols, in which ^* represents the silent part which is placed at the beginning of speech or immediately before an explosive, "p", "t", "k" explosive parts, and (b)a, (d)a, (g)a represent transient parts of the vowel "a" parts which follow the consonants "b", "d", "g". On the other hand, i(b), i(d), i(g) represent the transient parts of the vowel "i" parts which precede the consonants "b", "d", "g", and ai, au, ao represent the transient parts where the vowel "a" is followed by the vowels "i", "u" and "o".
Now returning to Fig. 1, in response to an address corresponding to the phoneme signal /kait/ sent from the processing part 21, a sequence of the articulation symbols
corresponding to the phoneme signal /kait/ is read as an articulation signal from the memory 32 in the phoneme symbol/articulation symbol converting circuit 31. Here, ^*represents a silent part described above (#1 in Fig. 2), "k" and "t" explosive parts of /k/ and /t/ (#2, #6 in Fig. 2), "g(a)" a transient part shifting from the consonant to the vowel of /ga/ (#3), "ai" a transient part of the vowel link /ai/ (#4) and"i(d)" a transient part shifting from the vowel to the consonant of /id/ (#5), respectively. In this example, /ka/ in the phoneme symbol /kait/ is substituted by a silent explosive "k" and "(g)a" representing the transient part shifting from the consonant to the vowel of the phoneme symbol "ga" resembling /ka/. The phoneme symbol /it/ is substituted by a transient part i(d) shifting from the vowel to the consonant of the phoneme symbol /id/ resembling /it/ and a silent part ^* is placed immediately before the silent explosive "t".
As described above, speech synthesis by using, in place of /ka/ and /it/, the waveforms taken from /ga/ and /id/ the phoneme sequence of which is different from, but the articulation of which is similar to /ka/ and /it/, dispenses with the need to previously store the transient part of /ka/ or /it/ and enables reduction in the memory capacity. These articulation element piece waveforms can be easily obtained from, for example, waveforms of uttered speech.
Thus obtained articulation signal is supplied to a waveform address generation circuit 50. The waveform address generation circuit 50 reads the articulation element piece waveform corresponding to each articulation symbol which is contained in the articulation signal, and corresponding to the stress signal S_Tfrom an articulation waveform memory which is selected from among memories 80a, 80b and 80c included in an articulation waveform memory 80 by the stress signal S_T. In other words, the articulation element piece waveform is generated on the basis of the address corresponding to each articulation symbol from the memory 80. The stress signal S_T from the processing part 21 is detected in a stress strength detection circuit 40, and the articulation phoneme piece waveform of the strength corresponding to the strength of the detected stress strength is read from the memory 80. In the articulation waveform memory 80 the articulation element piece waveforms corresponding to the articulation symbols shown in Fig. 2 are stored.
An interpolation method selection circuit 60 judges whether the articulation symbol (two continuous waveforms) from the phoneme symbol/ articulation symbol converting circuit 30 is voiced or unvoiced. The interpolation circuit 70 is controlled by this judge result to perform the following interpolation, namely, when the articulation symbol is unvoiced (as well as silence) the two continuous articulation element piece waveforms read from the memory 80 are directly connected and, when the articulation symbol is voiced, these waveforms are interpolated, for example, synchronously with a pitch.
Generally, direct connection of the articulation waveforms make an unnatural synthesis because of the discontinuous change of a pitch or spectrum. To eliminate this drawback, in this invention, any spoken word is synthesized by connecting articulation waveforms having several levels of pitches by interpolation process between waveforms on the synchronous pitch process. For example, as shown in Fig. 5 if one pitch period of waveform (element piece waveform) at the connected ending part of a temporally preceding unit speech waveform is f(n), its time length (pitch period) N_f, the element piece waveform at the connected beginning part of a succeeding unit speech waveform g(n), its time length (pitch period) Ng, and the element piece waveform in the i-th section of the interpolation waveform of k pitch section is h,(n), the h,(n) is generated on the basis of the following formulae:

N,, namely the time length of h,(n), is assumed to be the value obtained by interpolating N, and Ng. In this case, when N, and Ng are shorter than N, the final sample value of the waveform may be repeated, and when N_f and Ng are longer than N, the surplus waveform may be discarded. In this way, a continuous articulation element piece waveform (in this example, a digital waveform) corresponding to the input sequence of characters is supplied to a D/A converter 90 where the interpolated synthesized articulation waveform is converted to an analogue waveform and generated as a synthesized speech. The symbol waveform of a synthesized speech obtained in this way is shown in Fig. 6.
As described above, this invention, in which a unit of speech is used which is shorter from the viewpoint of time than a unit speech waveform such as CV, VC waveforms in the CV, VC waveform compiling type synthesizing method, not only requires a small memory capacity of waveform but also reflects exactly the articulation of the articulatory organs so as to obtain a synthesized speech of high quality.
In the embodiment above described, an articulation element piece waveform corresponding to an articulation symbol is compiled and synthesized, but it is clear that the reduction in memory capacity is also possible when this invention is applied to the synthesizing method using what is called a "characteristic parameter" such as a Formant parameter.

Claims

1. A speech synthesizer comprising:

a) a converting means (20, 30) for converting the input sequence of characters to a sequence of articulation symbols corresponding to a unit speech waveform which is obtained by dividing a diphone;

b) a memory (80) for storing said unit speech waveform corresponding to the predetermined articulation symbols; and

c) a synthesizing means (40, 50, 60, 70, 80) for reading said unit speech waveforms corresponding to said articulation symbols of the converted sequence of articulation symbols from said memory (80) and synthesizing them, characterized in that

d) an interpolation method determining means (60) determines an interpolation method on the basis of the speech part of input characters corresponding to the output of said converting means (30),

e) an interpolating means (70) interpolates the unit speech waveform read from said memory (80) on the basis of the determined interpolation method,

f) said interpolation method determining means (60) directly connects the two read unit speech waveforms when said input speech part of the input characters is unvoiced (as well as silence), and determines a predetermined first interpolation method when said input speech part of the input characters is voiced, and that

g) the same unit waveform is used both for a voiced (g, d, b) and its corresponding unvoiced (k, t, p) phoneme.

2. A speech synthesizer according to claim 1, wherein said unit speech waveform of the vowel part which follows a consonant and is influenced by said consonant is stored in said memory (80).

3. A speech synthesizer according to claim 1 or

2, wherein said unit speech waveform of the vowel part preceding a consonant is stored in said memory (80).

4. A speech synthesizer according to any of claims 1 to 3, further comprising an input means (10) for outputting said input sequence of characters.

5. A speech synthesizer according to claim 4, wherein said input means is a keyboard (10).

6. A speech synthesizer according to any of claims 1 to 5, wherein said converting means (20,

30) includes: a first converting circuit (20) for converting said input sequence of characters to a sequence of phonemes; and a second converting circuit (30) for converting the converted sequence of phonemes to said sequence of articulation symbols.

7. A speech synthesizer according to any of claims 4 to 6, wherein said input means also generates a stress signal (ST), which represents the stress strength of the unit speech waveform corresponding to said articulation symbols.

8. A speech synthesizer according to claim 7, wherein said unit speech waveform is stored for each unit of said stress strength in said memory (80), and said synthesizing means (50) reads said unit speech waveform corresponding to said stress signal from said memory (80) and synthesizes them.

9. A speech synthesizer according to any of claims 1 to 8, wherein said first interpolation method executed by said interpolation means (70) determines interpolation waveform h,(n) by the following formulae on the basis of one pitch period of waveform at the connected ending part of a temporally preceding unit speech waveform f(n), its time length N_f, the element piece waveform at the connected beginning part of a succeeding unit speech waveform g(n), and its time length Ng:

and determines the time length of said interpolation waveform N, by interpolating N_f and Ng.

10. A speech synthesizer according to any of claims 1 to 9, further comprising a D/A converting means (90) which is connected to the output of said synthesizing means (40, 50, 60, 70, 80).