CA1235814A

CA1235814A - Voice synthesizing system

Info

Publication number: CA1235814A
Application number: CA000489787A
Authority: CA
Inventors: Fujio Nakagawa
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1984-08-31
Filing date: 1985-08-30
Publication date: 1988-04-26
Also published as: JPS6159500A; JPH051956B2

Abstract

ABSTRACT OF THE DISCLOSURE
A voice synthesizing system for synthesizing a voice from voice parameters containing acoustic parameters which express vocal tract transmission characteristics and voice source parameters which express the intensity and type of a sound source and from pitch information as described. The synthesizer includes a voice parameter input for supplying the voice parameters for predetermined frames and a non-voice parameter generator for generating parameters to synthesize a non-voice sound. An internal status detector receives connecting data containing the number of non-voice frames, the number of voice parameter frames and the number of parameter interpolation frames and counts each of the numbers to detect the internal status of the synthesizer. A parameter selector selects either an output from the voice parameter input or an output from the non-voice parameter generator in response to an output from the internal status detector. A parameter interpolator interpolates an output from the parameter selector in response to the output of the internal status detector. A pitch interpolator performs the interpolation of a pitch on the basis of pitch information having a pitch value at an arbitrary point and its interpolation period. A voice is synthesized in response to the output of the parameter interpolator and pitch interpolator.

Description

12358~

VOICE SYNTHESIZING SYSTEM

BACKGROUND OF THE INVENTIO~

Field of the Invention This invention relates to a voice synthesizing system using as a synthesis unit a voice element smaller than a word or a unit proximal thereto such as a consonant-vowel (CV) syllable, a vowel-consonant-vowel (VCV) syllable, - etc Prior Art Conventionally, there have been.known a method of synthesizing an arbitrary word or a sentenee ~rom the most basie element of voiee or phoneme (an element eorresponding to a phonetie sign sueh as a, i, p, k ....
ete.) and a method of performing such synthesis on the basis of a composite elemen~ of phoneme such as a diphone, which is a group of two concatenated phonemes (vowel-consonant and vowel-consonant) or such as a VCV, which is a group of three concatenated phonemes (vowel~consonant-vowel). Among these methods, it has been found that the method of synthesizing from a composite element of phonemes is easier to put in practice than the method of synthesizing from a phoneme. US Patent ~o. 3,632,887, for instance, discloses a voice synthesizer of such a i

- 2 - ~ ~35~

diphone method. In the voice synthesizer of this type, voice parameters having acoustic parameters such as PARCOR (partial auto-correlation) coefficient, LSP (line spectrum pair) coefficient, etc., which represent the characteristics of vocal tract transmission and source parameters, which represent the type and intensity of a sound source are stoxed in the unit of CV syllables or VCV syllables. For synthesizing a voice, designated voice parameters are read out, and processed for interpolation between the voice parameters or for insertion of non-voice parameters. So that the connection of the parameters may be done. Pitch information, on the other hand, is fed depending on the sentence structure to be synthesized and separately from the voice parameters.
However, such a conventional voice synthesizer needs inputs of voice parameters and pitch information at every frame cycle. For this reason, if the voice synthesizer of this type is used for voice synthesis in arbitrary vocabulary, a complicated controller is indispensable to the control of all the operations for generation of an non-voice parameter, connection of voice parameters by interpolation and computation of pitch information for every frame cycle. Those operations should also be done on a real-time basis.
An object of this invention is, therefore, to provide ~ 3 ~ ~ ~358~

a voice synthesizing system for eliminating loads imposed on the controller by performing the process such as generation of non-voice parameters, interpolation of voice parameters and interpolation of pitch information on the side of the voice synthesizer.
~ nother object of this invcntion is to provide a voice synthesizing system adaptable to plural connecting methods.

BRIEF SUMMARY OF THE INVENTION
According to one aspect of the invention, there is provided a voice synthesizing system for synthesizing a voice from voice parameters containing acoustic parameters which e~press vocal tract transmission characteristics and voice source parameters which express the intensity and type of a sound source and ~rom pitch information.
I'he synthesizer comprises: voice parameter input means for supplying said voice parameters for predetermined frames; non-voice parameter generating means for generating parameters to synthesize a non-voice sound; internal status detecting means for receiving connecting data containing the number of non-voice frames, the number of voice parameter frames and the number of parameter interpolation frames and for counting each of said numbers to detect the internal status of said synthesizer; para-meter selecting means for selecting either an output from 4 ~23S~

said voice parameter input means or an output from saidnon-voice parameter generating means in response to an output from said internal status detecting means;
parameter interpolating means for interpolating an output from said parameter selecting means in response to the output of said internal status detecting means; pitch interpolating means for interpolating on the basis of pitch information having a pitch value at an arbitrary point and interpolation period; and means for producing a synthesis voice in response to said output of said parameter interpolating means and its pitch interpolating means.

BRIEF DESCRIPTION OF TEIE DRAWINGS
The present invention will be described in conjunction with the accompanied drawings in which;
Fig. 1 is a block diagram of an ernbodiment of the present invention;
Fig. 2 is a detailed block diagram of a part of Fig. l;
Fig. 3 is a diagram of the structure of a voice parameter for voice synthesis for one frame.
Fig. 4 illustrates the structure of a connecting unit;
Fig. 5 is a diagram of the structure of connecting data;

- 5 ~ ~23S~

Fig. 6 is a diagram of the structure of pitch information;
Fig. 7 is an operation flow chart of a part of Fig. 2;
Figs. 8 to 10 illustrate the connecting data respectively when this invention is applied to the diphone method, C~C method and CV method; and Figs. 11 and 12 are diagrams of the structure of input data for voice synthesizing by each method of Figs. 8 to 10.
In the drawings, the same reference numerals denote the same structural elements.

DETAILED DESCRIPTION OF T~E EMBODIMENT
Referring now to Fig. 1, an embodiment of the lS invention includes a voice synthesizer 10, a processor (CPU) 20, a voice parameter memory 30, and a key board 40.
These structural elements are respectively connected to an address bus 50 and a data bus 60. The voice synthesizer 10 has registers 11 and 12 for storing voice parameters and connecting data respectively and voice synthesis means 13. The means 13 is made up of parameter connecting means 130, a pitch interpolater 140 and a voice synthesis filter 150.
Referring to Fig. 2, the parameter connecting ; 25 means 130 includes a voice parameter input register 131 . .~

- 6 - ~23~8~4 for receiving voice parameters from the reaister 11 (Fig. 1) via a terminal Tl, an non-voice parameter generator 132 for generating a parameter corresponding to the voice parameter for synthesizing non-voice sound and an internal status detector 133 for inputting connecting data from the register 12 (Fig. 1) via a terminal T2. This data contains the number of non-voice ~ frames for indicating non-voice time length, the number ; of voice parameter frames for indicating the time length where the voice parameters exist, and the number of parameter interpolation frames for indicating the time length for interpolating voice parameters over frames and counting each of said numbers to detect the internal status of the connecting means 130. The means 130 further includes a parameter selector 13~ for selecting either of the output of the register 131 or the output of the generator 132 under the control of the detector 133, and a parameter interpolater 135 for interpolating the output of the selector 13~ based upon the output of the detector 133.
~ he operation of an embodiment of this invention will now be described referring to Figs. 1 through 7 below.
Referring to Fig. 3,a voice parameter for voice synthesis of one frame comprises the type and intensity of a sound source and an acoustic parameter. A synthesis unit is ~;

~ 7 ~ ~23581~

composed of voice parameters of several frames to several tens of frames. Many synthesis units are previously stored in the memory 30. Referring to Fig. 4, a connecting ; unit is composed of a non-voice frames part 41, a voice parameter frames part 42, and a parameter interpolation frames part ~3. The CPU 20 (Fig. 1) computes connecting data (Fig. 5) which expresses the number of frames M, F, and I of each of the parts 41, 42 and 43 for every connecting unit based upon text information inputted from the key board 40. The CPU 20 also computes the value of pitch and the number of frames for pitch interpolation as the data on the pitch of the voice (Fig. 6) from the text information. The CPU 20 gives address data to the memory 30 via the address bus 50 to read out corresponding voice parameters. The parameters read from the memory 30 are fed to the register 11 via the data bus 60. The CPU 20 further gives the connecting data and the pitch information to the register 12. In this embodiment, since the connecting data is necessary for each of the connecting unit, the data is set in the register 12 from the CPU 20 once at the initial stage for the convenience of data transmission. The voice parameters in necessary amount are transferred from the ; memory 30 to the register 11. If such transfer has been completed, the next connecting data is set at the register ; 12 from the CPU 20. The pitch information is inputted .,, - 8 - ~23~4 to the register 12 in synchronism with the voice parameter.
In this embodiment, the pitch information is set in the register 12 at the same timing as the connecting data and is not set when not necessary.
Referring again to Fig. 2, the connecting data is fed to the detector 133 via the terminal T2. The detector 133 detects the status of each of said parts 41, 42 and 43 shown in Fig. 4 from the connecting data, which contains the number of non-voice frames M, the number of voice parameter frames F and the number of parameter interpolation frames I. The detector 133 gives directions on parameter selection and parameter interpolation corresponding to each part to the selector 134 and the interpolatex 135.
The opexation of the connecting means 130 will be described in more detai]. with reference to Figs. 2 and 7.
The connecting data M, F and I are inputted to the detector 133 from the terminal T2 (Step 10). The detector 133 checks whether or not the number of non-voice frames M is zero (Step 11~. If M is not zero, the detector 133 judges that it is a non-voice frame part and directs the selector 134 to select the generator 132 for obtaining a non-voice parameter (Step 12). Moreover, the detector 133 directs the interpolater 135 to interpolate within one frame (Step 13) and subtracts 1 from M (Step 14).
The detector 133 waits until the parameter interpolations - 9 - ~235~1~

are completed for the number of time it previously directed the interpolator 135 (Step 15), and returns to the checking operation on M (Step 11). The operation is repeated until M equals zero. When M becomes zero, the detector 133 checks F (Step 21). If F is not zero, the detector 133 ~udges that it is a voice parameter frame part,anddirec~ the selector 134 to select the register 131 (Step 22). The detector 133 directs the interpolator 135 to interpolate in one frame (Step 23) and subtracts 1 from F (Step 24). Then, like the operation on M, the detector 133 waits until parameter interpolations are completed in the number of times designated to the interpolator 135 (Step 15), and returns to checking operation on M (Step 11). This operation is repeated until ~ e~uals zero. When ~ becomes zero, the detector 133 proceeds to check on I (Step 31). If I is not zero, the detector 133 judges that it is an parameter interpolating section, and directs the selector 134 to select the register 131 (Step 32) for input of the voiced parameter for next connecting unit. The detector 133 directs the interpolator 135 to do interpolation over I frames (Step 33).
When I becomes zero (Step 34), the detector 133 waits until the parameter interpolation is completed over I
frames which has been directed to the interpolator 135 (Step 15) and returns to checking operation on M. When ~35~4 all of the M, F, and I become zero by the above operation, the detector returns to input of connecting data (Step 10).
The selector 134 selects either of the voiced parameter inputted in the register 131 and the non-voice parameter retained in the generator 132 and sends it to the interpolater 135.
The interpolater 135 sets the value of the parameter inputted from the selector 134 as the target value, does interpolation on parameters for the number of times directed by the detector 133 and sends the resultant parameter to the voice synthesis filter 150. However, the interpolation is conducted only on the source power and acoustic parameter out of the voice parameters and not on the type o~ source ` parameter which expresses voice type or non-voice type.
Referriny to Fig. 2, the pitch information inputted ; from the CPU 20 to the register 12 is sent to the interpolator 140 via a terminal T3. The interpolator 140 interpolates the pitch on the basis of the pitch information shown in Fig. 6 which expresses the pitch value and the number of the pitch i.nterpolation frame for each frame. The result of the pitch interpolation is sent to the filter 150.
The filter 150 produces a synthesis voice in response to the outputs of the connecting means 130 and the 3581g~

interpolater 140. The filter 150 includes a voice source generator 151 and a filter 152. The generator 151 generates pulses based on the type of source of the voice parameter which indicates voice or non-voice type.
If the type of source indicates voice type, the generator 151 generates pulses having an amplitude of the intensity of source of the voiced parameter at every pitch cvcle inputted from the interpolater 140. If the type of source indicates non-voice type, the generator 151 generates pulses having pseudo-random pattern where the intensity of so~lrce is the maximum amplitude irrespective of the pitch cycle. The filter 152 is driven by the pulse which is inputted from the generator 151 to perform filter operation using the acoustic parameter of the voice parameters as coefficient, and outputs the synthesis result to a terminal OUT.
~ s described i.n the foregoing, voice synthesis is accomplished by inputting the voice parameters, the connecting data and the pitch data to the voice synthesizer 10 (Fig, 1), connecting voice parameters based on the connecting data and computing the pitch for each frame based on the pitch information. The method according to this invention is therefore applicable to any of the diphone method, CVC method and CV method which have been widely used in the prior art.
Referring to Fig. 8 which shows the basic structure - 12 - ~ 2 3 5 a1 ~

of a eonnecting unit in the diphone method, two eonneeting data are used. The first connecting data contains the number of non-voice frames M, the number of voiee para-meter frames Fcv which expresses a transition from a vowel to a consonant, and the number of interpolating frames I. The seeond connecting data comprising the number of voice parameter frames Fvc which expresses a transition from a vowel to a consonant, where M and I are equal to zero.
Referring to Fig. 9, a eonneeting data in the CVC
method eomprises the number of non-voiee frame M and the number of voiee parameter frame Feve whieh expresses phonemie concatenat.ion of eonsonant-vowel-eonsonant.
Sinee this method does not need parameter interpolating seetion, I .ts always zero.
Fig. 10 shows the basie strueture of a eonneeting unit in the CV method. In this ease, first eonneeting data has the number of non-voiee frames M, the number of voiee parameter frames Fev whieh expresses a transition from a consonant to a vowel, and the number of interpolating frames Ivt. Seeond eonnecting data has the numbex of voice parameter frame Fvt for vowel constant section and the number of interpolating frame part Ive from a vowel to the next consonant, where the number of non-voice frame M is zero.

- 13 - ~ ~35~4 One example of data input to perform voice synthesis with above methods will be described referring to Figs.
11 and 12. The hatched portion in Fig. 11 represents the voice parameter part for a connecting unit where voice parameter is transferred once for each frame. The pitches Pl through P4 lnot shown) and the interpolation frame number PILl through PIL4 are supplied by the pitch information. The non-voice frame number M3, the voice parameter frame numbers Fl through F5 and the interpolating section frame numbers Il, I4 and I5 are supplied by the connecting data. CDl through CD5 express the data which combine these connecting data with pitch information.
In Fig. 11, one synthesis unit is given by CDl and CD2 first according to the diphone method. Referriny now to Fig. 12, the CDl contains connecting data M, F, and I
are O, Fl and Il, respectively. In this case, Pitch information is not supplied. The CD2 comprises connecting data M (= 0), F (= F2), I (= 0), and the pitch information of P2 and PIL2. According to the CVC method, one synthesis unit is given by CD3. The CD3 contains connecting data M (= M3), F (= F3), I (= 0), and the pitch information of P3 and PIL3. Similarly, according to the CV method, one synthesls unit is given by CD4 and CD5. The CD4 contains connecting data M (= 0), M (= F4) and I (= I4), and pitch information with 0 supplied. The CD5 contains - 14 _ ~ 2 ~ 5 81 A

connecting data M (= O), F (= F5), I (= I5), and the pitch information of P4 and PIL4. The voice synthesizing system of the invention can be simply applied to plural connecting methods and moreover, can be modified easily in the connecting method simply by changing the table for connecting unit.

Claims

What we claim is:

1. A voice synthesizing system for synthesizing a voice from voice parameters containing acoustic parameters which express vocal tract transmission characteristics and voice source parameters which express the intensity and type of a sound source and from pitch information, said synthesizer comprising:
voice parameter input means for supplying said voice parameters for predetermined frames;
non-voice parameter generating means for generating parameters to synthesize a non-voice sound;
internal status detecting means for receiving connecting data containing the number of non-voice frames, the number of voice parameter frames and the number of parameter interpolation frames and for counting each of said numbers to detect the internal status of said synthesizer;
parameter selecting means for selecting either an output from said voice parameter input means or an output from said non-voice parameter generating means in response to an output from said internal status detecting means;
parameter interpolating means for interpolating an output from said parameter selecting means in response to the output of said internal status detecting means;

pitch interpolating means for performing the interpolation of a pitch on the basis of pitch information having pitch value at an arbitrary point and its interpolation period; and means for producing a synthesis voice in response to said output of said parameter interpolating means and pitch interpolating means.