JPS5816297A

JPS5816297A - Voice synthesizing system

Info

Publication number: JPS5816297A
Application number: JP11483481A
Authority: JP
Inventors: 雄三布施
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 1981-07-22
Filing date: 1981-07-22
Publication date: 1983-01-29

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】本発明は音声合成方式、特に線形予測符号化（以下ＬＰ
Ｃと略称する）による音声合成方式に関する。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a speech synthesis method, particularly linear predictive coding (hereinafter referred to as LP).
The present invention relates to a speech synthesis method based on the method (abbreviated as C).

通常のＬＰＣ音声合成方式では合成用の音源信号として
単純なパルスを用いているが、大幅な情報圧縮が可能と
なるも、音源情報の細部が失われ、導出される合成音声
の品質が劣化する不都合がある。Normal LPC speech synthesis methods use simple pulses as the sound source signal for synthesis, but although it is possible to significantly compress information, the details of the sound source information are lost and the quality of the derived synthesized speech deteriorates. It's inconvenient.

そこで斯る不都合を解消すべく、複数個の合成用音源信
号を用い、それ等のうちの１つをフレーム毎に選択して
そのフレームでのＬＰＣ音声合成を行う第１図乃至第７
図に示すような音声合成方式が、本発明者により先に提
案さねた。Therefore, in order to solve this problem, a plurality of sound source signals for synthesis are used, one of them is selected for each frame, and LPC speech synthesis is performed in that frame.
A speech synthesis method as shown in the figure was previously proposed by the present inventor.

すなわち第１図はその構成を漿略的に示すもので、同図
において、（１）は原音声が印加される入力端子、（２
）は入力される原音声から分析しようとする波形部分を
切り出し、その切り出した波形からＬＰＣパラメータ等
声道の伝達特性に関する特徴パラメータや有声／無声、
ピッチ周波数、撮幅等の音源に関する特徴パラメータを
抽出する音声分析器、（３）は伝送路、（４）は特徴パ
ラメータより音声の再合成を行う音声合成器である。That is, FIG. 1 schematically shows its configuration. In the figure, (1) is an input terminal to which the original sound is applied, and (2) is an input terminal to which the original sound is applied.
) extracts the waveform part to be analyzed from the input original speech, and extracts characteristic parameters related to vocal tract transfer characteristics such as LPC parameters, voiced/unvoiced, etc. from the extracted waveform.
A speech analyzer extracts feature parameters related to the sound source such as pitch frequency and imaging width, (3) is a transmission path, and (4) is a speech synthesizer that resynthesizes speech from the feature parameters.

そしてこの方式では上述の如く予測残差信号そのものを
音声合成器（４）側に伝送するには膨大な情報量を必要
とするので、その代りに各フレームでの予測残差信号を
周波数分析し２、その微細成分を平滑して得られるスベ
クトラル包絡に関する情報を音声合成器（４）側に伝送
するようにする。In this method, as mentioned above, transmitting the prediction residual signal itself to the speech synthesizer (4) requires a huge amount of information, so instead, the prediction residual signal in each frame is frequency-analyzed. 2. Information regarding the spectral envelope obtained by smoothing the fine components is transmitted to the speech synthesizer (4) side.

この動作を第２図のフローチャートに従って説明する。This operation will be explained according to the flowchart in FIG.

ステップ０υで例えば第３図の如き原音声（第３図は女
性音”ア”の時間波形を表わしている）を印加し、ステ
ップＯ２でＬＰＣ分析を行い、ステップ（ｔ３１Ｋｒ、
ｐｃパラメータすなわち上述の線形予測係数ａｋを抽出
する。一般に音声合成に必要なパラメータは、音源に関
しては有声／無声、ピッチ周波数、振幅であり、声道の
伝達特性（スペクトラム包絡）に関しては方式により異
なるが、このＬＰＣ音声合成方式の場合上記ＬＰＣパラ
メータが対応する。At step 0υ, for example, the original voice as shown in Figure 3 (Figure 3 represents the time waveform of the female sound "a") is applied, LPC analysis is performed at step O2, and step (t31Kr,
The pc parameter, that is, the above-mentioned linear prediction coefficient ak is extracted. In general, the parameters required for speech synthesis are voiced/unvoiced, pitch frequency, and amplitude for the sound source, and the vocal tract transfer characteristics (spectral envelope) differ depending on the method, but in the case of this LPC speech synthesis method, the above LPC parameters are handle.

またＬＰＣ分析によりステップＱ４）Ｋ予測残差信号を
得、との予測残差信号からステラ１ａ勺で音源パラメー
タの一つである振幅を抽出し、更にステップＱＩ１９で
第４図の如き予測残差信号を得る。第４図は第３図の原
音声に対応した予測残差信号である。そしてこの予測残
差信号によりステップ０６）でピッチ分析を行い、ステ
ップ（１７）にピッチ周波数（ピッチ周期）を得る。こ
のピッチ周期は声帯音源における音声振動の基本周期で
あり、有声音を特徴づける重要なパラメータである。な
おこれ等の各パラメータを求める周期（フレーム周期）
としては通常１０〜２Ｑｍ９６Ｇ程度、波形切り出し窓
の時間長は１５〜３Ｑ　ｍ８ｅｃ程度が用いらねる。In addition, step Q4) K predicted residual signal is obtained by LPC analysis, and the amplitude, which is one of the sound source parameters, is extracted from the predicted residual signal using Stella 1a. Get a signal. FIG. 4 shows a prediction residual signal corresponding to the original speech shown in FIG. Then, a pitch analysis is performed in step 06) using this prediction residual signal, and a pitch frequency (pitch period) is obtained in step (17). This pitch period is the fundamental period of sound vibration in the vocal cord sound source, and is an important parameter characterizing voiced sounds. The cycle (frame cycle) for determining each of these parameters
Usually, the time length of the waveform cutting window is about 15 to 3Q m8ec.

そしてこれ等ステップＱｌ）〜αηは音声分析器（２）
側で行われ、慣用されているものである。And these steps Ql) to αη are the speech analyzer (2)
It is done on the side and is customary.

次にステップａ樽で第４図の如き予測残差信号に例えば
２５６サンプルポイントをもってフーリエ変換を施し、
時間−周波数変換、を行う。この結果第５図に波形Ｓで
示すような予測残差信号の周波数スペクトルが得られる
。この周波数スペクトルは位相情報が除去され実質的に
パワースペクトルで表わされる。そしてこの周波数スペ
クトルをステップ（１１で例えばケプストラム法により
スペクトル平滑を行い、第５図に包絡線Ｅで示すような
スペクトル包絡を得る。第５図では略々１フレ一ム分を
表わしている。Next, in step a, the prediction residual signal as shown in FIG. 4 is subjected to Fourier transformation using, for example, 256 sample points.
Performs time-frequency conversion. As a result, a frequency spectrum of the prediction residual signal as shown by waveform S in FIG. 5 is obtained. This frequency spectrum is substantially represented by a power spectrum with phase information removed. Then, this frequency spectrum is subjected to spectral smoothing by, for example, the cepstral method in step 11, to obtain a spectral envelope as shown by the envelope E in FIG. 5. In FIG. 5, approximately one frame is represented.

このスペクトル平滑を各フレームに付いて行い、ステッ
プ（イ）に複数個のスペクトル包絡を得る。この得られ
た複数個のスペクトル包絡はフレーム毎に異なるので、
それらをフレーム毎に伝送するにはやはりかなりの情報
量を必要とする。そこでステップ（２Ｉ）においてスペ
クトル包絡間の距離によって分離する、すなわちスペク
トル包絡の形状が似ているものは１つの群ＶＣまとめ、
それらの内の１つのスペクトル包絡を代表パターンとし
て用いるようにする。次に残余のスペクトル包絡群につ
いても同様の操作を行い、順次代表パターンを抽出する
。そしてこの操作を繰り返すことによりステップ０渇に
複数個例えば１６個の代表パターンを表わすスペクトル
包絡す１〜＋１６を得ることができ、これＫよって全フ
レームの予測残差信号情報を表現することができる。ま
たこの１６個のスペクトル包絡す１〜＋１６は、２＝１
６であるからフレーム毎に４ビツト、例えばスペクトル
包絡＋１には［０００１］、スペクトル包絡＋２には（
００１０：１等各スペクトル包終に対するビットコード
を予め割り当て−おけば、所望時その対応するビット情
報により対応する任意のスペクトル包絡を選ぶことがで
きる。　　′ 次にステップ（２りで適当な位相条件の下でフーリエ逆
変換を行い、各スペクトル包絡＋１〜＋１６に対応した
第６図に示すような時間波形すなわちインパルス≠１〜
＋１６をステップｃ！４）に得る。ここで位相情報を必
要とする、つまり上述のステップ０８）におけるフーリ
エ変換で位相情報が除去されているので、このフーリエ
逆変換の際に何等かの位相情報を与えてやる必要がある
わけであるが、合成音声の音質は音源信号の波形にはさ
して影響されないものと思われるので、このステップ（
２国におけるフーリエ逆変換の際の位相情報は、その後
の信号処理の都合の良いように与えればよい。この位相
情報の与え方によりステップＣ！（イ）に得られるイン
パルス÷１〜＋１６の波形は、例えば第７図Ａ　、　Ｂ
　、　ＣＫ示すような種々なものとなる。この第７図Ａ
、Ｂ、Ｃのインパルス波形のうち、インパルスの持続時
間を一定としたとき、インパルス波形の最終端部の信号
レベルが最も小さくなるのは第７図Ｃの波形である。従
ってこの第７図Ｃの如きインパルス波形を用いた場合に
ステップ（２５）におけるＬＰＣ音声合成の際音源波形
接続誤差が最も少なくなると考えられる。そこでステッ
プ＠におけるフーリエ逆変換の際は、ステップ（財）に
得らねるインパルス−＃−１〜＋１６の波形が実質的に
第７図Ｃに示すような波形になるよう位相条件を与える
のが好ましい。この第７図Ｃのインパルスは最小位相推
移系のインパルス・レスポンスであって、それを得るた
めの位相条件は容易に示すことができる。This spectral smoothing is performed for each frame to obtain a plurality of spectral envelopes in step (a). Since the obtained multiple spectrum envelopes differ from frame to frame,
Transmitting them frame by frame still requires a considerable amount of information. Therefore, in step (2I), the spectral envelopes are separated by the distance between them, that is, those with similar spectral envelope shapes are grouped into one group VC,
The spectral envelope of one of them is used as a representative pattern. Next, similar operations are performed on the remaining spectral envelope groups to sequentially extract representative patterns. By repeating this operation, it is possible to obtain spectrum envelopes 1 to +16 representing a plurality of representative patterns, for example 16, in step 0, and this K can represent the predicted residual signal information of all frames. . Also, these 16 spectrum envelopes 1 to +16 are 2=1
6, so there are 4 bits per frame, for example, [0001] for spectrum envelope +1 and ( for spectrum envelope +2).
If a bit code for each spectral envelope end such as 0010:1 is assigned in advance, any corresponding spectral envelope can be selected according to the corresponding bit information when desired. ' Next, in step (2), perform inverse Fourier transform under appropriate phase conditions to obtain time waveforms as shown in Figure 6 corresponding to each spectral envelope +1 to +16, that is, impulse ≠ 1 to
+16 step c! 4) Obtain. Phase information is required here; in other words, the phase information has been removed by the Fourier transform in step 08) above, so it is necessary to provide some kind of phase information during this inverse Fourier transform. However, since the sound quality of the synthesized speech is not expected to be significantly affected by the waveform of the sound source signal, this step (
The phase information during the inverse Fourier transform in the two countries may be provided in a manner convenient for subsequent signal processing. Step C! The waveform of impulse ÷ 1 to +16 obtained in (a) is, for example, shown in Fig. 7 A and B.
, CK. This figure 7A
, B, and C, when the duration of the impulse is constant, the waveform shown in FIG. 7C has the lowest signal level at the final end of the impulse waveform. Therefore, when an impulse waveform as shown in FIG. 7C is used, it is considered that the sound source waveform connection error will be minimized during LPC speech synthesis in step (25). Therefore, when performing inverse Fourier transform in step @, it is best to provide phase conditions so that the waveforms of impulses -#-1 to +16 that cannot be obtained in step (goods) become substantially the waveforms shown in Figure 7C. preferable. The impulse shown in FIG. 7C is an impulse response of a minimum phase shift system, and the phase conditions for obtaining it can be easily shown.

このように予測残差信号情報の代表パターンであるスペ
クトル包絡＋１〜÷１６をステップいでフーリエ逆変換
を行い対応する時間波形に変換したものを音声合成器（
４）（第１図）の音源信号として用いることになる。な
おこのフーリエ逆変換は音声合成器（４）側で行うには
膨大なハードウェアを必要とするので実時間動作を要し
ない場合には音声分析器（１）（第１図）側でソフトウ
ェアでフーリエ逆変換を行ってスペクトル包絡を時間波
形に変換し、それを音声合成器（１）側に伝送する方法
をとるのがよい。In this way, the spectral envelope +1 to ÷16, which is a representative pattern of prediction residual signal information, is subjected to inverse Fourier transform in steps and converted into a corresponding time waveform, which is then converted into a corresponding time waveform by a speech synthesizer (
4) It will be used as the sound source signal (Fig. 1). Note that this Fourier inverse transform requires a huge amount of hardware to perform on the speech synthesizer (4) side, so if real-time operation is not required, it can be performed using software on the speech analyzer (1) (Figure 1) side. It is preferable to perform an inverse Fourier transform to convert the spectral envelope into a time waveform, and then transmit it to the speech synthesizer (1) side.

そして音声合成器（４）側で行われるＬＰＣ音声合成の
ステップ（２５）では各フレームに付き４ビット程度の
情報量を割り当て、上述の如くステップ０りにある１６
個のインパルス≠１〜＋１６のうちの１つを選択してそ
のフレームの合成用音源とする。つまりこの選択された
インパルスが、そのフレームの音声合成に必要な音源情
報のうちの有声音に関する情報に実質的に対応するわけ
である。Then, in the step (25) of LPC speech synthesis performed on the speech synthesizer (4) side, approximately 4 bits of information are allocated to each frame, and as described above, the 16 bits at step 0 are
One of the impulses≠≠1 to +16 is selected and used as the sound source for synthesis of that frame. In other words, the selected impulse substantially corresponds to the information regarding voiced sound among the sound source information necessary for speech synthesis of that frame.

またステップＱ印では音源情報としてステップ（＋５１
の振幅情報、ステップαＤのピッチ情報が付加されると
共に、ステップ（１皺のＬＰＣパラメータが声道の伝達
特性に関する情報として付加さね、この結果ステップ（
２６）　Ｋ合成音声が取り出される。Also, in the step Q mark, the step (+51
The amplitude information of step αD and the pitch information of step αD are added, and the LPC parameter of step (1 wrinkle) is added as information regarding the transfer characteristics of the vocal tract, and as a result, step (
26) K synthesized speech is extracted.

このような音声合成方式により、各々のフレームでの合
成音声の周波数スペクトルが原音声のものに、より近似
したものとなり、合成音声の品質が改善される。With such a speech synthesis method, the frequency spectrum of the synthesized speech in each frame becomes more similar to that of the original speech, and the quality of the synthesized speech is improved.

ところで上述の如き音声合成方式の場合、１つのフレー
ムでの合成用音源波形として１種の音源波形をピッチ周
期毎に配置したものを用いるので、各フレームの接続部
で合成音声の波形、信号レベルが不連続になりやすく、
合成音声の音質が滑らかでない不都合がある。By the way, in the case of the above-mentioned speech synthesis method, one type of sound source waveform arranged for each pitch cycle is used as the sound source waveform for synthesis in one frame, so the waveform and signal level of the synthesized speech are determined at the connection part of each frame. tends to become discontinuous,
There is an inconvenience that the sound quality of the synthesized speech is not smooth.

本発明は斯る点に鑑み、上述の如き合成音声のフレーム
毎の不連続を少なくしてその音質を滑らかなものとする
ことができる音声合成方式を提供するものである。In view of the above, the present invention provides a speech synthesis method that can reduce discontinuities between frames of synthesized speech as described above and make the sound quality smooth.

本発明では２つの有声フレーム（或いは２つの無声フレ
ーム）が相続く場合、それ等のフレーム間で各フレーム
での音源信号波形の対応するサンプル値に補間を旋すこ
とにより、各フレーム間で音源信号波形が滑らかに少し
ずつ変化するようにする。In the present invention, when two voiced frames (or two unvoiced frames) are consecutive, by interpolating the corresponding sample values of the sound source signal waveform in each frame between those frames, the sound source is Make the signal waveform change smoothly and little by little.

以下本発明の一実施例を第８図乃至第１０図に基づいて
詳しく説明する。An embodiment of the present invention will be described in detail below with reference to FIGS. 8 to 10.

第８図は本実施例の構成を示すもので、同図において、
　Ｃ３１）はクロック発生器、０りはアドレスカウンタ
、（３→はフレームカウンタ、０４）は補間りｐツクカ
ウンタであって、クロック発生器Ｃ３１）からのクロッ
クを夫々各カウンタによりカウントすることにより３種
類のタイミング信号が生成される。FIG. 8 shows the configuration of this embodiment, and in the figure,
C31) is a clock generator, 0 is an address counter, (3→ is a frame counter, and 04) is an interpolation counter. Different types of timing signals are generated.

０５１は音源信号波形メモリであって、このメモリ０つ
には第２図に関連して説明したように予測残差信号情報
の代表パターンである複数個のスペクトル包絡を適当な
位相条件の下でフーリエ逆変換して時間波形（インパル
スレスポンス）に変換シ、それ等をＬＰＣ音声合成の音
源として用いるべく、予めフレーム毎に音源信号波形デ
ータとして記憶している。このメモリｃ３茄に記憶され
ている音源信号波形データのうちから、アドレスカウン
タ国の出力により一つのフレームの音源信号波形が選択
される。Reference numeral 051 denotes a sound source signal waveform memory, and as explained in connection with FIG. Inverse Fourier transform is performed to convert it into a time waveform (impulse response), which is stored in advance as sound source signal waveform data for each frame in order to be used as a sound source for LPC speech synthesis. From among the sound source signal waveform data stored in the memory c3, one frame of sound source signal waveform is selected by the output of the address counter.

Ｃ（６）は現在のフレームより時間的に１つ前のフレー
ムの音源信号波形を一時的に蓄えておくバッファメモリ
であって、フレームカウンタ（至）の出力によりフレー
ム毎にその内容が更新される。（３７）及び（３８）は
共に係数器であって、係数器０７）はメモ！Ｊ　Ｃ３５
１より出力された現在のフレームの音源信号波形に後述
されるような成る係数を付加するように働き、−力係数
器（到はバッファメモリ（絢の出力すなわち上述の現在
のフレームより１つ前のフレームの音源信号波形に上記
とは別の成る係数を付加するように働く。なおこれ等係
数器０７）及び（至）により付加される係数は、補間ク
ロックカウンタ０４）から補間クロックが係数器Ｃ３７
）及び（至）に供給される毎に更新される。また補間ク
ロックの周波数は１フレームを何等分して補間するかに
よって異なり、例えば１フレームを４等分して補間を行
なう場合には、フレーム周波数の４倍の周波数とされる
。C(6) is a buffer memory that temporarily stores the sound source signal waveform of the frame one frame before the current frame, and its contents are updated every frame by the output of the frame counter (to). Ru. Both (37) and (38) are coefficient units, and coefficient unit 07) is a memo! JC35
It works to add coefficients as described below to the sound source signal waveform of the current frame output from 1, It works to add coefficients different from those mentioned above to the sound source signal waveform of the frame.The coefficients added by these coefficient multipliers 07) and (to) C37
) and (to) are updated each time they are supplied. Further, the frequency of the interpolation clock varies depending on how many equal parts one frame is divided into for interpolation. For example, when one frame is divided into four equal parts and interpolation is performed, the frequency is set to four times the frame frequency.

０！は係数器０３７）及び□□□の各出力を加算・・す
葛ための加算器、　（４０）は補間さねた音源信号が取
り出される出力端子である。0! is an adder for adding the respective outputs of the coefficient unit 037) and □□□, and (40) is an output terminal from which the interpolated sound source signal is taken out.

次に本実施例の動作を説明する。いまメモＩＪ　Ｇ５！
に記憶されて各フレームに対応した音源信号波形のうち
、例えばフレーム−＃−ｎでの音源信号波形をｅｎ　（
ｍ）、これに続くフレーム＋ｎ　＋　１での音源信号波
形をｅｎ＋１（ｍ）とする。ｍはメモリｃ３５１に記憶
された音源信号波形のサンプルポイント数でｍ＝１゜２
、・・・・・ｙで表わされる。例えばサンプリング周波
数１０ｋＨｚ　（サンプリング周期１００μｓ）でｔ＝
３０とすると音源信号波形の長さはＱ、　１ｍ８　Ｘ　
３０＝　３ｍｓとなる。Next, the operation of this embodiment will be explained. Now Memo IJ G5!
Among the sound source signal waveforms stored in and corresponding to each frame, for example, the sound source signal waveform at frame -#-n is en (
m), and the sound source signal waveform at the subsequent frame +n+1 is assumed to be en+1(m). m is the number of sample points of the sound source signal waveform stored in the memory c351, m = 1°2
, ... is represented by y. For example, at a sampling frequency of 10kHz (sampling period of 100μs), t=
30, the length of the sound source signal waveform is Q, 1m8
30=3ms.

そして、第９図に示すように、１フレ一ム区間を複数個
に分割、例えば４等分した場合を考え、各フレームの接
続部の前後４つの分割区間に分割数Ｊ＝１．２，３．４
と番号をつける。As shown in FIG. 9, consider the case where one frame section is divided into a plurality of parts, for example, divided into four equal parts. 3.4
and number it.

そしてその各々の分割区間内での音源信号波形ｅｎＪ（
ｍ）を、次のように直線補間によって決定する。And the sound source signal waveform enJ(
m) is determined by linear interpolation as follows.

上記（１）式において分割数ＪはＪ＝１．２，３．４で
アリ、サンプルポイント数ｍはｍ　＝　ｌ　、　２　、
・・・・、Ｑである。In the above equation (1), the number of divisions J is J = 1.2, 3.4, and the number of sample points m is m = l, 2,
..., Q.

上記（１）式よりＪ＝１の分割区間での音源信号波形ｅ
ｎｌ（ｍ）はｅｎｌ（ｍ）＝ｅｎ（ｍ）　　　　　　　　　　・・・
・・（２）となり、フレームナｎでの補間前の音源信号
波形に一致することがわかる。From the above equation (1), the sound source signal waveform e in the divided section of J=1
nl(m) is enl(m)=en(m)...
...(2), and it can be seen that it matches the sound source signal waveform before interpolation at frame number n.

またＪ＝２の分割区間での音源信号波形ｅｎ２（ｍ）は
上記（１）よりとなる。第１０図はこのＪ＝２において実際に数値を入
れて補間な行った場合を示すものである。Further, the sound source signal waveform en2(m) in the divided section of J=2 is based on the above (1). FIG. 10 shows a case in which numerical values are actually entered and interpolation is performed for J=2.

すなわち、上記（３）式より、ｍ　＝−１の時のｅｎ（
ｍ）　。That is, from the above equation (3), en(
m).

ｅｎ＋　１（”）の各レベルを夫々１．０，０．９とす
ると補間後ノｅ１１２（ｍ）のレベルは０．９７５とな
り、以下同様Ｋｍ＝２の時のｅｎ（ｍ）　、　ｅｎ＋ｔ
（ｍ）の各レベルを夫々−０，８，−０，７とすると補
間後のｅｎｚ（ｍ）のレベルは−０，７７５となり、ｍ
　＝　３の時のｅｎ（ｍ）　。If the levels of en+ 1('') are respectively 1.0 and 0.9, the level of e112(m) after interpolation is 0.975, and similarly en(m) and en+t when Km=2.
If each level of (m) is -0, 8, -0, and 7, the level of enz(m) after interpolation is -0,775, and m
= en(m) when 3.

ｅｎ＋ｘ（ｍ）の各レベルを夫々０．５，０．７とする
と補間後のｅｎ２（ｍ）のレベルは０．５５となり、ｍ
　＝　４の時（７）　ｅｎ（ｍ）　、　ｅｎ＋ｔ（ｍ）
の各レベルを夫々−〇、２゜−〇、３とすると補間後の
ｅｎ２（ｍ）のレベルは−０，２２５となり、結果とし
て第１０図Ａに実線で示す音源信号波形ｅｎ（ｍ）と第
１０図Ｂに実線で示す音源信号波形ｅｎ＋　ｓ　（”）
により第１０図Ａに破線で示すような補間された音源信
号波形ｅｎ　２　（ｍ　）が得られることになる。If each level of en+x(m) is 0.5 and 0.7, the level of en2(m) after interpolation will be 0.55, and m
When = 4 (7) en(m), en+t(m)
If the respective levels of are -〇, 2゜-〇, and 3 respectively, the level of en2(m) after interpolation becomes -0,225, and as a result, the sound source signal waveform en(m) shown by the solid line in Fig. 10A is obtained. The sound source signal waveform en+s ('') shown by the solid line in Figure 10B
As a result, an interpolated sound source signal waveform en 2 (m) as shown by the broken line in FIG. 10A is obtained.

以下同様にして補間を行うことにより、に２→３→４と
進むＶＣつれて、音源信号波形ｅｎ、１（ｍ）は次第に
次のフレーム＋ｎ＋１での補間前の音源信号波形ｅｎ＋
ｘ（”）に近づいてゆく。By performing interpolation in the same manner, as the VC progresses from 2 to 3 to 4, the sound source signal waveform en,1(m) gradually becomes the sound source signal waveform before interpolation en+ at the next frame +n+1.
It approaches x(”).

、（Ｉ９上述は１フレ一ム区間を４等分した場合であるが、一般
に１フレ一ム区間なに等分した場合の各々の分割区間内
での音源信号波形ｅｎ、１（ｍ）は次式％式％つまりこの（４）式より補間された音源信号は、これ等
２つの相続くフレームでの音源信号波形の対応するサン
プル値を線形結合したものとなるので、係数器Ｃ３７）
　、（至）及び加算器０特を用いて得ることができる。, (I9 The above is a case where one frame section is divided into four equal parts, but in general, when one frame section is divided into four equal parts, the sound source signal waveform en,1(m) in each divided section is The following formula % Formula % In other words, the sound source signal interpolated by this formula (4) is a linear combination of the corresponding sample values of the sound source signal waveform in these two consecutive frames, so the coefficient multiplier C37)
, (to) and the adder 0.

なお上記（４）式においてＪ＝ｌ、Ｊ・・・・・ｋｍ＝
１２、・・・・・ｅである。分割数にとしては２，４．
８・・・・等２のベキ乗に選べば上記（４）式の補間計
算が２進データのビットシフトで容易に行われるので好
都合である。In addition, in the above equation (4), J=l, J...km=
12,...e. The number of divisions is 2, 4.
It is convenient to select a power of 2, such as 8, because the interpolation calculation of the above equation (4) can be easily performed by bit shifting of binary data.

そしてこのような補間動作を第８図の回路を用いて行う
わけであるが、それには先ず、クロック発生器０１）か
らのクロックをアドレスカウンタ０２でカウントして、
そのアドレス情報によりメモリ４３５１内の対応する各
フレームの波形データ、例えばフ（１→ レームナｎの音源信号波形ｅｎ（ｍ）を選択する。そし
てこのフレーム４　ｎの音源信号波形ｅｎ（ｍ）はフレ
ームカウンタ０３）の出力によりバッファメモリ（至）
に蓄積される。Such an interpolation operation is performed using the circuit shown in FIG. 8. First, the clock from the clock generator 01) is counted by the address counter 02,
Based on the address information, the waveform data of each corresponding frame in the memory 4351, for example, the frame number n sound source signal waveform en(m) is selected.Then, this frame 4n sound source signal waveform en(m) is The buffer memory (to) is reached by the output of counter 03).
is accumulated in

続いてフレームナｎの次のフレーム＋ｎ＋１の音源信号
波形ｅｎ（ｍ）が同様にメモリＧツ内からアドレスカウ
ンタＣ３２のアドレス情報により選択され、フレームカ
ウンタＯ■の出力によりバッファメモリ（３６）に蓄積
される。との特売にバッファメモリ０６）に蓄積されて
（・たフレーム４ｎの音源信号波形ｅｎ（ｍ）は係数器
−に供給される。つまりバッファメモリ（至）の内容は
フレームカウンタＣ３３１の出力によりフレーム毎に更
新される。Subsequently, the sound source signal waveform en(m) of the next frame +n+1 of frame number n is similarly selected from within the memory G by the address information of the address counter C32, and is stored in the buffer memory (36) by the output of the frame counter O. be done. The sound source signal waveform en(m) of frame 4n is stored in the buffer memory 06) and supplied to the coefficient multiplier.In other words, the contents of the buffer memory (to) are stored in the buffer memory 06) and the frame 4n is supplied to the coefficient multiplier. Updated every time.

またメモリ（３５）よりフレーム＋ｎ＋１の音源信号波
形ｅｎ　＋１（”　）がバッファメモリ（ト）に供給さ
れる時点で係数器（３７）にも供給される。そして補間
クロックカウンタ０４）の出力が係数器Ｇ′？）及び（
至）に供給された時点で、これ等の係数器により夫々成
る係数が付加される。すなわち、上記（４）式より係数
器間にに−Ｊ＋１おいては音源信号波形ｅｎ（ｍ）に対して係数　。Also, at the time when the sound source signal waveform en +1('') of frame +n+1 is supplied from the memory (35) to the buffer memory (g), it is also supplied to the coefficient unit (37).Then, the output of the interpolation clock counter 04) is Vessel G′?) and (
(to), the respective coefficients are added by these coefficient units. That is, from the above equation (4), between the coefficient units -J+1, there is a coefficient for the sound source signal waveform en(m).

ｋ−Ｊ＋１が付加されてその出力側にはｅｎ（ｍ）・−Ｙ−の信号
が取り出され、一方係数器０７）にお℃・ては音源信−
１その出力側にはｅｎ＋１（ｍ）・−に−の信号が取り出
される。そして取り出されたこれ等の信号は加算器０鴎
に供給されて加算され、もって出力端子（４（ＩＩＫは
上記（４）式で表わされるような補間された音源信号波
形ｅｎ、１（ｍ）が出力される。k-J+1 is added, and the signal en(m)・-Y- is taken out on the output side, while the coefficient unit 07) receives the sound source signal −
1 A negative signal is taken out at the output side of en+1(m). Then, these extracted signals are supplied to the adder 0 and added, and the output terminal (4 (IIK is the interpolated sound source signal waveform en, 1 (m) as expressed by the above equation (4) is output.

上述の如く本発明によれば、２つの有声フレーム（或い
は２つの無声フレーム）が相続く場合、それ等のフレー
ム間で各フレームでの音源信号波形の対応するサンプル
値に補間を旋して、各フレーム間で音源信号波形が滑ら
かに少しずつ変化するようにしたので、各フレームの接
続部での合成音声の波形、信号レベルの不連続が除去さ
れて音質の滑らかな品、質のすぐれた合成音声を得るこ
とができる。As described above, according to the present invention, when two voiced frames (or two unvoiced frames) are consecutive, interpolation is performed between the frames to the corresponding sample value of the sound source signal waveform in each frame, Since the sound source signal waveform changes smoothly and little by little between each frame, discontinuities in the synthesized speech waveform and signal level at the connections between each frame are removed, resulting in smooth sound quality and excellent quality. You can get synthesized speech.

[Brief explanation of drawings]

第１図は本発明の先行技術に係る一例を概略的に示すブ
ロック図、第２図乃至第７図は第１図の動作駅１明に供
するための線図、第８図は本発明の一実施例を示す構成
図、第９図及び第１０図は第８図の動作説明に供するた
めの線図である。Ｇｅｌはクロック発生器、　Ｃ３２１はアドレスカウン
タ、０３）はフレームカウンタ、（財）は補間クロック
カウンタ、　Ｃ３５ｉは音源信号波形メモリ、（至）は
バッファメモリ、（３７）、（至）は係数器、０坤は加
算器である。第１図第７図FIG. 1 is a block diagram schematically showing an example of the prior art of the present invention, FIGS. 2 to 7 are line diagrams for providing the operating station 1 of FIG. 1, and FIG. A configuration diagram showing one embodiment, FIGS. 9 and 10 are diagrams for explaining the operation of FIG. 8. Gel is a clock generator, C321 is an address counter, 03) is a frame counter, (goods) is an interpolation clock counter, C35i is a sound source signal waveform memory, (to) is a buffer memory, (37), (to) is a coefficient unit, 0kon is an adder. Figure 1 Figure 7

Claims

[Claims]

In a speech synthesis method that changes the sound source signal waveform for each frame, interpolation is performed between consecutive frames to the corresponding sample value of the sound source signal waveform in each frame, so that the sound source signal waveform changes smoothly between frames. A speech synthesis method characterized by the following.