JPH01219899A

JPH01219899A - Speech synthesizing device

Info

Publication number: JPH01219899A
Application number: JP4653688A
Authority: JP
Inventors: Norio Suda; 典雄須田; Yoshimichi Okuno; 義道奥野
Original assignee: Meidensha Corp; Meidensha Electric Manufacturing Co Ltd
Current assignee: Meidensha Corp; Meidensha Electric Manufacturing Co Ltd
Priority date: 1988-02-29
Filing date: 1988-02-29
Publication date: 1989-09-01

Abstract

PURPOSE:To obtain a smooth voice by specifying phoneme parameters of the sectional area, etc., of acoustic tubes by time zones, processing the phoneme parameters while superposing the energy of a sound source upon a pattern prescribed by an exponential function, and assigning a value less than the function value of the pattern for a vowel to be made voiceless. CONSTITUTION:The voice channel of a human is regarded as an acoustic tube group and made to correspond to a circuit element group of surge impedance components, and a voice is simulated according to the current wave at the output terminal. Parameters of the pitch of a current source and the sectional area of acoustic tubes are prescribed by phonemes, and parameters are interpolated as to connections between phonemes and the connections of sectioned time zones in phonemes. As for energy, energy patterns corresponding to accent prescribed previously by an exponential function by words are assigned and the interpolated values of the energy are superposed on the energy patterns. Further, the value less than the function value of the energy patterns is assigned to a vowel to be made voiceless to generate no voice, so the voice which is closer to the human voice is obtained.

Description

【発明の詳細な説明】Ａ、産業上の利用分野本発明は、音響管モデルを利用した音声合成装置に関す
るものである。DETAILED DESCRIPTION OF THE INVENTION A. Field of Industrial Application The present invention relates to a speech synthesis device using an acoustic tube model.

Ｂ９発明の概要本発明は人間の声道を音響管群とみなし、これをサージ
インピーダンス成分の回路要素群に対応させることによ
って、回路要素群の出力端の電流波に基づいて音声を模
擬的に作り出す装置において、音節を構成する各音素毎に各音素の発生時間を複数の時
間帯に区分し、各時間毎に音響管の断面積や音源波のエ
ネルギー等の音素パラメータを指定してこの音素パラメ
ータを補間処理すると共に、音源波のエネルギーの補間
値群については、単語のアクセントに応じた例えば指数
関数で規定されるパターンに重畳し、更に無声化すべき
母音にっいてはそのパターンの関数値よりも低い値を割
り当てることによって、滑らかで人間の音声に近似した音声を作り出すようにし
たものである。B9 Summary of the Invention The present invention regards the human vocal tract as a group of acoustic tubes, and by associating this with a group of circuit elements for surge impedance components, it is possible to simulate speech based on the current wave at the output end of the group of circuit elements. The generation device divides the generation time of each phoneme into multiple time periods for each phoneme that makes up a syllable, and specifies phoneme parameters such as the cross-sectional area of the acoustic tube and the energy of the sound source wave for each time period. In addition to interpolating the parameters, the interpolated value group of the energy of the sound source wave is superimposed on a pattern defined by an exponential function depending on the accent of the word, and for vowels to be devoiced, the function value of that pattern is used. By assigning a value lower than , it is possible to create a voice that is smooth and approximates human speech.

Ｃ１従来の技術音声合成やミュージックシンセサイザー（電子楽器）等
の所謂音を人工的に合成して出力する電子装置は、最近
になって１ないし数チップの音声認識や音声合成のＬＳ
Ｉが音声情報処理と半導体の大規模集積回路技術により
低価格で実現されるようになり、その使用目的、制約条
件により種々の方式が提案されている。この音声合成に
は、人間の発生した生の音声を録音しておき、これを適
当に結合して文章に編集する録音編集方式と、人間の声
を直接的には利用せず、人間の音声のパラメータだけを
抽出し、音声合成過程で、そのパラメータを制御して人
工的に音声信号を作り出すパラメータ方式がある。C1 Conventional technology Electronic devices that artificially synthesize and output so-called sounds, such as voice synthesis and music synthesizers (electronic musical instruments), have recently become LSs for voice recognition and voice synthesis using one or several chips.
I can now be realized at low cost through audio information processing and semiconductor large-scale integrated circuit technology, and various systems have been proposed depending on the purpose of use and constraints. This speech synthesis method involves two methods: recording and editing raw human speech, which combines them appropriately and editing them into sentences; There is a parameter method that extracts only the parameters and controls them during the speech synthesis process to artificially create a speech signal.

パラメータ方式においては、音声波形をある周期毎にサ
ンプリングして各サンプリング点での音声信号の値をア
ナログ／ディジタル変換し、その値をＯとｌの符号で表
示して行われるが、アナログ信号に忠実な記録をするた
めには、ビット数を増やす必要があり、このため大きな
メモリ容量を必要とする。In the parameter method, the audio waveform is sampled at certain intervals, the audio signal value at each sampling point is converted from analog to digital, and the values are displayed as O and l signs. In order to record faithfully, it is necessary to increase the number of bits, which requires a large memory capacity.

そこで、この情報量を極力少なくするために各種の高能
率な符号化法が研究開発されている。Therefore, various highly efficient encoding methods are being researched and developed in order to reduce the amount of information as much as possible.

その方法の一つとして、１つの音声信号の情報に最低限
１ビツトを対応させたデルタ変調方式がある。この方式
は、１ビツトの使い方として、次にくる音声信号値が現
在の値より高いか低いかを判定して、高ければ符号“ｌ
”、低ければ符号“０”を与え、音声信号の符号化を行
うもので、実際のシステム構成としては一定の振幅ステ
ップ量（デルタ）を定めておき、誤差が蓄積されないよ
うに今までの符号化によって得られる音声の値と、入力
してくる音声信号との残差信号に対して、符号化を行う
。One such method is a delta modulation method in which at least one bit corresponds to the information of one audio signal. In this method, one bit is used to determine whether the next audio signal value is higher or lower than the current value, and if it is higher, the code "l" is used.
”, if it is low, the code “0” is given and the audio signal is encoded.In the actual system configuration, a constant amplitude step amount (delta) is determined, and the previous code is changed to prevent errors from accumulating. The residual signal between the audio value obtained by the encoding and the input audio signal is encoded.

このような構成予測コード化といわれ、線形予測法（何
個か前のサンプル値から予測する）およびパーコール方
式（線形予測法の予測係数の代わリニパーコール係数に
といわれる偏自己相関関数を用いる）がある。This type of predictive coding is known as the linear prediction method (predicting from several previous sample values) and the Percoll method (using a partial autocorrelation function called the linear Percoll coefficient instead of the prediction coefficient of the linear prediction method). There is.

０９発明か解決しようとする問題点従来の音声合成方式のうち録音編集方式は、合成できる
給量や文章の種類が限定されるという問題がある。Problems to be Solved by the 2009 Invention Of the conventional speech synthesis methods, the recording and editing method has a problem in that the amount and types of texts that can be synthesized are limited.

また予測コード化を用いた方式では、音と音との継ぎ目
に相当する調音結合が難しくて合成単位の結合法が確立
しておらず、例えば母音から子音を経て母音に至る発声
において、母音の定常から過渡を経て子音に至りまた母
音の過渡を経て母音の定常音に至る過程で母音と母音の
継ぎ目の音が針切れてしまう。従って音の滑らかさに欠
け、人間が聞いたときに自然な感じを与えないという問
題がある。In addition, in methods using predictive coding, it is difficult to create articulatory combinations that correspond to the joints between sounds, and a method for combining synthesis units has not been established. In the process of going from a steady state to a consonant through a transition, and then through a vowel transition to a steady vowel sound, the sound at the joint between the vowels becomes disconnected. Therefore, there is a problem that the sound lacks smoothness and does not give a natural feeling when heard by humans.

本発明の目的は、任意な給量、文章を合成することがで
き、しかも音が滑らかであって人間の実際の音声に近く
、自然な感じを聞く人に与えることのできる音声合成装
置を提供することにある。An object of the present invention is to provide a speech synthesis device that can synthesize sentences of any amount and provide a smooth sound that is close to actual human speech and can give a natural feeling to the listener. It's about doing.

Ｅ０問題点を解決するための手段及び作用（１）基本概
念音声を口から外に放射するためには、音源が必要で、こ
の音源は声帯によって作り出される。−方声帯は２枚の
ヒダを開閉することによって呼気を断続的に止める働き
があり、その断続によってパフと呼ばれる空気流が発生
し、声帯を緊張させるとこのヒダに張力が加わりヒダの
開閉の周波数が高くなり、周波数の高いパフ音が発生す
る。そして呼気流を大きくすると大きな音となる。Means and operations for solving the E0 problem (1) Basic concept In order to radiate sound outward from the mouth, a sound source is required, and this sound source is produced by the vocal cords. -The vocal cords have the function of intermittent exhalation by opening and closing two folds, and these intermittent intervals generate airflow called puffs, and when the vocal cords are tensed, tension is applied to these folds, causing the folds to open and close. The frequency increases, producing a high-frequency puff sound. When the exhalation flow is increased, the sound becomes louder.

この音源波が声道のような円筒状の音響管を通過すると
、開放端から音波は共振現象によりある成分が強調され
、ある成分が減弱し複雑な母音の波形が作り出される。When this sound source wave passes through a cylindrical acoustic tube like the vocal tract, a resonance phenomenon causes certain components of the sound waves from the open end to be emphasized and certain components to be attenuated, creating a complex vowel waveform.

そして口から発せられる音声は、音源波が同じ波形をも
っていても、口唇から放射されるまでに通過する声道の
形によって影響を受ける。即ち、人間の発生音は、声帯
から口唇までの声道の長さや断面積及び声帯の震わけ方
等によって決定される。Even if the sound source waves have the same waveform, the sound emitted from the mouth is affected by the shape of the vocal tract that the sound passes through before being emitted from the lips. That is, the sounds produced by humans are determined by the length and cross-sectional area of the vocal tract from the vocal cords to the lips, and the way the vocal cords vibrate.

本発明はこのようなことに着目してなされたものであり
、上記の声道を複数の可変断面積の音響管群とみなし、
更に音響管の音波の伝達を表わす進行波現象をその等価
回路により実現することを出発点としている。声道を音
響管とみなすと、各音響管の中の音波の伝搬は萌進波と
後進波に分けて各音響管の境界面における反射、透過現
象の繰り返しとして考えることができ、このときその反
射と透過は境界面における音響的特性インピーダンスの
不整合の度合い、即ち互いに隣接する音響管の各断面積
の比に応じて定量的に規定される。The present invention has been made with this in mind, and considers the vocal tract as a group of acoustic tubes with variable cross-sectional areas.
Furthermore, the starting point is to realize the traveling wave phenomenon representing the transmission of sound waves in an acoustic tube using its equivalent circuit. If the vocal tract is regarded as an acoustic tube, the propagation of sound waves in each acoustic tube can be divided into forward waves and backward waves, and can be thought of as repeated reflection and transmission phenomena at the boundary surfaces of each acoustic tube. Reflection and transmission are quantitatively defined according to the degree of mismatch of acoustic characteristic impedances at the interface, that is, the ratio of the respective cross-sectional areas of adjacent acoustic tubes.

ここで上記の反射、透過現象は、電気回路においてイン
ピーダンスの異なる線路にインノくルス電流を流したと
きの過渡現象と同じである。The reflection and transmission phenomena described above are the same as the transient phenomena that occur when an innocuous current is passed through lines with different impedances in an electric circuit.

（２）等価回路このようなことからｎ個の音響管８１〜Ｓ０よりなる音
響管モデルを第１図（ア）に示すと、このモデルは第１
図（ロ）に示すような抵抗の無い無損失のサージインピ
ーダンス成分よりなる回路要素群（Ｔ　Ｉ−Ｔ−）を直
列に接続した電気回路として表わすことができる。Ａ、
−Ａ、は夫々音響管８１〜Ｓ、、の断面積である。ここ
に本発明では、基本的には上記の電気回路を適用して、
これに供給するインパルス電流と各回路要素Ｔ１〜Ｔｎ
のサージインピーダンスを変化させることによって、音
響管モデルの音源波と各音響管の断面積とを変化させる
ことに対応させ、最終段の回路要素Ｔｎから出力される
電流をスピーカ等の発声部に供給することによって、音
響管モデルから得られる音声を模擬的に作り出している
。(2) Equivalent circuit Based on the above, a sound tube model consisting of n sound tubes 81 to S0 is shown in Fig. 1 (A).
It can be expressed as an electric circuit in which a group of circuit elements (TI-T-) each consisting of a lossless surge impedance component with no resistance are connected in series as shown in FIG. A,
−A is the cross-sectional area of the acoustic tubes 81 to 81 to S, respectively. Here, in the present invention, basically, the above electric circuit is applied,
Impulse current supplied to this and each circuit element T1 to Tn
By changing the surge impedance of the acoustic tube model, the current output from the final stage circuit element Tn is supplied to the vocal part such as a speaker, in response to changing the sound source wave of the acoustic tube model and the cross-sectional area of each acoustic tube. By doing this, the sound obtained from the acoustic tube model is simulated.

具体的には、第１図（つ）に示すように上記の電気回路
と等価な回路を想定し、この等価回路における電流源の
電流を時間に対して変化させると共に、後述するように
演算式中には音響管の断面積比が導入されるので、各断
面積Ａ　＋　””’　Ａ−を時間に対して変化させ、こ
れによって各部の電流値を演算により求めている。同図
においてＰは電流源、Ｚｏは電流源のインピーダンス、
Ｚ１〜Ｚ０は夫々回路要素Ｔｉ〜Ｔｎのサージインピー
ダンス、Ｚｔ。Specifically, we assume a circuit equivalent to the above electric circuit as shown in Figure 1 (2), change the current of the current source in this equivalent circuit with respect to time, and create the calculation formula as described later. Since the cross-sectional area ratio of the acoustic tube is introduced, each cross-sectional area A + ""' A- is changed with respect to time, and the current value of each part is calculated by this. In the figure, P is the current source, Zo is the impedance of the current source,
Z1 to Z0 are surge impedances and Zt of circuit elements Ti to Tn, respectively.

は放射インピーダンス、ｉ　０Ａ−１（ｎ−１１Ａ、　
　ｉ　ＩＢ〜ｉ　ｎＢ、　ａ　ＯＡ−ａ　（ｈ−１）Ａ
＋　ａ　ＩＢ−ａ　ｎＢは各々記号の該当する電流路の
電流、Ｗ　ＯＡ−Ｗ　＋ｎ−１）＾、Ｗｌａ〜Ｗｎａは
電流源、Ｉ　ＯＡ−１ｔｎ−ｎＡは後進波電流、Ｉ、ｂ
〜ＩｎＢは前進波電流を示す。この等価回路においては
、例えば回路要素Ｔ１．Ｔｔの結合部分に着目すると、
回路要素Ｔ、中をＴｔに向かって流れる電流１１１Ｂに
対応させた電流源’Ｎ＋Ａｈ、回路要素Ｔ２中をＴ、に
向かって流れる電流１１Ａに対応させた電流源Ｗ　Ｉ　
Ａ　ｈを想定し、電流１＋ａが回路要素Ｔｌ。is the radiation impedance, i 0A-1(n-11A,
i IB~i nB, a OA-a (h-1)A
+ a IB-a nB is the current in the current path corresponding to each symbol, W OA-W +n-1)^, Wla to Wna are the current sources, I OA-1tn-nA is the backward wave current, I, b
~InB indicates forward wave current. In this equivalent circuit, for example, circuit element T1. Focusing on the connecting part of Tt,
A current source 'N+Ah corresponds to the current 111B flowing through the circuit element T toward Tt, and a current source W I corresponds to the current 11A flowing through the circuit element T2 toward T.
Assuming A h, the current 1+a is the circuit element Tl.

Ｔ、の境界にてＴ、へ反射される反射波電流ｆ＋ａとＴ
、へ透過する透過波電流ａ　（Ａｈに分かれ、また電流
１１Ａが回路要素Ｔ、、Ｔ、の境界にてＴ、へ反射され
る反射波電流ｉ＋ＡとＴ、へ透過する透過波電流ａ＋Ｂ
とに分かれることを等価的に表わしたものである。また
同図（１）はこうした様子を模式的に示す模式図である
。The reflected wave current f+a reflected to T at the boundary of T and T
, transmitted wave current a (Ah), and the current 11A is reflected to T at the boundary of circuit elements T, , T, reflected wave current i+A, and transmitted wave current a+B transmitted to T,
This is an equivalent representation of the division into Further, FIG. 1 (1) is a schematic diagram schematically showing such a situation.

（３）演算先ず第１図（つ）の第１役目の電流源Ｐを含むブロック
は、第２図に示すように二つの回路の重ね合わせと考え
ることができる。従って電流源Ｐの電圧を■とお（と、
同図の電流ａ１．ａｔは夫々（１）、（２）式で表わさ
れ、この結果電流ａ。Ａは（３）式で表わされる。(3) Calculation First, the block including the current source P that plays the first role shown in FIG. 1 can be thought of as a superposition of two circuits as shown in FIG. Therefore, the voltage of current source P is
Current a1 in the same figure. at is expressed by equations (1) and (2), respectively, and as a result, the current a. A is expressed by equation (3).

ａ　Ｉ＝ｖ／Ｚ、＋Ｚ＋　　　　　　−（１）ａｔ　＝
ｚｏ／ｚｏ＋ｚｔ・Ｉｏｔ　　　−（２）ａＯ＾＝　ａ
　Ｈ十ａ　１ −１　／　Ｚ　ｏ±Ｚ１（ｖ＋Ｚｏ−ＩＯＡ）・・・（
３）今、初めて等価回路中に電流を供給していくとする
と、ＩＯＡを零とすることによりａｏＡが求まる。a I=v/Z, +Z+ −(1) at =
zo/zo+zt・Iot −(2)aO^= a
H0a1-1/Zo±Z1(v+Zo-IOA)...(
3) If we now supply current into the equivalent circuit for the first time, aoA can be found by setting IOA to zero.

そしてこの値を基にして順次に演算が実行される。Then, calculations are performed sequentially based on this value.

図中左端に位置する１段目のプロ・ンク及び２段目のブ
ロックの電流値の演算式を例にとると、以下の（４）〜
（１２）式のように表わされる。Taking as an example the calculation formula for the current value of the first stage block and the second stage block located at the left end of the figure, the following (4) ~
It is expressed as equation (12).

ａｏＡ’＝　１／Ｚｏ＋Ｚｌ（Ｖ’＋Ｚｏ・Ｉ　ＯＡ）
　　・・・（４）ｉｏＡＩ＝ａＯ＾′−Ｉｏ＾　　　　
　　　　　　・・・（５）１　ｏ＾”＝　ｉ　ｌＢ’＋
　ａ　＋ｅ’　　　　　　　　　　”’　（６）ａ＋ａ
’−ｓ＋ｉ＋（Ｉ　ＩＢ＋　Ｉ　ＩＡ）　　　　　　　
・・・（７）ｉ＋ａ’＝ａ＋ａ’　　Ｉ　ＩＢ　　　　
　　　　　　　−（８）１＋Ｂ’＝ｊｏ＾＋ａｏ＾′　
　　　　　　　　　・・・（９）ａ　ｌＡ’＝　Ｓ　Ｉ
Ａ　（Ｉ　ＩＢ＋　１１の　　　　　＝−（１０）ｊ　
ｌＡ’＝　ａ　＋Ａ’　　Ｉ　＋８　　　　　　　　　
　　”’　（１１）１１Ａ’＝　ｉ　ｔａ’＋　ａ　ｔ
Ｂ’　　　　　　　　　　−（１２）このような計算を
進めていくと、最終段のブロックに関する演算式は（１
３）〜（１５）式のように表わされる。aoA'= 1/Zo+Zl(V'+Zo・I OA)
...(4) ioAI=aO^'-Io^
...(5) 1 o^"= i lB'+
a +e'”' (6) a+a
'-s+i+(I IB+ I IA)
...(7) i+a'=a+a' I IB
-(8)1+B'=jo^+ao^'
...(9) a lA'= S I
A (I IB+ 11 =-(10)j
lA'= a +A' I +8
”' (11) 11A'= i ta'+ a t
B' - (12) As we proceed with this calculation, the arithmetic expression for the final stage block becomes (1
3) to (15).

ａ　ｎＢ’：　ｚＬ、／　Ｚｎ　十ＺＬ−１ｎＢ　　　
＋・−（１３）ｌ　ｎＢ””　ａ　ｎＢ’　　　Ｉ　ｎ
ＢＩ　ｎａ′＝’Ｌ　ｔｎ−＋＋Ａ＋　ａ　（ｎ−１１
Ａ　　　’・・（１４）こうして最終段の音響管Ｓｎよ
り発せられる音波に対応する電流ｊｎＢが求められる。a nB': zL, / Zn 10ZL-1nB
+・-(13) l nB”” a nB' I n
BI na'='L tn-++A+ a (n-11
A'...(14) In this way, the current jnB corresponding to the sound wave emitted from the sound tube Sn at the final stage is determined.

ただしＳ１８゜Ｓ　ＩＡは各々互いに隣接する音響管の
断面積比で表わされる係数であり、夫々（１５）、（１
６）式％式％１段目から最終段目までのブロックの電流値の一連の演
算は瞬時に実行され、これら演算か所定のタイミングを
とって次々に行われていく。ここに上記の（４）〜（１
４）式において、ダッシュの付いた値は時刻りにおける
演算値、ダッシュの付かない値は時刻しにおける演算の
１回前における演算により求めた演算値である。こうし
て求めたデジタル値であるｉ。をデジタル／アナログ変
換してアナログ電流を作り、この電流をスピーカー等に
供給することにより音声を得る。前記演算のタイミング
については、音速を考慮して決定され、例えば各音響管
の１本の伝搬時間を演算の時間間隔とすることによって
、後進波電流Ｉ。Ａ〜Ｉ（□＝、、Ａ及び前進波電流１
　＋ａ−１ｎｉｌが音速と同じ速度で各回路要素Ｔ、−
１ｎＡ中を流れる状態と等価な状態を作り出し、これに
より音響管モデルと電気回路モデルとを整合させている
。However, S18゜SIA is a coefficient expressed by the cross-sectional area ratio of adjacent sound tubes, and is (15) and (1
6) Formula % Formula % A series of calculations for the current values of the blocks from the first stage to the final stage are executed instantaneously, and these calculations are performed one after another at a predetermined timing. Here, the above (4) to (1)
In formula 4), the value with a dash is the calculated value at the time, and the value without the dash is the calculated value obtained by the calculation one time before the calculation at the time. i is the digital value obtained in this way. is converted from digital to analog to create an analog current, and this current is supplied to a speaker or the like to produce sound. The timing of the calculation is determined in consideration of the speed of sound, and for example, by setting the propagation time of one acoustic tube as the time interval of the calculation, the backward wave current I is determined. A to I (□=, , A and forward wave current 1
+a-1 nil is the same speed as the speed of sound, and each circuit element T, -
A state equivalent to that flowing through 1 nA is created, thereby matching the acoustic tube model and the electric circuit model.

本発明は以上のような等価モデルと演算の実現を基調と
したものであり、具体的には、音節を構成する各音素毎
に各音素の発声時間を１以上の時間帯に区分し、各時間
帯毎に、音源波の繰り返し周波数であるピッチ、この音
源波のエネルギー及び音響管の断面積の各初期値と当該
時間帯の前記各初期値Ｘｏから次の時間帯の各初期値Ｘ
ｒへの変化の仕方を規定した定数と音源波パターンとを
格納する音素パラメータ格納部と、入力された音素デー
タに対応する前記ピッチ、エネルギー及び断面積の各補
間処理を行うパラメータ補間処理部と、各単語のアクセ
ントに対応する音源波のエネルギーのパターンを複数種
類に分類し、各単語に対して該当するエネルギーのパタ
ーンの種類を付すと共に、単語中の無声化すべき母音に
対して無声化の符号を付して各単語を登録する辞書部と
、前記エネルギーのパターンを種類毎に関数で規定する
と共に、前記辞書部を参照して、発声すべき単語の前記
種類に対応するエネルギーのパターンを選択し、無声化
符号を付した母音に対してはエネルギーパターンの関数
値よりも低い値を割り当てるパターン処理部と、前記パ
ラメータ補間処理部で補間処理されたパラメータに基づ
いて前記回路要素群の出力端から出力される電流値を演
算すると共に、前記エネルギーの演算値については、補
間処理された補間値群にパターン処理部で選択されたパ
ターンに対応する関数を重畳させた値を用いる演算部と
、この演算部の演算結果に基づいて音声を発生する発声
部とを備え、前記パラメータ補間処理部は、前記各時間帯の間に前記
初期値Ｘｏと目標値に相当する前記Ｘ１と前記定数とを
用いて多数回補間演算を行い、前記エネルギーについて
は、補間用の指数関数に基づいて実行することを特徴と
する。The present invention is based on the realization of the equivalent model and calculations described above, and specifically, the utterance time of each phoneme is divided into one or more time periods for each phoneme that makes up a syllable, and each phoneme is divided into one or more time periods. For each time period, each initial value of the pitch, which is the repetition frequency of the sound source wave, the energy of this sound source wave, and the cross-sectional area of the acoustic tube, and each initial value X of the next time period is determined from the above-mentioned initial value Xo of the relevant time period.
a phoneme parameter storage unit that stores a constant that defines how to change to r and a sound source wave pattern; a parameter interpolation processing unit that performs interpolation processing of the pitch, energy, and cross-sectional area corresponding to the input phoneme data; , the energy patterns of the sound source waves corresponding to the accents of each word are classified into multiple types, and the corresponding energy pattern type is assigned to each word, and the vowel in the word that should be devoiced is a dictionary section for registering each word with a code; and a function for defining the energy pattern for each type; and referring to the dictionary section, an energy pattern corresponding to the type of word to be uttered is determined a pattern processing unit that assigns a value lower than the function value of the energy pattern to the selected vowel and a devoicing code; and an output of the circuit element group based on the parameters interpolated by the parameter interpolation processing unit. a calculation unit that calculates the current value output from the end, and uses a value obtained by superimposing a function corresponding to the pattern selected by the pattern processing unit on the interpolation-processed interpolation value group for the calculation value of the energy; , and a voice generation section that generates a sound based on the calculation result of the calculation section, and the parameter interpolation processing section calculates the initial value Xo, the X1 corresponding to the target value, and the constant during each time period. The method is characterized in that interpolation calculations are performed multiple times using , and the energy is performed based on an exponential function for interpolation.

Ｆ、実施例第１図は本発明の実施例のブロック構成を示す図である
。■は日本語処理部であり、入力された日本語文章に対
して文節の区切りを行ったり辞書部１．を参照して読み
がな変換等を行う。２は文章処理部であり文章にイント
ネーションを付ける処理を行う。３は音節処理部であり
、文章を構成する音節に対して、イントネーションに応
じたアクセントを付ける。例えば「さくらがさいた」と
いう文章に対してｒｓＡＪ、ｒＫＵＪ、ｒＲＡＪ・・と
いうように音節に分解し、各音節に対してアクセントを
付ける。音のイントネーションは後述する音源波の繰り
返し周波数、そのエネルギー及び時間で決まることから
、アクセントを付けるとは、これらパラメータに対する
係数を決定することである。そして特にエネルギーに対
する係数の決定は音節処理部３内のパターン処理部３１
にて実行される。F. Embodiment FIG. 1 is a diagram showing a block configuration of an embodiment of the present invention. ■ is the Japanese language processing section, which divides the input Japanese sentences into phrases, and the dictionary section 1. Refer to , and perform reading conversion, etc. Reference numeral 2 denotes a sentence processing section which performs processing for adding intonation to sentences. 3 is a syllable processing unit, which adds accents to the syllables that make up a sentence according to the intonation. For example, the sentence "Sakura ga Saita" is broken down into syllables such as rsAJ, rKUJ, rRAJ, etc., and each syllable is accented. Since the intonation of a sound is determined by the repetition frequency of the sound source wave, its energy, and time, which will be described later, adding an accent means determining coefficients for these parameters. In particular, the determination of coefficients for energy is carried out by the pattern processing section 31 in the syllable processing section 3.
It will be executed at

ここでパターン処理部３、に関して述べると、各単語の
アクセントに対応する音源波のエネルギーのパターンを
例えば頭高パターン、尾高パターン及び中高パターンの
３種類に分類して格納し、これら３種類のエネルギーの
パターンの中から発声すべき単語に対応するパターンを
選択する機能を有する。第４図（ア）〜（つ）は夫々「
動作した」、「異常は」、「遮断器は」の各単語につい
て実際に人間が発声した音声を分析した結果を示す図で
あり、実線は各音節のエネルギーの変化、点線は各音節
のエネルギーのピーク値を結ぶことによって得た単語の
アクセントに対応するエネルギーのパターンを夫々示す
。この実施例では第４図（ア）〜（つ）の点線で示すパ
ターンを夫々エネルギーのピーク値の高い部分の位置に
応じて頭高パターン、尾高パターン、中高パターンとし
て捉え、これら３つのパターンを第５図（ア）〜（つ）
に示すように指数関数により規定してパターン処理部３
Ｉ内の格納部に予め格納しておく。Regarding the pattern processing unit 3, energy patterns of sound source waves corresponding to the accent of each word are classified and stored into three types, for example, a head height pattern, a tail height pattern, and a middle height pattern, and these three types of energy It has a function of selecting a pattern corresponding to the word to be uttered from among the patterns. Figure 4 (A) to (T) are respectively “
This is a diagram showing the results of analyzing the speech actually uttered by humans for the words "It worked,""Abnormal", and "The circuit breaker", where the solid line shows the change in the energy of each syllable, and the dotted line shows the energy of each syllable. The energy patterns corresponding to the accents of the words obtained by connecting the peak values of are shown. In this example, the patterns shown by the dotted lines in FIG. Figure 5 (A) - (T)
The pattern processing unit 3 is defined by an exponential function as shown in FIG.
It is stored in advance in the storage section in I.

この場合格納した各パターンは、例えば上昇部分と下降
部分とに対して夫々異なった式で定義される指数関数を
割り当てている。そしてこのエネルギーパターンの取り
出しについては、例えば前記辞書部１．に登録する単語
に予めエネルギーパターンの種類を示す符号を付けてお
き、パターン処理部３１にてこの符号に対応するエネル
ギーパターンを選択する。In this case, each stored pattern has an exponential function defined by a different formula, for example, assigned to an ascending portion and a descending portion. Regarding the extraction of this energy pattern, for example, the dictionary section 1. A code indicating the type of energy pattern is attached to the word to be registered in advance, and the pattern processing section 31 selects the energy pattern corresponding to this code.

また単語中には、第４図（ア）に示すｒｓｌｊのｒＩＪ
や同図（つ）に示すｒＮＪ、ｒＫＩｊのｒｌ等のように
、エネルギーレベルが点線のエネルギーパターンに対応
するレベルよりも相当低い無声化母音が含まれろことが
ある。このため前記辞書部１．にて単語の無声化すべき
母音に対して無声化の符号を付けておき、パターン処理
部３゜にてこの無声化符号を付した母音に対してはエネ
ルギーパターンの関数値よりも低い値を割り当てている
。Also, in the word, rIJ of rslj shown in Figure 4 (a)
In some cases, devoiced vowels whose energy level is considerably lower than the level corresponding to the energy pattern indicated by the dotted line may be included, such as rNJ and rl in rKIj shown in FIG. For this reason, the dictionary section 1. A devoicing code is attached to the vowel to be devoiced in the word, and a value lower than the function value of the energy pattern is assigned to the vowel with this devoicing code in the pattern processing unit 3. ing.

４は音素処理部、４１は音節パラメータ格納部であり、
音素処理部４は、人力されたｒｓＡＪ　・・等の音節デ
ータに対し、音節と母音及び子音の単位である音素との
対応関係を規定した音節パラメータ格納部４１内のデー
タを参照して音素に分解する処理、例えば音節ｒｓＡｊ
に対し、音素「Ｓ」。4 is a phoneme processing unit, 41 is a syllable parameter storage unit,
The phoneme processing unit 4 converts manually generated syllable data such as rsAJ into phonemes by referring to data in the syllable parameter storage unit 41 that defines the correspondence between syllables and phonemes, which are units of vowels and consonants. Process of decomposition, e.g. syllable rsAj
On the other hand, the phoneme "S".

ｒＡＪを取り出す。Take out rAJ.

５はパラメータ補間処理部、５、は音素パラメータ格納
部、Ｓ、は音源パラメータ格納部である。5 is a parameter interpolation processing section, 5 is a phoneme parameter storage section, and S is a sound source parameter storage section.

音素パラメータ格納部５Ｉは第６図に示すように各音素
の発声時間を複数例えば３つの時間帯０゜〜０３に区分
し、各時間帯毎に継続時間音源波の繰り返し周波数であ
るビッヂ、この音源波のエネルギー及び音響管の断面積
の各初期値と当該時間帯の前記各初期値から次の時間帯
の各初期値への変化の仕方を規定した時定数と音源波パ
ターンとを格納している。この実施例では、人間の声道
（男性の場合的１７ｃｘ）を長さｌｃｘの音響管を１７
個連接したものでモデル化しており、このため断面積値
は１つの時間帯光たり１７個（Ａｌ〜Ａ、？）定められ
ている。また音源パラメータ格納部５゜には、例えば第
７図に示すように３種類の音源波パターンＧ１−０３の
波形成分が５０個のサンプルデータとして格納されてい
る。前記パラメータ補間処理部５は、各時間帯（０，−
０３）におけるピッチ、エネルギー及び断面積の補間処
理を行う部分であり、この処理は当該時間帯のピッチ、
エネルギー及び断面積の各パラメータの初期値をＸｏと
じ、次の時間帯の初期値をＸｒ、ｎ番目の補間演算値を
Ｘ（ｎ）、各パラメータに対応する時定数をＤで表わす
と、次の（１７）式に示す漸化式に従って当該時間帯の
間にｎ回演算を行う処理である。ただし初期＠Ｘ（０）
は上記のＸｏである。As shown in FIG. 6, the phoneme parameter storage unit 5I divides the utterance time of each phoneme into a plurality of time periods, for example, three time periods 0° to 03, and stores the duration time, the repetition frequency of the sound source wave, and the repetition frequency of the sound source wave for each time period. Stores each initial value of the energy of the sound source wave and the cross-sectional area of the sound tube, a time constant that specifies the manner of change from each of the above-mentioned initial values in the relevant time period to each initial value in the next time period, and a sound source wave pattern. ing. In this example, the human vocal tract (17cx in the male case) is connected to an acoustic tube of length lcx.
The model is made up of individually connected light beams, and for this reason, 17 cross-sectional area values (Al to A, ?) are determined for one time zone light. Further, in the sound source parameter storage unit 5°, waveform components of three types of sound source wave patterns G1-03 are stored as 50 sample data, as shown in FIG. 7, for example. The parameter interpolation processing unit 5 calculates each time period (0, -
This is the part that performs interpolation processing of pitch, energy, and cross-sectional area in 03), and this processing is performed based on the pitch, energy, and cross-sectional area of the relevant time period.
If the initial value of each parameter of energy and cross-sectional area is expressed as Xo, the initial value of the next time period is expressed as Xr, the nth interpolated value is expressed as X(n), and the time constant corresponding to each parameter is expressed as D, then This is a process in which calculations are performed n times during the time period according to the recurrence formula shown in equation (17). However, initial @X(0)
is the above Xo.

Ｘ（ｎ）−Ｄ　（Ｘ、−Ｘ（ｎ−１））　＋Ｘ（ｎ−１
）−（１７）例えば時間帯０．におけるエネルギーの補
間処理については、Ｘｏｈ＜Ｅｌ、ＸｒがＥ、に相当す
るので（１８）式に従って演算される。X(n)-D (X,-X(n-1)) +X(n-1
)-(17) For example, time zone 0. The energy interpolation process in is calculated according to equation (18) since Xoh<El and Xr corresponds to E.

Ｘ（ｎ）＝ＤＥ＋（Ｅｔ−Ｘ（ｎ−１））＋Ｘ（ｎ−１
）・・（１８）ここで上記（１７）式は次の（１９）式
の漸化式である。X(n)=DE+(Et-X(n-1))+X(n-1
)...(18) Here, the above equation (17) is a recurrence equation of the following equation (19).

Ｘ＝Ｘ、−ｅ−”　　　　−（１９）即ち（１９）式を微分すると（２０）式が成立し、従っ
て（２１）が成立する。X=X, -e-'' -(19) That is, when equation (19) is differentiated, equation (20) is established, and therefore (21) is established.

ｄ　ｘ／　ｄ　ｔ　＝Ｄ　ｅ　−”　　　−（２０）Δ
Ｘ＝Ｘ（ｎ＋１）Ｘ（ｎ）＝Δｔ　＋＋　Ｄ　ｅ　−”
”’＝Δｔ−Ｄ（Ｘ、−Ｘ（ｎ））　　　−（２Ｄよっ
て（２２）式となる。d x / d t = De −” − (20) Δ
X=X(n+1)X(n)=Δt++D e −”
"'=Δt-D(X,-X(n))-(2D Therefore, equation (22) is obtained.

Ｘ（ｎ＋１）＝Δｔ−Ｄ（Ｘｒ−Ｘ（ｎ））十Ｘ（ｎ）
−（２２）ここで補間演算の時間間隔は一定であるから
Δし・Ｄを一括して時定数りと置き換えることができ、
（１７）式として表わされる。X(n+1)=Δt-D(Xr-X(n))×X(n)
-(22) Here, since the time interval of the interpolation calculation is constant, Δ and D can be collectively replaced with the time constant,
It is expressed as equation (17).

６は演算部であり、パラメータ補間処理部５で算出した
パラメータに基づいて、前記補間演算と同じタイミング
で例えば１００μｓの時間間隔で第１図（つ）に示す電
流１ｎａのデジタル値を求める。７はデジタル／アナロ
グ（Ｄ／Ａ）変換器であり、演算部６で求めたデジタル
値に基づいて電流波（アナログ電流）を作り出す。８は
スピーカー等の発声部であり、アナログ電流に基づいて
音声を発生する。Reference numeral 6 denotes a calculation unit, which calculates the digital value of the current 1na shown in FIG. 1 (2) at the same timing as the interpolation calculation, for example, at a time interval of 100 μs, based on the parameters calculated by the parameter interpolation processing unit 5. 7 is a digital/analog (D/A) converter, which generates a current wave (analog current) based on the digital value obtained by the calculation section 6. Reference numeral 8 denotes a voice generating section such as a speaker, which generates voice based on analog current.

次に上述実施例の作用について述べる。Next, the operation of the above embodiment will be described.

ワードプロセッサ等により入力された日本語文章は、日
本語処理部ｌ、文章処理部２及び音節処理部３を経てイ
ントネーション等が付けられて音節単位に区切られ、更
に音素処理部４によって各音節は音素に分解される。次
いでパラメータ補間処理部によって、各音素のピッチ、
エネルギー及び断面積が音素パラメータ格納部５、から
取り出され、これらパラメータについて各時間帯（０゜
〜０．）毎に補間処理が行われる。この補間処理は（１
７）式に従って行われ、例えば時間帯Ｏ１におけるエネ
ルギーについては（１８）式に従って実行される。第８
図はこの様子を示す図であり、補間演算によって求めら
れたエネルギーの各補間値Ｅ　（１）　、　Ｅ　（２）
　・＝Ｅ　（ｎ）は次の（２３）式で表わされる曲線に
沿って並ぶことになる。A Japanese sentence inputted by a word processor or the like passes through a Japanese language processing section 1, a sentence processing section 2, and a syllable processing section 3, where it is divided into syllables with intonation added, and further, each syllable is divided into syllables by a phoneme processing section 4. It is decomposed into Next, the parameter interpolation processing unit calculates the pitch of each phoneme,
Energy and cross-sectional area are taken out from the phoneme parameter storage unit 5, and interpolation processing is performed on these parameters for each time period (0° to 0.0°). This interpolation process is (1
For example, the energy in time period O1 is performed according to equation (18). 8th
The figure shows this situation, and each interpolated value of energy E (1), E (2) obtained by interpolation calculation is shown in the figure.
・=E (n) are arranged along the curve expressed by the following equation (23).

Ｅ　＝Ｅｔ　　ｅ−Ｄｔ・−（２３）、また各時間帯０、〜０３苺に規定された音源波パター
ンのサンプルデータが音源パラメータ格納部５、から取
り出され、このサンプルデータとピッチ等の補間値が演
算部６に与えられ、演算部６にて上記のＥ、（３）項「
演算」にて詳述した演算が実行される。この演算におい
て、音節処理部３にて各音節単位に付けられたアクセン
トに対応する係数あるいは関数とパラメータ補間処理部
５で求められた各パラメータとが掛は合わされて、文章
のイントネーションが表われるように演算される。特に
エネルギーについては、辞書部１１から引き出された単
語に付した符号に基づいて、パターン処理部３．により
単語のアクセントに対応するエネルギーパターン（第５
図参照）を選択し、その単語の発声時間の間に当該エネ
ルギーパターンが描かれるように、パターンを規定する
指数関数値（第５図の縦軸の値）を読み出し、読み出し
た値にエネルギーの補間値を掛は合わせてその掛は合わ
せ値を演算の要素として用いる。そして無声化符号の付
いている母音に対しては、前記読み出した値に例えばｌ
／８程度の係数を掛けてその値にエネルギーの補間値を
掛は合わせている。E = Et e-Dt・-(23) Also, sample data of the sound source wave pattern specified for each time period 0 and 03 is taken out from the sound source parameter storage unit 5, and this sample data is interpolated with pitch, etc. The value is given to the calculation unit 6, and the calculation unit 6 calculates the above E, term (3) “
The calculations detailed in "Operations" are executed. In this calculation, the coefficient or function corresponding to the accent added to each syllable unit in the syllable processing unit 3 is multiplied by each parameter obtained in the parameter interpolation processing unit 5, so that the intonation of the sentence is expressed. is calculated. In particular, regarding energy, the pattern processing unit 3. The energy pattern corresponding to the accent of the word (fifth
(see figure), read out the exponential function value that defines the pattern (the value on the vertical axis in Figure 5) so that the energy pattern is drawn during the utterance time of that word, and add the energy to the read value. The interpolated values are multiplied and combined, and the combined values are used as calculation elements. For a vowel with a devoicing code, for example, l is added to the read value.
The value is multiplied by a coefficient of about /8 and the interpolated value of energy is multiplied to match the value.

こうして最終段の音響管より発せられる音波に相当する
電流波のデジタル値が求められ、この値に基づいてＤ／
Ａ変換器７により電流波が作られ、発声音８より対応す
る音声が発せられる。In this way, the digital value of the current wave corresponding to the sound wave emitted from the final stage sound tube is obtained, and based on this value, the D/
A current wave is created by the A converter 7, and a corresponding sound is emitted from the vocalization sound 8.

Ｇ３発明の効果本発明によれば音響管モデルの音波の伝搬を等価回路の
電流の流れに置き換え、各音素毎に電流源のピッチ等の
パラメータと音響管の断面積とを規定し、音素間の継ぎ
目あるいは音素内の区分された時間帯の継ぎ目について
、パラメータの補間処理を実行しているので、滑らかな
音声を得ることができ、聞き手に自然な感じを与える。G3 Effects of the Invention According to the present invention, the propagation of sound waves in a sound tube model is replaced by the flow of current in an equivalent circuit, parameters such as the pitch of the current source and the cross-sectional area of the sound tube are defined for each phoneme, and the Since parameter interpolation processing is performed on the seams between 2 and 2 or between divided time periods within a phoneme, smooth speech can be obtained, giving a natural feel to the listener.

そしてエネルギーについては指数関数に基づいて補間処
理しているので補間値の並び方が実際の音声の場合に近
く、しかも単語毎に予め指数関数で規定したアクセント
に相当するエネルギーノくターンを割り当て、このエネ
ルギーパターンにエネルギーの補間値を重畳させると共
に、無声化すべき母音に対してはエネルギーパターンの
関数値よりも低い値を割り当てて無声化を達成している
から、より一層人間に近い音声を得ることができる。ま
た音素間の継ぎ目に相当する領域の全パラメータ値をメ
モリに格納するのではなく、音素単位あるいは時間帯単
位にデータを保存しておけば足りるのでメモリ容量が小
さくて済む。Energy is interpolated based on an exponential function, so the arrangement of interpolated values is similar to that of actual speech. Moreover, each word is assigned an energy turn corresponding to an accent predefined by an exponential function. In addition to superimposing the energy interpolation value on the energy pattern, devoicing is achieved by assigning a value lower than the function value of the energy pattern to the vowel that should be devoiced, so it is possible to obtain a voice that is even more human-like. Can be done. Furthermore, instead of storing all parameter values in the area corresponding to the joints between phonemes in the memory, it is sufficient to store data in units of phonemes or units of time, so the memory capacity can be reduced.

[Brief explanation of the drawing]

第１図は音響管の等価モデルを示す説明図、第２図は電
流源を含むブロックを示す等価回路図、第３図は本発明
の実施例を示すブロック図、第４図及び第５図は各々エ
ネルギーパターンを示す説明図である。第６図は音素パ
ラメータのデータ図、第７図は音源波パターンを示す説
明図、第８図はパラメータ補間処理の様子を示す説明図
である。１１・・・辞書部、３１・・・パターン処理部、４・・
・音素処理部、４１・・・音節パラメータ格納部、５・
・・パラメータ補間処理部、５１・・・音素パラメータ
格納部、５、・・・音源波パターン格納部、６・・・演
算部、７・・・デジタル／アナログ変換部、８・・発生
部。の等線ｅデシ図第２図１老Ｊ奄も会心プロ、Ｖグの囁匂西回路巴第３図ズ）ヒ例のＡ肩１ｇ第４図工序ルｒ−バターソｎｌＬ明ｌＩ　　　　　ＪＹＯＷＡＳＨＡ　　　ＤＡ　　Ｎ　　　　ＫＩ　　　ＷＡ第５図工事ルｆ−ハ゛７−ンのｔ＠図順ＡＩＹツーン中高ハ゛フーン尾高バツーン第６図苦県パラメータめテークｌ第７図？；原液パブーンａｊｔｌ！図時間早０１FIG. 1 is an explanatory diagram showing an equivalent model of an acoustic tube, FIG. 2 is an equivalent circuit diagram showing a block including a current source, FIG. 3 is a block diagram showing an embodiment of the present invention, and FIGS. 4 and 5 are explanatory diagrams each showing an energy pattern. FIG. 6 is a data diagram of phoneme parameters, FIG. 7 is an explanatory diagram showing a sound source wave pattern, and FIG. 8 is an explanatory diagram showing the state of parameter interpolation processing. 11... Dictionary section, 31... Pattern processing section, 4...
- Phoneme processing section, 41...Syllable parameter storage section, 5.
...Parameter interpolation processing unit, 51... Phoneme parameter storage unit, 5... Sound source wave pattern storage unit, 6... Calculation unit, 7... Digital/analog conversion unit, 8... Generation unit. isoline e decimal diagram fig. 2 SHA DA N KI WA Fig. 5 Construction route f-7-n t @ diagram order AIY tool Middle and high high school Hoon Odaka Batun Fig. 6 Hard prefecture parameter take l Fig. 7? ;Undiluted paboon ajtl! Figure time early 01

Claims

[Claims]

(1) By regarding the human vocal tract as a plurality of acoustic tubes connected in tandem, and by associating the acoustic tube group with the circuit element group of the surge impedance component and making the sound source correspond with the current source, the acoustic tube In a speech synthesizer that simulates a speech wave emitted from an output end of a group of circuit elements based on a current wave of an output end of a group of circuit elements, the utterance time of each phoneme is divided into one or more time periods for each phoneme constituting a syllable. For each time period, the initial values of the pitch, which is the repetition frequency of the sound source wave, the energy of this sound source wave, and the cross-sectional area of the acoustic tube, and the initial values a phoneme parameter storage unit that stores constants and sound source wave patterns that define how to change to each initial value X_r; and the pitch corresponding to the input phoneme data;
A parameter interpolation processing unit performs energy and cross-sectional area interpolation processing, and the energy pattern of the sound source wave corresponding to the accent of each word is classified into multiple types, and the corresponding energy pattern type is attached to each word. With,
A dictionary section that registers each word by assigning a devoicing code to the vowel to be devoiced in the word, and defines the energy pattern by a function for each type, and utters it with reference to the dictionary section. a pattern processing unit that selects an energy pattern corresponding to the type of word to be used, and assigns a value lower than the function value of the energy pattern to a vowel with a devoicing code; and a parameter interpolation processing unit that performs interpolation. The current value output from the output terminal of the circuit element group is calculated based on the processed parameters, and the calculated value of the energy is applied to the interpolated value group according to the pattern selected by the pattern processing section. The parameter interpolation processing section includes a calculation section that uses a value obtained by superimposing a corresponding function, and a voice generation section that generates a sound based on the calculation result of the calculation section, and the parameter interpolation processing section calculates the initial value during each time period. X_o
A speech synthesis device characterized in that interpolation calculations are performed multiple times using the X_r corresponding to the target value and the constant, and the energy is performed based on an exponential function for interpolation.