JPH01171000A

JPH01171000A - Voice synthesis system

Info

Publication number: JPH01171000A
Application number: JP62331283A
Authority: JP
Inventors: Norio Suda; 典雄須田; Yoshimasa Sawada; 沢田　喜正
Original assignee: Meidensha Corp; Meidensha Electric Manufacturing Co Ltd
Current assignee: Meidensha Corp; Meidensha Electric Manufacturing Co Ltd
Priority date: 1987-12-25
Filing date: 1987-12-25
Publication date: 1989-07-06

Abstract

PURPOSE:To generate efficient phoneme data by generating the correlative characteristic curve between pitch and energy in the case of generating the phoneme data, and correcting an energy value by the characteristic curve with the pitch and normalizing phoneme data. CONSTITUTION:When a voice is synthesized by using phonemes as basic units, the respective phonemes are sectioned by a rising state, a stationary state, and a falling state and a phoneme time constant, continuance, pitch, a pitch time constant, energy, an energy time constant, and a sound source are determined by the sections to generate phoneme data. At this time, the correlative characteristic curve between the pitch P and energy E is generated preliminarily and the energy value E is corrected with the pitch P according to the characteristic curve to normalize the phoneme data (a). Consequently, the phoneme data can be efficiently generated.

Description

【発明の詳細な説明】Ａ、産業上の利用分野本発明は音素データを使用して音声を合成する規則音声
合成方式に関する。DETAILED DESCRIPTION OF THE INVENTION A. Field of Industrial Application The present invention relates to a regular speech synthesis method for synthesizing speech using phoneme data.

Ｂ１発明の概要本発明は、音素を基本単位として音声を合成するように
したものにおいて、各音素を立ち上がり。B1 Summary of the Invention The present invention synthesizes speech using phonemes as basic units.

定常、立ち下がりの区分を行い、これらの各区分毎に継
続時間、音の高さおよび大きさを決めるピッチおよびエ
ネルギー、音源ならびにこれら音素、　　′ピッチ、エ
ネルギーの時定数を定めて音素パラメータを作成する際
、前記エネルギー値をピッチにより補正して音素データ
を正規化するようにしたことを特徴とした音声合成方式
に関する。Create phoneme parameters by classifying into steady and falling sounds, and determining the duration, pitch and energy that determine the pitch and loudness of each of these divisions, the sound source, and the time constants of these phonemes, pitch, and energy. The present invention relates to a speech synthesis method characterized in that the energy value is corrected by the pitch to normalize the phoneme data.

Ｃ１従来の技術人工的に音声を合成して出力する電子装置は、最近にな
って１ないし数チップの音声認識や音声合成のＬＳＩが
音声情報処理と半導体の大規模集積回路技術により低価
格で実現されるようになり、その使用目的、制約条件に
より種々の方式が提案されている。この音声合成には、
人間の発生した生の音声を録音しておき、これを適当に
結合して文章に編集する録音編集方式と、人間の声を直
接的には利用せず、人間の音声のパラメータだけを抽出
し、音声合成、過程で、そのパラメータを制御して人工
的に音声信号を作り出す方法がある。C1 Conventional technology Electronic devices that artificially synthesize and output speech have recently become low-cost LSIs for speech recognition and speech synthesis of one or several chips thanks to speech information processing and semiconductor large-scale integrated circuit technology. Various methods have been proposed depending on the purpose of use and constraints. For this speech synthesis,
There are recording and editing methods that record raw human voices and combine them appropriately to edit them into sentences, and two methods that do not use the human voice directly but extract only the parameters of the human voice. There is a method of artificially creating a speech signal by controlling the parameters during the speech synthesis process.

このパラメータ方式で良質な合成音が得られることで広
く利用されているパーコール（ＰＡＲＣＯＲ）方式があ
る。There is a PARCOR method which is widely used because it allows high-quality synthesized sounds to be obtained using this parameter method.

音声を電子計算機で扱う場合、音声波形をある周期毎に
サンプリングして各サンプリング点での音声信号の値を
アナログ／ディジタル変換し、その値を０と１の符号で
表示して行われるが、アナログ信号に忠実な記録、をす
るには、ビット数を増やす必要があるが音声合成信号は
大変多くのメモリーを必要とする。When handling audio using an electronic computer, the audio waveform is sampled at certain intervals, the audio signal value at each sampling point is converted from analog to digital, and the resulting values are displayed as codes of 0 and 1. In order to record faithfully to analog signals, it is necessary to increase the number of bits, but voice synthesis signals require a large amount of memory.

そこで、この情報量を極力少なくするために各種の高能
率な符号化法が研究開発されている。Therefore, various highly efficient encoding methods are being researched and developed in order to reduce the amount of information as much as possible.

その方法の１つとして、１つの音声信号の情報に対し、
最低限の１ビツトとした方式で、デルタ変調方式がある
。この方式は、１ビツトの使い方として、次にくる音声
信号値が現在の値より高いか低いかを判定して、高けれ
ば符号“ビ、低ければ符号“０”を与え音声信号の符号
化を行うもので、実際のシステム構成としては一定の振
幅ステップ量（デルタ）を定めておき、誤差が蓄積され
ないように今までの符号化によって得られる音声の値と
、入力してくる音声信号との残差信号に対して、符号化
を行う。As one of the methods, for information of one audio signal,
There is a delta modulation method, which uses a minimum of 1 bit. In this method, one bit is used to determine whether the next audio signal value is higher or lower than the current value, and if it is higher, it is given the code "B", and if it is lower, it is given the code "0", and the audio signal is encoded. In actual system configuration, a fixed amplitude step amount (delta) is determined, and the difference between the audio value obtained by conventional encoding and the input audio signal is determined to prevent errors from accumulating. Encoding is performed on the residual signal.

このような構成を予測コード化といわれ、線形予測法（
何個か前のサンプル値から予測する）およびパーコール
方式（線形予測法の予測係数の代わりにパーコール係数
にといわれる偏自己相関関数を用いる）がある。This kind of configuration is called predictive coding, and is based on the linear prediction method (
There are two methods: the Percoll method (which uses a partial autocorrelation function called a Percoll coefficient instead of the prediction coefficient of the linear prediction method).

Ｄ６発明が解決しようとする問題点前述のように予測コード化を用いたものは、音と音との
継ぎ目に相当する調音結合が難しいという問題がある。D6 Problems to be Solved by the Invention As mentioned above, the method using predictive coding has the problem that it is difficult to make articulatory connections corresponding to the joints between sounds.

例えば母音から子音を経て母音に至る発声において、母
音の定常から過渡を経て子＼音に至りまた母音の過渡を経て母音の定常音に至ろ過程
で母音と母音の継ぎ目の音が跡切れ、人間が聞いたとき
に自然な感じを与えない。For example, in pronunciation from a vowel to a consonant to a vowel, the sound at the joint between the vowels is cut off in the process of going from the steady state of the vowel to the consonant sound through the transitional sound, and then through the transitional sound of the vowel to the steady sound of the vowel. It does not give a natural feeling when heard by humans.

また楽器音合成の場合は、音階の継ぎ目が重要であるが
合成手法が実際の楽器の音発生の原理と異なるため、や
はり自然な感じが無く、特に残響音において顕著にあら
れれる。これら両者において自然な音に近付けるために
は、これを構成するメモリや、演算器等の電子部品を多
く必要とし装置が高価になる等の問題がある。Furthermore, in the case of musical instrument sound synthesis, the joints between scales are important, but since the synthesis method differs from the principle of sound generation in an actual musical instrument, it still lacks a natural feel, especially in reverberant sounds. In order to approximate natural sounds in both of these systems, there are problems such as the need for a large number of electronic components such as memory and arithmetic units, making the device expensive.

Ｅ１問題点を解決するための手段ヒ、４簾用そこで本願
の発明者は人間の音の発生や楽器の楽音は人間の口腔や
音響管の長さや断面積等の形状変化によって作り出され
るので、これら音響管の音波の伝達を表す進行波現象を
音響管等価回路で解析し、音響管の断面積がサージイン
ピーダンスに反比例することに着目し、サージインピー
ダンスを変化させることで断面積を模擬的に変化させ、
サージインピーダンスを連続的変化することで調音結合
をスムーズに行うことができるようにして人間の発声と
同様な音の合成を容易となし音声の自然性の向上を図る
ようにした音声合成方式を創案し、先に特許出願した（
特願昭６２−９１７０５号、以下先願と称す）。Means for Solving Problem E1 H. For 4 Curtains Therefore, the inventor of the present application believes that human sounds and musical tones of musical instruments are produced by changes in the shape of the human mouth and the length and cross-sectional area of the acoustic tube. We analyzed the traveling wave phenomenon that represents the transmission of sound waves in acoustic tubes using acoustic tube equivalent circuits, and focused on the fact that the cross-sectional area of the acoustic tube is inversely proportional to the surge impedance.We simulated the cross-sectional area by changing the surge impedance. change,
Created a speech synthesis method that enables smooth articulatory coupling by continuously changing surge impedance, making it easier to synthesize sounds similar to human speech and improving the naturalness of speech. and filed a patent application earlier (
Japanese Patent Application No. 62-91705 (hereinafter referred to as the earlier application).

この先願の発明を基に自然音声からパラメータを抽出し
て音声合成データを作る際、自然音の音の高さ（ピッチ
）や音の強さ（エネルギー）ζこばらつきがあり、耳で
聴き取ったときと同じ強さの音に聞こえるように音の強
さを正規化する必要がある。正規化する場合は、各自然
音声について実際の波形を抽出して、その高さやピッチ
を対比するか、あるいは音声を出して耳で聞き、ピッチ
やエネルギーを調整して音のレベルを合わせた音声合成
データを作っている。この作業は非常に時間と熟練を要
する。そこで、本発明は、種々実験の結果、エネルギー
とピッチは相関関係があり、ピッチの大きいところはエ
ネルギーも大きいことに着目し、ピッチとエネルギーの
相関特性曲線を作り、この曲線からピッチによりエネル
ギー値を補正して音声データを正規化し、作業の効率化
を図ったものである。When creating speech synthesis data by extracting parameters from natural speech based on the invention of this earlier application, there are variations in the pitch and strength (energy) of the natural sound, making it difficult to hear with the ear. It is necessary to normalize the intensity of the sound so that it sounds the same intensity as when it was heard. When normalizing, you can extract the actual waveform of each natural sound and compare its height and pitch, or you can output the sound and listen to it with your ears, and then adjust the pitch and energy to match the sound level. Creating synthetic data. This work requires a lot of time and skill. Therefore, as a result of various experiments, the present invention focused on the fact that there is a correlation between energy and pitch, and that where the pitch is large, the energy is also large.The present invention creates a correlation characteristic curve between pitch and energy, and from this curve, energy values are determined by pitch. This corrects the audio data and normalizes it to improve work efficiency.

Ｆ、実施例先ず、本願の基礎となる先願の発明の詳細な説明する。F. Example First, the invention of the earlier application, which is the basis of the present application, will be explained in detail.

音声発声時の声道の断面積変化は、例えば「ア」の発声
の場合は、喉の奥が狭く口唇が開いた状態で肺から押し
出される呼気で声帯が呼気を断続的に開閉して声道（音
響管）の中で反射を繰り返して出てくる音波が「ア」の
音声波形となって出てくる。「イ」は喉の方が広く口唇
の先が狭いと「イ」の音声波形が出力される。The change in the cross-sectional area of the vocal tract during vocal production is caused by, for example, when making the ``a'' sound, the back of the throat is narrow and the lips are open, and the exhaled air that is forced out of the lungs causes the vocal cords to open and close intermittently to absorb the exhaled air, resulting in the voice being distorted. The sound waves that are repeatedly reflected in the path (acoustic tube) come out as the sound waveform of "a". When the throat is wider and the tip of the lips are narrower, the sound waveform of "i" is output.

このように口の恰好で周波数が決まり、口の恰好を模擬
すれば「ア」なり「イ」が発声される。In this way, the frequency is determined by the shape of the mouth, and if the shape of the mouth is simulated, "a" or "i" will be uttered.

口の恰好は音響管の断面積で模擬でき、また音響管の断
面積の変化は、サージアドミッタンスの変化で模擬でき
る。従って、サージアドミッタンスを変化すれば口の恰
好が模擬できる。サージアドミッタンスの変化は、電気
回路上極めて容易に可変できるので電気信号によって様
々な音声を合成することができる。第２図（ア）は断面
積ＡＩ。The shape of the mouth can be simulated by the cross-sectional area of the sound tube, and changes in the cross-sectional area of the sound tube can be simulated by changes in surge admittance. Therefore, the shape of the mouth can be simulated by changing the surge admittance. Since changes in surge admittance can be varied very easily in electrical circuits, various sounds can be synthesized using electrical signals. Figure 2 (a) shows the cross-sectional area AI.

Ａ！・・・Ａ、と異なる断面積をもった音響管を接続し
て声道を模擬したものである。同図（イ）はその音響イ
ンピーダンスを電気回路のＬＣ回路に置き換えたもので
、各音響管を１個のＬＣ線路とし、全体を集中線路のｎ
−１の電気回路としたものである。また第２図（つ）は
進行波等価モデル図で、各音響管の音響インピーダンス
２１．Ｚ！・・・Ｚ７は、音響管の断面積に反比例（音
響アドミッタンスは比例）し、音波の速度Ｃ′と空気密
度ρに比例するのでとなる。なお、同図でＺｇは音源インピーダンス、ＺＬ
は放射インピーダンスを示し、またブロック間の矢印は
、進行波と後進波を表している。A! ... A simulates the vocal tract by connecting acoustic tubes with different cross-sectional areas. In the same figure (a), the acoustic impedance is replaced with an LC circuit of an electric circuit, and each acoustic tube is made into one LC line, and the whole is a concentrated line.
-1 electric circuit. Figure 2 (2) is a traveling wave equivalent model diagram, showing the acoustic impedance 21. Z! ... Z7 is inversely proportional to the cross-sectional area of the acoustic tube (acoustic admittance is proportional), and is proportional to the speed C' of the sound wave and the air density ρ. In addition, in the same figure, Zg is the sound source impedance, and ZL
indicates radiation impedance, and arrows between blocks indicate forward waves and backward waves.

今「ア」という音声を発声させる場合は、口唇の先に相
当する音響管の断面積のところで「ア」の口の恰好を与
えて、インパルスＰを断続的に印加することで、「ア」
の音が得られ、また「ア」から「イ」の音を発声させる
場合は、口唇の先に相当する音響管の断面積を狭め「イ
」の口の恰好に与えることで「イ」が得られる。If you want to produce the sound "a" now, you can create the sound "a" by creating the mouth shape of "a" at the cross-sectional area of the acoustic tube corresponding to the tip of the lips and applying impulse P intermittently.
If you want to produce the sound "a" and then "i", you can make "i" by narrowing the cross-sectional area of the acoustic tube corresponding to the tip of the lips and giving it the shape of the mouth of "i". can get.

インパルスＰが連続して断続的に与えられ、断面積全体
を「イ」の口の恰好に変化させる場合、声道は第２図に
示すｎ個の音響管によって模擬しているので、これらの
各断面積を「ア」から動かして口の恰好を「アーイ」と
連続的に変えることになる。この音響管の断面積を変え
るということは、サージインピーダンスを徐々に変える
ことによって行われる。When the impulse P is applied continuously and intermittently to change the entire cross-sectional area to resemble the mouth of "i", the vocal tract is simulated by n acoustic tubes shown in Figure 2, so these By moving each cross-sectional area from "A" to "A", the shape of the mouth will change continuously to "A". Changing the cross-sectional area of the acoustic tube is done by gradually changing the surge impedance.

従って、断面積は連続的に変えられるので、定常状態の
「ア」、「イ」の音が得られることは勿論であるが、更
にインピーダンスは連続して可変できるので、その中間
の音、即ち音と音との間の音を得ることができる。従っ
て音の切れが無く人間の発音に近い調音結合がスムーズ
に行われる。Therefore, since the cross-sectional area can be continuously changed, it is of course possible to obtain the steady-state sounds "a" and "i", but since the impedance can also be continuously varied, the sounds in between, i.e. You can get the sounds between the sounds. Therefore, articulatory combination similar to human pronunciation is performed smoothly without any sound breaks.

次に音波の伝搬速度を考えると、これは長さｅでＬＣを
持った電線路にインパルスを印加した時の過渡現象に似
ている。Next, considering the propagation speed of a sound wave, this is similar to the transient phenomenon when an impulse is applied to an electric line with length e and LC.

即ち第３図に示すようにＬＣを有する線路を等価的に表
すと第４図のようになる。ここで両端部からみたサージ
インピーダンスＺ。ｉ　Ｚｏｔは、Ｚ、、＝ＪＬ／Ｃ，
Ｚｏｔ＝ＪＬ／Ｃとなる。That is, when the line having LC as shown in FIG. 3 is equivalently represented, it becomes as shown in FIG. 4. Here is the surge impedance Z seen from both ends. i Zot is Z, ,=JL/C,
Zot=JL/C.

ここで相手から到達してきた進行波を等価的な電流源と
考えると、１＋＝ｉｔ（Ｌ−Ｔ）＋　　　Ｖｔ（ｔ　　ｆ）Ｚｏ。If we consider the traveling wave that has arrived from the other party as an equivalent current source, then 1+=it(L-T)+Vt(t f)Zo.

１　！　＝　ｉ　ｌ（ｔ　−ｒ　）　＋　　　Ｖ　ｌ（
ｔ　−ｒ　）ＯＩとなり電流は中間にｎ個の遅延回路ブロックＺがあれば
、ｎ時間後に出力される。即ち左側の回路で発生したも
のが１時間後右側に到達したということになる。1! = i l(t − r ) + V l(
t −r )OI, and if there are n delay circuit blocks Z in the middle, the current is output after n hours. In other words, what occurred in the left circuit reached the right circuit one hour later.

Ｉ２は送り管側の電流ｉ、＋　　’　　Ｖ、（ｔ−ｔ）
Ｚ（１１となる。但し、ディジタル計算においては、電圧または
電流を細分割するのでＶ＋、Ｖｔは計測時刻ｔにおける
電圧、τは経過時間を示している。I2 is the current i, + 'V, (t-t) on the feed pipe side
Z(11) However, in digital calculation, since the voltage or current is subdivided, V+ and Vt represent the voltage at measurement time t, and τ represents the elapsed time.

第４図では、Ｌ、Ｃ回路にインパルスを印加すれば、τ
時間後に出力管側に出る。そしてτ時間前到達されたも
のは相手にも到達しているということを等価的に表して
いる。線路の長さＣを１にするということは、遅延ブロ
ックｎを正規化してｌにすることで計算し易くなる。ｅ
を３ｃｉに刻む場合は遅延ブロックのｎを３ブロツクに
すればよい。In Figure 4, if an impulse is applied to the L and C circuits, τ
After some time, it will come out to the output tube side. And what is reached before τ time equivalently represents that it has also been reached at the other party. Setting the line length C to 1 can be easily calculated by normalizing the delay block n to l. e
When dividing into 3ci, the delay block n should be set to 3 blocks.

第２図（ア）を人間の声道は男性で約１７ｃｍなので、
１ＣｊＩ刻みで１７本の音響管で模擬すれば、Ａ、から
入うた波形は、半周期の電流をｌＯに分割しそのΔｔを
１０μｓｅｃとすれば、１７０μｓｅｃかかってＡｎ側
から出てくる。Figure 2 (a) shows that the human vocal tract is approximately 17 cm long for a male.
If simulated with 17 acoustic tubes in 1CjI increments, the waveform entering from A will take 170 μsec to emerge from the An side if half-cycle current is divided into lO and its Δt is 10 μsec.

したがって、音響管ＡＩ−Ａｎの断面積変化に対応した
演算処理を演算処理装置で行い、音響管Ａ１〜Ａｎの個
々の等価回路を流れる各部の電流値を計算する必要なＡ
１〜Ａｎに対応するインピーダンスＺ、〜Ｚ、の値をテ
ーブルとして有するメモリと、当該等価回路の各部の電
流値を演算する演算手段と、この等価回路とは相隣接す
る等価回路の電流値を用いて電流値を演算する演算手段
とを備えて演算処理を行えば音声信号が得られ、その出
力をＤ／Ａ変換してスピーカに出力すればスピーカより
音声として出力される。Therefore, the arithmetic processing unit performs arithmetic processing corresponding to the change in the cross-sectional area of the acoustic tubes AI-An, and the necessary A
1 to An, a memory that has values of impedances Z, ~Z, as a table, a calculation means that calculates the current values of each part of the equivalent circuit, and this equivalent circuit calculates the current values of adjacent equivalent circuits. If the arithmetic processing is performed using the arithmetic means for calculating the current value using the electric current, an audio signal will be obtained, and if the output is D/A converted and output to the speaker, it will be output as audio from the speaker.

次に上記の音響管モデルを使用して文字入力信号から規
則によって音声を合成する実施例について説明する。Next, an example will be described in which speech is synthesized from character input signals according to rules using the acoustic tube model described above.

第５図は本発明の一実施例を説明するためのブロック説
明図で、ｌは日本語処理部で、漢字かな混じりで書かれ
た文章を入力として受けとり、こ、れを辞書２と対応さ
せて文節１句２文の区切り。FIG. 5 is a block explanatory diagram for explaining one embodiment of the present invention, where l is a Japanese language processing unit that receives as input a sentence written in a mixture of kanji and kana, and associates this with dictionary 2. Separation of one clause and two sentences.

形態素分類の自然語解析を行い、更にアクセント処理を
行ってこれを表音変換してイントネーションをつけて文
章処理データを作る。３は音節処理部で、音節パラメー
タを有すし、文書処理されたデータの音節処理を行う。Natural language analysis is performed for morphological classification, and then accent processing is performed, phonetic conversion is performed, and intonation is added to create sentence processing data. 3 is a syllable processing unit which has syllable parameters and performs syllable processing on document-processed data.

音節パラメータは子音の１１０個〜１４０個（普通に話
せる言葉は１１０個程度あればよい）の音節毎に音の高
さ（ピッチ）、音の強さ（エネルギー）および継続時間
を与える。例えば「桜」の場合は第６図に示すようにＳ
Ａ、ＫＵ、ＲＡ各音節毎にピッチＰ、エネルギーＥ１時
間Ｔを正規化する。４は音素処理部で、パラメータ補間
機能をもつ音素パラメータを有する。音素パラメータは
、各音素毎に音の立ち上がり部０８．定常部０２．立ち
下がり部０３に区分を几い、各区分毎に音素（断面積）
時定数、継続時間、ピッチ、ピッチ時定数、エネルギー
、エネルギー時定数、音源を正規化し、各区分毎のデー
タのブロックを形成する。前記の「桜」に例をとれば第
７図に示すようにｒＳＪ　、ｒＡＪ　＋、ｒＫＪ。The syllable parameters give the pitch, strength (energy), and duration of each syllable of 110 to 140 consonants (about 110 are sufficient for normally spoken words). For example, in the case of "cherry blossoms", as shown in Figure 6, S
Pitch P, energy E1, time T are normalized for each syllable of A, KU, and RA. 4 is a phoneme processing unit which has phoneme parameters with a parameter interpolation function. The phoneme parameters are determined at the rising part 08 of the sound for each phoneme. Steady part 02. The falling part 03 is divided into sections, and each section has a phoneme (cross-sectional area).
The time constant, duration, pitch, pitch time constant, energy, energy time constant, and sound source are normalized to form a block of data for each section. Taking the above-mentioned "cherry blossoms" as an example, as shown in FIG. 7, rSJ, rAJ +, rKJ.

ｒＵＪ、ｒＲＪ、ｒＡＪの各音素に区分の立ち上がり部
Ｏ３であればＤｏｇ、　Ｔｒ、　Ｐ　ｒ、　ＤＰ　ｌ＋
　Ｅｔ。If the rising part O3 of each phoneme of rUJ, rRJ, rAJ is Dog, Tr, P r, DP l+
Et.

Ｄ　Ｅ　＋、　Ｇ　＋のデータユニットを形成する。こ
れらのデータユニットは第１図の音響管モデルの断面積
Ａ、〜Ａ１の各断面積Ａ、・１〜Ａ１・０に対応して設
けられている。即ち音響管モデルの断面積Ａが１７ある
場合は各音節毎に６Ｘ１７＝１０２のデータユニットが
用意される。D E +, G + data units are formed. These data units are provided corresponding to the cross-sectional areas A, .about.A1 of the acoustic tube model shown in FIG. 1, respectively. That is, if the cross-sectional area A of the acoustic tube model is 17, 6×17=102 data units are prepared for each syllable.

前記の各時定数は、前の区分の最終値から、当該区分の
それぞれに対応する目標値への動き方を指定する。時間
Ｔは継続時間で、この時間Ｔ内に上記の各処理が行われ
る。また音源Ｇ、、Ｇ、、Ｇ３は子音部分では各区分毎
に時間Ｔに応じて変化するが、おおむね３００パツ程度
、母音部では５０パツ程度のパルス列で音源を与える。Each of the above time constants specifies how to move from the final value of the previous segment to the target value corresponding to each of that segment. Time T is a duration time, and each of the above processes is performed within this time T. In addition, the sound sources G, , G, , G3 vary depending on the time T for each segment in the consonant part, but provide a sound source with a pulse train of approximately 300 pulses and approximately 50 pulses in the vowel part.

６は音響管モデル部で、音響管の断面積の変化を模擬す
る制御を行い、その出力を音声合成波形部７に入力し、
音声合成波形部７でテジタル信号をアナログ信号に変え
てスピーカ８から音声として出力させる。6 is an acoustic tube model section that performs control to simulate changes in the cross-sectional area of the acoustic tube, and inputs its output to the speech synthesis waveform section 7;
A voice synthesis waveform unit 7 converts the digital signal into an analog signal and outputs it as voice from a speaker 8.

上記の音素データの作成は、自然音声からパラメータを
抽出して行われる。即ち上記のエネルギー、ピッチ、＠
！続時間、音響管断面積などの各種パラメータを基本音
声データから導出する。このとき基本音声データから逆
フイルタ分析などで求めたパラメータ値を、そのまま用
いず、正規化を行ったり調音結合の具合などを試聴しな
がら何回も修正を加えて音素パラメータを作成する。The above phoneme data is created by extracting parameters from natural speech. That is, the above energy, pitch, @
! Various parameters such as duration and cross-sectional area of sound tube are derived from basic audio data. At this time, the parameter values obtained from the basic speech data by inverse filter analysis are not used as they are, but are normalized and modified many times while checking the state of articulatory combination, etc., to create phoneme parameters.

このとき、エネルギー値を基準に音の強さを正規化しよ
うとすれば、エネルギーは一般的には波形の高さと比例
関係にあり、音を強くする場合は、波形の高さを高くす
ればよいが、しかしエネルギーの値は、単に波形の大き
さだけでなく、ピッチにも影響している。即ち、同じ高
さの波形の場合ピッチの巾を小さくするとエネルギー値
は小さく、また広いとエネルギー値は大きくとる必要が
ある。At this time, if you try to normalize the intensity of the sound based on the energy value, energy is generally proportional to the height of the waveform, so if you want to make the sound stronger, increase the height of the waveform. However, the energy value affects not only the size of the waveform, but also the pitch. That is, in the case of waveforms of the same height, if the width of the pitch is small, the energy value will be small, and if the pitch is wide, the energy value needs to be large.

従って単にエネルギー値を基準に音の強さを正規化する
ことは困難である。Therefore, it is difficult to normalize the intensity of sound simply based on the energy value.

そこで本発明は音のデータからすると、母音のピッチの
大きいところはエネルギーも大きいということに着目し
、第１図（イ）の点線で示すように予め自然音声からエ
ネルギ一対ピッチのデータａをとり、そのばらつきの平
均的な値すの相関特性曲線を第１図（イ）のように作成
する。そしてあとは、この曲線からエネルギー値はピッ
チに合わせて機械的に定める。Therefore, the present invention focuses on the fact that, from the sound data, the energy is also large where the pitch of the vowel is large.As shown by the dotted line in FIG. , a correlation characteristic curve of the average value of the dispersion is created as shown in Fig. 1 (a). Then, from this curve, the energy value is determined mechanically according to the pitch.

即ち音の種類によってピッチがＰ、のときはエネルギー
はＥ、とし、また他の音で抽出した波形からエネルギー
がＥ、のらのであってもピッチがＰ、あれば、エネルギ
ーはＥ、′に補正する。なおピッチは音声波形から簡単
に測ることができるのでこの補正は簡単に行うことがで
きる。またこの曲線は、母音の種類によっては変わらな
いが、音声を出す人によっては変わる。That is, depending on the type of sound, if the pitch is P, the energy is E, and if the energy is E from the waveform extracted from another sound, even if the pitch is P, then the energy is E,'. to correct. Note that since the pitch can be easily measured from the audio waveform, this correction can be easily performed. Also, this curve does not change depending on the type of vowel, but it changes depending on the person who makes the sound.

なお、上記の実施例においては音響管断面積の場合の音
素データ作成について説明したが、従来のパーコール方
式の音素データの作成にも利用できる。このときは、音
素時定数はパーコール係数を用いる。In the above embodiment, the creation of phoneme data in the case of the acoustic tube cross-sectional area has been described, but it can also be used to create phoneme data using the conventional Percoll method. At this time, the Percoll coefficient is used as the phoneme time constant.

Ｈ３発明の効果本発明は以上のように音素データを作るとき、エネルギ
ー値をピッチとエネルギーの相関特性曲線を作ってこの
特性曲線からピッチによりエネルギー値を補正して音素
データを正規化するようにしているので、従来のように
基本音声データから逆フイルタ分析などで求めたパラメ
ータのエネルギー値等の正規化を行ったり、調音結合の
具合などを試聴しながら繰り返し修正を加えて作成する
必要がなく、ピッチからピッチとエネルギーの相関特性
曲線からエネルギー値を正規化できるので熟練を要せず
、効率的に音素データを正規化できる。H3 Effect of the Invention The present invention, when creating phoneme data as described above, creates a correlation characteristic curve between energy values and pitch, and corrects the energy value according to the pitch from this characteristic curve to normalize the phoneme data. Therefore, there is no need to normalize the energy values of parameters obtained from basic audio data by inverse filter analysis, etc., or to repeatedly make corrections while listening to the condition of articulatory combination etc., as in the past. Since the energy value can be normalized from the pitch to the pitch-energy correlation characteristic curve, phoneme data can be efficiently normalized without requiring any skill.

[Brief explanation of the drawing]

第１図は本発明の詳細な説明するためのピッチ対エネル
ギーの相関特性曲線図、第２図は音響管の電気回路等価
モデル図、第３図は音声伝搬を電気的に模擬した電気回
路図、第４図は第３図の等価回路図、第′（ン図は本発
明を説明するための文字入力信号から音声合成するブロ
ック結線図、第メータ説明図を示す。ｌ・・・日本語処理部、３・・・音節処理部、４・・・
音素処理部、５・・・パラメータ補間部、６・・・音響
管モデル部、７・・・音声合成波形部、８・・・スピー
カ。ぴ１シテＰｐ、　ｐ２ピ１νすＰ第５図１　　　　　　宮！ｅＡ部第６図第７図Fig. 1 is a pitch-to-energy correlation characteristic curve diagram for explaining the present invention in detail, Fig. 2 is an electric circuit equivalent model diagram of an acoustic tube, and Fig. 3 is an electric circuit diagram electrically simulating sound propagation. , Fig. 4 shows an equivalent circuit diagram of Fig. 3, Fig. 4 shows a block wiring diagram for synthesizing speech from character input signals to explain the present invention, and Fig. 4 shows a meter explanatory diagram. Processing unit, 3... Syllable processing unit, 4...
Phoneme processing section, 5... Parameter interpolation section, 6... Acoustic tube model section, 7... Speech synthesis waveform section, 8... Speaker. Pi1shiteP p, p2 Pi1νsuP Figure 5 1 Miya! eA section Fig. 6 Fig. 7

Claims

[Claims]

In a device that synthesizes speech using phonemes as basic units, each phoneme is classified into rising, steady, and falling, and for each of these classifications, the phoneme time constant, duration, pitch, pitch time constant, energy, energy time constant, When determining a sound source and creating phoneme data, the energy value is used to create a correlation characteristic curve between the pitch and energy, and the energy value is corrected according to the pitch from this characteristic curve to normalize the phoneme data. A speech synthesis method that uses