JPS63174100A

JPS63174100A - Voice rule synthesization system

Info

Publication number: JPS63174100A
Application number: JP481587A
Authority: JP
Inventors: 武田　昌一; 市川　熹; 淺川　吉章
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1987-01-14
Filing date: 1987-01-14
Publication date: 1988-07-18

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は文章音声の蔑則合成方式に係わり、特に規則合
成音声の音質改善に関する。DETAILED DESCRIPTION OF THE INVENTION [Industrial Application Field] The present invention relates to a discriminatory synthesis method for text speech, and more particularly to improving the sound quality of rule-synthesized speech.

[Conventional technology]

任意の文章或いは単語のテキストより、これに対応する
音声を合成する手法は「規則による音声合成」或いは単
に「規則合成」と呼ばれている。A method of synthesizing speech corresponding to an arbitrary sentence or word is called "speech synthesis by rules" or simply "speech synthesis by rules."

規則合成の音声では、一般に、音韻のつながりや、持続
時間、或いはピッチ（声の高さ）の変化などの特徴を外
部から規則により与えているため、自然の音声のものと
は異なっている。したがって、規則合成により音声は、
これらの自然の音声の特徴をそのまま保存しているいわ
ゆる「分析合成」による音声の音質より悪い。規則合成
音声の音質劣化要因として、■音韻の明瞭性の低下に起
因するものや、■文章の抑揚の不自然さに起因するもの
が挙げられる。Speech synthesized by rules generally differs from natural speech because features such as phonological connections, duration, and changes in pitch (voice height) are externally imparted by rules. Therefore, by rule synthesis, speech becomes
The sound quality is worse than the sound quality of the so-called "analysis-synthesized" sound, which preserves the characteristics of these natural sounds. Possible causes of deterioration in the sound quality of rule-synthesized speech include (1) a decrease in the clarity of phonemes, and (2) an unnatural intonation of sentences.

文章の抑揚を支配する規則、すなわち韻律規則について
は、すでに日本語の平叙文、疑問文、命令文２強調およ
び種々の表情を持つ文章のイントネーションを生成する
規則の公知例がある（例えば市川黒地、「合成音声の自
然性に関する実験的考察」音響学会講演論文集１−３−
８　（昭４２））。Regarding the rules that govern the intonation of sentences, that is, the prosodic rules, there are already known examples of rules that generate intonation for declarative sentences, interrogative sentences, imperative sentences, and sentences with various expressions in Japanese (for example, Ichikawa Kuroji , “Experimental Considerations on the Naturalness of Synthesized Speech” Proceedings of the Acoustical Society of Japan 1-3-
8 (Showa 42)).

しかしここで用いたモデルは、音節単位の点ピッチ情報
を与えるに過ぎないため、疑問文、命令文。However, the model used here only provides point pitch information on a syllable basis, so it cannot be used in interrogative or imperative sentences.

願望文の差異を表現するには不十分である。そのために
このようなピッチパタンを与えて合成した音声の抑揚は
不自然に聞こえる。It is not sufficient to express the difference between desire sentences. Therefore, the intonation of the synthesized voice with such a pitch pattern sounds unnatural.

種々の文章のイントネーションの差異を十分に表現する
ためには、音節内の基本周波数（ピッチ周波数）と時間
との関係を明確にする必要がある。In order to fully express the intonation differences of various sentences, it is necessary to clarify the relationship between the fundamental frequency (pitch frequency) within a syllable and time.

このような音節内のピッチパタンを記述し、しかも時間
構造を明確に定義できるモデルとして、従来、臨界制動
２次線形系で記述される「ピッチ制御機構モデルＪ　Ｈ
，Ｆｕｊｉｓａｋｉ　ｅｔ、ａｌ、、　”Ａｎａｌｙｓ
ｉｓｏｆ　ｖｏｉｃｅ　ｆｕｎｄａｍｅｎｔａｌ　ｆｅ
ｑｕａｎｃｙ　ｃｏｎｔｏｕｒｓ　ｆｏｒｄａｃｌａｔ
ｉｖｅ　　５ｅｎｔａｎｃａｓ　　ｏｆ　　Ｊａｐａｎ
ｅｓｅ、　　”Ｊ　、Ａｃｏｕｓｔ。Conventionally, as a model that can describe the pitch pattern within a syllable and clearly define the temporal structure, the "pitch control mechanism model J H
, Fujisaki et al., “Analyses
isof voice fundamental fe
quancy contours fordaclat
ive 5 entancas of Japan
ese, “J, Acoust.

Ｓｏｃ、Ｊｐｎ　（Ｅ）　５．４　（１９８４）参照が
用いられてきた。ここでピッチ制御機構モデルとは、以
下に述べるようなモデルである。Soc, Jpn (E) 5.4 (1984) reference has been used. Here, the pitch control mechanism model is a model as described below.

声の高さの情報を与える基本周波数は、次のような過程
で生成されると考えるのがピッチ制御機構モデルである
。声帯振動の周波数、すなわち基本周波数は、脳からの
■フレーズの切り替わりごとに発せられるインパクト指
令と、■アクセントの上げ下げごとに発せられるステッ
プ指令によって制御される。そのとき、生理機構の遅れ
特性により、■のインパルス指令は文頭から文末に向か
う緩やかな下降曲線（フレーズ成分）となり、■のステ
ップ指令は局所的な起伏の激しい曲線（アクセント成分
）となる、これらの二つの成分は。The pitch control mechanism model assumes that the fundamental frequency that provides information about the pitch of the voice is generated through the following process. The frequency of vocal cord vibration, that is, the fundamental frequency, is controlled by the brain's ■Impact commands issued each time a phrase changes, and ■Step commands issued each time an accent is raised or lowered. At that time, due to the delay characteristics of the physiological mechanism, the impulse command of ■ becomes a gradual descending curve from the beginning to the end of the sentence (phrase component), and the step command of ■ becomes a curve with sharp local ups and downs (accent component). The two ingredients are.

各指令の臨界制動２次線形系の応答としてモデル化され
、対数基本周波数の時間変化パタンは、これら周成分の
和として表現される。第２図はピッチ制御機構モデルを
示す。モデル基本周波数Ｆｏ（ｔ）（ｔは時刻）は、次
式のように定式化される。It is modeled as a response of a critical braking quadratic linear system for each command, and the time change pattern of the logarithmic fundamental frequency is expressed as the sum of these circumferential components. FIG. 2 shows a pitch control mechanism model. The model fundamental frequency Fo(t) (t is time) is formulated as follows.

！ここで、Ｆｍ１ｎは最低周波数、工はフレーズ指令の数
、Ａ　ｐ　ｔはｉ番目のフレーズ指令の大きさ、Ｔｏｔ
はｉ番目のフレーズ指令の時点、Ｊはアクセント指令の
数、Ａ＆Ｊはｊ番目のアクセント指令の太きさ、Ｔｘａ
、ＴＩはそれぞれｊ番目のアクセント指令の開始時点と
終了時点である。また、ａｐｔ（ｔ）、ａａａ（ｔ）は
それぞれ、フレーズ制御機構のインパルス応答関数、ア
クセント制御機構のステップ応答関数であり、次式で与
えられる。! Here, Fm1n is the lowest frequency, F is the number of phrase commands, A p t is the size of the i-th phrase command, Tot
is the time of the i-th phrase command, J is the number of accent commands, A&J is the thickness of the j-th accent command, Txa
, TI are the start time and end time of the j-th accent command, respectively. Further, apt(t) and aaa(t) are the impulse response function of the phrase control mechanism and the step response function of the accent control mechanism, respectively, and are given by the following equations.

Ｇｐｉ（ｔ）　＝　ａ　−ｔｅｘｐ（−ａ　ｔｔ）ｕ（
ｔ）　　　　　　　（２）ＧａＪ（ｔ）＝Ｍｉｎ［１−
（１＋β、ｔ）ｅｘｐ（−β−ｔ）ｕ（ｔＬ　　θ　、
］ここで、α五はｉ番目のフレーズ指令に対するフレー
ズ制御Ｉａ構の固有角周波数、βｊはｊ番目のアクセン
ト指令に対するアクセント制御機構の固有角周波数、ｕ
（ｔ）は単位ステップ関数である。Gpi(t) = a −texp(−att)u(
t) (2) GaJ(t)=Min[1-
(1+β,t)exp(−β−t)u(tL θ,
] Here, α5 is the natural angular frequency of the phrase control mechanism Ia for the i-th phrase command, βj is the natural angular frequency of the accent control mechanism for the j-th accent command, u
(t) is a unit step function.

また、０４はアクセント成分の上限値であり、例えば０
．９　などに選ばれる。Also, 04 is the upper limit value of the accent component, for example 0
．． 9 etc.

なおここで、基本周波数（ピッチ周波数）およびピッチ
制御パラメータ（Ａｐｓ　＋　Ａｓ２．　Ｔｏｉ　ｅＴ
ｓａｇ　Ｔ’２Ｊｌ　α１．β−，Ｆ＋ｍ１ｎ）の値の
単位は次のように定義される。すなわち、Ｆｏ（ｔ）お
よびＦ　＠ｉ１１の単位は［Ｈｚコ、Ｔ　ｏ　ｔ　ｖ　
Ｔ　ｔ　ａおよびＴ！Ｊの単位は［ｓｌ、α１およびβ
１の単位は［ｓ′−１］とする。Note that here, the fundamental frequency (pitch frequency) and pitch control parameter (Aps + As2.Toi eT
sag T'2Jl α1. The unit of the value of β-, F+m1n) is defined as follows. That is, the units of Fo(t) and F@i11 are [Hz, T o t v
T t a and T! The units of J are [sl, α1 and β
The unit of 1 is [s'-1].

また、Ａ　ｐ　ｉおよびＡ　ａ　ｉの値は、基本周波数
およびピッチ制御パラメータの値の単位を上記のように
定めたときの値を用いる。Further, the values of A p i and A a i are the values when the units of the fundamental frequency and pitch control parameter values are determined as described above.

解析の方法としては、最適化法が用いられてぃる、すな
わち、上記ピッチ制御機構モデルにより生成したピッチ
パターンと原音声の分析・抽出による実測値との誤差が
最小となるようなピッチ制御パラメータを求めることに
より、ピッチパタンの最良近似推定が行なわれる。The analysis method uses an optimization method, that is, the pitch control parameters are determined such that the error between the pitch pattern generated by the above pitch control mechanism model and the actual value measured by analyzing and extracting the original speech is minimized. The best approximation of the pitch pattern is made by determining .

上記ピッチ制御機構モデルを適用した公知例として、単
語音声合成に適用した例として特願昭５７−１９０８６
１号があり疑問文、命令文、願望文等の文章音声合成に
適用した例として特願昭６０−２７８９１５号等があり
、かなりの音質改善効果が認められている。しかしまだ
合成音声に機械的な感じが残っている。また音韻明瞭性
もまだ不十分である。音韻明瞭性を低下させる原因には
、合成単位の作成（分析）方法に起因するものや、単位
接続処理に起因するものなども考えられるが、音素内で
のピッチパタンの局所的な変化にも起因している。As a known example of applying the above pitch control mechanism model to word speech synthesis, Japanese Patent Application No. 57-19086
No. 1 was published in Japanese Patent Application No. 60-278915 as an example of its application to speech synthesis of sentences such as interrogative sentences, imperative sentences, wishful sentences, etc., and it has been recognized that it has a considerable effect of improving sound quality. However, the synthesized voice still has a mechanical feel. Furthermore, phonological clarity is still insufficient. Possible causes of deterioration in phoneme intelligibility include the method of creating (analyzing) synthesis units and unit connection processing, but local changes in pitch patterns within phonemes can also cause problems. It is caused by

[Problem that the invention seeks to solve]

上記従来のピッチ制御機構モデルは、音韻明瞭性の改善
に効果的な音素レベルの局所的な揺らぎを表現すること
が出来ない、また疑問文に現れる尻上がり調や、命令文
、保望文等、様々な感情や表情に固有な微妙な基本周波
数の変化を表現するのが困難である０以上のように、実
際に人間の音声から分析・抽出したピッチパタンの実測
値には、様々な人間特有の揺らぎが含まれている。The conventional pitch control mechanism model described above cannot express local fluctuations at the phoneme level that are effective in improving phoneme intelligibility. Just as it is difficult to express subtle changes in the fundamental frequency that are unique to various emotions and facial expressions. contains fluctuations.

本発明の目的は、かかる揺らずを忠実に表現することの
出来る新しいピッチ制御機構モデルを提供し、更に該モ
デルを用いて人間らしい自然な抑揚感を持った音声を合
成する方法を提供することにある。The purpose of the present invention is to provide a new pitch control mechanism model that can faithfully express such fluctuations, and furthermore, to provide a method for synthesizing speech with a natural human-like intonation using this model. be.

[Means for solving problems]

第１図に上記問題点を解決するために有効な新ピッチ制
御機構モデルを示す。FIG. 1 shows a new pitch control mechanism model that is effective for solving the above problems.

該新モデルの特徴は、従来の■フレーズ制御機構および
■アクセント制御機構のみから構成されるモデルに、更
に■音素制御機構、■文形指定制御機構、および■強調
制御機端の３つの＃［！摺を付加したことである。これ
ら■〜■の３つの新しい制御機構の導入により、ピッチ
パタン上に様様な揺らぎ成分を付加することが出来る。The new model is characterized by the addition of three # [ ! This is due to the addition of suri. By introducing these three new control mechanisms (1) to (2), it is possible to add various fluctuation components to the pitch pattern.

上記新ピッチ制御機構モデルを簡単に記述する式として
は、例えば（ｉ）〜（ｖｆｆｌ）式を用いれば良い、こ
こで（ｉ）〜（ｖＭ　）式の各パラメータの単位は従来
のピッチ制御機構に準じて定められる。For example, equations (i) to (vffl) may be used to easily describe the new pitch control mechanism model. Here, the units of each parameter in equations (i) to (vM) are those of the conventional pitch control mechanism. stipulated in accordance with.

勿論具体的に実現する式としては、上記（ｉ）〜（ｖｉ
）式のみに限定されない、また肯定文や強調の伴わない
文章などのように、全ての成分で記述する必要のない場
合には、適宜（ｖｎ　）あるいは（４）式において不要
な項を省略しても良い。Of course, the formulas to be concretely realized include the above (i) to (vi
), or when it is not necessary to describe all components, such as in affirmative sentences or sentences without emphasis, omit unnecessary terms in (vn) or (4) as appropriate. It's okay.

モデルパラメータの推定（解析）は、従来のピッチ制御
機構モデルの場合と同じく最適化法により実行すること
が出来る。Estimation (analysis) of model parameters can be performed using an optimization method as in the case of conventional pitch control mechanism models.

[Effect]

上記■音素制御機構は、音素ごとの局所的な基本周波数
の揺らぎの成分を生成する機構で、例えば有声子音／ｄ
／、／ｍ／、／ｎ／、／ｒ／。The above-mentioned phoneme control mechanism is a mechanism that generates a component of local fundamental frequency fluctuation for each phoneme, for example, voiced consonant/d
/, /m/, /n/, /r/.

／ｗ／等の局所的な基本周波数の低下や、無声破裂音／
ｌ／、／に／等の後続母音への入り渡り部にしばしば見
られる高基本周波数からの下降特性を表現することが出
来る。また■文形指定制御機構は、疑問文の文末の基本
周波数の尻上がりを表現する成分を生成する機構である
。そして■強調機構は、命令文や願望文等、様々な感情
や表情を表現する成分を生成する機構である。/w/ and other local fundamental frequency drops, voiceless plosives /
It is possible to express the descending characteristic from a high fundamental frequency that is often seen at the transition to a subsequent vowel such as l/, /ni/, etc. ■The sentence shape specification control mechanism is a mechanism that generates a component that expresses the rise in the fundamental frequency at the end of an interrogative sentence. The emphasis mechanism is a mechanism that generates components expressing various emotions and facial expressions, such as imperative sentences and desire sentences.

〔Example〕

以下、本発明の実施例を第３図〜第５図により説明する
。Embodiments of the present invention will be described below with reference to FIGS. 3 to 5.

第３図は任意文章合成方式の全体構成を示す。FIG. 3 shows the overall configuration of the arbitrary text synthesis method.

本方式では、漢字仮名混じり文のテキストを入力データ
として与えれば、それに対応する合成音声を出力として
得ることができる。処理手順は以下の通りである。In this method, if a text containing kanji and kana is given as input data, the corresponding synthesized speech can be obtained as output. The processing procedure is as follows.

まず入力テキストは、日本語解析部１の形態素解析手段
により、各単語に分解され、品詞が決定され、さらに読
みが決定される。次にこの結果に基づき、音声言語処理
部２において、各単語あるいは句のアクセント型が決定
される。以上のような構文レベルの処理結果として、音
節情報、アクセント情報などが得られる。なお句や文章
の区切りは、入力テキスト中の句読点等区切り記号に基
づいて決定される１文章中や文章間のポーズ長は、読点
や句点の後のスペースの数で指定できる。また疑問文、
命令文、願望文等文のタイプは、語尾の活用によって判
定することができる場合もあるし、あるいは文章の終止
に句点の代わりにそれぞれｒ’／）、ｒｌ　ｔＪおよび
「！」などの路上記号を使うことにより指定することも
できる０例えば同じ音韻列「川を渡る」であっても「川
を渡る。」は平叙文であり、［川を渡る？ｊは疑問文で
ある。First, the input text is broken down into each word by the morphological analysis means of the Japanese language analysis section 1, the part of speech is determined, and the reading is determined. Next, based on this result, the speech language processing section 2 determines the accent type of each word or phrase. Syllable information, accent information, etc. are obtained as a result of the above syntactic level processing. Note that the breaks between phrases and sentences are determined based on delimiters such as punctuation marks in the input text.The length of pauses within a sentence or between sentences can be specified by the number of spaces after commas or periods. Also, questions,
The type of sentence, such as an imperative sentence or a desire sentence, can sometimes be determined by the conjugation of the ending, or by using street symbols such as r'/), rl tJ, and "!" at the end of the sentence instead of a period, respectively. It can also be specified by using 0. For example, even if the phonetic string "Kawa wo wo wa dō" is the same, ``Kawa wo wo wo wa dō.'' is a declarative sentence, and ``kawa wo wo wo wa dō wa wa wa wa dō wa wa wa wo wa wa dā .'' is a declarative sentence. j is an interrogative sentence.

以上の■音節情報、■アクセント情報、■ポーズ情報、
０句・文章区切り情報、および（必要ならば例えば品詞
名等の）■文法情報は、「音節コード」と呼ばれる一連
の数字によって表現される。Above ■Syllable information, ■Accent information, ■Pause information,
0 Phrase/sentence break information and (if necessary, part-of-speech names, etc.) Grammar information is expressed by a series of numbers called a ``syllable code.''

音節コードは制御パラメータ生成部３の入力情報である
。The syllable code is input information to the control parameter generation section 3.

制御パラメータ生成部３では、アクセント、イントネー
ションおよび音韻持続時間が規則により決定され、それ
に従ってピッチパタンと音韻パラメータ時系列が生成さ
れる。ここでアクセント型は、アクセント情報により知
ることができる。アクセント情報は、具体的にはアクセ
ント核のある音韻（アクセントが下降する直前の音韻）
の直後にアクセントを示す音節コード番号を挿入するこ
とによって与えている。但し、この音節コードがない場
合は、平板型アクセントであることを示している。また
イントネーションは、基本的には文章タイプ情報より定
められる。但し、語尾の音韻の並びの違いによる変形も
加えられる０例えば。In the control parameter generation unit 3, accent, intonation, and phoneme duration are determined by rules, and pitch patterns and phoneme parameter time series are generated in accordance with the rules. Here, the accent type can be known from accent information. Accent information is specifically the phoneme with the accent nucleus (the phoneme just before the accent falls)
It is given by inserting a syllable code number indicating the accent immediately after. However, the absence of this syllable code indicates a flat accent. Furthermore, intonation is basically determined based on sentence type information. However, variations due to differences in the arrangement of the final phonemes may also be added, such as 0.

願望文ｒ用を渡りたい！」と［川を渡りたいなあ！」と
ではイントネーション・パタンか異なる。I want to cross the wish list! "I want to cross the river! ”, the intonation pattern is different.

最終的なピッチパタンは、アクセント型とイントネーシ
ョンの両者に基づいて生成される。音韻持続時間は、子
音の場合は周囲条件の影響が少ないので、子音の種類ご
とに固有長として決定される。The final pitch pattern is generated based on both accent type and intonation. In the case of consonants, the phoneme duration is determined as a unique length for each type of consonant since it is less affected by surrounding conditions.

それに対して、母音の場合は周囲条件によって様様な変
形を受ける。そのため、アクセント型、音節数、単語内
の位置、直前の子音の種類、その母音の種類などから持
続時間を決定している。このようにして音韻持続時間が
決定されたら、Ｃｖ（子音−母音連鎖）単位でファイル
に登録されている音韻パラメータ（スペクトル包絡パラ
メータと音源パラメータ）を音節コードに対応させて抽
出し、配列する。この際、長すぎれば持続時間内に収ま
るように切断する。しかる後に、切断部あるいは隙間部
を埋めるようにＣｖ単位間を補間（スペクトル包絡パラ
メータは直線補間、音源パラメータは同一値の繰り返し
）により接続する。In contrast, vowels undergo various transformations depending on surrounding conditions. Therefore, the duration is determined based on the accent type, number of syllables, position within the word, type of consonant immediately before it, type of vowel, etc. Once the phoneme duration is determined in this way, phoneme parameters (spectral envelope parameters and sound source parameters) registered in the file in Cv (consonant-vowel chain) units are extracted and arranged in correspondence with syllable codes. At this time, if it is too long, it is cut to fit within the duration. Thereafter, the Cv units are connected by interpolation (linear interpolation for spectrum envelope parameters, repetition of the same value for sound source parameters) so as to fill in the cut portions or gaps.

最後に１以上の処理によって生成された基本周波数と音
韻パラメータは、順次音声合成部４に送られ、音声波形
が出力される。ここで、音声合成方式としては、例えば
残差圧縮法を用いればよい。Finally, the fundamental frequency and phoneme parameters generated by one or more processes are sequentially sent to the speech synthesis section 4, and a speech waveform is output. Here, as the speech synthesis method, for example, a residual compression method may be used.

この場合、音源パルスは基本的には、フレームごとに１
ピッチ分の残差パルス（代表残差）を抽出し、その代表
残差を外から与えるピッチ周期の間隔で並べることによ
って生成している。このとき外から与えるピッチ周期が
代表残差の長さより短ければ、その長さの差だけ代表残
差の末尾を切り捨て、逆に長ければ、代表残差の不足し
ている区間だけＯを埋めている。In this case, the sound source pulse is basically 1 per frame.
It is generated by extracting residual pulses (representative residuals) corresponding to pitches and arranging the representative residuals at pitch period intervals given from the outside. At this time, if the pitch period given from the outside is shorter than the length of the representative residual, the end of the representative residual is truncated by that length difference, and if it is longer, O is filled in for the section where the representative residual is insufficient. There is.

以上の処理は、以下に述べるイントネーション生成規則
を除いて、すべて公知の手段により構成することができ
る。All of the above processing, except for the intonation generation rules described below, can be configured by known means.

以下では、上記任意文章合成方式の内、本発明の最も重
要な部分である、制御パラメータ生成部３におけるイン
トネーション生成規則の実施例を第４図および第５図を
引用して示す。In the following, an example of intonation generation rules in the control parameter generation section 3, which is the most important part of the present invention in the arbitrary sentence synthesis method described above, will be described with reference to FIGS. 4 and 5.

まず音声言語処理部２から得られた音節コード列は、文
章タイプ決定手段５に入力される。ここでは第一段階と
して１文章タイプ情報辞書６中の語尾辞書に登録されて
いる語尾形と音節コード列の文末の形とを照合すること
により、該当する文章タイプを決定する。なお第４図に
おける終止形は、現代文の場合は動詞なら「つ」行で終
わる語尾、形容詞なら「イ」でおわる語尾等、公知の国
文法の規則に基づいて定められる。命令形の場合も同様
に、現代文なら活用語尾が「工」行であることから定め
られる。以上の文章タイプの判定は、品詞情報などの文
法情報があれば、さらに確実となる。ここでもし語尾の
活用が終止形と判定された場合は、この文章は必ずしも
平叙文とは限らない、そこで第二段階として、この場合
は文章の終始時号（文末記号）を見に行き、この記号の
種類によって文章タイプを決定する。以上の文章タイプ
決定手段５の処理の一例を第５図に示す。First, the syllable code string obtained from the speech language processing section 2 is input to the sentence type determining means 5. Here, as a first step, the corresponding sentence type is determined by comparing the ending form registered in the ending dictionary in the one-sentence type information dictionary 6 with the sentence ending form of the syllable code string. The final forms in Figure 4 are determined based on the rules of known Japanese grammar, such as verbs in modern sentences ending in the ``tsu'' line, adjectives ending in the ``i'' line, etc. Similarly, the imperative form is determined by the fact that in modern sentences, the conjugated word ends in the ``工'' line. The above sentence type determination becomes more reliable if grammatical information such as part-of-speech information is available. Here, if the conjugation of the ending is determined to be a final form, this sentence is not necessarily a declarative sentence, so the second step is to look at the final time sign (sentence-final sign) of the sentence. The type of sentence is determined by the type of this symbol. An example of the processing of the text type determining means 5 described above is shown in FIG.

第４図に戻り１文章タイプ決定手段５では、上で述べた
文章タイプ情報が出力される以外に、音節情報とその他
の情報の分離が行われ、音節情報列も出力される。音節
情報は、■音韻境界を決定するため、および■ピッチパ
タンにおける音素成分生成のために用いられる。すなわ
ち、■については、音節情報をもとに、音韻持続時間規
則部９において各音節の音韻持続時間が決定され（前記
公知例）、これらを配列した形で音韻境界時刻が音韻境
界決定手段７により決定される。音韻境界時刻は、一方
ではＬＳＰパラメータ等の音韻パラメータを生成するた
めに用いられる。また■については、文章ピッチ制御パ
ラメータ生成部１１において、音素制御機構パラメータ
値を決定するために用いられる。Returning to FIG. 4, in addition to outputting the above-mentioned sentence type information, the sentence type determining means 5 separates syllable information from other information, and also outputs a syllable information string. The syllable information is used for (1) determining phoneme boundaries, and (2) generating phoneme components in pitch patterns. That is, for (■), the phoneme duration of each syllable is determined in the phoneme duration rule section 9 based on the syllable information (the above-mentioned known example), and the phoneme boundary time is determined by arranging them in the phoneme boundary determining unit 7. Determined by The phonetic boundary times are used on the one hand to generate phonetic parameters such as LSP parameters. Also, ■ is used in the sentence pitch control parameter generation unit 11 to determine the phoneme control mechanism parameter value.

先の文章タイプ情報はイントネーション規則部８に入力
され、文章のタイプに従い、Ｓｔ準イントネーション（
例えば平叙文）からの変形が加えられる。変形には時間
の変形と、振幅（指令の大きさ）の変形の２種類がある
。前者は音韻境界決定手段７に作用し、音韻境界時刻に
変更が加えられる。他方後者は文章ピッチ制御パラメー
タ生成部１１に作用し、指令の大きさが変更されるかあ
るいは新たな女形指定指令や強調指令が追加される。The previous sentence type information is input into the intonation rule section 8, and according to the sentence type, St semi-intonation (
For example, transformations from declarative sentences are added. There are two types of transformation: time transformation and amplitude (command magnitude) transformation. The former acts on the phoneme boundary determining means 7 and changes the phoneme boundary time. On the other hand, the latter acts on the sentence pitch control parameter generation unit 11, and the size of the command is changed or a new female form designation command or emphasis command is added.

この際標準イントネーションの制御パラメータはアクセ
ント規則部１０より供給される。なお文章ピッチ制御パ
ラメータ生成部１１では音韻情報との時間的整合をとる
ため、基準となる音韻境界時刻（タイミング基準情報）
を音韻境界決定手段７より得る０以上のイントネーショ
ンの規則は、規則テーブルをイントネーション規則部８
に設けておき参照することにより達成できる。また音素
制御パラメータは、音素ごとに指令の大きさ、固有角周
波数、境界からの相対時刻、上限値等を予め解析して求
めておき、音節情報に対応するテーブルとして音素規則
部１３に設けておけば良い。ここから音節情報列の順に
従って、音素制御パラメータ列が文章ピッチ制御パラメ
ータ部１１に送られる。ここで音素開始あるいは終了時
点（相対時刻）は、タイミング基準情報に基づいて絶対
時刻に変換される。かくして文章ピッチ制御パラメータ
生成部１１で作成されたピッチ制御パラメータはピッチ
パタン生成部１２に送られ、ここで新ピッチ制御機構モ
デル（（ｉ）〜（４）式）により文章ピッチパタンか生
成される。At this time, standard intonation control parameters are supplied from the accent rule section 10. Note that the sentence pitch control parameter generation unit 11 uses a reference phoneme boundary time (timing reference information) in order to achieve temporal consistency with phoneme information.
The intonation rules of 0 or more obtained from the phoneme boundary determining means 7 are stored in the rule table by the intonation rule section 8.
This can be achieved by setting it up and refer to it. In addition, phoneme control parameters are obtained by analyzing the command magnitude, natural angular frequency, relative time from the boundary, upper limit value, etc. for each phoneme in advance, and are provided in the phoneme rule section 13 as a table corresponding to syllable information. Just leave it there. From here, the phoneme control parameter string is sent to the sentence pitch control parameter section 11 in accordance with the order of the syllable information string. Here, the phoneme start or end time (relative time) is converted to absolute time based on timing reference information. The pitch control parameters thus created by the sentence pitch control parameter generation section 11 are sent to the pitch pattern generation section 12, where a sentence pitch pattern is generated using the new pitch control mechanism model (Equations (i) to (4)). .

本実施例によれば、漢字仮名混じり文テキストから合成
される音声に、人間らしい自然な表情を与える効果があ
る。According to this embodiment, there is an effect of giving a natural, human-like expression to the speech synthesized from the text containing kanji and kana.

〔Effect of the invention〕

第６図（ａ）は、文章の一部に従来のピッチ制御機構モ
デルにより実測したピッチパタンを推定した結果の一例
である。それに対し、同図（ｂ）は、同じ文章のピッチ
パタンを新ピッチ制御機構モデルにより推定した結果で
ある１両図の比較により、音素制御機構が基本周波数の
微妙な変化を忠実に表現する上でいかに重要な役割を果
たしているか明白である。FIG. 6(a) is an example of the result of estimating a pitch pattern actually measured using a conventional pitch control mechanism model for a part of a sentence. In contrast, figure (b) is the result of estimating the pitch pattern of the same sentence using the new pitch control mechanism model.A comparison of the two figures shows that the phoneme control mechanism is able to faithfully express subtle changes in the fundamental frequency. It is clear how important a role it plays.

第６図（Ｑ）は、疑問文についてピッチパタンを推定し
た例であり、文末の尻上がりを表現するための成分が必
要であることがわかる。女形指定制御機構がその役割を
果たしている。FIG. 6 (Q) is an example of estimating the pitch pattern for an interrogative sentence, and it can be seen that a component is required to express the rise at the end of the sentence. The female role designation control mechanism plays this role.

更に第６図（ｄ）は、命令文についてピッチパタンを推
定した例であり、従来のフレーズ成分およびアクセント
成分以外の成分を付加しなければ、実測値との良い近似
が得られない。本図では、新たに強調成分を付加するこ
とにより、実測値との近似を高めている。Furthermore, FIG. 6(d) is an example of estimating a pitch pattern for a command sentence, and unless components other than the conventional phrase component and accent component are added, a good approximation to the actual value cannot be obtained. In this figure, the approximation to the actual measurement value is improved by adding a new emphasized component.

これらのピッチパタンの推定精度改善効果は上記の例の
みでなく、様々な文章について確かめられている。These pitch pattern estimation accuracy improvement effects have been confirmed not only for the above example but also for various sentences.

従来のピッチ制御機構モデルおよび新モデルにより生成
したピッチパタンの実測値に対する平均誤差を計算した
ところ、従来のモデルの場合釣上４％であるのに対し、
新モデルの場合は±１％以下であり、定量的にも大きな
改善効果が確認されている。When we calculated the average error between the actual pitch pattern values generated by the conventional pitch control mechanism model and the new model, we found that the error was 4% for the conventional model, whereas the error was 4% for the conventional model.
In the case of the new model, it is less than ±1%, and a large improvement effect has been quantitatively confirmed.

以上量したように１本発明によれば、原音声との近似が
極めて良い抑揚パタンを生成することができる。すなわ
ち従来では不可能であった人間特有の微妙な揺らぎを忠
実に表現できることはもとより、疑問文、命令文、願望
文などの差異を明確に表現できる。かくして、人間らし
い極めて自然な抑揚感が得られるのみでなく、音韻明瞭
性の向上にも有効であり、音質改善効果は著しい。As described above, according to the present invention, it is possible to generate an intonation pattern that closely approximates the original speech. In other words, it is not only possible to faithfully express the delicate fluctuations peculiar to humans, which was impossible with conventional methods, but also to clearly express the differences between interrogative sentences, imperative sentences, and wishful sentences. In this way, not only is it possible to obtain an extremely natural intonation feeling that is typical of humans, but it is also effective in improving phoneme clarity, and the effect of improving sound quality is remarkable.

[Brief explanation of the drawing]

第１図は本発明の基本部分を示す図、第２図は従来方式
の基本部分を示す図、第３図〜第５図は本発明の実施例
を示す図、第６図は本発明の詳細な説明する図。３・・・制御パラメータ生成部、８・・・イントネーシ
ョン規則部、１０・・・アクセント規則部、１１・・・
文章ピッチ制御パラメータ生成部、１２・・・ピツチパ
タ゛＼　　　　ｍｉ第　１　図第　Ｚ　図Ａ。第　３　図冨４図Fig. 1 shows the basic part of the present invention, Fig. 2 shows the basic part of the conventional system, Figs. 3 to 5 show the embodiment of the present invention, and Fig. 6 shows the basic part of the present invention. Detailed illustrative diagram. 3... Control parameter generation section, 8... Intonation rule section, 10... Accent rule section, 11...
Sentence pitch control parameter generation unit, 12...Pitch pattern\mi Figure 1 Figure Z Figure A. Figure 3 Figure 4

Claims

[Claims] 1. A means for generating a time-varying fundamental frequency pattern (hereinafter simply referred to as a "pitch pattern"), the pitch pattern generating means comprising: (1) a phrase control mechanism; (2) a phrase control mechanism; A speech rule synthesis method characterized by having five control mechanisms: an accent control mechanism, (3) a phoneme control mechanism, (4) a sentence designation control mechanism, and (5) an emphasis control mechanism. 2. The pitch pattern generation means has (1) a phrase control mechanism, and (2) an accent control mechanism, and further has (
Claim 1, characterized in that it has at least one of the following three control mechanisms: 3) a phoneme control mechanism, (4) a sentence designation control mechanism, and (5) an emphasis control mechanism.
Speech rule synthesis method described in section. 3. The speech rule synthesis method according to claims 1 and 2, further comprising means for setting values of input commands corresponding to each of the control mechanisms. 4. The speech rule synthesis method according to claims 1 to 3, wherein each of the control mechanisms is described by a critical damping quadratic linear system. 5. The pitch pattern generation means has a phoneme control mechanism,
Claims 1 to 4 are characterized in that the phoneme control mechanism outputs either a response function of a critical braking quadratic linear system or an exponential function in response to an input command, depending on the type of phoneme. The speech rule synthesis method described.