JPH0519780A

JPH0519780A - Device and method for voice rule synthesis

Info

Publication number: JPH0519780A
Application number: JP3172180A
Authority: JP
Inventors: Shoichi Takeda; 昌一武田; Hiroshi Ichikawa; 熹市川
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1991-07-12
Filing date: 1991-07-12
Publication date: 1993-01-29

Abstract

PURPOSE:To improve the ability to locally emphasize a voice which is called prominence by giving rhythm control parameters according to classifications of the prominence of Japanese. CONSTITUTION:When a sentence unit to which the prominence is added is the whole phrase, a variation quantity DELTADAa which is the difference between 'accent command increment' defined as an increment for the size of an accent component when the prominence is not added and 'accent command increment' for an adjacent accent component is set to a positive value. When the sentence unit to which the prominence is added is part of a word and the accent type is not an accent modified type, the DELTADAa is set to the positive value and a pause is inserted right before or behind the sentence unit; when the accent type is the accent modified type, the DELTADAa is set to the positive value. Further, when the sentence unit to which the prominence is added is one beam of a subject, the DELTADAa is set to the positive value and the pause is inserted right before and behind the sentence unit.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は文章音声の規則合成装置
および方法に係わり、特に規則合成音声の品質改善に関
する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a rule synthesizing apparatus and method for sentence speech, and more particularly to quality improvement of rule synthesizing speech.

【０００２】[0002]

【従来の技術】本発明に関連した技術として、以下の文
献が知られている。The following documents are known as techniques related to the present invention.

【０００３】1. 市川熹，他；合成音声の自然性に関
する実験的考察，音響学会講演論文集1-3-8（昭42) 2. 中山剛，他；合成音声の音源特性制御による疑
問，強調の表現，電子通信学会大会 64（昭43） 3. 特開昭59-081697号公報（単語規則合成に藤崎モデル
を使用） 4. 特開昭60-074224号公報（段落ごとに発声の調子を改
め、更に自然な揺らぎを与える） 5. 特開昭62-138898号公報（疑問文、命令文、願望文等
のイントネーションを藤崎モデルにより生成） 6. H. Fujisaki et. al., "Analysis of voice fundame
ntal frequencycontours for declarative sentences o
f Japanese," J. Acoust. Soc. Jpn.(E)5,4 (1984). 7. 佐藤利男；有声、無声破裂音の時間要素の差異につ
いて、日本音響学会誌第14巻第２号（1958) 8. 落合和雄；無声破裂音におけるピッチ周波数変化の
聴覚的検討、日本音響学会講演論文集 2-3-12（昭43-1
1) 9. 特開昭63-174100号公報（藤崎モデルに更に音素制御
機構,文形指定制御機構,および強調制御機構を付加した
モデル） 10.廣瀬啓吉，藤崎博也，他２；基本周波数パターン生
成過程モデルに基づく文章音声の合成、電子情報通信学
会論文誌Ａ, J72-A, 1, pp.32-40 (1989-1) 11.河井恒，廣瀬啓吉，藤崎博也；日本語音声の合成に
おける韻律的特徴の合成規則、電子情報通信学会技術報
告音声、 SP88-129 (1989-1) 12.特願平1-214799号（プロミネンス生成規則基本特許
出願） 13.藤崎博也，廣瀬啓吉，他２；連続音声中におけるア
クセント成分の実現、音声研究会資料、 S84-36 (1984-
7) 14.武田昌一，市川熹；４モーラ単語を対象としたピ
ッチ制御機構モデルパラメータの推定、日本音響学会講
演論文集 1-5-13 (昭57-3)これらの文献を参照して、従
来の技術について説明する。1. Satoshi Ichikawa, et al .; Experimental study on the naturalness of synthetic speech, Proceedings of the Acoustical Society of Japan 1-3-8 (Sho 42) 2. Tsuyoshi Nakayama, et al. Expression of emphasis, IEICE Conference 64 (Sho 43) 3. Japanese Patent Laid-Open No. 59-081697 (using Fujisaki model for word rule synthesis) 4. Japanese Laid-Open Patent No. 60-074224 (Voice tone of each paragraph 5) Japanese Patent Laid-Open No. 62-138898 (Intonation of question sentence, imperative sentence, desire sentence, etc. is generated by Fujisaki model) 6. H. Fujisaki et. Al., "Analysis of voice fundame
ntal frequency contours for declarative sentences o
f Japanese, "J. Acoust. Soc. Jpn. (E) 5,4 (1984). 7. Sato Toshio; Differences in time elements of voiced and unvoiced plosives, Vol. 14, No. 2 of the Acoustical Society of Japan ( 1958) 8. Kazuo Ochiai; Auditory examination of pitch frequency change in unvoiced plosives, Proceedings of the Acoustical Society of Japan 2-3-12 (Sho 43-1)
1) 9. JP-A-63-174100 (Model in which a phoneme control mechanism, a sentence pattern control mechanism, and an emphasis control mechanism are further added to the Fujisaki model) 10. Hirokichi Hirose, Hiroya Fujisaki, et al. 2; Fundamental frequency Text-to-speech synthesis based on pattern generation process model, IEICE Transactions A, J72-A, 1, pp.32-40 (1989-1) 11. Hisashi Kawai, Keikichi Hirose, Hiroya Fujisaki; Japanese Speech Rules for synthesis of prosodic features in the synthesis of speech, IEICE Technical Report Speech, SP88-129 (1989-1) 12. Japanese Patent Application No. 1-214799 (application for basic patent for prominence production rules) 13. Hiroya Fujisaki, Hirose Keikichi, et al. 2; Realization of Accent Components in Continuous Speech, Voice Study Group Material, S84-36 (1984-
7) 14. Shoichi Takeda, Satoshi Ichikawa; Estimation of model parameters for pitch control mechanism for 4-mora words, Proceedings of the Acoustical Society of Japan 1-5-13 (Sho57-3) A conventional technique will be described.

【０００４】任意の文章あるいは単語のテキストより、
これに対応する音声を合成する手法は「規則による音声
合成」あるいは単に「規則合成」と呼ばれている。規則
合成の音声では、一般に、音韻のつながりや、持続時
間、あるいはピッチ（声の高さ）の変化などの特徴を外
部から規則により与えているため、自然の音声のものと
は異なっている。したがって、規則合成による音声は、
自然の音声の特徴をそのまま保存しているいわゆる「分
析合成」による音声の音質より悪い。規則合成音声の音
質劣化の要因としては、音韻の明瞭性の低下に起因す
るものや、文章の抑揚の不自然さに起因するものが挙
げられる。From the text of any sentence or word,
The method of synthesizing the corresponding speech is called “speech synthesis by rule” or simply “rule synthesis”. Rule-synthesized speech is different from natural speech because it generally has features such as phoneme connections, durations, and changes in pitch (voice pitch) from the outside. Therefore, the speech by rule synthesis is
It is worse than the sound quality of so-called "analysis and synthesis," which preserves the characteristics of natural speech. The factors of the sound quality deterioration of the rule-synthesized speech include the ones caused by the deterioration of the phoneme clarity and the ones caused by the unnaturalness of sentence intonation.

【０００５】文章の抑揚を支配する規則、すなわち韻律
規則については、すでに日本語の平叙文、疑問文、命令
文、強調および種々の表情を持つ文章のイントネーショ
ンを生成する規則が知られている（上記文献１、２を参
照）。しかしここで用いたモデルは、音節単位の点ピッ
チ情報を与えるに過ぎないため、疑問文、命令文、願望
文の差異を表現するには不十分である。そのためにこの
ようなピッチパターン（基本周波数の時間変化パター
ン）を与えて合成した音声の抑揚は不自然に聞こえる。Regarding rules governing the intonation of sentences, that is, prosodic rules, there are already known Japanese rules for generating intonation of sentences including ordinary sentences, interrogative sentences, imperative sentences, and various expressions (( (See References 1 and 2 above). However, since the model used here only gives point pitch information in syllable units, it is not sufficient to express the difference between a question sentence, an imperative sentence and a desire sentence. Therefore, the intonation of speech synthesized by giving such a pitch pattern (fundamental frequency change pattern) sounds unnatural.

【０００６】種々の文章のイントネーションの差異を十
分に表現するためには、音節内の基本周波数（ピッチ周
波数）と時間との関係を明確にする必要がある。このよ
うな音節内のピッチパターンを記述し、しかも時間構造
を明確に定義できるモデルとして、臨界制動２次線形系
で記述される「ピッチ制御機構モデル」が用いられてき
た。In order to sufficiently express the difference in intonation of various sentences, it is necessary to clarify the relationship between the fundamental frequency (pitch frequency) in the syllable and time. The "pitch control mechanism model" described by the critical damping quadratic linear system has been used as a model that can describe the pitch pattern in such a syllable and can clearly define the time structure.

【０００７】このピッチ制御機構モデルを適用したもの
として、単語音声合成に適用した例（文献３）、疑問
文、命令文、願望文等の文章音声合成に適用した例（文
献５）等があり、かなりの音質改善効果が認められてい
る。As an application of this pitch control mechanism model, there are an example of application to word voice synthesis (reference 3), an example of application to sentence voice synthesis of interrogative sentences, command sentences, desire sentences, etc. (reference 5). , A considerable sound quality improvement effect is recognized.

【０００８】文献９は、更に、音韻明瞭性の改善に効果
的な音素レベルの局所的な揺らぎ（文献７、８）を表現
する成分を付加したものである。また、疑問文に現れる
尻上がり調や、命令文、願望文等、様々な感情や表情に
固有な微妙な基本周波数の変化を表現する成分（文献
５）も付加されている。文献９では、これらの成分を生
成する修正型ピッチ制御機構モデルを用いて人間らしい
自然な抑揚感を持った音声を合成する方法を提供してい
る。[0008] Document 9 further adds a component expressing a local fluctuation of phoneme level (Documents 7 and 8) effective for improving phonological clarity. In addition, a component (reference 5) that expresses a subtle change in fundamental frequency unique to various emotions and expressions, such as a rising tone appearing in an interrogative sentence, a command sentence, and a desire sentence, is added. Reference 9 provides a method of synthesizing a voice having a natural human-like intonation using a modified pitch control mechanism model that generates these components.

【０００９】[0009]

【発明が解決しようとする課題】上述した各種のピッチ
制御機構モデルのうち、音素制御機構の導入により、合
成音声の音韻明瞭性は改善されるに至った。しかし、感
情や特別の表情の付かない通常の文章では、発話の単調
さ、機械的な感じは取り除かれていない。このような単
調さや機械感は、特に合成音声システムを長時間利用す
る者にとって、大きな負担になり、疲労をもたらす。こ
れらの単調さや機械感を取り除かないかぎり、例えば新
聞の校閲における読み合わせ作業のような、長時間利用
型のシステムへの適用に供することができない。Among the various pitch control mechanism models described above, the introduction of the phoneme control mechanism has improved the phoneme clarity of synthesized speech. However, ordinary sentences without emotions or special expressions do not remove the monotonous and mechanical feeling of speech. Such monotonousness and mechanical feeling impose a heavy burden on a person who uses the synthetic speech system for a long time and causes fatigue. Unless these monotonous and mechanical feelings are removed, it cannot be applied to a system that is used for a long time, such as a reading operation in newspaper editing.

【００１０】他方、人間の発声する自然音声を長時間聞
いても、疲労感が少ない理由の一つは、発話の中で、局
所的に強めたり、逆に弱めたりして、発話に変化をつけ
ているからである。すなわち、人間は強めたいところで
は、相対的に声の高さを高め、声を大きくし、しかもゆ
っくりと話す。逆に重要でないところでは、低く小さい
声で、しかも早口で曖昧に話そうとする。即ち、書き言
葉における「カギ括弧」や「太字」等に相当する強調表
現を話し言葉でも行っているのである。この強めや弱め
によって、聞く人は常に発話に注意を傾ける必要がなく
なり、負担が軽減する。[0010] On the other hand, one of the reasons why the feeling of fatigue is small even when listening to a natural voice uttered by a human for a long time is that the utterance is changed locally by strengthening it or weakening it conversely. Because it is attached. In other words, human beings, when they want to strengthen, relatively raise their voice, make their voice louder, and speak slowly. On the contrary, when it is not important, he tries to speak with a low and low voice, and swiftly and vaguely. That is, the spoken language is also used to emphasize expressions corresponding to the "brackets" and "bold characters" in the written language. This strengthening or weakening relieves the listener of the need to constantly pay attention to the utterance, thus reducing the burden.

【００１１】本発明は、このような自然音声が持つ強め
や弱めを規則合成音声において実現する装置及び方法を
提供するものである。The present invention provides an apparatus and method for realizing the strengthening and weakening of natural speech in rule-synthesized speech.

【００１２】[0012]

【課題を解決するための手段】上記の文音声における強
めや弱めは、文中の他の部分との相対的な強弱によって
行われる。このように他の部分に対して相対的に引き立
たせる（卓立させる）強めは、「プロミネンス」あるい
は「対比強調」と呼ばれている。Means for Solving the Problems The above-mentioned strengthening and weakening in the sentence voice is performed by the relative strength with respect to other parts in the sentence. In this way, the strength that makes the other parts stand out relative to other parts (is made prominent) is called “prominence” or “contrast emphasis”.

【００１３】言語学的立場からプロミネンスを分類に従
い、本発明では、これらプロミネンスの韻律的特徴を定
量的に表現するための尺度を導入する。即ち、プロミネ
ンスの分類に対応して、自然音声の解析結果に基づいて
求めた韻律の制御パラメータを記憶するプロミネンス生
成規則を用い、該プロミネンス生成規則に従って、プロ
ミネンス付加時の韻律制御パラメータを制御する。According to the classification of prominences from a linguistic standpoint, the present invention introduces a scale for quantitatively expressing the prosodic features of these prominences. That is, a prominence generation rule that stores prosody control parameters obtained based on the result of natural speech analysis is stored in accordance with the prominence classification, and the prosody control parameters when prominences are added are controlled according to the prominence generation rules.

【００１４】これらのプロミネンスは、音声情報処理的
には、（１）基本周波数、（２）音声波形振幅（パワ
ー）、および（３）時間長（音素あるいは「間」（ポー
ズ）持続時間）の増大や減少によって実現される。特
に、本発明では、アクセント指令の大きさの制御によ
りプロミネンスを実現する。また、必要に応じて、の
ポーズの挿入による時間長の制御、あるいはパワーの
大きさの制御を行なう。In terms of speech information processing, these prominences have (1) fundamental frequency, (2) speech waveform amplitude (power), and (3) time length (phoneme or "pause" duration). It is realized by increasing and decreasing. Particularly, in the present invention, prominence is realized by controlling the size of the accent command. If necessary, the control of the time length by inserting the pause or the control of the power level is performed.

【００１５】パワーは、基本周波数との相関が強く、プ
ロミネンスにより基本周波数が高くなれば、それに伴い
パワーの大きさも増大する。The power has a strong correlation with the fundamental frequency, and as the fundamental frequency increases due to prominence, the magnitude of the power also increases accordingly.

【００１６】[0016]

【作用】本発明のプミネンス生成規則による韻律制御
は、自然音声の定量的解析に基づき求められたものなの
で、入力文書（テキスト）から合成される音声に、人間
らしい自然な強め、弱めを与えることができる。本発明
によれば、現実の文章音声を起り得るほとんどすべての
場合の強め、弱めを実現することができる。従って、利
用者が特別の注意を払うことなく発話内容を容易に理解
することができ、利用者の負担を著しく軽減することが
可能となる。特に、新聞校閲のような長時間の作業時の
疲労軽減効果は著しく、作業の効率向上が期待できる。Since the prosody control by the puminence generation rule of the present invention is obtained based on the quantitative analysis of natural speech, it is possible to give human-like natural strengthening and weakening to the speech synthesized from the input document (text). it can. According to the present invention, it is possible to realize strengthening and weakening in almost all cases in which a real sentence voice can occur. Therefore, the user can easily understand the utterance content without paying special attention, and the burden on the user can be significantly reduced. In particular, the effect of reducing fatigue when working for a long time such as newspaper editing is remarkable, and improvement in work efficiency can be expected.

【００１７】[0017]

【実施例】まず、本発明の実施例で用いる「ピッチ制御
機構モデル」について説明する。ここでピッチ制御機構
モデルとは、以下に述べるようなモデルである。First, the "pitch control mechanism model" used in the embodiments of the present invention will be described. Here, the pitch control mechanism model is a model as described below.

【００１８】声の高さの情報を与える基本周波数は、次
のような過程で生成されると考えるのがピッチ制御機構
モデルである。声帯振動の周波数、すなわち基本周波数
は、脳からのフレーズの切り替わりごとに発せられる
インパルス指令と、アクセントの上げ下げごとに発せ
られるステップ指令によって制御される。そのとき、生
理機構の遅れ特性により、のインパルス指令は文頭か
ら文末に向かう緩やかな下降曲線（フレーズ成分）とな
り、のステップ指令は局所的な起伏の激しい曲線（ア
クセント成分）となる。これらの二つの成分は、各指令
の臨界制動２次線形系の応答としてモデル化され、対数
基本周波数の時間変化パターンは、これら両成分の和と
して表現される。図２はピッチ制御機構モデルを示す。
モデル基本周波数F₀(t)（ｔは時刻）は、次式のように
定式化される。It is the pitch control mechanism model that the fundamental frequency which gives the information of the voice pitch is considered to be generated in the following process. The frequency of vocal cord vibration, that is, the fundamental frequency, is controlled by an impulse command issued each time the brain switches the phrase and a step command issued each time the accent is raised or lowered. At that time, due to the delay characteristic of the physiological mechanism, the impulse command of becomes a gentle downward curve (phrase component) from the beginning of the sentence to the end of the sentence, and the step command becomes a curve of local undulation (accent component). These two components are modeled as the response of the critical damping quadratic linear system of each command, and the time change pattern of the logarithmic fundamental frequency is expressed as the sum of these two components. FIG. 2 shows a pitch control mechanism model.
The model fundamental frequency F ₀ (t) (t is time) is formulated as the following equation.

【００１９】[0019]

【数１】 [Equation 1]

【００２０】ここで、Fminは最低周波数、Iはフレーズ
指令の数、Ap(i)はi番目のフレーズ指令の大きさ、T
₀(i)はi番目のフレーズ指令の時点、Jはアクセント指令
の数、Aa(j)はj番目のアクセント指令の大きさ、T
₁(j)，T₂(j)はそれぞれj番目のアクセント指令の開始時
点と終了時点である。また、Gp(i，t)、Ga(j，t)はそれ
ぞれ、フレーズ制御機構のインパルス応答関数、アクセ
ント制御機構のステップ応答関数であり、次式で与えら
れる。Where Fmin is the minimum frequency, I is the number of phrase commands, Ap (i) is the size of the i-th phrase command, and T is the number of phrase commands.
₀ (i) is the time of the i-th phrase command, J is the number of accent commands, Aa (j) is the size of the j-th accent command, T
₁ (j) and T ₂ (j) are the start time and end time of the jth accent command, respectively. Further, Gp (i, t) and Ga (j, t) are the impulse response function of the phrase control mechanism and the step response function of the accent control mechanism, which are given by the following equations.

【００２１】[0021]

【数２】 Gp(i,t)=α(i)t exp(-α(i)t)u(t) …（数２) [Equation 2] Gp (i, t) = α (i) t exp (-α (i) t) u (t)… (Equation 2)

【００２２】[0022]

【数３】 Ga(j,t)=Min[1-(1+β(j)t) exp(-β(j)t)u(t),θ(j)] …（数３) ここで、α(i)はi番目のフレーズ指令に対するフレーズ
制御機構の固有角周波数、β(j)はj番目のアクセント指
令に対するアクセント制御機構の固有角周波数、u(t)は
単位ステップ関数である。また、θ(j)はアクセント成
分の上限値であり、例えば0.9などに選ばれる。[Equation 3] Ga (j, t) = Min [1- (1 + β (j) t) exp (-β (j) t) u (t), θ (j)] (Equation 3) where , Α (i) is the natural angular frequency of the phrase control mechanism for the i-th phrase command, β (j) is the natural angular frequency of the accent control mechanism for the j-th accent command, and u (t) is the unit step function. Further, θ (j) is the upper limit value of the accent component, and is selected as 0.9, for example.

【００２３】なおここで、基本周波数（ピッチ周波数）
およびピッチ制御パラメータ(Ap(i), Aa(j),T₀(i), T
₁(j), T₂(j), α(i), β(j), Fmin)の値の単位は次のよ
うに定義される。すなわち、F₀(t)およびFminの単位は
[Hz]、T₀(i), T₁(j)およびT₂(j)の単位は[s]、α(i)お
よびβ(j)の単位は[1/s]とする。またAp(i)およびAa(j)
の値は、基本周波数およびピッチ制御パラメータの値の
単位を上記のように定めたときの値を用いる。Here, the fundamental frequency (pitch frequency)
And pitch control parameters (Ap (i), Aa (j), T ₀ (i), T
The unit of the value of ₁ (j), T ₂ (j), α (i), β (j), Fmin) is defined as follows. That is, the units of F ₀ (t) and Fmin are
The unit of [Hz], T ₀ (i), T ₁ (j) and T ₂ (j) is [s], and the unit of α (i) and β (j) is [1 / s]. Also Ap (i) and Aa (j)
The value of is used when the units of the values of the fundamental frequency and the pitch control parameter are determined as described above.

【００２４】解析の方法としては、最適化法が用いられ
ている。すなわち、上記ピッチ制御機構モデルにより生
成したピッチパターンと原音声の分析・抽出による実測
値との誤差が最小となるようなピッチ制御パラメータを
求めることにより、ピッチパターンの最良近似推定が行
われる（文献６）。As an analysis method, an optimization method is used. That is, the best approximate estimation of the pitch pattern is performed by obtaining a pitch control parameter that minimizes the error between the pitch pattern generated by the pitch control mechanism model and the actual measurement value obtained by the analysis / extraction of the original speech (references). 6).

【００２５】次に、「修正型ピッチ制御機構モデル」に
ついて説明する。図３(a)は修正型ピッチ制御機構モデ
ルを示す。Next, the "corrected pitch control mechanism model" will be described. FIG. 3 (a) shows a modified pitch control mechanism model.

【００２６】この修正型モデルの特徴は、フレーズ制
御機構およびアクセント制御機構から構成されるモデ
ルに、更に音素制御機構、文形指定制御機構、およ
び強調制御機構の３つの制御機構を付加したことであ
る。これら〜の３つの制御機構の導入により、ピッ
チパターン上に様々な揺らぎ成分を付加することが出来
る。The characteristic of this modified model is that a phoneme control mechanism, a sentence pattern control mechanism, and an emphasis control mechanism are added to the model composed of the phrase control mechanism and the accent control mechanism. is there. By introducing these three control mechanisms, it is possible to add various fluctuation components to the pitch pattern.

【００２７】すなわち、上記音素制御機構は、音素ご
との局所的な基本周波数の揺らぎの成分を生成する機構
で、例えば有声子音/d/,/m/,/n/,/r/,/w/等の局所的な
基本周波数の低下や、無声破裂音/t/,/k/等の後続母音
への入り渡り部にしばしば見られる高基本周波数からの
下降特性を表現することが出来る。また文形指定制御
機構は、疑問文の文末の基本周波数の尻上がりを表現す
る成分を生成する機構である。そして強調制御機構
は、命令文や願望文等、様々な感情や表情を表現する成
分を生成することを目的とした機構である。That is, the above-mentioned phoneme control mechanism is a mechanism for generating a local fluctuation component of the fundamental frequency for each phoneme, for example, voiced consonants / d /, / m /, / n /, / r /, / w It is possible to represent the local fundamental frequency drop of / etc., and the descent characteristic from the high fundamental frequency often found in the transition part to the following vowels of unvoiced plosives / t /, / k / etc. In addition, the sentence pattern designation control mechanism is a mechanism for generating a component expressing the rising of the fundamental frequency at the end of the question sentence. The emphasis control mechanism is a mechanism aiming to generate components expressing various emotions and facial expressions such as imperative sentences and desire sentences.

【００２８】上記修正型ピッチ制御機構モデルを簡単に
記述する式としては、例えば先に示す数１７〜数２４を
用いれば良い。ここで数１７〜数２４の各パラメータの
単位は従来のピッチ制御機構に準じて定められる。勿論
具体的に実現する式としては、上記数１７〜数２４のみ
に限定されない。また、文章音声の性質や制御方式の選
択により、数１７〜数２２の任意の制御機構の組み合わ
せでピッチパターンを生成することが出来る。例えば、
強めを強調成分を用いて表現するならば、アクセント指
令と強調指令の関係は図３(b)の(1)のように重畳形にな
る。しかし、これらの指令により得られるピッチパター
ンと同一のピッチパターンを同図(b)の(2)のように、ア
クセント指令のみによっても得ることが出来る。この様
に一つのアクセント指令終了時点で、別の指令値に階段
状に変化することを「アクセント変形」と呼んでいる。
「アクセント成分に重畳された強調成分」と「アクセン
ト変形」とは、As a formula for simply describing the modified pitch control mechanism model, for example, the above-mentioned formulas 17 to 24 may be used. Here, the unit of each parameter of the equations 17 to 24 is determined according to the conventional pitch control mechanism. Of course, the formulas to be specifically realized are not limited to the above formulas 17 to 24. Further, the pitch pattern can be generated by a combination of arbitrary control mechanisms of Expressions 17 to 22 depending on the nature of the text voice and the selection of the control method. For example,
If the enhancement is expressed using the emphasis component, the relationship between the accent command and the emphasis command becomes a superposition type as shown in (1) of FIG. 3B. However, the pitch pattern identical to the pitch pattern obtained by these commands can be obtained only by the accent command as shown in (2) of FIG. Such a stepwise change to another command value at the end of one accent command is called "accent transformation".
"Emphasis component superimposed on accent component" and "accent transformation"

【００２９】[0029]

【数４】 Aa₂=Aa₁+As …（数４) [Equation 4] Aa ₂ = Aa ₁ + As (Equation 4)

【００３０】[0030]

【数５】 T₁₂=T₇₁…（数５) [Equation 5] T ₁₂ = T ₇₁ ... (Equation 5)

【００３１】[0031]

【数６】 T₂₂=T₈₁…（数６) の関係により相互に変換が可能である。## EQU6 ## Mutual conversion is possible due to the relationship of T ₂₂ = T ₈₁ (Equation 6).

【００３２】モデルパラメータの推定（解析）は、従来
のピッチ制御機構モデルの場合と同じく最適化法により
実行することが出来る（文献６）。The estimation (analysis) of the model parameters can be performed by the optimization method as in the case of the conventional pitch control mechanism model (Reference 6).

【００３３】ここで、先に本発明者らが提案した特願平
1-214799号（文献１２）、特願平2-183947号、特願平2-
250172号によれば、強めのある文章では、強めのない場
合に比して、卓立している部分の(1)アクセント指令の
大きさ、(2)パワー、あるいは(3)音素持続時間が増大
し、場合によってはポーズが発生していることがわか
る。また逆に、平叙文の文末弱めのように、(1)アクセ
ント指令の大きさ、あるいは(2)パワーが減少する場合
もある。したがって、プロミネンスによる強め、あるい
は弱めは、これら(1)〜(3)(これら(1)〜(3)は総称して
「韻律」と呼ばれている）の各値を増大させたり、逆に
減少させることにより実現される。韻律の各要素(1)〜
(3)は、単独で増大、減少する場合もあるし、組合せに
より増大、減少する場合もある。当然のことながら、組
み合わせにより増大、減少させた場合の方が卓立の効果
は大きくなる。Here, the Japanese Patent Application No.
1-214799 (Reference 12), Japanese Patent Application No. 2-183947, Japanese Patent Application No. 2-
According to No. 250172, (1) the size of the accent command, (2) the power, or (3) the phoneme duration of the prominent part, in the case of strong sentences, as compared with the case of no strong sentences. It can be seen that there is an increase and in some cases poses occur. On the contrary, (1) the size of the accent command, or (2) the power may decrease, as in the case of weakening the end of a plain text. Therefore, strengthening or weakening by prominence increases or decreases each value of these (1) to (3) (these (1) to (3) are collectively called “prosody”). It is realized by reducing. Each element of prosody (1) ~
(3) may increase or decrease independently, or may increase or decrease depending on the combination. As a matter of course, the effect of excellence becomes greater when the amount is increased or decreased by the combination.

【００３４】本発明では、上記プロミネンスの韻律的特
徴を定量的に表現するための尺度を導入する。すなわ
ち、強めのない文章（参照音声）を基準にプロミネンス
含有文（対象音声）の強めの位置と度合いを表す尺度と
して、以下の諸量を定義する。（１）F₀比（F0R)：参照
音声の基本周波数F₀rに対する対象音声の基本周波数F₀x
の比で、次式により定義する。ただし、基本周波数は、
藤崎モデルにより推定した値を用いた。The present invention introduces a scale for quantitatively expressing the prosodic features of prominence. That is, the following various quantities are defined as a scale showing the strong position and degree of a prominence-containing sentence (target voice) with reference to a non-strong sentence (reference voice). (1) F ₀ ratio (F ₀ R): Fundamental frequency F ₀ x of target speech with respect to fundamental frequency F ₀ r of reference speech
Is defined by the following equation. However, the fundamental frequency is
The value estimated by the Fujisaki model was used.

【００３５】[0035]

【数７】 F0R=20log(F₀x/F₀r) (dB) …（数７) （２）アクセント指令増分（DAa)：参照音声のアクセン
ト指令の大きさAarに対する対象音声のアクセント指令
の大きさAaxの増分で、次式により定義する。[Equation 7] F0R = 20log (F ₀ x / F ₀ r) (dB) (Equation 7) (2) Accent command increment (DAa): The accent command of the target voice with respect to the reference command accent command magnitude Aar It is an increment of size Aax and is defined by the following formula.

【００３６】[0036]

【数８】 DAa=Aax-Aar …（数８) （３）パワー比（POWR)：参照音声のパワー Prに対する
対象音声のパワーPxの比で、次式により定義する。[Equation 8] DAa = Aax-Aar (Equation 8) (3) Power ratio (POWR): The ratio of the power Px of the target voice to the power Pr of the reference voice, which is defined by the following equation.

【００３７】[0037]

【数９】 POWR=10log(Px/Pr) (dB) …（数９）（４）時間変化率（ＴＩＭＥＷＡＲＰ）：参照音声に
対する対象音声の時間伸縮の度合いを表す。いま参照音
声と対象音声の対応する音素の持続時間をそれぞれTr
(i)、Tx(i) (iはi番目の音素の意味）として、i番目の
音素の時間変化率TW(i)を次式で定義する。[Equation 9] POWR = 10log (Px / Pr) (dB) (Equation 9) (4) Time change rate (TIME WARP): Indicates the degree of time expansion / contraction of the target voice with respect to the reference voice. Now, let Tr be the duration of the corresponding phonemes of the reference and target voices.
(i) and Tx (i) (i is the meaning of the i-th phoneme), the time change rate TW (i) of the i-th phoneme is defined by the following equation.

【００３８】[0038]

【数１０】 TW(i)=(Tx(i)-Tr(i))/Tr(i)×100 (%) …（数１０) 上記韻律的特徴の尺度を用いて、プロミネンス定量解
析結果をまとめると以下のようになる（特願平2-183947
号、特願平2-250172号参照）。[Equation 10] TW (i) = (Tx (i) -Tr (i)) / Tr (i) × 100 (%) (Equation 10) Prominence quantitative analysis results are obtained using the above prosodic feature scale. The summary is as follows (Japanese Patent Application No. 2-183947)
No., see Japanese Patent Application No. 2-250172).

【００３９】〔１〕基本周波数プロミネンスの基本周波数に関する特徴をアクセント指
令の大きさAa、開始時点T₁、終了時点T₂、およびアクセ
ント変形開始時点T₁₂について調べる。なお、アクセン
ト開始・終了、アクセント変形開始時点は、それぞれア
クセントが低から高に上昇する音節境界、高から低に下
降する音節境界、および高から他の高に変化する音節境
界時刻を基準とした値ΔT₁、ΔT₂、およびΔT₁₂として
求めている。ただし、頭高型アクセントや先頭音節の卓
立の場合は、先頭音節始端時刻をΔT₁計測の基準時刻と
し、平板型あるいは尾高型アクセントや末尾音節の卓立
の場合は、末尾音節終端時刻をΔT₂を計測の基準時刻と
する。[1] Fundamental frequency The characteristics of the fundamental frequency of the prominence are examined with respect to the accent command size Aa, the starting time point T ₁ , the ending time point T ₂ , and the accent deformation starting time point T ₁₂ . The accent start / end and accent transformation start time are based on the syllable boundary where the accent rises from low to high, the syllable boundary where the accent falls from high to low, and the syllable boundary time when the accent changes from high to other high. The values are obtained as ΔT ₁ , ΔT ₂ , and ΔT ₁₂ . However, in the case of head-height accent or prominence of the _first syllable, the start time of the _first syllable is used as the reference time for ΔT ₁ measurement, and in the case of flat-type or tail-high accent or prominence of the last syllable, the end syllable end time is set. Let ΔT ₂ be the reference time for measurement.

【００４０】意図的なプロミネンスの場合、前述のよう
に、基本周波数による強めの度合いを表す尺度としてア
クセント指令増分DAaを定義した。しかしプロミネンス
は、対象となるアクセント指令そのものの増大ではな
く、その前後のアクセント指令の大きさを相対的に小さ
くすることによって実現される場合もある。この場合
は、DAaは大きな値を取らない。そこで、基本周波数の
値によるプロミネンス効果を表す尺度として、次式で定
義するアクセント指令増分の差により評価する。In the case of intentional prominence, the accent command increment DAa is defined as a scale showing the degree of enhancement by the fundamental frequency, as described above. However, prominence may be realized not by increasing the target accent command itself, but by relatively reducing the size of accent commands before and after the accent command. In this case, DAa does not take a large value. Therefore, the difference between the accent command increments defined by the following equation is evaluated as a measure of the prominence effect by the value of the fundamental frequency.

【００４１】[0041]

【数１１】 ΔDAa=DAap-DAan …（数１１）ここで、DAapはプロミネンスが置かれるアクセント成分
の指令増分、DAanはこのアクセント成分に隣接するアク
セント成分の指令増分のうち小さい方の値を表す。[Equation 11] ΔDAa = DAap-DAan (Equation 11) Here, DAap represents the command increment of the accent component in which prominence is placed, and DAan represents the smaller value of the command increments of the accent components adjacent to this accent component. .

【００４２】基本周波数に関しては、以下に示すような
傾向が見られた。Regarding the fundamental frequency, the following tendencies were observed.

【００４３】（１）アクセント指令の大きさは、先頭お
よび文末文節以外のアクセント成分では、アクセント型
（アクセント変形型か否か）に依存する傾向が見られ
た。しかし、ポーズの有無の影響は認められなかった
（図５(a))。他方、文末文節のアクセント成分の場合、
アクセント指令の大きさの値のばらつきが大きい。更
に、先頭文節のアクセント成分では、話者やアクセント
型に依存せず、データは、先頭および文末文節以外のア
クセント変形型のアクセント成分のデータとほぼ同様の
分布を示している。これは、下記のデフォルトのプロミ
ネンスの影響で、意図的なプロミネンスの強めの度合い
が相対的に小さくなるためと考えられる (図６(a))。(1) The magnitude of the accent command tends to depend on the accent type (accent modified type or not) in the accent components other than the head and end sentence clauses. However, the effect of the presence or absence of a pose was not observed (Fig. 5 (a)). On the other hand, in the case of the accent component of the end sentence,
There is a large variation in the size of the accent command. Furthermore, the accent component of the head bunsetsu does not depend on the speaker or the accent type, and the data shows almost the same distribution as the data of the accent-modified accent component other than the head and end bunsetsu. This is thought to be due to the influence of the default prominence described below, in which the degree of intentional prominence strengthening becomes relatively small (Fig. 6 (a)).

【００４４】（２）疑問文に関しては、アクセント指令
の大きさに文末強め傾向が見られた (図５(b))。(2) Regarding the interrogative sentence, there was a tendency that the size of the accent command strengthened at the end of the sentence (FIG. 5 (b)).

【００４５】（３）アクセント指令の大きさに関し、平
叙文の先頭文節のアクセントはデフォルトのプロミネン
スを有する (図６(b))。ここで、デフォルトのプロミネ
ンスの大きさの尺度として、先頭文節（韻律語）のアク
セント指令の大きさAa₁と第２文節（韻律語）のアクセ
ント指令の大きさAa₂の差の値を用いた。(3) Regarding the size of the accent command, the accent of the first bunsetsu of the plain text has a default prominence (FIG. 6 (b)). Here, the value of the difference between the accent command size Aa ₁ of the first phrase (prosodic word) and the accent command size Aa ₂ of the second phrase (prosodic word) was used as a measure of the default prominence size. .

【００４６】（４）アクセント指令開始時点・終了時点
に関しては、プロミネンスによる顕著な影響は認められ
なかった。(4) At the start and end points of the accent command, no prominence effect was observed.

【００４７】（５）アクセント変形型（通常は増大）プ
ロミネンスの変形指令開始時点に関しては、話者Ａでは
基準音素境界に対して進み傾向が見られるが、話者Ｂで
は進みも遅れも見られない (図７)。この結果より、プ
ロミネンスの変形指令開始時点は基準音素境界時刻と一
致するように設定することができると考えられる。(5) At the time when the modification command of the accent modification type (usually increased) prominence is started, the speaker A tends to advance with respect to the reference phoneme boundary, but the speaker B has both advance and delay. Not (Fig. 7). From this result, it is considered that the start time of the prominence deformation command can be set to coincide with the reference phoneme boundary time.

【００４８】〔２〕パワー図８に示すように、パワーＰと基本周波数（アクセント
指令の大きさAa）との間には高い相関が見られる（相関
係数ρ≒0.9）。このときの回帰直線は次式で表され
る。[2] Power As shown in FIG. 8, a high correlation is found between the power P and the fundamental frequency (accent command magnitude Aa) (correlation coefficient ρ≈0.9). The regression line at this time is represented by the following equation.

【００４９】[0049]

【数１２】Ｐ=11Aa (dB) …（数１２) 従って、数１２を用いれば、プロミネンスに伴うパワー
の増加量は、Aaより一意的に定めることができる。[Equation 12] P = 11Aa (dB) (Equation 12) Therefore, by using the equation 12, the increase amount of power accompanying prominence can be uniquely determined from Aa.

【００５０】あるいは、若干の変動を許容してAlternatively, allowing a slight variation

【００５１】[0051]

【数１３】Ｐ=11Aa±4 (dB) …（数１３) の範囲内で値を定めても良い。[Equation 13] P = 11Aa ± 4 (dB) (Equation 13) The value may be set within the range.

【００５２】なお、この値は、基本周波数の増加による
パワーの自然増加の値にほぼ等しいので、単に、音源信
号（例えば予測残差）の振幅値を基本周波数によらず一
定値として合成器に送り込むのみの簡易な処理でも良
い。これにより、合成音声波形のパワーは、基本周波数
に依存して自然に上昇する。Since this value is almost equal to the value of the natural increase in power due to the increase in the fundamental frequency, simply the amplitude value of the sound source signal (for example, the prediction residual) is set to the synthesizer as a constant value regardless of the fundamental frequency. It may be a simple process of only sending. As a result, the power of the synthesized speech waveform naturally rises depending on the fundamental frequency.

【００５３】〔３〕時間構造時間構造の主要因は、音素持続時間およびポーズ持続時
間であり、プロミネンスは、これらの持続時間の伸長に
より表現されうる（ポーズの発生は、ポーズ持続時間が
０から正数値に増加する特別な場合）。ここでは、(1)
ポーズ持続時間が音素持続時間に与える影響、および
(2)疑問文末尾における音素持続時間の伸長という観点
から調べてみた。[3] Temporal Structure The main factors of the temporal structure are the phoneme duration and the pause duration, and the prominence can be expressed by the extension of these durations (the occurrence of a pause is from 0 to the pause duration). The special case of increasing to a positive value). Here, (1)
The effect of pause duration on phoneme duration, and
(2) We investigated from the viewpoint of extension of phoneme duration at the end of the question sentence.

【００５４】（１）音素持続時間は、直後にポーズを発
生した場合に伸長する傾向が見られた。この傾向は、話
者ＡＢに共通している (図９)。(1) The phoneme duration tended to extend when a pause was generated immediately after. This tendency is common to speaker AB (Fig. 9).

【００５５】（２）疑問文文末母音においても伸長する
傾向があることがわかった。この傾向は、話者ＡＢに共
通している (図１０)。(2) It was found that vowels at the end of a question sentence also tended to expand. This tendency is common to speaker AB (Fig. 10).

【００５６】図１１は、図４の各分類に対応したプロミ
ネンスを生成するための韻律の各要素の値（強めあるい
は弱め）を自然音声を対象とした上記定量的解析結果に
基づき求めたものである。但し、図１１の数値例は、プ
ロミネンスの付加されていない場合の各制御値に対する
増分、あるいは増加率で表している。図１１に従い制御
規則を作成すれば、自然なプロミネンスを合成音声に付
与することが出来る。図１１において、"±"の記号より
左側の数値は、その制御パラメータの代表値であり、"
±"の記号で数値の変動範囲（ほぼ１σに相当）を表し
ている。すなわち、この変動範囲内で数値を設定するか
ぎり、自然なプロミネンスを生成することが出来ること
を示している。なお、図１１中のパラメータで、プロミ
ネンスの付加されてない部分を相対的に弱めることによ
っても同様の効果を得ることができる。この場合は、従
来の公知の韻律制御規則（例えば公知例12）に従い、制
御パラメータ値を求め、上記数７〜数１３の定義式を用
いて、プロミネンス付加時の韻律制御パラメータ値を求
めればよい。FIG. 11 shows the values (stronger or weaker) of each element of the prosody for generating prominence corresponding to each classification of FIG. 4 obtained on the basis of the above-mentioned quantitative analysis result for natural speech. is there. However, the numerical example of FIG. 11 is represented by an increment or an increase rate with respect to each control value when prominence is not added. If a control rule is created according to FIG. 11, natural prominence can be added to the synthesized voice. In FIG. 11, the numerical value on the left side of the symbol "±" is the representative value of the control parameter,
The symbol "±" indicates the range of variation of the numerical value (corresponding to approximately 1σ). That is, it means that natural prominence can be generated as long as the numerical value is set within this variation range. The same effect can be obtained by relatively weakening the part to which prominence is not added with the parameters in Fig. 11. In this case, according to a conventionally known prosody control rule (for example, known example 12), The control parameter value may be obtained, and the prosody control parameter value at the time of adding prominence may be obtained by using the defining expressions of the above equations 7 to 13.

【００５７】以上プロミネンスの韻律的特徴の解析と規
則の有効性の検証実験に基づけば、種々のプロミネンス
の実現形態は、図１２に示すように要約される。すなわ
ち、プロミネンスは次のいずれかの方法により実現され
ている。Based on the above analysis of prosodic features of prominence and verification experiment of validity of rules, various realization forms of prominence are summarized as shown in FIG. That is, prominence is realized by any of the following methods.

【００５８】（１）基本周波数（アクセント指令の大き
さ）の増大とこれに伴うパワーの増大（２）ポーズの挿入更にプロミネンス対象の直前に
ポーズ挿入プロミネンス対象の直後にポーズ挿入プロミネンス対象の直前および直後にポーズ挿入（３）発話速度の低減（４）上記（１）（２）（３）の組合せところで、図１１により合成されるプロミネンス含有文
音声は、自然音声と同等のプロミネンス表現力を持つ
が、必ずしも最高水準のプロミネンス表現力とは言えな
い。より確実な意図伝達を実現するためには、更にプロ
ミネンス表現力が高いことが望まれる。そこで、聴取評
価実験により、聴覚的に最適なプロミネンス表現力を実
現するための韻律制御パラメータを決定する。すなわ
ち、図１２に示すような種々のプロミネンスの実現形態
のうち、各分類のプロミネンスについて、どの形態がプ
ロミネンス表現力という点で優れているか、規則合成音
声の聴取実験による評価を行う。更に、基本周波数（Ｆ
₀）（＋パワー）増大によるプロミネンスについては、
アクセント指令の大きさとプロミネンス表現力との関係
を明らかにし、自然性とのトレードオフにおいて、どの
ような値に選ぶのが最適であるかについて検討する。(1) Increase in fundamental frequency (accent command size) and accompanying increase in power (2) Pause insertion Immediately before a prominence target Immediately after a pause insertion prominence target Immediately before a pause insertion prominence target and Immediately after that, a pause is inserted (3) Speech rate is reduced (4) The combination of (1), (2), and (3) above, the prominence-containing sentence voice synthesized according to FIG. 11 has prominence expression power equivalent to natural voice. However, it is not always the highest level of prominence expression. In order to realize more reliable intention transmission, it is desired that the prominence expression power is higher. Therefore, through probabilistic evaluation experiments, we determine prosodic control parameters for realizing aurally optimal prominence expression. That is, of the various prominence realizations as shown in FIG. 12, for each prominence of each classification, which form is superior in terms of prominence expressing power is evaluated by a listening experiment of rule-synthesized speech. Furthermore, the fundamental frequency (F
₀ ) (+ power) increase prominence,
The relationship between the size of accent command and prominence expression is clarified, and what value is most suitable to be selected in the trade-off with naturalness.

【００５９】評価は、プロミネンス生成規則により合成
した各種プロミネンス含有文音声が、意図したとおりの
プロミネンスが付加されているように聞こえるか否かに
より行う。The evaluation is performed based on whether or not various prominence-containing sentence voices synthesized by the prominence generation rule sound like prominences are added as intended.

【００６０】以下に試験方法について述べる。The test method will be described below.

【００６１】（１）意図的なプロミネンスのうち、各分
類に属するプロミネンスの種々の実現形態を含有する図
１３〜図１５のリストに示す66種類の文を規則により合
成した音声を用いる。なおここで、Ｆ₀（＋パワー）増
大によるプロミネンスに関しては、アクセント指令の大
きさとプロミネンス表現力あるいは自然性との関係を調
べるために、４段階のアクセント指令の増分値による合
成を行っている。表中「タイプ１」とは、プロミネンス
を置かない場合に、プロミネンスの対象部分直前から対
象部分までの間に低アクセントの部分が存在するタイプ
である。すなわち、２韻律語がアクセント変形を起さず
もともとのアクセント型を保っている後部韻律語（例え
ば「兄の雨具は」の「雨具は」）、２韻律語が低アクセ
ントの変形を起こして１韻律語になった後半部分（例え
ば「中川主任」の「主任」）、あるいは１韻律語中の低
アクセント部の１拍（例えば「くわばらさん」の
「ば」）等がプロミネンスの対象になる場合がこれに該
当する。また「タイプ２」とは、逆に、プロミネンスを
置かない場合に、プロミネンスの対象部分直前から対象
部分にかけて高アクセントが持続するタイプである。す
なわち、２韻律語が高アクセントの変形を起こして１韻
律語になった後半部分（例えば「姉の雨具は」の「雨具
は」や「田中主任」の「主任」）、あるいは１韻律語中
の高アクセント部の１拍（例えば「なかださん」の
「だ」）等がプロミネンスの対象になる場合がこれに該
当する。タイプ１とタイプ２では、プロミネンス対象部
とその直前部との相対的なアクセントの高低関係が異な
るため、基本周波数を増大させた場合のプロミネンスの
聴覚的な効果も異なると予想される。(1) Among intentional prominences, a speech is used in which 66 kinds of sentences shown in the lists of FIGS. 13 to 15 containing various realization forms of prominences belonging to each classification are synthesized by a rule. Here, regarding prominence due to increase in F ₀ (+ power), in order to examine the relationship between the size of the accent command and the prominence expressing power or naturalness, synthesis is performed by using four increments of the accent command. “Type 1” in the table is a type in which a low accent portion exists between immediately before the target portion of prominence and the target portion when prominence is not placed. That is, the two prosodic words do not undergo accent transformation and the original prosodic words are retained (for example, "rain gear of the elder brother's rain gear"). The second half of the prosodic word (for example, "Chief" in "Chief Nakagawa"), or one beat in the low accent part of one prosodic word (for example, "Baba" in "Kuwabara") is subject to prominence. This is the case. On the contrary, “type 2” is a type in which high accent continues from immediately before the target portion of the prominence to the target portion when the prominence is not placed. That is, the latter half of the two prosodic words that have undergone high-accent transformation to become one prosodic word (for example, "Amegu ha" in "My sister's rain gear" or "Chief" in "Tanaka chief"), or in one prosodic word This corresponds to the case where one beat of the high accent portion of (for example, "Da" of "Nakada-san") is subject to prominence. Since Type 1 and Type 2 have different relative accent heights between the prominence target portion and the portion immediately preceding it, it is expected that the auditory effect of prominence when the fundamental frequency is increased is also different.

【００６２】（２）上記66種類の文の規則合成音声を乱
順に配列して順次提示し、聴取により、プロミネンスの
有無、プロミネンスの位置等についての解答を解答用紙
に記入させる。但し、各文音声の提示回数は１種類の文
につき２回で、延べ132文を提示する。音声間には、解
答のために、３秒間のポーズを入れる。被験者は東京式
アクセント（第二種）を有する成人男女10名である。解
答は、強調されている部分があると聞こえた場合は、解
答用紙に書かれている文の対応する部分を〇で囲むこと
により行わせる。もしどこも強調されていないと聞こえ
た場合は、「強調なし」と書かれている部分を〇で囲
み、更に不自然に聞こえた場合は、「不自然」と書かれ
ている部分を〇で囲むよう指示した。(2) The rule-synthesized voices of the above-mentioned 66 kinds of sentences are arranged in random order and presented one by one, and by listening, the answers about the presence or absence of prominence, the position of prominence, etc. are written on the answer sheet. However, the number of presentations of each sentence voice is twice for one type of sentence, and a total of 132 sentences are presented. Between voices, pause for 3 seconds to answer. The subjects were 10 adult men and women with Tokyo accents (type 2). If you hear that there is an emphasized part, answer by enclosing the corresponding part of the sentence written on the answer sheet with a circle. If you hear that nowhere is emphasized, circle the part that says "no emphasis", and if it sounds more unnatural, circle the part that says "unnatural". I was instructed to do so.

【００６３】（３）プロミネンス生成規則の制御パラメ
ータの具体的な数値は、特に指定がない場合は次のとお
りである。DAaあるいはAaは図１１の番号順に、1:0.1、
2:-0.2、 3:0.0、 4:0.2、 5:0.6 (但し文末の場合は
0.7)、 6:0.3、 7:0.2、 8:0.4(直前ポーズ挿入時)、
0.7 (ポーズなしのとき)。また、ポーズ直前の音素持続
時間のTWの値は66%、その他のパラメータは、図１１の
値あるいは平均値 (±を省いた値) とした。(3) The concrete numerical values of the control parameters of the prominence generation rule are as follows unless otherwise specified. DAa or Aa is 1: 0.1 in the order of numbers in FIG.
2: -0.2, 3: 0.0, 4: 0.2, 5: 0.6 (However at the end of the sentence
0.7), 6: 0.3, 7: 0.2, 8: 0.4 (when the last pose is inserted),
0.7 (without pose). The value of TW of the phoneme duration immediately before the pause was 66%, and the other parameters were the values shown in FIG. 11 or the average values (values without ±).

【００６４】（４）ポーズ挿入によるプロミネンスにお
いては、通常ポーズ挿入に伴うフレーズの立て直しが起
こる。また場合によっては本来のアクセント型と異なっ
たアクセント型に変形する。本実験では、これらの基本
周波数の変化による音質上の影響を極力少なくするため
に、次のような例外規則を採用した。なおこれらの例外
規則は、プロミネンス含有文の解析結果を参考にすると
ともに、合成音声を聴取することにより作成した。(4) In the prominence due to the insertion of a pause, the phrase is usually rebuilt due to the insertion of a pause. Also, in some cases, the accent type is changed to an accent type different from the original accent type. In this experiment, the following exception rules were adopted in order to minimize the influence on the sound quality due to the change of these fundamental frequencies. These exception rules were created by listening to the synthetic voice while referring to the analysis result of the sentence containing prominence.

【００６５】ａ．プロミネンスの対象が複合単語の一部
の例「○○主任改め○○課長」において、「主任」の直
後にポーズを挿入する場合は、フレーズの立て直しの影
響により「改め」の基本周波数が増大する。この影響を
弱めるために、「改め」のアクセント指令の大きさを０
値に抑制した。A. In the case where the prominence target is a part of a compound word, in the case of "○○ Chief Amendment ○○ Section Manager", if a pause is inserted immediately after "Chief Chief", the fundamental frequency of "Amend" increases due to the effect of phrase re-building. . In order to weaken this effect, the size of the "revision" accent command is set to 0.
Suppressed to value.

【００６６】ｂ．プロミネンスの対象が話題の一拍（低
アクセント）の例「くわばらさんとも、くわはらさんと
もいいます」において、「ば」あるいは「は」の直前あ
るいは直後にポーズを挿入すると、本来の韻律語固有の
アクセントが崩壊し、通常のアクセント規則が適用出来
ない。この場合次のような例外的な基本周波数パターン
生成規則を適用した。B. An example of the subject of prominence is one beat (low accent) of the topic. In "Kuwabara-san" and "Kuwahara-san", if you insert a pose immediately before or after "ba" or "ha", the original prosodic word The specific accent is broken and the usual accent rules cannot be applied. In this case, the following fundamental frequency pattern generation rule was applied.

【００６７】「ば」あるいは「は」の直後にポーズを
挿入する場合、上記ａと同じ理由で、続く「らさんと
も」のアクセント指令の大きさを０値に抑制した。When a pose is inserted immediately after "ba" or "ha", the magnitude of the accent command of "Rasan Tomo" that follows is suppressed to 0 for the same reason as in the above a.

【００６８】同様に、「ば」あるいは「は」の直前に
ポーズを挿入する場合、続く「ば（は）らさんとも」の
アクセント指令の大きさを０値に抑制した。Similarly, when a pause is inserted immediately before "ba" or "ha", the magnitude of the accent command of "ba (ha) la san tomo" that follows is suppressed to 0 value.

【００６９】「ば」あるいは「は」の直前直後ともに
ポーズを挿入する場合、直前ポーズに対応するフレーズ
の立て直しは行わず、「ば」あるいは「は」のアクセント
レベルは「低」に設定した。When a pause is inserted immediately before and after "ba" or "ha", the phrase corresponding to the immediately preceding pause is not rebuilt, and the accent level of "ba" or "ha" is set to "low".

【００７０】ｃ．プロミネンスの対象が話題の一拍（高
アクセント）の例「なかださんか、なかたさんか」にお
いて、「だ」あるいは「た」の直前あるいは直後にポー
ズを挿入する場合も、本来の韻律語固有のアクセントが
崩壊し、通常のアクセント規則が適用出来ない。この場
合も次のような例外的な基本周波数パターン生成規則を
適用した。C. In the example of the subject of prominence, which is one beat (high accent) of the topic, in the case of "Nakada-san or Nakata-sanka", even when inserting a pose immediately before or after "da" or "ta", the original prosodic word peculiar The accent of is destroyed and the normal accent rules cannot be applied. Also in this case, the following exceptional fundamental frequency pattern generation rule is applied.

【００７１】「だ」あるいは「た」の直後にポーズを
挿入する場合、上記ａと同じ理由で、続く「さんか」の
アクセント指令の大きさを０値に抑制した。When a pause is inserted immediately after "da" or "ta", the size of the accent command for the subsequent "sanka" is suppressed to 0 for the same reason as in the above a.

【００７２】「だ」あるいは「た」の直前にポーズを
挿入する場合、続く「だ（た)さんか」のアクセントは
高アクセント（平板型）とした。When a pose is inserted immediately before "da" or "ta", the subsequent "da (san)" accent is a high accent (flat type).

【００７３】「だ」あるいは「た」の直前直後ともに
ポーズを挿入する場合、直前ポーズに対応するフレーズ
の立て直しは行わず、「だ」あるいは「た」のアクセント
レベルは「高」に設定した。When a pose is inserted immediately before or after "da" or "ta", the phrase corresponding to the immediately preceding pose is not rebuilt, and the accent level of "da" or "ta" is set to "high".

【００７４】（５）数１４で定義する正答率（プロミネ
ンス正聴率）ｒにより、プロミネンス表現力を評価す
る。(5) The prominence expressing ability is evaluated by the correct answer rate (prominence correct listening rate) r defined by the equation (14).

【００７５】[0075]

【数１４】ｒ= Ｒ／Ｎ …（数１４）ここで、Ｎは当該文章の全被験者に対する延べ提示回
数、ＲはＮ提示音声中のプロミネンス正答数（意図通り
のプロミネンスが付与されていると聞こえた音声提示
数）である。## EQU00004 ## r = R / N (Expression 14) Here, N is the total number of presentations of all the subjects of the sentence, and R is the number of correct answers of prominence in the N presented speech (provided that the intended prominence is given. The number of audio presentations heard).

【００７６】また、数１５で定義する不自然率ｕによ
り、プロミネンス付与の表現形式の自然性を評価する。Also, the naturalness of the expression format of prominence is evaluated by the unnatural rate u defined by the equation (15).

【００７７】[0077]

【数１５】ｕ= Ｕ／Ｎ …（数１５）ここで、ＵはＮ提示音声中で表現が不自然に聞こえた音
声の提示数である。[Mathematical formula-see original document] u = U / N (Expression 15) Here, U is the number of presentations of speech whose expression sounds unnatural in N presentation speeches.

【００７８】以下に試験結果について述べる。The test results will be described below.

【００７９】（１）Ｆ₀増大＋パワー増大の効果評価結果を図１６に示す。図より、いずれの分類に属す
るプロミネンスも、アクセント指令の大きさの増大とと
もに、正答率ｒは増大している。但し、タイプ２（同図
(b)、(d)、(f)）の方が、タイプ１（同図(a)、(c)、
(e)）の場合より低いアクセント指令の大きさ増分ΔDAa
の値でｒの値に急峻な増大傾向が見られる。(1) FIG. 16 shows the effect evaluation results of F ₀ increase + power increase. From the figure, in the prominences belonging to any of the categories, the correct answer rate r increases as the size of the accent command increases. However, type 2 (Fig.
(b), (d), (f)) are more type 1 ((a), (c),
(e)) Lower accent command magnitude increment ΔDAa
A sharp increase tendency is seen in the value of r at the value of.

【００８０】更に、自然性を大きく損うことなくアクセ
ント指令の大きさを十分増大して正答率ｒを100%に近づ
けられるのは、プロミネンスの対象が「文節全体」（同
図(a)、(b)）の場合である。これに対して他の場合で
は、高い正答率を得るために自然性を代償にする傾向が
大きくなる（同図(c)、(d)、(f))。Furthermore, the reason that the correct answer rate r can be brought close to 100% by sufficiently increasing the size of the accent command without significantly impairing the naturalness is that the target of prominence is "entire phrase" (FIG. (A), This is the case of (b)). On the other hand, in other cases, there is a greater tendency to compensate for naturalness in order to obtain a high correct answer rate ((c), (d), (f) in the same figure).

【００８１】（２）ポーズ挿入の効果評価結果を図１７に示す。但し、図中横軸の「なし」と
は、ポーズを挿入することなく、Ｆ₀増大＋パワー増大
だけでプロミネンスを実現した場合の最大の正答率を与
える場合の結果を参考データとして示したものである。
図より、ポーズ挿入がプロミネンス表現力の向上に最も
有効なのは、プロミネンスの対象が話題の一拍（同図
(e)、(f)）の場合である。ポーズ位置に関しては、ポー
ズをプロミネンスの対象の直前に置くのが良いか、直後
に置くのが良いかは、ケースバイケースである。しか
し、直前直後ともに置く場合（図中の「前後」）は、片
方のみに置く場合に比して、安定して良好な結果が得ら
れている。(2) The result of the effect evaluation of the pose insertion is shown in FIG. However, “None” on the horizontal axis in the figure is the reference data showing the result when the maximum correct answer rate is given when prominence is achieved only by increasing F ₀ + power without inserting a pause. Is.
From the figure, it can be seen that the insertion of a pose is the most effective way to improve the prominence expression.
(e), (f)). Regarding the pose position, it is on a case-by-case basis whether the pose should be placed immediately before or after the subject of prominence. However, when placed both immediately before and after (“front and back” in the figure), stable and good results are obtained compared to when placed on only one side.

【００８２】（３）Ｆ₀増大＋パワー増大とポーズ挿入
の組合せ効果評価結果を図１８に示す。なお図中横軸の「Ｆ₀」ある
いは「Ｆ₀＋ポーズ」とは、それぞれ「Ｆ₀増大＋パワー
増大」、「Ｆ₀増大＋パワー増大＋直前直後ポーズ挿
入」を略して記したものである。この二つの場合につい
ては、同一条件でポーズ挿入の影響を見るために、アク
セント指令の大きさ増分ΔDAaを同じ文では同じ値とし
た。更に図中横軸に「ポーズ」と記してあるのは、Ｆ₀
およびパワーの増大を伴わず、プロミネンス対象部の直
前および直後にポーズを挿入することのみでプロミネン
スを実現しようとした場合である。図の例では、Ｆ₀増
大＋パワー増大によるプロミネンスとしては、強調の度
合いが比較的弱い場合を参照している。そのため、いず
れの分類に属するプロミネンスにおいても、ポーズ挿入
による方がプロミネンス表現力が大きい。そしてポーズ
挿入効果は、プロミネンス対象の文単位の大きさが、文
節全体→複合単語の一部→話題の一拍、と小さくなるに
従い、大きくなる。更に、「Ｆ₀増大＋パワー増大」と
「直前直後ポーズ挿入」を組み合わせるとプロミネンス
表現力は一層大きくなる。(3) FIG. 18 shows the evaluation result of the combined effect of F ₀ increase + power increase and pause insertion. In the figure, “F ₀ ” or “F ₀ + pause” on the horizontal axis is an abbreviation for “F ₀ increase + power increase” and “F ₀ increase + power increase + immediately before / after pause insertion”, respectively. is there. In these two cases, the size increment ΔDAa of the accent command was set to the same value in the same sentence in order to see the effect of the pause insertion under the same condition. Further, in the figure, the word "pose" on the horizontal axis indicates F ₀
In addition, it is a case where the prominence is to be realized only by inserting the pose immediately before and after the prominence target portion without increasing the power. In the example of the figure, as the prominence due to the increase of F ₀ + the increase of power, the case where the degree of emphasis is relatively weak is referred to. Therefore, in prominences belonging to any of the categories, the expression of prominences is greater when the pose is inserted. The pose insertion effect increases as the size of the sentence unit targeted for prominence decreases as the whole phrase → part of the compound word → one beat of the topic. Furthermore, if "F ₀ increase + power increase" and "immediately before and after pause insertion" are combined, the prominence expression power is further enhanced.

【００８３】以上の結果は、Ｆ₀の増大＋パワーの増大
のみでは効果が少ない場合でも、ポーズ挿入と組合せる
ことにより、より効果的なプロミネンスが実現できるこ
とを示していると言える。また、この組合せ効果により
プロミネンス表現力が改善された代償として自然性が大
幅に低下する例は、プロミネンスの対象が複合単語の一
部の場合の一例に見られるだけで（図１８(d))、他には
見られない（同図(d)以外)。It can be said that the above results show that even if the increase in F ₀ + increase in power alone has little effect, more effective prominence can be realized by combining with pause insertion. In addition, an example in which the naturalness is greatly reduced as a cost of the improved prominence expression ability due to this combination effect can be seen only in an example in which the target of prominence is a part of a compound word (FIG. 18 (d)). , Not seen elsewhere (other than (d) in the figure).

【００８４】次に、本発明による音声規則合成装置及び
方法の実施例を図１および図１９〜図２８により説明す
る。Next, an embodiment of the speech rule synthesizing apparatus and method according to the present invention will be described with reference to FIGS. 1 and 19 to 28.

【００８５】図１９は任意文章の音声合成に適用できる
音声規則合成装置の一実施例の全体構成を示す。本実施
例では、漢字仮名混じり文のテキストを入力データとし
て与えれば、それに対応する合成音声を出力として得る
ことができる。処理手順は以下の通りである。FIG. 19 shows the overall construction of an embodiment of a speech rule synthesizing apparatus applicable to speech synthesis of an arbitrary sentence. In this embodiment, if the text of a kanji / kana mixed sentence is given as input data, the corresponding synthetic speech can be obtained as an output. The processing procedure is as follows.

【００８６】まず入力テキストは、日本語解析部１（特
開昭59-98236号公報参照）の形態素解析手段により、各
単語に分解され、品詞が決定され、さらに読みが決定さ
れる。次にこの結果に基づき、音声言語処理部２（特公
昭59-13040号公報、特開昭59-081697号公報、特開昭61-
6693号公報参照）において、各単語あるいは文節のアク
セント型が決定される。First, the input text is decomposed into each word by the morphological analysis means of the Japanese analysis section 1 (see Japanese Patent Laid-Open No. 59-98236), the part of speech is determined, and the reading is further determined. Next, based on this result, the speech language processing unit 2 (Japanese Patent Publication No. 59-13040, Japanese Patent Publication No. 59-081697, Japanese Patent Publication No. 61-
In Japanese Patent No. 6693), the accent type of each word or phrase is determined.

【００８７】以上のような構文レベルの処理結果とし
て、音節情報、アクセント情報、プロミネンス情報など
が得られる。なお句や文章の区切りは、入力テキスト中
の句読点等区切り記号に基づいて決定される。文章中や
文章間のポーズ長は、読点や句点の後のスペースの数で
指定できる。また疑問文、命令文、願望文等文のタイプ
は、語尾の活用によって判定することができる場合もあ
るし、あるいは文章の終止に句点の代わりにそれぞれ
「？」、「！！」および「！」などの終止記号を使うこ
とにより指定することもできる。例えば同じ音韻列「川
を渡る」であっても「川を渡る。」は平叙文であり、
「川を渡る？」は疑問文である。As the processing result of the syntax level as described above, syllable information, accent information, prominence information, etc. are obtained. The delimiter between phrases and sentences is determined based on delimiters such as punctuation marks in input text. The pause length in a sentence or between sentences can be specified by the number of spaces after the reading and punctuation. In some cases, the type of sentence such as question sentence, imperative sentence, and desire sentence can be determined by utilizing the ending of the sentence, or "?", "!", And "!" Instead of punctuation at the end of the sentence. It can also be specified by using a terminator such as ". For example, even if the same phoneme sequence "cross the river", "cross the river." Is a plain text,
"Across the river?" Is a question.

【００８８】以上の音節情報、アクセント情報、
ポーズ情報、句・文章区切り情報、（必要ならば例え
ば品詞名等の）文法情報、およびプロミネンス情報
は、「音節コード」と呼ばれる一連の数字によって表現
される。音節コードは制御パラメータ生成部３の入力情
報である。The above syllable information, accent information,
Pause information, phrase / sentence break information, grammatical information (for example, part of speech if necessary), and prominence information are expressed by a series of numbers called "syllable chords". The syllable code is input information of the control parameter generation unit 3.

【００８９】制御パラメータ生成部３では、アクセン
ト、イントネーション、音韻持続時間、および音源パワ
ー（振幅）修正値が規則により決定され、それに従って
ピッチパターンと音韻パラメータ時系列が生成される。
ここで、音源パワー修正値とは、強めの有無により、標
準的な音源パワーの値を増減するための係数である。こ
の音源パワー修正値は、強めの無い場合に対する倍率で
与えても良いし、絶対数値で与えても良い。また、アク
セント型は、アクセント情報により知ることができる。
アクセント情報は、具体的にはアクセント核のある音韻
（アクセントが下降する直前の音韻）の直後にアクセン
トを示す音節コード番号を挿入することによって与えて
いる。ただし、この音節コードがない場合は、平板型ア
クセントであることを示している。またイントネーショ
ンは、基本的には文章タイプ情報およびプロミネンス情
報より定められる。ただし、語尾の音韻の並びの違いに
よる変形も加えられる。例えば、願望文「川を渡りたい
！」と「川を渡りたいなあ！」とではイントネーション
・パターンが異なる。最終的なピッチパターンは、アク
セント型とイントネーションの両者に基づいて生成され
る。ただし、後に述べるプロミネンスを含有する文章に
ついては、アクセント変形を伴うこともある。音韻持続
時間は、子音の場合は周囲条件の影響が少ないので、子
音の種類ごとに固有長として決定される。それに対し
て、母音の場合は周囲条件によって様々な変形を受け
る。そのため、アクセント型、音節数、単語内の位置、
直前の子音の種類、その母音の種類などから持続時間を
決定している（特開昭59-081697号公報参照）。このよ
うにして音韻持続時間が決定されたら、ＣＶ（子音−母
音連鎖）単位でファイルに登録されている音韻パラメー
タ（生成源方式の場合はスペクトル包絡パラメータと音
源パラメータ、波形合成方式の場合は音声素片）を音節
コードに対応させて抽出し、配列する。この際、長すぎ
れば持続時間内に収まるように切断する。しかる後に、
切断部あるいは隙間部を埋めるようにＣＶ単位間を補間
（生成源方式：スペクトル包絡パラメータは直線補間、
音源パラメータは同一値の繰り返し、波形合成方式：素
片切り出し窓の最大値の補間）により接続する（詳細は
図２６参照）。最後に、以上の処理によって生成された
基本周波数と音韻パラメータは、順次音声合成部４に送
られ、音声波形が出力される。ここで、音声合成方式と
しては、例えば残差圧縮法（公開昭60-150100号公報、
特開昭61-296390号公報参照）を用いればよい。この場
合、音源パルスは基本的には、フレームごとに１ピッチ
分の残差パルス（代表残差）を抽出し、その代表残差を
外から与えるピッチ周期の間隔で並べることによって生
成している。このとき外から与えるピッチ周期が代表残
差の長さより短ければ、その長さの差だけ代表残差の末
尾を切り捨て、逆に長ければ、代表残差の不足している
区間だけ０を埋めている。図１９には音声合成部４に残
差圧縮法を用いた例を示しているが、勿論、音声合成方
式は残差圧縮法に限定されない。例えば、波形合成方
式、特に素片編集方式を用いても良い。In the control parameter generator 3, the accent, intonation, phoneme duration, and sound source power (amplitude) correction value are determined by the rule, and the pitch pattern and the phoneme parameter time series are generated according to the rules.
Here, the sound source power correction value is a coefficient for increasing / decreasing the standard sound source power value depending on the presence / absence of strength. This sound source power correction value may be given as a magnification as compared with the case where there is no strengthening, or as an absolute value. The accent type can be known from the accent information.
Specifically, the accent information is given by inserting a syllable code number indicating an accent immediately after a phoneme with an accent nucleus (a phoneme immediately before the accent descends). However, the absence of this syllable chord indicates that it is a flat accent. Further, the intonation is basically defined by the sentence type information and the prominence information. However, a modification is also added due to the difference in the phoneme sequence of the ending. For example, the intonation pattern is different between the wish sentences “I want to cross the river!” And “I want to cross the river!”. The final pitch pattern is generated based on both accent type and intonation. However, a sentence containing prominence, which will be described later, may be accompanied by accent transformation. The phoneme duration is determined as a unique length for each type of consonant, since the influence of ambient conditions is small for consonants. On the other hand, vowels undergo various transformations depending on the ambient conditions. Therefore, accent type, number of syllables, position within a word,
The duration is determined from the type of consonant immediately before, the type of vowel, etc. (see Japanese Patent Laid-Open No. 59-081697). When the phoneme duration is determined in this way, the phoneme parameters (spectrum envelope parameters and sound source parameters in the case of the generation source method, and the voice in the case of the waveform synthesis method) registered in the file in CV (consonant-vowel concatenation) units. Element) corresponding to the syllable code and extracted and arranged. At this time, if it is too long, it is cut to fit within the duration. After that,
Interpolate between CV units so as to fill the cut or gap (generation method: linear interpolation for spectral envelope parameters,
The sound source parameters are connected by repeating the same value, and the waveform synthesis method: interpolation of the maximum value of the segment extraction window) (see FIG. 26 for details). Finally, the fundamental frequency and the phoneme parameters generated by the above processing are sequentially sent to the speech synthesizer 4, and the speech waveform is output. Here, as a voice synthesis method, for example, a residual compression method (Japanese Patent Laid-Open No. 60-150100,
JP-A-61-296390) may be used. In this case, the sound source pulse is basically generated by extracting a residual pulse (representative residual) for one pitch for each frame and arranging the representative residual at intervals of a pitch cycle given from the outside. . At this time, if the pitch period given from the outside is shorter than the length of the representative residual, the end of the representative residual is truncated by the length difference, and conversely, if it is longer, 0 is filled only in the section where the representative residual is insufficient. There is. FIG. 19 shows an example in which the residual compression method is used in the speech synthesis unit 4, but of course the speech synthesis method is not limited to the residual compression method. For example, a waveform synthesizing method, especially a segment editing method may be used.

【００９０】以上の処理は、以下に述べるプロミネンス
生成規則を除いて、すべて公知の手段により構成するこ
とができる。The above-mentioned processing can be configured by any known means except for the prominence generation rule described below.

【００９１】以下では、本発明の最も重要な部分であ
る、制御パラメータ生成部３におけるプロミネンス生成
規則の説明を中心に図１および図２０〜図２８を引用し
て示す。Below, the most important part of the present invention, that is, the description of the prominence generation rule in the control parameter generation unit 3, will be mainly described with reference to FIGS. 1 and 20 to 28.

【００９２】プロミネンス情報は以下の（１）〜（５）
の情報から抽出可能である。The prominence information is the following (1) to (5)
It can be extracted from the information.

【００９３】（１）平叙文／疑問文等の文のタイプより
（文形固有の卓立）（２）構文情報（文献１０参照）。(1) From a sentence type such as a plain text / an interrogative sentence (excellence peculiar to the sentence form) (2) Syntax information (see Reference 10).

【００９４】（３）旧情報／新情報（文献１１参照）、
慣用的な口調。(3) Old information / new information (see Reference 11),
Idiomatic tone.

【００９５】（４）テキスト情報より（カギ括弧、太
字、アンダーライン等）。(4) From text information (square brackets, bold letters, underlines, etc.)

【００９６】（５）意味情報（例：先行疑問文に対する
答えの部分を強め）。(5) Semantic information (eg, strengthening the answer part to the preceding question sentence).

【００９７】上記（１）では、文章タイプ情報よりプロ
ミネンスを実現するパラメータを生成することができる
のに対し、（２）〜（５）では、音声言語処理部２等
で、プロミネンス情報（音節コード表現）を生成しなけ
ればならない。例えば上記（４）におけるカギ括弧から
プロミネンス情報を取得する場合、カギ括弧開きが検出
されたら、アクセント指令の開始時点と大きさ情報（あ
るいはプロミネンスの分類情報(例えば図４のような情
報)）を含有する音節コードを発行し、カギ括弧閉じが
検出されたら、アクセント指令の終了時点の情報を含有
する音節コードを発行すれば良い。また、（５）の場合
は、意味解析手段が必要となる。もし意味解析手段を用
いないならば、（４）で代用することになる。すなわ
ち、人間が強めたいところを上記のカギ括弧等によりテ
キスト内で指定すれば良い。In the above (1), the parameter that realizes prominence can be generated from the sentence type information, whereas in (2) to (5), the prominence information (syllabic code) is used by the speech language processing unit 2 or the like. Expression) must be generated. For example, when the prominence information is obtained from the brackets in (4) above, when the opening of the brackets is detected, the start time of the accent command and the size information (or the classification information of the prominence (for example, information as shown in FIG. 4)) are input. The contained syllabic code is issued, and when closing of the bracket is detected, the syllabic code containing the information at the time when the accent command ends is issued. In the case of (5), a semantic analysis means is required. If the semantic analysis means is not used, (4) will be substituted. That is, what the person wants to strengthen may be designated in the text by the above-mentioned brackets and the like.

【００９８】はじめに、文形固有の卓立を実現する規則
の実施例を示す。まず、図２０において、音声言語処理
部２から得られた音節コード列は、文章タイプ決定手段
５に入力される。ここでは第一段階として、文章タイプ
情報辞書６中の語尾辞書に登録されている語尾形と音節
コード列の文末の形とを照合することにより、該当する
文章タイプを決定する。なお図２０における終止形は、
現代文の場合は動詞なら「ウ」行で終わる語尾、形容詞
なら「イ」でおわる語尾等、国文法の規則に基いて定め
られる。命令形の場合も同様に、現代文なら活用語尾が
「エ」行であることから定められる。以上の文章タイプ
の判定は、品詞情報などの文法情報があれば、さらに確
実となる。ここでもし語尾の活用が終止形と判定された
場合は、この文章は必ずしも平叙文とは限らない。そこ
で第二段階として、この場合は文章の終始記号（文末記
号）を見に行き、この記号の種類によって文章タイプを
決定する（例えば、「。」あるいは「.」なら平叙文、「?」
なら疑問文、「!!」なら命令文、「!」なら願望文、等）。
以上の文章タイプ決定手段５の処理の一例を図２１に示
す。First, an example of a rule for realizing prominence peculiar to a sentence pattern will be shown. First, in FIG. 20, the syllable code string obtained from the speech language processing unit 2 is input to the sentence type determining means 5. Here, as the first step, the corresponding sentence type is determined by matching the ending forms registered in the ending dictionary in the sentence type information dictionary 6 with the ending forms of the syllable code strings. The final type in Fig. 20 is
In the case of modern sentences, it is determined based on the rules of national grammar, such as the ending of the line "u" for verbs and the ending "i" for adjectives. In the case of imperatives as well, in the case of modern sentences, the inflection ending is defined as "e" line. The above sentence type determination becomes more reliable if there is grammatical information such as part-of-speech information. If the inflection of the ending is determined to be the final form, this sentence is not necessarily a plain sentence. So, as the second step, in this case, go to the end symbol of the sentence (end-of-sentence symbol) and determine the sentence type by the type of this symbol (for example, "." Or "." Is a plain sentence, "?"
If it is an interrogative sentence, if it is "!!" it is an imperative sentence, if it is "!" It is a wish sentence, etc.).
FIG. 21 shows an example of the processing of the sentence type determining means 5 described above.

【００９９】図２０に戻り、文章タイプ決定手段５で
は、上で述べた文章タイプ情報のみが選択的に出力され
る。Returning to FIG. 20, the sentence type determining means 5 selectively outputs only the sentence type information described above.

【０１００】音節コード列から音節情報抽出手段１６に
より抽出された音節情報（例えば、「あ」、「い」、「う」等
の音節の種類を数字で表したもの）は、音韻境界を決
定するため、およびピッチパターンにおける音素成分
生成のために用いられる。すなわち、については、音
節情報をもとに、音韻持続時間規則部９において各音節
の音韻持続時間が決定され、これらを配列した形で音韻
境界時刻が音韻境界決定手段７により決定される。音韻
境界時刻は、一方ではＬＳＰパラメータ等の音韻パラメ
ータを生成するために用いられる。またについては、
文章ピッチ制御パラメータ生成部１１において、音素制
御機構パラメータ値を決定するために用いられる。The syllable information extracted from the syllable code string by the syllable information extracting means 16 (for example, the kind of syllable such as "A", "I", "U" represented by a number) determines a phonological boundary. And for generating phoneme components in the pitch pattern. That is, with respect to, the phonological element duration determining unit 7 determines the phonological element duration time of each syllable based on the syllable information, and the phonological element boundary time is determined by the phonological element boundary determination unit 7 in the form of arranging them. The phoneme boundary time is used, on the one hand, to generate phoneme parameters such as LSP parameters. See also
It is used in the sentence pitch control parameter generation unit 11 to determine a phoneme control mechanism parameter value.

【０１０１】先の文章タイプ情報は、イントネーション
規則部８および音源パワー（振幅）修正値計算手段１５
に入力され、文章のタイプに従い、標準イントネーショ
ン（例えば平叙文）からの変形が加えられる。変形には
時間の変形と、ピッチ振幅（指令の大きさ）の変形、お
よび音源パワーあるいは振幅の変形の３種類がある。時
間の変形は、音韻境界決定手段７に作用し、音韻境界時
刻に変更が加えられる。他方指令の大きさの変形は、文
章ピッチ制御パラメータ生成部１１に作用し、指令の大
きさが変更されるか、あるいは新たな文形指定指令や強
調指令が追加される。この際標準イントネーションの制
御パラメータはアクセント規則部１０より供給される。
なお文章ピッチ制御パラメータ生成部１１では音韻情報
との時間的整合をとるため、基準となる音韻境界時刻
（タイミング基準情報）を音韻境界決定手段７より得
る。また音源パワーの変形は、音源パワー（振幅）修正
値計算手段１５に作用し、音源パワー値の修正値が計算
され、音源生成部に送られる。なお音源パワー値の修正
値は、数１２、数１３を用いて計算することができる
が、基本周波数増大によるパワーの自然増を利用するの
であるならば、修正処理を省略してもよい。The above sentence type information includes the intonation regulation section 8 and the sound source power (amplitude) correction value calculation means 15.
, And a transformation from the standard intonation (eg, Hirajo sentence) is added according to the type of sentence. There are three types of deformation: time deformation, pitch amplitude (command magnitude) deformation, and sound source power or amplitude deformation. The time deformation acts on the phoneme boundary determining means 7 to change the phoneme boundary time. On the other hand, the modification of the size of the command acts on the sentence pitch control parameter generation unit 11 to change the size of the command or add a new sentence pattern designating command or an emphasis command. At this time, the standard intonation control parameters are supplied from the accent rule unit 10.
Note that the sentence pitch control parameter generation unit 11 obtains a reference phoneme boundary time (timing reference information) from the phoneme boundary determining means 7 in order to achieve temporal matching with the phoneme information. Further, the deformation of the sound source power acts on the sound source power (amplitude) correction value calculation means 15, and the correction value of the sound source power value is calculated and sent to the sound source generation unit. The correction value of the sound source power value can be calculated by using Expressions 12 and 13, but if the natural increase in power due to the increase in the fundamental frequency is used, the correction process may be omitted.

【０１０２】以上のイントネーションの規則は、規則テ
ーブル（文献５参照）をイントネーション規則部８に設
けておき参照することにより達成できる。かくして、プ
ロミネンスのうち、文形固有の卓立は、上記手段により
実現される。The above intonation rule can be achieved by providing a rule table (see Document 5) in the intonation rule section 8 and referring to it. Thus, among the prominences, the prominence peculiar to the sentence pattern is realized by the above means.

【０１０３】他方、意図的な卓立（上記（４)、（５)）
やその他のデフォルトの卓立（上記（２)、（３)等）に
対するプロミネンス情報は、音節コード列中からプロミ
ネンス情報抽出手段１４により、プロミネンス情報のコ
ードを抽出し、このコードから得られる。プロミネンス
情報は、イントネーション規則部８と音源パワー（振
幅）修正値計算手段１５に作用する。On the other hand, intentional excellence ((4) and (5) above)
The prominence information for other default prominences ((2), (3), etc.) is obtained from the prominence information code extracted by the prominence information extracting means 14 from the syllable chord string. The prominence information acts on the intonation regulation section 8 and the sound source power (amplitude) correction value calculation means 15.

【０１０４】ここで、音節コード列より、文章タイプ
情報、音節情報、プロミネンス情報をそれぞれ抽出
する方法の一具体例を示す。例えば、音節コードの番号
に応じ、図２２、図２３に示すように情報内容を定義し
ておけば、文章タイプ決定手段５、音節情報抽出手段１
６、プロミネンス情報抽出手段１４のそれぞれに数値大
小判定機能を持たせることにより、該当情報か否か判定
できる。すなわち音節コードが１〜４００であるならば
音節情報と判定、９００４〜９０２０であるならば文章
タイプを与える情報であるので、前述の方法により文章
タイプ情報を決定することが出来る。また、音節コード
が９０３０〜９０３９であるならばプロミネンス情報と
判定、例えば下１桁の数字にアクセント指令値情報を割
り当てれば良い。一例を挙げれば、音節コード下１桁の
数字をＩで表したとき、プロミネンスの付加されてない
場合のアクセント指令の大きさに対する、プロミネンス
によるアクセント指令増分値DAaは次式により与えるこ
とができる。Here, a specific example of a method for extracting sentence type information, syllable information, and prominence information from a syllable code string will be described. For example, if the information content is defined according to the syllable code number as shown in FIGS. 22 and 23, the sentence type determining means 5 and the syllable information extracting means 1 are defined.
6. By providing each of the prominence information extracting means 14 with a numerical magnitude judgment function, it is possible to judge whether or not the information is the corresponding information. That is, if the syllable code is 1 to 400, it is determined to be syllabic information, and if it is 9004 to 9020, it is information that gives a sentence type, and thus the sentence type information can be determined by the above-described method. If the syllable code is 9030 to 9039, it is determined to be prominence information, and for example, the accent command value information may be assigned to the last one digit. As an example, when the last digit of the syllable code is represented by I, the accent command increment value DAa by prominence with respect to the magnitude of the accent command when prominence is not added can be given by the following equation.

【０１０５】[0105]

【数１６】 DAa=0.1I …（数１６）数１６を用いれば、音節コードにより、アクセント指令
の大きさを0.0から0.9の範囲内で0.1ステップで増大さ
せることができる。もちろんより小きざみなステップで
アクセント指令の大きさを変化させたい場合には、音節
コードを他の値の範囲に割当て（例えば９１００〜９１
９９）、下２桁にアクセント指令値情報を割り当てれば
良い。また、プロミネンスによるアクセント指令の増大
・減少をさせるタイミングは、例えば次のようにして決
定することができる。まず、アクセント指令開始時点を
決定する音節境界の指定は、上記プロミネンス情報をも
つ音節コード（例えば９０３０〜９０３９）を境界直前
の音節に対応する音節コードと境界直後の音節に対応す
る音節コードの間に挿入することにより達成できる。次
に、アクセント指令終了時点を決定する音節境界の指定
は、プロミネンス終了を意味するコードとして例えば９
０３０を同様に境界直前の音節に対応する音節コードと
境界直後の音節に対応する音節コードの間に挿入するこ
とにより達成できる。また、プロミネンスの開始あるい
は終了が高アクセントの領域で起きる場合、すなわちア
クセント変形型の場合は、アクセント変形を起こす音節
境界の指定は、同様に境界直前直後の音節に対応する音
節コードの間にプロミネンスの開始あるいは終了のコー
ドを挿入することにより達成できる。かくしてプロミネ
ンスによるアクセント指令開始・終了時点設定のタイミ
ング基準時刻が定まれば、実際の開始・終了時点はこの
基準時刻からのずれ量としてタイミングテーブルから検
索することにより求めることができる。図２４に実例を
示す。[Equation 16] DAa = 0.1I (Equation 16) Using Equation 16, the syllable code can increase the magnitude of the accent command in the range of 0.0 to 0.9 in 0.1 steps. Of course, when it is desired to change the size of the accent command in smaller steps, the syllable code is assigned to another value range (for example, 9100 to 91).
99), the accent command value information may be assigned to the last two digits. Further, the timing for increasing / decreasing the accent command by prominence can be determined as follows, for example. First, a syllable boundary that determines the accent command start time is specified between the syllable code having the prominence information (for example, 9030 to 9039) between the syllable code immediately before the boundary and the syllable code immediately after the boundary. Can be achieved by inserting into Next, the designation of the syllable boundary that determines the end point of the accent command is, for example, 9 as a code indicating the end of prominence.
Similarly, 030 can be achieved by inserting it between the syllable code corresponding to the syllable immediately before the boundary and the syllable code corresponding to the syllable immediately after the boundary. Also, when the start or end of prominence occurs in a high-accent area, that is, in the case of accent deformation type, the specification of the syllable boundary that causes accent modification is likewise done between the syllable chords corresponding to the syllable immediately before and after the boundary. This can be accomplished by inserting the start or end code of. Thus, when the timing reference time for setting the accent command start / end time point by the prominence is determined, the actual start / end time points can be obtained by searching the timing table as the deviation amount from this reference time. FIG. 24 shows an actual example.

【０１０６】次に、パワーを制御しポーズを生成する方
法の具体例を示す。図２０では、音声合成部に生成源方
式（例えば残差圧縮法＋LSP合成器）を用いた例を示し
ているが、これから示す具体的な生成源方式に限定され
ない。勿論波形合成方式でもまったく同じ考え方で波形
振幅のパワーを制御することが出来る。Next, a specific example of a method for controlling power and generating a pose will be shown. Although FIG. 20 shows an example in which a generation source method (for example, residual compression method + LSP combiner) is used in the speech synthesis unit, the present invention is not limited to the specific generation source method described below. Of course, even in the waveform synthesizing method, the power of the waveform amplitude can be controlled by the same idea.

【０１０７】図２５は、音声合成部４に残差圧縮法を用
いた場合の例を示している。スペクトル包絡パラメータ
は、LSPパラメータ、PARCOR係数等、任意のパラメータ
を利用出来る。ちなみに、図中の接続補間処理は、例え
ば図２６のような処理により実現できる。音源パワー
（振幅）修正値計算手段１５（図２０）で得られたパワ
ー値の平方根（振幅値で与えられるならばそのままの
値）が有声音源生成部あるいは無声音源生成部に与えら
え、残差（音源）振幅が修正される。修正値は、実際の
値で与える場合は、例えば時間不連続を防ぐために、フ
レームごとに、パワー実測値（例えば特願平1-214799
号、特願平2-183947号、特願平2-250172号参照）の平方
根に近似した振幅包絡曲線（例えば、図２７）の値とし
て与えれば良い。もし修正値を倍率で与える場合は、合
成単位が本来持っている自然音声の振幅包絡形を活用出
来るので、強調部に対応するフレーム間のみで、合成単
位の音源振幅値に指定した倍率を乗ずれば良い。また所
定持続時間のポーズを生成する場合は、その時間の間だ
け無音生成指令を発行して、無音（０値）を出力すれば
良い。FIG. 25 shows an example in which the residual compression method is used in the speech synthesizer 4. As the spectral envelope parameter, any parameter such as LSP parameter and PARCOR coefficient can be used. Incidentally, the connection interpolation processing in the figure can be realized by the processing as shown in FIG. 26, for example. The square root of the power value obtained by the sound source power (amplitude) correction value calculation means 15 (FIG. 20) (the value as it is if given by the amplitude value) is given to the voiced sound source generation unit or the unvoiced sound source generation unit, and left. The difference (source) amplitude is corrected. When the correction value is given as an actual value, for example, in order to prevent time discontinuity, a power measurement value (for example, Japanese Patent Application No. 1-214799) is calculated for each frame.
No. 2, Japanese Patent Application No. 2-183947, Japanese Patent Application No. 2-250172), the amplitude envelope curve (for example, FIG. 27) approximates to the square root. If the correction value is given as a scale factor, the natural voice's amplitude envelope of the synthesis unit can be used, so the sound source amplitude value of the synthesis unit is multiplied by the specified scale factor only between the frames corresponding to the emphasis section. You just have to shift. Further, when a pause having a predetermined duration is generated, a silence generation command may be issued only during that time to output silence (zero value).

【０１０８】図２８は、音声合成部４に波形合成方式を
用いた場合の例を示している。この場合は、図２０の音
源パワー（振幅）修正値計算手段１５は、波形パワー
（振幅）修正値計算手段と置き換えられるが、処理内容
は、音源の場合と全く同様である。違いは、単に実現値
が異なるだけである。波形パワー（振幅）修正値計算手
段で得られたパワー値の平方根（振幅値で与えられるな
らばそのままの値）が素片窓生成部に与えられ、素片編
集時に素片振幅が修正される。修正値の時間変化パター
ンは、上記残差圧縮法の場合と全く同様の考え方で与え
られる。また、ポーズの生成方法も残差圧縮法の場合と
同様、所定時間長の０振幅波形を出力すれば実現出来
る。FIG. 28 shows an example in which the voice synthesis section 4 uses a waveform synthesis method. In this case, the sound source power (amplitude) correction value calculation means 15 of FIG. 20 is replaced with the waveform power (amplitude) correction value calculation means, but the processing content is exactly the same as that of the sound source. The only difference is the realization values. The square root of the power value obtained by the waveform power (amplitude) correction value calculation means (the value as it is if it is given as an amplitude value) is given to the element window generator, and the element amplitude is corrected at the time of editing the element. . The time-varying pattern of the correction value is given by the same idea as in the residual compression method. In addition, the pause generation method can be realized by outputting a zero-amplitude waveform of a predetermined time length as in the residual compression method.

【０１０９】他の合成方式の場合も、各波形振幅制御手
段に応じて、全く同様の方法でパワー（振幅）制御が実
現できる。Also in the case of the other synthesizing methods, the power (amplitude) control can be realized by a completely similar method according to each waveform amplitude control means.

【０１１０】プロミネンスを具体的にどの様なパラメー
タ値により実現するかを定めた韻律（ピッチ、パワー、
時間長）の制御方法の一例を示したのが図１１である。
なお、図１１におけるプロミネンスを含有しない場合の
基準値は、例えば、アクセント指令の大きさおよび開始
・終了時点については、公知のアクセント成分生成規則
（文献3、14参照）により決定すれば良い。あるいはよ
り簡便な方法としては、アクセント指令の大きさの基準
値Aa=0.3、アクセント指令開始・終了時点の基準音節境
界からの相対値ΔT₁=ΔT₂=ΔT₁₂=0としても実用上音質
にほとんど支障は無い。図１１は、自然音声の定量的解
析結果（図５〜図１０）に基づき求めたものであるの
で、図１１のプロミネンス生成規則に従い、音声を合成
すれば、自然な強調感をもった合成音声が得られる。勿
論、図１１はパラメータ実現値の一例であり、これらの
数値に限定されない。実際には、様々な強めの変形があ
りうるので、それに対応した数値の変形の可能性は無数
に存在する。そのような数値の変形の中で、意図伝達と
いう点で優れた性能（すなわち優れたプロミネンス表現
力）を有するようにパラメータ値を選ぶことができる。
以下、そのようなパラメータ選定の一実施例を示す。[0110] The prosody (pitch, power,
FIG. 11 shows an example of a control method of (time length).
It should be noted that the reference value in the case where prominence is not included in FIG. 11 may be determined by a known accent component generation rule (see References 3 and 14) with respect to, for example, the size of the accent command and the start and end times. Alternatively, as a simpler method, the reference value Aa = 0.3 for the accent command size and the relative value ΔT ₁ = ΔT ₂ = ΔT ₁₂ = 0 from the reference syllable boundary at the beginning and end of the accent command can be used for practical sound quality. Almost no hindrance. Since FIG. 11 is obtained based on the quantitative analysis result of natural speech (FIGS. 5 to 10), if the speech is synthesized in accordance with the prominence generation rule of FIG. Is obtained. Of course, FIG. 11 is an example of the parameter realization values, and is not limited to these numerical values. In reality, there are various kinds of strong deformations, so that there is an infinite number of possibilities of corresponding numerical deformations. Among such numerical transformations, the parameter values can be selected so as to have excellent performance in terms of intention transmission (that is, excellent prominence expressing power).
An example of such parameter selection will be described below.

【０１１１】先に詳述した聴取評価実験の結果に基づ
き、プロミネンスの対象の単位による分類ごとに、図１
２に示した種々のプロミネンスの実現形態のうちのどの
形態がプロミネンス表現力という点で優れているかを示
したのが図１である。更に基本周波数については、どの
ようなアクセント指令の大きさの増分値ΔDAaを与える
べきかも記してある。Based on the results of the listening evaluation experiment described in detail above, FIG.
FIG. 1 shows which of the various realization forms of prominence shown in FIG. 2 is superior in terms of prominence expressing power. Further, regarding the fundamental frequency, it is also described what kind of accent instruction magnitude ΔDAa should be given.

【０１１２】電話等の会話では、正確な意図の伝達が最
も重要であるという立場に立つならば、多少自然性を犠
牲にしても、プロミネンス表現力の高い実現形態を採用
することは実用上意味がある。図１中の特殊な例につい
ては、このような観点に立って最適なプロミネンスの実
現形態を選定している。If it is assumed that accurate intention transmission is most important in conversations such as telephone calls, it is practically meaningful to adopt a realization with high prominence expression, even if a certain amount of naturalness is sacrificed. There is. With respect to the special example in FIG. 1, the optimal prominence realization mode is selected from this point of view.

【０１１３】同表中のｒの値は、上記意味で最適なプロ
ミネンスの実現形態を採用した場合のプロミネンス正聴
率（正答率）の値、すなわちｒの最大値であり、一般的
に使われる例では平均98%、特殊な例でも平均96%という
結果が得られている。これらの結果は、実現形態および
韻律制御パラメータの最適化を行う前（約77%）に比べ
て約20%のプロミネンス表現力の改善が得られたことを
意味している。なお、発話速度低減を利用するプロミネ
ンス表現形式は、プロミネンス表現力を高める方法とし
てあまり有効でない。The value of r in the table is the value of the prominence correct listening rate (correct answer rate) when the optimal prominence realization form is adopted in the above meaning, that is, the maximum value of r, and is generally used. An average of 98% was obtained in the example, and an average of 96% in the special case. These results indicate that the prominence expressiveness was improved by about 20% compared to before the optimization of the realization form and the prosody control parameters (about 77%). It should be noted that the prominence expression format using the reduction of the speech rate is not very effective as a method for enhancing the prominence expression power.

【０１１４】以上のように、図１は本発明の中枢をなす
ものであり、プロミネンス表現力を著しく改善する手段
を提供するものである。As described above, FIG. 1 is the center of the present invention, and provides means for remarkably improving the prominence expressing power.

【０１１５】実際に図１および図１１による韻律制御を
実現する具体例を図２４に示す。FIG. 24 shows a specific example for actually implementing the prosody control shown in FIGS.

【０１１６】本実施例では、プロミネンスのピッチによ
る強めあるいは弱めをアクセント指令の増減により行う
例を示したが、勿論、前述のように、強調成分を用いて
行っても良い。この場合、例えば数４〜数６によりパラ
メータ値を変換しても良いし、新たにパラメータテーブ
ルを作り直しても良い。In this embodiment, an example is shown in which the prominence pitch is strengthened or weakened by increasing or decreasing the accent command, but of course, as described above, the emphasizing component may be used. In this case, for example, the parameter values may be converted by using Equations 4 to 6, or a new parameter table may be recreated.

【０１１７】他方、音素制御パラメータは、音素ごとに
指令の大きさ、固有角周波数、境界からの相対時刻、底
の値等を予め解析して求めておき、音節情報に対応する
テーブルとして音素規則部１３に設けておけば良い。こ
こから音節情報列の順に従って、音素制御パラメータ列
が文章ピッチ制御パラメータ部１１に送られる。ここで
音素開始あるいは終了時点（相対時刻）は、タイミング
基準情報に基いて絶対時刻に変換される。かくして文章
ピッチ制御パラメータ生成部１１で作成されたピッチ制
御パラメータはピッチパターン生成部１２に送られ、こ
こで新ピッチ制御機構モデル（下記の数１７〜数２４）
により文章ピッチパターンが生成される。On the other hand, the phoneme control parameter is obtained by analyzing in advance the magnitude of the command, the natural angular frequency, the relative time from the boundary, the value of the bottom, etc. for each phoneme, and the phoneme rule is used as a table corresponding to the syllable information. It may be provided in the section 13. From here, the phoneme control parameter sequence is sent to the sentence pitch control parameter unit 11 in the order of the syllable information sequence. Here, the phoneme start or end time (relative time) is converted into an absolute time based on the timing reference information. Thus, the pitch control parameter created by the sentence pitch control parameter generation unit 11 is sent to the pitch pattern generation unit 12, where the new pitch control mechanism model (Equations 17 to 24 below).
Produces a sentence pitch pattern.

【０１１８】フレーズ制御機構：Phrase control mechanism:

【０１１９】[0119]

【数１７】 Gp(i,t)=α(i)t exp(-α(i)t)u(t) …（数１７) t ：時刻 α(i) ：ｉ番目の固有角周波数 u(t)：単位ステップ関数アクセント制御機構：[Equation 17] Gp (i, t) = α (i) t exp (-α (i) t) u (t) (Equation 17) t: time α (i): i-th natural angular frequency u (t): Unit step function Accent control mechanism:

【０１２０】[0120]

【数１８】 Ga(j,t)=Min[1-(1+β(j)t) exp(-β(j)t)u(t),θ(j)] …（数１８) β(j) ：ｊ番目の固有角周波数 θ(j) ：ｊ番目の上限値音素制御機構：[Equation 18] Ga (j, t) = Min [1- (1 + β (j) t) exp (-β (j) t) u (t), θ (j)] (Equation 18) β (j): jth natural angular frequency θ (j): j-th upper limit value Phoneme control mechanism:

【０１２１】[0121]

【数１９】 Gf(k,t)=-Min[1-(1+γ(k)t) exp(-γ(k)t)u(t),φ(k)] …（数１９) あるいは[Formula 19] Gf (k, t) =-Min [1- (1 + γ (k) t) exp (-γ (k) t) u (t), φ (k)] (Equation 19) Or

【０１２２】[0122]

【数２０】 Gf(k,t)=exp(-γ(k)t)u(t) …（数２０) γ(k) ：ｋ番目の固有角周波数 φ(k) ：ｋ番目の底の値文形指定制御機構：[Equation 20] Gf (k, t) = exp (-γ (k) t) u (t) (Equation 20) γ (k): kth natural angular frequency φ (k): k-th base value Sentence control mechanism:

【０１２３】[0123]

【数２１】 Gt(l,t)=Min[1-(1+ζ(l)t) exp(-ζ(l)t)u(t),θt(l)] …（数２１) ζ(l) ：ｌ番目の固有角周波数 θt(l)：ｌ番目の上限値強調制御機構：[Equation 21] Gt (l, t) = Min [1- (1 + ζ (l) t) exp (-ζ (l) t) u (t), θt (l)] (Equation 21) ζ (l): l-th natural angular frequency θt (l): l-th upper limit value Emphasis control mechanism:

【０１２４】[0124]

【数２２】 Gs(m,t)=Min[1-(1+η(m)t) exp(-η(m)t)u(t),θs(m)] …（数２２) η(m) ：ｍ番目の固有角周波数 θs(m)：ｍ番目の上限値ピッチパターン：[Equation 22] Gs (m, t) = Min [1- (1 + η (m) t) exp (-η (m) t) u (t), θs (m)] (Equation 22) η (m): m-th natural angular frequency θs (m): m-th upper limit value Pitch pattern:

【０１２５】[0125]

【数２３】 [Equation 23]

【０１２６】あるいはOr

【０１２７】[0127]

【数２４】 [Equation 24]

【０１２８】ここで、Fminは最低周波数、Iはフレーズ
指令の数、Ap(i)はi番目のフレーズ指令の大きさ、T
₀(i)はi番目のフレーズ指令の時点、Jはアクセント指令
の数、Aa(j)はj番目のアクセント指令の大きさ、T
₁(j)、T₂(j)はそれぞれj番目のアクセント指令の開始時
点と終了時点、Kは音素指令の数、Af(k)はk番目の音素
指令の大きさ、T₃(k)、T₄(k)はそれぞれk番目の音素指
令の開始時点と終了時点、Lは文形指定指令の数、At(l)
はl番目の文形指定指令の大きさ、T₅(l)、T₆(l)はそれ
ぞれl番目の文形指定指令の開始時点と終了時点、Mは強
調指令の数、As(m)はm番目の強調指令の大きさ、T
₇(m)、T₈(m)はそれぞれm番目の強調指令の開始時点と終
了時点である。Here, Fmin is the minimum frequency, I is the number of phrase commands, Ap (i) is the size of the i-th phrase command, T
₀ (i) is the time of the i-th phrase command, J is the number of accent commands, Aa (j) is the size of the j-th accent command, T
₁ (j) and T ₂ (j) are the start and end points of the jth accent command, K is the number of phoneme commands, Af (k) is the size of the kth phoneme command, and T ₃ (k) , T ₄ (k) are the start and end times of the k-th phoneme command, L is the number of pattern designating commands, At (l)
Is the size of the l-th sentence pattern directive, T ₅ (l) and T ₆ (l) are the start and end points of the l-th sentence pattern directive, M is the number of emphasis commands, As (m) Is the size of the m-th emphasis command, T
₇ (m) and T ₈ (m) are the start point and the end point of the m-th emphasis command, respectively.

【０１２９】本実施例におけるプロミネンス生成規則に
よる韻律制御（図１）は、聴取評価実験結果の最高スコ
アとして求められたものなので、このプロミネンス生成
規則により韻律の制御を行えば、漢字仮名混じり文テキ
ストから合成される音声に、極めて有効な強調効果をも
たらすことができる。Since the prosody control by the prominence generation rule in this embodiment (FIG. 1) is obtained as the highest score of the listening evaluation experiment result, if the prosody is controlled by this prominence generation rule, the kanji and kana mixed sentence text will be obtained. A very effective enhancement effect can be brought to the speech synthesized from.

【０１３０】以上、実施例では、プロミネンスのピッチ
による強めあるいは弱めをピッチ制御機構モデルあるい
は修正型ピッチ制御機構モデルにより実現する方法を示
したが、勿論プロミネンス実現方法は、これらのモデル
のみに限定されない。どの様なモデルを用いても良い。
例えば、点ピッチ（折線近似ピッチパターン）でも実現
可能であるし、あるいは階段状のピッチパターンを用い
ても何ら支障は無い。In the above embodiments, the method of realizing the strengthening or weakening of the prominence by the pitch by the pitch control mechanism model or the modified pitch control mechanism model has been shown, but the prominence realizing method is not limited to these models. . Any model may be used.
For example, a point pitch (pitch pattern approximated to a broken line) can be used, or a stepwise pitch pattern can be used without any problem.

【０１３１】[0131]

【発明の効果】以上示したように、本発明は、人間の発
声する自然な文章音声に含まれる強めや弱めを規則合成
において実現する手段及び方法を提供するものである。
本発明によれば、現実の文章音声に起こりうるほとんど
全ての場合の強め、弱めを実現することができる。その
ため、利用者が特別の注意を払うことなく発話内容を容
易に理解することができるので、利用者の負担を著しく
軽減することが可能となる。特に、例えば新聞校閲のよ
うな長時間作業時の疲労軽減効果は著しく、作業効率向
上により得られる利益は大きい。Industrial Applicability As described above, the present invention provides means and method for realizing the strengthening and weakening included in a natural sentence voice uttered by a human in rule synthesis.
According to the present invention, it is possible to realize strengthening and weakening in almost all cases that can occur in a real sentence voice. Therefore, since the user can easily understand the utterance content without paying special attention, the burden on the user can be significantly reduced. In particular, the effect of reducing fatigue when working for a long time such as newspaper editing is remarkable, and the benefit obtained by improving work efficiency is great.

[Brief description of drawings]

【図１】本発明の基本部分を示す図表である。FIG. 1 is a diagram showing a basic part of the present invention.

【図２】本発明を実現する構成例を示す図表である。FIG. 2 is a chart showing a configuration example for realizing the present invention.

【図３】本発明を実現する構成例を示す図表である。FIG. 3 is a chart showing a configuration example for realizing the present invention.

【図４】本発明の基本部分を補足する図表である。FIG. 4 is a chart supplementing the basic part of the present invention.

【図５】本発明の考え方を例示する図表である。FIG. 5 is a chart illustrating the concept of the present invention.

【図６】本発明の考え方を例示する図表である。FIG. 6 is a chart illustrating the concept of the present invention.

【図７】本発明の考え方を例示する図表である。FIG. 7 is a diagram illustrating the concept of the present invention.

【図８】本発明の考え方を例示する図表である。FIG. 8 is a chart illustrating the concept of the present invention.

【図９】本発明の考え方を例示する図表である。FIG. 9 is a diagram illustrating the concept of the present invention.

【図１０】本発明の考え方を例示する図表である。FIG. 10 is a diagram illustrating the concept of the present invention.

【図１１】本発明の考え方を例示する図表である。FIG. 11 is a chart illustrating the concept of the present invention.

【図１２】本発明の考え方を例示する図表である。FIG. 12 is a diagram illustrating the concept of the present invention.

【図１３】本発明の考え方を例示する図表である。FIG. 13 is a chart illustrating the concept of the present invention.

【図１４】本発明の考え方を例示する図表である。FIG. 14 is a chart illustrating the concept of the present invention.

【図１５】本発明の考え方を例示する図表である。FIG. 15 is a diagram illustrating the concept of the present invention.

【図１６】本発明の考え方を例示する図表である。FIG. 16 is a chart illustrating the concept of the present invention.

【図１７】本発明の考え方を例示する図表である。FIG. 17 is a diagram illustrating the concept of the present invention.

【図１８】本発明の考え方を例示する図表である。FIG. 18 is a diagram illustrating the concept of the present invention.

【図１９】本発明の実施例を示す図表である。FIG. 19 is a chart showing an example of the present invention.

【図２０】本発明の実施例を示す図表である。FIG. 20 is a chart showing an example of the present invention.

【図２１】本発明の実施例を示す図表である。FIG. 21 is a chart showing an example of the present invention.

【図２２】本発明の実施例を示す図表である。FIG. 22 is a chart showing an example of the present invention.

【図２３】本発明の実施例を示す図表である。FIG. 23 is a chart showing an example of the present invention.

【図２４】本発明の実施例を示す図表である。FIG. 24 is a chart showing an example of the present invention.

【図２５】本発明の実施例を示す図表である。FIG. 25 is a chart showing an example of the present invention.

【図２６】本発明の実施例を示す図表である。FIG. 26 is a chart showing an example of the present invention.

【図２７】本発明の実施例を示す図表である。FIG. 27 is a chart showing an example of the present invention.

【図２８】本発明の実施例を示す図である。FIG. 28 is a diagram showing an example of the present invention.

[Explanation of symbols]

３…制御パラメータ生成部、８…イントネーション規則
部、１０…アクセント規則部、１１…文章ピッチ制御パ
ラメータ生成部、１２…ピッチパターン生成部、１４…
プロミネンス情報抽出手段、１５…音源パワー（振幅）
修正値計算手段。3 ... Control parameter generation part, 8 ... Intonation rule part, 10 ... Accent rule part, 11 ... Sentence pitch control parameter generation part, 12 ... Pitch pattern generation part, 14 ...
Prominence information extraction means, 15 ... Sound source power (amplitude)
Correction value calculation means.

Claims

[Claims]

1. A language processing unit for morphologically analyzing an input sentence; a control parameter generation unit for determining a type of the input sentence based on an output of the language processing unit and generating a control parameter according to the type; Is a first prosody control unit having a pitch pattern generation unit for generating a temporal change pattern of a fundamental frequency (hereinafter, abbreviated as a pitch pattern) according to the above. The prominence classification is based on the output of the language processing unit. According to the change amount of the control parameter previously obtained in correspondence with the classification of the prominence based on the analysis result of the natural voice,
First prosody control means for controlling the control parameter;
A voice synthesizing means for generating a phoneme parameter string corresponding to the input sentence based on the output of the language processing means, and for sequentially synthesizing a voice by the phoneme parameter string and the pitch pattern generated by the first prosody control means; In the speech rule synthesizing device having the :, the pitch pattern generation unit has at least an accent control mechanism for controlling the size of the accent component and the start and end times thereof, and the change of the accent component as the change amount of the control parameter. If the sentence unit to which the prominence is added is the entire phrase, the amount of change in the above control parameter is defined as an increment to the size of the accent component when the prominence is not added The difference between "increment" and "accent command increment" for the adjacent accent component If a certain amount of change ΔDAa is set to a positive value and the sentence unit to which prominence is added is a part of a word and the accent type is not the accent deformation type, set ΔDAa to a positive value and the sentence unit If a sentence unit to which prominence is added is a part of a word and the accent type is an accent-deformed type by inserting a pause immediately before and immediately after, the ΔDAa is set to a positive value and prominence is added. The sentence unit is a beat of the topic (syllable)
If it is, the speech rule synthesizing device is characterized in that the ΔDAa is set to a positive value and a pause is inserted immediately before and immediately after the sentence unit.

2. When the sentence unit to which the prominence is added is the entire phrase and the accent type is not the accent-deformed type, the change amount of the control parameter is compared with the size of the accent component when the prominence is not added. The difference ΔDAa, which is the difference between the "accent command increment" defined as an increment and the "accent command increment" for the adjacent accent component, is set to a value of 0.8, and the sentence unit with prominence is the entire phrase. If the accent type is the accent deformed type, the ΔDAa is set to a value within the range of 0.7 ± 0.1, and the sentence unit with prominence is a part of the word and the accent type is not the accent deformed type. Sets the ΔDAa to a value of 0.1 and inserts a pause immediately before and immediately after the sentence unit, and prominence is added. Blocks if it and the accent type is part of the word is accented variant type, the ΔDAa was set to a value of 0.6, prominence is the added sentence topic one heartbeat (syllable)
In the case of, the ΔDAa is set to a value of 0.3 and a pause is inserted immediately before and after the sentence unit.
The described voice rule synthesizer.

3. The voice rule synthesizing apparatus according to claim 1, further comprising second prosody control means for controlling the power of the voice synthesized by said voice synthesizing means.

4. The second prosody control means is a decibel (d
4. The voice rule synthesis according to claim 3, wherein the value of the power P defined in B) is set to a value obtained by the equation P = 11Aa ± 4 (dB) from the magnitude Aa of the accent component command. apparatus.

5. The speech rule synthesizing apparatus according to claim 3, wherein the second prosody control means uses a change in power caused by a change in pitch pattern by the first prosody control means.

6. The voice rule synthesizing apparatus according to claim 1, further comprising a third prosody control unit for controlling a time length of a voice synthesized by said voice synthesizing unit. .

7. The speech rule synthesizing apparatus according to claim 6, wherein said third prosody control means comprises means for controlling the duration of a phoneme corresponding to said phoneme parameter sequence.

8. The third prosody control means, if there is a pause immediately after the sentence unit to which the prominence is added, the duration of the vowel at the end of the sentence unit is not emphasized and the vowel duration is maintained. If the sentence pattern is an interrogative sentence, the duration of the vowel at the end of the sentence is 78 ± 22% of the duration of the vowel in the case of a plain sentence. The speech rule synthesizing apparatus according to claim 7, wherein the speech rule synthesizing apparatus expands the value.

9. A step of morphologically analyzing an input sentence and expressing it as a syllable code string; determining a type of the input sentence based on the syllable code string, generating a control parameter according to the type, and setting the control parameter as the control parameter. A step of generating a pitch pattern according to the method, determining the classification of the prominence based on the syllable code string, according to the change amount of the control parameter obtained in advance corresponding to the classification of the prominence based on the analysis result of natural speech. A step of controlling the control parameter; a phonological parameter sequence corresponding to the input sentence is generated based on the syllable code sequence, and voices are sequentially synthesized by the phonological parameter sequence and the pitch pattern generated by the prosody control means. In the speech rule synthesizing method including the step of The size of the xent component,
With parameters that control its start and end times,
The change amount of the accent component is set as the change amount of the control parameter, and the change amount of the control parameter is set as the accent component when the prominence is not added when the sentence unit to which the prominence is added is the entire phrase. The difference ΔDAa, which is the difference between the "accent command increment" defined as an increment for the magnitude of the and the "accent command increment" for the adjacent accent component, is set to a positive value, and the sentence unit with prominence added If the word is a part of a word and the accent type is not the accent deformed type, the ΔDAa is set to a positive value and a pause is inserted immediately before and after the sentence unit so that the sentence unit to which the prominence is added is the word unit. If it is a part and the accent type is the accent modified type, set ΔDAa to a positive value and add prominence. Sentence that was the topic of one heartbeat (syllable)
If ΔDAa is set to a positive value, a pause is inserted immediately before and immediately after the sentence unit.

10. When the sentence unit to which the prominence is added is the entire phrase and the accent type is not the accent-deformed type, the change amount of the control parameter is compared with the size of the accent component when the prominence is not added. The difference ΔDAa, which is the difference between the "accent command increment" defined as an increment and the "accent command increment" for the adjacent accent component, is set to a value of 0.8, and the sentence unit with prominence is the entire phrase. If the accent type is the accent deformed type, the ΔDAa is set to a value within the range of 0.7 ± 0.1, and the sentence unit with prominence is a part of the word and the accent type is not the accent deformed type. Sets the ΔDAa to a value of 0.1 and inserts a pause immediately before and immediately after the sentence unit to add prominence. If sentence is part of a word and the accent type is accented variant type, the ΔDAa was set to a value of 0.6, prominence is the added sentence topic one heartbeat (syllable)
10. If it is, the ΔDAa is set to a value of 0.3 and a pause is inserted immediately before and after the sentence unit.
The voice rule synthesis method described.

11. The voice rule synthesizing method according to claim 9, wherein the power of the synthesized voice is controlled.

12. The power control is performed by setting the value of the power P defined in decibel (dB) unit to a value obtained by the equation P = 11Aa ± 4 (dB) from the magnitude Aa of the command of the accent component. The method of synthesizing a voice rule according to claim 11, characterized in that.

13. The voice rule synthesizing method according to claim 11, wherein the power control utilizes a change in power associated with a change in the pitch pattern.

14. The voice rule synthesizing method according to claim 9, wherein the time length of the synthesized voice is controlled.

15. The speech rule synthesizing method according to claim 14, wherein the control of the time length of the speech is performed by controlling the duration of a phoneme corresponding to the phoneme parameter sequence.

16. The control of the time length of the voice is such that when there is a pause immediately after the sentence unit to which the prominence is added, the duration of the vowel at the end of the sentence unit is not emphasized and the vowel duration is maintained. Extend by a value within the range of 66 ± 33% of the time,
The speech according to claim 15, wherein when the sentence pattern is a question sentence, the duration of the vowel at the end of the sentence is extended by a value within the range of 78 ± 22% of the duration of the vowel in the case of a normal sentence. Rule composition method.