JPH0836398A

JPH0836398A - Rhythm control system

Info

Publication number: JPH0836398A
Application number: JP6172990A
Authority: JP
Inventors: Tsugumine Yanagida; 従峰柳田; Yoshimasa Sawada; 喜正沢田; Kiyoshi Ishida; 清石田; Shigeru Kashiwagi; 繁柏木
Original assignee: Meidensha Corp; Meidensha Electric Manufacturing Co Ltd
Current assignee: Meidensha Corp; Meidensha Electric Manufacturing Co Ltd
Priority date: 1994-07-26
Filing date: 1994-07-26
Publication date: 1996-02-06

Abstract

PURPOSE:To provide a rhythm control system in which high quality voice synthesis, that is close to a human voice, is conducted even though a small qualitative variation amount is used during a rule voice synthesis based on qualitative variation amount such as phoneme environment. CONSTITUTION:A pitch pattern of a mora is finely controlled by applying a quantification class I of multivariate analysis to the fundamental frequency control of the pitch pattern of a certain mora. When an analysis section 2 beforehand computes the number of categories for the quantized computational equations, item (qualitative variation amount) having higher partial correlation coefficients of accent patterns, etc., described by as mora, a proceeding mora and a suceeding mora are used. When a synthesis section 3 performs a rhythm control, the fundamental frequency of each pitch point at each mora of phoneme symbol column of voice synthesis object is successively computed. Thus, employing a smaller number of items and a smaller number of categories (classifications of each item), a fundamental frequency pattern of high quality voice close to a human voice is expressed.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、任意日本語の規則音声
合成において、音韻データベース等から音声合成手段に
投入する音韻データの基本周波数制御手法に係る韻律制
御方式に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a prosody control method relating to a fundamental frequency control method for phonological data input from a phonological database or the like to a speech synthesizing means in synthesizing arbitrary Japanese regular speech.

【０００２】[0002]

【従来の技術】従来の一般的な音声合成装置では、例え
ばホストコンピュータ等から与えられる漢字かな混じり
の入力テキストを、日本語処理によりアクセント句など
の音韻記号列に変換し、この音韻記号列に対してモーラ
の基本周波数（ピッチ周波数）、エネルギー、音韻時間
長等の音韻データを蓄積したデータベースを参照して韻
律制御を行い、調音フィルタを通して合成音声を得てい
る。この韻律制御手法は、モデル化による手法であっ
た。2. Description of the Related Art In a conventional general speech synthesizer, for example, input text containing kanji and kana given from a host computer or the like is converted into a phoneme symbol string such as an accent phrase by Japanese processing, and this phoneme symbol string is converted into the phoneme symbol string. On the other hand, a prosodic control is performed by referring to a database that stores phoneme data such as mora fundamental frequency (pitch frequency), energy, and phoneme time length, and synthetic speech is obtained through an articulatory filter. This prosody control method was a modeling method.

【０００３】上記したように、音声合成に用いる音韻デ
ータとしては、基本周波数、音韻時間長、エネルギーが
主に用いられている。この内、特に基本周波数は合成音
声のアクセントに関するパラメータで人間の発生する音
声に近づけるには細かい制御を必要とする部分である。As described above, the fundamental frequency, the phoneme duration and the energy are mainly used as the phoneme data used for the speech synthesis. Of these, particularly the fundamental frequency is a parameter relating to the accent of synthetic speech and is a portion that requires fine control in order to approach the speech generated by humans.

【０００４】このような細かい制御を可能とする基本周
波数制御手法の一つとして、多変量解析の数量化Ｉ類を
用いる手法が知られている。この手法は、質的変量（音
韻環境）から量的変量（基本周波数）を算出するもの
で、以下の式で定式化される。ｉ番目のデータの質的変
量の要因アイテムをｊ、その属するカテゴリ（各アイテ
ムの分類）をｋ、そのカテゴリ数量（カテゴリに付与す
る数量（係数））をａｊｋとするとき、As one of the fundamental frequency control methods which enables such fine control, a method using quantification type I of multivariate analysis is known. This method calculates a quantitative variable (fundamental frequency) from a qualitative variable (phonological environment) and is formulated by the following equation. When j is the factor item of the qualitative variable of the i-th data, k is the category to which it belongs (classification of each item), and its category quantity (quantity (coefficient) given to the category) is ajk,

【０００５】[0005]

【数１】 [Equation 1]

【０００６】ここで、δは、次のように定義される変数
である。Here, δ is a variable defined as follows.

【０００７】[0007]

【数２】 [Equation 2]

【０００８】量的変量の値｛Ｙｉ｝を最小２乗法で予測
するためTo predict the value {Yi} of the quantitative variate by the method of least squares

【０００９】[0009]

【数３】 (Equation 3)

【００１０】を満たすようにカテゴリ数量｛ａｊｋ｝を
定める。The category quantity {ajk} is determined so as to satisfy the above.

【００１１】なお、数量化Ｉ類については、文献「パソ
コン統計解析ハンドブックII、多変量解析偏」（田中
豊，垂水共之，脇本和昌；共立出版）等に詳しく述べら
れている。The quantification type I is described in detail in the document "Handbook of PC statistical analysis II, multivariate analysis bias" (Yutaka Tanaka, Kyoyuki Tarumi, Kazumasa Wakimoto; Kyoritsu Shuppan).

【００１２】[0012]

【発明が解決しようとする課題】しかしながら、上記し
たように多変量解析の数量化Ｉ類を音声合成の韻律制御
手法に適用する場合、質的変量による影響が非常に大き
い。例えば、偏相関係数（その変量が量的変量に適応す
る度合い）の低いものでも実際に投入するデータによ
り、その偏相関係数の値が大きく変化する場合がある。
したがって、どのアイテムが制御に有効であるか、一意
には決定できない。また、一つのアイテムの中の全ての
カテゴリに対してある程度、多数のカテゴリに属する入
力データを与えないと最終的に出力される線形関数（数
量化計算式）の係数（ａｊｋ）がうまく求まらないとい
った問題を抱えていた。このため、多変量解析の数量化
Ｉ類を適用した従来の韻律制御手法では、アイテムおよ
びカテゴリ数を多く必要とするという問題と、偏相関係
数が不安定で高品質な音声合成を行う点で改善の余地が
あった。However, when the quantification type I of multivariate analysis is applied to the prosody control method of speech synthesis as described above, the influence of qualitative variables is very large. For example, even if the partial correlation coefficient (the degree to which the variable adapts to the quantitative variable) is low, the value of the partial correlation coefficient may change greatly depending on the data actually input.
Therefore, which item is effective for control cannot be uniquely determined. In addition, the coefficient (ajk) of the linear function (quantification calculation formula) that is finally output can be obtained well if input data that belongs to many categories is not given to all categories in one item to some extent. I had a problem such as not being. Therefore, the conventional prosody control method to which the quantification type I of the multivariate analysis is applied requires a large number of items and categories and has a problem that the partial correlation coefficient is unstable and high-quality speech synthesis is performed. There was room for improvement.

【００１３】本発明は、上記問題点を解決するためにな
されたものであり、その目的は、音韻環境等の質的変量
に基づいて規則音声合成における韻律制御を行う場合
に、少ない質的変量でも人間に近い高品質な音声合成を
可能とする韻律制御方式を提供することにある。The present invention has been made in order to solve the above problems, and its object is to reduce a small number of qualitative variables when performing prosody control in regular speech synthesis based on a qualitative variable such as a phonological environment. However, it is to provide a prosody control method that enables high-quality speech synthesis close to that of humans.

【００１４】[0014]

【課題を解決するための手段】上記の目的を達成するた
め、本発明の第１例の韻律制御方式においては、多変量
解析の数量化手法を適用した分析学習手段を用い、先行
モーラ、当該モーラ、後続モーラで記述したアクセント
パターンと、語頭からのモーラ位置、語尾からのモーラ
位置、当該モーラ内のピッチ点位置、長音フラグ、促音
フラグのうち１以上とをアイテムとする学習用入力デー
タから前記アイテムとそのカテゴリを変量としてモーラ
の各ピッチ点の基本周波数を数量化するために前記アイ
テムの各カテゴリについて数量化計算式の係数であるカ
テゴリ数量を算出し、音声合成対象の音韻記号列の各モ
ーラの前記アイテムのカテゴリを求めて当該カテゴリに
ついて前記算出したカテゴリ数量を用いた数量化計算式
から当該モーラの各ピッチ点の基本周波数を逐次計算す
る計算手段を基本周波数の推定手段として用い、該計算
した基本周波数を音声合成に用いることを特徴とする。In order to achieve the above object, in the prosody control method of the first example of the present invention, the analysis learning means to which the quantification method of multivariate analysis is applied is used, From the input data for learning that uses the mora, the accent pattern described by the following mora, the mora position from the beginning of the mora, the mora position from the end of the mora, the pitch point position within the mora, one or more of the long sound flag and the consonant flag. In order to quantify the fundamental frequency of each pitch point of the mora using the item and its category as variables, calculate the category quantity that is the coefficient of the quantification calculation formula for each category of the item, and calculate the phoneme symbol string of the speech synthesis target. Obtain the category of the item of each mora, and calculate the mora of the mora from the quantification formula using the category quantity calculated for the category. Using calculating means for sequentially calculating the fundamental frequency of the pitch point as the estimation means of the fundamental frequency, it is characterized by using the speech synthesis the fundamental frequencies the calculated.

【００１５】また、本発明の第２例の韻律制御方式にお
いては、多変量解析の数量化手法を適用した分析学習手
段を用い、アクセントの上がりか下がり、該アクセント
の変化したモーラ位置からの位置、入力の先頭モーラを
第１番目とする前からのモーラの順番、入力の最後モー
ラを第１番目とする後からのモーラの順番、音韻信号の
母音部のみを見た場合の属性、音韻記号の子音部を見た
場合の属性をアイテムとする学習用入力データから前記
アイテムとそのカテゴリを変量としてモーラの各ピッチ
点の基本周波数を数量化するために前記アイテムの各カ
テゴリについて数量化計算式の係数であるカテゴリ数量
を算出し、音声合成対象の音韻記号列の各モーラの前記
アイテムのカテゴリを求めて当該カテゴリについて前記
算出したカテゴリ数量を用いた数量化計算式から当該モ
ーラの各ピッチ点の基本周波数を逐次計算する計算手段
を基本周波数の推定手段として用い、該計算した基本周
波数を音声合成に用いることを特徴とする。Further, in the prosody control method of the second example of the present invention, the analysis learning means to which the quantification method of the multivariate analysis is applied is used to increase or decrease the accent, and to change the position of the accent from the mora position. , The order of the mora from before the first mora of the input is the first, the order of the mora after the last of the input is the first, the attributes and phonological symbols when only the vowel part of the phoneme signal is viewed. The quantification formula for each category of the item in order to quantify the fundamental frequency of each pitch point of the mora using the item and its category as variables from the input data for learning with the attribute when viewing the consonant part of The category quantity that is a coefficient of is calculated, the category of the item of each mora of the phoneme symbol string of the speech synthesis target is obtained, and the category calculated for the category is calculated. Using sequential calculation calculating means fundamental frequency of each pitch point of the mora from quantification calculation formula using the amounts as estimating means of the fundamental frequency, it is characterized by using the speech synthesis the fundamental frequencies the calculated.

【００１６】また、本発明の第３例の韻律制御方式にお
いては、前件部と後件部からなるファジィニューラルネ
ットワークを学習手段として用いて学習用入力データか
ら係数データを学習して得、前記学習して得られた係数
データを設定したファジィニューラルネットワークをピ
ッチ周波数の推定手段として用いて、該推定したピッチ
周波数を音声合成に用いることを特徴とする。Further, in the prosody control system of the third example of the present invention, a fuzzy neural network having an antecedent part and a consequent part is used as a learning means to learn coefficient data from learning input data, and A fuzzy neural network in which coefficient data obtained by learning is set is used as pitch frequency estimation means, and the estimated pitch frequency is used for speech synthesis.

【００１７】上記の第３例の韻律制御方式においては、
学習手段における学習用入力データをファジィニューラ
ルネットワークの前件部のメンバシップ関数に対応させ
て構築し直して数量化Ｉ類で処理することにより入出力
関数の係数を計算し、該係数を後件部の一次関数の係数
の初期値として設定するのが、推定誤差をさらに少なく
する上で好適である。In the prosody control system of the above third example,
The input data for learning in the learning means is reconstructed corresponding to the membership function of the antecedent part of the fuzzy neural network and processed by the quantification type I to calculate the coefficient of the input / output function, and the coefficient is the consequent. It is preferable to set the coefficient as the initial value of the linear function of the section in order to further reduce the estimation error.

【００１８】[0018]

【作用】本発明の第１例の韻律制御方式では、規則音声
合成において、あるモーラにおけるピッチパターンを決
定する場合に、基本周波数制御に多変量解析の数量化手
法を適用することで、その当該モーラが持っているピッ
チパターンを細かく制御することにより、また、その
際、当該モーラの他にその先行モーラ、後続モーラ、で
記述したアクセントパターン等の偏相関係数の高いアイ
テム（質的変量）を用いて音声合成対象の音韻記号列の
各モーラにおける各ピッチ点の基本周波数を逐次数量化
することにより、人間に近い高品質な基本周波数パター
ンを、少ないアイテム数およびカテゴリ（各アイテムの
分類）数から表現可能とする。In the prosody control method of the first example of the present invention, in the case of determining the pitch pattern in a certain mora in the regular speech synthesis, by applying the quantification method of the multivariate analysis to the fundamental frequency control, An item with a high partial correlation coefficient (qualitative variate) such as the accent pattern described in the preceding mora and the subsequent mora in addition to the mora by finely controlling the pitch pattern of the mora. By sequentially quantifying the fundamental frequency of each pitch point in each mora of the phoneme symbol string for speech synthesis using, the high-quality fundamental frequency pattern close to that of a human can be obtained with a small number of items and categories (classification of each item). It can be expressed from numbers.

【００１９】また、本発明の第２例の韻律制御方式で
は、上記の多変量解析の数量化手法を適用する韻律制御
方式において、入力形式を変えてアイテムを絞り込むこ
とにより、韻律制御における基本周波数の推定の精度を
向上させて、より一層、人間に近い高品質な音声合成を
可能にする。Further, in the prosody control system of the second example of the present invention, in the prosody control system to which the quantification method of the multivariate analysis is applied, the input format is changed to narrow down the items, and thereby the fundamental frequency in the prosody control is obtained. The accuracy of the estimation of is improved to enable high-quality speech synthesis closer to that of a human.

【００２０】さらに、本発明の第３例の韻律制御方式で
は、韻律制御における基本周波数の学習と推定にファジ
ィニューラルネットワークを用いることにより、１デー
タ中のすべてのカテゴリの組み合わせと出力の関係を考
慮した基本周波数の推定を可能とし、その推定の精度を
向上させて、より一層、人間に近い高品質な音声合成を
可能にする。Further, in the prosody control method of the third example of the present invention, the fuzzy neural network is used for learning and estimation of the fundamental frequency in prosody control, so that the relationship between the combinations of all categories and the output in one data is considered. It is possible to estimate the fundamental frequency, improve the accuracy of the estimation, and enable high-quality speech synthesis closer to human beings.

【００２１】[0021]

【実施例】以下、本発明の実施例を、図面を参照して詳
細に説明する。Embodiments of the present invention will now be described in detail with reference to the drawings.

【００２２】図１は本発明の第１の実施例の構成を示す
ブロック図である。図において、１は音韻データベー
ス、２は分析部、３は合成部である。音韻データベース
部１には、（ａ）ピッチ値（学習データ）、（ｂ）アク
セントパターン（１：ＬＨＬ，２：ＬＨＨ，３：ＬＬ
Ｈ，４：ＬＬＬ，５：ＨＬＨ，６：ＨＬＬ，７：ＨＨ
Ｌ，８：ＨＨＨの８パターン（Ｈ：Ｈｉｇｈ，Ｌ：Ｌｏ
ｗ））、（ｃ）語頭からのモーラ位置（１，２，…）、
（ｄ）語尾からのモーラ位置（１，２，…）、（ｅ）モ
ーラ内のピッチ点位置（１モーラに７点）、（ｆ）当該
モーラ名もしくは音韻種別（無声破裂音、有声摩擦音
等）、（ｇ）長音フラグ（長音があるか）、（ｈ）促音
フラグ（促音があるか）が蓄積される。上記のうち、
（ｂ）〜（ｈ）は、アイテムを示し、アイテム名に続く
（）内の記載がカテゴリの種類を示している。アクセン
トパターンのカテゴリでは、先行モーラ、当該モーラ、
後続モーラのアクセントの高／低（Ｈ／Ｌ）の組み合わ
せで分類する。本実施例の特徴は、このような（ｂ）〜
（ｈ）のアイテムとカテゴリを用いて入力データを記述
し、多変量解析の数量化Ｉ類を適用する点にある。FIG. 1 is a block diagram showing the configuration of the first embodiment of the present invention. In the figure, 1 is a phoneme database, 2 is an analysis unit, and 3 is a synthesis unit. The phoneme database 1 has (a) pitch values (learning data) and (b) accent patterns (1: LHL, 2: LHH, 3: LL).
H, 4: LLL, 5: HLH, 6: HLL, 7: HH
8 patterns of L, 8: HHH (H: High, L: Lo
w)), (c) Mora position (1, 2, ...) From the beginning of the word,
(D) Mora position from the ending (1, 2, ...), (e) Pitch point position in the mora (7 points per mora), (f) Mora name or phoneme type (unvoiced plosive, voiced fricative, etc.) ), (G) long sound flag (whether there is a long sound), and (h) consonant flag (whether there is a consonant sound) are accumulated. Of the above,
(B) to (h) indicate items, and the description in () following the item name indicates the type of category. In the accent pattern category, the preceding mora, the relevant mora,
Classify by the combination of high / low (H / L) of the accent of the subsequent mora. The feature of the present embodiment is that (b)
The input data is described by using the item and category of (h), and the quantification type I of the multivariate analysis is applied.

【００２３】規則音声合成において、あるモーラにおけ
るピッチパターンを決定する場合に、その先行モーラ、
後続モーラ、アクセント環境、音韻等を考慮しなくては
ならない。さらにその当該モーラが持っているピッチパ
ターンには細かい制御を必要とする。この制御に数量化
Ｉ類を用いる場合、少ないアイテムで少ないカテゴリで
音声の基本周波数パターンが表現できればよい事にな
る。In regular speech synthesis, when a pitch pattern in a certain mora is determined, its preceding mora,
The following mora, accent environment, and phoneme must be taken into consideration. Further, the pitch pattern of the mora requires fine control. When the quantified type I is used for this control, it suffices if the basic frequency pattern of the voice can be expressed with a small number of items and a small number of categories.

【００２４】以下、音韻記号列の入力に対してあるモー
ラのピッチパターン決定に至るまでの手法を説明する。A method for determining a pitch pattern of a certain mora in response to the input of a phoneme symbol string will be described below.

【００２５】まず、予め分析部２において、音韻データ
ベース１の（ｂ）〜（ｈ）のアイテム（質的変量）を入
力として多変量解析の数量化Ｉ類を実行し、（ａ）のピ
ッチ値（量的変量）を学習して各カテゴリ数量（入出力
線形関数の係数ａｊｋ）を求める。このための分析部２
のブロック構成例を図２に示す。２１は学習データから
入力データを生成する入力データ生成部、２２は入力デ
ータから数量データ（カテゴリ数量）を求めてデータベ
ース等へ格納する数量化Ｉ類計算部である。図３に、そ
の分析部２の制御フローを示す。モーラ内のある位置の
基本周波数に対して、質的変量から（ａ）のピッチ値
（量的変量）を推定するとき従来の技術で示した（１）
式が生成される。そして、次の位置のピッチ値を学習す
る。全てのデータに対して学習が終了したとき、同じく
従来の技術で示した（２）式が生成され、最小２乗法に
よりカテゴリ数量を求めることができる。First, in the analysis unit 2, the quantification type I of the multivariate analysis is executed by inputting the items (qualitative variables) of (b) to (h) of the phoneme database 1 in advance, and the pitch value of (a) is executed. (Quantitative variable) is learned to obtain each category quantity (coefficient ajk of input / output linear function). Analysis unit 2 for this
FIG. 2 shows an example of the block configuration of the above. Reference numeral 21 is an input data generation unit that generates input data from learning data, and 22 is a quantification type I calculation unit that calculates quantity data (category quantity) from the input data and stores it in a database or the like. FIG. 3 shows a control flow of the analysis unit 2. When estimating the pitch value (quantitative variable) of (a) from the qualitative variable with respect to the fundamental frequency at a certain position in the mora, it was shown by the conventional technique (1).
An expression is generated. Then, the pitch value at the next position is learned. When learning is completed for all data, the equation (2) shown in the prior art is also generated, and the category quantity can be obtained by the least square method.

【００２６】次に、合成部３における音声合成について
説明する。図４に合成部３のブロック構成例を示す。３
１はテストデータから入力データを生成する入力データ
生成部、３２はデータベースの数量データ（カテゴリ数
量）によりピッチ周波数を求めるピッチ周波数推定部、
３３はその得られたピッチ周波数により音声合成を行う
音声合成部である。図５に合成部３における制御フロー
を示す。音声合成時にはカテゴリ数量で生成された積和
演算のみの数量化計算式で計算することにより各モーラ
のピッチ点の基本周波数（ピッチ周波数）を推定するこ
とができる。すなわち、上記演算済みカテゴリ数量を入
力し、入力データの対象発声音韻記号列から音韻データ
ベース１の（２）〜（８）のデータを計算し、該当する
カテゴリ数量（係数ａｊｋ）を用いた（１）式による計
算を行ってすべてのピッチ点位置の基本周波数を逐次求
め、これによるピッチパターンを用いて韻律制御を行い
音声合成を行う。Next, the voice synthesis in the synthesizer 3 will be described. FIG. 4 shows a block configuration example of the combining unit 3. Three
Reference numeral 1 is an input data generation unit that generates input data from test data, 32 is a pitch frequency estimation unit that obtains a pitch frequency from quantity data (category quantity) in a database,
Reference numeral 33 is a voice synthesizing section for performing voice synthesis with the obtained pitch frequency. FIG. 5 shows a control flow in the combining unit 3. At the time of speech synthesis, it is possible to estimate the fundamental frequency (pitch frequency) of the pitch point of each mora by calculating with a quantification formula of only the product-sum calculation generated by the category quantity. That is, the calculated category quantity is input, the data (2) to (8) of the phoneme database 1 is calculated from the target phonological phoneme symbol string of the input data, and the corresponding category quantity (coefficient ajk) is used (1). ) Is used to sequentially find the fundamental frequencies at all pitch point positions, and the pitch pattern is used to perform prosody control for speech synthesis.

【００２７】図４は、本実施例による実験結果を示す図
である。下表に示す６個のアイテムとそのカテゴリで記
述した肉声からのデータ数２７４１点を入力して行った
ピッチ点の基本周波数の学習値と、音声合成時のピッチ
点の基本周波数の推定値との相関をプロットしたもので
ある。各アイテムの偏相関係数は下表の通りであり、特
に、アクセントパターン、語頭からのモーラ数、語尾か
らのモーラ数の各アイテムの偏相関係数が高い。また、
アイテム全体の重相関係数は０．５２５であり、本アイ
テムとカテゴリを用いることが効果的であることを示し
ている。FIG. 4 is a diagram showing the experimental results according to this embodiment. The learning value of the fundamental frequency of the pitch point performed by inputting 2741 points of the number of data from the real voice described in the six items shown in the table below and the estimated value of the fundamental frequency of the pitch point during speech synthesis It is a plot of the correlation of. The partial correlation coefficient of each item is as shown in the table below, and in particular, the partial correlation coefficient of each item of the accent pattern, the number of mora from the beginning and the number of mora from the end is high. Also,
The multiple correlation coefficient of the entire item is 0.525, which shows that it is effective to use this item and the category.

【００２８】[0028]

【表１】 [Table 1]

【００２９】以上のように、本実施例の手法によれば、
規則音声合成において、あるモーラにおけるピッチパタ
ーンを決定する場合に、その先行モーラ、後続モーラ、
アクセント環境、音韻等が考慮されることとなる。ま
た、その際に基本周波数制御に多変量解析の数量化Ｉ類
を適用しているので、従来のモデル化手法に比べて、そ
の当該モーラが持っているピッチパターンが細かく制御
されることとなる。さらに、本実施例では、数量化Ｉ類
を基本周波数制御に用いる従来の場合よりも、アイテム
数およびカテゴリ数が少なく、しかも偏相関係数が高
い。従って、人間に近い高品質な音声の基本周波数パタ
ーンが表現できる。As described above, according to the method of this embodiment,
In regular speech synthesis, when determining a pitch pattern in a certain mora, its preceding mora, subsequent mora,
Accent environment, phoneme, etc. will be considered. Further, at that time, since the quantification type I of the multivariate analysis is applied to the fundamental frequency control, the pitch pattern possessed by the mora is finely controlled as compared with the conventional modeling method. . Further, in this embodiment, the number of items and the number of categories are smaller and the partial correlation coefficient is higher than in the conventional case where the quantified type I is used for the fundamental frequency control. Therefore, it is possible to express a high-quality fundamental frequency pattern of human voice.

【００３０】次に、本発明の第２の実施例を述べる。Next, a second embodiment of the present invention will be described.

【００３１】本実施例は、第１の実施例における入力形
式と学習データ作成に関して変更を加えて、任意日本語
規則音声合成における数量化Ｉ類を用いた第１の実施例
におけるピッチ周波数制御手法での推定値と実測値の間
の誤差の改善を図ったものである。すなわち、基本的に
は本実施例も、任意日本語規則音声合成における、ピッ
チ周波数制御手法に、数量化Ｉ類を用いた手法であり、
あらかじめ音韻環境データをもとに入出力関数の係数
（数量）を統計的手法を用いて計算しておき、合成時に
はその係数を用いて任意の日本語の任意のモーラに対す
るピッチ周波数を算出するものである。In this embodiment, the input frequency and the learning data are changed in the first embodiment, and the pitch frequency control method in the first embodiment using the quantification type I in the arbitrary Japanese rule speech synthesis is changed. It is intended to improve the error between the estimated value and the actual measured value in. That is, basically, this embodiment is also a method that uses the quantification type I as the pitch frequency control method in arbitrary Japanese rule speech synthesis,
The coefficient (quantity) of the input / output function is calculated using a statistical method in advance based on the phonological environment data, and the pitch frequency for an arbitrary mora of arbitrary Japanese is calculated using the coefficient at the time of synthesis. Is.

【００３２】図７は、第１の実施例において、カテゴリ
数量（入出力関数の係数）を求めるための学習用データ
として、女性話者１名の発声した４９０単語に含まれる
３６９６５点のデータを用い、テスト用単語として１０
０単語中の９２１４点のデータを用いて、そのテスト用
単語について推定した結果であり、縦軸が推定値、横軸
が実測値である。このとき重相関係数は０．７６、平均
相対誤差は１５．４０％であった。FIG. 7 shows data of 36965 points contained in 490 words uttered by one female speaker as learning data for obtaining the category quantity (coefficient of input / output function) in the first embodiment. Use as a test word 10
This is the result of estimation for the test word using the data of 9214 points in 0 word, and the vertical axis is the estimated value and the horizontal axis is the measured value. At this time, the multiple correlation coefficient was 0.76 and the average relative error was 15.40%.

【００３３】上記において、重相関関数とは、あるカテ
ゴリ数量を用いたとき、どの程度良く予測されるかを評
価できる値であり、観測値｛ｙ_i：ｉ＝１，…，ｎ｝と
推定値｛Ｙ_i：ｉ＝１…，ｎ｝との相関係数に等しく、
下式によって計算される値である。In the above, the multiple correlation function is a value with which it is possible to evaluate how well it is predicted when a certain category quantity is used, and it is estimated as an observed value {y _i : i = 1, ..., N}. Equal to the correlation coefficient with the value {Y _i : i = 1, ..., N},
It is a value calculated by the following formula.

【００３４】[0034]

【数４】 [Equation 4]

【００３５】また、平均相対誤差Ｊは下式によって計算
した値である。The average relative error J is a value calculated by the following equation.

【００３６】[0036]

【数５】 (Equation 5)

【００３７】図７を見て分かるように、推定値と実測値
の間の誤差が２００Ｈｚ以上離れている推定がある。そ
の原因として考えられることは、次のとおりである。As can be seen from FIG. 7, there is an estimation in which the error between the estimated value and the actually measured value is separated by 200 Hz or more. The possible causes are as follows.

【００３８】まず、（ｂ）のアクセント型は、前モー
ラ、当該モーラ、後モーラも３モーラだけしか考えてい
ない。また、当該モーラが高いか低いかに関する情報は
非常に重要であると予想されるが、それを直接指示する
入力が無い。（ｅ）のモーラ内ピッチ点位置について
は、入力データの１つとして考える方法では、それぞれ
のモーラ内点位置に対する複雑なピッチ周波数を入出力
関数の係数に対応させて表現しきれない。（ｆ），
（ｇ）の音韻、長音については、音韻を高々６パターン
に分類するだけでは、１つのパターンに属するデータの
数が多くなり、それだけピッチ周波数のばらつき（分
散）も大きくなってしまう。First, as for the accent type of (b), the front mora, the relevant mora, and the rear mora only consider three mora. Also, information about whether the mora is high or low is expected to be very important, but there is no direct input for it. The method of considering the pitch point position in the mora of (e) as one of the input data cannot express the complicated pitch frequency for each position of the mora point in correspondence with the coefficient of the input / output function. (F),
Regarding the phoneme and long sound of (g), if the phonemes are classified into at most 6 patterns, the number of data belonging to one pattern increases, and the variation (dispersion) of the pitch frequency increases accordingly.

【００３９】そこで、第２の実施例では入力形式（アイ
テムとカテゴリ）を以下に示すようにする。Therefore, in the second embodiment, the input format (item and category) is as shown below.

【００４０】（１）アクセントの上がりか下がりを示
す。上がりが２、下がりが１。(1) Indicates whether the accent is rising or falling. Up is 2, down is 1.

【００４１】（２）アクセントの変化したモーラ位置か
ら何モーラ後の位置かを示す。アクセントの変化点を１
番目として数える。（１〜１１）（３）入力の先頭モーラを第１番目として前から何モー
ラ目であるかを示す。（１〜１３）（４）入力の最後のモーラを第１番目として後ろから何
モーラ目であるかを示す。（１〜１３）（５）音韻信号の母音部のみを見て／Ａ／Ｉ／Ｕ／Ｅ／Ｏ／Ａ−／Ｉ−／Ｕ−／Ｅ−／Ｏ−
／のどれかを示す。各々の数値は前から１〜１１の値をと
る。(2) The number of moras after the mora position where the accent has changed is shown. Change point of accent 1
Count as the second. (1 to 11) (3) The first mora of the input is the first, and the number of mora from the front is shown. (1 to 13) (4) The last mora of the input is the first, and the number of mora from the back is shown. (1 to 13) (5) Looking only at the vowel part of the phoneme signal / A / I / U / E / O / A- / I- / U- / E- / O-
Indicates one of /. Each numerical value takes a value of 1 to 11 from the front.

【００４２】（６）音韻記号の子音部を見て、／Ｂ，Ｄ，Ｇ／／ＢＹ，ＤＹ，ＧＹ／／Ｚ，Ｊ／／Ｎ，Ｍ／／ＮＹ，ＭＹ／／Ｒ／／ＲＹ／／Ｗ／／ＷＹ／／Ｐ，Ｔ，Ｋ，ＴＳ，ＣＨ／／ＰＹ，ＫＹ／／Ｆ，Ｈ，Ｓ，ＳＨ／／ＦＹ，ＨＹ／／子音なし／のどの／／に属するかを示す。数値は上から１〜１４の
値をとる。(6) Looking at the consonant part of the phonological symbol, / B, D, G / / BY, DY, GY / / Z, J / / N, M / / NY, MY / / R / / RY / / W / / WY / / P, T, K, TS, CH / / PY, KY / / F, H, S, SH / / FY, HY / / no consonant / It belongs to //. Numerical values take values from 1 to 14 from the top.

【００４３】以上の入力形式において、（１）はアクセ
ントの上がり下がりを直接与えるもので、大変重要な入
力である。In the above input format, (1) directly gives rise and fall of accents and is a very important input.

【００４４】（２）はアクセント変化点からの位置で、
アクセントの上がり（または下がり）が何個続いたかを
示す。(2) is the position from the accent change point,
Shows how many rising (or falling) accents have continued.

【００４５】（３），（４）はセットで入力全体の中の
どのモーラ位置であるかを示す。(3) and (4) indicate which mora position in the entire input as a set.

【００４６】（５），（６）は音韻記号を母音部、子音
部それぞれ細かく分類したもので、これによって、１つ
の入力パターンに属するデータ数も減り、ある程度ピッ
チ周波数のばらつきも減る。In (5) and (6), phonological symbols are finely classified into vowel parts and consonant parts, which reduces the number of data belonging to one input pattern and reduces the variation in pitch frequency to some extent.

【００４７】更に、１入力パターンに対しては１つのピ
ッチ周波数を学習させるのが理想的である。そこで、学
習データの１入力パターンに対するピッチ周波数のばら
つきを更に減らすために、全入力パターンに対するピッ
チ周波数の平均値と標準偏差を求め、平均値から標準偏
差の大きさだけ離れているピッチ周波数のデータを削除
して、学習データとした。こうすることにより、１入力
パターンに対してはある程度決まったピッチ周波数を学
習させることになる。Further, it is ideal to learn one pitch frequency for one input pattern. Therefore, in order to further reduce the variation of the pitch frequency for one input pattern of the learning data, the average value and the standard deviation of the pitch frequency for all the input patterns are obtained, and the data of the pitch frequency deviated from the average value by the amount of the standard deviation is obtained. Was deleted and used as learning data. By doing so, a pitch frequency that is determined to some extent is learned for one input pattern.

【００４８】また、モーラ内点位置に対するピッチ周波
数を、入出力関数の係数に対応させてできるだけ多用に
表現させるために、あらかじめ各々のモーラ内ピッチ点
位置に対応したデータセットを用意して、学習、推定を
行うこととする。Further, in order to express the pitch frequency with respect to the intra-mora point position as versatilely as possible in correspondence with the coefficient of the input / output function, a data set corresponding to each intra-mora pitch point position is prepared and learned. , Will be estimated.

【００４９】本実施例における学習の処理の流れを図８
のブロックに示す。図において、２３は学習データから
入力データを生成する入力データ生成部、２４は平均値
から標準偏差の大きさだけ離れているピッチ周波数のデ
ータを入力データから削除するデータ削減部、２５は削
減された入力データから数量データ（カテゴリ数量）を
求めてデータベース等へ格納する数量化Ｉ類計算部であ
る。数量化Ｉ類計算部２５から入力データ生成部２３側
へのループ２６は、すべてのモーラ内ピッチ点位置に対
して数量データを求めるまで行うためのものである。な
お、音声合成は、第１の実施例と同様に行う。ただし、
入力形式が異なることは上記のとおりである。FIG. 8 shows the flow of learning processing in this embodiment.
Shown in the block. In the figure, 23 is an input data generation unit that generates input data from learning data, 24 is a data reduction unit that deletes from the input data data of pitch frequencies that are apart from the average value by the amount of standard deviation, and 25 is reduced. It is a quantification type I calculation unit that calculates quantity data (category quantity) from input data and stores it in a database or the like. The loop 26 from the quantification type I calculation unit 25 to the input data generation unit 23 side is for performing the calculation of the quantitative data for all the pitch points in the mora. The voice synthesis is performed in the same manner as in the first embodiment. However,
The different input formats are as described above.

【００５０】図９に、第２の実施例におけるテストデー
タの推定結果を示す。学習データは、図７の場合であげ
たものと同じ女性話者１名の発声した４９０単語に含ま
れるデータで、１つのモーラ内点位置について取り出
し、ピッチ周波数の標準偏差の幅で削除を行った４２４
２点のデータである。テスト用データは１００単語中の
１つのモーラ内点位置について取り出した１４７６点の
データである。このとき、重相関数係数は０．９０、平
均相対誤差は１１．６％に向上した。FIG. 9 shows the estimation result of the test data in the second embodiment. The learning data is the data included in the 490 words uttered by one female speaker, which is the same as the one shown in the case of FIG. 7, and is extracted for one mora inner point position and deleted in the width of the standard deviation of the pitch frequency. 424
Two points of data. The test data is the data of 1476 points extracted for one mora inner point position in 100 words. At this time, the multiple correlation coefficient was improved to 0.90 and the average relative error was improved to 11.6%.

【００５１】次に、本発明の第３の実施例を説明する。Next, a third embodiment of the present invention will be described.

【００５２】上記第２の実施例の入力形式による別の推
定結果を図１０に示す。この結果は、女性話者１名の発
声した４９０単語に含まれる４２４２点のデータで学習
を行なって数量（入出力関数の係数）を求め、その数量
を用いて、同じ話者の発声による１００単語中に含まれ
る１０４８点のテスト用データで推定を行ったものであ
る。ただし、このときテストデータは１入力パターンに
対して１つの値を推定する時の性能を調べるために、学
習データ同様、削除処理を行ったものを用いた。図の縦
軸が実測値、横軸が推定値である。このとき、重相関数
係数は０．９３、平均相対誤差は９．８．％であった。FIG. 10 shows another estimation result according to the input format of the second embodiment. This result is obtained by learning with the data of 4242 points included in 490 words uttered by one female speaker to obtain the quantity (coefficient of the input / output function), and using the quantity, 100 by the utterance of the same speaker. The estimation was performed using 1048 points of test data included in the word. However, at this time, the test data used was the one that was subjected to the deletion processing in the same manner as the learning data in order to investigate the performance when estimating one value for one input pattern. The vertical axis of the figure is the measured value, and the horizontal axis is the estimated value. At this time, the multiple correlation coefficient is 0.93 and the average relative error is 9.8. %Met.

【００５３】上記第２の実施例では、全カテゴリ１つ１
つに数量を割り当てるため、推定時の入力のモーラ数が
制限される。数量化Ｉ類は、要因アイテム中の１つのカ
テゴリに対して１つの数量を最小２乗法によって決定す
る重回帰分析の一種であるが、カテゴリ数量を計算する
際、カテゴリの組み合わせと出力の関係は考慮されてい
ない。実際に考慮されるのは、ある１つのカテゴリと他
の１つのカテゴリの組み合わせに対する出力であり、３
つ以上のカテゴリの組み合わせのみに出力が大きく依存
している場合、推定精度が悪くなる。数量化Ｉ類の結果
から人間の先験的知識、すなわち、入力の大小による組
み合わせと出力との関係に対応した入出力関数を得るの
が難しく、この意味において結果の解釈が困難である。In the second embodiment, one for each category
Since the quantity is assigned to one, the number of moras of the input at the time of estimation is limited. Quantification type I is a type of multiple regression analysis that determines one quantity for one category in factor items by the least squares method, but when calculating category quantity, the relationship between the combination of categories and the output is Not considered. What is actually considered is the output for a combination of one category with another category,
If the output largely depends only on the combination of two or more categories, the estimation accuracy becomes poor. It is difficult to obtain a priori knowledge of human beings, that is, an input / output function corresponding to the relationship between the combination depending on the magnitude of the input and the output from the result of the quantification type I, and it is difficult to interpret the result in this sense.

【００５４】そこで、本実施例において提案するピッチ
周波数制御手法は、あらかじめ音韻環境データを基に、
入出力関係をファジィニューラルネットワーク（以下、
ＦＮＮと記す）の結合係数としてバッププロパゲーショ
ン（ＢＰ）学習によって計算しておき、合成時にはその
係数を用いて任意の日本語の任意のモーラに対するピッ
チ周波数を算出するものである。Therefore, the pitch frequency control method proposed in this embodiment is based on the phoneme environment data in advance.
The fuzzy neural network (hereinafter,
It is calculated by Bop propagation (BP) learning as a coupling coefficient of FNN), and the pitch frequency for an arbitrary mora of arbitrary Japanese is calculated by using the coefficient at the time of synthesis.

【００５５】本実施例における学習の処理の流れを図１
１のブロック図に、推定の処理の流れを図１２のブロッ
ク図に示す。なお、ＦＮＮの構成および学習方法につい
ては論文「ファジィニューラルネットワークの構成法と
学習法」（堀川慎一、古橋武、内藤嘉樹；日本ファジィ
学会誌、Ｖｏｌ．４，Ｎｏ．５，ｐｐ９０６−９２８
（１９８２））等に詳しく述べられている。FIG. 1 shows the flow of learning processing in this embodiment.
1 shows the flow of the estimation process in the block diagram of FIG. Regarding the construction and learning method of FNN, the paper "Construction method and learning method of fuzzy neural network" (Shinichi Horikawa, Takeshi Furuhashi, Yoshiki Naito; Journal of Japan Fuzzy Society, Vol.4, No.5, pp906-928)
(1982)) and the like.

【００５６】図１１において、２７は学習データから入
力データを生成する入力データ生成部、２８は入力デー
タから係数データを求めてデータベースへ格納するＦＮ
Ｎ学習部である。In FIG. 11, 27 is an input data generator for generating input data from learning data, and 28 is an FN for obtaining coefficient data from the input data and storing it in a database.
N learning unit.

【００５７】また、図１２において、３４はテストデー
タから入力データを生成する入力データ生成部、３５は
データベースの上記係数データによりピッチ周波数を求
めるＦＮＮピッチ周波数推定部、３６はその得られたピ
ッチ周波数により音声合成をする音声合成部である。Further, in FIG. 12, reference numeral 34 is an input data generating section for generating input data from test data, 35 is an FNN pitch frequency estimating section for obtaining a pitch frequency from the coefficient data of the database, and 36 is the obtained pitch frequency. Is a voice synthesis unit for performing voice synthesis by.

【００５８】図１３に本実施例で採用するＦＮＮの構成
を示す。本実施例のＦＮＮは、（Ａ）層〜（Ｅ）層から
なるファジィルールの前件部と、（Ｆ）層〜（Ｌ）層か
らなるファジィルールの後件部とで構成される。これは
前述の参考文献中のタイプIIのＦＮＮにおいて前件部変
数と後件部変数が異なる場合に相当する。ここでタイプ
IIとは、ファジィルールの後件部が入力変数の一次関数
で表わされるものを言う。（Ｂ），（Ｊ），（Ｌ）層の
Σは入力の和を、（Ｋ）層のΠは入力の積を出力する。
（Ｅ）層のΠ＾は入力の積を（Ｅ）層への入力の総和で
割った値を出力する。（Ｃ）層のｆはシグモイド関数か
らの出力をする。□１は１を出力するニューロンであ
る。このＦＮＮが実現する機能は、ファジィルールの後
件部が一次関数で表されるファジィ推論である。学習
は、バックプロパゲイション（以下、ＢＰ）法を用いて
結合係数Ｗ_c，Ｗ_g，Ｗ_aを変化させることによって行
う。ここでＷ_c，Ｗ_gは前件部メンバシップ関数の係数、
Ｗ_aは後件部一次関数の係数に対応する。FIG. 13 shows the configuration of the FNN used in this embodiment. The FNN of this embodiment is composed of the antecedent part of the fuzzy rule composed of layers (A) to (E) and the consequent part of the fuzzy rule composed of layers (F) to (L). This corresponds to the case where the antecedent variable and the antecedent variable are different in the type II FNN in the above-mentioned reference. Type here
II means that the consequent part of the fuzzy rule is represented by a linear function of the input variable. Σ in the (B), (J), and (L) layers outputs the sum of the inputs, and Π in the (K) layer outputs the product of the inputs.
Π ^ of the (E) layer outputs a value obtained by dividing the product of the inputs by the total sum of the inputs to the (E) layer. The f of the (C) layer outputs from the sigmoid function. □ 1 is a neuron that outputs 1. The function realized by this FNN is fuzzy inference in which the consequent part of the fuzzy rule is represented by a linear function. Learning is performed by changing the coupling coefficients W _c , W _g , and W _a using a back propagation (hereinafter, BP) method. Where W _c and W _g are coefficients of the antecedent membership function,
W _a corresponds to the coefficient of the consequent linear function.

【００５９】ファジィ推論は、人間の判断のようなあい
まいさを含むアルゴリズムをｉｆ−ｔｈｅｎ型のファジ
ィルールにより言語的に記述できるという特徴を持つ。
しかし、ファジィ制御器の設計例に見られるように、フ
ァジィ推論におけるファジィルールやメンバーシップ関
数等の同定もしくは調整には、一般に多大な労力を必要
とする。一方、人間の知識を扱うのに適したもう一つの
手法としてニューラルネットワーク（以下、ＮＮと記
す）がある。ＮＮの特徴は、学習機能により任意の入出
力関係を同定できることにあるが、その入出力関係に関
する知識はネットワーク内の結合荷重に分散して記憶さ
れるため、この知識を定性的に把握することは難しい。
そこで、ファジィ推論とＮＮを融合し、両者の特徴を兼
ね備えたものとして提案されたものがＦＮＮである。Ｆ
ＮＮは、ファジィ推論に適した構造を持ち、ＮＮの学習
法であるパックプロパゲーション（ＢＰ）法の適用を可
能としながら、知識を言語的に記述できるというファジ
ィ推論の特徴と知識を自動的に学習できるというＮＮの
特徴を併せ持つ。Fuzzy inference has a feature that an algorithm including ambiguity such as human judgment can be linguistically described by if-then type fuzzy rules.
However, as seen in the fuzzy controller design example, a great deal of effort is generally required to identify or adjust fuzzy rules or membership functions in fuzzy reasoning. On the other hand, there is a neural network (hereinafter referred to as NN) as another method suitable for handling human knowledge. The feature of NN is that any input / output relation can be identified by the learning function. However, since the knowledge about the input / output relation is distributed and stored in the connection weights in the network, it is necessary to qualitatively grasp this knowledge. Is difficult
Therefore, the FNN is proposed as a combination of fuzzy reasoning and NN and having both features. F
The NN has a structure suitable for fuzzy reasoning, and while applying the Pack Propagation (BP) method that is a learning method of the NN, the feature of the fuzzy reasoning that the knowledge can be described linguistically and the knowledge are automatically described. It also has the characteristic of NN that you can learn.

【００６０】このようなＦＮＮの構成は、ファジィ推論
による推論値の計算過程をＮＮ（ＢＰモデル）の構造で
実現し、ファジィ推論において同定または調整すべきパ
ラメータをＮＮの結合荷重に対応づけたものである。こ
れらの結合荷重をＢＰ法を用いて更新することによりフ
ァジィルールの同定およびメンバシップ関数の調整を自
動的に行うことができ、また、学習結果をファジィルー
ルとして把握することも可能である。In such an FNN configuration, the process of calculating an inference value by fuzzy inference is realized by the structure of NN (BP model), and the parameter to be identified or adjusted in fuzzy inference is associated with the NN coupling weight. Is. By updating these connection weights using the BP method, the fuzzy rules can be identified and the membership function can be automatically adjusted, and the learning result can be grasped as the fuzzy rules.

【００６１】一般にファジィ推論による推論値は、与え
られた入力値から、（ア）前件部メンバシップ関数値（イ）各ファジィルールの前件部適合度（ウ）各ファジィルールの後件部推定値を順次求め、（イ），（ウ）の結果を統合することによ
り求められる。Generally, the inference value obtained by fuzzy inference is calculated from the given input values as follows: (a) Membership function value of the antecedent part (b) Antecedent suitability of each fuzzy rule (c) Consequent part of each fuzzy rule It can be obtained by sequentially obtaining the estimated value and integrating the results of (a) and (c).

【００６２】本実施例では、ファジィ推論の入出力を実
現する後件部が一次関数のＦＮＮにおいて、ファジィ推
論の前件部メンバシップ関数係数と後件部一次関数係数
をバックプロパゲーション（ＢＰ）学習で決定する手法
である。この手法は、ＢＰ学習をする際、後件部一次関
数の初期値を０にして学習を行う。ここで入力（１）〜
入力（６）は、前述の第２の実施例における数量化Ｉ類
の入力形式（１）〜（６）に対応するが、ただし、音韻
記号の入力（５）と入力（６）に関しては、１１＋１４
＝２５個の０または１の値の入力とした。前件部メンバ
シップ関数は、入力変数に対して２種類とし、それらの
ラベルをＳｍａｌｌ、Ｂｉｇとした。In this embodiment, the backpropagation (BP) of the antecedent part membership function coefficient and the consequent part linear function coefficient of the fuzzy inference is performed in the FNN in which the antecedent part for realizing the input / output of the fuzzy inference is a linear function. This is a method that is determined by learning. In this method, when BP learning is performed, the initial value of the consequent linear function is set to 0 and learning is performed. Input here (1) ~
The input (6) corresponds to the input forms (1) to (6) of the quantification type I in the second embodiment described above, except that the input (5) and the input (6) of the phonological symbols are as follows. 11 + 14
= 25 input values of 0 or 1. There are two types of membership functions for the antecedent part, and their labels are Small and Big.

【００６３】図１４にＦＮＮによるテストデータの推定
結果を示す。学習率Ｗ_c，Ｗ_g，Ｗ_a（定数項），Ｗ_a（変
数項）の値はそれぞれ０．０００００５，０．００００
１，０．１，０．０５に定めて学習を行った。使用した
データは、学習データ、テストデータ共、数量化Ｉ類で
の推定で使用したものと同じデータである。すなわち、
女性話者１名の発声した４９０単語に含まれる４２４２
点のデータで学習を行い、ＦＮＮの各係数を求め、その
結果を用いて、同話者の発声による１００単語中に含ま
れる１０４８点のテスト用データで推定を行ったもので
ある。学習は、学習用データを順次ネットワークに入力
することによって行い、データの一巡をもって学習回数
１回とした。結果は、８６５０回学習させた時のもの
で、このとき重相関係数は０．９７、平均相対誤差は
８．４１％に向上した。FIG. 14 shows the estimation result of the test data by FNN. The learning rates W _c , W _g , W _a (constant term) and W _a (variable term) are 0.000005 and 0.0000, respectively.
The learning was performed by setting it to 1, 0.1, 0.05. The data used are the same as the data used in the estimation in the quantification type I for both the training data and the test data. That is,
4242 included in 490 words spoken by one female speaker
Learning is performed with point data, each coefficient of FNN is obtained, and the results are used to estimate with 1048 points of test data included in 100 words uttered by the same speaker. The learning was performed by sequentially inputting the learning data into the network, and the number of times of learning was set to be once for one cycle of the data. As a result, when the learning was performed 8650 times, the multiple correlation coefficient was improved to 0.97 and the average relative error was improved to 8.41%.

【００６４】これは、１入力データ中の全てのカテゴリ
の組み合わせと出力との関係を、ＦＮＮの係数として実
現させるために、ＢＰ学習を行ったことが理由である。
すなわち、ＢＰ学習によって非線形な入出力の学習がな
されたわけである。This is because BP learning was performed in order to realize the relationship between the combinations of all the categories in one input data and the outputs as the coefficients of FNN.
That is, the nonlinear input / output learning is performed by the BP learning.

【００６５】また、学習によって形成された、後件部一
次関数は、表２に示したように、入力の大小によって分
類され、人間の先験的知識に対応した入出力関数となっ
ている。ここで、ｘ₁〜ｘ₄はそれぞれ入力（１）〜入力
（４）に対応しており、表２の後件部に対する入力ｘ₁
〜ｘ₄の値は、図１３のＷ_sをそれぞれ乗じたもので、４
つの入力に対応するＷ_s値はそれぞれ、０．５０，０．
０９，０．０８，０．０８である。As shown in Table 2, the consequent linear functions formed by learning are classified according to the size of the input and are input / output functions corresponding to human a priori knowledge. Here, x ₁ ~x ₄ corresponds to the input, respectively (1) to the input (4), the input x ₁ for the consequent part of Table 2
The values of up to x ₄ are each multiplied by W _s in FIG.
The W _s values corresponding to the two inputs are 0.50, 0.
It is 09, 0.08, 0.08.

【００６６】また、入力のモーラ数は、前件部メンバシ
ップ関数の入力であり、数量化Ｉ類のようにその値が制
限されることはない。The input number of moras is an input of the antecedent part membership function, and its value is not limited unlike the quantification type I.

【００６７】[0067]

【表２】 [Table 2]

【００６８】次に、本発明の第４の実施例を説明する。Next, a fourth embodiment of the present invention will be described.

【００６９】上記第３の実施例では、ＦＮＮ後件部一次
関数係数の初期値を０にして学習をするため、学習デー
タに対しては誤差が小さい値に収束しても、未学習デー
タ（テストデータ）に対しては全く関係の無いローカル
ミニマムである可能性がある。すなわち、未学習データ
に大きく貢献するはずの係数に対して全く学習されない
か、又は、他の係数の学習に引きずられて、でたらめな
値になってしまうことがある。そのため、図１４中に
も、実測値と推定値の値が大きく離れている点がいくつ
か見られる。In the third embodiment, since learning is performed by setting the initial value of the linear function coefficient of the FNN consequent part to 0, even if the learning data converges to a value with a small error, the unlearned data ( It may be a local minimum that has nothing to do with test data). That is, a coefficient that should greatly contribute to unlearned data may not be learned at all, or may be dragged by the learning of another coefficient, resulting in a random value. Therefore, in FIG. 14, there are some points where the measured value and the estimated value are greatly separated.

【００７０】そこで、本実施例においては、未学習デー
タに対しても推定の精度を向上させるために、ＦＮＮ後
件部一次関数係数の初期値を、数量化Ｉ類によって決定
する手法を提案する。この手法は、（１）学習用入力デ
ータをＦＮＮメンバシップ関数に対応させて構築しなお
し、（２）その再構築したデータを数量化Ｉ類で処理
し、入出力関数の係数を計算して、（３）その係数をＦ
ＮＮ後件部一次関数係数の初期値にする。Therefore, in this embodiment, in order to improve the accuracy of estimation even for unlearned data, a method of determining the initial value of the FNN consequent linear function coefficient by quantification type I is proposed. . In this method, (1) the input data for learning is reconstructed in correspondence with the FNN membership function, and (2) the reconstructed data is processed by quantification type I to calculate the coefficient of the input / output function. , (3) The coefficient is F
NN Set the initial value of the linear function coefficient of the consequent part.

【００７１】というものである。That is,

【００７２】簡単な例として、ＦＮＮの入力が前件部、
後件部とも２入力で、１つの入力に対するメンバシップ
関数が２個の場合を述べる。二つの入力Ａ，Ｂが、例え
ば１〜１０の値をとるとすると、各々の入力の値をそれ
ぞれ２つのメンバシップ関数に対応させるために、ある
しきい値（例えば５など）で区切って、５より小さいな
ら１に、５以上なら２の値に設定しなおす。出力はその
ままの値を用いる。そして作りなおしたデータを数量化
Ｉ類で処理し、入出力関数の係数（数量）を得る。得ら
れた係数は、入力Ａの１の値に対する数量化Ｉ類の係数
では、ＦＮＮで入力Ａの値がＳｍａｌｌのときのＡの係
数の初期値とし、入力Ａの２の値に対する数量化Ｉ類の
係数では、ＦＮＮで入力Ａの値がＢｉｇのときのＡの係
数の初期値とする。入力Ｂに対しても同様である。As a simple example, the input of FNN is the antecedent part,
The case where the consequent part has two inputs and there are two membership functions for one input will be described. Assuming that the two inputs A and B take values of, for example, 1 to 10, the values of each input are divided by a certain threshold value (for example, 5) to correspond to the two membership functions, If it is less than 5, set it to 1, and if it is 5 or more, set it to 2. The output uses the value as it is. Then, the recreated data is processed by the quantification type I to obtain the coefficient (quantity) of the input / output function. The obtained coefficient is the coefficient of Class I for the value of 1 of input A, which is the initial value of the coefficient of A when the value of Input A is Small in FNN, and the coefficient of quantification I for the value of 2 of input A is The coefficient of the class is the initial value of the coefficient of A when the value of the input A is FNN and is Big. The same applies to the input B.

【００７３】以上の処理を行い、先のピッチ周波数制御
に適用したときの推定結果を図１５に示す。ただしこの
場合、音韻記号に対する入力（５），（６）は、前件部
に入力しないので、メンバシップ関数に合わせて２値化
する必要はない。従ってそのままの値で数量化Ｉ類の処
理を行い、それぞれの音韻に関する数量化Ｉ類の係数
を、それぞれ対応するＦＮＮ後件部一次関数の係数とす
ればよい。ここで用いたデータは、学習データ、テスト
データとも、前述で用いたものと同じである。推定結果
の図を比較しても分かるとうり、誤差の大きかった点が
無くなって、未学習データに関して推定精度が向上した
ことが分かる。このとき、テストデータに対する平均相
対誤差は６．８４％に向上した。FIG. 15 shows an estimation result when the above processing is applied to the above pitch frequency control. However, in this case, since the inputs (5) and (6) for the phonological symbols are not input to the antecedent part, it is not necessary to binarize them according to the membership function. Therefore, the quantification type I processing may be performed with the value as it is, and the quantification type I coefficient regarding each phoneme may be used as the coefficient of the corresponding FNN consequent linear function. The data used here is the same as that used above for both the learning data and the test data. As can be seen by comparing the figures of the estimation results, it can be seen that the points with large errors are eliminated and the estimation accuracy is improved for unlearned data. At this time, the average relative error with respect to the test data was improved to 6.84%.

【００７４】[0074]

【発明の効果】以上の説明で明らかなように、本発明の
数量化Ｉ類を適用した第１例の韻律制御方式によれば、（１）逐次的な手法により１モーラ内のある位置の基本
周波数を求めることができ、当該モーラが持っているピ
ッチパターンを細かく制御することができる。As is clear from the above description, according to the prosody control system of the first example to which the quantification type I of the present invention is applied, (1) the position of one position in one mora is determined by the sequential method. The fundamental frequency can be obtained, and the pitch pattern of the mora can be finely controlled.

【００７５】（２）少ないアイテム（音韻環境）、カテ
ゴリ（各アイテムの分類）による多変量解析の数量化Ｉ
類が実行でき、偏相関係数が高く安定した高品質な音声
合成韻律制御が実現できる。(2) Quantification of multivariate analysis with few items (phonological environment) and categories (classification of each item) I
Class can be executed, and stable high-quality speech synthesis prosody control with a high partial correlation coefficient can be realized.

【００７６】（３）入力データが当該音韻、先行音韻、
後続音韻で記述できるため従来のアクセント句などの音
韻記号列では実現できない音韻の影響を考慮した人間に
近い制御が実現できる。(3) The input data is the phoneme, the preceding phoneme,
Since it can be described by a subsequent phoneme, it is possible to realize a human-like control that considers the influence of a phoneme that cannot be realized by a conventional phoneme symbol string such as an accent phrase.

【００７７】また、本発明の数量化Ｉ類の入力形式を変
えた第２例の韻律制御方式によれば、次のような効果が
得られる。Further, according to the prosody control method of the second example in which the input format of the quantification type I of the present invention is changed, the following effects can be obtained.

【００７８】（１）入力形式を定めて学習させたことに
より予測精度が向上できる。(1) Prediction accuracy can be improved by determining and learning the input format.

【００７９】（２）１つのモーラ内点位置に１セットの
データを作成し、学習させたことにより予測精度が向上
できる。(2) The prediction accuracy can be improved by creating and learning one set of data at one mora inner point position.

【００８０】（３）学習データの１入力パターンに対し
てある程度決まった値のピッチ周波数を学習させたこと
により予測精度が向上できる。(3) The prediction accuracy can be improved by learning the pitch frequency of a certain value for one input pattern of the learning data.

【００８１】また、本発明のファジィニューラルネット
ワーク（ＦＮＮ）を用いた第３例の韻律制御方式によれ
ば、次のような効果が得られる。According to the prosody control system of the third example using the fuzzy neural network (FNN) of the present invention, the following effects can be obtained.

【００８２】（１）推定時のモーラ数はＦＮＮメンバシ
ップ関数への入力であるため、制限されない。(1) The number of moras at the time of estimation is an input to the FNN membership function and is not limited.

【００８３】（２）非線形な入出力の学習をＢＰ法で行
えるため、１入力データ中の全てのカテゴリの組み合わ
せと出力との関係を考慮した入出力システムを構成でき
る。従って推定精度が向上する。(2) Since the nonlinear input / output learning can be performed by the BP method, an input / output system can be constructed in consideration of the relationship between the combinations of all the categories in one input data and the output. Therefore, the estimation accuracy is improved.

【００８４】（３）人間の先験的知識に対応した入出力
関数を得ることができる。(3) It is possible to obtain an input / output function corresponding to human a priori knowledge.

【００８５】さらに、上記のＦＮＮを用いた韻律制御方
式において、ＦＮＮ後件部一次関数の初期値を、数量化
Ｉ類の手法を用いて決定した場合には、より一層推定精
度の向上が図られる。Further, in the prosody control method using the FNN, when the initial value of the FNN consequent part linear function is determined using the method of quantification type I, the estimation accuracy is further improved. To be

[Brief description of drawings]

【図１】本発明の第１の実施例を実現するブロック構成
図FIG. 1 is a block configuration diagram for realizing a first embodiment of the present invention.

【図２】上記第１の実施例における分析部のブロック構
成図FIG. 2 is a block configuration diagram of an analysis unit in the first embodiment.

【図３】本発明の第１の実施例を示す分析部の制御フロ
ーを示す図FIG. 3 is a diagram showing a control flow of an analysis unit showing the first embodiment of the present invention.

【図４】上記第１の実施例における合成部ブロック構成
図FIG. 4 is a block diagram of a combining unit in the first embodiment.

【図５】上記第１の実施例における合成部の制御フロー
を示す図FIG. 5 is a diagram showing a control flow of a combining unit in the first embodiment.

【図６】上記第１の実施例による実験結果を示す図FIG. 6 is a diagram showing an experimental result according to the first embodiment.

【図７】上記第１の実施例による別の実験結果を示す図FIG. 7 is a diagram showing another experiment result according to the first embodiment.

【図８】本発明の第２の実施例の処理の流れを示すブロ
ック図FIG. 8 is a block diagram showing the flow of processing according to the second embodiment of the present invention.

【図９】上記第２の実施例による実験結果を示す図FIG. 9 is a diagram showing experimental results according to the second embodiment.

【図１０】上記第２の実施例による別の実験結果を示す
図FIG. 10 is a diagram showing another experimental result according to the second embodiment.

【図１１】本発明の第３の実施例の学習の処理の流れを
示すブロック図FIG. 11 is a block diagram showing the flow of learning processing according to the third embodiment of this invention.

【図１２】上記第３の実施例の推定の処理の流れを示す
ブロック図FIG. 12 is a block diagram showing the flow of an estimation process according to the third embodiment.

【図１３】上記第３の実施例で用いるＦＮＮの構成図FIG. 13 is a block diagram of an FNN used in the third embodiment.

【図１４】上記第３の実施例による実験結果を示す図FIG. 14 is a diagram showing an experimental result according to the third embodiment.

【図１５】本発明の第４の実施例による実験結果を示す
図FIG. 15 is a diagram showing experimental results according to the fourth embodiment of the present invention.

[Explanation of symbols]

１…音韻データベース２…分析部２１，２３，２７…入力データ生成部２２，２５…数量化Ｉ類計算部２４…データ削減部２８…ＦＮＮ学習部３…合成部３１，３４…入力データ生成部３２…ピッチ周波数推定部３３，３６…音声合成部３５…ＦＮＮピッチ周波数推定部 DESCRIPTION OF SYMBOLS 1 ... Phonological database 2 ... Analysis part 21, 23, 27 ... Input data generation part 22, 25 ... Quantification type I calculation part 24 ... Data reduction part 28 ... FNN learning part 3 ... Synthesis part 31, 34 ... Input data generation part 32 ... Pitch frequency estimation unit 33, 36 ... Speech synthesis unit 35 ... FNN pitch frequency estimation unit

フロントページの続き (72)発明者柏木繁東京都品川区大崎２丁目１番17号株式会社明電舎内Front page continuation (72) Inventor Shigeru Kashiwagi 2-1-117 Osaki, Shinagawa-ku, Tokyo Stock Company, Meidensha

Claims

[Claims]

1. An accent pattern described by the preceding mora, the relevant mora, and the subsequent mora, the mora position from the beginning of the mora, the mora position from the end of the mora, and the mora concerned by using the analysis learning means to which the quantification method of the multivariate analysis is applied. In order to quantify the fundamental frequency of each pitch point of the mora using the item and its category as variables from the input data for learning in which the pitch point position, the long sound flag, and one or more of the consonant flags are items. For each category, calculate the category quantity that is the coefficient of the quantification formula, find the category of the item of each mora of the phoneme symbol string of the speech synthesis target, and calculate the quantification formula using the calculated category quantity for the category. Using the calculation means for sequentially calculating the fundamental frequency of each pitch point of the mora from Prosody control method which comprises using the calculated fundamental frequency in the speech synthesis.

2. An analysis learning means to which a quantification method of multivariate analysis is applied is used, and the position from the mora position where the accent rises or falls and the accent changes, and the leading mora of the input is before the first. The items are the order of the mora, the order of the mora after the last one of the input is the first, the attribute when only the vowel part of the phoneme signal is viewed, and the attribute when the consonant part of the phonological symbol is viewed. In order to quantify the fundamental frequency of each pitch point of the mora from the input data for learning using the item and its category as variables, calculate the category quantity which is the coefficient of the quantification formula for each category of the item, and the speech synthesis target The category of the item of each mora in the phonological symbol sequence of the Using calculating means for sequentially calculating the fundamental frequency of the switch point as the estimation means of the fundamental frequency, the prosody control method is characterized by using the speech synthesis the fundamental frequencies the calculated.

3. A fuzzy neural network obtained by learning coefficient data from learning input data by using a fuzzy neural network having an antecedent part and a consequent part as learning means, and setting the coefficient data obtained by the learning. A prosody control method characterized in that a network is used as a pitch frequency estimating means and the estimated pitch frequency is used for speech synthesis.

4. The coefficient of the input / output function is calculated by reconstructing the learning input data in the learning means in correspondence with the membership function of the antecedent part of the fuzzy neural network and processing it by quantification I, The prosody control method according to claim 3, wherein the coefficient is set as an initial value of a coefficient of a linear function of the consequent part.