JPH03125200A

JPH03125200A - Voice synthesizing method

Info

Publication number: JPH03125200A
Application number: JP1263619A
Authority: JP
Inventors: Tatsuo Matsuoka; 達雄松岡; Ryohei Nakatsu; 良平中津; Shinya Nakajima; 信弥中嶌
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1989-10-09
Filing date: 1989-10-09
Publication date: 1991-05-28

Abstract

PURPOSE:To synthesize a voice which has fluctuations close to a natural voice by converting a phoneme symbol sequence into a parameter matrix and then converting the parameter matrix into a parameter matrix of the voice which has fluctuations close to the natural voice. CONSTITUTION:The conversion into the parameter matrix is performed by a rule synthesizing method which uses the parameter matrix averaged in each phoneme environment as a synthesis unit and this converted parameter matrix 1 has its time base matched with that of the parameter matrix 2 of the natural voice for learning corresponding to a predetermined phoneme sequence, phoneme by phoneme, by linear expansion or contraction in the time-base direction to obtain an input signal 3 to a neural network. Then, outputs of respective units in an input layer 4 are multiplied by respective coupling coefficients 5 and the total of all the units in the input layer 4 is calculated to obtain an input to each unit of an intermediate layer 6. Outputs of respective units in the intermediate layer 6 are multiplied by respective coupling coefficients 7 and the total is calculated to obtain an input to an output layer 8. Consequently, the voice having fluctuations close to the natural voice can be synthesized.

Description

【発明の詳細な説明】［産業上の利用分野］この発明は、規則合成法により得られた合成音声を自然
性の高い合成音声とする音声合成法に関するものである
。DETAILED DESCRIPTION OF THE INVENTION [Industrial Application Field] The present invention relates to a speech synthesis method for making synthesized speech obtained by a rule synthesis method into highly natural synthesized speech.

［従来の技術］音声の規則合成では、音韻環境などによる調音効果を反
映させることが重要であり、ＣＶＣｌ３連続音韻など比
較的広い音９１環境から合成単位を抽出する方法や、変
形規則、スペクトル起動制御法など合成単位接続時に変
形する方法、また、音韻環境に基づいたクラスタリング
により調音結合による効果を反映した合成単位を生成す
る方法などが提案されている。[Prior art] In the rule synthesis of speech, it is important to reflect the articulatory effect due to the phonetic environment, etc., and there are methods for extracting synthesis units from a relatively wide sound environment such as CVCl3 continuous phonemes, transformation rules, and spectrum activation. Methods such as a control method that transforms synthesis units when they are connected, and a method that uses clustering based on the phonological environment to generate synthesis units that reflect the effects of articulatory combination have been proposed.

前記ＣｖＣ５３連続音韻など比較的広い音韻環境から合
成単位を抽出する方法や、合成単位接続時に変形する方
法では、調音結合の及ぶ範囲や音韻環境を先験的知識に
基づいて決定するため合成単位の作成に多大な労力を要
し、話者・合成単位数の変更などシステムとしての柔軟
性に問題があった。In the method of extracting synthetic units from a relatively wide phonological environment such as the above-mentioned CvC53 continuous phoneme, or the method of transforming the synthetic units when they are connected, the range of articulatory combinations and the phonological environment are determined based on a priori knowledge, so the synthesis units are It took a lot of effort to create, and there were problems with the flexibility of the system, such as changing the speaker and the number of composite units.

一方、前記音韻環境に基づいたクラスタリングによる方
法においては、音韻ラベリングされた学習用データのみ
から統計的手法によって合成単位を自動生成するため、
先験的知識を用いることなく合成単位作成の労力は削減
されている。しかし、音韻環境毎に平均化されたパラメ
ータマトリクスを合成単位として用いるため、音声の揺
らぎ（各音韻内での強弱や高低の微妙な変化）が見られ
なくなり自然性は十分でなかった。すなわち、この音韻
環境のクラスタリングによる方法においては、まず、学
習データ内で同じ音韻ラベルの付与されたパラメータマ
トリクス（パラメータ時系列）の集合を初期クラスタと
し、クラスタ内の各バラメークマトリクスの先行・後続
音韻によってクラスタ分割を行い、ある評価値（クラス
タ内分散の平均値など）が一定のしきい値以下になるま
で続ける。このようにして求められた各クラスタの重心
マトリクスを合成単位として登録していた。長さの異な
るパラメータマトリクスからなるクラスタの重心・分散
を求める際には、固定次元化法により、各パラメータマ
トリクスを時間方向に線形伸縮し、固定次元化して重心
計算を行っていた。従って、各重心マトリクスは音韻環
境毎に平均化されたパラメータマトリクスであり、クラ
スタの平均継続時間長を持っていた。このため、母音定
常部などでは合成音声特有の揺らぎのないボコーダ−的
な音質となり自然性が不足していた。また、音韻継続時
間や韻律についても自然性の向上が必要であった。On the other hand, in the clustering method based on the phonological environment, synthesis units are automatically generated using statistical methods only from phonologically labeled learning data.
The effort of creating synthetic units is reduced without using a priori knowledge. However, since a parameter matrix averaged for each phoneme environment is used as a synthesis unit, fluctuations in speech (subtle changes in strength and pitch within each phoneme) are not observed, and the naturalness is not sufficient. In other words, in this method based on clustering of the phonological environment, first, a set of parameter matrices (parameter time series) given the same phonological label in the learning data is used as an initial cluster, and the preceding and following of each variable matrix in the cluster is set as an initial cluster. Cluster division is performed based on phoneme, and this is continued until a certain evaluation value (such as the average value of intra-cluster variance) falls below a certain threshold. The centroid matrix of each cluster obtained in this way was registered as a synthesis unit. When calculating the center of gravity and dispersion of a cluster consisting of parameter matrices of different lengths, each parameter matrix was linearly expanded and contracted in the time direction using a fixed dimension method, and the center of gravity was calculated by converting it into a fixed dimension. Therefore, each centroid matrix was a parameter matrix averaged for each phonetic environment, and had the average duration length of the cluster. For this reason, in vowel stationary parts, etc., the sound quality becomes vocoder-like without fluctuations, which is characteristic of synthesized speech, and lacks naturalness. It was also necessary to improve the naturalness of phoneme duration and prosody.

［課題を解決するための手段］、この発明は各音韻環境毎に平均化されたパラメータマ
トリクスから自然音声への写像関係をニューラルネット
により学習し、音声合成すべく人力された音韻記号列を
音韻環境毎に平均化されたパラメータマトリクスに変換
し、そのパラメータマトリクスを前記学習されたニュー
ラルネットに入力し、ニューラルネットの出力として自
然音声に近い揺らぎを持った音声パラメータマトリクス
を得ることを特徴とする。合成単位が画一的に平均化さ
れたものではなく、揺らぎを持っているため自然性が向
上している点が従来の方法と異なる。[Means for Solving the Problems] This invention uses a neural network to learn mapping relationships from parameter matrices averaged for each phonetic environment to natural speech, and converts phonetic symbol strings manually created to synthesize speech into phonemes. The method is characterized by converting into a parameter matrix averaged for each environment, inputting the parameter matrix to the learned neural network, and obtaining a speech parameter matrix having fluctuations close to natural speech as an output of the neural network. . This method differs from conventional methods in that the synthesis units are not uniformly averaged, but have fluctuations, resulting in improved naturalness.

［実施例］第１図はこの発明の説明図である。この図では、予め決
められた音韻記号列として／ｂ、ａ、に、ｕ、ｏ、ｎ　
／という音韻記号列が入力された場合を示している。[Example] FIG. 1 is an explanatory diagram of the present invention. In this figure, the predetermined phonetic symbol strings include /b, a, u, o, n.
A case is shown in which a phoneme symbol string / is input.

まず、この予め決められた音韻記号列を各音韻環境ごと
に平均化されたパラメータマトリクスを合成単位として
用いる規則合成法によりＬＳＰパラメータマトリクスに
変換し、この変換されたしＳＰパラメータマトリクスｌ
を、時間軸方向の線形伸縮により、前記予め決められた
音韻記号列に対応した学習用の自然音声のＬＳＰパラメ
ータマトリクス２と各音韻毎に時間軸を合わせ、ニュー
ラルネットへの入力信号３とする。ＬＳＰパラメータの
分析次数は１６次で、分析窓長は２０ｎ＋ｓ。First, this predetermined phonetic symbol string is converted into an LSP parameter matrix by a rule synthesis method that uses parameter matrices averaged for each phonetic environment as synthesis units, and this converted SP parameter matrix l
is aligned with the LSP parameter matrix 2 of natural speech for learning corresponding to the predetermined phoneme symbol string for each phoneme by linear expansion/contraction in the time axis direction, and is used as an input signal 3 to the neural network. . The analysis order of LSP parameters is 16th order, and the analysis window length is 20n+s.

フレームシフトは５Ｉｌｌｓである。ニューラルネット
の入力Ｎ４には１６次のパラメータの３０フレ一ム分の
４８０のユニットがある。ここへ適当なフレーム数ずつ
スライドさせながらパラメータを入力する。ニューラル
ネットの出力層８に対する教師信号は、入力するパラメ
ータマトリクスと時間的に同期した（すなわち同一の音
韻に対応する）前記学習用の自然音声のパラメータマト
リクス２で、入力信号と同じようにスライドさせる。従
って、出力層８のユニット数も入力層４と同一で４８０
である。The frame shift is 5Ills. Input N4 of the neural network has 480 units of 30 frames of 16th order parameters. Input the parameters by sliding the appropriate number of frames here. The teacher signal for the output layer 8 of the neural network is the parameter matrix 2 of the natural speech for learning, which is temporally synchronized with the input parameter matrix (that is, corresponds to the same phoneme), and is slid in the same way as the input signal. . Therefore, the number of units in the output layer 8 is also the same as that in the input layer 4, which is 480.
It is.

入力層４のユニットは単なる入力端子で特別な変換は行
わずに入力パラメータ値をそのまま出力値とする。入力
層４の各ユニットの出力はそれぞれ結合係数５が乗じら
れ、入力層４のすべてのユニットについて総和をとって
中間層６の各ユニットの入力となる。すなわち、中間層
６のユニットｊの入力値ｎｅＪは第１式となる。The units in the input layer 4 are simply input terminals, and input parameter values are used as output values without any special conversion. The output of each unit in the input layer 4 is multiplied by a coupling coefficient 5, and the sum of all the units in the input layer 4 is taken as an input to each unit in the intermediate layer 6. That is, the input value neJ of unit j of the intermediate layer 6 is expressed by the first equation.

ただし、賀４．は入力層ユニットｉから中間層ユニット
ｊへの結合係数、Ｕ！は入力層ユニットｉの出力値。However, ga 4. is the coupling coefficient from input layer unit i to hidden layer unit j, U! is the output value of input layer unit i.

中間層６の各ユニットではｓｉｇｍｏｉｄ関数などの非
線形関数により人力が変換されて出力となる。すなわち
、中間層６のユニットｊの出力値ｕｊは第２式となる。In each unit of the intermediate layer 6, human power is converted by a nonlinear function such as a sigmoid function and becomes an output. That is, the output value uj of unit j in the intermediate layer 6 is expressed by the second equation.

ｕ　ｊ＝１／（１＋ｅｘｐ−””θ、））（第２式）た
だし、θ、はユニットｊのしきい値。u j = 1/(1+exp-"" θ, )) (2nd formula) where θ is the threshold value of unit j.

中間層６の各ユニットの出力は人力層４の場合と同様に
結合係数７と乗じられ、総和がとられて出力層８の入力
となる。すなわち、出力層８のユニットにの入力値ｎｅ
５は第３式となる。The output of each unit of the intermediate layer 6 is multiplied by the coupling coefficient 7 as in the case of the human-powered layer 4, and the sum is taken to become the input of the output layer 8. That is, the input value ne to the unit of the output layer 8
5 becomes the third equation.

ｎｅｔ、−Σ匈あ７Ｘｕ、　　（第３式）ただし、Ｎは
中間層６のユニット数。net, -Σ匈7Xu, (3rd formula) where N is the number of units in the middle layer 6.

出力層８のユニットにの出力ｕ、は第４式となる。The output u to the unit of the output layer 8 is given by the fourth equation.

ｕ　ｗ　＝１／（１＋ｅｘｐ−（″”’″θつ））（第
４式）ただし、θ３はユニットにのしきい値。u w =1/(1+exp-(″″′″θ)) (4th formula) However, θ3 is the threshold value of the unit.

ここで、出力層８のユニットｋに対する教師信号である
自然音声のＬＳＰパラメータ値を１．とするとネットワ
ークの誤差評価値Ｅは次式となる。Here, the LSP parameter value of the natural voice, which is the teacher signal for unit k of the output layer 8, is set to 1. Then, the network error evaluation value E is given by the following equation.

このＥの値を最小化するように最急降下法により結合係
数Ｗｌｌｊｓ　Ｗｊｉを変更する。この手順を入力信号
と教師信号を数フレームスライドするごとに繰り返し、
Ｅの値が十分小さくなるまで学習を続ける。この結果、
出力信号は教師信号（学習用自然音声パラメータマトリ
クス）に著しく近づき、入力信号１に対し、自然音声に
近い揺らぎが与えられたことになる。入力信号、教師信
号のスライドさせるフレーム数は予備検討などから適宜
決定する。また、中間層６のユニット数についても予備
検討などから適宜決定する。The coupling coefficient Wlljs Wji is changed by the steepest descent method so as to minimize the value of E. Repeat this step every time you slide the input signal and teacher signal a few frames.
Continue learning until the value of E becomes sufficiently small. As a result,
The output signal becomes extremely close to the teacher signal (natural speech parameter matrix for learning), which means that fluctuations close to natural speech are given to input signal 1. The number of frames for sliding the input signal and the teacher signal is determined as appropriate based on preliminary studies. Further, the number of units in the intermediate layer 6 is also determined as appropriate based on preliminary studies and the like.

以上の手順で学習したニューラルネットを用いて音声を
合成する場合には、まず与えられた音韻記号列を各音韻
環境ごとに平均化されたパラメータマトリクスを合成単
位として用いる規則合成法によりＬＳＰパラメータマト
リクスに変換する。When synthesizing speech using the neural network trained in the above steps, first, a given phonetic symbol string is synthesized into an LSP parameter matrix using a rule synthesis method that uses parameter matrices averaged for each phonetic environment as synthesis units. Convert to

次にそのＬＳＰパラメータマトリクスを３０フレームご
とに前記学習したニューラルネットに入力し、第１〜４
式に従って計算することにより、３０フレームずつ出力
を得る。この出力を結合すると、与えられた音韻記号列
に対応した長さの自然音声に近い揺らぎをもった音声の
ＬＳＰバラメークマトリクスが得られる。上述では音韻
記号列をＬＳＰパラメータマトリクスに変換したが、パ
ーコールパラメータマトリクスなどその他のパラメータ
マトリクスに変換してもよい。Next, the LSP parameter matrix is input to the learned neural network every 30 frames, and
By calculating according to the formula, output is obtained for each 30 frames. By combining these outputs, an LSP variation matrix of speech having a length corresponding to a given phonetic symbol string and having fluctuations close to that of natural speech is obtained. Although the phonetic symbol string is converted into an LSP parameter matrix in the above description, it may be converted into other parameter matrices such as a Percall parameter matrix.

「発明の効果」以上説明したように、この発明によれば音韻環境ごとに
平均化されたパラメータマトリクスを合成単位として用
いる規則合成法により、音韻記号列をパラメータマトリ
クスに変換し、そのパラメータマトリクスを自然音声に
近いｔｉらぎを持った音声のパラメータマトリクスへ変
換することを、学習したニューラルネットを用いること
により自動的に行うことができ、かつ先験的知識を用い
て変換規則（関数）を記述する必要がないため容易に自
然音声に近い揺らぎを持った音声を合成することができ
るという利点がある。"Effects of the Invention" As explained above, according to the present invention, a phonetic symbol string is converted into a parameter matrix by a rule synthesis method that uses a parameter matrix averaged for each phonetic environment as a synthesis unit, and the parameter matrix is converted into a parameter matrix. Using a trained neural network, it is possible to automatically convert speech into a parameter matrix with a ti-ratio close to that of natural speech, and to create conversion rules (functions) using a priori knowledge. Since there is no need to describe it, it has the advantage that it is possible to easily synthesize speech with fluctuations similar to natural speech.

[Brief explanation of the drawing]

第１図はこの発明の説明図で、■は従来の規則合成法に
より得られたパラメータマトリクス、２は学習用の自然
音声のパラメータマトリクス、３は線形伸縮により１の
音韻毎の継続時間長を２と合わせたパラメータマトリク
ス、４はニューラルネットの入力層、５，７は学習によ
り結合係数を変更される結合、６は中間層、８は出力層
、９はニューラルネットの出力として得られる自然音声
に近いｔＩらぎを持った音声のパラメータマトリクスで
ある。Figure 1 is an explanatory diagram of the present invention, where ■ is a parameter matrix obtained by the conventional rule synthesis method, 2 is a parameter matrix of natural speech for learning, and 3 is the duration length of each phoneme of 1 by linear expansion and contraction. 2 is the parameter matrix combined with 2, 4 is the input layer of the neural network, 5 and 7 are the connections whose coupling coefficients are changed by learning, 6 is the middle layer, 8 is the output layer, and 9 is the natural speech obtained as the output of the neural network. This is a speech parameter matrix with a tI error close to .

Claims

[Claims]

(1) Convert a predetermined phoneme symbol string into a speech feature parameter matrix using a rule synthesis method that generates synthesis units that reflect the effects of articulatory combination through clustering based on the phonetic environment, and correspond to the phoneme symbol string. The duration length of the feature parameter matrix obtained by the rule synthesis method is adjusted for each phoneme to the feature parameter matrix of the natural speech for learning, and the feature parameter matrix obtained by combining the duration lengths is input, and the input and temporal A neural network is trained using the feature parameter matrix of the natural speech for learning synchronized with the training signal as a teacher signal, the input phonetic symbol string is converted into a speech feature parameter matrix by the rule synthesis method, and the converted feature parameter matrix is A speech synthesis method characterized by inputting the above-mentioned learned neural network into the learned neural network, and using the output of the neural network as synthesized speech.