JP3423276B2

JP3423276B2 - Voice synthesis method

Info

Publication number: JP3423276B2
Application number: JP2000242068A
Authority: JP
Inventors: 啓之平井
Original assignee: Sanyo Electric Co Ltd
Current assignee: Sanyo Electric Co Ltd
Priority date: 2000-08-10
Filing date: 2000-08-10
Publication date: 2003-07-07
Anticipated expiration: 2020-08-10
Also published as: JP2002055693A

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は、任意のテキスト
情報を合成音声で読み上げることのできる音声合成方法
に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice synthesizing method capable of reading arbitrary text information by synthetic voice.

【０００２】[0002]

【従来の技術】図１は、音声合成装置の概略構成を示し
ている。2. Description of the Related Art FIG. 1 shows a schematic structure of a speech synthesizer.

【０００３】入力された日本語仮名漢字混じりのテキス
トは、言語処理部１で形態素解析、係り受け解析が行な
われ、音素記号、アクセント記号等に変換せしめられ
る。The input text containing Japanese kana and kanji characters is subjected to morphological analysis and dependency analysis in the language processing unit 1 and converted into phoneme symbols, accent symbols and the like.

【０００４】韻律パターン生成部２では、音素記号、ア
クセント記号列および形態素解析結果から得られる入力
テキストの品詞情報を用いて、音韻継続時間長（声の長
さ DUR^T）、基本周波数（声の高さ FO ^T）、母音中心
のパワー（声の大きさPOW ^T）等の推定が行なわれる。The prosody pattern generation unit 2 uses the phoneme symbol, the accent symbol string, and the part-of-speech information of the input text obtained from the result of morphological analysis, and uses the phoneme duration (voice length DUR ^T ) and the fundamental frequency (voice Pitch (FO ^T ), power of vowel center (loudness POW ^T ) and so on are estimated.

【０００５】音素単位選択部３では、推定された音韻継
続時間長 DUR^T、基本周波数 FO ^Tおよび母音中心のパ
ワーPOW ^Tに最も近く、かつ波形辞書５に蓄積されてい
る音素単位( 音素片) を接続したときの歪みが最も小さ
くなる音素片の組み合わせがＤＰ（動的プログラミン
グ）を用いて選択される。[0005] In the phoneme unit selector 3, estimated phoneme duration DUR ^T, fundamental frequency FO ^T and closest to the power POW ^T vowel center, and phonemes stored in the waveform dictionary 5 (phoneme) The combination of phonemes that produces the smallest distortion when the are connected is selected using DP (dynamic programming).

【０００６】音声波形生成部４では、選択された音素片
の組み合わせにしたがって、ピッチを変換しつつ音素片
の接続を行なうことによって音声が生成される。The speech waveform generator 4 generates speech by connecting the phonemes while converting the pitch according to the selected combination of the phonemes.

【０００７】図２は、波形辞書５の内容を示している。
波形辞書５は、複数の音素片が格納された音素片格納部
５１と、音素片格納部５１内の各音素片に関する補助情
報が格納された補助情報格納部５２とがある。補助情報
には、音素片のパワー（POW ^Dic）、基本周波数（ FO
^Dic）、継続時間長（ DUR^Dic）等がある。FIG. 2 shows the contents of the waveform dictionary 5.
The waveform dictionary 5 includes a phoneme piece storage unit 51 that stores a plurality of phoneme pieces, and an auxiliary information storage unit 52 that stores auxiliary information regarding each phoneme piece in the phoneme piece storage unit 51. The auxiliary information includes the power (POW ^Dic ) of the phoneme piece and the fundamental frequency (FO
^Dic ), duration (DUR ^Dic ), etc.

【０００８】ところで、音素単位選択部３では、波形辞
書５に蓄積されている音素片の組み合わせの中で、歪み
が少なくなる組み合わせを選択しているが、この歪みに
は次のようなものがある。By the way, the phoneme unit selecting section 3 selects a combination with less distortion among the combinations of phoneme pieces stored in the waveform dictionary 5. The distortion is as follows. is there.

【０００９】つまり、図３に示すように、ｕ_i-1、
ｕ_i、ｕ_i+1を波形辞書５から抽出した音素片として、
ｔ_i-1、ｔ_i、ｔ_i+1を実際に使用する環境( ターゲッ
ト）とすると、ｕi に対する歪みには、Ｃ_i ^tと、Ｃ_i
^cとがある。That is, as shown in FIG. 3, u _i-1 ,
u _i and u _{i + 1} are phoneme pieces extracted from the waveform dictionary 5,
Assuming that t _i−1 , t _i , and t _{i + 1} are actually used environments (targets), the distortion with respect to u _i is C _i ^t and C _i.
There is ^c .

【００１０】ここで、Ｃ_i ^tは、ｉ番目の音素について
辞書から抽出した音素片（ｕ_i）と実際に使用する環境
( ターゲットｔ_i）との間の歪みである。また、Ｃ_i ^c
は、ｉ番目の音素片（ｕ_i）と、ｉ−１番目の素片（ｕ
_i-1）とを接続したときに生じる歪みである。音素単位
選択部３は、動的計画法（ＤＰ法）に用いて音素片を接
続していき、入力された全ての音素に対するＣ_i ^tとＣ
_i ^cとの総和Ｃ^allが最小となる素片の組み合わせを選
択する。Here, C _i ^t is a phoneme piece (u _i ) extracted from the dictionary for the i-th phoneme and the environment actually used.
(Target t _i ) and the distortion. Also, C _i ^c
Is the i-th phoneme (u _i ) and the i−1-th phoneme (u _i ).
_i-1 ) is the distortion that occurs when and are connected. The phoneme unit selection unit 3 connects the phoneme pieces using the dynamic programming (DP method), and outputs C _i ^t and C for all input phonemes.
A combination of the pieces that minimizes the sum C ^all with _i ^c is selected.

【００１１】Ｃ_i ^tは、次の数式１で表される。C _i ^t is expressed by the following equation 1.

【００１２】[0012]

【数１】 [Equation 1]

【００１３】数式１において、各変数は、次のように定
義される。In equation 1, each variable is defined as follows.

【００１４】Ｄ_POW ^t（ｔ_i，ｕ_i）は、ｉ番目の音素
について、辞書から抽出した音素片（ｕ_i）のパワー
（POW ^Dic(i) ）と、実際に使用する環境（ターゲット
ｔ_i）のパワー（POW ^T(i) ）との間の距離の自乗であ
り、｛（POW ^Dic(i) ）−（POW ^T(i) ）｝²となる。D _POW ^t (t _i , u _i ) is the power (POW ^Dic (i)) of the phoneme piece (u _i ) extracted from the dictionary for the i-th phoneme and the environment (target t) actually used. It is the square of the distance between the power of ( _i ) and the power (POW ^T (i)) and is {(POW ^Dic (i)) − (POW ^T (i))} ² .

【００１５】ｗ_POW ^tは、Ｄ_POW ^t（ｔ_i，ｕ_i）に対
する重み係数である。W _POW ^t is a weighting factor for D _POW ^t (t _i , u _i ).

【００１６】Ｄ_F0 ^t（ｔ_i，ｕ_i）は、ｉ番目の音素に
ついて、辞書から抽出した音素片（ｕ_i）の基本周波数
（ FO ^Dic(i) ）と、実際に使用する環境（ターゲット
ｔ_i）の基本周波数（ FO ^T(i) ）との間の距離の自乗
であり、｛（ FO ^Dic(i) ）−（ FO ^T(i) ）｝²とな
る。D _F0 ^t (t _i , u _i ) is the fundamental frequency (FO ^Dic (i)) of the phoneme piece (u _i ) extracted from the dictionary for the i-th phoneme and the environment (target) to be actually used. It is the square of the distance between the fundamental frequency (FO ^T (i)) of t _i ) and is {(FO ^Dic (i)) − (FO ^T (i))} ² .

【００１７】ｗ_F0 ^t は、Ｄ_F0 ^t（ｔ_i，ｕ_i）に対す
る重み係数である。W _F0 ^t is a weighting coefficient for D _F0 ^t (t _i , u _i ).

【００１８】Ｄ_DUR ^t（ｔ_i，ｕ_i）は、ｉ番目の音素
について、辞書から抽出した音素片（ｕ_i）の継続時間
長（ DUR^Dic(i) ）と、実際に使用する環境（ターゲッ
トｔ _i）の継続時間長（ DUR^T(i) ）との間の距離の自
乗であり、｛（ DUR^Dic(i)）−（ DUR^T(i) ）｝²と
なる。D_DUR ^t(T_i, U_i) Is the i-th phoneme
For phonemes (u_i) Duration
Length (DUR^Dic(i)) and the actual usage environment (target
To t _i) Duration (DUR^T(i)) the distance between
Squared, {(DUR^Dic(i)) − (DUR^T(i))}²When
Become.

【００１９】ｗ_DUR ^tは、Ｄ_DUR ^t（ｔ_i，ｕ_i）に対
する重み係数である。W _DUR ^t is a weighting coefficient for D _DUR ^t (t _i , u _i ).

【００２０】Ｃ_i ^cは、次の数式２で表される。C _i ^c is expressed by the following equation 2.

【００２１】[0021]

【数２】 [Equation 2]

【００２２】数式２において、各変数は、次のように定
義される。In equation 2, each variable is defined as follows.

【００２３】Ｄ_POW ^c（ｕ_i，ｕ_i-1）は、ｉ番目の音
素片（ｕ_i）の始端のパワー（POW ^DicS(i) ）と、ｉ−
１番目の音素片（ｕ_i-1）の終端のパワー（POW ^DicE(i
-1)）との間の距離の自乗であり、｛（POW ^DicS(i) ）
−（POW ^DicE(i-1) ）｝²となる。D_POW ^c(U_i, U_i-1) Is the i-th sound
Element (u_i) Starting power (POW ^DicS(i)) and i-
First phoneme (u_i-1) End power (POW^DicE(i
-1)) is the square of the distance to {(POW^DicS(i))
− (POW^DicE(i-1))}²Becomes

【００２４】ｗ_POW ^cは、Ｄ_POW ^c（ｕ_i，ｕ_i-1）に
対する重み係数である。W _POW ^c is a weighting coefficient for D _POW ^c (u _i , u _i-1 ).

【００２５】Ｄ_F0 ^c（ｕ_i，ｕ_i-1）は、ｉ番目の音素
片（ｕ_i）の始端の基本周波数（ FO ^DicS(i) ）と、ｉ
−１番目の音素片（ｕ_i-1）の終端の基本周波数（FO
^DicE (i-1)）との間の距離の自乗であり、｛（ FO ^DicS
(i) ）−（FO^DicE (i-1)）｝²となる。D _F0 ^c (u _i , u _i-1 ) is the fundamental frequency (FO ^DicS (i)) of the starting end of the i-th phoneme piece (u _i ) and i
The fundamental frequency (FO at the end of the -1st phoneme unit (u _i-1 )
^DicE (i-1)) is the square of the distance to, {(FO ^DicS
(i))-(FO ^DicE (i-1))} ² .

【００２６】ｗ_F0 ^cは、Ｄ_F0 ^c（ｕ_i，ｕ_i-1）に対す
る重み係数である。W _F0 ^c is a weighting coefficient for D _F0 ^c (u _i , u _i-1 ).

【００２７】Ｄ_SPC ^c（ｕ_i，ｕ_i-1）は、ｉ番目の音
素片（ｕ_i）の始端のスペクトル（SPC^DicS(i,j), j＝1
〜16 ）と、ｉ−１番目の音素片（ｕ_i-1）の終端の
スペクトル（ SPC^DicE(i-1,j) , j ＝1 〜16）との間の
距離の自乗であり、｛（ SPC ^DicS(i,j) ）−（ SPC^DicE
(i-1,j) ）｝²となる。D_SPC ^c(U_i, U_i-1) Is the i-th sound
Element (u_i) Beginning spectrum (SPC^DicS(i, j), j = 1
~ 16) and the i-1th phoneme piece (u_i-1) End of
Spectrum (SPC^DicEbetween (i-1, j), j = 1 to 16)
It is the square of the distance, and {(SPC ^DicS(i, j))-(SPC^DicE
(i-1, j))}²Becomes

【００２８】ｗ_SPC ^cは、Ｄ_SPC ^c（ｕ_i，ｕ_i-1）に
対する重み係数である。W _SPC ^c is a weighting coefficient for D _SPC ^c (u _i , u _i-1 ).

【００２９】入力された全ての音素に対するＣ_i ^tとＣ
_i ^cとの総和Ｃ^allは、次の数式３で表される。C _i ^t and C for all input phonemes
The total sum C ^all with _i ^c is represented by the following Expression 3.

【００３０】[0030]

【数３】 [Equation 3]

【００３１】[0031]

【発明が解決しようとする課題】ところで、上述したよ
うに音声合成方法によれば、品質の高い合成音声、つま
り、自然発話に近い合成音声を得ることができる。しか
しながら、自然発話から作成した音素片には、”なま
け”、”いいよどみ”など、実際に選択された場合に音
質の劣化につながる音素片が存在している可能性が高
い。このような音素片を含まないように波形辞書５を作
成することが好ましいが、実際上には音質劣化につなが
る音素片をすべて取り除いて波形辞書５を作成すること
は困難である。By the way, as described above, according to the voice synthesizing method, it is possible to obtain a high quality synthetic voice, that is, a synthetic voice close to natural speech. However, there is a high possibility that a phoneme piece created from a natural utterance may have a phoneme piece such as “smoothness” or “good stagnation” that leads to deterioration in sound quality when actually selected. Although it is preferable to create the waveform dictionary 5 so as not to include such phonemes, it is actually difficult to create the waveform dictionary 5 by removing all phonemes that lead to sound quality deterioration.

【００３２】また、波形辞書５を作成した後に、音質劣
化につながる音素片を削除していくといったことも考え
られるが、そのようにすると、波形辞書５の大幅な修正
が必要となる。It is also conceivable to delete the phonemes that lead to the deterioration of the sound quality after creating the waveform dictionary 5, but in that case, the waveform dictionary 5 needs to be largely modified.

【００３３】この発明は、波形辞書の大幅な修正を行な
うことなく、音質劣化につながる品質の悪い音素片が最
適な音素片として選択されにくくすることができる音声
合成方法を提供することを目的とする。It is an object of the present invention to provide a speech synthesizing method capable of making it difficult to select a phoneme piece of poor quality which leads to deterioration of sound quality as an optimum phoneme piece without making a large modification to the waveform dictionary. To do.

【００３４】[0034]

【課題を解決するための手段】この発明による第１の音
声合成方法は、複数の音声単位と各音素単位毎にターゲ
ットとの歪みを算出するために用いられる補助情報とが
波形辞書に格納されており、波形辞書に格納されている
音素単位の組み合わせの中で、ターゲットとの歪みが最
も少なくなる組み合わせを選択する音素単位選択型の音
声合成方法において、各音素単位の補助情報にペナルテ
ィ情報を追加しておくステップ、ユーザが音声合成結果
を聞いて、その品質が悪い場合には、品質の悪い合成音
声箇所をユーザに入力させるステップ、ならびにユーザ
によって入力された品質の悪い合成音声箇所が入力され
た場合には、当該品質の悪い合成音声箇所に対応する音
素片の補助情報内のペナルティ情報に、当該音素片が候
補として選択されたときにターゲットとの歪み算出値を
強制的に大きくさせるような値を設定するステップを備
えていることを特徴とする。In a first speech synthesis method according to the present invention, a plurality of speech units and auxiliary information used for calculating distortion with a target for each phoneme unit are stored in a waveform dictionary. In the phoneme unit selection type speech synthesis method, which selects the combination with the least distortion from the target among the combinations of phoneme units stored in the waveform dictionary, penalty information is added to the auxiliary information of each phoneme unit. The step of adding, the step in which the user listens to the speech synthesis result, and if the quality is poor, the step of causing the user to input the poor quality synthesized speech portion, and the poor quality synthesized speech portion input by the user are input. If this is the case, the phoneme piece is selected as a candidate for the penalty information in the auxiliary information of the phoneme piece corresponding to the poor-quality synthesized speech part. Characterized in that it comprises a step of setting a value that forcibly increases the distortion calculation value of the target when.

【００３５】この発明による第２の音声合成方法は、複
数の音声単位と各音素単位毎にターゲットに対する適応
度を算出するために用いられる補助情報とが波形辞書に
格納されており、波形辞書に格納されている音素単位の
組み合わせの中で、ターゲットに対する適応度が最も大
きくなる組み合わせを選択する音素単位選択型の音声合
成方法において、各音素単位の補助情報に優先度情報を
追加しておくステップ、ユーザが音声合成結果を聞い
て、その品質が悪い場合には、品質の悪い合成音声箇所
をユーザに入力させるステップ、ならびにユーザによっ
て入力された品質の悪い合成音声箇所が入力された場合
には、当該品質の悪い合成音声箇所に対応する音素片の
補助情報内の優先度情報に、当該音素片が候補として選
択されたときにターゲットに対する適応度の算出値を強
制的に小さくさせるような値を設定するステップを備え
ていることを特徴とする。In the second speech synthesis method according to the present invention, a plurality of speech units and auxiliary information used for calculating the fitness for the target for each phoneme unit are stored in the waveform dictionary, and the waveform dictionary stores the auxiliary information. A step of adding priority information to auxiliary information of each phoneme unit in a phoneme unit selection-type speech synthesis method that selects a combination having the largest fitness for a target among combinations of stored phoneme units. , If the user hears the speech synthesis result and the quality is poor, the step of causing the user to input the poor quality synthesized speech location, and if the poor quality synthesized speech location input by the user is input, , The priority information in the auxiliary information of the phoneme piece corresponding to the synthesized speech part of the poor quality is selected when the phoneme piece is selected as a candidate. Characterized in that it comprises a step of setting a value that is forcibly reduced calculation value of fitness for Tsu bets.

【００３６】[0036]

【発明の実施の形態】以下、この発明の実施の形態につ
いて説明する。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described below.

【００３７】〔１〕第１の実施の形態の説明音声合成装置の全体構成は、図１と同じである。[1] Description of First Embodiment The overall configuration of the speech synthesizer is the same as in FIG.

【００３８】第１の実施の形態では、次の点（１）、
（２）、（３）が従来と異なっている。In the first embodiment, the following points (1),
(2) and (3) are different from conventional ones.

【００３９】（１）図４に示すように、各音素片の補
助情報に、ペナルティー情報Ｄ^t _pri（ｕ_i）を追加す
る。ペナルティー情報Ｄ^t _pri（ｕ_i）の初期値は、０
である。(1) As shown in FIG. 4, penalty information D ^t _pri (u _i ) is added to the auxiliary information of each phoneme piece. The initial value of the penalty information D ^t _pri (u _i ) is 0.
Is.

【００４０】（２）音素単位選択部３で歪みＣ^allを
算出するためのＣ_i ^tに、ペナルティー情報Ｄ
^t _pri（ｕ_i）をパラメータとして加える。(2) The penalty information D is added to C _i ^t for calculating the distortion C ^all in the phoneme unit selection unit 3.
Add ^t _pri (u _i ) as a parameter.

【００４１】つまり、Ｃ_i ^tは、次の数式４で表わされ
る。That is, C _i ^t is expressed by the following equation 4.

【００４２】[0042]

【数４】 [Equation 4]

【００４３】（３）ユーザが音声合成結果を聞いて、
その品質が悪い場合には、品質の悪い合成音声箇所を音
声合成装置に入力するようにする。音声合成装置は、ユ
ーザによって入力された品質の悪い合成音声箇所が入力
された場合には、品質の悪い合成音声箇所に対応する音
素片の補助情報内のペナルティー情報Ｄ^t _pri（ｕ_i）
の値を、所定値αに設定する。(3) The user hears the voice synthesis result,
If the quality is poor, the poor-quality synthesized speech portion is input to the speech synthesizer. When a poor-quality synthesized speech portion input by the user is input, the speech synthesis device receives penalty information D ^t _pri (u _i ) in the auxiliary information of the phoneme unit corresponding to the poor-quality synthesized speech location.
Is set to a predetermined value α.

【００４４】この所定値αとしては、たとえば、数式１
のＣ_i ^tの予想される最大値の約１００倍の値が用い
られる。具体的には、任意数の文章を入力したときの数
式１の最大値を実験により求めておき、その最大値の１
００倍の値を、所定値αとして設定する。The predetermined value α is, for example, Equation 1
A value of about 100 times the expected maximum value of C _i ^t of is used. Specifically, the maximum value of Equation 1 when an arbitrary number of sentences is input is obtained by an experiment, and the maximum value of 1 is calculated.
A value multiplied by 00 is set as the predetermined value α.

【００４５】上記（１），（２），（３）のような変更
を行なうことにより、ペナルティー情報Ｄ
^t _pri（ｕ_i）の値としてαが設定されている品質の悪
い音素片（ｕ_i）が候補として選択された場合には、そ
の音素片とターゲットとの歪みＣ_i ^tが、従来法に比べ
てα分だけ大きくなり、当該音素片（ｕ_i）が最適な音
素片として選択されにくくなる。By making the above changes (1), (2) and (3), the penalty information D
^When a poor-quality phoneme piece (u _i ) for which α is set as the value of ^t _pri (u _i ) is selected as a candidate, the distortion C _i ^t between the phoneme piece and the target is determined by the conventional method. Compared with this, it becomes larger by α, and it becomes difficult to select the phoneme piece (u _i ) as the optimum phoneme piece.

【００４６】上記実施の形態によれば、波形辞書内に品
質の悪い音素片が存在している場合に、その音素片を削
除するといった大幅な辞書の修正を行なうことなく、音
素片の補助情報にペナルティー情報Ｄ^t _pri（ｕ_i）を
追加するといった小規模な修正を行なうことによって、
品質の悪い音素片を選択されにくくすることができるよ
うになる。According to the above embodiment, when there is a poor quality phoneme piece in the waveform dictionary, the auxiliary information of the phoneme piece is deleted without making a large correction of the dictionary such as deleting the phoneme piece. By making a small modification such as adding penalty information D ^t _pri (u _i ) to
It becomes possible to make it difficult to select a phoneme piece with poor quality.

【００４７】高品質の音声合成装置の場合には、波形辞
書内の音素片格納部には６万個程度の音素片が格納され
るため、音素片格納部の容量は数十ＭＢに及ぶが、波形
辞書内の補助情報格納部の容量は数ＭＢというように、
音素片格納部の容量の十分の１以下とである。このた
め、上記実施の形態のように補助情報格納部のみの修正
を行なう方が容易である。また、音素片の削除に品質の
改善を行なう従来方法では、波形辞書全てを置き換える
必要があるが、上記実施の形態の方法では補助情報にペ
ナルティー情報Ｄ^t _pri（ｕ_i）を追加するといった修
正のみであるため、波形辞書の一部の変更のみで修正が
可能である。In the case of a high quality speech synthesizer, since the phoneme piece storage unit in the waveform dictionary stores about 60,000 phoneme pieces, the capacity of the phoneme piece storage unit reaches several tens of MB. , The capacity of the auxiliary information storage in the waveform dictionary is several MB,
It is 1 or less, which is a sufficient capacity of the phoneme piece storage unit. Therefore, it is easier to modify only the auxiliary information storage unit as in the above embodiment. Further, in the conventional method of improving the quality by deleting the phoneme pieces, it is necessary to replace the entire waveform dictionary, but in the method of the above embodiment, the penalty information D ^t _pri (u _i ) is added to the auxiliary information. Therefore, the correction can be made by only changing a part of the waveform dictionary.

【００４８】また、ユーザが自由に波形辞書から品質の
悪い音素片を削除することにより、合成音声の品質を改
善させることも考えられるが、音素の種類によってはそ
の音素に対応する全ての音素片を削除してしまうおそれ
がある。そうすると、当該音素を含む文章に対して合成
音声を生成できなくなる可能性がある。It is also possible for the user to improve the quality of synthesized speech by freely deleting low quality phoneme pieces from the waveform dictionary. However, depending on the type of phoneme, all phoneme pieces corresponding to the phoneme piece may be improved. Might be deleted. Then, there is a possibility that a synthesized voice cannot be generated for a sentence including the phoneme.

【００４９】これに対して、上記実施の形態による方法
では、たとえ、ある音素に対応する全ての音素片に対す
るペナルティー情報Ｄ^t _pri（ｕ_i）の値が所定値αに
設定されたとしても、当該音素を音声合成する際には、
その音素に対応する音素片の中で最適な音素片が選択さ
れるため、当該音素に対して合成音声を生成することが
できるという利点がある。On the other hand, in the method according to the above embodiment, even if the value of the penalty information D ^t _pri (u _i ) for all the phoneme pieces corresponding to a certain phoneme is set to the predetermined value α, When synthesizing the phoneme,
Since an optimum phoneme piece is selected from the phoneme pieces corresponding to the phoneme, there is an advantage that a synthetic speech can be generated for the phoneme.

【００５０】〔２〕第２の実施の形態の説明第１の実施の形態においては、音素単位選択部３では、
波形辞書に蓄積されている音素片の組み合わせの中で、
歪みが少なくなる組み合わせを選択しているが、音素単
位選択部として、波形辞書に蓄積されている音素片の組
み合わせの中で、適応度が大きくなる組み合わせを選択
するものが知られている。[2] Description of Second Embodiment In the first embodiment, in the phoneme unit selection section 3,
Among the combinations of phoneme pieces accumulated in the waveform dictionary,
Although a combination with less distortion is selected, as a phoneme unit selection unit, there is known a phoneme unit selection unit that selects a combination with a large fitness from among combinations of phoneme pieces accumulated in the waveform dictionary.

【００５１】適応度Ｓ^allは、一般的に次の数式５で表
される。The fitness S ^all is generally expressed by the following equation 5.

【００５２】[0052]

【数５】 [Equation 5]

【００５３】数式５においてＳ_i ^tは、ｉ番目の音素に
ついて辞書から抽出した音素片（ｕ _i）と実際に使用す
る環境( ターゲットｔ_i）との間の類似度を示してお
り、次の数式６で表される。数式６中の各変数は、数式
１中の変数と同じである。In equation 5, S_i ^tIs the i-th phoneme
About phoneme pieces (u _i) And actually use
Environment (target t_i) And the similarity between
Is expressed by the following equation 6. Each variable in Equation 6 is an equation
It is the same as the variable in 1.

【００５４】[0054]

【数６】 [Equation 6]

【００５５】また、数式５において、Ｓ_i ^cは、ｉ番目
の音素について辞書から選択した音素片（ｕ_i）の始端
と、ｉ−１番目の音素について辞書から選択した音素片
（ｕ _i-1）の終端との間の類似度を示しており、次の数
式７で表される。数式７中の各変数は、数式２中の変数
と同じである。In equation 5, S_i ^cIs the i-th
Phonemes (u) selected from the dictionary_i) Starting point
And the phoneme piece selected from the dictionary for the i-1th phoneme
(U _i-1) Indicates the similarity to the end of
It is expressed by Equation 7. Each variable in Equation 7 is a variable in Equation 2
Is the same as.

【００５６】[0056]

【数７】 [Equation 7]

【００５７】第２の実施の形態では、次の点（１）、
（２）、（３）が、適応度を用いて音素単位を選択する
従来例と異なっている。In the second embodiment, the following points (1),
(2) and (3) are different from the conventional example in which the phoneme unit is selected using the fitness.

【００５８】（１）各音素片の補助情報に、優先度情
報Ｅ^t _pri（ｕ_i）を追加する。優先度情報Ｅ
^t _pri（ｕ_i）の初期値は、所定値である。(1) Priority information E ^t _pri (u _i ) is added to the auxiliary information of each phoneme piece. Priority information E
The initial value of ^t _pri (u _i ) is a predetermined value.

【００５９】（２）音素単位選択部３で適応度Ｓ^all
を算出するためのＳ_i ^tに、優先度情報Ｅ
^t _pri（ｕ_i）をパラメータとして加える。(2) The fitness S ^all in the phoneme unit selection unit 3
The S _i ^t for calculating the priority information E
Add ^t _pri (u _i ) as a parameter.

【００６０】つまり、Ｓ_i ^tは、次式８で表わされる。That is, S _i ^t is expressed by the following equation 8.

【００６１】[0061]

【数８】 [Equation 8]

【００６２】（３）ユーザが音声合成結果を聞いて、
その品質が悪い場合には、品質の悪い合成音声箇所を音
声合成装置に入力するようにする。音声合成装置は、ユ
ーザによって入力された品質の悪い合成音声箇所が入力
された場合には、品質の悪い合成音声箇所に対応する音
素片の補助情報内の優先度情報Ｅ^t _pri（ｕ_i）の値
を、初期値より小さい値に設定する。(3) The user hears the voice synthesis result,
If the quality is poor, the poor-quality synthesized speech portion is input to the speech synthesizer. When a poor-quality synthesized speech portion input by the user is input, the speech synthesizer inputs priority information E ^t _pri (u _i ) in the auxiliary information of the phoneme unit corresponding to the poor-quality synthesized speech portion. Set the value of to a value smaller than the initial value.

【００６３】[0063]

【発明の効果】この発明によれば、波形辞書の大幅な修
正を行なうことなく、音質劣化につながる品質の悪い音
素片が最適な音素片として選択されにくくすることがで
きる。According to the present invention, it is possible to make it difficult to select a phoneme piece of poor quality that leads to sound quality deterioration as the optimum phoneme piece without making a large modification to the waveform dictionary.

[Brief description of drawings]

【図１】音声合成装置の全体構成を示すブロック図であ
る。FIG. 1 is a block diagram showing an overall configuration of a speech synthesizer.

【図２】波形辞書５の内容を示す模式図である。FIG. 2 is a schematic diagram showing the contents of a waveform dictionary 5.

【図３】音素単位選択部３において、音素片の組み合わ
せを選択するために用いられる２種の歪みＣ_i ^t、Ｃ_i
^cを説明するための模式図である。FIG. 3 shows two types of distortions C _i ^t and C _i used for selecting a combination of phoneme pieces in a phoneme unit selection unit 3.
It is a schematic diagram for ^{demonstrating c} .

【図４】品質の悪い合成音声箇所に対応する音素片の補
助情報に、ペナルティー情報Ｄ ^t _priを追加された様子
を示す模式図である。[Fig. 4] Supplement of phoneme pieces corresponding to a synthesized speech portion of poor quality
Penalty information D for auxiliary information ^t _priAdded
It is a schematic diagram which shows.

───────────────────────────────────────────────────── フロントページの続き (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 13/06 ─────────────────────────────────────────────────── ─── Continuation of front page (58) Fields surveyed (Int.Cl. ⁷ , DB name) G10L 13/06

Claims

(57) [Claims]

1. A plurality of voice units and auxiliary information used for calculating distortion with a target for each phoneme unit are stored in a waveform dictionary, and combinations of phoneme units stored in the waveform dictionary are stored. Among them, in the phoneme unit selection type speech synthesis method that selects the combination with the least distortion with the target, the step of adding penalty information to the auxiliary information of each phoneme unit, the user hears the speech synthesis result, If the quality is poor, the step of prompting the user to input a poor-quality synthesized speech location, and if the poor-quality synthesized speech location input by the user is input, correspond to the poor-quality synthesized speech location. In the penalty information in the auxiliary information of the phoneme piece to be set, the distortion calculation value with the target is forcibly increased when the phoneme piece is selected as a candidate. Speech synthesis method characterized in that it comprises a step of setting a so that value.

2. A plurality of voice units and auxiliary information used to calculate the fitness for a target for each phoneme unit are stored in a waveform dictionary, and combinations of phoneme units stored in the waveform dictionary are stored. Among them, in the phoneme unit selection type speech synthesis method that selects the combination with the highest fitness for the target, the step of adding priority information to the auxiliary information of each phoneme unit, the user hears the speech synthesis result. , If the quality is poor, the step of prompting the user to input a poor quality synthesized speech location, and if the poor quality synthesized speech location input by the user is entered, The priority information in the auxiliary information of the corresponding phoneme piece is forced to the calculated value of the fitness for the target when the phoneme piece is selected as a candidate. Speech synthesis method characterized by comprising the step, a setting a value that is small.