JP2004062002A

JP2004062002A - Speech synthesizing method

Info

Publication number: JP2004062002A
Application number: JP2002222511A
Authority: JP
Inventors: Hiroyuki Hirai; 平井　啓之
Original assignee: Sanyo Electric Co Ltd
Current assignee: Sanyo Electric Co Ltd
Priority date: 2002-07-31
Filing date: 2002-07-31
Publication date: 2004-02-26
Anticipated expiration: 2022-07-31
Also published as: JP4056319B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech synthesizing method capable of reducing the connection distortion of a phoneme based upon the position of a pitch mark. <P>SOLUTION: In the speech synthesizing method of selecting the combination of phoneme units which has the smallest sum of distortion from a target and connection distortion of the phoneme units among combinations of phoneme units and generating a synthesized speech waveform on a wavelength superposition basis according to the selected combination of phoneme units, phase information on the pitch waveform at the start end of each phoneme unit and phase information on the pitch waveform at the tail end of the phoneme unit are added to auxiliary information on the phoneme unit and the distance between the phase of the pitch waveform at the tail end of the precedent phoneme unit between two phoneme units to be connected together and the phase of the pitch information at the start end of the trailing phoneme unit is added as a parameter used to calculate the connection distortion of the phoneme units. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
この発明は、任意のテキスト情報を合成音声で読み上げることのできる音声合成方法に関する。
【０００２】
【従来の技術】
図１は、音声合成装置の概略構成を示している。
【０００３】
入力された日本語仮名漢字混じりのテキストは、言語処理部１で形態素解析、係り受け解析が行なわれ、音素記号、アクセント記号等に変換せしめられる。
【０００４】
韻律パターン生成部２では、音素記号、アクセント記号列および形態素解析結果から得られる入力テキストの品詞情報を用いて、音韻継続時間長（声の長さ　ＤＵＲ^Ｔ）、基本周波数（声の高さ　ＦＯ　^Ｔ）、母音中心のパワー（声の大きさＰＯＷ　^Ｔ）等の推定が行なわれる。
【０００５】
音素単位選択部３では、推定された音韻継続時間長　ＤＵＲ^Ｔ、基本周波数　ＦＯ　^Ｔおよび母音中心のパワーＰＯＷ　^Ｔに最も近く、かつ波形辞書５に蓄積されている音素単位（　音素片）　を接続したときの歪みが最も小さくなる音素片の組み合わせがＤＰ（動的プログラミング）を用いて選択される。
【０００６】
音声波形生成部４では、選択された音素片の組み合わせにしたがって、ピッチを変換しつつ音素片の接続を行なうことによって音声が生成される。
【０００７】
図２は、波形辞書５の内容を示している。
波形辞書５は、複数の音素片が格納された音素片格納部５１と、音素片格納部５１内の各音素片に関する補助情報が格納された補助情報格納部５２とがある。補助情報には、音素片のパワー（ＰＯＷ　^Ｄｉｃ）、基本周波数（　ＦＯ　^Ｄｉｃ）、継続時間長（　ＤＵＲ^Ｄｉｃ）等がある。
【０００８】
ところで、音素単位選択部３では、波形辞書５に蓄積されている音素片の組み合わせの中で、歪みが少なくなる組み合わせを選択しているが、この歪みには次のようなものがある。
【０００９】
つまり、図３に示すように、ｕ_ｉ−１、ｕ_ｉ、ｕ_ｉ＋１を波形辞書５から抽出した音素片として、ｔ_ｉ−１、ｔ_ｉ、ｔ_ｉ＋１を実際に使用する環境（　ターゲット）とすると、ｕｉ　に対する歪みには、Ｃ^ｔ _ｉと、Ｃ^ｃ _ｉとがある。
【００１０】
ここで、Ｃ^ｔ _ｉは、ｉ番目の音素について辞書から抽出した音素片（ｕ_ｉ）と実際に使用する環境（　ターゲットｔ_ｉ）との間の歪みである。また、Ｃ^ｃ _ｉは、ｉ番目の音素片（ｕ_ｉ）と、ｉ−１番目の素片（ｕ_ｉ−１）とを接続したときに生じる歪みである。音素単位選択部３は、動的計画法（ＤＰ法）に用いて音素片を接続していき、入力された全ての音素に対するＣ^ｔ _ｉとＣ^ｃ _ｉとの総和Ｃ^ａｌｌが最小となる素片の組み合わせを選択する。
【００１１】
Ｃ^ｔ _ｉは、次式（１）で表される。
【００１２】
【数１】

【００１３】
上記式（１）において、各変数は、次のように定義される。
【００１４】
Ｄ^ｔ _ＰＯＷ（ｔ_ｉ，ｕ_ｉ）は、ｉ番目の音素について、辞書から抽出した音素片（ｕ_ｉ）のパワー（ＰＯＷ　^Ｄｉｃ（ｉ）　）と、実際に使用する環境（ターゲットｔ_ｉ）のパワー（ＰＯＷ　^Ｔ（ｉ）　）との間の距離の自乗であり、｛（ＰＯＷ　^Ｄｉｃ（ｉ）　）−（ＰＯＷ　^Ｔ（ｉ）　）｝^２となる。
【００１５】
ｗ^ｔ _ＰＯＷは、Ｄ^ｔ _ＰＯＷ（ｔ_ｉ，ｕ_ｉ）に対する重み係数である。
【００１６】
Ｄ^ｔ _Ｆ０（ｔ_ｉ，ｕ_ｉ）は、ｉ番目の音素について、辞書から抽出した音素片（ｕ_ｉ）の基本周波数（　ＦＯ　^Ｄｉｃ（ｉ）　）と、実際に使用する環境（ターゲットｔ_ｉ）の基本周波数（　ＦＯ　^Ｔ（ｉ）　）との間の距離の自乗であり、｛（　ＦＯ　^Ｄｉｃ（ｉ）　）−（　ＦＯ　^Ｔ（ｉ）　）｝^２となる。
【００１７】
ｗ^ｔ _Ｆ０　　は、Ｄ^ｔ _Ｆ０（ｔ_ｉ，ｕ_ｉ）に対する重み係数である。
【００１８】
Ｄ^ｔ _ＤＵＲ（ｔ_ｉ，ｕ_ｉ）は、ｉ番目の音素について、辞書から抽出した音素片（ｕ_ｉ）の継続時間長（　ＤＵＲ^Ｄｉｃ（ｉ）　）と、実際に使用する環境（ターゲットｔ_ｉ）の継続時間長（　ＤＵＲ^Ｔ（ｉ）　）との間の距離の自乗であり、｛（　ＤＵＲ^Ｄｉｃ（ｉ）　）−（　ＤＵＲ^Ｔ（ｉ）　）｝^２となる。
【００１９】
ｗ^ｔ _ＤＵＲは、Ｄ^ｔ _ＤＵＲ（ｔ_ｉ，ｕ_ｉ）に対する重み係数である。
【００２０】
Ｃ^ｃ _ｉは、次式（２）で表される。
【００２１】
【数２】

【００２２】
上記式（２）において、各変数は、次のように定義される。
【００２３】
Ｄ^ｃ _ＰＯＷ（ｕ_ｉ，ｕ_ｉ−１）は、ｉ番目の音素片（ｕ_ｉ）の始端のパワー（ＰＯＷ　^ＤｉｃＳ（ｉ）　）と、ｉ−１番目の音素片（ｕ_ｉ−１）の終端のパワー（ＰＯＷ　^ＤｉｃＥ（ｉ−１）　）との間の距離の自乗であり、｛（ＰＯＷ　^ＤｉｃＳ（ｉ）　）−（ＰＯＷ　^ＤｉｃＥ（ｉ−１）　）｝^２となる。
【００２４】
ｗ^ｃ _ＰＯＷは、Ｄ^ｃ _ＰＯＷ（ｕ_ｉ，ｕ_ｉ−１）に対する重み係数である。
【００２５】
Ｄ^ｃ _Ｆ０（ｕ_ｉ，ｕ_ｉ−１）は、ｉ番目の音素片（ｕ_ｉ）の始端の基本周波数（　ＦＯ　^ＤｉｃＳ（ｉ）　）と、ｉ−１番目の音素片（ｕ_ｉ−１）の終端の基本周波数（ＦＯ^ＤｉｃＥ　（ｉ−１））との間の距離の自乗であり、｛（　ＦＯ　^ＤｉｃＳ（ｉ）　）−（ＦＯ^ＤｉｃＥ　（ｉ−１））｝^２となる。
【００２６】
ｗ^ｃ _Ｆ０は、Ｄ^ｃ _Ｆ０（ｕ_ｉ，ｕ_ｉ−１）に対する重み係数である。
【００２７】
Ｄ^ｃ _ＳＰＣ（ｕ_ｉ，ｕ_ｉ−１）は、ｉ番目の音素片（ｕ_ｉ）の始端のスペクトル（ＳＰＣ^ＤｉｃＳ（ｉ，ｊ），　ｊ＝１　〜１６　　）と、ｉ−１番目の音素片（ｕ_ｉ−１）の終端のスペクトル（　ＳＰＣ^ＤｉｃＥ（ｉ−１，ｊ）　，　ｊ　＝１　〜１６）との間の距離の自乗であり、｛（　ＳＰＣ^ＤｉｃＳ（ｉ，ｊ）　）−（　ＳＰＣ^ＤｉｃＥ（ｉ−１，ｊ）　）｝^２となる。
【００２８】
ｗ^ｃ _ＳＰＣは、Ｄ^ｃ _ＳＰＣ（ｕ_ｉ，ｕ_ｉ−１）に対する重み係数である。
【００２９】
入力された全ての音素に対するＣ^ｔ _ｉとＣ^ｃ _ｉとの総和Ｃ^ａｌｌは、次式（３）で表される。
【００３０】
【数３】

【００３１】
音声波形生成部４は、ここでは、波形重畳方式を用いて音声を合成する。波形重畳方式とは、選択された音素片を目標とする基本周波数Ｆ０^Ｔ、継続時間長ＤＵＲ^Ｔに合うように変形する方式の１つである。つまり、図４に示すように、音素片を生成するための元波形のピッチに同期して２ピッチの幅の窓（ｗ１，ｗ２，ｗ３…）を、元波形に乗じてピッチ波形（ｘ１，ｘ２，ｘ３…）を取り出す。
【００３２】
このようにして元波形から取り出されたピッチ波形群が元波形に対応する１つの音素片として波形辞書５に登録されている。それらのピッチ波形を目標とする基本周波数Ｆ０^Ｔの間隔で、継続時間長ＤＵＲ^Ｔに合うように同じ波形を繰り返したり間引いたりしながら、再配置し加え合わせることで目的の波形を得る。ここで、窓を乗ずる位置は、１ピッチごとに設定されたピッチマークと呼ばれる位置が窓の中心となるように設定される。
【００３３】
【発明が解決しようとする課題】
ところで、音素片を生成するための元波形において、基本周波数、パワー、周波数エンベロープが等しければ、同じ形状の波形（音素片）が得られるはずである。しかしながら、ピッチマークの付与の仕方が異なると抽出したピッチ波形（音素片）は異なる形状を示し、音素片を接続した場合に歪みとなる可能性がある。ピッチマークの付与は非常に困難な作業であり、すべてを相対的に等しく付与することは不可能である。
【００３４】
なお、ピッチマークの位置は、次のようにして決定されている。つまり、ピッチマークの間隔が元波形のピッチ周期間隔に近く、ピッチマークの間隔が急激に変化することなく、元波形の１ピッチ内で最も大きな波形の山の直前で、かつ右上がりのゼロクロスの位置を、ピッチマークとして手作業で設定している。
【００３５】
音声波形は、発声する言葉によって形状が変化するので、全ての条件を満足するようにピッチマークを設定することは不可能である。そこで、それぞれの条件を適当に妥協しながら、ピッチマークを設定している。また、どの条件を妥協するかは、合成した音声の音質を元に決定しているので、音素によって妥協する条件が異なる場合がある。その結果、同じ音素でも、その元波形を抽出した音声波形において、その音素の前側にある音素の種類によって、ピッチマークの位置が異なるといったことが生ずる。
【００３６】
この発明は、ピッチマークの位置に基づく音素片の接続歪みを小さくできる音声合成方法を提供することを目的とする。
【００３７】
【課題を解決するための手段】
この発明による音声合成方法は、複数の音素単位と、各音素単位毎にターゲットとの間の歪みおよび音素単位の接続歪みを算出するために用いられる補助情報とが波形辞書に格納されており、波形辞書に格納されている音素単位の組み合わせの中で、ターゲットとの間の歪みおよび音素単位の接続歪みとの和が最も少なくなる組み合わせを選択し、選択した音素単位の組み合わせに基づいて、波長重畳方式で合成音声波形を生成する音声合成方法において、各音素単位の補助情報に、音素単位の始端のピッチ波形の位相情報と音素単位の終端のピッチ波形の位相情報とを追加しておき、音素単位の接続歪みを算出する際のパラメータとして、接続される２つの音素単位のうちの前側の音素単位の終端のピッチ波形の位相と、接続される後側の音声単位の始端のピッチ波形の位相との間の距離を追加したことを特徴とする。
【００３８】
ピッチ波形の位相情報としては、たとえば、音素単位の元波形に元波形のピッチに同期した窓を乗じて、音素単位を形成するピッチ波形を取り出す際に、元波形の１ピッチ内で形状的な特徴を表す特徴点と、窓の中心の位置であるピッチマークとの時間的な距離に関する情報が用いられる。
【００３９】
【発明の実施の形態】
以下、この発明の実施の形態について説明する。
【００４０】
音声合成装置の全体構成は、図１と同じである。
【００４１】
この実施の形態では、次の点（１）、（２）が従来と異なっている。
【００４２】
（１）　図５に示すように、各音素片の補助情報に、音素片のピッチ波形の位相ＰＨＡＳ^Ｄｉｃ（ｕ_ｉ）を追加する。音素片のピッチ波形の位相ＰＨＡＳ^Ｄｉｃ（ｕ_ｉ）には、音素片の始端のピッチ波形の位相ＰＨＡＳ^ＤｉｃＳ（ｕ_ｉ）と、音素片の終端のピッチ波形の位相ＰＨＡＳ^ＤｉｃＥ（ｕ_ｉ）とが含まれている。
【００４３】
ここで、ピッチ波形の位相とは、図６に示すように、元波形に窓を掛けてピッチ波形を取り出すときの窓の中心の位置（ピッチマークの位置）と１ピッチ内で形状的な特徴を表す位置（特徴点）との時間的な距離を示す数値である。特徴点としては、たとえば、１ピッチ内の最大値の直前のゼロクロスの位置が用いられる。なお、図６において、ピッチ波形１は窓１を用いた場合に得られるピッチ波形であり、ピッチ波形２は窓２を用いた場合に得られるピッチ波形である。
【００４４】
（２）　接続歪みＣ^ｃ _ｉに、位相の歪みＤ^ｃ _ｐｈａｓ（ｕ_ｉ，ｕ_ｉ−１）をパラメータとして加える。
【００４５】
つまり、接続歪みＣ^ｃ _ｉは、次式（４）で表わされる。
【００４６】
【数４】

【００４７】
上記式（４）において、Ｗ^ｃ _ｐｈａｓは、Ｄ^ｃ _ｐｈａｓ（ｕ_ｉ，ｕ_ｉ−１）に対する重み計数である。また、Ｄ^ｃ _ｐｈａｓ（ｕ_ｉ，ｕ_ｉ−１）は、ｉ番目の音素片ｕ_ｉの始端のピッチ波形の位相ＰＨＡＳ^ＤｉｃＳ（ｕ_ｉ）と、ｉ−１番目の音素片ｕ_ｉ−１の終端のピッチ波形の位相ＰＨＡＳ^ＤｉｃＥ（ｕ_ｉ−１）との間の距離の自乗であり、次式（５）で表される。
【００４８】
【数５】

【００４９】
つまり、この実施の形態では、音素片のピッチ波形の位相情報も、接続歪みＣ^ｃ _ｉのパラメータとして追加し、そのパラメータによる歪みが小さくなるように音素片が選択される。このため、音素片を接続した際の聴感的な歪みを減少させることができる。
【００５０】
例えば、図７に示すように、ｉ−１番目の音素片ｕ_ｉ−１の終端のピッチ波形がＸ０であるとする。また、ｉ番目の音素片ｕ_ｉとしては、波形Ｘ１１、Ｘ１２およびＸ１３の候補があるとする。
【００５１】
波形Ｘ１３はｉ−１番目の音素片ｕ_ｉ−１とスペクトルが異なる波形である。波形Ｘ１２はｉ−１番目の音素片ｕ_ｉ−１とスペクトルは近いが、ピッチ波形の位相が異なる波形である。波形Ｘ１１はｉ−１番目の音素片ｕ_ｉ−１とスペクトルが近くかつピッチ波形の位相もほぼ等しい波形である。したがって、この場合には、ｉ番目の音素片ｕ_ｉとしては、接続歪みが最も小さくなる波形Ｘ１１が選択される。
【００５２】
【発明の効果】
この発明によれば、ピッチマークの位置に基づく音素片の接続歪みを小さくできるようになる。
【図面の簡単な説明】
【図１】音声合成装置の全体構成を示すブロック図である。
【図２】波形辞書５の内容を示す模式図である。
【図３】音素単位選択部３において、音素片の組み合わせを選択するために用いられる２種の歪みＣ^ｔ _ｉ、Ｃ^ｃ _ｉを説明するための模式図である。
【図４】波形重畳方式を説明するための模式図である。
【図５】音素片の補助情報に、音素片の始端のピッチ波形の位相ＰＨＡＳ^ＤｉｃＳ（ｕ_ｉ）と、音素片の終端のピッチ波形の位相ＰＨＡＳ^ＤｉｃＥ（ｕ_ｉ）が追加された様子を示す模式図である。
【図６】ピッチ波形の位相を説明するための模式図である。
【図７】音素片の選択例を示す模式図である。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speech synthesis method capable of reading out arbitrary text information with synthesized speech.
[0002]
[Prior art]
FIG. 1 shows a schematic configuration of the speech synthesizer.
[0003]
The input text mixed with Japanese kana and kanji is subjected to morphological analysis and dependency analysis by the language processing unit 1 and converted into phoneme symbols, accent symbols, and the like.
[0004]
The prosody pattern generation unit 2 uses a phoneme symbol, an accent symbol string, and the part-of-speech information of the input text obtained from the morphological analysis result to obtain a phoneme duration (voice length DUR ^T ), a fundamental frequency (voice pitch FO). ^T ), the power at the center of the vowel (loudness POW ^T ), and the like are estimated.
[0005]
In the phoneme unit selector 3, and connect the estimated phoneme duration DUR ^T, fundamental frequency FO ^T and closest to the power POW ^T vowel center, and phonemes stored in the waveform dictionary 5 (phoneme) The combination of phoneme segments that minimizes the distortion at that time is selected using DP (dynamic programming).
[0006]
The speech waveform generation unit 4 generates speech by connecting the phonemes while converting the pitch in accordance with the combination of the selected phonemes.
[0007]
FIG. 2 shows the contents of the waveform dictionary 5.
The waveform dictionary 5 includes a phoneme unit storage unit 51 in which a plurality of phoneme units are stored, and an auxiliary information storage unit 52 in which auxiliary information on each phoneme unit in the phoneme unit storage unit 51 is stored. The auxiliary information includes the power of the phoneme ^segment (POW ^Dic ), the fundamental frequency (FO ^Dic ), the duration (DUR ^Dic ), and the like.
[0008]
By the way, the phoneme unit selection unit 3 selects a combination that reduces distortion among combinations of phonemic pieces stored in the waveform dictionary 5, and the following distortions are available.
[0009]
That is, as shown in FIG. _{_{3, u i-1, u}} i, as phoneme extracting the _{u i + 1} from the waveform dictionary _{_{5, t i-1, t}} i, and environment (target) actually using the _{t i + 1} Then, distortions for ui include C ^t _i and C ^c _i .
[0010]
Here, C ^t _i is the distortion between the phoneme segment (u _i ) extracted from the dictionary for the i-th phoneme and the environment actually used (target t _i ). ^{Also, C} _{c i} is, i-th speech segment and _{(u i),} a strain generated when connecting the i-1 th segment _{(u i-1).} Phoneme unit selector 3, dynamic programming is used to (DP method) will connect the speech segments, containing the sum C ^all of the C ^t _i and C ^c _i for all the phonemes inputted is minimized Select a combination of pieces.
[0011]
C ^t _i is expressed by the following equation (1).
[0012]
(Equation 1)

[0013]
In the above equation (1), each variable is defined as follows.
[0014]
D ^t _POW (t _i , u _i ) is the power of the phoneme fragment (u _i ) extracted from the dictionary for the i-th phoneme, and the power (POW ^Dic (i)) of the actual use environment (target t _i ). This is the square of the distance from the power (POW ^T (i)), and is {(POW ^Dic (i)) − (POW ^T (i))} ² .
[0015]
w ^t _POW is a weighting factor for D ^t _POW (t _i , u _i ).
[0016]
^{_{_{_{D t F0 (t i, u}}}} i) , for i-th phoneme, phoneme extracted from the dictionary and the fundamental frequency of the _{^{(u i) (FO Dic (}} i)), environment actually used (target _t i) the fundamental frequency is the square of the distance between the ^(FO T (i)), the ^- a ^{^{2 {(FO Dic (i)}} ) (FO T (i))}.
[0017]
w ^t _{F 0} is a weight coefficient for D ^t _{F 0} (t _i , u _i ).
[0018]
D ^t _DUR _(t i, _{u i),} for i-th phoneme, phoneme extracted from the dictionary duration of _{(u i)} and ^(DUR Dic (i)), actually used environment (target _{t i} ) Is the square of the distance to the duration (DUR ^T (i)), and {(DUR ^Dic (i)) − (DUR ^T (i))} ² .
[0019]
w ^t _DUR is a weighting factor for D ^t _DUR (t _i , u _i ).
[0020]
C ^c _i is represented by the following formula (2).
[0021]
(Equation 2)

[0022]
In the above equation (2), each variable is defined as follows.
[0023]
D ^c _POW (u _i , u _i-1 ) is the power (POW ^DicS (i)) at the beginning of the i-th phoneme ^segment (u _i ) and the power of the i-th phoneme ^segment (u _i-1 ). This is the square of the distance between the terminal power (POW ^DicE (i-1)) and {(POW ^DicS (i))-(POW ^DicE (i-1))} ² .
[0024]
w ^c _POW ^is a weighting factor for _{_{_{D c POW (u i, u}}} i-1).
[0025]
^{_{_{_{D c F0 (u i, u}}}} i-1) is, i-th phoneme beginning of the fundamental frequency of _{^{(u i) (FO DicS (}} i)) and, i-1 th phoneme _{(u i-1)} ^Is the square of the distance from the fundamental frequency (FO ^DicE (i-1)) at the end of, and is {(FO ^DicS (i))-(FO ^DicE (i-1))} ² .
[0026]
w ^c _F0 ^is a weighting factor for _{_{_{D c F0 (u i, u}}} i-1).
[0027]
^{_{_{_{D c SPC (u i, u}}}} i-1) , the starting end of the spectrum ^{(SPC DicS (i, j)} , j = 1 ~16) of i-th speech segment _{(u i)} and, i-1 th phoneme ^Is the square of the distance between the spectrum (SPC ^DicE (i-1, j), j = 1 to 16) at the end of the piece (u _i-1 ), and ｛(SPC ^DicS (i, j)) − ( SPC ^DicE (i-1, j))｝ ² .
[0028]
w ^c _SPC ^is a weighting factor for _{_{_{D c SPC (u i, u}}} i-1).
[0029]
The sum C ^all of C ^t _i and C ^c _i for all input phonemes is represented by the following equation (3).
[0030]
[Equation 3]

[0031]
Here, the audio waveform generation unit 4 synthesizes audio using a waveform superposition method. The waveform superposition method is one of methods in which a selected phoneme is deformed so as to conform to a target fundamental frequency F0 ^T and a duration time DUR ^T. In other words, as shown in FIG. 4, the original waveform is multiplied by a window (w1, w2, w3...) Having a width of two pitches in synchronization with the pitch of the original waveform for generating a phoneme segment, and the pitch waveform (x1, x2, x3 ...).
[0032]
The pitch waveform group extracted from the original waveform in this way is registered in the waveform dictionary 5 as one phoneme corresponding to the original waveform. At intervals of the fundamental frequency F0 ^T to those pitch waveforms and target, while or thinning or repeating the same waveform to match the duration DUR ^T, to obtain the desired waveform by summing rearranged. Here, the position on which the window is placed is set such that a position called a pitch mark set for each pitch becomes the center of the window.
[0033]
[Problems to be solved by the invention]
By the way, if the fundamental frequency, the power, and the frequency envelope are the same in the original waveform for generating a phoneme, a waveform (phoneme) having the same shape should be obtained. However, the extracted pitch waveform (phoneme piece) has a different shape if the way of adding the pitch mark is different, and there is a possibility that distortion occurs when the phoneme pieces are connected. The application of pitch marks is a very difficult task and it is not possible to apply everything relatively equally.
[0034]
The position of the pitch mark is determined as follows. In other words, the pitch mark interval is close to the pitch cycle interval of the original waveform, the pitch mark interval does not change rapidly, and immediately before the largest waveform peak in one pitch of the original waveform, and the zero-crossing of the rising right The position is manually set as a pitch mark.
[0035]
Since the shape of the speech waveform changes depending on the words spoken, it is impossible to set pitch marks so as to satisfy all conditions. Therefore, the pitch mark is set while appropriately compromising the respective conditions. Further, since the conditions to be compromised are determined based on the sound quality of the synthesized speech, the conditions to be compromised may differ depending on the phoneme. As a result, even in the same phoneme, in the voice waveform from which the original waveform is extracted, the position of the pitch mark differs depending on the type of the phoneme in front of the phoneme.
[0036]
SUMMARY OF THE INVENTION It is an object of the present invention to provide a speech synthesis method capable of reducing connection distortion of phonemes based on the position of a pitch mark.
[0037]
[Means for Solving the Problems]
In the speech synthesis method according to the present invention, a plurality of phoneme units and auxiliary information used to calculate a distortion between a target and a connection strain of each phoneme unit for each phoneme unit are stored in a waveform dictionary. From the combinations of phoneme units stored in the waveform dictionary, select the combination that minimizes the sum of the distortion with the target and the connection distortion of the phoneme unit, and based on the selected combination of phoneme units, In the speech synthesis method of generating a synthesized speech waveform by the superposition method, in the auxiliary information of each phoneme unit, the phase information of the pitch waveform at the beginning of the phoneme unit and the phase information of the pitch waveform at the end of the phoneme unit are added, As parameters for calculating the connection distortion per phoneme unit, the phase of the pitch waveform at the end of the front phoneme unit of the two connected phoneme units and the connected rear Characterized in that adding the distance between the phase of the start of the pitch waveforms of the voice unit.
[0038]
As the phase information of the pitch waveform, for example, when the pitch waveform forming the phoneme unit is extracted by multiplying the original waveform of the phoneme unit by a window synchronized with the pitch of the original waveform, the shape information within one pitch of the original waveform is obtained. Information about a temporal distance between a feature point representing a feature and a pitch mark that is a center position of a window is used.
[0039]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described.
[0040]
The overall configuration of the speech synthesizer is the same as in FIG.
[0041]
In this embodiment, the following points (1) and (2) are different from the conventional one.
[0042]
(1) As shown in FIG. 5, the auxiliary information of each phoneme segment includes the phase PHAS ^Dic ( u _i ). The phase ^PHAS Dic pitch waveforms phoneme _{(u i),} and phase ^PHAS DICS the start of the pitch waveforms of phoneme _{(u i),} but the phase ^PHAS DICE pitch waveform at the end of the speech segment _{(u i)} include.
[0043]
Here, as shown in FIG. 6, the phase of the pitch waveform refers to the position of the center of the window (position of the pitch mark) when the original waveform is windowed and the pitch waveform is extracted, and the shape characteristic within one pitch. Is a numerical value indicating a temporal distance from a position (feature point) representing. As the feature point, for example, the position of the zero cross just before the maximum value within one pitch is used. In FIG. 6, a pitch waveform 1 is a pitch waveform obtained when the window 1 is used, and a pitch waveform 2 is a pitch waveform obtained when the window 2 is used.
[0044]
(2) to connect the strain ^C _{c i,} adding distortion ^{_{_{_{D c phas (u i, u}}}} i-1) of phase as a parameter.
[0045]
That is, the connection strain ^C _{c i} is expressed by the following equation (4).
[0046]
(Equation 4)

[0047]
In the above formula _{^(4), W c} phas ^is the weight counter for _{_{_{D c phas (u i, u}}} i-1). ^{_{_{_{Further, D c phas (u i,}}}} u i-1) is the i-th start of pitch waveforms of phoneme _{u i} and phase ^{_{PHAS DicS (u i), i}} -1 th phoneme _{u i-1} of It is the square of the distance between the phase of the pitch waveform at the end and PHAS ^DicE (u _i-1 ), and is expressed by the following equation (5).
[0048]
(Equation 5)

[0049]
That is, in this embodiment, also the phase information of the pitch waveforms of phoneme, added as a parameter of the connection strain C ^c _i, speech segment such distortion is reduced by the parameter is selected. For this reason, it is possible to reduce audible distortion when the phoneme pieces are connected.
[0050]
For example, as shown in FIG. 7, it is assumed that the pitch waveform at the end of the (i-1) -th phoneme ui _-1 is X0. As the i-th phoneme _{u i,} and there is a candidate waveform X11, X12 and X13.
[0051]
The waveform X13 is a waveform having a spectrum different from that of the (i-1) -th phoneme piece ui _-1 . The waveform X12 is a waveform whose spectrum is close to that of the (i-1) -th phoneme piece ui _-1 but whose pitch waveform is different in phase. The waveform X11 is a waveform whose spectrum is close to that of the (i-1) -th phoneme piece ui _-1 and whose pitch waveforms have almost the same phase. Therefore, in this case, the i-th phoneme u _i, waveform X11 connection distortion is smallest is selected.
[0052]
【The invention's effect】
ADVANTAGE OF THE INVENTION According to this invention, the connection distortion of a phoneme piece based on the position of a pitch mark can be reduced.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating an overall configuration of a speech synthesizer.
FIG. 2 is a schematic diagram showing the contents of a waveform dictionary 5;
FIG. 3 is a schematic diagram for explaining two types of distortions C ^t _i and C ^c _i used for selecting a combination of phoneme segments in a phoneme unit selection unit 3;
FIG. 4 is a schematic diagram for explaining a waveform superposition method.
The auxiliary information of FIG. 5 phoneme indicates the phase ^PHAS DICS the start of the pitch waveforms of phoneme _{(u i),} how the phase ^PHAS DICE pitch waveform at the end of the speech segment _{(u i)} is added It is a schematic diagram.
FIG. 6 is a schematic diagram for explaining the phase of a pitch waveform.
FIG. 7 is a schematic diagram showing an example of selecting a phoneme segment.

Claims

A plurality of phoneme units and auxiliary information used to calculate a distortion between a target for each phoneme unit and a connection distortion per phoneme unit are stored in the waveform dictionary, and the phonemes stored in the waveform dictionary are stored. Among the combinations of units, select the combination that minimizes the sum of the distortion with the target and the connection distortion of the phoneme unit, and generate a synthesized speech waveform by the wavelength superposition method based on the selected combination of phoneme units. In the speech synthesis method
The phase information of the pitch waveform at the beginning of the phoneme unit and the phase information of the pitch waveform at the end of the phoneme unit are added to the auxiliary information for each phoneme unit, and the connection is used as a parameter when calculating the connection distortion for the phoneme unit. The distance between the phase of the pitch waveform at the end of the front phoneme unit of the two phoneme units to be connected and the phase of the pitch waveform at the start end of the connected voice unit is added. Speech synthesis method.

The phase information of the pitch waveform represents a geometric feature within one pitch of the original waveform when extracting the pitch waveform forming the phoneme unit by multiplying the original waveform of the phoneme unit by a window synchronized with the pitch of the original waveform. The speech synthesis method according to claim 1, wherein the information is information on a temporal distance between a feature point and a pitch mark that is a center position of the window.