JP4639532B2

JP4639532B2 - Node extractor for natural speech

Info

Publication number: JP4639532B2
Application number: JP2001169140A
Authority: JP
Inventors: 一史芹生
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2001-06-05
Filing date: 2001-06-05
Publication date: 2011-02-23
Anticipated expiration: 2021-06-05
Also published as: JP2002366177A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声合成装置のための節点抽出装置に関し、より詳細には、スプライン近似でピッチパターンを生成するために、必要な節点を抽出する自然音声の節点抽出装置に関するものである。
【０００２】
【従来の技術】
最近の音声合成装置は、規則合成方式に従って、音声を合成する。規則合成方式では、規則合成エンジンにパラメータとして節点を与えることで、音声の基本周波数の時間的変化パターンを示すピッチパターンが生成されて、音声を合成する規則の１つとして利用される。
【０００３】
音声合成装置は、節点抽出装置が自然音声から抽出した節点を予め記憶し、節点に基づいてスプライン近似を行い、ピッチパターンを生成する。スプライン近似は、節点と呼ばれる離散的な点を順に結び、スプライン関数を用いて、全体が滑らかな曲線に近似する処理である。
【０００４】
規則合成方式の音声合成装置では、音声合成の規則の１つとして、生成されたピッチパターンを利用し、別に入力される発音記号又は文字から、任意の語彙の連続音声を直接合成する。節点抽出装置は、発話者の性別や発話速度等の条件に左右されないで、節点を抽出できることが重要になり、幾つかの提案がなされている。
【０００５】
信学技法ＳＰ２０００−２９には、自然音声の節点抽出装置で使用される節点抽出方法が記載されている（２０００年７月発行の電子情報通信学会信学技法：２０ページ、筆者：森川博由坪井直宏柳雄一郎、題名：「平滑化スプライン関数による音声のピッチパターンのモデル化と分析」）。この節点抽出装置が行う節点抽出方法（節点選択法）では、自然音声から基本周波数を抽出し、この抽出した基本周波数を各時間毎にプロットした複数のデータ点として求め、複数のデータ点から下記に示す２つの方法を用いて節点を選択する。
【０００６】
図６は、第１の節点選択法のフローチャートである。複数のデータ点の始点及び終点を２つの節点とし、始点から終点までの間を時間間隔ｄｔで等分割し、分割点毎に最も近いデータ点を抽出して節点候補にする（ステップＳ８１）。隣り合う節点候補間の傾きを求め、傾きの大きさがしきい値ＴH2より小さければ、節点候補から削除する（ステップＳ８２）。
【０００７】
節点及び節点候補に基づいて、平滑スプライン関数を求めて基本周波数パターン曲線との誤差を計算する（ステップＳ８３）。ここで、基本周波数パターン曲線とは、複数のデータ点の集まりを曲線として取り扱うものである。ステップＳ８３の誤差がしきい値ＴH3より小さいと、傾きの大きさが最も小さい節点候補を削除し、ステップＳ８３から処理を実行する（ステップＳ８６）。ステップＳ８３の誤差がしきい値ＴH3より大きいと、最終的に残った節点候補を節点として決定する（ステップＳ８５）。
【０００８】
図７は、第２の節点選択法のフローチャートである。複数のデータ点の始点及び終点を２つの節点とし、双方の節点を直線で結び、直線と最も遠いデータ点を節点候補とする（ステップＳ９１）。節点及び節点候補に基づいて、平滑スプライン関数を求めて基本周波数パターン曲線との誤差を計算する（ステップＳ９２）。
【０００９】
ステップＳ９２の誤差がしきい値ＴH3より大きいと、スプライン関数から最も遠いデータ点を新たな節点候補として追加し、ステップＳ９２から処理を実行する（ステップＳ９５）。ステップＳ９２の誤差がしきい値ＴH3より小さいと、最終的に残った節点候補を節点として決定する（ステップＳ９４）。
【００１０】
【発明が解決しようとする課題】
上記従来の自然音声の節点抽出装置では、第１の節点選択法は、分割する時間間隔ｄｔが発話速度に依存するので、時間間隔ｄｔを経験的に決定しなければならない。また、誤差と比較されるしきい値ＴH2又はＴH3を発話速度に応じて変更し、経験的に決定する必要があり、節点を安定して求められない。第２の節点選択方法も、第１の節点選択方法と同様に、しきい値ＴH2又はＴH3を使用するので、発話速度に依存し経験的に決定する必要がある。
【００１１】
一般に、自然音声の基本周波数パターン曲線について、その形状を考慮せずに節点を抽出すると、この節点に基づくスプライン近似で生成されたピッチパターンには、波打ち現象等のような影響が現れ、自然音声の基本周波数パターンと異なる形状やパターンが生成されることがある。
【００１２】
上記従来の自然音声の節点抽出装置では、節点及び節点候補に基づく誤差がしきい値内であるか否かの比較により節点を選択するので、自然音声の基本周波数パターンと異なるピッチパターンが生成されることがあり、基本周波数パターン曲線の形状を十分に考慮しているとはいえない。
【００１３】
本発明は、上記したような従来の技術が有する問題点を解決するためになされたものであり、スプライン近似でピッチパターンを生成するために、必要な節点を安定して効率よく抽出する自然音声の節点抽出装置を提供する。
【００１４】
【課題を解決するための手段】
上記目的を達成するため、本発明の自然音声の節点抽出装置は、複数の自然音声それぞれの基本周波数パターンを抽出するパターン抽出部と、各前記自然音声について、アクセント句の開始時刻と終了時刻とを含む言語情報を入力する入力部と、前記入力部により入力された前記言語情報に含まれるアクセント句の開始時刻と終了時刻の情報に基づいて、各前記基本周波数パターンをアクセント句毎に区分するパターン区分部と、前記区分された基本周波数パターンの１次微分曲線及び２次微分曲線を求める微分演算部と、前記区分された基本周波数パターンの開始点及び終了点と、前記１次微分曲線のゼロ交差点と、前記２次微分曲線の最高点及び最低点と、を前記基本周波数パターンの節点として抽出する節点抽出部と、を備えることを特徴とする。
【００１５】
本発明の自然音声の節点抽出装置は、自然音声をアクセント句で区切って基本周波数を抽出した基本周波数パターン曲線の開始点及び終了点と、その１次微分曲線のゼロ交差点と、２次微分曲線の最高点及び最低点と、を基本周波数パターン曲線の節点として抽出する。これにより、節点抽出が基本周波数パターン曲線の形状を特徴づける変化点等で確実に行える。この節点は発話速度とは無関係に抽出されるので、安定して効率のよい節点抽出を行うことができる。
【００１７】
また、本発明の自然音声の節点抽出装置では、前記自然音声についての言語情報には、疑問文であるか否かについての情報、及び、アクセント位置を含むか否かを示すアクセント位置時刻についての情報が含まれることが好ましい。この場合、前記節点抽出部は、前記自然音声についての言語情報に、疑問文であることを示す情報が含まれている場合には、前記区分された基本周波数パターンの終了点の直前の周波数最低点を節点として抽出すること、前記自然音声についての言語情報に、アクセント位置を含むことを示すアクセント時刻の情報が含まれている場合には、前記１次微分曲線のゼロ交差点以降の２次微分曲線の最高点及び最低点を夫々節点として抽出することができる。この場合、アクセント位置等の言語情報を用いて、基本周波数パターン曲線の形状を十分に考慮できるので、節点抽出が基本周波数パターン曲線の形状を特徴づける変化点等でより確実に行える。
【００１８】
前記節点抽出部は、前記基本周波数パターンの節点として抽出した節点のうち、２つの隣り合う節点の中間点を新たに節点として抽出することも本発明の好ましい態様である。この場合、節点に基づいてピッチパターンを生成することが確実になる。
【００１９】
【発明の実施の形態】
以下、本発明の実施形態例に基づいて、本発明の自然音声の節点抽出装置について図面を参照して説明する。図１は、本発明の一実施形態例の自然音声の節点抽出装置のブロック図である。自然音声の節点抽出装置は、パターン抽出部１、パターン区分部３、入力部２、無声制御部４、微分演算部５、節点抽出部６、及び、節点情報出力部７で構成される。
【００２０】
パターン抽出部１は、入力される自然音声１０２に基づいて、基本周波数パターンを抽出し、パターン区分部３に入力する。基本周波数パターンは、短い時間間隔の抽出時点で、基本周波数を抽出した複数のデータ点である。データ点は、抽出時刻及び基本周波数で構成される。
【００２１】
入力部２は、入力される言語情報１０１をパターン区分部３に入力する。言語情報１０１は、アクセント句の開始時刻と終了時刻、アクセント位置時刻、アクセント句に含まれる子音母音の開始時刻と終了時刻、及び、疑問文か平叙文かを示す文タイプ等から成る情報である。パターン区分部３は、言語情報１０１に基づいて、基本周波数パターンをアクセント句ごとに区切り、無声制御部４に入力する。
【００２２】
図２は、自然音声の「よろしいですか」に関する情報を示す。自然音声は、各時刻毎に発生された周波数が点としてプロットされる。図中の黒い影部分は、自然音声の周波数特性（スペクトル表示）を示す。同図（ａ）に示すように、自然音声の基本周波数は、２００Ｈｚから４００Ｈｚまでの黒い影部分の中に、白抜き線＊として示される。
【００２３】
無声制御部４は、言語情報１０１に基づいて、基本周波数パターン曲線に含まれる無声区間を調べる。基本周波数パターン曲線は、無声区間が存在すると、スプライン近似に必要な節点を抽出する際に誤りを起こし易いので、補間して滑らかなアクセント句パターン曲線として修正される。
【００２４】
図２（ｂ）に示すように、子音を含む無声区間（“ｓｈ”）がある場合には、近くの有声区間（“ｏ”又は“ｉ−”）から引き伸ばし、直線又は曲線で補間する。アクセント句の開始点又は終了点が無声である場合には、近くの有声区間の値から数Ｈｚ小さい値を開始点又は終了点として補間する。無声制御部４は、アクセント句パターン曲線を白丸で示される各節点（Ｂ1、Ｐ1、Ｅ1、Ｅ2）を通るように連続的で滑らかにして、節点抽出部６に入力する。
【００２５】
図３は、図１の自然音声の節点抽出装置が行う節点抽出方法のフローチャートである。微分演算部５は、アクセント句パターン曲線の１次微分曲線及び２次微分曲線を求めて、節点抽出部６に入力する。節点抽出部６は、１次微分曲線、２次微分曲線、及び、言語情報１０１に基づいて、節点抽出を行う。
【００２６】
図４（ａ）、（ｂ）、及び、（ｃ）は、平叙文のアクセント句パターン曲線、その１次微分曲線、及び、２次微分曲線を夫々示す。アクセント句パターン曲線は、アクセント句パターンの開始点Ｂ1、及び、アクセント句パターンの終了点Ｅ1を有する。１次微分曲線は、符号が正から負に変わるゼロ交差点Ｐ1を有する。２次微分曲線は、ゼロ交差点Ｐ1以前の最高点Ａ1とゼロ交差点Ｐ1以後の最高点Ｃ2、及び、ゼロ交差点Ｐ1以前の最低点Ａ2とゼロ交差点Ｐ1以後の最低点Ｃ1を有する。
【００２７】
アクセント句パターン曲線の開始点であるデータ点Ｂ1を節点Ｂ1として抽出し、アクセント句パターン曲線の終了点であるデータ点Ｅ1を節点Ｅ1として抽出する（ステップＳ１１）。アクセント句パターン曲線を一階微分し、１次微分曲線を求める（ステップＳ１２）。
【００２８】
１次微分曲線の符号が正から負に変わるゼロ交差点Ｐ１を求め、ゼロ交差点Ｐ1に対応するアクセント句パターン曲線上のデータ点Ｐ1である節点Ｐ1を抽出する。ゼロ交差点が複数ある場合には、アクセント句パターン曲線の最高周波数点に最も近い交差点をゼロ交差点Ｐ１とする（ステップＳ１３）。言語情報１０１の文タイプが疑問文でなければ（ステップＳ１４）、ステップＳ１６に進み次の処理を実行する。
【００２９】
図５（ａ）、（ｂ）、及び、（ｃ）は、疑問文のアクセント句パターン曲線、その１次微分曲線、及び、２次微分曲線を夫々示す。アクセント句パターン曲線は、アクセント句パターンの開始点Ｂ1、周波数最低点Ｅ1、及び、アクセント句パターンの終了点Ｅ2を有する。１次微分曲線は、ゼロ交差点Ｐ1を有する。２次微分曲線は、ゼロ交差点Ｐ1以前の最高点Ａ1とゼロ交差点Ｐ1以後の最高点Ｃ2、及び、ゼロ交差点Ｐ1以後の最低点Ｃ1を有する。
【００３０】
ステップＳ１４で疑問文であれば、アクセント句パターン曲線の周波数最低点であるデータ点Ｅ1を節点Ｅ1として抽出し、アクセント句パターン曲線の終了点であるデータ点Ｅ2を節点Ｅ2として抽出する（ステップＳ１５）。また、１次微分曲線の符号が負から正に変わるゼロ交差点Ｅ1を調べ、１次微分曲線のゼロ交差点Ｅ1に対応するアクセント句パターン曲線上のデータ点Ｅ1を周波数最低点Ｅ1としてもよい。
【００３１】
図４（ｃ）に示すように、アクセント句パターン曲線を二階微分し、２次微分曲線を求める（ステップＳ１６）。アクセント句パターン曲線の節点Ｂ1から節点Ｐ1までの区間にある２次微分曲線の頂点を調べ、ゼロ交差点Ｐ1以前の最高点Ａ1に対応するアクセント句パターン曲線上のデータ点Ａ1である節点Ａ1を抽出し、節点Ａ1から節点Ｐ1までの区間で、ゼロ交差点Ｐ1以前の最低点Ａ2に対応するアクセント句パターン曲線上のデータ点Ａ2である節点Ａ2を抽出する（ステップＳ１７）。
【００３２】
次に、言語情報１０１のアクセント位置時刻を調べ、アクセント位置を含まなければ（ステップＳ１８）、ステップＳ２０に進み次の処理を実行する。アクセント位置は、アクセントのある位置を表わすものである。例えば、「アンケート」は、「ア」の次の音で下がるので、「ア」にアクセントがあり、「ア」の音の終了位置がアクセント位置である。
【００３３】
アクセント位置を含めば、アクセント句パターン曲線の節点Ｐ1から節点Ｅ１までの区間にある２次微分曲線の頂点を調べ、ゼロ交差点Ｐ1以後の最高点Ｃ2に対応するアクセント句パターン曲線上のデータ点Ｃ2である節点Ｃ2を抽出し、節点Ｐ1から節点Ｃ2までの区間で、ゼロ交差点Ｐ1以後の最低点Ｃ1に対応するアクセント句パターン曲線上のデータ点Ｃ1である節点Ｃ1を抽出する（ステップＳ１９）。
【００３４】
ただし、ステップＳ１７又はＳ１９において、指定区間で２次微分曲線の最高点又は最低点が無い場合には、アクセント句パターン曲線上の節点を抽出しない。図５（ｃ）には２次微分曲線のゼロ交差点Ｐ1以前の最低点Ａ2が無く、アクセント句パターン曲線の節点Ａ2を抽出しない例が示されている。
【００３５】
また、指定された区間に対する２次微分曲線の頂点を求める際に、更に三次微分曲線を求め、三次微分曲線が正又は負に符号が変わるゼロ交差点を調べ、三次微分曲線のゼロ交差点に対応する２次微分曲線の頂点を求めてもよい。
【００３６】
アクセント句パターン曲線上で抽出された節点Ｂ1、Ａ1、Ａ2、Ｐ1、Ｃ1、Ｃ2、及び、Ｅ1だけでは、ピッチパターンを生成する際に不十分な場合、アクセント句パターン曲線上で、先に求められた２つの隣り合う節点の中間点を新たに節点として抽出してもよい。最終的な節点を節点情報出力部７に入力して、処理を終了する（ステップＳ２０）。
【００３７】
節点情報出力部７は、節点抽出部６からの最終的な節点を節点１０３として外部に出力する。また、上記のステップＳ２０に相当する中間点を追加し最終的な節点を求める処理を、節点情報出力部７が実行してもよい。
【００３８】
上記実施形態例によれば、自然音声をアクセント句で区切って基本周波数を抽出した基本周波数パターン曲線、その１次微分曲線及び２次微分曲線に基づいて節点を抽出することにより、基本周波数パターン曲線の形状を特徴づける変化点等を節点として抽出し、この節点は発話速度とは無関係に抽出されるので、安定して効率のよい節点抽出を行うことができる。
【００３９】
音声合成装置は、節点に基づいてピッチパターンを生成し、音声合成の規則（規則合成方式）の１つとして利用し、別に入力される発音記号又は文字列に基づいて音声を合成する。ピッチパターンは、アクセント及びイントネーションと最も密接に関連し、自然で聞きやすい音調を与えるだけでなく、単語や句のまとまりを示し、文として理解しやすくする。音声合成装置は、生成されるピッチパターンが実際の基本周波数パターンを忠実に再現すれば、自然で聞きやすい音声を合成できる。
【００４０】
本発明の節点抽出装置では、基本周波数パターン曲線の形状を特徴づける変化点等を節点として抽出することにより、生成されるピッチパターンが実際の基本周波数パターンを忠実に再現できるので、音声合成装置に限らず規則合成方式を採用する装置には好適に利用される。
【００４１】
以上、本発明をその好適な実施形態例に基づいて説明したが、本発明の節点抽出方法は、上記実施形態例の構成にのみ限定されるものでなく、上記実施形態例の構成から種々の修正及び変更を施した自然音声の節点抽出装置も、本発明の範囲に含まれる。
【００４２】
【発明の効果】
以上説明したように、本発明の自然音声の節点抽出装置では、自然音声をアクセント句で区切って基本周波数を抽出した基本周波数パターン曲線、その１次微分曲線及び２次微分曲線に基づいて節点を抽出することにより、基本周波数パターン曲線の形状を特徴づける変化点等を節点として抽出し、この節点は発話速度とは無関係に抽出されるので、安定して効率のよい節点抽出を行うことができる。
【図面の簡単な説明】
【図１】本発明の一実施形態例の自然音声の節点抽出装置のブロック図である。
【図２】自然音声の「よろしいですか」に関する情報を示す。
【図３】図１の自然音声の節点抽出装置が行う節点抽出方法のフローチャートである。
【図４】同図（ａ）、（ｂ）、及び、（ｃ）は、平叙文のアクセント句パターン曲線、その１次微分曲線、及び、２次微分曲線を夫々示す。
【図５】同図（ａ）、（ｂ）、及び、（ｃ）は、疑問文のアクセント句パターン曲線、その１次微分曲線、及び、２次微分曲線を夫々示す。
【図６】第１の節点選択法のフローチャートである。
【図７】第２の節点選択法のフローチャートである。
【符号の説明】
１パターン抽出部
２入力部
３パターン区分部
４無声制御部
５微分演算部
６節点抽出部
７節点情報出力部
１０１言語情報
１０２自然音声
１０３節点[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a node extraction device for a speech synthesizer, and more particularly to a natural speech node extraction device that extracts nodes necessary for generating a pitch pattern by spline approximation.
[0002]
[Prior art]
Recent speech synthesizers synthesize speech according to a rule synthesis method. In the rule synthesis system, a node is provided as a parameter to the rule synthesis engine, whereby a pitch pattern indicating a temporal change pattern of the fundamental frequency of the voice is generated and used as one of the rules for synthesizing the voice.
[0003]
The speech synthesizer stores in advance the nodes extracted from the natural speech by the node extraction device, performs spline approximation based on the nodes, and generates a pitch pattern. Spline approximation is a process of connecting discrete points called nodes in order, and using a spline function to approximate a smooth curve as a whole.
[0004]
A rule-synthesizing speech synthesizer uses a generated pitch pattern as one of speech synthesis rules, and directly synthesizes continuous speech of an arbitrary vocabulary from phonetic symbols or characters input separately. It has become important for the node extraction device to be able to extract nodes without being influenced by conditions such as the gender and speaking speed of the speaker, and several proposals have been made.
[0005]
The scientific technique SP2000-29 describes a node extraction method used in a natural speech node extraction apparatus (The Institute of Electronics, Information and Communication Engineers published in July 2000, Technical Technique: 20 pages, author: Hiroyoshi Morikawa. Naohiro Tsuboi Yuichiro Yanagi, Title: “Modeling and Analysis of Speech Pitch Patterns by Smoothing Spline Function”). In the node extraction method (node selection method) performed by this node extraction device, a fundamental frequency is extracted from natural speech, and the extracted fundamental frequency is obtained as a plurality of data points plotted for each time. The nodes are selected using the two methods shown in FIG.
[0006]
FIG. 6 is a flowchart of the first node selection method. The start point and end point of a plurality of data points are set as two nodes, and the interval from the start point to the end point is equally divided at a time interval dt, and the nearest data point is extracted for each division point to be a node candidate (step S81). The inclination between adjacent node candidates is obtained, and if the magnitude of the inclination is smaller than the threshold value TH2, it is deleted from the node candidates (step S82).
[0007]
Based on the node and the node candidate, a smooth spline function is obtained and an error from the fundamental frequency pattern curve is calculated (step S83). Here, the basic frequency pattern curve handles a collection of a plurality of data points as a curve. If the error in step S83 is smaller than the threshold TH3, the node candidate having the smallest inclination is deleted, and the process is executed from step S83 (step S86). If the error in step S83 is larger than the threshold value TH3, the node candidate remaining finally is determined as a node (step S85).
[0008]
FIG. 7 is a flowchart of the second node selection method. The start and end points of a plurality of data points are set as two nodes, both nodes are connected by a straight line, and the data point farthest from the straight line is set as a node candidate (step S91). Based on the node and the node candidate, a smooth spline function is obtained and an error from the fundamental frequency pattern curve is calculated (step S92).
[0009]
If the error in step S92 is larger than the threshold value TH3, the data point farthest from the spline function is added as a new node candidate, and the process is executed from step S92 (step S95). If the error in step S92 is smaller than the threshold value TH3, the finally remaining node candidate is determined as a node (step S94).
[0010]
[Problems to be solved by the invention]
In the conventional natural speech node extraction apparatus, the first node selection method requires that the time interval dt be determined empirically because the time interval dt to be divided depends on the speech rate. Further, it is necessary to change the threshold value TH2 or TH3 to be compared with the error according to the speech speed and to determine it empirically, and the nodes cannot be obtained stably. Similarly to the first node selection method, the second node selection method uses the threshold value TH2 or TH3, and therefore needs to be determined empirically depending on the speech rate.
[0011]
In general, if a node is extracted without considering the shape of the fundamental frequency pattern curve of natural speech, the pitch pattern generated by spline approximation based on this node will have an effect such as a undulation phenomenon. In some cases, a shape or pattern different from the basic frequency pattern is generated.
[0012]
In the above-described conventional natural speech node extraction device, a node is selected by comparing whether or not the error based on the node and the node candidate is within a threshold value, so that a pitch pattern different from the basic frequency pattern of natural speech is generated. Therefore, it cannot be said that the shape of the fundamental frequency pattern curve is sufficiently considered.
[0013]
The present invention has been made in order to solve the above-described problems of the prior art, and in order to generate a pitch pattern by spline approximation, a natural speech that stably and efficiently extracts necessary nodes. A nodal extraction device is provided.
[0014]
[Means for Solving the Problems]
To achieve the above object, the node extracting apparatus natural speech of the present invention, a pattern extraction unit for extracting a fundamental frequency pattern of the plurality of natural speech, for each said natural speech, the end time and start time of the accent phrase The basic frequency pattern is classified for each accent phrase on the basis of an input unit for inputting language information including, and information on the start time and end time of the accent phrase included in the language information input by the input unit a pattern classification unit, and a differentiating unit for determining the first derivative curve and the secondary differential curve of the segmented fundamental frequency pattern, the start and end points of the segmented fundamental frequency pattern, the first derivative curve and zero crossing, that and a node extracting unit and the highest point and the lowest point, is extracted as a nodal point of the fundamental frequency pattern of the second derivative curve And butterflies.
[0015]
The natural speech nodal point extracting apparatus of the present invention includes a starting point and an end point of a fundamental frequency pattern curve obtained by dividing a natural speech by an accent phrase and extracting a fundamental frequency, a zero crossing point of the primary differential curve , and a secondary differential curve. Are extracted as nodes of the basic frequency pattern curve . As a result, node extraction can be reliably performed at a change point that characterizes the shape of the fundamental frequency pattern curve. Since these nodes are extracted regardless of the speech rate, stable and efficient node extraction can be performed.
[0017]
In the natural speech nodal point extraction device of the present invention, the language information about the natural speech includes information about whether or not it is a question sentence, and an accent position time indicating whether or not an accent position is included. Rukoto contains information is preferred. In this case, the node extracting unit, the language information regarding the natural speech, if it contains any information indicating that the question sentence, the frequency lowest just before the end point of the fundamental frequency pattern in which the sectioned extracting the points as nodes, the language information regarding the natural speech, if it contains information accent time indicating that it contains an accent position, the secondary after the zero crossing point of the first derivative curve The highest point and the lowest point of the differential curve can be extracted as nodes, respectively. In this case, the shape of the fundamental frequency pattern curve can be sufficiently taken into account using language information such as the accent position, so that node extraction can be performed more reliably at a change point that characterizes the shape of the fundamental frequency pattern curve.
[0018]
It is also a preferable aspect of the present invention that the node extraction unit newly extracts an intermediate point between two adjacent nodes among the nodes extracted as the nodes of the fundamental frequency pattern . In this case, it is certain that the pitch pattern is generated based on the nodes.
[0019]
DETAILED DESCRIPTION OF THE INVENTION
DESCRIPTION OF EMBODIMENTS Hereinafter, a natural speech node extraction apparatus according to the present invention will be described with reference to the drawings based on an embodiment of the present invention. FIG. 1 is a block diagram of a natural speech node extraction apparatus according to an embodiment of the present invention. The natural speech node extraction apparatus includes a pattern extraction unit 1, a pattern classification unit 3, an input unit 2, a silent control unit 4, a differential operation unit 5, a node extraction unit 6, and a node information output unit 7.
[0020]
The pattern extraction unit 1 extracts a basic frequency pattern based on the input natural sound 102 and inputs it to the pattern classification unit 3. The fundamental frequency pattern is a plurality of data points from which fundamental frequencies are extracted at the time of extraction at a short time interval. A data point is composed of an extraction time and a fundamental frequency.
[0021]
The input unit 2 inputs the input language information 101 to the pattern classification unit 3. The language information 101 is information including an accent phrase start time and end time, an accent position time, a consonant vowel start time and end time included in the accent phrase, a sentence type indicating whether it is a question sentence or a plain sentence, and the like. . The pattern classification unit 3 divides the fundamental frequency pattern for each accent phrase based on the language information 101 and inputs it to the silent control unit 4.
[0022]
FIG. 2 shows information related to "Are you sure?" For natural speech, the frequency generated at each time is plotted as a point. The black shaded part in the figure indicates the natural voice frequency characteristics (spectrum display). As shown in FIG. 5A, the fundamental frequency of natural speech is shown as a white line * in a black shadow portion from 200 Hz to 400 Hz.
[0023]
The silent control unit 4 examines the silent section included in the fundamental frequency pattern curve based on the language information 101. If there is an unvoiced section, the basic frequency pattern curve is likely to cause an error when extracting the nodes necessary for the spline approximation, and is thus corrected as a smooth accent phrase pattern curve by interpolation.
[0024]
As shown in FIG. 2B, when there is an unvoiced section (“sh”) including a consonant, it is stretched from a nearby voiced section (“o” or “i−”) and interpolated with a straight line or a curve. When the start point or end point of an accent phrase is unvoiced, a value several Hz smaller than the value of a nearby voiced section is interpolated as the start point or end point. The silent control unit 4 makes the accent phrase pattern curve continuous and smooth so as to pass through each node (B 1, P 1, E 1, E 2) indicated by white circles, and inputs it to the node extraction unit 6.
[0025]
FIG. 3 is a flowchart of a node extraction method performed by the natural speech node extraction apparatus of FIG. The differential calculation unit 5 obtains a primary differential curve and a secondary differential curve of the accent phrase pattern curve and inputs them to the node extraction unit 6. The node extraction unit 6 performs node extraction based on the primary differential curve, the secondary differential curve, and the language information 101.
[0026]
FIGS. 4A, 4B, and 4C show an accent phrase pattern curve of a plain text, its first derivative curve, and a second derivative curve, respectively. The accent phrase pattern curve has an accent phrase pattern start point B1 and an accent phrase pattern end point E1. The first derivative curve has a zero crossing point P1 whose sign changes from positive to negative. The quadratic differential curve has a highest point A1 before the zero intersection P1, a highest point C2 after the zero intersection P1, and a lowest point A2 before the zero intersection P1 and a lowest point C1 after the zero intersection P1.
[0027]
A data point B1 that is the start point of the accent phrase pattern curve is extracted as a node B1, and a data point E1 that is the end point of the accent phrase pattern curve is extracted as a node E1 (step S11). The accent phrase pattern curve is first-order differentiated to obtain a first derivative curve (step S12).
[0028]
A zero crossing point P1 in which the sign of the primary differential curve changes from positive to negative is obtained, and a node P1 which is a data point P1 on the accent phrase pattern curve corresponding to the zero crossing point P1 is extracted. If there are a plurality of zero intersections, the intersection closest to the highest frequency point of the accent phrase pattern curve is set as the zero intersection P1 (step S13). If the sentence type of the language information 101 is not a question sentence (step S14), the process proceeds to step S16 and the next process is executed.
[0029]
5A, 5B, and 5C show an accent phrase pattern curve of a question sentence, a primary differential curve thereof, and a secondary differential curve, respectively. The accent phrase pattern curve has an accent phrase pattern start point B1, a frequency lowest point E1, and an accent phrase pattern end point E2. The first derivative curve has a zero crossing point P1. The secondary differential curve has a maximum point A1 before the zero intersection P1, a maximum point C2 after the zero intersection P1, and a minimum point C1 after the zero intersection P1.
[0030]
If it is a question sentence in step S14, the data point E1 which is the lowest frequency of the accent phrase pattern curve is extracted as the node E1, and the data point E2 which is the end point of the accent phrase pattern curve is extracted as the node E2 (step S15). ). Further, the zero crossing point E1 in which the sign of the primary differential curve changes from negative to positive is examined, and the data point E1 on the accent phrase pattern curve corresponding to the zero crossing point E1 of the primary differential curve may be set as the lowest frequency point E1.
[0031]
As shown in FIG. 4C, the accent phrase pattern curve is second-order differentiated to obtain a second derivative curve (step S16). The vertex of the secondary differential curve in the section from the node B1 to the node P1 of the accent phrase pattern curve is examined, and the node A1 that is the data point A1 on the accent phrase pattern curve corresponding to the highest point A1 before the zero intersection P1 is extracted. Then, in the section from the node A1 to the node P1, the node A2 which is the data point A2 on the accent phrase pattern curve corresponding to the lowest point A2 before the zero intersection P1 is extracted (step S17).
[0032]
Next, the accent position time of the language information 101 is checked, and if the accent position is not included (step S18), the process proceeds to step S20 and the next process is executed. The accent position represents an accented position. For example, since “questionnaire” is lowered by the sound next to “A”, “A” has an accent, and the end position of the sound of “A” is an accent position.
[0033]
If the accent position is included, the vertex of the second derivative curve in the section from the node P1 to the node E1 of the accent phrase pattern curve is examined, and the data point C2 on the accent phrase pattern curve corresponding to the highest point C2 after the zero crossing P1. extract the node C2 is, in a section from the node P1 to the node C2, it extracts the node C1 is a data point C1 on the accent phrase pattern curve corresponding to the zero crossing point P1 after the lowest point C1 (step S19 ).
[0034]
However, in step S17 or S19, if there is no highest or lowest point of the secondary differential curve in the specified section, no node on the accent phrase pattern curve is extracted. FIG. 5C shows an example in which there is no lowest point A2 before the zero crossing point P1 of the secondary differential curve, and the node A2 of the accent phrase pattern curve is not extracted.
[0035]
Further, when obtaining the vertex of the second derivative curve for the specified section, a third derivative curve is further obtained, a zero crossing where the sign of the third derivative is changed to positive or negative is examined, and the zero crossing of the third derivative curve is corresponded. You may obtain | require the vertex of a secondary differential curve.
[0036]
If only the nodes B1, A1, A2, P1, C1, C2, and E1 extracted on the accent phrase pattern curve are insufficient when generating the pitch pattern, the first determination is made on the accent phrase pattern curve. An intermediate point between the two adjacent nodes obtained may be newly extracted as a node. The final node is input to the node information output unit 7, and the process ends (step S20).
[0037]
The node information output unit 7 outputs the final node from the node extraction unit 6 as a node 103 to the outside. Further, the node information output unit 7 may execute a process of adding a midpoint corresponding to step S20 and obtaining a final node.
[0038]
According to the embodiment, the fundamental frequency pattern curve is obtained by extracting the nodes based on the fundamental frequency pattern curve obtained by dividing the natural speech by the accent phrase and extracting the fundamental frequency, and the first and second derivative curves. Since the change point characterizing the shape of the character is extracted as a node, and this node is extracted regardless of the speech speed, the node can be stably and efficiently extracted.
[0039]
The speech synthesizer generates a pitch pattern based on the nodes, uses it as one of speech synthesis rules (rule synthesis method), and synthesizes speech based on phonetic symbols or character strings input separately. Pitch patterns are most closely related to accents and intonation, and not only give a natural and easy-to-understand tone, but also show a group of words and phrases, making them easier to understand as sentences. If the generated pitch pattern faithfully reproduces the actual fundamental frequency pattern, the speech synthesizer can synthesize natural and easy-to-hear speech.
[0040]
In the node extraction device of the present invention, by extracting the change point characterizing the shape of the fundamental frequency pattern curve as a node, the generated pitch pattern can faithfully reproduce the actual fundamental frequency pattern. The present invention is not limited to this and is preferably used for an apparatus that employs a rule composition method.
[0041]
As described above, the present invention has been described based on the preferred embodiment. However, the node extraction method of the present invention is not limited to the configuration of the above-described embodiment example. A node extraction device for natural speech that has been modified and changed is also included in the scope of the present invention.
[0042]
【The invention's effect】
As described above, in the natural speech nodal extraction device of the present invention, the nodal points are obtained based on the fundamental frequency pattern curve obtained by dividing the natural speech by the accent phrase and extracting the fundamental frequency, and the primary differential curve and the secondary differential curve. By extracting, a change point characterizing the shape of the basic frequency pattern curve is extracted as a node, and this node is extracted regardless of the speaking speed, so that stable and efficient node extraction can be performed. .
[Brief description of the drawings]
FIG. 1 is a block diagram of a natural speech node extraction apparatus according to an embodiment of the present invention.
FIG. 2 shows information related to natural voice “Are you sure?”
FIG. 3 is a flowchart of a node extraction method performed by the natural speech node extraction apparatus of FIG. 1;
4 (a), (b), and (c) show an accent phrase pattern curve of a plain text, its first derivative curve, and a second derivative curve, respectively.
FIGS. 5A, 5B, and 5C show an accent phrase pattern curve of a question sentence, a primary differential curve thereof, and a secondary differential curve, respectively.
FIG. 6 is a flowchart of a first node selection method.
FIG. 7 is a flowchart of a second node selection method.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 Pattern extraction part 2 Input part 3 Pattern division part 4 Silent control part 5 Differentiation calculation part 6 Node extraction part 7 Node information output part 101 Language information 102 Natural speech 103 Node

Claims

A pattern extraction unit for extracting a fundamental frequency pattern of each of a plurality of natural sounds ;
For each natural speech, an input unit for inputting language information including the start time and end time of an accent phrase;
A pattern division unit that divides each fundamental frequency pattern for each accent phrase based on information on the start time and end time of the accent phrase included in the language information input by the input unit ;
A differential operation unit for obtaining a primary differential curve and a secondary differential curve of the divided fundamental frequency pattern;
The start point and the end point of the segmented fundamental frequency pattern, the zero crossings of the first derivative curve, the node extraction for extracting the highest point and the lowest point of the second derivative curve, as nodes of the fundamental frequency pattern and parts,
A node extraction device for natural speech, comprising:

The language information regarding the natural speech, the information about whether the interrogative sentence, and Ru contains information about the accent position and time indicating whether including accent position, nature of claim 1 A voice node extractor.

The nodal point extraction unit, the language information regarding the natural speech, if it contains any information indicating that the question statement node frequency lowest point immediately before the end point of the segmented fundamental frequency pattern The natural speech node extracting apparatus according to claim 2 , wherein

The nodal point extraction unit, the language information on the natural speech, if it contains information accent time indicating that it contains an accent position, the zero-crossing point after the second derivative of the first derivative curve The node extraction device for natural speech according to claim 2 or 3 , wherein the highest point and the lowest point of the curve are extracted as nodes.

The node extraction unit is configured of the node extracted as a node of the fundamental frequency pattern, the midpoint of the two adjacent nodes is extracted as a new node, natural speech according to any one of claims 1 to 4, Nodal extraction device.