JP4218075B2

JP4218075B2 - Speech synthesizer and text analysis method thereof

Info

Publication number: JP4218075B2
Application number: JP04963598A
Authority: JP
Inventors: 英二小松
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1998-03-02
Filing date: 1998-03-02
Publication date: 2009-02-04
Anticipated expiration: 2018-03-02
Also published as: JPH11249678A

Abstract

PROBLEM TO BE SOLVED: To provide a voice synthesizer and its text analytic method making a synthetic voice generated by the supplied data the natural synthetic voice easy to hear and outputting it. SOLUTION: The voice synthesizer 10 is constituted so that a division candidate generation part 460 of a vocal sound processing part 40 divides an output of a language processing part 30 based on a (first) rule of ICRLB (immediate constituent with recursively left-branching structure) division, and judges (second rule) whether this divided rhythm phrase is the number of prescribed beats or below, and makes a combination of the rhythm phrases generated according to positions dividing this rhythm phrase when this rhythm phrase is re-divided a division candidate, and a valuation value showing division suitability is calculated by using information of a voice expression incorporated in the division candidate obtained in a division candidate selection part 462 and a parameter supplied from a parameter storage part 464, and further, and the division candidate satisfying a (third) rule, selecting the minimum valuation value is selected by using this valuation value by the division candidate selection part 462, and the selected division candidate becomes the rhythm phrase divided to an optimum length.

Description

【０００１】
【発明の属する技術分野】
本発明は、供給された文章に各種の解析を施し、この解析結果を韻律規則等に応じて生成される中間言語から人工的に合成音声を発生させる音声合成装置およびそのテキスト解析方法に関し、特に、供給される日本語の文章に形態素および構文の解析を施した後、この結果に対して施す音韻処理によって生成される中間言語を基に合成音声を生成する音声合成装置およびこの音声合成装置に適用して好適なテキスト解析方法に関するものである。
【０００２】
【従来の技術】
音声を人工的に合成するには、波形形成を行うために予め記録した音声を組み合わせて再生する記録再生方式、あるいは音声を純粋に人工的に合成する純合成方式があり、この記録再生方式における波形形成の制御にも予め用意した情報の組合せを利用する編集制御方式、あるいは純人工的に制御信号を生成する規則制御方式の諸手法がある。これらの方式の中で最近、たとえば電子メールの読み上げや天気予報の案内サービス、プロ野球の結果といったニュース等、種々の分野に規則制御方式を適用して高品質の合成音声を提供する音声合成装置が注目されている。
【０００３】
合成音声を高品質に得るためには、日本語における語義、統語構造、談話構造等の言語情報を的確に反映し、かつ自然な韻律を生成することのできる韻律規則を構築しなければならない。このような規則制御方式を適用して日本語の合成音声を生成する手法の一例が日本音響学会誌50巻 6号（「日本語文章音声の合成のための韻律規則」）に開示されている。ここに開示されている手法は、供給される文章に対して形態素解析、構文解析を行って韻律句を生成し、生成した韻律句が大きすぎる場合にはこの韻律句を均等に近く分割することにより、文の意味的な構造を考慮するとともに、滑らかな韻律で文が読まれるように中間言語を生成させている。
【０００４】
音声合成装置は、この生成された中間言語を基に制御パラメータを生成し、この制御パラメータに応じた音声を合成して出力している。
【０００５】
【発明が解決しようとする課題】
ところで、前述したような生成した韻律句が大きすぎる場合、規則を適用することによって、生成した韻律句を細かく分割し過ぎてしまったり、本来の韻律に反する読み方がなされるような韻律句の分割が行われることがある。このような韻律句の分割が行われることにより、出力される合成音声が不自然に聞こえてしまうことがあった。
【０００６】
本発明はこのような従来技術の欠点を解消し、供給されたデータによって生成される合成音声をより自然で聞き易い合成音声にして出力することができる音声合成装置およびそのテキスト解析方法を提供することを目的とする。
【０００７】
【課題を解決するための手段】
本発明は上述の課題を解決するために、情報として供給される文章に含まれる形態素やこの文章の構文を処理に用いる言語に基づいてこの言語レベルの特徴を解析する言語解析手段と、この言語解析手段の出力に対して音声言語レベルの特徴に基づく解析を行うとともに、得られた解析結果を基に音声合成用の指令となる中間言語を生成する音韻解析手段と、この音韻解析手段の出力に応じた制御パラメータを生成する制御パラメータ生成手段と、この制御パラメータ生成手段の出力を基に音声信号を合成する音声信号生成手段とを備え、情報を人工的に音声に合成する音声合成装置において、形態素の連鎖である韻律語素が複数まとめられた韻律句の修飾関係に応じて言語解析手段の出力を分割する第１の規則とともに、この分割によって得られた韻律句が予め設定した大きさ以下に小さいかを判断する第２の規則に応じて韻律句を再分割する際にこの韻律句を区切る位置により得られる韻律句の組合せを分割候補として生成する分割候補生成手段と、この分割候補生成手段で生成された分割候補が含む音声表現の情報を用いて分割妥当性を評価する評価値を算出するとともに、この評価値の最小な分割候補を選択する第３の規則により分割候補を選択する分割候補選択手段と、この分割候補選択手段での評価値の算出に用いるパラメータを格納するパラメータ格納手段とを音韻解析手段に含むことを特徴とする。
【０００８】
ここで、分割候補生成手段は、言語解析手段の出力に対して第１の規則を用いて分割する第１の規則分割手段と、この第１の規則処理手段の出力において分割された長さを第２の規則で判断する第２の規則処理手段と、この分割長判断手段の判断に応じて第１の規則分割手段から出力する韻律句を再分割して複数の分割候補を生成する再分割手段とを含むことが望ましい。
【０００９】
パラメータ格納手段は、分割候補における音声表現の情報に基づく第１の重み係数と、分割候補の分割数に伴う韻律語素の平均拍数に掛ける第２の重み係数とを格納するとよい。これにより、生成される韻律句の強調・抑圧指定等を的確に表現できる中間言語の生成が可能になる。
【００１０】
規則を満足する韻律句の区切り設定位置の実例を格納する分割実例格納手段と、この分割実例格納手段に格納された実例の中から供給された文章が最適に分割される実例を検索する分割実例検索手段とを含むことが好ましい。このような構成により、実例に則して最適な合成音声を出力させる中間言語を出力させることができる。
【００１１】
分割実例検索手段は、分割実例格納手段が格納する実例の検索に強調情報、韻律語素、韻律語素の拍数、この拍数の総数、韻律語素の係受けの種別、アクセント指令および／または品詞の種類を用いることが好ましい。実例の検索による実例のマッチングの確度の向上および検索時間の短縮を図ることができる。
【００１２】
言語解析手段は、強調情報の規則の設定を行うとともに、設定された強調情報を格納する強調情報設定手段を含むことが好ましい。
【００１３】
本発明の音声合成装置は、分割候補生成手段で言語解析手段の出力を第１の規則に基づいて分割し、この分割された韻律句を第２の規則で判断して韻律句を再分割する際にこの韻律句を区切る位置により生成される韻律句の組合せを分割候補とし、分割候補選択手段で得られた分割候補が含む音声表現の情報を用いて分割妥当性を評価する評価値をパラメータ格納手段から供給されるパラメータを用いて算出し、さらに、この評価値を用いて分割候補選択手段で第３の規則を満たす分割候補を選択することにより、選ばれた分割候補が最適な長さに区切られた韻律句になるので、音韻解析手段ではこの選ばれた分割候補に応じた中間言語を生成する音声合成装置における中間処理が可能になる。
【００１４】
また、本発明の音声合成装置のテキスト解析方法は、情報として供給される文章に含まれる形態素やこの文章の構文を処理に用いる言語に基づいてこの言語レベルの特徴を解析し、この解析結果に基づいて文章の韻律に対する音声言語レベルの特徴に基づく解析を行うとともに、得られた解析結果を基に音声合成用の指令となる中間言語を生成し、この生成された中間言語に応じた制御パラメータを生成した後、この制御パラメータに対応する音声を人工的に合成する音声合成装置のテキスト解析方法において、形態素の連鎖である韻律語素が複数まとめられた韻律句の修飾関係に応じて情報を分割する第１の規則を用いて、文章に対する解析結果を分割する規則分割工程と、韻律句の大きさが予め設定した大きさ以下に小さいかの判断を第２の規則とし、規則分割工程の結果をこの第２の規則で判断する分割長判断工程と、この分割長判断工程の判断結果に応じて情報を再分割して得られる韻律句の区切り設定位置の組合せを分割候補として生成する分割候補生成工程と、この分割候補生成工程で得られる分割候補が含む音声表現の情報に基づいて分割妥当性の評価に用いるパラメータを格納するパラメータ格納工程と、分割候補生成工程で生成された分割候補が含む音声表現の情報および前記パラメータから評価値を算出する評価値算出工程と、評価値算出工程により得られる評価値中で最小を示す分割候補の選択を第３の規則とし、この第３の規則に基づいて分割候補を選択する分割候補選択工程とを含み、供給された文章の解析を行うことを特徴とする。
【００１５】
ここで、分割候補は、韻律句に含まれる拍長が予め設定した再分割を指示する再分割拍長より大きいとき、韻律句を再分割して生成され、かつ分割候補の境界には音声言語レベルの特徴の一つである話調指令する記号を入れることが好ましい。分割候補をこのように定義することにより、長い韻律句に対する考慮、たとえば韻律句を細かく分割し過ぎること等を避けることができるようになる。
【００１６】
評価値は、韻律句を分割して得られる区間の拍長と韻律句全体の拍長を等分割した拍長との差の絶対値の総和を算出する誤差総和算出工程と、韻律句における音声言語レベルの特徴に含まれる音声表現の情報で表す特徴量の存在に応じてこの韻律句を分割して得られる区間の拍長の総和を算出する特徴量総和算出工程と、分割候補の前記韻律語素に対して含まれる音声表現の情報を基に重み係数を算出する重み算出工程と、特徴量総和算出工程の結果に重み算出工程で得られた重み係数を乗算するとともに、この乗算結果の総和を算出する重付き特徴量総和算出工程と、韻律句を分割した後の平均拍長と韻律句全体の拍長を等分割した拍長の積を算出する積算出工程と、誤差総和算出工程の結果と重付き特徴量総和算出工程の結果を加算し、この加算により得られた結果から積算出工程の結果を減算して対象となる分割候補の評価値を算出する評価値算出工程とを用いて算出されると有利である。
【００１７】
上述した音声表現の情報には、韻律語素の強調、抑圧、および両者を除く場合とに分類して値を設定されることが好ましい。この設定される値に基づいて特徴量が求められる。
【００１８】
分類する上での基準として、強調は分割候補で表される韻律節の中心に含まれる固有名詞、あるいは数字により分類し、抑圧は分割候補で表される韻律節の中心に含まれる形式名詞、動詞、あるいは文頭に位置し、かつ韻律節の最後に係助詞が位置することによりこの韻律節を分類し、強調と抑圧に対して予め設定した値を割り当てることが望ましい。
【００１９】
強調は分割候補の韻律節に固有名詞、あるいは数字を含むことにより分類し、抑圧は分割候補の韻律節に形式名詞、動詞、あるいは文頭に位置し、かつ韻律節の最後に係助詞を含むことによりこの韻律節を分類し、強調と抑圧に対して予め設定した値を割り当てることが望ましい。
【００２０】
このテキスト解析方法が適用される文章の分割候補は、供給される情報および／または適切に区分された実例に付加される情報を検索キーとしてを予め格納し、分割長判断工程は、供給される韻律句に対する判断結果に応じて出力先を選択し、その後、この選択された出力先で供給される韻律句の分割に該当する実例を検索し、この検索結果に応じてこの韻律句を分割して各種指令を付与する実例検索工程を含むことが好ましい。この検索により、的確な文章の解析を行える。
【００２１】
実例格納工程は、予め各種の実例を学習的に記憶させることが望ましい。この記憶により、経験を積む期間の短縮化およびより一層幅広く対応させることができる。
【００２２】
また、パラメータは、統計解析あるいは多変量解析により求めることが好ましい。これにより、パラメータの変更を半自動的に行える。
【００２３】
このテキスト解析方法の適用する言語は、日本語であることが有利である。これにより、あいまいな表現や構文を有する日本語の文章解析に基づいた中間言語の生成が容易になる。
【００２４】
本発明の音声合成のテキスト解析方法は、文章に対する解析結果を規則分割工程の第１の規則、分割長判断工程の第２の規則と順に処理し、この結果に応じて情報を再分割した際に得られる組合せを分割候補とし、一方、この分割候補が含む音声表現の情報に基づいてパラメータ格納工程で得られているパラメータおよび分割候補が含む音声表現の情報に基づいて分割妥当性の評価値の算出し（評価値算出工程）、分割候補選択工程で第３の規則を満足する分割候補を選択することにより、供給される文章の韻律句をさらに細分化して音韻解析が行われても最適な分割候補を選択してこの分割に対応した中間言語の生成が可能になる。
【００２５】
【発明の実施の形態】
次に添付図面を参照して本発明による音声合成装置およびそのテキスト解析方法の実施例を詳細に説明する。
【００２６】
本発明の音声合成装置は、供給される日本語の文章に形態素および構文の解析を施した後、この結果に対して施す音韻処理によって生成される中間言語を基に制御パラメータを生成し、この制御パラメータに基づいて合成音声を生成して出力する装置である。この音声合成装置およびそのテキスト解析方法について図１〜図19を参照しながら説明する。図１に示すように、音声合成装置10は、データ入力部20、言語解析部30、音韻処理部40、制御パラメータ生成部50、および音声信号生成部60を有している。
【００２７】
データ入力部20は、各種形態の文章を音声合成装置10の処理可能なデータ形式に変換して入力する。この文章は、たとえば手書きの原稿の形態やワードプロセッサ等の文章作成ソフトウェアによって作成された所定のデータ形式で記録媒体に記録される形態等、様々な形態がある。手書きの原稿の形態の場合、データ入力部20は、この紙に係れている文章を光学的に読み取る装置が付加されている。また、フロッピーディスクから保存されているデータを読み出す場合、データ入力部20には、データ読取り装置としてディスクドライバが装置内部に配設されている。
【００２８】
言語解析部30は、図１に示すように、形態素解析部32、構文解析部34、および強調情報設定部36を備えている。言語解析部30では、データ入力部20から供給されるデータ、すなわち入力文のディジタルデータが形態素解析部32に供給されている。形態素解析部32には、図２に示す単語分割解析部32a と単語辞書部32b がある。形態素（morpheme）とは、意味を有する最小の形態のことである。単語分割解析部32a は、単語分割を行う処理プログラムが格納されており、入力文に含まれる単語と単語辞書部32b の単語を照合するように検索処理が施している。形態素解析部32は、単語分割解析部32a に格納されるプログラムに応じて単語辞書部32b を検索しながら、入力文を単語に分割している。
【００２９】
ここで、言語解析の用語を説明する。文は、処理上の単位で韻律的単位、構文的単位、およびこれら両者の中間的単位の用語に分類して扱われる（図３(a) を参照）。韻律的単位には、韻律語、韻律句、韻律節、韻律文がある。韻律語は、一つのアクセント成分に対応し、かつ一定のアクセント型を示す音素（すなわち、音として弁別可能な最小単位）の連鎖である。韻律句は、一つのフレーズ成分に対応する韻律語の連鎖である。韻律節は、休止によって区切られた韻律句の連鎖であり、かつ韻律句の最後の部分でフレーズ成分のリセット（すなわち、負のフレーズ指令の生起）が行われていないものを示す。最後に、韻律文は、韻律句の連鎖であり、かつ韻律句の定義と同様に最後の部分でフレーズ成分のリセットが行われていないものである。たとえば、文「低気圧は伊豆半島の南にあり、東日本でにわか雨が降っています。」は、これらの定義で図３(b) に示すように分類される。
【００３０】
構文的単位には、大きさの順に文節、（句）、節、文がある。特に、節とは、他の語句を修飾していない述語（動詞述語、形容詞述語、名詞述語等）とその述語を直接的・間接的に修飾するすべての語句からなる連鎖と定義している。ただし、「付近の500 世帯の家庭に電気を供給している変電所です」という文中における「〜している」の記述は「変電所」を連体修飾しているので、この文の「〜している」の記述は、節に分類されない。
【００３１】
また、中間的単位には、韻律語素、ICRLB がある。韻律語素は、統語的条件、読みの強調・抑圧によって複数個の韻律語に分割されることのない形態素の連鎖である。ICRLB （Immediate Constituent with Recursively Left-Branching Structure の略）とは、右枝分れ境界で区切られ、かつ左枝分れ境界のみを含む韻律語の連鎖である。構文木において左枝分れ境界とは、修飾関係にある韻律語素間の境界のことで、右枝分れ境界とは、修飾関係にない韻律語素間の境界と定義され、ICRLB 境界と呼ばれている。ICRLB は、一つの単語でもよく、構文木を右枝分れ境界で分割すると、一つの文はICRLB 連鎖となる。後述する第１の規則は、ICRLB 分割を行う規則である。
【００３２】
このような定義により、たとえば節の境界は、修飾関係があっても通常の修飾関係と区別できるため同時にICRLB 境界とする。したがって、境界の定義は、それぞれまとめると節境界が節同士の境界、ICRLB 境界はICRLB 同士の境界、列挙境界は、たとえば「司法、立法、行政」のように名詞が読点で並列的に並べた表現における読点後の単語境界となる。
【００３３】
このような定義を用いて解析処理を行う構文解析部34は、形態素解析部32からの形態素解析結果をデータとして入力し、入力文に対する構文木、または係受け構造の解析を行う機能部である。また、構文解析部34は、上述した機能による解析結果からICRLB 、節、および並列関係等を決定して入力文に対し言語レベルにおける特徴である構文解析を行っている。この構文解析の結果が構文木でなく、係受け構造の場合、前述において定義した修飾関係から、ICRLB 境界を定義したり、アクセント核を複数個含む複合語がある場合、その複合語の直前をICRLB 境界と定義している。
【００３４】
強調情報設定部36は、図２に示すように形態素解析部32の解析結果と構文解析部34の解析結果を基に単語または文節に強調・抑圧の強調情報の設定を行う情報設定部36a と、情報設定部36a に供給する設定の基準となる、強調情報の規則を格納する規格格納部36b とを備えている。情報設定部36a は、規格格納部36b から供給される規則と照合して強調情報を入力されるデータに設定している。
【００３５】
ここで、強調情報は、アクセントや一部のフレーズ指令の設定に用いる情報で、音声表現の情報の一つである。強調情報は、３種類の情報に分類し、情報の強調には+emph 、情報の抑圧には-emph 、そして強調・抑圧の両方に分類されない場合には0emph に分けている。強調情報の設定例として、たとえば固有名詞、数字等を中心とする文節は+emph とし、「こと」、「もの」等の形式名詞、「なる」、「いる」等の動詞を中心とする文節やたとえば文頭の「は」（係助詞）格の文節は-emph に分類する。
【００３６】
言語解析部30は、入力文に対して形態素解析部32、および構文解析部34からの解析結果や強調情報設定部36からの設定された強調情報を音韻処理部40に出力する。音韻処理部40は、言語解析部30の出力を基に音声合成のための中間言語を生成する。音韻処理部40には、図１に示すように、韻律語生成処理部42、アクセント指令生成部44、およびポーズ・フレーズ指令生成部46が備えられている。
【００３７】
韻律語生成部42は、文節に含まれる単語のアクセント結合を行って韻律語を生成する。韻律語生成部42は、必要に応じて文節同士のアクセント結合も行って韻律語を生成している。韻律語生成部42では、生成された韻律語に文節の強調情報も設定されている。
【００３８】
アクセント指令生成部44は、図４に示すように、言語解析部30の出力の中でICRLB の情報に基づいてアクセントのかけ方を指示するアクセント指令の生成を行うアクセント指令設定部44a と、アクセント指令設定部44a に供給する設定の基準となるアクセント指令の規則を格納する規格格納部44b とが備えられている。アクセント指令設定部44a は、規格格納部44b から供給されるアクセント指令の規則と照合してアクセント指令を言語解析の結果に設定している。
【００３９】
ここで、アクセント指令は、アクセントの大きさ、アクセントの立上げ、立下げの時点を示す命令である。このアクセント指令の生成において、処理対象の範囲は、アクセント変形の範囲である。アクセント変形（accent sandhi ）とは、たとえば平板的なアクセントの平板型、アクセントに起伏を伴う起伏型等といったアクセント型、文章を境界毎に区切る統語条件や談話条件に応じて、連続する複数の韻律語のアクセント成分が互いに影響を及ぼし合う韻律語素間の相互作用のことである。このような関係から判るように、アクセント変形の範囲はICRLB である。
【００４０】
ポーズ・フレーズ指令生成部46は、休止の長さ・フレーズ指令の生成を行う機能を有している。この機能を発揮するように、ポーズ・フレーズ指令生成部46には、図４に示すポーズ指令・フレーズ指令を生成するポーズ・フレーズ指令設定部46A と、言語解析部30の出力を最適な長さのICRLB に分割するICRLB 分割部46B とが備えられている。
【００４１】
ここで、ポーズ指令は、休止記号S₁, S₂, S₃で表される。休止記号S₁, S₂, S₃は、それぞれ文、節、ICRLB の区切りである。また、フレーズ指令は、フレーズ記号P₀, P₁, P₂, P₃で表される。フレーズ記号P₀, P₁, P₂, P₃は、それぞれ、フレーズ成分のリセット、立直し、節頭での追加、ICRLB 間およびICRLB 内での追加に用いる。
【００４２】
ポーズ・フレーズ指令設定部46A には、ポーズ指令・フレーズ指令を生成する指令設定部46a と、この指令設定部46a に供給する設定の基準となるポーズ指令・フレーズ指令の規則を格納する規格格納部46b とがある。指令設定部46a には、言語解析部30からの出力がそのまま供給されるのではなく、本実施例ではICRLB 分割部46B からの出力が供給されている。
【００４３】
さらに、ICRLB 分割部46B には、分割候補生成部460 、分割候補選択部462 、およびパラメータ格納部464 が備えられている。分割候補生成部460 には、言語解析部30からの出力が供給されている。分割候補生成部460 には、供給されるデータ（文節）をICRLB の境界での分割（すなわち、第１の規則）を行う規則分割部460aと、規則分割部460aの出力（ICRLB で分割された韻律句の長さ：以下、ICRLB 韻律句という）が予め設定した長さ以下に短いかを判断（すなわち、第２の規則）する分割長判断部460bと、分割長判断部460bの判断に応じて規則分割部460aで分割された韻律句を再分割して複数の分割候補を生成する再分割部460cとがある。
【００４４】
ここで、ICRLB 韻律句を再分割し韻律句を区切る位置により得られる韻律句の組合せであり、この組合せを分割候補と呼ぶ。
【００４５】
分割長判断部460bは、最初に供給された場合に第２の規則の判断条件を満足するとき、出力を指令設定部46a に供給する。また、この最初の場合に、この判断条件を満足しないとき、指令設定部46a は出力先を再分割部460c側にする。ところで、２度目以降の判断時において、この判断条件に分割されたICRLB 韻律句に対して分割候補の組合せがまだあるかどうかという条件も加えて指令設定部46a は判断する。
【００４６】
再分割部460cは、第２の規則の判断を行うため分割したデータを再分割部460cに帰還させて複数種類の分割候補の組合せがなくなるまで再分割処理を繰り返す。再分割部460cは、分割された分割候補の中で第２の規則の判断条件を満足する、分割候補だけを分割候補選択部462 の演算部462aに出力する。
【００４７】
ただし、ICRLB 分割部46c に供給される前に情報がICRLB の境界での分割規則に対応して常に細かく分割されているとき、規則分割部460aは配設を省略することができる。分割長判断部460bは、分割されたICRLB 韻律句が所定の長さより長いとき再分割部460cに出力を供給し、それ以外（すなわち、入力文が最初に供給された際に第２の規則を満足する）のとき、唯一、ポーズ・フレーズ指令設定部46A の指令設定部46a にこの出力を供給する。
【００４８】
したがって、分割候補生成部460 は、供給されたICRLB 韻律句が所定の長さより長いとき、内蔵する再分割部460cでこのICRLB 韻律句を再分割し、所定の長さ以下になるまで繰り返しながら、この出力を分割候補選択部462 に供給することになる。
【００４９】
分割候補選択部462 には、生成された分割候補が含む音声表現の情報に基づいて得られる分割妥当性を示す評価値としてコストを算出する演算部462aと、演算部462aで算出されたコストの中で最小値となる分割候補の選択（すなわち、第３の規則）を行う選択部362bとがある。このコスト算出は、従来の方法における誤差の総和に相当するもので、一部にこの誤差の総和を含むが、その詳細については後段で説明する。分割候補選択部462 は、選択部462bから選択された分割候補をポーズ・フレーズ指令設定部46A の指令設定部46a に出力を供給する。また、パラメータ格納部464 は、分割候補選択部462 の演算部462aでのコスト算出に用いるパラメータを格納するメモリがある。
【００５０】
また、パラメータ格納部464 には、図示しないが統計解析、あるいは多変量解析等の方法を適用してパラメータの変更を行う機能部もある。装置の簡略化を考慮する場合、パラメータの変更を装置の外部で行い、パラメータ格納部464 に変更するパラメータを単に供給するようにしてもよい。統計解析、あるいは多変量解析は、音声表現の情報を予め数値化する規則を設けておき、規則に基づいて情報に対応して得られる数値を用いて文の傾向を評価する。この際、評価に対応する値が再分割したことによる傾向を示すパラメータとなる。
【００５１】
ICRLB 分割部46B は、長いICRLB 韻律句に対して所定の長さ以下にした際の分割候補の中から最小コストを選択し、この分割候補をポーズ・フレーズ指令設定部46A の指令設定部46a に供給してポーズ指令・フレーズ指令を従来よりも的確なものにして出力する機能を備えている。これにより、ICRLB 韻律句を適切な範囲に分割することができる。音韻処理部40は、単語の読み、アクセント指令、ポーズ指令、フレーズ指令の決定後、図示しないが長音化、促音化等の処理を施して作成される中間言語のデータを制御パラメータ生成部50に送る。この音韻処理により、不自然な音声合成となる中間言語の生成を避けることができる。
【００５２】
制御パラメータ生成部50は、音韻処理部40から供給される中間言語によるデータを基に音声合成に用いる制御パラメータを生成する。生成した制御パラメータは、音声信号生成部60に供給される。音声信号生成部60には、音声波形生成部62、および音声出力部64が備えられている。音声波形生成部62は、制御パラメータ生成部50から供給される制御パラメータにD/A 変換処理を施して音声波形を生成して音声出力部64に出力する。音声出力部64は、音声波形をたとえばスピーカを介して入力された情報（たとえば文章等）を音声にして出力する。このように構成して音声構成装置10は、的確な中間言語のデータを生成して、この生成されたデータに基づく合成音声を出力させている。
【００５３】
次に本実施例の音声合成装置10の制御およびその動作について図５〜図13のフローチャートや各種の例示に基づく表等を参照しながら説明する。図５のフローチャートは、音声合成装置10の制御およびその制御による主要な動作手順を説明している。この音声合成装置10に電源が投入されると、音声合成装置10の動作が開始して初期設定が行われた後、ステップS10 に進む。ステップS10 では、音声合成装置10のデータ入力部20を介して供給された文章をディジタル化したり、すでにディジタル化済みのデータをメモリに一旦格納する。このデータ入力の後、サブルーチンSUB1に進む。
【００５４】
サブルーチンSUB1では、供給されたデータについて言語解析部30で言語解析処理を施す。ここで行われる言語解析処理には、形態素解析処理、構文解析処理、強調情報設定処理等がある。この言語解析処理の結果はサブルーチンSUB2に送られる。
【００５５】
次にサブルーチンSUB2では、サブルーチンSUB1の結果を基に音韻処理部40で音韻解析処理を行う。この音韻解析処理には、韻律語生成処理、アクセント指令生成処理、ポーズ・フレーズ指令生成処理等がある。この音韻解析の最終解析結果（すなわち、中間言語のデータ）は、制御パラメータ生成部50に送られ、処理手順はステップS11 に進む。
【００５６】
このサブルーチンSUB2には、ポーズ・フレーズ指令を的確に行うように規則が用意されている。この規則は、予め設定した最短の拍数をL₁とし、韻律句に含まれる拍数限界をL₂とする場合、拍数限界L₂より長いICRLB 韻律句があれば、すべての韻律句が拍数限界L₂以下になるように韻律語素境界にフレーズ記号P₃を挿入する。ただし、直前のフレーズ記号P₁/P₂/P₃からの距離が最短拍数L₁以下の場合はフレーズ記号P₃の挿入を省略する。さらに、長いICRLB 韻律句の分割方法が複数ある場合は、分割できる韻律句について後述するコスト関数を適用して、コスト関数の値を最小とする分割を選択する。したがって、ポーズ・フレーズ指令は、この選択された韻律句に対して生成されることになる。
【００５７】
ステップS11 では、サブルーチンSUB2の音韻解析処理の結果に基づいて音声合成に必要な制御パラメータを制御パラメータ生成部50で生成する。ここで生成された制御パラメータは、ステップS12 に進む。
【００５８】
ステップS12 では、ステップS11 により得られた制御パラメータを基に音声信号生成部60で音声信号生成処理を行う。この処理により、最終的に供給された文章に対応する合成音声が生成されて音声出力される。この音声出力により音声合成装置10のこの一連の処理が終了する。
【００５９】
前述したサブルーチンSUB1, SUB2についてさらに説明する。音声合成装置10は、図６のサブルーチンSUB1で言語解析処理を行う際に、まず、サブステップSS10に進む。サブステップSS10では、形態素解析処理を行う。形態素解析部32は、単語分解解析部32a で供給されたデータから文章の境界を認識して文章を文に分割する。形態素解析部32は、さらに分割した文の文字列の要素をなす部分文字列と一致する単語を図２の単語辞書32b から検索する。また、形態素解析部32では、文法的な接続可能性のチェックも行って文を単語列に分割する。この処理の後に、サブステップSS11に進む。
【００６０】
サブステップSS11では、サブステップSS10の処理結果（単語列）を用いて構文解析部34で構文解析する。この構文解析は、供給される単語列を文節にまとめ、かつこの文節間の修飾関係を解析する。この解析により、たとえば「低気圧は伊豆半島の南にあり、東日本でにわか雨が降っています。」をICRLB で区切ると、図７(a) に示す構文木、あるいは図７(b) に示す係り受け構造が生成される。ここで、図７中の記号「／」は、文のICRLB 境界を示している。この際に構文解析部34では、ICRLB 、節、列挙表現、並列関係等も同時に決定している。この処理後、サブステップSS12に進む。
【００６１】
サブステップSS12では、形態素解析処理および構文解析処理の結果を強調情報設定部36に供給して供給されたこれらのデータに強調情報設定処理を施す。強調情報設定部36の設定規則格納部36b には、強調・抑圧・両者以外の３つの場合に分類する規則が格納されている。情報設定部36a は、供給される文節と設定規則格納部36b からの規則とを照合して一致する規則に対応する強調情報をこの文節に設定している。
【００６２】
次にサブステップSS13では、サブステップSS10, SS11, SS12で得られた処理結果を音韻処理部40に供給する。この供給の後、リターンに移行してサブルーチンSUB1を終了する。
【００６３】
このサブルーチンSUB1の終了後、直ちに図８のサブルーチンSUB2に移行する。サブルーチンSUB2では、前述したサブルーチンSUB1の言語解析処理の結果を基に音韻解析処理が行われる。音韻解析処理には、韻律語生成処理、アクセント指令生成処理、ポーズ・フレーズ指令生成処理等がある。まず、サブステップSS20に進む。
【００６４】
サブステップSS20では、言語解析部30からのデータを韻律語生成部42に供給して韻律語を生成させる処理を施す。韻律語生成部42は、供給されるデータ（文節）内の単語のアクセント結合を行う。また、韻律語生成部42は、必要に応じて文節同士のアクセント結合をも行って、韻律語を生成する。生成された韻律語には、図４の韻律語生成部42に図示しないが文節の強調情報の設定機能も備えられている。この処理後に、サブステップSS21に進む。
【００６５】
サブステップSS21では、供給されるICRLB の情報に基づいて各韻律語のアクセントの大きさを決定する処理を行う。この処理はアクセント指令生成部44で行われる。アクセント指令生成部44では、規格格納部44b に予め格納されている規則、たとえばアクセント変形の規則等と供給されるICRLB 情報との照合をアクセント指令設定部44a で行われる。アクセント指令設定部44a は、ICRLB 情報に対して規則と一致するアクセント記号を割り当てて設定している。この設定の後、サブルーチンSUB3に移行する。
【００６６】
図９に示すサブルーチンSUB3では、韻律的な特徴を表すポーズ指令・フレーズ指令を供給されるデータに設定する（ポーズ・フレーズ指令生成処理）。このポーズ・フレーズ指令生成処理は、ポーズ指令・フレーズ指令をそれぞれ生成する処理を行うために、まず、サブルーチンSUB4でICRLB 分割処理を行っている。ICRLB 分割処理には、さらに後述する分割候補生成処理、パラメータ格納処理、および分割選択処理を行うためサブルーチンSUB5, SUB6, SUB7が含まれている。これらの処理を経た最適に分割されたICRLB 韻律句（すなわち、分割候補）を指令設定部46a に供給する。この後、サブステップSS30に進む。
【００６７】
サブステップSS30では、この供給された分割候補に対して、指令設定部46a は、ポーズ指令、フレーズ指令を設定して中間言語作成部（図示せず）に出力する。この処理の後に、リターンに進む。リターンでこのポーズ指令・フレーズ指令の設定処理を終了してサブステップSS22に移行する。
【００６８】
サブステップSS22では、サブステップSS20, SS21, SUB3で得られたデータを基に長音化、促音化等の処理を行った後、中間言語の作成を行う。作成された中間言語のデータは、制御パラメータ生成部50に供給される。この出力の後、リターンに進み、このサブルーチンSUB3の処理を終了させる。
【００６９】
前述したサブルーチンSUB3で行われるサブルーチンSUB4のICRLB 分割処理について図10のフローチャートで簡単に説明する。図９のサブルーチンSUB3に処理が移行してきたとき、ICRLB 分割処理を行うようにサブルーチンSUB4を開始して分割候補生成処理、パラメータ格納処理、および分割選択処理を順次に行う。分割候補生成処理はサブルーチンSUB5（図11を参照）で、パラメータ格納処理はサブルーチンSUB6（図12を参照）で、分割選択処理はサブルーチンSUB7（図13を参照）で行う。これら一連の処理の後、データをサブルーチンSUB3に渡す。このとき、サブルーチンSUB3の処理に対応して、図４の指令設定部46a にデータが供給されている。指令設定部46a は、前述した通り入力されたデータに対してポーズ指令・フレーズ指令を規格格納部46b の規則に応じて設定している。
【００７０】
次に分割候補生成処理を行うサブルーチンSUB5について図11を参照しながら説明する。分割候補生成処理は、サブルーチンSUB4に処理が移行してきたとき、直ちにサブルーチンSUB5の処理を開始してサブステップSS50に進む。
【００７１】
サブステップSS50では、言語解析部30から供給されるデータの種類に応じて処理を分ける。データがたとえば文節のとき（Yes ）、サブステップSS51に進む。また、すでにデータが適当なICRLB 境界で分割されているとき、この処理を行わずにサブステップSS52に進む。
【００７２】
サブステップSS51では、供給されるデータをICRLB に分割する。この処理は前述した第１の規則に従って規則分割部460aで行われる。
【００７３】
サブステップSS52では、供給されたデータ（ICRLB 韻律句）の長さが予め設定された分割長との比較処理を行う。この分割長の長さの基本単位は、言葉の読みに対応して与えられる１拍としている。したがって、この比較は、この分割長の拍数とICRLB 韻律句の拍数で行われる。比較の条件は、第２の規則でICRLB 韻律句の拍数が分割長の拍数以下かどうかで行われる。比較処理は処理回数によって処理結果の供給先を変えるためたとえば、フラグF を設けるとよい。言語解析部30から供給されたデータに対する比較処理を行う（フラグF=0 のとき）。このとき、条件を満たす場合（Yes ）、図４の分割長判断部60b は、供給されたICRLB 韻律句を指令設定部46a に供給する。また、条件を満たさない、すなわちICRLB 韻律句の拍数が分割長の拍数より長いとき（No）、ICRLB 韻律句を再分割部460cに供給する。この後、サブステップSS53に進む。
【００７４】
サブステップSS53では、供給されるICRLB 韻律句を再分割し、分割長判断部460bに分割されたICRLB 韻律句からなるデータを出力する。
【００７５】
次にサブステップSS54では、前述した比較条件を戻されたデータが満足するかを判断している。比較条件を満足するとき（Yes ）、サブステップSS55に進む。また、比較条件を満たさないとき（No）、サブステップSS53に処理を戻す。この判断は分割長判断部460bで行われる。
【００７６】
サブステップSS55では、分割候補の組合せとなり得る分割候補の有無を判断する。分割候補となる組合せがICRLB 韻律句にあるとき（Yes ）、比較条件が満たされた分割候補をサブルーチンSUB7に引き渡すとともに、この引渡し処理後、サブステップSS53に戻す。これにより、分割候補生成部460 は、再分割部460cから分割候補選択部462 にデータを供給する。また、分割候補となる組合せがICRLB 韻律句にないとき（No）、リターンに進む。
【００７７】
ここで、分割候補の有無は、計算により得られる。この計算は、まずICRLB 分割後にICRLB 韻律句の拍数を設定値以下に分割するとともに、分割によって得られるフレーズ指令の最小個数を求める。すなわち、全拍数を上述の設定値で割った際に得られる整数値で、剰余がある場合、この最小個数は整数値+1となる。この最小個数を変数PH_NUMとする。また、データをICRLB 分割によって最も細かく分割した際に得られるフレーズ指令の最大個数MAX_PH_NUMを求める。変数PH_NUMは、分割が要すると判断された際に分割長判断部460bは、変数PH_NUMの値を+1だけ歩進する。このように設定することから、実際に再分割したデータのサブルーチンSUB7への引渡しは、変数PH_NUMの分割を行われた後になる。そして、分割候補の組合せは、変数PH_NUMがフレーズ指令の最大個数MAX_PH_NUMを越えるまで続けられる。図11のフローチャートでは便宜的にデータ引渡しをサブルーチンSUB7の表示で表している。データ引渡しは、この方法に限定されず、メモリに格納しておき、まとめて分割候補の組合せを引き渡しても良い。これらの判断処理は分割長判断部460bで行われる。この一連の処理により、分割候補の組合せが生成されるようになる。
【００７８】
次に分割選択処理についてサブルーチンSUB7を説明する前に、この分割選択処理で用いられるパラメータについて図12を用いて簡単に説明する。パラメータはサブルーチンSUB6で設定している。ここで、パラメータには、後述する式(1) で用いる特徴量I に対する重み係数A と、分割後の平均拍数に乗算する係数B とがある。
【００７９】
ここで、特徴量I は、たとえば韻律句の先頭の韻律語の強調情報（+emp, -emp, 0emp）で示す。サブステップSS60では、特徴量I の規則、すなわち強調情報（+emp, -emp, 0emp）の関係を記憶する。
【００８０】
次にサブステップSS61では、特徴量I の数値に応じて重み係数A を格納する。本実施例では、特徴量I の数値-1, -2に対する重み係数A は、7 、-3を割り当てて数値の格納を行っている。
【００８１】
次にサブステップSS62では、前述した係数B の数値を格納する。ここで、前述の重み係数A および重み係数B は、図４に示すパラメータ格納部464 で算出してもよい。これらの重み係数A, Bは、たとえば音声合成における特徴量I の統計解析処理、あるいは多変量解析等の手法を駆使して算出するとよい。パラメータ格納部464 は、パラメータ値の算出を装置外部で予め行い、単に数値を格納させるだけでもよい。
【００８２】
このような手順で処理してリターンに進み、パラメータの格納を終了して図10に示すようにサブルーチンSUB7に移行する。このサブルーチンSUB7では、サブルーチンSUB5から供給される分割候補（ICRLB 韻律句）の各組合せに対して図13に示すようにコスト計算が行われる。さらに、サブルーチンSUB7は、図13の手順に従って得られたコストから最適な分割候補を選択する処理を行っている。ここで、コストとは、分割してできる句に関する評価を行った際に正確な音声表現の指標となる誤差のことである。
【００８３】
まず、図13のサブステップSS70では、供給される分割候補の各組合せに対してICRLB 境界で区切られる範囲毎の拍数を図４の演算部462aでカウントして記憶する。この記憶を行ってサブステップSS71に進む。サブステップSS71では、分割候補の選択に用いる変数MIN の値を設定する。この設定値は、たとえば9999とする。また、最小の分割候補のICRLB 境界の位置を記憶するメモリMIN_DIV[ ]およびコスト値を格納する変数VAL の内容をクリアする。この後、サブステップSS72に進む。
【００８４】
サブステップSS72では、ICRLB 境界で分割された各分割数m での分割候補についてそれぞれコスト計算を行う。この計算は、演算部462aにおいてコスト関数F(D)に従って行う。コスト関数F(D)は、式(1)
【００８５】
【数１】

で表し、ここで、D(m,n)は供給された分割候補の組合せ中で n番目でこの分割候補を m個に分割したことを示す変数、L はICRLB 全体の拍数、L(m,n,i)は n番目の組合せの分割候補を m個に分割した際の i番目の韻律句の拍数を表す変数、Ph(m,n,k) は n番目の組合せの分割候補を m個に分割した際の k番目の韻律句、count(Ph))は k番目の韻律句における特徴量I の有無を表す関数、A_kは k番目の韻律句における特徴量I の重み係数、B は分割後の平均拍数に関する重み係数である。これら各種の変数および重み係数を用いて表される式(1) は、第１項が前述した誤差の総和で、第２項が特徴量I のコスト、最後に第３項が分割後の平均拍数のコストを表している。これら各項の演算結果が n番目の分割候補のコストである。
【００８６】
次にサブステップSS73では、計算されたコストを変数VAL に格納する。格納後、サブステップSS74に進む。
【００８７】
サブステップSS74では、変数VAL と変数MIN の値を比較する。この比較において変数MIN の値が変数VAL の値よりも大きいとき（ VAL＜MIN ）、サブステップSS75に進む。また、変数VAL の値が変数MIN の値以上のとき（ VAL≧MIN ）、サブステップSS76に進む。
【００８８】
サブステップSS75では、変数MIN の値を変数VAL の値で置換するとともに、この n番目の分割候補を分割した位置を示すICRLB 境界位置を示すデータをメモリMIN_DIV[ ]にこれまでの記憶データと置換させる。
【００８９】
サブステップSS76では、分割候補の組合せがまだあるかどうか判断している。まだ分割候補があるとき（Yes ）、サブステップSS77に進む。また、分割候補がなくなったとき（No）、サブステップSS78に進む。ここで、その判断には、前述した分割候補の有無で用いた変数MAX_PH_NUMと変数PH_NUMとの関係から判る。すなわち、変数MAX_PH_NUMと変数PH_NUMが等しくなると、分割候補がなくなったことになるからである。
【００９０】
サブステップSS77では、供給される新たなn+1 番目の分割候補の拍数をカウントする。このカウント処理の後、サブステップSS71に処理を戻す。また、サブステップSS78では、分割候補のコスト計算が終了し、この時点で最小な分割候補が選択されたことになるので、最小の確定したメモリMIN_DIV[ ]のICRLB 境界位置のデータを最適な分割候補として図４の指令設定部46a に供給する。この供給の後、リターンに進む。リターンを介して分割候補の選択処理を終了させる。
【００９１】
このように処理することにより、最適な分割が施された分割候補を選択しこの選択された分割候補にポーズ指令・フレーズ指令をそれらの指令の規則と照合して設定している。
【００９２】
次により具体的な例を用いて説明するとともに、従来の処理との比較も交えて説明する。前述した規則に基づいて入力文のテキスト解析を行う。ここで、第２の規則で用いる基本的なパラメータである最短拍数L₁=5、韻律句に含まれる拍数限界L₂=15 にしている。また、韻律語素の間に挿入されている記号「↓」は、アクセント指令、記号「，」は韻律語素境界、記号「P 」はフレーズ指令を示している。演算処理の準備として、パラメータ格納部464 は、格納されている特徴量I=-1, I=-2に関する重み係数A₁=7, A₂=-3 、および重み係数B=0.2 を演算部462aに供給している。
【００９３】
図１のデータ入力部20、言語解析部30を介して供給される入力文「私たちの生活から切り離せない道具となっています。」が音韻処理部40に供給された場合、アクセント指令および韻律語素の分割により入力文は、「ワタシ↓タチノ，セーカツヲ，キリハナセナイ，ドーグ↓ト，ナ↓ッテイマス」となる。また、各韻律語素の拍数を括弧内の数字で表すと、韻律語素の拍数は、それぞれ、(6),(6),(7),(4),(6) で、この場合の分割される最大個数は 5で、ICRLB 全体の拍数は(29)である。入力文の韻律句は、図14に示すようにフレーズ指令の最小個数2 と最大個数5 の間で区分される。最小個数は、ICRLB 全体の拍数と拍数限界L₂=15 との関係から明らかである。図14の表は、設定した分割数でコスト最小となる区分位置を示している。
【００９４】
コスト算出に関して演算部462aは、コスト関数F の各項毎に値を算出し、これらの値を合算して求めている（式(1) を参照）。さらに、ある分割数においてどの区分位置でのコストが最小になるか求めた結果、図14の表は、分割数2 で、かつ(6+6,7+4+6) と分割したとき、コストが最小になることを示している。この分割に合わせて指令設定部46a でポーズ指令・フレーズ指令の設定を行うと、入力文に対して生成されるデータは、「P₁ワタシ↓タチノ，セーカツヲ，P₃キリハナセナイ，ドーグ↓ト，ナ↓ッテイマスP₀」となる。これは、特に、分割後の韻律句の長さについて式(1) の第３項で考慮することにより、自然な韻律となるデータが生成されるようになることが判った。
【００９５】
ところで、コスト関数の第１項だけを用いて分割数を判断していたとき、誤差の総和は、 5分割、すなわち(6,6,7,4,6) にした場合、最小値3.6 となった。この結果、入力文には「P₁ワタシ↓タチノ，P₃セーカツヲ，P₃キリハナセナイ，P₃ドーグ↓ト，P₃ナ↓ッテイマスP₀」とフレーズ指令が付加される。しかしながら、あまりに細かく入力文が分割されているため、この生成されたデータを基に得られる合成音声の韻律は不自然であった。このように分割後の韻律句の長さを考慮した処理を行うことにより、最終的に的確な韻律を伴った合成音声を発生させることができるようになる。
【００９６】
また、別な入力文「開幕連勝を支えた望月が打たれた。」が音声合成装置10に供給された。この場合、韻律処理、アクセント指令により、「カイマク，レ↓ンショーヲ，ササエタ，モチ↓ズキガ，ウタ↓レタ」となる。また、各韻律語素の拍数を括弧内の数字で表すと、韻律語素の拍数は、それぞれ、(4),(5),(4),(5),(4) で、この場合の分割される最小個数は 2、最大個数は 3で、ICRLB 全体の拍数は(22)である。図15に示す表は、各分割数においてコストを最小にするフレーズ指令の位置関係を示している。
【００９７】
この場合もコスト関数F は各項毎に値を算出し、これらの値を合算して求めている。この結果、各分割の最小値の中で分割数2 で、かつ(4+5+4,5+4) と分割したとき、コストが-1.2と最小になることが判る。この分割に合わせて指令設定部46a でポーズ指令・フレーズ指令の設定を行うと、入力文に対して生成されるデータは、「P₁カイマク，レ↓ンショーヲ，ササエタ，P₃モチ↓ズキガ，ウタ↓レタP₀」となる。これは、特に、韻律後の強調情報および分割後の韻律句の長さについて式(1) の第２項および第３項で考慮することにより、自然な韻律となるデータが生成されることを示していた。
【００９８】
ところで、前述の比較と同様にコスト関数の第１項だけを用いて分割数を判断していたとき、誤差の総和は、 2分割し、分割候補が(4+5,4+5+4) あるいは(4+5+4,5+4) と韻律句を区切った場合、最小値4 となった。この入力文は２通りの分割が得られる。(4+5,4+5+4) の分割候補で生成されたデータは、「P₁カイマク，レ↓ンショーヲ，P₃ササエタ，モチ↓ズキガ，ウタ↓レタP₀」となり、一方、(4+5+4,5+4) の分割候補で生成されたデータは、「P₁カイマク，レ↓ンショーヲ，ササエタ，P₃モチ↓ズキガ，ウタ↓レタP₀」となる。
【００９９】
ここで、(4+5,4+5+4) の分割候補の生成されたデータには、動詞「支えた」の前にフレーズ指令が設定され、固有名詞「望月」の前にフレーズ指令が付いていない。これは、起伏語の強調がアクセント指令を大きくすることにより行われるがフレーズ指令を前の単語に付けたため強調単語が強調されなくなってしまうことを示している。すなわち、強調単語が韻律句の先頭に来ないICRLB 分割を行うことに起因している。したがって、前者の分割候補を用いて合成音声を生成すると、不自然な韻律で発声されることになる。この入力文の場合、誤差の総和がともに4 と同じため、誤差の総和だけの判断では前者の分割候補が採用される虞れがあった。このような観点からも本発明を適用した音声合成装置10は、的確に韻律を自然に再現する分割候補を選択できるので、この不適切な分割候補の採用を回避することができた。
【０１００】
また、抑圧単語が韻律句の先頭に来るようにICRLB 分割する場合もある。このような場合の例文としては、たとえば「電話が繋がってしまう事があるかもしれません。」がある。この例では、韻律処理、アクセント指令により、「デンワガ，ツナガッテシマウ，コト］ガ，ア｝ルカモ，シレマセ｝ン」となる。また、各韻律語素の拍数を括弧内の数字で表すと、韻律語素の拍数は、それぞれ、(4),(8),(3),(4),(5) で、ICRLB 全体の拍数は(24)である。図16に示す表は、各分割数においてコストを最小にするフレーズ指令の位置関係を示している。
【０１０１】
コスト関数F は各項毎に値を算出し、これらの値を合算して求めている。このコスト算出の結果、分割数2 で、かつ(4+8+3,4+5) に分割したとき、入力文に対するコストが3.6 と最小になった。この分割に合わせて指令設定部46a でポーズ指令・フレーズ指令の設定を行うと、入力文に対して生成されるデータは、「P₁デンワガ，ツナガッテシマウ，コト］ガ，P₃，ア｝ルカモ，シレマセ｝ン」となる。これは、特に、韻律後の強調情報および分割後の韻律句の長さについて式(1) の第２項および第３項で考慮することにより、自然な韻律となるデータが生成されるようになった。
【０１０２】
一方、コスト関数F の第１項のみを用いて分割候補を評価すると、分割数2 で、かつ(4+8,3+4+5) と分割したとき、誤差が0 と最小になった。このとき、入力文に対して生成されるデータは、「P₁デンワガ，ツナガッテシマウ，P₃コト］ガ，ア｝ルカモ，シレマセ｝ン」となる。このデータが示すように、「コト」の直前にフレーズ指令が設定される。しかしながら、「コト」のような形式名詞は、通常、抑圧されて強く発音されないことが知られている。式(1) の第１項のみによる誤差の総和で最小値を探しても抑圧単語が韻律句の先頭に来るような分割候補を選択すると、結局、不自然な韻律で合成音声を生成してしまうことになる。したがって、コスト関数の全項で入力文のデータを評価することにより、的確な分割候補を選択できるようになる。
【０１０３】
音声合成装置10に、たとえば入力文「国有地の処分に適用されているのと同じ転売規則を設けるという基本方針を決めました。」が入力された場合、韻律語素の処理、アクセント指令の処理により、入力文は、「コクユ］ーチノ，ショ↓ブンニ，テキヨーサレ↓テイルノト，オナジ，テンバイキソ］クヲ，モーケ↓ルトイユー，キホンホ↓ーシンヲ，キメマ｝シタ」と処理される。各韻律語素の拍数は、(6),(4),(11),(3),(8),(7),(8),(5)で、全体の拍数は52であった。コスト関数F の第１項のみで評価すると、図17に示すように、分割数4 、かつ(6+4,11+3,8+7,8+5)の分割候補の場合に誤差の総和が6 と最小になることが判る。またコスト関数の全項の合算によるコストを各分割候補の組合せで比較した場合でも分割数4 、かつ(6+4,11+3,8+7,8+5)の場合にコストが3.4 となり、最小を示す。このように同じ結果が得られる場合もある。
【０１０４】
次に本発明の音声合成装置の変形例について図18および図19を参照しながら説明する。この音声合成装置10は、基本的に前述の実施例の構成と略々同じであるが、音韻処理部40のポーズ・フレーズ指令生成部46に対して構成の変更を施している。この変更が施された音韻処理部40の要部としてポーズ・フレーズ指令生成部46の構成を図18に示す。
【０１０５】
ポーズ・フレーズ指令生成部46は、前述した実施例と同様にポーズ・フレーズ指令設定部46A と、ICRLB 分割部46B とが備えられている。この場合、ポーズ・フレーズ指令設定部46A には、言語解析部30からの出力が指令設定部46a に供給されている。フレーズ指令設定部46A は、この指令設定部46a と、規格格納部46b とを含んでいる。規格格納部46b には、ポーズ指令、フレーズ指令を供給されるデータに付与するための規格が格納されている。指令設定部46a は、前述の実施例の規格分割部460aと同等の機能を有するとともに、供給されるデータがこの規格の内、韻律句に含まれる拍数限界L₂を満足するかの判断機能が備わっている。この条件により、指令設定部46a は、データの出力先を選択している。条件については、後段の動作で詳述する。
【０１０６】
ICRLB 分割部46B は、実例検索部466 と、分割実例記憶部468 とが備えられている。実例検索部466 には、指令設定部46a から供給されるデータを韻律語素に分解する韻律語素分割部466aと、韻律語素分割部466aの出力に対する検索キーの情報を抽出し、そして分割実例記憶部468 に記憶されている実例との照合を行う検索キー照合部466bと、検索キー照合部466bの照合結果に基づいて言語解析部30からの出力の分割を施すとともに、ポーズ指令、フレーズ指令をこの出力に対して付与する実例対応分割部466cとが含まれている。検索キー照合部466bは、分割実例記憶部468 が有する照合用検索キーと同じ情報で照合を行っている。
【０１０７】
分割実例記憶部468 は、メモリで、既に的確な分割と判断された実例のデータを照合用の検索キーとして記憶している。照合用の検索キーとしては、実際の入力文のデータ、強調情報、韻律語素、韻律語素の拍数、この拍数の総数、韻律語素の係受けの種別、および／またはアクセント指令等を用いる。
【０１０８】
このポーズ・フレーズ指令生成部46の動作について図19のサブルーチンSUB8を用いて簡単に説明する。サブルーチンSUB8は、図８に示したサブルーチンSUB3の処理に代わるポーズ・フレーズ指令生成処理ルーチンである。この処理には、分割実例記憶部468 に的確な分割がされた実例を予め複数格納しておくことが好ましい。この実例には、ポーズ指令・フレーズ指令も含まれている。
【０１０９】
図19に示すサブルーチンSUB8のサブステップSS80では、供給される韻律句の長さが規定されている拍数限界L₂を満足するかの判断を行う。この際、指令設定部46a には、規格格納部46b から格納されている拍数限界L₂の値が供給されている。指令設定部46a は、この数値に基づき韻律句の長さが（第２の）規則を満足するか判断している（分割長判断工程）。規則を満足するとき（Yes ）、サブステップSS81に進む。また、規則を満たさないとき（No）、サブステップSS82に進む。
【０１１０】
サブステップSS81では、規則を満たすように韻律句の長さが拍数限界L₂以下にあるので、韻律句の間に対応するポーズ記号・フレーズ記号を付加する。指令設定部46a は、規格格納部46b に格納されているポーズ指令・フレーズ指令の規則に対応してポーズ記号・フレーズ記号を付加している。この処理後、リターンに移行する。
【０１１１】
サブステップSS82では、供給されたデータをICRLB 分割部46B の実例検索部466 に送って韻律語素に分割する。この分割処理（すなわちICRLB 分割）は、実例検索部466 の韻律語素分割部466aで行われる。韻律語素分割部466aは、分割したデータを検索キー照合部466bに出力する。この処理後、サブステップSS83に進む。
【０１１２】
サブステップSS83では、分割されたデータに含まれている情報と一致する分割実例の検索および検索の一致度合いを判定する。実際に、この検索は、韻律語素分割部466aからの出力を検索キーとし分割実例記憶部468 が有する情報を照合用の検索キーとし、検索キー照合部466bで行われる。この検索キーの例は、後段に示す。検索において、完全に一致する実例が得られたとき（Yes ）、サブステップSS84に進む。検索キー照合部466bは、完全に一致しなかったとき（No）、サブステップSS85に進む。
【０１１３】
サブステップSS84では、分割実例記憶部468 から一致した実例を検索キー照合部466b、実例対応分割部466cを介して出力する。このとき、実例対応分割部466cは、単にこの実例の情報をスルーさせて規則を満たした際の指令設定部46a の出力先に供給する。この供給の後、リターンに移行する。
【０１１４】
サブステップSS85では、分割実例記憶部468 が有する実例の中で最も一致している実例に応じて言語解析部30からの出力を分割し、かつポーズ指令・フレーズ指令を付加する処理を施す。検索キー照合部466bは、一致性の高い実例を実例対応分割部466cに供給する。このような場合、入力文が一致していないことが考えられるので、実例対応分割部466cは、言語解析部30からの出力を検索キー照合部466bからの情報に応じて分割、かつその分割位置も含めて情報（ポーズ指令・フレーズ指令等）の付加を行う。実例対応分割部466cは、出力をサブステップSS84での指令設定部46a の出力先と同じ出力先に供給する。この処理後、リターンに移行する。サブルーチンSS84, SS85の処理は実例検索工程に相当する。
【０１１５】
リターンでは、実例に合った分割候補の選択を行うサブルーチンSUB8を終了してサブステップSS22に進む。これ以降の処理は、サブルーチンSUB2の処理を行った後、メインルーチンにより合成音声を生成している。この変形例のように構成すると、音声合成装置10を簡略化して構成することが可能になる。
【０１１６】
この検索に用いる情報の具体例を以下に示す。一つの入力文に対して分割実例記憶部468 には、たとえば韻律語素、分割位置（記号「／」で示す）、韻律語素数、全体の拍数、強調情報、各韻律語素の拍数、係受けの種別、アクセント指令の位置および／または品詞の種類等の情報が数値に置き換えられて格納される。
【０１１７】
たとえば、入力文が「遠くの海まで漁に出かけている漁師が」は、韻律語素と分割位置の関係：トークノ，ウミマデ，（／）リョーニ，デテイル，リョーシガとなり韻律語素数が 5、全体の拍数が19、各韻律句の強調情報が0emph,0emph,0emph,0emph,0emph 、各韻律句の拍数が(4),(4),(3),(4),(4) である。このときのP 記号は（P₁）トークノ，ウミマデ，（P₃）リョーニ，デテイル，リョーシガ（P₀）であり、強調を含む文例「国境に近いである島である魚釣り島に着いた」は韻律語素と分割位置の関係：コッキョーニ，チカ↓イ，シマ↓デアル，（／）ウオツリジマニ，ツ↓イタとなり韻律語素数が 5、全体の拍数が23、各韻律句の強調情報は、0emph,0emph,0emph,+emph,0emph で、各韻律句の拍数は、(5),(3),(5),(7),(3) である。このときのP 記号は（P₁）コッキョーニ，チカ↓イ，シマ↓デアル，（P₃）ウオツリジマニ，ツ↓イタ（P₀）という関係になっている。
【０１１８】
また、抑圧を含む文例「会長を務めたことも強みの一つです。」は韻律語素と分割位置の関係：カイチョーヲ，ツト↓メタ，コト↓モ，（／）ツヨミノ，ヒト↓ツデスとなり韻律語素数が 5、全体の拍数が19、各韻律句の強調情報は、0emph,0emph,-emph,0emph,0emph で、各韻律句の拍数は、(4),(4),(3),(4),(4) である。このときのP 記号は（P₁）カイチョーヲ，ツト↓メタ，コト↓モ，（P₃）ツヨミノ，ヒト↓ツデス（P₀）となる。
【０１１９】
最後に、係受けの種別も情報として用いる例に挙げると、前述した入力文「私たちの生活から欠かせない道具となっています。」は、ワタシ↓タチノ，セーカツカラ，（／）カカセナイ，ドーグ↓ト，ナ↓ッテイマイスとなり、韻律語素数 5、全体の拍数が27、係り受け種別は韻律句毎に連体、連用、連体、連用、なしで、各韻律句の強調情報は、0emph,0emph,0emph,0emph,0emph で、各韻律句の拍数は、(6),(6),(5),(4),(6) である。このときのP 記号は（P₁）ワタシ↓タチノ，セーカツカラ，（P₃）カカセナイ，ドーグ↓ト，ナ↓ッテイマイス（P₀）となる。これらの情報を数値化して分割実例記憶部468 に格納し、検索時に照合用の検索キーに用いることにより、パラメータの設定等を行うことなく、実例の検索および検索の精度を向上させることができるようになる。
【０１２０】
以上のように構成することにより、妥当なICRLB 分割を行えるので、不自然な合成音声を出す音韻処理が避けられるので、より自然で聞き易い合成音声を出力させることができる。この音声合成装置は、規則の変更に伴うパラメータ値の変更も半自動的に対応することができ。この装置の操作性を容易化し、人手による労力を減少させることができる。
【０１２１】
また、実例を検索し選択した実例を用いることにより、パラメータ設定等の処理および演算処理をなくすことができ、装置構成の簡略化によりコスト低減も図ることができる。
【０１２２】
なお、本発明は、前述した実施例に限定されるものでなく、たとえば前述の実施例とこの変形例を組み合わせて構成してもよい。音声合成装置10を稼働させはじめた初期では、情報の蓄積を図るため図４の構成により最適な分割候補を求め、中間言語を作成させるとともに、この求めた分割候補の情報を分割実例記憶部468 に供給して予め各種の実例を学習的に記憶させる。この処理を行って分割実例記憶部468 に情報を蓄積させた後、供給されるデータに対する処理を前述した変形例（の図18に示す構成）に切り換えて情報の検索を行ってポーズ指令・フレーズ指令を付加するようにしてもよい。これにより、最初に、ユーザの使用する確実な音声合成するためのデータが音声合成装置10に蓄積されるので無駄なデータの格納を避けることができ、処理を検索処理にした場合は、演算を行うことなく検索キーを組み合わせて検索することにより所望の分割候補およびそれに付加する情報を求めることが容易にできるようになる。
【０１２３】
また、前述の実施例では、分割位置を選択する式(1) に拍数を用いたが、文節の中心語、または文節間の係受け種別等を変数に用いて演算させることもできる。
【０１２４】
【発明の効果】
このように本発明の音声合成装置によれば、分割候補生成手段で言語解析手段の出力を第１の規則に基づいて分割し、この分割された韻律句を第２の規則で判断して韻律句を再分割する際にこの韻律句を区切る位置により生成される韻律句の組合せを分割候補とし、分割候補選択手段で得られた分割候補が含む音声表現の情報およびパラメータ格納手段から供給されるパラメータを用いて分割妥当性を示す評価値を算出し、さらに、この評価値を用いて分割候補選択手段で第３の規則を満たす分割候補を選択して選ばれた分割候補が最適な長さに区切られた韻律句になって音韻解析手段でこの分割に応じた中間言語を生成する音声合成装置における中間処理が行われるので、不自然な合成音声を出す音韻処理が避けられ、より自然で聞き易い合成音声を出力させることができ、規則の変更に伴うパラメータ値の変更も半自動的に対応することができる。これにより、さらにこの装置の操作性を容易化し、人手による労力を減少させることもできる。
【０１２５】
また、音声合成装置は、分割実例格納手段に実例を記憶させて分割実例検索手段で外部からの情報に一致する記憶させていた情報を検索して、この検索結果を用いて該当する分割候補に指令等を付加すると、パラメータに対する処理、分割候補のコスト計算等を行うことなく、簡単な構成で品質の高い合成音声を得ることができる。装置のコスト低減も図ることができる。
【０１２６】
本発明の音声合成装置のテキスト解析方法によれば、文章に対する解析結果を第１の規則、第２の規則に沿って順に処理や判断を行い、この結果に応じて情報を再分割した際に得られる組合せを分割候補とし、一方、格納されるパラメータおよび分割候補が含む音声表現の情報を用いて分割妥当性の評価値を算出し、第３の規則を満足する分割候補を選択する音韻解析を行ってこの分割候補に応じた中間言語を生成することにより、より自然で聞き易い合成音声の出力される確度を向上させることができる。
【０１２７】
また、分割候補は、供給される情報および／または適切に区分された実例に付加される情報を検索キーとして予め格納し、供給される韻律句に対する判断結果に応じて出力先を選択した後に、この選択された出力先で供給される韻律句の分割に該当する実例を検索し、この検索結果に応じて各種指令を付与することによっても、無駄な演算を大幅に削減しても容易な操作でより自然な品質の高い合成音声を生成させることができる。
【図面の簡単な説明】
【図１】本発明の音声合成装置の概略的な構成を示すブロック図である。
【図２】図１の音声合成装置における言語解析部の概略的な構成の一例を示すブロック図である。
【図３】図２の言語解析部で扱う言語の処理単位の関係とその分類例を示す図である。
【図４】図１の音声合成装置における音韻処理部の概略的な構成の一例を示すブロック図である。
【図５】図１の音声合成装置の動作を説明するメインフローチャートである。
【図６】図５のメインフローチャートに示した言語解析処理（サブルーチンSUB1）の動作を説明するフローチャートである。
【図７】図６に示した処理ルーチン内の構文解析処理として構文木および係り受け構造の一具体例を表す図である。
【図８】図５のメインフローチャートに示した音韻解析処理（サブルーチンSUB2）の動作を説明するフローチャートである。
【図９】図８のサブルーチンSUB2に示したポーズ・フレーズ指令設定（サブルーチンSUB3）の動作を説明するフローチャートである。
【図１０】図９のサブルーチンSUB3に示したICRLB 分割処理（サブルーチンSUB4）の動作を説明するフローチャートである。
【図１１】図10のサブルーチンSUB4に示した分割候補生成処理（サブルーチンSUB5）の動作を説明するフローチャートである。
【図１２】図10のサブルーチンSUB4に示したパラメータ格納処理（サブルーチンSUB6）の動作を説明するフローチャートである。
【図１３】図10のサブルーチンSUB4に示した分割選択処理（サブルーチンSUB7）の動作を説明するフローチャートである。
【図１４】図１の構成に供給された入力文を分割した分割候補に対して得られる各変数の数値および分割候補の選択を表す図である。
【図１５】図１の構成に供給された入力文を分割した分割候補に対して得られる各変数の数値および分割候補の選択を表す図である。
【図１６】図１の構成に供給された入力文を分割した分割候補に対して得られる各変数の数値および分割候補の選択を表す図である。
【図１７】図１の構成に供給された入力文を分割した分割候補に対して得られる各変数の数値および分割候補の選択を表す図である。
【図１８】図４の音韻処理部における他の構成の要部である概略的なポーズ・フレーズ指令生成部を示すブロック図である。
【図１９】図18のポーズ・フレーズ指令生成部の実例に則したポーズ・フレーズ指令処理（サブルーチンSUB8）の動作を説明するフローチャートである。
【符号の説明】
10 音声合成装置
20 データ入力部
30 言語解析部
40 音韻処理部
42 韻律語生成部
44 アクセント指令生成部
46 ポーズ・フレーズ指令生成部
50 制御パラメータ生成部
60 音声信号生成部
460 分割候補生成部
462 分割候補選択部
464 パラメータ格納部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech synthesizer that artificially generates synthesized speech from an intermediate language generated by performing various analyzes on a supplied sentence and generating the analysis result in accordance with prosodic rules, etc., and its text analysis method. To a speech synthesizer that generates synthesized speech based on an intermediate language generated by phonological processing applied to a result of morpheme and syntax analysis on a supplied Japanese sentence, and to this speech synthesizer The present invention relates to a text analysis method suitable for application.
[0002]
[Prior art]
In order to synthesize speech artificially, there are a recording / playback method for combining and reproducing pre-recorded speech for waveform formation, or a pure synthesis method for purely artificially synthesizing speech. There are various methods, such as an edit control method that uses a combination of information prepared in advance, or a rule control method that artificially generates a control signal for controlling waveform formation. Among these methods, a speech synthesizer that provides high-quality synthesized speech by applying a rule control method to various fields such as e-mail reading, weather forecast guidance service, news such as professional baseball results recently, etc. Is attracting attention.
[0003]
In order to obtain synthesized speech with high quality, it is necessary to construct prosodic rules that can accurately reflect linguistic information such as semantics, syntactic structure, and discourse structure in Japanese and generate natural prosody. An example of a method for generating Japanese synthesized speech using such a rule control method is disclosed in Journal of the Acoustical Society of Japan, Vol. 50, No. 6 (“Prosodic rules for synthesizing Japanese sentence speech”). . The method disclosed here generates a prosodic phrase by performing morphological analysis and syntactic analysis on the supplied text, and if the generated prosodic phrase is too large, the prosodic phrase is divided almost equally. Thus, the semantic structure of the sentence is taken into consideration, and an intermediate language is generated so that the sentence can be read with a smooth prosody.
[0004]
The speech synthesizer generates a control parameter based on the generated intermediate language, and synthesizes and outputs a speech corresponding to the control parameter.
[0005]
[Problems to be solved by the invention]
By the way, if the generated prosodic phrase is too large as described above, applying the rules will split the generated prosodic phrase too finely, or the prosodic phrase may be read in a way that is contrary to the original prosody. May be performed. As a result of such division of prosodic phrases, the output synthesized speech may sound unnatural.
[0006]
The present invention eliminates the drawbacks of the prior art, and provides a speech synthesizer and a text analysis method thereof that can output synthesized speech generated by supplied data as a more natural and easy-to-synthesize synthesized speech. For the purpose.
[0007]
[Means for Solving the Problems]
In order to solve the above-mentioned problems, the present invention provides language analysis means for analyzing the features of this language level based on a morpheme included in a sentence supplied as information and a language that uses the syntax of this sentence for processing, and this language An analysis based on the features of the speech language level is performed on the output of the analysis means, and a phoneme analysis means for generating an intermediate language serving as a speech synthesis command based on the obtained analysis result, and an output of the phoneme analysis means A speech synthesizer comprising: control parameter generation means for generating a control parameter according to the control signal; and voice signal generation means for synthesizing a voice signal based on the output of the control parameter generation means. This division together with the first rule that divides the output of the language analysis means according to the modification relationship of the prosodic phrase in which a plurality of prosodic word elements that are morpheme chains are combined A prosodic phrase combination obtained by the position where the prosodic phrase is divided is generated as a division candidate when the prosodic phrase is subdivided according to the second rule for determining whether the prosodic phrase is smaller than a preset size. And calculating an evaluation value for evaluating the validity of the division using information on the speech expression included in the division candidate generated by the division candidate generation unit, and selecting a division candidate having the smallest evaluation value The phonological analysis unit includes a division candidate selection unit that selects a division candidate according to the third rule, and a parameter storage unit that stores a parameter used to calculate an evaluation value in the division candidate selection unit.
[0008]
Here, the division candidate generation means includes a first rule division means for dividing the output of the language analysis means using the first rule, and the length divided in the output of the first rule processing means. A second rule processing means for judging based on the second rule, and a re-division for generating a plurality of division candidates by re-dividing the prosodic phrase output from the first rule dividing means in accordance with the decision of the division length judging means Means.
[0009]
The parameter storage means may store a first weighting factor based on the speech expression information in the division candidate and a second weighting factor to be multiplied by the average number of prosodic word elements associated with the number of division candidates. This makes it possible to generate an intermediate language that can accurately express the emphasis / suppression specification of the generated prosodic phrase.
[0010]
Division example storage means for storing an example of a prosody phrase delimitation setting position that satisfies the rule, and a division example for searching an example in which a supplied sentence is optimally divided from the examples stored in the division example storage means And a search means. With such a configuration, it is possible to output an intermediate language that outputs optimum synthesized speech in accordance with an actual example.
[0011]
The divided instance search means searches the examples stored in the divided instance storage means for emphasis information, prosodic word elements, prosodic word element beats, total number of beats, prosodic word element type, accent command and / or Or it is preferable to use the kind of part of speech. It is possible to improve the accuracy of matching of examples and to shorten the search time by searching for examples.
[0012]
The language analysis means preferably includes an emphasis information setting means for setting the emphasis information rule and storing the set emphasis information.
[0013]
In the speech synthesizer according to the present invention, the division candidate generation means divides the output of the language analysis means based on the first rule, and the divided prosodic phrase is judged based on the second rule to re-divide the prosodic phrase. In this case, a combination of prosodic phrases generated based on the position that divides the prosodic phrase is set as a division candidate, and an evaluation value for evaluating the validity of division using the information of speech expression included in the division candidate obtained by the division candidate selection unit is set as a parameter. By using the parameters supplied from the storage means, and selecting the division candidate satisfying the third rule by the division candidate selection means using this evaluation value, the selected division candidate has the optimum length. Therefore, the phoneme analysis means can perform intermediate processing in the speech synthesizer that generates an intermediate language corresponding to the selected division candidate.
[0014]
Also, the text analysis method of the speech synthesizer of the present invention analyzes the features of this language level based on the morpheme included in the sentence supplied as information and the language that uses the syntax of this sentence for processing, and the analysis result Based on the features of the spoken language level for the prosody of the sentence, and based on the obtained analysis results, generate an intermediate language that is a command for speech synthesis, and control parameters according to the generated intermediate language In a text analysis method of a speech synthesizer that artificially synthesizes speech corresponding to this control parameter, information is obtained according to the modification relationship of the prosodic phrase in which a plurality of prosodic word elements that are morpheme chains are combined. Using the first rule to divide, the rule dividing step of dividing the analysis result for the sentence and the determination of whether the size of the prosodic phrase is smaller than a preset size And a division length determination step for determining the result of the rule division step according to the second rule, and a division setting position of the prosodic phrase obtained by re-dividing the information according to the determination result of the division length determination step. A division candidate generation step for generating a combination as a division candidate, a parameter storage step for storing parameters used for evaluation of division validity based on information of speech expressions included in the division candidates obtained in the division candidate generation step, and a division candidate The evaluation value calculation step for calculating the evaluation value from the speech expression information included in the division candidate generated in the generation step and the parameter, and the selection of the division candidate indicating the minimum among the evaluation values obtained by the evaluation value calculation step are third. And a division candidate selection step of selecting a division candidate based on the third rule, and the supplied sentence is analyzed.
[0015]
Here, the division candidate is generated by re-dividing the prosodic phrase when the beat length included in the prosodic phrase is larger than a predetermined re-division pulse length that instructs re-division, and a speech language is provided at the boundary of the division candidate. It is preferable to include a symbol for tone control, which is one of the characteristics of the level. By defining the division candidates in this way, it is possible to avoid consideration of long prosodic phrases, for example, excessive division of prosodic phrases.
[0016]
The evaluation value includes an error sum calculation step for calculating the sum of absolute values of the difference between the beat length of the section obtained by dividing the prosodic phrase and the beat length obtained by equally dividing the entire length of the prosodic phrase, and speech in the prosodic phrase A feature amount sum calculating step for calculating a sum of beat lengths of sections obtained by dividing the prosodic phrase according to the presence of the feature amount represented by the speech expression information included in the language level feature; and the prosody of the division candidate The weight calculation step of calculating a weighting factor based on the speech expression information included for the word element and the result of the feature amount summation calculation step are multiplied by the weighting factor obtained in the weight calculation step. A weighted feature sum calculation process for calculating the sum, a product calculation process for calculating a product of an average beat length after dividing the prosodic phrase and a beat length obtained by equally dividing the beat length of the entire prosodic phrase, and an error sum calculating process And the result of the weighted feature sum calculation process When subtracting the result of the step out integration from the results obtained by the addition is calculated by using the evaluation value calculating step of calculating an evaluation value of the candidate dividing of interest to be advantageous.
[0017]
It is preferable to set values for the above-described phonetic expression information by classifying prosody word element emphasis, suppression, and the case of excluding both. A feature amount is obtained based on the set value.
[0018]
As a criterion for classification, emphasis is a proper noun included in the center of the prosodic clause represented by the division candidate, or classification is performed by number, and suppression is a formal noun included in the center of the prosodic clause represented by the division candidate, It is desirable to classify this prosodic phrase by positioning it at the beginning of the verb or sentence and at the end of the prosodic phrase, and assign a preset value for emphasis and suppression.
[0019]
Emphasis is classified by including proper nouns or numbers in the prosodic clauses of the division candidates, and suppression is located in the formal nouns, verbs, or sentence heads in the prosodic clauses of the division candidates, and includes a particle at the end of the prosodic clauses. It is desirable to classify these prosodic clauses and assign preset values for emphasis and suppression.
[0020]
A sentence division candidate to which this text analysis method is applied stores in advance information to be supplied and / or information added to an appropriately classified example as a search key, and a division length determination step is supplied. The output destination is selected according to the determination result for the prosodic phrase, and then an example corresponding to the division of the prosodic phrase supplied at the selected output destination is searched, and the prosodic phrase is divided according to the search result. It is preferable to include an example search step of giving various commands. By this search, an accurate sentence can be analyzed.
[0021]
In the example storage step, it is desirable to store various examples in a learning manner in advance. With this memory, it is possible to shorten the period for gaining experience and deal with it more widely.
[0022]
The parameters are preferably obtained by statistical analysis or multivariate analysis. Thereby, the parameter can be changed semi-automatically.
[0023]
The language to which this text analysis method is applied is advantageously Japanese. This facilitates the generation of an intermediate language based on Japanese sentence analysis with ambiguous expressions and syntax.
[0024]
According to the speech synthesis text analysis method of the present invention, the analysis result of a sentence is processed in the order of the first rule in the rule division step and the second rule in the division length determination step, and information is subdivided according to the result. The combination obtained in the above is used as a division candidate, while the parameter obtained in the parameter storage step based on the speech expression information included in the division candidate and the evaluation value of the division validity based on the speech expression information included in the division candidate (Evaluation value calculation step) and by selecting a division candidate that satisfies the third rule in the division candidate selection step, it is optimal even if the prosodic phrase of the supplied sentence is further subdivided to perform phonological analysis It is possible to select an appropriate division candidate and generate an intermediate language corresponding to this division.
[0025]
DETAILED DESCRIPTION OF THE INVENTION
Next, embodiments of a speech synthesizer and a text analysis method thereof according to the present invention will be described in detail with reference to the accompanying drawings.
[0026]
The speech synthesizer of the present invention performs morpheme and syntax analysis on a supplied Japanese sentence, and then generates a control parameter based on an intermediate language generated by phonological processing performed on the result. It is a device that generates and outputs synthesized speech based on control parameters. This speech synthesizer and its text analysis method will be described with reference to FIGS. As shown in FIG. 1, the speech synthesizer 10 includes a data input unit 20, a language analysis unit 30, a phoneme processing unit 40, a control parameter generation unit 50, and a speech signal generation unit 60.
[0027]
The data input unit 20 converts various forms of text into data formats that can be processed by the speech synthesizer 10 and inputs them. This sentence has various forms such as a form of a handwritten manuscript and a form recorded on a recording medium in a predetermined data form created by sentence creation software such as a word processor. In the case of the form of a handwritten manuscript, the data input unit 20 is provided with a device that optically reads the text associated with the paper. When reading data stored from a floppy disk, the data input unit 20 is provided with a disk driver as a data reading device inside the apparatus.
[0028]
As shown in FIG. 1, the language analysis unit 30 includes a morphological analysis unit 32, a syntax analysis unit 34, and an emphasis information setting unit 36. In the language analysis unit 30, data supplied from the data input unit 20, that is, digital data of an input sentence is supplied to the morpheme analysis unit 32. The morphological analysis unit 32 includes a word division analysis unit 32a and a word dictionary unit 32b shown in FIG. A morpheme is the smallest form that has meaning. The word division analysis unit 32a stores a processing program for performing word division, and performs a search process so as to match a word included in the input sentence with a word in the word dictionary unit 32b. The morpheme analysis unit 32 divides the input sentence into words while searching the word dictionary unit 32b according to the program stored in the word division analysis unit 32a.
[0029]
Here, language analysis terms will be explained. Sentences are classified and handled as prosodic units, syntactic units, and intermediate terms between them (see Fig. 3 (a)). Prosodic units include prosodic words, prosodic phrases, prosodic clauses, and prosodic sentences. A prosodic word is a chain of phonemes corresponding to one accent component and showing a certain accent type (that is, the smallest unit that can be distinguished as a sound). A prosodic phrase is a chain of prosodic words corresponding to one phrase component. The prosodic clause is a chain of prosodic phrases delimited by pauses, and indicates that the phrase component is not reset (ie, occurrence of a negative phrase command) is not performed at the last part of the prosodic phrase. Finally, the prosodic sentence is a chain of prosodic phrases, and the phrase component is not reset at the last part as in the definition of the prosodic phrase. For example, the sentence “The cyclone is south of the Izu Peninsula and showers are raining in eastern Japan” is classified as shown in Fig. 3 (b).
[0030]
Syntactic units include clauses, (phrases), clauses, and sentences in order of size. In particular, a clause is defined as a chain consisting of predicates that do not modify other phrases (verb predicates, adjective predicates, noun predicates, etc.) and all the phrases that modify the predicates directly or indirectly. However, in the sentence “It is a substation that supplies electricity to 500 households in the vicinity,” the description of “having” is a combination of “substation”, and therefore “ The description of “is” is not classified into sections.
[0031]
Intermediate units include prosodic word elements and ICRLB. A prosodic word element is a chain of morphemes that is not divided into a plurality of prosodic words due to syntactic conditions and reading emphasis / suppression. ICRLB (Immediate Constituent with Recursively Left-Branching Structure) is a chain of prosodic terms delimited by the right branch boundary and including only the left branch boundary. In the parse tree, the left branch boundary is the boundary between prosodic word elements in a modification relationship. so, The right branch boundary is defined as a boundary between prosodic word elements that are not in a modification relationship, and is called an ICRLB boundary. ICRLB can be a single word. If the syntax tree is divided at the right branch boundary, one sentence becomes an ICRLB chain. The first rule described later is a rule for performing ICRLB division.
[0032]
With this definition, for example, clause boundaries can be distinguished from ordinary modifier relationships even if they have modifier relationships. Therefore, the definition of the boundary can be summarized as follows: the node boundary is the boundary between nodes, the ICRLB boundary is the boundary between ICRLBs, and the enumeration boundaries are arranged in parallel with nouns, for example, "judicial, legislative, administrative" It becomes the word boundary after the reading in the expression.
[0033]
The syntax analysis unit 34 that performs analysis processing using such a definition is a functional unit that inputs the morpheme analysis result from the morpheme analysis unit 32 as data, and analyzes the syntax tree or dependency structure for the input sentence. . In addition, the syntax analysis unit 34 determines ICRLB, clauses, parallel relations, and the like from the analysis results of the functions described above, and performs syntax analysis that is a feature at the language level on the input sentence. If the result of this parsing is not a parsing tree but a dependency structure, if there is a compound word that defines an ICRLB boundary or contains multiple accent kernels from the modification relationship defined above, It is defined as an ICRLB boundary.
[0034]
As shown in FIG. 2, the emphasis information setting unit 36 includes an information setting unit 36a for setting emphasis / suppression emphasis information for words or phrases based on the analysis result of the morphological analysis unit 32 and the analysis result of the syntax analysis unit 34. A standard storage unit 36b for storing rules for emphasis information, which is a reference for setting to be supplied to the information setting unit 36a. The information setting unit 36a compares the rules supplied from the standard storage unit 36b and sets the emphasis information as input data.
[0035]
Here, the emphasis information is information used for setting accents and some phrase commands, and is one piece of speech expression information. The emphasis information is classified into three types of information: + emph for information emphasis, -emph for information suppression, and 0 emph when it is not classified into both emphasis and suppression. As an example of setting emphasis information, for example, a phrase centered on proper nouns, numbers, etc. is + emph, and a phrase centered on verbal nouns such as `` being '', `` is '', etc. For example, the phrase “ha” (participant particle) case at the beginning of a sentence is classified as -emph.
[0036]
The language analysis unit 30 outputs the analysis result from the morphological analysis unit 32 and the syntax analysis unit 34 and the set emphasis information from the emphasis information setting unit 36 to the phoneme processing unit 40 for the input sentence. The phoneme processing unit 40 generates an intermediate language for speech synthesis based on the output of the language analysis unit 30. As shown in FIG. 1, the phoneme processing unit 40 includes a prosodic word generation processing unit 42, an accent command generation unit 44, and a pause / phrase command generation unit 46.
[0037]
The prosodic word generation unit 42 generates a prosodic word by performing accent combination of words included in the phrase. The prosodic word generation unit 42 generates a prosodic word by performing accent connection between phrases as necessary. In the prosodic word generation unit 42, phrase emphasis information is also set in the generated prosodic word.
[0038]
As shown in FIG. 4, the accent command generating unit 44 includes an accent command setting unit 44a for generating an accent command for instructing how to apply an accent based on the information of ICRLB in the output of the language analyzing unit 30, and an accent command. A standard storage unit 44b that stores the rules of the accent command that is a reference for setting to be supplied to the command setting unit 44a is provided. The accent command setting unit 44a collates with the rules of the accent command supplied from the standard storage unit 44b, and sets the accent command as a result of language analysis.
[0039]
Here, the accent command is a command indicating the time of accent size, accent rise, and fall. In the generation of the accent command, the processing target range is an accent deformation range. Accent deformation (accent sandhi) is a series of prosodic features depending on the accent type such as the flat type of flat accents, the undulation type with undulations in the accent, and the syntactic and discourse conditions that divide sentences at each boundary. It is an interaction between prosodic word elements in which accent components of words influence each other. As can be seen from this relationship, the range of accent deformation is ICRLB.
[0040]
The pause / phrase command generator 46 has a function of generating a pause length / phrase command. In order to demonstrate this function, the pause / phrase command generation unit 46 includes the pause / phrase command setting unit 46A for generating the pause command / phrase command shown in FIG. ICRLB dividing unit 46B that divides the ICRLB.
[0041]
Here, the pause command is a pause symbol S ₁ , S ₂ , S _Three It is represented by Rest symbol S ₁ , S ₂ , S _Three Are delimiters for sentences, sections, and ICRLB, respectively. The phrase command is the phrase symbol P ₀ , P ₁ , P ₂ , P _Three It is represented by Phrase symbol P ₀ , P ₁ , P ₂ , P _Three Are used to reset the phrase component, to recover, to add at the beginning of a phrase, between ICRLBs, and within ICRLB, respectively.
[0042]
The pause / phrase command setting unit 46A includes a command setting unit 46a that generates a pause command / phrase command, and a standard storage unit that stores rules for the pause command / phrase command used as a reference for setting to be supplied to the command setting unit 46a. There is 46b. The command setting unit 46a is not supplied with the output from the language analysis unit 30 as it is, but is supplied with the output from the ICLLB division unit 46B in this embodiment.
[0043]
Further, the ICLLB dividing unit 46B includes a division candidate generating unit 460, a division candidate selecting unit 462, and a parameter storage unit 464. The division candidate generation unit 460 is supplied with the output from the language analysis unit 30. The division candidate generation unit 460 includes a rule division unit 460a that divides the supplied data (clause) at the boundary of the ICRLB (that is, the first rule), and an output (ICRLB divided by the rule division unit 460a). Prosody phrase length (hereinafter referred to as ICRLB prosodic phrase) is determined to be shorter than a predetermined length (that is, the second rule), according to the determination of the segment length determination unit 460b and the segment length determination unit 460b There is a re-dividing unit 460c that re-divides the prosodic phrase divided by the rule dividing unit 460a to generate a plurality of division candidates.
[0044]
Here, the IRLB prosodic phrase is a combination of prosodic phrases obtained by subdividing the prosodic phrase and dividing the prosodic phrase, and this combination is called a division candidate.
[0045]
The division length determination unit 460b supplies the output to the command setting unit 46a when the determination condition of the second rule is satisfied when supplied for the first time. In this first case, when this determination condition is not satisfied, the command setting unit 46a sets the output destination to the re-dividing unit 460c side. By the way, at the time of the second and subsequent determinations, the command setting unit 46a also determines whether or not there is still a combination of division candidates for the ICRLB prosodic phrase divided under this determination condition.
[0046]
The re-division unit 460c feeds back the divided data for the determination of the second rule to the re-division unit 460c, and repeats the re-division process until there are no combinations of plural types of division candidates. The re-division unit 460c outputs only the division candidates that satisfy the determination condition of the second rule among the divided division candidates to the calculation unit 462a of the division candidate selection unit 462.
[0047]
However, when the information is always finely divided corresponding to the division rule at the boundary of the ICRLB before being supplied to the ICRLB division unit 46c, the arrangement of the rule division unit 460a can be omitted. The division length determination unit 460b supplies an output to the re-division unit 460c when the divided ICRLB prosodic phrase is longer than a predetermined length, and otherwise outputs the second rule when the input sentence is first supplied. When satisfied, this output is supplied only to the command setting unit 46a of the pause / phrase command setting unit 46A.
[0048]
Therefore, when the supplied ICRLB prosodic phrase is longer than a predetermined length, the division candidate generating unit 460 re-divides the ICRLB prosodic phrase by the built-in re-dividing unit 460c, and repeats until the predetermined length or less, This output is supplied to the division candidate selection unit 462.
[0049]
The division candidate selection unit 462 includes a calculation unit 462a that calculates a cost as an evaluation value indicating the validity of division obtained based on the information of the speech expression included in the generated division candidate, and the cost calculated by the calculation unit 462a. Among them, there is a selection unit 362b that performs selection of a division candidate having the minimum value (that is, the third rule). This cost calculation corresponds to the total sum of errors in the conventional method, and a part of the total sum of errors is included. The details will be described later. The division candidate selection unit 462 supplies an output of the division candidate selected from the selection unit 462b to the command setting unit 46a of the pause / phrase command setting unit 46A. The parameter storage unit 464 has a memory for storing parameters used for cost calculation in the calculation unit 462a of the division candidate selection unit 462.
[0050]
The parameter storage unit 464 also has a functional unit that changes parameters by applying a method such as statistical analysis or multivariate analysis, although not shown. When considering simplification of the apparatus, the parameter may be changed outside the apparatus, and the parameter to be changed may be simply supplied to the parameter storage unit 464. In statistical analysis or multivariate analysis, a rule for digitizing speech expression information is provided in advance, and the tendency of a sentence is evaluated using a numerical value obtained corresponding to the information based on the rule. At this time, the value corresponding to the evaluation is a parameter indicating a tendency due to the re-division.
[0051]
The ICRLB dividing unit 46B selects the minimum cost from among the division candidates when the length of the long ICRLB prosodic phrase is less than or equal to the predetermined length, and sends this division candidate to the command setting unit 46a of the pause / phrase command setting unit 46A. It has a function to supply and output the pause command and phrase command more accurately than before. As a result, the ICRLB prosodic phrase can be divided into appropriate ranges. The phoneme processing unit 40, after determining the word reading, accent command, pause command, and phrase command, transmits intermediate language data, which is created by performing processing such as lengthening and sounding, although not shown, to the control parameter generating unit 50. send. This phonological process can avoid the generation of an intermediate language that results in unnatural speech synthesis.
[0052]
The control parameter generation unit 50 generates control parameters used for speech synthesis based on the intermediate language data supplied from the phoneme processing unit 40. The generated control parameter is supplied to the audio signal generation unit 60. The audio signal generation unit 60 includes an audio waveform generation unit 62 and an audio output unit 64. The voice waveform generation unit 62 performs a D / A conversion process on the control parameter supplied from the control parameter generation unit 50 to generate a voice waveform, and outputs the voice waveform to the voice output unit 64. The audio output unit 64 outputs information (for example, text etc.) inputted as a voice waveform via a speaker, for example, as a voice. The speech composing apparatus 10 configured as described above generates accurate intermediate language data, and outputs synthesized speech based on the generated data.
[0053]
Next, the control and operation of the speech synthesizer 10 of this embodiment will be described with reference to the flowcharts of FIGS. 5 to 13 and tables based on various examples. The flowchart in FIG. 5 explains the control of the speech synthesizer 10 and the main operation procedure by the control. When the speech synthesizer 10 is turned on, the operation of the speech synthesizer 10 is started and initial setting is performed, and then the process proceeds to step S10. In step S10, the text supplied via the data input unit 20 of the speech synthesizer 10 is digitized, or already digitized data is temporarily stored in the memory. After this data input, the process proceeds to subroutine SUB1.
[0054]
In the subroutine SUB1, the language analysis unit 30 performs language analysis processing on the supplied data. The language analysis processing performed here includes morpheme analysis processing, syntax analysis processing, emphasis information setting processing, and the like. The result of this language analysis processing is sent to the subroutine SUB2.
[0055]
Next, in the subroutine SUB2, the phoneme processing unit 40 performs phoneme analysis processing based on the result of the subroutine SUB1. This phoneme analysis processing includes prosodic word generation processing, accent command generation processing, pause / phrase command generation processing, and the like. The final analysis result (that is, intermediate language data) of this phonological analysis is sent to the control parameter generation unit 50, and the processing procedure proceeds to step S11.
[0056]
In this subroutine SUB2, a rule is prepared so that a pause / phrase command is accurately performed. This rule sets the shortest preset number of beats to L ₁ And the beat limit included in the prosodic phrase is L ₂ , Beat limit L ₂ If there is a longer ICRLB prosodic phrase, all prosodic phrases have a beat limit L ₂ Phrase symbol P on prosodic word boundary so that _Three Insert. However, the immediately preceding phrase symbol P ₁ / P ₂ / P _Three Distance from is the shortest beat L ₁ Phrase symbol P in the following cases _Three The insertion of is omitted. Further, when there are a plurality of methods for dividing a long ICRLB prosodic phrase, a cost function described later is applied to the prosody phrase that can be divided, and a division that minimizes the value of the cost function is selected. Therefore, a pause / phrase command is generated for the selected prosodic phrase.
[0057]
In step S11, the control parameter generator 50 generates control parameters necessary for speech synthesis based on the result of the phoneme analysis process of the subroutine SUB2. The control parameter generated here proceeds to step S12.
[0058]
In step S12, the audio signal generation unit 60 performs audio signal generation processing based on the control parameter obtained in step S11. By this processing, synthesized speech corresponding to the sentence finally supplied is generated and output as speech. This series of processing of the speech synthesizer 10 is completed by this voice output.
[0059]
The subroutines SUB1 and SUB2 will be further described. The speech synthesizer 10 first proceeds to sub-step SS10 when performing language analysis processing in subroutine SUB1 of FIG. In sub-step SS10, morphological analysis processing is performed. The morpheme analyzer 32 recognizes the boundary of the sentence from the data supplied by the word decomposition analyzer 32a and divides the sentence into sentences. The morpheme analysis unit 32 searches the word dictionary 32b of FIG. 2 for a word that matches the partial character string forming the character string element of the further divided sentence. In addition, the morpheme analyzer 32 also checks the grammatical connection possibility and divides the sentence into word strings. After this process, the process proceeds to substep SS11.
[0060]
In substep SS11, the syntax analysis unit 34 performs syntax analysis using the processing result (word string) in substep SS10. In this parsing, supplied word strings are grouped into phrases, and the modification relationship between the phrases is analyzed. From this analysis, for example, if the cyclone is located in the south of the Izu peninsula and it is raining in eastern Japan, it can be separated by ICRLB. A receiving structure is generated. Here, the symbol “/” in FIG. 7 indicates the ICRLB boundary of the sentence. At this time, the parsing unit 34 simultaneously determines ICRLB, clauses, enumerated expressions, parallel relationships, and the like. After this processing, the process proceeds to substep SS12.
[0061]
In sub-step SS12, the result of the morphological analysis process and the syntax analysis process is supplied to the emphasis information setting unit 36 and the emphasis information setting process is performed on the supplied data. The setting rule storage unit 36b of the emphasis information setting unit 36 stores rules that are classified into three cases other than emphasis, suppression, and both. The information setting unit 36a sets the emphasis information corresponding to the matching rule by comparing the supplied clause with the rule from the setting rule storage unit 36b.
[0062]
Next, in substep SS13, the processing results obtained in substeps SS10, SS11, and SS12 are supplied to the phoneme processing unit 40. After this supply, the process proceeds to return and the subroutine SUB1 is terminated.
[0063]
Immediately after the end of this subroutine SUB1, the process proceeds to subroutine SUB2 in FIG. In the subroutine SUB2, phonological analysis processing is performed based on the result of the language analysis processing of the subroutine SUB1 described above. The phoneme analysis processing includes prosodic word generation processing, accent command generation processing, pause / phrase command generation processing, and the like. First, the process proceeds to substep SS20.
[0064]
In sub-step SS20, the data from the language analysis unit 30 is supplied to the prosodic word generation unit 42 to generate prosodic words. The prosodic word generation unit 42 performs accent combination of words in the supplied data (sentence). In addition, the prosodic word generation unit 42 also generates a prosodic word by performing accent coupling between phrases as necessary. The generated prosodic word has a phrase emphasis information setting function (not shown) in the prosodic word generating unit 42 of FIG. After this processing, the process proceeds to substep SS21.
[0065]
In sub-step SS21, a process of determining the accent size of each prosodic word based on the supplied ICRLB information is performed. This processing is performed by the accent command generator 44. In the accent command generating unit 44, the accent command setting unit 44a collates a rule stored in advance in the standard storage unit 44b, for example, an accent deformation rule and the supplied ICRLB information. The accent command setting unit 44a assigns and sets an accent symbol that matches the rule to the ICRLB information. After this setting, the process proceeds to subroutine SUB3.
[0066]
In the subroutine SUB3 shown in FIG. 9, a pose command / phrase command representing prosodic features is set in the supplied data (pause / phrase command generation processing). In this pause / phrase command generation processing, first, ICRLB division processing is performed in subroutine SUB4 in order to perform processing for generating a pause command and phrase command, respectively. The ICRLB division processing further includes subroutines SUB5, SUB6, and SUB7 for performing division candidate generation processing, parameter storage processing, and division selection processing described later. The ICRLB prosodic phrase (that is, the division candidate) that has been optimally divided through these processes is supplied to the command setting unit 46a. After this, the process proceeds to substep SS30.
[0067]
In sub-step SS30, the command setting unit 46a sets a pause command and a phrase command for the supplied division candidates and outputs them to an intermediate language creation unit (not shown). After this process, the process proceeds to return. Upon return, the setting process of the pause command / phrase command is terminated, and the process proceeds to sub-step SS22.
[0068]
In sub-step SS22, an intermediate language is created after processing such as lengthening and sound-promoting based on the data obtained in sub-steps SS20, SS21, and SUB3. The generated intermediate language data is supplied to the control parameter generation unit 50. After this output, the process proceeds to RETURN and ends the processing of this subroutine SUB3.
[0069]
The ICRLB division process of the subroutine SUB4 performed in the subroutine SUB3 will be briefly described with reference to the flowchart of FIG. When the processing shifts to the subroutine SUB3 in FIG. 9, the subroutine SUB4 is started so as to perform the ICRLB division processing, and the division candidate generation processing, parameter storage processing, and division selection processing are sequentially performed. The division candidate generation process is performed in subroutine SUB5 (see FIG. 11), the parameter storage process is performed in subroutine SUB6 (see FIG. 12), and the division selection process is performed in subroutine SUB7 (see FIG. 13). After these series of processing, the data is passed to the subroutine SUB3. At this time, data is supplied to the command setting unit 46a of FIG. 4 in correspondence with the processing of the subroutine SUB3. The command setting unit 46a sets a pause command / phrase command according to the rules of the standard storage unit 46b for the data input as described above.
[0070]
Next, the subroutine SUB5 for performing the division candidate generation process will be described with reference to FIG. In the division candidate generation process, when the process proceeds to the subroutine SUB4, the process of the subroutine SUB5 is immediately started and the process proceeds to substep SS50.
[0071]
In sub-step SS50, processing is divided according to the type of data supplied from the language analysis unit 30. For example, when the data is a clause (Yes), the process proceeds to substep SS51. If the data is already divided at an appropriate ICRLB boundary, the process proceeds to substep SS52 without performing this process.
[0072]
In sub-step SS51, the supplied data is divided into ICRLB. This processing is performed by the rule dividing unit 460a according to the first rule described above.
[0073]
In sub-step SS52, the length of the supplied data (ICRLB prosodic phrase) is compared with a preset division length. The basic unit of the length of the division length is one beat given corresponding to the reading of words. Therefore, this comparison is made with the number of beats of this division length and the number of beats of the ICRLB prosodic phrase. The comparison condition is determined by whether the number of beats of the ICRLB prosodic phrase is equal to or less than the number of beats of the division length according to the second rule. In the comparison process, for example, a flag F may be provided in order to change the supply destination of the process result depending on the number of processes. Comparison processing is performed on the data supplied from the language analysis unit 30 (when the flag F = 0). At this time, if the condition is satisfied (Yes), the division length determination unit 60b in FIG. 4 supplies the supplied ICRLB prosodic phrase to the command setting unit 46a. When the condition is not satisfied, that is, when the number of beats of the ICRLB prosodic phrase is longer than the number of beats of the division length (No), the ICRLB prosodic phrase is supplied to the re-dividing unit 460c. Thereafter, the process proceeds to sub-step SS53.
[0074]
In sub-step SS53, the supplied ICRLB prosodic phrase is subdivided, and the division length judging unit 460b outputs data composed of the divided ICRLB prosodic phrases.
[0075]
Next, in sub-step SS54, it is determined whether the data returned with the above comparison condition is satisfied. When the comparison condition is satisfied (Yes), the process proceeds to substep SS55. If the comparison condition is not satisfied (No), the process returns to sub-step SS53. This determination is performed by the division length determination unit 460b.
[0076]
In sub-step SS55, it is determined whether there is a division candidate that can be a combination of division candidates. When the combination that is a division candidate is in the ICRLB prosodic phrase (Yes), the division candidate that satisfies the comparison condition is delivered to the subroutine SUB7, and after this delivery process, the process returns to sub-step SS53. Thereby, the division candidate generation unit 460 supplies data from the subdivision unit 460c to the division candidate selection unit 462. If there is no combination that is a candidate for division in the ICRLB prosodic phrase (No), the process proceeds to return.
[0077]
Here, the presence or absence of the division candidate is obtained by calculation. In this calculation, first, after dividing the ICRLB, the number of beats of the ICRLB prosodic phrase is divided below the set value, and the minimum number of phrase commands obtained by the division is obtained. That is, when there is a remainder with an integer value obtained by dividing the total number of beats by the set value, the minimum number is an integer value +1. Let this minimum number be the variable PH_NUM. Also, the maximum number of phrase commands MAX_PH_NUM obtained when the data is divided most finely by ICRLB division is obtained. When it is determined that the variable PH_NUM needs to be divided, the division length determination unit 460b advances the value of the variable PH_NUM by +1. Since the setting is made in this way, the actually re-divided data is delivered to the subroutine SUB7 after the variable PH_NUM is divided. The combination of division candidates is continued until the variable PH_NUM exceeds the maximum number of phrase commands MAX_PH_NUM. In the flowchart of FIG. 11, for convenience, data delivery is represented by a display of subroutine SUB7. Data delivery is not limited to this method, but may be stored in a memory and delivered as a combination of division candidates. These determination processes are performed by the division length determination unit 460b. By this series of processing, combinations of division candidates are generated.
[0078]
Next, before describing the subroutine SUB7 regarding the division selection processing, parameters used in the division selection processing will be briefly described with reference to FIG. Parameters are set in subroutine SUB6. Here, the parameters include a weighting coefficient A for the feature quantity I used in equation (1) described later and a coefficient B for multiplying the divided average beat number.
[0079]
Here, the feature quantity I is indicated by, for example, emphasis information (+ emp, -emp, 0emp) of the prosodic word at the head of the prosodic phrase. In sub-step SS60, the rule of feature quantity I, that is, the relationship of emphasis information (+ emp, -emp, 0emp) is stored.
[0080]
Next, in sub-step SS61, the weighting coefficient A is stored according to the numerical value of the feature quantity I 1. In this embodiment, the weighting factor A is assigned to the numerical value −1 and −2 of the feature quantity I by assigning 7 and −3 to store the numerical value.
[0081]
Next, in sub-step SS62, the numerical value of the coefficient B described above is stored. Here, the above-described weighting factor A and weighting factor B may be calculated by the parameter storage unit 464 shown in FIG. These weighting factors A and B may be calculated by using, for example, statistical analysis processing of the feature quantity I in speech synthesis or multivariate analysis. The parameter storage unit 464 may calculate parameter values in advance outside the apparatus and simply store numerical values.
[0082]
Processing is performed in such a procedure, the process proceeds to return, the parameter storage is terminated, and the process proceeds to the subroutine SUB7 as shown in FIG. In this subroutine SUB7, cost calculation is performed as shown in FIG. 13 for each combination of division candidates (ICRLB prosodic phrases) supplied from subroutine SUB5. Further, the subroutine SUB7 performs a process of selecting an optimal division candidate from the cost obtained according to the procedure of FIG. Here, the cost is an error that is an accurate index of speech expression when an evaluation is performed on a phrase that can be divided.
[0083]
First, in sub-step SS70 of FIG. 13, the number of beats for each range delimited by the ICRLB boundary is counted and stored for each combination of division candidates supplied by the calculation unit 462a of FIG. After this storage, the process proceeds to substep SS71. In substep SS71, the value of the variable MIN used for selection of the division candidate is set. This set value is, for example, 9999. Also, the contents of the memory MIN_DIV [] for storing the position of the ICLLB boundary of the minimum division candidate and the variable VAL for storing the cost value are cleared. Thereafter, the process proceeds to sub-step SS72.
[0084]
In sub-step SS72, the cost is calculated for each of the division candidates at each division number m divided at the ICRLB boundary. This calculation is performed in the calculation unit 462a according to the cost function F (D). The cost function F (D) is given by equation (1)
[0085]
[Expression 1]

Where D (m, n) is a variable indicating that the division candidate is divided into m pieces in the combination of supplied division candidates, and L is the number of beats of the entire ICRLB, L (m , n, i) is a variable that represents the number of beats in the i-th prosodic phrase when the n-th combination candidate is divided into m, and Ph (m, n, k) is the n-th combination candidate. The k-th prosodic phrase when divided into m, count (Ph)) is a function indicating the presence or absence of the feature quantity I in the k-th prosodic phrase, A _k Is the weighting factor of the feature quantity I in the kth prosodic phrase, and B is the weighting factor for the average beat after division. The expression (1) expressed using these various variables and weighting coefficients is as follows. The first term is the sum of the errors described above, the second term is the cost of the feature quantity I, and finally the third term is the average after the division. Represents the cost of beats. The calculation result of each term is the cost of the nth division candidate.
[0086]
Next, in substep SS73, the calculated cost is stored in variable VAL. After storing, the process proceeds to sub-step SS74.
[0087]
In substep SS74, the values of variable VAL and variable MIN are compared. In this comparison, when the value of the variable MIN is larger than the value of the variable VAL (VAL <MIN), the process proceeds to substep SS75. When the value of variable VAL is equal to or greater than the value of variable MIN (VAL ≧ MIN), the process proceeds to substep SS76.
[0088]
In sub-step SS75, the value of the variable MIN is replaced with the value of the variable VAL, and the data indicating the ICRLB boundary position indicating the position where the nth division candidate is divided is replaced with the data stored so far in the memory MIN_DIV []. Let
[0089]
In sub-step SS76, it is determined whether there are any combinations of division candidates. When there are still division candidates (Yes), the process proceeds to sub-step SS77. When there are no more division candidates (No), the process proceeds to sub-step SS78. Here, the determination is made from the relationship between the variable MAX_PH_NUM and the variable PH_NUM used in the above-described presence / absence of the division candidate. That is, when the variable MAX_PH_NUM and the variable PH_NUM are equal, there are no more division candidates.
[0090]
In sub-step SS77, the number of beats of the new n + 1-th division candidate to be supplied is counted. After this counting process, the process returns to substep SS71. In sub-step SS78, the cost calculation of the division candidate is completed, and the smallest division candidate is selected at this point. Therefore, the data of the IRLLB boundary position of the smallest confirmed memory MIN_DIV [] is optimally divided. It supplies to the command setting part 46a of FIG. 4 as a candidate. After this supply, proceed to return. The division candidate selection process is terminated via return.
[0091]
By performing processing in this way, a division candidate that has been optimally divided is selected, and a pause command / phrase command is set in the selected division candidate by checking with the rules of those commands.
[0092]
The following description will be made using a more specific example, and will be described with a comparison with conventional processing. Performs text analysis of the input sentence based on the rules described above. Here, the shortest beat L, which is a basic parameter used in the second rule ₁ = 5, Beat limit L included in prosodic phrase ₂ = 15. A symbol “↓” inserted between prosodic word elements indicates an accent command, a symbol “,” indicates a prosodic word element boundary, and a symbol “P” indicates a phrase command. As preparation for the arithmetic processing, the parameter storage unit 464 performs the weighting coefficient A for the stored feature quantity I = -1, I = -2. ₁ = 7, A ₂ = -3 and weight coefficient B = 0.2 are supplied to the calculation unit 462a.
[0093]
When the input sentence supplied through the data input unit 20 and the language analysis unit 30 in FIG. 1 “is a tool that cannot be separated from our lives” is supplied to the phonological processing unit 40, an accent command and prosody As a result of the division of the word element, the input sentence becomes “Wattashi ↓ Tachino, Sekatsuwo, Kirihanasenai, Doug ↓ To, ↓ tatimasu”. Also, if the number of beats of each prosodic word is represented by the number in parentheses, the number of beats of the prosodic word is (6), (6), (7), (4), (6). The maximum number of divisions is 5 and the total number of beats in ICRLB is (29). The prosodic phrase of the input sentence is divided between the minimum number 2 and the maximum number 5 of phrase commands as shown in FIG. The minimum number is the total number of beats of ICRLB and the beat limit L ₂ It is clear from the relationship with = 15. The table in FIG. 14 shows the division position where the cost is minimum with the set number of divisions.
[0094]
Regarding the cost calculation, the calculation unit 462a calculates a value for each term of the cost function F and adds up these values (see equation (1)). Furthermore, as a result of calculating which division position has the smallest cost in a certain number of divisions, the table in FIG. 14 shows that when the number of divisions is 2 and (6 + 6,7 + 4 + 6) is divided, Is minimized. When the pause command / phrase command is set by the command setting unit 46a in accordance with this division, the data generated for the input sentence is “P ₁ I ↓ Tachino, Sekatsuwo, P _Three Kirihanasenai, Dog ↓ G, Na ↓ Tatymas P ₀ " In particular, it has been found that by taking the length of the prosodic phrase after the division into consideration in the third term of the equation (1), data having a natural prosody can be generated.
[0095]
By the way, when the number of divisions was judged using only the first term of the cost function, the total sum of errors would be the minimum value of 3.6 when it was divided into 5 divisions, that is, (6, 6, 7, 4, 6). It was. As a result, "P" ₁ I ↓ Tachino, P _Three Sekatsuwo, P _Three Kirihanasenai, P _Three Dog ↓ G, P _Three Na ↓ Tatymas P ₀ "And a phrase command are added. However, since the input sentence is divided so finely, the prosody of the synthesized speech obtained based on the generated data is unnatural. By performing processing in consideration of the length of the divided prosodic phrase in this way, it is possible to finally generate a synthesized speech with an accurate prosody.
[0096]
In addition, another input sentence “Mochizuki who supported the opening victory was hit” was supplied to the speech synthesizer 10. In this case, “Kaimakaku, Renshowo, Sasaeta, Mochi ↓ Zukiga, Uta ↓ Leta” is obtained by prosodic processing and accent commands. Also, if the number of beats of each prosodic word is represented by a number in parentheses, the number of beats of the prosodic word is (4), (5), (4), (5), (4). In this case, the minimum number of divisions is 2, the maximum number is 3, and the total number of beats of ICRLB is (22). The table shown in FIG. 15 shows the positional relationship of phrase commands that minimizes the cost for each number of divisions.
[0097]
In this case as well, the cost function F is obtained by calculating a value for each term and adding these values. As a result, when the number of divisions is 2 and the division is (4 + 5 + 4,5 + 4) among the minimum values of each division, it can be seen that the cost is the minimum of -1.2. When the pause command / phrase command is set by the command setting unit 46a in accordance with this division, the data generated for the input sentence is “P ₁ Kaimak, Rensho Showo, Sasaeta, P _Three Mochi ↓ Zukiga, Uta ↓ Letter P ₀ " This is because, in particular, by considering the emphasis information after prosody and the length of the prosodic phrase after division in the second and third terms of Equation (1), the data that becomes natural prosody is generated. Was showing.
[0098]
By the way, when the number of divisions is judged using only the first term of the cost function as in the above comparison, the sum of errors is divided into two, and the division candidates are (4 + 5,4 + 5 + 4) Or, when the prosodic phrase is separated from (4 + 5 + 4,5 + 4), the minimum value is 4. This input sentence can be divided in two ways. The data generated with the (4 + 5,4 + 5 + 4) partition candidate is "P ₁ Kaimak, Ren ↓ Showo, P _Three Saseta, mochi ↓ Zukiga, Uta ↓ Leta P ₀ On the other hand, the data generated with the (4 + 5 + 4,5 + 4) partition candidate is "P ₁ Kaimak, Rensho Showo, Sasaeta, P _Three Mochi ↓ Zukiga, Uta ↓ Letter P ₀ "
[0099]
Here, the phrase command is set before the verb “Supported” and the phrase command is set before the proper noun “Mochizuki” in the generated data of (4 + 5,4 + 5 + 4). Not attached. This indicates that emphasis of the undulation is performed by increasing the accent command, but the emphasized word is not emphasized because the phrase command is attached to the previous word. In other words, this is because ICRLB division is performed in which the emphasized word does not come to the beginning of the prosodic phrase. Therefore, if synthesized speech is generated using the former division candidate, it is uttered with an unnatural prosody. In the case of this input sentence, since the sum of errors is the same as 4, there is a possibility that the former candidate for division may be adopted when judging only the sum of errors. Also from this point of view, the speech synthesizer 10 to which the present invention is applied can select a division candidate that accurately reproduces the prosody naturally, and thus can avoid the use of this inappropriate division candidate.
[0100]
There are also cases where ICRLB is divided so that the suppression word comes to the beginning of the prosodic phrase. As an example sentence in such a case, there is, for example, "There may be a case where a telephone is connected." In this example, “Denwaga, Tsunagatteshimau, Koto] ga, A} Lucamo, Silemassen” is obtained by prosodic processing and accent command. Also, if the number of beats of each prosodic word is represented by a number in parentheses, the number of beats of the prosodic word is (4), (8), (3), (4), (5), and ICRLB The total number of beats is (24). The table shown in FIG. 16 shows the positional relationship of phrase commands that minimizes the cost in each division number.
[0101]
The cost function F is calculated by calculating a value for each term and adding these values. As a result of this cost calculation, the cost for the input sentence was minimized to 3.6 when the number of divisions was 2 and it was divided into (4 + 8 + 3,4 + 5). When the pause command / phrase command is set by the command setting unit 46a in accordance with this division, the data generated for the input sentence is “P ₁ Denwaga, Tsunagatteshimau, Koto] ga, P _Three , A} Lucamo, Sillemassene ". This is particularly true so that natural prosody data is generated by considering the emphasis information after prosody and the length of the prosodic phrase after division in the second and third terms of equation (1). became.
[0102]
On the other hand, when dividing candidates were evaluated using only the first term of the cost function F, the error was minimized to 0 when dividing into 2 and (4 + 8,3 + 4 + 5). At this time, the data generated for the input sentence is “P ₁ Denwaga, Tsunagatateshimau, P _Three "Koto] Ga, A} Lucamo, Silemass". As this data indicates, a phrase command is set immediately before “Koto”. However, it is known that formal nouns such as “Koto” are usually suppressed and not pronounced strongly. Even if the minimum value is searched for in the sum of errors due to only the first term of equation (1), if a candidate for division is selected so that the suppressed word comes to the beginning of the prosodic phrase, a synthesized speech is generated with an unnatural prosody. It will end up. Therefore, it is possible to select an appropriate division candidate by evaluating the data of the input sentence with all terms of the cost function.
[0103]
For example, when the input sentence “I have decided the basic policy to provide the same resale rules as are applied to the disposal of state-owned land” is input to the speech synthesizer 10, prosodic word processing, accent command As a result of the processing, the input sentence is processed as “Kokuyu” -Chino, Sho ↓ Bunni, Tetsuyo Sare ↓ Tailnot, Onaji, Tenbai Kiso] Kuwo, Moque ↓ Ruyu, Kihon Ho ↓ Shin Shin, Kimema} Shita. The number of beats of each prosodic word was (6), (4), (11), (3), (8), (7), (8), (5), and the total number of beats was 52. It was. When only the first term of the cost function F is evaluated, as shown in FIG. 17, the sum of errors is obtained when the number of divisions is 4 and the division candidates are (6 + 4,11 + 3,8 + 7,8 + 5). It can be seen that the minimum is 6. Even when the total cost of all terms in the cost function is compared for each combination of candidates, the cost is 3.4 when the number of divisions is 4 and (6 + 4,11 + 3,8 + 7,8 + 5). , Indicate the minimum. In this way, the same result may be obtained.
[0104]
Next, a modification of the speech synthesizer of the present invention will be described with reference to FIGS. The speech synthesizer 10 is basically the same as the configuration of the above-described embodiment, but the configuration of the pause / phrase command generation unit 46 of the phoneme processing unit 40 is changed. FIG. 18 shows the configuration of the pause / phrase command generation unit 46 as a main part of the phoneme processing unit 40 to which this change has been made.
[0105]
The pause / phrase command generation unit 46 includes a pause / phrase command setting unit 46A and an ICRLB division unit 46B, as in the above-described embodiment. In this case, the pause / phrase command setting unit 46A is supplied with the output from the language analysis unit 30 to the command setting unit 46a. The phrase command setting unit 46A includes this command setting unit 46a and a standard storage unit 46b. The standard storage unit 46b stores a standard for giving a pause command and a phrase command to supplied data. The command setting unit 46a has a function equivalent to that of the standard dividing unit 460a of the above-described embodiment, and the supplied data includes a beat limit L included in the prosodic phrase within this standard. ₂ It has a function to judge whether or not Under this condition, the command setting unit 46a selects the data output destination. The conditions will be described in detail later in the operation.
[0106]
The ICRLB division unit 46B includes an example search unit 466 and a division example storage unit 468. The example search unit 466 extracts the prosody word element division unit 466a that decomposes the data supplied from the command setting unit 46a into prosodic word elements, and the search key information for the output of the prosodic word element division unit 466a, and The search key collating unit 466b that collates with the examples stored in the example storage unit 468, the output from the language analyzing unit 30 is divided based on the collation result of the search key collating unit 466b, and the pause command, phrase An example correspondence division unit 466c for giving a command to this output is included. The search key collation unit 466b performs collation with the same information as the collation search key included in the divided example storage unit 468.
[0107]
The division example storage unit 468 stores, as a search key for collation, example data that has already been determined to be an appropriate division in the memory. Search key for collation includes actual input sentence data, emphasis information, prosodic word elements, prosodic word element beats, total number of beats, prosodic word element type, and / or accent command, etc. Is used.
[0108]
The operation of the pause / phrase command generator 46 will be briefly described with reference to a subroutine SUB8 in FIG. Subroutine SUB8 is a pause / phrase command generation processing routine that replaces the processing of subroutine SUB3 shown in FIG. In this process, it is preferable to store in advance a plurality of examples that have been accurately divided in the divided example storage unit 468. This example includes a pause command and a phrase command.
[0109]
In sub-step SS80 of subroutine SUB8 shown in FIG. 19, the beat limit L, which specifies the length of the prosodic phrase to be supplied ₂ Judge whether to satisfy. At this time, the command setting unit 46a stores the beat limit L stored in the standard storage unit 46b. ₂ The value of is supplied. The command setting unit 46a determines whether the length of the prosodic phrase satisfies the (second) rule based on this numerical value (division length determination step). When the rule is satisfied (Yes), the process proceeds to substep SS81. When the rule is not satisfied (No), the process proceeds to sub-step SS82.
[0110]
In substep SS81, the length of the prosodic phrase is set to the beat limit L so as to satisfy the rule. ₂ Since there is the following, a corresponding pause symbol / phrase symbol is added between prosodic phrases. The command setting unit 46a adds a pause symbol / phrase symbol corresponding to the rules of the pause command / phrase command stored in the standard storage unit 46b. After this processing, the process proceeds to return.
[0111]
In sub-step SS82, the supplied data is sent to the example search unit 466 of the ICRLB dividing unit 46B and divided into prosodic word elements. This division processing (that is, ICRLB division) is performed by the prosodic word division unit 466a of the example search unit 466. The prosodic word segmentation unit 466a outputs the segmented data to the search key matching unit 466b. After this processing, the process proceeds to substep SS83.
[0112]
In sub-step SS83, a search for a split example that matches the information included in the split data and the degree of search match are determined. Actually, this search is performed by the search key collating unit 466b using the output from the prosodic word segmenting unit 466a as the search key and the information stored in the divided example storage unit 468 as the collating search key. An example of this search key is shown later. When a completely matching example is obtained in the search (Yes), the process proceeds to sub-step SS84. When the search key collating unit 466b does not completely match (No), the search key collating unit 466b proceeds to substep SS85.
[0113]
In sub-step SS84, the matched example is output from the divided example storage unit 468 via the search key matching unit 466b and the example corresponding division unit 466c. At this time, the example correspondence dividing unit 466c simply passes through the information of this example and supplies it to the output destination of the command setting unit 46a when the rule is satisfied. After this supply, the process proceeds to return.
[0114]
In sub-step SS85, a process of dividing the output from the language analysis unit 30 and adding a pause command / phrase command according to the most matched example among the examples stored in the divided example storage unit 468 is performed. The search key collating unit 466b supplies an example with high matching to the example correspondence dividing unit 466c. In such a case, it is considered that the input sentences do not match, so the example correspondence dividing unit 466c divides the output from the language analyzing unit 30 according to the information from the search key matching unit 466b, and the division position Information (pause command, phrase command, etc.) is added. The example correspondence dividing unit 466c supplies the output to the same output destination as the output destination of the command setting unit 46a in the sub-step SS84. After this processing, the process proceeds to return. The processing of subroutines SS84 and SS85 corresponds to an example search process.
[0115]
In the return, the subroutine SUB8 for selecting a division candidate suitable for the actual example is terminated, and the process proceeds to substep SS22. In the subsequent processing, the synthesized speech is generated by the main routine after the processing of the subroutine SUB2. When configured as in this modification, the speech synthesizer 10 can be simplified.
[0116]
Specific examples of information used for this search are shown below. The division example storage unit 468 for one input sentence includes, for example, a prosodic word element, a division position (indicated by the symbol “/”), a prosodic word element number, an overall beat number, emphasis information, and a beat number of each prosodic word element. The information such as the type of service, the position of the accent command and / or the type of part of speech is replaced with a numerical value and stored.
[0117]
For example, if the input sentence is “A fisherman who is fishing to a distant sea”, the relationship between prosodic terms and division positions: tokeno, umimade, (/) ryoni, detail, ryosiga, and the number of prosodic terms is 5, The number of beats is 19, the emphasis information of each prosodic phrase is 0emph, 0emph, 0emph, 0emph, 0emph, and the number of beats of each prosodic phrase is (4), (4), (3), (4), (4) . The P symbol at this time is (P ₁ ) Tokeno, Umimade, (P _Three ) Ryoni, Detail, Ryoshiga (P ₀ ), And the example sentence with emphasis “I arrived at the fishing island that is close to the border” is the relationship between prosodic terms and divisional positions: Kokkioni, Chika ↓ i, Shima ↓ Deal, (/) Woman The prosody word number is 5, the total number of beats is 23, the emphasis information of each prosodic phrase is 0emph, 0emph, 0emph, + emph, 0emph, and the number of beats of each prosodic phrase is (5), (3) , (5), (7), (3). The P symbol at this time is (P ₁ ) Kokkioni, Chika ↓ Lee, Shima ↓ Dial, (P _Three ) Wood mani, Tsu ↓ Ita (P ₀ ).
[0118]
In addition, the sentence “I served as the chairman” is an example that includes suppression. The relationship between prosodic word elements and division positions: Kaichowo, Tsuto ↓ Meta, Koto ↓ Mo, (/) Tuyomino, Human ↓ Tudes The number of primes is 5, the total number of beats is 19, the emphasis information of each prosodic phrase is 0emph, 0emph, -emph, 0emph, 0emph, and the number of beats of each prosodic phrase is (4), (4), (3 ), (4), (4). The P symbol at this time is (P ₁ ) Kaichowo, Tsuto ↓ Meta, Koto ↓ Mo, (P _Three ) Tsuyomino, human ↓ Tsudes (P ₀ )
[0119]
Finally, as an example of using the type of dependency as information, the above-mentioned input sentence “It has become an indispensable tool from our lives.” Is ↓ Tachino, Sekatsukara, (/) Kakasenai, Dogue ↓ T, N ↓ Tame-mis, prosodic word prime number 5, total number of beats 27, dependency type for each prosodic phrase is continuous, continuous, continuous, continuous, no, emphasis information of each prosodic phrase is 0emph, 0emph , 0emph, 0emph, 0emph, and the number of beats of each prosodic phrase is (6), (6), (5), (4), (6). The P symbol at this time is (P ₁ ) I ↓ Tachino, Seikatsukara, (P _Three ) Kakasenai, Dog ↓ To, Na ↓ Tame Mais (P ₀ ) By quantifying these pieces of information and storing them in the divided example storage unit 468 and using them as search keys for matching at the time of search, it is possible to improve the search and search accuracy of the examples without setting parameters or the like. It becomes like this.
[0120]
With the configuration described above, since appropriate ICRLB division can be performed, phonological processing that produces unnatural synthesized speech can be avoided, so that synthetic speech that is more natural and easy to hear can be output. This speech synthesizer can also respond semi-automatically to parameter value changes accompanying rule changes. The operability of this apparatus can be facilitated, and the labor required by manpower can be reduced.
[0121]
In addition, by using an actual example searched and selected, it is possible to eliminate processing such as parameter setting and arithmetic processing, and it is possible to reduce costs by simplifying the device configuration.
[0122]
In addition, this invention is not limited to the Example mentioned above, For example, you may comprise combining the above-mentioned Example and this modification. In the initial stage when the speech synthesizer 10 is started to operate, an optimum division candidate is obtained by the configuration of FIG. 4 in order to accumulate information, an intermediate language is created, and information on the obtained division candidate is stored in a division example storage unit 468. To be stored in advance in a learning manner. After this process is performed and information is stored in the divided example storage unit 468, the process for the supplied data is switched to the above-described modification (configuration shown in FIG. 18), and the information is searched to perform a pause command / phrase. A command may be added. As a result, since the data for reliable speech synthesis used by the user is first stored in the speech synthesizer 10, it is possible to avoid storing useless data. It is possible to easily obtain a desired division candidate and information added thereto by performing a search by combining search keys without performing the search.
[0123]
In the above-described embodiment, the number of beats is used in the expression (1) for selecting the division position. However, the central word of a phrase or the dependency type between phrases can be used for calculation.
[0124]
【The invention's effect】
As described above, according to the speech synthesizer of the present invention, the division candidate generation unit divides the output of the language analysis unit based on the first rule, and the divided prosodic phrase is determined based on the second rule. A combination of prosodic phrases generated based on the position at which the prosodic phrase is divided when the phrase is subdivided is set as a division candidate, and is supplied from the speech expression information and parameter storage means included in the division candidate obtained by the division candidate selection means An evaluation value indicating the validity of the division is calculated using the parameter, and further, the division candidate selected by selecting the division candidate satisfying the third rule by the division candidate selection means using the evaluation value is the optimum length. Intermediate processing is performed in the speech synthesizer that generates the intermediate language according to this division by the phonological analysis means as a prosodic phrase divided into two, so that phonological processing that produces unnatural synthesized speech can be avoided and more natural Easy to hear Formation can be output sound, change of the parameter values due to changes in rules can also be semi-automatically correspond. Thereby, the operability of the apparatus can be further facilitated, and manual labor can be reduced.
[0125]
Also, the speech synthesizer stores the example in the divided example storage unit, searches the stored information that matches the information from the outside by the divided example search unit, and uses this search result to determine the corresponding division candidate. When a command or the like is added, a high-quality synthesized speech can be obtained with a simple configuration without performing processing for parameters, cost calculation of division candidates, and the like. The cost of the apparatus can also be reduced.
[0126]
According to the text analysis method of the speech synthesizer of the present invention, when the analysis result for a sentence is processed or judged in order according to the first rule and the second rule, information is subdivided according to the result. The obtained combination is set as a division candidate, and on the other hand, an evaluation value of division validity is calculated by using stored parameters and information of speech expression included in the division candidate, and a phonological analysis that selects a division candidate that satisfies the third rule By generating the intermediate language corresponding to the division candidates, it is possible to improve the accuracy with which the synthesized speech that is more natural and easy to hear is output.
[0127]
In addition, the division candidate is a search key for information to be supplied and / or information to be added to appropriately classified examples. As After selecting the output destination according to the determination result for the prosodic phrase stored and stored in advance, the example corresponding to the division of the prosodic phrase supplied at the selected output destination is searched, and according to the search result By giving various commands, it is possible to generate synthesized speech with higher natural quality by an easy operation even if wasteful computation is greatly reduced.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a schematic configuration of a speech synthesizer according to the present invention.
2 is a block diagram illustrating an example of a schematic configuration of a language analysis unit in the speech synthesizer of FIG. 1;
3 is a diagram illustrating a relationship between processing units of languages handled by the language analysis unit in FIG. 2 and an example of classification thereof.
4 is a block diagram illustrating an example of a schematic configuration of a phonological processing unit in the speech synthesizer of FIG. 1;
FIG. 5 is a main flowchart for explaining the operation of the speech synthesizer of FIG. 1;
6 is a flowchart for explaining the operation of the language analysis process (subroutine SUB1) shown in the main flowchart of FIG. 5;
7 is a diagram illustrating a specific example of a syntax tree and a dependency structure as syntax analysis processing in the processing routine illustrated in FIG. 6;
FIG. 8 is a flowchart for explaining the operation of the phoneme analysis process (subroutine SUB2) shown in the main flowchart of FIG.
FIG. 9 is a flowchart for explaining the operation of pause / phrase command setting (subroutine SUB3) shown in subroutine SUB2 of FIG. 8;
FIG. 10 is a flowchart for explaining the operation of ICRLB division processing (subroutine SUB4) shown in subroutine SUB3 in FIG. 9;
FIG. 11 is a flowchart for explaining the operation of a division candidate generation process (subroutine SUB5) shown in subroutine SUB4 in FIG. 10;
12 is a flowchart for explaining the operation of parameter storage processing (subroutine SUB6) shown in subroutine SUB4 in FIG. 10;
13 is a flowchart for explaining the operation of a division selection process (subroutine SUB7) shown in subroutine SUB4 in FIG.
FIG. 14 is a diagram illustrating the numerical values of variables and selection of division candidates obtained for the division candidates obtained by dividing the input sentence supplied to the configuration of FIG. 1;
15 is a diagram showing the numerical values of each variable and selection of division candidates obtained for the division candidates obtained by dividing the input sentence supplied to the configuration of FIG. 1;
16 is a diagram showing the numerical values of variables and selection of division candidates obtained for the division candidates obtained by dividing the input sentence supplied to the configuration of FIG.
FIG. 17 is a diagram illustrating a numerical value of each variable and selection of a division candidate obtained for a division candidate obtained by dividing the input sentence supplied to the configuration of FIG.
18 is a block diagram showing a schematic pause / phrase command generation unit which is a main part of another configuration in the phoneme processing unit of FIG. 4; FIG.
FIG. 19 is a flowchart for explaining the operation of a pause / phrase command process (subroutine SUB8) in accordance with an example of the pause / phrase command generation unit of FIG. 18;
[Explanation of symbols]
10 Speech synthesizer
20 Data input section
30 Language Analysis Department
40 Phonological processing part
42 Prosodic word generator
44 Accent command generator
46 Pause / phrase command generator
50 Control parameter generator
60 Audio signal generator
460 Split candidate generator
462 Division candidate selector
464 Parameter storage

Claims

Language analysis means for analyzing the features of the language level based on the morpheme included in the sentence supplied as information and the language used for processing the sentence syntax, and the features of the speech language level for the output of the language analysis means Phonological analysis means for generating an intermediate language serving as a speech synthesis command based on the obtained analysis result, and control parameter generation means for generating a control parameter according to the output of the phonological analysis means And a speech signal generation unit that synthesizes a speech signal based on the output of the control parameter generation unit, and the speech synthesizer that artificially synthesizes the information into speech, the device includes:
A first rule that divides the output of the language analysis unit according to the modification relationship of the prosodic phrase in which a plurality of prosodic word elements that are the morpheme chain are combined, and the prosodic phrase obtained by the division has a preset size. Division candidate generation means for generating a combination of prosodic phrases obtained as a division candidate when subdividing the prosodic phrase according to the second rule for determining whether the prosodic phrase is smaller than the division rule;
The evaluation value for evaluating the validity of division is calculated using the information of the speech expression included in the division candidate generated by the division candidate generating means, and the division is performed according to the third rule for selecting the division candidate with the smallest evaluation value. A division candidate selection means for selecting candidates;
Look including a parameter storage means for storing the parameters used to calculate the evaluation value in the division candidate selecting means to the phoneme analysis means,
The division candidate generation means includes first rule processing means for dividing the output of the language analysis means using the first rule;
Second rule processing means for determining the divided length at the output of the first rule processing means by the second rule;
Speech synthesis, wherein subdividing means and the free Mukoto for generating a plurality of candidate dividing and re-dividing the prosodic phrase outputted from the first rule dividing means in accordance with a judgment of the second rule processing means apparatus.

The apparatus according to claim 1, wherein the parameter storage means includes a first weighting factor based on speech expression information in the division candidates;
A speech synthesizer for storing a second weighting factor by which an average beat number of prosodic word elements according to the number of divisions of the division candidates is stored.

3. The speech synthesizer according to claim 1, wherein the speech expression information is emphasis information that specifies emphasis and suppression of the prosodic word elements and a case in which both are excluded.

The apparatus according to claim 1, wherein the phonological analysis means includes divided example storage means for storing examples of delimitation setting positions of prosodic phrases that satisfy the rules;
A speech synthesizer, comprising: a divided example search unit that searches for an example in which a supplied sentence is optimally divided from the examples stored in the divided example storage unit.

5. The apparatus according to claim 4 , wherein the divided instance search means includes the enhancement information, the prosodic word element, the number of beats of the prosodic word element, the number of beats in the search of the example stored in the divided instance storage means. A speech synthesizer characterized by using the total number, the type of the prosodic word element, the type of accent command and / or the part of speech.

2. The speech synthesizer according to claim 1 , wherein the language analysis unit includes an emphasis information setting unit that sets a rule of the emphasis information and stores the set emphasis information.

Analyzing the features of the language level based on the morpheme contained in the sentence supplied as information and the language that uses the syntax of the sentence for processing, and analyzing the prosody of the sentence based on the features of the spoken language level based on the analysis result And generating an intermediate language serving as a speech synthesis command based on the obtained analysis result, generating a control parameter corresponding to the generated intermediate language, and then artificially generating speech corresponding to the control parameter. In the text analysis method of a speech synthesizer that synthesizes automatically,
A rule dividing step of dividing the analysis result for the sentence using a first rule that divides the information according to a modification relationship of a prosodic phrase in which a plurality of prosodic word elements that are chains of the morphemes are combined;
A division length determining step of determining whether the size of the prosodic phrase is smaller than a predetermined size as a second rule, and determining a result of the rule dividing step by the second rule;
A division candidate generation step for generating, as a division candidate, a combination of prosodic phrase delimitation setting positions obtained by subdividing the information according to a determination result of the division length determination step;
A parameter storage step for storing parameters used for evaluating the validity of division based on information on speech expressions included in the division candidates obtained in the division candidate generation step;
An evaluation value calculation step of calculating an evaluation value from the information of the speech expression included in the division candidate generated in the division candidate generation step and the parameter;
A division candidate selection step of selecting a division candidate indicating the minimum among the evaluation values obtained by the evaluation value calculation step as a third rule, and selecting the division candidate based on the third rule ,
The division candidate is generated by re-dividing the prosodic phrase when a beat length included in the prosodic phrase is larger than a predetermined re-division pulse length that instructs re-division, and the boundary of the division candidate is the boundary A text analysis method for a speech synthesizer, characterized in that a symbol for speech tone command, which is one of features of a speech language level, is inserted .

The method according to claim 7 , wherein the evaluation value is a sum of absolute values of differences between a beat length of a section obtained by dividing the prosodic phrase and a beat length obtained by equally dividing the beat length of the entire prosodic phrase. A total error calculation step to calculate,
A feature sum total calculating step of calculating a sum of beat lengths of sections obtained by dividing the prosodic phrase according to the presence of the feature amount represented by speech expression information included in the features of the speech language level in the prosodic phrase; ,
A weight calculating step of calculating a weighting coefficient based on information of speech expression included for the prosodic word elements of the division candidates;
Multiplying the result of the feature sum total calculation step by the weighting coefficient obtained in the weight calculation step and calculating the sum of the multiplication results, a weighted feature amount sum calculation step, and an average after dividing the prosodic phrase A product calculating step of calculating a product of a beat length obtained by equally dividing a beat length and a beat length of the entire prosodic phrase;
Add the result of the error sum calculation step and the result of the weighted feature amount sum calculation step, and subtract the result of the product calculation step from the result obtained by the addition to calculate the evaluation value of the target division candidate A text analysis method for a speech synthesizer, characterized by being calculated using an evaluation value calculating step.

The method according to claim 7 , wherein the speech expression information is classified and classified into cases where the prosodic word elements are emphasized, suppressed, and both are excluded. Text analysis method.

10. The method according to claim 9 , wherein the emphasis is classified by including proper nouns or numbers in the prosodic clauses of the division candidates, and the suppression is located in a formal noun, verb, or sentence head in the prosodic clauses of the division candidates. A text analysis method for a speech synthesizer, wherein a classifier is included at the end of the prosodic clause to classify the prosodic clause, and a predetermined value is assigned to the enhancement and the suppression.

8. The method according to claim 7 , wherein the division candidate stores in advance information to be supplied and / or information added to an appropriately classified example as a search key,
The division length determining step selects an output destination according to a determination result for the supplied prosodic phrase, and then searches for an example corresponding to the division of the prosodic phrase supplied at the selected output destination, A text analysis method for a speech synthesizer, comprising an example search step of dividing the prosodic phrase according to a result and giving various commands.

12. The text analysis method for a speech synthesizer according to claim 11 , wherein the example storing step stores various examples in advance in a learning manner.

The method according to claim 7 , wherein the parameter is obtained by statistical analysis or multivariate analysis.

The method according to claim 7 , wherein the language is Japanese.