JP3655808B2

JP3655808B2 - Speech synthesis apparatus, speech synthesis method, portable terminal device, and program recording medium

Info

Publication number: JP3655808B2
Application number: JP2000151297A
Authority: JP
Inventors: 浩幸勘座
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2000-05-23
Filing date: 2000-05-23
Publication date: 2005-06-02
Anticipated expiration: 2020-05-23
Also published as: JP2001331191A

Description

【０００１】
【発明の属する技術分野】
この発明は、文字情報から音声を合成する音声合成装置および音声合成方法、携帯端末器、並びに、プログラム記録媒体に関する。
【０００２】
【従来の技術】
従来、日本語解析時における解析誤りに起因する日本語文認識誤り等に対処する装置として、特開平４‐１６０６３０号公報に開示されているような規則音声合成装置がある。この規則音声合成装置においては、入力文字列に関する複数の日本語解析結果を確定順位と共に読上げ文バッファに保持しておき、これらの複数の日本語解析結果情報を利用するようにしている。この規則音声合成装置によれば、ある確定順位の日本語解析結果に基づく合成音声中に認識誤り(入力文字列の読み誤り)があるため、ユーザーが「違う」等と発声した場合に、確定順位が１つ低い日本語解析結果に基づいて合成音を生成することができ、入力文字列の読み誤り箇所を別の読み方で合成することができる。こうすることによって、上記読み誤りがあった場合に、繁雑な手間や作業を要せずに対話的に訂正作業ができるのである。
【０００３】
【発明が解決しようとする課題】
しかしながら、上記従来の規則音声合成装置においては、以下のような問題がある。すなわち、上記読上げ文バッファに保持されている日本語解析結果の単位で読み誤りを訂正するようにしている。したがって、上記入力文字列でなる入力文(読上げ文)の読み誤り箇所を特定して、その箇所から読み直して読上げ文の途中から再発声させることはできないのである。そのために、長文の読上げ文における最後の個所で再発声を要求した場合でも、読上げ文の最初から発声し直すことになり、目的の合成音声が得られるまでに時間が掛るという問題がある。
【０００４】
また、上記規則音声合成装置においては、入力文字列(読上げ文)の読み誤りには対処できるものの、ユーザが聞き取れなかった場合の聞き直しには、対処できないようになっている。特に、ユーザは文の後半のみが聞き取れなかった場合には、文の先頭から聞き直したいのではなく、聞き取れなかた後半の箇所のみを聞き直したいはずである。その場合に、読上げ文中の特定の位置から後のみを再発声させることは全くできないという問題がある。
【０００５】
そこで、この発明の目的は、音声出力文中の特定位置から再発声を行うことができる音声合成装置を提供することにある。
【０００６】
【課題を解決するための手段】
上記目的を達成するため、第１の発明は、入力された文字列情報を解析手段で解析し,この解析結果に基づいて音声合成手段によって音声を合成して出力する音声合成装置において、再発声を促すための指示を入力する指示入力手段と、上記指示入力手段からの指示を受けて,上記解析手段による解析結果に基づいて上記文字列情報中における再発声の開始位置を特定する位置特定手段と、上記音声合成手段に対して上記特定された開始位置からの音声合成を指示する制御手段を備えて、上記解析結果は文節間結合度を含み、上記位置特定手段は、上記文字列情報中における上記指示入力手段から指示を受けた時点に対応する位置より前の文節から、モーラ数と上記文節間結合度とに基づいて上記開始位置となる文節を特定するようになっていることを特徴としている。
【０００７】
上記構成によれば、合成音声での再発声を促すための指示が指示入力手段から入力されると、位置特定手段によって、解析手段による解析結果に基づいて再発声の開始位置が特定される。したがって、ユーザは、出力合成音声が聞き取れない場合には、その時点で指示入力手段から指示を行うだけで、出力合成音声文中の特定位置から再発声を聞くことができ、長文の後半のみが聞き取れなかった場合でも少ない時間で聞き直しができる。
【０００８】
さらに、再発声の開始位置を特定する指標としてモーラ数が用いられて、指示を受けた時点に近過ぎず且つ遠過ぎない文節が再発声の開始位置として選ばれる。さらに、再発声の開始位置を特定する指標として文節間結合度が用いられて、言語的に結合度合が弱く区切れ易い適切な文節が再発声の開始位置として選ばれる。
【０００９】
また、第２の発明は、入力された文字列情報を解析手段で解析し,この解析結果に基づいて音声合成手段によって音声を合成して出力する音声合成装置において、再発声を促すための指示を入力する指示入力手段と、上記指示入力手段からの指示を受けて,上記解析手段による解析結果に基づいて上記文字列情報中における再発声の開始位置を特定する位置特定手段と、上記音声合成手段に対して上記特定された開始位置からの音声合成を指示する制御手段を備えて、上記解析結果は尤度を含み、上記位置特定手段は、上記文字列情報中における上記指示入力手段から指示を受けた時点に対応する位置以前の文節に関して、各解析結果の候補を入れ換えながら尤度の合計値を算出し、得られた尤度の合計値に基づいて上記再発声の開始位置を特定するようになっていることを特徴としている。
【００１０】
上記構成によれば、各解析結果の候補を入れ換えながら算出された尤度の合計値に基づいて再発声の開始位置が特定される。したがって、入力された文字列情報「市場開放によってもたらされる」に基づいて合成音声「しじょーかいほうによっても、たらされる」が発声された場合に、上記尤度の合計値を次に高くする第２候補「よって」が選択されて、再発声の開始位置であるとして特定される。その結果、「よって、もたらされる」の合成音声が再発声される。
【００１１】
また、上記第１の発明の音声合成装置は、上記解析結果を,尤度を含むように成すと共に、上記文字列情報中における上記指示入力手段から指示を受けた時点に対応する位置以前の文節に関して,各解析結果の候補を入れ換えながら尤度の合計値を算出し,得られた尤度の合計値に基づいて再発声を行う際の解析結果を選択する結果選択手段を備えることが望ましい。
【００１２】
上記構成によれば、各解析結果の候補を入れ換えながら算出された尤度の合計値に基づいて再発声を行う際の解析結果が選択される。したがって、入力された文字列情報「市場開放によってもたらされる」に基づいて合成音声「いちば」が発声された場合に、発声の異なる第２候補「しじょー」が選択されて、「しじょーかいほうによってもたらされる」の合成音声が再発声される。
【００１３】
また、上記第２の発明の音声合成装置は、上記解析結果を,上記尤度に加えて ,音韻明瞭度・単語出現確率および単語重要度の少なくとも一つを含むように成し、上記位置特定手段を,上記解析結果の尤度に加えて上記音韻明瞭度・単語出現確率及び単語重要度の少なくとも一つを用いて選出した上記再発声の開始位置候補に基づいて , 上記再発声の開始位置を特定するように成すことが望ましい。
【００１４】
上記構成によれば、解析結果尤度に基づいて上記利用者による再発声指示の意図「読み誤り」が反映された位置が再発声の開始位置候補として選出される。さらに、音韻明瞭度または単語出現確率に基づいて上記意図「聞き取れない」が反映された位置が再発声の開始位置候補として選出される。あるいは、単語重要度に基づいて上記意図「重要個所の確認」が反映された位置が再発声の開始位置候補として選出される。
【００１５】
また、この発明の音声合成装置は、上記音声合成手段を,上記解析手段による解析結果に基づいて音声合成情報列を生成する音声合成情報生成手段を有して,上記音声合成情報列に基づいて音声を合成するように成すと共に、上記制御手段の指示に従って,上記音声合成情報生成手段によって生成された音声合成情報列のうち上記再発声の開始位置以降における所定の音声合成情報列を他の音声合成情報列に変換する音声合成情報変換手段を備えることが望ましい。
【００１６】
上記構成によれば、音声合成情報変換手段によって、上記再発声の開始位置以降における所定の音声合成情報列が他の音声合成情報列に変換される。したがって、聞き取れなかった個所の発声が、「発声速度を少し遅め」に、あるいは「発声ピッチを少し高め」に、あるいは「発声パワーを少し大きめ」に、あるいは「音韻長を少し長め」にして再発声を行うことが可能になる。
【００１７】
また、この発明の音声合成装置は、上記解析結果は読み系列情報を含むと共に、文字とこの文字に対応した対応文とを記憶した文字文記憶手段と、上記制御手段の指示に従って,上記解析手段による解析結果のうち上記再発声の開始位置以降における所定の解析結果の読み系列を上記文字文記憶手段に記憶された対応文の解析結果列に変換する文字文変換手段を備えることが望ましい。
【００１８】
上記構成によれば、文字文記憶手段に、文字「す」,「ず」及び「き」とこの文字に対応した対応文「すずめのす」,「すずめのすに濁点」および「切手のき」とを記憶しておけば、文字文変換手段によって、上記再発声の開始位置以降における所定の文「鈴木」の読み系列「すずき」が、対応文の列「すずめのす、すずめのすに濁点、切手のき」に変換されて再発声される。
【００１９】
また、この発明の音声合成装置は、上記指示入力手段を、音声入力された指示を認識する音声認識手段で構成することが望ましい。
【００２０】
上記構成によれば、ユーザの再発声要求意図が、ユーザの音声によって明示される。
【００２１】
また、この発明の音声合成装置は、再発声を促す語彙を記憶した音声認識辞書と、単語と上記再発声を促す語彙との関連性の有無を記憶した解析辞書を備えて、上記音声認識手段は,上記音声認識辞書を用いて再発声を促す語彙を認識し、上記解析手段は,上記解析辞書を用いて上記再発声を促す語彙との関連性の有無を表わす関連情報を含む上記解析結果を生成し、上記位置特定手段は,上記解析手段からの関連情報に基づいて上記再発声の開始位置を特定するようになっていることが望ましい。
【００２２】
上記構成によれば、上記音声認識辞書に「誰」,「いつ」,「どこで」等の再発声を促す語彙を登録しておくことによって、ユーザが知りたい情報が何かを問いかけるこれらの語彙が音声認識された場合には、これらの語彙との関連性が高い「鈴木さん」,「昨日」,「会社で」等の人名や時や場所を表す単語から再発声が行われる。
【００２３】
また、第３の発明は、入力された文字列情報を解析し,この解析結果に基づいて音声を合成して出力する音声合成方法において、再発声を促すための指示を入力するステップと、上記入力された指示を受けて,上記入力された文字列情報の解析結果に基づいて上記文字列情報中における再発声の開始位置を特定するステップと、上記特定された開始位置から音声合成を行うステップを備えて、上記解析結果は文節間結合度を含み、上記開始位置を特定するステップでは、上記文字列情報中における上記指示入力手段から指示を受けた時点に対応する位置より前の文節から、モーラ数と上記文節間結合度とに基づいて上記開始位置となる文節を特定することを特徴としている。
【００２４】
上記構成によれば、合成音声による再発声を促すための指示が入力されると、上記入力された文字列情報の解析結果に基づいて、合成音声による再発声の開始位置が特定される。したがって、ユーザは、出力合成音声が聞き取れない場合には、その時点で指示を行うだけで、出力合成音声文中の特定位置から再発声を聞くことができ、長文の後半のみが聞き取れなかった場合でも少ない時間で聞き直しができる。
【００２５】
さらに、再発声の開始位置を特定する指標としてモーラ数が用いられて、指示を受けた時点に近過ぎず且つ遠過ぎない文節が再発声の開始位置として選ばれる。さらに、再発声の開始位置を特定する指標として文節間結合度が用いられて、言語的に結合度合が弱く区切れ易い適切な文節が再発声の開始位置として選ばれる。
【００２６】
また、第４の発明の携帯端末器は、この発明の音声合成装置を備えたことを特徴としている。
【００２７】
上記構成によれば、文字情報の少ない携帯端末器によって、比較的長い文面の電子メールの内容を合成音声出力によって知る場合に、再発声個所や次候補の選択が自動的に行われるため、非常に簡単な操作で的確な再発声や発声変更が行われる。
【００２８】
また、第５の発明のプログラム記録媒体は、コンピュータを、入力された文字列情報を解析する解析手段と、上記解析手段による文節間結合度を含む解析結果に基づいて音声を合成する音声合成手段と、再発声を促すための指示を入力する指示入力手段と、上記指示入力手段からの指示を受けて,解析手段による解析結果に基づいて,上記文字列情報中における上記指示入力手段から指示を受けた時点に対応する位置より前の文節から,モーラ数と上記文節間結合度とに基づいて上記文字列情報中における再発声の開始位置となる文節を特定する位置特定手段と、上記音声合成手段に対して上記特定された開始位置からの音声合成を指示する制御手段として機能させる音声合成処理プログラムが記録されていることを特徴としている。
【００２９】
上記構成によれば、上記第１の発明の場合と同様に、ユーザは、出力合成音声が聞き取れない場合には、その時点で指示を行うだけで、出力合成音声文中の特定位置から再発声を聞くことができ、長文の後半のみが聞き取れなかった場合でも少ない時間で聞き直しができる。
【００３０】
さらに、再発声の開始位置を特定する指標としてモーラ数が用いられて、指示を受けた時点に近過ぎず且つ遠過ぎない文節が再発声の開始位置として選ばれる。さらに、再発声の開始位置を特定する指標として文節間結合度が用いられて、言語的に結合度合が弱く区切れ易い適切な文節が再発声の開始位置として選ばれる。
【００３１】
【発明の実施の形態】
以下、この発明を図示の実施の形態により詳細に説明する。
＜第１実施の形態＞
図１は、本実施の形態の音声合成装置における構成を示すブロック図である。テキスト解析部１は、入力されたテキストの言語を解析し、得られたテキスト解析結果を解析記憶部３に一時的に記憶する。解析辞書メモリ２は、テキスト解析部１がテキスト解析を行う際に必要な解析辞書を含む言語データ等を格納している。上記音声合成情報生成手段としての音韻処理部４は、解析記憶部３に記憶されているテキスト解析結果に基づいて、読み系列情報(音韻記号列),アクセント情報および文節区切り情報等の音声合成情報を生成する。そうすると、音声合成部５は、音韻処理部４によって生成された音声合成情報列に基づいて、音声を合成してスピーカ(図示せず)から出力する。
【００３２】
指示入力部６は、音声合成部５からの出力合成音声に対してユーザが再発声を促すための指示を入力する。この指示入力部６は、ボタンあるいはキーボード等で構成される。位置特定部７は、解析記憶部３に記憶されているテキスト解析結果に基づいて、上記テキスト中における再発声を開始する位置を特定する。制御部８は、指示入力部６からの指示を受けて位置特定部７に再発声開始位置を特定させ、特定された上記テキスト中の位置から以降のテキスト解析結果の読み出しを音韻処理部４に指示して再発声を開始させる。
【００３３】
図２は、上記構成を有する音声合成装置の各部によって実行される音声合成処理動作のフローチャートである。以下、図２のフローチャートに従って、テキスト解析部１に「鈴木さんから商談の件でメールが来ています」という漢字仮名混じり文が入力された場合を例に上げて、本音声合成装置の動作を説明する。
【００３４】
ここで、上記テキスト解析部１に対するテキストの入力は、キーボード,ペンあるいは音声認識装置等の文入力手段からの入力でもよいし、メールやＷＷＷ等のネットワークを経由した入力でも構わない。こうして、テキスト解析部１にテキストが入力されると音声合成処理動作がスタートする。
【００３５】
ステップＳ1で、上記テキスト解析部１によってテキスト解析が行われる。すなわち、入力文が解析辞書メモリ２に格納された解析辞書を参照して形態素に分割され、読みや品詞等の情報が付与される。上記入力文の場合は、「鈴木(すずき：固有名詞)」,「さん(さん：接尾語)」,「から(から：助詞)」,「商談(しょーだん：名詞)」,「の(の：助詞)」,「件(けん：名詞)」,「で(で：助詞)」,「メール(めーる：名詞)」,「が(が：助詞)」,「来(き：動詞)」,「て(て：助詞)」,「い((い：動詞)」,「ます（ます：助動詞)」の形態素に分割される。
【００３６】
ステップＳ2で、上記テキスト解析部１によってテキストの解析結果が解析記憶部３に記憶される。図３に解析記憶部３における格納状態の概念図を示す。図３において、解析結果尤度,アクセント情報および文節間結合度の欄には、後述するような数値情報が格納される。尚、上記ステップＳ1におけるテキスト解析処理においては、解析候補が複数得られる。そして、夫々の解析候補に尤度を付けた際に、尤度順位が最も高い解析候補をテキスト解析結果として解析記憶部３に記憶するのである。その場合、夫々の解析候補の尤度も解析記憶部３に記憶される。
【００３７】
さらに、上記各解析結果の連接から文節が生成され、文節の情報が解析記憶部３に記憶される。それと共に、先行文節と後続文節との結合性の度合を表わす文節間結合度が計算され、文節毎のアクセント情報がアクセント規則等に基づいて計算される。そして、得られた文節間結合度およびアクセント情報も合わせて記憶される。
【００３８】
上記文節間結合度は韻律を制御するために使用されるものであり、文節間結合度によってポーズの長さやフレーズ指令の大きさが決定される。すなわち、文節間結合度が大きければ両文節の結合は弱いために明確なポーズが入れられて、夫々の文節が独立に句を構成するのである。尚、上記文節間結合度の計算は、一般的には先行文節と後続文節との品詞パターン毎に設定された結合値が記載されたテーブル(図示せず)を用いて行われる。
【００３９】
ステップＳ3で、上記音韻処理部４によって、上記解析記憶部３に格納されたテキスト解析結果に基づいて、解析結果尤度が最も高い単語の並びに対して音韻処理が行われる。ステップＳ4で、音声合成部５によって、上記ステップＳ3における音韻処理によって得られた音声合成情報列に基づいて音声が合成される。そして、合成音がスピーカ等から出力される。ステップＳ5で、上記スピーカから出力された音声に対して、ユーザから、聞き取り難い音声がある旨の指示が指示入力部６からあるか否かが判別される。その結果、あればステップＳ6に進み、無ければステップＳ8に進む。
【００４０】
ステップＳ6で、上記制御部８および音声合成部５によって、その時点で合成音声の出力が中断される。ステップＳ7で、制御部８および位置特定部７によって、解析記憶部３の内容が参照されて、再発声開始文節が決定される。そして、決定された再発声開始文節以降のテキスト解析結果の読み出しが音韻処理部４に指示される。そうした後に、上記ステップＳ3に戻って、再発声開始文節以降の文節に関して、音韻処理及び音声合成が再度行われる。そして、上記ステップＳ5において、ユーザからの指示がないと判別されると、ステップＳ8に進む。尚、上記再発声開始文節の決定方法については後に詳述する。
【００４１】
ステップＳ8で、上記テキスト解析部１に入力された次の文があるか否かが判別される。その結果、あれば上記ステップＳ1に戻って、次の文の処理に移行する。一方、無ければ音声合成処理動作を終了する。
【００４２】
以下、上述した音声合成処理動作における上記ステップＳ4以降の処理に付いて、上記入力文「鈴木さんから商談の件でメールが来ています」に従って更に詳細に説明する。上記ステップＳ4において、スピーカから「鈴木さんから商談の件でメールが来ています」と出力された際に、図４に示すように、ユーザが「商談」の単語が聞き取り難かったとする。そして、「メールが」近傍まで音声出力された時点で、ユーザが「再発声」の指示を行うと、一旦音声合成/出力が中断されて、どの文節から再発声を行うかを決定する処理に移行する。
【００４３】
再発声する文節位置を決める指標として、本実施の形態においては、その一例として、指示時点から先頭文節までのモーラ数と文節間結合度とを用いる。ここで、上記モーラ数を指標に入れる理由は、再発声開始文節の位置として指示時点に近過ぎず且つ遠過ぎない位置を選び易くするためである。再発声の指示時点に相当する文字から数えてモーラ数の少ない文節(すなわち接近している文節)から再発声を開始すると、本来聞き直したい箇所まで戻らない可能性がある。逆に、モーラ数が多すぎると、かなり遡った位置から再発声が開始されることになり、本来聞き直したい箇所まで到達するのに時間が掛り、再度聞き逃してしまう場合も発生する。そこで、図５に示すように、適切なモーラ数における評価関数値が最高の値になり、そのモーラ数から更にモーラ数が増えるに従って評価関数値が徐々に減るような評価関数ｆ(x)を定義するのである。
【００４４】
上の例文であれば、第４文節「メールが」の「が」の位置で再発声の指示が行われた場合、文節「メールが」までのモーラ数は４、文節「件で」までのモーラ数は７、文節「商談の」までのモーラ数は１２、文節「鈴木さんから」までのモーラ数は１９となる。そして、指示時点とのモーラ数差が「４」では少なすぎる一方、モーラ数差が「１９」では多すぎると見なし、「件で」や「商談の」のあたりが妥当であると判断されるのである。
【００４５】
また、上記文節間結合度を用いる理由は、適切な文節の区切れ目から再発声を開始させるためである。上述の例文であれば、文節間結合度Ｂは以下のような関係になる。尚、文節間結合度Ｂは、その値が大きいほど結合度合は弱い(すなわち区切れ易い)ことを表わす。また、Ｂ(ａ,ｂ)における「ａ」は文節番号であり、「ｂ」は候補番号である。
Ｂ(３,１)＞Ｂ(１,１)＞Ｂ(４,１)＞Ｂ(２,１)＞Ｂ(５,１)
【００４６】
すなわち、第３文節「件で」と第４文節「メールが」との間の文節の結合度が最も弱く、第５文節「来て」と第６文節「います」との間の文節の結合度が最も強い。そして、第２文節「商談の」‐第３文節「件で」(Ｂ(２,１))と、第１文節「鈴木さんから」−第２文節「商談の」(Ｂ(１,１))とを比較した場合、後者の方が結合度は弱いので、文節「件で」から再発声を開始するよりも、文節「商談の」から再発声を開始する方がよいということになる。
【００４７】
以上で述べたように、上記指示時点から文節先頭までのモーラ数と文節間結合度とに基づいて再発声する文節位置を決めることによって、上記指示時点に近過ぎず且つ遠過ぎない位置で、言語的にも適切な位置で発声を再開できる。したがって、本実施の形態においては、以下のように、上記指示時点から先頭文節までのモーラ数「ｘ」と文節間結合度「ｙ」とをパラメータとする関数ｇ(ｘ,ｙ)を定義し、各文節毎に関数値を求めて比較することによって、再発声を開始する文節を最適に決めることができるのである。
ｇ(ｘ,ｙ)＝α・ｆ(ｘ)＋β・ｙ
【００４８】
上述のように、本実施の形態においては、上記テキスト解析部１で得られたテキスト解析結果を記憶する解析記憶部３を設け、入力テキストの各文節毎に、構成単語,解析結果尤度,アクセント情報および文節間結合度を記憶しておく。そして、上記テキスト解析結果に基づいて出力された合成音声が聞き取り難い場合には、その時点で指示入力部６からその旨の指示がなされる。そうすると、制御部８の制御の下に、合成音声の出力が中断され、位置特定部７によって、上記指示時点から先頭文節までのモーラ数「ｘ」と文節間結合度「ｙ」とをパラメータとして定義された関数ｇ(ｘ,ｙ)に基づいて、合成音声の出力を再開始する文節の位置が決定される。そして、その文節位置から合成音声の出力を再開するようにしている。
【００４９】
したがって、ユーザは、出力合成音声が聞き取れない場合には、その時点で指示入力部６から指示を行うことによって、上記指示時点に近過ぎず且つ遠過ぎない位置で、言語的にも適切な位置から合成音声の出力を再開することができるのである。すなわち、テキスト解析部１に入力された漢字仮名混じり文「鈴木さんから商談の件でメールが来ています」に基づいて合成音声が出力された際に、ユーザが「商談」の単語が聞き取れなかったために「メールが」近傍で「再発声」の指示を行うと、位置特定部７によって第２文節「商談の」が再発声開始位置であると特定される。そして、「商談の件でメールが来ています」の合成音声が再発声されるのである。
【００５０】
すなわち、本実施の形態によれば、音声出力文中の特定位置から再発声を行うことができ、長文の後半のみが聞き取れなかった場合に少ない時間で聞き直しができるのである。
【００５１】
＜第２実施の形態＞
上記第１実施の形態によれば、テキスト中の特定位置から再発声を行うことができるのではあるが、テキスト中の特定位置から発声を変えて再発声を行うことはできない。本実施の形態は、このような場合に対処するものである。
【００５２】
図６は、本実施の形態の音声合成装置における構成を示すブロック図である。解析辞書メモリ１２,解析記憶部１３,音韻処理部１５,音声合成部１６および指示入力部１７は、上記第１実施の形態における解析辞書メモリ２,解析記憶部３,音韻処理部４,音声合成部５および指示入力部６と同じ構成を有して、同様に動作する。
【００５３】
テキスト解析部１１は、上記第１実施の形態で述べたようにテキスト解析処理の結果得られた複数の解析候補を、尤度順位の高い順に解析記憶部１３に記憶する。また、位置特定部１４は結果選択手段１８を有している。この結果選択手段１８は、解析記憶部１３に記憶されている複数の解析候補の中から、解析結果尤度の合計の高い順に一つの解析候補の組み合わせを選択して音韻処理部１５に送出する。
【００５４】
制御部１９は、上記指示入力部１７からの再発声指示を受けて、位置特定部１４に再発声開始位置の特定を指示する。位置特定部１４は、制御部１９からの指示を受けて、結果選択手段１８によって、解析記憶部１３に記憶されている複数の解析候補の中から、解析結果尤度の合計値が現在音声合成中の解析候補の組み合わせの次に大きな解析候補の組み合わせを選択し、この選択結果に基づいて再発声の開始位置を特定する。
【００５５】
図７は、上記構成を有する音声合成装置の各部によって実行される音声合成処理動作のフローチャートである。以下、図７に従って、音声合成装置の動作を説明する。
【００５６】
ステップＳ11およびステップＳ12で、上記第１実施の形態におけるステップＳ1およびステップＳ2と同様にして、テキスト解析およびテキスト解析結果の解析記憶部１３への記憶が行われる。但し、本実施の形態においては、図８に示すように、１つの単語に複数の解析候補が存在する場合には、各解析候補の尤度が大きい順に第１候補,第２候補,…の欄に格納するようにしている。
【００５７】
ステップＳ13で、上記制御部１９によって、現在選択されている解析候補の組み合わせに関する解析結果尤度の合計値の順位をカウントする変数ｉの値が「１」に初期化される。ステップＳ14で、指示入力部１７からの発声変更指示がない場合には、位置特定部１４の結果選択手段１８によって、解析結果尤度の合計値がｉ番目に大きい解析候補列が解析記憶部１３から選択される。そして、音韻処理部１５によって、上記選択された解析候補列に基づいて音韻処理が行われる。したがって、最初は、解析結果尤度が最も高い単語の並びに対して音韻処理が行われることになる。ステップＳ15で、音声合成部１６によって、上記ステップＳ14における音韻処理で得られた音声合成情報列に基づいて音声が合成されて、合成音がスピーカ等から出力される。ステップＳ16で、上記スピーカから出力された音声に対して、合成誤りがある旨の発声変更指示が指示入力部１７からあるか否かが判別される。その結果、あればステップＳ17に進み、無ければステップＳ22に進む。
【００５８】
ステップＳ17で、上記制御部１９及び音声合成部１６によって、その時点(ｊ_p：発声変更指示があった時点の文節番号)(１≦ｊ_p≦ｊ_n：ｊは文節番号,ｎは文節数)で合成音声の出力が中断される。ステップＳ18で、結果選択手段１８によって、解析記憶部１３に記憶されているｊ≦ｊ_pの全文節の結果候補を入れ換えながら、解析結果尤度Ｋの全文節に関する合計値ΣＫ(ｊ,ｐ(j))を「ｉ＋１」番目に大きくする(ｊ_a,ｐ(ｊ_a))(１≦ｊ_a≦ｊ_n)が求められる。但し、「ｊ_a」は、「ｉ＋１」番目に大きな解析結果尤度Ｋの合計値ΣＫが得られた際に交換された文節の番号であり、「ｐ(ｊ_a)」はその候補番号である。
【００５９】
ステップＳ19で、上記結果選択手段１８によって、上記ステップＳ18において得られた入れ換え解析候補の解析候補情報(ｊ_a,ｐ(ｊ_a))に基づいて、「ｊ_a」番目の文節における「ｐ(ｊ_a)」番目の解析候補(以下、解析候補(ｊ_a,ｐ(ｊ_a))と言う)と、「ｊ_a＋１」番目以降における直前の音声合成時に用いられた解析候補列「(ｊ_a＋１,ｐ(ｊ_a＋１))〜(ｊ_n,ｐ(ｊ_n))」とが選択されて、音韻処理部１５に送出される。ステップＳ20で、制御部１９によって、順位ｉの内容がインクリメントされる。ステップＳ21で、順位ｉが全解析候補の組み合わせ総数Ｎより大きいか否かが判別される。そして、「Ｎ」より大きければステップＳ22に進む。一方、「Ｎ」以下であれば上記ステップＳ14に戻る。以後、音韻処理部１５によって、直前の音声合成時に用いられた「ｊ_a」番目の文節以降の解析候補列中における「ｊ_a」番目の文節の解析候補が解析候補(ｊ_a,ｐ(ｊ_a))に置き換えらた解析候補列に基づいて合成音声が出力される。
【００６０】
こうして、上記ステップＳ16において発声変更指示がないと判別されるか、上記ステップＳ21において順位ｉが「Ｎ」より大きいと判別されるまで、上記ステップＳ17〜ステップＳ21,ステップＳ14〜ステップＳ16の処理が繰り返される。
【００６１】
ステップＳ22で、上記テキスト解析部１１に入力された次の文があるか否かが判別される。その結果、あれば上記ステップＳ11に戻って、次の文の処理に移行する。一方、無ければ音声合成処理動作を終了する。
【００６２】
以下、上述した音声合成処理動作における上記ステップＳ18以降の処理に付いて、入力文「市場開放によってもたらされる」に従って更に詳細に説明する。図８は、「市場開放によってもたらされる」と言うテキスト文に対する解析記憶部１３の格納状態を示す。単語「市場」に対して読みが「いちば」である解析候補と読みが「しじょー」である解析候補の２つの解析候補が存在する。また、文節連鎖「よってもたらされる」に対して、文節区切り位置の違いによって「よっても/たらされる」と「よって/もたらされる」との２つの解析候補列が存在する。
【００６３】
上記テキスト解析部１１によるテキスト解析の結果、第１文節の第１候補「いちば」の解析結果尤度Ｋ(１,１)と、第１文節の第２候補「しじょー」の解析結果尤度Ｋ(１,２)との間には、
Ｋ(１,１)＞Ｋ(１,２)
という関係が成立したものとする。これは、文字列「市場」は、「いちば」と読む可能性の方が高いと言うテキスト解析の誤解析の例である。
【００６４】
また、上記第３,第４文節の文節連鎖における第１候補「よっても/たらされる」の解析結果尤度Ｋ(３,１),Ｋ(４,１)と、第２候補「よって/もたらされる」の解析結果尤度Ｋ(３,２),Ｋ(４,２)との間には、
Ｋ(３,１)＋Ｋ(４,１)＞Ｋ(３,２)＋Ｋ(４,２)
という関係が成立したものとする。
【００６５】
今、図９に示すように、テキスト「市場開放によってもたらされる」が音声合成されて１回目の合成音声「いちばかいほうーによって」が出力された際に、「よって」という発声が行われている時点(ｊ_p＝３)で利用者が発声変更指示を入力した場合は、第１音節〜第３音節との解析結果候補が順次入れ換えられて解析結果尤度Ｋの合計値ΣＫ(ｊ,ｐ(j))が算出される。その場合、第１〜第３音節の解析結果尤度Ｋの合計値は、「Ｋ(１,１)＋Ｋ(２,１)＋Ｋ(３,１)」が１番大きく、次に「Ｋ(１,２)＋Ｋ(２,１)＋Ｋ(３,１)」が２番目に大きいので、「ｉ＋１」番目に大きな、すなわち２番目に大きな解析結果尤度Ｋの合計値ΣＫを呈する解析候補情報(１,２)が得られる。その結果、結果選択手段１８によって、第１文節における「２」番目の解析候補「市場(しじょー)」が選択され、直前の音声合成時に用いられた第２文節以降の解析候補列と共に音韻処理部１５に送出される。こうして、合成音声「しじょーかいほーによってもたらされる」が出力されるのである。
【００６６】
次に、２回目の音声出力時に「たらされる」という発声が行われている時点(ｊ_p＝４)で利用者が発声変更指示を入力した場合は、第１〜第４音節の解析結果尤度Ｋの合計値は、「Ｋ(１,２)＋Ｋ(２,１)＋Ｋ(３,１)＋Ｋ(４,１)」が２番目に大きく、「Ｋ(１,２)＋Ｋ(２,１)＋Ｋ(３,２)＋Ｋ(４,２)」が３番目に大きい(上述の「しじょー」の選択によって第１文節の解析候補はＫ(１,２)に固定される)。したがって、解析候補情報列(３,２),(４,２)が選択されて、合成音声「よって、もたらされる」が出力されるのである。
【００６７】
尚、上記発声変更指示を入力する時点が遅れた場合には、解析結果候補を入れ換える文節を誤る場合が生ずる。例えば、第１文節「いちば」の発声変更の指示を「たらされる」の発声が行われている時点で行ったために、第４文節「もたらされた」の発声変更が行われてしまうような場合である。このような場合でも、再度発声変更を指示することによって次候補の発声が行われるため、所望の合成音声が出力されるまで発声変更の指示を繰り返せばよい。
【００６８】
上述のように、本実施の形態においては、上記位置特定部１４に、解析記憶部１３に記憶されたテキスト解析結果を選択する結果選択手段１８を設けている。そして、上記テキスト解析結果に基づいて出力された合成音声が間違っている場合には、その時点で指示入力部１７から音声変更指示がなされる。そうすると、制御部１９の制御の下に、合成音声の出力が中断され、位置特定部１４の結果選択手段１８によって、発声変更指示があった位置の文節よりも前の全文節の結果候補を入れ換えながら、解析結果尤度Ｋの合計値ΣＫを次に大きくする入れ換え解析候補(ｊ_a,ｐ(ｊ_a))が求められ、解析候補列(ｊ_a,ｐ(ｊ_a))〜(ｊ_n,ｐ(ｊ_n))が選択される。そして、音韻処理部１５によって、解析候補列(ｊ_a,ｐ(ｊ_a))〜（ｊ_n,ｐ(ｊ_n))に基づいて再度合成音声が出力されるようにしている。
【００６９】
したがって、ユーザは、出力合成音声に間違いを見つけた場合には、その時点で指示入力部１７から発声変更指示を行うことによって、次候補の合成音声の出力を行うことができるのである。すなわち、テキスト解析部１１に入力された漢字仮名混じり文「市場開放によってもたらされる」に基づいて合成音声が出力された際に、ユーザが「よっても、たらされた」の発声を変更したい場合に発声変更の指示を行うと、結果選択手段１８によって第３,第４文節の第１候補「よっても」,「たらされた」が第２候補「よって」,「もたらされた」に入れ換えられる。そして、入れ換えが発声した文節以降の文節列「よって、もたらされる」の合成音声が再発声されるのである。
【００７０】
すなわち、本実施の形態によれば、音声出力文中の特定位置から発声を変えて再発声を行うことができ、長文の後半のみが間違っている場合に少ない時間で修正ができるのである。
【００７１】
尚、後述するように、音声認識部を併用することによって、指示入力部１７に「違う」と音声入力することによって次候補を発声させ、「もう一度」と音声入力することによって同じ発声を繰り返させることが可能となる。このように、利用者が指示を明示的に行うことによって、利用者の意図とは異なる合成音声が出力されないようにすることが可能になる。
【００７２】
＜第３実施の形態＞
上記第１実施の形態によれば、音声出力文中の特定位置から再発声を行うことができるのではあるが、再発声の際に最初と全く同じ合成音を出力するとやはり聞き取れない可能性がある。本実施の形態は、そのような場合に対処するものであって、再発声に際しては、最初聞き取り難かった部分の発声速度等の音声合成情報列を変えるものである。
【００７３】
図１０は、本実施の形態の音声合成装置び構成を示すブロック図である。テキスト解析部２１,解析辞書メモリ２２,解析記憶部２３,音韻処理部２４,音声合成部２５,指示入力部２６および位置特定部２７は、上記第１実施の形態におけるテキスト解析部１,解析辞書メモリ２,解析記憶部３,音韻処理部４,音声合成部５,指示入力部６および位置特定部７と同じ構成を有して、同様に動作する。
【００７４】
音声合成情報変換部２９は、上記音韻処理部２４で生成された音声合成情報列を制御部２８からの指示に従って他の音声合成情報列に変換する。こうして、利用者が聞き取り難かった個所の音声合成情報列を変えることによって、発声の性質が違う合成音を生成することができ、以前聞き取れなかった個所を聞き取り易くするのである。
【００７５】
ここで、聞き取り易い発声に変換する例としては、下記のような変換方法がある。
(１) 発声速度を少し遅めにする。
(２) 発声ピッチを少し高めにする。
(３) 発声パワーを少し大きめにする。
(４) 音韻長を少し長めにする。
【００７６】
例えば、パワーを変更する場合は次のようにする。すなわち、母音パワーに影響を与える要因としては、当該母音の種類,隣接音韻の種類,位置,ピッチ等があり、これらの値からパワーの値を推定できる。したがって、上記要因の値から推定されたパワーの値よりやや高めの数値を設定することによって、最初とは異なる合成音声を生成することができる。また、発声速度を変更する場合には、予め適切な速度の値が決められており、再発声時にはその適性値よりも所定値だけ遅くなるように設定することによって、最初とは異なる合成音声を生成するのである。
【００７７】
以上のように、本実施の形態によれば、上記音声合成情報における推定された値や予め設定されている値を少し変えることで、先の合成音声とは音声合成情報列が異なる合成音声を生成して出力することができる。したがって、再発声時における聞き取り易さの向上を図ることができるのである。
【００７８】
＜第４実施の形態＞
本実施の形態は、上記第３実施の形態の場合と同様に、再発声の際に最初と異なる合成音を出力するものであって、再発声に際しては、最初聞き取り難かった部分の音声を正確に伝達するものである。
【００７９】
図１１は、本実施の形態の音声合成装置び構成を示すブロック図である。テキスト解析部３１,解析辞書メモリ３２,解析記憶部３３,音韻処理部３４,音声合成部３５,指示入力部３６および位置特定部３７は、上記第１実施の形態におけるテキスト解析部１,解析辞書メモリ２,解析記憶部３,音韻処理部４,音声合成部５,指示入力部６および位置特定部７と同じ構成を有して、同様に動作する。
【００８０】
文字文記憶部３９は、１文字とそれに対応する文とを対にして記憶しているメモリであり、例えば和文通話に用いる文等を記憶する。尚、上記「和文通話」は、「あ」という文字に対して「朝日のあ」、「い」という文字に対して「いろはのい」のような文を対応させ、正確に音声を伝達する目的で使用されるものである。
【００８１】
文字文変換部４０は、制御部３８からの指示に従って、ユーザが聞き取り難かったと判断した個所(位置特定部３７で特定された再発声開始位置に基づいて制御部３８によって判断される)の解析結果における読み系列情報を、文字文記憶部３９に記憶されている対応する文に変換する。例えば、「すずき」という合成音声をユーザが聞き取り難かったと判断した場合は、「すずめのす、すずめのすに濁点、切手のき」という上記和文通話で用いる文に変換する。そして、この変換文のテキスト解析結果を生成して音韻処理部３４に送出するのである。
【００８２】
以後、上記音韻処理部３４によって、文字文変換部４０で生成されたテキスト解析結果に基づいて音声合成情報列を生成し、音声合成部３５によって、上記生成されたで音声合成情報列に基づいて合成音声が出力される。
【００８３】
以上のごとく、本実施の形態によれば、指示入力部３６からの指示に従って、例えば、合成音声「すずき」をユーザが聞き取り難かったと判断した場合は、制御部３８及び文字文変換部４０によって、読み系列情報「すずき」を上記和文通話で用いる文に変換し、この変換文のテキスト解析結果を生成するようにしている。したがって、読み系列情報「すずき」を再発声する際には、「すずめのす、すずめのすに濁点、切手のき」と再発声することができ、ユーザの聞き取り難い個所を更に明瞭にすることができるのである。
【００８４】
＜第５実施の形態＞
上記第１実施の形態においては、文節間結合度とモーラ数とに基づいて再発声の開始位置を特定している。また、第２実施の形態においては、解析結果尤度に基づいて再発声の開始位置を特定している。本実施の形態においては、ユーザの再発声要求意図を反映した位置を再発声開始位置として設定するものである。
【００８５】
図１２は、本実施の形態の音声合成装置び構成を示すブロック図である。テキスト解析部４１,解析辞書メモリ４２,解析記憶部４３,音韻処理部４５,音声合成部４６,指示入力部４７および音声合成情報変換部５１は、上記第３実施の形態におけるテキスト解析部２１,解析辞書メモリ２２,解析記憶部２３,音韻処理部２４,音声合成部２５,指示入力部２６および音声合成情報変換部２９と同じ構成を有して、同様に動作する。また、結果選択部４４は、上記第２実施の形態における結果選択手段１８と同じ構成を有して、同様に動作する。
【００８６】
位置特定部４８は、開始位置候補選出手段４９を有して、ユーザが再発声を要求した意図を反映させて再発声の開始位置を特定する。以下、位置特定部４８の機能について説明する。
【００８７】
上記テキスト解析部４１に対する入力テキストの何れの個所から再発声を行うかを決めるためには、(１)聞き直したい要因、(２)聞き直したい項目、(３)再発声する個所としての妥当性を考慮する必要がある。さらに、聞き直したい要因としては、(ａ)読み間違い、(ｂ)音韻が不明瞭、(ｃ)出現頻度が低い単語の存在等が上げられる。また、聞き直したい項目としては、５Ｗ１Ｈのような文の意味を決定付ける単語がある。また、再発声する個所としての妥当性の尺度としては、(Ａ)指示があった時点で音声出力されている文節と各文節の間のモーラ数、(Ｂ)文節間の結合度合の強さが上げられる。
【００８８】
上記位置特定部４８は、これらを総合的に判断して、再発声する文節の順序付けを行うのである。その場合の順序付けは、各文節毎に以下の関数値を求めることによって行う。
ｆ(ｐ1,ｐ2,ｐ3,ｐ4,ｐ5,ｐ6)
ここで、ｐ1：解析結果バラメータ
ｐ2：音韻明瞭性パラメータ
ｐ3：単語出現確率パラメータ
ｐ4：単語重要度パラメータ
ｐ5：文節間結合度パラメータ
ｐ6：モーラ数パラメータ
【００８９】
複数の解析結果が存在する場合、それらの解析結果の尤度が小さければ小さいほど解析を誤っている可能性が高いと言える。そこで、解析結果パラメータｐ1の一例として、上記解析結果尤度が第１位の候補と第２位の候補との解析結果尤度差の逆数が考えられる。
【００９０】
また、不明瞭な音韻ほど聞き取れない可能性が高い。そこで、関数ｆのパラメータとして音韻明瞭性パラメータｐ2を用いるのである。すなわち、各音韻毎に明瞭性の度合を予め数値化(明瞭度)しておくのである。その場合、明瞭性の低い音韻の明瞭度の値を大きくするようにしておく。したがって、無声摩擦音等は高い明瞭度が与えられることになる。
【００９１】
また、普段あまり使われない単語ほど聞き取れない傾向が強い。そこで、関数ｆのパラメータとして単語出現確率パラメータｐ3を用いるのである。また、聞き取れない場合に文の意味自体が全く分らなくなるような重要な単語ほど聞き直される可能性が高い。そこで、関数ｆのパラメータとして単語重要度パラメータｐ4を用いるのである。尚、後述する「誰」「どこ」「いつ」等の５Ｗ１Ｈに関連する単語は重要度が大きく設定される。
【００９２】
上記文節間結合度パラメータｐ5は、既に第１実施の形態において述べたように、言語的に適切な文節の区切れ目から再発声を開始させるためのものである。文節間結合度が大きいほど文節間の結合度が弱く、再発声させる箇所としての適切さが増す。モーラ数パラメータｐ6についても上記第１実施の形態において述べた通りである。
【００９３】
更に、上記指示入力部４７から再発声の指示が入力された際に音声出力されている文字位置は、発声開始時間と指示発声時間との差と、指示発声時間と発声終了予定時間との差の比率から、おおよそ推定できる。
【００９４】
上記関数ｆの決定方法は様々な方法が考えられるが、以下、観測データに基づく決定方法について述べる。尚、関数ｆの計算方法は，説明を簡単にするため、各パラメータ毎に重み係数を掛けてそれらの値を加算する方法とする。上記観測データは、図１３に示すような、解析記憶部４３に記憶されている情報と、指示入力部４７に再発声の指示が入力された際に合成音声が出力されていた文字位置と、利用者が期待した読み開始位置と読み方とを記したもので構成される。これらの観測データを大量に用意し、ユーザが期待した解の確率が最大になるように重み係数を推定するのである。尚、上記重み係数の推定は、重回帰分析等の多変量解析手法を用いるのが一般的である。
【００９５】
観測データの中に、解析結果が複数存在して各々の読み方が違うという指示が多ければ、解析結果パラメータｐ1の値が大きい場合には、読み誤りの可能性が高くなるように重みが学習されることになる。したがって、上記解析結果尤度が接近した次候補が存在する場合には、読み誤りと判断して、読みの次候補を再発声する確率が高くなるのである。
【００９６】
上記位置特定部４８の開始位置候補選出手段４９には、上述のようにして得られた重み係数を有して再発声文節の順序付けを行う関数ｆ(ｐ1,ｐ2,ｐ3,ｐ4,ｐ5,ｐ6)を搭載しておく。そして、指示入力部４７に再発声指示が入力された場合に、制御部５０からの指示に従って、解析記憶部４３に記憶されている各文節の解析結果尤度Ｋ,音韻明瞭度Ｃ,単語出現頻度Ｆ,単語重要度Ｓ,文節間結合度Ｂ及びモーラ数Ｍの値を代入して、各文節の関数ｆの値を算出する。こうして、再発声の開始位置候補が選出される。そして、位置特定部４８は、関数ｆの値が最大である文節を再発声開始位置であると特定して、特定位置を制御部５０に返すのである。
【００９７】
そうすると、上記制御部５０は、上記再発声開始位置から所定数の文節に関して、結果選択部４４に対して、上記第２実施の形態の場合と同様にして、解析記憶部４３に記憶されている複数の解析候補の中から、解析結果尤度の合計値が現在合成音声出力中の解析候補の組み合わせの次に大きな解析候補の組み合わせを選択することを指示する。さらに、音声合成情報変換部５１に対して、上記第３実施の形態の場合と同様にして、音声合成情報列の変換を指示するのである。その結果、上記関数ｆの値が最大である再発声開始文節以降の所定文節に関して、最も大きな要因となっているパラメータが解析結果パラメータｐ1である場合には、結果選択部４４によって読み誤りが修正されることなる。さらに、音韻明瞭性パラメータｐ2あるいは単語出現確率パラメータｐ3である場合には、音声合成情報変換部５１によって、発声パワーを少し大きくしたり、発声速度を少し遅めにしたりして再発声されることになる。
【００９８】
このように、本実施の形態においては、解析記憶部４３には、各文節毎に解析結果尤度Ｋ,音韻明瞭度Ｃ,単語出現頻度Ｆ,単語重要度Ｓ,文節間結合度Ｂおよびモーラ数Ｍを格納するようにしている。また、位置特定部４８には、再発声文節の順序付けを行う関数ｆ(ｐ1,ｐ2,ｐ3,ｐ4,ｐ5,ｐ6)を搭載した開始位置候補選出手段４９を設けて、再発声指示があると、解析記憶部４３に記憶されている各文節の解析結果尤度Ｋ,音韻明瞭度Ｃ,単語出現頻度Ｆ,単語重要度Ｓ,文節間結合度Ｂおよびモーラ数Ｍの値を用いて各文節の関数ｆの値を演算し、演算結果に基づいて再発声開始位置を特定するようにしている。そして、制御部５０は、上記特定された再発声開始位置に関して、結果選択部４４に対する次解析候補の組み合わせの選択、および、音韻変換部５１に対する音声合成情報列の変換を指示するようにしている。
【００９９】
したがって、本実施の形態によれば、「読み方が違う」,「区切り方が違う」,「聞きとり難い」等のユーザの再発声要求意図を反映して自動的に再発声位置を特定することができるのである。
【０１００】
＜第６実施の形態＞
上記第５実施の形態によれば、「読み方が違う」,「区切り方が違う」,「聞きとり難い」等をユーザーが明示的に指示しなくても、その何れであるかを音声合成装置が自動的に判断して再発声開始位置を設定することができる。しかしながら、これらの情報を明示的に指示することで、より確実に再発声位置と再発声方法とを判断することができる。本実施の形態は、ユーザの再発声要求意図を明示的に指示することに関するものである。
【０１０１】
図１４は、本実施の形態の音声合成装置における概略構成を示すブロック図である。テキスト解析部６１,解析記憶部６３,音韻処理部６４および音声合成部６５は、上記第１実施の形態におけるテキスト解析部１,解析記憶部３,音韻処理部４および音声合成部５と同じ構成を有して、同様に動作する。また、位置特定部６８は、上記第５実施の形態における位置特定部４８と同じ構成を有して、同様に動作する。
【０１０２】
音声認識部６６は、マイクを備えて、入力された音声を、音声認識辞書メモリ６７に格納された音声認識辞書を用いて認識する。音声認識辞書メモリ６７は、音声認識する語彙を音声認識辞書として記憶している。ここで、音声認識辞書メモリ６７には、「えっ」,「何」,「もう一度」等の再発声を促す語彙を登録することができる。また、「誰」,「いつ」,「どこで」等のユーザが知りたい情報が何かを問いかける語彙も登録することができる。
【０１０３】
また、解析辞書メモリ６２には、各単語毎に、この単語が音声認識辞書メモリ６７の登録語彙に関連するか否かの情報が付与されている。例えば、単語「鈴木」は、人名であり、登録語彙「誰」と関連する。あるいは、単語「６時」は、時間であり、登録語彙「いつ」と関連する等である。したがって、テキスト解析部６１は上述のような関連単語情報をも解析結果として得ることができ、解析記憶部６３に記憶されることになる。
【０１０４】
位置特定部６８は、再発声文節の順序付けを行う関数ｆ(ｐ1,ｐ2,ｐ3,ｐ4,ｐ5,ｐ6)を搭載した開始位置候補選出手段６９を有している。そして、制御部７０から再発声位置の特定が指示された場合には、解析記憶部６３に格納された各文節の解析結果尤度Ｋ,音韻明瞭度Ｃ,単語出現頻度Ｆ,単語重要度Ｓ,文節間結合度Ｂおよびモーラ数Ｍの値を用いて、関数ｆの値を求める。そして、最も大きな関数ｆの値を呈する文節を再発声位置と特定するのである。
【０１０５】
その場合、上記音声認識部６６によって、音声認識辞書メモリ６７の登録語彙が認識された場合には、解析記憶部６３に記憶されている解析結果の一つである上記関連単語情報を参照して上記認識語彙に関連する関連単語を探し出す。そして、上記関連単語に関する文節の関数ｆ(ｐ1,ｐ2,ｐ3,ｐ4,ｐ5,ｐ6)を算出する際に、単語重要度パラメータｐ4の値を通常よりも大きくする(例えば、単語重要度パラメータｐ4の重み係数や関連単語の単語重要度Ｓの値を整数倍する)のである。こうすることによって、上記認識語彙に関連する単語が入力テキスト文中にあれば、その関連単語を含む文節の関数ｆの値が大きくなり、当該文節から再発声が開始されることになるのである。
【０１０６】
また、その場合、上記解析記憶部６３には、入力されたテキスト文の全文の解析結果を残しておくようにすることによって、現在合成音声出力中のテキスト文に限定されることなく、以前に出力されたテキスト文に遡って再発声することが可能になる。さらに、第１実施の形態における位置特定部７による再発声の開始位置を特定する場合と同様の方法で、文節間結合度Ｂとモーラ数Ｍとの値を用いて終了位置を特定し、開始位置の特定と併用することによって、上記マイクからの「誰」という問いかけに対して、「鈴木さんから」と言うように、人名に関する文節だけを再発声することが実現可能になる。
【０１０７】
同様に、上記音声認識部６６によって、「違う」等の読み変えを促す語彙が認識された場合には、関数ｆ(ｐ1,ｐ2,ｐ3,ｐ4,ｐ5,ｐ6)を算出する際に、解析結果パラメータｐ1の値を通常よりも大きくするのである。こうすることによって、上記第５実施の形態における結果選択部４４と同様の結果選択部を併用して、解析結果が複数ある文節において別の読み方を有する候補を選択して再発声するようにできる。
【０１０８】
上述のように、本実施の形態においては、上記位置特定部６８に、上記第５実施の形態における位置特定部４８と同様の機能を持たせている。そして、解析記憶部６３に格納された各文節の解析結果尤度Ｋ,音韻明瞭度Ｃ,単語出現頻度Ｆ,単語重要度Ｓ,文節間結合度Ｂおよびモーラ数Ｍの値を用いて、関数ｆ(ｐ1,ｐ2,ｐ3,ｐ4,ｐ5,ｐ6)の値を求め、最も大きな関数ｆの値を呈する文節を再発声位置と特定する。
【０１０９】
その際に、上記音声認識部６６によって音声認識辞書メモリ６７の登録語彙が認識された場合には、上記認識語彙「いつ」に関連する関連単語「６時」を含む文節の関数ｆを算出する際に、単語重要度パラメータｐ4の値を通常よりも大きくするようにしている。したがって、ユーザは、「いつ」と音声指示するだけで、所望する文節「６時…」から自動的に再発声を開始できるのである。
【０１１０】
また、上記音声認識部６６によって、読み変えを促す語彙「違う」が認識された場合には、関数ｆを算出する際に、解析結果パラメータｐ1の値を通常よりも大きくするようにしている。したがって、ユーザは、「違う」と音声指示するだけで自動的に別の読み方の候補を選択して再発声できるのである。
【０１１１】
上記各実施の形態における音声合成装置は、携帯電話や電子手帳等の比較的文字情報の少ない携帯端末器に搭載することが有効である。すなわち、このような文字情報の少ない携帯端末器によって、比較的長い文面の電子メールの内容を知る場合には合成音声によって聞き取ることになる。ところが、テキスト音声合成の正解率や明瞭度を完全なものにするのは困難であり、誤解析や不明瞭な場合のリカバリー手段が必要なのである。
【０１１２】
上記各実施の形態における音声合成装置によれば、再発声開始位置や次候補の選択を自動的にできるため、非常に簡単な操作で再発声や発声変更を行うことができ、携帯端末器用の音声合成装置として非常に有効なのである。
【０１１３】
ところで、上記各実施の形態におけるテキスト解析部,音声合成部,指示入力部,位置特定部および制御部による上記解析手段,音声合成手段,指示入力手段,位置特定手段および制御手段としての機能は、プログラム記録媒体に記録された音声合成処理プログラムによって実現される。上記各実施の形態における上記プログラム記録媒体は、ＲＯＭ(リード・オンリ・メモリ)でなるプログラムメディアである。あるいは、外部補助記憶装置に装着されて読み出されるプログラムメディアであってもよい。尚、何れの場合においても、上記プログラムメディアから音声合成処理プログラムを読み出すプログラム読み出し手段は、上記プログラムメディアに直接アクセスして読み出す構成を有していてもよいし、ＲＡＭ(ランダム・アクセス・メモリ)に設けられたプログラム記憶エリア(図示せず)にダウンロードし、上記プログラム記憶エリアに対してアクセスして読み出す構成を有していてもよい。尚、上記プログラムメディアからＲＡＭの上記プログラム記憶エリアにダウンロードするためのダウンロードプログラムは、予め本体装置に格納されているものとする。
【０１１４】
ここで、上記プログラムメディアとは、本体側と分離可能に構成され、磁気テープやカセットテープ等のテープ系、フロッピーディスク,ハードディスク等の磁気ディスクやＣＤ(コンパクトディスク)−ＲＯＭ,ＭＯ(光磁気)ディスク,ＭＤ(ミニディスク),ＤＶＤ(ディジタルビデオディスク)等の光ディスクのディスク系、ＩＣ(集積回路)カードや光カード等のカード系、マスクＲＯＭ,ＥＰＲＯＭ（紫外線消去型ＲＯＭ),ＥＥＰＲＯＭ(電気的消去型ＲＯＭ),フラッシュＲＯＭ等の半導体メモリ系を含めた、固定的にプログラムを坦持する媒体である。
【０１１５】
また、上記各実施の形態における音声合成装置は、インターネットを含む通信ネットワークと接続可能な構成を有して、上記プログラムメディアは、通信ネットワークからのダウンロード等によって流動的にプログラムを坦持する媒体であっても差し支えない。尚、その場合における上記通信ネットワークからダウンロードするためのダウンロードプログラムは、予め本体装置に格納されているものとする。あるいは、別の記録媒体からインストールされるものとする。
【０１１６】
尚、上記記録媒体に記録されるものはプログラムのみに限定されるものではなく、データも記録することが可能である。
【０１１７】
【発明の効果】
以上より明らかなように、第１の発明の音声合成装置は、指示入力手段から入力された再発声を促すための指示を受けて、位置特定手段によって、解析手段による解析結果に基づいて文字列情報中における再発声の開始位置を特定し、制御手段によって、音声合成手段に対して上記特定された開始位置からの音声合成を指示するので、ユーザは、出力合成音声が聞き取れない場合に、その時点で指示入力手段から指示を行うだけで、出力合成音声文中の特定位置から再発声を聞くことができ、長文の後半のみが聞き取れなかった場合でも少ない時間で聞き直しができる。
【０１１８】
さらに、上記解析結果を文節間結合度が含まれるように成し、上記位置特定手段を、上記文字列情報中における上記指示入力手段から指示を受けた時点に対応する位置より前の文節から、モーラ数と上記文節間結合度とに基づいて上記開始位置となる文節を特定するようにしたので、指示を受けた時点に近過ぎず且つ遠過ぎない文節であって、言語的に結合度合が弱く区切れ易い適切な文節を上記開始位置として選ぶことができる。
【０１１９】
また、第２の発明の音声合成装置は、上記解析結果を尤度が含まれるように成し、上記位置特定手段を、上記文字列情報中における上記指示入力手段から指示を受けた時点に対応する位置以前の文節に関して、各解析結果の候補を入れ換えながら尤度の合計値を算出し、得られた尤度の合計値に基づいて上記再発声の開始位置を特定するようにすれば、入力された文字列情報「市場開放によってもたらされる」に基づいて合成音声「しじょーかいほうによっても、たらされる」が発声された場合に、上記尤度の合計値を次に高くする第２候補「よって」を選択して、再発声の開始位置であると特定することができる。したがって、その場合には、文節「よって」の位置から合成音声「よって、もたらされる」を再発声することができる。
【０１２０】
また、上記第１の発明の音声合成装置は、上記解析結果を尤度が含まれるように成すと共に、結果選択手段によって、上記文字列情報中における上記指示入力手段から指示を受けた時点に対応する位置以前の文節に関して、各解析結果の候補を入れ換えながら尤度の合計値を算出し、得られた尤度の合計値に基づいて再発声を行う際の解析結果を選択すれば、入力された文字列情報「市場開放によってもたらされる」に基づいて合成音声「いちば」を発声した場合に、発声の異なる第２候補「しじょー」を選択して、「しじょーかいほうによってもたらされる」の合成音声を再発声することができる。
【０１２１】
また、上記第２の発明の音声合成装置は、上記解析結果を上記尤度に加えて音韻明瞭度,単語出現確率及び単語重要度の少なくとも一つが含まれるように成し、上記位置特定手段を、上記解析結果の尤度に加えて上記音韻明瞭度,単語出現確率および単語重要度の少なくとも一つを用いて選出した上記再発声の開始位置候補に基づいて上記再発声の開始位置を特定するように成せば、解析結果尤度に基づいて上記利用者による再発声指示の意図「読み誤り」が反映された位置を、再発声の開始位置候補として選出することができる。さらに、音韻明瞭度または単語出現確率に基づいて上記意図「聞き取れない」が反映された位置を、または、単語重要度に基づいて上記意図「重要個所の確認」が反映された位置を、再発声の開始位置候補として選出することができる。
【０１２２】
また、この発明の音声合成装置は、上記音声合成手段の音声合成情報生成手段によって、上記解析手段による解析結果に基づいて音声合成情報列を生成し、音声合成情報変換手段によって、上記音声合成情報生成手段で生成された音声合成情報列のうち上記再発声の開始位置以降における所定の音声合成情報列を他の音声合成情報列に変換すれば、聞き取れなかった個所の発声を、「発声速度を少し遅め」に、または「発声ピッチを少し高め」に、または「発声パワーを少し大きめ」に、または「音韻長を少し長め」に変更して再発声を行うことができる。
【０１２３】
また、この発明の音声合成装置は、上記解析結果を読み系列情報が含まれるように成すと共に、文字文変換手段によって、上記再発声の開始位置以降における所定の解析結果の読み系列を文字文記憶手段に記憶された対応文の解析結果列に変換すれば、例えば、上記文字文記憶手段に文字「す」,「ず」及び「き」と対応文「すずめのす」,「すずめのすに濁点」および「切手のき」とを記憶しておけば、上記再発声の開始位置以降における所定の文「鈴木」の読み系列「すずき」を、対応文の列「すずめのす、すずめのすに濁点、切手のき」に変換して再発声できる。
【０１２４】
また、この発明の音声合成装置は、上記指示入力手段を、音声入力された指示を認識する音声認識手段で構成すれば、ユーザの再発声要求意図をユーザの音声によって明示できる。
【０１２５】
また、この発明の音声合成装置は、上記音声認識手段によって音声認識辞書を用いて再発声を促す語彙を認識し、上記解析手段によって解析辞書を用いて上記再発声を促す語彙との関連性の有無を表わす関連情報を含む上記解析結果を生成し、上記位置特定手段によって上記解析手段からの関連情報に基づいて上記再発声の開始位置を特定するように成せば、上記音声認識辞書に「誰」,「いつ」,「どこで」等の再発声を促す語彙を登録しておくことによって、ユーザが知りたい情報が何かを問いかけるこれらの語彙が音声認識された場合には、これらの語彙との関連性が高い「鈴木さん」,「昨日」,「会社で」等の人名や時や場所を表す単語から再発声を行うことができる。
【０１２６】
また、第３の発明の音声合成方法は、再発声を促すための指示を行うステップと、入力された文字列情報の解析結果に基づいて上記文字列情報中における再発声の開始位置を特定するステップと、上記特定された開始位置から音声合成を行うステップを備えて、上記解析結果は文節間結合度を含み、上記開始位置を特定するステップでは、上記文字列情報中における上記指示入力手段から指示を受けた時点に対応する位置より前の文節から、モーラ数と上記文節間結合度とに基づいて上記開始位置となる文節を特定するようにしたので、ユーザは、出力合成音声が聞き取れない場合には、その時点で指示を行うだけで、出力合成音声文中の特定位置から再発声を聞くことができる。したがって、長文の後半のみが聞き取れなかった場合でも少ない時間で聞き直しを行うことができる。さらに、指示を受けた時点に近過ぎず且つ遠過ぎない文節であって、言語的に結合度合が弱く区切れ易い適切な文節を上記開始位置として選ぶことができる。
【０１２７】
また、第４の発明の携帯端末器は、この発明の音声合成装置を備えたので、文字情報の少ない携帯端末器によって、比較的長い文面の電子メールの内容を音声合成出力によって知る場合に、再発声個所や次候補の選択を自動的に行うことができる。したがって、非常に簡単な操作で的確な再発声や発声変更を行うことができる。
【０１２８】
また、第５の発明のプログラム記録媒体は、コンピュータを、上記第１の発明における解析手段,音声合成手段,指示入力手段,位置特定手段および制御手段として機能させる音声合成処理プログラムを記録しているので、上記第１の発明の場合と同様に、ユーザは、出力合成音声が聞き取れない場合にはその時点で指示を行うだけで、出力合成音声文中の特定位置から再発声を聞くことができる。したがって、長文の後半のみが聞き取れなかった場合でも少ない時間で聞き直しを行うことができる。さらに、指示を受けた時点に近過ぎず且つ遠過ぎない文節であって、言語的に結合度合が弱く区切れ易い適切な文節を上記開始位置として選ぶことができる。
【図面の簡単な説明】
【図１】この発明の音声合成装置におけるブロック図である。
【図２】図１に示す音声合成装置によって実行される音声合成処理動作のフローチャートである。
【図３】図１における解析記憶部の格納状態を示す概念図である。
【図４】「再発声」の指示を行った場合における合成音声出力例を示す図である。
【図５】モーラ数と評価関数との関係を示す図である。
【図６】図１とは異なる音声合成装置のブロック図である。
【図７】図６に示す音声合成装置によって実行される音声合成処理動作のフローチャートである。
【図８】図６における解析記憶部の格納状態を示す概念図である。
【図９】「発声変更」の指示を行った場合における合成音声出力例を示す図である。
【図１０】図１および図６とは異なる音声合成装置のブロック図である。
【図１１】図１,図６および図１０とは異なる音声合成装置のブロック図である。
【図１２】図１,図６,図１０および図１１とは異なる音声合成装置のブロック図である。
【図１３】図１２における解析記憶部の格納状態を示す概念図である。
【図１４】図１,図６,図１０〜図１２とは異なる音声合成装置のブロック図である。
【符号の説明】
１,１１,２１,３１,４１,６１…テキスト解析部、
２,１２,２２,３２,４２,６２…解析辞書メモリ、
３,１３,２３,３３,４３,６３…解析記憶部、
４,１５,２４,３４,４５,６４…音韻処理部、
５,１６,２５,３５,４６,６５…音声合成部、
６,１７,２６,３６,４７…指示入力部、
７,１４,２７,３７,４８,６８…位置特定部、
８,１９,２８,３８,５０,７０…制御部、
１８…結果選択手段、
２９,５１…音声合成情報変換部、
３９…文字文記憶部、
４０…文字文変換部、
４４…結果選択部、
４９,６９…開始位置候補選出手段、
６６…音声認識部、
６７…音声認識辞書メモリ。[0001]
BACKGROUND OF THE INVENTION
  The present invention relates to a speech synthesizer and a speech synthesis method for synthesizing speech from character information, a portable terminal device, and a program recording medium.
[0002]
[Prior art]
  Conventionally, there is a regular speech synthesizer as disclosed in Japanese Patent Laid-Open No. 4-160630 as a device for dealing with a Japanese sentence recognition error caused by an analysis error at the time of Japanese analysis. In this regular speech synthesizer, a plurality of Japanese analysis results relating to an input character string are held in a reading sentence buffer together with a fixed order, and the plurality of pieces of Japanese analysis result information are used. According to this rule speech synthesizer, there is a recognition error (reading error in the input character string) in the synthesized speech based on the Japanese analysis result of a certain order of decision. A synthesized sound can be generated based on a Japanese analysis result that is one lower in rank, and a misread portion of the input character string can be synthesized by another reading method. By doing so, when there is an error in reading, the correction work can be performed interactively without requiring complicated labor and work.
[0003]
[Problems to be solved by the invention]
  However, the conventional regular speech synthesizer has the following problems. That is, reading errors are corrected in units of Japanese analysis results held in the reading sentence buffer. Therefore, it is not possible to specify a reading error part of the input sentence (reading sentence) composed of the input character string, read it again from that part, and repeat the voice from the middle of the reading sentence. For this reason, even when a recurrence voice is requested at the last part of a long sentence, the voice is re-uttered from the beginning of the sentence, and there is a problem that it takes time until the desired synthesized voice is obtained.
[0004]
  In addition, although the above-described regular speech synthesizer can cope with an error in reading an input character string (a reading sentence), it cannot cope with a re-listening when the user cannot hear it. In particular, when the user cannot hear only the latter half of the sentence, the user does not want to hear again from the beginning of the sentence, but wants to hear only the latter half of the sentence. In that case, there is a problem that it is impossible to make a voice replay only after a specific position in the read-out sentence.
[0005]
  SUMMARY OF THE INVENTION An object of the present invention is to provide a speech synthesizer capable of performing recurrent voice from a specific position in a speech output sentence.
[0006]
[Means for Solving the Problems]
  In order to achieve the above object, the first invention is a speech synthesizer that analyzes input character string information by an analysis unit, and synthesizes and outputs a speech by a speech synthesis unit based on the analysis result. An instruction input means for inputting an instruction for prompting, and a position specifying means for specifying a start position of a recurrent voice in the character string information based on an analysis result by the analysis means in response to an instruction from the instruction input means And control means for instructing the speech synthesis means to synthesize speech from the specified start position, and the analysis result includes a degree of inter-phrase coupling, and the position specification means is included in the character string information. The phrase that is the starting position is identified based on the number of mora and the degree of inter-phrase coupling from the phrase before the position corresponding to the time point when the instruction is received from the instruction input means. It is characterized in that.
[0007]
  According to the above configuration, when an instruction for prompting the recurrent voice in the synthesized voice is input from the instruction input unit, the position specifying unit specifies the start position of the recurrent voice based on the analysis result by the analysis unit. Therefore, when the user cannot hear the output synthesized speech, the user can hear the recurrent voice from a specific position in the output synthesized speech sentence only by giving an instruction from the instruction input means at that time, and can hear only the latter half of the long sentence. You can listen again in less time even if you did not.
[0008]
  Further, the number of mora is used as an index for specifying the start position of the recurrent voice, and a phrase that is not too close and not too far from the time when the instruction is received is selected as the start position of the recurrent voice. In addition, the inter-phrase coupling degree is used as an index for specifying the starting position of the recurrent voice, and an appropriate phrase that is linguistically weak and easily separated is selected as the starting position of the recurrent voice.
[0009]
  The second invention is an instruction for prompting a recurrent voice in a speech synthesizer that analyzes input character string information by an analysis unit and synthesizes and outputs speech by a speech synthesis unit based on the analysis result. An instruction input means for inputting the position, a position specifying means for specifying a start position of a recurrent voice in the character string information based on an analysis result by the analysis means in response to an instruction from the instruction input means, and the speech synthesis Control means for instructing the means to synthesize speech from the specified starting position, the analysis result includes likelihood, and the position specifying means indicates from the instruction input means in the character string information For the clauses before the position corresponding to the time point received, the total likelihood value is calculated while replacing each analysis result candidate, and the starting position of the recurrent voice is specified based on the obtained total likelihood value. It is characterized that it is way.
[0010]
  According to the above configuration, the start position of the recurrent voice is specified based on the total likelihood value calculated while replacing each analysis result candidate. Therefore, when the synthesized speech “It is also acted by Shijo Kaiho” is uttered based on the input character string information “Bringed by market opening”, the total likelihood value is The second candidate “hence” to be raised is selected and identified as the start position of the recurrent voice. As a result, the synthesized voice of “Thus brought” is replayed.
[0011]
  In the speech synthesizer of the first invention, the analysis result includes the likelihood, and the phrase before the position corresponding to the time point when the instruction is received from the instruction input means in the character string information. It is desirable to provide a result selection means for calculating a total likelihood value while exchanging each analysis result candidate and selecting an analysis result when performing recurrent voice based on the obtained total likelihood value.
[0012]
  According to the above configuration, an analysis result when performing recurrent voice is selected based on a total likelihood value calculated while replacing each analysis result candidate. Therefore, when the synthesized speech “Ichiba” is uttered based on the input character string information “provided by market opening”, the second candidate “Shijo” with a different utterance is selected, and “ The synthesized voice of `` Jokaibo '' is reoccurringThe
[0013]
  MaThe aboveSecondThe speech synthesizer of the invention ofthe aboveLikelihoodIn addition to ,It includes at least one of phonetic intelligibility, word appearance probability, and word importance.LocateMeans,the aboveAnalysis resultofLikelihoodIn addition to the aboveUse at least one of phonetic intelligibility, word appearance probability, and word importanceBased on the starting position candidate for the recurrent voice selected above , Identify the starting position of the recurrent voiceIt is desirable to do so.
[0014]
  According to the above configuration, based on the analysis result likelihood, the position reflecting the intention “reading error” of the recurrent voice instruction by the user is selected as a recurrent voice start position candidate.furtherBased on the phoneme intelligibility or the word appearance probability, a position reflecting the intention “cannot be heard” is selected as a start position candidate of the recurrent voice. Alternatively, a position where the intention “confirmation of important part” is reflected based on the word importance is selected as a start position candidate of the recurrent voice.
[0015]
  The speech synthesizer according to the present invention further includes a speech synthesis information generation unit configured to generate a speech synthesis information sequence based on an analysis result of the analysis unit, and based on the speech synthesis information sequence. In addition to synthesizing speech, a predetermined speech synthesis information sequence after the start position of the recurrent voice in the speech synthesis information sequence generated by the speech synthesis information generation unit according to an instruction from the control unit It is desirable to include speech synthesis information conversion means for converting into a synthesis information string.
[0016]
  According to the above configuration, the predetermined speech synthesis information sequence after the recurrent voice start position is converted into another speech synthesis information sequence by the speech synthesis information conversion means. Therefore, the utterance of the part that could not be heard is changed to “Slightly slowing the utterance speed”, “Slightly increasing the utterance pitch”, “Slightly increasing the utterance power”, or “Longer phoneme length”. It becomes possible to make a recurrent voice.
[0017]
  Further, the speech synthesizer of the present invention is characterized in that the analysis result includes reading sequence information, character sentence storage means for storing characters and corresponding sentences corresponding to the characters, and the analysis means in accordance with instructions from the control means. It is desirable to provide character sentence conversion means for converting a reading sequence of a predetermined analysis result after the start position of the recurrent voice into the analysis result string of the corresponding sentence stored in the character sentence storage means.
[0018]
  According to the above configuration, the text “s”, “z” and “ki” and the corresponding sentences “suzume no su”, “suzume no ni ku” and “stamp” ”Is stored by the text conversion means, the reading sequence“ Suzuki ”of the predetermined sentence“ Suzuki ”after the start position of the recurrent voice is converted into the corresponding sentence string“ Suzuminosu, It is converted to "Dakuten, stamp" and re-voiced.
[0019]
  In the speech synthesizer according to the present invention, it is preferable that the instruction input means is constituted by a speech recognition means for recognizing an instruction inputted by speech.
[0020]
  According to the above configuration, the user's intention to request re-utterance is clearly indicated by the user's voice.
[0021]
  The speech synthesizer according to the present invention further includes a speech recognition dictionary storing a vocabulary that promotes recurrence and an analysis dictionary storing presence / absence of relevance between a word and the vocabulary that prompts recurrence. Recognizes the vocabulary that encourages recurrent voice using the speech recognition dictionary, and the analysis means uses the analysis dictionary to obtain the analysis result including related information indicating the presence or absence of relevance to the vocabulary that promotes recurrent voice Preferably, the position specifying means specifies the start position of the recurrent voice based on the related information from the analyzing means.
[0022]
  According to the above configuration, by registering vocabularies that prompt recurrence such as “who”, “when”, and “where” in the speech recognition dictionary, these vocabularies that ask the user what information the user wants to know are asked. Is recognized, the recurrent voice is made from words representing names and times or places such as “Mr. Suzuki”, “Yesterday”, and “At work” which are highly related to these vocabularies.
[0023]
  Also,ThirdIn the speech synthesis method of analyzing the input character string information and synthesizing and outputting the speech based on the analysis result, the step of inputting an instruction for prompting a recurrent voice, and the input instruction In response, the step of identifying the start position of the recurrent voice in the character string information based on the analysis result of the input character string information, and the step of performing speech synthesis from the identified start position, The analysis result includes inter-phrase connectivity, and in the step of specifying the start position, the number of mora and the above-mentioned from the clause before the position corresponding to the time point when the instruction is received from the instruction input means in the character string information. It is characterized in that the phrase that becomes the start position is specified based on the degree of inter-joint coupling.
[0024]
  According to the above configuration, when an instruction for prompting recurrent voice by synthetic speech is input, the start position of the recurrent voice by synthetic speech is specified based on the analysis result of the input character string information. Therefore, if the user cannot hear the output synthesized speech, the user can hear the recurrent voice from a specific position in the output synthesized speech sentence only by giving an instruction at that time, and even if only the latter half of the long sentence cannot be heard You can listen again in less time.
[0025]
  Further, the number of mora is used as an index for specifying the start position of the recurrent voice, and a phrase that is not too close and not too far from the time when the instruction is received is selected as the start position of the recurrent voice. In addition, the inter-phrase coupling degree is used as an index for specifying the starting position of the recurrent voice, and an appropriate phrase that is linguistically weak and easily separated is selected as the starting position of the recurrent voice.
[0026]
  Also,4thThe portable terminal of the invention is characterized by comprising the speech synthesizer of the invention.
[0027]
  According to the above configuration, when the content of an e-mail with a relatively long text is known by synthetic voice output with a portable terminal with little character information, the recurrent voice location and the next candidate are automatically selected. Accurate recurrence and utterance changes are performed with simple operation.
[0028]
  Also,5thAccording to another aspect of the invention, there is provided a program recording medium comprising: a computer; an analyzing unit that analyzes input character string information; a voice synthesizing unit that synthesizes speech based on an analysis result including a degree of inter-phrase coupling by the analyzing unit; An instruction input means for inputting an instruction for prompting the user, and at the time when the instruction is received from the instruction input means in the character string information based on an analysis result by the analysis means in response to an instruction from the instruction input means. A position specifying means for specifying a phrase that is a starting position of a recurrent voice in the character string information based on the number of mora and the degree of inter-joint connection from the clause before the corresponding position, and for the speech synthesis means A voice synthesis processing program that functions as control means for instructing voice synthesis from the specified start position is recorded.
[0029]
  According to the above configuration, as in the case of the first invention, when the user cannot hear the output synthesized speech, the user simply gives an instruction at that time and makes a recurrent voice from a specific position in the output synthesized speech sentence. You can listen to it, and if you can't hear only the second half of the long sentence, you can listen again in less time.
[0030]
  Further, the number of mora is used as an index for specifying the start position of the recurrent voice, and a phrase that is not too close and not too far from the time when the instruction is received is selected as the start position of the recurrent voice. In addition, the inter-phrase coupling degree is used as an index for specifying the starting position of the recurrent voice, and an appropriate phrase that is linguistically weak and easily separated is selected as the starting position of the recurrent voice.
[0031]
DETAILED DESCRIPTION OF THE INVENTION
  Hereinafter, the present invention will be described in detail with reference to the illustrated embodiments.
<First embodiment>
  FIG. 1 is a block diagram showing the configuration of the speech synthesizer according to the present embodiment. The text analysis unit 1 analyzes the language of the input text, and temporarily stores the obtained text analysis result in the analysis storage unit 3. The analysis dictionary memory 2 stores language data including an analysis dictionary necessary when the text analysis unit 1 performs text analysis. Based on the text analysis result stored in the analysis storage unit 3, the phoneme processing unit 4 serving as the speech synthesis information generating means generates speech synthesis information such as reading sequence information (phoneme symbol string), accent information, and phrase break information. Is generated. Then, the speech synthesis unit 5 synthesizes speech based on the speech synthesis information sequence generated by the phoneme processing unit 4 and outputs it from a speaker (not shown).
[0032]
  The instruction input unit 6 inputs an instruction for the user to urge the user to replay the synthesized voice output from the voice synthesizer 5. The instruction input unit 6 includes buttons or a keyboard. Based on the text analysis result stored in the analysis storage unit 3, the position specifying unit 7 specifies the position where the recurrent voice is started in the text. Upon receiving an instruction from the instruction input unit 6, the control unit 8 causes the position specifying unit 7 to specify the recurrent voice start position, and reads out the subsequent text analysis result from the specified position in the text to the phonological processing unit 4. Instruct and start recurrent voice.
[0033]
  FIG. 2 is a flowchart of the speech synthesis processing operation executed by each unit of the speech synthesizer having the above configuration. In the following, the operation of the speech synthesizer will be described with reference to the flowchart of FIG. 2, taking as an example the case where a kana-kana mixed sentence “Mr. explain.
[0034]
  Here, the text input to the text analysis unit 1 may be input from a sentence input means such as a keyboard, a pen, or a voice recognition device, or may be input via a network such as mail or WWW. Thus, when a text is input to the text analysis unit 1, the speech synthesis processing operation starts.
[0035]
  In step S1, the text analysis unit 1 performs text analysis. That is, the input sentence is divided into morphemes with reference to the analysis dictionary stored in the analysis dictionary memory 2, and information such as readings and parts of speech is given. In the case of the above input sentence, "Suzuki (suzuki: proper noun)", "san (san: suffix)", "kara (kara: particle)", "business talk (shodan: noun)", "no ( ": Particle", "case (ken: noun)", "de (de: particle)", "mail (me: noun)", "ga (ga: particle)", "coming (ki: verb)" , “Te (Te: Verb)”, “I ((I: Verb)”, “Masu (Masu: Auxiliary Verb)” morphemes.
[0036]
  In step S 2, the text analysis result is stored in the analysis storage unit 3 by the text analysis unit 1. FIG. 3 shows a conceptual diagram of the storage state in the analysis storage unit 3. In FIG. 3, numerical information as described later is stored in the columns of analysis result likelihood, accent information, and inter-phrase coupling degree. In the text analysis process in step S1, a plurality of analysis candidates are obtained. Then, when the likelihood is given to each analysis candidate, the analysis candidate having the highest likelihood ranking is stored in the analysis storage unit 3 as a text analysis result. In that case, the likelihood of each analysis candidate is also stored in the analysis storage unit 3.
[0037]
  Further, a clause is generated from the concatenation of the respective analysis results, and the phrase information is stored in the analysis storage unit 3. At the same time, an inter-phrase coupling degree indicating the degree of connectivity between the preceding clause and the subsequent clause is calculated, and accent information for each clause is calculated based on an accent rule or the like. The obtained inter-phrase coupling degree and accent information are also stored together.
[0038]
  The inter-phrase coupling degree is used for controlling the prosody, and the pause length and the phrase command size are determined by the inter-phrase coupling degree. That is, if the degree of coupling between phrases is large, the coupling between both phrases is weak, so a clear pause is put in, and each phrase independently constitutes a phrase. Note that the calculation of the degree of connection between clauses is generally performed using a table (not shown) in which connection values set for each part-of-speech pattern of the preceding clause and the subsequent clause are described.
[0039]
  In step S3, the phonological processing unit 4 performs phonological processing on the word sequence having the highest analysis result likelihood based on the text analysis result stored in the analysis storage unit 3. In step S4, the speech synthesizer 5 synthesizes speech based on the speech synthesis information sequence obtained by the phoneme processing in step S3. Then, the synthesized sound is output from a speaker or the like. In step S5, it is determined whether or not there is an instruction from the instruction input unit 6 that there is a voice that is difficult to hear for the voice output from the speaker. As a result, if there is, the process proceeds to step S6, and if not, the process proceeds to step S8.
[0040]
  In step S6, the output of the synthesized speech is interrupted by the control unit 8 and the speech synthesis unit 5 at that time. In step S7, the control unit 8 and the position specifying unit 7 refer to the contents of the analysis storage unit 3 to determine the recurrent voice start phrase. Then, the phoneme processing unit 4 is instructed to read the text analysis result after the determined recurrent voice start phrase. After that, the process returns to step S3, and phonological processing and speech synthesis are performed again for the clauses after the recurrent voice start clause. If it is determined in step S5 that there is no instruction from the user, the process proceeds to step S8. The method for determining the recurrent voice start phrase will be described in detail later.
[0041]
  In step S8, it is determined whether or not there is a next sentence input to the text analysis unit 1. As a result, if there is, the process returns to step S1 and proceeds to processing of the next sentence. On the other hand, if not, the speech synthesis processing operation is terminated.
[0042]
  Hereinafter, the processing after step S4 in the above-described speech synthesis processing operation will be described in more detail according to the above-mentioned input sentence “A mail has been received from Mr. Suzuki regarding the negotiation”. Assume that the user has difficulty in hearing the word “negotiation” as shown in FIG. 4 when the speaker outputs “E-mail is coming from Mr. Suzuki regarding the negotiation” in step S4. Then, when the voice is output to the vicinity of the “mail”, when the user gives an instruction of “repeated voice”, the speech synthesis / output is interrupted once, and the process of determining from which phrase the recurrent voice is started Transition.
[0043]
  In this embodiment, as an example of the index for determining the position of a phrase that recurs, the number of mora from the point in time to the head phrase and the degree of inter-phrase coupling are used. Here, the reason why the number of mora is included in the index is to make it easy to select a position that is neither too close nor too far from the indicated time as the position of the recurrent voice start phrase. If a recurrent voice is started from a phrase with a small number of mora counted from the character corresponding to the point in time when the recurrent voice is instructed (that is, an approaching phrase), there is a possibility that it does not return to the point where it originally wants to re-listen. On the other hand, if the number of mora is too large, a recurrent voice is started from a position that goes back considerably, and it takes time to reach a point that is originally desired to be rehearsed, and may be missed again. Therefore, as shown in FIG. 5, the evaluation function f (x) is such that the evaluation function value at the appropriate number of mora becomes the highest value, and the evaluation function value gradually decreases as the number of mora further increases from the number of mora. Define it.
[0044]
  In the above example, if a recurrent voice is instructed at the position of “GA” in the fourth sentence “E-mail”, the number of mora up to the phrase “E-mail” is 4, The number of mora is 7, the number of mora up to the phrase “business negotiations” is 12, and the number of mora up to the phrase “from Mr. Suzuki” is 19. Then, while the difference in the number of mora from the point in time is “4”, it is too small, while the difference in the number of mora is “19”, it is considered that the difference between “matter” and “negotiation” is appropriate. It is.
[0045]
  The reason for using the inter-phrase coupling degree is to start recurrent voice from an appropriate segment break. In the case of the above-mentioned example sentence, the inter-phrase coupling degree B has the following relationship. The inter-phrase coupling degree B indicates that the greater the value is, the weaker the coupling degree is (that is, the more easily divided). In B (a, b), “a” is a phrase number, and “b” is a candidate number.
        B (3,1)> B (1,1)> B (4,1)> B (2,1)> B (5,1)
[0046]
  In other words, the combination of the clauses between the third clause “In case” and the fourth clause “Mail is” is the weakest, and the clause between the fifth clause “Come” and the sixth clause “Is” Strongest coupling. Then, the second phrase “Negotiation” – the third phrase “In case” (B (2,1)) and the first phrase “From Mr. Suzuki” —the second phrase “Negotiation” (B (1,1)) ), The latter is less coupled, so it is better to start recurring voices from the phrase “business negotiations” than to start recurring voices from the phrase “in case”.
[0047]
  As described above, by determining the position of the recurring phrase based on the number of mora from the indicated time to the beginning of the phrase and the degree of inter-phrase coupling, at a position that is neither too close nor too far from the indicated time, Voice can be resumed at a language-appropriate position. Therefore, in the present embodiment, a function g (x, y) is defined that uses as parameters the number of mora “x” from the indicated point to the first phrase and the inter-phrase coupling degree “y” as follows. Thus, by determining the function value for each phrase and comparing it, it is possible to optimally determine the phrase at which the recurrent voice starts.
                g (x, y) = α · f (x) + β · y
[0048]
  As described above, in the present embodiment, the analysis storage unit 3 that stores the text analysis result obtained by the text analysis unit 1 is provided, and the constituent word, analysis result likelihood, Accent information and inter-phrase connectivity are stored. If the synthesized speech output based on the text analysis result is difficult to hear, an instruction to that effect is given from the instruction input unit 6 at that time. Then, the output of the synthesized speech is interrupted under the control of the control unit 8, and the position specifying unit 7 uses the number of mora “x” and the inter-phrase coupling degree “y” from the indicated time point to the first phrase as parameters. Based on the defined function g (x, y), the position of the phrase at which the output of the synthesized speech is restarted is determined. Then, the output of the synthesized speech is resumed from the phrase position.
[0049]
  Therefore, if the user cannot hear the output synthesized speech, the user can give an instruction from the instruction input unit 6 at that time, so that the user is not too close to the instruction time and is not too far from the position. Thus, the output of the synthesized speech can be resumed. That is, when a synthesized speech is output based on the kanji-kana mixed sentence “Mr. Suzuki has received an email about a business talk” input to the text analysis unit 1, the user cannot hear the word “business talk” For this reason, when an instruction of “recurrence voice” is given in the vicinity of “mail”, the position identifying unit 7 identifies the second phrase “negotiation” as the recurrence voice start position. Then, the synthesized voice “Receiving an email on a business talk” reappears.
[0050]
  That is, according to the present embodiment, a recurrent voice can be made from a specific position in the voice output sentence, and when only the latter half of the long sentence cannot be heard, it can be rehearsed in a short time.
[0051]
  <Second Embodiment>
  According to the first embodiment, it is possible to perform recurrence from a specific position in the text, but it is not possible to perform recurrence by changing the utterance from a specific position in the text. The present embodiment addresses such a case.
[0052]
  FIG. 6 is a block diagram showing a configuration of the speech synthesizer according to the present embodiment. The analysis dictionary memory 12, the analysis storage unit 13, the phoneme processing unit 15, the speech synthesis unit 16, and the instruction input unit 17 are the analysis dictionary memory 2, the analysis storage unit 3, the phoneme processing unit 4, the speech synthesis in the first embodiment. It has the same configuration as the unit 5 and the instruction input unit 6 and operates in the same manner.
[0053]
  As described in the first embodiment, the text analysis unit 11 stores a plurality of analysis candidates obtained as a result of the text analysis processing in the analysis storage unit 13 in descending order of likelihood ranking. In addition, the position specifying unit 14 includes a result selection unit 18. This result selection means 18 selects one analysis candidate combination from the plurality of analysis candidates stored in the analysis storage unit 13 in descending order of the total analysis result likelihood and sends it to the phoneme processing unit 15. .
[0054]
  The control unit 19 receives the recurrent voice instruction from the instruction input unit 17 and instructs the position specifying unit 14 to specify the recurrent voice start position. In response to an instruction from the control unit 19, the position specifying unit 14 uses the result selection unit 18 to calculate the total value of the analysis result likelihood from the plurality of analysis candidates stored in the analysis storage unit 13. The next combination of analysis candidates is selected after the combination of analysis candidates, and the start position of the recurrent voice is specified based on the selection result.
[0055]
  FIG. 7 is a flowchart of the speech synthesis processing operation executed by each unit of the speech synthesizer having the above configuration. The operation of the speech synthesizer will be described below with reference to FIG.
[0056]
  In step S11 and step S12, text analysis and storage of the text analysis result in the analysis storage unit 13 are performed in the same manner as in step S1 and step S2 in the first embodiment. However, in this embodiment, as shown in FIG. 8, when there are a plurality of analysis candidates in one word, the first candidate, second candidate,... It stores in the column.
[0057]
  In step S13, the control unit 19 initializes the value of the variable i that counts the rank of the total value of the analysis result likelihoods regarding the currently selected combination of analysis candidates to “1”. If there is no utterance change instruction from the instruction input unit 17 in step S14, the result selection means 18 of the position specifying unit 14 determines that the analysis candidate string having the i-th largest total analysis result likelihood is the analysis storage unit 13 Selected from. Then, the phoneme processing unit 15 performs phoneme processing based on the selected analysis candidate string. Therefore, first, phoneme processing is performed on the word sequence having the highest likelihood of analysis result. In step S15, the speech synthesizer 16 synthesizes speech based on the speech synthesis information sequence obtained by the phoneme processing in step S14, and outputs the synthesized sound from a speaker or the like. In step S16, it is determined whether or not the instruction input unit 17 has an utterance change instruction indicating that there is a synthesis error for the sound output from the speaker. As a result, if there is, the process proceeds to step S17, and if not, the process proceeds to step S22.
[0058]
  In step S17, the control unit 19 and the speech synthesizer 16 perform the time (j_p: Phrase number at the time of utterance change instruction) (1 ≦ j_p≦ j_n: J is the phrase number and n is the number of phrases), and the output of the synthesized speech is interrupted. In step S18, j ≦ j stored in the analysis storage unit 13 by the result selection unit 18_pThe total value ΣK (j, p (j)) for all the clauses of the analysis result likelihood K is increased to “i + 1” th while replacing the result candidates of all the clauses of (j + 1) (j_a, p (j_a)) (1≤j_a≦ j_n) Is required. However, "j_a"Is the number of the phrase that was exchanged when the total value ΣK of the (i + 1) th largest analysis result likelihood K was obtained, and" p (j_a) "Is the candidate number.
[0059]
  In step S19, the result selection means 18 analyzes candidate information (j of the replacement analysis candidate obtained in step S18)._a, p (j_a))_a"P (j_a) "Th analysis candidate (hereinafter, analysis candidate (j_a, p (j_a)))) And "j_aThe candidate analysis sequence “(j_a+ 1, p (j_a+1)) to (j_n, p (j_n)) ”Is selected and sent to the phoneme processing unit 15. In step S20, the content of the ranking i is incremented by the control unit 19. In step S21, it is determined whether the order i is greater than the total number N of combinations of all analysis candidates. If greater than “N”, the process proceeds to step S22. On the other hand, if “N” or less, the process returns to step S14. Thereafter, “j” used by the phoneme processing unit 15 at the time of the previous speech synthesis._a"J" in the analysis candidate string after the "th" clause_aThe analysis candidate of the “th” clause is the analysis candidate (j_a, p (j_aThe synthesized speech is output based on the analysis candidate sequence replaced with)).
[0060]
  In this way, the processes in steps S17 to S21 and steps S14 to S16 are performed until it is determined in step S16 that there is no utterance change instruction or in step S21 it is determined that the order i is greater than “N”. Repeated.
[0061]
  In step S22, it is determined whether or not there is a next sentence input to the text analysis unit 11. As a result, if there is, the process returns to step S11 to shift to the processing of the next sentence. On the other hand, if not, the speech synthesis processing operation is terminated.
[0062]
  Hereinafter, the processing after step S18 in the above-described speech synthesis processing operation will be described in more detail according to the input sentence “provided by market opening”. FIG. 8 shows the storage state of the analysis storage unit 13 for the text sentence “provided by market opening”. There are two analysis candidates for the word “market”, an analysis candidate whose reading is “Ichiba” and an analysis candidate whose reading is “Shijo”. In addition, there are two analysis candidate strings of “being brought by / being brought about” and “being brought by / being brought about” by the difference in the phrase delimiter positions with respect to the phrase chain “being brought by”.
[0063]
  As a result of the text analysis by the text analysis unit 11, the analysis result likelihood K (1,1) of the first candidate “Ichiba” of the first phrase and the second candidate “Shijo” of the first phrase are analyzed. Between the result likelihood K (1, 2)
                K (1,1)> K (1,2)
This relationship is established. This is an example of misanalysis of text analysis that the character string “market” is more likely to be read as “Ichiba”.
[0064]
  In addition, the analysis result likelihoods K (3,1), K (4,1) of the first candidate in the phrase chain of the third and fourth phrases and the second candidate “according / Between the analysis result likelihood K (3,2) and K (4,2)
        K (3,1) + K (4,1)> K (3,2) + K (4,2)
This relationship is established.
[0065]
  Now, as shown in FIG. 9, when the text “bred by market opening” is synthesized and the first synthesized speech “by the first person” is output, the utterance “That” is performed. (J_p= 3) When the user inputs an utterance change instruction, the analysis result candidates for the first syllable to the third syllable are sequentially replaced, and the total value ΣK (j, p (j)) of the analysis result likelihood K Is calculated. In this case, the total value of the analysis result likelihoods K of the first to third syllables is “K (1,1) + K (2,1) + K (3,1)” is the largest, and next is “K ( “1,2) + K (2,1) + K (3,1)” is the second largest, so that the candidate analysis information presents the total value ΣK of the “i + 1” th largest, that is, the second largest analysis result likelihood K. (1,2) is obtained. As a result, the result selection means 18 selects the “2” -th analysis candidate “market” in the first phrase, together with the analysis candidate strings after the second phrase used in the immediately preceding speech synthesis. It is sent to the phoneme processing unit 15. In this way, the synthesized speech “provided by Shijokaiho” is output.
[0066]
  Next, at the time when the utterance “it is done” is performed at the time of the second sound output (j_p= 4), when the user inputs an utterance change instruction, the total value of the analysis result likelihood K of the first to fourth syllables is “K (1,2) + K (2,1) + K (3, 1) + K (4,1) ”is the second largest, and“ K (1,2) + K (2,1) + K (3,2) + K (4,2) ”is the third largest (see“ By selecting “Shijo”, the analysis candidate of the first phrase is fixed to K (1, 2)). Therefore, the analysis candidate information strings (3, 2) and (4, 2) are selected, and the synthesized speech “and thus” is output.
[0067]
  If the time point for inputting the utterance change instruction is delayed, there may be a mistake in the phrase for replacing the analysis result candidates. For example, since the instruction to change the utterance of the first phrase “Ichiba” is given at the time when the utterance “to be done” is performed, the utterance change of the fourth phrase “bred” is performed. This is the case. Even in such a case, since the next candidate is uttered by instructing the utterance change again, the utterance change instruction may be repeated until the desired synthesized speech is output.
[0068]
  As described above, in the present embodiment, the position specifying unit 14 is provided with the result selecting unit 18 that selects the text analysis result stored in the analysis storage unit 13. If the synthesized speech output based on the text analysis result is incorrect, a voice change instruction is given from the instruction input unit 17 at that time. Then, under the control of the control unit 19, the output of the synthesized speech is interrupted, and the result selection means 18 of the position specifying unit 14 replaces the result candidates of all the phrases before the phrase at the position where the utterance change instruction is given. However, the replacement analysis candidate (j_a, p (j_a)) Is obtained, and an analysis candidate string (j_a, p (j_a)) To (j_n, p (j_n)) Is selected. The phoneme processing unit 15 then analyzes the candidate sequence (j_a, p (j_a)) To (j_n, p (j_n)), The synthesized speech is output again.
[0069]
  Therefore, if the user finds an error in the output synthesized speech, the user can output the next candidate synthesized speech by issuing an utterance change instruction from the instruction input unit 17 at that time. That is, when the synthesized speech is output based on the kanji-kana mixed sentence “provided by market opening” input to the text analysis unit 11, the user wants to change the utterance of “It was done,” When an instruction to change the utterance is given, the result selection means 18 replaces the first candidate “according” and “done” in the third and fourth clauses with the second candidate “according” and “provided”. . Then, the synthesized speech of the phrase string “provided by the phrase” after the phrase in which the replacement is uttered is re-uttered.
[0070]
  That is, according to the present embodiment, it is possible to change the utterance from a specific position in the voice output sentence and perform the re-utterance, and when only the second half of the long sentence is wrong, the correction can be made in a short time.
[0071]
  As will be described later, when the voice recognition unit is used in combination, the instruction input unit 17 utters “No” to the next candidate and utters “next” to repeat the same utterance. It becomes possible. In this way, when the user explicitly gives an instruction, it is possible to prevent a synthesized speech that is different from the user's intention from being output.
[0072]
  <Third Embodiment>
  According to the first embodiment, the recurrent voice can be performed from a specific position in the voice output sentence, but if the same synthesized sound as the first is output at the time of the recurrent voice, there is a possibility that it cannot be heard again. . The present embodiment deals with such a case, and changes the speech synthesis information sequence such as the utterance speed of the portion that was difficult to hear at the time of recurrence.
[0073]
  FIG. 10 is a block diagram showing the configuration of the speech synthesizer according to the present embodiment. The text analysis unit 21, the analysis dictionary memory 22, the analysis storage unit 23, the phoneme processing unit 24, the speech synthesis unit 25, the instruction input unit 26, and the position specifying unit 27 are the text analysis unit 1, analysis dictionary in the first embodiment. The memory 2, the analysis storage unit 3, the phoneme processing unit 4, the speech synthesis unit 5, the instruction input unit 6, and the position specifying unit 7 have the same configuration and operate in the same manner.
[0074]
  The speech synthesis information conversion unit 29 converts the speech synthesis information sequence generated by the phoneme processing unit 24 into another speech synthesis information sequence in accordance with an instruction from the control unit 28. In this way, by changing the speech synthesis information sequence at a location that is difficult for the user to hear, a synthesized speech with a different utterance property can be generated, making it easier to hear the location that could not be heard before.
[0075]
  Here, as an example of conversion into an audible utterance, there are the following conversion methods.
(1) Slightly slow down the utterance speed.
(2) Slightly increase the utterance pitch.
(3) Increase the utterance power a little.
(4) Make the phoneme length a little longer.
[0076]
  For example, when changing the power, the following is performed. That is, factors affecting the vowel power include the type of the vowel, the type of adjacent phoneme, the position, the pitch, and the like, and the power value can be estimated from these values. Therefore, by setting a numerical value slightly higher than the power value estimated from the factor values, synthesized speech different from the original can be generated. In addition, when changing the utterance speed, an appropriate speed value is determined in advance, and by setting it to be slower than the appropriate value by a predetermined value at the time of recurrence, synthesized speech different from the initial one can be obtained. It generates.
[0077]
  As described above, according to the present embodiment, a synthesized speech having a speech synthesis information sequence different from the previous synthesized speech can be obtained by slightly changing an estimated value or a preset value in the speech synthesis information. Can be generated and output. Therefore, it is possible to improve the easiness of hearing at the time of recurrent voice.
[0078]
  <Fourth embodiment>
  As in the case of the third embodiment, this embodiment outputs a synthesized sound that is different from the first when a recurrent voice is produced. To communicate.
[0079]
  FIG. 11 is a block diagram showing the configuration of the speech synthesizer according to the present embodiment. The text analysis unit 31, the analysis dictionary memory 32, the analysis storage unit 33, the phonological processing unit 34, the speech synthesis unit 35, the instruction input unit 36, and the position specifying unit 37 are the text analysis unit 1, analysis dictionary in the first embodiment. The memory 2, the analysis storage unit 3, the phoneme processing unit 4, the speech synthesis unit 5, the instruction input unit 6, and the position specifying unit 7 have the same configuration and operate in the same manner.
[0080]
  The text sentence storage unit 39 is a memory that stores one character and a sentence corresponding to the letter, and stores, for example, a sentence used for a Japanese call. In addition, the above-mentioned “Japanese-language call” corresponds to a sentence such as “Asahi no A” for the letter “A”, and “Iroha no I” for the letter “I”, and accurately transmits the voice. It is used for the purpose.
[0081]
  Analysis result of the part (determined by the control unit 38 based on the recurrent voice start position specified by the position specifying unit 37) where the text conversion unit 40 determines that the user has difficulty in hearing according to the instruction from the control unit 38 Is converted into a corresponding sentence stored in the character sentence storage unit 39. For example, when it is determined that the user has difficulty in hearing the synthesized speech “Suzuki”, it is converted into a sentence used in the above Japanese call “Sparrow's, Sparrow's cloudy point, stamp”. Then, a text analysis result of this converted sentence is generated and sent to the phonological processing unit 34.
[0082]
  Thereafter, the phoneme processing unit 34 generates a speech synthesis information sequence based on the text analysis result generated by the character sentence conversion unit 40, and the speech synthesis unit 35 generates the speech synthesis information sequence based on the generated speech synthesis information sequence. Synthetic speech is output.
[0083]
  As described above, according to the present embodiment, in accordance with the instruction from the instruction input unit 36, for example, when it is determined that the user has difficulty in hearing the synthesized speech “Suzuki”, the control unit 38 and the character sentence conversion unit 40 The reading sequence information “Suzuki” is converted into a sentence used in the above Japanese call, and a text analysis result of the converted sentence is generated. Therefore, when reading the reading sequence information “Suzuki” again, it is possible to say “Suzume no Suzume, Suzume no Nisaku, Postage stamp”, and further clarify the parts that are difficult for users to hear. Can do it.
[0084]
  <Fifth embodiment>
  In the first embodiment, the recurrent voice start position is specified based on the inter-phrase coupling degree and the number of mora. In the second embodiment, the recurrent voice start position is specified based on the analysis result likelihood. In the present embodiment, a position reflecting the user's intention to request re-utterance is set as a re-utterance start position.
[0085]
  FIG. 12 is a block diagram showing the configuration of the speech synthesizer according to the present embodiment. The text analysis unit 41, the analysis dictionary memory 42, the analysis storage unit 43, the phoneme processing unit 45, the speech synthesis unit 46, the instruction input unit 47, and the speech synthesis information conversion unit 51 are the text analysis unit 21 in the third embodiment, The analysis dictionary memory 22, the analysis storage unit 23, the phoneme processing unit 24, the speech synthesis unit 25, the instruction input unit 26, and the speech synthesis information conversion unit 29 have the same configuration and operate in the same manner. The result selection unit 44 has the same configuration as the result selection unit 18 in the second embodiment and operates in the same manner.
[0086]
  The position specifying unit 48 includes start position candidate selection means 49, and specifies the start position of the recurrent voice reflecting the intention of the user requesting the recurrent voice. Hereinafter, the function of the position specifying unit 48 will be described.
[0087]
  In order to decide from which part of the input text to the text analysis unit 41 the recurrence is to be made, (1) the factor that you want to hear again, (2) the item that you want to hear again, (3) It is necessary to consider sex. Furthermore, factors that are desired to be re-listed include (a) misreading, (b) unclear phonology, and (c) presence of words with low appearance frequency. As an item to be re-listened, there is a word that determines the meaning of a sentence such as 5W1H. In addition, as a measure of relevance as a part where the voice recurs, (A) the number of mora between the phrase that is output at the time when the instruction is given and each phrase, and (B) the strength of the coupling between the phrases Is raised.
[0088]
  The position specifying unit 48 comprehensively determines these, and orders the recurring phrases. In this case, the ordering is performed by obtaining the following function values for each phrase.
        f (p1, p2, p3, p4, p5, p6)
      Where p1: analysis result parameter
              p2: Phoneme clarity parameter
              p3: Word appearance probability parameter
              p4: Word importance parameter
              p5: Inter-phrase connectivity parameter
              p6: Number of mora parameters
[0089]
  When there are a plurality of analysis results, it can be said that the smaller the likelihood of these analysis results, the higher the possibility that the analysis is erroneous. Therefore, as an example of the analysis result parameter p1, the reciprocal of the difference in the analysis result likelihood between the candidate having the first analysis result likelihood and the second candidate is considered.
[0090]
  In addition, it is highly possible that an unclear phoneme cannot be heard. Therefore, the phoneme clarity parameter p2 is used as the parameter of the function f. That is, the degree of clarity is digitized (clarity) in advance for each phoneme. In that case, the intelligibility value of a phoneme having low intelligibility is increased. Therefore, high intelligibility is given to unvoiced frictional sound and the like.
[0091]
  In addition, words that are not often used are more difficult to hear. Therefore, the word appearance probability parameter p3 is used as the parameter of the function f. In addition, if the word cannot be heard, the more important words that the meaning of the sentence itself is not understood at all are more likely to be reheard. Therefore, the word importance parameter p4 is used as the parameter of the function f. It should be noted that words related to 5W1H such as “who”, “where”, and “when”, which will be described later, are set to have a high importance.
[0092]
  The inter-phrase connectivity parameter p5 is used to start a recurrent voice from a linguistically appropriate clause break, as already described in the first embodiment. The greater the inter-phrase coupling degree, the weaker the inter-phrase coupling degree, and the more appropriate the recurring voice is. The mora number parameter p6 is also as described in the first embodiment.
[0093]
  Furthermore, the character position output when the recurrent voice instruction is input from the instruction input unit 47 includes the difference between the utterance start time and the instruction utterance time, and the difference between the instruction utterance time and the utterance end scheduled time. From the ratio, it can be estimated roughly.
[0094]
  Various methods can be considered for determining the function f. Hereinafter, a determination method based on observation data will be described. Note that the calculation method of the function f is a method of multiplying each parameter by a weighting coefficient and adding the values in order to simplify the description. The observation data includes the information stored in the analysis storage unit 43 as shown in FIG. 13, the character position where the synthesized speech was output when the recurrent voice instruction was input to the instruction input unit 47, and It is composed of the reading start position expected by the user and the reading. A large amount of these observation data is prepared, and the weighting factor is estimated so that the probability of the solution expected by the user is maximized. The weighting factor is generally estimated using a multivariate analysis method such as multiple regression analysis.
[0095]
  If there are many instructions in the observation data that there are multiple analysis results and each reading is different, if the value of the analysis result parameter p1 is large, the weight is learned so that the possibility of reading errors is high. Will be. Therefore, when there is a next candidate whose analysis result likelihood is close, it is determined that there is a reading error, and the probability of recurrence of the next candidate for reading increases.
[0096]
  The starting position candidate selecting means 49 of the position specifying unit 48 has a function f (p1, p2, p3, p4, p5, p6) for ordering recurrent voice phrases having the weighting coefficient obtained as described above. ) Is installed. When a recurrent voice instruction is input to the instruction input unit 47, the analysis result likelihood K, phonological clarity C, and word appearance of each phrase stored in the analysis storage unit 43 according to the instruction from the control unit 50 Substituting the values of the frequency F, the word importance S, the inter-phrase connectivity B, and the number of mora M, the value of the function f of each clause is calculated. In this way, a candidate position for the start of recurrent voice is selected. Then, the position specifying unit 48 specifies the phrase having the maximum value of the function f as the recurrence voice starting position, and returns the specific position to the control unit 50.
[0097]
  Then, the control unit 50 stores the predetermined number of phrases from the recurrent voice start position in the analysis storage unit 43 in the same manner as in the second embodiment with respect to the result selection unit 44. It is instructed to select, from among a plurality of analysis candidates, a combination of analysis candidates having the next largest analysis result likelihood after the combination of analysis candidates currently being output in synthesized speech. Further, the voice synthesis information conversion unit 51 is instructed to convert the voice synthesis information string in the same manner as in the third embodiment. As a result, regarding the predetermined phrase after the recurrent vocalization start phrase with the maximum value of the function f, when the parameter that is the largest factor is the analysis result parameter p1, the result selection unit 44 corrects the reading error. Will be. Furthermore, when the phoneme clarity parameter p2 or the word appearance probability parameter p3 is used, the speech synthesis information conversion unit 51 re-utters the speech by increasing the utterance power a little or decreasing the utterance speed a little. become.
[0098]
  Thus, in the present embodiment, the analysis storage unit 43 stores the analysis result likelihood K, phonological clarity C, word appearance frequency F, word importance S, inter-phrase coupling degree B, and mora for each phrase. The number M is stored. Further, the position specifying unit 48 is provided with start position candidate selecting means 49 equipped with a function f (p1, p2, p3, p4, p5, p6) for ordering recurrent voice phrases, and when there is a recurrent voice instruction. Each phrase is analyzed using the analysis result likelihood K, phonological clarity C, word appearance frequency F, word importance S, inter-phrase coupling degree B, and mora number M stored in the analysis storage unit 43. The value of the function f is calculated, and the recurrent voice start position is specified based on the calculation result. Then, the control unit 50 instructs the selection of the next analysis candidate combination for the result selection unit 44 and the conversion of the speech synthesis information sequence to the phonological conversion unit 51 with respect to the identified recurrent voice start position. .
[0099]
  Therefore, according to the present embodiment, the recurrent voice position is automatically identified by reflecting the user's recurrent voice request intention such as “reading is different”, “separation is different”, “difficult to hear”, etc. Can do it.
[0100]
  <Sixth embodiment>
  According to the fifth embodiment, even if the user does not explicitly indicate “difference in reading”, “difference in separation”, “difficult to hear”, etc., the speech synthesizer Can automatically determine the recurrent voice start position. However, by explicitly instructing such information, it is possible to more reliably determine the recurrent voice position and the recurrent voice method. The present embodiment relates to explicitly instructing the user's intention to request reoccurrence.
[0101]
  FIG. 14 is a block diagram showing a schematic configuration in the speech synthesizer of the present embodiment. The text analysis unit 61, analysis storage unit 63, phoneme processing unit 64, and speech synthesis unit 65 have the same configuration as the text analysis unit 1, analysis storage unit 3, phoneme processing unit 4, and speech synthesis unit 5 in the first embodiment. And operate similarly. The position specifying unit 68 has the same configuration as the position specifying unit 48 in the fifth embodiment and operates in the same manner.
[0102]
  The voice recognition unit 66 includes a microphone and recognizes an input voice using a voice recognition dictionary stored in the voice recognition dictionary memory 67. The speech recognition dictionary memory 67 stores a speech recognition vocabulary as a speech recognition dictionary. Here, in the speech recognition dictionary memory 67, words such as “E”, “what”, “again”, etc. that prompt a recurrence voice can be registered. It is also possible to register a vocabulary that asks what information the user wants to know, such as “who”, “when”, and “where”.
[0103]
  In addition, the analysis dictionary memory 62 is provided with information on whether each word is related to the registered vocabulary in the speech recognition dictionary memory 67 for each word. For example, the word “Suzuki” is a personal name and is associated with the registered vocabulary “who”. Alternatively, the word “6 o'clock” is time, associated with the registered vocabulary “when”, and so forth. Therefore, the text analysis unit 61 can obtain related word information as described above as an analysis result, and is stored in the analysis storage unit 63.
[0104]
  The position specifying unit 68 has start position candidate selecting means 69 equipped with a function f (p1, p2, p3, p4, p5, p6) for ordering recurrent vocal phrases. When the control unit 70 instructs to specify the recurrent voice position, the analysis result likelihood K, phonological clarity C, word appearance frequency F, and word importance S of each phrase stored in the analysis storage unit 63 , The value of the function f is obtained using the values of the inter-phrase coupling degree B and the mora number M. Then, the phrase exhibiting the largest function f value is identified as the recurrent voice position.
[0105]
  In this case, when the vocabulary registered in the speech recognition dictionary memory 67 is recognized by the speech recognition unit 66, the related word information which is one of the analysis results stored in the analysis storage unit 63 is referred to. Search for related words related to the recognition vocabulary. Then, when calculating the phrase function f (p1, p2, p3, p4, p5, p6) relating to the related word, the value of the word importance parameter p4 is made larger than usual (for example, the word importance parameter p4 And the value of the word importance S of the related word are multiplied by an integer). By doing this, if there is a word related to the recognized vocabulary in the input text sentence, the value of the function f of the phrase including the related word is increased, and recurrent voice is started from the phrase.
[0106]
  In that case, the analysis storage unit 63 keeps the analysis result of the entire text of the input text sentence, so that it is not limited to the text sentence that is currently being synthesized speech output. It becomes possible to recite back to the output text. Further, the end position is specified by using the values of the inter-phrase coupling degree B and the number of mora M in the same manner as in the case where the start position of the recurrent voice is specified by the position specifying unit 7 in the first embodiment, and the start is started. By using it together with specifying the position, it becomes feasible to re-speak only the phrase related to the person name, such as “From Mr. Suzuki” in response to the question “who” from the microphone.
[0107]
  Similarly, when the speech recognition unit 66 recognizes a vocabulary that prompts re-reading such as “different”, an analysis is performed when calculating the function f (p1, p2, p3, p4, p5, p6). The value of the result parameter p1 is made larger than usual. In this way, a result selection unit similar to the result selection unit 44 in the fifth embodiment can be used in combination to select a candidate having a different reading in a phrase having a plurality of analysis results and make a reoccurring voice. .
[0108]
  As described above, in the present embodiment, the position specifying unit 68 has the same function as the position specifying unit 48 in the fifth embodiment. Then, using the analysis result likelihood K, phonological clarity C, word appearance frequency F, word importance S, inter-phrase connectivity B, and mora number M stored in the analysis storage unit 63, The value of f (p1, p2, p3, p4, p5, p6) is obtained, and the phrase exhibiting the largest function f value is identified as the recurrent voice position.
[0109]
  At this time, when the registered vocabulary in the speech recognition dictionary memory 67 is recognized by the speech recognition unit 66, the function f of the phrase including the related word “6 o'clock” related to the recognition vocabulary “when” is calculated. At this time, the value of the word importance parameter p4 is set larger than usual. Therefore, the user can automatically start re-utterance from the desired phrase “6 o'clock ...” simply by giving a voice instruction “when”.
[0110]
  In addition, when the speech recognition unit 66 recognizes the vocabulary “different” that promotes re-reading, the value of the analysis result parameter p1 is made larger than usual when the function f is calculated. Therefore, the user can automatically select another reading candidate and re-speak by simply giving a voice instruction “different”.
[0111]
  It is effective to install the speech synthesizer in each of the above embodiments in a portable terminal device with relatively little text information such as a cellular phone or an electronic notebook. That is, when the content of an e-mail with a relatively long sentence is known by such a portable terminal with little character information, it is heard by synthetic voice. However, it is difficult to complete the accuracy rate and clarity of text-to-speech synthesis, and it is necessary to have a means of recovery in case of misanalysis or unclearness.
[0112]
  According to the speech synthesizer in each of the above embodiments, since the recurrent voice start position and the selection of the next candidate can be automatically performed, the recurrent voice and utterance change can be performed with a very simple operation. It is very effective as a speech synthesizer.
[0113]
  By the way, the functions as the analysis means, the speech synthesis means, the instruction input means, the position specification means and the control means by the text analysis section, the speech synthesis section, the instruction input section, the position specification section and the control section in the above embodiments are as follows: This is realized by a speech synthesis processing program recorded on a program recording medium. The program recording medium in each of the above embodiments is a program medium composed of a ROM (read only memory). Alternatively, it may be a program medium that is loaded into an external auxiliary storage device and read out. In any case, the program reading means for reading the voice synthesis processing program from the program medium may have a configuration in which the program medium is directly accessed and read, or the random access memory (RAM). You may have the structure which downloads to the provided program storage area (not shown), accesses the said program storage area, and reads. It is assumed that a download program for downloading from the program medium to the program storage area of the RAM is stored in the main unit in advance.
[0114]
  Here, the program medium is configured to be separable from the main body side, and is a tape system such as a magnetic tape or a cassette tape, a magnetic disk such as a floppy disk or a hard disk, or a CD (compact disk) -ROM, MO (magneto-optical). Optical discs such as discs, MDs (mini discs), DVDs (digital video discs), card systems such as IC (integrated circuit) cards and optical cards, mask ROMs, EPROMs (ultraviolet erasable ROMs), EEPROMs (electrical This is a medium that carries a fixed program including a semiconductor memory system such as an erasable ROM) and a flash ROM.
[0115]
  In addition, the speech synthesizer in each of the above embodiments has a configuration that can be connected to a communication network including the Internet, and the program medium is a medium that fluidly carries the program by downloading from the communication network or the like. There is no problem. In this case, it is assumed that a download program for downloading from the communication network is stored in the main device in advance. Or it shall be installed from another recording medium.
[0116]
  It should be noted that what is recorded on the recording medium is not limited to a program, and data can also be recorded.
[0117]
【The invention's effect】
  As is clear from the above, the speech synthesizer of the first invention receives the instruction for prompting the recurrent voice input from the instruction input unit, and the character string is determined by the position specifying unit based on the analysis result by the analyzing unit. Since the start position of the recurrent voice in the information is specified, and the control means instructs the voice synthesis means to synthesize the voice from the specified start position, when the user cannot hear the output synthesized voice, Just by giving an instruction from the instruction input means at the time, a recurrent voice can be heard from a specific position in the output synthesized speech sentence, and even if only the latter half of the long sentence cannot be heard, it can be heard again in a short time.
[0118]
  Further, the analysis result is configured to include inter-phrase connectivity, and the position specifying unit is changed from a phrase before the position corresponding to the time point when the instruction is received from the instruction input unit in the character string information. Since the phrase that is the starting position is specified based on the number of mora and the degree of coupling between the phrases, the phrase is not too close and not too far from the point in time when the instruction is received. An appropriate phrase that is weak and easily separated can be selected as the start position.
[0119]
  Further, the speech synthesizer of the second invention is configured such that the analysis result includes likelihood, and the position specifying means corresponds to a time point when an instruction is received from the instruction input means in the character string information. For the clause before the position to be calculated, the total likelihood value is calculated while replacing each analysis result candidate, and the start position of the recurrent voice is specified based on the obtained total likelihood value. When the synthesized speech “I am also driven by the way” is uttered based on the text string information “provided by market opening”, the next highest total value of the above-mentioned likelihood is raised. Two candidates “according” can be selected to identify the start position of the recurrent voice. Therefore, in that case, the synthesized speech “to be brought” can be replayed from the position of the phrase “to”.
[0120]
  The speech synthesizer according to the first aspect of the present invention is configured so that the analysis result includes a likelihood and corresponds to a time point when the result selection unit receives an instruction from the instruction input unit in the character string information. For the clause before the position to be input, calculate the total likelihood value while replacing the candidates for each analysis result, and select the analysis result when performing recurrence based on the obtained total likelihood value. When the synthesized speech “Ichiba” is uttered based on the character string information “provided by market opening”, the second candidate “Shijo” with different utterances is selected and “Shijokai Can be re-speech "synthesized voice brought about byThe
[0121]
  MaThe aboveSecondThe speech synthesizer of the invention ofthe aboveLikelihoodIn addition toPhonological clarity,At least one of word appearance probability and word importance is included, andLocateMeansthe aboveAnalysis resultofLikelihoodIn addition to the abovePhonological clarity,Using at least one of word appearance probability and word importanceThe start position of the recurrent voice is identified based on the selected start position of the recurrent voiceIf so, based on the analysis result likelihood, the position where the intention “reading error” of the recurrent voice instruction by the user is reflected,Can be selected as a starting position candidate for recurrent voice. furtherBased on phonological intelligibility or word appearance probability, the position reflecting the above intention “cannot be heard” or the position reflecting the above intention “confirm important part” based on the word importance Can be selected as a starting position candidate.
[0122]
  In the speech synthesizer of the present invention, the speech synthesis information generation unit of the speech synthesis unit generates a speech synthesis information sequence based on the analysis result of the analysis unit, and the speech synthesis information conversion unit converts the speech synthesis information. If a predetermined speech synthesis information sequence after the start position of the recurrent voice in the speech synthesis information sequence generated by the generation means is converted into another speech synthesis information sequence, the utterance of the part that could not be heard is expressed as `` speech rate. The re-utterance can be performed by changing to “slightly late”, “slightly higher pitch”, “slightly higher utterance power”, or “longer phoneme length”.
[0123]
  The speech synthesizer according to the present invention is configured so that the analysis result includes the reading sequence information, and the character sentence conversion means stores the reading sequence of the predetermined analysis result after the start position of the recurrent voice as a character sentence storage. If the corresponding sentence analysis result sequence stored in the means is converted into the character sentence storage means, for example, the characters “su”, “zu” and “ki” and the corresponding sentences “suzume no su”, “suzume no ni” If you memorize the “spot” and “stamp”, the reading sequence “Suzuki” of the given sentence “Suzuki” after the start position of the above recurrent voice will be used as the corresponding sentence column It can be re-voiced after converting to "Dakuten, stamp".
[0124]
  In the speech synthesizer according to the present invention, if the instruction input means is composed of a voice recognition means for recognizing an instruction inputted by voice, the user's voice request intention can be clearly indicated by the user's voice.
[0125]
  In the speech synthesizer of the present invention, the speech recognition means recognizes a vocabulary that promotes recurrence using a speech recognition dictionary, and the analysis means uses the analysis dictionary to determine the relevance of the vocabulary that prompts recurrence. If the analysis result including the related information indicating the presence / absence is generated and the position specifying unit specifies the start position of the recurrent voice based on the related information from the analyzing unit, the voice recognition dictionary stores “who” ”,“ When ”,“ where ”, etc., by registering vocabulary that prompts recurrence, if these vocabularies that ask the user what information they want to know are recognized by voice, It is possible to make a recurrence voice from words representing time and place, such as “Mr. Suzuki”, “Yesterday”, “In the company”, etc.
[0126]
  Also,ThirdThe speech synthesis method of the invention includes a step of giving an instruction for prompting a recurrent voice, a step of specifying a start position of a recurrent voice in the character string information based on an analysis result of the input character string information, A step of performing speech synthesis from the specified start position, wherein the analysis result includes a degree of inter-phrase coupling, and in the step of specifying the start position, an instruction is received from the instruction input means in the character string information Since the phrase that becomes the start position is specified based on the number of mora and the degree of inter-phrase coupling from the phrase before the position corresponding to the time point, when the user cannot hear the output synthesized speech, Just by giving an instruction at that time, a recurrent voice can be heard from a specific position in the output synthesized speech sentence. Therefore, even if only the second half of the long sentence cannot be heard, it can be re-listened in a short time. Furthermore, an appropriate phrase that is not too close and not too far from the point in time when the instruction is received and that is linguistically weak and easily separated can be selected as the start position.
[0127]
  Also,4thSince the portable terminal device of the invention is equipped with the speech synthesizer of the present invention, when the content of an e-mail with a relatively long sentence is known by speech synthesis output by a portable terminal device with little character information, The next candidate can be automatically selected. Therefore, accurate recurrence voices and utterance changes can be performed with a very simple operation.
[0128]
  Also,5thSince the program recording medium of the invention records the speech synthesis processing program that causes the computer to function as the analysis means, speech synthesis means, instruction input means, position specifying means, and control means in the first invention, As in the case of the first invention, if the output synthesized speech cannot be heard, the user can hear the recurrent voice from a specific position in the output synthesized speech sentence only by giving an instruction at that time. Therefore, even if only the second half of the long sentence cannot be heard, it can be re-listened in a short time. Furthermore, an appropriate phrase that is not too close and not too far from the point in time when the instruction is received and that is linguistically weak and easily separated can be selected as the start position.
[Brief description of the drawings]
FIG. 1 is a block diagram of a speech synthesizer according to the present invention.
FIG. 2 is a flowchart of a speech synthesis processing operation executed by the speech synthesizer shown in FIG.
3 is a conceptual diagram showing a storage state of an analysis storage unit in FIG. 1. FIG.
FIG. 4 is a diagram showing an example of a synthesized voice output when an instruction of “recurrent voice” is given.
FIG. 5 is a diagram illustrating a relationship between the number of mora and an evaluation function.
FIG. 6 is a block diagram of a speech synthesizer different from FIG.
7 is a flowchart of a speech synthesis processing operation executed by the speech synthesizer shown in FIG.
8 is a conceptual diagram showing a storage state of an analysis storage unit in FIG.
FIG. 9 is a diagram showing an example of synthesized speech output when an instruction “change utterance” is given.
10 is a block diagram of a speech synthesizer different from those in FIGS. 1 and 6. FIG.
11 is a block diagram of a speech synthesizer different from those shown in FIGS. 1, 6 and 10. FIG.
12 is a block diagram of a speech synthesizer different from those shown in FIGS. 1, 6, 10 and 11. FIG.
13 is a conceptual diagram showing a storage state of an analysis storage unit in FIG.
14 is a block diagram of a speech synthesizer different from those shown in FIGS. 1, 6, and 10 to 12. FIG.
[Explanation of symbols]
  1,11,21,31,41,61 ... text analysis part,
  2, 12, 22, 32, 42, 62 ... analysis dictionary memory,
  3, 13, 23, 33, 43, 63 ... analysis storage unit,
  4, 15, 24, 34, 45, 64 ... phonological processing unit,
  5, 16, 25, 35, 46, 65 ... speech synthesis unit,
  6, 17, 26, 36, 47 ... instruction input unit,
  7, 14, 27, 37, 48, 68 ... position specifying part,
  8, 19, 28, 38, 50, 70 ... control unit,
18 ... result selection means,
29, 51 ... speech synthesis information conversion unit,
39 ... a text storage unit,
40 ... character sentence conversion part,
44 ... result selection part,
49, 69 ... start position candidate selection means,
66 ... voice recognition unit,
67: Voice recognition dictionary memory.

Claims

In the speech synthesizer that analyzes the input character string information by the analysis means, and synthesizes and outputs the speech by the speech synthesis means based on the analysis result,
An instruction input means for inputting an instruction for prompting a recurrence voice;
In response to an instruction from the instruction input means, based on an analysis result by the analysis means, a position specifying means for specifying a start position of a recurrent voice in the character string information;
Control means for instructing the voice synthesis means to synthesize voice from the specified start position,
The above analysis results include inter-phrase connectivity,
The position specifying means is a phrase that becomes the start position based on the number of mora and the degree of inter-phrase coupling from a phrase before the position corresponding to the time point when the instruction is received from the instruction input means in the character string information. A speech synthesizer characterized in that the voice synthesizer is specified.

In the speech synthesizer that analyzes the input character string information by the analysis means, and synthesizes and outputs the speech by the speech synthesis means based on the analysis result,
An instruction input means for inputting an instruction for prompting a recurrence voice;
In response to an instruction from the instruction input means, based on an analysis result by the analysis means, a position specifying means for specifying a start position of a recurrent voice in the character string information;
Control means for instructing the voice synthesis means to synthesize voice from the specified start position,
The analysis result includes likelihood,
The position specifying means calculates a total likelihood value while replacing candidates of each analysis result with respect to a clause before the position corresponding to the time point when the instruction is received from the instruction input means in the character string information. A speech synthesizer characterized in that the start position of the recurrent voice is specified based on the total likelihood value.

The speech synthesis apparatus according to claim 1,
The analysis result includes likelihood,
For the clause before the position corresponding to the time point when the instruction is received from the instruction input means in the character string information, the total likelihood value is calculated while replacing each analysis result candidate, and the obtained total likelihood value A speech synthesizer comprising: a result selection means for selecting an analysis result when performing recurrent voice based on the above.

The speech synthesis apparatus according to claim 2 ,
The analysis result includes at least one of phonological clarity , word appearance probability, and word importance in addition to the likelihood ,
It said position specifying means, the phoneme clarity in addition to the likelihood of the analysis results, on the basis of the start position candidates of selection was the re-utterance using at least one word occurrence probability and word significance, the recurrence voice A speech synthesizer characterized in that the start position of the voice is specified.

In the speech synthesizer according to any one of claims 1 to 4,
The speech synthesis means has speech synthesis information generation means for generating a speech synthesis information sequence based on the analysis result by the analysis means, and synthesizes speech based on the speech synthesis information sequence,
Speech synthesis for converting a predetermined speech synthesis information sequence after the start position of the recurrent voice into another speech synthesis information sequence in the speech synthesis information sequence generated by the speech synthesis information generation unit in accordance with an instruction from the control unit A speech synthesizer characterized by comprising information conversion means .

The speech synthesizer according to any one of claims 1 to 4 ,
The analysis result includes reading sequence information,
A text storage means for storing a character and a corresponding sentence corresponding to the character;
In accordance with an instruction from the control means, a reading sequence of a predetermined analysis result after the start position of the recurrent voice in the analysis result by the analysis means is converted into an analysis result string of a corresponding sentence stored in the character sentence storage means. A speech synthesizer comprising a character sentence conversion means.

The speech synthesizer according to any one of claims 1 to 6 ,
The voice synthesizing apparatus according to claim 1, wherein the instruction input means includes voice recognition means for recognizing an instruction inputted by voice.

The speech synthesizer according to claim 7 .
A speech recognition dictionary that stores vocabulary that encourages recurrence,
An analysis dictionary that memorizes the relevance between words and the vocabulary that promotes recurrent voice
With
The voice recognition means recognizes a vocabulary that encourages recurrent voice using the voice recognition dictionary,
The analysis means uses the analysis dictionary to generate the analysis result including related information indicating the presence or absence of relevance with the vocabulary that prompts the recurrence voice,
The position specifying means specifies the start position of the recurrent voice based on the related information from the analyzing means.
Speech synthesis apparatus characterized in that it is so.

In the speech synthesis method for analyzing the input character string information and synthesizing and outputting the speech based on the analysis result ,
Inputting instructions for prompting recurrence ;
In response to the input instruction, based on the analysis result of the input character string information, identifying the start position of the recurrent voice in the character string information;
Performing speech synthesis from the identified start position
With
The above analysis results include inter-phrase connectivity,
In the step of specifying the start position, the start position is determined based on the number of mora and the degree of inter-phrase coupling from the clause preceding the position corresponding to the time when the instruction is received from the instruction input means in the character string information. identify and become clause
Speech synthesis wherein a call.

A portable terminal device comprising the speech synthesizer according to any one of claims 1 to 8 .

Computer
An analysis means for analyzing the input character string information;
Speech synthesis means for synthesizing speech based on the analysis result including the inter-phrase coupling degree by the analysis means;
An instruction input means for inputting an instruction for prompting a recurrence voice;
In response to an instruction from the instruction input means, based on an analysis result by the analysis means, the number of mora from the clause before the position corresponding to the time point when the instruction input means is received in the character string information, and A position specifying means for specifying a phrase that is a starting position of a recurrent voice in the character string information based on the inter-phrase coupling degree;
Control means for instructing the voice synthesis means to synthesize voice from the specified start position
A computer-readable program recording medium on which is recorded a voice synthesis processing program that functions as a computer program .