JP3862478B2

JP3862478B2 - Speech synthesis apparatus and speech synthesis method

Info

Publication number: JP3862478B2
Application number: JP2000158908A
Authority: JP
Inventors: 賢一郎中川; 隆麻生
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2000-05-29
Filing date: 2000-05-29
Publication date: 2006-12-27
Anticipated expiration: 2020-05-29
Also published as: JP2001337690A

Abstract

PROBLEM TO BE SOLVED: To provide information playback equipment with which listening comprehension of synthetic voice is made possible to those who are in vehicles by controlling the parameter about a synthetic voice according to the situation of vehicles or the circumstance. SOLUTION: The text data which should be uttered is first chosen and is taken in from the utterance contents storing memory 101 by the utterance contents selection part 102 (S201). Based on the taken-in text, a default rhythm parameter is created by the rhythm parameter generation part 103 (S202). The present own vehicle speed (speed per hour) S is acquired from the own vehicle speed measurement part 109 (S203). Based on the value S, the default rhythm parameter is updated (S204-S209). By sticking the phoneme data gained according to the text sentence which the utterance contents selection part 102 chooses from the phoneme piece dictionary 106 on the updated rhythm parameter, a synthetic sound is generated (S210). A synthetic sound is outputted from the voice output unit 108 (S211).

Description

【０００１】
【発明の属する技術分野】
本発明は、車両に搭載し、テキストに基づいて音声合成を行う音声合成装置および音声合成方法に関するものである。
【０００２】
【従来の技術】
従来のカーナビゲーション装置は、自車内のディスプレイに映し出された地図上に自車位置を表示させるものである。また、走行前、または走行中に目的地を設定することにより、目的地までの経路を検索し、ディスプレイ上の経路表示や音声案内により目的地までの経路を運転手に誘導することができる。
【０００３】
カーナビゲーション装置の音声案内は、実際の人間の案内音声を収録したものを再生する方式や、規則音声合成技術を用いたものもある。また、自車が曲がるべき交差点に近づくにつれ、自車の速度、道路の混雑状況を考慮し、経路案内のタイミングを変化させ、運転手への経路誘導の確実性を向上させたものもある。
【０００４】
【発明が解決しようとする課題】
音声案内で規則音声合成を用いている場合、音声を録音するよりも記憶領域が少なくてすむメリットがあるが、現状では人間の肉声に比べて聞き取りやすさが劣っており、車速が上がり、周囲の雑音が大きくなったときには、なおさら聞き取りにくくなる。しかし従来の合成音声生成部は、自車の周囲状況まで考慮して合成波形を作成することはなかったため、周囲状況にあった合成音声を生成することができなかった。
【０００５】
本発明は上述の問題点に対して鑑みたものであり、車両や周囲の状況に応じて合成音声に関するパラメータを制御することで、音声合成結果を車両に乗車している人に聞き取り可能にすることを目的とする。
【０００６】
【課題を解決するための手段】
本発明の目的を達成するために、例えば本発明の音声合成装置は以下の構成を備える。すなわち、テキストに対応する合成音声を生成する音声合成手段と、
道路の混雑状況を取得する取得手段と、
前記道路の混雑状況が第１の混雑状況である場合は前記合成音声の話者を変更し、前記道路の混雑状況が第２の混雑状況である場合は前記合成音声の韻律に関するパラメータである韻律パラメータを変更する変更手段と
を備えることを特徴とする。
本発明の目的を達成するために、例えば本発明の音声合成方法は以下の構成を備える。すなわち、テキストに対応する合成音声を生成する音声合成工程と、
道路の混雑状況を取得する取得工程と、
前記道路の混雑状況が第１の混雑状況である場合は前記合成音声の話者を変更する第１の変更工程と、
前記道路の混雑状況が第２の混雑状況である場合は前記合成音声の韻律に関するパラメータである韻律パラメータを変更する第２の変更工程と
を有することを特徴とする。
【０００７】
【発明の実施の形態】
以下添付図面に従って、本発明を好適な実施形態に従って詳細に説明する。なお、以下の実施形態では情報再生装置としてカーナビゲーション装置を用いた場合について説明する。
【０００８】
［第１の実施形態］
図１に本実施形態におけるカーナビゲーション装置の概略構成を示す。
【０００９】
１０１はカーナビゲーション装置が発声する内容のテキスト文である発声内容テキストを格納している発声内容格納メモリである。カーナビゲーション装置がアナウンスをする場合は、発声内容選択部１０２がこの発声内容格納メモリ１０１を参照し、発声すべき内容に該当する発声内容テキストを選択する。場合により、この発声内容は、カーナビゲーション装置が無線や携帯電話を通して接続された情報発信サーバからダウンロードした運転手へのメールや、ニュースなどでもよい。
【００１０】
発声内容選択部１０２で選択された発声内容テキストは、カーナビゲーション装置内の韻律パラメータ生成部１０３に送られる。
【００１１】
韻律パラメータ生成部１０３では発声内容が書かれたテキストから、音パワー（声の大きさ）、ピッチ周波数（声の高さ）、音韻時間長（声の速度）といった韻律パラメータ（合成音声パラメータ）を生成する。本来、これらのパラメータはアクセントの位置や強さが付加された発声内容テキストから生成することが可能であるが、本実施形態では、走行状態情報取得部１０５を介して得られた後述の情報も反映する。走行状態情報取得部１０５は、音声合成装置１０７外の自車速度測定部１０９、自車位置測定部１１０、混雑状況取得部１１１と接続している。
【００１２】
韻律パラメータ生成部１０３において作成された韻律パラメータは音素片接続部１０４に送られ、音素片接続部１０４はその韻律パラメータ通りに規則合成音の元となる音素データを、この音素データを格納している音素片辞書１０６から獲得し、獲得した音素データを接続していくことで、合成音声を生成する。
【００１３】
また、走行状態情報取得部１０５は音素片接続部１０４も制御しており、用いる音素データなどを変更する場合もある。音素データが変わることで、合成音の話者が変わったようになる。
【００１４】
音声合成装置１０７で合成音声が作成されると、そのデータはスピーカなどの音声出力装置１０８に送られて、音声出力される。
【００１５】
図２は、本実施形態におけるカーナビゲーション装置が上述の処理を行う際のフローチャートである。なお、同図に示したフローチャートに従ったプログラムコードは、本実施形態におけるカーナビゲーション装置内の不図示のＲＯＭやＲＡＭなどのメモリ内に格納され、ＣＰＵにより読み出され、実行される。その結果、本実施形態のカーナビゲーション装置は後述する各処理を行うことができる。
【００１６】
なお、ここでは例として、自車速度により韻律パラメータを変え、自車速度が早くなるに従って、大きく、はっきり、ゆっくりと発声する合成音を作成する処理について説明する。又、本フローチャートに従った処理が実行される前に、カーナビゲーション装置が発声すべき内容は決定しているものとする。
【００１７】
まず発声内容格納メモリ１０１から、発声内容選択部１０２により発声すべき内容に該当するテキストデータを選択、取り込む（ステップＳ２０１）。取り込んだテキストに基づいて韻律パラメータ生成部１０３によりデフォルトの韻律パラメータを作成する（ステップＳ２０２）。ここでの処理は一般の規則音声合成処理と同様である。
【００１８】
次に、自車速度測定部１０９から現在の自車速度（時速）Ｓを取得する（ステップＳ２０３）。このＳの値をもとに、先ほど決定したデフォルトの韻律パラメータを更新する（ステップＳ２０４〜ステップＳ２０９）。同図の例では、自車速度Ｓが上がるに連れて、大きく、ゆっくりとした合成音を出力する処理になっている。また、この例では、自車速度Ｓが８０Ｋｍ、５０ｋｍ、２０ｋｍで階段状の制御を行っているが、
韻律パラメータ＝α×Ｓ＋β （α，βは定数）
のように自車速度Ｓを変数とする関数で算出してもよい。
【００１９】
次に、音素片接続部１０４は発声内容選択部１０２が選択したテキスト文に従って音素片辞書１０６から必要な音素データを獲得し、自車速度Ｓに基づいて更新された上述の韻律パラメータ上に獲得した音素データを貼り付け、合成音声を生成する（ステップＳ２１０）。そして生成された合成音声は音声出力装置１０８に出力され、音声出力装置１０８から合成音声を出力する（ステップＳ２１１）。
【００２０】
以上の説明により、本実施形態におけるカーナビゲーション装置は、自車速度に基づいて合成音声パラメータとしての韻律パラメータを制御することが可能である。その結果、自車速度の増加に起因する例えばエンジン音の音量の増加により、アナウンスが聞こえにくい場合にでも、例えば合成音声の音量を増加させることで、よりアナウンスの内容が聞き取りやすくなる。
【００２１】
［第２の実施形態］
第１の実施形態では自車速度を、合成音声パラメータを制御するパラメータとして用いたが、本実施形態では、自車の周囲の状況（道路の混雑状況）を韻律パラメータを制御するパラメータとして用いる場合を説明する。なお本実施形態で用いるカーナビゲーション装置の構成は第１の実施形態で用いたものと同じものとする。
【００２２】
図３は、本実施形態におけるカーナビゲーション装置が行う処理のフローチャートである。なお、同図に示したフローチャートに従ったプログラムコードは、本実施形態におけるカーナビゲーション装置内の不図示のＲＯＭやＲＡＭなどのメモリ内に格納され、ＣＰＵにより読み出され、実行される。その結果、本実施形態のカーナビゲーション装置は後述する各処理を行うことができる。
【００２３】
なお、ここでは例として、デフォルトで男性音声の合成音が、道路の混雑状況によって落ち着いた合成音（女性音声の合成音）になる例を示す。又、第１の実施形態と同様、本フローチャートに従った処理が実行される前に、カーナビゲーション装置が発声すべき内容は決定しているものとする。
【００２４】
まず発声内容格納メモリ１０１から、発声内容選択部１０２により発声すべき内容に該当するテキストデータを選択、取り込む（ステップＳ３０１）。また、デフォルトの音素片辞書を男性のものに設定し（ステップＳ３０２）、発声内容選択部１０２が取り込んだテキストを用いて、韻律パラメータ生成部１０３によりデフォルトの韻律パラメータを作成する（ステップＳ３０３）。
【００２５】
次に、走行状態情報取得部１０５は、混雑状況取得部１１１により測定された現在の自車がいる道路の混雑状況を取得する（ステップＳ３０４）。ここで道路が渋滞していると判断された場合（ステップＳ３０５）、運転手の気持ちを解きほぐすためにデフォルト話者が女性の音声になるように、韻律パラメータ生成部１０３は上述のデフォルトの韻律パラメータを更新する（ステップＳ３０７）。具体的には、ピッチ周波数をあげることで、より女性らしい音声にする。
【００２６】
一般に男性の音声のピッチ周波数の帯域は大まかには８０から１６０Ｈｚ程度で、女性の音声のピッチ周波数の帯域は大まかには１２０〜２５０Ｈｚ程度で、平均的に見て、女性の音声のピッチ周波数は男性のそれよりも高い。よってデフォルトでは男性の話者の音声データなので、デフォルトのピッチ周波数を例えば１２０Ｈｚと設定すると、ステップＳ３０７で、女性の音声にするためには、ピッチ周波数を２００Ｈｚにあげる処理を行う。
【００２７】
また、道路が渋滞とまではいかなくても混雑していると判断された場合（ステップＳ３０６）、デフォルトの韻律パラメータを操作し（ステップＳ３０８）、更新する。本フローチャートの例では、ピッチ周波数を下げることで低い音声にし、音韻時間長を長くすることでゆっくりとアナウンスを行うようにし、音パワーを下げることで音量を下げる。その結果、ゆったりと落ち着いた音声に更新している。
【００２８】
次に、音素片接続部１０４は発声内容選択部１０２が選択したテキスト文に従って音素片辞書１０６から必要な音素データを獲得し、上述の通り更新された韻律パラメータ上に獲得した音素データを貼り付け、合成音声を生成する（ステップＳ３０９）。そして生成された合成音声は音声出力装置１０８に出力され、音声出力装置１０８から合成音声を出力する（ステップＳ３１０）。
【００２９】
以上の説明により、本実施形態におけるカーナビゲーション装置は、自車の周囲の状況に応じて韻律パラメータを変更することができる。その結果、自車が混雑した道路を運転している場合には、落ち着いた音声を、渋滞した道路では女性の音声を聞かせることが可能となる。
【００３０】
［第３の実施形態］
本実施形態では、次の経路案内をしなければならない地点への到達時間を考慮し、その時間内にテキストの読み上げが終わるように、韻律パラメータを更新する場合について説明する。なお本実施形態で用いるカーナビゲーション装置は第１の実施形態で用いたものと同じものとする。
【００３１】
本実施形態のカーナビゲーション装置が動作する状況を図５に示す。５０１は自車で、同図の時点では速度ｖの速度で走っているものとする。５０２は上述の次の経路案内をしなければならない地点で、自車５０１は地点５０２に着くまでに、後述するテキスト文を読み終えなければいけない。なお同図で示した時点では自車５０１の位置から地点５０２まではＬの距離があるものとする。
【００３２】
図４は、本実施形態におけるカーナビゲーション装置が行う処理のフローチャートである。なお、同図に示したフローチャートに従ったプログラムコードは、本実施形態におけるカーナビゲーション装置内の不図示のＲＯＭやＲＡＭなどのメモリ内に格納され、ＣＰＵにより読み出され、実行される。その結果、本実施形態のカーナビゲーション装置は後述する各処理を行うことができる。
【００３３】
又本フローチャートに従った処理は、ある一つの経路案内文発声（例えば「その交差点を右に曲がってください」など）が終了した時点で呼び出される。まず、自車速度測定部１０９，自車位置測定部１１０により夫々自車速度、自車位置が測定され、走行状態情報取得部１０５によりこの自車速度、自車位置を取得する（ステップＳ４０１）。その結果、走行状態情報取得部１０５は次の経路案内を行う地点（例えば図５における地点５０２）に自車が到着するまでの時間Ｔを推定する（ステップＳ４０２）。時間Ｔは、次の式で推定可能である。
【００３４】
Ｔ＝次の経路案内をする地点までの距離／自車速度
さらに現在の自車速度ではなく、現在までの数分間の自車速度の平均値を用いることで、更に信頼度の高い時間Ｔの推定値となる。
【００３５】
ここで時間Ｔの値が十分に大きくなければ（例では時間Ｔが１０分以下であれば）、本処理を終了する（ステップＳ４０３）。一方、時間Ｔの値が１０分以上であれば、発声内容選択部１０２は時間Ｔ以内に読み上げ可能な未読の運転手へのメールやニュース（のテキスト文）を選択する（ステップＳ４０４）。つまり、カーナビゲーション装置（の発声内容格納メモリ１０１）と無線ネットワークでつながった情報発信サーバに格納した上述のメールやニュースのテキスト文を参照し、選択してもよいし、既にこの情報発信サーバにアクセスして、カーナビゲーション装置内の発声内容格納メモリ１０１にダウンロードした上述のメールやニュースのテキスト文を参照し、選択してもよい。また、時間Ｔ以内に読み上げ可能かどうかは、テキスト文を構成する音節数と各音節を発声する際に要する時間とを獲得し、そのテキスト文を読み上げる時間を（音節数×各音節を発声する際に要する時間）と演算し、その結果が時間Ｔ以内か否かを判定する必要がある。
【００３６】
テキスト全体の音節数については、上述の情報発信サーバ内で格納しているメールやニュースのテキスト文を構成している音節の数を予め測定しておき、測定した音節数をこのテキストに添付しておいて、カーナビゲーション装置でこのテキストを参照もしくはダウンロードする際に、この音節数を獲得する。その他にも、発声内容選択部１０２において、形態素解析などを用いてテキスト文の読み方を決定し、この読み方から音節数を数えてもよい。又、各音節を発声する際の要する時間は一定とする。
【００３７】
図６に、発声すべき内容を発声する際に要する時間の算出方法を示す。
【００３８】
６０１は発声すべき内容で、ここでは例として「あかさか」という言葉を用いる。６０１ａ〜６０１ｄは発声すべき内容７０１を構成する各音節である。各音節を発声する際に要する時間は夫々ｔ1，ｔ2，ｔ3，ｔ4であり、夫々の時間（ｔ1、ｔ2、ｔ3、ｔ4）は、予め測定された厳密な値であってもよいし、全て同じ平均的な値であってもよい。そしてその結果、発声すべき内容６０１を発声する際に要する（推定）時間は（ｔ1＋ｔ2＋ｔ3＋ｔ4）となる。
【００３９】
また、上述の参照したメールやニュースが未読か否かを判断するには、各メールやニュースに未読フラグを添付することで判断可能である。図７にメールを例として、各メールにこの未読フラグを添付したテーブルの構成を示す。７０１は各メールに添付された未読フラグの項目で、この未読フラグが０のメールはすでに読んだメールで、未読フラグが１のメールは未読であることを示す。７０２は各メールの内容が記載されている項目である。なおこのテーブルは、上述の情報発信サーバに格納しておき、カーナビゲーション装置が情報発信サーバ内のメールを参照する際にはこのテーブルを参照することになる。なお、この未読フラグはメールが情報発信サーバに到着したときに１に設定され、メールの内容と共にこのテーブルに付け加える。
【００４０】
以上のテーブルの構成はメールを例として説明したが、ニュースであってもそのテーブルの構成は同じで、項目７０２の部分をニュースの内容が記載された項目とすることでニュースを例とするテーブルとすることができる。なお、ステップＳ４０４において、選択した（メールの内容の、もしくはニュースの内容の）テキスト文に対応する未読フラグを０に設定する。
【００４１】
次に、読み上げようとするテキスト文（文章）がまだあるか否かを判断し（ステップＳ４０５）、読み上げるテキスト文がもう無ければ、本フローチャートに従った処理を終了する。一方、読み上げるテキスト文がまだある場合、韻律パラメータ生成部１０３は発声内容選択部１０１が上述の通り選択したテキスト文から１文を取り込み（ステップＳ４０６）、また再び自車速度測定部１０９，自車位置測定部１１０より夫々自車速度、自車位置を測定、走行状態情報取得部１０５で取得し（ステップＳ４０７）、走行状態情報取得部１０５において次の経路案内までの時間Ｔを再推定する（ステップＳ４０８）。ここでも自車速度のかわりに、現在まで数分間の自車速度の平均値を用いることで、Ｔの精度が向上する。また、１文を読み上げるごとにこの再推定を繰り返すことにより、文章を読み上げ始めてから車速が急に変化した場合にも対応が可能である。
【００４２】
韻律パラメータ生成部１０３は取り込んだ１文のテキストに、デフォルトの韻律パラメータを設定し（ステップＳ４０９）、設定した韻律パラメータを、Ｔの値に応じて更新する（ステップＳ４１０〜ステップＳ４１３）。同図の例では、音韻時間長を操作しているが、これは次の経路案内までの時間が少ないときに、なるべく早口になることに相当する。
【００４３】
音素片接続部１０４では韻律パラメータ生成部１０３において得られた韻律パラメータを用いて、第１，２の実施形態と同じようにして合成音声を生成し、この合成音声は音声出力装置１０８で出力される（ステップＳ４１４）。
【００４４】
以上の説明により、本実施形態におけるカーナビゲーション装置は、次の経路案内をしなければならない地点への到達時間を考慮して、その時間内にテキストを読み上げ終わるように合成音声パラメータとして韻律パラメータを更新する。その結果、テキストを読み上げる速度を上述の到着時間に応じて変更することができ、上述の到着時間内にこのテキストを読み上げることが可能である。
【００４５】
［他の実施形態］
なお、本発明は、複数の機器（例えばホストコンピュータ、インタフェイス機器、リーダ、プリンタなど）から構成されるシステムに適用しても、一つの機器からなる装置（例えば、複写機、ファクシミリ装置など）に適用してもよい。
【００４６】
また、本発明の目的は、前述した実施形態の機能を実現するソフトウェアのプログラムコードを記録した記憶媒体（または記録媒体）を、システムあるいは装置に供給し、そのシステムあるいは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記憶媒体に格納されたプログラムコードを読み出し実行することによっても、達成されることは言うまでもない。この場合、記憶媒体から読み出されたプログラムコード自体が前述した実施形態の機能を実現することになり、そのプログラムコードを記憶した記憶媒体は本発明を構成することになる。また、コンピュータが読み出したプログラムコードを実行することにより、前述した実施形態の機能が実現されるだけでなく、そのプログラムコードの指示に基づき、コンピュータ上で稼働しているオペレーティングシステム（ＯＳ）などが実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。
【００４７】
さらに、記憶媒体から読み出されたプログラムコードが、コンピュータに挿入された機能拡張カードやコンピュータに接続された機能拡張ユニットに備わるメモリに書込まれた後、そのプログラムコードの指示に基づき、その機能拡張カードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。
【００４８】
本発明を上記記憶媒体に適用する場合、その記憶媒体には、先に説明した（図２又は３又は４に示す）フローチャートに対応するプログラムコードが格納されることになる。
【００４９】
【発明の効果】
車両や周囲の状況に応じて合成音声に関するパラメータを制御することで、音声合成結果を車両に乗車している人に聞き取り可能にする効果がある。
【図面の簡単な説明】
【図１】本発明の第１の実施形態におけるカーナビゲーション装置の概略構成を示す図である。
【図２】本発明の第１の実施形態におけるカーナビゲーション装置が行う処理のフローチャートである。
【図３】本発明の第２の実施形態におけるカーナビゲーション装置が行う処理のフローチャートである。
【図４】本発明の第３の実施形態におけるカーナビゲーション装置が行う処理のフローチャートである。
【図５】本発明の第３の実施形態におけるカーナビゲーション装置が動作する状況を示す図である。
【図６】発声すべき内容を発声する際に要する時間の算出方法を示す図である。
【図７】メールを例として、各メールに未読フラグを添付したテーブルの構成を示す図である。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech synthesizer and a speech synthesis method that are mounted on a vehicle and perform speech synthesis based on text.
[0002]
[Prior art]
A conventional car navigation device displays the position of the vehicle on a map displayed on a display in the vehicle. Further, by setting the destination before or during traveling, the route to the destination can be searched, and the route to the destination can be guided to the driver by displaying the route on the display or by voice guidance.
[0003]
The voice guidance of the car navigation apparatus includes a method of reproducing a recording of actual human guidance voice and a method using a regular voice synthesis technique. In addition, as the vehicle approaches the intersection where the vehicle should bend, the route guidance timing is changed in consideration of the speed of the vehicle and the congestion of the road, thereby improving the reliability of route guidance to the driver.
[0004]
[Problems to be solved by the invention]
When regular voice synthesis is used for voice guidance, there is an advantage that it requires less storage space than recording voice, but currently it is inferior to human voices and the speed of the vehicle increases and the surroundings increase. When the noise increases, it becomes more difficult to hear. However, since the conventional synthesized speech generation unit does not create a synthesized waveform in consideration of the surrounding situation of the host vehicle, it cannot generate synthesized speech that matches the surrounding situation.
[0005]
The present invention has been made in view of the above-described problems, and by controlling parameters related to synthesized speech in accordance with the vehicle and surrounding conditions, the speech synthesis result can be heard by a person riding in the vehicle. For the purpose.
[0006]
[Means for Solving the Problems]
In order to achieve the object of the present invention, for example, a speech synthesizer of the present invention comprises the following arrangement. That is, speech synthesis means for generating synthesized speech corresponding to text,
An acquisition means for acquiring road congestion conditions;
When the road congestion situation is the first congestion situation, the speaker of the synthesized speech is changed, and when the road congestion situation is the second congestion situation, the prosody is a parameter related to the prosody of the synthesized speech. And a changing means for changing the parameter.
In order to achieve the object of the present invention, for example, the speech synthesis method of the present invention comprises the following arrangement. That is, a speech synthesis process for generating synthesized speech corresponding to text,
An acquisition process for acquiring road congestion;
A first changing step of changing a speaker of the synthesized speech when the congestion situation of the road is a first congestion situation;
And a second changing step of changing a prosodic parameter that is a parameter related to the prosody of the synthesized speech when the road congestion state is a second congestion state.
[0007]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, the present invention will be described in detail according to preferred embodiments with reference to the accompanying drawings. In the following embodiment, a case where a car navigation device is used as the information reproducing device will be described.
[0008]
[First Embodiment]
FIG. 1 shows a schematic configuration of a car navigation apparatus according to the present embodiment.
[0009]
An utterance content storage memory 101 stores an utterance content text which is a text sentence of the content uttered by the car navigation device. When the car navigation device makes an announcement, the utterance content selection unit 102 refers to the utterance content storage memory 101 and selects the utterance content text corresponding to the content to be uttered. Depending on circumstances, the content of the utterance may be an email to the driver downloaded from an information transmission server to which the car navigation device is connected through radio or a mobile phone, news, and the like.
[0010]
The utterance content text selected by the utterance content selection unit 102 is sent to the prosodic parameter generation unit 103 in the car navigation device.
[0011]
In the prosody parameter generation unit 103, prosody parameters (synthesized speech parameters) such as sound power (voice volume), pitch frequency (voice pitch), and phoneme duration (voice speed) are written from the text in which the utterance content is written. Generate. Originally, these parameters can be generated from the utterance content text to which the position and strength of the accent are added, but in this embodiment, the information described later via the running state information acquisition unit 105 is also included. reflect. The driving state information acquisition unit 105 is connected to the own vehicle speed measurement unit 109, the own vehicle position measurement unit 110, and the congestion state acquisition unit 111 outside the speech synthesizer 107.
[0012]
The prosody parameters created in the prosody parameter generation unit 103 are sent to the phoneme segment connection unit 104, and the phoneme segment connection unit 104 stores the phoneme data as the basis of the regular synthesized sound according to the prosodic parameters. The synthesized speech is generated by connecting the acquired phoneme data.
[0013]
The driving state information acquisition unit 105 also controls the phoneme piece connection unit 104, and may change phoneme data to be used. As the phoneme data changes, the speaker of the synthesized sound changes.
[0014]
When the synthesized speech is created by the speech synthesizer 107, the data is sent to the speech output device 108 such as a speaker and outputted as speech.
[0015]
FIG. 2 is a flowchart when the car navigation apparatus according to this embodiment performs the above-described processing. The program code according to the flowchart shown in the figure is stored in a memory such as a ROM or a RAM (not shown) in the car navigation apparatus according to this embodiment, and is read and executed by the CPU. As a result, the car navigation device of the present embodiment can perform each process described below.
[0016]
Here, as an example, a process for changing a prosodic parameter according to the own vehicle speed and creating a synthesized sound that is uttered louder, clearly and slowly as the own vehicle speed increases will be described. Further, it is assumed that the content to be uttered by the car navigation device is determined before the processing according to this flowchart is executed.
[0017]
First, text data corresponding to the content to be uttered is selected and fetched by the utterance content selection unit 102 from the utterance content storage memory 101 (step S201). A default prosodic parameter is created by the prosodic parameter generating unit 103 based on the captured text (step S202). The processing here is the same as the general rule speech synthesis processing.
[0018]
Next, the current vehicle speed (speed) S is acquired from the vehicle speed measuring unit 109 (step S203). Based on the value of S, the previously determined default prosodic parameter is updated (steps S204 to S209). In the example shown in the figure, as the host vehicle speed S increases, a large and slow synthesized sound is output. In this example, the vehicle speed S is 80 km, 50 km, 20 km, and stepwise control is performed.
Prosodic parameter = α x S + β (α and β are constants)
As described above, the vehicle speed S may be calculated as a variable.
[0019]
Next, the phoneme segment connection unit 104 acquires necessary phoneme data from the phoneme dictionary 106 according to the text sentence selected by the utterance content selection unit 102, and acquires it on the above-mentioned prosodic parameters updated based on the vehicle speed S. The phoneme data thus pasted is pasted to generate synthesized speech (step S210). The generated synthesized speech is output to the speech output device 108, and the synthesized speech is output from the speech output device 108 (step S211).
[0020]
As described above, the car navigation apparatus according to the present embodiment can control the prosodic parameters as the synthesized speech parameters based on the own vehicle speed. As a result, even if the announcement is difficult to hear due to, for example, an increase in the volume of the engine sound due to an increase in the vehicle speed, the content of the announcement can be heard more easily by increasing the volume of the synthesized speech, for example.
[0021]
[Second Embodiment]
In the first embodiment, the own vehicle speed is used as a parameter for controlling the synthesized voice parameter. In this embodiment, the situation around the own vehicle (road congestion situation) is used as a parameter for controlling the prosodic parameter. Will be explained. The configuration of the car navigation device used in the present embodiment is the same as that used in the first embodiment.
[0022]
FIG. 3 is a flowchart of processing performed by the car navigation device according to the present embodiment. The program code according to the flowchart shown in the figure is stored in a memory such as a ROM or a RAM (not shown) in the car navigation apparatus according to this embodiment, and is read and executed by the CPU. As a result, the car navigation device of the present embodiment can perform each process described below.
[0023]
Here, as an example, an example in which the synthesized sound of the male voice becomes a synthesized sound (synthetic sound of the female voice) calmed by the congestion situation of the road by default is shown. Similarly to the first embodiment, it is assumed that the content to be uttered by the car navigation device is determined before the processing according to the flowchart is executed.
[0024]
First, text data corresponding to the content to be uttered is selected and taken in by the utterance content selection unit 102 from the utterance content storage memory 101 (step S301). The default phoneme dictionary is set to male (step S302), and the prosody parameter generation unit 103 creates default prosody parameters using the text captured by the utterance content selection unit 102 (step S303).
[0025]
Next, the driving state information acquisition unit 105 acquires the congestion state of the road on which the current vehicle is measured, which is measured by the congestion state acquisition unit 111 (step S304). If it is determined that the road is congested (step S305), the prosodic parameter generation unit 103 uses the above-mentioned default prosodic parameters so that the default speaker becomes a female voice in order to unravel the driver's feelings. Is updated (step S307). Specifically, the voice is made more feminine by increasing the pitch frequency.
[0026]
In general, the pitch frequency band of male voice is roughly 80 to 160 Hz, and the pitch frequency band of female voice is roughly 120 to 250 Hz. On average, the pitch frequency of female voice is Higher than that of men. Therefore, since the voice data of a male speaker is the default, if the default pitch frequency is set to 120 Hz, for example, in step S307, processing for increasing the pitch frequency to 200 Hz is performed in order to obtain female voice.
[0027]
If it is determined that the road is congested even if it is not traffic jam (step S306), the default prosodic parameters are manipulated (step S308) and updated. In the example of this flowchart, the voice is lowered by lowering the pitch frequency, the announcement is made slowly by lengthening the phoneme length, and the volume is lowered by lowering the sound power. As a result, it has been updated to a relaxed and calm voice.
[0028]
Next, the phoneme segment connection unit 104 acquires the necessary phoneme data from the phoneme dictionary 106 according to the text sentence selected by the utterance content selection unit 102, and pastes the acquired phoneme data on the prosodic parameters updated as described above. Then, synthesized speech is generated (step S309). The generated synthesized speech is output to the speech output device 108, and the synthesized speech is output from the speech output device 108 (step S310).
[0029]
As described above, the car navigation device according to the present embodiment can change the prosodic parameters according to the situation around the host vehicle. As a result, it is possible to hear a calm voice when driving on a crowded road and a female voice on a congested road.
[0030]
[Third Embodiment]
In the present embodiment, a case will be described in which the prosodic parameters are updated so that the arrival time at the point where the next route guidance should be performed is taken into account and the text is read out within that time. The car navigation apparatus used in this embodiment is the same as that used in the first embodiment.
[0031]
FIG. 5 shows a situation in which the car navigation device of this embodiment operates. Reference numeral 501 denotes the own vehicle, which is assumed to be running at the speed v at the time shown in FIG. Reference numeral 502 denotes a point where the next route guidance must be performed. The host vehicle 501 must finish reading a text sentence to be described later before reaching the point 502. It is assumed that there is an L distance from the position of the own vehicle 501 to the point 502 at the time shown in FIG.
[0032]
FIG. 4 is a flowchart of processing performed by the car navigation device according to this embodiment. The program code according to the flowchart shown in the figure is stored in a memory such as a ROM or a RAM (not shown) in the car navigation apparatus according to this embodiment, and is read and executed by the CPU. As a result, the car navigation device of the present embodiment can perform each process described below.
[0033]
The process according to this flowchart is called when a certain route guidance sentence utterance (for example, “turn right at the intersection”) is completed. First, the own vehicle speed and the own vehicle position are measured by the own vehicle speed measuring unit 109 and the own vehicle position measuring unit 110, respectively, and the own vehicle speed and the own vehicle position are obtained by the traveling state information obtaining unit 105 (step S401). . As a result, the traveling state information acquisition unit 105 estimates a time T until the host vehicle arrives at a point where next route guidance is performed (for example, a point 502 in FIG. 5) (step S402). The time T can be estimated by the following equation.
[0034]
T = Distance to the point where the next route guidance is to be provided / Own vehicle speed Furthermore, by using the average value of the own vehicle speed for several minutes up to the present time instead of the current own vehicle speed, Estimated value.
[0035]
If the value of the time T is not sufficiently large (in the example, the time T is 10 minutes or less), the process is terminated (step S403). On the other hand, if the value of the time T is 10 minutes or more, the utterance content selection unit 102 selects an e-mail or news (text text) to an unread driver that can be read out within the time T (step S404). That is, the above-mentioned mail or news text stored in the information transmission server connected to the car navigation device (speech content storage memory 101) via the wireless network may be referred to and selected, or the information transmission server You may access and select the above-mentioned mail or news text sentence downloaded to the utterance content storage memory 101 in the car navigation device. Whether or not it can be read out within the time T is obtained by acquiring the number of syllables constituting the text sentence and the time required for uttering each syllable, and the time to read out the text sentence (the number of syllables × speaking each syllable). It is necessary to determine whether or not the result is within time T.
[0036]
As for the number of syllables in the entire text, measure the number of syllables that make up the text sentence of the mail or news stored in the above information transmission server in advance, and attach the measured number of syllables to this text. The syllable number is acquired when the text is referred to or downloaded by the car navigation device. In addition, the utterance content selection unit 102 may determine how to read a text sentence using morphological analysis and count the number of syllables based on this reading. The time required to utter each syllable is constant.
[0037]
FIG. 6 shows a method for calculating the time required to utter the content to be uttered.
[0038]
Reference numeral 601 denotes content to be uttered, and here, the word “akasaka” is used as an example. Reference numerals 601a to 601d denote syllables constituting the content 701 to be uttered. The time required to utter each syllable is t1, t2, t3, and t4, and each time (t1, t2, t3, and t4) may be a strict value measured in advance or all of them. The same average value may be used. As a result, the (estimated) time required for uttering the content 601 to be uttered is (t1 + t2 + t3 + t4).
[0039]
In addition, in order to determine whether or not the above-described mail or news is unread, it can be determined by attaching an unread flag to each mail or news. FIG. 7 shows the structure of a table in which this unread flag is attached to each mail, taking mail as an example. Reference numeral 701 denotes an unread flag item attached to each mail. The mail with the unread flag set to 0 indicates that the mail has been read, and the mail with the unread flag set to 1 indicates that the mail has not been read. Reference numeral 702 denotes an item in which the contents of each mail are described. This table is stored in the information transmission server described above, and this table is referred to when the car navigation apparatus refers to the mail in the information transmission server. This unread flag is set to 1 when mail arrives at the information transmission server, and is added to this table together with the contents of the mail.
[0040]
Although the structure of the above table has been described by taking e-mail as an example, the structure of the table is the same even for news, and the table exemplarily includes news by setting the item 702 as an item in which the content of the news is described. It can be. In step S404, the unread flag corresponding to the selected text sentence (the contents of mail or the contents of news) is set to zero.
[0041]
Next, it is determined whether or not there is still a text sentence (sentence) to be read out (step S405). If there is no more text sentence to be read out, the process according to this flowchart is terminated. On the other hand, if there is still a text sentence to be read, the prosodic parameter generation unit 103 takes in one sentence from the text sentence selected by the utterance content selection unit 101 as described above (step S406), and again the own vehicle speed measurement unit 109 and the own vehicle. The own vehicle speed and the own vehicle position are measured from the position measuring unit 110 and acquired by the traveling state information acquiring unit 105 (step S407), and the traveling state information acquiring unit 105 re-estimates the time T until the next route guidance ( Step S408). Here, instead of the vehicle speed, the accuracy of T is improved by using the average value of the vehicle speed for several minutes until now. In addition, by repeating this re-estimation every time one sentence is read out, it is possible to cope with a case where the vehicle speed suddenly changes after starting to read the sentence.
[0042]
The prosodic parameter generation unit 103 sets a default prosodic parameter in the captured sentence text (step S409), and updates the set prosodic parameter according to the value of T (step S410 to step S413). In the example shown in the figure, the phoneme duration is manipulated, which corresponds to the quickest possible opening when the time until the next route guidance is short.
[0043]
The phoneme segment connection unit 104 uses the prosodic parameters obtained by the prosody parameter generation unit 103 to generate synthesized speech in the same manner as in the first and second embodiments, and the synthesized speech is output by the speech output device 108. (Step S414).
[0044]
As described above, the car navigation apparatus according to the present embodiment considers the arrival time to the point where the next route guidance should be taken into consideration, and uses the prosodic parameters as synthesized speech parameters so that the text is read out within that time. Update. As a result, the speed at which the text is read out can be changed according to the above arrival time, and the text can be read out within the above arrival time.
[0045]
[Other Embodiments]
Note that the present invention can be applied to a system including a plurality of devices (for example, a host computer, an interface device, a reader, and a printer), and a device (for example, a copying machine and a facsimile device) including a single device. You may apply to.
[0046]
Another object of the present invention is to supply a storage medium (or recording medium) in which a program code of software that realizes the functions of the above-described embodiments is recorded to a system or apparatus, and the computer (or CPU or Needless to say, this can also be achieved by the MPU) reading and executing the program code stored in the storage medium. In this case, the program code itself read from the storage medium realizes the functions of the above-described embodiments, and the storage medium storing the program code constitutes the present invention. In addition, by executing the program code read by the computer, not only the functions of the above-described embodiments are realized, but also an operating system (OS) running on the computer based on the instruction of the program code. It goes without saying that a case where the function of the above-described embodiment is realized by performing part or all of the actual processing and the processing is included.
[0047]
Furthermore, after the program code read from the storage medium is written into a memory provided in a function expansion card inserted into the computer or a function expansion unit connected to the computer, the function is determined based on the instruction of the program code. It goes without saying that the CPU or the like provided in the expansion card or the function expansion unit performs part or all of the actual processing and the functions of the above-described embodiments are realized by the processing.
[0048]
When the present invention is applied to the above-described storage medium, the storage medium stores program codes corresponding to the above-described flowchart (shown in FIG. 2, 3 or 4).
[0049]
【The invention's effect】
By controlling the parameters related to the synthesized speech in accordance with the vehicle and surrounding conditions, there is an effect that the speech synthesis result can be heard by a person riding in the vehicle.
[Brief description of the drawings]
FIG. 1 is a diagram showing a schematic configuration of a car navigation device according to a first embodiment of the present invention.
FIG. 2 is a flowchart of processing performed by the car navigation device according to the first embodiment of the present invention.
FIG. 3 is a flowchart of processing performed by a car navigation device according to a second embodiment of the present invention.
FIG. 4 is a flowchart of processing performed by a car navigation device according to a third embodiment of the present invention.
FIG. 5 is a diagram illustrating a situation in which a car navigation device according to a third embodiment of the present invention operates.
FIG. 6 is a diagram illustrating a method of calculating a time required for uttering content to be uttered.
FIG. 7 is a diagram illustrating a configuration of a table in which an unread flag is attached to each mail, taking mail as an example.

Claims

Speech synthesis means for generating synthesized speech corresponding to text;
An acquisition means for acquiring road congestion conditions;
When the road congestion situation is the first congestion situation, the speaker of the synthesized speech is changed, and when the road congestion situation is the second congestion situation, the prosody is a parameter related to the prosody of the synthesized speech. A speech synthesizer comprising: a changing unit that changes a parameter.

The speech synthesis apparatus according to claim 1, wherein the prosodic parameter includes a parameter related to voice pitch.

The speech synthesizer according to claim 1, wherein the prosody parameter includes a parameter relating to a voice speed.

The speech synthesizer according to any one of claims 1 to 3, wherein the prosodic parameter includes a parameter related to a voice volume.

The speech synthesis apparatus according to claim 1, wherein the text is downloaded from a predetermined server.

The said change means changes the speaker of the said synthetic voice from a man to a woman, when the congestion condition of the said road is the said 1st congestion condition, The any one of Claim 1 thru | or 5 characterized by the above-mentioned. The speech synthesizer described.

7. The change unit according to claim 1, wherein when the congestion situation of the road is the second congestion situation, parameters relating to a volume, a height, and a speed of the synthesized speech are changed. The speech synthesizer of any one of Claims.

A speech synthesis step for generating synthesized speech corresponding to the text;
An acquisition process for acquiring road congestion;
A first changing step of changing a speaker of the synthesized speech when the congestion situation of the road is a first congestion situation;
And a second changing step of changing a prosodic parameter that is a parameter relating to the prosody of the synthesized speech when the road congestion state is a second congestion state.

The speech synthesis method according to claim 8, wherein the prosodic parameter includes a parameter related to voice pitch.

The speech synthesis method according to claim 8 or 9, wherein the prosodic parameter includes a parameter relating to a voice speed.

The speech synthesis method according to claim 8, wherein the prosodic parameter includes a parameter related to a voice volume.

The speech synthesis method according to any one of claims 8 to 11, wherein the text is downloaded from a predetermined server.

The said change process WHEREIN: When the congestion condition of the said road is a 1st congestion condition, the speaker of the said synthetic | combination voice is changed from a man to a woman. Voice synthesis method.

Wherein in the changing step, when the congestion state of the road is the second congestion, the synthesized speech loudness of claims 8 to 13, characterized in that to change the parameters related to height and speed The speech synthesis method according to any one of the above.

A storage medium storing a program code for realizing the speech synthesis method according to claim 8.