JP3595041B2

JP3595041B2 - Speech synthesis system and speech synthesis method

Info

Publication number: JP3595041B2
Application number: JP23583595A
Authority: JP
Inventors: 重宣瀬戸; 孝章新居
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1995-09-13
Filing date: 1995-09-13
Publication date: 2004-12-02
Anticipated expiration: 2015-09-13
Also published as: JPH0981174A

Abstract

PROBLEM TO BE SOLVED: To generate such output of a system that makes a user easily grasp the user's situation and the condition of the system and to improve the operation convenience by changing an applicable phoneme/rhythm control regulation in accordance with the user's situation, user's environment and the condition of the system. SOLUTION: The communication state in a computer in which a voice synthesizing system is operated and a communication state between the computer and the outside are monitored by a communication state monitoring part 12 and corresponding information of the communication state is outputted. In a language analysis part 111 in a voice synthesizing part 11, by performing language analysis such as morpheme analysis and syntactic structural analysis of an input text and applying regulations of various levels to the contents of the result of language analysis in a phoneme control part 112 and a rhythm control part 113, phoneme/rhythm control influencing the quality of the synthesized voice accompanied by conversion from language media to voice media is performed. At this time, the regulation applied to the phoneme/rhythm control is changed according to the information of the communication state.

Description

【０００１】
【発明の属する技術分野】
本発明は、音声合成技術を利用するシステム一般に用いて好適な音声合成システムおよび音声合成方法に関する。
【０００２】
【従来の技術】
近年、音声合成技術の応用が拡大され、さらに計算機の処理能力が向上するにつれ、我々の身近における音声合成の利用がますます増え、テキスト音声変換は１つの応用アプリケーションとして気軽に利用可能になってきた。岩田他：“パソコン向けソフトウェア日本語テキスト音声合成，”日本音響学会講演論文集，２−８−１３，ｐｐ．２４５−２４６（１９９３年１０月）がその例である。
【０００３】
これらの音声合成システムは、入力されるテキストの言語解析、音韻制御、韻律制御、波形生成のいずれの処理においても原則的に一意の処理結果を得るように作られており、入力テキストが決まれば結果として得られる合成音声は常に同じものになっていた。
【０００４】
システムによっては、男声・女声、高い声・低い声など、合成音声の生成に先立ちユーザに選択させ、その選択に応じた合成音声を生成するものもあるが、選択項目が決まれば、入力テキストに対して生成される合成音声は一意に決まるという意味で、同じ枠組みであると言える。
【０００５】
【発明が解決しようとする課題】
しかしながら、入力テキストに対して同じ合成音声が生成されることは、単調で飽きがくるというだけでなく、必ずしも音声メディアの特徴を有効に利用しているとは言えない。
【０００６】
音声メディアは、テキストメディアと異なり韻律や声色の変化が加わることにより、言語表現による直接的なメッセージだけでなく、感情や意図、話者の状況やノリといった雰囲気など、付加的な情報を副次的に伝える特徴があることはよく知られている。藤崎他：“音声の韻律的特徴による発話意図の表現，”日本音響学会講演論文集，２−８−１６，ｐｐ．２２５−２２６（１９９３年３月）や、上床他：“音声の感情表現の分析とモデル化，”電子情報通信学会技術研究報告，ＳＰ９２−１３１，ｐｐ．６５−７２（１９９３年１月）などがその例である。
【０００７】
上述した現状の音声合成システムやその応用アプリケーションはいずれも、文字言語メディアの形態に表現される言語情報を単に音声メディアの形態の表現に変換するメディア変換（テキスト音声変換）としての機能は持っていても、副次的な情報をも伝えるという音声の特徴をを積極的に利用しているとは言い難い。
【０００８】
さらに、音声合成が、単独の装置としてではなく、他のシステムとの連携を行い動作する場合や、１つの応用アプリケーションとしてパーソナルコンピュータやワークステーションなどの汎用的な計算機の上で他の応用アプリケーションとともに利用される場合においても、ともに動作しているシステムやアプリケーションなどの状況や、音声合成アプリケーションが動作しているシステムがどのような状態にあるかなどの状況によらず、与えられたテキストを単に忠実に音声へ変換しているに過ぎない。
【０００９】
本発明は上記事情を考慮してなされたものでその目的は、システムの動作状況（システム状況）、あるいはユーザ自身の状況（ユーザ状況）やユーザのいる場所の環境（ユーザ環境）に応じて音韻・韻律制御を動的に変えたり、さらには入力テキストに対応した本来の合成音に併せて別の音や合成音を出力することで、ユーザにとってシステムの動作状況が把握しやすいようなシステムの出力や、ユーザの置かれている状況に適したシステムの出力が生成でき、使い勝手を向上させることができる音声合成システムおよび音声合成方法を提供することにある。
【００１０】
【課題を解決するための手段】
上述した課題を解決するため、本発明の第１の観点に係る構成は、入力テキストの言語解析を行いその解析結果に対して規則を適用して音韻・韻律的な制御を行い合成音声を生成・出力する音声合成手段に加えて、計算機内の通信状態、および計算機と外部の間の通信状態の少なくとも一方を監視し通信状態情報を出力する通信状態監視手段を備え、上記音声合成手段にあっては、上記音韻・韻律的な制御において適用する規則を通信状態監視手段から出力される通信状態情報に応じて変更するようにしたことを特徴とする。
【００１１】
本発明の第２の観点に係る構成は、上記第１の観点に係る構成における音声合成手段に相当する音声合成手段に加えて、計算機ハードウエアの動作状態、および計算機ソフトウェアの動作状態の少なくとも一方を監視し動作状態情報を出力する動作状態監視手段を備え、上記音声合成手段にあっては、上記音韻・韻律的な制御において適用する規則を動作状態監視手段から出力される動作状態情報に応じて変更するようにしたことを特徴とする。ここで、上記音声合成手段における言語解析、音韻的な制御、韻律的な制御、および音声波形生成の少なくとも１つの処理を、上記動作状態情報の示す動作状態に応じて、通信可能な有線ネットワークもしくは無線ネットワークで結ばれる複数の計算機ハードウエアに分担させるようにするとよい。
【００１２】
本発明の第３の観点に係る構成は、上記第１の観点に係る構成における音声合成手段に相当する音声合成手段に加えて、ユーザのシステム利用状況、およびユーザのシステム利用環境の少なくとも一方を監視しユーザ状況情報を出力するユーザ状況監視手段を備え、上記音声合成手段にあっては、上記音韻・韻律的な制御において適用する規則をユーザ状況監視手段から出力されるユーザ状況情報に応じて変更するようにしたことを特徴とする。ここで、ユーザ状況情報をもとに人間の発声でないことを明示すべきか否かの判断結果を出力する非自然音声明示判断手段と、この非自然音声明示判断手段の判断結果に応じ、上記入力テキストの表現の一部の変更により人間の音声でないことを明示するテキスト変更手段、および当該判断結果に応じ、合成音声の出力に併せて人間の発声でないことを明示する音を出力する非自然音声明示音出力手段の少なくとも一方とをさらに備えることも可能である。
【００１３】
上記第１の観点に係る構成においては、音声合成手段内で、まず入力テキストに対して形態素解析や統語構造解析などの周知の言語解析が行われて、形態素の系列に分解されるとともに「読み」を表す記号列と形態素の品詞、活用、アクセント型、形態素間の係り受け関係の強さなどの情報が併せて出力される。
【００１４】
また、音声合成手段内では、上記の言語解析の結果の内容に対して、それぞれ様々なレベルの規則（音韻・韻律的規則）を適用することによって、言語メディアから音声メディアへの変換に伴う合成音声の品質を左右する制御、すなわち音韻・韻律的な制御が行われ、入力テキストに対応する音声波形が生成される。
【００１５】
一方、通信監視手段は、音声合成システムの稼働する計算機内の通信状態、および計算機と外部の間の通信状態の少なくとも一方を監視する。それぞれの通信状態監視結果は、音声合成手段に伝えられる。
【００１６】
音声合成手段内では、この通信状態監視結果に応じて、上記音韻・韻律的規則の適用内容が変更される。
次に、上記第２の観点に係る構成においても、音声合成手段内では、入力テキストに対する言語解析の結果の内容に対して、それぞれ様々なレベルの音韻・韻律的規則を適用することによって、言語メディアから音声メディアへの変換に伴う合成音声の品質を左右する音韻・韻律的な制御が行われ、入力テキストに対応する音声波形が生成される。
【００１７】
一方、動作状態監視手段は、音声合成システムの稼働する計算機ハードウエアの動作状態、および計算機ソフトウェアの動作状態の少なくとも一方を監視する。それぞれの動作状態監視結果は、音声合成手段に伝えられる。
【００１８】
音声合成手段内では、この動作状態監視結果に応じて、上記音韻・韻律的規則の適用内容が変更される。
また、音声合成手段における言語解析、音韻的な制御、韻律的な制御、および音声波形生成の少なくとも１つの処理が、動作状態監視結果に応じて、通信可能な有線ネットワークもしくは無線ネットワークで結ばれる複数の計算機ハードウエアに分担させられる。
【００１９】
次に、上記第３の観点に係る構成においても、音声合成手段内では、入力テキストに対する言語解析の結果の内容に対して、それぞれ様々なレベルの音韻・韻律的規則を適用することによって、言語メディアから音声メディアへの変換に伴う合成音声の品質を左右する音韻・韻律的な制御が行われ、入力テキストに対応する音声波形が生成される。
【００２０】
一方、ユーザ状況監視手段は、ユーザのシステム利用状況、およびユーザのシステム利用環境の少なくとも一方を監視する。それぞれのユーザ状況監視結果は、音声合成手段に伝えられる。
【００２１】
音声合成手段内では、このユーザ状況監視結果に応じて、上記音韻・韻律的規則の適用内容が変更される。
また、非自然音声明示判断手段と、テキスト変更手段および非自然音声明示音出力手段の少なくとも一方とをさらに備えた構成では、入力テキストの表現の一部の変更（例えば、入力テキストに対する定型表現の追加）によって人間の音声でないことを明示するテキスト変更、あるいは合成音声の出力に併せて人間の発声でないことを明示する音の出力がなされる。すなわち、本来の合成音に併せて別の音や合成音が出力される。
【００２２】
【発明の実施の形態】
以下、本発明の実施の形態につき図面を参照して説明する。
［第１の実施形態］
図１は本発明の音声合成システムの第１の実施形態を示すブロック構成図である。
【００２３】
図１のシステムの中心をなす音声合成部１１は、入力テキストの言語解析を行う言語解析部１１１と、その解析結果に対して規則を適用して、音韻的な制御を行う音韻制御部１１２と韻律的な制御を行う韻律制御部１１３と、音韻制御部１１２および韻律制御部１１３の制御に従い音声波形を生成する波形生成部１１４と、生成された波形を出力する波形出力部１１５とから構成されている。この音声合成部１１の構成の枠組みについては、既存のテキスト音声変換可能な音声合成システムの一般的な構成法がそのまま利用できる。テキスト音声合成システムの一般的な構成法としては、例えば佐藤他：“日本語テキストからの音声合成，”電気通信研究所研究実用化報告，Ｖｏｌ．３２，Ｎｏ．１１，ｐｐ．２２４３−２２５２（１９８３年１１月）などが利用できる。
【００２４】
音声合成部１１内の言語解析部１１１は、入力テキストに対して形態素解析や統語構造解析などの言語解析を行い、形態素の系列に分解するとともに「読み」を表す記号列と形態素の品詞、活用、アクセント型、形態素間の係り受け関係の強さなどの情報を併せて出力する。
【００２５】
音声合成部１１内の音韻制御部１１２および韻律制御部１１３は、これら言語解析部１１１での言語解析結果の内容に対して、それぞれ様々なレベルの規則を適用することによって、言語メディアから音声メディアへの変換に伴う合成音声の品質を左右する制御を行う。
【００２６】
具体的には、アクセント単位の認定（すなわち、未知語へのアクセント付与、アクセント結合、複合語のアクセント分割、付属語連鎖に対する副次アクセントの付与などが含まれる）、読みの認定（すなわち、未知語への読み付与、連濁処理、表記から読みへの変換などが含まれる）、１つの韻律的なまとまりとするためのいわゆる韻律語（アクセント句）連鎖のまとまりの認定（すなわち、韻律句境界の付与がこれに相当する）、韻律制御パラメータ値決定（すなわち、ピッチの時間変化パターンを生成するモデルのパラメータ値の決定、音韻・ポーズのタイミングの決定、パワーの決定）、読みに対応する蓄積パターンの検索および蓄積素片の選択（すなわち、蓄積単位への変換、検索条件への変換、複数の検索結果を得たときの選択などが含まれる）、蓄積素片の編集（すなわち、蓄積素片間の接続、補間加工などが含まれる）といった各段階の処理をそれぞれの規則によって行う。
【００２７】
このうち、音韻制御部１１２においては、読みの認定、読みに対応する蓄積パターンの検索および蓄積素片の選択、蓄積素片の編集を、韻律制御部１１３においては、アクセント単位の認定、韻律語（アクセント句）連鎖のまとまりの認定、韻律制御パラメータ値決定を担当する。これらの各段階の処理や規則は任意の分類が可能であり、システムの実装形態によって上記の分類とは異なる場合や省略される場合もあるが、基本的にはこれらの内容に沿った処理が行われる。また、システムの実装形態によっては、前段の言語解析部１１１や後段の波形生成部１１４との境界も様々であるが、ここでは、上記の音韻的な制御を行うものとして音韻制御部１１２を、同じく上記の韻律的な制御を行うものとして韻律制御部１１３を、それぞれ定義している。
【００２８】
韻律制御部１１３は、言語解析部１１１での形態素解析結果にこれらの規則を適用して、形態素系列の読みに対応する個々の音韻やポーズなどのタイミングを決め、形態素系列あるいは読みに対応する音韻の系列を韻律語（アクセント句）というアクセント付与のための韻律制御上の単位に分割するとともに、意味上の文構造上や生理的な制約による呼気段落上のまとまりを形成し、いわゆる韻律句と呼ばれる話調成分付与のための韻律制御上の単位へ韻律語系列を分割し、各韻律制御上の単位に対して、タイミングを考慮して、アクセントや話調の成分の大きさを与えるパラメータ値を決めピッチを決定する。韻律制御部１１３はさらに、形態素系列あるいは読みに対応する音韻の系列あるいはピッチなどをもとにパワー包絡を決定する。
【００２９】
一方、音韻制御部１１２は、読みに対応する音韻の部分系列に対して、音声波形、あるいは音声波形の分析パラメータ、あるいはその両方を対応させた蓄積素片を格納しておく蓄積データ格納部１１２１を有しており、この蓄積データ格納部１１２１に格納されている蓄積素片のバリエーションを考慮して、形態素系列の読みに当たる音韻の部分系列に対応する蓄積素片の系列を決定する。
【００３０】
本実施形態において、上記した音韻制御部１１２および韻律制御部１１３で適用される規則は、計算機内の通信状態や計算機外との間の通信状態に応じて切り替えられるようになっているが、これについては後述する。
【００３１】
音声合成部１１内の波形生成部１１４は、音韻制御部１１２の出力する蓄積素片系列を接続し、韻律制御部１１３の出力する制御情報、すなわち、タイミング、ピッチ、パワー包絡に従い、信号処理レベルでの韻律制御を行って、音声波形を生成する。
【００３２】
音声合成部１１内の波形出力部１１５は、音声合成部１１により生成された音声波形を例えばスピーカーやイヤホーン等から出力する。
さて、本実施形態において、音韻制御部１１２が持つ蓄積データ格納部１１２１に格納される蓄積素片、音韻制御部１１２で利用される規則、および韻律制御部１１３で利用される規則は、生成したい合成音声の調子に合う自然音声データを収集しておき、そのデータから予め作成しておいたものである。例えば、対話調の音声を合成したい場合は模擬対話音声を収集したり、ささやき声、早口の声、疲れた様子の音、元気の良い声、雑踏の中で（あるいは雑踏環境を模擬したところで）発声した声、落ち着いた声、様々な人の声をできるだけ大量に収集し、それぞれのピッチやパワー、時間長の分析結果から、それぞれの声に対応した規則や蓄積データを導出する。
【００３３】
様々な状況における音声が、それぞれ異なる傾向の音韻的・韻律的な特徴を有することは従来からの研究で指摘されており、様々な音声データから導かれた韻律の制御規則が異なる傾向を示すことは、平井他：“種々の音声コーパスから自動生成されたＦ_０制御規則の違いについて，”日本音響学会講演論文集，２−５−３，ｐｐ．２７１−２７２（１９９４年１０月〜１１月）においても実際のデータとともに示されている。
【００３４】
音声データからの規則の導出に関しては従来から研究例が多数ある。例えば、広瀬他：“音声合成とアクセント・イントネーション，”電子情報通信学会誌，Ｖｏｌ．７０，Ｎｏ．４，ｐｐ．３７８−３８５（１９８７年４月）、三村他：“統計的手法を用いた音声パワーの分析と制御，”日本音響学会誌，Ｖｏｌ．４９，Ｎｏ．２，ｐｐ．２５３−２５９（１９９３年１２月）、海木他：“発話速度による文音声のポーズ長変化の分析，”日本音響学会講演論文集，１−５−１６，ｐｐ．２４７−２４８（１９９２年１２月）などがあり、規則の抽出に利用できる。
【００３５】
それぞれの環境について抽出された制御規則および蓄積素片には、音声合成時に利用するための抽出環境に関する情報、即ち、対話調であるとか、ささやき声、早口の声、疲れた様子の声、元気の良い声、雑踏の中での声、落ち着いた声、などの音声データの収集状況の情報が付加される。
【００３６】
周知のように、既存の音声合成システムの音韻的・韻律的な制御規則や蓄積データは、本質的には、言語的な環境（例えば、形態素、品詞、活用など）および音韻的・韻律的な環境（例えば、音韻の並び、アクセント型とアクセント核、ピッチ、パワー包絡、タイミングなど）と制御内容（例えば、読み記号列、アクセント結合情報、韻律パラメータ値、蓄積素片の選択優先度など）や音声波形・分析パラメータとの対応として捉えることができる。
【００３７】
そこで本実施形態では、この対応関係に規則の抽出環境を加え、抽出した制御規則や蓄積データを、言語的な環境、音韻的・韻律的な環境および規則の抽出環境と、制御内容や音声波形・分析パラメータとの対応として記述している。
【００３８】
このように、複数の規則や蓄積データを備え、さらに、それらを音声合成部１１（内の言語解析部１１１および音韻制御部１１２）が適宜選択して使用することにより、合成音声の声の調子にバリエーションを与えることができる。
【００３９】
そこで本実施形態では、上述した音声合成部１１に加えて、当該音声合成部１１での規則選択の条件を決定するための情報を与える手段として、通信状態監視部１２が設けられている。この通信状態監視部１２は、音声合成システムが稼働する計算機内の通信状態を監視する計算機内通信状態監視部１２１と、当該計算機と外部の間の通信状態を監視する計算機外通信状態監視部１２２とを有している。
【００４０】
通信状態監視部１２内の計算機内通信状態監視部１２１は、同一計算機内で動作するソフトウェア間、ハードウェア間、あるいはソフトウェアとハードウェアの間の通信状況ないしは通信路の品質からなる通信状態を監視する。説明を簡単にするために、ここでは互いに通信を行うハードウェアあるいはソフトウェアをそれぞれ通信者Ａおよび通信者Ｂと簡略化して表現する。すなわち通信者Ａと通信者Ｂとの間で通信が行われているものとする。
【００４１】
計算機内通信状態監視部１２１は、これらの間で交わされる通信状態を知るために、この通信を媒介するソフトウェアないしはハードウェア（便宜的にここでは、通信媒体と呼ぶことにする）に問い合わせ、通信状況（例えば、情報の送り手、通信量や通信量の時間的な変化、通信の頻度、送る予定のデータ総量、既に送ったデータ量など）や通信路の品質（例えば、データ転送速度やエラー発生頻度など）を通知してもらう。これらの通知は、必ずしも問い合わせが必要なわけではなく、問い合わせがなくても通信媒体側から計算機内通信状態監視部１２１に適当なタイミングで通知するようにしても構わない。
【００４２】
このような通信媒体として、オペレーティングシステムやオペレーションシステム（以下、ＯＳと称する）の提供する既存の機能（例えば、メッセージング機能を実現できるＷｉｎｄｏｗｓのＤＤＥ＝ＤｉｎａｍｉｃＤａｔａＥｘｃｈａｎｇｅや、クリップボードを使ったデータの受け渡し）や、ウィンドウシステムの提供する既存の機能（例えば、ＸＷｉｎｄｏｗＳｙｓｔｅｍにおけるイベントやセレクションバッファ、Ｗｉｎｄｏｗｓのｍｅｓｓａｇｅなどが一例である）、あるいは、サーバ・クライアントモデルで実装された各種サービスが利用できる。もちろん、既存システムを利用するだけでなく、同様のメカニズムを持つように新たなシステムを組むことも可能である。
【００４３】
また、通信者Ａと通信者Ｂで交わされる通信状態を知るために、通信媒体を介さずに直接、通信者Ａと通信者Ｂに問い合わせる仕組みにしてもよい。この場合、通信者Ａおよび通信者Ｂがそれぞれ持っている、通信を行う機能を持つ部分（便宜的に、通信部と呼ぶことにする）に対して計算機内通信状態監視部１２１が問い合わせ、上記と同様に通信状態を通知してもらう。もちろん、上記と同様に、問い合わせがなくても適宜、通信者Ａおよび通信者Ｂがそれぞれ持っている通信部が計算機内通信状態監視部１２１に適当なタイミングで通知するようにしても構わない。
【００４４】
計算機内通信状態監視部１２１は、このようにして取得した通信状態に関する情報をもとに、例えば、通信量が大きい／小さい、送るべきデータ総量が多い／少ない、既に通信が済んだデータの割合が大きい／小さい、データ転送速度が速い／遅いといった情報を通信状態情報として音声合成部１１に送る。これらの情報は、取得した数値のまま通信状態情報としてもよいし、計算機内通信状態監視部１２１内で閾値と比較して離散的なレベルにまるめて通信状態情報としてもよい。
【００４５】
一方、通信状態監視部１２内の計算機外通信状態監視部１２２は、計算機外との通信状態を監視する。この計算機外通信状態監視部１２２においても、上記した計算機内通信状態監視部１２１と同様に、通信媒体を介して通信状態を取得する構成とすることができる。通信媒体としては、同じように、ＯＳやＯＳの提供する既存の機能（メッセージング機能）や、ウィンドウシステムの提供する既存の機能（例えば、イベント）、あるいは、サーバ・クライアントモデルで実装された各種サービス（例えば、ＮｅｔｗｏｒｋＦｉｌｅＳｙｓｔｅｍやプリンタのデーモン等）の他、モデムのように計算機外とのデータ通信が可能なデバイスやドライバが利用できる。もちろん、既存システムを利用するだけでなく、同様のメカニズムを持つように新たなシステムを組むことも、上記と同様に可能である。
【００４６】
音声合成部１１は（通信状態監視部１２内の）計算機内通信状態監視部１２１および計算機外通信状態監視部１２２からそれぞれ通信状態情報を受け取り、当該通信状態情報に応じて音韻制御部１１２および韻律制御部１１３においてそれぞれ適用する制御規則や蓄積データを選択する。
【００４７】
ここで、通信状態情報と選択する制御規則および蓄積データとの対応関係は、音韻制御部１１２および韻律制御部１１３に定めておく。例えば、通信量大あるいは通信の頻度が大きい場合は早口にしたり、非常に大きい場合には緊迫した声にしたり、逆に、通信量小あるいは通信の頻度が小さい場合は、ピッチのダイナミックレンジを大きく、落ち着いた声にしたり、ポーズを多めに挿入したり、ゆったりした声にする。通信残量が多い場合はのんびりした声に、残り少なくなってくるにつれ、ピッチを高めにしたり早口にしたりする。通信路の品質が悪い場合には、元気のない声やピッチに不規則な揺らぎを重畳させ声質を変える。転送速度が速ければ軽快な声を、遅ければ重苦しい声にするなどの対応関係が例として挙げられる。
【００４８】
このように、言語解析部１１１での解析結果に対して音声合成部１１（内の言語解析部１１１および音韻制御部１１２）において適用する制御規則や蓄積データを、通信状態監視部１２（内の計算機内通信状態監視部１２１または計算機外通信状態監視部１２２）から出力される通信状態情報（の示す通信状態）に応じて切り替えて（変更して）合成音声を出力することにより、ユーザは、合成音声の声の調子から、その時点における計算機内の通信状態、あるいは計算機外との間の通信状態を知ることができる。
【００４９】
なお、上述の対応関係はあくまで例であって、音声合成システムのユーザの好みに応じて変更可能にしても構わない。また、計算機内通信状態監視部１２１で監視される計算機内の通信状態と、計算機外通信状態監視部１２２で監視される計算機外の通信状態のそれぞれについて、独立に対応関係を設定しても構わない。
【００５０】
また、以上の実施形態では、通信状態監視部１２には、計算機内通信状態監視部１２１および計算機外通信状態監視部１２２の両方が設けられているものとしたが、いずれか一方だけが設けられているものであっても構わない。
［第２の実施形態］
図２は本発明の音声合成システムの第２の実施形態を示すブロック構成図である。なお、図１と同一部分には同一符号を付してある。
【００５１】
まず、図２の構成の特徴は、音声合成部１１に加えて、計算機ハードウェアの動作状態を監視するハードウェア状態監視部２２１と計算機ソフトウエアの動作状態を監視するソフ卜ウェア状態監視部２２２とを有する動作状態監視部２２が設けられている点である。これに伴い、図２における音声合成部１１内（の音韻制御部１１２および韻律制御部１１３）の機能も、以下に述べるように図１中の音声合成部１１（内の音韻制御部１１２および韻律制御部１１３）とは異なるが、便宜上同一符号を付してある。
【００５２】
動作状態監視部２２内のハードウェア状態監視部２２１は、音声合成システムの稼働する計算機ハードウェアの動作状態を示すパラメータを直接測定したり、あるいは、計算機ハードウェアもしくはそのソフトウェアドライバに動作状態を問い合わせたり、あるいは、計算機ハードウェアもしくはそのソフトウェアドライバ自体から適当なタイミングで動作状態を通知されることによって、計算機ハードウェアの動作状態を監視する。
【００５３】
例えば、システムを構成するハードウェアに供給される電源電圧の高さや安定性、カード、プリンタ、キーボード、マウス等のデバイス（周辺機器）やネットワークケーブル等、システムに接続されているハードウェアの接続状況（接続されているか否か、さらには利用可能な状態か否か）を監視する。
【００５４】
ハードウェア状態監視部２２１は、このようにして取得したハードウェア状態に関する監視結果をもとに、例えば、電源電圧が十分高い／高い／やや低い／低い／かなり低い、十分安定している／安定している／やや不安定／非常に不安定、などにランク分けされる電源品質に関する情報、あるいは、ハードウェアが利用可能な状態にある／待機状態にある／接続が切れているといった動作状態情報を音声合成部１１に送る。
【００５５】
なお、上記の分類は一例であり、必要に応じて任意の分類が可能である。また、適当な閾値を設定し、これと比較して離散的なレベルにまるめてもよいし、取得した数値のまま動作状態情報としてもよく、上記の分類に限定されるものではない。
【００５６】
音声合成部１１は（動作状態監視部２２内の）ハードウェア状態監視部２２１から動作状態情報を受け取り、当該動作状態情報に応じて音韻制御部１１２および韻律制御部１１３においてそれぞれ適用する制御規則や蓄積データを選択する。
【００５７】
ここで、動作状態情報と選択する制御規則および蓄積データとの対応関係は、前記第１の実施形態における通信状態情報と選択する制御規則および蓄積データとの対応関係と同様に、音韻制御部１１２および韻律制御部１１３に定めておく。この対応関係は、例えば、品質の高い電源電圧が十分安定して供給されている場合は通常の韻律制御や声色で合成音声を生成するが、電源電圧が下がり始めたり不安定な場合には、少し元気のない声に対応する蓄積データを選択するような規則を選択したり、ゆったりした口調になるような規則に切り替えたり、ピッチの上げ下げを弱めたりするような規則を選択したり、おとなしい声になるような規則を選択したりするなどの対応関係が例として挙げられる。もちろん、この対応関係はあくまで例であって、音声合成システムのユーザの好みに応じて変更可能にしても構わない。そして、これらの規則の対応関係の変更は、上記と逆の印象を与えるように選択であっても構わない。
【００５８】
音声合成部１１内の韻律制御部１１３および音韻制御部１１２では、このような対応関係に従って選択された規則を用いることで、生成・出力する合成音声の韻律的・音韻的な品質を制御する。これによりユーザは、合成音声の声の調子から、その時点における計算機ハードウェアの状態を知ることができる。
【００５９】
ところで、ＰＤＡ（ＰｅｒｓｏｎａｌＤｉｇｉｔａｌＡｓｓｉｓｔａｎｔｓ）に代表される可搬型システム（携帯機器）では、表示に利用できる面積が小さいことから、システムの動作状態情報を提示するために割り当てる面積を大きくとるのは非効率的であるが、過度に小さくすればユーザに注意を促すという本来の目的が満足できなくなる可能性がある。一般に、可搬型のシステムの場合、供給される電源の安定性は、整備された環境にある固定型のシステムに比較して低いのが普通である。そこで、可搬型システムにおいて、このような動作状態情報について、韻律や声質を制御することによって副次的に伝えることは有効である。電源電圧と同様、ハードウェアの接続状況も、一般のユーザが普段は比較的意識せず、見落としがちであるが、これも接続状況の変化に応じて韻律や声質に変化を与えれば、ユーザにそれとなく知らせることができる。
【００６０】
一方、動作状態監視部２２内のソフ卜ウェア状態監視部２２２は、音声合成システムの稼働する計算機のプロセッサ（ＣＰＵ）やメモリ、ハードディスク等の計算機資源をある（ターゲットとする）ソフトウェア（プロセス）がどれだけ占有しているか、あるいは、逆の観点から言えば、あるソフトウェアがどれだけ処理を待たされているかといった、限られた計算機資源の分配に起因するソフトウェアの動作状態を監視したり、あるソフトウェアが今どのような入力を受け付ける状態にあるか、例えば、入力デバイスの種類や入力内容の種類として何が有効であるか、また、あるソフトウェアが今どのような情報を提示しているか、例えば、提示情報の出所や提示内容の種類といった動作モード（場面）に対応するソフトウェアの動作状態を取得する。
【００６１】
このようなソフトウェアの動作状態は、ソフ卜ウェア状態監視部２２２が、当該ソフトウェアの動作しているＯＳに対して問い合わせ、通知してもらうことによって取得したり、当該ソフトウェア自体に、動作状態を直接問い合わせると通知する通知部（通知機能）を付加しておくことによって取得する。もちろん、問い合わせがなくても、ソフトウェア自身がその動作状態をソフ卜ウェア状態監視部２２２に適当なタイミングで通知する仕組みを用意することも可能である。
【００６２】
ここで、取得するソフトウェアの動作状態情報としては、例えばメモリ使用量やソフトウェア状態、ＣＰＵの占有率や占有時間累計、動作優先度等の情報が一例として挙げられる。これらの情報は、既存のＯＳのシステムコールやライブラリを利用して取得可能である。また、現在受け付ける入力の種類や提示している情報の種類を通知する通知部を備えたソフトウェアを新たに作成してもよい。
【００６３】
一般に、同一のアプリケーションでも動作モード（場面）に応じて受け付ける入力の種類は動的に変化する。例えば、メールの送受信を行うメールアプリケーションは、届いたメールのリストを表示する状態、そのうちの選択されている１つのメールの内容を表示する状態、送信したいメールの文面を編集する場面、編集したメールを送信する場面などがあって、それぞれの場面によって、同じキー入力が有効になるか無視されるか、有効であった場合にどういう動作をするかが変わってくる。また、音声認識入力を受け付けるソフトウェアの場合には、今どのような認識語彙が入力可能であるかといった情報が「受け付けられる入力の種類」に相当し、さらに、認識語彙だけでなくそれぞれの認識語彙に対応する動作もソフトウェアの動作モード（場面）に応じて動的に変化する。
【００６４】
一方、電子メールのアプリケーションでは、誰から送られたメールであるとか、極秘扱いの内容であるといった、情報の出所や内容を表す情報を文字列照合や言語解析によって取得し、これらの動作モードや提示情報の出所や提示内容の種類を動作情報としてソフトウェア動作状態監視部２２２に伝える。ここでは、メールのアプリケーションを例に挙げたが、電子ネット掲示板や電子ネット上の情報提供システムのように、複数の情報源からの情報をブラウジングする応用ソフトウェアにおいても全く同様のことが適用できる。
【００６５】
ソフトウェア状態監視部２２２は、このようにして取得したソフトウェアの動作状態に関する情報をもとに、例えば、メモリの占有が大きい／小さい、ＣＰＵの占有時間累計が長い／短い、認識語彙の組合せがどのセットであるか、どのような動作モードにあるか、情報の出所はどこか、情報の内容の種類が何であるかを示す情報を動作状態情報として音声合成部１１（内の音韻制御部１１２および韻律制御部１１３）に送る。
【００６６】
音声合成部１１では、ソフトウェア状態監視部２２２からの動作状態情報を受け取ると、音韻制御部１１２および韻律制御部１１３においてそれぞれ適用する規則や蓄積データを当該動作状態情報に応じて選択する。これにより、例えばメモリの占有が大きいとか、ＣＰＵの占有時間累計が長い場合には、元気のない声や申し訳なさそうな声を生成して、システムの状況をユーザにそれとなく伝えたり、逆に早口の口調とすることでユーザ自らの処理を促したりすることが可能となる。また、情報の出所に応じて、アクセントやフレーズを変えるための規則を選択し、地域色を音声に反映することで、情報の出所の違いをユーザに意識させることが可能となる。また、情報提供者の声の蓄積データがあれば、それを使うことで、情報提供者を簡単に判別できるようにすることも可能である。また、電話等でリモート操作する場合や、携帯機器で表示面積が小さい場合に、あるソフトウェアが現在どのような入力を受け付ける状態にあるか（その入力デバイスの種類と入力内容の種類）に応じて、韻律や声色に変化を与えることで、ユーザは次に何を入力すべきかや、現在の「場面」を、出力される合成音声の調子から知ることができる。
【００６７】
ここで、動作状態情報（ソフトウェアの動作状態情報）と選択する制御規則および蓄積データの対応関係は、上述した計算機ハードウェアの動作状態情報と選択する制御規則および蓄積データの対応関係の場合と同様に、音韻制御部１１２および韻律制御部１１３に定めておく。
【００６８】
このように、言語解析部１１１での解析結果に対して音声合成部１１（内の音韻制御部１１２および韻律制御部１１３）において適用する制御規則や蓄積データを、動作状態監視部２２（内のハードウェア状態監視部２２１またはソフ卜ウェア状態監視部２２２）から出力される動作状態情報に応じて切り替えて（変更して）合成音声を出力することにより、ユーザは、合成音声の声の調子から、その時点における計算機ハードウェアの状態、あるいは計算機ソフトウェアの状態を知ることができる。
【００６９】
さて、本実施形態における音声合成部１１では、言語解析部１１１、音韻制御部１１２、韻律制御部１１３、波形生成部１１４、および波形出力部１１５のそれぞれが単独で動作するようにモジュール化しておき、互いのデータの授受の形式がネットワークを通じたものであっても、同一の実行プロセス内でのデータの授受でも処理が可能になるようにしてある。また、上記各部の全体の処理手続きおよびその部分的な処理手続きが互いに別のプロセスとして分離可能にしておき、分離されたプロセスは、処理結果を元のプロセスに返すようにしておく。このようなシステムの実装は、マルチタスクＯＳ上ならば、子プロセスの生成と子プロセスとのソケット通信などのシステムコール、ライブラリを用いて容易に実装可能である。
【００７０】
音声合成部１１は、動作状態監視部３２から動作状態情報を受け取ると、メモリの残量やＣＰＵの占有時間や占有率から判断して、当該音声合成部１１を構成する言語解析部１１１から波形出力部１１５に至る音声合成処理を進めるに当たってメモリやＣＰＵ能力など十分な計算機資源が確保されているか否かをチェックする。そして音声合成部１１では、メモリが不足する可能性がある場合や、ＣＰＵの負荷状況から十分な計算機資源が確保できないと判断される場合には、現在までに処理が進んでいる段階よりも後の処理のうち適当なものを別の計算機ハードウェアに別プロセスとして分担させ、処理結果を受け取るようにする。
【００７１】
ここで、どの処理を分担させるかは、処理に必要なＣＰＵ能力やメモリ量から判断するが、これは音声合成方式の種類や蓄積データの規模によって変わるものである。例えば、分析パラメータ合成方式の場合は、波形生成部１１４における信号処理、次いで音韻制御部１１２における蓄積データの編集加工の処理に資源が多く使われ、波形素片編集型の音声合成方式では、蓄積データの検索がＣＰＵパワーを最も要する。分析パラメータ合成方式においても、蓄積データが持つ蓄積素片の種類が多いほど蓄積データの検索時間は多くかかる。したがって、どの処理を分担させるかは、合成方式や蓄積データの規模によって適当な優先順位をつけて、それに従って分担させる処理を決めればよい。
【００７２】
ところで、音声合成部１１の言語解析や音韻・韻律的な制御における規則の適用は、いずれも多くの規則適用の可能性を数え上げ、その規則を適用した場合の結果を評価することによって、処理が進められる。これらの処理を１つの計算機上で順次実行するのは必ずしも効率的でなく、規則の適用可能性と規則適用を仮定した場合の評価を同時並行して処理する方が効率的である。適用される規則が固定的であれば、ある程度チュ−ニングすることにより順次処理をさせることによって問題は比較的顕れにくくすることも可能であるが、本発明のように適用される規則が動的に変更される場合には、同時並行的な処理をする方が効率的である。
【００７３】
そこで本実施形態では、上述の規則の適用可能性と規則適用を仮定した場合の評価を同時並行して処理するようにしている。この並行処理は、計算機が接続されたネットワーク上の他の計算機にリモートプロセスとして実行させたり、同じ計算機上の副プロセッサに分担させることもできる。
【００７４】
このように本実施形態においては、音声合成部１１を構成する言語解析部１１１から波形出力部１１５に至る音声合成処理を進める上で、動作状態監視部３２からの動作状態情報により十分な計算機資源が確保できないと判断される場合には、現在までに処理が進んでいる段階よりも後の処理のうち適当なものを別の計算機ハードウェアに別プロセスとして分担させたり、言語解析や音韻・韻律的な制御における規則の適用可能性と規則適用を仮定した場合の評価を、計算機が接続されたネットワーク上の他の計算機や同じ計算機上の副プロセッサに分担させて同時並行して処理させることで、効率的な処理を実現し、ユーザの待ち時間を減らすようにしている。
【００７５】
なお、以上の実施形態では、動作状態監視部２２には、ハードウェア状態監視部２２１およびソフ卜ウェア状態監視部２２２の両方が設けられているものとしたが、いずれか一方だけが設けられているものであっても構わない。
［第３の実施形態］
図３は本発明の音声合成システムの第３の実施形態を示すブロック構成図である。なお、図１と同一部分には同一符号を付してある。
【００７６】
まず、図３の構成の特徴は、音声合成部１１に加えて、ユーザのシステム利用状況を監視する利用者状態監視部３２１と、ユーザのシステム利用環境を監視する利用者環境監視部３２２とを有するユーザ状況監視部３２が設けられている点である。これに伴い、図３における音声合成部１１内（の音韻制御部１１２および韻律制御部１１３）の機能も、以下に述べるように図１中の音声合成部１１（内の音韻制御部１１２および韻律制御部１１３）とは異なるが、便宜上同一符号を付してある。
【００７７】
ユーザ状況監視部３２内の利用者状態監視部３２１は、ユーザのシステムの利用状況（利用者状態）を得るための入力デバイスや時計、利用履歴の少なくとも１つからの情報を監視し、例えばユーザがどの程度集中してシステムを利用しているかといったシステム利用状況監視結果を取得する。入力デバイスとしては、例えば、カメラなどが利用できる。カメラの捉えたユーザの頭の向きを精度よく推定することは可能であり、ある一定時間中にどの程度長く安定してシステムの方向（正面）を向いているのか否か（他を向いているか）をもって、ユーザの集中度として評価する。また、マウスに代表されるポインティングデバイス、キーボード等、ユーザの入力操作のための入力デバイスについて、ユーザの操作状況（入力操作頻度、入力操作時間、ポインティングデバイス移動速度・距離など）を監視することも可能である。この他、時計、利用履歴については、同じ曜日、同じ時間帯にどのような利用状況にあったかを記録しておくことで、利用状況の推定精度を高めるのに用いられる。
【００７８】
利用者状態監視部３２１は、このようにして取得したユーザのシステム利用状況に関する情報をもとに、ユーザの集中度や、ユーザの操作状況を示す情報をユーザ状況情報として音声合成部１１（内の音韻制御部１１２および韻律制御部１１３）に送る。
【００７９】
音声合成部１１では、利用者状態監視部３２１からのユーザ状況情報を受け取ると、音韻制御部１１２および韻律制御部１１３においてそれぞれ適用する規則や蓄積データを当該ユーザ状況情報に応じて選択する。これにより、例えば集中度が予め定めた閾値以下の場合には、パワーを大きくしたり、文頭では発話速度が小さく（遅く）なるような規則を適用することで、ユーザに集中するように注意を促すことが可能となる。
【００８０】
一方、ユーザ状況監視部３２内の利用者環境監視部３２２は、ユーザがシステムを利用している場所の環境（利用者環境）を得るための入力デバイスや時計、利用履歴の少なくとも１つからの情報を監視し、例えばユーザがどのような音環境（周囲雑音環境）下に居るかとか、どの程度の明るさの場所に居るかとか、ユーザの物理的な居場所（位置）といったシステム利用環境監視結果を出力する。このような入力デバイスとして、例えば、周囲雑音を集音するマイクロフォンや、ＧＰＳなどの位置推定デバイス、さらには明るさセンサ、カメラ、ガスセンサ、水センサなどが挙げられる。この他、時計、利用履歴については、同じ曜日、同じ時間帯にどのような利用環境にあったかを記録しておくことで、利用環境の推定精度を高めるのに用いられる。
【００８１】
利用者環境監視部３２２は、このようにして取得したユーザのシステム利用環境に関する情報をもとに、周囲雑音のスペクトル特徴やレベル、明るさ、ユーザの居場所（位置）等を示す情報をユーザ状況情報として音声合成部１１（内の音韻制御部１１２および韻律制御部１１３）に送る。
【００８２】
音声合成部１１では、利用者環境監視部３２２からのユーザ状況情報を受け取ると、音韻制御部１１２および韻律制御部１１３においてそれぞれ適用する規則や蓄積データを当該ユーザ状況情報に応じて選択する。これにより、例えば高周波数成分に優勢な雑音がある場合には、はっきり聞こえるように高いピッチの声になるように韻律制御規則を適用したり高周波数成分の優勢な蓄積素片を選択するように音韻制御規則を適用するとか、雑音レベルが低い静かなところでは、静かな声あるいは落ち着いた声になるような規則を適用することができる。また、明るい場所で利用する際はピッチが高めで発話速度が早くなるように、暗い場所で利用する際には発話速度を遅く、ピッチのダイナミックレンジは広くなるように韻律規則を適用することで、明るい場所に比較して暗い場所では比較的落ちついた印象を与えることもできる。このような対応関係は、ユーザの好みに応じて変更可能としても構わない。
【００８３】
なお、以上の実施形態では、ユーザ状況監視部３２には、利用者状態監視部３２１および利用者環境監視部３２２の両方が設けられているものとしたが、いずれか一方だけが設けられているものであっても構わない。
［第４の実施形態］
図４は本発明の音声合成システムの第４の実施形態を示すブロック構成図である。なお、図３と同一部分には同一符号を付してある。
【００８４】
まず、図４の構成の特徴は、図３の構成（の音声合成部１１およびユーザ状況監視部３２）に加えて、非自然音声明示判断部４１と、テキスト変更部４２が設けられている点である。これに伴い、図４における音声合成部１１（内の言語解析部１１１等）の機能も、以下に述べるように図３中の音声合成部１１（内の言語解析部１１１等）とは異なるが、便宜上同一符号を付してある。
【００８５】
まず非自然音声明示判断部４１は、ユーザ状況監視部３２の出力するユーザ状況情報をもとに、人間の発声でないこと（非自然音声であること）を明示すべきか否かを判断し、その判断結果（非自然音声明示判断結果）を出力する。例えば、ユーザ状況情報においてユーザが集中していないことを示している場合や、これまであまり合成音声の出力をしたことのない時間帯や場所であることを示している場合には、人間の発声でないことを明示すべきであるという判断結果を出力する。
【００８６】
テキスト変更部４２は非自然音声明示判断部４１からの非自然音声明示判断結果を受け取り、当該判断結果が人間の発声でないことを明示すべきことを示している場合には、入力テキストに対応する合成音声の出力に先立ち（すなわち、言語解析部１１１での入力テキストに対する言語解析結果を音韻制御部１１２および韻律制御部１１３に出力して、対応する音声波形を生成・出力させるのに先立ち）、音声合成によるメッセージ出力が始まることを予告する「合成音です」「システムからのお知らせです」などの定型表現を前置する。音声合成部１１は、このテキスト変更部４２によって前置された語彙を含めて合成出力する。
【００８７】
このようにして、例えばユーザが集中していない場合や、これまであまり合成音声の出力をしたことのない時間帯や場所での利用の場合に、音声合成によるメッセージ出力が始まることを予告する（非自然音声であることを明示する）合成音を、入力テキストに対応する合成音声の出力に先立って出力することで、そのような状況をユーザに知らせることができる。特に、高品質で肉声に近い合成音声が出力される状況では、人の声がする利用者環境のもとでの利用の場合に、非自然音声であることを明示する合成音を前置することで、周囲の人の声と紛らわしくしないとか、非自然音声であることを明示しないことで、合成音であることを強調して注意を集めるのを避けることが可能である。
［第５の実施形態］
図５は本発明の音声合成システムの第５の実施形態を示すブロック構成図である。なお、図４と同一部分には同一符号を付してある。
【００８８】
まず、図５の構成の特徴は、図４で示したテキスト変更部４２に代えて、人間の発声でないことを明示する音（非自然音声明示音）を出力する非自然音声明示音出力部４３が設けられている点である。これに伴い、図４における音声合成部１１の機能も、例えば当該音声合成部１１内の波形出力部１１５が、波形生成部１１４により生成される合成音声と、非自然音声明示音出力部４３の生成する非自然音声明示音とを混合する機能を有しているというように、図４中の音声合成部１１（内の波形出力部１１５等）とは異なるが、便宜上同一符号を付してある。
【００８９】
まず、非自然音声明示音出力部４３は、非自然音声明示判断部４１から出力される非自然音声明示判断結果が人間の発声でないことを明示すべきことを示している場合には、入力テキストに対応する合成音声の出力に先立ち、例えば「ピ」といった信号音（非自然音声明示音）を出力する。この信号音は、音韻制御部１１２および韻律制御部１１３による音韻・韻律的な制御に従って波形生成部１１４により生成される合成音声の出力に先立ち、波形出力部１１５によって出力される。
【００９０】
このようにして、ユーザが集中していない場合や、これまであまり合成音声の出力をしたことのない時間帯や場所での利用の場合に、例えば「ピ」という非自然音声明示音を、入力テキストに対応する合成音声の出力に先立って出力することで、人間の発した声ではなく合成音声によるメッセージであることを明示してユーザに対して注意を促すことができる。
【００９１】
なお、図５の構成に図４中のテキスト変更部４２を加え、このテキスト変更部４２と非自然音声明示音出力部４３の両方を備えた構成とすることも可能である。
［第６の実施形態］
図６は本発明の音声合成システムの第６の実施形態を示すブロック構成図である。なお、図１と同一部分には同一符号を付してある。
【００９２】
まず、図６の構成の特徴は、図１の構成（の音声合成部１１および通信状態監視部１２）に加えて、図４に示したような入力テキストの変更を行うテキスト変更部４２が設けられている点である（但し、テキスト変更の内容が、図４の例とは異なる）。これに伴い、音声合成部１１内（の言語解析部１１１等）の機能も、以下に述べるように図１中の音声合成部１１（内の言語解析部１１１等）とは異なるが、便宜上同一符号を付してある。
【００９３】
図６の構成の音声合成システムにおいて、音声合成部１１内の言語解析部１１１は、通信状態監視部１２から通信状態情報を受け取ると、当該情報をテキスト変更部４２に渡して起動する。
【００９４】
するとテキスト変更部４２は、言語解析部１１１と連絡をとりながら、言語解析部１１１により言語解析されている入力テキストに通信状態情報に応じた定型表現の語彙を挿入して当該テキストを変更する。すなわちテキスト変更部４２は、音声合成部１１内の音韻制御部１１２および韻律制御部１１３の処理の先頭において、あるいは、韻律制御部１１３の処理の途中においてポーズ挿入位置を決めた段階において、文頭や文末、あるいはポーズ挿入位置に、通信状態情報によって決まる定型表現の語彙を挿入する。音声合成部１１は、このテキスト変更部４２によって挿入された語彙を含めて合成出力する。
【００９５】
以上のテキスト変更部４２での通信状態情報に応じたテキスト変更処理により、例えば、通信量大のとき（通信が混んでいるとき）には、「あ」「えーと」「えー」「はい」などの不要語を文頭や文末、あるいはポーズ挿入位置に挿入したり、「ちょっと待って」などのメッセージを文頭に前置することができる。このような決まった語彙（あらかじめ設定されている語彙）を挿入することによって、処理時間をかせぎ合成音声の処理による負荷を低減する効果がある。また、逆に通信量小のときには、上記と同様の不要語を挿入すれば、システムがアイドル状態であることをユーザにそれとなく示すという効果がある。
【００９６】
なお、図６の構成におけるテキスト変更部４２は、通信状態監視部１２からの通信状態情報を音声合成部１１を通して受け取るものとしているが、通信状態監視部１２から直接受け取るようにしても構わない。
［第７の実施形態］
図７は本発明の音声合成システムの第７の実施形態を示すブロック構成図である。なお、図２と同一部分には同一符号を付してある。
【００９７】
まず、図７の構成の特徴は、図２の構成（の音声合成部１１および動作状態監視部２２）に加えて、図６に示したようなテキスト変更部４２が設けられている点である。これに伴い、図７における音声合成部１１内（の言語解析部１１１等）の機能も、以下に述べるように図２中の音声合成部１１（内の言語解析部１１１等）とは異なるが、便宜上同一符号を付してある。
【００９８】
図７の構成の音声合成システムにおいて、音声合成部１１内の言語解析部１１１は、動作状態監視部２２からシステムの動作状態情報を受け取ると、当該情報をテキスト変更部４２に渡して起動する。
【００９９】
するとテキスト変更部４２は、言語解析部１１１と連絡をとりながら、言語解析部１１１により言語解析されている入力テキストに動作状態情報に応じた定型表現の語彙を挿入する。すなわちテキスト変更部４２は、音声合成部１１内の音韻制御部１１２および韻律制御部１１３の処理の先頭において、あるいは、韻律制御部１１３の処理の途中においてポーズ挿入位置を決めた段階において、文頭や文末、あるいはポーズ挿入位置に、動作状態情報によって決まる定型表現の語彙を挿入する。音声合成部１１は、このテキスト変更部４２によって挿入された語彙を含めて合成出力する。
【０１００】
以上のテキスト変更部４２での動作状態情報に応じたテキスト変更処理により、例えば、ＣＰＵが長時間占有されているときには、「あ」「えーと」「えー」「はい」などの不要語を文頭や文末、あるいはポーズ挿入位置に挿入することができる。このような決まった語彙を挿入することによって、処理時間をかせぎ合成音声の処理による負荷を低減する効果がある。
【０１０１】
なお、図７の構成におけるテキスト変更部４２は、動作状態監視部２２からの動作状態情報を音声合成部１１を通して受け取るものとしているが、動作状態監視部２２から直接受け取るようにしても構わない。
［第８の実施形態］
図８は本発明の音声合成システムの第８の実施形態を示すブロック構成図である。なお、図３と同一部分には同一符号を付してある。
【０１０２】
まず、図８の構成の特徴は、図３の構成（の音声合成部１１およびユーザ状況監視部３２）に加えて、図６に示したようなテキスト変更部４２が設けられている点である。これに伴い、図８における音声合成部１１内（の言語解析部１１１等）の機能も、以下に述べるように図３中の音声合成部１１（内の言語解析部１１１等）とは異なるが、便宜上同一符号を付してある。
【０１０３】
図８の構成の音声合成システムにおいて、音声合成部１１内の言語解析部１１１は、ユーザ状況監視部３２からユーザ状況情報を受け取ると、当該情報をテキスト変更部４２に渡して起動する。
【０１０４】
するとテキスト変更部４２は、言語解析部１１１と連絡をとりながら、言語解析部１１１により言語解析されている入力テキストにユーザ状況情報に応じた定型表現の語彙を挿入する。すなわちテキスト変更部４２は、音声合成部１１内の音韻制御部１１２および韻律制御部１１３の処理の先頭において、あるいは、韻律制御部１１３の処理の途中においてポーズ挿入位置を決めた段階において、文頭や文末、あるいはポーズ挿入位置に、ユーザ状況情報によって決まる定型表現の語彙を挿入する。音声合成部１１は、このテキスト変更部４２によって挿入された語彙を含めて合成出力する。
【０１０５】
以上のテキスト変更部４２での動作状態情報に応じたテキスト変更処理により、例えば、ユーザが集中していないときには、「あの」などの人に声をかける語彙を文頭に設定することで、ユーザに注意を促すことができる。
【０１０６】
なお、図８の構成におけるテキスト変更部４２は、ユーザ状況監視部３２からのユーザ状況情報を音声合成部１１を通して受け取るものとしているが、ユーザ状況監視部３２から直接受け取るようにしても構わない。
【０１０７】
【発明の効果】
以上詳述したように本発明によれば、言語情報の持つメッセージとしての直接的な情報伝達だけでなく、音声合成機能を含む、システム全体の状況をそれとなく示す、音声メディアの持つ副次的な情報伝達機能を利用し、使い勝手のよいシステム構築が可能になる。また、ユーザの利用状況に応じた合成音声出力が可能となる。
【０１０８】
特に、計算機の出力メディアとして、システム内部の状態をユーザに伝えることはユーザインタフェースの観点からも重要である。言語メッセージ伝達としての主目的としての利用と同時に、システムがどのような動作状況にあるかをそれとなくユーザに伝えることは、音声メディアの利用形態として適切なものであるといえる。
【０１０９】
このような情報は画面表示部などの視覚的な出力と併用することでその効果を高めることも可能であるだけでなく、ＰＤＡに代表される携帯機器のように表示部の面積が小さい場合には、メッセージを、主に音声メディアによって伝えるようにすれば、メッセージ表示による画面の面積の占有を抑えることが可能になる。
【０１１０】
さらに、ユーザの利用状況を考慮して韻律や声色を制御することにより、より自然なシステム出力が可能になる。それは、状況を考慮せず単調な合成音声を出力しないようにするというだけでなく、高品質で肉声に近い合成音声が増えるような状況では、逆に合成音声であることを明らかにして、音声としては不自然さはあっても、機械とのコミュニケーションとしては自然なやりとりが可能になる。
【図面の簡単な説明】
【図１】本発明の音声合成システムの第１の実施形態を示すブロック構成図。
【図２】本発明の音声合成システムの第２の実施形態を示すブロック構成図。
【図３】本発明の音声合成システムの第３の実施形態を示すブロック構成図。
【図４】本発明の音声合成システムの第４の実施形態を示すブロック構成図。
【図５】本発明の音声合成システムの第５の実施形態を示すブロック構成図。
【図６】本発明の音声合成システムの第６の実施形態を示すブロック構成図。
【図７】本発明の音声合成システムの第７の実施形態を示すブロック構成図。
【図８】本発明の音声合成システムの第８の実施形態を示すブロック構成図。
【符号の説明】
１１…音声合成部、
１２…通信状態監視部、
２２…動作状態監視部、
３２…ユーザ状況監視部、
４１…非自然音声明示判断部、
４２…テキスト変更部、
４３…非自然音声明示音出力部、
１１１…言語解析部、
１１２…音韻制御部、
１１３…韻律制御部、
１１４…波形生成部、
１１５…波形出力部、
１２１…計算機内通信状態監視部、
１２２…計算機外通信状態監視部、
２２１…ハードウェア状態監視部、
２２２…ソフ卜ウェア状態監視部、
３２１…利用者状態監視部、
３２２…利用者環境監視部、
１１２１…蓄積データ格納部。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speech synthesis system and a speech synthesis method that are preferably used in general for systems that use speech synthesis technology.
[0002]
[Prior art]
In recent years, as the application of speech synthesis technology has expanded and the processing power of computers has improved, the use of speech synthesis has increased more and more, and text-to-speech conversion has become easily available as an application. Was. Iwata et al .: "Software Japanese text-to-speech synthesis for personal computers," Proc. 245-246 (October 1993) is an example.
[0003]
These speech synthesis systems are designed to obtain a unique processing result in principle in any of the language analysis, phonological control, prosodic control, and waveform generation of the input text. The resulting synthesized speech was always the same.
[0004]
Some systems allow the user to select a synthetic voice, such as a male / female voice, a high / low voice, etc., before generating the synthesized voice, and generate a synthesized voice according to the selection. It can be said that the synthesized speech is the same framework in the sense that it is uniquely determined.
[0005]
[Problems to be solved by the invention]
However, generating the same synthesized speech for the input text is not only monotonous and tired, but also does not necessarily mean that the features of the speech media are effectively used.
[0006]
Speech media differs from text media in that prosody and timbre change add additional information, such as emotions and intentions, the situation of the speaker, and the mood of the speaker, in addition to direct messages in linguistic expressions. It is well known that there is a characteristic that conveys information. Fujisaki et al .: "Expression of utterance intention using prosodic features of speech," Proc. 225-226 (March, 1993), Uesoko, et al .: "Analysis and Modeling of Emotional Expression of Voice," IEICE Technical Report, SP92-131, pp. 139-143. 65-72 (January 1993) is an example.
[0007]
All of the above-described current speech synthesis systems and their application applications have a function as a media conversion (text-to-speech conversion) that simply converts linguistic information expressed in the form of character language media into an expression in the form of voice media. However, it is hard to say that they actively use the voice feature of conveying secondary information.
[0008]
Furthermore, when speech synthesis operates not as a single device but in cooperation with another system, or as one application application on a general-purpose computer such as a personal computer or a workstation together with other application applications Even when used, the given text is simply copied regardless of the status of the system or application running together or the status of the system running the speech synthesis application. It just converts it to sound.
[0009]
The present invention has been made in consideration of the above circumstances, and its object is to provide a phoneme according to a system operation status (system status), a user's own status (user status), or an environment where the user is located (user environment).・ A system that allows the user to easily understand the operation status of the system by dynamically changing the prosody control, or by outputting another sound or synthesized sound in addition to the original synthesized sound corresponding to the input text An object of the present invention is to provide a speech synthesis system and a speech synthesis method capable of generating an output and an output of a system suitable for a situation where a user is placed, and improving usability.
[0010]
[Means for Solving the Problems]
In order to solve the above-described problem, a configuration according to a first aspect of the present invention performs linguistic analysis of an input text, applies rules to the analysis result, performs phonological / prosodic control, and generates synthesized speech. A communication state monitoring means for monitoring at least one of a communication state inside the computer and a communication state between the computer and the outside and outputting communication state information, in addition to the voice synthesis means for outputting; Preferably, a rule applied in the phonological / prosodic control is changed in accordance with communication state information output from the communication state monitoring means.
[0011]
The configuration according to the second aspect of the present invention includes at least one of an operation state of computer hardware and an operation state of computer software in addition to the speech synthesis unit corresponding to the speech synthesis unit in the configuration according to the first aspect. Operating state monitoring means for monitoring operation and outputting operating state information, wherein the voice synthesizing means applies rules applied in the phonological / prosodic control according to the operating state information output from the operating state monitoring means. Is changed. Here, at least one process of language analysis, phonological control, prosodic control, and voice waveform generation in the voice synthesis means may be performed according to an operation state indicated by the operation state information. It is preferable to share a plurality of computer hardware connected by a wireless network.
[0012]
The configuration according to the third aspect of the present invention further includes at least one of a user's system usage status and a user's system usage environment, in addition to the voice synthesis unit corresponding to the voice synthesis unit in the configuration according to the first aspect. The voice synthesizing means includes a user status monitoring means for monitoring and outputting user status information, wherein a rule applied in the phonological / prosodic control is determined according to the user status information output from the user status monitoring means. It is characterized in that it is changed. Here, based on the user situation information, a non-natural voice explicit determination means for outputting a determination result as to whether or not to indicate that the voice is not a human voice, and according to the determination result of the non-natural voice explicit determination means, Text changing means for specifying that the voice is not a human voice by partially changing the expression of the input text, and a non-natural output for outputting a sound specifying that the voice is not a human voice along with the output of the synthesized voice according to the determination result It is also possible to further comprise at least one of the voice explicit sound output means.
[0013]
In the configuration according to the first aspect, a well-known linguistic analysis such as a morphological analysis or a syntactic structure analysis is first performed on the input text in the speech synthesizing unit, and the input text is decomposed into a morpheme sequence and “read”. And information such as the part of speech of the morpheme, inflection, accent type, and the strength of the dependency relationship between the morphemes are output.
[0014]
Further, in the speech synthesis means, various levels of rules (phonological / prosodic rules) are applied to the contents of the results of the above-mentioned linguistic analysis, thereby synthesizing with the conversion from linguistic media to speech media. Control that affects the quality of voice, that is, phonemic / prosodic control, is performed, and a voice waveform corresponding to the input text is generated.
[0015]
On the other hand, the communication monitoring means monitors at least one of a communication state in the computer in which the speech synthesis system operates and a communication state between the computer and the outside. Each communication state monitoring result is transmitted to the voice synthesizing means.
[0016]
In the voice synthesizing means, the application contents of the phoneme / prosodic rules are changed according to the result of monitoring the communication state.
Next, also in the configuration according to the second aspect, the speech synthesis means applies various levels of phonological / prosodic rules to the contents of the results of the linguistic analysis of the input text, thereby achieving language Phonological / prosodic control that affects the quality of synthesized speech accompanying the conversion from media to audio media is performed, and an audio waveform corresponding to the input text is generated.
[0017]
On the other hand, the operation state monitoring means monitors at least one of the operation state of the computer hardware on which the speech synthesis system operates and the operation state of the computer software. Each operation state monitoring result is transmitted to the voice synthesizing means.
[0018]
In the voice synthesizing means, the application contents of the phonological / prosodic rules are changed according to the operation state monitoring result.
In addition, at least one process of language analysis, phonological control, prosodic control, and voice waveform generation in the voice synthesis means is connected to a plurality of communicable wired or wireless networks in accordance with the operation state monitoring result. Computer hardware.
[0019]
Next, in the configuration according to the third aspect as well, the speech synthesis means applies various levels of phonological / prosodic rules to the contents of the results of the linguistic analysis of the input text, thereby achieving language Phonological / prosodic control that affects the quality of synthesized speech accompanying the conversion from media to audio media is performed, and an audio waveform corresponding to the input text is generated.
[0020]
On the other hand, the user status monitoring means monitors at least one of the user's system usage status and the user's system usage environment. Each user situation monitoring result is transmitted to the voice synthesis means.
[0021]
In the voice synthesizing means, the application contents of the phonological / prosodic rules are changed according to the result of the user situation monitoring.
Further, in a configuration further including a non-natural voice explicit determination unit and at least one of a text change unit and a non-natural voice explicit sound output unit, a change in a part of the expression of the input text (for example, (Addition), a text change that clearly indicates that the voice is not a human voice, or a sound that clearly indicates that the voice is not a human voice is output together with the output of the synthesized voice. That is, another sound or synthesized sound is output in addition to the original synthesized sound.
[0022]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[First Embodiment]
FIG. 1 is a block diagram showing a first embodiment of the speech synthesis system of the present invention.
[0023]
The speech synthesis unit 11 which is the center of the system in FIG. 1 includes a language analysis unit 111 that performs a language analysis of an input text, a phoneme control unit 112 that performs a phonological control by applying a rule to the analysis result, and It is composed of a prosody control unit 113 that performs prosody control, a waveform generation unit 114 that generates a speech waveform under the control of the phoneme control unit 112 and the prosody control unit 113, and a waveform output unit 115 that outputs the generated waveform. ing. As for the framework of the configuration of the speech synthesis unit 11, a general configuration method of an existing speech synthesis system capable of text-to-speech conversion can be used as it is. As a general construction method of a text-to-speech synthesis system, for example, Sato et al .: "Speech synthesis from Japanese text," Research and Application Report of the Institute of Electro-Communications, Vol. 32, No. 11 pp. 2243-2252 (November 1983).
[0024]
The linguistic analysis unit 111 in the speech synthesis unit 11 performs linguistic analysis such as morphological analysis and syntactic structure analysis on the input text, decomposes the input text into a series of morphemes, and uses a symbol string representing “reading” and a part of speech of the morpheme. , Accent type, and the strength of the dependency relationship between morphemes are also output.
[0025]
The phoneme control unit 112 and the prosody control unit 113 in the speech synthesis unit 11 apply rules at various levels to the contents of the language analysis result of the language analysis unit 111, thereby converting the language media to the speech media. To control the quality of the synthesized speech accompanying the conversion to.
[0026]
Specifically, recognition of accent units (that is, including accenting of unknown words, accent combining, accent division of compound words, addition of secondary accents to accessory word chains, etc.), recognition of reading (that is, unknown words) (Including the addition of reading to words, rendaku processing, conversion from notation to reading, etc.), recognition of a unit of a so-called prosodic word (accent phrase) chain to form one prosodic unit (ie, The assignment corresponds to this), prosody control parameter value determination (that is, determination of parameter values of a model that generates a time-varying pattern of pitch, determination of phoneme / pause timing, determination of power), and accumulation pattern corresponding to reading Search and selection of storage units (that is, conversion to storage units, conversion to search conditions, selection when multiple search results are obtained, etc.) Contains), performs editing of the storage segment (i.e., the connection of the storage element pieces, interpolation processing such as the processing of each step such Included) is by the respective rules.
[0027]
Among them, the phoneme control unit 112 recognizes reading, searches for a storage pattern corresponding to the reading, selects a storage unit, and edits a storage unit. The prosody control unit 113 recognizes accent units, prosody words. (Accent phrase) Responsible for recognizing the unit of the chain and determining the prosody control parameter value. The processes and rules at each of these stages can be arbitrarily categorized, and may differ from the above categorization or may be omitted depending on the implementation of the system. Done. In addition, depending on the implementation of the system, the boundary between the language analysis unit 111 in the preceding stage and the waveform generation unit 114 in the subsequent stage is also various, but here, the phonological control unit 112 is assumed to perform the above phonological control. Similarly, a prosody control unit 113 is defined to perform the above-described prosody control.
[0028]
The prosody control unit 113 applies these rules to the result of the morphological analysis performed by the language analyzing unit 111, determines timings of individual phonemes and pauses corresponding to the reading of the morphological sequence, and determines the timing of the phoneme corresponding to the morphological sequence or the reading. Is divided into prosodic words (accent phrases), which are units of prosody control for adding accents, and forms a unit on the expiration paragraph due to semantic sentence structure and physiological constraints. A parameter value that divides the prosodic word sequence into units for prosodic control to add called tone components, and gives the size of accent and speech components to each prosodic control unit in consideration of timing. To determine the pitch. The prosody control unit 113 further determines a power envelope based on a morphological sequence, a phonemic sequence corresponding to reading, a pitch, or the like.
[0029]
On the other hand, the phoneme control unit 112 stores the speech waveform, the analysis parameter of the speech waveform, or the storage unit in which the both are associated with the partial sequence of the phoneme corresponding to the reading, and stores the stored data storage unit 1121. In consideration of the variation of the storage unit stored in the storage data storage unit 1121, the sequence of the storage unit corresponding to the phoneme partial sequence corresponding to the reading of the morphological sequence is determined.
[0030]
In the present embodiment, the rules applied by the phonological control unit 112 and the prosody control unit 113 are switched according to the communication state inside the computer and the communication state between outside the computer. Will be described later.
[0031]
The waveform generation unit 114 in the speech synthesis unit 11 connects the storage unit sequence output from the phoneme control unit 112 and controls the signal processing level according to the control information output from the prosody control unit 113, that is, the timing, pitch, and power envelope. Is performed to generate a speech waveform.
[0032]
The waveform output unit 115 in the voice synthesis unit 11 outputs the voice waveform generated by the voice synthesis unit 11 from, for example, a speaker, an earphone, or the like.
By the way, in the present embodiment, it is desired to generate a storage unit stored in the storage data storage unit 1121 of the phoneme control unit 112, a rule used by the phoneme control unit 112, and a rule used by the prosody control unit 113. Natural speech data that matches the tone of the synthesized speech is collected and created in advance from the data. For example, if you want to synthesize dialogue-like speech, you can collect simulated dialogue speech, whisper, fast-talking voice, tired sound, cheerful voice, or in a crowd (or in a crowded environment) The collected voices, calm voices, and various human voices are collected as much as possible, and rules and accumulated data corresponding to each voice are derived from the analysis results of each pitch, power, and time length.
[0033]
Previous studies have pointed out that speech in various situations has phonological and prosodic characteristics with different tendencies, and that prosodic control rules derived from various speech data show different tendencies. Is Hirai et al .: "F automatically generated from various speech corpora. ₀ The difference between the control rules is shown together with the actual data in “Acoustic Society of Japan, 2-5-3, pp. 271-272 (October to November 1994)”.
[0034]
There have been many studies on the derivation of rules from audio data. For example, Hirose et al .: "Speech Synthesis and Accent Intonation," IEICE Journal, Vol. 70, No. 4, pp. 378-385 (April 1987), Mimura et al .: "Analysis and Control of Speech Power Using Statistical Method," Journal of the Acoustical Society of Japan, Vol. 49, no. 2, pp. 253-259 (December 1993), Miki et al .: "Analysis of Pause Length Change of Sentence Voice by Speech Rate," Proc. 247-248 (December 1992), etc., and can be used to extract rules.
[0035]
The control rules and accumulated segments extracted for each environment include information about the extracted environment to be used at the time of speech synthesis, such as dialogue, whisper, fast-talking voice, tired voice, and energetic voice. Information on the collection status of voice data such as a good voice, a busy voice, a calm voice, etc. is added.
[0036]
As is well known, phonological and prosodic control rules and stored data of existing speech synthesis systems are essentially linguistic environments (eg, morphemes, parts of speech, inflections, etc.) and phonological and prosodic. Environment (for example, arrangement of phonemes, accent type and accent nucleus, pitch, power envelope, timing, etc.) and control contents (for example, reading symbol string, accent binding information, prosodic parameter value, selection priority of storage unit, etc.) and It can be understood as the correspondence with the voice waveform / analysis parameters.
[0037]
Therefore, in this embodiment, a rule extraction environment is added to this correspondence relationship, and the extracted control rules and accumulated data are converted into a linguistic environment, a phonological / prosodic environment and a rule extraction environment, and control contents and voice waveforms. -Described as correspondence with analysis parameters.
[0038]
As described above, a plurality of rules and accumulated data are provided, and the voice synthesis unit 11 (the language analysis unit 111 and the phoneme control unit 112 therein) appropriately selects and uses them, thereby adjusting the tone of the voice of the synthesized voice. Can be given a variation.
[0039]
Therefore, in the present embodiment, in addition to the above-described speech synthesis unit 11, a communication state monitoring unit 12 is provided as means for providing information for determining conditions for rule selection in the speech synthesis unit 11. The communication status monitoring unit 12 includes an intra-computer communication status monitoring unit 121 that monitors a communication status in a computer on which the speech synthesis system operates, and an external communication status monitoring unit 122 that monitors a communication status between the computer and the outside. And
[0040]
The in-computer communication state monitoring unit 121 in the communication state monitoring unit 12 monitors the communication state or the communication state based on the communication state or the quality of the communication path between software, hardware, or software and hardware operating in the same computer. I do. For the sake of simplicity, here, hardware or software that communicates with each other will be simply referred to as communicator A and communicator B, respectively. That is, it is assumed that communication is performed between the communication party A and the communication party B.
[0041]
The in-computer communication state monitoring unit 121 inquires of software or hardware (hereinafter referred to as a communication medium for convenience) that mediates this communication in order to know the communication state exchanged between them, and Situation (for example, information sender, amount of communication or time change of communication amount, frequency of communication, total amount of data to be sent, amount of data already sent, etc.) and quality of communication path (eg, data transfer speed and error Frequency, etc.). These notifications do not always need to be inquired, and even if there is no inquiry, the communication medium side may notify the in-computer communication state monitoring unit 121 at an appropriate timing.
[0042]
As such a communication medium, an existing function provided by an operating system or an operation system (hereinafter, referred to as an OS) (for example, Windows DDE = Dynamic Data Exchange capable of realizing a messaging function, and data transfer using a clipboard) Existing functions provided by the window system (for example, events and selection buffers in the X Window System, Windows messages, etc. are examples), or various services implemented in a server-client model can be used. Of course, it is possible not only to use the existing system, but also to construct a new system having a similar mechanism.
[0043]
Further, in order to know the communication state exchanged between the correspondent A and the correspondent B, a mechanism may be employed in which the correspondents A and B are directly inquired without using a communication medium. In this case, the in-computer communication state monitoring unit 121 inquires of the parts having the function of performing communication (hereinafter, referred to as communication units) which the communication parties A and B have, respectively. In the same way as above, the communication status is notified. Of course, similarly to the above, even if there is no inquiry, the communication units of the correspondents A and B may notify the in-computer communication state monitoring unit 121 at appropriate timing.
[0044]
The in-computer communication status monitoring unit 121 uses the information on the communication status obtained in this way to determine, for example, the ratio of the data that has already been communicated, such as large / small traffic, large / small data to be sent. Is transmitted to the speech synthesizer 11 as communication state information. These pieces of information may be used as communication state information as they are, or may be used as communication state information in the in-computer communication state monitoring unit 121 after being compared with a threshold and rounded to a discrete level.
[0045]
On the other hand, the external computer communication state monitoring unit 122 in the communication state monitoring unit 12 monitors the communication state with the outside of the computer. The communication status monitoring unit 122 outside the computer can also be configured to acquire the communication status via the communication medium, similarly to the communication status monitoring unit 121 inside the computer. Similarly, as the communication medium, the OS, the existing functions provided by the OS (messaging function), the existing functions provided by the window system (for example, events), or various services implemented in the server-client model In addition to (for example, Network File System, printer daemon, etc.), a device or driver such as a modem capable of data communication with a device outside the computer can be used. Of course, it is possible not only to use the existing system but also to construct a new system having the same mechanism as in the above case.
[0046]
The voice synthesizer 11 receives communication status information from the intra-computer communication status monitoring unit 121 (in the communication status monitoring unit 12) and the off-computer communication status monitoring unit 122, respectively, and responds to the phoneme control unit 112 and the prosody according to the communication status information. The control unit 113 selects a control rule and stored data to be applied, respectively.
[0047]
Here, the correspondence between the communication state information, the control rule to be selected, and the stored data is determined in the phoneme control unit 112 and the prosody control unit 113. For example, if the communication volume is large or the frequency of communication is large, make the voice fast, if it is very large, make a tense voice, or conversely, if the communication volume is small or the frequency of communication is small, increase the pitch dynamic range. Use a calm voice, insert more pauses, or relax. When the remaining communication amount is large, the pitch is raised or the voice is made quicker as the voice becomes more relaxed and the remaining voice becomes smaller. When the quality of the communication channel is poor, the voice quality is changed by superimposing irregular voices on a weak voice or pitch. An example is a correspondence relationship in which a light voice is converted into a light voice when the transfer speed is high, and a heavy voice is generated when the transfer speed is low.
[0048]
As described above, the control rules and the accumulated data applied in the speech synthesis unit 11 (in the language analysis unit 111 and the phoneme control unit 112) to the analysis result in the language analysis unit 111 are transmitted to the communication state monitoring unit 12 (in the communication state monitoring unit 12). By switching (changing) and outputting the synthesized voice in accordance with (the communication state indicated by) the communication state information output from the intra-computer communication state monitoring unit 121 or the off-computer communication state monitoring unit 122, the user can: From the tone of the synthesized voice, it is possible to know the communication state inside the computer at that time or the communication state between the outside of the computer.
[0049]
It should be noted that the above-described correspondence is merely an example, and may be changed according to the user's preference of the speech synthesis system. Further, a correspondence may be independently set for each of the communication state inside the computer monitored by the communication state monitoring unit 121 inside the computer and the communication state outside the computer monitored by the communication state monitoring unit 122 outside the computer. Absent.
[0050]
In the above embodiment, the communication status monitoring unit 12 includes both the in-computer communication status monitoring unit 121 and the out-of-computer communication status monitoring unit 122. However, only one of them is provided. It does not matter even if it is.
[Second embodiment]
FIG. 2 is a block diagram showing a second embodiment of the speech synthesis system according to the present invention. The same parts as those in FIG. 1 are denoted by the same reference numerals.
[0051]
First, the configuration of FIG. 2 is characterized in that, in addition to the speech synthesis unit 11, a hardware status monitoring unit 221 for monitoring the operating status of computer hardware and a software status monitoring unit 222 for monitoring the operating status of computer software An operation state monitoring unit 22 having the following is provided. Accordingly, the functions of (the phoneme control unit 112 and the prosody control unit 113) in the speech synthesis unit 11 in FIG. 2 are also changed as described below. Although different from the control unit 113), the same reference numerals are given for convenience.
[0052]
The hardware state monitoring unit 221 in the operation state monitoring unit 22 directly measures a parameter indicating the operation state of the computer hardware on which the speech synthesis system operates, or inquires of the computer hardware or its software driver about the operation state. Alternatively, the operating state of the computer hardware is monitored by notifying the operating state at an appropriate timing from the computer hardware or its software driver itself.
[0053]
For example, the level and stability of the power supply voltage supplied to the hardware configuring the system, the connection status of devices (peripheral devices) such as a card, a printer, a keyboard, a mouse, and network cables such as network cables. (Whether connected or not, and whether it can be used).
[0054]
Based on the monitoring result regarding the hardware status acquired in this way, the hardware status monitoring unit 221 may, for example, have a sufficiently high / high / slightly low / low / substantially low / sufficiently stable / stable power supply voltage. Power quality information, which is ranked as "active" / "slightly unstable" / "very unstable", or operation status information such as "hardware available / standby / connection lost" To the speech synthesizer 11.
[0055]
Note that the above classification is an example, and any classification is possible as needed. In addition, an appropriate threshold value may be set and compared with the threshold value, the result may be rounded to a discrete level, or the acquired numerical value may be used as the operation state information, and the present invention is not limited to the above classification.
[0056]
The speech synthesis unit 11 receives operation state information from the hardware state monitoring unit 221 (in the operation state monitoring unit 22), and according to the operation state information, applies control rules to be applied to the phoneme control unit 112 and the prosody control unit 113, respectively. Select the stored data.
[0057]
Here, the correspondence between the operation state information and the selected control rule and the stored data is the same as the correspondence between the communication state information and the selected control rule and the stored data in the first embodiment. And the prosody control unit 113. For example, when a high-quality power supply voltage is supplied sufficiently stably, a synthetic voice is generated with normal prosody control or timbre, but when the power supply voltage starts to drop or becomes unstable, Choose a rule that selects stored data that corresponds to a slightly unhealthy voice, switch to a rule that creates a loose tone, select a rule that weakens pitch up / down, or a soft voice For example, a correspondence relationship such as selecting a rule that gives Of course, this correspondence is merely an example, and may be changed according to the user's preference of the speech synthesis system. The change of the correspondence between these rules may be selected so as to give the opposite impression to the above.
[0058]
The prosody control unit 113 and the phonology control unit 112 in the speech synthesis unit 11 control the prosodic and phonological quality of the synthesized speech to be generated and output by using the rules selected according to such correspondence. This allows the user to know the state of the computer hardware at that time from the tone of the synthesized voice.
[0059]
By the way, in a portable system (portable device) represented by a PDA (Personal Digital Assistants), since an area available for display is small, it is inefficient to allocate a large area for presenting operation state information of the system. However, if it is too small, the original purpose of calling the user's attention may not be satisfied. In general, for portable systems, the stability of the supplied power is usually lower than for fixed systems in a maintained environment. Therefore, in a portable system, it is effective to convey such operation state information by controlling prosody and voice quality. Like the power supply voltage, the connection status of the hardware is usually relatively unconscious for general users and tends to be overlooked.However, if the prosody and voice quality are changed according to the change in the connection status, the user will be affected. I can inform you implicitly.
[0060]
On the other hand, the software state monitoring unit 222 in the operation state monitoring unit 22 is a software (process) that uses (targets) computer resources such as a processor (CPU), a memory, and a hard disk of a computer in which the speech synthesis system operates. Monitor the operating state of software due to the limited distribution of computer resources, such as how much it occupies, or from the opposite point of view, how long a piece of software is waiting to process, What kind of input is now accepting, for example, what is valid as the type of input device and type of input content, and what information is presented by certain software now, for example, Operating state of software corresponding to operation mode (scene) such as source of presentation information and type of presentation contents To get.
[0061]
The operating state of such software is acquired by the software state monitoring unit 222 by inquiring and notifying the OS on which the software is operating, or the operating state is directly given to the software itself. It is obtained by adding a notification unit (notification function) that notifies when inquired. Needless to say, it is also possible to prepare a mechanism in which the software itself notifies the software status monitoring unit 222 of the operation status at an appropriate timing even if there is no inquiry.
[0062]
Here, examples of the acquired operation state information of the software include, for example, information such as a memory usage amount, a software state, a CPU occupancy rate, a total occupation time, and an operation priority. These pieces of information can be obtained by using the existing OS system calls and libraries. Further, software having a notification unit for notifying the type of input currently accepted and the type of information being presented may be newly created.
[0063]
In general, the type of input to be accepted dynamically changes according to the operation mode (scene) even in the same application. For example, an e-mail application that sends and receives e-mails displays a list of received e-mails, displays the contents of one of the selected e-mails, edits the text of an e-mail to be transmitted, and edits an e-mail. Is transmitted, and depending on each scene, whether the same key input is valid or ignored or what action is performed when the same key input is valid. In the case of software that accepts voice recognition input, information such as what recognition vocabulary can be input now corresponds to “acceptable input types”. Also dynamically changes according to the operation mode (scene) of the software.
[0064]
On the other hand, an e-mail application acquires information indicating the source and content of information, such as who sent the e-mail or the contents of which information is confidential, by character string collation or linguistic analysis. The source of the presentation information and the type of the presentation content are transmitted to the software operation state monitoring unit 222 as operation information. Here, the mail application is described as an example, but the same can be applied to application software for browsing information from a plurality of information sources, such as an electronic network bulletin board or an information providing system on an electronic network.
[0065]
Based on the information on the operating state of the software acquired in this manner, the software state monitoring unit 222 determines, for example, which combination of the memory occupancy is large / small, the total occupation time of the CPU is long / short, and the combination of the recognized vocabulary. The information indicating the set, the operation mode, the source of the information, and the type of the content of the information is used as the operation state information as the operation state information. To the prosody control unit 113).
[0066]
Upon receiving the operation state information from the software state monitoring unit 222, the speech synthesis unit 11 selects rules and accumulated data to be applied in the phoneme control unit 112 and the prosody control unit 113 according to the operation state information. Thus, for example, when the memory occupation is large or the accumulated time occupied by the CPU is long, an unsound voice or an apologetic voice is generated to inform the user of the system status implicitly, or conversely, It is possible to encourage the user to perform his / her own processing by using the tone of the user. In addition, by selecting rules for changing accents and phrases according to the source of the information and reflecting the regional colors in the voice, it is possible to make the user aware of the difference in the source of the information. Also, if there is stored data of the voice of the information provider, the information provider can be easily identified by using the stored data. In addition, when remote operation is performed by a telephone or the like or when the display area is small on a portable device, depending on what kind of input the certain software is currently accepting (the type of the input device and the type of the input content). By changing the prosody and timbre, the user can know what to input next and the current “scene” from the tone of the synthesized speech to be output.
[0067]
Here, the correspondence between the operation state information (the operation state information of the software), the selected control rule, and the stored data is the same as the above-described case of the correspondence between the operation state information of the computer hardware, the selected control rule, and the stored data. First, the phoneme control unit 112 and the prosody control unit 113 determine this.
[0068]
As described above, the control rules and the accumulated data applied in the speech synthesis unit 11 (the phoneme control unit 112 and the prosody control unit 113 therein) to the analysis result of the language analysis unit 111 are transmitted to the operation state monitoring unit 22 (the inside of the operation state monitoring unit 22). By switching (changing) and outputting the synthesized voice according to the operation status information output from the hardware status monitoring unit 221 or the software status monitoring unit 222, the user can change the tone of the synthesized voice. It is possible to know the state of the computer hardware or the state of the computer software at that time.
[0069]
By the way, in the speech synthesis unit 11 of the present embodiment, the language analysis unit 111, the phoneme control unit 112, the prosody control unit 113, the waveform generation unit 114, and the waveform output unit 115 are modularized so as to operate independently. Even if the format of the exchange of data with each other is via a network, processing can be performed even with the exchange of data within the same execution process. In addition, the entire processing procedure and the partial processing procedure of each unit can be separated as separate processes, and the separated process returns a processing result to the original process. Such a system can be easily mounted on a multitask OS by using a system call such as generation of a child process and socket communication with the child process, and a library.
[0070]
Upon receiving the operating state information from the operating state monitoring unit 32, the voice synthesizing unit 11 determines the remaining amount of the memory, the occupation time and the occupancy of the CPU, and receives a waveform from the language analyzing unit 111 constituting the voice synthesizing unit 11. In proceeding with the speech synthesis processing reaching the output unit 115, it is checked whether or not sufficient computer resources such as memory and CPU capacity are secured. Then, when there is a possibility that the memory may be insufficient, or when it is determined that sufficient computer resources cannot be secured from the load status of the CPU, the voice synthesis unit 11 may perform a process after the stage in which the process is currently performed. Of the above processing is assigned to another computer hardware as another process, and the processing result is received.
[0071]
Here, which process is to be shared is determined based on the CPU capacity and the amount of memory required for the process, and this depends on the type of the speech synthesis method and the scale of the stored data. For example, in the case of the analysis parameter synthesizing method, a lot of resources are used for the signal processing in the waveform generation unit 114 and then for the processing of editing the stored data in the phoneme control unit 112. Retrieving data requires the most CPU power. Also in the analysis parameter synthesizing method, the search time of the stored data becomes longer as the number of types of the stored segments included in the stored data increases. Therefore, which process is to be shared may be determined by assigning an appropriate priority according to the synthesis method or the scale of the stored data, and determining the process to be shared according to the priority.
[0072]
By the way, the application of the rules in the language analysis and phonological / prosodic control of the speech synthesis unit 11 counts the possibility of applying many rules, and the processing is performed by evaluating the result of applying the rules. Proceed. It is not always efficient to execute these processes sequentially on one computer, but it is more efficient to process the applicability of rules and the evaluation in the case of assuming rule application in parallel. If the rules to be applied are fixed, it is possible to make the problem relatively invisible by performing sequential processing by tuning to some extent, but the rules to be applied are dynamic as in the present invention. When the number is changed to a parallel processing, it is more efficient to perform the processing in parallel.
[0073]
Therefore, in the present embodiment, the applicability of the above-described rule and the evaluation when the rule application is assumed are processed in parallel. This parallel processing can be executed as a remote process by another computer on the network to which the computer is connected, or can be shared by a sub-processor on the same computer.
[0074]
As described above, in the present embodiment, when proceeding with the speech synthesis processing from the language analysis unit 111 constituting the speech synthesis unit 11 to the waveform output unit 115, sufficient computer resources are provided by the operation state information from the operation state monitoring unit 32. If it is judged that the process cannot be secured, the appropriate process among the processes after the stage in which the process is progressing up to now is assigned to another computer hardware as a separate process, language analysis, phonological / prosodic The applicability of rules in dynamic control and the evaluation when rules are assumed to be applied are shared by other computers on the network to which the computer is connected and the sub-processors on the same computer, and are processed simultaneously and in parallel. Thus, efficient processing is realized and the waiting time of the user is reduced.
[0075]
In the above embodiment, both the hardware status monitoring unit 221 and the software status monitoring unit 222 are provided in the operation status monitoring unit 22, but only one of them is provided. May be present.
[Third Embodiment]
FIG. 3 is a block diagram showing a third embodiment of the speech synthesis system according to the present invention. The same parts as those in FIG. 1 are denoted by the same reference numerals.
[0076]
First, the configuration of FIG. 3 is characterized in that, in addition to the speech synthesis unit 11, a user status monitoring unit 321 that monitors the system usage status of the user and a user environment monitoring unit 322 that monitors the system usage environment of the user. The point is that a user status monitoring unit 32 is provided. Accordingly, the functions of (the phoneme control unit 112 and the prosody control unit 113) in the speech synthesis unit 11 in FIG. 3 are also changed as described below. Although different from the control unit 113), the same reference numerals are given for convenience.
[0077]
The user status monitoring unit 321 in the user status monitoring unit 32 monitors information from at least one of an input device, a clock, and a usage history for obtaining a usage status (user status) of the user's system. Of the system usage status, such as how concentrated the user is using the system. As the input device, for example, a camera can be used. It is possible to accurately estimate the direction of the user's head captured by the camera, and how long and stably the system is facing the front (front) during a certain period of time (whether it is facing another ) Is evaluated as the degree of concentration of the user. In addition, with respect to an input device such as a pointing device represented by a mouse, a keyboard, and the like for input operation by the user, the operation status of the user (input operation frequency, input operation time, moving speed and distance of the pointing device, etc.) can be monitored. It is possible. In addition, the clock and the usage history are used to improve the estimation accuracy of the usage status by recording the usage status in the same day of the week and the same time zone.
[0078]
The user status monitoring unit 321 uses the information about the system usage status of the user obtained in this way and uses the information indicating the degree of concentration of the user and the operation status of the user as user status information as the voice synthesis unit 11 (in the To the phoneme control unit 112 and the prosody control unit 113).
[0079]
Upon receiving the user status information from the user status monitoring unit 321, the voice synthesis unit 11 selects rules and accumulated data to be applied in the phoneme control unit 112 and the prosody control unit 113 according to the user status information. Accordingly, for example, when the degree of concentration is equal to or less than a predetermined threshold, attention is paid to the user by increasing the power or applying a rule such that the utterance speed is small (slow) at the beginning of the sentence. It is possible to encourage.
[0080]
On the other hand, the user environment monitoring unit 322 in the user status monitoring unit 32 is provided with at least one of an input device, a clock, and a usage history for obtaining an environment (user environment) of a place where the user uses the system. Monitors information and monitors the system usage environment, such as what kind of sound environment (ambient noise environment) the user is in, how bright the user is, and the physical location (position) of the user. Output the result. Examples of such an input device include a microphone that collects ambient noise, a position estimation device such as a GPS, a brightness sensor, a camera, a gas sensor, and a water sensor. In addition, the clock and the usage history are used to improve the estimation accuracy of the usage environment by recording the usage environment in the same day of the week and the same time zone.
[0081]
The user environment monitoring unit 322, based on the information on the user's system usage environment acquired in this manner, displays information indicating the spectral characteristics and level of ambient noise, brightness, the user's whereabouts (position), and the like. The information is sent to the speech synthesis unit 11 (the phoneme control unit 112 and the prosody control unit 113 therein) as information.
[0082]
Upon receiving the user status information from the user environment monitoring unit 322, the voice synthesis unit 11 selects rules and accumulated data to be applied in the phoneme control unit 112 and the prosody control unit 113 according to the user status information. Thus, for example, when there is dominant noise in the high frequency component, the prosody control rules are applied so that the voice has a high pitch so that the voice can be clearly heard, and the predominant storage unit of the high frequency component is selected. Phonological control rules can be applied, or in quiet places with low noise levels, rules can be applied that result in a quiet or calm voice. Also, by using a prosodic rule so that the pitch is higher and the utterance speed is faster when using in a bright place, the utterance rate is slower when using in a dark place and the dynamic range of the pitch is widened. It can also give a relatively calm impression in dark places compared to bright places. Such a correspondence may be changeable according to the user's preference.
[0083]
In the above embodiment, the user status monitoring unit 32 is provided with both the user status monitoring unit 321 and the user environment monitoring unit 322, but only one of them is provided. It may be something.
[Fourth embodiment]
FIG. 4 is a block diagram showing a fourth embodiment of the speech synthesis system according to the present invention. The same parts as those in FIG. 3 are denoted by the same reference numerals.
[0084]
First, the configuration of FIG. 4 is characterized in that, in addition to the configuration of FIG. 3 (the voice synthesis unit 11 and the user status monitoring unit 32), a non-natural voice explicit determination unit 41 and a text change unit 42 are provided. It is. Accordingly, the function of the speech synthesis unit 11 (such as the language analysis unit 111) in FIG. 4 is also different from that of the speech synthesis unit 11 (such as the language analysis unit 111) in FIG. 3, as described below. The same reference numerals are used for convenience.
[0085]
First, based on the user situation information output from the user situation monitoring unit 32, the non-natural speech explicit determination unit 41 determines whether or not it is necessary to clearly indicate that the speech is not a human utterance (non-natural speech). The judgment result (non-natural voice explicit judgment result) is output. For example, if the user situation information indicates that the user is not concentrated, or if it indicates a time zone or place where synthesized speech has not been output so far, human utterances The result of the judgment that it should be clearly indicated that the message should not be output.
[0086]
The text changing unit 42 receives the non-natural voice explicit determination result from the non-natural voice explicit determination unit 41, and if the determination result indicates that it is not a human utterance, it corresponds to the input text. Prior to outputting the synthesized speech (that is, prior to outputting the linguistic analysis result for the input text in the linguistic analysis unit 111 to the phonological control unit 112 and the prosody control unit 113 to generate and output the corresponding audio waveform), Prefix fixed expressions such as "Synthetic sound" or "Notification from the system" to announce the beginning of message output by voice synthesis. The speech synthesizer 11 synthesizes and outputs the vocabulary prefixed by the text changer 42.
[0087]
In this way, for example, when the user is not concentrated, or when the synthesized speech is used in a time zone or a place where the synthesized speech has not been output so far, the message output by the speech synthesis is notified in advance ( Such a situation can be notified to the user by outputting the synthesized speech (which explicitly indicates that the speech is non-natural) before outputting the synthesized speech corresponding to the input text. In particular, in a situation where synthesized voice with high quality and close to real voice is output, in the case of use in a user environment where human voices are used, a synthetic sound indicating that it is a non-natural voice is prefixed. By doing so, it is possible to avoid attracting attention by emphasizing that it is a synthesized sound by not confusing it with the voice of the surrounding people or not explicitly indicating that it is a non-natural sound.
[Fifth Embodiment]
FIG. 5 is a block diagram showing a fifth embodiment of the speech synthesis system according to the present invention. The same parts as those in FIG. 4 are denoted by the same reference numerals.
[0088]
First, the feature of the configuration of FIG. 5 is that, instead of the text changing unit 42 shown in FIG. 4, a non-natural voice explicit sound output unit 43 that outputs a sound (non-natural voice explicit sound) that clearly indicates that it is not a human utterance. Is provided. Along with this, the function of the speech synthesis unit 11 in FIG. 4 also includes, for example, the waveform output unit 115 in the speech synthesis unit 11 and the synthesized voice generated by the waveform generation unit 114 and the non-natural voice explicit sound output unit 43. It differs from the speech synthesis unit 11 (the waveform output unit 115 and the like) in FIG. 4 in that it has a function of mixing the generated non-natural voice explicit sound, but for convenience the same reference numerals are used. is there.
[0089]
First, the non-natural voice explicit sound output unit 43 outputs the input text when the non-natural voice explicit judgment result output from the non-natural voice explicit judgment unit 41 indicates that it is not a human utterance. Prior to the output of the synthesized speech corresponding to (i), for example, a signal sound (non-natural sound explicit sound) such as “P” is output. This signal sound is output by the waveform output unit 115 prior to the output of the synthesized voice generated by the waveform generation unit 114 according to the phonological / prosodic control by the phonological control unit 112 and the prosody control unit 113.
[0090]
In this way, when the user is not concentrated, or in a time zone or place where synthesized voice has not been output so far, for example, a non-natural voice explicit sound such as "Pi" is input. By outputting the synthesized speech corresponding to the text prior to outputting the synthesized speech, it is possible to call attention to the user by clearly indicating that the message is not a human voice but a synthesized speech.
[0091]
In addition, it is also possible to add the text changing unit 42 in FIG. 4 to the configuration in FIG. 5 and to provide a configuration including both the text changing unit 42 and the non-natural voice explicit sound output unit 43.
[Sixth Embodiment]
FIG. 6 is a block diagram showing a sixth embodiment of the speech synthesis system according to the present invention. The same parts as those in FIG. 1 are denoted by the same reference numerals.
[0092]
First, the feature of the configuration of FIG. 6 is that, in addition to the configuration of FIG. 1 (the voice synthesis unit 11 and the communication state monitoring unit 12), a text change unit 42 for changing the input text as shown in FIG. 4 is provided. (However, the content of the text change is different from the example of FIG. 4). Along with this, the functions in (speech analysis section 111 etc.) in the speech synthesis section 11 are also different from the speech synthesis section 11 (language analysis section 111 etc.) in FIG. 1 as described below, but are the same for convenience. The reference numerals are attached.
[0093]
In the speech synthesis system having the configuration shown in FIG. 6, when the language analysis unit 111 in the speech synthesis unit 11 receives communication state information from the communication state monitoring unit 12, the language analysis unit 111 passes the information to the text change unit 42 and starts up.
[0094]
Then, the text changing unit 42 changes the text by inserting a vocabulary of a fixed expression according to the communication state information into the input text that has been subjected to the language analysis by the language analyzing unit 111 while communicating with the language analyzing unit 111. That is, at the beginning of the processing of the phoneme control unit 112 and the prosody control unit 113 in the speech synthesis unit 11 or at the stage where the pause insertion position is determined during the processing of the prosody control unit 113, the text changing unit 42 At the end of a sentence or at a pause insertion position, a vocabulary of a fixed expression determined by the communication state information is inserted. The speech synthesis unit 11 synthesizes and outputs the vocabulary including the vocabulary inserted by the text change unit 42.
[0095]
By the above-described text change processing according to the communication state information in the text change unit 42, for example, when the communication volume is large (when communication is congested), “A”, “Em”, “Em”, “Yes”, etc. Can be inserted at the beginning or end of the sentence or at the position where the pause is inserted, or a message such as "Wait a minute" can be placed at the beginning of the sentence. By inserting such a fixed vocabulary (a vocabulary set in advance), there is an effect that the processing time is shortened and the load due to the processing of the synthesized speech is reduced. Conversely, when the traffic is small, inserting unnecessary words similar to the above has the effect of implicitly indicating to the user that the system is idle.
[0096]
Although the text changing unit 42 in the configuration of FIG. 6 receives the communication status information from the communication status monitoring unit 12 through the voice synthesis unit 11, the text changing unit 42 may directly receive the communication status information from the communication status monitoring unit 12.
[Seventh embodiment]
FIG. 7 is a block diagram showing a seventh embodiment of the speech synthesis system according to the present invention. The same parts as those in FIG. 2 are denoted by the same reference numerals.
[0097]
First, a feature of the configuration of FIG. 7 is that a text changing unit 42 as shown in FIG. 6 is provided in addition to the configuration of FIG. 2 (the speech synthesis unit 11 and the operation state monitoring unit 22). . Along with this, the functions in (the language analysis unit 111 and the like) in the speech synthesis unit 11 in FIG. 7 are also different from those in the speech synthesis unit 11 (the language analysis unit 111 and the like) in FIG. 2 as described below. The same reference numerals are used for convenience.
[0098]
In the speech synthesis system having the configuration shown in FIG. 7, when the language analysis unit 111 in the speech synthesis unit 11 receives the operation state information of the system from the operation state monitoring unit 22, the language analysis unit 111 passes the information to the text change unit 42 and starts up.
[0099]
Then, the text changing unit 42 inserts the vocabulary of the fixed expression according to the operation state information into the input text that has been subjected to the language analysis by the language analysis unit 111 while communicating with the language analysis unit 111. That is, at the beginning of the processing of the phoneme control unit 112 and the prosody control unit 113 in the speech synthesis unit 11 or at the stage where the pause insertion position is determined during the processing of the prosody control unit 113, the text changing unit 42 A vocabulary of a fixed expression determined by the operation state information is inserted at the end of a sentence or at a pause insertion position. The speech synthesis unit 11 synthesizes and outputs the vocabulary including the vocabulary inserted by the text change unit 42.
[0100]
By the above-described text change processing in accordance with the operation state information in the text change unit 42, for example, when the CPU is occupied for a long time, unnecessary words such as “A”, “Eto”, “Eye”, It can be inserted at the end of a sentence or at a pause insertion position. By inserting such a fixed vocabulary, there is an effect that the processing time is increased and the load due to the processing of the synthesized speech is reduced.
[0101]
Although the text change unit 42 in the configuration of FIG. 7 receives the operation state information from the operation state monitoring unit 22 through the speech synthesis unit 11, the text change unit 42 may directly receive the operation state information from the operation state monitoring unit 22.
[Eighth Embodiment]
FIG. 8 is a block diagram showing an eighth embodiment of the speech synthesis system according to the present invention. The same parts as those in FIG. 3 are denoted by the same reference numerals.
[0102]
First, the configuration of FIG. 8 is characterized in that a text changing unit 42 as shown in FIG. 6 is provided in addition to the configuration of FIG. 3 (the speech synthesis unit 11 and the user status monitoring unit 32). . Along with this, the functions in (the language analysis unit 111 and the like) in the speech synthesis unit 11 in FIG. 8 are also different from those in the speech synthesis unit 11 (the language analysis unit 111 and the like) in FIG. 3 as described below. The same reference numerals are used for convenience.
[0103]
In the speech synthesis system having the configuration shown in FIG. 8, when the language analysis unit 111 in the speech synthesis unit 11 receives the user status information from the user status monitoring unit 32, it passes the information to the text change unit 42 and starts up.
[0104]
Then, the text changing unit 42 inserts the vocabulary of the fixed expression according to the user situation information into the input text that has been subjected to the language analysis by the language analysis unit 111, while contacting the language analysis unit 111. That is, at the beginning of the processing of the phoneme control unit 112 and the prosody control unit 113 in the speech synthesis unit 11 or at the stage where the pause insertion position is determined during the processing of the prosody control unit 113, the text changing unit 42 At the end of the sentence or at the position where the pause is inserted, the vocabulary of the fixed expression determined by the user situation information is inserted. The speech synthesis unit 11 synthesizes and outputs the vocabulary including the vocabulary inserted by the text change unit 42.
[0105]
By the text change processing according to the operation state information in the text change unit 42 described above, for example, when the user is not concentrated, a vocabulary that calls out to a person such as “that” is set at the beginning of the sentence, You can call attention.
[0106]
Although the text changing unit 42 in the configuration of FIG. 8 receives the user status information from the user status monitoring unit 32 through the speech synthesizing unit 11, the text changing unit 42 may directly receive the user status information from the user status monitoring unit 32.
[0107]
【The invention's effect】
As described in detail above, according to the present invention, not only the direct communication as a message of the linguistic information, but also an implicit indication of the situation of the entire system including the speech synthesis function, A user-friendly system can be constructed using the information transmission function. In addition, it is possible to output a synthesized voice according to the usage status of the user.
[0108]
In particular, it is important from the viewpoint of a user interface to inform the user of the internal state of the system as an output medium of a computer. It can be said that it is appropriate to use the audio media to inform the user of the operating state of the system at the same time as using it as the main purpose as a language message transmission.
[0109]
Such information can be used not only to enhance its effect by being used together with a visual output such as a screen display unit, but also when the area of the display unit is small like a portable device represented by a PDA. If the message is transmitted mainly by audio media, it is possible to suppress the occupation of the screen area by displaying the message.
[0110]
Furthermore, by controlling the prosody and timbre in consideration of the user's usage situation, a more natural system output becomes possible. Not only does it not output monotonous synthesized speech without considering the situation, but in situations where high-quality synthesized voices close to real voice increase, it is clear that it is synthesized speech, In spite of the unnaturalness, natural communication becomes possible as communication with machines.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a first embodiment of a speech synthesis system according to the present invention.
FIG. 2 is a block diagram showing a second embodiment of the speech synthesis system of the present invention.
FIG. 3 is a block diagram showing a third embodiment of the speech synthesis system according to the present invention.
FIG. 4 is a block diagram showing a fourth embodiment of the speech synthesis system according to the present invention.
FIG. 5 is a block diagram showing a speech synthesis system according to a fifth embodiment of the present invention.
FIG. 6 is a block diagram showing a sixth embodiment of the speech synthesis system of the present invention.
FIG. 7 is a block diagram illustrating a speech synthesis system according to a seventh embodiment of the present invention.
FIG. 8 is a block diagram showing an eighth embodiment of the speech synthesis system according to the present invention.
[Explanation of symbols]
11 ... Speech synthesis unit,
12 ... communication state monitoring unit
22 ... operation state monitoring unit
32: User status monitoring unit
41 ... non-natural voice explicit judgment unit,
42 ... text change section
43 ... Non-natural voice explicit sound output unit,
111 ... Language analysis unit,
112 ... phoneme control unit,
113 ... Prosody control unit,
114 ... waveform generator
115 ... waveform output unit,
121: communication status monitoring unit in the computer
122: communication status monitoring unit outside the computer
221, a hardware status monitoring unit;
222: software state monitoring unit
321, a user state monitoring unit,
322: user environment monitoring unit,
1121... Stored data storage unit.

Claims

Speech synthesis means for performing linguistic analysis of the input text, applying rules to the analysis results, performing phonological / prosodic control, and generating / outputting synthesized speech;
A communication state monitoring unit that monitors at least one of a communication state in the computer and a communication state between the computer and the outside and outputs communication state information,
The speech synthesis system, wherein the speech synthesis unit is configured to change a rule applied in the phonological / prosodic control in accordance with the communication state information output from the communication state monitoring unit. .

Speech synthesis means for performing linguistic analysis of the input text, applying rules to the analysis results, performing phonological / prosodic control, and generating / outputting synthesized speech;
Operating state monitoring means for monitoring at least one of the operating state of the computer hardware and the operating state of the computer software and outputting operating state information,
The speech synthesis means is applied in the phonological / prosodic control so that the tone of the synthesized voice corresponds to the operation state in order to notify the user of the operation state in the tone of the synthesized voice. A speech synthesis system configured to change a rule to be performed according to the operation state information output from the operation state monitoring unit.

At least one process of language analysis, phonological control, prosodic control, and voice waveform generation in the voice synthesis means is performed by a communicable wired network or wireless network in accordance with the operation state indicated by the operation state information. 3. The speech synthesis system according to claim 2, wherein said plurality of computer hardware are assigned to each other.

Speech synthesis means for performing linguistic analysis of the input text, applying rules to the analysis results, performing phonological / prosodic control, and generating / outputting synthesized speech;
User status monitoring means for monitoring at least one of a user's concentration level, a user's operation status of the system, and a position and brightness of a place where the user uses the system as a user status,
The voice synthesis unit changes a rule applied in the phonological / prosodic control according to the user status monitored by the user status monitoring unit so that the tone of the voice of the synthesized voice corresponds to the user status. A speech synthesis system characterized in that the speech synthesis system is configured to

The apparatus further comprises a non-natural voice explicit determination means for outputting a determination result as to whether or not the synthetic voice is not a human utterance based on the user situation information, and the non-natural voice explicit determination means In response to the judgment result, a text change unit that specifies that the synthesized voice is not a human voice by changing a part of the expression of the input text, and according to the judgment result of the non-natural voice explicit judgment unit, The speech synthesis system according to claim 4, further comprising at least one of a non-natural voice explicit sound output unit that outputs a sound indicating that the synthesized voice is not a human utterance in addition to the output.

A speech synthesis method that performs linguistic analysis of an input text, applies rules to the analysis result, performs phonological and prosodic control, and generates and outputs synthesized speech.
Monitoring at least one of a communication state in the computer and a communication state between the computer and the outside, and changing a rule applied in the phonological / prosodic control according to the communication state according to the monitoring result. A speech synthesis method characterized by the following.

A speech synthesis method that performs linguistic analysis of an input text, applies rules to the analysis result, performs phonological and prosodic control, and generates and outputs synthesized speech.
In order to monitor at least one of the operation state of the computer hardware and the operation state of the computer software, and to inform the user of the operation state by the tone of the voice of the synthesized voice, the tone of the voice of the synthesized voice is changed to the operation state. Correspondingly, a rule applied in the phonological / prosodic control is changed according to the result of the monitoring.

A speech synthesis method that performs linguistic analysis of an input text, applies rules to the analysis result, performs phonological and prosodic control, and generates and outputs synthesized speech.
At least one of a user's concentration, a user's operation state of the system, and a position and brightness of a place where the user uses the system is monitored as a user state,
A speech synthesis method, wherein rules applied in the phonological / prosodic control are changed in accordance with the result of the monitoring so that the tone of the voice of the synthesized voice corresponds to the user monitoring situation.

The speech synthesis method according to any one of claims 6 to 8, further comprising outputting at least one of another sound and a synthesized speech in addition to the synthesized speech corresponding to the input text.