JP3806030B2

JP3806030B2 - Information processing apparatus and method

Info

Publication number: JP3806030B2
Application number: JP2001401424A
Authority: JP
Inventors: 洋史杉戸
Original assignee: Canon Electronics Inc
Current assignee: Canon Electronics Inc
Priority date: 2001-12-28
Filing date: 2001-12-28
Publication date: 2006-08-09
Anticipated expiration: 2021-12-28
Also published as: JP2003202885A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声を用いてメッセージを処理することが可能な情報処理装置及び方法に関する。更に詳しくは、受信したメッセージを音声にて読み上げる、或いは送信すべきメッセージを音声認識を用いて入力することが可能な装置に関するものである。
【０００２】
【従来の技術】
電子メールの広がりに伴い、電子メールに含まれるメッセージを音声にて読み上げ機能を持つ機器が増えてきている。このような文書の読み上げにおいて、その内容を聞き手に理解し易くする方法として、特開２０００−１４８１７５には、発信者が、文章中に伝えたいニュアンスや表情を表現する表情記号を挿入し、読み上げ時には、その表情記号に応じた読み上げを行う方法が示されている。また、ヒューマンＩ／Ｆの向上を目的とし、コンピュータで人の感情を読み取る手法も多数提案されている。例えば、特公平６−８２３７６で号では日本語文章から作者の感情を抽出する方法が示されている。また、特開平５−１２０２３号では、音声から音声認識を用いて感情を抽出する方法が示されている。
【０００３】
【発明が解決しようとする課題】
しかしながら、上記特開２０００−１４８１７５では、発信者側によって表情記号を挿入する作業を行わなければならないというような欠点がある。また、特公平６−８２３７６号と特開平５−１２０２３号では、感情の抽出手段が示されているだけで、その利用方法に及ぶ開示はない。
【０００４】
また、メッセージを音声にて読み上げる場合には、当該メッセージの発信者にかかわらず、ある特定の声質を用いて行われるのが一般的であった。
【０００５】
本発明は、受信されたメッセージを、発信者本人の声で読み上げることを可能とすることを目的とする。
また、本発明の他の目的は、発信者本人の声、かつ、感情のこもった読み上げを行うことにより、発信者の意図をより正確に伝えることを可能とすることにある。
【０００６】
【課題を解決するための手段】
上記の目的を達成するための本発明による情報処理装置は以下の構成を備える。すなわち、
個人の電話番号、メールアドレス、音声特徴データを対応付けて格納する格納手段と、
前記電話番号を指定して発信することで通話を開始して、音声による通話を行う通話手段と、
前記通話中に得られる通話相手の音声から音声特徴データを生成する生成手段と、
前記生成手段で生成された音声特徴データを用いて、前記格納手段に格納された、前記通話手段で指定した電話番号に対応する音声特徴データを更新する更新手段と、
テキストデータを含むメールを受信する受信手段と、
前記受信手段で受信したメールの送信者メールアドレスに対応する音声特徴データを前記格納手段より取得する取得手段と、
前記取得手段で取得した音声特徴データを用いて、前記メールに含まれるテキストデータに対する合成音声データを生成する合成手段とを備える。
また、上記の目的を達成するための本発明の他の態様による情報処理装置は以下の構成を備える。すなわち、
個人の電話番号、メールアドレス、音声特徴データを対応付けて格納する格納手段と、
発信相手の発信を着信することにより通話を開始して、音声による通話を行う通話手段と、
前記通話中に得られる通話相手の音声から音声特徴データを生成する生成手段と、
前記生成手段で生成された音声特徴データを用いて、前記格納手段に格納された、前記発信相手の電話番号通知により特定された電話番号に対応する音声特徴データを更新する更新手段と、
テキストデータを含むメールを受信する受信手段と、
前記受信手段で受信したメールの送信者メールアドレスに対応する音声特徴データを前記格納手段より取得する取得手段と、
前記取得手段で取得した音声特徴データを用いて、前記メールに含まれるテキストデータに対する合成音声データを生成する合成手段とを備える。
【００１１】
【発明の実施の形態】
以下、添付の図面を参照して本発明の好適な実施形態のいくつかについて詳細に説明する。
【００１２】
＜実施形態１＞
図１は実施形態１によるメッセージ読み上げ装置の構成を示すブロック図である。
【００１３】
１は主制御部であり、２は公衆回線と接続する通信制御部である。３は音声による通話を行う送受話器部である。以上の構成により、音声による通話が行われる。すなわち、送話においては、送受話器部３により、入力された音声信号をデジタル信号へ変換し、主制御部１を介して通信制御部２より公衆回線に送信する。受話は、公衆回線からの信号を通信制御部２により受信してこれをデジタル信号に変換し、主制御部１を介して送受話器部３で音声信号に変換する。これより音声の送受による通話が成立する。
【００１４】
４は、電子メール受信部であり通信制御部２より受信した電子メールを格納する。ここで、主制御部１は、通信制御部２より受信した信号がメッセージであるか音声通話であるかを判別し、メッセージであった場合には電子メール受信部４に送り、音声通話であった場合には送受話器部３に送る。
【００１５】
５は入力部であり、操作者が電話番号のダイアルや、読み上げ指示等を行うときに使用する。６は表示部であり、受信メッセージの表示等を行う。７は、音声特徴抽出部であり、入力された音声の特徴を抽出する。本実施形態においては、音声通話時の受話側のデジタル信号（公衆回線より通信制御部２に受信された音声信号をデジタル化した信号）を入力とする。８は個人データ記憶部であり、各個人の電話番号やメールアドレス及び音声の特徴が保管されている。９は音声特徴データ比較部であり、音声特徴抽出部７により抽出した音声特徴と、個人データ記憶部８に保管されている音声の特徴を比較し、必要に応じて個人データ記憶部８の内容を更新する。
【００１６】
１０はメールアドレス検索部であり、主制御部１より入力されたメールアドレスが、誰のメールアドレスかを個人データ記憶部８から検索する。１１は電話番号検索部であり、主制御部１より入力された電話番号が、誰の電話番号かを個人データ記憶部８から検索する。１２は音声合成部であり、声質データを設定し、入力されたテキスト文章の言語解析を行い、設定された声質データを使用して音声データを作成する。音声合成部１２で合成された音声データは、主制御部１を介して送受話器部３に送られ、音声信号となる。後述するが、本実施形態では、音声合成の対象は受信したメールのテキスト文章データであり、音声合成部１２では、当該メールの送信元の話者の音声特徴データに対応した声質データが設定される。
【００１７】
図２は、個人データ記憶部８に保管されているデータの構成例を示す図である。図２において、１３は個人毎に割り振られる管理番号である。１４は名前、１５は電話番号、１６はメールアドレスであり、これらは操作者が入力したものである。また、１７は音声特徴データであり、音声特徴抽出部７にて抽出された音声特徴データ、或いは、音声特徴データ比較部９により更新されたデータである。１８は学習回数であり、音声特徴抽出部７にて抽出を行った回数（個人データ記憶部８に記憶された音声特徴データに関する学習の回数）である。尚、学習回数１８には、学習の対象者であるかどうかの情報も含まれる。例えば学習回数データとして１６bitを用いる場合、最上位bitを学習の対象者であるか否かを表すフラグとして用いる。
【００１８】
図３は、主制御部１による、音声特徴データに対する学習処理を示すフローチャートである。本実施形態では、音声通話時に、当該通話相手の音声特徴データについて学習が行われる。
【００１９】
ステップＳ１０１で入力部５よりのダイアル入力を受けると、ステップＳ１０２で、通信制御部２を用いて公衆回線に接続する。こうして送受話器部３により通話が可能となる。ステップＳ１０３では、電話番号検索部１１によりステップＳ１０１で入力されたダイアル番号で個人データ記憶部８を検索する。ステップＳ１０４では、この検索結果を受け、個人データ記憶部８に該当者がいなかった場合または該当者が学習の対象者でなかった場合にはステップＳ１０５へ進み、そのまま学習処理を終了する。すなわち、学習処理は行われず、通話のみが行われることになる。
【００２０】
一方、ステップＳ１０３の検索の結果、該当者があり、その該当者が学習の対象者あった場合にはステップＳ１０６へ処理を移す。ステップＳ１０６では公衆回線から送られてくる受話信号を音声特徴抽出部７に送り、音声特徴の抽出を行う。ステップＳ１０７では、音声特徴抽出部７により抽出された音声特徴に基づいて、抽出音声特徴データを作成する。ステップＳ１０８では、個人データ記憶部８の該当者の学習回数１８のチェックを行う。ここで学習回数が０回でなければ、ステップＳ１０９へと移る。ステップＳ１０９では、音声特徴データ比較部９により、個人データ記憶部８の該当者欄に登録されている音声特徴データ１７と、ステップＳ１０７で作成した抽出音声特徴データの比較を行う。尚、音声特徴データ比較部９では、２つのデータの差異を検出し、差異のある部分に関して、登録されている音声特徴データ１７に学習回数１８に応じた重み付けを行い、抽出音声特徴データとの補完及び平均化を行う。ステップＳ１１０では、音声特徴データ比較部９の比較結果に基づき新音声特徴データを作成し、ステップＳ１１１にて個人データ記憶部８に登録する。
【００２１】
尚、ステップＳ１０８で学習回数１８が０回の場合には、比較すべき音声特徴データ１７がない。従って、ステップＳ１０８からステップＳ１１２へ進み、ステップＳ１０７で作成された抽出音声特徴データを個人データ記憶部８の音声特徴データ１７として登録する。ステップＳ１１３では、学習回数１８を＋１してステップＳ１０５で終了する。
【００２２】
なお、上記処理は発信による通話時の学習を説明したが、着信による通話時においても、例えば番号通知により通話相手の電話番号を特定できれば、通話相手の音声特徴を学習できる。
【００２３】
本実施形態のメッセージ読み上げ装置は、以上のようにして生成、学習された音声特徴データを用いてメールのメッセージを読み上げる。図４は、実施形態１によるメール読み上げ時の主制御部１の処理を説明するフローチャートである。
【００２４】
ステップＳ２０１で入力部５より読み上げ指示を受けると、ステップＳ２０２で、電子メール受信部４より読み上げの対象とする電子メールを取り出す。ステップＳ２０３では、この電子メールに付加されている送信者メールアドレス情報を抽出する。次に、ステップＳ２０４では、メールアドレス検索部１０により、このメールアドレスを用いて個人データ記憶部８を検索する。
【００２５】
この検索結果を受け、該当者がいる場合はステップＳ２０５からステップＳ２０６へ処理を移す。ステップＳ２０６では、該当する個人データの学習回数１８をチェックし、０回でなければステップＳ２０７へ移る。ステップＳ２０７では、該当者の音声特徴データ１７を読み出し、ステップＳ２０８で、音声合成部１２に声質データとして設定する。尚、ステップＳ２０５で該当者がない場合、或いは、ステップＳ２０６で学習回数１８が０回の場合は、ステップＳ２０９で予め標準として設定している声質データを設定する。
【００２６】
ステップＳ２１０では、音声合成部１２において、ステップＳ２０８或いはステップＳ２０９で設定された声質データを用いて電子メールのメッセージを音声データへと変換、合成する。そして、ステップＳ２１１にて、ステップＳ２１０で変換された音声データを送受話器部３にて音声とする。
【００２７】
また、本実施形態のメッセージ読み上げ装置では、主制御部１の制御により、入力部５、表示部６を用いて個人データ記憶部８に格納されている音声特徴データの調整を行うことができる。以下、この調整処理について説明する。
【００２８】
図５は、登録されている音声特徴データ１７を調整する際の画面推移を示す図である。なお、本例では、表示部６の上に入力部５としてタッチパネルが重ねてある構成とするが、他の表示、入力形態であってもよい。
【００２９】
５０１は、調整モードの画面であり、個人データ記憶部８に記憶されている名前と学習回数が表示される。学習回数表示中の「○」は、学習回数０回を示し、「◎」は、学習対象外を示している。５０２は、テスト発声文章の選択操作を示しており、「音声［自己紹介］」をタッチすることによりプルダウンメニューが表示され、発声文章を選ぶことができる。ここでは「自己紹介」が選択されたものとする。５０３では、上下方向キーにより名前の選択を行い、そこで「発声」をタッチすると、登録されている音声特徴データ１７を用いて５０２で選択した定型文章の読み上げを行う。ここでは「自己紹介」が選択されており、「私の名前は＊＊です。どうぞよろしく。」と発声する。
【００３０】
音声の調整をする場合は、５０４に示すように、調整したい人にカーソルを合わせ「調整」をタッチすれば良い。５０５は調整画面である。調整画面においては、ピッチ、トーン、語尾の調整が可能となっている。それぞれの項目をタッチすることにより選択し、左右方向キーにて、ピッチの速さ、トーンの高低、語尾の上下を調整する画面である。ここで「継続」をタッチすると、調整したデータを音声特徴データ１７として設定するが今後も学習を継続することを意味している。「固定」の時は、調整したデータを音声特徴データ１７とし、５０６に示すように、学習回数欄の表示が◎となり、学習の対象外となり音声特徴データ１７は固定される。
【００３１】
以上のように、実施形態１によれば、以下のような効果がある。
・相手の声でメッセージを読み上げることが可能となり、送信者の特定やイメージが明確になる。
・メッセージの読み上げのための音声特徴データを生成するにあたって、操作者が意識及び作業することなく相手の声をサンプリングできる。
・通話毎の学習により、より送信者の声に近づけることができる。
・通話毎の学習により、送信者の声の変化に対応することが可能である。
・音声の調整ができることにより、合成音声をより自分のイメージする声に近づけることが可能である。
【００３２】
＜実施形態２＞
実施形態２では、感情要素を加味したメッセージの読み上げを可能とするメッセージ読み上げ装置を説明する。
【００３３】
図６は実施形態２によるにメッセージ読み上げ装置の構成を示すブロック図である。図６において、参照番号１〜１２で示される構成は、図１に示した実施形態１の各構成と同様の機能を有する。また、個人データ記憶部８の中に記憶されている個人データの内容も図２と概ね同様であるが、音声特徴データ１７が感情別に構成される点が異なる。
【００３４】
図７は実施形態２による音声特徴データの内容を説明する図である。実施形態２における音声特徴データ１７は、感情により大きな変化を生じない基本音声データ２４と、感情に影響される感情影響音声データ２５から構成されている。感情影響音声データ２５の構成を説明する。２６は感情分類であり、本実施形態では＜喜び：Ｈ＞＜期待：Ｅ＞＜通常：Ｕ＞＜悲しみ：Ｓ＞＜怒り：Ａ＞に大別している。２７は、感情別データ２８の「ピッチ」及び「トーン」が、音声から抽出したデータであるか、予測・調整されたデータであるかを、＜確定：Ｙ＞，＜暫定：Ｎ＞で示すサンプリング符号である。
【００３５】
再び図６において、１９は音声感情抽出部であり、音声通話時の受話側のデジタル信号を入力とし、入力された音声信号のピッチやアクセント等の「音声律情報」に注目し、通話相手の感情を感情分類２６の何れかに分類する。ここで分類された感情に、音声特徴抽出部７の抽出結果に基づいて生成される音声特徴データを、感情影響音声データ２５として記憶する。
【００３６】
２０は感情データ調整部であり、音声特徴データ２３の感情別データの内、検出できていない感情、つまり、サンプリング符号２７が、＜暫定：Ｎ＞となっている感情別データ２８を、他の＜確定：Ｙ＞となっている感情別データ、及び予め定めた基準に沿って予測・調整する。２１は文章感情抽出部であり、電子メール受信部４にて受信したメッセージを入力とし、単語単位で数量化した感情情報を予め辞書として持つことにより、単語列に含まれる感情情報を抽出し、単語または文節毎に感情を判断する。２２は感情音声生成部であり、音声合成部１２により作成された音声データに、文章感情抽出部２１の結果に応じて、感情影響音声データ２５の感情別データ２８を付加し、感情情報の加わった音声データを生成する。
【００３７】
図８は、及び図９は、実施形態２による音声特徴データの作成例を示す図である。８０１は、初期状態を示す。基本音声データ２４には標準音声データが設定されている。また、感情影響音声データ２５にも標準のデータが設定されている。すなわち、サンプリング符号２７は全て＜暫定：Ｎ＞であり、感情別データ２８もピッチ：a1〜a5，トーン：b1〜b5と標準の設定となっている。この時点における学習回数１８は０回である。
【００３８】
８０２は、音声の通話が行われた状態である。音声は、音声特徴抽出部７と音声感情抽出部１９に入力される。音声特徴抽出部７では、入力された音声から基本音声データ２４ａ（Ｚ）と、感情別データ２８（Ａ３，Ｂ３）を検出する。また、音声感情抽出部１９では、入力された音声から感情を判断し、感情分類２６（図ではＵに分類されている）として検出する。検出した基本音声データ２４（Ｚ）は、学習回数１８が０であることより、音声特徴データ２３にそのまま基本音声データ２４として記憶される。これにより学習回数１８は＋１され１回となる。
【００３９】
一方、検出した感情影響音声データ２５（Ａ３，Ｂ３）は、検出した感情分類２６ａがＵであり、感情影響音声データの＜通常：Ｕ＞のサンプリング符号２７が＜暫定：Ｎ＞となっていることより、＜通常：Ｕ＞の感情別データ２８として記憶する。また、サンプリング符号２７は、＜暫定：Ｎ＞から＜確定：Ｙ＞へと変更する。
【００４０】
次に８０３では、８０２で感情影響音声データ２５が更新されたことにより感情データ調整部２０が起動される。感情データ調整部２０は、サンプリング符号２７が＜確定：Ｙ＞となっている感情別データ２８の全てを用いて、サンプリング符号２７が＜暫定：Ｎ＞となっているすべての感情別データを調整する。例えば、８０３の場合、サンプリング符号２７が＜確定：Ｙ＞となっている＜通常：Ｕ＞の感情別データ２８を取り込み、他の感情分類２６、つまりサンプリング符号２７が＜暫定：Ｎ＞となっているすべて、Ｓ８０３では＜喜び：Ｈ＞＜期待：Ｅ＞＜悲しみ：Ｓ＞＜怒り：Ａ＞の感情別データ２８を、予め定めた感情分類２６の互いの相関関係、例えば、ピッチの関係はＥ：Ｕ＝１．２：１といった相関関係から決定し、それらを更新する。
【００４１】
８０４は、次の通話が行われた状態を示している。８０１と同じ流れにより入力された音声から基本音声データ２４ｂと、感情別データ２８（Ａ２，Ｂ２）、感情分類２６（Ｅ）が検出される。検出された感情別データ（Ａ２，Ｂ２）は、検出した感情分類２６ｂ（Ｅ）に対応する＜期待：Ｅ＞のサンプリング符号２７が＜暫定：Ｎ＞となっていることより、＜期待：Ｅ＞の感情別データ２８として記憶する。そして、＜期待：Ｅ＞のサンプリング符号２７を＜暫定：Ｎ＞から＜確定：Ｙ＞へと変える。
【００４２】
検出した基本音声データ２４（Ｙ）は、学習回数１８が１であることより、８０５へと移り、音声特徴データ比較部７に、検出した基本音声データ２４ｂ（Ｙ）と、記憶されていた基本音声データ２４（Ｚ）と、学習回数１８を入力する。音声特徴データ比較部７では、記憶されていた基本音声データ２４（Ｚ）に学習学習回数１８で重み付けを行い、検出した基本音声データ２４（Ｙ）との平均をとり、新たな音声基本データ２４ｃ（Ｘ）を作成し、音声特徴データ２３の基本音声データ２４として記憶する。続いて学習回数１８が＋１され、学習回数は２回となる。
【００４３】
８０６では、感情影響音声データ２５が更新されたことにより感情データ調整部２０が起動される。上述したように、感情データ調整部２０は、サンプリング符号２７が＜確定：Ｙ＞となっている感情別データ２８の全て、８０６では＜通常：Ｕ＞と＜期待：Ｅ＞の感情別データ２８を取り込み、サンプリング符号２７が＜暫定：Ｎ＞となっているすべて、８０６では＜喜び：Ｈ＞＜悲しみ：Ｓ＞＜怒り：Ａ＞の感情別データ２８を、予め定めた感情分類２６の互いの相関関係に従って更新する。なお、例えば、ピッチの関係の標準値をＨ：Ｅ：Ｕ＝１．５：１．２：１といった関係に対して、実際に取り込んだ感情別データ２８ではＥ：Ｕ＝１．０８：１となっており、サンプリング符号２７が＜確定：Ｙ＞となっている感情別データ２８を再度検出した場合には、実施形態１で示した学習方法と同様に感情データ２８を更新し、当更新時にも＜暫定：Ｎ＞のデータを更新しても良い。Ｅ：Ｕ＝１．０８：１となっていた場合は、感情がピッチに反映される比率が低い人という判断を行い、Ｈも標準より低くなると予想し、Ｈ＝１．５×１．０８／１．２＝１．３５というように補正を行ってデータを決定し、更新する。
【００４４】
図１０は、実施形態２によるメッセージ読み上げ時の音声データの作成経路を示す図である。
【００４５】
２９は読み上げ対象となる受信メッセージであり、受信メッセージ２９は、文章感情抽出部２１と音声合成部１２に渡される。また、基本音声データ２４も音声合成部１２に渡される。文章感情抽出部２１では、受信メッセージ２９から、単語または文節毎に感情を判断し、受信メッセージ２９に感情情報を付加した感情メッセージデータ３０を作成する。音声合成部１２では、声質データとして基本音声データ２４を用いて、受信メッセージ２９を音声合成し、音声データ３１を作成する。
【００４６】
感情メッセージデータ３０と音声データ３１及び感情影響音声データ２５は、感情音声生成部２２に渡される。感情音声生成部２２では、入力された音声データ３１に対して、感情メッセージデータ３０に含まれる感情情報に対応した感情影響音声データ２５の感情別データにて加工を行い、感情入り音声データ３２を作成する。感情入り音声データ３２は、送受話器部３によって音声となる。
【００４７】
以上説明したように、実施形態２によれば以下の効果が得られる。
・送信者の感情を読み上げ音声から知ることができ、より正確に相手の意図を掴むことが可能である。
・送信者の感情より緊急度を計ることが可能となる。
・送信者が感情を表面に現さない人でも、その文面から相手の感情を知ることにより相手への理解を深めることができる。
【００４８】
＜実施形態３＞
次に、入力された音声に対して音声認識を行い、電子メールを生成し、これを送信するメッセージ送信装置と、このような電子メールを受信して読み上げるメッセージ読み上げ装置について説明する。なお、送信対象は、電子メールに限らず、チャットのようなメッセージ送信であってもかまわない。
【００４９】
図１１は実施形態３によるメッセージ送信装置の構成を示すブロック図である。実施形態１の装置構成（図１）で説明した構成と同様の機能を有する構成には同一の参照番号を付してある。
【００５０】
３３は音声入力部であり、マイクにより操作者の音声を電気信号として入力する。３４は電子メール送信部である。３５は音声認識部であり、音声入力部３３より入力された音声を音声認識して、テキストデータに変換する。３６は音量・速度検出部であり、音声入力部３３より入力された音声の、音量と話す速度を検出し、他の音声部分と比べ、予め定めた一定率以上に異なる個所を検出する。３７は文章作成部であり、音声認識部３５にて認識したテキストデータに、音声特徴データの付加と音量・速度検出部３６の検出結果の情報を含んだ送信メールを作成するものである。なお、音声入力部３３は送受話器部３の通話部を用いてもよい。
【００５１】
図１２は、実施形態３による送信メールのデータの作成経路を示す図である。３８は操作者の音声であり、音声入力部３３から入力され、音声データ３１となる。音声データ３１は音声特徴抽出部７と、音声認識部３５と、音量・速度検出部３６に入力される。
【００５２】
音声認識部３５では、入力された音声信号３１を順次音声認識することによりテキストデータ３９とし、これを文章作成部３７へ入力する。音量・速度検出部３６では、音声データ３１において音量・速度に大きな変化を生じた時にその内容を文章作成部３７に通知する。文章作成部３７では、その通知を受け、テキストデータ３９のその音声部分に対応する箇所に、予め定めた情報付加方法に従って情報を付加する。情報付加方法としては、テキスト文字に対してアンダーラインや斜め文字等の装飾や、文字サイズ、フォント種別、また画面上に表示されないデータとして付加するなどの方法がある。
【００５３】
音声特徴抽出部７では入力された音声データ３１から、音声特徴データ１７を作成し文章作成部３７へと渡す。文章作成部３７では、情報を付加したテキスト文章に音声特徴データ１７及びその他の情報を付加して送信メール４０を作成する。
【００５４】
図１３に送信メール４０の構成例を示す。４１は送信先の宛先情報であり、４２は送信者、つまり自分の情報である。４３はメール種別情報であり、当メールにおいては音声情報入りメールであることを示す（音声情報入りメールとは、音声の音量や速度に応じてフォント、サイズ、文字修飾等が変更されているメールである）。次に入力時の音声特徴データ１７が格納されている。また、４４はSubjectであり、４５は音声情報を含んだ文章例である。
【００５５】
図１３の例において、文章４５では、「こんにちは＊＊です。」は通常音声であり、文字は標準として設定されている文字を使用している。「例の彼女の写真ですが」では、ここで「彼女の写真」の入力時に音量・速度検出部３６が、音量が小さいと検出したことよりフォントサイズは小さくなり、また同時に話す速度が速いと検出したことより、さらに斜め文字修飾が行われている。「今日中に必ずお願いします」では、音量・速度検出部３６が、音量が著しく大きいと検出したことより、フォントサイズを大きくし、また同時に話す速度が遅いと検出したことより、太字となっている。さらに、前記２つの検出の組み合わせ（音量が大きくゆっくりしていること）から重要部分として認識され、アンダーラインが付加されている。
【００５６】
尚、送信メール４０を表示部６に表示し、入力部５にて文字装飾の追加や修正を行った後に送信するようにすることも当然可能である。
【００５７】
次に、以上のような音声情報入りメールを読み上げる読み上げ装置について説明する。図１４は、実施形態３による音声情報入りメールに対応したメール読み上げ装置の構成を示すブロック図である。
【００５８】
送信メール４０は公衆回線を通して通信制御部２で受信され、主制御部１を介して電子メール受信部４へと送られる。入力部５からメールが指定されると、表示部６にそのメール内容が表示される。このときの表示状態は、図１３の４５に示したように、音声状態に応じて設定されたフォント、サイズ、文字修飾に従ったものとなる。
【００５９】
また入力部５から読み上げを指示された場合は、メールは電子メール受信部４から主制御部１へと移り、主制御部１でメール種別情報４３により音声情報入りメールであることを確認し、Subject４４と文章４５が音声合成部１２に渡される。このとき、声質データとして当該電子メールに含まれる音声特徴データ１７を設定する。また、文章４５は音量・速度調整部４７へも渡される。音声合成部１２では、声質データを元に音声データを作成する。また、音量・速度調整部４７では、入力された文章４５から文字の装飾や大きさに含まれている付加情報を取り出し、この付加情報に基づいて、音声合成部１２で作成された音声データに音量や速度等を設定する加工を行い、最終読み上げデータを作成する。最終読み上げデータは、主制御部１を介して音声出力部４７に送られ、音声へと変換される。
【００６０】
尚、本例においては、音声入力時の音声の特徴を検出する例を示したが、実施形態１で示したように、音声入力部３３を送受話器部３として、通話時に音声特徴データ１７を作成しておき、送信メールに音声特徴データ１７を付加するようにしても良い。この場合には当然、実施形態１に示した学習や音声の調整を行えるものである。
【００６１】
以上説明したように、実施形態３によれば、以下のような効果が得られる。
・声の大小や喋りかたにより意図的に文章に変化を入れることが可能である。
・受信者は送信者の声で聞くことができので、送信者を特定でき、意図を理解し易くなる。
・音声データとして送信するよりもデータ量が少なく、通信料金が少ない。
・受信者は視覚的にも送信者の意図を見ることが可能である。
【００６２】
＜実施形態４＞
次に、実施形態４として、実施形態３に感情要素を加味したメッセージ送信装置について説明する。
【００６３】
図１５は実施形態４によるメッセージ送信装置の構成を示すブロック図である。図１５に示される構成は、実施形態３で示した構成図（図１１）に、感情検出部４８が加わったものである。感情検出部４８は、音声入力部３３から入力された音声データ３１より、話者の感情を判断し、その結果が予め定めた「通常の感情」の範囲以上の変化を示した時に、その結果を感情分類２６として出力するものである。
【００６４】
図１６は、実施形態４によるメッセージ送信装置における、音声入力から感情情報を含むメッセージを作成するまでの作成経路を示す図である。
【００６５】
音声入力部３３に入力された話者の音声３８は、音声データ３１となり、音声認識部３５、感情検出部４８、音声特徴検出部７へと入力される。音声認識部３５では、音声データ３１を順次音声認識することによりテキストデータ３９を作成し文章作成部３７へ渡す。感情検出部４８では、音声データ３１から感情の判断を行い、感情が設定範囲を超えた場合にその感情分類２６を出力する。感情分類２６は、音声特徴抽出部７と文章作成部３７に入力される。音声特徴抽出部７では、音声データ３１より音声の特徴を抽出すると共に、入力された感情分類２６により感情別の特徴も抽出し、これに基づいて基本音声データ２４と、感情影響音声データ２５を作成する。
【００６６】
文章作成部３７では、感情分類２６の通知を受け、その時のテキストデータ３９に文節毎に、予め感情分類２６毎に定めた情報付加方法に従って情報を付加する。文章作成部３７では、情報を付加したテキスト文章に基本音声データ２４、感情影響音声データ２５及びその他の情報を付加して送信メール４９を作成する。
【００６７】
図１７は実施形態４における送信メールの構成例を示す図である。実施形態４の送信メール４９の主な構成は図１３と同様であり、図１７には送信メール４９の特徴的な部分が示されている。
【００６８】
音声特徴データ１７は、基本音声データ２４と感情影響音声データ２５とによって構成される。また、Subject４４に続き感情情報入り文章データ５０が格納される。図１７では、感情方法入り文章データの一例が示されている。この文章例の中で、「こんにちは，＊＊です。」は、感情検出部４８で感情分類２６で＜通常＞と判断し、標準として設定してある文字にて記述される。「楽しかった」は、感情分類２６で＜喜び＞と判断され、＜喜び＞に対応した文字として「（笑）」を加えることにより＜喜び＞の情報を付加している。「誰もいなかった」では、感情分類２６で＜悲しみ＞と判断し、文節後に0003hという画面上表記されないデータを加えることにより＜悲しみ＞の情報を付加している。「頭にきた」では、感情分類２６で＜怒り＞と判断し、「!!」を加えることにより＜怒り＞の情報を付加している。「返事待ってます」では、感情分類２６で＜期待＞と判断し、絵文字である「(^_^)」を加えることにより＜期待＞の情報を付加している。前記情報付加の方法は、方法として各種の方法を示したものであり、各感情分類２６に対して特定した方法ではない。
【００６９】
尚、作成した文章５０を編集する時には、＜悲しみ＞の情報である０００３ｈも情報として表示され、編集することが可能となっている。
【００７０】
このような送信メッセージを読み上げる装置には、実施形態２で説明したような装置を用いることができる。すなわち、送信メールに含まれる基本音声データ２４、感情影響音声データ２５を用いて、図１０に示すようにして音声合成を行う。ただし、実施形態４の場合、文章感情抽出部２１では、予め決められた記号や文字列（!!、(笑)等）、コード（0003h）等を読み上げ対象のメッセージから取り出し、これに基づいて感情メッセージデータ３０を生成することになる。
【００７１】
以上のように、実施形態４によれば、以下の効果がある。
・感情要素が加わることにより、より送信者の意図や気持ちを伝えることが可能となる。
・受信者は視覚的にも、送信者の感情を知ることが可能である。
【００７２】
【発明の効果】
以上説明したように、本発明によれば、複雑な登録操作をすることなく、受信されたメッセージを、発信者本人の声で読み上げることが可能となる。
また、本発明によれば、発信者本人の声で、かつ、感情のこもった読み上げを行うことが可能となり、発信者の意図をより正確に伝えることができる。
更に、本発明によれば、入力音声を音声認識処理して得られる文字列に、その入力音声の感情を示す表記を自動的に組み込んで送信メッセージを生成することが可能となり、感情表現豊かなメッセージの送信を実現できる。
【図面の簡単な説明】
【図１】実施形態１によるメッセージ読み上げ装置の構成を示すブロック図である。
【図２】個人データ記憶部８に保管されているデータの構成例を示す図である。
【図３】主制御部１による、音声特徴データに対する学習処理を示すフローチャートである。
【図４】実施形態１によるメール読み上げ時の主制御部１の処理を説明するフローチャートである。
【図５】実施形態１において、登録されている音声特徴データ１７を調整する際の画面推移を示す図である。
【図６】実施形態２によるにメッセージ読み上げ装置の構成を示すブロック図である。
【図７】実施形態２による音声特徴データの内容を説明する図である。
【図８】実施形態２による音声特徴データの作成例を示す図である。
【図９】実施形態２による音声特徴データの作成例を示す図である。
【図１０】実施形態２によるメッセージ読み上げ時の音声データの作成経路を示すする図である。
【図１１】実施形態３によるメッセージ送信装置の構成を示すブロック図である。
【図１２】実施形態３による送信メールのデータの作成経路を示す図である。
【図１３】実施形態３による送信メール４０の構成例を示す図である。
【図１４】実施形態３による音声情報入りメールに対応したメール読み上げ装置の構成を示すブロック図である。
【図１５】実施形態４によるメッセージ送信装置の構成を示すブロック図である。
【図１６】実施形態４によるメッセージ送信装置における、音声入力から感情情報を含むメッセージを作成するまでの作成経路を示す図である。
【図１７】実施形態４における送信メールの構成例を示す図である。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an information processing apparatus and method capable of processing a message using voice. More particularly, the present invention relates to an apparatus capable of reading a received message by voice or inputting a message to be transmitted using voice recognition.
[0002]
[Prior art]
Along with the spread of e-mails, an increasing number of devices have a function for reading out messages contained in e-mails by voice. As a method for making it easier for the listener to understand the contents of such reading, Japanese Laid-Open Patent Publication No. 2000-148175 inserts facial expression symbols that express the nuances and expressions that the caller wants to convey in the text. Sometimes, a method of reading out according to the facial expression symbol is shown. Many methods for reading human emotions with a computer have been proposed for the purpose of improving human I / F. For example, the Japanese Patent Publication No. 6-82376 shows a method for extracting the author's emotions from Japanese sentences. Japanese Patent Application Laid-Open No. 5-12023 discloses a method for extracting emotion from speech using speech recognition.
[0003]
[Problems to be solved by the invention]
However, the above Japanese Patent Laid-Open No. 2000-148175 has a drawback in that an operation of inserting a facial expression symbol must be performed by the caller side. In Japanese Patent Publication No. 6-82376 and Japanese Patent Laid-Open No. 5-12023, only the means for extracting emotions is shown, and there is no disclosure covering how to use them.
[0004]
In addition, when a message is read out by voice, it is generally performed using a specific voice quality regardless of the sender of the message.
[0005]
  An object of the present invention is to enable a received message to be read out in the voice of the caller.
  Another object of the present invention is to make it possible to more accurately convey the intention of the caller by reading out the voice of the caller and emotionally.is there.
[0006]
[Means for Solving the Problems]
  In order to achieve the above object, an information processing apparatus according to the present invention comprises the following arrangement. That is,
  Storage means for associating and storing personal telephone numbers, email addresses, and voice feature data;
  Start calling by designating the phone number,Call means for making voice calls,
  Generating means for generating voice feature data from the voice of the other party obtained during the call;
  Using the voice feature data generated by the generation means, stored in the storage means,Specified by the call meansUpdating means for updating voice feature data corresponding to the telephone number;
  A receiving means for receiving an email including text data;
  Obtaining means for obtaining voice feature data corresponding to the sender mail address of the mail received by the receiving means from the storage means;
  Synthesizing means for generating synthesized voice data for text data included in the mail using the voice feature data acquired by the acquiring means.
  An information processing apparatus according to another aspect of the present invention for achieving the above object has the following configuration. That is,
  Storage means for associating and storing personal telephone numbers, email addresses, and voice feature data;
  A call means for initiating a call by receiving an outgoing call from the other party and making a voice call;
  Generating means for generating voice feature data from the voice of the other party obtained during the call;
  Update means for updating the voice feature data corresponding to the telephone number specified by the telephone number notification of the calling party, stored in the storage means, using the voice feature data generated by the generating means;
  A receiving means for receiving an email including text data;
  Obtaining means for obtaining voice feature data corresponding to the sender mail address of the mail received by the receiving means from the storage means;
  Synthesizing means for generating synthesized voice data for the text data included in the mail using the voice feature data acquired by the acquiring means.
[0011]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, some preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.
[0012]
<Embodiment 1>
FIG. 1 is a block diagram illustrating a configuration of a message reading apparatus according to the first embodiment.
[0013]
Reference numeral 1 denotes a main control unit, and reference numeral 2 denotes a communication control unit connected to a public line. Reference numeral 3 denotes a handset unit that performs a voice call. With the above configuration, a voice call is performed. That is, in transmission, the input / output signal is converted into a digital signal by the transmitter / receiver unit 3 and transmitted from the communication control unit 2 to the public line via the main control unit 1. In the reception, a signal from the public line is received by the communication control unit 2, converted into a digital signal, and converted into a voice signal by the handset unit 3 through the main control unit 1. As a result, a voice call is established.
[0014]
4 is an e-mail receiving unit for storing the e-mail received from the communication control unit 2. Here, the main control unit 1 determines whether the signal received from the communication control unit 2 is a message or a voice call. If the signal is a message, the main control unit 1 sends the message to the e-mail receiving unit 4 to confirm the voice call. If it is received, it is sent to the handset unit 3.
[0015]
Reference numeral 5 denotes an input unit which is used when an operator dials a telephone number or gives a reading instruction. A display unit 6 displays a received message. Reference numeral 7 denotes an audio feature extraction unit, which extracts features of the input audio. In the present embodiment, a digital signal on the receiver side during voice call (a signal obtained by digitizing a voice signal received by the communication control unit 2 from the public line) is input. Reference numeral 8 denotes a personal data storage unit that stores telephone numbers, mail addresses, and voice characteristics of each individual. Reference numeral 9 denotes an audio feature data comparison unit, which compares the audio features extracted by the audio feature extraction unit 7 with the audio features stored in the personal data storage unit 8 and, if necessary, the contents of the personal data storage unit 8 Update.
[0016]
Reference numeral 10 denotes a mail address search unit, which searches the personal data storage unit 8 to find out who the mail address is input from the main control unit 1. A telephone number search unit 11 searches the personal data storage unit 8 to find out who the telephone number input from the main control unit 1 is. A speech synthesis unit 12 sets voice quality data, performs language analysis on the input text sentence, and creates voice data using the set voice quality data. The voice data synthesized by the voice synthesis unit 12 is sent to the handset unit 3 via the main control unit 1 and becomes a voice signal. As will be described later, in this embodiment, the target of speech synthesis is text sentence data of received mail, and the voice synthesis unit 12 sets voice quality data corresponding to the voice feature data of the speaker who is the sender of the mail. The
[0017]
FIG. 2 is a diagram illustrating a configuration example of data stored in the personal data storage unit 8. In FIG. 2, 13 is a management number assigned to each individual. 14 is a name, 15 is a telephone number, and 16 is an e-mail address, which are entered by the operator. Reference numeral 17 denotes voice feature data, which is voice feature data extracted by the voice feature extraction unit 7 or data updated by the voice feature data comparison unit 9. Reference numeral 18 denotes the number of learning times, which is the number of times of extraction by the speech feature extraction unit 7 (the number of times of learning related to speech feature data stored in the personal data storage unit 8). Note that the number of learning times 18 includes information on whether or not the person is an object of learning. For example, when 16 bits are used as the learning frequency data, the most significant bit is used as a flag indicating whether or not the person to be learned.
[0018]
FIG. 3 is a flowchart showing a learning process for the voice feature data by the main control unit 1. In this embodiment, at the time of a voice call, learning is performed on the voice feature data of the other party.
[0019]
When a dial input from the input unit 5 is received in step S101, the communication control unit 2 is used to connect to the public line in step S102. In this way, the handset 3 can make a call. In step S103, the telephone number search unit 11 searches the personal data storage unit 8 with the dial number input in step S101. In step S104, in response to the search result, if there is no corresponding person in the personal data storage unit 8 or if the corresponding person is not a learning target person, the process proceeds to step S105, and the learning process is ended as it is. That is, the learning process is not performed, and only a call is performed.
[0020]
On the other hand, as a result of the search in step S103, if there is a corresponding person and the corresponding person is a learning target person, the process proceeds to step S106. In step S106, the reception signal sent from the public line is sent to the voice feature extraction unit 7 to extract the voice feature. In step S107, extracted voice feature data is created based on the voice features extracted by the voice feature extraction unit 7. In step S108, the number of learnings 18 of the corresponding person in the personal data storage unit 8 is checked. If the learning number is not 0, the process proceeds to step S109. In step S109, the voice feature data comparison unit 9 compares the voice feature data 17 registered in the corresponding person column of the personal data storage unit 8 with the extracted voice feature data created in step S107. Note that the voice feature data comparison unit 9 detects a difference between the two data, weights the registered voice feature data 17 according to the number of learnings 18 with respect to the difference portion, Complement and average. In step S110, new voice feature data is created based on the comparison result of the voice feature data comparison unit 9, and is registered in the personal data storage unit 8 in step S111.
[0021]
If the learning count 18 is 0 in step S108, there is no voice feature data 17 to be compared. Accordingly, the process proceeds from step S108 to step S112, and the extracted voice feature data created in step S107 is registered as the voice feature data 17 of the personal data storage unit 8. In step S113, the learning count 18 is incremented by 1, and the process ends in step S105.
[0022]
In the above processing, the learning at the time of a call by outgoing call has been described. However, even at the time of a call by incoming call, for example, if the telephone number of the other party can be identified by number notification, the voice characteristics of the other party can be learned.
[0023]
The message reading apparatus according to the present embodiment reads a mail message using the voice feature data generated and learned as described above. FIG. 4 is a flowchart for explaining processing of the main control unit 1 when reading a mail according to the first embodiment.
[0024]
When a reading instruction is received from the input unit 5 in step S201, an e-mail to be read out is extracted from the e-mail receiving unit 4 in step S202. In step S203, sender mail address information added to the e-mail is extracted. Next, in step S204, the e-mail address search unit 10 searches the personal data storage unit 8 using this e-mail address.
[0025]
In response to the search result, if there is a corresponding person, the process proceeds from step S205 to step S206. In step S206, the number of learning times 18 of the corresponding personal data is checked. If it is not 0, the process proceeds to step S207. In step S207, the voice feature data 17 of the corresponding person is read, and in step S208, it is set as voice quality data in the voice synthesizer 12. If there is no corresponding person in step S205, or if the number of learnings 18 is 0 in step S206, voice quality data set as a standard in advance in step S209 is set.
[0026]
In step S210, the voice synthesizer 12 converts and synthesizes an e-mail message into voice data using the voice quality data set in step S208 or step S209. In step S211, the voice data converted in step S210 is converted into voice in the handset unit 3.
[0027]
In the message reading apparatus according to the present embodiment, the voice feature data stored in the personal data storage unit 8 can be adjusted using the input unit 5 and the display unit 6 under the control of the main control unit 1. Hereinafter, this adjustment process will be described.
[0028]
FIG. 5 is a diagram showing screen transition when adjusting the registered voice feature data 17. In addition, in this example, it is set as the structure by which the touch panel is piled up as the input part 5 on the display part 6, but another display and input form may be sufficient.
[0029]
Reference numeral 501 denotes an adjustment mode screen on which the name and the number of learnings stored in the personal data storage unit 8 are displayed. “◯” in the learning number display indicates zero learning number, and “◎” indicates that learning is not performed. Reference numeral 502 denotes a test utterance text selection operation. When “voice [self-introduction]” is touched, a pull-down menu is displayed, and the utterance text can be selected. Here, it is assumed that “self-introduction” is selected. In 503, a name is selected by using the up and down direction keys, and when “speech” is touched there, the fixed sentence selected in 502 is read out using the registered voice feature data 17. Here, “self-introduction” is selected, and “My name is **. Nice to meet you.”
[0030]
When adjusting the sound, as indicated by reference numeral 504, the cursor is placed on the person to be adjusted and "Adjust" is touched. Reference numeral 505 denotes an adjustment screen. On the adjustment screen, the pitch, tone and ending can be adjusted. This screen is selected by touching each item, and the pitch speed, tone level, and top and bottom of the ending are adjusted with the left and right direction keys. Here, when “continuation” is touched, the adjusted data is set as the voice feature data 17, which means that learning is continued in the future. When “fixed”, the adjusted data is set as the voice feature data 17, and as indicated by 506, the display in the learning number column becomes ◎, and the voice feature data 17 is fixed because it is not subject to learning.
[0031]
As described above, the first embodiment has the following effects.
・ The message can be read out in the voice of the other party, and the identification and image of the sender can be clarified.
-When generating voice feature data for reading a message, the voice of the other party can be sampled without the operator being aware and working.
・ By learning for each call, it is possible to get closer to the voice of the sender.
-It is possible to respond to changes in the voice of the sender by learning for each call.
・ Since the voice can be adjusted, the synthesized voice can be brought closer to the voice that I imagine.
[0032]
<Embodiment 2>
In the second embodiment, a message reading device that can read a message that takes emotion elements into account will be described.
[0033]
FIG. 6 is a block diagram showing the configuration of the message reading apparatus according to the second embodiment. In FIG. 6, the configurations indicated by reference numerals 1 to 12 have the same functions as the configurations of the first embodiment shown in FIG. Further, the content of the personal data stored in the personal data storage unit 8 is substantially the same as that shown in FIG. 2, except that the voice feature data 17 is configured according to emotions.
[0034]
FIG. 7 is a diagram for explaining the contents of the audio feature data according to the second embodiment. The voice feature data 17 according to the second embodiment is composed of basic voice data 24 that does not change greatly depending on emotions, and emotion-affected voice data 25 that is influenced by emotions. The configuration of the emotion influence voice data 25 will be described. 26 is an emotion classification, and in this embodiment, it is roughly divided into <joy: H> <expectation: E> <normal: U> <sadness: S> <anger: A>. 27 indicates whether the “pitch” and “tone” of the emotion-specific data 28 are data extracted from speech or predicted / adjusted data by <determined: Y> and <provisional: N>. Sampling code.
[0035]
In FIG. 6 again, reference numeral 19 denotes a voice emotion extraction unit, which receives a digital signal on the receiver side during a voice call and pays attention to “voice rhythm information” such as pitch and accent of the input voice signal, and The emotion is classified into any of emotion categories 26. The voice feature data generated based on the extraction result of the voice feature extraction unit 7 is stored as the emotion influence voice data 25 in the emotions classified here.
[0036]
Reference numeral 20 denotes an emotion data adjustment unit. Among emotion-specific data of the audio feature data 23, emotion that has not been detected, that is, emotion-specific data 28 whose sampling code 27 is <provisional: N> Prediction / adjustment is performed according to emotion-specific data that is <determined: Y> and a predetermined criterion. 21 is a sentence emotion extracting unit, which receives the message received by the e-mail receiving unit 4 and extracts emotion information included in the word string by having emotion information quantified in units of words as a dictionary in advance. Judge emotions for each word or phrase. Reference numeral 22 denotes an emotion voice generation unit, which adds emotion-specific data 28 of the emotion influence voice data 25 to the voice data created by the voice synthesis unit 12 according to the result of the sentence emotion extraction unit 21 and adds emotion information. Audio data is generated.
[0037]
FIG. 8 and FIG. 9 are diagrams showing an example of creating voice feature data according to the second embodiment. Reference numeral 801 denotes an initial state. Standard audio data is set in the basic audio data 24. In addition, standard data is also set in the emotion influence sound data 25. That is, all the sampling codes 27 are <provisional: N>, and the emotion-specific data 28 is also set as standard with pitches: a1 to a5 and tones: b1 to b5. The number of learnings 18 at this time is 0.
[0038]
Reference numeral 802 denotes a state in which a voice call is performed. The voice is input to the voice feature extraction unit 7 and the voice emotion extraction unit 19. The voice feature extraction unit 7 detects basic voice data 24a (Z) and emotion-specific data 28 (A3, B3) from the input voice. The voice emotion extraction unit 19 determines an emotion from the input voice and detects it as an emotion classification 26 (classified as U in the figure). The detected basic voice data 24 (Z) is stored as the basic voice data 24 in the voice feature data 23 as it is because the learning count 18 is zero. As a result, the learning count 18 is incremented by 1 and becomes 1.
[0039]
On the other hand, in the detected emotion-affected speech data 25 (A3, B3), the detected emotion classification 26a is U, and the <normal: U> sampling code 27 of the emotion-affected speech data is <provisional: N>. As a result, it is stored as emotion-specific data 28 of <normal: U>. The sampling code 27 is changed from <provisional: N> to <determined: Y>.
[0040]
Next, in 803, the emotion data adjustment unit 20 is activated by updating the emotion influence voice data 25 in 802. The emotion data adjustment unit 20 uses all the emotion-specific data 28 whose sampling code 27 is <determined: Y> and adjusts all emotion-specific data whose sampling code 27 is <provisional: N>. To do. For example, in the case of 803, the emotion-specific data 28 of <normal: U> in which the sampling code 27 is <determined: Y> is fetched, and the other emotion classification 26, that is, the sampling code 27 becomes <provisional: N>. In S803, the emotion-specific data 28 of <joy: H> <expectation: E> <sadness: S> <anger: A> are correlated with each other in a predetermined emotion classification 26, for example, pitch relationship. Is determined from the correlation such as E: U = 1.2: 1 and updates them.
[0041]
Reference numeral 804 denotes a state in which the next call is performed. Basic voice data 24b, emotion-specific data 28 (A2, B2), and emotion classification 26 (E) are detected from the voice input in the same flow as in 801. The detected emotion-specific data (A2, B2) indicates that the <expectation: E> sampling code 27 corresponding to the detected emotion classification 26b (E) is <provisional: N>. > Is stored as emotion-specific data 28. Then, the sampling code 27 of <expectation: E> is changed from <provisional: N> to <determined: Y>.
[0042]
The detected basic voice data 24 (Y) moves to 805 because the learning count 18 is 1, and the detected basic voice data 24 b (Y) is stored in the voice feature data comparison unit 7. The voice data 24 (Z) and the learning count 18 are input. In the voice feature data comparison unit 7, the stored basic voice data 24 (Z) is weighted by the learning learning count 18, and averaged with the detected basic voice data 24 (Y) to obtain new voice basic data 24c. (X) is created and stored as basic audio data 24 of the audio feature data 23. Subsequently, the learning count 18 is incremented by 1, and the learning count becomes 2.
[0043]
In 806, the emotion data adjustment unit 20 is activated when the emotion influence sound data 25 is updated. As described above, the emotion data adjustment unit 20 uses the emotion-specific data 28 of the <normal: U> and <expectation: E> for all the emotion-specific data 28 for which the sampling code 27 is <determined: Y>. All the sampling codes 27 are <provisional: N>, and in 806, the emotion-specific data 28 of <joy: H> <sadness: S> <anger: A> are exchanged with each other in a predetermined emotion classification 26. Update according to the correlation. For example, in contrast to the relationship where the standard value of the pitch relationship is H: E: U = 1.5: 1.2: 1, in the emotion-specific data 28 actually captured, E: U = 1.08: 1. When the emotion-specific data 28 whose sampling code 27 is <determined: Y> is detected again, the emotion data 28 is updated in the same manner as the learning method shown in the first embodiment, and this update is performed. Sometimes <temporary: N> data may be updated. If E: U = 1.08: 1, it is determined that the ratio of the emotion reflected in the pitch is low, and H is expected to be lower than the standard, and H = 1.5 × 1.08 /1.2=1.35 is corrected and data is determined and updated.
[0044]
FIG. 10 is a diagram illustrating a voice data creation path when reading a message according to the second embodiment.
[0045]
Reference numeral 29 is a received message to be read out, and the received message 29 is passed to the sentence emotion extracting unit 21 and the speech synthesizing unit 12. The basic voice data 24 is also passed to the voice synthesizer 12. The sentence emotion extraction unit 21 determines an emotion for each word or phrase from the received message 29 and creates emotion message data 30 in which emotion information is added to the received message 29. The voice synthesizer 12 synthesizes the received message 29 with voice using the basic voice data 24 as voice quality data, and creates voice data 31.
[0046]
The emotion message data 30, the voice data 31, and the emotion influence voice data 25 are passed to the emotion voice generation unit 22. The emotion voice generation unit 22 processes the input voice data 31 with the emotion-specific data of the emotion influence voice data 25 corresponding to the emotion information included in the emotion message data 30 to obtain the voice data 32 with emotion. create. The voice data 32 with emotion is converted into voice by the handset unit 3.
[0047]
As described above, according to the second embodiment, the following effects can be obtained.
・ Sender's emotions can be known from the reading voice, and the intention of the other party can be grasped more accurately.
・ It is possible to measure the degree of urgency from the emotion of the sender.
-Even if the sender does not show emotions on the surface, it is possible to deepen the understanding of the other party by knowing the other party's emotions from the text.
[0048]
<Embodiment 3>
Next, a description will be given of a message transmission device that performs speech recognition on input speech, generates an email, and transmits the email, and a message reading device that receives and reads such email. Note that the transmission target is not limited to electronic mail, but may be message transmission such as chat.
[0049]
FIG. 11 is a block diagram illustrating a configuration of a message transmission device according to the third embodiment. The components having the same functions as those described in the apparatus configuration of the first embodiment (FIG. 1) are denoted by the same reference numerals.
[0050]
An audio input unit 33 inputs an operator's voice as an electric signal through a microphone. Reference numeral 34 denotes an e-mail transmission unit. A voice recognition unit 35 recognizes the voice input from the voice input unit 33 and converts it into text data. Reference numeral 36 denotes a volume / speed detection unit that detects the volume and speaking speed of the voice input from the voice input unit 33 and detects portions that differ by a predetermined rate or more compared to other voice portions. Reference numeral 37 denotes a sentence creation unit which creates a transmission mail including the addition of voice feature data to the text data recognized by the voice recognition unit 35 and information on the detection result of the volume / speed detection unit 36. Note that the voice input unit 33 may use the call unit of the handset unit 3.
[0051]
FIG. 12 is a diagram illustrating a creation route of outgoing mail data according to the third embodiment. The operator's voice 38 is input from the voice input unit 33 and becomes voice data 31. The voice data 31 is input to the voice feature extraction unit 7, the voice recognition unit 35, and the volume / speed detection unit 36.
[0052]
The speech recognition unit 35 sequentially recognizes the input speech signal 31 as text data 39 and inputs it to the text creation unit 37. The volume / speed detection unit 36 notifies the text creation unit 37 of the contents when a large change in the volume / speed occurs in the audio data 31. In response to the notification, the sentence creation unit 37 adds information to a portion corresponding to the voice portion of the text data 39 according to a predetermined information addition method. As information addition methods, there are methods such as decoration of text characters such as underline and diagonal characters, character size, font type, and addition as data not displayed on the screen.
[0053]
The voice feature extraction unit 7 creates the voice feature data 17 from the input voice data 31 and passes it to the sentence creation unit 37. The sentence creation unit 37 creates the outgoing mail 40 by adding the voice feature data 17 and other information to the text sentence to which the information is added.
[0054]
FIG. 13 shows a configuration example of the outgoing mail 40. 41 is destination information of a transmission destination, and 42 is information of a sender, that is, own information. 43 is mail type information, which indicates that this mail is mail containing voice information (mail with voice information is a mail whose font, size, character modification, etc. have been changed according to the volume and speed of the voice. Is). Next, voice feature data 17 at the time of input is stored. Reference numeral 44 denotes a subject, and reference numeral 45 denotes a sentence example including voice information.
[0055]
In the example of FIG. 13, in the sentence 45, "it is Hello **." Is a normal voice, the characters are using a character that has been set as the standard. In "Example of her photo", the font size is smaller than the volume / speed detection unit 36 detects that the volume is low when inputting "her photo", and at the same time the speaking speed is fast. From the detection, oblique character modification is further performed. In “Please be sure today”, the volume / speed detection unit 36 becomes bolder because it detects that the volume is extremely high, increases the font size, and at the same time detects that the speaking speed is slow. ing. Furthermore, it is recognized as an important part from the combination of the two detections (the volume is large and slow), and an underline is added.
[0056]
Of course, it is also possible to display the transmission mail 40 on the display unit 6 and transmit it after adding or modifying the character decorations at the input unit 5.
[0057]
  Next, read aloud the above email containing voice informationAbout the equipmentexplain. FIG. 14 is a block diagram showing a configuration of a mail reading device corresponding to a voice information-containing mail according to the third embodiment.
[0058]
The outgoing mail 40 is received by the communication control unit 2 through the public line and sent to the electronic mail receiving unit 4 via the main control unit 1. When a mail is designated from the input unit 5, the mail content is displayed on the display unit 6. The display state at this time is in accordance with the font, size, and character modification set according to the voice state, as indicated by 45 in FIG.
[0059]
When the input unit 5 is instructed to read out, the e-mail is transferred from the e-mail receiving unit 4 to the main control unit 1, and the main control unit 1 confirms that the e-mail contains voice information by the mail type information 43. Subject 44 and sentence 45 are passed to speech synthesizer 12. At this time, voice feature data 17 included in the e-mail is set as voice quality data. The sentence 45 is also passed to the volume / speed adjustment unit 47. The voice synthesizer 12 creates voice data based on the voice quality data. Further, the volume / speed adjustment unit 47 extracts additional information included in the decoration and size of the character from the input sentence 45, and converts the additional information included in the voice synthesis unit 12 based on this additional information. Processing to set the volume, speed, etc., and create final reading data. The final read-out data is sent to the voice output unit 47 via the main control unit 1 and converted into voice.
[0060]
In this example, the example of detecting the voice feature at the time of voice input is shown. However, as shown in the first embodiment, the voice input unit 33 is used as the transmitter / receiver unit 3 and the voice feature data 17 is used during the call. The voice feature data 17 may be added to the outgoing mail. In this case, naturally, the learning and voice adjustment shown in the first embodiment can be performed.
[0061]
As described above, according to the third embodiment, the following effects can be obtained.
・ Sentences can be intentionally changed according to the size of the voice and how to speak.
-Since the receiver can listen to the voice of the sender, the sender can be identified and the intention can be easily understood.
-The amount of data is smaller than the transmission as voice data, and the communication fee is lower.
-The receiver can see the intention of the sender visually.
[0062]
<Embodiment 4>
Next, as a fourth embodiment, a message transmission device in which emotion elements are added to the third embodiment will be described.
[0063]
FIG. 15 is a block diagram illustrating a configuration of a message transmission apparatus according to the fourth embodiment. The configuration shown in FIG. 15 is obtained by adding an emotion detection unit 48 to the configuration diagram (FIG. 11) shown in the third embodiment. The emotion detection unit 48 determines the speaker's emotion from the voice data 31 input from the voice input unit 33, and when the result indicates a change beyond the predetermined range of “normal emotion”, the result Is output as the emotion classification 26.
[0064]
FIG. 16 is a diagram illustrating a creation route from voice input to creation of a message including emotion information in the message transmission device according to the fourth embodiment.
[0065]
The speaker's voice 38 input to the voice input unit 33 becomes the voice data 31 and is input to the voice recognition unit 35, the emotion detection unit 48, and the voice feature detection unit 7. In the speech recognition unit 35, text data 39 is created by sequentially recognizing the speech data 31 and passed to the sentence creation unit 37. The emotion detection unit 48 determines the emotion from the voice data 31 and outputs the emotion classification 26 when the emotion exceeds the set range. The emotion classification 26 is input to the voice feature extraction unit 7 and the sentence creation unit 37. The voice feature extraction unit 7 extracts voice characteristics from the voice data 31 and also extracts emotion-specific characteristics based on the input emotion classification 26. Based on this, the basic voice data 24 and the emotion-affected voice data 25 are extracted. create.
[0066]
The sentence creation unit 37 receives the notification of the emotion classification 26, and adds information to the text data 39 at that time according to the information addition method predetermined for each emotion classification 26 for each phrase. The sentence creation unit 37 creates the outgoing mail 49 by adding the basic voice data 24, the emotion influence voice data 25, and other information to the text sentence to which the information is added.
[0067]
FIG. 17 is a diagram illustrating a configuration example of a transmission mail according to the fourth embodiment. The main configuration of the transmission mail 49 according to the fourth embodiment is the same as that shown in FIG. 13, and FIG.
[0068]
The voice feature data 17 is composed of basic voice data 24 and emotion influence voice data 25. Further, text data 50 including emotion information is stored following Subject 44. FIG. 17 shows an example of sentence data with emotion method. In this sentence example, "This is Hello, **.", It is determined that the <Normal> in the emotion classification 26 in emotion detection unit 48, is described by a character that is set as a standard. “Happy” is judged as <joy> in the emotion classification 26, and information of <joy> is added by adding "(laugh)" as a character corresponding to <joy>. In the case of “nobody”, the emotion classification 26 determines that it is <sadness>, and the information of <sadness> is added by adding data 0003h that is not displayed on the screen after the phrase. In “I came to my head”, the emotion classification 26 determines that it is <anger>, and by adding “!!”, information about <anger> is added. In “Waiting for reply”, it is judged as <expectation> in the emotion classification 26, and information of <expectation> is added by adding “(^ _ ^)” as an emoji. The information adding method shows various methods as methods, and is not a method specified for each emotion classification 26.
[0069]
When editing the created sentence 50, 0003h, which is information of <sadness>, is also displayed as information and can be edited.
[0070]
As an apparatus for reading out such a transmission message, the apparatus described in the second embodiment can be used. That is, voice synthesis is performed as shown in FIG. 10 using the basic voice data 24 and the emotion influence voice data 25 included in the outgoing mail. However, in the case of the fourth embodiment, the sentence emotion extraction unit 21 extracts a predetermined symbol or character string (!!, (laugh), etc.), a code (0003h), etc. from the message to be read out, and based on this Emotion message data 30 is generated.
[0071]
As described above, the fourth embodiment has the following effects.
・ By adding emotional elements, it becomes possible to convey the intentions and feelings of the sender.
・ The receiver can know the emotion of the sender visually.
[0072]
【The invention's effect】
As described above, according to the present invention, the received message can be read out by the voice of the caller without performing a complicated registration operation.
Further, according to the present invention, it is possible to read out with a voice of the caller himself and with emotion, and the intention of the caller can be conveyed more accurately.
Furthermore, according to the present invention, it is possible to automatically generate a transmission message by automatically incorporating a notation indicating the emotion of the input voice into a character string obtained by performing voice recognition processing on the input voice, and the emotional expression is rich. Message transmission can be realized.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a configuration of a message reading apparatus according to a first embodiment.
FIG. 2 is a diagram illustrating a configuration example of data stored in a personal data storage unit 8;
FIG. 3 is a flowchart showing a learning process for voice feature data by the main control unit 1;
FIG. 4 is a flowchart illustrating processing of the main control unit 1 when reading a mail according to the first embodiment.
FIG. 5 is a diagram illustrating screen transition when adjusting registered voice feature data 17 in the first embodiment.
FIG. 6 is a block diagram showing a configuration of a message reading apparatus according to the second embodiment.
FIG. 7 is a diagram illustrating the contents of audio feature data according to the second embodiment.
FIG. 8 is a diagram illustrating an example of creating voice feature data according to the second embodiment.
FIG. 9 is a diagram illustrating an example of creating voice feature data according to the second embodiment.
FIG. 10 is a diagram illustrating a voice data creation path when reading a message according to the second embodiment;
FIG. 11 is a block diagram illustrating a configuration of a message transmission device according to a third embodiment.
FIG. 12 is a diagram showing a creation route of outgoing mail data according to the third embodiment.
FIG. 13 is a diagram showing a configuration example of an outgoing mail 40 according to the third embodiment.
FIG. 14 is a block diagram showing a configuration of a mail reading device corresponding to a voice information-containing mail according to the third embodiment.
FIG. 15 is a block diagram illustrating a configuration of a message transmission device according to a fourth embodiment.
FIG. 16 is a diagram illustrating a creation route from voice input to creation of a message including emotion information in the message transmission device according to the fourth embodiment.
FIG. 17 is a diagram illustrating a configuration example of a transmission mail according to the fourth embodiment.

Claims

Storage means for associating and storing personal telephone numbers, email addresses, and voice feature data;
Call means for starting a call by designating the telephone number and making a voice call;
Generating means for generating voice feature data from the voice of the other party obtained during the call;
Updating means for updating the voice feature data corresponding to the telephone number designated by the calling means , stored in the storage means, using the voice feature data generated by the generating means;
A receiving means for receiving an email including text data;
Obtaining means for obtaining voice feature data corresponding to the sender mail address of the mail received by the receiving means from the storage means;
An information processing apparatus comprising: synthesizing means for generating synthesized voice data for text data included in the mail using the voice feature data acquired by the acquiring means.

Storage means for associating and storing personal telephone numbers, email addresses, and voice feature data;
A call means for initiating a call by receiving an outgoing call from the other party and making a voice call;
Generating means for generating voice feature data from the voice of the other party obtained during the call;
Update means for updating the voice feature data corresponding to the telephone number specified by the telephone number notification of the calling party , stored in the storage means, using the voice feature data generated by the generating means;
A receiving means for receiving an email including text data;
Obtaining means for obtaining voice feature data corresponding to the sender mail address of the mail received by the receiving means from the storage means;
An information processing apparatus comprising: synthesizing means for generating synthesized voice data for text data included in the mail using the voice feature data acquired by the acquiring means.

The information processing apparatus according to claim 1 or 2, further comprising an adjusting means for adjusting the audio characteristic data stored in the storage unit manually.

Classifying means for classifying the voice of the calling party obtained from the calling means into any of a plurality of emotion classification items,
The said generating means acquires audio | voice feature data for every emotion classification item classified by the said classification | category means, The said storage means stores audio | voice feature data for every said emotion classification | category item, It is characterized by the above-mentioned. Or the information processing apparatus of 2 .

The information processing apparatus according to claim 4, wherein the classification unit performs emotion classification based on phonetic information such as pitch and accent detected from the voice.

A determination means for determining which of the plurality of emotion classification items the text data included in the message belongs to;
The acquisition unit acquires voice feature data corresponding to the emotion item classification determined by the determination unit of the call partner corresponding to the sender of the message received by the reception unit from the storage unit. The information processing apparatus according to claim 4.

When voice feature data corresponding to a certain emotion classification item is generated in the generation means, the voice generation data corresponding to another emotion classification item is further updated using the voice feature data. The information processing apparatus according to claim 4.

A call process for starting a call by specifying a telephone number and making a call by voice,
A generation step of generating voice feature data from the voice of the call partner obtained during the call;
Corresponding to the telephone number specified in the calling step stored in the storing means for storing the personal telephone number, mail address and voice characteristic data in association with each other using the voice feature data generated in the generating step An update process for updating the voice feature data;
A receiving process for receiving an email including text data;
An acquisition step of acquiring voice feature data corresponding to a sender email address of the email received in the reception step from the storage means;
And a synthesis step of generating synthesized voice data for the text data included in the mail using the voice feature data acquired in the acquisition step.

A call process in which a call is started by receiving a call from a caller and a call is made by voice;
A generation step of generating voice feature data from the voice of the call partner obtained during the call;
Using the voice feature data generated in the generating step, the telephone specified by the telephone number notification of the calling party stored in the storage means for storing the personal phone number, mail address and voice feature data in association with each other An update process for updating the voice feature data corresponding to the number;
A receiving process for receiving an email including text data;
An acquisition step of acquiring voice feature data corresponding to a sender email address of the email received in the reception step from the storage means;
And a synthesis step of generating synthesized voice data for the text data included in the mail using the voice feature data acquired in the acquisition step.

The information processing method according to claim 8 , further comprising an adjustment step of manually adjusting the voice feature data stored in the storage step.

Further comprising a classification step of classifying the voice of the call partner obtained from the calling step into any of a plurality of emotion classification items,
The generation step acquires voice feature data for each emotion category item classified by the classification step, and the storage step stores voice feature data in the storage means for each emotion category item. The information processing method according to claim 8 or 9 .

12. The information processing method according to claim 11 , wherein the classification step performs emotion classification based on phonetic information such as pitch and accent detected from the voice.

A determination step of determining which of the plurality of emotion classification items the text data included in the message belongs to;
The acquisition step is characterized in that voice feature data corresponding to the emotion item classification determined in the determination step is acquired from the storage unit of the call partner corresponding to the sender of the message received in the reception step. The information processing method according to claim 11 .

When voice feature data corresponding to an emotion classification item is generated in the generation step, the voice feature data further includes an update step of updating voice feature data corresponding to another emotion classification item using the voice feature data. The information processing method according to claim 11 .

Storage medium storing a computer program for executing the information processing method according to the computer in any one of claims 8 to 14.