JP4228442B2

JP4228442B2 - Voice response device

Info

Publication number: JP4228442B2
Application number: JP36289898A
Authority: JP
Inventors: 雅一服部; 崇司笹井; 弘史角田; 靖彦加藤
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 1998-12-21
Filing date: 1998-12-21
Publication date: 2009-02-25
Anticipated expiration: 2018-12-21
Also published as: JP2000181475A

Abstract

PROBLEM TO BE SOLVED: To inform a user of the reliability of an answer sent back by an answering system to a question inputted by the user and also to adaptively vary the detailedness of the answer. SOLUTION: An input analysis block 2 makes a retrieval block 5 to retrieve data according to information inputted from an input block 1, an answer text generation block 4 generates an answer sentence from the retrieved data, and a voice synthesis block 3 outputs a synthesized voice of the answer sentence. The answer text generation block 4 generates the answer sentence on which the reliability of the answer is reflected, varies the abstraction degree of the expression of the corresponding answer sentence according to the accuracy of the retrieved data, and also varies the abstraction degree of the answer sentence at a request for accuracy estimated from input information and a history of a history management block 9.

Description

【０００１】
【発明の属する技術分野】
本発明は、例えばユーザ等から入力された質問等に対して応答を行うようなシステムに関し、特に自然言語の音声による応答を行う音声応答装置に関する。
【０００２】
【従来の技術】
従来より、例えばユーザ等から入力された質問等に対して応答を行うようなシステムが存在する。当該応答システムの一例としては、ユーザ等からの質問等に対して自然言語の音声による応答を行う音声応答装置が知られている。
【０００３】
また、応答の内容やその情報の出所に応じて応答音声を変えるようになされた音声合成システム及び音声合成方法も存在している。
【０００４】
【発明が解決しようとする課題】
ところで、例えばユーザ等から入力された質問に対して応答システムから応答を返す際に、その応答の信頼性についてもユーザに知らせたい場合がある。しかし、従来の音声応答装置では、口調、音質などが一定のものが多い。これでは、応答の内容をしっかり聞かないと、ユーザはその信頼性がわからない。
【０００５】
また、例えば応答の内容やその情報の出所に応じて応答音声を変えるシステムを用いたとしても、応答の信頼性を伝えるのに充分ではない。例えば、ユーザの質問への理解に不安があっての応答ならば、その応答の信頼性は低いと言える。一方で、応答に利用した情報の細かい部分が不確かであっても、ユーザの要求が詳細を求めていなければ、充分な応答ができると考えられる。
【０００６】
さらに、応答の詳細度を適応的に変えることも、音声応答においては重要なポイントである。すなわち例えば、音声出力で必要以上に詳しく説明してしまうと、適切な応答とは言えなくなる。このような場合における応答の一つの方法としては、特開平８−１３７６９８号公報にて開示されるように、ユーザからの質問の形式、履歴を反映して、必要と思われる部分の説明だけを行い、他を省略するような方法が考えられる。また、他の方法としては、データを応答文に変換する時の抽象化の程度によって、詳細度を変えることも考えられる。例えば、あるイベントの日程を知らせる時に、いつ頃かという「月、季節」を知らせるのか、或いは具体的な「日時」まで知らせるのかといったことである。
【０００７】
しかしながら、従来の音声応答システムでは、データから応答文への変換の時に、用意された応答文のテンプレートにデータの項目を入れていくものが多く、１対１対応でデータを応答文に変換するため、簡潔な応答のための抽象化には対応できない。
【０００８】
そこで、本発明はこのような状況に鑑みてなされたものであり、ユーザ等から入力された質問に対して応答システムから応答を返す際に、その応答の信頼性についてユーザに知らせることが可能であり、また、応答の詳細度を適応的に変えることも可能な、音声応答装置を提供することを目的とする。
【００１０】
【課題を解決するための手段】
本発明の音声応答装置は、質問としての情報を入力する入力手段と、入力された情報に基づいて質問に対する応答となるデータの検索を行う検索手段と、検索したデータから応答文を生成する応答文生成手段と、応答文を合成音に変換して出力する合成音変換出力手段とを有し、応答文生成手段では、検索したデータに基づいた応答文を生成するとき、入力した質問から応答に対して期待される精度と検索したデータの精度とを比較した結果に応じて当該応答文の表現の予め定められた抽象度に応じた単語又は語尾の表現を変えることにより、上述した課題を解決する。
【００１２】
すなわち本発明においては、ユーザ等から入力された質問に対して応答を返す際に、例えば、データの種類、確信度、詳細度、量などに応じて、応答文の口調、声などを変えることで、その応答の信頼性についてユーザに知らせることを可能とし、また、応答の詳細度を適応的に変えることを可能として、さらにユーザから指示された要点に応じて応答文を簡潔に出力するようにしている。
【００１３】
【発明の実施の形態】
本発明の好ましい実施の形態について、図面を参照しながら説明する。
【００１４】
本発明の音声応答装置が適用される一実施の形態の音声応答システムの構成例を図１に示す。
【００１５】
先ず、第１の実施の形態の音声応答システム構成として、ユーザからの質問に対応する情報を検索し、その検索した情報から音声を生成して出力する情報検索システムを例に挙げる。
【００１６】
この図１において、入力ブロック１は、ユーザからの入力を受け、その入力を数値や記号に変換したり、テキストに変換するための入力変換手段である。ここで、ユーザからの入力とは、例えばキーワードや命令、質問等であり、これら入力としては、例えばボタン、キーボード、タッチパネル等をユーザが操作することによる入力や、マイクロホンへの音声による入力などが考えられる。
【００１７】
入力解析ブロック２は、入力ブロック１の処理結果を受けて、ユーザ入力を解釈し、どの種類の情報から何をキーワードにしてデータを検索するのかを決定し、その決定結果を検索ブロック５に知らせる。また、当該入力解析ブロック２は、検索ブロック５での検索の結果得られたデータを、どの程度の詳しさで音声出力するのかを決定し、その決定結果を応答テキスト生成ブロック４に知らせる。
【００１８】
履歴管理ブロック９は、入力解析ブロック２での決定を行うための情報が欠けている場合にそれを補う役割を持つ。ユーザの過去の入力、それに対して行った応答出力などの履歴を管理する。当該履歴管理ブロック９は、入力解析ブロック２から不足情報に関する問い合わせがあると、それに対する回答を送る。
【００１９】
検索ブロック５は、入力解析ブロック２からの情報をもとに、データベースブロック６に格納されている情報の中から所望のデータを検索する。なお、当該検索の結果得られるデータには、最終的にユーザが知りたい以外の情報や必要以上に詳しい情報が含まれるが、この検索結果のデータはそのまま応答テキスト生成ブロック４に送られる。
【００２０】
データベースブロック６は、検索の対象となる情報が記録されている場所である。このデータベースブロック６に格納される情報は、例えば地図や辞典などシステム側で用意しておく情報の他、予定表、電話帳など、ユーザが後から編集、追加していく情報も含まれる。また、交通情報、ニュース、天気予報など、ネットワーク経由で取得する情報も含まれる。
【００２１】
応答テキスト生成ブロック４は、検索ブロック５から受け取ったデータをもとに、応答用のテキストを生成する。すなわち、応答テキスト生成ブロック４では、例えば入力解析ブロック２からの情報をもとに、ユーザが求めていないデータ項目をカットしたり、表現の抽象度を決めたりすることが行われる。また、応答テキスト生成ブロック４は、テキスト生成用データ格納ブロック８から単語を引き、それらを繋ぎ合わせてテキストを生成したり、そのテキストを音声出力する時の声の質、大きさなどを決める。
【００２２】
テキスト生成用データ格納ブロック８には、データの各要素に対応する単語、及び、出力情報の特徴を表現するための語尾のセットが用意されている。データと単語の対応は１対多であり、例えば「１０月５日」「１０月上旬」「今年の秋」という表現がすべて１つのデータに対応する。そのどれを選ぶかは、データの正確性や、音声テキスト生成ブロック４で決められた表現の抽象度による。また、語尾については、出力情報の特徴をユーザに感覚的に伝えられるように使い分ける。例えば、応答の信頼性が高い時は「〜です」、低い時は「〜かな」という語尾を使用する。
【００２３】
音声合成ブロック３は、応答テキスト生成ブロック４からテキストとその出力方法を受け取る。当該音声合成ブロック３では、それらをもとに音声合成用データ格納ブロック７から必要なデータを取得し、それらを繋ぎ合わせて合成音データを生成する。そして、この音声合成ブロック３にて生成された応答文の合成音データは、最終的にスピーカに音声信号として送られ、これにより当該スピーカから応答文の合成音が出力される。
【００２４】
音声合成用データ格納ブロック７には、テキストを合成音に変換するための音素などのデータセットが用意されている。なお、この音声合成用データ格納ブロック７に用意された情報は、出力情報の特徴をユーザに感覚的に伝えられるように使い分けられる。すなわち本実施の形態のシステムでは、例えば、応答の信頼性が高い時は大人の声を、低い時は子供の声を使用するようなことが行われる。
【００２５】
ここで、上記応答の信頼性を決める要素には、例えば、ユーザ入力の解釈に対する自信と、応答に利用した情報の出所と、ユーザが応答に対して期待する正確性などが考えられる。
【００２６】
先ず、上記ユーザ入力の解釈に対する自信が、上記応答の信頼性を決める要素となる例について具体的に説明する。
【００２７】
例えば「東京から静岡までの時間は？」という質問がユーザから入力された場合、この質問には、東京から静岡まで移動するのに車を使用するのか或いは電車するのか、といった移動手段についての内容が抜けている。ここで、入力解析ブロック２は、履歴管理ブロック９に保存されているユーザ入力の履歴を管理しており、当該履歴管理ブロック９の履歴から例えばユーザは車で移動することが多いとわかったとする。その場合、本実施の形態のシステムでは、ユーザの質問に抜けている移動手段についての内容をユーザに聞き返すようなことは行わず、とりあえず車での移動時間を調べて応答を返すことができる。ただし、この場合、車での移動時間を応答で返すことが正しいかどうかは不明なので、応答の信頼性がその分落ちる。すなわち、上記ユーザ入力の解釈に対する自信とは、例えば上述のようにユーザ入力を解釈した場合にその解釈がどの程度確かであるかを表し、当該ユーザ入力の解釈に対する自信によって、上記応答の信頼性が決定される。
【００２８】
次に、上記応答に利用した情報の出所が、上記応答の信頼性を決める要素となる例について具体的に説明する。
【００２９】
例えば、東京から静岡まで車での移動時間を計算する場合は、例えば、道路地図を利用して車での移動時間を計算する方法や、交通情報を利用して車での移動時間を計算する方法などがある。上記道路地図を利用する場合は、例えば出発地と目的地の２地点間の道路距離を計算し、その道路距離を予想される時速で割って移動時間を求める。また、交通情報を利用する場合は、例えば出発地と目的地の２地点の間に設けられている、交通量や渋滞等の各測定ポイント間の所要時間を調べ、それらを合計することで移動時間を求める。このように、応答に利用した情報の出所として、道路地図を利用する場合と交通情報を利用する場合の２つの出所があるとすると、リアルタイムな交通量や渋滞情報等を加味できる上記交通情報を利用した場合の方が、応答の信頼性は高くなる。このように、応答に利用した情報の出所により、応答の信頼性を決める。
【００３０】
次に、上記ユーザが応答に対して期待する正確性が、上記応答の信頼性を決める要素となる例について具体的に説明する。
【００３１】
例えば、東京から静岡までの車での移動時間を道路地図から計算することにより、移動時間として例えば２．１時間という計算結果が出たとする。ここで、例えば１時間単位での正確性しかないとすると、ユーザからの「東京から静岡までの時間は？」という質問に対して「２時間くらい」という応答は信頼できると考えられる。しかし、ユーザからの「東京から静岡まで何分くらいかかる？」という質問に対して、「２時間６分」という応答は信頼性が低いと考えられる。すなわち、同じデータからの応答であっても、ユーザの期待との比較によって、その応答の信頼性が変わる。このように、ユーザが応答に対して期待する正確性によって、応答の信頼性を決める。
【００３２】
一方、上記データ表現の抽象度を決める要素には、例えば、データ自身の正確性と、ユーザが応答に対して期待する正確性などが考えられる。
【００３３】
先ず、上記データ自身の正確性が、上記データ表現の抽象度を決める要素となる例について具体的に説明する。
【００３４】
例えば、東京から静岡までの車での移動時間を道路地図から計算することにより、移動時間として例えば２．１時間という計算結果が出たとする。このような場合、ユーザに対する応答の表現の仕方としては、当該移動時間の計算結果である２．１時間を、「２時間６分」と表現したり、「２時間ちょっと」と表現したり、或いは「２時間くらい」と表現するように、想定される誤差に応じて表現を変えることができる。このようにデータ自身の正確性に応じて、応答のデータ表現の抽象度を決める。
【００３５】
次に、上記ユーザが応答に対して期待する正確性が、上記データ表現の抽象度を決める要素となる例について具体的に説明する。
【００３６】
例えば、ユーザからの「明日の予定は？」という質問に対しては、「午前中にテニスです。」と応答するようにする。一方、ユーザからの「明日の午前中の予定は？」という質問に対しては、「９時から１１時までテニスです。」と、時間まで具体的に答えるようにする。このように、ユーザが応答に対して期待する正確性によって、データ表現の抽象度を決める。なお、ユーザが応答に対して期待する正確性は、ユーザからの質問入力の形状からだけでなく、過去の履歴から推測することもできる。例えば、いつも正確な時間まで求めるユーザに対しては、「明日の予定は？」と質問されたとき、「９時から１１時までテニスです。」と答えるようにする。
【００３７】
以下、第１の実施の形態のシステムの動作について、図１及び図２を用いて具体的に説明する。
【００３８】
ここでは、ユーザがキーワードと共に質問を入力すると、本実施の形態のシステムから音声出力による応答を返す場合を考える。
【００３９】
先ず、初期設定として、ユーザが前記入力ブロック１を操作し、道路に関する情報を得るモードとする。
【００４０】
次に、ステップＳ１として、ユーザが「東京から静岡までの時間は？」という質問を入力する。ここでは、質問入力のためのインターフェイスとして音声入力を想定し、ユーザがシステムのマイクロホンに向かって喋るものとする。マイクロホンからの入力音声は、入力ブロック１内で音声認識処理にかけられ、テキストデータに変換される。
【００４１】
入力解析ブロック２は、入力ブロック１から送られたテキストデータを解析し、対応する検索処理を決定する。すなわち、入力解析ブロック２は、処理を決定するためのキーワードのリストを持っており、先ずはこのキーワードにマッチする言葉を捜す。
【００４２】
ここで、「東京から静岡までの時間は？」という質問からは、キーワードにマッチする言葉として「東京」「静岡」「時間」の３つを見つける。このとき、「東京」「静岡」といった地名と「時間」というキーワードから交通情報を調べることまでは分かるが、このままでは交通手段（移動手段）が分からない。そこで、入力解析ブロック２は、履歴管理ブロック９に問い合わせを行う。この問い合わせから、過去において交通手段として例えば車が選択される場合が殆どだったということを知ると、入力解析ブロック２は、これを踏まえて、ステップＳ１０として交通手段は車であると想定する。これにより、交通手段（移動手段）についてユーザが明示していないが、履歴から車の可能性が高く、ユーザ入力に対する理解には自信があるものとする。
【００４３】
次に、入力解析ブロック２は、検索ブロック５に対し、ステップＳ２として「東名高速道路の交通情報を取得する」ように指示を行い、また、応答テキスト生成ブロック４に対し、「検索した情報から東京−静岡間の所要時間を計算して出力する」ように指示する。また、入力解析ブロック２は、ユーザの質問の形式から、応答としては大体の時間を知らせればよいものと判断し、応答テキスト生成ブロック４に対して、時間の分単位については抽象的に表現するように指示する。
【００４４】
検索ブロック５は、上記入力解析ブロック２からの指示に応じ、データベースブロック６にアクセスして交通情報を取り寄せる。このとき、当該検索ブロック５は、システムの記憶部であるデータベースブロック６にローカルに保存されている情報を調べ、当該データベースブロック６の情報が古かったり無かったりした場合は、例えばネットワーク経由で最新の情報を取得する。なお、ネットワーク上のどこから交通情報が得られるか、また、その情報を得るためのシステムの動作は、予め決められて分かっているものとする。この検索ブロック５での検索により得られた交通情報は、応答テキスト生成ブロック４に送られる。
【００４５】
このときの応答テキスト生成ブロック４は、入力解析ブロック２から「東京−静岡間の所要時間を計算して出力する」ことを知らされており、先ず、ステップＳ３として、検索ブロック５の検索により得られた交通情報から例えば「東京−厚木」「厚木−御殿場」「御殿場−静岡」の各区間の所要時間を必要データとしてピックアップし、さらにステップＳ４としてその合計時間（この例では２１５分）を計算する。
【００４６】
次に、当該応答テキスト生成ブロック４では、応答テキストを生成する。先ず、時間の表現について決定する。上記の例の場合、入力解析ブロック５からの指示に応じて、分単位については抽象的な表現に決定する。この抽象的な表現の決定に際し、応答テキスト生成ブロック４は、テキスト生成用データ格納ブロック８を参照し、抽象的な表現の単語として、例えば「（何も言わない）」「ちょっと」「半」「弱」の４つの中から、例えば「半」という表現を選択し、これによりステップＳ５として上記２１５分という時間を「３時間半」という表現にする。また、当該計算した時間は、交通情報がもとになっていて正確性が高いため、語尾の表現を言い切り型の「です」にし、さらに、ユーザ入力に対して十分な信頼性を持つ応答であると判断し、大人の声の大きくはっきりした口調の音声出力を行うためのテキストと音声合成に関する指示を音声合成ブロック３に出力する。
【００４７】
音声合成ブロック３では、応答テキスト生成ブロック４から、「３時間半です」というテキストと音声合成に関する指示を受け取る。これにより、音声合成ブロック３は、ステップＳ１１として、「３時間半です」という大人の声の大きくはっきりした口調の音声出力のデータを生成し、スピーカから応答として出力する。
【００４８】
ところで、「東京から静岡までの時間」を調べるための情報源は、交通情報だけとは限らない。正確性では劣るものの、道路地図から距離を調べて、所要時間を概算することも可能である。
【００４９】
この場合、上記入力解析ブロック２は、ステップＳ６として、「東名高速道路の地図情報を所得する」のように、検索ブロック５に知らせる。
【００５０】
このときの検索ブロック５は、データベースブロック６にアクセスして地図情報を取り寄せる。当該検索ブロック５がデータベースブロック６を検索して得られた情報は応答テキスト生成ブロック４に送られる。
【００５１】
またこのときの応答テキスト生成ブロック４は、入力解析ブロック２から、「東京−静岡間の所要時間を出力する」ことを知らさてており、ステップＳ７として、検索ブロック５の検索により得られた地図情報から例えば「東京−厚木」「厚木−御殿場」「御殿場−静岡」の各区間の距離を必要データとしてピックアップし、さらにステップＳ８としてその合計距離（この例では２１０ｋｍ）を計算する。またさらに、応答テキスト生成ブロック４では、車の移動速度を平均で時速１００ｋｍとし、上記合計距離と当該平均時速とから所要時間の２．１時間を概算する。
【００５２】
次に、当該応答テキスト生成ブロック４では、応答テキストを生成する。この場合も先ず、時間の表現について決定する。上記の例の場合、入力解析ブロック５からの指示に応じて、分単位については抽象的な表現に決定する。この抽象的な表現の決定に際し、応答テキスト生成ブロック４は、テキスト生成用データ格納ブロック８を参照し、抽象的な表現の単語として、前述同様に「（何も言わない）」「ちょっと」「半」「弱」の４つの中から、例えば「（何も言わない）」の表現を選択し、これによりステップＳ９として上記２．１時間を「２時間」という表現にする。また、当該計算した時間は、大体の時間としては合っているものとし、語尾の表現を言い切り型の「です」にする。ただし、地図情報を利用した場合は、上述の交通情報を利用した場合に比べて応答の信頼性が低いので、大人の声ではあるが小さな声の音声出力を行うためのテキストと音声合成に関する指示を音声合成ブロック３に出力する。
【００５３】
音声合成ブロック３では、応答テキスト生成ブロック４から、「２時間です」というテキストと音声合成に関する指示を受け取る。これにより、音声合成ブロック３は、ステップＳ１２として、「２時間です」という大人の声ではあるが小さな声の音声出力のデータを生成し、スピーカから応答として出力する。
【００５４】
なお、上述のように道路地図から時間を計算する場合において、ユーザからの質問が「東京から静岡までの時間は？」でなく、「東京から静岡まで何分？」だったとすると、以下のような応答が考えられる。
【００５５】
この場合、入力解析ブロック２では、分単位までの表現の応答をユーザが望んでいると判断される。
【００５６】
これにより、応答テキスト生成ブロック４では、それに基づき、上記計算により得られた２．１時間を「２時間６分」という表現にする。ただし、この計算結果の分単位は信用できないものとすると、当該応答における「６分」の部分は不正確（嘘）である可能性が高い。このため、当該応答テキスト生成ブロック４では、応答の信頼性が十分でないので、語尾には曖昧性を表現した「かな」を使うようにする。そして、応答テキスト生成ブロック４では、例えば子供の声で音声出力を行うためのテキストと音声合成に関する指示を音声合成ブロック３に出力する。
【００５７】
音声合成ブロック３では、応答テキスト生成ブロック４から、「２時間６分かな」というテキストと音声合成に関する指示を受け取る。これにより、音声合成ブロック３は、「２時間６分かな」という子供の声の音声出力のデータを生成し、スピーカから応答として出力する。
【００５８】
更に、例えば正確性を保証できないデータ表現はしないという方針にすると、上述のように道路地図から時間を計算する場合において、ユーザからの質問が「東京から静岡まで何分？」だったとすると、以下のような応答が考えられる。
【００５９】
この例の場合、入力解析ブロック２では、分単位までの表現の応答をユーザが望んでいると判断される。
【００６０】
これにより、応答テキスト生成ブロック４では、それに基づき、上記計算により得られた２．１時間を「２時間６分」という表現にしようとする。しかしここで、上記計算結果の分単位は信用できないものとすると、当該応答の「６分」の部分は不正確（嘘）である可能性が高い。そこで、応答テキスト生成ブロック４では、表現の抽象度を上げて、具体的な分単位は言わず「２時間」という表現に決定する。さらに、応答テキスト生成ブロック４は、それ以上具体的には答えられないことを反映して、その語尾に「くらいです」をつける。ただし、この場合、応答としてユーザの要求を満足するものではないので、応答テキスト生成ブロック４では、例えば子供の声で音声出力を行うためのテキストと音声合成に関する指示を音声合成ブロック３に出力する。
【００６１】
音声合成ブロック３では、応答テキスト生成ブロック４から、「２時間くらいです」というテキストと音声合成に関する指示を受け取る。これにより、音声合成ブロック３は、「２時間くらいです」という子供の声の音声出力のデータを生成し、スピーカから応答として出力する。
【００６２】
次に、本発明の第２の実施の形態の音声応答システム構成として、ユーザのスケジュールを管理するスケジューラを例に挙げる。以下、当該第２の実施の形態のシステムの動作について、図１及び図３を用いて具体的に説明する。
【００６３】
先ず、初期設定として、ユーザが前記入力ブロック１を操作し、予定や電話番号などの個人情報を得るモードとする。
【００６４】
次に、ステップＳ３１として、ユーザが例えば「明日の予定は？」という質問を入力する。ここでは、質問入力のためのインターフェイスとして音声入力を想定し、ユーザがシステムのマイクロホンに向かって喋るものとする。マイクロホンからの入力音声は、入力ブロック１内で音声認識処理にかけられ、テキストデータに変換される。
【００６５】
入力解析ブロック２は、入力ブロック１から送られたテキストデータを解析し、対応する検索処理を決定する。すなわち、入力解析ブロック２は、処理を決定するためのキーワードのリストを持っており、先ずはこのキーワードにマッチする言葉を捜す。
【００６６】
ここで、「明日の予定は？」という質問からは、キーワードにマッチする言葉として「明日」「予定」の２つの言葉を見つける。そして、入力解析ブロック２は、「予定」というキーワードから予定表の検索を決定し、また、「明日」というキーワードから明日について調べることを決定する。
【００６７】
このため、入力解析ブロック２は、ステップＳ３２として、検索ブロック５に対して「明日の予定のデータを取得する」ように指示し、また、応答テキスト生成ブロック４には「明日の予定を全て知らせる」ように指示する。ただし、この場合の入力解析ブロック２は、ユーザの質問の形式から、ユーザが知りたいのは概略だと判断し、予定の詳しい時間、詳しい内容までは言わないようにする。
【００６８】
上記入力解析ブロック２から検索の指示を受けた検索ブロック５は、ステップＳ３３として、データベースブロック６にアクセスして明日の予定を取り出す。なお、このとき取り出された情報としては、「９：００−１１：００テニス at 品川」、「１４：００− 買い物 at 新宿 with 友達」であるとする。当該検索ブロック５での検索により得られた情報は、応答テキスト生成ブロック４に送られる。
【００６９】
次に、応答テキスト生成ブロック４は応答テキストを生成する。この時、先ず時間については表現について決定するが、上述のように入力解析ブロック２から「明日の予定の概略を出力する」ことが知らされているため、当該応答テキスト生成ブロック４では、時間については表現の抽象度を上げるように決定する。例えば、テキスト生成用データ格納ブロック８を参照すると例えば「朝」「昼」「夜」の３つの分類があり、そこで、応答テキスト生成ブロック４は、明日の予定のうち「９：００−１１：００」は「朝」、「１４：００−（１４時以降）」は「昼」に相当するものと決定する。また、予定の内容については、場所、相手を省略し、用件だけ言うようにする。すなわち、当該応答テキスト生成ブロック４は、ステップＳ３４として、「午前テニス」「午後買い物」のような応答テキストを生成する。
【００７０】
次に、応答テキスト生成ブロック４では、文の語尾及び音声合成に関する指示を決定する。この第２の実施の形態のように、スケジューラの場合はユーザ入力の解釈に対して他に間違えようが無く、応答のデータもユーザ自身が入れたものであるため正確で正しいとすると、応答の信頼性は十分である。そこで、応答テキスト生成ブロック４は、ステップＳ３５として、語尾の表現として言い切り型の「です」を使い、大人の声の大きくはっきりした口調で音声出力行うためのテキストと音声合成に関する指示を音声合成ブロック３に出力する。
【００７１】
音声合成ブロック３は、応答テキスト生成ブロック４から、「午前中テニスで、午後は買い物です。」というテキストと音声合成に関する指示を受け取る。これにより、音声合成ブロック３は、ステップＳ３９として、「午前中テニスで、午後は買い物です。」という大人の声で大きくはっきりした口調の音声出力のデータを生成し、スピーカから応答として出力する。
【００７２】
ただし、上記の例ように「明日の予定は？」とユーザが聞いた時に、日時、用件を詳しく返答することがあってもよい。
【００７３】
例えば、入力解析ブロック２において、履歴管理ブロック９の情報から「ユーザに概略を言うと、その後詳しい日時を聞き直されることが多い。」ことが分かったとする。この場合は、入力解析ブロック２では、現在の入力の表現よりも履歴を優先し、応答テキスト生成ブロック４には「明日の予定は１件１件を詳しくしらせる」ように指示する。
【００７４】
この場合、応答テキスト生成ブロック４では、「午前９時から１１までテニス、午後２時からは買い物です。」という、大人の声の大きくはっきりした口調で音声出力行うためのテキストと音声合成に関する指示を音声合成ブロック３に出力する。
【００７５】
音声合成ブロック３は、応答テキスト生成ブロック４から当該テキストと音声合成に関する指示を受け取ると、「午前９時から１１までテニス、午後２時からは買い物です。」という大人の声で大きくはっきりした口調の音声出力のデータを生成し、スピーカから応答として出力する。
【００７６】
一方、例えば、ユーザが入力ブロック１を操作して個人情報を得るモードを指定後、例えばステップＳ３６として「明日の午前中の予定は？」という質問を入力した場合を考える。なお、この場合も、前述と同様、質問入力のためのインターフェイスとして音声入力を想定し、ユーザがシステムのマイクロホンに向かって喋るものとする。マイクロホンからの入力音声は、入力ブロック１内で音声認識処理にかけられ、テキストデータに変換される。
【００７７】
この場合、入力解析ブロック２は、入力ブロック１から送られたテキストデータを解析し、対応する検索処理を決定する。入力解析ブロック２は、処理を決定するためのキーワードのリストを持っており、先ずはこのキーワードにマッチする言葉を捜す。
【００７８】
ここで、「明日の午前中の予定は？」という質問からは、キーワードにマッチする言葉として「明日」「午前中」「予定」の３つの言葉を見つける。そして、入力解析ブロック２は、「明日」「予定」というキーワードの組み合わせから「明日の予定を検索する」ことを決定する。また、このときの入力解析ブロック２は、ステップＳ３２として、検索ブロック５に対して「明日の予定のデータを取得する」ように指示し、応答テキスト生成ブロック４には「明日の午前中の予定のみ知らせる」ように指示する。また、この例の場合、入力解析ブロック２は、ユーザの質問の形式から、ユーザが詳しい時間まで知りたいと判断し、予定の詳しい時間、ついでに詳しい内容まで言うようにする。
【００７９】
上記入力解析ブロック２から検索の指示を受けた検索ブロック５は、ステップＳ３３として、データベースブロック６にアクセスして明日の予定を取り出す。当該検索ブロック５での検索により得られた情報は、応答テキスト生成ブロック４に送られる。
【００８０】
次に、応答テキスト生成ブロック４は応答テキストを生成する。この時の応答テキスト生成ブロック４は、入力解析ブロック２から「午前中の予定のみ出力する」ことが知らされているため、当該応答テキスト生成ブロック４では、ステップＳ３７として、「１４：００−」の予定は午前中の予定ではないので、出力から除外するが、予定の内容については時間、場所、用件全てを出力用に残す。また、当該応答テキスト生成ブロック４では、時間をできるだけ詳しく知らせるために、「９：００−１１：００」を「９時から１１まで」という表現にする。
【００８１】
次に、応答テキスト生成ブロック４では、文の語尾及び音声合成に関する指示を決定する。この例では、ユーザ入力の解釈に対して他に間違えようが無く、応答のデータもユーザ自身が入れたもので正しいとすると、応答の信頼性は十分である。そこで、応答テキスト生成ブロック４では、ステップＳ３８として、語尾の表現は言い切り型の「です」を使い、「９時から１１時まで品川でテニスです。」という、大人の声の大きくはっきりした口調で音声出力行うためのテキストと音声合成に関する指示を音声合成ブロック３に出力する。
【００８２】
音声合成ブロック３は、応答テキスト生成ブロック４からテキストと音声合成に関する指示を受け取ると、ステップＳ４０として、「９時から１１時まで品川でテニスです。」という大人の声で大きくはっきりした口調の音声出力のデータを生成し、スピーカから応答として出力する。
【００８３】
次に、図４には、本発明の音声応答システムを実現するためのハードウェア構成を示す。
【００８４】
この図４において、入力部１１は、図１の入力ブロック１に対応し、例えばボタン、キーボード、タッチパネル、マイクロホンなどのインターフェイスを備えている。ユーザは、当該入力部１１から質問やキーワードを入力する。
【００８５】
ＣＰＵ１２は、システム各部の制御、各種プログラム処理を実行する。当該ＣＰＵ１２は、図１の入力解析ブロック２、音声合成ブロック３、応答テキスト生成ブロック４等における各種処理を受け持つ。
【００８６】
ＲＯＭ１３は、固定データや固定のプログラムを記憶する。前述のテキスト生成、音声合成の各種データや処理のためのプログラムも、当該ＲＯＭ１３に記憶されている。
【００８７】
ＲＡＭ１４は、ＣＰＵ１２でのプログラム処理中に必要なデータを一時保存する。ユーザからの入力や、テキスト生成、音声合成の結果、通信部１６で得た情報などが当該ＲＡＭ１４に一時的に保存される。
【００８８】
補助記憶部１５は、フラッシュメモリやＥＥＰＲＯＭからなり、追加、書き換えなどを行うデータでプログラム処理中以外でも常に残しておきたいデータを保存する。図１の履歴管理ブロック９の個人情報や、通信で得たデータ、その他、ＣＰＵの処理で使うデータ等が、当該補助記憶部１５に記憶される。
【００８９】
通信部１６は、システムがネットワーク経由で情報を取得したり、発信したりする時に使用するものである。例えば、電話、インターネット、赤外線、ラジオなどの通信を行う。
【００９０】
スピーカ部１７は、音声合成された応答などの音声の出力を行う。
【００９１】
表示部１８は、液晶パネルなどのディスプレイ装置を備えてなり、いわゆるＧＵＩ（Graphical User Interface）の画面を出力したり、表、リストなど、各種情報を表示したりする。
【００９２】
なお、本発明実施の形態のシステムとしては、補助記憶部１５、通信部１６、表示部１８は、必ずしも必要ない。
【００９３】
上述したように、本発明実施の形態によれば、ユーザからの質問に対する応答文を音声出力する時の口調、音質などを、応答の信頼度に応じて変えるようにしているため、ユーザは自分の要求に合った応答が返ってきているかどうかある程度知りつつ、音声出力を聞くことができる。また、本実施の形態によれば、音声出力を聞くだけの場合でも、システムの状態が分かりやすくなり、次の対処（例えばちゃんと聞く、不足条件を入れる、別の質問をするなど）を早く行うことが可能である。更に、本実施の形態によれば、応答出力中のデータ表現の抽象度を変えることにより、その正確性の程度を表したり、出力を簡潔にしたりすることができる。
【００９４】
【発明の効果】
以上の説明で明らかなように、本発明の音声応答装置は、入力された情報に基づいてデータ検索を行い、その検索したデータから応答文を生成し、合成音として出力する場合において、応答の信頼性を反映した応答文を生成すること、或いは、検索したデータの正確性に応じて当該応答文の表現の抽象度を変えること、若しくは、入力した情報及び履歴から推測した正確性の要求に応じて応答文の表現の抽象度を変えることにより、ユーザ等から入力された質問に対して応答システムから応答を返す際に、その応答の信頼性についてユーザに知らせることが可能であり、また、応答の詳細度を適応的に変えることも可能である。
【００９５】
すなわち、本発明によれば、応答文を音声出力する時の口調、音質などを、応答の信頼度に応じて変えることで、ユーザは自分の要求に合った応答が返ってきているかどうかある程度知りつつ、音声出力を聞くことができ、また、音声出力を聞くだけの場合でも、システムの状態が分かりやすくなり、次の対処が早くなる。さらに、本発明によれば、応答出力中のデータ表現の抽象度を変えることにより、その正確性の程度を表したり、出力を簡潔にしたりすることができる。
【図面の簡単な説明】
【図１】本発明実施の形態の音声応答システムの概略構成を示す機能ブロック図である。
【図２】第１の実施の形態のシステムの一動作例の流れを示すフローチャートである。
【図３】第２の実施の形態のシステムの一動作例の流れを示すフローチャートである。
【図４】本発明の音声応答システムを実現するためのハードウェア構成を示すブロック回路図である。
【符号の説明】
１入力ブロック、２入力解析ブロック、３音声合成ブロック、４応答テキスト生成ブロック、５検索ブロック、６データベースブロック、７音声合成用データ格納ブロック、８テキスト生成用データ格納ブロック、９履歴管理ブロック、１１入力部、１２ＣＰＵ、１３ＲＯＭ、１４ＲＡＭ、１５補助記憶部、１６通信部、１７スピーカ部、１８表示部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a system that responds to, for example, a question input from a user or the like, and more particularly, to a voice response device that responds by natural language speech.
[0002]
[Prior art]
Conventionally, for example, there is a system that responds to a question or the like input from a user or the like. As an example of the response system, a voice response device that responds to a question from a user or the like with a natural language voice is known.
[0003]
There are also speech synthesis systems and speech synthesis methods that change the response speech depending on the content of the response and the source of the information.
[0004]
[Problems to be solved by the invention]
By the way, for example, when a response is returned from a response system to a question input by a user or the like, there is a case where it is desired to inform the user of the reliability of the response. However, many conventional voice response devices have a constant tone and sound quality. In this case, the user cannot know the reliability unless he / she listens to the details of the response.
[0005]
For example, even if a system that changes the response voice according to the content of the response and the source of the information is used, it is not sufficient to convey the reliability of the response. For example, if there is an anxiety in understanding the user's question, it can be said that the reliability of the response is low. On the other hand, even if the details of the information used for the response are uncertain, it is considered that a sufficient response can be made if the user's request does not require details.
[0006]
Furthermore, adaptively changing the detail level of the response is also an important point in the voice response. That is, for example, if it is explained in detail more than necessary with the audio output, it cannot be said that the response is appropriate. As one method of response in such a case, as disclosed in Japanese Patent Laid-Open No. 8-137698, only the explanation of the part that seems necessary is reflected, reflecting the form and history of the question from the user. A method can be considered in which the others are performed and others are omitted. As another method, the degree of detail may be changed depending on the level of abstraction when data is converted into a response sentence. For example, when notifying the date of a certain event, whether to notify the “month and season” of when, or to the specific “date and time”.
[0007]
However, in the conventional voice response system, when data is converted into a response sentence, data items are often put in a prepared response sentence template, and data is converted into a response sentence in a one-to-one correspondence. Therefore, it cannot cope with abstraction for a simple response.
[0008]
Therefore, the present invention has been made in view of such a situation, and when returning a response from a response system to a question input from a user or the like, it is possible to inform the user about the reliability of the response. Another object of the present invention is to provide a voice response device that can adaptively change the level of detail of a response.
[0010]
[Means for Solving the Problems]
The voice response device of the present invention is As a question Based on input information and input means to input information Respond to questions data of It has a search means for performing a search, a response sentence generation means for generating a response sentence from the searched data, and a synthesized sound conversion output means for converting the response sentence into a synthesized sound and outputting it. When generating a response sentence based on the processed data, The expected accuracy of the response from the question entered Of the retrieved data The result of comparing accuracy Depending on the response Predetermined Abstraction Word or ending expression according to By changing the above, the above-described problems are solved.
[0012]
That is, in the present invention, when a response is returned to a question input by a user or the like, the tone of the response sentence, voice, etc. are changed according to, for example, the type of data, certainty, detail, amount, etc. Thus, it is possible to inform the user about the reliability of the response, and it is possible to adaptively change the level of detail of the response, and to output a response sentence in a concise manner according to the point instructed by the user. I have to.
[0013]
DETAILED DESCRIPTION OF THE INVENTION
A preferred embodiment of the present invention will be described with reference to the drawings.
[0014]
FIG. 1 shows a configuration example of a voice response system according to an embodiment to which the voice response device of the present invention is applied.
[0015]
First, as the voice response system configuration according to the first embodiment, an information search system that searches for information corresponding to a question from a user, generates a voice from the searched information, and outputs the information is taken as an example.
[0016]
In FIG. 1, an input block 1 is an input conversion means for receiving an input from a user and converting the input into a numerical value or a symbol or converting it into a text. Here, the input from the user is, for example, a keyword, a command, a question, etc., and these inputs include, for example, an input by a user operating a button, a keyboard, a touch panel, etc., or an input by voice to a microphone. Conceivable.
[0017]
The input analysis block 2 receives the processing result of the input block 1, interprets the user input, decides what kind of information is used as a keyword to search for data, and informs the search block 5 of the decision result. . In addition, the input analysis block 2 determines how much the data obtained as a result of the search in the search block 5 is to be output as voice, and informs the response text generation block 4 of the determination result.
[0018]
The history management block 9 has a role to compensate for the lack of information for making a decision in the input analysis block 2. It manages the history of the user's past input and the response output made to it. The history management block 9 sends an answer to an inquiry about insufficient information from the input analysis block 2.
[0019]
The search block 5 searches for desired data from the information stored in the database block 6 based on the information from the input analysis block 2. Note that the data obtained as a result of the search includes information that the user does not want to know or information more detailed than necessary, but the search result data is sent to the response text generation block 4 as it is.
[0020]
The database block 6 is a place where information to be searched is recorded. The information stored in the database block 6 includes information that the user edits and adds later, such as a schedule and a telephone book, in addition to information prepared on the system side such as a map and a dictionary. In addition, information acquired via a network, such as traffic information, news, and weather forecasts, is also included.
[0021]
The response text generation block 4 generates response text based on the data received from the search block 5. That is, in the response text generation block 4, for example, based on information from the input analysis block 2, a data item not requested by the user is cut or an abstraction level of the expression is determined. The response text generation block 4 draws words from the text generation data storage block 8 and connects them to generate text, and determines the quality and size of the voice when the text is output as speech.
[0022]
In the text generation data storage block 8, a word corresponding to each element of the data and a ending set for expressing the characteristics of the output information are prepared. There is a one-to-many correspondence between data and words. For example, the expressions “October 5”, “early October”, and “this autumn” all correspond to one data. Which one is selected depends on the accuracy of the data and the abstraction level of the expression determined by the speech text generation block 4. In addition, the ending is used properly so that the features of the output information can be conveyed to the user sensuously. For example, the ending is "~" when the response reliability is high, and "~ kana" when the response is low.
[0023]
The speech synthesis block 3 receives the text and the output method from the response text generation block 4. The speech synthesis block 3 acquires necessary data from the speech synthesis data storage block 7 based on them, and combines them to generate synthesized speech data. The synthesized text data of the response sentence generated by the speech synthesis block 3 is finally sent as a voice signal to the speaker, and the synthesized voice of the response sentence is output from the speaker.
[0024]
The speech synthesis data storage block 7 is provided with a data set such as phonemes for converting text into synthesized speech. The information prepared in the speech synthesis data storage block 7 is properly used so that the characteristics of the output information can be conveyed to the user sensuously. That is, in the system of the present embodiment, for example, an adult voice is used when the response reliability is high, and a child voice is used when the response is low.
[0025]
Here, factors that determine the reliability of the response include, for example, confidence in the interpretation of the user input, the origin of information used for the response, and the accuracy that the user expects from the response.
[0026]
First, an example in which confidence in the interpretation of the user input is an element that determines the reliability of the response will be specifically described.
[0027]
For example, when a user inputs a question “What is the time from Tokyo to Shizuoka?”, The question is about the means of transportation such as whether to use a car or train to travel from Tokyo to Shizuoka. Is missing. Here, it is assumed that the input analysis block 2 manages the history of user input stored in the history management block 9, and it is found from the history of the history management block 9 that, for example, the user often moves by car. . In that case, the system according to the present embodiment does not ask the user about the contents of the moving means missing in the user's question, but can check the moving time in the vehicle and return a response for the time being. However, in this case, since it is unclear whether or not it is correct to return the travel time by car in the response, the reliability of the response is reduced accordingly. That is, the confidence in the interpretation of the user input indicates, for example, how certain the interpretation is when the user input is interpreted as described above, and the reliability of the response is determined by the confidence in the interpretation of the user input. Is determined.
[0028]
Next, an example in which the source of information used for the response is an element that determines the reliability of the response will be described in detail.
[0029]
For example, when calculating the travel time by car from Tokyo to Shizuoka, for example, the method of calculating the travel time by car using a road map or the travel time by car using traffic information There are methods. When the road map is used, for example, the road distance between two points of the starting point and the destination is calculated, and the travel time is obtained by dividing the road distance by the expected speed. In addition, when using traffic information, for example, check the time required between each measurement point such as traffic volume and traffic jam, which is provided between two points of departure and destination, and move by summing them. Ask for time. As described above, when there are two sources of information used for response, when using a road map and when using traffic information, the above traffic information that can take into account real-time traffic volume and traffic information etc. When used, the response is more reliable. Thus, the reliability of the response is determined by the source of the information used for the response.
[0030]
Next, an example in which the accuracy expected by the user for the response is an element that determines the reliability of the response will be described in detail.
[0031]
For example, it is assumed that a calculation result of 2.1 hours as a travel time is obtained by calculating the travel time by car from Tokyo to Shizuoka from a road map. Here, for example, if there is only accuracy in units of one hour, it can be considered that the response “about 2 hours” is reliable for the user's question “What is the time from Tokyo to Shizuoka?”. However, the response “2 hours 6 minutes” to the user's question “How long does it take from Tokyo to Shizuoka?” Is considered to be unreliable. That is, even if the responses are from the same data, the reliability of the responses changes depending on the comparison with the user's expectation. Thus, the reliability of the response is determined by the accuracy that the user expects for the response.
[0032]
On the other hand, factors that determine the level of abstraction of the data representation include, for example, the accuracy of the data itself and the accuracy that the user expects from the response.
[0033]
First, an example in which the accuracy of the data itself is an element that determines the level of abstraction of the data representation will be specifically described.
[0034]
For example, it is assumed that a calculation result of 2.1 hours as a travel time is obtained by calculating the travel time by car from Tokyo to Shizuoka from a road map. In such a case, as a way of expressing the response to the user, 2.1 hours that is the calculation result of the travel time can be expressed as “2 hours 6 minutes”, “2 hours a little”, Alternatively, the expression can be changed according to an assumed error so as to express “about 2 hours”. In this way, the abstraction level of the data representation of the response is determined according to the accuracy of the data itself.
[0035]
Next, an example in which the accuracy expected by the user for the response is an element that determines the abstraction level of the data expression will be described in detail.
[0036]
For example, in response to a question “What are your plans for tomorrow?” From the user, a response “tennis in the morning” is made. On the other hand, in response to the question “What is your plan for tomorrow in the morning?” From the user, the answer is “tennis from 9 o'clock to 11 o'clock”. In this way, the degree of abstraction of the data representation is determined by the accuracy that the user expects for the response. Note that the accuracy expected by the user for the response can be estimated not only from the shape of the question input from the user but also from the past history. For example, for a user who always asks for an accurate time, when asked “What are your plans for tomorrow?”, Answer “tennis from 9 o'clock to 11 o'clock”.
[0037]
Hereinafter, the operation of the system according to the first embodiment will be specifically described with reference to FIGS. 1 and 2.
[0038]
Here, it is assumed that when a user inputs a question together with a keyword, a response by voice output is returned from the system of the present embodiment.
[0039]
First, as an initial setting, a mode is set in which the user operates the input block 1 to obtain information on the road.
[0040]
Next, as step S1, the user inputs a question "What is the time from Tokyo to Shizuoka?" Here, it is assumed that a voice input is assumed as an interface for inputting a question, and the user is directed toward the microphone of the system. Input speech from the microphone is subjected to speech recognition processing in the input block 1 and converted into text data.
[0041]
The input analysis block 2 analyzes the text data sent from the input block 1 and determines a corresponding search process. That is, the input analysis block 2 has a list of keywords for determining processing, and first searches for a word that matches the keyword.
[0042]
Here, from the question “What is the time from Tokyo to Shizuoka?”, Find three words that match the keyword: “Tokyo”, “Shizuoka”, and “Time”. At this time, although it can be understood from the place name such as “Tokyo” and “Shizuoka” and the keyword “time”, the transportation means (movement means) cannot be understood. Therefore, the input analysis block 2 makes an inquiry to the history management block 9. If it is found from this inquiry that, for example, a car has been selected as a transportation means in the past, the input analysis block 2 assumes that the transportation means is a car in step S10 based on this. As a result, although the user does not specify the transportation means (movement means), it is assumed that there is a high possibility of a car from the history and that the user input is confident.
[0043]
Next, the input analysis block 2 instructs the search block 5 to “acquire traffic information on the Tomei Expressway” as step S2, and also instructs the response text generation block 4 “from the searched information. Calculate and output the required time between Tokyo and Shizuoka. Further, the input analysis block 2 determines that it is sufficient to inform about the time as a response from the form of the user's question, and expresses the minute unit of time to the response text generation block 4 in an abstract manner. To instruct.
[0044]
In response to an instruction from the input analysis block 2, the search block 5 accesses the database block 6 and obtains traffic information. At this time, the search block 5 examines information stored locally in the database block 6 which is a storage unit of the system. If the information in the database block 6 is old or not, for example, the latest information is obtained via the network. Get information. It is assumed that the traffic information is obtained from where on the network and the operation of the system for obtaining the information is determined in advance. The traffic information obtained by the search in the search block 5 is sent to the response text generation block 4.
[0045]
The response text generation block 4 at this time is informed from the input analysis block 2 that “the time required between Tokyo and Shizuoka is calculated and output”. First, as a step S 3, the response text generation block 4 is obtained by searching the search block 5. For example, the time required for each section of “Tokyo-Atsugi”, “Atsugi-Gotemba”, “Gotemba-Shizuoka” is picked up as necessary data, and the total time (215 minutes in this example) is calculated as step S4. To do.
[0046]
Next, in the response text generation block 4, a response text is generated. First, the expression of time is determined. In the case of the above example, according to the instruction from the input analysis block 5, the minute unit is determined as an abstract expression. In determining this abstract expression, the response text generation block 4 refers to the text generation data storage block 8 and uses, for example, “(don't say)”, “a little”, “half” as abstract expression words. For example, the expression “half” is selected from the four “weak”, and the time of 215 minutes is changed to the expression “three and a half hours” in step S5. In addition, the calculated time is based on traffic information and is highly accurate. It is determined that there is a text, and an instruction regarding text synthesis and voice synthesis for outputting a voice with a large and clear tone of an adult voice is output to the voice synthesis block 3.
[0047]
In the speech synthesis block 3, a text “3 hours and a half” and an instruction regarding speech synthesis are received from the response text generation block 4. As a result, the voice synthesis block 3 generates voice output data of a large and clear tone of the adult voice “3 hours and a half” as step S11, and outputs it as a response from the speaker.
[0048]
By the way, the information source for examining “time from Tokyo to Shizuoka” is not limited to traffic information. Although it is inaccurate, it is possible to estimate the required time by examining the distance from the road map.
[0049]
In this case, the input analysis block 2 informs the search block 5 at step S6, such as “Get the map information of the Tomei Expressway”.
[0050]
The search block 5 at this time accesses the database block 6 and obtains map information. Information obtained by the search block 5 searching the database block 6 is sent to the response text generation block 4.
[0051]
The response text generation block 4 at this time is informed from the input analysis block 2 that “the required time between Tokyo and Shizuoka is output”, and the map obtained by the search of the search block 5 as step S7. For example, the distance of each section of “Tokyo-Atsugi”, “Atsugi-Gotemba”, and “Gotemba-Shizuoka” is picked up as necessary data from the information, and the total distance (210 km in this example) is calculated as step S8. Furthermore, in the response text generation block 4, the moving speed of the vehicle is set to 100 km / h on average, and the required time of 2.1 hours is estimated from the total distance and the average speed.
[0052]
Next, in the response text generation block 4, a response text is generated. Also in this case, first, the expression of time is determined. In the case of the above example, according to the instruction from the input analysis block 5, the minute unit is determined as an abstract expression. In determining this abstract expression, the response text generation block 4 refers to the text generation data storage block 8 and, as the word of the abstract expression, “(Do not say anything)” “A little” “ For example, the expression “(Do not say anything)” is selected from the four of “half” and “weak”, and the above 2.1 hours is changed to “2 hours” in step S9. In addition, the calculated time is assumed to be appropriate for the approximate time, and the ending of the expression is made to be an all-out type “is”. However, when using map information, the response is less reliable than when using the traffic information described above, so instructions for text-to-speech synthesis for voice output of an adult voice but a low voice Is output to the speech synthesis block 3.
[0053]
In the speech synthesis block 3, the text “2 hours” and an instruction regarding speech synthesis are received from the response text generation block 4. As a result, the voice synthesis block 3 generates voice output data of a small voice, although it is an adult voice “two hours”, in step S12, and outputs it as a response from the speaker.
[0054]
In addition, when calculating the time from the road map as described above, if the question from the user is not "What is the time from Tokyo to Shizuoka?" But "How many minutes from Tokyo to Shizuoka?" Response is possible.
[0055]
In this case, the input analysis block 2 determines that the user desires a response of expression up to the minute unit.
[0056]
Thereby, in the response text generation block 4, the 2.1 hours obtained by the above calculation is expressed as “2 hours 6 minutes” based on this. However, if the unit of the calculation result is unreliable, the “6 minutes” portion of the response is likely to be inaccurate (lie). For this reason, in the response text generation block 4, since the reliability of the response is not sufficient, “kana” expressing ambiguity is used for the ending. Then, in the response text generation block 4, for example, a text for voice output with a child's voice and an instruction regarding voice synthesis are output to the voice synthesis block 3.
[0057]
In the speech synthesis block 3, the text “2 hours 6 minutes” and an instruction regarding speech synthesis are received from the response text generation block 4. As a result, the voice synthesis block 3 generates voice output data of a child's voice “2 hours 6 minutes” and outputs it as a response from the speaker.
[0058]
Furthermore, for example, if the policy is not to represent data that cannot guarantee accuracy, when calculating the time from the road map as described above, if the question from the user was "How many minutes from Tokyo to Shizuoka?" A response such as
[0059]
In the case of this example, the input analysis block 2 determines that the user desires a response of expression up to the minute unit.
[0060]
Accordingly, the response text generation block 4 tries to express 2.1 hours obtained by the above calculation as “2 hours and 6 minutes” based on this. However, here, assuming that the minute unit of the calculation result is unreliable, the “6 minutes” portion of the response is likely to be inaccurate (lie). Therefore, in the response text generation block 4, the level of abstraction of the expression is increased and the expression “2 hours” is determined without saying a specific minute unit. Further, the response text generation block 4 adds “about” to the end of the word to reflect that it cannot be answered more specifically. However, in this case, since the user's request is not satisfied as a response, the response text generation block 4 outputs to the speech synthesis block 3 an instruction relating to text and speech synthesis, for example, for voice output by a child's voice. .
[0061]
In the speech synthesis block 3, the text “It is about 2 hours” and an instruction related to speech synthesis are received from the response text generation block 4. As a result, the voice synthesis block 3 generates voice output data of a child's voice “about 2 hours” and outputs it as a response from the speaker.
[0062]
Next, as a voice response system configuration according to the second exemplary embodiment of the present invention, a scheduler that manages a user's schedule is taken as an example. Hereinafter, the operation of the system according to the second embodiment will be specifically described with reference to FIGS. 1 and 3.
[0063]
First, as an initial setting, a mode is set in which the user operates the input block 1 to obtain personal information such as a schedule and a telephone number.
[0064]
Next, as step S31, the user inputs a question such as "What is your plan for tomorrow?" Here, it is assumed that a voice input is assumed as an interface for inputting a question, and the user is directed toward the microphone of the system. Input speech from the microphone is subjected to speech recognition processing in the input block 1 and converted into text data.
[0065]
The input analysis block 2 analyzes the text data sent from the input block 1 and determines a corresponding search process. That is, the input analysis block 2 has a list of keywords for determining processing, and first searches for a word that matches the keyword.
[0066]
Here, from the question “What is the schedule for tomorrow?”, Two words “tomorrow” and “schedule” are found as words that match the keyword. Then, the input analysis block 2 determines to search the schedule from the keyword “schedule”, and determines to search for tomorrow from the keyword “tomorrow”.
[0067]
Therefore, the input analysis block 2 instructs the search block 5 to “obtain tomorrow's schedule data” as step S32, and the response text generation block 4 “notifies all tomorrow's schedule”. " However, in this case, the input analysis block 2 determines that the user wants to know the outline from the form of the user's question, and does not say the detailed time and details of the schedule.
[0068]
The search block 5 that has received the search instruction from the input analysis block 2 accesses the database block 6 and retrieves tomorrow's schedule as step S33. It is assumed that information extracted at this time is “9: 00-11: 0 tennis at Shinagawa” and “14: 00-shopping at Shinjuku with friends”. Information obtained by the search in the search block 5 is sent to the response text generation block 4.
[0069]
Next, the response text generation block 4 generates response text. At this time, the time is first determined for the expression, but since the input analysis block 2 is informed that “the outline of tomorrow's schedule is output” as described above, the response text generation block 4 Decides to increase the level of abstraction of the expression. For example, referring to the text generation data storage block 8, there are, for example, three categories of “morning”, “daytime”, and “night”, and the response text generation block 4 is “9: 00-11: It is determined that “00” corresponds to “morning” and “14: 00- (after 14:00)” corresponds to “daytime”. Also, for the contents of the schedule, omit the place and partner, and say only the business. That is, the response text generation block 4 generates response texts such as “AM tennis” and “PM shopping” in step S34.
[0070]
Next, the response text generation block 4 determines an instruction relating to the sentence ending and speech synthesis. As in the second embodiment, in the case of the scheduler, there is no other mistake in interpreting the user input, and the response data is also entered by the user himself. Reliability is sufficient. Therefore, in step S35, the response text generation block 4 uses the utterance type “Da” as the ending expression, and the voice synthesis block gives instructions for text and voice synthesis for voice output in a large and clear tone of an adult voice. 3 is output.
[0071]
The speech synthesis block 3 receives from the response text generation block 4 a text “sound in the morning, shopping in the afternoon” and an instruction regarding speech synthesis. As a result, the speech synthesis block 3 generates voice output data with a large and clear tone in a voice of an adult saying “Tennis in the morning and shopping in the afternoon” as step S39, and outputs it as a response from the speaker.
[0072]
However, when the user asks "What is your plan for tomorrow?"
[0073]
For example, in the input analysis block 2, suppose that it is found from the information in the history management block 9 that “if you give an outline to the user, you will often be asked again for a detailed date and time”. In this case, in the input analysis block 2, the history is prioritized over the expression of the current input, and the response text generation block 4 is instructed to “make each tomorrow's schedule more detailed”.
[0074]
In this case, in the response text generation block 4, an instruction regarding text and speech synthesis for performing voice output in a loud and clear tone of an adult voice, such as “tennis from 9 am to 11 am, shopping from 2 pm”. Is output to the speech synthesis block 3.
[0075]
When the speech synthesis block 3 receives the instruction about the text and the speech synthesis from the response text generation block 4, the tone is large and clear with an adult voice saying “Tennis from 9 am to 11 am, shopping from 2 pm”. Is output as a response from the speaker.
[0076]
On the other hand, for example, consider a case where, after the user designates a mode for obtaining personal information by operating the input block 1, for example, a question “What are your plans for tomorrow in the morning?” Is entered as step S 36. Also in this case, as described above, it is assumed that a voice input is assumed as an interface for inputting a question, and the user is directed toward the microphone of the system. Input speech from the microphone is subjected to speech recognition processing in the input block 1 and converted into text data.
[0077]
In this case, the input analysis block 2 analyzes the text data sent from the input block 1 and determines a corresponding search process. The input analysis block 2 has a list of keywords for determining processing, and first searches for a word that matches the keyword.
[0078]
Here, from the question “What is your plan for tomorrow in the morning?”, Three words “Tomorrow”, “Morning” and “Schedule” are found as words that match the keyword. Then, the input analysis block 2 determines to “search for tomorrow's schedule” from the keyword combination of “tomorrow” and “schedule”. At this time, the input analysis block 2 instructs the search block 5 to “acquire tomorrow's schedule data” in step S32, and the response text generation block 4 “tomorrow's morning schedule”. Only tell me ". In the case of this example, the input analysis block 2 determines that the user wants to know the detailed time from the form of the user's question, and says the detailed time of the schedule and then the detailed content.
[0079]
The search block 5 that has received the search instruction from the input analysis block 2 accesses the database block 6 and retrieves tomorrow's schedule as step S33. Information obtained by the search in the search block 5 is sent to the response text generation block 4.
[0080]
Next, the response text generation block 4 generates response text. Since the response text generation block 4 at this time is informed from the input analysis block 2 that “only the schedule in the morning is output”, in the response text generation block 4, “14: 00−” is set as step S 37. The schedule is excluded from the output because it is not in the morning, but the time, place, and requirements are all left for output. Further, in the response text generation block 4, in order to inform the time as much as possible, “9: 00-11: 000” is expressed as “from 9 o'clock to 11”.
[0081]
Next, the response text generation block 4 determines an instruction relating to the sentence ending and speech synthesis. In this example, if there is no other mistake in interpreting the user input and the response data is correct by the user himself / herself, the reliability of the response is sufficient. Therefore, in the response text generation block 4, as the step S38, the ending expression is “Dai”, and “It is tennis in Shinagawa from 9:00 to 11:00” with a large and clear tone of adult voice. Text for voice output and instructions regarding voice synthesis are output to the voice synthesis block 3.
[0082]
When the voice synthesis block 3 receives an instruction regarding text and voice synthesis from the response text generation block 4, in step S 40, a voice with a large and clear tone is heard in an adult voice, “It is tennis in Shinagawa from 9:00 to 11:00.” Output data is generated and output as a response from the speaker.
[0083]
Next, FIG. 4 shows a hardware configuration for realizing the voice response system of the present invention.
[0084]
In FIG. 4, an input unit 11 corresponds to the input block 1 of FIG. 1, and includes interfaces such as buttons, a keyboard, a touch panel, and a microphone. A user inputs a question and a keyword from the input unit 11.
[0085]
The CPU 12 executes control of each part of the system and various program processes. The CPU 12 is responsible for various processes in the input analysis block 2, the speech synthesis block 3, the response text generation block 4 and the like in FIG.
[0086]
The ROM 13 stores fixed data and fixed programs. Various data for the text generation and speech synthesis described above and programs for processing are also stored in the ROM 13.
[0087]
The RAM 14 temporarily stores necessary data during program processing by the CPU 12. Information input from the user, text generation, speech synthesis, information obtained by the communication unit 16, and the like are temporarily stored in the RAM 14.
[0088]
The auxiliary storage unit 15 includes a flash memory and an EEPROM, and stores data that is to be kept at all times even during program processing, with data to be added or rewritten. Personal information of the history management block 9 in FIG. 1, data obtained by communication, other data used for processing of the CPU, and the like are stored in the auxiliary storage unit 15.
[0089]
The communication unit 16 is used when the system acquires or sends information via a network. For example, communication such as a telephone, the Internet, infrared rays, and radio is performed.
[0090]
The speaker unit 17 outputs a voice such as a voice synthesized response.
[0091]
The display unit 18 includes a display device such as a liquid crystal panel, and outputs a so-called GUI (Graphical User Interface) screen and displays various information such as a table and a list.
[0092]
In the system according to the embodiment of the present invention, the auxiliary storage unit 15, the communication unit 16, and the display unit 18 are not necessarily required.
[0093]
As described above, according to the embodiment of the present invention, the tone, sound quality, and the like when outputting a response to a question from the user are changed according to the reliability of the response. It is possible to listen to the audio output while knowing to some extent whether or not a response matching the request is returned. Also, according to the present embodiment, even when only listening to audio output, the system status becomes easy to understand, and the next countermeasure (for example, listening properly, entering a deficiency condition, asking another question, etc.) is quickly performed. It is possible. Furthermore, according to the present embodiment, the degree of accuracy can be expressed or the output can be simplified by changing the abstraction level of the data expression in the response output.
[0094]
【The invention's effect】
As is clear from the above description, the voice response device of the present invention performs a data search based on the input information, generates a response sentence from the searched data, and outputs a response as a synthesized sound. To generate a response sentence that reflects reliability, or to change the abstraction level of the expression of the response sentence according to the accuracy of the retrieved data, or to request the accuracy estimated from the input information and history By changing the level of abstraction of the response sentence accordingly, it is possible to inform the user about the reliability of the response when returning a response from the response system to the question entered by the user, etc. It is also possible to adaptively change the level of detail of the response.
[0095]
In other words, according to the present invention, by changing the tone, sound quality, etc. when outputting the response sentence in accordance with the reliability of the response, the user knows to some extent whether the response that meets his request is returned. On the other hand, it is possible to hear the audio output, and even when only the audio output is heard, the state of the system becomes easy to understand, and the next countermeasure is accelerated. Furthermore, according to the present invention, the degree of accuracy can be expressed or the output can be simplified by changing the abstraction level of the data expression in the response output.
[Brief description of the drawings]
FIG. 1 is a functional block diagram showing a schematic configuration of a voice response system according to an embodiment of the present invention.
FIG. 2 is a flowchart showing a flow of one operation example of the system according to the first embodiment;
FIG. 3 is a flowchart showing a flow of an operation example of the system according to the second embodiment;
FIG. 4 is a block circuit diagram showing a hardware configuration for realizing the voice response system of the present invention.
[Explanation of symbols]
1 input block, 2 input analysis block, 3 speech synthesis block, 4 response text generation block, 5 search block, 6 database block, 7 speech synthesis data storage block, 8 text generation data storage block, 9 history management block, 11 Input unit, 12 CPU, 13 ROM, 14 RAM, 15 auxiliary storage unit, 16 communication unit, 17 speaker unit, 18 display unit

Claims

An input means for inputting information as a question ;
Search means for searching for data serving as a response to the question based on the input information;
Response sentence generation means for generating a response sentence from the retrieved data;
A synthesized sound conversion output means for converting the response sentence into a synthesized sound and outputting the synthesized sound;
In the response sentence generation unit, when generating a response sentence based on the data the search, depending on the result of comparison between accuracy of precision and data the retrieval expected for the response from the question that the input the A voice response device that changes a word or ending expression according to a predetermined abstraction level of a response sentence expression.

2. The voice response device according to claim 1, wherein the synthesized sound conversion output means changes the type or volume of the voice according to the comparison result .

In the response sentence generating means, when generating a response sentence based on the searched data, the expression of the word or ending according to a predetermined abstraction level of the expression of the response sentence is changed according to the comparison result ,
The voice response device according to claim 1, wherein the synthesized sound conversion output means changes a type or a volume of the voice according to the comparison result .