JP3945356B2

JP3945356B2 - Spoken dialogue apparatus and program

Info

Publication number: JP3945356B2
Application number: JP2002269941A
Authority: JP
Inventors: 竜一鈴木; 美樹男笹木
Original assignee: Denso Corp
Current assignee: Denso Corp
Priority date: 2002-09-17
Filing date: 2002-09-17
Publication date: 2007-07-18
Anticipated expiration: 2022-09-17
Also published as: JP2004109323A

Description

【０００１】
【発明の属する技術分野】
本発明は、ユーザとの間で音声対話を行うための音声対話装置に関する。
【０００２】
【従来の技術】
従来より、例えばカーナビゲーションシステムにおいてレストラン等の目的地の位置情報を音声により問い合わせたりする情報検索のための装置、音声対話を通じてユーザを楽しませる娯楽用の装置等の音声対話装置が知られている。特に近年では、こうした音声対話においてユーザとの間で自然な対話を実現するために、対話のためのシナリオを予め複数用意してユーザの発話に対応する音声対話装置が提案されている（例えば、特許文献１参照）。
【０００３】
【特許文献１】
特開２００１−３５７０５３号公報
【０００４】
【発明が解決しようとする課題】
しかしながら、上記特許文献１の音声対話装置では、ユーザの発話に対してシステム側の応答が予め対応づけられており、ユーザの質問等の発話に対して予め決められたことしか答えることができなかった。このため、答えを知らない場合には応答することができず、対話を中断するか話題を変えるなどの手段をとるしかなく、知的な対話を行うという観点からは十分ではなかった。
【０００５】
また、ユーザの発話に対してシステム側の応答が予め対応づけられているため、決められた言葉に対して決まりきった発話をすることしかできず、自然な対話を行うという観点からも十分と言えるものではなかった。
本発明は、こうした問題に鑑みなされたものであり、ユーザとの対話状況に応じて発話内容を臨機応変に変えることができると共に、ユーザの知的好奇心にも応えることができ、知的で自然な音声対話を実現する音声対話装置を提供することを目的とする。
【０００６】
【課題を解決するための手段】
上記課題に鑑み、請求項１記載の音声対話装置においては、ユーザから対話のための音声入力がなされると、認識手段がこの入力内容を音声認識する。記憶手段には、ユーザとの対話内容に応じた複数のシナリオと、各シナリオに沿った発話対象語が予め記憶されており、選択手段が、認識手段による認識に応じて記憶手段に記憶された発話対象語の中からユーザに向けた発話語を選択し、出力手段が、この選択手段によって選択された発話語を音声により出力することにより、ユーザとの間で対話を行う。
【０００７】
そして特に、学習手段が、記憶手段に記憶された発話対象語の中に、ユーザとの対話内容に応じた発話対象語がない場合に、選択手段にユーザにこの対話内容の答えを問い返すための発話語を選択させ、この問い返しに対してユーザから入力された対話内容の答えに基づき、この対話内容に応じたシナリオを学習し、このシナリオと各シナリオに沿った発話対象語を記憶手段に新たに記憶させ、更新手段が、学習手段により学習された対話情報に基づき、記憶手段において、この対話内容についての音声対話に必要なシナリオ、対話辞書、認識辞書を、自動的に更新するようにし、第２の記憶手段が、学習手段により学習された対話情報について、複数回交わされた同種の対話内容の履歴を記憶するようにし、上記更新手段が、同種の対話内容について互いに不整合なシナリオがある場合には、対話確率の高いシナリオに順次変更して更新するようにする。
【０００８】
すなわち、かかる音声認識装置においては、ユーザから知らないことを聞かれたら、その場では分からないと答えるが、次に同じようなことを聞かれたら、答えられるような学習機能を有する。つまり、ユーザから知らないことを聞かれた場合にユーザにその答えを問い返し、その質問内容と答えを記憶して、次からの対話に用いるようにする。
【０００９】
このため、知らない対話内容によって対話を中断させたり、ユーザの提示する話題を変更したりする必要性が小さくなると共に、学習によって新たなシナリオや語彙を増やして知識を向上させ、次回からのユーザとの対話に反映することができる。その結果、学習を重ねる毎に特定のユーザに対して満足のできる対話を実現することができるようになる。また、異なるユーザに対しては新たな話題や情報を提供することができ、知的な対話を実現することができる。
【００１０】
また、学習によって保有するシナリオや発話語彙のバリエーションを増加させることができ、同じ発話内容であってもその発話語を様々なタイプのユーザに応じて適宜変化させることができる。このため、様々なタイプのユーザとの間で自然な対話を実現することができる。
【００１２】
すなわち、かかる音声対話装置においては、ユーザの発話を認識するための語彙が参照される認識辞書、ユーザとの対話内容に沿った発話を実現するために予め用意された複数種類のシナリオ、各シナリオに沿った発話を実現するための語彙が参照される対話辞書が設けられている。そして、ユーザとの対話を通じて学習手段により新たに学習されたシナリオや語彙を自動的に更新し、次回から参照可能なシナリオや語彙を増加させることにより、上記知的で自然な対話を実現するのである。
【００１３】
しかし、ユーザにより教えられたことが間違っている場合もあるので、対話を行っていく中で、１つの質問に対する答えがいくつか返ってくる場合がある。その場合に上記更新手段が対話内容を逐次更新していくと、次回からの対話においてユーザに間違った情報を提供する虞がある。
【００１５】
しかしながら、かかる音声対話装置においては、例えば異なる複数のユーザとの間で交わされた対話内容を通じて学習手段が学習した問いかけと答えとを履歴情報として記憶する。そして、その問いかけと答えの対応が複数の対話間で異なる場合には、更新手段が、その問いかけに対して最も対話確率（頻度）の高い答えを採用するようにシナリオを順次変更していく。この結果、次回からの対話においては、同じ内容の問いかけに対して、この最も頻度の高いシナリオに沿った発話をすることになる。その結果、発話内容が自然に実際の答えに近づいていくようになり、知的な対話を実現することができるのである。
【００１６】
その際、対話内容についての対話確率が等しい場合も考えられるので、請求項２に記載のように、更新手段は、不整合なシナリオ間で、その対話内容についての対話確率が等しい場合には、先に出現したものを優先するようにしてもよい。
すなわち、対話確率が等しいからといって発話内容が直前に出現したものに度々されると、ユーザから優柔不断と思われ、不快感を感じられる場合がある。そこで、このように先に（より過去に）出現したものを優先することで、音声対話装置としての意志を強調して対話に勢いや信頼性を持たせることができる。
【００４５】
尚、このような音声対話装置の各手段をコンピュータにて実現する機能は、例えば、コンピュータ側で起動するプログラムとして備えることができる（請求項３）。このようなプログラムの場合、例えば、ＦＤ、ＭＯ、ＤＶＤ、ＣＤ−ＲＯＭ、ハードディスク等のコンピュータ読取可能な記録媒体に記録し、必要に応じてコンピュータにロードして起動することにより用いることができる。この他、ＲＯＭやバックアップＲＡＭをコンピュータ読取可能な記録媒体としてプログラムを記録しておき、このＲＯＭ或いはバックアップＲＡＭをコンピュータに組み込んでもよい。尚、ここでいう「各手段」とは、各請求項中の各構成要件としての個々の手段を意味するのではなく、請求項単位の手段の集まりを意味する。
【００４６】
【発明の実施の形態】
以下、本発明の実施の形態を具体化した実施例を図面と共に説明する。図１は本実施例の音声対話装置の全体構成を表すブロック図である。
１．音声対話装置の構成
同図に示すように、音声対話装置１は、音声対話ロボットとして構成され、音声認識部１０，シナリオインタープリタ２０，対話シナリオ部３０，顔画像認識判定部４０，ロボット発話語決定部５０，学習機能部６０，及び音声合成部７０等を備えている。また、ユーザの姿態を撮像可能なカメラがその頭部の周りに一定間隔で複数設けられており、ユーザがたとえロボットの後方から話し掛けてきても、これを認識することができるようになっている。そして、ユーザからはその複数のカメラのうちの２つがロボットの目として認識できるように構成されている。さらに、ユーザの音声を入力するための指向性マイクが、その頭部の周りに一定間隔で複数設けられており、四方からユーザの発話音声を入力できるようになっている。そして、ユーザからはその複数の指向性マイクのうちの２つがロボットの耳として認識できるように構成されている。
【００４７】
そして、ユーザの発話音声は、上記指向性マイクを介してまず音声認識部１０に入力される。音声認識部１０は、ユーザの発話により指向性マイクから入力される音声レベルの大きさやその音声レベルの変化等により、ユーザがロボットの方向を向いて話し掛けてきたかどうか、また、どの方向から話し掛けてきたか等を認識することができる。また、顔画像認識判定部４０は、これと同時に上記複数のカメラから入力されたユーザの顔画像からその顔の位置や向きを判定し、ユーザがロボットの方向を向いて話し掛けてきたかどうかの判定精度を向上させたり、ユーザの唇の動きを解析して所謂リップリーディングによる画像認識を行い、音声認識の精度を向上させることができる。
【００４８】
そして、音声認識部１０は、話し掛けてきたユーザの発話音声を認識すると、対話に必要な語彙が格納された認識辞書１１を参照してこの発話音声の内容を認識し、この認識結果をシナリオインタープリタ２０に出力する。
対話シナリオ部３０には、対話上の条件分岐等を表す複数種類のシナリオが格納されている。この対話シナリオ部３０は、シナリオインタープリタ２０を介して得た上記認識結果，時間計測器８０による経過時間情報等を参照して、対話の進行状況に適合したシナリオ（発話語）を生成し、その情報をシナリオインタープリタ２０に出力する。
【００４９】
シナリオインタープリタ２０は、対話シナリオ部３０にて決定されたシナリオに従って、対話用認識辞書２１及び発話リスト格納部２２を参照し、次の発話内容を設定するための演算処理を行う。ここで、対話用認識辞書２１には、対話用の単語等の装置で用いられる単語が格納され、発話リスト格納部２２には、対話のシナリオに応じて複数設定された文章化された発話語が選択可能に格納されている。
【００５０】
さらに、シナリオインタープリタ２０が、発話リスト格納部２２や対話用認識辞書２１を参照しても、ユーザとの対話内容に応じた発話対象語がない場合には、ユーザにこの対話内容の答えを問い返すことになる。その際、学習機能部６０が、この問い返しに対してユーザから入力された対話内容の答えに基づき、この対話内容に応じたシナリオとこのシナリオに沿った発話対象語（語彙）を学習し、当該シナリオ及び発話対象語をシナリオインタープリタ２０を介して、対話用認識辞書２１，発話リスト格納部２２，対話シナリオ部３０に格納して更新し、次回からの対話に反映させる。つまり、ここでは円滑でかつ適切な対話をするために、シナリオの作成、及びそれに伴う対話辞書、発話リストの作成が行われる。
【００５１】
そして、ユーザとの発話において、ロボット発話語決定部５０が、対話シナリオ部３０に新たに格納されたシナリオ等も含めてロボットの発話語を決定し、これに対応した発話語を発話リスト格納部２２からシナリオインタープリタ２０に出力させる。そして、シナリオインタープリタ２０にて最終的に生成された応答内容が、音声合成部７０にて音声合成され、ロボットの発話としてスピーカから出力される。
２．学習機能（知的な対話）
音声対話を進めていくうちに、ロボットが答えられない（シナリオに記述されていない）ことをユーザから聞かれることが出てくる。その際、はじめは分からないので、ユーザにその答えを問い返す。このことにより、ロボットはその質問内容と答えを学習し、自動的に音声対話装置１に必要なシナリオ、対話辞書、認識辞書を更新する。したがって、２回目以降は、今まで答えられなかったことに対しても、答えられるようになっていく。しかし、学習した答えが間違っている場合もあり、対話を行っていく中で、１つの質問に対して、いくつかの答えが発生する場合が出てくる。その場合は、その答えの中での出現確率が一番高いものをロボットが発話する答えとする。等確率のものが発生した場合は、先に出現したものを優先する。このような機能を備えることで、知らないことを学習できるようになり、また、その答えも高い確率で正確な答えに近づいていくようになる。
３．ロボット発話（自然な対話）
ロボットが応答する発話を、質問に対して、毎回同じことを発話するのではなく、ユーザの応答時間間隔や答え方、発話内容などによって、ロボットの発話も様々に変化させることができる。また、ユーザの感情や方言など、様々な要因によってもロボット発話を変化させることができる。
４．作動
次に、図２〜図４に示すフローチャートに基づいて、本実施例の音声対話装置の動作について説明する。
４．１全体の流れ
本実施例の音声対話装置１の全体の流れとしては、対話シナリオ部３０で設定したシナリオどおりに進んでいく。ユーザの応答待ち、すなわちシナリオの各分岐点において、図３に示す学習機能に関する動作フロー、図４に示すロボット発話に関する動作フローを適用し、それに対応するロボットの発話を出力していく。この操作を繰り返し、シナリオにより、対話終了となった時点で終了とする。
【００５２】
すなわち、図２に示すように、ユーザの応答を待ち（Ｓ１１０）、まず顔画像認識判定処理を実行する（Ｓ１２０）。
この顔画像認識判定処理においては、上述したロボットの複数の目（カメラ）及び耳（マイク）を介して、音声認識部１０が、ユーザがロボットの方向を向いて話し掛けてきたか否かを判定し、図１０に示すように、ユーザがロボットの方向を向いて話し掛けてきたと判定されると音声認識を開始する。そのとき、発話中にユーザが顔を背けても、認識をストップさせず、その発話が完了するまで認識は止めない。また、ユーザがロボットの方向を向いていなくて入ってきた音声は認識せず、その途中で、ユーザがロボットの方向に顔を向けた場合は、その時点から認識を開始させるようにする。この場合、上述のように、ロボットにはその頭部周囲の３６０度全てにわたって一定の間隔で、目（カメラ）及び耳（マイク）が設けられているため、ロボットの後ろの方からの音声でも、ユーザがロボットの方向を向いて話し掛けてきた音声は認識することができる。本実施例では、その際、ロボットがそのユーザの方向を向くようにされており、これにより、人間が行っている会話のように自然な動作／対話をするようになっている。こうすることで、ロボットは、本当に認識したい語彙のみ認識するようになる。
【００５３】
そして、ユーザの顔画像を認識したと判定されると（Ｓ１３０：ＹＥＳ）、続いて、図３に示す学習機能に関する動作フローを実行する（Ｓ１４０）。
４．２学習機能（知的な対話）
音声対話を進めていくうちに、ロボットが答えられない（シナリオに記述されていない）ことをユーザから聞かれることが出てくる。本学習機能は、このような事態に対応できる知的な対話を実現するものである。尚、図５及び図６には、ユーザに知らないことを聞かれたときの対応例が示されている。
【００５４】
すなわち、ユーザとの対話を通じて、シナリオインタープリタ２０が対話シナリオ部３０を参照し、ユーザからの発話内容（質問内容）についてロボットが知っている内容であるか否かを判定する（Ｓ２１０）。このとき、対応するシナリオがなく、知らない内容であると判定されると（Ｓ２１０：ＮＯ）、ロボット側からユーザに対して、「分からないので、教えて」などと言ってその答えを問い返し（Ｓ２２０）、これに対するユーザの回答に基づき、学習機能部６０がその質問内容とその答えを学習する（Ｓ２３０）。そして、このとき得られた新たなシナリオや発話語彙をシナリオインタープリタ２０を介して対話用認識辞書２１，発話リスト格納部２２，及び対話シナリオ部３０等に格納する。つまり、全く知らないことに対してはシナリオを増やしていき、言葉の意味が分からないだけの場合には、認識語彙を増やしていく。このことにより、２回目以降は、分からなかったことに対しても答えられるようになり、新たなシナリオや発話語彙が蓄積されていくことで知的なロボットとなっていく。具体例を示すと、図５に示す如くである。
【００５５】
一方、Ｓ２１０において、対応するシナリオがあり、知っている内容であると判定されると（Ｓ２１０：ＹＥＳ）、そのシナリオに記述されている発話を選択し、音声合成部７０にて音声合成して発話する（Ｓ２４０）。そして、この発話に対してユーザに誤りを指摘されなければ（Ｓ２５０：ＮＯ）、当該動作フローを終了する。
【００５６】
ただし、はじめに答えてもらったユーザの答えが必ずしも正解であるとは限らない。このため、２回目以降にユーザ側から「違うよ」と指摘されることも想定される。このため、Ｓ２５０において、ユーザに誤りを指摘された場合には（Ｓ２５０：ＹＥＳ）、学習機能部６０が、まずその質問内容とその答えを学習する（Ｓ２６０）。そして、その質問内容と同じ対話内容について、発話リスト格納部２２から過去の履歴を参照し、その質問内容に対して現在一番出現確率（累積値）の高い答えとの確率の比較を行う（Ｓ２７０）。
【００５７】
このとき、今回の答えの出現確率が大きいと判定されると、当該質問内容についての答えを正解とみなし、次回からの答えに変更し、シナリオを更新して当該動作フローを終了する（Ｓ２８０）。一方、今回の答えの出現確率が小さいと判定されると、当該質問内容についての答えの変更は行わない（Ｓ２９０）。さらに、両者の確率が等しい場合には、先に（より過去に）出現したものを、次回の発話に使用するように設定する（Ｓ３００）。これらの具体例を示すと、図６に示す如くである。
【００５８】
図２に戻り、続いて図４に示すロボット発話に関する動作フローを実行する（Ｓ１５０）。
４．３ロボット発話（自然な対話）
本動作フローでは、図４に示すように、ロボット側の発話に対するユーザの応答時間間隔（Ｓ３１０），ユーザの答え方（Ｓ３４０），及びユーザの発話内容の判断（Ｓ３５０）に基づき、その発話内容を強調した発話をしたり（Ｓ３２０）、曖昧性を持たせた発話をしたりする（Ｓ３３０）。
【００５９】
具体的には図７に示すように、例えばロボットがユーザに対して、「好きな食べ物は」と聞いた場合、ユーザがすぐに、例えば「リンゴだよ」と答えたとすると、ロボットは「そうですか。リンゴが大好きなんだね。」などと、好きなことを強調するような発話にする。逆に、ユーザが間を開けて（例えば１０秒程度）、「リンゴかな」と答えたとすると、ロボットは「本当にリンゴが好きなの。」といったような、ユーザが考えて出した答えに対して、曖昧性を持たせた返答とする。このように、返答に差をつけることで、意味、感情といった部分を考慮に入れた知的なロボットとなる。
【００６０】
また、ユーザの答え方、例えば、「リンゴだよ」と「リンゴかな」の違いのように、「だよ」であると確信を持った断定的な言い方であるし、「かな」であると少し曖昧性を持った言い方であるので、このような点を見極めて、ロボット発話の返答をかえる。
【００６１】
さらに、ユーザの発話内容から、「えーと、・・・」などと、頭に語彙が入ると、考えていて、あまり確信がなく曖昧な言い方と受け取れるので、このような点も考慮に入れて、ロボット発話の返答をかえる。
図２に戻り、以上のようにして決定されたロボットの発話内容に従って発話を行い（Ｓ１６０）、続いて終了条件判定処理を実行する（Ｓ１７０）。そして、シナリオに基づいて予め設定した終了条件を具備したと判定されると（Ｓ１８０：ＹＥＳ）、一連の処理を終了する。
５．その他の知的で自然なロボット音声対話に関する要因
図８に知的で自然なロボット音声対話装置に関する要因を示す。
５．１感情
図８に示すように、ユーザの発話に含まれる感情を認識し、ロボットの発話口調を変化させる。例えば、ユーザがロボットに対して、怒っているような口調で話し掛けた場合には、ロボットの発話は、なだめるようなやさしい言葉で発話させるようにする。
【００６２】
また図９に示すように、そのユーザの感情の対象の違い、例えば、ロボットの発話に対しての感情か、一般的なことに対しての感情かによっても、ロボット発話を変化させることができるようにする。
５．２方言
図８に示すように、ユーザの発話の方言に対して、ロボットの発話も同様の方言を用いて発話させることで、親しみのわく知的なロボットとする。例えば、ユーザが関西弁で話し掛けた場合には、ロボットの発話も関西弁にするといった具合である。
【００６３】
また、方言の認識より、話題をその土地柄にちなんだものに進めていくようにする。こうすることで、話題の転換ができ、ユーザにとっても答えやすい話題へと進んでいくという工夫を入れている。
５．３言語
図８に示すように、ユーザの発話の言語に対して、ロボットの発話も同様の言語で発話させる。例えば、英語で話し掛けられたら、ロボットも英語で発話するといった具合である。
【００６４】
また、言語の認識より、話題をその国にちなんだものに進めていくようにする。こうすることで、話題の転換ができ、ユーザにとっても答えやすい話題へと進んでいくという工夫を入れている。
５．４発話音声
図８に示すように、ユーザの年齢や性別に応じて、ロボットの発話音声を変化させる。例えば、小さい子供に対しては、幼稚園の先生のようなお姉さんの声で対応し、男の人には女の人の声で、女の人には男の人の声で応答するといった具合である。
【００６５】
また、発話音声だけでなく、年齢や性別にちなんだ話題へと進めていくことができるという工夫を入れている。
５．５画像（目をもたせる）
図８に示すように、１つには、目をもたせることで、ユーザの唇の動きを見ることができるので、リップリーディングの技術を応用し、認識率の向上に努めることができ、どんな言葉でも認識できるという知的なロボットに役立つ。
【００６６】
もう１つには、ユーザの顔の位置を認識し、ロボットはユーザの顔が正面を向いていると判定したときの音声のみを認識することで、ノイズ対策が行え、認識率向上につながる。
具体的には、上記において図１０に基づいて説明したとおりである。

以上に説明したように、本実施例の音声対話装置１においては、ユーザから知らないことを聞かれたら、その場では分からないと答えるが、次に同じようなことを聞かれたら、答えられるような学習機能を有する。つまり、ユーザから知らないことを聞かれた場合にユーザにその答えを問い返し、その質問内容と答えを記憶して、次からの対話に用いるようにする。
【００６７】
このため、知らない対話内容によって対話を中断させたり、ユーザの提示する話題を変更したりする必要性が小さくなると共に、学習によって新たなシナリオや語彙を増やして知識を向上させ、次回からのユーザとの対話に反映することができる。その結果、学習を重ねる毎に特定のユーザに対して満足のできる対話を実現することができるようになる。また、異なるユーザに対しては新たな話題や情報を提供することができ、知的な対話を実現することができる。
【００６８】
また、ユーザの受け答えのタイミングや発話内容等に応じて、音声対話装置１側もその応答内容やタイミングを様々に変化させる。このような装置側の発話の変化により、自然な音声対話を実現することができる。
さらに、音声対話装置１としてのロボットに複数の目（カメラ）や耳（マイク）を設けたり、リップリーディングの技術を応用することにより、ユーザの発話語以外のノイズを除去して音声認識することができ、認識手段による発話語の認識率が向上する。それにより、ユーザの発話に対する装置側の錯誤が防止又は抑制することができ、ユーザとの間で自然な対話を実現することができる。その結果、ユーザとの間で知的な対話を進めることができる。
【００６９】
尚、本実施例において、音声認識部１０が認識手段に該当し、顔画像認識判定部４０が画像認識手段に該当し、シナリオインタープリタ２０，対話シナリオ部３０及びロボット発話語決定部５０が、選択手段，判定手段，更新手段，識別手段に該当する。また、対話用認識辞書２１，発話リスト格納部２２が、記憶手段，第２の記憶手段に該当し、学習機能部６０が学習手段に該当し、音声合成部７０が出力手段に該当する。
【００７０】
以上、本発明の実施例について説明したが、本発明の実施の形態は、上記実施例に何ら限定されることなく、本発明の技術的範囲に属する限り種々の形態をとり得ることはいうまでもない。
例えば、上記実施例では、本発明の音声対話装置をロボットとして構成した例を示したが、これに限らず、ナビゲーションシステム等の装置として構成してもよいことはもちろんである。
【図面の簡単な説明】
【図１】本発明の実施例に係る音声対話装置の概略構成を表すブロック図である。
【図２】実施例の音声対話装置の動作を表すフローチャートである。
【図３】音声対話装置の学習機能の動作を表すフローチャートである。
【図４】音声対話装置に係るロボット発話動作を表すフローチャートである。
【図５】ロボットが知らないことを聞かれたとき（学習機能）の対応例１を表す説明図である。
【図６】ロボットが知らないことを聞かれたとき（学習機能）の対応例２を表す説明図である。
【図７】ユーザの受け答えのタイミングの違いによるロボットの対応例を表す説明図である。
【図８】知的なロボットに関する要因を表す説明図である。
【図９】ユーザの感情の対象の違いによるロボット発話の対応例を表す説明図である。
【図１０】ロボットの画像認識による音声対話の例を表す説明図である。
【図１１】ロボット音声対話の対話例を表す説明図である。
【符号の説明】
１・・・音声対話装置、１０・・・音声認識部、
２０・・・シナリオインタープリタ、２１・・・対話用認識辞書、
２２・・・発話リスト格納部、３０・・・対話シナリオ部、
４０・・・顔画像認識判定部、５０・・・ロボット発話語決定部、
６０・・・学習機能部、７０・・・音声合成部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a voice dialogue apparatus for carrying out voice dialogue with a user.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, for example, in a car navigation system, a device for searching information for inquiring about location information of a destination such as a restaurant by voice, and a voice interactive device such as an entertainment device for entertaining a user through voice dialogue are known. . Particularly in recent years, in order to realize a natural dialogue with a user in such a voice dialogue, a voice dialogue device corresponding to a user's utterance by preparing a plurality of scenarios for the dialogue in advance has been proposed (for example, Patent Document 1).
[0003]
[Patent Document 1]
JP 2001-357053 A
[0004]
[Problems to be solved by the invention]
However, in the spoken dialogue apparatus of Patent Document 1, a system-side response is associated with a user's utterance in advance, and can only answer a predetermined utterance such as a user's question. It was. For this reason, if the answer is not known, it is impossible to respond, and there is no choice but to interrupt the dialogue or change the topic, which is not sufficient from the viewpoint of conducting an intelligent dialogue.
[0005]
In addition, since the system side response is associated with the user's utterance in advance, it is only possible to make a regular utterance for a predetermined word, which is sufficient from the viewpoint of conducting a natural conversation. I couldn't say anything.
The present invention has been made in view of these problems, and can change the utterance content flexibly according to the state of dialogue with the user, and can also respond to the intellectual curiosity of the user. An object of the present invention is to provide a voice dialogue apparatus that realizes a natural voice dialogue.
[0006]
[Means for Solving the Problems]
In view of the above problems, in the voice interaction apparatus according to the first aspect, when the voice input for the dialogue is made by the user, the recognition means recognizes the input content by voice. The storage means stores in advance a plurality of scenarios corresponding to the content of the dialogue with the user and utterance target words along each scenario, and the selection means is stored in the storage means in accordance with recognition by the recognition means. An utterance word directed to the user is selected from the utterance target words, and the output means outputs the utterance word selected by the selection means by voice, thereby performing a dialogue with the user.
[0007]
  In particular, when the learning means does not have an utterance target word corresponding to the conversation content with the user in the utterance target word stored in the storage means, the selection means asks the user the answer of the conversation content. Select the utterance word, learn the scenario according to the content of the dialogue based on the answer of the dialogue content entered by the user in response to this question, and newly store the utterance target word according to this scenario and each scenario in the storage means RememberedAnd the update means automatically updates the scenario, dialog dictionary, and recognition dictionary necessary for the voice dialog about the dialog contents in the storage means based on the dialog information learned by the learning means, The storage means stores the history of the same kind of conversation contents exchanged multiple times for the conversation information learned by the learning means, and the update means has a scenario that is inconsistent with the same kind of conversation contents. In this case, the scenario is changed to a scenario having a high dialogue probability and updated.
[0008]
In other words, such a speech recognition device has a learning function that can answer if it is not known on the spot if it is asked by the user that it does not know, but if it asks the same thing next time. That is, when the user asks what he / she does not know, he / she asks the user for the answer, stores the contents of the question and the answer, and uses them for the next dialogue.
[0009]
For this reason, it is not necessary to interrupt the conversation due to unknown conversation contents or to change the topic presented by the user, and to improve the knowledge by increasing new scenarios and vocabulary through learning. Can be reflected in the dialogue. As a result, a satisfactory dialogue can be realized for a specific user each time learning is repeated. Also, new topics and information can be provided to different users, and intelligent dialogue can be realized.
[0010]
In addition, it is possible to increase variations of scenarios and utterance vocabulary possessed by learning, and the utterance words can be appropriately changed according to various types of users even if the utterance contents are the same. For this reason, a natural dialogue can be realized with various types of users.
[0012]
That is, in such a voice interaction device, a recognition dictionary that refers to a vocabulary for recognizing a user's utterance, a plurality of types of scenarios prepared in advance to realize an utterance in accordance with the content of a conversation with the user, and each scenario An interactive dictionary is provided in which vocabulary for realizing utterances along the lines is referenced. Then, the scenario and vocabulary newly learned by the learning means through dialogue with the user is automatically updated, and the above-mentioned intelligent and natural dialogue is realized by increasing the number of scenarios and vocabularies that can be referred to from the next time. is there.
[0013]
However, there may be a case where what is taught by the user is wrong, and there are cases where some answers to one question are returned during the dialogue. In this case, if the updating means sequentially updates the content of the dialogue, there is a risk that wrong information will be provided to the user in the next dialogue.
[0015]
  However,TakeIn spoken dialogue devicesFor example, the question and answer learned by the learning means through the contents of dialogue exchanged with a plurality of different users are stored as history information. If the correspondence between the question and the answer is different among a plurality of dialogues, the updating means sequentially changes the scenario so as to adopt the answer having the highest dialogue probability (frequency) for the question. As a result, in the next dialogue, for the same question, the utterance will be made in accordance with this most frequent scenario. As a result, the content of the utterance naturally approaches the actual answer, and an intelligent dialogue can be realized.
[0016]
  At that time, since it is considered that the conversation probability of the conversation contents is equal,2As described in the above, the update means may give priority to the one that appears first when the dialog probabilities for the dialog contents are the same between inconsistent scenarios.
  That is, if the utterance content appears frequently just before the conversation probability is equal, the user may feel indecisive and feel uncomfortable. Thus, by giving priority to the one appearing earlier (more in the past) in this way, it is possible to emphasize the will as a voice dialogue apparatus and to give momentum and reliability to the dialogue.
[0045]
  Note that the function of realizing each means of such a voice interactive apparatus with a computer can be provided as a program that is activated on the computer side, for example (claims).3). In the case of such a program, for example, it can be used by recording it on a computer-readable recording medium such as FD, MO, DVD, CD-ROM, hard disk, etc., and loading it into a computer and starting it if necessary. In addition, the ROM or backup RAM may be recorded as a computer-readable recording medium, and the ROM or backup RAM may be incorporated in the computer. Here, “each means” does not mean individual means as each constituent element in each claim but means a group of means in units of claims.
[0046]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing the overall configuration of the voice interactive apparatus of this embodiment.
1. Configuration of spoken dialogue device
As shown in the figure, the voice interaction device 1 is configured as a voice interaction robot, and includes a voice recognition unit 10, a scenario interpreter 20, a dialogue scenario unit 30, a face image recognition determination unit 40, a robot utterance word determination unit 50, and a learning function. Unit 60, speech synthesis unit 70, and the like. In addition, a plurality of cameras capable of capturing the user's appearance are provided around the head at regular intervals, so that even if the user speaks from behind the robot, this can be recognized. . The user can recognize two of the plurality of cameras as the eyes of the robot. Furthermore, a plurality of directional microphones for inputting the user's voice are provided around the head at regular intervals so that the user's uttered voice can be input from four directions. The user can recognize two of the plurality of directional microphones as robot ears.
[0047]
The user's uttered voice is first input to the voice recognition unit 10 via the directional microphone. The voice recognition unit 10 determines whether or not the user has spoken toward the robot depending on the level of the voice level input from the directional microphone by the user's utterance, the change in the voice level, and the like. Can be recognized. At the same time, the face image recognition determination unit 40 determines the position and orientation of the face from the user's face images input from the plurality of cameras, and determines whether the user has spoken toward the robot. The accuracy of voice recognition can be improved by improving the accuracy or analyzing the movement of the user's lips and performing image recognition by so-called lip reading.
[0048]
Then, when recognizing the spoken voice of the user who has spoken, the voice recognition unit 10 recognizes the content of the spoken voice with reference to the recognition dictionary 11 in which the vocabulary necessary for the dialogue is stored, and the recognition result is used as a scenario interpreter. 20 is output.
The dialogue scenario unit 30 stores a plurality of types of scenarios representing conditional branches and the like on dialogue. The dialogue scenario unit 30 refers to the recognition result obtained through the scenario interpreter 20, the elapsed time information by the time measuring device 80, and the like to generate a scenario (utterance word) suitable for the progress of the dialogue. Information is output to the scenario interpreter 20.
[0049]
The scenario interpreter 20 refers to the dialogue recognition dictionary 21 and the utterance list storage unit 22 according to the scenario determined by the dialogue scenario unit 30 and performs arithmetic processing for setting the next utterance content. Here, the dialogue recognition dictionary 21 stores words used in a device such as a dialogue word, and the utterance list storage unit 22 stores a plurality of sentenced utterance words set according to a dialogue scenario. Is stored so that it can be selected.
[0050]
Further, even if the scenario interpreter 20 refers to the utterance list storage unit 22 and the recognition dictionary 21 for dialogue, if there is no utterance target word corresponding to the content of dialogue with the user, the user is asked for an answer to this dialogue content. It will be. At that time, the learning function unit 60 learns a scenario according to the content of the dialogue and an utterance target word (vocabulary) according to the scenario based on the answer of the content of the dialogue inputted by the user in response to this question. The scenario and utterance target words are stored and updated in the dialogue recognition dictionary 21, the utterance list storage unit 22, and the dialogue scenario unit 30 via the scenario interpreter 20, and are reflected in the next dialogue. That is, here, in order to have a smooth and appropriate dialogue, a scenario is created, and a dialogue dictionary and an utterance list are created accordingly.
[0051]
Then, in the utterance with the user, the robot utterance word determination unit 50 determines the utterance word of the robot including the scenario newly stored in the dialogue scenario unit 30, and the utterance list storage unit stores the corresponding utterance word. 22 to the scenario interpreter 20. Then, the response content finally generated by the scenario interpreter 20 is synthesized by the speech synthesizer 70 and output from the speaker as the utterance of the robot.
2. Learning function (intelligent dialogue)
As the voice conversation progresses, the user will hear that the robot cannot answer (not described in the scenario). At that time, since the beginning is not known, the user is asked an answer. As a result, the robot learns the content and answer of the question, and automatically updates the scenario, dialog dictionary, and recognition dictionary necessary for the voice dialog device 1. Therefore, from the second time on, you will be able to answer questions that you could not answer. However, there are cases where the learned answer is wrong, and there are cases where several answers occur for one question during the dialogue. In that case, the answer with the highest appearance probability among the answers is the answer that the robot speaks. If something with the same probability occurs, give priority to the one that appears first. By having such a function, you will be able to learn what you do not know, and the answer will also approach an accurate answer with a high probability.
3. Robot utterance (natural dialogue)
The utterance that the robot responds does not utter the same thing each time for the question, but the utterance of the robot can be changed variously depending on the response time interval of the user, how to answer, and the content of the utterance. In addition, the robot utterance can be changed by various factors such as the user's emotion and dialect.
4). Operation
Next, based on the flowchart shown in FIGS. 2-4, operation | movement of the voice interactive apparatus of a present Example is demonstrated.
4.1 Overall flow
The overall flow of the voice interaction apparatus 1 according to the present embodiment proceeds according to the scenario set by the interaction scenario unit 30. Waiting for a response from the user, that is, at each branch point of the scenario, the operation flow related to the learning function shown in FIG. 3 and the operation flow related to the robot utterance shown in FIG. 4 are applied, and the corresponding robot utterance is output. This operation is repeated, and it ends when the dialogue ends according to the scenario.
[0052]
That is, as shown in FIG. 2, the user waits for a response (S110), and first executes a face image recognition determination process (S120).
In this face image recognition determination process, the voice recognition unit 10 determines whether or not the user has spoken in the direction of the robot via the plurality of eyes (camera) and ears (microphones) of the robot described above. As shown in FIG. 10, when it is determined that the user has spoken in the direction of the robot, voice recognition is started. At that time, even if the user turns away during the utterance, the recognition is not stopped and the recognition is not stopped until the utterance is completed. In addition, if the user does not face the direction of the robot and does not recognize the incoming voice, and the user turns his / her face in the direction of the robot, the recognition is started from that point. In this case, as described above, the robot is provided with eyes (cameras) and ears (microphones) at regular intervals all around 360 degrees around the head, so that even a voice from the back of the robot can be heard. The voice spoken by the user facing the direction of the robot can be recognized. In this embodiment, at that time, the robot faces the direction of the user, and as a result, a natural operation / conversation is performed as in a conversation performed by a human. By doing this, the robot recognizes only the vocabulary that it really wants to recognize.
[0053]
If it is determined that the face image of the user has been recognized (S130: YES), the operation flow related to the learning function shown in FIG. 3 is subsequently executed (S140).
4.2 Learning function (intelligent dialogue)
As the voice conversation progresses, the user will hear that the robot cannot answer (not described in the scenario). This learning function realizes intelligent dialogue that can cope with such a situation. Note that FIGS. 5 and 6 show correspondence examples when the user is asked not to know.
[0054]
That is, through the dialogue with the user, the scenario interpreter 20 refers to the dialogue scenario unit 30 to determine whether or not the robot knows the utterance content (question content) from the user (S210). At this time, if it is determined that there is no corresponding scenario and the content is unknown (S210: NO), the robot side asks the user the answer by saying "I don't know. Based on the user's answer to this, the learning function unit 60 learns the question content and the answer (S230). Then, the new scenario and utterance vocabulary obtained at this time are stored in the dialogue recognition dictionary 21, the utterance list storage unit 22, the dialogue scenario unit 30 and the like via the scenario interpreter 20. In other words, we will increase the scenario for things that we do not know at all, and increase the recognition vocabulary if we just do not understand the meaning of the words. As a result, from the second time on, it becomes possible to respond to things that were not understood, and it becomes an intelligent robot by accumulating new scenarios and utterance vocabularies. A specific example is as shown in FIG.
[0055]
On the other hand, if it is determined in S210 that there is a corresponding scenario and the content is known (S210: YES), an utterance described in the scenario is selected and synthesized by the speech synthesizer 70. Speak (S240). And if an error is not pointed out by the user with respect to this utterance (S250: NO), the said operation | movement flow will be complete | finished.
[0056]
However, the user's answer given first is not always correct. For this reason, it is assumed that the user will point out that it is “different” after the second time. For this reason, when an error is pointed out by the user in S250 (S250: YES), the learning function unit 60 first learns the question content and the answer (S260). Then, with respect to the same dialogue content as the question content, the past history is referred to from the utterance list storage unit 22 and the probability of the question content is compared with the answer having the highest current appearance probability (cumulative value) ( S270).
[0057]
At this time, if it is determined that the appearance probability of the current answer is high, the answer regarding the question content is regarded as a correct answer, the answer is changed to the answer from the next time, the scenario is updated, and the operation flow is ended (S280). . On the other hand, if it is determined that the appearance probability of the current answer is small, the answer regarding the question content is not changed (S290). Further, when the probabilities of both are equal, the one that appears first (more in the past) is set to be used for the next utterance (S300). Specific examples of these are as shown in FIG.
[0058]
Returning to FIG. 2, the operation flow related to the robot utterance shown in FIG. 4 is executed (S150).
4.3 Robot utterance (natural dialogue)
In this operation flow, as shown in FIG. 4, the content of the utterance is based on the user's response time interval (S310), the user's answer (S340), and the judgment of the user's utterance content (S350). (S320), or utterance with ambiguity (S330).
[0059]
Specifically, as shown in FIG. 7, for example, when the robot asks the user “What is your favorite food?”, If the user immediately answers, for example, “It ’s an apple”, the robot “I love apples,” and so on. On the other hand, if the user opens a gap (for example, about 10 seconds) and answers "I think it's an apple," the robot thinks about the answer that the user thinks, such as "I really like apples." The response is ambiguous. Thus, by making a difference in response, it becomes an intelligent robot that takes into account such parts as meaning and emotion.
[0060]
In addition, the user's answer, for example, the difference between “It ’s an apple” and “It ’s an apple”. Because it is a little ambiguous, it is necessary to identify these points and change the response of the robot utterance.
[0061]
Furthermore, from the content of the user's utterances, it is thought that vocabulary is put in the head, such as "Uh ...", and it can be taken as an unclear way of speaking with little conviction, so taking these points into account, Change the response of the robot utterance.
Returning to FIG. 2, the utterance is performed according to the utterance content of the robot determined as described above (S160), and then the end condition determination process is executed (S170). If it is determined that the preset termination condition is satisfied based on the scenario (S180: YES), the series of processes is terminated.
5). Other factors related to intelligent and natural robotic voice interaction
FIG. 8 shows factors relating to an intelligent and natural robot voice interactive device.
5.1 Emotion
As shown in FIG. 8, the emotion included in the user's utterance is recognized, and the utterance tone of the robot is changed. For example, when the user speaks to the robot in an angry tone, the utterance of the robot is uttered with a soothing easy word.
[0062]
Also, as shown in FIG. 9, the robot utterance can be changed depending on the difference in the emotional target of the user, for example, the emotion for the robot utterance or the general emotion. Like that.
5.2 Dialect
As shown in FIG. 8, the robot utterance is uttered using the same dialect in response to the user's utterance dialect, thereby making the familiar and intelligent robot. For example, when the user talks to the Kansai dialect, the robot utterance is also changed to the Kansai dialect.
[0063]
In addition, rather than recognizing dialects, the topic should be directed to the locality. By doing this, it is possible to change the topic, and to make it easier for users to answer.
5.3 Language
As shown in FIG. 8, the speech of the robot is also spoken in the same language as the language of the user's speech. For example, if you speak in English, the robot will also speak in English.
[0064]
Also, rather than recognizing language, the topic should be promoted to the country. By doing this, it is possible to change the topic, and to make it easier for users to answer.
5.4 Speech
As shown in FIG. 8, the speech voice of the robot is changed according to the age and sex of the user. For example, for a small child, respond to the voice of an older sister like a kindergarten teacher, respond to the voice of a woman for a man, and respond to the voice of a man for a woman. is there.
[0065]
In addition, it has been devised that it can move forward to topics related to age and gender, as well as speech.
5.5 Image (hold eye)
As shown in FIG. 8, the movement of the user's lips can be seen by holding the eyes, so the lip reading technology can be applied to improve the recognition rate. But it is useful for intelligent robots that can recognize them.
[0066]
The other is to recognize the position of the user's face, and the robot recognizes only the voice when it is determined that the user's face is facing the front, so that noise countermeasures can be taken and the recognition rate is improved.
Specifically, as described above with reference to FIG.

As described above, in the voice interactive apparatus 1 of the present embodiment, if the user asks what he / she does not know, he / she answers that he / she does not know on the spot. It has such a learning function. That is, when the user asks what he / she does not know, he / she asks the user for the answer, stores the contents of the question and the answer, and uses them for the next dialogue.
[0067]
For this reason, it is not necessary to interrupt the conversation due to unknown conversation contents or to change the topic presented by the user, and to improve the knowledge by increasing new scenarios and vocabulary through learning. Can be reflected in the dialogue. As a result, a satisfactory dialogue can be realized for a specific user each time learning is repeated. Also, new topics and information can be provided to different users, and intelligent dialogue can be realized.
[0068]
Further, depending on the timing of the user's answer / answer, the content of the utterance, and the like, the voice interaction device 1 also changes the response content and timing in various ways. Such a change in the utterance on the device side can realize a natural voice conversation.
Furthermore, the robot as the voice interaction device 1 is provided with a plurality of eyes (cameras) and ears (microphones), and by applying a lip reading technique, noise other than the user's spoken words is removed and voice recognition is performed. And the recognition rate of spoken words by the recognition means is improved. Thereby, it is possible to prevent or suppress an error on the device side with respect to the user's utterance, and to realize a natural dialogue with the user. As a result, an intelligent dialogue can be advanced with the user.
[0069]
In this embodiment, the voice recognition unit 10 corresponds to the recognition unit, the face image recognition determination unit 40 corresponds to the image recognition unit, and the scenario interpreter 20, the dialogue scenario unit 30, and the robot utterance word determination unit 50 select Corresponding to means, determination means, update means and identification means. Further, the dialogue recognition dictionary 21 and the utterance list storage unit 22 correspond to the storage unit and the second storage unit, the learning function unit 60 corresponds to the learning unit, and the speech synthesis unit 70 corresponds to the output unit.
[0070]
As mentioned above, although the Example of this invention was described, it cannot be overemphasized that embodiment of this invention can take various forms, as long as it belongs to the technical scope of this invention, without being limited to the said Example at all. Nor.
For example, in the above-described embodiment, an example in which the voice interactive apparatus of the present invention is configured as a robot has been described. However, the present invention is not limited thereto, and may be configured as an apparatus such as a navigation system.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a schematic configuration of a voice interactive apparatus according to an embodiment of the present invention.
FIG. 2 is a flowchart illustrating the operation of the voice interactive apparatus according to the embodiment.
FIG. 3 is a flowchart showing the operation of the learning function of the voice interaction apparatus.
FIG. 4 is a flowchart showing a robot utterance operation related to the voice interaction apparatus.
FIG. 5 is an explanatory diagram showing a correspondence example 1 when a robot is asked about what it does not know (learning function);
FIG. 6 is an explanatory diagram showing a correspondence example 2 when it is asked that the robot does not know (learning function).
FIG. 7 is an explanatory diagram illustrating a correspondence example of a robot according to a difference in user's answer / answer timing.
FIG. 8 is an explanatory diagram showing factors related to an intelligent robot.
FIG. 9 is an explanatory diagram illustrating a correspondence example of a robot utterance based on a difference in a user's emotional target.
FIG. 10 is an explanatory diagram illustrating an example of a voice dialogue based on image recognition of a robot.
FIG. 11 is an explanatory diagram illustrating a dialog example of a robot voice dialog.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... Voice dialogue apparatus, 10 ... Voice recognition part,
20 ... Scenario interpreter, 21 ... Dialogue recognition dictionary,
22 ... utterance list storage unit, 30 ... dialogue scenario unit,
40 ... face image recognition determination unit, 50 ... robot utterance word determination unit,
60 ... Learning function unit, 70 ... Speech synthesis unit

Claims

When a voice input for dialogue is made by the user, a recognition means for recognizing the input content,
A plurality of scenarios according to the content of dialogue with the user, and storage means for pre-stored utterance target words according to each scenario;
Selection means for selecting an utterance word for the user from utterance target words stored in the storage means in accordance with recognition by the recognition means;
Output means for outputting the speech word selected by the selection means by voice;
A voice interaction device for interacting with a user,
When there is no utterance target word corresponding to the content of the dialogue with the user among the utterance target words stored in the storage means, the selection means causes the user to select an utterance word for asking the answer of the conversation content. , Learning a scenario according to the dialogue content based on the answer of the dialogue content inputted by the user in response to the answer, and newly storing the scenario and words to be uttered according to each scenario in the storage means Learning means ,
Updating means for automatically updating a scenario, a dialogue dictionary, and a recognition dictionary necessary for the voice dialogue about the dialogue contents in the storage unit based on the dialogue information learned by the learning unit;
About the dialogue information learned by the learning means, a second storage means for storing a history of the same kind of dialogue contents exchanged a plurality of times,
Based on the dialogue information stored in the second storage means, when there are inconsistent scenarios with respect to the same kind of dialogue contents, the updating means sequentially changes and updates the scenario with a high dialogue probability. Voice dialogue device.

The voice interactive apparatus according to claim 1,
The said update means gives priority to what appeared previously, when the said dialog probability about the said dialog content is equal between the said inconsistent scenarios.

The program for functioning a computer as said each means of the voice interactive apparatus of Claim 1 or 2.