JP4292846B2

JP4292846B2 - Spoken dialogue device, spoken dialogue substitution device, and program thereof

Info

Publication number: JP4292846B2
Application number: JP2003093194A
Authority: JP
Inventors: 鈴木　　忠; 泰石川; 稔西田; 昌人炭田
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2003-03-31
Filing date: 2003-03-31
Publication date: 2009-07-08
Anticipated expiration: 2023-03-31
Also published as: JP2004301980A

Description

【０００１】
【発明の属する技術分野】
この発明は、ネットワーク経由で取得可能し対話操作により利用するコンテンツを音声により利用する音声対話装置及び音声対話代行装置並びにそれらのプログラムに係るものであり、特に操作回数を低減し、短時間で所望のコンテンツを得ることができる音声対話装置及び音声対話代行装置並びにそれらのプログラムに関する。
【０００２】
【従来の技術】
近年、電話を介して音声によりインターネットサービスを利用できるようにしたボイスポータルが増えてきている。例えば、株式会社ＮＴＴコミュニケーションズの「Ｖポータル」（ＵＲＬ：ｈｔｔｐ：／／ｗｗｗ．ｎｔｔ．ｃｏｍ／ｖ−ｐｏｒｔａｌ／）や、株式会社電話放送局の「大阪ボイスポータル」（ＵＲＬ：ｈｔｔｐ：／／ｗｗｗ．ｖｐｓｉｔｅ．ｎｅｔ／）などがある。
【０００３】
これらは、もともと文字として表現されていたインターネット上のコンテンツを、音声合成により音声として利用者に提供するものである。ここで、これらのコンテンツが対話操作を含む場合には、対話操作を促す文字情報が音声ガイダンスに変換され、また本来キーボードやマウスによる操作指示の入力が必要な場面では、音声認識技術を利用して、利用者の発話を音声認識技術により操作指示に変換して利用できるようにしている。
【０００４】
ところで、音声によるガイダンスと音声入力とを組み合わせた対話操作と、通常のインターネットコンテンツが前提としている画面での文字表示とキーボードあるいはマウスによる操作指示とを組み合わせた対話操作とでは、次のような点が異なっている。
【０００５】
例えば、音声ガイダンスや発話は、言語として完結しなければ、意味が不明確となる。そこで、音声ガイダンスを再現したり、発話を最後まで行うために、数秒以上の時間を要する。このため、音声による対話処理は、画面に文字列を表示し、キーボードやマウスを通じて操作指示を行う対話処理よりも、所要時間が長い。音声を通じて何度も同じコンテンツを利用しようとする利用者は、毎回同じような操作を行うにもかかわらず、本来必要とする情報に辿りつくまでに長い間待たされることになる。
【０００６】
音声による対話操作を通じて、情報機器を操作するインターフェースは、ＩＴＳ（ＩｎｔｅｌｌｉｇｅｎｔＴｒａｎｓｐｏｒｔＳｙｓｔｅｍ：高度道路交通システム）の普及につれて、運転者が視線を逸らさずに情報を得る手段として有望視されている。特に今後ＤＳＲＣ（ＤｅｄｉｃａｔｅｄＳｈｏｒｔＲａｎｇｅＣｏｍｍｕｎｉｃａｔｉｏｎ、専用狭域通信）技術によって、運転中に高度な情報を供給できるようになってくることが想定される。そこで、音声対話操作インターフェースを普及させるためにも、上記のような煩わしさを解決する必要がある。
【０００７】
このような音声対話の操作性上の問題点を解決しようとした技術として、情報提供の順序を利用者に合わせて変更し、利用者が頻繁に利用する情報に辿りつくまでの操作を省略できるようにした方法が提案されている（例えば、特許文献１）。
【０００８】
【特許文献１】
特開２０００−２７０１０５「音声応答システム」（第１図、第７図、第３頁−第５頁）
【０００９】
【発明が解決しようとする課題】
しかし上記の方法は、情報を利用者に供給するサーバの側で、情報提供の方法を利用者ごとに変更する手段を採用している。そのため、情報を利用者に供給するサーバが、利用者固有の情報提供順序を記憶しておかなければならない。例えば、現在のインターネットでは、おびただしい量のコンテンツが存在する。このような場合に、大量のコンテンツそれぞれについて情報提供の順序を利用者ごとに変更し、さらにその変更内容を記憶させることは現実的ではない。
【００１０】
この発明は、上記のような問題を解決するために行われたもので、ネットワークを通じて取得したコンテンツを音声によって操作するインターフェースにおいて、定型の対話操作の回数を低減するものである。
【００１１】
【課題を解決するための手段】
この発明に係る音声対話装置は、
コンテンツ取得手段と、履歴記憶手段と、コンテンツ解釈手段と、対話操作代行手段と、音声認識手段とを備えた音声対話装置であって、
前記コンテンツ取得手段は、利用者の操作指示を促すメッセージと操作指示毎に異なる動作とを定義した対話操作制御情報を有するコンテンツをネットワークを通じて取得し、前記履歴記憶手段は、前記コンテンツ取得手段が取得したコンテンツについて前記利用者がこれまで行った操作指示を使用履歴として記憶し、
前記コンテンツ解釈手段は、前記履歴記憶手段の記憶している使用履歴が所定の条件を満たす場合には、前記対話操作代行手段の出力する操作指示に基づいて、前記コンテンツの有する対話操作制御情報に定義された動作を決定する一方で、前記使用履歴が前記所定の条件を満たさない場合には、前記コンテンツの有する対話操作制御情報のメッセージを前記利用者に提示するとともに、前記音声認識手段が出力する操作指示に基づいて前記動作を決定するものであって、前記履歴記憶手段が記憶する使用履歴から前記コンテンツを使用した回数を算出すると共に、ハードウェア環境情報に基づいて所定値を算出し、前記回数が前記所定値以上の場合に、前記対話操作代行手段の出力する操作指示に基づいて、前記動作を決定する一方で、前記回数が前記所定値未満の場合には、前記メッセージを前記利用者に提示するとともに、前記音声認識手段が出力する操作指示に基づいて前記動作を決定し、
前記対話操作代行手段は、前記履歴記憶手段が記憶している使用履歴の操作指示を出力し、
前記音声認識手段は、前記コンテンツ解釈手段が提示したメッセージに対して前記利用者が行った発話を音声認識し、前記コンテンツの有する対話操作制御情報に対する操作指示として出力するとともに、該操作指示を前記履歴記憶手段に記憶させることを特徴とするものである。
【００１２】
またこの発明に係る音声対話代行装置は、利用者の操作指示により異なる動作を行い、かつ前記利用者の操作指示を促すメッセージを含むコンテンツをネットワークを通じて取得するコンテンツ取得手段と、
前記利用者の発話を音声認識により前記コンテンツ取得手段が取得するコンテンツに対する操作指示に変換し出力する音声認識手段と、
前記コンテンツ取得手段が取得するコンテンツの有する前記メッセージを前記利用者に報知する報知手段と、
前記音声認識手段が出力する操作指示に基づいて動作を決定するコンテンツ解釈手段とを備えた音声対話装置とともに使用する音声対話代行装置であって、
前記コンテンツ取得手段が取得したコンテンツと前記利用者の発話とを関連づけて使用履歴として記憶する履歴記憶手段と、
利用者情報を記憶する利用者情報記憶手段と、
前記利用者が使用するコンテンツについて、前記履歴記憶手段が前記利用者の発話を記憶している場合に、前記利用者の発話を再生して前記音声認識手段に出力するものであって、前記履歴記憶手段が記憶する使用履歴から前記コンテンツを使用した回数を算出すると共に、前記利用者情報記憶手段が記憶する前記利用者情報に基づいて所定値を算出し、前記回数が前記所定値以上の場合に、前記利用者の発話を再生して前記音声認識手段に出力する発話再生手段と、
を備えるものである。
【００１３】
以下、この発明の実施の形態について説明する。
実施の形態１．
図１は、この発明の実施の形態１による音声対話装置の構成を表すブロック図である。図において、コンテンツ記憶部１は、コンテンツを記憶し、ネットワークを経由して利用者にそのコンテンツを供給する装置である。具体的には、コンテンツ記憶部１はコンピュータを用いて構成されたサーバ装置である。コンテンツ２は、コンテンツ記憶部１によって供給されるコンテンツである。ここで、コンテンツとは、利用者が利用する情報を総称するものであり、具体的には、ＨＴＭＬ（ＨｙｐｅｒＴｅｘｔＭａｒｋｕｐＬａｎｇｕａｇｅ）、ＸＭＬ（ｅＸｔｅｎｄｅｄＭａｒｋｕｐＬａｎｇｕａｇｅ）などの構造化文書形式、その他のバイナリ形式で供給される情報を含む。ネットワーク３は、ＬＡＮや電話通信回線を初めとする双方向でデジタルデータを送受信するための通信路である。ここでネットワーク３は、このような目的を達するものであればどのようなものでもよく、有線／無線の別を問わない。
【００１４】
音声対話装置４は、実施の形態１による音声対話装置であって、コンテンツ２を取得して、利用者に提供する装置である。メッセージ５は、コンテンツ２に対する対話操作を促すためのメッセージであり、音声又は文字やアイコンによって利用者に提供されるものである。発話６は、メッセージ５に応答して、利用者がコンテンツ２に対する操作指示を行うために発声する音声である。音声対話装置４は、発話６を音声認識により解釈して、コンテンツ２の対話操作制御情報に適合した形式の操作指示に変換するものである。
【００１５】
ここで対話操作制御情報とは、コンテンツ２に組み込まれた、あるいはコンテンツ２と関連づけられた対話操作処理を実行するためのプログラムコードである。コンテンツ２がＨＴＭＬやＸＭＬであるならば、このような対話操作制御情報はジャバスクリプトやＶｏｉｃｅＸＭＬ、あるいはＨＴＭＬとｃｇｉプログラムとの組み合わせなどによって実現されることが多い。もっとも、実施の形態１におけるコンテンツ２は、必ずしも音声対話操作を前提として構成されている必要はない。
【００１６】
次に音声対話装置４の構成について説明する。図２は、音声対話装置４の詳細な構成を示すブロック図である。図において、コンテンツ取得手段７は、ネットワーク３を経由してコンテンツ２を取得する部位であって、具体的にはネットワーク入出力を行ってコンテンツ２を取得するものである。
【００１７】
制御ＩＤ取得手段８は、コンテンツ２の対話操作制御情報に割り振られた制御ＩＤを取得する部位である。制御ＩＤとは、対話操作制御情報に割り振られた識別子であって、対話操作制御情報を一意に識別する識別子である。このような識別子としては、例えば、コンテンツ２がＨＴＭＬデータであれば、特定のタグを用いてもよいし、そのようなタグがないデータの場合は、行番号やデータの先頭からのオフセット値（データの先頭を０番地とした場合のそのデータの開始アドレス）を用いてもよい。
【００１８】
コンテンツ解釈手段９は、コンテンツ取得手段７が取得したコンテンツ２の内容を解析して、図示せぬディスプレイ装置やスピーカーなどによって、利用者に対話操作を促すメッセージ５を利用者に報知する。また利用者からの操作指示に従って、対話操作制御情報に予め定められているいずれかの動作を選択し、場合によってはその動作を実行する部位である。
【００１９】
音声認識手段１０は、利用者がメッセージ５に応答して操作指示を発話すると、この発話をマイクロホンで集音し、集音した発話を音声認識してコンテンツ２の対話操作制御情報に適合した操作指示に変換するものである。
【００２０】
また履歴記憶手段１１は、利用者がコンテンツ２にアクセスした履歴を使用履歴として記憶する部位である。具体的には、ハードディスク装置やフラッシュメモリなどの不揮発性記憶装置によって構成されており、音声認識手段１０によって音声認識された操作指示を、制御ＩＤ取得手段８が取得した識別子に関連づけて、記憶するようになっている。
【００２１】
対話操作代行手段１２は、対話操作制御情報を通じてコンテンツ２が要求する対話操作を自動的に行うために、履歴記憶手段１１によって記憶されている使用履歴を参照して、過去の利用者の操作指示を取得し、出力する部位である。
【００２２】
次に、コンテンツ２の詳細について説明する。図３は、コンテンツ２の一例を示したものである。図３の矩形２０内のリストはＶｏｉｃｅＸＭＬ言語に準拠して記述されたコンテンツのリストである。また図の左端の数字とコロンの組み合わせは、説明のために付された行番号である。以下の説明において、＜という文字と、＞という文字とによって括られた文字列（トークン）をタグと呼ぶこととする。
【００２３】
図において、＜ｆｏｒｍｉｄ＞タグで開始し、＜／ｆｏｒｍ＞タグで終了する行は、コンテンツ２を利用すると行われる対話操作処理を定義するものである。図３の例では、このような対話操作処理として、２行目から１６行目までの対話操作制御情報（＜ｆｏｒｍｉｄ＝”説明文出力の確認”＞で開始する対話操作制御情報、以後、単に対話操作制御情報２１という）及び１７行目から２１行目までの対話操作制御情報（＜ｆｏｒｍｉｄ＝”説明文の出力”＞で開始する対話操作制御情報、以後単に対話操作制御情報２２という）が表されている。
【００２４】
次に、音声対話装置４の動作について説明する。図４は音声対話装置４の処理を示すフローチャートである。図において、ステップＳ１０１はコンテンツ取得手段７によって処理されるもので、ネットワーク３を介してコンテンツ記憶部１からコンテンツ２を取得する。コンテンツ２の取得には、例えばｆｔｐ（ｆｉｌｅｔｒａｎｓｆｅｒｐｒｏｔｏｃｏｌ）やｈｔｔｐ（ｈｙｐｅｒｔｅｘｔｔｒａｎｓｆｅｒｐｒｏｔｏｃｏｌ）などを使用する。
【００２５】
次にステップＳ１０２において、制御ＩＤ取得手段８は、対話操作制御情報２１の制御ＩＤを取得する。図３の例でいえば、＜ｆｏｒｍｉｄ＞タグの値はコンテンツ２内で重複して用いられることがない情報であるから、この値を識別子とすることができる。音声対話装置４で複数のコンテンツを扱うことを考慮すると、さらにコンテンツ名またはコンテンツのＵＲＬと＜ｆｏｒｍｉｄ＞タグの値とを組み合わせたもの（例．ｈｔｔｐ：／／ｗｗｗ．コンテンツ２＃説明文の出力など）を識別子として用いてもよい。
【００２６】
ステップＳ１０３において、コンテンツ解釈手段９は、利用者によるコンテンツ２の使用の条件が所定の条件に合致するかどうかを判定する。コンテンツ２の使用条件とは、現在の利用者のコンテンツ２へのアクセス状況を意味するものであって、例えばこの利用者がコンテンツ２にこれまでアクセスしたことがあるかどうか、そしてアクセスしている場合には、どの程度の頻度でアクセスしているか、などの情報を指す。この情報は、履歴記憶手段１１の記憶する使用履歴を参照することによって得られる。また所定の条件とは、この場合では、「初回のアクセスかそれ以外か」又は「過去のアクセス回数が所定の回数以上か否か」などである。
【００２７】
例えば「初回のアクセスかそれ以外か」ということを所定の条件とするのであれば、コンテンツ解釈手段３は使用履歴を検索し、コンテンツ２のアクセス履歴が取得できるかどうかを調べる。その結果、過去にコンテンツ２をアクセスしたことがあれば、ステップＳ１０３の結果はＹＥＳである。またアクセスしたことがないのであれば、ＮＯがステップＳ１０３の結果となる。
【００２８】
同様に「過去のアクセス回数が所定の回数以上か否か」を所定の条件とするのであれば、コンテンツ解釈手段３は使用履歴を検索し、コンテンツ２のアクセス回数を算出する。その結果、この回数が所定の回数以上であれば、ステップＳ１０３の結果はＹＥＳである。また所定の回数に達していないのであれば、ＮＯがステップＳ１０３の結果となる。
【００２９】
利用者が、初めてコンテンツ２にアクセスした場合には、上記の所定の条件を満たすことはないので、ステップＳ１０３の判定結果はＮＯとなる。そこで、まずステップＳ１０３の判定結果がＮＯとなる場合の処理について説明する。この場合、ステップＳ１０４に進む（ステップＳ１０３：ＮＯ）。
【００３０】
ステップＳ１０４において、コンテンツ解釈手段１２は、対話操作制御情報に含まれるメッセージをメッセージ５として出力し、利用者の対話操作を促す。対話操作制御情報に含まれるメッセージとは、対話操作制御情報２１の例でいえば、＜ｐｒｏｍｐｔ＞タグによって定義される「システムの説明が必要ですか？」などのメッセージをいう。なおこの例では、メッセージは文字列として表されているが、アイコンなどの画像データと組み合わせて、あるいは画像データのみで利用者に分かるように情報を提供してもよい。
【００３１】
ステップＳ１０５において、音声認識手段１０は、利用者の発話を音声認識し、操作指示に変換する。すなわち、利用者がこのメッセージや音声ガイダンスに対して操作指示を発話すると、音声認識手段１０はこの発話を音声認識して、操作指示に変換する。この音声認識処理は、一般的な音声認識辞書を用いて実現してもよい。さらに、対話操作制御情報２１の＜ｆｉｌｌｅｄ＞タグ（８行目から１４行目まで）の内容を解析し、例えば９行目の”いいえ”という文字列を抽出して、この”いいえ”の音声データとの間でマッチングするような処理を行ってもよい。
【００３２】
次にステップＳ１０６において、履歴記憶手段１１は、音声認識手段１０が変換した操作指示と、制御ＩＤ取得手段８が取得した制御ＩＤとを関連づけて、使用履歴として記憶する。履歴記憶手段１１は、すでにこの制御ＩＤと関連づけられて使用履歴として記憶している操作指示がある場合には、新たな操作指示がすでに記憶している操作指示と同じかどうかを評価する。そして、異なる操作指示の場合のみ、すでに記憶している操作指示を消去して、新たな操作指示と制御ＩＤを関連づけて記憶する。
【００３３】
なお、音声対話装置４の記憶容量に余裕がある場合には、このように既存の操作指示を上書きするのではなく、常に新たな操作指示を追加していく処理を行うようにしてもよい。こうすると、一つの制御ＩＤに対して複数の操作指示を記憶することとなる。したがってこの場合には、最新の操作指示（一番最後に使用履歴に追加した操作指示）を使用することとする。あるいは、ある制御ＩＤについて履歴記憶手段１１が記憶している操作指示が複数ある場合には、その操作指示の中から最も頻度の高い操作指示を選択するようにしてもよい。
【００３４】
最後に、ステップＳ１０７において、コンテンツ解釈手段９は、音声認識手段１０が変換した操作指示を取得し、この操作指示に従って対話操作制御情報に定義された動作を選択する。例えば、対話操作制御情報２１の場合ならば、利用者が「いいえ」と発話すると、音声認識手段１０はこれを認識し、操作指示として出力する。これに対してコンテンツ解釈手段９は、この操作指示を９行目の＜ｉｆｃｏｎｄ＝”ＹＮ＝＝’いいえ’”＞という行に代入して評価する。この場合は、この評価値は「真」となって１０行目の行が動作として選択される。この結果、コンテンツ解釈手段９は１０行目の行（＜ｇｏｔｏｎｅｘｔ＝”＃次の処理”／＞）を解釈実行し、２２行目から始まる＜ｆｏｒｍｉｄ＞タグの処理を行う。
【００３５】
一方、利用者が「いいえ」以外の発話を行うと、音声認識手段１０はこれを認識し、コンテンツ解釈手段９に出力する。コンテンツ解釈手段９は、この操作指示を９行目に代入して評価し、その結果として評価値は「偽」となって１２行目の行が動作として選択される。この結果、コンテンツ解釈手段９は１２行目の行（＜ｇｏｔｏｎｅｘｔ＝”＃説明文の出力”／＞）を解釈実行し、１７行目から始まる対話操作制御情報２２の処理を行う。
【００３６】
以上が、所定の条件に合致しない場合の処理である。次に所定の条件に合致する場合の処理（ステップＳ１０３：ＹＥＳ）について説明する。この場合は、ステップＳ１０８に進む。この場合、対話操作代行手段１２は、履歴記憶手段１１が記憶する使用履歴から、対話操作制御情報の制御ＩＤに関連づけて記憶されている操作指示を取得する。例えば、制御ＩＤが”説明文出力の確認”であれば、過去に利用者は「いいえ」などの指示操作を行っている。したがって、履歴記憶手段１１は制御ＩＤ”説明文出力の確認”と、「いいえ」などの操作指示とを関連づけて記憶している。この場合、対話操作代行手段１２は、制御ＩＤ”説明文出力の確認”に関連づけられている指示操作「いいえ」を使用履歴から取得して出力する。
【００３７】
次に再びステップＳ１０７において、コンテンツ解釈手段９は、対話操作代行手段１２が出力した操作指示を取得し、この操作指示に従って対話操作制御情報に定義された動作を選択する。以後の処理は、所定の条件に合致しない場合の処理と同様である。
【００３８】
以上より明らかなように、音声対話装置４は、対話操作代行手段１２が過去に利用者が行った操作を代行入力するので、すでに利用者が音声による対話操作を行ったことがあるコンテンツの対話操作を省略する。このため、利用者は同じコンテンツを利用する場合に、音声による対話操作を何度も繰り返す必要がなくなり、使い勝手のよい音声対話インターフェースを提供することができる。
【００３９】
また使用履歴をコンテンツ記憶部１などのサーバ側に記憶するのではなく、端末側である音声対話装置４側に記憶するようにしたので、利用者ごとに使用履歴を管理することができる。したがって利用者ごとのコンテンツの趣向や操作の手順に合わせて、定型的な音声による対話操作の省力化を行うことが可能となる。
【００４０】
なお、実施の形態１ではＶｏｉｃｅＸＭＬに準拠したコンテンツ２を例として説明したが、使用するコンテンツについてはこのような形式のものに限定されるわけではない。
【００４１】
また、実施の形態１では、コンテンツ２は複数の対話操作制御情報を有しているので、制御ＩＤ取得手段８を用いてこれらを識別することとした。しかし対話操作制御情報が複数存在しないコンテンツを扱う場合にあっては、制御ＩＤ取得手段８を省略してもよい。この場合は、コンテンツ名あるいはコンテンツのＵＲＬなどと操作指示とを関連づけて、使用履歴として記憶させるようにすればよい。
【００４２】
さらに、対話操作制御情報を複数有しているコンテンツ（対話操作が複数ステップからなるコンテンツ）の場合であっても、各ステップに対する利用者の操作指示の一連の流れ（シーケンス）を、一つの操作指示のかたまりとして記憶するようにすれば、やはり制御ＩＤ取得手段を省略することができる。
【００４３】
またステップＳ１０３の判定の結果、所定の条件に合致する場合にも、コンテンツ解釈手段９は、メッセージ５に基づいて利用者に何らかの情報を提供するようにしてもよい。この場合には、対話操作代行手段１２が代理応答を行うので、利用者に通知されるメッセージの内容は対話操作を促すものでなくてよく、例えば「代理応答がなされます」などの変更したメッセージであってもよい。
【００４４】
さらに、音声対話装置４と同様の機能を、コンピュータに実行させるコンピュータプログラムとして構成することも当然に可能である。このようなコンピュータプログラムは、コンテンツ取得手段７に処理を実行するコンピュータプログラム、音声合成手段８による処理を実行するコンピュータプログラム、音声認識手段１０による処理を実行するコンピュータプログラム、制御ＩＤ取得手段８による処理を実行するコンピュータプログラム、履歴記憶手段１１による処理を実行するコンピュータプログラム、対話操作代行手段１２による処理を実行するプログラム、コンテンツ解釈手段９による処理を実行するプログラムのそれぞれを逐次コンピュータに実行させるプログラムである。
【００４５】
実施の形態２．
実施の形態１では、対話操作を省略するか否かの判断を、コンテンツの使用回数に基づいて行うようにしたものであった。実施の形態２では、さらにハードウェア環境情報を用いて、対話操作を省略するか否かの判断を行う。
【００４６】
図５は、実施の形態２による音声対話装置の構成を示すブロック図である。実施の形態１の音声対話装置と同一の符号を付した構成要素については、同様の動作を行うものであるので、説明を省略する。本図の構成と図１の構成の異なる点は、コンテンツ解釈手段９がハードウェア環境情報を取得する点にある。
【００４７】
ここで、ハードウェア環境情報とは、音声対話装置４が有する機器の諸元や音声対話処理を行う環境の諸元を示す情報を意味する。より具体的にいうと、利用者にとっての音声対話装置４におけるコンテンツの認識性や操作性に影響を与える要因をパラメータ化した情報であって、例えば、音声対話装置４が
（１）ディスプレイ装置やキーボード装置などを有するか否かなどの情報
（２）車載用機器として用いられているかどうか
（３）携帯電話として用いられているかどうか
などの情報である。その他、コンテンツの認識性や操作性に影響を与える要因をコンピュータなどにより情報処理できるようにしたものであれば、どのような情報であっても構わない。
【００４８】
これらの情報は、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）に記録され、ＢＩＯＳ（ＢａｓｉｃＩｎｐｕｔＯｕｔｐｕｔＳｙｓｔｅｍ）プログラムを用いて読み出す。またシステムコンフィギュレーション（システム構成）情報として、図示せぬ記憶装置にファイルなどの形式で記録しておき、それを読み出すようにしてもよい。
【００４９】
次に、上記に示したハードウェア環境情報（１）〜（３）を例にして、音声対話装置４の動作を説明する。図６は音声対話装置４の処理を示すフローチャートである。なお本フローチャートにおいて実施の形態１のフローチャートと同じ符号を付した処理については、実施の形態１と同様であるので説明を省略する。また点線矩形で囲み、符号Ｓ１０３を付した部分は、実施の形態１のステップＳ１０３に相当する処理であることを示すもので、実施の形態２の説明のために、その内容を詳細化したものである。
【００５０】
そこで以下の説明では、このステップＳ１０３の詳細についてのみ説明することとする。なおこれらの処理は、いずれもコンテンツ解釈手段９によって処理されるものであり、使用回数とその所定の値との大小関係を見るものである。ここでは、この所定の値を「閾値」と呼ぶこととし、この閾値の値が例えば３であるものとする。
【００５１】
図６のフローチャートのステップＳ２０１において、履歴記憶手段１１が記憶する使用履歴を参照して、コンテンツ２の使用回数を算出する。次にステップＳ２０２において、ハードウェア環境情報の取得を行う。さらにステップＳ２０３において、ハードウェア環境情報に基づいて閾値を変更する。
【００５２】
ここで、上述した（１）〜（３）を例にとって、ハードウェア環境情報がコンテンツ２の認識性や操作性に与える影響と、この影響を考慮したステップＳ２０３における閾値の変更の方法とについて、具体的に説明する。
【００５３】
（１）のディスプレイ装置やキーボード装置の有無は、コンテンツ２の操作性に影響を与える。例えば音声対話装置４がディスプレイ装置を有し、さらにキーボードやマウスなどの入力機器を有している場合、対話操作制御情報の操作指示を促すメッセージを画面に表示し、さらにキーボードやマウスなどの入力機器を用いて操作指示を行うことができる。このような場合、音声ガイダンスに加えて画面にメッセージが表示されるので、利用者はごく短時間に多くの情報を認識できる。したがってこのような機能のない装置を通じて同じコンテンツを利用する場合に比して、より少ない回数で操作に慣れることが予想されるし、音声による対話操作を煩わしく感じるようになると考えられる。
【００５４】
そこでステップＳ２０３では、ディスプレイ装置やキーボード装置が音声対話装置４に装備されていないことを示すハードウェア環境情報を取得した場合には、閾値を３のままとする。一方、ディスプレイ装置やキーボード装置が音声対話装置４に装備されていることを示すハードウェア環境情報を取得した場合には、閾値を１あるいは２に変更する。
【００５５】
（２）の車載用機器であるか否かという条件は、利用者に対するコンテンツ２の操作性に影響を与える。利用者が自動車のドライバであり、運転中にコンテンツ２を使用しようとする場合、音声による対話操作は有効なユーザーインターフェースとなりうる。しかし、音声による対話操作とはいえ、１つのコンテンツを利用するために何度も同じ対話操作を運転中に行うことは煩わしい。また、自動車車内では騒音レベルが高く、音声認識率が劣化するので、発話による操作指示を何度も試行することになる。そこでこのような場合には、過去に行った操作指示を利用して、より少ない回数で音声による対話操作を不要とするような処理が望まれる。
【００５６】
そこでステップＳ２０３では、音声対話装置４が車載用機器でないというハードウェア環境情報を取得した場合には、閾値を３のままとする。一方、音声対話装置４が車載用機器であるというハードウェア環境情報を取得した場合には、閾値を１あるいは２に変更する。
【００５７】
（３）の携帯電話であるか否かという条件も、利用者に対するコンテンツの操作性に影響を与える。携帯電話の使用環境を考えてみると、車載用機器の場合と同じように、騒音環境下で使用する場合が多く、音声認識率が劣化する。また携帯電話の場合は、ディスプレイ装置が付属しており、さらにテンキー操作という手段によって操作指示も可能である。
【００５８】
そこでステップＳ２０３では、音声対話装置４が携帯電話でないというハードウェア環境情報を取得した場合には、閾値を３のままとする。一方、音声対話装置４が携帯電話であるというハードウェア環境情報を取得した場合には、閾値を１あるいは２に変更する。
【００５９】
次にステップＳ２０４において、ステップＳ２０１で算出されたコンテンツ２の使用回数と、ステップＳ２０３で算出された閾値との比較を行う。そして使用回数が閾値以上であれば、ステップＳ１０８に進み（ステップＳ２０４：ＹＥＳ）、使用回数が閾値未満であれば、ステップＳ１０４に進む（ステップＳ２０４：ＮＯ）。以降の処理については、実施の形態１と同様であるので説明を省略する。
【００６０】
以上より明らかなように、実施の形態２による音声対話装置４は、ハードウェア環境情報に応じて対話操作が省略されるようになるまでの使用度数を変更する。その結果、音声対話装置４の機器の諸元や使用環境に応じて音声対話操作を効率化し、使いやすい音声対話操作インターフェースを提供できる。
【００６１】
なお、音声対話装置４に外部の騒音レベルを検出するような機能を設け、この騒音レベルをハードウェア環境情報に変換してコンテンツ解釈手段９に出力できるようにした上で、ハードウェア環境情報として与えられる騒音レベルに基づいて動的に閾値の値を変更するようにしても構わない。
【００６２】
実施の形態３．
実施の形態２では、ハードウェア環境情報に応じて、対話操作が省略されるようになるまでの使用度数を変更することとした。これに対して、実施の形態３では、利用者固有の属性に基づいてこの使用度数を変更する例について説明する。
【００６３】
図７は、実施の形態３による音声対話装置の構成を示すブロック図である。実施の形態１の音声対話装置と同一の符号を付した構成要素については、同様の動作を行うものであるので、説明を省略する。本図の構成と図１の構成の異なる点は、新たに利用者情報記憶手段１３を設けた点にある。
【００６４】
利用者情報記憶手段１３は、利用者情報を記憶するものであって、具体的にはハードディスクやフラッシュメモリなどの不揮発性記憶装置により構成されている。なお、利用者情報記憶手段１３は、音声対話装置４とは別体であってもよい。例えば磁気カードに利用者情報を記憶させておき、音声対話装置４にこの磁気カードに記憶された利用者情報を読みとらせるような構成としてもよい。また、携帯電話に利用者情報を記憶させておき、赤外線通信により音声対話装置４に利用者情報を転送するようにしても構わない。
【００６５】
ここで利用者情報とは、たとえば利用者の年齢など、音声対話装置４を使用する利用者固有の情報をいう。また聴力や視力などの情報を含めるようにしてもよい。利用者の年齢が高い場合には、同じ音声対話操作であっても、慣れるまでに時間を要する。その一方で、例えば２０代〜４０代であれば、短期間に音声対話操作に慣れてしまい、すぐに何度も同じ音声対話操作を行うのが煩わしくなる。
【００６６】
実施の形態３においても、コンテンツの使用度数と閾値の大小関係を判断して、対話操作を省略するかどうかを決定する。そこで例えば、利用者の年齢に基づいてこの閾値を決定するようにすれば、利用者に最適な音声対話操作を提供できる。
【００６７】
同様に、視力や聴力が弱い場合と正常な場合では、同じ音声対話操作であっても、慣れるまでに要する時間は異なる。したがってこのような場合にも、異なる閾値を用いて、コンテンツの使用度数の大小を判断する。
【００６８】
さらに、音声対話装置４を航空機や旅客船舶などで使用する場合、または空港などで使用する場合には、利用者全員が同じ言語を理解できるとは限らない。そこで、国籍や使用言語などを利用者情報として記憶させてもよい。例えば、英語による音声ガイダンスは日本人には聞き取りにくく、慣れるまで時間を要する。このような場合には、対話操作が省略されるようになるまでの使用度数を大きくするような処理が必要となる。利用者情報として、国籍や使用言語を記憶させておけば、このような場合に、利用者に適切な形で対話操作の省力化を行うことが可能となる。
【００６９】
なお実施の形態３による音声対話装置４の処理は図３のフローチャートによるものであり、ステップＳ１０３のみ、上述の処理を行う点で相違する。したがって説明を省略する。
【００７０】
以上より明らかなように、実施の形態３による音声対話装置４は、利用者固有の事情に合わせて音声対話処理を省略するので、適切な音声対話操作を提供することができる。
【００７１】
実施の形態４．
実施の形態１から実施の形態３までにおいて説明した音声対話装置４は、いずれも履歴記憶手段１１に記憶された操作指示に基づいて、コンテンツ解釈手段９が動作を選択するものであった。これに対して実施の形態４では、コンテンツ２に対する操作指示が記録された日時に基づいて、履歴記憶手段１１に記録された操作指示の採否を決定することを特徴とするものである。
【００７２】
利用者はコンテンツ２を頻繁に使用している期間においては、コンテンツ２の対話操作に慣れてしまい、何度も同じ操作を繰り返すことを負担に感じる。これに対して、過去の一時期にコンテンツ２を頻繁に使用していても、しばらく間を空けてコンテンツ２を使用することになった場合は、コンテンツ２の内容を記憶しているとは限らない。このような場合に、以前と同じように音声による自動応答を行ってしまうと、コンテンツ２の情報の遷移を理解できず、利用者が混乱することも考えられる。実施の形態４による音声対話装置４は、このような課題に対応するものである。
【００７３】
実施の形態４による音声対話装置４の構成は図２に示すものであって、実施の形態１による音声対話装置と同様であるので、説明を省略する。
【００７４】
次に、音声対話装置４の動作を説明する。図８は音声対話装置４の処理を示すフローチャートである。なお本フローチャートにおいて実施の形態１のフローチャートと同じ符号を付した処理については、実施の形態１と同様であるので説明を省略する。また点線矩形で囲み、符号Ｓ１０３を付した部分は、実施の形態１のステップＳ１０３に相当する処理であることを示すもので、実施の形態４の説明のために、その内容を詳細化したものである。さらに、図８のフローチャートにおいてはステップＳ１０６−２の処理が実施の形態１と異なる。ステップＳ１０６−２は、実施の形態１のステップＳ１０６の処理に相当するものである。そこで以下の説明では、ステップＳ１０３とステップＳ１０６−２についてのみ説明することとする。
【００７５】
まずステップＳ１０３について説明する。ステップＳ１０３はステップＳ３０１〜Ｓ３０４よりなるものである。これらは、いずれもコンテンツ解釈手段９によって処理されるものであり、使用回数と所定の値（以降、閾値と呼ぶ）との大小関係および最終使用時から現在時までの経過時間と別の所定値（以降、単に所定値と呼ぶ）との大小関係を見るものである。
【００７６】
図８のフローチャートのステップＳ３０１において、履歴記憶手段１１が記憶する使用履歴を参照して、コンテンツ２の最終使用時刻を取得する。ここで、コンテンツ２の最終使用時刻は、後述するステップＳ１０５−２において履歴記憶手段１１が記憶するものである。
【００７７】
次にステップＳ３０２において現在時刻から最終使用時刻を減じて、最終使用時刻からの経過時間を求め、この経過時間が所定値以下かどうかを調べる。経過時間が所定値以下である場合には、ステップＳ３０３に進む（ステップＳ３０２：ＹＥＳ）。一方、経過時間が所定値を超える場合には、ステップＳ１０３の処理を抜けて、ステップＳ１０４に進む（ステップＳ３０２：ＮＯ）。これによって、利用者が最後に使用してから一定の時間以上経過している場合には、ステップＳ１０４以降の処理が行われ、利用者の発話を音声認識するようになる。
【００７８】
次にステップＳ３０３において、履歴記憶手段１１が記憶する使用履歴を参照して、コンテンツ２の使用回数を算出し、さらにステップＳ３０４において、この使用回数が閾値以上か否かを評価する。使用回数が閾値以上である場合には、ステップＳ１０８に進む（ステップＳ３０４：ＹＥＳ）。一方、使用回数が閾値未満である場合には、ステップＳ１０４に進む（ステップＳ３０４：ＮＯ）。以上が、ステップＳ１０３の詳細処理である。次に、ステップＳ１０６−２の処理について説明する。
【００７９】
ステップＳ１０６−２において、履歴記憶手段１１は、音声認識手段１０が変換した操作指示と、制御ＩＤ取得手段８が取得した制御ＩＤとを関連づけて記憶し、さらに利用者がその操作指示を行った時間も操作指示とともに記憶する。履歴記憶手段１１が利用者の操作指示を記憶させる方法については、実施の形態１と同様であるので、詳細な説明については省略する。
【００８０】
以上より明らかなように、実施の形態４による音声対話装置４は、コンテンツを最後に使用した所定の時間が経過している場合に、自動応答を行わないようにするので、適切な範囲で定型的な音声応答処理の省力化を行うことができる。
【００８１】
実施の形態５．
実施の形態１乃至４では、音声対話装置自身に音声対話操作を省力化する機能を持たせる場合について説明した。これに対して、このような省力化機能を持たない音声対話装置に、音声対話操作を省力化するための機能を有する機器を組み合わせて使用する形態も考えられる。実施の形態８による音声対話代行装置はこのような機能を有する装置である。
【００８２】
図９は、実施の形態５による音声対話代行装置と、この装置と組み合わせて用いられる音声対話装置の構成を示すブロック図である。図において、図１と同じ符号を付した構成要素については、実施の形態１と同様であるので、説明を省略する。図の音声対話装置１４は、利用者の発話によってネットワークを介して取得したコンテンツを操作することができる装置である。また音声対話代行装置１５は、音声対話装置１４と組み合わせて使用するものであって、音声対話装置１４による音声対話操作を省力化する装置である。
【００８３】
図１０は、音声対話装置１４と音声対話代行装置１５の詳細な構成を示したブロック図である。図において、図２と同じ符号を付した構成要素については、実施の形態１と同様であるので、説明を省略する。音声対話装置１４において、報知手段３１は、コンテンツ取得手段７がコンテンツ２を取得すると、利用者に対話操作を促すメッセージ５を利用者に知らせるものであって、具体的にはディスプレイ装置またはスピーカーなどによって構成されている。報知手段３１がスピーカーによってメッセージ５を報知する場合には、メッセージ５を音声合成する。
【００８４】
音声対話代行装置１５は、音声対話装置１４とＲＳ２３２ＣやＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）、あるいはその他のバスなどによって接続されているものである。この接続の方法は、音声対話装置１４から電気信号あるいはデジタル信号、音声信号を送受信できるようになっているものであれば、どのようなものであってもよい。またマイクロホンを装備させて、音声対話装置の報知手段３１が出力する音声を、音声のまま直接入力するような方法を採用してもよい。
【００８５】
音声対話代行装置１５において、履歴記憶手段３２は、報知手段３１の出力するメッセージ５と利用者の操作指示を記憶するものであって、具体的にはハードディスク装置やフラッシュメモリなどの不揮発性記憶装置によって構成されている。発話再生手段３３は、利用者が発話した操作指示を取り込み、履歴記憶手段３２に発話内容を記憶させるとともに、利用者の発話を音声対話装置１４の音声認識手段１０に出力するようになっている。さらに発話再生手段３３は、利用者が発話を行わない場合に、履歴記憶手段３２に記憶されている利用者の発話を再生して、音声対話装置１４の音声認識手段１０に出力することで、音声による対話操作を代行するものである。
【００８６】
次に音声対話装置１４と音声対話代行装置１５の処理について説明する。図１１は、音声対話装置１４と音声対話代行装置１５の処理を示すフローチャートである。図のステップＳ４０１において、音声対話装置１４のコンテンツ取得手段７は、ネットワーク３を通してコンテンツ記憶部１よりコンテンツ２を取得する。この処理は実施の形態１におけるステップＳ１０１と同様であるので、説明を省略する。
【００８７】
続いてステップＳ４０２において、音声対話装置１４の報知手段３１はコンテンツ２の内容を報知する。前述したとおり、報知手段３１はメッセージ５を報知するために、メッセージ５を音声合成して図示せぬスピーカーから出力したり、ディスプレイ装置に表示したりする。またその一方で、ＲＳ２３２Ｃインターフェースやバス経由で、音声対話代行装置１５にもメッセージ５を出力する。
【００８８】
次にステップＳ４０３において、音声対話代行装置１５の履歴記憶手段３２は、メッセージ５に対する利用者の発話を記憶しているかどうかを調べる。そして、利用者の発話を記憶している場合には、発話再生手段３３に利用者の発話を出力して、ステップＳ４０４に進む（ステップＳ４０３：ＹＥＳ）。記憶していない場合はステップＳ４０６に進む（ステップＳ４０３：ＮＯ）。ステップＳ４０６以降の処理については後述する。
【００８９】
ステップＳ４０４において、発話再生手段３３は履歴記憶手段３２が出力した利用者の発話を音声データとして再生する。再生された音声データは、音声対話装置１４と音声対話代行装置１５とを接続するＲＳ２３２Ｃインターフェースやバスを経由して電気信号として伝達する。または、発話再生手段３３自身によってスピーカーから実際の音声として再生されて音声対話装置１４のマイクロホンに出力するようにしてもよい。
【００９０】
最後にステップＳ４０５において、音声認識手段１０は利用者の発話を音声認識して、操作指示に変換する。この処理は、実施の形態１におけるステップＳ１０４の処理と同様であるので、説明を省略する。
【００９１】
一方、ステップＳ４０３において、音声対話代行装置１５の履歴記憶手段３２が、メッセージ５に対する利用者の発話を記憶していない場合（ステップＳ４０３：ＮＯ）には、ステップＳ４０６が実行される。ステップＳ４０６において、履歴記憶手段３２は、利用者の発話とメッセージ５とを関連づけ、図示せぬハードディスク装置又はフラッシュメモリなどの不揮発性記憶装置に記憶させる。
【００９２】
またこの場合、履歴記憶手段３２は何も出力せず、発話再生手段３３もそれに伴って何も出力しない。その結果、音声対話装置１４の音声認識手段１０は入力待ちの状態となる。この状態で、利用者が操作指示のための発話を行うと、ステップＳ４０５において音声認識手段１０は、この発話を音声認識して操作指示に変換する。
【００９３】
以上より明らかなように、音声対話代行装置１５によれば、音声による対話操作を省力化する手段を持たない音声対話装置１４のような機器に、省力化機能を付加することができる。
【００９４】
なおステップＳ４０３において、履歴記憶手段３２はコンテンツ内容７に対する発話を記憶している場合に、無条件でその発話を再生する処理に移行するのではなく、例えば実施の形態１乃至実施の形態４で行ったような条件判定に基づいて発話再生処理への移行を判断するようにしてもよい。
【００９５】
また音声対話装置１４からコンテンツ２の有する対話操作制御情報を表す制御ＩＤを出力し、音声対話代行装置１５でこの制御ＩＤと利用者の発話を関連づけて記憶させるようにしてもよい。
【００９６】
さらに、履歴記憶手段３２は、コンテンツ内容７に対する発話を記憶している場合にも、即座に発話再生処理を行うのではなく、一定時間待機し、その間に利用者が発話を行わない場合にのみ、発話再生処理を行うようにすればよい。こうすることで、普段は一度行った操作指示を再生させておき、どうしても特別な操作指示を行わなければならない場合にのみ利用者が発声すれば、操作指示を進めることができるようになる。
【００９７】
【発明の効果】
この発明に係る音声対話装置は、コンテンツの使用時に利用者が行った操作指示を使用履歴として記憶する履歴記憶手段と、前記使用履歴として記憶された操作指示を、前記コンテンツの対話操作制御情報に則した操作指示として出力する対話操作代行手段と、前記コンテンツの使用回数がハードウェア環境情報に基づく所定値以上の場合に、対話操作代行手段が出力する操作指示に基づいて動作を決定するコンテンツ解釈手段とを備えたので、定型的な音声対話操作を自動化し、使いやすい音声対話操作インターフェースを提供することが可能となるという効果を奏する。
【００９８】
またこの発明に係る音声対話代行装置は、音声対話操作を省力化する機能を持たない音声対話装置の出力するコンテンツと利用者の発話内容を関連づけて記憶する履歴記憶手段と、前記コンテンツの使用回数がハードウェア環境情報に基づく所定値以上の場合に、履歴記憶手段が記憶する利用者の発話内容を再生する発話再生手段を備えて、前記音声対話装置に利用者の発話内容を出力するようにしたので、音声対話操作を省力化する機能を持たない音声対話装置についても、定型的な音声対話操作を自動化し、使いやすい音声対話操作インターフェースを提供することが可能となるという効果を奏する。
【図面の簡単な説明】
【図１】この発明の実施の形態１の構成を示すブロック図である。
【図２】この発明の実施の形態１の構成の詳細を示すブロック図である。
【図３】この発明の実施の形態１のコンテンツの内容の例を示すプログラムリストである。
【図４】この発明の実施の形態１の処理のフローチャートである。
【図５】この発明の実施の形態２の構成の詳細を示すブロック図である。
【図６】この発明の実施の形態２の処理のフローチャートである。
【図７】この発明の実施の形態３の構成の詳細を示すブロック図である。
【図８】この発明の実施の形態４の処理のフローチャートである。
【図９】この発明の実施の形態５の構成を示すブロック図である。
【図１０】この発明の実施の形態５の構成の詳細を示すブロック図である。
【図１１】この発明の実施の形態５の処理のフローチャートである。
【符号の説明】
１：コンテンツ記憶部、２：コンテンツ、３：ネットワーク、
４：音声対話装置、５：メッセージ、６：利用者の発話、
７：コンテンツ取得手段、８：制御ＩＤ取得手段、９：コンテンツ解釈手段、
１０：音声認識手段、１１：履歴記憶手段、１２：対話操作代行手段、
１３：利用者情報記憶手段、１４：音声対話装置、１５：音声対話代行装置、
３１：報知手段、３２：履歴記憶手段、３３：発話再生手段、
３４：利用者の発話[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a voice dialogue apparatus, a voice dialogue substitution apparatus and a program for using contents that can be acquired via a network and used by a dialogue operation. The present invention relates to a voice dialog device, a voice dialog proxy device and a program thereof that can obtain the contents of the program.
[0002]
[Prior art]
In recent years, an increasing number of voice portals have made it possible to use Internet services by voice via telephone. For example, NTT Communications Corporation's “V Portal” (URL: http://www.ntt.com/v-portal/) and the Telephone Broadcasting Corporation's “Osaka Voice Portal” (URL: http: // www) .Vpsite.net /).
[0003]
These provide contents on the Internet originally expressed as characters to the user as voice by voice synthesis. Here, when these contents include interactive operations, text information that prompts interactive operations is converted into voice guidance, and voice recognition technology is used in situations where input of operation instructions using a keyboard or mouse is essential. Thus, the user's utterance is converted into an operation instruction by voice recognition technology so that it can be used.
[0004]
By the way, there are the following points in interactive operation combining voice guidance and voice input, and interactive operation combining character display on the screen assumed by normal Internet contents and operation instructions by keyboard or mouse. Is different.
[0005]
For example, the meaning of voice guidance or utterance is unclear unless it is completed as a language. Therefore, it takes several seconds or more to reproduce the voice guidance or to complete the utterance. For this reason, the dialogue processing by voice takes a longer time than the dialogue processing in which a character string is displayed on the screen and an operation instruction is given through a keyboard or a mouse. A user who intends to use the same content over and over through the voice will have to wait for a long time before reaching the information he / she needs, despite performing the same operation every time.
[0006]
Interfaces for operating information devices through voice interactive operations are promising as a means for drivers to obtain information without changing their eyes as the ITS (Intelligent Transport System) spreads. In particular, it is expected that advanced information can be supplied during driving by the DSRC (Dedicated Short Range Communication) technology. Therefore, it is necessary to solve the above troublesomeness in order to spread the voice interaction operation interface.
[0007]
As a technology for solving such problems in the operability of voice dialogue, the order of information provision can be changed according to the user, and the operation until the user reaches the frequently used information can be omitted. Such a method has been proposed (for example, Patent Document 1).
[0008]
[Patent Document 1]
JP 2000-270105 “Voice Response System” (FIGS. 1, 7, and 3-5)
[0009]
[Problems to be solved by the invention]
However, the above method employs means for changing the information providing method for each user on the side of the server supplying the information to the user. Therefore, the server that supplies information to the user must store the information provision order unique to the user. For example, there is a tremendous amount of content on the current Internet. In such a case, it is not realistic to change the information provision order for each large amount of content for each user and to store the changed content.
[0010]
The present invention has been made to solve the above-described problems, and reduces the number of routine interactive operations in an interface for operating contents acquired through a network by voice.
[0011]
[Means for Solving the Problems]
  The voice interaction device according to the present invention is:
A voice interaction device comprising content acquisition means, history storage means, content interpretation means, dialogue operation proxy means, and voice recognition means,
  The content acquisition means acquires content having interactive operation control information defining a message prompting a user's operation instruction and a different operation for each operation instruction through the network, and the history storage means is acquired by the content acquisition means Store the operation instructions that the user has made so far as the usage history,
  When the usage history stored in the history storage unit satisfies a predetermined condition, the content interpretation unit adds the interactive operation control information of the content based on an operation instruction output from the interactive operation proxy unit. While the defined operation is determined and the usage history does not satisfy the predetermined condition, a message of interactive operation control information possessed by the content is presented to the user, and the voice recognition means outputs Determine the action based on the operation instructionAnd calculating the number of times the content has been used from the usage history stored in the history storage means, calculating a predetermined value based on hardware environment information, and if the number is greater than or equal to the predetermined value The operation is determined based on an operation instruction output by the dialogue operation proxy unit, and when the number of times is less than the predetermined value, the message is presented to the user, and the voice recognition unit The operation is determined based on the operation instruction output byAnd
  The dialogue operation proxy means outputs an operation instruction of a use history stored in the history storage means,
  The speech recognition means recognizes speech uttered by the user in response to the message presented by the content interpretation means, outputs the speech as an operation instruction for the interactive operation control information of the content, and outputs the operation instruction to the message It is stored in the history storage means.
[0012]
  In addition, the voice interaction agent device according to the present invention includes a content acquisition unit that performs different operations according to a user's operation instruction and acquires content including a message prompting the user's operation instruction through a network,
  Voice recognition means for converting the user's utterance into an operation instruction for the content acquired by the content acquisition means by voice recognition;
  Informing means for informing the user of the message of the content acquired by the content acquisition means;
  A voice dialogue substitution device for use with a voice dialogue device comprising content interpretation means for determining an action based on an operation instruction output by the voice recognition means;
  History storage means for storing the content acquired by the content acquisition means and the user's utterance in association with each other as a use history;
User information storage means for storing user information;
  For the content used by the user, when the history storage means stores the user's utterance, the user's utterance is reproduced and output to the voice recognition means.The number of times the content has been used is calculated from the usage history stored in the history storage unit, and a predetermined value is calculated based on the user information stored in the user information storage unit. When the value is greater than or equal to the predetermined value, the user's utterance is reproduced and output to the voice recognition meansUtterance playback means;
Is provided.
[0013]
Embodiments of the present invention will be described below.
Embodiment 1 FIG.
FIG. 1 is a block diagram showing the configuration of a voice interactive apparatus according to Embodiment 1 of the present invention. In the figure, a content storage unit 1 is a device that stores content and supplies the content to a user via a network. Specifically, the content storage unit 1 is a server device configured using a computer. Content 2 is content supplied by the content storage unit 1. Here, the content is a collective term for information used by the user. Specifically, a structured document format such as HTML (Hyper Text Markup Language) or XML (eXtended Markup Language), and other binary formats. Contains information supplied in The network 3 is a communication path for transmitting and receiving digital data in both directions including a LAN and a telephone communication line. Here, the network 3 may be any network as long as it achieves such a purpose, regardless of whether it is wired or wireless.
[0014]
The voice interaction device 4 is a voice interaction device according to the first embodiment, and is a device that acquires the content 2 and provides it to the user. The message 5 is a message for prompting an interactive operation with respect to the content 2, and is provided to the user by voice, text, or icon. The utterance 6 is a voice uttered in response to the message 5 for the user to give an operation instruction to the content 2. The voice interaction device 4 interprets the utterance 6 by voice recognition and converts it into an operation instruction in a format suitable for the interaction operation control information of the content 2.
[0015]
Here, the interaction operation control information is a program code for executing an interaction operation process incorporated in the content 2 or associated with the content 2. If the content 2 is HTML or XML, such interactive operation control information is often realized by Javascript, VoiceXML, or a combination of HTML and cgi programs. But the content 2 in Embodiment 1 does not necessarily need to be comprised on the assumption of voice interaction operation.
[0016]
Next, the configuration of the voice interaction device 4 will be described. FIG. 2 is a block diagram showing a detailed configuration of the voice interaction device 4. In the figure, the content acquisition means 7 is a part that acquires the content 2 via the network 3, and specifically, acquires the content 2 by performing network input / output.
[0017]
The control ID acquisition unit 8 is a part that acquires a control ID assigned to the interactive operation control information of the content 2. The control ID is an identifier assigned to the dialogue operation control information and is an identifier for uniquely identifying the dialogue operation control information. As such an identifier, for example, if the content 2 is HTML data, a specific tag may be used. If the content 2 does not have such a tag, a line number or an offset value from the beginning of the data ( The start address of the data when the top of the data is address 0) may be used.
[0018]
The content interpretation unit 9 analyzes the content of the content 2 acquired by the content acquisition unit 7 and notifies the user of a message 5 that prompts the user to perform an interactive operation using a display device or a speaker (not shown). Further, in accordance with an operation instruction from the user, it is a part for selecting any one of the predetermined operations in the dialog operation control information and executing the operation depending on the case.
[0019]
When the user utters an operation instruction in response to the message 5, the voice recognition means 10 collects the utterance with a microphone, recognizes the collected utterance as a voice, and operates in accordance with the interactive operation control information of the content 2. It is converted into instructions.
[0020]
The history storage unit 11 is a part that stores a history of access to the content 2 by the user as a usage history. Specifically, it is configured by a non-volatile storage device such as a hard disk device or a flash memory, and stores the operation instruction recognized by the voice recognition unit 10 in association with the identifier acquired by the control ID acquisition unit 8. It is like that.
[0021]
The dialogue operation proxy unit 12 refers to the usage history stored in the history storage unit 11 in order to automatically perform the dialogue operation requested by the content 2 through the dialogue operation control information. Is a part to acquire and output.
[0022]
Next, details of the content 2 will be described. FIG. 3 shows an example of the content 2. The list in the rectangle 20 in FIG. 3 is a list of contents described in conformity with the VoiceXML language. The combination of the number and colon at the left end of the figure is a line number given for explanation. In the following description, a character string (token) surrounded by a character <and a character> is called a tag.
[0023]
In the figure, a line starting with a <form id> tag and ending with a </ form> tag defines a dialogue operation process performed when the content 2 is used. In the example of FIG. 3, as such interactive operation processing, interactive operation control information starting from interactive operation control information (<form id = “confirmation of description output”>) from the second line to the 16th line, Dialog operation control information starting from the 17th to 21st lines (simply referred to as dialog operation control information 21) and dialog operation control information starting from <form id = “output of description” >>, hereinafter simply referred to as dialog operation control information 22 ) Is shown.
[0024]
Next, the operation of the voice interaction device 4 will be described. FIG. 4 is a flowchart showing processing of the voice interaction device 4. In the figure, step S 101 is processed by the content acquisition unit 7, and the content 2 is acquired from the content storage unit 1 via the network 3. For acquiring the content 2, for example, ftp (file transfer protocol) or http (hyper text transfer protocol) is used.
[0025]
Next, in step S <b> 102, the control ID acquisition unit 8 acquires the control ID of the dialogue operation control information 21. In the example of FIG. 3, since the value of the <form id> tag is information that is not used redundantly in the content 2, this value can be used as an identifier. In consideration of handling a plurality of contents in the voice interaction device 4, a combination of the content name or the URL of the contents and the value of the <form id> tag (eg, http: //www.content 2 # Output etc.) may be used as an identifier.
[0026]
In step S103, the content interpretation unit 9 determines whether or not the conditions for using the content 2 by the user match a predetermined condition. The usage condition of the content 2 means the current user's access status to the content 2. For example, whether or not the user has accessed the content 2 until now and is accessing the content 2. In this case, it indicates information such as how often the access is made. This information is obtained by referring to the use history stored in the history storage unit 11. In this case, the predetermined condition is “whether it is the first access or otherwise” or “whether the number of past accesses is a predetermined number or more”.
[0027]
For example, if the predetermined condition is “first time access or not”, the content interpretation unit 3 searches the usage history to check whether the access history of the content 2 can be acquired. As a result, if the content 2 has been accessed in the past, the result of step S103 is YES. If it has not been accessed, NO is the result of step S103.
[0028]
Similarly, if “whether the number of past accesses is equal to or greater than a predetermined number” is a predetermined condition, the content interpretation unit 3 searches the usage history and calculates the number of accesses to the content 2. As a result, if this number is equal to or greater than the predetermined number, the result of step S103 is YES. If the predetermined number of times has not been reached, NO is the result of step S103.
[0029]
When the user accesses the content 2 for the first time, the predetermined condition is not satisfied, so the determination result in step S103 is NO. Therefore, first, a process when the determination result of step S103 is NO will be described. In this case, the process proceeds to step S104 (step S103: NO).
[0030]
In step S <b> 104, the content interpretation unit 12 outputs a message included in the dialogue operation control information as a message 5 to prompt the user to perform a dialogue operation. The message included in the dialogue operation control information is a message such as “Do you need a description of the system?” Defined by the <prompt> tag in the example of the dialogue operation control information 21. In this example, the message is represented as a character string, but information may be provided in combination with image data such as an icon or so as to be understood by the user only with the image data.
[0031]
In step S105, the voice recognition means 10 recognizes a user's utterance and converts it into an operation instruction. That is, when the user utters an operation instruction in response to this message or voice guidance, the voice recognition means 10 recognizes the utterance and converts it into an operation instruction. This voice recognition process may be realized using a general voice recognition dictionary. Further, the content of the <filled> tag (from the 8th line to the 14th line) of the dialogue operation control information 21 is analyzed, and for example, the character string “No” in the 9th line is extracted, and this “No” voice is extracted. Processing such as matching with data may be performed.
[0032]
Next, in step S106, the history storage unit 11 associates the operation instruction converted by the voice recognition unit 10 with the control ID acquired by the control ID acquisition unit 8, and stores it as a use history. When there is an operation instruction that is already associated with this control ID and stored as a use history, the history storage unit 11 evaluates whether the new operation instruction is the same as the already stored operation instruction. Only in the case of a different operation instruction, the already stored operation instruction is deleted, and a new operation instruction and a control ID are stored in association with each other.
[0033]
Note that when the storage capacity of the voice interaction device 4 has a margin, the existing operation instruction is not overwritten in this way, but a process of constantly adding a new operation instruction may be performed. In this way, a plurality of operation instructions are stored for one control ID. Therefore, in this case, the latest operation instruction (the last operation instruction added to the use history) is used. Alternatively, when there are a plurality of operation instructions stored in the history storage unit 11 for a certain control ID, the operation instruction with the highest frequency may be selected from the operation instructions.
[0034]
Finally, in step S107, the content interpretation unit 9 acquires the operation instruction converted by the voice recognition unit 10, and selects an action defined in the dialogue operation control information according to the operation instruction. For example, in the case of the dialog operation control information 21, when the user speaks “No”, the voice recognition means 10 recognizes this and outputs it as an operation instruction. On the other hand, the content interpretation means 9 assigns this operation instruction to the line <if cond = “YN ==“ no ””> on the ninth line and evaluates it. In this case, this evaluation value is “true”, and the tenth row is selected as the operation. As a result, the content interpreter 9 interprets and executes the 10th line (<goto next = “# next process” />), and processes the <form id> tag starting from the 22nd line.
[0035]
On the other hand, when the user speaks other than “No”, the voice recognition unit 10 recognizes this and outputs it to the content interpretation unit 9. The content interpretation means 9 assigns this operation instruction to the 9th line and evaluates it. As a result, the evaluation value becomes “false” and the 12th line is selected as the action. As a result, the content interpretation unit 9 interprets and executes the 12th line (<goe next = “# output of explanatory text” />), and processes the dialogue operation control information 22 starting from the 17th line.
[0036]
The above is the processing when the predetermined condition is not met. Next, the process (step S103: YES) in the case where a predetermined condition is met will be described. In this case, the process proceeds to step S108. In this case, the dialogue operation proxy unit 12 acquires an operation instruction stored in association with the control ID of the dialogue operation control information from the usage history stored in the history storage unit 11. For example, if the control ID is “confirmation of description text output”, the user has performed an instruction operation such as “No” in the past. Therefore, the history storage unit 11 stores the control ID “confirmation of explanatory text output” and an operation instruction such as “No” in association with each other. In this case, the dialogue operation proxy means 12 acquires the instruction operation “No” associated with the control ID “confirmation of explanation output” from the usage history and outputs it.
[0037]
Next, in step S107 again, the content interpretation unit 9 acquires the operation instruction output by the dialogue operation proxy unit 12, and selects an action defined in the dialogue operation control information according to this operation instruction. The subsequent processing is the same as the processing when the predetermined condition is not met.
[0038]
As is clear from the above, in the voice interaction device 4, the dialogue operation proxy means 12 inputs the operations performed by the user in the past, so that the dialogue of the content that the user has already performed the dialogue operation by voice is already performed. The operation is omitted. For this reason, when the user uses the same content, it is not necessary to repeat the voice interaction operation many times, and it is possible to provide a user-friendly voice interaction interface.
[0039]
Further, since the use history is not stored on the server side such as the content storage unit 1 but is stored on the voice interactive device 4 side which is the terminal side, the use history can be managed for each user. Accordingly, it is possible to save labor for dialog operations using routine voices in accordance with the content preferences and operation procedures for each user.
[0040]
In the first embodiment, the content 2 conforming to VoiceXML has been described as an example. However, the content to be used is not limited to such a format.
[0041]
In the first embodiment, since the content 2 has a plurality of pieces of interactive operation control information, the control ID acquisition unit 8 is used to identify them. However, the control ID acquisition means 8 may be omitted when dealing with content that does not have a plurality of interactive operation control information. In this case, the content name or the URL of the content may be associated with the operation instruction and stored as a usage history.
[0042]
Furthermore, even in the case of content having a plurality of interactive operation control information (contents in which the interactive operation consists of a plurality of steps), a series of flows (sequences) of user operation instructions for each step is handled as one operation. If it is stored as a group of instructions, the control ID acquisition means can also be omitted.
[0043]
Also, as a result of the determination in step S103, the content interpretation unit 9 may provide some information to the user based on the message 5 even when a predetermined condition is met. In this case, since the dialogue operation proxy means 12 makes a proxy response, the content of the message notified to the user may not prompt the dialogue operation, for example, a changed message such as “a proxy response is made” It may be.
[0044]
Furthermore, it is naturally possible to configure a function similar to that of the voice interaction device 4 as a computer program that causes a computer to execute the function. Such a computer program includes a computer program for executing processing on the content acquisition means 7, a computer program for executing processing by the speech synthesis means 8, a computer program for executing processing by the speech recognition means 10, and a process by the control ID acquisition means 8. A computer program for executing the process by the history storage unit 11, a program for executing the process by the dialogue operation proxy unit 12, and a program for executing the process by the content interpretation unit 9 are sequentially executed by the computer. is there.
[0045]
Embodiment 2. FIG.
In the first embodiment, it is determined whether to omit the interactive operation based on the number of times the content is used. In the second embodiment, it is further determined whether to omit the dialogue operation using the hardware environment information.
[0046]
FIG. 5 is a block diagram showing a configuration of the voice interactive apparatus according to the second embodiment. The same reference numerals as those in the voice interactive apparatus according to the first embodiment perform the same operation, and thus the description thereof is omitted. The difference between the configuration of this figure and the configuration of FIG. 1 is that the content interpretation means 9 acquires hardware environment information.
[0047]
Here, the hardware environment information means information indicating the specifications of the devices included in the voice interaction device 4 and the specifications of the environment in which the voice interaction processing is performed. More specifically, it is information obtained by parameterizing factors that affect the recognizability and operability of content in the voice interaction device 4 for the user.
(1) Information on whether or not to have a display device or a keyboard device
(2) Whether it is used as an in-vehicle device
(3) Whether it is used as a mobile phone
It is information such as. In addition, any information may be used as long as the factor that affects the recognizability and operability of the content can be processed by a computer or the like.
[0048]
These pieces of information are recorded in a ROM (Read Only Memory) and read out using a BIOS (Basic Input Output System) program. Further, system configuration (system configuration) information may be recorded in a storage device (not shown) in the form of a file and read out.
[0049]
Next, the operation of the voice interaction apparatus 4 will be described using the hardware environment information (1) to (3) described above as an example. FIG. 6 is a flowchart showing processing of the voice interaction device 4. In addition, since the process which attached | subjected the same code | symbol as the flowchart of Embodiment 1 in this flowchart is the same as that of Embodiment 1, description is abbreviate | omitted. Further, a part enclosed by a dotted line rectangle and denoted by reference numeral S103 indicates that the process corresponds to step S103 of the first embodiment, and the contents are detailed for the description of the second embodiment. It is.
[0050]
Therefore, in the following description, only the details of step S103 will be described. Note that these processes are all processed by the content interpretation means 9 and look at the magnitude relationship between the number of uses and the predetermined value. Here, the predetermined value is referred to as a “threshold value”, and the value of the threshold value is, for example, 3.
[0051]
In step S201 of the flowchart of FIG. 6, the usage count of the content 2 is calculated with reference to the usage history stored in the history storage unit 11. In step S202, hardware environment information is acquired. In step S203, the threshold value is changed based on the hardware environment information.
[0052]
Here, taking (1) to (3) described above as an example, the influence of the hardware environment information on the recognizability and operability of the content 2, and the method of changing the threshold in step S203 in consideration of this influence, This will be specifically described.
[0053]
The presence or absence of the display device or keyboard device of (1) affects the operability of the content 2. For example, when the voice interaction device 4 has a display device and further has an input device such as a keyboard and a mouse, a message prompting an operation instruction for the interaction operation control information is displayed on the screen, and further, an input such as a keyboard and a mouse is performed. Operation instructions can be given using the device. In such a case, since a message is displayed on the screen in addition to the voice guidance, the user can recognize a lot of information in a very short time. Therefore, it is expected that the user will get used to the operation in a smaller number of times compared to the case where the same content is used through a device without such a function, and the voice interactive operation will be troublesome.
[0054]
Therefore, in step S203, when hardware environment information indicating that no display device or keyboard device is installed in the voice interaction device 4, the threshold value is kept at 3. On the other hand, when hardware environment information indicating that a display device or a keyboard device is installed in the voice interaction device 4 is acquired, the threshold value is changed to 1 or 2.
[0055]
The condition (2) regarding whether or not the device is a vehicle-mounted device affects the operability of the content 2 for the user. When the user is an automobile driver and intends to use the content 2 while driving, the voice interaction operation can be an effective user interface. However, although it is an interactive operation by voice, it is troublesome to perform the same interactive operation many times during driving in order to use one content. In addition, since the noise level is high and the voice recognition rate deteriorates in the automobile, the operation instruction by speech is tried many times. Therefore, in such a case, it is desired to perform a process that eliminates the need for a voice interactive operation with a smaller number of times using operation instructions performed in the past.
[0056]
Therefore, in step S203, when the hardware environment information indicating that the voice interaction device 4 is not a vehicle-mounted device is acquired, the threshold value remains at 3. On the other hand, when the hardware environment information indicating that the voice interaction device 4 is a vehicle-mounted device is acquired, the threshold value is changed to 1 or 2.
[0057]
The condition of whether or not the mobile phone is (3) also affects the operability of the content for the user. Considering the usage environment of a mobile phone, it is often used in a noisy environment as in the case of a vehicle-mounted device, and the speech recognition rate deteriorates. In the case of a mobile phone, a display device is attached, and operation instructions can be given by means of numeric keypad operation.
[0058]
Therefore, in step S203, when the hardware environment information indicating that the voice interaction device 4 is not a mobile phone is acquired, the threshold value is kept at 3. On the other hand, when hardware environment information indicating that the voice interaction device 4 is a mobile phone is acquired, the threshold value is changed to 1 or 2.
[0059]
In step S204, the number of uses of content 2 calculated in step S201 is compared with the threshold calculated in step S203. If the number of uses is equal to or greater than the threshold, the process proceeds to step S108 (step S204: YES), and if the number of uses is less than the threshold, the process proceeds to step S104 (step S204: NO). Since the subsequent processing is the same as that of the first embodiment, the description thereof is omitted.
[0060]
As is clear from the above, the voice interaction device 4 according to the second embodiment changes the usage frequency until the dialogue operation is omitted according to the hardware environment information. As a result, it is possible to improve the efficiency of the voice dialogue operation according to the specifications of the device of the voice dialogue device 4 and the usage environment, and to provide an easy-to-use voice dialogue operation interface.
[0061]
The voice interaction device 4 is provided with a function for detecting an external noise level, and the noise level is converted into hardware environment information so that it can be output to the content interpretation means 9. The threshold value may be dynamically changed based on a given noise level.
[0062]
Embodiment 3 FIG.
In the second embodiment, the usage frequency until the interactive operation is omitted is changed according to the hardware environment information. On the other hand, in the third embodiment, an example in which the usage number is changed based on user-specific attributes will be described.
[0063]
FIG. 7 is a block diagram showing the configuration of the voice interactive apparatus according to the third embodiment. The same reference numerals as those in the voice interactive apparatus according to the first embodiment perform the same operation, and thus the description thereof is omitted. The difference between the configuration of this figure and the configuration of FIG. 1 is that a user information storage means 13 is newly provided.
[0064]
The user information storage means 13 stores user information, and specifically comprises a nonvolatile storage device such as a hard disk or a flash memory. Note that the user information storage unit 13 may be separate from the voice interaction device 4. For example, the user information may be stored in a magnetic card, and the voice interaction device 4 may be configured to read the user information stored in the magnetic card. Alternatively, the user information may be stored in the mobile phone, and the user information may be transferred to the voice interaction device 4 by infrared communication.
[0065]
Here, the user information refers to information specific to the user who uses the voice interaction device 4, such as the age of the user. Information such as hearing ability and visual acuity may be included. When the user is old, it takes time to get used to even the same voice interaction operation. On the other hand, if you are in your 20s to 40s, for example, you will get used to voice conversation operations in a short period of time, and it will be bothersome to perform the same voice conversation operations many times immediately.
[0066]
Also in the third embodiment, it is determined whether or not the dialogue operation is omitted by determining the magnitude relation between the content usage frequency and the threshold value. Therefore, for example, if this threshold value is determined based on the age of the user, an optimal voice interaction operation can be provided to the user.
[0067]
Similarly, the time required to get used to the case of the same voice interaction operation differs depending on whether the visual acuity or hearing ability is weak or normal. Therefore, also in such a case, the level of content usage is determined using different threshold values.
[0068]
Furthermore, when the voice interaction device 4 is used on an aircraft or a passenger ship, or when used at an airport, not all users can understand the same language. Therefore, nationality and language used may be stored as user information. For example, English voice guidance is difficult for Japanese to hear and takes time to get used to. In such a case, it is necessary to increase the frequency of use until the interactive operation is omitted. If the nationality and the language used are stored as user information, in such a case, it becomes possible to save labor in dialog operations in a form appropriate for the user.
[0069]
Note that the processing of the voice interaction apparatus 4 according to the third embodiment is based on the flowchart of FIG. 3, and only the step S103 is different in that the above-described processing is performed. Therefore, the description is omitted.
[0070]
As is clear from the above, the voice interaction device 4 according to the third embodiment omits the voice interaction processing in accordance with the circumstances specific to the user, and therefore can provide an appropriate voice interaction operation.
[0071]
Embodiment 4 FIG.
In all of the spoken dialogue apparatuses 4 described in the first to third embodiments, the content interpretation unit 9 selects an operation based on the operation instruction stored in the history storage unit 11. On the other hand, the fourth embodiment is characterized in that whether to accept the operation instruction recorded in the history storage unit 11 is determined based on the date and time when the operation instruction for the content 2 is recorded.
[0072]
During the period in which the content 2 is frequently used, the user gets used to the interactive operation of the content 2 and feels burdened by repeating the same operation over and over. On the other hand, even if the content 2 is frequently used in the past, if the content 2 is used after a while, the content 2 is not always stored. . In such a case, if an automatic response is made by voice as before, the transition of information of the content 2 cannot be understood and the user may be confused. The voice interaction apparatus 4 according to the fourth embodiment addresses such a problem.
[0073]
The configuration of the voice interaction device 4 according to the fourth embodiment is as shown in FIG. 2 and is the same as that of the voice interaction device according to the first embodiment, and thus the description thereof is omitted.
[0074]
Next, the operation of the voice interaction device 4 will be described. FIG. 8 is a flowchart showing the processing of the voice interaction apparatus 4. In addition, since the process which attached | subjected the same code | symbol as the flowchart of Embodiment 1 in this flowchart is the same as that of Embodiment 1, description is abbreviate | omitted. Also, the part enclosed by a dotted rectangle and attached with reference numeral S103 indicates that the process corresponds to step S103 of the first embodiment, and the contents are detailed for the description of the fourth embodiment. It is. Further, in the flowchart of FIG. 8, the process of step S106-2 is different from that of the first embodiment. Step S106-2 corresponds to the process of step S106 of the first embodiment. Therefore, in the following description, only step S103 and step S106-2 will be described.
[0075]
First, step S103 will be described. Step S103 consists of steps S301 to S304. These are all processed by the content interpreting means 9, and the magnitude relationship between the number of uses and a predetermined value (hereinafter referred to as a threshold value), and a predetermined value different from the elapsed time from the last use to the current time. (Hereinafter, simply referred to as “predetermined value”).
[0076]
In step S301 of the flowchart of FIG. 8, the last use time of the content 2 is acquired with reference to the use history stored in the history storage unit 11. Here, the last use time of the content 2 is stored in the history storage unit 11 in step S105-2 to be described later.
[0077]
Next, in step S302, the last use time is subtracted from the current time to obtain an elapsed time from the last use time, and it is checked whether this elapsed time is equal to or less than a predetermined value. If the elapsed time is less than or equal to the predetermined value, the process proceeds to step S303 (step S302: YES). On the other hand, when the elapsed time exceeds the predetermined value, the process goes to step S103 and proceeds to step S104 (NO in step S302). As a result, when a predetermined time or more has elapsed since the last use by the user, the processing after step S104 is performed, and the user's speech is recognized as speech.
[0078]
Next, in step S303, the usage history stored in the history storage unit 11 is referred to calculate the usage count of the content 2, and in step S304, it is evaluated whether the usage count is equal to or greater than a threshold value. If the number of times of use is greater than or equal to the threshold, the process proceeds to step S108 (step S304: YES). On the other hand, when the number of times of use is less than the threshold, the process proceeds to step S104 (step S304: NO). The above is the detailed processing of step S103. Next, the process of step S106-2 will be described.
[0079]
In step S106-2, the history storage unit 11 stores the operation instruction converted by the voice recognition unit 10 and the control ID acquired by the control ID acquisition unit 8 in association with each other, and the user issues the operation instruction. The time is stored together with the operation instruction. Since the method of storing the user's operation instruction by the history storage unit 11 is the same as that of the first embodiment, the detailed description is omitted.
[0080]
As is clear from the above, the voice interaction device 4 according to the fourth embodiment does not perform an automatic response when a predetermined time when the content has been used last has elapsed, so that the routine is within a suitable range. Labor saving of typical voice response processing can be performed.
[0081]
Embodiment 5 FIG.
In the first to fourth embodiments, the case has been described in which the voice interaction apparatus itself has a function for saving the voice interaction operation. On the other hand, a mode in which a device having a function for saving the voice interaction operation is used in combination with a voice interaction device without such a labor saving function is also conceivable. The spoken dialogue proxy device according to the eighth embodiment is a device having such a function.
[0082]
FIG. 9 is a block diagram showing the configuration of a voice interaction proxy device according to the fifth embodiment and a voice interaction device used in combination with this device. In the figure, the components denoted by the same reference numerals as those in FIG. 1 are the same as those in the first embodiment, and thus the description thereof is omitted. The spoken dialogue apparatus 14 shown in the figure is an apparatus that can operate content acquired through a network by a user's utterance. The voice dialogue substitution device 15 is used in combination with the voice dialogue device 14 and is a device that saves the voice dialogue operation by the voice dialogue device 14.
[0083]
FIG. 10 is a block diagram showing the detailed configuration of the voice interaction device 14 and the voice interaction agent device 15. In the figure, the components denoted by the same reference numerals as those in FIG. 2 are the same as those in the first embodiment, and thus the description thereof is omitted. In the voice interactive device 14, the notifying unit 31 notifies the user of the message 5 that prompts the user to perform an interactive operation when the content acquiring unit 7 acquires the content 2, and specifically, a display device or a speaker or the like. It is constituted by. When the notification means 31 notifies the message 5 through the speaker, the message 5 is synthesized by voice.
[0084]
The voice dialogue proxy device 15 is connected to the voice dialogue device 14 via RS232C, USB (Universal Serial Bus), or other buses. Any connection method may be used as long as it can transmit and receive electrical signals, digital signals, and voice signals from the voice interaction device 14. Alternatively, a method may be adopted in which a microphone is provided and the voice output from the notification means 31 of the voice interactive apparatus is directly input as the voice.
[0085]
In the spoken dialogue proxy device 15, the history storage unit 32 stores the message 5 output from the notification unit 31 and the user's operation instruction, and specifically, a nonvolatile storage device such as a hard disk device or a flash memory. It is constituted by. The utterance reproduction means 33 takes an operation instruction uttered by the user, stores the utterance content in the history storage means 32, and outputs the user's utterance to the voice recognition means 10 of the voice interaction device 14. . Furthermore, when the user does not utter, the utterance reproduction means 33 reproduces the user's utterance stored in the history storage means 32 and outputs it to the voice recognition means 10 of the voice interaction device 14. Acts as a voice interaction.
[0086]
Next, processing of the voice interaction device 14 and the voice interaction agent device 15 will be described. FIG. 11 is a flowchart showing the processing of the voice interaction device 14 and the voice interaction agent device 15. In step S <b> 401, the content acquisition unit 7 of the voice interaction apparatus 14 acquires the content 2 from the content storage unit 1 through the network 3. Since this process is the same as step S101 in the first embodiment, a description thereof will be omitted.
[0087]
Subsequently, in step S <b> 402, the notification unit 31 of the voice interaction apparatus 14 notifies the content 2. As described above, in order to notify the message 5, the notification unit 31 synthesizes the message 5 by voice synthesis and outputs it from a speaker (not shown) or displays it on a display device. On the other hand, the message 5 is also output to the voice interaction proxy device 15 via the RS232C interface or the bus.
[0088]
Next, in step S403, the history storage unit 32 of the voice interaction proxy device 15 checks whether or not the user's utterance for the message 5 is stored. If the user's utterance is stored, the user's utterance is output to the utterance reproduction means 33, and the process proceeds to step S404 (step S403: YES). If not stored, the process proceeds to step S406 (step S403: NO). The processing after step S406 will be described later.
[0089]
In step S404, the utterance reproduction unit 33 reproduces the user's utterance output from the history storage unit 32 as voice data. The reproduced voice data is transmitted as an electrical signal via an RS232C interface or a bus connecting the voice dialogue device 14 and the voice dialogue substitution device 15. Alternatively, it may be reproduced as actual voice from the speaker by the utterance reproduction means 33 itself and output to the microphone of the voice interaction device 14.
[0090]
Finally, in step S405, the voice recognition means 10 recognizes the user's utterance and converts it into an operation instruction. Since this process is the same as the process of step S104 in the first embodiment, a description thereof will be omitted.
[0091]
On the other hand, in step S403, when the history storage unit 32 of the voice interaction proxy device 15 does not store the user's utterance for the message 5 (step S403: NO), step S406 is executed. In step S406, the history storage unit 32 associates the user's speech with the message 5 and stores them in a non-volatile storage device such as a hard disk device or a flash memory (not shown).
[0092]
In this case, the history storage means 32 outputs nothing, and the utterance reproduction means 33 does not output anything accordingly. As a result, the voice recognition means 10 of the voice interaction device 14 is in an input waiting state. In this state, when the user utters an operation instruction, in step S405, the speech recognition means 10 recognizes the utterance and converts it into an operation instruction.
[0093]
As is clear from the above, according to the voice dialogue substitution device 15, it is possible to add a labor saving function to a device such as the voice dialogue device 14 that does not have means for saving the voice dialogue operation.
[0094]
In step S403, when the history storage unit 32 stores an utterance for the content content 7, the history storage unit 32 does not shift to a process for reproducing the utterance unconditionally. For example, in the first to fourth embodiments, The transition to the utterance reproduction process may be determined based on the condition determination as performed.
[0095]
Alternatively, a control ID representing the dialogue operation control information of the content 2 may be output from the voice dialogue device 14 and the voice dialogue proxy device 15 may store the control ID and the user's utterance in association with each other.
[0096]
Further, even when the history storage unit 32 stores an utterance for the content content 7, the history storage unit 32 does not immediately perform the utterance reproduction process, but waits for a certain period of time, and only when the user does not utter during that time. The utterance reproduction process may be performed. By doing this, it is possible to reproduce the operation instruction that is normally performed once, and to advance the operation instruction if the user utters only when a special operation instruction must be given.
[0097]
【The invention's effect】
  The voice interaction device according to the present invention includes history storage means for storing an operation instruction given by a user during use of content as a use history, and the operation instruction stored as the use history as interactive operation control information of the content. A dialogue operation proxy means for outputting as a regular operation instruction; andThe number of uses exceeds a specified value based on hardware environment information.In this case, it is equipped with content interpretation means that decides the action based on the operation instruction output by the dialogue operation proxy means, so that routine voice interaction operations are automated.Provide an easy-to-use voice interaction interfaceThere is an effect that it becomes possible.
[0098]
  In addition, the voice dialogue proxy device according to the present invention is a history storage means for storing the content outputted by the voice dialogue device not having the function of saving the voice dialogue operation and the utterance contents of the user in association with each other,When the number of times the content is used is a predetermined value or more based on hardware environment information,Since the utterance reproduction means for reproducing the user's utterance content stored in the history storage means is provided and the user's utterance content is output to the voice interaction device, it does not have a function for saving the voice interaction operation. Automating routine voice interaction operations for voice interaction devicesProvide an easy-to-use voice interaction interfaceThere is an effect that it becomes possible.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a first embodiment of the present invention.
FIG. 2 is a block diagram showing details of the configuration of the first embodiment of the present invention.
FIG. 3 is a program list showing an example of content contents according to the first embodiment of the present invention.
FIG. 4 is a flowchart of processing according to the first embodiment of the present invention.
FIG. 5 is a block diagram showing details of the configuration of the second embodiment of the present invention.
FIG. 6 is a flowchart of processing according to Embodiment 2 of the present invention.
FIG. 7 is a block diagram showing details of the configuration of the third embodiment of the present invention.
FIG. 8 is a flowchart of processing according to Embodiment 4 of the present invention.
FIG. 9 is a block diagram showing a configuration of a fifth embodiment of the present invention.
FIG. 10 is a block diagram showing details of the configuration of the fifth embodiment of the present invention.
FIG. 11 is a flowchart of processing according to the fifth embodiment of the present invention.
[Explanation of symbols]
1: content storage unit, 2: content, 3: network,
4: Voice interaction device, 5: Message, 6: User utterance,
7: content acquisition means, 8: control ID acquisition means, 9: content interpretation means,
10: voice recognition means, 11: history storage means, 12: dialogue operation proxy means,
13: User information storage means, 14: Spoken dialogue device, 15: Spoken dialogue substitution device,
31: Notification means, 32: History storage means, 33: Speech reproduction means,
34: User utterance

Claims

A voice interaction device comprising content acquisition means, history storage means, content interpretation means, dialogue operation proxy means, and voice recognition means,
The content acquisition means acquires content having interactive operation control information defining a message prompting a user's operation instruction and a different operation for each operation instruction through the network, and the history storage means is acquired by the content acquisition means Store the operation instructions that the user has made so far as the usage history,
When the usage history stored in the history storage unit satisfies a predetermined condition, the content interpretation unit adds the interactive operation control information of the content based on an operation instruction output from the interactive operation proxy unit. While the defined operation is determined and the usage history does not satisfy the predetermined condition, a message of interactive operation control information possessed by the content is presented to the user, and the voice recognition means outputs The operation is determined based on an operation instruction to calculate the number of times the content is used from a usage history stored in the history storage unit, and a predetermined value is calculated based on hardware environment information, When the number of times is equal to or greater than the predetermined value, the operation is determined based on an operation instruction output by the interactive operation proxy means, If the number is less than the predetermined value, while presenting the message to the user, the operation determined on the basis of the operation instruction, wherein the voice recognition means outputs,
The dialogue operation proxy means outputs an operation instruction of a use history stored in the history storage means,
The speech recognition means recognizes speech uttered by the user in response to the message presented by the content interpretation means, outputs the speech as an operation instruction for the interactive operation control information of the content, and outputs the operation instruction to the message A spoken dialogue apparatus characterized by being stored in a history storage means.

A voice interaction device comprising content acquisition means, history storage means, content interpretation means, dialogue operation proxy means, voice recognition means, and user information storage means,
The user information storage means stores user information,
The content acquisition means acquires content having interactive operation control information defining a message prompting a user's operation instruction and a different operation for each operation instruction through the network, and the history storage means is acquired by the content acquisition means Store the operation instructions that the user has made so far as the usage history,
When the usage history stored in the history storage unit satisfies a predetermined condition, the content interpretation unit adds the interactive operation control information of the content based on an operation instruction output from the interactive operation proxy unit. While the defined operation is determined and the usage history does not satisfy the predetermined condition, a message of interactive operation control information possessed by the content is presented to the user, and the voice recognition means outputs The user information is determined based on an operation instruction to be performed, the number of times the content is used is calculated from a use history stored in the history storage unit, and the user information stored in the user information storage unit Based on an operation instruction output by the dialogue operation proxy means when the number of times is equal to or greater than the predetermined value. While determining the operation, if the count is less than the predetermined value, while presenting the message to the user, the operation determined on the basis of the operation instruction, wherein the voice recognition means outputs,
The dialogue operation proxy means outputs an operation instruction of a use history stored in the history storage means,
The speech recognition means recognizes speech uttered by the user in response to the message presented by the content interpretation means, outputs the speech as an operation instruction for the interactive operation control information of the content, and outputs the operation instruction to the message A spoken dialogue apparatus characterized by being stored in a history storage means .

The user information storage means stores at least the age of the user as user information,
3. The spoken dialogue apparatus according to claim 2, wherein the content interpretation unit calculates the predetermined value based on a user's age stored in the user information storage unit .

A voice interaction device comprising content acquisition means, history storage means, content interpretation means, dialogue operation proxy means, and voice recognition means,
The content acquisition means acquires content having interactive operation control information defining a message prompting a user's operation instruction and a different operation for each operation instruction through the network, and the history storage means is acquired by the content acquisition means The operation instruction that the user has performed so far on the content and the time when the user last used the content is stored as a usage history,
When the usage history stored in the history storage unit satisfies a predetermined condition, the content interpretation unit adds the interactive operation control information of the content based on an operation instruction output from the interactive operation proxy unit. While the defined operation is determined and the usage history does not satisfy the predetermined condition, a message of interactive operation control information possessed by the content is presented to the user, and the voice recognition means outputs And determining the number of times the content has been used from the usage history stored in the history storage unit, and finally using the content stored in the history storage unit. A predetermined value is calculated from the measured time and the current time, and when the number of times is equal to or greater than the predetermined value, The operation is determined on the basis of the number of times, and when the number of times is less than the predetermined value, the message is presented to the user, and the operation is performed based on an operation instruction output by the voice recognition unit. Decide
The dialogue operation proxy means outputs an operation instruction of a use history stored in the history storage means,
The speech recognition means recognizes speech uttered by the user in response to the message presented by the content interpretation means, outputs the speech as an operation instruction for the interactive operation control information of the content, and outputs the operation instruction to the message A spoken dialogue apparatus characterized by being stored in a history storage means .

A control ID acquisition means for acquiring a control ID for uniquely identifying the interactive operation control information of the content;
The history storage means stores the control ID acquired by the control ID acquisition means in association with an operation instruction made by the user,
The interaction operation abbreviation means is configured such that the operation instruction stored in the history storage means in association with the control ID for the content interaction operation control information is an operation instruction in accordance with the content interaction operation control information. 5. The voice interactive apparatus according to claim 1, wherein the voice interactive apparatus outputs the voice dialog.

The content interpreting means changes a message of interactive operation control information possessed by the content when the usage history satisfies the predetermined condition, and presents the changed message to the user. The voice interactive apparatus according to any one of claims 1 to 5 .

Content acquisition means for performing different operations according to user operation instructions and acquiring content including a message prompting the user operation instructions through a network;
  Voice recognition means for converting the user's utterance into an operation instruction for the content acquired by the content acquisition means by voice recognition;
  Informing means for informing the user of the message of the content acquired by the content acquisition means;
  A voice dialogue substitution device for use with a voice dialogue device comprising content interpretation means for determining an action based on an operation instruction output by the voice recognition means;
  History storage means for storing the content acquired by the content acquisition means and the user's utterance in association with each other as a use history;
  For the content used by the user, when the history storage means stores the user's utterance, the user's utterance is reproduced and output to the voice recognition means, and the history The number of times the content has been used is calculated from the usage history stored in the storage means, and a predetermined value is calculated based on hardware environment information. When the number is greater than or equal to the predetermined value, the user's utterance is reproduced. Utterance reproduction means for outputting to the voice recognition means,
A voice dialogue proxy device characterized by comprising:

Content acquisition means for performing different operations according to user operation instructions and acquiring content including a message prompting the user operation instructions through a network;
  Voice recognition means for converting the user's utterance into an operation instruction for the content acquired by the content acquisition means by voice recognition;
  Informing means for informing the user of the message of the content acquired by the content acquisition means;
  A voice dialogue substitution device for use with a voice dialogue device comprising content interpretation means for determining an action based on an operation instruction output by the voice recognition means;
  History storage means for storing the content acquired by the content acquisition means and the user's utterance in association with each other as a use history;
  User information storage means for storing user information;
  For the content used by the user, when the history storage means stores the user's utterance, the user's utterance is reproduced and output to the voice recognition means, and the history When the number of times the content has been used is calculated from the usage history stored in the storage means, and a predetermined value is calculated based on the user information stored in the user information storage means, and the number of times is equal to or greater than the predetermined value Utterance reproduction means for reproducing the user's utterance and outputting it to the voice recognition means;
A voice dialogue proxy device characterized by comprising:

The user information storage means stores at least the age of the user as user information,
9. The spoken dialogue proxy device according to claim 8, wherein the utterance reproduction unit calculates the predetermined value based on a user's age stored in the user information storage unit .

Content acquisition means for performing different operations according to user operation instructions and acquiring content including a message prompting the user operation instructions through a network;
Voice recognition means for converting the user's utterance into an operation instruction for the content acquired by the content acquisition means by voice recognition, and outputting the instruction.
Informing means for informing the user of the message of the content acquired by the content acquisition means;
A voice dialogue substitution device for use with a voice dialogue device comprising content interpretation means for determining an action based on an operation instruction output by the voice recognition means;
A history storage unit that stores the content acquired by the content acquisition unit and the user's utterance in association with each other as a usage history, and stores the time when the user last used the content as the usage history;
For the content used by the user, when the history storage means stores the user's utterance, the user's utterance is reproduced and output to the voice recognition means, and the history The number of times the content has been used is calculated from the usage history stored in the storage means, and a predetermined value is calculated from the last time the content stored in the history storage means is used and the current time, and the number of times is the predetermined value. In the above case, utterance reproduction means for reproducing the user's utterance and outputting it to the voice recognition means;
A voice dialogue proxy device characterized by comprising:

In the voice dialogue proxy device, when a predetermined time has elapsed after the notification unit notifies the user of the message, the utterance reproduction unit reproduces the user's utterance and outputs the utterance to the voice recognition unit The spoken dialogue proxy device according to any one of claims 7 to 10, wherein:

A voice interaction program that causes a computer to execute a content acquisition procedure, a history storage procedure, a content interpretation procedure, a dialogue operation proxy procedure, and a voice recognition procedure,
The content acquisition procedure acquires a content having interactive operation control information defining a message for prompting a user's operation instruction and a different operation for each operation instruction through the network, and the history storage procedure is acquired by the content acquisition procedure. Storing the operation instructions that the user has made so far as the usage history,
In the content interpretation procedure, when the usage history stored in the history storage procedure satisfies a predetermined condition, the interactive operation control information included in the content is based on an operation instruction output by the interactive operation proxy procedure. If the usage history does not satisfy the predetermined condition, a message of interactive operation control information possessed by the content is presented to the user, and the voice recognition procedure includes The operation is determined based on an operation instruction to be output, and the number of times the content is used is calculated from the usage history stored in the history storage procedure , and a predetermined value is calculated based on hardware environment information and, when the number of the predetermined value or more, based on the outputted operation instruction by the interaction proxy procedure, the While determining the work, if the count is less than the predetermined value, while presenting the message to the user, the operation determined on the basis of the operation instruction output by the speech recognition procedure,
The interactive operation proxy procedure outputs a usage history operation instruction stored by the history storage procedure,
The speech recognition procedure recognizes speech uttered by the user with respect to the message presented by the content interpretation procedure, outputs the speech as an operation instruction for the interactive operation control information of the content, and outputs the operation instruction. A spoken dialogue program which is stored by the history storing procedure.

Content acquisition means for performing different operations according to user operation instructions and acquiring content including a message prompting the user operation instructions through a network;
  Voice recognition means for converting the user's utterance into an operation instruction for the content acquired by the content acquisition means by voice recognition;
  Informing means for informing the user of the message of the content acquired by the content acquisition means;
  A voice dialogue substitution program for use with a voice dialogue device comprising a content interpretation unit for determining an operation based on an operation instruction output by the voice recognition unit,
  A history storage procedure for storing the content acquired by the content acquisition means and the user's utterance in association with each other as a usage history;
  For the content used by the user, when the user's utterance is stored by the history storing procedure, the user's utterance is reproduced and output to the voice recognition means, and the history The number of times the content has been used is calculated from the usage history stored in the storage procedure, and a predetermined value is calculated based on hardware environment information. When the number is equal to or greater than the predetermined value, the user's utterance is An utterance playback procedure to play back and output to the voice recognition means;
A speech dialogue substitution program characterized by causing a computer to execute.