JP2004354942A

JP2004354942A - Voice interactive system, voice interactive method and voice interactive program

Info

Publication number: JP2004354942A
Application number: JP2003155749A
Authority: JP
Inventors: Norihiko Maeda; 典彦前田; Masayuki Takahashi; 真之高橋; Tasuku Shinozaki; 翼篠崎; Shinichi Tomiyama; 伸一富山; Masashi Satomura; 昌史里村; Yoichi Kitano; 陽一北野
Original assignee: Honda Motor Co Ltd; Nippon Telegraph and Telephone Corp
Current assignee: Honda Motor Co Ltd; Nippon Telegraph and Telephone Corp
Priority date: 2003-05-30
Filing date: 2003-05-30
Publication date: 2004-12-16

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice interactive system in which a user is able to select an input that is convenient for the user. <P>SOLUTION: The voice interactive system is provided with a talking button switch which is only pushed during a speaker is talking and the speaker controls devices while the speaker interacts with the system by voice. The system is provided with a time monitoring means which measures a lapse of time from the time when the talking button switch is pushed to the time when the switch is released, an operation setting storage means which beforehand stores operations to be performed next for every lapse of time and a control means which reads the operations corresponding to the lapse of time notified by the time monitoring means from the operation setting storage means and executes the read operations. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、機器の制御処理や情報提供処理を行う場合に、音声対話によってユーザの要求を獲得する音声対話システム、音声対話方法及び音声対話プログラムに関する。
【０００２】
【従来の技術】
車両内におけるカーナビゲーションシステム等の操作において、運転者の運転動作ヘの影響を少なくするためには、運転者がシステムと対話を行なう動作の負荷を小さくすることが望ましい。一方、音声認識の認識率を向上させるためには、話者がスイッチ等を操作しながら発声した方が音声認識処理を簡単にすることができる。例えば、話者が発話している間だけ、発話ボタンをプッシュすることにより、発話ボタンがプッシュされていた時間内に発声された音声を音声認識の対象とする技術（プレストークという）が知られている。また、発話ボタンの他に、取り消しボタンを設けることで、誤認識が発生した場含に、一つ前の状態に戻ることが容易に行なえるようにした技術や音声操作以外に、モニタ画面に選択肢を示し、タッチ操作やリモコン操作によって、入力内容を選択肢から選んで決定する技術も知られている。
【０００３】
ところで、対話時の運転者負荷低減や製造コスト低減のため、対話に用いるボタンやスイッチは少ない方がよく、運転動作以外に用いられるボタンやスイッチの数を滅らす取り組みが行なわれている。例えば、発話ボタンとは別に設けられた操作スイッチにおいて、短時間ボタンプッシュ／長時間ボタンプッシュという複数の操作を一つの操作スイッチで実現している（特許文献１）。
また、ユーザが音声コマンドを入力したあと、システムは認識結果が正しいか、ユーザに確認を取ってから、音声コマンドに対応した処理を実行し、このユーザの確認を取る場面で、音声入力による確認の他、ステアリング上の「許可スイッチ」によっても入力することができるようになっている（特許文献２）。ただし、この「許可スイッチ」の意味は一つであり、発話ボタンとは関係なく、音声認識機能とは独立に存在している。
また、発話ボタンと、操作ボタンを統合し、「この場面では発話ボタン、この場面では操作ボタン」というように、システムは場面に応じてボタンの用途をどちらか一方に限定して、発話ボタンを特定の場面において、操作スイッチとして機能させるている（特許文献３）。
【０００４】
【特許文献１】
特開２００１−１５４６８９号公報
【特許文献２】
特関２００２−１２１００号公報
【特許文献３】
特開２００１−２１６１３０号公報
【０００５】
【発明が解決しようとする課題】
しかしながら、特許文献１、２の技術は、発話ボタンと操作ボタンが独立に存在しているため、複数のボタンを使い分ける必要があり、手や指の移動、視線の移動が伴うことになり、ユーザ負荷が増加するという問題がある。また、特許文献３の技術は、発話ボタンと操作ボタンを一つに統合してはいるが、その場面に応じて、どちらか一方の機能のみに限定されているため、「音声入力すべき場面」と「操作ボタンとして入力すべき場面」がシステムの都合で決められており、ユーザの希望により音声入力の場面で、発話ボタンを「次へ」と等価のショートカットボタンとして利用するなどの操作を行うことができないという問題がある。
【０００６】
本発明は、このような事情に鑑みてなされたもので、発話ボタン一つで、音声入力ためのプレストーク機能と、対話動作のショートカットを実現するための操作ボタン機能を同時に実現することができ、ユーザは自分にとって都合のよい入力を選択することができる音声対話システム、音声対話方法及び音声対話プログラムを提供することを目的とする。
【０００７】
【課題を解決するための手段】
請求項１に記載の発明は、話者が発話している間だけプッシュする発話ボタンスイッチを備え、音声によって話者がシステムと対話を行いながら機器の制御を行う音声対話システムであって、前記発話ボタンスイッチがプッシュされてからリリースされるまでの経過時間を計測する時間監視手段と、前記経過時間毎に次に行うべき動作が予め記憶された動作設定記憶手段と、前記時間監視手段が通知する経過時間に該当する動作を前記動作設定記憶手段から読み出して、読み出した動作を実行する制御手段とを備えたことを特徴とする。
【０００８】
請求項２に記載の発明は、前記動作設定記憶手段は、前記経過時間が第１のしきい値未満のときに行うべき第１の動作と、前記経過時間が第１のしきい値以上第２のしきい値未満のときに行うべき第２の動作と、前記経過時間が第２のしきい値を超えるときに行うべき第３の動作が予め記憶され、前記第２の動作は、前記発話ボタンスイッチが発話ボタンとして機能する動作であり、前記第１の動作および第３の動作は、前記発話ボタンスイッチがそれぞれ異なる操作ボタンとして機能する動作であることを特徴とする。
【０００９】
請求項３に記載の発明は、前記動作設定記憶手段は、対話を行う場合の場面毎に動作が記憶され、前記第１のしきい値は、対象となる場面において発話される可能性がある応答内容のうち、最も短い応答内容を発話するのに必要な時間より短い時間であり、前記第２のしきい値は、対象となる場面において発話される可能性がある応答内容のうち、最も長い応答内容を発話するのに必要な時間より長い時間であることを特徴とする。
【００１０】
請求項４に記載の発明は、前記音声対話システムは、前記第１及び第２のしきい値を任意に設定する入力手段を備えたことを特徴とする。
【００１１】
請求項５に記載の発明は、話者が発話している間だけプッシュする発話ボタンスイッチを備え、音声によって話者がシステムと対話を行いながら機器の制御を行う音声対話システムにおける音声対話方法であって、前記発話ボタンスイッチがプッシュされてからリリースされるまでの経過時間を計測する時間監視過程と、前記経過時間毎に次に行うべき動作を予め記憶しておく動作設定記憶過程と、前記時間監視過程が通知する経過時間に該当する動作を前記動作設定記憶過程で記憶しておいた動作の中から読み出して実行する制御過程とを有することを特徴とする。
【００１２】
請求項６に記載の発明は、前記動作設定記憶過程は、前記経過時間が第１のしきい値未満のときに行うべき第１の動作と、前記経過時間が第１のしきい値以上第２のしきい値未満のときに行うべき第２の動作と、前記経過時間が第２のしきい値を超えるときに行うべき第３の動作を予め記憶し、前記第２の動作は、前記発話ボタンスイッチが発話ボタンとして機能する動作であり、前記第１の動作および第３の動作は、前記発話ボタンスイッチがそれぞれ異なる操作ボタンとして機能する動作であることを特徴とする。
【００１３】
請求項７に記載の発明は、前記動作設定記憶過程は、対話を行う場合の場面毎に動作が記憶され、前記第１のしきい値は、対象となる場面において発話される可能性がある応答内容のうち、最も短い応答内容を発話するのに必要な時間より短い時間であり、前記第２のしきい値は、対象となる場面において発話される可能性がある応答内容のうち、最も長い応答内容を発話するのに必要な時間より長い時間であることを特徴とする。
【００１４】
請求項８に記載の発明は、前記音声対話方法は、前記第１及び第２のしきい値を任意に設定する入力過程を有することを特徴とする。
【００１５】
請求項９に記載の発明は、話者が発話している間だけプッシュする発話ボタンスイッチを備え、音声によって話者がシステムと対話を行いながら機器の制御を行う音声対話システムにおいて動作する音声対話プログラムであって、前記発話ボタンスイッチがプッシュされてからリリースされるまでの経過時間を計測する時間監視処理と、前記経過時間毎に次に行うべき動作を予め記憶しておく動作設定記憶処理と、前記時間監視処理が通知する経過時間に該当する動作を前記動作設定記憶処理で記憶しておいた動作の中から読み出して実行する制御処理とをコンピュータに行わせることを特徴とする。
【００１６】
請求項１０に記載の発明は、前記動作設定記憶処理は、前記経過時間が第１のしきい値未満のときに行うべき第１の動作と、前記経過時間が第１のしきい値以上第２のしきい値未満のときに行うべき第２の動作と、前記経過時間が第２のしきい値を超えるときに行うべき第３の動作を予め記憶し、前記第２の動作は、前記発話ボタンスイッチが発話ボタンとして機能する動作であり、前記第１の動作および第３の動作は、前記発話ボタンスイッチがそれぞれ異なる操作ボタンとして機能する動作であることを特徴とする。
【００１７】
請求項１１に記載の発明は、前記動作設定記憶処理は、対話を行う場合の場面毎に動作が記憶され、前記第１のしきい値は、対象となる場面において発話される可能性がある応答内容のうち、最も短い応答内容を発話するのに必要な時間より短い時間であり、前記第２のしきい値は、対象となる場面において発話される可能性がある応答内容のうち、最も長い応答内容を発話するのに必要な時間より長い時間であることを特徴とする。
【００１８】
請求項１２に記載の発明は、前記音声対話プログラムは、前記第１及び第２のしきい値を任意に設定する入力処理をさらにコンピュータに行わせることを特徴とする。
【００１９】
【発明の実施の形態】
以下、本発明の一実施形態による音声対話システムを図面を参照して説明する。図１は同実施形態の構成を示すブロック図である。この図において、符号１は、自動車のステアリングホイールに取り付けられた発話ボタンである。この発話ボタンはプッシュしたときとリリースしているときの２つの状態を識別することができるボタンスイッチであり、運転者は、ステアリングホイールから手を離さずにこの発話ボタンを操作可能である。符号２は、発話ボタン１の２つの状態（プッシュとリリース）を検出する発話ボタン操作検出部である。符号３は、発話ボタン操作検出部２の出力に基づいて、発話ボタン１がプッシュされてからリリースされるまでの経過時間を計測する時間監視部である。符号４は、運転者と対話を行いながら機器の制御処理（エアコンのＯＮ／ＯＦＦやオーディオ機器の再生／停止等）や情報提供処理（データベース検索を実行し検索結果を読み上げる等）を行うための制御信号を機器に対して出力するＨＭＩ（Ｈｕｍａｎ−ｍａｃｈｉｎｅＩｎｔｅｒｆａｃｅ）制御部である。符号５は、発話ボタン１の操作状況に応じた動作が予め設定されたボタン操作動作設定記憶部である。符号６は、音声を集音するマイクである。符号７は、マイク６で集音した音声を認識する音声認識部である。符号８は、システムが発する音声を発音するスピーカである。符号９は、音声データを再生して、スピーカ８から発音させる音声再生部である。符号１０は、システムが発する音声データを合成する音声合成部である。符号１１は、運転者と対話を行う場合のシナリオの流れを制御する対話制御部である。符号１２は、運転者と対話を行う場合のシナリオが予め定義された対話シナリオ記憶部である。符号１３は、タッチパネル等の操作スイッチで構成される入力部である。
【００２０】
次に、図２、３を参照して、図１に示す音声対話システムの動作を説明する。以下の説明において、ＨＭＩ制御部４は、自動車内において、食事をする店を検索して結果情報を運転者へ提供する情報提供処理を行うものとして説明する。
初めに、図２を参照して、運転者が発話した音声を入力した場合の動作を説明する。まず、ＨＭＩ制御部４は、音声対話システムが起動した時点で、ボタン操作動作設定記憶部５に記憶されている動作設定の内容を読み込む（ステップＳ０）。この動作は起動時に１回実行するのみである。
【００２１】
ここで、図４を参照して、ボタン操作動作設定記憶部５に記憶されている内容を説明する。ボタン操作動作設定記憶部５には、発話ボタン１が「プッシュ」された場合と「リリース」された場合に分けて、ＨＭＩ制御部４が行うべき動作がアプリケーション毎（ここでは、データベース検索、エアコン制御、オーディオ制御）に定義されている。そして、「リリース」の場合は、さらに、「プッシュ」から「リリース」までの経過時間毎に細分化された動作が定義されている。この例では、「プッシュ」された場合には、音声再生部９に対して、ガイダンス再生停止と、チャイム音を再生する指示を行い、音声認識部７に対しては、音声の録音開始を指示することが定義されている。また、データベース検索では、「リリース」までの経過時間が、０〜２００ｍｓｅｃ、２０１〜１００００ｍｓｅｃ、１０００１ｍｓｅｃ以上毎に動作が定義されている。このテーブル参照に基づく動作の詳細は、後述する。
【００２２】
この２００ｍｓｅｃが請求項でいう第１のしきい値であり、１００００ｍｓｅｃが第２のしきい値である。この第１及び第２のしきい値は、各アプリケーション毎に異なり、それぞれ操作を容易にするために経過時間毎に動作が予め定義される。ただし、第１のしきい値は、対象のアプリケーション中において、話者が発話する可能性がある語句のうち、最も短い語句（例えば、「はい」）を発話するのに必要な時間より短い時間に設定する。また、第２のしきい値は、対象のアプリケーション中において、話者が発話する可能性がある語句のうち、最も長い語句（例えば、「和食レストラン検索」）を発話するのに必要な時間より長い時間に設定する。
【００２３】
次に、図２に戻り、運転者は、入力部１３のスイッチを操作して、または音声入力操作によって、データベース検索を選択する。対話制御部１１は、データベース検索処理が選択されたことを受けて、ＨＭＩ制御部４に対して、音声入力を行うように指示を出す。そして、システム側から、「何のデータを検索しますか」という音声の質問が発声され、ＨＭＩ制御部４は、音声入力待ち状態となる。
【００２４】
ここで、運転者が発話ボタン１をプッシュし、例えば６秒（６０００ｍｓｅｃ）間かけて「食事検索」と発話した後、発話ボタン１をリリースしたとする。これを受けて、発話ボタン操作検出部２は、発話ボタン１がプッシュされたことを検出して、発話ボタンプッシュ検出通知を時間監視部３へ送る（ステップＳ１）。これを受けて、時間監視部３は、タイマ測定を開始する（ステップＳ２）とともに、ＨＭＩ制御部４に対して、発話ボタンプッシュ検出通知を送る（ステップＳ３）。
【００２５】
次に、ＨＭＩ制御部４は、事前に読み込んだ動作設定に従い、プッシュ時の処理を行う（ステップＳ４）。プッシュ時の処理は、まず、ガイダンス音の再生を停止させるため、音声再生部９に対してガイダンス停止を指示し（ステップＳ５）、これによりガイダンスが停止する（ステップＳ６）。そして、プッシュ操作がシステムによって検出されたことを運転者に対して通知するため、音声再生部９に対してチャイム再生を指示し（ステップＳ７）、これにより「ピ」というチャイム音が再生される（ステップＳ８）。また、音声認識部７に対して、音声録音開始の指示を出し（ステップＳ９）、これにより、音声認識部７は、マイク６で集音された音声の録音を開始するとともに、録音された音声の認識処理を実行する（ステップＳ１０）。
【００２６】
次に、発話ボタン操作検出部２は、発話ボタン１のリリースを検出すると、時間監視部３に対して発話ボタンリリース検出通知を送る（ステップＳ１１）。時間監視部３はこの通知を受けて、タイマ測定を終了し（ステップＳ１２）、プッシュからリリースまでの経過時間を求め、この経過時間を含む発話ボタンリリース検出通知をＨＭＩ制御部４へ送る（ステップＳ１３）。
【００２７】
次に、ＨＭＩ制御部４は、通知された経過時間に基づくリリース時の処理を行う（ステップＳ１４）。ここでは、経過時間が６０００ｍｓｅｃであるので、図４に示すボタン操作動作設定記憶部５に定義されている２０１〜１００００ｍｓｅｃの動作が実施されることとなる。経過時間が２０１〜１００００ｍｓｅｃの場合、ＨＭＩ制御部４は、既に実行状態にあった音声認識部７に対し、認識結果を要求し（ステップＳ１５）、認識結果を得る（ステップＳ１６）。そして、ここで得た音声認識結果（ここでは「食事検索」となる）を対話制御部１１に対して通知する（ステップＳ１７）。
【００２８】
これを受けて、対話制御部１１は、対話シナリオ記憶部１２を参照して、「食事検索」が指定された場合に次に行うべき動作シナリオを読み込む（ステップＳ１８）。そして、読み込んだシナリオに基づいて、次に発声させるべきガイダンスの内容を生成する（ステップＳ１９）。この例では、次に発声するべきガイダンス内容が、「食事検索ですね」であるものとする。続いて、対話制御部１１は、ここで生成したガイダンス内容「食事検索ですね」をＨＭＩ制御部４へ通知する（ステップＳ２０）。
【００２９】
次に、ＨＭＩ制御部４は、通知されたガイダンス内容を含む音声合成要求を音声合成部１０へ送る（ステップＳ２１）。これを受けて、音声合成部１０は、「食事検索ですね」というガイダンスの音声データを音声合成によって生成し（ステップＳ２２）、この音声データを含む音声合成結果応答をＨＭＩ制御部４へ返す（ステップＳ２３）。
【００３０】
次に、ＨＭＩ制御部４は、この音声データを含む音声再生要求を音声再生部９へ送る（ステップＳ２４）。これを受けて音声再生部９は、送られた音声データを再生する（ステップＳ２５）。これにより、スピーカ８から「食事検索ですね」という音声が発声する。そして、音声再生部９は、ＨＭＩ制御部４に対して、ガイダンス音声再生終了通知を送る（ステップＳ２６）。
【００３１】
続いて、運転者は、「食事検索ですね」という質問に対して、発話ボタン１をプッシュして、「はい」と発話し、発話ボタン１をリリースすると、前述した動作と同様に処理がなされ、対話シナリオが続行することとなる。
【００３２】
次に、図３を参照して、運転者が発話する代わりに発話ボタン１を操作して応答した場合の動作を説明する。ここでは、スピーカ８から「食事検索ですね」という質問が発声されたものとする。ここで、運転者が発話ボタン１をプッシュし、直ぐに（１５０ｍｓｅｃ後）発話ボタン１をリリースしたとする。これを受けて、発話ボタン操作検出部２は、発話ボタン１がプッシュされたことを検出して、発話ボタンプッシュ検出通知を時間監視部３へ送る（ステップＳ１）。これを受けて、時間監視部３は、タイマ測定を開始する（ステップＳ２）とともに、ＨＭＩ制御部４に対して、発話ボタンプッシュ検出通知を送る（ステップＳ３）。
【００３３】
次に、ＨＭＩ制御部４は、事前に読み込んだ動作設定に従い、プッシュ時の処理を行う（ステップＳ４）。プッシュ時の処理は、まず、ガイダンス音の再生を停止させるため、音声再生部９に対してガイダンス停止を指示し（ステップＳ５）、これによりガイダンスが停止する（ステップＳ６）。そして、プッシュ操作がシステムによって検出されたことを運転者に対して通知するため、音声再生部９に対してチャイム再生を指示し（ステップＳ７）、これにより「ピ」というチャイム音が再生される（ステップＳ８）。また、音声認識部７に対して、音声録音開始の指示を出し（ステップＳ９）、これにより、音声認識部７は、マイク６で集音された音声の録音を開始するとともに、録音された音声の認識処理を実行する（ステップＳ１０）。
【００３４】
次に、発話ボタン操作検出部２は、発話ボタン１のリリースを検出すると、時間監視部３に対して発話ボタンリリース検出通知を送る（ステップＳ１１）。時間監視部３はこの通知を受けて、タイマ測定を終了し（ステップＳ１２）、プッシュからリリースまでの経過時間を求め、この経過時間を含む発話ボタンリリース検出通知をＨＭＩ制御部４へ送る（ステップＳ１３）。図３に示すステップＳ１〜Ｓ１３の動作は、図２に示すステップＳ１〜Ｓ１３の動作と同一である。
【００３５】
次に、ＨＭＩ制御部４は、通知された経過時間に基づくリリース時の処理を行う（ステップＳ３１）。ここでは、経過時間が１５０ｍｓｅｃであるので、図４に示すボタン操作動作設定記憶部５に定義されている０〜２００ｍｓｅｃの動作が実施されることとなる。経過時間が０〜２００ｍｓｅｃの場合、ＨＭＩ制御部４は、音声認識部７に対して既に実行状態にあった音声認識処理を中止させる指示を出す（ステップＳ３２）とともに、操作ボタン入力として識別したことを運転者に通知するため、音声再生部９に対してチャイム再生を指示し（ステップＳ３３）、これにより「ポ」というチャイム音が再生される（ステップＳ３４）。そして、対話制御部１１に対し、ユーザの入力が、操作ボタン入力であり、その値が短時間プッシュ（０〜２００ｍｓｅｃ）であったことを通知する（ステップＳ３５）。これを受けて対話制御部１１は、対話シナリオ記憶部１２を参照して（ステップＳ３６）、この入力結果を認識する（ステップＳ３７）。この認識により、ステップＳ３５において通知された入力結果が、「食事検索ですね」の質問に対して、あたかも音声で「はい」と答えたものと見なして、以降の処理が続行する。ステップＳ３７の以降の動作は、図２に示すステップＳ１８〜Ｓ２６と同一であるので説明を省略する。
【００３６】
また、図３に示すステップＳ１３において通知された経過時間が、１０００１ｍｓｅｃ以上であった場合、ＨＭＩ制御部４は操作ボタン入力として識別したことを示す「ポポ」というチャイムを再生させ、既に実行状態にあった音声認識拠理を中止させ、対話制御部１１に対し、ユーザの入力が、操作ボタン入力であり、その値が長時間プッシュであったことを通知する。
【００３７】
ＨＭＩ制御部４は、音声認識部７に対して既に実行状態にあった音声認識処理を中止させる指示を出す（ステップＳ３２）とともに、操作ボタン入力として識別したことを運転者に通知するため、音声再生部９に対してチャイム再生を指示し（ステップＳ３３）、これにより「ポポ」というチャイム音が再生される（ステップＳ３４）。そして、対話制御部１１に対し、ユーザの入力が、操作ボタン入力であり、その値が長時間プッシュ（１０００１ｍｓｅｃ以上）であったことを通知する（ステップＳ３５）。この長時間プッシュは、現在実行中のアプリケーション（ここでは、データベース検索）を途中で終了させたり、現時点までの処理内容をクリアして最初の状態に戻したいときに用いる。
【００３８】
このように、時間監視部３を設け、運転者が発話ボタン１をプッシュしてからリリースするまでの経過時間を測定し、この経過時間が短い場合（２００ｍｓｅｃ以下）と長い場合（１０００１ｍｓｅｃ以上）において、発話ボタン１が操作ボタンとして利用されたと判断して、ボタン操作動作設定記憶部５に定義された動作を実施するようにしたため、発話ボタン１一つで、音声入力ためのプレストーク機能と、対話動作のショートカットなどを実現するための操作ボタン機能を同時に実現することができ、ユーザは自分にとって都合のよい入力を選択することができる。これにより、運転者への問い合わせ内容に対する回答が「はい／いいえ」であるような場面において、短時間プッシュを「はい」の音声入力とみなしたり、また、運転者ヘの問い合わせ内容に対する回答が複数候補からの選択であるような場面において、短時間プッシュを「もう一度」の音声入力とみなして、状態遷移の制御を行うことが容易にできるようになる。
【００３９】
また、音声認識は、走行ノイズ等により誤認識を起こす可能性があるのに対し、ボタン操作によるショートカットは確実であり、使用頻度の高い音声入力コマンドをショートカットで置き換えることにより、短時間に対話を進めることができる。このため、ユーザ操作の簡略化と負荷低減が図れるとともに、操作ボタンの数を減らすことにより、製造コスト低減を図ることが可能となる。
【００４０】
なお、前述した説明では、図２に示すステップＳ０において、図４に示すボタン操作動作設定記憶部５の内容の全てをシステム起動時に読み込むようにしたが、アプリケーションが起動された時点で、必要なテーブルの内容のみを読み込むようにしてもよい。例えば、データベース検索のアプリケーションが起動された場合は、ボタン操作動作設定記憶部５からデータベース検索用の動作設定のみを読み込むようにする。これにより、設定動作の内容を読み込む時間を短縮することができる。また、図４に示す例では、１つのアプリケーションに対して、１つのテーブルを用意した例を示したが、１つのアプリケーションの各場面毎に１つのテーブルを用意し、各場面毎で操作の持つ意味を変えるようにしてもよい。また、前述した説明では音声のみによる受け答えの例を示したが、対話の進行に応じて表示内容や選択候補ボタンが変化するようなタッチパネル表示装置など、音声以外の対話手段を併用するようにしてもよい。
【００４１】
また、図４に示すボタン操作動作設定記憶部５の内容は、運転者が入力部３から希望する内容を入力し、この入力内容に基づいて書き換え可能としてもよい。ここでいう書き換えとは、新規追加、削除、変更を含む。これにより、短時間プッシュと見なす経過時間の第１のしきい値を２００ｍｓｅｃでなく、３００ｍｓｅｃとしたり、長時間プッシュと見なす経過時間の第２のしきい値を１０００１ｍｓｅｃでなく、８０００ｍｓｅｃとすることができるため、運転者が操作しやすくカスタマイズすることが可能となる。ただし、前述したように、第１のしきい値は、話者が発話する最も短い語句を発話するのに必要な時間より短い時間に設定し、第２のしきい値は、話者が発話する最も長い語句を発話するのに必要な時間より長い時間に設定しなければならない。
【００４２】
また、前述の説明では、ナビゲーションシステムを例にして説明したが、プレストーク入力を用いてシステムとの音声対話を可能にする機器であれば同様に前述の方式を適用可能である。
【００４３】
また、図１における各処理部の機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより音声対話処理を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（ＲＡＭ）のように、一定時間プログラムを保持しているものも含むものとする。
【００４４】
また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。また、上記プログラムは、前述した機能の一部を実現するためのものであっても良い。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であっても良い。
【００４５】
【発明の効果】
以上説明したように、この発明によれば、時間監視手段を設け、ユーザが発話ボタンをプッシュしてからリリースするまでの経過時間を測定し、この経過時間が短い場合と長い場合において、発話ボタンが操作ボタンとして利用されたと判断して、ボタン操作動作設定記憶手段に定義された動作を実施するようにしたため、発話ボタン一つで、音声入力ためのプレストーク機能と、対話動作のショートカットなどを実現するための操作ボタン機能を同時に実現することができ、ユーザは自分にとって都合のよい入力を選択することができるという効果が得られる。また、ユーザへの問い合わせ内容に対する回答が「はい／いいえ」であるような場面において、短時間プッシュを「はい」の音声入力とみなしたり、また、運転者ヘの問い合わせ内容に対する回答が複数候補からの選択であるような場面において、短時間プッシュを「もう一度」の音声入力とみなして、状態遷移の制御を容易に行うことが可能になるという効果も得られる。また、入力手段を設け、動作設定記憶手段に記憶されている内容を変更できるようにしたため、ユーザの使いやすい形態にカスタマイズすることができる。
【図面の簡単な説明】
【図１】本発明の一実施形態の構成を示すブロック図である。
【図２】図１に示す音声対話システムにおける音声入力時の動作を示すシーケンス図である。
【図３】図１に示す音声対話システムにおけるボタン操作時の動作を示すシーケンス図である。
【図４】図１に示すボタン操作動作設定記憶部５のテーブル構造を示す説明図である。
【符号の説明】
１・・・発話ボタン
２・・・発話ボタン操作検出部
３・・・時間監視部
４・・・ＨＭＩ制御部
５・・・ボタン操作動作設定記憶部
６・・・マイク
７・・・音声認識部
８・・・スピーカ
９・・・音声再生部
１０・・・音声合成部
１１・・・対話制御部
１２・・・対話シナリオ記憶部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a voice interaction system, a voice interaction method, and a voice interaction program for acquiring a user's request by voice interaction when performing a device control process or an information providing process.
[0002]
[Prior art]
In order to reduce the influence on the driving operation of the driver in the operation of the car navigation system or the like in the vehicle, it is desirable to reduce the load of the operation in which the driver interacts with the system. On the other hand, in order to improve the recognition rate of voice recognition, the voice recognition process can be simplified if the speaker speaks while operating a switch or the like. For example, a technology is known in which a speech button is pushed only while a speaker is speaking, and the speech uttered during the time when the speech button was pushed is targeted for speech recognition (called press talk). ing. Also, by providing a cancel button in addition to the utterance button, in addition to the technology and voice operation that made it easy to return to the previous state in the case of erroneous recognition, it is also displayed on the monitor screen. There is also known a technique in which an option is indicated, and input contents are selected from the options and determined by a touch operation or a remote control operation.
[0003]
By the way, in order to reduce the driver load and the manufacturing cost at the time of dialogue, it is better to use fewer buttons and switches for the dialogue, and efforts are being made to reduce the number of buttons and switches used for other than the driving operation. For example, in an operation switch provided separately from an utterance button, a plurality of operations of short-time button push / long-time button push are realized by one operation switch (Patent Document 1).
Also, after the user inputs a voice command, the system checks with the user whether the recognition result is correct, and then executes processing corresponding to the voice command. In addition, the input can be made by a "permission switch" on the steering wheel (Patent Document 2). However, this “permission switch” has one meaning, and is independent of the speech recognition function regardless of the speech button.
Also, the utterance button and the operation button are integrated, and the system limits the use of the button to one or the other depending on the scene, such as "Speech button in this scene, operation button in this scene", and the utterance button In a specific scene, it is made to function as an operation switch (Patent Document 3).
[0004]
[Patent Document 1]
JP 2001-154689 A
[Patent Document 2]
Japanese Patent Publication No. 2002-12100
[Patent Document 3]
JP 2001-216130 A
[0005]
[Problems to be solved by the invention]
However, in the techniques of Patent Literatures 1 and 2, since the utterance button and the operation button are independently provided, it is necessary to use a plurality of buttons properly, which involves movement of a hand or finger, and movement of a line of sight. There is a problem that the load increases. Further, the technology of Patent Document 3 integrates the utterance button and the operation button into one, but is limited to only one of the functions depending on the scene. "And" Scenes to be input as operation buttons "are determined by the convenience of the system, and operations such as using the utterance button as a shortcut button equivalent to" Next "in the case of voice input at the request of the user There is a problem that can not be done.
[0006]
The present invention has been made in view of such circumstances, and it is possible to simultaneously realize a press talk function for voice input and an operation button function for realizing a shortcut of an interactive operation with a single utterance button. It is an object of the present invention to provide a voice interaction system, a voice interaction method, and a voice interaction program that enable a user to select an input that is convenient for him / her.
[0007]
[Means for Solving the Problems]
The invention according to claim 1 is a voice interaction system including an utterance button switch that is pushed only while the speaker is speaking, wherein the speaker controls the device while interacting with the system by voice, Time monitoring means for measuring an elapsed time from when the utterance button switch is pushed to when the utterance button switch is released, an operation setting storage means in which an operation to be performed next for each of the elapsed time is stored in advance, and the time monitoring means notifies Control means for reading an operation corresponding to the elapsed time from the operation setting storage means and executing the read operation.
[0008]
According to a second aspect of the present invention, the operation setting storage means includes: a first operation to be performed when the elapsed time is less than a first threshold; A second operation to be performed when the elapsed time is less than a second threshold value and a third operation to be performed when the elapsed time exceeds the second threshold value are stored in advance. The utterance button switch is an operation that functions as an utterance button, and the first operation and the third operation are operations in which the utterance button switches function as different operation buttons.
[0009]
In the invention described in claim 3, the action setting storage means stores an action for each scene when a dialogue is performed, and the first threshold value may be uttered in a target scene. It is a time shorter than the time required to utter the shortest response content of the response content, and the second threshold value is the most time responsive among the response content likely to be uttered in the target scene. It is characterized in that the time is longer than a time required for uttering a long response content.
[0010]
The invention according to claim 4 is characterized in that the voice interaction system includes input means for arbitrarily setting the first and second thresholds.
[0011]
According to a fifth aspect of the present invention, there is provided a speech dialogue method in a speech dialogue system including a speech button switch for pushing only while a speaker is speaking, and controlling a device while the speaker interacts with the system by voice. A time monitoring step of measuring an elapsed time from when the utterance button switch is pushed to when the utterance button switch is released; an operation setting storage step in which an operation to be performed next is stored in advance for each elapsed time; A control step of reading and executing an operation corresponding to the elapsed time notified by the time monitoring step from the operations stored in the operation setting storage step.
[0012]
The invention according to claim 6, wherein the operation setting storing step includes a first operation to be performed when the elapsed time is less than a first threshold, and a first operation to be performed when the elapsed time is equal to or more than the first threshold. A second operation to be performed when the elapsed time is less than a second threshold value, and a third operation to be performed when the elapsed time exceeds the second threshold value. The utterance button switch is an operation that functions as an utterance button, and the first operation and the third operation are operations in which the utterance button switches function as different operation buttons.
[0013]
In the invention described in claim 7, in the action setting storage step, an action is stored for each scene when a dialogue is performed, and the first threshold value may be uttered in a target scene. It is a time shorter than the time required to utter the shortest response content of the response content, and the second threshold value is the most time responsive among the response content likely to be uttered in the target scene. It is characterized in that the time is longer than a time required for uttering a long response content.
[0014]
The invention according to claim 8 is characterized in that the voice interaction method has an input step of arbitrarily setting the first and second thresholds.
[0015]
According to a ninth aspect of the present invention, there is provided a speech dialogue system including a speech button switch for pushing only while the speaker is speaking, and operating in a speech dialogue system in which the speaker controls a device while interacting with the system by voice. A program, a time monitoring process for measuring an elapsed time from when the utterance button switch is pushed to when the utterance button switch is released, and an operation setting storage process for previously storing an operation to be performed next for each of the elapsed times. And causing the computer to perform an operation corresponding to the elapsed time notified by the time monitoring process from among the operations stored in the operation setting storage process and executing the control process.
[0016]
In the invention according to claim 10, the operation setting storage processing includes a first operation to be performed when the elapsed time is less than a first threshold, and a first operation to be performed when the elapsed time is equal to or greater than a first threshold. A second operation to be performed when the elapsed time is less than a second threshold value, and a third operation to be performed when the elapsed time exceeds the second threshold value. The utterance button switch is an operation that functions as an utterance button, and the first operation and the third operation are operations in which the utterance button switches function as different operation buttons.
[0017]
In the invention set forth in claim 11, in the operation setting storage processing, an operation is stored for each scene when a dialogue is performed, and the first threshold may be uttered in a target scene. It is a time shorter than the time required to utter the shortest response content of the response content, and the second threshold value is the most time responsive among the response content likely to be uttered in the target scene. It is characterized in that the time is longer than a time required for uttering a long response content.
[0018]
A twelfth aspect of the present invention is characterized in that the voice interaction program further causes a computer to perform an input process for arbitrarily setting the first and second thresholds.
[0019]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, a spoken dialogue system according to an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing the configuration of the embodiment. In this figure, reference numeral 1 denotes an utterance button attached to a steering wheel of an automobile. The utterance button is a button switch capable of discriminating between a push state and a release state, and the driver can operate the utterance button without releasing the hand from the steering wheel. Reference numeral 2 denotes an utterance button operation detection unit that detects two states of the utterance button 1 (push and release). Reference numeral 3 denotes a time monitoring unit that measures the elapsed time from when the utterance button 1 is pushed to when it is released, based on the output of the utterance button operation detection unit 2. Reference numeral 4 is for performing device control processing (such as turning on / off an air conditioner or playing / stopping an audio device) and information providing processing (such as executing a database search and reading out a search result) while interacting with the driver. An HMI (Human-Machine Interface) control unit that outputs a control signal to a device. Reference numeral 5 denotes a button operation setting storage unit in which an operation according to the operation state of the utterance button 1 is set in advance. Reference numeral 6 denotes a microphone for collecting sound. Reference numeral 7 denotes a voice recognition unit that recognizes voice collected by the microphone 6. Reference numeral 8 denotes a speaker that emits a sound produced by the system. Reference numeral 9 denotes an audio reproducing unit that reproduces audio data and causes the speaker 8 to generate sound. Reference numeral 10 denotes a voice synthesis unit that synthesizes voice data generated by the system. Reference numeral 11 denotes a dialogue control unit that controls the flow of a scenario when a dialogue is performed with the driver. Reference numeral 12 denotes a dialogue scenario storage unit in which a scenario for performing a dialogue with the driver is defined in advance. Reference numeral 13 denotes an input unit including operation switches such as a touch panel.
[0020]
Next, the operation of the voice interaction system shown in FIG. 1 will be described with reference to FIGS. In the following description, it is assumed that the HMI control unit 4 performs an information providing process of searching for a restaurant in a car and providing result information to a driver.
First, with reference to FIG. 2, an operation when the driver inputs a spoken voice will be described. First, the HMI control unit 4 reads the content of the operation setting stored in the button operation setting storage unit 5 when the voice dialogue system is activated (Step S0). This operation is executed only once at startup.
[0021]
Here, the contents stored in the button operation setting memory 5 will be described with reference to FIG. The button operation setting memory 5 stores the operation to be performed by the HMI controller 4 for each application (here, database search, air conditioner, etc.) when the utterance button 1 is “pushed” and when the utterance button 1 is “released”. Control, audio control). In the case of “release”, an operation that is subdivided for each elapsed time from “push” to “release” is further defined. In this example, when “push” is performed, an instruction to stop guidance reproduction and reproduce a chime sound is issued to the audio reproduction unit 9, and an instruction to start recording audio is issued to the audio recognition unit 7. It is defined that In the database search, the operation is defined every time the elapsed time until “release” is 0 to 200 msec, 201 to 10000 msec, 10001 msec or more. Details of the operation based on this table reference will be described later.
[0022]
This 200 msec is the first threshold value in the claims, and 10000 msec is the second threshold value. The first and second threshold values differ for each application, and the operation is defined in advance for each elapsed time in order to facilitate the operation. However, the first threshold value is a time shorter than a time required for uttering the shortest phrase (for example, “yes”) among the phrases that the speaker may utter in the target application. Set to. In addition, the second threshold value is set to a value that is longer than the time required to speak the longest phrase (eg, “Japanese restaurant search”) among the phrases that the speaker may possibly speak in the target application. Set a long time.
[0023]
Next, returning to FIG. 2, the driver selects a database search by operating a switch of the input unit 13 or by voice input operation. In response to the selection of the database search process, the dialog control unit 11 instructs the HMI control unit 4 to perform voice input. Then, the system issues a voice question "What data do you want to search?", And the HMI control unit 4 enters a voice input waiting state.
[0024]
Here, it is assumed that the driver pushes the utterance button 1 and utters “meal search” for, for example, six seconds (6000 msec), and then releases the utterance button 1. In response to this, the utterance button operation detection unit 2 detects that the utterance button 1 has been pushed, and sends an utterance button push detection notification to the time monitoring unit 3 (step S1). In response to this, the time monitoring unit 3 starts timer measurement (step S2) and sends an utterance button push detection notification to the HMI control unit 4 (step S3).
[0025]
Next, the HMI control unit 4 performs a push process according to the operation settings read in advance (step S4). In the process at the time of the push, first, in order to stop the reproduction of the guidance sound, the audio reproduction unit 9 is instructed to stop the guidance (step S5), whereby the guidance is stopped (step S6). Then, in order to notify the driver that the push operation has been detected by the system, the sound reproducing unit 9 is instructed to reproduce the chime (step S7), whereby the chime sound "pi" is reproduced. (Step S8). In addition, the voice recognition unit 7 issues an instruction to start voice recording to the voice recognition unit 7 (step S9), whereby the voice recognition unit 7 starts recording the voice collected by the microphone 6 and simultaneously performs the voice recording. Is performed (step S10).
[0026]
Next, when detecting the release of the utterance button 1, the utterance button operation detection unit 2 sends an utterance button release detection notification to the time monitoring unit 3 (step S11). Upon receiving this notification, the time monitoring unit 3 ends the timer measurement (step S12), obtains the elapsed time from the push to the release, and sends a speech button release detection notification including the elapsed time to the HMI control unit 4 (step S12). S13).
[0027]
Next, the HMI control unit 4 performs a process at the time of release based on the notified elapsed time (step S14). Here, since the elapsed time is 6000 msec, the operation of 201 to 10000 msec defined in the button operation setting storage unit 5 shown in FIG. 4 is performed. If the elapsed time is from 201 to 10000 msec, the HMI control unit 4 requests a recognition result from the speech recognition unit 7 already in the execution state (step S15), and obtains the recognition result (step S16). Then, the obtained speech recognition result (here, “meal search”) is notified to the dialogue control unit 11 (step S17).
[0028]
In response to this, the dialog control unit 11 refers to the dialog scenario storage unit 12 and reads the next operation scenario to be performed when “meal search” is specified (step S18). Then, based on the read scenario, the contents of the guidance to be uttered next are generated (step S19). In this example, the guidance content to be uttered next is “meal search”. Subsequently, the dialogue control unit 11 notifies the HMI control unit 4 of the guidance content "meal search" generated here (step S20).
[0029]
Next, the HMI control unit 4 sends a voice synthesis request including the notified guidance content to the voice synthesis unit 10 (Step S21). In response to this, the voice synthesis unit 10 generates voice data of the guidance of "meal search" by voice synthesis (step S22), and returns a voice synthesis result response including the voice data to the HMI control unit 4 (step S22). Step S23).
[0030]
Next, the HMI control unit 4 sends a sound reproduction request including the sound data to the sound reproduction unit 9 (Step S24). In response to this, the audio reproduction unit 9 reproduces the transmitted audio data (Step S25). As a result, a voice saying “meal search” is uttered from the speaker 8. Then, the audio reproduction unit 9 sends a guidance audio reproduction end notification to the HMI control unit 4 (Step S26).
[0031]
Subsequently, when the driver pushes the utterance button 1 in response to the question of "meal search", utters "yes" and releases the utterance button 1, processing is performed in the same manner as the above-described operation. , The dialog scenario will continue.
[0032]
Next, an operation when the driver operates the speech button 1 and responds instead of speaking will be described with reference to FIG. Here, it is assumed that the speaker 8 has uttered the question “meal search”. Here, it is assumed that the driver pushes the utterance button 1 and immediately releases the utterance button 1 (after 150 msec). In response to this, the utterance button operation detection unit 2 detects that the utterance button 1 has been pushed, and sends an utterance button push detection notification to the time monitoring unit 3 (step S1). In response to this, the time monitoring unit 3 starts timer measurement (step S2) and sends an utterance button push detection notification to the HMI control unit 4 (step S3).
[0033]
Next, the HMI control unit 4 performs a push process according to the operation settings read in advance (step S4). In the process at the time of the push, first, in order to stop the reproduction of the guidance sound, the audio reproduction unit 9 is instructed to stop the guidance (step S5), whereby the guidance is stopped (step S6). Then, in order to notify the driver that the push operation has been detected by the system, the sound reproducing unit 9 is instructed to reproduce the chime (step S7), whereby the chime sound "pi" is reproduced. (Step S8). In addition, the voice recognition unit 7 issues an instruction to start voice recording to the voice recognition unit 7 (step S9), whereby the voice recognition unit 7 starts recording the voice collected by the microphone 6 and simultaneously performs the voice recording. Is performed (step S10).
[0034]
Next, when detecting the release of the utterance button 1, the utterance button operation detection unit 2 sends an utterance button release detection notification to the time monitoring unit 3 (step S11). Upon receiving this notification, the time monitoring unit 3 ends the timer measurement (step S12), obtains the elapsed time from the push to the release, and sends a speech button release detection notification including the elapsed time to the HMI control unit 4 (step S12). S13). The operations in steps S1 to S13 shown in FIG. 3 are the same as the operations in steps S1 to S13 shown in FIG.
[0035]
Next, the HMI control unit 4 performs a process at the time of release based on the notified elapsed time (step S31). Here, since the elapsed time is 150 msec, the operation of 0 to 200 msec defined in the button operation setting memory 5 shown in FIG. 4 is performed. If the elapsed time is 0 to 200 msec, the HMI control unit 4 issues an instruction to the voice recognition unit 7 to stop the voice recognition process that has already been executed (step S32), and identifies the input as an operation button input. Is instructed to the sound reproducing unit 9 to notify the driver (step S33), whereby a chime sound “Po” is reproduced (step S34). Then, the control unit 11 notifies the dialog control unit 11 that the user input is an operation button input and the value is a short-time push (0 to 200 msec) (step S35). In response to this, the dialog control unit 11 refers to the dialog scenario storage unit 12 (step S36), and recognizes this input result (step S37). With this recognition, the input result notified in step S35 is regarded as having answered "Yes" by voice in response to the question of "meal search", and the subsequent processing is continued. Operations after step S37 are the same as steps S18 to S26 shown in FIG.
[0036]
When the elapsed time notified in step S13 shown in FIG. 3 is 10001 msec or more, the HMI control unit 4 reproduces a chime “popo” indicating that the chime is identified as an operation button input, and the HMI control unit 4 is already in the execution state. The voice recognition principle is canceled, and the dialog control unit 11 is notified that the user's input is an operation button input and the value is a long-time push.
[0037]
The HMI control unit 4 issues an instruction to the voice recognition unit 7 to stop the voice recognition process that has already been performed (step S32), and also notifies the driver that the voice recognition process has been identified as an operation button input. The reproduction unit 9 is instructed to reproduce the chime (step S33), and the chime sound “popo” is reproduced (step S34). Then, the control unit 11 notifies the dialog control unit 11 that the user input is an operation button input and the value has been a long-time push (10001 msec or more) (step S35). The long push is used when the currently running application (here, database search) is to be terminated halfway, or when it is desired to clear the processing contents up to the present and return to the initial state.
[0038]
As described above, the time monitoring unit 3 is provided, and the elapsed time from when the driver pushes the utterance button 1 until release is measured. When the elapsed time is short (200 msec or less) and long (10001 msec or more). Since it is determined that the utterance button 1 has been used as an operation button and the operation defined in the button operation operation setting storage unit 5 is performed, a press talk function for voice input can be performed with one utterance button. An operation button function for realizing a shortcut of an interactive operation or the like can be realized at the same time, and the user can select an input that is convenient for the user. As a result, in a situation where the answer to the inquiry to the driver is “yes / no”, the short-time push is regarded as a voice input of “yes”, and the answer to the inquiry to the driver is plural. In a scene where selection is from a candidate, it is possible to easily perform state transition control by regarding short-time push as "again" voice input.
[0039]
In addition, voice recognition may cause erroneous recognition due to driving noise, etc., whereas shortcuts by button operation are reliable, and dialogue can be performed in a short time by replacing frequently used voice input commands with shortcuts. You can proceed. Therefore, the user operation can be simplified and the load can be reduced, and the manufacturing cost can be reduced by reducing the number of operation buttons.
[0040]
In the above description, in step S0 shown in FIG. 2, all the contents of the button operation setting storage unit 5 shown in FIG. 4 are read at the time of starting up the system. Only the contents of the table may be read. For example, when a database search application is started, only the operation setting for database search is read from the button operation setting storage unit 5. Thus, the time for reading the contents of the setting operation can be reduced. In the example shown in FIG. 4, one table is prepared for one application. However, one table is prepared for each scene of one application, and operation is performed for each scene. The meaning may be changed. Also, in the above description, an example of answering only by voice is shown, but a dialogue means other than voice is used together, such as a touch panel display device in which display contents and selection candidate buttons change according to the progress of the dialogue. Is also good.
[0041]
Further, the contents of the button operation setting memory 5 shown in FIG. 4 may be rewritable based on the contents inputted by the driver through the input unit 3 by the driver. Here, rewriting includes new addition, deletion, and change. Thereby, the first threshold value of the elapsed time considered as a short-time push is set to 300 msec instead of 200 msec, and the second threshold value of the elapsed time considered to be a long-time push is set to 8000 msec instead of 10001 msec. Therefore, it is possible for the driver to easily operate and customize. However, as described above, the first threshold is set to a time shorter than the time required to speak the shortest phrase spoken by the speaker, and the second threshold is set to the time when the speaker speaks. The time must be set to be longer than the time required to speak the longest phrase.
[0042]
In the above description, the navigation system has been described as an example. However, the above-described method can be similarly applied to any device that enables a voice conversation with the system using a press talk input.
[0043]
Also, a program for realizing the functions of the respective processing units in FIG. 1 is recorded on a computer-readable recording medium, and the program recorded on this recording medium is read into a computer system and executed, thereby executing a voice interactive processing. May be performed. Here, the “computer system” includes an OS and hardware such as peripheral devices. The “computer-readable recording medium” refers to a portable medium such as a flexible disk, a magneto-optical disk, a ROM, and a CD-ROM, and a storage device such as a hard disk built in a computer system. Further, the “computer-readable recording medium” refers to a volatile memory (RAM) inside a computer system that becomes a server or a client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. In addition, those that hold programs for a certain period of time are also included.
[0044]
Further, the above program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the "transmission medium" for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. Further, the program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, and what is called a difference file (difference program) may be sufficient.
[0045]
【The invention's effect】
As described above, according to the present invention, the time monitoring means is provided, and the elapsed time from when the user pushes the utterance button to when the utterance button is released is measured. Is determined to be used as an operation button, and the operation defined in the button operation operation setting storage unit is performed.Therefore, with a single utterance button, a press-talk function for voice input and a shortcut for an interactive operation are provided. The operation button function for realizing can be realized at the same time, and the user can select an input that is convenient for him. In a situation where the answer to the inquiry to the user is “yes / no”, the short push is regarded as a voice input of “yes”, and the answer to the inquiry to the driver is determined from a plurality of candidates. In such a case, it is possible to obtain an effect that it is possible to easily perform the state transition control by regarding the short-time push as the "again" voice input. Further, since the input means is provided so that the contents stored in the operation setting storage means can be changed, it is possible to customize the operation setting to an easy-to-use form for the user.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of an embodiment of the present invention.
FIG. 2 is a sequence diagram showing an operation at the time of voice input in the voice dialogue system shown in FIG. 1;
FIG. 3 is a sequence diagram showing an operation when a button is operated in the voice interaction system shown in FIG. 1;
FIG. 4 is an explanatory diagram showing a table structure of a button operation setting memory 5 shown in FIG. 1;
[Explanation of symbols]
1 ... speak button
2 ... utterance button operation detection unit
3. Time monitoring unit
4 ... HMI control unit
5 ... button operation setting memory
6 ... microphone
7 ... Speech recognition unit
8 ... speaker
9 Voice playback unit
10 Voice synthesis unit
11 Dialogue control unit
12: Dialogue scenario storage unit

Claims

A speech dialogue system including a speech button switch for pushing only while the speaker is speaking, and controlling the device while the speaker interacts with the system by voice,
Time monitoring means for measuring an elapsed time from when the utterance button switch is pushed to when it is released,
An operation setting storage unit in which an operation to be performed next for each elapsed time is stored in advance;
A voice dialogue system comprising: an operation corresponding to an elapsed time notified by the time monitoring unit, read from the operation setting storage unit, and a control unit that executes the read operation.

The operation setting storage means,
A first operation to be performed when the elapsed time is less than a first threshold, and a second operation to be performed when the elapsed time is equal to or more than a first threshold and less than a second threshold. A third operation to be performed when the elapsed time exceeds a second threshold value is stored in advance; the second operation is an operation in which the utterance button switch functions as an utterance button; 2. The voice interaction system according to claim 1, wherein the third operation and the third operation are operations in which the speech button switches function as different operation buttons.

The operation setting storage means,
An operation is stored for each scene when a dialogue is performed, and the first threshold value is necessary for uttering the shortest response content among the response contents that may be uttered in the target scene. And the second threshold is longer than the time required to utter the longest response content among the response contents that may be uttered in the target scene. 3. The voice interaction system according to claim 2, wherein:

The voice interaction system,
4. The voice interaction system according to claim 2, further comprising an input unit for arbitrarily setting the first and second thresholds.

A speech dialogue method in a speech dialogue system comprising a speech button switch for pushing only while a speaker is speaking, and controlling a device while the speaker interacts with the system by voice.
A time monitoring step of measuring the elapsed time from when the utterance button switch is pushed to when it is released,
An operation setting storage step of storing in advance an operation to be performed next for each elapsed time;
A control step of reading and executing an operation corresponding to the elapsed time notified by the time monitoring step from the operations stored in the operation setting storage step.

The operation setting storing step includes:
A first operation to be performed when the elapsed time is less than a first threshold, and a second operation to be performed when the elapsed time is equal to or more than a first threshold and less than a second threshold. Storing in advance a third operation to be performed when the elapsed time exceeds a second threshold value, wherein the second operation is an operation in which the utterance button switch functions as an utterance button; The voice interaction method according to claim 5, wherein the third operation and the third operation are operations in which the utterance button switches function as different operation buttons.

The operation setting storing step includes:
An operation is stored for each scene when a dialogue is performed, and the first threshold value is necessary for uttering the shortest response content among the response contents that may be uttered in the target scene. And the second threshold is longer than the time required to utter the longest response content among the response contents that may be uttered in the target scene. The voice interaction method according to claim 6, wherein:

The voice interaction method includes:
The method according to claim 6, further comprising an input step of arbitrarily setting the first and second thresholds.

A speech dialogue program comprising a speech button switch for pushing only while the speaker is speaking, and operating in a speech dialogue system for controlling a device while the speaker interacts with the system by voice,
A time monitoring process for measuring an elapsed time from when the utterance button switch is pushed to when it is released,
An operation setting storage process for storing in advance an operation to be performed next for each elapsed time;
A voice dialogue program for causing a computer to perform a control process of reading and executing an operation corresponding to an elapsed time notified by the time monitoring process from the operations stored in the operation setting storage process.

The operation setting storage process includes:
A first operation to be performed when the elapsed time is less than a first threshold, and a second operation to be performed when the elapsed time is equal to or more than a first threshold and less than a second threshold. Storing in advance a third operation to be performed when the elapsed time exceeds a second threshold value, wherein the second operation is an operation in which the utterance button switch functions as an utterance button; 10. The voice dialogue program according to claim 9, wherein the third operation and the third operation are operations in which the speech button switches function as different operation buttons.

The operation setting storage process includes:
An operation is stored for each scene when a dialogue is performed, and the first threshold value is necessary for uttering the shortest response content among the response contents that may be uttered in the target scene. And the second threshold value is set to a time longer than the time required to utter the longest response content among the response contents that may be uttered in the target scene. The speech dialogue program according to claim 10, wherein the program is set.

The voice dialogue program,
The speech dialogue program according to claim 10 or 11, further causing a computer to perform an input process for arbitrarily setting the first and second thresholds.