JP3838159B2

JP3838159B2 - Speech recognition dialogue apparatus and program

Info

Publication number: JP3838159B2
Application number: JP2002158985A
Authority: JP
Inventors: 亮輔池谷
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2002-05-31
Filing date: 2002-05-31
Publication date: 2006-10-25
Anticipated expiration: 2022-05-31
Also published as: JP2004004239A

Description

【０００１】
【発明の属する技術分野】
本発明は、話者が発話した内容に対する応答を音声出力する音声認識対話装置に関し、特に音声認識対話装置の周囲に複数の話者がいる中で、ある特定の話者とだけ集中して対話をしたり、複数の話者と代わる代わる対話をしたりすることができる音声認識対話装置に関する。
【０００２】
【従来の技術】
話者の発話内容に対する応答を音声出力する音声認識対話装置においては、話者の発話内容を高い認識率で認識することが必要になる。認識率を高いものとするため、周囲雑音等の影響を低減し、ある特定の話者の発する音声を良好な品質で取り込むようにした音声認識装置は、従来から提案されている（例えば、特開２０００−１４８１８４号公報）。
【０００３】
図７は、特開２０００−１４８１８４号公報に記載されている音声認識装置の構成を示すブロック図である。図７を参照すると、マイクロフォンアレイ等の指向特性や感度特性等を可変できる構成とした音声情報入力部７０と、音声情報入力部７０の指向特性あるいは感度特性等を調整する音声入力制御部７１と、音声入力制御部７１の制御に基づいて音声情報入力部７０より入力された音声信号をＡ／Ｄ変換し、周波数分析を行い、音声の特徴ベクトル列に変換する音声特徴ベクトル抽出部７２と、音声特徴ベクトル抽出部７２から得られた音声特徴ベクトルによって音声認識を行う音声認識部７３と、音声認識部７３の認識結果を表示する認識結果表示部７４と、カメラ等の撮像装置で構成される画像情報入力部７５と、画像情報入力部７５から入力された画像情報を解析する画像情報解析部７６とを備えている。
【０００４】
続いて、特開２０００−１４８１８４号公報に記載されている音声認識装置の動作について説明する。図７において、画像情報解析部７６は、画像情報入力部７５から得られる画像データを解析し、画像内の話者の位置を検出する。画像内における話者の位置は、話者の顔画像を抽出し、それを追跡することなどで求めることができる。音声入力制御部７１は、画像情報解析部７６から送られてくる話者の位置データに基づいて、音声情報入力部７０の指向特性や入力特性、方向を制御する。
【０００５】
【発明が解決しようとする課題】
しかしながら、前述した従来の音声認識装置を音声認識対話装置に使用した場合、次のような問題が発生する。
【０００６】
第１の問題点は、複数の話者が音声認識対話装置のまわりにいる中で、別の方向にいる複数の話者と代わる代わる対話を行うことができないことである。
【０００７】
その理由は、ある特定の話者の音声認識率を向上させるために、特定話者のいる方向にマイクロフォンの感度特性や、マイクロフォンの指向特性を調整しており、他の方向にいる話者の音声を捕捉しづらくしてしまうためである。
【０００８】
第２の問題点は、複数の話者が音声認識対話装置のまわりにいる中で、同じ方向にいる特定の話者とだけ集中した対話を行うことができないことである。
【０００９】
その理由は、話者のいる方向にマイクロフォンの感度特性や、マイクロフォンの指向特性を調整するだけなので、同じ方向からの他の話者が発話した音声も捕捉して音声認識してしまうためである。
【００１０】
【発明の目的】
本発明の目的は、複数の話者が音声認識対話装置のまわりにいる中で、時にはある特定の話者とだけ集中して対話をしたり、時には複数の話者と代わる代わる対話をしたりすることが、対話の中で自然に切り替えてできる音声認識対話装置を提供することにある。
【００１１】
【課題を解決するための手段】
本発明の音声認識対話装置は、発話された音声情報を分析し得られる話者位置特定情報や照合話者情報や音声認識結果をもとに話者への集中度を管理し制御する話者への集中度制御部（図1の１４）と、話者への集中度制御部（図１の１４）が集中度を決定する際に必要な情報を格納し参照および更新が行われる話者及び集中度管理のデータベース（図１の２０）とを有する。
【００１２】
より具体的には、本発明の音声認識対話装置は、
音声情報を取り込む音声入力部（図１の１０）と、
発話した話者の方向を特定する話者位置特定部（図１の１１）と、
発話した話者を特定する話者照合部（図１の１２）と、
前記音声入力部（図１の１０）から入力される音声情報を分析し、音声を認識する音声認識部（図１の１３）と、
特定話者を示す特定話者識別名と、該特定話者識別名によって示される特定話者に対する集中度のレベルとが設定される集中度設定テーブル（図１の２４）と、
該集中度設定テーブル（図１の２４）の内容と前記話者照合部（図１の１２）で特定された話者とに基づいて前記話者の発話を有効にするか否かを判定し、有効にすると判定した場合は、前記話者の発話に対する前記音声認識部（図１の１３）の認識結果に基づいて決定した集中度のレベルと前記話者照合部（図１の１２）で特定された話者の識別名とを用いて前記集中度設定テーブル（図１の２４）中のレベル及び特定話者識別名を更新し、該更新後の集中度設定テーブル（図１の２４）の内容と前記話者位置特定部（図１の１１）で特定された話者の方向とに基づいて、前記音声入力部（図１の１０）の指向性及び方向を制御する集中度制御部（図１の１４）と、
該集中度制御部（図１の１４）で有効にすると判定された発話の認識結果に対する応答を音声出力する音声出力部（図１の１９）とを備えている。
【００１３】
更に、本発明の音声認識対話装置は、所定のイベントが発生したとき、集中度設定テーブル（図１の２４）に設定されている特定話者の集中度のレベルを変更できるようにするため、
所定のイベントが発生したことを検出する他イベント管理部（図１の１６）を備え、且つ、
前記集中度制御部（図１の１４）が、前記他イベント管理部（図１の１６）によって前記所定のイベントの発生が検出されたとき、前記集中度設定テーブル（図１の２４）に設定されている集中度のレベルを変更する構成を有している。
【００１４】
より具体的には、
前記所定のイベントが、前記集中度設定テーブル（図１の２４）に特定話者識別名が設定されている特定話者による発話が所定時間なかったことであり、且つ、
前記集中度制御部（図１の１４）が、前記他イベント管理部（図１の１６）で前記所定のイベントの発生が検出され、且つ、前記集中度設定テーブル（図１の２４）に設定されている集中度のレベルが、該集中度設定テーブル（図１の２４）に設定されている特定話者識別名によって示される特定話者の発話のみを有効にするほど高いものである場合、前記集中度設定テーブル（図１の２４）に設定されている集中度のレベルを、他の話者による発話も有効にするレベルに下げる構成を有する。
【００１５】
【作用】
複数の話者と対話をする中で、予め設定しておく集中度の変移条件をもとに、話者への集中度を制御し、集中度のレベルに応じて、マイクロフォンアレイ等の音声入力部（図１の１０）の指向性や方向を調整する。また、集中度のレベルに応じて、特定話者以外の話者の発話を無効にする。
【００１６】
特定話者に対する集中度のレベルを、特定話者の発話内容のみに基づいて決定すると、特定話者が集中度のレベルを高くする発話を行った後に音声認識対話装置から離れた場合、他の話者の発話が無効にされる状態が続いてしまい、他の話者が、音声認識対話装置と対話を行えなくなってしまう。そこで、他イベント管理部（図１の１６）で所定のイベント（例えば、特定話者による発話がない時間が所定時間継続）の発生が検出された場合、集中度制御部（図１の１４）が、集中度設定テーブル（図１の２４）に設定されている集中度のレベルを、他の話者による発話も有効にするレベルまで下げる。これにより、他の発話者も音声認識対話装置と対話することが可能になる。
【００１７】
【発明の実施の形態】
次に本発明の実施の形態について図面を参照して詳細に説明する。図１を参照すると、本発明に係る音声認識対話装置の第１の実施の形態は、音声入力部１０と、話者位置特定部１１と、話者照合部１２と、音声認識部１３と、集中度制御部１４と、音声入力制御部１５と、他イベント管理部１６と、対話制御部１７と、音声合成部１８と、音声出力部１９と、話者及び集中度管理のためのデータベース２０とから構成されている。
【００１８】
音声入力部１０は、音声情報を電気信号に変換する機能を有している。また、音声入力部１０は、指向性及び方向を変更可能なものであり、例えば、複数のマイクロフォンを円形状に一定の間隔で配置したマイクロフォンアレイにより構成される。
【００１９】
話者位置特定部１１は、音声入力部１０から入力される音声情報を分析し話者の方向を特定する機能を有する。例えば、音声入力部１０が、複数のマイクロフォンを円形状に配置したマイクロフォンアレイにより構成されている場合は、最も出力レベルの高いマイクロフォンの方向を話者の方向とする。上記マイクロフォンの方向は、音声認識対話装置の基準方向に対する方向であり、複数のマイクロフォンの内の基準マイクロフォンと出力レベルが最も高いマイクロフォンとの角度と、上記基準方向と上記基準マイクロフォンとの角度とを加算することにより求まる。
【００２０】
話者照合部１２は、音声入力部１０から入力される音声情報を分析し、登録済みの話者の音声情報と照合し話者を特定する機能を有する。
【００２１】
音声認識部１３は、音声入力部１０から入力される音声情報を分析し音声を認識する機能を有する。
【００２２】
集中度制御部１４は、話者位置特定部１１から入力される話者位置特定情報、話者照合部１２から入力される照合話者情報、音声認識部１３から入力される音声認識結果及び他イベント管理部１６からの通知をもとに話者への集中度を制御する機能を有する。
【００２３】
より具体的には、集中度制御部１４は、以下の機能を有する。
【００２４】
・集中度設定テーブル２４の内容と話者照合部１２からの照合話者情報（話者の識別名）とに基づいて、照合話者情報によって特定される話者の発話を有効にするか否かを判定する機能。
・有効にしないと判定した場合は、音声認識部１３から入力される認識結果を棄却する機能。
・有効にすると判定した場合は、音声認識部１３に認識結果を対話制御部１７に渡す機能。
・有効にすると判定した場合は、音声認識部１３の認識結果と変移条件テーブル２１の内容とに基づいて集中度のレベルを決定し、この決定した集中度のレベルと話者照合部１２からの照合話者情報とに基づいて集中度設定テーブル２４の内容を更新する機能。
・更新後の集中度設定テーブル２４の内容と、定義テーブル２２の内容と、情報テーブル２３の内容とに基づいて、音声入力制御部１５に対して音声入力部１０の方向及び指向性の調整を指示する機能。
【００２５】
なお、データベース２０中の各テーブル２１〜２４については、後で詳細に説明する。
【００２６】
音声入力制御部１５は、集中度制御部１４からの指示に従って、音声入力部１０のマイクロフォンアレイ等の指向性や方向（音声認識対話装置の基準方向に対する基準マイクロフォンの方向）を調整する機能を有する。
【００２７】
他イベント管理部１６は、音声入力以外の時間等の他のイベントを管理し、集中度制御部１４にイベント発生を通知する機能を有する。
【００２８】
対話制御部１７は、集中度制御部１４から送られてくる音声認識結果及び話者照合情報をもとに対話内容を管理し、次の応答内容を決定する機能を有する。
【００２９】
音声合成部１８は、対話制御部１７より入力される応答内容の合成音声を生成する機能を有する。
【００３０】
音声出力部１９は、音声合成部１８から入力される合成音声を出力する機能を有するものであり、スピーカー等によって構成される。
【００３１】
データベース２０は、集中度制御部１４が、話者への集中度を制御する際に使用する変移条件テーブル２１、定義テーブル２２、情報テーブル２３及び集中度設定テーブル２４を備えている。
【００３２】
変移条件テーブル２１には、特定話者に対する集中度のレベルを変移させる各種の条件が格納されている。各条件は、それぞれ条件内容と、現在の集中度のレベル（現在レベル）と、変移させる集中度のレベル（変移レベル）とを含んでいる。例えば、条件Ｎｏ１は、現在の集中度のレベルが「中」のときに、「ありがとう」或いは「もういいよ」が発話されたら、レベルを「低」に変移させることを示している。また、例えば、条件Ｎｏ７は、現在の集中度のレベルが「高」のときに、３０秒間にわたって特定話者による発話がなかった場合、レベルを「中」に変移させることを示している。
【００３３】
定義テーブル２２には、集中度のレベル毎に、集中度制御部１４が行う制御内容が定義されている。例えば、集中度のレベルが「低」の場合には、集中度制御部１４は、音声入力部１０の指向性を−１８０度〜１８０度とし、集中度設定テーブル２４に設定されている特定話者以外の音声認識結果も有効にする。また、集中度が「高」の場合には、集中度制御部１４は、音声入力部１０の方向を特定話者の方向にし、指向性を−４５度〜４５度とし、集中度設定テーブル２４に設定されている特定話者以外の音声認識結果を無効にする。
【００３４】
情報テーブル２３には、話者照合部１２が特定した話者の識別名と話者位置特定部１１で特定された方向とが対応付けて登録されている。この図１の例は、音声認識対話装置の基準方向に対して、父親が０度、母親が９０度、不明者が１８０度の位置に存在することを示している。
【００３５】
集中度設定テーブル２４には、現時点における集中度のレベルと、その対象となる特定話者の識別名とが対応して設定されている。この図１の例は、現時点の集中度のレベルが「高」で、父親が対象となっていることを示している。
【００３６】
次に、図１、図２及び図３を参照して本実施の形態の動作について詳細に説明する。
【００３７】
先ず、図１及び図２を参照して話者が発話したときの動作を説明する。話者が発話をすると、マイクロフォンアレイ等の音声入力部１０を介して入力された音声情報は、それぞれ話者位置特定部１１、話者照合部１２、音声認識部１３へ出力される。話者位置特定部１１では、入力された音声情報を分析し話者の音源方向の特定を行い、話者位置特定情報を集中度制御部１４へ出力する。話者照合部１２では、入力された音声情報を分析し、登録済みの話者の音声情報と照合し話者の特定を行い、照合話者情報を集中度制御部１４へ出力する。音声認識部１３では、入力された音声情報を分析し音声認識結果を集中度制御部１４へ出力する。
【００３８】
集中度制御部１４では、入力される話者位置特定情報と照合話者情報とをもとに、照合話者の情報テーブル２３の位置方向を更新する（図２、Ｓ２０）。
【００３９】
次に、集中度設定テーブル２４に設定されている集中度が、集中した対話状態であるか否かを判定する（Ｓ２１）。判定の結果、集中した対話状態を示すレベル「高」の場合は、照合された話者が、集中度設定テーブル２４中の特定話者の識別名と一致するか否かを判定する（Ｓ２２）。
【００４０】
そして、一致しない場合は、入力された音声認識結果を棄却する（Ｓ２３）。これに対して、一致する場合は、変移条件テーブル２１を検索し、現在レベルが集中度設定テーブル２４に設定されているレベルと一致し、且つ条件内容が音声認識結果と一致する条件を探す（Ｓ２４）。なお、ステップＳ２１でレベル「高」でないと判定された場合も、ステップＳ２４の処理が行われる。
【００４１】
ステップＳ２４において、該当する条件を探し出すことができなかった場合は、ステップＳ２６の処理を行う。これに対して該当する条件を探し出すことができた場合は、集中度設定テーブル２４に設定されている集中度のレベルを、ステップＳ２４で探し出した条件中の変移レベルに変更した後（Ｓ２５）、ステップＳ２６の処理を行う。ステップＳ２６では、集中度設定テーブル２４に設定されている特定話者の識別名を、話者照合部１２で特定された話者の識別名に変更する処理が行われる。
【００４２】
次に、話者への集中度制御部１４は、集中度設定テーブル２４と情報テーブル２３とを参照し、特定話者の位置方向をマイクロフォンアレイ等の方向の設定情報として音声入力制御部１５へ出力すると共に、定義テーブル２２を参照し、現在の集中度のレベルに対応して定義されている、マイクロフォンアレイ等の指向性の設定情報を音声入力制御部１５へ出力し（Ｓ２７）、更に、音声認識結果と照合話者情報とを対話制御部１７へ出力する（Ｓ２８）。
【００４３】
音声入力制御部１５では、話者への集中度制御部１４より入力されたマイクロフォンアレイ等の方向、指向性の設定情報をもとに、音声入力部１０のマイクロフォンアレイ等の指特性や方向を調整する。
【００４４】
対話制御部１７では、話者への集中度制御部１４より入力された音声認識結果と照合話者情報をもとに、次の応答する内容を決定し、音声合成部１８に応答内容を出力する。
【００４５】
音声合成部１８では、入力された応答内容から合成音声を生成し、スピーカー等の音声出力部１９を介して合成音声を出力する。
【００４６】
次に、図１及び図３を参照して、他イベント管理部１６が、予め定められているイベントの発生を検出した場合の動作を説明する。他イベント管理部１６は、予め定められているイベントの発生を検出すると、発生したイベントの種類を集中度制御部１４に通知する。
【００４７】
これにより、集中度制御部１４は、変移条件テーブル２１を検索し、現在レベルが集中度設定テーブル２４に設定されているレベルと一致し、且つ条件内容が通知されたイベントの種類と一致する条件を探す（図３、Ｓ３１）。
【００４８】
そして、ステップＳ３１において該当する条件を探し出すことができなかった場合は、集中度制御部１４は処理を終了する。これに対して、該当する条件を探し出すことができた場合は、集中度制御部１４は、集中度設定テーブル２４に設定されている集中度のレベルを、探し出した条件中の変移レベルに変更し（Ｓ３２）、定義テーブル２２を参照し、現在の集中度のレベルに対応して定義されている、マイクロフォンアレイ等の指向性の設定情報を音声入力制御部１５へ出力し（Ｓ３３）、その後、処理終了となる。
【００４９】
次に、データベース２０内の変移条件テーブル２１および定義テーブル２２の内容が図１に示すものであり、集中度設定テーブル２４に集中度のレベルとしてあらゆる方向からの発話を捕捉できる集中度が発散した状態を表す「低」が設定されている場合を例に挙げて本実施の形態の動作を詳細に説明する。
【００５０】
例えば、音声認識対話装置の背面、側面にそれぞれ父親、母親がいるような複数の話者が別の方向にいる状況下で、父親が「こんにちは」と発話したとする。
【００５１】
この場合、集中度制御部１４は、先ず、話者位置特定部１１から入力される話者位置特定情報と、話者照合部１２から入力される照合話者情報とに基づいて、情報テーブル２３中の父親の位置方向を更新する（図２、Ｓ２０）。その後、集中度制御部１４は、変移条件テーブル２１中の条件Ｎｏ５に従って、集中度設定テーブル２４の集中度のレベルを「中」に変更し、更に、集中度の対象となる特定話者を「父親」に変更する（Ｓ２１がＮｏ、Ｓ２４がＹｅｓ、Ｓ２５、Ｓ２６）。その後、集中度制御部１４は、定義テーブル２２中の集中度のレベル「中」の定義内容に従って、音声入力部１０の方向を特定話者である父親のいる背面方向に向けると共に指向性を−９０度〜９０度に調整する（Ｓ２７）。更に、集中度制御部１４は、ステップＳ２８の処理を行い、これにより、父親が発話した「こんにちは」に対する応答が音声出力部１９から出力される。
【００５２】
その後、側面にいる母親が「元気？」と変移条件テーブル２１の条件内容と一致しない発話を行った場合、集中度制御部１４は、情報テーブル２３中の母親の位置方向を更新し（Ｓ２０）、更に、集中度設定テーブル２４の集中度をレベル「中」の通常の対話状態を持続したまま、集中度の対象となる特定話者を「母親」に変更する（Ｓ２１がＮｏ、Ｓ２４がＮｏ、Ｓ２６）。その後、集中度制御部１４は、音声入力部１０の方向を特定話者である母親のいる側面方向に向けると共に、指向性を−９０度〜９０度に調整する（Ｓ２７）。更に、集中度制御部１４はステップＳ２８の処理を行い、これにより母親が発話した「元気？」に対する応答が音声出力部１９から出力される。
【００５３】
その後、父親が「元気だよね」等と発話した場合は、集中度制御部１４は、ステップＳ２０で情報テーブル２３中の父親の位置方向を更新し、ステップＳ２６で集中度設定テーブル２４中の集中度の対象となる特定話者を父親に変更し、ステップＳ２７で音声入力部１０の方向を、特定話者である父親のいる位置方向に変更する。このように、別の方向にいる父親と母親が代わる代わる音声認識対話装置を相手に対話を行うことができる。
【００５４】
このような通常の対話状態中に、父親が音声認識対話装置を自分に集中させた状態で対話をしたいと考えた場合、「よく聞いて」と発話する。これにより、集中度制御部１４は、ステップＳ２０において情報テーブル２３中の父親の位置方向を変更し、ステップＳ２５において、変移条件テーブル２１の条件Ｎｏ４に従って、集中度設定テーブル２４の集中度のレベルを「高」に変移させ、ステップＳ２６において、集中度の対象となる特定話者を「父親」に変更し、ステップＳ２７において、音声入力部１０の方向を特定話者である父親のいる位置方向に向けると共に指向性を−４５度〜４５度に調整する。この状況下で、父親が続けて対話を行えば、音声入力部１０がまわりの関係のない人の発話や雑音をひろう確率も低減し父親の音声を捕捉しやすくなり音声認識率も向上する。このため、この状況下で母親が何か発話した場合でも、指向性の調整結果により音声入力部１０が音声を捕捉する確率が低減する。仮に、音声入力部１０が音声を捕捉したとしても話者照合部１２で照合される話者は母親となり、現在の集中度設定テーブル２４の集中度の対象となる特定話者の父親と一致しないため（Ｓ２２がＮｏ）、母親の発話内容の音声認識結果は棄却されることになる（Ｓ２３）。
【００５５】
次に、この状況下で、父親が、「昨日のことだけど」等と変移条件テーブル２１の条件内容と一致しない発話を行った場合は、ステップＳ２４の判断結果がＮｏとなるので、集中度設定テーブル２４の集中度のレベルが「高」に保たれたままとなり、父親との集中した対話状態を持続される。
【００５６】
次に、この状況下で、父親が、集中した対話状態を止めたいと考えた場合、父親は「もういいよ」と発話する。これにより、集中度制御部１４は、ステップＳ２５において、変移条件テーブル２１中の条件Ｎｏ２に従って、集中度設定テーブル２４の集中度のレベルを「低」に変移させ、ステップＳ２７において、定義テーブル２２の集中度のレベル「低」の定義内容に基づき、指向性を−１８０度〜１８０度に調整する。また、ステップＳ２５において、集中度設定テーブル２４中のレベルが「低」に変更されているので、次回から特定話者以外の音声認識結果も棄却されずに有効となる（Ｓ２１がＮｏ）。
【００５７】
また仮に、現在の集中度設定テーブル２４の集中度の対象となる特定話者である父親が、集中度のレベルを「高」にしたまま、即ち集中した対話状態にしたままその場を立ち去った場合でも、他イベント管理部１６からの通知に基づいて、母親や他の話者が音声認識対話装置と対話を行えるようになる。
【００５８】
即ち、他イベント管理部１６は、集中度設定テーブル２４に設定されている特定話者の発話がない時間が３０秒続くというイベントを検出すると、上記イベントの種類を集中度制御部１４に通知する。これにより、集中度制御部１４は、変移条件テーブル２１中の条件Ｎｏ７に基づいて、集中度設定テーブル２４中の集中度のレベルを「中」に変更し（図３、Ｓ３１がＹｅｓ、Ｓ３２）、その後、定義テーブル２２中のレベル「中」の指向性に基づいて、音声入力制御部１５に対して、音声入力部１０の指向性−９０度〜９０度に調整することを指示する（Ｓ３３）。
【００５９】
さらに、集中度設定テーブル２４に登録されている特定話者による発話がない時間が３０秒続くと、他イベント管理部１６は、再度上記イベントの種類を集中度制御部１４に通知する。これにより、集中度制御部１４は、変移条件テーブル２１中の条件Ｎｏ６に基づいて、集中度設定テーブル２４中の集中度のレベルを「低」とし（Ｓ３１がＹｅｓ、Ｓ３２）、その後、定義テーブル２２中のレベル「低」の指向性に基づいて音声入力制御部１５に対して、音声入力部１０の指向性を−１８０度〜１８０度に調整することを指示する（Ｓ３３）。以上のように、発話がない時間が３０秒続くと、集中度設定テーブル２４中のレベルが「高」から「中」へ、或いは「中」から「低」へ変更されるので、特定話者である父親が集中度のレベルを「高」にしたまま、その場を立ち去っても、母親や他の話者が音声認識対話装置と対話することが可能になる。
【００６０】
なお、他イベント管理部１６は、例えば、次のようにして、集中度設定テーブル２４に登録されている特定話者による発話がない時間が３０秒続いたことを検出する。
【００６１】
他イベント管理部１６には、集中度制御部１４からクリア信号と、カウント開始信号とが入力されている。クリア信号は、集中度制御部１４が、集中度設定テーブル２４に設定されている特定話者の発話開始を検出したときに出力する信号であり、カウント開始信号は、集中度制御部１４が集中度設定テーブル２４に設定されている特定話者の発話終了を検出したときに出力する信号である。他イベント管理部１６は、その内部にカウンタを有しており、クリア信号が入力されると、カウンタのカウント値を「０」にすると共にカウント動作を停止し、カウント開始信号が入力されると、カウント動作を開始する。そして、カウント値が３０秒に対応する値になると、集中度制御部１４に対して発話のない時間が３０秒続いたことを通知し、更に、カウント値を「０」にしてカウント動作を再開する。
【００６２】
次に、例えば、音声認識対話装置の背面に父親と母親がいるような複数の話者が同じ方向にいる状況下において、父親が「こんにちは」と発話した場合の動作を説明する。なお、変移条件テーブル２１、定義テーブル２２の内容は図１に示すものであり、集中度設定テーブル２４には、集中度のレベルとしてあらゆる方向からの発話を捕捉できる集中度が発散した状態を表す「低」が設定されているとする。
【００６３】
父親が「こんにちは」と発話すると、集中度制御部１４は、ステップＳ２０において、情報テーブル２３中の父親の位置方向を更新し、ステップＳ２５において、変移条件テーブル２１の条件Ｎｏ５に従って、集中度設定テーブル２４中の集中度のレベルを「中」に変更し、ステップＳ２６において集中度設定テーブル２４に集中度の対象となる特定話者として「父親」を設定する。その後、集中度制御部１４は、ステップＳ２７において、定義テーブル２２の集中度のレベル「中」の定義内容に基づいて、音声入力部１０の方向を特定話者である父親のいる背面方向に調整すると共に、指向性を−９０度〜９０度に調整する。
【００６４】
この状況下で、同じ方向にいる母親が「元気？」と発話した場合は、集中度制御部１４は、集中度設定テーブル２４の集中度をレベル「中」の通常の対話状態にしたまま、集中度の対象となる特定話者を母親に変更する（Ｓ２４がＮｏ、Ｓ２６）。集中度設定テーブル２４のレベルが「中」のままであるので、音声入力部１０は同じ方向を向いたままとなる。この状況下で父親が「元気だよね」等と発話した場合は、現在の集中度設定テーブル２４の集中度の対象となる特定話者が父親に変更されるというように、同じ方向にいる父親と母親とが音声認識対話装置と代わる代わる対話を行うことができる。
【００６５】
このような対話中に、父親が音声認識対話装置を自分に集中させた状態で対話をしたいと考えた場合、父親は「よく聞いて」と発話する。これにより、集中度制御部１４は、ステップＳ２５において、変移条件テーブル２１中の条件Ｎｏ４に従って、集中度設定テーブル２４中の集中度のレベルが「高」に変更し、ステップＳ２６において、集中度の対象となる特定話者を「父親」に変更する。この状況下で同じ方向にいる母親が何か発話した場合、音声入力部１０で音声を捕捉するが話者照合部１２で照合される話者は母親となり、現在の集中度設定テーブル２４の集中度の対象となる特定話者の父親と一致しないため、母親の発話内容の音声認識結果は棄却されることになり（Ｓ２１がＹｅｓ、Ｓ２２がＮｏ、Ｓ２３）、父親と集中して対話ができるようになる。また、集中度設定テーブル２４の集中度のレベルが「高」の時は、集中度の定義テーブル２２の集中度のレベル「高」の定義内容により指向性も−４５度〜４５度に調整されるため、音声入力部１０が別の方向の関係のない人の発話や雑音をひろう確率も低減し父親の音声を捕捉しやすくなり音声認識率も向上する。
【００６６】
次に、この状況下で、父親が、「昨日のことだけど」等と集中度の変移条件テーブル２１の集中度の条件内容と一致しない発話を行った場合（Ｓ２４がＮｏ）は、集中度設定テーブル２４の集中度のレベルを「高」にしたままの集中した対話状態を持続する。
【００６７】
次に、この状況下で、父親が集中した対話状態を止めたいと考えた場合、父親は「もういいよ」と発話する。これにより、集中度制御部１４は、ステップＳ２５において、変移条件テーブル２１中の条件Ｎｏ２に従って、集中度設定テーブル２４中の集中度のレベルを「低」に変更し、ステップＳ２７において音声入力部１０の指向性を−１８０度〜１８０度に調整する。集中度設定テーブル２４の集中度のレベルが、あらゆる方向からの発話も捕捉できる集中度が発散した状態を表す「低」となるので、次回から特定話者以外の音声認識結果も棄却されずに有効とされる（Ｓ２１がＮｏ）。
【００６８】
また仮に、集中度設定テーブル２４に識別名が設定されている特定話者である父親が、集中度のレベル「高」の集中した対話状態にしたままその場を立ち去った場合でも、図３の流れ図を用いて既に説明してあるように、発話がない時間が３０秒続くと集中度の変移条件テーブル２１の条件Ｎｏ７により、集中度設定テーブル２４中の集中度のレベルが「中」に変移し、さらに発話がない時間が３０秒続くと集中度の変移条件テーブル２１の条件Ｎｏ６により、集中度設定テーブル２４中の集中度のレベルが「低」に変移するため、母親や他の話者も音声認識対話装置と対話することが可能になる。
【００６９】
次に本実施の形態の効果について説明する。
【００７０】
本実施の形態では、複数の話者が別の方向や同じ方向にいる状況下で、話者への集中度制御部１４で話者への集中度を制御することにより、時にはある特定の話者とだけ集中して対話をし、時には複数の話者と代わる代わる対話をするといった切り換えを、対話の中で自然に行うことができる。
【００７１】
また、特定の話者との対話中に、他の関係のない人の発話や雑音を拾ってしまう確率を対話の中で低減させることができる。
【００７２】
【発明の他の実施例】
図４は、本発明の第２の実施の形態を示すブロック図である。図４を参照すると、本発明の第２の実施の形態は、図１に示された第１の実施の形態と、画像入力部４０が追加されている点、話者位置特定部１１の代わりに話者位置特定部４１を備えている点、話者照合部１２の代わりに話者照合部４２を備えている点が相違している。なお、他の図１と同一符号は同一部分を表している。
【００７３】
画像入力部４０は、３６０度の範囲の画像情報を取り込む機能を有するものであり、例えば、複数台のＣＣＤカメラ等により実現される。
【００７４】
話者位置特定部４１は、音声入力部１０から入力される音声情報と、画像入力部４０から入力される画像情報とに基づいて、発話した話者の方向を特定する機能を有する。
【００７５】
話者照合部４２は、音声入力部１０からの音声情報と画像入力部４０からの画像情報とに基づいて話者を特定する機能を有する。
【００７６】
次に本実施の形態の動作について説明する。
【００７７】
話者位置特定部４１は、音声入力部１０から音声情報が入力されると、先ず、音声情報に基づいて発話した話者の方向を特定する。その後、話者位置特定部４１は、画像入力部４０が入力した画像情報に基づいて、音声認識対話装置の周囲にいる全ての話者の方向を求める。その後、画像情報に基づいて求めた各話者の方向の内の、音声情報に基づいて求めた話者の方向に最も近い方向を発話した話者が存在する方向とし、その方向を集中度制御部１４に出力する。
【００７８】
話者照合部４２は、音声入力部１０から音声情報が入力されると、音声情報に基づいて発話した話者を特定する。更に、話者照合部４２は、画像入力部４０を解析し、口元が動いている話者を認識し、この話者の顔の画像と、予め登録されている複数の話者の顔画像とを照合することにより、発話した話者を特定する。音声情報により特定した話者と、画像情報により特定した話者とが一致する場合は、上記話者を示す照合話者情報を集中度制御部１４に対して出力し、一致しない場合は、例えば、画像情報により特定した話者を示す照合話者情報を集中度制御部１４に対して出力する。
【００７９】
上記した動作以外は、第１の実施の形態と同様であるので、ここでは、説明を省略する。
【００８０】
上述したように本実施の形態は、マイクロフォンアレイ等の音声入力部１０に加え、カメラ等の画像入力部４０を備えており、音声情報と画像情報の両方に基づいて、発話した話者の方向、発話した話者を認識しているので、認識精度を高いものにすることができる。
【００８１】
図５は、本発明の第３の実施の形態を示すブロック図である。図５を参照すると、本発明の第３の実施の形態は、図１に示された第１の実施の形態の構成に音声モデルデータベース５１を追加した点、音声認識部１３の代わりに音声認識部５２を備えた点、および集中度制御部１４の代わりに集中度制御部５３を備えた点で異なる。なお、他の図１と同一符号は、同一部分を表している。
【００８２】
音声モデルデータベース５１には、音声認識対話装置を使用する各話者それぞれの音声モデル、および標準音声モデルが登録されている。これらは、音声認識を行う際に使用される。
【００８３】
集中度制御部５３は、集中度制御部１４が備えている機能に加え、集中度設定テーブル２４に設定されている特定話者識別名を音声モデルデータベース５１に設定する。
【００８４】
音声認識部５２は、音声認識を行う際、音声モデルデータベース５１中の音声モデルの内、集中度制御部５３によって設定されている特定話者識別名と対応する話者の音声モデルを使用して音声認識を行う。このようにすることにより、集中度の対象となる特定話者の音声認識率を向上させることができる効果がある。なお、特定話者識別名が「不明」となっている場合は、音声認識部５２は、標準音声モデルを使用して音声認識を行う。
【００８５】
図６は本発明に係る音声認識対話装置のハードウェア構成の一例を示すブロック図であり、コンピュータ６１と、記録媒体６２と、音声入力部６３と、音声出力部６４と、データベース６５とから構成されている。音声入力部６３、音声出力部６４、データベース６５は、それぞれ図１に示した音声入力部１０、音声出力部１９、データベース２０に対応する。記録媒体６２は、ディスク、半導体メモリ、その他の記録媒体であり、コンピュータ６１を音声認識対話装置の一部として機能させるためのプログラムが記録されている。このプログラムは、コンピュータ６１によって読み取られ、その動作を制御することで、コンピュータ６１上に図１に示した話者位置特定部１１、話者照合部１２、音声認識部１３、集中度制御部１４、音声入力制御部１５、他イベント管理部１６、対話制御部１７、音声合成部１８を実現する。
【００８６】
【発明の効果】
第１の効果は、複数の話者が音声認識対話装置のまわりにいる中で、特に、別の方向に複数の話者がいる場合でも、時にはある特定の話者とだけ集中して対話をし、時には複数の話者と代わる代わる対話をするといった切り替えを、対話の中で自然に行えるということである。
【００８７】
その理由は、話者の発話内容に応じて話者に対する集中度のレベルを決定し、集中度のレベルに応じて、マイクロフォンアレイ等の音声入力部の指向性や方向を調整させることができるためである。
【００８８】
第２の効果は、複数の話者が音声認識対話装置のまわりにいる中で、特に、同じ方向に複数の話者がいる場合でも、時にはある特定の話者とだけ集中して対話をし、時には複数の話者と代わる代わる対話をするといった切り替えを、対話の中で自然に行えるということである。
【００８９】
その理由は、話者の発話内容に応じて話者に対する集中度のレベルを決定し、集中度のレベルに応じて、特定話者以外の話者の発話を無効にできるためである。
【図面の簡単な説明】
【図１】本発明の第１の実施の形態の構成例を示すブロック図である。
【図２】話者位置特定部１１、話者照合部１２、音声認識部１３から入力があったときの集中度制御部１４の処理例を示す流れ図である。
【図３】他イベント管理部１６から通知があったときの集中度制御部１４の処理例を示す流れ図である。
【図４】本発明の第２の実施の形態の構成例を示すブロック図である。
【図５】本発明の第３の実施の形態の構成例を示すブロック図である。
【図６】音声認識対話装置のハードウェア構成の一例を示すブロック図である。
【図７】従来の技術を説明するためのブロック図である。
【符号の説明】
１０音声入力部
１１話者位置特定部
１２話者照合部
１３音声認識部
１４集中度制御部
１５音声入力制御部
１６他イベント管理部
１７話者制御部
１８音声合成部
１９音声出力部
２０データベース
２１変移条件テーブル
２２定義テーブル
２３情報テーブル
２４集中度設定テーブル
４０画像入力部
４１話者位置特定部
４２話者照合部
５１音声モデルデータベース
５２音声認識部
５３集中度制御部
６１コンピュータ
６２記録媒体
６３音声入力部
６４音声出力部
６５データベース
７０音声入力部
７１音声入力制御部
７２音声特徴ベクトル抽出部
７３音声認識部
７４認識結果表示部
７５画像情報入力部
７６画像情報解析部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech recognition dialogue apparatus that outputs a response to a content spoken by a speaker, and particularly, when there are a plurality of speakers around the voice recognition dialogue apparatus, the dialogue is concentrated only on a specific speaker. The present invention relates to a speech recognition dialogue apparatus that can perform a dialogue or an alternative dialogue with a plurality of speakers.
[0002]
[Prior art]
In a speech recognition dialogue apparatus that outputs a response to a speaker's utterance by voice, it is necessary to recognize the utterance of the speaker at a high recognition rate. In order to increase the recognition rate, a speech recognition device has been conventionally proposed in which the influence of ambient noise and the like is reduced, and speech from a specific speaker is captured with good quality (for example, specially No. 2000-148184).
[0003]
FIG. 7 is a block diagram showing a configuration of a speech recognition apparatus described in Japanese Patent Laid-Open No. 2000-148184. Referring to FIG. 7, a voice information input unit 70 configured to vary directivity characteristics, sensitivity characteristics, and the like of a microphone array, a voice input control unit 71 that adjusts directivity characteristics, sensitivity characteristics, and the like of the voice information input unit 70. A voice feature vector extraction unit 72 that performs A / D conversion on the voice signal input from the voice information input unit 70 based on the control of the voice input control unit 71, performs frequency analysis, and converts it into a voice feature vector sequence; A speech recognition unit 73 that performs speech recognition based on the speech feature vector obtained from the speech feature vector extraction unit 72, a recognition result display unit 74 that displays a recognition result of the speech recognition unit 73, and an imaging device such as a camera. The image information input part 75 and the image information analysis part 76 which analyzes the image information input from the image information input part 75 are provided.
[0004]
Subsequently, the operation of the speech recognition apparatus described in Japanese Patent Application Laid-Open No. 2000-148184 will be described. In FIG. 7, the image information analysis unit 76 analyzes the image data obtained from the image information input unit 75 and detects the position of the speaker in the image. The position of the speaker in the image can be obtained by extracting the face image of the speaker and tracking it. The voice input control unit 71 controls the directivity characteristics, input characteristics, and direction of the voice information input unit 70 based on the speaker position data transmitted from the image information analysis unit 76.
[0005]
[Problems to be solved by the invention]
However, when the above-described conventional speech recognition apparatus is used for a speech recognition dialogue apparatus, the following problems occur.
[0006]
The first problem is that, while a plurality of speakers are around the speech recognition dialogue apparatus, it is not possible to carry out an alternative dialogue with a plurality of speakers in different directions.
[0007]
The reason is that in order to improve the speech recognition rate of a specific speaker, the sensitivity characteristics of the microphone and the directivity characteristics of the microphone are adjusted in the direction of the specific speaker, and the speaker in the other direction is adjusted. This is because it is difficult to capture the sound.
[0008]
The second problem is that, while a plurality of speakers are around the speech recognition dialogue apparatus, it is not possible to conduct a concentrated dialogue only with a specific speaker in the same direction.
[0009]
The reason is that the microphone sensitivity characteristics and microphone directivity characteristics are only adjusted in the direction of the speaker, so that voices spoken by other speakers from the same direction are also captured and recognized. .
[0010]
OBJECT OF THE INVENTION
It is an object of the present invention to have a plurality of speakers around a speech recognition dialogue apparatus, sometimes to concentrate on a conversation with a specific speaker, or sometimes to have an alternate dialogue with a plurality of speakers. The object of the present invention is to provide a speech recognition dialogue apparatus that can be switched naturally during dialogue.
[0011]
[Means for Solving the Problems]
The speech recognition dialogue apparatus according to the present invention manages a speaker to manage and control the degree of concentration on a speaker based on speaker position specifying information obtained by analyzing spoken speech information, collation speaker information, and a speech recognition result. 1 and 14 (FIG. 1) and the speaker concentration control unit (14 in FIG. 1) stores information necessary for determining the concentration, and is referred to and updated by the speaker. And a concentration management database (20 in FIG. 1).
[0012]
More specifically, the speech recognition dialogue apparatus of the present invention is
A voice input unit (10 in FIG. 1) for capturing voice information;
A speaker position specifying unit (11 in FIG. 1) for specifying the direction of the speaker who has spoken;
A speaker verification unit (12 in FIG. 1) for identifying a speaker who has spoken,
A voice recognition unit (13 in FIG. 1) that analyzes voice information input from the voice input unit (10 in FIG. 1) and recognizes a voice;
A degree-of-concentration setting table (24 in FIG. 1) in which a specific speaker identification name indicating a specific speaker and a concentration level for the specific speaker indicated by the specific speaker identification name are set;
It is determined whether or not to enable the speaker's speech based on the content of the concentration level setting table (24 in FIG. 1) and the speaker specified by the speaker verification unit (12 in FIG. 1). If determined to be valid, the level of concentration determined based on the recognition result of the voice recognition unit (13 in FIG. 1) for the utterance of the speaker and the speaker verification unit (12 in FIG. 1) The level and the specific speaker identification name in the concentration setting table (24 in FIG. 1) are updated using the identified speaker identification name, and the updated concentration setting table (24 in FIG. 1). Concentration control unit for controlling the directivity and direction of the voice input unit (10 in FIG. 1) based on the content of the speaker and the direction of the speaker specified by the speaker position specifying unit (11 in FIG. 1) (14 in FIG. 1),
And a voice output unit (19 in FIG. 1) that outputs a response to the recognition result of the utterance determined to be valid by the concentration control unit (14 in FIG. 1).
[0013]
Furthermore, the voice recognition dialogue apparatus of the present invention can change the concentration level of a specific speaker set in the concentration level setting table (24 in FIG. 1) when a predetermined event occurs.
Another event management unit (16 in FIG. 1) for detecting that a predetermined event has occurred, and
The concentration level control unit (14 in FIG. 1) is set in the concentration level setting table (24 in FIG. 1) when the occurrence of the predetermined event is detected by the other event management unit (16 in FIG. 1). It has a configuration for changing the level of concentration.
[0014]
More specifically,
The predetermined event is that there is no utterance for a predetermined time by a specific speaker whose specific speaker identification name is set in the concentration setting table (24 in FIG. 1); and
The concentration control unit (14 in FIG. 1) detects the occurrence of the predetermined event in the other event management unit (16 in FIG. 1), and sets it in the concentration setting table (24 in FIG. 1). When the level of concentration being made is high enough to enable only the utterance of the specific speaker indicated by the specific speaker identification name set in the concentration setting table (24 in FIG. 1), The concentration level set in the concentration level setting table (24 in FIG. 1) is lowered to a level that also enables speech by other speakers.
[0015]
[Action]
When talking with multiple speakers, control the concentration level on the speaker based on the preset concentration level change condition, and input a voice to a microphone array, etc., according to the concentration level. The directivity and direction of the unit (10 in FIG. 1) are adjusted. Further, the utterances of speakers other than the specific speaker are invalidated according to the level of concentration.
[0016]
If the concentration level for a specific speaker is determined based only on the utterance content of the specific speaker, if the specific speaker leaves the speech recognition dialogue device after making a speech that increases the concentration level, The state in which the speaker's speech is invalidated continues, and other speakers cannot interact with the speech recognition dialogue apparatus. Therefore, when the occurrence of a predetermined event (for example, a time during which there is no utterance by a specific speaker continues for a predetermined time) is detected in another event management unit (16 in FIG. 1), a concentration level control unit (14 in FIG. 1) However, the level of the concentration set in the concentration setting table (24 in FIG. 1) is lowered to a level at which utterances by other speakers are also effective. As a result, other speakers can also interact with the speech recognition dialogue apparatus.
[0017]
DETAILED DESCRIPTION OF THE INVENTION
Next, embodiments of the present invention will be described in detail with reference to the drawings. Referring to FIG. 1, a first embodiment of a speech recognition dialogue apparatus according to the present invention includes a speech input unit 10, a speaker position specifying unit 11, a speaker verification unit 12, a speech recognition unit 13, Concentration control unit 14, speech input control unit 15, other event management unit 16, dialogue control unit 17, speech synthesis unit 18, speech output unit 19, and database 20 for managing speakers and concentration level It consists of and.
[0018]
The voice input unit 10 has a function of converting voice information into an electrical signal. In addition, the voice input unit 10 can change directivity and direction, and is configured by, for example, a microphone array in which a plurality of microphones are arranged in a circular shape at regular intervals.
[0019]
The speaker position specifying unit 11 has a function of analyzing the voice information input from the voice input unit 10 and specifying the direction of the speaker. For example, when the voice input unit 10 is configured by a microphone array in which a plurality of microphones are arranged in a circular shape, the direction of the microphone with the highest output level is set as the speaker direction. The direction of the microphone is a direction with respect to a reference direction of the speech recognition dialogue apparatus, and an angle between a reference microphone of a plurality of microphones and a microphone having the highest output level, and an angle between the reference direction and the reference microphone are determined. It is obtained by adding.
[0020]
The speaker verification unit 12 has a function of analyzing voice information input from the voice input unit 10 and checking the voice information of a registered speaker to identify the speaker.
[0021]
The voice recognition unit 13 has a function of recognizing voice by analyzing voice information input from the voice input unit 10.
[0022]
The concentration level control unit 14 includes speaker position specifying information input from the speaker position specifying unit 11, verification speaker information input from the speaker verification unit 12, speech recognition results input from the speech recognition unit 13, and others. It has a function of controlling the degree of concentration on the speaker based on the notification from the event management unit 16.
[0023]
More specifically, the concentration level control unit 14 has the following functions.
[0024]
Whether or not to enable the utterance of the speaker specified by the verification speaker information based on the content of the concentration level setting table 24 and the verification speaker information (speaker identification name) from the speaker verification unit 12 The function to determine whether.
A function of rejecting the recognition result input from the voice recognition unit 13 when it is determined not to be valid.
A function of passing the recognition result to the dialogue control unit 17 to the voice recognition unit 13 when it is determined to be valid.
If it is determined to be valid, the concentration level is determined based on the recognition result of the voice recognition unit 13 and the content of the transition condition table 21, and the determined concentration level and the speaker verification unit 12 A function of updating the content of the concentration level setting table 24 based on the verification speaker information.
The direction and directivity of the voice input unit 10 are adjusted with respect to the voice input control unit 15 based on the contents of the updated concentration degree setting table 24, the contents of the definition table 22, and the information table 23. Function to direct.
[0025]
In addition, each table 21-24 in the database 20 is demonstrated in detail later.
[0026]
The voice input control unit 15 has a function of adjusting the directivity and direction of the microphone array or the like of the voice input unit 10 (the direction of the reference microphone with respect to the reference direction of the voice recognition dialogue apparatus) according to an instruction from the concentration control unit 14. .
[0027]
The other event management unit 16 has a function of managing other events such as time other than voice input, and notifying the concentration degree control unit 14 of the occurrence of the event.
[0028]
The dialogue control unit 17 has a function of managing dialogue contents based on the voice recognition result and speaker verification information sent from the concentration degree control unit 14 and determining the next response content.
[0029]
The speech synthesizer 18 has a function of generating synthesized speech of response contents input from the dialogue control unit 17.
[0030]
The voice output unit 19 has a function of outputting the synthesized voice input from the voice synthesis unit 18 and is configured by a speaker or the like.
[0031]
The database 20 includes a transition condition table 21, a definition table 22, an information table 23, and a concentration level setting table 24 that are used when the concentration level control unit 14 controls the concentration level on a speaker.
[0032]
The change condition table 21 stores various conditions for changing the level of concentration with respect to a specific speaker. Each condition includes a condition content, a current concentration level (current level), and a concentration level to be changed (transition level). For example, the condition No. 1 indicates that the level is changed to “low” when “thank you” or “you are better” is spoken when the current concentration level is “medium”. Further, for example, Condition No. 7 indicates that the level is changed to “medium” when there is no utterance by a specific speaker for 30 seconds when the current concentration level is “high”.
[0033]
In the definition table 22, the contents of control performed by the concentration level control unit 14 are defined for each level of concentration level. For example, when the concentration level is “low”, the concentration control unit 14 sets the directivity of the voice input unit 10 to −180 degrees to 180 degrees, and the specific talk set in the concentration setting table 24. Voice recognition results other than those who are not active are also enabled. When the concentration level is “high”, the concentration level control unit 14 sets the direction of the voice input unit 10 to the direction of the specific speaker, sets the directivity to −45 degrees to 45 degrees, and sets the concentration level setting table 24. Disable speech recognition results for non-specific speakers set to.
[0034]
In the information table 23, the speaker identification name specified by the speaker verification unit 12 and the direction specified by the speaker position specifying unit 11 are registered in association with each other. The example of FIG. 1 indicates that the father is at 0 degrees, the mother is at 90 degrees, and the unknown person is at 180 degrees with respect to the reference direction of the speech recognition dialogue apparatus.
[0035]
In the concentration level setting table 24, the level of the current concentration level and the identification name of the specific speaker as the target are set correspondingly. The example of FIG. 1 shows that the current concentration level is “high” and the father is the target.
[0036]
Next, the operation of the present embodiment will be described in detail with reference to FIG. 1, FIG. 2, and FIG.
[0037]
First, the operation when the speaker speaks will be described with reference to FIGS. When the speaker speaks, the voice information input via the voice input unit 10 such as a microphone array is output to the speaker position specifying unit 11, the speaker verification unit 12, and the voice recognition unit 13, respectively. The speaker position specifying unit 11 analyzes the input voice information, specifies the direction of the sound source of the speaker, and outputs the speaker position specifying information to the concentration level control unit 14. The speaker verification unit 12 analyzes the input voice information, compares it with the registered speaker's voice information, identifies the speaker, and outputs the verification speaker information to the concentration level control unit 14. The voice recognition unit 13 analyzes the input voice information and outputs the voice recognition result to the concentration degree control unit 14.
[0038]
The concentration control unit 14 updates the position direction of the collation speaker information table 23 based on the input speaker position specifying information and collation speaker information (FIG. 2, S20).
[0039]
Next, it is determined whether or not the concentration level set in the concentration level setting table 24 is a concentrated conversation state (S21). As a result of the determination, if the level is “high” indicating the concentrated conversation state, it is determined whether or not the collated speaker matches the identification name of the specific speaker in the concentration level setting table 24 (S22). .
[0040]
If they do not match, the input speech recognition result is rejected (S23). On the other hand, if they match, the transition condition table 21 is searched to search for a condition where the current level matches the level set in the concentration setting table 24 and the condition content matches the voice recognition result ( S24). Even when it is determined in step S21 that the level is not “high”, the process of step S24 is performed.
[0041]
If the corresponding condition cannot be found in step S24, the process of step S26 is performed. On the other hand, if the corresponding condition can be found, after changing the concentration level set in the concentration setting table 24 to the transition level in the condition found in step S24 (S25), The process of step S26 is performed. In step S <b> 26, a process of changing the identification name of the specific speaker set in the concentration level setting table 24 to the identification name of the speaker specified by the speaker verification unit 12 is performed.
[0042]
Next, the speaker concentration control unit 14 refers to the concentration setting table 24 and the information table 23, and uses the position and direction of the specific speaker as direction setting information such as a microphone array to the voice input control unit 15. In addition to output, the definition table 22 is referred to, and directivity setting information such as a microphone array defined corresponding to the current level of concentration is output to the voice input control unit 15 (S27). The voice recognition result and the verification speaker information are output to the dialogue control unit 17 (S28).
[0043]
The voice input control unit 15 determines the finger characteristics and direction of the microphone array of the voice input unit 10 based on the direction and directivity setting information of the microphone array and the like input from the speaker concentration control unit 14. adjust.
[0044]
The dialogue control unit 17 determines the next response content based on the speech recognition result and the collation speaker information input from the speaker concentration control unit 14, and outputs the response content to the speech synthesis unit 18. To do.
[0045]
The voice synthesizer 18 generates a synthesized voice from the input response content, and outputs the synthesized voice via the voice output unit 19 such as a speaker.
[0046]
Next, the operation when the other event management unit 16 detects the occurrence of a predetermined event will be described with reference to FIGS. 1 and 3. When detecting the occurrence of a predetermined event, the other event management unit 16 notifies the concentration level control unit 14 of the type of event that has occurred.
[0047]
As a result, the concentration level control unit 14 searches the transition condition table 21, and the condition where the current level matches the level set in the concentration level setting table 24 and the condition content matches the type of the notified event. Is searched (FIG. 3, S31).
[0048]
If the corresponding condition cannot be found in step S31, the concentration level control unit 14 ends the process. On the other hand, when the corresponding condition can be found, the concentration control unit 14 changes the concentration level set in the concentration setting table 24 to the transition level in the searched condition. (S32), with reference to the definition table 22, directivity setting information such as a microphone array defined corresponding to the current level of concentration is output to the voice input control unit 15 (S33). Processing ends.
[0049]
Next, the contents of the transition condition table 21 and the definition table 22 in the database 20 are as shown in FIG. 1, and the concentration level that can capture utterances from all directions as the concentration level is diverged in the concentration level setting table 24. The operation of this exemplary embodiment will be described in detail by taking as an example the case where “low” representing the state is set.
[0050]
For example, the back of the speech recognition dialogue apparatus, father respectively to the side, in a situation where a plurality of speakers, such as there are mothers in a different direction, father and utters "Hello".
[0051]
In this case, the concentration level control unit 14 first determines the information table 23 based on the speaker position specifying information input from the speaker position specifying unit 11 and the verification speaker information input from the speaker verification unit 12. The position direction of the father inside is updated (FIG. 2, S20). Thereafter, the concentration level control unit 14 changes the concentration level of the concentration level setting table 24 to “medium” according to the condition No. 5 in the transition condition table 21, and further selects a specific speaker to be the target of the concentration level as “ Change to "Father" (S21 is No, S24 is Yes, S25, S26). Thereafter, the concentration control unit 14 directs the direction of the voice input unit 10 to the back direction where the father who is the specific speaker is present and directivity according to the definition content of the concentration level “medium” in the definition table 22. The angle is adjusted to 90 to 90 degrees (S27). Furthermore, the degree of concentration control unit 14 performs the processing in step S28, thereby, the response to his father utters "Hello" is output from the audio output unit 19.
[0052]
After that, when the mother on the side makes an utterance that does not match the condition contents of the transition condition table 21 with “How are you?”, The concentration control unit 14 updates the position direction of the mother in the information table 23 (S20). Furthermore, while maintaining the normal conversation state of the level “medium” in the concentration level setting table 24, the specific speaker targeted for the concentration level is changed to “mother” (S21 is No, S24 is No). , S26). Thereafter, the concentration degree control unit 14 directs the direction of the voice input unit 10 toward the side surface where the mother who is the specific speaker is present, and adjusts the directivity to -90 degrees to 90 degrees (S27). Furthermore, the concentration control unit 14 performs the process of step S <b> 28, whereby a response to “how are you?” Spoken by the mother is output from the voice output unit 19.
[0053]
Thereafter, when the father speaks “I'm fine” or the like, the concentration control unit 14 updates the position direction of the father in the information table 23 in step S20, and the concentration in the concentration setting table 24 in step S26. In step S27, the direction of the voice input unit 10 is changed to a position where the father who is the specific speaker is located. In this way, it is possible to perform a dialogue with a voice recognition dialogue apparatus in which a father and a mother in different directions take the place.
[0054]
In such a normal dialogue state, when the father wants to talk in a state where the speech recognition dialogue device is concentrated on himself / herself, he / she utters “Please listen carefully”. Thereby, the concentration control unit 14 changes the position of the father in the information table 23 in step S20, and in step S25, the concentration level of the concentration setting table 24 is set in accordance with the condition No. 4 of the transition condition table 21. In step S26, the specific speaker to be focused is changed to “father”. In step S27, the direction of the voice input unit 10 is changed to the position direction of the father who is the specific speaker. And adjust the directivity to -45 degrees to 45 degrees. Under this circumstance, if the father continues the conversation, the voice input unit 10 also reduces the probability that the utterance and noise of the unrelated person will be captured, and the father's voice can be easily captured, thereby improving the voice recognition rate. For this reason, even if the mother speaks something in this situation, the probability that the voice input unit 10 captures the voice is reduced by the directivity adjustment result. Even if the voice input unit 10 captures the voice, the speaker verified by the speaker verification unit 12 is a mother and does not match the father of the specific speaker that is the target of concentration in the current concentration setting table 24. Therefore (S22 is No), the speech recognition result of the mother's speech content is rejected (S23).
[0055]
Next, in this situation, if the father makes an utterance that does not match the condition content of the transition condition table 21 such as “but yesterday”, the determination result in step S24 is No, so the concentration level setting is performed. The level of concentration on the table 24 remains “high”, and the concentrated conversation state with the father is maintained.
[0056]
Next, under this circumstance, if the father wants to stop the concentrated conversation, he speaks "I'm fine". Thereby, the concentration control unit 14 changes the concentration level of the concentration setting table 24 to “low” in step S25 in accordance with the condition No. 2 in the change condition table 21. The directivity is adjusted to -180 degrees to 180 degrees based on the definition content of the concentration level "low". In step S25, since the level in the concentration level setting table 24 is changed to “low”, the speech recognition result for other than the specific speaker is valid from the next time without being rejected (No in S21).
[0057]
Also, suppose that the father who is the specific speaker who is the target of concentration in the current concentration setting table 24 leaves the place with the concentration level set to “high”, that is, in a concentrated conversation state. Even in this case, based on the notification from the other event management unit 16, the mother and other speakers can interact with the speech recognition dialogue apparatus.
[0058]
That is, the other event management unit 16 notifies the concentration level control unit 14 of the type of the event when detecting an event in which the specific speaker's utterance time set in the concentration level setting table 24 continues for 30 seconds. . As a result, the concentration control unit 14 changes the concentration level in the concentration setting table 24 to “medium” based on the condition No. 7 in the transition condition table 21 (FIG. 3, S31 is Yes, S32). Thereafter, based on the directivity of the level “medium” in the definition table 22, the voice input control unit 15 is instructed to adjust the directivity of the voice input unit 10 to −90 degrees to 90 degrees (S33). ).
[0059]
Furthermore, if the time when there is no utterance by the specific speaker registered in the concentration level setting table 24 continues for 30 seconds, the other event management unit 16 notifies the concentration level control unit 14 of the event type again. Thereby, the concentration control unit 14 sets the concentration level in the concentration setting table 24 to “low” based on the condition No. 6 in the transition condition table 21 (S31 is Yes, S32), and then the definition table. 22 is instructed to adjust the directivity of the voice input unit 10 to −180 degrees to 180 degrees based on the directivity of level “low” in FIG. 22 (S33). As described above, if the duration of no speech continues for 30 seconds, the level in the concentration setting table 24 is changed from “high” to “medium” or from “medium” to “low”. Even if the father leaves the place with the level of concentration being “high”, the mother and other speakers can interact with the speech recognition dialogue apparatus.
[0060]
Note that the other event management unit 16 detects, for example, as follows, that a time period during which no utterance is made by a specific speaker registered in the concentration setting table 24 has continued for 30 seconds.
[0061]
A clear signal and a count start signal are input from the concentration level control unit 14 to the other event management unit 16. The clear signal is a signal that is output when the concentration level control unit 14 detects the start of speech of a specific speaker set in the concentration level setting table 24, and the count start signal is concentrated by the concentration level control unit 14. This signal is output when the end of the utterance of a specific speaker set in the degree setting table 24 is detected. The other event management unit 16 has a counter therein, and when a clear signal is input, the count value of the counter is set to “0”, the count operation is stopped, and a count start signal is input. The count operation is started. Then, when the count value becomes a value corresponding to 30 seconds, the concentration control unit 14 is notified that the non-speech time has continued for 30 seconds, and the count value is set to “0” and the count operation is resumed. To do.
[0062]
Then, for example, in a situation where a plurality of speakers like being father and mother to the back of the speech recognition dialogue system are in the same direction, for explaining the operation when the father utters "Hello". The contents of the transition condition table 21 and the definition table 22 are as shown in FIG. 1, and the concentration setting table 24 represents a state in which the concentration that can capture utterances from all directions is scattered as the concentration level. Assume that “low” is set.
[0063]
If father utters "Hello", the degree of concentration control unit 14, in step S20, updates the positional direction of the father in information table 23, in step S25, according to the conditions No5 of transition condition table 21, the concentration level setting table 24, the level of concentration is changed to “medium”, and “father” is set as a specific speaker to be the target of concentration in the concentration setting table 24 in step S26. Thereafter, in step S27, the concentration level control unit 14 adjusts the direction of the voice input unit 10 to the back direction where the father who is the specific speaker is present, based on the definition content of the concentration level “medium” in the definition table 22. In addition, the directivity is adjusted to -90 degrees to 90 degrees.
[0064]
In this situation, when the mother in the same direction speaks “How are you?”, The concentration level control unit 14 keeps the concentration level of the concentration level setting table 24 in the normal dialogue state of “medium” level. The specific speaker that is the target of the concentration degree is changed to the mother (S24 is No, S26). Since the level of the concentration level setting table 24 remains “medium”, the voice input unit 10 remains facing the same direction. In this situation, when the father speaks “I'm fine”, etc., the father who is in the same direction as the specific speaker who is the target of the concentration in the current concentration setting table 24 is changed to the father. And the mother can perform an alternative dialogue instead of the voice recognition dialogue device.
[0065]
During such a dialogue, if the father wants to talk with the voice recognition dialogue device concentrated on himself, the father speaks “Please listen carefully”. As a result, the concentration level control unit 14 changes the concentration level in the concentration level setting table 24 to “high” in accordance with the condition No. 4 in the transition condition table 21 in step S25. Change the target specific speaker to "Father". In this situation, if a mother in the same direction speaks something, the voice input unit 10 captures the voice, but the speaker collated by the speaker collation unit 12 is the mother, and the concentration in the current concentration setting table 24 The speech recognition result of the mother's utterance content is rejected because it does not match the father of the specific speaker who is the target of the degree (S21 is Yes, S22 is No, S23), and conversation can be concentrated with the father. It becomes like this. When the concentration level in the concentration setting table 24 is “high”, the directivity is adjusted to −45 degrees to 45 degrees according to the definition content of the concentration level “high” in the concentration definition table 22. Therefore, the voice input unit 10 also reduces the probability of uttering a person's utterance or noise that is not related to another direction, makes it easier to capture the father's voice, and improves the voice recognition rate.
[0066]
Next, in this situation, if the father makes an utterance that does not match the condition content of the concentration level in the concentration level transition condition table 21 such as “but yesterday” (No in S24), the concentration level setting is performed. The concentrated conversation state is maintained with the level of concentration of the table 24 being “high”.
[0067]
Next, under this circumstance, if the father wants to stop the conversational state in which the father is concentrated, the father speaks "I'm fine". As a result, the concentration control unit 14 changes the concentration level in the concentration setting table 24 to “low” according to the condition No. 2 in the transition condition table 21 in step S25, and the voice input unit 10 in step S27. Is adjusted to -180 degrees to 180 degrees. The level of concentration in the concentration level setting table 24 is “low” indicating a state in which the concentration level capable of capturing utterances from any direction is diverged, so that the speech recognition results other than the specific speaker will not be rejected from the next time. Valid (No in S21).
[0068]
Further, even if a father who is a specific speaker whose identification name is set in the concentration level setting table 24 leaves the place in a concentrated conversation state with a high level of concentration level, FIG. As already described with reference to the flowchart, if the time when there is no speech continues for 30 seconds, the concentration level in the concentration setting table 24 changes to “medium” due to the condition No. 7 in the concentration change condition table 21. However, if the time when there is no further utterance continues for 30 seconds, the concentration level in the concentration level setting table 24 changes to “low” due to the condition No. 6 in the concentration level change condition table 21, so that the mother and other speakers Can also interact with the speech recognition dialogue device.
[0069]
Next, the effect of this embodiment will be described.
[0070]
In the present embodiment, in a situation where a plurality of speakers are in different directions or in the same direction, the concentration level on the speaker is controlled by the concentration level control unit 14 on the speaker. It is possible to switch naturally, such as conversing with only a speaker, and sometimes having an alternative dialogue with multiple speakers.
[0071]
In addition, the probability of picking up utterances and noises of other unrelated people during a dialogue with a specific speaker can be reduced during the dialogue.
[0072]
Other Embodiments of the Invention
FIG. 4 is a block diagram showing a second embodiment of the present invention. Referring to FIG. 4, the second embodiment of the present invention is different from the first embodiment shown in FIG. 1 in that an image input unit 40 is added, instead of the speaker position specifying unit 11. The speaker position specifying unit 41 is provided with a speaker verification unit 42 instead of the speaker verification unit 12. In addition, the same code | symbol as the other FIG. 1 represents the same part.
[0073]
The image input unit 40 has a function of capturing image information in a range of 360 degrees, and is realized by, for example, a plurality of CCD cameras.
[0074]
The speaker position specifying unit 41 has a function of specifying the direction of the speaker who has spoken based on the audio information input from the audio input unit 10 and the image information input from the image input unit 40.
[0075]
The speaker verification unit 42 has a function of specifying a speaker based on the voice information from the voice input unit 10 and the image information from the image input unit 40.
[0076]
Next, the operation of the present embodiment will be described.
[0077]
When the voice information is input from the voice input unit 10, the speaker position specifying unit 41 first specifies the direction of the speaker who speaks based on the voice information. Thereafter, the speaker position specifying unit 41 obtains directions of all speakers around the voice recognition dialogue apparatus based on the image information input by the image input unit 40. Then, among the directions of each speaker obtained based on the image information, the direction closest to the direction of the speaker obtained based on the speech information is set as the direction in which the speaker is speaking, and the direction is controlled by the degree of concentration. To the unit 14.
[0078]
When voice information is input from the voice input unit 10, the speaker verification unit 42 identifies a speaker who has spoken based on the voice information. Further, the speaker verification unit 42 analyzes the image input unit 40 to recognize the speaker whose mouth is moving, and the speaker's face image and a plurality of speaker face images registered in advance. Is used to identify the speaker who spoke. When the speaker specified by the voice information matches the speaker specified by the image information, collation speaker information indicating the speaker is output to the concentration control unit 14, and when they do not match, The collation speaker information indicating the speaker specified by the image information is output to the concentration control unit 14.
[0079]
Since operations other than those described above are the same as those in the first embodiment, description thereof is omitted here.
[0080]
As described above, the present embodiment includes the image input unit 40 such as a camera in addition to the audio input unit 10 such as a microphone array, and the direction of the speaker who speaks based on both the audio information and the image information. Since the speaker who spoke is recognized, the recognition accuracy can be increased.
[0081]
FIG. 5 is a block diagram showing a third embodiment of the present invention. Referring to FIG. 5, in the third embodiment of the present invention, a voice model database 51 is added to the configuration of the first embodiment shown in FIG. The difference is that the unit 52 is provided, and that the concentration control unit 53 is provided instead of the concentration control unit 14. In addition, the same code | symbol as the other FIG. 1 represents the same part.
[0082]
In the voice model database 51, a voice model and a standard voice model of each speaker who uses the voice recognition dialogue apparatus are registered. These are used when performing speech recognition.
[0083]
The concentration level control unit 53 sets the specific speaker identification name set in the concentration level setting table 24 in the speech model database 51 in addition to the functions provided in the concentration level control unit 14.
[0084]
When performing speech recognition, the speech recognition unit 52 uses the speech model of the speaker corresponding to the specific speaker identification name set by the concentration level control unit 53 among the speech models in the speech model database 51. Perform voice recognition. By doing in this way, there exists an effect which can improve the speech recognition rate of the specific speaker used as the object of concentration. When the specific speaker identification name is “unknown”, the speech recognition unit 52 performs speech recognition using a standard speech model.
[0085]
FIG. 6 is a block diagram showing an example of a hardware configuration of the speech recognition dialogue apparatus according to the present invention, which is composed of a computer 61, a recording medium 62, a voice input unit 63, a voice output unit 64, and a database 65. Has been. The voice input unit 63, the voice output unit 64, and the database 65 correspond to the voice input unit 10, the voice output unit 19, and the database 20 shown in FIG. The recording medium 62 is a disk, semiconductor memory, or other recording medium, and stores a program for causing the computer 61 to function as a part of the voice recognition dialogue apparatus. This program is read by the computer 61 and its operation is controlled so that the speaker position specifying unit 11, the speaker verification unit 12, the speech recognition unit 13, and the concentration level control unit 14 shown in FIG. The voice input control unit 15, the other event management unit 16, the dialogue control unit 17, and the voice synthesis unit 18 are realized.
[0086]
【The invention's effect】
The first effect is that when multiple speakers are around the speech recognition dialogue device, particularly when there are multiple speakers in different directions, sometimes the conversation is concentrated only with a specific speaker. However, sometimes it is possible to switch naturally, such as having an alternative dialogue with multiple speakers.
[0087]
The reason is that the level of concentration with respect to the speaker can be determined according to the content of the speaker's utterance, and the directivity and direction of the voice input unit such as a microphone array can be adjusted according to the level of concentration. It is.
[0088]
The second effect is that when there are multiple speakers around the speech recognition dialogue device, especially when there are multiple speakers in the same direction, sometimes the conversation is concentrated only on a specific speaker. In other words, it is possible to switch naturally, such as having a dialogue with multiple speakers.
[0089]
The reason is that the level of the degree of concentration on the speaker is determined according to the content of the speaker's utterance, and the utterances of speakers other than the specific speaker can be invalidated according to the level of concentration.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration example of a first exemplary embodiment of the present invention.
FIG. 2 is a flowchart showing a processing example of a concentration control unit 14 when there is an input from a speaker position specifying unit 11, a speaker collation unit 12, and a voice recognition unit 13.
FIG. 3 is a flowchart showing a processing example of the concentration level control unit 14 when notified from another event management unit 16;
FIG. 4 is a block diagram showing a configuration example of a second embodiment of the present invention.
FIG. 5 is a block diagram showing a configuration example of a third embodiment of the present invention.
FIG. 6 is a block diagram illustrating an example of a hardware configuration of a speech recognition dialogue apparatus.
FIG. 7 is a block diagram for explaining a conventional technique.
[Explanation of symbols]
10 Voice input part
11 Speaker position identification part
12 Speaker verification section
13 Voice recognition unit
14 Concentration control unit
15 Voice input control unit
16 Other Event Management Department
17 Speaker control unit
18 Speech synthesis unit
19 Audio output unit
20 database
21 Transition condition table
22 Definition table
23 Information table
24 Concentration setting table
40 Image input section
41 Speaker position identification part
42 Speaker verification
51 Voice model database
52 Voice recognition unit
53 Concentration control unit
61 computers
62 Recording media
63 Voice input part
64 Audio output unit
65 database
70 Voice input part
71 Voice input control unit
72 Speech feature vector extraction unit
73 Voice recognition unit
74 Recognition result display
75 Image information input section
76 Image Information Analysis Department

Claims

An audio input unit for capturing audio information;
A speaker position specifying unit for specifying the direction of the speaker who has spoken,
A speaker verification unit that identifies the speaker who spoke,
A voice recognition unit that analyzes voice information input from the voice input unit and recognizes voice;
A speaker concentration control unit that controls the speaker concentration;
A voice input control unit that adjusts the input state of the voice input unit according to the level of concentration,
A speech recognition dialogue apparatus comprising: a speaker that stores information necessary for a concentration level control unit for a speaker to control the concentration level, and is referred to and updated, and a concentration level management database.

An audio input unit for capturing audio information;
A speaker position specifying unit for specifying the direction of the speaker who has spoken,
A speaker verification unit that identifies the speaker who spoke,
A voice recognition unit that analyzes voice information input from the voice input unit and recognizes voice;
A concentration level setting table in which a specific speaker identification name indicating a specific speaker and a concentration level for the specific speaker indicated by the specific speaker identification name are set;
It is determined whether to enable the speaker's speech based on the content of the concentration level setting table and the speaker specified by the speaker verification unit. The level in the concentration setting table and the specific speaker using the level of concentration determined based on the recognition result of the voice recognition unit for the utterance of the voice and the identification name of the speaker specified by the speaker verification unit Concentration level for updating the identification name and controlling the directivity and direction of the voice input unit based on the content of the updated concentration level setting table and the direction of the speaker specified by the speaker position specifying unit A control unit;
A voice recognition dialogue apparatus comprising: a voice output unit that outputs a response to a speech recognition result determined to be valid by the concentration level control unit.

The speech recognition dialogue apparatus according to claim 2,
Another event management unit for detecting that a predetermined event has occurred, and
The concentration control unit is configured to change a concentration level set in the concentration setting table when the occurrence of the predetermined event is detected by the other event management unit. Speech recognition dialogue device.

The speech recognition dialogue apparatus according to claim 3,
The predetermined event is that there is no utterance for a predetermined time by a specific speaker whose specific speaker identification name is set in the concentration setting table; and
The concentration control unit detects the occurrence of the predetermined event in the other event management unit, and the concentration level set in the concentration setting table is set in the concentration setting table. If the utterance of a specific speaker indicated by a specific speaker identifier is high enough to enable the utterance by another speaker, the level of concentration set in the concentration setting table is also effective. A speech recognition dialogue apparatus, characterized by having a configuration that is lowered to a level to achieve.

The speech recognition dialogue apparatus according to any one of claims 1 to 4,
The voice recognition dialogue apparatus, wherein the voice input unit is composed of a microphone array that can change directivity.

The speech recognition dialogue apparatus according to any one of claims 1 to 5,
The speech recognition interactive apparatus characterized in that the speaker position specifying unit specifies a direction of a speaker who speaks based on voice information input by the voice input unit.

The speech recognition dialogue apparatus according to any one of claims 1 to 6,
The speech recognition dialogue apparatus, wherein the speaker verification unit has a configuration for identifying a speaker who has spoken based on voice information input by the voice input unit.

The speech recognition dialogue apparatus according to any one of claims 1 to 5,
An image input unit for capturing image information; and
The speech recognition characterized in that the speaker position specifying unit specifies a direction of a speaker who speaks based on voice information input by the voice input unit and image information input by the image input unit. Interactive device.

The speech recognition dialogue apparatus according to any one of claims 1 to 5,
An image input unit for capturing image information; and
The speech recognition dialogue apparatus characterized in that the speaker verification unit has a configuration for identifying a speaker who speaks based on voice information input by the voice input unit and image information input by the image input unit.

The speech recognition dialogue apparatus according to any one of claims 1 to 5,
It has a voice model database in which the voice models of multiple speakers are registered,
The voice recognition unit uses the voice model of a specific speaker whose specific speaker identifier is set in the concentration setting table among the voice models of each speaker registered in the voice model database. A speech recognition dialogue apparatus having a configuration for performing recognition.

The speech recognition dialogue apparatus according to claim 1.
The speaker and concentration management database is:
A concentration condition transition condition table storing condition contents for changing the level of concentration on the speaker;
A concentration level definition table that defines the directivity and direction of the microphone array, etc. for each level of concentration level,
An information table for a speaker to be collated when the speaker speaks and a collation speaker information table for storing the position information;
A current concentration level setting table for storing the currently set concentration level and the target speaker information;
A speech recognition dialogue apparatus having a configuration capable of referring to and updating necessary information for controlling the degree of concentration on a speaker.

A program for causing a computer equipped with a voice input unit for capturing voice information to function as a voice recognition dialogue apparatus,
The computer,
A speaker position specifying unit for specifying the direction of the speaker who has spoken,
Speaker verification unit that identifies the speaker who spoke,
A voice recognition unit that analyzes voice information input from the voice input unit and recognizes a voice;
The specific speaker identification name indicating the specific speaker and the content of the concentration level setting table in which the level of the concentration level for the specific speaker indicated by the specific speaker identification name is set, and the speaker specified by the speaker verification unit Whether or not to enable the speaker's utterance based on the level of concentration determined based on the recognition result of the voice recognition unit for the speaker's utterance And the speaker identification name specified by the speaker verification unit, the level and the specific speaker identification name in the concentration setting table are updated, and the content of the updated concentration setting table and the talk are updated. A concentration control unit that controls the directivity and direction of the voice input unit based on the direction of the speaker specified by the speaker position specifying unit;
A program for functioning as a voice output unit that outputs a response to a recognition result of an utterance determined to be valid by the concentration level control unit.