JP3926242B2

JP3926242B2 - Spoken dialogue system, program for spoken dialogue, and spoken dialogue method

Info

Publication number: JP3926242B2
Application number: JP2002272689A
Authority: JP
Inventors: 賢司阿部; 直司松尾; 鏡子奥山
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2002-09-19
Filing date: 2002-09-19
Publication date: 2007-06-06
Anticipated expiration: 2022-09-19
Also published as: JP2004109563A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声認識に関し、特にインタラクティブな即ち対話的な音声認識に関する。
【０００２】
【従来の技術】
通常の固定電話に加えて携帯電話が普及し、最近、インターネットを介して音声によってアクセスされるボイスポータルの試験的運用が開始されている。そのような中で、音声対話システムの高度化に対するニーズ（要求）が強くなっている。
【０００３】
音声対話システムにおいて最も重要なことは、ユーザの意図を的確に抽出または推定することである。そのためには、まず、ユーザが発した音声信号を的確に認識する必要がある。即ち、音声認識の性能が音声対話システムの性能を左右する。そのため、音声対話システムにおいて、音声信号中の雑音成分を除去したり、音響モデルや言語モデルを改良することによって、音声認識の性能を上げることが試みられているが、不規則な雑音の混入および言語表現の多様性により、いかなる状況においても１００％の認識率が得られるようにすることは事実上不可能である。
【０００４】
認識結果が得られず音声認識に失敗した場合、例えば、音声区間の切り出しに失敗して認識できない場合、および音声認識の結果のスコアがシステムで定めた閾値よりも低いためその結果が不採用になった場合に、例えば、対話によって音声の再入力を要求する機能を有するようにした対話システムが提案された。しかし、そのシステムでは失敗原因がユーザに知らされないので、同じ失敗を何度も繰り返すことがある。
【０００５】
また、認識失敗の原因をユーザに通知して失敗の繰り返しを回避するという手法も考案されている。
【０００６】
特開平１０−１３３８４９号公報（特許文献１）には、音声認識に失敗したときに、エラーメッセージを表示することが記載されている。
【特許文献１】
特開平１０−１３３８４９号公報
【０００７】
特開２０００−１１２４９７号公報（特許文献２）には、入力された音声認識に失敗した場合に、その理由情報を通知することが記載されている。
【特許文献２】
特開２０００−１１２４９７号公報
【０００８】
特開２００２−２３９０３号公報（特許文献３）には、初回利用者に対してインストラクションを与えることが記載されている。
【特許文献３】
特開２００２−２３９０３号公報
【０００９】
【発明が解決しようとする課題】
上述の従来の手法は、音声認識の失敗に関するメッセージをディスプレイ上に表示したり、定常的に低雑音であるような音響的環境を前提として音声メッセージを提供するものである。しかし、例えば携帯電話ではユーザが音声通話中においてディスプレイを利用することは実際には不可能である。ユーザが音声を聞き取れないような大きな雑音が存在する音響的環境においては、必ずしも音声メッセージが有効であるとは限らない。ユーザが失敗原因を知ったとしても、ユーザがその意味を理解できるとは限らず、それに対する解決手段が設けられているとは限らない。例えば、ユーザがある雑音環境下にいて、その場を離れることができない場合、雑音が失敗の原因であることを知ったとしても、ユーザはそれに対処するのが困難である。従って、実際には、これらの手法の適用は限られている。
【００１０】
発明者たちは、ユーザ端末にシステム応答をユーザの環境に適した方法でユーザ端末に送信すると有利であると認識した。発明者たちは、音声認識に失敗した場合には、その原因をユーザ端末に通知するだけでなく、それに対処するための手段をユーザ端末に提供すると有利であると認識した。また、発明者たちは、システムを高度化するためには、ユーザの特徴および過去の状況を考慮して効率的に音声認識を行うことが有利であると認識した。音声認識に失敗して対話が成立しないときに、失敗メッセージの応答を繰り返し送信して対話を継続するのは無駄である。
【００１１】
本発明の目的は、効率的な音声対話システムを実現することである。
【００１２】
【課題を解決するための手段】
本発明の特徴によれば、音声対話システムは、端末より接続可能な音声対話システムであり、前記端末からの音声信号を受信する入力信号受信手段と、前記音声信号を解析し、ユーザ発声前の背景雑音のレベルを検出し、前記レベルが所定の閾値より大きいかどうかを判定する音響解析手段と、前記音声信号を音声信号データベースに蓄積する音声信号記録手段と、前記音声信号データベースに蓄積された音声信号に対して、認識パラメータを用いて音声認識を実行し、前記音声認識の認識結果が得られたかどうかによって前記音声認識が成功したかどうかを判定し、さらに前記認識結果が得られた場合には前記認識結果の認識スコアを求めて前記音声認識が成功したかどうかを判定する音声認識手段と、前記背景雑音の前記レベルが所定の閾値より大きい場合には認識失敗原因情報および予め登録されているユーザ情報に基づいて前記背景雑音の前記レベルが前記音声認識の失敗の原因になるかどうかを判定し、前記背景雑音の前記レベルが前記音声認識の失敗の原因になると判定した場合には前記認識失敗原因情報および前記ユーザ情報に基づいて前記認識パラメータの調整によって前記背景雑音に対処できるかどうかを判定し、また、前記音声認識に失敗した場合には、前記認識失敗原因情報および前記ユーザ情報に基づいて、その失敗原因を判定しかつ前記認識パラメータの調整によって前記背景雑音または前記失敗に対処できるかどうかを判定し、前記背景雑音または前記失敗に対処できない場合にはユーザによる対処法を判定する、認識失敗原因判定手段と、前記認識失敗原因判定手段が前記認識パラメータの調整によって前記背景雑音または前記失敗に対処できると判定した場合に、前記認識パラメータを調整し、前記音声認識手段に音声認識を要求する認識パラメータ設定手段と、前記認識パラメータ設定手段が前記認識パラメータの調整によって前記背景雑音に対処できないと判定した場合に前記認識失敗原因判定手段によって検出された音声信号の特徴および前記ユーザ情報を参照することによって、前記端末の音響的環境を推定する推定手段と、前記推定された音響的環境に応じて前記端末のユーザに対する応答方法を決定する応答方法決定手段と、前記対処法を表す情報を前記決定された応答方法で前記端末に送信する送信手段と、を具え、前記認識失敗原因判定手段が前記認識パラメータの調整によって前記背景雑音または前記失敗に対処できないと判定した場合に、前記推定手段、前記応答方法決定手段および前記送信手段を動作させる。
【００１３】
音声対話システムは、さらに、ユーザの識別情報および利用履歴情報を記憶する記憶手段を具えていてもよい。その対処法決定手段は、その記憶手段に格納されたそのユーザのその識別情報および利用履歴情報に従ってその対処法を決定してもよい。
【００１４】
前記対処法には音声認識用のパラメータの調整が含まれていてもよい。その対処法には対処用のプログラムのその端末への送信が含まれていてもよい。その対処法決定手段が、音声対話は不可能と判定したときに、その対処法としてその端末との通信を切断することを決定することを含んでいてもよい。
【００１５】
音声対話システムは、さらに、その決定された対処法を記憶する対処法履歴記憶手段を具えていてもよい。その音声認識手段が、その対処法履歴記憶手段に格納されたその対処法の履歴に従って音声認識を実行してもよい。
【００１６】
その応答方法決定手段は、その推定された音響的環境に応じて、その決定された対処法を表す音声信号、電子メールおよび／または画像信号が前記端末に送信されるようにしてもよい。
【００１８】
本発明のさらに別の特徴によれば、情報処理装置において用いられる音声対話のためのプログラムは、端末からの音声信号を受信するステップと、前記音声信号を解析し、ユーザ発声前の背景雑音のレベルを検出し、前記レベルが所定の閾値より大きいかどうかを判定するステップと、前記音声信号を音声信号データベースに蓄積するステップと、前記音声信号データベースに蓄積された音声信号に対して、認識パラメータを用いて音声認識を実行し、前記音声認識の認識結果が得られたかどうかによって前記音声認識が成功したかどうかを判定し、さらに前記認識結果が得られた場合には前記認識結果の認識スコアを求めて前記音声認識が成功したかどうかを判定するステップと、前記背景雑音の前記レベルが所定の閾値より大きい場合には認識失敗原因情報および予め登録されているユーザ情報に基づいて前記背景雑音の前記レベルが前記音声認識の失敗の原因になるかどうかを判定し、前記背景雑音の前記レベルが前記音声認識の失敗の原因になると判定した場合には前記認識失敗原因情報および前記ユーザ情報に基づいて前記認識パラメータの調整によって前記背景雑音に対処できるかどうかを判定し、また、前記音声認識に失敗した場合には、前記認識失敗原因情報および前記ユーザ情報に基づいて、その失敗原因を判定しかつ前記認識パラメータの調整によって前記背景雑音または前記失敗に対処できるかどうかを判定し、前記背景雑音または前記失敗に対処できない場合にはユーザによる対処法を判定する判定ステップと、前記判定ステップにおいて前記認識パラメータの調整によって前記背景雑音または前記失敗に対処できると判定された場合に、前記認識パラメータを調整し、前記音声認識手段に音声認識を要求するステップと、前記認識パラメータ設定手段が前記認識パラメータの調整によって前記背景雑音に対処できないと判定した場合に前記認識失敗原因判定手段によって検出された音声信号の特徴および前記ユーザ情報を参照することによって、前記端末の音響的環境を推定する推定ステップと、前記推定された音響的環境に応じて前記端末のユーザに対する応答方法を決定する応答方法決定ステップと、前記決定された対処法を表す情報を前記決定された応答方法で前記端末に送信する送信ステップと、を実行させるよう動作可能であり、前記判定ステップにおいて前記認識パラメータの調整によって前記背景雑音または前記失敗に対処できないと判定された場合に、前記推定ステップ、前記応答方法決定ステップおよび前記送信ステップを実行させるよう動作可能である。
【００１９】
本発明のさらに別の特徴によれば、音声対話システムにおいて用いられる音声対話方法は、端末からの音声信号を受信するステップと、前記音声信号を解析し、ユーザ発声前の背景雑音のレベルを検出し、前記レベルが所定の閾値より大きいかどうかを判定するステップと、前記音声信号を音声信号データベースに蓄積するステップと、前記音声信号データベースに蓄積された音声信号に対して、認識パラメータを用いて音声認識を実行し、前記音声認識の認識結果が得られたかどうかによって前記音声認識が成功したかどうかを判定し、さらに前記認識結果が得られた場合には前記認識結果の認識スコアを求めて前記音声認識が成功したかどうかを判定するステップと、前記背景雑音の前記レベルが所定の閾値より大きい場合には認識失敗原因情報および予め登録されているユーザ情報に基づいて前記背景雑音の前記レベルが前記音声認識の失敗の原因になるかどうかを判定し、前記背景雑音の前記レベルが前記音声認識の失敗の原因になると判定した場合には前記認識失敗原因情報および前記ユーザ情報に基づいて前記認識パラメータの調整によって前記背景雑音に対処できるかどうかを判定し、また、前記音声認識に失敗した場合には、前記認識失敗原因情報および前記ユーザ情報に基づいて、その失敗原因を判定しかつ前記認識パラメータの調整によって前記背景雑音または前記失敗に対処できるかどうかを判定し、前記背景雑音または前記失敗に対処できない場合にはユーザによる対処法を判定する判定ステップと、前記判定ステップにおいて前記認識パラメータの調整によって前記背景雑音または前記失敗に対処できると判定された場合に、前記認識パラメータを調整し、前記音声認識手段に音声認識を要求するステップと、前記認識パラメータ設定手段が前記認識パラメータの調整によって前記背景雑音に対処できないと判定した場合に前記認識失敗原因判定手段によって検出された音声信号の特徴および前記ユーザ情報を参照することによって、前記端末の音響的環境を推定する推定ステップと、前記推定された音響的環境に応じて前記端末のユーザに対する応答方法を決定する応答方法決定ステップと、前記決定された対処法を表す情報を前記決定された応答方法で前記端末に送信する送信ステップと、を含み、前記判定ステップにおいて前記認識パラメータの調整によって前記背景雑音または前記失敗に対処できないと判定された場合に、前記推定ステップ、前記応答方法決定ステップおよび前記送信ステップを実行する。
【００２０】
本発明によれば、効率的な音声対話システムを実現できる。システム応答をユーザの環境に適した方法でユーザ端末に送信することができる。音声認識に失敗した場合には、その原因をユーザ端末に通知するだけでなく、それに対処するための手段をユーザ端末に提供できる。ユーザの特徴および過去の状況を考慮して効率的に音声認識を行うことができ、システムを高度化できる。
【００２１】
【発明の実施の形態】
図１は、本発明の実施形態による音声対話システム１００を示している。音声対話システム１００は、入力信号受信部１０２、音響解析部１０４、音声信号記録（録音）部１０６、音声信号データベース１０８、音声認識部１１０、認識失敗原因判定部１１２、認識パラメータ設定部１１４、対話管理部１１６、タスク処理部１１８、ユーザ環境推定部１２０、応答方法決定部１２２、応答生成部１２４、出力信号送信部１２６およびプロセッサまたはコントローラ１５０を含んでいる。
【００２２】
各部１０２〜１２６は、プロセッサ１５０によって制御される。各部１０２〜１２６は、ハードウェアまたはソフトウェアの形態で実装されている。各部１０２〜１２６は、プロセッサ１５０によって実行されるプログラムとして実装されていてもよい。
【００２３】
入力信号受信部１０２は、ユーザ端末から送信された音声信号を受信して音響解析部１０４にその音声信号を供給する。音響解析部１０４は、入力信号受信部１０２からの音声信号を解析する。音響解析部１０４は、ユーザの発声前の背景雑音に関してはその雑音のレベルを検出し、認識失敗原因判定部１１２にその雑音の検出レベルを供給して認識失敗の原因となるか否かを判定するよう要求する。それによって、システム１００は、ユーザの発話を認識する前に、音声認識の失敗の原因（要因）に対処することができる。音響解析部１０４は、後で説明するように背景雑音のレベルが閾値より大きく且つシステム内での対処が不可能であると判定された場合以外は、音声信号を音声信号記録部１０６に供給する。
【００２４】
音声信号記録部１０６は、音響解析部１０４から供給された音声信号データを記録し、そのデータを音声信号データベース１０８に蓄積する。音声信号データベース１０８はその音声データを蓄積する。
【００２５】
音声認識部１１０は、音声信号データベース１０８の音声データに対して音声認識を行って、音声信号に対応する文字列を導出する。音声認識部１１０は、音声認識に失敗した場合に、認識失敗原因判定部１１２に失敗原因の判定を要求する。
【００２６】
認識失敗原因判定部１１２は、音響解析部１０４からの要求に応答して、背景雑音が認識失敗の原因になるか否かを判定する。認識失敗原因判定部１１２は、背景雑音が失敗の原因になると判定した場合には、さらに認識のためのパラメータの調整またはパラメータの値の変更によって対処できるかどうかを判定し、対処できると判定した場合には、認識パラメータ設定部１１４にパラメータの調整を要求する。
【００２７】
認識失敗原因判定部１１２は、さらに、音声認識部１１０からの要求に応答して、音声認識失敗の原因を解析する。認識失敗原因判定部１１２は、認識のためのパラメータを調整することによってそれに対処できると判定した場合には、パラメータの調整を認識パラメータ設定部１１４に要求する。認識失敗原因判定部１１２は、システム１００内での対処が不可能であると判定した場合には、ユーザの音響的環境の推定をユーザ環境推定部１２０に要求する。
【００２８】
認識パラメータ設定部１１４は、認識失敗原因判定部１１２からのパラメータの調整の要求に応答してパラメータの値を調整し、音声の再認識が必要な場合には音声認識部１１０に音声認識を要求する。ユーザ環境推定部１２０は、認識失敗原因判定部１１２からの要求に応答して、ユーザの現在の音響的な環境を推定する。
【００２９】
対話管理部１１６は、音声認識部１１０が音声認識に成功したときに、導出した文字データを解析してユーザの意図を推定または抽出し、その意図への対応を決定する。タスク処理部１１８は、対話を通じて要求されたタスクを処理し、または外部モジュール（図示せず）に処理を要求する。
【００３０】
応答方法決定部１２２は、ユーザの音響的環境の推定の結果に基づいてユーザへ応答を出力するための方法または手段を決定する。応答生成部１２４は、その決定された応答方法または応答手段に従ってユーザへの応答を生成する。出力信号送信部１２６は、応答生成部１２４によって生成されたユーザへの応答信号を、応答方法決定部１２２によって決定された応答方法または応答手段でユーザ端末へ送信する。その応答方法には、ユーザ端末との通信の“切断”も含まれている。
【００３１】
図２は、図１の対話システム１００における処理のフロー図を示している。
【００３２】
ステップ２０２において、入力信号受信部１０２は、ユーザ端末から送信された音声信号を受信する。音響解析部１０４は、ステップ２０４において受信した音声信号を解析する。ステップ２０６において、音響解析部１０４は、ユーザ発声前の背景雑音のレベルを検出し、そのレベルが閾値より大きいかどうかを判定する。そのレベルが閾値より大きくない即ち閾値以下であると判定された場合には、音響解析部１０４は、信号を解析して音声信号を音声信号記録部１０６に供給する。ステップ２０８において、音声信号記録部１０６は音声信号をデータベース１０８に蓄積する。
【００３３】
ステップ２０６においてユーザ発声前の背景雑音のレベルが閾値以上であると判定された場合には、音響解析部１０４は、そのレベルの値を認識失敗原因判定部１１２に供給して、認識失敗の原因になるか否かの判定を要求する。手順はステップ２１４へ進む。
【００３４】
ステップ２１４において、認識失敗原因判定部１１２は、音響解析部１０４からの要求に応答して、背景雑音が認識失敗の原因となるか否かを判定する。それが失敗の原因になると判定された場合は、認識失敗原因判定部１１２は、さらにステップ２１６において認識時のパラメータの調整によって対処できるかどうかを判定する。パラメータの調整によって対処できると判定された場合には、認識失敗原因判定部１１２は、認識パラメータ設定部１１４に認識パラメータを調整するよう要求する。ステップ２１８において、認識パラメータ設定部１１４は認識パラメータを調整し即ち認識パラメータの値を変更する。その際、ステップ２０８における音声信号の記録が行われていない場合は、認識失敗原因判定部１１２は、ステップ２０８におけるのと同様に音声信号記録部１０６に音声信号をデータベース１０８に蓄積させる。このように、ユーザ発声前の背景雑音を検出した時点で失敗要因に対処するので、音声処理効率が高くなる。一方、ステップ２１６においてシステム１００内での対処が不可能であると判定された場合には、手順はステップ２２４に進む。
【００３５】
ステップ２１０において、音声認識部１１０は、音声信号データベース１０８中の記録された音声信号に対して音声認識を実行する。
【００３６】
ステップ２１２において、音声認識部１１０は、音声認識が成功したかどうか、即ち認識結果が得られたかどうかを判定する。音声認識が成功したと判定された場合、即ち、音声区間の切り出しに成功し、各切り出し区間の音声データに対する文字列が導出され、さらに、その導出された文字列によって表されるメッセージ（単語、フレーズ、文等）の確実性または信頼性を表す認識スコアが予め定めた閾値よりも高い場合には、手順はステップ２２０へ進む。認識に成功しなかった即ち失敗したと判定された場合には、手順はステップ２１４へ進む。
【００３７】
ステップ２２０において、対話管理部１１６は、音声認識の結果得られたメッセージからユーザの意図を抽出または推定し、その結果に応じてユーザへの対応法を決定し、必要な場合にはタスク処理部１１８にタスクの処理を要求する。ステップ２２０の後、手順はステップ２２６に進む。ステップ２２２において、タスク処理部１１８はそのメッセージによって表された要求に従ってタスクを処理する。その後、手順はステップ２２６に進む。
【００３８】
ステップ２１４において、認識失敗原因判定部１１２は、さらに、音声認識部１１０からの要求に応答して、音声認識失敗の原因を解析する。認識失敗原因判定部１１２は、認識時のパラメータを調整することによってそれに対処できるどうかを判定する。パラメータの調整によって対処できると判定された場合には、認識失敗原因判定部１１２は、認識パラメータ設定部１１４に認識パラメータを調整するよう要求する。ステップ２１８において、認識パラメータ設定部１１４は認識パラメータを調整し即ち認識パラメータの値を変更する。認識失敗原因判定部１１２が、システム１００内での対処が不可能であると判定した場合には、手順はステップ２２４に進む。
【００３９】
ステップ２２４において、ユーザ環境推定部１２０は、ユーザの音響的環境を推定し、ユーザに対する応答方法または応答手段を決定するための情報を取り出す。
【００４０】
ステップ２２６において、応答方法決定部１２２は、ユーザの音響的環境を考慮して応答方法または応答手段を決定する。ステップ２２８において、応答生成部１２４は、応答方法に応じた応答信号を生成する。ステップ２３０において、出力信号送信部１２６は、決定された応答方法または応答手段、例えば音声メッセージまたは電子メールによって応答信号をユーザ端末へ送信する。応答方法の一形態として、場合によってはユーザ端末との通信を切断してもよい。
【００４１】
図３は、本発明の実施形態による、インターネットに接続された図１の対話システム１００を含むボイスポータル３００の構成を示している。ボイスポータル３００は、図１の対話システム１００と、ユーザ認証部３１０と、ユーザ情報データベース３２０と、認識失敗原因対処履歴データベース３３０とを具えている。
【００４２】
ユーザ認証部３１０は、ユーザからの入力信号として受け取ったユーザＩＤおよびパスワードまたはユーザ音声に対する音声認証によってユーザ認証を行う。ユーザＩＤおよびパスワード、および音声認証に必要なユーザ情報は、システム１００を初めて利用するときにユーザ情報データベース３２０に予め登録される。ユーザ認証に成功した場合には、図１の各部１０２〜１２６は、図２における各ステップにおいてユーザ情報データベース３２０を参照して、システム１００の利用におけるそのユーザの特徴、例えば、“いつも背景雑音の大きい場所からシステムを利用する”、および“いつも大きな声で話す”のような特徴を考慮して、各処理を実行する。
【００４３】
ユーザ情報データベース３２０は、ユーザによるシステム１００の利用に関する情報を蓄積し、ユーザＩＤおよびパスワードのようなユーザ識別情報に加えて、ユーザの利用履歴および利用時の音響的環境に関する情報を蓄積する。ユーザ情報データベース３２０は、図１の対話システム１００の音声認識部１１０、認識失敗原因判定部１１２、認識パラメータ設定部１１４、対話管理部１１６、タスク処理部１１８、ユーザ環境推定部１２０、応答方法決定部１２２および応答生成部１２４によってアクセスされて、情報が読み取られ（参照され）および書き込まれる。ユーザ情報データベース３２０は、ユーザの要求に応答して、ユーザ情報を削除する。
【００４４】
認識失敗原因対処履歴データベース３３０は、例えば“背景雑音が大き過ぎる”というような、ユーザ側の音響的な環境が劣悪と判定された場合や、音声認識に失敗した場合における対処法の履歴（ログ）を蓄積している。例えば、“音声区間の切り出しに失敗したときに、切り出しパラメータの値を‘Ａ１’から‘Ａ２’に変更したら再認識に成功した”という内容の情報を記述している。このデータベース３３０を利用することによって、２回目以降の音声認識において失敗を効率的に回避したり、認識が失敗した時に迅速に対処でき、処理効率が向上する。例えば、認識パラメータ設定部１１４は、過去に再認識に成功したときに用いられた各パラメータの値の組み合わせを優先的に採用することによって、再認識の試行回数が低減され、認識が高速になるという利点が得られる。
【００４５】
図４は認識失敗原因判定部１１２の構成を示している。認識失敗原因判定部１１２は、音声区間検出部４０２、Ｓ／Ｎ比検出部４０４、話速検出部４０６、認識失敗原因対処法決定部４０８および認識失敗原因判定情報データベース４１０を含んでいる。
【００４６】
表１は、認識失敗原因判定情報データベース４１０における認識失敗原因判定情報を例示している。
【表１】

【００４７】
表１において、認識失敗原因判定情報として、判定項目（ファクタ）と、各判定項目に対するエラー閾値と、その原因への対処法とが記述されている。
【００４８】
表１において、“最短音声区間”とは、入力信号から音声信号区間として切り出すための最短区間を表している。一般的に、単発的な雑音は区間が短いので、最短音声区間を長めに設定することによって、雑音の切り出しを減少させることができる。しかし、最短音声区間が長過ぎると、例えば“に（２）”のような短い単語が切り出せないので、認識パラメータの調整が必要である。
【００４９】
“認識スコア”は、音声認識で得られた文字列のメッセージの確実性または信頼性を表す。認識スコアは、処理方法によって計算方法（尺度）が変わってもよい。音声認識では、幾つかの解の候補の中から、認識スコアが最も高く、かつ、その値が閾値以上のものを認識結果として出力する。逆に、認識結果のメッセージが得られても、その認識スコアが閾値よりも低ければ、信頼性が低いと判定されてその結果は拒否される。認識率が低い場合には、認識スコアの閾値を下げることによって、正解であるにもかかわらずスコアが低かったために拒否されていたメッセージを正解として抽出できることがある。しかし、閾値を下げ過ぎる不適当な結果も許容してしまうので、認識パラメータの値の調整が必要である。
【００５０】
“Ｓ／Ｎ比”は、音声信号と雑音信号の電力比である。雑音が大きくても、それ以上に音声信号が大きければＳ／Ｎ比は大きくなる。一般的に、Ｓ／Ｎ比が大きい方が認識率は向上する。Ｓ／Ｎ比が閾値よりも小さい場合には、ユーザに雑音の少ないところで音声を再入力させたり、または、ユーザに大きな声で音声を再入力させることによって対処することができる。
【００５１】
“話速”はユーザの話す速さを表す。一般的に、発話1秒あたりのモーラ数（≒音節数）で速さを表す。即ち、単位時間に多くの言葉を発するほど話速が速い。一般的に、話速が速過ぎると認識率が低下するので、話速の検出値が表１の閾値よりも大きい場合には、ユーザにゆっりと発話するよう指示する。逆に、話速が遅すぎても認識率は低下するので、話速の検出値が別の閾値よりも小さい場合には、ユーザに早く発話するよう指示する。
【００５２】
表１は、その他の認識失敗原因判定情報をも含んでいてもよい。その内容は表の形で示されているが、別の形態であってもよく、例えばテキスト形式であってもよい。
【００５３】
認識失敗原因判定部１１２による処理の例として、入力信号における音声区間前の部分に対する処理、および入力信号における音声信号部分に対する処理について説明する。
【００５４】
入力信号における音声区間前の部分に対する処理
図２において手順がステップ２０６（ＹＥＳ）からステップ２１４に進んだときそのステップ２１４において、図４の認識失敗原因判定部１１２は、音響解析部１０４からの要求に応答して、雑音レベルと認識失敗原因判定情報に基づいて、入力信号の背景雑音が認識失敗の原因となるか否かを判定する。背景雑音が認識失敗の原因になると判定された場合には、認識失敗原因判定部１１２は、ユーザ情報データベース３２０および認識失敗原因対処履歴データベース３３０の情報をも参照して、失敗原因への対処法を決定する。次に、その処理をより詳しく説明する。
【００５５】
入力信号において、ユーザが発声する前の信号、即ちユーザが発声していない部分の信号は背景雑音を表している。認識失敗原因判定部１１２は、その背景雑音が“音声認識失敗の原因になるか否か”を次の手順で推定する。
【００５６】
まず、前処理として、音響解析部１０４（図１）によって入力信号から背景雑音の部分が抽出されて、背景雑音の音響的特徴、例えば雑音レベルが検出される。
【００５７】
ステップ１：認識失敗原因判定部１１２は、音響解析部１０４から、背景雑音の音響的特徴に関する情報とともに、前処理において抽出された背景雑音が“音声認識失敗の原因になるか否か”を判定するよう求める要求を受け取る。
【００５８】
ステップ２：その判定要求に応答して、認識失敗原因判定部１１２の認識失敗原因対処法決定部４０８は、背景雑音の音響的特徴に関する情報、および認識失敗原因判定情報（表１参照）から、背景雑音が“音声認識失敗の原因になるか否か”を判定する。例えば、認識失敗原因判定情報に“Ｓ／Ｎ比（信号対雑音比）が１０ｄＢ以下の場合は、誤認識の原因となり得る”という内容が記述されている。実際の背景雑音レベルから推定されるＳ／Ｎ比（この場合は背景雑音区間は音声信号のない区間なので音声信号レベルＳを一般的な値に想定した場合の推定Ｓ／Ｎ比）が１０ｄＢ以下の場合には、認識失敗原因対処法決定部４０８は、“入力信号の背景雑音は音声認識失敗の原因になる”と判定する。さらに、認識失敗原因対処法決定部４０８は、“このユーザはＸさんである”、“Ｘさんは、いつも推定Ｓ／Ｎ比が１０ｄＢの環境でシステムを利用するが、過去１０回の利用において、音声認識率は９８％以上である”のようなユーザ情報、および、認識失敗原因対処履歴がある場合にはその情報をも、判定項目として参照する。このような場合は、推定Ｓ／Ｎ比が１０ｄＢであっても、Ｘさんの場合は音声認識に影響がないので、認識失敗原因対処法決定部４０８は“入力信号の背景雑音は音声認識失敗の原因とはならない”と判定する。このように、ユーザに応じて処理方法を変更することができるので、ユーザに特化した音声処理が可能となる。
【００５９】
ステップ３：認識失敗原因対処法決定部４０８は、背景雑音が“音声認識失敗の原因になる”と判定した場合には、その旨をユーザに通知する前に、ユーザ環境推定部１２０にユーザの音響的環境を推定するよう要求する。
【００６０】
入力信号における音声信号に対する処理
図２において手順がステップ２１２（ＮＯ）からステップ２１４に進んだときそのステップ２１４において、音声認識部１１０からの要求に応答して、図４の認識失敗原因判定部１１２は、認識に失敗した音声データの音声区間、Ｓ／Ｎ比および話速を検出する。これによって、最短音声区間長、最長音声区間長、雑音のレベルおよび話速のような音声信号の特徴の値を導出または測定する。次に、認識失敗原因判定部１１２は、それらの導出された特徴の値と、そのデータベース４１０中の認識失敗原因判定情報とを照合して認識失敗への対処法を決定する。次に、その処理をより詳しく説明する。
【００６１】
ステップ１：音声認識部１１０（図１）において認識結果が得られない場合、認識失敗原因判定部１１２の認識失敗原因対処法決定部４０８は、音声認識部１１０から失敗原因を判定するよう求める要求を受け取る。認識結果が得られない場合であっても、例えば、切り出し情報や各ステップにおける認識スコアのような認識処理過程のデータがあるときは、認識失敗原因対処法決定部４０８はそれらの情報をも音声認識部１１０から受け取る。
【００６２】
ステップ２：認識失敗原因判定部１１２の音声区間検出部４０２、Ｓ／Ｎ比検出部４０４および話速検出部４０６は、その判定の要求に応答して、入力音声データの音声区間（音素、音節、単語、句または文等を単位とした切り出し区間）およびＳ／Ｎ比および話速を検出または測定する。認識失敗原因対処法決定部４０８は、必要に応じて、音声認識部１１０による上述の音声認識過程のデータも利用する。
【００６３】
ステップ３：認識失敗原因対処法決定部４０８は、ステップ２の検出結果と認識失敗原因判定情報とから、認識失敗原因を判定する。例えば、認識失敗原因判定情報において“音声区間切り出し時の最短音声区間が５０ｍｓ”という情報が記述されており、実際に切り出した音声区間において、区間長５０ｍｓ程度の信号の多くが雑音であると推定された場合には、認識失敗原因対処法決定部４０８は、“切り出しミスの可能性がある”または“雑音を多く切り出している”と判定して、“最短音声区間を長く設定すれば、例えば１００ｍｓにすれば、区間長５０ｍｓ程度の雑音を切り出さなくなる”即ち“認識時の最短単語長を１００ｍｓに設定して再認識すべきである”と判定する。検出したＳ／Ｎ比が認識失敗原因判定情報に記述されている閾値よりも小さいときには、認識失敗原因対処法決定部４０８は、“雑音が致命的である”と判定して、“雑音の少ないところでの再入力または大きな声での再入力が必要”と判定する。“話速が速過ぎる”と判定された場合には、認識失敗原因対処法決定部４０８は“ゆっくりとした発話での再入力が必要”と判定する。これらの判定の際には、上述したのと同様に、ユーザ情報や認識失敗原因対処履歴の情報を参照して、例えば次のように、上述の検出を省略したり、対処法を変更したりする。
【００６４】
− 既にパラメータを調整して再認識処理を行って、再び認識に失敗した場合には、一度行った上述の検出を省略して、パラメータをさらに調整して再認識処理を行う。
【００６５】
− パラメータの調整を伴う再認識を複数回（例えば３回）行った後は、パラメータ調整による対処を中止する。
【００６６】
− “Ｘさんはいつも早口で話し、しかもそれが致命的になっている”という情報があった場合には、話速に関する対処を優先させる。
【００６７】
ステップ４：認識失敗原因対処法決定部４０８は、ステップ３においてパラメータの調整によって対処できない場合には、例えばユーザへの通知が必要なときにはその前にユーザ環境推定部１２０にユーザの音響的環境を推定させ、パラメータの調整によって対処可能できる場合には、認識パラメータ設定部１１４にパラメータを調整させる。
【００６８】
詳しく説明すると、ステップ３においてパラメータの調整によって対処可能であると判定された場合には、認識失敗原因対処法決定部４０８は、認識パラメータ設定部１１４にパラメータの調整を要求する。例えば、認識失敗原因対処法決定部４０８が、表１の認識失敗原因判定情報に基づいて、最短音声区間の設定が小さすぎて音声区間の切り出しミスが生じていると推定した場合、認識失敗原因対処法決定部４０８は、切り出しパラメータの調整によって対処できると判定し、“このユーザは、パラメータ値がＡ１の場合に認識率が高い”、“過去の履歴ではパラメータ値がＡ２の場合に認識率が高い”のようなユーザ情報および認識失敗原因対処履歴の情報をも考慮して、認識パラメータ設定部１１４にパラメータの調整を要求し、その調整されたパラメータに従って音声認識部１１０に再び音声認識するよう指示する。このように、ユーザに応じて処理方法を変更することができるので、ユーザに特化した音声処理が可能となる。
【００６９】
一方、音声認識の失敗に対してシステム１００内での対処が不可能であると判定された場合、例えば上述の再認識をパラメータの値を変えながら所定回数反復しても認識に成功しない場合には、認識失敗原因対処法決定部４０８は、失敗原因およびその失敗への対処法をユーザへ通知する前に、ユーザ環境推定部１２０にユーザの音響的環境の推定を要求する。
【００７０】
図５は認識パラメータ設定部１１４の構成を示している。認識パラメータ設定部１１４は、パラメータ決定部５０２および認識実行コマンド生成部５０４を含んでいる。パラメータ決定部５０２が図４の認識失敗原因判定部１１２からパラメータ調整の要求を受け取ったとき、パラメータ決定部５０２は、まず、データベース３２０のユーザ情報、データベース３３０の認識失敗原因対処履歴を参照して、再認識時のパラメータの値を決定する。次に、認識実行コマンド生成部５０４は、その決定されたパラメータの値を反映した音声認識実行コマンドを作成し、必要に応じて音声認識部１１０に再認識処理を要求する。
【００７１】
図６は、図５の認識パラメータ設定部１１４による処理のフロー図を例示している。
【００７２】
ステップ６０２において、パラメータ決定部５０２は、認識失敗原因判定部１１２から認識パラメータ設定の要求を、認識失敗原因判定部１１２における判定、例えば“最短音声区間の設定を５０ｍｓより長くする”または“認識スコアの閾値を６０より下げる”のような情報とともに、受け取る。
【００７３】
ステップ６０４において、パラメータ決定部５０２は、ユーザ情報と認識失敗原因対処履歴があるかどうかを判定する。ないと判定された場合は、ステップ６０８において、パラメータ決定部５０２は、パラメータの値をどれだけ変化させるかを決定する。基本的には、経験的（実験的）に定めた一定量だけ変化させる。例えば、“最短音声区間に関しては１０ｍｓずつ値を変化させる”、または“認識スコアの閾値は５ずつ変化させる”のような情報に基づいてパラメータの値を変化させる。その後、手順はステップ６１０に進む。
【００７４】
ステップ６０４においてユーザ情報と認識失敗原因対処履歴があると判定された場合は、ステップ６０６において、ユーザ情報と認識失敗原因対処履歴も参照してパラメータの値を決定する。例えば、“Ｘさんの過去１０回の利用において、認識スコアの設定を５０としたときの認識率が一番高い”、または“既に最短音声区間の設定を変えてており、現在は６０ｍｓにセットされている”のようなユーザ情報や認識失敗原因対処履歴がある場合には、その情報も参照してパラメータの値を決定する。例えば、上述のように“Ｘさんは認識スコアの設定を５０とするのが良い”という情報がある場合には、６０から２回にわけて５ずつ下げるのではなく、直ぐに５０に設定することによって処理を効率化する。その後、手順はステップ６１０に進む。
【００７５】
ステップ６１０において、認識実行コマンド生成部５０４は、ステップ６０６または６０８の結果に基づいて、新しいパラメータ値を反映した認識実行コマンドを生成し、ステップ６１２において音声認識部１１０に音声の再認識を要求する。
【００７６】
図７は、ユーザ環境推定部１２０の構成を示している。ユーザ環境推定部１２０は、雑音定常性解析部７０２および環境推定部７０４を含んでいる。
【００７７】
雑音定常性推定部７０２は、図４の認識失敗原因判定部１１２からのユーザ環境推定の要求に応答して、入力音声データの雑音の定常性を解析する。
【００７８】
環境推定部７０４は、例えば、認識失敗原因判定部１１２から検出されたＳ／Ｎ比のような音声信号の特徴、および“Ｘさんは一定の雑音環境下でシステムを利用する”のようなユーザ情報を参照することによって、例えば、“音声を主とした対話が可能な環境かどうか”、“どの程度の音量で出力すればユーザが聞き取れるかどうか”または“ノイズ・キャンセラ（雑音除去装置／ツール）の使用が有効かどうか”のような音響的環境を判定する。例えば、雑音が比較的定常的で、Ｓ／Ｎ比が１０ｄＢで、“Ｘさんは、いつも職場からシステムを利用する”のようなユーザ情報があった場合は、次のように推定する。
【００７９】
− Ｓ／Ｎ比＝１０ｄＢは、背景雑音が大きい可能性を表し、音声による対話がやや困難であることを表している。
− 但し、最大音量で音声応答を出力すればユーザは聞き取れる。雑音が比較的定常的なので、ノイズ・キャンセラの使用が有効である。
− Ｘさんの職場の音響的環境に合わせたノイズ・キャンセラをＸさんの携帯端末にインストールすれば、次回以降の音声認識が容易になる。
【００８０】
環境推定部７０４は、推定されたユーザの音響的環境に応じて、例えば“システム側の音声出力の音量を最大にし、音声を主とした対話を行う”と決定する。環境推定部７０４は、さらに、例えば“Ｘさんの職場環境に合わせたノイズ・キャンセラをＸさんの携帯に送信する。”と決定する。
【００８１】
上述の“Ｘさんは、いつも同じ職場からシステムを利用する”というようなユーザ情報を獲得するために、例えば、システム１００の利用の初回に、またはユーザの利用環境が変化したときに、音声入力時の周囲の音響的環境の状況を端末を用いて登録するようユーザに指示してもよい。その際、その状況における背景雑音を記録し、音響解析部１０４によって予め音響解析することによって、環境推定部７０４は、次回以降において音響解析部１０４によって同様の雑音成分が検知された場合には、“同じ環境下でシステムを利用している”と推定することができる。
【００８２】
図８は、図１における応答方法決定部１２２による処理のフロー図を示している。応答方法決定部１２２は、上述のユーザ環境推定の結果に基づいて応答方法または応答手段を決定する。
【００８３】
ステップ８０２において、応答方法決定部１２２は、音声認識が成功したかどうかを判定する。音声認識が成功したと判定され、即ち音声認識において認識結果が得られ、即ち、音声区間の切り出しに成功し、切り出し区間の音声に対する文字列が導出され、認識スコアが所定の閾値よりも高い場合には、ユーザは音声認識に影響がない程度の雑音が存在する音響的環境にいると判定され、ステップ８０８において、応答方法決定部１２２は、音声を主たる媒体とした通常の応答方法または応答手段を用いて対話を継続することを決定する。
【００８４】
一方、ステップ８０２において認識が成功しなかったと判定された場合、即ち、背景雑音が認識失敗原因になると判定された場合、または音声認識において認識結果が得られない場合には、ステップ８０４において応答方法決定部１２２は、ユーザ環境推定部１２０からユーザの音響的環境の推定結果を取得する。
【００８５】
ステップ８０６において、応答方法決定部１２２は、音声対話が可能かどうかを判定する。音声対話が可能であると判定された場合は、ステップ８１０において、応答方法決定部１２２は、音声を主たる媒体とした通常の応答方法または応答手段を用いて、認識失敗の事実とその原因およびそれへの対処法をユーザに提示して対話を継続することを決定する。
【００８６】
ステップ８０６において音声対話が不可能と判定された場合には、ステップ８１２において、応答方法決定部１２２は、対話を終了する旨を音声によって通知してユーザ端末との通信を切断し、その後、電子メールによる文字または画像信号を媒体とした応答方法または応答手段を用いて、一方的に対話を終了させたこと、および認識失敗の事実とその原因およびそれへの対処法をユーザに提示することを決定する。このように、音声信号の検出ができない場合や、音声認識に失敗して音声による対話が困難な場合に、ユーザ端末とシステム１００の間の通信が切断されるので、無駄な対話または通信を回避することができる。また、通信を切断する前に、ユーザ環境に適した応答方法で対話終了の旨をユーザに通知することができるので、ユーザに不愉快な思いをさせることを最小限に抑えることができる。その際、入力信号のＳ／Ｎ比に基づいて背景雑音の大きさを検知し、その値に応じて出力信号の音量を上げて、雑音環境下でもユーザにメッセージが伝わるようにしてもよい。
【００８７】
ステップ８１０において、またはステップ８１２における通信切断の前に、応答方法決定部１２２は、さらに、必要に応じて、認識失敗への対処に必要なツール、例えば、雑音に対処するためのノイズ・キャンセラのプログラムをユーザ端末に送信し、さらに音声認識失敗原因への対処用のツールを送る旨をユーザに通知することを決定してもよい。このように、必要に応じて、音声認識失敗原因への対処に必要なツールを作成または用意してユーザ端末に送信するので、失敗時の対処を支援することができ、ユーザの負担が軽減する。
【００８８】
図９は、応答生成部１２４の構成を示している。応答生成部１２４は、通信切断信号生成部９０２、プログラム生成部９０４、プログラム・データベース９０６、応答生成管理部９０８、電子メール生成部９１０、応答文生成部９１２、音声信号生成部（音声合成部）９１４および画像信号生成部９１６を含んでいる。
【００８９】
応答生成管理部９０８は、応答方法決定部１２２によって決定された応答方法または応答手段に従って、例えば応答文またはプログラムのような応答内容と、例えば音声信号、電子メールまたは通信切断信号のような応答信号の種類と、その生成タイミングおよび出力タイミングとを決定して、それらの生成を各生成部９０２〜９１６に要求する。
【００９０】
通信切断信号生成部９０２は、ユーザ端末との通信を切断するための信号を生成する。プログラム生成部９０４は、ユーザ端末で使用する認識失敗原因への対処プログラム（例えば、ノイズ・キャンセラ）を新たに生成するか、または、プログラム・データベースに予め保存されているツール群の中からそれを選択し、システム応答として用意する。応答文生成部９１２は、ユーザ発話に対する一般的な応答や、認識失敗時の通知など、ユーザへの応答メッセージを生成する。この際には、ユーザ情報や認識失敗時対処履歴の情報も利用する。電子メール生成部９１０は、応答文生成部９１２またはプログラム生成部９０４で生成された応答内容を組み込んだ電子メールを生成する。音声信号生成部９１４は、応答文生成部９０４で生成された応答内容を音声信号に変換する。その際、音声信号生成部９１４は、ユーザの音響的環境を参照して音声信号の音量を調整する。画像信号生成部９１６は、応答文生成部９１２で生成された応答内容を画像信号に変換する。
【００９１】
出力信号送信部１２６は、図９の応答生成部９０２、９１０、９１４および９１６で生成された各種応答信号をユーザ端末に送信する機能、即ち、通信切断信号の検出に応答してユーザ端末との通信を切断する機能、電子メールを送信する機能、音声信号を送信する機能、および画像信号を送信する機能を有する。
【００９２】
このように、本発明の実施形態によれば、ユーザ発声前の背景雑音からユーザの音響的な環境が劣悪と判定された場合や、音声認識に失敗した場合に、その原因を解明し、その原因およびそれへの対処法をユーザ端末に送信することができるので、認識失敗に適切に対処することができる。また、その原因および対処法をユーザ端末に通知するときに、ユーザの環境に適した応答方法でそれをユーザにより確実に伝えることができる。
【００９３】
以上説明した実施形態は典型例として挙げたに過ぎず、その変形およびバリエーションは当業者にとって明らかであり、当業者であれば本発明の原理および請求の範囲に記載した発明の範囲を逸脱することなく上述の実施形態の種々の変形を行えることは明らかである。
【００９４】
（付記１）端末より接続可能な音声対話システムであって、
前記端末からの音声信号に対して音声認識を実行する音声認識手段と、
前記音声認識の結果が得られないときにその原因を判定して、前記判定された原因への対処法を決定する対処法決定手段と、
前記端末の音響的環境を推定する推定手段と、
前記推定された音響的環境に応じて前記端末に対する応答方法を決定する応答方法決定手段と、
前記決定された対処法を表す情報を前記決定された応答方法で前記端末に送信する送信手段と、
を具えることを特徴とする音声対話システム。
（付記２）さらに、ユーザの識別情報および利用履歴情報を記憶する記憶手段を具え、
前記対処法決定手段が、前記記憶手段に格納された前記ユーザの前記識別情報および利用履歴情報に従って前記対処法を決定することを特徴とする、付記１に記載の音声対話システム。
（付記３）前記対処法決定手段が、音声対話は不可能と判定したときに、前記対処法として前記端末との通信を切断することを決定することを特徴とする、付記１または２に記載の音声対話システム。
（付記４）さらに、前記決定された対処法を記憶する対処法履歴記憶手段を具え、
前記音声認識手段が、前記対処法履歴記憶手段に格納された前記対処法の履歴に従って音声認識を実行することを特徴とする、付記１乃至３のいずれかに記載の音声対話システム。
（付記５）前記応答方法決定手段は、前記推定された音響的環境に応じて、前記決定された対処法を表す音声信号、電子メールおよび／または画像信号が前記端末に送信されるようにすることを特徴とする、付記１乃至４のいずれかに記載の音声対話システム。
（付記６）ユーザからの音声信号に対して音声認識を実行する音声認識手段と、
前記音声認識の結果が得られないときにその原因を判定して、前記判定された原因への対処法を決定する対処法決定手段と、
前記ユーザの音響的環境を推定する推定手段と、
前記推定された音響的環境に応じて前記ユーザに対する応答方法を決定する応答方法決定手段と、
前記決定された対処法を表す情報を前記決定された応答方法で前記ユーザに通知する通知手段と、
を具えることを特徴とする音声対話システム。
（付記７）情報処理装置において用いられる音声対話のためのプログラムであって、
端末からの音声信号に対して音声認識を実行するステップと、
前記音声認識の結果が得られないときにその原因を判定するステップと、
前記判定された原因への対処法を決定するステップと、
前記端末の音響的環境を推定するステップと、
前記推定された音響的環境に応じて前記端末に対する応答方法を決定するステップと、
前記決定された対処法を表す情報を前記決定された応答方法で前記端末に送信するステップと、
を実行させるよう動作可能なプログラム。
（付記８）前記対処法を決定するステップが、記憶手段に格納された前記ユーザの識別情報および利用履歴情報に従って前記対処法を決定することを含むことを特徴とする、付記７に記載のプログラム。
（付記９）さらに、前記決定された対処法を記憶するステップを実行させるよう動作可能であり、
前記音声認識を実行するステップが、前記対処法履歴記憶手段に格納された前記対処法の履歴に従って音声認識を実行することを含むことを特徴とする、付記７または８に記載のプログラム。
（付記１０）前記応答方法を決定するステップが、前記推定された音響的環境に応じて、前記決定された対処法を表す音声信号、電子メールおよび／または画像信号が前記端末に送信されるようにすることを含むことを特徴とする、付記７乃至９のいずれかに記載のプログラム。
（付記１１）音声対話システムにおいて用いられる音声対話方法であって、
端末からの音声信号に対して音声認識を実行するステップと、
前記音声認識の結果が得られないときにその原因を判定するステップと、
前記判定された原因への対処法を決定するステップと、
前記端末の音響的環境を推定するステップと、
前記推定された音響的環境に応じて前記端末に対する応答方法を決定するステップと、
前記決定された対処法を表す情報を前記決定された応答方法で前記端末に送信するステップと、
を含む方法。
【００９５】
【発明の効果】
本発明は、前述の特徴によって、ユーザの環境に適した応答方法を選定することができ、音声認識に失敗したときにはその原因を知らせるだけでなくそれに対処するための手段をユーザ端末に提供することができ、ユーザの特徴および過去の状況を考慮して効率的に音声認識を行うことができるという効果を奏する。
【図面の簡単な説明】
【図１】図１は、本発明の実施形態による音声対話システムを示している。
【図２】図２は、図１の対話システムにおける処理のフロー図を示している。
【図３】図３は、本発明の実施形態による、インターネットに接続された図１の対話システムを含むボイスポータルの構成を示している。
【図４】図４は、認識失敗原因判定部の構成を示している。
【図５】図５は認識パラメータ設定部の構成を示している。
【図６】図６は、図５の認識パラメータ設定部による処理のフロー図を例示している。
【図７】図７は、ユーザ環境推定部の構成を示している。
【図８】図８は、応答方法決定部による処理のフロー図を示している。
【図９】図９は、応答生成部の構成を示している。
【符号の説明】
１００音声対話システム
１０２入力信号受信部
１０４音響解析部
１１０音声認識部
１１２認識失敗原因判定部
１１４認識パラメータ設定部
１１６対話管理部
１１８タスク処理部
１２０ユーザ環境推定部
１２２応答方法決定部
１２４応答生成部
１２６出力信号送信部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to speech recognition, and more particularly to interactive speech recognition.
[0002]
[Prior art]
Mobile phones have become popular in addition to regular landline phones, and recently, a pilot operation of a voice portal accessed by voice via the Internet has started. Under such circumstances, there is a strong need (request) for the advancement of spoken dialogue systems.
[0003]
The most important thing in a spoken dialogue system is to accurately extract or estimate the user's intention. For this purpose, first, it is necessary to accurately recognize the audio signal emitted by the user. In other words, the performance of voice recognition affects the performance of the voice dialogue system. For this reason, in speech dialogue systems, attempts have been made to improve speech recognition performance by removing noise components from speech signals or improving acoustic and language models. Due to the diversity of linguistic expressions, it is virtually impossible to achieve 100% recognition rate in any situation.
[0004]
If the recognition result fails and speech recognition fails, for example, if speech segmentation fails and cannot be recognized, and the speech recognition result score is lower than the threshold set by the system, the result is not adopted. In such a case, for example, a dialogue system has been proposed which has a function of requesting voice re-input by dialogue. However, since the user is not informed of the cause of failure in the system, the same failure may be repeated many times.
[0005]
In addition, a method has been devised in which the user is notified of the cause of recognition failure to avoid repeated failures.
[0006]
Japanese Patent Laid-Open No. 10-133849 (Patent Document 1) describes that an error message is displayed when speech recognition fails.
[Patent Document 1]
JP-A-10-133849
[0007]
Japanese Unexamined Patent Publication No. 2000-112497 (Patent Document 2) describes that, when input speech recognition fails, the reason information is notified.
[Patent Document 2]
JP 2000-112497 A
[0008]
Japanese Patent Laid-Open No. 2002-23903 (Patent Document 3) describes that an instruction is given to an initial user.
[Patent Document 3]
JP 2002-23903 A
[0009]
[Problems to be solved by the invention]
The above-described conventional method provides a voice message on the assumption of an acoustic environment in which a message regarding voice recognition failure is displayed on a display or is constantly low in noise. However, for example, in a mobile phone, it is actually impossible for a user to use a display during a voice call. In an acoustic environment where there is a large amount of noise that the user cannot hear, the voice message is not always effective. Even if the user knows the cause of the failure, the user does not always understand the meaning and does not necessarily have a solution. For example, if the user is in a noisy environment and cannot leave the place, even if the user knows that the noise is the cause of the failure, it is difficult for the user to cope with it. Therefore, in practice, the application of these methods is limited.
[0010]
The inventors have recognized that it is advantageous to send the system response to the user terminal in a manner suitable for the user's environment. The inventors have recognized that it is advantageous to not only notify the user terminal of the cause of voice recognition failure but also provide the user terminal with a means for dealing with it. In addition, the inventors have recognized that it is advantageous to perform speech recognition efficiently in consideration of user characteristics and past situations in order to upgrade the system. When speech recognition fails and dialogue is not established, it is useless to continue the dialogue by repeatedly sending a failure message response.
[0011]
An object of the present invention is to realize an efficient spoken dialogue system.
[0012]
[Means for Solving the Problems]
According to a feature of the invention, a spoken dialogue system is Is a voice dialogue system that can be connected from a terminal, Input signal receiving means for receiving an audio signal from the terminal; , Acoustic analysis means for analyzing the audio signal, detecting a level of background noise before user utterance, and determining whether the level is greater than a predetermined threshold; Audio signal recording means for storing the audio signal in an audio signal database; Perform speech recognition using recognition parameters for speech signals stored in the speech signal database Then, it is determined whether or not the voice recognition is successful depending on whether or not the recognition result of the voice recognition is obtained, and when the recognition result is obtained, a recognition score of the recognition result is obtained and the voice recognition is performed. Determine if successful Voice recognition means to Of the background noise If the level is greater than a predetermined threshold Based on the recognition failure cause information and user information registered in advance, the background noise Said level Is determined to cause the speech recognition failure, and if the level of the background noise is determined to cause the speech recognition failure, based on the recognition failure cause information and the user information Determining whether the background noise can be dealt with by adjusting the recognition parameters; When the speech recognition fails, based on the recognition failure cause information and the user information, Determine the cause of failure And The background noise or the failure can be dealt with by adjusting the recognition parameter. Can handle the background noise or the failure If there is no recognition failure cause determination means for determining a countermeasure by the user, and if the recognition failure cause determination means determines that the background noise or the failure can be handled by adjusting the recognition parameter, the recognition parameter Recognition parameter setting means for requesting voice recognition to the voice recognition means, By referring to the characteristics of the audio signal detected by the recognition failure cause determination unit and the user information when the recognition parameter setting unit determines that the background noise cannot be dealt with by adjusting the recognition parameter, Estimating means for estimating the acoustic environment of the terminal, and the terminal according to the estimated acoustic environment Users A response method determination means for determining a response method for the response, and a transmission means for transmitting information indicating the countermeasure to the terminal using the determined response method. means When it is determined that the background noise or the failure cannot be dealt with by adjusting the recognition parameter, the estimation unit, the response method determination unit, and the transmission unit are operated.
[0013]
The spoken dialogue system may further include storage means for storing user identification information and usage history information. The countermeasure determining means may determine the countermeasure according to the identification information and usage history information of the user stored in the storage means.
[0014]
The coping method may include adjustment of parameters for speech recognition. The countermeasure may include transmission of a countermeasure program to the terminal. When the coping method determining means determines that voice dialogue is impossible, the coping method determining unit may include determining to disconnect communication with the terminal as the coping method.
[0015]
The spoken dialogue system may further include a countermeasure history storage means for storing the determined countermeasure. The voice recognition means may execute voice recognition according to the history of the countermeasure stored in the countermeasure history storage means.
[0016]
The response method determining means may transmit an audio signal, an e-mail and / or an image signal representing the determined countermeasure to the terminal in accordance with the estimated acoustic environment.
[0018]
According to yet another aspect of the invention, The program for voice dialogue used in the information processing apparatus is: Receiving an audio signal from the terminal; , Analyzing the audio signal, detecting a level of background noise before user utterance, and determining whether the level is greater than a predetermined threshold; Storing the audio signal in an audio signal database; Perform speech recognition using recognition parameters for speech signals stored in the speech signal database Then, it is determined whether or not the voice recognition is successful depending on whether or not the recognition result of the voice recognition is obtained, and when the recognition result is obtained, a recognition score of the recognition result is obtained and the voice recognition is performed. Determine if successful And steps to Of the background noise If the level is greater than a predetermined threshold Based on the recognition failure cause information and user information registered in advance, the background noise Said level Is determined to cause the speech recognition failure, and if the level of the background noise is determined to cause the speech recognition failure, based on the recognition failure cause information and the user information Determining whether the background noise can be dealt with by adjusting the recognition parameters; When the speech recognition fails, based on the recognition failure cause information and the user information, Determine the cause of failure And The background noise or the failure can be dealt with by adjusting the recognition parameter. Can handle the background noise or the failure A determination step of determining a coping method by a user if not, and adjusting the recognition parameter when the determination step determines that the background noise or the failure can be dealt with by adjusting the recognition parameter; Requesting the recognition means for speech recognition; By referring to the characteristics of the audio signal detected by the recognition failure cause determination unit and the user information when the recognition parameter setting unit determines that the background noise cannot be dealt with by adjusting the recognition parameter, An estimation step for estimating an acoustic environment of the terminal, and the terminal according to the estimated acoustic environment Users A response method determining step for determining a response method to the terminal, and a transmission step for transmitting information representing the determined countermeasure to the terminal using the determined response method. When it is determined that the background noise or the failure cannot be dealt with by adjusting the recognition parameter, the estimation step, the response method determination step, and the transmission step are operable.
[0019]
According to still another aspect of the present invention, a voice interaction method used in a voice interaction system includes receiving an audio signal from a terminal; , Analyzing the audio signal, detecting a level of background noise before user utterance, and determining whether the level is greater than a predetermined threshold; Storing the audio signal in an audio signal database; Perform speech recognition using recognition parameters for speech signals stored in the speech signal database Then, it is determined whether or not the voice recognition is successful depending on whether or not the recognition result of the voice recognition is obtained, and when the recognition result is obtained, a recognition score of the recognition result is obtained and the voice recognition is performed. Determine if successful And steps to Of the background noise If the level is greater than a predetermined threshold Based on the recognition failure cause information and user information registered in advance, the background noise Said level Is determined to cause the speech recognition failure, and if the level of the background noise is determined to cause the speech recognition failure, based on the recognition failure cause information and the user information Determining whether the background noise can be dealt with by adjusting the recognition parameters; When the speech recognition fails, based on the recognition failure cause information and the user information, Determine the cause of failure And The background noise or the failure can be dealt with by adjusting the recognition parameter. Can handle the background noise or the failure A determination step of determining a coping method by a user if not, and adjusting the recognition parameter when the determination step determines that the background noise or the failure can be dealt with by adjusting the recognition parameter; Requesting the recognition means for speech recognition; By referring to the characteristics of the audio signal detected by the recognition failure cause determination unit and the user information when the recognition parameter setting unit determines that the background noise cannot be dealt with by adjusting the recognition parameter, An estimation step for estimating an acoustic environment of the terminal, and the terminal according to the estimated acoustic environment Users A response method determining step for determining a response method to the terminal, and a transmission step for transmitting information representing the determined countermeasure to the terminal using the determined response method, and adjusting the recognition parameter in the determining step When it is determined that the background noise or the failure cannot be dealt with, the estimation step, the response method determination step, and the transmission step are executed.
[0020]
According to the present invention, an efficient voice dialogue system can be realized. The system response can be transmitted to the user terminal in a manner suitable for the user's environment. When voice recognition fails, not only the cause is notified to the user terminal, but also means for coping with it can be provided to the user terminal. It is possible to efficiently perform speech recognition in consideration of user characteristics and past situations, and to improve the system.
[0021]
DETAILED DESCRIPTION OF THE INVENTION
FIG. 1 shows a voice interaction system 100 according to an embodiment of the present invention. The voice dialogue system 100 includes an input signal receiving unit 102, an acoustic analysis unit 104, a voice signal recording (recording) unit 106, a voice signal database 108, a voice recognition unit 110, a recognition failure cause determination unit 112, a recognition parameter setting unit 114, a dialogue A management unit 116, a task processing unit 118, a user environment estimation unit 120, a response method determination unit 122, a response generation unit 124, an output signal transmission unit 126, and a processor or controller 150 are included.
[0022]
Each unit 102 to 126 is controlled by the processor 150. Each unit 102 to 126 is implemented in the form of hardware or software. Each unit 102 to 126 may be implemented as a program executed by the processor 150.
[0023]
The input signal receiving unit 102 receives an audio signal transmitted from the user terminal and supplies the audio signal to the acoustic analysis unit 104. The acoustic analysis unit 104 analyzes the audio signal from the input signal reception unit 102. The acoustic analysis unit 104 detects the noise level of the background noise before the user's utterance, and supplies the detection level of the noise to the recognition failure cause determination unit 112 to determine whether or not it causes a recognition failure. Request to do. Thereby, the system 100 can cope with the cause (factor) of the speech recognition failure before recognizing the user's utterance. The acoustic analysis unit 104 supplies the audio signal to the audio signal recording unit 106 unless it is determined that the background noise level is greater than the threshold value and cannot be dealt with in the system, as will be described later. .
[0024]
The audio signal recording unit 106 records the audio signal data supplied from the acoustic analysis unit 104 and accumulates the data in the audio signal database 108. The audio signal database 108 stores the audio data.
[0025]
The voice recognition unit 110 performs voice recognition on the voice data in the voice signal database 108 to derive a character string corresponding to the voice signal. When the speech recognition fails, the speech recognition unit 110 requests the recognition failure cause determination unit 112 to determine the cause of failure.
[0026]
The recognition failure cause determination unit 112 determines whether background noise causes the recognition failure in response to a request from the acoustic analysis unit 104. When the recognition failure cause determination unit 112 determines that the background noise causes the failure, the recognition failure cause determination unit 112 further determines whether or not the problem can be dealt with by adjusting the parameter for recognition or changing the parameter value. In this case, the recognition parameter setting unit 114 is requested to adjust parameters.
[0027]
The recognition failure cause determination unit 112 further analyzes the cause of the voice recognition failure in response to the request from the voice recognition unit 110. If the recognition failure cause determination unit 112 determines that the parameter can be dealt with by adjusting the parameter for recognition, the recognition failure cause determination unit 112 requests the recognition parameter setting unit 114 to adjust the parameter. The recognition failure cause determination unit 112 requests the user environment estimation unit 120 to estimate the acoustic environment of the user when it is determined that the countermeasure in the system 100 is impossible.
[0028]
The recognition parameter setting unit 114 adjusts the parameter value in response to the parameter adjustment request from the recognition failure cause determination unit 112, and requests speech recognition to the speech recognition unit 110 when speech recognition is necessary. To do. In response to the request from the recognition failure cause determination unit 112, the user environment estimation unit 120 estimates the current acoustic environment of the user.
[0029]
When the speech recognition unit 110 succeeds in speech recognition, the dialogue management unit 116 analyzes the derived character data to estimate or extract the user's intention, and determines a response to the intention. The task processing unit 118 processes the requested task through the dialog or requests an external module (not shown) for processing.
[0030]
The response method determination unit 122 determines a method or means for outputting a response to the user based on the result of estimation of the user's acoustic environment. The response generation unit 124 generates a response to the user according to the determined response method or response means. The output signal transmission unit 126 transmits the response signal to the user generated by the response generation unit 124 to the user terminal using the response method or response unit determined by the response method determination unit 122. The response method includes “disconnection” of communication with the user terminal.
[0031]
FIG. 2 shows a flowchart of processing in the interactive system 100 of FIG.
[0032]
In step 202, the input signal receiving unit 102 receives an audio signal transmitted from a user terminal. The acoustic analysis unit 104 analyzes the audio signal received in step 204. In step 206, the acoustic analysis unit 104 detects the level of background noise before the user utterance, and determines whether the level is greater than a threshold value. The level is a threshold Not greater than or below threshold If it is determined, the acoustic analysis unit 104 analyzes the signal and supplies the audio signal to the audio signal recording unit 106. In step 208, the audio signal recording unit 106 stores the audio signal in the database 108.
[0033]
In step 206 Background noise before user utterance If it is determined that the level is equal to or higher than the threshold value, the acoustic analysis unit 104 supplies the value of the level to the recognition failure cause determination unit 112 and requests to determine whether or not it causes a recognition failure. To do. The procedure proceeds to step 214.
[0034]
In step 214, the recognition failure cause determination unit 112 determines whether background noise causes the recognition failure in response to the request from the acoustic analysis unit 104. If it is determined that this is the cause of failure, the recognition failure cause determination unit 112 further determines in step 216 whether or not it can be dealt with by adjusting parameters during recognition. When it is determined that the problem can be dealt with by adjusting the parameters, the recognition failure cause determination unit 112 requests the recognition parameter setting unit 114 to adjust the recognition parameters. In step 218, the recognition parameter setting unit 114 adjusts the recognition parameter, that is, changes the value of the recognition parameter. At this time, if the audio signal is not recorded in step 208, the recognition failure cause determination unit 112 causes the audio signal recording unit 106 to accumulate the audio signal in the database 108 as in step 208. As described above, since the failure factor is dealt with when the background noise before the user utters is detected, the voice processing efficiency is increased. On the other hand, if it is determined in step 216 that it is impossible to cope with within the system 100, the procedure proceeds to step 224.
[0035]
In step 210, the voice recognition unit 110 performs voice recognition on the recorded voice signal in the voice signal database 108.
[0036]
In step 212, the speech recognition unit 110 determines whether speech recognition has been successful, that is, whether a recognition result has been obtained. When it is determined that the speech recognition has succeeded, that is, the speech section has been successfully cut out, a character string for the voice data of each cut-out section is derived, and a message (word, word) represented by the derived character string If the recognition score representing the certainty or reliability of a phrase, sentence, etc.) is higher than a predetermined threshold, the procedure proceeds to step 220. If it is determined that the recognition has not succeeded, that is, has failed, the procedure proceeds to step 214.
[0037]
In step 220, the dialogue management unit 116 extracts or estimates the user's intention from the message obtained as a result of the speech recognition, determines a response method to the user according to the result, and if necessary, a task processing unit. A task processing is requested to 118. After step 220, the procedure proceeds to step 226. In step 222, the task processing unit 118 processes the task according to the request represented by the message. Thereafter, the procedure proceeds to Step 226.
[0038]
In step 214, the recognition failure cause determination unit 112 further analyzes the cause of the voice recognition failure in response to the request from the voice recognition unit 110. The recognition failure cause determination unit 112 determines whether or not the failure can be dealt with by adjusting parameters at the time of recognition. When it is determined that the problem can be dealt with by adjusting the parameters, the recognition failure cause determination unit 112 requests the recognition parameter setting unit 114 to adjust the recognition parameters. In step 218, the recognition parameter setting unit 114 adjusts the recognition parameter, that is, changes the value of the recognition parameter. If the recognition failure cause determination unit 112 determines that the countermeasure in the system 100 is impossible, the procedure proceeds to step 224.
[0039]
In step 224, the user environment estimation unit 120 estimates the acoustic environment of the user and takes out information for determining a response method or response means for the user.
[0040]
In step 226, the response method determination unit 122 determines the response method or response means in consideration of the acoustic environment of the user. In step 228, the response generation unit 124 generates a response signal corresponding to the response method. In step 230, the output signal transmission unit 126 transmits a response signal to the user terminal by the determined response method or response means, for example, a voice message or electronic mail. As one form of the response method, communication with the user terminal may be disconnected depending on circumstances.
[0041]
FIG. 3 shows a configuration of a voice portal 300 including the interactive system 100 of FIG. 1 connected to the Internet according to an embodiment of the present invention. The voice portal 300 includes the interactive system 100 of FIG. 1, a user authentication unit 310, a user information database 320, and a recognition failure cause handling history database 330.
[0042]
The user authentication unit 310 performs user authentication by voice authentication for the user ID and password or user voice received as an input signal from the user. The user ID and password, and user information necessary for voice authentication are registered in advance in the user information database 320 when the system 100 is used for the first time. When the user authentication is successful, each unit 102 to 126 in FIG. 1 refers to the user information database 320 in each step in FIG. Each process is executed in consideration of features such as “use the system from a large place” and “always speak loudly”.
[0043]
The user information database 320 stores information related to the use of the system 100 by the user, and stores information related to the user's usage history and the acoustic environment at the time of use, in addition to user identification information such as a user ID and password. The user information database 320 includes a voice recognition unit 110, a recognition failure cause determination unit 112, a recognition parameter setting unit 114, a dialogue management unit 116, a task processing unit 118, a user environment estimation unit 120, and a response method determination of the dialogue system 100 in FIG. Accessed by unit 122 and response generator 124, information is read (referenced) and written. The user information database 320 deletes user information in response to a user request.
[0044]
The recognition failure cause countermeasure history database 330 is a history of countermeasures (logs) when the acoustic environment on the user side is judged to be inferior, for example, “background noise is too large”, or when voice recognition fails. ). For example, information is described that indicates that “when recognizing a voice section fails, re-recognition succeeds if the value of the cut-out parameter is changed from“ A1 ”to“ A2 ””. By using this database 330, failure can be efficiently avoided in the second and subsequent speech recognitions, and when the recognition fails, the processing efficiency can be improved. For example, the recognition parameter setting unit 114 preferentially adopts a combination of parameter values used when re-recognition succeeds in the past, thereby reducing the number of re-recognition trials and speeding up recognition. The advantage is obtained.
[0045]
FIG. 4 shows the configuration of the recognition failure cause determination unit 112. The recognition failure cause determination unit 112 includes a speech section detection unit 402, an S / N ratio detection unit 404, a speech speed detection unit 406, a recognition failure cause coping method determination unit 408, and a recognition failure cause determination information database 410.
[0046]
Table 1 exemplifies recognition failure cause determination information in the recognition failure cause determination information database 410.
[Table 1]

[0047]
In Table 1, as the recognition failure cause determination information, determination items (factors), an error threshold for each determination item, and a countermeasure for the cause are described.
[0048]
In Table 1, “the shortest speech section” represents the shortest section for cutting out from the input signal as a speech signal section. Generally, since a single noise has a short section, it is possible to reduce noise clipping by setting the shortest speech section longer. However, if the shortest speech segment is too long, a short word such as “(2)” cannot be extracted. recognition Parameter adjustment is required.
[0049]
“Recognition score” represents the certainty or reliability of a message of a character string obtained by speech recognition. The calculation method (scale) of the recognition score may vary depending on the processing method. In speech recognition, a candidate having the highest recognition score and having a value equal to or greater than a threshold value is output as a recognition result among several solution candidates. On the contrary, even if a message indicating the recognition result is obtained, if the recognition score is lower than the threshold value, it is determined that the reliability is low, and the result is rejected. When the recognition rate is low, by lowering the threshold of the recognition score, it may be possible to extract a message that has been rejected because the score was low despite being correct, as a correct answer. However, it will also tolerate inappropriate results that are too low. recognition The parameter value needs to be adjusted.
[0050]
The “S / N ratio” is a power ratio between the audio signal and the noise signal. Even if the noise is large, the S / N ratio increases if the audio signal is larger than that. In general, the recognition rate improves as the S / N ratio increases. When the S / N ratio is smaller than the threshold value, it can be dealt with by allowing the user to re-input the voice at a low noise level, or allowing the user to re-input the voice with a loud voice.
[0051]
“Speaking speed” represents the speaking speed of the user. Generally, the speed is expressed by the number of mora per second (≈number of syllables). That is, the faster the speech is, the more words are issued per unit time. In general, since the recognition rate decreases when the speech speed is too fast, if the detected speech speed value is larger than the threshold value in Table 1, the user is instructed to speak slowly. On the other hand, since the recognition rate decreases even if the speaking speed is too slow, if the detected value of the speaking speed is smaller than another threshold value, the user is instructed to speak quickly.
[0052]
Table 1 may also include other recognition failure cause determination information. The contents are shown in the form of a table, but may be in another form, for example, in text form.
[0053]
As an example of processing by the recognition failure cause determination unit 112, processing for a portion before the speech section in the input signal and processing for the speech signal portion in the input signal are described.
[0054]
Processing of the input signal before the speech segment
In FIG. 2, when the procedure proceeds from step 206 (YES) to step 214, in step 214, the recognition failure cause determination unit 112 in FIG. 4 responds to the request from the acoustic analysis unit 104 and the noise level and recognition failure. Based on the cause determination information, it is determined whether the background noise of the input signal causes a recognition failure. When it is determined that the background noise causes the recognition failure, the recognition failure cause determination unit 112 also refers to the information in the user information database 320 and the recognition failure cause handling history database 330 to cope with the failure cause. To decide. Next, the process will be described in more detail.
[0055]
In the input signal, a signal before the user utters, that is, a signal of a portion not uttered by the user represents background noise. The recognition failure cause determination unit 112 estimates whether or not the background noise causes “speech recognition failure” by the following procedure.
[0056]
First, as preprocessing, a background noise portion is extracted from the input signal by the acoustic analysis unit 104 (FIG. 1), and an acoustic feature of the background noise, for example, a noise level is detected.
[0057]
Step 1: The recognition failure cause determination unit 112 determines whether the background noise extracted in the preprocessing from the acoustic analysis unit 104 and the information on the acoustic characteristics of the background noise are “cause of speech recognition failure”. Receive a request to do so.
[0058]
Step 2: In response to the determination request, the recognition failure cause coping method determination unit 408 of the recognition failure cause determination unit 112 determines from the information regarding the acoustic characteristics of the background noise and the recognition failure cause determination information (see Table 1). It is determined whether or not background noise causes “speech recognition failure”. For example, the recognition failure cause determination information describes “if the S / N ratio (signal-to-noise ratio) is 10 dB or less, it may cause misrecognition”. S / N ratio estimated from the actual background noise level (in this case, since the background noise section is a section without a speech signal, the estimated S / N ratio when the speech signal level S is assumed to be a general value) is 10 dB or less. In this case, the recognition failure cause countermeasure determining unit 408 determines that “background noise of the input signal causes voice recognition failure”. Furthermore, the recognition failure cause coping method determination unit 408 determines that “this user is Mr. X”, “Ms. X always uses the system in an environment where the estimated S / N ratio is 10 dB. If there is user information such as “The voice recognition rate is 98% or more” and the recognition failure cause handling history, the information is also referred to as a determination item. In such a case, even if the estimated S / N ratio is 10 dB, in the case of Mr. X, there is no effect on speech recognition, so the recognition failure cause countermeasure determining unit 408 determines that the background noise of the input signal is speech recognition failure. It will not be the cause of ". Thus, since the processing method can be changed according to the user, voice processing specialized for the user can be performed.
[0059]
Step 3: When the recognition failure cause countermeasure determining unit 408 determines that the background noise is “cause of speech recognition failure”, the user environment estimation unit 120 is notified of the user before notifying the user of that fact. Request to estimate the acoustic environment.
[0060]
Processing audio signal in input signal
In FIG. 2, when the procedure proceeds from step 212 (NO) to step 214, in response to the request from the speech recognition unit 110, the recognition failure cause determination unit 112 in FIG. The voice section, S / N ratio and speech speed of the data are detected. Thereby, the values of the features of the speech signal such as the shortest speech segment length, the longest speech segment length, the noise level and the speech speed are derived or measured. Next, the recognition failure cause determination unit 112 collates these derived feature values with the recognition failure cause determination information in the database 410 to determine a countermeasure for the recognition failure. Next, the process will be described in more detail.
[0061]
Step 1: When the speech recognition unit 110 (FIG. 1) does not obtain a recognition result, the recognition failure cause coping method determination unit 408 of the recognition failure cause determination unit 112 requests the speech recognition unit 110 to determine the cause of failure. Receive. Even when the recognition result cannot be obtained, for example, when there is data of the recognition processing process such as the cutout information or the recognition score at each step, the recognition failure cause countermeasure determining unit 408 also outputs the information as a voice Received from the recognition unit 110.
[0062]
Step 2: The speech segment detection unit 402, the S / N ratio detection unit 404, and the speech rate detection unit 406 of the recognition failure cause determination unit 112 respond to the request for the determination, and the speech segment (phoneme, syllable) of the input speech data. ), And the S / N ratio and speech speed are detected or measured. The recognition failure cause coping method determination unit 408 also uses the data of the speech recognition process described above by the speech recognition unit 110 as necessary.
[0063]
Step 3: The recognition failure cause countermeasure determining unit 408 determines a recognition failure cause from the detection result of Step 2 and the recognition failure cause determination information. For example, in the recognition failure cause determination information, information that “the shortest speech segment at the time of voice segment cutout is 50 ms” is described, and it is estimated that most of the signals having a segment length of about 50 ms are noise in the voice segment actually cut out. If it is determined that the recognition failure cause countermeasure deciding unit 408 determines that “there is a possibility of clipping mistake” or “a lot of noise is cut out”, and “ If it is set to 100 ms, it is determined that noise having a section length of about 50 ms cannot be cut out, that is, “the shortest word length at the time of recognition should be set to 100 ms and re-recognized”. When the detected S / N ratio is smaller than the threshold value described in the recognition failure cause determination information, the recognition failure cause countermeasure determining unit 408 determines that “noise is fatal” and determines that “noise is low”. By the way, it is determined that re-input or re-input with loud voice is necessary. When it is determined that “speech speed is too fast”, the recognition failure cause countermeasure determining unit 408 determines that “re-input with slow utterance is necessary”. In making these determinations, as described above, referring to the user information and information on the recognition failure cause countermeasure history, for example, as described below, the above detection is omitted or the countermeasure is changed. To do.
[0064]
-If the parameter has already been adjusted and re-recognition processing has been performed and recognition has failed again, the above-described detection once performed is omitted, and the parameters are further adjusted to perform re-recognition processing.
[0065]
-After performing re-recognition with parameter adjustment multiple times (for example, 3 times), discontinue countermeasures by parameter adjustment.
[0066]
-If there is information that "Ms. X always speaks quickly and it is fatal", give priority to dealing with speech speed.
[0067]
Step 4: If the recognition failure cause coping method determination unit 408 cannot cope with the parameter adjustment in Step 3, for example, when notification to the user is necessary, the user environment estimation unit 120 is notified of the acoustic environment of the user before that. When it is possible to estimate and adjust the parameter, the recognition parameter setting unit 114 adjusts the parameter.
[0068]
More specifically, when it is determined in step 3 that the problem can be dealt with by adjusting the parameter, the recognition failure cause countermeasure determining unit 408 requests the recognition parameter setting unit 114 to adjust the parameter. For example, when the recognition failure cause coping method determination unit 408 estimates that the setting of the shortest speech segment is too small and a speech segmentation error occurs based on the recognition failure cause determination information in Table 1, the recognition failure cause The coping method determination unit 408 determines that the adjustment can be made by adjusting the cut-out parameter, and “the user has a high recognition rate when the parameter value is A1”, “the recognition rate when the parameter value is A2 in the past history” In consideration of user information such as “high” and information on the cause of coping with recognition failure cause, the recognition parameter setting unit 114 is requested to adjust the parameter, and the voice recognition unit 110 recognizes the voice again according to the adjusted parameter. Instruct. Thus, since the processing method can be changed according to the user, voice processing specialized for the user can be performed.
[0069]
On the other hand, when it is determined that it is impossible to cope with the speech recognition failure in the system 100, for example, when the above-described re-recognition is not successful even after repeated a predetermined number of times while changing the parameter value. The recognition failure cause coping method determination unit 408 requests the user environment estimation unit 120 to estimate the acoustic environment of the user before notifying the user of the cause of the failure and the coping method for the failure.
[0070]
FIG. 5 shows the configuration of the recognition parameter setting unit 114. The recognition parameter setting unit 114 includes a parameter determination unit 502 and a recognition execution command generation unit 504. When the parameter determination unit 502 receives a parameter adjustment request from the recognition failure cause determination unit 112 in FIG. 4, the parameter determination unit 502 first refers to the user information in the database 320 and the recognition failure cause handling history in the database 330. Determine the parameter value for re-recognition. Next, the recognition execution command generation unit 504 creates a voice recognition execution command reflecting the determined parameter value, and requests the voice recognition unit 110 to perform re-recognition processing as necessary.
[0071]
FIG. 6 exemplifies a flowchart of processing by the recognition parameter setting unit 114 in FIG.
[0072]
In step 602, the parameter determination unit 502 requests the recognition parameter setting request from the recognition failure cause determination unit 112 to make a determination in the recognition failure cause determination unit 112, for example, “make the setting of the shortest speech interval longer than 50 ms” or “recognition score”. The information is received together with information such as “lower the threshold value from 60”.
[0073]
In step 604, the parameter determination unit 502 determines whether there is user information and a recognition failure cause handling history. If it is determined that the parameter value does not exist, in step 608, the parameter determination unit 502 determines how much the parameter value is to be changed. Basically, it is changed by a fixed amount determined empirically (experimental). For example, the parameter value is changed based on information such as “change the value by 10 ms for the shortest speech interval” or “change the threshold value of the recognition score by 5”. Thereafter, the procedure proceeds to Step 610.
[0074]
If it is determined in step 604 that there is user information and a recognition failure cause coping history, the parameter value is determined in step 606 with reference to the user information and the recognition failure cause coping history. For example, “In the past 10 uses of Mr. X, the recognition rate is the highest when the recognition score is set to 50” or “The setting of the shortest speech section has already been changed and is currently set to 60 ms If there is user information such as “Allowed” or a recognition failure cause handling history, the value of the parameter is determined with reference to the information. For example, as described above, if there is information that “Mr. X should set the recognition score to 50”, it should be set to 50 immediately instead of going from 60 to 2 in 5 steps. To make the process more efficient. Thereafter, the procedure proceeds to Step 610.
[0075]
In step 610, the recognition execution command generation unit 504 generates a recognition execution command reflecting the new parameter value based on the result of step 606 or 608, and requests the voice recognition unit 110 to re-recognize the voice in step 612. .
[0076]
FIG. 7 shows the configuration of the user environment estimation unit 120. The user environment estimation unit 120 includes a noise continuity analysis unit 702 and an environment estimation unit 704.
[0077]
The noise stationarity estimation unit 702 analyzes the noise stationarity of the input voice data in response to the user environment estimation request from the recognition failure cause determination unit 112 of FIG.
[0078]
The environment estimation unit 704 is, for example, a feature of an audio signal such as an S / N ratio detected from the recognition failure cause determination unit 112 and a user such as “Mr. X uses the system under a certain noise environment”. By referring to the information, for example, “whether it is an environment in which a conversation based mainly on speech is possible”, “how loud the sound can be heard by the user” or “noise canceller (noise removal device / tool) ) Is used to determine the acoustic environment. For example, when the noise is relatively steady, the S / N ratio is 10 dB, and there is user information such as “Ms. X always uses the system from the workplace”, the estimation is performed as follows.
[0079]
-S / N ratio = 10 dB represents a possibility that background noise is large, which means that voice dialogue is somewhat difficult.
-However, if the voice response is output at the maximum volume, the user can hear it. Since the noise is relatively stationary, it is effective to use a noise canceller.
-If a noise canceller that matches the acoustic environment of Mr. X's workplace is installed on Mr. X's mobile terminal, voice recognition from the next time onward becomes easier.
[0080]
The environment estimation unit 704 determines, for example, “maximize the volume of the audio output on the system side and perform a conversation mainly using voice” according to the estimated acoustic environment of the user. The environment estimation unit 704 further determines, for example, “send a noise canceller that matches Mr. X's work environment to Mr. X's mobile phone”.
[0081]
In order to acquire user information such as “Mr. X always uses the system from the same workplace”, for example, at the first use of the system 100 or when the user's usage environment changes, voice input The user may be instructed to register the state of the surrounding acoustic environment using the terminal. At that time, by recording the background noise in the situation and performing acoustic analysis in advance by the acoustic analysis unit 104, the environment estimation unit 704, when the same noise component is detected by the acoustic analysis unit 104 in the next time or later, Estimating that “the system is used in the same environment” But it can.
[0082]
FIG. 8 shows a flowchart of processing by the response method determination unit 122 in FIG. The response method determination unit 122 determines a response method or response means based on the result of the user environment estimation described above.
[0083]
In step 802, the response method determination unit 122 determines whether the voice recognition has succeeded. When it is determined that the speech recognition is successful, that is, a recognition result is obtained in speech recognition, that is, the speech section is successfully cut out, the character string for the speech in the cut out section is derived, and the recognition score is higher than a predetermined threshold In step 808, the response method determination unit 122 determines that the user is in an acoustic environment where there is noise that does not affect voice recognition. To continue the dialogue.
[0084]
On the other hand, if it is determined in step 802 that the recognition has not succeeded, that is, if it is determined that background noise causes the recognition failure, or if a recognition result cannot be obtained in speech recognition, a response method in step 804 The determination unit 122 acquires a user acoustic environment estimation result from the user environment estimation unit 120.
[0085]
In step 806, the response method determination unit 122 determines whether a voice conversation is possible. If it is determined that the voice dialogue is possible, in step 810, the response method determination unit 122 uses the normal response method or response means using voice as the main medium, and the fact of the recognition failure and its cause and the cause thereof. Decide to continue the dialogue by presenting the user with a workaround.
[0086]
If it is determined in step 806 that voice dialogue is impossible, in step 812, the response method determination unit 122 notifies the end of the dialogue by voice and disconnects communication with the user terminal. Using a response method or response means that uses text or image signals via email as a medium, unilaterally ending the conversation, and presenting the fact of the recognition failure, its cause, and how to deal with it to the user decide. As described above, when the voice signal cannot be detected or when the voice recognition fails and the voice conversation is difficult, the communication between the user terminal and the system 100 is disconnected, thereby avoiding a wasteful conversation or communication. can do. In addition, since it is possible to notify the user of the end of the conversation by a response method suitable for the user environment before disconnecting the communication, it is possible to minimize unpleasant feelings to the user. At that time, the background noise may be detected based on the S / N ratio of the input signal, and the volume of the output signal may be increased according to the detected value so that the message is transmitted to the user even in a noisy environment.
[0087]
In step 810 or before the communication disconnection in step 812, the response method determination unit 122 further includes a tool necessary for coping with recognition failure, for example, a noise canceller for coping with noise, if necessary. The program may be transmitted to the user terminal, and it may be determined to notify the user that a tool for coping with the cause of voice recognition failure will be sent. In this way, as necessary, tools necessary for coping with the cause of voice recognition failure are created or prepared and sent to the user terminal, so that coping with the failure can be supported and the burden on the user is reduced. .
[0088]
FIG. 9 shows the configuration of the response generation unit 124. The response generation unit 124 includes a communication disconnection signal generation unit 902, a program generation unit 904, a program database 906, a response generation management unit 908, an e-mail generation unit 910, a response sentence generation unit 912, a voice signal generation unit (voice synthesis unit). 914 and an image signal generation unit 916 are included.
[0089]
The response generation management unit 908 responds according to the response method or response means determined by the response method determination unit 122, for example, a response content such as a response sentence or a program, and a response signal such as a voice signal, e-mail, or communication disconnection signal. And the generation timing and output timing thereof are determined and the generation units 902 to 916 are requested to generate them.
[0090]
The communication disconnection signal generation unit 902 generates a signal for disconnecting communication with the user terminal. The program generation unit 904 newly generates a coping program (for example, a noise canceller) for the cause of recognition failure used in the user terminal, or selects it from a group of tools stored in advance in the program database. Select and prepare as system response. The response sentence generation unit 912 generates a response message to the user such as a general response to the user utterance and a notification when the recognition fails. At this time, user information and information on the recognition failure handling history are also used. The e-mail generation unit 910 generates an e-mail incorporating the response content generated by the response sentence generation unit 912 or the program generation unit 904. The audio signal generation unit 914 converts the response content generated by the response sentence generation unit 904 into an audio signal. At that time, the audio signal generation unit 914 adjusts the volume of the audio signal with reference to the acoustic environment of the user. The image signal generation unit 916 converts the response content generated by the response sentence generation unit 912 into an image signal.
[0091]
The output signal transmission unit 126 transmits various response signals generated by the

response generation units

902, 910, 914, and 916 of FIG. 9 to the user terminal, that is, in response to the detection of the communication disconnection signal, It has a function of disconnecting communication, a function of transmitting an electronic mail, a function of transmitting an audio signal, and a function of transmitting an image signal.
[0092]
As described above, according to the embodiment of the present invention, when the acoustic environment of the user is determined to be inferior from the background noise before the user utters, or when speech recognition fails, the cause is clarified, Since the cause and a countermeasure for the cause can be transmitted to the user terminal, it is possible to appropriately cope with the recognition failure. Further, when notifying the user terminal of the cause and the countermeasure, it can be surely transmitted to the user by a response method suitable for the user's environment.
[0093]
The embodiments described above are merely given as typical examples, and variations and variations thereof will be apparent to those skilled in the art. Those skilled in the art will depart from the principles of the present invention and the scope of the invention described in the claims. Obviously, various modifications of the above-described embodiment can be made.
[0094]
(Supplementary note 1) A voice dialogue system connectable from a terminal,
Voice recognition means for performing voice recognition on a voice signal from the terminal;
Coping method determining means for determining the cause when the result of the speech recognition is not obtained and determining a coping method for the determined cause;
Estimating means for estimating the acoustic environment of the terminal;
Response method determining means for determining a response method for the terminal according to the estimated acoustic environment;
Transmitting means for transmitting information representing the determined countermeasure to the terminal by the determined response method;
A spoken dialogue system characterized by comprising:
(Additional remark 2) Furthermore, the storage means which memorize | stores a user's identification information and utilization history information is provided,
The voice interaction system according to appendix 1, wherein the coping method determining unit determines the coping method according to the identification information and usage history information of the user stored in the storage unit.
(Supplementary note 3) The supplementary note 1 or 2, wherein the countermeasure determining means determines that the communication with the terminal is disconnected as the countermeasure when it is determined that voice dialogue is impossible. Voice dialogue system.
(Additional remark 4) Furthermore, it has a countermeasure history storage means for storing the determined countermeasure,
4. The voice dialogue system according to any one of appendices 1 to 3, wherein the voice recognition unit performs voice recognition according to a history of the countermeasure stored in the countermeasure history storage unit.
(Additional remark 5) The said response method determination means transmits the audio | voice signal, electronic mail, and / or image signal showing the determined countermeasure according to the said estimated acoustic environment to the said terminal. The spoken dialogue system according to any one of appendices 1 to 4, characterized in that:
(Supplementary Note 6) Voice recognition means for executing voice recognition on a voice signal from a user;
Coping method determining means for determining the cause when the result of the speech recognition is not obtained and determining a coping method for the determined cause;
Estimating means for estimating the acoustic environment of the user;
Response method determining means for determining a response method for the user according to the estimated acoustic environment;
Notification means for notifying the user of information indicating the determined countermeasure by the determined response method;
A spoken dialogue system characterized by comprising:
(Supplementary note 7) A program for voice conversation used in an information processing apparatus,
Performing voice recognition on a voice signal from a terminal;
Determining the cause when the speech recognition result is not obtained;
Determining a countermeasure for the determined cause;
Estimating the acoustic environment of the terminal;
Determining a response method for the terminal according to the estimated acoustic environment;
Transmitting information representing the determined countermeasure to the terminal in the determined response method;
A program that can run to run.
(Supplementary note 8) The program according to supplementary note 7, wherein the step of determining the countermeasure includes determining the countermeasure according to the identification information and usage history information of the user stored in a storage unit. .
(Supplementary note 9) Further, the method is operable to execute a step of storing the determined countermeasure.
The program according to appendix 7 or 8, wherein the step of executing speech recognition includes executing speech recognition in accordance with the history of the countermeasure stored in the countermeasure history storage means.
(Supplementary Note 10) In the step of determining the response method, an audio signal, an e-mail, and / or an image signal representing the determined countermeasure are transmitted to the terminal according to the estimated acoustic environment. The program according to any one of appendices 7 to 9, characterized by comprising:
(Supplementary Note 11) A voice dialogue method used in a voice dialogue system,
Performing voice recognition on a voice signal from a terminal;
Determining the cause when the speech recognition result is not obtained;
Determining a countermeasure for the determined cause;
Estimating the acoustic environment of the terminal;
Determining a response method for the terminal according to the estimated acoustic environment;
Transmitting information representing the determined countermeasure to the terminal in the determined response method;
Including methods.
[0095]
【The invention's effect】
According to the present invention, it is possible to select a response method suitable for the user's environment according to the above-described features, and to provide the user terminal with a means for not only informing the cause but also coping with the cause when voice recognition fails. Thus, it is possible to efficiently perform speech recognition in consideration of user characteristics and past situations.
[Brief description of the drawings]
FIG. 1 shows a voice interaction system according to an embodiment of the present invention.
FIG. 2 shows a flow diagram of processing in the interactive system of FIG.
FIG. 3 shows the configuration of a voice portal including the interactive system of FIG. 1 connected to the Internet, according to an embodiment of the present invention.
FIG. 4 shows a configuration of a recognition failure cause determination unit.
FIG. 5 shows a configuration of a recognition parameter setting unit.
6 illustrates a flowchart of processing by a recognition parameter setting unit in FIG.
FIG. 7 shows a configuration of a user environment estimation unit.
FIG. 8 shows a flowchart of processing by a response method determination unit.
FIG. 9 shows a configuration of a response generation unit.
[Explanation of symbols]
100 spoken dialogue system
102 Input signal receiver
104 Acoustic analysis unit
110 Voice recognition unit
112 Recognition failure cause determination unit
114 Recognition parameter setting section
116 Dialogue Management Department
118 Task processing part
120 User environment estimation unit
122 Response method determination unit
124 Response generator
126 Output signal transmitter

Claims

A voice interaction system that can be connected from a terminal,
Input signal receiving means for receiving an audio signal from the terminal ;
Acoustic analysis means for analyzing the audio signal, detecting a level of background noise before user utterance, and determining whether the level is greater than a predetermined threshold;
Audio signal recording means for storing the audio signal in an audio signal database;
Performing speech recognition on a speech signal stored in the speech signal database using a recognition parameter, determining whether the speech recognition is successful based on whether a recognition result of the speech recognition is obtained; a speech recognition means for determining whether the voice recognition seeking recognition score of the recognition result is successful if the recognition result is obtained,
When the level of the background noise is greater than a predetermined threshold, it is determined whether the level of the background noise causes the speech recognition failure based on recognition failure cause information and pre-registered user information And determining whether the background noise can be dealt with by adjusting the recognition parameter based on the recognition failure cause information and the user information when it is determined that the level of the background noise causes the speech recognition failure. If the voice recognition fails, the background noise or the failure can be dealt with by determining the cause of the failure based on the recognition failure cause information and the user information and adjusting the recognition parameter. It determines whether to determine what to do by the user if it can not cope with the background noise or the failure, And identify the cause of the failure judgment means,
A recognition parameter setting unit that adjusts the recognition parameter and requests the speech recognition unit to perform speech recognition when the recognition failure cause determination unit determines that the background noise or the failure can be dealt with by adjusting the recognition parameter; ,
By referring to the characteristics of the audio signal detected by the recognition failure cause determination unit and the user information when the recognition parameter setting unit determines that the background noise cannot be dealt with by adjusting the recognition parameter , An estimation means for estimating an acoustic environment;
Response method determining means for determining a response method for the user of the terminal according to the estimated acoustic environment;
Transmitting means for transmitting information representing the countermeasure to the terminal by the determined response method;
With
The speech characterized by operating the estimation means, the response method determination means, and the transmission means when the recognition failure cause determination means determines that the background noise or the failure cannot be dealt with by adjusting the recognition parameter Dialog system.

And further comprising storage means for storing the user information including user identification information and usage history information,
2. The spoken dialogue system according to claim 1, wherein the recognition failure cause determination unit determines the countermeasure according to the identification information and usage history information of the user stored in the storage unit.

The recognition failure cause determining means, in response to the estimated acoustic environment, sound signals representing the Remedy method, email and / or image signals, characterized in that to be sent to the terminal The spoken dialogue system according to claim 1 or 2.

A program for voice conversation used in an information processing apparatus,
Receiving an audio signal from the terminal ;
Analyzing the audio signal, detecting a level of background noise before user utterance, and determining whether the level is greater than a predetermined threshold;
Storing the audio signal in an audio signal database;
Performing speech recognition on a speech signal stored in the speech signal database using a recognition parameter, determining whether the speech recognition is successful based on whether a recognition result of the speech recognition is obtained; a step of determining whether the voice recognition seeking recognition scores of the recognition result is successful if the recognition result is obtained,
When the level of the background noise is greater than a predetermined threshold, it is determined whether the level of the background noise causes the speech recognition failure based on recognition failure cause information and pre-registered user information And determining whether the background noise can be dealt with by adjusting the recognition parameter based on the recognition failure cause information and the user information when it is determined that the level of the background noise causes the speech recognition failure. If the voice recognition fails, the background noise or the failure can be dealt with by determining the cause of the failure based on the recognition failure cause information and the user information and adjusting the recognition parameter. determining whether,-format determines what to do by the user if it can not cope with the background noise or the failure And the step,
Adjusting the recognition parameter and requesting the speech recognition means to perform speech recognition when it is determined in the determining step that the background noise or the failure can be dealt with by adjusting the recognition parameter;
By referring to the characteristics of the audio signal detected by the recognition failure cause determination unit and the user information when the recognition parameter setting unit determines that the background noise cannot be dealt with by adjusting the recognition parameter , An estimation step for estimating the acoustic environment;
And response method determining step of determining how to respond to the user of the terminal in response to the estimated acoustic environment,
A transmission step of transmitting information representing the determined countermeasure to the terminal by the determined response method;
Is operable to run
A program operable to execute the estimation step, the response method determination step, and the transmission step when it is determined in the determination step that the background noise or the failure cannot be dealt with by adjusting the recognition parameter.

A spoken dialogue method used in a spoken dialogue system,
Receiving an audio signal from the terminal ;
Analyzing the audio signal, detecting a level of background noise before user utterance, and determining whether the level is greater than a predetermined threshold;
Storing the audio signal in an audio signal database;
Performing speech recognition on a speech signal stored in the speech signal database using a recognition parameter, determining whether the speech recognition is successful based on whether a recognition result of the speech recognition is obtained; a step of determining whether the voice recognition seeking recognition score of the recognition result is successful if the recognition result is obtained,
When the level of the background noise is greater than a predetermined threshold, it is determined whether the level of the background noise causes the speech recognition failure based on recognition failure cause information and pre-registered user information And determining whether the background noise can be dealt with by adjusting the recognition parameter based on the recognition failure cause information and the user information when it is determined that the level of the background noise causes the speech recognition failure. If the voice recognition fails, the background noise or the failure can be dealt with by determining the cause of the failure based on the recognition failure cause information and the user information and adjusting the recognition parameter. determining whether,-format determines what to do by the user if it can not cope with the background noise or the failure And the step,
Adjusting the recognition parameter and requesting the speech recognition means to perform speech recognition when it is determined in the determining step that the background noise or the failure can be dealt with by adjusting the recognition parameter;
By referring to the characteristics of the audio signal detected by the recognition failure cause determination unit and the user information when the recognition parameter setting unit determines that the background noise cannot be dealt with by adjusting the recognition parameter , An estimation step for estimating the acoustic environment;
And response method determining step of determining how to respond to the user of the terminal in response to the estimated acoustic environment,
A transmission step of transmitting information representing the determined countermeasure to the terminal by the determined response method;
Including
The method of performing the estimation step, the response method determination step, and the transmission step when it is determined in the determination step that the background noise or the failure cannot be dealt with by adjusting the recognition parameter.