JP2004212533A

JP2004212533A - Voice command adaptive equipment operating device, voice command adaptive equipment, program, and recording medium

Info

Publication number: JP2004212533A
Application number: JP2002380493A
Authority: JP
Inventors: Junichi Takami; 淳一鷹見
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2002-12-27
Filing date: 2002-12-27
Publication date: 2004-07-29

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice command adaptive equipment operating device capable of providing excellent operability according to voice commands for many unspecified users. <P>SOLUTION: A parameter storage device 12 stores by users optimum parameters among voice recognition environmental parameters which are parameters for recognition precision improvement such as parameters in voice section extraction and parameters of sound models. A user is specified by a user specifying device 11. An operation history storage device 16 stores operation histories of voice command adaptive equipment by users. When a voice recognition error occurs, its cause is estimated from the operation histories to specify a voice recognition environmental parameter of the user to be adjusted and a parameter adjusting device 17 adjusts the parameter to automatically optimize voice recognition environmental parameters. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、音声コマンド対応機器操作装置、音声コマンド対応機器、プログラム、及び記録媒体に関し、より詳細には、音声認識を利用して操作が可能なオフィス機器等の音声コマンド対応機器を操作するための音声コマンド対応機器操作装置、その装置を組み込んだ音声コマンド対応機器、その機器に組み込まれるプログラム、及び記録媒体に関する。
【０００２】
【従来の技術】
昨年６月に米国で施行されたリハビリテーション法修正５０８条の内容からも分かる通り、最近では身体に障害を持った者でも健常者と同等に操作することのできるオフィス機器の開発が要求されている。
【０００３】
特に、最近のオフィス機器のように、フラットな液晶タッチパネル上に場面に応じて表示される様々な情報や仮想ボタンなどを手がかりに、豊富な機能の選択やモードの設定などを実行するような操作系は、特に視覚障害者にとって大きな障壁となっている。
【０００４】
この障壁を取り除く有望な手段として、音声ガイダンス出力や音声コマンド入力を利用する音声インタフェースが考えられる。
【０００５】
音声コマンドによる機器操作では、パソコン上のソフトのように特定の個人が特定の環境で使用するものに関しては事前に利用者に数十〜数百単語のサンプル音声を発話させて音響モデルのパラメータをチューニングする話者適応と呼ばれる学習を行うことによって認識精度の向上を図る場合が多い。しかし、オフィス機器のように、多数の利用者が利用するものに関しては、サンプル音声の登録にかなりの時間を要する話者適応処理を実際の機器を前にして利用者毎に実行してもらうことは難しい。
【０００６】
そのため、一般的には多数話者の音声を認識することが可能な不特定話者向きの音響モデルが使用されるが、この場合、話者によっては極端に低い認識精度を示す話者が存在する可能性がある。このように、音声コマンド入力では、音声認識誤りを完全に無くすことは不可能であり、話者によっては極端に認識精度が低くなってしまう可能性もある。
【０００７】
その原因としては、
（１）発話の開始のタイミングが不適切であり、音声区間の始端がきちんと抽出されない、
（２）コマンド中にある程度長い息継ぎやポーズが混入し、音声区間の終端がきちんと抽出されない、
（３）話者の声質が、使用されている音響モデルのパラメータと合致していない、
などの原因が考えられる。
【０００８】
【発明が解決しようとする課題】
上述の原因（１），（２）は、音声認識の前処理として実施される音声区間切り出し処理において生ずる原因であるが、この音声区間切り出し処理においても話者の発話様式によって適切なパラメータは異なる。上述の原因（１），（２）はある程度の訓練で改善される可能性もあるが、なんらかの障害によって利用者側での対応が難しいケースも考えられる。したがって、音声区間抽出時のパラメータを適切に選ぶ必要があるが、このパラメータを変更（調整により改善）するような装置は現状には存在しない。また、上述の原因（３）に対しても、音響モデルのパラメータの調整による認識精度の改善が必須である。
【０００９】
本発明は、上述のごとき実情に鑑みてなされたものであり、音声区間抽出時のパラメータや音響モデルのパラメータなど、認識精度改善のためのパラメータ調整を、利用者毎に自動的に実行するようにすることによって、不特定多数の利用者に対して良好な音声コマンドによる操作性を提供することが可能な、音声コマンド対応機器操作装置、その装置を組み込んだ音声コマンド対応機器、その機器に組み込まれるプログラム、及びそのプログラムを記録したコンピュータ読み取り可能な記録媒体を提供することをその目的とする。
【００１０】
【課題を解決するための手段】
請求項１の発明は、音声で入力された様々なコマンドを認識する音声認識装置を有する音声コマンド対応機器を操作する音声コマンド対応機器操作装置であって、利用者を特定する利用者特定手段と、利用者毎に最適な音声認識環境パラメータを記憶するパラメータ記憶手段と、各利用者による前記音声コマンド対応機器の操作履歴を記憶する操作履歴記憶手段と、操作履歴に応じてその利用者の音声認識環境パラメータを最適化するパラメータ最適化手段と、を備えることを特徴としたものである。
【００１１】
請求項２の発明は、請求項１の発明において、前記利用者特定手段は、テンキー等の前記音声コマンド対応機器上のボタンによって利用者がＩＤ番号を入力することで利用者を特定する特定手段と、音声認識を利用して、利用者名を登録しておき照合することによって利用者を特定する特定手段と、磁気カード，非接触型ＩＣカード，ＰＨＳ，携帯電話，パーソナル無線端末など特別な機器・装置の利用によって利用者を特定する特定手段と、のうちの少なくとも１つを有することを特徴としたものである。
【００１２】
請求項３の発明は、請求項１又は２の発明において、前記パラメータ最適化手段は、前記音声認識環境パラメータとして、タイムアウト時間、ポーズ許容時間、音響モデルの確率パラメータ、の少なくとも１つを調整可能とし、前記タイムアウト時間は、ボタン押下などの音声入力の開始を指示する操作が行われてから実際の発話が開始するまでの時間遅れを制限するパラメータとし、前記ポーズ許容時間は、コマンドの途中に挿入される息継ぎやポーズなどの時間を制限するパラメータとすることを特徴としたものである。
【００１３】
請求項４の発明は、請求項１乃至３のいずれか１の発明において、前記パラメータ最適化手段は、音声認識誤りが発生した場合に、その原因を推定し、調整すべきパラメータを特定する調整パラメータ特定手段を有することを特徴としたものである。
【００１４】
請求項５の発明は、請求項４の発明において、前記パラメータ最適化手段は、音声認識誤りの原因として推定された要因に関連するパラメータを、誤認識が減少するように調整するパラメータ調整手段を有することを特徴としたものである。
【００１５】
請求項６の発明は、請求項１乃至５のいずれか１の音声コマンド対応機器操作装置を組み込んだ音声コマンド対応機器である。
【００１６】
請求項７の発明は、コンピュータを、請求項１乃至５のいずれか１の音声コマンド対応機器操作装置として機能させるためのプログラムである。
【００１７】
請求項８の発明は、請求項７のプログラムを記録したコンピュータ読み取り可能な記録媒体である。
【００１８】
【発明の実施の形態】
本発明に係る音声コマンド対応操作装置は、音声で入力された様々なコマンドを認識する音声認識装置を有する機器、すなわち音声認識を利用して操作が可能なオフィス機器等の機器（本明細書中では音声コマンド対応機器と呼ぶ）を操作する操作装置であって、音声コマンド対応機器のアクセシビリティ向上を実現するために有用な装置である。本発明に係る音声コマンド対応機器操作装置（しばしば、操作装置と呼ぶ）は、後述する利用者特定手段，パラメータ記憶手段，操作履歴記憶手段，パラメータ最適化手段を有するものとし、これら手段により、音声区間抽出時のパラメータや音響モデルのパラメータなど、認識精度改善のためのパラメータ調整を、利用者毎に自動的に実行するようにし、その結果、不特定多数の利用者に対して良好な音声コマンドによる操作性を提供することが可能となる。また、本発明に係る音声コマンド対応機器は、音声認識を利用して操作が可能なオフィス機器等の機器であり、本発明に係る操作装置が組み込まれることで、その操作性を向上させることが可能である。
【００１９】
パラメータ記憶手段では、利用者毎に最適な音声認識環境パラメータを記憶し、操作履歴記憶手段では、各利用者による音声コマンド対応機器（音声コマンド対応のオフィス機器等）の操作履歴を記憶する。また、パラメータ最適化手段では、操作履歴に応じてその利用者の音声認識環境パラメータを自動的に最適化する。パラメータ最適化手段は、操作履歴に従って音声認識環境パラメータを最適化していくことから、パラメータ学習機能を備えているともいえる。これら手段により、不特定多数の利用者に対して良好な音声コマンドによる操作性を提供することが可能となる。例えば、視覚障害者でも操作することのできるオフィス機器を提供することも可能となる。
【００２０】
利用者特定手段は利用者を特定する手段であり、次の３つの特定手段のうち少なくとも１つの特定手段を有することが好ましい。第１の特定手段は、テンキー等の当該音声コマンド対応機器上のボタンによって利用者がＩＤ番号を入力することで利用者を特定する手段である。第２の特定手段は、音声認識を利用して、利用者名を登録しておき照合することによって利用者を特定する手段である。第３の特定手段は、磁気カード，非接触型ＩＣカード，ＰＨＳ，携帯電話，パーソナル無線端末など特別な機器・装置の利用によって利用者を特定する手段である。この特定手段により、利用者毎に最適な音声認識環境を音声コマンド対応機器に用意するための枠組みを提供することが可能となる。
【００２１】
また、本発明の一実施形態に係るパラメータ最適化手段では、前記音声認識環境パラメータとして、タイムアウト時間、ポーズ許容時間、音響モデルの確率パラメータ、の少なくとも１つを調整可能とする。ここで、タイムアウト時間とは、ボタン押下などの音声入力の開始を指示する操作が行われてから実際の発話が開始するまでの時間遅れを制限するパラメータとする。また、ポーズ許容時間とは、コマンドの途中に挿入される息継ぎやポーズなどの時間を制限するパラメータとする。これら調整可能なパラメータを、利用者毎に最適な音声認識環境を用意するために提供することで、認識精度の改善がなされ、誤認識が少なくなる。
【００２２】
また、パラメータ最適化手段はその最適化に際し、音声認識誤りが発生した場合に、その原因を推定し、調整すべきパラメータを特定する調整パラメータ特定手段を有するようにすることが好ましい。この調整パラメータ特定手段により、利用者毎に最適な音声認識環境を音声コマンド対応機器に用意するためのパラメータ調整の枠組みを提供することが可能となる。
【００２３】
さらに、パラメータ最適化手段は、この調整パラメータ特定手段だけでなく、音声認識誤りの原因として推定された要因に関連するパラメータを、誤認識が減少するように自動的に調整するパラメータ調整手段を有するようにすることが好ましい。パラメータ調整手段による調整方法を、利用者毎に最適な音声認識環境を用意するために提供することで、認識精度の改善がなされ、誤認識が少なくなる。
【００２４】
また、本発明の他の実施形態として、コンピュータを、上述のいずれかの形態に係る操作装置として、機能させるためのプログラム（例えば図１で後述する処理の手順を実行するためのプログラム）や、そのプログラムを記録したコンピュータ読み取り可能な記録媒体としての形態も採用可能である。プログラムとしては、その音声コマンド対応機器に組み込まれ、音声コマンド対応機器の制御用演算処理装置（ここではコンピュータとして表現している）に、本発明に係る機能を実現させるよう制御させるべきプログラムである。記録媒体としては、具体的には、ＣＤ−ＲＯＭ、光磁気ディスク、ＤＶＤ−ＲＯＭ、ＦＤ、フラッシュメモリ、メモリスティック、及びその他各種ＲＯＭやＲＡＭ等が想定でき、これら記録媒体に上述した本発明に係る操作装置の機能をコンピュータ（汎用コンピュータに限らず、機器内に組み込まれた上述のごとき演算処理装置（ＣＰＵ，ＭＰＵ等）も含むものとする）に実行させ、この機能を実現するためのプログラムを記録して流通させることにより、当機能の実現を容易にする。そして操作装置或いは音声コマンド対応機器に上述のごとくの記録媒体を装着して操作装置或いは音声コマンド対応機器によりプログラムを読み出すか、若しくは操作装置或いは音声コマンド対応機器が具備する記録媒体に当プログラムを記憶させておき、必要に応じて読み出すことにより、本発明に係わる音声コマンド対応機器操作の機能を実行することができる。
【００２５】
図１は、本発明の一実施形態に係る音声コマンド対応機器の一実施例を示すブロック図で、図中、１は音声コマンド機能付きコピー機の構成例を示している。本発明の一実施形態に係る音声コマンド対応機器（しばしば、本機器と呼ぶ）を、図１の音声コマンド機能付きコピー機１を例に挙げて説明する。
【００２６】
コピー機１は、コピー機本体の機能１０に加え、利用者特定装置１１、パラメータ記憶手段としてのパラメータ記憶装置１２、マイクロフォン（以下、マイクと呼ぶ）１３、Ａ／Ｄ変換器１４、音声認識装置１５、操作履歴記憶手段としての操作履歴記憶装置１６、パラメータ最適化手段としてのパラメータ最適化装置（その主たる機能を果たすものとしてパラメータ調整装置１７として図示）を備えるものとする。
【００２７】
パラメータ記憶装置１２は、利用者毎の最適な音声認識環境パラメータを記憶しておく記憶装置であり、操作履歴記憶装置１６は、各利用者によるコピー機１の操作履歴を記憶しておく記憶装置である。パラメータ調整装置１７を有するパラメータ最適化装置は、記憶された操作履歴に応じてその利用者の音声認識環境パラメータを最適化するよう演算する装置である。本発明においては、操作履歴記憶装置１６に格納された操作履歴に応じて、音声認識誤りが生じた場合にはその原因を推定してその利用者の調整すべき音声認識環境パラメータを特定し、パラメータ調整装置１７により調整することで、音声認識環境パラメータを自動的に最適化する。
【００２８】
音声認識装置１５は、マイク１３及びＡ／Ｄ変換器１４を介して入力された音声データに対し、音声区間を切り出す音声区間切り出し手段２３と、切り出した音声区間のそれぞれに対して音響分析を行う音響分析手段２４と、音響分析後のデータに基づいて音声を認識する音声認識手段２５と、音声認識関連パラメータを保存或いは呼び出しにより一時記憶しておく記憶領域（作業領域とも言える）２０とを備えるものとする。ここで、音声認識関連パラメータとは、パラメータ記憶装置１２から対象となる利用者（ここでは利用者３）の音声認識環境パラメータから呼び出されたパラメータであり、音声区間切り出しに用いる区間抽出関連パラメータ２１と、音響分析結果から音声を認識するための音響モデルパラメータ２２とを含むものとする。
【００２９】
上述のごとき機器構成によりコピー機１における処理手順は次のようになる。まず、機器使用開始時に、利用者は、まず利用者特定装置１１を利用して利用者のＩＤを機器に伝える。これは、例えば磁気カードＣをカードリーダなどに通すことによって行われる。
【００３０】
利用者のＩＤを確認した後、音声認識関連パラメータ記憶装置１２から、特定された利用者用の音声認識関連パラメータが作業領域２０にロードされる。その後、利用者は、コピー機本体のキー入力や、コピー機本体或いはマイク１３，Ａ／Ｄ変換器１４を介した音声入力によって、機器の操作を実行する。一連の操作中で実行された全てオペレーションに関する情報、すなわち、各ボタンの押下タイミング情報やマイク１３から入力された全ての音情報を、Ａ／Ｄ変換器１４にてＡ／Ｄ変換した後のディジタルデータは、操作履歴記憶装置１６に保存される。
【００３１】
図２は、音声切り出し処理のタイムチャートの一例を示す図である。
まず、図２を参照して、音声コマンドによる操作が失敗する原因を説明する。失敗の原因としては、大きく次の（Ａ）〜（Ｃ）の３点が考えられる。
【００３２】
（Ａ）発話の開始タイミングが遅れたため、音声区間切り出し処理で実施しているタイムアウト処理（音声入力開始ボタンが押されて一定時間（タイムアウト時間Ｔｏ）経過しても、音声区間が検出されない場合には、処理を中断する）によって、音声の取り込み自体が失敗した。
（Ｂ）利用者が発声した音声の始終端が正しく検出できなかったため、認識処理に失敗した。
（Ｃ）利用者の声質が音響モデルのパラメータとマッチしていないため、認識処理に失敗した。
【００３３】
（Ａ）のタイムアウト処理は、快適な音声入力環境を提供するために必要な処理である。そうでないと、誤って音声入力開始ボタンを押してしまった場合には、明示的にその取り消し動作を行うか、何らかの音が入力されるまで音声入力処理が完了しないため、音声以外の外部ノイズなどに反応して機器が誤動作してしまう可能性が高くなる。そのためデフォルト値としてはある程度短め（数百ミリ秒）に設定されているが、利用者によっては（何らかの障害などにより）予め定められたタイムアウト時間Ｔｏ内での発話が難しい場合もあり得る。
【００３４】
そこで、（Ａ）の失敗については、操作履歴として保存されている音声データを分析し、タイムアウト発生後、比較的短時間内に何らかの発声があったことが検出された場合には、その利用者のためのタイムアウト時間Ｔｏを一定の割合で増加する。ただし、タイムアウト時間Ｔｏの増加を無制限に許した場合には上述のごとき問題が発生するので、実際には適当な上限値（１〜２秒程度）を設ける。
【００３５】
一方、（Ｂ）や（Ｃ）のような音声コマンドの誤認識が発生した場合には、利用者は再び音声コマンドを発話するか、或いはキー操作など、別の操作手段によって同じ操作をやり直すことになる。そこで、音声コマンド入力の認識結果に応じて選択されたモードが確定されないまま、再び操作がやり直された場合には、その音声コマンド入力の認識結果が誤りであったものと仮定することができる（これを誤認識仮説と呼ぶ）。この場合、複数回連続して誤認識が発生する場合も考えられるが、この場合はどれか一つ（例えば最初の誤認識）を誤認識仮説とする。さらに、誤認識仮説成立後にやり直された操作が利用者によって確定された時点で、その操作に対応する音声コマンドが先に誤認識した音声コマンドに対する正解であると仮定することができる（これを正解仮説と呼ぶ）。
【００３６】
ただし、実際には利用者の実行したい内容が音声コマンド入力後に急に変更された可能性もあるため、上述の仮説が常に正しいものである保証はない。そこで、上述のごとき誤認識仮説と正解仮説のペアが確定した時点で、機器（ここではコピー機１）は、利用者に対して、先に入力に失敗した音声コマンドがこの正解仮説を意図したものであるかを確認するメッセージを出力し、利用者に確認させる。
【００３７】
先に入力に失敗した音声コマンドデータは、操作履歴として保存されている音声データから、その入力の開始を示す音声入力開始ボタン等が押された時刻以降のデータを抽出して実際の音声認識時に使用している音声区間切り出し処理を行うことによって忠実に再現することができる。この際、音声区間の始終端検出に失敗した（Ｂ）の場合には、ここで抽出された音声コマンドデータの再生音が、利用者の意図したものとは異なっている（文頭や文末の一部が欠けているなど）はずである。一方、音声区間切り出しは成功したものの、その後の認識処理で失敗した（Ｃ）の場合は、ここで抽出された音声は利用者の意図したものと同じはずである。
【００３８】
そこで、両者を区別するため、例えばまず「先に入力したコマンドは『両面コピー（この部分は上記処理で抽出された利用者の肉声の再生音）』ですか」のようなメッセージを出力する。そして引き続き「先に入力したコマンドの正解は『両面コピー（この部分は正解仮説）』ですか」のようなメッセージを出力する。２番目の質問への回答がＹＥＳの場合には、その時点で誤認識の原因、及び誤認識した音声コマンドデータとその正解というペアが確定される。なお、２番目の質問への回答がＮＯの場合には正解が特定できないため、これ以降の処理は行わない。
【００３９】
１番目の質問がＮＯの場合、すなわち、音声区間の始終端検出に失敗している場合、その原因としては次の２つが考えられる。
（ａ）突発的なノイズなど、予期できない音が音声の直前や直後に入力された。
（ｂ）コマンド発話中に想定外に長いポーズがあり、その時点で発話が完了したものとみなされた。
【００４０】
（ａ）の問題は、現在の音声認識技術では対処することが難しい本質的な問題であるが、（ｂ）のポーズ区間の問題については、ポーズ許容時間Ｔｐを拡大することによってその発生をある程度回避することが可能である。しかし、その場合、実際には発話が完了しているにも拘らず、その時間だけは待たないと終端の確定ができなくなるため、音声コマンドに対する応答性が劣化してしまう。
【００４１】
そのため、このデフォルト値はある程度短め（数百ミリ秒）に設定されているが、利用者によっては（何らかの障害などにより）予め定められたポーズ許容時間以上のポーズの挿入が不可避な場合もあり得る。
【００４２】
そこで、操作履歴として保存されている音声データに対して、ポーズ許容時間Ｔｐを一時的に一定割合で増加させた後で音声区間の再切り出し処理を行い、そのデータに対する音声認識結果が正解仮説と一致した場合に限り、実際のポーズ許容時間Ｔｐをその（一時的に増加させた）値で置き換える。ただし、ポーズ許容時間Ｔｐの増加を無制限に許した場合には上述のごとき問題が発生するので、実際には適当な上限値（１〜２秒程度）を設ける。
【００４３】
最後に残る１つの理由、すなわち利用者の声質が音響モデルのパラメータとマッチしていないために認識処理に失敗したと考えられる場合には、誤認識した音声データと正解情報を用いて、音響モデルの話者適応を実施する。話者適応の対象となる音響モデルとしては、初めてこの話者適応を実行する話者に対してはデフォルトの不特定話者用モデルを使用し、既に自分専用の適応済み音響モデルを有している話者に対してはそのモデルを使用する。
【００４４】
なお、この話者適応処理は、例えば広く知られているＭＡＰ（最大事後確率）推定法のような少サンプルによる適応処理に向いた話者適応アルゴリズムを利用して実施する。
【００４５】
こうした一連の処理で変更されたパラメータは、変更の都度、或いは一連の操作が終了した時点で作業領域２０から話者毎のパラメータ記憶装置１２に再び格納することで、その値を更新する。
【００４６】
【発明の効果】
本発明によれば、音声区間抽出時のパラメータや音響モデルのパラメータなど、認識精度改善のためのパラメータ調整を、利用者毎に自動的に実行するようにすることによって、不特定多数の利用者に対して良好な音声コマンドによる操作性を提供することが可能となる。
【図面の簡単な説明】
【図１】本発明の一実施形態に係る音声コマンド対応機器の一実施例を示すブロック図である。
【図２】音声切り出し処理のタイムチャートの一例を示す図である。
【符号の説明】
１…コピー機、１０…コピー機本体機能、１１…利用者特定装置、１２…パラメータ記憶装置、１３…マイクロフォン、１４…Ａ／Ｄ変換器、１５…音声認識装置、１６…操作履歴記憶装置、１７…パラメータ調整装置、２０…記憶領域（作業領域）、２１…区間抽出関連パラメータ、２２…音響モデルパラメータ、２３…音声区間切り出し手段、２４…音響分析手段、２５…音声認識手段。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a voice command-compatible device operating device, a voice command-compatible device, a program, and a recording medium, and more particularly to operating a voice command-compatible device such as an office device that can be operated using voice recognition. The present invention relates to a voice command-compatible device operating device, a voice command-compatible device incorporating the device, a program incorporated in the device, and a recording medium.
[0002]
[Prior art]
As can be seen from the contents of the Amendment to the Rehabilitation Act 508, which was enforced in the United States last June, the development of office equipment that enables people with physical disabilities to operate as well as healthy persons has recently been required. .
[0003]
In particular, operations such as recent office equipment that use a variety of information displayed on a flat LCD touch panel according to the scene and virtual buttons as clues to select abundant functions and set modes The system is a major barrier, especially for the visually impaired.
[0004]
As a promising means to remove this barrier, a voice interface using voice guidance output or voice command input can be considered.
[0005]
In the device operation using voice commands, for software that is used by a specific individual in a specific environment, such as software on a personal computer, the user is required to utter a sample voice of tens to hundreds of words in advance to adjust the parameters of the acoustic model. In many cases, recognition accuracy is improved by performing learning called speaker adaptation for tuning. However, for devices that are used by a large number of users, such as office equipment, the speaker adaptation process that requires a considerable amount of time to register sample voices must be performed for each user in front of the actual equipment. Is difficult.
[0006]
Therefore, in general, an acoustic model suitable for an unspecified speaker capable of recognizing the voices of many speakers is used. In this case, some speakers exhibit extremely low recognition accuracy. there's a possibility that. As described above, it is impossible to completely eliminate a voice recognition error by inputting a voice command, and depending on a speaker, the recognition accuracy may be extremely low.
[0007]
The cause is
(1) The utterance start timing is inappropriate, and the beginning of the voice section is not properly extracted.
(2) A somewhat long breath or pause is mixed in the command, and the end of the voice section is not properly extracted.
(3) the voice quality of the speaker does not match the parameters of the acoustic model used;
Such causes are possible.
[0008]
[Problems to be solved by the invention]
The above-mentioned causes (1) and (2) are causes that occur in the voice section cutout processing performed as preprocessing of voice recognition. In this voice section cutout processing, appropriate parameters differ depending on the utterance style of the speaker. . Although the above causes (1) and (2) may be improved by training to some extent, there are cases where it is difficult for the user to respond due to some kind of obstacle. Therefore, it is necessary to appropriately select parameters at the time of speech section extraction. However, there is no device that changes (improves by adjusting) these parameters at present. For the above-mentioned cause (3), it is essential to improve the recognition accuracy by adjusting the parameters of the acoustic model.
[0009]
The present invention has been made in view of the above circumstances, and automatically adjusts parameters for improving recognition accuracy, such as parameters at the time of speech section extraction and acoustic model parameters, for each user. , A voice command-compatible device operating device, a voice command-compatible device incorporating the device, and a voice command-compatible device capable of providing operability with good voice commands to an unspecified number of users. It is an object of the present invention to provide a program to be executed and a computer-readable recording medium on which the program is recorded.
[0010]
[Means for Solving the Problems]
The invention according to claim 1 is a voice command-compatible device operating device that operates a voice command-compatible device having a voice recognition device that recognizes various commands input by voice, wherein a user specifying unit that specifies a user; A parameter storage means for storing an optimum voice recognition environment parameter for each user; an operation history storage means for storing an operation history of the voice command corresponding device by each user; and a voice of the user according to the operation history. Parameter optimization means for optimizing the recognition environment parameters.
[0011]
According to a second aspect of the present invention, in the first aspect of the present invention, the user identifying unit identifies the user by inputting an ID number by a button on the voice command compatible device such as a numeric keypad. And a means for identifying a user by registering and collating the user name using voice recognition, and a special means such as a magnetic card, a non-contact type IC card, a PHS, a mobile phone, and a personal wireless terminal. And at least one of specifying means for specifying a user by using a device / apparatus.
[0012]
According to a third aspect of the present invention, in the first or second aspect of the invention, the parameter optimizing means can adjust at least one of a timeout time, a pause allowable time, and a probability parameter of an acoustic model as the speech recognition environment parameter. The timeout time is a parameter that limits the time delay from when an operation for instructing the start of voice input such as pressing a button is performed until the actual utterance starts, and the pause allowable time is set in the middle of a command. It is characterized in that it is a parameter for limiting the time such as a breath or a pause to be inserted.
[0013]
According to a fourth aspect of the present invention, in the first aspect of the present invention, when a speech recognition error occurs, the parameter optimizing means estimates a cause of the error and specifies a parameter to be adjusted. It is characterized by having parameter specifying means.
[0014]
According to a fifth aspect of the present invention, in the fourth aspect of the invention, the parameter optimizing means adjusts a parameter related to a factor estimated as a cause of the speech recognition error so as to reduce the erroneous recognition. It is characterized by having.
[0015]
A sixth aspect of the present invention is a voice command compatible device incorporating the voice command compatible device operation device according to any one of the first to fifth aspects.
[0016]
According to a seventh aspect of the present invention, there is provided a program for causing a computer to function as the voice command-compatible device operating device according to any one of the first to fifth aspects.
[0017]
The invention according to claim 8 is a computer-readable recording medium on which the program according to claim 7 is recorded.
[0018]
BEST MODE FOR CARRYING OUT THE INVENTION
The voice command operation device according to the present invention is a device having a voice recognition device that recognizes various commands input by voice, that is, a device such as an office device that can be operated using voice recognition (in the present specification). This is an operation device for operating a voice command compatible device, and is a device useful for improving accessibility of the voice command compatible device. The voice command-compatible device operation device (often referred to as an operation device) according to the present invention includes a user identification unit, a parameter storage unit, an operation history storage unit, and a parameter optimization unit, which will be described later. Adjustment of parameters for improving recognition accuracy, such as parameters at the time of section extraction and acoustic model parameters, is automatically performed for each user, and as a result, a good voice command for an unspecified number of users Operability can be provided. The voice command compatible device according to the present invention is a device such as an office device that can be operated using voice recognition, and the operability can be improved by incorporating the operating device according to the present invention. It is possible.
[0019]
The parameter storage means stores the optimum voice recognition environment parameters for each user, and the operation history storage means stores the operation history of each user's voice command compatible equipment (such as voice command compatible office equipment). Further, the parameter optimizing means automatically optimizes the user's voice recognition environment parameters according to the operation history. Since the parameter optimizing means optimizes the voice recognition environment parameters according to the operation history, it can be said that the parameter optimizing means has a parameter learning function. By these means, it is possible to provide an operability with a good voice command to an unspecified number of users. For example, it is possible to provide an office device that can be operated even by a visually impaired person.
[0020]
The user specifying means is a means for specifying a user, and preferably has at least one of the following three specifying means. The first specifying means is a means for specifying a user by inputting an ID number by a button on the voice command compatible device such as a numeric keypad. The second specifying means is a means for specifying a user by registering and collating the user name using voice recognition. The third specifying means is a means for specifying a user by using a special device such as a magnetic card, a non-contact type IC card, a PHS, a mobile phone, or a personal wireless terminal. With this specifying means, it is possible to provide a framework for preparing an optimum voice recognition environment for each user in a voice command compatible device.
[0021]
Further, in the parameter optimizing means according to one embodiment of the present invention, at least one of a timeout time, a pause allowable time, and a probability parameter of an acoustic model can be adjusted as the speech recognition environment parameter. Here, the timeout time is a parameter that limits the time delay from when an operation for instructing the start of voice input, such as pressing a button, is performed to when an actual utterance starts. In addition, the pause allowable time is a parameter for limiting a time such as a breath or a pause inserted in the middle of a command. By providing these adjustable parameters in order to prepare an optimal speech recognition environment for each user, recognition accuracy is improved and erroneous recognition is reduced.
[0022]
In addition, it is preferable that the parameter optimizing means include an adjustment parameter specifying means for estimating a cause of a speech recognition error and specifying a parameter to be adjusted when a speech recognition error occurs. With this adjustment parameter specifying means, it is possible to provide a parameter adjustment framework for preparing an optimum voice recognition environment for each user in a voice command compatible device.
[0023]
Further, the parameter optimizing means includes not only the adjustment parameter specifying means but also a parameter adjusting means for automatically adjusting a parameter related to a factor estimated as a cause of a speech recognition error so as to reduce erroneous recognition. It is preferable to do so. By providing an adjustment method by the parameter adjustment means in order to prepare an optimum speech recognition environment for each user, recognition accuracy is improved and erroneous recognition is reduced.
[0024]
Further, as another embodiment of the present invention, a program for causing a computer to function as the operating device according to any of the above-described modes (for example, a program for executing a procedure of a process described later with reference to FIG. 1), A form as a computer-readable recording medium on which the program is recorded can also be adopted. The program is a program that is incorporated in the voice command-compatible device and that is to be controlled by a control arithmetic processing unit (represented as a computer) of the voice command-compatible device so as to realize the function according to the present invention. . Specific examples of the recording medium include a CD-ROM, a magneto-optical disk, a DVD-ROM, an FD, a flash memory, a memory stick, and various other ROMs and RAMs. The functions of the operating device are executed by a computer (including not only a general-purpose computer but also the above-described arithmetic processing unit (CPU, MPU, etc.) incorporated in the device), and a program for realizing this function is recorded. By realizing this function, it is easy to realize this function. Then, the recording medium as described above is mounted on the operation device or the voice command compatible device, and the program is read out by the operation device or the voice command compatible device, or the program is stored in a recording medium provided in the operation device or the voice command compatible device. By reading the data as needed, the function of operating the device corresponding to the voice command according to the present invention can be executed.
[0025]
FIG. 1 is a block diagram showing an example of a voice command compatible device according to an embodiment of the present invention. In the drawing, reference numeral 1 denotes a configuration example of a copy machine with a voice command function. A voice command compatible device according to an embodiment of the present invention (often referred to as the device) will be described using the copy machine 1 with a voice command function of FIG. 1 as an example.
[0026]
The copying machine 1 includes a user identification device 11, a parameter storage device 12 as a parameter storage unit, a microphone (hereinafter referred to as a microphone) 13, an A / D converter 14, a voice recognition device, in addition to the functions 10 of the copying machine body. 15, an operation history storage device 16 as an operation history storage device, and a parameter optimization device (shown as a parameter adjustment device 17 that performs its main function) as a parameter optimization device.
[0027]
The parameter storage device 12 is a storage device for storing optimum voice recognition environment parameters for each user, and the operation history storage device 16 is a storage device for storing operation history of the copy machine 1 by each user. It is. The parameter optimizing device having the parameter adjusting device 17 is a device that performs an operation to optimize the voice recognition environment parameter of the user according to the stored operation history. In the present invention, when a speech recognition error occurs according to the operation history stored in the operation history storage device 16, the cause is estimated and the speech recognition environment parameter to be adjusted by the user is specified. By adjusting the parameters by the parameter adjusting device 17, the speech recognition environment parameters are automatically optimized.
[0028]
The voice recognition device 15 performs voice section extraction means 23 for extracting a voice section from the voice data input via the microphone 13 and the A / D converter 14, and performs acoustic analysis on each of the cut voice sections. An acoustic analysis unit 24, a speech recognition unit 25 for recognizing speech based on data after the acoustic analysis, and a storage area (also referred to as a work area) 20 for temporarily storing speech-recognition-related parameters by storing or recalling the parameters. Shall be. Here, the voice recognition related parameters are parameters called from the parameter storage device 12 from the voice recognition environment parameters of the target user (here, the user 3), and the section extraction related parameters 21 used for voice section cutout. And acoustic model parameters 22 for recognizing speech from the acoustic analysis result.
[0029]
The processing procedure in the copy machine 1 is as follows according to the device configuration as described above. First, at the start of use of the device, the user uses the user identification device 11 to transmit the user ID to the device. This is performed, for example, by passing the magnetic card C through a card reader or the like.
[0030]
After confirming the user ID, the specified speech recognition-related parameters for the user are loaded from the speech recognition-related parameter storage device 12 into the work area 20. Thereafter, the user executes the operation of the device by key input of the copier main body or voice input via the copier main body or the microphone 13 and the A / D converter 14. The A / D converter 14 converts the information on all operations executed during the series of operations, that is, information on the timing of pressing each button and all the sound information input from the microphone 13 into digital data. The data is stored in the operation history storage device 16.
[0031]
FIG. 2 is a diagram illustrating an example of a time chart of the audio cutout processing.
First, with reference to FIG. 2, a description will be given of the cause of the failure of the operation by the voice command. The following three major causes (A) to (C) can be considered as causes of failure.
[0032]
(A) When the utterance start timing is delayed, the time-out process performed in the voice segment cutout process (when a voice segment is not detected even after a certain time (time-out time To) has elapsed after the voice input start button has been pressed). Interrupted the process), so the audio capture itself failed.
(B) The recognition process failed because the start and end of the voice uttered by the user could not be detected correctly.
(C) The recognition process failed because the voice quality of the user did not match the parameters of the acoustic model.
[0033]
The timeout process (A) is a process necessary for providing a comfortable voice input environment. Otherwise, if you accidentally press the voice input start button, you must either explicitly cancel it, or the voice input process will not be completed until some sound is input, so it will not be affected by external noise other than voice. The possibility that the device will malfunction due to the reaction increases. For this reason, the default value is set somewhat short (several hundred milliseconds), but it may be difficult for some users to speak within a predetermined time-out period To (due to some kind of trouble or the like).
[0034]
Therefore, regarding the failure of (A), the voice data stored as the operation history is analyzed, and when it is detected that some utterance has been made within a relatively short time after the occurrence of the timeout, the user of the user is analyzed. Is increased at a fixed rate. However, if the increase in the timeout time To is allowed without any limitation, the above-described problem occurs. Therefore, an appropriate upper limit (about 1 to 2 seconds) is actually set.
[0035]
On the other hand, when an erroneous recognition of a voice command such as (B) or (C) occurs, the user utters the voice command again, or performs the same operation again by another operation means such as a key operation. become. Therefore, if the operation is performed again without determining the mode selected according to the recognition result of the voice command input, it can be assumed that the recognition result of the voice command input was incorrect ( This is called a false recognition hypothesis). In this case, erroneous recognition may occur consecutively a plurality of times. In this case, one of the erroneous recognitions (for example, the first erroneous recognition) is set as the erroneous recognition hypothesis. Further, when the user determines an operation that has been redone after the false recognition hypothesis has been established, it can be assumed that the voice command corresponding to the operation is the correct answer to the previously falsely recognized voice command (this is the correct answer). Call it a hypothesis).
[0036]
However, there is no guarantee that the above-mentioned hypothesis is always correct because the content that the user wants to execute may have been suddenly changed after inputting the voice command. Therefore, when the pair of the misrecognition hypothesis and the correct answer hypothesis as described above is determined, the device (the copy machine 1 in this case) instructs the user that the voice command that has failed to first input is the correct answer hypothesis. Outputs a message to confirm that the message is valid and asks the user to confirm.
[0037]
The voice command data that failed to be input earlier is extracted from the voice data stored as the operation history and the data after the time when the voice input start button etc. indicating the start of the input was pressed, and used for actual voice recognition. By performing the used voice segment extraction processing, it is possible to faithfully reproduce. At this time, in the case (B) in which the detection of the start and end of the voice section has failed, the playback sound of the voice command data extracted here differs from the one intended by the user (one of the beginning of the sentence or the end of the sentence). Part is missing). On the other hand, in the case of (C), in which the voice section extraction succeeds but the subsequent recognition processing fails, the voice extracted here should be the same as that intended by the user.
[0038]
Therefore, in order to distinguish between the two, a message such as "Is the command input first" double-sided copy (this part is the reproduction sound of the real voice of the user extracted in the above processing) "is output. Then, a message such as "Is the correct answer of the previously input command" double-sided copy (this part is the correct answer hypothesis) "" is output. If the answer to the second question is YES, the cause of the erroneous recognition at that time and the pair of the erroneously recognized voice command data and its correct answer are determined. If the answer to the second question is NO, the correct answer cannot be specified, and the subsequent processing is not performed.
[0039]
If the first question is NO, that is, if the detection of the start and end of the voice section has failed, the following two causes can be considered.
(A) An unexpected sound, such as a sudden noise, was input immediately before or immediately after a voice.
(B) There was an unexpectedly long pause during command utterance, at which point the utterance was considered completed.
[0040]
The problem (a) is an essential problem that is difficult to deal with with the current speech recognition technology. However, the problem of the pause section of (b) is reduced to some extent by increasing the pause allowable time Tp. It is possible to avoid. However, in this case, even though the utterance is actually completed, the end cannot be determined unless the user waits for that time, so that the response to the voice command is deteriorated.
[0041]
For this reason, the default value is set to be somewhat short (several hundred milliseconds). However, depending on the user, it may be unavoidable that a pause longer than a predetermined allowable pause time is inserted (due to some kind of obstacle). .
[0042]
Therefore, for the voice data stored as the operation history, the pause allowable time Tp is temporarily increased at a fixed rate, and then the voice section is re-cut out, and the voice recognition result for the data is the correct hypothesis. Only when they match, the actual pause allowable time Tp is replaced with the (temporarily increased) value. However, if the increase of the pause allowable time Tp is allowed without any limitation, the above-mentioned problem occurs. Therefore, an appropriate upper limit (about 1 to 2 seconds) is actually set.
[0043]
One last reason, that is, if it is considered that the recognition process has failed because the voice quality of the user does not match the parameters of the acoustic model, the acoustic model using the erroneously recognized speech data and the correct answer information is used. Carry out speaker adaptation. As the acoustic model to be subjected to speaker adaptation, the default speaker-independent model is used for the speaker who performs this speaker adaptation for the first time, and the speaker already has its own adapted acoustic model. Use that model for existing speakers.
[0044]
This speaker adaptation processing is performed using a speaker adaptation algorithm suitable for adaptation processing with a small number of samples, such as a widely known MAP (maximum posterior probability) estimation method.
[0045]
The parameter changed in such a series of processing is updated from the work area 20 to the parameter storage device 12 for each speaker again each time the parameter is changed or when the series of operations is completed.
[0046]
【The invention's effect】
ADVANTAGE OF THE INVENTION According to this invention, the parameter adjustment for the improvement of recognition accuracy, such as the parameter at the time of speech area extraction and the parameter of an acoustic model, is automatically performed for every user, Operability by a good voice command can be provided.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating an example of a voice command compatible device according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating an example of a time chart of a voice cutout process.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... Copy machine, 10 ... Copy machine main body function, 11 ... User identification device, 12 ... Parameter storage device, 13 ... Microphone, 14 ... A / D converter, 15 ... Voice recognition device, 16 ... Operation history storage device, 17: parameter adjustment device, 20: storage area (work area), 21: section extraction related parameters, 22: acoustic model parameters, 23: voice section cutout means, 24: acoustic analysis means, 25: voice recognition means.

Claims

A voice command-compatible device operating device for operating a voice command-compatible device having a voice recognition device for recognizing various commands input by voice, comprising: a user specifying unit for specifying a user; Parameter storage means for storing voice recognition environment parameters; operation history storage means for storing operation histories of the voice command-compatible device by each user; and optimizing the user's voice recognition environment parameters according to the operation histories. And a parameter optimizing means.

The user specifying means is a specifying means for specifying a user by inputting an ID number by a button on the voice command compatible device such as a numeric keypad, and a user name is registered using voice recognition. A means for specifying a user by collating beforehand, and a means for specifying a user by using a special device such as a magnetic card, a non-contact type IC card, a PHS, a mobile phone, or a personal wireless terminal. 2. The apparatus according to claim 1, further comprising at least one of the following.

The parameter optimizing means is capable of adjusting at least one of a timeout time, a pause allowable time, and a probability parameter of an acoustic model as the speech recognition environment parameter, and the timeout time is used to start a speech input such as pressing a button. The parameter is used to limit the time delay from when the instructing operation is performed to the start of the actual utterance, and the pause allowable time is a parameter that limits the time such as breathing and pause inserted in the middle of the command. The voice command-compatible device operating device according to claim 1 or 2, wherein:

4. The apparatus according to claim 1, wherein said parameter optimizing means has an adjustment parameter specifying means for estimating a cause of the occurrence of a speech recognition error and specifying a parameter to be adjusted. Voice command compatible device operation device.

5. The apparatus according to claim 4, wherein said parameter optimizing means includes a parameter adjusting means for adjusting a parameter associated with a factor estimated as a cause of the speech recognition error so as to reduce the erroneous recognition. Equipment operation device.

A voice command compatible device incorporating the voice command compatible device operation device according to claim 1.

A program for causing a computer to function as the voice command-compatible device operating device according to claim 1.

A computer-readable recording medium on which the program according to claim 7 is recorded.