JP3594356B2

JP3594356B2 - Audio processing device

Info

Publication number: JP3594356B2
Application number: JP08424195A
Authority: JP
Inventors: 和也佐古; 昇治藤本; 博之藤本; 育恵高橋; 良明寺本; 晋太木村
Original assignee: Denso Ten Ltd; Fujitsu Ltd
Current assignee: Denso Ten Ltd; Fujitsu Ltd
Priority date: 1995-04-10
Filing date: 1995-04-10
Publication date: 2004-11-24
Anticipated expiration: 2019-11-24
Also published as: JPH08278797A

Description

【０００１】
【産業上の利用分野】
本発明は音声のディジタル信号を処理する音声処理装置に関し、特に入力信号の精度確保が低コストで実現することができる音声認識装置に関する。
【０００２】
【従来の技術】
図２４は従来の音声処理装置の概略を示す図である。本図（ａ）に示す音声処理装置は、車両に搭載され、音声のアナログ信号を入力してディジタル信号に変換するＡ／Ｄ変換器１（ＡｎａｌｏｇＴｏＤｉｇｉｔａｌＣｏｎｖｅｒｔｅｒ）と、このＡ／Ｄ変換器１に接続され音声認識処理を行うプロセッサ２と、このプロセッサ２に接続されるインタフェース３から構成される。そして、この音声処理装置は、例えば、パワーウインドウに対して、「窓開」、「窓閉」の音声を認識し、オーディオ機器に対しては、「オーディオオン」、「オーディオオフ」の音声を認識し制御を行うものである。本図（ｂ）、（ｃ）は入力信号のレベルに対する実質的ダイナミックレンジを示すが、ノイズレベルが非常に小さい場合には、この実質的ダイナミックレンジは近似的にＳ／Ｎで示される。この場合、Ａ／Ｄ変換器１がｎ＝１６ビットで変換を行う場合には、実質的ダイナミックレンジの最大値は６ｎ＋２＝９８ｄＢとなる。そして、入力レベル変化が大きい二つの入力レベルＡ、Ｂがあり、これに対応する実質的ダイナミックレンジをａ、ｂとする。この場合、Ａ、ＢにＡ≫Ｂの関係があるなら、ａ≫ｂとなる。
【０００３】
【発明が解決しようとする課題】
ところで、車両に搭載される上記音声処理装置では、これを使用するドライバーの声の大小により音声信号の入力レベルが大小し、図示しないマイクロフォンとの距離の大小により音声信号の入力レベルが大小する。
しかしながら、上記音声処理装置では、入力レベルが変化する場合に、例えば、入力レベルが大きいと、入力信号Ｂに関しては実質的ダイナミックレンジｂは十分に大きくとれるが、入力レベルが小さいと、入力信号Ａに関する実質的ダイナミックレンジａは小さく、十分なＳ／Ｎ比が取れない。このため、後段のプロセッサ２において音声認識処理の精度の悪化を招来していたという問題点があった。
【０００４】
一方、Ｓ／Ｎを高くするために、Ａ／Ｄ変換器１として高ビット（例えば１６ビット〜１８ビット以上）のものを使用すると、コストアップという別の問題点を招来する。また、高ビットの調整作業が必要となる。
また一般的なアナログ利得制御回路を入力部に用いる方法もあったが音声区間内で利得が変化し音声信号に歪が加わる場合があるので必ずしも良好な結果は得られなかった。
【０００５】
本発明は、前記問題点に鑑み、入力信号の変化が大きい音声信号を、低ビットのＡ／Ｄ変換器で、高精度に処理することができる音声認識装置を提供することを目的とする。
【０００６】
【課題を解決するための手段】
本発明は、前記問題点を解決するために、次の構成を有する音声処理装置を提供する。すなわち、音声のアナログ信号をデジタル信号に変換して音声認識処理を行う音声処理装置に、音声のアナログ信号の電圧を制御する信号利得調整部と、音声のデジタル信号から音声区間を検出する音声区間処理部と、音声区間のデータを基に利得制御値を導出する利得制御値導出部とが設けられる。フィードバック判断部は前記利得制御値を前記信号利得調整部に設定して音声認識処理の結果を再評価して最適な利得制御値を設定する。
【０００７】
前記音声区間処理部の音声区間は母音として推定される範囲であるようにしてもよい。
前記音声区間処理部の音声区間は子音として推定される範囲であるようにしてもよい。
前記音声区間処理部の音声区間は音声のデジタル信号のレベルを基に求められるようにしてもよい。
【０００８】
前記音声区間処理部の音声区間は音声のデジタル信号のパワーレベルを基に求められるようにしてもよい。
前記音声区間処理部の音声区間は複数の窓に分割されるようにしてもよい。
前記窓の幅が可変長であるようにしてもよい。
前記利得制御値導出部の利得制御値は音声区間のデータに任意の係数を乗算して導出されるようにしてもよい。
【０００９】
前記利得制御値導出部の利得制御値は音声区間のデータに複数の係数から選択した１つの係数を乗算して導出されるようにしてもよい。
前記利得制御値導出部の利得制御値は音声区間のデータのそれぞれに１組の係数のそれぞれを乗算して導出されるようにしてもよい。
前記利得制御値導出部の利得制御値は音声区間のデータのそれぞれに複数の組から選択した１つの組の係数のそれぞれを乗算して導出されるようにしてもよい。
【００１０】
フィードバック判断部の音声認識処理結果の再評価は、前記利得制御値導出部による利得制御値を前記信号利得調整部に設定して得られた認識候補の上位の距離の平均値を用いて行われるが、前記認識候補の上位の距離の平均値が利得制御値設定前の平均値よりも大きい場合には設定前の利得制御値が使用され、この逆の場合には設定後の利得制御値が使用されて行われるようにしてもよい。
【００１１】
フィードバック判断部の音声認識処理結果の再評価は、前記利得制御値導出部による利得制御値を前記信号利得調整部に設定して得られた認識候補の上位の距離の平均値を用いて行われるが、前記認識候補の上位の距離の平均値が最小になる利得制御値が使用されて、行われるようにしてもよい。
フィードバック判断部の音声認識処理結果の再評価は、前記利得制御値導出部による利得制御値を前記信号利得調整部に設定した後の認識の修正、次候補の呼び出し操作回数を用いるが、この操作回数が所定値よりも大きい場合には設定前の利得制御値が使用され、この逆の場合には設定後の利得制御値が使用されて、行われるようにしてもよい。
【００１２】
フィードバック判断部の音声認識処理結果の再評価は、前記利得制御値導出部による利得制御値を前記信号利得調整部に設定した後の認識の修正、次候補の呼び出し操作回数を用いるが、この操作回数が最小になる利得制御値が使用されて、行われるようにしてもよい。
フィードバック判断部は、１単語前の音声区間を用いて得られた利得制御値を各単語の終端検出後に設定するようにしてもよい。
【００１３】
フィードバック判断部は、複数の単語前からの音声区間を用いて得られた利得制御値を複数の単語の終端検出後に設定するようにしてもよい。
前記利得制御値を前記信号利得調整部に設定し、デジタル信号への変換後で音声認識処理前にデジタル信号に前記利得制御値の逆数を乗算するようにしてもよい。
【００１４】
前記利得制御値導出部の利得制御値は、音声区間のデータの最大値を基に求められるようにしてもよい。
前記利得制御値導出部の利得制御値は、音声区間のデータの最大値を基に求められるようにしてもよい。
前記利得制御値導出部の利得制御値は、音声区間のデータの平均値を基に求められるようにしてもよい。
【００１５】
前記利得制御値導出部の利得制御値は、音声区間のデータの絶対値を基に求められるようにしてもよい。
前記利得制御値導出部の利得制御値は、音声区間のデータの完全積分値を基に必要に応じて値をリセットし、求められるようにしてもよい。
前記値はある値以上にならないようにクリップされるようにしてもよい。
【００１６】
前記利得制御値導出部の利得制御値は、音声区間のデータのリーキー積分値を基に求められるようにしてもよい。
前記利得制御値導出部の利得制御値は、音声区間のデータのピークホールド値を基に求められるようにしてもよい。
前記利得制御値導出部の利得制御値は、音声区間のデータのピークホールド時のアタック時間及びリリース時間を基に求められるようにしてもよい。
【００１７】
前記利得制御値導出部の利得制御値は、認識候補の上位の距離の平均値が最小になるように、音声区間のデータのピークホールド時のアタック時間及びリリース時間を変化させて、求められるようにしてもよい。
【００１８】
【作用】
本発明の音声処理装置によれば、音声のデジタル信号から音声区間を検出し、音声区間のデータを基に利得制御値を導出し、前記利得制御値を前記信号利得調整部に設定して音声認識処理の結果を再評価して最適な利得制御値を設定することにより、入力信号レベルのバラツキによらずほぼ一定したＳ／Ｎ比が得られ、信号処理精度の悪化を抑制でき、また比較的簡単な構成で低ビットのＡ／Ｄ変換器を用いることができ、認識率の向上とシステムコストの低減が可能になる。
【００１９】
【実施例】
以下本発明の実施例について図面を参照して説明する。
図１は本発明の実施例に係る音声認識装置の概略を示す図である。本図に示す構成で、図２４と異なるものは、Ａ／Ｄ変換器１の入力段に設けられる信号利得調整部４である。そして、プロセッサ２には、信号利得調整部４の利得を制御するためのフィードバック信号を形成する利得制御部２１及び音声区間処理部、音声認識部などの音声処理機能が設けられる。
【００２０】
図２は図１の信号利得調整部４の構成を示す図である。本図に示すように、信号利得調整部４は、利得を変化して入力した音声信号の電圧を制御しＡ／Ｄ変換器１に出力する電圧制御増幅器４１と、プロセッサ２からのフィードバックのデジタル信号をアナログ信号に変換するＤ／Ａ変換器４２（ＤｉｇｉｔａｌＴｏＡｎａｌｏｇＣｏｎｖｅｒｔｅｒ）と、Ｄ／Ａ変換器４２に接続され高周波成分を除去した後の信号で電圧制御増幅器４１の利得を制御する低域通過フィルタ４３とを具備する。
【００２１】
図３は図１の利得制御部２１の構成を示す図である。本図に示すように、利得制御部２１は、Ａ／Ｄ変換器１に接続され音声区間に処理したデータ群を形成する音声区間処理部２２と、処理データ群に一定の係数を乗算して利得制御値を形成する利得制御値導出部２３と、この利得制御値を前記信号利得調整部４へフィードバックすべきか否かを判断するフィードバック制御部２４とを有する。なお各部の出力信号又は出力データに基づき音声認識を行なう音声認識部２２−１も有している。
【００２２】
図４は図３の音声区間処理部２２を説明する図である。本図に示すように、Ａ／Ｄ変換器１からの離散した音声信号（図４（ｂ）参照）を記憶するバッファメモリ３１と、バッファメモリ３１に記憶された音声信号値、又は算出されたパワー値について一定の閾値以上のブロックの音声区間（図４（ｃ）、（ｄ）参照）を切り出すための音声区間検出部３２と、このようにして切り出された音声区間を記憶する音声区間メモリ３３とからなる。この音声区間メモリ３３に記憶されたデータは、利得制御値導出部２３に出力される。また音声認識部２２−１は１からの入力信号又は音声区間処理部２２の出力信号に基づき認識処理を行ない認識結果を出力する。この時利得制御導出部２３からの利得制御情報を用いて入力信号を補正して使用しても良い。
【００２３】
図５は図４の音声区間検出部３２の変形を説明する図である。本図に示すように、母音部は一般的に子音部に比べて振幅が大きく音素長も長いのでこの特性を使用し一連の入力信号に含まれる母音区間を推定し、この入力値を音声区間検出結果として用いる。例えばこの推定では、振幅が閾値ｔｈｖ１よりも大きい場合が母音区間とされる。
【００２４】
さらに、この入力レベルを二乗してパワーを算出して、この入力値に代わり用いてもよい。
また、上記とは逆に子音の区間を推定し、同様にこの入力レベルを音声区間検出結果として用いる。さらに、この入力レベルを二乗してパワーを算出して、この入力値に代わり用いてもよい。
【００２５】
図６は図４の音声区間処理部２２の第１の変形を示す図である。本図に示すように、音声区間処理部２２の音声区間メモリ３３の後段にそれぞれデータを二乗してパワーを求める二乗部３４と、二乗して得られたパワーデータを記憶するパワーメモリ３５が設けられる。このパワーメモリ３５のパワーデータは、利得制御値導出部２３に出力される。
【００２６】
図７は図４の音声区間処理部２２の第２の変形を示す図である。本図に示すように、音声区間処理部２２の音声区間メモリ３３の後段に、音声区間を複数の窓に分割して記憶する分割メモリ３６が設けられる。この分割メモリ３６に記憶されたデータは、利得制御値導出部２３に出力される。後述するフィードバックの判断の精度を向上させるためである。
【００２７】
図８は図４の音声区間処理部２２の第３の変形を示す図である。本図に示すように、音声区間処理部２２の音声区間メモリ３３の後段に、音声区間を複数の窓に分割して記憶する分割メモリ３６と、分割メモリ３６に記憶されたデータを二乗してパワーを求める二乗部３７と、二乗して得られたパワーデータを記憶するパワーメモリ３８が設けられる。このパワーメモリ３８のパワーデータは、利得制御値導出部２３に出力される。
【００２８】
図９は図４の音声区間処理部２２の第４の変形であって、図８の窓の幅を変化させる例を示す図である。本図に示すように、図９の分割メモリ３６、パワーメモリ３８の窓の幅を変化させる。同様に、図９の分割メモリ３６の幅を変化させてもよい。後述するフィードバックの判断の精度を向上させるためである。
図１０は図３の利得制御値導出部２３の一例を説明する図である。本図に示すように、利得制御値導出部２３では、音声区間処理部２２での処理後の各データ値ｄ０，ｄ１，ｄ２，…，ｄｎに係数ｋ１を乗算する。すなわち、図４の音声区間メモリ３３の入力レベルのデータに係数ｋ１を乗算して係数利得調整部４の利得制御値を形成する。
【００２９】
図６のパワーメモリ３５のパワーデータに係数ｋ１を乗算して利得制御値を形成することも可能である。
さらに図７の分割メモリ３６の入力値のデータに係数ｋ１を乗算して利得制御値を形成することも可能である。
さらに図８の分割メモリ３７のパワーデータに係数ｋ１を乗算して利得制御値を形成することも可能である。
【００３０】
さらに図９の可変幅のパワーメモリ３８のパワーデータに係数ｋ１を乗算して利得制御値を形成することも可能である。なお、可変幅の分割メモリ３６の入力データに係数ｋ１を乗算して利得制御値を形成してもよい。
以上は係数ｋ１を乗算する場合であるが、本図に示すように、ｋ２、ｋ３、…、ｋｎの係数を選択して乗算して利得制御値を形成することをさらに可能にしておく。後述するフィードバックの判断の精度を向上させるためでもある。
【００３１】
以上は、係数を乗算する線形処理について説明したが、次に非線形処理について説明する。
図１１は図３の利得制御値導出部２３の他の例を説明する図である。本図に示すように、音声区間処理部２２での処理後の各データ値ｄ０，ｄ１，ｄ２，…，ｄｎに対して、Ｍａｐ１として、非線形の係数をｋ１０，ｋ１１，ｋ１２， …ｋ１ｎを乗算して利得制御値を形成する。
【００３２】
さらに、Ｍａｐ２，……，Ｍａｐ２として、係数をｋ２０，ｋ２１，ｋ２２， …ｋ２ｎ、……、ｋｎ０，ｋｎ１，ｋｎ２， …ｋｎｎを追加してこれらを選択的に乗算して利得制御値とする。
図１２は非線形係数を使用する場合に音声認識を可能にするための例を説明する図である。本図（ａ）に示すように、利得制御値として非線形の係数を使用する場合には、プロセッサ２の利得制御部２１は、利得制御値ｋｇを決定して信号利得調整部４に設定して音声入力信号ｖｉ・ｋｇとした後にプロセッサ２は、Ａ／Ｄ変換後の信号を逆数倍してｖｉ／ｋｇとして音声認識を行う。本図（ｂ）に示すように、プロセッサ２内では信号ＳもノイズＮも含め元の信号の大きさに復元して信号の不連続性を除去し音声区間検出処理の精度を向上する方法をとっても良い。
【００３３】
図１３は図３のフィードバック判断部２４の一例を説明する図である。本図に示すように、フィードバック判断部２４は、図４の音声区間メモリ３３、図６のパワーメモリ３５、図７の分割メモリ３６、図８のパワーメモリ３８等のデータを音声認識処理する信号処理メインルーチン４１と、音声認識処理された結果としての認識候補Ｎｏ．及び音声認識の程度を表す距離を抽出して記憶する音声認識処理データ部４２と、抽出された認識候補Ｎｏ．のうち音声認識の程度が高いつまり距離が小さいものの平均値を基に、利得制御値の変更の評価を行い利得制御値の決定を行う利得制御値判断部４３とを具備する。
【００３４】
つまり、利得制御値判断部４３は、図１０の利得制御値ｋ１又は図１３のＭａｐ１を用いて、

の制御値とする。例えば、ｍ＝５とする。
【００３５】
Ｒ１＜利得制御値変更前の値なら変更後の利得制御値とする。
さらに、利得制御値判断部４３は、図９の利得制御値ｋ１、ｋ２、…、ｋｎ又は図１１のＭａｐ１、Ｍａｐ２、…、Ｍａｐｎをパラメータとして、一定期間毎にＲ１を求め、パラメータに対してＲ１が最小となるものを最終的な利得制御値とする。
【００３６】
フィードバック判断部２４が動作中にはプロセッサ３の音声認識結果をインタフェース３に出力するのを禁止し、利得制御値決定後に出力するのを許可する様にしてもよい。
図１４は図３のフィードバック判断部２４の他の例を示す図である。本図（ａ）に示すように、インタフェース３には開始スイッチ５１、音声の再入力により修正するスイッチスイッチ５２、次候補を選択する次候補スイッチ５３が設けられ、パワーウインドウ、オーディオ等の制御対象機器６０が接続される。プロセッサ３の利得制御部２１のフィードバック判断部２４は、修正スイッチ５１、次候補スイッチ５３の操作回数Ｃｒをカウントし、このカウントＣｒが所定値ｔｈ１を越える場合には利得制御値ｋ１に変える。
【００３７】
さらに、利得制御値判断部４３は、図９の利得制御値ｋ１、ｋ２、…、ｋｎ又は図１０のＭａｐ１、Ｍａｐ２、…、Ｍａｐｎをパラメータとして、操作回数Ｃｒを求め、このパラメータに対して操作回数Ｃｒが最小となるものを最終的な利得制御値とする。
使用者の操作（内容）や操作回数（音声認識における操作回数、言い直し回数）により信号処理品質（例えば認識率）の推定を行い、利得制御値を算出することが可能になる。
【００３８】
さらに、信号処理の品質を複数回分使用し、平均的な推定値を使用し利得制御値を算出するようにしてもよい。
さらに、通常開始スイッチ５１のオンによりプロセッサ２の処理開始されるが、開始スイッチ５１がオンされる前で本音声処理装置が未使用時に、プロセッサ２内で入力信号を用いて利得制御を行い、信号処理品質を仮に評価し良好な状態を予め制御しておいてもよい。
【００３９】
図１５は利得制御値の設定時期を説明する図である。本図に示すように、１単語前の音声区間を用いて、入力信号データ、パワーデータに係数を乗算して求めた利得制御値は、各単語の終端検出後に、前記信号利得調整部４に、設定される。
図１６は利得制御値の別の設定時期を説明する図である。本図に示すように、複数個前の音声区間を用いて、入力データ、パワーデータに係数を乗算して求めた利得制御値は、複数単語の終端検出後に、前記信号利得調整部４に、設定される。
【００４０】
以上では予め利得制御値を保持していたが、簡略のために、音声区間内のデータから利得制御値を決定する例を、以下に、説明する。
図１７は音声区間内の最大値を用いて利得制御値を決定する例を説明する図である。最大値と利得制御値との関係を予め決めておき、本図に示すように、音声区間内の最大値ｄｉ（１）を求めて、これに対応する利得制御値を算出する。
【００４１】
図１８は音声区間内の最大値を求めるのにピークホールド値を用いて利得制御値を決定する例を説明する図である。本図（ａ）に示すように、区間検出部３２の後段にピークホールド処理部５１を設け、本図（ｂ）に示すように、区間検出部からの離散入力信号列ｖｉに対して、本図（ｃ）に示すように、ｖｉ（Ｌ−１）≦ｖｉ（Ｌ）ならば、ｖｉ（Ｌ）をｖｉ’（Ｌ）とする。
【００４２】
さらに、本図（ｄ）に示すように、次の音声区間での最大値測定のためにリリース時間を制御を、下記式を用いて、行う。
ｖｉ（Ｌ）≦ｖｉ’（Ｌ）・ｋｔ１、ｋｔ１＝０．９９
図１９は図１８の変形を示す図である。本図に示すように、ピークホールド処理部５１の前に低域通過フィルタ（ＬＰＦ）で構成されるアタックタイム処理部５２を設け、さらにピーク処理部５１にはリリース時間の制御部が設けられる。
【００４３】
このアタック時間及びリリース時間を変化させて、図１３のフィードバック判断部２４を介して、最適なアタック時間及びリリース時間の制御を行う。
次に、音声区間内の振幅値の平均値と利得制御値との関係を予め決めておき、音声区間内のデータ値の平均値ｖｉａｖを

求め、これに対応する利得制御値を算出する。
【００４４】
さらに変形として、音声区間内の振幅値の絶対値と利得制御値との関係を予め決めておき、音声区間内のデータの絶対値を
｜ｖｉ（Ｌ）｜、Ｌ＝０，…ｍ
求め、これに対応する利得制御値を算出する。
図２０は音声区間内の完全積分値を用いて利得制御値を決定する例を説明する図である。完全積分値と利得制御値との関係を予め決めておき、本図に示すように、音声区間内の完全積分値ｖｉ’（Ｌ）を
ｖｉ’（Ｌ）＝ｖｉ（Ｌ）＋ｋｘ１・ｖｉ’（Ｌ−１）、ｋｘ１＝０．０９
として求めて、これに対応する利得制御値を算出する。このままでは入力が入るたびにｖｉ’（１）が増大するため一定期間（時間）ごとにｋｘ１を１サンプルだけ０にする。
【００４５】
図２１は図２０の変形を示す図である。本図に示すように、完全積分にレベルクリップにより出力値を制限する。すなわち、ｖｉ’（Ｌ） ≧ｋＬ１のとき、ｖｉ’（Ｌ）＝ｋＬ１とする。
図２２は音声区間内のリーキー積分値を用いて利得制御値を決定する例を説明する図である。リーキー積分値と利得制御値との関係を予め決めておき、本図に示すように、音声区間内の完全積分値ｖｉ’（Ｌ）を
ｖｉ’（Ｌ）＝ｋｘ２・ｖｉ（Ｌ）＋ｋｘ１・ｖｉ’（Ｌ−１）、ｋｘ１＋ｋｘ２≦１
として求めて、これに対応する利得制御値を算出する。ｋｘ１＋ｋｘ２≦１とすることにより、ｖｉ’（Ｌ）の増大傾向を防止する。
【００４６】
図２３は本実施例の効果を説明する図である。本図に示すように、入力信号データのバラツキによらず、ほぼ一定したＳ／Ｎ比が得られ、信号処理精度の悪化を招来することなく、また比較的簡単な構成で低ビット（８〜１２ビット）のＡ／Ｄ変換器を用いることができる。例えば、音声認識装置のおいては認識率の向上とシステムのコストの低減が可能になる。
【００４７】
【発明の効果】
以上説明したように本発明によれば、音声のデジタル信号から音声区間を検出し、音声区間のデータを基に利得制御値を導出し、前記利得制御値を前記信号利得調整部に設定して音声認識処理の結果を再評価して最適な利得制御値を設定するので、入力信号レベルのバラツキによらずほぼ一定したＳ／Ｎ比が得られ、信号処理精度の悪化を抑制でき、また比較的簡単な構成で低ビットのＡ／Ｄ変換器を用いることができ、認識率の向上とシステムコストの低減が可能になる。
【図面の簡単な説明】
【図１】本発明の実施例に係る音声認識装置の概略を示す図である。
【図２】図１の信号利得調整部４の構成を示す図である。
【図３】図１の利得制御部２１の構成示す図である。
【図４】図３の音声区間処理部２２説明する図である。
【図５】図４の音声区間検出部３２の変形を説明する図である。
【図６】図４の音声区間処理部２２の第１の変形を示す図である。
【図７】図４の音声区間処理部２２の第２の変形を示す図である。
【図８】図４の音声区間処理部２２の第３の変形を示す図である。
【図９】図４の音声区間処理部２２の第４の変形であって、図８の窓の幅を変化させる例を示す図である。
【図１０】図３の利得制御値導出部２３の一例を説明する図である。
【図１１】図３の利得制御値導出部２３の他の例を説明する図である。
【図１２】非線形係数を使用する場合に音声認識を可能にするための例を説明する図である。
【図１３】図３のフィードバック判断部２４の一例を説明する図である。
【図１４】図３のフィードバック判断部２４の他の例を示す図である。
【図１５】利得制御値の設定時期を説明する図である。
【図１６】利得制御値の別の設定時期を説明する図である。
【図１７】音声区間内の最大データを用いて利得制御値を決定する例を説明する図である。
【図１８】音声区間内の最大データを求めるのにピークホールド値を用いて利得制御値を決定する例を説明する図である。
【図１９】図１８の変形を示す図である。
【図２０】音声区間内の完全積分値を用いて利得制御値を決定する例を説明する図である。
【図２１】図２０の変形を示す図である。
【図２２】音声区間内のリーキー積分値を用いて利得制御値を決定する例を説明する図である。
【図２３】本実施例の効果を説明する図である。
【図２４】従来の音声処理装置の概略を示す図である。
【符号の説明】
１…Ａ／Ｄ変換器
２…プロセッサ
３…インタフェース
４…信号利得調整部
２１…利得制御部
２２…音声区間処理部
２３…利得制御値導出部
２４…フィードバック判断部[0001]
[Industrial applications]
The present invention relates to a speech processing apparatus for processing a digital signal of speech, and more particularly to a speech recognition apparatus capable of ensuring the accuracy of an input signal at low cost.
[0002]
[Prior art]
FIG. 24 is a diagram schematically showing a conventional audio processing device. The audio processing device shown in FIG. 1A is mounted on a vehicle, and an A / D converter 1 (Analog To Digital Converter) for inputting an audio analog signal and converting it into a digital signal, and the A / D converter The processor 2 is connected to the processor 1 and performs a voice recognition process, and the interface 3 is connected to the processor 2. The audio processing device recognizes, for example, “window open” and “window closed” voices for the power window, and outputs “audio on” and “audio off” voices for the audio device. It recognizes and controls. FIGS. 3B and 3C show a substantial dynamic range with respect to the level of the input signal. When the noise level is very small, the substantial dynamic range is approximately represented by S / N. In this case, when the A / D converter 1 performs conversion with n = 16 bits, the maximum value of the substantial dynamic range is 6n + 2 = 98 dB. Then, there are two input levels A and B whose input level change is large, and the substantial dynamic ranges corresponding to these two levels are a and b. In this case, if A and B have a relationship of A≫B, a≫b.
[0003]
[Problems to be solved by the invention]
By the way, in the above-described voice processing device mounted on a vehicle, the input level of the voice signal is large or small depending on the volume of the voice of the driver using the device, and the input level of the voice signal is large or small depending on the distance between the microphone and a microphone not shown.
However, in the above-described audio processing apparatus, when the input level changes, for example, when the input level is large, the substantial dynamic range b can be set sufficiently for the input signal B, but when the input level is small, the input signal A is small. Is small, and a sufficient S / N ratio cannot be obtained. For this reason, there has been a problem that the accuracy of the speech recognition processing in the subsequent processor 2 has been deteriorated.
[0004]
On the other hand, if a high-bit (for example, 16 bits to 18 bits or more) A / D converter is used as the A / D converter 1 in order to increase the S / N, another problem of cost increase is caused. In addition, a high bit adjustment operation is required.
In addition, there was a method of using a general analog gain control circuit for the input unit, but good results were not always obtained because the gain might change in the voice section and the voice signal might be distorted.
[0005]
The present invention has been made in view of the above problems, and an object of the present invention is to provide a speech recognition device that can process a speech signal having a large change in an input signal with a low-bit A / D converter with high accuracy.
[0006]
[Means for Solving the Problems]
The present invention provides an audio processing device having the following configuration to solve the above problems. That is, a voice processing apparatus that converts a voice analog signal into a digital signal and performs voice recognition processing includes a signal gain adjustment unit that controls a voltage of the voice analog signal, and a voice section that detects a voice section from the voice digital signal. A processing unit and a gain control value deriving unit that derives a gain control value based on voice section data are provided. The feedback determination unit sets the gain control value in the signal gain adjustment unit, re-evaluates the result of the speech recognition process, and sets an optimal gain control value.
[0007]
The voice section of the voice section processing unit may be a range estimated as a vowel.
The voice section of the voice section processing unit may be a range estimated as a consonant.
The voice section of the voice section processing unit may be obtained based on the level of a voice digital signal.
[0008]
The voice section of the voice section processing unit may be obtained based on a power level of a voice digital signal.
The voice section of the voice section processing unit may be divided into a plurality of windows.
The width of the window may be variable.
The gain control value of the gain control value deriving unit may be derived by multiplying data of a voice section by an arbitrary coefficient.
[0009]
The gain control value of the gain control value deriving unit may be derived by multiplying data of a voice section by one coefficient selected from a plurality of coefficients.
The gain control value of the gain control value deriving unit may be derived by multiplying each of the data of the voice section by each of a set of coefficients.
The gain control value of the gain control value deriving unit may be derived by multiplying each of the data of the voice section by each of a set of coefficients selected from a plurality of sets.
[0010]
The reevaluation of the speech recognition processing result of the feedback determination unit is performed using the average value of the upper distances of the recognition candidates obtained by setting the gain control value by the gain control value derivation unit in the signal gain adjustment unit. However, if the average value of the upper distances of the recognition candidates is larger than the average value before setting the gain control value, the gain control value before setting is used, and in the opposite case, the gain control value after setting is used. It may be used and performed.
[0011]
The reevaluation of the speech recognition processing result of the feedback determination unit is performed using the average value of the upper distances of the recognition candidates obtained by setting the gain control value by the gain control value derivation unit in the signal gain adjustment unit. However, the gain control value that minimizes the average value of the upper distances of the recognition candidates may be used.
Re-evaluation of the speech recognition processing result of the feedback determination unit uses the number of times of the next candidate's calling operation, correction of the recognition after setting the gain control value by the gain control value deriving unit in the signal gain adjusting unit, and this operation. When the number of times is larger than the predetermined value, the gain control value before setting may be used, and when the number of times is larger than the predetermined value, the gain control value after setting may be used and the gain control value may be used.
[0012]
Re-evaluation of the speech recognition processing result of the feedback determination unit uses the number of times of the next candidate's calling operation, correction of the recognition after setting the gain control value by the gain control value deriving unit in the signal gain adjusting unit, and this operation. A gain control value that minimizes the number of times may be used and performed.
The feedback determination unit may set the gain control value obtained by using the speech section one word before, after detecting the end of each word.
[0013]
The feedback determination unit may set the gain control value obtained by using the speech section preceding the plurality of words after detecting the end of the plurality of words.
The gain control value may be set in the signal gain adjustment unit, and the digital signal may be multiplied by a reciprocal of the gain control value after conversion into the digital signal and before speech recognition processing.
[0014]
The gain control value of the gain control value deriving unit may be obtained based on a maximum value of data of a voice section.
The gain control value of the gain control value deriving unit may be obtained based on a maximum value of data of a voice section.
The gain control value of the gain control value deriving unit may be obtained based on an average value of data of a voice section.
[0015]
The gain control value of the gain control value deriving unit may be obtained based on an absolute value of data of a voice section.
The gain control value of the gain control value deriving unit may be determined by resetting the value as needed based on the complete integral value of the data of the voice section.
The value may be clipped so as not to exceed a certain value.
[0016]
The gain control value of the gain control value deriving unit may be obtained based on a leaky integral value of data of a voice section.
The gain control value of the gain control value deriving unit may be obtained based on a peak hold value of data of a voice section.
The gain control value of the gain control value deriving unit may be determined based on an attack time and a release time at the time of peak hold of data of a voice section.
[0017]
The gain control value of the gain control value deriving unit is obtained by changing the attack time and the release time at the time of peak hold of the data of the voice section so that the average value of the upper distances of the recognition candidates is minimized. It may be.
[0018]
[Action]
According to the audio processing device of the present invention, a voice section is detected from a digital signal of voice, a gain control value is derived based on data of the voice section, and the gain control value is set in the signal gain adjustment unit to perform voice control. By re-evaluating the result of the recognition processing and setting the optimal gain control value, an almost constant S / N ratio can be obtained regardless of the variation of the input signal level, and the deterioration of the signal processing accuracy can be suppressed. It is possible to use a low-bit A / D converter with a simple configuration, thereby improving the recognition rate and reducing the system cost.
[0019]
【Example】
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a diagram schematically illustrating a speech recognition device according to an embodiment of the present invention. The configuration shown in this figure, which differs from FIG. 24, is a signal gain adjustment unit 4 provided in the input stage of the A / D converter 1. The processor 2 is provided with a gain control unit 21 that forms a feedback signal for controlling the gain of the signal gain adjustment unit 4 and voice processing functions such as a voice section processing unit and a voice recognition unit.
[0020]
FIG. 2 is a diagram showing a configuration of the signal gain adjustment unit 4 of FIG. As shown in the figure, a signal gain adjustment unit 4 controls a voltage of an input audio signal by changing a gain, and outputs a voltage control amplifier 41 to the A / D converter 1, and a feedback digital signal from the processor 2. A D / A converter 42 (Digital To Analog Converter) for converting a signal into an analog signal; and a low frequency band connected to the D / A converter 42 for controlling the gain of the voltage control amplifier 41 with the signal after removing the high frequency component. And a pass filter 43.
[0021]
FIG. 3 is a diagram showing a configuration of the gain control unit 21 of FIG. As shown in the figure, the gain control unit 21 is connected to the A / D converter 1 and forms a data group processed into a voice section, and a voice section processing unit 22 multiplies the processed data group by a certain coefficient. It has a gain control value deriving unit 23 that forms a gain control value, and a feedback control unit 24 that determines whether or not this gain control value should be fed back to the signal gain adjusting unit 4. It also has a voice recognition unit 22-1 that performs voice recognition based on output signals or output data of each unit.
[0022]
FIG. 4 is a diagram illustrating the voice section processing unit 22 of FIG. As shown in the figure, a buffer memory 31 for storing discrete audio signals (see FIG. 4B) from the A / D converter 1 and audio signal values stored in the buffer memory 31 or calculated. A voice section detection unit 32 for extracting a voice section (see FIGS. 4C and 4D) of a block having a power value equal to or greater than a certain threshold value, and a voice section memory for storing the voice section thus cut out. 33. The data stored in the voice section memory 33 is output to the gain control value deriving unit 23. The speech recognition unit 22-1 performs a recognition process based on an input signal from 1 or an output signal of the speech section processing unit 22, and outputs a recognition result. At this time, the input signal may be corrected using the gain control information from the gain control deriving unit 23 and used.
[0023]
FIG. 5 is a diagram illustrating a modification of the voice section detection unit 32 of FIG. As shown in this figure, vowels generally have larger amplitudes and longer phonemes than consonants, so use this characteristic to estimate the vowel sections included in a series of input signals, Used as a detection result. For example, in this estimation, a case where the amplitude is larger than the threshold thv1 is set as a vowel section.
[0024]
Further, the power may be calculated by squaring the input level and used instead of the input value.
Conversely, a consonant section is estimated, and this input level is similarly used as a voice section detection result. Further, the power may be calculated by squaring the input level and used instead of the input value.
[0025]
FIG. 6 is a diagram showing a first modification of the voice section processing unit 22 in FIG. As shown in the figure, a squaring unit 34 for squaring data to obtain power and a power memory 35 for storing power data obtained by squaring are provided at the subsequent stage of the speech section memory 33 of the speech section processing unit 22. Can be The power data of the power memory 35 is output to the gain control value deriving unit 23.
[0026]
FIG. 7 is a diagram showing a second modification of the voice section processing unit 22 in FIG. As shown in the figure, a divided memory 36 that divides a voice section into a plurality of windows and stores the divided voice section is provided downstream of the voice section memory 33 of the voice section processing unit 22. The data stored in the division memory 36 is output to the gain control value deriving unit 23. This is for improving the accuracy of the feedback determination described later.
[0027]
FIG. 8 is a diagram showing a third modification of the voice section processing unit 22 in FIG. As shown in the figure, at the subsequent stage of the voice section memory 33 of the voice section processing section 22, a divided memory 36 for dividing the voice section into a plurality of windows and storing the divided data, and squaring the data stored in the divided memory 36 A squaring unit 37 for obtaining power and a power memory 38 for storing power data obtained by squaring are provided. The power data of the power memory 38 is output to the gain control value deriving unit 23.
[0028]
FIG. 9 is a diagram showing a fourth modification of the voice section processing unit 22 in FIG. 4 and showing an example in which the width of the window in FIG. 8 is changed. As shown in the figure, the widths of the windows of the divided memory 36 and the power memory 38 in FIG. 9 are changed. Similarly, the width of the divided memory 36 in FIG. 9 may be changed. This is for improving the accuracy of the feedback determination described later.
FIG. 10 is a diagram illustrating an example of the gain control value deriving unit 23 of FIG. As shown in the figure, the gain control value derivation unit 23 multiplies each data value d0, d1, d2,..., Dn after processing by the voice section processing unit 22 by a coefficient k1. That is, the gain control value of the coefficient gain adjustment unit 4 is formed by multiplying the input level data of the voice section memory 33 of FIG.
[0029]
It is also possible to form a gain control value by multiplying the power data of the power memory 35 of FIG. 6 by a coefficient k1.
Further, it is possible to form a gain control value by multiplying the data of the input value of the division memory 36 of FIG. 7 by the coefficient k1.
Further, it is also possible to form a gain control value by multiplying the power data of the divided memory 37 of FIG. 8 by a coefficient k1.
[0030]
Further, it is also possible to form a gain control value by multiplying the power data of the variable width power memory 38 in FIG. 9 by a coefficient k1. The gain control value may be formed by multiplying the input data of the variable width divided memory 36 by the coefficient k1.
The above is the case where the coefficient k1 is multiplied, but as shown in the figure, it is further possible to select and multiply the coefficients k2, k3,... Kn to form a gain control value. This is also to improve the accuracy of the feedback determination described later.
[0031]
The linear processing for multiplying the coefficient has been described above. Next, the non-linear processing will be described.
FIG. 11 is a diagram for explaining another example of the gain control value deriving unit 23 of FIG. As shown in the figure, each data value d0, d1, d2,..., Dn after processing by the voice section processing unit 22 is multiplied by Map1, and nonlinear coefficients k10, k11, k12,. To form a gain control value.
[0032]
Further, as Map2, ..., Map2, coefficients k20, k21, k22, ... k2n, ..., kn0, kn1, kn2, ... knn are added, and these are selectively multiplied to obtain gain control values.
FIG. 12 is a diagram illustrating an example for enabling speech recognition when using a nonlinear coefficient. When a non-linear coefficient is used as a gain control value, the gain control unit 21 of the processor 2 determines the gain control value kg and sets it in the signal gain adjustment unit 4 as shown in FIG. After setting the voice input signal vi · kg, the processor 2 performs voice recognition by multiplying the signal after A / D conversion by a reciprocal number to obtain vi / kg. As shown in FIG. 3B, a method for restoring the original signal size including the signal S and the noise N in the processor 2 to remove the discontinuity of the signal and improve the accuracy of the voice section detection processing is shown. Very good.
[0033]
FIG. 13 is a diagram illustrating an example of the feedback determination unit 24 of FIG. As shown in the figure, the feedback determination unit 24 generates a signal for performing voice recognition processing on data in the voice section memory 33 in FIG. 4, the power memory 35 in FIG. 6, the divided memory 36 in FIG. 7, the power memory 38 in FIG. Processing main routine 41 and a recognition candidate No. as a result of the voice recognition processing. And a speech recognition processing data section 42 for extracting and storing a distance indicating the degree of speech recognition, and a recognition candidate No. And a gain control value judging unit 43 that evaluates a change in the gain control value and determines the gain control value based on the average value of the speech recognition with a high degree of speech recognition, that is, a short distance.
[0034]
That is, the gain control value determination unit 43 uses the gain control value k1 in FIG. 10 or Map1 in FIG.

Control value. For example, m = 5.
[0035]
If R1 <the value before the change of the gain control value, the gain control value after the change is set.
Further, the gain control value determination unit 43 obtains R1 at regular intervals using the gain control values k1, k2,..., Kn of FIG. 9 or Map1, Map2,. The value that minimizes R1 is defined as the final gain control value.
[0036]
While the feedback determination unit 24 is operating, the output of the speech recognition result of the processor 3 to the interface 3 may be prohibited, and the output after the gain control value is determined may be permitted.
FIG. 14 is a diagram showing another example of the feedback determination unit 24 of FIG. As shown in FIG. 3A, the interface 3 is provided with a start switch 51, a switch switch 52 for correcting by re-inputting a voice, and a next candidate switch 53 for selecting a next candidate. The device 60 is connected. The feedback determination unit 24 of the gain control unit 21 of the processor 3 counts the number of times Cr of the operation of the correction switch 51 and the next candidate switch 53, and when the count Cr exceeds a predetermined value th1, changes the number to the gain control value k1.
[0037]
Further, the gain control value determination unit 43 obtains the number of operation Cr using the gain control values k1, k2,..., Kn of FIG. 9 or Map1, Map2,. The value that minimizes the number of times Cr is the final gain control value.
The signal processing quality (for example, the recognition rate) can be estimated based on the user's operation (contents) and the number of operations (the number of operations in speech recognition, the number of restatements), and the gain control value can be calculated.
[0038]
Further, the gain control value may be calculated by using the quality of the signal processing a plurality of times and using the average estimated value.
Further, the processing of the processor 2 is started when the normal start switch 51 is turned on. Before the start switch 51 is turned on, the gain control is performed using the input signal in the processor 2 when the audio processing device is not used, The signal processing quality may be temporarily evaluated and a good state may be controlled in advance.
[0039]
FIG. 15 is a diagram illustrating the timing of setting the gain control value. As shown in the figure, a gain control value obtained by multiplying input signal data and power data by a coefficient using a speech section one word before is transmitted to the signal gain adjustment unit 4 after the end of each word is detected. Is set.
FIG. 16 is a diagram for explaining another setting time of the gain control value. As shown in the figure, the gain control value obtained by multiplying the input data and the power data by a coefficient using a plurality of previous speech sections is transmitted to the signal gain adjustment unit 4 after detecting the end of a plurality of words. Is set.
[0040]
In the above description, the gain control value is held in advance. For the sake of simplicity, an example in which the gain control value is determined from data in a voice section will be described below.
FIG. 17 is a diagram illustrating an example in which a gain control value is determined using the maximum value in a voice section. The relationship between the maximum value and the gain control value is determined in advance, and as shown in this figure, the maximum value di (1) in the voice section is obtained, and the corresponding gain control value is calculated.
[0041]
FIG. 18 is a diagram illustrating an example in which a gain control value is determined using a peak hold value to determine a maximum value in a voice section. As shown in FIG. 3A, a peak hold processing unit 51 is provided at a stage subsequent to the section detection unit 32. As shown in FIG. 3B, a discrete input signal sequence vi from the section detection unit As shown in FIG. 9C, if vi (L-1) ≦ vi (L), vi (L) is set to vi ′ (L).
[0042]
Further, as shown in FIG. 4D, the release time is controlled using the following equation for measuring the maximum value in the next voice section.
vi (L) ≦ vi ′ (L) kt1, kt1 = 0.99
FIG. 19 is a diagram showing a modification of FIG. As shown in the figure, an attack time processing section 52 composed of a low-pass filter (LPF) is provided in front of the peak hold processing section 51, and further, a release time control section is provided in the peak processing section 51.
[0043]
By changing the attack time and the release time, optimal control of the attack time and the release time is performed via the feedback determination unit 24 in FIG.
Next, the relationship between the average value of the amplitude values in the voice section and the gain control value is determined in advance, and the average value viav of the data values in the voice section is calculated.

Then, a gain control value corresponding to this is calculated.
[0044]
As a further modification, the relationship between the absolute value of the amplitude value in the voice section and the gain control value is determined in advance, and the absolute value of the data in the voice section is | vi (L) |, L = 0,.
Then, a gain control value corresponding to this is calculated.
FIG. 20 is a diagram illustrating an example in which a gain control value is determined using a complete integral value in a voice section. The relationship between the complete integral value and the gain control value is determined in advance, and as shown in the figure, the complete integral value vi ′ (L) in the voice section is calculated as vi ′ (L) = vi (L) + kx1 · vi ′. (L-1), kx1 = 0.09
And a gain control value corresponding to this is calculated. In this state, vi '(1) increases each time an input is input, so that kx1 is set to 0 for only one sample every fixed period (time).
[0045]
FIG. 21 is a diagram showing a modification of FIG. As shown in this figure, the output value is limited to the level of the complete integration. That is, when vi ′ (L) ≧ kL1, vi ′ (L) = kL1.
FIG. 22 is a diagram illustrating an example in which a gain control value is determined using a leaky integral value in a voice section. The relationship between the leaky integral value and the gain control value is determined in advance, and as shown in the figure, the complete integral value vi ′ (L) in the voice section is calculated as vi ′ (L) = kx2 · vi (L) + kx1 · vi ′ (L−1), kx1 + kx2 ≦ 1
And a gain control value corresponding to this is calculated. By setting kx1 + kx2 ≦ 1, the tendency of increasing vi ′ (L) is prevented.
[0046]
FIG. 23 is a diagram for explaining the effect of this embodiment. As shown in the figure, an almost constant S / N ratio can be obtained regardless of the variation of the input signal data, the signal processing accuracy does not deteriorate, and the low bit (8 to 12-bit) A / D converter can be used. For example, in a speech recognition device, the recognition rate can be improved and the cost of the system can be reduced.
[0047]
【The invention's effect】
According to the present invention as described above, a voice section is detected from a voice digital signal, a gain control value is derived based on voice section data, and the gain control value is set in the signal gain adjustment unit. Since the optimal gain control value is set by re-evaluating the result of the voice recognition processing, an almost constant S / N ratio can be obtained irrespective of the variation of the input signal level, and the deterioration of the signal processing accuracy can be suppressed. It is possible to use a low-bit A / D converter with a simple configuration, thereby improving the recognition rate and reducing the system cost.
[Brief description of the drawings]
FIG. 1 is a diagram schematically illustrating a speech recognition device according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating a configuration of a signal gain adjustment unit 4 of FIG. 1;
FIG. 3 is a diagram illustrating a configuration of a gain control unit 21 in FIG. 1;
FIG. 4 is a diagram for explaining a voice section processing unit 22 of FIG. 3;
FIG. 5 is a diagram illustrating a modification of the voice section detection unit 32 in FIG. 4;
6 is a diagram showing a first modification of the voice section processing unit 22 in FIG.
FIG. 7 is a diagram illustrating a second modification of the voice section processing unit 22 in FIG. 4;
FIG. 8 is a diagram illustrating a third modification of the voice section processing unit 22 in FIG. 4;
9 is a diagram illustrating a fourth modification of the voice section processing unit 22 in FIG. 4, illustrating an example in which the width of the window in FIG. 8 is changed.
FIG. 10 is a diagram illustrating an example of a gain control value deriving unit 23 in FIG.
11 is a diagram illustrating another example of the gain control value deriving unit 23 in FIG.
FIG. 12 is a diagram illustrating an example for enabling speech recognition when using a nonlinear coefficient.
FIG. 13 is a diagram illustrating an example of the feedback determination unit 24 of FIG.
FIG. 14 is a diagram illustrating another example of the feedback determination unit 24 in FIG. 3;
FIG. 15 is a diagram illustrating the timing of setting the gain control value.
FIG. 16 is a diagram illustrating another setting time of the gain control value.
FIG. 17 is a diagram illustrating an example in which a gain control value is determined using maximum data in a voice section.
FIG. 18 is a diagram illustrating an example in which a gain control value is determined using a peak hold value to determine the maximum data in a voice section.
FIG. 19 is a diagram showing a modification of FIG. 18;
FIG. 20 is a diagram illustrating an example in which a gain control value is determined using a complete integral value in a voice section.
FIG. 21 is a diagram showing a modification of FIG. 20;
FIG. 22 is a diagram illustrating an example in which a gain control value is determined using a leaky integral value in a voice section.
FIG. 23 is a diagram illustrating the effect of the present embodiment.
FIG. 24 is a diagram schematically showing a conventional audio processing device.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... A / D converter 2 ... Processor 3 ... Interface 4 ... Signal gain adjustment part 21 ... Gain control part 22 ... Speech section processing part 23 ... Gain control value derivation part 24 ... Feedback judgment part

Claims

In a voice processing device that converts a voice analog signal into a digital signal and performs voice recognition processing,
A signal gain adjuster (4) for controlling the voltage of the analog audio signal;
A voice section processing unit (22) for detecting a voice section from a voice digital signal;
A gain control value deriving unit (23) that derives a gain control value based on data of one or a plurality of voice segments that are retroactive;
A feedback determination unit (24) for setting the gain control value in the signal gain adjustment unit, re-evaluating the result of the speech recognition process, and setting an optimal gain control value;
The gain control value derivation unit derives the gain control value by multiplying the data of the voice section by one coefficient selected from a plurality of coefficients, and derives a set of coefficients for each of the voice section data. , Or the data of the speech section is multiplied by each of a set of coefficients selected from a plurality of sets to derive the data.

The feedback determination unit, used in the re-evaluation of the speech recognition processing result, the average value of the distance of the upper of the recognition candidates obtained by setting the gain control value by the gain control value deriving unit to the signal gain adjustment section And
When the average value of the upper distances of the recognition candidates is larger than the average value before setting the gain control value, the gain control value before setting is used, and the average value of the distance is the average value before setting the gain control value. It is smaller than the speech recognition apparatus according to claim 1, characterized in that the gain control value after the setting is used.

The feedback determination unit, used in the re-evaluation of the speech recognition processing result, the average value of the distance of the upper of the recognition candidates obtained by setting the gain control value by the gain control value deriving unit to the signal gain adjustment section and speech recognition apparatus according to claim 2, characterized in that the gain control value the average value of the distance of the upper of the recognition candidate is minimized is used.

The feedback determination unit, wherein in the re-evaluation of the speech recognition result, the correction of recognition after the gain control value is set to the signal gain adjusting unit according to the gain control value deriving unit uses the call number of operations next candidate ,
When the number of operations is larger than a predetermined value, a gain control value before setting is used, and when the number of operations is smaller than a predetermined value , a gain control value after setting is used. The speech recognition device according to claim 2 .

The feedback determination unit, wherein in the re-evaluation of the speech recognition result, the correction of recognition after the gain control value is set to the signal gain adjusting unit according to the gain control value deriving unit uses the call number of operations next candidate ,
3. The speech recognition device according to claim 2 , wherein a gain control value that minimizes the number of operations is used.

The gain control value is set in the signal gain adjustment unit, and after conversion into a digital signal and before speech recognition processing, the digital signal has a constant or an inverse of the gain control value or an inverse proportional relationship with the gain control value. The speech recognition device according to claim 2 , wherein the coefficient is multiplied by the following coefficient.

In a voice processing device that converts a voice analog signal into a digital signal and performs voice recognition processing,
A signal gain adjuster (4) for controlling the voltage of the analog audio signal;
A voice section processing unit (22) for detecting a voice section from a voice digital signal;
A gain control value deriving unit (23) that derives a gain control value based on data of one or a plurality of voice segments that are retroactive;
A feedback determination unit (24) for setting the gain control value in the signal gain adjustment unit, re-evaluating the result of the speech recognition process, and setting an optimal gain control value;
The feedback determination unit, a gain control value obtained using one word before the speech section sets the gain control value after the detection end of each word, or by using a speech segment from a previous plurality of words obtained Wherein the gain control value is set after the end of a plurality of words is detected .

In a voice processing device that converts a voice analog signal into a digital signal and performs voice recognition processing,
A signal gain adjuster (4) for controlling the voltage of the analog audio signal;
A voice section processing unit (22) for detecting a voice section from a voice digital signal;
A gain control value deriving unit (23) that derives a gain control value based on data of one or a plurality of voice segments that are retroactive;
A feedback determination unit (24) for setting the gain control value in the signal gain adjustment unit, re-evaluating the result of the speech recognition process, and setting an optimal gain control value;
Said gain control value deriving unit, so that the average value of the distance of the upper of the speech recognition candidate is minimized by changing the attack time and release time of the peak hold data of the speech interval, obtains the gain control value A speech recognition device characterized by the above-mentioned .

In a voice processing device that converts a voice analog signal into a digital signal and performs voice recognition processing,
A signal gain adjuster (4) for controlling the voltage of the analog audio signal;
A voice section processing unit (22) for detecting a voice section from a voice digital signal;
A gain control value deriving unit (23) that derives a gain control value based on data of one or a plurality of voice segments that are retroactive;
A feedback determination unit (24) for setting the gain control value in the signal gain adjustment unit, re-evaluating the result of the speech recognition process, and setting an optimal gain control value;
It said gain control value deriving unit, the gain control value, features and to Ruoto voices be determined based on the corrected restored data by reciprocal number of the gain control value that is used for the input signal by deriving time Recognition device.

In a voice processing device that converts a voice analog signal into a digital signal and performs voice recognition processing,
A signal gain adjuster (4) for controlling the voltage of the analog audio signal;
A voice section processing unit (22) for detecting a voice section from a voice digital signal;
A gain control value deriving unit (23) that derives a gain control value based on data of one or a plurality of voice segments that are retroactive;
A feedback determination unit (24) for setting the gain control value in the signal gain adjustment unit, re-evaluating the result of the speech recognition process, and setting an optimal gain control value;
The speech segment processing unit performs on the basis of the input data corrected restored by reciprocal of the gain control value that is used for the input signal in detection processing time during voice-ku, restored before or the reciprocal of A speech processing apparatus characterized in that a speech recognition process is performed using data multiplied by a coefficient different from the above to derive a result.