JP2004219918A

JP2004219918A - Speech recognition environment judging method

Info

Publication number: JP2004219918A
Application number: JP2003009683A
Authority: JP
Inventors: Hiroki Yamamoto; 寛樹山本; Makoto Hirota; 誠廣田; Toshiaki Fukada; 俊明深田
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2003-01-17
Filing date: 2003-01-17
Publication date: 2004-08-05

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech recognition environment judging method capable of judging whether the environment in which a speech recognition device is used is suitable for speech recognition processing. <P>SOLUTION: The speech recognition device is equipped with a speech recognition environment judging function of judging the use environment of the speech recognition device performing speech recognition by using a sound model. A speech is inputted first and it is judged whether the inputted speech is a speech to be recognized. Then the similarity between the speech decided as an object of speech recognition and background noise when the sound model is generated is calculated and the use environment of the speech recognition device is judged according to the calculated similarity. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、音声認識装置による音声認識が適切に行える環境であるか、音声認識には厳しい環境であるかといった音声認識の使用環境の良否を判定する技術に関する。
【０００２】
【従来の技術】
音声認識は、従来、キーボードやボタン等を手で操作することによりデータ入力や指示を可能としていた入力手段に代わって、音声を使用することにより誰でも簡単に使える入力手段として様々な場面での利用が期待されている。例えば、近年、音声で文章を入力するディクテーションソフト、電話応答を自動で行う通話システム、呼びかけると返事をする玩具等に至るまで、実にさまざまな分野で、さまざまな使用者を対象にした音声認識技術が実用化され、利用されるようになってきている。
【０００３】
このように、音声認識は、認識性能が高い場合には非常に便利な入力手段として利用できる反面、認識性能が低い場合には誤認識が発生して、従来の手で操作する入力手段よりも非効率的な場合がある。音声認識における認識率の低下の要因としては、様々な要因が考えられるが、なかでも使用環境の背景雑音による影響が極めて大きい。例えば、オフィス環境のような比較的静かで背景雑音が少ない環境では正しく認識できる音声認識装置であっても、人混みや騒音の激しい場所で利用すると認識率が極端に低下するといった場合がある。
【０００４】
また、音声認識装置の種類によっても雑音への耐性が異なっており、ある音声認識装置で認識できる場合であっても、別の音声認識装置では全く認識できないといった場合もある。例えば、カーナビゲーションシステム等で利用されている音声認識装置は走行中での自動車内でも十分な性能を発揮する場合が多いが、走行中の自動車内でディクテーションソフトによって文章を入力しようとしても正しく入力できない場合がある。
【０００５】
上述したように、音声認識装置における認識性能が使用環境によって劣化することから、音声認識にとって好ましくない環境であることが事前にわかっていれば、アプリケーションによっては当初からユーザに対して音声入力手段以外の手段で入力させるという選択をさせることが可能である。また、ユーザが音声認識装置の性能劣化の原因が背景雑音によるものであると予測可能な場合には、静かな場所へ移動したり、発声をより明瞭にしたり、雑音源を自力で除去する等の対応をとることができる。このように、ユーザが意図している装置への情報等の入力を音声で入力させたり、手入力で入力させるといった適切な指示を入力時の環境に応じて判定してユーザに通知することにより、ユーザは環境に応じた円滑な入力操作を実行することが可能となる。
【０００６】
従来から、音声認識装置のユーザに対して、音声入力が可能か否かを通知する方法がある（例えば、特許文献１参照）。特許文献１に記載の方法によれば、音声認識装置に音声入力用とノイズ入力用の二つのマイクロフォンを備え、入力音声と入力ノイズのパワーの大きさの差に基づいて音声入力可能か否かを判定している。すなわち、入力音声のパワーが入力ノイズのパワーよりも大きく、その差が大きい場合は音声入力可能と判定し、両者の差が小さい場合は、さらに入力ノイズ波形の相関を調べ、入力ノイズのパワーの大きさと合わせて総合的に音声認識が可能か否かを判定している。
【０００７】
【特許文献１】
特開平１０−２４０２９１号公報
【０００８】
【発明が解決しようとする課題】
しかしながら、一般に利用されているパーソナルコンピュータ（以下、「ＰＣ」と称す。）や音声入力を備えた機器では、複数のマイクロフォンを搭載することは少ないため、従来の方法を一般的なＰＣ等に対してそのまま適用することはできない。すなわち、新たにマイクロフォンを増設するなどのコストが余分にかかることになる。
【０００９】
一方、従来から音声認識に用いる音響モデルを雑音の混入した音声から作成する方法が知られているように、入力音声の認識に音響モデルを使用することによって雑音の大きさが大きい場合であっても入力音声の認識が可能な場合があり、この音響モデルを使用する方法は背景雑音に強いという特性を有している。
【００１０】
本発明は、このような事情を考慮してなされたものであり、複数のマイクロフォンを必要とせず、音声認識装置の使用環境が音声認識処理にとって適切か否かを好適に判定することができる音声認識環境判定方法を提供することを目的とする。
【００１１】
【課題を解決するための手段】
上記目的を達成するため、本発明に係る音声認識環境判定方法は、音響モデルを用いて音声認識を行う音声認識装置の使用環境を判定する音声認識環境判定方法であって、
音声を入力する入力工程と、
入力された前記音声が音声認識対象の音声であるか否かを判定する音声判定工程と、
音声認識対象の音声と判定された前記音声と前記音響モデル作成時の背景雑音との類似度を算出する類似度算出工程と、
算出された前記類似度に基づいて前記音声認識装置の使用環境を判定する環境判定工程とを有することを特徴とする。
【００１２】
【発明の実施の形態】
以下、図面を参照して、本発明に係る音声認識装置の使用環境を判定する方法に関する実施の形態について詳細に説明する。
【００１３】
＜第１の実施形態＞
図１は、本発明の第１の実施形態に係る音声認識環境判定機能を備える音声認識装置の構成を示すブロック図である。
【００１４】
図１に示すように、第１の実施形態に係る音声認識装置において、１００は、背景雑音やユーザの入力音声等を入力するマイクロフォンであり、本音声認識装置では背景雑音及び入力音声とも同一もマイクロフォンから入力される。また、２００は、マイクロフォン１００を通して入力された音声が、ユーザによる入力音声であるか否かを判定する音声判定部である。以下、本実施形態では、ユーザによる入力音声以外の音声を背景雑音とみなす。
【００１５】
また、５００は、本音声認識装置が音声認識環境判定機能を動作するためのプログラム及び音声認識環境判定機能を動作するために必要なデータや動作の過程で生成されるデータを一時的に格納するＲＯＭ、ＲＡＭ、ハードディスク等で構成される記憶装置である。尚、記憶装置５００には、さらに、音響モデル作成時（学習時）の背景雑音の特徴が雑音モデル５０１として記憶されている。例えば、音響モデル学習時の雑音からＦＦＴ分析などの音響分析方法を用いてスペクトルを求めておき、それを予め記憶装置５００に雑音モデル５０１として記憶しておく。尚、音響モデル作成時の雑音のスペクトルについては、数秒〜数十秒程度の雑音から求めた時間平均スペクトルを求めておくことが望ましい。
【００１６】
また、記憶装置５００には、雑音モデル５０１以外にも、音声認識環境判定機能を備える本音声認識装置が、一般的な音声認識装置として機能する際に必要な装置全体を制御するプログラム、音声認識に必要な音響モデル、認識辞書等も記憶されている。
【００１７】
さらに、３００は、入力された背景雑音と音響モデル学習時の背景雑音との類似度を算出する類似度算出部である。さらにまた、４００は、類似度算出部３００で算出された類似度に基づいて、音声認識の使用環境を判定する環境判定部である。さらにまた、６００は、マイクロフォン１００から入力された音声が雑音ではなくユーザによる入力音声であると音声判定部２００で判定された場合、当該入力音声を認識する音声認識部である。さらにまた、本音声認識装置は、本装置全体の制御を行う制御部７００と、音声入力以外の手段でデータを入力したり、本装置に対して指示を付与するためのキーボード、マウス等で構成される入力装置８００と、ディスプレイやスピーカ等で実現される音声認識実行可能性の判定結果や音声認識の実行結果等を出力する出力装置９００とから構成される。
【００１８】
すなわち、本実施形態に係る音声認識装置には、後述するように、音響モデルを用いて音声認識を行う音声認識装置の使用環境を判定する音声認識環境判定機能が備わっており、音声を入力し、入力された音声が音声認識対象の音声であるか否かを判定し、音声認識対象の音声と判定された音声と音響モデル作成時の背景雑音との類似度を算出し、算出された類似度に基づいて音声認識装置の使用環境を判定することを特徴とする。
【００１９】
次に、図１に示すような構成を備えた本実施形態に係る音声認識環境判定機能を備える音声認識装置の動作例について説明する。図２は、本発明の第１の実施形態に係る音声認識環境判定機能を備える音声認識装置の動作手順を説明するためのフローチャートである。以下、図２に従って、本音声認識装置全体の処理の流れについて説明する。
【００２０】
まず、本装置が起動され（ステップＳ１０１）、マイクロフォン１００を通じてユーザの音声又は背景雑音の取り込みが開始される（ステップＳ１０２）。この取り込み処理は、ユーザが入力装置８００を用いて指示したり、音声を感知することによって自動的に感知した音声を取り込むようにしてもよい。
【００２１】
そして、マイクロフォン１００による音声取り込み開始後、音声判定部２００において音声認識が開始されたかどうかが判断される（ステップＳ１０３）。例えば、本実施形態では、キーボード、マウス、ボタン等の入力装置７００を用いてユーザが予め決められた操作（例えば、スペースキーの押下等の操作）を行った場合に「音声認識開始」であると判断されるようにしておく。また、逆に、上記のような予め決められた操作以外の場合が音声認識可能状態であって、取り込んだ音声を自動的に認識するようにして、上記操作が行われた場合に取り込んだ音声を音声認識用以外の音声（例えば、背景雑音）と判定するようにしてもよい。
【００２２】
さらに、ユーザの発声を検出することによって音声認識を自動的に開始するようにしてもよい。このような場合は、一般的な音声認識処理で用いられているような音声検出方法を用いて、音声認識部６００が音声を検出した段階を音声認識開始と判断するようにすればよい。一般に、音声認識では、１５〜３０ミリ秒分の音声データを一塊として５〜１５ミリ秒ごとにオーバーラップさせながら処理が行われる。ここで、一度に処理するデータの長さをフレーム長、オーバーラップする際にずらす長さをフレーム周期という。図４は、本実施形態に係る音声処理を実行する際のフレームの概念を説明するための図である。例えば、音声認識の開始を検出する処理は、フレーム周期ごとに行っても良いし、数フレームごとに行って処理量を落とすようにしても良い。
【００２３】
ステップＳ１０３において、音声判定部２００で音声認識開始を判断した結果、音声認識を開始したと判断された場合（Ｙｅｓ）、取り込んだ音声に対して音声認識部６００によって音声認識が行われる（ステップＳ１０８）。そして、制御部７００は、音声認識の結果に基づいて、事前に決められた手順に従って本音声認識装置の各種制御を行う（ステップＳ１０９）。ここで、各種制御とは、例えば、音声認識結果を出力装置９００に出力したり、認識結果をアプリケーションに送るといった制御部７００による制御がある。そして、各種制御の後、ステップＳ１０３に戻って、再び音声認識が開始されたか否かを判断する。
【００２４】
一方、ステップＳ１０３において、音声判定部２００によって音声認識が開始されていないと判断された場合（Ｎｏ）、マイクロフォン１００から取り込んだ音声を雑音として取り込む（ステップＳ１０４）。次に、類似度算出部３００において、入力された雑音と記憶装置５００に雑音モデル５０１として予め記憶されている音響モデル作成時の雑音との類似度が算出される（ステップＳ１０５）。
【００２５】
ここで、本実施形態においては、ステップＳ１０５で算出される類似度として、例えば、
（１）入力された雑音と音響モデル学習時の雑音とのスペクトル距離の逆数、
（２）音響モデル学習時の雑音をＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）でモデル化したときの当該ＨＭＭに対する入力された雑音の尤度、
等を用いることができる。
【００２６】
次に、上述した（１）及び（２）に例示されるそれぞれの類似度を算出する場合の実施形態について詳述する。
【００２７】
［スペクトル距離の逆数を類似度として用いる場合］
入力された雑音と音響モデル学習時の雑音とのスペクトル距離Ｄを求め、その逆数を類似度と定義する。すなわち、類似度Ｌは、
【００２８】
【数１】

で定義される。
【００２９】
すなわち、本実施形態に係る音声認識環境判定方法では、ステップＳ１０３の音声判定ステップにより音声認識対象の音声と判定されなかった音声（すなわち、入力された雑音）のスペクトルと音響モデル作成時の背景雑音のスペクトルとに基づいて算出されたスペクトル距離Ｄの逆数Ｌを類似度として、ステップＳ１０５の類似度算出ステップを実行することを特徴とする。
【００３０】
ここで、２つのスペクトル間の距離（スペクトル距離）Ｄを求める方法について説明する。
【００３１】
尚、以下では、２つのスペクトルをスペクトルＡ、Ｂとし、
Ｉ：スペクトルを構成する要素数、
ｘ_Ｋ（ｉ）：スペクトルＫのｉ番目のスペクトル強度（１≦ｉ≦Ｉ）
μ_Ｋ（ｉ）：スペクトルＫのｉ番目のスペクトル強度の平均（１≦ｉ≦Ｉ）
σ_Ｋ（ｉ）：スペクトルＫのｉ番目のスペクトル強度の分散（１≦ｉ≦Ｉ）
とする。
【００３２】
次に、スペクトルＡに関しては、１フレーム分のスペクトルが求まる場合と、数フレーム分の時間平均スペクトルが求まる場合とについて説明する。また、スペクトルＢに関しては、時間平均スペクトルが求まっている場合について説明する。
【００３３】
ここで、スペクトル距離として以下のような距離を用いることができる。
（ａ）ユークリッド距離
【００３４】
【数２】

（ｂ）スペクトルＢの分散を考慮した距離
【００３５】
【数３】

（ｃ）スペクトルＡ、スペクトルＢの分散を考慮した距離
【００３６】
【数４】

本実施形態では、上記（ａ）〜（ｃ）のいずれかで定義されるスペクトル距離を用いる。
【００３７】
そして、音響モデル学習時の雑音のスペクトルを求めたときと同じ音響分析手法を用いて、入力された雑音のスペクトルを算出する。ここで、入力された雑音のスペクトルをスペクトルＡ、学習時の雑音のスペクトルをスペクトルＢとすると、上記（ａ）〜（ｃ）のいずれかの方法でスペクトル距離Ｄを計算することができる。
【００３８】
すなわち、本実施形態に係る音声認識環境判定方法では、ステップＳ１０５の類似度算出ステップで、スペクトル距離を算出する際に、音響モデル作成時の背景雑音のスペクトル（スペクトルＢ）の分散、又は、音声認識用の音声と判定されなかった音声のスペクトル（スペクトルＡ）の分散及び音響モデル作成時の背景雑音のスペクトル（スペクトルＢ）の分散を考慮して、スペクトル距離を算出することを特徴とする。
【００３９】
例えば、音声認識開始を検出する処理（ステップＳ１０３）をフレーム周期ごとに行う場合は、入力雑音のスペクトルは１フレームごとに求まるため、スペクトルＡの時間平均や分散を用いないｄ_ａ１、ｄ_ａ２をスペクトル距離として用いる。
【００４０】
一方、音声認識開始を検出する処理（ステップＳ１０３）をＴフレームごとに行う場合は、Ｔフレーム分の入力雑音のスペクトルが求まる。この場合、Ｔフレーム分のスペクトルの時間平均及び分散を計算することによって、ｄ_ａ２、ｄ_ｂ２、ｄ_ｃ２のいずれかの方法でスペクトル距離を求めることができる。
【００４１】
また別の方法として、各フレームごとにスペクトル距離をｄ_ａ１、ｄ_ｂ１、ｄ_ｃ１のいずれかで求め、その平均値をスペクトル距離としてもよい。この場合のスペクトル距離Ｄは、次に示すようになる。
【００４２】
【数５】

【００４３】
ここで、ｄ（ｔ）はｔフレームに入力された雑音スペクトルと学習時の雑音スペクトルの距離であり、各フレームにおけるスペクトル距離ｄ（ｔ）は、上述したｄ_ａ１、ｄ_ｂ１、ｄ_ｃ１のいずれかの方法で求めたものである。
【００４４】
また、その他の方法として、次に示す距離Ｄを用いても良い。
【００４５】
【数６】

【００４６】
これは、各フレームにおけるスペクトル距離のうち最も距離が遠く最大スペクトル距離、すなわち入力時と学習時の雑音が最も似ておらず、最小の類似度である時のスペクトル距離を用いるものである。
【００４７】
すなわち、本実施形態に係る音声認識環境判定方法では、ステップＳ１０５の類似度算出ステップにおいて、音声認識用の音声と判定されなかった音声（入力された雑音）の複数フレームの各フレームごとにスペクトル距離を算出し、各フレームごとに算出されたスペクトル距離の平均スペクトル距離又は最大スペクトル距離を算出し、平均スペクトル距離の逆数又は最大スペクトル距離の逆数を類似度とすることを特徴とする。
【００４８】
次に、音響モデルの学習時雑音が複数種類存在する場合について説明する。これは、いわゆるマルチコンディショントレーニングと言われる様々な環境で収録した音声で学習して雑音耐性を高めた音響モデルや、雑音環境ごとに用意した複数の音響モデルを使用する場合のスペクトル距離の計算方法である。
【００４９】
例えば、学習時の雑音が、オフィス雑音、自動車内雑音及び人ごみの雑音の３種類の雑音であった場合、以下のようにしてスペクトル距離Ｄを求める。まず、
スペクトルＢ１：オフィス雑音のスペクトル
スペクトルＢ２：自動車内雑音のスペクトル
スペクトルＢ３：人ごみの雑音のスペクトル
とし、入力雑音のスペクトルＡとスペクトルＢ１、Ｂ２、Ｂ３との間のスペクトル距離をそれぞれ前述したｄ_ａ１、ｄ_ａ２、ｄ_ｂ１、ｄ_ｂ２、ｄ_ｃ１、ｄ_ｃ２のいずれかの方法で求め、これらのうち最も近い距離をスペクトル距離Ｄとする。すなわち、学習時にＮ種類の雑音を用いた場合のスペクトル距離Ｄは次のように算出する。
【００５０】
【数７】

【００５１】
ここで、ｄ（Ａ，Ｂｎ）は入力雑音のスペクトルＡと学習時のｎ番目のスペクトルＢｎとのスペクトル距離であり、それぞれｄ_ａ１、ｄ_ａ２、ｄ_ｂ１、ｄ_ｂ２、ｄ_ｃ１、ｄ_ｃ２のいずれかの方法で求める。
【００５２】
すなわち、本実施形態に係る音声認識環境判定方法では、ステップＳ１０５の類似度算出ステップにおいて、音響モデル作成時の背景雑音Ｂｎが複数存在する場合に、背景雑音ごとに算出されたスペクトル距離ｄ（Ａ，Ｂｎ）のうちの最小スペクトル距離の逆数を類似度とすることを特徴とする。
【００５３】
本実施形態では、上述したような様々なスペクトル距離Ｄの計算方法によって求めたものを使用することが可能である。
【００５４】
ここで、上記いずれの計算方法においても、求めたスペクトル距離Ｄについて、さらに入力時雑音と音響モデルの学習に用いていない雑音とのスペクトル距離で正規化するようにしても良い。すなわち、音響モデルの学習に用いていない雑音のスペクトルをＣとすると、正規化したスペクトル距離Ｄ’は以下のように定義できる。
【００５５】
【数８】

【００５６】
ここで、Ｄ（Ｘ，Ｙ）はスペクトルＸとスペクトルＹのスペクトル距離であり、上述したいずれの方法を用いて計算しても良い。また、この場合の類似度は同様にＤ’の逆数となる。
【００５７】
すなわち、本実施形態に係る音声認識環境判定方法では、ステップＳ１０５の類似度算出ステップにおいて、音声認識用の音声と判定されなかった音声（入力時雑音）と音響モデル作成時の背景雑音とのスペクトル距離を、入力時雑音と音響モデル作成時に用いていない背景雑音とのスペクトル距離を用いて正規化し、正規化されたスペクトル距離の逆数を類似度とすることを特徴とする。
【００５８】
いずれの方法においても、使用するスペクトル距離計算に必要な学習時の雑音のスペクトルのパラメータを雑音モデル５０１として記憶しておく。例えば、ｄ_ａ１の場合は学習時の雑音の時間平均スペクトルを雑音モデル５０１として記憶する。また、学習時の分散を考慮したｄ_ｂ１で求める場合は、時間平均スペクトル及び分散を雑音モデル５０１として記憶する。さらに、Ｎ種類の雑音でマルチコンディショントレーニングした音響モデルを使用する場合においては、Ｎ種類分の平均スペクトル及び分散を雑音モデル５０１として記憶する。
【００５９】
尚、以上述べた実施形態では、スペクトルの種類について特に定義していないが、振幅スペクトル、パワースペクトル、対数スペクトル、スペクトル包絡等の一般に音響分析で用いられているスペクトル表現のいずれの場合でも、上述の方法で距離を算出することができる。また、スペクトルに限らず、音声認識で用いられる他のパラメータ、例えば、ケプストラム（ｃｅｐｓｔｒｕｍ）等を用いて同様の距離計算を行って類似度を定義できることは言うまでもない。
【００６０】
［学習時の雑音をモデル化したＨＭＭの尤度を類似度として用いる場合］
次に、類似度として学習時の雑音をモデル化したＨＭＭに対する入力雑音の対数尤度を用いる方法について説明する。以下では、学習時の雑音をモデル化したＨＭＭをモデルＢ、モデルＢに対するフレームｔの入力雑音の対数尤度をＰ_Ｂ（ｔ）とした場合の類似度Ｌの定義を示す。
【００６１】
ここで、１フレームごとに類似度を求める場合は、
【００６２】
【数９】

で示すようにフレームｔの対数尤度により類似度を定義する。
【００６３】
すなわち、本実施形態に係る音声認識環境判定方法では、ステップＳ１０５の類似度算出ステップにおいて、音響モデル作成時の背景雑音をモデル化した隠れマルコフモデルＨＭＭに対する入力雑音（音声認識用の音声と判定されなかった音声）の対数尤度を類似度とすることを特徴とする。
【００６４】
また、Ｔフレームごとに類似度を求める場合であって、Ｔフレームの対数尤度和を類似度とする場合は、
【００６５】
【数１０】

により類似度を定義する。
【００６６】
さらに、Ｔフレームごとに類似度を求める場合であって、Ｔフレーム中の最小の対数尤度を類似度とする場合は、
【００６７】
【数１１】

により類似度を定義する。
【００６８】
すなわち、本実施形態に係る音声認識環境判定方法では、ステップＳ１０５の類似度算出ステップにおいて、入力雑音（音声認識用の音声と判定されなかった音声）の複数フレームに対する類似度を算出する際に、各フレームごとに音響モデル作成時の背景雑音をモデル化したＨＭＭに対する音声の対数尤度を求め、入力雑音の複数フレームの和又は当該複数フレーム中最小となる対数尤度を類似度とすることを特徴とする。
【００６９】
さらにまた、学習時雑音がＮ種類ある場合は、
【００７０】
【数１２】

により類似度を定義する。
【００７１】
ここでｌ（Ｂｎ）は、学習時のｎ番目の雑音Ｂｎに対する入力雑音の類似度であって、前述したいずれの方法で求めても良い。
【００７２】
すなわち、本実施形態に係る音声認識環境判定方法では、ステップＳ１０５の類似度算出ステップにおいて、音響モデル作成時の背景雑音が複数存在する場合に、背景雑音ごとに求めた対数尤度のうち最大となる対数尤度を類似度とすることを特徴とする。
【００７３】
さらに、求めた類似度Ｌに対して、さらに音響モデルの学習に用いていない雑音に対する類似度で正規化しても良い。この場合、音響モデルの学習に用いていない雑音のモデルをＣとすると、正規化した類似度Ｌ’は以下のように定義できる。
【００７４】
【数１３】

【００７５】
すなわち、本実施形態に係る音声認識環境判定方法では、ステップＳ１０５の類似度算出ステップにおいて、音響モデル作成時の背景雑音をモデル化したＨＭＭに対する入力雑音（音声認識用の音声と判定されなかった音声）の対数尤度を、当該音響モデル作成時に用いていない背景雑音をモデル化したＨＭＭに対する入力雑音の対数尤度で正規化した値を類似度とすることを特徴とする。
【００７６】
以上説明した類似度の計算に必要な雑音をモデル化したＨＭＭを、雑音モデル５０１として記憶装置５００に記憶しておく。
【００７７】
そして、環境判定部４００において、算出された類似度Ｌに基づいて、音声認識の使用環境を判定する（ステップＳ１０６）。その結果、双方の雑音がよく似ている場合、すなわち、類似度Ｌが大きい場合は、使用されている環境が音響モデル作成時に想定していた環境に近いと考えられ、十分な音声認識性能が発揮できるので音声認識に良好な環境であると判定する。一方、雑音が似ていない場合、すなわち、類似度Ｌが小さい場合は、入力音声の認識性能が劣化する可能性があるため音声認識の使用環境としては良好ではない（劣悪で認識できない）と判定する。
【００７８】
具体的には、類似度Ｌに対して閾値Ｔｈを設け、計算した類似度ＬがＴｈ以上である場合には音声認識に良好な環境、Ｔｈ以下である場合には音声認識には向いていない環境であると判定するようにする。尚、１つの閾値を用いて環境の良否を判定するだけでなく、複数の閾値を用いて音声認識装置の使用環境の状態を複数の段階で判定することも可能である。
【００７９】
そして、ステップＳ１０６の判定結果に基づいて、出力装置９００に判定結果を出力することによって使用者に通知することができる（ステップＳ１０７）。
例えば、ディスプレイ上に表示しても良いし、機器に備えられたＬＥＤの点滅として結果を出力するようにしても良い。
【００８０】
図３は、音声認識環境の判定結果をディスプレイ上に表示する場合の一例を示す図である。例えば、図３（ａ）には、使用環境を３段階で判定し、ディスプレイ上に「◎」（良好）、「○」（普通）、「×」（劣悪）といったマークを用いて判定結果を表示する例を示す。また、図３（ｂ）には、使用環境を５段階に判定し、高さの異なるバーでを使ってグラフィカルに表示する場合の例である。例えば、４本のバーが全て表示された場合は音声認識の使用環境として良好、２本表示された場合は普通、全く表示されていない場合は劣悪であるようにする。また、その他の判定結果の出力方法として、ビープ音や合成音声を使用することによって出力し、ユーザに通知するようにしても良い。
【００８１】
すなわち、本実施形態に係る音声認識環境判定方法では、ステップＳ１０７において、音声認識装置の使用環境の判定結果を表示するが、その際に、判定結果を図形又は記号を用いてグラフィカルに表示することを特徴とする。
【００８２】
＜第２の実施形態＞
上述した第１の実施形態では、図２のステップＳ１０７の判定結果を出力する処理において、ディスプレイ等の出力装置に対して判定結果を出力する場合について示したが、本発明の適用はこれに限られるものではない。例えば、音声認識の使用環境が劣悪、すなわち音声認識使用に適さない環境であると判定した場合は、制御部７００によって、音声認識部６００による音声認識処理自体を強制的に実行不可能な状態にしてもよい。すなわち、図２のステップＳ１０３で音声認識開始と判断された場合であってもステップＳ１０８の処理を行わないよう制御するようにすることも可能である。
【００８３】
＜その他の実施形態＞
尚、本発明は、複数の機器（例えば、ホストコンピュータ、インタフェース機器、リーダ、プリンタ等）から構成されるシステムに適用しても、一つの機器からなる装置（例えば、複写機、ファクシミリ装置等）に適用してもよい。
【００８４】
また、本発明の目的は、前述した実施形態の機能を実現するソフトウェアのプログラムコードを記録した記録媒体（または記憶媒体）を、システムあるいは装置に供給し、そのシステムあるいは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記録媒体に格納されたプログラムコードを読み出し実行することによっても、達成されることは言うまでもない。この場合、記録媒体から読み出されたプログラムコード自体が前述した実施形態の機能を実現することになり、そのプログラムコードを記録した記録媒体は本発明を構成することになる。また、コンピュータが読み出したプログラムコードを実行することにより、前述した実施形態の機能が実現されるだけでなく、そのプログラムコードの指示に基づき、コンピュータ上で稼働しているオペレーティングシステム（ＯＳ）などが実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。
【００８５】
さらに、記録媒体から読み出されたプログラムコードが、コンピュータに挿入された機能拡張カードやコンピュータに接続された機能拡張ユニットに備わるメモリに書き込まれた後、そのプログラムコードの指示に基づき、その機能拡張カードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。
【００８６】
本発明を上記記録媒体に適用する場合、その記録媒体には、先に説明したフローチャートに対応するプログラムコードが格納されることになる。
【００８７】
本発明の実施態様の例を以下に列挙する。
【００８８】
［実施態様１］音響モデルを用いて音声認識を行う音声認識装置の使用環境を判定する音声認識環境判定方法であって、
音声を入力する入力工程と、
入力された前記音声が音声認識対象の音声であるか否かを判定する音声判定工程と、
音声認識対象の音声と判定された前記音声と前記音響モデル作成時の背景雑音との類似度を算出する類似度算出工程と、
算出された前記類似度に基づいて前記音声認識装置の使用環境を判定する環境判定工程とを有することを特徴とする音声認識環境判定方法。
【００８９】
［実施態様２］前記類似度算出工程が、前記音声判定工程により音声認識対象の音声と判定されなかった前記音声のスペクトルと前記音響モデル作成時の前記背景雑音のスペクトルとに基づいて算出されたスペクトル距離の逆数を前記類似度とすることを特徴とする実施態様１に記載の音声認識環境判定方法。
【００９０】
［実施態様３］前記類似度算出工程が、前記スペクトル距離を算出する際に、前記音響モデル作成時の前記背景雑音のスペクトルの分散、又は、音声認識用の音声と判定されなかった前記音声のスペクトルの分散及び前記音響モデル作成時の前記背景雑音のスペクトルの分散を考慮して、前記スペクトル距離を算出することを特徴とする実施態様２に記載の音声認識環境判定方法。
【００９１】
［実施態様４］前記類似度算出工程が、
音声認識用の音声と判定されなかった前記音声の複数フレームの各フレームごとに前記スペクトル距離を算出し、
各フレームごとに算出されたスペクトル距離の平均スペクトル距離又は最大スペクトル距離を算出し、
前記平均スペクトル距離の逆数又は前記最大スペクトル距離の逆数を前記類似度とすることを特徴とする実施態様２に記載の音声認識環境判定方法。
【００９２】
［実施態様５］前記類似度算出工程は、前記音響モデル作成時の前記背景雑音が複数存在する場合に、前記背景雑音ごとに算出されたスペクトル距離のうちの最小スペクトル距離の逆数を前記類似度とすることを特徴とする実施態様２から４までのいずれか１つに記載の音声認識環境判定方法。
【００９３】
［実施態様６］前記類似度算出工程が、音声認識用の音声と判定されなかった前記音声と前記音響モデル作成時の前記背景雑音とのスペクトル距離を、前記音声と前記音響モデル作成時に用いていない背景雑音とのスペクトル距離を用いて正規化し、
正規化された前記スペクトル距離の逆数を前記類似度とすることを特徴とする実施態様２から５までのいずれか１つに記載の音声認識環境判定方法。
【００９４】
［実施態様７］前記類似度算出工程が、前記音響モデル作成時の前記背景雑音をモデル化した隠れマルコフモデルＨＭＭに対する音声認識用の音声と判定されなかった前記音声の対数尤度を前記類似度とすることを特徴とする実施態様１に記載の音声認識環境判定方法。
【００９５】
［実施態様８］前記類似度算出工程が、
音声認識用の音声と判定されなかった前記音声の複数フレームに対する前記類似度を算出する際に、各フレームごとに前記音響モデル作成時の前記背景雑音をモデル化したＨＭＭに対する前記音声の対数尤度を求め、
前記音声の前記複数フレームの和又は該複数フレーム中最小となる対数尤度を前記類似度とすることを特徴とする実施態様１又は７に記載の音声認識環境判定方法。
【００９６】
［実施態様９］前記類似度算出工程が、前記音響モデル作成時の背景雑音が複数存在する場合に、前記背景雑音ごとに求めた対数尤度のうち最大となる対数尤度を前記類似度とすることを特徴とする実施態様１、７、８のいずれか１つに記載の音声認識環境判定方法。
【００９７】
［実施態様１０］前記類似度算出工程が、前記音響モデル作成時の前記背景雑音をモデル化したＨＭＭに対する音声認識用の音声と判定されなかった前記音声の対数尤度を、該音響モデル作成時に用いていない背景雑音をモデル化したＨＭＭに対する前記音声の対数尤度で正規化した値を前記類似度とすることを特徴とする実施態様１、７、８、９のいずれか１つに記載の音声認識環境判定方法。
【００９８】
［実施態様１１］音声認識装置の使用環境の判定結果を表示する表示工程をさらに有し、
前記判定結果を図形又は記号を用いてグラフィカルに表示することを特徴とする実施態様１から１０までのいずれか１項に記載の音声認識環境判定方法。
【００９９】
［実施態様１２］音響モデルを用いて音声認識を行う音声認識装置の使用環境を判定する音声認識環境判定装置であって、
音声を入力する入力手段と、
入力された前記音声が音声認識対象の音声であるか否かを判定する音声判定手段と、
音声認識対象の音声と判定された前記音声と前記音響モデル作成時の背景雑音との類似度を算出する類似度算出手段と、
算出された前記類似度に基づいて前記音声認識装置の使用環境を判定する環境判定手段とを備えることを特徴とする音声認識環境判定装置。
【０１００】
［実施態様１３］コンピュータに、音響モデルを用いて音声認識を行う音声認識装置の使用環境を判定させるためのプログラムであって、
入力された音声が音声認識対象の音声であるか否かを判定する音声判定手順と、
音声認識対象の音声と判定された前記音声と前記音響モデル作成時の背景雑音との類似度を算出する類似度算出手順と、
算出された前記類似度に基づいて前記音声認識装置の使用環境を判定する環境判定手順とを実行させるためのプログラム。
【０１０１】
［実施態様１４］実施態様１３に記載のプログラムを格納したことを特徴とするコンピュータ読み取り可能な記録媒体。
【０１０２】
【発明の効果】
以上説明したように、本発明によれば、複数のマイクロフォンを必要とせず、音声認識装置の使用環境が音声認識処理にとって適切か否かを好適に判定することができる。
【図面の簡単な説明】
【図１】本発明の第１の実施形態に係る音声認識環境判定機能を備える音声認識装置の構成を示すブロック図である。
【図２】本発明の第１の実施形態に係る音声認識環境判定機能を備える音声認識装置の動作手順を説明するためのフローチャートである。
【図３】音声認識環境の判定結果をディスプレイ上に表示する場合の一例を示す図である。
【図４】本実施形態に係る音声処理を実行する際のフレームの概念を説明するための図である。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a technique for determining whether or not an environment in which speech recognition is used is appropriate, such as an environment in which speech recognition by a speech recognition device can be appropriately performed or a severe environment for speech recognition.
[0002]
[Prior art]
Speech recognition replaces the conventional input means that allows data input and instruction by manually operating keyboards and buttons, etc., and as an input means that can be easily used by anyone by using voice in various situations. Use is expected. For example, in recent years, speech recognition technology targeting various users in a variety of fields, from dictation software that inputs text by voice, a telephone system that automatically responds to telephone calls, and toys that respond when called. Has been put to practical use and is being used.
[0003]
As described above, speech recognition can be used as a very convenient input means when the recognition performance is high, but erroneous recognition occurs when the recognition performance is low, and is lower than conventional input means operated by hand. May be inefficient. Various factors can be considered as a factor of a decrease in the recognition rate in the speech recognition, and the influence of background noise in the use environment is extremely large. For example, even if the speech recognition device can recognize correctly in an environment such as an office environment that is relatively quiet and has little background noise, the recognition rate may be extremely reduced when used in a crowded or noisy place.
[0004]
In addition, the resistance to noise differs depending on the type of speech recognition device, and there are cases in which even if recognition can be performed by one speech recognition device, it cannot be recognized by another speech recognition device at all. For example, speech recognition devices used in car navigation systems and the like often exhibit sufficient performance even in a running car, but even if you try to input sentences with dictation software in a running car, you can input correctly. It may not be possible.
[0005]
As described above, since the recognition performance of the speech recognition device deteriorates depending on the use environment, if it is known in advance that the environment is not favorable for speech recognition, depending on the application, the user may be initially provided with a means other than the speech input means. It is possible to make a selection to input by means of (1). If the user can predict that the performance of the speech recognition device is deteriorated due to the background noise, the user moves to a quiet place, makes the utterance clearer, or removes the noise source by itself. Can be taken. In this way, by inputting information such as information to the device intended by the user by voice or by manual input, the appropriate instruction is determined according to the environment at the time of input and notified to the user. Thus, the user can execute a smooth input operation according to the environment.
[0006]
2. Description of the Related Art Conventionally, there is a method of notifying a user of a speech recognition device whether or not speech input is possible (for example, see Patent Document 1). According to the method described in Patent Literature 1, the voice recognition device includes two microphones for voice input and noise input, and determines whether or not voice input is possible based on a difference in power between input voice and input noise. Is determined. That is, if the power of the input voice is greater than the power of the input noise and the difference is large, it is determined that voice input is possible. If the difference between the two is small, the correlation between the input noise waveforms is further examined and the power of the input noise is determined. It is determined whether or not speech recognition can be performed comprehensively according to the size.
[0007]
[Patent Document 1]
JP-A-10-240291
[0008]
[Problems to be solved by the invention]
However, generally used personal computers (hereinafter, referred to as “PC”) and devices provided with voice input are rarely equipped with a plurality of microphones. Cannot be applied as is. That is, extra costs such as the addition of a new microphone are required.
[0009]
On the other hand, as is conventionally known, a method of creating an acoustic model to be used for speech recognition from speech mixed with noise is known. In some cases, the input speech can be recognized, and the method using this acoustic model has a characteristic of being resistant to background noise.
[0010]
The present invention has been made in view of such circumstances, does not require a plurality of microphones, and can appropriately determine whether or not the usage environment of the speech recognition device is appropriate for speech recognition processing. An object of the present invention is to provide a recognition environment determination method.
[0011]
[Means for Solving the Problems]
To achieve the above object, a voice recognition environment determination method according to the present invention is a voice recognition environment determination method for determining a use environment of a voice recognition device that performs voice recognition using an acoustic model,
An input step of inputting voice,
A voice determination step of determining whether the input voice is a voice to be voice-recognized,
A similarity calculation step of calculating a similarity between the voice determined to be a voice to be recognized and the background noise at the time of creating the acoustic model,
An environment determining step of determining a use environment of the voice recognition device based on the calculated degree of similarity.
[0012]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, an embodiment of a method for determining a use environment of a speech recognition device according to the present invention will be described in detail with reference to the drawings.
[0013]
<First embodiment>
FIG. 1 is a block diagram illustrating a configuration of a speech recognition device having a speech recognition environment determination function according to the first embodiment of the present invention.
[0014]
As shown in FIG. 1, in the speech recognition apparatus according to the first embodiment, reference numeral 100 denotes a microphone for inputting background noise, a user's input speech, and the like. In the speech recognition apparatus, the background noise and the input speech are the same. Input from microphone. Reference numeral 200 denotes a voice determination unit that determines whether the voice input through the microphone 100 is a voice input by a user. Hereinafter, in the present embodiment, speech other than the speech input by the user is regarded as background noise.
[0015]
Also, 500 temporarily stores a program for operating the voice recognition environment determining function of the voice recognition apparatus, data necessary for operating the voice recognition environment determining function, and data generated during the operation. This is a storage device including a ROM, a RAM, a hard disk, and the like. The storage device 500 further stores a feature of the background noise at the time of creating the acoustic model (at the time of learning) as a noise model 501. For example, a spectrum is obtained from the noise at the time of learning the acoustic model using an acoustic analysis method such as FFT analysis, and the spectrum is stored in the storage device 500 in advance as the noise model 501. In addition, as for the noise spectrum at the time of creating the acoustic model, it is desirable to obtain a time-average spectrum obtained from noise of several seconds to several tens of seconds.
[0016]
In addition to the noise model 501, the storage device 500 includes a program for controlling the entire device required when the present voice recognition device having a voice recognition environment determination function functions as a general voice recognition device, and a voice recognition device. Also, an acoustic model, a recognition dictionary, and the like necessary for the operation are stored.
[0017]
Reference numeral 300 denotes a similarity calculation unit that calculates the similarity between the input background noise and the background noise during acoustic model learning. Furthermore, reference numeral 400 denotes an environment determination unit that determines a use environment for speech recognition based on the similarity calculated by the similarity calculation unit 300. Furthermore, 600 is a voice recognition unit that recognizes the input voice when the voice input unit 200 determines that the voice input from the microphone 100 is not noise but input voice by the user. Furthermore, the present voice recognition device includes a control unit 700 for controlling the entire device, a keyboard, a mouse, and the like for inputting data by means other than voice input and for giving instructions to the device. And an output device 900 that outputs a determination result of voice recognition feasibility realized by a display, a speaker, or the like, a voice recognition execution result, and the like.
[0018]
That is, the speech recognition device according to the present embodiment has a speech recognition environment determination function for determining a use environment of the speech recognition device that performs speech recognition using an acoustic model, as described later. , Determine whether or not the input voice is the voice to be recognized, calculate the similarity between the voice determined to be the voice to be recognized and the background noise when creating the acoustic model, and calculate the calculated similarity. The use environment of the speech recognition device is determined based on the degree.
[0019]
Next, an operation example of the speech recognition apparatus having the speech recognition environment determination function according to the present embodiment having the configuration shown in FIG. 1 will be described. FIG. 2 is a flowchart for explaining an operation procedure of the voice recognition device having the voice recognition environment determination function according to the first embodiment of the present invention. Hereinafter, the flow of processing of the entire speech recognition apparatus will be described with reference to FIG.
[0020]
First, the present apparatus is activated (step S101), and capture of the user's voice or background noise through the microphone 100 is started (step S102). In the capturing process, the user may instruct using the input device 800 or capture the voice automatically detected by detecting the voice.
[0021]
Then, after the voice capturing by the microphone 100 is started, the voice determining unit 200 determines whether the voice recognition is started (step S103). For example, in the present embodiment, when the user performs a predetermined operation (for example, an operation such as pressing a space key) using the input device 700 such as a keyboard, a mouse, and a button, “voice recognition starts”. To be determined. Conversely, when the operation other than the above-mentioned predetermined operation is in a voice recognizable state, the captured voice is automatically recognized, and the voice captured when the above operation is performed is performed. May be determined as speech other than speech recognition (for example, background noise).
[0022]
Further, the voice recognition may be automatically started by detecting the utterance of the user. In such a case, the stage at which the voice recognition unit 600 detects a voice may be determined to be the start of voice recognition using a voice detection method used in general voice recognition processing. In general, in speech recognition, processing is performed while overlapping speech data for 15 to 30 milliseconds every 5 to 15 milliseconds. Here, the length of data to be processed at one time is called a frame length, and the length to be shifted when overlapping is called a frame period. FIG. 4 is a diagram for explaining the concept of a frame when executing the audio processing according to the present embodiment. For example, the process of detecting the start of speech recognition may be performed every frame cycle, or may be performed every few frames to reduce the processing amount.
[0023]
In step S103, when it is determined that the speech recognition has been started as a result of the speech determination unit 200 determining the start of speech recognition (Yes), the speech recognition unit 600 performs speech recognition on the captured speech (step S108). ). Then, the control section 700 performs various controls of the present voice recognition device according to a predetermined procedure based on the result of the voice recognition (step S109). Here, the various controls include, for example, control by the control unit 700, such as outputting a speech recognition result to the output device 900 and sending the recognition result to an application. Then, after the various controls, the process returns to step S103 to determine whether or not the speech recognition has been started again.
[0024]
On the other hand, if it is determined in step S103 that the voice recognition has not been started by the voice determination unit 200 (No), the voice captured from the microphone 100 is captured as noise (step S104). Next, the similarity calculating unit 300 calculates the similarity between the input noise and the noise at the time of creating the acoustic model stored in the storage device 500 in advance as the noise model 501 (step S105).
[0025]
Here, in the present embodiment, as the similarity calculated in step S105, for example,
(1) the reciprocal of the spectral distance between the input noise and the noise at the time of learning the acoustic model,
(2) Likelihood of input noise with respect to the acoustic model learning when the noise at the time of learning the acoustic model is modeled by HMM (Hidden Markov Model),
Etc. can be used.
[0026]
Next, a detailed description will be given of an embodiment in which the respective similarities exemplified in the above (1) and (2) are calculated.
[0027]
[When reciprocal of spectral distance is used as similarity]
A spectrum distance D between the input noise and the noise at the time of acoustic model learning is obtained, and the reciprocal thereof is defined as a similarity. That is, the similarity L is
[0028]
(Equation 1)

Is defined by
[0029]
That is, in the voice recognition environment determination method according to the present embodiment, the spectrum of the voice that has not been determined to be the voice to be recognized in the voice determination step in step S103 (that is, the input noise) and the background noise at the time of creating the acoustic model. And the reciprocal L of the spectrum distance D calculated based on the spectrum is used as the similarity, and the similarity calculating step of step S105 is executed.
[0030]
Here, a method of obtaining a distance (spectral distance) D between two spectra will be described.
[0031]
In the following, the two spectra are referred to as spectra A and B,
I: number of elements constituting spectrum,
x _K (I): i-th spectral intensity of spectrum K (1 ≦ i ≦ I)
μ _K (I): Average of the i-th spectrum intensity of the spectrum K (1 ≦ i ≦ I)
σ _K (I): variance of i-th spectrum intensity of spectrum K (1 ≦ i ≦ I)
And
[0032]
Next, regarding the spectrum A, a case where a spectrum for one frame is obtained and a case where a time average spectrum for several frames are obtained will be described. Regarding spectrum B, a case where a time-averaged spectrum is determined will be described.
[0033]
Here, the following distance can be used as the spectral distance.
(A) Euclidean distance
[0034]
(Equation 2)

(B) Distance considering dispersion of spectrum B
[0035]
[Equation 3]

(C) Distance in consideration of dispersion of spectrum A and spectrum B
[0036]
(Equation 4)

In the present embodiment, the spectral distance defined in any of the above (a) to (c) is used.
[0037]
Then, the spectrum of the input noise is calculated using the same acoustic analysis technique as when the spectrum of the noise at the time of learning the acoustic model was obtained. Here, assuming that the spectrum of the input noise is spectrum A and the spectrum of the noise at the time of learning is spectrum B, the spectrum distance D can be calculated by any of the above methods (a) to (c).
[0038]
That is, in the speech recognition environment determination method according to the present embodiment, when calculating the spectral distance in the similarity calculation step of step S105, the variance of the spectrum of the background noise (spectrum B) at the time of creating the acoustic model, or the speech The spectral distance is calculated in consideration of the variance of the spectrum (spectrum A) of the voice not determined as the voice for recognition and the variance of the spectrum of the background noise (spectrum B) at the time of creating the acoustic model.
[0039]
For example, when the process of detecting the start of speech recognition (step S103) is performed for each frame period, the spectrum of the input noise is obtained for each frame, and therefore, the time average or variance of the spectrum A is not used. _a1 , D _a2 Is used as the spectral distance.
[0040]
On the other hand, when the process of detecting the start of speech recognition (step S103) is performed for each T frame, the spectrum of the input noise for the T frame is obtained. In this case, by calculating the time average and variance of the spectrum for T frames, d _a2 , D _b2 , D _c2 The spectral distance can be obtained by any one of the following methods.
[0041]
Alternatively, the spectral distance for each frame is d _a1 , D _b1 , D _c1 And the average value may be used as the spectral distance. The spectrum distance D in this case is as follows.
[0042]
(Equation 5)

[0043]
Here, d (t) is the distance between the noise spectrum input to the t frame and the noise spectrum at the time of learning, and the spectral distance d (t) in each frame is the above d. _a1 , D _b1 , D _c1 It was obtained by any of the methods.
[0044]
As another method, the following distance D may be used.
[0045]
(Equation 6)

[0046]
This uses the maximum spectral distance, which is the longest distance among the spectral distances in each frame, that is, the spectral distance when the noise at the time of input and the noise at the time of learning are the least similar and have the minimum similarity.
[0047]
That is, in the speech recognition environment determination method according to the present embodiment, in the similarity calculation step of step S105, the spectral distance is determined for each of a plurality of frames of the speech (input noise) that has not been determined to be speech recognition speech. Is calculated, an average spectral distance or a maximum spectral distance of the spectral distances calculated for each frame is calculated, and a reciprocal of the average spectral distance or a reciprocal of the maximum spectral distance is used as the similarity.
[0048]
Next, a case where there are a plurality of types of noise during learning of the acoustic model will be described. This is a method of calculating the spectral distance when using acoustic models with improved noise tolerance by learning with voices recorded in various environments called so-called multi-condition training, or when using multiple acoustic models prepared for each noise environment It is.
[0049]
For example, when the noise at the time of learning is three kinds of noises such as office noise, noise in a car, and crowd noise, the spectrum distance D is obtained as follows. First,
Spectrum B1: spectrum of office noise
Spectrum B2: spectrum of noise in the car
Spectrum B3: spectrum of crowd noise
And the spectral distance between the spectrum A of the input noise and the spectra B1, B2, B3 is d _a1 , D _a2 , D _b1 , D _b2 , D _c1 , D _c2 And the closest distance among them is defined as the spectral distance D. That is, the spectrum distance D when N kinds of noises are used at the time of learning is calculated as follows.
[0050]
(Equation 7)

[0051]
Here, d (A, Bn) is a spectrum distance between the spectrum A of the input noise and the n-th spectrum Bn at the time of learning. _a1 , D _a2 , D _b1 , D _b2 , D _c1 , D _c2 Ask in one of the ways.
[0052]
That is, in the speech recognition environment determination method according to the present embodiment, in the similarity calculation step of step S105, when there are a plurality of background noises Bn at the time of creating the acoustic model, the spectral distance d (A , Bn), the reciprocal of the minimum spectral distance is used as the similarity.
[0053]
In the present embodiment, it is possible to use the values obtained by the various methods for calculating the spectral distance D as described above.
[0054]
Here, in any of the above calculation methods, the obtained spectral distance D may be further normalized by the spectral distance between the input noise and noise not used for learning the acoustic model. That is, assuming that the noise spectrum not used for learning the acoustic model is C, the normalized spectral distance D ′ can be defined as follows.
[0055]
(Equation 8)

[0056]
Here, D (X, Y) is a spectrum distance between the spectrum X and the spectrum Y, and may be calculated using any of the above-described methods. The similarity in this case is also the reciprocal of D '.
[0057]
That is, in the speech recognition environment determination method according to the present embodiment, in the similarity calculation step of step S105, the spectrum of the voice (noise at the time of input) not determined as the voice for speech recognition and the background noise at the time of creating the acoustic model. The distance is normalized using the spectral distance between the input noise and the background noise not used when creating the acoustic model, and the reciprocal of the normalized spectral distance is used as the similarity.
[0058]
In any method, the parameters of the noise spectrum at the time of learning necessary for calculating the spectrum distance to be used are stored as a noise model 501. For example, d _a1 In the case of (1), the time average spectrum of the noise at the time of learning is stored as the noise model 501. Also, d considering the variance during learning _b1 In the case of finding by the formula, the time average spectrum and the variance are stored as the noise model 501. Further, when an acoustic model that has been subjected to multi-condition training with N types of noise is used, the average spectrum and variance for N types of noise are stored as a noise model 501.
[0059]
In the embodiment described above, the type of spectrum is not particularly defined. However, in any case of the spectrum expression generally used in acoustic analysis such as an amplitude spectrum, a power spectrum, a logarithmic spectrum, and a spectrum envelope, The distance can be calculated by the following method. Further, it is needless to say that similarity can be defined by performing similar distance calculation using not only the spectrum but also other parameters used in speech recognition, for example, cepstrum.
[0060]
[When the likelihood of an HMM that models noise during learning is used as the similarity]
Next, a method of using the log likelihood of input noise for an HMM in which noise during learning is modeled as similarity will be described. In the following, the HMM that models the noise during learning is model B, and the log likelihood of the input noise of frame t for model B is P _B The definition of the similarity L in the case of (t) is shown.
[0061]
Here, when calculating the similarity for each frame,
[0062]
(Equation 9)

The similarity is defined by the log likelihood of the frame t as shown by.
[0063]
That is, in the speech recognition environment determination method according to the present embodiment, in the similarity calculation step of step S105, the input noise (determined as speech recognition speech) for the hidden Markov model HMM that models the background noise at the time of creating the acoustic model. It is characterized in that the log likelihood of the voice that did not exist is regarded as the similarity.
[0064]
Further, when the similarity is obtained for each T frame, and the log likelihood sum of the T frames is used as the similarity,
[0065]
(Equation 10)

Defines the similarity.
[0066]
Further, when the similarity is obtained for each T frame, and the minimum log likelihood in the T frame is used as the similarity,
[0067]
[Equation 11]

Defines the similarity.
[0068]
That is, in the speech recognition environment determination method according to the present embodiment, in the similarity calculation step of step S105, when calculating the similarity of a plurality of frames of input noise (speech not determined as speech recognition speech), For each frame, the log likelihood of the speech for the HMM that models the background noise at the time of creating the acoustic model is obtained, and the sum of a plurality of frames of the input noise or the smallest log likelihood among the plurality of frames is regarded as the similarity. Features.
[0069]
Furthermore, when there are N types of noises during learning,
[0070]
(Equation 12)

Defines the similarity.
[0071]
Here, l (Bn) is the similarity of the input noise to the n-th noise Bn at the time of learning, and may be obtained by any of the methods described above.
[0072]
That is, in the speech recognition environment determination method according to the present embodiment, in the similarity calculation step of step S105, when there are a plurality of background noises at the time of creating an acoustic model, the maximum likelihood of the log likelihood obtained for each background noise The logarithmic likelihood is defined as a similarity.
[0073]
Further, the obtained similarity L may be further normalized by the similarity to noise not used for learning the acoustic model. In this case, assuming that a noise model not used for learning the acoustic model is C, the normalized similarity L ′ can be defined as follows.
[0074]
(Equation 13)

[0075]
That is, in the speech recognition environment determination method according to the present embodiment, in the similarity calculation step of step S105, the input noise to the HMM that models the background noise at the time of creating the acoustic model (the speech that has not been determined to be speech recognition speech). ) Is characterized in that a value normalized by the log likelihood of input noise with respect to an HMM that models background noise not used when creating the acoustic model is used as the similarity.
[0076]
The HMM in which the noise necessary for calculating the similarity described above is modeled is stored in the storage device 500 as the noise model 501.
[0077]
Then, the environment determining unit 400 determines a use environment of the voice recognition based on the calculated similarity L (step S106). As a result, when both noises are very similar, that is, when the similarity L is large, it is considered that the used environment is close to the environment assumed at the time of creating the acoustic model, and sufficient speech recognition performance is obtained. Since it can be demonstrated, it is determined that the environment is favorable for speech recognition. On the other hand, when the noises are not similar, that is, when the similarity L is small, it is determined that the use environment of the voice recognition is not good (it cannot be recognized because of poor quality) because the recognition performance of the input voice may be deteriorated. I do.
[0078]
Specifically, a threshold value Th is set for the similarity L. If the calculated similarity L is equal to or greater than Th, the environment is favorable for speech recognition, and if the calculated similarity L is equal to or less than Th, the environment is not suitable for speech recognition. The environment is determined. In addition, it is possible not only to determine the quality of the environment using one threshold value, but also to determine the state of the usage environment of the speech recognition device in a plurality of stages using a plurality of threshold values.
[0079]
Then, based on the determination result of step S106, the user can be notified by outputting the determination result to the output device 900 (step S107).
For example, the result may be displayed on a display, or the result may be output as blinking of an LED provided in the device.
[0080]
FIG. 3 is a diagram illustrating an example of a case where a result of determining the voice recognition environment is displayed on a display. For example, in FIG. 3A, the use environment is determined in three stages, and the determination result is displayed using marks such as “◎” (good), “○” (normal), and “×” (poor) on the display. Here is an example of the display. FIG. 3B shows an example in which the use environment is determined in five levels and displayed graphically using bars of different heights. For example, when all four bars are displayed, the environment for use of voice recognition is good, and when two bars are displayed, it is normal, and when no bar is displayed, the environment is poor. As another method of outputting the determination result, the determination result may be output by using a beep sound or a synthesized voice to notify the user.
[0081]
That is, in the voice recognition environment determination method according to the present embodiment, in step S107, the determination result of the usage environment of the voice recognition device is displayed. At this time, the determination result is graphically displayed using a graphic or a symbol. It is characterized by.
[0082]
<Second embodiment>
In the above-described first embodiment, in the process of outputting the determination result in step S107 in FIG. 2, the case where the determination result is output to an output device such as a display has been described, but the application of the present invention is not limited to this. It is not something that can be done. For example, when it is determined that the use environment of the speech recognition is poor, that is, the environment is not suitable for the use of the speech recognition, the control unit 700 forcibly sets the speech recognition processing itself by the speech recognition unit 600 to an unexecutable state. You may. That is, it is also possible to control so that the process of step S108 is not performed even when it is determined that the speech recognition is started in step S103 of FIG.
[0083]
<Other embodiments>
Note that the present invention is applied to a system including a plurality of devices (for example, a host computer, an interface device, a reader, a printer, etc.), but a device including one device (for example, a copying machine, a facsimile machine, etc.). May be applied.
[0084]
Further, an object of the present invention is to supply a recording medium (or a storage medium) recording a program code of software for realizing the functions of the above-described embodiments to a system or an apparatus, and to provide a computer (or a CPU or a CPU) of the system or the apparatus. Needless to say, the present invention can also be achieved by the MPU) reading and executing the program code stored in the recording medium. In this case, the program code itself read from the recording medium implements the functions of the above-described embodiment, and the recording medium on which the program code is recorded constitutes the present invention. When the computer executes the readout program code, not only the functions of the above-described embodiments are realized, but also an operating system (OS) running on the computer based on the instruction of the program code. It goes without saying that a part or all of the actual processing is performed and the functions of the above-described embodiments are realized by the processing.
[0085]
Further, after the program code read from the recording medium is written into a memory provided in a function expansion card inserted into the computer or a function expansion unit connected to the computer, the function expansion is performed based on the instruction of the program code. It goes without saying that the CPU or the like provided in the card or the function expansion unit performs part or all of the actual processing, and the processing realizes the functions of the above-described embodiments.
[0086]
When the present invention is applied to the recording medium, the recording medium stores program codes corresponding to the flowcharts described above.
[0087]
Examples of embodiments of the present invention are listed below.
[0088]
[Embodiment 1] A voice recognition environment determination method for determining a use environment of a voice recognition device that performs voice recognition using an acoustic model,
An input step of inputting voice,
A voice determination step of determining whether the input voice is a voice to be voice-recognized,
A similarity calculation step of calculating a similarity between the voice determined to be a voice to be recognized and the background noise at the time of creating the acoustic model,
An environment determining step of determining an environment in which the voice recognition device is used based on the calculated degree of similarity.
[0089]
[Embodiment 2] The similarity calculation step is calculated based on a spectrum of the voice that is not determined as a voice to be recognized by the voice determination step and a spectrum of the background noise when the acoustic model is created. The speech recognition environment determination method according to claim 1, wherein the reciprocal of a spectrum distance is used as the similarity.
[0090]
[Embodiment 3] In the similarity calculation step, when calculating the spectrum distance, the variance of the spectrum of the background noise at the time of creating the acoustic model, or the speech of the speech that has not been determined to be speech recognition speech. The speech recognition environment determination method according to claim 2, wherein the spectrum distance is calculated in consideration of a spectrum variance and a spectrum variance of the background noise at the time of creating the acoustic model.
[0091]
[Embodiment 4] The similarity calculating step includes:
The spectral distance is calculated for each frame of a plurality of frames of the voice that was not determined as voice for voice recognition,
Calculate the average spectral distance or the maximum spectral distance of the spectral distance calculated for each frame,
The method according to the second embodiment, wherein a reciprocal of the average spectral distance or a reciprocal of the maximum spectral distance is set as the similarity.
[0092]
[Embodiment 5] In the similarity calculation step, when there are a plurality of the background noises at the time of creating the acoustic model, the reciprocal of the minimum spectrum distance among the spectrum distances calculated for each of the background noises is used as the similarity. The speech recognition environment determination method according to any one of the second to fourth embodiments, wherein:
[0093]
[Sixth Embodiment] The similarity calculation step uses a spectral distance between the speech not determined as speech recognition speech and the background noise at the time of creating the acoustic model at the time of creating the speech and the acoustic model. Normalized using the spectral distance with no background noise,
The speech recognition environment determination method according to any one of embodiments 2 to 5, wherein a reciprocal of the normalized spectral distance is used as the similarity.
[0094]
[Embodiment 7] The similarity calculation step calculates the log likelihood of the speech that has not been determined to be speech recognition speech for a Hidden Markov Model HMM that models the background noise at the time of creating the acoustic model. 3. The speech recognition environment determination method according to claim 1, wherein
[0095]
Embodiment 8 The similarity calculating step includes:
When calculating the similarity for a plurality of frames of the speech that has not been determined to be speech recognition speech, the log likelihood of the speech for an HMM that models the background noise at the time of creating the acoustic model for each frame ,
The speech recognition environment determination method according to claim 1 or 7, wherein the similarity is a sum of the plurality of frames of the speech or a log likelihood having a minimum value in the plurality of frames.
[0096]
[Ninth embodiment] When the similarity calculation step includes a plurality of background noises at the time of creating the acoustic model, the maximum log likelihood among the log likelihoods obtained for each of the background noises is defined as the similarity. The speech recognition environment determination method according to any one of embodiments 1, 7, and 8, wherein
[0097]
[Embodiment 10] In the similarity calculation step, the log likelihood of the speech that has not been determined to be speech recognition speech for the HMM that models the background noise at the time of creating the acoustic model is calculated at the time of creating the acoustic model. The method according to any one of embodiments 1, 7, 8, and 9, wherein a value normalized by the log likelihood of the speech with respect to an HMM in which unused background noise is modeled is used as the similarity. Speech recognition environment judgment method.
[0098]
[Embodiment 11] The apparatus further includes a display step of displaying a determination result of a use environment of the voice recognition device,
The voice recognition environment determination method according to any one of embodiments 1 to 10, wherein the determination result is graphically displayed using a graphic or a symbol.
[0099]
[Embodiment 12] A speech recognition environment determination device that determines a use environment of a speech recognition device that performs speech recognition using an acoustic model,
Input means for inputting voice,
Voice determining means for determining whether the input voice is voice for voice recognition,
A similarity calculating unit that calculates a similarity between the voice determined to be a voice to be recognized and the background noise at the time of creating the acoustic model,
Environment determining means for determining a use environment of the voice recognition device based on the calculated similarity.
[0100]
[Embodiment 13] A program for causing a computer to determine a use environment of a speech recognition device that performs speech recognition using an acoustic model,
A voice determination procedure for determining whether the input voice is a voice to be voice-recognized,
A similarity calculation procedure for calculating the similarity between the voice determined to be a voice to be recognized and the background noise at the time of creating the acoustic model,
An environment determining step of determining a use environment of the speech recognition device based on the calculated similarity.
[0101]
[Embodiment 14] A computer-readable recording medium storing the program according to Embodiment 13.
[0102]
【The invention's effect】
As described above, according to the present invention, it is possible to preferably determine whether or not the usage environment of the speech recognition device is appropriate for the speech recognition processing without requiring a plurality of microphones.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a configuration of a speech recognition device having a speech recognition environment determination function according to a first embodiment of the present invention.
FIG. 2 is a flowchart illustrating an operation procedure of the voice recognition device having the voice recognition environment determination function according to the first embodiment of the present invention.
FIG. 3 is a diagram illustrating an example of a case where a determination result of a voice recognition environment is displayed on a display.
FIG. 4 is a diagram for explaining the concept of a frame when executing audio processing according to the embodiment;

Claims

A speech recognition environment determination method for determining a use environment of a speech recognition device that performs speech recognition using an acoustic model,
An input step of inputting voice,
A voice determination step of determining whether the input voice is a voice to be voice-recognized,
A similarity calculation step of calculating a similarity between the voice determined to be a voice to be recognized and the background noise at the time of creating the acoustic model,
An environment determining step of determining an environment in which the voice recognition device is used based on the calculated degree of similarity.