JP3615088B2

JP3615088B2 - Speech recognition method and apparatus

Info

Publication number: JP3615088B2
Application number: JP18321699A
Authority: JP
Inventors: 亮典小柴; 三慶舘森; 博史金澤
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1999-06-29
Filing date: 1999-06-29
Publication date: 2005-01-26
Anticipated expiration: 2019-06-29
Also published as: JP2001013988A

Description

【０００１】
【発明の属する技術分野】
本発明は、発声された音声を高精度に認識するのに好適な音声認識方法及び装置に関する。
【０００２】
【従来の技術】
近年、音声認識技術は、優れたマンマシンインタフェースを実現する上で重要な役割を担っている。最近では、ＨＭＭを用いたワードスポッティングや連続音声認識など、発声者の発声方式に制約を要求しない、自然発話認識のための研究や開発が盛んに行われている。従来これらの音声認識手法においては、入力信号から、話者が音声を発声していると判断される区間を切り出し、その部分を標準パターンとマッチングさせることにより、発話内容を認識していた。
【０００３】
ところが、実際の自然発話においては、発声区間と判断された部分にも、促音や、摩擦音、無声化した有声音など、信号のパワーの低い無音区間が生じることがある。信号のパワーの低い区間では、背景雑音の影響が相対的に大きくなるため、信号のスペクトルが安定せず、その結果誤ったパターンとマッチングしてしまい、誤認識が生じることがしばしばあった。
【０００４】
更に、このような自然発話において生じるパワーの低い無音区間は、予め予期することが難しいため、標準パターンとして登録しておくことができなかった。
【０００５】
【発明が解決しようとする課題】
このように従来は、発声区間として検出された区間内に、パワーの低い無音区間が存在すると、その部分においては背景雑音のスペクトルが支配的となり、誤ったパターンマッチングが生じるという問題があった。また、発声区間内において、パワーが低くなる区間は予め予期することが難しく、そのため、それらのパターンを標準パターンとして登録することができない、という問題もあった。
【０００６】
本発明は、上記事情を考慮してなされたもので、発声区間内に、不規則に発生するパワーの低い無音区間が存在しても、その影響を受けることなく、高精度の認識を可能とする音声認識方法及び装置を提供することを目的とする。
【０００７】
【課題を解決するための手段】
本発明は、入力される信号を音響分析して音声が発声された区間を検出して、検出した発声区間の音声信号から特徴ベクトル系列を抽出し、前記抽出した特徴ベクトル系列と所定の認識候補ごとに予め用意されている音声信号の標準パターンとを第１の照合方式にて照合することにより、両者の類似度または距離を表す照合スコアを計算し、各認識候補ごとの照合スコアに基づいて認識結果を判定する音声認識方法において、上記検出した発声区間の音声信号の短時間パワーから音声信号の無音区間を検出し、その無音区間の特徴ベクトル系列をパターン照合の対象外とすると共に、無音区間から有音区間へ変化する時刻に相当する特徴ベクトル系列につき無音区間の影響を考慮した第２の照合方式を用いて照合することにより照合スコアを計算することを特徴とする。ここで、第１の照合方式にＨＭＭ（隠れマルコフモデル）照合方式を適用し、第２の照合方式にナル遷移を許すＨＭＭ照合方式を適用するとよい。
【０００８】
本発明によれば、発声区間内に予期しないパワーの低い無音区間が存在していたとしても、その無音区間を検出して、標準パターンとの照合の際には無音区間を除いて照合を行うことにより、無音区間における誤ったパターンマッチングを回避することができ、高精度な認識が可能となる。しかも本発明においては、無音区間から有音区間へ変化する時刻に相当する特徴ベクトル系列につき無音区間の影響を考慮した第２の照合方式、例えばナル遷移を許すＨＭＭ照合方式を適用することから、無音区間（の特徴ベクトル）を照合に用いなかったことによる状態遷移の矛盾が生じない。
【０００９】
ここで、無音区間から有音区間へ切り替わった時刻にナル遷移を許す場合、その際のＨＭＭの状態（第１の状態ｉ）へのナル遷移を起こすＨＭＭの状態として、直前の時刻（フレーム）における状態ｉ以前の状態のうち最適経路の照合スコアが最大となる状態（第２の状態ｊ）を選択することで、状態ｊから状態ｉへのナル遷移を起こし、状態ｉの上記直前の時刻における照合スコアを、状態ｊの同時刻における照合スコアに置き換えるとよい。この状態ｉへのナル遷移が可能な状態を、無音区間の継続時間などによって制限するようにしてもよい。
【００１０】
また本発明は、発声区間の音声信号の短時間パワーに基づく無音区間の検出を、異なる閾値を用いて独立に行い、発声区間の音声信号から抽出された特徴ベクトル系列と所定の認識候補ごとに予め用意されている音声信号の標準パターンとを、上記異なる閾値に基づいて独立に検出される無音区間の情報に基づいて、隠れマルコフモデル照合方式にて照合することにより、各閾値別に照合スコアを計算し、その際に対応する閾値に基づいて検出した無音区間の特徴ベクトル系列をパターン照合の対象外とすると共に、無音区間から有音区間へ変化する時刻にのみ、ナル遷移を許す隠れマルコフ照合方式を適用し検出し、各閾値別に求めた各認識候補ごとの照合スコアに基づいて認識結果を判定することをも特徴とする。
【００１１】
このように、各閾値別に得られる無音区間情報を用いて、各閾値別に、対応する無音区間をパターン照合の対象外として各認識候補ごとの照合スコアを求め、その照合スコアに基づいて認識結果を判定することで、無音区間における誤ったマッチングの影響を減らすことができる。
【００１２】
ここで、１つの閾値について各認識候補ごとの照合スコアを計算する都度、その認識候補ごとの照合スコアに基づいて認識候補を絞り、その動作を、上記閾値を一定方向に段階的に切り替えながら繰り返すようにするとよい。なお、異なる閾値を用いた無音区間の検出自体は、並行して行っても、閾値を切り替えながら順次行っても構わない。前者の場合には、無音区間の検出結果を記憶しておく必要がある。また、後者の場合には、少なくとも発声区間の音声信号を記憶しておく必要がある。
【００１３】
このように、無音区間検出用の閾値（パワーの閾値）を一定方向に段階的に変えて、認識候補の枝刈りをしながらパターン照合を行うことにより、段階的に認識候補を絞ることができ、認識の精度を向上させ、誤認識を減らすことができる。
【００１４】
ここで、閾値の切り替えを当該閾値が小さくなる方向に行うならば、認識候補の選択の際に、スペクトルが安定するパワーの大きな部分に重みをかけることができ、スペクトルが不安定なパワーの低い区間の影響を減らすことができる。
【００１５】
また、閾値の切り替えを当該閾値が大きくなる方向に行うようにしてもよい。この場合、最初は無音区間における誤ったマッチングが許されて複数の認識候補が選択されるものの、正解候補は無音区間以外では正しくマッチングするので上位候補に入り、徐々に閾値を大きくしてマッチングを行うことにより、無音区間における誤ったマッチングの影響を減らすことができ、最終的に正しい正解候補を検出することができる。
【００１６】
また、閾値を一定方向に段階的に切り替えながら認識候補を絞るのではなく、同一認識候補について各閾値別に得られる照合スコアの重み付け和を算出する処理を全ての認識候補について実行し、その全認識候補各々の照合スコアの重み付け和に基づいて認識結果を判定することも可能である。
この場合、無音区間の影響を任意に照合スコアに反映させることができ、これにより無音区間における誤ったマッチングの影響を減らすことができる。
【００１７】
なお、方法に係る本発明は装置に係る発明としても成立する。
また、本発明は、コンピュータに当該発明に相当する手順を実行させるための（或いはコンピュータを当該発明に相当する手段として機能させるための、或いはコンピュータに当該発明に相当する機能を実現させるための）プログラムを記録したコンピュータ読み取り可能な記録媒体としても成立する。
【００１８】
【発明の実施の形態】
以下、本発明の実施の形態につき図面を参照して説明する。
【００１９】
［第１の実施形態］
図１は、本発明の第１の実施形態に係る音声認識装置を概略的に示すものである。
図１に示す音声認識装置は、入力された信号を分析して発声区間を検出する発声区間検出部１０１と、この発声区間検出部１０１で検出された発声区間の音声信号を音響分析することにより、特徴ベクトルを抽出する特徴ベクトル抽出部１０２と、発声区間検出部１０１で検出された発声区間の音声信号から、当該音声信号のパワーを用いて無音区間を検出する無音区間検出部１０６と、予め学習された所定の各認識候補の標準特徴パターンが記憶されている標準特徴パターン記憶部１０４と、無音区間検出部１０６で検出された無音区間情報を用いて、特徴ベクトル抽出部１０２で抽出された特徴ベクトル系列と、標準特徴パターン記憶部１０４に記憶された各認識候補の標準特徴パターンとを、ＨＭＭを用いた照合方式で照合するパターン照合部１０３と、このパターン照合部１０３で得られる認識候補ごとの照合結果をもとに、認識された発声内容を判定する、認識結果判定部１０５とを具備している。
【００２０】
なお図１では、発声者が発声した音声を入力してデジタルの電気信号（デジタル音声信号）に変換する、マイクロホン、Ａ／Ｄ（アナログ／デジタル）変換器を含む音声入力部は省略されている。
【００２１】
次に、図１の構成の音声認識装置の処理概念を説明する。
発声区間検出部１０１において検出された発声区間の音声信号は、特徴ベクトル抽出部１０２で、予め定められた複数の周波数帯域毎に周波数分析され、特徴ベクトル系列（特徴ベクトル時系列）｛ｘｔ｝に変換される。特徴ベクトル（特徴パラメータ）はフレームと呼ばれる固定の時間長を単位に求められる。音声認識に使用される代表的な特徴ベクトルとしては、バンドパスフィルタまたはフーリエ変換によって求めることができるパワースペクトラムや、ＬＰＣ（線形予測）分析によって求められるケプストラム係数などがよく知られている。但し、本実施形態では、使用する特徴ベクトルの種類は問わない。
特徴ベクトル抽出部１０２により抽出された特徴ベクトルの時系列は、パターン照合部１０３に送られる。
【００２２】
一方、上記発声区間の音声信号は、無音区間検出部１０６にも送られ、当該音声信号の短時間パワーから、上記特徴ベクトル系列のフレームと同期して無音区間が検出される。図２はこの部分の処理によって、無音区間が検出された信号の様子を概念的に表わしている。図２の横軸は時間、縦軸は信号の短時間パワーであり、ＴＨは予め設定されているパワーの閾値である。
【００２３】
無音区間検出部１０６では、各時刻ｔの短時間パワーの値Ｐｔとパワーの閾値ＴＨが毎時刻比較され、Ｐｔ＜ＴＨとなる区間が無音区間と判定される。このようにして得られた無音区間を示す情報（無音区間情報）は、パターン照合部１０３に送られる。なお、ここで時刻ｔは、発声区間におけるｔ番目のフレームを指す。
【００２４】
パターン照合部１０３では、入力された特徴ベクトル系列、無音区間情報、及び予め学習しておいた標準特徴パターン（標準パターン）を用いて、パターン照合が行われる。標準特徴パターンは、所定の認識候補（認識単位）ごとにＨＭＭとして標準特徴パターン記憶部１０４に予め記憶されている。認識の際には、このＨＭＭをそのまま、或いは組み合わせて用いる。
【００２５】
図３は、照合に用いられるＨＭＭの構造を表わしている。ここで状態遷移のうち符号ｃが付された遷移はナル遷移であり、符号ａ，ｂが付された遷移はそれぞれ、通常の状態遷移及び自己ループである。なお、図３のＨＭＭでは、ナル遷移はすべての状態間に仮定しているが、ここに制約を設けてナル遷移が生じる状態を制限することも可能である。
【００２６】
次に、パターン照合部１０３で適用される、図３の構造のＨＭＭを用いたパターン照合方式について図４のフローチャートを参照して説明する。
ステップＳ１０１では、入力された時刻ｔの信号、即ちｔ番目のフレームの信号が発声区間であるか否かが、発声区間検出部１０１での検出結果に基づいて判定される。時刻ｔの入力信号が発声区間の信号である場合にはステップＳ１０２に、発声区間の信号でなければステップＳ１０６に進む。
【００２７】
ステップＳ１０２では、無音区間検出部１０６での検出結果に基づいて、入力された時刻ｔの信号が無音区間の信号であるか否かが判定される。無音区間の信号と判定された場合にはステップＳ１０７に、有音区間の信号と判定された場合にはステップＳ１０３に進む。
【００２８】
ステップＳ１０３では、フラグ（ＦＬＡＧ）の値が評価される。フラグは０または１の値を取り、時刻ｔ−１の信号（つまり１フレーム前の信号）が無音区間に属していたか（ＦＬＡＧ＝０の場合）、有音区間に属していたか（ＦＬＡＧ＝１の場合）を示す。フラグの値が０の場合には時刻ｔが（時刻ｔ−１までの）無音区間から有音期間に切り替わった（変化した）時刻であると判定されて最終ステップＳ１０８に、１の場合には無音区間が継続していると判定されてステップＳ１０４に進む。
【００２９】
ステップＳ１０４では、図３に示されるＨＭＭにおいて、時刻ｔの信号に対する、ナル遷移を除くすべての状態遷移確率、及びすべての分布の出力確率が計算され、最適な遷移が決定される。決定後、ステップＳ１０５に進む。
ステップＳ１０５では、時刻ｔが次の時刻ｔ＋１に設定され、ステップＳ１０１に戻る。
【００３０】
ステップＳ１０６では、各認識候補ごとに、図３に示されたＨＭＭにおいて、発声区間終了時刻ｔで照合スコアが最大となる状態が選択され、その認識候補ごとの照合スコアが認識結果判定部１０５に送られ、処理を終了する。ここで照合スコアは、周知のように入力音声信号の特徴ベクトル系列と標準特徴パターンとの類似度または距離を表す評価値である
ステップＳ１０７では、ステップＳ１０２で時刻ｔの信号が無音区間の信号であると判定されたことを受け、前述したフラグの値を０に設定し、ステップＳ１０５に進む。ここでは、パターン照合は行われず、したがって時刻ｔにおける特徴ベクトルはパターン照合の対象外とされ、各状態の照合スコアは更新されない。
【００３１】
ステップＳ１０８では、ステップＳ１０３で時刻ｔの信号が、無音区間から有音区間へ切り替わった時刻であると判定されたことを受け、図３に示されたＨＭＭにおいて、まずナル遷移を行い、各状態における時刻ｔ−１における照合スコアを更新する。照合スコア更新後、ナル遷移を除くすべての状態遷移確率、及びすべての分布の出力確率が計算され、最適な遷移が決定される。決定後、ステップＳ１０９へ進む。この部分の処理の詳細は、後述する。
【００３２】
ステップＳ１０９では、ステップＳ１０２で時刻ｔの信号が有音区間の信号であると判定されたことを受けて、前述したフラグの値を１に設定し、ステップＳ１０５に進む。
【００３３】
以上が、本発明に直接関係するパターン照合方式の概略と流れである。
上記パターン照合方式を適用したパターン照合部１０３での処理により、すべての認識候補の照合スコアが計算され、認識結果判定部１０５において最大スコアをとる認識候補が認識結果として選択される。
【００３４】
ここで、無音区間から有音区間へ切り替わった時刻ｔにおける上記ステップＳ１０８の処理の詳細について、図５のフローチャートを参照して説明する。
時刻ｔにおいて、まずステップＳ４０１で状態番号ｉが最終状態に設定される。
【００３５】
ステップＳ４０２では、状態ｉについて、状態０から状態ｉのうち、時刻ｔ−１（１フレーム前）における最適経路の照合スコアが最大となる状態ｊが選択される。
【００３６】
ステップＳ４０３では、状態ｊから状態ｉへのナル遷移が起こり、状態ｉの時刻ｔ−１（１フレーム前）における照合スコアが、状態ｊの同時刻ｔ−１における照合スコアに置き換えられる。
【００３７】
ステップＳ４０４では、状態ｉが先頭の状態０であるかどうかが判定される。状態０である場合には最終ステップＳ４０６に、そうでなければステップＳ４０５に進む。
【００３８】
ステップＳ４０５では、ｉが１だけカウントダウンされ、ステップＳ４０２に戻る。
ステップＳ４０６では、すべての状態に対して、時刻ｔにおける、ナル遷移を除く最適経路、及びその照合スコアが求められる。
【００３９】
このように無音区間から有音区間へ切り替わった時刻にナル遷移を考えることにより、無音区間の特徴ベクトルを照合に用いなかった影響を取り除くことができる。なお、ここでは、状態ｉへのナル遷移は、状態０から状態ｉのすべての状態から起こり得るとしているが、ここに制約を設けて、例えば、無音区間の継続時間などによって状態ｉへのナル遷移が可能な状態を制限する（継続時間が短いほど状態数を減らす）ことも可能である。また無音区間の継続時間が所定の閾値以下の場合には、ナル遷移を起こさないようにすることも可能である。更に、ここでは、ナル遷移が可能な状態を最終状態から先頭の状態すべてについて探索しているが、これは必ずしもすべての状態について行う必要はなく、予め事前情報に基づいて無音区間が発生しやすい状態についてのみナル遷移を行うことも可能である。
【００４０】
次に、本実施形態の効果を図６乃至図９を参照して説明する。
図６は「とさか（ＴＯＳＡＫＡ）」と発声したときの、信号のパワーのイメージ図である。ここで、時刻Ｔ０，Ｔ７はそれぞれ、発声区間の始端時刻、終端時刻を示している。また、時刻Ｔ０−Ｔ１，Ｔ２−Ｔ３，Ｔ４−Ｔ５，Ｔ６−Ｔ７の各区間は、それぞれ、パワーの閾値ＴＨにより無音区間と判定された区間である。
【００４１】
一般に発声区間中の無音区間は、促音や摩擦音、有声音の無声化などにより発生し、この区間内では、背景雑音の影響が相対的に大きくなるため、誤ったパターンとのマッチングが起こりやすい。そしてその結果、誤認識が生じることがある。図６によれば、Ｔ０からＴ１、Ｔ２からＴ３、Ｔ４からＴ５、及びＴ６からＴ７の区間で誤ったパターンマッチングが生じる虞がある。
【００４２】
図７は、Ｔ２からＴ３の区間における音声信号の短時間パワーの様子と発生内容（ここでは音素列で表現）を更に詳細に示したものである。この例では、摩擦音／Ｓ／に相当する区間は、完全に閾値ＴＨ以下となっている。上述したように、この場合、パワーの閾値ＴＨ以下であるＴ２からＴ３の区間は、誤ったマッチングを起こしやすい。
【００４３】
図８は、簡単のため１つの音素を１状態で表わした「ＴＯＳＡＫＡ」を表わすＨＭＭである。ここでは簡単のため、状態／Ｏ／／Ｓ／／Ａ／／Ｋ／／Ａ／からのナル遷移については省略してある。
【００４４】
図８のようなＨＭＭに対して、先に述べたパターン照合方式を適用すると、Ｔ２からＴ３の区間（音声信号の無音区間）では、特徴ベクトル系列が照合に用いられないように制御される。このため、音声信号の有音区間、無音区間に無関係に特徴ベクトル系列が照合に用いられる従来技術とは異なって、Ｔ２からＴ３の区間（無音区間）における誤ったマッチングが生じることがなく、したがって照合スコアに悪影響を与えることがない。しかも、本実施形態で適用されるパターン照合方式では、無音区間から有音区間へ変わる時刻にはナル遷移を許しているので、無音区間を照合に用いなかったことによる状態遷移の矛盾が生じない。
【００４５】
以上の結果、本実施形態では、照合スコアに悪影響を与えることなく、図９で示したような遷移が可能になる。この例では、音素／Ｓ／に相当する特徴ベクトルのパワーが、パワーの閾値ＴＨ以下となっているため、この部分の特徴ベクトルが照合に使われず、それを表現するために、音素／Ｏ／から音素／Ａ／へのナル遷移を許し、音素／Ｓ／の状態を経由することを回避している。
このことは、Ｔ２−Ｔ３以外の無音区間（Ｔ０−Ｔ１，Ｔ４−Ｔ５，Ｔ６−Ｔ７）についても全く同様に考えられる。
【００４６】
発声区間が終了した場合には、すべての状態の、時刻Ｔ７における最適な状態遷移経路、及びそのときの照合スコアが求まるので、最大となるスコアを認識結果の判定に用いればよい。
【００４７】
この方法を用いれば、発声に対する認識候補の照合において、無音区間の誤ったマッチングにより、誤った認識候補の照合スコアが大きくなることを回避できる。その結果、照合スコアの精度が向上するので、認識率の改善につながる。以上が本発明の第１の実施形態に係る音声認識装置の構成、作用、効果の詳細な説明である。
【００４８】
［第２の実施形態］
図１０は、本発明の第２の実施形態に係る音声認識装置を概略的に示すものである。
【００４９】
図１０に示す音声認識装置は、発声区間検出部２０１、特徴ベクトル抽出部２０２、パターン照合部２０３、標準特徴パターン記憶部２０４、認識結果判定部２０５、及びＮ個の無音区間検出部（＃１）２０６−１〜（＃Ｎ）２０６−Ｎとを具備している。
【００５０】
図１０の構成の特徴は、（図１中の無音区間検出部１０６に相当する）Ｎ個の無音区間検出部＃１（２０６−１）〜＃Ｎ（２０６−Ｎ）により、予め用意された異なる信号のパワーの閾値ＴＨ１〜ＴＨＮに基づいて（発声区間の）音声信号の無音区間が検出されるようになっている点にある。このため、（図１中のパターン照合部１０３、認識結果判定部１０５に相当する）パターン照合部２０３、認識結果判定部２０５の機能も、後述するように一部異なっている。なお、それ以外の構成要素、即ち発声区間検出部２０１、特徴ベクトル抽出部２０２、標準特徴パターン記憶部２０４は、図１中の発声区間検出部１０１、特徴ベクトル抽出部１０２、標準特徴パターン記憶部１０４と同様である。
【００５１】
そこで、図２の構成の音声認識装置の動作について、図１の音声認識装置と異なる部分を中心に説明する。
無音区間検出部＃１（２０６−１）〜＃Ｎ（２０６−Ｎ）には、発声区間検出部２０１で検出された音声信号が並列に入力される。各無音区間検出部＃ｉ（ｉ＝１〜Ｎ）には、それぞれ異なるパワーの閾値ＴＨｉが用意されており、それらの閾値を用いて独立に音声信号の無音区間が検出される。
【００５２】
図１１は、無音区間検出部＃１（２０６−１）、無音区間検出部＃２（２０６−２）、…無音区間検出部＃Ｎ（２０６−Ｎ）で、予め設定されたパワーの閾値ＴＨ１，ＴＨ２，…ＴＨＮに基づき、発声区間における音声信号の無音区間が検出される様子を表している。ここでは、ＴＨｉ＞ＴＨｉ＋１となるように設定されているものとする。
【００５３】
無音区間検出部＃１（２０６−１）〜＃Ｎ（２０６−Ｎ）にて独立に検出された無音区間を示す情報（無音区間情報）はパターン照合部２０３に送られる。パターン照合部２０３には、特徴ベクトル抽出部２０２により抽出された特徴ベクトルの時系列（特徴ベクトル系列）も送られる。パターン照合部２０３では、特徴ベクトル抽出部２０２から入力される特徴ベクトル系列と、各無音区間検出部＃１（２０６−１）〜＃Ｎ（２０６−Ｎ）から入力される無音区間情報を用いて、各認識候補の照合スコアが計算される。
【００５４】
ここで、パターン照合部２０３及び認識結果判定部２０５における処理を、図１２のフローチャートを参照して説明する。
ステップＳ２０１では、初期設定処理が行われ、無音区間検出部＃ｉを示すパラメータ（無音区間検出部番号）としてｉ＝１が設定される。
【００５５】
ステップＳ２０２では、すべての認識候補について、無音区間検出部＃ｉからの無音区間情報を用いてパターン照合部２０３により照合スコアが算出される。このパターン照合部２０３での照合スコア計算には、前記第１の実施形態で述べた（パターン照合部１０３での）照合方式を用いる。
【００５６】
ステップＳ２０３では、ステップＳ２０２で算出された各認識候補ごとの照合スコアから、予め用意された枝刈りのための認識候補数Ｍｉに従い、上位Ｍｉ位までの認識候補が選択され、次のステップの認識候補として残される。ここでは、Ｍｉ＞Ｍｉ＋１となるように設定されているものとする。
【００５７】
ステップＳ２０４では、ｉが無音区間検出部＃Ｎを表すパラメータ値（無音区間検出部番号）Ｎに達したかどうかが判定される。ｉ＝Ｎとなったなら最終ステップＳ２０６に、そうでなければステップＳ２０５に進む。
ステップＳ２０５では、ｉが１だけカウントアップされ、ステップＳ２０２に戻る。
【００５８】
ステップＳ２０６では、その時点において残されている（上位ＭＮ位までの）認識候補の中から照合スコアが最大となるものが認識結果判定部２０５により選ばれ、認識結果として出力される。
以上、第２の実施形態でのパターン照合部２０３及び認識結果判定部２０５における処理について説明した。
【００５９】
以上の方式を用いれば、認識候補の選択の際に、スペクトルが安定するパワーの大きな部分に重みをかけることができ、スペクトルが不安定なパワーの低い区間の影響を減らすことができる。また、パワーの閾値を段階的に変えて、認識候補の枝刈りをしながらパターン照合を行うことにより、段階的に認識候補を絞ることができ、認識の精度を向上させ、誤認識を減らすことができる。
以上が本発明の第２の実施形態に係る音声認識装置の構成、作用、効果の詳細な説明である。
【００６０】
（第２の実施形態の第１変形例）
以上に述べた第２の実施形態では、パターン照合部２０３における認識候補の枝刈りを、パワーの閾値の大きいものから順に用いて行うものとして説明したが、逆にパワーの閾値の小さいものから順に行うことも可能である。
【００６１】
そこで、図１０の構成において認識候補の枝刈りをパワーの閾値の小さいものから順に行う方式を適用した、第２の実施形態の第１変形例について、図１３のフローチャートを参照して説明する。
【００６２】
ステップＳ３０１では、ｉ＝Ｎが初期設定される。
ステップＳ３０２では、すべての認識候補に対して、無音区間検出部＃ｉからの無音区間情報を用いてパターン照合部２０３により照合スコアが算出される。このパターン照合部２０３での照合スコア計算には、前記第１の実施形態で述べた（パターン照合部１０３での）照合方式を用いる。
【００６３】
ステップＳ３０３では、ステップＳ３０２で算出された照合スコアから、予め用意された枝刈りのための認識候補数Ｍｉに従い、上位Ｍｉ位までの認識候補が選択され、次のステップの認識候補として残される。ここでは、先の照合方式の例と異なって、Ｍｉ＜Ｍｉ＋１となるように設定されているものとする。
【００６４】
ステップＳ３０４では、ｉが無音区間検出部＃１を表すパラメータ値（無音区間検出部番号）１に達したかどうかが判定される。ｉ＝１となったなら最終ステップＳ３０６に、そうでなければステップＳ３０５に進む。
ステップＳ３０５では、ｉが１だけカウントダウンされ、ステップＳ３０２に戻る。
【００６５】
ステップＳ３０６では、その時点において残されている（上位Ｍ１位までの）認識候補の中から照合スコアが最大となるものが認識結果判定部２０５により選ばれ、認識結果として出力される。
以上、第２の実施形態の第１変形例に係るパターン照合部２０３及び認識結果判定部２０５における処理について説明した。
【００６６】
以上の方式では、まず小さいパワーの閾値で、無音区間における誤ったマッチングを許して複数の認識候補が選択される。正解候補は、無音区間以外では正しくマッチングするので、上位候補に入る。そして、徐々にパワーの閾値を大きくしてマッチングを行うことにより、無音区間における誤ったマッチングの影響を減らすことができ、最終的に正解候補を検出することが可能である。
【００６７】
このような方式を用いれば、認識候補選択の際に、まず、無音区間における誤ったマッチングを含む認識候補の中から、段階的に無音区間の誤ったマッチングの影響を減らしていくことができ、認識の精度を向上させ、誤認識を減らすことができる。
以上が本発明の第２の実施形態の第１変形例における音声認識装置の作用、効果の詳細な説明である。
【００６８】
（第２の実施形態の第２変形例）
以上に述べた第２の実施形態、及び当該実施形態の第１変形例では、異なるパワーの閾値ＴＨｉを用いて各閾値ＴＨｉごとに検出される無音区間の情報に対してパターン照合部２０３で得られる複数の照合スコアを順番に用いて認識候補を枝刈りし、認識結果を求めるものとして説明したが、これに限るものではない。例えば、各閾値ＴＨｉごとに得られる照合スコアの重み付け和をとることにより認識結果を判定することも可能である。
【００６９】
そこで、この方式を用いた第２の実施形態の第２変形例について、図１４のフローチャートを参照して説明する。
ステップＳ５０１では、認識候補番号ｉが１に初期設定される。
【００７０】
ステップＳ５０２では、無音区間検出部＃ｊを示すパラメータ（無音区間検出部番号）ｊが１に初期設定される。
ステップＳ５０３では、無音区間検出部＃ｊからの無音区間情報を用いて認識候補ｉ（認識候補番号がｉの認識候補）の照合スコアｓｉｊがパターン照合部２０３により計算される。
【００７１】
ステップＳ５０４では、ｊが無音区間検出部＃Ｎを表すパラメータ値（無音区間検出部番号）Ｎに達したかどうかが判定される。ｊ＝ＮとなったならステップＳ５０６に、そうでなければステップＳ５０５に進む。
ステップＳ５０５では、ｊが１だけカウントアップされ、ステップＳ５０３に戻る。
【００７２】
ステップＳ５０６では、各無音区間検出部＃ｊ（ｊ＝１〜Ｎ）、つまり無音区間検出部＃１〜＃Ｎからの無音区間情報を用いて算出された照合スコアｓｉｊの重みｗｊによる重み付け和、つまりｓｉ１〜ｓｉＮの重みｗ１〜ｗＮによる重み付け和が計算され、認識結果判定に用いられる認識候補ｉの照合スコアＳｉが計算される。ここでｗｊは予め定められている重み（０≦ｗｊ≦１）であり、無音区間検出部＃ｊからの無音区間情報を用いて算出された照合スコアｓｉｊに対する重みである。
【００７３】
ステップＳ５０７では、すべての認識候補について照合スコアＳｉが計算されたかどうかが、ｉの値により判定される。ｉが、認識候補数に達していれば最終ステップＳ５０９に、達していなければステップＳ５０８に進む。
ステップＳ５０８では、認識候補番号ｉが１だけカウントアップされて、ステップＳ５０２に戻る。
【００７４】
ステップＳ５０９では、認識結果判定部２０５により、すべての認識候補の照合スコアＳｉが比較され、Ｓｉが最大となる認識候補が認識結果として判定されて出力される。
以上、第２の実施形態の第２変形例に係るパターン照合部２０３及び認識結果判定部２０５における処理について説明した。
【００７５】
以上の方式では、異なるパワーの閾値ＴＨ１〜ＴＨＮに基づいて得られる各閾値ごとの無音区間情報を用いて算出される、同一認識候補ｉについての照合スコアｓｉ１〜ｓｉＮに対して適当な重みｗ１〜ｗＮをかけて和をとることにより、無音区間の影響を任意に照合スコアに反映させることができる。このため、無音区間における誤ったマッチングの影響を減らすことができる。
以上が本発明の第２の実施形態の第２変形例における音声認識装置の作用、効果の詳細な説明である。
【００７６】
なお、前記第２の実施形態では、無音区間検出部＃１（２０６−１）〜＃Ｎ（２０６−Ｎ）が並行して動作するものとして説明したが、発声区間検出部２０１で検出された発声区間の音声信号をメモリ等の記憶手段に格納しておき、この状態で無音区間検出部＃１（２０６−１）〜＃Ｎ（２０６−Ｎ）を順に起動して、上記記憶手段内の音声信号を対象としてその無音区間検出部に固有の閾値で無音区間を検出させ、その都度検出した無音区間情報をパターン照合部２０３に送るようにしても構わない。
【００７７】
また、以上の実施形態における発声区間検出部１０１（２０１）、特徴ベクトル抽出部１０２（２０２）、パターン照合部１０３（２０３）、認識結果判定部１０５（２０５）、無音区間検出部１０６（２０６−１〜２０６−Ｎ）の各機能は、ソフトウェアとしても実現可能である。
【００７８】
また、本実施形態は、コンピュータに以上の実施形態に係る音声認識装置で適用したパターン照合方式を含む所定の手順を実行させるための（或いはコンピュータを音声認識装置の持つ所定の手段として機能させるための、或いはコンピュータに音声認識装置の持つ所定の機能を実現させるための）プログラムを記録したコンピュータ読み取り可能なＣＤ−ＲＯＭ等の記録媒体として実施することもできる。また、このプログラムが通信媒体を介してダウンロードされるものであっても構わない。
【００７９】
この他、本発明の実現形態には上述の例に対して種々の変形が可能であり、それらも趣旨に反しない限り本発明の実施形態の範囲内である。
【００８０】
【発明の効果】
以上説明したように、本発明によれば、発声区間内に予期しないパワーの低い無音区間が存在しても、その無音区間を検出し、標準特徴パターンとの照合の際に利用することにより、無音区間における誤ったパターンマッチングを回避することができ、高精度な認識が可能となる等の実用上多大な効果が奏せられる。
【図面の簡単な説明】
【図１】本発明の第１の実施形態に係る音声認識装置の基本構成を表わすブロック図。
【図２】入力音声信号における無音区間を表わす概念図。
【図３】ナル遷移を含むＨＭＭの構成を示す図。
【図４】パターン照合方式の流れを示す図。
【図５】無音区間から有音区間へ切り替わった時刻における処理の流れを示す図。
【図６】入力音声信号におけるパワーの様子を示す図。
【図７】入力音声信号におけるパワーの様子の詳細を示す図。
【図８】ＨＭＭの構成の具体例を示す図。
【図９】パターン照合処理後の最適経路の概念図。
【図１０】本発明の第２の実施形態に係る音声認識装置の基本構成を表わすブロック図。
【図１１】複数の閾値による入力信号の無音区間を表わす概念図。
【図１２】複数の閾値を用いるパターン照合方式の流れを示す図。
【図１３】複数の閾値を用いるパターン照合方式の流れの第１変形例を示す図。
【図１４】複数の閾値を用いるパターン照合方式の流れの第２変形例を示す図。
【符号の説明】
１０１，２０１…発声区間検出部
１０２，２０２…特徴ベクトル抽出部
１０３，２０３…パターン照合部
１０４，２０４…標準特徴パターン記憶部
１０５，２０５…認識結果判定部
１０６，２０６−１〜２０６−Ｎ…無音区間検出部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech recognition method and apparatus suitable for accurately recognizing spoken speech.
[0002]
[Prior art]
In recent years, speech recognition technology has played an important role in realizing an excellent man-machine interface. Recently, research and development for natural utterance recognition, which does not require restrictions on the utterance method of the speaker, such as word spotting using HMM and continuous speech recognition, have been actively conducted. Conventionally, in these speech recognition methods, an utterance content is recognized by cutting out a section in which it is determined that a speaker is speaking from an input signal and matching that portion with a standard pattern.
[0003]
However, in actual natural utterances, silent sections with low signal power, such as urging sounds, friction sounds, and voiced voices, may occur even in portions that are determined to be speech sections. In the section where the signal power is low, the influence of the background noise becomes relatively large, so that the spectrum of the signal is not stable, and as a result, it is matched with an incorrect pattern, and erroneous recognition often occurs.
[0004]
Furthermore, since the silent section with low power generated in such natural speech is difficult to anticipate in advance, it cannot be registered as a standard pattern.
[0005]
[Problems to be solved by the invention]
As described above, conventionally, when a silent section with low power is present in the section detected as the utterance section, there is a problem that the background noise spectrum becomes dominant in the section and erroneous pattern matching occurs. Further, in the utterance section, it is difficult to predict a section where the power is low in advance, and therefore, there is a problem that those patterns cannot be registered as a standard pattern.
[0006]
The present invention has been made in consideration of the above circumstances, and even if there is a silent section with low power that occurs irregularly in the utterance section, it is possible to perform highly accurate recognition without being affected by it. An object of the present invention is to provide a speech recognition method and apparatus.
[0007]
[Means for Solving the Problems]
The present invention acoustically analyzes an input signal to detect a section where speech is uttered, extracts a feature vector series from the detected speech signal of the utterance section, and extracts the extracted feature vector series and a predetermined recognition candidate A collation score representing the similarity or distance between the two is calculated by collating with a standard pattern of an audio signal prepared in advance for each of the first patterns, and based on the collation score for each recognition candidate In the speech recognition method for determining the recognition result, a silent section of the speech signal is detected from the short-time power of the detected speech signal of the speech section, and the feature vector sequence of the silent section is excluded from pattern matching, The collation score is obtained by collating the feature vector sequence corresponding to the time changing from the section to the voiced section using the second collation method considering the influence of the silent section. Characterized in that the calculation be. Here, an HMM (Hidden Markov Model) matching method may be applied to the first matching method, and an HMM matching method allowing null transition may be applied to the second matching method.
[0008]
According to the present invention, even if there is a silent section of unexpected low power in the utterance section, the silent section is detected, and when the matching with the standard pattern is performed, the silent section is excluded and the matching is performed. As a result, erroneous pattern matching in a silent section can be avoided, and highly accurate recognition is possible. In addition, in the present invention, since the second collation method considering the influence of the silent section is applied to the feature vector sequence corresponding to the time changing from the silent section to the voiced section, for example, the HMM collation scheme that allows the null transition, There is no inconsistency in the state transition due to the fact that the silent section (feature vector) is not used for collation.
[0009]
Here, when the null transition is allowed at the time when the silent section is switched to the voiced section, the immediately preceding time (frame) is set as the HMM state that causes the null transition to the HMM state (first state i) at that time. By selecting a state (second state j) in which the matching score of the optimum route is the maximum among states before state i in FIG. 8, a null transition from state j to state i occurs, and the time immediately before state i It is preferable to replace the collation score in (1) with the collation score at the same time in state j. You may make it restrict | limit the state in which the null transition to this state i is possible by the duration of a silence area.
[0010]
In addition, the present invention independently detects a silence interval based on the short-time power of the speech signal in the utterance interval, using different threshold values, and for each feature vector sequence extracted from the utterance interval speech signal and a predetermined recognition candidate. By comparing the standard pattern of the audio signal prepared in advance with the hidden Markov model matching method based on the silent section information detected independently based on the different threshold values, a matching score is obtained for each threshold value. Hidden Markov Matching that excludes the feature vector sequence of silence intervals calculated based on the corresponding thresholds at that time, and that allows null transitions only at the time of transition from silence intervals to speech intervals It is also characterized in that a recognition result is determined on the basis of a matching score for each recognition candidate obtained by applying a method to detect each threshold.
[0011]
In this way, using the silent section information obtained for each threshold, for each threshold, the corresponding silent section is excluded from pattern matching, and a matching score is obtained for each recognition candidate, and the recognition result is obtained based on the matching score. By determining, it is possible to reduce the influence of erroneous matching in the silent section.
[0012]
Here, each time a collation score for each recognition candidate is calculated for one threshold value, the recognition candidates are narrowed down based on the collation score for each recognition candidate, and the operation is repeated while gradually switching the threshold value in a certain direction. It is good to do so. The silent section detection using different threshold values may be performed in parallel or sequentially while switching the threshold values. In the former case, it is necessary to store the detection result of the silent section. In the latter case, it is necessary to store at least the voice signal of the utterance section.
[0013]
In this way, it is possible to narrow down the recognition candidates step by step by changing the threshold for silent section detection (power threshold) stepwise in a certain direction and performing pattern matching while pruning the recognition candidates. , Improve the accuracy of recognition and reduce misrecognition.
[0014]
Here, if the threshold value is switched in the direction in which the threshold value becomes smaller, when selecting a recognition candidate, it is possible to apply a weight to a large portion of the power where the spectrum is stable, and the spectrum is unstable and the power is low. The influence of a section can be reduced.
[0015]
Further, the threshold value may be switched in the direction in which the threshold value increases. In this case, although erroneous matching in the silent section is allowed at first and a plurality of recognition candidates are selected, the correct answer candidates are correctly matched in other than the silent section, so the higher candidates are entered, and the threshold is gradually increased to perform matching. By doing so, it is possible to reduce the influence of erroneous matching in the silent section, and finally correct correct candidates can be detected.
[0016]
Also, instead of narrowing down recognition candidates while switching the threshold stepwise in a certain direction, a process of calculating a weighted sum of matching scores obtained for each threshold value for the same recognition candidate is executed for all recognition candidates, and all recognition It is also possible to determine the recognition result based on the weighted sum of the matching scores of each candidate.
In this case, the influence of the silent section can be arbitrarily reflected in the matching score, thereby reducing the influence of erroneous matching in the silent section.
[0017]
Note that the present invention relating to a method is also established as an invention relating to an apparatus.
The present invention also allows a computer to execute a procedure corresponding to the invention (or causes a computer to function as a means corresponding to the invention, or causes a computer to realize a function corresponding to the invention). It can also be realized as a computer-readable recording medium on which a program is recorded.
[0018]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below with reference to the drawings.
[0019]
[First Embodiment]
FIG. 1 schematically shows a speech recognition apparatus according to a first embodiment of the present invention.
The speech recognition apparatus shown in FIG. 1 analyzes an input signal to detect a speech segment, and performs acoustic analysis on the speech signal of the speech segment detected by the speech segment detection unit 101. A feature vector extracting unit 102 for extracting a feature vector, a silent segment detecting unit 106 for detecting a silent segment from the speech signal of the speech segment detected by the speech segment detecting unit 101 using the power of the speech signal, The feature vector extraction unit 102 uses the standard feature pattern storage unit 104 that stores the learned standard feature patterns of each recognized recognition candidate and the silent section information detected by the silent section detection unit 106. A pattern for collating a feature vector series with a standard feature pattern of each recognition candidate stored in the standard feature pattern storage unit 104 by a collation method using an HMM A verification unit 103, based on the comparison result for each recognition candidate obtained by the pattern matching unit 103 determines the recognized utterance contents, and a recognition result determining unit 105.
[0020]
In FIG. 1, a voice input unit including a microphone and an A / D (analog / digital) converter that inputs a voice uttered by a speaker and converts the voice into a digital electric signal (digital voice signal) is omitted. .
[0021]
Next, the processing concept of the speech recognition apparatus configured as shown in FIG. 1 will be described.
The speech signal in the utterance section detected by the utterance section detection unit 101 is frequency-analyzed for each of a plurality of predetermined frequency bands by the feature vector extraction unit 102, and is converted into a feature vector series (feature vector time series) {xt}. Converted. A feature vector (feature parameter) is obtained in units of a fixed time length called a frame. As typical feature vectors used for speech recognition, a power spectrum that can be obtained by a bandpass filter or Fourier transform, a cepstrum coefficient that is obtained by LPC (linear prediction) analysis, and the like are well known. However, in this embodiment, the type of feature vector to be used is not limited.
The feature vector time series extracted by the feature vector extraction unit 102 is sent to the pattern matching unit 103.
[0022]
On the other hand, the speech signal in the utterance interval is also sent to the silence interval detection unit 106, and the silence interval is detected in synchronization with the frame of the feature vector series from the short-time power of the audio signal. FIG. 2 conceptually shows the state of a signal in which a silent section is detected by this part of processing. In FIG. 2, the horizontal axis represents time, the vertical axis represents the short-time power of the signal, and TH represents a preset power threshold.
[0023]
The silent section detecting unit 106 compares the short-time power value Pt at each time t with the power threshold TH every time, and determines a section where Pt <TH is a silent section. Information (silent section information) indicating the silent section obtained in this way is sent to the pattern matching unit 103. Here, the time t indicates the t-th frame in the utterance section.
[0024]
The pattern matching unit 103 performs pattern matching using the input feature vector series, silent section information, and a standard feature pattern (standard pattern) learned in advance. The standard feature pattern is stored in advance in the standard feature pattern storage unit 104 as an HMM for each predetermined recognition candidate (recognition unit). At the time of recognition, this HMM is used as it is or in combination.
[0025]
FIG. 3 shows the structure of the HMM used for collation. Here, of the state transitions, the transition with the symbol c is a null transition, and the transitions with the symbols a and b are a normal state transition and a self-loop, respectively. In the HMM of FIG. 3, null transitions are assumed between all states, but it is also possible to limit the states in which null transitions occur by providing constraints here.
[0026]
Next, a pattern matching method using the HMM having the structure of FIG. 3 applied in the pattern matching unit 103 will be described with reference to the flowchart of FIG.
In step S <b> 101, whether or not the input signal at time t, i.e., the signal of the t-th frame is an utterance interval, is determined based on the detection result of the utterance interval detection unit 101. If the input signal at time t is a signal in the utterance section, the process proceeds to step S102, and if not, the process proceeds to step S106.
[0027]
In step S <b> 102, it is determined whether the input signal at time t is a silent section signal based on the detection result of the silent section detection unit 106. If it is determined that the signal is a silent section, the process proceeds to step S107. If it is determined that the signal is a sound section, the process proceeds to step S103.
[0028]
In step S103, the value of the flag (FLAG) is evaluated. The flag takes a value of 0 or 1, and whether the signal at time t-1 (that is, the signal one frame before) belongs to a silent section (when FLAG = 0) or belongs to a voiced section (FLAG = 1). In the case of When the value of the flag is 0, it is determined that the time t is the time when the silent period (until time t-1) is switched (changed) from the silent period to the last step S108. It is determined that the silent section continues, and the process proceeds to step S104.
[0029]
In step S104, in the HMM shown in FIG. 3, all state transition probabilities except for the null transition and output probabilities of all distributions for the signal at time t are calculated, and the optimum transition is determined. After the determination, the process proceeds to step S105.
In step S105, time t is set to the next time t + 1, and the process returns to step S101.
[0030]
In step S106, for each recognition candidate, in the HMM shown in FIG. 3, a state in which the matching score is maximized at the utterance section end time t is selected, and the matching score for each recognition candidate is input to the recognition result determination unit 105. Sent to finish the process. Here, as is well known, the matching score is an evaluation value representing the similarity or distance between the feature vector series of the input speech signal and the standard feature pattern.
In step S107, when it is determined in step S102 that the signal at time t is a silent section signal, the flag value described above is set to 0, and the process proceeds to step S105.Here, pattern matching is not performed, and therefore the feature vector at time t is not subject to pattern matching, and the matching score of each state is not updated.
[0031]
In step S108, when it is determined in step S103 that the signal at time t is the time when the silent section is switched to the voiced section, a null transition is first performed in the HMM shown in FIG. The collation score at time t-1 is updated. After the matching score is updated, all state transition probabilities except for the null transition and output probabilities of all distributions are calculated, and the optimum transition is determined. After the determination, the process proceeds to step S109. Details of the processing of this part will be described later.
[0032]
In step S109, when it is determined in step S102 that the signal at time t is a signal in a sound section, the value of the flag described above is set to 1, and the process proceeds to step S105.
[0033]
The above is the outline and flow of the pattern matching method directly related to the present invention.
By the processing in the pattern matching unit 103 to which the pattern matching method is applied, the matching scores of all the recognition candidates are calculated, and the recognition candidate having the maximum score is selected as the recognition result in the recognition result determination unit 105.
[0034]
Here, the details of the processing in step S108 at time t when the silent section is switched to the voice section will be described with reference to the flowchart of FIG.
At time t, first, in step S401, the state number i is set to the final state.
[0035]
In step S402, for state i, state j is selected from state 0 to state i that maximizes the matching score of the optimum route at time t-1 (one frame before).
[0036]
In step S403, a null transition from state j to state i occurs, and the matching score at time t-1 (one frame before) of state i is replaced with the matching score at time t-1 of state j.
[0037]
In step S404, it is determined whether or not the state i is the leading state 0. If the state is 0, the process proceeds to the final step S406, and if not, the process proceeds to step S405.
[0038]
In step S405, i is counted down by 1, and the process returns to step S402.
In step S406, the optimum route excluding the null transition and its matching score at time t are obtained for all states.
[0039]
Thus, by considering the null transition at the time when the silent section is switched to the voiced section, it is possible to remove the influence that the feature vector of the silent section is not used for matching. Here, it is assumed that the null transition to state i can occur from all states from state 0 to state i. However, there is a restriction here, for example, the null to state i depends on the duration of a silent interval, etc. It is also possible to limit the states in which transition is possible (reducing the number of states as the duration is shorter). Further, when the duration of the silent section is equal to or less than a predetermined threshold, it is possible to prevent a null transition from occurring. Further, here, a state in which a null transition is possible is searched for all the states from the final state, but this need not necessarily be performed for all states, and a silent section is likely to occur based on prior information in advance. It is also possible to make a null transition only for the state.
[0040]
Next, the effect of this embodiment will be described with reference to FIGS.
FIG. 6 is an image diagram of signal power when “TOSAKA” is spoken. Here, times T0 and T7 indicate the start time and end time of the utterance section, respectively. In addition, each section of time T0-T1, T2-T3, T4-T5, T6-T7 is a section determined as a silent section by the power threshold TH.
[0041]
In general, the silent section in the utterance section is generated by devoicing of the urging sound, the friction sound, and the voiced sound. In this section, the influence of the background noise becomes relatively large, so that matching with an erroneous pattern is likely to occur. As a result, erroneous recognition may occur. According to FIG. 6, there is a possibility that erroneous pattern matching may occur in the sections from T0 to T1, T2 to T3, T4 to T5, and T6 to T7.
[0042]
FIG. 7 shows in more detail the state of the short-time power of the audio signal and the generated content (expressed here as a phoneme string) in the section from T2 to T3. In this example, the section corresponding to the frictional sound / S / is completely below the threshold value TH. As described above, in this case, erroneous matching is likely to occur in the section from T2 to T3 that is equal to or less than the power threshold TH.
[0043]
FIG. 8 is an HMM representing “TOSAKA” in which one phoneme is represented in one state for the sake of simplicity. Here, for the sake of simplicity, the null transition from the state / O // S // A // K // A / is omitted.
[0044]
When the pattern matching method described above is applied to the HMM as shown in FIG. 8, control is performed so that the feature vector series is not used for matching in the section from T2 to T3 (silent section of the audio signal). For this reason, unlike the prior art in which the feature vector series is used for matching regardless of the voiced and silent sections of the speech signal, there is no false matching in the T2 to T3 section (silent section). Does not adversely affect the matching score. Moreover, in the pattern matching method applied in the present embodiment, null transition is allowed at the time when the silent section changes to the voiced section, so that there is no state transition inconsistency caused by not using the silent section for matching. .
[0045]
As a result, in the present embodiment, the transition as shown in FIG. 9 is possible without adversely affecting the matching score. In this example, since the power of the feature vector corresponding to the phoneme / S / is equal to or less than the power threshold TH, the feature vector of this portion is not used for matching, and the phoneme / O / Is allowed to pass from the phoneme / S / state to the phoneme / A /.
This can be considered in the same manner for silent sections other than T2-T3 (T0-T1, T4-T5, T6-T7).
[0046]
When the utterance period ends, the optimal state transition path at time T7 and the matching score at that time are obtained for all states, and the maximum score may be used for determination of the recognition result.
[0047]
If this method is used, it is possible to avoid an increase in the verification score of an incorrect recognition candidate due to an incorrect matching of a silent section in the verification of a recognition candidate for utterance. As a result, the accuracy of the matching score is improved, leading to an improvement in recognition rate. The above is the detailed description of the configuration, operation, and effect of the speech recognition apparatus according to the first embodiment of the present invention.
[0048]
[Second Embodiment]
FIG. 10 schematically shows a speech recognition apparatus according to the second embodiment of the present invention.
[0049]
The speech recognition apparatus shown in FIG. 10 includes an utterance section detection unit 201, a feature vector extraction unit 202, a pattern matching unit 203, a standard feature pattern storage unit 204, a recognition result determination unit 205, and N silent section detection units (# 1). ) 206-1 to (#N) 206-N.
[0050]
The characteristics of the configuration of FIG. 10 are prepared in advance by N silent section detectors # 1 (206-1) to #N (206-N) (corresponding to the silent section detector 106 in FIG. 1). The silent section of the voice signal (of the utterance section) is detected based on the threshold values TH1 to THN of the powers of the different signals. For this reason, the functions of the pattern matching unit 203 (corresponding to the pattern matching unit 103 and the recognition result determination unit 105 in FIG. 1) and the recognition result determination unit 205 are also partially different as will be described later. The other components, that is, the utterance section detection unit 201, the feature vector extraction unit 202, and the standard feature pattern storage unit 204 are the utterance section detection unit 101, the feature vector extraction unit 102, and the standard feature pattern storage unit in FIG. Similar to 104.
[0051]
Therefore, the operation of the speech recognition apparatus having the configuration shown in FIG. 2 will be described focusing on the differences from the speech recognition apparatus shown in FIG.
Audio signals detected by the utterance section detection unit 201 are input in parallel to the silent section detection units # 1 (206-1) to #N (206-N). Each silent section detector #i (i = 1 to N) has thresholds THi of different powers, and silent sections of the audio signal are detected independently using these threshold values.
[0052]
FIG. 11 shows a silent threshold detection unit # 1 (206-1), a silent segment detection unit # 2 (206-2),... A silent segment detection unit #N (206-N), and a preset power threshold TH1. , TH2,... THN represents a state in which a silent section of the voice signal in the utterance section is detected. Here, it is assumed that THi> THi + 1 is set.
[0053]
Information indicating silence sections (silence section information) independently detected by the silence section detection units # 1 (206-1) to #N (206-N) is sent to the pattern matching unit 203. A time series (feature vector series) of feature vectors extracted by the feature vector extraction unit 202 is also sent to the pattern matching unit 203. The pattern matching unit 203 uses the feature vector series input from the feature vector extraction unit 202 and the silent section information input from the silent section detection units # 1 (206-1) to #N (206-N). A collation score for each recognition candidate is calculated.
[0054]
Here, processing in the pattern matching unit 203 and the recognition result determination unit 205 will be described with reference to the flowchart of FIG.
In step S201, an initial setting process is performed, and i = 1 is set as a parameter (silent section detector number) indicating the silent section detector #i.
[0055]
In step S202, for all recognition candidates, the matching score is calculated by the pattern matching unit 203 using the silent interval information from the silent interval detecting unit #i. For the matching score calculation in the pattern matching unit 203, the matching method (in the pattern matching unit 103) described in the first embodiment is used.
[0056]
In step S203, recognition candidates up to the top Mi are selected from the collation score for each recognition candidate calculated in step S202 according to the number of recognition candidates Mi for pruning prepared in advance, and recognition in the next step is performed. Left as a candidate. Here, it is assumed that Mi> Mi + 1 is set.
[0057]
In step S204, it is determined whether i has reached a parameter value (silent section detector number) N representing the silent section detector #N. If i = N, the process proceeds to the final step S206; otherwise, the process proceeds to step S205.
In step S205, i is incremented by 1, and the process returns to step S202.
[0058]
In step S206, the recognition result determination unit 205 selects the one having the largest matching score from the recognition candidates remaining up to that point (up to the higher MN rank), and outputs it as a recognition result.
The processing in the pattern matching unit 203 and the recognition result determination unit 205 in the second embodiment has been described above.
[0059]
When the above method is used, when a recognition candidate is selected, it is possible to apply a weight to a large power portion where the spectrum is stable, and it is possible to reduce the influence of a low power section where the spectrum is unstable. In addition, by changing the power threshold in stages and performing pattern matching while pruning recognition candidates, the recognition candidates can be narrowed down in stages, improving recognition accuracy and reducing misrecognition. Can do.
The above is the detailed description of the configuration, operation, and effect of the speech recognition apparatus according to the second embodiment of the present invention.
[0060]
(First Modification of Second Embodiment)
In the second embodiment described above, the pruning of recognition candidates in the pattern matching unit 203 has been described as being performed using the power threshold in descending order, but conversely, the power threshold is decreasing in order. It is also possible to do this.
[0061]
Therefore, a first modification of the second embodiment to which the method of pruning recognition candidates in order from the power threshold in the configuration of FIG. 10 is applied will be described with reference to the flowchart of FIG.
[0062]
In step S301, i = N is initialized.
In step S302, for all recognition candidates, the matching score is calculated by the pattern matching unit 203 using the silent section information from the silent section detecting unit #i. For the matching score calculation in the pattern matching unit 203, the matching method (in the pattern matching unit 103) described in the first embodiment is used.
[0063]
In step S303, recognition candidates up to the upper Mi position are selected from the matching score calculated in step S302 according to the number of recognition candidates Mi for pruning prepared in advance, and are left as recognition candidates in the next step. Here, it is assumed that Mi <Mi + 1 is set, unlike the example of the previous collation method.
[0064]
In step S304, it is determined whether i has reached a parameter value (silent section detecting unit number) 1 representing the silent section detecting unit # 1. If i = 1, the process proceeds to the final step S306; otherwise, the process proceeds to step S305.
In step S305, i is counted down by 1, and the process returns to step S302.
[0065]
In step S306, the recognition result determination unit 205 selects the one having the maximum collation score from the recognition candidates remaining up to that point (up to the top M1) and outputs the recognition result.
The processing in the pattern matching unit 203 and the recognition result determination unit 205 according to the first modification of the second embodiment has been described above.
[0066]
In the above method, first, a plurality of recognition candidates are selected with a small threshold of power, allowing erroneous matching in a silent section. Since the correct answer matches correctly except in the silent section, it is included in the top candidate. Then, by gradually increasing the power threshold and performing matching, it is possible to reduce the influence of erroneous matching in the silent section, and finally detect a correct candidate.
[0067]
If such a method is used, when selecting a recognition candidate, first, among the recognition candidates including erroneous matching in the silent section, it is possible to reduce the influence of erroneous matching in the silent section step by step. Recognition accuracy can be improved and misrecognition can be reduced.
The above is the detailed description of the operation and effect of the speech recognition apparatus in the first modification of the second embodiment of the present invention.
[0068]
(Second modification of the second embodiment)
In the second embodiment described above and the first modification of the embodiment, the pattern matching unit 203 obtains the silent section information detected for each threshold THi using different power thresholds THi. Although it has been described that the recognition candidates are pruned using a plurality of matching scores in order and the recognition result is obtained, the present invention is not limited to this. For example, it is possible to determine the recognition result by taking a weighted sum of the matching scores obtained for each threshold THi.
[0069]
Therefore, a second modification of the second embodiment using this method will be described with reference to the flowchart of FIG.
In step S501, the recognition candidate number i is initially set to 1.
[0070]
In step S502, a parameter (silent section detector number) j indicating the silent section detector #j is initially set to 1.
In step S503, the pattern matching unit 203 calculates the matching score sij of the recognition candidate i (recognition candidate whose recognition candidate number is i) using the silent section information from the silent section detection unit #j.
[0071]
In step S504, it is determined whether j has reached a parameter value (silent section detector number) N representing the silent section detector #N. If j = N, the process proceeds to step S506; otherwise, the process proceeds to step S505.
In step S505, j is incremented by 1, and the process returns to step S503.
[0072]
In step S506, each silent section detector #j (j = 1 to N), that is, the weighted sum of the matching scores sij calculated using the silent section information from the silent section detectors # 1 to #N by the weight wj, In other words, the weighted sum of the si1 to siN weights w1 to wN is calculated, and the verification score Si of the recognition candidate i used for recognition result determination is calculated. Here, wj is a predetermined weight (0 ≦ wj ≦ 1), and is a weight for the collation score sij calculated using the silent section information from the silent section detecting unit #j.
[0073]
In step S507, whether or not the matching score Si has been calculated for all recognition candidates is determined based on the value of i. If i has reached the number of recognition candidates, the process proceeds to the final step S509, and if not, the process proceeds to step S508.
In step S508, the recognition candidate number i is incremented by 1, and the process returns to step S502.
[0074]
In step S509, the recognition result determination unit 205 compares the collation scores Si of all recognition candidates, and the recognition candidate having the maximum Si is determined and output as a recognition result.
The processing in the pattern matching unit 203 and the recognition result determination unit 205 according to the second modification example of the second embodiment has been described above.
[0075]
In the above method, appropriate weights w1 to w1 for the collation scores si1 to siN for the same recognition candidate i calculated using the silent section information for each threshold obtained based on the thresholds TH1 to THN of different powers. By taking the sum by multiplying wN, the influence of the silent section can be arbitrarily reflected in the collation score. For this reason, the influence of the incorrect matching in a silence area can be reduced.
The above is the detailed description of the operation and effect of the speech recognition apparatus in the second modification of the second embodiment of the present invention.
[0076]
In the second embodiment, the silent section detectors # 1 (206-1) to #N (206-N) have been described as operating in parallel, but are detected by the utterance section detector 201. The voice signal of the utterance section is stored in the storage means such as a memory, and in this state, the silent section detection units # 1 (206-1) to #N (206-N) are sequentially activated to store the voice signal in the storage means. It is also possible to cause the silent section detection unit to detect a silent section with a specific threshold value and send the detected silent section information to the pattern matching unit 203 each time.
[0077]
In addition, the utterance section detection unit 101 (201), the feature vector extraction unit 102 (202), the pattern matching unit 103 (203), the recognition result determination unit 105 (205), and the silent section detection unit 106 (206−) in the above embodiment. Each function of 1 to 206-N) can also be realized as software.
[0078]
Further, the present embodiment is for causing a computer to execute a predetermined procedure including the pattern matching method applied in the speech recognition apparatus according to the above embodiment (or for causing the computer to function as a predetermined means possessed by the speech recognition apparatus. Or a computer-readable recording medium such as a CD-ROM on which a program (for realizing a predetermined function of the voice recognition apparatus) is recorded. The program may be downloaded via a communication medium.
[0079]
In addition, the embodiment of the present invention can be variously modified with respect to the above-described example, and these are also within the scope of the embodiment of the present invention unless they are contrary to the spirit.
[0080]
【The invention's effect】
As described above, according to the present invention, even if there is a silent section with low unexpected power in the utterance section, by detecting the silent section and using it in matching with the standard feature pattern, An erroneous pattern matching in a silent section can be avoided, and a great practical effect such as enabling highly accurate recognition can be achieved.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a basic configuration of a speech recognition apparatus according to a first embodiment of the present invention.
FIG. 2 is a conceptual diagram showing a silent section in an input voice signal.
FIG. 3 is a diagram showing a configuration of an HMM including a null transition.
FIG. 4 is a diagram showing a flow of a pattern matching method.
FIG. 5 is a diagram showing a flow of processing at a time when a silent section is switched to a voice section.
FIG. 6 is a diagram showing a state of power in an input audio signal.
FIG. 7 is a diagram showing details of the state of power in an input audio signal.
FIG. 8 is a diagram showing a specific example of the configuration of an HMM.
FIG. 9 is a conceptual diagram of an optimum route after pattern matching processing.
FIG. 10 is a block diagram showing a basic configuration of a speech recognition apparatus according to a second embodiment of the present invention.
FIG. 11 is a conceptual diagram showing a silent section of an input signal with a plurality of threshold values.
FIG. 12 is a diagram showing a flow of a pattern matching method using a plurality of threshold values.
FIG. 13 is a diagram showing a first modification of the flow of the pattern matching method using a plurality of threshold values.
FIG. 14 is a diagram showing a second modification of the flow of the pattern matching method using a plurality of threshold values.
[Explanation of symbols]
101, 201 ... utterance section detection unit
102, 202 ... feature vector extraction unit
103, 203 ... pattern matching unit
104, 204 ... Standard feature pattern storage unit
105, 205 ... Recognition result determination unit
106, 206-1 to 206-N: Silent section detector

Claims

An input signal is acoustically analyzed to detect a section in which speech is uttered, a feature vector series is extracted from the detected speech signal in the utterance section, and prepared in advance for each of the extracted feature vector series and predetermined recognition candidates By comparing the standard pattern of the voice signal with a hidden Markov model matching method, a matching score representing the similarity or distance between the two is calculated, and the recognition result is judged based on the matching score for each recognition candidate In the speech recognition method to
Detecting a silent interval of the audio signal from the short-time power of the audio signal of the detected utterance interval,
The feature vector sequence of the silent section is excluded from pattern matching, and a null transition is performed for the feature vector sequence corresponding to the time when the silent section changes to the voiced section, and the matching score one frame before each state is calculated. After updating and updating the matching score, all state transition probabilities except for the null transition and output probability of all distributions are calculated, and the optimal transition is determined, and matching is performed using a hidden Markov model matching method that allows the null transition. A speech recognition method characterized in that a collation score is calculated.

While the input signal is acoustically analyzed to detect the section where the voice is uttered, and the feature vector series is extracted from the voice signal of the detected utterance section,
From the short-time power of the audio signal of the detected utterance interval, independently detect the silent interval of the audio signal based on different threshold values,
Hidden Markov model matching method based on information on silent sections that are independently detected based on the different threshold values, the extracted feature vector series and a standard pattern of a speech signal prepared in advance for each predetermined recognition candidate By collating with each other, the collation score representing the similarity or distance between the two is calculated for each threshold, and the feature vector series of the silent section detected based on the corresponding threshold at that time is excluded from the pattern matching target In addition, the null transition is performed only at the time when the silent section changes to the voiced section, the collation score is updated one frame before each state, and after the collation score is updated, all state transition probabilities except for the null transition, and Apply a hidden Markov matching scheme that allows null transitions, where the output probabilities of all distributions are calculated and the optimal transition is determined,
A speech recognition method, wherein a recognition result is determined based on a collation score for each recognition candidate obtained for each threshold.

Each time the calculation of the matching score for each recognition candidate is performed for one threshold, the recognition candidates are narrowed down based on the matching score for each recognition candidate, and the operation is repeated while gradually switching the threshold in a certain direction. The speech recognition method according to claim 2.

A process of calculating a weighted sum of matching scores obtained for each threshold for the same recognition candidate is performed for all recognition candidates, and a recognition result is determined based on the weighted sum of the matching scores of all the recognition candidates. The speech recognition method according to claim 2.

An utterance section detecting means for acoustically analyzing an input signal and detecting a section in which the voice is uttered;
Feature vector extraction means for extracting a feature vector series from the speech signal of the utterance section detected by the utterance section detection means;
A silent section detecting means for detecting a silent section of a voice signal from a short time power of the voice signal of the voice section detected by the voice section detecting means;
A standard pattern storage means for storing a standard pattern of the audio signal of each predetermined recognition candidate;
By comparing the feature vector series extracted by the feature vector extraction unit with the standard pattern of each recognition candidate stored in the standard pattern storage unit using a hidden Markov matching method, the similarity or distance between them is expressed. A pattern matching means for calculating a matching score, wherein the feature vector series of the silent section detected by the silent section detecting means is excluded from pattern matching, and only at the time when the silent section changes to the voiced section. , Perform a null transition, update the matching score one frame before each state, and after updating the matching score, all state transition probabilities except for the null transition and output probabilities of all distributions are calculated to determine the optimal transition Pattern matching means for matching by a hidden Markov matching method that allows null transition,
A speech recognition apparatus comprising: a recognition result determining unit that determines a recognition result based on a matching score for each recognition candidate obtained by the pattern matching unit.

An utterance section detecting means for acoustically analyzing an input signal and detecting a section in which the voice is uttered;
Feature vector extraction means for extracting a feature vector series from the speech signal of the utterance section detected by the utterance section detection means;
A plurality of silent section detecting means for detecting a silent section of the voice signal based on different threshold values from the short-time power of the voice signal of the voice section detected by the voice section detecting means;
A standard pattern storage means for storing a standard pattern of the audio signal of each predetermined recognition candidate;
The feature vector series extracted by the feature vector extraction means is stored in the standard pattern storage means for each of the different threshold values, except for the feature vector series of the silence periods detected by the silence period detection means. A pattern matching means for calculating a matching score representing the similarity or distance between the standard patterns of each recognition candidate by the hidden Markov matching method, at a time when the silent section changes to the voiced section. Only the null transition is performed, the collation score is updated one frame before each state, and after the collation score is updated, all the state transition probabilities except for the null transition and the output probabilities of all distributions are calculated. A pattern matching means for matching by a determined Markov matching method that allows a null transition;
A speech recognition apparatus comprising: a recognition result determining unit that determines a recognition result based on a matching score for each recognition candidate obtained by the pattern matching unit for each threshold value.