JP2005004018A

JP2005004018A - Speech recognition apparatus

Info

Publication number: JP2005004018A
Application number: JP2003168641A
Authority: JP
Inventors: Michihiro Yamazaki; 道弘山崎; Jun Ishii; 純石井
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2003-06-13
Filing date: 2003-06-13
Publication date: 2005-01-06

Abstract

<P>PROBLEM TO BE SOLVED: To provide a means for preventing a recognition rate from decreasing when speech recognition processing is performed for an analog speech signal having an instantaneous interruption section or an overflow section exceeding the input range of an A/D converter. <P>SOLUTION: Even when the analog speech signal has an instantaneous interruption section or an overflow section exceeding the input range of an A/D converter, a sound analysis part 3 is provided which calculates an input speech feature quantities from an analog speech signal remaining in the unstable section to perform speech recognition by using even the speech signal in the unstable section. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
この発明は、入力音声のパワーがＡ／Ｄ変換器の入力レンジを超えたり、瞬断が発生しうる環境にある場合においても、音声認識の精度を向上する音声認識装置に係るものであり、特に入力レンジを超えた区間又は瞬断区間の信号処理又は尤度算出処理を工夫することによって、音声認識の精度を向上する技術に関する。
【０００２】
【従来の技術】
従来の技術によれば、瞬断・オーバーフローが生じている区間において、すべての認識基本単位（ある音響モデル中に記憶されているすべての音素または音韻、音節）に対して同じ音響尤度（以下、単に尤度と呼ぶ）を与えるようにしていた。このようにすることで、音声信号の歪んだ区間で正しい認識基本単位の尤度が低くなり、そのために正解語彙の尤度が低くなることによる誤認識を防いでいる（例えば、非特許文献１）。
【０００３】
また、瞬断・オーバーフローに対処する技術ではないが、パワーの低い区間を無音区間として、無音区間の音声特徴量をパターン照合から除外する方法などもある（例えば、特許文献１や特許文献２）。
【０００４】
【特許文献１】
特開２００１−１３９８８「音声認識方法及び装置」第２図、第３頁−第７頁
【特許文献２】
特開２０００−１９４３８５「音声認識処理装置」
【非特許文献】
日本音響学会講演論文集（１９９９年９月〜１０月Ｖｏｌ．１Ｐ１４９３−Ｑ−１６）
【０００５】
【発明が解決しようとする課題】
従来の技術による音声認識装置では、オーバーフロー区間や瞬断区間、無音区間に残存する音声の情報を使用しないため、高精度な音声認識を行うことが難しく、特にオーバーフロー区間や瞬断区間が長くなると認識率が低下するという問題があった。
【０００６】
一方、これらの区間に残存する音声の情報は不安定であり、例えば、瞬断区間のようにサンプル値０のディジタル信号が連続した区間に対して音響分析を行うと音響分析に失敗するという問題があった。このような問題を回避するため、従来技術では、この区間直前の音響分析結果を繰り返して使用する方法もあった。しかしこの方法では、瞬断区間が長くなるにつれて、直前の音響分析結果との乖離が大きくなり、誤ったデータにより照合を行うことになるという問題があった。
【０００７】
この発明は上記のような問題点を解決するためになされたもので、瞬断やオーバーフローがある音声に対しても高精度な音声認識を行うことを目的とする。
【０００８】
【課題を解決するための手段】
この発明に係る音声認識装置は、アナログ音声信号を入力し、Ａ／Ｄ変換器によりディジタル信号に変換して、このディジタル信号から入力音声特徴量を算出するとともに、前記入力音声特徴量に基づいて前記アナログ音声信号の音声認識結果を算出する音声認識装置であって、
前記アナログ音声信号に、不安定区間が存在する場合であっても、この不安定区間に残存する前記アナログ音声信号に基づいて前記入力音声特徴量を算出する前記音響分析手段を備えたことを特徴とするものである。
【０００９】
ここで、不安定区間とは、音声認識装置の有するＡ／Ｄ変換器に入力されるアナログ音声信号に含まれる瞬断区間又は前記Ａ／Ｄ変換手段の入力レンジを超えるオーバーフロー区間をいうものとする。
【００１０】
【発明の実施の形態】
以下、この発明の実施の形態について説明する。
実施の形態１．
図１は、この発明の実施の形態１による音声認識装置の構成を示すブロック図である。図において、Ａ／Ｄ変換器１は入力された音声のアナログ信号をディジタル信号に変換する素子又は回路であって、例えばサンプリング周波数を８ｋＨｚ、ビット数分解能を１６ビットとし、線形パルス符号化（ＰｕｌｓｅＣｏｄｅＭｏｄｕｌａｔｉｏｎ：ＰＣＭ）によって入力信号をディジタル化するものである。このサンプル値は式（１）によって与えられる値域に含まれる値をとる。
【数１】

【００１１】
図２は、Ａ／Ｄ変換器１に入力されるアナログ信号を示した波形図である。図３は、図２によって示されるアナログ信号をディジタル変換した後の波形図である。図において、ＳｍａｘとＳｍｉｎはＡ／Ｄ変換器１の入力レンジの上限と下限を示すものである。図の破線で描かれた円１０１における信号の状況を拡大して示したのが、破線で描かれた円１０２であって、入力信号のうちＳｍａｘを超える部分（オーバーフローしたサンプル）については、入力レンジの上限であるＳｍａｘに平滑化されていることを示している。
【００１２】
また図４は、図２に示されたアナログ信号波形において、瞬断が発生した場合の入力音声の波形を示す波形図である。この場合は、Ａ／Ｄ変換器１の入力レンジとは関係なく、一定区間の間のサンプル値が存在しないこととなり、Ａ／Ｄ変換器１は、その区間においてサンプル値が０の出力信号を出力することになる。
【００１３】
引き続き、図１によって、この発明の実施の形態１による音声認識装置の構成を説明する。微小信号出力部２は、Ａ／Ｄ変換器１の出力信号に微小信号（微小雑音）を重畳する素子又は回路である。音響分析部３は、微小信号（微小雑音）が重畳されたディジタル信号から、一定時間毎の信号を用いて、音声認識を行うための音声特徴量（入力音声特徴量）を出力する部位である。音響尤度演算部４は、認識基本単位毎の音声の標準パタン（標準音声特徴量）と音響分析部３から出力された音声特徴量とを比較し、認識基本単位毎の尤度を算出する部位である。
【００１４】
音響モデル記憶部５は、音響尤度演算部４が尤度を算出する認識基本単位の音声標準パタンを記憶する記憶媒体又は記憶素子及び回路（記憶装置と総称する）から構成されるが、さらにこの記憶装置を管理・構成するコンピュータプログラムやコントローラを含んでいてもよい。
【００１５】
また、照合部６は、この音声認識装置が基づいている語彙・言語モデルに従って、認識基本単位毎に算出された尤度から語彙の尤度を算出し、この語彙の尤度が最大となる語彙を認識候補として算出する部位である。語彙・言語モデル記憶部７は、照合部６が参照する語彙・言語モデルを記憶する記憶媒体又は記憶素子及び回路（記憶装置）であって、この記憶装置を管理・構成するコンピュータプログラムやコントローラまでをも含んでもよいという点については、音響モデル記憶部５と同様である。
【００１６】
なおＡ／Ｄ変換器１はＡ／Ｄ変換手段、微小信号出力部２は微小信号出力手段、音響分析部は音響分析手段、音響尤度演算部４と音響モデル記憶部５は音響尤度演算手段、照合部６と語彙・言語モデル記憶部７は照合手段にそれぞれ相当する。
【００１７】
次にこの発明の実施の形態１による音声認識装置の動作について説明する。Ａ／Ｄ変換器１は、アナログ信号として入力された音声信号をディジタル信号に変換する。微小信号出力手段２は、Ａ／Ｄ変換器１が出力したディジタル信号に微小な信号を重畳し出力する。このような微小信号を重畳する処理を、ここでは非０化と呼ぶこととする。微小信号としては、例えば、サンプル値の最大値が２^４程度の白色雑音を出力する。
【００１８】
なお、Ａ／Ｄ変換器１と微小信号出力部２とを直列に接続する他に、例えばＡ／Ｄ変換器１において、パワーを検知することによって、瞬断やオーバーフローが発生したことを検出し、その検出結果に基づいて、可動端子をＡ／Ｄ変換器１の出力と、微小信号出力部２の出力とのいずれかに接続するスイッチを設けるようにしてもよい。
【００１９】
また、非０化の方法としては、例えばＡ／Ｄ変換器１と微小信号出力部２との接続位置を入れ替えて、微小信号出力部２の出力がＡ／Ｄ変換器１の入力となるようにしてもよい。このようにすると、常に微小信号出力部２の出力がＡ／Ｄ変換器１に入力され続けるので、瞬断が発生しても、Ａ／Ｄ変換器１が出力するサンプル値は一定期間以上連続して０になることがない。
【００２０】
続いて、音響分析部３は、微小信号出力部２から出力される微小な雑音を重畳された音声（ディジタル信号）に対して、一定時間（例えばフレーム周期＝１０ｍｓｅｃ）毎に、一定時間分（例えばフレーム長＝２５ｍｓｅｃ）のディジタル信号を用いて特徴量（例えばＬＰＣケプストラム係数）を算出する。その結果、音響分析部３の出力Ｏは、例えば式（２）に示すように特徴量の時系列となる。
【数２】

【００２１】
なお、式（２）において、ｏ（ｔ）はｔ番目のフレームにおける特徴量であって、式３に示すように次元数Ｋのベクトルとなる。
【数３】

【００２２】
ここで、音声特徴量をＬＰＣケプストラム係数とした場合のｎ次のＬＰＣケプストラムｏ’（ｔ，ｎ）は、式（４）〜（６）によって算出される。
【数４】

なお、α_ｉ（ｉ＝１，２，…，Ｎａ）は線形予測係数であり、以下のように求める。
【００２３】
すなわち、まず窓長（１フレーム内のサンプル数）をＮｓとし、ｔ番目のフレームにおける１〜Ｎｓ番目の音声信号にフレームの外側では０であるような有限長の窓関数（ハミング窓など）を乗じた音声信号をｘ（ｔ，ｉ）（０≦ｉ≦Ｎｓ−１）として、式（７）により、自己相関数列Ｒ_０，Ｒ_１、Ｒ_２、…、Ｒ_Ｎａを算出する。
【数５】

【００２４】
次に、α_ｉについての連立方程式である式（８）を解く。
【数６】

式（８）を行列表示に直すと、式（９）となる。
【数７】

ただし、ｒ_ｉ＝Ｒ_ｉ／Ｒ_０とする。
【００２５】
この式（９）による行列のＴｏｅｐｌｉｔｚ性を利用して、レビンソン・ダービン（Ｌｅｖｉｎｓｏｎ−Ｄｕｒｂｉｎ）の巡回解法によりα_ｉを求めることができる。なお、ここでは、各α_ｎについてｎ＝１からｎ＝Ｎａまで巡回させて計算するが、ｍ回目（ただし１≦ｍ≦Ｎａ）の計算で得られたα_ｎをα_ｎ ^（ｍ）と表し、特にｍ＝ｎのときｋ_ｎ＝ａ_ｎ ^（ｎ）と表記することとする。そうすると、まず初期値として、
【数８】

として、次の漸化式からｍ＝２、３、４…について、順に、ｋ_ｍ、ａ_ｉ ^（ｍ）、Ｅ^（ｍ）を計算する。
【数９】

【００２６】
式（１１）において、ｍを順次大きくしていき、Ｎａになったところで、この漸化式による計算を終了し、α_ｉ（ｉ＝１，２，…，Ｎａ）が算出される。ところで、以上のＬＰＣケプストラムの演算過程において、得られた音声信号がすべて０であるとすると、ｘ_ｉ＝０（ｔ＝０，１，２，…，Ｎ−１）となるので、式（７）によって算出するＲ_０は、次式のように０となる。
【数１０】

【００２７】
その結果、式（９）におけるｒ_ｉ＝Ｒ_ｉ／Ｒ_０を求めようとすると、０で除算することとなってしまい、ｒ_ｉを計算できない。このことは、ＬＰＣケプストラムｏ’（ｔ，ｎ）を算出することができないことを意味している。すなわち、瞬断やオーバーフローによってＡ／Ｄ変換器１の出力が０となると、ＬＰＣケプストラムが算出できないために、音声特徴量の計算でエラーが生じる。０による除算は、通常の計算機システムではトラップの発生で処理されるような重大エラーとして扱われる。このため従来は、不安定区間に残存している音声信号を用いて安定的に音声認識することができない。これが従来における入力信号に不安定区間が存在する場合の音声認識処理の問題点であった。
【００２８】
しかし実施の形態１による音声認識装置では、微小信号出力部２を設けることにより、このような問題を解決している。すなわち不安定区間において、Ａ／Ｄ変換器１の出力が０となっても、微小信号出力部２が非０成分からなる微小信号を補うので、音響分析部３の入力音声信号は決して０になることがない。したがって、不安定区間が存在する入力音声信号に対して音声特徴量を安定的に算出するので、不安定区間に残存している音声信号から音声特徴量を求める演算を行っても、上記のような問題は生じない。
【００２９】
なお、実施の形態１では微小信号出力部２を設けることで、物理的に非０化、すなわち入力信号が０とならないような対策を講じたが、このような方法の他に、例えば音響分析部３において、入力音声信号の所定の下位ビット、例えば最下位ビットを１にマスクして非０化する方法を採用してもよいことはいうまでもない。
【００３０】
以上が音響分析部３の動作である。引き続き、実施の形態１による音声認識装置の動作について説明する。
【００３１】
音響モデル記憶部５は、認識基本単位毎の標準的な特徴量を表す標準パタンを記憶している。ＨＭＭ（ＨｉｄｄｅｎＭａｒｃｏｖＭｏｄｅｌ）においては標準パタンはガウス分布で表されることが多い。なお以後の説明において、認識基本単位として音素を用いることとするが、音素の代わりに音韻、音節等を用いる場合であっても、処理の流れは何ら変わることがない。
【００３２】
音響尤度演算部４は、音響分析部３が出力した音声の特徴量の時系列Ｏと、音響モデル記憶部５が記憶している例えば音素毎の標準パタンとを比較し、各フレームの各音素に対する尤度を演算する。フレームｔにおける特徴量ｏ（ｔ）の音素ｐに対する尤度Ｂ（ｐ，ｔ）は、対角共分散行列を用いたガウス分布では、式（１４）によって算出される。
【数１１】

【００３３】
照合部６は、音響尤度演算部４で求めた尤度と、語彙・言語モデル記憶部７により記憶される各語彙の音素系列から各認識語彙の尤度を算出し、最終的に最も尤度が高くなる語彙を認識結果として出力する。すなわち音響分析部３が出力した音声特徴量の時系列Ｏに対して下記の式（１５）を用いて音声認識結果Ｗ’を抽出する。
【数１２】

【００３４】
式（１５）において、第１項のＰ（Ｏ｜Ｗ）は音響的な確率である。この確率は認識対象語彙Ｗを仮定して計算する。最近では音響的な確率を計算するためにＨＭＭを用いることが多い。また、第２項のＰ（Ｗ）は仮定された語彙Ｗの確率を表すものであり、言語的な確率である。最近では言語的な確率を求めるために統計的言語モデルを用いることが多い。
【００３５】
ここで状態遷移系列をｑ＝｛ｑ（０），ｑ（１），．．．．ｑ（Ｔ）｝（但し、ｑ（０）は初期状態、ｑ（Ｔ）は最終状態の集合Ｆの要素）としたとき、式５のＰ（Ｏ｜Ｗ）は下記の式（１６）で表すことができる。
【数１３】

【００３６】
なお、式（１６）において、π_ｉはｉ番目の状態の初期確率（π_０＝１，π_１，…，π_Ｔ＝０）、ａ（ｉ，ｊ）はｉ番目の状態からｊ番目の状態への遷移確率、ｂ（ｉ，ｔ）は時刻（フレーム）ｔでのｉ番目の状態の尤度をあらわす。またＦは最終状態の集合を表す。ここで、ｉ番目の状態が音素ｐを表す状態ならばｂ（ｉ，ｔ）＝Ｂ（ｐ，ｔ）である。
【００３７】
このようにして、実施の形態１による音声認識装置は入力音声信号について最尤の音声認識結果を出力するのである。
【００３８】
以上から明らかなように、実施の形態１の音声認識装置によれば、アナログ音声信号またはこのアナログ音声信号をディジタル変換して得たディジタル信号を非０化することで、入力されるアナログ音声信号中に不安定区間が存在しても安定的に音声特徴量を算出する。したがって不安定区間に残存する音声信号に基づいて音声認識を行えるようになり、その結果、オーバーフローや瞬断のある音声信号に対しても認識率の低下を防ぐことができる。
【００３９】
なお、実施の形態１の構成要素中、Ａ／Ｄ変換器１、微小信号出力部２、音響分析部３以外の構成要素をその他の構成要素に代えても、この発明の特徴を損なうことはない。
【００４０】
また、Ａ／Ｄ変換部１、微小信号出力部２、音響分析部３、音響尤度演算部４、照合部５をハードウェアで構成してもよいが、これらの処理を行う音声認識プログラムを作成し、コンピュータがこの音声認識プログラムを実行するようにしてもよい。
【００４１】
実施の形態２．
実施の形態１では、不安定区間において音声特徴量の算出が行えない問題点を、入力信号に微小信号を重畳する、あるいはディジタル信号の下位ビットを１にマスクすることで解決し、安定的に音声特徴量を算出するようにして、不安定区間に残存する音声信号を利用できるようにした。実施の形態２では、このような不安定区間における尤度の信頼性が低いことに着目し、不安定区間以外の尤度を用いて、不安定区間の尤度を補正することで、不安定区間の音響尤度の信頼性を向上するものである。
【００４２】
図５は、実施の形態２による音声認識装置の構成を示すブロック図である。図において図１と同一の符号を付した構成要素については、実施の形態１と同様であるので説明を省略する。不安定区間検出部８は、Ａ／Ｄ変換器１において瞬断やオーバーフローが発生したか否かを検出する部位である。また、音響尤度補正部９は、不安定区間について音響尤度演算部４で算出された尤度を補正する部位であって、不安定区間検出部８との間に不安定区間か否かを通知するための信号線が設けられている。
【００４３】
なお、音響尤度演算部４と音響モデル記憶部５、不安定区間検出部８は音響尤度演算手段、音響尤度補正部９は音響尤度補正手段、照合部６と語彙・言語モデル記憶部７、不安定区間検出部８は照合手段にそれぞれ相当する。
【００４４】
次に、実施の形態２による音声認識装置の動作について説明する。Ａ／Ｄ変換器１は、実施の形態１と同じようにアナログ音声信号をディジタル信号に変換する。不安定区間検出部８は、Ａ／Ｄ変換器１の入力線のパワーを監視していて、不安定区間の検出、すなわち瞬断の発生やオーバーフローの発生を検出すると、音響尤度補正部９への信号線をＨｉにする。また不安定区間にない場合は、この信号線をＬｏｗのままとする。
【００４５】
音響分析部３、音響尤度演算部４は実施の形態１と同様に作用し、フレームｔごとに音声特徴量ｏ（ｔ）と、音素ｐに対する音響尤度Ｂ（ｐ，ｔ）の算出を行う。
【００４６】
音響尤度補正部９は、不安定区間検出部８からの信号線がＬｏｗである場合には、音響尤度演算部４が算出した尤度Ｂ（ｐ，ｔ）をそのまま出力する。また、信号線がＨｉの場合、音響尤度補正部９は、音響尤度演算部４により算出された尤度を次のように補正する。すなわち、不安定区間が開始する時刻と終了する時刻の時間軸上の点を始点ｔｓと終点ｔｅとして、式（１８）によって尤度を補正する。
【数１４】

【００４７】
ここでＮは不安定区間の前後の尤度（不安定区間の始点直前に算出された尤度および終点直後に算出された尤度）を用いて補正することを許容する最大時間であり、Ｂｔｈは予め定められた値である。すなわち、式（１８）では、始点と終点から時間Ｎを超える時間だけ離れている区間（上記（Ｃ））では、一定値Ｂｔｈとし、始点と終点から離れている時間が時間Ｎ以内の区間（上記（Ａ）と（Ｂ））では、始点直前の尤度と終点直後の尤度、さらに（Ｃ）のＢｔｈに連続な尤度分布となる尤度を与えるようにしている。時間Ｎは、例えば４０ｍｓｅｃとするなど、音響分析のフレーム長を考慮して定められる。
【００４８】
次に照合部６は実施の形態１と同様にして、式（１５）を用いて最尤なる音声認識結果を算出する。以上が実施の形態２による音声認識装置の動作である。
【００４９】
以上から明らかなように、実施の形態２の音声認識装置によれば、不安定区間の音響尤度を、その区間の前後の尤度に基づいて補正することとした。これにより、不安定区間の前後の音素の尤度が不安定区間の始点または終点近傍に反映されるようになるので、オーバーフローや瞬断による音声情報の不連続性を補うことによって、誤認識を防ぐことができる。
【００５０】
また始点と終点から離れるにつれて、始点直前の尤度と終点直後の尤度の影響が小さくなると考えられ、さらに一定以上離れた中間区間においては、始点や終点の効果がなくなると考えられることから、一定値を尤度とすることとした。これによって、不安定区間が長い場合に、始点直前の尤度と終点直後の尤度が必要以上に効果を及ぼすことを回避できる。
【００５１】
また、不安定区間においても尤度を補正しながら、その区間に残存する音声信号に基づいて音声認識を行うので、その結果、オーバーフローや瞬断のある音声信号に対しても認識率の低下を防ぐことができる。
【００５２】
なお、式（１８）による補正以外にも、始点直前の尤度と終点直後の尤度とを不安定区間の尤度に反映させる方法が考えられる。例えば、始点直前の尤度から終点直後の尤度に向かって単調増加、あるいは単調減少するような尤度分布を仮定し、このような尤度分布に基づいて、不安定区間の尤度を決定するようにしてもよい。このような方法によっても、オーバーフローや瞬断による音声情報の不連続性を補うことができるので、誤認識を防ぐことができる。
【００５３】
また、実施の形態１で示したように、不安定区間の入力信号を非０化して、安定的に音声特徴量を算出する技術と組み合わせて構成するようにしてもよいことはいうまでもない。
【００５４】
さらに、実施の形態２による音声認識装置では、不安定区間検出部８を設けることによって、Ａ／Ｄ変換器１で瞬断やオーバーフローが発生していることを検知するようにした。しかし、この他にも、例えば音響分析部３において、Ａ／Ｄ変換器１によるサンプル値が所定の下限値以下または未満であれば、瞬断と判断し、さらにサンプル値の絶対値が所定の値以上または超えていれば、オーバーフローと判断し、このようなサンプル値に基づいて音声特徴量を生成する場合に、特別なフラグなどを立てて、音響尤度演算部４や音響尤度補正部５において判断できるようにしておいてもよい。例えば実施の形態１で示した微小信号出力部２を備えるようにして、さらに微小信号出力部で２^４程度の微小信号を重畳するのであれば、下限値は２^５程度として整合を図るようにしてもよい。またＡ／Ｄ変換器１のビット数分解能が１６ビットならば、−３２７６８〜３２７６７が値域となるので、サンプル値の絶対値が３２７６７以上となった場合をオーバーフロー発生と判断するようにしてもよい。
【００５５】
実施の形態３．
実施の形態２による音声認識装置は、不安定区間における音響尤度の補正によって、最尤音素（あるいは他の認識基本単位でもよい）を適切に選択し、誤認識を防ぐものであった。その他に、語彙との照合時に不安定区間における音響尤度の重み付けを低くする方法も考えられる。実施の形態３による音声認識装置は、このような原理によって動作するものである。
【００５６】
図６は、実施の形態３による音声認識装置の構成を示すブロック図である。図において、図５と同じ符号を付した構成要素は、実施の形態２と同様であるので説明を省略する。図６から明らかなように、不安定区間検出部８からの信号線が照合部６に至っていることが図５との相違点である。なお、実施の形態３における不安定区間検出部８は、不安定区間か否かを検出するだけでなく、不安定区間についてはオーバーフロー区間と瞬断区間のいずれであるかについても検出することとし、信号線は３つの状態（例えばＮｏｒｍａｌ：不安定区間でない、Ｈｉ：オーバーフロー、Ｌｏｗ：瞬断）をとりうるものとする。
【００５７】
次に実施の形態３による音声認識装置の動作について説明する。Ａ／Ｄ変換器１、不安定区間検出部８、音響分析部３、音響尤度演算部４の動作については、実施の形態２と同様であるので説明を省略する。続いて、照合部１０は、不安定区間検出部８の信号線がＮｏｒｍａｌ、Ｈｉ、Ｌｏｗのいずれでであるかによって、音響尤度演算部４で算出された音素毎の尤度の、入力音声信号全体の尤度算出における寄与度を設定し、その後、音素毎の尤度と語彙・言語モデル７と寄与度とを用いて照合し、認識結果を出力する。
【００５８】
ここでフレームｔにおけるフレーム寄与度をｆ（ｔ）とした場合の式（１５）における音響的確率Ｐ（Ｏ｜Ｗ）は、式（１９）によって与えられる。
【数１５】

またフレーム寄与度ｆ（ｔ）は次のようにする。
【数１６】

【００５９】
ここでｆ１、ｆ２を一定の値とし、例えばｆ１＝０．５、ｆ２＝０．１などのように設定する。この例ではオーバーフロー区間の尤度の全体の寄与度は通常区間の半分、瞬断区間の尤度の全体への寄与度は通常区間の１／１０としている。
【００６０】
また、時刻ｔの１フレーム内で最大値をオーバーしている信号の割合を、ピーク検出率と呼び、Ｐｏ（ｔ）で表すこととし、時刻ｔの１フレーム内で瞬断状態の信号の割合を、瞬断検出率と呼び、Ｐｃ（ｔ）で表すこととすると、式（２２）に示すように、フレーム寄与度ｆ（ｔ）は、オーバーフロー時にはピーク検出率Ｐｏ（ｔ）、瞬断時にはＰｃ（ｔ）としてもよい。
【数１７】

【００６１】
さらに具体的にこれらの演算方法を示すと、例えば式（２３）や式（２４）に示すような方法が考えられる。
【数１８】

【数１９】

【００６２】
この例では、ピーク検出率が一定値以下（０．０５）の場合は、求めた尤度が信用できるためフレーム寄与度は１（通常時と同じ）とし、またピーク検出率が一定値（０．３）より大きくなった場合は、入力歪みが大きすぎて尤度演算が信用できないため、フレーム寄与度を０（全体の尤度に寄与しない）としている。またピーク検出率が０．０５と０．３との間の値では、ピーク検出率が大きくなるほどフレーム寄与度が小さくなるものとしている。
【００６３】
またフレーム寄与度をオーバーフロー区間の始端、終端からの時間を用いてもよい。この場合の算出例を式（２５）に示す。なお式（２５）において、ｔｓは不安定区間の始点、ｔｅは不安定区間の終点である。
【数２０】

なお、上式において、ｍｉｎ（ｘ，ｙ）とは、ｘとｙの小さい方を選択する演算である。この例では、始点と終点の寄与度は１となり、不安定区間の中間では０．５となる。
【００６４】
以上から明らかなように、不安定区間検出部８により出力された瞬断区間又はオーバーフロー区間の尤度の全体の尤度への寄与度を小さくする（反映しにくくする）ことにより、不安定区間に残存する音声信号を利用しながら、一方で、尤度の信頼性の低い区間による誤認識を減らすことができる。
【００６５】
また、ピーク検出率や、オーバーフロー区間の端からの時間差等に基づいて、フレーム寄与度を設定することにより、入力状態に応じた寄与度を設定することが可能となる。
【００６６】
なお、実施の形態３では、不安定区間検出部８がオーバーフロ区間、瞬断区間、通常区間の３つの状態を判断することとしたが、実施の形態２と同じように、音響分析部３が判断するようにし、音声特徴量中にこれらの情報を識別するような成分やデータを含めるようにしてもよい。
【００６７】
さらに、実施の形態１における微小信号出力部２や、実施の形態２における音響尤度補正部９と組み合わせて用いることが可能なことはいうまでもない。
【００６８】
実施の形態４．
実施の形態１〜３による音声認識装置では、不安定区間においても安定的に音声特徴量を演算する方法、不安定区間の尤度を補正する方法、不安定区間の尤度の重み付けを不安定区間以外の尤度の重み付けより小さくする方法などによって、オーバーフローや瞬断の存在する音声信号に基づいて、音声認識を行うものであった。この他に、不安定区間における音声信号を認識することを前提とした音響モデルを準備する方法も考えられる。実施の形態４による音声認識装置はかかる原理により動作するものである。
【００６９】
図７は、実施の形態４による音声認識装置の構成を示したブロック図である。図において、図６と同じ符号を付した構成要素については、実施の形態３と同様であるので、説明を省略する。ただし、実施の形態４において、音響モデル記憶部５は、複数の音響モデルを記憶しているものとする。また音響モデル選択部１０は、音響モデル記憶部５が記憶している複数の音響モデルから条件に見合う音響モデルを選択する部位である。さらに不安定区間検出部８からの信号線は音響モデル選択部１０に接続されている。
【００７０】
次に、実施の形態４による音声認識装置の動作について説明する。Ａ／Ｄ変換器１、不安定区間検出部８、音響分析部３の動作については実施の形態３と同様であるので説明を省略するが、実施の形態４においても、実施の形態３と同様、不安定区間検出部８の検出結果である信号線はＨｉ（オーバーフロー区間を表す）、Ｌｏｗ（瞬断区間を表す）、Ｎｏｒｍａｌ（定常状態又は通常状態、あるいは安定区間を表す）の３つの状態を表すものとする。
【００７１】
なお、音響尤度演算部４と音響モデル記憶部５、不安定区間検出部８、音響モデル選択部１０は音響尤度演算手段に相当する。
【００７２】
続いて音響モデル選択部１０の動作について説明する。音響モデル選択部１０は、不安定区間検出部８から出力される不安定区間検出結果に基づいて、ピーク検出率と瞬断検出率を算出する。そして算出されたピーク検出率・瞬断検出率に基づいて、音響モデル記憶部５が記憶している複数の音響モデルの中から最適な音響モデルを選択する。
【００７３】
音響モデル記憶部５は、所定のピーク検出率・瞬断検出率となる環境下で学習された音響モデルをそれぞれのピーク検出率・瞬断検出率に関連づけて記憶している。音響モデル選択部１０は、音響モデルが関連づけられているピーク検出率・瞬断検出率と、現在のピーク検出率・瞬断検出率とを比較し、現在のピーク検出率・瞬断検出率に最も距離値の小さいピーク検出率・瞬断検出率に関連づけられている音響モデルを選択する。すなわち、劣悪な環境下で学習した音響モデルと良好な環境下で学習した音響モデルとを準備しておき、現実の環境に近い音響モデルを選択するようにする。
【００７４】
音響尤度演算部４は音響モデル選択部１０がピーク検出率・瞬断検出率に基づいて選択した音響モデルから音素（又は音韻・音節などの基本認識単位）ごとの尤度を算出し、照合部６は算出された尤度に基づいて、最尤の認識結果を出力する。
【００７５】
以上から明らかなように、実施の形態４による音声認識装置によれば、種々のピーク検出率・瞬断検出率にあわせて予め学習された音響モデルを複数準備しておき、現在のピーク検出率・瞬断検出率に最も近い音響モデルを選択することとした。これによって、不安定区間に残存する音声信号を利用し、劣悪な環境にあわせた音響モデルを使用して音声認識を行うので、精度を向上することができる。すなわち、瞬断やオーバーフローを興していない区間に対しても、量子化ノイズによるＳ／Ｎの劣化に応じた音響モデルを選択して、認識率を向上できるのである。
【００７６】
なお、上記においてピーク検出率・瞬断検出率は実施の形態３において定義したとおりフレーム毎のオーバーフローした信号の割合、あるいは瞬断した信号の割合に基づいて算出されるものである。しかし、これらの率の算出の区間はフレームに限られるものではなく、例えば発話単位に算出してもよいし、所定の時間毎（例．４０ｍｓｅｃなど）に算出するようにしてもよい。
【００７７】
また、音響モデルの学習条件として、ピーク検出率・瞬断検出率の代わりに各フレームのパワーを採用してもよい。すなわち所定のパワーの下で学習された複数の音響モデルを準備しておき、現実のフレームのパワーに基づいて音響モデルを選択するようにしてもよい。またこの場合においても、フレーム毎ではなく、発話毎や所定の時間の平均パワーに基づいて音響モデルを選択するようにしてもよいことはいうまでもない。
【００７８】
【発明の効果】
この発明に係る音声認識装置は、不安定区間に残存する音声の情報を使用することとしたので、不安定区間が長い場合であっても、認識率の低下を防ぐことができる、という極めて顕著な効果を奏するものである。
【図面の簡単な説明】
【図１】この発明の実施の形態１による音声認識装置の構成を示したブロック図である。
【図２】この発明の実施の形態１による音声認識装置に入力されるオーバーフローしたアナログ音声信号の波形図である。
【図３】この発明の実施の形態１によるオーバーフローした音声信号が音声認識装置に入力され、ディジタル変換された後の波形図である。
【図４】この発明の実施の形態１による音声認識装置に入力される瞬断を含む音声信号の波形図である。
【図５】この発明の実施の形態２による音声認識装置の構成を示したブロック図である。
【図６】この発明の実施の形態３による音声認識装置の構成を示したブロック図である。
【図７】この発明の実施の形態４による音声認識装置の構成を示したブロック図である。
【符号の説明】
１Ａ／Ｄ変換器
２微小信号出力部
３音響分析部
４音響尤度演算部
５音響モデル記憶部
６照合部
７語彙・言語モデル記憶部
８不安定区間検出部
９音響尤度補正部
１０音響モデル選択部。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech recognition device that improves the accuracy of speech recognition even when the power of the input speech exceeds the input range of the A / D converter or in an environment where instantaneous interruption may occur. In particular, the present invention relates to a technique for improving the accuracy of speech recognition by devising signal processing or likelihood calculation processing in a section exceeding an input range or an instantaneous interruption section.
[0002]
[Prior art]
According to the conventional technique, the same acoustic likelihood (hereinafter, all phonemes, phonemes, and syllables stored in a certain acoustic model) in the section where instantaneous interruption / overflow occurs. , Simply called likelihood). By doing so, the likelihood of the correct recognition basic unit is lowered in the distorted section of the audio signal, and thus the erroneous recognition due to the lower likelihood of the correct vocabulary is prevented (for example, Non-Patent Document 1). ).
[0003]
Further, although it is not a technique for dealing with instantaneous interruption / overflow, there is a method in which a low power section is set as a silent section and a voice feature amount of the silent section is excluded from pattern matching (for example, Patent Document 1 and Patent Document 2). .
[0004]
[Patent Document 1]
Japanese Patent Laid-Open No. 2001-13988 “Voice Recognition Method and Apparatus” FIG. 2, pages 3-7
[Patent Document 2]
JP 2000-194385 “Voice Recognition Processing Device”
[Non-patent literature]
Proceedings of the Acoustical Society of Japan (September to October 1999, Vol. 1 P149 3-Q-16)
[0005]
[Problems to be solved by the invention]
In the speech recognition device according to the prior art, it is difficult to perform highly accurate speech recognition because it does not use information of speech remaining in the overflow interval, the instantaneous interruption interval, and the silent interval, especially when the overflow interval or the instantaneous interruption interval becomes long. There was a problem that the recognition rate decreased.
[0006]
On the other hand, the information of the speech remaining in these sections is unstable. For example, if acoustic analysis is performed on a section in which a digital signal having a sample value of 0 continues like an instantaneous interruption section, the acoustic analysis fails. was there. In order to avoid such a problem, there is a method in which the acoustic analysis result immediately before this section is repeatedly used in the prior art. However, this method has a problem that, as the instantaneous interruption interval becomes longer, the deviation from the immediately preceding acoustic analysis result increases, and matching is performed using incorrect data.
[0007]
The present invention has been made to solve the above-described problems, and an object thereof is to perform highly accurate speech recognition even for speech with instantaneous interruption or overflow.
[0008]
[Means for Solving the Problems]
The speech recognition apparatus according to the present invention receives an analog speech signal, converts the analog speech signal into a digital signal by an A / D converter, calculates an input speech feature amount from the digital signal, and based on the input speech feature amount. A speech recognition device for calculating a speech recognition result of the analog speech signal,
The acoustic analysis means for calculating the input voice feature amount based on the analog voice signal remaining in the unstable section even when an unstable section exists in the analog voice signal. It is what.
[0009]
Here, the unstable period means an instantaneous interruption period included in an analog voice signal input to an A / D converter included in the voice recognition device or an overflow period exceeding the input range of the A / D conversion means. To do.
[0010]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below.
Embodiment 1 FIG.
FIG. 1 is a block diagram showing the configuration of a speech recognition apparatus according to Embodiment 1 of the present invention. In the figure, an A / D converter 1 is an element or circuit that converts an analog signal of an input voice into a digital signal. For example, the sampling frequency is 8 kHz, the bit number resolution is 16 bits, and linear pulse coding (Pulse) is performed. The input signal is digitized by Code Modulation (PCM). This sample value takes a value included in the range given by equation (1).
[Expression 1]

[0011]
FIG. 2 is a waveform diagram showing an analog signal input to the A / D converter 1. FIG. 3 is a waveform diagram after digital conversion of the analog signal shown in FIG. In the figure, Smax and Smin indicate the upper and lower limits of the input range of the A / D converter 1. The enlarged state of the signal in the circle 101 drawn with a broken line in the figure is a circle 102 drawn with a broken line, and a portion exceeding the Smax (overflowed sample) of the input signal is input. It is shown that smoothing is performed to Smax which is the upper limit of the range.
[0012]
FIG. 4 is a waveform diagram showing the waveform of the input voice when an instantaneous interruption occurs in the analog signal waveform shown in FIG. In this case, regardless of the input range of the A / D converter 1, there is no sample value during a certain interval, and the A / D converter 1 outputs an output signal whose sample value is 0 in that interval. Will be output.
[0013]
Next, the configuration of the speech recognition apparatus according to Embodiment 1 of the present invention will be described with reference to FIG. The minute signal output unit 2 is an element or circuit that superimposes a minute signal (minute noise) on the output signal of the A / D converter 1. The acoustic analysis unit 3 is a part that outputs a speech feature amount (input speech feature amount) for performing speech recognition from a digital signal on which a minute signal (minute noise) is superimposed, using a signal at regular intervals. . The acoustic likelihood calculation unit 4 compares the standard pattern (standard speech feature amount) of speech for each recognition basic unit with the speech feature amount output from the acoustic analysis unit 3, and calculates the likelihood for each recognition basic unit. It is a part.
[0014]
The acoustic model storage unit 5 includes a storage medium or a storage element and a circuit (collectively referred to as a storage device) that stores a speech standard pattern of a recognition basic unit for which the acoustic likelihood calculation unit 4 calculates the likelihood. A computer program and a controller for managing and configuring the storage device may be included.
[0015]
Further, the collation unit 6 calculates the likelihood of the vocabulary from the likelihood calculated for each recognition basic unit according to the vocabulary / language model on which the speech recognition apparatus is based, and the vocabulary with the maximum likelihood of the vocabulary. Is a part that is calculated as a recognition candidate. The vocabulary / language model storage unit 7 is a storage medium or a storage element and a circuit (storage device) for storing the vocabulary / language model referred to by the collation unit 6, and includes computer programs and controllers that manage and configure the storage device. Is also the same as the acoustic model storage unit 5.
[0016]
The A / D converter 1 is A / D conversion means, the minute signal output unit 2 is minute signal output means, the acoustic analysis unit is acoustic analysis means, the acoustic likelihood calculation unit 4 and the acoustic model storage unit 5 are acoustic likelihood calculations. The means, collation unit 6 and vocabulary / language model storage unit 7 correspond to collation means.
[0017]
Next, the operation of the speech recognition apparatus according to Embodiment 1 of the present invention will be described. The A / D converter 1 converts an audio signal input as an analog signal into a digital signal. The minute signal output means 2 superimposes and outputs a minute signal on the digital signal output from the A / D converter 1. The process of superimposing such a minute signal is referred to as non-zeroing here. As the minute signal, for example, the maximum value of the sample value is 2⁴Outputs about white noise.
[0018]
In addition to connecting the A / D converter 1 and the minute signal output unit 2 in series, for example, the A / D converter 1 detects the occurrence of momentary interruption or overflow by detecting the power. Based on the detection result, a switch for connecting the movable terminal to either the output of the A / D converter 1 or the output of the minute signal output unit 2 may be provided.
[0019]
Further, as a non-zero method, for example, the connection position of the A / D converter 1 and the minute signal output unit 2 is switched so that the output of the minute signal output unit 2 becomes the input of the A / D converter 1. It may be. In this way, since the output of the minute signal output unit 2 is always input to the A / D converter 1, the sample value output by the A / D converter 1 continues for a certain period or more even if an instantaneous interruption occurs. And never become 0.
[0020]
Subsequently, the acoustic analysis unit 3 applies a certain amount of time (for example, frame period = 10 msec) for a certain amount of time to the voice (digital signal) on which the minute noise output from the minute signal output unit 2 is superimposed. For example, a feature amount (for example, LPC cepstrum coefficient) is calculated using a digital signal having a frame length of 25 msec. As a result, the output O of the acoustic analysis unit 3 is a time series of feature amounts as shown in, for example, Expression (2).
[Expression 2]

[0021]
In Equation (2), o (t) is a feature amount in the t-th frame, and is a vector of dimension number K as shown in Equation 3.
[Equation 3]

[0022]
Here, the nth-order LPC cepstrum o ′ (t, n) when the speech feature value is the LPC cepstrum coefficient is calculated by the equations (4) to (6).
[Expression 4]

Α_i(I = 1, 2,..., Na) is a linear prediction coefficient and is obtained as follows.
[0023]
That is, first, a window length (number of samples in one frame) is Ns, and a finite-length window function (such as a Hamming window) that is 0 outside the frame is applied to the 1st to Nsth audio signals in the tth frame. Assuming that the multiplied speech signal is x (t, i) (0 ≦ i ≦ Ns−1), the autocorrelation sequence R is expressed by Equation (7).₀, R₁, R₂... R_NaIs calculated.
[Equation 5]

[0024]
Next, α_iEquation (8), which is a simultaneous equation for, is solved.
[Formula 6]

When formula (8) is converted into a matrix display, formula (9) is obtained.
[Expression 7]

Where r_i= R_i/ R₀And
[0025]
Using the Toeplitz property of the matrix according to Equation (9), the Levinson-Durbin cyclic solution solves for α_iCan be requested. Here, each α_nIs calculated by cycling from n = 1 to n = Na._nΑ_n ^(M)In particular, when m = n, k_n= A_n ^(N)It shall be written as Then, as an initial value,
[Equation 8]

As for m = 2, 3, 4,..._m, A_i ^(M), E^(M)Calculate
[Equation 9]

[0026]
In equation (11), m is sequentially increased, and when it becomes Na, the calculation by this recurrence equation is terminated, and α_i(I = 1, 2,..., Na) is calculated. By the way, in the above calculation process of the LPC cepstrum, if all the obtained audio signals are 0, x_i= 0 (t = 0, 1, 2,..., N−1), R calculated by the equation (7)₀Becomes 0 as in the following equation.
[Expression 10]

[0027]
As a result, r in equation (9)_i= R_i/ R₀Would be divided by 0, r_iCannot be calculated. This means that the LPC cepstrum o '(t, n) cannot be calculated. That is, when the output of the A / D converter 1 becomes 0 due to a momentary interruption or overflow, an LPC cepstrum cannot be calculated, and an error occurs in the calculation of the voice feature amount. Division by zero is treated as a serious error that is handled by the occurrence of a trap in a normal computer system. For this reason, conventionally, it is impossible to stably recognize the voice using the voice signal remaining in the unstable section. This is a problem of speech recognition processing in the case where an unstable section exists in the conventional input signal.
[0028]
However, the speech recognition apparatus according to Embodiment 1 solves such a problem by providing the minute signal output unit 2. That is, even when the output of the A / D converter 1 becomes 0 in the unstable period, the minute signal output unit 2 compensates for the minute signal composed of non-zero components, so that the input voice signal of the acoustic analysis unit 3 is never zero. Never become. Therefore, since the speech feature amount is stably calculated for the input speech signal in which the unstable section exists, even if the computation for obtaining the speech feature amount from the speech signal remaining in the unstable section is performed as described above, No problem arises.
[0029]
In the first embodiment, by providing the minute signal output unit 2, a measure for physically non-zeroing, that is, preventing the input signal from becoming zero, has been taken. It goes without saying that the unit 3 may adopt a method of masking predetermined low-order bits of the input audio signal, for example, the least significant bit to 1 to make it non-zero.
[0030]
The above is the operation of the acoustic analysis unit 3. Subsequently, the operation of the speech recognition apparatus according to Embodiment 1 will be described.
[0031]
The acoustic model storage unit 5 stores a standard pattern representing a standard feature amount for each recognition basic unit. In HMM (Hidden Markov Model), the standard pattern is often expressed by a Gaussian distribution. In the following description, phonemes are used as recognition basic units. However, even if phonemes, syllables, etc. are used instead of phonemes, the flow of processing does not change at all.
[0032]
The acoustic likelihood calculation unit 4 compares the time series O of the feature amount of the sound output from the acoustic analysis unit 3 with, for example, a standard pattern for each phoneme stored in the acoustic model storage unit 5, and compares each time series of each frame. Compute the likelihood for a phoneme. The likelihood B (p, t) for the phoneme p of the feature quantity o (t) in the frame t is calculated by Expression (14) in a Gaussian distribution using a diagonal covariance matrix.
## EQU11 ##

[0033]
The matching unit 6 calculates the likelihood of each recognized vocabulary from the likelihood obtained by the acoustic likelihood calculating unit 4 and the phoneme sequence of each vocabulary stored in the vocabulary / language model storage unit 7, and finally the most likely The vocabulary that becomes higher is output as the recognition result. That is, the speech recognition result W ′ is extracted using the following equation (15) for the time series O of the speech feature amount output by the acoustic analysis unit 3.
[Expression 12]

[0034]
In equation (15), P (O | W) in the first term is an acoustic probability. This probability is calculated assuming the recognition target vocabulary W. Recently, HMMs are often used to calculate acoustic probabilities. The second term P (W) represents the assumed probability of the vocabulary W and is a linguistic probability. Recently, statistical language models are often used to obtain linguistic probabilities.
[0035]
Here, the state transition series is expressed as q = {q (0), q (1),. . . . q (T)} (where q (0) is the initial state and q (T) is the element of the final state set F), P (O | W) in Equation 5 is expressed by Equation (16) below. Can be represented.
[Formula 13]

[0036]
In Equation (16), π_iIs the initial probability of the i-th state (π₀= 1, π₁, ..., π_T= 0), a (i, j) represents the transition probability from the i-th state to the j-th state, and b (i, t) represents the likelihood of the i-th state at time (frame) t. F represents a set of final states. Here, if the i-th state represents the phoneme p, b (i, t) = B (p, t).
[0037]
Thus, the speech recognition apparatus according to Embodiment 1 outputs the maximum likelihood speech recognition result for the input speech signal.
[0038]
As is apparent from the above, according to the speech recognition apparatus of the first embodiment, an analog speech signal that is input by de-zeroing an analog speech signal or a digital signal obtained by digitally converting this analog speech signal. Even if an unstable section exists, the voice feature amount is stably calculated. Therefore, it becomes possible to perform speech recognition based on the speech signal remaining in the unstable section, and as a result, it is possible to prevent the recognition rate from being lowered even for speech signals with overflow or instantaneous interruption.
[0039]
In addition, even if it replaces components other than the A / D converter 1, the minute signal output part 2, and the acoustic analysis part 3 in the component of Embodiment 1, the characteristic of this invention is impaired. Absent.
[0040]
Further, the A / D conversion unit 1, the minute signal output unit 2, the acoustic analysis unit 3, the acoustic likelihood calculation unit 4, and the matching unit 5 may be configured by hardware. It may be created and the computer may execute this speech recognition program.
[0041]
Embodiment 2. FIG.
In the first embodiment, the problem that the speech feature value cannot be calculated in the unstable section is solved by superimposing a minute signal on the input signal or by masking the low-order bits of the digital signal to 1, and stably. The audio feature amount is calculated so that the audio signal remaining in the unstable section can be used. In Embodiment 2, paying attention to the low reliability of the likelihood in such an unstable interval, the likelihood of the unstable interval is corrected by using the likelihood other than the unstable interval. This improves the reliability of the acoustic likelihood of the section.
[0042]
FIG. 5 is a block diagram showing the configuration of the speech recognition apparatus according to the second embodiment. In the figure, the components denoted by the same reference numerals as those in FIG. 1 are the same as those in the first embodiment, and thus the description thereof is omitted. The unstable section detection unit 8 is a part that detects whether an instantaneous interruption or overflow has occurred in the A / D converter 1. The acoustic likelihood correction unit 9 is a part that corrects the likelihood calculated by the acoustic likelihood calculation unit 4 for the unstable section, and whether or not the unstable section is in the unstable section detection unit 8. A signal line for notifying is provided.
[0043]
The acoustic likelihood calculation unit 4 and the acoustic model storage unit 5, the unstable section detection unit 8 is an acoustic likelihood calculation unit, the acoustic likelihood correction unit 9 is an acoustic likelihood correction unit, and the collation unit 6 and the vocabulary / language model storage. The unit 7 and the unstable section detection unit 8 correspond to collating means, respectively.
[0044]
Next, the operation of the speech recognition apparatus according to the second embodiment will be described. The A / D converter 1 converts an analog audio signal into a digital signal as in the first embodiment. The unstable section detection unit 8 monitors the power of the input line of the A / D converter 1 and detects the unstable section, that is, the occurrence of instantaneous interruption or the occurrence of overflow, the acoustic likelihood correction unit 9 The signal line to is set to Hi. Further, when not in an unstable section, this signal line is kept low.
[0045]
The acoustic analysis unit 3 and the acoustic likelihood calculation unit 4 operate in the same manner as in the first embodiment, and calculate the speech feature amount o (t) and the acoustic likelihood B (p, t) for the phoneme p for each frame t. Do.
[0046]
When the signal line from the unstable section detection unit 8 is Low, the acoustic likelihood correction unit 9 outputs the likelihood B (p, t) calculated by the acoustic likelihood calculation unit 4 as it is. When the signal line is Hi, the acoustic likelihood correction unit 9 corrects the likelihood calculated by the acoustic likelihood calculation unit 4 as follows. That is, the likelihood is corrected by the equation (18) with the points on the time axis of the time when the unstable section starts and the time when the unstable section starts as the start point ts and end point te.
[Expression 14]

[0047]
Here, N is the maximum time allowed to be corrected using the likelihood before and after the unstable interval (the likelihood calculated immediately before the start point of the unstable interval and the likelihood calculated immediately after the end point), and Bth Is a predetermined value. That is, in the equation (18), in a section (the above (C)) that is separated from the start point and the end point by a time exceeding the time N, the constant value Bth is set, and a section in which the time that is separated from the start point and the end point is within the time N ( In the above (A) and (B)), the likelihood immediately before the start point, the likelihood immediately after the end point, and the likelihood that becomes a continuous likelihood distribution are given to Bth in (C). The time N is determined in consideration of the frame length of acoustic analysis, for example, 40 msec.
[0048]
Next, the collation unit 6 calculates the maximum likelihood speech recognition result using Expression (15) in the same manner as in the first embodiment. The above is the operation of the speech recognition apparatus according to the second embodiment.
[0049]
As is clear from the above, according to the speech recognition apparatus of the second embodiment, the acoustic likelihood of the unstable section is corrected based on the likelihood before and after the section. As a result, the likelihood of phonemes before and after the unstable section is reflected in the vicinity of the start point or end point of the unstable section. Can be prevented.
[0050]
Also, as the distance from the start point and the end point increases, the influence of the likelihood immediately before the start point and the likelihood immediately after the end point is considered to be small, and in the intermediate section further away from the fixed point, it is considered that the effect of the start point and the end point is lost. A certain value was taken as the likelihood. Thereby, when the unstable section is long, it is possible to avoid that the likelihood immediately before the start point and the likelihood immediately after the end point exert an effect more than necessary.
[0051]
In addition, since the speech recognition is performed based on the speech signal remaining in the section while correcting the likelihood even in the unstable section, the recognition rate is lowered even for the speech signal having an overflow or a momentary interruption. Can be prevented.
[0052]
In addition to the correction using the equation (18), a method of reflecting the likelihood immediately before the start point and the likelihood immediately after the end point in the likelihood of the unstable section is conceivable. For example, assuming a likelihood distribution that monotonically increases or decreases monotonically from the likelihood immediately before the start point to the likelihood immediately after the end point, the likelihood of the unstable interval is determined based on such likelihood distribution. You may make it do. Even by such a method, discontinuity of voice information due to overflow or instantaneous interruption can be compensated, and thus misrecognition can be prevented.
[0053]
Further, as shown in the first embodiment, it is needless to say that the configuration may be configured in combination with a technique for stably calculating an audio feature amount by de-zeroing an input signal in an unstable section. .
[0054]
Furthermore, in the speech recognition apparatus according to the second embodiment, by providing the unstable section detection unit 8, it is detected that an instantaneous interruption or overflow has occurred in the A / D converter 1. However, in addition to this, for example, in the acoustic analysis unit 3, if the sample value by the A / D converter 1 is less than or less than a predetermined lower limit value, it is determined that there is a momentary interruption, and the absolute value of the sample value is also predetermined If it exceeds or exceeds the value, it is determined that there is an overflow, and when generating a voice feature based on such sample values, a special flag or the like is set, and the acoustic likelihood calculation unit 4 or the acoustic likelihood correction unit You may make it possible to judge in 5. For example, the minute signal output unit 2 shown in the first embodiment is provided, and the minute signal output unit 2 further includes⁴If a small signal of the order is superimposed, the lower limit is 2⁵You may make it match as a grade. If the bit number resolution of the A / D converter 1 is 16 bits, the range of −32768 to 32767 is in the range, so that the occurrence of overflow may be determined when the absolute value of the sample value is 32767 or more. .
[0055]
Embodiment 3 FIG.
The speech recognition apparatus according to Embodiment 2 appropriately selects the maximum likelihood phoneme (or another recognition basic unit) by correcting the acoustic likelihood in the unstable section, and prevents erroneous recognition. In addition, a method of lowering the weight of acoustic likelihood in an unstable section when collating with a vocabulary can be considered. The speech recognition apparatus according to Embodiment 3 operates according to such a principle.
[0056]
FIG. 6 is a block diagram showing the configuration of the speech recognition apparatus according to the third embodiment. In the figure, the components denoted by the same reference numerals as those in FIG. As is clear from FIG. 6, the signal line from the unstable section detection unit 8 reaches the verification unit 6, which is different from FIG. 5. The unstable section detection unit 8 in the third embodiment not only detects whether or not the section is unstable, but also detects whether the section is an overflow section or an instantaneous interruption section. The signal line can assume three states (for example, Normal: not an unstable section, Hi: overflow, Low: instantaneous interruption).
[0057]
Next, the operation of the speech recognition apparatus according to the third embodiment will be described. Since the operations of the A / D converter 1, the unstable section detection unit 8, the acoustic analysis unit 3, and the acoustic likelihood calculation unit 4 are the same as those in the second embodiment, description thereof is omitted. Subsequently, the collation unit 10 determines the input speech of the likelihood for each phoneme calculated by the acoustic likelihood calculation unit 4 depending on whether the signal line of the unstable section detection unit 8 is Normal, Hi, or Low. A contribution in the likelihood calculation of the entire signal is set, and then the likelihood for each phoneme is collated using the vocabulary / language model 7 and the contribution, and a recognition result is output.
[0058]
Here, the acoustic probability P (O | W) in the equation (15) when the frame contribution degree in the frame t is f (t) is given by the equation (19).
[Expression 15]

The frame contribution f (t) is as follows.
[Expression 16]

[0059]
Here, f1 and f2 are set to constant values, for example, f1 = 0.5 and f2 = 0.1 are set. In this example, the overall contribution of the likelihood of the overflow section is half that of the normal section, and the contribution of the likelihood of the instantaneous interruption section to the whole is 1/10 of the normal section.
[0060]
In addition, the ratio of the signal exceeding the maximum value in one frame at time t is referred to as a peak detection rate and is represented by Po (t), and the ratio of the signal in an instantaneous interruption state in one frame at time t Is called the instantaneous interruption detection rate and is expressed by Pc (t), as shown in the equation (22), the frame contribution degree f (t) is the peak detection rate Po (t) at the overflow, and at the instantaneous interruption Pc (t) may be used.
[Expression 17]

[0061]
More specifically, these calculation methods are shown as methods shown in, for example, Expression (23) and Expression (24).
[Formula 18]

[Equation 19]

[0062]
In this example, when the peak detection rate is equal to or less than a certain value (0.05), the obtained likelihood can be trusted, so the frame contribution is set to 1 (same as normal), and the peak detection rate is constant (0). .3), if the input distortion is too large and the likelihood calculation cannot be trusted, the frame contribution is set to 0 (does not contribute to the overall likelihood). Further, when the peak detection rate is between 0.05 and 0.3, the greater the peak detection rate, the smaller the frame contribution.
[0063]
The frame contribution may be the time from the beginning and end of the overflow interval. An example of calculation in this case is shown in Expression (25). In Expression (25), ts is the start point of the unstable section, and te is the end point of the unstable section.
[Expression 20]

In the above equation, min (x, y) is an operation for selecting the smaller of x and y. In this example, the contribution of the start point and the end point is 1, and is 0.5 in the middle of the unstable section.
[0064]
As is clear from the above, the unstable section is reduced by making the contribution of the likelihood of the instantaneous section or overflow section output by the unstable section detection unit 8 to the overall likelihood small (difficult to reflect). On the other hand, it is possible to reduce misrecognition due to a section with low reliability of likelihood.
[0065]
In addition, it is possible to set the contribution according to the input state by setting the frame contribution based on the peak detection rate, the time difference from the end of the overflow section, or the like.
[0066]
In the third embodiment, the unstable section detection unit 8 determines the three states of the overflow section, the instantaneous interruption section, and the normal section. However, as in the second embodiment, the acoustic analysis section 3 May be determined, and components and data for identifying these pieces of information may be included in the audio feature amount.
[0067]
Furthermore, it goes without saying that it can be used in combination with the minute signal output unit 2 in the first embodiment and the acoustic likelihood correction unit 9 in the second embodiment.
[0068]
Embodiment 4 FIG.
In the speech recognition apparatus according to the first to third embodiments, a method for stably calculating a speech feature amount even in an unstable section, a method for correcting likelihood in an unstable section, and unstable weighting in likelihood in an unstable section Voice recognition is performed on the basis of a voice signal in which overflow or instantaneous interruption exists by a method of making it smaller than the likelihood weighting other than the section. In addition to this, a method of preparing an acoustic model based on the recognition of a speech signal in an unstable section is also conceivable. The speech recognition apparatus according to the fourth embodiment operates on this principle.
[0069]
FIG. 7 is a block diagram showing the configuration of the speech recognition apparatus according to the fourth embodiment. In the figure, the components denoted by the same reference numerals as those in FIG. 6 are the same as those in the third embodiment, and thus the description thereof is omitted. However, in the fourth embodiment, it is assumed that the acoustic model storage unit 5 stores a plurality of acoustic models. The acoustic model selection unit 10 is a part that selects an acoustic model that meets a condition from a plurality of acoustic models stored in the acoustic model storage unit 5. Further, the signal line from the unstable section detector 8 is connected to the acoustic model selector 10.
[0070]
Next, the operation of the speech recognition apparatus according to the fourth embodiment will be described. Since the operations of the A / D converter 1, the unstable section detection unit 8, and the acoustic analysis unit 3 are the same as those in the third embodiment, description thereof will be omitted, but the same applies to the fourth embodiment as in the third embodiment. The signal line that is the detection result of the unstable section detection unit 8 has three states: Hi (represents an overflow section), Low (represents an instantaneous interruption section), and Normal (represents a steady state, a normal state, or a stable section). .
[0071]
The acoustic likelihood calculation unit 4, the acoustic model storage unit 5, the unstable section detection unit 8, and the acoustic model selection unit 10 correspond to acoustic likelihood calculation means.
[0072]
Next, the operation of the acoustic model selection unit 10 will be described. The acoustic model selection unit 10 calculates the peak detection rate and the instantaneous interruption detection rate based on the unstable section detection result output from the unstable section detection unit 8. Based on the calculated peak detection rate and instantaneous interruption detection rate, an optimal acoustic model is selected from a plurality of acoustic models stored in the acoustic model storage unit 5.
[0073]
The acoustic model storage unit 5 stores an acoustic model learned in an environment having a predetermined peak detection rate and instantaneous interruption detection rate in association with each peak detection rate and instantaneous interruption detection rate. The acoustic model selection unit 10 compares the peak detection rate / instantaneous interruption detection rate associated with the acoustic model with the current peak detection rate / instantaneous interruption detection rate, and obtains the current peak detection rate / instantaneous interruption detection rate. The acoustic model associated with the peak detection rate / instantaneous interruption detection rate with the smallest distance value is selected. That is, an acoustic model learned in a poor environment and an acoustic model learned in a good environment are prepared, and an acoustic model close to the actual environment is selected.
[0074]
The acoustic likelihood calculation unit 4 calculates the likelihood for each phoneme (or basic recognition unit such as phoneme / syllable) from the acoustic model selected by the acoustic model selection unit 10 based on the peak detection rate and the instantaneous interruption detection rate, and performs matching. The unit 6 outputs the maximum likelihood recognition result based on the calculated likelihood.
[0075]
As is clear from the above, according to the speech recognition apparatus according to the fourth embodiment, a plurality of acoustic models learned in advance according to various peak detection rates and instantaneous interruption detection rates are prepared, and the current peak detection rate is obtained.・ The acoustic model closest to the instantaneous interruption detection rate was selected. As a result, voice recognition is performed using an audio model that matches an inferior environment using a voice signal remaining in an unstable section, so that accuracy can be improved. That is, the recognition rate can be improved by selecting an acoustic model corresponding to the degradation of S / N due to quantization noise even in a section where no instantaneous interruption or overflow occurs.
[0076]
In the above, the peak detection rate and the instantaneous interruption detection rate are calculated based on the ratio of the overflowed signal or the instantaneous interruption signal for each frame as defined in the third embodiment. However, these rate calculation sections are not limited to frames, and may be calculated, for example, in units of utterances, or may be calculated every predetermined time (eg, 40 msec).
[0077]
Further, as a learning condition for the acoustic model, the power of each frame may be employed instead of the peak detection rate and the instantaneous interruption detection rate. That is, a plurality of acoustic models learned under a predetermined power may be prepared, and the acoustic model may be selected based on the actual frame power. Also in this case, it goes without saying that the acoustic model may be selected based on the average power for each utterance or for a predetermined time instead of for each frame.
[0078]
【The invention's effect】
Since the speech recognition apparatus according to the present invention uses the information of the speech remaining in the unstable section, it is extremely remarkable that the reduction of the recognition rate can be prevented even when the unstable section is long. It has a great effect.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a speech recognition apparatus according to Embodiment 1 of the present invention.
FIG. 2 is a waveform diagram of an overflowed analog speech signal input to the speech recognition apparatus according to Embodiment 1 of the present invention.
FIG. 3 is a waveform diagram after an overflow voice signal according to Embodiment 1 of the present invention is input to a voice recognition device and digitally converted;
FIG. 4 is a waveform diagram of an audio signal including a momentary interruption input to the speech recognition apparatus according to Embodiment 1 of the present invention.
FIG. 5 is a block diagram showing a configuration of a speech recognition apparatus according to Embodiment 2 of the present invention.
FIG. 6 is a block diagram showing a configuration of a speech recognition apparatus according to Embodiment 3 of the present invention.
FIG. 7 is a block diagram showing a configuration of a speech recognition apparatus according to Embodiment 4 of the present invention.
[Explanation of symbols]
1 A / D converter
2 Minute signal output section
3 Acoustic analysis section
4 Acoustic likelihood calculator
5 Acoustic model storage
6 verification part
7 Vocabulary / Language Model Storage
8 Unstable section detector
9 Acoustic likelihood correction unit
10 Acoustic model selection unit.

Claims

An analog voice signal is input, converted into a digital signal by an A / D converter, an input voice feature quantity is calculated from the digital signal, and a voice recognition result of the analog voice signal is calculated based on the input voice feature quantity. In the speech recognition device to calculate,
Even if there is a momentary interruption interval or an overflow interval exceeding the input range of the A / D converter (hereinafter referred to as an unstable interval) in the analog audio signal, the analog remaining in the unstable interval A speech recognition apparatus comprising: the acoustic analysis unit that calculates the input speech feature quantity based on a speech signal.

The speech recognition apparatus according to claim 1, wherein the acoustic analysis unit calculates the input speech feature quantity from the digital signal in the unstable section that has been made non-zero.

Even if the unstable period is present in the analog audio signal, further comprising a minute signal output means for superimposing the minute signal to non-zero the digital signal,
The speech recognition apparatus according to claim 2, wherein the acoustic analysis unit calculates the input speech feature quantity from the digital signal that has been non-zeroed by the minute signal output unit.

Acoustic likelihood calculating means for calculating the acoustic likelihood of the input voice feature quantity and the standard voice feature quantity;
Acoustic likelihood correcting means for correcting the acoustic likelihood of the unstable section based on the acoustic likelihood calculated by the acoustic likelihood calculating means immediately before and immediately after the unstable section;
A matching unit that calculates a speech recognition result for the input speech feature based on the acoustic likelihood calculated by the acoustic likelihood calculating unit or the acoustic likelihood corrected by the acoustic likelihood correcting unit;
The speech recognition apparatus according to claim 1, further comprising:

The acoustic likelihood correcting means assumes a continuous acoustic likelihood distribution for each of the acoustic likelihood immediately before and immediately after the unstable section as the acoustic likelihood distribution of the unstable section, The speech recognition apparatus according to claim 4, wherein the acoustic likelihood of the unstable section is corrected based on the basis.

The acoustic likelihood correction means corrects the acoustic likelihood of an intermediate section that is a fixed time away from both the start point and the end point of the section to a constant value when the time length of the unstable section exceeds a predetermined length. The speech recognition apparatus according to claim 4.

The acoustic likelihood correcting means assumes an acoustic likelihood distribution that monotonously increases or decreases monotonically from the acoustic likelihood immediately before the unstable section toward the acoustic likelihood immediately after the section, and the acoustic likelihood distribution The speech recognition apparatus according to claim 4, wherein the acoustic likelihood of the unstable section is corrected based on the basis.

Acoustic likelihood calculating means for calculating the acoustic likelihood of the input voice feature quantity and the standard voice feature quantity;
Among the acoustic likelihoods calculated by the acoustic likelihood calculating means, the acoustic likelihood weighting of the unstable section is made smaller than the acoustic likelihood weighting of the sections other than the unstable section, and the input speech feature amount Collation means for calculating a speech recognition result;
The speech recognition apparatus according to claim 1, further comprising:

The said collation means changes weighting about the acoustic likelihood of the said unstable area based on the ratio which the input range excess or instantaneous interruption generate | occur | produces in the said A / D converter. Voice recognition device.

The acoustic analysis means and the acoustic likelihood calculation means calculate the input speech feature quantity and the acoustic likelihood for each frame,
The said collating means reduces the weighting of the said acoustic likelihood about the said frame as the time difference of the said frame and the starting point or the end point of the said unstable area becomes large. The speech recognition device according to any one of the above.

Acoustic model storage means for storing a plurality of acoustic models learned in an environment where the input range of the A / D converter exceeds or the instantaneous interruption occurs at different rates;
The acoustic likelihood calculating means selects one of the plurality of acoustic models based on the rate at which the input range of the A / D converter exceeds or the instantaneous interruption occurs, and is stored in the selected acoustic model. The speech recognition apparatus according to claim 1, wherein the acoustic likelihood is calculated by comparing the standard speech feature quantity and the input speech feature quantity.

The speech recognition apparatus according to claim 11, wherein the acoustic likelihood calculating unit calculates the ratio for each predetermined period and selects one of the plurality of acoustic models based on the ratio.

The speech recognition apparatus according to claim 12, wherein the acoustic likelihood calculating means sets each utterance period to the certain period.

The speech recognition apparatus according to claim 12, wherein the acoustic likelihood calculating means sets the period of each frame as the fixed period.

The acoustic model storage means stores a plurality of acoustic models learned in different power environments, instead of the rate at which input range excess or instantaneous interruption occurs.
The acoustic likelihood means selects one of the plurality of acoustic models based on an analog audio signal power in the A / D converter, instead of a rate at which an input range exceeds or an instantaneous interruption occurs. The voice recognition device according to claim 11.