JP4028136B2

JP4028136B2 - Speech recognition method and apparatus

Info

Publication number: JP4028136B2
Application number: JP15028499A
Authority: JP
Inventors: 徹郎北添; 星一金; 知幸市来
Original assignee: Japan Science and Technology Agency; National Institute of Japan Science and Technology Agency
Current assignee: Japan Science and Technology Agency; National Institute of Japan Science and Technology Agency
Priority date: 1999-05-28
Filing date: 1999-05-28
Publication date: 2007-12-26
Anticipated expiration: 2019-05-28
Also published as: JP2000338989A

Description

【発明の属する技術分野】
本発明は、連続入力される音声からその要素となっている各音素の種類を識別して音声を認識する音声認識方法および装置の改良に関する。
【０００１】
音声認識では、認識率をいかに向上させるかが最も重要な問題となっており、本発明では、入力音声の音素とあらかじめ保持されている基準の各音素とを比較してそれぞれの類似度を求めた後の、唯一の音素を決定する音韻認識処理に、脳の立体視に関わる神経回路モデルを適用することにより、認識率の飛躍的な向上を図っている。
【０００２】
【従来の技術】
最近、大語彙連続音声認識の研究が盛んに行なわれるようになり、実用化の兆しも見えて来つつある。この研究には、大きく分けて二つの技術的問題がある。一つは音響モデルの問題で、一つ一つの音素の認識率を向上させようとするものである。もう一つは言語モデルの問題で、音素の繋がりに関する言語的あるいは文法的知識を用いて単語や文の認識率を高めようとするものである。前者に対してはヒドン（隠れ）マルコフモデル（ＨＭＭ）を基本として、それを補強するようなモデルが一般的に用いられてきた。現在は、特に後者の言語モデル改良に力が注入されているが、全体の文認識に対しては、言語モデルによる１０−２０％の改良は音響モデルによる１−２％の改良と等価な寄与しかしないことが知られており、大きな期待は望めないものとなっている。一方、音響モデルの方でも技術的に限界が感じられており、ＨＭＭのもとでこれ以上の発展を見込むのは困難である。
【０００３】
図９に、従来のＨＭＭモデルによる音声認識装置のブロック構成を示す。図において、１は連続音声入力部、２は音声信号処理部、３は音韻認識部、４は単語同定部、５は文同定部、６は文認識部である。連続音声入力部１では連続音声入力を行ない、デジタル形式の音声信号を入力する。音声信号処理部２で、入力された音声信号を一定の時間間隔（フレーム）に分割し、各フレームにおいて信号処理を行なって、音響パラメータ（通常メルケプストラム係数ＭＦＣＣ）を抽出する。音韻認識部３は、すでに学習されたＨＭＭモデルによる音韻認識を行なう。その際、各フレームにおける認識結果を見ながら、各音韻の平均的な長さや辞書的知識を参照して、音韻境界抽出を行ない、音韻と音韻の間の望ましい境界を決定する。それらの一連の作業によって決定された音韻の列により単語同定部４で単語同定処理を行ない、ついでそれらの同定された単語の積み上げとして、文同定部５で文同定を行なう。文同定部５では文同定の候補が挙げられるが、その際、文法的知識や意味的知識による検討が行なわれる。もし、この段階で文法的および意味的に問題がなければ、文認識部６の文認識処理において文認識が完成するが、そうでなければ単語同定部４、音韻認識部３にフィードバックして第二候補が検討される。
【０００４】
【発明が解決しようとする課題】
本発明の課題は、従来のヒドンマルコフモデル（ＨＭＭ）に比べて認識率の一層の向上が可能な音韻認識手段を有する音声認識方法および装置を提供することにある。
【０００５】
【課題を解決するための手段】
本発明は、音声認識の音響モデルとして、従来のヒドンマルコフモデルとは全く異なる原理に基づく神経回路モデルを用いるものである。
【０００６】
本発明者らは、先に人間の脳における立体視（ステレオビジョン）の機能について考察したが、立体視では、３次元物体を左目と右目の網膜にそれぞれ投影して得た二つの２次元像を、脳の中の神経回路によって比較して瞬時に類似度処理を行い、物体の立体知覚を得ていることから、このステレオビジョン神経回路機能は、音声の入力データと基準の学習データとの間の類似度の高速処理にも有効であると考えた。そこで、ステレオビジョン神経回路から音響モデルのための新しい神経回路方程式を構成、発展させ、実際にその神経回路モデルによる効果を確認できた。
【０００７】
本来のステレオビジョン神経回路における立体視のための類似度処理では、方程式中に、映像のある画素に対応するニューロン（セル）の活性度（興奮度）を他の画素に対応するニューロンの活性度により抑制するように作用する競合項と、活性度を強調するように作用する協調項とを併せてもっている。本発明では、ステレオビジョン神経回路の方程式において、立体視の左右二つの映像のデータの代わりに、音声認識対象の入力データの音素が比較基準となる各音素の学習データについてそれぞれ得られる類似度のデータを適用し、唯一の音素を決定する音韻認識処理を行なわせるようにした。音韻認識処理には、神経回路モデルが用いられ、競合と協調の作用は、出力層から入力層へのフィードバックの過程で行なわれるようにした。なお、本発明による音声認識にかかわる神経回路方程式の詳細については、後述される。
【０００８】
本発明は、以下のように構成される。
（１）連続して入力された音声からフレームごとに抽出した音響パラメータからなる入力データと、あらかじめ学習されている複数の音素の音響パラメータからなる基準の学習データとの間でフレームごとに各音素に対する類似度を計算し、得られた入力データのフレームごとの類似度データに基づいて音韻認識処理を行う音声認識方法において、
上記音韻認識処理は、競合項と協調項とを含む神経回路方程式にしたがう神経回路モデルを用いて行われ、該神経回路モデルでは、入力層における各音素対応のセルの活性度は、出力層における同一フレームの競合する他の音素対応のセルの活性度に応じて抑制を受け、また出力層における近接フレームの協調する同一音素の活性度に応じて強調されるように処理されることを特徴とする音声認識方法の構成。
（２）前項（１）において、各音素の学習データは、ガウス確率分布関数で標準化して保持されていることを特徴とする音声認識方法の構成。
（３）連続して入力された音声からフレームごとに抽出した音響パラメータからなる入力データと、あらかじめ学習されている複数の音素の音響パラメータからなる基準の学習データとの間でフレームごとに各音素に対する類似度を計算する類似度計算手段と、得られた入力データのフレームごとの類似度データに基づいて音韻認識を行なう音韻認識手段とを備えた音声認識装置において、
上記音韻認識手段は、競合項と協調項とを含む神経回路方程式にしたがう神経回路モデルを有し、該神経回路モデルは、入力層における各音素対応のセルの活性度が、出力層における同一フレームの競合する他の音素対応のセルの活性度に応じて抑制され、また出力層における近接フレームの協調する同一音素の活性度に応じて強調されるように構成されていることを特徴とする音声認識装置の構成。
（４）前項（３）において、各音素の学習データは、ガウス確率分布関数で標準化して保持されていることを特徴とする音声認識装置の構成。
【０００９】
図１により、本発明の基本構成を説明する。
【００１０】
図１において、認識対象の連続音声信号は、連続音声入力部１１に入力され、次に音声信号処理部１２で一定時間幅のフレームに分割されて、フレームごとに音響パラメータを抽出される。
【００１１】
一方、学習部１３には、あらかじめ認識基準となる音声データが入力されて学習が行われて、音素単位にフレームごとに音響パラメータを抽出して作成された学習データ１３ａが保持されている。
【００１２】
類似度計算部１４は、音声信号処理部１２から出力された入力データと学習データ１３ａの各音素とをフレームごとに比較し、各音素に対する類似度をそれぞれフレームごとに算出して類似度データを作成し、音韻認識部１５へ出力する。
【００１３】
音韻認識部１５は、前述した競合項と協調項を含む２層または３層神経回路モデル１５ａを備えており、類似度計算部１４から出力された各音素対応の類似度データを入力して動作させる。その結果、神経回路は一つの音素のみが勝利するように収束して、音韻認識出力を生じる。なお神経回路モデル１５ａは、コンピュータプログラム上に実現されている。
【００１４】
音韻認識部１５から順次出力される音韻認識出力は、図示省略されているが、さらに単語同定処理や文同定処理をされて、連続音声認識結果として出力される。
【００１５】
学習部１３には、多数の音声の学習データが各音素に分類して記憶されている。この学習データは、音素ごとにガウス確率分布関数（pdf）のような標準形式で記憶されている。
【００１６】
類似度計算部１４では、入力データの音素が、記憶されている学習データの各音素のガウスpdfに対して参照され、比較することによって類似度が求められる。
類似度の説明
ｕをフレーム番号、ａを音素名として、あるフレームｕにおける入力データとある音素/a/ との類似度を次の〔数１〕で表わし、またその類似度が対応する神経回路内のニューロン（セル）の活性度（アクティビティ）、つまり興奮のレベルを次の〔数２〕で表わす。
【００１７】
【数１】

【００１８】
【数２】

【００１９】
神経回路方程式は、入力として〔数１〕の類似度データを受け取った後、神経回路が安定点に向かって動作するように〔数２〕の活性度を処理し、安定点に達したとき認識を完了する。
【００２０】
記憶されている学習データは、次の〔数３〕に示すガウスpdfの式(1) で表現される。
【００２１】
【数３】

【００２２】
ここで、οは入力、μ_ａはある音素/a/ に対するケプストラムで表わされた学習データの平均値である。Σ_ａは、次の〔数４〕に示す式(2) で表わされる。
【００２３】
【数４】

【００２４】
ここで、ο_ｎはある音素/a/ の学習データである。ある音素/a/ に対するｕ番目のフレームにおける入力データο_ｕの正規化された類似度を次の〔数５〕で表わした場合、次の〔数６〕に示す式(3) のように定義される。
【００２５】
【数５】

【００２６】
【数６】

【００２７】
ここで、Ｎ′はＮの対数尤度を表わし、＜Ｎ′＞は、各音素の平均値を示す。
【００２８】
【発明の実施の形態】
以下に、本発明の具体的な実施の形態について説明する。
【００２９】
図２は、本発明の１実施例による音声認識処理のフローである。図中、21から23までは音素データの学習過程、24から28までは音韻認識過程を示す。なお、連続音声認識に伴う単語同定や文同定の過程は、省略されている。
【００３０】
21では、入力する音素の学習データとして、すでに専門家によって音声データの各音韻部分にラベルが振られたデータが利用される。これらの音声データから、ラベルに基づいて音韻を切り取り、膨大な音素データを作成する。
【００３１】
22では、各音素データからフレームごとに音響パラメータを抽出する。
【００３２】
23では、ガウス確率分布関数（pdf）を用いて各音素の標準モデルを計算し保存する。
【００３３】
24では、認識対象の音声データを入力する。ここでは、入力データとしてすでに音韻ごとに切り出された音素データが用いられる。音素データからは、フレームごとに音響パラメータが抽出される。
【００３４】
25では、入力された音素データと保存されている標準モデルの各音素データとをフレームごとに比較して、それぞれの類似度を計算する。
【００３５】
26では、フレームごとに各音素の類似度データを神経回路方程式に入力して、計算処理する。その際、27で、神経回路の活性度変数に対して適当な初期値の設定を行う。なお、計算時間を短縮する必要がある場合には、入力する類似度データ数を限定し、たとえば上位の５つの音素に限定してもよい。
【００３６】
28では、神経回路方程式を計算処理した結果、定常解が得られたときに音韻認識出力する。フレームごとに、定常解で出力層（最終層）の神経活性度が正の一定値をとる音素が認識候補となり、０に近い値の音素は捨てられる。各フレームで認識頻度がもっとも高かった音素がその音韻の第一候補となり、認識結果として出力される。
【００３７】
次に、競合項と協調項を含む神経回路方程式の実施例について説明する。
【００３８】
本発明において、音韻認識を行なう神経回路モデルを規定する競合項と協調項を含む神経回路方程式には種々の態様のものが考えられるが、以下に、３層構造神経回路方程式と２層構造神経回路方程式の２つの実施例について述べる。
実施例Ａ：３層構造神経回路（３ＬＮＮ）方程式
図３は、３層構造の神経回路方程式による神経回路（Three Layered Neural Net：３ＬＮＮ) モデルを示したものである。図示の神経回路モデルは、入力層、中間層、出力層の３層からなり、各層には、複数のセル（ニューロン）が２次元配列されている。２次元配列の横の各セル行は順次のフレームに対応付けられており、縦の各セル列は異なる音素の種類に対応付けられている。
【００３９】
入力層、中間層、出力層の各セルの活性度は、それぞれ〔数１２〕のように表される。
【００４０】
【数１２】

【００４１】
入力層の各セルには、〔数１３〕に示す類似度データが入力される。また各層のセルには、図中に矢線で例示されるような、神経回路方程式の競合項と協調項に基づく結合が設けられている。図４にその結合の様子を示す。
【００４２】
【数１３】

【００４３】
図４において、入力層上の〔数１４〕で示す音素のセルへは、出力層上の〔数１５〕で示す同じフレームの他の複数の音素のセルから、点線の矢線のように競合項に基づく興奮を抑制する結合が行なわれる。また、出力層上の近接する複数のフレームにある同一音素に属する〔数１６〕のセルからは、実線の矢線のように協調項に基づく興奮を強調させる結合が行なわれる。これらの競合と強調に基づく結合の結果は、図３において〔数１４〕で表わされる入力層にフィードバックされる。
【００４４】
【数１４】

【００４５】
【数１５】

【００４６】
【数１６】

【００４７】
次に、３層構造神経回路モデル（３ＬＮＮ）の神経回路方程式（以下、３ＬＮＮ方程式という）について説明する。
３ＬＮＮ方程式は、以下の〔数１７〕に示す式(7) で与えられる。
【００４８】
【数１７】

【００４９】
ここで、右辺第1項の下記〔数１８〕は時間依存の神経活性度を表わし、第２項のf(x)は下記〔数１９〕の式(8) で与えられるSigmoid 関数を表わしている。また〔数２０〕は〔数２１〕の式(9) で表わされ、g(u)は〔数２２〕に示す式(10)、(11)で与えられる。
【００５０】
【数１８】

【００５１】
【数１９】

【００５２】
【数２０】

【００５３】
【数２１】

【００５４】
【数２２】

【００５５】
なお、τ_１，τ_２，Ａ，Ｂ，Ｄ，ｗは、それぞれ適切に選択される正の定数である。また、ｈは適切なしきい値定数である。
【００５６】
図３に示されるように、入力層の〔数２３〕で示す神経活性度は、〔数２４〕の類似度入力とともに、〔数２５〕が零の場合の〔数２６〕で示す近傍の神経活性度の影響をも受ける。式(11)において、右辺の第２項は入力項、第３項は競合項、第４項は協調項である。この第２項は、第ｕフレームにおけるある音素/a/ に対する入力データの類似度を表わす。また第３項は、〔数２７〕で示す他の音素の活性度との競合を表わし、第４項は、同一音素についての近接フレームからの協調を表わす。
【００５７】
【数２３】

【００５８】
【数２４】

【００５９】
【数２５】

【００６０】
【数２６】

【００６１】
【数２７】

【００６２】
式(11)の右辺第３項中の〔数２８〕に示す加算指標は、a′≠a の制限のもとで、
a−a_s ＜＝ a′＜＝a＋a_s
として定義される不同検索競合範囲を網羅する。また式(11)の右辺第４項中の〔数２９〕に示す加算指標は、u′≠u の制限のもとで、
ｕ−ｌ＜＝ u′＜＝ｕ＋ｌ
として定義される協調範囲を網羅する。
【００６３】
【数２８】

【００６４】
【数２９】

【００６５】
方程式の本質的特徴を理解するために〔数３０〕の平衡解を考慮すると、式(10)、(11)は次の〔数３１〕のように書き替えられる。
【００６６】
【数３０】

【００６７】
【数３１】

【００６８】
図５は、曲線ｙ＝ξ および曲線ｙ＝ｆ( g(α) ＋ g (ξ))のグラフであり、（ａ）〜（ｄ）は、〔数３２〕を正の大きい値４から正の小さい値1.3 まで変化させたときのものを順に示す。
【００６９】
【数３２】

【００７０】
式の解は、図５における二つの曲線の交点で与えられる。図５において、上記〔数３２〕の値が（ａ）の４から（ｄ）の1.3 まで減少するならば、（ｃ）の値に達するまで解はほぼ〔数３３〕の値を維持する。これとは反対に、〔数３２〕の値が（ｄ）から（ａ）に増加するならば、（ｂ）の値に達するまで解はほぼ〔数３４〕の値に維持される。
【００７１】
【数３３】

【００７２】
【数３４】

【００７３】
この事実から、以下の二つの結論が得られる。
（１） αの値が大きい場合、ξは比較的大きい値（ほぼ１）をとり、αの値が小さい場合は、ξは小さい値（ほぼ０）をとる。
（２） αが増加するか減少するかにしたがって、解のξは異なるパスをとる。これは、ヒステリシス現象の存在を示唆している。
【００７４】
なお、ｗ＞１を仮定するならば、図５の（ｂ）と（ｃ）の間に、安定でない第３の解が存在する。
（３ＬＮＮ方程式による音韻認識処理）
【００７５】
【表１】

【００７６】
表１は、実際には/n/ と発音された入力音素について学習データの各音素との間で類似度計算を行なった結果の候補のベスト５の類似度マップを示す。ここでは、音素/n/,/m/,/o/,/g/,/w/ がベスト５として選択された。これらのデータが、３ＬＮＮ方程式に入力されて、フレームごとに唯一の音素のξのみが勝利を収める勝者決定の処理が行なわれる。表２は、その処理結果を例示したものである。
【００７７】
【表２】

【００７８】
表２の例の場合、フレーム１〜１１では音素/n/ のみが正の大きい値をとって他の音素ではほぼ０となることから/n/ が勝者となり、一方、フレーム１２〜１５では音素/m/ のみが正の大きい値をとって他はほぼ０であるから/m/ が勝者となっている。そこで、全フレームの平均あるいは頻度から、/n/ を音韻認識結果として出力する。
【００７９】
３ＬＮＮ方程式のこのような処理についての動的な理解を得るには、αの典型的な値に応じて変わるSigmoid 関数の形に注目するのがよい。また３ＬＮＮ方程式の安定解は、ほぼ１の大きい値か、ほぼ０の小さい値かのどちらかだけを与える式(12)によって決定されるから、すべてのξに対して、初期値として0.5 を設定した。
【００８０】
図６と図７は、それぞれ、表１の類似度マップが入力されるときの、音素/n/,/m/,/o/,/g/,/w/ に対する第５フレームにおけるαとξの時間変化特性を示す。最初に、入力された各音素の類似度データλが異なっている場合、音素間の差だけがαから３ＬＮＮ方程式中に導入される。音素/m/,/o/,/g/,/w/についてのξは、α＜０に対するSigmoid 関数形に基づいて、図８に示すように減少し始める。これに対して、もっとも大きいλを持つ音素/n/ に対するαは、競合項の値の増加につれて正の値をとりはじめる。αｎが正になると、活性度ξｎは図７に示すように、Sigmoid 関数形にしたがって増加に転じる。この段階になると、αｎの協調項がαｎの立ち上がりを助け、ξｎの増加を加速することが注目される。一方、他の各音素のαは、ξｎの増加に基づき競合項の値が増加するため減少し始める。
（認識実験例）
学習データから各音素についてのガウス確率分布関数（pdf）を作成するために、１０人の男性話者によって話された４０００語からなるＡＴＲデータと、６人の男性話者によって話された５００の文のＡＳＪデータとから、あらかじめラベル付けされている音素を抽出した。また認識実験のための入力音声データは、二つの種類で構成された。その一つは、２１６語のデータベースからのものであり、他の一つは、３人の異なる男性話者によって話された２４０語の一つからのものである。音声データは、次のようにして解析された。
【００８１】
本発明による神経回路モデルと従来のモデルとの性能を比較するため、同じデータベースを使用し、単一の混合と三つの状態をもつヒドンマルコフモデル（ＨＭＭ）により音韻認識実験が行なわれた。認識テストは、表３に示すように、１０次元のＭＦＣＣと、その速度成分の１０次元のデルタＭＦＣＣを使用して実行された。学習データの各音素のケプストラムデータは、フレームの中間位置の前半と後半に分けて別々にガウスpdfを作成された。入力音声データも前半と後半に分割され、学習データの前半と後半のガウスpdfの対応部分と別々に比較され、類似度マップが作成された。音素の種類は２４あるが、それらの類似度データの上位５つの候補が、３ＬＮＮ方程式に適用された。
【００８２】
【表３】

【００８３】
【表４】

【００８４】
【表５】

【００８５】
表４と表５に、音韻認識結果を示す。非特定話者認識のときの認識率は、216語データベースの場合、ＨＭＭでは71.56 ％であったのに対して３ＬＮＮでは78.05 ％が得られた。また240 語データベースの場合は、ＨＭＭでは72.37 ％であったのに対して３ＬＮＮでは78.94 ％が得られた。
実施例Ｂ：２層構造神経回路（２ＬＮＮ）方程式
図８は、２層構造の神経回路方程式による神経回路（Two Layered Neural Net：２ＬＮＮ) モデルを示したものである。図示の神経回路モデルは、入力層Ｖ1 と出力層Ｖ2 の２層からなり、各層には、複数のセル（ニューロン）が２次元配列されている。２次元配列の横の各セル行は順次のフレームに対応付けられており、縦の各セル列は異なる音素の種類に対応付けられている。入力層と出力層の各セルの活性度は、それぞれ〔数３５〕のように表される。
【００８６】
【数３５】

【００８７】
入力層の各セルには、〔数３６〕に示す類似度データが入力される。また各層のセルには、図中に矢線で例示されるような、２ＬＮＮ方程式の競合項と協調項に基づく結合が, ３ＬＮＮについて図５で述べたように設けられている。
【００８８】
【数３６】

【００８９】
次に、２層構造神経回路モデル（２ＬＮＮ）の神経回路方程式（以下、２ＬＮＮ方程式という）について説明する。
【００９０】
２ＬＮＮ方程式は、以下の〔数３７〕に示す式(13)および式(14)で与えられる。
【００９１】
【数３７】

【００９２】
ここで、式(13）中の下記〔数３８〕の項は時間依存の神経活性度を表わし、f(x)の項は式 (14)で与えられるSigmoid 関数を表わしている。また〔数３９〕は〔数３７〕中の式(15)で表わされ、式(15)中のg(u)は式(16)で与えられる。
【００９３】
なお、τ，Ａ，Ｂ，Ｄ，ｗは、それぞれ適切に選択される正の定数である。また、ｈは適切なしきい値定数である。
【００９４】
【数３８】

【００９５】
図８に示されるように、入力層の〔数３９〕で示す神経活性度は、〔数３６〕の類似度入力とともに前記〔数２６〕で示す近傍の神経活性度の影響をも受ける。〔数３７〕中の式(15)において、右辺の第２項は入力項、第３項は競合項、第４項は協調項である。この第２項は、第ｕフレームにおけるある音素/a/ に対する入力データの類似度を表わす。また第３項は、〔数４０〕で示す他の音素の活性度との競合を表わし、第４項は、同一音素についての近接フレームからの協調を表わす。
【００９６】
【数３９】

【００９７】
【数４０】

【００９８】
式 (15) の第３項中の〔数４１〕に示す加算指標は、a′≠a の制限のもとで、
a−a_s ＜＝ a′＜＝a＋a_s
として定義される不同検索競合範囲を網羅する。また式(15)の第４項中の〔数４２〕に示す加算指標は、u′≠u の制限のもとで、
ｕ−ｌ＜＝ u′＜＝ｕ＋ｌ
として定義される協調範囲を網羅する。
【００９９】
【数４１】

【０１００】
【数４２】

【０１０１】
方程式の本質的特徴を理解するために〔数４３〕の平衡解を考慮すると、式(15)、(16)は次の〔数４４〕の式(17)および〔数４５〕の式(18)のように書き替えられる。
【０１０２】
【数４３】

【０１０３】
【数４４】

【０１０４】
【数４５】

【０１０５】
〔数４４〕の式(17)では、〔数３７〕の式(14)で与えられるSigmoid 関数により、〔数３９〕で示す神経活性度に対する競合と協調の効果によって、直接的に〔数３８〕に対して勝者と敗者の決定が下される仕組みとなっている。すなわち、〔数３９〕で示す神経活性度の大きな値に対しては１に近い〔数３８〕の出力を与え、〔数３９〕で示す神経活性度の小さな値に対しては０に近い小さな〔数３８〕の出力を与える。
【０１０６】
この２ＬＮＮ方程式によっても、３ＬＮＮ方程式の場合と同様に、ＨＭＭに比べて高い音韻認識率を得ることができる。
【０１０７】
【発明の効果】
本発明は、連続して入力される音声からその要素となっている各音素の種類を識別して音声を認識する手段として、脳の立体視にかかわる神経回路モデルを適用することにより、認識率の向上を図っている。従来は、ヒドンマルコフモデル（ＨＭＭ）を基本として、それを補強するようなモデルが一般に用いられてきたが、ヒドンマルコフモデルでは、技術的限界が見えていた。本発明による２層および３層神経回路モデル（２ＬＮおよび３ＬＮ）は、ヒドンマルコフモデルとは根本的に異なる考え方に基づいており、音韻認識率を著しく改善することができた。この本発明のモデルを連続的な単語認識や文認識に適用することにより、音声認識の飛躍的向上が期待される。
【図面の簡単な説明】
【図１】本発明の基本構成図である。
【図２】本発明の１実施例による音声認識処理のフローである。
【図３】３層構造神経回路モデル（３ＬＮＮ）の概念図である。
【図４】異なる音素の類似度間の競合と近接フレームデータ間の協調の説明図である。
【図５】ｙ＝ξおよびｙ＝ｆ（ｇ（α）＋ｇ（ξ））のグラフである。
【図６】 αの時間変化特性を示すグラフである。
【図７】 ξの時間変化特性を示すグラフである。
【図８】２層構造神経回路モデル（２ＬＮＮ）の概念図である。
【図９】従来の連続音声認識装置のブロック構成図である。
【符号の説明】
１１：連続音声入力部
１２：音声信号処理部
１３：学習部
１３ａ：学習データ
１４：類似度計算部
１５：音韻認識部
１５ａ：２層または３層神経回路モデルBACKGROUND OF THE INVENTION
The present invention relates to an improvement in a speech recognition method and apparatus for recognizing speech by identifying the type of each phoneme that is an element from continuously input speech.
[0001]
In speech recognition, how to improve the recognition rate is the most important issue. In the present invention, the phoneme of the input speech is compared with each reference phoneme stored in advance to obtain the similarity. After that, the neural network model related to the stereoscopic vision of the brain is applied to the phoneme recognition process that determines the only phoneme, thereby dramatically improving the recognition rate.
[0002]
[Prior art]
Recently, research on large vocabulary continuous speech recognition has been actively conducted, and signs of practical use are beginning to appear. There are two main technical problems in this research. One is the problem of the acoustic model, which tries to improve the recognition rate of each phoneme. The other is a language model problem that attempts to increase the recognition rate of words and sentences using linguistic or grammatical knowledge about phoneme connections. For the former, based on the Hidden (Hidden) Markov Model (HMM), a model that reinforces it has been generally used. At present, the latter language model is being improved especially, but for the whole sentence recognition, the 10-20% improvement by the language model is equivalent to the 1-2% improvement by the acoustic model. However, it is known that there is no such thing, and great expectations cannot be expected. On the other hand, the acoustic model has a technical limit, and it is difficult to expect further development under the HMM.
[0003]
FIG. 9 shows a block configuration of a conventional speech recognition apparatus based on an HMM model. In the figure, 1 is a continuous speech input unit, 2 is a speech signal processing unit, 3 is a phoneme recognition unit, 4 is a word identification unit, 5 is a sentence identification unit, and 6 is a sentence recognition unit. The continuous voice input unit 1 performs continuous voice input and inputs a digital audio signal. The audio signal processing unit 2 divides the input audio signal into predetermined time intervals (frames), performs signal processing in each frame, and extracts acoustic parameters (normally mel cepstrum coefficient MFCC). The phoneme recognition unit 3 performs phoneme recognition using the already learned HMM model. At this time, referring to the recognition result in each frame, the phoneme boundary extraction is performed by referring to the average length of each phoneme and lexicographic knowledge to determine a desired boundary between phonemes. The word identification unit 4 performs word identification processing based on the phoneme strings determined by the series of operations, and then the sentence identification unit 5 performs sentence identification as the accumulation of the identified words. The sentence identification unit 5 includes candidates for sentence identification, and at that time, examination based on grammatical knowledge and semantic knowledge is performed. If there is no grammatical and semantic problem at this stage, sentence recognition is completed in the sentence recognition process of the sentence recognition unit 6, but if not, feedback to the word identification unit 4 and the phoneme recognition unit 3 is performed. Two candidates are considered.
[0004]
[Problems to be solved by the invention]
An object of the present invention is to provide a speech recognition method and apparatus having a phoneme recognition means capable of further improving the recognition rate as compared with a conventional Hidden Markov Model (HMM).
[0005]
[Means for Solving the Problems]
The present invention uses a neural circuit model based on a completely different principle from the conventional Hidden Markov model as an acoustic model for speech recognition.
[0006]
The present inventors have previously considered the function of stereoscopic vision (stereo vision) in the human brain. In stereoscopic vision, two two-dimensional images obtained by projecting a three-dimensional object onto the retina of the left eye and the right eye, respectively. Are compared by the neural circuit in the brain, and the similarity processing is instantaneously performed to obtain the stereoscopic perception of the object. Therefore, this stereo vision neural circuit function is used for voice input data and reference learning data. It was also considered effective for high-speed processing of the similarity between them. Therefore, a new neural circuit equation for an acoustic model was constructed and developed from the stereo vision neural circuit, and the effect of the neural circuit model was actually confirmed.
[0007]
In the similarity processing for stereoscopic vision in the original stereo vision neural circuit, the activity (excitation degree) of neurons (cells) corresponding to a certain pixel in the image is expressed in the equation as the degree of activity of neurons corresponding to other pixels. Competing terms that act to suppress and coordination terms that act to emphasize activity. In the present invention, in the stereo vision neural circuit equation, instead of the two left and right video data in the stereoscopic view, the similarity of the degree of similarity obtained with respect to the learning data of each phoneme for which the phoneme of the input data to be recognized is used as a reference for comparison. The phoneme recognition process to determine the only phoneme was applied. A neural circuit model is used for the phoneme recognition process, and the action of competition and cooperation is performed in the process of feedback from the output layer to the input layer. Details of the neural circuit equations related to speech recognition according to the present invention will be described later.
[0008]
The present invention is configured as follows.
(1) Each phoneme for each frame between input data composed of acoustic parameters extracted for each frame from continuously input speech and reference learning data composed of acoustic parameters of a plurality of phonemes learned in advance. In a speech recognition method for calculating phoneme recognition processing and performing phoneme recognition processing based on similarity data for each frame of input data obtained,
The phoneme recognition process is performed using a neural circuit model according to a neural circuit equation including a competition term and a coordination term. In the neural circuit model, the activity of each phoneme-corresponding cell in the input layer is determined in the output layer. It is processed according to the suppression according to the activity of the cell corresponding to the other phoneme competing in the same frame, and to be emphasized according to the activity of the same phoneme cooperating in the adjacent frame in the output layer. Of the voice recognition method to be performed.
(2) The configuration of the speech recognition method according to (1), wherein the learning data of each phoneme is standardized and held by a Gaussian probability distribution function.
(3) Each phoneme for each frame between input data composed of acoustic parameters extracted for each frame from continuously input speech and reference learning data composed of acoustic parameters of a plurality of phonemes learned in advance. A speech recognition apparatus comprising: similarity calculation means for calculating the similarity with respect to and phoneme recognition means for performing phoneme recognition based on similarity data for each frame of the obtained input data;
The phoneme recognition means has a neural circuit model according to a neural circuit equation including a competition term and a coordination term , and the neural circuit model has the activity of the cell corresponding to each phoneme in the input layer as the same frame in the output layer. The speech is configured to be suppressed according to the activity of the cell corresponding to the other phoneme competing with, and to be emphasized according to the activity of the same phoneme in cooperation with the adjacent frame in the output layer. Configuration of recognition device.
(4) The configuration of the speech recognition apparatus according to (3), wherein the learning data of each phoneme is standardized and held by a Gaussian probability distribution function.
[0009]
The basic configuration of the present invention will be described with reference to FIG.
[0010]
In FIG. 1, a continuous speech signal to be recognized is input to a continuous speech input unit 11, and then divided into frames having a certain time width by a speech signal processing unit 12, and acoustic parameters are extracted for each frame.
[0011]
On the other hand, the learning unit 13 holds learning data 13a created by extracting sound parameters for each frame in units of phonemes after learning by inputting speech data as a recognition reference in advance.
[0012]
The similarity calculation unit 14 compares the input data output from the audio signal processing unit 12 with each phoneme of the learning data 13a for each frame, calculates the similarity for each phoneme for each frame, and calculates the similarity data. Created and output to the phoneme recognition unit 15.
[0013]
The phoneme recognition unit 15 includes a two-layer or three-layer neural circuit model 15a including the above-described competition term and cooperation term , and operates by inputting similarity data corresponding to each phoneme output from the similarity calculation unit 14. Let As a result, the neural circuit converges so that only one phoneme wins, producing a phoneme recognition output. The neural circuit model 15a is realized on a computer program.
[0014]
The phoneme recognition output sequentially output from the phoneme recognition unit 15 is not shown, but is further subjected to word identification processing and sentence identification processing and output as a continuous speech recognition result.
[0015]
The learning unit 13 stores a large number of speech learning data classified into each phoneme. This learning data is stored in a standard format such as a Gaussian probability distribution function (pdf) for each phoneme.
[0016]
In the similarity calculation unit 14, the phoneme of the input data is referred to the Gaussian pdf of each phoneme of the stored learning data, and the similarity is obtained by comparison.
A description of similarity u is a frame number, a is a phoneme name, and the similarity between input data in a frame u and a phoneme / a / is expressed by the following [Equation 1], and the corresponding neural circuit The activity (activity) of the neurons (cells), that is, the level of excitement is expressed by the following [Equation 2].
[0017]
[Expression 1]

[0018]
[Expression 2]

[0019]
After receiving the similarity data of [Equation 1] as an input, the neural circuit equation processes the activity of [Equation 2] so that the neural circuit operates toward the stable point, and recognizes when the stable point is reached. To complete.
[0020]
The stored learning data is expressed by the following equation (1) of Gaussian pdf shown in the following [Equation 3].
[0021]
[Equation 3]

[0022]
Here, ο is an input, and μ _a is an average value of learning data represented by a cepstrum for a certain phoneme / a /. Sigma _a is represented by the formula (2) shown in the following [Equation 4].
[0023]
[Expression 4]

[0024]
Here, ο _n is learning data of a phoneme / a /. When the normalized similarity of the input data ο _u in the u-th frame for a certain phoneme / a / is expressed by the following [Equation 5], it is defined as the following equation (3) Is done.
[0025]
[Equation 5]

[0026]
[Formula 6]

[0027]
Here, N ′ represents the log likelihood of N , and < N ′ > represents the average value of each phoneme.
[0028]
DETAILED DESCRIPTION OF THE INVENTION
Specific embodiments of the present invention will be described below.
[0029]
FIG. 2 is a flowchart of speech recognition processing according to an embodiment of the present invention. In the figure, 21 to 23 show the phoneme data learning process, and 24 to 28 show the phoneme recognition process. Note that the word identification and sentence identification process associated with continuous speech recognition is omitted.
[0030]
In 21, data in which labels are assigned to each phoneme portion of speech data by an expert is used as input phoneme learning data. From these speech data, phonemes are cut out based on the labels to create a large amount of phoneme data.
[0031]
In 22, acoustic parameters are extracted for each frame from each phoneme data.
[0032]
In 23, a standard model of each phoneme is calculated and stored using a Gaussian probability distribution function (pdf).
[0033]
In 24, speech data to be recognized is input. Here, phoneme data that has already been extracted for each phoneme is used as input data. From the phoneme data, acoustic parameters are extracted for each frame.
[0034]
In 25, the input phoneme data is compared with the saved phoneme data of the standard model for each frame, and the respective similarities are calculated.
[0035]
In 26, the similarity data of each phoneme is input to the neural circuit equation for each frame, and calculation processing is performed. At that time, at 27, an appropriate initial value is set for the activity variable of the neural circuit. If it is necessary to shorten the calculation time, the number of similarity data to be input may be limited, for example, to the upper five phonemes.
[0036]
In 28, the phoneme recognition is output when a steady solution is obtained as a result of the calculation processing of the neural circuit equation. For each frame, phonemes in which the neural activity of the output layer (final layer) is a positive constant value in the steady solution are recognition candidates, and phonemes having values close to 0 are discarded. The phoneme with the highest recognition frequency in each frame becomes the first phoneme candidate and is output as a recognition result.
[0037]
Next, an example of a neural circuit equation including a competition term and a coordination term will be described.
[0038]
In the present invention, the neural equations involving cooperation term competing term defining a neural network model for performing phoneme recognition can be considered in various embodiments, the following three-layer structure neural equations and two-layer structure nerve Two examples of circuit equations are described.
Example A: 3-layer neural (3LNN) Equation Figure 3, neural by neural circuit equations of the three-layer structure (Three Layered Neural Net: 3LNN) illustrates a model. The illustrated neural circuit model includes three layers, an input layer, an intermediate layer, and an output layer, and a plurality of cells (neurons) are two-dimensionally arranged in each layer. Each horizontal cell row of the two-dimensional array is associated with a sequential frame, and each vertical cell column is associated with a different phoneme type.
[0039]
The activity of each cell in the input layer, the intermediate layer, and the output layer is expressed as [Equation 12].
[0040]
[Expression 12]

[0041]
Similarity data shown in [Equation 13] is input to each cell in the input layer. Each layer cell is provided with a connection based on the competition term and the cooperative term of the neural circuit equation as exemplified by the arrow in the figure. FIG. 4 shows the state of the coupling.
[0042]
[Formula 13]

[0043]
In FIG. 4 , a phoneme cell indicated by [Equation 14] on the input layer competes with other phoneme cells of the same frame indicated by [Equation 15] on the output layer as indicated by dotted arrows. Termination-based coupling is suppressed . Further, from [Expression 16] cells belonging to the same phoneme in a plurality of adjacent frames on the output layer, a combination for enhancing excitement based on the cooperative term is performed as indicated by a solid line. The result of the combination based on the competition and the emphasis is fed back to the input layer represented by [Equation 14] in FIG.
[0044]
[Expression 14]

[0045]
[Expression 15]

[0046]
[Expression 16]

[0047]
Next, a neural circuit equation (hereinafter referred to as 3LNN equation) of the three-layer structure neural circuit model (3LNN) will be described.
The 3LNN equation is given by equation (7) shown in the following [Equation 17].
[0048]
[Expression 17]

[0049]
Here, the following [Equation 18] in the first term on the right side represents the time-dependent neural activity, and f (x) in the second term represents the Sigmoid function given by Equation (8) in the following [Equation 19]. Yes. Further, [Expression 20] is expressed by Expression (9) of [Expression 21], and g (u) is given by Expressions (10) and (11) shown in [Expression 22].
[0050]
[Expression 18]

[0051]
[Equation 19]

[0052]
[Expression 20]

[0053]
[Expression 21]

[0054]
[Expression 22]

[0055]
Note that τ ₁ , τ ₂ , A, B, D, and w are positive constants that are appropriately selected. H is an appropriate threshold constant.
[0056]
As shown in FIG. 3 , the neural activity indicated by [Equation 23] in the input layer is similar to the input of similarity in [Equation 24] and the neighboring nerve indicated by [Equation 26] when [Equation 25] is zero Also affected by activity. In Expression (11), the second term on the right side is an input term, the third term is a competition term, and the fourth term is a cooperative term. This second term represents the similarity of input data to a phoneme / a / in the u-th frame. The third term represents competition with the activity of other phonemes represented by [Equation 27], and the fourth term represents cooperation from adjacent frames for the same phoneme.
[0057]
[Expression 23]

[0058]
[Expression 24]

[0059]
[Expression 25]

[0060]
[Equation 26]

[0061]
[Expression 27]

[0062]
The addition index shown in [Equation 28] in the third term on the right side of equation (11) is under the restriction of a ′ ≠ a.
a-a _s <= a '<= a + a _s
Covers the range of conflict search defined as. Also, the addition index shown in [Expression 29] in the fourth term on the right side of Equation (11) is under the restriction of u ′ ≠ u.
u−l <= u ′ <= u + l
Covers the scope of cooperation defined as
[0063]
[Expression 28]

[0064]
[Expression 29]

[0065]
Considering the equilibrium solution of [Equation 30] to understand the essential characteristics of the equation, Equations (10) and (11) can be rewritten as the following [Equation 31].
[0066]
[30]

[0067]
[31]

[0068]
FIG. 5 is a graph of a curve y = ξ and a curve y = f (g (α) + g (ξ)). (A) to (d) are obtained by changing [Expression 32] from a positive large value 4 to a positive value. These are shown in order when the value is changed to a small value of 1.3.
[0069]
[Expression 32]

[0070]
Solution of equation is given by the intersection of the two curves in FIG. In FIG. 5 , if the value of [Equation 32] decreases from 4 in (a) to 1.3 in (d), the solution maintains the value of [Equation 33] until reaching the value in (c). On the other hand, if the value of [Equation 32] increases from (d) to (a), the solution is maintained at the value of [Equation 34] until the value of (b) is reached.
[0071]
[Expression 33]

[0072]
[Expression 34]

[0073]
From this fact, the following two conclusions can be drawn.
(1) When the value of α is large, ξ takes a relatively large value (approximately 1), and when the value of α is small, ξ takes a small value (approximately 0).
(2) The solution ξ takes different paths according to whether α increases or decreases. This suggests the existence of a hysteresis phenomenon.
[0074]
Incidentally, if assuming w> 1, between the 5 and (b) (c), there is a third solution not stable.
(Phonological recognition processing by 3LNN equation)
[0075]
[Table 1]

[0076]
Table 1 shows the best five similarity maps as a result of the similarity calculation between the phonemes of the learning data for the input phonemes actually pronounced as / n /. Here, phonemes / n /, / m /, / o /, / g /, / w / were selected as the best five. These data are input to the 3LNN equation, and a winner determination process is performed in which only a single phoneme ξ wins for each frame. Table 2 illustrates the processing results.
[0077]
[Table 2]

[0078]
In the example of Table 2, only the phoneme / n / takes a large positive value in frames 1 to 11 and is almost 0 in the other phonemes, so / n / is the winner, while in frames 12 to 15 the phoneme is Only / m / takes a large positive value and the others are almost zero, so / m / is the winner. Therefore, / n / is output as a phoneme recognition result from the average or frequency of all frames.
[0079]
To gain a dynamic understanding of such processing of the 3LNN equation, it is better to focus on the shape of the Sigmoid function that varies with the typical value of α. In addition, the stable solution of the 3LNN equation is determined by Equation (12) that gives either a large value of approximately 1 or a small value of approximately 0, so 0.5 is set as the initial value for all ξs. did.
[0080]
FIGS. 6 and 7 respectively show α and ξ in the fifth frame for phonemes / n /, / m /, / o /, / g /, / w / when the similarity map of Table 1 is input. The time change characteristic of is shown. First, if the input similarity data λ of each phoneme is different, only the difference between phonemes is introduced into the 3LNN equation from α. Ξ for phonemes / m /, / o /, / g /, / w / starts to decrease as shown in FIG. 8 based on the Sigmoid function form for α <0. On the other hand, α for the phoneme / n / having the largest λ begins to take a positive value as the value of the competition term increases. When αn becomes positive, activity ξn, as shown in FIG. 7, changes to increase according Sigmoid function form. It is noted that at this stage, the αn cooperating term helps the rise of αn and accelerates the increase in ξn. On the other hand, α of each other phoneme starts to decrease because the value of the competition term increases as ξn increases.
(Example of recognition experiment)
To create a Gaussian probability distribution function (pdf) for each phoneme from the training data, 4000 ATR data spoken by 10 male speakers and 500 spoken by 6 male speakers Pre-labeled phonemes were extracted from the ASJ data of the sentence. The input voice data for the recognition experiment consisted of two types. One is from a database of 216 words and the other is from one of 240 words spoken by three different male speakers. The voice data was analyzed as follows.
[0081]
In order to compare the performance of the neural circuit model according to the present invention and the conventional model, a phoneme recognition experiment was performed using a Hidden Markov model (HMM) having a single mixture and three states using the same database. The recognition test was performed using a 10-dimensional MFCC and a 10-dimensional delta MFCC of its velocity component, as shown in Table 3. The cepstrum data of each phoneme in the learning data was created separately for the first half and the second half of the middle position of the frame, and a Gaussian pdf was created separately. The input speech data was also divided into the first half and the second half, and compared with the corresponding parts of the Gaussian pdf in the first half and the second half of the learning data, and a similarity map was created. Although there are 24 types of phonemes, the top five candidates for their similarity data were applied to the 3LNN equation.
[0082]
[Table 3]

[0083]
[Table 4]

[0084]
[Table 5]

[0085]
Tables 4 and 5 show the phoneme recognition results. The recognition rate for non-specific speaker recognition was 71.56% in the HMM for the 216 word database, while 78.05% was obtained for 3LNN. In the case of the 240-word database, 72.37% was obtained with HMM, while 78.94% was obtained with 3LNN.
Example B: 2-layer structure neural (2LNN) Equation 8 are neural by neural circuit equation of a two-layer structure (Two Layered Neural Net: 2LNN) illustrates a model. The illustrated neural circuit model is composed of two layers, an input layer V1 and an output layer V2, and a plurality of cells (neurons) are two-dimensionally arranged in each layer. Each horizontal cell row of the two-dimensional array is associated with a sequential frame, and each vertical cell column is associated with a different phoneme type. The activity of each cell in the input layer and the output layer is expressed as [Equation 35].
[0086]
[Expression 35]

[0087]
Similarity data shown in [Equation 36] is input to each cell in the input layer. Further, the cells of each layer are provided with the coupling based on the competition term and the coordination term of the 2LNN equation, as exemplified by the arrow in the figure, as described in FIG. 5 for 3LNN.
[0088]
[Expression 36]

[0089]
Next, a neural circuit equation (hereinafter referred to as 2LNN equation) of the two-layer structure neural circuit model (2LNN) will be described.
[0090]
The 2LNN equation is given by Equation (13) and Equation (14) shown in the following [Equation 37].
[0091]
[Expression 37]

[0092]
Here, the following [Equation 38] term in equation (13) represents the time-dependent neural activity, and the term of f (x) represents the Sigmoid function given by equation (14). Further, [Equation 39] is expressed by Equation (15) in [Equation 37], and g (u) in Equation (15) is given by Equation (16).
[0093]
Note that τ , A, B, D, and w are positive constants that are appropriately selected. H is an appropriate threshold constant.
[0094]
[Formula 38]

[0095]
As shown in FIG. 8 , the nerve activity indicated by [Equation 39] of the input layer is also affected by the neighboring nerve activity indicated by [Equation 26] together with the similarity input of [Equation 36]. In Expression (15) in [Equation 37], the second term on the right side is an input term, the third term is a competitive term, and the fourth term is a cooperative term. This second term represents the similarity of input data to a phoneme / a / in the u-th frame. The third term represents competition with the activity of other phonemes represented by [Equation 40], and the fourth term represents cooperation from adjacent frames for the same phoneme.
[0096]
[39]

[0097]
[Formula 40]

[0098]
The addition index shown in [Equation 41] in the third term of Equation (15) is under the restriction of a ′ ≠ a.
a-a _s <= a '<= a + a _s
Covers the range of conflict search defined as. Also, the addition index shown in [Equation 42] in the fourth term of equation (15) is under the restriction of u ′ ≠ u,
u−l <= u ′ <= u + l
Covers the scope of cooperation defined as
[0099]
[Expression 41]

[0100]
[Expression 42]

[0101]
Considering the equilibrium solution of [Equation 43] in order to understand the essential characteristics of the equation, Equations (15) and (16) are expressed by the following Equation (17) and Equation (18) of [Equation 44] (18 ).
[0102]
[Equation 43]

[0103]
(44)

[0104]
[Equation 45]

[0105]
In the equation (17) of [Equation 44], the Sigmoid function given by the equation ( 14 ) of [Equation 37] directly causes the effect of competition and cooperation on the nerve activity shown in [Equation 39] to directly ], The winners and losers are determined. That is, an output of [Equation 38] close to 1 is given to a large value of the nerve activity shown in [Equation 39], and a small value close to 0 is given to a value of a small nerve activity shown in [Equation 39]. The output of [Equation 38] is given.
[0106]
Also with this 2LNN equation, as in the case of the 3LNN equation, it is possible to obtain a higher phoneme recognition rate than with the HMM.
[0107]
【The invention's effect】
The present invention applies a neural circuit model related to stereoscopic vision of the brain as a means for recognizing speech by identifying the type of each phoneme that is an element from continuously input speech, thereby realizing a recognition rate. We are trying to improve. Conventionally, a model that reinforces the Hidden Markov model (HMM) has been generally used, but the Hidden Markov model has seen a technical limit. The two-layer and three-layer neural circuit models (2LN and 3LN) according to the present invention are based on a fundamentally different idea from the Hidden Markov model, and can significantly improve the phoneme recognition rate. By applying this model of the present invention to continuous word recognition and sentence recognition, dramatic improvement in speech recognition is expected.
[Brief description of the drawings]
FIG. 1 is a basic configuration diagram of the present invention.
FIG. 2 is a flowchart of speech recognition processing according to one embodiment of the present invention.
FIG. 3 is a conceptual diagram of a three-layered neural circuit model (3LNN).
FIG. 4 is an explanatory view of competition between similarities of different phonemes and cooperation between adjacent frame data.
FIG. 5 is a graph of y = ξ and y = f (g (α) + g (ξ)).
FIG. 6 is a graph showing the time variation characteristics of α.
FIG. 7 is a graph showing a time change characteristic of ξ.
FIG. 8 is a conceptual diagram of a two-layer structure neural circuit model (2LNN).
FIG. 9 is a block diagram of a conventional continuous speech recognition apparatus.
[Explanation of symbols]
11: continuous speech input unit 12: speech signal processing unit 13: learning unit 13a: learning data 14: similarity calculation unit 15: phoneme recognition unit 15a: 2-layer or 3-layer neural circuit model

Claims

Similarity to each phoneme for each frame between input data consisting of acoustic parameters extracted for each frame from continuously input speech and reference learning data consisting of acoustic parameters of multiple phonemes learned in advance In the speech recognition method for performing phoneme recognition processing based on similarity data for each frame of the obtained input data,
The phoneme recognition process is performed using a neural circuit model according to a neural circuit equation including a competition term and a coordination term. In the neural circuit model, the activity of each phoneme-corresponding cell in the input layer is determined in the output layer. It is processed according to the suppression according to the activity of the cell corresponding to the other phoneme competing in the same frame, and to be emphasized according to the activity of the same phoneme cooperating in the adjacent frame in the output layer. Voice recognition method.

2. The speech recognition method according to claim 1, wherein learning data of each phoneme is standardized and held by a Gaussian probability distribution function.

Similarity to each phoneme for each frame between input data consisting of acoustic parameters extracted for each frame from continuously input speech and reference learning data consisting of acoustic parameters of multiple phonemes learned in advance A speech recognition apparatus comprising: similarity calculation means for calculating the phoneme; and phoneme recognition means for performing phoneme recognition based on similarity data for each frame of the obtained input data;
The phoneme recognition means has a neural circuit model according to a neural circuit equation including a competition term and a coordination term , and the neural circuit model has the activity of the cell corresponding to each phoneme in the input layer as the same frame in the output layer. The speech is configured to be suppressed according to the activity of the cell corresponding to the other phoneme competing with, and to be emphasized according to the activity of the same phoneme in cooperation with the adjacent frame in the output layer. Recognition device.

4. The speech recognition apparatus according to claim 3, wherein learning data of each phoneme is standardized and held by a Gaussian probability distribution function.