JP3410756B2

JP3410756B2 - Voice recognition device

Info

Publication number: JP3410756B2
Application number: JP05810393A
Authority: JP
Inventors: 哲中村; 和彦宮田; 俊夫赤羽; 清治濱口
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1993-03-18
Filing date: 1993-03-18
Publication date: 2003-05-26
Anticipated expiration: 2018-05-26
Also published as: JPH06274197A

Description

【発明の詳細な説明】【０００１】【産業上の利用分野】本発明は、任意の単語を認識でき
る音声認識装置に関する。【０００２】【従来の技術】従来の音声認識装置は、任意の語彙を認
識するために単音節や音素の特徴系列を単位とし、これ
らの組合せで認識を行なっていた。【０００３】【発明が解決しようとする課題】しかしながら、上述し
た従来の音声認識装置では、照合における標準パターン
としては、これらの単位に相当する時系列全体を対象と
して標準パターンが構成されており、このため発声毎の
時間構造の異なりを正規化するための時間正規化マッチ
ングを動的計画法（ＤＰ）などで行なう必要があり、構
成が複雑になってしまうという問題点があった。。更
に、上述した従来の音声認識装置では、音素などに対応
する標準パターンは、自らの音素カテゴリーへの尤度し
か計算できずパターン識別の性能が低いという問題点が
あった。【０００４】従って、上述した従来の音声認識装置で
は、認識単位の効率的なとり方、音素特徴の自動走査、
動的計画法の計算量の削減、認識単位と標準パターンの
学習法に関する問題点があった。【０００５】本発明の目的は、上述した従来の音声認識
装置における問題点に鑑み、動的計画法の計算量を削減
でき効率よくパターン認識を行うことができる音声認識
装置を提供することにある。【０００６】【課題を解決するための手段】本発明の目的は、入力音
声の特徴的な部位を抽出する複数の音声イベント検出手
段と、複数の音声イベント検出手段の出力に基づいて単
語の尤度を求める単語検出手段とを備えており、入力音
声に対して認識対象単語の特徴に応じて連結された音声
イベント検出手段を時間軸上にそれぞれ独立に走査して
各音声イベント検出手段及び単語検出手段の出力に基づ
いて該入力音声を認識する音声認識装置であって、入力
音声の特徴的な部位に対応する正参照ベクトルと特徴的
な部位に対応しない反参照ベクトルとの尤度に基づいて
該入力音声を認識し、認識すべき音声を含む音響信号に
対し各音声イベント検出手段を走査し各時刻において単
語照合の終端を仮定して単語検出手段からの出力を求め
て出力の時系列の極大値に基づいて該当する単語を検出
して連続的に認識を行うことを特徴とする音声認識装置
によって達成される。【０００７】【０００８】【０００９】【００１０】【００１１】【００１２】【作用】本発明の音声認識装置では、複数の音声イベン
ト検出手段は、入力音声の特徴的な部位を抽出し、単語
検出手段は、複数の音声イベント検出手段の出力に基づ
いて単語の尤度を求めて、入力音声に対して認識対象単
語の特徴に応じて連結された音声イベント検出手段を時
間軸上にそれぞれ独立に走査して各音声イベント検出手
段及び単語検出手段の出力に基づいて該入力音声を認識
する。その際、入力音声の特徴的な部位に対応する正参
照ベクトルと特徴的な部位に対応しない反参照ベクトル
との尤度に基づいて該入力音声を認識し、認識すべき音
声を含む音響信号に対し各音声イベント検出手段を走査
し各時刻において単語照合の終端を仮定して単語検出手
段からの出力を求めて出力の時系列の極大値に基づいて
該当する単語を検出して連続的に認識を行う。【００１３】【００１４】【００１５】【００１６】【００１７】【００１８】【実施例】以下、図面を参照して、本発明の音声認識装
置の実施例を詳細に説明する。【００１９】図１は、本発明の音声認識装置の一実施例
の構成を示すブロック図である。【００２０】図１の音声認識装置は、マイクロホン１
１、マイクロホン１１に接続されたアナログ／デジタル
（Ａ／Ｄ）変換器１２、Ａ／Ｄ変換器１２に接続された
マイクロプロセッサ１３、マイクロプロセッサ１３に接
続されており音声イベント検出手段及び単語検出手段を
構成しているリード・オンリー・メモリ（ＲＯＭ）１
４、マイクロプロセッサ１３に接続されたランダム・ア
クセス・メモリ（ＲＡＭ）１５、マイクロプロセッサ１
３に接続された外部インタフェース１６によって構成さ
れている。【００２１】次に、図１の音声認識装置の動作を説明す
る。【００２２】入力音声は、マイクロホン１１で集音され
て電気信号に変換され、低域通過フィルターをかけた
後、Ａ／Ｄ変換器１２でアナログ信号からデジタル信号
に変換される。【００２３】Ａ／Ｄ変換器１２でデジタル信号に変換さ
れた音声信号は、バスを経てマイクロプロセッサ１３に
転送される。【００２４】マイクロプロセッサ１３は、ＲＯＭ１４に
格納されている音声認識プログラムにより、同じくＲＯ
Ｍ１４に格納されている認識単語音素列と対応するニュ
ーラルネットを呼び出し、ワーキングエリアをＲＡＭ１
５としてデータを一時的に格納しながら認識処理を行な
い、認識結果を外部インタフェース１６を通じて外部に
出力する。【００２５】図２に音声波形の一例を示す。図２は、無
声破裂音／ｋ／の一例であるが、破裂部分が雑音の中に
現れている。この破裂時刻は発声の試行によりいろいろ
変わり得る。このように、音声の特徴を表す音声イベン
トは、ある程度決まった特徴時系列が時間軸上で揺らぎ
ながら生じていると考えることができる。【００２６】本発明では、音声イベントをとらえるため
Ｌフレームの特徴時系列を用いて、この特徴時系列に基
づいてニューラルネットを構成するものとする。【００２７】ニューラルネットは、図３に示すような層
状のパーセプトロン型のニューラルネットかあるいは、
図４に示すような学習ベクトル量子化（ＬＶＱ（Learni
ng Vector Quantization））型のニューラルネットのい
ずれでもよいが、ここでは図４のＬＶＱ型ニューラルネ
ットについて説明する。【００２８】ＬＶＱ型ニューラルネットでは、複数の参
照ベクトルがあり、それらとのベクトルの距離や内積を
基にニューラルネットワークの出力を計算する。また、
層状ニューラルネットワークとの対比として各参照ベク
トル自体を出力ユニット、参照ベクトルの値をユニット
の重み、これらとの内積を出力ユニットからの出力と呼
ぶ。また、音素イベントに対応して学習された参照ベク
トル群と内積演算を含めて音素イベントニューラルネッ
トワークと呼ぶ。なお、ＬＶＱ型はニューラルネットワ
ークかどうかについて議論があるが、現状ではニューラ
ルネットワークの一種とされている。【００２９】音声の認識の単位として簡単のため音素を
例に説明する。ＬＶＱ型ニューラルネットでは、図５に
示すようにある音素のカテゴリーｋを示すために参照ベ
クトルＶｋｉ｛ｉ＝０，．．，Ｎ｝を用意し、このカテ
ゴリーにはいる学習データが提示されると参照ベクトル
をそのベクトルの方向に移動し、異なるカテゴリーに入
ると遠ざけるように学習を行なう。【００３０】しかし、該当カテゴリーに属すことを示す
正参照ベクトルだけでは充分な識別ができないため、本
発明ではそのカテゴリーでないことを示す反参照ベクト
ルＵｋｊ｛ｊ＝０，．．，Ｍ｝を用意する。従って、該
当カテゴリーに属す学習データが提示されると参照ベク
トルＶｋｉは学習データの方向に移動され、反参照ベク
トルＵｋｊは遠ざかるように移動される。また、逆に該
当カテゴリーに属さない学習データが提示されると参照
ベクトルＶｋｉは遠ざける方向に、反参照ベクトルＵｋ
ｊは近付く方向に訂正される。【００３１】認識すべき入力が与えられるとこれらの正
反参照ベクトルとの内積を計算し、次式のようにそのカ
テゴリーとの尤度を計算する。【００３２】Ｄ（ｌ，ｋ）＝ｈ（ｆ（ｄ（Ｘｌ，Ｖ
ｋ））−ｇ（ｄ（Ｘｌ，Ｕｋ）））ｌｓ＜ｌ＜ｌｅここで、Ｘｌは、認識すべき入力音声の１フレーム目を
開始点とする時系列パターンであり、ｌｓは照合開始時
刻、ｌｅは照合終了時刻である。また、関数ｄ（Ｘｌ，
Ｖｋ）はＸｌとカテゴリーｋの正参照ベクトルとの類似
度、関数ｄ（Ｘｌ，Ｕｋ）はＸｌとカテゴリーｋの反参
照ベクトルとの類似度である。【００３３】例えば、関数ｄによって求まる正参照ベク
トルへの類似度は各正参照ベクトルへの出力の最大値関
数、反参照ベクトルへの類似度は反参照ベクトルの出力
の最大値関数でそれぞれ構成できる。【００３４】ｆは入力と各正参照ベクトル群とそのカテ
ゴリーへの尤度を求める関数、ｇは入力と各反参照ベク
トル群とそのカテゴリーへの反尤度を求める関数、例え
ばｍａｘである。【００３５】ｈは、対象区間ｌｓからｌｅ間で走査した
ときの最適位置決め関数である。この関数としては、同
様にｍａｘが考えられる。【００３６】図６は、実際の各参照ベクトルとの距離を
示す。対象とする音素のイベント位置になると正参照ベ
クトルとの類似度が増大し、反参照ベクトルとの類似度
が減少する。【００３７】離散単語認識の場合は、認識対象語彙の音
素列に対応する音素イベントニューラルネットワークを
連結しそれぞれのネットワークが時間拘束を考慮しなが
ら時間軸を走査して最大値を求めた後、単語検出ニュー
ラルネットワークで重みつきの和を求め認識結果を得
る。単語検出ニューラルネットワークの構造は図４に示
されている。その認識対象単語において信頼できる音素
イベントに重みがかかるように学習される。【００３８】次に音素イベントニューラルネットワーク
の学習について説明する。【００３９】まず、各音素イベントニューラルネットワ
ークは、一定量のラベル付けを行なった音声データベー
スから初期学習を行なう。【００４０】音素毎に特徴点を人間が指示してその部位
の学習を行なう。学習は先に述べたＬＶＱ学習とする。【００４１】次に、図７に示すように、この音素イベン
トニューラルネットワークを用いて学習単語を認識す
る。認識を行なったときに各音素イベントニューラルネ
ットワークが時間軸上を走査して求まった音素イベント
の位置において各音素イベントニューラルネットワーク
の再学習を行ない最適化を行なう。これを最適化学習と
呼ぶ。【００４２】次に、この音素イベントニューラルネット
ワークを用いて単語検出ニューラルネットワークの学習
を行なう。単語検出ニューラルネットワークは、本実施
例では各音素イベントニューラルネットワークの和とし
て構成しているが、ＬＶＱ型の参照ベクトルの集合によ
り構成し、学習単語のデータを用いて学習してもよい。
これは、対象単語内での各音素イベントニューラルネッ
トワークの出力のパターンを記憶する働きを持つ。【００４３】上記各処理手順を、図８〜図１４を参照し
て説明する。【００４４】図８は、初期学習の動作を示すフローチャ
ートである。【００４５】まず、初期設定を行ない（ステップＳ
１）、ニューラルネットに用いる学習データをあらかじ
め付与されている音素ラベル情報で分類し音声の特徴パ
ラメータ系列を求める（ステップＳ２）。上記ステップ
Ｓ２の処理については、図９を参照して後述する。ラベ
ル毎に分類され分析された学習データのパラメータ系列
を用いてＬＶＱ型ニューラルネットの正反参照ベクトル
の学習を行なう（ステップＳ３〜Ｓ７）。この学習をあ
らかじめ決められた繰り返し終了条件（一定の回数など
を満たすなど）まで繰り返す。【００４６】学習では、学習データの提示順序により学
習が偏らないように学習データの提示順序を音素ラベル
を乱数により決定した後（ステップＳ４）、その学習デ
ータを読み込み（ステップＳ６）、正反参照ベクトルの
学習を行なう（ステップＳ７）。この学習をステップＳ
５のループで全ての学習データに対して行なう。上記ス
テップＳ７の正反参照ベクトルの学習については、図１
０を参照して後述する。【００４７】上記ステップＳ３で一定の条件を満たすま
で繰り返しが行なわれて学習された参照ベクトルと音素
平均長を格納して（ステップＳ８）、処理を終了する。【００４８】次に、図９を用いて上記ステップＳ２の処
理を説明する。【００４９】初期設定を行なった後（ステップＳ２０
１）、ラベルファイルを指定した後（ステップＳ２０
２）、ラベルファイルの読み込みを行なう（ステップＳ
２０３）。次に、ラベルファイル内の最初の音素を指定
し（ステップＳ２０４）、更に学習音素を指定した後
（ステップＳ２０５）、上記ステップＳ２０３で読み込
まれた学習データのラベルと現在の学習音素の比較を行
なう（ステップＳ２０６）。比較の結果、同一の音素で
あった場合（ステップＳ２０７）、そのラベルファイル
の音素位置に相当する音声データのパラメータを読み込
み（ステップＳ２０８）、ラベル情報を基にあらかじめ
与えた位置を決定し（ステップＳ２０９）、学習データ
バッファに格納する（ステップＳ２１０）。【００５０】上述した処理を上記ステップＳ２０４で一
つのラベルファイル内の全ての音素に対して行なう。更
に、上記ステップＳ２０２で上記ステップＳ２０３以降
の処理を全てのラベルファイルに対して行なう。【００５１】次に、図１０を参照して、図８のステップ
Ｓ７の正反参照ベクトルの学習について説明する。【００５２】まず、音素の学習順によって影響されない
ように、乱数により音素の学習順を決定する（ステップ
Ｓ７０１）。次に、学習音素を指定して学習データ音素
ラベルと学習音素との比較を行なう（ステップＳ７０
３）。もし一致するときは（ステップＳ７０４）、正参
照ベクトルの学習を行なう（ステップＳ７０５）。ここ
で、入力ベクトルをＸｌとすると、正参照ベクトルＶｋ
ｉと反参照ベクトルＵｋｊは次のようになる。【００５３】Ｖｋｉ＝Ｖｋｉ＋α（Ｘｌ−Ｖｋｉ）Ｕｋｊ＝Ｕｋｊ−α（Ｘｌ−Ｕｋｊ）また、上記ステップＳ７０４で一致しない場合は、反参
照ベクトルの学習を行なう（ステップＳ７０６）。ここ
で、入力ベクトルをＸｌとすると、正参照ベクトルＶｋ
ｉと反参照ベクトルＵｋｊは次のようになる。【００５４】Ｖｋｉ＝Ｖｋｉ−α（Ｘｌ−Ｖｋｉ）Ｕｋｊ＝Ｕｋｊ＋α（Ｘｌ−Ｕｋｊ）次に、図１１を参照して、図４のＬＶＱ型ニューラルネ
ットの認識部である出力層における処理を説明する。【００５５】まず、初期設定を行なった後（ステップＴ
１）、全音素のユニット重み、即ち正反参照ベクトルを
読み込み（ステップＴ２）、認識対象の単語の音素列か
ら構成される単語辞書を読み込み（ステップＴ３）、入
力音声を１フレーム読み込み（ステップＴ４）、音声検
出済みフラグをチェックして（ステップＴ５）、確認済
みの場合には処理を終了し、未検出の場合には、入力音
声の分析とノルムの計算、正規化を行なう（ステップＴ
６）。ステップＴ７からのループでは、順に各音素の参
照ベクトルとの照合を行なうと同時に単語検出の確認を
行なう。各音素の参照ベクトルとの尤度を求め発火閾値
との比較を行ない（ステップＴ８）、発火していなけれ
ば（ステップＴ９）、次の音素との照合に移る。他方、
上記ステップＴ９で発火閾値を越えている場合には、発
火時刻、発火レベルを適当な大きさを持った先入れ先出
し（ＦＩＦＯ）メモリに格納する（ステップＴ１０）。
この発火に対して単語辞書を確認し、どれかの単語の終
端音素でない場合（ステップＴ１１）、次の音素の照合
に移り、終端音素の場合、単語ネットの出力確認を行な
う（ステップＴ１２）。【００５６】次に、図１２を参照して、図１１のステッ
プＴ１２における単語ネット出力確認処理を説明する。【００５７】初期設定を行なった後（ステップＴ１０
１）、認識対象単語全体との照合処理をステップＴ１０
２〜ステップＴ１０７のループで行なう。【００５８】まず、１つの認識単語を指定し（ステップ
Ｔ１０２）、その単語の終端音素が発火しているかを確
認し（ステップＴ１０３）、発火していない場合、次の
単語との照合を行なう。発火している場合、終端音素の
発火している時刻から時間逆向きに辞書の音素が発火し
ている時刻を調べ、各発火時点が継続長による許容範囲
に入っているかを調べる（ステップＴ１０５，Ｔ１０
６）。継続長による許容範囲は、例えば音素の平均継続
長の０．７５倍〜１．５倍を満たしていれば良いとする
が、学習データから学習することも可能である。【００５９】許容範囲にはいっている場合、その音素の
発火値ＰをＯｗに加算して次の辞書中の音素を調べる
（ステップＴ１０７）。上記ステップＴ１０２で全単語
との照合が終了すると、各単語に対する尤度をソートし
単語検出フラグをオン（ＯＮ）し（ステップＴ１０
８）、上位Ｍ個の結果を出力する（ステップＴ１０
９）。【００６０】次に、図１３を参照して、図４のＬＶＱ型
ニューラルネットの最適化学習について説明する。【００６１】最適化学習は、認識のアルゴリズムに応じ
た最適な学習を行なうための処理であり認識性能改善に
大きな効果がある。【００６２】まず、初期設定を行なった後（ステップＶ
１）、初期学習済みの全音素のユニット重み、即ち参照
ベクトルを読み込み（ステップＶ２）、認識対象となる
単語の音素列を読み込む（ステップＶ３）。最適化学習
はステップＶ４であらかじめ決められた繰り返し条件、
例えば一定の繰り返し数などに達するまでの繰り返しを
行なう。【００６３】更に、ステップＶ５のループでは学習単語
全体に対し各音素の参照ベクトルの更新を行なう。【００６４】まず、一つの学習単語を指定しそのデータ
をその単語に対応して音素を連結し認識処理を行なう
（ステップＶ６）。これにより、認識処理で決定される
各音素イベント位置が求まる。この結果求まった単語ネ
ットの出力値が閾値以下の場合には学習しないように判
断を行なう（ステップＶ７）。これは、あまりに精度が
悪い単語に対しては学習に使用しないようにするもの
で、学習が進むにつれて閾値を上回るので最終的には全
単語が学習に使われるように設定される。学習速度を改
善する効果がある。上記ステップＶ７で閾値を越えてい
た単語に対しては、単語Ｗの最適化学習を行なう（ステ
ップＶ８）。この処理を全単語に対して行ない全音素の
参照ベクトルの更新を行なって、さらに繰り返しを行な
うことで認識部と学習部の最適化がはかれる。一定の学
習が終了するとユニット重み、即ち参照ベクトル、音素
平均長が格納されて（ステップＶ９）、処理を終了す
る。【００６５】次に、図１４を参照して、図１３のステッ
プＶ８である単語Ｗの最適化学習方法について説明す
る。【００６６】初期設定を行なった後（ステップＶ８０
１）、ステップＶ８０２〜ステップＶ８０８のループで
は学習対象となっている単語内の音素の順に学習する。【００６７】まず、語頭から順に音素を指定してその音
素の発火位置を読み込む（ステップＶ８０３）。ステッ
プＶ８０４からのループでは、全ての音素を順に参照し
て正参照ベクトルと反参照ベクトルの学習を行なう。【００６８】まず、最初の学習音素を指定して認識デー
タ音素と学習音素との比較を行ない（ステップＶ８０
６）。上記ステップＶ８０６で一致している場合には正
参照ベクトルの学習を行ない（ステップＶ８０７）、一
致していない場合には反参照ベクトルの学習を行なう
（ステップＶ８０８）。ここでの正反参照ベクトルの学
習は、図１０のステップＳ７０５、Ｓ７０６におけるも
のと同一である。上記ステップＶ８０２で単語中に含ま
れる全音素の学習を終了した後処理を終了する。【００６９】連続音声認識やワードスポッティングの場
合は、連続照合が必要になるがこれは各音素検出ニュー
ラルネットの出力を基に、図１５に示すように、連続動
的計画法（連続ＤＰ）マッチングなどを用いて各時刻を
終点と仮説して、そのときの単語全体の尤度がある閾値
を越えて最大になる点をもって検出できたとする。この
場合、各音素イベントニューラルネットはそれぞれある
一定の閾値以上の尤度がないと対象にしないことにすれ
ば、連続ＤＰの演算量を減らすことができる。【００７０】音声の特徴は、時間系列上で音声の特徴と
なる音声イベントが位置的に変動しながら発生する形態
となっており、認識単位とする音素など全体を対象にす
ることはかえって不要な部分を含める場合がある。従っ
て、この音声イベント部分の特徴をニューラルネットワ
ークにより学習し、時間上で走査して音声系列を求め
て、認識単位となる音響パラメータ全ての時間を対象と
して認識することにより、メモリ量、計算量、認識精度
を向上することができる。【００７１】更に、ある一定の尤度より大きい出力を有
する部分しか対象にしないので、最適音素列を求める場
合に用いる動的計画法（ＤＰ）の計算量も削減すること
ができる。【００７２】従来の方法では入力音声と標準パターン
（モデル）との一致を距離や類似度のみで測定していた
が、本発明の音声認識装置では、類似度を測定する正参
照ベクトルとの距離に加えて相違度を求めるための反参
照ベクトルを学習しておき、この両者から本来の尤度を
計算することにより精度よく識別できる。また、音声イ
ベントのニューラルネットの学習法として目視などによ
り求めたラベル情報を基に初期学習を行ない、これを連
結して学習単語のモデルを構成し、学習単語を認識した
後、各音声イベントニューラルネットが検出した音声イ
ベントの位置で音声イベントニューラルネットを再学習
最適化することにより、認識すべき音声イベントに最も
適したニューラルネットを構成する。【００７３】【発明の効果】本発明の音声認識装置は、入力音声の特
徴的な部位を抽出する複数の音声イベント検出手段と、
複数の音声イベント検出手段の出力に基づいて単語の尤
度を求める単語検出手段とを備えており、入力音声に対
して認識対象単語の特徴に応じて連結された音声イベン
ト検出手段を時間軸上にそれぞれ独立に走査して各音声
イベント検出手段及び単語検出手段の出力に基づいて該
入力音声を認識する音声認識装置であって、入力音声の
特徴的な部位に対応する正参照ベクトルと特徴的な部位
に対応しない反参照ベクトルとの尤度に基づいて該入力
音声を認識し、認識すべき音声を含む音響信号に対し各
音声イベント検出手段を走査し各時刻において単語照合
の終端を仮定して単語検出手段からの出力を求めて出力
の時系列の極大値に基づいて該当する単語を検出して連
続的に認識を行うので、任意の単語を効率よくかつ高い
精度で認識できる。Description: BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus capable of recognizing an arbitrary word. 2. Description of the Related Art In a conventional speech recognition apparatus, in order to recognize an arbitrary vocabulary, a single syllable or a feature sequence of phonemes is used as a unit, and recognition is performed using a combination of these. [0003] However, in the above-described conventional speech recognition device, the standard pattern for matching is configured for the entire time series corresponding to these units. For this reason, it is necessary to perform time normalization matching for normalizing a difference in time structure for each utterance by a dynamic programming (DP) or the like, and there has been a problem that the configuration becomes complicated. . Furthermore, the above-described conventional speech recognition apparatus has a problem that a standard pattern corresponding to a phoneme or the like can only calculate the likelihood to its own phoneme category, and the pattern recognition performance is low. Therefore, in the above-described conventional speech recognition apparatus, an efficient way of taking a recognition unit, automatic scanning of phoneme features,
There were problems with the reduction of the computational complexity of dynamic programming, and the learning method of recognition units and standard patterns. SUMMARY OF THE INVENTION An object of the present invention is to provide a speech recognition apparatus capable of reducing the amount of calculation in dynamic programming and performing pattern recognition efficiently in view of the above-mentioned problems in the conventional speech recognition apparatus. . SUMMARY OF THE INVENTION An object of the present invention is to provide a plurality of voice event detecting means for extracting a characteristic portion of an input voice, and a likelihood of a word based on the outputs of the plurality of voice event detecting means. Word detection means for determining the degree of speech, and the voice event detection means connected to the input voice in accordance with the characteristics of the recognition target word are independently scanned on the time axis, and each voice event detection means and word a Ruoto voice recognition system to recognize the input speech based on the output of the detecting means, input
Positive reference vectors corresponding to characteristic parts of speech and characteristic
Based on the likelihood with the anti-reference vector that does not correspond to the
Recognize the input voice and convert it to an acoustic signal containing the voice to be recognized.
On the other hand, each audio event detecting means is scanned and
Assuming the end of word matching, find output from word detection means
The corresponding word based on the maximum value of the output time series
And a continuous speech recognition device . In the voice recognition device of the present invention, the plurality of voice event detection means extracts a characteristic portion of the input voice and detects a word. The means obtains the likelihood of the word based on the outputs of the plurality of audio event detection means, and independently generates the audio event detection means connected to the input voice according to the characteristics of the recognition target word on the time axis. The input voice is recognized based on the output of each voice event detecting means and word detecting means by scanning. At this time, the ginseng corresponding to the characteristic part of the input voice
Reference vectors and anti-reference vectors that do not correspond to characteristic parts
The input speech is recognized based on the likelihood of
Scan each audio event detection means for audio signals including voice
At each time, the word detection
Output from the stage is calculated based on the maximum value of the output time series.
A corresponding word is detected and recognition is continuously performed. An embodiment of the speech recognition apparatus according to the present invention will be described below in detail with reference to the drawings. FIG. 1 is a block diagram showing the configuration of an embodiment of the speech recognition apparatus of the present invention. The speech recognition apparatus shown in FIG.
1. An analog / digital (A / D) converter 12 connected to a microphone 11, a microprocessor 13 connected to the A / D converter 12, a voice event detecting means and a word detecting means connected to the microprocessor 13 Read only memory (ROM) 1
4. Random access memory (RAM) 15 connected to microprocessor 13, microprocessor 1
The external interface 16 is connected to the external interface 3. Next, the operation of the speech recognition apparatus shown in FIG. 1 will be described. The input sound is collected by a microphone 11 and converted into an electric signal. After being subjected to a low-pass filter, the A / D converter 12 converts the analog signal into a digital signal. The audio signal converted into a digital signal by the A / D converter 12 is transferred to a microprocessor 13 via a bus. The microprocessor 13 uses the voice recognition program stored in the ROM 14 to execute the RO
The neural network corresponding to the recognized word phoneme string stored in M14 is called, and the working area is set to RAM1.
In step 5, recognition processing is performed while temporarily storing data, and the recognition result is output to the outside through the external interface 16. FIG. 2 shows an example of a speech waveform. FIG. 2 is an example of the unvoiced plosive / k /, where the plosive part appears in the noise. This burst time may vary depending on the utterance attempt. As described above, it can be considered that the sound event representing the sound feature occurs while a characteristic time series determined to some extent fluctuates on the time axis. In the present invention, a neural network is configured based on the characteristic time series using the characteristic time series of the L frame in order to capture a voice event. The neural network may be a layered perceptron type neural network as shown in FIG.
Learning vector quantization (LVQ (Learni
ng Vector Quantization)) type neural network, but here, the LVQ type neural network of FIG. 4 will be described. In the LVQ type neural network, there are a plurality of reference vectors, and the output of the neural network is calculated based on the distance between the reference vectors and the inner product. Also,
As a comparison with the layered neural network, each reference vector itself is called an output unit, the value of the reference vector is called a unit weight, and the inner product of these is called an output from the output unit. Also, a reference is made to a phoneme event neural network including a reference vector group learned in response to a phoneme event and an inner product operation. Although there is a debate about whether or not the LVQ type is a neural network, it is currently regarded as a type of neural network. For simplicity, a phoneme will be described as an example of a unit of speech recognition. In the LVQ type neural network, as shown in FIG. 5, reference vectors Vkiｋi = 0,. . , N}. When learning data in this category is presented, the reference vector is moved in the direction of the vector, and in a different category, learning is performed so as to move away. However, since it is not possible to sufficiently discriminate only with the positive reference vector indicating that the category belongs to the category, in the present invention, the anti-reference vector Ukj ｛j = 0,. . , M}. Therefore, when learning data belonging to the category is presented, the reference vector Vki is moved in the direction of the learning data, and the anti-reference vector Ukj is moved away. Conversely, when learning data that does not belong to the category is presented, the reference vector Vki moves away from the reference vector Vki.
j is corrected in the approaching direction. When an input to be recognized is given, an inner product of these positive and negative reference vectors is calculated, and the likelihood of the category is calculated as in the following equation. D (l, k) = h (f (d (X1, V
k))-g (d (Xl, Uk))) ls <l <le where Xl is a time-series pattern starting from the first frame of the input speech to be recognized, and ls is the matching start time , Le are collation end times. Also, the function d (Xl,
Vk) is the similarity between Xl and the positive reference vector of category k, and the function d (Xl, Uk) is the similarity between Xl and the anti-reference vector of category k. For example, the similarity to the positive reference vector obtained by the function d can be formed by the maximum value function of the output to each positive reference vector, and the similarity to the anti-reference vector can be formed by the maximum value function of the output of the anti-reference vector. . F is a function for calculating the likelihood of the input, each of the reference vector groups and their categories, and g is a function of calculating the likelihood of the input, each of the anti-reference vectors and its category, for example, max. H is an optimal positioning function when scanning is performed between the target section ls and le. Similarly, max can be considered as this function. FIG. 6 shows the actual distance from each reference vector. At the event position of the target phoneme, the similarity with the normal reference vector increases, and the similarity with the anti-reference vector decreases. In the case of the discrete word recognition, a phoneme event neural network corresponding to a phoneme string of the vocabulary to be recognized is connected, and each network scans a time axis while considering a time constraint to obtain a maximum value. A weighted sum is obtained by a detection neural network to obtain a recognition result. The structure of the word detection neural network is shown in FIG. Learning is performed so that a reliable phoneme event is weighted in the recognition target word. Next, learning of the phoneme event neural network will be described. First, each phoneme event neural network performs initial learning from a speech database to which a certain amount of labeling has been performed. A human designates a feature point for each phoneme and learns the part. The learning is the above-mentioned LVQ learning. Next, as shown in FIG. 7, a learning word is recognized using this phoneme event neural network. At the time of recognition, each phoneme event neural network performs re-learning of each phoneme event neural network at the position of the phoneme event obtained by scanning on the time axis to perform optimization. This is called optimization learning. Next, learning of the word detection neural network is performed using the phoneme event neural network. In this embodiment, the word detection neural network is configured as a sum of the phoneme event neural networks. However, the word detection neural network may be configured by a set of LVQ-type reference vectors and may be learned using learning word data.
This has the function of storing the output pattern of each phoneme event neural network in the target word. Each of the above-described processing procedures will be described with reference to FIGS. FIG. 8 is a flowchart showing the operation of the initial learning. First, initialization is performed (step S
1) The learning data used for the neural network is classified by phoneme label information assigned in advance to obtain a feature parameter sequence of the voice (step S2). The processing in step S2 will be described later with reference to FIG. Using the parameter series of the learning data classified and analyzed for each label, the correct / inverse reference vector of the LVQ neural network is learned (steps S3 to S7). This learning is repeated until a predetermined repetition termination condition (such as satisfying a certain number of times). In the learning, after the presentation order of the learning data is determined by the random number of the phoneme label so that the learning is not biased by the presentation order of the learning data (step S4), the learning data is read (step S6), and the reference is made. Vector learning is performed (step S7). This learning is performed in step S
This is performed for all learning data in a loop of 5. The learning of the forward / backward reference vector in step S7 is described in FIG.
0 will be described later. The reference vector and the average phoneme length learned by repeating the process until a predetermined condition is satisfied in step S3 are stored (step S8), and the process ends. Next, the processing in step S2 will be described with reference to FIG. After the initial setting (step S20)
1) After specifying a label file (step S20)
2) The label file is read (step S)
203). Next, the first phoneme in the label file is specified (step S204), and the learning phoneme is further specified (step S205). Then, the label of the learning data read in step S203 is compared with the current learning phoneme. (Step S206). As a result of the comparison, if they are the same phoneme (step S207), the parameters of the voice data corresponding to the phoneme position of the label file are read (step S208), and the position given in advance based on the label information is determined (step S208). (S209), and store it in the learning data buffer (step S210). The above processing is performed on all phonemes in one label file in step S204. Further, in step S202, the processes in and after step S203 are performed on all label files. Next, with reference to FIG. 10, the learning of the normal / reverse reference vector in step S7 of FIG. 8 will be described. First, the learning order of phonemes is determined by random numbers so as not to be affected by the learning order of phonemes (step S701). Next, a learning phoneme is designated to compare the learning data phoneme label with the learning phoneme (step S70).
3). If they match (step S704), learning of the positive reference vector is performed (step S705). Here, assuming that the input vector is Xl, the positive reference vector Vk
i and the anti-reference vector Ukj are as follows. Vki = Vki + α (X1−Vki) Ukj = Ukj−α (X1−Ukj) If the values do not match in step S704, learning of an anti-reference vector is performed (step S706). Here, assuming that the input vector is Xl, the positive reference vector Vk
i and the anti-reference vector Ukj are as follows. Vki = Vki-α (X1-Vki) Ukj = Ukj + α (X1-Ukj) Next, with reference to FIG. 11, the processing in the output layer which is the recognition unit of the LVQ type neural network of FIG. 4 will be described. . First, after initial setting (step T)
1) Read unit weights of all phonemes, that is, correct / inverse reference vectors (step T2), read a word dictionary composed of phoneme strings of words to be recognized (step T3), and read one frame of input speech (step T4) ), The voice detected flag is checked (step T5), and if confirmed, the process is terminated; if not detected, the input voice is analyzed, norm is calculated, and normalization is performed (step T5).
6). In the loop starting from step T7, each phoneme is collated with the reference vector in order, and at the same time, the word detection is confirmed. The likelihood of each phoneme with the reference vector is determined and compared with the firing threshold (step T8). If the phoneme has not fired (step T9), the process proceeds to the matching with the next phoneme. On the other hand,
If the ignition threshold is exceeded in step T9, the ignition time and the ignition level are stored in a first-in first-out (FIFO) memory having an appropriate size (step T10).
The word dictionary is checked for this firing, and if it is not the terminal phoneme of any word (step T11), the process proceeds to the verification of the next phoneme, and if it is the terminal phoneme, the output of the word net is checked (step T12). Next, the word net output confirmation processing in step T12 in FIG. 11 will be described with reference to FIG. After performing the initial setting (step T10)
1) The matching process with the entire recognition target word is performed in step T10.
This is performed in a loop from Step 2 to Step T107. First, one recognition word is designated (step T102), and it is confirmed whether or not the terminal phoneme of the word has been fired (step T103). If not, the word is collated with the next word. If it is fired, the time at which the phoneme in the dictionary is fired is checked in reverse time from the time at which the terminal phoneme is fired, and it is checked whether each firing time is within the allowable range by the duration (step T105, T10
6). It is sufficient that the allowable range based on the duration is, for example, 0.75 to 1.5 times the average duration of the phoneme, but it is also possible to learn from the learning data. If it falls within the allowable range, the firing value P of the phoneme is added to Ow, and the phoneme in the next dictionary is checked (step T107). When the matching with all the words is completed in step T102, the likelihood for each word is sorted, and the word detection flag is turned on (ON) (step T10).
8) Output the top M results (step T10)
9). Next, the optimization learning of the LVQ type neural network shown in FIG. 4 will be described with reference to FIG. The optimization learning is a process for performing optimal learning according to a recognition algorithm, and has a great effect on improving the recognition performance. First, after performing the initial setting (step V
1) The unit weight of all the phonemes that have been initially learned, that is, the reference vector is read (step V2), and the phoneme sequence of the word to be recognized is read (step V3). The optimization learning is performed at a predetermined repetition condition in step V4.
For example, repetition is performed until a certain number of repetitions is reached. Further, in the loop of step V5, the reference vector of each phoneme is updated for the entire learning word. First, one learning word is specified, and its data is connected to phonemes corresponding to the word to perform recognition processing (step V6). Thereby, each phoneme event position determined in the recognition processing is obtained. If the output value of the word net obtained as a result is equal to or smaller than the threshold value, it is determined not to learn (step V7). This is to prevent words with too low precision from being used for learning. As the learning progresses, the threshold value is exceeded, so that all words are finally set to be used for learning. This has the effect of improving the learning speed. For words that exceed the threshold value in step V7, word W optimization learning is performed (step V8). This process is performed on all words, the reference vectors of all phonemes are updated, and the repetition is further performed to optimize the recognition unit and the learning unit. When the fixed learning is completed, the unit weights, that is, the reference vector and the average phoneme length are stored (step V9), and the process ends. Next, with reference to FIG. 14, a description will be given of the word W optimization learning method in step V8 of FIG. After performing the initial setting (step V80)
1) In the loop from step V802 to step V808, learning is performed in the order of phonemes in the word to be learned. First, a phoneme is specified in order from the beginning of the word, and the firing position of the phoneme is read (step V803). In the loop from step V804, learning of the normal reference vector and the anti-reference vector is performed by sequentially referring to all phonemes. First, the first learning phoneme is designated and the recognition data phoneme is compared with the learning phoneme (step V80).
6). If they match in step V806, learning of the reference vector is performed (step V807), and if they do not match, learning of the anti-reference vector is performed (step V808). The learning of the forward / backward reference vector is the same as that in steps S705 and S706 in FIG. After the learning of all phonemes included in the word is completed in step V802, the process ends. In the case of continuous speech recognition or word spotting, continuous collation is necessary. This is based on the output of each phoneme detection neural network, as shown in FIG. It is assumed that each time is hypothesized as an end point by using the above method, and a point at which the likelihood of the entire word at that time exceeds a certain threshold and reaches a maximum is detected. In this case, if each phoneme event neural net is not targeted unless there is a likelihood that is equal to or greater than a certain threshold, the amount of calculation of the continuous DP can be reduced. The feature of the voice is such that the voice event which is the feature of the voice in the time series is generated while fluctuating in position, and it is unnecessary to target the entire phoneme as a recognition unit. May include parts. Therefore, the feature of the voice event part is learned by a neural network, and a voice sequence is obtained by scanning over time to recognize the time of all acoustic parameters serving as a recognition unit. The recognition accuracy can be improved. Further, since only the portion having an output larger than a certain likelihood is targeted, the calculation amount of the dynamic programming (DP) used for obtaining the optimal phoneme sequence can be reduced. In the conventional method, the coincidence between the input speech and the standard pattern (model) is measured only by the distance and the similarity. However, in the speech recognition apparatus of the present invention, the distance between the input speech and the standard reference vector for measuring the similarity is measured In addition, by learning an anti-reference vector for obtaining the degree of difference, and calculating the original likelihood from both, accurate identification is possible. Also, as a neural network learning method for voice events, initial learning is performed based on label information obtained by visual observation or the like, and this is connected to form a learning word model. By re-learning and optimizing the speech event neural net at the position of the speech event detected by the net, a neural network most suitable for the speech event to be recognized is constructed. The speech recognition apparatus of the present invention comprises a plurality of speech event detection means for extracting a characteristic portion of an input speech,
Word detection means for calculating the likelihood of a word based on the outputs of the plurality of voice event detection means. The voice event detection means connected to the input voice in accordance with the characteristics of the recognition target word is displayed on the time axis. in a Ruoto voice recognition system to recognize the input speech based on the output of the independently scanned each audio event detection means and word detection means, the input speech
The positive reference vector corresponding to the characteristic part and the characteristic part
Based on the likelihood with the anti-reference vector not corresponding to
Recognizes speech and responds to audio signals containing speech to be recognized.
Scans the audio event detection means and verifies words at each time
Output from word detection means assuming the end of
The corresponding word is detected based on the maximum value of the time series of
Since recognition is performed continuously , any word can be recognized efficiently and with high accuracy.

【図面の簡単な説明】【図１】本発明の音声認識装置の一実施例の構成を示す
ブロック図である。【図２】音声及び音声イベントの説明図である。【図３】層状ニューラルネットワークの一構成例を示す
説明図である。【図４】ＬＶＱ型ニューラルネットワークの一構成例を
示す説明図である。【図５】参照ベクトルの配置の説明図である。【図６】各音素イベントニューラルネットの出力の説明
図である。【図７】最適化学習位置の説明図である。【図８】初期学習を説明するためのフローチャートであ
る。【図９】初期学習を説明するためのフローチャートであ
る。【図１０】初期学習を説明するためのフローチャートで
ある。【図１１】認識部の動作を説明するためのフローチャー
トである。【図１２】認識部の動作を説明するためのフローチャー
トである。【図１３】最適化学習を説明するためのフローチャート
である。【図１４】最適化学習を説明するためのフローチャート
である。【図１５】ワードスポッティング応用の説明図である。【符号の説明】１１マイクロホン１２Ａ／Ｄ変換器１３マイクロプロセッサ１４ＲＯＭ１５ＲＡＭ１６外部インタフェースBRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram showing a configuration of one embodiment of a speech recognition device of the present invention. FIG. 2 is an explanatory diagram of a voice and a voice event. FIG. 3 is an explanatory diagram showing a configuration example of a layered neural network. FIG. 4 is an explanatory diagram showing a configuration example of an LVQ type neural network. FIG. 5 is an explanatory diagram of the arrangement of reference vectors. FIG. 6 is an explanatory diagram of an output of each phoneme event neural network. FIG. 7 is an explanatory diagram of an optimization learning position. FIG. 8 is a flowchart for explaining initial learning. FIG. 9 is a flowchart for explaining initial learning. FIG. 10 is a flowchart illustrating initial learning. FIG. 11 is a flowchart illustrating an operation of a recognition unit. FIG. 12 is a flowchart illustrating an operation of a recognition unit. FIG. 13 is a flowchart for explaining optimization learning. FIG. 14 is a flowchart for explaining optimization learning. FIG. 15 is an explanatory diagram of word spotting application. [Description of Signs] 11 Microphone 12 A / D converter 13 Microprocessor 14 ROM 15 RAM 16 External interface

───────────────────────────────────────────────────── フロントページの続き (72)発明者濱口清治大阪府大阪市阿倍野区長池町22番22号シャープ株式会社内 (56)参考文献特開昭64−81999（ＪＰ，Ａ) 特開平１−204099（ＪＰ，Ａ) 特開平３−269500（ＪＰ，Ａ) 特開平１−116869（ＪＰ，Ａ) 特開平２−170265（ＪＰ，Ａ) 特開平５−334276（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/08 ──────────────────────────────────────────────────続き Continuation of front page (72) Inventor Seiji Hamaguchi 22-22 Nagaikecho, Abeno-ku, Osaka-shi, Osaka Inside Sharp Corporation (56) References JP-A-64-81999 (JP, A) JP-A-1- 204099 (JP, A) JP-A-3-269500 (JP, A) JP-A-1-116869 (JP, A) JP-A-2-170265 (JP, A) JP-A-5-334276 (JP, A) (58) Field surveyed (Int. Cl. ⁷ , DB name) G10L 15/08

Claims

(57) [Claims 1] A plurality of voice event detection means for extracting a characteristic part of an input voice, and a likelihood of a word is obtained based on outputs of the plurality of voice event detection means. Word detecting means, wherein the voice event detecting means connected to the input voice according to the characteristics of the recognition target word is independently scanned on the time axis, and each of the voice event detecting means and the a recognition <br/> be Ruoto voice recognition device said input speech based on the output of the word detection means, and a positive reference vector corresponding to the characteristic parts of the input speech
Likelihood with anti-reference vector not corresponding to the characteristic part
And recognizes the input voice based on the voice event.
Scans the detection means and assumes the end of word matching at each time
To obtain the output from the word detection means,
Detects the corresponding word based on the maximum value of the column and continuously
A speech recognition device for performing recognition .