JP3584002B2

JP3584002B2 - Voice recognition device and voice recognition method

Info

Publication number: JP3584002B2
Application number: JP2001095790A
Authority: JP
Inventors: 計美大倉
Original assignee: Sanyo Electric Co Ltd
Current assignee: Sanyo Electric Co Ltd
Priority date: 2001-03-29
Filing date: 2001-03-29
Publication date: 2004-11-04
Anticipated expiration: 2021-03-29
Also published as: JP2002297182A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声認識装置および音声認識方法に関するものであり、特に、不要語を含んだ音声から適切な単語を精度よく確定し得るものである。
【０００２】
【従来の技術】
不要語（例えば、無意味な音声や助詞等）を含んだ音声から単語辞書中の単語を認識する手法として、例えば、特開平７−７７９９８号公報に記載された手法が公知である。
【０００３】
かかる従来の音声認識手法においては、まず、単語辞書に記憶された全ての認識候補（必要語）の音声特徴量から不要語の特徴量を生成し、この不要語の特徴量を認識辞書に予め登録しておく。そして、入力音声信号の内、この不要語の特徴量にマッチングするものを不要語として認識し、この不要語として認識された単語を認識結果から除去して必要語の音声認識結果を得るというものである。
【０００４】
かかる従来手法における不要語の特徴量の生成は、以下のように行われる。まず、不要語の特徴量の生成に先立ち、必要語の特徴量を生成する。この必要語の特徴量の生成は、一つの必要語について数種の音声特徴量をサンプルとして入力し、これら各サンプルを音響分析し学習処理することによって当該必要語の特徴量を生成する。次に、このように生成された全ての必要語の特徴量を平均化処理し、この平均値を不要語の特徴量として設定する。
【０００５】
【発明が解決しようとする課題】
このように、上記従来の認識手法では、全ての認識候補（必要語）の音声特徴量から不要語の音声特徴量を生成するものであるから、認識辞書中の認識候補が追加・修正されると、そのたびに不要語の音声特徴量を一々再生成しなければならず、よって、認識候補の追加や削除において面倒な作業を強いられるものであった。
【０００６】
また、音声認識処理の際に、必要語のみならず不要語の音声特徴量についても入力音声信号との尤度演算や近似距離演算を行わなければならず、その分、音声認識処理ステップが追加され、このため、認識結果導出までの所要時間が大きくなってしまうとの問題があった。
【０００７】
そこで、本発明は、認識候補の追加・削除を行っても迅速に音声認識し得る音声認識方法を提供することを課題とする。
【０００８】
【課題を解決するための手段】
上記課題に鑑み、本発明は、以下の特徴を有する。
【０００９】
請求項１の発明は、音声認識装置に関するものであって、入力音声信号を音響分析してフレーム特徴量を抽出する音響分析手段と、単語の基準モデルの両端に無音モデルを連結して単語モデルを作成する単語モデル作成手段と、前記フレーム特徴量と基準モデルの特徴量とを比較して当該単語モデルに対する当該フレーム特徴量の尤度を演算する尤度演算手段と、前記演算された尤度に基づいて当該単語モデルの前記入力音声信号に対するマッチング度合いを演算するマッチング演算手段と、このマッチング度合いに応じて認識候補を設定する認識候補設定手段と、前記無音モデルに対するガーベジ尤度を設定するガーベジ尤度設定手段とを有し、当該ガーベジ尤度設定手段は、前記フレーム特徴量の基準モデル特徴量に対する尤度の内、尤度の大きさが上位からＮ番目までの尤度を平均化演算してガーベジ尤度を算出し、前記マッチング演算手段は、無音モデルの尤度として、前記ガーベジ尤度設定手段によって設定されたガーベジ尤度を用いてマッチング演算を行うことを特徴とする。
請求項２の発明は、音声認識装置に関するものであって、入力音声信号を音響分析してフレーム特徴量を抽出する音響分析手段と、単語の基準モデルの両端に無音モデルを連結して単語モデルを作成する単語モデル作成手段と、前記フレーム特徴量と基準モデルの特徴量とを比較して当該単語モデルに対する当該フレーム特徴量の尤度を演算する尤度演算手段と、前記演算された尤度に基づいて当該単語モデルの前記入力音声信号に対するマッチング度合いを演算するマッチング演算手段と、このマッチング度合いに応じて認識候補を設定する認識候補設定手段と、前記無音モデルに対するガーベジ尤度を設定するガーベジ尤度設定手段とを有し、当該ガーベジ尤度設定手段は、前記フレーム特徴量の基準モデル特徴量に対する尤度の内、尤度の大きさが上位からＮ番目までの尤度を平均化演算してガーベジ尤度を算出し、前記マッチング演算手段は、無音モデルの尤度として、前記尤度演算手段によって演算された無音モデルの尤度と、前記ガーベジ尤度設定手段によって設定されたガーベジ尤度の何れか一方を選択してマッチング演算を行うことを特徴とする。
【００１０】
請求項３の発明は、音声認識装置に関するものであって、入力音声信号を音響分析してフレーム特徴量を抽出する音響分析手段と、単語の基準モデルの両端に無音モデルを連結して単語モデルを作成する単語モデル作成手段と、前記フレーム特徴量と基準モデルの特徴量とを比較して当該単語モデルに対する当該フレーム特徴量の尤度を演算する尤度演算手段と、前記演算された尤度に基づいて当該単語モデルの前記入力音声信号に対するマッチング度合いを演算するマッチング演算手段と、このマッチング度合いに応じて認識候補を設定する認識候補設定手段と、前記無音モデルに対するガーベジ尤度を設定するガーベジ尤度設定手段とを有し、当該ガーベジ尤度設定手段は、前記フレーム特徴量の基準モデル特徴量に対する尤度の内、尤度の大きさが上位からＫ番目の尤度をガーベジ尤度とし、前記マッチング演算手段は、無音モデルの尤度として、前記ガーベジ尤度設定手段によって設定されたガーベジ尤度を用いてマッチング演算を行うことを特徴とする。
【００１１】
請求項４の発明は、音声認識装置に関するものであって、入力音声信号を音響分析してフレーム特徴量を抽出する音響分析手段と、単語の基準モデルの両端に無音モデルを連結して単語モデルを作成する単語モデル作成手段と、前記フレーム特徴量と基準モデルの特徴量とを比較して当該単語モデルに対する当該フレーム特徴量の尤度を演算する尤度演算手段と、前記演算された尤度に基づいて当該単語モデルの前記入力音声信号に対するマッチング度合いを演算するマッチング演算手段と、このマッチング度合いに応じて認識候補を設定する認識候補設定手段と、前記無音モデルに対するガーベジ尤度を設定するガーベジ尤度設定手段とを有し、当該ガーベジ尤度設定手段は、前記フレーム特徴量の基準モデル特徴量に対する尤度の内、尤度の大きさが上位からＫ番目の尤度をガーベジ尤度とし、前記マッチング演算手段は、無音モデルの尤度として、前記尤度演算手段によって演算された無音モデルの尤度と、前記ガーベジ尤度設定手段によって設定されたガーベジ尤度の何れか一方を選択してマッチング演算を行うことを特徴とする。
【００１２】
請求項５の発明は、請求項２または４の特徴に加え、さらに、マッチング演算手段は、前記尤度演算手段によって演算された無音モデルの尤度と、前記ガーベジ尤度設定手段によって設定されたガーベジ尤度を比較し、何れか大きい方の尤度を選択するとの特徴を備えるものである。
【００１３】
請求項６の発明は、音声認識方法に関するものであって、入力音声信号を音響分析してフレーム特徴量を抽出するステップと、単語の基準モデルの両端に無音モデルを連結して単語モデルを作成するステップと、前記フレーム特徴量と基準モデルの特徴量とを比較して当該単語モデルに対する当該フレーム特徴量の尤度を演算するステップと、前記演算された尤度に基づいて当該単語モデルの前記入力音声信号に対するマッチング度合いを演算するステップと、このマッチング度合いに応じて認識候補を設定するステップと、前記無音モデルに対するガーベジ尤度を設定するステップとを有し、当該ガーベジ尤度設定ステップは、前記フレーム特徴量の基準モデル特徴量に対する尤度の内、尤度の大きさが上位からＮ番目までの尤度を平均化演算してガーベジ尤度を算出し、前記マッチング演算のステップは、無音モデルの尤度として、前記ガーベジ尤度設定ステップによって設定されたガーベジ尤度を用いてマッチング演算を行うことを特徴とする。
【００１４】
請求項７の発明は、音声認識方法に関するものであって、入力音声信号を音響分析してフレーム特徴量を抽出するステップと、単語の基準モデルの両端に無音モデルを連結して単語モデルを作成するステップと、前記フレーム特徴量と基準モデルの特徴量とを比較して当該単語モデルに対する当該フレーム特徴量の尤度を演算するステップと、前記演算された尤度に基づいて当該単語モデルの前記入力音声信号に対するマッチング度合いを演算するステップと、このマッチング度合いに応じて認識候補を設定するステップと、前記無音モデルに対するガーベジ尤度を設定するステップとを有し、当該ガーベジ尤度設定ステップは、前記フレーム特徴量の基準モデル特徴量に対する尤度の内、尤度の大きさが上位からＮ番目までの尤度を平均化演算してガーベジ尤度を算出し、前記マッチング演算のステップは、無音モデルの尤度として、前記尤度演算ステップによって演算された無音モデルの尤度と、前記ガーベジ尤度設定ステップによって設定されたガーベジ尤度の何れか一方を選択してマッチング演算を行うことを特徴とする。
【００１５】
請求項８の発明は、音声認識方法に関するものであって、入力音声信号を音響分析してフレーム特徴量を抽出するステップと、単語の基準モデルの両端に無音モデルを連結して単語モデルを作成するステップと、前記フレーム特徴量と基準モデルの特徴量とを比較して当該単語モデルに対する当該フレーム特徴量の尤度を演算するステップと、前記演算された尤度に基づいて当該単語モデルの前記入力音声信号に対するマッチング度合いを演算するステップと、このマッチング度合いに応じて認識候補を設定するステップと、前記無音モデルに対するガーベジ尤度を設定するステップとを有し、当該ガーベジ尤度設定ステップは、前記フレーム特徴量の基準モデル特徴量に対する尤度の内、尤度の大きさが上位からＫ番目の尤度をガーベジ尤度とし、前記マッチング演算のステップは、無音モデルの尤度として、前記ガーベジ尤度設定ステップによって設定されたガーベジ尤度を用いてマッチング演算を行うことを特徴とする。
請求項９の発明は、音声認識方法に関するものであって、入力音声信号を音響分析してフレーム特徴量を抽出するステップと、単語の基準モデルの両端に無音モデルを連結して単語モデルを作成するステップと、前記フレーム特徴量と基準モデルの特徴量とを比較して当該単語モデルに対する当該フレーム特徴量の尤度を演算するステップと、前記演算された尤度に基づいて当該単語モデルの前記入力音声信号に対するマッチング度合いを演算するステップと、このマッチング度合いに応じて認識候補を設定するステップと、前記無音モデルに対するガーベジ尤度を設定するステップとを有し、当該ガーベジ尤度設定ステップは、前記フレーム特徴量の基準モデル特徴量に対する尤度の内、尤度の大きさが上位からＫ番目の尤度をガーベジ尤度とし、前記マッチング演算のステップは、無音モデルの尤度として、無音モデルの尤度として、前記尤度演算ステップによって演算された無音モデルの尤度と、前記ガーベジ尤度設定ステップによって設定されたガーベジ尤度の何れか一方を選択してマッチング演算を行うことを特徴とする。
【００１６】
請求項１０の発明は、請求項７または９の特徴に加え、さらに、マッチング演算ステップは、前記尤度演算ステップによって演算された無音モデルの尤度と、前記ガーベジ尤度設定ステップによって設定されたガーベジ尤度を比較し、何れか大きい方の尤度を選択するとの特徴を備えるものである。
【００２１】
本発明の特徴およびその効果は、以下の実施の形態を参照することにより、明らかとなろう。
【００２２】
【発明の実施の形態】
まず、本実施の形態に係る文字認識装置および文字認識方法の概要について、図１〜図３を参照して説明する。
【００２３】
音声認識装置に対し音声が入力されると、この音声入力信号から認識対象の音声信号が切り出される。この切り出しは、例えば、音声入力信号のパワーを監視することにより行われる。
【００２４】
すなわち、マイクロフォンに対して音声が入力されると、マイクロフォンからの入力音声信号は、そのパワーが無音レベルから立ちあがる。この立ちあがりから次に音声信号のレベルが無音レベルに達するまでの区間が、本来認識対象とされるべき音声信号の区間である。かかる音声信号区間をそのまま切り出すと、認識対象とされるべき音声信号の先頭あるいは末尾がカットされてしまう恐れがある。そこで、かかる音声信号区間を確実に認識対象とするために、通常、かかる音声信号区間の前後にある無音信号を一定区間だけ含むようにして音声信号が切り出される。
【００２５】
たとえば、図１〜図３では、切り出された音声信号は、「○○ええがぞうがおおきい○○」である。ここで、○は無音信号である。実際にマイクロフォンに入力された音声信号の言語の意味は、「エ〜、画像が大きい」である。この内、「エ〜」は音声入力を開始する際に入力者からしばしば発声される意味のない不要語である。また、「画像」は認識辞書に登録された認識候補（単語）、「が」は認識辞書にない助詞（不要語）、「大きい」は認識辞書に登録された認識候補（単語）である。
【００２６】
以上のようにして入力され、且つ、切り出された音声信号は、一定周期（フレーム周期）毎に音響分析され、音響的な特徴量（以下、「フレーム特徴量」という）が抽出される。抽出された各フレーム特徴量は、基準音節の特徴量と比較され、各基準音節との間の近似度合い、即ち、尤度が算出される。
【００２７】
ここで、音声認識装置内のメモリには、基準音節の特徴量、すなわち５０音（あいうえお、等）、濁音（がきぐげご、等）、半濁音（ぱぴぷぺぽ、等）、拗音（ぎゃぎゅぎょ、等）の各音節の特徴量と共に、無音音節の特徴量が、予め記憶されている。
【００２８】
上記フレーム周期で音響分析され抽出されたフレーム特徴量は、メモリに記憶された全ての基準音節の特徴量と比較され、各基準音節毎に尤度が算出される。例えば、図１においては、フレーム特徴量は「○」、「○」、「え」、「え」、「が」、「ぞ」、「う」、「が」、「お」、「お」、「き」、「い」、「○」、「○」に対応する部分の特徴量であり、これら各フレーム特徴量と基準音節の特徴量が順次比較され、尤度が算出される。
【００２９】
図１の下部に示したブロックのうち、最左欄は上述の基準音節であり、この最左欄に続く右方の各欄内の数字は、これら各基準音節に対する各フレーム特徴量の尤度である。例えば、図１において、認識対象とされる音声信号（前後に無音を含むように切り出された音声信号）の内、最初のフレーム特徴量は「○（無音信号）」の部分に対応した特徴量で、この特徴量が各基準音節の全ての特徴量と順番に比較され、各基準音節に対する尤度が算出される。図１では、基準音節「あ」に対する尤度は０．１、基準音節「い」に対する尤度は０．１、…、基準音節「無音」に対する尤度は０．９である。
【００３０】
このようにして最初のフレーム特徴量に対する尤度の算出が全て終了すると、次に、次のフレーム特徴量「○（無音信号）」の部分について基準音節に対する尤度の算出がなされる。以下、同様にして、音声信号の全期間「え」、「え」、「が」、…、「い」、「○」、「○」の部分のフレーム特徴量について基準音節「あ」、「い」、「う」、…、「ぺ」、「ぽ」、「○（無音）」に対する尤度の算出が行われる。
【００３１】
このようにして算出された尤度は、基準音節と各フレームとを相関軸とするマトリックス上に、それぞれの尤度がマッピングされるように、メモリ（ＲＡＭ：ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）上に書きこまれる。すなわち、図１の下部に示すマトリックス上の尤度が、そのままメモリ上にマッピングされ記憶される。
【００３２】
以上のようにして、各基準音節に対する尤度の算出およびメモリへの書きこみがなされると、次に、単語辞書中の一つの認識候補に対する音声入力信号のマッチング度合いが算出される。
【００３３】
図１の上部のブロックは、認識候補「おおきい」に対する尤度の得点（マッチング度合い）を示すものである。
【００３４】
上述の通り、音声信号の切り出しは無音部分を含むようにして行われるため、認識候補に対する音声信号のマッチングを算出する場合には、認識候補の前後に無音音節を付加したもの認識候補音節とする。すなわち、図１のように認識候補が「おおきい」であれば、「おおきい」を構成するそれぞれの音節「お」、「お」、「き」、「い」の前後に、音節「○（無音）」を付加し、これらの音節を連結したものを認識候補音節とする。
【００３５】
このようにして、認識候補音節の構成がなされると、次に、かかる認識候補音節の各音節に対して、上記音声信号のフレーム特徴量の尤度が割り振られる。
【００３６】
まず、音声信号から最初に抽出されたフレーム特徴量（「○（無音信号）」部分の特徴量）と認識候補音節の各音節との間の尤度が、上記ＲＡＭから読み出される。すなわち、上記ＲＡＭに記憶された基準音節に対する音声信号の尤度（図１の下部ブロックに割り振られた尤度）の内、当該フレーム特徴量「○（無音信号）」に対応する各音節「○（無音）」、「お」、「お」、「き」、「い」、「○（無音）」の尤度をＲＡＭから読み出し、これら各尤度を、図１の上部ブロックの内、第１のフレーム特徴量とこれら認識候補の各音節とが交差する欄中に割り振る。
【００３７】
次に、音声信号から２番目に抽出されたフレーム特徴量（「○（無音信号）」の部分の特徴量）と各音節との間の尤度をＲＡＭから読み出し、これを、上記と同様にして、図１の上部ブロックの欄中に配布する。
【００３８】
以下、同様にして、第３番目に抽出したフレーム特徴量から最後に抽出した特徴量までの尤度を図１の上部ブロックに割り振る。
【００３９】
このようにして図１の上部ブロックに割り振られた尤度は、実際には、各フレームと認識候補音節とを相関軸とするようにして、ＲＡＭ内の所定の領域に記憶される。
【００４０】
以上のようにして、認識候補音節に対する音声信号の尤度の設定および配布が終了すると、次に、このように配布された尤度群を用いて、当該認識音節に対する音声信号のマッチング度合いが算出される。
【００４１】
かかるマッチング度合いの算出は、まず、図１の左下角の欄から右上角の欄まで各欄を通って進むルートを設定し、当該ルート上にある各欄中の尤度の合計値を算出する。かかるルート設定は、例えば、一つの欄から見て前の欄が左横または左斜め下の何れかになるように設定する。あるいは、これに替えて、一つの欄から見て前の欄が左横、左斜め下または真下の何れかになるようにルートを設定するようにしてもよい。かかるＤＰマッチングには種々の定式化があり、たとえば、東海大学出版会「デジタル音声処理（第１刷）」Ｐ１６７〜Ｐ１６７に記載されているものを使用し得る。
【００４２】
なお、本実施の形態では尤度を用いて説明を行っているが、これは対数尤度でも良いし、距離（フレーム特徴量と各音節の特徴量の差：絶対値）の逆数に基づく値でも良い。また、これに遷移確率等を加えれば周知のＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）でも表現できる。
【００４３】
このようにして設定され得る全てのルートについて、上記の尤度の合計値算出を行い、各ルートについて算出された合計値の内、最も値の大きな合計値を、当該認識候補に対する音声信号のマッチング度合い（得点）とする。
【００４４】
図１の例においては、認識候補が「おおきい」であり、音声信号中に「おおきい」の部分が含まれているので、図１の上部ブロック中において、音声信号の「おおきい」の各フレーム部分と認識候補音節の交差する欄の尤度が高くなっている。したがって、当該認識候補「おおきい」に対するマッチング度合い（得点）は、かかる交差する欄中の尤度の影響によって大きなものとなる。
【００４５】
これに対し、音声信号中に含まれていないような認識候補（例えば「ちいさい」等）についてマッチング度合い（得点）を算出すると、図１の上部ブロックにおける各欄の尤度は、何れも低い値となるから、マッチング度合い（得点）も低いものとなる。
【００４６】
したがって、各認識候補について算出したマッチング度合い（得点）を相互に比較し、得点の最も高いものから順に上位数個の認識候補を選択して暫定的な認識候補とすれば、この認識候補中に適正な認識候補が含まれている可能性は高いものとなる。
【００４７】
そして、この暫定的な認識候補を、例えば、音声認識装置のモニター上に全てを表示せしめ、その中から操作者に適切なものを選択させることで、認識結果として確定させるようにする。
【００４８】
あるいは、かかる方法に代えて、あらかじめ認識候補を内容種別によって区分する（例えば、「大きい」や「小さい」等の大きさの区分、「画像」や「音声」等の情報の区分、等）ようにして種々の辞書を構成しておき、上記マッチング度合い（得点）の上位数個（たとえば、５個）の認識候補を各辞書から取り出し、取り出した各辞書からの認識候補について組み合わせを作成し（たとえば、２辞書で５個づつの認識候補であれば５個×５個＝２５個の組み合わせ）、組み合わされた認識候補間を無音で連結すると共に前後に無音を付加して認識候補音節を生成し（例えば、「がぞう」と「おおきい」の組み合わせであれば、「○○がぞう○○おおきい○○」の認識候補音節）、この認識候補音節と上記音声信号のマッチング度合い（得点）を上記と同様にして再度算出し、各組み合わせの内最も得点の高いものを認識結果として確定するようにしてもよい。
【００４９】
以上のように、認識候補の前後に無音音節を付加して認識候補音節を生成し、この認識候補音節と音声信号とのマッチング度合いを判別するようにすれば、上記のように無音信号部分を含めて音声信号を切り出すようにしても、この無音信号部分の影響が認識候補の前後に付加した認識候補音節「○（無音）」によって吸収されるので、比較的精度のよい認識結果を得ることができるようになる。
【００５０】
しかしながら、音声信号に不要語が付加されている場合には、この認識候補音節「○（無音）」の特徴量と不要語の特徴量は通常非近似であるから、この認識候補音節「○（無音）」によって不要語の影響を吸収することはできず、このため、認識候補の確定精度が低下してしまう。
【００５１】
そこで、本実施の形態では、認識候補音節「○（無音）」に対する尤度の割り当て方を改良し、これにより、音声信号に含まれる無音部分の影響のみならず、音声信号中の不要語の影響をも同時に吸収し得るようにする。具体的には、切り出し音声信号をフレーム周期で音響分析して抽出した各フレーム特徴量の、認識候補音節「○（無音）」の特徴量に対する尤度（以下、「無音モデル尤度」という）の設定を改良する。
【００５２】
詳しくは、図２に示すように、それぞれのフレーム特徴量の各標準音節に対する尤度の内、最高の尤度を無音モデル尤度に設定する。すなわち、図２の上部ブロックにおいて、フレーム特徴量「え」「え」「が」「ぞ」「う」「が」「お」「お」「き」「い」の無音モデル尤度には、これら各フレーム特徴量の基準音節に対する尤度の最高値である０．９、０．９、０．９、０．９、０．９、０．８、０．９、１．０をそれぞれ割り当てる。
【００５３】
このように尤度の割り当てを行うと、マッチング度合いの得点が最高となるルートは、例えば図２の上部ブロックでは、無音モデル尤度の欄を通った後、矢印で示すように「お」「お」「き」「い」の部分で斜め右上に進み、その後再び無音モデル尤度の欄を通るルートとなる。すなわち、切り出し音声信号の「○」「○」「が」「ぞ」「う」「が」の部分と、「おおきい」に続く「○」「○」の部分では、無音モデル尤度の欄の尤度が最高値に設定されるため、尤度の合計値が最高となるルートは、通常、この無音モデル尤度の欄を通るものとなる。
【００５４】
したがって、切り出し音声に無音信号の部分の他に不要語が含まれているような場合であっても、無音信号部分と不要語による尤度の乱れは、全て、無音モデル尤度によって吸収されることとなる。
【００５５】
ところで、上記実施の形態では、無音モデル尤度として、各フレーム特徴量の基準音節に対する尤度の最高値を設定するようにしたが、このようにすると、音声信号の内、本来、尤度が強調されるべき部分、すなわち認識候補に対応する部分の尤度が強調されないといった不都合が生じる。
【００５６】
例えば図２の上部ブロックにおいて、矢印で示したルート上の欄の尤度は、認識候補と音声信号が一致する箇所であるから、他の欄の尤度に比べて、尤度が充分に強調されていなければならない。しかしながら、かかる矢印上の欄は、上記の通り、当該区間の無音モデル尤度と同一の尤度が設定されている。このため、本来、マッチング度合い（得点）に大きく影響する必要のある矢印上の欄の尤度が、それ程、強調されないことになり、その結果、認識結果の精度が外乱による影響等を受けやすくなるとの不都合が生じる。
【００５７】
そこで、かかる不都合を改善するために、図３の実施の形態においては、各フレーム特徴量の基準音節に対する尤度の内、上位Ｎ個の尤度の平均値（以下、「ガーベジ用尤度」という）を算出し、ガーベジ用尤度が無音モデル尤度よりも大きいとき、このガーベジ用尤度を無音モデル尤度に置換えるようにした。
【００５８】
このようにガーベジ用尤度への置換えを行うと、図３に示すように、矢印ルート上にある欄の尤度が無音モデル尤度よりも数段大きくなり、よって、本来強調されるべき矢印ルート上の欄の尤度が、効果的に強調されるようになる。また、音声信号のうち、無音部分の無音モデル尤度は適正に強調され、さらに、不要語部分（認識対象でない「がぞう」の部分を含む）の無音モデル尤度も適正に強調されているので、当該期間の無音モデル尤度によって無音部分および不要語部分の影響を効果的に吸収できるようになる。
【００５９】
以上が本実施の形態の概要である。以下、本実施の形態をさらに詳細に示す種々の実施例について説明する。
【００６０】
図４に本実施例のブロック図を示す。図において、１はマイクロフォン等の音声入力部、２は音声入力部からの音声入力信号から音声信号を切り出す音声信号切り出し部である。かかる音声信号切り出し部２は、上述の通り、音声入力信号のパワーを監視し、前後に無音信号を含むように音声信号を切り出す。
【００６１】
３は音響分析部で、切り出された音声信号を所定のフレーム周期毎に音響分析し、特徴パラメータ（以下、「フレーム特徴パラメータ」という）を抽出する。フレーム特徴パラメータとしては、例えば線形予測係数やＬＰＣケプストラム、周波数帯域毎のエネルギなどとする。かかる音響分析については既に周知であるので、ここでは詳細な説明を割愛する。なお、かかるフレーム特徴パラメータとは、上記実施の形態におけるフレーム特徴量と同義である。
【００６２】
４は各基準モデル毎の音響特性パラメータを記憶した基準モデルパラメータ部で、上記音響分析部３と同様の方法により基準モデルを音響分析し、そのパラメータを各モデルの基準パラメータとして記憶している。ここで基準モデルとは、例えば、上記実施の形態で言うところの基準音節に相当する。かかる基準モデルは、上記実施の形態で示した如く、無音モデルを含むものである。かかる基準モデルは、上記実施の形態の如く基準音節としてもよいし、これに代えて、基準音素とすることもできる。また、各単語全体の特徴パラメータを基準モデルとすることもできる。なお、基準モデルとしては、ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ等を用いることができる。また、基準モデルパラメータは離散分布や連続分布等で表現できる。
【００６３】
５は尤度演算部で、音響分析部３で抽出された所定フレーム周期毎のフレーム特徴パラメータと基準パラメータ部４の各基準モデル毎の特徴パラメータとを比較し、両者間の尤度を算出する。この尤度の算出方法としては、たとえば社団法人電子情報通信学会発行「確率モデルによる音声認識」の第３章等に記載されている周知の方法を用いることができる。
【００６４】
６はＲＡＭ部で、尤度演算部５にて算出された各フレーム毎の尤度を各基準モデルと関連付けて記憶する。例えば、上記実施の形態で示した図１〜図３の下部にマトリックス状に示した尤度をＲＡＭ上にマッピングして記憶する。
【００６５】
７は認識辞書部で、認識候補としての単語が記憶されている。かかる認識辞書部には、図５に示す如く、「大きさ」や「情報の種類」等のカテゴリーに区分して複数の認識辞書が準備されている。
【００６６】
８は単語モデル作成部で、認識評価の対象となる単語の基準モデルを連結すると共にその前後に無音モデルを付加して単語モデルを作成する。例えば、図６に示す如く、認識対象の単語が「おおきい」であれば、作成される単語モデルは「○おおきい○」（○は無音モデル）となる。
【００６７】
９はマッチング演算部で、上記の如くＲＡＭ部６に記憶された基準モデル毎の尤度と後述のガーベジ用尤度算出部１０からのガーベジ用尤度を参照し、単語モデル作成部８からの単語モデルについてマッチング度合い（得点）を演算する。かかるマッチング度合い（得点）の算出は、例えば、上記実施の形態に示した如く、単語モデルを構成する各基準モデル（無音モデルを含む）と、音声信号から抽出した各フレーム特徴パラメータとの間の尤度をマッピングしてマトリックスを構成し（図１〜図３の上部参照）、このマトリックスを左下の角から右上の角に進む種々のルートにおける尤度の合計得点の内、最高得点をマッチング度合いとする（ビタビマッチング法）。
【００６８】
ここで、ＲＡＭに記憶された無音モデルの尤度としては、上記実施の形態の如く、フレーム周期で抽出された各フレーム特徴パラメータの基準パラメータに対する尤度群の内、最高の尤度を無音モデルの尤度に置換える方法や、フレーム特徴パラメータの基準パラメータに対するそれぞれの尤度群の内、上位Ｎ個の平均値を無音モデルの尤度に置換える方法、等とする。なお、このように算出され置換えられる尤度が、ガーベジ用尤度算出部１０で算出されるガーベジ用尤度である。
【００６９】
１０はガーベジ用尤度算出部で、ＲＡＭ部６に記憶された基準モデル毎の尤度を参照し、ガーベジ用尤度を算出する。ここで、ガーベジ用尤度としては、上記実施の形態の如く、フレーム周期で抽出されたフレーム特徴パラメータの基準パラメータに対する尤度群の内、最高の尤度をガーベジ用尤度とする方法や、フレーム特徴パラメータの基準パラメータに対する尤度の内、上位Ｎ個の平均値をガーベジ用尤度とする方法、等とする。
【００７０】
１１は認識候補記憶部で、マッチング演算部９で算出されたマッチング度合い（得点）を単語毎に比較し、得点の高いものからＭ個を認識候補の単語として記憶する。ここで、認識候補とされる単語は、上記「大きさ」や「情報の種類」等の辞書（カテゴリー）毎にＭ個が記憶される。このように記憶された認識候補の単語は、そのまま表示して操作者の意図するものを選択させるようにしてもよいし、あるいは、後述するように、再度、かかるＭ個の認識候補を対象として、音声認識処理を行うようにしてもよい。
【００７１】
図７は上記マッチング演算部９の詳細を示すブロック図である。９１は無音モデル尤度決定部で、ＲＡＭ部６に記憶されている無音モデルの尤度とガーベジ用尤度算出部１０からのガーベジ用尤度を比較し、無音モデルの尤度を決定する。
【００７２】
図８に無音モデル決定部９１における無音モデルの尤度の決定方法を示す。単語モデルの各基準モデルの尤度は、ステップＳ１にて、その基準モデルが無音モデルであるが否かが判別される。ここで、無音モデルではないと判別されると、当該モデルの尤度は認識対象の単語に関する尤度として、ＲＡＭ部６に記憶されたままとされる。
【００７３】
ステップＳ１にて、当該尤度が無音モデルの尤度であると判別されると、ステップＳ２、Ｓ３にて、当該無音モデルの尤度とガーベジ用尤度算出部１０からのガーベジ用尤度の何れが大きいかが判別され、ガーベジ用尤度の方が大きいと、ステップＳ５、Ｓ６にて、当該無音モデルの尤度がガーベジ用尤度に置換えられる。かかる置換えは、ＲＡＭ部６の当該無音モデルの尤度をガーベジ用尤度に書き替えるようにしてもよいし、あるいは、ＲＡＭ部６の当該無音モデルの尤度は書き替えずに、マッチング演算部９における演算時にのみ、当該無音モデルについてはガーベジ用尤度を用いるよう処理するようにしてもよい。
【００７４】
以上の実施例における音声認識動作について図９を参照して説明する。
【００７５】
所定の音声入力モードにおいて操作者が音声を入力すると、認識辞書部７に格納された種々の辞書の内、当該モードにて使用されるべき辞書が選択され、さらにこれら辞書のうち一の辞書が認識対象の辞書として設定される（ステップＳ１０１、Ｓ１０２）。認識対象の辞書が設定されると、この辞書中に格納された種々の単語の内、一の単語が認識対象の単語（Ｗ１）として読み出される（ステップＳ１０３、Ｓ１０４）。そして、この単語（Ｗ１）は、上記の通り入力音声信号と比較され、単語認識のための尤度計算と得点計算（マッチング処理）がなされる（ステップＳ１０５）。
【００７６】
辞書内から読み出された単語（Ｗ１）について得点計算がなされると、この得点は、認識候補記憶部１１に先の処理によって記憶されているＭ個の単語（Ｗｓ１、Ｗｓ２、…、Ｗｓｍ）の内、最も得点の低い単語と比較され、これよりも得点が大きければ、この先に記憶された単語に代えて、当該単語（Ｗ１）が得点と共に記憶される。今、単語（Ｗ１）は当該辞書から読み出された最初の単語であるので、認識候補記憶部１１には未だ認識候補の単語が記憶されていない。従って、単語（Ｗ１）は得点と共にそのまま認識候補記憶部１１に記憶される（ステップＳ１０６）。
【００７７】
上記単語（Ｗ１）の処理が終了すると、ステップＳ１０３に戻り、当該認識辞書から次の単語（Ｗ２）が読み出され、ステップＳ１０４〜Ｓ１０６と同様の処理がなされる。このとき、当該辞書からＭ個の単語が読み出されるまでは、認識候補記憶部１１にはＭ個の認識候補が記憶されないので、辞書から読み出された単語はその得点と共に順番に認識候補記憶部１１に記憶される。そして、当該辞書から読み出される単語がＭ＋１個目になったときに、この単語（Ｗｍ＋１）の得点が認識候補記憶部１１に記憶されているＭ個の単語の得点と比較され、これよりも大きければ、この単語（Ｗｍ＋１）とその得点が認識候補記憶部１１に記憶されると共に、先に認識候補記憶部１１に記憶されていたＭ個の単語の内、最も得点の低い単語とその得点が認識候補記憶部１１から消去される。
【００７８】
以上の処理を当該辞書に記憶されている全ての単語について行うと、ステップＳ１０４にて、当該辞書に対する認識候補の設定が終了したことが判別され、処理はステップＳ１０１に戻る。このとき、認識候補記憶部１１には、当該辞書に記憶された単語のうち、音声入力信号との間で得点の高い上位Ｍ個の単語が認識候補として記憶されている。
【００７９】
以上のようにして、最初の辞書について認識候補の設定が終了すると、ステップＳ１０１〜Ｓ１０３にて次の辞書が認識対象の辞書として選択され、この辞書中の単語について、順次、上記ステップＳ１０３〜Ｓ１０６の処理が行われる。これにより、当該２番目の辞書について、上位Ｍ個の単語が認識候補として認識候補記憶部１１に記憶される。
【００８０】
以上の動作が、当該音声入力モードにて使用されるべき全ての辞書について行われると、ステップＳ１０２にて、全ての辞書についての音声認識処理が終了したと判別される。このとき、認識候補記憶部１１には、当該音声入力モードにて使用されるべき全ての辞書について、辞書毎にそれぞれＭ個の単語が認識候補として記憶されている。
【００８１】
そして、かかるＭ個の認識候補は、ステップＳ１０７において辞書区分毎に例えば音声認識装置のモニター上に表示される。操作者は、モニター上に表示された認識候補の内、所望のものを選択する。これにより、入力音声に対する単語が辞書区分毎に確定される。
【００８２】
以上の音声認識動作では、辞書区分毎の認識候補としてＭ個の単語をモニター上に表示し、操作者に選択させるようにした。しかしながら、認識候補として表示される単語の数が多いと、その分、操作者に無駄な選択動作を強いることになる。表示される単語の数はなるべく少ない方が良く、且つ、その単語の認識候補としての精度も高いほうが好ましい。
【００８３】
そこで以下の実施例では、Ｍ個の単語を認識候補としてそのまま表示せずに、さらに単語の数を絞ると共に認識候補として精度を上げようにした。
【００８４】
図１０に当該実施例の構成を示す。図１０の構成は、上記図４の実施例に比べて、単語モデル作成部８と認識候補記憶部１１の構成が相違するのみであり、その他の構成は上記図４の構成と同一である。
【００８５】
本実施例では、上記実施例と同様の処理によって認識候補記憶部１１に辞書区分毎に記憶されたＭ個の単語の内、各辞書区分から一つずつ単語を選択し、これを無音モデルで連結して再度単語モデルを作成し、この単語モデルと入力音声とのマッチングを演算するものである。
【００８６】
単語モデル作成部８にて作成される単語モデルの例を図１１に示す。この単語モデルは、認識候補記憶部１１に辞書区分毎に記憶されたＭ個の単語の内、一の辞書区分から単語「がぞう」を選択し、他の一の辞書区分から単語「おおきい」を選択して組み合わせたものである。
【００８７】
例えば、音声入力モードに応じて使用されるべき辞書が２つの場合、上記実施例の処理と同様にして各辞書毎にＭ個の単語がそれぞれ認識候補として設定されたとすると、各辞書区分から一つずつ選択して作成した単語モデルの総数は、Ｍ×Ｍ個となる。同様に、音声入力モードに応じて使用されるべき辞書が３つの場合、単語モデルの総数は、Ｍ×Ｍ×Ｍ個となる。
【００８８】
本実施例では、このように作成したＭのＰ乗（Ｐは音声入力モードに応じて使用されるべき辞書の数）個の単語モデルの全てについて、入力音声信号との尤度計算およびマッチング処理を行い、得点の最も高いものからＬ個の単語モデルを判別し、この単語モデルにおいて連結されている各単語を認識候補とするものである。
【００８９】
このように複数の単語を連結して単語モデルを作成しこれを入力音声と比較するようにすると、各単語モデルの単語が入力音声中に１つ含まれているか、２つ含まれているか、３つ含まれているか、あるいは、全く含まれていないか、すなわち、音声入力信号中に含まれている単語の数に応じて、各単語モデル間におけるマッチング得点の格差が大きなものとなる。
【００９０】
この点について上記実施例と比較して説明すると、上記の実施例では一つの単語のみを対象として単語モデルを作成し、これと入力音声信号とのマッチング度合い（得点）を算出するものであった。したがって、音声入力信号中には単語モデルの単語以外に多くの不要な単語が必ず含まれ、このため各単語モデルの得点は、例え入力音声信号中にその単語が含まれていたとしても、それ程大きくならず、このため、単語モデル間のマッチング度合い（得点）の格差はそれ程大きくならない。これに対し、本実施例のように複数の単語を対象として単語モデルを作成し、これと入力音声信号とを比較してマッチング度合い（得点）を算出するようにすれば、入力音声中に単語モデルを構成する単語が一つ存在するか、２つ存在するかで、単語モデル間の得点の格差は大きなものとなる。入力音声信号中に全ての単語が余すところなく含まれていれば、その単語によって構成される単語モデルの得点は極めて高いものとなる。
【００９１】
したがって、単語モデルを構成する場合には、上記実施例のように一つの単語から単語モデルを構成するよりも、本実施例のように複数の単語から単語モデルを構成する方が、単語モデル間の得点の格差が大きくなり、よって、精度の高い認識候補の単語を操作者に提供できるようになる。
【００９２】
しかしながら、入力音声モードに応じて使用される全ての単語辞書から全ての単語を一つずつ連結して単語モデルを作成すると、その単語モデルの数は膨大なものなる。かかる膨大な数の単語モデルについて入力音声信号とのマッチング処理を行うとなると、膨大な処理時間を要し、且つ、不要な連結による単語モデルに対する無駄な処理を繰り返す結果ともなる。
【００９３】
そこで、本実施例では、上記図４〜図９で得られた辞書区分毎のＭ個の単語のみを対象とし、各辞書区分から一つずつ単語を選択しこれを連結して単語モデルを作成し、これを入力音声信号と比較することで、最終的な認識候補の数を絞ると共にその精度を上げるものである。
【００９４】
以下、本実施例の動作について図１２を参照して説明する。なお、かかる動作は、音声入力モードに応じて使用されるべき辞書が２つの場合の動作である。また、図１２において、ステップＳ１０１〜Ｓ１０６による動作は上記実施例と同様である。すなわち、かかるステップにより、辞書毎にそれぞれＭ個の単語が認識候補として設定される。
【００９５】
しかして、使用されるべき２つの辞書についてＭ個の単語が認識候補として設定されると、動作はステップＳ１０２からステップＳ２０１に移行し、これら辞書の内、第１の辞書について設定された認識候補の単語（Ｗｓ１１）が読み出されると共に（ステップＳ２０１、Ｓ２０３）、第２の辞書について設定された認識候補の単語（Ｗｓ２１）が読み出される（ステップＳ２０３、Ｓ２０４）。そして、これらの各単語（Ｗｓ１１）（Ｗｓ２１）を無音モデルで接続し、その両端にさらに無音モデルを連結して単語モデルを作成する（ステップＳ２０５）。
【００９６】
このようにして単語モデルが作成されると、この単語モデルについて、上記実施例と同様に、入力音声信号との間の尤度計算と得点計算（マッチング処理）が行われる（ステップＳ２０６）。そして、この単語モデルがその得点と共に認識候補記憶部１１に記憶される。
【００９７】
以上のようにして一つの単語モデルに対する処理が終了すると、ステップＳ２０３に戻り、第２の辞書の単語（Ｗ２２）が読み出される。そして、この単語（Ｗ２２）が、上記と同様にして、上記第１の辞書の単語（Ｗ１１）と連結され、新たな単語モデルが作成される（ステップＳ２０５）。
【００９８】
作成された単語モデルは、上記と同様に、入力音声信号との間の尤度計算および得点計算がなされ（ステップＳ２０６）、この得点と共に認識候補記憶部１１に記憶される。
【００９９】
以上のステップＳ２０３〜Ｓ２０６の動作は、第２の辞書について設定されたＭ番目の単語（Ｗｓ２ｍ）が第１の辞書の単語（Ｗｓ１１）と連結されて得点計算され、これが認識候補記憶部１１に記憶されるまで繰り返される。
【０１００】
第２の辞書について設定されたＭ個の単語の全てが読み出され、上記の処理が終了すると、ステップＳ２０１に戻り、第１の辞書について設定された次の単語（Ｗｓ１２）が読み出される（ステップＳ２０１、Ｓ２０２）。そして、この単語が、上記と同様ステップＳ２０３〜Ｓ２０６の処理を繰り返すことにより、第２の辞書に応じたＭ個の単語と順次連結されてＭ個の単語モデルが作成され、これら各単語モデルと入力音声信号の間の尤度計算と得点計算が順次行われる。そして、計算された得点はその単語モデルと共に順次、認識候補記憶部１１に記憶される。
【０１０１】
以上の処理は、第１の辞書について設定されたＭ個の単語の全てが第２の辞書のＭ個の単語と連結されて処理されるまで繰り返される。
【０１０２】
以上の処理が終了すると、認識候補記憶部１１には合計Ｍ×Ｍ個の単語モデルとその得点が記憶されている。かかるＭ×Ｍ個の単語モデルは、ステップＳ２０７において、その得点が比較され、このうち、上位Ｌ個の単語モデルが選択される。そして、かかる上位Ｌ個の単語モデルに含まれる各辞書の単語を判別し、この単語を辞書毎の認識候補としてモニター上に表示する。
【０１０３】
なお、かかる実施例は、音声入力モードに応じて使用される辞書が２つの場合の動作であったが、これに限定されるものではない。例えば、辞書が３つの場合には、図１２のステップＳ２０１およびＳ２０２（第１の辞書用）と、ステップＳ２０３およびＳ２０４（第２の辞書用）に相当するステップを、ステップＳ２０４の下にもう１段追加すれば良い。対象となる辞書が増えるに応じて、かかるステップを追加し、各辞書に応じたＭ個の単語が全て組み合わせられるようにすればよい。
【０１０４】
また、対象となる辞書が３つ以上（例えばＫ個）ある場合であっても、Ｋ個の辞書から一つずつ単語を選択するのではなく、この内、Ｊ個（Ｊ＜Ｋ）の辞書を選択し、この選択したＪ個の辞書に応じた単語を一つずつ選択してこれを連結するようにしても良い。
【０１０５】
さらに、本実施例では、各辞書について設定されたＭ個の単語を組み合わせてＭのＰ乗（Ｐは辞書の個数）個の単語モデルを作成するものであったが、各辞書に設定されたＭ個の単語に加え、ヌル（無し）を単語として追加し、各辞書について設定される単語をＭ＋１個とて、Ｍ＋１のＰ乗個の単語モデルを作成するようにしても良い。この場合、ヌルと単語との組み合わせは、ヌルを除いて単語を連結することにより行う。例えば、対象となる辞書が３つあり、第１の辞書の単語がヌル、第２の辞書の単語がＷｓ１、第３の辞書の単語がＷｓ２であるとすると、これらを組み合わせた単語モデルは、単語Ｗｓ１と単語Ｗｓ２とを無音モデルで連結し、その両端にさらに無音モデルを連結するようにして作成される。対象となる辞書が２つで、第１の辞書がヌル、第２の辞書がＷｓ２の場合には、単語モデルは、単語Ｗｓ２の両端に無音モデルを連結した、例えば図６と同様の単語モデルとなる。
【０１０６】
このようにＭ個の単語の他に別途ヌルを追加すると、操作者が音声入力モードによって入力を求められている種類・区分の全てについて単語を入力しなかった場合でも、入力された種別の単語は正しく認識できるようになる。例えば、音声入力モードがＡ、Ｂ、Ｃの３つの種類・区分の単語の入力を要求するものであった場合に、操作者がＡとＢの種別・区分の単語しか入力しなかったとする。この場合、図１２の実施例ではステップＳ１０１〜Ｓ１０６にてＡ、Ｂ、Ｃの種別・区分に応じた辞書についてＭ個の単語が認識候補として設定されるが、この内、Ｃの辞書について設定されたＭ個の単語は、操作者が入力しなかった種類・区別に応じたものであるから、何れも認識候補としては誤りである。しかし、図１２の実施例では、ステップＳ２０１〜Ｓ２０６によって、このＣの辞書についても認識候補の単語が設定され、モニター上に表示されることになってしまう。
【０１０７】
そこで、Ａ、Ｂ、Ｃの辞書について設定されたＭ個の単語にさらにヌルを追加しておけば、Ｃの辞書についてヌルが選択された場合の単語モデルの得点が他よりも高くなる。すなわち、この場合の単語モデルは、Ａ、Ｂの辞書の単語をそれぞれＷａ、Ｗｂとすると、○＋Ｗａ＋○＋Ｗｂ＋○（○は無音モデル）となり、他方、入力された音声はＡとＢの種類・区分に応じたものであるから、ＷａとＡの音声部分、ＷｂとＢの音声部分がマッチングし、全体としての得点が大きくなる。
【０１０８】
なお、単語モデルの長さに得点が比例するようなマッチング方法の場合には、ヌルが選択されると単語モデルの長さが小さくなるので得点の正規化が必要となる。かかる正規化は、例えば、単語モデルの長さに応じて得点を平均化することによって達成される。
【０１０９】
この点は、上記図４の実施例のように、一つの単語のみを対象とした場合でも同様である。すなわち、単語の音節数は画一的ではなく、単語に応じて音節数は相違する。例えば、「がめん」は３音節、「おんせい」は４音節である。かかる場合にも、単語モデルの長さは音節数に応じて変化するが、正規化処理により得点が単語モデルの長さに応じて平均化されるので、単語モデルの長さに応じた得点の格差は是正される。
【０１１０】
以上、本発明に係る種々の実施例について説明したが、本発明はかかる実施例に制限されるものではない。
【０１１１】
例えば、上記実施例では、一つの単語から単語モデルを作成する場合、単語の両端に無音モデルを一つだけ追加するようにしたが、２つ以上追加するようにしても良く、また、単語の前後で無音モデルの数を変化させるようにしてもよい。
【０１１２】
また、２つ以上の単語を連結して単語モデルを作成する場合、上記実施例では単語間に介在させる無音モデルの数を１つとしたが、これを２つ以上とすることもでき、また、無音モデルを介在させることなしに直接単語を連結するようにしても良い。さらに、単語Ｗａと単語Ｗｂの間に介在する無音モデルの数を２つ、単語Ｗｂと単語Ｗｃの間に介在する無音モデルの数を１つといった具合に、単語間の位置に応じて無音モデルの数を変えるようにしても良い。
【０１１３】
また、上記実施例では、ガーベジ尤度として、フレーム特徴量の基準モデル特徴量に対する尤度の内、最も大きな尤度または上位Ｎ個の尤度の平均値を採用したが、これに代えて上位Ｋ番目の尤度をガーベジ尤度として設定するようにしても良い。この際、統計的にＫ番目の尤度がＮ個の尤度の平均値近傍となるようにＫを選んでやれば、平均値処理を省略しながら平均値を採用したと同様の効果が得られる。
【０１１４】
また、上記実施例では、単語モデル作成部８にて無音モデルを付加するようにしたが、これに代えて、単語に予め無音モデルを付加して認識辞書部７に記憶させるようにしても良い。
【０１１５】
また、上記実施例では、各辞書について認識候補として設定されたＭ個の単語の他、別途、ヌルを追加して各単語を連結するようにしたが、この場合、全ての辞書についてヌルを設定すると、単語モデルは無音モデルのみからなることになる。従って、全てがヌルの単語モデルはマッチングの対象から除くようににてもよい。あるいは。全てがヌルの場合にマッチングの得点が上位Ｈ番目より上位である場合には、当該入力音声に対する処理結果は採用せず、操作者に再度音声入力を促すようにしても良い。
【０１１６】
また、上記実施例では、各辞書毎に設定される認識候補を画一的にＭ個としたが、辞書毎に認識候補の数を変えるようにしても良い。この際、予め、辞書毎に認識候補の数を設定しておいても良いし、あるいは認識処理時の得点に応じて当該辞書についての認識候補の数を設定するようにしても良い。後者の場合、例えば、得点の閾値を設定しておき得点が閾値以上のもののみを認識候補とするようにしても良い。この場合、認識候補の数は得点と閾値に依存し、Ｍ個以上にもＭ個未満にもなり得る。
【０１１７】
また、上記実施例では、例えば、図１２において、ステップＳ１０５による特性分析および計算処理と、ステップＳ２０６による特性分析および計算処理は同一のものとしたが、ステップＳ１０５の特性分析および計算処理を粗くし、ステップＳ２０６の特性分析および計算処理を精密にするようにしても良い。すなわち、ステップＳ１０５においては、対象となる単語モデルの数が多いので粗の処理により処理速度を優先し、ステップＳ２０６では、対象となる単語モデルの数が少ないので密の処理により精度を上げる。これにより、全体の処理速度を高めながら、精度の良い認識結果を得ることができるようになる。
【０１１８】
ここで、認識処理精度は、音声信号のスペクトル、スペクトルの変化量、パワーおよびパワーの変化量等の音響分析パラメータについて、処理対象とするパラメータを変化させることによって粗の処理と密の処理を切り分ける。例えば、粗の処理はスペクトルのパラメータのみを対象とし、密の処理はスペクトル、スペクトルの変化量、パワーおよびパワーの変化量を対象とする。あるいは、入力音声信号の抽出フレーム数を粗の処理と密の処理とで変化させても良い。例えば、密の処理のフレーム数を１００としたとき、粗の処理のフレーム数に５０に間引くようにする。
【０１１９】
その他、特性分析やマッチング処理等についても種々の変更が可能である。さらに、ガーベジモデルの生成も上記のように当該フレームの最大尤度を取る方法や上位Ｎ個の平均を取る方法の他、種々の変更が可能である。
【０１２０】
【発明の効果】
本発明によれば、フレーム特徴量に対する無音モデルの尤度を適宜ガーベジ用尤度に置換えるものであるから、無音部分を含めて入力音声信号を切り出したとしても、この無音部分のマッチング演算に対する影響は無音モデルによって吸収され、且つ、音声信号中の不要語部分のマッチング演算に対する影響はガーベジ用尤度への置換えによって吸収されるから、不要語を含んで音声入力がなされても、精度良く、音声認識を行うことができる。
【０１２１】
また、ガーベジ用尤度は、フレーム特徴量の基準モデル特徴量に対する尤度に基づいて演算されるものであるから、認識対象の単語が辞書に追加されたとしても、従来例のように全ての単語に基づいて別途演算して再設定する必要はなく、単語の追加、変更における装置の自由度を向上させることができる。
【図面の簡単な説明】
【図１】実施の形態の概要を説明するための図
【図２】実施の形態の概要を説明するための図
【図３】実施の形態の概要を説明するための図
【図４】実施例の構成を示す図
【図５】実施例に係る認識辞書部の記憶状態を示す図
【図６】実施例に係る単語モデルの構成を示す図
【図７】実施例に係るマッチング演算部の構成を示す図
【図８】実施例に係る無音モデル尤度の設定方法を示す図
【図９】実施例の動作を示す図
【図１０】第２の実施例の構成を示す図
【図１１】第２の実施例に係る単語モデルの構成を示す図
【図１２】第２の実施例の動作を示す図
【符合の説明】
１音声入力部
２音声信号切り出し部
３音響分析部
４基準モデルパラメータ部
５尤度演算部
６ＲＡＭ部
７認識辞書部
８単語モデル作成部
９マッチング演算部
１０ガーベジ用尤度算出部
１１認識候補記憶部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speech recognition device and a speech recognition method, and more particularly to a method for accurately determining an appropriate word from speech including unnecessary words.
[0002]
[Prior art]
As a method of recognizing a word in a word dictionary from a voice including an unnecessary word (for example, a meaningless voice or a particle), for example, a method described in Japanese Patent Application Laid-Open No. 7-77998 is known.
[0003]
In such a conventional speech recognition method, first, a feature amount of an unnecessary word is generated from speech feature amounts of all recognition candidates (necessary words) stored in a word dictionary, and the feature amount of the unnecessary word is stored in a recognition dictionary in advance. Register. Then, of the input speech signals, those that match the feature amount of the unnecessary word are recognized as unnecessary words, and the words recognized as unnecessary words are removed from the recognition result to obtain the voice recognition result of the necessary word. It is.
[0004]
Generation of the feature amount of the unnecessary word in such a conventional method is performed as follows. First, prior to generation of the feature amount of the unnecessary word, the feature amount of the necessary word is generated. The feature amount of the necessary word is generated by inputting several types of voice feature amounts as samples for one necessary word, and acoustically analyzing and learning each of the samples to generate the feature amount of the necessary word. Next, the feature amounts of all the necessary words thus generated are averaged, and the average value is set as the feature amount of the unnecessary word.
[0005]
[Problems to be solved by the invention]
As described above, in the above-described conventional recognition method, the speech feature amount of the unnecessary word is generated from the speech feature amounts of all the recognition candidates (necessary words). Therefore, the recognition candidates in the recognition dictionary are added and corrected. Each time, it is necessary to regenerate the speech feature amount of the unnecessary word one by one, so that a troublesome work is required in adding or deleting a recognition candidate.
[0006]
Also, at the time of voice recognition processing, likelihood calculation and approximate distance calculation with the input voice signal must be performed not only for the necessary words but also for the voice feature amounts of the unnecessary words, and a voice recognition processing step is added accordingly. Therefore, there is a problem that the time required for deriving the recognition result becomes long.
[0007]
Therefore, an object of the present invention is to provide a speech recognition method that can quickly perform speech recognition even when a recognition candidate is added or deleted.
[0008]
[Means for Solving the Problems]
In view of the above problems, the present invention has the following features.
[0009]
The invention according to claim 1 relates to a speech recognition apparatus, comprising: an acoustic analysis unit for acoustically analyzing an input speech signal to extract a frame feature; and a word model by connecting a silence model to both ends of a word reference model. Word model generating means for generating the likelihood calculation means; comparing the frame feature quantity and the feature quantity of the reference model to calculate the likelihood of the frame feature quantity with respect to the word model; and the calculated likelihood. Matching operation means for calculating the degree of matching of the word model with respect to the input speech signal based on the above, recognition candidate setting means for setting recognition candidates in accordance with the degree of matching, and garbage setting the garbage likelihood for the silence model Having likelihood setting means,The garbage likelihood setting means calculates the garbage likelihood by averaging the likelihoods of the frame features from the top model to the Nth among the likelihoods with respect to the reference model feature,The matching operationmeansIs characterized in that a matching operation is performed using the garbage likelihood set by the garbage likelihood setting means as the likelihood of the silent model.
The invention according to claim 2 relates to a speech recognition apparatus, comprising: an acoustic analysis means for acoustically analyzing an input speech signal to extract a frame feature; and a word model by connecting a silence model to both ends of a word reference model. Word model generating means for generating the likelihood calculation means; comparing the frame feature quantity and the feature quantity of the reference model to calculate the likelihood of the frame feature quantity with respect to the word model; and the calculated likelihood. Matching operation means for calculating the degree of matching of the word model with respect to the input speech signal based on the above, recognition candidate setting means for setting recognition candidates in accordance with the degree of matching, and garbage setting the garbage likelihood for the silence model Having likelihood setting means,The garbage likelihood setting means calculates the garbage likelihood by averaging the likelihoods of the frame features from the top model to the Nth among the likelihoods with respect to the reference model feature,The matching operationmeansIs the likelihood calculation as the likelihood of the silence model.meansAnd a garbage likelihood set by the garbage likelihood setting means is selected to perform a matching calculation.
[0010]
The invention of claim 3 isThe present invention relates to a speech recognition apparatus, and relates to acoustic analysis means for acoustically analyzing an input speech signal to extract a frame feature, and word model creation means for creating a word model by connecting a silence model to both ends of a word reference model. And a likelihood calculating means for comparing the frame feature quantity with the feature quantity of the reference model to calculate the likelihood of the frame feature quantity with respect to the word model, and a likelihood calculating means for calculating the likelihood of the word model based on the calculated likelihood. Matching calculation means for calculating the degree of matching with respect to the input voice signal, recognition candidate setting means for setting recognition candidates according to the degree of matching, and garbage likelihood setting means for setting garbage likelihood for the silent model. The garbage likelihood setting means determines whether the magnitude of the likelihood is higher in the likelihood of the frame feature with respect to the reference model feature. The K-th likelihood and garbage likelihood, the matching calculation means, as the likelihood of silence model, and performing a matching operation using the garbage likelihood set by the garbage likelihood setting means.
[0011]
The invention of claim 4 isThe present invention relates to a speech recognition apparatus, and relates to acoustic analysis means for acoustically analyzing an input speech signal to extract a frame feature, and word model creation means for creating a word model by connecting a silence model to both ends of a word reference model. And a likelihood calculating means for comparing the frame feature quantity with the feature quantity of the reference model to calculate the likelihood of the frame feature quantity with respect to the word model, and a likelihood calculating means for calculating the likelihood of the word model based on the calculated likelihood. Matching calculation means for calculating the degree of matching with respect to the input voice signal, recognition candidate setting means for setting recognition candidates according to the degree of matching, and garbage likelihood setting means for setting garbage likelihood for the silent model. The garbage likelihood setting means determines whether the magnitude of the likelihood is higher in the likelihood of the frame feature with respect to the reference model feature. The K-th likelihood is set as the garbage likelihood, and the matching calculation means is set as the likelihood of the silence model, the likelihood of the silence model calculated by the likelihood calculation means, and the garbage likelihood setting means. It is characterized in that one of the garbage likelihoods is selected and a matching operation is performed..
[0012]
The invention of claim 5 isIn addition to the features of claim 2 or 4, furthermore, the matching calculation means compares the likelihood of the silence model calculated by the likelihood calculation means with the garbage likelihood set by the garbage likelihood setting means, The feature is that any one of the larger likelihoods is selected.
[0013]
The invention of claim 6 isThe present invention relates to a speech recognition method, comprising the steps of: acoustically analyzing an input speech signal to extract a frame feature amount; connecting a silence model to both ends of a reference model of a word to create a word model; Calculating the likelihood of the frame feature for the word model by comparing the amount with the feature of the reference model; and determining the degree of matching of the word model to the input speech signal based on the calculated likelihood. Computing, setting a recognition candidate according to the degree of matching, and setting a garbage likelihood for the silence model, wherein the garbage likelihood setting step comprises: The garbage likelihood is calculated by averaging the likelihoods from the top to the Nth likelihood of the likelihood for the feature amount. Calculating a step of the matching operation, as the likelihood of silence model, and performing a matching operation using the garbage likelihood set by the garbage likelihood setting step.
[0014]
The invention of claim 7 isThe present invention relates to a speech recognition method, comprising the steps of: acoustically analyzing an input speech signal to extract a frame feature amount; connecting a silence model to both ends of a reference model of a word to create a word model; Calculating the likelihood of the frame feature for the word model by comparing the amount with the feature of the reference model; and determining the degree of matching of the word model to the input speech signal based on the calculated likelihood. Computing, setting a recognition candidate according to the degree of matching, and setting a garbage likelihood for the silence model, wherein the garbage likelihood setting step comprises: The garbage likelihood is calculated by averaging the likelihoods from the top to the Nth likelihood of the likelihood for the feature amount. And the matching calculation step is any of the silence model likelihood calculated in the likelihood calculation step and the garbage likelihood set in the garbage likelihood setting step, as the likelihood of the silence model. A matching operation is performed by selecting one of them.
[0015]
The invention of claim 8 relates to a speech recognition method, wherein a step of acoustically analyzing an input speech signal to extract a frame feature amount, and a step of connecting a silence model to both ends of a reference model of a word to create a word model Calculating the likelihood of the frame feature with respect to the word model by comparing the frame feature with the feature of the reference model, and calculating the likelihood of the frame model with respect to the word model based on the calculated likelihood. Calculating a matching degree for the input voice signal, setting a recognition candidate according to the matching degree, and setting a garbage likelihood for the silence model,In the garbage likelihood setting step, among the likelihoods of the frame feature amount with respect to the reference model feature amount, the K-th likelihood having the highest likelihood as the garbage likelihood is defined as garbage likelihood,In the matching calculation step, the matching calculation is performed using the garbage likelihood set in the garbage likelihood setting step as the likelihood of the silence model.
A ninth aspect of the present invention relates to a speech recognition method, wherein a step of acoustically analyzing an input speech signal to extract a frame feature amount, and a step of connecting a silence model to both ends of a word reference model to create a word model Calculating the likelihood of the frame feature with respect to the word model by comparing the frame feature with the feature of the reference model, and calculating the likelihood of the frame model with respect to the word model based on the calculated likelihood. Calculating a matching degree for the input voice signal, setting a recognition candidate according to the matching degree, and setting a garbage likelihood for the silence model,In the garbage likelihood setting step, among the likelihoods of the frame feature amount with respect to the reference model feature amount, the K-th likelihood having the highest likelihood as the garbage likelihood is defined as garbage likelihood,The matching calculation step includes, as the likelihood of the silence model, the likelihood of the silence model, the likelihood of the silence model calculated by the likelihood calculation step, and the garbage likelihood set by the garbage likelihood setting step. The matching operation is performed by selecting any one of the above.
[0016]
The invention of claim 10 isClaim 7 orIn addition to the above-mentioned features, the matching calculation step further includes the likelihood of the silence model calculated in the likelihood calculation step and the garbage likelihood setting.StepsThe garbage likelihood set by the above is compared, and the larger likelihood is selected.
[0021]
The features of the present invention and the effects thereof will be apparent by referring to the following embodiments.
[0022]
BEST MODE FOR CARRYING OUT THE INVENTION
First, an outline of a character recognition device and a character recognition method according to the present embodiment will be described with reference to FIGS.
[0023]
When a voice is input to the voice recognition device, a voice signal to be recognized is cut out from the voice input signal. This extraction is performed, for example, by monitoring the power of the audio input signal.
[0024]
That is, when a voice is input to the microphone, the power of the input voice signal from the microphone rises from the silence level. The section from this rising until the next level of the audio signal reaches the silence level is the section of the audio signal that should be originally recognized. If such an audio signal section is cut out as it is, the beginning or end of the audio signal to be recognized may be cut off. Therefore, in order to ensure that the voice signal section is to be recognized, the voice signal is usually cut out so as to include only a certain section of the silent signal before and after the voice signal section.
[0025]
For example, in FIG. 1 to FIG. 3, the cut-out audio signal is “XX”. Here, ○ is a silent signal. The meaning of the language of the audio signal actually input to the microphone is “d, the image is large”. Among them, “e-” is a meaningless unnecessary word that is often uttered by the input person when starting the voice input. “Image” is a recognition candidate (word) registered in the recognition dictionary, “GA” is a particle (unnecessary word) not in the recognition dictionary, and “large” is a recognition candidate (word) registered in the recognition dictionary.
[0026]
The audio signal input and cut out as described above is subjected to acoustic analysis at regular intervals (frame periods), and acoustic feature values (hereinafter, referred to as “frame feature values”) are extracted. Each extracted frame feature value is compared with the feature value of the reference syllable, and the degree of approximation with each reference syllable, that is, the likelihood is calculated.
[0027]
Here, the memory of the voice recognition device stores the feature values of the reference syllables, that is, 50 sounds (aiueo, etc.), voiced sounds (gakiguge, etc.), semi-voiced sounds (ぱぴぷぺぽ, etc.), muddy sounds (gyugyugyo, ) Are stored in advance together with the feature amounts of the respective syllables.
[0028]
The frame features extracted by acoustic analysis at the frame period are compared with the features of all the reference syllables stored in the memory, and the likelihood is calculated for each reference syllable. For example, in FIG. 1, the frame feature amounts are “O”, “O”, “E”, “E”, “GA”, “ZO”, “U”, “GA”, “O”, “O”. , "Ki", "i", "o", and "o", the feature amounts of these frame and the reference syllable are sequentially compared to calculate the likelihood.
[0029]
In the block shown in the lower part of FIG. 1, the leftmost column is the above-mentioned reference syllable, and the numbers in the rightmost columns following the leftmost column are the likelihood of each frame feature amount for each of these reference syllables. It is. For example, in FIG. 1, among audio signals to be recognized (audio signals cut out so as to include silence before and after), the first frame feature amount corresponds to a portion corresponding to a part of “○ (silence signal)”. Then, this feature amount is sequentially compared with all the feature amounts of each reference syllable, and the likelihood for each reference syllable is calculated. In FIG. 1, the likelihood for the reference syllable “a” is 0.1, the likelihood for the reference syllable “i” is 0.1,..., And the likelihood for the reference syllable “silence” is 0.9.
[0030]
When the calculation of the likelihood for the first frame feature is completed in this way, the likelihood for the reference syllable is calculated for the next frame feature “「 (silence signal) ”. Hereinafter, similarly, the reference syllables “A”, “A”, “E”, “E”, “GA”,... , "う",..., "ぽ", "ぽ", "○ (silence)" are calculated.
[0031]
The likelihood calculated in this manner is written on a memory (RAM: Random Access Memory) so that each likelihood is mapped on a matrix having a reference syllable and each frame as a correlation axis. . That is, the likelihood on the matrix shown in the lower part of FIG. 1 is directly mapped and stored in the memory.
[0032]
As described above, when the likelihood for each reference syllable is calculated and written into the memory, the degree of matching of the speech input signal to one recognition candidate in the word dictionary is calculated.
[0033]
The upper block in FIG. 1 shows the score (matching degree) of the likelihood for the recognition candidate “large”.
[0034]
As described above, since the speech signal is cut out so as to include a silent part, when calculating the matching of the speech signal with the recognition candidate, the recognition candidate syllable is obtained by adding a silent syllable before and after the recognition candidate. That is, as shown in FIG. 1, if the recognition candidate is “big”, the syllable “O” (silent) is added before and after each of the syllables “o”, “o”, “ki”, and “i” that constitute “big”. )], And a concatenation of these syllables is taken as a recognition candidate syllable.
[0035]
When the configuration of the recognition candidate syllable is made in this way, the likelihood of the frame feature of the speech signal is allocated to each syllable of the recognition candidate syllable.
[0036]
First, the likelihood between the frame feature first extracted from the audio signal (the feature of the “○ (silence signal)” portion) and each syllable of the candidate syllable is read from the RAM. That is, of the likelihood of the audio signal with respect to the reference syllable stored in the RAM (the likelihood allocated to the lower block in FIG. 1), each syllable “O” corresponding to the frame feature amount “O (silence signal)” is used. (Silence) ”,“ O ”,“ O ”,“ K ”,“ I ”,“ O (silence) ”are read out from the RAM, and these likelihoods are identified in the upper block in FIG. One frame feature amount and each syllable of these recognition candidates are allocated in a column where they intersect.
[0037]
Next, the likelihood between the frame feature quantity (the feature quantity of the “○ (silent signal)” portion) extracted second from the audio signal and each syllable is read from the RAM, and this is read in the same manner as above. Then, it is distributed in the column of the upper block in FIG.
[0038]
Hereinafter, similarly, the likelihood from the third extracted frame feature amount to the last extracted feature amount is assigned to the upper block in FIG.
[0039]
Actually, the likelihood assigned to the upper block in FIG. 1 is stored in a predetermined area in the RAM, using each frame and the recognition candidate syllable as a correlation axis.
[0040]
When the setting and distribution of the likelihood of the speech signal to the recognition candidate syllable are completed as described above, the degree of matching of the speech signal to the recognition syllable is calculated using the likelihood group distributed in this manner. Is done.
[0041]
In order to calculate the matching degree, first, a route that proceeds through each column from the lower left corner column to the upper right corner column in FIG. 1 is set, and the total value of likelihood in each column on the route is calculated. . Such a route setting is set, for example, such that the preceding column when viewed from one column is either left side or diagonally lower left. Alternatively, instead of this, the route may be set such that the preceding column when viewed from one column is any of the left side, the diagonally lower left, or the right below. There are various formulations for such DP matching, and for example, those described in Tokai University Press “Digital Audio Processing (1st printing)”, pages 167 to P167, can be used.
[0042]
In the present embodiment, the description is made using the likelihood. However, this may be log likelihood or a value based on the reciprocal of the distance (the difference between the frame feature and the feature of each syllable: absolute value). But it's fine. If a transition probability and the like are added to this, the well-known HMM (Hidden Markov Model) can be used.
[0043]
The total value of the likelihoods is calculated for all the routes that can be set in this way, and the largest total value among the total values calculated for each route is matched with the speech signal for the recognition candidate. Degree (score).
[0044]
In the example of FIG. 1, the recognition candidate is “big” and the voice signal includes a “big” portion. Therefore, in the upper block of FIG. And the recognition candidate syllable have a high likelihood in the column where they intersect. Therefore, the degree of matching (score) for the recognition candidate “large” becomes large due to the influence of the likelihood in the intersecting columns.
[0045]
On the other hand, when the matching degree (score) is calculated for a recognition candidate not included in the audio signal (for example, “small”), the likelihood of each column in the upper block in FIG. Therefore, the degree of matching (score) is also low.
[0046]
Therefore, the degree of matching (score) calculated for each recognition candidate is compared with each other, and the top several recognition candidates are selected in order from the one with the highest score and set as provisional recognition candidates. It is highly likely that proper recognition candidates are included.
[0047]
Then, for example, all of the provisional recognition candidates are displayed on the monitor of the voice recognition device, and the operator is allowed to select an appropriate one from among them, thereby to determine the recognition result.
[0048]
Alternatively, instead of such a method, recognition candidates may be classified in advance by content type (for example, classification of size such as “large” or “small”, classification of information such as “image” or “sound”, etc.). Various dictionaries are constructed, and the top several (e.g., five) recognition candidates having the matching degree (score) are extracted from each dictionary, and a combination of the recognition candidates from each extracted dictionary is created ( For example, if there are five recognition candidates in two dictionaries, 5 × 5 = 25 combinations), the combined recognition candidates are connected silently, and silence is added before and after to generate recognition candidate syllables. (For example, in the case of a combination of “gaze” and “big”, the recognition candidate syllable of “XX gaze XX big XX”), the matching degree between this recognition candidate syllable and the above speech signal (score The may be calculated again in the same manner as described above, is determined as a recognition result having the highest score among the combinations.
[0049]
As described above, a silence syllable is added before and after a recognition candidate to generate a recognition candidate syllable, and if the degree of matching between the recognition candidate syllable and the audio signal is determined, the silence signal portion is determined as described above. Even if the speech signal is cut out, the effect of the silent signal portion is absorbed by the recognition candidate syllable “○ (silence)” added before and after the recognition candidate, so that a relatively accurate recognition result can be obtained. Will be able to
[0050]
However, when an unnecessary word is added to the voice signal, the feature amount of the recognition candidate syllable “○ (silence)” and the feature amount of the unnecessary word are usually non-approximate, and thus the recognition candidate syllable “「 ( The effect of the unnecessary word cannot be absorbed by "silence"), and therefore, the accuracy of determining the recognition candidate is reduced.
[0051]
Therefore, in the present embodiment, the method of assigning the likelihood to the recognition candidate syllable “○ (silence)” is improved, thereby not only the effect of the silence part included in the audio signal but also the unnecessary words in the audio signal. The effect can be absorbed at the same time. More specifically, the likelihood of each frame feature extracted by acoustically analyzing the cut-out audio signal at a frame cycle with respect to the feature of the recognition candidate syllable “○ (silence)” (hereinafter referred to as “silence model likelihood”) Improve the settings of.
[0052]
More specifically, as shown in FIG. 2, the highest likelihood of the likelihood of each frame feature for each standard syllable is set as the silence model likelihood. That is, in the upper block of FIG. 2, the silence model likelihood of the frame feature amounts “e”, “e”, “ga”, “zo”, “u”, “ga”, “o”, “o”, “ki” and “i” are: 0.9, 0.9, 0.9, 0.9, 0.9, 0.8, 0.9, and 1.0, which are the maximum values of the likelihood of these frame feature amounts with respect to the reference syllable, are assigned, respectively. .
[0053]
When the likelihood is assigned in this manner, the route having the highest matching degree score passes through the silence model likelihood column in the upper block in FIG. The route proceeds diagonally to the upper right at the portions of "o", "ki" and "i", and then becomes the route again passing through the silence model likelihood column. That is, in the cut-out audio signal, the part of “○”, “○”, “ga”, “zo”, “u”, “ga” and the part of “○” and “○” following “big” Since the likelihood is set to the highest value, the route having the highest likelihood value usually passes through the column of the silence model likelihood.
[0054]
Therefore, even if the cut-out voice includes unnecessary words in addition to the silent signal portion, all the likelihood disturbances due to the silent signal portion and the unnecessary words are absorbed by the silent model likelihood. It will be.
[0055]
By the way, in the above embodiment, the maximum value of the likelihood of each frame feature amount with respect to the reference syllable is set as the silence model likelihood. There is a disadvantage that the likelihood of the portion to be emphasized, that is, the portion corresponding to the recognition candidate is not emphasized.
[0056]
For example, in the upper block of FIG. 2, the likelihood in the column on the route indicated by the arrow is a portion where the recognition candidate and the speech signal match, and thus the likelihood is sufficiently emphasized compared to the likelihood in the other columns. Must have been. However, in the column above such an arrow, as described above, the same likelihood as the silence model likelihood of the section is set. For this reason, the likelihood in the column on the arrow which should originally greatly affect the matching degree (score) is not emphasized so much, and as a result, the accuracy of the recognition result is likely to be affected by disturbance or the like. Disadvantages occur.
[0057]
Therefore, in order to improve such inconvenience, in the embodiment of FIG. 3, among the likelihoods of each frame feature quantity with respect to the reference syllable, the average value of the top N likelihoods (hereinafter, “garbage likelihood”) Is calculated, and when the garbage likelihood is larger than the silence model likelihood, the garbage likelihood is replaced with the silence model likelihood.
[0058]
When the garbage likelihood is replaced in this way, as shown in FIG. 3, the likelihood of the column on the arrow route becomes several steps larger than the silence model likelihood, and therefore, the arrow which should be emphasized originally The likelihood in the column on the route is effectively emphasized. In addition, the silence model likelihood of the silence portion of the speech signal is appropriately emphasized, and the silence model likelihood of the unnecessary word portion (including the “gaze” portion that is not the recognition target) is also properly emphasized. In addition, the effect of the silence part and the unnecessary word part can be effectively absorbed by the silence model likelihood of the period.
[0059]
The above is the outline of the present embodiment. Hereinafter, various examples illustrating the present embodiment in further detail will be described.
[0060]
FIG. 4 shows a block diagram of this embodiment. In the figure, reference numeral 1 denotes an audio input unit such as a microphone, and 2 denotes an audio signal cutout unit that cuts out an audio signal from an audio input signal from the audio input unit. As described above, the audio signal cutout unit 2 monitors the power of the audio input signal, and cuts out the audio signal so as to include a silent signal before and after.
[0061]
Reference numeral 3 denotes an acoustic analysis unit that performs acoustic analysis on the cut-out audio signal at predetermined frame periods, and extracts characteristic parameters (hereinafter, referred to as “frame characteristic parameters”). The frame feature parameters include, for example, a linear prediction coefficient, an LPC cepstrum, and energy for each frequency band. Since such acoustic analysis is already well known, a detailed description is omitted here. Note that such a frame feature parameter is synonymous with the frame feature amount in the above embodiment.
[0062]
Reference numeral 4 denotes a reference model parameter section that stores acoustic characteristic parameters for each reference model. The reference model parameter section 4 performs acoustic analysis of the reference model by the same method as the above-described acoustic analysis section 3 and stores the parameters as reference parameters of each model. Here, the reference model corresponds to, for example, a reference syllable in the above embodiment. Such a reference model includes a silent model as described in the above embodiment. Such a reference model may be a reference syllable as in the above embodiment, or may be a reference phoneme instead. In addition, the feature parameters of each whole word can be used as a reference model. Note that Hidden Markov Model or the like can be used as the reference model. Further, the reference model parameters can be represented by a discrete distribution, a continuous distribution, or the like.
[0063]
Reference numeral 5 denotes a likelihood calculation unit which compares the frame feature parameter for each predetermined frame period extracted by the acoustic analysis unit 3 with the feature parameter for each reference model of the reference parameter unit 4, and calculates the likelihood between the two. . As a method of calculating the likelihood, for example, a well-known method described in Chapter 3 of “Speech Recognition by Stochastic Model” issued by the Institute of Electronics, Information and Communication Engineers can be used.
[0064]
A RAM unit 6 stores the likelihood for each frame calculated by the likelihood calculating unit 5 in association with each reference model. For example, the likelihoods shown in the form of a matrix in the lower part of FIGS. 1 to 3 shown in the above embodiment are mapped on the RAM and stored.
[0065]
Reference numeral 7 denotes a recognition dictionary, which stores words as recognition candidates. As shown in FIG. 5, a plurality of recognition dictionaries are prepared in the recognition dictionary section, which are divided into categories such as "size" and "type of information".
[0066]
Reference numeral 8 denotes a word model creation unit that connects a reference model of a word to be recognized and evaluated and adds a silence model before and after the reference model to create a word model. For example, as shown in FIG. 6, if the word to be recognized is “big”, the created word model is “と big ○” (○ is a silent model).
[0067]
Reference numeral 9 denotes a matching calculation unit which refers to the likelihood for each reference model stored in the RAM unit 6 as described above and the garbage likelihood from a garbage likelihood calculation unit 10 described later, and The degree of matching (score) is calculated for the word model. The calculation of the matching degree (score) is performed, for example, as described in the above embodiment, between each reference model (including a silence model) constituting the word model and each frame feature parameter extracted from the audio signal. A matrix is constructed by mapping the likelihoods (see the upper part of FIGS. 1 to 3), and the highest score among the total scores of the likelihoods in various routes that go from the lower left corner to the upper right corner is determined as the matching degree. (Viterbi matching method).
[0068]
Here, as the likelihood of the silence model stored in the RAM, the highest likelihood of the likelihood group for the reference parameter of each frame feature parameter extracted in the frame period is defined as the likelihood of the silence model as in the above embodiment. , Or a method of replacing the average value of the top N items in each likelihood group with respect to the reference parameter of the frame feature parameter by the likelihood of the silent model. The likelihood calculated and replaced in this manner is the garbage likelihood calculated by the garbage likelihood calculation unit 10.
[0069]
Reference numeral 10 denotes a garbage likelihood calculating unit that calculates the likelihood for garbage by referring to the likelihood for each reference model stored in the RAM unit 6. Here, as the garbage likelihood, a method of setting the highest likelihood to the garbage likelihood among the likelihood group for the reference parameter of the frame feature parameter extracted in the frame period as in the above-described embodiment, Among the likelihoods of the frame feature parameter with respect to the reference parameter, a method of using the average value of the top N items as the likelihood for garbage is adopted.
[0070]
Reference numeral 11 denotes a recognition candidate storage unit, which compares the matching degrees (scores) calculated by the matching calculation unit 9 for each word, and stores M words having higher scores as recognition candidate words. Here, M words are recognized as recognition candidates for each dictionary (category) such as “size” and “information type”. The words of the recognition candidates stored in this way may be displayed as they are and the operator's intention may be selected, or, as described later, the M recognition candidates are again targeted. Alternatively, voice recognition processing may be performed.
[0071]
FIG. 7 is a block diagram showing the details of the matching calculation unit 9. Reference numeral 91 denotes a silence model likelihood determination unit which compares the likelihood of the silence model stored in the RAM unit 6 with the garbage likelihood from the garbage likelihood calculation unit 10 to determine the likelihood of the silence model.
[0072]
FIG. 8 shows a method of determining the likelihood of a silent model in the silent model determination unit 91. The likelihood of each reference model of the word model is determined in step S1 as to whether or not the reference model is a silent model. Here, when it is determined that the model is not a silent model, the likelihood of the model is stored in the RAM unit 6 as the likelihood of the word to be recognized.
[0073]
When it is determined in step S1 that the likelihood is the likelihood of the silence model, in steps S2 and S3, the likelihood of the silence model and the garbage likelihood from the garbage likelihood calculation unit 10 are calculated. It is determined which is larger, and if the garbage likelihood is larger, the likelihood of the silence model is replaced with the garbage likelihood in steps S5 and S6. For such replacement, the likelihood of the silence model in the RAM unit 6 may be rewritten to the likelihood for garbage, or the likelihood of the silence model in the RAM unit 6 may be rewritten without changing the likelihood. Only at the time of the calculation in step 9, the silence model may be processed to use the garbage likelihood.
[0074]
The speech recognition operation in the above embodiment will be described with reference to FIG.
[0075]
When the operator inputs a voice in the predetermined voice input mode, a dictionary to be used in the mode is selected from various dictionaries stored in the recognition dictionary unit 7, and one of the dictionaries is further selected. The dictionary is set as a recognition target dictionary (steps S101 and S102). When the dictionary to be recognized is set, one of the various words stored in the dictionary is read as the word to be recognized (W1) (steps S103 and S104). Then, this word (W1) is compared with the input speech signal as described above, and likelihood calculation and score calculation (matching processing) for word recognition are performed (step S105).
[0076]
When the score is calculated for the word (W1) read from the dictionary, the score is calculated as M words (Ws1, Ws2,..., Wsm) stored in the recognition candidate storage unit 11 by the previous processing. Is compared with the word with the lowest score. If the score is higher than this, the word (W1) is stored together with the score instead of the word stored earlier. Now, since the word (W1) is the first word read from the dictionary, the recognition candidate storage unit 11 has not yet stored the recognition candidate word. Therefore, the word (W1) is directly stored in the recognition candidate storage unit 11 together with the score (step S106).
[0077]
When the processing of the word (W1) is completed, the process returns to step S103, where the next word (W2) is read from the recognition dictionary, and the same processing as in steps S104 to S106 is performed. At this time, since M recognition candidates are not stored in the recognition candidate storage unit 11 until the M words are read from the dictionary, the words read from the dictionary are sequentially stored in the recognition candidate storage unit along with their scores. 11 is stored. When the word read from the dictionary reaches the (M + 1) th word, the score of this word (Wm + 1) is compared with the score of the M words stored in the recognition candidate storage unit 11, and if the score is larger than this. For example, this word (Wm + 1) and its score are stored in the recognition candidate storage unit 11, and the word with the lowest score and its score among the M words previously stored in the recognition candidate storage unit 11 are It is deleted from the recognition candidate storage unit 11.
[0078]
When the above processing is performed for all the words stored in the dictionary, it is determined in step S104 that the setting of the recognition candidates for the dictionary has been completed, and the processing returns to step S101. At this time, among the words stored in the dictionary, the top M words having a high score with respect to the speech input signal are stored in the recognition candidate storage unit 11 as recognition candidates.
[0079]
When the setting of the recognition candidates for the first dictionary is completed as described above, the next dictionary is selected as the dictionary to be recognized in steps S101 to S103, and the words in this dictionary are sequentially processed in steps S103 to S106. Is performed. Thereby, for the second dictionary, the top M words are stored in the recognition candidate storage unit 11 as recognition candidates.
[0080]
When the above operation is performed for all dictionaries to be used in the voice input mode, it is determined in step S102 that the speech recognition processing for all dictionaries has been completed. At this time, the recognition candidate storage unit 11 stores M words for each dictionary as recognition candidates for all dictionaries to be used in the voice input mode.
[0081]
Then, in step S107, the M recognition candidates are displayed for each dictionary section, for example, on a monitor of a voice recognition device. The operator selects a desired one from the recognition candidates displayed on the monitor. As a result, the word corresponding to the input voice is determined for each dictionary section.
[0082]
In the above speech recognition operation, M words are displayed on the monitor as recognition candidates for each dictionary section, and the operator is allowed to select one. However, when the number of words displayed as recognition candidates is large, the operator is forced to perform useless selection operations accordingly. It is preferable that the number of words to be displayed is as small as possible and that the accuracy of the word as a recognition candidate is high.
[0083]
Therefore, in the following embodiment, M words are not displayed as recognition candidates as they are, but the number of words is further reduced and the accuracy is increased as recognition candidates.
[0084]
FIG. 10 shows the configuration of this embodiment. The configuration of FIG. 10 is different from the embodiment of FIG. 4 only in the configuration of the word model creation unit 8 and the recognition candidate storage unit 11, and the other configuration is the same as the configuration of FIG.
[0085]
In the present embodiment, one word is selected from each dictionary section among the M words stored for each dictionary section in the recognition candidate storage unit 11 by the same processing as in the above-described embodiment, and this is selected as a silent model. The word model is created again by linking, and the matching between the word model and the input speech is calculated.
[0086]
FIG. 11 shows an example of a word model created by the word model creating section 8. This word model selects the word “gazou” from one dictionary section among the M words stored in the recognition candidate storage unit 11 for each dictionary section, and selects the word “large” from another dictionary section. They are selected and combined.
[0087]
For example, if there are two dictionaries to be used in accordance with the voice input mode, assuming that M words are set as recognition candidates for each dictionary in the same manner as in the processing of the above-described embodiment, one of the dictionary divisions The total number of word models selected and created one by one is M × M. Similarly, when there are three dictionaries to be used in accordance with the voice input mode, the total number of word models is M × M × M.
[0088]
In the present embodiment, likelihood calculation and matching processing with an input speech signal are performed for all of the word models of M raised to the Pth power (P is the number of dictionaries to be used according to the speech input mode). Is performed, L word models are discriminated from the highest score, and each word connected in this word model is set as a recognition candidate.
[0089]
When a word model is created by connecting a plurality of words in this way and compared with the input speech, whether one word or two words of each word model are included in the input speech, Depending on whether three words are included or not at all, that is, depending on the number of words included in the speech input signal, the difference in the matching score between each word model becomes large.
[0090]
Explaining this point in comparison with the above embodiment, in the above embodiment, a word model is created for only one word, and the degree of matching (score) between the word model and the input speech signal is calculated. . Therefore, the speech input signal always includes many unnecessary words in addition to the words of the word model. Therefore, the score of each word model is not so large even if the word is included in the input speech signal. Therefore, the difference in the degree of matching (score) between the word models does not increase so much. On the other hand, if a word model is created for a plurality of words as in the present embodiment and this is compared with the input speech signal to calculate the degree of matching (score), the word in the input speech Whether there is one or two words constituting the model, the difference in the scores between the word models is large. If all the words are completely included in the input speech signal, the score of the word model constituted by the words becomes extremely high.
[0091]
Therefore, when constructing a word model, it is better to construct a word model from a plurality of words as in the present embodiment than to construct a word model from one word as in the above embodiment. The difference between the scores of the candidates becomes large, and therefore, it becomes possible to provide the operator with highly accurate recognition candidate words.
[0092]
However, if a word model is created by connecting all the words one by one from all the word dictionaries used according to the input voice mode, the number of the word models becomes enormous. When matching processing with an input voice signal is performed on such a large number of word models, a huge processing time is required, and unnecessary processing on the word model due to unnecessary connection is repeated.
[0093]
Therefore, in the present embodiment, only the M words for each dictionary section obtained in FIGS. 4 to 9 are targeted, and one word is selected from each dictionary section and connected to create a word model. Then, by comparing this with the input voice signal, the number of final recognition candidates is reduced and the accuracy thereof is increased.
[0094]
Hereinafter, the operation of this embodiment will be described with reference to FIG. This operation is an operation in the case where there are two dictionaries to be used according to the voice input mode. In FIG. 12, the operations in steps S101 to S106 are the same as those in the above embodiment. That is, through these steps, M words are set as recognition candidates for each dictionary.
[0095]
When M words are set as recognition candidates for the two dictionaries to be used, the operation shifts from step S102 to step S201, and among these dictionaries, the recognition candidates set for the first dictionary are set. Is read out (steps S201 and S203), and the recognition candidate word (Ws21) set for the second dictionary is read out (steps S203 and S204). Then, these words (Ws11) and (Ws21) are connected by a silence model, and a silence model is further connected to both ends to create a word model (step S205).
[0096]
When the word model is created in this manner, likelihood calculation and score calculation (matching processing) with respect to the input speech signal are performed on this word model in the same manner as in the above embodiment (step S206). Then, this word model is stored in the recognition candidate storage unit 11 together with the score.
[0097]
When the processing for one word model is completed as described above, the process returns to step S203, and the word (W22) in the second dictionary is read. Then, this word (W22) is connected to the word (W11) in the first dictionary in the same manner as described above, and a new word model is created (step S205).
[0098]
Like the above, the generated word model is subjected to likelihood calculation and score calculation with the input speech signal (step S206), and is stored in the recognition candidate storage unit 11 together with the score.
[0099]
The operations in steps S203 to S206 described above are such that the M-th word (Ws2m) set for the second dictionary is linked to the word (Ws11) in the first dictionary to calculate a score, and this is stored in the recognition candidate storage unit 11. Repeated until memorized.
[0100]
When all of the M words set for the second dictionary are read out and the above processing ends, the process returns to step S201, where the next word (Ws12) set for the first dictionary is read out (step S201). S201, S202). Then, by repeating the processing of steps S203 to S206 in the same manner as described above, M words are sequentially connected to M words corresponding to the second dictionary, and M word models are created. The likelihood calculation and the score calculation between the input audio signals are sequentially performed. Then, the calculated scores are sequentially stored in the recognition candidate storage unit 11 together with the word model.
[0101]
The above process is repeated until all of the M words set for the first dictionary are connected to the M words of the second dictionary and processed.
[0102]
When the above processing is completed, the recognition candidate storage unit 11 stores a total of M × M word models and their scores. The scores of the M × M word models are compared in step S207, and the top L word models are selected. Then, words in each dictionary included in the upper L word models are determined, and the words are displayed on a monitor as recognition candidates for each dictionary.
[0103]
In this embodiment, the operation is performed when two dictionaries are used according to the voice input mode. However, the operation is not limited to this. For example, when there are three dictionaries, steps corresponding to steps S201 and S202 (for the first dictionary) and steps S203 and S204 (for the second dictionary) in FIG. You just need to add a step. As the number of target dictionaries increases, such a step may be added so that all M words corresponding to each dictionary can be combined.
[0104]
Even when there are three or more target dictionaries (for example, K), instead of selecting words one by one from the K dictionaries, J dictionaries (J <K) are selected. May be selected, and words corresponding to the selected J dictionaries may be selected one by one and connected.
[0105]
Further, in this embodiment, M word models set for each dictionary are combined to create a word model of M to the Pth power (P is the number of dictionaries). Null (none) may be added as a word in addition to the M words, and M + 1 words set for each dictionary may be used to create M + 1 P-th word models. In this case, a combination of a null and a word is performed by connecting words except for the null. For example, if there are three target dictionaries, a word in the first dictionary is null, a word in the second dictionary is Ws1, and a word in the third dictionary is Ws2, a word model combining these is: The word Ws1 and the word Ws2 are connected by a silence model, and a silence model is further connected to both ends thereof. When there are two target dictionaries, the first dictionary is null, and the second dictionary is Ws2, the word model is a word model in which a silence model is connected to both ends of the word Ws2. It becomes.
[0106]
As described above, if nulls are separately added in addition to the M words, even if the operator does not input words for all of the types and divisions that are required to be input by the voice input mode, words of the input type are not input. Can be recognized correctly. For example, it is assumed that when the voice input mode requires input of words of three types / sections of A, B, and C, the operator has input only words of the types / sections of A and B. In this case, in the embodiment of FIG. 12, in steps S101 to S106, M words are set as recognition candidates for dictionaries corresponding to the types and divisions of A, B, and C. Of these, the dictionary for C is set. Since the M words that have been selected correspond to the types and distinctions that have not been input by the operator, all of them are erroneous as recognition candidates. However, in the embodiment of FIG. 12, the words of the recognition candidates are set for the dictionary of C by steps S201 to S206, and are displayed on the monitor.
[0107]
Therefore, if nulls are further added to the M words set for the dictionaries A, B, and C, the score of the word model when null is selected for the dictionary of C becomes higher than the others. In other words, the word model in this case is O + Wa + O + Wb + O (O is a silent model), where the words in the A and B dictionaries are Wa and Wb, respectively. Since it is in accordance with the classification, the audio part of Wa and A and the audio part of Wb and B are matched, and the overall score is increased.
[0108]
In the case of a matching method in which the score is proportional to the length of the word model, if the null is selected, the length of the word model is reduced, so that the score needs to be normalized. Such normalization is achieved, for example, by averaging scores according to the length of the word model.
[0109]
This point is the same even when only one word is targeted as in the embodiment of FIG. That is, the number of syllables of a word is not uniform, and the number of syllables differs depending on the word. For example, “Gamen” has three syllables, and “Onsei” has four syllables. In such a case as well, the length of the word model changes according to the number of syllables, but the scores are averaged according to the length of the word model by the normalization process. The gap is corrected.
[0110]
Although various embodiments according to the present invention have been described above, the present invention is not limited to such embodiments.
[0111]
For example, in the above embodiment, when a word model is created from one word, only one silence model is added to both ends of the word. However, two or more silence models may be added. The number of silence models may be changed before and after.
[0112]
In addition, when a word model is created by connecting two or more words, the number of silence models interposed between words is one in the above embodiment, but this can be two or more. Words may be directly linked without a silence model. Further, the number of silence models interposed between the words Wa and Wb is two, the number of silence models intervening between the words Wb and Wc is one, and so on. May be changed.
[0113]
In the above-described embodiment, the largest likelihood or the average value of the top N likelihoods among the likelihoods of the frame feature amount with respect to the reference model feature amount is adopted as the garbage likelihood. The K-th likelihood may be set as the garbage likelihood. At this time, if K is selected so that the K-th likelihood is statistically close to the average value of the N pieces of likelihood, the same effect as when the average value is employed without averaging is obtained. Can be
[0114]
Further, in the above-described embodiment, the silence model is added by the word model creation unit 8. Alternatively, a silence model may be added to the word in advance and stored in the recognition dictionary unit 7. .
[0115]
Further, in the above embodiment, in addition to the M words set as recognition candidates for each dictionary, nulls are separately added to connect each word. In this case, nulls are set for all dictionaries. Then, the word model consists of only the silence model. Therefore, all null word models may be excluded from matching. Or. If all of them are null and the matching score is higher than the top H, the processing result for the input voice may not be adopted and the operator may be prompted to input voice again.
[0116]
In the above embodiment, the number of recognition candidates set for each dictionary is uniformly set to M, but the number of recognition candidates may be changed for each dictionary. At this time, the number of recognition candidates may be set in advance for each dictionary, or the number of recognition candidates for the dictionary may be set according to the score at the time of recognition processing. In the latter case, for example, a threshold value of the score may be set, and only those having a score equal to or larger than the threshold value may be set as recognition candidates. In this case, the number of recognition candidates depends on the score and the threshold, and may be more than or less than M.
[0117]
In the above embodiment, for example, in FIG. 12, the characteristic analysis and calculation processing in step S105 and the characteristic analysis and calculation processing in step S206 are the same, but the characteristic analysis and calculation processing in step S105 are roughened. The characteristic analysis and calculation processing in step S206 may be made precise. That is, in step S105, the number of target word models is large, so that the processing speed is prioritized by coarse processing. In step S206, the number of target word models is small, and the accuracy is increased by dense processing. This makes it possible to obtain an accurate recognition result while increasing the overall processing speed.
[0118]
Here, the recognition processing accuracy distinguishes between coarse processing and dense processing by changing parameters to be processed with respect to acoustic analysis parameters such as a spectrum of an audio signal, a change in spectrum, power and a change in power. . For example, the coarse processing targets only the parameters of the spectrum, and the dense processing targets the spectrum, the amount of change in the spectrum, the power, and the amount of change in the power. Alternatively, the number of extracted frames of the input audio signal may be changed between the coarse processing and the fine processing. For example, when the number of frames for dense processing is set to 100, the number of frames for coarse processing is reduced to 50.
[0119]
In addition, various changes can be made to the characteristic analysis and the matching process. In addition, the garbage model can be generated in various ways other than the method of obtaining the maximum likelihood of the frame and the method of obtaining the average of the upper N items as described above.
[0120]
【The invention's effect】
According to the present invention, the likelihood of the silence model for the frame feature is appropriately replaced with the likelihood for garbage. Therefore, even if the input speech signal including the silence portion is cut out, the matching calculation for the silence portion is not performed. The effect is absorbed by the silence model, and the effect on the matching operation of the unnecessary word portion in the audio signal is absorbed by the replacement with the garbage likelihood. , Can perform voice recognition.
[0121]
Further, since the garbage likelihood is calculated based on the likelihood of the frame feature amount with respect to the reference model feature amount, even if the word to be recognized is added to the dictionary, all the likelihoods are reduced as in the conventional example. It is not necessary to separately calculate and reset based on the word, and the degree of freedom of the device in adding or changing a word can be improved.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating an outline of an embodiment;
FIG. 2 is a diagram illustrating an outline of an embodiment;
FIG. 3 is a diagram for explaining an outline of an embodiment;
FIG. 4 is a diagram showing a configuration of an embodiment.
FIG. 5 is a diagram illustrating a storage state of a recognition dictionary unit according to the embodiment;
FIG. 6 is a diagram showing a configuration of a word model according to the embodiment.
FIG. 7 is a diagram illustrating a configuration of a matching calculation unit according to the embodiment;
FIG. 8 is a diagram illustrating a method of setting a likelihood of a silence model according to the embodiment;
FIG. 9 is a diagram showing the operation of the embodiment.
FIG. 10 is a diagram showing a configuration of a second embodiment.
FIG. 11 is a diagram showing a configuration of a word model according to a second embodiment.
FIG. 12 is a diagram showing the operation of the second embodiment.
[Description of sign]
1 Voice input section
2 Audio signal extraction unit
3 Sound analysis unit
4 Reference model parameter section
5 Likelihood calculation unit
6 RAM section
7 Recognition dictionary
8 Word model creation unit
9 Matching operation unit
10 Garbage likelihood calculator
11 Recognition candidate storage unit

Claims

Acoustic analysis means for acoustically analyzing an input voice signal to extract frame features; word model creation means for creating a word model by connecting silence models to both ends of a word reference model; A likelihood calculating means for calculating the likelihood of the frame feature with respect to the word model by comparing with the feature of the word model, and calculating a matching degree of the word model with respect to the input speech signal based on the calculated likelihood. Matching calculation means, a recognition candidate setting means for setting a recognition candidate according to the degree of matching, and a garbage likelihood setting means for setting a garbage likelihood for the silent model,
The garbage likelihood setting means calculates the garbage likelihood by averaging the likelihoods of the frame features from the top model to the Nth among the likelihoods with respect to the reference model feature,
The speech recognition device according to claim 1, wherein said matching calculation means performs a matching calculation using the garbage likelihood set by said garbage likelihood setting means as a likelihood of a silence model.

Acoustic analysis means for acoustically analyzing an input voice signal to extract frame features; word model creation means for creating a word model by connecting silence models to both ends of a word reference model; A likelihood calculating means for calculating the likelihood of the frame feature with respect to the word model by comparing with the feature of the word model, and calculating a matching degree of the word model with respect to the input speech signal based on the calculated likelihood. Matching calculation means, a recognition candidate setting means for setting a recognition candidate according to the degree of matching, and a garbage likelihood setting means for setting a garbage likelihood for the silent model,
The garbage likelihood setting means calculates the garbage likelihood by averaging the likelihoods of the frame features from the top model to the Nth among the likelihoods with respect to the reference model feature,
The matching calculating means selects one of the likelihood of the silence model calculated by the likelihood calculating means and the garbage likelihood set by the garbage likelihood setting means as the likelihood of the silence model. A speech recognition device that performs a matching operation.

Acoustic analysis means for acoustically analyzing an input voice signal to extract frame features; word model creation means for creating a word model by connecting silence models to both ends of a word reference model; A likelihood calculating means for calculating the likelihood of the frame feature with respect to the word model by comparing with the feature of the word model, and calculating a matching degree of the word model with respect to the input speech signal based on the calculated likelihood. Matching calculation means, a recognition candidate setting means for setting a recognition candidate according to the degree of matching, and a garbage likelihood setting means for setting a garbage likelihood for the silent model,
The garbage likelihood setting means, among the likelihoods of the frame feature amount with respect to the reference model feature amount, sets the Kth likelihood from the top in the magnitude of the likelihood as the garbage likelihood,
The speech recognition device according to claim 1, wherein said matching calculation means performs a matching calculation using the garbage likelihood set by said garbage likelihood setting means as a likelihood of a silence model.

Acoustic analysis means for acoustically analyzing an input voice signal to extract frame features; word model creation means for creating a word model by connecting silence models to both ends of a word reference model; A likelihood calculating means for calculating the likelihood of the frame feature with respect to the word model by comparing with the feature of the word model, and calculating a matching degree of the word model with respect to the input speech signal based on the calculated likelihood. Matching calculation means, a recognition candidate setting means for setting a recognition candidate according to the degree of matching, and a garbage likelihood setting means for setting a garbage likelihood for the silent model,
The garbage likelihood setting means, among the likelihoods of the frame feature amount with respect to the reference model feature amount, sets the Kth likelihood from the top in the magnitude of the likelihood as the garbage likelihood,
The matching calculation means selects one of the likelihood of the silence model calculated by the likelihood calculation means and the garbage likelihood set by the garbage likelihood setting means as the likelihood of the silence model. A speech recognition device that performs a matching operation.

The matching calculation means according to claim 2 or 4, wherein the matching calculation means comprises: A speech recognition apparatus characterized in that the likelihood of a silence model calculated by a stage is compared with the garbage likelihood set by the garbage likelihood setting means, and the larger likelihood is selected.

Extracting a frame feature by acoustically analyzing the input voice signal; connecting a silence model to both ends of the reference model of the word to create a word model; and determining the frame feature and the feature of the reference model. Calculating the likelihood of the frame feature amount for the word model by comparing; calculating the matching degree of the word model with respect to the input speech signal based on the calculated likelihood; Setting a recognition candidate accordingly, and setting a garbage likelihood for the silence model,
The garbage likelihood setting step calculates the garbage likelihood by averaging the likelihoods of the likelihood from the top to the Nth among the likelihoods of the frame feature amount with respect to the reference model feature amount,
The voice recognition method according to claim 1, wherein the matching calculation step performs the matching calculation using the garbage likelihood set in the garbage likelihood setting step as the likelihood of the silence model.

Extracting a frame feature by acoustically analyzing the input voice signal; connecting a silence model to both ends of the reference model of the word to create a word model; and determining the frame feature and the feature of the reference model. Calculating the likelihood of the frame feature amount for the word model by comparing; calculating the matching degree of the word model with respect to the input speech signal based on the calculated likelihood; Setting a recognition candidate accordingly, and setting a garbage likelihood for the silence model,
The garbage likelihood setting step calculates a garbage likelihood by averaging the likelihood of the likelihood from the top to the N-th among the likelihoods of the frame feature amount with respect to the reference model feature amount,
The matching calculation step selects one of the likelihood of the silence model calculated by the likelihood calculation step and the garbage likelihood set by the garbage likelihood setting step as the likelihood of the silence model. A speech recognition method characterized by performing a matching operation by using

Extracting a frame feature by acoustically analyzing the input voice signal; connecting a silence model to both ends of the reference model of the word to create a word model; and determining the frame feature and the feature of the reference model. Calculating the likelihood of the frame feature amount for the word model by comparing; calculating the matching degree of the word model with respect to the input speech signal based on the calculated likelihood; Setting a recognition candidate accordingly, and setting a garbage likelihood for the silence model,
In the garbage likelihood setting step, among the likelihoods of the frame feature amount with respect to the reference model feature amount, the K-th likelihood having the highest likelihood as the garbage likelihood is defined as garbage likelihood,
The voice recognition method according to claim 1, wherein the matching calculation step performs the matching calculation using the garbage likelihood set in the garbage likelihood setting step as the likelihood of the silence model.

Extracting a frame feature by acoustically analyzing the input voice signal; connecting a silence model to both ends of the reference model of the word to create a word model; and determining the frame feature and the feature of the reference model. Calculating the likelihood of the frame feature amount for the word model by comparing; calculating the matching degree of the word model with respect to the input speech signal based on the calculated likelihood; Setting a recognition candidate accordingly, and setting a garbage likelihood for the silence model,
In the garbage likelihood setting step, among the likelihoods of the frame feature amount with respect to the reference model feature amount , the K-th likelihood having the highest likelihood as the garbage likelihood is defined as garbage likelihood,
In the matching calculation step, as the likelihood of the silence model, one of the likelihood of the silence model calculated in the likelihood calculation step and the garbage likelihood set in the garbage likelihood setting step is selected. A speech recognition method characterized by performing a matching operation by using

10. The matching calculation step according to claim 7 , wherein the likelihood of the silence model calculated in the likelihood calculation step is compared with the garbage likelihood set in the garbage likelihood setting step . A speech recognition method characterized by selecting likelihood.