JP4678464B2

JP4678464B2 - Voice recognition apparatus, voice recognition method, program, and recording medium

Info

Publication number: JP4678464B2
Application number: JP2001189179A
Authority: JP
Inventors: 等本田
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2001-06-22
Filing date: 2001-06-22
Publication date: 2011-04-27
Anticipated expiration: 2021-06-22
Also published as: JP2003005780A

Abstract

PROBLEM TO BE SOLVED: To perform a voice recognition processing at a speed or with a precision meeting the requirement of a user or the like. SOLUTION: A partial space detection part 21 detects partial space which a feature vector of a voice belongs to in its feature vector space. A calculation object function selection part 23 stores a calculation object function table where one or more probability density functions which define HMM for use in matching processing with feature vectors are made correspond to each of a lot of partial spaces of the feature vector space, and the calculation object function selection part 23 selects one corresponding to speed/prevision information from a speed/precision setting part 27 from probability density functions corresponding to the partial space outputted from the partial space detection part 21. A score calculation part 25 uses the selected probability density function to perform matching processing between the feature vector and HMM.

Description

【０００１】
【発明の属する技術分野】
本発明は、音声認識装置および音声認識方法、並びにプログラムおよび記録媒体に関し、例えば、ユーザ等の要求に応じた速度または精度の音声認識処理を行うことができるようにする音声認識装置および音声認識方法、並びにプログラムおよび記録媒体に関する。
【０００２】
【従来の技術】
図１は、従来の音声認識装置の一例の構成を示している。
【０００３】
ユーザが発した音声は、マイク（マイクロフォン）１に入力され、マイク１では、その入力音声が、電気信号としての音声信号に変換される。この音声信号は、ＡＤ(Analog Digital)変換部２に供給される。ＡＤ変換部２では、マイク１からのアナログ信号である音声信号がサンプリング、量子化され、ディジタル信号である音声データに変換される。この音声データは、特徴抽出部３に供給される。
【０００４】
特徴抽出部３は、ＡＤ変換部２からの音声データについて、適当なフレームごとに音響処理を施し、これにより、例えば、ＭＦＣＣ(Mel Frequency Cepstrum Coefficient)等の特徴ベクトル（特徴量）を抽出し、マッチング部４に供給する。なお、特徴抽出部３では、その他、例えば、スペクトルや、線形予測係数、ケプストラム係数、線スペクトル対等の特徴量を抽出することが可能である。
【０００５】
マッチング部４は、特徴抽出部３からの特徴ベクトルを用いて、音響モデルデータベース５、辞書データベース６、および文法データベース７を必要に応じて参照しながら、マイク１に入力された音声（入力音声）を、例えば、連続分布ＨＭＭ法等に基づいて音声認識する。
【０００６】
即ち、音響モデルデータベース５は、音声認識する音声の言語における個々の音素や音節などの音響的な特徴を表す音響モデルを記憶している。ここでは、連続分布ＨＭＭ法に基づいて音声認識を行うので、音響モデルとしては、例えば、ＨＭＭ(Hidden Markov Model)が用いられる。辞書データベース６は、認識対象の各単語について、その発音に関する情報（音韻情報）が記述された単語辞書を記憶している。文法データベース７は、辞書データベース６の単語辞書に登録されている各単語が、どのように連鎖する（つながる）かを記述した文法規則（言語モデル）を記憶している。ここで、文法規則としては、例えば、文脈自由文法（ＣＦＧ）や、統計的な単語連鎖確率（Ｎ−ｇｒａｍ）などに基づく規則を用いることができる。
【０００７】
マッチング部４は、辞書データベース６の単語辞書を参照することにより、音響モデルデータベース５に記憶されている音響モデルを接続することで、単語の音響モデル（単語モデル）を構成する。さらに、マッチング部４は、幾つかの単語モデルを、文法データベース７に記憶された文法規則を参照することにより接続し、そのようにして接続された単語モデルと、マイク１に入力された音声から抽出された特徴ベクトル系列とのマッチング処理を、例えば、連続分布ＨＭＭ法に基づいて行い、その音声を認識する。即ち、マッチング部４は、特徴抽出部３から供給される時系列の特徴ベクトルが出力（観測）されるスコア（尤度）が最も高い単語モデルの系列を検出し、その単語モデルの系列に対応する単語列を、音声の認識結果として出力する。
【０００８】
つまり、マッチング部４は、接続された単語モデルに対応する単語列について、各特徴ベクトルの出現確率を累積し、その累積値をスコアとして、そのスコアを最も高くする単語列を、音声認識結果として出力する。
【０００９】
スコア計算は、一般に、音響モデルデータベース５に記憶された音響モデルによって与えられる音響的なスコア（以下、適宜、音響スコアという）と、文法データベース７に記憶された文法規則によって与えられる言語的なスコア（以下、適宜、言語スコアという）とを総合評価することで行われる。
【００１０】
即ち、音響スコアとしては、例えば、ＨＭＭ法による場合には、単語モデルを構成する音響モデルから、特徴抽出部３が出力する特徴ベクトルの系列が出力（観測）される確率（出力確率）の累積値が計算される。また、言語スコアとしては、例えば、バイグラムによる場合には、注目している単語と、その単語の直前の単語とが連鎖（連接）する確率が求められる。そして、各単語についての音響スコアと言語スコアとを総合評価（例えば、重み付け加算など）して得られる最終的なスコア（以下、適宜、最終スコアという）に基づいて、音声認識結果が確定される。
【００１１】
以上のような処理が行われることにより、図１の音声認識装置では、例えば、ユーザが、「今日はいい天気ですね」と発話した場合には、「今日」、「は」、「いい」、「天気」、「ですね」といった各単語に、音響スコアおよび言語スコアが与えられ、それらを総合評価して得られる最終スコアが最も大きいときに、単語列「今日」、「は」、「いい」、「天気」、「ですね」が、音声認識結果として出力される。
【００１２】
ところで、上述の場合において、辞書データベース６の単語辞書に、「今日」、「は」、「いい」、「天気」、および「ですね」の５単語が登録されているとすると、これらの５単語を用いて構成しうる５単語の並びは、５⁵通り存在する。従って、単純には、マッチング部４では、この５⁵通りの単語列を評価し、その中から、ユーザの発話に最も適合するもの（最終スコアを最も大きくするもの）を決定しなければならない。そして、単語辞書に登録する単語数が増えれば、その単語数分の単語の並びの数は、単語数の単語数乗通りになるから、評価の対象としなければならない単語列は、膨大な数となる。
【００１３】
さらに、一般には、発話中に含まれる単語の数は未知であるから、５単語の並びからなる単語列だけでなく、１単語、２単語、・・・からなる単語列も、評価の対象とする必要がある。従って、評価すべき単語列の数は、さらに膨大なものとなるから、そのような膨大な単語列の中から、音声認識結果として最も確からしいものを、演算量の観点から効率的に決定することは、非常に重要な問題である。
【００１４】
音声認識精度の劣化を抑えながら、計算量の低減化を図る方法としては、例えば、E.Bocchieri. Vector quantization for the efficient computation of continuous density likelihoods. In International Conference on Acoustic, Speech, and Signal Processing, volume 2, pages 692-695, Apr. 1993（以下、文献１という）、K.M.Knill, M.J.F.Gales, and S.J.Young. Use of gaussian selection in large vocabulary continuous speech recognition using hmms. In International Conference on Spoken Language Processing, volume 1, pages 470-473, Oct. 1996（以下、文献２という）、M.J.F.Gales, K.M.Knill, and S.J.Young. State-based gaussian selection in large vocabulary continuous speech recognition using hmms. In Cambridge University Technical Report, TR284, Jan. 1997（以下、文献３という）、S.M.Herman and R.A.Sukkar. Variable threshold vector quantization for reduced continuous density likelihood computation in speech recognition. In IEEE Workshop on Acoustic Speech Recognition and Understanding Proceedings, pages 331-338, Santa Babara, 1997（以下、文献４という）に記載されているように、音響スコアを構成する出力確率の計算の一部を省略する方法がある。
【００１５】
即ち、例えば、連続ＨＭＭ法によれば、ＨＭＭが、ある状態ｓにおいて、時刻ｔの特徴ベクトルｘ_tを出力する出力確率ｂ_s（ｘ_t）は、次式で計算される。
【００１６】
ｂ_s（ｘ_t）＝Σｃ_m×ｇ_m（ｘ_t）・・・（１）
【００１７】
ここで、式（１）において、ｃ_mは、ｍ番目の関数ｇ_m（）に対する重み係数であり、関数ｇ_m（）は、ＨＭＭを構成するｍ番目の確率密度関数（例えば、ガウス分布）である。また、Σは、変数ｍについてのサメーションを表す。従って、式（１）によれば、出力確率ｂ_s（ｘ_t）は、確率密度関数ｇ_m（ｘ_t）の重みｃ_m付き和として計算される。
【００１８】
重み係数ｃ_mと確率密度関数ｇ_m（）は、音響モデルであるＨＭＭを定義する情報としての定義情報のひとつであり（他の定義情報としては、例えば、ＨＭＭの状態が、ある状態から、必要に応じてその状態を含む他の状態に遷移する確率としての状態遷移確率がある）、ＨＭＭは、重み係数ｃ_mと確率密度関数ｇ_m（）のセットを、１セットだけ用いて定義される場合の他、複数セット用いて定義される場合がある。
【００１９】
ＨＭＭが、複数としてのＮセットの重み係数ｃ₀乃至ｃ_N-1と確率密度関数ｇ₀乃至g_N-1（）を用いて定義される場合、式（１）の計算は、変数ｍを０からＮ−１までの整数値に変えて行う必要がある。
【００２０】
しかしながら、ＨＭＭを定義するＮ個の確率密度関数ｇ₀（ｘ_t），ｇ₁（ｘ_t），・・・，ｇ_N-1（ｘ_t）の中には、出力確率ｂ_s（ｘ_t）、ひいては音響スコアに寄与する大きさ（程度）（以下、適宜、寄与度という）が非常に小さいもの（ほとんど寄与しないもの）が存在する場合がある。
【００２１】
そこで、文献１乃至４に記載の方法では、寄与度が非常に小さい確率密度関数ｇ_m（）について、式（１）の計算から省略することで、音声認識精度の劣化を抑えながら、計算量の低減化を図るようになっている。
【００２２】
具体的には、例えば、図２に示すような、特徴ベクトル空間の所定の部分空間ごとに、その部分空間に属する特徴ベクトルｘ_tの出力確率ｂ_s（ｘ_t）の計算に用いる１以上の確率密度関数ｇ_m（）を対応付けた表（以下、適宜、計算対象関数表という）が作成され、ある部分空間に属する特徴ベクトルｘ_tの出力確率ｂ_s（ｘ_t）の計算は、Ｎ個の確率密度関数ｇ₀（）乃至ｇ_N-1（）のうち、計算対象関数表において、特徴ベクトルｘ_tが属する部分空間に対応付けられているものだけを用いて行われる。
【００２３】
この場合、一部の確率密度関数の計算を省くことができるので、演算量が低減され、さらに、音声認識処理の速度を向上させることができる。また、計算が省かれる確率密度関数は、スコアｂ_s（ｘ_t）に対する寄与度がほとんどないものであるから、その計算の省略による音声認識精度の劣化を抑えることができる。
【００２４】
ここで、図２の計算対象関数表において（後述する図６乃至図９においても同様）、特徴ベクトル空間は、Ｙ₀，Ｙ₁，・・・，Ｙ₅₁₁の５１２の部分空間に分割されている。
【００２５】
また、計算対象関数表では、部分空間ごとではなく、特徴ベクトルｘ_tごとに、その特徴ベクトルｘ_tの出力確率ｂ_s（ｘ_t）の計算に用いる１以上の確率密度関数ｇ_m（）を対応付けておくのが理想的であるが、そうすると、特徴ベクトルｘ_tは連続量であることから、計算対象関数表を作成することができなくなるため、計算対象関数表は、部分空間ごとに、確率密度関数ｇ_mを対応付ける形で作成される。
【００２６】
図２の計算対象関数表によれば、例えば、部分空間Ｙ₀に属する特徴ベクトルｘ_tの出力確率ｂ_s（ｘ_t）は、式ｂ_s（ｘ_t）＝ｃ₀ｇ₀（ｘ_t）＋ｃ₁ｇ₁（ｘ_t）＋ｃ₅ｇ₅（ｘ_t）＋ｃ₁₅ｇ₁₅（ｘ_t）によって計算されることになる。
【００２７】
一方、ＨＭＭを定義する確率密度関数ｇ_m（）の総数Ｎを、例えば、１６とすると、式（１）をそのまま採用する場合には、特徴ベクトルｘ_tの出力確率ｂ_s（ｘ_t）は、式ｂ_s（ｘ_t）＝ｃ₀ｇ₀（ｘ_t）＋ｃ₁ｇ₁（ｘ_t）＋・・・＋ｃ₁₅ｇ₁₅（ｘ_t）によって計算されることになる。
【００２８】
従って、式（１）をそのまま用いる場合には、１６の確率密度関数について演算を行う必要があるが、計算対象関数表を用いる場合には、４つの確率密度関数について演算を行えば済むことになり、大幅に演算量を低減することができる。
【００２９】
なお、計算対象関数表を用いる場合には、特徴ベクトルｘ_tが、５１２の部分空間Ｙ₀乃至Ｙ₅₁₁のうちのいずれに属するかを検出する必要があるが、この部分空間の検出方法としては、例えば、ベクトル量子化を用いることができる。
【００３０】
【発明が解決しようとする課題】
上述のように、計算対象関数表を用いることにより、音響スコア（出力確率）を求めるための演算量を低減し、音声認識精度の低下を抑えながら、音声認識処理速度を向上させることができる。
【００３１】
しかしながら、例えば、音声認識処理に割り当てられるリソースが少なくなった場合であっても、リアルタイムでの音声認識処理が要求されるときには、多少の音声認識精度の劣化があったとしても、音声認識処理速度を向上させること、即ち、演算量をリソースにあわせて少なくすることが望ましい。
【００３２】
これは、リアルタイムでの音声認識処理が要求される場合には、その後に、その音声認識結果に基づいて処理が行われることが一般的であり、従って、リアルタイムで音声認識結果が得られない場合には、その後の処理に支障をきたすこととなるからである。
【００３３】
一方、例えば、音声認識処理に割り当て可能なリソースが十分ある場合には、その十分なリソースによって、多くの演算を行い、精度の高い音声認識処理をリアルタイムで得ることができる。即ち、この場合、音声認識処理速度を低下させても、精度の高い音声認識結果をリアルタイムで得ることができる。従って、リソースが十分な場合には、その十分なりソースを使用して、精度の高い音声認識処理を行うのが、リソースの有効利用の観点からは望ましい。
【００３４】
本発明は、このような状況に鑑みてなされたものであり、要求に応じた速度や精度の音声認識処理を行うことができるようにするものである。
【００３５】
【課題を解決するための手段】
本発明の一側面の音声認識装置、プログラム、又は、記録媒体は、音声を認識する音声認識装置であって、前記音声の特徴量を抽出する抽出手段と、前記音声の特徴量が、その特徴量空間において属する部分空間を検出する検出手段と、前記特徴量空間の複数の部分空間それぞれごとに、前記音声の特徴量とのマッチング処理に用いられるＨＭＭ(Hidden Markov Model)を定義する１以上の定義情報を対応付けて記憶している記憶手段と、前記音声の特徴量が属する前記部分空間に対応付けられている前記１以上の定義情報から、任意の１以上の定義情報を選択する選択手段と、前記選択手段において選択された定義情報を用いて、前記音声の特徴量と前記ＨＭＭとのマッチング処理を行うことにより、前記音声が、前記ＨＭＭに対応するものであることの尤度を表すスコアを求め、そのスコアに基づいて、前記音声の音声認識結果を出力するマッチング手段とを備え、前記定義情報は、前記ＨＭＭが前記特徴量を出力する出力確率を求めるのに用いられる確率密度関数または確率関数であり、前記選択手段は、ユーザの操作にしたがって設定される音声認識処理の速度若しくは精度、又は、音声認識処理に割り当て可能なリソースに応じて設定される音声認識処理の速度若しくは精度に基づいて、前記音声認識処理の速度または精度に対応する個数の前記定義情報を、前記定義情報がスコアに寄与する大きさに対応する順番で選択する音声認識装置、そのような音声認識装置として、コンピュータを機能させるためのプログラム、又は、そのようなプログラムが記録されている記録媒体である。
【００３６】
本発明の一側面の音声認識方法は、音声を認識する音声認識装置の音声認識方法であって、前記音声認識装置が、前記音声の特徴量を抽出する抽出ステップと、前記音声の特徴量が、その特徴量空間において属する部分空間を検出する検出ステップと、前記特徴量空間の複数の部分空間それぞれごとに、前記音声の特徴量とのマッチング処理に用いられるＨＭＭ(Hidden Markov Model)を定義する１以上の定義情報を対応付けて記憶している記憶手段における、前記音声の特徴量が属する前記部分空間に対応付けられている前記１以上の定義情報から、任意の１以上の定義情報を選択する選択ステップと、前記選択ステップにおいて選択された定義情報を用いて、前記音声の特徴量と前記ＨＭＭとのマッチング処理を行うことにより、前記音声が、前記ＨＭＭに対応するものであることの尤度を表すスコアを求め、そのスコアに基づいて、前記音声の音声認識結果を出力するマッチングステップとを備え、前記定義情報は、前記ＨＭＭが前記特徴量を出力する出力確率を求めるのに用いられる確率密度関数または確率関数であり、前記選択ステップでは、ユーザの操作にしたがって設定される音声認識処理の速度若しくは精度、又は、音声認識処理に割り当て可能なリソースに応じて設定される音声認識処理の速度若しくは精度に基づいて、前記音声認識処理の速度または精度に対応する個数の前記定義情報を、前記定義情報がスコアに寄与する大きさに対応する順番で選択する音声認識方法である。
【００３９】
本発明の一側面においては、前記音声の特徴量が抽出され、前記音声の特徴量が、その特徴量空間において属する部分空間が検出される。さらに、前記特徴量空間の複数の部分空間それぞれごとに、前記音声の特徴量とのマッチング処理に用いられるＨＭＭ(Hidden Markov Model)を定義する１以上の定義情報を対応付けて記憶している記憶手段における、前記音声の特徴量が属する前記部分空間に対応付けられている前記１以上の定義情報から、任意の１以上の定義情報が選択される。そして、その選択された定義情報を用いて、前記音声の特徴量と前記ＨＭＭとのマッチング処理を行うことにより、前記音声が、前記ＨＭＭに対応するものであることの尤度を表すスコアが求められ、そのスコアに基づいて、前記音声の音声認識結果が出力される。前記定義情報は、前記ＨＭＭが前記特徴量を出力する出力確率を求めるのに用いられる確率密度関数または確率関数であり、その定義情報の選択では、ユーザの操作にしたがって設定される音声認識処理の速度若しくは精度、又は、音声認識処理に割り当て可能なリソースに応じて設定される音声認識処理の速度若しくは精度に基づいて、前記音声認識処理の速度または精度に対応する個数の前記定義情報が、前記定義情報がスコアに寄与する大きさに対応する順番で選択される。
【００４０】
【発明の実施の形態】
図３は、本発明を適用した音声認識装置の一実施の形態の構成例を示している。なお、図中、図１における場合と対応する部分については、同一の符号を付してあり、以下では、その説明は、適宜省略する。即ち、図３の音声認識装置は、マッチング部４に代えて、マッチング部１１が新たに設けられている他は、図１における場合と同様に構成されている。
【００４１】
図４は、図３のマッチング部１１の構成例を示している。
【００４２】
特徴抽出部３（図３）が出力する時系列の特徴ベクトルは、部分空間検出部２１とスコア計算部２５に供給されるようになっている。
【００４３】
部分空間検出部２１は、部分空間データ記憶部２２を参照することにより、そこに供給される特徴ベクトルが、その特徴ベクトル空間において属する部分空間を検出し、その部分空間を表す部分空間情報を、計算対象関数選択部２３に供給する。
【００４４】
部分空間データ記憶部２２は、部分空間検出部２１が、特徴ベクトルが属する部分空間を検出するのに必要な情報としての部分空間データを記憶している。
【００４５】
ここで、部分空間検出部２１においては、例えば、ベクトル量子化によって、特徴ベクトルが属する部分空間を検出するようにすることができ、この場合、部分空間データ記憶部２２においては、部分空間データとして、そのベクトル量子化に用いられるコードブックが記憶される。
【００４６】
なお、コードブックは、多数の音声データを用い、コードブック学習用のアルゴリズムの１つである、例えば、ＬＢＧ(Linde Buzo Gray)アルゴリズム等によって学習を行うことにより作成することが可能である。
【００４７】
コードブックには、特徴ベクトル空間を幾つかの部分空間（本実施の形態では、前述したように、５１２の部分空間であるとする）に分割したときの各部分空間の代表のベクトルとしてのコードベクトルと、そのコードベクトルを表すコードとが登録されている。従って、特徴ベクトル空間を、例えば、５１２の部分空間に分割した場合には、コードブックには、５１２のコードベクトルと対応するコードが登録されている。
【００４８】
部分空間検出部２１は、特徴ベクトルと、コードブックに登録された５１２のコードベクトルそれぞれとの距離を計算し、その距離を最も短くするコードベクトルを検出する。そして、部分空間検出部２１は、そのコードベクトルを代表のベクトルとする部分空間が、特徴ベクトルが属する部分空間であるとして、その検出したコードベクトルに対応するコードを、特徴ベクトルが属する部分空間を表す部分空間情報として出力する。
【００４９】
計算対象関数選択部２３は、部分空間検出部２１からの部分空間情報に基づき、計算対象関数表記憶部２４に記憶された計算対象関数表（定義情報テーブル）を参照することで、特徴ベクトルを用いた音響スコア（出力確率）の計算に用いる、音響モデル（ここでは、前述したように、ＨＭＭ）を定義する確率密度関数等を選択する。
【００５０】
即ち、計算対象関数選択部２３には、部分空間検出部２１から部分空間情報が供給される他、速度／精度設定部２７から、速度／精度情報も供給されるようになっている。
【００５１】
計算対象関数選択部２３は、計算対象関数表記憶部２４に記憶された計算対象関数表において、部分空間検出部２１からの部分空間情報が表す部分空間に対応付けられている確率密度関数等から、速度／精度設定部２７から供給される速度／精度設定情報に基づき、１以上の確率密度関数等を選択する。そして、計算対象関数選択部２３は、その選択した確率密度関数等を表す選択情報を、スコア計算部２５に供給する。
【００５２】
計算対象関数表記憶部２４は、特徴ベクトル空間の複数の部分空間それぞれごとに、音響モデルデータベース５に記憶された音響モデルを定義する１以上の確率密度関数等を対応付けた計算対象関数表を記憶している。
【００５３】
スコア計算部２５は、特徴抽出部３から供給される特徴ベクトルを用いて、音響モデルデータベース５に記憶された音響モデル、辞書データベース６に記憶された単語辞書、および文法データベース７に記憶された文法規則を必要に応じて参照し、音声認識結果の候補（以下、適宜、仮説という）を構成しながら、各仮説について、前述したようなＨＭＭ法に基づく音響スコアと、言語スコアを計算する。
【００５４】
但し、スコア計算部２５は、音響スコアについては、音響モデルを定義する確率密度関数すべてではなく、計算対象関数選択部２３から供給される選択情報が表す確率密度関数等を用いて、特徴ベクトルｘ_tの出力確率ｂ_s（ｘ_t）を求め、その出力確率に基づいて、音響スコアを求める。
【００５５】
スコア計算部２５において求められた音響スコアおよび言語スコアは、出力選択部２６に供給され、出力選択部２６は、各仮説について得られた音響スコアおよび言語スコアを総合評価して最終スコアを得て、例えば、その最終スコアを最も大きくする仮説を選択し、音声認識結果として出力する。
【００５６】
速度／精度設定部２７は、操作レバー２８の操作にしたがい、音声認識処理の速度または精度を設定し、その設定した速度または精度を表す速度／精度情報を、計算対象関数選択部２３に供給する。
【００５７】
操作レバー２８は、ユーザが、音声認識処理の速度または精度を指定するときに操作され、その操作に対応する操作信号を、速度／精度設定部２７に供給する。
【００５８】
従って、速度／精度設定部２７では、ユーザの要求にしたがって、音声認識処理の速度または精度が設定される。
【００５９】
ここで、操作レバー２８は、物理的なレバーとして構成することもできるし、画面上に表示される仮想的なレバーとして構成することもできる。操作レバー２８が、物理的なレバーとして構成される場合には、操作レバー２８は、ユーザが実際に掴んで操作することになる。また、操作レバー２８が仮想的なレバーとして構成される場合は、操作レバー２８は、ユーザがマウスでドラッグ等して操作することになる。
【００６０】
なお、図４の実施の形態においては、より低速または高精度の音声認識処理を要求する場合には、操作レバー２８は左方向に操作され、逆に、より高速または低精度の音声認識処理を要求する場合には、操作レバー２８は右方向に操作されるようになっている。
【００６１】
次に、図５のフローチャートを参照して、図４のマッチング部１１で行われるマッチング処理について説明する。
【００６２】
ユーザが発話を行い、これにより、特徴抽出部３が、その音声の特徴ベクトルの出力を開始すると、マッチング部１１は、マッチング処理を開始する。
【００６３】
即ち、特徴抽出部３が出力する時系列の特徴ベクトルは、部分空間選択部２１とスコア計算部２５に供給され、部分空間検出部２１は、ステップＳ１において、部分空間データ記憶部２２を参照し、特徴抽出部３からの特徴ベクトルｘ_tが属する部分空間を検出する。そして、部分空間検出部２１は、その部分空間を表す部分空間情報を、計算対象関数選択部２３に供給し、ステップＳ２に進む。
【００６４】
ステップＳ２では、計算対象関数選択部２３が、計算対象関数表記憶部２４に記憶された計算対象関数表において、部分空間検出部２１からの部分空間情報が表す部分空間に対応付けられている確率密度関数等から、１以上の確率密度関数等を、必要に応じて速度／精度設定部２７からの速度／精度設定情報に基づいて選択し、その選択した確率密度関数等を表す選択情報を、スコア計算部２５に供給する。
【００６５】
スコア計算部２５は、ステップＳ３において、辞書データベース６の単語辞書に記憶された単語について、計算対象関数選択部２３から供給される選択情報が表す、音響モデルデータベース５の音響モデルを定義する確率密度関数等を用いて、特徴ベクトルｘ_tの出力確率ｂ_s（ｘ_t）を求め、その出力確率に基づいて、音響スコアを求めるとともに、文法データベース７の文法規則を参照することで、言語スコアを求める。さらに、スコア計算部２５は、その音響スコアおよび言語スコアに基づき、必要に応じて、仮説（音声認識結果の候補）を生成して、ステップＳ４に進む。
【００６６】
ステップＳ４では、ユーザが発話を行った音声区間の終点までについて、音響スコアと言語スコアの計算が終了したかどうかが判定され、終了していないと判定された場合、ステップＳ１に戻り、次の特徴ベクトルを対象に、以下、同様の処理が繰り返される。なお、音響スコアおよび言語スコアの計算は、必要に応じて、ビームサーチ法による枝狩りをしながら行われる。
【００６７】
また、ステップＳ４において、ユーザが発話を行った音声区間の終点までについて、音響スコアと言語スコアの計算が終了したと判定された場合、ステップＳ５に進み、出力選択部２６は、１以上の仮説について得られた音響スコアおよび言語スコアを総合評価して最終スコアを得て、例えば、その最終スコアを元も大きくする仮説を選択し、音声認識結果として出力して、マッチング処理を終了する。
【００６８】
次に、図６乃至図９を参照して、図４の計算対象関数表記憶部２４に記憶される計算対象関数表について説明する。なお、以下においては、例えば、特徴ベクトル空間は５１２の部分空間Ｙ₀乃至Ｙ₅₁₁に分割されており、音響モデルデータベース５に記憶された音響モデルとしてのＨＭＭは、１６の確率密度関数ｇ₀（）乃至ｇ₁₅（）で定義されるものとする。
【００６９】
図４のマッチング部１１では、基本的には、上述したように、計算対象関数表記憶部２４に記憶された計算対象関数表において、部分空間検出部２１が出力する部分空間情報が表す部分空間（特徴ベクトルｘ_tが属する部分空間）に対応付けられている確率密度関数等から、１以上の確率密度関数等が、速度／精度情報に基づいて選択され、その確率密度関数等を用いて、特徴ベクトルｘ_tの出力確率ｂ_s（ｘ_t）が求められるが、この出力確率ｂ_s（ｘ_t）は、計算対象関数表において、特徴ベクトルｘ_tが属する部分空間に対応付けられている確率密度関数等すべてを用いて計算することも可能である。
【００７０】
即ち、マッチング部１１では、出力確率ｂ_s（ｘ_t）を、特徴ベクトルｘ_tが属する部分空間に対応付けられている確率密度関数等から選択したものを用いて計算することも可能であるし、また、特徴ベクトルｘ_tが属する部分空間に対応付けられている確率密度関数等すべてを用いて計算することも可能である。
【００７１】
いま、出力確率ｂ_s（ｘ_t）を、特徴ベクトルｘ_tが属する部分空間に対応付けられている確率密度関数等から選択したものを用いて計算するモードを、選択可能モードというとともに、出力確率ｂ_s（ｘ_t）を、特徴ベクトルｘ_tが属する部分空間に対応付けられている確率密度関数等すべてを用いて計算するモードを、選択不可能モードというものとすると、選択不可能モードでは、例えば、図６に示すような計算対象関数表が用いられる。
【００７２】
即ち、図６の計算対象関数表では、５１２の部分空間Ｙ₀乃至Ｙ₅₁₁それぞれに、その部分空間Ｙ_j（ｊ＝０，１，・・・，５１１）に属する特徴ベクトルｘ_tの出力確率ｂ_s（ｘ_t）の計算に用いる確率密度関数のリスト｛ｇ_m｝、またはフロア値が対応付けられている。
【００７３】
図６の計算対象関数表が用いられる場合、計算対象関数選択部２３は、確率密度関数｛ｇ_m｝が対応付けられている部分空間Ｙ_jを表す部分空間情報を、部分空間検出部２１から受信したときには、その部分空間Ｙ_jに対応付けられている確率密度関数｛ｇ_m｝すべてを選択し、その確率密度関数｛ｇ_m｝すべてを表す選択情報を、スコア計算部２５に供給する。
【００７４】
スコア計算部２５では、選択情報が表す確率密度関数を用いて、出力確率ｂ_s（ｘ_t）が計算される。従って、この場合、スコア計算部２５では、前述の図２で説明した場合と同様にして、出力確率ｂ_s（ｘ_t）が計算される。
【００７５】
ところで、図６の計算対象関数表では、部分空間Ｙ_jに対して、特徴ベクトルｘ_tの出力確率ｂ_s（ｘ_t）の計算に用いる確率密度関数｛ｇ_m｝が対応付けられている場合の他、フロア値が対応付けられている場合がある。
【００７６】
フロア値は、それが対応付けられている部分空間Ｙ_jに属する特徴ベクトルの出力確率の最小値を表す固定の値であり、フロア値が対応付けられている部分空間Ｙ_jに属する特徴ベクトルの出力確率は、そのフロア値とされる。
【００７７】
即ち、計算対象関数選択部２３は、フロア値が対応付けられている部分空間Ｙ_jを表す部分空間情報を、部分空間検出部２１から受信したときには、その部分空間Ｙ_jに対応付けられているフロア値を選択し、そのフロア値を表す選択情報を、スコア計算部２５に供給する。
【００７８】
スコア計算部２５では、選択情報がフロア値を表す場合、そのフロア値を、出力確率ｂ_s（ｘ_t）とする。
【００７９】
従って、この場合、出力確率は、確率密度関数を用いた計算を行うことなく求めることができるので、演算量を削減することができる。
【００８０】
即ち、図６の計算対象関数表においては、部分空間Ｙ_jに対して、確率密度関数｛ｇ_m｝が対応付けられている場合と、フロア値が対応付けられている場合とがあり、フロア値が対応付けられている部分空間Ｙ_jに属する特徴ベクトルｘ_tの出力確率ｂ_s（ｘ_t）を求めるにあたっては、確率密度関数を計算する必要はないから、図２に示したように、すべての部分空間に対して、確率密度関数が対応付けられている計算対象関数表を用いる場合に比較して、より演算量を削減することができる。
【００８１】
なお、図６の計算対象関数表においては、部分空間Ｙ₃に対して、フロア値「−３０．０」が対応付けられているが、このフロア値は、出力確率の対数をとった値としてある。後述する図７乃至図９の実施の形態に示したあるフロア値も同様である。
【００８２】
また、選択不可能モードにおいては、図６の計算対象関数表の他、図７に示すような計算対象関数表を用いることも可能である。
【００８３】
即ち、図７の計算対象関数表は、図６の計算対象関数表に対して、各部分空間Ｙ_jに対応付けられている確率密度関数の個数を追加したものとなっている。図７の計算対象関数表を用いる場合には、計算対象関数選択部２３が出力する選択情報に、特徴ベクトルが属する部分空間Ｙ_jに対応付けられている個数を含めることができ、この場合、スコア計算部２５において、出力確率を求めるのにあたって計算しなければならない確率密度関数の個数を、即座に認識することができる。
【００８４】
なお、計算対象関数表に登録される確率密度関数は、出力確率に対する寄与度が大きいものであり、従って、図６や図７の計算対象関数表において、部分空間Ｙ₃に対して、確率密度関数が登録されていないのは、音響モデルを定義する確率密度関数ｇ₀（）乃至ｇ₁₅（）それぞれの、部分空間Ｙ₃に属する特徴ベクトルの出力確率に対する寄与度が、相対的に差がないためである。また、部分空間Ｙ₃に属する任意の特徴ベクトルについては、音響モデルを定義する１６の確率密度関数ｇ₀（）乃至ｇ₁₅（）を用いて計算される出力確率（の対数をとったもの）が−３０程度であり、従って、出力確率を−３０．０の固定値としても、精度のよい近似が可能であるため、図６や図７の計算対象関数表の部分空間Ｙ₃については、出力確率が−３０．０の固定値とされている。
【００８５】
次に、図８は、選択可能モードの場合に、計算対象関数表記憶部２４に記憶される計算対象関数表の一実施の形態の構成例を示している。
【００８６】
図８の計算対象関数表においては、各部分空間Ｙ_jに、確率密度関数｛ｇ_m｝またはフロア値の他、出力確率（ひいては音響スコア）の計算に用いる確率密度関数の個数（以下、適宜、計算個数という）が複数対応付けられている。
【００８７】
即ち、図８の実施の形態においては、例えば、部分空間Ｙ₀に対して、フロア値「−２９．０」、確率密度関数｛ｇ₅（），ｇ₁（），ｇ₁₅（），ｇ₀（）｝、計算個数｛０，１，４｝が対応付けられている。また、例えば、部分空間Ｙ₁に対して、フロア値「−４５．０」、確率密度関数｛ｇ₀（），ｇ₁（），ｇ₁₇（），ｇ₈（），ｇ₃（），ｇ₁₀（）｝、計算個数｛０，３，６｝が対応付けられている。さらに、部分空間Ｙ₂に対して、フロア値「−２０．０」、確率密度関数｛ｇ₂（），ｇ₆（），ｇ₄（）｝、計算個数｛０，３，３｝が対応付けられている。また、部分空間Ｙ₃に対して、フロア値「−３０．０」、計算個数｛０，０，０｝が対応付けられている。以下、同様にして、部分空間Ｙ₄乃至Ｙ₅₁₀にも、フロア値または確率密度関数｛ｇ_m｝と、計算個数が対応付けられており、最後の部分空間Ｙ₅₁₁に対して、フロア値「−４０．０」、確率密度関数｛ｇ₁₅（）｝、計算個数｛０，０，１｝が対応付けられている。
【００８８】
図８の計算対象関数表が用いられる場合、計算対象関数選択部２３は、まず、部分空間検出部２１から供給される部分空間情報に基づき、計算対象関数表において、ベクトルｘ_tが属する部分空間Ｙ_jのエントリ（行）を選択する。いま、このようにして、部分空間情報に基づき、計算対象関数表から選択された部分空間Ｙ_jのエントリを、選択エントリというものとすると、さらに、計算対象関数選択部２３は、選択エントリにおける複数の計算個数から、速度／精度設定部２７から供給される速度／精度情報に対応するものを選択する。
【００８９】
即ち、計算対象関数表の各エントリにおける複数の計算個数それぞれは、計算対象関数選択部２３に選択させる確率密度関数の個数を表しており、音声認識処理に要求される速度または精度に基づいて登録されている。
【００９０】
具体的には、例えば、いま、音声認識処理について、「高速／低精度」、「中速／中精度」、「低速／高精度」の３つの速度または精度の設定が可能であるとすると、図８の実施の形態では、計算対象関数表の各エントリに、３つの計算個数が登録されているが、この３つの計算個数のうち、最も左側の計算個数は、「高速／低精度」の速度または精度が設定されたときに、左から２番目の計算個数は、「中速／中精度」の速度または精度が設定されたときに、最も右側の計算個数は、「低速／高精度」の速度または精度が設定されたときに、それぞれ選択される。
【００９１】
従って、図８の実施の形態において、速度／精度設定部２７から供給される速度／精度情報が、「高速／低精度」を表す場合には、計算対象関数選択部２３では、部分空間Ｙ₀乃至Ｙ₅₁₁のエントリそれぞれに登録されている３つの計算個数のうち、最も左側にある０，０，０，０，・・・，０が選択される。また、速度／精度情報が、「中速／中精度」を表す場合には、計算対象関数選択部２３では、部分空間Ｙ₀乃至Ｙ₅₁₁のエントリそれぞれに登録されている３つの計算個数のうち、左から２番目にある１，３，３，０，・・・，０が選択される。さらに、速度／精度情報が、「低速／高精度」を表す場合には、計算対象関数選択部２３では、部分空間Ｙ₀乃至Ｙ₅₁₁のエントリそれぞれに登録されている３つの計算個数のうち、最も右側にある４，６，３，０，・・・，１が選択される。
【００９２】
以上から、特徴ベクトルｘ_tが、例えば、部分空間Ｙ₀に属するとした場合、計算対象関数選択部２３は、その部分空間Ｙ₀のエントリを選択エントリとする。さらに、速度／精度設定部２７から供給される速度／精度情報が、「高速／低精度」を表す場合には、計算対象関数選択部２３は、選択エントリに登録されている３つの計算個数「０，１，４」のうちの最も左側の「０」を選択する。また、計算対象関数選択部２３は、速度／精度情報が「中速／中精度」を表す場合には、選択エントリに登録されている３つの計算個数「０，１，４」のうちの左から２番目の「１」を選択し、速度／精度情報が「低速／高精度」を表す場合には、選択エントリに登録されている３つの計算個数「０，１，４」のうちの最も右側の「４」を選択する。
【００９３】
いま、上述のようにして、選択エントリに登録されている複数の計算個数から選択されたものを、選択計算個数というものとすると、計算対象関数選択部２３は、選択エントリから、選択計算個数だけの確率密度関数を選択する。
【００９４】
従って、例えば、図８において、部分空間Ｙ₀のエントリが選択エントリとされた場合において、選択計算個数が、「０」、「１」、「４」とされたときには、計算対象関数選択部２３は、部分空間Ｙ₀に登録されている確率密度関数から、０，１，４個を選択する。
【００９５】
ここで、選択エントリからの確率密度関数の選択は、次のようにして行われる。
【００９６】
即ち、選択計算個数が、「０」の場合は、選択エントリからは、確率密度関数は選択されず、フロア値が選択される。また、選択計算個数が、「０」以外の値である場合には、選択エントリからは、そこに登録されている１以上の確率密度関数のうちの、左から、選択計算個数分だけの確率密度関数が選択される。
【００９７】
従って、図８において、部分空間Ｙ₀のエントリが選択エントリとされた場合において、選択計算個数が、「０」とされたときには、計算対象関数選択部２３は、部分空間Ｙ₀のエントリに登録されているフロア値「−２９．０」を選択する。また、選択計算個数が、「１」とされたときには、計算対象関数選択部２３は、部分空間Ｙ₀のエントリに登録されている確率密度関数｛ｇ₅（），ｇ₁（），ｇ₁₅（），ｇ₀（）｝のうちの、左から１つだけ、即ち、｛ｇ₅（）｝を選択する。さらに、選択計算個数が、「４」とされたときには、計算対象関数選択部２３は、部分空間Ｙ₀に登録されている確率密度関数｛ｇ₅（），ｇ₁（），ｇ₁₅（），ｇ₀（）｝のうちの、左から４つ、即ち、部分空間Ｙ₀に登録されている確率密度関数の全部｛ｇ₅（），ｇ₁（），ｇ₁₅（），ｇ₀（）｝を選択する。
【００９８】
そして、計算対象関数選択部２３は、その選択したフロア値または確率密度関数を表す選択情報を、スコア計算部２５に供給する。
【００９９】
ここで、図８の計算対象関数表においては、部分空間Ｙ₃のエントリにおける３つの計算個数は、いずれも「０」となっている。従って、部分空間Ｙ₃のエントリが選択エントリとされた場合には、速度／精度情報が、「高速／低精度」、「中速／中精度」、「低速／高精度」のうちのいずれを表すときであっても、選択計算個数は「０」であり、従って、計算対象関数選択部２３では、フロア値「−３０．０」が選択されることになる。
【０１００】
以上から、特徴ベクトルｘ_tが、例えば、部分空間Ｙ₀に属する場合において、速度／精度情報が「高速／低精度」に設定されている「高速／低精度」モードでは、スコア計算部２５において、特徴ベクトルｘ_tの出力確率ｂ_s（ｘ_t）は、フロア値「−２９．０」とされる。従って、この場合、出力確率ｂ_s（ｘ_t）は、確率密度関数を用いた計算をせずに求められるから、精度は落ちるが、高速な処理が可能となる。
【０１０１】
また、速度／精度情報が「中速／中精度」に設定されている「中速／中精度」モードでは、スコア計算部２５において、特徴ベクトルｘ_tの出力確率ｂ_s（ｘ_t）は、１の確率密度関数ｇ₅（ｘ_t）を計算し、さらに、前述の式（１）に基づき、その重み付け値ｃ₅ｇ₅（ｘ_t）を計算することによって求められる。従って、この場合、出力確率ｂ_s（ｘ_t）は、１つの確率密度関数ｇ₅（ｘ_t）を用いた計算によって求められるから、「高速／低精度」モードの場合に比較して、処理速度は低下するが、精度は向上することになる。
【０１０２】
さらに、速度／精度情報が「低速／高精度」に設定されている「低速／高精度」モードでは、スコア計算部２５において、特徴ベクトルｘ_tの出力確率ｂ_s（ｘ_t）は、４つの確率密度関数ｇ₅（ｘ_t），ｇ₁（ｘ_t），ｇ₁₅（ｘ_t），ｇ₀（ｘ_t）を計算し、さらに、前述の式（１）に基づき、その重み付け和ｃ₅ｇ₅（ｘ_t）＋ｃ₁ｇ₁（ｘ_t）＋ｃ₁₅ｇ₁₅（ｘ_t）＋ｃ₀ｇ₀（ｘ_t）を計算することによって求められる。従って、この場合、出力確率ｂ_s（ｘ_t）は、４つの確率密度関数ｇ₅（ｘ_t），ｇ₁（ｘ_t），ｇ₁₅（ｘ_t），ｇ₀（ｘ_t）を用いた計算によって求められるから、「高速／低精度」モードの場合に比較して、処理速度はさらに低下するが、精度はさらに向上することになる。
【０１０３】
図４のマッチング部１１では、速度／精度情報は、ユーザによって操作される操作レバー２８にしたがって設定されるようになっており、従って、ユーザの要求に応じた速度や精度での音声認識処理が可能となる。
【０１０４】
なお、計算対象関数表のエントリに、複数の確率密度関数を登録する場合には、その複数の確率密度関数ｇ_m（）は、例えば、そのサフィックスｍの昇順や降順に並べても良いが、図８の実施の形態においては、計算対象関数表のエントリに登録されている複数の確率密度関数は、出力確率（ひいては、音響スコア）に対する寄与度が大きい順に並べられている（最も左の確率密度関数が、出力確率に対する寄与度が最も大きいものとなっている）。
【０１０５】
従って、この場合、計算対象関数選択部２３では、出力確率（ひいては、音響スコア）に対する寄与度が大きい確率密度関数が優先的に選択されることになり、スコア計算部２５でも、そのような確率密度関数が優先的に用いて、出力確率（ひいては、音響スコア）が計算されることになるので、計算対象関数表に基づき、一部の確率密度関数の計算を省略することによって生じる出力確率（音響スコア）の誤差を、最小限に抑えることができる。
【０１０６】
なお、図８の実施の形態では、計算対象関数表の各エントリに、「高速／低精度」モード、「中速／中精度」モード、および「低速／高精度」モードの３つの速度／精度モードそれぞれに対する３つの計算個数を登録するようにしたが、計算対象関数表の各エントリには、計算個数ではなく、そのエントリに登録されている確率密度関数の総数を設定し、計算対象関数選択部２３において、速度／精度情報に基づき、０から確率密度関数の総数までの範囲（以下、適宜、選択範囲という）の整数値から、計算個数を選択するようにすることが可能である。
【０１０７】
即ち、例えば、図８の計算対象関数表の部分空間Ｙ₀のエントリには、４つの確率密度関数｛ｇ₅（），ｇ₁（），ｇ₁₅（），ｇ₀（）｝が登録されているから、選択範囲は０乃至４で、選択範囲内の整数値としては、０，１，２，３，４の５つを取り得るから、計算個数も、その５つの整数値から選択される。
【０１０８】
この場合、図４の操作レバー２８の可動範囲を、左端付近、左端と中心の中間付近、中心付近、中心と右端の中間付近、右端付近の５つの範囲に分けて、操作レバー２８が、その５つの範囲それぞれに位置するときは、計算対象関数選択部２３において、計算個数として、０，１，２，３，４をそれぞれ選択するようにすることができる。
【０１０９】
この場合、スコア計算部２５では、操作レバー２８が、左端付近に位置するときには、４つの確率密度関数｛ｇ₅（），ｇ₁（），ｇ₁₅（），ｇ₀（）｝のうちの０個、即ち、フロア値を用いて、出力確率が求められることになる。また、操作レバー２８が、左端と中心の中間付近に位置するときには、４つの確率密度関数｛ｇ₅（），ｇ₁（），ｇ₁₅（），ｇ₀（）｝のうちの、出力確率に対する寄与度が最も高い１つの確率密度関数ｇ₅（）を計算することによって、出力確率が求められることになる。さらに、操作レバー２８が、中心付近に位置するときには、４つの確率密度関数｛ｇ₅（），ｇ₁（），ｇ₁₅（），ｇ₀（）｝のうちの、出力確率に対する寄与度が最も高い確率密度関数ｇ₅（）と２番目に高い確率密度関数ｇ₁（）の２つを計算することによって、出力確率が求められることになる。また、操作レバー２８が、中心と右端の中間付近に位置するときには、４つの確率密度関数｛ｇ₅（），ｇ₁（），ｇ₁₅（），ｇ₀（）｝のうちの、出力確率に対する寄与度が高い順に３つの確率密度関数ｇ₅（），ｇ₁（），ｇ₁₅（）を計算することによって、出力確率が求められることになる。さらに、操作レバー２８が、右端付近に位置するときには、４つの確率密度関数ｇ₅（），ｇ₁（），ｇ₁₅（），ｇ₀（）すべてを計算することによって、出力確率が求められることになる。
【０１１０】
従って、この場合、５段階の速度または精度での音声認識処理が可能となる。
【０１１１】
なお、上述のように、操作レバー２８の位置に応じて、出力確率の計算に用いる確率密度関数を選択する場合には、計算対象関数表の各部分空間のエントリには、音響モデルを定義する１６の確率密度関数ｇ₀（）乃至ｇ₁₅（）すべてを、寄与度の高い順に登録しておくことが可能である。
【０１１２】
次に、図９は、選択可能モードの場合に、計算対象関数表記憶部２４に記憶される計算対象関数表の他の実施の形態の構成例を示している。
【０１１３】
図９の実施の形態においては、計算対象関数表記憶部２４には、高速／低精度用計算対象関数表（図９（Ａ））と、低速／高精度用計算対象関数表（図９（Ｂ））の２つの計算対象関数表が、計算対象関数表記憶部２４に記憶されるようになっており、計算対象関数選択部２３は、速度／精度設定部２７から供給される速度／精度情報に基づき、高速／低精度用計算対象関数表または低速／高精度用計算対象関数表のうちのいずれか一方を選択し、その選択した計算対象関数表を参照して、確率密度関数またはフロア値を選択する。
【０１１４】
即ち、速度／精度設定部２７は、操作レバー２８が左側に位置する場合は、低速または高精度の音声認識処理を行うことを設定し、その旨の速度／精度情報を、計算対象関数選択部２３に供給する。この場合、計算対象関数選択部２３では、低速／高精度用計算対象関数表（図９（Ｂ））が選択される。一方、速度／精度設定部２７は、操作レバー２８が右側に位置する場合は、高速または低精度の音声認識処理を行うことを設定し、その旨の速度／精度情報を、計算対象関数選択部２３に供給する。この場合、計算対象関数選択部２３では、高速／低精度用計算対象関数表（図９（Ａ））が選択される。
【０１１５】
図９の実施の形態において、低速／高精度用計算対象関数表（図９（Ｂ））における各エントリに登録されている確率密度関数は、基本的に、高速／低精度用計算対象関数表（図９（Ａ））において、対応するエントリに登録されている確率密度関数に対して、０以上の確率密度関数を加えたものとなっている。
【０１１６】
従って、低速／高精度用計算対象関数表を参照して、確率密度関数等（確率密度関数またはフロア値）を選択する場合には、高速／低精度用計算対象関数表を参照して、確率密度関数等を選択する場合に比較して、出力確率の計算に要する演算量が多くなるので、処理速度が低速にはなるが、精度の高い音声認識結果が得られる。
【０１１７】
また、逆に、高速／低精度用計算対象関数表を参照して、確率密度関数等を選択する場合には、低速／高精度用計算対象関数表を参照して、確率密度関数等を選択する場合に比較して、精度は劣化するかもしれないが、出力確率の計算に要する演算量が少なくなるので、処理速度を高速化することができる。
【０１１８】
なお、図９の実施の形態においても、計算対象関数表には、確率密度関数を、出力確率に対する寄与度が高い順に登録することができ、さらに、操作レバー２８の位置に応じて、計算対象関数表から選択する確率密度関数の個数を変化させるようにすることができる。
【０１１９】
即ち、例えば、操作レバー２８が左側に位置する場合は、低速／高精度用計算対象関数表（図９（Ｂ））を選択し、さらに、操作レバー２８が、どの程度左側に位置するかによって、低速／高精度用計算対象関数表から選択する確率密度関数の個数を変化させることができる。また、操作レバー２８が右側に位置する場合は、高速／低精度用計算対象関数表（図９（Ａ））を選択し、さらに、操作レバー２８が、どの程度右側に位置するかによって、高速／低精度用計算対象関数表から選択する確率密度関数の個数を変化させることができる。この場合、音声認識処理の速度または精度について、より細かな制御を行うことが可能となる。
【０１２０】
また、図９の実施の形態においては、計算対象関数表記憶部２４に、２つの計算対象関数表を記憶させるようにしたが、計算対象関数表記憶部２４には、その他、音声認識処理に要求される速度または精度に応じて、登録されている確率密度関数の数が異なる３以上の計算対象関数表を記憶させておき、計算対象関数選択部２３においては、その３以上の計算対象関数表から、速度／精度情報に基づいて、参照する計算対象関数表を選択するようにすることが可能である。
【０１２１】
次に、上述した一連の処理は、ハードウェアにより行うこともできるし、ソフトウェアにより行うこともできる。一連の処理をソフトウェアによって行う場合には、そのソフトウェアを構成するプログラムが、汎用のコンピュータ等にインストールされる。
【０１２２】
そこで、図１０は、上述した一連の処理を実行するプログラムがインストールされるコンピュータの一実施の形態の構成例を示している。
【０１２３】
プログラムは、コンピュータに内蔵されている記録媒体としてのハードディスク１０５やＲＯＭ１０３に予め記録しておくことができる。
【０１２４】
あるいはまた、プログラムは、フレキシブルディスク、CD-ROM(Compact Disc Read Only Memory)，MO(Magneto optical)ディスク，DVD(Digital Versatile Disc)、磁気ディスク、半導体メモリなどのリムーバブル記録媒体１１１に、一時的あるいは永続的に格納（記録）しておくことができる。このようなリムーバブル記録媒体１１１は、いわゆるパッケージソフトウエアとして提供することができる。
【０１２５】
なお、プログラムは、上述したようなリムーバブル記録媒体１１１からコンピュータにインストールする他、ダウンロードサイトから、ディジタル衛星放送用の人工衛星を介して、コンピュータに無線で転送したり、LAN(Local Area Network)、インターネットといったネットワークを介して、コンピュータに有線で転送し、コンピュータでは、そのようにして転送されてくるプログラムを、通信部１０８で受信し、内蔵するハードディスク１０５にインストールすることができる。
【０１２６】
コンピュータは、CPU(Central Processing Unit)１０２を内蔵している。CPU１０２には、バス１０１を介して、入出力インタフェース１１０が接続されており、CPU１０２は、入出力インタフェース１１０を介して、ユーザによって、キーボードや、マウス、マイク等で構成される入力部１０７が操作等されることにより指令が入力されると、それにしたがって、ROM(Read Only Memory)１０３に格納されているプログラムを実行する。あるいは、また、CPU１０２は、ハードディスク１０５に格納されているプログラム、衛星若しくはネットワークから転送され、通信部１０８で受信されてハードディスク１０５にインストールされたプログラム、またはドライブ１０９に装着されたリムーバブル記録媒体１１１から読み出されてハードディスク１０５にインストールされたプログラムを、RAM(Random Access Memory)１０４にロードして実行する。これにより、CPU１０２は、上述したフローチャートにしたがった処理、あるいは上述したブロック図の構成により行われる処理を行う。そして、CPU１０２は、その処理結果を、必要に応じて、例えば、入出力インタフェース１１０を介して、LCD(Liquid CryStal Display)やスピーカ等で構成される出力部１０６から出力、あるいは、通信部１０８から送信、さらには、ハードディスク１０５に記録等させる。
【０１２７】
ここで、本明細書において、コンピュータに各種の処理を行わせるためのプログラムを記述する処理ステップは、必ずしもフローチャートとして記載された順序に沿って時系列に処理する必要はなく、並列的あるいは個別に実行される処理（例えば、並列処理あるいはオブジェクトによる処理）も含むものである。
【０１２８】
また、プログラムは、１のコンピュータにより処理されるものであっても良いし、複数のコンピュータによって分散処理されるものであっても良い。さらに、プログラムは、遠方のコンピュータに転送されて実行されるものであっても良い。
【０１２９】
なお、図３に示した音声認識装置は、例えば、音声によってデータベースの検索を行う場合や、各種の機器の操作を行う場合、各機器へのデータ入力を行う場合、音声対話システム等に適用可能である。より具体的には、例えば、音声による地名の問合せに対して、対応する地図情報を表示するデータベース検索装置や、音声による命令に対して、荷物の仕分けを行う産業用ロボット、キーボードの代わりに音声入力によりテキスト作成を行うディクテーションシステム、ユーザとの会話を行うロボットにおける対話システム等に適用可能である。
【０１３０】
また、本実施の形態では、速度／精度設定部２７において、ユーザによる操作レバー２８の操作に応じて、音声認識処理の速度または精度を設定するようにしたが、音声認識処理の速度または精度は、その他、例えば、音声認識処理に割り当て可能なリソース等の要因に基づいて設定することが可能である。
【０１３１】
即ち、例えば、図１０に示したようなコンピュータにプログラムを実行させることによって、図３に示した音声認識装置を実現する場合においては、CPU１０２は、一般に、音声認識処理以外のタスクも実行することから、音声認識処理に割り当て可能なリソースは、時々刻々と変化する。そこで、速度／精度設定部２７においては、CPU１０２が音声認識処理に割り当て可能なリソースを認識し、そのリソースによって、リアルタイムで、かつ最大の精度が得られるように、音声認識処理の速度と精度を設定するようにすることができる。
【０１３２】
また、本実施の形態では、スコア計算部２５において、連続量の特徴ベクトルを用いて、連続ＨＭＭ法に基づく音響スコアを計算するようにしたが、本発明は、例えば、離散値の特徴ベクトルを用いて、離散ＨＭＭ法に基づく音響スコアを計算する場合にも適用可能である。
【０１３３】
即ち、例えば、Satoshi Takahashi, Kiyoaki Aikawa, and Shigeki Sagayama. Discrete mixture hmm. In International Conference on Acoustic, Speech, and Signal Processing, pages 971-974, 1997等には、離散混合分布型ＨＭＭ(discrete mixture HMM)による音声認識手法が記載されているが、この離散混合分布型ＨＭＭによれば、特徴ベクトルｘ_tの出力確率ｂ_s（ｘ_t）は、例えば、次式にしたがって計算される。
【０１３４】
ｂ_s（ｘ_t）＝ΣＣ_m×Ｇ_m（Ｚ_i）・・・（２）
【０１３５】
ここで、式（２）において、Ｃ_mは、ｍ番目の関数Ｇ_m（）に対する重み係数であり、関数Ｇ_m（）は、離散混合分布型ＨＭＭを構成するｍ番目の確率関数である。また、Σは、変数ｍについてのサメーションを表す。また、Ｚ_iは、特徴ベクトルｘ_tが属する特徴ベクトル空間の部分空間を表し、例えば、連続量の特徴ベクトルｘ_tをベクトル量子化して得られるものである。従って、Ｚ_iは、特徴ベクトルｘ_tが属する特徴ベクトル空間の部分空間のコードベクトル（代表ベクトル）を表すと考えることもでき、その値は離散値である。
【０１３６】
式（２）は、前述した式（１）と同様の形をしているから、離散混合分布型ＨＭＭを用いる場合も、計算対象関数表によって、出力確率を求めるための確率関数の計算の一部を省くことが可能であり、従って、図６乃至図９に示した計算対象関数表における確率密度関数ｇ_m（）を、確率関数Ｇ_m（）に置き換えた計算対象関数表を用いることにより、ユーザ等の要求に応じた速度や精度の音声認識処理を行うことが可能となる。
【０１３７】
また、本実施の形態では、ＨＭＭ法に基づく音声認識を行うようにしたが、本発明は、その他のアルゴリズムに基づく音声認識にも適用可能である。
【０１３８】
さらに、本実施の形態では、計算対象関数表を、確率密度関数の他、必要なフロア値も用いて構成するようにしたが、計算対象関数表は、フロア値を用いずに構成することも可能である。
【０１３９】
なお、図６乃至図９に示した計算対象関数表は、原理的には、例えば、次のようにして作成することが可能である。即ち、特徴ベクトル空間の各部分空間Ｙ₀乃至Ｙ₅₁₁を代表するコードベクトルｖ₀乃至ｖ₅₁₁それぞれが、音響モデルを定義する１６の確率密度関数ｇ₀（）乃至ｇ₁₅（）それぞれから出力される確率を求め、その確率を、出力確率に対する寄与度として、各部分空間について、寄与度の相対的に大きい確率密度関数を選択し、その部分空間のエントリに登録することにより、計算対象関数表を作成することができる。なお、計算対象関数表のフロア値としては、例えば、各部分空間に属する任意の特徴ベクトルｘ_tについて、式（１）にしたがって計算される出力確率の最小値や、最大値、平均値などを採用することが可能である。
【０１４０】
【発明の効果】
本発明の一側面によれば、例えば、ユーザ等の要求に応じた速度や精度の音声認識処理を行うことが可能となる。即ち、速度重視の音声認識処理や精度重視の音声認識処理を行うことが可能となる。
【図面の簡単な説明】
【図１】従来の音声認識装置の一例の構成例を示すブロック図である。
【図２】従来の音声認識装置で用いられる計算対象関数表を示す図である。
【図３】本発明を適用した音声認識装置の一実施の形態の構成例を示すブロック図である。
【図４】マッチング部１１の構成例を示すブロック図である。
【図５】マッチング部１１によるマッチング処理を説明するフローチャートである。
【図６】計算対象関数表記憶部２４に記憶される計算対象関数表の第１実施の形態の構成例を示す図である。
【図７】計算対象関数表記憶部２４に記憶される計算対象関数表の第２実施の形態の構成例を示す図である。
【図８】計算対象関数表記憶部２４に記憶される計算対象関数表の第３実施の形態の構成例を示す図である。
【図９】計算対象関数表記憶部２４に記憶される計算対象関数表の第４実施の形態の構成例を示す図である。
【図１０】本発明を適用したコンピュータの一実施の形態の構成例を示すブロック図である。
【符号の説明】
１マイク，２ＡＤ変換部，３特徴抽出部，５音響モデルデータベース，６辞書データベース，７文法データベース，１１マッチング部，２１部分空間検出部，２２部分空間データ記憶部，２３計算対象関数選択部，２４計算対象関数表記憶部，２５スコア計算部，２６出力選択部，２７速度／精度設定部，２８操作レバー，１０１バス，１０２ CPU，１０３ ROM，１０４ RAM，１０５ハードディスク，１０６出力部，１０７入力部，１０８通信部，１０９ドライブ，１１０入出力インタフェース，１１１リムーバブル記録媒体[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a voice recognition device, a voice recognition method, a program, and a recording medium, and, for example, a voice recognition device and a voice recognition method capable of performing voice recognition processing with speed or accuracy according to a request from a user or the like. And a program and a recording medium.
[0002]
[Prior art]
FIG. 1 shows an example of the configuration of a conventional speech recognition apparatus.
[0003]
The voice uttered by the user is input to a microphone (microphone) 1, and the microphone 1 converts the input voice into a voice signal as an electrical signal. This audio signal is supplied to an AD (Analog Digital) converter 2. In the AD conversion unit 2, the audio signal that is an analog signal from the microphone 1 is sampled, quantized, and converted into audio data that is a digital signal. This audio data is supplied to the feature extraction unit 3.
[0004]
The feature extraction unit 3 performs acoustic processing on the audio data from the AD conversion unit 2 for each appropriate frame, thereby extracting, for example, a feature vector (feature amount) such as MFCC (Mel Frequency Cepstrum Coefficient), This is supplied to the matching unit 4. In addition, the feature extraction unit 3 can extract other feature quantities such as a spectrum, a linear prediction coefficient, a cepstrum coefficient, and a line spectrum pair.
[0005]
The matching unit 4 uses the feature vector from the feature extraction unit 3 to refer to the acoustic model database 5, the dictionary database 6, and the grammar database 7 as necessary, and the voice (input voice) input to the microphone 1. Are recognized based on, for example, a continuous distribution HMM method.
[0006]
That is, the acoustic model database 5 stores an acoustic model representing acoustic features such as individual phonemes and syllables in the speech language for speech recognition. Here, since speech recognition is performed based on the continuous distribution HMM method, for example, an HMM (Hidden Markov Model) is used as the acoustic model. The dictionary database 6 stores a word dictionary in which information related to pronunciation (phonological information) is described for each word to be recognized. The grammar database 7 stores grammatical rules (language model) describing how each word registered in the word dictionary of the dictionary database 6 is linked (connected). Here, as the grammar rule, for example, a rule based on context-free grammar (CFG), statistical word chain probability (N-gram), or the like can be used.
[0007]
The matching unit 4 connects the acoustic model stored in the acoustic model database 5 by referring to the word dictionary in the dictionary database 6 to construct an acoustic model (word model) of the word. Further, the matching unit 4 connects several word models by referring to the grammar rules stored in the grammar database 7, and the word model thus connected and the voice input to the microphone 1. A matching process with the extracted feature vector series is performed based on, for example, a continuous distribution HMM method, and the speech is recognized. That is, the matching unit 4 detects the sequence of the word model having the highest score (likelihood) from which the time-series feature vector supplied from the feature extraction unit 3 is output (observed), and corresponds to the sequence of the word model. The word string to be output is output as a speech recognition result.
[0008]
That is, the matching unit 4 accumulates the appearance probabilities of the feature vectors for the word strings corresponding to the connected word models, uses the accumulated value as a score, and sets the word string having the highest score as the speech recognition result. Output.
[0009]
The score calculation is generally performed by an acoustic score given by an acoustic model stored in the acoustic model database 5 (hereinafter, referred to as an acoustic score as appropriate) and a linguistic score given by a grammar rule stored in the grammar database 7. (Hereinafter referred to as language score as appropriate).
[0010]
That is, as the acoustic score, for example, in the case of the HMM method, the cumulative probability of outputting (observing) the feature vector series output by the feature extracting unit 3 from the acoustic model constituting the word model. The value is calculated. As the language score, for example, in the case of bigram, the probability that the word of interest and the word immediately preceding the word are linked (connected) is obtained. Then, a speech recognition result is determined based on a final score (hereinafter, referred to as final score as appropriate) obtained by comprehensive evaluation (for example, weighted addition) of the acoustic score and language score for each word. .
[0011]
By performing the processing as described above, in the speech recognition apparatus of FIG. 1, for example, when the user speaks “Today is a good weather”, “Today”, “Ha”, “Good”. , “Weather”, “Issue” are given an acoustic score and a language score, and when the final score obtained by comprehensive evaluation is the largest, the word strings “today”, “ha”, “ “Good”, “Weather”, and “It is” are output as the speech recognition results.
[0012]
By the way, in the above-mentioned case, if five words “today”, “ha”, “good”, “weather”, and “sound” are registered in the word dictionary of the dictionary database 6, these five words are registered. The sequence of 5 words that can be constructed using words is 5^FiveExist. Therefore, simply, in the matching unit 4, this 5^FiveIt is necessary to evaluate the street word strings and determine the one that best matches the user's utterance (the one that maximizes the final score). If the number of words to be registered in the word dictionary increases, the number of words arranged as many as the number of words becomes the number of words multiplied by the number of words. Therefore, the number of word strings to be evaluated is enormous. It becomes.
[0013]
Furthermore, in general, since the number of words included in an utterance is unknown, not only a word string consisting of a sequence of five words but also a word string consisting of one word, two words,. There is a need to. Therefore, since the number of word strings to be evaluated is further enormous, the most probable speech recognition result is efficiently determined from the viewpoint of the amount of computation out of such enormous word strings. That is a very important issue.
[0014]
For example, E. Bocchieri.Vector quantization for the efficient computation of continuous density likelihoods.In International Conference on Acoustic, Speech, and Signal Processing, volume 2, pages 692-695, Apr. 1993 (hereinafter referred to as Reference 1), KMKnill, MJFGales, and SJYoung. Use of gaussian selection in large vocabulary continuous speech recognition using hmms. In International Conference on Spoken Language Processing, volume 1, pages 470-473, Oct. 1996 (hereinafter referred to as Reference 2), MJFGales, KMKnill, and SJYoung. State-based gaussian selection in large vocabulary continuous speech recognition using hmms. In Cambridge University Technical Report, TR284, Jan. 1997 (hereinafter referred to as Reference 3), SMHerman and RASukkar. Variable threshold vector quantization for reduced continuous density likelihood computation in speech recognition. In IEEE Workshop on Acou As described in stic Speech Recognition and Understanding Proceedings, pages 331-338, Santa Babara, 1997 (hereinafter referred to as Document 4), there is a method of omitting a part of the calculation of the output probability constituting the acoustic score.
[0015]
That is, for example, according to the continuous HMM method, the HMM has a feature vector x at time t in a state s._tOutput probability b_s(X_t) Is calculated by the following equation.
[0016]
b_s(X_t) = Σc_m× g_m(X_t) ... (1)
[0017]
Here, in equation (1), c_mIs the mth function g_mIs the weighting factor for () and the function g_m() Is an m-th probability density function (for example, Gaussian distribution) constituting the HMM. Σ represents a summation for the variable m. Therefore, according to equation (1), the output probability b_s(X_t) Is the probability density function g_m(X_t) Weight c_mCalculated as a sum.
[0018]
Weight coefficient c_mAnd probability density function g_m() Is one piece of definition information as information that defines the HMM that is an acoustic model (for example, other definition information includes a state of the HMM from a certain state and other information including the state as necessary. State transition probability as the probability of transition to a state), the HMM is a weighting factor c_mAnd probability density function g_mIn addition to the case where the set of () is defined using only one set, there are cases where it is defined using a plurality of sets.
[0019]
The HMM has a plurality of N sets of weighting factors c₀Thru c_N-1And probability density function g₀To g_N-1When defined using (), the calculation of Equation (1) needs to be performed by changing the variable m to an integer value from 0 to N-1.
[0020]
However, the N probability density functions g that define the HMM₀(X_t), G₁(X_t), ..., g_N-1(X_t) Includes output probability b_s(X_t) As a result, there may be a case where the magnitude (degree) contributing to the acoustic score (hereinafter, referred to as the degree of contribution as appropriate) is very small (that contributes little).
[0021]
Therefore, in the methods described in Documents 1 to 4, the probability density function g with a very small contribution is used._mBy omitting () from the calculation of equation (1), the amount of calculation is reduced while suppressing deterioration of speech recognition accuracy.
[0022]
Specifically, for example, for each predetermined partial space of the feature vector space as shown in FIG._tOutput probability b_s(X_t1 or more probability density function g used in the calculation of_mA table in which () is associated (hereinafter referred to as a calculation target function table as appropriate) is created, and a feature vector x belonging to a certain partial space_tOutput probability b_s(X_t) Is calculated by N probability density functions g₀() To g_N-1In (), in the calculation target function table, the feature vector x_tThis is performed using only those associated with the subspace to which the belongs.
[0023]
In this case, calculation of a part of the probability density functions can be omitted, so that the amount of calculation is reduced and the speed of the speech recognition processing can be improved. In addition, the probability density function for which the calculation is omitted is the score b_s(X_t), The degradation of speech recognition accuracy due to omission of the calculation can be suppressed.
[0024]
Here, in the calculation target function table of FIG. 2 (the same applies to FIGS. 6 to 9 described later), the feature vector space is Y₀, Y₁, ..., Y₅₁₁Are divided into 512 subspaces.
[0025]
In the calculation target function table, not the feature space x but the feature vector x_tFor each feature vector x_tOutput probability b_s(X_t1 or more probability density function g used in the calculation of_mIt is ideal to associate (), but then the feature vector x_tSince the calculation target function table cannot be created because is a continuous quantity, the calculation target function table has a probability density function g for each subspace._mIt is created in a form that associates.
[0026]
According to the calculation target function table of FIG.₀Feature vector x belonging to_tOutput probability b_s(X_t) Is the formula b_s(X_t) = C₀g₀(X_t) + C₁g₁(X_t) + C_Fiveg_Five(X_t) + C₁₅g₁₅(X_t).
[0027]
On the other hand, the probability density function g that defines the HMM_mAssuming that the total number N of () is 16, for example, when the formula (1) is adopted as it is, the feature vector x_tOutput probability b_s(X_t) Is the formula b_s(X_t) = C₀g₀(X_t) + C₁g₁(X_t) + ... + c₁₅g₁₅(X_t).
[0028]
Therefore, when using the formula (1) as it is, it is necessary to perform calculations for 16 probability density functions. However, when using the calculation target function table, it is sufficient to perform calculations for four probability density functions. Therefore, the amount of calculation can be greatly reduced.
[0029]
When using the calculation target function table, the feature vector x_tIs 512 subspace Y₀To Y₅₁₁It is necessary to detect which one of the subspaces belongs, but as a method for detecting this subspace, for example, vector quantization can be used.
[0030]
[Problems to be solved by the invention]
As described above, by using the calculation target function table, it is possible to reduce the amount of calculation for obtaining the acoustic score (output probability), and to improve the speech recognition processing speed while suppressing a decrease in speech recognition accuracy.
[0031]
However, for example, even when the resources allocated to the voice recognition process are reduced, when the voice recognition process in real time is required, even if there is a slight deterioration in the voice recognition accuracy, the voice recognition processing speed In other words, it is desirable to reduce the amount of calculation according to the resource.
[0032]
This is because when real-time speech recognition processing is required, processing is generally performed based on the speech recognition result, and therefore, when the speech recognition result cannot be obtained in real time. This is because it will hinder subsequent processing.
[0033]
On the other hand, for example, when there are sufficient resources that can be allocated to the speech recognition processing, it is possible to obtain a highly accurate speech recognition processing in real time by performing many calculations with the sufficient resources. That is, in this case, even if the speech recognition processing speed is reduced, a highly accurate speech recognition result can be obtained in real time. Therefore, when resources are sufficient, it is desirable from the viewpoint of effective use of resources to perform highly accurate speech recognition processing using the sufficient source.
[0034]
The present invention has been made in view of such a situation, and makes it possible to perform speech recognition processing with speed and accuracy according to a request.
[0035]
[Means for Solving the Problems]
  A speech recognition device, program, or recording medium according to an aspect of the present invention is a speech recognition device that recognizes speech, and the extraction feature that extracts the feature amount of the speech, and the feature amount of the speech includes the feature. Detection means for detecting a subspace belonging to the quantity space, and one or more HMMs (Hidden Markov Models) used for matching processing with the feature quantity of the speech for each of the plurality of subspaces of the feature quantity space Selection means for selecting any one or more definition information from storage means for storing definition information in association with the one or more definition information associated with the partial space to which the audio feature value belongs Then, using the definition information selected by the selection means, the voice corresponds to the HMM by performing a matching process between the feature quantity of the voice and the HMM. And a matching unit that outputs a speech recognition result of the speech based on the score, and the definition information obtains an output probability that the HMM outputs the feature quantity. A probability density function or a probability function used in the above, wherein the selection means is set according to the speed or accuracy of the voice recognition process set according to the user's operation, or a resource that can be allocated to the voice recognition process A speech recognition device that selects a number of the definition information corresponding to the speed or accuracy of the speech recognition processing in an order corresponding to the magnitude that the definition information contributes to the score, based on the speed or accuracy of the recognition processing; As such a voice recognition device, a program for causing a computer to function, or a recording medium on which such a program is recorded That.
[0036]
  A speech recognition method according to an aspect of the present invention is a speech recognition method of a speech recognition device that recognizes speech, wherein the speech recognition device extracts the feature amount of the speech, and the feature amount of the speech is A detection step of detecting a subspace belonging to the feature amount space, and an HMM (Hidden Markov Model) used for matching processing with the speech feature amount for each of the plurality of subspaces of the feature amount space Any one or more definition information is selected from the one or more definition information associated with the partial space to which the audio feature amount belongs in a storage unit that stores one or more definition information in association with each other. And performing a matching process between the audio feature quantity and the HMM using the definition information selected in the selection step, and A matching step for obtaining a score representing the likelihood of being compatible with the HMM and outputting a speech recognition result of the speech based on the score, and the definition information includes the feature amount by the HMM. Probability density function or probability function used to determine the output probability to be output, and in the selection step, the speed or accuracy of speech recognition processing set according to the user's operation, or resources that can be allocated to speech recognition processing The number of the definition information corresponding to the speed or accuracy of the voice recognition processing based on the speed or accuracy of the voice recognition processing set according to The voice recognition method to select.
[0039]
  In one aspect of the present invention, the feature amount of the voice is extracted, and a partial space to which the feature amount of the voice belongs in the feature amount space is detected. Further, each of the plurality of partial spaces of the feature amount space is stored in association with one or more definition information defining an HMM (Hidden Markov Model) used for the matching processing with the feature amount of the speech. Any one or more pieces of definition information are selected from the one or more pieces of definition information associated with the partial space to which the audio feature amount belongs. Then, by using the selected definition information to perform a matching process between the voice feature quantity and the HMM, a score representing the likelihood that the voice corresponds to the HMM is obtained. And the speech recognition result of the speech is output based on the score. The definition information is a probability density function or a probability function that is used to obtain an output probability that the HMM outputs the feature value. In the selection of the definition information, a speech recognition process set according to a user operation is performed. Based on speed or accuracy, or speed or accuracy of speech recognition processing set according to resources that can be allocated to speech recognition processing, the number of the definition information corresponding to the speed or accuracy of the speech recognition processing, The definition information is selected in the order corresponding to the size that contributes to the score.
[0040]
DETAILED DESCRIPTION OF THE INVENTION
FIG. 3 shows a configuration example of an embodiment of a speech recognition apparatus to which the present invention is applied. In the figure, portions corresponding to those in FIG. 1 are denoted by the same reference numerals, and description thereof will be omitted below as appropriate. That is, the speech recognition apparatus in FIG. 3 is configured in the same manner as in FIG. 1 except that a matching unit 11 is newly provided instead of the matching unit 4.
[0041]
FIG. 4 shows a configuration example of the matching unit 11 of FIG.
[0042]
The time-series feature vector output from the feature extraction unit 3 (FIG. 3) is supplied to the subspace detection unit 21 and the score calculation unit 25.
[0043]
The subspace detection unit 21 refers to the subspace data storage unit 22, detects the subspace to which the feature vector supplied thereto belongs in the feature vector space, and displays subspace information representing the subspace, The calculation target function selection unit 23 is supplied.
[0044]
The subspace data storage unit 22 stores subspace data as information necessary for the subspace detection unit 21 to detect the subspace to which the feature vector belongs.
[0045]
Here, the subspace detection unit 21 can detect the subspace to which the feature vector belongs by, for example, vector quantization. In this case, the subspace data storage unit 22 uses the subspace data as the subspace data. A code book used for the vector quantization is stored.
[0046]
The code book can be created by using a large number of audio data and performing learning using, for example, an LBG (Linde Buzo Gray) algorithm, which is one of code book learning algorithms.
[0047]
The code book includes a code as a representative vector of each subspace when the feature vector space is divided into several subspaces (in this embodiment, as described above, 512 subspaces). A vector and a code representing the code vector are registered. Therefore, when the feature vector space is divided into, for example, 512 partial spaces, codes corresponding to the 512 code vectors are registered in the code book.
[0048]
The subspace detection unit 21 calculates the distance between the feature vector and each of the 512 code vectors registered in the codebook, and detects the code vector that minimizes the distance. Then, the subspace detection unit 21 assumes that the subspace whose representative vector is the code vector is the subspace to which the feature vector belongs, and converts the code corresponding to the detected code vector to the subspace to which the feature vector belongs. Output as subspace information.
[0049]
The calculation target function selection unit 23 refers to the calculation target function table (definition information table) stored in the calculation target function table storage unit 24 based on the subspace information from the subspace detection unit 21, thereby determining the feature vector. A probability density function or the like that defines an acoustic model (here, as described above, HMM) to be used for calculation of the used acoustic score (output probability) is selected.
[0050]
That is, in addition to the partial space information supplied from the partial space detection unit 21, the calculation target function selection unit 23 is also supplied with speed / accuracy information from the speed / accuracy setting unit 27.
[0051]
The calculation target function selection unit 23 uses a probability density function or the like associated with the subspace represented by the subspace information from the subspace detection unit 21 in the calculation target function table stored in the calculation target function table storage unit 24. Based on the speed / accuracy setting information supplied from the speed / accuracy setting unit 27, one or more probability density functions and the like are selected. Then, the calculation target function selection unit 23 supplies selection information representing the selected probability density function or the like to the score calculation unit 25.
[0052]
The calculation target function table storage unit 24 generates a calculation target function table in which one or more probability density functions that define the acoustic model stored in the acoustic model database 5 are associated with each of the plurality of partial spaces of the feature vector space. I remember it.
[0053]
  The score calculation unit 25 uses the feature vector supplied from the feature extraction unit 3 to store the acoustic model stored in the acoustic model database 5, the word dictionary stored in the dictionary database 6, and the grammar stored in the grammar database 7. While referring to the rules as necessary, constructing speech recognition result candidates (hereinafter referred to as hypotheses as appropriate)hypothesis, An acoustic score based on the HMM method as described above and a language score are calculated.
[0054]
However, the score calculation unit 25 uses the probability density function represented by the selection information supplied from the calculation target function selection unit 23 instead of all the probability density functions that define the acoustic model for the acoustic score._tOutput probability b_s(X_t) And an acoustic score is obtained based on the output probability.
[0055]
The acoustic score and language score obtained in the score calculation unit 25 are supplied to the output selection unit 26, and the output selection unit 26 comprehensively evaluates the acoustic score and language score obtained for each hypothesis to obtain a final score. For example, a hypothesis that maximizes the final score is selected and output as a speech recognition result.
[0056]
The speed / accuracy setting unit 27 sets the speed or accuracy of the voice recognition processing according to the operation of the operation lever 28 and supplies speed / accuracy information representing the set speed or accuracy to the calculation target function selection unit 23. .
[0057]
The operation lever 28 is operated when the user specifies the speed or accuracy of the voice recognition processing, and supplies an operation signal corresponding to the operation to the speed / accuracy setting unit 27.
[0058]
Accordingly, the speed / accuracy setting unit 27 sets the speed or accuracy of the speech recognition process in accordance with the user's request.
[0059]
Here, the operation lever 28 can be configured as a physical lever, or can be configured as a virtual lever displayed on the screen. In the case where the operation lever 28 is configured as a physical lever, the operation lever 28 is actually gripped and operated by the user. When the operation lever 28 is configured as a virtual lever, the operation lever 28 is operated by the user by dragging with the mouse.
[0060]
In the embodiment shown in FIG. 4, when a lower speed or higher accuracy voice recognition process is requested, the operation lever 28 is operated in the left direction, and conversely, a higher speed or lower accuracy voice recognition process is performed. When requested, the operation lever 28 is operated in the right direction.
[0061]
Next, the matching process performed by the matching unit 11 in FIG. 4 will be described with reference to the flowchart in FIG.
[0062]
When the user speaks and the feature extraction unit 3 starts outputting the feature vector of the voice, the matching unit 11 starts the matching process.
[0063]
That is, the time-series feature vector output from the feature extraction unit 3 is supplied to the subspace selection unit 21 and the score calculation unit 25, and the subspace detection unit 21 refers to the subspace data storage unit 22 in step S1. , Feature vector x from feature extraction unit 3_tDetect the subspace to which the belongs. Then, the partial space detection unit 21 supplies the partial space information representing the partial space to the calculation target function selection unit 23, and the process proceeds to step S2.
[0064]
In step S <b> 2, the probability that the calculation target function selection unit 23 is associated with the subspace represented by the subspace information from the subspace detection unit 21 in the calculation target function table stored in the calculation target function table storage unit 24. From the density function or the like, one or more probability density functions or the like are selected based on the speed / accuracy setting information from the speed / accuracy setting unit 27 as necessary, and selection information representing the selected probability density function or the like is selected. This is supplied to the score calculation unit 25.
[0065]
The score calculation unit 25 defines the probability density that defines the acoustic model of the acoustic model database 5 represented by the selection information supplied from the calculation target function selection unit 23 for the words stored in the word dictionary of the dictionary database 6 in step S3. Using function etc., feature vector x_tOutput probability b_s(X_t) And the acoustic score is obtained based on the output probability, and the language score is obtained by referring to the grammar rules of the grammar database 7. Further, the score calculation unit 25 generates a hypothesis (speech recognition result candidate) as necessary based on the acoustic score and the language score, and proceeds to step S4.
[0066]
In step S4, it is determined whether or not the calculation of the acoustic score and the language score has been completed up to the end point of the speech section in which the user has spoken. If it is determined that the calculation has not ended, the process returns to step S1 and Hereinafter, the same processing is repeated for the feature vector. The calculation of the acoustic score and the language score is performed while branching by the beam search method as necessary.
[0067]
If it is determined in step S4 that the calculation of the acoustic score and the language score has been completed up to the end point of the speech section in which the user uttered, the process proceeds to step S5, and the output selection unit 26 determines one or more hypotheses The final score is obtained by comprehensively evaluating the acoustic score and the language score obtained with respect to, for example, a hypothesis that enlarges the final score is selected, output as a speech recognition result, and the matching process is terminated.
[0068]
Next, the calculation target function table stored in the calculation target function table storage unit 24 of FIG. 4 will be described with reference to FIGS. In the following, for example, the feature vector space is 512 subspaces Y₀To Y₅₁₁The HMM as an acoustic model stored in the acoustic model database 5 is 16 probability density functions g₀() To g₁₅Shall be defined in parentheses.
[0069]
In the matching unit 11 of FIG. 4, basically, as described above, in the calculation target function table stored in the calculation target function table storage unit 24, the partial space represented by the partial space information output by the subspace detection unit 21. (Feature vector x_tOne or more probability density functions or the like are selected based on the speed / accuracy information from the probability density function or the like associated with the subspace to which the feature vector x_tOutput probability b_s(X_t) Is obtained, but this output probability b_s(X_t) Is a feature vector x in the calculation target function table._tIt is also possible to calculate using all the probability density functions associated with the subspace to which.
[0070]
That is, in the matching unit 11, the output probability b_s(X_t), The feature vector x_tCan be calculated using a probability density function or the like associated with the subspace to which the_tIt is also possible to calculate using all the probability density functions associated with the subspace to which.
[0071]
Now, output probability b_s(X_t), The feature vector x_tA mode that is calculated using a probability density function or the like associated with the subspace to which the data belongs is called a selectable mode and an output probability b_s(X_t), The feature vector x_tIf the calculation mode using all probability density functions associated with the subspace to which the data belongs belongs to a non-selectable mode, in the non-selectable mode, for example, a calculation target function table as shown in FIG. Is used.
[0072]
That is, in the calculation target function table of FIG.₀To Y₅₁₁Each has its subspace Y_jFeature vector x belonging to (j = 0, 1,..., 511)_tOutput probability b_s(X_t) List of probability density functions used in the calculation of {g_m} Or a floor value is associated.
[0073]
When the calculation target function table of FIG. 6 is used, the calculation target function selection unit 23 sets the probability density function {g_m} Is associated with the subspace Y_jIs received from the subspace detector 21, the subspace Y is received._jProbability density function {g_m} Select all and its probability density function {g_m} Selection information representing all is supplied to the score calculation unit 25.
[0074]
The score calculation unit 25 uses the probability density function represented by the selection information to output the output probability b_s(X_t) Is calculated. Therefore, in this case, the score calculation unit 25 outputs the output probability b in the same manner as described with reference to FIG._s(X_t) Is calculated.
[0075]
By the way, in the calculation target function table of FIG._jFor feature vector x_tOutput probability b_s(X_t) Probability density function {g_m} May be associated with the floor value.
[0076]
The floor value is the subspace Y with which it is associated_jIs a fixed value representing the minimum value of the output probability of the feature vector belonging to, and the subspace Y associated with the floor value_jThe output probability of the feature vector belonging to is the floor value.
[0077]
In other words, the calculation target function selection unit 23 uses the partial space Y associated with the floor value._jIs received from the subspace detector 21, the subspace Y is received._jIs selected, and selection information representing the floor value is supplied to the score calculation unit 25.
[0078]
In the score calculation unit 25, when the selection information represents a floor value, the floor value is converted into an output probability b._s(X_t).
[0079]
Therefore, in this case, the output probability can be obtained without performing a calculation using the probability density function, so that the amount of calculation can be reduced.
[0080]
That is, in the calculation target function table of FIG._jFor the probability density function {g_m} Is associated with the floor value, and the floor value is associated with the partial space Y with the associated floor value._jFeature vector x belonging to_tOutput probability b_s(X_t) Is not required to calculate the probability density function, and as shown in FIG. 2, when using the calculation target function table in which the probability density function is associated with all the partial spaces. Compared to this, the amount of calculation can be further reduced.
[0081]
In the calculation target function table of FIG._ThreeThe floor value “−30.0” is associated with the floor value, and this floor value is a logarithm of the output probability. The same applies to certain floor values shown in the embodiments of FIGS.
[0082]
In the non-selectable mode, a calculation target function table as shown in FIG. 7 can be used in addition to the calculation target function table of FIG.
[0083]
That is, the calculation target function table of FIG. 7 is different from the calculation target function table of FIG._jThe number of probability density functions associated with is added. When the calculation target function table of FIG. 7 is used, the subspace Y to which the feature vector belongs is included in the selection information output from the calculation target function selection unit 23._jIn this case, the score calculation unit 25 can immediately recognize the number of probability density functions that must be calculated to obtain the output probability.
[0084]
Note that the probability density function registered in the calculation target function table has a large contribution to the output probability. Therefore, in the calculation target function table of FIG. 6 or FIG._ThreeOn the other hand, the probability density function is not registered because the probability density function g that defines the acoustic model is not registered.₀() To g₁₅() Each subspace Y_ThreeThis is because the degree of contribution to the output probability of feature vectors belonging to is relatively different. Also, the partial space Y_ThreeFor any feature vector belonging to, the 16 probability density functions g defining the acoustic model₀() To g₁₅The output probability calculated by using () (logarithm thereof) is about −30. Therefore, even when the output probability is set to a fixed value of −30.0, accurate approximation is possible. Subspace Y of the calculation target function table of FIG. 6 and FIG._ThreeFor, the output probability is a fixed value of -30.0.
[0085]
Next, FIG. 8 shows a configuration example of an embodiment of a calculation target function table stored in the calculation target function table storage unit 24 in the selectable mode.
[0086]
In the calculation target function table of FIG._jAnd probability density function {g_m} Or the floor value, a plurality of probability density functions (hereinafter referred to as “calculated number” as appropriate) used for calculating the output probability (and hence the acoustic score) are associated with each other.
[0087]
That is, in the embodiment of FIG. 8, for example, the partial space Y₀, The floor value “−29.0”, the probability density function {g_Five(), G₁(), G₁₅(), G₀()} And the calculated number {0, 1, 4} are associated with each other. For example, the partial space Y₁, Floor value “−45.0”, probability density function {g₀(), G₁(), G₁₇(), G₈(), G_Three(), G_Ten()} And the calculated number {0, 3, 6} are associated with each other. Furthermore, the subspace Y₂For floor value “−20.0”, probability density function {g₂(), G₆(), G_Four()} And the calculated number {0, 3, 3} are associated with each other. Also, the partial space Y_ThreeAre associated with the floor value “−30.0” and the calculated number {0, 0, 0}. In the same manner, the subspace Y_FourTo Y₅₁₀Also, the floor value or probability density function {g_m} Is associated with the calculated number, and the last subspace Y₅₁₁For floor value “−40.0”, probability density function {g₁₅()} And the calculated number {0, 0, 1} are associated with each other.
[0088]
When the calculation target function table of FIG. 8 is used, the calculation target function selection unit 23 first calculates the vector x in the calculation target function table based on the subspace information supplied from the subspace detection unit 21._tSubspace Y to which_jSelect the entry (row). Now, in this way, the subspace Y selected from the calculation target function table based on the subspace information._j, The calculation target function selecting unit 23 selects the one corresponding to the speed / accuracy information supplied from the speed / accuracy setting unit 27 from the plurality of calculated numbers in the selected entry. To do.
[0089]
That is, each of the plurality of calculation numbers in each entry of the calculation target function table represents the number of probability density functions to be selected by the calculation target function selection unit 23, and is registered based on the speed or accuracy required for the speech recognition processing. Has been.
[0090]
Specifically, for example, if it is now possible to set three speeds or accuracy of “high speed / low accuracy”, “medium speed / medium accuracy”, and “low speed / high accuracy” for voice recognition processing, In the embodiment of FIG. 8, three calculation numbers are registered in each entry of the calculation target function table. Of these three calculation numbers, the leftmost calculation number is “high speed / low accuracy”. When the speed or accuracy is set, the second calculation number from the left is “medium speed / medium accuracy”. When the speed or accuracy is set, the rightmost calculation number is “low speed / high accuracy”. Is selected when the speed or accuracy is set.
[0091]
Therefore, in the embodiment of FIG. 8, when the speed / accuracy information supplied from the speed / accuracy setting unit 27 represents “high speed / low accuracy”, the calculation target function selection unit 23 uses the subspace Y.₀To Y₅₁₁Of the three calculated numbers registered in each of the entries, 0, 0, 0, 0,. When the speed / accuracy information represents “medium speed / medium precision”, the calculation target function selector 23 selects the subspace Y₀To Y₅₁₁Of the three calculated numbers registered in each of the entries, 1, 3, 3, 0,..., 0 that are second from the left are selected. Furthermore, when the speed / accuracy information represents “low speed / high precision”, the calculation target function selection unit 23 uses the subspace Y₀To Y₅₁₁Of the three calculated numbers registered in each of the entries, 4, 6, 3, 0,.
[0092]
From the above, the feature vector x_tFor example, subspace Y₀The calculation target function selection unit 23 determines that the subspace Y₀Are selected entries. Further, when the speed / accuracy information supplied from the speed / accuracy setting unit 27 indicates “high speed / low accuracy”, the calculation target function selection unit 23 sets the three calculation numbers “ The leftmost “0” of “0, 1, 4” is selected. In addition, when the speed / accuracy information indicates “medium speed / medium precision”, the calculation target function selection unit 23 selects the left of the three calculation numbers “0, 1, 4” registered in the selection entry. When the second “1” is selected and the speed / accuracy information indicates “low speed / high precision”, the most of the three calculated numbers “0, 1, 4” registered in the selected entry Select “4” on the right.
[0093]
Now, assuming that a selection selected from a plurality of calculation numbers registered in the selection entry as described above is referred to as a selection calculation number, the calculation target function selection unit 23 selects only the number of selection calculations from the selection entry. Select the probability density function.
[0094]
Thus, for example, in FIG.₀When the selected calculation number is “0”, “1”, “4”, the calculation target function selection unit 23 sets the sub-space Y₀0, 1, 4 are selected from the probability density functions registered in.
[0095]
Here, the selection of the probability density function from the selection entry is performed as follows.
[0096]
That is, when the selection calculation number is “0”, the floor value is selected from the selection entry without selecting the probability density function. If the selected calculation number is a value other than “0”, the probability corresponding to the selected calculation number from the left of one or more probability density functions registered there is selected from the selection entry. A density function is selected.
[0097]
Therefore, in FIG.₀When the selected calculation number is “0”, the calculation target function selection unit 23 selects the subspace Y.₀The floor value “−29.0” registered in the entry is selected. When the selected calculation number is “1”, the calculation target function selection unit 23 sets the subspace Y₀Probability density function {g_Five(), G₁(), G₁₅(), G₀()}, Only one from the left, ie {g_FiveSelect ()}. Furthermore, when the number of selected calculations is “4”, the calculation target function selection unit 23 sets the subspace Y₀Probability density function {g_Five(), G₁(), G₁₅(), G₀()} From the left, that is, subspace Y₀All probability density functions registered in {g_Five(), G₁(), G₁₅(), G₀Select ()}.
[0098]
Then, the calculation target function selection unit 23 supplies selection information representing the selected floor value or probability density function to the score calculation unit 25.
[0099]
Here, in the calculation target function table of FIG._ThreeThe three calculated numbers in the entry are all “0”. Therefore, the subspace Y_ThreeWhen the entry is selected, the speed / accuracy information indicates any of “high speed / low precision”, “medium speed / medium precision”, and “low speed / high precision”. The number of selected calculations is “0”. Therefore, the calculation target function selection unit 23 selects the floor value “−30.0”.
[0100]
From the above, the feature vector x_tFor example, subspace Y₀In the “high speed / low accuracy” mode in which the speed / accuracy information is set to “high speed / low accuracy”, the score calculation unit 25 uses the feature vector x_tOutput probability b_s(X_t) Is the floor value “−29.0”. Therefore, in this case, the output probability b_s(X_t) Is obtained without calculation using the probability density function, the accuracy is reduced, but high-speed processing is possible.
[0101]
In the “medium speed / medium precision” mode in which the speed / accuracy information is set to “medium speed / medium precision”, the score calculation unit 25 uses the feature vector x_tOutput probability b_s(X_t) Is a probability density function g of 1_Five(X_t) And the weighting value c based on the above equation (1)._Fiveg_Five(X_t) Is calculated. Therefore, in this case, the output probability b_s(X_t) Is one probability density function g_Five(X_t), The processing speed is reduced but the accuracy is improved as compared with the case of the “high speed / low accuracy” mode.
[0102]
Further, in the “low speed / high precision” mode in which the speed / precision information is set to “low speed / high precision”, the score calculation unit 25 uses the feature vector x_tOutput probability b_s(X_t) Is the four probability density functions g_Five(X_t), G₁(X_t), G₁₅(X_t), G₀(X_t) And the weighted sum c based on equation (1) above._Fiveg_Five(X_t) + C₁g₁(X_t) + C₁₅g₁₅(X_t) + C₀g₀(X_t) Is calculated. Therefore, in this case, the output probability b_s(X_t) Is the four probability density functions g_Five(X_t), G₁(X_t), G₁₅(X_t), G₀(X_t), The processing speed is further reduced as compared with the “high speed / low accuracy” mode, but the accuracy is further improved.
[0103]
In the matching unit 11 shown in FIG. 4, the speed / accuracy information is set according to the operation lever 28 operated by the user. Therefore, the voice recognition process with the speed and accuracy according to the user's request is performed. It becomes possible.
[0104]
When a plurality of probability density functions are registered in the entry of the calculation target function table, the plurality of probability density functions g_m() May be arranged in ascending or descending order of the suffix m, for example, but in the embodiment of FIG. 8, a plurality of probability density functions registered in the entry of the calculation target function table are output probabilities (and eventually , Acoustic score) in descending order of contribution (the leftmost probability density function has the largest contribution to the output probability).
[0105]
Accordingly, in this case, the calculation target function selection unit 23 preferentially selects a probability density function having a large contribution to the output probability (and consequently the acoustic score), and the score calculation unit 25 also has such a probability. Since the density function is preferentially used and the output probability (and hence the acoustic score) is calculated, the output probability generated by omitting calculation of some probability density functions based on the calculation target function table ( The error of the acoustic score can be minimized.
[0106]
In the embodiment of FIG. 8, each entry of the calculation target function table includes three speeds / accuracy of “high speed / low accuracy” mode, “medium speed / medium accuracy” mode, and “low speed / high accuracy” mode. Three calculation numbers for each mode are registered, but each entry in the calculation target function table sets the total number of probability density functions registered in the entry, not the calculation number, and selects the calculation target function. In the unit 23, based on the speed / accuracy information, it is possible to select the number of calculations from an integer value in a range from 0 to the total number of probability density functions (hereinafter referred to as a selection range as appropriate).
[0107]
That is, for example, the subspace Y of the calculation target function table of FIG.₀Entry has four probability density functions {g_Five(), G₁(), G₁₅(), G₀Since ()} is registered, the selection range is 0 to 4, and the integer value in the selection range can be five, 0, 1, 2, 3, and 4. Therefore, the number of calculations is 5 Selected from two integer values.
[0108]
In this case, the movable range of the operation lever 28 in FIG. 4 is divided into five ranges, ie, near the left end, near the center between the left end and the center, near the center, near the center between the center and the right end, and near the right end. When positioned in each of the five ranges, the calculation target function selection unit 23 can select 0, 1, 2, 3, 4 as the number of calculations.
[0109]
In this case, in the score calculation unit 25, when the operation lever 28 is located near the left end, four probability density functions {g_Five(), G₁(), G₁₅(), G₀The output probability is obtained using 0 of ()}, that is, the floor value. When the operation lever 28 is located near the middle between the left end and the center, four probability density functions {g_Five(), G₁(), G₁₅(), G₀()} One probability density function g having the highest contribution to the output probability_FiveBy calculating (), the output probability is obtained. Further, when the control lever 28 is located near the center, four probability density functions {g_Five(), G₁(), G₁₅(), G₀()} Probability density function g having the highest contribution to the output probability_Five() And second highest probability density function g₁By calculating the two of (), the output probability is obtained. Further, when the operation lever 28 is positioned near the middle between the center and the right end, four probability density functions {g_Five(), G₁(), G₁₅(), G₀()} Of three probability density functions g in descending order of contribution to the output probability_Five(), G₁(), G₁₅By calculating (), the output probability is obtained. Further, when the operation lever 28 is positioned near the right end, four probability density functions g_Five(), G₁(), G₁₅(), G₀() By calculating all, the output probability is obtained.
[0110]
Therefore, in this case, speech recognition processing can be performed at five levels of speed or accuracy.
[0111]
As described above, when a probability density function used for calculating the output probability is selected in accordance with the position of the operation lever 28, an acoustic model is defined in each subspace entry of the calculation target function table. 16 probability density function g₀() To g₁₅() All can be registered in descending order of contribution.
[0112]
Next, FIG. 9 shows a configuration example of another embodiment of the calculation target function table stored in the calculation target function table storage unit 24 in the selectable mode.
[0113]
In the embodiment of FIG. 9, the calculation target function table storage unit 24 includes a high speed / low accuracy calculation target function table (FIG. 9A) and a low speed / high accuracy calculation target function table (FIG. 9 ( B)) two calculation target function tables are stored in the calculation target function table storage unit 24, and the calculation target function selection unit 23 receives the speed / accuracy supplied from the speed / accuracy setting unit 27. Based on the information, select either the high-speed / low-precision calculation target function table or the low-speed / high-precision calculation target function table, refer to the selected calculation target function table, and select the probability density function or floor Select a value.
[0114]
That is, when the operation lever 28 is located on the left side, the speed / accuracy setting unit 27 sets that low-speed or high-accuracy voice recognition processing is performed, and the speed / accuracy information to that effect is used as the calculation target function selection unit. 23. In this case, the calculation target function selection unit 23 selects the calculation target function table for low speed / high accuracy (FIG. 9B). On the other hand, when the operation lever 28 is positioned on the right side, the speed / accuracy setting unit 27 sets that high-speed or low-accuracy voice recognition processing is performed, and the speed / accuracy information to that effect is used as the calculation target function selection unit. 23. In this case, the calculation target function selection unit 23 selects the calculation target function table for high speed / low accuracy (FIG. 9A).
[0115]
In the embodiment of FIG. 9, the probability density function registered in each entry in the calculation target function table for low speed / high accuracy (FIG. 9B) is basically the calculation target function table for high speed / low accuracy. In FIG. 9A, a probability density function of 0 or more is added to the probability density function registered in the corresponding entry.
[0116]
Therefore, when a probability density function or the like (probability density function or floor value) is selected with reference to the calculation target function table for low speed / high accuracy, the probability is calculated with reference to the calculation target function table for high speed / low accuracy. Compared with the case where a density function or the like is selected, the amount of calculation required for calculating the output probability is increased, so that the processing speed is reduced but a highly accurate speech recognition result is obtained.
[0117]
Conversely, when selecting a probability density function by referring to the calculation target function table for high speed / low accuracy, select the probability density function, etc., referring to the calculation target function table for low speed / high accuracy. Compared to the case, the accuracy may be deteriorated, but the amount of calculation required for calculating the output probability is reduced, so that the processing speed can be increased.
[0118]
In the embodiment of FIG. 9 as well, the probability density function can be registered in the calculation target function table in descending order of contribution to the output probability, and further, the calculation target is determined according to the position of the operation lever 28. The number of probability density functions selected from the function table can be changed.
[0119]
That is, for example, when the operation lever 28 is located on the left side, the calculation target function table for low speed / high accuracy (FIG. 9B) is selected, and further, depending on how far the operation lever 28 is located on the left side. The number of probability density functions selected from the calculation target function table for low speed / high accuracy can be changed. Further, when the operation lever 28 is located on the right side, the calculation target function table for high speed / low accuracy (FIG. 9A) is selected, and the high speed depends on how far the operation lever 28 is located on the right side. / The number of probability density functions selected from the low accuracy calculation target function table can be changed. In this case, finer control can be performed on the speed or accuracy of the speech recognition process.
[0120]
In the embodiment of FIG. 9, the calculation target function table storage unit 24 stores two calculation target function tables. However, the calculation target function table storage unit 24 performs other speech recognition processing. Three or more calculation target function tables having different numbers of registered probability density functions are stored according to the required speed or accuracy, and the calculation target function selection unit 23 stores the three or more calculation target functions. It is possible to select a calculation target function table to be referenced from the table based on the speed / accuracy information.
[0121]
Next, the series of processes described above can be performed by hardware or software. When a series of processing is performed by software, a program constituting the software is installed in a general-purpose computer or the like.
[0122]
Therefore, FIG. 10 shows a configuration example of an embodiment of a computer in which a program for executing the series of processes described above is installed.
[0123]
The program can be recorded in advance in a hard disk 105 or a ROM 103 as a recording medium built in the computer.
[0124]
Alternatively, the program is stored temporarily on a removable recording medium 111 such as a flexible disk, a CD-ROM (Compact Disc Read Only Memory), an MO (Magneto optical) disk, a DVD (Digital Versatile Disc), a magnetic disk, or a semiconductor memory. It can be stored permanently (recorded). Such a removable recording medium 111 can be provided as so-called package software.
[0125]
The program is installed in the computer from the removable recording medium 111 as described above, or transferred from the download site to the computer wirelessly via a digital satellite broadcasting artificial satellite, LAN (Local Area Network), The program can be transferred to a computer via a network such as the Internet, and the computer can receive the program transferred in this way by the communication unit 108 and install it in the built-in hard disk 105.
[0126]
The computer includes a CPU (Central Processing Unit) 102. An input / output interface 110 is connected to the CPU 102 via the bus 101, and the CPU 102 operates an input unit 107 including a keyboard, a mouse, a microphone, and the like by the user via the input / output interface 110. When a command is input as a result, the program stored in a ROM (Read Only Memory) 103 is executed accordingly. Alternatively, the CPU 102 also transfers from a program stored in the hard disk 105, a program transferred from a satellite or a network, received by the communication unit 108 and installed in the hard disk 105, or a removable recording medium 111 attached to the drive 109. The program read and installed in the hard disk 105 is loaded into a RAM (Random Access Memory) 104 and executed. Thus, the CPU 102 performs processing according to the above-described flowchart or processing performed by the configuration of the above-described block diagram. Then, the CPU 102 outputs the processing result from the output unit 106 configured with an LCD (Liquid Crystal Display), a speaker, or the like via the input / output interface 110, or from the communication unit 108 as necessary. Transmission and further recording on the hard disk 105 are performed.
[0127]
Here, in this specification, the processing steps for describing a program for causing a computer to perform various types of processing do not necessarily have to be processed in time series according to the order described in the flowchart, but in parallel or individually. This includes processing to be executed (for example, parallel processing or processing by an object).
[0128]
Further, the program may be processed by a single computer, or may be processed in a distributed manner by a plurality of computers. Furthermore, the program may be transferred to a remote computer and executed.
[0129]
Note that the speech recognition apparatus shown in FIG. 3 can be applied to, for example, a speech dialogue system or the like when searching a database by voice, operating various devices, or inputting data to each device. It is. More specifically, for example, a database search device that displays map information corresponding to a place name inquiry by voice, an industrial robot that sorts luggage for voice instructions, a voice instead of a keyboard The present invention can be applied to a dictation system that creates text by input, a dialog system in a robot that performs conversation with a user, and the like.
[0130]
In the present embodiment, the speed / accuracy setting unit 27 sets the speed or accuracy of the voice recognition process in accordance with the operation of the operation lever 28 by the user. In addition, for example, it is possible to set based on factors such as resources that can be allocated to the speech recognition processing.
[0131]
That is, for example, when the speech recognition apparatus shown in FIG. 3 is realized by causing a computer as shown in FIG. 10 to execute the program, the CPU 102 generally executes tasks other than the speech recognition processing. Therefore, resources that can be allocated to the speech recognition process change from moment to moment. Therefore, in the speed / accuracy setting unit 27, the CPU 102 recognizes a resource that can be allocated to the speech recognition process, and the speed and accuracy of the speech recognition process are adjusted so that the maximum accuracy can be obtained in real time by the resource. Can be set.
[0132]
In the present embodiment, the score calculation unit 25 calculates the acoustic score based on the continuous HMM method using the continuous amount of feature vectors. However, in the present invention, for example, a discrete-value feature vector is calculated. It is also applicable to the case where the acoustic score based on the discrete HMM method is used.
[0133]
That is, for example, Satoshi Takahashi, Kiyoaki Aikawa, and Shigeki Sagayama. Discrete mixture hmm. In International Conference on Acoustic, Speech, and Signal Processing, pages 971-974, 1997 etc. Is described. According to this discrete mixed distribution type HMM, the feature vector x_tOutput probability b_s(X_t) Is calculated according to the following equation, for example.
[0134]
b_s(X_t) = ΣC_m× G_m(Z_i) ... (2)
[0135]
Here, in Equation (2), C_mIs the mth function G_mIs a weighting factor for () and the function G_m() Is the mth probability function constituting the discrete mixed distribution type HMM. Σ represents a summation for the variable m. Z_iIs the feature vector x_tRepresents a subspace of the feature vector space to which the_tIs obtained by vector quantization. Therefore, Z_iIs the feature vector x_tCan be considered to represent a code vector (representative vector) of a subspace of the feature vector space to which the value belongs, and its value is a discrete value.
[0136]
Since the equation (2) has the same form as the equation (1) described above, even when the discrete mixed distribution type HMM is used, the calculation of the probability function for obtaining the output probability by the calculation object function table is performed. Therefore, the probability density function g in the calculation target function tables shown in FIGS._m() Is a probability function G_mBy using the calculation target function table replaced with (), it becomes possible to perform speech recognition processing with speed and accuracy in accordance with the user's request.
[0137]
In this embodiment, voice recognition based on the HMM method is performed. However, the present invention can also be applied to voice recognition based on other algorithms.
[0138]
Furthermore, in the present embodiment, the calculation target function table is configured using a necessary floor value in addition to the probability density function, but the calculation target function table may be configured without using the floor value. Is possible.
[0139]
The calculation target function tables shown in FIGS. 6 to 9 can be created in principle as follows, for example. That is, each subspace Y of the feature vector space₀To Y₅₁₁Code vector v₀Thru v₅₁₁16 probability density functions g each defining an acoustic model₀() To g₁₅() Obtain the probability output from each, select the probability density function with a relatively large contribution for each subspace as the contribution to the output probability, and register it in the entry of that subspace Thus, a calculation target function table can be created. As the floor value of the calculation target function table, for example, an arbitrary feature vector x belonging to each partial space_tFor the above, it is possible to adopt the minimum value, maximum value, average value, etc. of the output probability calculated according to the equation (1).
[0140]
【The invention's effect】
  Of the present inventionAccording to one aspect,For example, it is possible to perform voice recognition processing with speed and accuracy according to a request from a user or the like. That is, it is possible to perform voice recognition processing that emphasizes speed and voice recognition processing that emphasizes accuracy.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating an exemplary configuration of a conventional speech recognition apparatus.
FIG. 2 is a diagram showing a calculation target function table used in a conventional speech recognition apparatus.
FIG. 3 is a block diagram showing a configuration example of an embodiment of a speech recognition apparatus to which the present invention is applied.
4 is a block diagram illustrating a configuration example of a matching unit 11. FIG.
FIG. 5 is a flowchart illustrating a matching process performed by a matching unit 11;
6 is a diagram showing a configuration example of a first embodiment of a calculation target function table stored in a calculation target function table storage unit 24. FIG.
FIG. 7 is a diagram illustrating a configuration example of a second embodiment of a calculation target function table stored in a calculation target function table storage unit 24;
FIG. 8 is a diagram illustrating a configuration example of a third embodiment of a calculation target function table stored in a calculation target function table storage unit 24;
FIG. 9 is a diagram illustrating a configuration example of a fourth embodiment of a calculation target function table stored in a calculation target function table storage unit 24;
FIG. 10 is a block diagram illustrating a configuration example of an embodiment of a computer to which the present invention has been applied.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 Microphone, 2 AD conversion part, 3 Feature extraction part, 5 Acoustic model database, 6 Dictionary database, 7 Grammar database, 11 Matching part, 21 Subspace detection part, 22 Subspace data storage part, 23 Calculation object function selection part, 24 calculation target function table storage unit, 25 score calculation unit, 26 output selection unit, 27 speed / accuracy setting unit, 28 operation lever, 101 bus, 102 CPU, 103 ROM, 104 RAM, 105 hard disk, 106 output unit, 107 input Part, 108 communication part, 109 drive, 110 input / output interface, 111 removable recording medium

Claims

A speech recognition device for recognizing speech,
Extracting means for extracting the feature amount of the voice;
Detecting means for detecting a partial space in which the feature amount of the voice belongs in the feature amount space;
Storage means for storing in association with one or more definition information defining an HMM (Hidden Markov Model) used for matching processing with the voice feature quantity for each of a plurality of partial spaces of the feature quantity space; ,
Selecting means for selecting any one or more definition information from the one or more definition information associated with the partial space to which the audio feature amount belongs;
Using the definition information selected by the selection unit, a matching process between the speech feature quantity and the HMM is performed to obtain a score representing the likelihood that the speech corresponds to the HMM. And a matching means for outputting a speech recognition result of the speech based on the score,
The definition information is a probability density function or a probability function used for obtaining an output probability that the HMM outputs the feature value,
The selection unit is configured to select the voice based on the speed or accuracy of voice recognition processing set according to a user operation or the speed or accuracy of voice recognition processing set according to resources that can be allocated to the voice recognition processing. A speech recognition apparatus that selects a number of the definition information corresponding to the speed or accuracy of the recognition processing in an order corresponding to the magnitude of the definition information contributing to the score.

The speech recognition apparatus according to claim 1, wherein the definition information further includes a fixed value representing an output probability that the HMM outputs the feature amount.

The storage means also stores the number of the definition information to be selected by the selection means in accordance with the speed or accuracy of the voice recognition processing,
The speech recognition apparatus according to claim 1, wherein the selection unit selects a number of the definition information corresponding to the speed or accuracy of the speech recognition processing.

The storage means includes a definition information table in which one or more definition information defining the HMM used for matching processing with the feature amount of speech is associated with each of a plurality of partial spaces of the feature amount space, It is stored for each speed or accuracy of the voice recognition process.
The speech recognition apparatus according to claim 1, wherein the selection unit selects the definition information from the definition information table corresponding to the speed or accuracy of the speech recognition processing.

The speech recognition apparatus according to claim 1, wherein the matching unit performs the matching process based on a continuous HMM method or a discrete HMM method using the feature amount of a continuous amount or a discrete value.

A speech recognition method for a speech recognition device that recognizes speech,
An extraction step in which the speech recognition apparatus extracts a feature amount of the speech;
A detection step in which the speech recognition device detects a partial space to which the feature amount of the speech belongs in the feature amount space;
The speech recognition apparatus stores one or more definition information defining an HMM (Hidden Markov Model) used for matching processing with the feature amount of speech for each of a plurality of partial spaces of the feature amount space in association with each other. A selection step of selecting any one or more pieces of definition information from the one or more pieces of definition information associated with the partial space to which the audio feature amount belongs in the storage means
The speech recognizer corresponds to the HMM by performing matching processing between the feature amount of the speech and the HMM using the definition information selected in the selection step. A matching step for obtaining a score representing likelihood and outputting a speech recognition result of the speech based on the score;
The definition information is a probability density function or a probability function used for obtaining an output probability that the HMM outputs the feature value,
In the selection step, based on the speed or accuracy of the voice recognition process set according to the user's operation or the speed or accuracy of the voice recognition process set according to the resources that can be allocated to the voice recognition process, A speech recognition method, wherein the number of definition information corresponding to the speed or accuracy of recognition processing is selected in the order corresponding to the magnitude of the definition information contributing to the score.

A program for causing a computer to perform speech recognition processing for recognizing speech,
Extracting means for extracting the feature amount of the voice;
Detecting means for detecting a partial space in which the feature amount of the voice belongs in the feature amount space;
Storage means for storing in association with one or more definition information defining an HMM (Hidden Markov Model) used for matching processing with the voice feature quantity for each of a plurality of partial spaces of the feature quantity space; ,
Selecting means for selecting any one or more definition information from the one or more definition information associated with the partial space to which the audio feature amount belongs;
Using the definition information selected by the selection unit, a matching process between the speech feature quantity and the HMM is performed to obtain a score representing the likelihood that the speech corresponds to the HMM. And a program for causing a computer to function as a matching means for outputting the speech recognition result of the speech based on the score obtained.
The definition information is a probability density function or a probability function used for obtaining an output probability that the HMM outputs the feature value,
The selection unit is configured to select the voice based on the speed or accuracy of voice recognition processing set according to a user operation or the speed or accuracy of voice recognition processing set according to resources that can be allocated to the voice recognition processing. A program that selects a number of the definition information corresponding to the speed or accuracy of recognition processing in an order corresponding to the magnitude of the definition information contributing to the score.

A recording medium on which a program for causing a computer to perform speech recognition processing for recognizing speech is recorded,
Extracting means for extracting the feature amount of the voice;
Detecting means for detecting a partial space in which the feature amount of the voice belongs in the feature amount space;
Storage means for storing in association with one or more definition information defining an HMM (Hidden Markov Model) used for matching processing with the voice feature quantity for each of a plurality of partial spaces of the feature quantity space; ,
Selecting means for selecting any one or more definition information from the one or more definition information associated with the partial space to which the audio feature amount belongs;
Using the definition information selected by the selection unit, a matching process between the speech feature quantity and the HMM is performed to obtain a score representing the likelihood that the speech corresponds to the HMM. And a program for causing a computer to function as a matching means for outputting the speech recognition result of the speech based on the score obtained.
The definition information is a probability density function or a probability function used for obtaining an output probability that the HMM outputs the feature value,
The selection unit is configured to select the voice based on the speed or accuracy of voice recognition processing set according to a user operation or the speed or accuracy of voice recognition processing set according to resources that can be allocated to the voice recognition processing. A recording medium on which a program for selecting the number of definition information corresponding to the speed or accuracy of recognition processing is selected in the order corresponding to the magnitude of the definition information contributing to the score.