JP2001125593A

JP2001125593A - Speech recognition method and device, and medium recorded with speech recognition program

Info

Publication number: JP2001125593A
Application number: JP30818099A
Authority: JP
Inventors: Akira Tsuruta; 彰鶴田
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1999-10-29
Filing date: 1999-10-29
Publication date: 2001-05-11
Anticipated expiration: 2019-10-29
Also published as: JP3534665B2

Abstract

PROBLEM TO BE SOLVED: To provide a speech recognition device capable of reducing the calculation volume of likelihood calculation thereby increasing the processing speed. SOLUTION: The speech recognition device includes an auxiliary selection dictionary 4a generated from each partial space, a likelihood calculation part 3 which calculates an approximate likelihood of each distribution constituting the state of an HMM 'Hidden Markov Model) on the basis of an input vector and contents stored in the auxiliary selection dictionary 4a and calculates a likelihood of the input vector to a distribution which is extracted by using the approximate likelihood, and a dictionary search part 5 which performs voice recognition on the basis of the likelihood calculated by the calculation part 3. Since the likelihood calculation part 3 calculates the likehood of the input vector to only the distribution which is extracted by using the approximate likelihood, the calculation volume of likelihood calculation is reduced to increase the processing speed of speech recognition.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声認識技術に関
し、特に、隠れマルコフモデル（以下、ＨＭＭ（Hidden
Markov Model）と呼ぶ）を用いた音声認識装置、音
声認識方法および音声認識プログラムを記録した媒体に
関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition technology, and more particularly, to a Hidden Markov Model (HMM).
Markov Model), a speech recognition method, and a medium recording a speech recognition program.

【０００２】[0002]

【従来の技術】近年、パーソナルコンピュータやワード
プロセッサ等の情報処理装置において、音声によって文
章の入力等を可能とするために音声認識装置の開発が盛
んに行われている。音声認識の中でも、話者の個人差等
に起因するスペクトルそのものの変動に対しても高い認
識精度が得られるＨＭＭを用いた音声認識が特に盛んに
研究・開発されている。2. Description of the Related Art In recent years, in information processing apparatuses such as personal computers and word processors, voice recognition apparatuses have been actively developed in order to enable input of sentences by voice. Among voice recognitions, voice recognition using an HMM, which can obtain high recognition accuracy even with a change in spectrum itself due to individual differences of speakers, has been actively researched and developed.

【０００３】ＨＭＭを用いた音声認識においては、大量
の音声データから得られる音声の統計的特徴をモデル化
したものが使用される。このＨＭＭを用いた音声認識に
ついては、中川聖一著“確率モデルによる音声認識”等
に詳細に説明されているので、これらの参考書を参照さ
れたい。[0003] In speech recognition using HMM, a model of statistical features of speech obtained from a large amount of speech data is used. The speech recognition using the HMM is described in detail in Seiichi Nakagawa, “Speech Recognition by Stochastic Model,” and the like, so please refer to these reference books.

【０００４】[0004]

【発明が解決しようとする課題】ＨＭＭを用いた音声認
識装置の開発において、多くは大語彙、不特定話者およ
び連続音声の認識を目指しており、ソフトウェア処理の
みによって実時間処理が可能な音声認識装置を構築する
ためには、音声認識処理の高速化が必要である。In the development of a speech recognition apparatus using an HMM, many aim at recognition of large vocabulary, unspecified speakers and continuous speech, and speech which can be processed in real time only by software processing. In order to construct a recognition device, it is necessary to speed up speech recognition processing.

【０００５】連続分布ＨＭＭを用いた音声認識装置にお
いて、ＨＭＭの出力確率の計算は、言語空間探索と同様
に多くの計算量を要する処理である。このＨＭＭの出力
確率の計算は、総分布数（＝（ＨＭＭの数）×（ＨＭＭ
の状態数）×（状態内混合分布数））に比例する。一般
に、連続分布ＨＭＭを用いた音声認識においては、学習
データが十分であれば、分布数が多いほど認識率が高く
なるが計算量が多くなる。逆に、分布数が少ないほど認
識率が低くなるが計算量は少なくなる。したがって、認
識率を低下させずに、実際に計算する総分布数を削減し
て計算量を少なくする手法が重要となる。[0005] In a speech recognition apparatus using a continuous distribution HMM, the calculation of the output probability of the HMM is a process requiring a large amount of calculation as in the case of language space search. The calculation of the output probability of the HMM is based on the total number of distributions (= (number of HMMs) × (HMM
(The number of states in the state) × (the number of mixed distributions in the state)). Generally, in speech recognition using a continuous distribution HMM, if the learning data is sufficient, the larger the number of distributions, the higher the recognition rate but the amount of calculation. Conversely, the smaller the number of distributions, the lower the recognition rate but the amount of calculation. Therefore, it is important to reduce the amount of calculation by reducing the total number of distributions actually calculated without lowering the recognition rate.

【０００６】本発明の第１の目的は、尤度計算の計算量
を削減して、処理速度を向上させることが可能な音声認
識装置、音声認識方法および音声認識プログラムを記録
した媒体を提供することである。A first object of the present invention is to provide a speech recognition apparatus, a speech recognition method, and a medium in which a speech recognition program is recorded, which can reduce the amount of calculation of likelihood calculation and improve the processing speed. That is.

【０００７】第２の目的は、認識性能が高い音声認識装
置、音声認識方法および音声認識プログラムを記録した
媒体を提供することである。A second object is to provide a speech recognition device, a speech recognition method, and a medium on which a speech recognition program is recorded, having high recognition performance.

【０００８】[0008]

【課題を解決するための手段】本発明のある局面に従え
ば、音声認識装置は、部分空間毎に作成された予備選択
辞書と、入力ベクトルと予備選択辞書に格納された内容
とに基づいてＨＭＭの状態を構成する各分布の近似尤度
を計算し、近似尤度を用いて選択された分布について入
力ベクトルとの尤度を計算するための尤度計算手段と、
尤度計算手段によって計算された尤度に基づいて音声の
認識を行う認識手段とを含む。According to one aspect of the present invention, a speech recognition apparatus is provided based on a preselection dictionary created for each subspace, an input vector, and contents stored in the preselection dictionary. A likelihood calculating means for calculating the approximate likelihood of each distribution constituting the state of the HMM, and calculating the likelihood of the distribution selected using the approximate likelihood with the input vector;
Recognizing means for recognizing speech based on the likelihood calculated by the likelihood calculating means.

【０００９】尤度計算手段は、近似尤度を用いて選択さ
れた分布についてのみ入力ベクトルとの尤度を計算する
ので、尤度計算の計算量を削減することができ、音声認
識の処理速度を向上させることが可能となる。Since the likelihood calculating means calculates the likelihood with the input vector only for the distribution selected using the approximate likelihood, the amount of calculation of the likelihood calculation can be reduced, and the processing speed of speech recognition can be reduced. Can be improved.

【００１０】本発明の別の局面に従えば、予備選択辞書
は特徴空間毎に作成される。予備選択辞書は特徴空間毎
に作成されるので、より詳細な近似尤度を計算すること
ができ、認識性能が高い音声認識装置を提供することが
可能となる。[0010] According to another aspect of the present invention, a preselection dictionary is created for each feature space. Since the preliminary selection dictionary is created for each feature space, more detailed approximate likelihood can be calculated, and a speech recognition device with high recognition performance can be provided.

【００１１】本発明のさらに別の局面に従えば、予備選
択辞書は、部分空間毎に作成されたクラスタ代表分布を
正規分布で表現し、平均値および分散が格納される。According to still another aspect of the present invention, the preliminary selection dictionary expresses a cluster representative distribution created for each subspace by a normal distribution, and stores an average value and a variance.

【００１２】予備選択辞書は、部分空間毎に作成された
クラスタ代表分布が正規分布で表現されているので、近
似尤度の計算が容易となり、さらに処理速度の向上を図
ることが可能となる。In the preliminary selection dictionary, since the cluster representative distribution created for each subspace is represented by a normal distribution, it is easy to calculate the approximate likelihood, and it is possible to further improve the processing speed.

【００１３】本発明のさらに別の局面に従えば、予備選
択辞書は、部分空間毎に作成された多層の辞書を含む。According to still another aspect of the present invention, the preselection dictionary includes a multi-layer dictionary created for each subspace.

【００１４】たとえば、予備選択辞書は、第１層の辞書
と第２層の辞書とを含み、第１層の辞書を用いて近似尤
度を計算した後、第２層の辞書を用いてさらに詳細に近
似尤度を計算することによって、近似尤度の計算量を削
減することが可能となる。For example, the pre-selection dictionary includes a first-layer dictionary and a second-layer dictionary, calculates an approximate likelihood using the first-layer dictionary, and further uses the second-layer dictionary. By calculating the approximate likelihood in detail, the amount of calculation of the approximate likelihood can be reduced.

【００１５】本発明のさらに別の局面に従えば、尤度計
算手段は、音声データを学習して得られた各分布の出力
確率分布が格納される詳細辞書と、入力ベクトルと予備
選択辞書に格納された内容とに基づいてＨＭＭを構成す
る各分布の近似尤度を計算するための近似尤度計算手段
と、近似尤度計算手段によって計算された近似尤度に基
づいて、詳細尤度を計算する分布を選択するための分布
選択手段と、分布選択手段によって選択された分布の詳
細尤度を詳細辞書を用いて計算するための詳細尤度計算
手段とを含む。According to still another aspect of the present invention, the likelihood calculating means includes a detailed dictionary in which the output probability distribution of each distribution obtained by learning the speech data is stored, an input vector and a preselection dictionary. An approximate likelihood calculating means for calculating an approximate likelihood of each distribution constituting the HMM based on the stored contents, and a detailed likelihood based on the approximate likelihood calculated by the approximate likelihood calculating means. A distribution selecting means for selecting a distribution to be calculated, and a detailed likelihood calculating means for calculating a detailed likelihood of the distribution selected by the distribution selecting means using a detailed dictionary.

【００１６】詳細尤度計算手段は、分布選択手段によっ
て選択された分布の詳細尤度を詳細辞書を用いて計算す
るので、詳細尤度を計算する分布数を削減することがで
き、音声認識の処理速度を向上させることが可能とな
る。Since the detailed likelihood calculating means calculates the detailed likelihood of the distribution selected by the distribution selecting means using the detailed dictionary, it is possible to reduce the number of distributions for calculating the detailed likelihood, and to reduce the likelihood of speech recognition. Processing speed can be improved.

【００１７】本発明のさらに別の局面に従えば、予備選
択辞書は、詳細辞書に格納される各分布が特徴空間毎に
分類されて作成される。According to still another aspect of the present invention, the preliminary selection dictionary is created by classifying each distribution stored in the detailed dictionary for each feature space.

【００１８】予備選択辞書は、詳細辞書に格納される各
分布が特徴空間毎に分類されて作成されるので、詳細な
近似尤度を計算することができ、認識性能が高い音声認
識装置を提供することが可能となる。Since the preselection dictionary is created by classifying each distribution stored in the detailed dictionary for each feature space, a detailed approximate likelihood can be calculated, and a speech recognition apparatus with high recognition performance is provided. It is possible to do.

【００１９】本発明のさらに別の局面に従えば、予備選
択辞書は、詳細辞書に格納される各分布が部分空間毎に
分類された後に、音声データがどのクラスタに属するか
を分類し、各クラスタに属する音声データの特徴パラメ
ータの値から計算して作成される。According to still another aspect of the present invention, the pre-selection dictionary classifies which cluster the audio data belongs to after each distribution stored in the detailed dictionary is classified for each subspace. It is created by calculating from the value of the feature parameter of the audio data belonging to the cluster.

【００２０】予備選択辞書は、詳細辞書に格納される各
分布が部分空間毎に分類された後に、音声データがどの
クラスタに属するかを分類し、各クラスタに属する音声
データの特徴パラメータの値から計算して作成されるの
で、各分布の分類がさらに正確に行なわれるようにな
り、認識性能の高い音声認識装置を提供することが可能
となる。The pre-selection dictionary classifies which cluster the audio data belongs to after each distribution stored in the detailed dictionary is classified for each subspace, and determines from the value of the characteristic parameter of the audio data belonging to each cluster. Since it is calculated and created, each distribution can be classified more accurately, and a speech recognition device with high recognition performance can be provided.

【００２１】本発明のさらに別の局面に従えば、分布選
択手段は、近似尤度計算手段によって計算された近似尤
度の中から尤度の大きい近似尤度を有する分布を選択す
る。According to still another aspect of the present invention, the distribution selecting means selects a distribution having a high likelihood from the approximate likelihoods calculated by the approximate likelihood calculating means.

【００２２】分布選択手段は、近似尤度計算手段によっ
て計算された近似尤度の中から尤度の大きい近似尤度を
有する分布を選択するので、詳細尤度の計算が必要な分
布を正確に選択することが可能となる。The distribution selecting means selects a distribution having an approximate likelihood having a large likelihood from the approximate likelihoods calculated by the approximate likelihood calculating means, so that a distribution requiring detailed likelihood calculation can be accurately determined. It becomes possible to select.

【００２３】本発明のさらに別の局面に従えば、分布選
択手段は、部分空間の最大尤度と閾値とから基準尤度を
算出し、基準尤度より近似尤度が大きい分布を選択す
る。According to yet another aspect of the present invention, the distribution selecting means calculates a reference likelihood from the maximum likelihood of the subspace and a threshold, and selects a distribution having a higher approximate likelihood than the reference likelihood.

【００２４】分布選択手段は、部分空間の最大尤度と閾
値とから基準尤度を算出し、基準尤度より近似尤度が大
きい分布を選択するので、詳細尤度の計算が必要な分布
を正確に選択することが可能となる。The distribution selecting means calculates a reference likelihood from the maximum likelihood of the subspace and a threshold value and selects a distribution having an approximate likelihood larger than the reference likelihood. It becomes possible to select accurately.

【００２５】本発明のさらに別の局面に従えば、音声認
識方法は、入力ベクトルと部分空間毎に作成された予備
選択辞書の内容とに基づいてＨＭＭを構成する各分布の
近似尤度を計算するステップと、近似尤度を用いて選択
された分布について入力ベクトルとの尤度を計算するス
テップと、計算された尤度に基づいて音声の認識を行う
ステップとを含む。According to still another aspect of the present invention, a speech recognition method calculates an approximate likelihood of each distribution constituting an HMM based on an input vector and contents of a preselection dictionary created for each subspace. And calculating the likelihood of the distribution selected using the approximate likelihood with the input vector, and performing speech recognition based on the calculated likelihood.

【００２６】近似尤度を用いて選択された分布について
のみ入力ベクトルとの尤度を計算するので、尤度計算の
計算量を削減することができ、音声認識の処理速度を向
上させることが可能となる。Since the likelihood with the input vector is calculated only for the distribution selected using the approximate likelihood, the calculation amount of the likelihood calculation can be reduced, and the processing speed of speech recognition can be improved. Becomes

【００２７】本発明のさらに別の局面に従えば、コンピ
ュータ読取可能な媒体に記録された音声認識プログラム
は、入力ベクトルと部分空間毎に作成された予備選択辞
書の内容とに基づいてＨＭＭを構成する各分布の近似尤
度を計算するステップと、近似尤度を用いて選択された
分布について入力ベクトルとの尤度を計算するステップ
と、計算された尤度に基づいて音声の認識を行うステッ
プとを含む。According to still another aspect of the present invention, a speech recognition program recorded on a computer-readable medium forms an HMM based on an input vector and contents of a preliminary selection dictionary created for each subspace. Calculating the approximate likelihood of each distribution to be calculated, calculating the likelihood of the distribution selected using the approximate likelihood with the input vector, and performing speech recognition based on the calculated likelihood. And

【００２８】近似尤度を用いて選択された分布について
のみ入力ベクトルとの尤度を計算するので、尤度計算の
計算量を削減することができ、音声認識の処理速度を向
上させることが可能となる。Since the likelihood with the input vector is calculated only for the distribution selected using the approximate likelihood, the calculation amount of the likelihood calculation can be reduced, and the processing speed of speech recognition can be improved. Becomes

【００２９】[0029]

【発明の実施の形態】図１は、本発明の実施の形態にお
ける音声認識装置の機能構成を示すブロック図である。
この音声認識装置は、マイクによって集音されたアナロ
グの音声信号をディジタル信号に変換し、音声認識の対
象となる音声区間を切り出して出力する音声入力部１
と、音声入力部１から出力されたディジタルの音声信号
を分析し、音響パラメータ（入力ベクトル）を抽出して
出力する音響分析部２と、音響分析部２から出力された
音響パラメータに基づいてＨＭＭの状態を構成する各分
布の尤度を計算する尤度計算部３と、尤度計算部３によ
る尤度計算の際に使用されるＨＭＭを格納するＨＭＭ格
納部４と、単語辞書６と、尤度計算部３によって計算さ
れた尤度に基づいて単語辞書６を探索して認識結果を出
力する辞書探索部５と、辞書探索部５から出力された認
識結果を表示する表示部７とを含む。FIG. 1 is a block diagram showing a functional configuration of a speech recognition apparatus according to an embodiment of the present invention.
This speech recognition device converts an analog speech signal collected by a microphone into a digital signal, cuts out a speech section to be subjected to speech recognition, and outputs the speech section.
An audio analysis unit 2 that analyzes a digital audio signal output from the audio input unit 1 to extract and output an audio parameter (input vector), and an HMM based on the audio parameter output from the audio analysis unit 2. A likelihood calculation unit 3 for calculating the likelihood of each distribution constituting the state of the above, an HMM storage unit 4 for storing an HMM used for likelihood calculation by the likelihood calculation unit 3, a word dictionary 6, A dictionary search unit 5 that searches the word dictionary 6 based on the likelihood calculated by the likelihood calculation unit 3 and outputs a recognition result, and a display unit 7 that displays the recognition result output from the dictionary search unit 5 Including.

【００３０】入力部１は、ユーザによって音声が入力さ
れるマイク１ａと、マイク１ａを介して入力された音声
信号をアナログ信号からディジタル信号に変換するＡ／
Ｄ（Analog/Digital）コンバータ1bとを含む。The input unit 1 includes a microphone 1a to which a voice is input by a user, and an A / A converter for converting a voice signal input through the microphone 1a from an analog signal to a digital signal.
D (Analog / Digital) converter 1b.

【００３１】ＨＭＭ格納部４は、大量の音声データから
学習したＨＭＭの状態を構成する各分布を格納する詳細
辞書部４ｂと、詳細辞書４ｂに格納される音声データの
各分布を特徴空間毎にクラスタリングして作成したクラ
スタ代表分布を格納する予備選択辞書４ａとを含む。Ｈ
ＭＭの学習は、Baum-Welchアルゴリズムによって行われ
る。なお、この予備選択辞書４ａおよび詳細辞書４ｂの
詳細については後述する。The HMM storage unit 4 stores a detailed dictionary unit 4b for storing each distribution constituting the state of the HMM learned from a large amount of voice data, and stores each distribution of the voice data stored in the detailed dictionary 4b for each feature space. And a preliminary selection dictionary 4a for storing a cluster representative distribution created by clustering. H
Learning of the MM is performed by the Baum-Welch algorithm. The details of the preliminary selection dictionary 4a and the detailed dictionary 4b will be described later.

【００３２】尤度計算部３は、音響分析部２によって抽
出された音響パラメータおよび予備選択辞書４ａに格納
されるクラスタ代表分布とを比較して尤度を計算し、各
分布毎のインデックス情報を用いて各分布の近似尤度を
計算する近似尤度計算部３ａと、近似尤度計算部３ａに
よって計算された近似尤度の上位から所定数の分布、ま
たは予め定められた基準尤度より大きい近似尤度の分布
を選択する分布選択部３ｂと、分布選択部３ｂによって
選択された分布について詳細辞書４ｂを用いて尤度を再
計算する詳細尤度計算部３ｃとを含む。The likelihood calculating unit 3 calculates the likelihood by comparing the acoustic parameters extracted by the acoustic analyzing unit 2 and the cluster representative distribution stored in the preliminary selection dictionary 4a, and calculates the index information for each distribution. An approximate likelihood calculation unit 3a that calculates the approximate likelihood of each distribution using the distribution, and a predetermined number of distributions from the top of the approximate likelihood calculated by the approximate likelihood calculation unit 3a, or larger than a predetermined reference likelihood. It includes a distribution selection unit 3b that selects a distribution of approximate likelihood, and a detailed likelihood calculation unit 3c that recalculates the likelihood of the distribution selected by the distribution selection unit 3b using the detailed dictionary 4b.

【００３３】辞書探索部５は、尤度計算部３によって計
算された各分布の尤度および単語辞書６に登録された単
語に対してビタビアルゴリズムを用いてスコアを算出
し、この算出されたスコアが最大となる単語を認識結果
として出力する。The dictionary search unit 5 calculates a score using the Viterbi algorithm for the likelihood of each distribution calculated by the likelihood calculation unit 3 and the words registered in the word dictionary 6, and calculates the calculated score. Is output as a recognition result.

【００３４】なお、以上の説明においては、音声特徴ベ
クトル系列をＨＭＭで表現した場合についてのものであ
ったが、ＨＭＭの代わりに、フレームベクトルの時系列
で表現した標準パターンを用いても良い。この標準パタ
ーンを用いて認識を行う場合には、音響パラメータとＨ
ＭＭの状態を構成する各分布とを比較して尤度を計算す
る尤度計算部３に代えて、音響パラメータと標準パター
ンのフレームベクトルとの距離を計算して評価する構成
を採用すれば良い。また、辞書探索部５は、標準パター
ンのフレームベクトルの距離および単語辞書６に登録さ
れた単語に対してＤＰ（Dynamic Programming）マッチ
ングを用いてスコアを算出し、この算出されたスコアが
最小となる単語を認識結果として出力する構成を採用す
れば良い。In the above description, the speech feature vector series is represented by the HMM, but a standard pattern represented by a time series of frame vectors may be used instead of the HMM. When performing recognition using this standard pattern, acoustic parameters and H
Instead of the likelihood calculation unit 3 that calculates the likelihood by comparing each distribution constituting the state of the MM and calculates the likelihood, a configuration may be adopted in which the distance between the acoustic parameter and the frame vector of the standard pattern is calculated and evaluated. . The dictionary search unit 5 calculates a score using DP (Dynamic Programming) matching for the distance between the frame vectors of the standard pattern and the words registered in the word dictionary 6, and the calculated score is minimized. What is necessary is just to employ | adopt the structure which outputs a word as a recognition result.

【００３５】図２は、本実施の形態における音声認識装
置の処理手順を説明するためのフローチャートである。
まず、ユーザがマイク１ａを介して音声を入力すると、
このアナログの音声信号がＡ／Ｄコンバータ１ｂによっ
てディジタル信号に変換される（Ｓ１１）。音響分析部
２は、線形予測分析等を用いて音声信号の特徴である音
響パラメータ（入力ベクトル）を抽出する（Ｓ１２）。
なお、本実施の形態においては、この音響パラメータと
して、ＬＰＣ（Linear Predictive Coding）ケプスト
ラム１次〜１６次、ＬＰＣΔケプストラム１次〜１６
次、パワーおよびΔパワーの４種類の特徴を用いるもの
とする。FIG. 2 is a flowchart for explaining the processing procedure of the speech recognition apparatus according to the present embodiment.
First, when the user inputs voice through the microphone 1a,
This analog audio signal is converted into a digital signal by the A / D converter 1b (S11). The acoustic analysis unit 2 extracts an acoustic parameter (input vector) that is a feature of the audio signal using linear prediction analysis or the like (S12).
In the present embodiment, as the acoustic parameters, LPC (Linear Predictive Coding) cepstrum 1st to 16th order, LPCΔ cepstrum 1st to 16th order
Next, four types of characteristics, power and Δpower, are used.

【００３６】次に、近似尤度計算部３ａは、音響分析部
２によって抽出された音響パラメータと各特徴毎に作成
された予備選択辞書４ａとを比較し、部分空間別尤度テ
ーブルを作成する。そして、インデックステーブル内の
分布番号と各特徴毎のクラスタへのインデックス情報を
用いて部分空間別尤度テーブルを参照し、各分布の近似
尤度を計算する（Ｓ１３）。Next, the approximate likelihood calculating section 3a compares the acoustic parameters extracted by the acoustic analyzing section 2 with the preliminary selection dictionary 4a created for each feature, and creates a subspace-based likelihood table. . Then, the approximate likelihood of each distribution is calculated by referring to the likelihood table for each subspace using the distribution number in the index table and the index information for the cluster for each feature (S13).

【００３７】ここで、予備選択辞書４ａ、インデックス
テーブルおよび部分空間別尤度テーブルの作成方法につ
いて説明する。Here, a method of creating the preliminary selection dictionary 4a, the index table, and the likelihood table for each subspace will be described.

【００３８】図３は予備選択辞書４ａの内容の一例を示
す図であり、図４はインデックステーブルおよび部分空
間別尤度テーブルの一例を示す図である。また、図５は
予備選択辞書４ａの作成方法を説明するためのフローチ
ャートである。まず、詳細辞書４ｂに格納されたＨＭＭ
の状態を構成する各分布が、部分空間毎にＭ個のクラス
タに分類される（Ｓ２１）。分布の各特徴をＭ個のクラ
スタに分類する方法として、たとえば、Ｋ−ｍｅａｎｓ
法を使うことができる。FIG. 3 is a diagram showing an example of the contents of the preliminary selection dictionary 4a, and FIG. 4 is a diagram showing an example of the index table and the likelihood table for each subspace. FIG. 5 is a flowchart for explaining a method of creating the preliminary selection dictionary 4a. First, the HMM stored in the detailed dictionary 4b
Are classified into M clusters for each subspace (S21). As a method of classifying each feature of the distribution into M clusters, for example, K-means
You can use the law.

【００３９】次に、このクラスタリングの結果から、各
分布とクラスタとの関係を表すインデックステーブルが
作成される（Ｓ２２）。このインデックステーブルは、
各分布の各特徴がいずれのクラスタ（１〜Ｍ）に分類さ
れているかを示している。たとえば、図４において分布
番号“１”の分布のパワーがクラスタ番号“２”のクラ
スタに分類され、ＬＰＣケプストラムがクラスタ番号
“１”のクラスタに分類され、Δパワーがクラスタ番号
“１”のクラスタに分類され、ＬＰＣΔケプストラムが
クラスタ番号“２”のクラスタに分類されていることを
示している。Next, an index table indicating the relationship between each distribution and the cluster is created from the result of the clustering (S22). This index table is
It shows which cluster (1 to M) each feature of each distribution is classified into. For example, in FIG. 4, the distribution power of the distribution number “1” is classified into the cluster with the cluster number “2”, the LPC cepstrum is classified into the cluster with the cluster number “1”, and the Δ power is the cluster with the cluster number “1”. This indicates that the LPCΔ cepstrum is classified into the cluster with the cluster number “2”.

【００４０】次に、大量の音声データ（詳細辞書４ｂを
生成する際に使用された音声データ）と詳細辞書４ｂ内
の各分布とを比較し、音声データがどのクラスタに属す
るかを再度分類する（Ｓ２３）。すなわち、音声データ
を詳細辞書４ｂの内容を用いて認識を行いマッチング経
路をとることにより、入力ベクトルに対して最も尤度の
高い分布を求め、インデックステーブルを参照して各特
徴がどのクラスタに属するかを分類する。Next, a large amount of voice data (voice data used for generating the detailed dictionary 4b) is compared with each distribution in the detailed dictionary 4b, and the cluster to which the voice data belongs is classified again. (S23). That is, by recognizing the audio data using the contents of the detailed dictionary 4b and taking a matching path, a distribution with the highest likelihood is obtained for the input vector, and each cluster belongs to any cluster with reference to the index table. Or classify.

【００４１】そして、各クラスタに属している音声デー
タの音響パラメータの値から平均値および分散を計算し
てクラスタ代表分布を作成する（Ｓ２４）。図３に示す
予備選択辞書は、この作成された各クラスタ代表分布が
特徴毎に格納されているところを示している。また、予
備選択辞書４ａを簡易に作成するために、ステップＳ２
１において作成されたクラスタリングの結果から平均値
と分散とを計算してクラスタ代表分布としても良い。Then, the average value and the variance are calculated from the values of the acoustic parameters of the audio data belonging to each cluster to create a cluster representative distribution (S24). The preliminary selection dictionary shown in FIG. 3 indicates that each created cluster representative distribution is stored for each feature. Also, in order to easily create the preliminary selection dictionary 4a, step S2
The average value and the variance may be calculated from the result of the clustering created in 1 and may be used as a cluster representative distribution.

【００４２】なお、図３には部分空間別の予備選択辞書
が１層だけ作成される場合を示しているが、図６に示す
ような多層構造にしても良い。この場合、たとえば図４
に示す予備選択辞書４ａを第２層と考え、部分空間毎に
第２層のＭ２個のクラスタ代表分布をクラスタリングし
てＭ１個のクラスタ代表分布が作成され、第１層と第２
層との対応がインデックステーブルに格納される。予備
選択辞書をこのような構成にすることにより、認識対象
の音響パラメータに基づいて第１層の各クラスタ代表分
布との尤度を計算して尤度が高いクラスタを選択し、こ
のクラスタが指し示す第２層の各クラスタ代表分布との
尤度を再計算するとともに、選択されなかったクラスタ
が指し示す第２層のクラスタについては第１層のクラス
タ代表分布の尤度を近似尤度として使用することによ
り、尤度計算の計算量を削減することができる。Although FIG. 3 shows a case where only one layer of the preliminary selection dictionary for each subspace is created, a multi-layer structure as shown in FIG. 6 may be used. In this case, for example, FIG.
Is assumed to be the second layer, and M2 cluster representative distributions of the second layer are clustered for each subspace to create M1 cluster representative distributions.
The correspondence with the layer is stored in the index table. With the pre-selection dictionary having such a configuration, the likelihood with each cluster representative distribution of the first layer is calculated based on the acoustic parameters of the recognition target, and a cluster having a high likelihood is selected, and this cluster indicates. Recalculate the likelihood with each cluster representative distribution of the second layer, and use the likelihood of the cluster representative distribution of the first layer as the approximate likelihood for the cluster of the second layer indicated by the unselected cluster. Thereby, the calculation amount of likelihood calculation can be reduced.

【００４３】次に、図４に示す部分空間別尤度テーブル
の算出および詳細尤度を算出する必要がある分布の抽出
について、図７（ａ）および図７（ｂ）を参照しながら
説明する。図７（ａ）に示すように、まず、近似尤度計
算部３ａは、部分空間毎に認識対象の音響パラメータと
予備選択辞書４ａに格納されるクラスタ代表の分布とを
比較して尤度を計算し、図４に示すように部分空間別尤
度テーブルを作成する（Ｓ３１）。Next, the calculation of the likelihood table for each subspace shown in FIG. 4 and the extraction of the distribution for which the detailed likelihood needs to be calculated will be described with reference to FIGS. 7 (a) and 7 (b). . As shown in FIG. 7A, first, the approximate likelihood calculating unit 3a compares the acoustic parameter to be recognized with the distribution of cluster representatives stored in the preliminary selection dictionary 4a for each subspace to determine the likelihood. The calculation is performed, and a likelihood table for each subspace is created as shown in FIG. 4 (S31).

【００４４】次に、近似尤度計算部３ａは、インデック
ステーブルの分布番号順にインデックスを参照して部分
空間別尤度テーブル内の各特徴の尤度を抽出し、それら
を加算することによって近似尤度を算出する（Ｓ３
２）。図４に示すように、部分空間別尤度テーブルの内
容は対数によって表されているので、加算のみによって
近似尤度を算出することができる。Next, the approximate likelihood calculating section 3a extracts the likelihood of each feature in the likelihood table for each subspace by referring to the index in the order of distribution number of the index table, and adds them to calculate the approximate likelihood. Calculate the degree (S3
2). As shown in FIG. 4, since the content of the likelihood table for each subspace is represented by a logarithm, the approximate likelihood can be calculated only by addition.

【００４５】次に、分布選択部３ｂは、近似尤度算出部
３ａによって算出された各分布の近似尤度をソーティン
グし、近似尤度の大きい上位Ｔ個の分布を抽出する（Ｓ
３３）。そして、詳細尤度計算部３ｃは、分布選択部３
ｂによって抽出された上位Ｔ個の分布に対応する出力確
率分布を詳細辞書４ｂから抽出し、認識対象の音響パラ
メータと上位Ｔ個の分布に対応する出力確率分布とを比
較して詳細尤度を計算する（Ｓ３４）。Next, the distribution selecting section 3b sorts the approximate likelihood of each distribution calculated by the approximate likelihood calculating section 3a, and extracts the top T distributions having a large approximate likelihood (S
33). Then, the detailed likelihood calculation unit 3c includes the distribution selection unit 3
b, an output probability distribution corresponding to the top T distributions extracted from the detailed dictionary 4b, and comparing the acoustic parameter to be recognized with the output probability distribution corresponding to the top T distributions to obtain a detailed likelihood. The calculation is performed (S34).

【００４６】また、別の方法として図７（ｂ）に示すよ
うに、まず、近似尤度計算部３ａは、部分空間毎に認識
対象の音響パラメータと予備選択辞書４ａに格納される
クラスタ代表の分布とを比較して尤度を計算し、図４に
示すように部分空間別尤度テーブルを作成する（Ｓ４
１）。As another method, as shown in FIG. 7 (b), first, the approximate likelihood calculating unit 3a calculates the acoustic parameters to be recognized and the cluster representative stored in the preliminary selection dictionary 4a for each subspace. The likelihood is calculated by comparing with the distribution, and a likelihood table for each subspace is created as shown in FIG. 4 (S4).
1).

【００４７】次に、近似尤度計算部３ａは、インデック
ステーブルの分布番号順にインデックスを参照して部分
空間別尤度テーブル内の各特徴の尤度を抽出し、それら
を加算することによって近似尤度を算出する（Ｓ４
２）。Next, the approximate likelihood calculation unit 3a extracts the likelihood of each feature in the likelihood table for each subspace by referring to the index in the order of distribution number of the index table, and adds them to calculate the approximate likelihood. Calculate the degree (S4
2).

【００４８】次に、分布選択部３ｂは、各部分空間の最
大尤度を算出し、この各部分空間の最大尤度と予め定め
られた閾値とから基準尤度（基準尤度＜最大尤度）を算
出する。そして、分布選択部３ｂは、基準尤度よりも近
似尤度が大きい分布を選択する（Ｓ４３）。そして、詳
細尤度計算部３ｃは、分布選択部３ｂによって選択され
た分布に対応する出力確率分布を詳細辞書４ｂから抽出
し、認識対象の音響パラメータと抽出された分布に対応
する出力確率分布とを比較して詳細尤度を計算する（Ｓ
４４）。Next, the distribution selection unit 3b calculates the maximum likelihood of each subspace, and calculates a reference likelihood (reference likelihood <maximum likelihood) from the maximum likelihood of each subspace and a predetermined threshold value. ) Is calculated. Then, the distribution selection unit 3b selects a distribution having an approximate likelihood larger than the reference likelihood (S43). Then, the detailed likelihood calculating unit 3c extracts an output probability distribution corresponding to the distribution selected by the distribution selecting unit 3b from the detailed dictionary 4b, and outputs an acoustic parameter to be recognized and an output probability distribution corresponding to the extracted distribution. To calculate the detailed likelihood (S
44).

【００４９】再び、図２に示すフローチャートの説明に
戻る。上述したように、予備選択辞書４ａおよび詳細辞
書４ｂを参照しながら分布の選択を行い（Ｓ１４）、選
択された分布の詳細尤度を算出する（Ｓ１５）。Returning to the description of the flowchart shown in FIG. As described above, the distribution is selected with reference to the preliminary selection dictionary 4a and the detailed dictionary 4b (S14), and the detailed likelihood of the selected distribution is calculated (S15).

【００５０】次に、辞書探索部５は、ステップＳ１５に
おいて算出された各分布の尤度と、単語辞書６に登録さ
れた単語毎のモデルとに対してビタビアルゴリズムを用
いてスコアを算出し（Ｓ１６）、スコアが最大となる単
語を認識結果として表示部７に表示する（Ｓ１７）。Next, the dictionary search unit 5 calculates a score using the Viterbi algorithm for the likelihood of each distribution calculated in step S15 and the model for each word registered in the word dictionary 6 ( S16) The word having the largest score is displayed on the display unit 7 as a recognition result (S17).

【００５１】たとえば、分布数Ｎを１５００、各部分空
間のクラスタ数Ｍを５０、詳細尤度を算出する分布数Ｔ
を１００とすると、５０（クラスタ数Ｍ）＋１００（分
布Ｔ）＝１５０の分布についての尤度計算と、各分布の
近似尤度の計算と、上位１００個の分布の選択とが必要
になる。この中で、各分布の近似尤度の計算と、上位１
００個の分布の選択とに要する時間は、尤度計算と比較
して短いものである。したがって、従来のＨＭＭを用い
た音声認識においては分布数Ｎ＝１５００に対応する尤
度計算が必要であるのに対し、本実施の形態における音
声認識装置においては１５０の分布についての尤度計算
が必要であり、従来と比べて約１／１０の計算量で尤度
計算が行えることになる。For example, the number of distributions N is 1500, the number of clusters M in each subspace is 50, and the number of distributions T for calculating detailed likelihood is T
Is 100, the likelihood calculation for a distribution of 50 (number of clusters M) +100 (distribution T) = 150, calculation of the approximate likelihood of each distribution, and selection of the top 100 distributions are required. Among them, the calculation of the approximate likelihood of each distribution and the top one
The time required to select the 00 distributions is short compared to the likelihood calculation. Therefore, in speech recognition using the conventional HMM, likelihood calculation corresponding to the number of distributions N = 1500 is required, whereas in the speech recognition apparatus according to the present embodiment, likelihood calculation for 150 distributions is not possible. This is necessary, and the likelihood calculation can be performed with a calculation amount of about 1/10 as compared with the related art.

【００５２】また、全特徴空間でクラスタリングして作
成した予備選択辞書を用いて尤度計算を行う場合、その
尤度計算に要する計算量は本実施の形態における予備選
択辞書を用いた場合の計算量と同じである。しかし、全
特徴空間でクラスタリングして作成した予備選択辞書を
用いた場合にはＭ通りの近似尤度しか表現できないのに
対し、本実施の形態の予備選択辞書においては、各特徴
空間毎にクラスタリングされるので、Ｍ×Ｍ×Ｍ×Ｍ通
りの近似尤度を表現できる。したがって、より詳細な近
似尤度を算出することができ、音声の認識性能が高くな
る。When the likelihood calculation is performed using the pre-selection dictionary created by clustering in all the feature spaces, the amount of calculation required for the likelihood calculation is calculated by using the pre-selection dictionary in the present embodiment. Same as quantity. However, when a pre-selection dictionary created by clustering in all feature spaces is used, only M possible approximate likelihoods can be expressed, whereas in the pre-selection dictionary of the present embodiment, clustering is performed for each feature space. Therefore, M × M × M × M approximate likelihoods can be expressed. Therefore, more detailed approximate likelihood can be calculated, and the speech recognition performance is improved.

【００５３】なお、本実施の形態における音声認識装置
においては、特徴空間としてパワー、Δパワー、ＬＰＣ
ケプストラムおよびＬＰＣΔケプストラムの４つが用い
られたが、他の特徴空間が用いられても良い。また、各
特徴空間において、各部分空間のクラスタ数を同一とし
て説明したが、特徴空間毎に異なる部分空間のクラスタ
数を設定しても良い。In the speech recognition apparatus according to the present embodiment, power, Δpower, LPC
Although four are used, the cepstrum and the LPCΔ cepstrum, other feature spaces may be used. Further, in each feature space, the number of clusters in each subspace has been described as being the same, but the number of clusters in different subspaces may be set for each feature space.

【００５４】図８は、本発明の音声認識装置の外観例を
示す図である。この音声認識装置は、音声入力部１、コ
ンピュータ本体１１、グラフィックディスプレイ装置１
２、磁気テープ１４が装着される磁気テープ装置１３、
キーボード１５、マウス１６、ＣＤ−ＲＯＭ（Compact
Disc-Read Only Memory）１８が装着されるＣＤ−Ｒ
ＯＭ装置１７および通信モデム１９を含む。音声認識プ
ログラムは、磁気テープ１４またはＣＤ―ＲＯＭ１８等
の記録媒体によって供給される。音声認識プログラムは
コンピュータ本体１１によって実行され、操作者はグラ
フィックディスプレイ装置１２を見ながらキーボード１
５またはマウス１６を操作することによって音声認識の
指示等を行う。また、音声認識プログラムは他のコンピ
ュータより通信回線を経由し、通信モデム１９を介して
コンピュータ本体１１に供給されてもよい。FIG. 8 is a diagram showing an example of the appearance of the speech recognition apparatus of the present invention. This voice recognition device includes a voice input unit 1, a computer main body 11, a graphic display device 1,
2, a magnetic tape device 13 on which a magnetic tape 14 is mounted;
Keyboard 15, mouse 16, CD-ROM (Compact
Disc-Read Only Memory (CD-R) 18
It includes an OM device 17 and a communication modem 19. The voice recognition program is supplied by a recording medium such as the magnetic tape 14 or the CD-ROM 18. The voice recognition program is executed by the computer main body 11, and an operator looks at the graphic display device 12 while watching the keyboard 1.
By operating the mouse 5 or the mouse 16, an instruction for voice recognition or the like is given. Further, the voice recognition program may be supplied to the computer main body 11 from another computer via a communication line and a communication modem 19.

【００５５】図９は、本発明の音声認識装置の構成例を
示すブロック図である。図８に示すコンピュータ本体１
１は、ＣＰＵ（Central Processing Unit）２０、Ｒ
ＯＭ（Read Only Memory)２１、ＲＡＭ（Random Acc
ess Memory）２２およびハードディスク２３を含む。
ＣＰＵ２０は、グラフィックディスプレイ装置１２、磁
気テープ装置１３、キーボード１５、マウス１６、ＣＤ
−ＲＯＭ装置１７、通信モデム１９、ＲＯＭ２１、ＲＡ
Ｍ２２またはハードディスク２３との間でデータを入出
力しながら処理を行う。磁気テープ１４またはＣＤ−Ｒ
ＯＭ１８に記録された音声認識プログラムは、ＣＰＵ２
０により磁気テープ装置１３またはＣＤ−ＲＯＭ装置１
７を介して一旦ハードディスク２３に格納される。ＣＰ
Ｕ２０は、ハードディスク２３から適宜音声認識プログ
ラムをＲＡＭ２２にロードして実行することによって音
声認識を行う。FIG. 9 is a block diagram showing a configuration example of the speech recognition apparatus of the present invention. Computer body 1 shown in FIG.
1 is a CPU (Central Processing Unit) 20, R
OM (Read Only Memory) 21, RAM (Random Acc
ess Memory) 22 and a hard disk 23.
The CPU 20 includes a graphic display device 12, a magnetic tape device 13, a keyboard 15, a mouse 16, and a CD.
ROM device 17, communication modem 19, ROM 21, RA
The processing is performed while inputting / outputting data to / from the M22 or the hard disk 23. Magnetic tape 14 or CD-R
The voice recognition program recorded in the OM 18 is
0, the magnetic tape device 13 or the CD-ROM device 1
7 is temporarily stored in the hard disk 23. CP
The U20 performs voice recognition by appropriately loading a voice recognition program from the hard disk 23 into the RAM 22 and executing the program.

【００５６】以上説明したように、本実施の形態におけ
る音声認識装置によれば、部分空間毎に作成された予備
選択辞書４ａを用いて各分布の近似的な尤度を算出し、
出力確率を厳密に計算する必要がある分布を抽出し、抽
出された分布のみ出力確率を計算するようにしたので、
認識に要する時間を短縮することが可能となった。ま
た、各特徴空間毎にクラスタリングして予備選択辞書４
ａを作成するようにしたので、近似尤度計算部３ａは詳
細な近似尤度を算出することができ、認識性能を高くす
ることが可能となった。As described above, according to the speech recognition apparatus of the present embodiment, the approximate likelihood of each distribution is calculated using the preliminary selection dictionary 4a created for each subspace.
Since the distributions for which the output probabilities need to be calculated exactly are extracted and the output probabilities are calculated only for the extracted distributions,
It has become possible to reduce the time required for recognition. In addition, clustering is performed for each feature space, and a preliminary selection dictionary 4 is created.
Since a is created, the approximate likelihood calculating unit 3a can calculate a detailed approximate likelihood, and can improve recognition performance.

【００５７】今回開示された実施の形態は、すべての点
で例示であって制限的なものではないと考えられるべき
である。本発明の範囲は上記した説明ではなくて特許請
求の範囲によって示され、特許請求の範囲と均等の意味
および範囲内でのすべての変更が含まれることが意図さ
れる。The embodiments disclosed this time are to be considered in all respects as illustrative and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.

[Brief description of the drawings]

【図１】本発明の実施の形態における音声認識装置の
機能構成の概略を説明するための図である。FIG. 1 is a diagram illustrating an outline of a functional configuration of a speech recognition device according to an embodiment of the present invention.

【図２】本発明の実施の形態における音声認識装置の
処理手順を説明するためのフローチャートである。FIG. 2 is a flowchart illustrating a processing procedure of the voice recognition device according to the embodiment of the present invention.

【図３】予備選択辞書４ａの一例を示す図である。FIG. 3 is a diagram illustrating an example of a preliminary selection dictionary 4a.

【図４】インデックステーブルおよび部分空間別尤度
テーブルの一例を示す図である。FIG. 4 is a diagram illustrating an example of an index table and a likelihood table for each subspace.

【図５】部分空間毎の予備選択辞書４ａおよびインデ
ックステーブルの作成を説明するためのフローチャート
である。FIG. 5 is a flowchart for explaining creation of a preliminary selection dictionary 4a and an index table for each subspace.

【図６】予備選択辞書４ａを多層構造にした場合を示
す図である。FIG. 6 is a diagram showing a case where the preliminary selection dictionary 4a has a multilayer structure.

【図７】部分空間別尤度テーブルの作成および詳細尤
度の計算を説明するためのフローチャートである。FIG. 7 is a flowchart for explaining creation of a likelihood table for each subspace and calculation of detailed likelihood.

【図８】本発明の実施の形態における音声認識装置の
外観例を示す図である。FIG. 8 is a diagram illustrating an example of an appearance of a speech recognition device according to an embodiment of the present invention.

【図９】本発明の実施の形態における音声認識装置の
概略構成を示す図である。FIG. 9 is a diagram illustrating a schematic configuration of a speech recognition device according to an embodiment of the present invention.

[Explanation of symbols]

１音声入力部、１ａマイク、１ｂＡ／Ｄコンバー
タ、２音響分析部、３尤度計算部、３ａ近似尤度
計算部、３ｂ分布選択部、３ｃ詳細尤度計算部、４
ＨＭＭ格納部、４ａ予備選択辞書、４ｂ詳細辞
書、５辞書探索部、６単語辞書、７表示部、１１
コンピュータ本体、１２グラフィックディスプレイ
装置、１３磁気テープ装置、１４磁気テープ、１５
キーボード、１６マウス、１７ＣＤ−ＲＯＭ装
置、１８ＣＤ−ＲＯＭ、１９通信モデム、２０Ｃ
ＰＵ、２１ＲＯＭ、２２ＲＡＭ、２３ハードディ
スク。Reference Signs List 1 voice input unit, 1a microphone, 1b A / D converter, 2 acoustic analysis unit, 3 likelihood calculation unit, 3a approximate likelihood calculation unit, 3b distribution selection unit, 3c detailed likelihood calculation unit, 4
HMM storage unit, 4a preliminary selection dictionary, 4b detailed dictionary, 5 dictionary search unit, 6 word dictionary, 7 display unit, 11
Computer body, 12 Graphic display device, 13 Magnetic tape device, 14 Magnetic tape, 15
Keyboard, 16 mouse, 17 CD-ROM device, 18 CD-ROM, 19 communication modem, 20 C
PU, 21 ROM, 22 RAM, 23 hard disk.

Claims

[Claims]

An approximate likelihood of each distribution constituting a state of an HMM is calculated based on a preselection dictionary created for each subspace, an input vector and contents stored in the preselection dictionary, Likelihood calculating means for calculating the likelihood of the distribution selected using the approximate likelihood with the input vector; and recognition means for recognizing speech based on the likelihood calculated by the likelihood calculating means. And a voice recognition device.

2. The speech recognition device according to claim 1, wherein the preliminary selection dictionary is created for each feature space.

3. The speech recognition apparatus according to claim 1, wherein the preliminary selection dictionary expresses a cluster representative distribution created for each subspace as a normal distribution, and stores an average value and a variance.

4. The speech recognition device according to claim 1, wherein said preliminary selection dictionary includes a multi-layer dictionary created for each subspace.

5. The likelihood calculating means includes: a detailed dictionary in which an output probability distribution of each distribution obtained by learning speech data is stored; and the input vector and contents stored in the preliminary selection dictionary. Approximate likelihood calculating means for calculating the approximate likelihood of each distribution constituting the state of the HMM based on the approximate likelihood calculated by the approximate likelihood calculating means. 5. A distribution selecting unit for selecting a distribution, and a detailed likelihood calculating unit for calculating a detailed likelihood of the distribution selected by the distribution selecting unit by using the detailed dictionary. A voice recognition device according to any one of the claims.

6. The preliminary selection dictionary is created by classifying each distribution stored in the detailed dictionary for each feature space.
The speech recognition device according to claim 5.

7. The pre-selection dictionary, after each distribution stored in the detailed dictionary is classified for each subspace, classifies which cluster the audio data belongs to, The speech recognition device according to claim 5, wherein the speech recognition device is created by calculating from a value of a feature parameter.

8. The distribution selecting unit according to claim 5, wherein the distribution selecting unit selects a distribution having an approximate likelihood having a large likelihood from the approximate likelihoods calculated by the approximate likelihood calculating unit. The speech recognition device according to the above.

9. The method according to claim 5, wherein the distribution selecting means calculates a reference likelihood from the maximum likelihood of the subspace and a threshold, and selects a distribution having an approximate likelihood larger than the reference likelihood. The speech recognition device according to any one of the above.

10. calculating the approximate likelihood of each distribution constituting the state of the HMM based on the input vector and the contents of the preliminary selection dictionary created for each subspace; and selecting the approximate likelihood using the approximate likelihood. A speech recognition method comprising: calculating a likelihood of the obtained distribution with the input vector; and performing speech recognition based on the calculated likelihood.

11. A step of calculating an approximate likelihood of each distribution constituting the state of the HMM based on an input vector and contents of a preselection dictionary created for each subspace, and selecting the approximate likelihood using the approximate likelihood. A computer-readable recording medium recording a speech recognition program, comprising: calculating a likelihood of the obtained distribution with the input vector; and performing speech recognition based on the calculated likelihood.