JP4801107B2

JP4801107B2 - Voice recognition apparatus, method, program, and recording medium thereof

Info

Publication number: JP4801107B2
Application number: JP2008055977A
Authority: JP
Inventors: 厚徳小川; 敏高橋
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2008-03-06
Filing date: 2008-03-06
Publication date: 2011-10-26
Anticipated expiration: 2028-03-06
Also published as: JP2009210975A

Description

この発明は、音声認識技術に関する。特に、音声認識処理の速度を向上させる技術に関する。 The present invention relates to speech recognition technology. In particular, the present invention relates to a technique for improving the speed of voice recognition processing.

図７を参照して、従来技術の音声認識装置１００’について説明する。
入力音声が、音響分析部１０に入力される。音響分析部１０は、入力音声から、一定時間長のフレームごとに特徴量ベクトルを計算し、特徴量ベクトルの時系列を生成する。生成された特徴量ベクトルの時系列は、探索部３０’に送られる。
探索部３０’は、音響モデル記憶部４０から読み込んだ音響モデルを用いて、文法記憶部５０から読み出した文法で表現される単語又は単語列と特徴量ベクトルの時系列との照合を行い、すなわち探索処理を行い、最も尤度が高い単語又は単語列を認識結果として出力する。 With reference to FIG. 7, a conventional speech recognition apparatus 100 ′ will be described.
The input voice is input to the acoustic analysis unit 10. The acoustic analysis unit 10 calculates a feature vector for each frame having a fixed time length from the input speech, and generates a time series of the feature vector. The time series of the generated feature vector is sent to the search unit 30 ′.
The search unit 30 ′ uses the acoustic model read from the acoustic model storage unit 40 to collate the word or word string expressed in the grammar read from the grammar storage unit 50 with the time series of the feature vector, that is, A search process is performed, and the word or word string having the highest likelihood is output as a recognition result.

音響分析部１０における音声分析方法としてよく用いられるのは、ケプストラム分析である。例えば、特徴量として、ＭＦＣＣ（ＭｅｌＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒａｌＣｏｅｆｆｉｃｉｅｎｔ）、ΔＭＦＣＣ、ΔΔＭＦＣＣ、対数パワー、Δ対数パワー等があり、これらの特徴量が１０から１００次元程度の特徴量ベクトルを構成する。また、例えば、フレーム幅は３０ｍｓ程度、フレームシフト幅は１０ｍｓ程度で音声分析が行われる。
音響モデル記憶部４０に記憶された音響モデルは、ＭＦＣＣ等の音声の特徴量を適切なカテゴリで標準パターンとして保持したものであり、入力音声のある区間の特徴量ベクトルに対して、各標準パターンとの音響的な近さを尤度として計算し、それがどのカテゴリに属するのかを推定するために用いられる。 Cepstrum analysis is often used as a voice analysis method in the acoustic analysis unit 10. For example, there are MFCC (Mel Frequency Cessential Coefficient), ΔMFCC, ΔΔMFCC, logarithmic power, Δlogarithmic power, etc. as feature quantities, and these feature quantities constitute a feature quantity vector of about 10 to 100 dimensions. For example, the voice analysis is performed with a frame width of about 30 ms and a frame shift width of about 10 ms.
The acoustic model stored in the acoustic model storage unit 40 is obtained by holding a feature amount of speech such as MFCC as a standard pattern in an appropriate category, and each standard pattern for a feature amount vector in a section of input speech. Is used as a likelihood to estimate which category it belongs to.

現在、音響モデルとしては、確率・統計理論に基づいてモデル化された隠れマルコフモデル（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ，以下ＨＭＭと略する。）が汎用される。通常、ＨＭＭは音素カテゴリ単位で作成される。音素カテゴリ単位で作成された各ＨＭＭを音素ＨＭＭという。複数の音素ＨＭＭからなる音素ＨＭＭの集合が、１つの音響モデルを構築する。
音素ＨＭＭとしては、ｍｏｎｏｐｈｏｎｅ−ＨＭＭ，ｂｉｐｈｏｎｅ−ＨＭＭ及びｔｒｉｐｈｏｎｅ−ＨＭＭがよく用いられる。 Currently, hidden Markov models (hereinafter abbreviated as HMMs) modeled on the basis of probability / statistical theory are widely used as acoustic models. Usually, the HMM is created for each phoneme category. Each HMM created for each phoneme category is called a phoneme HMM. A set of phoneme HMMs composed of a plurality of phoneme HMMs constructs one acoustic model.
As phoneme HMMs, monophone-HMM, biphone-HMM, and triphone-HMM are often used.

ｍｏｎｏｐｈｏｎｅ−ＨＭＭは、中心音素に先行する音素と後続する音素の両方を音素環境として考慮しない音素環境非依存型の音素ＨＭＭである。例えば、音素ａのｍｏｎｏｐｈｏｎｅ−ＨＭＭは、＊を任意の音素として、＊−ａ−＊と表すことができる。
ｂｉｐｈｏｎｅ−ＨＭＭには、中心音素に先行する音素のみを音素環境として考慮する先行音素環境依存型の音素ＨＭＭ、中心音素に後続する音素のみを音素環境として考慮する後続音素環境依存型の音素ＨＭＭがある。例えば、先行音素がｐである音素ａの先行音素環境依存型ｂｉｐｈｏｎｅ−ＨＭＭは、ｐ−ａ−＊と表すことができる。また、後続音素がｔである音素ａの後続音素環境依存型ｂｉｐｈｏｎｅ−ＨＭＭは、＊−ａ−ｔと表すことができる。 The monophone-HMM is a phoneme environment-independent phoneme HMM that does not consider both phonemes preceding and following the central phoneme as phoneme environments. For example, the monophone-HMM of phoneme a can be represented as * -a- *, where * is an arbitrary phoneme.
The biphone-HMM includes a preceding phoneme environment-dependent phoneme HMM that considers only the phoneme preceding the central phoneme as the phoneme environment, and a subsequent phoneme environment-dependent phoneme HMM that considers only the phoneme following the center phoneme as the phoneme environment. is there. For example, the preceding phoneme environment-dependent biphone-HMM of the phoneme a whose preceding phoneme is p can be expressed as p-a- *. The subsequent phoneme environment-dependent biphone-HMM of the phoneme a whose subsequent phoneme is t can be expressed as * -at.

ｔｒｉｐｈｏｎｅ−ＨＭＭは、中心音素に先行する音素及び後続する音素の両方を音素環境として考慮する音素ＨＭＭである。例えば、先行音素がｐ、後続音素がｔである音素ａのｔｒｉｐｈｏｎｅ−ＨＭＭは、ｐ−ａ−ｔと表すことができる。
ｍｏｎｏｐｈｏｎｅ−ＨＭＭよりもｂｉｐｈｏｎｅ−ＨＭＭの方が、また、ｂｉｐｈｏｎｅ−ＨＭＭよりもｔｒｉｐｈｏｎｅ−ＨＭＭの方が音素環境を詳細に表現したモデルである。
音素ＨＭＭで表現される音素カテゴリの種類の数は、音響モデルの学習データに依存するが、ｔ−ｔ−ｔ等の日本語の音素連鎖としてあり得ないものは除かれるため、一般的には数千から数万程度になる。 A triphone-HMM is a phoneme HMM that considers both phonemes preceding and following the central phoneme as phoneme environments. For example, a triphone-HMM of a phoneme a whose leading phoneme is p and whose subsequent phoneme is t can be expressed as p-at.
The biphone-HMM is a model that represents the phoneme environment in more detail than the monophone-HMM, and the triphone-HMM is more detailed than the biphone-HMM.
The number of types of phoneme categories expressed by the phoneme HMM depends on the learning data of the acoustic model, but it excludes those that are not possible as a Japanese phoneme chain such as ttt. Thousands to tens of thousands.

図８，図９を参照して、音響モデルに含まれる音素ＨＭＭの構造を説明する。音素ＨＭＭは、後述するように複数の状態Ｓから構成される。
状態Ｓは、図８に例示するように、混合確率分布として表現される。混合確率分布の各要素分布には、離散確率分布と連続確率分布があるが、現在最もよく用いられているのは、連続確率分布の１つである多次元正規分布（多次元ガウス分布ともいう。）である。その内でも次元間の相関がない、すなわち共分散行列の対角成分が０である多次元無相関正規分布が最もよく用いられている。多次元正規分布の各次元は、特徴量ベクトルの各次元に対応する。 The structure of the phoneme HMM included in the acoustic model will be described with reference to FIGS. The phoneme HMM is composed of a plurality of states S as will be described later.
The state S is expressed as a mixed probability distribution as illustrated in FIG. Each element distribution of the mixed probability distribution includes a discrete probability distribution and a continuous probability distribution. Currently, the most commonly used one is a multidimensional normal distribution (also called a multidimensional Gaussian distribution). .) Among them, a multidimensional uncorrelated normal distribution in which there is no correlation between dimensions, that is, the diagonal component of the covariance matrix is 0 is most often used. Each dimension of the multidimensional normal distribution corresponds to each dimension of the feature vector.

図８では、状態Ｓは、４つの多次元正規分布を要素分布とする多次元混合正規分布Ｍとして表現されている。なお、図８では、多次元正規分布のある次元ｉについて示されているが、多次元正規分布の他の次元についても同様に表現される。
図８に例示するような状態Ｓの数個から十数個程度の確率連鎖によって、音素ＨＭＭが構成される。音素ＨＭＭが、いくつの状態のどのような確率連鎖によって構成されるかには、様々なバリエーションがある。また、音素ＨＭＭごとに異なる構造を取ることもある。 In FIG. 8, the state S is expressed as a multidimensional mixed normal distribution M having four multidimensional normal distributions as element distributions. Although FIG. 8 shows a dimension i having a multidimensional normal distribution, the other dimensions of the multidimensional normal distribution are also expressed in the same manner.
A phoneme HMM is configured by a probability chain of several to about a dozen states S as exemplified in FIG. There are various variations in what probability chain of phoneme HMMs is composed of in what state. Also, different phoneme HMMs may have different structures.

現在最も一般的に用いられている構造は、図９に例示するような３状態のｌｅｆｔ−ｔｏ−ｒｉｇｈｔ型ＨＭＭと呼ばれるものである。これは、第一状態Ｓ_１、第二状態Ｓ_２及び第三状態Ｓ_３からなる３つの状態を左から右に並べたものである。状態の確率連鎖、すなわち状態遷移は、自分自身への遷移（自己遷移）Ｓ_１→Ｓ_１、Ｓ_２→Ｓ_２、Ｓ_３→Ｓ_３と、次状態への遷移Ｓ_１→Ｓ_２、Ｓ_２→Ｓ_３とからなる。音響モデル中の全ての音素ＨＭＭがこの３状態のｌｅｆｔ−ｔｏ−ｒｉｇｈｔ型ＨＭＭの構造を取ることが多い。 The structure most commonly used at present is what is called a three-state left-to-right HMM as illustrated in FIG. This is an arrangement of three states consisting of a first state S ₁ , a second state S ₂ and a third state S ₃ from left to right. Probability chain of states, that is, state transitions are transitions to themselves (self-transitions) S ₁ → S ₁ , S ₂ → S ₂ , S ₃ → S ₃ and transitions S ₁ → S ₂ , S ₃ to the next state. consisting of ₂ → _{S 3} Metropolitan. All phoneme HMMs in an acoustic model often take the structure of this three-state left-to-right HMM.

音素ＨＭＭの音響尤度計算について説明する。具体的には、図９の音素ＨＭＭに、ある特徴量ベクトルの時系列が入力されたときの音響尤度計算について説明する。例えば、６フレーム分の特徴量ベクトルの時系列Ｘ＝Ｘ_１，Ｘ_２，Ｘ_３，Ｘ_４，Ｘ_５，Ｘ_６が、音素ＨＭＭのある１つの状態遷移系列Ｓ_ｅ＝Ｓ_１→Ｓ_１→Ｓ_２→Ｓ_２→Ｓ_３→Ｓ_３から出力される確率である音響尤度Ｐ（Ｘ｜Ｓ_ｅ，ＨＭＭ）は、以下のように計算される。

The acoustic likelihood calculation of the phoneme HMM will be described. Specifically, acoustic likelihood calculation when a time series of a certain feature vector is input to the phoneme HMM in FIG. 9 will be described. For example, the time series X = X ₁ , X ₂ , X ₃ , X ₄ , X ₅ , X ₆ of feature quantity vectors for 6 frames is one state transition sequence S _e = S ₁ → S ₁ with a phoneme HMM. The acoustic likelihood P (X | S _e , HMM), which is the probability output from S ₂ → S ₂ → S ₃ → S ₃ , is calculated as follows.

ここで、ａ_ｊｋは状態Ｓ_ｊから状態Ｓ_ｋへの遷移確率である。また、状態尤度ｂ_ｊ（Ｘ_ｔ）は、時刻ｔ、すなわちフレームｔにおける特徴量ベクトルＸ_ｔが状態Ｓ_ｊを表現する多次元混合正規分布Ｍ_ｊから出力される確率である。状態尤度ｂ_ｊ（Ｘ_ｔ）は、多次元混合正規分布Ｍ_ｊを構成するｍ番目の多次元正規分布の出力確率Ｐ_ｊｍ（Ｘ_ｔ）を用いて、以下のように計算される。

Here, a _jk is a transition probability from the state S _j to the state S _k . Further, the state likelihood b _j (X _t ) is a probability that the feature quantity vector X _{t at} the time t, that is, the frame t is output from the multidimensional mixed normal distribution M _j representing the state S _j . The state likelihood b _j (X _t ) is calculated as follows using the output probability P _jm (X _t ) of the m-th multidimensional normal distribution constituting the multidimensional mixed normal distribution M _j .

ここで、混合数ｍ_ｊは、多次元混合正規分布Ｍ_ｊを構成する正規分布の数であり、Ｗ_ｊｍは、多次元混合正規分布Ｍ_ｊを構成するｍ番目の正規分布の分布重みである。Ｗ_ｊｍについては、以下の式が満たされる。

Here, the mixture number m _j is the number of normal distributions constituting the multidimensional mixed normal distribution M _j , and W _jm is the distribution weight of the mth normal distribution constituting the multidimensional mixed normal distribution M _j. . For W _jm , the following equation is satisfied:

多次元混合正規分布Ｍ_ｊを構成する正規分布が多次元無相関正規分布の場合、Ｐ_ｊｍ（Ｘ_ｔ）は以下のように計算される。

When the normal distribution constituting the multidimensional mixed normal distribution M _j is a multidimensional uncorrelated normal distribution, P _jm (X _t ) is calculated as follows.

ここで、μ_ｊｍｉ，σ_ｊｍｉ ^２は、多次元混合正規分布Ｍ_ｊを構成するｍ番目の多次元無相関正規分布の次元ｉにおける平均値、分散。Ｘ_ｔｉは、特徴量ベクトルのＸ_ｔの次元ｉの値である。Ｉは、特徴量ベクトル及び多次元無相関正規分布の次元数である。 Here, μ _jmi and σ _jmi ² are the average value and variance in the dimension i of the m-th multidimensional uncorrelated normal distribution constituting the multidimensional mixed normal distribution M _j . X _ti is the value of dimension i of X _t of the feature vector. I is the number of dimensions of the feature vector and the multidimensional uncorrelated normal distribution.

上記した音響尤度計算は、ある１つの状態遷移系列Ｓ_ｅに対するものである。このような状態遷移系列は他にもあげることができる。このような状態遷移系列の全てに対して特徴量ベクトルの時系列を出力する確率を計算し、これらの計算された確率を加算したものを音素ＨＭＭに特徴量ベクトルの時系列Ｘが入力されたときの音響尤度とする方法は、トレリス（ｔｒｅｌｌｉｓ）アルゴリズムと呼ばれる。 Acoustic likelihood calculations described above are those for a one state transition sequence S _e. Such a state transition sequence can also be given elsewhere. The probability of outputting a time series of feature vector for all such state transition series is calculated, and the time series X of the feature vector is input to the phoneme HMM by adding these calculated probabilities. The method of obtaining the acoustic likelihood is sometimes called a trellis algorithm.

一方、全ての状態遷移系列の中で最も高い音響尤度を与える状態遷移系列を特徴量ベクトルの時系列によりフレーム単位で逐次的に求め、最終フレームに到達したときの尤度を音素ＨＭＭに特徴量ベクトルの時系列Ｘが入力されたときの音響尤度とする方法は、ビタービ（Ｖｉｔｅｒｂｉ）アルゴリズムと呼ばれる。一般的には、トレリスアルゴリズムと比較して計算量を大幅に削減できるビタービアルゴリズムが用いられることが多い。 On the other hand, the state transition sequence that gives the highest acoustic likelihood among all the state transition sequences is sequentially obtained in units of frames by the time series of feature vectors, and the likelihood when the final frame is reached is characterized by the phoneme HMM. The method of obtaining the acoustic likelihood when the time series X of the quantity vector is input is called a Viterbi algorithm. In general, a Viterbi algorithm that can significantly reduce the amount of calculation compared to the trellis algorithm is often used.

また、上記した音響尤度計算は、ある１つの音素ＨＭＭに対するものであるが、実際には、探索部３０’において探索処理を行う前に、音素ＨＭＭを連結して文法記憶部５０に記憶された文法で表現される単語又は単語列のＨＭＭの探索ネットワークを作成し、入力音声の特徴量ベクトルの時系列と探索ネットワークで表現される単語又は単語列との照合、すなわち探索処理が行われる。そして、最も音響尤度が高い単語又は単語列が認識結果として出力される。 Further, the above-described acoustic likelihood calculation is for one phoneme HMM, but actually, the phoneme HMM is concatenated and stored in the grammar storage unit 50 before performing the search process in the search unit 30 ′. A search network for HMMs of words or word strings expressed in a grammar is created, and a time series of feature vectors of input speech is matched with words or word strings expressed in the search network, that is, search processing is performed. Then, the word or word string having the highest acoustic likelihood is output as the recognition result.

なお、連続音声認識の場合には、上記の音響尤度に加えて、単語のつながりやすさを統計的に表現する言語モデルによる言語尤度が考慮され、これらの統合尤度が最も高い単語又は単語列として出力される。また、上記した音響尤度計算では、確率値をそのまま扱ったが、実際にはアンダーフローを防ぐために、確率値の対数をとって計算を行う（以上の内容については、例えば、非特許文献１，２参照。）。 In the case of continuous speech recognition, in addition to the acoustic likelihood described above, the language likelihood based on a language model that statistically expresses the ease of connection of words is taken into consideration, and the word having the highest integrated likelihood or Output as a word string. In the above-described acoustic likelihood calculation, the probability value is handled as it is. However, in order to prevent underflow, the logarithm of the probability value is actually used for calculation (for the above contents, for example, Non-Patent Document 1). , 2).

ところで、音声認識処理時間に占める状態尤度ｂ_ｊ（Ｘ_ｔ）を計算する時間の割合は４５％から６５％に上るため、音声認識処理を高速化するためには、状態尤度ｂ_ｊ（Ｘ_ｔ）を求める処理を高速化するとよい。状態尤度ｂ_ｊ（Ｘ_ｔ）を求める処理を高速化する数多くの手法が従来から提案されている（例えば、非特許文献３，４参照。）
以下、非特許文献４に記載された状態尤度ｂ_ｊ（Ｘ_ｔ）を求める処理の高速化の手法について説明する。非特許文献４の手法は、以下の２つの実験的事実に基づいて状態尤度ｂ_ｊ（Ｘ_ｔ）を求める処理の高速化を実現している。 By the way, since the ratio of the time for calculating the state likelihood b _j (X _t ) in the speech recognition processing time is increased from 45% to 65%, the state likelihood b _j ( The processing for obtaining X _t ) may be accelerated. A number of techniques for speeding up the process of obtaining the state likelihood b _j (X _t ) have been proposed (see, for example, Non-Patent Documents 3 and 4).
Hereinafter, a method for speeding up the process for obtaining the state likelihood b _j (X _t ) described in Non-Patent Document 4 will be described. The method of Non-Patent Document 4 realizes speeding up of processing for obtaining the state likelihood b _j (X _t ) based on the following two experimental facts.

１．状態尤度ｂ_ｊ（Ｘ_ｔ）の計算におけるＣＰＵの動きを調べた結果、最も消費時間が長いのは、上記式（２）で定義される状態尤度ｂ_ｊ（Ｘ_ｔ）の計算そのものではなく、状態尤度ｂ_ｊ（Ｘ_ｔ）を計算するために必要な、計算対象となっている状態ｊの状態パラメータを、メインメモリからＣＰＵのキャッシュにフェッチする処理である。 1. As a result of investigating the movement of the CPU in the calculation of the state likelihood b _j (X _t ), the time consumption is the longest in the calculation of the state likelihood b _j (X _t ) defined by the above equation (2). Rather, it is a process of fetching the state parameter of the state j to be calculated necessary for calculating the state likelihood b _j (X _t ) from the main memory to the CPU cache.

２．ある状態ｊのあるフレームｔについての状態尤度ｂ_ｊ（Ｘ_ｔ）の計算が行われた場合、その状態ｊの次のフレームｔ＋１についての状態尤度ｂ_ｊ（Ｘ_ｔ＋１）の計算が行われる可能性が高い。非特許文献４では、７５％以上の確率で、次のフレームｔ＋１についての状態尤度ｂ_ｊ（Ｘ_ｔ＋１）の計算が行われると記載されている。 2. When the state likelihood b _j (X _t ) is calculated for a frame t in a certain state j, the state likelihood b _j (X _{t + 1} ) is calculated for the next frame t + 1 of the state _j. Probability is high. Non-Patent Document 4 describes that the state likelihood b _j (X _{t + 1} ) for the next frame t + 1 is calculated with a probability of 75% or more.

図１０に例示した状態尤度テーブルを参照して、非特許文献４の手法を説明する。状態尤度テーブルとは、各状態ごとに、状態尤度ｂ_ｊ（Ｘ_ｔ）の計算を行うフレームを時系列で表したものである。 The method of Non-Patent Document 4 will be described with reference to the state likelihood table illustrated in FIG. The state likelihood table is a time series of frames for calculating the state likelihood b _j (X _t ) for each state.

例えば、状態ｊのフレームｔについての状態尤度ｂ_ｊ（Ｘ_ｔ）を計算する必要が生じたとする。このとき、状態尤度ｂ_ｊ（Ｘ_ｔ）のみならず、ついでにＫフレーム先までの状態尤度ｂ_ｊ（Ｘ_ｔ＋１），…，ｂ_ｊ（Ｘ_ｔ＋Ｋ）を合わせて計算して、それらの計算結果をテーブルに記憶しておく。このＫフレーム先までの状態尤度を計算する処理を「バッチ状態尤度計算処理」という。Ｋは、７程度の整数である。 For example, assume that it is necessary to calculate the state likelihood b _j (X _t ) for the frame t in the state j. At this time, not only the state likelihood b _j (X _t ) but also the state likelihoods b _j (X _{t + 1} ),..., B _j (X _{t + K} ) up to K frames ahead are calculated together, and these calculations are performed. Store the results in a table. The process of calculating the state likelihood up to K frames ahead is called “batch state likelihood calculation process”. K is an integer of about 7.

その後、状態尤度ｂ_ｊ（Ｘ_ｔ＋１），…，ｂ_ｊ（Ｘ_ｔ＋Ｋ）を計算する必要が出てきた場合には、そのテーブルを参照して、これらを実際に計算することなく求める。これにより、状態尤度ｂ_ｊ（Ｘ_ｔ）を求める処理を高速化することができる。 Thereafter, when it becomes necessary to calculate the state likelihoods b _j (X _{t + 1} ),..., B _j (X _{t + K} ), they are obtained by referring to the table without actually calculating them. Thus, it is possible to speed up the process of obtaining the status likelihood b _{j (X} _t).

この非特許文献４の手法によれば、上記「１．」で述べた、消費時間が長い状態パラメータをＣＰＵのキャッシュにフェッチする回数を削減することができるため、音響尤度の計算を高速化することができ、音声認識処理を高速化することができる。
鹿野清宏，外４名，「ＩＴＴｅｘｔ音声認識システム」，オーム社，２００１年５月，ｐ．１−５１安藤彰男，「リアルタイム音声認識」，（社）電子情報通信学会，２００３年９月，ｐ．１−５８，ｐ．１２５−１７０嵯峨山茂樹，外４名，「音声認識における新しい高速化」，日本音響学会講演論文集，１−５−１２，平成８年３月，ｐ．２５−２８ M.Saraclar，外３名，「Towards automatic closed captioning: low latency real time broadcast news transcription」，Proc.ICSLP’02，２００２年９月，ｐ．１７４１−１７４４ According to the method of Non-Patent Document 4, it is possible to reduce the number of times of fetching the state parameter having a long consumption time into the CPU cache as described in “1.”, so that the calculation of acoustic likelihood is accelerated. The speech recognition process can be speeded up.
Kiyohiro Shikano and 4 others, “IT Text Speech Recognition System”, Ohmsha, May 2001, p. 1-51 Akio Ando, “Real-time Speech Recognition”, The Institute of Electronics, Information and Communication Engineers, September 2003, p. 1-58, p. 125-170 Shigeki Hiyama, 4 others, “New acceleration in speech recognition”, Proceedings of the Acoustical Society of Japan, 1-5-12, March 1996, p. 25-28 M. Saraclar, three others, “Towards automatic closed captioning: low latency real time broadcast news transcription”, Proc. ICSLP'02, September 2002, p. 1741-1744

ところで、ついでに計算したＫフレーム分の状態尤度ｂ_ｊ（Ｘ_ｔ＋１），…，ｂ_ｊ（Ｘ_ｔ＋Ｋ）は実際に使用されるかどうかが不明であり、これらが使用されなければ状態尤度の無駄な計算を行ったことになる。
非特許文献４の手法においては、Ｋの値は諸事情を考慮せずに固定されていたため、状態尤度の無駄な計算が行われていた可能性がある。このため、音声認識処理を十分に高速化できていない可能性があった。 Incidentally, it is unclear whether the state likelihood b _j (X _{t + 1} ),..., B _j (X _{t + K} ) for the K frames calculated next is actually used. This is a wasteful calculation.
In the method of Non-Patent Document 4, since the value of K is fixed without considering various circumstances, there is a possibility that useless calculation of state likelihood has been performed. For this reason, there is a possibility that the voice recognition processing has not been sufficiently accelerated.

この発明は、上記問題に鑑み、音声認識処理の速度をより向上させた音声認識装置、方法、プログラム及びその記録媒体を提供することを目的とする。 In view of the above problems, an object of the present invention is to provide a speech recognition apparatus, method, program, and recording medium thereof that further improve the speed of speech recognition processing.

この発明の１つの観点によれば、音響モデル記憶部を、状態パラメータ、自己遷移確率を含む音響モデルを記憶する記憶部とし、状態パラメータ記憶部を、音響モデル記憶部よりも高速な記憶部とする。音響分析部が、入力された音声から一定時間長のフレームごとに特徴量ベクトルを求め、特徴量ベクトルの時系列を特徴量ベクトル記憶部に格納する。フェッチ部が、ｊ，ｔをそれぞれ任意の整数、ある状態ｊがフレームｔの特徴量ベクトルＸ_ｔを出力する確率を状態尤度ｂ_ｊ（Ｘ_ｔ）として、状態尤度ｂ_ｊ（Ｘ_ｔ）が計算される前に、状態ｊの状態パラメータを音響モデル記憶部から状態パラメータ記憶部に読み込む。自己遷移確率フレーム数決定部が、音響モデル記憶部から読み込んだ状態ｊの自己遷移確率ａ_ｊｊが高いほど、大きい整数Ｋ_Ａ（ｊ）をフレーム数Ｋとして決定する。状態尤度計算部が、状態パラメータ記憶部から読み込んだ状態ｊの状態パラメータと、特徴量ベクトル記憶部から読み込んだ特徴量ベクトルＸ_ｔとを用いて状態尤度ｂ_ｊ（Ｘ_ｔ）を計算すると共に、状態パラメータ記憶部から読み込んだ状態ｊの状態パラメータと、特徴量ベクトル記憶部から読み込んだ特徴量ベクトルＸ_ｔ＋１，…，Ｘ_ｔ＋Ｋとを用いて、状態尤度ｂ_ｊ（Ｘ_ｔ＋１），…，ｂ_ｊ（Ｘ_ｔ＋Ｋ）を更に計算して、それらの更に計算された状態尤度ｂ_ｊ（Ｘ_ｔ＋１），…，ｂ_ｊ（Ｘ_ｔ＋Ｋ）を状態尤度記憶部に格納する。状態尤度参照部が、状態尤度ｂ_ｊ（Ｘ_ｔ＋１），…，ｂ_ｊ（Ｘ_ｔ＋Ｋ）の何れかが必要になったときに、状態尤度記憶部を参照して、その状態尤度を求める。 According to one aspect of the present invention, the acoustic model storage unit is a storage unit that stores an acoustic model including state parameters and self-transition probabilities, and the state parameter storage unit is a storage unit that is faster than the acoustic model storage unit. To do. The acoustic analysis unit obtains a feature vector for each frame having a predetermined time length from the input speech, and stores a time series of the feature vector in the feature vector storage unit. The state likelihood b _j (X _t ) is a state likelihood b _j (X _t ), where the fetch unit uses j and t as arbitrary integers, and a certain state j outputs the feature quantity vector X _t of the frame t. Is calculated from the acoustic model storage unit to the state parameter storage unit. The self-transition probability frame number determination unit determines the larger integer K _A (j) as the frame number K as the self-transition probability a _jj of the state j read from the acoustic model storage unit is higher. The state likelihood calculation unit calculates the state likelihood b _j (X _t ) using the state parameter of the state j read from the state parameter storage unit and the feature quantity vector X _t read from the feature quantity vector storage unit. In addition, state likelihood b _j (X _{t + 1} ),..., Using the state parameter of state j read from the state parameter storage unit and feature quantity vectors X _{t + 1} ,..., X _{t + K} read from the feature quantity storage unit. , B _j (X _{t + K} ) are further calculated, and the further calculated state likelihoods b _j (X _{t + 1} ),..., B _j (X _{t + K} ) are stored in the state likelihood storage unit. When the state likelihood reference unit needs any of the state likelihoods b _j (X _{t + 1} ),..., B _j (X _{t + K} ), the state likelihood storage unit refers to the state likelihoods. Ask for.

フレーム数Ｋの値を状態に応じて適宜変えることにより、状態尤度の無駄な計算処理の量を少なくすることができる。これにより、従来よりも音声認識処理を高速化することができる。 By appropriately changing the value of the number K of frames according to the state, it is possible to reduce the amount of wasteful calculation processing of the state likelihood. As a result, the voice recognition processing can be speeded up as compared with the conventional art.

この発明は、図６に記載した状態尤度テーブルに例示するように、ついでに状態尤度が計算されるフレーム数Ｋを状態ごとに適宜異ならせることを特徴とする。
以下、図面を参照してこの発明の実施形態の例を説明する。背景技術と同様な部分については、同じ符号をつけて重複説明を略する。 As exemplified in the state likelihood table described in FIG. 6, the present invention is characterized in that the number of frames K for which the state likelihood is calculated is appropriately changed for each state.
Hereinafter, embodiments of the present invention will be described with reference to the drawings. The same parts as those in the background art are denoted by the same reference numerals, and redundant description is omitted.

［第一実施形態］
第一実施形態は、音声認識処理の対象となる目的音声と音響的に近い性質を有する音声（以下、適応先データ、開発用データともいう。）が得られない場合の実施形態である。 [First embodiment]
The first embodiment is an embodiment in the case where a voice (hereinafter also referred to as “adaptation destination data” or “development data”) having a property that is acoustically close to the target voice that is the target of the voice recognition process cannot be obtained.

第一実施形態は、音素ＨＭＭの各状態の自己遷移確率ａ_ｊｊを用いる。／ａ／等の母音の継続長は、／ｐ／等の子音の継続長よりも通常長い。このため、中心音素が母音である音素ＨＭＭの各状態の自己遷移確率は、中心音素が子音である音素ＨＭＭの各状態の自己遷移確率よりも大きくなる。自己遷移確率が高い状態ｊほど、あるフレームｔについての状態尤度ｂ_ｊ（Ｘ_ｔ）の計算が行われた場合に、次のフレームｔ＋１についての状態尤度ｂ_ｊ（Ｘ_ｔ）の計算が行われる可能性が高いと考えることができる。 The first embodiment uses a self-transition probability a _jj of each state of the phoneme HMM. The duration of a vowel such as / a / is usually longer than the duration of a consonant such as / p /. For this reason, the self-transition probability of each state of the phoneme HMM whose central phoneme is a vowel is larger than the self-transition probability of each state of the phoneme HMM whose central phoneme is a consonant. Self-transition probability higher state j, when the calculation of the state likelihood b _j for a certain frame t _(X _t) is performed, the calculation of the state likelihood b _j for the next frame t + 1 _(X _t) It can be considered that there is a high possibility of being performed.

この性質を利用して、自己遷移確率が高い状態に対しては大きなフレーム数Ｋを与え、逆に自己遷移確率が低い状態に対しては小さいフレーム数Ｋを与える。すなわち、自己遷移確率が高い状態ほど、大きいフレーム数Ｋを与える。
このように、ついでに状態尤度が計算されるフレーム数Ｋを状態ごとに適宜異ならせることにより、状態尤度の無駄な計算処理の量を少なくすることができる。したがって、従来よりも音響尤度の計算を高速化することができ、音声認識処理を高速化することができる。 Using this property, a large frame number K is given to a state with a high self-transition probability, and conversely a small frame number K is given to a state with a low self-transition probability. That is, the higher the number of frames K, the higher the self-transition probability.
In this way, by appropriately changing the number of frames K for which the state likelihood is calculated for each state, the amount of wasteful state likelihood calculation processing can be reduced. Therefore, it is possible to speed up the calculation of the acoustic likelihood as compared with the prior art, and to speed up the speech recognition process.

図１，図２を参照してこの発明の第一実施形態の例を説明する。図１は、音声認識装置の例の機能ブロック図である。図２は、音声認識方法の処理の流れを例示するフローチャートである。 An example of the first embodiment of the present invention will be described with reference to FIGS. FIG. 1 is a functional block diagram of an example of a speech recognition apparatus. FIG. 2 is a flowchart illustrating the processing flow of the speech recognition method.

第一実施形態の音声認識装置１００は、図１において実線で示す、音響分析部１０、特徴量ベクトル記憶部２０、探索部３０、音響モデル記憶部４０、文法記憶部５０、フェッチ部６０、状態パラメータ記憶部７０、状態尤度記憶部８０及びフレーム数決定部９０を例えば含む。探索部３０は、状態尤度計算部３１及び状態尤度参照部３２を例えば含む。フレーム数決定部９０は、自己遷移確率フレーム数決定部９１を例えば含む。 The speech recognition apparatus 100 according to the first embodiment includes an acoustic analysis unit 10, a feature vector storage unit 20, a search unit 30, an acoustic model storage unit 40, a grammar storage unit 50, a fetch unit 60, and a state indicated by a solid line in FIG. For example, a parameter storage unit 70, a state likelihood storage unit 80, and a frame number determination unit 90 are included. The search unit 30 includes a state likelihood calculation unit 31 and a state likelihood reference unit 32, for example. The frame number determination unit 90 includes, for example, a self transition probability frame number determination unit 91.

＜ステップＳ１＞
入力音声が、音響分析部１０に入力される。音響分析部１０は、入力音声から、一定時間長のフレームごとに特徴量ベクトルＸ_ｔを計算し、特徴量ベクトルＸ_ｔの時系列を生成する。生成された特徴量ベクトルＸ_ｔの時系列は、特徴量ベクトル記憶部２０に送られる。
特徴量ベクトル記憶部２０は、例えば一時的に特徴量ベクトルＸ_ｔを記憶するバッファである。 <Step S1>
The input voice is input to the acoustic analysis unit 10. Acoustic analysis section 10, from the input speech, the feature vector X _t is calculated for each frame of a fixed time length, to generate a time series of feature vectors X _t. Time series of the generated feature vector X _t is sent to the feature quantity vector storage unit 20.
Feature quantity vector storage unit 20 is, for example, a buffer for temporarily storing the feature vectors X _t.

＜ステップＳ２＞
フェッチ部６０は、状態尤度計算部３１が状態ｊのフレームｔについての状態尤度ｂ_ｊ（Ｘ_ｔ）を計算する前に、状態ｊの状態パラメータを、音響モデルが記憶された音響モデル記憶部４０から読み込み、状態パラメータ記憶部７０に格納する。 <Step S2>
Before the state likelihood calculation unit 31 calculates the state likelihood b _j (X _t ) for the frame t of the state j, the fetch unit 60 stores the state parameter of the state j as an acoustic model storage in which the acoustic model is stored. The data is read from the unit 40 and stored in the state parameter storage unit 70.

状態パラメータとは、状態尤度Ｂ_ｊ（Ｘ_ｔ）を計算するために必要な数値のことであり、例えば、背景技術の欄の式（２）で登場する分布重みＷ_ｊｍ（ｍ＝１，…，ｍ_ｊ）、式（４）で登場する平均μ_ｊｍｉ（ｍ＝１，…，ｍ_ｊ，ｉ＝１，…，Ｉ），分散σ_ｊｍｉ ^２（ｍ＝１，…，ｍ_ｊ，ｉ＝１，…，Ｉ）である。
状態パラメータ記憶部７０は、音響モデル記憶部４０よりも読み書きが高速な記憶媒体であり、例えばＣＰＵ１のキャッシュ１ａ（図５参照）である。 The state parameter is a numerical value necessary for calculating the state likelihood B _j (X _t ). For example, the distribution weight W _jm (m = 1, m) appearing in Equation (2) in the background art column. , _M _j ), mean μ _jmi (m = 1,..., _M _j , i = 1,..., I), variance σ _jmi ² (m = 1,..., _M _j , i = 1, ..., I).
The state parameter storage unit 70 is a storage medium that is faster in reading and writing than the acoustic model storage unit 40, and is, for example, the cache 1a of the CPU 1 (see FIG. 5).

＜ステップＳ３＞
フレーム数決定部９０の自己遷移確率フレーム数決定部９１は、音響モデル記憶部４０から読み込まれた状態ｊの自己遷移確率ａ_ｊｊを用いて、その自己遷移確率ａ_ｊｊが高いほど、大きい整数Ｋ_Ａ（ｊ）をフレーム数Ｋとして決定する。フレーム数Ｋについての情報は、状態尤度計算部３１に送られる。 <Step S3>
The self-transition probability frame number determination unit 91 of the frame number determination unit 90 uses the self-transition probability a _jj of the state j read from the acoustic model storage unit 40, and the higher the self-transition probability a _jj is, the larger the integer K _A (j) is determined as the number K of frames. Information about the number of frames K is sent to the state likelihood calculation unit 31.

例えば、ａ_ｌを０以上１以下の数、ａ_ｈをａ_ｌ以上１以下の数、Ｋ_ｍｉｎを０以上の整数、Ｋ_ｍａｘをＫ_ｍｉｎ＋１以上の整数、ｆ（・）を・の小数点以下を切り捨てして整数を出力する関数として、Ｋ_Ａ（ｊ）を以下の式により求めることができる。ａ_ｌ，ａ_ｈ，Ｋ_ｍｉｎ及びＫ_ｍａｘは、目的音声、ハードウェアの性能及び目標とする音声認識処理速度等に応じて適宜に予め定められた数である。例えば、ａ_ｌは０．２から０．３、ａ_ｈは０．７から０．８、Ｋ_ｍｉｎは３から４、Ｋ_ｍａｘは１０から１２に設定される。 For example, _{a l} a 0 to 1 inclusive of a few, _{a h} a _{a l} 1 inclusive _number, the _{K min} 0 or an _{integer, K max} and _{K min} +1 or more integer, f (·) below the decimal point K _A (j) can be obtained by the following equation as a function that outputs an integer by rounding down. a _l , a _h , K _min, and K _max are numbers that are appropriately determined in accordance with the target speech, hardware performance, target speech recognition processing speed, and the like. For example, a ₁ is set to 0.2 to 0.3, a _h to 0.7 to 0.8, K _min to 3 to 4, and K _max to 10 to 12.

すなわち、図３に例示するように、自己遷移確率ａ_ｊｊがａ_ｌより下であればＫ_Ａ（ｊ）＝Ｋ_ｍｉｎとし、自己遷移確率ａ_ｊｊがａ_ｌ以上ａ_ｈより下であればＫ_Ａ（ｊ）＝ｆ（（Ｋ_ｍａｘ−Ｋ_ｍｉｎ）ａ_ｊｊ／（ａ_ｈ−ａ_ｌ））＋（（Ｋ_ｍｉｎａ_ｈ−Ｋ_ｍａｘａ_ｌ）／（ａ_ｈ−ａ_ｌ）））とし、自己遷移確率ａ_ｊｊがａ_ｈ以上であればＫ_Ａ＝Ｋ_ｍａｘとする。
このようにして、自己遷移確率ａ_ｊｊが高いほど、大きな整数を出力する関数Ｋ_Ａ（ｊ）を定めて、この関数に従い、状態ごとに個別のフレーム数Ｋを決定する。

That is, as illustrated in FIG. 3, K _A (j) = K _min if the self-transition probability a _jj is lower than a ₁ , and K if the self-transition probability a _jj is higher than a _{1 and} lower than a _h. and _{_{_{_{a (j) = f ((}}}} K max -K min) a jj / (a h -a l)) + ((K min a h -K max a l) / (a h -a l))), If the self-transition probability a _jj is equal to or greater than a _h , K _A = K _max is set.
In this way, the function K _A (j) that outputs a larger integer as the self-transition probability a _jj is higher is determined, and the number of individual frames K is determined for each state according to this function.

＜ステップＳ４＞
状態尤度計算部３１は、状態パラメータ記憶部７０から読み込んだ状態ｊのパラメータと、特徴量ベクトル記憶部２０から読み込んだフレームｔの特徴量ベクトルＸ_ｔとを用いて、状態ｊのフレームｔについての状態尤度ｂ_ｊ（Ｘ_ｔ）を計算する。また、これと共に、状態パラメータ記憶部７０から読み込んだ状態ｊのパラメータと、特徴量ベクトル記憶部２０から読み込んだフレームｔ＋１，…，ｔ＋Ｋの特徴量ベクトルＸ_ｔ＋１，…，Ｘ_ｔ＋Ｋとを用いて、状態ｊのフレームｔ＋１，…，ｔ＋Ｋについての状態尤度ｂ_ｊ（Ｘ_ｔ＋１），…，ｂ_ｊ（Ｘ_ｔ＋Ｋ）を更に計算する。 <Step S4>
The state likelihood calculating unit 31 uses the parameter of the state j read from the state parameter storage unit 70 and the feature quantity vector X _{t of the} frame t read from the feature quantity vector storage unit 20 for the frame t of the state j. The state likelihood b _j (X _t ) is calculated. At the same time, using the parameters of the state j read from the state parameter storage unit 70 and the feature quantity vectors X _{t + 1} ,..., X _{t + K of the} frames t + 1,. The state likelihoods b _j (X _{t + 1} ),..., B _j (X _{t + K} ) for the frame t + 1,.

計算された状態尤度ｂ_ｊ（Ｘ_ｔ）は、探索部３０による音響尤度の計算に用いられる。一方、計算された状態尤度ｂ_ｊ（Ｘ_ｔ＋１），…，ｂ_ｊ（Ｘ_ｔ＋Ｋ）は、状態尤度記憶部８０に格納される。 The calculated state likelihood b _j (X _t ) is used for calculation of acoustic likelihood by the search unit 30. On the other hand, the calculated state likelihoods b _j (X _{t + 1} ),..., B _j (X _{t + K} ) are stored in the state likelihood storage unit 80.

＜ステップＳ５＞
探索部３０が音響尤度を計算するために状態尤度ｂ_ｊ（Ｘ_ｔ＋１），…，ｂ_ｊ（Ｘ_ｔ＋Ｋ）の何れかが必要になったときに、状態尤度参照部３２は状態尤度記憶部８０を参照してその状態尤度を求める。
探索部３０は、状態尤度参照部３２が求めた状態尤度を用いて、背景技術と同様に、音響尤度を計算して、音声認識結果を出力する。 <Step S5>
When the search unit 30 needs any of the state likelihoods b _j (X _{t + 1} ),..., B _j (X _{t + K} ) in order to calculate the acoustic likelihood, the state likelihood reference unit 32 determines the state likelihood. The state likelihood is obtained by referring to the degree storage unit 80.
The search unit 30 calculates the acoustic likelihood using the state likelihood obtained by the state likelihood reference unit 32, and outputs a speech recognition result.

［第二実施形態］
第一実施形態は、適応先データ、開発用データが得られる場合の実施形態である。
開発用データに対して、バッチ状態尤度計算を行わない通常の状態尤度計算により音声認識処理を行い、例えば状態尤度テーブルを得ることにより、全フレームにおける状態尤度の計算が行われたフレームの割合（以下、尤度計算率ｑ_ｊという。）を状態ｊごとに求める。この尤度計算率ｑ_ｊが高い状態ｊほど、あるフレームｔについての状態尤度ｂ_ｊ（Ｘ_ｔ）が計算された場合に、次のフレームｔ＋１についての状態尤度ｂ_ｊ（Ｘ_ｔ＋１）が計算される可能性が高いと考えることができる。 [Second Embodiment]
The first embodiment is an embodiment when adaptation destination data and development data are obtained.
For the development data, speech recognition processing was performed by normal state likelihood calculation without performing batch state likelihood calculation, for example, state likelihood calculation was performed for all frames by obtaining a state likelihood table. The ratio of frames (hereinafter referred to as likelihood calculation rate q _j ) is obtained for each state j. The state likelihood b _j (X _{t + 1} ) for the next frame t + 1 when the state likelihood b _j (X _t ) for a certain frame t is calculated as the state _j has a higher likelihood calculation rate q _j. It can be considered that the possibility of being calculated is high.

この性質を利用して、尤度計算率ｑ_ｊが高い状態ｊに対しては大きなフレーム数Ｋを与え、逆に尤度計算率ｑ_ｊが低い状態ｊに対しては小さいフレーム数Ｋを与える。すなわち、尤度計算率ｑ_ｊが高い状態ｊほど、大きいフレーム数Ｋを与える。
すなわち、第二実施形態においては、自己遷移確率ａ_ｊｊと尤度計算率ｑ_ｊの両方を考慮して、フレーム数Ｋを決定する。 Using this property, a large frame number K is given to a state j with a high likelihood calculation rate q _j , and conversely a small frame number K is given to a state j with a low likelihood calculation rate q _j. . That is, the higher the likelihood calculation rate q _j , the larger the number of frames K is given.
That is, in the second embodiment, the number K of frames is determined in consideration of both the self-transition probability a _jj and the likelihood calculation rate q _j .

このように、自己遷移確率ａ_ｊｊと尤度計算率ｑ_ｊの両方を考慮して、ついでに状態尤度が計算されるフレーム数Ｋを状態ごとに適宜異ならせることにより、状態尤度の無駄な計算処理の量を更に少なくすることができる。したがって、音響尤度の計算を更に高速化することができ、音声認識処理を更に高速化することができる。 As described above, by considering both the self-transition probability a _jj and the likelihood calculation rate q _j , the state likelihood is calculated by appropriately changing the number of frames K for which the state likelihood is calculated for each state. The amount of calculation processing can be further reduced. Therefore, the calculation of the acoustic likelihood can be further speeded up, and the speech recognition process can be further speeded up.

以下、図１，図４を参照して第二実施形態の例を説明するが、第一実施形態と異なる部分についてのみ説明し、第一実施形態と同様な部分については重複説明を省略する。図４は、第二実施形態の音声認識装置の処理の流れを例示するフローチャートである。
第二実施形態の音声認識装置のフレーム数決定部９０は、自己遷移確率フレーム数決定部９１に加えて、図１において破線で示す、尤度計算率計算部９２、尤度計算率フレーム数決定部９３及び統合フレーム数決定部９４を例えば含む。 Hereinafter, although the example of 2nd embodiment is demonstrated with reference to FIG. 1, FIG. 4, only a different part from 1st embodiment is demonstrated, and duplication description is abbreviate | omitted about the part similar to 1st embodiment. FIG. 4 is a flowchart illustrating the process flow of the speech recognition apparatus according to the second embodiment.
In addition to the self-transition probability frame number determination unit 91, the frame number determination unit 90 of the speech recognition apparatus according to the second embodiment includes a likelihood calculation rate calculation unit 92 and a likelihood calculation rate frame number determination indicated by a broken line in FIG. A unit 93 and an integrated frame number determination unit 94 are included, for example.

＜ステップＳ３’（図４）＞
自己遷移確率フレーム数決定部９１は、第一実施形態と同様に自己遷移確率ａ_ｊｊが高いほど大きい整数Ｋ_Ａ（ｊ）を決定する。Ｋ_Ａ（ｊ）は、統合フレーム数決定部９４に送られる。第一実施形態とは異なり、Ｋ_Ａ（ｊ）がＫとしてそのまま状態尤度計算部３１には送られない。すなわち、第二実施形態においては、一律にＫ＝Ｋ_Ａ（ｊ）とはならず、後述するステップＳ８の処理によりＫは定められる。 <Step S3 ′ (FIG. 4)>
The self transition probability frame number determination unit 91 determines an integer K _A (j) that is larger as the self transition probability a _jj is higher, as in the first embodiment. K _A (j) is sent to the integrated frame number determination unit 94. Unlike the first embodiment, K _A (j) is not sent as it is to the state likelihood calculation unit 31 as K. That is, in the second embodiment, K = K _A (j) is not uniformly set, and K is determined by the process of step S8 described later.

＜ステップＳ６＞
尤度計算率計算部９２は、開発用データに対して、バッチ状態尤度計算を行わない通常の状態尤度計算により音声認識処理を行い、尤度計算率ｑ_ｊを状態ｊごとに求める。尤度計算率ｑ_ｊは、尤度計算率フレーム数決定部９３に送られる。 <Step S6>
The likelihood calculation rate calculation unit 92 performs speech recognition processing on the development data by normal state likelihood calculation without performing batch state likelihood calculation, and obtains a likelihood calculation rate q _j for each state j. The likelihood calculation rate q _j is sent to the likelihood calculation rate frame number determination unit 93.

＜ステップＳ７＞
尤度計算率フレーム数決定部９３は、尤度計算率ｑ_ｊが高い状態ｊほど大きい整数Ｋ_Ｂ（ｊ）を決定する。Ｋ_Ｂ（ｊ）は、統合フレーム数決定部９４に送られる。 <Step S7>
The likelihood calculation rate frame number determination unit 93 determines an integer K _B (j) that is larger for a state j having a higher likelihood calculation rate q _j . K _B (j) is sent to the integrated frame number determination unit 94.

例えば、ｑ_ｌを０以上１以下の数、ｑ_ｈをｑ_ｌ以上１以下の数、Ｋ_ｍｉｎを０以上の整数、Ｋ_ｍａｘをＫ_ｍｉｎ＋１以上の整数、ｆ（・）を・の小数点以下を切り捨てして整数を出力する関数として、Ｋ_Ｂ（ｊ）を以下の式により求めることができる。ｑ_ｌ，ｑ_ｈ，Ｋ_ｍｉｎ及びＫ_ｍａｘは、目的音声、ハードウェアの性能及び目標とする音声認識処理速度等に応じて適宜に予め定められた数である。例えば、ｑ_ｌは０．２から０．３、ｑ_ｈは０．７から０．８、Ｋ_ｍｉｎは３から４、Ｋ_ｍａｘは１０から１２に設定される。 For example, q _l is a number between 0 and 1, q _h is a number between q _{l and} 1; K _min is an integer greater than or equal to 0; K _max is an integer greater than or equal to K _min +1; K _B (j) can be obtained by the following equation as a function that outputs an integer by rounding down. q _l , q _h , K _min, and K _max are numbers that are appropriately determined according to the target speech, hardware performance, target speech recognition processing speed, and the like. For example, q _l is set to 0.2 to 0.3, q _h is set to 0.7 to 0.8, K _min is set to 3 to 4, and K _max is set to 10 to 12.

すなわち、尤度計算率ｑ_ｊがｑ_ｌより下であればＫ_Ｂ（ｊ）＝Ｋ_ｍｉｎとし、尤度計算率ｑ_ｊがｑ_ｌ以上ｑ_ｈより下であればＫ_Ｂ（ｊ）＝ｆ（（Ｋ_ｍａｘ−Ｋ_ｍｉｎ）ｑ_ｊ／（ｑ_ｈ−ｑ_ｌ））＋（（Ｋ_ｍｉｎｑ_ｈ−Ｋ_ｍａｘｑ_ｌ）／（ｑ_ｈ−ｑ_ｌ）））とし、尤度計算率ｑ_ｊがｑ_ｈ以上であればＫ_Ｂ＝Ｋ_ｍａｘとする。

That is, if the lower likelihood calculation factor _{q j} is from _{q l} and _{_{K B (j) = K min}} , if lower likelihood calculation factor _{q j} is from _{q l} or _{_{q h K B (j) =}} f ((K _max −K _min ) q _j / (q _h −q _l )) + ((K _min q _h −K _max q _l ) / (q _h −q _l )))) and likelihood calculation rate q _j There is a _K B _{= K max} equal to or greater than _{q h.}

＜ステップＳ８＞
統合フレーム数決定部９４は、Ｋ_Ａ（ｊ）とＫ_Ｂ（ｊ）との両方を考慮して、フレーム数Ｋを決定する。決定されたフレーム数Ｋは、状態尤度計算部３１に送られる。例えば、ｆ（・）を・の小数点以下を切り捨てして整数を出力する関数、重み係数λを０以上１以下の予め定められた数として、下記の、Ｋ_Ａ（ｊ）とＫ_Ｂ（ｊ）の線形補間式に基づいて、Ｋを求めてもよい。 <Step S8>
The integrated frame number determination unit 94 determines the number of frames K in consideration of both K _A (j) and K _B (j). The determined number K of frames is sent to the state likelihood calculating unit 31. For example, let f (•) be a function that outputs an integer by rounding down the decimal point of •, and the weighting coefficient λ is a predetermined number between 0 and 1, and the following K _A (j) and K _B (j ) K may be obtained based on the linear interpolation formula.

Ｋ＝ｆ（（１−λ）Ｋ_Ａ（ｊ）−λＫ_Ｂ（ｊ））
λは、Ｋ_Ｂ（ｊ）にどの程度信頼をおくかを調整する重み係数である。手に入る開発データの量が多い等の理由によりＫ_Ｂ（ｊ）に信頼がおけると考えられる場合には、重み係数λに１に近い値を与え、逆の場合には、重み係数λには０に近い値を与える。 K = f ((1-λ) K _A (j) −λK _B (j))
λ is a weighting coefficient that adjusts how much confidence is placed on K _B (j). When K _B (j) is considered to be reliable due to the large amount of development data available, a value close to 1 is given to the weighting factor λ, and in the opposite case, the weighting factor λ Gives a value close to zero.

［変形例等］
上記の例では、ｆ（・）を・の小数点以下を切り捨てして整数を出力する関数としたが、ｆ（・）を、・の小数点以下を切り上げして整数を出力する関数、又は、・の小数点以下を四捨五入して整数を出力する関数としてもよい。
上記式（５）において、ａ_ｊｊ＝ａ_ｌのときに、Ｋ_Ａ（ｊ）＝ｆ（（Ｋ_ｍａｘ−Ｋ_ｍｉｎ）ａ_ｊｊ／（ａ_ｈ−ａ_ｌ））＋（（Ｋ_ｍｉｎａ_ｈ−Ｋ_ｍａｘａ_ｌ）／（ａ_ｈ−ａ_ｌ）））としたが、ａ_ｊｊ＝ａ_ｌのときに、Ｋ_Ａ（ｊ）＝Ｋ_ｍｉｎとしてもよい。また、ａ_ｊｊ＝ａ_ｈのときに、Ｋ_Ａ＝Ｋ_ｍａｘとしたが、ａ_ｊｊ＝ａ_ｈのときに、Ｋ_Ａ（ｊ）＝ｆ（（Ｋ_ｍａｘ−Ｋ_ｍｉｎ）ａ_ｊｊ／（ａ_ｈ−ａ_ｌ））＋（（Ｋ_ｍｉｎａ_ｈ−Ｋ_ｍａｘａ_ｌ）／（ａ_ｈ−ａ_ｌ）））としてもよい。 [Modifications, etc.]
In the above example, f (•) is a function that outputs an integer by rounding down the decimal point of •, but f (•) is a function that outputs an integer by rounding down the decimal point of •, or It is good also as a function which rounds off the decimal point of and outputs an integer.
In the above formula (5), when a _jj = a ₁ , K _A (j) = f ((K _max −K _min ) a _jj / (a _h −a ₁ )) + ((K _min a _h − K _max a _l ) / (a _h −a _l ))), but when a _jj = a _l , K _A (j) = K _min may be used. _Further, a jj = when _{a _h,} K _{_A =} is set to _{K _max,} when _{_{_{a jj = a h, K A}}} (j) = f ((K max -K min) a jj / (a h _{_{_{_{-a l)) + ((K}}}} min a h -K max a l) / (a h -a l))) may be.

同様に、上記式（６）において、ｑ_ｊ＝ｑ_ｌのときに、Ｋ_Ｂ（ｊ）＝ｆ（（Ｋ_ｍａｘ−Ｋ_ｍｉｎ）ｑ_ｊ／（ｑ_ｈ−ｑ_ｌ））＋（（Ｋ_ｍｉｎｑ_ｈ−Ｋ_ｍａｘｑ_ｌ）／（ｑ_ｈ−ｑ_ｌ）））としたが、ｑ_ｊ＝ｑ_ｌのときに、Ｋ_Ｂ（ｊ）＝Ｋ_ｍｉｎとしてもよい。また、ｑ_ｊ＝ｑ_ｈのときに、Ｋ_Ｂ＝Ｋ_ｍａｘとしたが、ｑ_ｊ＝ｑ_ｈのときに、Ｋ_Ｂ（ｊ）＝ｆ（（Ｋ_ｍａｘ−Ｋ_ｍｉｎ）ｑ_ｊ／（ｑ_ｈ−ｑ_ｌ））＋（（Ｋ_ｍｉｎｑ_ｈ−Ｋ_ｍａｘｑ_ｌ）／（ｑ_ｈ−ｑ_ｌ）））としてもよい。 Similarly, in the above formula (6), when q _j = q _l , K _B (j) = f ((K _max −K _min ) q _j / (q _h −q _l )) + ((K _min q _h −K _max q _l ) / (q _h −q _l ))), but when q _j = q _l , K _B (j) = K _min may be used. Further, when q _j = q _h , K _B = K _max , but when q _j = q _h , K _B (j) = f ((K _max −K _min ) q _j / (q _h _{_{_{_{-q l)) + ((K}}}} min q h -K max q l) / (q h -q l))) may be.

自己遷移確率フレーム数決定部９１におけるＫ_ｍｉｎ，Ｋ_ｍａｘと、尤度計算率フレーム数決定部９３におけるＫ_ｍｉｎ，Ｋ_ｍａｘとは同じでも、異なっていてもよい。
上述の構成をコンピュータによって実現する場合、音声認識装置の各部が有する機能の処理内容はプログラムによって記述される。そして、このプログラムを図５に例示するコンピュータで実行することにより、上記各部の機能がコンピュータ上で実現される。 _K min in self-transition probabilities frame number determining portion _91, and _{K max,} _K min at likelihood calculating rate frame number determination unit _93, also the same as the _{K max,} may be different.
When the above configuration is realized by a computer, the processing contents of the functions of each unit of the speech recognition apparatus are described by a program. Then, by executing this program on the computer illustrated in FIG. 5, the functions of the above-described units are realized on the computer.

すなわち、ＣＰＵ１がプログラムを逐次読み込んで実行することにより、音響分析部１０、特徴量ベクトル記憶部２０、探索部３０、状態尤度計算部３１、状態尤度参照部３２、フェッチ部６０、フレーム数決定部９０、自己遷移確率フレーム数決定部９１、尤度計算率計算部９２、尤度計算率フレーム数決定部９３及び統合フレーム数決定部９４の機能がそれぞれ実現される。この場合、音声認識装置の各部として機能するＣＰＵ１は、メモリ２、ハードディスク等の補助記憶装置３から読み込み込んだデータに対して処理を行い、処理を行った後のデータを、メモリ２、補助記憶装置３に格納する。 That is, when the CPU 1 sequentially reads and executes the program, the acoustic analysis unit 10, the feature vector storage unit 20, the search unit 30, the state likelihood calculation unit 31, the state likelihood reference unit 32, the fetch unit 60, the number of frames The functions of the determination unit 90, the self-transition probability frame number determination unit 91, the likelihood calculation rate calculation unit 92, the likelihood calculation rate frame number determination unit 93, and the integrated frame number determination unit 94 are realized. In this case, the CPU 1 functioning as each unit of the speech recognition apparatus performs processing on the data read from the memory 2 and the auxiliary storage device 3 such as a hard disk, and the data after the processing is stored in the memory 2 and the auxiliary storage. Store in device 3.

図５に示した例だと、補助記憶装置３が、音響モデル記憶部４０、文法記憶部５０及び状態尤度記憶部８０に対応する。また、キャッシュ１ａが、状態パラメータ記憶部７０に対応する。 In the example illustrated in FIG. 5, the auxiliary storage device 3 corresponds to the acoustic model storage unit 40, the grammar storage unit 50, and the state likelihood storage unit 80. The cache 1 a corresponds to the state parameter storage unit 70.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよいが、具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ
−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. The computer-readable recording medium may be any medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, or a semiconductor memory. Specifically, for example, the magnetic recording device may be a hard disk device or a flexible Discs, magnetic tapes, etc. as optical discs, DVD (Digital Versatile Disc), DVD-RAM (Random Access Memory), CD-ROM (Compact Disc Read Only Memory), CD
-R (Recordable) / RW (ReWritable), etc., MO (Magneto-Optical disc), etc. as a magneto-optical recording medium, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. as a semiconductor memory it can.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

また、上述した実施形態とは別の実行形態として、コンピュータが可搬型記録媒体から直接このプログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を基底する性質を有するデータ等）を含むものとする。 As an execution form different from the above-described embodiment, the computer may read the program directly from the portable recording medium and execute processing according to the program. Each time is transferred, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to a computer but has a property that is based on computer processing).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

また、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。例えば、図２において、ステップＳ２の処理とステップＳ３の処理とを並列に行ってもよい。また、図４において、ステップＳ２の処理とステップＳ３’の処理とを並列に行ってもよい。さらに、図４において、ステップＳ３’の処理とステップＳ６，７の処理とを並列に行ってもよい。
その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 In addition, the various processes described above are not only executed in time series according to the description, but may be executed in parallel or individually according to the processing capability of the apparatus that executes the processes or as necessary. For example, in FIG. 2, the process of step S2 and the process of step S3 may be performed in parallel. In FIG. 4, the process of step S2 and the process of step S3 ′ may be performed in parallel. Furthermore, in FIG. 4, the process of step S3 ′ and the processes of steps S6 and S7 may be performed in parallel.
Needless to say, other modifications are possible without departing from the spirit of the present invention.

この発明の音声認識装置の例の機能ブロック図。The functional block diagram of the example of the speech recognition apparatus of this invention. この発明の第一実施形態の音声認識装置の処理の流れを例示するフローチャート。The flowchart which illustrates the flow of a process of the speech recognition apparatus of 1st embodiment of this invention. フレーム数Ｋの決定の仕方の例を説明するための図。The figure for demonstrating the example of the method of determining the number K of frames. この発明の第二実施形態の音声認識装置の処理の流れを例示するフローチャート。The flowchart which illustrates the flow of a process of the speech recognition apparatus of 2nd embodiment of this invention. この発明の音声認識装置をコンピュータで実現させる場合の機能ブロック図の例。The example of a functional block diagram in the case of implement | achieving the speech recognition apparatus of this invention with a computer. この発明の状態尤度テーブルの例。The example of the state likelihood table of this invention. 従来技術の音声認識装置の例の機能ブロック図。The functional block diagram of the example of the speech recognition apparatus of a prior art. 状態Ｓの例を説明するための図。The figure for demonstrating the example of the state S. 音素ＨＭＭの例を説明するための図。The figure for demonstrating the example of phoneme HMM. 従来技術の状態尤度テーブルの例。The example of a state likelihood table of a prior art.

Explanation of symbols

１０音響分析部
２０特徴量ベクトル記憶部
３０探索部
３１状態尤度計算部
３２状態尤度参照部
４０音響モデル記憶部
５０文法記憶部
６０フェッチ部
７０状態パラメータ記憶部
８０状態尤度記憶部
９０フレーム数決定部
９１自己遷移確率フレーム数決定部
９２尤度計算率計算部
９３尤度計算率フレーム数決定部
９４統合フレーム数決定部 10 acoustic analysis unit 20 feature vector storage unit 30 search unit 31 state likelihood calculation unit 32 state likelihood reference unit 40 acoustic model storage unit 50 grammar storage unit 60 fetch unit 70 state parameter storage unit 80 state likelihood storage unit 90 frame Number determination unit 91 Self transition probability frame number determination unit 92 Likelihood calculation rate calculation unit 93 Likelihood calculation rate frame number determination unit 94 Integrated frame number determination unit

Claims

An acoustic model storage unit for storing an acoustic model including a state parameter and a self-transition probability;
A state parameter storage unit faster than the acoustic model storage unit;
An acoustic analysis unit that obtains a feature vector for each frame of a certain time length from input speech and obtains a time series of the feature vector;
A feature vector storage unit for storing a time series of the obtained feature vectors;
State likelihood b _j (X _t ) is calculated by using j and t as arbitrary integers, and the state likelihood b _j (X _t ) as the probability that a certain state j outputs the feature vector X _t of frame t. Before, a fetch unit that reads the state parameter of state j from the acoustic model storage unit into the state parameter storage unit,
_A self-transition probability frame number determination unit that determines a larger integer K _A (j) as the frame number K as the self-transition probability a _jj of the state j read from the acoustic model storage unit is higher;
The state likelihood b _j (X _t ) is calculated using the state parameter of the state j read from the state parameter storage unit and the feature amount vector X _t read from the feature amount vector storage unit, and the state parameter State likelihood b _j (X _{t + 1} ),..., B _j using the state parameter of state j read from the storage unit and the feature amount vectors X _{t + 1} ,..., X _{t + K} read from the feature vector storage unit. A state likelihood calculator for further calculating (X _{t + K} );
A state likelihood storage unit for storing the further calculated state likelihood b _j (X _{t + 1} ),..., B _j (X _{t + K} );
When any of the state likelihoods b _j (X _{t + 1} ),..., B _j (X _{t + K} ) becomes necessary, the state likelihood reference is obtained by referring to the state likelihood storage unit. And
A speech recognition apparatus.

The speech recognition apparatus according to claim 1,
a _l predetermined 0 to 1. number and _{a h} predetermined _{a l} 1 inclusive _number, 0 or an integer which is predetermined the _{K _min,} _{K max} and _{K min} +1 more pre As a function to output an integer by rounding off, rounding up, or rounding off the specified integer, f (•)
The self-transition probability frame number determination unit
If the self-transition probability a _jj is below a ₁ , set K _A (j) = K _min ,
If the self-transition probability a _jj is higher than a _{1 and} lower than a _h , K _A (j) = f ((K _max −K _min ) a _jj / (a _h −a ₁ )) + ((K _min a _h− K _max a _l ) / (a _h −a _l )))
If the self-transition probability a _jj is above a _h , then K _A = K _max
If self-transition probability _{_{_{a jj = a l, K A}}} (j) = K min or _{_{_{K A (j) = f (}}} (K max -K min) a jj / (a h -a l)) + (( _{_{_{_{K min a h -K max a l}}}} ) / (a h -a l))) and then,
If self-transition probability a _jj = a _h , then K _A (j) = K _max or K _A (j) = f ((K _max −K _min ) a _jj / (a _h −a _l )) + (( _{_{_{_{K min a h -K max a l}}}} ) / (a h -a l)))
Is the part,
A speech recognition apparatus characterized by that.

The acoustic model storage unit is a storage unit that stores an acoustic model including state parameters and self-transition probabilities,
The state parameter storage unit is a storage unit faster than the acoustic model storage unit,
An acoustic analysis step in which an acoustic analysis unit obtains a feature vector for each frame of a fixed time length from the input speech, and stores a time series of the feature vector in the feature vector storage unit;
The state likelihood b _j (X _t ) is a state likelihood b _j (X _t ), where the fetch unit uses j and t as arbitrary integers, and a certain state j outputs the feature quantity vector X _t of the frame t. Fetching the state parameter of state j from the acoustic model storage unit into the state parameter storage unit before calculating
The self-transition probability frame number determination unit determines the larger integer K _A (j) as the frame number K as the self-transition probability a _jj of the state j read from the acoustic model storage unit is higher. Steps,
The state likelihood calculation unit calculates the state likelihood b _j (X _t ) using the state parameter of the state j read from the state parameter storage unit and the feature quantity vector X _t read from the feature quantity vector storage unit. The state likelihood b _j (X) is calculated using the state parameter of the state j read from the state parameter storage unit and the feature amount vectors X _{t + 1} ,..., X _{t + K} read from the feature amount vector storage unit. _{t + 1} ),..., b _j (X _{t + K} ) are further calculated, and the further calculated state likelihoods b _j (X _{t + 1} ),..., b _j (X _{t + K} ) are stored in the state likelihood storage unit. A state likelihood calculation step;
When the state likelihood reference unit needs any of the state likelihoods b _j (X _{t + 1} ),..., B _j (X _{t + K} ), the state likelihood storage unit refers to the state likelihood storage unit and determines the state likelihood. A state likelihood reference step for obtaining a degree;
A speech recognition method comprising:

The speech recognition method according to claim 3,
a _l predetermined 0 to 1. number and _{a h} predetermined _{a l} 1 inclusive _number, 0 or an integer which is predetermined the _{K _min,} _{K max} and _{K min} +1 more pre As a function to output an integer by rounding off, rounding up, or rounding off the specified integer, f (•)
The self-transition probability frame number determination step includes:
If the self-transition probability a _jj is below a ₁ , set K _A (j) = K _min ,
If the self-transition probability a _jj is higher than a _{1 and} lower than a _h , K _A (j) = f ((K _max −K _min ) a _jj / (a _h −a ₁ )) + ((K _min a _h− K _max a _l ) / (a _h −a _l )))
If the self-transition probability a _jj is above a _h , then K _A = K _max
If self-transition probability _{_{_{a jj = a l, K A}}} (j) = K min or _{_{_{K A (j) = f (}}} (K max -K min) a jj / (a h -a l)) + (( _{_{_{_{K min a h -K max a l}}}} ) / (a h -a l))) and then,
If self-transition probability a _jj = a _h , then K _A (j) = K _max or K _A (j) = f ((K _max −K _min ) a _jj / (a _h −a _l )) + (( _{_{_{_{K min a h -K max a l}}}} ) / (a h -a l)))
Is the step
A speech recognition method characterized by the above.

A speech recognition program for causing a computer to function as each unit of the speech recognition device according to claim 1.

A computer-readable recording medium on which the voice recognition program according to claim 5 is recorded.