JPH067352B2

JPH067352B2 - Voice recognizer

Info

Publication number: JPH067352B2
Application number: JP60003536A
Authority: JP
Inventors: 正宏浜田; 明寿山田
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1985-01-11
Filing date: 1985-01-11
Publication date: 1994-01-26
Anticipated expiration: 2009-01-26
Also published as: JPS61162099A

Description

【発明の詳細な説明】産業上の利用分野本発明はフレーム毎の音響分析と音韻識別とを行なう音
声認識装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus that performs acoustic analysis and phoneme identification for each frame.

従来の技術近年、音声認識装置の性能向上に関する試みは盛んに行
なわれており、線形判別関数による音韻識別を用いた音
声認識装置もその対象の一つとなっている。2. Description of the Related Art In recent years, many attempts have been made to improve the performance of a voice recognition device, and a voice recognition device using phoneme identification by a linear discriminant function is one of the targets.

従来の技術としては、例えば、特開昭５９−１３１９９
９号公報に示されているように音声の特徴パラメータに
対しベイズ判定に基づく距離、マハラノビス距離、線形
判別関数などの統計的距離尺度を適用し、これによって
入力音声の認識を行なおうとするものがある。A conventional technique is, for example, Japanese Patent Laid-Open No. 59-13199.
As disclosed in Japanese Patent Publication No. 9, a statistical distance measure such as a Bayesian decision based distance, Mahalanobis distance, and linear discriminant function is applied to a feature parameter of a voice, and an input voice is attempted to be recognized by this. There is.

以下図面を参照しながら、上述したような従来の音声認
識装置について説明を行なう。The conventional speech recognition apparatus as described above will be described below with reference to the drawings.

第２図は従来の音声認識装置を示すものである。第２図
において、１は音響分析部、１２は係数メモリ、３は判
別係数部、６は単語標準パターンメモリ、７は認識部で
ある。FIG. 2 shows a conventional voice recognition device. In FIG. 2, 1 is an acoustic analysis unit, 12 is a coefficient memory, 3 is a discrimination coefficient unit, 6 is a word standard pattern memory, and 7 is a recognition unit.

以上のように構成された音声認識装置について、以下そ
の動作について説明する。The operation of the speech recognition apparatus configured as above will be described below.

入力音声は音響分析部１へ送られ、ここで５〜３０ｍse
c程度の微少時間毎（以下これをフレームと呼ぶ）に分
析され、特徴パラメータベクトルに変換される。分析方
法としては線形予測(LPC)分析法が多く、特徴パラメー
タとしてはＬＰＣケプストラム係数が利用されることが
多い。これらのパラメータベクトルは判別計算部３へ入
力される。一方、係数モメリ１２には後に述べる方法で
統計的距離尺度を算出する際に必要となる各種の係数が
予め記憶されており、これらの係数も判別計算部３へ入
力される。判別計算部３は前記の二つの入力を受け、統
計的距離尺度を計算し、該当フレームの音韻を識別す
る。上記のような処理はフレーム毎になされ、得られた
音韻列は認識部７に送られる。ここでは、単語標準パタ
ーンメモリ６から得られる標準パターンと前記音韻列と
の間で単語間距離の総合評価を行ない、入力音声に最も
近い単語標準パターンをもって入力音声の認識結果とす
る。The input voice is sent to the acoustic analysis unit 1, where it is 5 to 30 mse.
It is analyzed at every minute time of about c (hereinafter referred to as a frame) and converted into a feature parameter vector. Most of the analysis methods are linear prediction (LPC) analysis methods, and LPC cepstrum coefficients are often used as the characteristic parameters. These parameter vectors are input to the discrimination calculation unit 3. On the other hand, in the coefficient memory 12, various kinds of coefficients necessary for calculating the statistical distance measure by the method described later are stored in advance, and these coefficients are also input to the discrimination calculation unit 3. The discrimination calculator 3 receives the above two inputs, calculates a statistical distance measure, and identifies the phoneme of the corresponding frame. The above process is performed for each frame, and the obtained phoneme sequence is sent to the recognition unit 7. Here, the inter-word distance is comprehensively evaluated between the standard pattern obtained from the word standard pattern memory 6 and the phoneme sequence, and the word standard pattern closest to the input voice is used as the recognition result of the input voice.

前述した統計的距離尺度については、前述の公報におい
て、次のように説明されている。The above-mentioned statistical distance measure is described in the above-mentioned publication as follows.

音韻ｊに対する標準パターンとして、その平均をとり、未知入力フレームのパラメータ列ベクトルをとすると、ベイズ判定は第(1)式を最大とする音韻を識
別結果とすることに対応する。ただし、ｎはベクトルの
次数，添字Ｔは転置を表わす。As a standard pattern for phoneme j, its average And the parameter column vector of the unknown input frame Then, the Bayesian determination corresponds to using the phoneme that maximizes the expression (1) as the identification result. Where n is the order of the vector, The subscript T represents transposition.

また、マハラノビス距離は第(2)式で与えられ、第(2)式
を最小とする音韻が識別結果となる。 Further, the Mahalanobis distance is given by the equation (2), and the phoneme that minimizes the equation (2) becomes the discrimination result.

また、線形判別関数は第(3)式で与えられ、第(3)式の左
辺の値が正であれば未知入力は音韻ｕに、負であれば音
韻ｖに属する。ただし、は音韻ｕと音韻ｖとを判別するための線形判別係数列ベ
クトルであり、b_u/vは同線形判別係数の定数項である。 The linear discriminant function is given by the equation (3). If the value on the left side of the equation (3) is positive, the unknown input belongs to the phoneme u, and if it is negative, it belongs to the phoneme v. However, Is a linear discriminant coefficient sequence vector for discriminating between the phoneme u and the phoneme v, and b _u / v is a constant term of the linear discriminant coefficient.

発明が解決しようとする問題点しかしながら上記のような構成では、統計的距離尺度の
算出に要する計算量が多大である，あるいは音韻識別の
結果に対する信頼度が不明なため以降の単語単位での類
似度評価に確実性が乏しい、等の問題点があった。即
ち、前記第(1)式あるいは第(2)式の尺度に依る場合はなる行列とマトリックスとの積を全ての音韻ｊに対して
フレーム毎に計算せねばならず、これに要する演算量は
乗算回数のみをとっても第(5)式のごとく多大である。
従ってこれを実現する装置は高速かつ大規模なものが要
求されるという問題点を有していた。 Problems to be Solved by the Invention However, in the above-mentioned configuration, the amount of calculation required to calculate the statistical distance measure is large, or the reliability of the result of phoneme identification is unknown, so that the similarity in the following word units is similar. There were problems such as lack of certainty in the degree evaluation. That is, when depending on the scale of the formula (1) or the formula (2), The product of the matrix and the matrix must be calculated for all the phonemes j for each frame, and the amount of calculation required for this is enormous, as in equation (5), even if only the number of multiplications is taken.
Therefore, there has been a problem that a device for realizing this is required to be high-speed and large-scale.

一方前記第(3)式の尺度による場合は、乗算回数は第(6)
式のごとく少量で済む反面、判別の結果だけでは未知入
力フレームが相対的にどの音韻に属するかが示されるだ
けであり、音韻標準パターンとの距離、言い換えれば音
韻識別の信頼性に関する指数は得られない。このため第
２図の認識部７で、単語標準パターン６との音韻類似度
の総合評価を行なう際に、音韻識別の信頼性の高いフレ
ームと低いフレームとが同じ重みで評価され、ひいては
最終の単語認識率が引き下げられるという問題点を有し
ていた。 On the other hand, in the case of the scale of the equation (3), the number of multiplications is the (6)
Although a small amount is required as in the equation, the discrimination result only indicates to which phoneme the unknown input frame relatively belongs, and the distance to the phoneme standard pattern, in other words, the index for the reliability of phoneme identification is obtained. I can't. Therefore, when the recognizing unit 7 in FIG. 2 comprehensively evaluates the phoneme similarity with the word standard pattern 6, the frame with high reliability of phoneme identification and the frame with low reliability of phoneme identification are evaluated with the same weight, and eventually the final frame. There was a problem that the word recognition rate was lowered.

本発明は上記問題点に鑑み、計算量の軽減と、識別結果
に信頼性を与えることのできる音韻識別機能の保有との
相反する二側面の要求を満たした、高認識率の音声認識
装置を提供するものである。 In view of the above problems, the present invention provides a speech recognition apparatus having a high recognition rate, which satisfies the two contradictory requirements of reduction of the amount of calculation and possession of a phoneme identification function that can give reliability to an identification result. It is provided.

問題点を解決するための手段上記問題点を解決するために本発明の音声認識装置は、
入力音声をフレーム毎に分析して特徴パラメータを得る
音響分析部と、線形判別係数の組を記憶する第１の係数
メモリと、任意フレームの音韻判別を行なう判別計算部
と、音韻別に予め定められた特徴パラメータ平均ベクト
ルと共分散行列の逆行列とを記憶する第２の係数メモリ
と、マハラノビス距離算出部と、認識しようとする単語
の標準的な音韻列を記憶する単語標準パターンメモリ
と、単語全体での距離を評価する認識的とを具備した構
成になっている。Means for Solving Problems In order to solve the above problems, the voice recognition device of the present invention is
An acoustic analysis unit that analyzes the input speech for each frame to obtain a characteristic parameter, a first coefficient memory that stores a set of linear discriminant coefficients, a discrimination calculation unit that discriminates phonemes of arbitrary frames, and a predetermined phoneme. A second coefficient memory for storing the feature parameter mean vector and the inverse of the covariance matrix, a Mahalanobis distance calculation unit, a word standard pattern memory for storing a standard phoneme sequence of a word to be recognized, and a word It is configured to have a cognitive function of evaluating the total distance.

作用本発明は上記した構成により、次のような作用に基いて
前記従来の問題点の解消を図っている。Action The present invention has the above-described configuration to solve the above-mentioned conventional problems based on the following actions.

入力音声は音響分析部で特徴パラメータベクトルに変換
され、マハラノビス距離算出部と判別計算部とに入力さ
れる。第１の係数メモリ中の線形判別係数も判別計算部
に入力され、ここで前記特徴パラメータベクトルとの間
で一対の音韻の組ごとに線形判別関数が計算され、得ら
れた音韻列は第２の係数メモリと認識部とに入力され
る。なお、ここで線形判別関数計算に要する計算量は、
第(6)式で示したように少量で済む。さらに第２の係数
メモリに入力された音韻列情報に基いて、ここに蓄えら
れている音韻別の特徴パラメータ平均ベクトルと共分散
行列の逆行列との組のうち該当するものが選び出されて
前記マハラノビス距離算出部へ入力される。マハラノビ
ス距離算出部は上記の二つの入力を受け、入力音声と、
これが属すると判別された音韻の標準的パターンとの間
の第１候補音韻距離を算出する。この際の距離算出作業
は前記判別計算部で識別された唯一の候補音韻について
のみ行なえばよく、全体としての計算量は第(5)式で示
した従来例より大幅な削減が期待できる。最後に認識部
は、前記判別計算部からの音韻列と、前記マハラノビス
距離算出部からフレーム毎の第１候補音韻距離と、単語
標準パターンメモリからの単語標準パターンとを受け、
単語全体での総合距離評価を行なって認識結果を出力す
る。The input speech is converted into a feature parameter vector by the acoustic analysis unit and input to the Mahalanobis distance calculation unit and the discrimination calculation unit. The linear discriminant coefficient in the first coefficient memory is also input to the discriminant calculator, where a linear discriminant function is calculated for each pair of phonemes with the feature parameter vector, and the obtained phoneme sequence is the second phoneme string. Is input to the coefficient memory and the recognition unit. Note that here, the calculation amount required for the linear discriminant function calculation is
Only a small amount is required as shown in equation (6). Further, based on the phoneme sequence information input to the second coefficient memory, the corresponding one is selected from the set of the phoneme-specific feature parameter average vector and the inverse matrix of the covariance matrix stored therein. It is input to the Mahalanobis distance calculation unit. The Mahalanobis distance calculation unit receives the above two inputs, the input voice,
A first candidate phoneme distance is calculated from the standard pattern of phonemes to which this belongs. At this time, the distance calculation work only needs to be performed for the only candidate phoneme identified by the discrimination calculation unit, and the overall calculation amount can be expected to be significantly reduced as compared with the conventional example shown in Expression (5). Finally, the recognition unit receives the phoneme sequence from the discrimination calculation unit, the first candidate phoneme distance for each frame from the Mahalanobis distance calculation unit, and the word standard pattern from the word standard pattern memory,
The total distance is evaluated for the entire word and the recognition result is output.

実施例以下本発明の一実施例の音声認識装置について、図面を
参照しながら説明する。第１は本発明の一実施例を示す
ものである。Embodiment A voice recognition device according to an embodiment of the present invention will be described below with reference to the drawings. The first shows an embodiment of the present invention.

第１図において、１は入力音声をフレーム毎に分析する
音響分析部、２は線形判別係数の組を記憶する第１の係
数メモリ、３は任意フレームの音韻判別を行なう判別計
算部、４は音韻別に予め定められた特徴パラメータ平均
ベクトルと共分散行列の逆行列とを記憶する第２の係数
メモリ、５はマハラノビス距離算出部、６は認識しよう
とする単語の標準的な音韻列を記憶する単語標準パター
ンメモリ、７は単語全体での距離を評価する認識部であ
る。In FIG. 1, 1 is an acoustic analysis unit that analyzes input speech for each frame, 2 is a first coefficient memory that stores a set of linear discriminant coefficients, 3 is a discrimination calculation unit that discriminates phoneme of an arbitrary frame, and 4 is A second coefficient memory for storing a characteristic parameter average vector and an inverse matrix of a covariance matrix that are predetermined for each phoneme, 5 is a Mahalanobis distance calculation unit, and 6 is a standard phoneme sequence of a word to be recognized. A word standard pattern memory, 7 is a recognition unit that evaluates the distance in the entire word.

以上のように構成された音声認識装置について、以下そ
の動作を説明する。まず入力音声は音響分析部１の分析
される。分析方法は従来から行なわれている方法のいず
れでもよいが、本実施例では線形予測分析法を用いる。
対象音声が電話帯域に限られているのであれば、演算量
の最小化と認識性能の最大化との両方を満たすものとし
て、８ＫHz、１２ビットの標本量子化を行なった後、１
０〜２０ｍsecのフレーム間隔毎に１０次の線形予測分
析を行ない。ＬＰＣケプストラム係数C_i（ｉ＝１，２，
……，１０）を得るのが望ましい。ＬＰＣケプストラム
係数に関しては文献「ディジタルプロセッシングオブス
ピーチシグナル」(L.R.Rabiner R.W.Schafer共著“Digi
tal Processing of Speech Signals")に詳しい説明があ
る。要約すると、線形予測モデルＨ_(z)が第１(7)式で与
えられるとき、Ｃ_iは第(8)式で与えられる。The operation of the speech recognition apparatus configured as described above will be described below. First, the input voice is analyzed by the acoustic analysis unit 1. The analysis method may be any of the conventional methods, but the linear prediction analysis method is used in this embodiment.
If the target voice is limited to the telephone band, it is assumed that both the minimum amount of calculation and the maximum recognition performance are satisfied.
A 10th-order linear prediction analysis is performed at every frame interval of 0 to 20 msec. LPC cepstrum coefficient C _i (i = 1, 2,
..., 10) is desirable. Regarding the LPC cepstrum coefficient, refer to "Digital Processing of Speech Signal" (LR Rabiner RW Chafer, co-authored by "Digi
tal Processing of Speech Signals "). In summary, when the linear prediction model H _(z) is given by the equation (7), C _i is given by the equation (8).

以上のようにして得られた特徴パラメータベクトル｛Ｃ_i｝＝(C₁,C₂,……,C₁₀) ……(9) は、フレーム毎に第１図中の判別計算部３への入力と、
マハラノビス距離算出部５への第２の入力となる。 The feature parameter vector {C _i } = (C ₁ , C ₂ , ..., C ₁₀ ) (9) obtained as described above is stored in the discrimination calculation unit 3 in FIG. 1 for each frame. Input and
It is the second input to the Mahalanobis distance calculation unit 5.

同図中、判別計算部３では、線形判別関数を用いてフレ
ーム毎に入力音声の音韻識別を行ない、その結果得られ
た第１候補音韻を第２の係数メモリ４と認識部７とに入
力される。線形判別関数については本明細書中の従来の
技術の項で説明したものと同様の方法で扱われるので、
ここでは説明を省略する。なお図中第１の係数メモリ２
には任意の音韻対間での判別を行なうための線形判別係
数が記憶されており、これらの係数は判別計算部３での必
要に応じて適宜読み出される。In the figure, the discrimination calculation unit 3 performs phoneme discrimination of the input speech for each frame using a linear discriminant function, and inputs the first candidate phoneme obtained as a result to the second coefficient memory 4 and the recognition unit 7. To be done. Since the linear discriminant function is treated in the same manner as described in the section of the prior art in this specification,
The description is omitted here. The first coefficient memory 2 in the figure
Is a linear discriminant coefficient for discriminating between arbitrary phoneme pairs. Are stored, and these coefficients are read out as needed in the discrimination calculation unit 3.

一方、第２の係数メモリ４は判別計算部３から入力され
た前記音韻列に基いて、該当する音韻別の特徴パラメー
タ平均ベクトルと共分散行列の逆行列とを選び出し、これをマハラノビス距離算出部５への第
１の入力とする。マハラノビス距離算出部５は前記の第
１の入力と第２の入力とを受け、第(2)式に従ってマハ
ラノビス距離d_jを算出し、これを第１候補音韻距離とし
て認識部７に入力する。ここに第(2)式の列ベクトルの各要素は第(9)式のC_iの各要素が代入されたものであ
る。On the other hand, the second coefficient memory 4 is based on the phoneme sequence input from the discrimination calculator 3 And the inverse of the covariance matrix And are selected and used as the first input to the Mahalanobis distance calculation unit 5. The Mahalanobis distance calculation unit 5 receives the first input and the second input, calculates the Mahalanobis distance d _j according to the equation (2), and inputs this to the recognition unit 7 as the first candidate phoneme distance. Where the column vector of equation (2) Each element of C _i is a substitution of each element of C _i in Eq. (9).

ところで第(2)式による距離算出は、判別計算部３で決
定されたフレーム当たり唯一の音韻に対して行なわれる
ので、これに要する乗算回数は第(5)式と異なる。即
ち、１フレーム当たり１回だけ第(4)式を計算すればよ
いので、これに要する乗算回数は次の第(10)式で表わさ
れる。By the way, since the distance calculation by the equation (2) is performed for only one phoneme per frame determined by the discrimination calculation unit 3, the number of multiplications required for this is different from the equation (5). That is, since the formula (4) only needs to be calculated once per frame, the number of multiplications required for this is expressed by the following formula (10).

従って本実施例の音韻識別における所要乗算量の合計
は、１フレーム当たり次の第(11)式で示される。第(11)
式のｎ（Ｊ−１）は第１図中判別計算部３で行なわれる
線形判別関数算出に際する所要量，第(11)式のｎ（ｎ＋
１）はマハラノビス距離算出部５で所要量である。 Therefore, the total required multiplication amount in the phoneme identification of the present embodiment is expressed by the following equation (11) per frame. Number (11)
N (J-1) in the equation is the required amount for calculating the linear discriminant function performed by the discriminant calculation section 3 in FIG. 1, and n (n + in the equation (11)
1) is a required amount in the Mahalanobis distance calculation unit 5.

以上でみた乗算回数をまとめると次表のようになる。 The number of multiplications seen above is summarized in the following table.

本実施例による計算量削減の効果は、乗算回数の比較か
らも明らかである。 The effect of reducing the calculation amount according to the present embodiment is clear from the comparison of the number of multiplications.

最後に認識部７は、判別計算部３から得られる音韻列
と、マハラノビス距離算出部５から得られる第１候補音
韻距離と、単語標準パターンメモリ６から得られる認識
対象単語の標準的な音韻列とを用いて、単語全体での総
合距離評価を行なって認識結果を出力する。総合距離評
価に関しては種々の方法が考えられるが、本発明ではそ
の一実施例として次の方法をとる。即ち、単語標準パタ
ーンの音韻列と入力音声の音韻列とを用いて音韻レベル
のＤＰマッチングをフレーム単位に行なうことにより単
語間距離を累積していく際に、マッチングパス上で入力
音声の代表フレーム位置において前記両音韻が不一致の
とき、該当のフレームの第１候補音韻距離が小さい程、
大きな重みのかかった距離を累積する。マハラノビス距
離が小さい程該当フレームでの音韻識別結果は信頼性が
高い訳であるから、そこでの音韻クラスの不一致が単語
全体での距離増加に、より大きく影響することになるの
は妥当な方法である。Finally, the recognition unit 7 includes the phoneme sequence obtained from the discrimination calculation unit 3, the first candidate phoneme distance obtained from the Mahalanobis distance calculation unit 5, and the standard phoneme sequence of the recognition target word obtained from the word standard pattern memory 6. Using and, the total distance is evaluated for the entire word and the recognition result is output. Although various methods can be considered for the total distance evaluation, the following method is taken as an example of the present invention. That is, when the inter-word distance is accumulated by performing the phoneme level DP matching in frame units using the phoneme sequence of the word standard pattern and the phoneme sequence of the input voice, the representative frame of the input voice on the matching path. When the two phonemes do not match at the position, the smaller the first candidate phoneme distance of the corresponding frame is,
Accumulate heavily weighted distances. The smaller the Mahalanobis distance is, the more reliable the phoneme identification result in the corresponding frame is.Therefore, it is a reasonable method that the disagreement of the phoneme class there has a greater effect on the increase in the distance for the whole word. is there.

以上のように本実施例によれば、ＬＰＣケプストラム係
数によるフレーム毎の音韻判別を所要計算量の少ない線
形判別関数を用いて行ない、これだけでは識別された音
韻の確からしさが不明であるので、第１候補音韻である
と判別された音韻に限って多量の計算量を必要とするマ
ハラノビス距離算出を行なって音韻距離を求めることに
より、音韻レベルでのＤＰマッチングの際に音韻識別の
信頼性をも考慮した距離評価ができ、その結果従来より
高い認識率の音声認識装置を実現することができる。As described above, according to the present embodiment, the phoneme discrimination for each frame based on the LPC cepstrum coefficient is performed by using the linear discriminant function with a small required calculation amount, and the accuracy of the identified phoneme is unknown only by this. The phoneme distance is calculated by performing the Mahalanobis distance calculation that requires a large amount of calculation only for the phonemes determined to be one candidate phoneme, and thus the reliability of the phoneme identification is also improved during the DP matching at the phoneme level. It is possible to perform distance evaluation in consideration, and as a result, it is possible to realize a voice recognition device having a higher recognition rate than ever before.

発明の効果以上のように本発明になる音声認識装置は、入力音声を
フレーム毎に分析して特徴パラメータを得る音響分析手
段と、音韻識別のための線形判別係数の組を記憶する第
１の係数メモリと、音韻判別を行なう判別計算部と、音
韻別に予め定められた特徴パラメータ平均ベクトルと共
分散行列の逆行列とを記憶する第２の係数メモリと、マ
ハラノビス距離算出部と、認識しようとする単語の標準
的な音韻列を記憶する単語標準パターンメモリと、単語
全体での距離を評価する認識部とを具備したことによ
り、少ない計算量で音韻識別と前記音韻識別の信頼性評
価との両方ができ、その結果、従来より高い認識率を得
ることができる。EFFECTS OF THE INVENTION As described above, the speech recognition apparatus according to the present invention stores the first set of acoustic analysis means for analyzing the input speech frame by frame to obtain characteristic parameters and the linear discriminant coefficient for phoneme identification. A coefficient memory, a discrimination calculation unit for performing phonological discrimination, a second coefficient memory for storing a predetermined characteristic parameter average vector and an inverse matrix of a covariance matrix for each phonological unit, a Mahalanobis distance calculation unit, and a recognition unit By including a word standard pattern memory that stores a standard phoneme sequence of words to be used and a recognition unit that evaluates the distance over the entire word, the phoneme identification and the reliability evaluation of the phoneme identification can be performed with a small amount of calculation. Both can be done, and as a result, a higher recognition rate than before can be obtained.

[Brief description of drawings]

第１図は本発明の一実施例における音声認識装置のブロ
ック図、第２図は従来の音声認識装置のブロック図であ
る。１……音響分析部、２……第１の係数メモリ、３……判
別計算部、４……第２の係数メモリ、５……マハラノビ
ス距離算出部、６……単語標準パターンメモリ、７……
認識部。FIG. 1 is a block diagram of a voice recognition device according to an embodiment of the present invention, and FIG. 2 is a block diagram of a conventional voice recognition device. 1 ... Acoustic analysis unit, 2 ... First coefficient memory, 3 ... Discrimination calculation unit, 4 ... Second coefficient memory, 5 ... Mahalanobis distance calculation unit, 6 ... Word standard pattern memory, 7 ... …
Recognition part.

Claims

[Claims]

1. An acoustic analysis section for analyzing input speech for each frame to obtain a characteristic parameter, and a first set of linear discrimination coefficients stored for the purpose of performing phonological discrimination between arbitrary phoneme pairs. Of the arbitrary frame and the linear discriminant coefficient obtained from the first coefficient memory and the characteristic parameter of the arbitrary frame obtained from the acoustic analysis unit, and the phonological sequence of the discriminated phoneme is output. And a second coefficient memory for storing a feature vector average vector and an inverse matrix of a covariance matrix that are predetermined for each phoneme for the purpose of calculating the Mahalanobis distance between an arbitrary frame and the standard phoneme. , A Mahalanobis distance calculation unit, a word standard pattern memory that stores a standard phoneme sequence of a word to be recognized, and a recognition unit that evaluates the distance in the entire word. Corresponding to the phoneme sequence for each frame obtained from the discrimination calculation unit, a feature parameter average vector for each corresponding phoneme and an inverse matrix of the covariance matrix are selected from the second coefficient memory and are selected for the Mahalanobis distance calculation unit. As a first input, on the other hand, a feature parameter for each frame obtained from the acoustic analysis unit is used as a second input to the Mahalanobis distance calculation unit, and the first input and the second input generated in frames at the same time are used. And a Mahalanobis distance obtained as a first candidate phoneme distance from the Mahalanobis distance calculation unit, and in the recognition unit a phoneme sequence obtained from the discrimination calculation unit and a first Mahalanobis distance calculation unit. Candidate phoneme distance,
A speech recognition apparatus, which performs word recognition using a standard phoneme sequence of words obtained from the word standard pattern memory.