JPH067353B2

JPH067353B2 - Voice recognizer

Info

Publication number: JPH067353B2
Application number: JP60003537A
Authority: JP
Inventors: 正宏浜田; 明寿山田
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1985-01-11
Filing date: 1985-01-11
Publication date: 1994-01-26
Anticipated expiration: 2009-01-26
Also published as: JPS61162100A

Description

【発明の詳細な説明】産業上の利用分野本発明はフレーム毎の音響分析と音韻識別とを行なう音
声認識装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus that performs acoustic analysis and phoneme identification for each frame.

従来の技術近年、音声認識装置の性能向上に関する試みは盛んに行
なわれており、線形判別関数による音韻識別を用いた音
声認識装置もその対象の一つとなっている。2. Description of the Related Art In recent years, many attempts have been made to improve the performance of a voice recognition device, and a voice recognition device using phoneme identification by a linear discriminant function is one of the targets.

従来の技術としては、例えば、特開昭５９−１３１９９
９号公報に示されているように、音声の特徴パラメータ
に対しベイズ判定に基づく距離，マハラノビス距離，線
形判別関数などの統計的距離尺度を適用し、これによっ
て入力音声の認識を行なおうとするものがある。A conventional technique is, for example, Japanese Patent Laid-Open No. 59-13199.
As disclosed in Japanese Patent Publication No. 9, a statistical distance measure such as Bayesian decision based distance, Mahalanobis distance, and linear discriminant function is applied to the feature parameter of the voice, and the input voice is attempted to be recognized by this. There is something.

以下図面を参照しながら、上述したような従来の音声認
識装置について説明を行なう。The conventional speech recognition apparatus as described above will be described below with reference to the drawings.

第２図は従来の音声認識装置を示すものである。第２図
において、１は音響分析部、１３は係数メモリ、４は判
別計算部、８は単語標準パターンメモリ、９は認識部で
ある。FIG. 2 shows a conventional voice recognition device. In FIG. 2, 1 is an acoustic analysis unit, 13 is a coefficient memory, 4 is a discrimination calculation unit, 8 is a word standard pattern memory, and 9 is a recognition unit.

以上のように構成された音声認識装置について、以下そ
の動作について説明する。The operation of the speech recognition apparatus configured as above will be described below.

入力音声は音響分析部１へ送られ、ここで５〜３０msec
程度の微少時間毎（以下これをフレームと呼ぶ）に分析
され、特徴パラメータに変換される。分析方法としては
線形予測（ＬＰＣ）分析法がよく利用され、特徴パラメ
ータとしては、ＬＰＣケプストラム係数がよく利用され
る。これらのパラメータは判別計算部３へ入力される。The input voice is sent to the acoustic analysis unit 1, where it is 5 to 30 msec.
It is analyzed for each minute time (hereinafter referred to as a frame) and converted into a characteristic parameter. A linear prediction (LPC) analysis method is often used as an analysis method, and an LPC cepstrum coefficient is often used as a feature parameter. These parameters are input to the discrimination calculation unit 3.

一方、係数モメリ１３中には後に述べる方法で統計的距
離尺度を算出する際に必要となる各種の係数が予め記憶
されており、これらの係数も判別計算部４へ入力され
る。判別計算部４は前記の二つの入力を受け、統計的距
離尺度を計算し、該当フレームの音韻を識別する。上記
のような処理はフレーム毎になされ、得られた音韻列は
認識部９に送られる。ここでは、単語標準パターンメモ
リ８から得られる標準パターンと前記音韻列との間で類
似度の総合評価を行ない、入力音声に最も近い単語標準
パターンをもって、入力音声の認識結果とする。On the other hand, in the coefficient memory 13, various coefficients necessary for calculating the statistical distance scale by the method described later are stored in advance, and these coefficients are also input to the discrimination calculation unit 4. The discrimination calculation unit 4 receives the above two inputs, calculates a statistical distance measure, and identifies the phoneme of the corresponding frame. The above-described processing is performed for each frame, and the obtained phoneme sequence is sent to the recognition unit 9. Here, a comprehensive evaluation of the similarity between the standard pattern obtained from the word standard pattern memory 8 and the phoneme sequence is performed, and the word standard pattern closest to the input voice is used as the recognition result of the input voice.

前述した統計的距離尺度については、前述の公報におい
て、次のように説明されている。The above-mentioned statistical distance measure is described in the above-mentioned publication as follows.

音韻ｊに対する標準パターンとして、その平均入力フレームのパラメータ列ベクトルをとすると、ベイズ判定は第(1)式を最大とする音韻を識
別結果とすることに対応する。ただしｎはベクトルＴは転置を表わす。As a standard pattern for phoneme j, its average The parameter column vector of the input frame Then, the Bayesian determination corresponds to using the phoneme that maximizes the expression (1) as the identification result. Where n is a vector T represents transposition.

また、マハラノビス距離は第(2)式で与えられ、第(2)を
最小とする音韻が識別結果となる。 Further, the Mahalanobis distance is given by the equation (2), and the phoneme that minimizes the (2) is the discrimination result.

また、線形判別関数は第(3)式で与えられ、第(3)式の左
辺の値が正であれば未知入力は音韻ｕに、負であれば音
韻ｖに属する。ただし、は音韻ｕと音韻ｖとを判別するための線形判別係数列ベ
クトルであり、b_u/vは同じく音韻ｕと音韻ｖとを判別す
るための定数である。 The linear discriminant function is given by the equation (3). If the value on the left side of the equation (3) is positive, the unknown input belongs to the phoneme u, and if it is negative, it belongs to the phoneme v. However, Is a linear discrimination coefficient sequence vector for discriminating between the phoneme u and the phoneme v, and b _u / v is a constant for similarly discriminating between the phoneme u and the phoneme v.

発明が解決しようとする問題点しかしながら上記のような構成では、統計的距離尺度の
算出に要する計算量が多大である，あるいは音韻識別の
結果に対する信頼度が不明なため以降の単語単位での類
似度評価に確実性が乏しい等の問題点があった。即ち、
前記第(1)式あるいは第(2)の尺度に依る場合はなる行列とマトリックスとの積を全ての音韻ｊに対して
フレーム毎に計算せねばならず、これに要する演算量は
乗算回数のみをとっても第(5)式のごとく多大である。
従ってこれを実現する装置は高速かつ大規模なものが要
求されるという問題点を有していた。 Problems to be Solved by the Invention However, in the above-mentioned configuration, the amount of calculation required to calculate the statistical distance measure is large, or the reliability of the result of phoneme identification is unknown, so that the similarity in the following word units is similar. There were problems such as lack of certainty in the degree evaluation. That is,
When depending on the formula (1) or the scale of (2), The product of the matrix and the matrix must be calculated for all the phonemes j for each frame, and the amount of calculation required for this is enormous, as in equation (5), even if only the number of multiplications is taken.
Therefore, there has been a problem that a device for realizing this is required to be high-speed and large-scale.

また前記第(3)式の尺度に依る場合は、乗算の回数は第
(6)式のごとく少量で済む。 When the scale of the above equation (3) is used, the number of multiplications is
Only a small amount is required as shown in equation (6).

しかしながら判別の結果だけでは未知入力フレームがど
の音韻に属するかが示されるだけであり、音韻標準パタ
ーンとの距離、言い換えれば音韻識別の信頼性に関する
指数は得られない。このため、第２図の認識部ので、単
語標準パターンメモリ８との音韻類似度総合評価を行な
う際に、音韻識別の信頼性の高いフレームと低いフレー
ムとが同じ重みで評価され、ひいては最終の単語認識率
が引き下げられるという問題点を有していた。 However, only the result of the discrimination only indicates which phoneme the unknown input frame belongs to, and cannot obtain the distance from the phoneme standard pattern, in other words, the index regarding the reliability of phoneme identification. Therefore, in the recognizing unit shown in FIG. 2, when the phoneme similarity comprehensive evaluation with the word standard pattern memory 8 is performed, the frame with high reliability and the frame with low reliability of phoneme identification are evaluated with the same weight, and eventually the final frame. There was a problem that the word recognition rate was lowered.

本発明は上記問題点に鑑み、計算量の軽減と識別結果に
信頼性を与えることのできる音韻識別機能の保有という
相反する２側面の要求を満たした、高認識率の音声認識
装置を提供するものである。In view of the above problems, the present invention provides a speech recognition apparatus having a high recognition rate, which satisfies the two contradictory requirements of reducing the amount of calculation and possessing a phoneme identification function capable of giving reliability to the identification result. It is a thing.

問題点を解決するための手段上記問題点を解決するために本発明の音声認識装置は、
入力音声をフレーム毎に分析する音響分析と、特徴パラ
メータを記憶するパラメータメモリと、線形判別係数の
組を記憶する第１の係数メモリと、任意フレームの音韻
判別を行なう判別計算部と、前記音韻判別結果を記憶す
る音韻メモリと、音韻距離算出のための距離係数を記憶
する第２の係数メモリと、音韻距離算出部と、認識しよ
うとする単語の標準的な音韻列を記憶する単語標準パタ
ーンメモリと、単語全体での類似度を評価する認識部と
を具備した構成になっている。Means for Solving Problems In order to solve the above problems, the voice recognition device of the present invention is
An acoustic analysis for analyzing the input speech frame by frame, a parameter memory for storing characteristic parameters, a first coefficient memory for storing a set of linear discriminant coefficients, a discriminant calculator for discriminating phoneme of an arbitrary frame, and the phoneme. A phoneme memory that stores the discrimination result, a second coefficient memory that stores a distance coefficient for calculating the phoneme distance, a phoneme distance calculation unit, and a word standard pattern that stores a standard phoneme sequence of words to be recognized. The memory is provided with a recognition unit that evaluates the degree of similarity in the entire word.

作用本発明は上記した構成により、次のような作用に基いて
前記従来の問題点の解消を図っている。Action The present invention has the above-described configuration to solve the above-mentioned conventional problems based on the following actions.

入力音声は音響分析部で特徴パラメータに変換され、パ
ラメータメモリにフレーム毎に記憶される。また前記特
徴パラメータは判別計算部に入力される。一方、第１の
係数メモリ中の線形判別係数も判別計算部に入力され、
ここで前記特徴パラメータとの間で一対の音韻の組ごと
に線形判別関数が計算され、音韻列メモリにフレーム毎
に記憶される。ここで線形判別関数計算に要する計算量
は、第(6)式で示したように少量で済む。The input speech is converted into characteristic parameters by the acoustic analysis unit and stored in the parameter memory for each frame. Further, the characteristic parameter is input to the discrimination calculation unit. On the other hand, the linear discriminant coefficient in the first coefficient memory is also input to the discriminant calculator,
Here, a linear discriminant function is calculated for each pair of phonemes with the characteristic parameter and stored in the phoneme sequence memory for each frame. Here, a small amount of calculation is required to calculate the linear discriminant function, as shown in the equation (6).

さらに前記音韻列メモリ中に書き込まれた音韻列におい
て同一音韻が連続して現われているとき、その中から任
意に一つの代表フレームを選び、これとフレーム番号を
同じくする前記特徴パラメータを前記パラメータメモリ
から読み出し、音韻距離算出部に入力する。Further, when the same phoneme continuously appears in the phoneme string written in the phoneme string memory, one representative frame is arbitrarily selected from the phonemes and the characteristic parameter having the same frame number as the representative frame is stored in the parameter memory. Read out and input to the phoneme distance calculation unit.

一方、前記代表フレームの音韻情報は第２の係数メモリ
へ入力され、ここに蓄えられている音韻別の距離係数の
うち該当するものが選び出されて前記音韻距離算出部へ
入力される。前記音韻距離算出部は上記の二つの入力を
受け、両者間の音韻距離を算出する。通常の音声では母
音定常部において、同一音韻が長く連続するので、この
区間についての代表フレームは１つだけとなり、前記音
韻距離算出部での計算は、各フレーム毎に実行する必要
がなくなる。このため、全体の計算量は従来例より大幅
な減少が期待できる。On the other hand, the phoneme information of the representative frame is input to the second coefficient memory, and the applicable one of the distance coefficients for each phoneme stored therein is selected and input to the phoneme distance calculation unit. The phoneme distance calculator receives the above two inputs and calculates the phoneme distance between them. In normal speech, the same phoneme continues for a long time in the vowel stationary part, so that there is only one representative frame for this section, and the calculation by the phoneme distance calculation part does not need to be executed for each frame. For this reason, the total calculation amount can be expected to be significantly reduced as compared with the conventional example.

最後に認識部は、前記音韻列メモリからの毎フレームの
音韻列と、前記音韻距離算出部から毎代表フレームの音
韻距離と、単語標準パターンメモリからの単語標準パタ
ーンとを受け、単語全体での総合類似度評価を行なって
認識結果を出力する。Finally, the recognition unit receives the phoneme sequence of each frame from the phoneme sequence memory, the phoneme distance of each representative frame from the phoneme distance calculation unit, and the word standard pattern from the word standard pattern memory, The overall similarity evaluation is performed and the recognition result is output.

実施例以下本発明の一実施例の音声認識装置について、図面を
参照しながら説明する。第１図は本発明の一実施例を示
すものである。第１図において、１は入力音声をフレー
ム毎に分析する音響分析部、２は音響分析部１に得られ
た特徴パラメータを記憶するパラメータメモリ、３は線
形判別係数の組を記憶する第１の係数メモリ、４は任意
フレームの音韻判別を行なう判別計算部、５は前記音韻
判別結果を記憶する音韻列メモリ、６は音韻別共分散行
列の逆行列の組を記憶する第２の係数メモリ、７はマハ
ラノビス距離算出部、８は認識しようとする単語の標準
的な音韻列を記憶する単語標準パターンメモリ、９は単
語全体での類似度を評価する認識部である。Embodiment A voice recognition device according to an embodiment of the present invention will be described below with reference to the drawings. FIG. 1 shows an embodiment of the present invention. In FIG. 1, 1 is an acoustic analysis unit that analyzes the input speech for each frame, 2 is a parameter memory that stores the characteristic parameters obtained by the acoustic analysis unit 1, and 3 is a first storage unit that stores a set of linear discriminant coefficients. Coefficient memory, 4 is a discrimination calculation unit for discriminating phonemes of arbitrary frames, 5 is a phoneme sequence memory for storing the result of phoneme discrimination, 6 is a second coefficient memory for storing a set of inverse matrix of covariance matrix for each phoneme, Reference numeral 7 is a Mahalanobis distance calculation unit, 8 is a word standard pattern memory that stores a standard phoneme sequence of words to be recognized, and 9 is a recognition unit that evaluates the degree of similarity of the entire word.

以上のように構成された音声認識装置について、以下そ
の動作を説明する。The operation of the speech recognition apparatus configured as described above will be described below.

まず入力音声は音響分析部１で分析される。分析方法は
従来から行なわれているもののいずれでもよいが、本実
施例では線形予測分析法を用いる。また対象音声が電話
帯域に限られている場合であれば、演算量の最小化と認
識性能の最大化との両方を満たすものとして８ＫHz、１
２ビットの標本量子化を行なった後、１０〜２０msecの
フレーム間隔毎に１０次の線形予測分析を行ない。ＬＰ
Ｃケプストラム係数（C_i,i＝1,2…10)を得るのが望まし
い。ＬＰＣケプストラム係数に関しては文献「ディジタ
ルプロセッシングオブスピーチシグナル」(L.
R.Rabiner,R.W.Schafer共著，“Digital Processing of
Speech Signals”）に詳しい説明がある。要約する
と、線形予測モデルＨ_(z)が第１(7)式で与えられると
き、ＬＰＣケプストラム係数Ｃ_iは第(8)式で与えられ
る。First, the input voice is analyzed by the acoustic analysis unit 1. Although any conventional analysis method may be used, a linear prediction analysis method is used in this embodiment. Also, if the target voice is limited to the telephone band, it is assumed that both of the minimum calculation amount and the maximum recognition performance are satisfied, 8 KHz, 1
After performing 2-bit sample quantization, 10th-order linear prediction analysis is performed at every frame interval of 10 to 20 msec. LP
It is desirable to obtain the C cepstrum coefficient (C _i , i = 1, 2 ... 10). Regarding the LPC cepstrum coefficient, refer to "Digital Processing of Speech Signal" (L.
R.Rabiner, RWSchafer, “Digital Processing of
Speech Signals ”). In summary, when the linear prediction model H _(z) is given by the equation (1), the LPC cepstrum coefficient C _i is given by the equation (8).

以上のようにして得られた特徴パラメータは、フレーム
毎に第１図中のパラメータメモリ２に記憶されると共
に、判別計算部４に入力される。 The characteristic parameters obtained as described above are stored for each frame in the parameter memory 2 in FIG.

判別計算部４では、線形判別関数を用いてフレーム毎に
入力音声の音韻識別を行ない、順次音韻列メモリ５に結
果を書き込む。線形判別関数については本明細書中の従
来の技術の項で説明したもので、ここでは省略する。ま
た図中、第１の係数メモリ３には任意の音韻対間での判
別を行なうための線形判別係数が記憶されており、これ
らの係数は判別計算部４での必要に応じて適宜読み出さ
れる。The discriminant calculation unit 4 uses the linear discriminant function to identify the phoneme of the input voice for each frame, and sequentially writes the results in the phoneme sequence memory 5. The linear discriminant function has been described in the section of the prior art in this specification, and is omitted here. Further, in the figure, the first coefficient memory 3 stores linear discriminant coefficients for discriminating between arbitrary phoneme pairs, and these coefficients are read out as needed in the discriminant calculation section 4. .

ところで、通常の音声を本実施例で示した１０〜２０ms
ec程度毎のフレームで音韻識別する場合、母音部の音響
的特性の定常性のため同一音韻が複数フレーム連続して
識別出力されるのが普通である。本実施例ではその中か
ら任意に一つの代表フレームを選び、これとフレーム番
号を同じくする前記特徴パラメータをパラメータメモリ
２から読み出し、マハラノビス距離算出部７に入力す
る。By the way, normal voice is 10 to 20 ms shown in this embodiment.
When the phoneme is identified in frames of about ec, it is usual that the same phoneme is continuously identified and output in a plurality of frames due to the steadiness of the acoustic characteristics of the vowel part. In the present embodiment, one representative frame is arbitrarily selected from among them, and the characteristic parameter having the same frame number as that is read from the parameter memory 2 and input to the Mahalanobis distance calculation unit 7.

一方、前記代表フレームでの音韻情報は音韻列メモリ５
から読み出されて第２の係数メモリ６へ入力され、ここ
に蓄えられている音韻別逆行列の中から該当するものが
選び出されて、マハラノビス距離算出部７への第１の入
力となる。他方、前記代表フレームのフレーム番号情報
はパラメータメモリ２へも入力され、該当する特徴パラ
メータがマハラノビス距離算出部７への第２の入力とな
る。マハラノビス距離算出部７は前記第１，第２の入力
を受け、前記第(2)式に従ってマハラノビス距離を算出
し、認識部９へ結果を出力する。前述したように代表フ
レームは連続した同一音韻区間について、一つだけ求ま
るので、マハラノビス距離算出部７での計算は、従来例
で述べたものより大幅に減少する。On the other hand, the phoneme information in the representative frame is stored in the phoneme sequence memory 5
Is read out from the above and input to the second coefficient memory 6, and the corresponding one is selected from the phoneme-based inverse matrix stored therein, and becomes the first input to the Mahalanobis distance calculation unit 7. . On the other hand, the frame number information of the representative frame is also input to the parameter memory 2, and the corresponding feature parameter becomes the second input to the Mahalanobis distance calculation unit 7. The Mahalanobis distance calculation unit 7 receives the first and second inputs, calculates the Mahalanobis distance according to the equation (2), and outputs the result to the recognition unit 9. As described above, since only one representative frame can be obtained for consecutive identical phoneme intervals, the calculation by the Mahalanobis distance calculation unit 7 is significantly reduced as compared with that described in the conventional example.

最後に認識部９は、音韻列メモリ５からの毎フレームの
音韻列と、マハラノビス距離算出部７からの毎代表フレ
ームのマハラノビス距離と、単語標準パターンメモリ８
からの単語標準パターンとを受け、単語全体での総合類
似度評価を行なって認識結果を出力する。総合類似度評
価に関しては種々の方法が考えられるが、本発明ではそ
の一実施例として、次の方法をとる。即ち単語標準パタ
ーンと音韻列とを用いて、音韻レベルのＤＰマッチング
をフレーム単位に行なって単語間距離を累積していく際
に、マッチングパス上で入力音声の代表フレーム位置に
おいて入力音声の音韻と標準パターンの音韻が不一致の
とき、前記マハラノビス距離が小さい程、大きな重みの
かかった距離を累積する。マハラノビス距離が小さい程
該当フレームでの音韻識別結果は信頼性が高い訳である
から、そこでの音韻クラスの不一致が単語全体での距離
増加に、より大きく影響することになるのは妥当な方法
である。Finally, the recognition unit 9 receives the phoneme sequence of each frame from the phoneme sequence memory 5, the Mahalanobis distance of each representative frame from the Mahalanobis distance calculation unit 7, and the word standard pattern memory 8
It receives the standard pattern from the word and evaluates the overall similarity of the whole word, and outputs the recognition result. Although various methods can be considered for the comprehensive similarity evaluation, the following method is taken as an example of the present invention. That is, when the phoneme level DP matching is performed in frame units using the word standard pattern and the phoneme sequence to accumulate the inter-word distances, the phoneme of the input voice at the representative frame position of the input voice is detected on the matching path. When the phonemes of the standard patterns do not match, the smaller the Mahalanobis distance is, the larger the weighted distance is accumulated. The smaller the Mahalanobis distance is, the more reliable the phoneme identification result in the corresponding frame is.Therefore, it is a reasonable method that the disagreement of the phoneme class there has a greater effect on the increase in the distance for the whole word. is there.

以上のように本実施例によれば、ＬＰＣケプストラム係
数によるフレーム毎の音韻判別を所要計算量の比較的少
ない線形判別関数を用いて行なうと共に、音韻中心とし
て選んだ代表フレームに限って多量の計算量を必要とす
るマハラノビス距離算出を行なって、上記線形判別関数
のみによっては求めることのできない音韻類似度を算出
することができるため、大局的かつ安性的な単語全長に
わたる音韻列情報と、局所的かつ定量的な代表フレーム
での音韻距離情報とを、大局的かつ定量的を単語全長に
わたるＤＰマッチング演算の入力とすることができ、そ
の結果従来の技術になる音声認識装置に比べ、より高い
認識率を得ることができる音声認識装置を実現すること
ができる。As described above, according to the present embodiment, phoneme discrimination for each frame by the LPC cepstrum coefficient is performed by using a linear discriminant function having a relatively small calculation amount, and a large amount of calculation is performed only for the representative frame selected as the phoneme center. It is possible to calculate the Mahalanobis distance, which requires a large amount, and to calculate the phoneme similarity that cannot be obtained only by the above linear discriminant function. The phoneme distance information in the representative and quantitative representative frame can be globally and quantitatively input as the DP matching operation over the entire length of the word, and as a result, higher than the conventional speech recognition device. A voice recognition device capable of obtaining a recognition rate can be realized.

発明の効果以上のように本発明は、入力音声をフレーム毎に分析す
る音響分析部と、特徴パラメータを記憶するパラメータ
メモリと、線形判別係数の組を記憶する第１の係数メモ
リと、任意フレームの音韻判別を行なう判別計算部と、
音韻判別結果を記憶する音韻メモリと、音韻距離を算出
するための距離係数を記憶する第２の係数メモリと、音
韻距離算出部と、認識しようとする単語の標準的な音韻
列を記憶する単語標準パターンメモリと、単語全体での
類似度を評価する認識部とを設けることにより、計算量
の軽減と識別結果に信頼性を与えることのできる音韻識
別機能の保有という相反する２側面の要求を満たした、
高認識率の音声認識装置を提供することができる。EFFECTS OF THE INVENTION As described above, according to the present invention, the acoustic analysis unit that analyzes the input speech for each frame, the parameter memory that stores the characteristic parameter, the first coefficient memory that stores the set of linear discriminant coefficients, and the arbitrary frame. A discriminant calculation unit that discriminates the phoneme of
A phoneme memory that stores a phoneme discrimination result, a second coefficient memory that stores a distance coefficient for calculating a phoneme distance, a phoneme distance calculation unit, and a word that stores a standard phoneme sequence of words to be recognized. By providing a standard pattern memory and a recognition unit that evaluates the degree of similarity in the entire word, there are two contradictory requirements of reducing the amount of calculation and possessing a phoneme identification function that can give reliability to the identification result. Filled,
A voice recognition device having a high recognition rate can be provided.

[Brief description of drawings]

第１図は本発明の一実施例における音声認識装置のブロ
ック図、第２図は従来の音声認識装置のブロック図であ
る。１……音響分析部、２……パラメータメモリ、３……第
１の係数メモリ、４……判別計算部、５……音韻列メモ
リ、６……第２の係数メモリ、７……マハラノビス距離
算出部、８……単語標準パターンメモリ、７……認識
部。FIG. 1 is a block diagram of a voice recognition device according to an embodiment of the present invention, and FIG. 2 is a block diagram of a conventional voice recognition device. 1 ... Acoustic analysis unit, 2 ... Parameter memory, 3 ... First coefficient memory, 4 ... Discrimination calculation unit, 5 ... Phoneme string memory, 6 ... Second coefficient memory, 7 ... Mahalanobis distance Calculation unit, 8 ... Word standard pattern memory, 7 ... Recognition unit.

Claims

[Claims]

1. An acoustic analysis unit for analyzing input speech frame by frame, a parameter memory for storing characteristic parameters obtained by the acoustic analysis unit, and a predetermined unit for performing phoneme discrimination between arbitrary phoneme pairs. Phoneme discrimination of the arbitrary frame using a first coefficient memory for storing a set of linear discrimination coefficients, a characteristic parameter of the arbitrary frame obtained from the parameter memory, and a linear discrimination coefficient obtained from the first coefficient memory. And a phoneme sequence memory that stores the result of the judgment by the judgment calculator, and a predetermined distance coefficient for each phoneme for calculating the phoneme distance between an arbitrary frame and the standard phoneme. A second coefficient memory, a phoneme distance calculator that calculates a phoneme distance between an arbitrary frame and a standard phoneme, and a standard phoneme sequence of words to be recognized. And word standard pattern memory, comprising a recognition unit for evaluating the similarity of the entire word, when the same phoneme in the phoneme sequence memory is written sequentially,
The characteristic parameter having the same frame number as the arbitrarily selected representative frame is selected from the parameter memory, and the characteristic parameter corresponding to the phoneme discriminated by the discrimination calculation unit is selected from the second coefficient memory. Select a distance coefficient for each phoneme, calculate the representative phoneme distance in the representative frame by the two selected as described above,
Character recognition is performed using a phoneme string obtained from the phoneme string memory, a representative phoneme distance obtained from the phoneme distance calculation unit, and a standard phoneme string of words obtained from the word standard pattern memory. And a voice recognition device.

2. A second coefficient memory stores an inverse matrix set of inverse matrices of phoneme-specific covariance matrices that are predetermined for each phoneme in order to calculate a Mahalanobis distance between an arbitrary frame and a standard phoneme. The speech recognition apparatus according to claim 1, wherein the speech recognition device is a memory, and the phoneme distance calculation unit is a Mahalanobis distance calculation unit.