JPH0344320B2

JPH0344320B2 -

Info

Publication number: JPH0344320B2
Application number: JP58175304A
Authority: JP
Inventors: Satoshi Fujii; Katsuyuki Futayada
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1983-09-22
Filing date: 1983-09-22
Publication date: 1991-07-05
Also published as: JPS6067996A

Description

【発明の詳細な説明】産業上の利用分野本発明は人間の声によつて発声された音声信号
を自動的に認識するため、音声認識方法に関する
ものである。DETAILED DESCRIPTION OF THE INVENTION Field of the Invention The present invention relates to a voice recognition method for automatically recognizing voice signals uttered by a human voice.

従来例の構成とその問題点音声を自動的に認識する音声認識装置は人間か
ら電子計算機や各種機械へデータや命令を与える
手段として非常に有効と考えられる。Conventional configurations and their problems A speech recognition device that automatically recognizes speech is considered to be very effective as a means for providing data and instructions from humans to computers and various machines.

従来研究あるいは発表されている音声認識装置
の動作原理としてはパターンマツチング法が多く
採用されている。この方法は認識される必要があ
る全種類の単語に対して標準パターンをあらかじ
め記憶しておき、入力される未知の入力パターン
と比較することによつて一致の度合（以下類似度
と呼ぶ）を計算し、最大一致が得られる標準パタ
ーンと同一の単語であると判定するものである。
このパターンマツチング法では認識されるべき全
ての単語に対して標準パターンを用意しなければ
ならないため、発声者が変つた場合には新しく標
準パターンを入力して記憶させる必要がある。従
つてこの方法は原理が簡単であり、少数語いには
有効な方法であるが、数百種類以上の単語を認識
対象とするような場合、全種類の単語を発声して
登録するには時間と労力を必要とし、又登録に要
するメモリ容量も膨大になることが予想される。
さらに入力パターンと標準パターンのパターンマ
ツチングに要する時間も単語数が多くなると長く
なつてしまう欠点がある。 The pattern matching method is often adopted as the operating principle of speech recognition devices that have been researched or published in the past. This method memorizes standard patterns for all types of words that need to be recognized in advance, and compares them with unknown input patterns to calculate the degree of matching (hereinafter referred to as similarity). The word is calculated and determined to be the same word as the standard pattern that yields the maximum match.
In this pattern matching method, standard patterns must be prepared for all words to be recognized, so if the speaker changes, a new standard pattern must be input and stored. Therefore, this method has a simple principle and is an effective method for recognizing a small number of words, but in cases where more than several hundred types of words are to be recognized, it is difficult to pronounce and register all types of words. It is expected that this will require time and effort, and that the memory capacity required for registration will also be enormous.
Furthermore, there is a drawback that the time required for pattern matching between the input pattern and the standard pattern increases as the number of words increases.

これに対して、入力音声を音素単位に分けて音
素の組合せとして認識し（以下音素認識と呼ぶ）
音素単位で表記された単語辞書との類似度を求め
る方法は単語辞書に要するメモリ容量が大幅に少
なくて済み、パターンマツチングに要する時間が
短く又辞書の内容変更も容易であるという特長を
持つている。例えば「赤い」という発声は／
ａ／，／ｋ／，／ｉ／という三つの音素を組合せ
てAKAIという極めて簡単の形式で表現できるた
め、不特定話者で多数語の音声に対処することが
容易である。 On the other hand, input speech is divided into phoneme units and recognized as combinations of phonemes (hereinafter referred to as phoneme recognition).
The method of determining similarity with a word dictionary written in phoneme units has the advantage that the memory capacity required for the word dictionary is significantly smaller, the time required for pattern matching is shorter, and the contents of the dictionary can be easily changed. ing. For example, the utterance of "red" is /
Since the three phonemes a/, /k/, and /i/ can be combined and expressed in an extremely simple format called AKAI, it is easy for non-specific speakers to deal with the speech of many words.

第１図に音素認識を行うことを特徴とする音声
認識方法のブロツク図を示す。マイク等で入力さ
れた音声は音響分析部１によつて分析を行なう。
分析方法としては帯域フイルタ郡や線形予測分析
を思い、フレーム周期（10ｍＳ程度）毎にスペク
トル情報を得る。音素判別部２では、音響分析部
１で得たスペクトル情報を用い、標準パターン格
納部３のデータによつてフレーム毎の音素判別を
行なう。標準パターン格納部３に格納された標準
パターンは、あらかじめ多数話者の音声より音素
毎に求めておく。セグメンテーシヨン部４では、
音響分析部１の分析出力をもとに音声区間の検出
と音素毎の境界決定（以下セグメンテーシヨンと
呼ぶ）を行なう。音素認識部５ではセグメンテー
シヨン部４の音素判別部２の結果をもとに１つの
音素区間毎に何という音素であるかを決定する作
業を行なう。この結果として音素の系列が完成す
る。単語認識部６では、この音素系列を、同様に
音素系列で表記された単語辞書７と照合し、最も
類似度の高い単語を認識結果として出力する。 FIG. 1 shows a block diagram of a speech recognition method characterized by performing phoneme recognition. Audio input through a microphone or the like is analyzed by the acoustic analysis section 1.
As an analysis method, consider band filter group or linear predictive analysis, and obtain spectrum information at every frame period (about 10 mS). The phoneme discrimination unit 2 uses the spectrum information obtained by the acoustic analysis unit 1 to discriminate phonemes for each frame based on the data in the standard pattern storage unit 3. The standard patterns stored in the standard pattern storage section 3 are obtained in advance for each phoneme from the voices of many speakers. In the segmentation section 4,
Based on the analysis output of the acoustic analysis unit 1, voice sections are detected and boundaries for each phoneme are determined (hereinafter referred to as segmentation). The phoneme recognition unit 5 performs the work of determining what phoneme is for each phoneme interval based on the results of the phoneme discrimination unit 2 of the segmentation unit 4. As a result, a series of phonemes is completed. The word recognition unit 6 compares this phoneme sequence with a word dictionary 7 that is similarly expressed in phoneme sequences, and outputs the word with the highest degree of similarity as a recognition result.

前記方法で不特定話者を対象とする場合に最も
重要な点は、高い音声認識精度を、どういう話
者、環境に対しても安定して得ることである。
又、そのために話者に負担をかけすぎたり音声認
識装置にした場合に高価な部分を要するようであ
つてはならない。 The most important point when using the above method to target unspecified speakers is to stably obtain high speech recognition accuracy for any speaker or environment.
Furthermore, this should not place too much burden on the speaker or require expensive parts in the case of a speech recognition device.

しかし従来発表又は試作されている音声認識装
置は前記条件が不十分であるという欠点があつ
た。従来例として、予測残差を対象とする方式
（鹿野、好田「会話音声中の母音認識を目的とし
たLPC距離尺度の評価」電子通信学会誌80/5、
VOLJ−63D、No.５参照）では、あらかじめ多数
話者の音声より線形予測分析によつて音素ｉの最
尤パラメータA_ij（ｊ＝１、２、……、ｐ）（ｐは
分析次数）を求めておき、予測残差を次式で求め
る。 However, the speech recognition devices that have been announced or prototyped so far have had the drawback of not meeting the above conditions. As a conventional example, a method targeting prediction residuals (Kano, Koda, "Evaluation of LPC distance measure for vowel recognition in conversational speech", Journal of the Institute of Electronics and Communication Engineers 80/5,
VOLJ-63D, No. 5), the maximum likelihood parameter A _ij (j = 1, 2, ..., p) of phoneme i (j = 1, 2, ..., p) (p is the analysis order) is calculated in advance by linear predictive analysis from the voices of many speakers. is calculated in advance, and the predicted residual is calculated using the following formula.

N_i＝_P 〓^j=1 A_ijS_j ここでS_jは未知な入力音声から求めた自己相関
係数である。この予測残差N_iを、対象とする音
素毎に求めこれを距離尺度として、N_iが最少と
なる音素を判別結果とする。 N _i = _P 〓 ^j=1 A _ij S _j where S _j is an autocorrelation coefficient obtained from unknown input speech. This prediction residual N _i is obtained for each target phoneme and is used as a distance measure, and the phoneme for which N _i is the minimum is taken as the discrimination result.

しかしこの方法は音素の標準パターンに相当す
る最尤パラメータA_ijが単なる平均値であるため、
たとえ使用者にあわせてA_ijを作り直すという学
習機能を設けたとしても、調音結合による発声の
変動に対処することができず、認識率が低いとい
う欠点があつた。 However, in this method, the maximum likelihood parameter A _ij corresponding to the standard pattern of phonemes is just an average value, so
Even if a learning function was provided to recreate A _ij to suit the user, it would not be able to deal with variations in utterances caused by articulatory combinations, resulting in a low recognition rate.

又、母音、半母音等の音素をフレーム単位の標
準パターンで判別し、判別結果の組合せとしてセ
グメンテーシヨン、音素認識をするために、時間
的な働きを十分にとらえきれず、認識度が上がら
ないという欠点があつた。 In addition, since phonemes such as vowels and semi-vowels are discriminated using standard patterns in frame units, and segmentation and phoneme recognition are performed as a combination of the discrimination results, temporal functions cannot be fully captured and the recognition level does not improve. There was a drawback.

発明の目的本発明は前記欠点を解消し、不特定話者に対処
するとともに、話者、言葉のちがいに影響される
ことなく安定に、高い音声認識精度を得ることの
できる音声認識方法を提供することを目的とす
る。Purpose of the Invention The present invention solves the above-mentioned drawbacks, provides a speech recognition method that can cope with unspecified speakers, and can stably obtain high speech recognition accuracy without being affected by differences in speakers or words. The purpose is to

発明の構成本発明は上記目的を達成するもので、多数話者
の音声から、音素内の時間的動きを表現する複数
個のフレームのLPCケプストラム係数よりなる
標準パターンを音素ごとに作成し、前記標準パタ
ーンと未知音声の複数個のフレームのLPCケプ
ストラム係数とを用いて統計的距離尺度に基づい
て求めた類似度または音素系列により音声認識を
行なうことを特徴とする音声認識方法を提供する
ものである。Structure of the Invention The present invention achieves the above object by creating, for each phoneme, a standard pattern consisting of LPC cepstral coefficients of a plurality of frames expressing temporal movement within a phoneme from the speech of multiple speakers. The present invention provides a speech recognition method characterized in that speech recognition is performed by similarity or phoneme sequence obtained based on a statistical distance measure using a standard pattern and LPC cepstral coefficients of a plurality of frames of unknown speech. be.

実施例の説明本発明の音声認識方法は、母音と半母音、又は
母音同志のスペクトルの時間−周波数特性に相異
のあることを利用するものである。DESCRIPTION OF EMBODIMENTS The speech recognition method of the present invention utilizes the fact that there is a difference in the time-frequency characteristics of the spectra of vowels and semi-vowels, or vowels.

母音と半母音の認識法を例に説明する。 An example of how to recognize vowels and semi-vowels will be explained.

第２図ＡにOOSAMAと発声したときの母音／
ａ／のスペクトルを、第２図Ｂに YASUMONOと発声したときの／ja／の部分の
半母音／ｊ／からの母音／ａ／への遷移スペクト
ルを示す。縦軸は周波数を表わし、横軸は時間を
10ｍＳ単位のフレームで表わす。また横軸はスペ
クトル強度を同時に表わす。スペクトルの右方向
のピークがホルマントを表わし、ホルマントの動
きを破線で示す。 Figure 2 A shows the vowel when saying OOSAMA/
Figure 2B shows the spectrum of the transition from the semi-vowel /j/ to the vowel /a/ in the /ja/ part when uttering YASUMONO. The vertical axis represents frequency and the horizontal axis represents time.
Expressed in frames of 10mS. Also, the horizontal axis simultaneously represents the spectral intensity. The peak to the right of the spectrum represents the formant, and the movement of the formant is shown by a broken line.

ＡとＢとを比較すると、母音と半母音の違いは
音素のはじまりから母音（この場合／ａ／）の中
心（図Ａで示す。）へ向つて移動するホルマント
の働きが異なることである。すなわち／ａ／は音
素中心までの時間が５フレームであるのに対
し、／ja／は10フレームである。またホルマント
の位置が／ａ／は500〜1000Hzの間にあるのに対
し、／ja／は250〜1000Hzの間に広く存在する。 Comparing A and B, the difference between a vowel and a semi-vowel is that the function of the formant, which moves from the beginning of the phoneme to the center of the vowel (in this case /a/) (shown in Figure A), is different. That is, the time to the center of the phoneme for /a/ is 5 frames, while for /ja/ it is 10 frames. Furthermore, the formant position of /a/ is between 500 and 1000 Hz, whereas the position of /ja/ is widely between 250 and 1000 Hz.

この傾向は母音間においても同様である。 This tendency is also the same between vowels.

例えば／ａ／と／ｏ／を比較しても、その中心
へ向かう、又中心から離れる時のスペクトル変化
の状態は異なる。 For example, even when comparing /a/ and /o/, the state of the spectrum change toward the center and away from the center is different.

本発明はこの現象に着目し、母音や半母音の標
準パータンを複数個のフレームの周波数スペクト
ルにて構成し、従来のフレーム単位で音素を区別
する方法に比し、／ａ／，／ｉ／，／ｕ／，／
ｅ／，／ｏ／，／ｊ／，／ｗ／等の音素同志の区
別を精度良く実現しようとするものである。 Focusing on this phenomenon, the present invention constructs a standard pattern of vowels and semi-vowels by frequency spectra of multiple frames, and compared to the conventional method of distinguishing phonemes in frame units, /a/, /i/, /u/,/
This is intended to accurately distinguish between phonemes such as e/, /o/, /j/, /w/, etc.

本発明における距離尺度としては、高い音声認
識精度を得るために、ベイズ判定に基づく距離、
マハラノビスの汎距離、線形判別関数等の統計的
距離尺度が好適である。 In order to obtain high speech recognition accuracy, distance measures in the present invention include distances based on Bayesian judgment,
Statistical distance measures such as Mahalanobis' general distance, linear discriminant function, etc. are preferred.

計算量の削減という面からは例えばマハラノビ
スの汎距離を基本としてこれを一次判別関数とし
て展開したもの（簡易型マハラノビス距離と呼
ぶ）が望ましく、以下の実施例では一例として簡
易型マハラノビス距離を用いた場合について説明
する。マハラノビスの汎距離はマトリクス演算を
必要とするが、対象とする音素間の分散が大きく
ちがわない場合には共分数行列を共通化すること
ができ、計算量の少ない簡易型マハラノビス距離
に展開することができる。 From the perspective of reducing the amount of calculation, it is desirable to use the Mahalanobis general distance as a basis and expand it as a first-order discriminant function (referred to as the simplified Mahalanobis distance). In the following example, the simplified Mahalanobis distance is used as an example. Let me explain the case. Mahalanobis' generalized distance requires matrix calculations, but if the variances between target phonemes are not significantly different, the co-fraction matrix can be shared, and it can be expanded to a simplified Mahalanobis distance that requires less calculation. I can do it.

上記考え方に基づき本発明に係る音声認識方法
の一実施例を第３図により説明する。 Based on the above idea, an embodiment of the speech recognition method according to the present invention will be described with reference to FIG.

まず、標準パターン作成のための多数話者の音
素ｉに相当する音声を入力し、ブロツク１１にて
スペクトル情報として分析次数ｐでフレーム数ｎ
個の複数の周波数スペクトルからなるLPCケプ
ストラム係数の２次元配列を求める。 First, the speech corresponding to the phoneme i of the majority of speakers for standard pattern creation is input, and in block 11, it is analyzed as spectrum information with an analysis order p and a frame number n.
A two-dimensional array of LPC cepstral coefficients consisting of a plurality of frequency spectra is obtained.

〓ｉ＝C_i11、C_i12、…、C_i1p C_i21、C_i22、…、C_i2p 〓 C_io1、C_io2、…、C_iop これをブロツク１２にてＭ＝ｎ×Ｐ次元のベク
トル〓_iにする。〓i=C _i11 , C _i12 , ..., C _i1p C _i21 , C _i22 , ..., C _i2p 〓 C _io1 , C _io2 , ..., C _iop This is converted into M=n×P-dimensional vector〓 _i in block 12 Make it.

〓_i＝（C_i11、C_i12、…、C_i1p、C_i21、…、C_i2
p、C_io11、…、C_iop）この〓_iを用いてブロツク１３にて音素ｉの標
準パターンを作成する。以上を各音素毎に行な
う。〓 _i = (C _i11 , C _i12 , ..., C _i1p , C _i21 , ..., C _{i2
p} , _Cio11 ,..., _Ciop ) Using this _〓i , a standard pattern for phoneme i is created in block 13. The above steps are performed for each phoneme.

次に学習の必要がある場合には、ブロツク１４
にて使用者の音声を学習し、上記標準パターンの
修正を行なう。学習の過程は必要に応じて設けれ
ば良い。 If you need to learn next time, go to block 14.
The system learns the user's voice and modifies the standard pattern. Learning processes can be established as needed.

これに対し、未知音声をブロツク１５にて分析
して分析次数ｐでフレーム数ｎ個からなるLPC
ケスプスラム係数の２次元配列〓を求める。 On the other hand, the unknown voice is analyzed in block 15 and an LPC consisting of analysis order p and frame number n is used.
Find the two-dimensional array 〓 of the Kesp Slam coefficients.

〓_i＝C₁₁、C₁₂、…、C_1p C₂₁、C₂₂、…、C_2p 〓 C_o1、C_o2、…、C_op これをブロツク１６にてＭ次元ベクトル〓に変
換する。〓 _i = C ₁₁ , C ₁₂ , ..., C _1p C ₂₁ , C ₂₂ , ..., C _2p 〓 _Co1 , _Co2 , ..., C _op This is converted into an M-dimensional vector 〓 in block 16.

〓＝（C₁₁、C₁₂、…、C_1p、C₂₁、…、C_2p、C_o1、…、C_op）この〓と標準パターンを用いて、ブロツク１７
にて各音素毎に、簡易型マハラノビス距離によつ
て類似度を求め、判別する。〓=(C ₁₁ , C ₁₂ ,..., C _1p , C ₂₁ ,..., C _2p , _Co1 ,..., C _op ) Using this 〓 and standard pattern, block 17
For each phoneme, the degree of similarity is determined using the simplified Mahalanobis distance, and discrimination is performed.

上記考え方に基づき、一例として学習がある場
合の母音、半母音の標準パターン作成の手順を第
４図に示す。音素ｉに相当する入力音声をブロツ
ク２１にて線形予測分析し、標準パターンとして
使用すべきｎ個のLPCケプストラム係数のC_ij1か
らC_ijpまでを周波数軸として２次元パターンを構
成する。ｉは音素の種類を、ｊはフレームの順番
を表わす。ｐは分析次数である。次にブロツク２
２にてパラメータを並べかえて〓_i＝（C_i11、C_i12、…、C_i1p、C_i21、…、C_i3
1、…、C_io1、C_iop）とする。 Based on the above idea, FIG. 4 shows the procedure for creating standard patterns for vowels and semi-vowels in the case of learning as an example. Linear predictive analysis is performed on the input speech corresponding to phoneme i in block 21, and a two-dimensional pattern is constructed using n LPC cepstral coefficients C _ij1 to C _ijp to be used as a standard pattern as a frequency axis. i represents the type of phoneme, and j represents the frame order. p is the analysis order. Next block 2
Rearrange the parameters in step 2 and get 〓 _i = (C _i11 , C _i12 ,..., C _i1p , C _i21 ,..., C _{i3
1} , ..., C _io1 , C _iop ).

さらにブロツク２３にて多数の音声による〓_i
を集計し、〓_iの平均値を〓_i（〓_iはＭ次元のベク
トル）とする。ブロツク２４で求める共分散行列
は音素の種類にかかららず共通とし、〓で表わ
す。ブロツク２５にて〓の逆行列を〓^-1とし、
（ｊ、j′）要素をブロツク２６にてσj、j′とする
と、音素ｉのｊ番目のパラメータに対する判別係
数a_ijはブロツク２７にて a_ij＝２_M 〓^j ′⁼¹σ_ij′m_ij′ (1) で表わすことができる。ここでm_ij′は〓_iの第j′成
分である。 Furthermore, in block 23, a large number of voices 〓 _i
are totaled, and the average value of 〓 _i is set as 〓 _i (〓 _i is an M-dimensional vector). The covariance matrix obtained in block 24 is the same regardless of the type of phoneme, and is represented by . In block 25, set the inverse matrix of 〓 to 〓 ^-1 ,
When the (j, j') elements are set to σj, j' in block 26, the discriminant coefficient a _ij for the j-th parameter of phoneme i is determined in block 27 as a _ij =2 _M 〓 ^j ′ ⁼¹ σ _ij ′m _ij ′ (1) Here, m _ij ′ is the j′-th component of 〓 _i .

一方、音素によつて決まる定数d_iはブロツク２
８にて d_i＝〓_i ^t〓^-1〓_i (2) で求めることができる。ここでｔは転置行列を表
わす。 On the other hand, the constant d _i determined by the phoneme is
8, d _i =〓 _i ^t 〓 ^-1 〓 _i (2). Here, t represents a transposed matrix.

以上により求めたa_ij、d_iを音素標準パターンと
してブロツク２９に示す係数メモリに蓄える。 The a _ij and d _i obtained in the above manner are stored in the coefficient memory shown in block 29 as a phoneme standard pattern.

又、ブロツク２３で求めた平均値〓_iとブロツ
ク２５で求めた逆行列〓^-1は学習に使用するた
め、ブロツク３０に示す学習部に蓄える。 Furthermore, the average value 〓 _i obtained in block 23 and the inverse matrix 〓 ^-1 obtained in block 25 are stored in the learning section shown in block 30 for use in learning.

以上のようにして作成された標準パターンは学
習により修正される。しかる後、未知の入力音声
と前記修正後の標準パターンとの間の類似度を簡
易型マハラノビス距離を用いて計算し音素判別を
行なう。 The standard pattern created as described above is modified through learning. Thereafter, the degree of similarity between the unknown input speech and the modified standard pattern is calculated using the simplified Mahalanobis distance, and phoneme discrimination is performed.

入力されたパラメータ〓＝（x₁、x₂、…、x_o）
の音素ｉの分布に対するマハラノビス距離D_i ²は D_i ²＝〓^t〓^-1〓−_M 〓^j=1 a_ijχ_j ＋〓_i ^t〓^-1〓_i (3) で表わされる。 Input parameters = (x ₁ , x ₂ , ..., x _o )
The Mahalanobis distance D _i ² for the distribution of phoneme i is expressed as D _i ² =〓 ^t 〓 ^-1 〓− _M 〓 ^j=1 a _ij χ _j +〓 _i ^t 〓 ^-1 〓 _i (3).

ｔは転値行列を表わす。 t represents a transposition matrix.

(3)式の第１項は音素の種類に依存しないため、
類似度L_iを簡易的に L_i＝_M 〓^j=1 a_ijχ_j−〓_i ^t−〓^-1〓_i (4) で表わすことができる。(4)式の第２項は音素の種
類によつて決まる定数であり、これを(2)式に基づ
いてd_iで表わすと、類似度L_iは L_i＝_M 〓^j=1 a_ijχ_j−d_i (5) で求めることができる。ここでa_ij、d_iは標準パタ
ーンとしてすでに求めたものを使用する。 Since the first term in equation (3) does not depend on the type of phoneme,
The degree of similarity L _i can be simply expressed as L _i = _M 〓 ^j=1 a _ij χ _j −〓 _i ^t −〓 ^-1 〓 _i (4). The second term in equation (4) is a constant determined by the type of phoneme, and if this is expressed as d _i based on equation (2), the similarity L _i is L _i = _M 〓 ^j=1 a _ij It can be obtained by χ _j −d _i (5). Here, a _ij and d _i are those already determined as standard patterns.

(5)式を簡易型マハラノビス距離と呼ぶ。 Equation (5) is called the simplified Mahalanobis distance.

学習を行なう場合と認識手順を第５図に示す装
置のブロツク図を用いて説明する。まず第４図に
述べた手順で作成された標準パターンa_ij、d_iを係
数メモリ３１に蓄えておく。又、〓、〓^-1を学習
部３２に蓄えておく。 The case of learning and the recognition procedure will be explained using the block diagram of the apparatus shown in FIG. First, the standard patterns a _ij and d _i created by the procedure described in FIG. 4 are stored in the coefficient memory 31. Also, 〓, 〓 ^-1 are stored in the learning section 32.

次に学習による標準パターンの修正について述
べる。 Next, we will discuss the modification of standard patterns through learning.

すなわち使用者に母音、半母音をマイク３３に
向かつて発生させ、Ａ／Ｄ変換器３４でＡ／Ｄ変
換し、信号処理回路３５でハミング窓をかけ、プ
リエンフアシスを行なう。線形予測分析プロセツ
サ３６にてLPCケプストラム係数を求め、係数
の並べかえを行なつて、χ_iとし学習部３２に転送
する。この場合必要に応じて帯域フイルタ３９か
らのパラメータ情報を使用することもある。さら
にχ_iは類似度計算部３７にも転送する。類似度計
算部３７は、係数メモリ３１の標準パターンによ
つて前記(5)式に示した類似度L_iを求める。 That is, the user generates vowels and semi-vowels while facing the microphone 33, A/D converter 34 performs A/D conversion, a signal processing circuit 35 applies a Hamming window, and pre-emphasis is performed. The LPC cepstrum coefficients are determined by the linear predictive analysis processor 36, the coefficients are rearranged, and the coefficients are transferred to the learning unit 32 as χ _i . In this case, parameter information from the band filter 39 may be used as necessary. Furthermore, χ _i is also transferred to the similarity calculation unit 37. The similarity calculating section 37 calculates the similarity L _i shown in the above equation (5) using the standard pattern in the coefficient memory 31.

これをメインメモリ３８に転送する。一方、帯
域フイルタ３９にてセグメンテーシヨンのための
パラメータ（帯域パワーおよび全パワー）を求
め、メインメモリ３８に転送する。類似度計算部
３７の結果と帯域フイルタ３９の結果からメンイ
プロセツサ４０にて、学習すべき時間軸上の位置
を決定し、出力部４１を通して学習部３２に指定
してやる。学習部３２は蓄えてある平均値〓_i、
共分散逆行列〓^-1を用いて次の手順で標準パター
ンの話者適合を行なう。求める平均値を〓′_iとす
ると、〓_i′＝（α〓_i＋〓_i）／（α＋１） (6) となる。ここでαは重み係数である。 This is transferred to the main memory 38. On the other hand, parameters for segmentation (band power and total power) are determined by the band filter 39 and transferred to the main memory 38. The main processor 40 determines the position on the time axis to be learned from the result of the similarity calculation section 37 and the result of the band filter 39, and specifies it to the learning section 32 through the output section 41. The learning unit 32 uses the stored average value 〓 _i ,
Using the inverse covariance matrix 〓 ^-1 , perform speaker matching of the standard pattern in the following steps. If the average value to be sought is 〓′ _i , then 〓 _i ′=(α〓 _i +〓 _i )/(α+1) (6). Here α is a weighting coefficient.

この〓′_iを用い、適合すべきa′_ijは(1)式により、 a′_ij＝２_M 〓^j=1 σ_ij′m′_ij′ (7) となる。ただしm′_ij′は〓′_iの第j′成分である。 Using this 〓′ _i , a′ _ij to be adapted is obtained from equation (1) as follows: a′ _ij =2 _M 〓 ^j=1 σ _ij ′m′ _ij ′ (7). However, m′ _ij ′ is the j′-th component of 〓′ _i .

又、適合すべきd_i′は(2)式により d′_i＝〓′_i ^t〓^-1〓′_i (8) となり、このa′_ij、d′_iを話者適合した標準パター
ンとして係数メモリ３１を書き換える。 In addition, the d _i ′ to be matched becomes d′ _i =〓′ _i ^t 〓 ^-1 〓′ _i (8) using equation (2), and these a′ _ij and d′ _i are used as coefficients as a standard pattern adapted to the speaker. Rewrite the memory 31.

以上で学習を終了し、実際の音声の認識を次の
手順で行なう。 This completes the learning, and the actual speech recognition is performed using the following steps.

入力された音声をAD変換器３４、信号処理回
路３５を経て線形予測分析プロセツサ３６で線形
予測分析してＰ個LPCケプストラム係数を求め、
ｎフレーム分のLPCケプストラム係数を並べか
えてのＭ次元の入力ベクトル〓を求める。このχ
と係数メモリ３１に格納されている修正済の標準
パターンa′_ij、d′_iとを用い、判別フイルタ３７に
て類似度L_iを次式で計算する。 The input voice is passed through an AD converter 34 and a signal processing circuit 35, and then a linear predictive analysis processor 36 performs a linear predictive analysis to obtain P LPC cepstral coefficients.
Find an M-dimensional input vector 〓 by rearranging the LPC cepstral coefficients for n frames. This χ
Using the modified standard patterns a' _ij and d' _i stored in the coefficient memory 31, the discrimination filter 37 calculates the degree of similarity L _i using the following equation.

L_i＝_M 〓^j=1 a′_ijχ_j−d′_i (9) ただし、χ_jは入力ベクトル〓の第ｊ成分であ
る。 L _i = _M 〓 ^j=1 a′ _ij χ _j −d′ _i (9) where χ _j is the j-th component of the input vector 〓.

このL_iと、帯域フイルタ３９の結果をメインメ
モリ３８に転送する。メンインプロセツサ４０は
これらのデータによつて音声区間の検出、セグメ
ンテーシヨン、音素認識を行ない、音素系列を作
成する。この音素系列を、同様に音素系列で表記
された単語辞書メモリ４２と照合し、最も類似度
の大きい単語名を認識結果として出力部４１に出
力する。 This L _i and the result of the band filter 39 are transferred to the main memory 38 . The main processor 40 uses these data to perform voice section detection, segmentation, and phoneme recognition to create a phoneme sequence. This phoneme sequence is compared with a word dictionary memory 42 which is also written in a phoneme sequence, and the word name with the highest degree of similarity is outputted to the output unit 41 as a recognition result.

以上述べたように、本実施例による方法は音素
認識を基本とする音声認識方法において、音素の
標準パターンを複数個のフレームの周波数スプク
トルで構成することによつて音素内での時間的動
きを十分に考慮し、さらに学習によつて音素標準
パターンを自動作成して話者適合させ、高い音声
認識性能を持たせることができる。又、マハラノ
ビスの汎距離を距離尺度として使用し、さらに簡
易化をはかつたため音素の類似度計算および学習
のための演算は簡単であり、高い演算制度を持つ
計算回路を要することなく実現することができ
る。 As described above, the method according to the present embodiment is a speech recognition method based on phoneme recognition, in which a standard pattern of a phoneme is composed of frequency spectral frames of multiple frames, thereby detecting temporal movement within a phoneme. Through careful consideration and further learning, phoneme standard patterns can be automatically created and matched to the speaker, resulting in high speech recognition performance. In addition, since Mahalanobis' general distance is used as a distance measure and further simplified, calculations for phoneme similarity and learning are easy, and can be achieved without requiring a calculation circuit with a high calculation precision. I can do it.

第６図は成人女子10人を対象として、母音、鼻
音の認識率を従来のフレーム単位で音素判別を行
なう方法で学習のない場合５１、学習のある場合
５２と本実施例の方法５３とを比較したものであ
る。フレーム単位での判別でも学習すればかなり
改善されるが本実施例ほどの効果はない。 FIG. 6 shows the recognition rate of vowels and nasals for 10 adult females using the conventional method of phoneme discrimination in frame units without learning (51), with learning (52), and the method 53 of this embodiment. This is a comparison. If the discrimination is performed on a frame-by-frame basis, it will be considerably improved, but it will not be as effective as this embodiment.

このように標準パターンを複数個のフレームの
周波数スペクトルで構成することによつて全ての
話者に対して顕著に改善され、平均で誤り率は
6.7％となり、フレーム単位の判別で学習ありの
場合５２の70％に圧縮されている。 By configuring the standard pattern with frequency spectra of multiple frames in this way, significant improvements are made for all speakers, and the average error rate is
It is 6.7%, which is compressed to 70% of 52 with learning based on frame-by-frame discrimination.

第７図は半母音の認識率について、従来のフレ
ーム単位の判別で学習ありの場合６１と本実施例
の方法６２とを比較したものである。従来法で認
識すると平均で68.5％しかできたかつたものが、
本実施例を用いると84.4％に向上した。認識率は
15.9％向上し、誤り率は1/2に減少できる。 FIG. 7 compares the recognition rate of semi-vowels between the conventional case 61 with frame-based discrimination and learning and the method 62 of this embodiment. When recognized with the conventional method, only 68.5% was achieved on average.
Using this example, it improved to 84.4%. The recognition rate is
It is improved by 15.9% and the error rate can be reduced to 1/2.

まとめると、本実施例による特徴は次の通りで
ある。 In summary, the features of this embodiment are as follows.

(1) 音素の標準パターンを複数個のフレームのス
ペクトル又はそれに類似する情報を用いて構成
することにより、高い音素認識率を得ることが
できる。(1) A high phoneme recognition rate can be obtained by constructing a standard pattern of phonemes using spectra of multiple frames or information similar to them.

(2) 学習によつて音素標準パターンを自動作成
し、話者適合させるこにより、従来認識できな
かつた話者（たとえば第６図の話者YM、YI）
に対しても精度の良い音素認識ができる。(2) By automatically creating standard phoneme patterns through learning and matching them to the speaker, it is possible to recognize speakers who were previously unrecognizable (for example, speakers YM and YI in Figure 6).
Accurate phoneme recognition is possible even for

(3) (1)、(2)の効果により、高性能の音声認識装置
を構成することができ、高い単語認識率が期待
できる。(3) Due to the effects of (1) and (2), a high-performance speech recognition device can be constructed, and a high word recognition rate can be expected.

なお前記実施例は学習のある場合について述
べたが、本発明の特徴は音素内の時間的動きを
表現する複数個のフレームのLPCケプストラ
ム係数からなる標準パターンを構成し、統計的
距離尺度に基づいて類似度を算出するところに
あるので、学習しない場合においても良好な音
素認識率を得ることができる。 Although the above embodiment describes the case where learning is involved, the feature of the present invention is that a standard pattern consisting of LPC cepstral coefficients of a plurality of frames expressing temporal movement within a phoneme is constructed, and a standard pattern is constructed based on a statistical distance measure. Since the degree of similarity is calculated based on the method, a good phoneme recognition rate can be obtained even when no learning is performed.

発明の効果以上要するに本発明は多数話者の音声から、音
素内での時間的動きを表現するために複数個のフ
レームのLPCケプストラム係数よりなる標準パ
ターンを音素ごとに作成し、前記標準パターンと
未知音声の複数個のフレームのLPCケプストラ
ム係数とを用いて統計的距離尺度に基づいて求め
た類似度または音素系列により音声認識を行なう
ことを特徴とする音声認識方法を提供するもの
で、極めて高い音素認識率を得ることができ、性
能の優れた音声認識装置を実現することができ
る。Effects of the Invention In summary, the present invention creates a standard pattern for each phoneme consisting of LPC cepstral coefficients of a plurality of frames to express temporal movement within a phoneme from the voices of multiple speakers, and The present invention provides a speech recognition method that performs speech recognition based on the similarity or phoneme sequence obtained based on a statistical distance measure using LPC cepstral coefficients of multiple frames of unknown speech. It is possible to obtain a phoneme recognition rate and realize a speech recognition device with excellent performance.

[Brief explanation of drawings]

第１図は音素認識を基本とする従来の音声認識
方法のブロツク図、第２図は本発明の方法に係わ
るスペクトルパターンの例を示す図、第３図は本
発明の一実施例にいよる音声認識方法を示すブロ
ツク図、第４図は本発明の標準パターンの作成法
を示すブロツク図、第５図は本発明の音声認識方
法を具限化する音声認識装置の一構成例を示すブ
ロツク図、第６図及び第７図は本実施例の効果を
話者毎の音声認識率で示した図である。３１……係数メモリ、３２……学習部、３６…
…線形予測分析プロセツサ、３７……類似度計算
部、３８……メインメモリ、３９……帯域フイル
タ、４０……メインプロセツサ、４１……出力
部。 Fig. 1 is a block diagram of a conventional speech recognition method based on phoneme recognition, Fig. 2 is a diagram showing an example of a spectral pattern related to the method of the present invention, and Fig. 3 is a diagram according to an embodiment of the present invention. FIG. 4 is a block diagram showing the method of creating a standard pattern of the present invention. FIG. 5 is a block diagram showing an example of the configuration of a speech recognition device embodying the speech recognition method of the present invention. 6 and 7 are diagrams showing the effects of this embodiment in terms of the speech recognition rate for each speaker. 31...Coefficient memory, 32...Learning section, 36...
. . . Linear prediction analysis processor, 37 . . . Similarity calculation section, 38 . . . Main memory, 39 . . . Bandwidth filter, 40 .

Claims

[Claims]

1 Create a standard pattern for each phoneme consisting of LPC cepstral coefficients of multiple frames expressing temporal movement within a phoneme from the voices of multiple speakers,
A speech recognition method characterized in that speech recognition is performed based on a similarity or a phoneme sequence obtained based on a statistical distance measure using the standard pattern and LPC cepstral coefficients of a plurality of frames of unknown speech.