JPH035595B2

JPH035595B2 -

Info

Publication number: JPH035595B2
Application number: JP55038747A
Authority: JP
Inventors: Atsuo Tanaka; Shinji Kanehara; Kazumi Yamashita
Original assignee: Computer Basic Technology Research Association Corp
Current assignee: Computer Basic Technology Research Association Corp
Priority date: 1980-03-25
Filing date: 1980-03-25
Publication date: 1991-01-25
Also published as: JPS56133800A

Description

【発明の詳細な説明】本発明は音声の動的特徴を利用した音声認識方
法に関するものである。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a speech recognition method using dynamic features of speech.

音声認識において、従来、音声の動的特徴を用
いたパターンマツチングとしては動的計画法を用
いた方法が多く行われている。この方法によれ
ば、入力パターンと標準パターンを比べる際に通
常発声スピードが異なつているため、パターン間
の類似度を最大にするべく、まず入力パターンを
伸縮させるなど時間伸縮の処理を施し、このよう
な処理がされた後にパターン間の類似度を求める
ものである。 In speech recognition, methods using dynamic programming have conventionally been widely used for pattern matching using dynamic features of speech. According to this method, when comparing the input pattern and the standard pattern, the speaking speed is usually different, so in order to maximize the similarity between the patterns, the input pattern is first subjected to time stretching processing such as stretching or contracting. After such processing is performed, the degree of similarity between patterns is determined.

本発明は、上記音声の動的特徴を利用した認識
方法を更に改良することを目的とし、特に特徴ベ
クトルにおける動的モデルを用いることによつて
音声の動的特徴を利用し、更に所定の特性をもつ
評価関数を用いて最適推定を行なうことにより、
入力音声から直接得られる特徴ベクトルの値のゆ
らぎの影響を軽減して、各音声のカテゴリ間にお
ける分離の度合を高め、標準パターンとの照合を
容易することができる音声認識方式を提供する。
次に実施例を挙げて本発明を詳細に説明する。 The present invention aims to further improve the above-mentioned recognition method using the dynamic features of speech. In particular, the present invention utilizes the dynamic features of speech by using a dynamic model in the feature vector, and further improves the recognition method using the dynamic features of speech. By performing optimal estimation using an evaluation function with
To provide a speech recognition method that can reduce the influence of fluctuations in the values of feature vectors directly obtained from input speech, increase the degree of separation between categories of each speech, and facilitate matching with standard patterns.
Next, the present invention will be explained in detail with reference to Examples.

入力された音声情報は標準パターンと照合され
るに先立つて、特徴ベクトの値のゆらぎを鈍らせ
るために、入力音声スペクトルの時間変化をダイ
ナミツクス（第４頁の一式）で表現し、この表現
された値からカルマンフイルタなる評価関数によ
つてフレーム毎に最適なスペクトルを推定し、こ
の推定されたスペクトルを実際測定されたスペク
トルの代りに用いてパターンマツチングが実行さ
れるが、次にまず入力音声から最適推定値を得る
ための原理について説明する。 Before the input audio information is compared with a standard pattern, the temporal changes in the input audio spectrum are expressed using dynamics (set on page 4) in order to dampen the fluctuations in the value of the feature vector. The optimal spectrum is estimated for each frame using an evaluation function called a Kalman filter, and pattern matching is performed using this estimated spectrum in place of the actually measured spectrum. The principle for obtaining the optimal estimate from input speech will be explained.

尚、従来方法と同様に、測定された入力音声に
対して標準パターンとのマツチングをとるための
時間伸縮処理を予め行つておく。 Note that, similarly to the conventional method, time expansion/contraction processing is performed in advance on the measured input audio in order to match it with the standard pattern.

マイクロフオン等の音声入力部に与えられた音
声入力は分析部に与えられ、所定区間にサンプリ
ングされてデイジタル信号に変換され、音声の特
徴パターンが形成される。ここでパターンとは特
徴ベクトルの時系列であり、一般に多次元空間の
時系列として表現することができる。例えば入力
された音声波形をバンドパスフイルタ等を介して
周波数分析した出力は、各周波数帯域での出力値
を要素としてもつベクトルで表わすことができ、
これが上記特徴ベクトルとなる。従つて周波数領
域をｎ帯域（本実施例ではｎ＝24）で分割すれば
特徴ベクトルはｎ次元になり、パターンはｎ次元
の特徴ベクトルの時系列X₁X₂…X_kとなる。以下
肉太の字体はベクトルを表わしているものとす
る。 Audio input applied to an audio input unit such as a microphone is applied to an analysis unit, where it is sampled in a predetermined interval and converted into a digital signal to form a characteristic pattern of the audio. Here, a pattern is a time series of feature vectors, and can generally be expressed as a time series in a multidimensional space. For example, the output of frequency analysis of an input audio waveform via a bandpass filter or the like can be expressed as a vector having output values in each frequency band as elements,
This becomes the feature vector described above. Therefore, if the frequency domain is divided into n bands (n=24 in this embodiment), the feature vector becomes n-dimensional, and the pattern becomes a time series of n-dimensional feature vectors X ₁ X ₂ . _. . In the following, bold fonts represent vectors.

パターン内のｉフレームにおける特徴ベクトル
は（ｉ−１）フレームに対する時間変化として次
のような差分方程式で表わすことができる。 The feature vector in the i frame in the pattern can be expressed as a time change with respect to the (i-1) frame by the following difference equation.

X_i＝A_i-1X_i-1＋U_i-1 (1) ここでパラメータA_i-1はX_i-1の時間変化を記述
する行列で、簡単のために対角行列とする。また
U_iは励振源で後述する滑らかなスペクトル概形の
時間変化量として表わされる。 X _i =A _i-1 X _i-1 +U _i-1 (1) Here, the parameter A _i-1 is a matrix that describes the time change of X _i-1 , and is assumed to be a diagonal matrix for simplicity. Also
U _i is an excitation source and is expressed as a time variation of a smooth spectral outline, which will be described later.

上記のような真の特徴ベクトル〓_iに対して実
際に測定値から得られるベクトル〓_iは誤差〓_iを
含み次式のように表わされる。 For the true feature vector 〓 _i as described above, the vector 〓 i actually obtained from the measured values 〓 _i includes an error 〓 _i and is expressed as follows.

Y_i＝X_i＋W_i (2) 尚上(1)及び(2)式の表現は一例であつて本発明の
応用分野を限定するものではない。 Y _i =X _i +W _i (2) The expressions in equations (1) and (2) above are merely examples and do not limit the field of application of the present invention.

音声認識の過程においては上記測定値Y_iを得
て、この測定値Y_iから所定のアルゴリズムをもつ
評価関数を用いてX_iの最適推定値X^_iが求められ
る。上記評価関数としてカルマンフイルタを用い
る。 In the process of speech recognition, the above-mentioned measured value Y _i is obtained, and an optimal estimated value X^ _i of X _i is determined from this measured value Y _i using an evaluation function having a predetermined algorithm. A Kalman filter is used as the above evaluation function.

今カルマンフイルタを用いていることから評価
関数は自乗誤差であり、上記最適推定値X^_iは自乗
誤差を最小にする推定値のベクトルで、最小自乗
推定値X^_iは以下のアルゴリズムに従つて求められ
る。 Since we are currently using a Kalman filter, the evaluation function is a squared error, and the above optimal estimate X^ _i is a vector of estimated values that minimizes the squared error, and the least squares estimated value X^ _i is calculated according to the following algorithm. It is required.

X^_i＝X〓_i＋P_iW_i ^-1（Y_i−X〓_i） (3) X〓_i＝A_i-1X^_i-1＋_i-1 (4) P_i＝（M_i ^-1＋W_i ^-1）^-1 (5) M_i＝A_i-1P_i-1A_i-1＋U_i-1 (6) ただしW_iはW_iの共分散行列で、簡単のために
W_iの平均値を零とし、W_iの各要素の分散は時間
に無関係に一定でσ²とする。またW_i＝σ₂I（Ｉは単
位行列）とし、U_iはU_iの共分散行列で、U_iの平均
値（U_i＝g_i+1−g_i）をU_iとし、P_iは推定誤差の共
分散行列である。 X^ _i =X〓 _i +P _i W _i ^-1 (Y _i −X〓 _i ) (3) X〓 _i =A _i-1 X^ _i-1 + _i-1 (4) P _i = (M _i ^-1 +W _i ^-1 ) ^-1 (5) M _i =A _i-1 P _i-1 A _i-1 +U _i-1 (6) where W _i is the covariance matrix of W _i , and for simplicity
The average value of W _i is set to zero, and the variance of each element of W _i is constant regardless of time and is set to σ ² . Also, let W _i = σ ₂ I (I is the identity matrix), U _i is the covariance matrix of U _i , the average value of U _i (U _i = g _{i +1} − g _i ) is U _i , and P _i is the covariance matrix of estimation errors.

第１図はフレームｉ、ｉ＋１、ｉ＋２における
特徴ベクトルX_i、X_i+1、X_i+2の動的モデルを示
し、特徴ベクトルは24帯域の出力値からなるベク
トルで、各要素値は夫々のフレーム毎に白丸で示
されている。同図から励振源U_iは、破線で示され
ているスペクトル概形g_iの時間変化を用いること
によつて、U_i＝g_i+1−g_iから求められ特徴ベクト
ルの時系列を(1)式に従つて矛盾なく表わすことが
できる。 Figure 1 shows a dynamic model of feature vectors X _i , X _{i+1 , and X i+2 in frames i, i+1} , and _i+2 . The feature vectors are vectors consisting of output values of 24 bands, and each element value is Each frame is indicated by a white circle. From the figure, the excitation source U _i is determined from U _i = g _{i +1} − g _i by using the time change of the spectral outline g _i shown by the broken line, and the time series of the feature vector is ( 1) It can be expressed without contradiction according to Eq.

第２図には、上記特徴ベクトルX_iとスペクトル
概形g_iを求める処理手順が示されている。 FIG. 2 shows the processing procedure for obtaining the feature vector X _i and the spectral outline g _i .

まず、サンプリングされデイジタル化された音
声波形は一定時間区間（フレーム）毎に分割され
て（例えば10〜20ｍｓ）その区間データ毎に分析
される。各フレームのデータには窓かけ処理によ
りデータの時間位置に応じた重み（ここでは、ハ
ニング窓と呼ばれる重みの係数）がかけられる。 First, a sampled and digitized audio waveform is divided into predetermined time intervals (frames) (for example, 10 to 20 ms) and analyzed for each interval data. A weight (here, a weighting coefficient called a Hanning window) is applied to the data of each frame according to the time position of the data by windowing processing.

その後、デイジタルフーリエ変換（DFT）し
てパワースペクトルを計算し、各スペクトル成分
毎に対数化（LOG）して、対数スペクトル
（LOG SPECTRUM）を得る。このスペクトル
をバンド毎に分けて（PASSBAND WINDOW）
（ここでは、24チヤンネル）少数のスペクトル成
分に圧縮することにより、特徴ベクトルX_iが得ら
れる。 Thereafter, a power spectrum is calculated by digital Fourier transform (DFT), and logarithmization (LOG) is performed for each spectral component to obtain a logarithmic spectrum (LOG SPECTRUM). Divide this spectrum into bands (PASSBAND WINDOW)
By compressing it into a small number of spectral components (here, 24 channels), a feature vector X _i is obtained.

一方、対数スペクトルから逆フーリエ変換
（IDET）によつてケプストラム係数
（CEPSTRUM）を計算し、その低次の係数のみ
（ここでは５次）を用いて（WINDOW）更にデ
イジタルフーリエ変換（DFT）を行ない、対数
スペクトルの包絡スペクトルを得る。この包絡ス
ペクトルをバンド毎に分けて（PASS BAND
WINDOW）（ここでは24チヤンネル）、なだらか
な特性をもつ概形ベクトルg_iを得る。 On the other hand, cepstrum coefficients (CEPSTRUM) are calculated from the logarithmic spectrum by inverse Fourier transform (IDET), and then digital Fourier transform (DFT) is performed using only the low-order coefficients (in this case, 5th order) (WINDOW). , obtain the envelope spectrum of the logarithmic spectrum. Divide this envelope spectrum into bands (PASS BAND
WINDOW) (here 24 channels), we obtain an approximate vector g _i with smooth characteristics.

今ある音声カテゴリαの特徴ベクトルの時系列
（X〓₁、X〓₂…X〓_k）ベクトル時系列（〓₁、〓₂…

〓_k-1）が標準パターンとして予め求められている
ものとする。この状態でこのカテゴリαに関する
パラメータA〓_iの時系列を(1)式から求めることが
できる。 _Time series of feature vectors of the current speech category α ₍ X〓 ₁ , _X〓 ₂ ...

〓k _-1 ) is obtained in advance as a standard pattern. In this state, the time series of the parameter A〓 _i regarding this category α can be obtained from equation (1).

一方入力音声からは測定によつて特徴ベクトル
の時系列（Y₁、Y₂…Y_k）と励振源（U₁、U₂…
U_k-1）を求めることができ、これ等の値及びカ
テゴリαに関する（〓₁〓₂…〓_k-1）（A〓₁ A〓
₂…
A〓_k-1）から上記(3)、(4)、(5)及び(6)式のアルゴリ
ズムにより推定値のベクトル（X^₁X^₂…X^_k）を求
めることができる。 On the other hand, from the input speech, the time series of feature vectors (Y ₁ , Y ₂ ... Y _k ) and excitation sources (U ₁ , U ₂ ...
U _k-1 ) can be found, and (〓 ₁ 〓 ₂ …〓 _k-1 ) (A〓 ₁ A〓
₂ ...
A vector of estimated values (X^ ₁ _X ^ ₂ . _. .

上記のようにして測定値（Y₁Y₂…Y_k）及び
（U₁U₂…U_k-1）から最適推定された特徴ベクトル
の時系列（X^₁X^₂…X^_k）が測定値（Y₁Y₂…Y_k）
の代りに用いられて、標準パターン（X〓₁X〓₂…
X〓_k）とのパターンマツチングが実行される。該
パターンマツチングによれば測定値Y_iと標準パタ
ーンを直接照合する方式に比べて、入力音声のゆ
らぎが予め軽減されているためパターン間の分離
がよくなり照合動作が迅速に且つ効率的に行われ
る。 _Time _series _of _feature _vectors ₍ _X _^ ₁ is the measured value (Y ₁ Y ₂ …Y _k )
used in place of the standard pattern (X〓 ₁ X〓 ₂ …
Pattern matching with _X〓k ) is performed. According to this pattern matching, compared to the method of directly matching the measured value Y _i with the standard pattern, the fluctuation of the input audio is reduced in advance, so the separation between patterns is better, and the matching operation is faster and more efficient. It will be done.

第３図及び第４図は、上記入力音声から最適推
定された時系列スペクトルを形成するための処理
回路を示すブロツク図である。 FIGS. 3 and 4 are block diagrams showing a processing circuit for forming a time-series spectrum optimally estimated from the input speech.

同図において、M₁は第１メモリで、標準パタ
ーンを格納する役目を果たし、上記ベクトル
（X〓₁X〓₂…X〓_k）及び（〓₁〓₂…〓_k-1）が記
録され
ている。M₂は第２メモリでマイクロフオン等の
入力部から導入された測定パターン（Y₁Y₂…
Y_k）、（U₁U₂…U_k-1）が格納されている。動的モ
デル（標準パターンの時間変化）のパラメータと
なる行列A〓₁A〓₂…A〓_k-1はメモリM₄に格納される
が、第３図の実施例は上記パラメータA〓_iが順次
計算される場合が、第４図は予め求められた値が
メモリM₄に既に格納されている場合を示す。従
つて第４図実施例においてはメモリM₄には音声
の各カテゴリα、β、γ…について夫々の行列の
時系列が（カテゴリー別の変換マトリツクス
〔A〓〕、〔A〓〕、…すなわち、A〓₁、A〓₂、…Ai〓…
、
A〓₁、A〓₂、…A〓ｉ…、A〓₁、A〓₂、…A〓ｉ…）が
格納されていることになる。第３図実施例ではメ
モリM₁に格納されている標準パターンを用いて
計算部C₃において、順次行列A_iが計算され、そ
の都度メモリM₄に転送されて記憶される。主計
算部C₁はカルマンフイルタを含んで構成され、
第５図に示す演算回路を備え、上記アルゴリズム
(3)、(4)、(5)及び(6)式に基いた計算が実行される。
計算は時系列に従つて順次行われるが、時系列の
中の第ｉ番目の段階で計算が実行されている状態
では、前段階で得られたP_i-1、X^_i-1の各値がメモ
リM₃に既に格納されている。初期段階において
は、メモリM₅には夫々の初期値P₀、X^₀が格納さ
れている。従つて上記主計算部C₁においては、
各メモリM₁，M₂，M₃及びM₄に格納されている
値を用いて各段階毎に特徴ベクトルの推定ベクト
ルX^_iが計算され、その値は順次メモリM₄及びM₅
に格納されてゆく。すなわち、メモリM₅には最
終的には、最終フレームの推定ベクトルが格納さ
れる。計算部C₂では求められた推定ベクトルX^_i
と標準パターンの特徴ベクトルX〓_i、X〓_i…との距
離が計算され、メモリM₆に加算されて格納され
てゆく。Ｄは判定部でメモリM₆に標準パターン
毎に格納されている距離を比較して判定結果を出
力する。即ち入力された音声がいずれのカテゴリ
のものであるかの照合が行われる。 In the figure, _M1 is _the first memory, which serves to store the standard pattern _, and the _vectors ( _X〓 ₁ X〓 ₂ ... ing. M ₂ is the second memory and contains the measurement pattern (Y ₁ Y ₂ ...) introduced from the input section such as a microphone.
Y _k ), (U ₁ U ₂ ...U _k-1 ) are stored. The matrix A〓 ₁ A〓 ₂ ...A〓 _k-1 , which is a parameter of the dynamic model (time change of standard pattern), is stored in the memory _M4 , but in the embodiment shown in Fig. 3, the above parameter A〓 _i is In the case where the values are calculated sequentially, FIG. 4 shows the case where the predetermined values are already stored in the memory _M4 . Therefore, in the embodiment shown in FIG. 4, the memory _M4 stores the time series of matrices for each of the voice categories α, β, γ, etc. (conversion matrices for each category [A〓], [A〓], . , A〓 ₁ , A〓 ₂ ,…Ai〓…
,
A〓 ₁ , A〓 ₂ , ...A〓i ..., A〓 ₁ , A〓 ₂ , ...A〓i ...) are stored. In the embodiment of FIG. 3, the matrix A _i is sequentially calculated in the calculation unit C ₃ using the standard pattern stored in the memory M ₁ and is transferred and stored in the memory M ₄ each time. The main calculation unit _C1 is configured including a Kalman filter,
Equipped with the arithmetic circuit shown in Fig. 5, the above algorithm
Calculations based on equations (3), (4), (5) and (6) are performed.
Calculations are performed sequentially according to the time series, but when the calculation is performed at the i-th stage in the time series, each of P _i-1 and X^ _i-1 obtained in the previous stage The value is already stored in memory _M3 . At the initial stage, the memory M ₅ stores initial values P ₀ and X^ ₀ , respectively. Therefore, in the main calculation section _C1 ,
An estimated vector X^ _i of the feature vector is calculated at each stage using the values stored in the memories M ₁ , M ₂ , M ₃ and M ₄ , and the values are sequentially stored in the memories M ₄ and M ₅ .
will be stored in. That is, the estimated vector of the final frame is finally stored in the memory _M5 . In calculation section _C2 , the estimated vector X^ _i
The distance between the standard pattern feature vectors X〓 _i , X〓 _i . . . is calculated, added to the memory _M6 , and stored. D is a determination unit that compares the distances stored for each standard pattern in the memory _M6 and outputs a determination result. That is, it is checked to see which category the input voice belongs to.

以上本発明によれば、測定された音声から音声
の動的特徴を考慮したカルマンフイルタなる評価
関数を用いて最適に推定された特徴ベクトルを得
て、該特徴ベクトルと標準パターンとのマツチン
グをとつて認識を行うため、カルマンフイルタを
使用せずに測定された音声の特徴ベクトルを得る
場合に比べ特徴ベクトルの値の揺らぎが軽減さ
れ、パターンの照合効率を高め、各音声カテゴリ
間の分離の精度を向上することができる。 As described above, according to the present invention, an optimally estimated feature vector is obtained from the measured speech using an evaluation function called a Kalman filter that takes into account the dynamic characteristics of the speech, and the feature vector is matched with a standard pattern. This reduces the fluctuation in the value of the feature vector compared to when obtaining the feature vector of the measured speech without using a Kalman filter, increases the efficiency of pattern matching, and improves the accuracy of separation between each speech category. can be improved.

[Brief explanation of the drawing]

第１図は本発明による特徴ベクトルの動的モデ
ルを示す図、第２図は本発明の動作を説明するた
めのフローチヤート、第３図及び第４図は本発明
による実施例を説明するためのブロツク図、第５
図は同ブロツク図の要部を詳細に示すブロツク図
である。 X_i：特徴ベクトル、Y_i：測定された特徴ベクト
ル、M₁，M₂，M₃，M₄，M₅，M₆：メモリ、
C₁：主計算部、C₂：計算部、Ｄ：判定部。 FIG. 1 is a diagram showing a dynamic model of a feature vector according to the present invention, FIG. 2 is a flowchart for explaining the operation of the present invention, and FIGS. 3 and 4 are for explaining an embodiment according to the present invention. Block diagram, No. 5
The figure is a block diagram showing the main parts of the same block diagram in detail. X _i : Feature vector, Y _i : Measured feature vector, M ₁ , M ₂ , M ₃ , M ₄ , M ₅ , M ₆ : Memory,
_C1 : Main calculation section, _C2 : Calculation section, D: Judgment section.

Claims

[Claims] 1. For each category, the feature vector extracted from the audio signal is set in advance as a standard pattern of time changes that changes in a predetermined time series, and is extracted from input audio information. Evaluation function based on the time change of _the above standard pattern for measured _values X^ _i =X〓 _i +P _i W _i ^-1 (Y _i −X〓 _i ₎ + _i-1 P _i = (M _i ^-1 +W _i ^-1 ) ^-1 M _i =A _i-1 P _i-1 A _i-1 +U _i-1 (X^ _i ; estimated feature vector, Y _i ; Vector by measured values, A _i ; Matrix describing time change, W _i ; Covariance matrix of measurement error, U _i ; Covariance matrix of excitation source,
P _i ：Covariance matrix of estimation error, _i ：Mean value of U _i ,
i; frame number) to estimate a feature vector of input speech, and compare the estimated feature vector with a standard pattern registered in advance to perform pattern matching of the speech signal. Method.