JPH0426480B2

JPH0426480B2 -

Info

Publication number: JPH0426480B2
Application number: JP59264782A
Authority: JP
Inventors: Sadahiro Furui
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1984-12-14
Filing date: 1984-12-14
Publication date: 1992-05-07
Also published as: JPS61141500A

Description

【発明の詳細な説明】「産業上の利用分野」この発明は入力音声の特徴量の時系列と、対応
する標準パタン音声の時系列との間の類似度を、
これらの時系列の要素間の距離を計算することに
基づき求める音声認識装置に関するものである。[Detailed Description of the Invention] "Industrial Application Field" This invention calculates the degree of similarity between the time series of features of input speech and the time series of the corresponding standard pattern speech.
The present invention relates to a speech recognition device that performs voice recognition based on calculating distances between these time-series elements.

「従来の技術」従来のこの種の音声認識装置では、入力音声の
特徴量の時系列〓＝〓₁，〓₂，……〓_lとパターン
音声の特徴量の時系列〓＝〓₁，〓₂，……〓_nと
に対し、その要素〓_iと〓_jとの間の距離尺度を各
要素における全パワーを一定に正規化したスペク
トルを用いて求めていた。このため要素〓_iが母
音で、要素〓_jが子音というように音声パワーが
極端に異なる要素間でも距離がかなり小さくなる
ということが起こり得るという欠点があつた。"Prior art" In this type of conventional speech recognition device, the time series of input speech features 〓=〓 ₁ , 〓 ₂ , ...〓 _l and the time series of pattern speech features 〓 = 〓 ₁ , 〓 ₂ ,...〓 _n , the distance measure between the elements 〓 _i and 〓 _j was found using a spectrum in which the total power in each element was normalized to a constant value. For this reason, there was a drawback that the distance could become quite small even between elements with extremely different vocal powers, such as element _i being a vowel and element _j being a consonant.

音声認識においては一つの標準パタンを用いて
なるべく多く人の音声を認識できることが望まし
いが、従来の方法の多くは、個人差の影響の出や
すいスペクトルに基づいた距離を用いていたた
め、母音と子音との誤認識等の誤りが起こりやす
く、特に入力音声が標準パタンと同一の話者のも
のでない場合には認識率の大幅な低下が見られ
た。 In speech recognition, it is desirable to be able to recognize as many people's voices as possible using one standard pattern, but many conventional methods use distances based on spectra that are easily affected by individual differences, so vowels and consonants Errors such as erroneous recognition of the standard pattern were likely to occur, and the recognition rate significantly decreased, especially when the input speech was not from the same speaker as the standard pattern.

この問題に対処するため、スペクトル情報とパ
ワー情報とを併用した音声時系列の要素間マツチ
ング距離尺度を用いる試みも行われている（例え
ば文献、相川、鹿野、古井：パワー情報で重みづ
けた距離による単語音声認識、日本音響学会音声
研究会資料、S81−59，1981参照）。しかしこの
方法では音声の入力レベルによるパワー情報の変
動に対処する必要性から、音声区間全体のパワー
の最大値と最小値とを調べて、これが一定値とな
るように音声パワーを正規化する必要があつた。
このため音声区間が終了するまで、あるいは数秒
間にわたつて音声パワーの変化を調べ、こののち
に距離尺度の計算を開始する必要があり、認識結
果が得られるまでに時間遅れが生じたり、距離尺
度の算出に高速の素子を用いなければならなくな
るという欠点があつた。 To address this problem, attempts have been made to use a matching distance measure between elements of speech time series that combines spectral information and power information (for example, Ref. Aikawa, Kano, Furui: Distance weighted by power information). (See Acoustical Society of Japan Speech Study Group Materials, S81-59, 1981). However, in this method, it is necessary to deal with fluctuations in power information due to the input level of the voice, so it is necessary to check the maximum and minimum values of the power of the entire voice section and normalize the voice power so that these are constant values. It was hot.
For this reason, it is necessary to check the changes in voice power until the end of the voice section or over several seconds, and then start calculating the distance measure, which may cause a time delay before recognition results are obtained, or The disadvantage was that a high-speed element had to be used to calculate the scale.

この発明の目的は不特定話者の音声に対しても
従来の方法に比べ、より正確かつ遅れ時間なしで
音声認識ができるようにした単語音声認識装置を
提供することにある。 SUMMARY OF THE INVENTION An object of the present invention is to provide a word speech recognition device that is capable of recognizing speech of unspecified speakers more accurately and without delay than conventional methods.

「問題点を解決するための手段」この発明によれば個人差が少なく、しかも音声
の入力レベルの変動の影響を受けない、音声パワ
ーの時間波形の線形回帰係数の距離をスペクトル
距離と併用する。すなわち入力音声と標準パタン
音声とのそれぞれの特徴量時系列の要素間スペク
トル距離D_sを求めると共に、入力音声と標準パ
タン音声とについてそれぞれ音声パワーの時間波
形から線形回帰係数の時間波形を全時点について
算出し、これら線形回帰係数を用いて入力音声と
標準パタンとの要素間のパワー回帰係数距離D_p
を求める。これらスペクトル距離D_sとパワー回
帰係数距離D_pとから求まる要素間マツチング距
離を用いて、入力音声と標準パタン音声との類似
度を時間正規化マツチングにより計算する。"Means for Solving the Problem" According to the present invention, the distance of the linear regression coefficient of the temporal waveform of audio power is used in conjunction with the spectral distance, which has little individual difference and is not affected by fluctuations in the audio input level. . In other words, the inter-element spectral distance D _s of the feature time series of the input speech and the standard pattern speech is calculated, and the time waveform of the linear regression coefficient is calculated from the time waveform of the audio power for the input speech and the standard pattern speech at all points in time. and use these linear regression coefficients to calculate the power regression coefficient distance D _p between the input voice and the standard pattern.
seek. Using the inter-element matching distance determined from the spectral distance D _s and the power regression coefficient distance D _p , the degree of similarity between the input speech and the standard pattern speech is calculated by time-normalized matching.

「実施例」第１図はこの発明の実施例を示し、音声入力端
子１に加えられた入力音声信号はまず音声区間検
出回路２によつて無音（雑音）区間が除去されて
実際の音声区間だけが抽出される。この音声区間
の検出にはすでによく知られているいくつかの方
法、例えば入力音声信号波の短時間パワー、ある
一定値以上のパワーが継続する時間等を用いるこ
とができる。検出された音声区間の信号波は線形
予測分析回路３に送られ、線形予測係数とパワー
の時間波形とに変換される。この技術はすでに公
知であるので（例えば文献、板倉、斎藤：統計的
手法による音声スペクトル密度とホルマント周波
数の推定、電子通信学会論文誌、53−Ａ，１，
P35，1970参照）、詳細は省略するが、基本的に
はまず低域フイルタに通したのち標本化及び量子
化を行い、一定時間毎に短区間の波形ハミング窓
等を乗じて切り出し、積和の演算によつてパワー
と相関係数とを計算する。ハミング窓の長さとし
ては例えば30ms、これを更新する周期としては
例えば10msのような値が用いられる。この相関
係数から繰返し演算処理によつて代数方程式を解
くことにより線形予測係数が抽出される。この線
形予測係数は例えば第１次から第10次までの値を
計算する。"Embodiment" FIG. 1 shows an embodiment of the present invention, in which the input audio signal applied to the audio input terminal 1 is first subjected to a speech section detection circuit 2, in which silent (noise) sections are removed and an actual speech section is obtained. only are extracted. To detect this voice section, several well-known methods can be used, such as the short-time power of the input voice signal wave, the time period during which the power of a certain value or more continues. The signal wave of the detected voice section is sent to the linear prediction analysis circuit 3, where it is converted into a linear prediction coefficient and a time waveform of power. Since this technique is already known (for example, literature, Itakura, Saito: Estimation of speech spectral density and formant frequency using statistical methods, Journal of the Institute of Electronics and Communication Engineers, 53-A, 1,
(Refer to p. 35, 1970), details are omitted, but basically, first pass it through a low-pass filter, then sample and quantize it, multiply the waveform in a short period by a Hamming window, etc., cut it out, and sum the products. The power and correlation coefficient are calculated by the calculation. For example, a value such as 30 ms is used as the length of the Hamming window, and a value such as 10 ms is used as the period for updating this window. A linear prediction coefficient is extracted from this correlation coefficient by solving an algebraic equation through repeated calculation processing. The linear prediction coefficients are calculated, for example, from the 1st order to the 10th order.

抽出された線形予測係数の時間波形は、ケプス
トラム変換回路４により線形予測ケプストラム係
数に変換される。この技術もすでに公知であるの
で（例えば文献、斎藤、中田：音声情報処理の基
礎、オーム社、第７章、P102，1981参照）詳細
は省略するが、線形予測係数を用いた再帰式の演
算により、線形予測ケプストラム係数（以下簡単
のため単にケプストラム係数と呼ぶ）を容易に得
ることができる。抽出されたケプストラム係数は
特徴パラメータレジスタ５に一たん蓄えられる。 The time waveform of the extracted linear prediction coefficient is converted into a linear prediction cepstrum coefficient by the cepstrum conversion circuit 4. Since this technique is already publicly known (for example, see literature, Saito, Nakata: Fundamentals of Speech Information Processing, Ohmsha, Chapter 7, P102, 1981), the details will be omitted, but the calculation of the recursive formula using linear prediction coefficients will be omitted. As a result, linear predictive cepstral coefficients (hereinafter simply referred to as cepstral coefficients for simplicity) can be easily obtained. The extracted cepstrum coefficients are temporarily stored in the feature parameter register 5.

一方、線形予測分析回路３で抽出されたもう一
方の特性であるパワーの時間波形は、その抽出周
期（上述の例では10ms）毎に一定の時間長の区
間の波形が対数変換されたのちパワーレジスタ６
に一たん蓄えられ、このレジスタ６の内容は回帰
係数計算回路７に送られて、線形回帰係数が演算
される。このレジスタ６および回帰係数計算回路
７に入力される時間波形の長さとしては、例えば
50msのような値を用いる。対数パワーの時間波
形x_j（ｊ＝−Ｍ，……Ｍ）であらわすと、この線
形回帰係数ａ（以下ではこれをパワー回帰係数と
呼ぶ）は次の演算で求めることができる。 On the other hand, the time waveform of the power, which is the other characteristic extracted by the linear prediction analysis circuit 3, is obtained by logarithmically converting the waveform of a certain time length section every extraction period (10ms in the above example), and then register 6
The contents of this register 6 are sent to a regression coefficient calculation circuit 7, where a linear regression coefficient is calculated. The length of the time waveform input to this register 6 and regression coefficient calculation circuit 7 is, for example,
Use a value like 50ms. When expressed as a logarithmic power time waveform x _j (j=-M, . . . M), this linear regression coefficient a (hereinafter referred to as a power regression coefficient) can be obtained by the following calculation.

ａ＝（_M Σ^j=-M x_j・ｊ）／（_M Σ^j=-M j²） ……(1) パワー回帰係数は上述の周期毎に更新される回
帰係数計算回路７の入力に応じて計算され、ケプ
ストラム係数とあわせて特徴パラメータレジスタ
５に蓄えられる。a=( _M Σ ^j=-M x _j・j)/( _M Σ ^j=-M j ² )...(1) The power regression coefficient is input to the regression coefficient calculation circuit 7 which is updated every cycle as described above. It is calculated accordingly and stored in the feature parameter register 5 together with the cepstral coefficients.

スイツチ８は学習モードと認識モードとを選択
するスイツチであつて、最初にスイツチ８を端子
８ａに接続しておいて、後に認識すべき音声を入
力する本人、あるいはその本人とは異なる複数人
の音声から、各認識対象語彙に対してケプストラ
ム係数とパワー回帰係数からなる特徴パラメータ
波形を求め、特徴パラメータレジスタ５に蓄えた
のち標準パタン蓄積部９に入力し、その語彙の標
準パタンとして蓄える。 The switch 8 is a switch for selecting a learning mode or a recognition mode. First, the switch 8 is connected to the terminal 8a, and then the user who inputs the voice to be recognized later, or multiple users different from the user, A feature parameter waveform consisting of a cepstrum coefficient and a power regression coefficient is obtained from the speech for each recognition target vocabulary, stored in a feature parameter register 5, and then input to a standard pattern storage section 9, where it is stored as a standard pattern for that vocabulary.

その後の認識すべき音声に対してはスイツチ８
を端子８ｂに接続しておいて、特徴パラメータレ
ジスタ５の内容を時間正規化マツチング回路１０
に入力する。同時に各語彙に対応した標準パタン
を標準パタン蓄積部９から一つ一つ読出し、時間
正規化マツチング回路１０に入力する。時間正規
化マツチング回路１０では、標準パタンと入力音
声との特徴パラメータの類似性の度合いを計算す
る。 Switch 8 for subsequent voices to be recognized.
is connected to the terminal 8b, and the contents of the feature parameter register 5 are transferred to the time normalization matching circuit 10.
Enter. At the same time, standard patterns corresponding to each vocabulary are read out one by one from the standard pattern storage section 9 and input to the time normalization matching circuit 10. The time normalization matching circuit 10 calculates the degree of similarity of feature parameters between the standard pattern and the input speech.

音声の発声速度は同じ話者が同じ言葉を繰返し
発声してもその度ごとに部分的及び全体的に変化
するので、両者を比較するには共通の音（音韻）
が対応するように、一方の時間軸を適当に非線形
に伸縮して他方の時間軸にあわせ、対応する時点
の特徴パラメータどうしを比較する必要がある。
一方を基準にして、両者が最もよくあうように
（両者の類似度が最も大きくなるように）他方の
時間軸を非線形に伸縮する技術としては、動的計
画法による最適化の手法を使用できることが知ら
れている（文献：迫江、千葉：動的計画法を利用
した音声の時間正規化に基づく連続単語認識、日
本音響学会誌、27，９，P483，1971）。 The rate of speech production changes both partially and completely each time the same speaker utters the same word repeatedly, so to compare the two, we need to use common sounds (phonemes).
It is necessary to appropriately expand or contract one time axis non-linearly to match the other time axis so that the two time axes correspond to each other, and then compare the feature parameters at corresponding points in time.
Optimization using dynamic programming can be used as a technique for non-linearly expanding or contracting the time axis of one side so that it best matches the other (so that the degree of similarity between the two is greatest). (Reference: Sakoe, Chiba: Continuous word recognition based on temporal normalization of speech using dynamic programming, Journal of the Acoustical Society of Japan, 27, 9, P483, 1971).

この発明の装置においても、時間正規化マツチ
ング回路１０では動的計画法の演算を行う。標準
パターンのある時点ｋにおけるケプストラム係数
をC^R _ki（１ｉｐ，ｐとしては前述のように10の
ような値を用いる）、パワー回帰係数をa^R _k、入力
音声のある時点ｌにおけるケプストラム係数を
C^I _li（１ｉｐ）、パワー回帰係数をa^I _lであらわす
と、ここではケプストラム係数、パワー回帰係数
のそれぞれに関する時点ｋの標準パタンと時点ｌ
の入力音声との距離（小さくなるほど類似度が大
きいことを示す数値）D_s（ｋ，ｌ），D_p（ｋ，ｌ）
として次のような値を用いる。 Also in the apparatus of the present invention, the time normalization matching circuit 10 performs dynamic programming calculations. The cepstrum coefficient at a certain point k in the standard pattern is C ^R _ki (1ip, p is a value such as 10 as mentioned above), the power regression coefficient is a ^R _k , and the cepstrum coefficient at a certain point l in the input audio is
C ^I _li (1ip), and the power regression coefficient is expressed as a ^I _l . Here, the standard pattern at time k and time l regarding the cepstral coefficient and the power regression coefficient, respectively.
distance from the input voice (a numerical value indicating that the smaller the degree of similarity is) D _s (k, l), D _p (k, l)
The following values are used as

D_s（ｋ，ｌ）＝_p Σⁱ⁼¹ （C^R _ki−C^I _li）² ……(2) D_p（ｋ，ｌ）＝（a^R _k−a^I _l）² ……(3) 次にこの両者を次のように重みつき加算平均し
たＤ（ｋ，ｌ）を求め、この値を時点との標準パ
タンと時点ｌの入力音声の要素間マツチング距離
として、動的計画法の演算を行う。D _s (k, l)= _p Σ ⁱ⁼¹ (C ^R _ki − C ^I _li ) ² …(2) D _p (k, l)=(a ^R _k −a ^I _l ) ² …(3 ) Next, D(k, l) is obtained by weighted averaging of both of them as follows, and this value is used as the matching distance between elements of the standard pattern at time point and the input voice at time l, and the dynamic programming method is used. Perform calculations.

Ｄ（ｋ，ｌ）＝ √_s（，）＋（１−）_p（，）
……(4) この式で用いる重みＷは０以上１以下の値を有
し、この値は予備実験の結果にもとづいて比較的
高い認識精度が得られるように適切な値に定めて
重みレジスタ１１に蓄えておく。D(k,l)=√ _s (,)+(1−) _p (,)
...(4) The weight W used in this formula has a value of 0 to 1, and this value is set to an appropriate value based on the results of preliminary experiments to obtain a relatively high recognition accuracy and is stored in the weight register. Save it for 11.

動的計画法の演算によつて標準パターンと入力
音声の一致度が最もよくなるように時間軸を対応
づけたときの対応する時点どうしの標準パタンと
入力音声との要素間マツチング距離を全音声区間
について平均した値を計算する。この値を標準パ
タンと入力音声の総合的距離と呼ぶことにする。
各語彙に対応した標準パタンと入力音声との総合
的距離を比較回路１２に入力し、論理回路により
これらすべての総合的距離のうち、最も総合的距
離の小さい語彙を判定する。この判定結果は、出
力端子１３から出力される。 The inter-element matching distance between the standard pattern and the input voice at corresponding points when the time axes are matched so that the degree of matching between the standard pattern and the input voice is the best through dynamic programming calculations is calculated over the entire voice interval. Calculate the average value for. This value will be referred to as the overall distance between the standard pattern and the input voice.
The total distance between the standard pattern corresponding to each vocabulary and the input speech is input to the comparison circuit 12, and the logic circuit determines the vocabulary with the smallest total distance among all these total distances. This determination result is output from the output terminal 13.

ところで音声パワーの時間波形は母音部では高
く、子音部では低くなるという基本的性質があ
り、この性質は話者が異なつても不変である。第
２図は４人の話者がそれぞれ２回ずつ発声した
「札幌」という単語の対数パワーの時間波形であ
り、対数パワー時間波形を最大値と最小値とが一
定になるように正規化して示している。この第２
図から理解されるようにパワー時間波形は話者が
変わつてもあまり差異がなく、しかも時間的に比
較的なめらかに変化するので50ms程度の一定区
間を10ms程度ずつずらしながらその一定区間内
の時間波形の線形回帰係数、つまり線形近似した
時の傾斜を求めれば、この値は線形回帰係数の原
理から時間波形が全体的に一定量増減してもその
影響を受けないため、異なる話者に共通し発声レ
ベルの変動の影響を受けない安定した単語の特徴
を抽出することができる。従つてこの実施例のよ
うにパワー回帰係数をケプストラム係数とあわせ
て標準パタンと入力音声の時間正規化マツチング
を行えば、スペクトル（ケプストラム）とパワー
の両方が共に類似した部分どうしがマツチング
し、母音と子音とのマツチングを避けることがで
き、認識率向上をはかることができる。このよう
な構造になつているからその結果として音声区間
全体におけるパワーの最大値と最小値を調べてパ
ワーの時間波形を正規化することなく、パワーの
時間波形に含まれる安定した特徴を用いることに
より、音声が入力されるとただちに認識のための
演算を開始して時間遅れなしに、誰の声に対して
も高い精度で認識結果を出力できる単語音声認識
装置を実現することができる。これまでの実験に
よれば都市名100単語を認識対象として、本人と
異なる話者１名の音声を標準パタンとしたときに
ケプストラム係数のみを用いた従来の装置による
認識率が85.5％であつたのに対し、この実施例の
装置では89.3％の認識率が得られ、この発明が優
れたものであることが確認された。 Incidentally, the temporal waveform of speech power has a basic property that it is high in the vowel part and low in the consonant part, and this property does not change even if the speaker is different. Figure 2 shows the logarithmic power time waveform of the word "Sapporo" uttered twice by each of the four speakers.The logarithmic power time waveform is normalized so that the maximum and minimum values are constant. It shows. This second
As can be understood from the figure, the power time waveform does not differ much even when the speaker changes, and it changes relatively smoothly over time. If we find the linear regression coefficient of the waveform, that is, the slope when linearly approximated, this value is common to different speakers because it is unaffected by the principle of linear regression coefficient even if the overall temporal waveform increases or decreases by a certain amount. It is possible to extract stable word features that are not affected by fluctuations in utterance level. Therefore, if time-normalized matching of the standard pattern and the input voice is performed by combining the power regression coefficient with the cepstrum coefficient as in this example, parts that are similar in both spectrum (cepstrum) and power will be matched, and vowel It is possible to avoid matching with consonants and improve the recognition rate. As a result of this structure, it is possible to use stable features included in the power time waveform without normalizing the power time waveform by examining the maximum and minimum power values in the entire speech interval. As a result, it is possible to realize a word speech recognition device that can start calculations for recognition immediately when speech is input, and can output recognition results with high precision for anyone's voice without any time delay. According to previous experiments, when 100 words of city names were recognized and the standard pattern was the voice of one speaker different from the person, the recognition rate with a conventional device using only cepstral coefficients was 85.5%. In contrast, the device of this example achieved a recognition rate of 89.3%, confirming that the present invention is superior.

ケプストラム係数の線形回帰係数ｂ（ケプスト
ラム回帰係数と呼ぶ）を計算し、ケプストラム係
数とケプストラム回帰係数とパワー回帰係数とを
用いて入力音声と標準パタン音声との類似度を時
間正規化マツチングすることにより、更に高い認
識率を得ることができる。 By calculating the linear regression coefficient b (referred to as the cepstrum regression coefficient) of the cepstrum coefficients, and performing time-normalized matching of the similarity between the input speech and the standard pattern speech using the cepstral coefficients, cepstral regression coefficients, and power regression coefficients. , an even higher recognition rate can be obtained.

第３図はこの例を示し、第１図と対応する部分
に同一符号を付けて示す。ケプストラム変換回路
４で計算されたケプストラム係数C_oは特徴パラ
メータレジスタ５に直接供給されると共に、この
ケプストラム係数C_oの時間波形は、一定間隔ご
とに一定の時間長の区間がケプストラムレジスタ
１４に一旦蓄えられ、このレジスタ１４の内容は
回帰係数計算回路１５に送られて、線形回帰係数
（ケプストラム回帰係数）が演算される。このケ
プストラムレジスタ１４及び回帰係数計算回路１
５に入力される時間波形の長さとしては、例えば
50ms、これを更新する周期としては、例えば
10msのような値を用いる。ケプストラム係数の
時間波形をy_j（ｊ＝−Ｍ，……Ｍ）であらわすと、
このケプストラム回帰係数ｂは次の演算で求める
ことができる。 FIG. 3 shows this example, and parts corresponding to those in FIG. 1 are given the same reference numerals. The cepstrum coefficient C _o calculated by the cepstrum conversion circuit 4 is directly supplied to the feature parameter register 5, and the time waveform of this cepstrum coefficient C _o is temporarily stored in the cepstrum register 14 in sections of a certain time length at regular intervals. The contents of this register 14 are sent to a regression coefficient calculating circuit 15, where a linear regression coefficient (cepstrum regression coefficient) is calculated. This cepstrum register 14 and regression coefficient calculation circuit 1
For example, the length of the time waveform input to 5 is
50ms, the update cycle is, for example,
Use a value like 10ms. If the time waveform of the cepstrum coefficient is expressed as y _j (j=-M,...M),
This cepstral regression coefficient b can be obtained by the following calculation.

ｂ＝（_M Σ^j=M y_j・ｊ）／（_M Σ^j=M j²） ……(5) ケプストラム回帰係数ｂは、各次数のケプスト
ラム係数に対して、10ms毎に更新される回帰係
数計算回路１５の入力に応じて計算され、このケ
プストラム回帰係数ｂはケプストラム係数とあわ
せて2p次元の特徴パラメータとして特徴パラメ
ータレジスタ７に送られて蓄えられる。時間正規
化マツチング回路１０では標準パタンのある時点
ｋにおけるケプストラム係数及びケプストラム回
帰係数をr_ki（１ｉ2p）、入力音声のある時点
ｌにおけるケプストラム係数及びケプストラム回
帰係数をx_li（１ｉ2p）であらわすと、ここで
両者の距離（小さくなるほど類似度が大きいこと
を示す数値）として次のような値を用いる。b=( _M Σ ^j=M y _j・j)/( _M Σ ^j=M j ² ) ...(5) The cepstrum regression coefficient b is a regression that is updated every 10ms for the cepstrum coefficient of each order. The cepstrum regression coefficient b is calculated according to the input of the coefficient calculation circuit 15, and is sent to the feature parameter register 7 and stored together with the cepstrum coefficient as a 2p-dimensional feature parameter. In the time normalization matching circuit 10, if the cepstrum coefficient and cepstrum regression coefficient at a certain point k of the standard pattern are expressed as r _ki (1i2p), and the cepstrum coefficient and cepstrum regression coefficient at a certain point l of the input audio are expressed as x _li (1i2p), Here, the following value is used as the distance between the two (a numerical value indicating that the smaller the degree of similarity is).

ｄ＝１／2p_2p Σⁱ⁼¹ w_i ²（r_ki−x_li）² ……(6) ｉ＝2pまでとするのはケプストラム係数の次
数がＰ、ケプストラム回帰係数の次数がＰであ
り、両者合せて2Pの次数となるためである。こ
こでw_iは各係数に対してあらかじめ定められてい
る重みを示す数値で、この値は予備実験の結果に
もとづいて比較的高い認識精度が得られるように
適切な値に定め、重みレジスタ１６に蓄えてお
く。距離ｄの計算は(6)式に示すように同一時点の
Ｐ次のケプストラム係数とＰ次のケプストラム回
帰係数とについて入力音声と標準パタンとの差の
二乗和として計算しており、つまりケプストラム
係数とケプストラム回帰係数との互に性質が異な
るものを一緒に使つており、これらの平衡をとる
ためにw_iの重み付けを行うものであり、従つてw_i
の値としてはケプストラム係数について演算する
際に用いるw_aと、ケプストラム回帰係数につい
て演算する際に用いるw_bとの少くとも二つの値
を用いる。これら重みw_a〜w_bは重みレジスタ１
６に蓄えておく。d=1/2p _2p Σ ⁱ⁼¹ w _i ² (r _ki −x _li ) ² ...(6) The reason for i=2p is that the order of the cepstral coefficient is P, and the order of the cepstral regression coefficient is P. , this is because the total order of both is 2P. Here, w _i is a numerical value indicating a predetermined weight for each coefficient, and this value is set to an appropriate value based on the results of preliminary experiments to obtain relatively high recognition accuracy. Store it in. As shown in equation (6), the distance d is calculated as the sum of squares of the difference between the input speech and the standard pattern for the P-th order cepstrum coefficient and the P-th order cepstrum regression coefficient at the same time, that is, the cepstrum coefficient and cepstral regression coefficients, which have different properties, are used together, and in order to balance them, w _i is weighted, so w _i
At least two values are used: w _a used when calculating the cepstrum coefficient, and w _b used when calculating the cepstrum regression coefficient. These weights w _a to w _b are weight register 1
Save it to 6.

時間正規化マツチング回路１０では、更に(6)式
で得た時点ｋの標準パタンと時点ｌの入力音声と
の距離ｄ（ｋ，ｌ）を(4)式におけるD_s（ｋ，ｌ）
として用いて、この(4)式を演算する。その他の動
作は第１図の場合と同様である。 In the time normalization matching circuit 10, the distance d(k,l) between the standard pattern at time point k obtained by equation (6) and the input voice at time point l is further calculated as D _s (k, l) in equation (4).
This equation (4) is calculated using Other operations are the same as in the case of FIG.

なお音声特徴量としてケプストラム係数を用い
たが、線形予測係数、ホルマント周波数、パーコ
ール係数、対数断面積比、零交差数などを用いて
もよい。 Note that although cepstral coefficients are used as voice feature quantities, linear prediction coefficients, formant frequencies, Percoll coefficients, logarithmic cross-sectional area ratios, number of zero crossings, etc. may also be used.

「発明の効果」以上説明したように、この発明によればパワー
回帰係数とスペクトル距離とから成る距離を用い
て入力音声と標準パタン音声とのマツチングを行
うため、スペクトル距離のみでは認識誤りを生じ
やすい不特定話者単語音声認識において認識能力
を向上でき、しかもパワーの絶対値の正規化演算
を必要としないため認識演算の時間遅れを生じな
いという利点がある。"Effects of the Invention" As explained above, according to the present invention, input speech and standard pattern speech are matched using distances consisting of power regression coefficients and spectral distances, so spectral distances alone may cause recognition errors. This method has the advantage that recognition ability can be improved in easy speaker-independent word speech recognition, and there is no need for normalization calculation of the absolute value of power, so there is no time delay in recognition calculation.

[Brief explanation of drawings]

第１図はこの発明の単語音声認識装置の実施例
を機能的に示すブロツク図、第２図は単語「札
幌」の音声対数パワーの時間パタンを示す図、第
３図はこの発明の他の実施例を機能的に示すブロ
ツク図である。１：音声入力端子、２：音声区間検出回路、
３：線形予測分析回路、４：ケプストラム変換回
路、５：特徴パラメータレジスタ、６：パワーレ
ジスタ、７：回帰係数計算回路、８：スイツチ、
９：標準パターン蓄積部、１０：時間正規化マツ
チング回路、１１：重みレジスタ、１２：比較回
路、１３：出力端子。 FIG. 1 is a block diagram functionally showing an embodiment of the word speech recognition device of the present invention, FIG. 2 is a diagram showing the time pattern of the speech logarithmic power of the word "Sapporo", and FIG. FIG. 2 is a block diagram functionally showing an embodiment. 1: Audio input terminal, 2: Audio section detection circuit,
3: Linear prediction analysis circuit, 4: Cepstrum conversion circuit, 5: Feature parameter register, 6: Power register, 7: Regression coefficient calculation circuit, 8: Switch,
9: standard pattern storage section, 10: time normalization matching circuit, 11: weight register, 12: comparison circuit, 13: output terminal.

Claims

[Claims] 1. Means for determining the spectral distance D _s between the elements of the feature time series of the input voice and the standard pattern voice, respectively, from the time waveform of the voice power for the input voice and the standard pattern voice, respectively. means for deriving a time waveform of the linear regression coefficient for all time points; means for using the linear regression coefficient to determine a power regression coefficient distance D _p between elements of the input speech and the standard pattern; and the spectral distance D _s and the A word speech recognition device comprising means for calculating the degree of similarity between the input speech and a standard pattern speech by time normalized matching using an inter-element matching distance determined from a power regression coefficient distance D _p .