JPS61141500A

JPS61141500A - Word voice recognition equipment

Info

Publication number: JPS61141500A
Application number: JP59264782A
Authority: JP
Inventors: 貞煕古井
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1984-12-14
Filing date: 1984-12-14
Publication date: 1986-06-28
Also published as: JPH0426480B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】「産業上の利用分野」この発明は入力音声の特徴量の時系列と、対応する標準
パタン音声の時系列との間の類似度を、これらの時系列
の要素間の距離を計算することに基づき求める音声認識
装置に関するものである。[Detailed Description of the Invention] "Industrial Application Field" This invention calculates the degree of similarity between a time series of features of an input speech and a time series of a corresponding standard pattern speech, and calculates the degree of similarity between the elements of these time series. The present invention relates to a speech recognition device that calculates the distance between.

「従来の技術」従来のこの種の音声認識装置では、入力音声の特徴量の
時系列ん＝４＋　ａＩ２ｐ・・・・・・鑓□と標準パタ
ン音声の特徴量の時系列［３＝　Ｊ　＋　”２　’・・
・・・・Ｔｏｍとに対し、その要素ａｉとす、との間の
距離尺度を各要素における全パワーを一定に正規化した
スペクトルを用いて求めていた。このため要素町が母音
で、要素１ｂｊが予行というように音声パワーが極端２
二異なる要素間でも距離がかなり小さくなるということ
が起こり得るという欠点があった。``Prior art'' In this type of conventional speech recognition device, the time series of the feature values of the input voice = 4 + aI2p ..... and the time series of the feature values of the standard pattern voice [3 = J + "2'...
... Tom, the distance measure between its element ai and , was determined using a spectrum in which the total power in each element was normalized to a constant value. For this reason, the element cho is a vowel and the element 1bj is a prelude, so the voice power is extremely 2.
The disadvantage is that it is possible for the distance between two different elements to become quite small.

音声認識においては一つの標準パタンを用いてなるべく
多くの人の音声を認識できることが望ましいが、従来の
方法の多くは、個人差の影響の出やすいスペクトル（二
基づいた距離を用いていたため、母音と予行との誤認識
等の誤りが起こりやすく、特に入力音声が標準パタンと
同一の話者のものでない場合には認識率の大幅な低下が
見られた。In speech recognition, it is desirable to be able to recognize the voices of as many people as possible using one standard pattern, but many of the conventional methods use distances based on spectra (distances) that are susceptible to individual differences. Errors such as erroneous recognition of words and previews are likely to occur, and the recognition rate is particularly low when the input speech is not from the same speaker as the standard pattern.

この問題に対処するため、スペクトル情報とパワー情報
とを併用した音声時系列の要素間マツチング距離尺度を
用いる試みも行われている（例えば文献、相用、鹿野、
古井：バヮー情報で重みづけた距離による単語音声認識
、日本音響学会音声研究会資料、５８１−５９．１９８
１参照）。しかしこの方法では音声の入力レベルによる
パワー情報の変動に対処する必要性がら、音声区間全体
のパワーの最大値と最小値とを調べて、これが一定値と
なるように音声パワーを正規化する必要があった。この
ため音声区間が終了するまで、あるいは数秒間にわたっ
て音声パワーの変化を調べ、こののちに距離尺度の計算
を開始する必要があり、認識結果が得られるまでに時間
遅れが生じたり、距離尺度の算出に高速の素子を用いな
ければならなくなるという欠点があった。To deal with this problem, attempts have been made to use a matching distance measure between elements of speech time series that combines spectral information and power information (e.g., References, Ayo, Kano,
Furui: Word speech recognition using distance weighted with bow information, Acoustical Society of Japan speech study group material, 581-59.198
(see 1). However, with this method, it is necessary to deal with fluctuations in power information due to the input level of the voice, and it is also necessary to check the maximum and minimum values of the power of the entire voice section and normalize the voice power so that this becomes a constant value. was there. For this reason, it is necessary to check the change in voice power until the end of the speech interval or for several seconds, and then start calculating the distance measure. This may cause a time delay before the recognition result is obtained, or There was a drawback that a high-speed element had to be used for calculation.

この発明の目的は不特定話者の音声に対しても従来の方
法に比べ、より正確かつ遅れ時間なしで音声認識ができ
るよう１ニジた単語音声認識装置な提供することにある
。An object of the present invention is to provide a single word speech recognition device that can recognize speech of unspecified speakers more accurately and without delay time than conventional methods.

「問題点を解決するための手段」この発明によれば個人差が少なく、しがも音声の入力レ
ベルの変動の影響を受けない、音声パワーの時間波形の
線形回帰係数の距離をスペクトル距離と併用する。すな
わち入力音声と標準パタン音声とのそれぞれの特徴量時
系列の要素間のスペクトル距離Ｄｓを求めると共に、入
力音声と標準パタン音声とについてそれぞれ音声パワー
の時間波形から線形回帰係数の時間波形を算出し′、こ
れら線形回帰係数を用いて入力音声と標準パタンとの要
素間のパワー回帰係数距離Ｄｓを求める。これらスペク
トル距離Ｄｓとパワー回帰係数距離り、とから要素間マ
ツチング距離を用いて、入力音声と標準パタン音声との
類似度を時間正規化マツチングにより計算する。"Means for Solving the Problem" According to the present invention, the distance of the linear regression coefficient of the time waveform of voice power, which has little individual difference and is not affected by fluctuations in the voice input level, is defined as the spectral distance. Combined. In other words, the spectral distance Ds between the elements of the feature time series of the input speech and the standard pattern speech is determined, and the time waveform of the linear regression coefficient is calculated from the time waveform of the audio power for the input speech and the standard pattern speech, respectively. ', These linear regression coefficients are used to determine the power regression coefficient distance Ds between elements of the input voice and the standard pattern. Using the spectral distance Ds, the power regression coefficient distance, and the inter-element matching distance, the degree of similarity between the input speech and the standard pattern speech is calculated by time-normalized matching.

「実施例」第１図はこの発明の実施例を示し、音声入力端　　　″
子１に加えられた入力音声信号はまず音声区間検出回路
２によって無音（雑音）区間が除去されて実際の音声区
間だけが抽出される。この音声区間の検出にはすでによ
く知られているいくつかの方法、例えば人力音声信号に
の短時間パワー、ある一定値以上のパワーが継続する時
間等を用い逮ことができる。検出された音声区間の信号
波は線形予測分析回路３に送られ、線形予測係数とパワ
ーの時間波形とに変換される。この技術はすでに公知で
あるので（例えば文献、板書、斎藤：統計的手法による
音声スペクトル密度とホルマント周波数の推定、電子通
信学会論文誌、５３−Ａ、１゜Ｐ３５．１９７０参照）
、詳細は省略するが、基本的にはまず低域フィルタに通
したのち標本化及び欧子化を行い、一定時間毎に短区間
の波形にハミング窓等を乗じて切り出し、積和の演算に
よってパワーと相関係数とを計算する。ハミング窓の長
さとしては例えば３Ｑｍ３、これを更新する周期として
は例えばｌＱｍｓのような値が用いられる。この相関係
数から繰返し演算処理によって代数方程式を解くことに
より線形予測係数が抽出される。この線形予測係数は例
えば第１次から第１０次までの値を計算する。``Embodiment'' Figure 1 shows an embodiment of the present invention, in which the audio input terminal
The input speech signal applied to the child 1 is first subjected to a speech section detection circuit 2 in which silent (noise) sections are removed and only actual speech sections are extracted. This voice section can be detected by using several well-known methods, such as short-term power of the human voice signal, and time during which the power continues to exceed a certain value. The signal wave of the detected voice section is sent to the linear prediction analysis circuit 3, where it is converted into a linear prediction coefficient and a time waveform of power. This technique is already known (see, for example, literature, board book, Saito: Estimation of speech spectral density and formant frequency using statistical methods, Transactions of the Institute of Electronics and Communication Engineers, 53-A, 1゜P35.1970).
, details are omitted, but basically, first pass it through a low-pass filter, then sample and Europeanize it, multiply the waveform in a short period by a Hamming window etc. at regular intervals, cut it out, and calculate the sum of products. Calculate the power and correlation coefficient. For example, a value such as 3Qm3 is used as the length of the Hamming window, and a value such as 1Qms is used as the period for updating this window. A linear prediction coefficient is extracted from this correlation coefficient by solving an algebraic equation through repeated calculation processing. For example, values from the first order to the tenth order are calculated as the linear prediction coefficients.

抽出された線形予測係数の時間波形は、ケプットラム変
換回路４　Ｃより線形予測ケプストラム係数に変換され
る。この技術もすでに公知であるので（例えば文献、斎
藤、中田二片声情報処理の基礎、オーム社、第７章、Ｐ
２Ｏ３，１９８１参照）詳細は省略するが、線形予測係
数を用いた再帰式の演算により、線形予測ケプストラム
係数（以下簡単のため蛍にケプストラム係数と呼ぶ）を
容易に得ることができる。抽出されたケプストラム係数
は特徴パラメータレジスタ５（ニーたん蓄えられる。The time waveform of the extracted linear prediction coefficient is converted into a linear prediction cepstrum coefficient by the cepstral conversion circuit 4C. This technique is already known (for example, literature, Saito, Nakata, Basics of Two-Simple Voice Information Processing, Ohmsha, Chapter 7, P.
2O3, 1981) Although the details are omitted, linear prediction cepstral coefficients (hereinafter referred to as cepstral coefficients for simplicity) can be easily obtained by recursive calculation using linear prediction coefficients. The extracted cepstral coefficients are stored in the feature parameter register 5 (knee register).

一方、線形予測分析回路３で抽出されたもう一方の特性
であるパワーの時間波形は、その抽出周期（上述の例で
はｌ　Ｑ　ｍ　ｓ　）毎に一定の時間長の区間の波形が
対数変換されたのちパワーレジスタ６に−たん蓄えられ
、このレジスタ６の内容は回帰係数計算回路７に送られ
て、線形回帰係数が演算される。このレジスタ６および
回帰係数計算回路７（二人力される時間波形の長さとし
ては、例えば５Ｑｍｓのような値を用いる。対数パワー
の時間波形をｘｊ（Ｊ　＝　−Ｍ　、・・・・・・Ｍ）
であられすと、この線形回帰係数ａ（Ｕ下ではこれをパ
ワー回帰係数と呼ぶ）は次の演算で求めることができる
。On the other hand, the time waveform of the power, which is the other characteristic extracted by the linear prediction analysis circuit 3, is logarithmically transformed into a waveform of a certain time length section at each extraction period (l Q m s in the above example). Thereafter, the data is stored in a power register 6, and the contents of this register 6 are sent to a regression coefficient calculation circuit 7, where a linear regression coefficient is calculated. For example, a value such as 5Qms is used as the length of the time waveform input by the register 6 and the regression coefficient calculation circuit 7. M)
If so, this linear regression coefficient a (under U, this is called a power regression coefficient) can be obtained by the following calculation.

パワー回帰係数は上述の周期毎に更新される回帰係数計
算回路７の入力に応じて計算され、ケプストラム係数と
あわせて特徴パラメータレジスタ５に蓄えられる。The power regression coefficient is calculated according to the input of the regression coefficient calculation circuit 7, which is updated every cycle as described above, and is stored in the feature parameter register 5 together with the cepstrum coefficient.

スイッチ８は学習モードと認識モードとを選択するスイ
ッチであって、最初にスイッチ８を端子８ａに接続して
おいて、後に認識すべき音声を入力する本人、あるいは
その本人とは異なる複数人の音声から、各認識対象語望
に対してケプストラム係数とパワー回帰係数からなる特
徴パラメータ波形を求め、特徴パラメータレジスタ５に
蓄えたのち標準パタン蓄積部９に入力し、その語霊の標
準パタンとして蓄える。The switch 8 is a switch for selecting the learning mode and the recognition mode. First, the switch 8 is connected to the terminal 8a, and then the user who inputs the voice to be recognized, or multiple users different from the user, A feature parameter waveform consisting of cepstrum coefficients and power regression coefficients is obtained from the speech for each word to be recognized, stored in the feature parameter register 5, and then input to the standard pattern storage section 9, where it is stored as a standard pattern for that word. .

その後の認識すべき音声（二対してはスイッチ８を端子
８ｂに接続しておいて、特徴パラメータレジスタ５の内
容を時間正規化マツチング回路１０に入力する。同時に
各語業に対応した標準パタンを標準パタン蓄積部９から
一つ一つ読出し、時間正規化マツチング回路１０に入力
する。時間正規化マツチング回路１０では、標準パタン
と人力音声との特徴パラメータの類似性の度合いを計算
する。For the subsequent speech to be recognized (2), switch 8 is connected to terminal 8b, and the contents of feature parameter register 5 are input to time normalization matching circuit 10. At the same time, standard patterns corresponding to each language are input. Each pattern is read out from the standard pattern storage section 9 and input to the time normalization matching circuit 10.The time normalization matching circuit 10 calculates the degree of similarity of the feature parameters between the standard pattern and the human voice.

音声の発声速度は同じ話者が同じ言葉を繰返し発声して
もその度ごと（二部公的および全体的Ｃ：変化するので
、両者を比較するには共通の音（音韻）が対応するよう
に、一方の時間軸を適当に非線形に伸縮して能力の時間
軸にあわせ、対応する時点の特徴パラメータどうしを比
較する必要がある。The speaking rate of speech changes each time the same word is uttered repeatedly by the same speaker (bipartite and global), so to compare the two, it is necessary to make sure that the common sounds (phonemes) correspond. In order to do this, it is necessary to appropriately expand or contract one of the time axes in a non-linear manner to match the time axis of the ability, and compare the characteristic parameters at corresponding points in time.

一方を基準にして、両者が最もよくあうよう６二（両者
の類似度が最も大きくなるようＣ；）能力の時間軸を非
線形に伸縮する技術としては、動的計画法５ユよ、最適
イヒ。手法を使用アき、３が。られ　　　１ている（文
献、迫江、千葉：動的計画法を利用した音声の時間正規
化に基づく連続単語認識、日本音響学会誌、２７，９．
Ｐ２Ｓ５，１９７１）。As a technique for non-linearly expanding and contracting the time axis of the ability so that the two best match each other based on one of them (so that the degree of similarity between the two is maximized), dynamic programming 5 U is used to achieve the optimal . Using method 3. 1 (Reference, Sakoe, Chiba: Continuous word recognition based on temporal normalization of speech using dynamic programming, Journal of the Acoustical Society of Japan, 27, 9.
P2S5, 1971).

この発明の装置においても、時間正規化マツチング回路
１０では動的計画法の演算を行う。標準パタンのある時
点ｋにおけるケプストラム係数をｃｌ（１≦ｉ≦ｐ、ｐ
としては前述のように１０のような値を用いる）、パワ
ー回帰係数なａζ、入力音声のある時点！におけるケプ
ストラム係数を０ト（１≦ｉ≦ｐ）、パワー回帰係数を
りであられすと、ここではケプストラム係数、パワー回
帰係数のそれぞれに関する時点にの標準パタンと時点Ｊ
の入力音声との距離（小さくなるほど類似度が大きいこ
とを示す数値）ＤＳ（ｋ、り、Ｄｐ（ｋ、ｊりとして次
のような値を用いる。Also in the apparatus of the present invention, the time normalization matching circuit 10 performs dynamic programming calculations. Let the cepstral coefficient at a certain point k of the standard pattern be cl(1≦i≦p, p
(as mentioned above, use a value such as 10), the power regression coefficient aζ, and the point in time of the input audio! If the cepstral coefficient is 0 (1≦i≦p) and the power regression coefficient is
The following values are used as the distance from the input voice (a numerical value indicating that the smaller the degree of similarity is) DS(k, ri, Dp(k, j)).

Ｄ、（ｋ、ｊり＝（ａζ−リ）２　　　　・・・・・・
・・・（３）次にこの両者を次のように重みつき加算平
均した１）（ｋ　、りを求め、この値を時点との標準パ
タンと時点）の入力音声の要素間マツチング距離として
、動的計画法の演算を行う。D, (k, jri=(aζ-ri)2...
...(3) Next, calculate the weighted average of both as shown below. Performs dynamic programming operations.

１）（ｋ　、り　＝−１ｉｉＤ、−（−Ｍ−、り　＋　
（１−Ｗ　）　Ｄｐ　（ｋ、ｊ）・・・・・・・・・（
４）この式で用いる重みＷは０以上１以下の値を有し、この
値は予備実験の結果にもとづいて比較的高い認識精度が
得られるように適切な値に定めて重みレジスタ１１に蓄
えておく。1) (k, ri = -1iiD, -(-M-, li +
(1-W) Dp (k, j)・・・・・・・・・(
4) The weight W used in this equation has a value between 0 and 1, and this value is set to an appropriate value based on the results of preliminary experiments and stored in the weight register 11 so as to obtain relatively high recognition accuracy. I'll keep it.

動的計画法の演算によって標準パタンと人力音声との一
致度が最もよくなるように時間軸を対応づけたときの対
応する時点どうしの標準パタンと入力音声との要素間マ
ツチング距離を全旨声区間について平均した値を計算す
る。この値を標準パタンと人力音声との総合的距離と呼
ぶことにする。The inter-element matching distance between the standard pattern and the input voice at corresponding points when the time axes are matched so that the degree of matching between the standard pattern and the human voice is the best is determined by dynamic programming calculations. Calculate the average value for. This value will be referred to as the overall distance between the standard pattern and the human voice.

各語業に対応した標準パタンと入力音声との総合的距離
を比較回路１２に入力し、論理回路によりこれらすべて
の総合的距離のうち、最も総合的距離の小さい語案を判
定する。この判定結果は出力端子１３から出力される。The total distance between the standard pattern corresponding to each word work and the input speech is input to the comparison circuit 12, and the logic circuit determines the word plan with the smallest total distance among all these total distances. This determination result is output from the output terminal 13.

ところで音声パワーの時間波形は母音部では高く、子音
部では低くなるという基本的性質があり、この性質は話
者が異なっても不変である。第２図は４人の話者がそれ
ぞれ２回ずつ発声した一九暖という単語の対数パワーの
時間波形であり、対数パワ一時間波形を最大値と最小値
とが一定になるように正規化して示している。この第２
図から理解されるようにパワ一時間波形は話者が変って
もあまり差異がなぐ、しかも時間的に比較的なめらかに
変化するので５９ｍ５程度の一定区間を１０ｍ５程度ず
つずらしながらその一定区間内の時間波形の線形回帰係
数、つまり線形近似した時の傾斜を求めれば、この値は
線形回帰係数の原理から時間波形が全体的に一定Ｉ増減
してもその影響を受けないため、異なる話者（二共通し
発声レベルの変動の影響を受けない安定した単語の特徴
を抽出することができる。従ってこの実施例のようにパ
ワー回帰係数をケプストラム係数とあわせて標準パタン
と入力音声の時間正規化マツチングを行えば、スペクト
ル（ケプストラム）とパワーの両方が共に類似した部分
どうしがマツチングし、母音と子音とのマツチングを避
けることができ、認識率の向上をはかることができる。By the way, the temporal waveform of speech power has a basic property that it is high in the vowel part and low in the consonant part, and this property does not change even if the speaker is different. Figure 2 shows the time waveform of the logarithmic power of the word 1900 uttered twice by each of the four speakers, and the logarithmic power 1-hour waveform is normalized so that the maximum and minimum values are constant. It shows. This second
As can be understood from the figure, the power one-hour waveform does not differ much even if the speaker changes, and it changes relatively smoothly over time. If we calculate the linear regression coefficient of the time waveform, that is, the slope when linearly approximated, this value will not be affected by the overall constant I increase or decrease of the time waveform due to the principle of the linear regression coefficient. It is possible to extract stable word features that are common to the two and are not affected by fluctuations in the vocalization level.Therefore, as in this example, the power regression coefficient is combined with the cepstral coefficient to perform time-normalized matching of the standard pattern and the input speech. If this is done, parts that are similar in both spectrum (cepstrum) and power will be matched, and matching between vowels and consonants can be avoided, and the recognition rate can be improved.

このような構造になっているからその結果として音声区
間全体におけるパワーの最大値と最小値を調べてパワー
の時間波形を正規化することなく、パワーの時間波形に
含まれる安定した特徴を用いることにより、音声が入力
されるとただちに認識のための演算を開始して時間遅れ
なしに、誰の声に対しても高い精度で認識結果を出力で
きる単語音声認識装置を実現することができる。これま
での実験（二よれば都市名１００単語を認識対象として
、本人と異なる話者１名の音声を標準パタンとしたとき
にケプストラム係数のみを用いた従来の装置によ、る認
識率が８５．５％であったのに対し、この実施例の装置
では８９．３　％の認識率が得られ、この発明が優れた
ものであることが確認された。As a result of this structure, it is possible to use stable features contained in the power time waveform without normalizing the power time waveform by examining the maximum and minimum power values in the entire speech interval. As a result, it is possible to realize a word speech recognition device that can start calculations for recognition immediately when speech is input, and can output recognition results with high precision for anyone's voice without any time delay. According to previous experiments (2), when 100 words of city names were recognized and the speech of one speaker different from the person was used as a standard pattern, the recognition rate of a conventional device using only cepstral coefficients was 85. .5%, whereas the device of this example achieved a recognition rate of 89.3%, confirming that the present invention is superior.

ケプストラム係数の線形回帰係数ｂ（ケプストラム回帰
係数と呼ぶ）を計算し、ケプストラム係数とケプストラ
ム回帰係数とパワー回帰係数とを用いて入力音声と標準
パタン音声との類似度を時間正規化マツチングすること
により、更に高い認識率を得ることができる。By calculating the linear regression coefficient b (referred to as the cepstrum regression coefficient) of the cepstrum coefficients, and performing time-normalized matching of the similarity between the input speech and the standard pattern speech using the cepstral coefficients, cepstral regression coefficients, and power regression coefficients. , an even higher recognition rate can be obtained.

第３図はこの例を示し、第１図と対応する部分に同一符
号を付けて示す。ケプヌトラム変換回路４で計算された
ケプストラム係数Ｃｎは特徴パラメータレジスタ５に直
接供給されると共に、このケプストラム係数Ｃｎの時間
波形は、一定間隔ごとに一定の時間長の区間がケプスト
ラムレジスタ１４に一旦蓄えられ、このレジスタ１４の
内容は回帰係数計算回路１５に送られて、線形回帰係数
（ケプストラム回帰係数）が演算される。このケプスト
ラムレジスタ１４及び回帰係数計算回路１５１入力され
る時間波形の長さと１−ては、例えば５０ｍ５、これを
更新する虐期としては、例えば１０ｍ５のような値を用
いる。ケプストラム係数の時間波形１ｙｊ（ｊ＝−Ｍ、
・・・鐸・Ｍ）であられすと、このケプストラム回帰係
数すは次の演算で求めることができる。FIG. 3 shows this example, and parts corresponding to those in FIG. 1 are given the same reference numerals. The cepstrum coefficient Cn calculated by the cepnutrum conversion circuit 4 is directly supplied to the feature parameter register 5, and the time waveform of this cepstrum coefficient Cn is temporarily stored in the cepstrum register 14 in sections of a certain time length at regular intervals. The contents of this register 14 are sent to a regression coefficient calculation circuit 15, where a linear regression coefficient (cepstral regression coefficient) is calculated. The length of the time waveform input to the cepstrum register 14 and the regression coefficient calculation circuit 151 is, for example, 50 m5, and the period for updating this is, for example, a value such as 10 m5. Time waveform of cepstrum coefficient 1yj (j=-M,
...Taku・M), this cepstral regression coefficient can be found by the following calculation.

ケプストラム回帰係数すは、各次数のケプストラム係数
に対して、ｌＱ、ｍｓ毎に更新される回帰係数計算回路
１５の入力に応じて計算され、このケプストラム回帰係
数すはケプストラム係数とあわせて２Ｐ次元の特徴パラ
メータとして特徴パラメータレジスタ７（＝送られて蓄
えられる。時間正規化マツチング回路１０では標準パタ
ンのある時点ｋにおけるケプストラム係数及びケプスト
ラム回帰係数をｒｋｉ　（１≦ｉ≦２ｐ）、入力音声の
ある時点ノにおけるケプストラム係数及びケプストラム
回帰係数をｘｚｉ　（１≦ｉ≦２ｐ）であられすと、こ
こで両者の距離（小さくなるほど類似度が大きいことを
示す数値）として次のような値を用いる。The cepstrum regression coefficients are calculated for the cepstrum coefficients of each order according to the input of the regression coefficient calculation circuit 15, which is updated every lQ, ms, and together with the cepstrum coefficients, the cepstrum regression coefficients are calculated in 2P dimension. The feature parameters are sent to the feature parameter register 7 (= sent and stored. In the time normalization matching circuit 10, the cepstrum coefficients and cepstrum regression coefficients at a certain time point k of the standard pattern are rki (1≦i≦2p), Let xzi (1≦i≦2p) be the cepstrum coefficient and cepstrum regression coefficient in , then the following value is used as the distance between the two (a numerical value indicating that the smaller the degree of similarity is).

１＝２ｐまでとするのはケプストラム係数の次数がＰ、
ケプストラム回帰係数の次数がＰであり、両者合せて２
Ｐの次数となるためである。ここでＷｉは各係数に対し
てあらかじめ定められでいる重みを示す数値で、この値
は予備実験の結果にもとづいて比較的高い認識精度が得
られるように適切な値に定め、重みレジスタ１６に蓄え
ておく。距離ｄの計算は（６）式に示すように同一時点
のＰ次のケプストラム係数とＰ次のケプストラム回帰係
数とについて入力音声と標準パタンとの差の二乗和とし
て計算しており、つまりケプストラム係数とケプストラ
ム回帰係数との互に性質が異なるものを一諸に使ってお
り、これらの平衡をとるためにＷｌの重み付けを行うも
のであり、従って町の値としてはケプストラム係数につ
いて演算する際（二用いるＷａと、ケプストラム回帰係
数について演算する際に用いるＷｂとの少くとも二つの
値を用いる。1 = up to 2p if the order of the cepstral coefficient is P,
The order of the cepstral regression coefficient is P, and the total of both is 2
This is because it becomes the order of P. Here, Wi is a numerical value indicating a predetermined weight for each coefficient, and this value is set to an appropriate value based on the results of preliminary experiments to obtain relatively high recognition accuracy, and is set in the weight register 16. Save it up. The distance d is calculated as the sum of squares of the difference between the input speech and the standard pattern for the P-th order cepstrum coefficient and the P-th order cepstrum regression coefficient at the same point in time, as shown in equation (6), that is, the cepstrum coefficient and cepstrum regression coefficients, which have different properties, are used, and Wl is weighted to balance them. Therefore, when calculating the cepstrum coefficient as a town value, At least two values are used: Wa used and Wb used when calculating the cepstral regression coefficient.

これら重みＷａ−Ｗｂは重みレジスタ１６に蓄えておく
。These weights Wa-Wb are stored in the weight register 16.

時間正規化マツチング回路１０では、更に（６）式で得
た時点にの標準パタンと時点ノ゛の入力音声との距離ｄ
（ｋ、りを（４）式におけるＤｓ（ｋ、りとして用いて
、この（４）式を演算する。その池の動作は第１図の場
合と同様である。The time normalization matching circuit 10 further calculates the distance d between the standard pattern obtained from equation (6) and the input audio at the time point.
Equation (4) is calculated by using (k, ri) as Ds(k, ri) in equation (4). The operation of the pond is the same as in the case of FIG.

なお音声特徴量としてケプストラム係数を用いたが、線
形予測係数、ホルマント周波数、パーコール係数、対数
断面積比、零交差数などを用いてもよい。Note that although cepstral coefficients are used as voice feature quantities, linear prediction coefficients, formant frequencies, Percoll coefficients, logarithmic cross-sectional area ratios, number of zero crossings, etc. may also be used.

「発明の効果」以上説明したように、この発明によればパワー回帰係数
とスペクトル距離とから成る距離を用いて入力音声と標
準パタン音声とのマツチングを行うため、スペクトル距
離のみでは認識誤りを生じやすい不特定話者単語音声認
識において認識能力を向上でき、しかもパワーの絶対値
の正規化演算を必要としないため認識演算の時間遅れを
生じないという利点がある。"Effects of the Invention" As explained above, according to the present invention, input speech and standard pattern speech are matched using distances consisting of power regression coefficients and spectral distances, so spectral distances alone may cause recognition errors. This method has the advantage that recognition ability can be improved in easy speaker-independent word speech recognition, and there is no need for normalization calculation of the absolute value of power, so there is no time delay in recognition calculation.

[Brief explanation of the drawing]

第１図はこの発明の単語音声認識装置の実施例を機能的
に示すブロック図、第２図は屯語「札幌」の音声対数パ
ワーの時間パタンを示す図、弗３図はこの発明の池の実
施例を機能的に示すブロック図である。１：音声入力端子、２：音声区間検出回路、３：線形予
測分析回路、４：ケプヌトラム変換回路、５：特徴パラ
メータレジスタ、６：パワーレジスタ、７：回帰係数計
算回路、８：スイッチ、９：標準パタン蓄積部、１０：
時間正規化マツチング回路、１１：重みレジスタ、１２
：比較回路、１３：出力端子。特許出願人　　日本電信電話公社代　　理　　人　　　草　　野　　　　　卓井　１　固オ　２旧５１７３圏FIG. 1 is a block diagram functionally showing an embodiment of the word speech recognition device of the present invention, FIG. 2 is a diagram showing the time pattern of the speech logarithmic power of the ``Tungo'' word "Sapporo", and FIG. FIG. 2 is a block diagram functionally showing an embodiment of the present invention. 1: Voice input terminal, 2: Voice section detection circuit, 3: Linear prediction analysis circuit, 4: Kepnutram conversion circuit, 5: Feature parameter register, 6: Power register, 7: Regression coefficient calculation circuit, 8: Switch, 9: Standard pattern storage section, 10:
Time normalization matching circuit, 11: Weight register, 12
: Comparison circuit, 13: Output terminal. Patent applicant: Representative of Nippon Telegraph and Telephone Public Corporation: Takui Kusano 1.Ko 2. Former 5173 area

Claims

[Claims]

(1) Means for determining the spectral distance Ds between elements of feature time series of input speech and standard pattern speech, and linear regression coefficient time from the time waveform of speech power for each of the input speech and standard pattern speech. means for deriving a waveform; means for determining a power regression coefficient distance Dp between elements of an input voice and a standard pattern using the linear regression coefficient; and the spectral distance Ds and the power regression coefficient distance Dp.
A word speech recognition device comprising means for calculating the degree of similarity between the input speech and the standard pattern speech by time normalized matching using the inter-element matching distance determined from the above.