JP2577891B2

JP2577891B2 - Word voice preliminary selection device

Info

Publication number: JP2577891B2
Application number: JP61184521A
Authority: JP
Inventors: 貞煕古井
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1986-08-06
Filing date: 1986-08-06
Publication date: 1997-02-05
Anticipated expiration: 2012-02-05
Also published as: JPS6340200A

Description

【発明の詳細な説明】「産業上を利用分野」この発明は大語彙を認識対象とする単語音声認識に適
用して好ましく、情報量の多い時間正規化マッチングを
行う単語数を減少するために用いられる単語音声予備選
択装置に関するものである。The present invention is preferably applied to word speech recognition in which a large vocabulary is to be recognized. In order to reduce the number of words to be subjected to time-normalized matching with a large amount of information. The present invention relates to a word voice preliminary selection device to be used.

「従来の技術」単語音声認識方式において、あらかじめ認識対象単語
のすべてを標準特徴パラメータとして記憶しておき、未
知入力音声単語の特徴パラメータを検出し、その特徴パ
ラメータと前記あらかじめ記憶した各単語の標準特徴パ
ラメータとの時間正規化マッチングを行って、未知入力
音声単語を認識することが行われる。しかし認識対象単
語数が多く、つまり大語彙を認識対象とする場合は時間
正規化マッチングに要する演算量が著しく多くなる。[Prior Art] In a word speech recognition method, all of the words to be recognized are stored in advance as standard feature parameters, feature parameters of an unknown input speech word are detected, and the feature parameters and the standard of each of the previously stored words are used. Recognition of an unknown input speech word is performed by performing time-normalized matching with a feature parameter. However, when the number of words to be recognized is large, that is, when a large vocabulary is to be recognized, the amount of calculation required for time-normalized matching is significantly increased.

従って未知入力音声単語についてまず簡単に認識し
て、いくつかの単語を選択し、その各選択した単語の標
準特徴パラメータと未知入力音声単語との時間正規化マ
ッチングを行うことが提案されている。そのいくつかの
単語を選択することは単語予備選択と呼ばれる。この単
語予備選択を比較的簡単に行うことにより、音声単語認
識を能率よく行うことができる。Therefore, it has been proposed to first easily recognize an unknown input speech word, select some words, and perform time-normalized matching between the standard feature parameter of each selected word and the unknown input speech word. Selecting the few words is called word pre-selection. By relatively easily performing the word preliminary selection, speech word recognition can be efficiently performed.

従来の単語予備選択方法の第１の方法は、単語音声の
特徴を時間軸上で等間隔に区切り、その各区間毎に各特
徴パラメータを区間内の標本点で平均したものを用い
て、予備選択を行うものである。第２の方法は単語音声
のパワレベルの変化を２値パタンで表現し、あるいは音
韻列の全体的な特徴、例えば単語中に/a/が存在する等
を２値パタンで表現し、これらのパタン系列の単純な比
較によって予備選択を行う方法である。さらに、第３の
方法は単語音声の短時間毎のスペクトルの変化をクラス
タ化し、各単語を構成する代表的スペクトルの集合をテ
ンプレート集合として作成して、このテンプレートを用
いて入力音声のスペクトルをベクトル量子化し、このと
きの歪量にもとづいて予備選択を行う方法である。The first method of the conventional word preliminary selection method is to divide the features of the word voice into equal intervals on the time axis, and use a value obtained by averaging each feature parameter at each sampling point in each section. Make a selection. In the second method, the change in the power level of the word voice is represented by a binary pattern, or the overall characteristics of the phoneme sequence, for example, the presence of / a / in a word, is represented by a binary pattern. This is a method of performing preliminary selection by simple comparison of series. Further, the third method clusters changes in the spectrum of the word speech for each short time, creates a set of representative spectra constituting each word as a template set, and uses this template to transform the spectrum of the input speech into a vector. This is a method of performing quantization and performing preliminary selection based on the distortion amount at this time.

前記第１の予備選択方法においては、発声された単語
音声の時間的切り出しのずれや、単語の発声中における
途中の伸ばしぐあいが、あらがじめ蓄えられている単語
と一致せず区切りの位置に時間的ずれがある場合には、
予備選択能力が低下する。第２の予備選択方法に関して
は、パワレベルパタンは単語間で類似していることが多
く、音韻列は十分な精度で抽出することが難しいため、
予備選択能力が低いという欠点がある。第３の方法は、
短時間毎のスペクトルの時間的流れ、継続時間等の時間
情報が考慮されないため、ある単語の比較的長い区間
と、異なる単語の瞬間的な区間とが対応づけられて、選
択誤りを多く生ずるという問題があった。In the first pre-selection method, the temporal cutoff of the uttered word voice and the extension of the word during the utterance do not match the previously stored word, and the position of the break is not the same. If there is a time lag,
Pre-selection ability decreases. Regarding the second preselection method, the power level pattern is often similar between words, and it is difficult to extract a phoneme sequence with sufficient accuracy.
There is a disadvantage that the preliminary selection ability is low. The third method is
Since time information such as the temporal flow and duration of the spectrum for each short time is not taken into account, a relatively long section of a certain word is associated with an instantaneous section of a different word, which often causes selection errors. There was a problem.

予備選択が悪いと多くの単語が選択されて、予備選択
の利益が得られない。あるいは正しい単語を予備選択か
ら落として、誤認識又は認識不能となる。If the preselection is bad, many words will be selected and the benefit of the preselection will not be obtained. Alternatively, the correct word is dropped from the preliminary selection, resulting in erroneous or unrecognizable recognition.

この発明の目的は、従来の予備選択方法における上記
の欠点を解決するため、十分な精度で抽出できしかも時
間情報も含み、なおかつ音声区間の切り出しのずれや時
間的伸縮の影響を受けにくい特徴パラメータを用いて、
予備選択の高性能化を図り、これによって後のマッチン
グ演算の低減化を可能とする単語音声予備選択装置を提
供することにある。SUMMARY OF THE INVENTION An object of the present invention is to solve the above-mentioned drawbacks of the conventional preselection method, by extracting feature parameters that can be extracted with sufficient accuracy, include time information, and are not easily affected by shifts in speech section cutout or temporal expansion and contraction. Using,
It is an object of the present invention to provide a word-speech preselection device which improves the performance of preselection and thereby makes it possible to reduce the number of subsequent matching operations.

「問題点を解決するための手段」この発明は、スペクトルパラメータの時間波形から導
出した比較的局所的な時間変化を示す線形回帰係数を短
時間ごとに算出する手段を設け、この線形回帰係数とス
ペクトルパラメータとの組み合わせを認識対象語彙ごと
クラスタ化手段でクラスタ化し、これをテンプレートと
して認識対象語彙ごとに蓄積しておき、認識すべき音声
波のスペクトルパラメータ及び線形回帰係数は前記テン
プレートによりベクトル量子化され、そのときの歪量が
計算され、その値が比較的小さい複数の単語が選択され
る。"Means for Solving the Problem" The present invention provides means for calculating a linear regression coefficient indicating a relatively local temporal change derived from a time waveform of a spectral parameter every short time. The combination with the spectral parameter is clustered by the clustering means for each vocabulary to be recognized, and this is stored as a template for each vocabulary to be recognized. Then, the distortion amount at that time is calculated, and a plurality of words having relatively small values are selected.

この発明は従来の予備選択装置とは、スペクトルパラ
メータとその線形回帰係数の同一時点のものを一種類の
パラメータであるかのように組み合わせてクラスタ化し
たものがテンプレートとして用いられ、このテンプレー
トを用いてベクトル量子化が行われる点が異なる。According to the present invention, a conventional preselection apparatus uses a template obtained by combining clusters of spectral parameters and their linear regression coefficients at the same point in time as if they were a single type of parameter. In that vector quantization is performed.

「実施例」第１図に、この発明により単語音声予備選択装置の実
施例を示す。単語音声入力端子１より入力された単語音
声入力は、音声区間検出回路２、スペクトル分析部３を
経由してスペクトル及び（対数）パワ分析され、その分
析結果はスペクトルパラメータ蓄積部４に蓄積される。
スペクトルを表現するパラメータとしては、比較的小数
のパラメータによってスペクトルが表現できるケプスト
ラム、LSP（線スペクトル対）パラメータ等が用いられ
る。以下では、ケプストラムを用いる場合について説明
を行う。LSPパラメータ等を用いる場合も全く同様であ
る。"Embodiment" FIG. 1 shows an embodiment of a word voice preliminary selection device according to the present invention. The word voice input from the word voice input terminal 1 is subjected to spectrum and (logarithmic) power analysis via the voice section detection circuit 2 and the spectrum analysis unit 3, and the analysis result is stored in the spectrum parameter storage unit 4. .
As a parameter for expressing a spectrum, a cepstrum, an LSP (line spectrum pair) parameter, or the like that can express a spectrum with a relatively small number of parameters is used. Hereinafter, a case where a cepstrum is used will be described. The same applies to the case where LSP parameters and the like are used.

スペクトルパラメータ蓄積部４に蓄積されたケプスト
ラム及びパワに時間波形は、回帰係数計算回路５に入力
されて、線形回帰係数が抽出される。パワの絶対値は発
声レベルによって変動しやすいのでこれは除き、ケプス
トラムと、ケプストラム及びパワの線形回帰係数の時間
波形（これらをまとめて特徴パラメータ波形と呼ぶ）を
いったん特徴パラメータレジスタ６に蓄える。学習モー
ドと認識モードとをスイッチ７で切り替えて、学習モー
ドの場合は特徴パラメータ波形をクラスタ化部８に入力
してクラスタ化を行い、各語彙毎に作られた複数テンプ
テートをテンプテート蓄積部９に蓄積する。認識モード
の場合は、特徴パラメータ波形及び各語彙の複数テンプ
レートをベクトル量子化回路10に入力して、各語彙の複
数テンプレートを用いてベクトル量子化を行ったときの
歪量の計算を行い、その結果を量子化歪蓄積部11に蓄積
する。すべての語彙に対するその歪量を候補単語選択回
路12に入力して、比較的歪量の小さいあらかじめ定めら
れた数の語彙、あるいはしきい値蓄積部13に蓄えられて
いるしきい値よりも小さい歪量を有する語彙を示すデー
タを、候補単語を示すデータとして出力端子14に与え
る。The cepstral and power time waveforms stored in the spectrum parameter storage unit 4 are input to a regression coefficient calculation circuit 5 to extract linear regression coefficients. The absolute value of the power is liable to fluctuate depending on the utterance level. This is excluded, and the cepstrum and the time waveform of the linear regression coefficient of the cepstrum and the power (these are collectively referred to as a feature parameter waveform) are temporarily stored in the feature parameter register 6. The learning mode and the recognition mode are switched by the switch 7, and in the case of the learning mode, the feature parameter waveform is input to the clustering unit 8 to perform clustering, and a plurality of templates created for each vocabulary are stored in the template accumulation unit 9. accumulate. In the case of the recognition mode, the feature parameter waveform and a plurality of templates of each vocabulary are input to the vector quantization circuit 10, and the amount of distortion when performing vector quantization using the plurality of templates of each vocabulary is calculated. The result is stored in the quantization distortion storage unit 11. The distortion amounts for all the vocabularies are input to the candidate word selection circuit 12, and a predetermined number of vocabularies having a relatively small distortion amount or smaller than the threshold value stored in the threshold value storage unit 13 Data indicating a vocabulary having a distortion amount is provided to the output terminal 14 as data indicating a candidate word.

さらに詳しく動作を説明する。まず単語音声入力端子
１から単語の認識に用いる音声波を入力する。入力され
た音声波には通常、実際の音声の区間と無音（雑音）の
区間とが含まれているので、入力された音声波を音声区
間検出回路２に入力して、音声区間の検出を行う。この
検出には、すでによく知られているいくつかの方法、例
えば入力信号波の短時間パワ、ある一定値以上のパワが
継続する時間、等を用いることができる。検出される音
声区間の信号波はスペクトル分析部３に送られ、ケプス
トラムとパワの時間波形に変換される。この技術は、す
でに公知であるので（例えば、文献、古井：ディジタル
音声処理、東海大学出版会、pp.44〜48、1985参照）、
詳細は省略するが、基本的にはまず低域通過フィルタに
通したのち標本化及び量子化を行い、一定時間毎に短区
間の波形を切り出してハミング窓等を乗じ、２回のフー
リエ変換と対数変換、又は線形予測分析と繰り返し演算
によってケプストラムが抽出される。ハミング窓の長さ
としては、例えば30msこれを更新する周期としては、例
えば10msのような値が用いられる。ケプストラムはあら
かじめ定めた第ｐ次まで、例えば第１次から第10次まで
の値を計算する。The operation will be described in more detail. First, a speech wave used for word recognition is input from the word speech input terminal 1. Since the input voice wave usually includes an actual voice section and a silent (noise) section, the input voice wave is input to the voice section detection circuit 2 to detect the voice section. Do. For this detection, it is possible to use some well-known methods, for example, short-time power of an input signal wave, time during which power exceeding a certain value continues, and the like. The signal wave of the detected voice section is sent to the spectrum analyzer 3 and converted into a cepstral and power time waveform. Since this technology is already known (for example, see Literature, Furui: Digital Speech Processing, Tokai University Press, pp.44-48, 1985),
Although details are omitted, basically, after passing through a low-pass filter, sampling and quantization are performed, a short section waveform is cut out at regular time intervals, multiplied by a Hamming window, etc., and two Fourier transforms are performed. Cepstrum is extracted by logarithmic transformation or linear prediction analysis and iterative operation. As the length of the Hamming window, for example, a value such as 10 ms is used as a cycle for updating it, for example, 30 ms. The cepstrum calculates values up to a predetermined p-th order, for example, the first to tenth orders.

抽出されたケプストラムとパワの時間波形は、一定間
隔毎に一定の時間長の区間がスペクトルパラメータ蓄積
部４にいったん蓄えられ、この蓄積部４の内容は回帰係
数計算回路５に送られて、線形回帰係数が演算される。
このスペクトルパラメータ蓄積部４及び回帰係数計算回
路５に入力される時間波形の長さとしては、例えば50m
s、これを更新する周期としては、例えば10msのような
値を用いる。時間波形をx_j（ｊ＝−M,…,M）であらわす
と、この線形回帰係数ａは次の演算で求めることができ
る。In the extracted cepstrum and power time waveforms, a section having a fixed time length is temporarily stored in the spectrum parameter storage unit 4 at regular intervals, and the contents of the storage unit 4 are sent to the regression coefficient calculation circuit 5 and are linearly converted. A regression coefficient is calculated.
The length of the time waveform input to the spectrum parameter storage unit 4 and the regression coefficient calculation circuit 5 is, for example, 50 m
s, as a cycle for updating this, a value such as 10 ms is used, for example. When the time waveform is represented by x _j (j = −M,..., M), the linear regression coefficient a can be obtained by the following calculation.

線形回帰係数は、各次数のケプストラム及びパワに対
して10ms毎に更新される回帰係数計算回路５の入力に応
じて計算され、この線形回帰係数はケプストラムと合わ
せて（2p＋１）次元の特徴パラメータとして特徴パラメ
ータレジスタ６に送られて蓄えられる。 The linear regression coefficient is calculated according to the input of the regression coefficient calculation circuit 5 updated every 10 ms for each order cepstrum and power, and the linear regression coefficient is combined with the cepstrum as a (2p + 1) -dimensional feature parameter. It is sent to the characteristic parameter register 6 and stored.

スイッチ７は、学習モードと認識モードとを選択する
スイッチであって、各語彙に対して、最初にスイッチ７
を端子7aに接続しておいて、後に認識すべき音声を入力
する本人、あるいその本人とは異なる複数人の音声から
特徴パラメータ波形を求め、クラスタ化部８において、
短時間毎の特徴パラメータについてクラスタ化を行う。
このクラスタ化は多数のパラメータの組を、あらかじめ
定められた一定数の代表的な組にまとめることである。
例えば４名の話者が発声したある単語音声から10ms毎に
（2p＋１）次元の特徴パラメータ（ケプストラム、及び
ケプストラムとパワの回帰係数）が抽出されているとす
ると、単語音声の長さが平均して500msであるとすれ
ば、全部で50×４＝200種類の（2p＋１）次元特徴パラ
メータが与えられる。これを例えば32種類の代表的（2p
＋１）次元特徴パラメータにまとめるには、公知の方法
（文献、Y.Linde,A.Buzo,and R.M.Gray:An algorithm f
or vector quanti−zation,IEEE Trans.Commun.,vol.CO
M−28,pp.84−95,1980）を用いることができる。この方
法は、類似している特徴パラメータはまとめて一つの平
均値で代表させ、元の200種類のすべての特徴パラメー
タを32種類の代表値のうちの最も近いものでおきかえた
ときの、おきかえによる誤差が全体として最も小さくな
るように、代表値が決定される。このようにして、各認
識対象語彙毎に決定されたこれらの32種類のそれぞれ
（2p＋１）次元の特徴パラメータ代表値は、テンプレー
トとして、テンプレート蓄積部９に蓄積される。The switch 7 is a switch for selecting a learning mode and a recognition mode.
Is connected to the terminal 7a, and a characteristic parameter waveform is obtained from voices of a person who inputs voice to be recognized later or a plurality of voices different from the person.
Clustering is performed on the feature parameters for each short time.
This clustering is to group a large number of parameter sets into a predetermined fixed number of representative sets.
For example, assuming that (2p + 1) -dimensional feature parameters (cepstrum and regression coefficients of cepstrum and power) are extracted every 10 ms from a certain word voice uttered by four speakers, the lengths of the word voices are averaged. If it is 500 ms, a total of 50 × 4 = 200 kinds of (2p + 1) -dimensional feature parameters are given. For example, 32 types (2p
+1) In order to summarize into the dimensional feature parameters, a known method (literature, Y. Linde, A. Buzo, and RMGray: An algorithm f
or vector quanti-zation, IEEE Trans.Commun., vol.CO
M-28, pp. 84-95, 1980) can be used. In this method, similar characteristic parameters are collectively represented by one average value, and all the original 200 characteristic parameters are replaced by the closest one of the 32 representative values. The representative value is determined so that the error is minimized as a whole. In this way, these 32 types of (2p + 1) -dimensional feature parameter representative values determined for each recognition target vocabulary are stored in the template storage unit 9 as templates.

その後認識すべき音声に対しては、スイッチを端子7b
に接続しておいて、特徴パラメータレジスタ６の内容、
即ち入力音声の短時間毎の特徴パラメータをベクトル量
子化回路10に入力する。ベクトル量子化回路では、各認
識対象語彙のテンプレートをテンプレート蓄積部９から
順に読み出して、ベクトル量子化を行う。この処理は、
入力音声の短時間毎の特徴パラメータについて、最も近
いテンプレートを選ぶことによって行い、そのときのテ
ンプレートと特徴パラメータとの誤差を入力音声全体に
ついて平均した値即ち量子化歪を計算する。Then switch to terminal 7b for voice to be recognized
To the contents of the feature parameter register 6,
That is, the feature parameters of the input voice for each short time are input to the vector quantization circuit 10. In the vector quantization circuit, templates of each vocabulary to be recognized are sequentially read from the template storage unit 9 to perform vector quantization. This process
For each feature parameter of the input speech for each short time, this is performed by selecting the closest template, and a value obtained by averaging the error between the template and the feature parameter at that time for the entire input speech, that is, quantization distortion is calculated.

ある単語ｋのｌ番目のテンプレートの特等パラメータ
をr_kli（１≦ｉ≦2p＋１、ｐ次のケプストラム、ｐ次の
ケプストラムの線形回帰係数、及びパワの線形回帰係数
とからなる）、入力音声のある時点ｍにおける特徴パラ
メータをx_mi（１≦ｉ≦2p＋１）で表わすと、ここで両
者の距離（小さくなればなるほど類似度が大きいことを
示す数値）として、次のような値を用いる。The special parameter of the l-th template of a certain word k is r _kli ( _{consisting of} 1 ≦ i ≦ 2p + 1, p-order cepstrum, p-order cepstrum linear regression coefficient, and power linear regression coefficient), and input speech When the feature parameter at the time point m is represented by x _mi (1 ≦ i ≦ 2p + 1), the following value is used as the distance between them (a numerical value indicating that the smaller the distance, the greater the similarity).

ここで、w_iは、各特徴パラメータに対してあらかじめ
定められた重みを示す数値で、この値は予備実験の結果
にもとづいて比較的高い精度が得られるように適切な値
に定め、重みレジスタ15に蓄えておく、距離ｄの計算は
（２）式に示すように、同一時点のｐ次のケプストラ
ム、ｐ次のケプストラムの線形回帰係数、及びパワの回
帰係数について入力音声とテンプレートとの差の二乗和
として計算しており、つまりスペクトルパラメータと線
形回帰係数との互いに性質が異なるものを一緒に使って
おり、これらの平衡をとるためにw_iの重み付けを行なう
ものであり、従ってw_iの値としてはケプストラム、ケプ
ストラムの線形回帰係数、パワの線形回帰係数に対応し
て少なくとも３つの値を用いる。 Here, w _i is a numerical value indicating a predetermined weight for each feature parameter, and this value is set to an appropriate value based on the result of the preliminary experiment so as to obtain relatively high accuracy. The distance d, which is stored in the equation 15, is calculated by the difference between the input speech and the template for the p-order cepstrum, the p-order cepstrum linear regression coefficient, and the power regression coefficient at the same time, as shown in equation (2). the indicators are calculated as the square sum, that is used together properties different from each other between the spectral parameters and the linear regression coefficients, which performs weighting w _i to take these equilibrium, therefore w _i As the value of, at least three values corresponding to the cepstrum, the linear regression coefficient of the cepstrum, and the linear regression coefficient of the power are used.

このようにして量子化歪を、入力音声とすべての認識
対象語彙に対応したテンプレートとの間でそれぞれ求め
て、量子化歪蓄積部11に蓄えたのち、これらを候補単語
選択回路12に入力し、量子化歪がしきい値蓄積部13に蓄
えられているしきい値よりも小さい認識対象単語名を入
力音声の候補単語として端子14により出力する。あるい
は、すべての認識対象単語に対する量子化歪を比較し
て、比較的小さい量子化歪を有する一定数（例えば全体
の1/10、1000単語が対象なら100単語）の単語名を候補
単語として出力する。しきい値を小さくとれば候補単語
数が減り、後の時間正規化マッチング処理を大幅に少な
くすることができるが、小さくしすぎると正解である単
語が予備選択によって切り落とされ、以後の処理におい
ては回復できないものとなる。このしきい値の設定は方
式の使い方によって設定すべきであり、認識率と処理量
との関連において決定される。一定数の候補単語を出力
する場合に何番目までの候補を出力するかに関しても同
様である。In this way, the quantization distortion is obtained between the input speech and the templates corresponding to all the vocabularies to be recognized, and stored in the quantization distortion storage unit 11.Then, these are input to the candidate word selection circuit 12. The recognition target word name whose quantization distortion is smaller than the threshold value stored in the threshold value storage unit 13 is output from the terminal 14 as a candidate word of the input voice. Alternatively, compare the quantization distortion for all the recognition target words, and output a fixed number of word names with relatively small quantization distortion (for example, 1/10 of the whole, 100 words if 1000 words are targeted) as candidate words I do. If the threshold value is reduced, the number of candidate words is reduced, and the subsequent time-normalized matching processing can be significantly reduced.However, if the threshold value is too small, the correct word is cut off by preliminary selection, and in the subsequent processing, It will not be recoverable. The setting of the threshold value should be set according to the method of use, and is determined in relation to the recognition rate and the processing amount. The same applies to the number of candidates to be output when a certain number of candidate words are output.

従来においては、例えばケプストラムについてのみク
ラスタ化を行い、入力音声のベクトル量子化を行って量
子化歪を計算していたが、この実施例においては、ケプ
ストラムとパワの線形回帰係数についてもケプストラム
と同一時点のものをまとめてクラスタ化とベクトル量子
化を行なっている。この線形回帰はケプストラムとパワ
の時間波形の直線近似であり、この近似の傾斜が線形回
帰係数であり、つまりケプストラム及びパワの変化の傾
向についても入力音声とテンプレートとの類似の度合い
を求めている。その結果として、入力音声のスペクトル
の定常部を異なる単語の過度部のテンプレートと対応づ
けたり、入力音声のスペクトルの過渡部を異なる単語の
定常部のテンプレートと対応づけたりすることがなくな
り、誰の音声に対しても高い精度を有する単語音声予備
選択システムを実現することができる。In the related art, for example, clustering is performed only on the cepstrum, and vector quantization of the input speech is performed to calculate the quantization distortion.In this embodiment, the linear regression coefficients of the cepstrum and the power are the same as those of the cepstrum. Clustering and vector quantization are performed collectively at the time. This linear regression is a linear approximation of the cepstrum and power time waveforms, and the slope of this approximation is the linear regression coefficient, that is, the degree of similarity between the input speech and the template is determined for the tendency of the change in cepstrum and power. . As a result, the stationary part of the spectrum of the input voice does not correspond to the template of the excessive part of the different word, and the transient part of the spectrum of the input voice does not correspond to the template of the stationary part of the different word. A word voice preliminary selection system having high accuracy for voice can be realized.

これまでの実験によれば、不特定話者が発声した都市
名100単語を対象とした認識において、男性４名の音声
を用いて各単語のテンプレートを作成し、量子化歪に関
するしきい値を適切に設定すれば、上記４名の話者と異
なる男性20名の音声に対して、候補単語を平均4.5単語
に絞ることができ、このときに正しい単語が候補単語に
含まれる割合は99.9％になることが確かめられている。
ケプストラムのみを用いた従来の方法によれば、平均4
2.5単語にした絞ることができず、しかも正しい単語が
その中に含まれる割合は99.0％の精度しか得られなかっ
たことと比較して、この発明が優れていることが理解さ
れる。According to previous experiments, in recognition of 100 words of city names uttered by unspecified speakers, a template for each word was created using the voices of four men, and the threshold value for quantization distortion was set. If properly set, candidate words can be narrowed down to an average of 4.5 words for 20 male voices different from the above four speakers, and the correct word is included in the candidate words at 99.9% It has been confirmed that
According to the conventional method using only cepstrum, an average of 4
It is understood that the present invention is superior to the fact that it was not possible to narrow down to 2.5 words, and that the ratio of correct words contained therein was only 99.0% accurate.

「発明の効果」以上説明したように、この発明の予備選択装置によれ
ばスペクトルパラメータにその局所的時間変化特性を含
めて代表的パターンを選び、これによってベクトル量子
化するため、単語音声の時間的切り出しのずれの影響
や、単語中におけるゆっくりした伸縮の影響を受けにく
く、しかも代表的パターンと単語音声との間で、時間的
流れを考慮した適切な対応付けが行なわれるので、高性
能な予備選択を行なうことができる利点がある。[Effects of the Invention] As described above, according to the preliminary selection device of the present invention, a representative pattern is selected by including its local time change characteristic in the spectral parameter, and vector quantization is performed by this. It is less susceptible to the effects of target segmentation and slow expansion and contraction in words, and moreover, a proper association is made between the representative pattern and the word voice, taking into account the temporal flow. There is the advantage that a preliminary selection can be made.

[Brief description of the drawings]

第１図は、この発明の実施例を示す単語音声予備選択装
置のブロック図である。 1:単語音声入力端子、2:音声区間検出回路、3:スペクト
ル分析部、4:スペクトルパラメータ蓄積部、5:線形回帰
係数計算回路、6:特徴パラメータレジスタ、7:スイッ
チ、8:クラスタ化部、9:テンプレート蓄積部、1:ベクト
ル量子化回路、11:量子化歪蓄積部、12:候補単語選択回
路、13:しきい値蓄積部、14:出力端子、15:重みレジス
タ。FIG. 1 is a block diagram of a word-speech preliminary selection device showing an embodiment of the present invention. 1: Word voice input terminal, 2: Voice section detection circuit, 3: Spectrum analysis unit, 4: Spectrum parameter storage unit, 5: Linear regression coefficient calculation circuit, 6: Feature parameter register, 7: Switch, 8: Clustering unit , 9: template storage, 1: vector quantization circuit, 11: quantization distortion storage, 12: candidate word selection circuit, 13: threshold storage, 14: output terminal, 15: weight register.

Claims

(57) [Claims]

An apparatus for recognizing an unknown input speech word by performing time-normalized matching between a feature parameter of an unknown input speech word and a standard feature parameter of each vocabulary to be recognized should perform the time-normalized matching. A device for pre-selecting standard feature parameters, means for calculating and storing parameters indicating a temporal change of a frequency spectrum and power of a sound wave, and a linear regression coefficient for each short time from a time waveform of those parameters. Means for calculating, means for clustering feature parameters composed of a combination of the above parameters and linear regression coefficients for each vocabulary to be recognized, means for accumulating a plurality of templates as a result of this clustering for each vocabulary to be recognized, The time waveform of the parameter and the linear regression coefficient of the speech wave to be recognized is used as the template Means for performing vector quantization using the method; means for calculating the amount of distortion for each vocabulary to be recognized at the time of quantization; and time-normalized matching of a standard feature pattern of a plurality of vocabulary to be recognized having a relatively small amount of distortion. A word voice preliminary selection device having means for selecting as a standard feature pattern to be performed.