JPS5868098A

JPS5868098A - Voice recognition equipment

Info

Publication number: JPS5868098A
Application number: JP56166735A
Authority: JP
Inventors: 野尻　忠雄; 信之寺浦
Original assignee: NipponDenso Co Ltd
Current assignee: Denso Corp
Priority date: 1981-10-19
Filing date: 1981-10-19
Publication date: 1983-04-22

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】この発明は、特に学習パターン作成手段に特徴を有し、
動的計画法をバ′ターンマツチングの手段として用いる
音声認識装置に関する。[Detailed Description of the Invention] The present invention is particularly characterized by learning pattern creation means,
This invention relates to a speech recognition device that uses dynamic programming as a means of pattern matching.

音声ｇ識装置にあっては、入力音声を分析して特徴ベク
トルを抽出し、その特徴ベクトルを特定される時間単位
で得、その特徴ベクトル・ぜターンを学習記憶されたノ
（ターンと対比スルノぞターンマツチングによって、入
力音声を識別するものである。In a speech recognition device, an input speech is analyzed to extract a feature vector, the feature vector is obtained in a specified time unit, and the feature vector/zetern is learned and memorized. The input voice is identified by turn matching.

従来の動的計画法をパターンマツチングの手段として用
いる音声認識装置においては、入力音声を分析する手段
によって抽出された各単位時間を代表する特徴ベクトル
について、時間的にその前後の特徴ベクトルとの類似度
には無関係に学習パターンが形成されてきた。In a speech recognition device that uses conventional dynamic programming as a means of pattern matching, a feature vector representing each unit time extracted by a means for analyzing input speech is compared with the feature vectors before and after it in time. Learning patterns have been formed regardless of similarity.

そして、複数１０１の学習を行なう場合には、中に時間
的に対応する特徴ベクトルについて平均をとるか、もし
くは同じ学習音声に対して別個の学習パターンを形成し
、認識時には、あたかも別の音声の学習・やターンであ
るかのように、谷々独立した認識操作を行なってきた。If multiple 101 learnings are performed, the time-corresponding feature vectors among them are averaged, or separate learning patterns are formed for the same training speech, and during recognition, it is treated as if it were a different speech. I have been performing independent recognition operations from time to time, as if it were learning or turning.

また、認識過程時において入力さＱ、た音声は、牟に学
習ノ９ターンとの比較のために用いられ、学習パターン
に影響を与えることはなかった。Furthermore, the input speech during the recognition process was used primarily for comparison with the nine turns of learning, and did not affect the learning pattern.

人の発声音は、その発声の度毎に兵なっている。したが
って、認識過程において人力される音声は、多くの場合
学習過程の人力音声とは異なっているが、前記方法によ
る学習・ゼターンの形成手段では、その変化がパターン
マツチングの結果に顕著に現われ、認識率のある一定限
度以上の向上は望めなかった。The sounds a person makes each time they are uttered become stronger. Therefore, in many cases, the human-generated voice in the recognition process is different from the human-generated voice in the learning process, but with the learning and Zetaan formation means using the method described above, the change will be noticeable in the pattern matching results. It was not possible to improve the recognition rate beyond a certain limit.

甘だ、複数回の学習を行なう場合、１つの学習パターン
に複数回の学習の結果を反映することが困離であり、学
習回数に応じて別個の学習パターンを形成する方法では
、記憶Ｗ　ｍが肥大化し、かつ五ｅ識過程時の演ｑ伯が
多くなｐ１応答に時間がかかるという問題がある。Naive, when learning multiple times, it is difficult to reflect the results of multiple learnings in one learning pattern, and the method of forming separate learning patterns depending on the number of learning times reduces the memory W m There is a problem that the p1 response becomes large and the p1 response takes a long time due to the large number of operations during the five-e-knowledge process.

人の発声音は前述したように一様でなく、かつ時々刻々
と変化している。したかって、例えば１ケ月以上前に学
習された学習パターンを用いる場合、実用に支障をきた
す場合が多く、このため、従来の方法では、認識装置を
連続して用いるにもかかわらす、頻繁に学習を行なわな
けｎはならない。As mentioned above, human vocalizations are not uniform and change from moment to moment. Therefore, when using a learning pattern that was learned more than one month ago, for example, it often causes problems in practical use.For this reason, in conventional methods, although the recognition device is used continuously, learning is performed frequently. must be carried out.

この発明は上記のような点に鑑みなされたもので、発声
の変化に基因する入力音声の変動を吸収して、効果的に
安定した認識結果を得ることかできると共に、例えは限
られた照合範囲で確実な認識演算か実行でき、認識操作
の高速化か効果的にはかれるようにし、特に話者の現在
の発声に対応した学習ノｅターンが作成され、時間と共
に認識率の低下することを確実に阻止できるようにする
７：１−声認識装置を提供しようとするものである。This invention was made in view of the above points, and it is possible to absorb fluctuations in input speech caused by changes in vocalization, effectively obtaining stable recognition results, and, for example, in limited verification. It is possible to perform reliable recognition calculations within a certain range, speed up recognition operations, and effectively measure the speed of recognition operations.In particular, learning e-turns corresponding to the speaker's current utterances are created, and the recognition rate decreases over time. The present invention seeks to provide a 7:1-voice recognition device that reliably prevents the use of voice recognition.

すなわち、この発明にあっては、音声の学習過程におい
て、入力音声の分析の結果得られた単位時間（例えは１
０ｍ秒）の音声を代表する特徴ベクトルについて、時間
的に連続する特徴ベクトル間で類似度の計Ａ全行ない、
その結果隣接する特徴ベクトル間の類似度が一定値以上
であれは同じ音声か連続して入力されたとみなし、当該
特徴ベクトルについて平均操作を行ない、この特徴ベク
トルｆ：＋均化された特徴ベクトルに置き換え、もしく
は平均化された特徴ベクトルと平均化された特徴ベクト
ルの数（持続時間）ＶＣ置き換えて学習パターンを作成
するものであり、ツヤターンマツチングの手段として動
的計画法を用いるようにする。That is, in the present invention, in the speech learning process, the unit time (for example, 1
For the feature vectors representing the speech of 0 msec), the total similarity between the temporally continuous feature vectors is A,
As a result, if the similarity between adjacent feature vectors is above a certain value, it is assumed that the same voice has been input continuously, and the averaging operation is performed on the feature vectors, and this feature vector f: + the equalized feature vector Learning patterns are created by replacing or averaging feature vectors and the number (duration) of averaged feature vectors by replacing them with VC, and dynamic programming is used as a means of gloss turn matching. .

すなわち、このような学習パターンの作成手５一段をとることによって、複数回の学習の効果を反映さぜ
ることか容易となり、すでに学習され学習パターンが作
成されている音声に対して再び学習する場合、入力音声
と学習パターンとの間で動的計画法を用いたパターンマ
ツチングを行ガい、その過程で求められる時間偏曲関係
によって対応ずけられる入力音声と学習パターンの特徴
ベクトルについて、学習回数による重みうり平均をとり
、その結果イＵらｉした時系列の特徴ベクトル列につい
て上記時間方向の平均化を行カう。ぞして、多重回学習
を行ない学習ノやターンを作成することによシ、より平
均的な学習パターンが作成され、安定した認識＃ｉ！ｉ
果がイ？ｊられ、また多数の異なった人ンこ工って学習
させることにより、不特定話者の音声認識も「拝能とさ
せるようにする。In other words, by taking five steps to create such a learning pattern, it becomes easy to reflect the effects of multiple learnings, and to re-learn voices for which the learning pattern has already been learned. In this case, pattern matching is performed between the input speech and the learning pattern using dynamic programming, and the feature vectors of the input speech and the learning pattern that are matched by the time deviation relationship obtained in the process are A weighted average is taken based on the number of learning times, and the resulting time-series feature vector sequence is averaged in the time direction. Therefore, by performing multiple learning and creating learning patterns and turns, a more average learning pattern is created, resulting in stable recognition #i! i
The fruit is good? In addition, by training a large number of different people, the speech recognition of unspecified speakers can also be made to be ``speech-like''.

さらに、音声の認識過程において、人力音声がすでに学
習された音声と同一であると判定された場合、上記多重
回学習のための人力音声と同様ＶＣ取扱い、認識のため
に入力された音声を６− 学習させるようにするもので、定常的に音声認識を１１
なうＪ′ルラ今にその使用の都度学習・パターンの一部
を史靴し、常に音声波ｉｉ＃！＋ｉｃ　Ｍ良の学習・９
ターンが１呆持されるようにす、乙ものである。Furthermore, in the voice recognition process, if it is determined that the human voice is the same as the voice that has already been learned, the voice input for recognition will be treated as VC in the same way as the human voice for multiple learning described above. - It is designed to make the user learn, and regularly performs voice recognition at 11
Now J'Lula learns and learns a part of the pattern each time it is used, and always uses audio waves II #! +ic M good learning 9
It's a good idea to make it last for one turn.

以丁図面を参照してこの発明の一実施例を説明する。ｍ
　１　ｉ’ン］はぞの（Ｉ′・”、成を示したもので、
１１は学習過程および泌識Ｉ荊程に１・・（・−する入
力音声を恢知するマイクロホンであり、（２のマイクロ
ホン１１で捕捉検知さｎた入力音声（（対応する電気化
′＋−ｊは、１冑幅器Ｉ２でｊＮ官増Ｉ（へ１゛１さ扛
、ローパスフィルタ１３に供給する。そして、入力行ｊ
ｋ・の音声波１ヒ（Ｃ対応した信月をこのロー・にスフ
ィルタ１３から取り出し、このローパスフィルタ１３の
出力信−弓は、Ａ／Ｄ変換器１４でディノタル信号に変
換し、このディジタル値５１づ゛はＣＰＵ１５で検知さ
れる。このＣＰＵ　１５にはタイマ１６からサンプリン
グのための割り込み信号が供給されており、この割り込
み信号によＡ割り込みによって上記Ａ／／ｂ変換器１４
から得られるディジタル値ＶＣ変換された音声波形デー
タの一す゛ンプリンダを行なう。そして、ＣＰＵ　１５
ではこのサンプリングデータにもとすき学習パターンを
作成記憶し、また学習パターンと入力音声によるパター
ンとのパターンマツチングによって音声認識操作を行な
い、その認識結果に応じて出力装置１７を駆動制御する
ものである。An embodiment of the present invention will now be described with reference to the drawings. m
1 i'n] is the meaning of (I'・"),
Reference numeral 11 denotes a microphone that detects the input sound (1... j is increased by 1 to 1 by 1 and supplied to the low-pass filter 13. Then, the input line j
The audio wave 1 of K. The value 51 is detected by the CPU 15. This CPU 15 is supplied with an interrupt signal for sampling from the timer 16, and this interrupt signal causes the A//b converter 14 to be activated by the A interrupt.
The digital value VC-converted audio waveform data obtained from the digital value VC conversion is performed. And CPU 15
Then, a preferred learning pattern is created and stored for this sampling data, and a speech recognition operation is performed by pattern matching the learning pattern and a pattern based on input speech, and the output device 17 is driven and controlled according to the recognition result. be.

このように構成される音声認識装置においては、まず音
声認識を行なう学習・ぐターンを作成するものであり、
またこの学習パターンにもとすく認識操作が行なわれる
もので、この学習過程および認識過程について、ぞれぞ
ｎ第２図および第３図に示す流れ図にもとすき説明する
。In a speech recognition device configured in this way, first a learning pattern for speech recognition is created.
Further, a recognition operation is quickly performed on this learning pattern, and the learning process and recognition process will be explained with reference to the flowcharts shown in FIGS. 2 and 3, respectively.

１ず、学習過程がスタートされると、ステップ１００に
進み、タイマー１６からの割り込み信号による割り込み
によって、マイクロホン１１から入力され増幅器１２、
ロー・Ａ゛スフイルタ１３よびＡ／Ｔ）変換器１４を経
てディジタル値に変換された音声波形のサンプリングを
行なう。そして、一定数（例えは２５６個、この個数を
単位としたデータを以後フレイムと呼ぶ）のサンプリン
グが行なわれると、ステップ１０１に進み、データの分
析を行なう。このデータの分析の結果、サンプリングデ
ータを代表する少数のパラメータ（Ｆ個とする）が抽出
され、それを特徴ベクトルと呼び次式で表現する。1. When the learning process is started, the process proceeds to step 100, in which the signal input from the microphone 11 is input to the amplifier 12 by an interrupt signal from the timer 16.
The audio waveform converted into a digital value via the low-A space filter 13 and the A/T converter 14 is sampled. When a certain number (for example, 256 pieces of data, hereinafter referred to as a frame) of data is sampled, the process proceeds to step 101, where the data is analyzed. As a result of the analysis of this data, a small number of parameters (assumed to be F) representing the sampling data are extracted, and are called a feature vector and expressed using the following equation.

■””　（”　　＋　ｖ”’　＋　１）ｔ、ｆ＋　”’
　＋　？Ｊｔ、Ｆ１ｔ　　　　　　ｔ、ｉ　　　　ｔ２
’ ここで、■、はｔｉ目のフレイムの特徴ベクトルであり
、υ、ｆはそのｆ番目の特徴ベクトル成分１である。■""(" + v"' + 1)t, f+ "'
＋? Jt, F1t t, i t2
'Here, ■, is the feature vector of the ti-th frame, and υ, f is its f-th feature vector component 1.

また、データ分析の実行中においても、タイマー１６か
らの割り込み信号からの割り込みによって、ステップ１
００により、次のフレイムのデータサンプリングを実行
するものとする。Also, even during execution of data analysis, an interrupt from the interrupt signal from the timer 16 causes step 1 to be interrupted.
00, data sampling of the next frame is executed.

データ分析が終了するとステップ１０２に進み、分析の
結果得られた特徴ベクトルから、入力が終了したか否か
を判定する。入力終了の判定は、例えは次のような条件
ａ、ｂによって行なう。When the data analysis is completed, the process proceeds to step 102, and it is determined from the feature vector obtained as a result of the analysis whether the input has been completed. The end of input is determined based on the following conditions a and b, for example.

（、）　　％徴ベクトルから入力音声の電力を計算９− し、有効な入力と判定し得る入力か一定回数以上入力さ
れたか否か。(,) Calculate the power of the input voice from the % signature vector9- and determine whether the input can be determined to be valid input or whether it has been input a certain number of times or more.

（ｂ）　　入力音声がないと判定し得る小さな電力の入
力が一定回数連続してあったか否か。(b) Whether or not there is a certain number of consecutive inputs of small power that can be determined to be that there is no input audio.

この判定結果で「ＮＯ」の場合はステップ１００゜１０
１に進み、再びデータサンプリング、データ分析を行な
う。また「ＹＥＳ」の場合はステップ１０３に進み、入
力された音声がすでに学習されているか否かを判定する
。この判定の結果、ｌＮ０Ｊの場合はステップ１０７に
進み、平均化処理を行なう。ぞして、「ＹＥＳ」の場合
はステップ１０４に進み、入力音声の特徴ベクトル列と
学習パターンのパターンマツチングラ行なう。If this judgment result is “NO”, step 100゜10
Proceed to step 1 and perform data sampling and data analysis again. If "YES", the process proceeds to step 103, where it is determined whether the input voice has already been learned. If the result of this determination is lN0J, the process proceeds to step 107, where averaging processing is performed. If the answer is "YES", the process proceeds to step 104, where a pattern matching process is performed between the feature vector sequence of the input voice and the learning pattern.

パターンマツチングは、入カバターンと学習パターンに
ついて始端と終端を一致させる動的計画法（ダイナミッ
クゾログラミング、以後ＤＰと呼ぶ）を用いて行なう。Pattern matching is performed using dynamic programming (dynamic zologramming, hereinafter referred to as DP), which matches the starting and ending ends of the input pattern and the learning pattern.

この結果、定義された距離尺度（後述する）によって、
音声パターン間の距離りが定する。（Ｌが小さいほど類
似しているものとする。）また、入カッ＋ターンの１０
− 特徴ベクトルと学習パターンの特徴ベクトルの対応を表
わす時間偏曲関数（タイムワービングファンクション）
が同時に定まる。時間偏曲関数の例を第４図に示す。図
中で、Ｔ　、　Ｔｐはそれぞれ入カバターン、学習パタ
ーンの特徴ベクトルの数である。As a result, by a defined distance measure (described below),
The distance between voice patterns is determined. (The smaller L is, the more similar it is.) Also, the input cut + turn 10
- Time warping function that represents the correspondence between the feature vector and the feature vector of the learning pattern
are determined at the same time. An example of a time deflection function is shown in FIG. In the figure, T and Tp are the number of feature vectors of the input pattern and the learning pattern, respectively.

次に　ステップ１０５に進み、入力音声と学習音声が同
一であるか否かを判定する。これは、上記音声間の距離
りに対して、あらかじめ定められたしきい値し。１に対
［７てＬ＜Ｌ、１１であれは同一であると判定し、こＪｔ以外を同一でない
と判定する。同一でないと判定された場合、入力音声は
学習音声とは異なり無効入力であるとして、学習過程を
終える。同一であると判定した場合には、ステップ１０
６に進み、多重学習処理を行なう。Next, the process proceeds to step 105, where it is determined whether the input voice and the learning voice are the same. This is a predetermined threshold value for the distance between the voices. 1, if [7 and L<L, 11, it is determined that they are the same, and anything other than this Jt is determined to be not the same. If it is determined that they are not the same, it is determined that the input voice is different from the learning voice and is an invalid input, and the learning process is terminated. If it is determined that they are the same, step 10
Proceed to step 6 to perform multiple learning processing.

この多重学習処理の過程を第５図に示す流れ図にしたが
って詳細に説明する。すなわち、多重学習処理過程に入
ると、ステップ１１６に進み、入カバターンの特徴ベク
トル番号を示す指標ｔを１１」にセットし、ステップ１
２６　Ｖｃ進み、ステップ１０４によって求められた時
間偏曲関数Ｗを用いて、入カバターンの和徴ベクトル番
号ｔに対応する学習・ぐターンの慣徴ベクトル番号ｊを
セットし、同様に、−回前の演算の学習パターンの特徴
ベクトル番号ｊをセットする。The process of this multiple learning process will be explained in detail according to the flowchart shown in FIG. That is, when entering the multiple learning processing process, the process proceeds to step 116, where the index t indicating the feature vector number of the input pattern is set to 11'', and step 1
26 Vc, and using the time deflection function W obtained in step 104, set the inertia vector number j of the learning pattern corresponding to the summation vector number t of the input kataan, and similarly, - times ago. Set the feature vector number j of the learning pattern of the operation.

次に、ステラ７′Ｊ１６６に進み、入カバターンの特徴
ベクトルＶび学習パターンの特徴ベクトルＶｊの重みつ
き平均を行なう。学習１す１数がＮ回であるとすると、
それぞれの特徴ベクトル要素についてとする。ここで、Ｖ　ａ　、１は学習パターンのｊｉ目
の特徴ベクトルのｔ番目の要素であり、υ）、。Next, the process proceeds to Stella 7'J166, where a weighted average of the feature vector V of the input pattern and the feature vector Vj of the learning pattern is performed. Assuming that the learning number is N times,
For each feature vector element. Here, V a , 1 is the t-th element of the ji-th feature vector of the learning pattern, υ).

は入カノヤターンのｔ番目の特徴ベクトルの１番目の要
素であり、υｊ９．は平均化されたｊ蚤目の特徴ベクト
ルのｔ番目の要素である。また、学習パターンは、後述
のように特徴ベクトルとその特徴ベクトルの連続する個
数とから成るが、多重学習処理過程では、それを伸長し
て演算を行なうこととする。また、Ｎの上限を１７」と
し、Ｎが「６」以下のときは、Ｎに１１」を加える。is the first element of the t-th feature vector of the input Kanoya turn, and υj9. is the t-th element of the averaged feature vector of the j-th order. Furthermore, the learning pattern consists of a feature vector and the number of consecutive feature vectors as described below, and in the multiple learning process, the learning pattern is expanded and calculated. Further, the upper limit of N is set as 17'', and when N is less than 6, 11'' is added to N.

次に、ステップ１７６ＶＣ進み、入カバターンのすべて
の特徴ベクトルについて平均化が終了したか否かを判定
する。「ＹＥＳｊの場合、多重学習処理を終了する。［
ＮＯＪの場合はステップ１８６に進み、ｔに１１」を加
え、ステップ１３６に進み、ステップ１２６と同様に入
力・ぐターン１の特徴ベクトルＶ、に対応する学習パターンの特徴ベク
トルの番号を時間偏曲関数Ｗより求め、ｊにセットする
。次に、ステップ１４６に進み、ステップ１３６で求め
た学習パターンの特徴ベクトルの着力が前回平均化を行
った学習パターンの特徴ベクトルの番号顛等しいか否か
を判定する。［ＹＥｓＪの場合はＤＰによって特徴ベク
トルｌと（ｉ−１）は、はとんど同一と判定されている
ので、再び？−’ｉは同じ特徴ベクトルを学習パターン
に取り込む必要はないと判定し、ステップ１３− １７６に進む。１ＮＯ」の場合はステップ１５６ｙｃ進
み、ｊｏにｊの値をセットし、ステラｆ１６６に進む。Next, the process proceeds to step 176VC, and it is determined whether or not averaging has been completed for all feature vectors of the input pattern. “If YESj, end the multiple learning process.
If NOJ, proceed to step 186, add 11'' to t, proceed to step 136, and similarly to step 126, time-deviate the number of the feature vector of the learning pattern corresponding to the feature vector V of input turn 1. Find it from the function W and set it to j. Next, the process proceeds to step 146, and it is determined whether the strength of the feature vector of the learning pattern obtained in step 136 is equal to the number of the feature vector of the learning pattern that was averaged last time. [In the case of YESJ, the feature vectors l and (i-1) are determined to be almost the same by DP, so again? -'i determines that it is not necessary to incorporate the same feature vector into the learning pattern, and proceeds to step 13-176. 1 NO'', proceed to step 156yc, set the value of j in jo, and proceed to Stella f166.

以下上記で説明した過程と同様である。The following process is similar to the process described above.

このような多重学習処理過程を終了すると、第２図のス
テップ１０７の平均化処理に進む。When such multiple learning process is completed, the process proceeds to the averaging process of step 107 in FIG. 2.

この平均化処理の過程を第６図に示す流６図にしたがっ
て、詳細に説明するに、平均化処理の過程に入ると、ス
テップ１７ノに進んで初期設定を行ない、平均化を行な
う特徴ベクトル列の番号を示す指標もを「２」にセット
し、同一の音声であると判定された特徴ベクトルが、い
くつ続いているかを示す指標に１平均化の結果異なる特
徴ベクトルとして登録される特徴ベクトル番号の指標ノ
をそｆＬぞれ「１」にセットする。次に、ステップ１７
２に進み、比較の対称とする特徴ベクトルＶＲをｖ、（
１回目の場合は■、以下同様）にセットする。次にステ
ップ１７３じ進み、特徴ベクトルＶＲとＶ、との距離ｉ
Ｉ算を行ない、その値をり、にセットする。距離計算の
方法は、データ分析の手法に依存するが、例えは周波数
分１４− 析を行なった場合には、対数ス梨りトルのユークリッド
ｌ１ｌｔ’ｉ離を、ケゾヌトラム分析を行なった場合に
はケプヌトラム係数のユークリッド距離を、線形予測分
析を行なった場合には尤度比やｃｏｉＩｈ尺度を用いる
。１だ、・々−コア（ＰＡＲＣＯＲ／）分析を行なった
場合には、次のような・９−コア係数の垂み１；１きユ
ークリッド距離を用いる。The process of this averaging process will be explained in detail according to the flowchart shown in FIG. The index indicating the column number is also set to "2", and the index indicating how many consecutive feature vectors determined to be the same voice is 1. Feature vectors that are registered as different feature vectors as a result of averaging. Set each number index to "1". Next, step 17
Proceed to step 2, and set the feature vector VR to be compared as v, (
If it is the first time, set it to ■, and so on). Next, proceeding to step 173, the distance i between the feature vectors VR and V
Perform the I calculation and set the value to R. The distance calculation method depends on the data analysis method, but for example, when frequency analysis is performed, the Euclidean distance of the logarithm is calculated, and when Quezonutrum analysis is performed, the distance is calculated using the Euclidean distance. For the Euclidean distance of the Cepnutrum coefficient, the likelihood ratio or coiIh scale is used when linear predictive analysis is performed. When performing a 1-core (PARCOR/) analysis, the following Euclidean distance of 9-core coefficients is used.

次に、ステップ１７４に進み、上記計算方法によって求
められた距離Ｄｔについて、あらかじめ定められたしき
い値り。より大きいか否かを判定する。「ＮＯｊの場合
は■８とｖ２は等しい音声を代表する特徴ベクトルであ
ると判定し、ステップ１７５に進み■、と■、の重みつ
き平均を求める。Next, the process proceeds to step 174, where a predetermined threshold value is calculated for the distance Dt obtained by the above calculation method. Determine whether the value is greater than or not. In the case of NOj, it is determined that ■8 and v2 are feature vectors representing the same voice, and the process proceeds to step 175, where the weighted average of ■, and ■ is calculated.

平均化は特徴ベクトルのすべての成分について行なう。Averaging is performed for all components of the feature vector.

（ｆ−１〜Ｆ）次に、ステップ１７６ＶｒＬ進み、最後の特徴ベクトル
まで比較を行なったか否かを判定する。(f-1 to F) Next, the process proceeds to step 176VrL, and it is determined whether the comparison has been performed up to the last feature vector.

Ｉ’ＹＥＳＪの場合、ステップ１８４にコＩＬむ。ｌ’
−ＮＯＪの場合、ｔ、ｋにそれぞれ「１」を加えステッ
プ１７３に戻る。If I'YESJ, go to step 184. l'
- In the case of NOJ, "1" is added to each of t and k and the process returns to step 173.

上記ステップ１７４において、ｉ’ＹＥＪの場合は特徴
ベクトルｖＲと■□は等しい音７！−を代表していない
と判定し、ステップ１７８に進み、学習・七ターンの１
１Ｌ丁目の特徴ベクトル■７としてＶヤをセットし、こ
の特徴ペクＩ・ルに等しいと判定されｆｃ特徴ベクトル
の個数町にｋの値をセットする。In the above step 174, if i'YEJ, the feature vector vR and ■□ are the same sound 7! - is determined not to be representative, and the process proceeds to step 178.
Vya is set as the feature vector (7) of 1L block, and the value of k is set in the number of fc feature vectors that are determined to be equal to the feature vector (I).

次にステップ１７９に進み、ステップ１７６と同様に最
後の特徴ベクトル１で比較を行なったか否かを判定し、
「ＹＥＳ」の場合はステップ１８４に進む。「ＮＯ」の
場合はステップ１８０に進み、Ｉ（を「１」にリセット
し、ステップで１およびｔにそｔしぞｎ［ＩＪを）ＪＩ
＋える。次に、ステップ１８２ｖＣ進み、比較の対称と
する特徴ベクトル■□を新たな■、に更新し、ステラ：
７′１８３でｔに「１」を加えてステップ１７３に仄る
。Next, the process proceeds to step 179, in which it is determined whether or not the comparison has been performed using the last feature vector 1, as in step 176.
If "YES", the process advances to step 184. If "NO", proceed to step 180, reset I(to "1", and set it to 1 and t in step 180).
＋I get it. Next, proceed to step 182vC, update the feature vector ■□ to be compared to a new ■, and Stella:
At step 7'183, "1" is added to t and the process goes to step 173.

」１記ステップ１７６．１７９では、すべての特徴ベク
トルの比較が終了したと判定されると、ステップ１８４
にう焦み、学習パターンのｉ番目の％徴ベクトル■７と
し−ＣｖＲ全セットし、Ｍ、にｋの値をセットして、平
均化処理の過程を終了する。このような平均化処理の過
程の結果、学習パターンは特徴ベクトルとこれらに代表
される音声の持続時間（特徴ベクトルの数）に変換され
るが、それを衣にすると第７図のようになる。"In steps 176 and 179 of the first paragraph, when it is determined that all feature vectors have been compared, step 184
Don't worry, set the i-th % characteristic vector of the learning pattern to 7, set all -CvR, set the value of k to M, and complete the averaging process. As a result of this averaging process, the learning pattern is converted into feature vectors and the duration of the voice represented by these (the number of feature vectors), and when these are converted into clothes, the result is as shown in Figure 7. .

平均化処理の過程を終了すると、第２図ステップ１０８
に進み、第７図に示した学習パターンが指定された音声
の学習パターンとして登録され、学習過程を終了する。When the averaging process is completed, step 108 in FIG.
The learning pattern shown in FIG. 7 is registered as the learning pattern of the designated voice, and the learning process is completed.

次に、第３図に示す流れ図にしたがって、認識過程につ
いて説明する。ステラ７″１００，１０１゜１０２のデ
ータサンプリング、データ分析および入力終了の判定は
、第２図で示した学習過程と同様であるのでその説明は
省略する。音声入力が終了すると、ステップ１１０に進
み、入力音声の入カバターンとすべての学習パターンと
の間で、ＤＰを用いてパターンマツチングを行１７− なう。この処理によって入カバターンと最も距離の小さ
い学習パターンを選び出す。ここで、ｉ番目の学習パタ
ーンと入カバターンとの距離をＬｉとする。次に、ステ
ップ１１１に進み、入カバターンと同じであると判定で
きる学習・ぐターンがあるかないかを判定する。その判
定は次のようにして行なう。すなわち、ＬＬ１〜Ｌ、Ｊ
（Ｉは学習パターンの数）の中で最も小さい値をＬＢ（
その値を持つ学習パターンをＢ　（ｆｆ目の学習ｉＲパ
ターンする）、次に小さな値をＬ　、／　とする。Next, the recognition process will be explained according to the flowchart shown in FIG. Data sampling, data analysis, and determination of completion of input for Stella 7'' 100, 101° 102 are the same as the learning process shown in FIG. , pattern matching is performed using DP between the input cover turn of the input voice and all the learning patterns. Through this process, the learning pattern with the smallest distance from the input cover turn is selected. Here, the i-th Let the distance between the learning pattern and the input cover turn be Li.Next, proceed to step 111, and determine whether there is a learning pattern that can be determined to be the same as the input cover turn.The determination is made as follows. In other words, LL1 to L, J
(I is the number of learning patterns), the smallest value is LB (
Let the learning pattern having that value be B (the ffth learning iR pattern), and the next smallest value be L, /.

そして、あらかじめ定められたしきい値し。２゜Ｌｏ６
に対してり、（Ｌ。２（１）ＬＢ−ＬＢ′＜Ｌ。３（２）の両条件を満たす場合、Ｂ番目の学習パターンを、入力
音声と同じ音声の学習パターンであると判定する。上記
（１）の条件が満たされなかった場合は、学習音声のい
ずれとも異なる音声が入力されたと判定し、（２）の条
件が満たされなかった場合は、入力音声はＢ番目の学習
パターンの１８− 学習音声である可能性が高いが、他の学習パターンと明
確に区別がつかないと判定する。and a predetermined threshold. 2゜Lo6
For this, if both conditions (L.2(1) LB-LB'<L.3(2) are satisfied, the B-th learning pattern is determined to be a learning pattern of the same voice as the input voice. If the condition (1) above is not met, it is determined that a voice different from any of the learning voices has been input, and if the condition (2) is not satisfied, the input voice is the B-th learning pattern. No. 18- It is determined that there is a high possibility that it is a learned voice, but it cannot be clearly distinguished from other learned patterns.

入カバターンに対応する学習パターンがあると判定され
なかった場合、ステップ１１６に進み、有効な人力はあ
ったが、それに対応する学習パターンかなかった、もし
くは１、区別がつかなかった旨の出力を出力装置に行な
い、認識過程を終了する。対応する学習パターンがある
と判定された場合は、ステップ１１２に進み、あらかじ
め定められたしきい４ｔｔｉ　Ｌ。４に対して、ＬＢ（
Ｌ、　４であるか否かを判定し、認識のために人力された音声を
学習すべきであるか否かを判笈する。If it is not determined that there is a learning pattern corresponding to the input cover turn, the process proceeds to step 116, and outputs an output indicating that there was effective manpower but there was no learning pattern corresponding to it, or 1, that it was indistinguishable. to the output device and complete the recognition process. If it is determined that there is a corresponding learning pattern, the process proceeds to step 112 and a predetermined threshold 4ttiL is determined. For 4, LB (
L, 4 is determined, and it is determined whether or not the human-generated voice should be learned for recognition.

ただしＬ１２＜Ｌ。２であるとする。however L12<L. 2 Suppose that

ここで、１ＮＯＪの場合は、入力音声がＢ番目の学習パ
ターンの学者音声と等しいと判定されるが、学習・！タ
ーンと一定限度以」―の差があるため、学習は行なわな
いと判定してステップ１１６に進み、入力音％ｊがＢ　
（１９目の学：８７パターンの学習音声であることを出
力挟置に出力して認識過程を終える。［−ＹＥＳＪの１
介、入力音声はＢ番目の学習パターンの学習音声と同一
であり、充分注意して発声された音声であると判定し、
距離ＬＢは通常の発声の変動であるとして、ステップ１
１３に進み多重学習処理を行なう。処理過程は、前記学
習過程と同様であるので、＝！’ｔ、　Ｆ３Ａは省略す
る。この過程により、連続して装置を作動させる場合、
学習パターンの１／８が更ｘ）「される。Here, in the case of 1NOJ, it is determined that the input voice is equal to the scholar voice of the Bth learning pattern, but the learning! Since there is a difference between the turn and "below a certain limit", it is determined that learning is not to be performed, and the process proceeds to step 116, where the input sound %j is
(The 19th learning: The recognition process is finished by outputting to the output interpolation that it is the 87th pattern of learning speech. [-YESJ's 1
The input voice is determined to be the same as the learning voice of the B-th learning pattern and has been uttered with sufficient care;
Assuming that the distance LB is a normal vocalization variation, step 1
The process proceeds to step 13, where multiple learning processing is performed. The processing process is the same as the learning process, so =! 't, F3A is omitted. When operating the device continuously through this process,
1/8 of the learning pattern will be changed.

次に、ステップ１１４に進み、特徴ベクトルの平均化処
理の過程を行なう。この過程も学習過程時と同様である
ので説明を省略する。次に、ステップ１１５に進みＢ番
目の学ｔ’ｉパターンを更新して登録し、ステップ１１
６に′Ｊ１ゑみ、入力音声がＢ−１１ｊ目の学習パター
ンの学Ｎ　音声であることを出力装置に出力して、認識
過程を終える。Next, the process proceeds to step 114, where a feature vector averaging process is performed. This process is also the same as the learning process, so the explanation will be omitted. Next, proceed to step 115 to update and register the B-th learning t'i pattern, and step 11
At 6'J1, the recognition process is completed by outputting to the output device that the input speech is the B-11j learning pattern speech.

尚、特徴ベクトルの平均化は、特徴ベクトルとしてＰＡ
ＲＣＯＲ係数を用いる」結合には、その線形補間性から
特に有効であるが、勝形予側係数、ケプヌトラム係数、
周波数成分を用いることによっても同様に効果をあける
ことができる。Note that the averaging of feature vectors is performed using PA as a feature vector.
Combining using RCOR coefficients is particularly effective due to its linear interpolability, but it is also possible to use
Similar effects can be obtained by using frequency components.

上記実施例では、入力音声をローパスフィルタ１３に供
給して音声波形をサンプリングしたが、第８図に示すよ
うに増幅器１２からの入力音声に対応するイも号を、そ
ｔ′Ｌ−ｅれ通過周波数帯域の異ならぜた複数のバンド
パスフィルタ１Ｂ−，１゜１８−２．・・・１　Ｂ　−
Ｆ　Ｋ供給すると共に、この各・ぐンドパスフィルタ１
８−１〜１８−Ｆそれぞれからの出力信号を検波器１９
−１〜１９−Ｆで検波して、入力音声の各局波数成分を
サンプリングするようにしてもよい。この場合、各検波
器１９−１〜１９−Ｆから得られる各周波数成分のザン
プリングデータは、ＣＰＵ　１５からの割り込み指令で
制御されるマルチプレクサ２０で順次読み取り、ｋ／／
Ｄ変換器１４でディジタル値に交決してＣ１）Ｕ１５に
結合するもので、この各サンプリング値を特徴ベクトル
とすることによって、前記実施例と同様の効果が得られ
るものである。In the above embodiment, the input voice is supplied to the low-pass filter 13 to sample the voice waveform, but as shown in FIG. A plurality of bandpass filters 1B-, 1°18-2. having different pass frequency bands. ...1 B-
In addition to supplying FK, each gundo pass filter 1
The output signals from each of 8-1 to 18-F are detected by a detector 19.
-1 to 19-F may be detected to sample each station wave number component of the input audio. In this case, sampling data of each frequency component obtained from each of the detectors 19-1 to 19-F is sequentially read by a multiplexer 20 controlled by an interrupt command from the CPU 15, and
The D converter 14 intersects the digital value and connects it to C1)U15, and by using each sampling value as a feature vector, the same effect as in the previous embodiment can be obtained.

−１だ実施例においては、時間的に隣接する特２１− 酸ベクトルの比較手段を示したか、これは隣り合う特徴
ベクトルをそのまま比１絞するようにしてもよい。さら
に、実施例では、多重学者処理の後に平均化処理を行な
ったが、こｔｔ？′、１平均化処理を先に行ない、対応
する特徴ベクトルについて平均操作を行なうことによっ
て多重学習を行なわせるようにしてもよいものである。In the first embodiment, a means for comparing temporally adjacent feature vectors is shown, but it is also possible to reduce the ratio of adjacent feature vectors by one. Furthermore, in the example, the averaging process was performed after the multi-scientist process, but is this true? ', 1 averaging process may be performed first, and then multiple learning may be performed by performing the averaging operation on the corresponding feature vectors.

以−ヒのようにこの発明に係る１１声認卸装泣ｔによれ
ば、時間的に隣接する特徴ベクトルについてその類似度
を＝−を婢し、同一のへ由を代表する特徴ベクトルであ
ると判定された場合、当該特徴ベクトル間で平均をとり
、平均化さｊＬだ１つの特徴ベクトルによって同一の音
声を代とさせ学習パターンを作成することにより、ｄ　
１ｌｊｋ時の話者の発声の変動やゆらきを吸収すること
ができ、また変動を過大Ｓ＋価することがなくなり、安
定した認識率を得ることができる。件だ、複数回の学習
を行なう場合、まず入力音゛声と学習・９ターンとの曲
で動的Ｈ１１１！ＩＩ法を用い−Ｃパターンマツチング
を行ない、その除氷められる時間的＝２２− 偏曲関数を用いて、入力音声と学習パターンの特徴ベク
トルを対応づけて平均操作を行なうことにより、多重学
習が正確に行なえ、発声のたびの変動を吸収することが
できる、また、認識過程において、入力音声に対応する学習パタ
ーンかあると判定された場合、その入力音声を学習させ
ることにより、通常頻繁に装置を用いる場合の定期的に
必要とされる学習□　　　　　操作を省略することがで
き、しかも、学習パターンの一部を順次（新してゆくこ
とによす、常に現在の兄声による学習・ぐターンを維持
することができ、安定した高い認、ｉ！２ｉ率を持続さ
せることができるものである。As described above, according to the 11-voice recognition system according to the present invention, the degree of similarity of temporally adjacent feature vectors is set to =-, and the feature vectors represent the same origin. If it is determined, the averaged feature vectors are averaged, and the averaged jL is used as a substitute for the same voice to create a learning pattern.
It is possible to absorb fluctuations and fluctuations in the speaker's utterance at the time of 1ljk, and it is possible to avoid giving too much S+ value to the fluctuations, so that a stable recognition rate can be obtained. If you want to learn multiple times, first use the input voice and the song with 9 turns to learn Dynamic H111! -C pattern matching is performed using the II method, and the deicing time is = 22. Multiple learning is performed by associating the input speech with the feature vector of the learning pattern and performing an averaging operation using the polarization function. can be performed accurately and can absorb fluctuations in each utterance.In addition, if it is determined during the recognition process that there is a learning pattern that corresponds to the input voice, the system can usually be used frequently by learning the input voice. Learning that is regularly required when using the device □ Operation can be omitted, and part of the learning pattern can be updated sequentially (by constantly updating the learning pattern using the current older voice). It is possible to maintain a turn and maintain a stable high recognition and i!2i rate.

[Brief explanation of the drawing]

第１図はこの発明の一実施例に係る音声認識装置を説明
する構成図、第２図は上記装置の学習過程を説明する流
れ図、第３図は同じく認識過程の流れ図、第４図は時間
偏曲関数の例を示す図、第５図は多重学習過程を説明す
る流れ図、第６図は半均化処理過程を説明する流れ図、
第７図は平均化された学習パターンの構成図、第８図は
この発明の他の実施例を説明する構成図である。１）・・・マイクロホン、１３・・・パスフィルタ、１
４・・・Ａ／Ｉ）変換器、１５・・・ＣＰＵ、７（ｌｉ
・・・タイマ、１７・・・出力装ｆｆｔ、１Ｂ−１〜１
８−Ｆ・・・バンドパスフィルタ、１９−１〜１９−Ｆ
・・・検波器、２０・・・マルチプレクサ。出願人代理人　　弁理士　鈴　江　武　彦第１図第２図第３図ＨヒＸぐ餅ＱＬ乙鑑へ第６図第７図第８図FIG. 1 is a block diagram explaining a speech recognition device according to an embodiment of the present invention, FIG. 2 is a flowchart explaining the learning process of the device, FIG. 3 is a flowchart of the recognition process, and FIG. 4 is a time diagram. A diagram showing an example of a polarization function, FIG. 5 is a flowchart explaining the multiple learning process, and FIG. 6 is a flowchart explaining the half-equalization process.
FIG. 7 is a block diagram of an averaged learning pattern, and FIG. 8 is a block diagram illustrating another embodiment of the present invention. 1)...Microphone, 13...Pass filter, 1
4...A/I) converter, 15...CPU, 7(li
...Timer, 17...Output device fft, 1B-1~1
8-F...Band pass filter, 19-1 to 19-F
...Detector, 20...Multiplexer. Applicant's representative Patent attorney Takehiko Suzue Figure 1 Figure 2 Figure 3 H

Claims

[Claims]

(1) A means for analyzing the sound input to the microphone and extracting a feature vector representative of the input sound for each sampling unit time, and a temporally continuous feature in the sampling time series feature vector sequence extracted by this means. comprising means for calculating the similarity of vectors, and means for determining that the similarity obtained by this means is greater than or equal to a specified value and averaging the feature vectors, and averaging the feature vectors. A speech recognition device characterized in that a learning no or turn is created based on the duration of the speech represented by the fc% feature vector or the averaged feature vector.

(2) A means for determining whether the input voice is the same voice in a recognition operation between the input voice and the learning pattern that has already been created, and a means for performing an averaging operation between the corresponding feature vectors of the input voice and the learning pattern at the time of this identity determination. 2. The apparatus according to claim 1, wherein the averaging means is performed using the averaged feature vector sequence to create a learning pattern by multiple learning.

(3) A means for determining whether the input voice during the recognition process is the same as the learning/turn, and a means for performing the above-mentioned multiple learning on the input voice during the same determination, and the learning pattern is updated sequentially. An apparatus according to claim 2.