JPS6273299A

JPS6273299A - Voice recognition system

Info

Publication number: JPS6273299A
Application number: JP21341885A
Authority: JP
Inventors: 森戸　誠; 田部井　幸雄; 山田　興三
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1985-09-26
Filing date: 1985-09-26
Publication date: 1987-04-03

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】（産業上の利用分野）この発明は認識精度の良い音声認識方式に関する。[Detailed description of the invention] (Industrial application field) The present invention relates to a speech recognition method with high recognition accuracy.

（従来の技術）従来より、情報及び通信機器の入力の効十化、システム
機能の向上等を図る目的のため、音声認識に関しての研
究開発が進められている。この音声認識を行う一般的な
方法にパタンマツチング法がある。(Prior Art) Research and development regarding speech recognition has been carried out for the purpose of increasing the efficiency of inputting information and communication equipment and improving system functions. A common method for performing this speech recognition is the pattern matching method.

先ず、この発明の説明に先立ち、第６図を参照して従来
のパタンマツチング法につき説明する。First, prior to explaining the present invention, a conventional pattern matching method will be explained with reference to FIG.

第６図において、ｌＯは音声入力端子、１１は音声分析
部、１２は区間検出部、１３は入力メモリ部、１４は比
較パタンメモリ部、１５は類似度計算部、１６は判定部
、１７は出力端子である。In FIG. 6, lO is a voice input terminal, 11 is a voice analysis section, 12 is a section detection section, 13 is an input memory section, 14 is a comparison pattern memory section, 15 is a similarity calculation section, 16 is a judgment section, and 17 is a It is an output terminal.

この従来の認識方式においては、音声入力端子１０に入
力した入力音声を音声分析部１１において特徴を表わす
ベクトルの時系列パタン（以下、音声パタンと称する）
に変換する。この音声パタンは、一般に、中心周波数の
異る２個のバンドパスフィルタ群によって抽出された帯
域内周波数成分を時間間隔ＴＯ（例えば８ミリ秒）毎に
標本化（以下、サンプリングと称する）することによっ
て得ている。一方、この音声分析部１１において、音声
パタンに対応する時間点における音声１ノクワーを算出
する。この音声分析部１１において算出された音声パタ
ンを入力メモリ部１３に逐次格納すると共に、音声パワ
ーを区間検出部１２へ出力する。In this conventional recognition method, input speech inputted to the speech input terminal 10 is processed in the speech analysis section 11 to generate a time-series pattern of vectors representing characteristics (hereinafter referred to as "speech pattern").
Convert to This audio pattern is generally created by sampling (hereinafter referred to as sampling) in-band frequency components extracted by two band-pass filter groups with different center frequencies at every time interval TO (for example, 8 milliseconds). obtained by. On the other hand, the voice analysis unit 11 calculates one voice at a time point corresponding to the voice pattern. The voice patterns calculated in the voice analysis section 11 are sequentially stored in the input memory section 13, and the voice power is outputted to the section detection section 12.

区間検出部１２では、音声分析部１１からの音声パワー
に基づき、音声区間すなわち音声の始端及び終端を決定
する。この音声パワーによる音声の始端及び終端の決定
アルゴリズムについては、特願昭５９−１０８６６８号
に開示されているような複雑なアルゴリズム、音声パワ
ーが閾値以上となった時点を音声の始端、閾値未満とな
った時点を音声の終端と考える簡易なアルゴリズムその
他のアルゴリズム等があり、いずれかの適切なアルゴリ
ズムで区間検出を行っている。この区間検出部１２で決
定された始端及び終端間の音声パタンを入力メモリ部か
ら読出して類似度計算部１５へ送る。一方、この類似度
計算部１５には比較パタンメモリ１４から比較パタンを
別途入力させている。この比較パタンは認識対象となる
単語（以後カテゴリと称する）に対し音声パタンと同一
な音声分析処理を施したベクトルの時系列パタンであり
、予め比較パタンメモリ部１４に格納しておく。The section detection section 12 determines a speech section, that is, the start and end of the speech, based on the speech power from the speech analysis section 11. The algorithm for determining the start and end of the voice based on the voice power is a complex algorithm as disclosed in Japanese Patent Application No. 108668/1982, and the time when the voice power exceeds the threshold is determined as the start and end of the voice. There are simple algorithms and other algorithms that consider the point at which the sound ends as the end of the voice, and any suitable algorithm is used to detect the section. The audio pattern between the start and end points determined by the section detection section 12 is read out from the input memory section and sent to the similarity calculation section 15. On the other hand, a comparison pattern is separately inputted to the similarity calculation unit 15 from the comparison pattern memory 14. This comparison pattern is a time-series pattern of vectors obtained by subjecting words to be recognized (hereinafter referred to as categories) to the same speech analysis process as speech patterns, and is stored in the comparison pattern memory section 14 in advance.

この格納に当り、比較パタンを作成するが、その作成は
認識目的によって異る０例えば、話者を限定した認識方
式の場合には、限定された話者が発声した音声を周波数
分析部１１を用いて又はこれと同等な音声分析処理を施
して得られた音声パタンを比較パタンとして比較パタン
メモリ部１４に格納する。When storing this, a comparison pattern is created, but the creation method differs depending on the purpose of recognition. For example, in the case of a recognition method that limits speakers, the frequency analysis unit 11 A speech pattern obtained by using this method or by performing an equivalent speech analysis process is stored in the comparison pattern memory section 14 as a comparison pattern.

類似度計算部１５では、音声パタンと比較パタンとの間
の類似度計算を行う、この類似度計算には例えば文献（
「沖電気研究開発１１８号」、４８、（３）（昭和５７
年１２月）第５３頁〜第５８頁）に開示されている重み
付は線形マツチング法と呼ばれている方法又はその他の
適切な方法を用いている。The similarity calculation unit 15 calculates the similarity between the speech pattern and the comparison pattern.
"Oki Electric Research and Development No. 118", 48, (3) (1982)
The weighting disclosed in December 2003, pp. 53-58) uses a method called a linear matching method or other suitable method.

この類似度計算部１５から出力されるカテゴリ毎の類似
度を用いて１判定部１６では、その最大類似度を与える
比較パタンに与えられたカテゴリ名を認識結果として出
力する。Using the similarity for each category output from the similarity calculation section 15, the 1 determination section 16 outputs the category name given to the comparison pattern giving the maximum similarity as a recognition result.

以上が従来のパタンマツチング法による音声認識方式の
概略である。The above is an outline of the conventional speech recognition method using the pattern matching method.

（発明が解決しようとする問題点）上述した従来の認識方式は、音声のスペクトルの形状を
与える音声パタンと、予め同一分析処理によって算出さ
れた比較パタンとの相違を類似度という尺度から評価し
、最大の類似度を与える比較パタンのカテゴリ名を認識
結果とする方法であった。従って、音声パタンのカテゴ
リと比較パタンのカテゴリとが同じ場合はその類似度は
大きく、異なる場合にはその類似度は小さくなるもので
あった・しかしながら、音声のスペクトルの形状が音声以外の要
因例えば外部の雑音により歪んだ場合にはたとえ同一カ
テゴリといえどもその両者の類似度が大きくなるとはい
えなくなる。(Problems to be Solved by the Invention) The conventional recognition method described above evaluates the difference between a speech pattern that gives the shape of the speech spectrum and a comparison pattern calculated in advance by the same analysis process using a measure of similarity. , the recognition result was the category name of the comparison pattern that gave the greatest degree of similarity. Therefore, when the category of the speech pattern and the category of the comparison pattern are the same, the similarity is large, and when they are different, the similarity is small. However, the shape of the speech spectrum may be affected by factors other than speech, such as If distortion occurs due to external noise, the degree of similarity between the two cannot be said to be large, even if they are in the same category.

また、従来の認識方式では、演算処理に時間が掛り、し
かも、大きな記憶容量を必要とするので、これを実施す
る装置の構造が大型となるという問題点があった。Further, in the conventional recognition method, the calculation process takes time and requires a large storage capacity, so there is a problem that the structure of the device implementing this method becomes large.

この発明の目的はこのような従来の問題点に鑑み、雑音
環境下でも認識精度の良い音声認識方式を提供すること
にある。SUMMARY OF THE INVENTION In view of these conventional problems, it is an object of the present invention to provide a speech recognition method with good recognition accuracy even in a noisy environment.

この発明の他の目的は、装置として構成する場合、構造
が筒中かつ小型となるように、演算処理速度が速く、し
かも、記憶容量が小さくて済む音声認識方式を提供する
ことにある。Another object of the present invention is to provide a voice recognition system which, when configured as a device, has a compact and in-cylinder structure, has a high calculation processing speed, and requires a small storage capacity.

（問題点を解決するための手段）上述した目的の達成を図るため、この発明の音声認識方
式においては次のような手段を採る。(Means for Solving the Problems) In order to achieve the above-mentioned object, the speech recognition system of the present invention takes the following measures.

（ａ）先ず、入力音声の周波数成分を複数のバンドパス
フィルタによって抽出し、その出力を一定時間間隔Ｔｏ
（音声フレームと称する）で標本化して特徴ベクトルを
算出する。(a) First, the frequency components of the input audio are extracted by multiple bandpass filters, and the output is
(referred to as audio frames) and calculates a feature vector.

（ｂ）また、予め雑音のみと分っている所定の雑音区間
における特徴ベクトルを時間平均して得られる雑音パタ
ンを算出する。(b) Also, calculate a noise pattern obtained by time-averaging feature vectors in a predetermined noise section that is known in advance to be only noise.

（Ｃ）この雑音パタン抽出以後は特徴ベクトルから雑音
パタンを減じて音声特徴ベクトルを算出する。(C) After this noise pattern extraction, a speech feature vector is calculated by subtracting the noise pattern from the feature vector.

（ｄ）音声フレームイσに前述の音声特徴ベクトルから
最小二乗近似直線を算出し、この最小二乗近似直線を基
準にして周波数軸方向で極大となるチャネルに対応する
成分を１にして得られるローカルピークベクトルを算出
する。(d) A least squares approximation straight line is calculated for the audio frame A σ from the audio feature vector described above, and the local value obtained by setting the component corresponding to the channel that is maximum in the frequency axis direction to 1 with this least squares approximation straight line as a reference. Calculate the peak vector.

（ｅ）この音声特徴ベクトルから当該音声フレームにお
けるフレーム電力を算出し、このフレーム電力の始端と
終端とを算出する。(e) Calculate the frame power in the audio frame from this audio feature vector, and calculate the start and end of this frame power.

（ｆ）この始端から終端までの音声フレーム毎に算出さ
れたローカルピークベクトルを一定音声フレーム長にな
るように時間軸線形伸縮する。(f) The local peak vector calculated for each audio frame from the start end to the end is linearly expanded or contracted on the time axis so that it becomes a constant audio frame length.

（ｇ）登録処理時に認識対象語の音声毎に対し、入力音
声に対して行われる前述の（ａ）〜（ｆ）項の各処理に
対応する処理を行って比較パタンを作成する（登録処理
と称する）。(g) During the registration process, for each voice of the recognition target word, a comparison pattern is created by performing processes corresponding to each of the above-mentioned processes (a) to (f) performed on the input voice (registration process ).

（ｈ）認識処理後に発声した音声に対して前述の（ａ）
〜（ｆ）項までの処理によって求められた入力パタンと
比較パタンとの間で線型なマツチング処理を行って比較
パタンと入力パタンとのパタン類似度を算出する。(h) The above (a) applies to the voice uttered after recognition processing.
Linear matching processing is performed between the input pattern obtained through the processing up to (f) and the comparison pattern to calculate the degree of pattern similarity between the comparison pattern and the input pattern.

（ｉ）こめ比較パタン毎に算出されるパタン類似度の中
で最大の類似度を与える比較パタンに付加されたカテゴ
リ名を認識結果とする処理を行う。(i) Processing is performed in which the category name added to the comparison pattern that provides the maximum similarity among the pattern similarities calculated for each comparison pattern is used as a recognition result.

以上のようにして、入力音声を認識した結果が得られる
。In the manner described above, the result of recognizing input speech is obtained.

上述した（ａ）　、　（ｂ）　、　（ｃ）項の処理は高
雑音化における入力に対して音声のみを抽出するための
処理であり、かつ、高雑音下において困難とされている
（ｅ）項の音声区間検出処理を容易ならしめる処理であ
る。The above-mentioned processes (a), (b), and (c) are processes for extracting only speech from an input with high noise, and are considered difficult under high noise conditions (e) This is a process that facilitates the process of detecting the voice section of a term.

また、（ｄ）項によって算出したローカルピークベクト
ルを（ｈ）項、（ｉ）項の類似度算出に用いることによ
り高雑音環境下における認識性能を向上させている。そ
れは、従来のようなスペクトルの形状を与えるベクトル
を類似度算出に用いずに、音声スペクトルのピークを与
える位置によって算出されるローカルピークベクトルを
類似度算出に用いているからである。従って、雑音が混
入した場合、スペクトルの形状は大きく変わるがスペク
トルのピークの位置は変わらないことに基づいている。Furthermore, recognition performance in a high noise environment is improved by using the local peak vector calculated in section (d) to calculate the similarity between sections (h) and (i). This is because a local peak vector calculated based on a position giving a peak of the audio spectrum is used for similarity calculation, instead of using a vector giving the shape of the spectrum as in the conventional method. Therefore, it is based on the fact that when noise is mixed, the shape of the spectrum changes significantly, but the position of the peak of the spectrum does not change.

（作用）次に、この発明の作用につき説明する。(effect) Next, the operation of this invention will be explained.

この発明の音声認識方式を達成するための機能は第１図
に示す各処理部によって構成される。The functions for achieving the speech recognition method of the present invention are constituted by each processing section shown in FIG.

以下、その詳細な処理につき説明する。The detailed processing will be explained below.

音声はマイクロフォンを通じて電気信号に変換し、増幅
器（図示せず）、ローパスフィルタ（図示せず）を経て
Ａ／Ｄ変換器（図示せず）に送り、そこで例えば８３マ
イクロ秒毎に標本化（サンプリング）した後、入力端子
２１に入力させる。The sound is converted into an electrical signal through a microphone, passed through an amplifier (not shown), a low-pass filter (not shown), and sent to an A/D converter (not shown), where it is sampled every 83 microseconds. ), then input to the input terminal 21.

以下、前述の各項につき説明する。Each of the above-mentioned items will be explained below.

［（ａ）項の特徴ベクトル算出処理］入力端子２１に入力した音声のデータの周波数分析を特
徴ベクトル算出部２２によって行い、音声フレーム時系
列の特徴ベクトルに変換する。[Feature vector calculation process in section (a)] Frequency analysis of audio data input to the input terminal 21 is performed by the feature vector calculation unit 22, and the data is converted into audio frame time series feature vectors.

この特徴ベクトル算出部２２には、周波数分析のための
、第２図に示すような夫々中心周波数が異なる特性を個
々に有する複数のバンドパスフィルタと、ローパスフィ
ルタと、音声フレーム毎にサンプリングを行うサンプリ
ング手段（それぞれ図に示していない）とを具えている
。The feature vector calculation unit 22 includes a plurality of band-pass filters each having characteristics with different center frequencies as shown in FIG. 2 for frequency analysis, a low-pass filter, and sampling for each audio frame. sampling means (each not shown in the figure).

各バンドフィルタによって音声からその中心周波数の成
分のみを抽出する。このようにして各バンドフィルタに
よって分けられたデータの系列をチャネルと称する。各
チャネル毎のバンドパスの出力に対して絶対値化演算を
施した後、ローパスフィルタに入力させる。各チャネル
毎のローパスフィルタ出力をサンプリング手段によって
音声フレームの周期毎に再サンプルして特徴ベクトルの
成分を得る。Each band filter extracts only the center frequency component from the voice. The data series separated by each band filter in this way is called a channel. After performing an absolute value calculation on the bandpass output of each channel, the output is input to a low-pass filter. The low-pass filter output for each channel is resampled by a sampling means every audio frame period to obtain the components of the feature vector.

今ｉ番目の音声フレームにおけるにチャネルのにローパスフィルタの出力をａｌ　　とすると、ｉ番目の
音声フレームにおける特徴ベクトルａｉ　　はａ＝（ａ
ｌ、ａ！’、−、ａＬ・＝、ａ”ＩＬ　　　　　　　ｌ
　　　　　ｌと表現することが出来る。ここで、Ｋはチャネル数であ
る。Now let the output of the low-pass filter of the channel in the i-th audio frame be al, then the feature vector ai in the i-th audio frame is a=(a
l,a! ',-,aL・=,a''IL l
It can be expressed as l. Here, K is the number of channels.

［（ｂ）項の雑音パタン算出処理］この処理は雑音パタン算出部２３で行う。雑音のみが入
力されていて音声が入力されていない区間を例えば連続
して１０音声フレーム（音声フレーム数は木質ではない
）設定し、これを雑音期間と称する。[Noise pattern calculation process in section (b)] This process is performed by the noise pattern calculation unit 23. A period in which only noise is input and no voice is input is set, for example, to 10 consecutive voice frames (the number of voice frames is not woody), and this is called a noise period.

雑音区間の特徴ベクトルは雑音のスペクトル形状を表わ
すもので、これを特に雑音ベクトルと称し、町　と表現
するつところで、雑音区間内における雑音のスペクトルの平均
値をによって算出し、この平均値を雑音パタンと称する。The feature vector of a noise section represents the spectral shape of the noise, and is especially called a noise vector.The average value of the noise spectrum within the noise section is calculated by It is called a pattern.

雑音パタンＮの成分をＮｋ　　とすると。Let the component of the noise pattern N be Nk.

叶＝　（Ｎｌ　、　Ｎ２．・・　Ｈｋ、・・・　ＨＫ　
Ｈとなる。Kano = (Nl, N2.... Hk,... HK
It becomes H.

［（Ｃ）項の音声特徴ベクトル算出処理］この処理を音
声特徴ベクトル算出部２４で行う。[Section (C) Audio feature vector calculation process] This process is performed by the audio feature vector calculation unit 24.

雑音区間以降、すなわち雑音パタン算出以降は特徴ベク
トル算出部２２から出力される特徴ベクトルａ１　　か
ら雑音パタン算出部２３からの雑音パタンＮを減じ、音
声特徴ベクトルわ’　　”　　’　　ｂ＞　　、　ｂ＞　　＋　　”’
　　＋　　ｂし　”’　　＋　　”’　　１を次式によ
って算出する。After the noise interval, that is, after the noise pattern calculation, the noise pattern N from the noise pattern calculation unit 23 is subtracted from the feature vector a1 output from the feature vector calculation unit 22, and the voice feature vector w''''b> , b> + '''
+ b and ``' + ''' 1 is calculated by the following formula.

この処理部２４における処理は高雑音環境化における音
声認識の性能を向上するための手法であり、雑音が比較
的に定常的に続いている場合に効果を発揮する。This processing in the processing unit 24 is a method for improving speech recognition performance in a high-noise environment, and is effective when noise continues relatively steadily.

［（ｄ）項のローカルピークベクトル算出処理］この処
理をローカルビーク算出部２５で行う。[Local peak vector calculation process in section (d)] This process is performed by the local peak calculation unit 25.

音声特徴ベクトル算出部２４から送出される音声特徴ベ
クトルＴｏｉ　　をローカルピークベクトル算出部２５
においてローカルピークベクトル町　に変換する。The audio feature vector Toi sent from the audio feature vector calculator 24 is calculated by the local peak vector calculator 25.
Convert to local peak vector town at .

この変換処理につき第３図（Ａ）〜（Ｃ）を参照して説
明する。This conversion process will be explained with reference to FIGS. 3(A) to 3(C).

音声特徴ベクトルＴｏｊ　　の各成分す、は次式により
対数変換される。Each component of the audio feature vector Toj is logarithmically transformed using the following equation.

第３図（Ａ）にこの音声特徴ベクトル成分の対数変換Ｘ
１（ｋ）の例を示し、横軸にチャネル番号ｋを及び縦軸
にＸｌ　　（ｋ）をそれぞれプロットして示す、この図
により、ｉ番目の音声フレームにおける音声の対数スペ
クトルの形状が表わされている。Figure 3 (A) shows the logarithmic transformation of this voice feature vector component.
1(k), where the channel number k is plotted on the horizontal axis and Xl (k) is plotted on the vertical axis. This figure represents the shape of the logarithmic spectrum of the audio in the i-th audio frame. has been done.

次に、次式によって学えられる最小二乗近似直線を用いて正規化を行う。Next, the least squares approximation straight line can be learned by Eq. Perform normalization using .

Ｚｔ　（ｋ）　＝　ｘ、　（ｋ）　−Ｙｌ　（ｋ）＝　
ｘｉ（ｋ）　−ｕ、（ｋ）−ｋ　−ｖｉ（ｋ）　　　　
　　（５１この正規化された音声特徴ベクトル成分ｚ１
　（ｋ）の例を第３図（８）に示す、第３図（Ｂ）にお
いて横軸にチャネル番号を及び縦軸にｚ４　（ｋ）をそ
れぞれプロットして示す。Zt (k) = x, (k) −Yl (k)=
xi(k) −u,(k)−k −vi(k)
(51 This normalized speech feature vector component z1
An example of (k) is shown in FIG. 3(8), in which the channel number is plotted on the horizontal axis and z4 (k) is plotted on the vertical axis in FIG. 3(B).

次に、次式（８）のような判断に基づいて、このｚｉ（
ｋ）を用いてローカルピークベクトルＥｉを算出する。Next, this zi(
k) to calculate the local peak vector Ei.

この（６）式の判断条件を満たすｋに対してはｒ−ゝ＝
１．満たさないｋに対してはｒ−＝ｏなる１　　　　　
　　　　　　　　　　　　　　　　　　　　　　。For k that satisfies the judgment condition of equation (6), r−ゝ=
1. For k that does not satisfy, r-=o, 1
.

値を成分として有するベクトルｒＩＦ１　＝（ｒ！ｒ’、・”＋ｒト・”＋４１１’　　　
１を算出する。このベクトルｒｉ　　をローカルピークベ
クトルと称する。このローカルピークベクトルｒｉ　　
の例を第３図（Ｃ）に示す。Vector having values as components rI F1 = (r!r',・"+rto・"+411'
Calculate 1. This vector ri is called a local peak vector. This local peak vector ri
An example of this is shown in FIG. 3(C).

［（ｅ）項の音声区間検出処理］この処理を音声区間検出部２６で行う。[Voice section detection processing in section (e)] This process is performed by the voice section detection section 26.

音声フレーム毎に音声特徴ベクトル算出部２４より算出
される音声特徴ベクトル帆　を用いて、当該音声フレー
ムのフレーム電力Ｐｉ　　を　算　出　する。Using the audio feature vector calculated by the audio feature vector calculation unit 24 for each audio frame, the frame power Pi of the audio frame is calculated.

音声区間検出部２Ｂにおいては、音声特徴ベクトルＩｂ
ｉ　　から得られたフレーム電力Ｐｉ　　を用いて音声
区間検出を行う。In the speech section detection unit 2B, the speech feature vector Ib
Voice section detection is performed using the frame power Pi obtained from i.

音声区間検出のアルゴリズムについては前述したように
各種のものが提案されているが、この発明はそのアルゴ
リズム自体を目的とするものではなく、音声区間検出に
特徴ベクトルａｉ　から雑音パタンＮを減じて得られた
音声特徴ベクトルｌｂｉを用いることを目的としている
ため、ここでは説明の便宜上、フレーム電力Ｐ、が定め
られた閾値Ｐｓ以上となった音声フレームを音声の始端
Ｉ５　　、音声の始端からフレーム電力Ｐ、が閾値Ｐｓ
未満となった音声フレームを音声の終端Ｉ６と考える。As mentioned above, various algorithms have been proposed for detecting speech intervals, but the purpose of this invention is not to use the algorithms themselves, but to detect speech intervals by subtracting the noise pattern N from the feature vector Therefore, for convenience of explanation, here, for convenience of explanation, the voice frame in which the frame power P is equal to or greater than the predetermined threshold Ps is defined as the voice start point I5, and the frame power P from the voice start point is , is the threshold Ps
The audio frame in which the number is less than 1 is considered to be the end of audio I6.

第４図（Ａ）及び（Ｂ）は入力音声を「サラポロ」とし
、これに雑音として自動車騒音を付加してＳ／Ｎを１０
ｄＢとした場合のフレーム電力特性を示す、第４図（Ａ
）は無雑音環境下において音声特徴ベクトルＴｏｉ　か
ら算出したフレーム電力Ｐｉであり、（Ｂ）図は雑音環
境下において、同様な手法により特徴ベクトルａｉ　　
から算出したフレーム電力Ｐｉ′である。それぞれ横軸
に時間を及び縦軸にフレーム電力をプロットして示しで
ある。In Figures 4 (A) and (B), the input voice is "Sarapolo", and car noise is added as noise to this to reduce the S/N to 10.
Figure 4 (A
) is the frame power Pi calculated from the voice feature vector Toi in a no-noise environment, and (B) shows the frame power Pi calculated from the voice feature vector Toi in a noisy environment using the same method.
This is the frame power Pi' calculated from . The graph shows time plotted on the horizontal axis and frame power plotted on the vertical axis.

第４図（Ａ）及び（Ｂ）から理解出来るように、雑音パ
タンを減じている音声特徴ベクトルＴｏｉ　　から得ら
れるフレーム電力Ｐｉ　　の変化は、音声の発せられて
いる区間と、音声の発せられていない区間との明確な区
別を有している。そのため、雑音環境下においても音声
区間検出が容易に行える。As can be understood from Fig. 4 (A) and (B), the change in the frame power Pi obtained from the speech feature vector Toi that reduces the noise pattern varies depending on the period in which the speech is being produced and the period in which the speech is being produced. It has a clear distinction from the section where there is no. Therefore, voice section detection can be easily performed even in a noisy environment.

［（「）項の線形伸縮処理］この処理を線形伸縮部２７で行う、音声区間検出部２８
により検出された始端Ｉ５　　と終端Ｉε　との間のロ
ーカルピークベクトルを一定音声フレーム長に時間軸線
形伸縮する。この線形伸縮部２７における伸縮処理は主
として後述する線形マツチングを行い易くするためであ
り、その他に後述する比較パタンをメモリ内に格納する
際の領域管理を容易にするための処理である。[Linear expansion/contraction processing of (“) term] This process is performed by the linear expansion/contraction unit 27, the voice section detection unit 28
The local peak vector between the starting point I5 and the ending point Iε detected by is linearly expanded or contracted on the time axis to a constant audio frame length. The expansion/contraction processing in the linear expansion/contraction section 27 is mainly for facilitating linear matching, which will be described later, and is also a process for facilitating area management when storing comparison patterns, which will be described later, in the memory.

次に１時間軸線形伸縮の方法について説明する。ここで
は説明のために３２音声フレームに線形伸縮する場合を
考える。始端を■ｓ　　とし終端を■、とし、線形伸縮
後の音声フレーム番号をｉ′（ｉ’＝１〜３２）とし、
線形伸縮前の音声フレーム番号ｉをの式から算出し、線形伸縮前ｉ番目の音声フレームにお
けるローカルピークベクトル曵　を線形伸縮後ｉ′番目
の音声フレームにおけるローカルピークベクトルｒ、′
とする。ただしく８）式において、〔〕はガウス記号を
表わす。Next, a method of 1-time axis linear expansion/contraction will be explained. Here, for explanation purposes, we will consider the case of linear expansion/contraction to 32 audio frames. The starting end is ■s, the ending is ■, the audio frame number after linear expansion and contraction is i'(i' = 1 to 32),
The audio frame number i before linear expansion and contraction is calculated from the formula, and the local peak vector in the i-th audio frame before linear expansion and contraction is calculated as the local peak vector r, ' in the i'-th audio frame after linear expansion and contraction.
shall be. However, in equation 8), [ ] represents a Gauss symbol.

結果として、始端から終端までのローカルピークベクト
ル列１工ｓ１工ｓ＋１°”　’ｉ　”’　”Ｅ−１”Ｅは線
形伸縮されて ’１　’２　””　ｉ　””　８１　　’８２なるベク
トル列となる。As a result, the local peak vector sequence 1 s 1 s + 1°'''i'''``E-1''E from the start end to the end is linearly expanded and contracted to become a vector sequence '1 2 '''' i '''' 81 '82. Become.

以後、特にことわりがない限り線形伸縮後の音声フレー
ムの番号付けで話を進める。From now on, unless otherwise specified, we will proceed with the numbering of audio frames after linear expansion and contraction.

［（ｇ）項の比較パタン算出及び格納処理］この処理を
比較パタン格納部２８で行う。[Comparison pattern calculation and storage processing in section (g)] This processing is performed in the comparison pattern storage section 28.

話者を限定する特定話者認識方式においては、認識対象
となる単語（以下、カテゴリと称する）を予め発声し、
その単語を表現するためのパタン（比較パタンと称する
）を予め格納しておく必要がある。比較パタン格納部２
８では、このような比較パタンか格納されている。以下
、この比較パタンの作成方法につき説明する。この比較
パタンを作成する処理を登録処理と称する。In the specific speaker recognition method that limits speakers, words to be recognized (hereinafter referred to as categories) are uttered in advance,
It is necessary to store in advance a pattern (referred to as a comparison pattern) for expressing the word. Comparison pattern storage section 2
8, such comparison patterns are stored. The method for creating this comparison pattern will be explained below. The process of creating this comparison pattern is called registration process.

ここで説明のためカテゴリの数をＭ個とする。Here, for the sake of explanation, the number of categories is assumed to be M.

また、同一カテゴリを数回発声し、それぞれのパタンの
平均をとることにより比較パタンを作成する方法もある
が、この例では一回のカテゴリの発声に対して比較パタ
ンを作成するものとする。There is also a method of creating a comparison pattern by uttering the same category several times and taking the average of each pattern, but in this example, a comparison pattern is created for one utterance of the category.

比較パタンを作成するために用いられる音声を学習音声
と称する。The voice used to create the comparison pattern is called a learning voice.

今、ディジタル化されたｍ番目の学習音声を入力端子２
１から特徴ベクトル算出部２２へと送り学習音声の特徴
ベクトルを算出する。一方、雑音ノくタン算出部２３に
は、前もって学習音声が入力されていないときの雑音パ
タンか抽出されている。従って、音声特徴ベクトル算出
部２４において、特徴ベクトル算出部２２からの特徴ベ
クトルから雑音ノくタン算出部２３からの雑音パタンを
減算し、学習音声の音声特徴ベクトルを算出する。Now, the m-th learning voice that has been digitized is input to the input terminal 2.
1 to the feature vector calculation unit 22 to calculate the feature vector of the learning speech. On the other hand, the noise pattern calculation unit 23 has previously extracted a noise pattern when no learning speech has been input. Therefore, the speech feature vector calculation section 24 subtracts the noise pattern from the noise subtraction calculation section 23 from the feature vector from the feature vector calculation section 22 to calculate the speech feature vector of the learning speech.

次に、この音声特徴ベクトルをローカルピークベクトル
算出部２５においてローカルピークベクトルに変更する
。Next, this audio feature vector is changed into a local peak vector in the local peak vector calculation unit 25.

一方、音声区間検出部２６において、学習音声の電力を
計算し、始端及び終端を検出する。On the other hand, the voice section detection unit 26 calculates the power of the learning voice and detects the start and end points.

次に、線形伸縮部２７において、時間的線形伸縮処理が
施され、３２フレーム長のローカルピークベクトル列に
変換される。この学習音声のローカルピークベクトルを
特に比較ローカルピークベクトルと称し、これを□Ｓｊ
　　で表わす。Next, the linear expansion/contraction unit 27 performs temporal linear expansion/contraction processing to convert it into a 32-frame-long local peak vector sequence. This local peak vector of the learning speech is especially called the comparison local peak vector, and is called □Sj
It is expressed as

ｍ５ｊ＝（ｍｓ、　ｌ　ｍｓ、　ｌ　”’　＋　ｍｓ、
　Ｉ　・・・ｒ　ｍｓ、　１また、比較ローカルピーク
ベクトルのベクトル列によって表わされるパタンをＳア
と表わし、これを比較パタンと称する。m5j=(ms, l ms, l ”' + ms,
I...r ms, 1 Furthermore, a pattern represented by a vector string of comparison local peak vectors is expressed as Sa, and this is called a comparison pattern.

各カテゴリ名毎の比較パタンＳユを対応するカテゴリ名
Ｃ□と一緒に比較パタン格納部２８に格納する。The comparison pattern Syu for each category name is stored in the comparison pattern storage section 28 together with the corresponding category name C□.

既に説明したように、線形伸縮処理によって比較パタン
Ｓ７．Ｉの大きさは一定となっているため、複数個の比
較パタンを格納する際のメモリのアドレス管理が極めて
容易となる。As already explained, the comparison pattern S7. Since the size of I is constant, memory address management when storing a plurality of comparison patterns is extremely easy.

［（ｈ）項の線形マツチング］この処理を線形マツチング部２９で行う。[Linear matching of term (h)] This process is performed by the linear matching section 29.

上述したような比較パタンを作成する登録処理に対して
、認識動作を行うときの処理を認識処理と称する。そこ
で、認識処理時に入力される音声を入力音声と称する。In contrast to the registration process of creating a comparison pattern as described above, the process of performing a recognition operation is called a recognition process. Therefore, the voice input during recognition processing is referred to as input voice.

この入力音声の音声区間も音声区間検出部２６で算出さ
れる。The voice section of this input voice is also calculated by the voice section detecting section 26.

また、入力音声に対しても前述した（ａ）項から（ｆ）
項と同一・又は類似の処理を行ってローカルピークベク
トルｌｒ４　　（入力ローカルピークベクトルと称する
）を求める。In addition, the above-mentioned items (a) to (f) are also applied to the input audio.
A local peak vector lr4 (referred to as an input local peak vector) is obtained by performing the same or similar processing as in the term.

このようにして、始端から終端まで入力ローカルピーク
ベクトルの時系列によって表現される入力音声のパタン
を入力パタンと称し、これをＲで表現する。The pattern of input speech expressed by the time series of input local peak vectors from the start to the end in this way is called an input pattern, and is expressed by R.

また、既に説明したように、ｍ番目の比較パタンＳｍ　
が始端から終端までの時系列として表現され、比較パタ
ン格納部２８に格納されている。Furthermore, as already explained, the m-th comparison pattern Sm
is expressed as a time series from the start to the end, and is stored in the comparison pattern storage section 28.

次に、入力パタンＲと、比較パタンＳｍ　　との類似性
を算出する処理につき説明する。Next, a process for calculating the similarity between the input pattern R and the comparison pattern Sm will be explained.

パタンの類似性を算出する方法としては非線形なりＰマ
ツチング法などがあるが、この発明では処理の簡易な線
形マツチングで行う。Methods for calculating the similarity of patterns include non-linear and P matching methods, but in the present invention linear matching is used, which is a simple process.

３２個の入力ローカルピークベクトル町　によって表わ
されている入力パタンＲと、３２個の比較ローカルピー
クベクトル−３によって表わされている比較パタンＳＴ
ｌ　との間のパタン類似度り、をで定義する。ここで、右肩添字ｔはベクトルの転置を表
わす。Input pattern R represented by 32 input local peak vectors Machi and comparison pattern ST represented by 32 comparison local peak vectors -3
The pattern similarity between l and l is defined as: Here, the right-hand subscript t represents the transposition of the vector.

［（ｉ）項の判定処理］この処理を判定処理部３０で行う、各カテゴリ毎に求ま
るパタン類似度り開　により最大値判定を行う。[Determination Processing in Item (i)] This process is performed by the determination processing unit 30, and the maximum value is determined by dividing the degree of pattern similarity found for each category.

最大値を与える比較パタンの番号ｍ％４ｇに対応するカ
テゴリ名Ｃｍ、ａｘを認識結果として出力端子３１から
出力させる。Category names Cm and ax corresponding to number m%4g of the comparison pattern giving the maximum value are output from the output terminal 31 as recognition results.

以上説明したところからも明らかなように、この発明の
音声認識方式においては、入力音声から雑音パタンを除
去した音声特徴ベクトルを用いてフレーム電力を算出し
、音声区間検出を行っているため、第４図（Ａ）及び（
Ｂ）に示した、音声特徴ベクトルにより算出したフレー
ム電力η　及び無処理の特徴ベクトルにより算出したフ
レーム電力ＰＩ′の比較からも明らかなように、音声区
間検出誤りが少ない、このように４、雑音環境下におい
ても入力音声を高精度で認識することが出来る。As is clear from the above explanation, in the speech recognition method of the present invention, frame power is calculated using a speech feature vector obtained by removing noise patterns from input speech, and speech section detection is performed. Figure 4 (A) and (
As is clear from the comparison of the frame power η calculated by the voice feature vector and the frame power PI' calculated by the unprocessed feature vector shown in B), there are fewer voice section detection errors. It is possible to recognize input speech with high accuracy even in different environments.

さらに、音声特徴ベクトルから算出したローカルピーク
ベクトルを用いてパタン類似度算出処理を行っているた
め、演算処理が極めて簡易である。Furthermore, since the pattern similarity calculation process is performed using the local peak vector calculated from the audio feature vector, the calculation process is extremely simple.

ざらに、比較パタンに関しても比較ローカルピークベク
トルを用いているため、その記憶容量を極めて少なくす
ることが出来、従って、上述した演算処理の簡易化と合
せて音声認識システムの小型化を図れる。In general, since the comparison local peak vector is used for the comparison pattern, the storage capacity thereof can be extremely reduced, and therefore, in addition to the above-mentioned simplification of the arithmetic processing, the speech recognition system can be downsized.

（実施例）以下、この発明の実施例につき第５図を参照して説明す
る。(Example) Hereinafter, an example of the present invention will be described with reference to FIG.

第５図はこの発明の音声認識方式の一実施例を実施する
ための具体的な回路構成を示すブロック図である。FIG. 5 is a block diagram showing a specific circuit configuration for implementing an embodiment of the speech recognition method of the present invention.

第５図において、４１はマイクロフォン、４２は音声信
号を増幅するための増幅器、４３はローパスフィルタ、
４４は音声をディジタル信号に変換するＡ／Ｄ変換器、
４５は特徴ベクトルを算出する信号処理プロセッサ、４
８はプロセッサ、４７はプロセッサのプログラムが格納
されているプログラムメモリ、４８は比較パタンを格納
するための比較パタンメモリ、４９は作業メモリ、５０
は雑音パタンを格納するための雑音パタンメモリ、５１
は認識結果を外部に出力するためのインタフェースであ
る。ただし、それぞれの構成要素間には厳密な意味では
インタフェース回路が必要であるが、ここではこれを省
略する。In FIG. 5, 41 is a microphone, 42 is an amplifier for amplifying the audio signal, 43 is a low-pass filter,
44 is an A/D converter that converts audio into a digital signal;
45 is a signal processing processor that calculates a feature vector;
8 is a processor, 47 is a program memory in which a processor program is stored, 48 is a comparison pattern memory for storing comparison patterns, 49 is a working memory, 50
is a noise pattern memory for storing noise patterns, 51
is an interface for outputting recognition results to the outside. However, although in a strict sense an interface circuit is required between each component, this is omitted here.

止血り盈太ヱ！１１１次に、この第５図を参照してこの発明の音声認識方式の
一例を説明する。Stop the bleeding, Eitae! 111 Next, an example of the speech recognition system of the present invention will be explained with reference to FIG.

マイクロフォン４１からの入力音声を増幅器４２で増幅
した後、ローパスフィルタ（ＬＰＦ）４３においてその
低周波数成分を除去する。After input audio from a microphone 41 is amplified by an amplifier 42, a low-pass filter (LPF) 43 removes its low frequency components.

次に、低周波成分が除去された入力音声をＡ／Ｄ変換器
４４によって例えば１２ｋＨｚのサンプリング周波数で
１２ビツトにサンプリングする。前述のローパスフィル
タ４３での処理はこのサンプリングのために必要な処理
で、従って、このフィルタとしては例えば５ｋＨｚの遮
断周波数をもつ減衰４８　ｄ　Ｂ　／　ｏ　ｃ　ｔのロ
ーパスフィルタを用いる。Next, the input audio from which low frequency components have been removed is sampled into 12 bits by the A/D converter 44 at a sampling frequency of, for example, 12 kHz. The processing in the low-pass filter 43 described above is necessary for this sampling, and therefore, for example, a low-pass filter with a cutoff frequency of 5 kHz and an attenuation of 48 dB/oct is used.

Ａ／Ｄ変換器４４によってサンプリングされた音声のデ
ィジタルデータを信号処理プロセッサ４５によって、特
徴ベクトルに変換する。この信号処理プロセッサ４５と
して例えばＴＩ社製の３２０１０を用いることが出来る
。The audio digital data sampled by the A/D converter 44 is converted into a feature vector by the signal processor 45. As this signal processing processor 45, for example, 32010 manufactured by TI can be used.

プロセッサ４６は音声フレーム周期毎に信号処理プロセ
ッサ４５から出力される特徴ベクトルを用いて処理を行
うが、その処理の内容は ■　登録処理 ■　認識処理とに分けられる。以下、これらの処理についてそれぞれ
説明をする。The processor 46 performs processing using the feature vector output from the signal processing processor 45 for each audio frame period, and the contents of the processing are divided into (1) registration processing (2) and recognition processing. Each of these processes will be explained below.

［登録処理］この処理は次の処理に分けられる。[registration process] This process is divided into the following processes.

雑音パタンの算出処理音声特徴ベクトルの算出処理比較ローカルピークベクトル算出処理音声区間検出処理線形伸縮及び比較パタン格納処理以下、これらの各処理につき説明する。Noise pattern calculation process Audio feature vector calculation process Comparison local peak vector calculation process Voice section detection processing Linear expansion/contraction and comparison pattern storage processing Each of these processes will be explained below.

（雑音パタン算出処理）登録処理のため、例えば、１０音声フレームを雑音区間
と定める。このとき、話者は発声しないで、まわりの雑
音のみをマイクロフォン４１から入力するようにする。(Noise pattern calculation process) For the registration process, for example, 10 audio frames are determined as a noise section. At this time, the speaker does not speak, and only ambient noise is input from the microphone 41.

この雑音入力を信号経路（４２，４３，４４）を経て信
号処理プロセッサ４５に送り、これより雑音ベクトルを
生じさせ、この雑音ベクトルを作業メモリ４８に逐次格
納する。このメモリ４８にｌＯ音声フレーム分の雑音ベ
クトルが格納されると、これら雑音ベクトルを平均化し
てその平均値を雑音パタンメモリ５０に格納する。This noise input is sent via signal paths (42, 43, 44) to a signal processing processor 45 from which a noise vector is generated which is sequentially stored in working memory 48. When noise vectors for 10 audio frames are stored in this memory 48, these noise vectors are averaged and the average value is stored in the noise pattern memory 50.

（音声特徴ベクトル算出処理）雑音区間終了後、信号処理プロセッサ４５から入力され
る特徴ベクトルから雑音パタンメモリ５０中の雑音パタ
ンを減じることによって、音声特徴ベクトルを算出し、
これを作業メモリ４８内に格納する。(Voice feature vector calculation process) After the end of the noise section, a speech feature vector is calculated by subtracting the noise pattern in the noise pattern memory 50 from the feature vector input from the signal processing processor 45,
This is stored in working memory 48.

この処理は音声フレーム周期毎に行われるが、音声区間
検出処理によって始端が検出されるまでの音声特徴ベク
トルは不必要であり、従って、作業メモリ４９を効果的
に使用するためには適当に捨てていく。Although this processing is performed for each audio frame period, the audio feature vectors until the start point is detected by the audio section detection processing are unnecessary, and therefore, in order to use the working memory 49 effectively, they can be discarded appropriately. To go.

（比較ローカルピークベクトルの算出処理）作業メモリ
４９に格納されている音声特徴ベクトルを、前述した（
ｄ）項の処理により、比較ローカルピークベクトルに変
換して作業メモリ４９に格納する。この処理も、音声フ
レーム周期毎に行われる。また、始端検出以前の比較ロ
ーカルピークベクトルも適宜に捨てていく。(Comparison local peak vector calculation process) The audio feature vectors stored in the working memory 49 are
Through the processing in section d), it is converted into a comparison local peak vector and stored in the working memory 49. This process is also performed every audio frame period. In addition, comparison local peak vectors before the start edge detection are also discarded as appropriate.

（音声区間検出処理）作業メモリ４８に格納されている音声特徴ベクトルから
フレーム電力を算出する。(Voice section detection process) Frame power is calculated from the voice feature vector stored in the working memory 48.

このフレーム電力と閾値とを比較しながら音声の始端と
終端とを決定する。The start and end of audio are determined by comparing this frame power with a threshold value.

（線形伸縮及び比較パタン格納処理）作業メモリ４８に格納されている比較ローカルピークベ
クトルのうち始端から終端までの比較ローカルピークベ
クトルを、（ｆ）項の処理により時間的線形伸縮して比
較パタンメモリ４８に格納する。(Linear expansion/contraction and comparison pattern storage processing) Among the comparison local peak vectors stored in the working memory 48, the comparison local peak vectors from the start end to the end are temporally linearly expanded/contracted by the process in section (f) and stored in the comparison pattern memory. 48.

以−ヒ説明した登録処理による比較パタンの作成方法は
１つの比較パタンを１つの学習音声から作成しているが
、認識性能をあげるためには同−力テゴリの複数回の学
習音声から比較パタンを作成するのが良いとされている
。この場合、複数回の発声により作成された比較パタン
を１つに平均化し、比較パタンとする方法、それぞれ比
較パタンを全て持つ方法、その他の種々の方式が考えら
れるが、この発明の木質でないため詳細な説明は省略す
る。The method of creating a comparison pattern using the registration process described below creates one comparison pattern from one training voice, but in order to improve recognition performance, it is necessary to create a comparison pattern from multiple training voices of the same category. It is considered a good idea to create a In this case, a method of averaging the comparison patterns created by multiple utterances into one comparison pattern, a method of having all the comparison patterns for each, and various other methods can be considered, but since the comparison patterns of this invention are not wooden, Detailed explanation will be omitted.

［認識処理］この処理はさらに次の処理に分けられる。[Recognition processing] This process is further divided into the following processes.

雑音パタンの算出処理音声特徴ベクトルの算出処理入力ローカルピークベクトルの算出処理音声区間検出処
理線形伸縮処理パタン類似度算出処理判定処理（ｌ音パタンの算出処理）登録時と認定時とでは雑音の状況が変化していることも
考えられるため、雑音パタンの算出を再度行う。Noise pattern calculation processing Speech feature vector calculation processing Input local peak vector calculation processing Speech section detection processing Linear expansion/contraction processing Pattern similarity calculation processing Judgment processing (L sound pattern calculation processing) Noise situation at the time of registration and at the time of certification Since it is possible that the noise pattern has changed, the noise pattern is calculated again.

この雑音パタンの算出に関しては単語入力の曲毎に行う
のが良いが、単語の入力速度が遅くなったり或いは雑音
測定中に発声し易いなどの点から、特に特別な雑音区間
を適宜設けてその区間で雑音パタンを測定する方が現実
的であろう。It is best to calculate this noise pattern for each song in which words are input, but since the input speed of words is slow or it is easy to vocalize during noise measurement, it is best to calculate the noise pattern by setting a special noise section as appropriate. It would be more realistic to measure the noise pattern in sections.

登録時と同様に、あるｌＯ音声フレームを雑音区間と定
め、このとき話者は発声しないようにする。この状態で
、まわりからの雑音のみをマイクロフォン４１から入力
させて、前述と同様に信号処理プロセッサ４５に送り、
これより生ずる雑音ベクトルを作業メモリ４８に逐次格
納する。１０￥ｆ声フレ一ム分の雑音ベクトルを格納し
たとき、これら雑音ベクトルの平均を取って、この平均
雑音ベクトルを雑音パタンメモリ５０に格納する。As in the case of registration, a certain 1O voice frame is defined as a noise section, and the speaker does not make any utterances at this time. In this state, only ambient noise is input from the microphone 41 and sent to the signal processing processor 45 in the same manner as described above.
The resulting noise vectors are sequentially stored in working memory 48. When the noise vectors for one 10￥f voice frame are stored, the average of these noise vectors is taken and this average noise vector is stored in the noise pattern memory 50.

（音声特徴ベクトル抽出処理）雑音区間終了後から音声特徴ベクトルの算出は新しい雑
音パタンを用いて行われる。(Speech feature vector extraction process) After the end of the noise section, the speech feature vector is calculated using a new noise pattern.

信号処理プロセッサ４５から入力される特徴ベクトルか
ら雑音パタンメモリ５０に格納されている雑音パタンを
減じることによって音声特徴ベクトルを算出し、これを
作業メモリ４９に格納する。この処理は音声フレーム周
期毎に行われる。また、後述する始端検出以前の音声特
徴ベクトルは不必要であるため適宜捨てていく。A voice feature vector is calculated by subtracting the noise pattern stored in the noise pattern memory 50 from the feature vector input from the signal processor 45, and is stored in the working memory 49. This process is performed every audio frame period. Furthermore, since the speech feature vectors before the start edge detection, which will be described later, are unnecessary, they are discarded as appropriate.

（入力ローカルピークベクトル算出処理）作業メモリ４
８に格納されている音声特徴ベクトルを前述した（ｄ）
項の処理により入力ローカルピークベクトルに変換して
作業メモリ４９に格納する。(Input local peak vector calculation process) Working memory 4
(d)
By processing the terms, the input local peak vector is converted into an input local peak vector and stored in the working memory 49.

この処理も音声フレーム周期毎に行われる。また、始端
検出以前の入力ローカルピークベクトルも適宜捨ててい
く。This process is also performed every audio frame period. In addition, input local peak vectors before the start edge detection are also discarded as appropriate.

（音声区間検出処理）作業メモリ４９に格納された音声特徴ベクトルからフレ
ーム電力Ｐｉ　　を算出する。このフレーム電力Ｐｉ　
　と閾値とを比較しながら音声の始端及び終端を決定す
る。(Voice section detection process) Frame power Pi is calculated from the voice feature vector stored in the working memory 49. This frame power Pi
The start and end of the audio are determined by comparing the threshold and the threshold.

（線形伸縮処理）作業メモリ４８に格納されている始端から終端までの入
力ローカルピークベクトルを（ｆ）項の処理により時間
的線形伸縮し１作業メモリ４９に格納する。(Linear expansion/contraction processing) The input local peak vector from the start end to the end end stored in the working memory 48 is temporally linearly expanded/contracted by the process in section (f) and stored in the 1 working memory 49.

（パタン類似度算出処理）作業メモリ４９に格納されている３２個の入力ローカル
ピークベクトルと、比較パタンメモリ４８に格納されて
いるカテゴリ毎３２個の比較ローカルピークベクトルと
の間で、前述した（ｈ）項におけるパタン類似度算出処
理（線形マツチング）を行い、その結果としてＤＴｎ　
を作業メモリ４８に格納する。(Pattern similarity calculation process) The above-mentioned ( The pattern similarity calculation process (linear matching) in term h) is performed, and as a result, DTn
is stored in the working memory 48.

（判定処理）作業メモリ４９に格納されているパタン類似度り電を用
いて、前述した（ｉ）項における判定処理を行って、そ
の結果として得られたカテゴリ名Ｃｍ％ａ、をインタフ
ェース５１に通して外部に出力させる。(Determination Process) Using the pattern similarity measure stored in the working memory 49, perform the determination process in the above-mentioned item (i), and send the category name Cm%a obtained as a result to the interface 51. output to the outside through the

（発明の効果）上述した説明からも明らかなように、この発明によれば
、次のような効果が得られる。(Effects of the Invention) As is clear from the above description, according to the present invention, the following effects can be obtained.

■入力に対して雑音パタンを除去した音声特徴ベクトル
を用いてフレーム電力を算出し音声区間検出を行ってい
るため、音声区間検出誤りが少なく、従って、雑音環境
下においても入力音声の認識精度が従来よりも向上する
。■Since the frame power is calculated using the speech feature vector from which the noise pattern has been removed from the input and the speech section is detected, there are fewer errors in speech section detection, and therefore the recognition accuracy of the input speech is improved even in noisy environments. Improved than before.

■音声特徴ベクトルから算出されたローカルピークベク
トルを用いてパタン類似度算出処理を行っているため、
この発明の音声認識方式を実施する際の演算処理が極め
て蒲易となる。■Since the pattern similarity calculation process is performed using the local peak vector calculated from the audio feature vector,
The arithmetic processing when implementing the speech recognition method of the present invention becomes extremely easy to implement.

■比較パタンに関しても比較ローカルピークベクトルを
用いているため、その記憶容量は極めて小さい、従って
、上述した■の効果と合せて、この発明の認識方式を実
施するための装置の構造が簡単かつ小型となる。■Since the comparison local peak vector is also used for the comparison pattern, its storage capacity is extremely small.Therefore, in addition to the above-mentioned effect (■), the structure of the device for implementing the recognition method of this invention is simple and compact. becomes.

[Brief explanation of drawings]

第１図はこの発明の音声認識方式の認識処理を説明する
ためのブロック図。第２図は音声分析処理に用いるバンドパスフィルタの特
性を示す図、第３図はローカルピークベクトル算出を説明するための
説明図、第４図はフレーム電力の様子を示す図、第５図はこの発
明の実施例を示すブロック図、第６図は従来の音声認識
方式を説明するためのブロック図である。２１・・・入力端子、　　　　２２・・・特徴ベクトル
算出部２３・・・雑音パタン算出部２４・・・音声特徴ベクトル算出部２５・・・ローカルピークベクトル算出部２６・・・音
声区間検出部２７・・・線形伸縮部２８・・・比較パタン格納部２８・・・線形マツチング部３０・・・判定部、　　　　　３１・・・出力端子４１
・・・マイクロフォン、４２・・・増幅器４３・・・ロ
ーパスフィルタ、４４・・・Ａ／Ｄ変換器４５・・・信
号処理プロセッサ４６・・・プロセッサ、　　　４７・・・プログラムメ
モリ４８・・・比較パタンメモリ、４９・・・作業メモ
リ５０・・・雑音パタンメモリ５１・・・インタフェース。手続補正書昭和６１年６月２５日FIG. 1 is a block diagram for explaining recognition processing of the speech recognition method of the present invention. Figure 2 is a diagram showing the characteristics of the bandpass filter used for speech analysis processing, Figure 3 is an explanatory diagram to explain local peak vector calculation, Figure 4 is a diagram showing the state of frame power, and Figure 5 is FIG. 6 is a block diagram showing an embodiment of the present invention. FIG. 6 is a block diagram for explaining a conventional speech recognition system. 21... Input terminal 22... Feature vector calculation unit 23... Noise pattern calculation unit 24... Voice feature vector calculation unit 25... Local peak vector calculation unit 26... Voice section detection unit 27 ...Linear expansion/contraction section 28...Comparison pattern storage section 28...Linear matching section 30...Judgment section 31...Output terminal 41
... Microphone, 42 ... Amplifier 43 ... Low pass filter, 44 ... A/D converter 45 ... Signal processing processor 46 ... Processor, 47 ... Program memory 48 ... Comparison Pattern memory, 49... Working memory 50... Noise pattern memory 51... Interface. Procedural amendment June 25, 1986

Claims

[Claims]

(1) (a) A process of frequency-analyzing input speech and calculating feature vectors, which are vectors of frequency components of the input speech, at fixed time intervals called speech frames, and (b) A predetermined process that is known in advance to be only noise. (c) a process of calculating a voice feature vector by subtracting the noise pattern from the feature vector calculated for each voice frame; (d) For each audio frame, calculate a least squares approximation straight line from the audio feature vector, set the component corresponding to the channel that is maximum in the frequency direction based on the least squares approximation straight line to 1, and set the other components to 1. (e) calculating a frame power in the audio frame from the audio feature vector and detecting the start and end of the audio using the frame power; , (f) a process of linearly expanding and contracting the local peak vector calculated for each audio frame from the start end to the end end to a constant audio frame length, and (g) once or multiple times for each recognition target word in advance. A comparison pattern is calculated by the same or similar processing as the above-mentioned processes (a) to (f) for the learning voice of the utterance, and (h) processing of storing the comparison pattern, and (h) attempting to recognize it. By performing non-linear matching processing between the input pattern obtained by processing the input audio in steps (a) to (f) above and the comparison pattern, the difference between the input pattern and the comparison pattern is determined. It is characterized by comprising a process of calculating pattern similarity, and (i) a process of outputting as a result a category name added to a comparison pattern that gives the maximum pattern similarity among the pattern similarities calculated for each comparison pattern. A voice recognition method that uses