JPS61230199A

JPS61230199A - Voice recognition

Info

Publication number: JPS61230199A
Application number: JP60070035A
Authority: JP
Inventors: 田部井　幸雄; 森戸　誠; 高橋　圭子
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1985-04-04
Filing date: 1985-04-04
Publication date: 1986-10-14

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】（−業上の利用分野）本発明は音声認識方法に関し、特にローカルピークを用
いた単語音声認識を行う音声認識方法に関する。DETAILED DESCRIPTION OF THE INVENTION (-Field of Industrial Application) The present invention relates to a speech recognition method, and more particularly to a speech recognition method that performs word speech recognition using local peaks.

（従来技術）従来、この種の音声認識方法として電子通信学会論文誌
、Ｊ６８−Ａ　Ｃ１）　（昭和６０年１月）ｐ−７８−
８５に記載されるものがあった。第２図は従来のローカ
ルピークを用いた音声認識方法のフローチャートでアシ
、入力音声は、１５チヤネルのバンドパスフィルタ一群
で１０ｍ５ｅｃごとに周波数分析され（第２図の１参照
）、声帯音源特性の個人差の正規化法として、音声スペ
クトルを振幅、周波数軸ともに対数で表わし最小二乗近
似直線を求め（第２図の２参照）、差をとって補正する
。ただし、最小二乗近似直線の傾きが正の場合には平均
値からの差をとる。その後、第３図に示すように、各フ
レーム（１０ｍ５ｅｃ　）毎に、ＯｄＢ以上となる各部
分について、各最大値の１／２以上の振幅を持つものの
中で最大値をとるチャネルをローカルピーク有シとして
″ｌ”にし、他を″０”として２値化を行う（第２図の
３参照）。バンド・母スフィルタのチャネル数は１５で
あるが、１６チヤネル目に最小二乗近似直線の傾きが負
のとき有声音と見なし１をたて、傾きが正のとき無声音
と見なし“０″をたて、傾斜の符号を付加する（第２図
の４参照）。(Prior art) Conventionally, this type of speech recognition method was published in Journal of the Institute of Electronics and Communication Engineers, J68-A C1) (January 1985) p-78-
There was something described in 85. Figure 2 is a flowchart of a conventional speech recognition method using local peaks.The input voice is frequency-analyzed every 10m5ec by a group of 15-channel bandpass filters (see 1 in Figure 2), and the vocal cord sound source characteristics are analyzed. As a normalization method for individual differences, the voice spectrum is expressed logarithmically on both the amplitude and frequency axes, a least squares approximation straight line is obtained (see 2 in FIG. 2), and the difference is corrected. However, if the slope of the least squares approximation line is positive, the difference from the average value is taken. After that, as shown in Figure 3, for each frame (10m5ec), for each part where the amplitude is OdB or more, the channel that has the maximum value among those with an amplitude of 1/2 or more of each maximum value is identified as a local peak. Binarization is performed by setting the value to "1" and the other values to "0" (see 3 in FIG. 2). The number of channels of the band/base filter is 15, but when the slope of the least squares approximation straight line for the 16th channel is negative, it is considered to be a voiced sound and set to 1, and when the slope is positive, it is considered to be an unvoiced sound and set to "0". Then, add the sign of the slope (see 4 in Fig. 2).

荷重平均辞書は、複数の２値化パタンを時間軸上一番長
いものへ線形に伸ばして加算して多値パタンとして得ら
れる（第２図の５参照）。The weighted average dictionary is obtained as a multivalued pattern by linearly extending and adding a plurality of binarized patterns to the longest one on the time axis (see 5 in FIG. 2).

２値の入力パタンと多値の荷重平均辞書とのマツチング
には、時間方向は長い方のパタンに線形に伸ばして合わ
せ、ある類似度に基づいて計算を行い、最大類似度を与
える標準パタンのカテゴリ名を認識結果とする（第２図
の６参照）。To match a binary input pattern with a multi-value weighted average dictionary, the longer pattern is linearly stretched and matched in the time direction, calculations are performed based on a certain degree of similarity, and a standard pattern that gives the maximum degree of similarity is selected. The category name is used as the recognition result (see 6 in Figure 2).

（発明が解決しようとする問題点）しかし、以上述べた従来の方法は、接話型マイクを用い
る場合のようなＳＮ比の良い環境では有効に機能するが
、雑音が強い環境下では、特に無音部でローカルピーク
として雑音のビークを拾いやすく、誤認識が増えると云
う問題点があった。(Problems to be Solved by the Invention) However, the conventional methods described above function effectively in environments with a good signal-to-noise ratio, such as when using a close-talking microphone, but they do not work particularly well in environments with strong noise. There is a problem in that noise peaks are likely to be picked up as local peaks in silent parts, leading to an increase in erroneous recognition.

本発明は、雑音が強い環境下で特に無音部で雑音のビー
クを音声のローカルーーりとして抽出してしまい誤認識
が増えるという従来技術の問題点を除去し、雑音に対す
る耐性の強い音声認識方法を提供するものである〇（問題点を解決するための手段）本発明による音声認識方法は、まず入力音声信号を各音
声フレーム毎に複数チャネルの分析データに周波数分析
を行う。次に各音声フレーム毎に分析データの平均値す
なわち平均電力を算出する。The present invention eliminates the problem of the prior art in that in a noisy environment, noise peaks are extracted as local parts of speech, especially in silent parts, resulting in an increase in false recognition, and a speech recognition method that is highly resistant to noise is provided. (Means for solving the problem) The speech recognition method according to the present invention first performs frequency analysis of an input speech signal into analysis data of multiple channels for each speech frame. Next, the average value of the analysis data, that is, the average power, is calculated for each audio frame.

また各分析データを音声スペクトルの最小二乗近似直線
を用いてスペクトル正規化を行い、スにクトル正規化さ
れた分析データに対して、最小二乗近似直線を基準とし
て周波数軸にそった極大値すなわちローカルピークの有
無を判定し、ローカルピーク有りの場を”１″としロー
カルピーク無しの場合を“０″とする２値のローカルピ
ークパタンに変換する。そして複数の標準／ＩＰタンと
入力音声のローカルピークパタンとの音声フレーム毎の
類似度を算出し、この類似度に対して前述の平均電力に
より重み付けを行い且つ全音声フレームにわたって累算
することにより最終の類似度の算出を行う。そして最大
値を与える最終類似度に対応する標準パタンのカテゴリ
名を認識結果と判定するものである。In addition, each analysis data is spectral normalized using the least squares approximation straight line of the audio spectrum, and the local maximum value along the frequency axis, that is, the local The presence or absence of a peak is determined and converted into a binary local peak pattern in which the presence of a local peak is set to "1" and the absence of a local peak is set to "0". Then, by calculating the degree of similarity for each voice frame between the plurality of standard/IP tongues and the local peak pattern of the input voice, weighting this degree of similarity by the above-mentioned average power, and accumulating it over all voice frames. Perform the final similarity calculation. Then, the category name of the standard pattern corresponding to the final similarity that gives the maximum value is determined to be the recognition result.

（作　用）本発明は各標準パタンと入力音声のローカルピークパタ
ンとの類似度に入力音声の周波数分析データの平均値す
なわち平均電力により重み付けを行って最終類似度を算
出しているため、雑音が混入した音声、特に音声の低レ
ベルの部分に対するローカルビーク抽出の不完全さを補
うことができるＯ（実施例）第１図は本発明による音声認識方法のフローチャートで
あシ、入力された音声信号は周波数分析ステラｆｌＺで
ｌ　Ｑ　ｍ５ｅｃ　（フレーム周期）毎に周波数分析さ
れ、各音声フレーム毎に複数チャネルの分析データとし
ての対数値が出力される。次に平均電力算出ステップ１
２において、各音声フレーム内の全ての対数値を平均化
して平均電力が算出される。また、この周波数分析デー
タの対数値はこの対数値の属する音声フレームにおける
音声スペクトルの最小二乗近似直線を用いてスペクトル
正規化される（ステラｆ１３）。次に、スペクトル正規
化された前記対数値に対して前記最小二乗近似直線を基
準として周波数軸に沿った極大値すなわちローカルピー
クの有無を判定し、ローカルピーク有りの場合を”１”
としローカルピーク無。(Function) The present invention calculates the final similarity by weighting the similarity between each standard pattern and the local peak pattern of the input speech by the average value of the frequency analysis data of the input speech, that is, the average power. (Embodiment) Figure 1 is a flowchart of the speech recognition method according to the present invention. The signal is frequency analyzed by a frequency analysis Stellar flZ every l Q m5ec (frame period), and logarithmic values as analysis data of multiple channels are output for each audio frame. Next, average power calculation step 1
2, the average power is calculated by averaging all logarithmic values within each audio frame. Further, the logarithmic value of this frequency analysis data is spectral normalized using the least square approximation straight line of the audio spectrum in the audio frame to which this logarithmic value belongs (Stella f13). Next, the presence or absence of a local maximum value along the frequency axis, that is, a local peak, is determined for the spectrum-normalized logarithm value using the least square approximation line as a reference, and if there is a local peak, it is set as "1".
and no local peak.

しの場合を″０”とする２値の入力ローカルピークノぐ
タンに変換する。Converts to a binary input local peak value with "0" in the case of .

標準パタン１５は、予め複数の２値ローカルパタンをカ
テゴリ毎に平均発声長に線形伸縮し加算して作成されて
いるものである。The standard pattern 15 is created in advance by linearly expanding and contracting a plurality of binary local patterns to the average utterance length for each category and adding them.

マ、チングステッｆ１６では２値の入力ローカルピーク
パタンヲ、各標準１？タンのフレーム長に線形伸縮し、
音声フレーム毎の類似度を計算し、平均電力算出ステラ
ｆ１２の出力である平均電力を類似度に重み付けし、且
つ全音声フレームにわたって重み付けされた類似度を累
算し最終類似度を得る。そして最大の最終類似度を与え
る標準パタンのカテゴリ名を認識結果として出力する。In step f16, the binary input local peak pattern, each standard 1? linearly expands and contracts to the tongue frame length;
The degree of similarity for each voice frame is calculated, the degree of similarity is weighted by the average power output from the average power calculating Stella f12, and the weighted degree of similarity is accumulated over all voice frames to obtain the final degree of similarity. Then, the category name of the standard pattern that gives the maximum final similarity is output as the recognition result.

次に本発明の音声認識方法による音声認識装置の１実施
例について説明する。Next, one embodiment of a speech recognition device using the speech recognition method of the present invention will be described.

第４図は本発明による音声認識装置の１実施例の全体の
回路構成を示すブロック図である。FIG. 4 is a block diagram showing the overall circuit configuration of one embodiment of the speech recognition device according to the present invention.

第４図において１０１は音声入力端子、１０２は周波数
分析部、１０３はスペクトル正規化部、１０４は平均電
力メモ’）、１０５は音声区間検出部、１０６はローカ
ルピーク特徴抽出部、１０７は標準／ｆメタンモ’）、
１０Ｂは線形マツチング部、１０９は判定部、１１０は
認識結果の出力端子である。以下、動作について説明す
る・先ずマイクロホン等によりミ気信号に変換された入力音
声信号は音声入力端子１０ノに入力され、周波数分析部
１０２１７（よって周波数スペクトルの抽出が行われる
。周波数分析部１０２０回路構成を第５図に示す。In FIG. 4, 101 is an audio input terminal, 102 is a frequency analysis section, 103 is a spectrum normalization section, 104 is an average power memo'), 105 is a voice section detection section, 106 is a local peak feature extraction section, and 107 is a standard/ fmethanemo'),
10B is a linear matching section, 109 is a determination section, and 110 is an output terminal for the recognition result. The operation will be explained below. First, an input audio signal converted into an audio signal by a microphone or the like is input to the audio input terminal 10, and is passed through the frequency analysis section 10217 (thereby extracting a frequency spectrum. Frequency analysis section 1020 circuit) The configuration is shown in FIG.

第５図において２０１は前置増幅器（Ａｍｐ）、！０２
１ｔｔ／４ンドパスフィルタ群（ＢＰＦ−１、・・・。In FIG. 5, 201 is a preamplifier (Amp), ! 02
1tt/4-pass filter group (BPF-1, . . .

ＢＰＦ−Ｎ　）、２０３は全波整流器詳（ＤＥＴ−１、
・・・。BPF-N), 203 is full wave rectifier details (DET-1,
....

ＤＥＴ−Ｎ　）、２０４はローパスフィルタ群（Ｉ、Ｐ
Ｆ−１゜・・・、　ＬＰＦ−Ｎ　）、ｚ　ｏ　ｓはアナ
ログ・マルチプレクサ、２０６はＡＤ変換器、２０７は
対数変換用ＲＯＭである。バンドＡスフィルタ群は中心
周波数が対数的等間隔で１からＮまでＮヶ周波数の低い
方から配置してあシ、以下第１チヤネル、・・・、第Ｎ
チャネルのように記す。本実施例ではＮ＝２２としてい
る。DET-N), 204 is a low-pass filter group (I, P
F-1°..., LPF-N), zos is an analog multiplexer, 206 is an AD converter, and 207 is a ROM for logarithmic conversion. The band A filter group is arranged from 1 to N with center frequencies at equal logarithmic intervals, starting from the lowest N frequency, hereinafter the first channel, ..., the Nth channel.
Written as a channel. In this embodiment, N=22.

ローフ９スフイルタ８２０４の出力は、マルチプレクサ
２０５により切シ換えられ、ＡＤ変換器２０６で標本化
周期１０ｎｕ＋程度でディジタル量に変換され、対数変
換用ＲＯＭ　２０７で対数値に変換され、スペクトル正
規化部１０３に送出される。The output of the loaf 9 filter 8204 is switched by the multiplexer 205, converted into a digital quantity by the AD converter 206 at a sampling period of about 10 nu+, converted into a logarithmic value by the logarithmic conversion ROM 207, and then sent to the spectrum normalization unit 103. will be sent to.

本実施例において周波数分析部１０２の構成にアナログ
のフィルタを用いているが、ディジタルフィルタ９バン
クを用いても本質的には同じである。In this embodiment, an analog filter is used in the configuration of the frequency analysis section 102, but the configuration is essentially the same even if nine banks of digital filters are used.

対数変換用ＲＯＭ　２０７よシ出力されるに番目のフレ
ームのＮチャネル分の音声スペクトルデータをＸ、（６
）、・・・、Ｘ迦とする。スペクトル正規化部では、第
１に、各フレーム毎に平均値Ｘ（６）を次の第（１）式
で計算する。The logarithmic conversion ROM 207 outputs audio spectrum data for N channels of the second frame as X, (6
), ..., X. The spectrum normalization unit first calculates the average value X(6) for each frame using the following equation (1).

以下論をにフレームでの「平均電力」と称す。平均電力
は平均電力メモＩ７１０４に格納され、その後音声区間
検出部１０５へ送られる。The following discussion will be referred to as the "average power" in the frame. The average power is stored in the average power memo I7104, and then sent to the voice section detection unit 105.

音声区間検出は、例えば平均電力に対する二つの閾値Ｅ
、　、　Ｋ２（Ｅ、）Ｋ２）を用いて抽出する。すなわ
ち第６図のように平均電力がＫ２を越え、かつその後Ｅ
２よシ小さくなることなくＥ、を越えるとき、Ｋ２を越
えた点を始端に、とする。同様にして時間軸を逆にして
終端に２を求め、（Ｋ１　ｓ　Ｋ２　）を音声区間とす
る＠また、スペクトル正規化部１０３では、音声スイクトル
データＸｉ■（Ｋ＝に、・”　ｙ　Ｋ２　＊　１　＝Ｌ
・−、Ｎ　）を次の第（２）弐による最小二乗近似直線
を用いてス硬りトル正規化する０Ｙｉ（イ）＝Ａ（６）・ｉ＋Ｂ（ｆＯ・−（２）すなわ
ち、音声スイクトルデータＸ１（６）と最小二乗近似直
線との差分を次の（３）式で計算し、スペクトル正規化
データＺｉ（６）を得る。Speech interval detection can be performed, for example, by using two thresholds E for the average power.
, , K2(E,)K2). In other words, as shown in Figure 6, the average power exceeds K2, and then E
When E is exceeded without becoming smaller than 2, let the point beyond K2 be the starting point. In the same way, the time axis is reversed and 2 is obtained at the end, and (K1 s K2 ) is set as the voice interval. *1=L
・-, N) is normalized using the least squares approximation straight line according to the following (2) 2. The difference between the spectral data X1 (6) and the least squares approximation straight line is calculated using the following equation (3) to obtain spectral normalized data Zi (6).

Ｚｉ（Ｘ）＝Ｘｉ（ＩＱ　　（Ａ（［・ｉ　＋Ｂ（Ｋ）
）（Ｋ＝に４．・・・ｔ　Ｋ２　；　１＝＝ｌ　ｐ・・
・、Ｎ）・・・（３）前記（２）式の最小二乗近似直線におけるＡ（Ｋ）、Ｂ
（６）は次の（４）　＃　（５）式で与えられる。Zi(X)=Xi(IQ (A([・i +B(K)
) (K=to 4...t K2; 1==l p...
・, N)...(3) A(K), B in the least squares approximation straight line of the above equation (2)
(6) is given by the following equations (4) # (5).

（３）式の計算はＡ（６）が正の場合、すなわち多くの
場合無声音でも行う。時間方向については音声区間検出
前からすべてのフレームＫに対して前述の（３）式の計
算を行っているものである。The calculation of equation (3) is performed when A(6) is positive, that is, in most cases, even for unvoiced sounds. In the time direction, the above-mentioned equation (3) is calculated for all frames K before the voice section is detected.

次に、ローカルピーク特徴抽出部１０６にて音声の始端
から終端までのＺｉ（６）（Ｋ＝＝に４．・・・、に２
；ｉ＝１．・・・ｐ　Ｎ　）を１フレーム毎に順次処理
する。Next, the local peak feature extraction unit 106 extracts Zi(6) (K==4,...,2) from the start to the end of the voice.
;i=1. . . . p N ) are sequentially processed frame by frame.

ローカルピーク特徴抽出部１０６の詳細構成を第７図に
示す。The detailed configuration of the local peak feature extraction section 106 is shown in FIG.

第７図において、４０１はデータ格納メモリ、４０２は
ピーク抽出部、４０３はローカルピークツ９タン格納メ
モリである。In FIG. 7, 401 is a data storage memory, 402 is a peak extraction section, and 403 is a local peak extraction storage memory.

データ格納メモリ４０１には始端に１から終端に２まで
のスペクトル正規化データＺｉ（６）（Ｋ＝＝に、。The data storage memory 401 contains spectral normalized data Zi(6) (K==) from 1 at the beginning to 2 at the end.

・・・ｇ　Ｋ２　；　ｉ　＝　１　ａ　２　ｍ・・・、
　Ｎ）が格納されておシ、１フレーム毎にピーク抽出部
４０２においてスペクトル正規化データＺｔ（［０（Ｋ
　＝　Ｋ１　ｔ　”’　ａ　Ｋ２　；　１＝１．・・・
、Ｎ）のローカルーーりを検出する。このピーク抽出は
＠０＃０ｔのスペクトル正規化データＺｉａＱ（ｉ　＝
　ｌ−Ｎ　）に対して以下の如く行う。...g K2; i = 1 a 2 m...,
Zt([0(K
= K1 t ”' a K2 ; 1=1...
, N) is detected. This peak extraction is performed using the spectrum normalized data ZiaQ (i =
l-N) as follows.

ｚｉ＋　、”−２ｉ（’０　＞　Ｏ・”　（６）ｚｉ＋
２（６）−２ｉや、（６）〈０　　　　　　　・・・（
７）（６）　１　（７）式を同時に満たしているときス
ペクトル正規化データｚｉヤ、（６）は極大値であり、
（ｔ＋ｘ）はローカルピークのあるチャネル番号を示し
、ローカルビーク／母タンＰ１＋１（８）を１とする。zi+ , "-2i('0 >O・" (6) zi+
2(6)-2i, (6)〈0...(
7) (6) 1 When the equation (7) is simultaneously satisfied, the spectral normalized data ziya, (6) is the maximum value,
(t+x) indicates a channel number with a local peak, and local peak/mother tongue P1+1(8) is assumed to be 1.

ｉ　＝　ｌから（Ｎ−１）までのスペクトル正規化デー
タについて前記（６）　＃　（７）式を満たすか否かを
順次調べ、満たす場合はローカルピークパタンＰｉや、
（６）を′１”とし、満たさない場合″′０”として２
値化しピークパタン格納メモリ４０３に格納する。The spectrum normalized data from i = l to (N-1) is sequentially checked to see if it satisfies the above formulas (6) # (7), and if it is satisfied, the local peak pattern Pi,
(6) is set as ``1'', and if it is not satisfied, it is set as ``0'' and 2
It is converted into a value and stored in the peak pattern storage memory 403.

ただし境界条件としてｉ　＝　１か’　ｚｔ＋１ＧＯＺ　１（Ｋ）　＜　Ｏ（
ＤときＰ、（Ｋ）＝　１　　−（８）ｉ　＝（Ｎ−１）
かツｚ、、（［−ｚ、ＧＯ’＞　０のときＰ−＝１・・
・（９）とし、条件を満たさない場合″″０”とし、２値化を行
いローカルビークツ臂タン格納メモリ４０３に格納する
。However, as a boundary condition, i = 1 or' zt+1GOZ 1(K) < O(
When D, P, (K) = 1 - (8)i = (N-1)
Katsuz,, (When [-z, GO'> 0, P-=1...
(9) If the condition is not met, it is set to "0", binarized and stored in the local beak tongue storage memory 403.

ローカルピーク特徴抽出部１０６において、あるに番目
のフレームにおけるローカルピークツ４タンＰｉ（８）
（＋　＝１　ｐ　２　ｅ・・・、Ｎ）抽出のフローチャ
ートを第８図に示す。In the local peak feature extraction unit 106, the local peaks 4 tan Pi(8) in the certain frame are
(+ = 1 p 2 e..., N) A flowchart of extraction is shown in FIG.

データ格納メモリ４０ノ内のスペクトル正規化７’−Ｉ
Ｚｉ（Ｋ）（ｉ＝１　ｔ　２９−９Ｎ’＊に＝に１ｔ−
ｔＫ２）を１フレーム毎に読み出し、２値のローカルピ
ークツヤタンを得る。Spectral normalization 7'-I in data storage memory 40
Zi(K)(i=1 t 29-9N'*to=to1t-
tK2) is read every frame to obtain a binary local peak gloss.

標準パタンメモリ１０７は予め求めた複数のローカルピ
ークパタンを重ね合わせて、カテゴリ毎に１つの標準パ
タンを作成したものを格納しているものである。その標
準ツクタンは発声速度の違いを正規化するためカテゴリ
毎に平均発声長を求め・２値ローカルピークｉＪ？タン
の時間方向のフレーム長を１平均発声長にそろえて線形
伸縮を行いｌ〜Ｎチャネルまで重ね合わせ、多値パタン
として作成されている。また、この標準パタンメモリ１
０７には各標準／やタンに対応してその標準パタン長Ｓ
Ｌも格納されている。The standard pattern memory 107 stores a plurality of pre-determined local peak patterns superimposed to create one standard pattern for each category. In order to normalize the difference in speaking speed, the standard Tsukutan calculates the average speaking length for each category and asks for the binary local peak iJ? The frame length in the time direction of the tongue is made equal to one average utterance length, linear expansion and contraction are performed, and channels 1 to N are superimposed to create a multivalued pattern. In addition, this standard pattern memory 1
07 has the standard pattern length S corresponding to each standard/yatan.
L is also stored.

線形マツチング部１０８では、ローカルピーク特徴抽出
部１０６よ多入力される入力ローカルピークパタンの時
間フレーム長を、標準／Ｊ？タンメモリ１０７中の各カ
テゴリの標準パタン長ＳＬへ線形伸縮してそろえた後、
各フレーム毎に初期類似度計算と、平均電力による類似
度への重み付けを行うＯ初期類似度５（８）の計算は次の０１式に基づいてフレ
ーム毎に行う。In the linear matching section 108, the time frame length of the input local peak patterns that are input multiple times from the local peak feature extraction section 106 is set to standard/J? After aligning the standard pattern length SL of each category in the pattern memory 107 by linear expansion and contraction,
The initial similarity is calculated for each frame and the similarity is weighted by the average power. The initial similarity 5 (8) is calculated for each frame based on the following formula 01.

ここでＰは第にフレームにおける入力ローカルピークパ
タンの特徴ベクトルを示す。Here, P indicates the feature vector of the input local peak pattern in the first frame.

Ｐ　＝　（Ｐ、（ｆＱ　、　Ｐ２（ＩＱ　、−、Ｐ、Ｈ
）　　　　・（１Ｏは任意の標準パタンの第にフレーム
における特徴ベクトルを示す。P = (P, (fQ, P2(IQ, -, P, H
) (1O indicates the feature vector in the first frame of an arbitrary standard pattern.

Ｄ＝（Ｄ、（６）ｔ　Ｄ２ＧＯｔ・・・、ＤＮｏｏ）　
　　　・・・に）（）１は転置を表わし、ＩＩＰＩＩ　
、　ＩＩＤＩ＋は各々Ｐ、Ｄのノルムを表わす。D=(D, (6)t D2GOt..., DNoo)
...to) ()1 represents transposition, IIPII
, IIDI+ represent the norms of P and D, respectively.

さらに音声の平均電力が小さいとき、雑音の影響を受け
やすいことから、最終の類似度Ｓを次の（２）式から求
める。Furthermore, when the average power of the voice is small, it is easily affected by noise, so the final similarity S is determined from the following equation (2).

ＳＬ−ｍａｘ（Ｘ（Ｋ））ここでＸ（６）は（１）式で求まる平均電力、ｍａｘ（
Ｘ（ｆＱ）は入力音声の時間正規化された区間の中での
平均電力の最大値、ＳＬは標準パタン長である。すなわ
ち最終類似度Ｓはα１式で求まる初期類似度Ｓ＠）に平
均電力で重みづけした値である。SL-max(X(K)) Here, X(6) is the average power found by equation (1), max(
X(fQ) is the maximum value of the average power in the time-normalized section of the input audio, and SL is the standard pattern length. That is, the final similarity S is a value obtained by weighting the initial similarity S@) determined by the α1 formula by the average power.

線形マツチング部１０８の構成は第９図のようになシ、
７０１は入力ローカルピークパタンの入力端子、７０２
は標準パタン、の入力端子、７０３は平均電力の入力端
子、７０５は標準ｉＪ？タン長の入力端子、７０６，７
０７，７０８は乗算加算器、７０９．７１０は平方根演
算器、７１１は乗算器、７１２は除算器、７１３は乗算
加算器、７１４は除算器、７１５は最大値検出器、７１
６は乗算器、７１７は出力端子である。The configuration of the linear matching section 108 is as shown in FIG.
701 is an input terminal for input local peak pattern, 702
is the standard pattern input terminal, 703 is the average power input terminal, and 705 is the standard iJ? Tongue length input terminal, 706,7
07, 708 is a multiplication adder, 709.710 is a square root operator, 711 is a multiplier, 712 is a divider, 713 is a multiplication adder, 714 is a divider, 715 is a maximum value detector, 71
6 is a multiplier, and 717 is an output terminal.

入力端子２０１からの入力ローカルピークパタンＰのノ
ルムが乗算加算器７０６、平方根演算器７０９によって
次のαゆ式の如く算出される。The norm of the input local peak pattern P from the input terminal 201 is calculated by the multiplier/adder 706 and the square root calculator 709 as shown in the following equation.

同様にして標準パタンＤのノルムが乗算加算器７０８、
平方根演算器７１０によって求まる。また乗算加算器７
０７によってＰＤｔが（ロ）式のように求まる。Similarly, the norm of standard pattern D is multiplied by adder 708,
It is determined by the square root calculator 710. Also, the multiplication adder 7
07, PDt can be found as shown in equation (b).

さらに除算器７１２によって前記０１式の初期類似度５
（Ｋｌが求まる。一方平均電力ＸＥが入力端子７０３か
ら入力され、乗算加算器７１３によりが求まる。また最
大値検出器７１５により音声平均電力の最大値が求まシ、乗算器７１６により標準／４’タン長ＳＬと
乗算され、この結果と前記ｏＱ式の結果を用いて除算器
７１４により最終の類似度Ｓが求まシ、端子７１７に出
力され、判定部１０９へ送られる。Further, the divider 712 divides the initial similarity 5 of the formula 01 into
(Kl is determined. On the other hand, the average power XE is input from the input terminal 703, and the multiplier/adder 713 determines the maximum value of the audio average power. ' A final similarity S is determined by a divider 714 using this result and the result of the oQ formula, and is output to a terminal 717 and sent to the determination unit 109 .

判定部１０９では各標準パタン毎の類似度の最大値を与
える標準パタンのカテゴリ名またはカテゴリ番号が出力
端子１１０を介して出力される。The determining unit 109 outputs the category name or category number of the standard pattern that gives the maximum similarity for each standard pattern via the output terminal 110.

（発明の効果）以上、詳細に説明したように、本発明では、ローカルピ
ーク抽出法として、有声音、無声音に関わらず最小二乗
近似直線を基準とし、また類似度に平均電力による重み
付けを行っているので、雑音が混入した音声でも、特に
音声の低レベルの部分における、音声によるローカルピ
ーク抽出の不完全さを補う事ができ、すなわち雑音によ
るピークを音声によるローカルーーりと見なす事が少な
く、良好に認識できる効果がある。(Effects of the Invention) As described in detail above, in the present invention, as a local peak extraction method, a least squares approximation straight line is used as a reference regardless of voiced or unvoiced sounds, and similarity is weighted by average power. Therefore, it is possible to compensate for imperfections in local peak extraction due to voice, especially in low-level parts of the voice, even in voice mixed with noise.In other words, peaks due to noise are less likely to be regarded as local peaks due to voice, which is good. has a discernible effect.

[Brief explanation of the drawing]

第１図は本発明による音声認識方法のフローチャート、
第２図は従来の音声認識方法のフローチャート、第３図
は入力音声信号の２値化を説明するための図、第４図は
本発明による音声認識装置　　　□の１実施例の全体の
回路構成を示すフロ、り図、第５図は周波数分析部の回
路構成を示すブロック図、第６図は音声区間検出を説明
するための図、第７図はローカルピーク特徴抽出部の構
成を示すブロック図、第８図はローカルピークＡ’タン
抽出のフローチャート、第９図は線形マツチング部の回
路構成を示すブロック図。１０ノ・・・入力端子、１０２・・・周波数分析部、１
０３・・・スペクトル正規化部、１０４・・・平均電力
メモリ、１０５・・・音声区間検出部、１０６・・・ロ
ーカルピーク特徴抽出部、１０７・・・標準ｉＪ？タン
メモ’）、ｌａｇ・・・線形マツチング部、１ｏ９・・
・判定部、１１０・・・出力端子。特許出願人　沖電気工業株式会社第１図本発明の盲Ｐ認識方法釣７０−ナッ斗綿　爪第２図ＲＬ刃’（、ｌ合一１忍織方；ｂフローナ、−ト第３図２イ直に６日第５図１！’１浪数分析部１０２の購入第６図責戸巴間−妬堝・ＩＭ＠１會工第７１０−カルビー７特徹徊工ｎ＋０６４精入手続補正書（自
発）１．事件の表示昭和６０年　特　許　　願第０７００３５号２、発明の
名称音声認識方法３、補正をする者事件との関係　　　　　特　許　　出　願　人任　所（
〒１０５）　　東京都港区虎ノ門１丁目７番１２号名称
（０２９）　　　’ｘ中電気工業１本式会社代表者　　
　　輔龍長橋本南海男４、代理人住　所（〒１０５）　　東京都港区虎ノ門１丁目７査１
２号５、補正の対象　明細書中「発明の詳細な説明」の
欄、及び図面「第８図」／′ ６、補正の内容　別紙の通り６、補正の内容（１）　　明細書第９頁第５行目に、「標本化周期１０
ｍ５程度で」とあるのを「標本化周期１０ｍ５で」と補正する。（２）回書第１７頁（１６）スミに（３）同書第１７頁（１７）弐゛に（４）図面「第８図」を別紙の通り補正する。以　　上FIG. 1 is a flowchart of the speech recognition method according to the present invention;
Fig. 2 is a flowchart of a conventional speech recognition method, Fig. 3 is a diagram for explaining the binarization of an input speech signal, and Fig. 4 is the overall circuit configuration of one embodiment of the speech recognition device according to the present invention. 5 is a block diagram showing the circuit configuration of the frequency analysis section, FIG. 6 is a diagram for explaining voice section detection, and FIG. 7 is a block diagram showing the configuration of the local peak feature extraction section. 8 is a flowchart of local peak A'tan extraction, and FIG. 9 is a block diagram showing the circuit configuration of a linear matching section. 10 No. Input terminal, 102 Frequency analysis section, 1
03... Spectrum normalization section, 104... Average power memory, 105... Voice section detection section, 106... Local peak feature extraction section, 107... Standard iJ? Tan memo'), lag...linear matching section, 1o9...
- Judgment unit, 110...output terminal. Patent Applicant: Oki Electric Industry Co., Ltd. Fig. 1 Blind P recognition method of the present invention Fishing 70-Natto Cotton Nail Fig. 2 RL Blade' Immediately on the 6th Fig. 5 1!'1 Purchase of the number analysis department 102 Fig. 6 Responsibility Tomoe - Jiebo / IM@1 Association No. 71 0 - Calbee 7 Special Walk Work n + 064 Precision procedure correction (Spontaneous) 1. Indication of the case 1985 Patent Application No. 070035 2, Name of the invention Speech recognition method 3, Person making the amendment Relationship with the case Patent application person office (
Address: 105) 1-7-12 Toranomon, Minato-ku, Tokyo Name (029) 'x Chuo Denki Kogyo 1-piece company representative
Sukeryu Naga Hashimoto Nankai 4, Agent address (〒105) 1-7-1, Toranomon, Minato-ku, Tokyo
2 No. 5, Subject of amendment: "Detailed Description of the Invention" column in the description and drawing "Figure 8" In the fifth line, “sampling period 10
The phrase ``with a sampling period of about 10 m5'' has been corrected to ``with a sampling period of 10 m5''. (2) In the corner of page 17 (16) of the circular. (3) In the second page of page 17 (17) of the circular. (4) The drawing "Figure 8" is corrected as shown in the attached sheet. that's all

Claims

[Claims] 1. A process of frequency-analyzing input audio and extracting analysis data of multiple channels for each audio frame, and averaging all the analysis data corresponding to the frame for each audio frame to obtain an average power. a process of calculating the spectrum of the analysis data using a least squares approximation straight line of the audio spectrum in the audio frame to which the data belongs; and a process of performing the least squares approximation on the spectrum-normalized analysis data The local value of the binarized input audio is determined by determining the presence or absence of a maximum value along the frequency axis, that is, a local peak, using a straight line as a reference, and setting it as "1" if there is a local peak and "0" if there is no local peak. A process of converting into a peak pattern, and a process of converting the similarity between a plurality of standard patterns represented by local peak patterns prepared in advance and the local peak pattern of the input audio into a normalized audio frame between the standard pattern and the input audio. a process of calculating a final similarity by weighting the similarity by the average power and accumulating over all the normalized audio frames; and a process of calculating the final similarity for each standard pattern. A speech recognition method comprising: determining a category name of a standard pattern corresponding to a maximum value of degree as a recognition result.