JPS61228500A

JPS61228500A - Voice recognition

Info

Publication number: JPS61228500A
Application number: JP60069053A
Authority: JP
Inventors: 田部井　幸雄
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1985-04-03
Filing date: 1985-04-03
Publication date: 1986-10-11
Also published as: JPH0556520B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】（産業上の利用分野）本発明は音声認識方法に関し、特にローカルビ−りを用
いた単語音声認識を行う音声認識方法に関する。DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to a speech recognition method, and more particularly to a speech recognition method that performs word speech recognition using local beams.

（従来技術）従来、この種の音声認識方法として電子通信学会論文誌
、Ｊ６８−Ａ（１）（昭和６０年１月）　ｐ−７８−８
５に記載されるものがあった。第２図は従来のローカル
ピークを用いた音声認識方法のフローチャートであシ、
入力音声は、１５チヤネルのバンドパスフィルタ一群で
１０　ｍ５ｅｃごとく周波数分析され（第２図の１参照
）、声帯音源特性の個人差の正規化法として、音声スペ
クトルを振幅、周波数軸ともに対数で表わし最小二乗近
似直線を求め（第２図の２参照）、差をとって補正する
。(Prior art) Conventionally, this type of speech recognition method was published in Journal of the Institute of Electronics and Communication Engineers, J68-A (1) (January 1985) p-78-8.
There were items listed in 5. Figure 2 is a flowchart of a conventional speech recognition method using local peaks.
The input voice is frequency-analyzed at every 10 m5ec by a group of 15-channel bandpass filters (see 1 in Figure 2), and the voice spectrum is expressed logarithmically on both the amplitude and frequency axes as a method for normalizing individual differences in vocal cord sound source characteristics. Find the least squares approximation straight line (see 2 in Figure 2) and correct it by taking the difference.

ただし、最小二乗近似直線の傾きが正の場合には平均値
からの差をとる。その後、第３図に示すように、各フレ
ーム（１０ｍ＋ｓｅｃ　）毎に、ＯｄＢ以上となる各部
分について、各最大値の１７２以上の振幅を持つものの
中で最大値をとるチャネルをローカルピーク有シとして
″１”にし、他を＠０”として２値化を行う（第２図の
３参照）。バンドパスフィルタのチャネル数は１５であ
るが、１６チヤネル目に最小二乗近似直線の傾きが負の
とき有声音と見なし１をたて、傾きが正のとき無声音と
見なし”０″をたて、傾斜の符号を付加する（第２図の
４参照）。However, if the slope of the least squares approximation line is positive, the difference from the average value is taken. After that, as shown in Fig. 3, for each frame (10m+sec), for each part where the amplitude is OdB or more, the channel that has the maximum value among those with amplitudes of 172 or more of each maximum value is considered as the channel with a local peak. Binarization is performed by setting the value to ``1'' and the others to @0 (see 3 in Figure 2).The number of channels of the bandpass filter is 15, but the slope of the least squares approximation line is negative at the 16th channel. When the slope is considered to be a voiced sound, a value of 1 is set, and when the slope is positive, it is considered to be an unvoiced sound and a value of "0" is set, and a slope sign is added (see 4 in Fig. 2).

荷重平均辞書は、複数の２値化パタンを時間軸上一番長
いものへ線形に伸ばして加算して多値パタンとして得ら
れる（第２図の５参照）。The weighted average dictionary is obtained as a multivalued pattern by linearly extending and adding a plurality of binarized patterns to the longest one on the time axis (see 5 in FIG. 2).

２値の入力パタンと多値の荷重平均辞書とのマ、チング
には、時間方向は長い方の／ぐタンに線形に伸ばして合
わせ、ある類似度に基づいて計算を行い、最大類似度を
与える標準ツクタンのカテゴリ名を認識結果とする（第
２図の６参照）。To match a binary input pattern and a multi-value weighted average dictionary, linearly extend the time direction to the longer / pattern, perform calculations based on a certain degree of similarity, and calculate the maximum degree of similarity. The category name of the given standard tsukutan is used as the recognition result (see 6 in Fig. 2).

（発明が解決しようとする問題点）しかし、以上述べた従来の方法は、接話型マイクを用い
る場合のようなＳＮ比の良い環境では有効に機能するが
、雑音が強い環境下では、特に無音部でローカルーーり
として雑音のピークを拾いやすく、誤認識が増えると云
う問題点があった。(Problems to be Solved by the Invention) However, the conventional methods described above function effectively in environments with a good signal-to-noise ratio, such as when using a close-talking microphone, but they do not work particularly well in environments with strong noise. There was a problem in that it was easy to pick up local noise peaks in silent parts, leading to an increase in false recognition.

本発明は、雑音が強い環境下で特に無音部で雑音のピー
クを音声のローカルピークとして抽出してしまい誤認識
が増えるという従来技術の問題点を除去し、雑音に対す
る耐性の強い音声認識方法を提供するものである〇（問題点を解決するための手段）本発明による音声認識方法は、まず入力音声を各音声フ
レーム毎に複数チャネルの分析データに周波数分析を行
う。また予め雑音のみとわかっている所定の雑音区間に
おける全ての音声フレームの全ての分析データを平均化
し雑音平均電力の算出を行う。そして入力音声の各音声
フレーム毎に当該フレームに対応する全ての分析データ
を平均化し平均電力の算出を行う。次に各音声フレーム
毎に前述の平均電力と前述の雑音平均電力との差分であ
る音声平均電力の算出を行う。一方入力音声の分析デー
タは当該データの属する音声フレームにおける音声スペ
クトルの最小二乗近似直線を用いてスペクトル正規化さ
れる。このスペクトル正規化された分析データに対して
、前述の最小二乗近似直線を基準として周波数軸にそっ
た極大値すなわちローカルピークの有無を判定し）ロー
カルピーク有りの場合を”ビとしローカルピーク無しの
場合を０”とする２値化されたローカルピークパタンへ
の変換を行う０そして予め用意されたローカルピークｉ＜？タンで表わ
される複数の標準パタンと入力音声のローカルピークパ
タンとの類似度を標準パタン及び入力　　　゛音声間の
正規化された音声フレーム毎に算出し、　　　　□この
類似度に対し前述の音声平均電力によシ重み付けを行い
且つ前述の正規化された全ての音声フレームにわたって
累算することにより最終の類似度を算出する。この結果
、最大値を与える最終類似度に対応する標準ノ４タンの
カテゴリ名を認識結果と判定するものである。The present invention eliminates the problem of conventional technology in which noise peaks are extracted as local peaks of speech, especially in silent parts, in environments with strong noise, resulting in increased erroneous recognition, and provides a speech recognition method that is highly resistant to noise. (Means for solving the problem) The speech recognition method according to the present invention first performs frequency analysis of input speech into analysis data of multiple channels for each speech frame. Further, all analysis data of all voice frames in a predetermined noise section, which is known in advance to be only noise, is averaged to calculate the noise average power. Then, for each audio frame of the input audio, all analysis data corresponding to the frame is averaged to calculate the average power. Next, the average voice power, which is the difference between the above-mentioned average power and the above-mentioned noise average power, is calculated for each voice frame. On the other hand, analysis data of input speech is spectral normalized using a least square approximation straight line of the speech spectrum in the speech frame to which the data belongs. For this spectrum-normalized analysis data, the presence or absence of local maximum values along the frequency axis, that is, local peaks, is determined using the above-mentioned least squares approximation line as a reference). Conversion to a binarized local peak pattern is performed with the case 0'' and a local peak i<? prepared in advance. The degree of similarity between multiple standard patterns represented by buttons and the local peak pattern of the input voice is calculated for each normalized voice frame between the standard pattern and the input voice, and □The above-mentioned voice average power is calculated based on this similarity. The final similarity is calculated by weighting and accumulating over all the normalized audio frames. As a result, the standard No.4-tan category name corresponding to the final similarity that gives the maximum value is determined to be the recognition result.

（作　用）本発明は、入力音声の周波数分析データの平均電力と所
定の雑音区間の雑音平均電力との差分である音声平均電
力を算出し、各標準・母タンと入力音声のローカルピー
クパタンとの類似度にこの音声平均電力によシ重み付け
を行って最終類似度を算出しているため、雑音が混入し
た音声特に音声の低レベルの部分に対するローカルピー
ク抽出の不完全さを補うことができる。(Function) The present invention calculates the average voice power, which is the difference between the average power of the frequency analysis data of the input voice and the noise average power of a predetermined noise section, and calculates the local peak pattern of each standard/mother button and the input voice. Since the final similarity is calculated by weighting the similarity with this voice average power, it is possible to compensate for the incompleteness of local peak extraction for noisy voices, especially low-level parts of the voice. can.

（実施例）第１図は本発明による音声認識方法のフローチャートで
あシ、入力された音声信号は、周波数分析ステラｆｘｒ
で１０ｍ５ｅｃ（フレーム周期）毎に周波数分析され、
各音声フレーム毎に複数チャネルの分析データとしての
対数値が出力される。(Example) FIG. 1 is a flowchart of the speech recognition method according to the present invention, in which the input speech signal is frequency-analyzed using Stella fxr.
The frequency is analyzed every 10m5ec (frame period),
Logarithmic values as analysis data of multiple channels are output for each audio frame.

次に、予め与えられた雑音区間の上記対数データよシ、
雑音平均電力算出ステラ７ｐ１２によシ雑音区間内の全
音声フレームの全ての対数値を平均化して雑音平均電力
を算出する。Next, based on the above logarithmic data of the noise interval given in advance,
The noise average power calculation Stella 7p12 calculates the noise average power by averaging all the logarithmic values of all the speech frames within the noise interval.

一方、音声平均電力算出ステラｆ１３によシ、入力音声
の各音声フレームの周波数分析された対数値よシ求まる
平均電力と前記雑音平均電力との差分を用いて音声平均
電力を算出する。On the other hand, the voice average power calculation Stella f13 calculates the voice average power using the difference between the average power found from the frequency-analyzed logarithmic value of each voice frame of the input voice and the noise average power.

また、入力音声の周波数分析データの対数値は、この対
数値の属する音声フレームにおける音声スイクトルの最
小二乗近似直線を用いてスペクトル正規化される（ステ
ップ１４）。次に、スペクトル正規化された前記対数値
に対して前記最小二乗近似直線を基準として周波数軸に
沿った極大値すなわちローカルピークの有無を判定し、
ローカルピーク有りの場合を“１″としローカルピーク
無しの場合を”Ｏ”とする２値の入力ローカルピークパ
タンに変換する。Further, the logarithmic value of the frequency analysis data of the input speech is spectral normalized using the least square approximation straight line of the speech quictor in the speech frame to which the logarithmic value belongs (step 14). Next, determining the presence or absence of a local maximum value, that is, a local peak, along the frequency axis with respect to the spectrum-normalized logarithm value using the least squares approximation straight line as a reference;
The input local peak pattern is converted into a binary input local peak pattern in which "1" indicates the presence of a local peak and "O" indicates no local peak.

標準パタン１６は、予め複数の２値ローカルピークＡ’
タンをカテゴリ毎に平均発声長に線形伸縮し、加算して
作成されているものである。The standard pattern 16 includes a plurality of binary local peaks A'
It is created by linearly expanding and contracting the tan to the average utterance length for each category and adding the results.

マツチングステラｆ１７では、２値の入力ローカルピー
ク／４タンを、各標準パタンのフレーム長に線形伸縮し
、各標準・やタンとの類似度を計算し、音声平均電力算
出ステラｆ１３の出力である音声平均電力を類似度に重
み付けし、且つ全音声フレームにわたって重み付けされ
た類似度を累算して最終類似度を得る。そして最大の最
終類似度を与える標準／Ｊ？タンのカテゴリ名を認識結
果として出力する。The matching Stella f17 linearly expands and contracts the binary input local peak/4tan to the frame length of each standard pattern, calculates the degree of similarity with each standard, and calculates the similarity with the average voice power calculation Stella f13. A certain speech average power is weighted to the similarity, and the weighted similarity is accumulated over all speech frames to obtain the final similarity. and the standard/J that gives the maximum final similarity? Output the tongue category name as the recognition result.

次に本発明の音声認識方法による音声認識装置の１実施
例について説明する〇第４図は本発明による音声認識装置の１実施例の全体の
回路構成を示すグロ、り図である。Next, an embodiment of a speech recognition apparatus using the speech recognition method of the present invention will be described. FIG. 4 is a diagram showing the overall circuit configuration of an embodiment of the speech recognition apparatus according to the present invention.

本実施例は、第４図に示されるように、１０１は音声入
力端子、１０２は周波数分析部、１０３ハスベクトル正
規化＃ＺＯＳは平均電力メモリ１１０６は雑音電力算出
部、１０７は音声区間検出部、１０８は音声平均電力算
出部、１０９はローカルピーク特徴抽出部、１１０は標
準ｉＪ？タンメモリ、１１１は線形マツチング部、１１
２は判定部、１１３は認識結果の出力端子、の如く構成
される。In this embodiment, as shown in FIG. 4, 101 is an audio input terminal, 102 is a frequency analysis section, 103 is a hash vector normalization #ZOS is an average power memory 1106 is a noise power calculation section, and 107 is a speech section detection section. , 108 is a voice average power calculation unit, 109 is a local peak feature extraction unit, and 110 is a standard iJ? Tan memory, 111 is a linear matching section, 11
2 is a determination unit, and 113 is an output terminal for recognition results.

以下動作について説明する。The operation will be explained below.

先ずマイクロホン等によシミ気信号に変換された音声信
号は音声入力端子１０１に入力され、周波数分析部１０
２によって周波数スペクトルの抽出が行われる。この周
波数分析部１０２の回路構成を第５図に示す。First, an audio signal converted into a stain signal by a microphone or the like is input to the audio input terminal 101, and is then input to the frequency analysis section 10.
2, the frequency spectrum is extracted. The circuit configuration of this frequency analysis section 102 is shown in FIG.

第５図において２０１は前置増幅器（Ａｍｐ　）、２０
２はバンドパスフィルタ群（ＢＰＦ−１ｔ・・・。In FIG. 5, 201 is a preamplifier (Amp);
2 is a group of band pass filters (BPF-1t...).

ＢＰＦ−Ｎ　）、２ｏ　ｓは全波整流器群（ＤＥＴ−１
＃・・・。BPF-N), 2os is a full wave rectifier group (DET-1
#...

ＤＥＴ−Ｎ　）、２０４はローパスフィルタ群（ＬＰＦ
−１゜・・・＃　ＬＰＦ−Ｎ　）　、２ｏ　ｓはアナロ
グ・マルチプレクサ、２０６はＡＤ変換器、２０７は対
数変換用ＲＯＭである。バンドｉＪ？スフィルタ群は中
心周波数が対数的等間隔で１からＮまでＮヶ周波数の低
い方から配置してあシ、以下第１チヤネル、・・・、第
Ｎチャネルのように記す。本実施例ではＮ＝２２として
いる。DET-N), 204 is a low-pass filter group (LPF
-1°...#LPF-N), 2os is an analog multiplexer, 206 is an AD converter, and 207 is a ROM for logarithmic conversion. Band iJ? The filter groups are arranged from 1 to N at logarithmically equal center frequencies, starting from the lowest N frequencies, and are hereinafter referred to as a first channel, . . . , an Nth channel. In this embodiment, N=22.

ロー／ぐスフィルタ群２０４の出力は、マルチプレクサ
２０５によシ切り換えられ、ＡＤ変換器２０６で標本化
周期１０　ｍｓ程度でディジタル量に変換され、対数変
換用ＲＯＭ　２０７で対数値に変換され、スペクトル正
規化部１０３に送出される。The output of the low/gus filter group 204 is switched by the multiplexer 205, converted to a digital quantity by the AD converter 206 at a sampling period of about 10 ms, converted to a logarithmic value by the logarithmic conversion ROM 207, and then It is sent to the normalization unit 103.

本実施例において周波数分析部１０２の構成にアナログ
のフィルタを用いているが、ディジタルフィルタバンク
を用いても本質的には同じである。In this embodiment, an analog filter is used in the configuration of the frequency analysis section 102, but the configuration is essentially the same even if a digital filter bank is used.

対数変換用ＲＯＭ　２０７　Ｋよシ出力されるに番目の
フレームのＮチャネル分の音声スペクトルデータをＸ、
（Ｋ）　、・・・、Ｘ註勲する。スペクトル正規化部１
０３では、第１に、各フレーム毎に平均値又■を次の第
（１）式に基づき算出する。Logarithmic conversion ROM 207 K outputs the audio spectrum data for N channels of the 2nd frame to
(K) ,...,X note. Spectrum normalization section 1
In step 03, first, the average value or ■ is calculated for each frame based on the following equation (1).

以下又（８）をにフレームでの「平均電力」と称す。Hereinafter, (8) will be referred to as the "average power" in the frame.

平均電力論は平均電力メモリ１０５へ格納されると同時
に雑音電力算出部１０６へ送られる。The average power theory is stored in the average power memory 105 and simultaneously sent to the noise power calculation unit 106.

雑音電力算出部１０６では、外部からの指示入力によシ
、予め雑音のみとわかっている区間〔ＫＥ、。In the noise power calculation unit 106, according to an instruction input from the outside, the section [KE, which is known to be only noise] is detected in advance.

ＫＥ２〕における雑音平均電力を算出する＠予め雑音の
みの区間の指示は、話者に発声を止めてもらう事で知る
ものとする。Calculating the average noise power in [KE2] @The instructions for the noise-only section are known in advance by asking the speaker to stop speaking.

雑音電力測定としては、雑音区間（ＫＥｌ、Ｋｌ□〕に
おける雑音平均電力Ｎ、を次の第（２）式のごとく求め
る。To measure the noise power, the average noise power N in the noise interval (KEl, Kl□) is obtained as shown in the following equation (2).

本実施例では、Ｋつ２−に、＝２０〜３０（フレーム）
としている。次なるステップとして音声区間検出部１０
７で音声区間を検出する。In this example, K = 20 to 30 (frames)
It is said that As the next step, the voice section detection unit 10
In step 7, a voice section is detected.

音声区間検出は、例えば平均電力に対する二つの閾値Ｅ
、　、　Ｋ２（Ｅｌ）Ｋ２）を用いて抽出する。すなわ
ち第６図のように平均電力がＫ２を越え、かつその後Ｅ
２よシ小さくなることなりＥ、を越えるとき、Ｋ２を越
えた点を始端に、とする。同様にして時間軸を逆にして
終端に２を求め、〔Ｋ１．に２〕を音声区間とする。Speech interval detection can be performed, for example, by using two thresholds E for the average power.
, , K2(El)K2). In other words, as shown in Figure 6, the average power exceeds K2, and then E
When exceeding E, which becomes smaller than 2, let the point beyond K2 be the starting point. Similarly, reverse the time axis and find 2 at the end, [K1. 2] is the voice section.

閾値Ｅ１．Ｅ２は雑音平均電力Ｎ、から但しＬｆ、　Ｋ
２は任意の定数で、Ｌ、＞Ｋ２＞０である。Threshold E1. E2 is the noise average power N, where Lf, K
2 is an arbitrary constant, L,>K2>0.

とじている。It is closed.

また、スペクトル正規化部１０３では、音声スペクトル
データＸ１（６）（Ｋ＝＝に１．・・・、に２：ｉ＝１
゜・・・ｐＮ）を次の第（４）式による最小二乗近似直
線を用いてスイクトル正規化する。In addition, the spectrum normalization unit 103 generates audio spectrum data X1(6) (K==1..., 2:i=1
.

Ｙ量α０＝Ａ（Ｋ）　・　ｉ＋ＢσＱ　　　　　　　　
　　　　　　　　　　　　　　　　・・・　（４）すな
わち、音声ス４クトルデータＸｉ■と最小二乗近似直線
との差分を次の第（５）式で計算し、スペクトル正規化
データｚ　ｉ　Ｃｙｓｔを得る。Y amount α0=A(K) ・i+BσQ
(4) That is, the difference between the audio spectrum data Xi■ and the least squares approximation straight line is calculated using the following equation (5) to obtain the spectrum normalized data z i Cyst.

Ｚｉ（［＝　Ｘｌ（ＩＦＯ（Ａ（Ｋ）　・ｉ　＋　Ｂ（
Ｋ））（Ｋ＝＝に、＃・・・、に２；ｉ＝１．・−，２
）・−（５）前記（４）式の最小二乗近似直線におけるＡＣＫＪ、Ｂ
（６）は次の（６）　？　（７）式で与えられる。Zi([= Xl(IFO(A(K) ・i + B(
K)) (K==, #..., 2; i=1.-,2
)・-(5) ACKJ, B in the least squares approximation straight line of the above equation (4)
(6) is the next (6)? It is given by equation (7).

・・・（７）前記（５）式の計算はＡＩ）が正の場合、すなわち多く
の場合無声音でも行う。時間方向については音声区間検
出前からすべてのフレームＫに対して（５）式の計算を
行っているものである。また音声区間の平均電力Ｘ（６
）＋　Ｋｖ　［Ｋ１　ｙ　Ｋ２　：）と雑音平均電力Ｎ
、から音声平均電力と称するＸ（８）を算出する。(7) The calculation of the above equation (5) is performed when AI) is positive, that is, in most cases, even for unvoiced sounds. In the time direction, equation (5) is calculated for all frames K before the voice section is detected. Also, the average power of the voice section
) + Kv [K1 y K2 :) and noise average power N
, X(8), which is called the audio average power, is calculated from .

（Ｋ＝＝に４．・・・、に２）　　　・・・（８）上記
（８）式の演算は音声平均電力算出部１０８で行われる
。すなわちＸ′（６）としてはｍｉｘ　（０−Ｘ（ＥＩ
　Ｎｐ　）で与えられる。(K==4. . . , 2) (8) The calculation of the above equation (8) is performed by the audio average power calculation unit 108. In other words, as X'(6), mix (0-X(EI
Np).

次に、ローカルピーク特徴抽出部１０９にて音声の始端
から終端までのスペクトル正規化データＺｔＣＫＬ　（
Ｋ＝に、　ｔ・ｓ　Ｋ２；　ｔ＝＝ｔ　ｐ−ｐ　Ｎ）を
１フレーム毎に順次処理する。Next, the local peak feature extraction unit 109 extracts spectrum normalized data ZtCKL (
K=, t·s K2; t==t pp N) are sequentially processed for each frame.

ローカルピーク特徴抽出部１０９の詳細構成を第７図に
示す。The detailed configuration of the local peak feature extraction section 109 is shown in FIG.

第７図において、４０ノはデータ格納メモリ、４０２は
ピーク抽出部、４０３はローカルピークパターン格納メ
モリである。In FIG. 7, 40 is a data storage memory, 402 is a peak extractor, and 403 is a local peak pattern storage memory.

データ格納メモリ４０１には始端に１から終端に２まで
のスペクトル正規化データＺｉ（６）（Ｋ＝に１゜・・
・、　Ｋ２；　ｉ　＝　１　、２　＊・・・、Ｎ）が格
納されておシ、１フレーム毎にピーク抽出部４０２にお
いてスペクトル正規化データＺｉ（６）（Ｋ＝に４．・
・・ｐ　Ｋ２　：　１＝１．・・・ｔ　Ｎ）のローカル
ピークを検出する。こ　　　−のピーク抽出は０”以上
のスペクトル正規化デーｌ　ｚｉ（１０（Ｋ”　Ｋ１　
ｚ　−ｐ　Ｋ２　：　ｌ　＝１　ｅ　”・ｐ　Ｎ　）に
対して以下の如く行う。The data storage memory 401 stores spectral normalized data Zi(6) from 1 at the beginning to 2 at the end (K=1°...
, K2; i = 1, 2 *..., N) are stored, and the peak extraction unit 402 extracts spectrum normalized data Zi(6) (K = 4.
...pK2: 1=1. ...t N) is detected. This peak extraction is performed using spectral normalized data lzi(10(K) K1
z −p K2 : l = 1 e ''・p N ) as follows.

ｚｉや、（８）−２ｉ（６）〉０　　　　　　　　・・
・（９）ｚ１＋２（Ｋ−２１＋、（８）〈０・・・αＱ
前記（９）、０１式を同時に満たしているときスペクト
ル正規化データｚｉ＋、（Ｋは極大値であシ、（１＋１
）はローカルーーりのあるチャネル番号を示し、ローカ
ルピークパタンＰｉヤ、（６）を１．！：する。zi, (8)-2i(6)〉0...
・(9)z1+2(K-21+,(8)〈0...αQ
When the above equations (9) and 01 are simultaneously satisfied, the spectral normalized data zi+, (K is the maximum value, (1+1
) indicates a local channel number, the local peak pattern Pi, and (6) are 1. ! :do.

ｉ　＝　１から（Ｎ−１）までのスペクトル正規化デー
タについて前記（９）、α・式を満たすか否かを順次調
べ、満たす場合はローカルピークＩＪ？ｌＸ　７ｐｉ＋
、（６）を“１”とし、満たさない場合”０＃とじて２
値化しピークパタン格納メモリ４０３に格納する。The spectrum normalized data from i = 1 to (N-1) is sequentially checked to see if the α expression (9) is satisfied, and if it is, the local peak IJ? lX 7pi+
, (6) is set to "1", and if not satisfied, "0#" is set to 2.
It is converted into a value and stored in the peak pattern storage memory 403.

ただし境界条件としてｉ＝ｌか”ｚｉ＋１’Ｑ−２ｉ（ＰＯ（ＱのときＰ、（
ＩＱ＝１　　・・・α乃１＝（Ｎ−１）かツｚｉ＋、（
’Ｑ　　Ｚｔ（［＞ＯｏときＰ週：　１　・（Ｌｌとし
、条件を満たさない場合”０”とし、２値化を行いロー
カルビークツ母タン格納メモリ４０３に格納する。However, as a boundary condition, if i=l or "zi+1'Q-2i(PO(Q, then P, (
IQ = 1 ... α乃1 = (N-1) or Tsuzi +, (
'Q Zt([>Oo, P week: 1 ・(Ll), and if the condition is not satisfied, it is set to "0", it is binarized and stored in the local beakts mother tongue storage memory 403.

ローカルピーク特徴抽出部１０９においであるに番目の
フレームにおけるローカルピークパタンＰｉ■（ｔ＝ｔ
ｔ２＋・・・＃Ｎ：に＝＝に、、・・・、に２）抽出の
フローチャートを第８図に示す。The local peak feature extraction unit 109 extracts the local peak pattern Pi (t=t
t2+...#N: に==に、... 2) A flowchart of extraction is shown in FIG.

このように、データ格納メモリ４０１内のスペクトル正
規化７”−夕Ｚｉ（Ｋ）（，１＝＝１　ｊ　２　＃　・
ｅ　Ｎ　；に＝＝に１．・・・、に２）を１フレーム毎
に読み出し、２値のローカルピークパタンを得る〇標準パタンメモ＋Ｊ　１１０は、予め求めた複数のロー
カルピークパタンを重ね合わせて、カテゴリ毎に１つの
標準パタンを作成したものを格納しているものである。In this way, the spectrum normalization in the data storage memory 401 is expressed as
e N ; ni== ni 1. ..., 2) is read every frame to obtain a binary local peak pattern. Standard Pattern Memo+J 110 superimposes multiple local peak patterns determined in advance to create one standard pattern for each category. This is where the created items are stored.

その標準パタンは発声速度の違いを正規化するためカテ
ゴリ毎に平均発声長を求め、２値ローカルピークパタン
の時間方向のフレーム長を、平均発声長にそろえて線形
伸縮を行い１〜Ｎチヤネルまで重ね合わせ、多値ｚ４タ
ンとして作成されている。また、この標準／’Ｐタンメ
モリ１１０には各標準パタンに対応してその標準パタン
長ＳＬも格納されている。In order to normalize the difference in speech rate, the standard pattern calculates the average utterance length for each category, and linearly expands and contracts the time-direction frame length of the binary local peak pattern to match the average utterance length, from 1 to N channels. It is created as a superimposed, multivalued z4 tan. The standard/'P pattern memory 110 also stores a standard pattern length SL corresponding to each standard pattern.

線形マツチング部１１１では、ローカルピーク特徴抽出
部１０９よシ入力される入力ローカルピークパタンの時
間フレー・ム長を、標準ノ４タンメモリ１１０中の各カ
テゴリの標準パタン長ＳＬへ線形伸縮してそろえた後、
各フレーム毎に初期類似度計算と、平均電力による類似
度への重み付けを行うＯ初期類似度５（６）の計算は次の（至）式に基づいてフ
レーム毎に行う。The linear matching section 111 linearly expands and contracts the time frame lengths of the input local peak patterns inputted from the local peak feature extraction section 109 to the standard pattern length SL of each category in the standard four-tone memory 110. rear,
The initial similarity is calculated for each frame and the similarity is weighted by the average power. The initial similarity 5 (6) is calculated for each frame based on the following equation.

ここでＰは第にフレームにおける入力ローカルピークパ
タンの特徴ベクトルを示す。Here, P indicates the feature vector of the input local peak pattern in the first frame.

Ｐ＝（ｐｌ（６）、Ｐ２（６）、・・・ｔ　、ｐＮｍ）
　　　　・・・（１ゆＤは任意の標準パタンの第にフレ
ームにおける特徴ベクトルを示す。P=(pl(6), P2(6),...t, pNm)
...(1YD indicates the feature vector in the first frame of an arbitrary standard pattern.

Ｄ＝（Ｄｌ（６）ｒＤ２■、・・・、ＤＮ（６））　　
　・・・（６）Ｄｔは転置を表わし、ＩＩＰＩＩ　、　
ｌＩＤ１１　　は各々Ｐ、Ｄのノルムを表わす。D=(Dl(6)rD2■,...,DN(6))
...(6) Dt represents transposition, IIPII,
lID11 represents the norm of P and D, respectively.

さらに音声の平均電力が小さいとき、雑音の影響を受け
やすいことから、最終の類似度Ｓを次の０１式から求め
る。Furthermore, when the average power of the voice is small, it is easily affected by noise, so the final similarity S is determined from the following equation 01.

ここでＸ／（Ｋ）は前記（８）式で求まる音声平均電力
、ｍａｘ　（Ｘ’（ＦＱ　）は入力音声の時間正規化さ
れた区間の中にでの音声平均電力の最大値、ＳＬは標準パタン長である
。すなわち最終類似度Ｓは前記６１式で求まる初期類似
度Ｓ（［に、雑音平均電力Ｎ、を引いた前記（８）式で
求まる音声平均電力又′（６）で重み付けした値である
。Here, X/(K) is the average voice power determined by equation (8) above, max (X'(FQ) is the maximum value of the average voice power in the time-normalized section of the input voice, and SL is This is the standard pattern length.In other words, the final similarity S is the initial similarity S determined by the above equation 61 ([, the noise average power N, subtracted from the speech average power found by the above equation (8), or weighted by '(6))] This is the value.

線形マツチング部１１１の構成は第９図のようになシ、
７０１は入力ローカルピークパタンの入力端子、７０２
は標準パタンの入力端子、７０３は音声平均電力の入力
端子、７０５は標準ノ４タン長の入力端子、７０６，７
０７，７０８は乗算加算器、７０９，７１０は平方根演
算器、７ノーは乗算器、７１２は除算器、７１３は乗算
加算器、７１４は除算器、７１５は最大値検出器、７１
６は乗算器、２１７は出力端子である。The configuration of the linear matching section 111 is as shown in FIG.
701 is an input terminal for input local peak pattern, 702
is a standard pattern input terminal, 703 is an input terminal for audio average power, 705 is a standard 4-tang length input terminal, 706, 7
07, 708 are multiplication adders, 709, 710 are square root operators, 7 NO is a multiplier, 712 is a divider, 713 is a multiplication adder, 714 is a divider, 715 is a maximum value detector, 71
6 is a multiplier, and 217 is an output terminal.

入力端子７０１から入力される入力ローカルピークＡタ
ンＰのノルムが乗算加算器７０６、平方根演算器７０９
によって次の６９式の如く算出されス同様にして標準パタンＤのノルムが乗算加算器７０８、
平方根演算器７１０によって求まる。また乗算加算器７
０７によってＰ−Ｄが次の（至）式のように求まる。The norm of the input local peak A tan P input from the input terminal 701 is multiplied by the adder 706 and the square root operator 709
Similarly, the norm of the standard pattern D is calculated by the multiplier adder 708,
It is determined by the square root calculator 710. Also, the multiplication adder 7
07, PD can be found as shown in the following equation.

さらに除算器７１２によって前記（至）式の初期類似度
５（８）が求まる。Furthermore, the divider 712 determines the initial similarity of the equation (to) 5(8).

一方、音声平均電力Ｘ／（６）が入力端子７０３から入
力され、乗算加算器７１３によシで求まる。また最大値検出器７１５によシ、時間正規化
された音声平均電力の最大値が求まシ、乗算器７１６によシ、標準パタン長ＳＬと乗
算され、この結果と前記α呻式の結果を除算器７１４に
入力し、最終の、類似度Ｓが求まシ、出力端子７１７に
出力され、判定部１１２へ送られる。On the other hand, the audio average power X/(6) is input from the input terminal 703 and is determined by the multiplier/adder 713. In addition, the maximum value detector 715 calculates the maximum value of the time-normalized audio average power, which is multiplied by the standard pattern length SL in the multiplier 716, and this result is combined with the result of the α equation. is input to the divider 714 to obtain the final degree of similarity S, which is output to the output terminal 717 and sent to the determination unit 112.

判定部１１２では、各標準ノ！タン毎の、最終の類似度
の最大値を与える標準パタンのカテゴリ名またはカテゴ
リ番号を出力端子１１３を介して出力し、認識結果とす
る。In the determination unit 112, each standard no! The category name or category number of the standard pattern that gives the maximum final similarity for each pattern is outputted via the output terminal 113 and used as the recognition result.

（発明の効果）以上、詳細に説明したように、本発明では、ローカルピ
ーク抽出法として、有声音、無声音に関わらず最小二乗
近似直線を基準とし、また類似度に音声平均電力（＝平
均電カー雑音平均電力）による重み付けを行っているの
で、雑音が混入した音声でも、特に音声の低レベルの部
分における、音声によるローカルピーク抽出の不完全さ
を補う事ができ、すなわち雑音によるピークを音声によ
るローカルピークと見なす事が少なく、良好に認識でき
る効果がある。(Effects of the Invention) As described in detail above, in the present invention, as a local peak extraction method, the least squares approximation straight line is used as the standard regardless of whether the sound is voiced or unvoiced, and the similarity is determined by the average voice power (=average electric power). Since weighting is performed using the Kerr noise average power), it is possible to compensate for the incompleteness of local peak extraction by the voice, especially in the low-level parts of the voice, even in voice mixed with noise.In other words, the peaks due to noise can be This has the effect of being easily recognized as a local peak.

[Brief explanation of the drawing]

第１図は本発明による音声認識方法のフローチャート、
第２図は従来の音声認識方法のフローチャート、第３図
は入力音声信号の２値化を説明するための図、第４図は
本発明による音声認識装置の１実施例の全体の回路構成
を示すブロック図、第５図は周波数分析部の回路構成を
示すブロック図、第６図は音声区間検出を説明するため
の図、第７図はローカルピーク特徴抽出部の構成を示す
ブロック図、第８図はローカルピークツやタン抽出のフ
ローチャート、第９図は線形マツチング部の回路構成を
示すグロ、り図。１０１・・・入力端子、１０２・・・周波数分析部、１
０３・・・スペクトル正規化部、１０５・・・平均電力
メモリ、１０６・・・雑音電力算出部、１０７・・・音
声区間検出部、１０８−・・音声平均電力算出部、１０
９・・・ローカルピーク特徴抽出部、１１０・・・標準
ツクタンメモリ、１１１・・・線形マツチング部、１１
２・・・判定部、１１３・・・出力端子。特許出願人　沖電気工業株式会社第１図本発明帽１％胤響駄力１句フロー子ダート畠舞論粟第２　図　　＋更釆−〇訳、ｌ、矢の７０一ナヤート第
３図２イ直イに１第６図４ｊｊ斉を藺ト＠塙・終塙埠志マ第７図 ’；Ｊ−Ｍｔｌ＝’−７躬噂ｕ實ｍＱ＋（Ｘ？　＠ｍＡ
１、事件の表示昭和６０年　特　許　　願第６９０５３　号２、発明の
名称音声認識方法３、補正をする者事件との関係　　　　　特　許　　出　願　人住　所（
〒１０５）　　東京都港区虎ノ門１丁目７番１２号５、
補正の対象　明細書中「発明の詳細な説明」方式乙へ　
゛・、°”°　′ ６、補正の内容（１）明細書＠１０頁第１ｌ行目にｒｌｏｍｇ程度で」
とあるのをｒ　１０　ｍｓで」と補正する。（２）同書第１３頁第１行目にｒ　ＫｍＫ１．・・・、
に２；ｉ＝１．・・・、２）」とあるのをｒ（ｘ＝ｘ１．・・・、　Ｋｍ　：　ｉ　＝１１・・・
、Ｎ）」と補正する。（３）　　同書同頁下から４行目にｒ　ＫＥＣＫＬ　、
　Ｋｍ　１Ｊとあるのを［Ｋε（Ｋｌ　＃　Ｋｍ　］、と補正する。るのをFIG. 1 is a flowchart of the speech recognition method according to the present invention;
Fig. 2 is a flowchart of a conventional speech recognition method, Fig. 3 is a diagram for explaining the binarization of an input speech signal, and Fig. 4 shows the overall circuit configuration of an embodiment of a speech recognition device according to the present invention. 5 is a block diagram showing the circuit configuration of the frequency analysis section, FIG. 6 is a diagram for explaining voice section detection, and FIG. 7 is a block diagram showing the configuration of the local peak feature extraction section. Figure 8 is a flowchart for local peak extraction and tongue extraction, and Figure 9 is a diagram showing the circuit configuration of the linear matching section. 101...Input terminal, 102...Frequency analysis section, 1
03... Spectrum normalization section, 105... Average power memory, 106... Noise power calculation section, 107... Speech section detection section, 108-... Speech average power calculation section, 10
9...Local peak feature extraction unit, 110...Standard tactile memory, 111...Linear matching unit, 11
2... Judgment unit, 113... Output terminal. Patent Applicant: Oki Electric Industry Co., Ltd. Figure 1 Inventor's cap 1% Tanekyo power 1 phrase Flow child Dart Hatamai Ronso Figure 2 1 Figure 6
1. Indication of the case 1985 Patent Application No. 69053 2. Name of the invention Voice recognition method 3. Person making the amendment Relationship with the case Patent application Person's address (
105) 1-7-12-5 Toranomon, Minato-ku, Tokyo.
Target of amendment: “Detailed description of the invention” in the specification Format B
゛・,°”° ' 6. Contents of amendment (1) Specification @ page 10, line 1l, approximately rlomg.''
Correct it to "r 10 ms". (2) r KmK1. on page 13, line 1 of the same book. ...,
2; i=1. ..., 2)" is written as r(x=x1..., Km : i =11...
, N)". (3) r KECKL on the fourth line from the bottom of the same page in the same book,
Km 1J is corrected as [Kε(Kl #Km].

Claims

[Claims] A process of frequency-analyzing input speech and extracting analysis data of a plurality of channels for each speech frame, and averaging all the analysis data of all speech frames in a predetermined noise interval to obtain a noise average power. a process of calculating the average power by averaging all the analysis data corresponding to the frame for each audio frame of the input audio; and calculating the average power and the noise average power for each audio frame of the input audio. A process of calculating the voice average power which is the difference between the spectral normalized and With respect to the analysis data, the presence or absence of a local maximum value along the frequency axis, that is, a local peak, is determined based on the least square approximation straight line, and if there is a local peak, it is set as "1".
A process of converting a binarized input audio into a local peak pattern with "0" in the case of no local peak, and a process of converting a plurality of standard patterns represented by a local peak pattern prepared in advance and the local The degree of similarity with the peak pattern is calculated for each normalized voice frame between the standard pattern and the input voice, and the degree of similarity is weighted by the voice average power, and the degree of similarity is calculated for each normalized voice frame between the standard pattern and the input voice. A speech recognition method comprising: a process of calculating a final similarity by accumulating the final similarity for each standard pattern; and a process of determining, as a recognition result, a category name of a standard pattern corresponding to the maximum value of the final similarity for each standard pattern.