JPH0556520B2

JPH0556520B2 -

Info

Publication number: JPH0556520B2
Application number: JP60069053A
Authority: JP
Inventors: Yukio Tabei
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1985-04-03
Filing date: 1985-04-03
Publication date: 1993-08-19
Also published as: JPS61228500A

Description

【発明の詳細な説明】（産業上の利用分野）本発明は音声認識方法に関し、特にローカルピ
ークを用いた単語音声認識を行う音声認識方法に
関する。DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to a speech recognition method, and particularly to a speech recognition method that performs word speech recognition using local peaks.

（従来技術）従来、この種の音声認識方法として電子通信学
会論文誌、J68−Ａ〔１〕（昭和60年１月）p.78−
85に記載されるものがあつた。第２図は従来のロ
ーカルピークを用いた音声認識方法のフローチヤ
ートであり、入力音声は、15チヤネルのバンドパ
スフイルター群で10msecごとに周波数分析され
（第２図の１参照）、声帯音源特性の個人差の正規
化法として、音声スペクトルを振幅、周波数軸と
もに対数で表わし最小二乗近似直線を求め（第２
図の２参照）、差をとつて補正する。ただし、最
小二乗近似直線の傾きが正の場合には平均値から
の差をとる。その後、第３図に示すように、各フ
レーム（10msec）毎に、0dB以上となる各部分
について、各最大値の1/2以上の振幅を持つもの
の中で最大値をとるチヤネルをローカルピーク有
りとして“１”にし、他を“０”として２値化を
行う（第２図の３参照）。バンドパスフイルタの
チヤネル数は15であるが、16チヤネル目に最小二
乗近似直線の傾きが負のとき有声音と見なし１を
たて、傾きが正のとき無声音と見なし“０”をた
て、傾斜の符号を付加する（第２図の４参照）。(Prior art) Conventionally, this type of speech recognition method has been used in the Transactions of the Institute of Electronics and Communication Engineers, J68-A [1] (January 1985) p.78-
There was something listed in 85. Figure 2 is a flowchart of a conventional speech recognition method using local peaks.The input voice is frequency-analyzed every 10 msec by a group of 15-channel bandpass filters (see 1 in Figure 2), and the vocal cord sound source characteristics are As a normalization method for individual differences, the voice spectrum is expressed logarithmically on both the amplitude and frequency axes, and a least squares approximation straight line is obtained (second
(see 2 in the figure), calculate the difference and correct it. However, if the slope of the least squares approximation line is positive, the difference from the average value is taken. After that, as shown in Figure 3, for each frame (10 msec), for each part where the amplitude is 0 dB or more, the channel that has the maximum value among those with an amplitude of 1/2 or more of each maximum value is determined as having a local peak. Binarization is performed by setting one to "1" and the other to "0" (see 3 in Fig. 2). The number of channels of the bandpass filter is 15, but when the slope of the least squares approximation line of the 16th channel is negative, it is regarded as a voiced sound and set to 1, and when the slope is positive, it is regarded as an unvoiced sound and set to "0". Add the slope sign (see 4 in Figure 2).

荷重平均辞書は、複数の２値化パタンを時間軸
上一番長いものへ線形に伸ばして加算して多値パ
タンとして得られる（第２図の５参照）。 The weighted average dictionary is obtained as a multivalued pattern by linearly extending and adding a plurality of binarized patterns to the longest one on the time axis (see 5 in FIG. 2).

２値の入力パタンと多値の荷重平均辞書とのマ
ツチングには、時間方向は長い方のパタンに線形
に伸ばして合わせ、ある類似度に基づいて計算を
行い、最大類似度を与える標準パタンのカテゴリ
名を認識結果とする（第２図の６参照）。 To match a binary input pattern with a multi-value weighted average dictionary, the longer pattern is linearly stretched and matched in the time direction, calculations are performed based on a certain degree of similarity, and a standard pattern that gives the maximum degree of similarity is selected. The category name is used as the recognition result (see 6 in Figure 2).

（発明が解決しようとする問題点）しかし、以上述べた従来の方法は、接話型マイ
クを用いる場合のようなSN比の良い環境では有
効に機能するが、雑音が強い環境下では、特に無
音部でローカルピークとして雑音のピークを拾い
やすく、誤認識が増えると云う問題点があつた。(Problems to be Solved by the Invention) However, the conventional methods described above function effectively in environments with a good signal-to-noise ratio, such as when using close-talking microphones, but they do not work particularly well in environments with strong noise. There was a problem in that noise peaks were likely to be picked up as local peaks in silent parts, leading to an increase in erroneous recognition.

本発明は、雑音が強い環境下で特に無音部で雑
音のピークを音声のローカルピークとして抽出し
てしまい誤認識が増えるという従来技術の問題点
を除去し、雑音に対する耐性の強い音声認識方法
を提供するものである。 The present invention eliminates the problem of conventional technology in which noise peaks are extracted as local peaks of speech, especially in silent parts, in environments with strong noise, resulting in increased erroneous recognition, and provides a speech recognition method that is highly resistant to noise. This is what we provide.

（問題点を解決するための手段）本発明による音声認識方法は、まず入力音声を
各音声フレーム毎に複数チヤネルの分析データに
周波数分析を行う。また予め雑音のみとわかつて
いる所定の雑音区間における全ての音声フレーム
の全ての分析データを平均化し雑音平均電力の算
出を行う。そして入力音声の各音声フレーム毎に
当該フレームに対応する全ての分析データを平均
化し平均電力の算出を行う。次に各音声フレーム
毎に前述の平均電力と前述の雑音平均電力との差
分である音声平均電力の算出を行う。一方入力音
声の分析データは当該データの属する音声フレー
ムにおける音声スペクトルの最小二乗近似直線を
用いてスペクトル正規化される。このスペクトル
正規化された分析データに対して、前述の最小二
乗近似直線を基準として周波数軸にそつた極大値
すなわちローカルピークの有無を判定し、ローカ
ルピーク有りの場合を“１”としローカルピーク
無しの場合を“０”とする２値化されたローカル
ピークパタンへの変換を行う。(Means for Solving the Problems) The speech recognition method according to the present invention first performs frequency analysis of input speech into analysis data of a plurality of channels for each speech frame. Further, all analysis data of all voice frames in a predetermined noise section, which is known in advance to be only noise, is averaged to calculate the noise average power. Then, for each audio frame of the input audio, all analysis data corresponding to the frame is averaged to calculate the average power. Next, the average voice power, which is the difference between the above-mentioned average power and the above-mentioned noise average power, is calculated for each voice frame. On the other hand, analysis data of input speech is spectral normalized using a least square approximation straight line of the speech spectrum in the speech frame to which the data belongs. For this spectrum-normalized analysis data, the presence or absence of local peaks along the frequency axis is determined using the aforementioned least squares approximation line as a reference, and if there is a local peak, it is set as "1" and there is no local peak. Conversion to a binarized local peak pattern is performed, with the case of ``0'' being set as ``0''.

そして予め用意されたローカルピークパタンで
表わされる複数の標準パタンと入力音声のローカ
ルピークパタンとの類似度を標準パタン及び入力
音声間の正規化された音声フレーム毎に算出し、
この類似度に対し前述の音声平均電力により重み
付けを行い且つ前述の正規化された全ての音声フ
レームにわたつて累算することにより最終の類似
度を算出する。この結果、最大値を与える最終類
似度に対応する標準パタンのカテゴリ名を認識結
果と判定するものである。 Then, the degree of similarity between a plurality of standard patterns represented by local peak patterns prepared in advance and the local peak pattern of the input audio is calculated for each normalized audio frame between the standard patterns and the input audio,
The final similarity is calculated by weighting this similarity using the voice average power described above and accumulating it over all the normalized voice frames described above. As a result, the category name of the standard pattern corresponding to the final similarity that gives the maximum value is determined to be the recognition result.

（作用）本発明は、入力音声の周波数分析データの平均
電力と所定の雑音区間の雑音平均電力との差分で
ある音声平均電力を算出し、各標準パタンと入力
音声のローカルピークパタンとの類似度にこの音
声平均電力により重み付けを行つて最終類似度を
算出しているため、雑音が混入した音声特に音声
の低レベルの部分に対するローカルピーク抽出の
不完全さを補うことができる。(Operation) The present invention calculates the average voice power, which is the difference between the average power of the frequency analysis data of the input voice and the noise average power of a predetermined noise section, and compares the similarity between each standard pattern and the local peak pattern of the input voice. Since the final similarity is calculated by weighting the speech average power each time, it is possible to compensate for imperfections in local peak extraction for noise-containing speech, particularly for low-level portions of speech.

（実施例）第１図は本発明による音声認識方法のフローチ
ヤートであり、入力された音声信号は、周波数分
析ステツプ11で10msec（フレーム周期）毎に周波
数分析され、各音声フレーム毎に複数チヤネルの
分析データとしての対数値が出力される。(Example) FIG. 1 is a flowchart of the speech recognition method according to the present invention, in which the input speech signal is frequency-analyzed every 10 msec (frame period) in frequency analysis step 11, and multiple channels are transmitted for each speech frame. Logarithmic values are output as analysis data.

次に、予め与えられた雑音区間の上記対数デー
タより、雑音平均電力算出ステツプ12により雑音
区間内の全音声フレームの全ての対数値を平均化
して雑音平均電力を算出する。 Next, from the logarithmic data of the noise section given in advance, in a noise average power calculation step 12, all the logarithmic values of all the speech frames within the noise section are averaged to calculate the noise average power.

一方、音声平均電力算出ステツプ13により、入
力音声の各音声フレームの周波数分析された対数
値より求まる平均電力と前記雑音平均電力との差
分を用いて音声平均電力を算出する。 On the other hand, in the voice average power calculation step 13, the voice average power is calculated using the difference between the average power found from the frequency-analyzed logarithm value of each voice frame of the input voice and the noise average power.

また、入力音声の周波数分析データの対数値
は、この対数値の属する音声フレームにおける音
声スペクトルの最小二乗近似直線を用いてスペク
トル正規化される（ステツプ14）。次に、スペク
トル正規化された前記対数値に対して前記最小二
乗近似直線を基準として周波数軸に沿つた極大値
すなわちローカルピークの有無を判定し、ローカ
ルピーク有りの場合を“１”としローカルピーク
無しの場合を“０”とする２値の入力ローカルピ
ークパタンに変換する。 Further, the logarithmic value of the frequency analysis data of the input speech is spectral normalized using the least square approximation straight line of the speech spectrum in the speech frame to which this logarithmic value belongs (step 14). Next, the presence or absence of a maximum value along the frequency axis, that is, a local peak, is determined for the spectrum-normalized logarithm value using the least squares approximation straight line as a reference, and the presence or absence of a local peak is set as "1", and the local peak is determined. The input local peak pattern is converted into a binary input local peak pattern with "0" in the case of none.

標準パタン16は、予め複数の２値ローカルピー
クパタンをカテゴリ毎に平均発声長に線形伸縮
し、加算して作成されているものである。 The standard pattern 16 is created in advance by linearly expanding and contracting a plurality of binary local peak patterns to the average utterance length for each category and adding them.

マツチングステプツプ17では、２値の入力ロー
カルピークパタンを、各標準パタンのフレーム長
に線形伸縮し、各標準パタンとの類似度を計算
し、音声平均電力算出ステツプ13の出力である音
声平均電力を類似度に重み付けし、且つ全音声フ
レームにわたつて重み付けされた類似度を累算し
て最終類似度を得る。そして最大の最終類似度を
与える標準パタンのカテゴリ名を認識結果として
出力する。 In the matching step 17, the binary input local peak pattern is linearly expanded or contracted to the frame length of each standard pattern, the similarity with each standard pattern is calculated, and the output of the audio average power calculation step 13 is Weight the average power to similarity and accumulate the weighted similarities over all audio frames to obtain the final similarity. Then, the category name of the standard pattern that gives the maximum final similarity is output as the recognition result.

次に本発明の音声認識方法による音声認識装置
の１実施例について説明する。 Next, one embodiment of a speech recognition device using the speech recognition method of the present invention will be described.

第４図は本発明による音声認識装置の１実施例
の全体の回路構成を示すブロツク図である。 FIG. 4 is a block diagram showing the overall circuit configuration of one embodiment of the speech recognition device according to the present invention.

本実施例は、第４図に示されるように、１０１
は音声入力端子、１０２は周波数分析部、１０３
はスペクトル正規化部、１０５は平均電力メモ
リ、１０６は雑音電力算出部、１０７は音声区間
検出部、１０８は音声平均電力算出部、１０９は
ローカルピーク特徴抽出部、１１０は標準パタン
メモリ、１１１は線形マツチング部、１１２は判
定部、１１３は認識結果の出力端子、の如く構成
される。以下動作について説明する。 In this embodiment, as shown in FIG.
is an audio input terminal, 102 is a frequency analysis section, 103
105 is a spectrum normalization unit, 105 is an average power memory, 106 is a noise power calculation unit, 107 is a voice section detection unit, 108 is a voice average power calculation unit, 109 is a local peak feature extraction unit, 110 is a standard pattern memory, and 111 is a The linear matching section, 112, is a determining section, and 113 is an output terminal for recognition results. The operation will be explained below.

先ずマイクロホン等により電気信号に変換され
た音声信号は音声入力端子１０１に入力され、周
波数分析部１０２によつて周波数スペクトルの抽
出が行われる。この周波数分析部１０２の回路構
成を第５図に示す。 First, an audio signal converted into an electrical signal by a microphone or the like is input to an audio input terminal 101, and a frequency spectrum is extracted by a frequency analysis section 102. The circuit configuration of this frequency analysis section 102 is shown in FIG.

第５図において２０１は前置増幅器（Amp）、
２０２はバンドパスフイルタ群（BPF−１，…，
BPF−Ｎ）、２０３は全波整流器群（DET−１，
…，DET−Ｎ）、２０４はローパスフイルタ群
（LPF−１，…，LPF−Ｎ）、２０５はアナロ
グ・マルチプレクサ、２０６はAD変換器、２０
７は対数変換用ROMである。バンドパスフイル
タ群は中心周波数が対数的等間隔で１からＮまで
Ｎケ周波数の低い方から配置してあり、以下第１
チヤネル、…、第Ｎチヤネルのように記す。本実
施例ではＮ＝22としている。 In FIG. 5, 201 is a preamplifier (Amp);
202 is a group of band pass filters (BPF-1,...,
BPF-N), 203 is a full-wave rectifier group (DET-1,
..., DET-N), 204 is a low-pass filter group (LPF-1, ..., LPF-N), 205 is an analog multiplexer, 206 is an AD converter, 20
7 is a ROM for logarithmic conversion. The bandpass filter group has center frequencies arranged at equal logarithmic intervals from 1 to N, starting from the lowest N frequency.
Channel, etc., is written as Nth channel. In this embodiment, N=22.

ローパスフイルタ群２０４の出力は、マルチプ
レクサ２０５により切り換えられ、AD変換器２
０６で標本化周期10msでデイジタル量に変換さ
れ、対数変換用ROM２０７に対数値に変換さ
れ、スペクトル正規化部１０３に送出される。 The output of the low-pass filter group 204 is switched by the multiplexer 205, and the output of the low-pass filter group 204 is switched to the AD converter 2.
In step 06, the signal is converted into a digital quantity at a sampling period of 10 ms, converted into a logarithmic value by the logarithmic conversion ROM 207, and sent to the spectrum normalization unit 103.

本実施例において周波数分析部１０２の構成に
アナログのフイルタを用いているが、デイジタル
フイルタバンクを用いても本質的には同じであ
る。 In this embodiment, an analog filter is used in the configuration of the frequency analysis section 102, but the structure is essentially the same even if a digital filter bank is used.

対数変換用ROM２０７により出力されるＫ番
目のフレームのＮチヤネル分の音声スペクトルデ
ータをX₁（Ｋ），…，X_N（Ｋ）とする。スペクトル
正規化部１０３では、第１に、各フレーム毎に平
均値（Ｋ）を次の第(1)式に基づき算出する。 The audio spectrum data for N channels of the K-th frame outputted by the logarithmic conversion ROM 207 are assumed to be X ₁ (K), . . . , X _N (K). The spectrum normalization unit 103 first calculates the average value (K) for each frame based on the following equation (1).

（Ｋ）＝_N 〓ⁱ⁼¹ X_i（Ｋ）／Ｎ …(1) 以下（Ｋ）をＫフレームでの「平均電力」と
称す。平均電力（Ｋ）は平均電力メモリ１０５
へ格納されると同時に雑音電力算出部１０６へ送
られる。(K)= _N 〓 ⁱ⁼¹ X _i (K)/N...(1) Hereinafter, (K) will be referred to as "average power" in K frames. Average power (K) is average power memory 105
At the same time, it is sent to the noise power calculation section 106.

雑音電力算出部１０６では、外部からの指示入
力により、予め雑音のみとわかつている区間
〔K_E1，K_E2〕における雑音平均電力を算出する。
予め雑音のみの区間の指示は、話者に発声を止め
てもらう事で知るものとする。 The noise power calculation unit 106 calculates the average noise power in a section [K _E1 , K _E2 ] that is known to be only noise in advance based on an instruction input from the outside.
The instructions for the noise-only section shall be known in advance by asking the speaker to stop speaking.

雑音電力測定としては、雑音区間〔K_E1，K_E2〕
における雑音平均電力N_pを次の第(2)式のごとく
求める。 For noise power measurement, the noise interval [K _E1 , K _E2 ]
The noise average power N _p at is calculated as shown in the following equation (2).

N_p＝１／K_E2−K_E1＋１_KE2 〓^K=KE1 （Ｋ） …(2) 本実施例では、K_E2−K_E1＝20〜30（フレーム）
としている。次なるステツプとして音声区間検出
部１０７で音声区間を検出する。N _p = 1/K _E2 −K _E1 +1 _KE2 〓 ^K=KE1 (K) …(2) In this example, K _E2 −K _E1 = 20 to 30 (frames)
It is said that As the next step, the voice section detecting section 107 detects the voice section.

音声区間検出は、例えば平均電力に対する二つ
の閾値E₁，E₂（E₁＞E₂）を用いて抽出する。すな
わち第６図のように平均電力がE₂を越え、かつ
その後E₂より小さくなることなくE₁を越えると
き、E₂を越えた点を始端K₁とする。同様にして
時間軸を逆にして終端K₂を求め、〔K₁，K₂〕を
音声区間とする。 Speech section detection is performed using, for example, two threshold values E ₁ and E ₂ (E ₁ >E ₂ ) for the average power. That is, as shown in FIG. 6, when the average power exceeds E ₂ and then exceeds E ₁ without becoming smaller than E ₂ , the point where E ₂ is exceeded is defined as the starting point K ₁ . In the same way, the time axis is reversed to find the end point K ₂ , and [K ₁ , K ₂ ] is defined as the speech interval.

閾値E₁，E₂は雑音平均電力N_pから E₁＝N_p＋L₁ E₂＝N_p＋L₂ …(3) 但しL₁，L₂は任意の定数で、L₁＞L₂＞０であ
る。としている。 The thresholds E ₁ and E ₂ are determined from the noise average power N _p by E ₁ = N _p + L ₁ E ₂ = N _p + L ₂ (3) where L ₁ and L ₂ are arbitrary constants, and L ₁ > L ₂ > 0. It is. It is said that

また、スペクトル正規化部１０３では、音声ス
ペクトルデータX_i（Ｋ）（Ｋ＝K₁，…，K₂；ｉ＝
１，…，Ｎ）を次の第(4)式による最小二乗近似直
線を用いてスペクトル正規化する。 In addition, the spectrum normalization unit 103 generates audio spectrum data X _i (K) (K=K ₁ ,...,K ₂ ; i=
1,...,N) is spectral normalized using the least squares approximation line according to the following equation (4).

Y_i（Ｋ）＝Ａ（Ｋ）・ｉ＋Ｂ（Ｋ） …(4) すなわち、音声スペクトルデータX_i（Ｋ）と最
小二乗近似直線との差分を次の第(5)式で計算し、
スペクトル正規化データZ_i（Ｋ）を得る。 Y _i (K)=A(K)・i+B(K)...(4) In other words, calculate the difference between the audio spectrum data X _i (K) and the least squares approximation straight line using the following equation (5),
Obtain spectral normalized data Z _i (K).

Z_i（Ｋ）＝ X_i（Ｋ）−（Ａ（Ｋ）・ｉ＋Ｂ（Ｋ））（Ｋ＝K₁，…，K₂；ｉ＝１，…，Ｎ） …(5) 前記(4)式の最小二乗近似直線におけるＡ（Ｋ），
Ｂ（Ｋ）は次の(6)，(7)式で与えられる。Z _i (K) = X _i (K) - (A (K) · i + B (K)) (K = K ₁ , ..., K ₂ ; i = 1, ..., N) ... (5) Above (4) A(K) on the least squares approximation straight line of Eq.
B(K) is given by the following equations (6) and (7).

前記(5)式の計算はＡ（Ｋ）が正の場合、すなわ
ち多くの場合無声音でも行う。時間方向について
は音声区間検出前からすべてのフレームＫに対し
て(5)式の計算を行つているものである。また音声
区間の平均電力（Ｋ），K_E〔K₁，K₂〕と雑音平
均電力N_pから音声平均電力と称する′（Ｋ）を
算出する。 The above equation (5) is calculated when A(K) is positive, that is, in most cases, even for unvoiced sounds. In the time direction, calculations using equation (5) are performed for all frames K before the voice section is detected. Also, the average voice power (K) is calculated from the average power (K) of the voice section, K _E [K ₁ , K ₂ ], and the noise average power N _p .

′（Ｋ）＝（Ｋ）−N_p（但し、（
Ｋ）＞N_pの場合）０（但し、（Ｋ）≦N_pの場合）（Ｋ＝K₁，…，K₂） …(8) 上記(8)式の演算は音声平均電力算出部１０８で
行われる。すなわち′（Ｋ）としてはmax｛０，
Ｘ（Ｋ）−N_p｝で与えられる。 ′(K)=(K)−N _p (however, (
K)>N _p ) 0 (However, if (K)≦N _p ) (K=K ₁ ,...,K ₂ )...(8) The above equation (8) is calculated by the audio average power calculation unit 108 It will be held in In other words, ′(K) is max{0,
X(K)−N _p }.

次に、ローカルピーク特徴抽出部１０９にて音
声の始端から終端までのスペクトル正規化データ
Z_i（Ｋ），（Ｋ＝K₁，…，K₂；ｉ＝１，…，Ｎ）を
１フレーム毎に順次処理する。 Next, the local peak feature extraction unit 109 extracts the spectrum normalized data from the start to the end of the voice.
Z _i (K), (K=K ₁ , . . . , K ₂ ; i=1, . . . , N) are sequentially processed for each frame.

ローカルピーク特徴抽出部１０９の詳細構成を
第７図に示す。 The detailed configuration of the local peak feature extraction section 109 is shown in FIG.

第７図において、４０１はデータ格納メモリ、
４０２はピーク抽出部、４０３はローカルピーク
パターン格納メモリである。 In FIG. 7, 401 is a data storage memory;
402 is a peak extraction unit, and 403 is a local peak pattern storage memory.

データ格納メモリ４０１には始端K₁から終端
K₂までのスペクトル正規化データZ_i（Ｋ）（Ｋ＝
K₁，…，K₂；ｉ＝１，２，…，Ｎ）が格納され
ており、１フレーム毎にピーク抽出部４０２にお
いてスペクトル正規化データZ_i（Ｋ）（Ｋ＝K₁，
…，K₂；ｉ＝１，…，Ｎ）のローカルピークを
検出する。このピーク抽出は“０”以上のスペク
トル正規化データZ_i（Ｋ）（Ｋ＝K₁，…，K₂；ｉ
＝１，…，Ｎ）に対して以下の如く行う。 The data storage memory 401 has a starting point _K1 to an ending point.
Spectral normalized data Z _i (K) up to K ₂ (K=
K ₁ ,...,K ₂ ; i=1,2,...,N) are stored, and spectrum normalized data Z _i (K) (K=K ₁ ,
..., K ₂ ; i=1, ..., N) local peaks are detected. This peak extraction is performed using spectral normalized data Z _i (K) (K=K ₁ ,..., K ₂ ; i
=1,...,N) as follows.

Z_i+1（Ｋ）−Z_i（Ｋ）＞０ …(9) Z_i+2（Ｋ）−Z_i+1（Ｋ）＜０ …(10) 前記(9)，(10)式を同時に満たしているときスペク
トル正規化データZ_i+1（Ｋ）は極大値であり、（ｉ
＋１）はローカルピークのあるチヤネル番号を示
し、ローカルピークパタンP_i+1（Ｋ）を１とする。 Z _i+1 (K)−Z _i (K)＞0 …(9) Z _i+2 (K)−Z _i+1 (K)＜0 …(10) Expressions (9) and (10) above When both are satisfied at the same time, the spectral normalized data Z _i+1 (K) is the maximum value, and (i
+1) indicates a channel number with a local peak, and the local peak pattern P _i+1 (K) is set to 1.

ｉ＝１から（Ｎ−１）までのスペクトル正規化
データについて前記(9)，(10)式を満たすか否かを順
次調べ、満たす場合はローカルピークパタンP_i+1
（Ｋ）を“１”とし、満たさない場合“０”とし
て２値化しピークパタン格納メモリ４０３に格納
する。 The spectrum normalized data from i=1 to (N-1) is sequentially checked to see if it satisfies equations (9) and (10), and if it is, the local peak pattern P _i+1 is determined.
(K) is set to "1", and if it is not satisfied, it is set to "0" and binarized and stored in the peak pattern storage memory 403.

ただし境界条件としてｉ＝１かつZ_i+1（Ｋ）−Z_i（Ｋ）＜０のときP₁（Ｋ）
＝１ …(11) ｉ＝（Ｎ−１）かつZ_i+1（Ｋ）−Z_i（Ｋ）＞０のとき
P_N（Ｋ）＝１ …(12) とし、条件を満たさない場合“０”とし、２値化
を行いローカルピークパタン格納メモリ４０３に
格納する。 However, as a boundary condition, when i = 1 and Z _{i +1} (K) - Z _i (K) < 0, P ₁ (K)
=1...(11) When i=(N-1) and Z _i+1 (K)-Z _i (K)>0
P _N (K)=1 (12), and if the condition is not met, it is set to "0", binarized, and stored in the local peak pattern storage memory 403.

ローカルピーク特徴抽出部１０９においてある
Ｋ番目のフレームにおけるローカルピークパタン
P_i（Ｋ）（ｉ＝１，２，…，Ｎ；Ｋ＝K₁，…，K₂）
抽出のフローチヤートを第８図に示す。 The local peak pattern in a certain K-th frame in the local peak feature extraction unit 109
P _i (K) (i=1,2,...,N; K= _K1 ,..., _K2 )
A flowchart of the extraction is shown in FIG.

このように、データ格納メモリ４０１内のスペ
クトル正規化データZ_i（Ｋ）（ｉ＝１，２，…，
Ｎ；Ｋ＝K₁，…，K₂）を１フレーム毎に読み出
し、２値のローカルピークパタンを得る。 In this way, the spectral normalized data Z _i (K) (i=1, 2,...,
N; K=K ₁ , . . . , K ₂ ) are read out for each frame to obtain a binary local peak pattern.

標準パタンメモリ１１０は、予め求めた複数の
ローカルピークパタンを重ね合わせて、カテゴリ
毎に１つの標準パタンを作成したものを格納して
いるものである。その標準パタンは発声速度の違
いを正規化するためカテゴリ毎に平均発声長を求
め、２値ローカルピークパタンの時間方向のフレ
ーム長を、平均発声長にそろえて線形伸縮を行い
１〜Ｎチヤネルまで重ね合わせ、多値パタンとし
て作成されている。また、この標準パタンメモリ
１１０には各標準パタンに対応してその標準パタ
ン長SLも格納されている。 The standard pattern memory 110 stores a standard pattern created for each category by overlapping a plurality of local peak patterns determined in advance. In order to normalize the difference in speech rate, the standard pattern calculates the average utterance length for each category, and linearly expands and contracts the time-direction frame length of the binary local peak pattern to match the average utterance length, from 1 to N channels. It is created as an overlapping, multivalued pattern. The standard pattern memory 110 also stores a standard pattern length SL corresponding to each standard pattern.

線形マツチング部１１１では、ローカルピーク
特徴抽出部１０９より入力される入力ローカルピ
ークパタンの時間フレーム長を、標準パタンメモ
リ１１０中の各カテゴリの標準パタン長SLへ線
形伸縮してそろえた後、各フレーム毎に初期類似
度計算と、平均電力による類似度への重み付けを
行う。 The linear matching unit 111 linearly expands and contracts the time frame length of the input local peak pattern input from the local peak feature extraction unit 109 to the standard pattern length SL of each category in the standard pattern memory 110, and then Initial similarity calculation and weighting of similarity based on average power are performed for each time.

初期類似度S^（Ｋ）の計算は次の(13)式に基づい
てフレーム毎に行う。 The initial similarity S^(K) is calculated for each frame based on the following equation (13).

S^（Ｋ）＝Ｐ D^t／‖Ｐ‖ ‖Ｄ‖ …(13) ここでＰは第Ｋフレームにおける入力ローカル
ピークパタンの特徴ベクトルを示す。 S^(K)=P D ^t /‖P‖ ‖D‖ (13) Here, P indicates the feature vector of the input local peak pattern in the K-th frame.

Ｐ＝（P₁（Ｋ），P₂（Ｋ），…，P_N（Ｋ）） …(14) Ｄは任意の標準パタンの第Ｋフレームにおける
特徴ベクトルを示す。 P=(P ₁ (K), P ₂ (K), ..., P _N (K)) ... (14) D indicates a feature vector in the K-th frame of an arbitrary standard pattern.

Ｄ＝（D₁（Ｋ），D₂（Ｋ），…，D_N（Ｋ）） …(15) D^tは転置を表わし、‖Ｐ‖，‖Ｄ‖は各々Ｐ，
Ｄのノルムを表わす。 D=(D ₁ (K), D ₂ (K), ..., D _N (K)) ...(15) D ^t represents transposition, and ‖P‖ and ‖D‖ are respectively P,
represents the norm of D.

さらに音声の平均電力が小さいとき、雑音の影
響を受けやすいことから、最終の類似度Ｓを次の
（１６）式から求める。 Furthermore, when the average power of the voice is small, it is easily affected by noise, so the final similarity S is determined from the following equation (16).

ここで′（Ｋ）は前記(8)式で求まる音声平均
電力、^max _K｛′（Ｋ）｝は入力音声の時間正規化さ
れた区間の中での音声平均電力の最大値、SLは
標準パタン長である。すなわち最終類似度Ｓは前
記(13)式で求まる初期類似度S^（Ｋ）に、雑音平均
電力N_pを引いた前記(8)式で求まる音声平均電力
Ｘ′（Ｋ）で重み付けした値である。 Here, ′(K) is the average voice power found by equation (8) above, ^max _K {′(K)} is the maximum value of the average voice power in the time-normalized interval of the input audio, and SL is the standard The pattern is long. In other words, the final similarity S is the value obtained by weighting the initial similarity S^(K) obtained by the above equation (13) with the speech average power X'(K) obtained from the above equation (8), which is obtained by subtracting the noise average power N _p It is.

線形マツチング部１１１の構成は第９図のよう
になり、７０１は入力ローカルピークパタンの入
力端子、７０２は標準パタンの入力端子、７０３
は音声平均電力の入力端子、７０５は標準パタン
長の入力端子、７０６，７０７，７０８は乗算加
算器、７０９，７１０は平方根演算器、７１１は
乗算器、７１２は除算器、７１３は乗算加算器、
７１４は乗算器、７１５は最大値検出器、７１６
は乗算器、７１７は出力端子である。 The configuration of the linear matching section 111 is as shown in FIG. 9, where 701 is an input terminal for an input local peak pattern, 702 is an input terminal for a standard pattern, and 703 is an input terminal for an input local peak pattern.
is an input terminal for audio average power, 705 is an input terminal for standard pattern length, 706, 707, 708 are multiplication adders, 709, 710 are square root operators, 711 is a multiplier, 712 is a divider, 713 is a multiplication adder ,
714 is a multiplier, 715 is a maximum value detector, 716
is a multiplier, and 717 is an output terminal.

入力端子７０１から入力される入力ローカルピ
ークパタンＰのノルムが乗算加算器７０６、平行
根演算器７０９によつて次の(17)式の如く算出さ
れる同様にして標準パタンＤのノルムが乗算加算器
７０８、平方根演算器７０１によつて求まる。ま
た乗算加算器７０７によつてＰ・Ｄが次の(18)式
のように求まる。_N 〓ⁱ⁼¹ P_i（Ｋ）・D_i（Ｋ） …(18) さらに除算器７１２によつて前記(13)式の初期
類似度S^（Ｋ）が求まる。 The norm of the input local peak pattern P input from the input terminal 701 is calculated by the multiplier adder 706 and the parallel root operator 709 as shown in the following equation (17). Similarly, the norm of the standard pattern D is determined by the multiplier/adder 708 and the square root calculator 701. Further, P·D is determined by the multiplier/adder 707 as shown in the following equation (18). _N 〓 ⁱ⁼¹ P _i (K)·D _i (K) (18) Furthermore, the divider 712 determines the initial similarity S^(K) of the above equation (13).

一方、音声平均電力′（Ｋ）が入力端子７０
３から入力され、乗算加算器７１３により_SL 〓^K=1 S^（Ｋ）・′（Ｋ） …(19) で求まる。また最大値検出器７１５により、時間
正規化された音声平均電力の最大値 max｛′（Ｋ）｝ K_E〔１，SL〕 …(20) が求まり、乗算器７１６により、標準パタン長
SLと乗算され、この結果と前記(19)式の結果を除
算器７１４に入力し、最終の類似度Ｓが求まり、
出力端子７１７に出力され、判定部１１２へ送ら
れる。 On the other hand, the average voice power '(K) is at the input terminal 70.
3, and is determined by the multiplier/adder 713 as _SL 〓 ^K=1 S^(K)·'(K) (19). In addition, the maximum value detector 715 determines the maximum value max{'(K)} K _E [1, SL] ...(20) of the time-normalized audio average power, and the multiplier 716 determines the standard pattern length.
This result and the result of equation (19) are input to the divider 714, and the final similarity S is determined.
It is output to the output terminal 717 and sent to the determination section 112.

判定部１１２では、各標準パタン毎の、最終の
類似度の最大値を与える標準パタンのカテゴリ名
またはカテゴリ番号を出力端子１１３を介して出
力し、認識結果とする。 The determining unit 112 outputs the category name or category number of the standard pattern that gives the maximum final similarity value for each standard pattern via the output terminal 113, as a recognition result.

（発明の効果）以上、詳細に説明したように、本発明では、ロ
ーカルピーク抽出法として、有声音、無声音に関
わらず最小二乗近似直線を基準とし、または類似
度に音声平均電力（＝平均電力−雑音平均電力）
による重み付けを行つているので、雑音が混入し
た音声でも、特に音声の低レベルの部分におけ
る、音声のローカルピーク抽出の不完全さを補う
事ができ、すなわち雑音によるピークを音声によ
るローカルピークと見なす事が少なく、良好に認
識できる効果がある。(Effects of the Invention) As described in detail above, in the present invention, as a local peak extraction method, a least squares approximation straight line is used as a reference regardless of whether it is a voiced sound or an unvoiced sound, or a voice average power (=average power – noise average power)
Since the weighting is performed by weighting, it is possible to compensate for the incompleteness of the local peak extraction of the voice, especially in the low-level part of the voice, even if the voice is mixed with noise.In other words, the peak due to the noise is considered to be the local peak due to the voice. There are few problems and the effect is well recognized.

[Brief explanation of drawings]

第１図は本発明による音声認識方法のフローチ
ヤート、第２図は従来の音声認識方法のフローチ
ヤート、第３図は入力音声信号の２値化を説明す
るための図、第４図は本発明による音声認識装置
の１実施例の全体の回路構成を示すブロツク図、
第５図は周波数分析部の回路構成を示すブロツク
図、第６図は音声区間検出を説明するための図、
第７図はローカルピーク特徴抽出部の構成を示す
ブロツク図、第８図はローカルピークパタン抽出
のフローチヤート、第９図は線形マツチング部の
回路構成を示すブロツク図。１０１……入力端子、１０２……周波数分析
部、１０３……スペクトル正規化部、１０５……
平均電力メモリ、１０６……雑音電力算出部、１
０７……音声区間検出部、１０８……音声平均電
力算出部、１０９……ローカルピーク特徴抽出
部、１１０……標準パタンメモリ、１１１……線
形マツチング部、１１２……判定部、１１３……
出力端子。 Fig. 1 is a flowchart of a speech recognition method according to the present invention, Fig. 2 is a flowchart of a conventional speech recognition method, Fig. 3 is a diagram for explaining binarization of an input speech signal, and Fig. 4 is a diagram of the present invention. A block diagram showing the overall circuit configuration of one embodiment of the speech recognition device according to the invention,
FIG. 5 is a block diagram showing the circuit configuration of the frequency analysis section, and FIG. 6 is a diagram for explaining voice section detection.
FIG. 7 is a block diagram showing the configuration of the local peak feature extraction section, FIG. 8 is a flowchart of local peak pattern extraction, and FIG. 9 is a block diagram showing the circuit configuration of the linear matching section. 101...Input terminal, 102...Frequency analysis section, 103...Spectrum normalization section, 105...
Average power memory, 106...Noise power calculation unit, 1
07...Speech section detection section, 108...Speech average power calculation section, 109...Local peak feature extraction section, 110...Standard pattern memory, 111...Linear matching section, 112...Judgment section, 113...
Output terminal.

Claims

[Scope of Claims] 1. A process of frequency-analyzing input audio and extracting analysis data of multiple channels for each audio frame, and averaging all the analysis data of all audio frames in a predetermined noise interval to obtain a noise average power. A process of calculating the average power by averaging all the analysis data corresponding to the frame for each audio frame of the input audio, and calculating the average power and the noise average for each audio frame of the input audio. A process of calculating the voice average power which is the difference from the power; a process of spectral normalizing the analysis data of the input voice using a least squares approximation straight line of the voice spectrum in the voice frame to which the data belongs; The presence or absence of a maximum value along the frequency axis, that is, a local peak, is determined for the analyzed data using the least squares approximation line as a reference, and the presence of a local peak is set as "1", and the case of no local peak is set as "0". ”, converting the binarized input audio into a local peak pattern, and converting the similarity between the local peak pattern of the input audio and a plurality of standard patterns represented by local peak patterns prepared in advance to the standard pattern. and the final similarity is calculated for each normalized audio frame between the input audio, weighting the similarity by the audio average power, and accumulating over all the normalized audio frames. and a process of determining, as a recognition result, a category name of a standard pattern corresponding to the maximum value of the final similarity for each standard pattern.