JPS61206000A

JPS61206000A - Voice recognition equipment

Info

Publication number: JPS61206000A
Application number: JP4653085A
Authority: JP
Inventors: 畑岡　信夫; 天野　明雄; 矢島　俊一
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1985-03-11
Filing date: 1985-03-11
Publication date: 1986-09-12

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】〔発明の利用分野〕本発明は音声認識装置に係り、特に発声レベルの変動に
よらない安定なスペクトルパタン抽出、かつ音声の基本
周波数や分析のゆらぎの影響を緩和した音声認識装置に
関する。[Detailed Description of the Invention] [Field of Application of the Invention] The present invention relates to a speech recognition device, and in particular, to a speech recognition device that extracts a stable spectrum pattern that is not affected by fluctuations in speech level, and that alleviates the effects of fluctuations in the fundamental frequency of speech and analysis. The present invention relates to a speech recognition device.

[Background of the invention]

従来のスペクトルパタン正規化方式は第１図で示される
ような２段処理（レベルの補正、基準レベル合わせ）で
構成されていた（文献「簡易形不特定話者音声認識」、
音声研究会資料番号８８４−１５２日本音響学会など）
。しかし、従来の方式は原スペクトルパタンの絶対レベ
ルをある程度保持した正規化を行うという利点はあるも
のの、発声レベルに起因する絶対レベルの違いからくる
分析の演算長の有効利用の点については配慮されていな
かった。発声レベルをある程度向等にできる離散発声の
入力による音声認識装置（厳密には、登録する音声と同
じ発声の方法で入力する音声認識袋ｆｉ！Ｅ）では従来
の方式でもあまり問題とはならなかった。しかし、登録
する音声と入力する音声の発声方法が異なる連続発声方
式の音声認識装置では登録する音声（一般には入力音声
の一部分）が連続発声された音声のどこに現われるか（
例えば語中か語尾か）によってその音声の絶対レベルが
大きく異なることがあり、絶対レベルの小さい場合には
分析の演算長が有効に活用できてはいなかった。さらに
、従来の帯域通過フィルタ（ＢＰＦ）分析では音声の基
本周波数の高調波の影響や分析フィルタの演算精度不足
などに起因する分析のゆらぎの影響により本来のスペク
トルパタンの形状とは異なった（例えば低域周波数帯域
に局所的なピークが現われるなど）パタン抽出がされる
ことがあった。The conventional spectral pattern normalization method consisted of two-stage processing (level correction, reference level matching) as shown in Figure 1 (Reference ``Simplified Speaker-Independent Speech Recognition'',
Speech Research Group Material No. 884-152 Acoustical Society of Japan, etc.)
. However, although the conventional method has the advantage of performing normalization that maintains the absolute level of the original spectral pattern to some extent, it does not take into consideration the effective use of the calculation length of analysis due to the difference in absolute level caused by the vocalization level. It wasn't. With a speech recognition device that inputs discrete utterances that can adjust the utterance level to some extent (strictly speaking, a speech recognition bag fi!E that inputs input using the same utterance method as the voice to be registered), the conventional method does not pose much of a problem. Ta. However, in continuous voice recognition devices where the voice to be registered and the voice to be input are uttered in different ways, it is difficult to determine where the voice to be registered (generally a part of the input voice) appears in the continuously uttered voice.
For example, the absolute level of the voice may vary greatly depending on whether it is in the middle or at the end of a word), and when the absolute level is small, the computational length of the analysis cannot be used effectively. Furthermore, in conventional band-pass filter (BPF) analysis, the shape of the spectral pattern differs from the original one due to the influence of harmonics of the fundamental frequency of the voice and the influence of analysis fluctuations due to insufficient calculation accuracy of the analysis filter (for example, In some cases, patterns (such as local peaks appearing in the low frequency band) were extracted.

[Purpose of the invention]

本発明の目的は、音声の絶対レベルが相違しても分析の
演算長が有効に活用され、かつ、音声の基本周波数の影
響や分析のゆらぎに影響されない音声認識装置における
パタン正規化方式を提供することにある。An object of the present invention is to provide a pattern normalization method for a speech recognition device in which the calculation length of analysis is effectively utilized even if the absolute level of speech differs, and which is unaffected by the influence of the fundamental frequency of speech or the fluctuation of analysis. It's about doing.

[Summary of the invention]

上記目的を達成するために１本発明では短時間スペクト
ルパタンの各帯域における値の緩和に準じた値で各帯域
のスペクトルの値を除するようなレベル正規化を行うこ
とに第１の特徴があり、さらにスペクトルパタンを平滑
化することに第２の特徴がある。In order to achieve the above object, the first feature of the present invention is to perform level normalization in which the spectral value of each band is divided by a value corresponding to the relaxation of the value in each band of the short-time spectral pattern. The second feature is that the spectrum pattern is smoothed.

[Embodiments of the invention]

以下、本発明の原理を詳細に説明する。 The principle of the present invention will be explained in detail below.

第１図は従来のパタン正規化処理のフローを示すもので
ある。帯域通過フィルタ（Ｂ　Ｐ　Ｆ）分析などの結果
得られたスペクトルパタンｆｉ、ｉ”０〜Ｎチヤネルを
入力として、第１ステツプでは対数変換などによるレベ
ルの補正が行われる。これは人間の聴覚特性を考慮した
処理である。対数変換の具体的な例はフローに示したも
のの他に、前述の文献「簡易形不特定話者音声認識」に
記載されている固定小数点値対象の単純なる対数変換な
ども考えられる。次に第２ステツプでは、発声レベル差
に起因する絶対レベルの差の正規化として、基準レベル
合わせが行すれる。FIG. 1 shows the flow of conventional pattern normalization processing. In the first step, the level is corrected by logarithmic transformation, etc., using as input the spectral pattern fi,i''0 to N channels obtained as a result of band-pass filter (BPF) analysis.This is based on the human auditory characteristics. Specific examples of logarithmic transformation include those shown in the flowchart, as well as the simple logarithmic transformation of a fixed-point value object described in the above-mentioned document "Simplified speaker-independent speech recognition." etc. can also be considered. Next, in the second step, reference level matching is performed to normalize the difference in absolute level caused by the difference in utterance level.

本発明によるパタン正規化処理のフロー例を第２図に示
す、以下詳細に各ステップを説明する。A flow example of the pattern normalization process according to the present invention is shown in FIG. 2, and each step will be explained in detail below.

入力；スペクトルパタンｆ、、ｉ＝１〜Ｎチャネル（１）第１ステツプ（レベルの正規化）ｆ、’　＝ｆ、
／ｆ、、、、但しｆ、、−＝Σｆ。Input; spectral pattern f,, i = 1 to N channels (1) 1st step (level normalization) f,' = f,
/f, , , where f, , -=Σf.

絶対レベルの違いに影響されずに１分析の演算長の有効
活用をはかる目的でレベルの正規化が行われる。この結
果、絶対レベルの小さな音声でも第２ステツプの対数変
換処理を精度良く実行することができる。Level normalization is performed for the purpose of making effective use of the calculation length for one analysis without being affected by differences in absolute levels. As a result, the logarithmic conversion process in the second step can be executed with high accuracy even for voices having a small absolute level.

（２）第２ステツプ（レベルの補正）ｆｌ’　＝＝　ｎｏｇｘａ　（１，０＋　ｆ１／　ｆ、
）、但しｆ、；定数この処理は従来方式の第１ステツプに等しい。(2) Second step (level correction) fl' == nogxa (1,0+f1/f,
), where f, ; constant This process is equivalent to the first step of the conventional method.

（３）第３ステツプ（基準レベル合わせ）ｆ、　ｔ″＝
＝　ｆ　、　＃　　ｆ、、、　、　＃。(3) Third step (standard level adjustment) f, t″=
= f, # f, , , #.

但しｆ、、’　＝　（Σｆ、’　）　／Ｎ従来方式の第
２ステツプに等しい。However, f,,'=(Σf,')/N is equivalent to the second step of the conventional method.

（４）第４ステツプ（平滑化処理）ｆ、’＝ｆ１”’−１＋２ｆｔ”＋ｆ１．１”’）／４
但しｉ＝ｌ〜ｎ　（Ｎ一般に、帯域通過フィルタ分析の中心周波数配置は人間
の聴覚特性と母音の第１ホルマント周波数の分布（日本
語５母音の場合（男性）、Ｉａｌが６００〜９００　Ｈ
ｚ、Ｉａｌが１６０〜３００Ｈｚ、ｊｕｌが２００〜５
００Ｈ２、ｌｅｌが３００〜６００　Ｈｚ、１０１が４
００〜６５０Ｈｚとなっている）を考慮して、対数スケ
ールあるいはメルスケール配置とするのが通例である。(4) Fourth step (smoothing process) f,'=f1"'-1+2ft"+f1.1"')/4
However, i = l~n (N) In general, the center frequency arrangement for bandpass filter analysis is based on the human auditory characteristics and the distribution of the first formant frequency of vowels (for Japanese 5 vowels (male), Ial is 600~900 H
z, Ial is 160-300Hz, jul is 200-5
00H2, lel is 300-600 Hz, 101 is 4
00 to 650 Hz), it is customary to use a logarithmic scale or mel scale arrangement.

こうした場合、中心周波数が極端に低域周波数側に配置
されることになり易い（例えば帯域１５０〜４０００Ｈ
ｚを１６チヤネル、対数スケールで中心周波数を配置し
た場合、５００Ｈｚ以下は６チャネル、ＩＫＨｚ以下は
１０チヤネルの配置となる）。この結果、音韻性（例え
ばｆａｔ、ｆｉｌなど）の特徴に関係しない音声の基本
周波数（音声波形の周期音部分で声帯振動数に対応、男
の場合、１００〜１５０Ｈｚ）の高調波による影響が低
次のチャネルに現われ易く、スペクトルパタンが崩れる
（例えば低次に２，３の小ピークが現われてくる）こと
がある、平滑化処理はこの現象に対処するためになされ
るもので、特に５００Ｈｚ付近までの低次のチャネルに
対して実行する。処理は本発明の例のように３チヤネル
の移動平均などが考えられる。In such cases, the center frequency tends to be placed extremely low frequency side (for example, in the band 150-4000H).
When 16 channels are arranged for z and the center frequency is arranged on a logarithmic scale, 6 channels are arranged for frequencies below 500 Hz, and 10 channels are arranged for frequencies below IKHz). As a result, the influence of harmonics of the fundamental frequency of speech (the periodic part of the speech waveform, corresponding to the vocal fold frequency, 100 to 150 Hz for men) that is not related to phonological characteristics (e.g. fat, fil, etc.) is reduced. Smoothing processing is performed to deal with this phenomenon, which tends to appear in the next channel and may cause the spectrum pattern to collapse (for example, a few small peaks appear in the low order), especially around 500Hz. Perform for low-order channels up to The processing may be a moving average of three channels as in the example of the present invention.

第３図は従来のパタン正規化方式と本発明の方式１（第
１ステツプから第３ステツプまで）によるスペクトルパ
タンの比較を示す図である。音声はｌ　ａｉｋｕｒｕｓ
ｈｉ　Ｉ　　から絶対レベルの大幅に異なる語中１ｉ＋
と語尾ｆｉｌを入力とした１本発明の方式ｌの結果は従
来の方式に比べて、ピーク（第１ホルマント）がつぶれ
ないで尖鋭であり、かつ絶対レベルの小さな語尾ｆｉｌ
のパタンのダイナミックレンジの比は。FIG. 3 is a diagram showing a comparison of spectral patterns obtained by the conventional pattern normalization method and method 1 of the present invention (from the first step to the third step). The audio is l aikurus
Words with significantly different absolute levels from hi I 1i+
The result of method 1 of the present invention, which inputs the word ending fil, is that the peak (first formant) is sharp without being crushed, and the absolute level of the word ending fil is small, compared to the conventional method.
The dynamic range ratio of the pattern is .

従来の方式（Ｗ、）　　２．４国となり１．４倍に増えており、良好なパタン正規化とな
っている。The conventional method (W,) has 2.4 countries, which is an increase of 1.4 times, and is a good pattern normalization.

第４図は本発明の方式２（平滑化処理を加えた第１ステ
ツプから第４ステツプまでの処理）の効果を示すもので
、平滑化処理によって第１ホルマントよりも低次に現わ
れていた局所的な小ピークがなくなっていることがわか
る。最上部のパタンはＬＰＧスペクトラムであり、本来
のスペクトルパタンを表わしている。Figure 4 shows the effect of method 2 of the present invention (processing from the first step to the fourth step including smoothing processing). It can be seen that the small peaks have disappeared. The top pattern is the LPG spectrum and represents the original spectral pattern.

次に本発明の具体的実施例を詳細に説明する。Next, specific embodiments of the present invention will be described in detail.

第５図は本発明を用いた音声認識装置の一実施例の構成
を示すブロック図である。入力音声１は低域ｒ波器（Ｌ
ＰＦ）、アナログ−ディジタル変換器（ＡＤＣ）２で折
り返し雑音を除去されながらアナログ値からディジタル
値にサンプリングされる。その後、音声分析部３にて入
力音声が分析され、認識に必要なスペクトルパタンか求
められる。FIG. 5 is a block diagram showing the configuration of an embodiment of a speech recognition device using the present invention. Input audio 1 is a low-frequency r-wave generator (L
PF), and an analog-to-digital converter (ADC) 2 samples the analog value into a digital value while removing aliasing noise. Thereafter, the input voice is analyzed by the voice analysis section 3 to obtain a spectrum pattern necessary for recognition.

スペクトルパタンを求める方法としては、例えば帯域通
過フィルタ（Ｂ　Ｐ　Ｆ）分析やＦＥＴ分析などが考え
られる。その後、本発明によるパタンを正規化部４で音
声の発声レベル変動や帯域通過フィルタ分析結果の正規
化、補正が行われる０次に正規化された入力のスペクト
ルパタンと標準音声記憶部５に格納されている標準音声
のスペクトルパタンとの差（あるいは類似度）が距離計
算機６にて求められる。距離としてはチャネル間のスペ
クトル値の差の絶対値の緩和などが考えられる。Possible methods for determining the spectral pattern include, for example, band pass filter (BPF) analysis and FET analysis. Thereafter, the pattern according to the present invention is stored in the standard speech storage section 5 as the zero-order normalized input spectrum pattern, in which the normalization section 4 normalizes and corrects the voice utterance level fluctuations and bandpass filter analysis results. The distance calculator 6 calculates the difference (or degree of similarity) between the spectrum pattern and the standard speech spectrum pattern. As the distance, relaxation of the absolute value of the difference in spectral values between channels can be considered.

次に、照合部７にて入力音声と標準音声との時間構造の
差も含めた総距離（照合）値が計算され、判定部８にて
総距離値の大小関係をもとに、入力音声がどの標準音声
に最も似ているかの判定がなされ、認識結果９を出力す
る。距離計算部６は簡単な加・減算器のみでも構成され
、照合部７は例えば連続ＮＬ（Ｎｏｎ旦１ｒａａｒ　）
マツチング法（公知例；連続ＤＰ法、特開昭５５−２２
０５号公報の改良）による回路などで構成される６判定
部８は単純な大Ｉｊ％比較器でも構成しうる。また、本
発明のパタン正規化部４での除算、対数変換処理はテー
ブル引き処理で実行され、その他は加・減算器のみで構
成されうる。Next, the matching section 7 calculates the total distance (matching) value including the difference in time structure between the input speech and the standard speech, and the determining section 8 calculates the input speech based on the magnitude relationship of the total distance value. A determination is made as to which standard voice is most similar, and a recognition result 9 is output. The distance calculating section 6 is composed of only simple adders and subtracters, and the matching section 7 is, for example, a continuous NL (Non tan 1 raar).
Matching method (known example: continuous DP method, JP-A-55-22
The 6-judgment section 8, which is constructed from a circuit according to the improvement of the No. 05 publication), can also be constructed from a simple large Ij% comparator. Further, the division and logarithmic conversion processing in the pattern normalization unit 4 of the present invention are executed by table look-up processing, and the rest may be configured only by adders and subtracters.

〔Effect of the invention〕

第６図は従来の方式と本発明の方式による連続音声の子
音認識結果を示すものであり、第ｎ位内までの複数候補
を許した場合の認識率を表わしている。この結果、従来
の方式に比べて平滑化処理を含めた本発明の方式２の場
合、約３％（第１位内）の認識率向上がはかれた。特に
、４，５位内までをみた場合の向上が著しい。FIG. 6 shows the results of consonant recognition of continuous speech by the conventional method and the method of the present invention, and shows the recognition rate when multiple candidates up to the nth rank are allowed. As a result, in the case of method 2 of the present invention, which includes smoothing processing, compared to the conventional method, an improvement in recognition rate of about 3% (within the first place) was achieved. The improvement is particularly remarkable when looking at the top 4 and 5 rankings.

以上のように１本発明によれば、絶対レベル変動によら
ず分析の演算長の有効利用がはかれ、かつ帯域通過フィ
ルタ分析での基本周波数を分析のゆらぎなどの影響を緩
和できるので、認識性能の向上がはかれるという効果が
ある。As described above, according to the present invention, the calculation length of the analysis can be effectively used regardless of absolute level fluctuations, and the fundamental frequency in the band-pass filter analysis can be used to reduce the influence of fluctuations in the analysis. This has the effect of improving performance.

[Brief explanation of drawings]

第１図は燐来のパタン正規化処理のフローを示す図、第
２図は本発明のパタン正規化処理のフローの一実施例を
示す図、第３図は従来の方式と本発明の方式によるスペ
クトルパタンの比較を示す図、第４図は本発明の特に平
滑化処理方式の効果を示す図、第５図は本発明を組み入
れた音声認識装置の一実施例を示すブロック図、第６図
は本発明の効果を示す認識実験結果を表わす図である。４・・・パタン正規化部。Ｙ　ｊ　口第　２　目 ′ｔＪ　Ｊ　口（ａ）硅明一方Ａ；１（ｂ）従束・方式１１”ｔｚ＜之’　　（Ｋ抱ン運ロ　　〃（慣ｉζ　　（にＨ１ンＶＩ　６　口１２３　　チ　５矛弧　稽　円Fig. 1 is a diagram showing the flow of Rinrai's pattern normalization processing, Fig. 2 is a diagram showing an example of the flow of pattern normalization processing of the present invention, and Fig. 3 is a diagram showing the conventional method and the method of the present invention. FIG. 4 is a diagram showing the effect of the smoothing processing method of the present invention, FIG. 5 is a block diagram showing an embodiment of a speech recognition device incorporating the present invention, and FIG. The figure is a diagram showing the results of a recognition experiment showing the effects of the present invention. 4...Pattern normalization section. Y j 口 2nd 'tJ J 口(a) 硅明一个A; 1 (b) Subjective/method 11"tz<之' 5.

Claims

[Scope of Claims] 1. In a speech recognition device equipped with a speech analysis section for extracting at least a short-time spectral pattern of speech, a first means for normalizing the level of the spectral pattern obtained by the speech analysis section; , a second means for level correction conversion of the pattern obtained by the first means, and a third means for adjusting the reference level of the pattern obtained by the second means. speech recognition device. 2. The speech recognition device according to item 1, further comprising means for smoothing the spectrum pattern obtained by the speech analysis section using spectrum information of adjacent bands. 3. In the speech recognition device according to item 1 above, the first
The means is a process of dividing the spectral information of each band by a value corresponding to the sum of the spectral values in each band of the short-time spectral pattern, the second means is a logarithmic transformation process, and the third means
A speech recognition device characterized in that the means is a process of subtracting a value based on the average value of the spectral values of each band of the spectral pattern from the spectral value of each band.