JPS6336680B2

JPS6336680B2 -

Info

Publication number: JPS6336680B2
Application number: JP57021412A
Authority: JP
Inventors: Satoshi Fujii; Katsuyuki Futayada; Hideji Morii
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1982-02-12
Filing date: 1982-02-12
Publication date: 1988-07-21
Also published as: JPS58139199A

Description

【発明の詳細な説明】本発明は人間によつて発声された音声信号を自
動的に認識するための、音声自動認識装置に関す
る。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to an automatic speech recognition device for automatically recognizing speech signals uttered by humans.

人間によつて発声された音声を自動的に認識す
る音声自動認識装置は人間から電子式算機や各種
機械へデータや命令を与える手段として将来的に
非常に有効と考えられる。たとえば数字音声を認
識する装置を電子計算機に接続して用いると伝票
などの数字データを入力する事が可能になり、特
に音声信号は電話回線を経由して遠隔地に伝送で
きるため伝票の発行や在庫の問い合せ等を即座に
行う事が出来る。また手や足を他の目的に対して
使用しながら音声信号を入力出来ることなどを考
えると、音声自動認識装置によつてもたらされる
効果はきわめて大きいと考えられる。 Automatic speech recognition devices that automatically recognize speech uttered by humans are considered to be extremely effective in the future as a means of providing data and instructions from humans to electronic calculators and various machines. For example, if you use a device that recognizes numeric audio by connecting it to a computer, it becomes possible to input numerical data such as slips.In particular, audio signals can be transmitted to remote locations via telephone lines, so it is possible to issue slips, etc. You can immediately make inventory inquiries, etc. Furthermore, considering that it is possible to input voice signals while using hands and feet for other purposes, the effects brought about by automatic voice recognition devices are considered to be extremely large.

従来研究あるいは発表されている音声自動認識
装置の動作原理としてはパタンマツチング法が多
く採用されている。この方法は認識される必要が
ある全種類の単語に対して標準パタンをあらかじ
め記憶しておき、入力される未知の入力パタンと
比較する事によつて一致の度合（以下類似度と呼
ぶ）を計算し、最大一致が得られる標準パタンと
同一の単語であると判定するものである。このパ
ターンマツチング法では認識されるべき全ての単
語に対して標準パタンが用意されるため、発声者
が変つた場合には新しく標準パタンを入力して記
憶させる必要がある。従つて日本全国の都市名な
どに数百種類以上の単語を認識対象とするような
場合、全種類の単語を発声して登録するには膨大
な時間と労力を必要とし、又登録に要するメモリ
ーの容量も膨大になる事が予想される。さらに入
力パタンを標準パタンのパタンマツチングに要す
る処理量も単語数が多くなると膨大なものになつ
てしまうという欠点を有している。 The pattern matching method is often adopted as the operating principle of automatic speech recognition devices that have been researched or published in the past. In this method, standard patterns are memorized in advance for all types of words that need to be recognized, and the degree of matching (hereinafter referred to as similarity) is calculated by comparing them with unknown input patterns. The word is calculated and determined to be the same word as the standard pattern that yields the maximum match. In this pattern matching method, standard patterns are prepared for all words to be recognized, so if the speaker changes, it is necessary to input and memorize a new standard pattern. Therefore, when recognizing hundreds of different words, such as the names of cities across Japan, it takes a huge amount of time and effort to pronounce and register all the words, and the memory required for registration is also large. It is expected that the capacity will increase enormously. Furthermore, the amount of processing required to pattern-match an input pattern with a standard pattern becomes enormous as the number of words increases.

これに対して、入力音声を音素単位に分けて音
素の組合せとして認識し（以下音素認識と呼ぶ）、
音素単位で表記された単語辞書との類似度を求め
る方法は単語辞書に要するメモリ容量が大幅に少
なくて済み、辞書の内容変更も容易であるという
特長を持つている。この方法の例は「音声スペク
トルの概略形とその動特性を利用した単語音声認
識システム」三輸他、日本音響学会誌34（1978）
に述べてある。第１図にこの方法の音声認識装置
のブロツク構成を示す。入力音声５０はフイルタ
群４０に入つて周波数スペクトルに変換された
後、音響処理部４１で音素認識に必要なパラメー
タPe₁，Pe₂，Pe₃，Ｇ，Ｈ，Ｖ，Ｗを算出する。 On the other hand, input speech is divided into phoneme units and recognized as combinations of phonemes (hereinafter referred to as phoneme recognition).
The method of determining similarity with a word dictionary written in phoneme units has the advantage that the memory capacity required for the word dictionary is significantly smaller, and the contents of the dictionary can be easily changed. An example of this method is "Word speech recognition system using the outline form of the speech spectrum and its dynamic characteristics," Sanpo et al., Journal of the Acoustical Society of Japan 34 (1978).
It is stated in FIG. 1 shows the block configuration of a speech recognition device using this method. The input speech 50 enters a filter group 40 and is converted into a frequency spectrum, and then an acoustic processing section 41 calculates parameters Pe ₁ , Pe ₂ , Pe ₃ , G, H, V, and W necessary for phoneme recognition.

音素認識部４２はこれらのパラメータを用いて
音素の区切り作業（以下セグメンテーシヨンと呼
ぶ）と音素の標準パターン４５に基づいて個々の
音素が何であるかを決定する音素認識を行なう。
しかしこの段階での音素の並びは不完全であるた
め、誤り訂正部４３で主に日本語の音素が結合す
る規則４６を用いて音素並びに訂正を行い音素系
列の作成を完了する。単語マツチング部４４で
は、あらかじめ統計的に求めておいた各音素の他
の音素への置換および脱落、他の音素の挿入の確
率を表わす配列表４７（以下Confusion Matrix
と呼ぶ）と音素名で構成された単語辞書４８を用
いて全単語辞書との類似度を計算し、類似度最大
の辞書項目を認識結果として出力する。 The phoneme recognition unit 42 uses these parameters to perform phoneme recognition to determine what each phoneme is based on a phoneme segmentation operation (hereinafter referred to as segmentation) and a standard phoneme pattern 45.
However, since the arrangement of phonemes at this stage is incomplete, the error correction unit 43 corrects the arrangement of the phonemes mainly using the rule 46 for combining Japanese phonemes, and completes the creation of the phoneme sequence. The word matching unit 44 uses a Confusion Matrix 47 (hereinafter referred to as Confusion Matrix) that represents the probability of each phoneme being replaced with another phoneme, being dropped, or being inserted into another phoneme, which has been statistically determined in advance.
Using the word dictionary 48 consisting of phoneme names and phoneme names, the degree of similarity with the entire word dictionary is calculated, and the dictionary item with the greatest degree of similarity is output as a recognition result.

この方法はスペクトルのピークの位置に着目し
て音素認識を行うものであり、アルゴリズムが簡
単で、またピークの位置とその相対的な大きさの
みに着目するため話者や環境の違いに起因するス
ペクトルパターンの全体の概形の変動に対して影
響を受けにくい利点があり、また不特定話者を対
象とする場合に適した手法であると考えられる。
しかしこの方法はフイルタ群によつて得た周波数
スペクトルを用いている事により、次の欠点を有
している。音声には声道長に基くピツチ成分があ
り、男声のピツチ周波数は一般に200Hz以下で母
音の第１ホルマントに重ることは少ないが女声の
ピツチ周波数は一般に200〜300Hzにあることによ
つて母音の第１ホルマントと重なりを生じ女声の
周波数スペクトルには低周波数域にピツチ周波数
の影響による大きなピークが現われ、またピツチ
周波数の高調波の影響により不必要なピークが発
生するためホルマントに対応した適確なピーク位
置を検出する事が出来なくなつてしまう欠点があ
つた。例えば第２図は250〜6300Hz間を29チヤネ
ルで1/6オクターブ毎にＱ＝６のフイルタで構成
されたフイルタ群によると、女声の母音／ｅ／の
周波数スペクトルの例である。図において縦軸は
１／６オクターブ間隔で区切られたチヤネル番号、
横軸は10ｍｓ毎に区切られたフレーム番号であ
り、あらかじめ視察により正解音素が名前付けさ
れている。これによると縦軸の２チヤネルと12チ
ヤネルにピツチの影響による母音／ｅ／のホルマ
ントに対応しないピークＢが現われ、本当のホル
マントに対応するピークＡとの区別がわからなく
なつている。 This method performs phoneme recognition by focusing on the position of the peak in the spectrum, and the algorithm is simple. Also, since it focuses only on the position of the peak and its relative size, it is difficult to recognize phonemes due to differences in speakers or environments. This method has the advantage of being less affected by changes in the overall shape of the spectrum pattern, and is considered to be a method suitable for targeting unspecified speakers.
However, this method has the following drawbacks because it uses a frequency spectrum obtained by a group of filters. Speech has a pitch component based on the length of the vocal tract.The pitch frequency of a male voice is generally below 200Hz and rarely overlaps with the first formant of a vowel, but the pitch frequency of a female voice is generally between 200 and 300Hz, which makes it difficult to pronounce a vowel. In the frequency spectrum of female voices, a large peak appears in the low frequency range due to the influence of the pitch frequency, and unnecessary peaks occur due to the influence of harmonics of the pitch frequency. There was a drawback that it became impossible to detect the exact peak position. For example, FIG. 2 shows an example of the frequency spectrum of the vowel /e/ in a female voice according to a filter group consisting of 29 channels between 250 and 6300 Hz and Q=6 filters every 1/6 octave. In the figure, the vertical axis is the channel number divided at 1/6 octave intervals,
The horizontal axis is a frame number divided into 10 ms units, and correct phonemes are named by inspection in advance. According to this, peak B, which does not correspond to the formant of the vowel /e/ due to the influence of pitch, appears in channels 2 and 12 on the vertical axis, and it is difficult to distinguish it from peak A, which corresponds to the true formant.

この方法による音素認識例を第３図に示す。 An example of phoneme recognition using this method is shown in FIG.

第３図は成人女性の発声した「安物」という言
葉で、横軸はフレーム毎に区切つてある。図でａ
は手作業によつて名前づけした音素で、バーは音
素の始端を、枠で囲つた部分は中心を示す。ｂは
母音の認識結果、ｃは半母音の認識結果、ｄは無
音及び子音区間を示し、Ｑは無音区間、Ｃは子音
区間にあたる。ｅは子音の認識結果を示す。ｆは
セグメンテーシヨン用の各種パラメータであり、
ｇはスペクトルのピークの周波数軸上の位置をパ
ワーの大きい順に１、２、３と表示したものであ
る。この図から、／ａ／の部分ではピツチの影響
により250Hz付近にピークがあらわれる（図で領
域イで示す）事により／ａ／が／ｉ／と誤つてい
ること（図で領域ニで示す）がわかる。また／
ｏ／のところにもピツチの影響により250Hz付近
にピークが現われ（図で領域ロ，ハ）、／ｎ／と
誤つている（図で領域ホ，ヘ）。このように、従
来の方法では女声において音声のピツチ成分の影
響が強く現われ、女声に対して対応出来ないこと
がわかる。 Figure 3 shows the word "cheap" uttered by an adult female, with the horizontal axis divided into frames. In the diagram a
is a phoneme named manually, the bar indicates the beginning of the phoneme, and the boxed area indicates the center. b represents the vowel recognition result, c represents the semi-vowel recognition result, d represents the silent and consonant sections, Q represents the silent period, and C corresponds to the consonant period. e indicates the consonant recognition result. f is various parameters for segmentation,
g is the position of the peak of the spectrum on the frequency axis expressed as 1, 2, and 3 in descending order of power. From this figure, we can see that in the /a/ part, a peak appears around 250Hz due to the influence of pitch (indicated by area A in the figure), which means that /a/ is mistaken as /i/ (indicated by area D in the figure). I understand. Also/
A peak appears around 250 Hz at o/ due to the influence of pitch (areas B and C in the diagram), which is incorrectly written as /n/ (areas E and F in the diagram). As described above, it can be seen that in the conventional method, the pitch component of the voice is strongly influenced by female voices, and it is not possible to cope with female voices.

本発明は男性にも女性にも共通に、ホルマント
のみに対応する第２図のピークＡの位置を精度良
く求めるために従来のフイルタ群に代つて線形予
測分析によつて音声のピツチ成分を軽減した周波
数スペクトルを得る事によつて上記問題点を解決
し、男女に関係なく不特定話者に対応する事の出
来る音素認識法および音声認識システムを提供す
るものである。線形予測分析は周波数スペクトル
を全極型モデルで近似し周波数スペクトル包絡特
性と声帯波特性を分離する方法であり、ピツチ周
波数やその高調波の影響は軽減されるはずであ
る。また周波数スペクトルには特定したモデル以
外の成分は含まれないので滑らかなスペクトルパ
ターンが得られる利点がある。第４図は第２図の
場合と同じ女声の母音／ｅ／を本発明による方法
によつて周波数スペクトルを求めた例であるが、
ピツチの影響による不必要なピークが取除かれホ
ルマントに対応したピークＡのみが描かれている
事がわかる。 The present invention reduces the pitch component of speech by linear predictive analysis instead of the conventional filter group in order to accurately determine the position of peak A in Figure 2, which corresponds only to formants, for both men and women. The present invention provides a phoneme recognition method and a speech recognition system that can solve the above-mentioned problems by obtaining a frequency spectrum that corresponds to a specific frequency spectrum, and that can be applied to unspecified speakers regardless of gender. Linear predictive analysis is a method of approximating the frequency spectrum with an all-pole model and separating the frequency spectrum envelope characteristics and vocal fold wave characteristics, and the influence of the pitch frequency and its harmonics should be reduced. Furthermore, since the frequency spectrum does not include components other than the specified model, there is an advantage that a smooth spectrum pattern can be obtained. FIG. 4 is an example in which the frequency spectrum of the vowel /e/ in a female voice, which is the same as in FIG. 2, was obtained using the method according to the present invention.
It can be seen that unnecessary peaks due to the influence of pitch have been removed and only peak A corresponding to the formant is depicted.

第５図は、本発明の方法により第３図と同じ単
語を音素認識した例である。ｇを見ると、／ａ／
の位置での、250Hz付近にはピークが現われず
（図で領域トで示す）、ピツチ成分が除去されてい
る事がわかる。これによつてｂを見ると／ａ／
が／ａ／と正しく認識されている事がわかる。
又、／ｏ／の位置でも従来例の第３図で現われて
いた250Hz付近のピツチ成分が除去される（図で
領域チで示す）事によつて／ｏ／が母音としてセ
グメンテーシヨンされている事がわかる。こうし
て、母音と子音、半母音が正しくセグメンテーシ
ヨンされれば単語認識を正しく行う事が可能とな
る。 FIG. 5 is an example of phoneme recognition of the same word as in FIG. 3 using the method of the present invention. When you look at g, /a/
No peak appears near 250Hz at the position (indicated by area ① in the figure), indicating that the pitch component has been removed. From this, if we look at b, /a/
It can be seen that is correctly recognized as /a/.
Also, at the position of /o/, the pitch component around 250Hz that appeared in the conventional example in Figure 3 is removed (indicated by area C in the figure), and /o/ is segmented as a vowel. I know that there is. In this way, if vowels, consonants, and semi-vowels are segmented correctly, word recognition can be performed correctly.

このように、本発明は線形予測分析によつてホ
ルマントに対応するピークを適確に抽出する事に
よつて男声にも女声にも共通して適用する事が出
来る、不特定話者向きの音声自動認識を可能とす
るものである。 As described above, the present invention can be applied to both male and female voices by accurately extracting peaks corresponding to formants using linear predictive analysis, and can be applied to voices suitable for unspecified speakers. This enables automatic recognition.

本音声自動認識装置の構成の概要を第６図に示
す。音声入力は音響処理部１に入り、線形予測分
析を行つて周波数スペクトルとパワー等の音素認
識に必要なパラメータを算出する。音素認識部２
では音響処理部１で求めたパラメータＷ，Ａ，
Ｇ，Ｈ（詳しくは後述）と、周波数スペクトルよ
り求めた周波数スペクトルのピークPe₁，Pe₂，
Pe₃（以下ローカルピークと呼ぶ）を用いて、10
ｍｓの分析区間（以下フレームと呼ぶ）毎に音素
認識を行なう。音素系列作成部３ではあらかじめ
作られた一般的な日本語の音素が相互に結合する
規則（以下音素結合規則と呼ぶ）を用いて音素の
並びを修正し（以下この作業を誤り訂正と呼ぶ）
単語ごとの音素の並び（以下音素系列と呼ぶ）を
作成する。単語マツチング部４では、あらかじめ
統計的に作成された、各音素の他の音素への置
換、他の音素の付加や脱落誤りの確率を表わす
Confusion Matrixを用いて、あらかじめ登録し
てある音素系列で作成された単語辞書と認識され
た音素系列との類似度を計算し、類似度最大の単
語を認識出力とするものである。 FIG. 6 shows an outline of the configuration of this automatic speech recognition device. Speech input enters the acoustic processing unit 1 and performs linear predictive analysis to calculate parameters necessary for phoneme recognition such as frequency spectrum and power. Phoneme recognition unit 2
Now, the parameters W, A, obtained by the acoustic processing section 1 are
G, H (details will be described later), and the peaks of the frequency spectrum obtained from the frequency spectrum Pe ₁ , Pe ₂ ,
Using Pe ₃ (hereinafter referred to as local peak), 10
Phoneme recognition is performed every ms analysis interval (hereinafter referred to as a frame). The phoneme sequence creation unit 3 corrects the sequence of phonemes (hereinafter this work is referred to as error correction) using rules that have been created in advance to connect common Japanese phonemes to each other (hereinafter referred to as phoneme combination rules).
Create a phoneme sequence (hereinafter referred to as a phoneme sequence) for each word. In the word matching section 4, the probability of each phoneme being replaced with another phoneme, addition of another phoneme, or omission error is calculated statistically in advance.
Using the Confusion Matrix, the degree of similarity between a word dictionary created from pre-registered phoneme sequences and the recognized phoneme sequence is calculated, and the word with the highest degree of similarity is output as the recognition output.

音響処理部１の構成を第７図に示す。音声入力
をＡ／Ｄ変換し、プリエンフアシス部１０でスペ
クトルの傾きを補正するために6dB／オクターブ
の高域強調を行つた後、窓部１１ではフレーム毎
に切出した音声入力に(1)式で表わされるＴ＝20ｍ
ｓ毎のハミング窓をかける。 The configuration of the sound processing section 1 is shown in FIG. After the audio input is A/D converted and the pre-emphasis section 10 performs high-frequency emphasis of 6 dB/octave to correct the spectral slope, the window section 11 converts the audio input cut out for each frame using equation (1). T expressed = 20m
Apply a Hamming window every s.

ｙ（ｔ）＝0.56＋0.44cos2πt／Ｔ …(1) 但し｜ｔ｜＞Ｔ／２ではｙ（ｔ）＝０線形予測分析部１２では「Speech Analysis
and Synthesis by Linear Prediction of the
Speech Wave」B.S.Atal etc、J.Acoust.Soc.
Amer.50（1971）に記載されているように窓をか
けた分析区間の音声信号をS₁，S₂，…，S_o，…
S_Nとすると分析次数ｐでの予測誤差S^_oは(2)式で
表わされる。 y(t)=0.56+0.44cos2πt/T...(1) However, when |t|>T/2, y(t)=0 In the linear prediction analysis unit 12, "Speech Analysis
and Synthesis by Linear Prediction of the
Speech Wave” BSAtal etc, J.Acoust.Soc.
S ₁ , S ₂ , ..., S _o , ...
If S is _N , the prediction error S^ _o at the analysis order p is expressed by equation (2).

S^_o＝_P 〓^k=1 a_k ^(p)S_o-k …(2) ここでa_k ^(p)（ｋ＝１、２、…、ｐ）は線形予測
係数である。一方分析区間の自己相関係数をr_k
（ｋ＝１、２、…、ｐ）とするとr_kは(3)式で求ま
る。 S^ _o = _P 〓 ^k=1 a _k ^(p) S _ok ...(2) Here, a _k ^(p) (k = 1, 2, ..., p) is a linear prediction coefficient. On the other hand, the autocorrelation coefficient of the analysis interval is r _k
If (k=1, 2,..., p), r _k can be found by equation (3).

r_k＝１／Ｎ_N-K 〓ⁿ⁼¹ S_oS_o+k …(3) (2)式の予測誤差S^_oの最小平均２誤差を得るため
にはＮ≫ｐとして(3)のr_kを用いて(4)式のｐ元連立
一次方程式中の線形予測係数a_k ^(p)を決める事にな
る従つて(4)式を解く事によつて線形予測係数a_k ^(p)
（ｋ＝１、２、…、ｐ）を求めるが、これはレビ
ンソンの方法によつてきれいに計算する事が出来
る事が一般に知られている。 r _k = 1/N _NK 〓 ⁿ⁼¹ S _o S _o+k …(3) In order to obtain the minimum average 2 errors of the prediction error S^ _o in equation (2), set r in (3) as N≫p. _k is used to determine the linear prediction coefficient a _k ^(p) in the p-element simultaneous linear equations in equation (4). Therefore, by solving equation (4), the linear prediction coefficient a _k ^(p)
(k=1, 2,..., p) is calculated, and it is generally known that this can be calculated accurately using Levinson's method.

周波数スペクトル計算部１３では前段で求めた
線形予測係数a_k ^(p)よりスペクトル包絡^*（ｎ）をで求める。ここでσ²は残差パワーであり、Ａ(i)＝_p-i 〓^j= 〓a_j（ｐ）a_j+i（ｐ） …(6) θ_o＝2π（ｎ）Ｔで、周波数（ｎ）は等オクタ
ーブ間隔になるように設定すると共に、(5)式の残
差パワーをσ²＝2πとして^*（ｎ）を求める。 The frequency spectrum calculation unit 13 calculates the spectrum envelope ^* (n) from the linear prediction coefficient a _k ^(p) obtained in the previous stage. Find it with Here, σ ² is the residual power, A(i)= _pi 〓 ^j= 〓a _j (p) a _j+i (p) …(6) θ _o =2π(n)T, and the frequency (n ) are set to have equal octave intervals, and ^* (n) is determined by setting the residual power of equation (5) to σ ² =2π.

ピーク抽出、パラメータ計算部１４では(5)式で
求めたスペクトル包絡の極大点および変曲点より
ローカルピークの周波数軸上の位置および大きさ
を求め、ピークの大きさの大きい順に周波数軸上
でPe₁，Pe₂，Pe₃とし、周波数の低い順に周波数
軸上でP₁，P₂，P₃とすると同時に、音素認識に
必要なパラメータとしてＧ，Ｈ，Ａを次のように
して求める。まずスペクトル包絡^*（ｎ）を対数
変換したスペクトルＸ（ｎ）の最小二乗近似直線
Ｙ（ｎ）を次式で求める。 The peak extraction and parameter calculation unit 14 determines the position and size of the local peaks on the frequency axis from the local maximum point and inflection point of the spectrum envelope calculated using equation (5), and calculates the positions and sizes of the local peaks on the frequency axis in descending order of peak size. Let Pe ₁ , Pe ₂ , and Pe ₃ be P ₁ , P ₂ , and P ₃ on the frequency axis in ascending order of frequency, and at the same time, determine G, H, and A as parameters necessary for phoneme recognition as follows. First, the least squares approximation straight line Y(n) of the spectrum X(n) obtained by logarithmically transforming the spectrum envelope ^* (n) is obtained using the following equation.

Ｙ（ｎ）＝Ａ・ｎ＋Ｂ …(7) ここで係数Ａはスペクトルの全体的な傾きを示
すものであり、Ｂはスペクトルの全体的なレベル
を表わす値である。スペクトルＸ（ｎ）をスペク
トルの傾きを除去するために最小二乗近似直線Ｙ
（ｎ）で正規化したスペクトルをＺ（ｎ）（正規化
スペクトルと呼ぶ）とすると、Ｚ（ｎ）＝Ｘ（ｎ）−Ｙ（ｎ） …(8) このＺ（ｎ）を周波数軸上で低域（177〜400
Hz）、中域（400〜1100Hz）、高域（1100〜2800Hz）
の三つに分け、正規化スペクトルの平均パワーと
低域の平均パワーの比をＧ、高域の平均パワーと
中域の電力の比をＨとして求める。 Y(n)=A·n+B (7) Here, the coefficient A indicates the overall slope of the spectrum, and B is a value indicating the overall level of the spectrum. Spectrum
If the spectrum normalized by (n) is Z(n) (called a normalized spectrum), then Z(n) = Low range (177-400
Hz), midrange (400~1100Hz), high range (1100~2800Hz)
The ratio of the average power of the normalized spectrum to the average power of the low frequency range is determined as G, and the ratio of the average power of the high frequency range to the power of the middle frequency range is determined as H.

さらに音声信号の２乗和によつて10ｍｓ長のフ
レーム毎のパワーを求め、対数変換したものをパ
ラメータＷとする。 Furthermore, the power for each frame of 10 ms length is determined by the sum of the squares of the audio signal, and the power is logarithmically transformed and used as the parameter W.

Ｗ＝10／Ｎlog₁₀（_N 〓ⁱ⁼¹ Si²） …(9) 上記パラメータを計算しながら同時にＷの値と
Pe₁，Pe₂の値を用いてフレーム毎に無音である
か有音であるかを決定しておく。 W=10/Nlog ₁₀ ( _N 〓 ⁱ⁼¹ Si ² ) …(9) While calculating the above parameters, simultaneously calculate the value of W and
The values of Pe ₁ and Pe ₂ are used to determine whether each frame is silent or has sound.

音素認識部２では、まず音響処理部１で求めた
無音／有音情報から有音又は無音の連続性と持続
時間を用いて発声の始端、終端を決定する。次に
音響処理部１で求めたローカルピークPe₁，
Pe₂Pe₃とP₁，P₂，P₃およびパラメータＷ，Ｇ，
Ｈ，Ａを平滑化処理したW_s，G_s，H_s，A_sによつ
て音素のセグメンテーシヨンと音素の決定を行
う。第５図を例に説明するとまず子音のセグメン
テーシヨンをW_sとA_sの極小変化をとらえて行つ
た後、あらかじめ子音毎のローカルピークPe₁，
Pe₂，Pe₃の分布に基づき構成されたPe₁，Pe₂，
Pe₃の標準パターンにフレーム毎にPe₁，Pe₂，
Pe₃をあてはめてフレーム毎の子音候補を決定し
子音候補の数による規則を適用する事によつてそ
の区間の子音を決定するｄ欄がその結果である。
次に半母音をG_sとH_sの極大極小変化を用いてセ
グメンテーシヨンし、P₁，P₂の分布に基き構成
されたP₁，P₂の２次元配置図（第８図点線）に
各々のフレームのP₁，P₂をあてはめてその区間
の半母音を決定する。c′欄がその結果である。最
後に母音の認識をP₁，P₂の分布に基き構成され
た、第８図実線に示すような５母音／ｉ／、／
ｅ／、／ａ／、／ｏ／、／ｕ／と中間母音／
ie／、／ea／、／ao／、／ou／、／ui／のP₁，
P₂による２次元配置図に各フレームのP₁，P₂を
あわはめた後、５フレーム毎のメジアン平滑化を
施す。最後に前後のフレームとの距離が１以内の
母音毎に切り分け、それらの持続時間が４フレー
ム（40ｍｓ）以上のものを１つの母音として決定
していく。 The phoneme recognition unit 2 first determines the start and end of the utterance from the silence/voice information obtained by the acoustic processing unit 1 using the continuity and duration of utterance or silence. Next, the local peak Pe ₁ obtained by the acoustic processing unit 1,
Pe ₂ Pe ₃ and P ₁ , P ₂ , P ₃ and parameters W, G,
Phoneme segmentation and phoneme determination are performed using W _s , G _s , H _s , and A _s obtained by smoothing H and A. To explain using Figure 5 as an example, first, consonant segmentation is performed by capturing the minimal changes in W _s and A _s , and then the local peaks Pe ₁ ,
Pe ₁ , Pe ₂ , constructed based on the distribution of Pe ₂ , Pe ₃ ,
Pe ₁ , Pe 2 , Pe ₂ for each frame in the standard pattern of Pe ₃
The results are shown in column d, which determines consonant candidates for each frame by applying Pe ₃ , and then determines the consonant in that section by applying a rule based on the number of consonant candidates.
Next, the semi-vowels are segmented using the maximum and minimum changes in G _s and H _s , and a two-dimensional map of P ₁ _and P ₂ (dotted line in Figure 8) is created based on the distribution of P 1 and P ₂ . P ₁ and P ₂ of each frame are applied to determine the semi-vowel in that section. Column c′ is the result. Finally, the vowels are recognized based on the distribution of P ₁ and P ₂ , with five vowels /i/, / as shown in the solid line in Figure 8.
e/, /a/, /o/, /u/ and the middle vowel /
P ₁ of ie/, /ea/, /ao/, /ou/, /ui/,
After fitting P ₁ _and P ₂ of each frame into the two-dimensional layout diagram based on P 2 , median smoothing is performed every 5 frames. Finally, each vowel is divided into vowels whose distance from the previous and subsequent frames is within 1, and those whose duration is 4 frames (40 ms) or more are determined as one vowel.

音素系列作成部３では、あらかじめ用意されて
いる日本語の音素結合規則に基づく音素並びの誤
り訂正と、母音の持続時間によつて長母音と単母
音のふりわけ、無声化した母音／ｉ／、／ｕ／の
挿入を行い、音素系列を作成する。 The phoneme sequence creation unit 3 corrects errors in the phoneme sequence based on pre-prepared Japanese phoneme combination rules, separates long vowels and simple vowels based on vowel duration, and creates devoiced vowels /i/, /u/ is inserted to create a phoneme sequence.

単語マツチング部４は音素系列のみで作られた
簡単な単語辞書と、あらかじめ多量の認識音素系
列から作られたConfusion Matrixを用いて、認
識された音素系列と全単語辞書との間の類似度を
算出し、類似度の最も大きい単語を認識結果とし
て出力する。 The word matching unit 4 uses a simple word dictionary created only from phoneme sequences and a Confusion Matrix created from a large number of recognized phoneme sequences in advance to calculate the degree of similarity between the recognized phoneme sequence and the entire word dictionary. The word with the highest similarity is output as the recognition result.

第９図に本発明による音声自動認識装置の構成
を示す。マイク２０から入力する音声信号はアン
プ２１で適当なレベルに増巾し、Ａ／Ｄ変換部２
２により12kHzサンプリングで、12ビツトにＡ／
Ｄ変換する。これを信号処理回路２９で6dB／オ
クターブのプリエンフアシスおよびハミング窓を
かけた後、線形予測分析プロセツサ２３にて線形
予測係数から周波数スペクトルの算出と音素認識
に必要なパラメータの算出を行う。メインプロセ
ツサ２４ではメインメモリ２５を使用してセグメ
ンテーシヨンと音素認識および音素系例列の作成
を行い、得られた音素系列の結果を単語マツチン
グプロセツサ２７に転送する。単語マツチングプ
ロセツサ２７は単語辞書、Confusion Matrix用
メモリ２６のデータを参照して単語毎の類似度計
算を行い、結果をメインプロセツサ２４に転送す
る。メインプロセツサ２４は類似度最大の単語を
認識結果としてＩ／Ｏ２８に出力するか、又はリ
ジエクトを行う。Ｉ／Ｏ２８は受けた結果を他の
計算機に送つたり、他のＩ／Ｏ機器に対して作業
を行わせる。こうしてメインプロセツサ以外に専
用プロセツサを設けて計算を分担する事によつて
高速化する事が出来る。 FIG. 9 shows the configuration of an automatic speech recognition device according to the present invention. The audio signal input from the microphone 20 is amplified to an appropriate level by the amplifier 21, and then sent to the A/D converter 2.
2, 12kHz sampling, 12 bit A/
D-convert. After this is subjected to 6 dB/octave pre-emphasis and a Hamming window in a signal processing circuit 29, a linear prediction analysis processor 23 calculates a frequency spectrum and parameters necessary for phoneme recognition from the linear prediction coefficients. The main processor 24 uses the main memory 25 to perform segmentation, phoneme recognition, and create a phoneme sequence example sequence, and transfers the obtained phoneme sequence results to the word matching processor 27. The word matching processor 27 refers to the word dictionary and the data in the Confusion Matrix memory 26 to calculate the degree of similarity for each word, and transfers the results to the main processor 24 . The main processor 24 outputs the word with the maximum similarity to the I/O 28 as a recognition result, or rejects the word. The I/O 28 sends the received results to other computers or causes other I/O devices to perform work. In this way, speeding up can be achieved by providing a dedicated processor in addition to the main processor to share the calculations.

成人男子20名が防音室で発声した日本全国の主
要都市名166を用い、総計3320個のデータによる
単語認識実験を、本装置を用いて行つた結果、平
均認識率84％であつた。これは、従来のフイルタ
バンクによる方法とほぼ同じである。一方、成人
女性20名が発声したデータによると、従来のフイ
ルタバンクによる方法では約30％しか認識出来な
い事によつて、女声に対応する事が全く出来なか
つたものが、本装置によると男声と同じ84％まで
認識する事が可能となり、男女に関係なく適応出
来る道が開ける事によつて本発明の有効性を確認
した。 Using this device, we conducted a word recognition experiment with a total of 3,320 data using 166 names of major cities across Japan uttered by 20 adult males in a soundproof room, and the average recognition rate was 84%. This is almost the same as the conventional filter bank method. On the other hand, according to the data uttered by 20 adult women, the conventional filter bank method could only recognize about 30% of the voices and was completely unable to respond to female voices, but with this device, male voices can be recognized. The effectiveness of the present invention was confirmed by making it possible to recognize up to 84%, which is the same as that for men and women, and opening the way for it to be applied regardless of gender.

以上述べたように、本発明は不特定な話者を対
象とした音素単位での認識を基本とするものであ
り、スペクトルのローカルピークの位置とそれら
の相対的な大きさに着目して音素認識を行う事に
よつて話者や環境による変動の影響を受けにく
く、又、スペクトルの分析を従来のフイルタ群に
変えて線形予測分析を用いる事によつて音声のピ
ツチ成分の影響を受けない安定なホルマントに対
応するピークを抽出する事を可能とし、またフイ
ルタ出力がデイスクリート量のため抽出誤差が大
きいのに対し、本発明のLPCケプストラムは連
続量であるのでピーク抽出精度が向上する。これ
らの結果、男女いずれの不特定話者にも対応でき
る事を特長とする音声自動認識装置を可能とする
ものである。 As described above, the present invention is based on phoneme-by-phoneme recognition for unspecified speakers, and focuses on the positions of local peaks in the spectrum and their relative sizes to identify phonemes. By performing recognition, it is less affected by variations due to the speaker or the environment, and by using linear predictive analysis instead of a conventional filter group for spectrum analysis, it is not affected by pitch components of speech. It is possible to extract peaks corresponding to stable formants, and since the filter output is a discrete quantity, the extraction error is large, whereas the LPC cepstrum of the present invention is a continuous quantity, so peak extraction accuracy is improved. As a result, it is possible to create an automatic speech recognition device that is capable of responding to unspecified speakers of both genders.

[Brief explanation of the drawing]

第１図は従来の音声自動認識システムのブロツ
ク図、第２図は従来例である、フイルタバンクに
よる女声母音／ｅ／の周波数スペクトル図、第３
図は従来のフイルタ分析による音素認識例を示す
図、第４図は本発明の線形予測分析による女声母
音／ｅ／の周波数スペクトル図、第５図は本発明
の線形予測分析による音素認識例を示す図、第６
図は本発明に係る音声自動認識装置全体のブロツ
ク構成を示す図、第７図は本発明の音響処理部の
ブロツク図、第８図は本発明のローカルピーク
P₁，P₂による母音判別図、第９図は本発明の音
声自動認識装置の詳細を示すブロツク図である。１……音響処理部、２……音素認識部、３……
音素系列作成部、４……単語マツチング部、１０
……プリエンフアシス部、１１……窓部、１２…
…線形予測分析部、１３……周波数スペクトル計
算部、１４……ピーク抽出パラメータ部、２２…
…Ａ／Ｄ変換回路、２３……線形予測分析プロセ
ツサ、２４……メインプロセツサ、２６……単語
辞書、コンフエーシヨンマトリクス用メモリ、２
７……単語マツチングプロセツサ。 Fig. 1 is a block diagram of a conventional automatic speech recognition system, Fig. 2 is a frequency spectrum diagram of the female vowel /e/ using a filter bank, which is a conventional example, and Fig. 3
The figure shows an example of phoneme recognition by conventional filter analysis, Figure 4 is a frequency spectrum diagram of the feminine vowel /e/ by linear predictive analysis of the present invention, and Figure 5 shows an example of phoneme recognition by linear predictive analysis of the present invention. Figure shown, No. 6
The figure shows the block configuration of the entire automatic speech recognition device according to the present invention, FIG. 7 is a block diagram of the acoustic processing section of the present invention, and FIG. 8 shows the local peak of the present invention.
FIG _. ₉ is a block diagram showing details of the automatic speech recognition device of the present invention. 1... Acoustic processing unit, 2... Phoneme recognition unit, 3...
Phoneme sequence creation section, 4...Word matching section, 10
...Pre-emphasis section, 11...Window section, 12...
...Linear prediction analysis section, 13... Frequency spectrum calculation section, 14... Peak extraction parameter section, 22...
...A/D conversion circuit, 23...Linear predictive analysis processor, 24...Main processor, 26...Word dictionary, memory for configuration matrix, 2
7...Word matching processor.

Claims

[Claims]

1. An acoustic processing unit that processes voice input to calculate parameters necessary for phoneme recognition, a phoneme recognition unit that performs segmentation and phoneme recognition using the parameters, and a phoneme recognition unit that processes voice input and calculates parameters necessary for phoneme recognition; a phoneme sequence creation unit that corrects the phoneme sequence of the phoneme sequence to create a phoneme sequence; and a word matching unit that matches the phoneme sequence with a word dictionary; Tate uses linear prediction analysis to perform speech recognition based on the peak position and relative size of the peak in the frequency spectrum obtained from the maximum point or inflection point of the spectrum envelope determined from the obtained linear prediction coefficient. An automatic speech recognition device characterized by being configured as follows.