JPS6336680B2 - - Google Patents

Info

Publication number
JPS6336680B2
JPS6336680B2 JP57021412A JP2141282A JPS6336680B2 JP S6336680 B2 JPS6336680 B2 JP S6336680B2 JP 57021412 A JP57021412 A JP 57021412A JP 2141282 A JP2141282 A JP 2141282A JP S6336680 B2 JPS6336680 B2 JP S6336680B2
Authority
JP
Japan
Prior art keywords
phoneme
recognition
word
unit
peak
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired
Application number
JP57021412A
Other languages
Japanese (ja)
Other versions
JPS58139199A (en
Inventor
Satoshi Fujii
Katsuyuki Futayada
Hideji Morii
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Panasonic Holdings Corp
Original Assignee
Matsushita Electric Industrial Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Matsushita Electric Industrial Co Ltd filed Critical Matsushita Electric Industrial Co Ltd
Priority to JP57021412A priority Critical patent/JPS58139199A/en
Publication of JPS58139199A publication Critical patent/JPS58139199A/en
Publication of JPS6336680B2 publication Critical patent/JPS6336680B2/ja
Granted legal-status Critical Current

Links

Description

【発明の詳細な説明】 本発明は人間によつて発声された音声信号を自
動的に認識するための、音声自動認識装置に関す
る。
DETAILED DESCRIPTION OF THE INVENTION The present invention relates to an automatic speech recognition device for automatically recognizing speech signals uttered by humans.

人間によつて発声された音声を自動的に認識す
る音声自動認識装置は人間から電子式算機や各種
機械へデータや命令を与える手段として将来的に
非常に有効と考えられる。たとえば数字音声を認
識する装置を電子計算機に接続して用いると伝票
などの数字データを入力する事が可能になり、特
に音声信号は電話回線を経由して遠隔地に伝送で
きるため伝票の発行や在庫の問い合せ等を即座に
行う事が出来る。また手や足を他の目的に対して
使用しながら音声信号を入力出来ることなどを考
えると、音声自動認識装置によつてもたらされる
効果はきわめて大きいと考えられる。
Automatic speech recognition devices that automatically recognize speech uttered by humans are considered to be extremely effective in the future as a means of providing data and instructions from humans to electronic calculators and various machines. For example, if you use a device that recognizes numeric audio by connecting it to a computer, it becomes possible to input numerical data such as slips.In particular, audio signals can be transmitted to remote locations via telephone lines, so it is possible to issue slips, etc. You can immediately make inventory inquiries, etc. Furthermore, considering that it is possible to input voice signals while using hands and feet for other purposes, the effects brought about by automatic voice recognition devices are considered to be extremely large.

従来研究あるいは発表されている音声自動認識
装置の動作原理としてはパタンマツチング法が多
く採用されている。この方法は認識される必要が
ある全種類の単語に対して標準パタンをあらかじ
め記憶しておき、入力される未知の入力パタンと
比較する事によつて一致の度合(以下類似度と呼
ぶ)を計算し、最大一致が得られる標準パタンと
同一の単語であると判定するものである。このパ
ターンマツチング法では認識されるべき全ての単
語に対して標準パタンが用意されるため、発声者
が変つた場合には新しく標準パタンを入力して記
憶させる必要がある。従つて日本全国の都市名な
どに数百種類以上の単語を認識対象とするような
場合、全種類の単語を発声して登録するには膨大
な時間と労力を必要とし、又登録に要するメモリ
ーの容量も膨大になる事が予想される。さらに入
力パタンを標準パタンのパタンマツチングに要す
る処理量も単語数が多くなると膨大なものになつ
てしまうという欠点を有している。
The pattern matching method is often adopted as the operating principle of automatic speech recognition devices that have been researched or published in the past. In this method, standard patterns are memorized in advance for all types of words that need to be recognized, and the degree of matching (hereinafter referred to as similarity) is calculated by comparing them with unknown input patterns. The word is calculated and determined to be the same word as the standard pattern that yields the maximum match. In this pattern matching method, standard patterns are prepared for all words to be recognized, so if the speaker changes, it is necessary to input and memorize a new standard pattern. Therefore, when recognizing hundreds of different words, such as the names of cities across Japan, it takes a huge amount of time and effort to pronounce and register all the words, and the memory required for registration is also large. It is expected that the capacity will increase enormously. Furthermore, the amount of processing required to pattern-match an input pattern with a standard pattern becomes enormous as the number of words increases.

これに対して、入力音声を音素単位に分けて音
素の組合せとして認識し(以下音素認識と呼ぶ)、
音素単位で表記された単語辞書との類似度を求め
る方法は単語辞書に要するメモリ容量が大幅に少
なくて済み、辞書の内容変更も容易であるという
特長を持つている。この方法の例は「音声スペク
トルの概略形とその動特性を利用した単語音声認
識システム」三輸他、日本音響学会誌34(1978)
に述べてある。第1図にこの方法の音声認識装置
のブロツク構成を示す。入力音声50はフイルタ
群40に入つて周波数スペクトルに変換された
後、音響処理部41で音素認識に必要なパラメー
タPe1,Pe2,Pe3,G,H,V,Wを算出する。
On the other hand, input speech is divided into phoneme units and recognized as combinations of phonemes (hereinafter referred to as phoneme recognition).
The method of determining similarity with a word dictionary written in phoneme units has the advantage that the memory capacity required for the word dictionary is significantly smaller, and the contents of the dictionary can be easily changed. An example of this method is "Word speech recognition system using the outline form of the speech spectrum and its dynamic characteristics," Sanpo et al., Journal of the Acoustical Society of Japan 34 (1978).
It is stated in FIG. 1 shows the block configuration of a speech recognition device using this method. The input speech 50 enters a filter group 40 and is converted into a frequency spectrum, and then an acoustic processing section 41 calculates parameters Pe 1 , Pe 2 , Pe 3 , G, H, V, and W necessary for phoneme recognition.

音素認識部42はこれらのパラメータを用いて
音素の区切り作業(以下セグメンテーシヨンと呼
ぶ)と音素の標準パターン45に基づいて個々の
音素が何であるかを決定する音素認識を行なう。
しかしこの段階での音素の並びは不完全であるた
め、誤り訂正部43で主に日本語の音素が結合す
る規則46を用いて音素並びに訂正を行い音素系
列の作成を完了する。単語マツチング部44で
は、あらかじめ統計的に求めておいた各音素の他
の音素への置換および脱落、他の音素の挿入の確
率を表わす配列表47(以下Confusion Matrix
と呼ぶ)と音素名で構成された単語辞書48を用
いて全単語辞書との類似度を計算し、類似度最大
の辞書項目を認識結果として出力する。
The phoneme recognition unit 42 uses these parameters to perform phoneme recognition to determine what each phoneme is based on a phoneme segmentation operation (hereinafter referred to as segmentation) and a standard phoneme pattern 45.
However, since the arrangement of phonemes at this stage is incomplete, the error correction unit 43 corrects the arrangement of the phonemes mainly using the rule 46 for combining Japanese phonemes, and completes the creation of the phoneme sequence. The word matching unit 44 uses a Confusion Matrix 47 (hereinafter referred to as Confusion Matrix) that represents the probability of each phoneme being replaced with another phoneme, being dropped, or being inserted into another phoneme, which has been statistically determined in advance.
Using the word dictionary 48 consisting of phoneme names and phoneme names, the degree of similarity with the entire word dictionary is calculated, and the dictionary item with the greatest degree of similarity is output as a recognition result.

この方法はスペクトルのピークの位置に着目し
て音素認識を行うものであり、アルゴリズムが簡
単で、またピークの位置とその相対的な大きさの
みに着目するため話者や環境の違いに起因するス
ペクトルパターンの全体の概形の変動に対して影
響を受けにくい利点があり、また不特定話者を対
象とする場合に適した手法であると考えられる。
しかしこの方法はフイルタ群によつて得た周波数
スペクトルを用いている事により、次の欠点を有
している。音声には声道長に基くピツチ成分があ
り、男声のピツチ周波数は一般に200Hz以下で母
音の第1ホルマントに重ることは少ないが女声の
ピツチ周波数は一般に200〜300Hzにあることによ
つて母音の第1ホルマントと重なりを生じ女声の
周波数スペクトルには低周波数域にピツチ周波数
の影響による大きなピークが現われ、またピツチ
周波数の高調波の影響により不必要なピークが発
生するためホルマントに対応した適確なピーク位
置を検出する事が出来なくなつてしまう欠点があ
つた。例えば第2図は250〜6300Hz間を29チヤネ
ルで1/6オクターブ毎にQ=6のフイルタで構成
されたフイルタ群によると、女声の母音/e/の
周波数スペクトルの例である。図において縦軸は
1/6オクターブ間隔で区切られたチヤネル番号、
横軸は10ms毎に区切られたフレーム番号であ
り、あらかじめ視察により正解音素が名前付けさ
れている。これによると縦軸の2チヤネルと12チ
ヤネルにピツチの影響による母音/e/のホルマ
ントに対応しないピークBが現われ、本当のホル
マントに対応するピークAとの区別がわからなく
なつている。
This method performs phoneme recognition by focusing on the position of the peak in the spectrum, and the algorithm is simple. Also, since it focuses only on the position of the peak and its relative size, it is difficult to recognize phonemes due to differences in speakers or environments. This method has the advantage of being less affected by changes in the overall shape of the spectrum pattern, and is considered to be a method suitable for targeting unspecified speakers.
However, this method has the following drawbacks because it uses a frequency spectrum obtained by a group of filters. Speech has a pitch component based on the length of the vocal tract.The pitch frequency of a male voice is generally below 200Hz and rarely overlaps with the first formant of a vowel, but the pitch frequency of a female voice is generally between 200 and 300Hz, which makes it difficult to pronounce a vowel. In the frequency spectrum of female voices, a large peak appears in the low frequency range due to the influence of the pitch frequency, and unnecessary peaks occur due to the influence of harmonics of the pitch frequency. There was a drawback that it became impossible to detect the exact peak position. For example, FIG. 2 shows an example of the frequency spectrum of the vowel /e/ in a female voice according to a filter group consisting of 29 channels between 250 and 6300 Hz and Q=6 filters every 1/6 octave. In the figure, the vertical axis is the channel number divided at 1/6 octave intervals,
The horizontal axis is a frame number divided into 10 ms units, and correct phonemes are named by inspection in advance. According to this, peak B, which does not correspond to the formant of the vowel /e/ due to the influence of pitch, appears in channels 2 and 12 on the vertical axis, and it is difficult to distinguish it from peak A, which corresponds to the true formant.

この方法による音素認識例を第3図に示す。 An example of phoneme recognition using this method is shown in FIG.

第3図は成人女性の発声した「安物」という言
葉で、横軸はフレーム毎に区切つてある。図でa
は手作業によつて名前づけした音素で、バーは音
素の始端を、枠で囲つた部分は中心を示す。bは
母音の認識結果、cは半母音の認識結果、dは無
音及び子音区間を示し、Qは無音区間、Cは子音
区間にあたる。eは子音の認識結果を示す。fは
セグメンテーシヨン用の各種パラメータであり、
gはスペクトルのピークの周波数軸上の位置をパ
ワーの大きい順に1、2、3と表示したものであ
る。この図から、/a/の部分ではピツチの影響
により250Hz付近にピークがあらわれる(図で領
域イで示す)事により/a/が/i/と誤つてい
ること(図で領域ニで示す)がわかる。また/
o/のところにもピツチの影響により250Hz付近
にピークが現われ(図で領域ロ,ハ)、/n/と
誤つている(図で領域ホ,ヘ)。このように、従
来の方法では女声において音声のピツチ成分の影
響が強く現われ、女声に対して対応出来ないこと
がわかる。
Figure 3 shows the word "cheap" uttered by an adult female, with the horizontal axis divided into frames. In the diagram a
is a phoneme named manually, the bar indicates the beginning of the phoneme, and the boxed area indicates the center. b represents the vowel recognition result, c represents the semi-vowel recognition result, d represents the silent and consonant sections, Q represents the silent period, and C corresponds to the consonant period. e indicates the consonant recognition result. f is various parameters for segmentation,
g is the position of the peak of the spectrum on the frequency axis expressed as 1, 2, and 3 in descending order of power. From this figure, we can see that in the /a/ part, a peak appears around 250Hz due to the influence of pitch (indicated by area A in the figure), which means that /a/ is mistaken as /i/ (indicated by area D in the figure). I understand. Also/
A peak appears around 250 Hz at o/ due to the influence of pitch (areas B and C in the diagram), which is incorrectly written as /n/ (areas E and F in the diagram). As described above, it can be seen that in the conventional method, the pitch component of the voice is strongly influenced by female voices, and it is not possible to cope with female voices.

本発明は男性にも女性にも共通に、ホルマント
のみに対応する第2図のピークAの位置を精度良
く求めるために従来のフイルタ群に代つて線形予
測分析によつて音声のピツチ成分を軽減した周波
数スペクトルを得る事によつて上記問題点を解決
し、男女に関係なく不特定話者に対応する事の出
来る音素認識法および音声認識システムを提供す
るものである。線形予測分析は周波数スペクトル
を全極型モデルで近似し周波数スペクトル包絡特
性と声帯波特性を分離する方法であり、ピツチ周
波数やその高調波の影響は軽減されるはずであ
る。また周波数スペクトルには特定したモデル以
外の成分は含まれないので滑らかなスペクトルパ
ターンが得られる利点がある。第4図は第2図の
場合と同じ女声の母音/e/を本発明による方法
によつて周波数スペクトルを求めた例であるが、
ピツチの影響による不必要なピークが取除かれホ
ルマントに対応したピークAのみが描かれている
事がわかる。
The present invention reduces the pitch component of speech by linear predictive analysis instead of the conventional filter group in order to accurately determine the position of peak A in Figure 2, which corresponds only to formants, for both men and women. The present invention provides a phoneme recognition method and a speech recognition system that can solve the above-mentioned problems by obtaining a frequency spectrum that corresponds to a specific frequency spectrum, and that can be applied to unspecified speakers regardless of gender. Linear predictive analysis is a method of approximating the frequency spectrum with an all-pole model and separating the frequency spectrum envelope characteristics and vocal fold wave characteristics, and the influence of the pitch frequency and its harmonics should be reduced. Furthermore, since the frequency spectrum does not include components other than the specified model, there is an advantage that a smooth spectrum pattern can be obtained. FIG. 4 is an example in which the frequency spectrum of the vowel /e/ in a female voice, which is the same as in FIG. 2, was obtained using the method according to the present invention.
It can be seen that unnecessary peaks due to the influence of pitch have been removed and only peak A corresponding to the formant is depicted.

第5図は、本発明の方法により第3図と同じ単
語を音素認識した例である。gを見ると、/a/
の位置での、250Hz付近にはピークが現われず
(図で領域トで示す)、ピツチ成分が除去されてい
る事がわかる。これによつてbを見ると/a/
が/a/と正しく認識されている事がわかる。
又、/o/の位置でも従来例の第3図で現われて
いた250Hz付近のピツチ成分が除去される(図で
領域チで示す)事によつて/o/が母音としてセ
グメンテーシヨンされている事がわかる。こうし
て、母音と子音、半母音が正しくセグメンテーシ
ヨンされれば単語認識を正しく行う事が可能とな
る。
FIG. 5 is an example of phoneme recognition of the same word as in FIG. 3 using the method of the present invention. When you look at g, /a/
No peak appears near 250Hz at the position (indicated by area ① in the figure), indicating that the pitch component has been removed. From this, if we look at b, /a/
It can be seen that is correctly recognized as /a/.
Also, at the position of /o/, the pitch component around 250Hz that appeared in the conventional example in Figure 3 is removed (indicated by area C in the figure), and /o/ is segmented as a vowel. I know that there is. In this way, if vowels, consonants, and semi-vowels are segmented correctly, word recognition can be performed correctly.

このように、本発明は線形予測分析によつてホ
ルマントに対応するピークを適確に抽出する事に
よつて男声にも女声にも共通して適用する事が出
来る、不特定話者向きの音声自動認識を可能とす
るものである。
As described above, the present invention can be applied to both male and female voices by accurately extracting peaks corresponding to formants using linear predictive analysis, and can be applied to voices suitable for unspecified speakers. This enables automatic recognition.

本音声自動認識装置の構成の概要を第6図に示
す。音声入力は音響処理部1に入り、線形予測分
析を行つて周波数スペクトルとパワー等の音素認
識に必要なパラメータを算出する。音素認識部2
では音響処理部1で求めたパラメータW,A,
G,H(詳しくは後述)と、周波数スペクトルよ
り求めた周波数スペクトルのピークPe1,Pe2
Pe3(以下ローカルピークと呼ぶ)を用いて、10
msの分析区間(以下フレームと呼ぶ)毎に音素
認識を行なう。音素系列作成部3ではあらかじめ
作られた一般的な日本語の音素が相互に結合する
規則(以下音素結合規則と呼ぶ)を用いて音素の
並びを修正し(以下この作業を誤り訂正と呼ぶ)
単語ごとの音素の並び(以下音素系列と呼ぶ)を
作成する。単語マツチング部4では、あらかじめ
統計的に作成された、各音素の他の音素への置
換、他の音素の付加や脱落誤りの確率を表わす
Confusion Matrixを用いて、あらかじめ登録し
てある音素系列で作成された単語辞書と認識され
た音素系列との類似度を計算し、類似度最大の単
語を認識出力とするものである。
FIG. 6 shows an outline of the configuration of this automatic speech recognition device. Speech input enters the acoustic processing unit 1 and performs linear predictive analysis to calculate parameters necessary for phoneme recognition such as frequency spectrum and power. Phoneme recognition unit 2
Now, the parameters W, A, obtained by the acoustic processing section 1 are
G, H (details will be described later), and the peaks of the frequency spectrum obtained from the frequency spectrum Pe 1 , Pe 2 ,
Using Pe 3 (hereinafter referred to as local peak), 10
Phoneme recognition is performed every ms analysis interval (hereinafter referred to as a frame). The phoneme sequence creation unit 3 corrects the sequence of phonemes (hereinafter this work is referred to as error correction) using rules that have been created in advance to connect common Japanese phonemes to each other (hereinafter referred to as phoneme combination rules).
Create a phoneme sequence (hereinafter referred to as a phoneme sequence) for each word. In the word matching section 4, the probability of each phoneme being replaced with another phoneme, addition of another phoneme, or omission error is calculated statistically in advance.
Using the Confusion Matrix, the degree of similarity between a word dictionary created from pre-registered phoneme sequences and the recognized phoneme sequence is calculated, and the word with the highest degree of similarity is output as the recognition output.

音響処理部1の構成を第7図に示す。音声入力
をA/D変換し、プリエンフアシス部10でスペ
クトルの傾きを補正するために6dB/オクターブ
の高域強調を行つた後、窓部11ではフレーム毎
に切出した音声入力に(1)式で表わされるT=20m
s毎のハミング窓をかける。
The configuration of the sound processing section 1 is shown in FIG. After the audio input is A/D converted and the pre-emphasis section 10 performs high-frequency emphasis of 6 dB/octave to correct the spectral slope, the window section 11 converts the audio input cut out for each frame using equation (1). T expressed = 20m
Apply a Hamming window every s.

y(t)=0.56+0.44cos2πt/T …(1) 但し|t|>T/2ではy(t)=0 線形予測分析部12では「Speech Analysis
and Synthesis by Linear Prediction of the
Speech Wave」B.S.Atal etc、J.Acoust.Soc.
Amer.50(1971)に記載されているように窓をか
けた分析区間の音声信号をS1,S2,…,So,…
SNとすると分析次数pでの予測誤差S^oは(2)式で
表わされる。
y(t)=0.56+0.44cos2πt/T...(1) However, when |t|>T/2, y(t)=0 In the linear prediction analysis unit 12, "Speech Analysis
and Synthesis by Linear Prediction of the
Speech Wave” BSAtal etc, J.Acoust.Soc.
S 1 , S 2 , ..., S o , ...
If S is N , the prediction error S^ o at the analysis order p is expressed by equation (2).

S^oPk=1 ak (p)So-k …(2) ここでak (p)(k=1、2、…、p)は線形予測
係数である。一方分析区間の自己相関係数をrk
(k=1、2、…、p)とするとrkは(3)式で求ま
る。
S^ o = Pk=1 a k (p) S ok ...(2) Here, a k (p) (k = 1, 2, ..., p) is a linear prediction coefficient. On the other hand, the autocorrelation coefficient of the analysis interval is r k
If (k=1, 2,..., p), r k can be found by equation (3).

rk=1/NN-Kn=1 SoSo+k …(3) (2)式の予測誤差S^oの最小平均2誤差を得るため
にはN≫pとして(3)のrkを用いて(4)式のp元連立
一次方程式中の線形予測係数ak (p)を決める事にな
従つて(4)式を解く事によつて線形予測係数ak (p)
(k=1、2、…、p)を求めるが、これはレビ
ンソンの方法によつてきれいに計算する事が出来
る事が一般に知られている。
r k = 1/N NKn=1 S o S o+k …(3) In order to obtain the minimum average 2 errors of the prediction error S^ o in equation (2), set r in (3) as N≫p. k is used to determine the linear prediction coefficient a k (p) in the p-element simultaneous linear equations in equation (4). Therefore, by solving equation (4), the linear prediction coefficient a k (p)
(k=1, 2,..., p) is calculated, and it is generally known that this can be calculated accurately using Levinson's method.

周波数スペクトル計算部13では前段で求めた
線形予測係数ak (p)よりスペクトル包絡*(n)を で求める。ここでσ2は残差パワーであり、 A(i)=p-ij= 〓aj(p)aj+i(p) …(6) θo=2π(n)Tで、周波数(n)は等オクタ
ーブ間隔になるように設定すると共に、(5)式の残
差パワーをσ2=2πとして*(n)を求める。
The frequency spectrum calculation unit 13 calculates the spectrum envelope * (n) from the linear prediction coefficient a k (p) obtained in the previous stage. Find it with Here, σ 2 is the residual power, A(i)= pij= 〓a j (p) a j+i (p) …(6) θ o =2π(n)T, and the frequency (n ) are set to have equal octave intervals, and * (n) is determined by setting the residual power of equation (5) to σ 2 =2π.

ピーク抽出、パラメータ計算部14では(5)式で
求めたスペクトル包絡の極大点および変曲点より
ローカルピークの周波数軸上の位置および大きさ
を求め、ピークの大きさの大きい順に周波数軸上
でPe1,Pe2,Pe3とし、周波数の低い順に周波数
軸上でP1,P2,P3とすると同時に、音素認識に
必要なパラメータとしてG,H,Aを次のように
して求める。まずスペクトル包絡*(n)を対数
変換したスペクトルX(n)の最小二乗近似直線
Y(n)を次式で求める。
The peak extraction and parameter calculation unit 14 determines the position and size of the local peaks on the frequency axis from the local maximum point and inflection point of the spectrum envelope calculated using equation (5), and calculates the positions and sizes of the local peaks on the frequency axis in descending order of peak size. Let Pe 1 , Pe 2 , and Pe 3 be P 1 , P 2 , and P 3 on the frequency axis in ascending order of frequency, and at the same time, determine G, H, and A as parameters necessary for phoneme recognition as follows. First, the least squares approximation straight line Y(n) of the spectrum X(n) obtained by logarithmically transforming the spectrum envelope * (n) is obtained using the following equation.

Y(n)=A・n+B …(7) ここで係数Aはスペクトルの全体的な傾きを示
すものであり、Bはスペクトルの全体的なレベル
を表わす値である。スペクトルX(n)をスペク
トルの傾きを除去するために最小二乗近似直線Y
(n)で正規化したスペクトルをZ(n)(正規化
スペクトルと呼ぶ)とすると、 Z(n)=X(n)−Y(n) …(8) このZ(n)を周波数軸上で低域(177〜400
Hz)、中域(400〜1100Hz)、高域(1100〜2800Hz)
の三つに分け、正規化スペクトルの平均パワーと
低域の平均パワーの比をG、高域の平均パワーと
中域の電力の比をHとして求める。
Y(n)=A·n+B (7) Here, the coefficient A indicates the overall slope of the spectrum, and B is a value indicating the overall level of the spectrum. Spectrum
If the spectrum normalized by (n) is Z(n) (called a normalized spectrum), then Z(n) = Low range (177-400
Hz), midrange (400~1100Hz), high range (1100~2800Hz)
The ratio of the average power of the normalized spectrum to the average power of the low frequency range is determined as G, and the ratio of the average power of the high frequency range to the power of the middle frequency range is determined as H.

さらに音声信号の2乗和によつて10ms長のフ
レーム毎のパワーを求め、対数変換したものをパ
ラメータWとする。
Furthermore, the power for each frame of 10 ms length is determined by the sum of the squares of the audio signal, and the power is logarithmically transformed and used as the parameter W.

W=10/Nlog10Ni=1 Si2) …(9) 上記パラメータを計算しながら同時にWの値と
Pe1,Pe2の値を用いてフレーム毎に無音である
か有音であるかを決定しておく。
W=10/Nlog 10 ( Ni=1 Si 2 ) …(9) While calculating the above parameters, simultaneously calculate the value of W and
The values of Pe 1 and Pe 2 are used to determine whether each frame is silent or has sound.

音素認識部2では、まず音響処理部1で求めた
無音/有音情報から有音又は無音の連続性と持続
時間を用いて発声の始端、終端を決定する。次に
音響処理部1で求めたローカルピークPe1
Pe2Pe3とP1,P2,P3およびパラメータW,G,
H,Aを平滑化処理したWs,Gs,Hs,Asによつ
て音素のセグメンテーシヨンと音素の決定を行
う。第5図を例に説明するとまず子音のセグメン
テーシヨンをWsとAsの極小変化をとらえて行つ
た後、あらかじめ子音毎のローカルピークPe1
Pe2,Pe3の分布に基づき構成されたPe1,Pe2
Pe3の標準パターンにフレーム毎にPe1,Pe2
Pe3をあてはめてフレーム毎の子音候補を決定し
子音候補の数による規則を適用する事によつてそ
の区間の子音を決定するd欄がその結果である。
次に半母音をGsとHsの極大極小変化を用いてセ
グメンテーシヨンし、P1,P2の分布に基き構成
されたP1,P2の2次元配置図(第8図点線)に
各々のフレームのP1,P2をあてはめてその区間
の半母音を決定する。c′欄がその結果である。最
後に母音の認識をP1,P2の分布に基き構成され
た、第8図実線に示すような5母音/i/、/
e/、/a/、/o/、/u/と中間母音/
ie/、/ea/、/ao/、/ou/、/ui/のP1
P2による2次元配置図に各フレームのP1,P2
あわはめた後、5フレーム毎のメジアン平滑化を
施す。最後に前後のフレームとの距離が1以内の
母音毎に切り分け、それらの持続時間が4フレー
ム(40ms)以上のものを1つの母音として決定
していく。
The phoneme recognition unit 2 first determines the start and end of the utterance from the silence/voice information obtained by the acoustic processing unit 1 using the continuity and duration of utterance or silence. Next, the local peak Pe 1 obtained by the acoustic processing unit 1,
Pe 2 Pe 3 and P 1 , P 2 , P 3 and parameters W, G,
Phoneme segmentation and phoneme determination are performed using W s , G s , H s , and A s obtained by smoothing H and A. To explain using Figure 5 as an example, first, consonant segmentation is performed by capturing the minimal changes in W s and A s , and then the local peaks Pe 1 ,
Pe 1 , Pe 2 , constructed based on the distribution of Pe 2 , Pe 3 ,
Pe 1 , Pe 2 , Pe 2 for each frame in the standard pattern of Pe 3
The results are shown in column d, which determines consonant candidates for each frame by applying Pe 3 , and then determines the consonant in that section by applying a rule based on the number of consonant candidates.
Next, the semi-vowels are segmented using the maximum and minimum changes in G s and H s , and a two-dimensional map of P 1 and P 2 (dotted line in Figure 8) is created based on the distribution of P 1 and P 2 . P 1 and P 2 of each frame are applied to determine the semi-vowel in that section. Column c′ is the result. Finally, the vowels are recognized based on the distribution of P 1 and P 2 , with five vowels /i/, / as shown in the solid line in Figure 8.
e/, /a/, /o/, /u/ and the middle vowel /
P 1 of ie/, /ea/, /ao/, /ou/, /ui/,
After fitting P 1 and P 2 of each frame into the two-dimensional layout diagram based on P 2 , median smoothing is performed every 5 frames. Finally, each vowel is divided into vowels whose distance from the previous and subsequent frames is within 1, and those whose duration is 4 frames (40 ms) or more are determined as one vowel.

音素系列作成部3では、あらかじめ用意されて
いる日本語の音素結合規則に基づく音素並びの誤
り訂正と、母音の持続時間によつて長母音と単母
音のふりわけ、無声化した母音/i/、/u/の
挿入を行い、音素系列を作成する。
The phoneme sequence creation unit 3 corrects errors in the phoneme sequence based on pre-prepared Japanese phoneme combination rules, separates long vowels and simple vowels based on vowel duration, and creates devoiced vowels /i/, /u/ is inserted to create a phoneme sequence.

単語マツチング部4は音素系列のみで作られた
簡単な単語辞書と、あらかじめ多量の認識音素系
列から作られたConfusion Matrixを用いて、認
識された音素系列と全単語辞書との間の類似度を
算出し、類似度の最も大きい単語を認識結果とし
て出力する。
The word matching unit 4 uses a simple word dictionary created only from phoneme sequences and a Confusion Matrix created from a large number of recognized phoneme sequences in advance to calculate the degree of similarity between the recognized phoneme sequence and the entire word dictionary. The word with the highest similarity is output as the recognition result.

第9図に本発明による音声自動認識装置の構成
を示す。マイク20から入力する音声信号はアン
プ21で適当なレベルに増巾し、A/D変換部2
2により12kHzサンプリングで、12ビツトにA/
D変換する。これを信号処理回路29で6dB/オ
クターブのプリエンフアシスおよびハミング窓を
かけた後、線形予測分析プロセツサ23にて線形
予測係数から周波数スペクトルの算出と音素認識
に必要なパラメータの算出を行う。メインプロセ
ツサ24ではメインメモリ25を使用してセグメ
ンテーシヨンと音素認識および音素系例列の作成
を行い、得られた音素系列の結果を単語マツチン
グプロセツサ27に転送する。単語マツチングプ
ロセツサ27は単語辞書、Confusion Matrix用
メモリ26のデータを参照して単語毎の類似度計
算を行い、結果をメインプロセツサ24に転送す
る。メインプロセツサ24は類似度最大の単語を
認識結果としてI/O28に出力するか、又はリ
ジエクトを行う。I/O28は受けた結果を他の
計算機に送つたり、他のI/O機器に対して作業
を行わせる。こうしてメインプロセツサ以外に専
用プロセツサを設けて計算を分担する事によつて
高速化する事が出来る。
FIG. 9 shows the configuration of an automatic speech recognition device according to the present invention. The audio signal input from the microphone 20 is amplified to an appropriate level by the amplifier 21, and then sent to the A/D converter 2.
2, 12kHz sampling, 12 bit A/
D-convert. After this is subjected to 6 dB/octave pre-emphasis and a Hamming window in a signal processing circuit 29, a linear prediction analysis processor 23 calculates a frequency spectrum and parameters necessary for phoneme recognition from the linear prediction coefficients. The main processor 24 uses the main memory 25 to perform segmentation, phoneme recognition, and create a phoneme sequence example sequence, and transfers the obtained phoneme sequence results to the word matching processor 27. The word matching processor 27 refers to the word dictionary and the data in the Confusion Matrix memory 26 to calculate the degree of similarity for each word, and transfers the results to the main processor 24 . The main processor 24 outputs the word with the maximum similarity to the I/O 28 as a recognition result, or rejects the word. The I/O 28 sends the received results to other computers or causes other I/O devices to perform work. In this way, speeding up can be achieved by providing a dedicated processor in addition to the main processor to share the calculations.

成人男子20名が防音室で発声した日本全国の主
要都市名166を用い、総計3320個のデータによる
単語認識実験を、本装置を用いて行つた結果、平
均認識率84%であつた。これは、従来のフイルタ
バンクによる方法とほぼ同じである。一方、成人
女性20名が発声したデータによると、従来のフイ
ルタバンクによる方法では約30%しか認識出来な
い事によつて、女声に対応する事が全く出来なか
つたものが、本装置によると男声と同じ84%まで
認識する事が可能となり、男女に関係なく適応出
来る道が開ける事によつて本発明の有効性を確認
した。
Using this device, we conducted a word recognition experiment with a total of 3,320 data using 166 names of major cities across Japan uttered by 20 adult males in a soundproof room, and the average recognition rate was 84%. This is almost the same as the conventional filter bank method. On the other hand, according to the data uttered by 20 adult women, the conventional filter bank method could only recognize about 30% of the voices and was completely unable to respond to female voices, but with this device, male voices can be recognized. The effectiveness of the present invention was confirmed by making it possible to recognize up to 84%, which is the same as that for men and women, and opening the way for it to be applied regardless of gender.

以上述べたように、本発明は不特定な話者を対
象とした音素単位での認識を基本とするものであ
り、スペクトルのローカルピークの位置とそれら
の相対的な大きさに着目して音素認識を行う事に
よつて話者や環境による変動の影響を受けにく
く、又、スペクトルの分析を従来のフイルタ群に
変えて線形予測分析を用いる事によつて音声のピ
ツチ成分の影響を受けない安定なホルマントに対
応するピークを抽出する事を可能とし、またフイ
ルタ出力がデイスクリート量のため抽出誤差が大
きいのに対し、本発明のLPCケプストラムは連
続量であるのでピーク抽出精度が向上する。これ
らの結果、男女いずれの不特定話者にも対応でき
る事を特長とする音声自動認識装置を可能とする
ものである。
As described above, the present invention is based on phoneme-by-phoneme recognition for unspecified speakers, and focuses on the positions of local peaks in the spectrum and their relative sizes to identify phonemes. By performing recognition, it is less affected by variations due to the speaker or the environment, and by using linear predictive analysis instead of a conventional filter group for spectrum analysis, it is not affected by pitch components of speech. It is possible to extract peaks corresponding to stable formants, and since the filter output is a discrete quantity, the extraction error is large, whereas the LPC cepstrum of the present invention is a continuous quantity, so peak extraction accuracy is improved. As a result, it is possible to create an automatic speech recognition device that is capable of responding to unspecified speakers of both genders.

【図面の簡単な説明】[Brief explanation of the drawing]

第1図は従来の音声自動認識システムのブロツ
ク図、第2図は従来例である、フイルタバンクに
よる女声母音/e/の周波数スペクトル図、第3
図は従来のフイルタ分析による音素認識例を示す
図、第4図は本発明の線形予測分析による女声母
音/e/の周波数スペクトル図、第5図は本発明
の線形予測分析による音素認識例を示す図、第6
図は本発明に係る音声自動認識装置全体のブロツ
ク構成を示す図、第7図は本発明の音響処理部の
ブロツク図、第8図は本発明のローカルピーク
P1,P2による母音判別図、第9図は本発明の音
声自動認識装置の詳細を示すブロツク図である。 1……音響処理部、2……音素認識部、3……
音素系列作成部、4……単語マツチング部、10
……プリエンフアシス部、11……窓部、12…
…線形予測分析部、13……周波数スペクトル計
算部、14……ピーク抽出パラメータ部、22…
…A/D変換回路、23……線形予測分析プロセ
ツサ、24……メインプロセツサ、26……単語
辞書、コンフエーシヨンマトリクス用メモリ、2
7……単語マツチングプロセツサ。
Fig. 1 is a block diagram of a conventional automatic speech recognition system, Fig. 2 is a frequency spectrum diagram of the female vowel /e/ using a filter bank, which is a conventional example, and Fig. 3
The figure shows an example of phoneme recognition by conventional filter analysis, Figure 4 is a frequency spectrum diagram of the feminine vowel /e/ by linear predictive analysis of the present invention, and Figure 5 shows an example of phoneme recognition by linear predictive analysis of the present invention. Figure shown, No. 6
The figure shows the block configuration of the entire automatic speech recognition device according to the present invention, FIG. 7 is a block diagram of the acoustic processing section of the present invention, and FIG. 8 shows the local peak of the present invention.
FIG . 9 is a block diagram showing details of the automatic speech recognition device of the present invention. 1... Acoustic processing unit, 2... Phoneme recognition unit, 3...
Phoneme sequence creation section, 4...Word matching section, 10
...Pre-emphasis section, 11...Window section, 12...
...Linear prediction analysis section, 13... Frequency spectrum calculation section, 14... Peak extraction parameter section, 22...
...A/D conversion circuit, 23...Linear predictive analysis processor, 24...Main processor, 26...Word dictionary, memory for configuration matrix, 2
7...Word matching processor.

Claims (1)

【特許請求の範囲】[Claims] 1 音声入力を処理して音素認識に必要なパラメ
ータを算出する音響処理部と、前記パラメータを
用いて区切り作業(セグメンテーシヨン)と音素
の認識とを行う音素認識部と、前記音素認識部か
らの音素列の音素並びを訂正して音素系列を作成
する音素系列作成部と、前記音素系列と単語辞書
とのマツチングを行う単語マツチング部とを具備
し、前記音響処理部のパラメータの算出にあたつ
ては線形予測分析を用い、得られた線形予測係数
から求められるスペクトル包絡の極大点または変
曲点より得られた周波数スペクトルのピーク位置
とピークの相対的な大きさに基づき音声認識を行
うごとく構成されたことを特徴とする音声自動認
識装置。
1. An acoustic processing unit that processes voice input to calculate parameters necessary for phoneme recognition, a phoneme recognition unit that performs segmentation and phoneme recognition using the parameters, and a phoneme recognition unit that processes voice input and calculates parameters necessary for phoneme recognition; a phoneme sequence creation unit that corrects the phoneme sequence of the phoneme sequence to create a phoneme sequence; and a word matching unit that matches the phoneme sequence with a word dictionary; Tate uses linear prediction analysis to perform speech recognition based on the peak position and relative size of the peak in the frequency spectrum obtained from the maximum point or inflection point of the spectrum envelope determined from the obtained linear prediction coefficient. An automatic speech recognition device characterized by being configured as follows.
JP57021412A 1982-02-12 1982-02-12 Automatic voice recognition system Granted JPS58139199A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP57021412A JPS58139199A (en) 1982-02-12 1982-02-12 Automatic voice recognition system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP57021412A JPS58139199A (en) 1982-02-12 1982-02-12 Automatic voice recognition system

Publications (2)

Publication Number Publication Date
JPS58139199A JPS58139199A (en) 1983-08-18
JPS6336680B2 true JPS6336680B2 (en) 1988-07-21

Family

ID=12054303

Family Applications (1)

Application Number Title Priority Date Filing Date
JP57021412A Granted JPS58139199A (en) 1982-02-12 1982-02-12 Automatic voice recognition system

Country Status (1)

Country Link
JP (1) JPS58139199A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005002704A (en) * 2003-06-13 2005-01-06 Nec Corp Personal authentication device and unlocking system with personal authentication function

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005002704A (en) * 2003-06-13 2005-01-06 Nec Corp Personal authentication device and unlocking system with personal authentication function

Also Published As

Publication number Publication date
JPS58139199A (en) 1983-08-18

Similar Documents

Publication Publication Date Title
JPH0352640B2 (en)
US20010010039A1 (en) Method and apparatus for mandarin chinese speech recognition by using initial/final phoneme similarity vector
JP2996019B2 (en) Voice recognition device
JP4839970B2 (en) Prosody identification apparatus and method, and speech recognition apparatus and method
JP3493849B2 (en) Voice recognition device
JPS6336680B2 (en)
Sharma et al. Speech recognition of Punjabi numerals using synergic HMM and DTW approach
Lingam Speaker based language independent isolated speech recognition system
Gulzar et al. An improved endpoint detection algorithm using bit wise approach for isolated, spoken paired and Hindi hybrid paired words
Adam et al. Analysis of Momentous Fragmentary Formants in Talaqi-like Neoteric Assessment of Quran Recitation using MFCC Miniature Features of Quranic Syllables
JP6517417B1 (en) Evaluation system, speech recognition device, evaluation program, and speech recognition program
KR20180087038A (en) Hearing aid with voice synthesis function considering speaker characteristics and method thereof
JPS6313198B2 (en)
JP3110025B2 (en) Utterance deformation detection device
Sahu et al. Odia isolated word recognition using DTW
Nair et al. Comparison of Isolated Digit Recognition Techniques based on Feature Extraction
Ozaydin An isolated word speaker recognition system
JPS62143100A (en) Voice pattern matching system
JPH0469800B2 (en)
JPH0640274B2 (en) Voice recognizer
JPS59114600A (en) Speaker identification system
JPH0816186A (en) Voice recognition device
MAINDARGI et al. Implementation Of Speech Recognition System
Jolad et al. INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY DIFFERENT FEATURE EXTRACTION TECHNIQUES FOR AUTOMATIC SPEECH RECOGNITION: A REVIEW
JPS5958498A (en) Voice recognition equipment