JPS60147797A

JPS60147797A - Voice recognition equipment

Info

Publication number: JPS60147797A
Application number: JP59003923A
Authority: JP
Inventors: 藤井　諭; 森井　秀司; 昌克星見
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1984-01-12
Filing date: 1984-01-12
Publication date: 1985-08-03
Also published as: JPH0333280B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】産業上の利用分野本発明は入間の声によって発声された音声信号を自動的
に認識するための、音声認識装置に関するものである。DETAILED DESCRIPTION OF THE INVENTION Field of the Invention The present invention relates to a speech recognition device for automatically recognizing a speech signal uttered by Iruma's voice.

従来例の構成とその問題点音声を自動的に認識する音声認識装置は人間から電子計
算機や各種機械へデータや命令を与える手段として非常
に有効と考えられる。Conventional configurations and their problems A speech recognition device that automatically recognizes speech is considered to be very effective as a means for providing data and commands from humans to computers and various machines.

従来研究あるいは発表されている音声認識装置の動作原
理としてはバタンマツチング法が多く採用されている。The slam matching method is often adopted as the operating principle of speech recognition devices that have been researched or published in the past.

との方法は認識される必要がある全種類の単語に対して
標準バタンをあらかじめ記憶しておき、入力される未知
の入カパタ／と比較することによって一致の度合（以下
類似度と呼ぶ）を計算し、最大一致が得られる標準バタ
ンと同一の単語であると判定するものである。このバタ
ンマツチング法では認識されるべき全ての単語に対して
標準バタンを用意しなければならないため、発声者が変
った場合には新しく標準・くタンを入力して記憶させる
必要がある。従って数百種類以上の単語を認識対象とす
るような場合、全種類の単語を発声して登録するには時
間と労力を必要とし、又登録に要するメモリー容量も膨
大になることが予想される。さらに入力バタンと標準バ
タンのバタンマツチングに要する時間も単語数が多くな
ると長くなってしまう欠点がある。This method memorizes standard patterns for all types of words that need to be recognized in advance, and compares them with unknown input patterns to determine the degree of matching (hereinafter referred to as similarity). The word is calculated and determined to be the same word as the standard batan that yields the maximum match. In this batan matching method, a standard batan must be prepared for every word to be recognized, so if the speaker changes, it is necessary to input and memorize a new standard batan. Therefore, if more than several hundred types of words are to be recognized, it is expected that it will take time and effort to pronounce and register all types of words, and the memory capacity required for registration will be enormous. . Furthermore, there is a drawback that the time required for matching the input button and the standard button increases as the number of words increases.

これに対して、入力音声を音素単位に分けて音素の組合
せとして認識しく以下音素認識と呼ぶ）音素単位で表記
された単語辞書との類似度をめる方法は単語辞書に袈す
るメモリー容量が大幅に少なくて済み、バタンマツチン
グに要する時間が短く、辞書の内容変更も容易であると
いう特長を持っている。例えば「従い」という発声は／
ａ　／。On the other hand, the method of dividing the input speech into phoneme units and recognizing them as combinations of phonemes (hereinafter referred to as phoneme recognition) and calculating the similarity with a word dictionary expressed in phoneme units requires less memory capacity for the word dictionary. It has the advantage that it requires much less time, the time required for slam matching is short, and it is easy to change the contents of the dictionary. For example, the utterance of “obey” is /
a/.

／に／、／ｉ／という三つの音素を組合せてＡＫＡＩと
いう極めて簡単な形式で表現することができるため、不
特定話者で多数語の音声に対処することが容易である。Since the three phonemes /ni/ and /i/ can be combined and expressed in an extremely simple format called AKAI, it is easy for unspecified speakers to deal with speech of many words.

第１図に音素認識を行うことを特徴とする音声認識方式
のプロ、り図を示す。マイク等で入力された音声は音響
分析部１によって分析を行なう。FIG. 1 shows a professional diagram of a speech recognition method characterized by phoneme recognition. Audio input through a microphone or the like is analyzed by an acoustic analysis section 1.

分析方法としては帯域フィルタ群や勝形予側分析を用い
、フレーム周期（１’ｏｍｓ程度）毎にスペクトル情報
を得る。音素判別部２では、音響分析部１で得たスペク
トル情報を用い、標準パターン格納部３のデータによっ
てフレーム毎の音素判別を行なう。標準パターン格納部
３に格納された標準パターンは、あらかじめ多数話者の
音声より音素毎にめておく。セグメンテーション部４で
は、音響分析部１の分析出力をもとに音声区間の検出と
音素毎の境界決定（以下セグメンテーションと呼ぶ）を
行う。音素認識部６ではセグメンテーシを行う。この結
果として音素の系列が完成する。As an analysis method, a bandpass filter group or a winning side analysis is used, and spectral information is obtained every frame period (about 1'oms). The phoneme discrimination unit 2 uses the spectrum information obtained by the acoustic analysis unit 1 to discriminate phonemes for each frame based on the data in the standard pattern storage unit 3. The standard patterns stored in the standard pattern storage section 3 are prepared in advance for each phoneme from the voices of many speakers. The segmentation unit 4 detects speech intervals and determines boundaries for each phoneme (hereinafter referred to as segmentation) based on the analysis output of the acoustic analysis unit 1. The phoneme recognition unit 6 performs segmentation. As a result, a series of phonemes is completed.

単語認識部６では、この音素系列を、同様に音素系列で
表記された単語辞書７と照合し、最も類似度の高い単語
全認識結果として出力する。The word recognition unit 6 compares this phoneme sequence with a word dictionary 7 similarly written in phoneme sequences, and outputs the results of recognition of all words with the highest degree of similarity.

前記従来のセグメンテーション部４では、子音のセグメ
ンテーションを次のように行っていた。The conventional segmentation unit 4 performs consonant segmentation as follows.

第２図ａはパワーの時間に対する変化の大きさを、第２
図すはパワーの変化速度の時間に対する変化の大きさを
示したもので、帯域フィルタを用いたパワーの時間的変
化の形８が凹状の形をしている時（これをディップと呼
ぶ）、パワーが極小値を示すフレームラｎ、とし、ｎ、
の前後のフレームでパワーの時間による変化速度（これ
ヲハワーの差分値と呼ぶ）９が負および正の極太値を示
すフレームをｎ２．　ｎ、とする。また、あるフレーム
ｎにおける差分値をＷＤ（ｎ）とすると、ＶＤ　（ｎ、）　−ＷＤ　（ｎ２）＜θａ）（１）の条
件を満足する時、ｎ２〜ｎ、までの区間を子音区間とし
ていた。ここで０　は子音の付加を防ぐω ためのいき値で予め統計的な分布に基づき決定されるも
のである。Figure 2a shows the magnitude of the change in power over time.
The figure shows the magnitude of the change in the rate of change of power over time. When the shape of the temporal change in power using a bandpass filter 8 is concave (this is called a dip), Let n be a frame line in which the power shows a local minimum value, and n,
The frame in which the rate of change of power over time (this is called the power difference value) 9 shows extremely large negative and positive values in the frames before and after n2. Let it be n. Also, if the difference value in a certain frame n is WD(n), when the condition of VD (n,) -WD (n2)<θa) (1) is satisfied, the interval from n2 to n is considered as a consonant interval. there was. Here, 0 is a threshold value for ω to prevent the addition of consonants, and is determined in advance based on statistical distribution.

セグメンテーション部４および音素判別部２の詳細を第
３図に示す。セグメンテーション部４はディップ検出部
３１、子音区間決定部３２、子音判定部３３からなり音
響分析でイ曾た帯域フィルタのパワーを用いてディップ
検出部３１にて前記ディップ検出を行い、子音区間決定
部３２で第２図のｎ２〜ｎ３間を子音区間として決定す
る。この区間に対してスペクトル形状をもとに子音判定
部３３にて子音判定を行う。一方、音素判別部２は母音
候補抽出部３６と母音区間決定部３６よりなり音響分析
で得たＬＰＣケプストラム係数を用いて、標準バタン格
納部３に対する類似度計算を母音候補抽出部３６にて行
い、最も類似度の高い音素を母音候補として抽出する。Details of the segmentation unit 4 and phoneme discrimination unit 2 are shown in FIG. The segmentation unit 4 includes a dip detection unit 31, a consonant interval determination unit 32, and a consonant determination unit 33. The dip detection unit 31 performs dip detection using the power of a bandpass filter determined by acoustic analysis, and the consonant interval determination unit At step 32, the area between n2 and n3 in FIG. 2 is determined as a consonant section. The consonant determination section 33 performs consonant determination for this section based on the spectral shape. On the other hand, the phoneme discrimination unit 2 includes a vowel candidate extraction unit 36 and a vowel interval determination unit 36, and the vowel candidate extraction unit 36 calculates the similarity with respect to the standard slam storage unit 3 using the LPC cepstral coefficients obtained by acoustic analysis. , the phoneme with the highest degree of similarity is extracted as a vowel candidate.

この場合標準バタン格納部３は５母音および鼻音を対象
として、フレーム毎のＬＰＣケプストラム係数を用いて
作成しておく。この結果を子音区間決定部３２でめた子
音区間以外に適用し、母音区間および母音の種類を母音
区間決定部３６″決定りる・０の結果を子音判定部３３
の結・果と組合せることによって音素認識部６にて音素
認識を行い、第１図に示した単語認識部６へ送る。In this case, the standard slam storage section 3 is created using LPC cepstral coefficients for each frame, targeting the five vowels and nasal sounds. This result is applied to consonant intervals other than the consonant interval determined by the consonant interval determination unit 32, and the vowel interval and vowel type are determined by the vowel interval determination unit 36''.
The phoneme recognition unit 6 performs phoneme recognition by combining the results with the results of , and sends the result to the word recognition unit 6 shown in FIG.

この方法によれば音素のセグメンテーション。According to this method phoneme segmentation.

判別を良好に行なうことができるが、ディップの存在に
よって一義的に音素境界を決定してしまうために、欠点
が２つある。その１つは母音中でノ々ワーが不安定にな
りだ時にもディップとして検出してしまうため子音が付
加されてしまい、日本語の規則により必然的に母音が付
加されるため、結果として子音１個の付加によって２音
素付加になってしまうことである。もう１つはディップ
の区間が必ずしも正しい境界を表わさないことにより、
母音、子音間の正しい境界が作証されなくなってしまう
ことである。これによって、母音、子音の判別誤り、単
母音と長母音の判別誤りなどを生ずる。Although good discrimination can be made, there are two drawbacks because phoneme boundaries are uniquely determined by the presence of dips. One of them is that even when no-no-wah becomes unstable in a vowel, it is detected as a dip, so a consonant is added. The problem is that the addition of one phoneme results in the addition of two phonemes. Another reason is that the dip section does not necessarily represent the correct boundary.
The problem is that the correct boundaries between vowels and consonants are not established. This causes errors in discrimination between vowels and consonants, errors in discrimination between simple vowels and long vowels, and the like.

第４図にその１例を示す。これは「番号」と発声した例
で、乙のラベルで各音素の位置を示す。An example is shown in FIG. This is an example of saying "number" and indicating the position of each phoneme using the Otsu label.

第３図のディップ検出音６３１でディップＣを検出し、
その結果を子音区間決定部３２に転送し、さらに子音判
定部３３で判定した結果をｄに示す。Dip C is detected by the dip detection sound 631 in FIG.
The result is transferred to the consonant section determining section 32, and the result determined by the consonant determining section 33 is shown in d.

一方母音候補抽出部３６の抽出結果をｅに示し、子音区
間決定部３２の結果と母音候補ｅとを組合せて母音区間
決定部３６で母音認識を行う。その結果をｆに示す。そ
の母音認識結果ｆと子音＝Ｓ結果ｄとを音素認識部６へ
転送し、認識結果すを得る。子音認識ｄの項には、ディ
ップＣの位置によって第２図ｎ２〜ｎ３間を子音の区間
として決定し、標準パタンに対するスペクトルの類似度
によって音素の種類を決定した結果を示す。母音候補ｅ
の項では母音および鼻音を対象にスペクトルの類似度の
最も高い音素を示す。子音認識ｄの境界を正しい境界と
して母音候補ｅを機械的に組み合わせることにより、認
識結果すの項で示すような音素系列が作成される。On the other hand, the extraction result of the vowel candidate extraction unit 36 is shown in e, and the vowel segment determination unit 36 performs vowel recognition by combining the result of the consonant interval determination unit 32 and the vowel candidate e. The results are shown in f. The vowel recognition result f and the consonant=S result d are transferred to the phoneme recognition section 6 to obtain a recognition result. The consonant recognition section d shows the results of determining the consonant section between n2 and n3 in FIG. 2 according to the position of dip C, and determining the type of phoneme according to the degree of spectral similarity to the standard pattern. vowel candidate e
The section shows the phonemes with the highest spectral similarity for vowels and nasals. By mechanically combining the vowel candidates e using the boundaries of the consonant recognition d as correct boundaries, a phoneme sequence as shown in the section ``Recognition Results'' is created.

ラベルａと認識結果すとを比較、すると、／ｈ／と／ｕ
／が付加している。又、／Ｎ／が／ｎ／に置換し、／り
／の区間が誤りている。Comparing label a and recognition result, /h/ and /u
/ is added. Also, /N/ is replaced with /n/, and the /ri/ section is incorrect.

これは単なる一例であり、第２図で示したディップの区
間が必ずしも子音の境界を表わさないことが原因で起る
ものである。This is just one example, and is caused by the fact that the dip sections shown in FIG. 2 do not necessarily represent consonant boundaries.

このような誤りが発生する頻度は人によって異なり、発
声方法の不安定な発声者や、ディップを検出するための
帯域フィルタに対する周波数特性のずれの大きい発声者
に対して誤りが多く生ずる。The frequency with which such errors occur differs from person to person, and errors occur more frequently for speakers whose vocalization method is unstable or for speakers whose frequency characteristics deviate greatly from the bandpass filter for detecting dips.

その結果、音素の付加、脱落、置換が多発し、単語認識
の性能を劣化させてしまう欠点があった。As a result, additions, omissions, and substitutions of phonemes occur frequently, which has the disadvantage of deteriorating word recognition performance.

発明の目的本発明は前記欠点を解消し、音素のセグメンテーション
の精度および音素判別を向上させることによって高性能
な音声認識方法を提供することを目的とする。OBJECTS OF THE INVENTION An object of the present invention is to eliminate the above-mentioned drawbacks and provide a high-performance speech recognition method by improving the accuracy of phoneme segmentation and phoneme discrimination.

発明の構成本発明は前記目的を達成するもので、標準パタンに対す
る音素の類似度をめ、またパワーの変化に基づいて子音
候補の位置をめ、母音に対する類似度の連続性および強
度によるスペクトルの安定性に基づき抽出された母音候
補と、子音候補とを照合することによって音素間の境界
の位置および境界間の音素の種類を精度良く決定し、高
性能な音声認ｌｔ−行うことを可能とするものである。SUMMARY OF THE INVENTION The present invention achieves the above-mentioned objects by determining the similarity of phonemes to a standard pattern, determining the position of consonant candidates based on changes in power, and determining the continuity of similarity and intensity of the spectrum with respect to vowels. By comparing vowel candidates extracted based on stability with consonant candidates, the position of the boundary between phonemes and the type of phoneme between the boundaries can be determined with high accuracy, making it possible to perform high-performance speech recognition. It is something to do.

実施例の説明以下に本発明の実施例を図面とともに説明する。Description of examples Embodiments of the present invention will be described below with reference to the drawings.

第４図に示したような誤りが生ずるのは、ディップの区
間が必ずしも子音の境界を表わさない原因によるもので
ある。ディップはパワーの変動によって生ずるが、スペ
クトルの変動とは必ずしも対応しない。すなわち、ディ
ップが存在してもスペクトルの変動がなければそこに子
音は存在しないと考えることができる。又、ディップの
始端又は終端の位置ではスペクトルが安定し、それ以外
の位置でスペクトルが大きく変化していれば、真の音素
境界はその位置にあると考えることができる。本実施例
はこの性質を積極的に利用して子音と母音の境界を精度
よく決定することを可能としたものである。The error shown in FIG. 4 occurs because the dip section does not necessarily represent the consonant boundary. Dips are caused by power fluctuations, but do not necessarily correspond to spectral fluctuations. In other words, even if a dip exists, if there is no change in the spectrum, it can be considered that no consonant exists there. Furthermore, if the spectrum is stable at the position of the start or end of the dip, and the spectrum changes significantly at other positions, it can be considered that the true phoneme boundary is at that position. This embodiment makes it possible to accurately determine the boundary between a consonant and a vowel by actively utilizing this property.

第６図に本発明の一実施例である音声認識装置の主要部
分のブロック図を示す。FIG. 6 shows a block diagram of the main parts of a speech recognition device that is an embodiment of the present invention.

標準バタン格納部４４に格納される標準パターンは母音
および鼻音を対象に音素中心付近ｎフレームのｐ次ＬＰ
Ｇケプヌトラム係数を用いて作成しておく。すなわち時
間−周波数軸の２次元バタンで構成する。音素ｉのｎフ
レーム目におけるｐ次ＬＰＧケブヌトラム係数ヲＣ１ｎ
ｐ　と表わし、ベクトルｙエ　を作成する。The standard pattern stored in the standard baton storage unit 44 is a p-order LP of n frames near the phoneme center for vowels and nasals.
It is created using the G kepnutrum coefficient. In other words, it is composed of two-dimensional bumps on the time-frequency axis. p-order LPG key nutrum coefficient woC1n in the n-th frame of phoneme i
Denote it as p and create a vector y.

ｙｌ＝（ｃｌ、１．Ｃ工、□、・・・、ｃ１４．ｃ、□
１．・・・、Ｃム１．・・ら１．・・・＋Ｃ１ｎｐ　）
多数の音声による１、を集計し、ｙ＃ｉの平均値をｍｇ
（３はパラメータの順番を表わし、最大はに＝ｎＸｐ）
とする。共分散行列を音素の種類にかかわらず共通とし
、ＩＷで表わす。ＩＷの逆行列をＩＷ−１とし、（］Ｉ
Ｉ′）要素をσｊコ′とすると、音素ｉのｊ＠目のパラ
メータに対する重み係数ａ　１．はで表わすことができ
る。yl=(cl, 1.C engineering, □,..., c14.c, □
1. ..., Cmu1. ...Ra1. ...+C1np)
1 from many voices and calculate the average value of y#i in mg
(3 represents the order of parameters, maximum = nXp)
shall be. The covariance matrix is the same regardless of the type of phoneme, and is expressed as IW. Let the inverse matrix of IW be IW-1, (]I
I') If the element is σj, then the weighting coefficient a for the j@th parameter of phoneme i 1. It can be expressed as .

多数話者の音声データより得られたパラメータｘ（？＋
　＋　−”２　＋　”’　＋　ｘｊ　＋　”’　＋　！
ｋ）の音素工の分布に対するマ・・ラノビス距離り工′
はで表わすことができる。ｔは転置行列を表わす。Parameter x(?+
+ −”2 + ”’ + xj + ”’ + !
k) Ma Lanobis distance operator for the distribution of phoneme operators'
It can be expressed as . t represents a transposed matrix.

（３）式の第１項は音素の種類に依存しないため省略し
、類似度Ｌｉ　を簡易的にＬ１＝ｊ４　ａｉｊｘｊ　ｍｉｗ　ｍｌ（４）でめるこ
とができる。The first term of equation (3) is omitted because it does not depend on the type of phoneme, and the similarity Li can be simply determined as L1=j4 aijxj miw ml (4).

従って、標準バタン格納部４４には（４）式のａ：Ｉｊ
および定数ｍ工Ｗｍエ　を入れておけば良い。Therefore, in the standard button storage section 44, a:Ij of equation (4)
It is sufficient to include the constant m and Wm.

次に入力音声より得られたパラメータＸ（Ｘ１＋ｘ２．
・・・・・・、乃、・・・ｘｋ）に対する類似度Ｌｉ　
を（４）式を用いて母音候補抽出部４５で算出し、母音
に対する類似度の連続性および強度によるスペクトルの
安定性に基づき母音候補を抽出し、その結果を母音区間
記憶部４６へ転送する。Next, parameters X(X1+x2.
..., 乃, ...xk)
is calculated by the vowel candidate extracting unit 45 using equation (4), and vowel candidates are extracted based on the continuity of the similarity to the vowel and the stability of the spectrum by intensity, and the result is transferred to the vowel interval storage unit 46. .

一方、音響分析を行った後、ディップ検出部４０にて帯
域フィルタのパワーのディップ検出を行う。子音区間検
出部４１で第２図に示すｎ２〜ｎ３間を仮の子音区間と
し子音区間記憶部４２にその結果を転送する。ディップ
検出部４０と子音区間決定部４１で子音候補抽出部４９
を構成する。On the other hand, after performing the acoustic analysis, the dip detection section 40 performs dip detection of the power of the bandpass filter. The consonant section detection section 41 sets the period between n2 and n3 shown in FIG. 2 as a temporary consonant section and transfers the result to the consonant section storage section 42. Consonant candidate extraction unit 49 with dip detection unit 40 and consonant interval determination unit 41
Configure.

子音区間記憶部４２と母音区間記憶部４６とを音素境界
決定部４７にて照合し、音素境界の決定を行う。この場
合標準バタン格納部４４は音素中心付近の複数フレーム
で統計的に構成しであるため、母音中のスペクトルのわ
ずかな変動は母音中におけるスペクトルの単なる乱れで
あるとして吸収することができる。又、子音との境界に
おけるあいまい領域ではスペクトルが時間的に安定でな
いため大きな類似度が表われない。この性質を利用する
ことによって母音区間を精度良く抽出することができる
。A phoneme boundary determination unit 47 collates the consonant interval storage unit 42 and the vowel interval storage unit 46 to determine the phoneme boundary. In this case, since the standard slam storage section 44 is statistically configured with a plurality of frames near the center of the phoneme, slight fluctuations in the spectrum in the vowel can be absorbed as mere disturbances in the spectrum in the vowel. Furthermore, in the ambiguous region at the boundary with the consonant, the spectrum is not stable over time, so a large degree of similarity does not appear. By utilizing this property, vowel intervals can be extracted with high accuracy.

従って音素境界の存在する可能性のない子音候補は取除
き、子音区間の大きく誤ったものは修正して、結果を子
音に対しては子音区間記憶部４２に、母音に対しては母
音区間記憶部４６にもどすことができる。Therefore, consonant candidates for which there is no possibility of a phoneme boundary are removed, consonant intervals with large errors are corrected, and the results are stored in the consonant interval storage unit 42 for consonants and in the vowel interval memory for vowels. It can be returned to section 46.

次に音素境界決定部４７で決定され子音区間記憶部４２
を経た結果に基づき子音判定部４３にて新しい区間にお
ける標準パターンに対するスペクトルの類似度を計算し
子音判定を行う。この結果と母音区間記憶部４６の結果
と組合わせることによって音素認識部４８で音素認識を
行い、その結果を単語−識部に転送する。Next, the phoneme boundary determination unit 47 determines the consonant interval storage unit 42.
Based on the results, the consonant determination unit 43 calculates the degree of spectral similarity to the standard pattern in the new section and performs consonant determination. By combining this result with the result of the vowel interval storage section 46, the phoneme recognition section 48 performs phoneme recognition, and the result is transferred to the word-identification section.

第６図に本実施例により認識を行った例を示す。FIG. 6 shows an example of recognition performed by this embodiment.

図においてａは祝祭によって決定されたラベルを示す。In the figure, a indicates the label determined by the festival.

Ｃは第５図のディップ検出部４０により検出されたディ
ップ領域を示し、ｄは子音区間決定部４１で決定された
子音候補を示す。またｅは音素境界決定部４了により修
正を加えられた子音候補であり、６はｅに示した子音候
補を子音判定部４３で判定した子音認識結果を示す。さ
らにｑは母音候補抽出部４６で抽出した母音候補を示し
、ｈは音素境界決定部４７により修正を加えられた母音
認識結果を示す。ｂは前記子音認識結果ｆと母音認識結
果りとから音素認識部４８により認識された認識結果を
示す。C indicates a dip region detected by the dip detection section 40 in FIG. 5, and d indicates a consonant candidate determined by the consonant section determination section 41. Further, e is a consonant candidate modified by the phoneme boundary determining unit 4, and 6 is a consonant recognition result obtained by determining the consonant candidate shown in e by the consonant determining unit 43. Further, q indicates a vowel candidate extracted by the vowel candidate extraction section 46, and h indicates a vowel recognition result modified by the phoneme boundary determination section 47. b indicates the recognition result recognized by the phoneme recognition unit 48 from the consonant recognition result f and the vowel recognition result.

本実施例の場合、まず子音認識についてはディップ検出
部４０で第６図Ｃに示すディップ位置を検出する。この
ディップ位置に対し、子音区間決定部４１で第６図ｄに
示す子音候補／ｂＡ／ｎ／；／ＩＩｌ／、／ｈ／を抽出
し、子音区間記憶部４２へ転送する。In this embodiment, for consonant recognition, the dip detection section 40 first detects the dip position shown in FIG. 6C. For this dip position, the consonant interval determination unit 41 extracts the consonant candidates /bA/n/; /IIl/ and /h/ shown in FIG. 6d, and transfers them to the consonant interval storage unit 42.

一方、母音認識については標準バタン格納部４４に格納
された、時間−周波数バタンで構成された標準バタンを
用いて、母音抽出部４６にて各フレーム毎に最も類似度
の高い音素を選び、第６図ｑに示す母音候補を抽出し、
母音区間記憶部４６へ転送する。On the other hand, for vowel recognition, the vowel extraction unit 46 selects the phoneme with the highest degree of similarity for each frame using the standard baton composed of time-frequency bangs stored in the standard baton storage unit 44. 6 Extract the vowel candidates shown in Figure q,
It is transferred to the vowel interval storage unit 46.

音素境界決定部４７では、子音区間記憶部４２と母音区
間記憶部４６の結果を参照して精度の高い音素境界の最
終決定を行なう。The phoneme boundary determination unit 47 refers to the results of the consonant interval storage unit 42 and the vowel interval storage unit 46 to make a final determination of highly accurate phoneme boundaries.

前述したように、標準バタンに時間−周波数バタンを用
いて母音候補を抽出することによって、次のような性質
がある。As mentioned above, the following properties are obtained by extracting vowel candidates using the time-frequency button as the standard button.

■　母音区間中のスペクトルの小さい乱れを吸収し安定
に母音を抽出することができる。■ It is possible to absorb small disturbances in the spectrum during vowel intervals and extract vowels stably.

■　渡りの部分は時間的にスペクトルが安定しないため
、余分な母音候補の抽出を防ぐことができる。■ Since the spectrum is not stable over time in the crossing part, it is possible to prevent the extraction of extra vowel candidates.

■　母音中でパワーが不安定なためにディップで伺加さ
れた子音候補を、母音候補の安定性によりて取除くこと
ができる。■ Consonant candidates added by dip due to unstable power among vowels can be removed based on the stability of the vowel candidates.

本実施例はこの性質を積極的に利用し、以下の処理を行
なう。まず、第６図ａで示す、子音／ｈ／の付加の部分
では、ｑに示す母音候補の１０／が長い区間に渡って安
定に抽出されているため、■。This embodiment actively utilizes this property and performs the following processing. First, in the part where the consonant /h/ is added, shown in FIG.

■の性質を用いて取除くことができる。又、ラベルａに
示す／Ｎ／の部分では、母音候補ｑを見ると／Ｎ／以外
に安定な母音候補が抽出されないという■の性ｉｔ利用
することにより、Ｃに示す次のティップまでの区間をｈ
に示すように／Ｎ／と決定することができる。又、ｄに
示す／ｍ／の区間で、ｑの母音候補を見ると、／ｍ／の
区間の一部と１０／が重なっており、この１０／は長い
区間に渡って安定していることがら■の性質を利用する
ことによって／ｍ／の区間を修正することができる。そ
の結果を第６図ｅに示す。■It can be removed using the properties of. In addition, in the /N/ part shown in label a, by using the fact of ■ that no stable vowel candidates other than /N/ are extracted when looking at the vowel candidate q, the section up to the next tip shown in C can be calculated. h
/N/ can be determined as shown in . Also, looking at the vowel candidates for q in the /m/ interval shown in d, 10/ overlaps with a part of the /m/ interval, and this 10/ is stable over a long interval. By utilizing the property of ``Gara'', the interval of /m/ can be modified. The results are shown in Figure 6e.

以上の処理の結果、子音区間は第６図ｅとして子音区間
記憶部４２に転送し、母音区間は第６図りとし−で母音
区間記憶部４６を経由して音素認識部４８に転送する。As a result of the above processing, the consonant section is transferred to the consonant section storage section 42 as FIG. 6e, and the vowel section is transferred to the phoneme recognition section 48 via the vowel section storage section 46 as shown in FIG.

子音判定部４３では、第６図ｅに示した子音候補の中で
、音素境界の修正された音素／ｍ／に対して見直しを行
ない、標準パタンのスペクトルに対する類似度をめて最
も類似度の高い要素／Ｓ／に修正し、子音認識結果ｆと
して音素認識部４８に転送する。The consonant determination unit 43 reviews the phoneme /m/ whose phoneme boundaries have been corrected among the consonant candidates shown in FIG. It is corrected to a high element /S/ and transferred to the phoneme recognition unit 48 as a consonant recognition result f.

このように本方法では、ディシブによる子音候補の検出
とスペクトルの安定性を併用することによって、より精
密な音素のセグメンテーションおよび音素判別を実現す
ることができる。In this way, in this method, more precise phoneme segmentation and phoneme discrimination can be achieved by combining detection of consonant candidates by dissipation and spectral stability.

本方法を用いて、成人男子１０名の発声した２１２０単
語を対象に音素認識し、評価した結果を表に示す。Using this method, phoneme recognition was performed on 2120 words uttered by 10 adult males, and the results of the evaluation are shown in the table.

表表から明らかなように、全音素の平均認識率８２．６　
％の良好な値を得ることができる。又、音累付加率４．
８％、音素脱落率３．９％の極めて少ない伺加、脱落誤
りで精度の高い音素系列を作成することができる。As is clear from the table, the average recognition rate for all phonemes is 82.6.
A good value of % can be obtained. Also, the sound addition rate is 4.
It is possible to create a highly accurate phoneme sequence with extremely low addition and omission errors of 8% and phoneme omission rate of 3.9%.

なお前記実施例ではスペクトル情報としてｈｐｃケプス
トラム係数を用いた場合について述べたが、フィルタバ
ンク出力等、他の情報であっても良い。In the above embodiment, a case was described in which hpc cepstral coefficients were used as spectrum information, but other information such as a filter bank output may be used.

発明の効果以上要するに本発明は標準バタンに対する音素の類似度
をめ、またパワーの変化に基づいて子音候補をめ、母音
に対する類似度の連続性及び強度によるスペクトルの安
定性に基づき抽出された゛母音候補と、子音候補とを照
合することによって、音素間の境界の位置および境界間
の音素の種類を精度良く決定し、信頼性の高い音素認識
を実現することができる利点を有する。Effects of the Invention In short, the present invention measures the similarity of phonemes to standard batan, selects consonant candidates based on changes in power, and selects consonant candidates based on continuity of similarity to vowels and stability of spectrum according to intensity. By comparing candidates with consonant candidates, the position of the boundary between phonemes and the type of phoneme between the boundaries can be determined with high accuracy, and there is an advantage that highly reliable phoneme recognition can be realized.

[Brief explanation of drawings]

第１図は従来の音声認識装置のブロック図、第２図はパ
ワー及びパワーの変化速度の時間に対する変化の様子を
示した図、第３図は従来の音声認識装置の要部のブロッ
ク図、第４図は同装置により認識を行った一例を示す図
、第６図は本発明の一実施例における音声認識装置の要
部のブロック図、第６図は同装置による認識結果の一例
を示す図である。４ｏ・・・・・・ディップ検出部、４１・・・・・・・
子音区間決定部、４２・・・・・・子音区間記憶部、４
３・・・・・・子音判定部、４４・・・・・・標準パタ
ーン格納部、４６・・・・・・母音候補抽出部、４６・
・・・・・母音区間記憶部、４７　°。・・・音素境界決定部、４８・・・・・・音素認識部、
４９・・・・・・子音候補抽出部。代理人の氏名　弁理士　中　尾　敏　男　ほか１名第２
図Ｒ第　３１！１Ｃ番号）ｂαＮり００FIG. 1 is a block diagram of a conventional speech recognition device, FIG. 2 is a diagram showing changes in power and power change rate over time, and FIG. 3 is a block diagram of main parts of a conventional speech recognition device. FIG. 4 is a diagram showing an example of recognition performed by the same device, FIG. 6 is a block diagram of main parts of a speech recognition device in an embodiment of the present invention, and FIG. 6 is an example of recognition results by the same device. It is a diagram. 4o...Dip detection section, 41...
Consonant interval determining unit, 42...Consonant interval storage unit, 4
3... Consonant determination section, 44... Standard pattern storage section, 46... Vowel candidate extraction section, 46.
...Vowel interval memory section, 47 degrees. ... Phoneme boundary determination section, 48 ... Phoneme recognition section,
49...Consonant candidate extraction unit. Name of agent: Patent attorney Toshio Nakao and 1 other person 2nd
Figure R No. 31!1 C number) bαNri00

Claims

[Claims]

(1) A standard pattern storage unit that stores in advance a standard pattern constructed based on a statistical distance measure using spectral information obtained from the voices of multiple speakers, and a statistical distance for each analysis interval using spectral information. a vowel candidate extraction unit that calculates the similarity of phonemes to the standard pattern based on a measure;
A consonant candidate extractor extracts consonant candidates using a dip number based on a temporal change in power, and a vowel candidate extractor extracts consonant candidates based on the temporal continuity of the similarity to the phoneme or the stability of the spectrum depending on the strength of the similarity. The vowel candidates obtained are compared with the consonant candidates in the consonant candidate extraction section, and the positions of the boundaries between phonemes and the number of phonemes between the boundaries are determined. A speech recognition device comprising at least a phoneme boundary determination unit that determines S*.

(2) The standard pattern is constructed based on a statistical distance measure by a two-dimensional time-frequency pattern using spectral information of a plurality of analysis interval lengths near the center of a phoneme. The speech recognition device according to scope 1.