JPS6363920B2 - - Google Patents

Info

Publication number
JPS6363920B2
JPS6363920B2 JP1177483A JP1177483A JPS6363920B2 JP S6363920 B2 JPS6363920 B2 JP S6363920B2 JP 1177483 A JP1177483 A JP 1177483A JP 1177483 A JP1177483 A JP 1177483A JP S6363920 B2 JPS6363920 B2 JP S6363920B2
Authority
JP
Japan
Prior art keywords
frequency power
consonants
power
low
dip
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired
Application number
JP1177483A
Other languages
Japanese (ja)
Other versions
JPS59138000A (en
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed filed Critical
Priority to JP1177483A priority Critical patent/JPS59138000A/en
Publication of JPS59138000A publication Critical patent/JPS59138000A/en
Publication of JPS6363920B2 publication Critical patent/JPS6363920B2/ja
Granted legal-status Critical Current

Links

Description

【発明の詳細な説明】 産業上の利用分野 本発明は、音声認識方法に関するものである。 従来例の構成とその問題点 入力音声を音素単位に分けて音素の組合せとし
て認識し(音素認識と呼ぶ)音素単位で表記され
た単語辞書との類似度を求めて認識結果を出力す
る従来の単語認識システムのブロツク図を第1図
に示す。まず、あらかじめ多数話者の音声を1フ
レーム(1フレームは10msecとする)毎に音響
分析部1によつてフイルタ・バンクを用いて分析
し、得られたスペクトル情報をもとに特徴抽出部
2によつて特徴パラメータを求める。この特徴パ
ラメータから5母音や子音の音素グループ毎に標
準パターンを作成して標準パターン登録部3に登
録しておく。次に、特徴抽出部2によつて求られ
た特徴パラメータを用いてセグメンテーシヨン部
4においてセグメンテーシヨンを行なう。この結
果をもとに、音素判別部5において、標準パター
ン登録部3の標準パターンと照合することによつ
て、音素を決定する。最後に、この結果作成した
音素の時系列を単語認識部6に送り、同様に音素
の時系列で表現された単語辞書7と最も類似度の
大きい項目に該当する単語を認識結果として出力
する。 ここで、セグメンテーシヨンは第2図のように
全域パワーの時間的変化の形8が凹状の形をして
いる時(これをデイツプと呼ぶ)、パワーが極小
値を示すフレームをn1とし、n1の前後のフレーム
でパワーの時間による変化速度(これをパワーの
差分値と呼ぶ)9が負および正の極大値を示すフ
レームをn2,n3とする。また、あるフレームnに
おける差分値をWD(n)とすると、WD(n2)、
WD(n3)が WD(n2)≦−θw WD(n3)≦θw の条件を満足する時、n2〜n3までの区間を子音区
間とする。ここでθwは子音の付加を防ぐためのい
き値である。 つぎに、この子音区間に対してフレーム毎に音
素の特徴を示す特徴パラメータを求め、あらかじ
め用意されている各音素の標準パターンと比較し
て、フレーム毎に子音分類を行なう。この結果を
子音分類ツリーに適用して、条件の一致したもの
に子音を分類する。 以上のように従来の方法では、最初に入力音声
の全域パワーを求め、そのパワー・デイツプを用
いて語中子音のセグメンテーシヨンを行ない、つ
ぎにフレーム毎に子音分類を行なう。そして、最
後にフレーム毎の子音分類の結果を子音分類ツリ
ーにあてはめて、条件の一致したものに子音を分
類するというように非常にアルゴリズムも複雑で
手間のかかるめんどうな方法であつた。 発明の目的 本発明は、以上のような従来の問題点を解決す
るためになされたもので、入力声音のセグメンテ
ーシヨンと子音の大分類をきわめて簡単に行なう
ことのできる音声認識方法を提供することを目的
とする。 発明の構成 この目的を達成するために本発明は、特徴パラ
メータとして低域パワーと高域パワーを併用し、
それぞれのパワーの時間的変化によつて生じるデ
イツプの大きさを2次元の判別図に適用すること
によつて、語中子音のセグメンテーシヨンと子音
の大分類を同時にしかも簡単に行なうものであ
る。 実施例の説明 以下に本発明の一実施例を図面を用いて説明す
る。 本実施例では、特徴パラメータとして低域パワ
ーと高域パワーを併用する。これは有声子音は高
域パワーに、無声子音は低域パワーに特徴が現わ
れやすいため、低域、高域パワーを併用すること
により、ほぼすべての子音に対応出来るためであ
る。 本実施例では、音素|P|、|t|、|k|、|
c|、|b|、|d|、|m|、|n|、|s|、|h
|を無声破裂音{|p|、|t|、|k|、|c
|}、有声破裂音{|b|、|d|}、鼻声{|m
|、|n|}、無声摩擦音{|s|、|h|}の4
つに大分類する場合の例を示す。 本実施例において、低域パワーは入力音声信号
を低域の帯域フイルタに通しフレーム毎(本実施
例では1フレームを10msecとする)に対数パワ
ーを求め、これを平滑化して得る。また、高域パ
ワーは高域の帯域フイルタを用いて同様にして得
る。 次に、パワー・デイツプの大きさの求め方を第
3図に示す。図において10,11はそれぞれ低
域パワー、高域パワーの時間的変化のようすであ
り、12,13はその差分値の時記的変化のよう
すを表わしている。低域、高域パワーの差分値が
負および正の極大値になるフレームをnA1
nA2、nB1、nB2とし、各フレームにおける差分
値をWD(nA1)、WD(nA2)、WD(nB1)、WD
(nB2)とする。そこで、低域パワー・デイツプ
の大きさPLと高域パワー・デイツプの大きさPH
を PL=WD(nA1)−WD(nA2) PH=WD(nB1)−WD(nB2) のように定義する。 子音区間としては、デイツプの区間をとり、低
域と高域パワー・デイツプが重複した場合は両方
のデイツプ区間の交わりの区間をとる。第3図に
おける子音区間はnA1からnB2までの区間Cにな
る。この子音区間のデイツプの大きさはPLとPH
になる。例えば、音素|k|、|d|、|n|、|
s|の場合のPLとPHの分布を調べると第4図〜
第7図のようになる。図において横軸にPL、縦
軸にPHをとり、図中の数字は音素の個数を表わ
している。図から明らかなように|k|、|d|
のような破裂性を示す音素はPL、PHとも大き
く、とくに無声破裂音|k|はPLが大きく、有
声破裂音|d|はPHが大きい。また、破裂性を
示さない音素|n|、|s|はPL、PHとも小さ
いが、有声子音か無声子音かによつて第6図、第
7図のように分かれる。 以上述べたように、低域パワーと高域パワーを
併用することにより、有声子音と無声子音の判別
に対応出来る。また、デイツプの大きさを用いる
ことにより、破裂性の検出が出来る。 このような性質からPL、PHの2次元の判別図
を作成して、無声破裂音{|p|、|t|、|k
|、|c|}、有声破裂音{|b|、|d|}、鼻音
{|m|、|n|}、無声摩擦音{|s|、|h|}
の4つのグループに大分類することが出来る。 第8図には、多くのデータを用いて作成した判
別図を示している。図において14の領域は母音
に子音が付加したと考えて、子音区間とはセグメ
ンテーシヨンされない。また、15の領域は鼻音
{|m|、|n|}、16の領域は有声破裂音{|
b|、|d|}、17の領域は無声摩擦音{|s
|、|h|}18の領域は無声破裂音{|p|、|
t|、|k|、|c|}である。 未知の入力音声に対して低域パワーと高域パワ
ーのデイツプの大きさを求め、第8図の判別図に
適用することにより、語中子音のセグメンテーシ
ヨンと子音の大分類を同時にしかも簡単に行なう
ことが出来る。 発明の効果 本実施例を用いて212単語セツトを発話した男
女名10名のデータ(約2100単語)を評価した。そ
の結果を表1に示す。語中子音の脱落数について
は鼻音{|m|、|n|}に対するものが少し多
いが、全体的には脱落数も少なく従来に較べて良
好な結果を得ることが出来た。また、子音の大分
類についても認識率が約85%〜90%と高い数値を
示している。 以上述べたように、本発明はパラメータとして
低域パワーと高域パワーのデイツプの大きさを用
いて、それらを2次元の判別図にあてかめるとい
う非常に簡単な方法で子音区間のセグメンテーシ
ヨンと子音の大分類を精度良く求めることが出来
る。 【表】
DETAILED DESCRIPTION OF THE INVENTION Field of Industrial Application The present invention relates to a speech recognition method. Configuration of conventional examples and their problems The conventional method divides input speech into phoneme units and recognizes them as combinations of phonemes (called phoneme recognition), and outputs the recognition result by calculating the similarity with a word dictionary written in phoneme units. A block diagram of the word recognition system is shown in FIG. First, the audio analysis unit 1 analyzes the voices of multiple speakers for each frame (one frame is 10 msec) using a filter bank, and the feature extraction unit 2 analyzes the voices of multiple speakers using a filter bank based on the obtained spectrum information. Find the feature parameters by. Standard patterns are created for each of the five vowel and consonant phoneme groups from these feature parameters and are registered in the standard pattern registration section 3. Next, a segmentation unit 4 performs segmentation using the feature parameters determined by the feature extraction unit 2. Based on this result, the phoneme discrimination unit 5 determines the phoneme by comparing it with the standard pattern in the standard pattern registration unit 3. Finally, the time series of phonemes created as a result is sent to the word recognition unit 6, and the word corresponding to the item with the highest degree of similarity to the word dictionary 7 similarly expressed in the time series of phonemes is output as a recognition result. Here, in segmentation, when the shape of the temporal change in the overall power 8 is concave as shown in Figure 2 (this is called a dip), the frame in which the power shows the minimum value is set as n 1 . , n 1 , frames in which the rate of change in power over time (this is called a power difference value) 9 exhibits maximum negative and positive values are denoted by n 2 and n 3 . Also, if the difference value at a certain frame n is WD(n), then WD(n 2 ),
When WD(n 3 ) satisfies the following conditions: WD(n 2 )≦−θ w WD(n 3 )≦θ w , the interval from n 2 to n 3 is defined as a consonant interval. Here, θ w is a threshold value for preventing addition of consonants. Next, characteristic parameters indicating the characteristics of phonemes are determined for each frame for this consonant section, and compared with standard patterns for each phoneme prepared in advance to perform consonant classification for each frame. This result is applied to the consonant classification tree to classify consonants into those that match the conditions. As described above, in the conventional method, first the whole range power of input speech is determined, the power dip is used to perform segmentation of mid-word consonants, and then consonant classification is performed for each frame. Finally, the results of the consonant classification for each frame are applied to a consonant classification tree, and the consonants are classified into those that meet the conditions, which is a time-consuming and time-consuming method with extremely complex algorithms. Purpose of the Invention The present invention has been made to solve the above-mentioned conventional problems, and provides a speech recognition method that can extremely easily segment input vocal sounds and classify consonants. The purpose is to Structure of the Invention In order to achieve this object, the present invention uses both low frequency power and high frequency power as characteristic parameters,
By applying the size of the dip caused by the temporal change in each power to a two-dimensional discriminant diagram, segmentation of middle consonants and general classification of consonants can be performed simultaneously and easily. . DESCRIPTION OF EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings. In this embodiment, low frequency power and high frequency power are used together as feature parameters. This is because voiced consonants tend to exhibit characteristics in high-frequency power, and voiceless consonants tend to exhibit characteristics in low-frequency power, so by using both low-frequency and high-frequency power, almost all consonants can be handled. In this example, the phonemes |P|, |t|, |k|, |
c|, |b|, |d|, |m|, |n|, |s|, |h
| is a voiceless plosive {|p|, |t|, |k|, |c
|}, voiced plosive {|b|, |d|}, nasal {|m
|, |n|}, voiceless fricative {|s|, |h|}
An example of major classification is shown below. In this embodiment, the low-frequency power is obtained by passing the input audio signal through a low-frequency band filter, obtaining logarithmic power for each frame (in this embodiment, one frame is 10 msec), and smoothing the logarithmic power. Further, high frequency power can be obtained in the same manner using a high frequency band filter. Next, FIG. 3 shows how to determine the size of the power dip. In the figure, numerals 10 and 11 represent temporal changes in the low-frequency power and high-frequency power, respectively, and numerals 12 and 13 represent temporal changes in the difference values. The frame where the difference value of low-frequency and high-frequency power is the maximum negative and positive value is nA 1 ,
Let nA 2 , nB 1 , nB 2 be the difference values in each frame as WD (nA 1 ), WD (nA 2 ), WD (nB 1 ), WD
(nB 2 ). Therefore, the size of the low-frequency power dip PL and the size of the high-frequency power dip PH
is defined as PL = WD (nA 1 ) - WD (nA 2 ) PH = WD (nB 1 ) - WD (nB 2 ). As the consonant section, a dip section is taken, and if the low-frequency and high-frequency power dips overlap, the intersection of both dip sections is taken. The consonant section in FIG. 3 is section C from nA 1 to nB 2 . The size of the dip in this consonant section is PL and PH.
become. For example, the phonemes |k|, |d|, |n|, |
When examining the distribution of PL and PH in the case of s|, Figure 4~
It will look like Figure 7. In the diagram, the horizontal axis represents PL, and the vertical axis represents PH, and the numbers in the diagram represent the number of phonemes. As is clear from the figure, |k|, |d|
Phonemes that exhibit plosiveness, such as , have large PL and PH; in particular, the voiceless plosive |k| has a large PL, and the voiced plosive |d| has a large PH. In addition, the phonemes |n| and |s|, which do not exhibit plosiveness, have small PL and PH, but are classified as shown in Figures 6 and 7 depending on whether they are voiced or voiceless consonants. As described above, by using both the low-frequency power and the high-frequency power, it is possible to discriminate between voiced consonants and voiceless consonants. Furthermore, by using the size of the dip, rupture can be detected. Based on these properties, we created a two-dimensional discriminant diagram for PL and PH, and used voiceless plosives {|p|, |t|, |k
|, |c|}, voiced plosives {|b|, |d|}, nasals {|m|, |n|}, voiceless fricatives {|s|, |h|}
It can be broadly classified into four groups. FIG. 8 shows a discriminant diagram created using a large amount of data. In the figure, region 14 is considered to be a consonant added to a vowel, and is not segmented as a consonant section. Also, area 15 is nasal sounds {|m|, |n|}, and area 16 is voiced plosive sounds {|
b|, |d|}, region 17 is a voiceless fricative {|s
|, |h|}18 region is voiceless plosive sound {|p|, |
t|, |k|, |c|}. By finding the dip size of low-frequency power and high-frequency power for unknown input speech and applying it to the discriminant diagram in Figure 8, segmentation of middle consonants and general classification of consonants can be performed simultaneously and easily. can be done. Effects of the Invention Using this example, data (approximately 2100 words) of 10 men and women who uttered a set of 212 words was evaluated. The results are shown in Table 1. Regarding the number of dropped consonants in the middle of words, there were a little more dropouts for nasal sounds {|m|, |n|}, but overall the number of dropped consonants was small and better results were obtained than in the past. Furthermore, the recognition rate for major categories of consonants is high at approximately 85% to 90%. As described above, the present invention uses the dip size of low-frequency power and high-frequency power as parameters and applies them to a two-dimensional discriminant diagram, which is a very simple method for segmentation of consonant intervals. It is possible to determine the major classification of consonants with high accuracy. 【table】

【図面の簡単な説明】[Brief explanation of drawings]

第1図は、音声認識システムのブロツク図、第
2図は、従来例による語中子音のセグメンテーシ
ヨン法の説明図、第3図は本発明における低域パ
ワーと高域パワーのデイツプのとらえかたの説明
図、第4図〜第7図は各音素に対する低域パワー
と高域パワーのデイツプの大きさの分布を示す
図、第8図は本発明の実施例における判別図であ
る。 1……音響分析部、2……特徴抽出部、3……
標準パターン登録部、4……セグメンテーシヨン
部、5……音素判別部、6……単語認識部、7…
…単語辞書、8……全域パワー、9……全域パワ
ーの差分値、10……低域パワー、11……高域
パワー、12……低域パワーの差分値、13……
高域パワーの差分値、14……付加の領域、15
……鼻音の領域、16……有声破裂音の領域、1
7……無声摩擦音の領域、18……無声破裂音の
領域。
Fig. 1 is a block diagram of a speech recognition system, Fig. 2 is an explanatory diagram of a conventional method for segmenting middle consonants, and Fig. 3 is an illustration of the dip in low-frequency power and high-frequency power in the present invention. The other explanatory diagrams, FIGS. 4 to 7, are diagrams showing the distribution of the dip size of low frequency power and high frequency power for each phoneme, and FIG. 8 is a discrimination diagram in an embodiment of the present invention. 1... Acoustic analysis section, 2... Feature extraction section, 3...
Standard pattern registration unit, 4... Segmentation unit, 5... Phoneme discrimination unit, 6... Word recognition unit, 7...
...word dictionary, 8...wide range power, 9...difference value of full range power, 10...low range power, 11...high range power, 12...difference value of low range power, 13...
Difference value of high frequency power, 14... Area of addition, 15
... Nasal sound region, 16 ... Voiced plosive sound region, 1
7...A region of voiceless fricatives, 18...A region of voiceless plosives.

Claims (1)

【特許請求の範囲】[Claims] 1 音素認識を行なう手順を有し、音声スペクト
ルの低域パワーと高域パワーを求め、それぞれの
時間的変化によつて生じるデイツプ(凹状の部
分)の大きさに着目し、それらを2次元の判別図
に適用することにより語中子音のセグメンテーシ
ヨンと子音の大分類を同時に行なうことを特徴と
する音声認識方法。
1 It has a procedure for phoneme recognition, which calculates the low-frequency power and high-frequency power of the speech spectrum, focuses on the size of the dip (concave part) caused by the temporal change of each, and converts them into two-dimensional data. A speech recognition method characterized by simultaneously performing segmentation of mid-word consonants and general classification of consonants by applying a discriminant diagram.
JP1177483A 1983-01-27 1983-01-27 Voice recognition Granted JPS59138000A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP1177483A JPS59138000A (en) 1983-01-27 1983-01-27 Voice recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP1177483A JPS59138000A (en) 1983-01-27 1983-01-27 Voice recognition

Publications (2)

Publication Number Publication Date
JPS59138000A JPS59138000A (en) 1984-08-08
JPS6363920B2 true JPS6363920B2 (en) 1988-12-08

Family

ID=11787308

Family Applications (1)

Application Number Title Priority Date Filing Date
JP1177483A Granted JPS59138000A (en) 1983-01-27 1983-01-27 Voice recognition

Country Status (1)

Country Link
JP (1) JPS59138000A (en)

Also Published As

Publication number Publication date
JPS59138000A (en) 1984-08-08

Similar Documents

Publication Publication Date Title
JPS6336676B2 (en)
JPH02195400A (en) Speech recognition device
JPS5972496A (en) Single sound identifier
US4817159A (en) Method and apparatus for speech recognition
Shahzadi et al. Recognition of emotion in speech using spectral patterns
Eray et al. An application of speech recognition with support vector machines
Kamble et al. Emotion recognition for instantaneous Marathi spoken words
Gaudani et al. Comparative study of robust feature extraction techniques for ASR for limited resource Hindi language
MY An improved feature extraction method for Malay vowel recognition based on spectrum delta
JPS6363920B2 (en)
Majidnezhad A HTK-based method for detecting vocal fold pathology
Bhattachajee et al. An experimental analysis of speech features for tone speech recognition
Mengistu et al. Text independent amharic language dialect recognition using neuro-fuzzy gaussian membership function
JPH0316040B2 (en)
JP2744622B2 (en) Plosive consonant identification method
Thubthong et al. A method for isolated Thai tone recognition using a combination of neural networks
JPH026079B2 (en)
JPS6069694A (en) Segmentation of head consonant
JPH0114600B2 (en)
Zahorian et al. Dynamic spectral shape features for speaker-independent automatic recognition of stop consonants
JPS63220297A (en) Segmentation sorting for consonant
JPH01260499A (en) Consonant recognizing method
Zaki et al. Effectiveness of multiscale fractal dimension for improvement of frame classification rate
JPH0120440B2 (en)
JPS5958498A (en) Voice recognition equipment