JPS6363920B2

JPS6363920B2 -

Info

Publication number: JPS6363920B2
Application number: JP1177483A
Authority: JP
Priority date: 1983-01-27
Filing date: 1983-01-27
Publication date: 1988-12-08
Also published as: JPS59138000A

Description

【発明の詳細な説明】産業上の利用分野本発明は、音声認識方法に関するものである。従来例の構成とその問題点入力音声を音素単位に分けて音素の組合せとし
て認識し（音素認識と呼ぶ）音素単位で表記され
た単語辞書との類似度を求めて認識結果を出力す
る従来の単語認識システムのブロツク図を第１図
に示す。まず、あらかじめ多数話者の音声を１フ
レーム（１フレームは10ｍsecとする）毎に音響
分析部１によつてフイルタ・バンクを用いて分析
し、得られたスペクトル情報をもとに特徴抽出部
２によつて特徴パラメータを求める。この特徴パ
ラメータから５母音や子音の音素グループ毎に標
準パターンを作成して標準パターン登録部３に登
録しておく。次に、特徴抽出部２によつて求られ
た特徴パラメータを用いてセグメンテーシヨン部
４においてセグメンテーシヨンを行なう。この結
果をもとに、音素判別部５において、標準パター
ン登録部３の標準パターンと照合することによつ
て、音素を決定する。最後に、この結果作成した
音素の時系列を単語認識部６に送り、同様に音素
の時系列で表現された単語辞書７と最も類似度の
大きい項目に該当する単語を認識結果として出力
する。ここで、セグメンテーシヨンは第２図のように
全域パワーの時間的変化の形８が凹状の形をして
いる時（これをデイツプと呼ぶ）、パワーが極小
値を示すフレームをn₁とし、n₁の前後のフレーム
でパワーの時間による変化速度（これをパワーの
差分値と呼ぶ）９が負および正の極大値を示すフ
レームをn₂，n₃とする。また、あるフレームｎに
おける差分値をWD（ｎ）とすると、WD（n₂）、
WD（n₃）が WD（n₂）≦−θ_w WD（n₃）≦θ_w の条件を満足する時、n₂〜n₃までの区間を子音区
間とする。ここでθ_wは子音の付加を防ぐためのい
き値である。つぎに、この子音区間に対してフレーム毎に音
素の特徴を示す特徴パラメータを求め、あらかじ
め用意されている各音素の標準パターンと比較し
て、フレーム毎に子音分類を行なう。この結果を
子音分類ツリーに適用して、条件の一致したもの
に子音を分類する。以上のように従来の方法では、最初に入力音声
の全域パワーを求め、そのパワー・デイツプを用
いて語中子音のセグメンテーシヨンを行ない、つ
ぎにフレーム毎に子音分類を行なう。そして、最
後にフレーム毎の子音分類の結果を子音分類ツリ
ーにあてはめて、条件の一致したものに子音を分
類するというように非常にアルゴリズムも複雑で
手間のかかるめんどうな方法であつた。発明の目的本発明は、以上のような従来の問題点を解決す
るためになされたもので、入力声音のセグメンテ
ーシヨンと子音の大分類をきわめて簡単に行なう
ことのできる音声認識方法を提供することを目的
とする。発明の構成この目的を達成するために本発明は、特徴パラ
メータとして低域パワーと高域パワーを併用し、
それぞれのパワーの時間的変化によつて生じるデ
イツプの大きさを２次元の判別図に適用すること
によつて、語中子音のセグメンテーシヨンと子音
の大分類を同時にしかも簡単に行なうものであ
る。実施例の説明以下に本発明の一実施例を図面を用いて説明す
る。本実施例では、特徴パラメータとして低域パワ
ーと高域パワーを併用する。これは有声子音は高
域パワーに、無声子音は低域パワーに特徴が現わ
れやすいため、低域、高域パワーを併用すること
により、ほぼすべての子音に対応出来るためであ
る。本実施例では、音素｜Ｐ｜、｜ｔ｜、｜ｋ｜、｜
ｃ｜、｜ｂ｜、｜ｄ｜、｜ｍ｜、｜ｎ｜、｜ｓ｜、｜ｈ
｜を無声破裂音｛｜ｐ｜、｜ｔ｜、｜ｋ｜、｜ｃ
｜｝、有声破裂音｛｜ｂ｜、｜ｄ｜｝、鼻声｛｜ｍ
｜、｜ｎ｜｝、無声摩擦音｛｜ｓ｜、｜ｈ｜｝の４
つに大分類する場合の例を示す。本実施例において、低域パワーは入力音声信号
を低域の帯域フイルタに通しフレーム毎（本実施
例では１フレームを10ｍsecとする）に対数パワ
ーを求め、これを平滑化して得る。また、高域パ
ワーは高域の帯域フイルタを用いて同様にして得
る。次に、パワー・デイツプの大きさの求め方を第
３図に示す。図において１０，１１はそれぞれ低
域パワー、高域パワーの時間的変化のようすであ
り、１２，１３はその差分値の時記的変化のよう
すを表わしている。低域、高域パワーの差分値が
負および正の極大値になるフレームをnA₁、
nA₂、nB₁、nB₂とし、各フレームにおける差分
値をWD（nA₁）、WD（nA₂）、WD（nB₁）、WD
（nB₂）とする。そこで、低域パワー・デイツプ
の大きさPLと高域パワー・デイツプの大きさPH
を PL＝WD（nA₁）−WD（nA₂） PH＝WD（nB₁）−WD（nB₂）のように定義する。子音区間としては、デイツプの区間をとり、低
域と高域パワー・デイツプが重複した場合は両方
のデイツプ区間の交わりの区間をとる。第３図に
おける子音区間はnA₁からnB₂までの区間Ｃにな
る。この子音区間のデイツプの大きさはPLとPH
になる。例えば、音素｜ｋ｜、｜ｄ｜、｜ｎ｜、｜
ｓ｜の場合のPLとPHの分布を調べると第４図〜
第７図のようになる。図において横軸にPL、縦
軸にPHをとり、図中の数字は音素の個数を表わ
している。図から明らかなように｜ｋ｜、｜ｄ｜
のような破裂性を示す音素はPL、PHとも大き
く、とくに無声破裂音｜ｋ｜はPLが大きく、有
声破裂音｜ｄ｜はPHが大きい。また、破裂性を
示さない音素｜ｎ｜、｜ｓ｜はPL、PHとも小さ
いが、有声子音か無声子音かによつて第６図、第
７図のように分かれる。以上述べたように、低域パワーと高域パワーを
併用することにより、有声子音と無声子音の判別
に対応出来る。また、デイツプの大きさを用いる
ことにより、破裂性の検出が出来る。このような性質からPL、PHの２次元の判別図
を作成して、無声破裂音｛｜ｐ｜、｜ｔ｜、｜ｋ
｜、｜ｃ｜｝、有声破裂音｛｜ｂ｜、｜ｄ｜｝、鼻音
｛｜ｍ｜、｜ｎ｜｝、無声摩擦音｛｜ｓ｜、｜ｈ｜｝
の４つのグループに大分類することが出来る。第８図には、多くのデータを用いて作成した判
別図を示している。図において１４の領域は母音
に子音が付加したと考えて、子音区間とはセグメ
ンテーシヨンされない。また、１５の領域は鼻音
｛｜ｍ｜、｜ｎ｜｝、１６の領域は有声破裂音｛｜
ｂ｜、｜ｄ｜｝、１７の領域は無声摩擦音｛｜ｓ
｜、｜ｈ｜｝１８の領域は無声破裂音｛｜ｐ｜、｜
ｔ｜、｜ｋ｜、｜ｃ｜｝である。未知の入力音声に対して低域パワーと高域パワ
ーのデイツプの大きさを求め、第８図の判別図に
適用することにより、語中子音のセグメンテーシ
ヨンと子音の大分類を同時にしかも簡単に行なう
ことが出来る。発明の効果本実施例を用いて212単語セツトを発話した男
女名10名のデータ（約2100単語）を評価した。そ
の結果を表１に示す。語中子音の脱落数について
は鼻音｛｜ｍ｜、｜ｎ｜｝に対するものが少し多
いが、全体的には脱落数も少なく従来に較べて良
好な結果を得ることが出来た。また、子音の大分
類についても認識率が約85％〜90％と高い数値を
示している。以上述べたように、本発明はパラメータとして
低域パワーと高域パワーのデイツプの大きさを用
いて、それらを２次元の判別図にあてかめるとい
う非常に簡単な方法で子音区間のセグメンテーシ
ヨンと子音の大分類を精度良く求めることが出来
る。【表】DETAILED DESCRIPTION OF THE INVENTION Field of Industrial Application The present invention relates to a speech recognition method. Configuration of conventional examples and their problems The conventional method divides input speech into phoneme units and recognizes them as combinations of phonemes (called phoneme recognition), and outputs the recognition result by calculating the similarity with a word dictionary written in phoneme units. A block diagram of the word recognition system is shown in FIG. First, the audio analysis unit 1 analyzes the voices of multiple speakers for each frame (one frame is 10 msec) using a filter bank, and the feature extraction unit 2 analyzes the voices of multiple speakers using a filter bank based on the obtained spectrum information. Find the feature parameters by. Standard patterns are created for each of the five vowel and consonant phoneme groups from these feature parameters and are registered in the standard pattern registration section 3. Next, a segmentation unit 4 performs segmentation using the feature parameters determined by the feature extraction unit 2. Based on this result, the phoneme discrimination unit 5 determines the phoneme by comparing it with the standard pattern in the standard pattern registration unit 3. Finally, the time series of phonemes created as a result is sent to the word recognition unit 6, and the word corresponding to the item with the highest degree of similarity to the word dictionary 7 similarly expressed in the time series of phonemes is output as a recognition result. Here, in segmentation, when the shape of the temporal change in the overall power 8 is concave as shown in Figure 2 (this is called a dip), the frame in which the power shows the minimum value is set as n ₁ . , n ₁ , frames in which the rate of change in power over time (this is called a power difference value) 9 exhibits maximum negative and positive values are denoted by n ₂ and n ₃ . Also, if the difference value at a certain frame n is WD(n), then WD(n ₂ ),
When WD(n ₃ ) satisfies the following conditions: WD(n ₂ )≦−θ _w WD(n ₃ )≦θ _w , the interval from n ₂ to n ₃ is defined as a consonant interval. Here, θ _w is a threshold value for preventing addition of consonants. Next, characteristic parameters indicating the characteristics of phonemes are determined for each frame for this consonant section, and compared with standard patterns for each phoneme prepared in advance to perform consonant classification for each frame. This result is applied to the consonant classification tree to classify consonants into those that match the conditions. As described above, in the conventional method, first the whole range power of input speech is determined, the power dip is used to perform segmentation of mid-word consonants, and then consonant classification is performed for each frame. Finally, the results of the consonant classification for each frame are applied to a consonant classification tree, and the consonants are classified into those that meet the conditions, which is a time-consuming and time-consuming method with extremely complex algorithms. Purpose of the Invention The present invention has been made to solve the above-mentioned conventional problems, and provides a speech recognition method that can extremely easily segment input vocal sounds and classify consonants. The purpose is to Structure of the Invention In order to achieve this object, the present invention uses both low frequency power and high frequency power as characteristic parameters,
By applying the size of the dip caused by the temporal change in each power to a two-dimensional discriminant diagram, segmentation of middle consonants and general classification of consonants can be performed simultaneously and easily. . DESCRIPTION OF EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings. In this embodiment, low frequency power and high frequency power are used together as feature parameters. This is because voiced consonants tend to exhibit characteristics in high-frequency power, and voiceless consonants tend to exhibit characteristics in low-frequency power, so by using both low-frequency and high-frequency power, almost all consonants can be handled. In this example, the phonemes |P|, |t|, |k|, |
c|, |b|, |d|, |m|, |n|, |s|, |h
| is a voiceless plosive {|p|, |t|, |k|, |c
|}, voiced plosive {|b|, |d|}, nasal {|m
|, |n|}, voiceless fricative {|s|, |h|}
An example of major classification is shown below. In this embodiment, the low-frequency power is obtained by passing the input audio signal through a low-frequency band filter, obtaining logarithmic power for each frame (in this embodiment, one frame is 10 msec), and smoothing the logarithmic power. Further, high frequency power can be obtained in the same manner using a high frequency band filter. Next, FIG. 3 shows how to determine the size of the power dip. In the figure, numerals 10 and 11 represent temporal changes in the low-frequency power and high-frequency power, respectively, and numerals 12 and 13 represent temporal changes in the difference values. The frame where the difference value of low-frequency and high-frequency power is the maximum negative and positive value is nA ₁ ,
Let nA ₂ , nB ₁ , nB ₂ be the difference values in each frame as WD (nA ₁ ), WD (nA ₂ ), WD (nB ₁ ), WD
(nB ₂ ). Therefore, the size of the low-frequency power dip PL and the size of the high-frequency power dip PH
is defined as PL = WD (nA ₁ ) - WD (nA ₂ ) PH = WD (nB ₁ ) - WD (nB ₂ ). As the consonant section, a dip section is taken, and if the low-frequency and high-frequency power dips overlap, the intersection of both dip sections is taken. The consonant section in FIG. 3 is section C from nA ₁ to nB ₂ . The size of the dip in this consonant section is PL and PH.
become. For example, the phonemes |k|, |d|, |n|, |
When examining the distribution of PL and PH in the case of s|, Figure 4~
It will look like Figure 7. In the diagram, the horizontal axis represents PL, and the vertical axis represents PH, and the numbers in the diagram represent the number of phonemes. As is clear from the figure, |k|, |d|
Phonemes that exhibit plosiveness, such as , have large PL and PH; in particular, the voiceless plosive |k| has a large PL, and the voiced plosive |d| has a large PH. In addition, the phonemes |n| and |s|, which do not exhibit plosiveness, have small PL and PH, but are classified as shown in Figures 6 and 7 depending on whether they are voiced or voiceless consonants. As described above, by using both the low-frequency power and the high-frequency power, it is possible to discriminate between voiced consonants and voiceless consonants. Furthermore, by using the size of the dip, rupture can be detected. Based on these properties, we created a two-dimensional discriminant diagram for PL and PH, and used voiceless plosives {|p|, |t|, |k
|, |c|}, voiced plosives {|b|, |d|}, nasals {|m|, |n|}, voiceless fricatives {|s|, |h|}
It can be broadly classified into four groups. FIG. 8 shows a discriminant diagram created using a large amount of data. In the figure, region 14 is considered to be a consonant added to a vowel, and is not segmented as a consonant section. Also, area 15 is nasal sounds {|m|, |n|}, and area 16 is voiced plosive sounds {|
b|, |d|}, region 17 is a voiceless fricative {|s
|, |h|}18 region is voiceless plosive sound {|p|, |
t|, |k|, |c|}. By finding the dip size of low-frequency power and high-frequency power for unknown input speech and applying it to the discriminant diagram in Figure 8, segmentation of middle consonants and general classification of consonants can be performed simultaneously and easily. can be done. Effects of the Invention Using this example, data (approximately 2100 words) of 10 men and women who uttered a set of 212 words was evaluated. The results are shown in Table 1. Regarding the number of dropped consonants in the middle of words, there were a little more dropouts for nasal sounds {|m|, |n|}, but overall the number of dropped consonants was small and better results were obtained than in the past. Furthermore, the recognition rate for major categories of consonants is high at approximately 85% to 90%. As described above, the present invention uses the dip size of low-frequency power and high-frequency power as parameters and applies them to a two-dimensional discriminant diagram, which is a very simple method for segmentation of consonant intervals. It is possible to determine the major classification of consonants with high accuracy. 【table】

[Brief explanation of drawings]

第１図は、音声認識システムのブロツク図、第
２図は、従来例による語中子音のセグメンテーシ
ヨン法の説明図、第３図は本発明における低域パ
ワーと高域パワーのデイツプのとらえかたの説明
図、第４図〜第７図は各音素に対する低域パワー
と高域パワーのデイツプの大きさの分布を示す
図、第８図は本発明の実施例における判別図であ
る。１……音響分析部、２……特徴抽出部、３……
標準パターン登録部、４……セグメンテーシヨン
部、５……音素判別部、６……単語認識部、７…
…単語辞書、８……全域パワー、９……全域パワ
ーの差分値、１０……低域パワー、１１……高域
パワー、１２……低域パワーの差分値、１３……
高域パワーの差分値、１４……付加の領域、１５
……鼻音の領域、１６……有声破裂音の領域、１
７……無声摩擦音の領域、１８……無声破裂音の
領域。 Fig. 1 is a block diagram of a speech recognition system, Fig. 2 is an explanatory diagram of a conventional method for segmenting middle consonants, and Fig. 3 is an illustration of the dip in low-frequency power and high-frequency power in the present invention. The other explanatory diagrams, FIGS. 4 to 7, are diagrams showing the distribution of the dip size of low frequency power and high frequency power for each phoneme, and FIG. 8 is a discrimination diagram in an embodiment of the present invention. 1... Acoustic analysis section, 2... Feature extraction section, 3...
Standard pattern registration unit, 4... Segmentation unit, 5... Phoneme discrimination unit, 6... Word recognition unit, 7...
...word dictionary, 8...wide range power, 9...difference value of full range power, 10...low range power, 11...high range power, 12...difference value of low range power, 13...
Difference value of high frequency power, 14... Area of addition, 15
... Nasal sound region, 16 ... Voiced plosive sound region, 1
7...A region of voiceless fricatives, 18...A region of voiceless plosives.

Claims

[Claims]

1 It has a procedure for phoneme recognition, which calculates the low-frequency power and high-frequency power of the speech spectrum, focuses on the size of the dip (concave part) caused by the temporal change of each, and converts them into two-dimensional data. A speech recognition method characterized by simultaneously performing segmentation of mid-word consonants and general classification of consonants by applying a discriminant diagram.