JPH0114600B2

JPH0114600B2 -

Info

Publication number: JPH0114600B2
Application number: JP57171631A
Authority: JP
Inventors: Katsuyuki Futayada; Masakatsu Hoshimi
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1982-09-29
Filing date: 1982-09-29
Publication date: 1989-03-13
Also published as: JPS5958495A

Description

【発明の詳細な説明】産業上の利用分野本発明は音声認識における音声セグメンテーシ
ヨン法に関するものである。DETAILED DESCRIPTION OF THE INVENTION Field of Industrial Application The present invention relates to a speech segmentation method in speech recognition.

従来例の構成とその問題点従来研究あるいは発表されている音声自動認識
システムの動作原理としてはパタンマツチング法
が多く採用されている。この方法は認識される必
要がある全種類の単語に対して標準パターンをあ
らかじめ記憶しておき、入力される未知の入力パ
ターンと比較することによつて一致の度合（以下
類似度と呼ぶ）を計算し、最大一致が得られる標
準パターンと同一の単語であると判定するもので
ある。このパタンマツチング法では認識されるべ
き全ての単語に対して標準パターンを用意しなけ
ればならないため、発声者が変つた場合には新し
く標準パターンを入力して記憶させる必要があ
る。従つて日本全国の都市名のように数百種類以
上の単語を認識対象とするような場合、全種類の
単語を発声して登録するには膨大な時間と労力を
必要とし、また登録に要するメモリー容量も膨大
になることが予想される。さらに入力パターンと
標準パターンのパタンマツチングに要する時間も
単語数が多くなると長くなつてしまう欠点があ
る。Configuration of conventional examples and their problems The pattern matching method is often adopted as the operating principle of automatic speech recognition systems that have been previously researched or published. This method memorizes standard patterns for all types of words that need to be recognized in advance, and compares them with unknown input patterns to calculate the degree of matching (hereinafter referred to as similarity). The word is calculated and determined to be the same word as the standard pattern that yields the maximum match. In this pattern matching method, standard patterns must be prepared for all words to be recognized, so if the speaker changes, a new standard pattern must be input and stored. Therefore, in cases where there are hundreds of types of words to be recognized, such as the names of cities across Japan, it takes a huge amount of time and effort to pronounce and register all the types of words. It is expected that the memory capacity will also be enormous. Furthermore, there is a drawback that the time required for pattern matching between the input pattern and the standard pattern increases as the number of words increases.

これに対して、入力音声を音素単位に分けて音
素の組合せとして認識し（以下音素認識と呼ぶ）
音素単位で表記された単語辞書との類似度を求め
る方法は単語辞書に要するメモリー容量が大巾に
少なくて済みパタンマツチングに要する時間が短
くでき、辞書の内容変更も容易であるという特長
を持つている。この方法の例は「音声スペクトル
の概略形とその動特性を利用した単語音声認識シ
ステム」三輪他、日本音響学会誌34（1978）に述
べてある。 On the other hand, input speech is divided into phoneme units and recognized as combinations of phonemes (hereinafter referred to as phoneme recognition).
The method of determining similarity with a word dictionary written in phoneme units has the advantage that the memory capacity required for the word dictionary is significantly reduced, the time required for pattern matching is shortened, and the contents of the dictionary can be easily changed. I have it. An example of this method is described in ``Word speech recognition system using the outline form of the speech spectrum and its dynamic characteristics'' by Miwa et al., Journal of the Acoustical Society of Japan 34 (1978).

この方法における単語認識システムのブロツク
図を第１図に示す。まず、あらかじめ多数話者の
音声を10ｍｓの分析区間毎に音響分析部１によつ
てフイルタバンクを用いて分析し、得られたスペ
クトル情報をもとに特徴抽出部２によつて特徴パ
ラメータを求める。この特徴パラメータから／
ａ／，／ｏ／等の母音や、／ｍ／，／ｂ／等の子
音に代表される音素毎又は音素グループ毎に標準
パターンを作成して標準パターン登録部５に登録
しておく。次に、入力された不特定話者の音声
を、同様に分析区間毎に音響分析部１によつて分
析し、特徴抽出部２によつて特徴パラメータを求
める。この特徴パラメータと標準パターン登録部
５の標準パターンを用いてセグメンテーシヨン部
３において母音と子音の区切り作業（以下、セグ
メンテーシヨンと呼ぶ）を行なう。この結果をも
とに、音素判別部４において、標準パターン登録
部５の標準パターンと照合することによつて、最
も類似度の高い標準パターンに該当する音素をそ
の区間における音素を決定する。最後に、この結
果作成した音素の時系列（以下音素系列と呼ぶ）
を単語認識部６に送り、同様に音素系列で表現さ
れた単語辞書７と最も類似度の大きい項目に該当
する単語を認識結果として出力する。 A block diagram of a word recognition system using this method is shown in FIG. First, the voices of multiple speakers are analyzed in advance by the acoustic analysis unit 1 using a filter bank for each 10ms analysis interval, and the feature parameters are determined by the feature extraction unit 2 based on the obtained spectrum information. . From this feature parameter/
A standard pattern is created for each phoneme or phoneme group, typified by vowels such as a/, /o/, and consonants such as /m/, /b/, and is registered in the standard pattern registration section 5. Next, the input speech of an unspecified speaker is similarly analyzed by the acoustic analysis section 1 for each analysis section, and the feature extraction section 2 obtains feature parameters. Using these characteristic parameters and the standard pattern of the standard pattern registration section 5, the segmentation section 3 performs a separation operation between vowels and consonants (hereinafter referred to as segmentation). Based on this result, the phoneme discriminator 4 compares the phoneme with the standard pattern in the standard pattern registration unit 5 to determine the phoneme corresponding to the standard pattern with the highest degree of similarity in that section. Finally, the time series of phonemes created as a result (hereinafter referred to as phoneme series)
is sent to the word recognition unit 6, and the word corresponding to the item having the highest degree of similarity to the word dictionary 7 similarly expressed as a phoneme sequence is output as a recognition result.

以上の全体の動作からわかるように、セグメン
テーシヨン部３においてセグメンテーシヨンを誤
つた場合にはあるべき音素を見過ごしてしまつた
り（音素の脱落）、実際には音素のないところに
別の音素が入り込んでしまう（音素の付加）こと
になる。これらの誤りを発生した場合、単語を音
素系列で表現した時に音素の脱落や付加によつて
全く関係のない他の単語に似かよつてしまうこと
によつて誤認識してしまう危険性が高くなる。 As can be seen from the overall operation described above, if the segmentation unit 3 makes a mistake in segmentation, a phoneme that should be present may be overlooked (dropped phoneme), or a different phoneme may be inserted where there is actually no phoneme. This results in the incorporation of phonemes (addition of phonemes). When these errors occur, when a word is expressed as a phoneme sequence, the omission or addition of a phoneme causes it to resemble another completely unrelated word, increasing the risk of misrecognition. .

このように、音素認識を基本に単語認識を行う
方法においてセグメンテーシヨンは最も重要な作
業であり、セグメンテーシヨンの精度によつて単
語認識システムの性能は大きく左右される。とこ
ろで従来、セグメンテーシヨンを行なうためのパ
ラメータとして、音声信号の全帯域のスペクトル
のパワー情報の時間的な動きを利用し、第２図に
示すようにパワーデイツプの存在によつてセグメ
ンテーシヨンを行なつていた。すなわち、母音部
のパワーが子音部のパワーよりも大きいことを利
用して、デイツプの大きさＤが、閾値θ_Dよりも大
きい（Ｄ＞θ_D）部分を子音区間としていた。この
方法において、次の２つの問題点があつた。 As described above, segmentation is the most important task in a method of word recognition based on phoneme recognition, and the performance of a word recognition system is greatly influenced by the accuracy of segmentation. Conventionally, as a parameter for performing segmentation, the temporal movement of power information in the spectrum of the entire audio signal band is used, and segmentation is performed based on the presence of power dips as shown in Figure 2. I was getting used to it. That is, by utilizing the fact that the power of the vowel part is greater than the power of the consonant part, a part where the dip size D is larger than the threshold value θ _D (D>θ _D ) is defined as a consonant section. This method had the following two problems.

(1) 全帯域の情報ではデイツプの存在が明らかで
ない音素があり、精度が良くない。（特に／
ｒ／，／η／，／ｈ／，／ｍ／，／ｎ／など） (2) デイツプの大きさＤは左右の母音のパワーと
の差で表現される。したがつて母音区間におけ
るパワーの動きが単純でない場合には、デイツ
プの大きさを直接求めることは難しい。(1) There are phonemes for which the existence of dips is not clear in the information of all bands, and the accuracy is not good. (especially/
r/, /η/, /h/, /m/, /n/, etc.) (2) The depth D is expressed as the difference between the power of the left and right vowels. Therefore, if the power movement in the vowel interval is not simple, it is difficult to directly determine the magnitude of the dip.

発明の目的本発明はこれらの問題点を解決するもので、単
語中のセグメンテーシヨンを精度良く行うことを
目的とする。OBJECTS OF THE INVENTION The present invention solves these problems and aims to perform segmentation within words with high accuracy.

発明の構成日本語は母音と子音が交互に組合わせられて単
語や文章が構成されているのが普通であり、撥音
を除く子音と他の子音が連続することはない。し
たがつて、日本語音声を認識する場合、母音と子
音を精度良く分離することができれば、認識率の
向上に大きく貢献する。本発明はセグメンテーシ
ヨンを行うために用いる情報として、音声スペク
トルの低域パワーと高域パワーとを併用し、各々
の時間的な動きによつて生ずるパワーデイツプを
使用して子音区間を精度良く検出し、単語中のセ
グメンテーシヨンの精度の向上をはかつたもので
ある。Structure of the Invention In Japanese, words and sentences are usually composed of vowels and consonants that are alternately combined, and consonants and other consonants, except for plosives, are not consecutive. Therefore, when recognizing Japanese speech, if vowels and consonants can be separated with high accuracy, it will greatly contribute to improving the recognition rate. The present invention uses both the low-frequency power and the high-frequency power of the speech spectrum as information used for segmentation, and uses the power dips caused by the temporal movement of each to accurately detect consonant intervals. The aim is to improve the accuracy of segmentation within words.

実施例の説明第３図は代表的な音素のスペクトルパターンを
表わしたものである。ａは５母音、ｂは鼻音、有
声破裂音のうなりの部分、ｃは無声子音である。
これらの図から明らかなように、ａは比較的中域
部にパワーが集まり、ｂは低域部に集中し、ｃは
高域部に集中している。これらの他に流音／ｒ／
や鼻濁音／η／のように、スペクトルが前後の音
素に大きく影響される音素もある。これらの事項
を考慮すると、母音群ａと有声子音グループｂを
区別するには高域部分のパワーの大きさが有効で
あり、母音群ａと無声子音グループｃを区別する
には低域部分のパワーの大きさが有効であること
がわかる。DESCRIPTION OF EMBODIMENTS FIG. 3 shows a typical spectrum pattern of a phoneme. a is a five-vowel sound, b is a nasal sound, the droning part of a voiced plosive, and c is a voiceless consonant.
As is clear from these figures, the power of a is relatively concentrated in the middle range, the power of b is concentrated in the low range, and the power of c is concentrated in the high range. In addition to these, Ryuon /r/
There are also phonemes whose spectrum is greatly influenced by the preceding and following phonemes, such as the nasal sound /η/. Taking these matters into consideration, the power in the high range is effective in distinguishing between vowel group a and voiced consonant group b, and the power in the low range is effective in distinguishing between vowel group a and voiceless consonant group c. It can be seen that the magnitude of power is effective.

以上の知見に基づき本実施例においてはセグメ
ンテーシヨン用パラメータとして、低域部分の情
報については250Hz−600Hzのバンドパスフイルタ
の出力を平滑化して求めた低域パワーを使用し、
高域部分の情報については1500Hz−4000Hzのバン
ドパスフイルタの出力を平滑化して求めた高域パ
ワーを使用している。本実施例のごとく低域パワ
ーと高域パワーを併用することにより、全域パワ
ーのみを用いた従来例に比較して、特に／
ｍ／，／ｎ／，／η／，／ｒ／，／ｈ／，／ｚ／
に対して大きなパワーデイツプを得ることがで
き、検出精度が向上した。 Based on the above knowledge, in this example, the low frequency power obtained by smoothing the output of the 250Hz-600Hz bandpass filter is used as the segmentation parameter for the low frequency information.
For high frequency information, the high frequency power obtained by smoothing the output of a 1500Hz-4000Hz bandpass filter is used. By using both low-frequency power and high-frequency power as in this example, compared to the conventional example that uses only full-range power, the
m/, /n/, /η/, /r/, /h/, /z/
It was possible to obtain a large power dip compared to the current value, and the detection accuracy was improved.

ところでパワーデイツプの大きさの絶対値を計
算するためには、デイツプの前後の広範囲な情報
を使用しなくてはならないため、従来法では手続
きが複雑となり、検出誤りも多くなる。本実施例
では、発声機構の制約を考慮した、簡便で精度の
良いデイツプ検出法を採用した。 However, in order to calculate the absolute value of the power dip, it is necessary to use a wide range of information before and after the dip, so in the conventional method, the procedure is complicated and there are many detection errors. In this embodiment, a simple and accurate dip detection method that takes into account the limitations of the vocalization mechanism is adopted.

音声の発声は、呼気を制御する肺や気管、有声
音を発する声帯、音韻を決定する調音器管などの
筋肉の動きの複合によるものである。したがつて
音声パワーの動きは発声器管の筋肉の動きによつ
て制約を受ける。このため、音声パワーの時間的
な変化速度は、破裂音などの動きの速いもの、半
母音など緩やかなものもあるが、一定の範囲内に
納まつてしまう。したがつて、イデイツプの大き
さを、単位時間内のパワーの変化量として置きか
えても実用上は問題ない。以下このような考え方
に基いたデイツプ検出法を具体的に述べる。 Speech production is a combination of muscle movements, including the lungs and trachea, which control exhalation, the vocal cords, which produce voiced sounds, and the articulator tubes, which determine phoneme. Therefore, the movement of vocal power is constrained by the movement of the muscles of the vocal tube. For this reason, the temporal rate of change in voice power remains within a certain range, although there are fast-moving sounds such as plosives and slow-moving sounds such as semi-vowels. Therefore, there is no practical problem even if the magnitude of the ID step is replaced by the amount of change in power within a unit time. The dip detection method based on this idea will be specifically described below.

第４図はその方法を説明したものである。パワ
ー情報は対数変換されたものを用いフレームごと
に（１フレームは10ｍsec）計算する。第ｉフレ
ーム（ｉ＝１〜i_nax、i_naxは音声区間の終端フレ
ーム）における対数パワー情報をＰ（ｉ）とする。
第４図ａは対数パワー情報Ｐ（ｉ）の時間的な動
きの例を母音、子音、母音という系列で図示した
ものである。この図には子音区間の大きなデイツ
プの他に、パワーの細かいゆらぎによる小さなデ
イツプが重畳している。前に述べたように細かい
デイツプは発声に必要な筋肉の動きによるもので
はないので平滑化によつて除去する。除去された
ものを第４図ｂに示す。平滑後のパワー情報
（ｉ）は（ｉ）＝｛Ｐ（ｉ−１）＋２×Ｐ（ｉ）＋Ｐ（ｉ＋１）｝／４とする。次に平滑後のパワー情報の差分値P_Dを
次式によつて計算し、パワー情報の時間的変化を
求める（第４図ｃ）。 FIG. 4 explains the method. Power information is calculated for each frame (one frame is 10 msec) using logarithmically transformed power information. Let P(i) be the logarithmic power information in the i-th frame (i=1 to i _nax , where i _nax is the final frame of the voice section).
FIG. 4a shows an example of the temporal movement of the logarithmic power information P(i) in the series of vowels, consonants, and vowels. In this figure, in addition to large dips in the consonant interval, small dips due to small fluctuations in power are superimposed. As mentioned earlier, the fine dips are not caused by the muscle movements necessary for vocalization, so they are removed by smoothing. What has been removed is shown in Figure 4b. The power information (i) after smoothing is (i)={P(i-1)+2×P(i)+P(i+1)}/4. Next, the difference value P _D of the power information after smoothing is calculated by the following equation, and the temporal change of the power information is determined (FIG. 4c).

P_D（ｉ）＝（ｉ＋１）−（ｉ−１）すなわちP_Dは20ｍsecごとの変化量の時間的な
動きを表わしている。P_Dはパワーデイツプの下
がりの変曲線で最小値となり、立上がりの変曲点
で最大値となる。前述の理由によつて、デイツプ
の大きさはP_Dの最大値と最小値の間の大きさＰ
で置きかえる。またデイツプの持続時間は、P_D
の最小値から最大値までの時間Ｌとする。 P _D (i)=(i+1)-(i-1) That is, P _D represents the temporal movement of the amount of change every 20 msec. P _D reaches its minimum value at the downward inflection point of the power dip, and reaches its maximum value at the rising inflection point. For the reason mentioned above, the size of the dip is the size P between the maximum and minimum values of P _D
Replace it with Also, the duration of the dip is P _D
Let L be the time from the minimum value to the maximum value.

パワー情報として前に述べた低域情報（P_L）
と高域情報（P_H）の両方を使用し、その各々に
対して第４図で説明した方法を適用すると、低域
情報によるデイツプと高域情報によるデイツプを
それぞれ求めることができる。これらのデイツプ
のうちＬ≦L_naxの条件を満足するもののみ子音候
補とする。一般に子音区間は／ｓ／や撥音を除く
と150ｍsec（L_nax＝15）以下であるので、このよ
うな条件を入れている。／ｓ／や撥音は他の方法
で検出することができる。 Low frequency information ( _PL ) mentioned earlier as power information
By using both the low-frequency information (P _H ) and the high-frequency information (PH) and applying the method explained in FIG. 4 to each of them, it is possible to obtain the dip due to the low-frequency information and the dip due to the high-frequency information, respectively. Among these dips, only those that satisfy the condition L≦L _nax are selected as consonant candidates. Generally, consonant intervals are 150 msec (L _nax = 15) or less, excluding /s/ and pellicles, so this condition is included. /s/ and the cursive sound can be detected by other methods.

子音候補として求められた音声区間には、低域
情報（P_L）のみで求められたもの、高域情報
（P_H）のみで求められたものがある。またこれら
の子音候補区間には、本当の子音区間とそうでな
いもの（子音の付加）の２種類が混在している。
次に子音候補区間から子音区間と子音の付加を分
離する方法を述べる。 Among the voice segments found as consonant candidates, there are those found using only low-frequency information ( _PL ) and those found only using high-frequency information ( _PH ). Furthermore, these consonant candidate sections are of two types: real consonant sections and non-true consonant sections (addition of consonants).
Next, a method for separating consonant intervals and consonant additions from consonant candidate intervals will be described.

低域情報P_Lおよび高域情報P_Hで求められたデ
イツプの変化分の大きさをそれぞれp_l，p_hとす
る。統計的に、本当の子音区間は子音の付加に比
べるとデイツプが顕著に現われるため、p_l，p_hの
両方またはどちらか一方が大きな値となる。たと
えば音素／ｂ／はp_l，p_hともに大きな値とな
り、／ｈ／はp_lのみ大きくなり、また／ｍ／はp_h
のみ大きくなる。一方、子音の付加によるデイツ
プに対しては、p_l，p_hともに比較的小さな値とな
る。これらの特徴を考慮して、子音と付加を精度
よく、しかも効率的に判別するためにはp_l−p_h空
間における判別図を使用する。 Let the magnitude of the change in dip obtained from the low frequency information _PL and the high frequency information _PH be p _l and _ph , respectively. Statistically, dips appear more prominently in true consonant intervals than in consonant additions, so both or one of p _l and p _h takes a large value. For example, the phoneme /b/ has a large value for both p _l and p _h , /h/ has a large value only for p _l , and /m/ has a large value for p _h
only becomes larger. On the other hand, for dips due to the addition of consonants, both p _l and p _h have relatively small values. Taking these characteristics into consideration, a discriminant diagram in the p _l _-ph space is used to accurately and efficiently distinguish between consonants and additions.

第５図は判別図の例である。図において斜線部
の内側が付加、外側が子音の領域である。ただし
p_l，p_hは整数に直して正規化してある。判別図は
セグメンテーシヨンをあらかじめ目視によつて行
なつてあるデータを多数使用して、子音として正
しく認識される確率と付加の確率の両方を考慮す
ることによつて結果が最適になるように決定した
ものである。 FIG. 5 is an example of a discriminant diagram. In the figure, the area inside the shaded area is the addition area, and the area outside the area is the consonant area. however
p _l and p _h are converted into integers and normalized. The discriminant diagram uses a large amount of data that has been obtained by visual inspection of segmentation in advance, and takes into account both the probability of being correctly recognized as a consonant and the probability of addition, in order to optimize the results. It has been decided.

次に判別図を使用して子音区間を決定する方法
を第６図に示した例によつて説明する。第６図ａ
はp_lのデイツプのみ現われた場合であり、大きさ
はp_l＝10である。これを第５図の判別図に適用す
ると、（10、０）は付加の領域であるから、子音
区間とはならない。ｂはp_l＝７、p_h＝８であり、
子音領域に位置する。この場合、p_l，p_hの両方の
区間の論理和の部分を子音区間とする（音素によ
つては論理和としない場合もある）。ｃはp_hしか
存在しない区間の例であり、（０、12）は判別図
上で子音領域に位置する。この場合はp_hの区間を
そのまま子音区間とする。ｄはp_l，p_h両方にデイ
ツプが存在するが、判別図上で付加の領域に位置
するので、付加として処理する。 Next, a method of determining a consonant interval using a discriminant diagram will be explained using the example shown in FIG. Figure 6a
is the case when only the dip of p _l appears, and the size is p _l =10. When this is applied to the discriminant diagram in FIG. 5, (10, 0) is an additional area, so it is not a consonant interval. b is p _l =7, p _h =8,
Located in the consonant region. In this case, the logical sum of both the intervals p _l and p _h is taken as the consonant interval (depending on the phoneme, it may not be the logical sum). c is an example of an interval in which only _ph exists, and (0, 12) is located in the consonant region on the discriminant diagram. In this case, the p _h interval is directly used as the consonant interval. Although d has dips in both p _l and p _h , it is located in the addition region on the discriminant diagram, so it is treated as addition.

男女10名それぞれが発声した212単語を使用し
て、本実施例の評価を行なつた。この単語セツト
は、目視によつてあらかじめ子音区間にラベル付
けしてある評価用のセツトである。本実施例を適
用した時の結果とラベルを比較して、正しくセグ
メンテーシヨンが行なわれて割合によつて評価し
た。その結果（正答率）を以下に示す。 This example was evaluated using 212 words uttered by 10 men and 10 men. This word set is a set for evaluation in which consonant sections are labeled in advance by visual inspection. The results obtained when this example was applied were compared with the labels, and the percentage of correct segmentation was evaluated. The results (correct answer rate) are shown below.

／ｒ／：94.7％、／ｈ／：94.8％、／ｚ／：98.7
％、／ｂ／：99.5％、／ｄ／：99.7％、／η／：91.3
％、／ｍ／：85.7％、／ｎ／：85.7％一方、母音区間に誤まつて子音が付加する確率
（付加率）は6.9％である。/r/: 94.7%, /h/: 94.8%, /z/: 98.7
%, /b/: 99.5%, /d/: 99.7%, /η/: 91.3
%, /m/: 85.7%, /n/: 85.7% On the other hand, the probability (addition rate) of a consonant being mistakenly added to a vowel interval is 6.9%.

この結果を従来の方法（全帯域スペクトルパワ
ーを使い、閾値でデイツプを検出する方法）に比
較すると、／ｒ／，／ｈ／，／η／で数％、／
ｂ／，／ｄ／で約１％向上している。また／
ｍ／，／ｎ／は全帯域パワーでは、デイツプの検
出ができないのに比し本実施例では検出可能であ
る。付加率は、ほぼ同じである。 Comparing this result with the conventional method (method that uses full-band spectral power and detects dips using a threshold value), it is found that /r/, /h/, /η/ has a decrease of several percent, /
There is an improvement of about 1% in b/ and /d/. Also/
Although dips cannot be detected in m/ and /n/ with full band power, they can be detected in this embodiment. The addition rate is almost the same.

このように本実施例は、従来検出が難しいとさ
れていた語中の子音（特に／ｒ／，／η／，／
ｈ／など）のセグメンテーシヨンを高い精度で行
なうことを可能とするものである。 In this way, this embodiment can detect consonants in words (especially /r/, /η/, /
h/, etc.) can be performed with high precision.

発明の効果以上述べたように本発明によれば、パラメータ
として、低域パワー情報と高域パワー情報の両方
を用いることによつてセグメンテーシヨン精度が
向上する。Effects of the Invention As described above, according to the present invention, segmentation accuracy is improved by using both low-frequency power information and high-frequency power information as parameters.

またパワーデイツプの時間的動きと持続時間を
利用することによつて、デイツプの存在を簡単に
検出することができる。 Furthermore, by utilizing the temporal movement and duration of the power dip, the presence of the dip can be easily detected.

さらに低域および高域両方のパワーデイツプの
動きの大きさを用い、それを判別図に適用するこ
とによつて、精度よく子音の存在を検出すること
ができる。 Furthermore, by using the magnitude of the movement of the power dips in both the low and high ranges and applying them to the discriminant diagram, the presence of consonants can be detected with high accuracy.

[Brief explanation of drawings]

第１図は従来の音声認識システムのブロツク
図、第２図はパワー情報を使用して子音を検出す
る従来の方法を説明する図、第３図ａ〜ｃは母音
および子音のスペクトルの例を示した図、第４図
ａ〜ｃは本発明によつてパワーデイツプを検出す
る方法を説明する図、第５図は低域パワーデイツ
プと高域パワーデイツプの各々の大きさによつて
子音と付加を判別するための判別図、第６図は子
音区間を決定する方法の一例を示した図である。 Figure 1 is a block diagram of a conventional speech recognition system, Figure 2 is a diagram explaining a conventional method of detecting consonants using power information, and Figures 3a to 3c show examples of vowel and consonant spectra. Figures 4a to 4c are diagrams illustrating a method for detecting power dips according to the present invention, and Figure 5 shows a method for determining consonants and additions based on the respective sizes of low-frequency power dips and high-frequency power dips. FIG. 6 is a diagram showing an example of a method for determining consonant intervals.

Claims

[Claims] 1. As information used for segmentation in speech recognition, the low-frequency power and high-frequency power of the voice spectrum are used together, and the power dip caused by the temporal movement of the respective powers is calculated. A speech segmentation method comprising: detecting a consonant candidate interval using the method of detecting a consonant candidate interval; and detecting a consonant interval from among the consonant candidate intervals. 2 Find the temporal change rate of each of the low-frequency power and the high-frequency power, detect consonant candidates based on the local maximum value, local minimum value, and time length of the temporal change rate, and calculate the local maximum value and local minimum value for the consonant candidate. A patent characterized in that a consonant interval is detected from a consonant candidate interval by regarding the value between the values as the magnitude of the power dip and applying the magnitude of the power dip of each of the low-frequency power and the high-frequency power to a two-dimensional discriminant diagram. A speech segmentation method according to claim 1.