JPS58211793A

JPS58211793A - Detection of voice section

Info

Publication number: JPS58211793A
Application number: JP57095434A
Authority: JP
Inventors: 森井　秀司; 二矢田　勝行; 藤井　諭; 郁夫井上
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1982-06-03
Filing date: 1982-06-03
Publication date: 1983-12-09
Also published as: JPH034918B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】本発明は、人間によって発声された音声を含む音響信号
から、音声区間・非音声区間を自動的に検出する音声区
間検出方法に関するものである。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a voice segment detection method for automatically detecting voice segments and non-speech segments from an acoustic signal including speech uttered by a human.

音声区間検出は音声認識システム、音声合成のための分
析システムあるいは音声の情報圧縮等において不可欠な
ものである。Speech segment detection is essential in speech recognition systems, analysis systems for speech synthesis, speech information compression, and the like.

音声自動認識システムのブロック図を第１図に。Figure 1 shows a block diagram of the automatic speech recognition system.

示す。１は音響処理部、２は音声区間検出部、３は認識
部である。音声認識システムでは音声区間検出部３は第
１図に示すように認識部３の前段に位置し、後段の認識
部３の性能が非常に良いものであっても正しく音声区間
を検出することが出来なければ正しい認識結果を得るこ
とが困難となるため、この音声区間検出部の性能の良否
はシステ３　　ベ一二ム全体に大きく影響する重要なものとなる。show. 1 is a sound processing section, 2 is a speech section detection section, and 3 is a recognition section. In the speech recognition system, the speech section detection section 3 is located at the front stage of the recognition section 3 as shown in Fig. 1, and even if the performance of the recognition section 3 at the rear stage is very good, it is difficult to correctly detect the speech section. If this is not possible, it will be difficult to obtain correct recognition results, so the quality of the performance of this speech section detection section is an important factor that greatly affects the entire system.

音声区間検出方法の従来例としては、音声信号と非音声
信号（ずなわぢ環境雑音）の信号エネルギーの差に注目
し信号エネルギーの値に適当な閾値を設定し音声区間を
検出するものが多い。また、音声信号のなかでも無声子
音のように信号エネルギーが小さく環境雑音と差のあま
りないものの検出精度を上げるため、信号のある適当な
時間長毎の零交差数の閾値処理を併用する方法もある。Conventional methods for detecting speech periods often focus on the difference in signal energy between a speech signal and a non-speech signal (Zunawaji environmental noise) and set an appropriate threshold for the signal energy value to detect speech periods. . In addition, in order to improve the detection accuracy of speech signals such as unvoiced consonants, which have low signal energy and are not much different from environmental noise, there is a method that also uses threshold processing of the number of zero crossings for each appropriate time length of the signal. be.

この１９すは「音声認識」新美康永著に述べである。This 19th article is described in "Voice Recognition" written by Yasunaga Niimi.

従来例に見られるような閾値処理による方法では、設定
する閾値が環境によって左右されてしまう。この閾値設
定は実験的に定められる場合が多いが、音声区間を検出
する場合の閾値というものに５、使用する環境の環境雑
音信号と音声信号との間を分離するための閾値であるた
め、使用する環境が変化した場合は閾値を設定しなおさ
なければならず、使用環境の変化に対する適応性に欠け
るという欠点がある。また、使用する環境の環境雑音エ
ネルギーが大きく、しかもその雑音が白色雑音のような
場合は、環境雑音と音声の無声音との間のエネルギー及
び零交差数の差がほとんどなくなるため、従来例では腎
声区間を正しく検出することが出来なくなる。このよう
に従来例による音声区間検出では、使用する環境に制限
があるということと、環境の変化に対し適応しないとい
う欠点がある。In the conventional method using threshold processing, the threshold value to be set depends on the environment. This threshold setting is often determined experimentally, but it is a threshold for detecting voice sections5, and a threshold for separating the environmental noise signal of the environment used and the voice signal. If the usage environment changes, the threshold must be reset, and there is a drawback that it lacks adaptability to changes in the usage environment. In addition, if the environmental noise energy of the environment used is large and the noise is white noise, there will be almost no difference in energy and zero crossing number between the environmental noise and unvoiced speech, so the conventional method Voice intervals cannot be detected correctly. As described above, conventional voice segment detection has the drawbacks of being limited in the environment in which it can be used and not being able to adapt to changes in the environment.

本発明は従来例に見られた欠点を改善した音声区間検出
方法を提供することを目的とするもので環境学習機能を
有する方法である。The present invention aims to provide a voice section detection method that improves the drawbacks seen in the conventional example, and is a method that has an environment learning function.

環境学習により、環境雑音の平均エネルギーレベルと、
スペクトルの平均的な特徴を表わすＬＰＣケプストラム
係数の平均値ベクトルを求めておく。Through environmental learning, the average energy level of environmental noise and
An average value vector of LPC cepstral coefficients representing average characteristics of the spectrum is determined in advance.

次いで、実際に入力された雑音を含む音声信号を学習で
求められた環境雑音平均エネルギーによって、信号レベ
ルに正規化する。また、入力信号のＬＰＣケグストラム
係数と、環境雑音のＬＰＣケプストラム係数とのユーク
リッド距離も求めておく。Next, the actually input audio signal containing noise is normalized to a signal level using the environmental noise average energy obtained through learning. Furthermore, the Euclidean distance between the LPC cepstral coefficients of the input signal and the LPC cepstral coefficients of the environmental noise is also determined.

旨発明はこのようにして求められた、正規化され５　ベ
ー二′ たエネルギーと、ユークリッド距離の２つのパラメータ
の閾値処理により音声区間検出を行うものである。According to the present invention, speech intervals are detected by threshold processing of the two parameters, the normalized 5 Beh2' energy and the Euclidean distance, which are obtained in this way.

本方法により従来例に比較し、使用する環境の変化に対
し著しい改善を得ることが出来る。以下本発明による音
声区間検出方法の詳細な説明を行う。By this method, compared to the conventional example, it is possible to obtain a remarkable improvement in response to changes in the environment in which the product is used. The voice section detection method according to the present invention will be explained in detail below.

第２図は本方法による音声区間検出部の大まかな機能ブ
ロック図である。第２図に示すように本方法による音声
区間検出は、使用する特徴パラメータを算出するだめの
音響分析部４と、使用する環境の特性を学習する環境学
習部５と実際に音声区間を検出する音声区間検出部６よ
り構成される。FIG. 2 is a rough functional block diagram of the speech section detection section according to the present method. As shown in FIG. 2, the voice section detection using this method involves an acoustic analysis section 4 that calculates the feature parameters to be used, an environment learning section 5 that learns the characteristics of the environment to be used, and a system that actually detects the voice section. It is composed of a voice section detection section 6.

本発明の音声区間検出法においては、まず予め標準環境
の学習を行う。In the speech segment detection method of the present invention, first, a standard environment is learned in advance.

この過程は従来例の閾値設定のための作業とほぼ同様で
あり、標準環境の平均エネルギーＥｓ、及び音声、非音
声を分離する２つの信号エネルギーの閾値ＴＪ！：ＩＩ
　”’ｊ！：２という定数を設定するものである。この
過程により求められた定数は、音声区間検出部６に蓄え
られる。従来例では使用する環境６　・−− が犬きく変るたびにこの過程を行なわなければならない
が、本方式は一度’ｒＥ１．　ＴＥ２１　ＥＳ　　とい
う定数が求まれば行う必要はない〇環境雑音エネルギーが小さく信号雑音化の良好な標準環
境を設定し、その環境雑音信号を、ある適当な時間長（
フレーム長という）毎にエネルギーＥ及び対数エネルギ
ーＥＬを（１）式及び（２）式により求める。This process is almost the same as the work for setting thresholds in the conventional example, and includes the average energy Es of the standard environment and the threshold TJ! of the two signal energies that separate speech and non-speech. :II
``'j!:2'' is set. The constant obtained through this process is stored in the speech interval detection unit 6. In the conventional example, this constant is set every time the environment 6 ・-- changes drastically. However, this method does not need to be performed once the constant 'rE1.TE21 ES is found.〇 Set a standard environment with low environmental noise energy and good signal noise reduction, and convert the environmental noise signal to , for some suitable length of time (
Energy E and logarithmic energy EL are determined for each frame length (referred to as frame length) using equations (1) and (2).

Ｅ　Ｌ　＝　１０　Ｘ　ｔｏｇ　１゜Ｅ　　　　・・・
・・・・・・・（２）あ否一定時間内に求められるＥの
平均値より、標準環境平均エネルギーＥＢを求める。ま
た、ＥＬの平均値と分散よりＥＬがこの値以下ならば非
音、声フレームであるというエネルギー閾値ＴＥ１を設
定する。さらに、標準環境下において多数話者が発声し
た音声信号のうちの無声子音の対数工不ル７ベーー二・ギーＥＬの平均値及び分散よりＥＬがこの値以上ならば
音声フレームであるというエネルギー閾値ｆＥ２を設定
する。E L = 10 X tog 1゜E...
(2) Obtain standard environmental average energy EB from the average value of E obtained within a certain period of time. Further, based on the average value and variance of EL, an energy threshold TE1 is set such that if EL is less than this value, it is a non-speech or voice frame. Furthermore, from the average value and variance of the logarithmic effort 7Behni-Gie EL of voiceless consonants in the speech signal uttered by many speakers under a standard environment, an energy threshold is determined that if EL exceeds this value, it is a speech frame. Set fE2.

次に音響分析部について述べる。Next, we will discuss the acoustic analysis section.

マイク等より入力きれ、第１図に示す音響処理部１でＡ
／Ｄ変換を施された音響信号は、第２図の音響分析部４
に送られる。音響分析部４では入力される音響イｇ号を
ある適当なフレーム長毎に分析し、後段の環境学習部５
及び音声検出部６で共１１０に用いられるパラメータを
算出する。算出するパラメータは（１）式で表わされる
信号エネルギーＥと信号のスペクトル上の特徴を表わす
パラメータであるＬＰＣケグストラム係数ベクトルＣで
ある。After the input from the microphone etc. is completed, the sound processing section 1 shown in Fig.
The acoustic signal subjected to the /D conversion is sent to the acoustic analysis section 4 in FIG.
sent to. The acoustic analysis section 4 analyzes the input acoustic signal for each appropriate frame length, and then analyzes the input acoustic signal Ig for each appropriate frame length.
and calculates the parameters used in both 110 in the voice detection section 6. The parameters to be calculated are the signal energy E expressed by equation (1) and the LPC kegstral coefficient vector C, which is a parameter representing the spectral characteristics of the signal.

ＬＰＣケプストラム係数Ｃの算出法の説明は省略するが
、詳しくはＪ、　Ｌ）、　１ｖｉａｒｋｅｒ　ａｎｄ　
Ａ、ｆ（。The explanation of the method for calculating the LPC cepstral coefficient C will be omitted, but details can be found in J, L), 1viarker and
A, f(.

Ｇｒａｙ、　Ｉｒ、　：Ｌｉｎｅａｒ　Ｐｒｅｄｉｃｔ
ｉｏｎ　ｏｆ　５ｐｅｅｃｈ。Gray, Ir. :Linear Predict
ion of 5peech.

Ｓｐｒｉｎｇｅｒ−Ｖｅｒ　Ｌａｇ　（１９７６）に述
べである。Springer-Ver Lag (1976).

音響分析部４で算出されたパラメータは、本方式による
音声区間検出が動作を開始した一番最初あるいは使用さ
れる環境が大きく変化し、音声区間検出の検出誤りが多
くケった場合、環境学習部５に送られる。このときの入
力音響信号にはご一声信号が含まれず環境雑音信号のみ
である。The parameters calculated by the acoustic analysis unit 4 are used for environmental learning when the speech segment detection using this method starts operating or when the environment in which it is used changes significantly and many detection errors occur in speech segment detection. Sent to Department 5. The input acoustic signal at this time does not include a voice signal and is only an environmental noise signal.

環境学習部５では音響分析部４より送られてくるフレー
ム毎のパラメータを１吏って、使用環境の信号エネルギ
ーを標準環境の４．Ｗ号エネルギーレベルに正規化する
ための正規化係数劇ｓと、使用環境雑音の平均的なスペ
クトル特性を表わすＬＰＣケプストラム係数の平均値ベ
クトルＣ８及び距離閾値′ｒＤを算出する。距離閾値Ｔ
ＤというのはＬＰＣケプストラム係数ベクトルＣと平均
値ベクトルＣ８とのユークリッド距離によりそのフレー
ムが音声フレームであるか非音声フレームであるかを判
定するだめの閾値である。The environment learning unit 5 examines the parameters for each frame sent from the acoustic analysis unit 4 and calculates the signal energy of the usage environment by comparing the signal energy of the standard environment. The normalization coefficient s for normalization to the W energy level, the average value vector C8 of the LPC cepstral coefficients representing the average spectral characteristics of the usage environment noise, and the distance threshold 'rD are calculated. Distance threshold T
D is a threshold value for determining whether the frame is a voice frame or a non-voice frame based on the Euclidean distance between the LPC cepstrum coefficient vector C and the average value vector C8.

正規化係数ＮＳは次のようにして算出される。The normalization coefficient NS is calculated as follows.

フレーム毎に送られてくる環境雑音エイ・ルギーの平均
値”ｒＪを求めさらに標準環境の学習により予め音声区
間検出部６に蓄えられている標準環境平均エネルギーＥ
ｓにより（３）式により算出する。The average value ``rJ'' of the environmental noise A and Lugi sent for each frame is calculated, and the standard environment average energy E stored in advance in the speech section detection unit 6 through learning of the standard environment is calculated.
s is calculated using equation (3).

ＮＳ”’５−ＥＮ　　　　　　　　−−（３）９　べ−
“・ま１５、ＬＰＣケプストラム係数の平均値ベクトルＣ８
はフレーム毎に送られてくるＬＰＣケプストラム係数係
数ベクトル者要素毎の平均値を算出することにより求め
られる。さらにこのＣ８と、Ｃ８を算出するために用い
たＬＰＣケプストラム係数係数ベクトル者ユークリッド
距離りをフレーム毎に求めＤの平均値および分散よりユ
ークリッド距離の値がこの値以下であるならば非音声フ
レームであるという閾値ＴＤを算出する。NS”'5-EN --(3)9 Be-
“・Ma15, average value vector of LPC cepstral coefficients C8
is obtained by calculating the average value for each element of the LPC cepstrum coefficient vector sent for each frame. Furthermore, the Euclidean distance between this C8 and the LPC cepstral coefficient vector used to calculate C8 is calculated for each frame, and from the average value and variance of D, if the Euclidean distance value is less than this value, it is a non-speech frame. A threshold value TD is calculated.

音声区間検出部６°では音響分析部４より送られてぐる
信号エネルギＥ１及びＬＰＣケプストラム係数係数ベク
トル者環境学習部５で求められた正規化係数Ｎｓ１及び
ＬＰＣケプストラム係数平均値ベクトルＣ８から正規化
信号対数エネルギーＥＮＬとＣ８，０間のユークリッド
距離りを求め、音声信号であるか非畜声信号であるかの
判定を行う。The speech interval detection unit 6° generates a normalized signal from the signal energy E1 sent from the acoustic analysis unit 4, the normalization coefficient Ns1 obtained by the LPC cepstrum coefficient vector human environment learning unit 5, and the LPC cepstrum coefficient average value vector C8. The Euclidean distance between the logarithmic energy ENL and C8,0 is determined to determine whether it is a voice signal or a non-voice signal.

正規化信号対数エネルギーＥＮＬは（４）式により求め
られる。また、ユークリッド距＃Ｄは（６）式により求
められる。The normalized signal logarithmic energy ENL is determined by equation (4). Further, the Euclidean distance #D is determined by equation (6).

０ＥＮＬ＝１０ＸｔＯ（Ｊｌｏ（Ｅ−Ｎｓ）　　　−−−
・（４）より　−ｆ（Ｃ−Ｃ５）Ｔ−（Ｕ−しｓ）ｌ　　　−−・
・−（５ン（Ｔコ装置行列を表す）音声信号であるか非音声信号であるかは以下のようにし
て判定される。0 ENL=10XtO(Jlo(E-Ns) ---
・From (4) -f(C-C5)T-(U-shis)l ---
・-(5n (represents Tco device matrix)) Whether it is an audio signal or a non-audio signal is determined as follows.

”ｒＪＬ≦ＴＥ１　　非音声ＴＥ、＜Ｅ、、Ｌ＜ＴＥ２ＡＮＤ　Ｄ＜’ｒＤ　非音声
ＴＥ１くＥＮしくＴＥ２　ＡＮＤＤ＞ＴＤ　音声”　Ｎ
　Ｌ≧ＴＥ２　　音声この判定法は、信号のエネルギーのみでは音声信号であ
るか非音声であるかあいまいな場合、信号のスペクトル
情報を利用することにより精度の向上を計るという方法
である。寸だ、従来例の零交差数のように信号スペクト
ルの一部の情報ではなく、ＬＰＣケプストラムというス
ペクトル全体の特徴を利用するため雑音のスペクトル変
化に対し性能の低下が少い。"rJL≦TE1 Non-voice TE, <E,, L<TE2AND D<'rD Non-voice TE1, TE2 ANDD>TD Voice" N
L≧TE2 Voice This determination method is a method of improving accuracy by using spectrum information of the signal when it is unclear whether the signal is a voice signal or non-voice based only on the energy of the signal. In fact, since it uses the characteristics of the entire spectrum called the LPC cepstrum, rather than information on a part of the signal spectrum like the number of zero crossings in the conventional example, there is little performance degradation due to changes in the noise spectrum.

このようにしてフレーム毎に判定された結果は平滑化処
理が施され最終的な音声区間が決定され１１べ− る。The results determined for each frame in this way are subjected to smoothing processing, and the final voice section is determined and 11 bases are applied.

第３図は本方式による音声区間検出回路の機能ブロック
図である。マイク等より入力される音響信号はハ／Ｄ変
換が施され、ある適当なフレーム長毎に信号エネルギー
演算部７、及びＬＰＣケプストラム係数演算部８に送ら
ねる。信号エネルギー演算部７では信号エネルギーＥを
算出し、ＬＰＣケフストラム係数演算部８ではＬＰＣケ
プヌトラム係数ベクトルＣを算出する。算出されたパラ
メータの流れは、環境学習を行うか、音声区間検出を行
うかで異り、この制御はコントロール部９により行われ
る。図で破線は制御線を示す。FIG. 3 is a functional block diagram of a speech section detection circuit according to this method. The acoustic signal inputted from a microphone or the like is subjected to H/D conversion, and is sent to the signal energy calculation unit 7 and the LPC cepstral coefficient calculation unit 8 for each appropriate frame length. The signal energy calculation section 7 calculates the signal energy E, and the LPC cefnutrum coefficient calculation section 8 calculates the LPC cepnutrum coefficient vector C. The flow of the calculated parameters differs depending on whether environmental learning or voice section detection is performed, and this control is performed by the control unit 9. In the figure, the broken line indicates the control line.

環境学習の場合、信号エネルギーＥはマルチプレクサ１
０を通して平均値・分散演算部１１に送られ、平均エネ
ルギーＥＮが算出される。このＥＮはさらに正規化係数
決定部１２に送られ正規化係数ＮＢが決定される。また
ＬＰＣケプストラム係数ベクトルＣはマルチプレクサ１
ｏを通して平均値分散演算部１１に送られるとともに１
．４ＩＦＯバツフア１３に蓄えられる。平均値分散演算
部昭５８−２１１７９３　（′４′）算部１１では平均
値ベクトルＣＢを算出し、ＬＰＣケプストラム係数平均
値ベクトルメモリー１４に送る。このり、Ｐ　Ｃケプス
トラム係数平均値ベクトルメ−ｖ＋）−１４にデータが
格納されると、ＦＩＦＯバッファ１３はマルチプレクサ
１６を通してニーノリノド距離演算部１６にＬＰＣケプ
ストラム係数Ｃを送り、ユークリッド距離りが算出され
る。In the case of environmental learning, the signal energy E is multiplexer 1
0 to the average value/variance calculation unit 11, where the average energy EN is calculated. This EN is further sent to the normalization coefficient determining section 12, where a normalization coefficient NB is determined. Also, the LPC cepstrum coefficient vector C is
1 through o to the mean value variance calculation unit 11.
．． It is stored in 4IFO buffer 13. Average value variance calculation unit 1982-211793 ('4') The calculation unit 11 calculates the average value vector CB and sends it to the LPC cepstrum coefficient average value vector memory 14. When the data is stored in the PC cepstrum coefficient average value vector m-v+)-14, the FIFO buffer 13 sends the LPC cepstrum coefficient C to the knee distance calculation unit 16 through the multiplexer 16, and the Euclidean distance is calculated. .

算出されたユークリッド距離りはマルチプレクサ１Ｑを
通して平均値分散演算部１１に送られ、平均値及び分散
が算出される。この平均値及び分散値は、閾値ＴＤ決定
部１７に送られ、閾値°ｆＤが決定される。The calculated Euclidean distance is sent to the mean value variance calculating section 11 through the multiplexer 1Q, and the mean value and variance are calculated. This average value and variance value are sent to the threshold value TD determining section 17, and the threshold value °fD is determined.

一方音声区間検出を行う場合、信号エネル、チーＥは正
規化対数エネルギー演算部８に送られて、正規化対数エ
ネルギーＥＮＬに変換され、三値比較部１９に送られる
。また、ＬＰＣケプストラム係数ベクトルＣはマルチプ
レクサ１６を通してニーノリノド距離演算部１６に送ら
れてユークリッド距離りが算出され、その値は二値比較
部２０に送ゆれる。二値比較部１９は正規化対数エネル
ギー１３、−− ＥＮＬと閾値ＴＩ８．１．“ｆＥ２との比較を行う。そ
の結果が”ＮＬ≦ＴＥ１　または”１’ＪＬ≧°１゛Ｅ
２である場合・三値比較部１９０判定結果がマルチプレ
クサ２１を通して平滑処理部２２へ送られる。筐た上述
以外の場合は、二値比較部２ｏによるニークリ、ド距離
りと閾値ＴＤの比較結果がマルチプレクサ２１を曲して
平滑化処理部２２に送られる。平滑化処理部２２はフレ
ーム毎に送られてくる判定結果の平滑化を行い音声区間
を決定し出力する。On the other hand, when performing voice section detection, the signal energy Qi E is sent to the normalized logarithmic energy calculation unit 8, converted to normalized logarithmic energy ENL, and sent to the ternary comparison unit 19. Further, the LPC cepstrum coefficient vector C is sent through the multiplexer 16 to the Nino-Rinode distance calculation section 16 to calculate the Euclidean distance, and the value is sent to the binary comparison section 20. The binary comparison unit 19 uses the normalized logarithmic energy 13, -- ENL and the threshold TI8.1. “Compare with fE2.The result is “NL≦TE1 or”1’JL≧°1゛E
2 - The determination result of the ternary comparison section 190 is sent to the smoothing processing section 22 through the multiplexer 21. In cases other than those described above, the comparison result between the sharpness, distance, and threshold value TD by the binary comparison section 2o is sent to the smoothing processing section 22 via the multiplexer 21. The smoothing processing unit 22 smoothes the determination results sent for each frame, determines a voice section, and outputs the same.

第４図は学習によって信号エネルギーを正規化する本発
明の効果を示したものである。第４図のＡは標準環境の
環境雑音対数エネルギー分布（上段）と、その環境下で
発声された音声の無声子音の対数エネルギーの分布（下
段）を示したものである。今、環境が変化して音声の平
均エネルギーと環境雑音のエネルギーの信号雑音化が約
２０ｄＢになった場合の、従来の信号エネルギー分布を
第４図Ｂに、本発明の信号エイ・ルギー分布を第４図Ｃ
に示す。第４図のＢは環境が変化し音声の：毛均エネル
ギーと環境雑音のエネルギーの信号雑４音比が約２０　ｄＢとなった場合の環境雑音対数エネル
ギー分布（上段）と、その環境下で尾声された音声の無
声子音の対数エネルギーの分布（下段）である。さらに
第４図のＣは第４図Ｂと同じ環境下のそれぞれの正規化
対数エネルギーの分布を示したものである。図中、破線
は正規分布仮定を施したものである。これらの図より従
来のように正規化を行なわない対数エネルギーでは信号
雑音化が劣下すると環境雑音の対数エネルギーは無声子
音のエネルギーに近づくため、標準環境下で設定した閾
値では分離することが困難になり、閾値を設定しなおさ
なければならなくなる。また、閾値を設定しなおしても
、両者の分布の重なりが大きいため精度が低下するｇこ
れに対し、本発明の場合には正規化対数エネルギーの分
布は標準環境下の対数エネルギーの分布と同じようにな
るため、閾値の変更全行わなくでもよく、また両者の分
布の重なりは、！！４図Ｂより少ないため、環境雑音信
号と音声の無声子音との分離を確実に行うことがＮき、
従来例でよく用いられている対数エネルギ１５ −よりも良い結果が得られる。FIG. 4 shows the effect of the present invention in normalizing signal energy through learning. FIG. 4A shows the environmental noise logarithmic energy distribution of a standard environment (upper row) and the logarithmic energy distribution of voiceless consonants of speech uttered under that environment (lower row). Now, when the environment changes and the average energy of the voice and the energy of the environmental noise become about 20 dB, the conventional signal energy distribution is shown in Figure 4B, and the signal energy distribution of the present invention is shown in Figure 4B. Figure 4C
Shown below. Figure 4B shows the environmental noise logarithmic energy distribution (upper row) when the environment changes and the signal noise ratio of voice energy and environmental noise energy is approximately 20 dB, and the Distribution of logarithmic energy of voiceless consonants in tailed speech (bottom row). Furthermore, C in FIG. 4 shows the distribution of each normalized logarithmic energy under the same environment as in FIG. 4B. In the figure, the broken line indicates the normal distribution assumption. These figures show that when the logarithmic energy is not normalized as in the past, the signal-to-noise degradation deteriorates and the logarithmic energy of the environmental noise approaches the energy of the unvoiced consonant, making it difficult to separate it using the threshold set under the standard environment. , and the threshold must be reset. In addition, even if the threshold is reset, the accuracy will decrease because the two distributions overlap greatly.In contrast, in the case of the present invention, the distribution of normalized logarithmic energy is the same as the distribution of logarithmic energy under the standard environment. Therefore, there is no need to change the threshold value at all, and the overlap of the two distributions is ! ! 4B, it is possible to reliably separate the environmental noise signal from the voiceless consonant of the voice.
A better result can be obtained than the logarithmic energy of 15 -, which is often used in the conventional example.

第５図は男性話者が発声した「ふた」（ｌｈｕｔａｌ）
という音声の正規化対数エネルギーＡとＬＰＣケプスト
ラム係数の平均値ベクトルとのユークリッド距離Ｂを示
したものである。正規化対数エネルギーのみでは／ｈ／
の始端、／Ｕ／の終端かは参きすしないが、ＬＰＣケプ
ストラム係数のユークリッド距離を用いることにより、
これらあい丑いな部分がはっきりし、検出種度を高める
ことが出来る。Figure 5 shows “lhutal” uttered by a male speaker.
This figure shows the Euclidean distance B between the normalized logarithmic energy A of the voice and the average value vector of the LPC cepstral coefficients. Normalized logarithmic energy only /h/
The starting point of /U/ does not matter, but by using the Euclidean distance of the LPC cepstral coefficients,
These awkward parts become clear and the degree of detection can be increased.

第６図は本発明による音声区間検出法と従来例の信号エ
ネルギーを固定の閾値で判定する音声区間検出法とを比
較したものである。FIG. 6 is a comparison between the voice interval detection method according to the present invention and the conventional voice interval detection method in which signal energy is determined using a fixed threshold value.

この図は男性話者１名が発声する２００単語の音声テー
クから求められたもので、発声する環境の信号雑音比が
変化した場合の（６）式で定める識別率を示したもので
ある。一点鎖線は従来例を示し、実線は本発明によるも
のでろる。This figure was obtained from an audio take of 200 words uttered by one male speaker, and shows the identification rate determined by equation (6) when the signal-to-noise ratio of the utterance environment changes. The one-dot chain line indicates the conventional example, and the solid line indicates the one according to the present invention.

特開８ム５８−２１１７９３（５）・・・・・・・・（６）従来例による方法では信号雑音比が３ｏｄＢより劣化す
ると識別率は急激に低下し２　ｅ；　ｄＢより劣りする
と全てのフレームを音声フレームと判定してしまうため
識別率は６ｏ％となってしまい音声区間検出不可能とな
る。それに対し本方法では信号雑音比２０ｄＢ程度まで
は識別率はほとんど変化せず、ざらに１ｏｄＢ程度まで
劣下しても識別率９１％という良い値を得ることが出来
、従来例の環境変化に対する適応性という欠点に対し著
しい改善を慢ることが出来る。JP-A-8-211793 (5) (6) In the conventional method, when the signal-to-noise ratio deteriorates below 3 odB, the identification rate decreases rapidly; Since the frame is determined to be a voice frame, the identification rate becomes 60%, making it impossible to detect a voice section. In contrast, with this method, the identification rate hardly changes up to a signal-to-noise ratio of about 20 dB, and even when the signal-to-noise ratio decreases to about 1 odB, a good value of 91% can be obtained, making it possible to adapt to environmental changes compared to the conventional method. It is possible to boast of a remarkable improvement in the shortcomings of gender.

以上述べたように、本発明は、あらかじめ、使用する環
境の雑音のエネルギーレベルとスペクトルの学習を行う
ということとスペクトル情報も利用することを特徴とす
る音声区間検出方法である。As described above, the present invention is a voice section detection method characterized by learning the energy level and spectrum of noise in the environment in which the method is used, and also using spectrum information.

学習を行うことによって、雑音を含む入力信号のエネル
ギーレベルを、一定のレベルに正規化することが可能と
なり環境の変化の影響を少なくする１７ベーことが出来るという特長と、さらにスペクトル情報も利
用するため精度が良い音声区間の検出ができる。By performing learning, it is possible to normalize the energy level of the input signal containing noise to a constant level, reducing the influence of environmental changes, and it also makes use of spectral information. Therefore, voice sections can be detected with high accuracy.

【図面の簡単な説明】第１図は音声自動認識システムの概略の構成を示すブロ
ック図、第２図は本発明による音声区間検出方法を示す
ブロック図、第３図は本発明によ゛る音声区間検出回路
の機能ブロック図、第４図へ〜Ｃは本発明と従来例の信
号エネルギーの分布の比較図、第６図へ、Ｂは本発明の
実際の音声信号における正規化対数エネルギ〜とユーク
リッド距離を示した図、第６図は環境の信号雑音比の変
化に対する本発明と従来例による変化を示した図である
。１・・・・・・音響処理部、２・・・・・・音声区間検
出部、３・・・・・・認識部、４・・・・−・Ｋ響分析
部、６・・・・・・環境学習部、６・・・・・・音声区
間検出部、７・・・・・・信号エネルギー演算部、８・
・・・・・ＬＰＣケプストラム係数演算部、９・・・・
・・コントロール部、ｉｏ。１６２１・・・・・・マルチブレフサ、１１・・・・・
・平１８　：均値分散演算部′、１２・・：・・・正規化係数決定部
、１３・・−・・・Ｆ、ＩＦＯバッファ、１４・・・・
・・ＬＰＣケプストラム係数平均値ベクトルメモリー、
１６・・−・・ユークリッド距離演算部、１７・・・・
・・閾値決定部、１８・・・・・・正規化対数エネルギ
ー演算部、１９・・・・・・三値比較部、２０・・・・
・・二値比較部、２２・・・・・・平滑化処理部。代理人の氏名　弁理士　中　尾　敏　男　ほか１名第１
凶第２図　　　　　　ｌ第４図正ＫＬ化すけ象工χ・ル千− ５図[Brief Description of the Drawings] Fig. 1 is a block diagram showing a general configuration of an automatic speech recognition system, Fig. 2 is a block diagram showing a speech interval detection method according to the present invention, and Fig. 3 is a block diagram showing a general configuration of an automatic speech recognition system. A functional block diagram of the voice section detection circuit, shown in FIG. 4. C is a comparison diagram of the signal energy distribution of the present invention and the conventional example, and B is the normalized logarithmic energy of the actual speech signal of the present invention. and Euclidean distance, and FIG. 6 is a diagram showing changes in the present invention and the conventional example with respect to changes in the signal-to-noise ratio of the environment. 1...Acoustic processing unit, 2...Speech section detection unit, 3...Recognition unit, 4...K-sound analysis unit, 6... ...Environmental learning unit, 6...Speech section detection unit, 7...Signal energy calculation unit, 8.
...LPC cepstrum coefficient calculation section, 9...
...control section, io. 1621・・・Multibrefusa, 11・・・・・・
・Hei 18: Mean value variance calculation unit', 12...:...Normalization coefficient determination unit, 13...F, IFO buffer, 14...
・・LPC cepstral coefficient average value vector memory,
16...Euclidean distance calculation unit, 17...
...Threshold value determination unit, 18...Normalized logarithmic energy calculation unit, 19...Third value comparison unit, 20...
. . . binary comparison section, 22 . . . smoothing processing section. Name of agent: Patent attorney Toshio Nakao and 1 other person 1st
Bad Figure 2 l Figure 4 Positive KL-based Suke Elephant χ・ru 1000 Figure 5

Claims

[Claims]

(1) Set an energy threshold in advance based on the average energy level of environmental noise in the standard environment, use the environmental learning function to learn the average energy level of environmental noise in the environment to be used, and A normalization coefficient is determined from the average energy level of environmental noise in the environment, the energy level of the audio input signal is normalized using the normalization coefficient, and the energy level of the audio input signal is normalized using the energy level of the normalized audio input signal and the energy threshold. A speech interval detection method characterized by detecting an interval.

(2) Using the environment learning function, learn the average value vector of LPC cepstral coefficients that expresses the spectral characteristics of the environmental noise of the environment in which it is used, and obtain the average value vector of the LPC cepstral coefficients and the audio input signal. L that can be done
Find the Euclidean distance with the PG cepstrum coefficient vector, set the distance and threshold based on the Euclidean distance, and calculate the energy level and energy threshold of the audio input signal normalized by the distance threshold and the normalization coefficient. 2. The voice interval detection method according to claim 1, wherein the voice interval is detected using the following.