JP3105465B2

JP3105465B2 - Voice section detection method

Info

Publication number: JP3105465B2
Application number: JP09060236A
Authority: JP
Inventors: 達雄松岡; 泰浩南; 貞▲煕▼ 古井
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1997-03-14
Filing date: 1997-03-14
Publication date: 2000-10-30
Anticipated expiration: 2017-03-14
Also published as: JPH10254476A

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は，機械による自動音
声認識における音声区間検出方法に関する。音声を認識
するための装置に実装され，入力信号中から音声区間を
検出するために用いられる。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for detecting a speech section in automatic speech recognition by a machine. It is mounted on a device for recognizing voice and is used for detecting a voice section from an input signal.

【０００２】[0002]

【従来の技術】従来の音声認識システムでは，主に音声
パワーの包絡の立ち上がり，立ち下がりにより音声区間
の始終端を検出していた。音声パワーに加えて，単位時
間のゼロ交叉数や，認識対象語彙，対象タスクの情報な
どの付加情報により検出精度を向上する手法などが提案
されているが，背景雑音が非定常な雑音である場合や，
連続発声された音声に適用するには十分な精度が得られ
ていたとは言えない。2. Description of the Related Art In a conventional speech recognition system, the start and end of a speech section are detected mainly by the rise and fall of the envelope of speech power. Methods have been proposed to improve the detection accuracy by using additional information such as the number of zero-crossings per unit time, the vocabulary to be recognized, and information on the target task in addition to the speech power, but the background noise is non-stationary noise Or
It cannot be said that sufficient accuracy has been obtained for application to continuously uttered speech.

【０００３】図４は，従来技術による信号のパワーに基
づく音声区間検出方法のフローチャートである。ディジ
タル化（標本化＆量子化）された入力信号はＮサンプル
（標本点）ごとにブロック化され，ブロックごとの信号
パワーが計算される。信号パワーは，各サンプルの振幅
値の二乗をブロック内で総和したものとして計算され
る。FIG. 4 is a flowchart of a voice section detection method based on signal power according to the prior art. The digitized (sampled & quantized) input signal is divided into blocks every N samples (sample points), and the signal power for each block is calculated. The signal power is calculated as the sum of the square of the amplitude value of each sample in the block.

【０００４】次に，背景雑音のレベルの変動を補償する
ためのパワーの正規化処理を行う。まず，Ｅ_minを次の
ように計算する。Ｅ_min＝ｍｉｎ（Ｅ（ｋ））〔１≦ｋ≦ＮＦ〕ここで，ＮＦ（Number of Frames）は，入力信号の長さ
をフレームでカウントした値であり，ｍｉｎ（）は，ｋ
が１からＮＦまでの中の最小値を表す。Next, power normalization processing for compensating for fluctuations in the background noise level is performed. First, E _min is calculated as follows. E _min = min (E (k)) [1 ≦ k ≦ NF] Here, NF (Number of Frames) is a value obtained by counting the length of the input signal in frames, and min () is k
Represents the minimum value from 1 to NF.

【０００５】正規化したパワーＥ_n（ｋ）は，次のよう
に定義する。Ｅ_n（ｋ）＝Ｅ（ｋ）−Ｅ_min，ｋ＝１，２，…，Ｎ
Ｆ次に，信号パワーのヒストグラムを求めることにより，
背景雑音レベルを推定する。ヒストグラムは例えば１５
ｄＢ以上のフレームについて求める。そして，３点メデ
ィアン平滑化を適用し，修正したパワーの輪郭Ｅ
_s（ｋ）を得る。[0005] The normalized power E _n (k) is defined as follows. _{E n (k) = E (} k) -E min, k = 1,2, ..., N
F Next, by obtaining a histogram of the signal power,
Estimate the background noise level. The histogram is 15
It is determined for a frame of dB or more. Then, three-point median smoothing is applied, and the corrected power contour E
_s (k) is obtained.

【０００６】Ｅ_s（ｋ）＝Ｅ_n（ｋ）−Ｍｏｄｅここで，Ｍｏｄｅは平滑化されたヒストグラムのモード
である。このようにして求められた信号パワーＥ
_s（ｋ）（図中ｅｎｅｒｇｙ）と，実験的に決定された
しきい値（図中ＴＨＲ）とを比較し，信号パワーＥ
_s（ｋ）がしきい値ＴＨＲを越えるフレーム数をカウン
トし，カウントしたフレーム数が一定値（図中ＭＩＮＬ
ＥＮ）を越えた場合，その始端が音声区間の始端として
検出される。ただし，音声区間中にも短いポーズは存在
しうるため，ある一定時間（図中ＭＡＸＰＡＵＳＥ）は
無音区間があっても音声区間に含めて検出する。ＭＡＸ
ＰＡＵＳＥを越える時間，信号パワーがしきい値ＴＨＲ
を下回った場合，その始端が音声区間の終端として検出
される。E _s (k) = E _n (k) −Mode where Mode is the mode of the smoothed histogram. The signal power E obtained in this way is
_s (k) (energy in the figure) is compared with an experimentally determined threshold (THR in the figure), and the signal power E
_The number of frames in which _s (k) exceeds the threshold value THR is counted, and the number of counted frames is a fixed value (MINL in the figure).
EN), the start end is detected as the start end of the voice section. However, since a short pause can exist in a voice section, a certain period of time (MAXPAUSE in the figure) is detected even if there is a silent section in the voice section. MAX
When the power exceeds PAUSE, the signal power becomes the threshold THR
, The start end is detected as the end of the voice section.

【０００７】図中，ｓｐｅｅｃｈ，ｐａｕｓｅはカウン
タで，それぞれフレーム数をカウントする。ｓｐｅｅｃ
ｈは，信号パワーがしきい値ＴＨＲを越えるフレーム数
を，ｐａｕｓｅは，信号パワーがしきい値ＴＨＲより小
さいフレーム数をカウントする。音声区間の始端と終端
では，信号パワーの低い部分を取りこぼさないように，
それぞれＢＥＧ，ＥＮＤフレームだけ音声区間に糊代を
付加して検出する。In the figure, speech and pause are counters for counting the number of frames, respectively. speed
h counts the number of frames whose signal power exceeds the threshold THR, and pause counts the number of frames whose signal power is lower than the threshold THR. At the beginning and end of the voice section, be careful not to miss parts with low signal power.
Each of the BEG and END frames is detected by adding a margin to the voice section.

【０００８】以上の図４の処理を，各ステップに従って
説明すると，ステップＳ２０では，Ｎサンプルごとにブ
ロック化された入力信号を１データブロック（フレー
ム）入力する。ステップＳ２１では，入力したブロック
の信号パワーを計算する。ステップＳ２２では，計算し
た信号パワーｅｎｅｒｇｙと所定のしきい値ＴＨＲとを
比較し，ｅｎｅｒｇｙが大きい場合にはステップＳ２３
へ進み，そうでない場合にはステップＳ２６へ進む。The above-described processing of FIG. 4 will be described according to each step. In step S20, an input signal divided into blocks every N samples is input as one data block (frame). In step S21, the signal power of the input block is calculated. In step S22, the calculated signal power energy is compared with a predetermined threshold value THR.
If not, go to step S26.

【０００９】ステップＳ２３では，音声区間を計数する
ためのカウンタｓｐｅｅｃｈをインクリメントし，も
し，音声区間に含めてよい短い無音（ポーズ）区間がカ
ウントされていれば（ステップＳ２４），ステップＳ２
５によって，そのポーズ区間のカウンタｐａｕｓｅの値
をカウンタｓｐｅｅｃｈに加え，その後ステップＳ２０
へ戻って，次のブロックの処理を同様に繰り返す。In step S23, a counter speech for counting the voice section is incremented. If a short silent (pause) section that can be included in the voice section has been counted (step S24), step S2 is performed.
In step S20, the value of the counter pause in the pause section is added to the counter speech.
Then, the processing of the next block is repeated in the same manner.

【００１０】信号パワーｅｎｅｒｇｙがしきい値ＴＨＲ
を下回っている場合，ステップＳ２６により，音声区間
がカウントされているかどうかを調べ，未カウントであ
ればステップＳ２０へ戻る。音声区間がカウントされて
いれば，ステップＳ２７により，ポーズ区間のカウンタ
ｐａｕｓｅをインクリメントする。When the signal power energy has a threshold value THR
If it is less than, it is checked in step S26 whether or not the voice section has been counted, and if it has not been counted, the process returns to step S20. If the voice section has been counted, the counter pause of the pause section is incremented in step S27.

【００１１】ステップＳ２８では，カウンタｐａｕｓｅ
の値と所定のＭＡＸＰＡＵＳＥとを比較し，カウンタｐ
ａｕｓｅの値がＭＡＸＰＡＵＳＥより大きければステッ
プＳ２９へ進み，そうでなければステップＳ２０へ戻
る。In step S28, the counter pause
Is compared with a predetermined MAXPAUSE, and a counter p
If the value of “ause” is larger than MAXPAUSE, the process proceeds to step S29; otherwise, the process returns to step S20.

【００１２】ステップＳ２９では，カウンタｓｐｅｅｃ
ｈの値と所定のＭＩＮＬＥＮとを比較し，カウンタｓｐ
ｅｅｃｈの値がＭＩＮＬＥＮより大きければステップＳ
３１へ進み，そうでなければステップＳ３０へ進む。ス
テップＳ３０では，短い音声区間を無視するため，ｓｐ
ｅｅｃｈとｐａｕｓｅを０に初期化し，ステップＳ２０
へ戻って同様に処理を繰り返す。In step S29, a counter "spec"
h is compared with a predetermined MINLEN, and a counter sp
If the value of ech is larger than MINLEN, step S
Then, the process proceeds to step S30. In step S30, to ignore the short voice section, sp
ech and pause are initialized to 0, and step S20
Return to and repeat the same process.

【００１３】ステップＳ３１では，現ブロックからｐａ
ｕｓｅとｓｐｅｅｃｈと音声区間の始端の糊代分のＢＥ
Ｇを加えたブロック数を戻した点を音声区間の始点とす
る。ステップＳ３２では，その始点からＢＥＧとｓｐｅ
ｅｃｈと音声区間の終端の糊代分のＥＮＤを加えたブロ
ック数を進めた点を音声区間の終端とする。その後，ス
テップＳ３０によりｓｐｅｅｃｈとｐａｕｓｅを初期化
し，同様に処理を続ける。In step S31, pa is set from the current block.
BE for use, speech, and the glue margin at the beginning of the voice section
The point at which the number of blocks to which G has been added is returned is defined as the start point of the voice section. In step S32, BEG and spe
The point at which the number of blocks obtained by adding ech and the END for the margin at the end of the voice section is advanced is defined as the end of the voice section. After that, the speech and pause are initialized in step S30, and the process is continued in the same manner.

【００１４】[0014]

【発明が解決しようとする課題】以上のような従来の手
法は，信号対雑音比が３０ｄＢ以上の環境や，雑音が定
常的な性質のものである場合には，おおむね良好に動作
する。しかし，現実的な環境では，これらの手法がうま
く機能しない場面が多い。非定常な背景雑音が存在する
場合や，背景雑音のレベルが比較的高い場合などには音
声パワーの包絡から音声区間の検出をすることは非常に
困難である。The above-described conventional method generally works well in an environment where the signal-to-noise ratio is 30 dB or more, or when the noise has a stationary characteristic. However, in a realistic environment, these methods often do not work well. When there is unsteady background noise or when the background noise level is relatively high, it is very difficult to detect a voice section from the envelope of voice power.

【００１５】本発明の目的は，背景雑音が非定常な信号
である場合や，雑音環境下においても正確な音声認識を
可能とするため，入力信号中から音声区間を自動的に検
出する方法を提供することにある。An object of the present invention is to provide a method for automatically detecting a speech section from an input signal in order to enable accurate speech recognition even when the background noise is an unsteady signal or in a noise environment. To provide.

【００１６】[0016]

【課題を解決するための手段】本発明は，認識対象語彙
（クラス）を網羅する全ての音声を用いて学習したヒド
ン・マルコフ・モデルに基づく音声音響モデルと，音声
の発声されていない区間を用いて学習したヒドン・マル
コフ・モデルに基づく非音声音響モデルとを用い，入力
信号の適当な区間長ごとに音声音響モデルと非音声音響
モデルの尤度比を計算する。その尤度比があるしきい値
を越えた区間が一定時間継続した場合に，その区間の始
端，またはその区間の始端から一定時間さかのぼった時
点を音声区間の始端とし，その後，尤度比があるしきい
値を下回る区間が一定時間継続した場合に，その下回る
区間の始端，またはその始端から一定時間経過した時点
を音声区間の終端として検出する。SUMMARY OF THE INVENTION The present invention, hydrate learned by using all of the speech to cover recognition target vocabulary (class)
Speech-acoustic model based on the N Markov model and Hidden Maru trained using the non-speech section
The likelihood ratio between the speech acoustic model and the non-speech acoustic model is calculated for each appropriate section length of the input signal using the non-speech acoustic model based on the Coff model . If the section in which the likelihood ratio exceeds a certain threshold continues for a certain period of time, the beginning of the section or a point in time that is beyond a certain time from the beginning of the section is defined as the beginning of the speech section. When a section below a certain threshold value continues for a certain period of time, the start of the section below the threshold or a point in time after the start of the section for a certain period of time is detected as the end of the voice section.

【００１７】以上のように，本発明は，音声と背景雑音
の周波数領域での統計的性質に基づき信号中から音声区
間を検出するため，信号のパワーやゼロ交叉数による方
法では問題となった背景雑音が非定常雑音である場合
や，背景雑音のレベルが高い場合にも正確に音声区間を
検出できるという利点がある。As described above, according to the present invention, since a speech section is detected from a signal based on the statistical properties of speech and background noise in the frequency domain, there is a problem in the method based on the power of the signal or the number of zero crossings. There is an advantage that a speech section can be accurately detected even when the background noise is non-stationary noise or when the background noise level is high.

【００１８】[0018]

【発明の実施の形態】図１は，本発明の実施の形態を示
すブロック図である。まず，バンドパスフィルタによる
フィルタリング部１により，入力信号をフィルタリング
した後，Ａ／Ｄ変換部２によってアナログ／ディジタル
変換（サンプリング，および，量子化）し，ディジタル
信号を得る。FIG. 1 is a block diagram showing an embodiment of the present invention. First, an input signal is filtered by a filtering unit 1 using a band-pass filter, and then an analog / digital conversion (sampling and quantization) is performed by an A / D conversion unit 2 to obtain a digital signal.

【００１９】次に，高域強調部３により，ディジタル化
された信号に，下記の［式１］に従って高域強調の処理
を施す。これは，通常，音声信号の周波数特性が低域か
ら高域にかけて下降する傾向にあり，これを補償するた
めである。Next, the high-frequency emphasizing unit 3 subjects the digitized signal to high-frequency emphasis processing according to the following [Equation 1]. This is because the frequency characteristic of the audio signal generally tends to decrease from the low band to the high band, and this is to compensate for this.

【００２０】Ｈ（ｚ）＝１−ａｚ^-1 ［式１］さらに，特徴抽出のため，特徴量抽出部４により，信号
を適当な長さ（例えば，３２ｍｓ）毎にブロック化し，
適当な長さ（例えば８ｍｓ）毎にシフトしながら特徴量
分析を行う。以下では，このシフト幅をフレームと呼
ぶ。H (z) = 1-az ^-1 [Equation 1] Further, for feature extraction, the signal is divided into blocks of an appropriate length (for example, 32 ms) by the feature amount extraction unit 4.
The feature amount analysis is performed while shifting every appropriate length (for example, 8 ms). Hereinafter, this shift width is referred to as a frame.

【００２１】各フレームは，次の［式２］に示されるＨ
ａｍｍｉｎｇ窓などにより重み付けを行い，フレーム毎
に切り出すことによる影響を低減する。ｗ（ｎ）＝0.54−0.46 cos（２πｎ／Ｎ），０≦ｎ≦Ｎ−１［式２］音声信号の特徴量としては，ＬＰＣ(Linear Predictive
Coefficient：線形予測係数）分析に基づくケプストラ
ムとその一次時間微分を用いる。Each frame is represented by H shown in the following [Equation 2].
Weighting is performed using an amming window or the like to reduce the influence of clipping for each frame. w (n) = 0.54−0.46 cos (2πn / N), 0 ≦ n ≦ N−1 [Equation 2] The feature amount of the audio signal is LPC (Linear Predictive
Cepstrum based on Coefficient (linear prediction coefficient) analysis and its first-order time derivative are used.

【００２２】以上のディジタル信号処理の後，尤度比計
算部５によって，二つのＨＭＭ（Hidden Markov Model)
に対する尤度を計算する。一つめのＨＭＭは，全ての語
彙に対応する音声ＨＭＭである。このモデルは，対象語
彙の全てを含む音声を用いて学習され，対象語彙の音声
信号に対しては高い尤度を示すが，それ以外の信号には
尤度が低くなるように期待される。もう一つのＨＭＭ
は，無音区間など認識対象語彙外の信号区間を用いて学
習された非音声ＨＭＭで，無音区間で高い尤度を示し音
声信号には低い尤度を示すことが期待される。それぞれ
のＨＭＭは非常に簡単な構造のモデルでよく，尤度計算
は高速に行うことが可能である。After the above-described digital signal processing, the likelihood ratio calculation unit 5 generates two HMMs (Hidden Markov Models).
Is calculated. The first HMM is a voice HMM corresponding to all vocabularies. This model is learned using a speech including all of the target vocabulary, and is expected to show a high likelihood for a speech signal of the target vocabulary, but to have a low likelihood for other signals. Another HMM
Is a non-speech HMM that is learned using a signal section outside the recognition target vocabulary, such as a silent section, and is expected to exhibit high likelihood in a silent section and low likelihood in a speech signal. Each HMM may be a model having a very simple structure, and the likelihood calculation can be performed at high speed.

【００２３】以上の尤度の比から，音声区間判定部６に
よって音声区間を検出し，その結果を出力する。図２は
本実施の形態による処理の流れを示すフローチャートで
ある。以下に，フローチャートに従って処理の流れを説
明する。From the above-mentioned likelihood ratio, a voice section is detected by the voice section determination unit 6 and the result is output. FIG. 2 is a flowchart showing the flow of processing according to the present embodiment. Hereinafter, the flow of the processing will be described according to the flowchart.

【００２４】なお，以下の説明において，カウンタｓｐ
ｅｅｃｈは，音声区間長を計るカウンタで，ｐａｕｓｅ
は，無音区間と考えられる区間の区間長を計るカウンタ
である。しかし，音声区間内でも短いポーズは存在しう
るため，ＭＡＸＰＡＵＳＥ以下の長さのポーズは音声区
間に含めている。また，ＭＩＮＬＥＮを越えない区間は
音声区間として検出しない。これは，あまり短い区間を
音声区間として抽出すると誤検出が増加するためであ
る。In the following description, the counter sp
ech is a counter that measures the length of the voice section.
Is a counter that measures the section length of a section considered to be a silent section. However, since a short pause can exist in a voice section, a pause having a length equal to or less than MAXPAUSE is included in the voice section. Further, a section not exceeding MINLEN is not detected as a voice section. This is because erroneous detection increases if a too short section is extracted as a voice section.

【００２５】各フレームごとのケプストラム，およびデ
ルタケプストラムに対して音声ＨＭＭと非音声ＨＭＭに
対する尤度を計算し，その尤度比（図中ｄｉｆｆ）を求
める（ステップＳ１，Ｓ２）。ｄｉｆｆの定義は次式の
通りである。The likelihood for speech HMM and non-speech HMM is calculated for the cepstrum and delta cepstrum for each frame, and the likelihood ratio (diff in the figure) is obtained (steps S1 and S2). The definition of diff is as follows.

【００２６】diff＝log Ｐ（ｏ_t｜allspeech ）−log
Ｐ（ｏ_t｜background）ここで，log Ｐ（ｏ_t｜allspeech ）は，時刻ｔでの音
声ＨＭＭの入力信号ｏ_tに対する対数尤度，log Ｐ（ｏ
_t｜background）は，同じく，非音声ＨＭＭの対数尤度
である。フレームごとの対数尤度には細かいギャップな
どが存在し，安定した尺度でないため，Ｍフレームに渡
ってｄｉｆｆの総和を取ることにより平滑化を行う（ス
テップＳ３）。[0026] _{diff = log P (o t |} allspeech) -log
Here _{| (background o t), log} P P (o t | allspeech) , the log-likelihood for the input signal o _t of voice HMM at time t, log P (o
_t | background) is the log likelihood of the non-voice HMM. Since a logarithmic likelihood for each frame has a small gap or the like and is not a stable measure, smoothing is performed by taking the sum of diff over M frames (step S3).

【００２７】次のステップＳ４では，平滑化後のｄｉｆ
ｆ（図中ｍｅａｓｕｒｅ）を，あらかじめ実験的に決定
しておいたしきい値（図中ＴＨＲ）と比較し，ｍｅａｓ
ｕｒｅがしきい値ＴＨＲより大きければ，ステップＳ５
へ進み，そうでなければ，ステップＳ１０へ進む。In the next step S4, the smoothed dif
f (measure in the figure) is compared with a threshold value (THR in the figure) determined experimentally in advance, and meas
If ure is larger than the threshold value THR, step S5
Otherwise, the process proceeds to step S10.

【００２８】ｍｅａｓｕｒｅ＞ＴＨＲのとき，ステップ
Ｓ５では，当該フレームを音声区間の一部と判定して，
カウンタｓｐｅｅｃｈをインクリメントする。次に，ス
テップＳ６により，ポーズ区間の長さを計るカウンタｐ
ａｕｓｅの値が０かどうかを判定し，ｐａｕｓｅ＝０で
あればステップＳ７をスキップし，ｐａｕｓｅ＝０でな
ければ，ステップＳ７によって，音声区間のカウンタｓ
ｐｅｅｃｈに，ｐａｕｓｅの値を加える。If measure> THR, in step S5, the frame is determined to be a part of a voice section, and
Increment the counter speech. Next, in step S6, a counter p for measuring the length of the pause section
It is determined whether or not the value of “ause” is 0. If “pause = 0”, step S7 is skipped. If “pause = 0”, the counter s of the voice section is determined by step S7.
The value of pause is added to "peech".

【００２９】次に，ステップＳ８により，ｓｐｅｅｃｈ
と音声区間とみなすための長さを定めたＭＩＮＬＥＮと
を比較し，ｓｐｅｅｃｈ＜ＭＩＮＬＥＮか，またはすで
に音声区間の始端（ｓｔａｒｔｐｏｉｎｔ）がセット
されていれば，ステップＳ１へ戻って，入力信号の次の
フレームに対して一連の処理を繰り返す。Next, in step S8, the speech
Is compared with MINLEN which determines the length to be regarded as a voice section, and if speech <MINLEN or the start point of the voice section has already been set, the flow returns to step S1 to return to the next step of the input signal. A series of processing is repeated for the frame of.

【００３０】ｓｐｅｅｃｈ≧ＭＩＮＬＥＮであり，かつ
始端がセットされていなければ，当該フレームよりＢＥ
Ｇフレーム戻ったフレームを始端（ｓｔａｒｔｐｏｉ
ｎｔ）としてセットする。この音声区間検出と並列して
音声認識を走らせる場合には，ここで音声認識を駆動す
る。その後，ステップＳ１へ戻り，入力信号の次のフレ
ームに対して同様に処理を繰り返す。If speech ≧ MINLEN and the start end is not set, the BE
Start frame (start poi)
nt). When speech recognition is run in parallel with this speech section detection, speech recognition is driven here. Thereafter, the process returns to step S1, and the same process is repeated for the next frame of the input signal.

【００３１】ステップＳ４の判定において，ｍｅａｓｕ
ｒｅがしきい値ＴＨＲより小さければ，ステップＳ１０
でｓｐｅｅｃｈが０かどうかを調べ，０であればステッ
プＳ１へ戻る。０でなければ，ステップＳ１１へ進み，
カウンタｐａｕｓｅをインクリメントする。In the determination in step S4, the measu
If re is smaller than the threshold value THR, step S10
It is checked whether or not speech is 0. If it is 0, the process returns to step S1. If not 0, proceed to step S11,
Increment the counter pause.

【００３２】次に，ステップＳ１２では，これまでのポ
ーズ区間が短いかどうかを調べるため，ｐａｕｓｅと所
定のＭＡＸＰＡＵＳＥとを比較する。ｐａｕｓｅがＭＡ
ＸＰＡＵＳＥより大きくないとき，ポーズ区間を音声区
間とみなしてよい場合があるので，ステップＳ１へ戻
り，同様に次のフレームの処理を行う。Next, at step S12, pause is compared with a predetermined MAXPAUSE in order to check whether or not the previous pause section is short. pause is MA
When it is not larger than XPAUSE, the pause section may be regarded as a voice section in some cases. Therefore, the process returns to step S1, and the processing of the next frame is similarly performed.

【００３３】ｐａｕｓｅがＭＡＸＰＡＵＳＥより大きけ
れば，ステップＳ１３へ進み，始端（ｓｔａｒｔｐｏ
ｉｎｔ）がセットされているかどうかを判定する。始端
がセットされていない場合には，ステップＳ１４によ
り，ｓｐｅｅｃｈとｐａｕｓｅとを０に初期化し，ステ
ップＳ１へ戻り，入力信号の次のフレームに対して一連
の処理を繰り返す。If pause is larger than MAXPAUSE, the process proceeds to step S13, and the start point (start po
It is determined whether or not (int) is set. If the start end is not set, the speech and pause are initialized to 0 in step S14, and the process returns to step S1 to repeat a series of processes for the next frame of the input signal.

【００３４】始端がセットされている場合には，ステッ
プＳ１５により，セットされている始端から当該フレー
ムまでを音声区間として出力する。その後，ステップＳ
１６によりシステムを初期化し，ステップＳ１から同様
に処理を繰り返す。If the start end has been set, the section from the set start end to the frame is output as a voice section in step S15. Then, step S
The system is initialized by 16 and the process is repeated from step S1.

【００３５】[0035]

【実施例】図３に，信号パワーに基づいて音声区間検出
を行う従来法と統計的音響モデルの尤度比に基づく本発
明の方法の比較評価の結果を示す。FIG. 3 shows the results of comparative evaluation between a conventional method for detecting a speech section based on signal power and a method of the present invention based on the likelihood ratio of a statistical acoustic model.

【００３６】評価実験では，連続発声された４桁数字の
認識を対象タスクとした。音声区間検出方法を定量的に
評価するには，正解区間からの差分を定量的に評価すれ
ばよいように思えるが，正解区間をどのように与えるか
などの課題が残る。本発明の音声区間検出方法は，音声
認識のためのものであるから，音声認識の精度を測るこ
とでより直接的に本発明の効果を評価できる。ここで
は，同様の音声認識手法のもとで，音声区間検出方法を
従来法と本発明の方法として比較した。In the evaluation experiment, the recognition of a continuously uttered 4-digit number was set as a target task. In order to quantitatively evaluate the voice section detection method, it seems that the difference from the correct section should be quantitatively evaluated. However, there remains a problem such as how to provide the correct section. Since the voice section detection method of the present invention is for voice recognition, the effect of the present invention can be more directly evaluated by measuring the accuracy of voice recognition. Here, the speech section detection method was compared with the conventional method and the method of the present invention under the same speech recognition method.

【００３７】評価対象には，一つの音声ファイル中に３
５の４桁数字が含まれた５１の音声ファイルを用いた。
合計で１７８５の４桁数字がある。背景雑音が含まれた
データとして車の走行音を加算したデータを作成した。
車の走行音を含まない音声データのＳ／Ｎ比はおよそ２
５ｄＢ，車の走行音を含む音声データのＳ／Ｎ比はおよ
そ１２ｄＢである。また，音声区間検出方法の比較とし
ては，パワーに基づく従来法，統計的音響モデルの尤度
比に基づく本発明の方法のほか，人手により（波形を視
認し，また音声を聴取しながら）音声区間を検出した実
験も行った。[0037] The evaluation targets are 3
51 audio files containing four 4-digit numbers were used.
There are a total of 1785 4-digit numbers. The data which added the running sound of the car as the data including the background noise was created.
The S / N ratio of voice data that does not include the sound of a car is about 2.
The S / N ratio of the audio data including the running sound of the vehicle is about 12 dB. The comparison of the voice section detection methods includes the conventional method based on power, the method of the present invention based on the likelihood ratio of a statistical acoustic model, and manual (while visually recognizing a waveform and listening to voice). The experiment which detected the section was also performed.

【００３８】図３（Ａ）は，パワーに基づく方法におい
て本実験に用いたパラメータを示す。実験的に最適値を
選んだ。図３（Ｂ）は，統計的音響モデルの尤度比に基
づく方法（本発明の方法）において用いたパラメータを
示す。同じく，実験的に最適値を選んだ。ここでは，Ｍ
ＡＸＰＡＵＳＥとＥＮＤは同じ値を用いた。FIG. 3A shows parameters used in this experiment in the power-based method. The optimal value was chosen experimentally. FIG. 3B shows parameters used in the method based on the likelihood ratio of the statistical acoustic model (the method of the present invention). Similarly, the optimal value was experimentally selected. Here, M
AXPAUSE and END used the same value.

【００３９】図３（Ｃ）は，Ｓ／Ｎ比が２５ｄＢ，１２
ｄＢの音声データに対する，各音声区間検出方法を用い
た場合の４桁数字の認識結果を示している。この図にお
いて，Error rate（％）は，誤認識した割合, False al
arm は，音声区間でない部分を音声区間として検出して
しまった区間の数である。この結果より，本発明による
方法は人手による方法に比べ遜色なく，従来法に比べて
明らかに有効であることがわかる。FIG. 3C shows that the S / N ratio is 25 dB and 12 dB.
The figure shows the recognition result of four-digit numbers in the case of using each voice section detection method for voice data of dB. In this figure, the error rate (%) is the rate of false recognition, False al
arm is the number of sections in which a non-voice section has been detected as a voice section. The results show that the method according to the present invention is not inferior to the manual method and is clearly more effective than the conventional method.

【００４０】[0040]

【発明の効果】本発明によれば，背景雑音が非定常雑音
である場合や，雑音レベルが高い場合にも正確に入力信
号中から音声区間を検出できるという利点がある。According to the present invention, there is an advantage that a speech section can be accurately detected from an input signal even when the background noise is non-stationary noise or when the noise level is high.

[Brief description of the drawings]

【図１】本発明の実施の形態を示すブロック図である。FIG. 1 is a block diagram showing an embodiment of the present invention.

【図２】音声ＨＭＭと非音声ＨＭＭの尤度比を用いて音
声区間を検出する本発明の実施の形態による処理の流れ
を説明するフローチャートである。FIG. 2 is a flowchart illustrating a flow of processing according to an embodiment of the present invention for detecting a voice section using a likelihood ratio between a voice HMM and a non-voice HMM.

【図３】信号パワーに基づいて音声区間検出を行う従来
法と統計的音響モデルの尤度比に基づく本発明の実施例
による方法の比較評価の結果を示す図である。FIG. 3 is a diagram showing results of comparative evaluation between a conventional method for detecting a voice section based on signal power and a method according to an embodiment of the present invention based on a likelihood ratio of a statistical acoustic model.

【図４】信号パワーによって音声区間を検出する従来手
法の処理の流れを説明するフローチャートである。FIG. 4 is a flowchart illustrating a flow of processing of a conventional method for detecting a voice section based on signal power.

[Explanation of symbols]

１フィルタリング部２Ａ／Ｄ変換部３高域強調部４特徴量抽出部５尤度比計算部６音声区間判定部 DESCRIPTION OF SYMBOLS 1 Filtering part 2 A / D conversion part 3 High frequency emphasis part 4 Feature amount extraction part 5 Likelihood ratio calculation part 6 Speech section judgment part

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開平１−169499（ＪＰ，Ａ) 特開平２−205898（ＪＰ，Ａ) 特開昭62−211698（ＪＰ，Ａ) 特開平８−87293（ＪＰ，Ａ) 特許3034279（ＪＰ，Ｂ２) 特許2666296（ＪＰ，Ｂ２) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/00 - 17/00 ＪＩＣＳＴファイル（ＪＯＩＳ) ＩＥＥＥ／ＩＥＥＥｌｅｃｔｒｏｎｉｃＬｉｂｒａｒｙＯｎｌｉｎｅ──────────────────────────────────────────────────続き Continuation of the front page (56) References JP-A-1-169499 (JP, A) JP-A-2-205898 (JP, A) JP-A-62-211698 (JP, A) JP-A 8- 87293 (JP, A) Patent 3034279 (JP, B2) Patent 2666296 (JP, B2) (58) Fields investigated (Int. Cl. ⁷ , DB name) G10L 15/00-17/00 JICST file (JOIS) IEEE / IEEE Electronic Library Online

Claims

(57) [Claims]

1. A method of detecting a speech section from the input signal in an automatic speech recognition by a machine, heat learned by using all of the speech to cover the recognition target words
A speech-acoustic model based on the Don-Markov model, and a hidden mask trained using non-speech sections .
Using the non-speech acoustic models based on Rukofu model, the speech acoustic model and pre every appropriate interval length of the input signal
If the serial likelihood ratio of the non-speech acoustic models were calculated and the section beyond a certain threshold likelihood ratio has continued for a predetermined time, the start of the section and the beginning of the speech segment, then the likelihood ratio A voice section detection method, wherein when a section below a certain threshold has continued for a certain period of time, a start end thereof is detected as an end of the voice section.

2. A method of detecting a speech section from the input signal in an automatic speech recognition by a machine, heat learned by using all of the speech to cover the recognition target words
A speech-acoustic model based on the Don-Markov model, and a hidden mask trained using non-speech sections .
Using the non-speech acoustic models based on Rukofu model, the speech acoustic model and pre every appropriate interval length of the input signal
Serial to calculate the likelihood ratio of the non-speech acoustic model, when the interval exceeds a certain threshold likelihood ratio has continued for a predetermined time, the time when going back a predetermined time from the beginning of the section and the beginning of the speech segment, Thereafter, when a section in which the likelihood ratio falls below a certain threshold has continued for a certain period of time, a point in time after a certain period of time has elapsed from the start end is detected as the end of the voice section.