JP2807457B2

JP2807457B2 - Voice section detection method

Info

Publication number: JP2807457B2
Application number: JP62179564A
Authority: JP
Inventors: 博喜内山; 博雄北川
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1987-07-17
Filing date: 1987-07-17
Publication date: 1998-10-08
Anticipated expiration: 2013-10-08
Also published as: JPS6423296A

Description

【発明の詳細な説明】技術分野本発明は、音声区間検出方式、より詳細には、音声応
答装置、音声認識装置等における音声区間検出方式に関
するものであり、玩具、楽器、自動伴奏装置などに応用
可能なものである。従来技術ハードウェアの規模が小さく検出理論が明瞭である音
声区間検出方法として、レベル検出法がある。これは、
音声信号より得られる時系列の特徴量を所定の閾値と比
較して音声区間の始端、終端を決定するものである。し
かし、この方法では、音声区間の始端、終端が正確に求
まりにくく、背景雑音による誤動作、微弱信号の検出不
能といった問題があった。目的本発明は、上述のごとき実情に鑑みてなされたもの
で、入力信号より音声信号の区間のみを検出する音声検
出器を提供することを目的としてなされたものである。構成本発明は、上記目的を達成するために、入力信号に対
して所定サンプル数で１フレームを形成しこれを検出単
位とする窓長とし、その窓長内の特徴量と所定の閾値と
を比較して音声区間の検出を行う音声区間検出方式にお
いて、時間長の異なる複数の窓長と各々の窓長に対応し
た閾値とを設定し、音声の始端は、中時間の窓長と中レ
ベルの閾値を用い、始端検出のための中レベルの閾値を
こえた後に後続する所定フレームのうち前記始端検出の
ための閾値をこえるフレーム数が所定数をこえるまで繰
り返して求め、音声の終端は、長時間の窓長と高レベル
の閾値を用い、終端検出のための高レベルの閾値より低
下した後に後続する所定フレームのうち前記終端検出の
ための閾値より下のフレーム数が所定数をこえるまで繰
り返して求め、始端・終端の補正を、短時間の窓長と低
レベルの閾値を用い、該短窓長内で計算される特徴量と
閾値とを比較することで、先に検出した始端位置より時
系列的に前の時点で低レベルの閾値より下がる点を新た
な始端として定め、先に検出した終端位置より時系列的
に後の時点で低レベルの閾値より下がる点を新たな終端
に定めることを特徴としたものである。以下、本発明の
実施例に基づいて説明する。前述のように、音声信号のエネルギー（電力，振幅
等）を予め設定した閾値と比較と、音声区間の始端、終
端の検出を行うレベル検出法は、ハードウェアの規模が
小さく検出理論が明瞭であるというメリットを持つとい
う反面、音声区間の始端、終端が正確には求まらず背景
雑音による誤動作、微弱信号の検出不能といった問題を
含んでいた。これに対して、本発明では、音声信号に対し時間長の
違う複数の窓長を設けその窓長内の特徴量としては隣接
する音声信号の差分値の絶対値和を用い、音声の始端を
検出するために中時間の窓長を設定し、終端を求めるた
めには長時間の窓長を設定することで各々異なった時間
長の窓と閾値とを用いて仮りの始端と終端を求め、さら
に正確な始端、終端の位置を求めるために短時間の窓長
と閾値を用い、正確な音声区間の始端、終端を検出する
ことを特徴としている。第１図は、本発明による音声区間検出方式の一実施例
を説明するための構成図で、図中、１はA/D変換部、２
は差分値計算部、３は窓長ごとの特徴量計算部、4,5は
比較部、6,7は閾値部、Ａは入力音声信号、Ｂは音声区
間の始端、終端のフレーム番号出力信号で、入力信号Ａ
は、A/D変換された後、順次前のサンプル点との差分値
が計算される。差分値は絶対値に変換され所定のサンプ
ル数ごとに加算され１フレームの特徴量とされる。時間
長の異なる窓長の設定は、この１フレーム内のサンプル
数で行う。第１雨では、例えば音声区間の始端検出のた
めには20msの窓長を用いその窓長内の特徴量をfsiとし
ている。同様に終端検出のためには30msの窓長を用いそ
の窓長内の特徴量をfeiとしている。ｉはフレーム番号
である。また、正確な始端、終端を求めるときに使う特
徴量fsi,fei,よりも敏感なように10msの窓長を設定しそ
の特徴量をfseiとしている。音声区間の始端、終端、お
よびその正確な位置を求めるための特徴量fsi,fei,fsei
に対応する閾値は、それぞれTH,TL,TEである。第２図は、本発明の動作原理を説明するための波形
図、第２図（ａ）は、入力信号波形であり、第２図
（ｂ）は、窓長20msのときの特徴量fsi、第２図（ｃ）
は、窓長30msのときの特徴量fei、第２図（ｄ）は、窓
長10msのときの特徴量fseiを示している。まず、第２図（ｂ）の特徴量を用いて入力時点よりTH
を越えるフレームを探索し、仮の始端を求める。ここ
で、THは、やや高めに設定しておき、雑音などは拾わな
いようにしておく。また、第２図（ｂ）に示すような突
発的な雑音を拾わないようにするために、始端検出後に
後続する所定フレームのうち始端検出のための閾値をこ
えるフレームが所定数に達しないときには、始端検出点
の次の点から始端検出を再度行い、後続のフレーム数が
所定の数を越えるまでの始端検出を繰り返し行う処理を
付加しておく。これによって始端の位置は、初めに閾値
を越える時点がｓ″として求まるが上述の操作により
ｓ′となる。終端検出のためには、同様にs'の時点以降のフレーム
に対して第２図（ｃ）の特徴量が使われる。始端検出点
ｓ′の次の閾値TLより下がるフレームを検索する。ここ
で、語中における欠落を防ぐために終端検出後に後続す
る所定フレーム数以内にTLを越えるもので出現したら終
端の位置を更新し、TLを越える位置より後続のフレーム
でTLより下がる点を探索する。後続の所定フレーム数以
内にTLを越えるものがなくなるまで、上述の処理を繰り
返し行い終端を求める。第２図（ｃ）では、この操作に
より初めに終端とされる点ｅ″が、ｅ′に更新されて求
まり、語中の音声欠落はなくなる。第２図（ｂ），（ｃ）の特徴量によって始端、終端が
求まったら、第２図（ｄ）の特徴量を用いてｓ′時点以
前の時点で、閾値TEより下がる点を求めるこの点を始端
ｓとする。同様に仮の終端ｅ′より以降の時点で閾値TE
より下がる点を求めてこれを終端ｅとする。ここで、複数の窓長とその特徴量は、時間長ごとに別
々に計算してもよいが、例えば短時間長の窓長の特徴量
のみ計算しておきこれを基準にして、中時間の窓長とし
ては隣合う２つの特徴量の和を用い、長時間の特徴量と
しては、連続する３つの特徴量の和を用いても良い。効果以上の説明から明らかなように、本発明によると、音
声区間の始端、終端の検出を正確にかつ容易に行うこと
が可能となる。Description: TECHNICAL FIELD The present invention relates to a voice section detection method, and more particularly to a voice section detection method in a voice response device, a voice recognition device and the like, and is applied to a toy, a musical instrument, an automatic accompaniment device, and the like. It is applicable. 2. Description of the Related Art There is a level detection method as a voice section detection method with a small hardware scale and a clear detection theory. this is,
The start and end of the voice section are determined by comparing a time-series feature amount obtained from the voice signal with a predetermined threshold value. However, in this method, it is difficult to accurately determine the start and end of the voice section, and there are problems such as malfunction due to background noise and detection of a weak signal. SUMMARY OF THE INVENTION The present invention has been made in view of the above situation, and has as its object to provide a voice detector that detects only a section of a voice signal from an input signal. Configuration In order to achieve the above object, the present invention forms one frame with a predetermined number of samples with respect to an input signal, sets the window length as a detection unit, and defines a feature amount within the window length and a predetermined threshold value. In the voice section detection method for detecting voice sections by comparison, a plurality of window lengths having different time lengths and threshold values corresponding to the respective window lengths are set, and the beginning of the voice is a middle time window length and a medium level. Using the threshold of, the number of frames exceeding the threshold for the start edge detection of the subsequent predetermined frames after exceeding the medium level threshold for the start edge detection repeatedly obtained until the number exceeds the predetermined number, the end of the voice, Using a long window length and a high-level threshold, until the number of frames below the threshold for the end detection exceeds a predetermined number in predetermined subsequent frames after falling below the high-level threshold for the end detection. Asked repeatedly, The start / end correction is performed using a short window length and a low-level threshold value, and comparing the feature amount calculated within the short window length with the threshold value, a time series from the previously detected start position is obtained. The point below the low-level threshold at the previous point in time is defined as a new starting point, and the point below the low-level threshold at a point in time later than the previously detected end point is determined as the new end point. It is what it was. Hereinafter, a description will be given based on examples of the present invention. As described above, the level detection method for comparing the energy (power, amplitude, and the like) of a voice signal with a preset threshold value and detecting the start and end of a voice section has a small hardware scale and a clear detection theory. On the other hand, the start and end of the voice section are not accurately determined, but there are problems such as malfunction due to background noise and detection of a weak signal. On the other hand, in the present invention, a plurality of window lengths having different time lengths are provided for the audio signal, and as a feature amount within the window length, the sum of absolute values of the difference values of adjacent audio signals is used, and the beginning of the audio is determined. Set a medium time window length to detect, set a long window length to determine the end to determine the temporary start and end using a different time length window and threshold, respectively, to determine the end, Further, the present invention is characterized in that a short window length and a threshold value are used to determine the exact start and end positions, and the accurate start and end of the voice section are detected. FIG. 1 is a configuration diagram for explaining an embodiment of a voice section detection system according to the present invention, wherein 1 is an A / D converter, 2
Is a difference value calculation unit, 3 is a feature amount calculation unit for each window length, 4 and 5 are comparison units, 6 and 7 are threshold units, A is an input voice signal, B is a start and end frame number output signal of a voice section. And the input signal A
After the A / D conversion, the difference value from the previous sample point is sequentially calculated. The difference value is converted to an absolute value, added for each predetermined number of samples, and used as a feature value for one frame. The setting of window lengths having different time lengths is performed based on the number of samples in one frame. In the first rain, for example, a window length of 20 ms is used for detecting the beginning of a voice section, and the feature amount within the window length is set to fsi. Similarly, for the end detection, a window length of 30 ms is used, and the feature amount within the window length is set to fei. i is a frame number. In addition, a window length of 10 ms is set so as to be more sensitive than the feature values fsi, fei, which are used when obtaining accurate start and end points, and the feature value is fsei. Features fsi, fei, fsei for finding the beginning and end of a voice section and their exact positions
Are TH, TL, and TE, respectively. FIG. 2 is a waveform diagram for explaining the operation principle of the present invention, FIG. 2 (a) is an input signal waveform, and FIG. 2 (b) is a characteristic amount fsi when the window length is 20 ms. Fig. 2 (c)
Shows the feature value fei when the window length is 30 ms, and FIG. 2D shows the feature value fsei when the window length is 10 ms. First, from the input point in time using the feature quantity of FIG.
To search for a frame that exceeds. Here, TH is set to be slightly higher so that noise and the like are not picked up. Further, in order to prevent sudden noise as shown in FIG. 2 (b) from being picked up, when the number of frames exceeding the threshold value for the detection of the starting end does not reach the predetermined number of the following predetermined frames after the detection of the starting end. In addition, a process of performing the start detection again from the next point of the start detection point and repeating the start detection until the number of subsequent frames exceeds a predetermined number is added. As a result, the position of the start end is determined as s ″ at the point where the threshold value is first exceeded, but becomes s ′ by the above-described operation. For the end detection, the frame after the point of s ′ is similarly shown in FIG. The feature quantity of (c) is used to search for a frame that falls below the next threshold value TL of the start detection point s', where the TL is exceeded within a predetermined number of subsequent frames after the end detection in order to prevent missing in the word. If it appears, update the end position and search for a point below the TL in the subsequent frame from the position exceeding the TL. In FIG. 2 (c), the point e ″ which is terminated first by this operation is updated to e ′, which is obtained, and the speech loss in the word is eliminated. When the start and end points are determined by the feature amounts of FIGS. 2B and 2C, a point lower than the threshold value TE is obtained at a time before the time s ′ using the feature amounts of FIG. 2D. Is the starting end s. Similarly, at a point after the provisional end point e ', the threshold TE
A lower point is obtained and this is set as a terminal e. Here, the plurality of window lengths and their feature amounts may be separately calculated for each time length. For example, only the feature amount of the short window length is calculated, and based on this, the middle time period is calculated. The sum of two adjacent features may be used as the window length, and the sum of three consecutive features may be used as the long-term feature. Effects As is clear from the above description, according to the present invention, it is possible to accurately and easily detect the start and end of a voice section.

【図面の簡単な説明】第１図は、本発明の一実施例を説明するためのブロック
線図、第２図は、本発明の動作原理を説明するための波
形図である。１……A/D変換部,2……差分値計算部,3……窓長ごとの
特徴量計算部,4,5……比較部,6,7……閾値部。BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram for explaining an embodiment of the present invention, and FIG. 2 is a waveform diagram for explaining the operation principle of the present invention. 1 ... A / D conversion unit, 2 ... Difference value calculation unit, 3 ... Feature amount calculation unit for each window length, 4,5 ... Comparison unit, 6,7 ... Threshold unit.

フロントページの続き (58)調査した分野(Int.Cl.⁶，ＤＢ名) G10L 3/00 - 9/18 ＪＩＣＳＴ（ＪＯＩＳ)Continuation of the front page (58) Field surveyed (Int. Cl. ⁶ , DB name) G10L 3/00-9/18 JICST (JOIS)

Claims

(57) [Claims] A voice section detection method that forms one frame with a predetermined number of samples for an input signal and uses this as a detection unit to determine a window length, and compares a feature amount within the window length with a predetermined threshold to detect a voice section. In, a plurality of window lengths with different time lengths and a threshold value corresponding to each window length are set, and the beginning of the voice uses a middle time window length and a middle level threshold, and the middle level for the leading edge detection is used. Of the subsequent predetermined frames after exceeding the threshold, the number of frames exceeding the threshold for the start detection is repeatedly determined until the number exceeds the predetermined number, and the end of the sound is determined using a long window length and a high-level threshold. Of the predetermined frames following the lower than the high-level threshold for the end detection, it is repeatedly determined until the number of frames below the threshold for the end detection exceeds a predetermined number, and the start and end corrections are performed.
By using a short window length and a low-level threshold value and comparing the feature amount calculated within the short window length with the threshold value, the low-level value is obtained at a time point earlier in time series than the previously detected start position. A voice section detection method characterized in that a point lower than the threshold value is determined as a new start point, and a point lower than the low-level threshold value is determined as a new end point in time series later than the previously detected end position.