JP2807457B2 - Voice section detection method - Google Patents

Voice section detection method

Info

Publication number
JP2807457B2
JP2807457B2 JP62179564A JP17956487A JP2807457B2 JP 2807457 B2 JP2807457 B2 JP 2807457B2 JP 62179564 A JP62179564 A JP 62179564A JP 17956487 A JP17956487 A JP 17956487A JP 2807457 B2 JP2807457 B2 JP 2807457B2
Authority
JP
Japan
Prior art keywords
window length
threshold
detection
point
threshold value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
JP62179564A
Other languages
Japanese (ja)
Other versions
JPS6423296A (en
Inventor
博喜 内山
博雄 北川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ricoh Co Ltd
Original Assignee
Ricoh Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ricoh Co Ltd filed Critical Ricoh Co Ltd
Priority to JP62179564A priority Critical patent/JP2807457B2/en
Publication of JPS6423296A publication Critical patent/JPS6423296A/en
Application granted granted Critical
Publication of JP2807457B2 publication Critical patent/JP2807457B2/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Description

【発明の詳細な説明】 技術分野 本発明は、音声区間検出方式、より詳細には、音声応
答装置、音声認識装置等における音声区間検出方式に関
するものであり、玩具、楽器、自動伴奏装置などに応用
可能なものである。 従来技術 ハードウェアの規模が小さく検出理論が明瞭である音
声区間検出方法として、レベル検出法がある。これは、
音声信号より得られる時系列の特徴量を所定の閾値と比
較して音声区間の始端、終端を決定するものである。し
かし、この方法では、音声区間の始端、終端が正確に求
まりにくく、背景雑音による誤動作、微弱信号の検出不
能といった問題があった。 目的 本発明は、上述のごとき実情に鑑みてなされたもの
で、入力信号より音声信号の区間のみを検出する音声検
出器を提供することを目的としてなされたものである。 構成 本発明は、上記目的を達成するために、入力信号に対
して所定サンプル数で1フレームを形成しこれを検出単
位とする窓長とし、その窓長内の特徴量と所定の閾値と
を比較して音声区間の検出を行う音声区間検出方式にお
いて、時間長の異なる複数の窓長と各々の窓長に対応し
た閾値とを設定し、音声の始端は、中時間の窓長と中レ
ベルの閾値を用い、始端検出のための中レベルの閾値を
こえた後に後続する所定フレームのうち前記始端検出の
ための閾値をこえるフレーム数が所定数をこえるまで繰
り返して求め、音声の終端は、長時間の窓長と高レベル
の閾値を用い、終端検出のための高レベルの閾値より低
下した後に後続する所定フレームのうち前記終端検出の
ための閾値より下のフレーム数が所定数をこえるまで繰
り返して求め、始端・終端の補正を、短時間の窓長と低
レベルの閾値を用い、該短窓長内で計算される特徴量と
閾値とを比較することで、先に検出した始端位置より時
系列的に前の時点で低レベルの閾値より下がる点を新た
な始端として定め、先に検出した終端位置より時系列的
に後の時点で低レベルの閾値より下がる点を新たな終端
に定めることを特徴としたものである。以下、本発明の
実施例に基づいて説明する。 前述のように、音声信号のエネルギー(電力,振幅
等)を予め設定した閾値と比較と、音声区間の始端、終
端の検出を行うレベル検出法は、ハードウェアの規模が
小さく検出理論が明瞭であるというメリットを持つとい
う反面、音声区間の始端、終端が正確には求まらず背景
雑音による誤動作、微弱信号の検出不能といった問題を
含んでいた。 これに対して、本発明では、音声信号に対し時間長の
違う複数の窓長を設けその窓長内の特徴量としては隣接
する音声信号の差分値の絶対値和を用い、音声の始端を
検出するために中時間の窓長を設定し、終端を求めるた
めには長時間の窓長を設定することで各々異なった時間
長の窓と閾値とを用いて仮りの始端と終端を求め、さら
に正確な始端、終端の位置を求めるために短時間の窓長
と閾値を用い、正確な音声区間の始端、終端を検出する
ことを特徴としている。 第1図は、本発明による音声区間検出方式の一実施例
を説明するための構成図で、図中、1はA/D変換部、2
は差分値計算部、3は窓長ごとの特徴量計算部、4,5は
比較部、6,7は閾値部、Aは入力音声信号、Bは音声区
間の始端、終端のフレーム番号出力信号で、入力信号A
は、A/D変換された後、順次前のサンプル点との差分値
が計算される。差分値は絶対値に変換され所定のサンプ
ル数ごとに加算され1フレームの特徴量とされる。時間
長の異なる窓長の設定は、この1フレーム内のサンプル
数で行う。第1雨では、例えば音声区間の始端検出のた
めには20msの窓長を用いその窓長内の特徴量をfsiとし
ている。同様に終端検出のためには30msの窓長を用いそ
の窓長内の特徴量をfeiとしている。iはフレーム番号
である。また、正確な始端、終端を求めるときに使う特
徴量fsi,fei,よりも敏感なように10msの窓長を設定しそ
の特徴量をfseiとしている。音声区間の始端、終端、お
よびその正確な位置を求めるための特徴量fsi,fei,fsei
に対応する閾値は、それぞれTH,TL,TEである。 第2図は、本発明の動作原理を説明するための波形
図、第2図(a)は、入力信号波形であり、第2図
(b)は、窓長20msのときの特徴量fsi、第2図(c)
は、窓長30msのときの特徴量fei、第2図(d)は、窓
長10msのときの特徴量fseiを示している。 まず、第2図(b)の特徴量を用いて入力時点よりTH
を越えるフレームを探索し、仮の始端を求める。ここ
で、THは、やや高めに設定しておき、雑音などは拾わな
いようにしておく。また、第2図(b)に示すような突
発的な雑音を拾わないようにするために、始端検出後に
後続する所定フレームのうち始端検出のための閾値をこ
えるフレームが所定数に達しないときには、始端検出点
の次の点から始端検出を再度行い、後続のフレーム数が
所定の数を越えるまでの始端検出を繰り返し行う処理を
付加しておく。これによって始端の位置は、初めに閾値
を越える時点がs″として求まるが上述の操作により
s′となる。 終端検出のためには、同様にs'の時点以降のフレーム
に対して第2図(c)の特徴量が使われる。始端検出点
s′の次の閾値TLより下がるフレームを検索する。ここ
で、語中における欠落を防ぐために終端検出後に後続す
る所定フレーム数以内にTLを越えるもので出現したら終
端の位置を更新し、TLを越える位置より後続のフレーム
でTLより下がる点を探索する。後続の所定フレーム数以
内にTLを越えるものがなくなるまで、上述の処理を繰り
返し行い終端を求める。第2図(c)では、この操作に
より初めに終端とされる点e″が、e′に更新されて求
まり、語中の音声欠落はなくなる。 第2図(b),(c)の特徴量によって始端、終端が
求まったら、第2図(d)の特徴量を用いてs′時点以
前の時点で、閾値TEより下がる点を求めるこの点を始端
sとする。同様に仮の終端e′より以降の時点で閾値TE
より下がる点を求めてこれを終端eとする。 ここで、複数の窓長とその特徴量は、時間長ごとに別
々に計算してもよいが、例えば短時間長の窓長の特徴量
のみ計算しておきこれを基準にして、中時間の窓長とし
ては隣合う2つの特徴量の和を用い、長時間の特徴量と
しては、連続する3つの特徴量の和を用いても良い。 効果 以上の説明から明らかなように、本発明によると、音
声区間の始端、終端の検出を正確にかつ容易に行うこと
が可能となる。
Description: TECHNICAL FIELD The present invention relates to a voice section detection method, and more particularly to a voice section detection method in a voice response device, a voice recognition device and the like, and is applied to a toy, a musical instrument, an automatic accompaniment device, and the like. It is applicable. 2. Description of the Related Art There is a level detection method as a voice section detection method with a small hardware scale and a clear detection theory. this is,
The start and end of the voice section are determined by comparing a time-series feature amount obtained from the voice signal with a predetermined threshold value. However, in this method, it is difficult to accurately determine the start and end of the voice section, and there are problems such as malfunction due to background noise and detection of a weak signal. SUMMARY OF THE INVENTION The present invention has been made in view of the above situation, and has as its object to provide a voice detector that detects only a section of a voice signal from an input signal. Configuration In order to achieve the above object, the present invention forms one frame with a predetermined number of samples with respect to an input signal, sets the window length as a detection unit, and defines a feature amount within the window length and a predetermined threshold value. In the voice section detection method for detecting voice sections by comparison, a plurality of window lengths having different time lengths and threshold values corresponding to the respective window lengths are set, and the beginning of the voice is a middle time window length and a medium level. Using the threshold of, the number of frames exceeding the threshold for the start edge detection of the subsequent predetermined frames after exceeding the medium level threshold for the start edge detection repeatedly obtained until the number exceeds the predetermined number, the end of the voice, Using a long window length and a high-level threshold, until the number of frames below the threshold for the end detection exceeds a predetermined number in predetermined subsequent frames after falling below the high-level threshold for the end detection. Asked repeatedly, The start / end correction is performed using a short window length and a low-level threshold value, and comparing the feature amount calculated within the short window length with the threshold value, a time series from the previously detected start position is obtained. The point below the low-level threshold at the previous point in time is defined as a new starting point, and the point below the low-level threshold at a point in time later than the previously detected end point is determined as the new end point. It is what it was. Hereinafter, a description will be given based on examples of the present invention. As described above, the level detection method for comparing the energy (power, amplitude, and the like) of a voice signal with a preset threshold value and detecting the start and end of a voice section has a small hardware scale and a clear detection theory. On the other hand, the start and end of the voice section are not accurately determined, but there are problems such as malfunction due to background noise and detection of a weak signal. On the other hand, in the present invention, a plurality of window lengths having different time lengths are provided for the audio signal, and as a feature amount within the window length, the sum of absolute values of the difference values of adjacent audio signals is used, and the beginning of the audio is determined. Set a medium time window length to detect, set a long window length to determine the end to determine the temporary start and end using a different time length window and threshold, respectively, to determine the end, Further, the present invention is characterized in that a short window length and a threshold value are used to determine the exact start and end positions, and the accurate start and end of the voice section are detected. FIG. 1 is a configuration diagram for explaining an embodiment of a voice section detection system according to the present invention, wherein 1 is an A / D converter, 2
Is a difference value calculation unit, 3 is a feature amount calculation unit for each window length, 4 and 5 are comparison units, 6 and 7 are threshold units, A is an input voice signal, B is a start and end frame number output signal of a voice section. And the input signal A
After the A / D conversion, the difference value from the previous sample point is sequentially calculated. The difference value is converted to an absolute value, added for each predetermined number of samples, and used as a feature value for one frame. The setting of window lengths having different time lengths is performed based on the number of samples in one frame. In the first rain, for example, a window length of 20 ms is used for detecting the beginning of a voice section, and the feature amount within the window length is set to fsi. Similarly, for the end detection, a window length of 30 ms is used, and the feature amount within the window length is set to fei. i is a frame number. In addition, a window length of 10 ms is set so as to be more sensitive than the feature values fsi, fei, which are used when obtaining accurate start and end points, and the feature value is fsei. Features fsi, fei, fsei for finding the beginning and end of a voice section and their exact positions
Are TH, TL, and TE, respectively. FIG. 2 is a waveform diagram for explaining the operation principle of the present invention, FIG. 2 (a) is an input signal waveform, and FIG. 2 (b) is a characteristic amount fsi when the window length is 20 ms. Fig. 2 (c)
Shows the feature value fei when the window length is 30 ms, and FIG. 2D shows the feature value fsei when the window length is 10 ms. First, from the input point in time using the feature quantity of FIG.
To search for a frame that exceeds. Here, TH is set to be slightly higher so that noise and the like are not picked up. Further, in order to prevent sudden noise as shown in FIG. 2 (b) from being picked up, when the number of frames exceeding the threshold value for the detection of the starting end does not reach the predetermined number of the following predetermined frames after the detection of the starting end. In addition, a process of performing the start detection again from the next point of the start detection point and repeating the start detection until the number of subsequent frames exceeds a predetermined number is added. As a result, the position of the start end is determined as s ″ at the point where the threshold value is first exceeded, but becomes s ′ by the above-described operation. For the end detection, the frame after the point of s ′ is similarly shown in FIG. The feature quantity of (c) is used to search for a frame that falls below the next threshold value TL of the start detection point s', where the TL is exceeded within a predetermined number of subsequent frames after the end detection in order to prevent missing in the word. If it appears, update the end position and search for a point below the TL in the subsequent frame from the position exceeding the TL. In FIG. 2 (c), the point e ″ which is terminated first by this operation is updated to e ′, which is obtained, and the speech loss in the word is eliminated. When the start and end points are determined by the feature amounts of FIGS. 2B and 2C, a point lower than the threshold value TE is obtained at a time before the time s ′ using the feature amounts of FIG. 2D. Is the starting end s. Similarly, at a point after the provisional end point e ', the threshold TE
A lower point is obtained and this is set as a terminal e. Here, the plurality of window lengths and their feature amounts may be separately calculated for each time length. For example, only the feature amount of the short window length is calculated, and based on this, the middle time period is calculated. The sum of two adjacent features may be used as the window length, and the sum of three consecutive features may be used as the long-term feature. Effects As is clear from the above description, according to the present invention, it is possible to accurately and easily detect the start and end of a voice section.

【図面の簡単な説明】 第1図は、本発明の一実施例を説明するためのブロック
線図、第2図は、本発明の動作原理を説明するための波
形図である。 1……A/D変換部,2……差分値計算部,3……窓長ごとの
特徴量計算部,4,5……比較部,6,7……閾値部。
BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram for explaining an embodiment of the present invention, and FIG. 2 is a waveform diagram for explaining the operation principle of the present invention. 1 ... A / D conversion unit, 2 ... Difference value calculation unit, 3 ... Feature amount calculation unit for each window length, 4,5 ... Comparison unit, 6,7 ... Threshold unit.

フロントページの続き (58)調査した分野(Int.Cl.6,DB名) G10L 3/00 - 9/18 JICST(JOIS)Continuation of the front page (58) Field surveyed (Int. Cl. 6 , DB name) G10L 3/00-9/18 JICST (JOIS)

Claims (1)

(57)【特許請求の範囲】 1.入力信号に対して所定サンプル数で1フレームを形
成しこれを検出単位とする窓長とし、その窓長内の特徴
量と所定の閾値とを比較して音声区間の検出を行う音声
区間検出方式において、時間長の異なる複数の窓長と各
々の窓長に対応した閾値とを設定し、音声の始端は、中
時間の窓長と中レベルの閾値を用い、始端検出のための
中レベルの閾値をこえた後に後続する所定フレームのう
ち前記始端検出のための閾値をこえるフレーム数が所定
数をこえるまで繰り返して求め、音声の終端は、長時間
の窓長と高レベルの閾値を用い、終端検出のための高レ
ベルの閾値より低下した後に後続する所定フレームのう
ち前記終端検出のための閾値より下のフレーム数が所定
数をこえるまで繰り返して求め、始端・終端の補正を、
短時間の窓長と低レベルの閾値を用い、該短窓長内で計
算される特徴量と閾値とを比較することで、先に検出し
た始端位置より時系列的に前の時点で低レベルの閾値よ
り下がる点を新たな始端として定め、先に検出した終端
位置より時系列的に後の時点で低レベルの閾値より下が
る点を新たな終端に定めることを特徴とする音声区間検
出方式。
(57) [Claims] A voice section detection method that forms one frame with a predetermined number of samples for an input signal and uses this as a detection unit to determine a window length, and compares a feature amount within the window length with a predetermined threshold to detect a voice section. In, a plurality of window lengths with different time lengths and a threshold value corresponding to each window length are set, and the beginning of the voice uses a middle time window length and a middle level threshold, and the middle level for the leading edge detection is used. Of the subsequent predetermined frames after exceeding the threshold, the number of frames exceeding the threshold for the start detection is repeatedly determined until the number exceeds the predetermined number, and the end of the sound is determined using a long window length and a high-level threshold. Of the predetermined frames following the lower than the high-level threshold for the end detection, it is repeatedly determined until the number of frames below the threshold for the end detection exceeds a predetermined number, and the start and end corrections are performed.
By using a short window length and a low-level threshold value and comparing the feature amount calculated within the short window length with the threshold value, the low-level value is obtained at a time point earlier in time series than the previously detected start position. A voice section detection method characterized in that a point lower than the threshold value is determined as a new start point, and a point lower than the low-level threshold value is determined as a new end point in time series later than the previously detected end position.
JP62179564A 1987-07-17 1987-07-17 Voice section detection method Expired - Fee Related JP2807457B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP62179564A JP2807457B2 (en) 1987-07-17 1987-07-17 Voice section detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP62179564A JP2807457B2 (en) 1987-07-17 1987-07-17 Voice section detection method

Publications (2)

Publication Number Publication Date
JPS6423296A JPS6423296A (en) 1989-01-25
JP2807457B2 true JP2807457B2 (en) 1998-10-08

Family

ID=16067938

Family Applications (1)

Application Number Title Priority Date Filing Date
JP62179564A Expired - Fee Related JP2807457B2 (en) 1987-07-17 1987-07-17 Voice section detection method

Country Status (1)

Country Link
JP (1) JP2807457B2 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070083365A1 (en) * 2005-10-06 2007-04-12 Dts, Inc. Neural network classifier for separating audio sources from a monophonic audio signal
CN108848435B (en) * 2018-09-28 2021-03-09 广州方硅信息技术有限公司 Audio signal processing method and related device
CN112908301A (en) * 2021-01-27 2021-06-04 科大讯飞(上海)科技有限公司 Voice recognition method, device, storage medium and equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6147000A (en) * 1984-08-10 1986-03-07 ブラザー工業株式会社 Voice head detector
JPH0693199B2 (en) * 1985-07-08 1994-11-16 日本電気株式会社 Voice analyzer

Also Published As

Publication number Publication date
JPS6423296A (en) 1989-01-25

Similar Documents

Publication Publication Date Title
EP0077194B1 (en) Speech recognition system
US5617508A (en) Speech detection device for the detection of speech end points based on variance of frequency band limited energy
CA1246228A (en) Endpoint detector
US5579431A (en) Speech detection in presence of noise by determining variance over time of frequency band limited energy
JPS5844500A (en) Voice recognition system
JP2807457B2 (en) Voice section detection method
US4783805A (en) System for converting a voice signal to a pitch signal
Hahn et al. An improved speech detection algorithm for isolated Korean utterances
CA1147071A (en) Method of and apparatus for detecting speech in a voice channel signal
JPS5984300A (en) Voice section detecting circuit
JP3360978B2 (en) Voice recognition device
JPS61223796A (en) Voice section detection system
JP3031081B2 (en) Voice recognition device
JPH052157B2 (en)
JPS6147000A (en) Voice head detector
JPS6146998A (en) Voice head detector
JP2951333B2 (en) Audio signal section discrimination method
JPS5834986B2 (en) Adaptive voice detection circuit
JPS61273596A (en) Voice section detection system
JP3423233B2 (en) Audio signal processing method and apparatus
JPS59149400A (en) Syllable boundary selection system
JPS62237498A (en) Voice section detecting method
JPS6247319B2 (en)
JPS61269197A (en) Voice section detection system
JPH0376471B2 (en)

Legal Events

Date Code Title Description
LAPS Cancellation because of no payment of annual fees