JPS63235999A

JPS63235999A - Voice initial end detector

Info

Publication number: JPS63235999A
Application number: JP62069775A
Authority: JP
Inventors: 丹羽　美幸
Original assignee: Brother Industries Ltd
Current assignee: Brother Industries Ltd
Priority date: 1987-03-24
Filing date: 1987-03-24
Publication date: 1988-09-30

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】［産業上の利用分野］本発明は音声を含む信号の振幅に対しサンプリングやそ
の他の処理を行うことにより音声領域の始端に対応する
時点を検出する音声始端検出装置に関する。[Detailed Description of the Invention] [Industrial Application Field] The present invention relates to a voice start edge detection device that detects a time point corresponding to the start edge of a voice region by performing sampling or other processing on the amplitude of a signal containing voice. .

［従来技術］音声の始端検出に関する従来例としては、信号の区間毎
ののエネルギー値を用いたものかよく利用される。この
例として特開昭６１−４６９９９号公報に記載の技術が
ある。これは、固定された二個の閾値によって連続する
一定の個数の区間の列を始端検出区間として抽出して、
その始端検出区間におけるエネルギーの平均値を第３の
閾値とし、始端検出区間において前記２個の閾値の低い
方の閾値を再び越えることなく、前記第３の閾値を下回
る時点の検出を時間的に遡る方向に向って行うものであ
る。[Prior Art] As a conventional example of detecting the start of a voice, a method using energy values for each section of a signal is often used. An example of this is the technique described in Japanese Patent Application Laid-Open No. 61-46999. This is done by extracting a sequence of a certain number of consecutive sections as the start detection section using two fixed thresholds,
The average value of the energy in the starting edge detection section is set as the third threshold, and the detection at the point in time when the energy falls below the third threshold without exceeding the lower of the two thresholds again in the starting edge detection section is performed. This is done in the backward direction.

また区間毎のエネルギー値と零交差数とを利用したもの
として、新差康永；音声認識、共立出版（１９７９）が
ある。これは予め非音声部の分析によって得られた区間
毎のエネルギー値及び零交差数に対する３個の閾値を用
いるものである。Furthermore, there is Yasunaga Shindashi's Speech Recognition, Kyoritsu Shuppan (1979), which utilizes the energy value and the number of zero crossings for each section. This uses three threshold values for the energy value and the number of zero crossings for each section, which are obtained in advance by analyzing the non-speech parts.

更に特開昭６０−２００３００号公報に記載されるよう
に区間毎のエネルギーの変化及びスペクトルの変化とい
った動的要因を用いたものもある。Furthermore, as described in Japanese Patent Application Laid-Open No. 60-200300, there is also a method using dynamic factors such as changes in energy and changes in spectrum for each section.

［発明が解決しようとする問題点］一般に音声の特に破裂音の発生の直前にはバズ音とよば
れる基本周波数が低くかつエネルギーの比較的低い鼻音
性の信号が発生することがある。[Problems to be Solved by the Invention] In general, a nasal signal called a buzz sound, which has a low fundamental frequency and relatively low energy, may be generated immediately before the occurrence of a voice, especially a plosive sound.

このバズ音は同一人物の発生においても、発生する場合
と発生しない場合とがある。そのため、特に音声認識等
を行う場合、それに用いる標準パラメータはバズ音のな
い音声信号より抽出する必要が市り、また認識される音
声信号においてもバズ音の発生時には予めその部分を除
去する必要がおる。そのため音声認識等に用る音声信号
の抽出において、バズ音は自動的に非音声とみなし予め
除去してしまうことが望ましい。This buzzing sound may or may not occur even when the same person is making the same noise. Therefore, especially when performing speech recognition, it is necessary to extract the standard parameters used for it from an audio signal that does not have a buzz sound, and it is also necessary to remove that part in advance when a buzz sound occurs in the audio signal to be recognized. is. Therefore, when extracting audio signals for use in speech recognition, etc., it is desirable to automatically consider buzz sounds as non-speech and remove them in advance.

前）ホした特開昭６１−４６９９９号公報に記載の技術
では、バズ音の発生時にはバズ音の始端を音声の始端と
して検出してしまい、その除去は不可能である。In the technique described in Japanese Patent Application Laid-Open No. 61-46999 mentioned above, when a buzz sound occurs, the start of the buzz is detected as the start of the sound, and it is impossible to remove it.

また析差康永；音声認識、共立出版（１９７９）記載の
技ｉ１′Ｊｉにおいては、闇値を非音声部の分析によっ
て得ているため、非音声部としてバズ音の部分を抽出す
ることにより、バズ音の除去が可能でおるが、バズ音部
の抽出が困難て必り非実用的である。In addition, in the technique i1'Ji described by Yasunaga Anasai, Speech Recognition, Kyoritsu Shuppan (1979), the dark value is obtained by analyzing the non-speech part, so by extracting the buzz sound part as the non-speech part, Although it is possible to remove the buzz, it is difficult to extract the buzz, making it impractical.

また特開昭６０−２００３００号公報に記載の技術では
、スペクトル等の分析か必要となり、処理が難しく、ま
た装置も複雑になるといった欠点があった。Furthermore, the technique described in Japanese Patent Application Laid-Open No. 60-200300 requires analysis of spectra, etc., which has the disadvantage that processing is difficult and the apparatus is complicated.

［発明の目的］本発明は上記の欠点に鑑みてなされたもので、その主な
る目的は、バズ音に影響されることのない、音声認識等
の処理に適した音声信号を抽出するための音声始端検出
装置を提出することである。[Object of the Invention] The present invention has been made in view of the above-mentioned drawbacks, and its main purpose is to extract audio signals suitable for processing such as speech recognition, which are not affected by buzz sounds. The purpose is to submit a voice start detection device.

［問題点を解決するための手段］そこで本発明では、音声を含んだ入力信号の振幅を一定
時間毎にサンプリングして、複数の振幅値の列に変換す
るサンプリング手段と、前記複数の振幅値の列を複数の
区間に分割し、その区間毎における前記入力信号のエネ
ルギー値に関連するエネルギー関連値を計算する計算手
段と、少なくとも前記人力信号のエネルギー関連値とそ
の関連値に対する閾値とを廿較する比較手段を含み、そ
の比較結果を参酌して前記音声の始端を決定する音声始
端決定手段を有する音声始端検出装置において、前記音
声の最初の音節として確定される連続した区間を抽出す
る抽出手段と、前記抽出手段によって抽出された区間の
各々のエネルギー関連値の増減に関連して変化する基準
値を決定する基準値決定手段と、前記基準値に関連して
前記闇値を決定する閾値決定手段とを有することを特徴
としたものである。[Means for Solving the Problems] Therefore, the present invention provides sampling means for sampling the amplitude of an input signal including audio at regular time intervals and converting the amplitude into a string of a plurality of amplitude values, and the plurality of amplitude values. calculation means for dividing a sequence of into a plurality of sections and calculating an energy-related value related to the energy value of the input signal for each section; and at least an energy-related value of the human input signal and a threshold value for the related value. extraction for extracting a continuous section that is determined as the first syllable of the speech in a speech start detection device that includes a speech start determining means that determines the start of the speech by taking into account the comparison result; means, a reference value determining means for determining a reference value that changes in relation to an increase or decrease in the energy-related value of each of the sections extracted by the extracting means, and a threshold value for determining the dark value in relation to the reference value. The invention is characterized in that it has a determining means.

［作用］従って、本発明において音声を含む信号はサンプリング
手段によって複数の振幅値の列に変換され、それらの複
数の振幅値の列は、計算手段によって複数の区間に分割
され、それぞれの区間のエネルギー関連値が計算される
。[Operation] Accordingly, in the present invention, a signal including audio is converted into a sequence of multiple amplitude values by the sampling means, and the sequence of these multiple amplitude values is divided into multiple intervals by the calculation means, and the value of each interval is Energy-related values are calculated.

その後、抽出手段は、その信号に含まれる音声の始めの
音節の区間を抽出する。基準値決定手段ではその抽出さ
れた区間のエネルギー関連値の増減に関連して変化する
基準値を抽出し、閾値決定手段はその基準値に関連して
閾値を計算する。Thereafter, the extraction means extracts the syllable section at the beginning of the voice included in the signal. The reference value determination means extracts a reference value that changes in relation to an increase or decrease in the energy-related value in the extracted section, and the threshold value determination means calculates a threshold value in relation to the reference value.

その後、始端決定手段は、前記区間毎のエネルギー関連
値と前記閾値とを比較し、その比較結果を参酌して前記
音声の始端を検出する。Thereafter, the start end determining means compares the energy-related value for each section with the threshold value, and detects the start end of the audio by taking the comparison result into consideration.

一般に単一の話者において発生するバズ音は、そのエネ
ルギー関連値が略一定であり、音声のエネルギー関連値
はど大きな変化はない。しかし、音声認識等に用いられ
る音声信号は、そのエネルギー関連値のレベルが略一定
に調整されており、そのため、後続の音声のエネルギー
関連値に関連して閾値を設定することにより実用上問題
のない程度にバズ音と音声を分離することが可能となる
。In general, the energy-related value of a buzz sound generated by a single speaker is approximately constant, and the energy-related value of the voice does not change significantly. However, the level of the energy-related value of the audio signal used for speech recognition etc. is adjusted to be approximately constant, so setting a threshold value in relation to the energy-related value of the subsequent audio poses a practical problem. It becomes possible to separate the buzz sound and voice to the extent that there is no noise.

［実施例］以下、第２図及び第３図を参照して本発明の一実施例を
詳細に説明する。[Embodiment] Hereinafter, an embodiment of the present invention will be described in detail with reference to FIGS. 2 and 3.

第２図は、汎用の中央演算装置（以下ＣＰＵと称す〉１
５を利用して本実施例の音声始端決定装置を構成したブ
ロック図である。話者の発生した音声を含む音響情報を
集音して電気信号に変換するマイクロホン１１は、増幅
器１２の入力端子に接続されている。増幅器１２はマイ
クロホン１１により送られた電気信号を以後の処理に適
したレベルに増幅するように構成されている。増幅器１
−７−　′ ２の出力には、アナログローパスフィルタ１３か接続さ
れている。このフィルタ１３はカットオフ周波数を４Ｋ
Ｈ２に設定され、そのカットオフ周波数以上の周波数の
信号を遮断するように構成されている。このフィルタ１
３の出力には、サンプリング手段に対応するＡ／Ｄ変換
器１４が接続されている。このＡ／Ｄ変換器１４の出力
端子は、ＣＰＵ１５に接続されている。このＣＰＵ”＋
５には、後述する各処理の手順を決定するプログラム及
び各種定数等を記憶している読出し専用メモリ（以下Ｒ
ＯＭと称す）１６及び随時書込みメモリ（以下ＲＡＭと
称す）１７が接続されている。このＲＡ、Ｍ１７は、Ａ
／Ｄ変換器１４でサンプリングされた音声波形の振幅値
が順次書込まれていく振幅バッフ１１７ａ、エネルギー
関連値か順次書込まれていくエネルギー関連値バッファ
１７ｂ、任意の整数値を記憶可能な第１のポインタレジ
スタ１７ｃ及び第２のポインタレジスタ１７ｄ、閾値を
記憶可能な閾値レジスタ１７ｅ１及び後述の各処理を行
うためのワーキングエリアを含んでいる。Figure 2 shows a general-purpose central processing unit (hereinafter referred to as CPU) 1
5 is a block diagram configuring a voice start end determination device of the present embodiment using 5. A microphone 11 that collects acoustic information including the voice generated by a speaker and converts it into an electrical signal is connected to an input terminal of an amplifier 12. Amplifier 12 is configured to amplify the electrical signal sent by microphone 11 to a level suitable for subsequent processing. amplifier 1
An analog low-pass filter 13 is connected to the output of -7-'2. This filter 13 has a cutoff frequency of 4K
H2, and is configured to block signals with frequencies higher than the cutoff frequency. This filter 1
An A/D converter 14 corresponding to sampling means is connected to the output of No. 3. The output terminal of this A/D converter 14 is connected to the CPU 15. This CPU”+
5 is a read-only memory (hereinafter referred to as R
OM) 16 and an occasional write memory (hereinafter referred to as RAM) 17 are connected. This RA, M17 is A
An amplitude buffer 117a to which the amplitude values of the audio waveform sampled by the /D converter 14 are sequentially written; an energy-related value buffer 17b to which energy-related values are sequentially written; It includes a first pointer register 17c, a second pointer register 17d, a threshold value register 17e1 capable of storing a threshold value, and a working area for performing each process described later.

上記のように構成された本実施例の音声始端検出装置の
動作を以下第３図に示すフローチャートを参照して詳細
に説明する。The operation of the voice start end detection device of this embodiment configured as described above will be explained in detail below with reference to the flowchart shown in FIG.

話者が発生した音声は、マイクロフォン１１て集音され
電気信号に変えられる。この電気信号は増幅器１２によ
って後述の処理に適したレベルに増幅される。増幅され
た電気信号は、アナログローパスフィルタ１３によって
、４ＫＨ２以上の信号弁が遮断される。以上の動作はス
テップ２０に対応している。The voice generated by the speaker is collected by the microphone 11 and converted into an electrical signal. This electrical signal is amplified by an amplifier 12 to a level suitable for processing described later. The amplified electric signal is passed through the analog low-pass filter 13, which cuts off the signal valves of 4KH2 or higher. The above operation corresponds to step 20.

前記アナログロウパスフィルタ１３の出力は、サンプリ
ング手段に対応するＡ／Ｄ変換器１４に入力される。Ａ
／Ｄ変換器１４では入力信号を８ＫＨ２のサンプリング
周波数でサンプリングし、１２５マイクロ秒毎の振幅値
を出力する。この振幅値は、標本化定理より前記入力信
号の４ＫＨ２までの情報をすべて含んでいる。この振幅
値のｉ番目の値、すなわちサンプリングの開始より１２
５Ｘ（ｉ−１）マイクロ秒後の振幅値を以後へ（ｉ）と
表すことにする。このｉは１からｎまでの値であり、こ
のｎはサンプリングされた振幅値の総数である。これら
のＡ（１）からＡ　（ｎ）までの値は前記振幅バッファ
１７ａに順次書込まれていく。この処理はステップ２１
にて実行される。The output of the analog low-pass filter 13 is input to an A/D converter 14 corresponding to sampling means. A
The /D converter 14 samples the input signal at a sampling frequency of 8KH2 and outputs an amplitude value every 125 microseconds. This amplitude value includes all information up to 4KH2 of the input signal according to the sampling theorem. The i-th value of this amplitude value, that is, 12 from the start of sampling.
The amplitude value after 5X(i-1) microseconds will be hereinafter expressed as (i). This i is a value from 1 to n, where n is the total number of sampled amplitude values. These values A(1) to A(n) are sequentially written into the amplitude buffer 17a. This process is performed in step 21
It will be executed at

次にステップ２２に進み、ＣＰＵ１５は前記ｎ個の振幅
値を６１！４個毎の区間、即ち８ミリ秒毎の振幅情報に
分υル、その区間毎のエネルギー関連値を計算する。エ
ネルギー関連値は厳密な意味でのエネルギー値である必
要はなく、比較的エネルギー値に似た性質のものであれ
ばよい。本実施例ではエネルギー関連値として６４個毎
の振幅値の絶対値の和を利用している。即ち、第ｊ番目
のエネルギー関連値をＥ（ｊ＞とすると、その関連値は
下記の式にて求められる。Next, the process proceeds to step 22, where the CPU 15 divides the n amplitude values into amplitude information for every 61!4 sections, that is, every 8 milliseconds, and calculates an energy-related value for each section. The energy-related value does not need to be an energy value in the strict sense, but may have properties relatively similar to an energy value. In this embodiment, the sum of the absolute values of every 64 amplitude values is used as the energy-related value. That is, if the j-th energy-related value is E(j>), the related value is obtained by the following formula.

４ｊＥ（ｊ）　　−ΣＩＡ（ｑ）１ｑ＝６４ｊ−６３ここでｊは整数であり、その最大値はｎを６４で割りそ
の剰余を切上げた商の値となる。この、」の値を以下そ
の区間の区間番号と呼ぶことにする。4j E(j) −ΣIA(q)1 q=64j−63 Here, j is an integer, and its maximum value is the value of the quotient obtained by dividing n by 64 and rounding up the remainder. Hereinafter, this value will be referred to as the section number of that section.

この処理は計算手段の動作に対応する。This processing corresponds to the operation of the calculation means.

次にステップ２３に進み、Ｅ（ｊ＞の値をｊ−１より順
に予め設定されている母音検出用の閾値と比較し、Ｅ（
ｊ＞の値が少なくとも２個以上連続して母音検出用の閾
値を越える最初の区間の列を抽出し、その区間の列の最
初の区間の区間番号を第１のポインタレジスタ１７Ｇに
、その区間の列の最後の区間の区間番号を第２のポイン
タレジスタ１７ｄに記憶させる。この処理は抽出手段の
動作に対応している。Next, the process proceeds to step 23, where the value of E(j> is compared with preset vowel detection thresholds in order from j-1, and E(
extract the first interval sequence in which at least two successive values of j> exceed the threshold for vowel detection, and store the interval number of the first interval in the sequence of intervals in the first pointer register 17G. The section number of the last section of the column is stored in the second pointer register 17d. This process corresponds to the operation of the extraction means.

次にステップ２４に進み、前ステップ２３において抽出
された区間の各々のエネルギー関連値を前記エネルギー
関連値バッファ１７ｂより読出し、それらの最大値を求
める。この処理は基準値決定手段の動作に対応し、前記
最大値の値が基準値となる。Next, the process proceeds to step 24, where the energy-related values of each section extracted in the previous step 23 are read out from the energy-related value buffer 17b, and their maximum value is determined. This process corresponds to the operation of the reference value determining means, and the maximum value becomes the reference value.

次にステップ２５に進み、前ステップ２４で抽出された
基準値に予め設定された、定数αを乗じ、その積を前記
閾値レジスタ１６ｅに記憶させる。Next, the process proceeds to step 25, where the reference value extracted in the previous step 24 is multiplied by a preset constant α, and the product is stored in the threshold register 16e.

この積は閾値に対応しており、以下Ｔｅと称すことにす
る。この閾値Ｔｅは、バズ音のエネルギー関連値の最大
値より僅かに低い値に設定されることか望しく、本実施
例ではα−０，１３程度とすることによりそれを実現し
ている。この処理は、閾値決定手段の動作に対応してい
る。This product corresponds to a threshold value and will be referred to as Te hereinafter. It is desirable that this threshold Te be set to a value slightly lower than the maximum value of the energy-related value of the buzz sound, and in this embodiment, this is achieved by setting it to approximately α-0.13. This process corresponds to the operation of the threshold value determining means.

次にステップ３０に進み、前記第１のポインタレジスタ
１７Ｇの値を読出して、その値より１を引き、再び第１
のポインタレジスタ１７Ｇに記憶させる。以後このポイ
ンタレジスタ１７Ｇの記憶する値をｋで表すことにする
。即ち、この処理はｋにに−１を代入することになる。Next, the process proceeds to step 30, where the value of the first pointer register 17G is read, 1 is subtracted from that value, and the value of the first pointer register 17G is subtracted from the value.
is stored in the pointer register 17G. Hereinafter, the value stored in this pointer register 17G will be expressed as k. That is, this process assigns -1 to k.

このｋの値は以下のステップ３１乃至ステップ３３の処
理の対象となる区間の区間番号を表す。次にステップ３
１に進み、ステップ２５で得られた閾値Ｔｅと第１のポ
インタレジスタ１７Ｇで示される区間のエネルギー関連
値Ｅ（ｋ）とを比較し、その結果がＥ　（ｋ）≧Ｔｅで
あるならばステップ３０に戻り、Ｅ　（ｋ）＜Ｔｅであ
るならばステップ３２に進む。The value of k represents the section number of the section to be processed in steps 31 to 33 below. Next step 3
1, the threshold Te obtained in step 25 is compared with the energy-related value E(k) in the section indicated by the first pointer register 17G, and if the result is E(k)≧Te, step Returning to step 30, if E (k)<Te, proceed to step 32.

このステップ３１は、比較手段の動作に対応している。This step 31 corresponds to the operation of the comparison means.

ステップ３２では、第１のポインタレジスタ１７Ｃの示
す区間の零交差数を計算する。計算方法については、特
開昭６０−１１７２９９号公報等に記載されているので
、詳細については省く。In step 32, the number of zero crossings in the section indicated by the first pointer register 17C is calculated. The calculation method is described in Japanese Unexamined Patent Application Publication No. 117299/1983, so details thereof will be omitted.

尚、このステップ３２で得られた零交差数をＺ（ｋ）と
表すことにする。次にステップ３３に進み、前ステップ
３２において得られた零交差数Ｚ（ｋ）と予め設定され
ている零交差数に対する閾値とを比較する。この閾値は
本実施例では４としている。この閾値をＴ、Ｚとして表
したとぎ、この比較結果がＺ　（ｋ）≧ＴＺであるなら
ばステップ３０に戻り、Ｚ　（ｋ）＜Ｔｚであるならば
ステップ３４へ進む。即ち、ステップ３０乃至ステップ
３３の動作を要約すると、ステップ２３で得られた区間
の列の直前の区間より時間を遡る方向に調査し、最初に
Ｅ　（ｋ）＜Ｔｅ、力ｓａＺ　（ｋ）＜Ｔｚとなる区間
を抽出し、その値を第１のポインタレジスタ１７Ｇに記
憶して、次のステップ３４に進んでいる。このステップ
３４ではこの時点における第１のポインタレジスタ１７
Ｇの内容に１を加え、その和によって表される区間番号
の区間を前記音声の始端としている。このステップ３０
乃至ステップ３４の処理は音声始端抽出手段の動作に対
応している。Note that the number of zero crossings obtained in step 32 will be expressed as Z(k). Next, the process proceeds to step 33, where the number of zero crossings Z(k) obtained in the previous step 32 is compared with a preset threshold value for the number of zero crossings. This threshold value is set to 4 in this embodiment. Assuming that the threshold values are expressed as T and Z, if the comparison result is Z (k)≧TZ, the process returns to step 30, and if Z (k)<Tz, the process proceeds to step 34. That is, to summarize the operations from step 30 to step 33, an investigation is performed in a direction going back in time from the section immediately before the row of sections obtained at step 23, and first, E (k)<Te, force saZ (k)< The section corresponding to Tz is extracted, its value is stored in the first pointer register 17G, and the process proceeds to the next step 34. In this step 34, the first pointer register 17 at this point is
1 is added to the content of G, and the section with the section number represented by the sum is set as the start of the audio. This step 30
The processing from step 34 corresponds to the operation of the voice start edge extraction means.

このように本実施例において、バズ音と音声とで比較的
大きな差異の表れる零交差数をも参酌しているため、よ
り正確な始端の検出が可能となる。In this way, in this embodiment, since the number of zero crossings, which shows a relatively large difference between the buzz sound and the voice, is also taken into account, it is possible to detect the start end more accurately.

尚、本発明は上記実施例に限るものではなく、例えば本
実施例でエネルギー関連値として使用した絶対値の和は
、一般的に使われる下記の式にて求められる短時間エネ
ルギーであってもよい。Note that the present invention is not limited to the above-mentioned embodiments; for example, the sum of absolute values used as energy-related values in this embodiment may be a short-time energy calculated by the commonly used formula below. good.

４ｊＥ′（ｊ）　　−Σ［Ａ（ｑ）］” ］ｑ＝６４ｊ−６３た、基準値は抽出手段で抽出された区間のエネルギー
関連値の最大値としたが、その最大値をとる区間の近傍
値の平均値、おるいは抽出された区間のエネルギー関連
値のモード値等であってもよい。ただしこの場合、前記
定数αの値は本実施例の値とは異なる。更に、本実施例
では零交差数の変動を始端の決定に利用しているが、短
時間スペクトルの変化等も利用できる。4j E′(j) −Σ[A(q)]” ]q=64j−6 3.The reference value is the maximum value of the energy-related values in the section extracted by the extraction means, but the maximum value is taken. It may be the average value of the neighboring values of the section, or the mode value of the energy-related value of the extracted section.However, in this case, the value of the constant α is different from the value of this embodiment.Furthermore, In this embodiment, fluctuations in the number of zero crossings are used to determine the starting point, but changes in short-term spectra can also be used.

［発明の効果］本発明では入力信号に関連して変化する基準値から閾値
を決定し、その閾値に基いて音声の始端を検出するので
、音声信号の直前に発生するバズ音等の不要な信号を正
確に除去し得る。この結果、本発明の音声始端検出装置
を既存の音声終端検出装置と併用することにより、音声
８２識等の処理に適した音声の情報を確実に抽出するこ
とが可能となる。[Effects of the Invention] In the present invention, a threshold value is determined from a reference value that changes in relation to an input signal, and the start of audio is detected based on the threshold value. The signal can be accurately removed. As a result, by using the voice start edge detection device of the present invention together with an existing voice end detection device, it becomes possible to reliably extract voice information suitable for processing such as voice recognition.

[Brief explanation of drawings]

第１図は本発明の全体の動作を示すフローチャート、第
２図は本発明の一実施例の構成を示すブロック図、第３
図は本発明の一実施例の動作を示すフローチャートであ
る。図中、１４はＡ／Ｄ変換器、１５はＣＰＵ、１６はＲＯ
Ｍ、１７はＲＡＭである。また、２１はサンプリング手段に対応するステップ、２
２は計算手段に対応するステップ、２３は抽出手段に対
応するステップ、２４は基準値決定手段に対応するステ
ップ、２５は閾値決定手段に対応するステップ、３０，
３１，３２，３３゜３４は音声始端決定手段に対応する
ステップであり３１は比較手段に対応するステップであ
る。FIG. 1 is a flow chart showing the overall operation of the present invention, FIG. 2 is a block diagram showing the configuration of an embodiment of the present invention, and FIG.
The figure is a flowchart showing the operation of an embodiment of the present invention. In the figure, 14 is an A/D converter, 15 is a CPU, and 16 is an RO.
M, 17 is a RAM. Further, 21 is a step corresponding to the sampling means;
2 is a step corresponding to the calculation means, 23 is a step corresponding to the extraction means, 24 is a step corresponding to the reference value determining means, 25 is a step corresponding to the threshold value determining means, 30,
31, 32, 33, and 34 are steps corresponding to the voice start end determining means, and 31 is a step corresponding to the comparing means.

Claims

[Claims] 1. Sampling means (14) that samples the amplitude of an input signal including audio at regular time intervals and converts the amplitude into a plurality of sequences of amplitude values; calculating means (22) for calculating an energy-related value related to the energy value of the input signal in each section; and comparing at least the energy-related value of the input signal with a threshold value for the related value. comprising a comparison means (31);
A speech start detection device comprising a speech start determining means (30, 31, 32, 33, 34) that determines the start of the speech by taking into account the comparison result, the continuous extraction means (23) for extracting a section; reference value determining means (24) for determining a reference value that changes in relation to an increase or decrease in the energy-related value of each of the sections extracted by the extraction means; and the reference value. a threshold determining means (25) for determining the threshold in relation to the threshold. 2. The reference value determining means (24) includes the extracting means (2).
Claim 1, characterized in that the reference value is determined as the highest value of the energy-related values of the section extracted by 3) or the average value of the energy-related values of the sections in the vicinity of the section having the highest value. The voice start end detection device described in .