JPH01307800A

JPH01307800A - Voice detecting method

Info

Publication number: JPH01307800A
Application number: JP63137395A
Authority: JP
Inventors: Kazuhiro Gomi; 五味　和洋; Yutaka Nishino; 豊西野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1988-06-06
Filing date: 1988-06-06
Publication date: 1989-12-12

Abstract

PURPOSE:To improve voice detecting performance by regarding the rising time point of an envelope waveform where a read level becomes a constant value larger than a level which is read one sampling cycle before as the start point of a voice section and regarding the sum value of the last read waveform level and constant value as a voice detection threshold value. CONSTITUTION:The point of time of the rising of the envelope waveform where the newly read level becomes the constant value larger the level which is read one sampling cycle before is regarded as the start point of the voice section and the value obtained by adding the constant value to the envelope waveform level which is read one sampling cycle before on a logarithmic scale is regarded as the voice detection threshold value. Therefore, the noise level of an input signal can be measured without providing any noise level measurement period specially, and the voice detection threshold value is set to a value adaptive to the noise level. Consequently, invariably accurate voice detection is performed.

Description

【発明の詳細な説明】〔産業上の利用分野〕この発明は、電話回線のように入力信号の音声レベル・
雑音レベルの変化が大きい条件下での入力信号の音声区
間・無音区間の判定に必要な音声検出方法に関するもの
である。[Detailed Description of the Invention] [Industrial Field of Application] This invention is applicable to the audio level and
The present invention relates to a voice detection method necessary for determining whether an input signal is a voice section or a silent section under conditions where the noise level changes significantly.

号公報）では、利用者の発声終了を検出して、次の応答
メツセージを送出開始することにより、自然な対話を実
現している。また、利用者の音声長あるいは応答メツセ
ージ送出後の利用者反応（応答メツセージに対する利用
者の音声）の有無によフて、利用者の発声内容を推定し
、次に送出する応答メツセージ内容を選択することによ
り、対話の自然性をさらに向上させる対話型留守番電話
機も提案されている。In this method, natural dialogue is realized by detecting the end of a user's utterance and starting sending the next response message. In addition, the content of the user's utterance is estimated based on the length of the user's voice or the presence or absence of the user's reaction after sending the response message (the user's voice in response to the response message), and the content of the response message to be sent next is selected. An interactive answering machine has also been proposed that further improves the naturalness of dialogue.

第６図は上述した対話型留守番電話機の動作フローを示
す図であり、第７図（ａ）、（ｂ）。FIG. 6 is a diagram showing the operation flow of the above-mentioned interactive answering machine, and FIGS. 7(a) and (b).

（Ｃ）はそれぞれ利用者音声長の測定、応答メツセージ
に対する利用者反応の有無の確認９発声終了の検出を行
う際の処理フローを示す図である。(C) is a diagram showing a processing flow when measuring the length of a user's voice, confirming the presence or absence of a user's reaction to a response message, and detecting the end of utterance, respectively.

次にこの対話型留守番電話機の動作について簡単に説明
する。Next, the operation of this interactive answering machine will be briefly explained.

まず、着信があると、自動的に回線を閉結して第１応答
メツセージを送出し、発声終了が検出されるまで利用者
メツセージの録音を行う。次いで、利用者メツセージの
音声長の測定を行い、利用者音声長が短と判定された場
合には、第２応答メツセージ「はい」を送出して相手に
発声を促し、さらに録音および音声長の測定を行う。そ
して、利用者音声長が長と判定された場合には、第４応
答メツセージを送出して用件を告げさせる。First, when an incoming call is received, the line is automatically closed, a first response message is sent, and the user's message is recorded until the end of utterance is detected. Next, the voice length of the user's message is measured, and if the user's voice length is determined to be short, a second response message "Yes" is sent to encourage the other party to speak, and further recording and voice length are performed. Take measurements. If the length of the user's voice is determined to be long, a fourth response message is sent to make the user announce the business.

また、利用者音声長が中と判定された場合には、前記の
第２応答メツセージを送出した後、利用者の反応の有無
を確認し、反応に応じて発呼者名を告げさせる旨の第３
応答メツセージまたは前記の第４応答メツセージを送出
する。そして、第３応答メツセージを送出した場合には
、利用者メツセージの録音を行った後、前記の第４応答
メツセージを送出する。In addition, if the user's voice length is determined to be medium, after sending the second response message mentioned above, the system checks whether or not the user has responded, and depending on the user's response, sends a message indicating that the user is required to announce the caller's name. Third
The response message or the fourth response message is sent. If the third response message is sent, the fourth response message is sent after recording the user's message.

このような利用者音声長の測定、応答メツセージに対す
る利用者反応の有無の確認９発生化号の検出は、すべて
音声検出技術に基づいて行われる。例えば、利用者音声
長の測定は第７図（ａ）に示すように行われる。The measurement of the length of the user's voice, the confirmation of the user's response to the response message, and the detection of the generated number 9 are all performed based on voice detection technology. For example, the length of a user's voice is measured as shown in FIG. 7(a).

すなわち、音声の検出動作状態になると有音タイマと無
音タイマをリセットし、音声検出結果を基に両タイマを
動作させる。次いで、音声検出状態になると有音タイマ
のカウントを開始し、無音になると有音タイマのカウン
トを停止する。そして、無音タイマのカウントを開始す
る。ここで、所定の時間（発声終了判定しきい値）内に
再び音声検出状態になると、無音タイマのカウントをリ
セットし、有音タイマのカウントを再開する。この動作
を繰り返して発声終了判定しきい値以上無音時間が継続
すると発声の終了と見なしてそれまでの有音タイマのカ
ウント値を音声長としている。That is, when the voice detection operation state is reached, the voice timer and silent timer are reset, and both timers are operated based on the voice detection result. Next, when the voice detection state is reached, the voice timer starts counting, and when the voice becomes silent, the voice timer stops counting. Then, the silence timer starts counting. Here, when the voice detection state returns within a predetermined time (utterance end determination threshold), the count of the silence timer is reset and the count of the sound timer is restarted. When this operation is repeated and the silent time continues to exceed the utterance end determination threshold, it is assumed that the utterance has ended, and the count value of the utterance timer up to that point is taken as the voice length.

また、用件メツセージあるいは応答メツセージをＩＣメ
モリに録音する留守番電話機では、入力信号中の音声区
間だけを録音し無音区間は録音しないことにより、実質
的な録音時間を長くする録音方法が提案されている（例
えば特開昭６２−４７３２９号公報）。In addition, for answering machines that record business messages or response messages in IC memory, a recording method has been proposed that actually lengthens the recording time by recording only the voice section of the input signal and not recording the silent section. (For example, Japanese Patent Laid-Open No. 62-47329).

一般に、これらの留守番電話機等で用いられる音声検出
方法としては、入力信号のエンベロープ波形を抽出し、
抽出されたエンベロープ波形レベルをある音声検出しき
い値と比較し、エンベロープ波形レベルが音声検出しき
い値を上回っているときには有音状態、エンベロープ波
形レベルが音声検出しきい値を下回っているときには無
音状態と判定している。Generally, the voice detection method used in these answering machines, etc. extracts the envelope waveform of the input signal,
The extracted envelope waveform level is compared with a certain voice detection threshold, and when the envelope waveform level is above the voice detection threshold, there is a sound state, and when the envelope waveform level is below the voice detection threshold, there is no sound. It is determined that the condition is

エンベロープ波形の抽出方法には、入力信号をＣＰＵに
取り込み、ＣＰＵの内部処理でエンベロープ波形を抽出
する。あるいはＣＰＵに取り込む以前に整流回路・低域
通過フィルタからなるエンベロープ抽出回路を設ける等
の方法がある。また、入力信号のパワー変化をエンベロ
ープ波形と等価に扱い、処理を行うこともある。To extract the envelope waveform, an input signal is input to the CPU, and the envelope waveform is extracted by internal processing of the CPU. Alternatively, there is a method such as providing an envelope extraction circuit consisting of a rectifier circuit and a low-pass filter before importing into the CPU. In addition, the power change of the input signal may be treated as equivalent to an envelope waveform and processed.

この音声検出方法の音声検出性能は、主に音声検出しき
い値の決定方法により決まる。特に電話回線を入力信号
源とする場合には、回線状態や対向電話機の設置環境に
より音声レベル・雑音レベルが様々に変化する。音声レ
ベルに関していえば、電話回線はその伝送路の設計基準
として端末一端末間で２５ｄＢまでの損失を許容してい
る。The voice detection performance of this voice detection method is mainly determined by the method of determining the voice detection threshold. In particular, when a telephone line is used as an input signal source, the voice level and noise level vary depending on the line condition and the installation environment of the opposite telephone. Regarding voice levels, telephone lines allow a loss of up to 25 dB between terminals as a design standard for the transmission path.

また、対向電話機利用者の発生レベルには個人差があり
変化する。雑音レベルに関していえば、対向する電話機
が寝室等の静かな部屋に置かれている場合の周囲雑音レ
ベルは４０ｄＢｓｐｔ、前後なのに対し、対向する電話
機が街頭に設置された公衆電話機の場合には、８０ｄＢ
、、Ｌを越えることもある。Furthermore, the level of occurrence of calls to the other party's telephone differs from person to person and changes. Regarding the noise level, when the opposing phone is placed in a quiet room such as a bedroom, the ambient noise level is around 40 dBspt, whereas when the opposing phone is a public telephone installed on the street, the ambient noise level is 80 dB.
,, sometimes exceeds L.

音声検出を正確に行うために、音声検出しきい値は雑音
レベルよりも必ず大きい値に設定されている必要がある
。一方、音声検出しきい値が余りにも大きいとレベルの
小さい音声区間を検出できなくなる可能性が生じる。こ
のことから、音声検出しきい値は雑音レベルよりも大き
い値に設定されている必要があり、しかもその値は雑音
レベルをわずかに上回る程度が望ましい。In order to perform voice detection accurately, the voice detection threshold must be set to a value greater than the noise level. On the other hand, if the voice detection threshold is too large, there is a possibility that voice sections with low levels cannot be detected. For this reason, the voice detection threshold must be set to a value greater than the noise level, and moreover, it is desirable that the value be slightly greater than the noise level.

ｃｐｕ内部処理で、入力信号のエンベロープ波形と音声
検出しきい値を比較する方法では、雑音レベルを測定し
、その値を基準に音声検出しきい値を決定することが容
易にできる。そこで、利用者が発生を行う可能性が少な
く、純粋に雑音レベルのみを測定できる期間を雑音レベ
ル測定期間として設け、その期間に測定される雑音レベ
ルを基準に音声検出しきい値を決定する方法がある。こ
の場合、雑音レベルは時間的に多少の変動を伴うので、
雑音レベル測定期間を長めに設定し、該期間内での平均
値を雑音レベルとして扱うことが多い。In the method of comparing the envelope waveform of the input signal and the voice detection threshold using CPU internal processing, it is possible to easily measure the noise level and determine the voice detection threshold based on that value. Therefore, there is a method of setting a period as a noise level measurement period during which there is a low possibility that the user will cause noise generation and in which only the noise level can be measured purely, and determining the voice detection threshold based on the noise level measured during that period. There is. In this case, the noise level will vary slightly over time, so
The noise level measurement period is often set to be long, and the average value within the period is treated as the noise level.

〔発明が解決しようとする課題）しかし、サービスによってはこのような特別な雑音レベ
ル設定期間を設けることができない場合がある。例えば
対話型留守番電話機は利用者に留守番電話機と話をする
違和感を持たせずに自然な応答を行うので、利用者と留
守番電話機の間で絶え間なくメツセージ音声のやりとり
が行われる。[Problems to be Solved by the Invention] However, depending on the service, it may not be possible to provide such a special noise level setting period. For example, an interactive answering machine provides natural responses without making the user feel uncomfortable talking to the answering machine, so voice messages are constantly exchanged between the user and the answering machine.

このため利用者が発生を行う可能性が低い期間を特定す
ることが困難である。また、雑音レベル測定期間を設け
てもこの期間に利用者が発生を行わない保証はないので
、この期間に利用者が発生を行った場合には、雑音レベ
ルが実際のものよりも大きく観測されてしまい、正確な
雑音レベル測定は行えない。「しばらく発声を行わない
で下さい」というガイダンス音声を送出し、利用者の発
声を抑制することも考えられるが、これは利用者の自由
度を奪うためサービス性を劣化させることになる。For this reason, it is difficult to specify a period during which a user is unlikely to cause an outbreak. Furthermore, even if a noise level measurement period is set, there is no guarantee that users will not generate noise during this period, so if a user generates noise during this period, the observed noise level may be higher than the actual one. Therefore, accurate noise level measurements cannot be performed. It is conceivable to suppress the user's vocalization by sending out a guidance voice that says, "Please do not speak for a while," but this would detract from the user's degree of freedom and degrade service performance.

さらに、電話回線では対向電話機の設置状態や回線状態
により時間が経つにつれ雑音レベルが変化することがあ
る。このような場合には、単一の雑音レベル測定期間に
求められた雑音レベルを基に決定された音声検出しきい
値では音声検出性能の劣化を避けられない。Further, in a telephone line, the noise level may change over time depending on the installation condition of the opposite telephone and the line condition. In such a case, the voice detection threshold determined based on the noise level determined during a single noise level measurement period cannot avoid deterioration of the voice detection performance.

この発明の目的は、上記の課題を解決するためになされ
たもので、雑音レベルの影響を受けずに常に精度の良い
音声検出を行うことが可能な音声検出方法を提供するこ
とにある。SUMMARY OF THE INVENTION An object of the present invention is to provide a voice detection method that can always perform highly accurate voice detection without being affected by noise levels.

[Means to solve the problem]

この発明の請求項（１）に係る音声検出方法は、１サン
プリング周期前に読み込んだエンベロープ波形レベルを
格納しておき、ＣＰＵが新たに読み込んだエンベロープ
波形レベルを、１サンプリング周期前に読み込んだエン
ベロープ波形レベルと比較し、新たに読み込んだレベル
が、１サンプリング周期前に読み込んだレベルよりも一
定値（Ｖｄ＋ｒ　）以上大きくなるエンベロープ波形の
立ち上り時点を、音声区間の始点とみなすとともに、こ
の１サンプリング周期前に読み込んだエンベロープ波形
レベルに、対数スケールにおいて一定の値（Ｖｏｆｆ−
ａ）を加算した値を、音声検出しきい値（Ｖ　ｔｈ）と
するものである。In the voice detection method according to claim (1) of the present invention, an envelope waveform level read one sampling period ago is stored, and the envelope waveform level newly read by the CPU is stored as an envelope waveform level read one sampling period ago. When compared with the waveform level, the rising point of the envelope waveform at which the newly read level becomes larger than the level read one sampling period ago by a certain value (Vd+r) is regarded as the starting point of the audio section, and this one sampling period Add a constant value (Voff-) to the previously read envelope waveform level on a logarithmic scale.
The value obtained by adding a) is used as the voice detection threshold (V th).

請求項（２）に係る音声検出方法は、エンベロープ波形
レベルが音声検出しきい値（ｖ　ｔｈ）を上回った後に
、音声検出しきい値（Ｖｔｈ）を下回った場合には、再
度エンベロープ波形の立ち上りの検出を開始し、エンベ
ロープ波形の立ち上りが検出された時点を新たな音声区
間の始点とみなすとともに、１サンプリング周期前に読
み込んだエンベロープ波形レベルに、対数スケールにお
いて一定の値（Ｖｏｆｆ−ａ）を加算した値を、新たな
音声検出しきい値（Ｖ　ｔｈ）とするものである。In the voice detection method according to claim (2), when the envelope waveform level exceeds the voice detection threshold (Vth) and then falls below the voice detection threshold (Vth), the envelope waveform rises again. detection is started, and the point in time when the rising edge of the envelope waveform is detected is regarded as the start point of a new voice section, and a constant value (Voff-a) on a logarithmic scale is set to the envelope waveform level read one sampling period ago. The added value is used as a new voice detection threshold (V th).

請求項（３）に係る音声検出方法は、１サンプリング周
期前に読み込んだエンベロープ波形レベルに、対数スケ
ールにおいて一定の値（Ｖ　ａｒｔ−ｂ）を加算した値
を、一定値（Ｖｄ＋ｒ）とするものである。The voice detection method according to claim (3) is one in which the value obtained by adding a constant value (V art-b) on a logarithmic scale to the envelope waveform level read one sampling period before is set as a constant value (Vd+r). It is.

請求項（４）に係る音声検出方法は、エンベロープ波形
が立ち上ってから一定時間（Ｔｅｄｔ。）以内にエンベ
ロープ波形レベルが音声検出しきい値（Ｖ　ｔｈ）を上
回らない場合には、立ち上りからその時点までの区間を
無音区間とみな゛し、再度エンベロープ波形の立ち上り
の検出を開始し、エンベロープ波形の立ち上りが検出さ
れた時点を新たな音声区間の始点とみなすとともに、１
サンプリング周期前に読み込んだエンベロープ波形レベ
ルに、対数スケールにおいて一定の値（Ｖｏｔｅ−ａ）
を加算した値を、新たな音声検出しきい値（ｖ　ｔｈ）
とするものである。In the voice detection method according to claim (4), if the envelope waveform level does not exceed a voice detection threshold (V th) within a certain time (Tedt.) after the envelope waveform rises, The interval up to this point is regarded as a silent interval, and the detection of the rise of the envelope waveform is started again.The time when the rise of the envelope waveform is detected is regarded as the starting point of a new voice interval.
Add a constant value (Vote-a) to the envelope waveform level read before the sampling period on a logarithmic scale.
The new voice detection threshold (v th)
That is.

請求項　（５）に係る音声検出方法は、エンベロープ波
形レベルが立ち上がり、一定時間（”ｒ　ａｄｇ。）以
内に音声検出しきい値（Ｖ　ｔｈ）を上回ってから、有
音継続最低時間（Ｔｖｃ）以内に再び音声検出しきい値
（Ｖ　ｔｈ）を下回った場合には、この区間を無音区間
と見なすものである。In the voice detection method according to claim (5), after the envelope waveform level rises and exceeds the voice detection threshold (V th) within a certain time (radg.), the minimum voice duration time (Tvc) is determined. If the voice detection threshold value (V th) falls below the voice detection threshold (V th) again within that period, this section is regarded as a silent section.

請求項　（６）に係る音声検出方法は、エンベロープ波
形が立ち上がり、一定時間（Ｔｅｄｇａ）以内にエンベ
ロニブ波形レベルが音声検出しきい値（Ｖｔｈ）を上回
ってから、有音ｉａ続最低時間（Ｔｖｃ）経過後に再び
エンベロープ波形レベルが音声検出しきい値（ｖ　ｔｈ
）を下回った場合には、その音声検出しきい値（ｖ　ｔ
ｈ）の値を格納しておき、その後、無音継続最低時間（
Ｔ　ｎｖｃ）以内に再度エンベロープ波形の立ち上がり
が検出された場合には、格納しておいた値を音声検出し
きい値ＣＶ　ｔｈ）とするもである。In the voice detection method according to claim (6), after the envelope waveform rises and the envelope waveform level exceeds the voice detection threshold (Vth) within a certain period of time (Tedga), the minimum voice continuous time (Tvc) is determined. After the elapsed time, the envelope waveform level again reaches the voice detection threshold (v th
), the voice detection threshold (v t
h), and then calculate the minimum duration of silence (
If the rising edge of the envelope waveform is detected again within T nvc), the stored value is set as the voice detection threshold CV th).

[Effect]

この発明の請求項（１）の音声検出方法においては、新
たに読み込んだレベルが１サンプリング周期前に読み込
んだレベルよりも一定値（Ｖ（１１ｒ　）以上大きくな
るエンベロープ波形の立ち上り時点が音声区間の始点と
見なされるとともに、１サンプリング周期前に読み込ん
だエンベロープ波形レベルに対数スケールにおいて一定
の値（■。ｆｔ−ａ）を加算した値が音声検出しきい値
（Ｖ　ｔｈ）とされる。In the voice detection method according to claim (1) of the present invention, the rising point of the envelope waveform at which the newly read level is greater than the level read one sampling period ago by a certain value (V(11r)) is the point in the voice section. It is regarded as the starting point, and the value obtained by adding a constant value (■.ft-a) on a logarithmic scale to the envelope waveform level read one sampling period before is taken as the voice detection threshold (V th).

すなわち、雑音レベル測定期間を特別に設けなくても入
力信号の雑音レベル測定をすることが可能となるととも
に、音声検出しきい値（ｖ　ｔｈ）は雑音レベルに適応
した値で設定される。That is, it becomes possible to measure the noise level of an input signal without providing a special noise level measurement period, and the voice detection threshold (v th) is set at a value that is appropriate for the noise level.

請求項（２）の音声検出方法においては、エンベロープ
波形レベルが音声検出しきい値（Ｖ　ｔｈ）を上回った
後に、音声検出しきい値（Ｖ　ｔｈ）を下回った場合に
、再度エンベロープ波形の立ち上りの検出が開始され、
エンベロープ波形の立ち上りが検出された時点が新たな
音声区間の始点とみなされるとともに、１サンプリング
周期前に読み込んだエンベロープ波形レベルに対数スケ
ールにおいて一定の値（Ｖｏｆｆ−ａ）を加算した値が
新たな音声検出しきい値（Ｖ　ｔｈ）とされる。In the voice detection method of claim (2), when the envelope waveform level exceeds the voice detection threshold (V th) and then falls below the voice detection threshold (V th), the envelope waveform rises again. detection has started,
The point in time when the rising edge of the envelope waveform is detected is regarded as the start point of a new voice section, and the new value is the sum of the envelope waveform level read one sampling period ago and a constant value (Voff-a) on a logarithmic scale. This is the voice detection threshold (V th).

すなわち、雑音を誤って音声と検出したことが判明した
場合や音声区間の終端が検出された場合には、次に音声
区間の始端が検出された時点で新たに雑音レベルの測定
を行い音声検出しきい値（ｖ　ｔｈ）が設定される。In other words, if it turns out that noise has been mistakenly detected as speech, or if the end of a speech section is detected, the noise level is newly measured the next time the start of the speech section is detected, and speech detection is performed. A threshold (v th) is set.

請求項　（３）の音声検出方法においては、１サンプリ
ング周期前に読み込んだエンベロープ波形レベルに、対
数スケールにおいて一定の値（Ｖｏｆｆ−ｂ）を加算し
た値が（ｖｄＩ、）トサレル。In the voice detection method of claim (3), the value obtained by adding a constant value (Voff-b) on a logarithmic scale to the envelope waveform level read one sampling period before is (vdI,) total.

すなわち、音声の始端検出時にエンベロープ波形レベル
差分値と比較する値もその時の雑音レベルに適応される
。That is, the value to be compared with the envelope waveform level difference value when detecting the start end of the voice is also adapted to the noise level at that time.

請求項（４）の音声検出方法においては、エンベロープ
波形が立ち上ってから一定時間（Ｔ−ｄｔ−）以内にエ
ンベロープ波形レベルが音声検出しきい値（Ｖ　ｔｈ）
を上回らない場合に、立ち上りからその時点までの区間
が無音区間とみなされ、再度エンベロープ波形の立ち上
りの検出が開始され、エンベロープ波形の立ち上りが検
出された時点が新たな音声区間の始点とみなされるとと
もに、１サンプリング周期前に読み込んだエンベロープ
波形レベルに、対数スケールにおいて一定の値（Ｖ　、
ｔｖ−）を加算した値が、新たな音声検出しきい値（Ｖ
ｔｈ）とされる。また、請求項　（５）の音声検出方法
においては、エンベロープ波形レベルが立ち上がり、一
定時間ＣＴ−ａｔ。）以内に音声検出しきい値（Ｖ　ｔ
ｈ）を上回ってから、有音１ａ続最低時間（Ｔｖｃ）以
内に再び音声検出しきい値（Ｖ　ｔｈ）を下回った場合
に、この区間が無音区間と見なされる。In the voice detection method of claim (4), the envelope waveform level reaches the voice detection threshold (V th ) within a certain time (T-dt-) after the envelope waveform rises.
If it does not exceed , the section from the rise to that point is considered a silent section, detection of the rise of the envelope waveform is started again, and the point at which the rise of the envelope waveform is detected is regarded as the start point of a new voice section. At the same time, a constant value (V,
tv-) is the new voice detection threshold (V
th). Moreover, in the voice detection method of claim (5), the envelope waveform level rises and CT-at for a certain period of time. ) within the voice detection threshold (V t
h) and then falls below the voice detection threshold (V th) again within the minimum continuous time of voice 1a (Tvc), this section is regarded as a silent section.

すなわち、誤って雑音を音声始端として検出した場合も
、始端検出後あるいは音声検出しきい値（Ｖｔｈ）を上
回った後のエンベロープ波形変化が音声区間のエンベロ
ープ波形変化と異なることから、音声区間ではないこと
が判明した時には、その区間が無音区間と判定される。In other words, even if noise is mistakenly detected as the start of speech, the envelope waveform change after the start is detected or after the speech detection threshold (Vth) has been exceeded is different from that of the speech section, so it is not a speech section. When this is determined, the section is determined to be a silent section.

請求項（６）の音声検出方法においては、エンベロープ
波形が立ち上がり、一定時間（Ｔｅｄｍ−）以内にエン
ベロープ波形レベルが音声検出しきい値（Ｖ　ｔｈ）を
上回ってから、有音継続最低時間（Ｔｖｃ）経過後に再
びエンベロープ波形レベルが音声検出しきい値（Ｖ　ｔ
ｈ）を下回った場合に、その音声検出しきい値（Ｖ　ｔ
ｈ）の値が格納され、その後、無音継続最低時間（Ｔ　
ｎｖｃ）以内に再度エンベロープ波形の立ち上がりが検
出された場合に、格納されていた値が音声検出しきい値
（Ｖ　ｔｈ）とされる。In the voice detection method of claim (6), after the envelope waveform rises and the envelope waveform level exceeds the voice detection threshold (V th) within a certain time (Tedm-), the minimum voice duration time (Tvc) is determined. ), the envelope waveform level again reaches the voice detection threshold (V t
h), the voice detection threshold (V t
h) is stored, and then the minimum duration of silence (T
If the rising edge of the envelope waveform is detected again within the time interval (V th ), the stored value is set as the voice detection threshold (V th ).

すなわち、音声区間中にエンベロープ波形レベルが瞬間
的に小さくなり、音声検出しきい値を下回った場合には
、音声検出しぎい値（Ｖｔｈ）は新たに設定されるので
はなく、以前の値がそのまま用いられる。In other words, if the envelope waveform level momentarily decreases during a voice section and falls below the voice detection threshold, the voice detection threshold (Vth) is not set anew, but the previous value is used. Used as is.

〔Example〕

人力信号のエンベロープ波形に着目すると、ある程度Ｓ
／Ｎが確保されていれば有音区間は前後の部分に比較し
てレベルが大きい。そこでエンベロープ波形の立ち上り
時点を音声区間の始点と見なす。具体的にはエンベロー
プ波形（ここでＣＰＵに入力されるエンベロープ波形レ
ベルをＸｎとする。ｎはＣＰＵに入力される時系列的な
順番を表す）を一定のサンプリング周期で入力するとと
もに、レジスタに１サンプリング周期前のエンベロープ
波形レベルを格納し、Ｘｏ。−Ｘｎ−１≧Ｖ　ｄｌＦ　　　　　　　・・・・
・・　（１）（但し、Ｖｄｌは一定）を満たした時のｎをエンベロープ波形の立ち上がり時点
とする。このエンベロープ波形の立ち上がりは音声区間
を構成する第１音節の種類に依存する。最初の音節が無
声子音（Ｓ音、Ｐ音、に音。If we focus on the envelope waveform of the human signal, we can see that S
/N is ensured, the sound section has a higher level than the preceding and succeeding sections. Therefore, the rising point of the envelope waveform is regarded as the starting point of the voice section. Specifically, an envelope waveform (here, the envelope waveform level input to the CPU is assumed to be Xn, where n represents the chronological order of input to the CPU) is input at a constant sampling period, and 1 is input to the register. Stores the envelope waveform level before the sampling period, Xo. -Xn-1≧V dlF...
... (1) (However, Vdl is constant) The time when n is satisfied is the rising point of the envelope waveform. The rise of this envelope waveform depends on the type of first syllable that constitutes the speech section. The first syllable is a voiceless consonant (S sound, P sound, ni sound).

子音）を含むとエンベロープ波形は緩やかに立ち上がり
、それ以外の場合には急峻に立ち上がる。Consonants), the envelope waveform rises slowly, and in other cases it rises steeply.

したがって、最初の音節が無声子音（Ｓ音、Ｐ音、に音
、子音）を含む場合でも音声始点の検出を精度良く行う
ためには、Ｖ　ｄｌｆが充分に小さい値に設定される必
要がある。Therefore, in order to accurately detect the voice start point even when the first syllable includes a voiceless consonant (S sound, P sound, Ni sound, consonant), V dlf needs to be set to a sufficiently small value. .

このようにして、立ち上がり時点が検出されたとき、立
ち上がる直前の区間は無音区間であるので、立ち上がる
直前のエンベロープ波形レベル値は雑音レベルの代表値
といえる。そこで、この値を基準に音声検出しきい値（
Ｖ　ｔｈ）の設定を行う。In this way, when the rising time is detected, since the section immediately before the rising is a silent section, the envelope waveform level value immediately before the rising can be said to be a representative value of the noise level. Therefore, based on this value, the voice detection threshold (
V th).

通常、伝送系で雑音が印加されない限り、入力信号のＳ
／Ｎは回線損失が変化しても対向電話機端におけるＳ／
Ｎと同一の値となる。また、利用者は周囲雑音が大きく
なるに従って発生レベルを大きくする傾向を持つ。以上
のことから、雑音レベルを基準に音声検出しきい値を決
定する際には、雑音レベルと音声検出しきい値の比が雑
音レベルによらず一定となるように設定することが望ま
しい。そこで、具体的には雑音レベルに対して対数スケ
ールにおける一定値（Ｖｏｆｒ−ａ）を加算し、音声検
出しきい値を決定する。これを数式で表すと以下のよう
になる。Normally, unless noise is applied in the transmission system, the input signal S
/N is the S/N at the opposite telephone end even if the line loss changes.
The value is the same as N. Additionally, users tend to increase the generation level as the ambient noise increases. From the above, when determining the voice detection threshold based on the noise level, it is desirable to set the ratio between the noise level and the voice detection threshold to be constant regardless of the noise level. Therefore, specifically, a constant value (Vofr-a) on a logarithmic scale is added to the noise level to determine the voice detection threshold. This can be expressed numerically as follows.

ｌｏｇ　［ｖ　ｔｈ］”　ｌｏｇ［Ｘ　ｎ−１］　＋Ｖ
　ａｔｅ−ａ　＝　・・・（２）但し、Ｘｎ−ｘｎ−、
≧Ｖｄｌの時同様に、音声始端でのエンベロープ波形の立ち上がり方
も雑音レベルに依存すると考えられるので、第　（１）
式も次のように改良することにより音声検知感度の向上
が図れる。log [v th]” log [X n-1] +V
ate-a = ... (2) However, Xn-xn-,
≧Vdl, the way the envelope waveform rises at the beginning of the voice is also considered to depend on the noise level, so (1)
The voice detection sensitivity can be improved by improving the formula as follows.

Ｘ　　ｎ　　−ｘ　　ｎ、　　　≧　Ｖｄ１ｆ　　　　
（Ｘｎ−１）　　　　　　・・・　・・・　　（３）但
し、ｌｏｇ［Ｖ　ａＩｔ　（Ｘ　１１−＋）］　　＝　ｌｏ
ｇ［ｘ　ｎ−１１＋　Ｖ　ｏｒｅ−ｂ・・・・・・　（
４）の時なお、Ｖｄ１ｆ（Ｘｎ−１）、Ｖｏｆｆの値は、小さく
設定されるとＳ／Ｎが低い入力信号に対して音声検出感
度が向上する反面、雑音を誤って検出する確率が高くな
る。Ｓ／Ｎが低い入力信号に対しての音声検出感度を維
持しつつ、雑音を誤って検出する確率を低下させるため
には、以下の２つのアルゴリズムが必要である。X n −x n, ≧ Vd1f
(Xn-1) ... ... (3) However, log[V aIt (X 11-+)] = lo
g [x n-11+ V ore-b... (
4) Note that when the values of Vd1f (Xn-1) and Voff are set small, the voice detection sensitivity improves for input signals with low S/N, but on the other hand, the probability of erroneously detecting noise increases. Become. In order to reduce the probability of erroneously detecting noise while maintaining voice detection sensitivity for input signals with low S/N, the following two algorithms are required.

■　入力信号が雑音のみから構成される場合には、エン
ベロープ波形は立ち上った後にもランダムに変化する。■ If the input signal consists only of noise, the envelope waveform changes randomly even after rising.

一方、音声の場合には短い時間内に音声検出しきい値（
Ｖｔｈ）を上回る。そこで、立ち上り時間しきい値Ｔ　
ｍｄｔ＠を設け、入力信号のエンベロープ波形が立ち上
がりＴ　、、、。以内にエンベロープ波形レベルがＶｔ
ｈを上回らない場合には、その区間は無音区間と見なす
。On the other hand, in the case of voice, the voice detection threshold (
Vth). Therefore, the rise time threshold T
mdt@ is provided, and the envelope waveform of the input signal rises T. The envelope waveform level is within Vt
If it does not exceed h, the section is regarded as a silent section.

■　人間が意味のあるメツセージを述べる際の発声継続
時間は５００　ｍ５ｅｃ以上である。そこで、有音継続
最低時間Ｔｖｃを５００　ｍ５ｅｃ程度に設定し、入力
信号のエンベロープ波形が立ち上り、Ｔ＊ｄｇ＊以内に
エンベロープ波形レベルがＶｔｈを上回った時もＴ　Ｖ
ｅ以内にＶｔｈを下回った場合にはその区間は無音区間
と見なす。■ The duration of human utterance when expressing a meaningful message is 500 m5ec or more. Therefore, the minimum sound duration time Tvc is set to about 500 m5ec, and even when the envelope waveform of the input signal rises and the envelope waveform level exceeds Vth within T*dg*, the T V
If it falls below Vth within e, that section is regarded as a silent section.

すなわち、エンベロープ波形レベルがＶｔｈを上回った
後にＶｔｈを下回り、音声区間の終端が検出された場合
、あるいは立ち上りからＴ　、ｄ、。経過してもエンベ
ロープ波形レベル■、を上回らない場合には、一端設定
された音声検出しきい値（Ｖｔｈ）を無効にし、再度エ
ンベロープ波形が立ち上った時に新たに音声検出しきい
値（Ｖ　ｔｈ）を設定する。この処理により音声検出処
理中に多少雑音レベルが変動しても、それらに適応した
雑音レベルの設定が可能になる。That is, when the envelope waveform level exceeds Vth and then falls below Vth, and the end of the voice section is detected, or from the rising edge of T, d. If it does not exceed the envelope waveform level (Vth) even after the elapse of time, the previously set voice detection threshold (Vth) is invalidated, and when the envelope waveform rises again, a new voice detection threshold (Vth) is set. Set. Through this process, even if the noise level changes somewhat during the voice detection process, it is possible to set the noise level in an appropriate manner.

なお、エンベロープ波形は音声区間中にも変動し、瞬間
的にエンベロープ波形が音声検出しきい値（ｖ　ｔｈ）
を下回ることがある。この音声検出しきい値（Ｖ　ｔｈ
）を下回った時点におけるエンベロープ波形レベルは、
雑音レベルにレベルの低い音声が重畳した形になってい
る。したがって、次のエンベロープ波形立ち上り時にこ
の値を基準に音声検出しきい値（Ｖｔｈ）を設定すると
、音声検出しきい値（Ｖ　ｔｈ）は所望の値に比べ大き
な値となる。この時は、音声区間の後部が無音区間と判
定されてしまうことになる。Note that the envelope waveform also fluctuates during the voice section, and the envelope waveform momentarily changes to the voice detection threshold (v th).
It may fall below. This voice detection threshold (V th
), the envelope waveform level is
It has a form in which low-level audio is superimposed on the noise level. Therefore, if the voice detection threshold (Vth) is set based on this value at the time of the next rise of the envelope waveform, the voice detection threshold (V th) will be a larger value than the desired value. In this case, the rear part of the voice section will be determined to be a silent section.

そこで、音声検出しきい値レジスタおよび無音継続最低
時間Ｔ　ｎＶＣを設け、エンベロープ波形がＶｔｈを上
回ってから有音継続最低時間”ｒｙｅ経過後にＶｔｈを
下回った場合には、音声検出しきい値（ｖ　ｔｈ）を音
声検出しきい値レジスタに格納するとともに、エンベロ
ープ波形がＶｔｈを下回ってから無音継続最低時間Ｔ　
ｎＶｃ以内のエンベロープ波形の立ち上がりでは、音声
検出しきい値（Ｖｔｈ）を新たに求めずに、音声検出し
きい値レジスタに格納されている音声検出しきい値（Ｖ
　ｔｈ）を用いる。第１図はこの発明の音声検出方法の
一実施例を説明するためのブロック構成図である。この
図において、１は整流回路と低域通過フィルタにより構
成されるエンベロープ波形抽出回路、２はエンベロープ
波形レベルをＣＰＵに読み込むためのＡ／Ｄ変換器、３
はＣＰＵにより構成される制御部、４はエンベロープ波
形レベルを格納する波形レジスタ、５は１サンプリング
周期前に読み込んだエンベロープ波形レベルを格納する
波形レジスタ、６は音声検出しきい値を格納するしきい
値レジスタ、７は前記Ａ／Ｄ変換器２と制御部３にサン
プリング周期Ｔ□□１毎に信号を送るタイマ、８は有音
区間長を測定するための有音カウンタ、９は無音区間長
を測定するための無音カウンタ、１０は音声検出結果を
格納する出力ステータスレジスタ、１１は当初音声区間
と見なされていた部分が後から無音区間と判明した時の
区間長を格納する無効有音区間長レジスタである。Therefore, a voice detection threshold register and a minimum duration of silence T nVC are provided, and when the envelope waveform exceeds Vth and falls below Vth after the minimum duration of voice duration "rye" has elapsed, the voice detection threshold (v th) in the voice detection threshold register, and the minimum duration of silence T after the envelope waveform falls below Vth.
At the rise of the envelope waveform within nVc, the voice detection threshold (Vth) stored in the voice detection threshold register is used instead of calculating a new voice detection threshold (Vth).
th) is used. FIG. 1 is a block diagram for explaining an embodiment of the voice detection method of the present invention. In this figure, 1 is an envelope waveform extraction circuit composed of a rectifier circuit and a low-pass filter, 2 is an A/D converter for reading the envelope waveform level into the CPU, and 3
4 is a waveform register that stores the envelope waveform level; 5 is a waveform register that stores the envelope waveform level read one sampling period before; 6 is a threshold that stores the voice detection threshold. A value register, 7 is a timer that sends a signal to the A/D converter 2 and the control unit 3 every sampling period T□□1, 8 is a sound counter for measuring the length of a sound period, and 9 is a length of a silent period. 10 is an output status register that stores the voice detection result, and 11 is an invalid voiced section that stores the length of a section that was originally considered to be a voiced section but later turned out to be a silent section. It is a long register.

第２図は、第１図のエンベロープ波形抽出回路１の構成
例を示す図であり、第３図は、第１図における処理フロ
ーを示す図である。以後、第１図、第３図に従って説明
を行う。なお、第３図のＳＴ１〜５Ｔ３８は各ステップ
を示す。FIG. 2 is a diagram showing a configuration example of the envelope waveform extraction circuit 1 shown in FIG. 1, and FIG. 3 is a diagram showing the processing flow in FIG. 1. Hereinafter, the explanation will be given according to FIGS. 1 and 3. Note that ST1 to ST38 in FIG. 3 indicate each step.

初期設定が行われた後（ＳＴＯ）　、人力信号はタイマ
７により決定される一定のサンプリング周期Ｔ　ｓａ、
、ｐｌで、エンベロープ波形抽出回路１．Ａ／Ｄ変換器
２を介して制御部３に読み込まれる（ＳＴｌ）。この時
のサンプリング周期Ｔ　ｉａ□１はサンプリング定理よ
りエンベロープ波形の変化周期の１／２以下である必要
がある。エンベロープ波形の変化周期は人間の発声速度
（１０〜２０［音節／ｓｅｃ］とほぼ等しいので、サン
プリング周期Ｔ　ｇａａｐｌは２０　ｍ５ｅｃ程度であ
れば問題がない。After the initial settings have been made (STO), the human input signal has a constant sampling period T sa determined by timer 7,
, pl, envelope waveform extraction circuit 1. The signal is read into the control unit 3 via the A/D converter 2 (STl). According to the sampling theorem, the sampling period T ia □1 at this time needs to be 1/2 or less of the change period of the envelope waveform. Since the change period of the envelope waveform is approximately equal to the human speech rate (10 to 20 [syllables/sec]), there is no problem if the sampling period T gaapl is about 20 m5ec.

次いで、制御部３は波形レジスタ４に格納されているデ
ータを波形レジスタ５に移動させた後（ＳＴｌ）、読み
込んだエンベロープ波形値を波形レジスタ４に格納する
（Ｓｒ３）。さらに、Ｘｎ−１の値を基準に第　（４）
式に従い、Ｖ　ｄｌｆ　　（Ｘ　ｎ−１）を求めた上で
波形レジスタ４のデータ（ｘｎ）と波形レジスタ５のデ
ータ（ｘｎ−Ｉ）の差分を計算し、これらの値が第　（
３）式を満たすか否かの判断を行う（Ｓｒ４）。判断の
結果、第（３）式を見たなさい場合には、無音状態と判
断し、出力ステータスレジスタ１ｏに［無音］をセット
する（Ｓｒ１）とともに、無音カクンタ９をインクリメ
ントする（５７Ｂ）。Next, the control unit 3 moves the data stored in the waveform register 4 to the waveform register 5 (STl), and then stores the read envelope waveform value in the waveform register 4 (Sr3). Furthermore, the (4)
After finding V dlf (X n-1) according to the formula, calculate the difference between the data (xn) of waveform register 4 and the data (xn-I) of waveform register 5, and these values are
3) It is determined whether the formula is satisfied (Sr4). As a result of the judgment, if the equation (3) is checked, it is determined that there is no sound, and the output status register 1o is set to [silence] (Sr1), and the silence kakunta 9 is incremented (57B).

一方、判断の結果、第（３）式を満たしている場合には
、それまでの無音長が無音継続最低時間Ｔ　ｎｖｃと比
較される（Ｓｒ７）。無音継続最低時間Ｔ　ｎｖｃの値
は、息継ぎなどにより人間が発声を休止する際に要する
時間の内、最短と考えられる時間長である。それまでの
無音長がＴ　ｎｖｃ以上（無音カウンタ値≧（Ｔｎｖｃ
／Ｔ１．、ｐｌ））で新しい音声区間の始点と判断され
た時には、第　（２）式で音声検出しきい値（Ｖｔｈ）
を新たに算出しく５Ｔ８）　、この値をしきい値レジス
タ６に格納する（Ｓｒ９）。それまでの無音長がＴ　ｎ
Ｖｅ未満（無音カウンタ値〈（Ｔｎｖｃ／Ｔ□ｐ１））
の時には、音声区間の始端ではなく、前回の音声区間の
延長であると判断し、しきい値レジスタ６に格納されて
いる音声検出しぎい値Ｖｔｈを読み出す（ＳＴＩＯ）、
この後、エンベロープ波形レベル（Ｘｎ）と音声検出し
きい値（Ｖ　ｔｈ）の比較を行い（ＳＴＩＩ）、エンベ
ロープ波形が音声検出しぎい値以上の時（ｘ　ｎ≧ｖ　
ｔｈ）には有音状態と判断し、有音カウンタ８をインク
リメントする（ＳＴ１２）とともに、出力ステータスレ
ジスタ１０を「有音」にセットする（ＳＴ１３）。On the other hand, as a result of the determination, if the formula (3) is satisfied, the silent length up to that point is compared with the minimum duration of silent duration T nvc (Sr7). The value of the minimum duration of silence T nvc is the length of time that is considered to be the shortest among the time required for a human to pause vocalization due to breathing or the like. The length of silence up to that point is T nvc or more (silence counter value ≧ (Tnvc
/T1. , pl)), the voice detection threshold (Vth) is determined by equation (2).
is newly calculated (5T8), and this value is stored in the threshold register 6 (Sr9). The length of silence up to that point is T n
Less than Ve (silence counter value (Tnvc/T□p1))
At this time, it is determined that this is not the beginning of a voice section but an extension of the previous voice section, and the voice detection threshold value Vth stored in the threshold register 6 is read out (STIO);
After this, the envelope waveform level (Xn) and the voice detection threshold (V th) are compared (STII), and when the envelope waveform is equal to or higher than the voice detection threshold (x n≧v
th), it is determined that there is a sound, and the sound counter 8 is incremented (ST12), and the output status register 10 is set to "sound" (ST13).

一方、エンベロープ波形が音声検出しきい値未満の時（
ｘｎ≧Ｖ　ｔｈ）にもエンベロープ波形の立ち上り（つ
まり音声区間の始端）が検出されたということから有音
状態と判断し、有音カウンタ８をインクリメントする（
ＳＴ１４）とともに、エンベロープ波形の立ち上りから
Ｔ　、ｄｇｅ経過する前に音声検出しきい値以上になる
否かを判定しく５Ｔ１５）、音声検出しきい値以上にな
れば出力ステータスレジスタ１０を「有音」にセットす
る（ＳＴ１６）。On the other hand, when the envelope waveform is below the voice detection threshold (
Since the rising edge of the envelope waveform (that is, the start of the voice section) is also detected when
In addition to ST14), it is determined whether the voice detection threshold is exceeded or not before T and dge have elapsed since the rise of the envelope waveform (5T15), and if the voice detection threshold is exceeded, the output status register 10 is set to "sound present". (ST16).

ただし、エンベロープ波形の立ち上りからＴ、ｄ、。経
過（有音カウンタ値−ＣＴ−ｄｔ。／Ｔｏ□１））して
もエンベロープ波形レベルが音声検出しきい値以上にな
らない時は、雑音によりエンベロープ波形が立ち上った
ものと見なし、立ち上がりからそれまで有音区間と判断
されていた区間を無音区間と判断し直し、有音カウンタ
８にカウントされているその区間長を無効音声区間長レ
ジスタ１１にセットする（ＳＴ２０）とともに、出力ス
タースレジスタ１０に「有音区間無効」をセットする（
ＳＴ２１）。However, T, d, from the rise of the envelope waveform. If the envelope waveform level does not rise above the voice detection threshold even after the passage of time (speech counter value - CT-dt./To□1)), it is assumed that the envelope waveform has risen due to noise, and The section that was determined to be a voiced section is re-judged as a silent section, and the length of the section counted in the voiced section 8 is set in the invalid voice section length register 11 (ST20), and at the same time, it is set in the output star register 10. Set “Sound section disabled” (
ST21).

次いで、無音カウンタ値に有音カウンタ値を加えたもの
を無音カウンタ９に格納する（ＳＴ２２）とともに、有
音カウンタ８を０にしたのち（ＳＴ２３）、ステップＳ
ＴＩ　に戻る。Next, the sum of the silence counter value and the voice counter value is stored in the silence counter 9 (ST22), and after setting the voice counter 8 to 0 (ST23), the process proceeds to step S.
Return to TI.

なお、Ｔｅｄｔ。の値は、音声の始端におけるｓｈ音等
の摩擦音の継続時間よりも短い時間に設定されている必
要がある。In addition, Tedt. The value of must be set to a shorter time than the duration of a fricative sound such as a shivering sound at the beginning of the voice.

一方、エンベロープ波形レベルが音声検出しきい値以上
になった時には、継続してエンベロープ波形レベルと音
声検出しきい値の比較を行い（ＳＴ２４．２５，２６，
２７，２８，２９．３０）、有音区間が有音継続最低時
間Ｔ　ｖｅ以上継続（有音カウンタ値≧（Ｔ　ｖｃ／Ｔ
　ｓａ□Ｉ））した時には、その区間は真の有音区間と
判断して無音カウンタ９を０にしく５Ｔ３１）、さらに
ステップＳＴ２４〜３０を実行する。そして、この後エ
ンベロープ波形レベルが音声検出しきい値未満になると
出力ステータスレジスタ１０に「無音」をセットしく５
Ｔ３３）、無音カウンタ９を１にしたのち（ＳＴ３４）
、有音カウンタ８をＯにして（ＳＴ３８）、ステップＳ
ＴＩに戻る。On the other hand, when the envelope waveform level exceeds the voice detection threshold, the envelope waveform level and the voice detection threshold are continuously compared (ST24.25, 26,
27, 28, 29.30), the sound section continues for longer than the minimum sound duration time T ve (speech counter value ≧ (T vc/T
sa□I)), the section is judged to be a true sound section, the silent counter 9 is set to 0 (5T31), and steps ST24 to ST30 are executed. After that, when the envelope waveform level becomes less than the voice detection threshold, "silence" is set in the output status register 10.
T33), after setting silence counter 9 to 1 (ST34)
, set the sound counter 8 to O (ST38), and step S
Return to T.I.

また、有音区間長が有音継続最低時間Ｔｖｃ以上継続す
る以前（有音カウント値＜（Ｔｖｃ／Ｔ□□Ｉ））に、
エンベロープ波形レベルが音声検出しきい値未満になっ
た時には（ＳＴ２７．３２）　、雑音の影響でエンベロ
ープ波形レベルが大きくなったものと見なし、立ち上が
りからそれまでの有音区間と判断されていた区間を無音
区間と判断し直し、有音カウンタ８にカウントされてい
る区間長を無効音声区間長レジスタ１１にセットする（
ＳＴ３５）とともに、出力ステータスレジスタ１０に「
有音区間無効」をセットする（ＳＴ３６）。次いで、無
音カウンタ値に有音カウンタ値＋１を加えたものを無音
カウンタ９に格納する（ＳＴ３７）とともに、有音カウ
ンタ８を０にしたのち（ＳＴ３８）、ステップＳＴＩに
戻る。In addition, before the sound interval length continues longer than the minimum sound duration time Tvc (sound count value < (Tvc/T□□I)),
When the envelope waveform level becomes less than the voice detection threshold (ST27.32), it is assumed that the envelope waveform level has increased due to the influence of noise, and the section that was determined to be a sound section from the rise to that point is It is re-judged as a silent section, and the section length counted by the voice counter 8 is set in the invalid voice section length register 11 (
ST35), the output status register 10 is set to “
"Speech interval invalid" is set (ST36). Next, the sum of the silence counter value and the voice counter value +1 is stored in the silence counter 9 (ST37), and after setting the voice counter 8 to 0 (ST38), the process returns to step STI.

なお、Ｔｖｃの値は、人間が意味のあるメツセージを発
生する際の音声区間の最短値に設定されている必要があ
る。Note that the value of Tvc needs to be set to the shortest value of the voice interval when a human generates a meaningful message.

第４図は、第３図で説明したこの発明の一実施例を入力
信号の有音区間だけを録音し無音区間は録音しないとい
う機能を持ったＩＣ録音器に応用した時の一構成例を示
す図である。この図において、１２は音声信号をＩＣメ
モリに蓄積できる形に分析する音声分析器、１３はＣＰ
Ｕで構成される制御部、１４はＩＣメモリで構成される
音声蓄積部、１５はこの音声蓄積部１４から前記制御部
１３への音声データの読み出しタイミングを決定するタ
イマ、１６は前記音声蓄積部１４のアクセスアドレスを
指定するアドレスポインタ、１７は第３図で説明したこ
の発明の一実施例による音声検出回路、１８はこの音声
検出回路１７の構成要素である出力ステータスレジスタ
、１日は同じく音声検出回路１７の構成要素である無効
音声区間長レジスタである。FIG. 4 shows a configuration example when the embodiment of the present invention explained in FIG. FIG. In this figure, 12 is a voice analyzer that analyzes voice signals into a form that can be stored in an IC memory, and 13 is a CP.
14 is an audio storage section consisting of an IC memory; 15 is a timer that determines the read timing of audio data from this audio storage section 14 to the control section 13; and 16 is the audio storage section. 14 is an address pointer specifying an access address; 17 is a voice detection circuit according to an embodiment of the present invention explained in FIG. 3; 18 is an output status register which is a component of this voice detection circuit 17; This is an invalid voice section length register which is a component of the detection circuit 17.

第５図に、第４図の実施例における処理フローを示す。FIG. 5 shows a processing flow in the embodiment of FIG. 4.

制御部３は音声分析器１２で分析された音声データをタ
イマ１５で決定される周期Ｔ、。ｐ２で音声蓄積部１４
から読み込む。この時の読み込み速度は、音声分析器１
２が通常のＰＣＭで入力信号帯域が電話音声帯域３．４
ｋＨｚであれば８ｋＨ２である。次いで、制御部３は読
み込んだ音声データを音声蓄積部１４中のアドレスポイ
ンタ１６の示すアドレスに格納するとともに、アドレス
ポインタ１６の値を１つづ進歩する。The control unit 3 analyzes the voice data analyzed by the voice analyzer 12 at a period T determined by a timer 15. Audio storage section 14 at p2
Load from. The loading speed at this time is Speech Analyzer 1
2 is normal PCM and the input signal band is telephone audio band 3.4
If it is kHz, it is 8kHz2. Next, the control unit 3 stores the read audio data at the address indicated by the address pointer 16 in the audio storage unit 14, and advances the value of the address pointer 16 by one.

通常は以上の処理を繰り返すが、上述したＴ　ｓａ□８
周期で音声検出回路１７の出力ステータスレジスタ１８
に新しく音声検知結果が書き込まれると以下の動作を行
う。Normally, the above process is repeated, but the above-mentioned T sa□8
The output status register 18 of the audio detection circuit 17 periodically
When a new voice detection result is written to , the following operations are performed.

■　出力ステータスレジスタ１８に「無音」が書き込ま
れた時直前のＴ　ｇａ□１間は無音と判断されたので、この部
分に次からの音声データを上書きする。このためアドレ
スポインタ１６をＴ□□１前のデータを格納したアドレ
スまで戻す。(2) When "silence" is written in the output status register 18, it is determined that there is no sound during T ga □1 immediately before, so this portion is overwritten with the next audio data. Therefore, the address pointer 16 is returned to the address where the previous data was stored by T□□1.

■　出力ステータスレジスタ１８に「有音区間無効」が
書き込まれた時直前の音声立ち上りからこれまでの区間は無音と判断さ
れたので、この部分に次からの音声データを上書きする
。このため、アドレスポインタ１６を区間の先頭の音声
データを格納しているアドレスまで戻す。区間長は無効
音声区間長レジスタ１９に書き込まれているので、その
値を読み出して用いる。(2) When "voiceable section invalid" is written in the output status register 18 Since the section from the immediately preceding audio rise to this point is determined to be silent, this section is overwritten with the next audio data. Therefore, the address pointer 16 is returned to the address where the audio data at the beginning of the section is stored. Since the section length is written in the invalid voice section length register 19, that value is read out and used.

■　出力ステータスレジスタ１８に「有音」が書き込ま
れた時直前のＴ１．□３間は有音と判断されたので、そのまま
録音を継続する。■ T1. immediately before "audio" is written to the output status register 18. □Since it was determined that there was sound during the 3rd period, recording continued.

上記の処理は動作終了コマンドの入力あるいは音声蓄積
部１４の空きエリアが無くなり動作停止指令が発生する
まで繰り返される。The above process is repeated until an operation end command is input or until there is no more free space in the audio storage section 14 and an operation stop command is issued.

〔Effect of the invention〕

この発明は以上説明したように構成されているので、以
下に記載する効果を奏する。Since this invention is configured as described above, it produces the effects described below.

すなわち、請求項（１）においては、雑音レベル測定期
間を特別に設けなくても人力信号の雑音レベルを測定す
ることが可能となるとともに、音声検出しきい値（Ｖｔ
ｈ）は雑音レベルに適応した値で設定されるので、雑音
レベルの影響を受けずに常に精度のよい音声検出を行う
ことができる。That is, in claim (1), it is possible to measure the noise level of a human signal without providing a special noise level measurement period, and the voice detection threshold (Vt
Since h) is set to a value that is appropriate to the noise level, highly accurate voice detection can always be performed without being affected by the noise level.

また、請求項（２）においては、雑音を誤って検出した
ことが判明した場合や音声区間の終端が検出された場合
には、次に音声区間の始端が検出された時点で、新たに
雑音レベルの測定を行い、音声検出しきい値（Ｖｔｒ＋
）の設定が行われるので、音声検出処理中に多少雑音レ
ベルが変動した場合にも音声検出精度を維持することが
可能である。In addition, in claim (2), if it is found that noise has been detected incorrectly or if the end of a speech section is detected, a new noise is detected when the next start of the speech section is detected. The level is measured and the voice detection threshold (Vtr+
), it is possible to maintain voice detection accuracy even if the noise level changes somewhat during voice detection processing.

また、請求項（３）においては、音声の始端検出時にエ
ンベロープ波形レベル差分値と比較する値も、その時の
雑音レベルに適応されるので、雑音を誤って音声始端と
して検出する可能性を低く保ちつつ、音声始端の構成感
度の向上を図ることができる。Furthermore, in claim (3), the value to be compared with the envelope waveform level difference value when detecting the start of speech is also adapted to the noise level at that time, so that the possibility of erroneously detecting noise as the start of speech is kept low. At the same time, it is possible to improve the configuration sensitivity of the voice start end.

さらに、請求項　（４）、　　（５）においては、誤っ
て雑音を音声始端として検出した場合にも、始端検出後
あるいは音声検出しきい値（Ｖ　ｔｈ）を上回った後の
エンベロープ波形変化が音声区間のエンベロープ波形変
化と異なることから、音声区間ではないことが判明した
時には、その区間が無音区間と判定されるので、音声の
始端検品時にエンベロープ波形レベル差分値と比較する
値や音声検出しきい値を小さく設定し、音声検出感度を
高めても雑音を音声と誤って検出する確率を低くするこ
とができる。Furthermore, in claims (4) and (5), even if noise is mistakenly detected as a voice start, the envelope waveform change after the start edge is detected or after the voice detection threshold (V th) has been exceeded is the same as that of the voice. If it is determined that the section is not a voice section because it differs from the envelope waveform change of the section, the section is determined to be a silent section, so the value to be compared with the envelope waveform level difference value and the voice detection threshold are set when inspecting the start of the voice. Even if the value is set to a small value and the voice detection sensitivity is increased, the probability that noise will be mistakenly detected as voice can be lowered.

また、請求項　（６）においては、音声区間中にエンベ
ロープ波形レベルが瞬間的に小さくなり、音声検出しき
い値を下回った場合に音声検出しきい値は新たに設定さ
れるのではなく、前の値がそのまま用いられるので、音
声区間中に音声検出しきい値（Ｖ　ｔｈ）が徐々に大き
くなり、結果的に話尾部分の音声が検出されなくなると
いう事態が避けられる。Furthermore, in claim (6), when the envelope waveform level momentarily decreases during a voice section and falls below the voice detection threshold, the voice detection threshold is not set anew, but is set as before. Since the value of V th is used as is, it is possible to avoid a situation where the voice detection threshold (V th ) gradually increases during the voice section and the voice at the end of the sentence is not detected as a result.

[Brief explanation of the drawing]

第１図はこの発明の音声検出方法の一実施例を説明する
ためのブロック構成図、第２図はエンベロープ波形抽出
回路の構成例を示す図、第３図は、第１図における処理
フローを示す図、第４図はこの発明の応用例を示すブロ
ック構成図、第５図は、第４図における処理フローを示
す図、第６図は対話型留守番電話機の動作フローを示す
図、第７図（ａ）、（ｂ）、（ｃ）は利用者音声長の測
定、応答メツセージに対する利用者反応の有無の確認１
発生終了の検出を行う際の処理フローを示す図である。図中、１はエンベロープ波形抽出回路、２はＡ／Ｄ変換
器、３は制御部、４．５は波形レジスタ、６はしきい値
レジスタ、７．１５はタイマ、８は有音カウンタ、９は
無音カウンタ、１０゜１８は出力ステータスレジスタ、
１１．１９は無効音声区間長レジスタ、１２は音声分析
器、１３は制御部、１４は音声蓄積部、１６はアドレス
ポインタ、１７は音声検出回路である。第４図第７図（ｂ）第７図＜ｃ＞FIG. 1 is a block configuration diagram for explaining an embodiment of the voice detection method of the present invention, FIG. 2 is a diagram showing an example of the configuration of an envelope waveform extraction circuit, and FIG. 3 is a diagram illustrating the processing flow in FIG. 1. FIG. 4 is a block configuration diagram showing an application example of the present invention, FIG. 5 is a diagram showing the processing flow in FIG. 4, FIG. 6 is a diagram showing the operation flow of an interactive answering machine, and FIG. Figures (a), (b), and (c) show measurement of user voice length and confirmation of user reaction to response message 1
FIG. 6 is a diagram showing a processing flow when detecting the end of generation. In the figure, 1 is an envelope waveform extraction circuit, 2 is an A/D converter, 3 is a control unit, 4.5 is a waveform register, 6 is a threshold register, 7.15 is a timer, 8 is a sound counter, 9 is the silence counter, 10°18 is the output status register,
11. 19 is an invalid voice section length register, 12 is a voice analyzer, 13 is a control section, 14 is a voice storage section, 16 is an address pointer, and 17 is a voice detection circuit. Figure 4 Figure 7 (b) Figure 7 <c>

Claims

[Claims]

(1) Extract the envelope waveform of the input signal through a rectifier circuit and low-pass filter, compare the magnitude relationship between this envelope waveform level and the voice detection threshold using internal processing of the CPU, and determine whether the envelope waveform level is the voice detection threshold. In the voice detection circuit, which determines the portion where the envelope waveform level exceeds the threshold as a voice section and the portion where the envelope waveform level is below the voice detection threshold as a silent section, the envelope waveform level read one sampling period ago is stored. Wait, C
The envelope waveform level newly read by PU is set to 1.
Compared with the envelope waveform level read before the sampling period, the newly read level is a constant value (V_d_i
The rising point of the envelope waveform that increases by more than _f) is regarded as the starting point of the voice section, and the envelope waveform level read one sampling period before is set to a constant value (V_o_f_f_−_a) on a logarithmic scale.
) is set as a voice detection threshold value (V_t_h).

(2) The envelope waveform level is the voice detection threshold (V
_t_h) is exceeded, the voice detection threshold (V_t
_h), the detection of the rise of the envelope waveform is started again, and the point at which the rise of the envelope waveform is detected is regarded as the start point of a new audio section, and the envelope waveform level read one sampling period before is detected. is a constant value (V_o_
Claim (1) characterized in that a value obtained by adding f_f_−_a) is set as a new voice detection threshold value (V_t_h).
) Voice detection method described.

(3) Add a constant value (V_o
_f_f_-_b) is added to a constant value (V_d_
The voice detection method according to claim 1, wherein the voice detection method is: i_f).

(4) A certain period of time (T
If the envelope waveform level does not exceed the voice detection threshold (V_t_h) within ____d_s_■),
The section from the rising edge to that point is considered a silent section,
The detection of the rising edge of the envelope waveform is started again, and the point at which the rising edge of the envelope waveform is detected is regarded as the starting point of a new audio section, and a constant value (( 2. The voice detection method according to claim 1, wherein a value obtained by adding V_o_f_f_-_a) is set as a new voice detection threshold value (V_t_h).

(5) The envelope waveform level rises and within a certain period of time (T_■_d_s_■) the voice detection threshold (V_
t_h), the minimum voice duration time (T_v_
The voice detection according to any one of claims (1) to (4), characterized in that if the voice detection threshold (V_t_h) falls below the voice detection threshold (V_t_h) again within c), this section is regarded as a silent section. Method.

(6) The envelope waveform rises for a certain period of time (T_
After the envelope waveform level exceeded the voice detection threshold (V_t_h) within ■_d_s_■), the envelope waveform level fell below the voice detection threshold (V_t_h) again after the minimum voice duration time (T_v_c) had elapsed. In this case, the value of the voice detection threshold (V_t_h) is stored, and then the minimum duration of silence (T_n_v_
If the rising edge of the envelope waveform is detected again within c), the stored value is used as the voice detection threshold (
The voice detection method according to any one of claims (1) to (5), characterized in that: V_t_h).