JPH1124692A

JPH1124692A - Method of judging generating/resting section of voice wave and device therefor

Info

Publication number: JPH1124692A
Application number: JP9176076A
Authority: JP
Inventors: Jiyoutarou Ikedo; 丈太朗池戸
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1997-07-01
Filing date: 1997-07-01
Publication date: 1999-01-29

Abstract

PROBLEM TO BE SOLVED: To minimize the effect by background noise and to minimize the operation quantity. SOLUTION: An input voice is power-detected in a short time (31), and supplied to a neural network(NNW) 34 directly or after sequentially delayed by a plurality of delay elements, and the maximum self-correlation of the input voice is detected in a range longer than the pitch period (32), and inputted to the NNW 34 to determine an LSP(linear spectrum pair) coefficient of the input voice (33). The error of this from a specified LSP vector is calculated (35) and inputted to the NNW 34. The NNW 34 is allowed to preliminarily learn to output 1 in sounding and 0 in resting section, and the output of the NNW 34 is threshold-judged (37) to provide a judgment result of sounding/resting section.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は音声のディジタル
伝送等の分野に応用が可能であり、音声のディジタル処
理の分野に属し音声波中の有音区間と休止区間とを判別
する方法及びその装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention is applicable to the field of digital transmission of voice and the like, and belongs to the field of digital processing of voice, and a method and apparatus for discriminating between a sound section and a pause section in a voice wave. About.

【０００２】[0002]

【従来の技術】はじめに音声波形における有音区間と休
止区間について説明する。図９に実際の音声波形におけ
る有音区間と休止区間の例を示す。図９Ａでは音声の前
後に音声を発声しないことによる休止区間Ｔ_US，Ｔ_UEが
存在する様子が明らかである。また、音声を発声してい
る最中にもごく短時間の休止区間Ｔ_UMが存在することが
判る。図９Ｂは発声区間中に存在する休止区間Ｔ_UMの近
傍を拡大した図である。同図に見られる休止区間Ｔ
_UMは、破裂音や摩擦音の直前に現れるものである。2. Description of the Related Art First, a sound section and a pause section in a speech waveform will be described. FIG. 9 shows an example of a sound section and a pause section in an actual speech waveform. In FIG. 9A, it is clear that pause periods T _US and T _UE exist due to the fact that no voice is uttered before and after the voice. In addition, it can be seen that a very short pause section T _UM exists even while the voice is being uttered. FIG. 9B is an enlarged view of the vicinity of the pause section T _UM existing in the utterance section. Pause section T seen in FIG.
_UM appears immediately before a pop or fricative.

【０００３】これらの休止区間Ｔ_US，Ｔ_UEおよびＴ_UMで
は音声波はパワーを持たないため、音声情報伝送におい
てはこれらの区間については休止区間であるという情報
のみを伝送すればよく、主に音声のディジタル伝送の分
野において伝送路の周波数資源を有効に利用することを
目的として、休止区間については休止区間であることを
ごく少ない情報量で伝送し、全体としての伝送情報量を
削減する方法が用いられる。[0003] In these pause sections T _US , T _UE and T _UM , the speech wave has no power, so in voice information transmission, it is sufficient to transmit only information that these sections are pause sections. A method for reducing the total amount of transmitted information by transmitting a pause section with a very small amount of information for the purpose of effectively utilizing the frequency resources of the transmission path in the field of voice digital transmission. Is used.

【０００４】このような音声伝送方法を実現するにあた
り、音声が有音区間であるか休止区間であるかを正確に
特定することのできる音声の有音／休止区間判定装置が
必要である。音声の有音区間と休止区間を判定する最も
簡単な手法としては、音声波の短時間区間パワーを測定
し、これを一定のしきい値と比較する方法が挙げられ
る。しかしながらこの方法は音声のレベル変動や背景雑
音の影響を受けやすく、有音／休止区間の判定誤りを生
じやすい。[0004] In order to realize such a voice transmission method, a voiced / paused section determination device for a voice which can accurately specify whether a voice is a voiced section or a paused section is required. The simplest method of determining a sound section and a pause section of a voice is a method of measuring a short-time power of a voice wave and comparing the measured power with a certain threshold. However, this method is susceptible to voice level fluctuations and background noise, and is liable to cause erroneous voice / pause determination.

【０００５】このような問題点を解消する手法として
は、文献“ＶＯＸ制御における有音・無音検出回路の一
検討”，電子通信情報学会春期大会，Ｂ−４２２，１９
９３がある。図１０にその構成を示す。入力音声１１は
帯域フィルタ１２で処理され、増幅器１３で増幅され、
ＰＣＭ符号器１４でＰＣＭ符号化され、出力端子１５で
出力される。またＰＣＭ符号化された音声は無音検出回
路１６においてパワーが算出され、一定のしきい値と比
較されて有音／休止区間の判定が行われる。一方増幅器
１３の出力は検出回路１７でレベルが検出され、その検
出レベルにホールド回路１８を経て利得制御回路１９へ
伝えられ、利得制御回路１９の制御信号により増幅器１
３の利得を適応的に制御する。さらに、無音検出回路１
６が休止区間を検出している間は検出回路１７の出力を
ホールドするようホールド回路１８を制御し、休止区間
で利得制御を行わないように動作する。[0005] As a method of solving such a problem, there is a document "A study of a sound / non-sound detection circuit in VOX control", IEICE Spring Conference, B-422, 19
There are 93. FIG. 10 shows the configuration. The input sound 11 is processed by the bandpass filter 12, amplified by the amplifier 13,
The data is PCM-coded by the PCM encoder 14 and output at the output terminal 15. The power of the PCM-coded speech is calculated by a silence detection circuit 16 and compared with a predetermined threshold value to determine a sound / pause section. On the other hand, the level of the output of the amplifier 13 is detected by the detection circuit 17 and transmitted to the gain control circuit 19 via the hold circuit 18 at the detected level.
3 is adaptively controlled. Furthermore, the silence detection circuit 1
6 controls the hold circuit 18 so as to hold the output of the detection circuit 17 while the idle period is being detected, and operates so as not to perform the gain control in the idle period.

【０００６】この方法においては、自動利得制御回路１
９により増幅器１３の利得を制御し増幅器１３の出力を
ある一定のレベルにすることにより、入力音声のレベル
変動に起因する有音／休止区間の判定誤りを軽減すると
ともに、無音検出回路１６が休止区間を検出している
間、自動利得制御回路１９の動作をホールドすることで
音声休止区間で増幅器１３の利得が必要以上に大きくな
ることを防ぎ、背景雑音に起因する有音／休止区間の判
定誤りを抑止している。In this method, the automatic gain control circuit 1
9 controls the gain of the amplifier 13 to make the output of the amplifier 13 a certain level, thereby reducing the erroneous determination of the sound / pause section caused by the level fluctuation of the input voice and stopping the silence detection circuit 16 While the section is detected, the operation of the automatic gain control circuit 19 is held to prevent the gain of the amplifier 13 from becoming unnecessarily large in the voice pause section, and to determine a sound / pause section caused by background noise. The error has been suppressed.

【０００７】また、他の方法としては音声波の短時間パ
ワーに加え、音声波の自己相関係数やピッチラグを用い
て有音／休止区間の判定を行う方法がある。このような
装置の例として、文献“Europian digital cellular te
lecommunications system(Phase 2)；Voice Activity D
etection（ＶＡＤ）（ＧＳＭ 06.32）”，European Tel
ecommunications Standards Institute(1994）がある。
図１１にその構成を示す。入力音声はＰＣＭ符号化され
た後に自己相関係数およびピッチラグが分析され、それ
ぞれ自己相関係数入力端子２１およびピッチラグ入力端
子２２よりこの装置に入力される。自己相関係数は残差
パワー算出部２３に送られ有音／休止区間判定対象とな
る区間の線形予測残差パワーが算出される。同時に自己
相関係数は自己相関係数平均化部２４へ送られ、有音／
休止区間判定対象となる区間を含む過去数フレーム分の
自己相関係数の平均値が算出されてスペクトル比較部２
５へ送られるとともに、過去の自己相関係数平均値が予
測値算出部２６へ送られる。予測値算出部２６では過去
の自己相関係数平均値を用いて現在の自己相関係数平均
値を予測し、これをスペクトル比較部２５へ送る。スペ
クトル比較部２５では入力された二つの自己相関平均値
からスペクトルの比較を行い、スペクトルの定常性の判
定を行う。入力されたピッチラグは周期性判定部２７へ
送られ、ここでピッチの定常性の判定が行われる。残差
パワー、スペクトル定常性判定結果、ピッチ定常性判定
結果はしきい値適応化部２８へ送られ、ここで有音／休
止区間判定のための残差パワーのしきい値が決定され
る。決定されたしきい値はＶＡＤ判定部２９で残差パワ
ーと比較され、有音／休止区間の判定が行われる。[0007] As another method, there is a method of determining a sound / pause section using an autocorrelation coefficient or a pitch lag of an audio wave in addition to the short-time power of the audio wave. As an example of such a device, the document “Europian digital cellular te”
lecommunications system (Phase 2); Voice Activity D
etection (VAD) (GSM 06.32) ", European Tel
There is the ecommunications Standards Institute (1994).
FIG. 11 shows the configuration. After the input speech is subjected to PCM encoding, an autocorrelation coefficient and a pitch lag are analyzed, and the speech is input to the apparatus from an autocorrelation coefficient input terminal 21 and a pitch lag input terminal 22, respectively. The autocorrelation coefficient is sent to the residual power calculator 23, and the linear prediction residual power of the section to be subjected to the sound / pause section determination is calculated. At the same time, the auto-correlation coefficient is sent to the auto-correlation coefficient averaging unit 24,
The average value of the autocorrelation coefficients for the past several frames including the section to be subjected to the pause section determination is calculated, and the spectrum comparison unit 2
5 and the past average value of the autocorrelation coefficient is sent to the predicted value calculating unit 26. The predicted value calculating section 26 predicts the current average value of the autocorrelation coefficient using the past average value of the autocorrelation coefficient, and sends it to the spectrum comparing section 25. The spectrum comparing unit 25 compares the spectra from the two averaged autocorrelation values and determines the continuity of the spectrum. The input pitch lag is sent to the periodicity determination unit 27, where the continuity of the pitch is determined. The residual power, the spectrum continuity determination result, and the pitch continuity determination result are sent to the threshold value adaptation unit 28, where the threshold value of the residual power for sound / pause section determination is determined. The determined threshold value is compared with the residual power by the VAD determination section 29, and a sound / pause section is determined.

【０００８】図１２にしきい値適応化部の処理を示す。
残差パワーが仮判定のしきい値ｐｔｈ（定数）より小さ
な場合は（Ｓ１）、無条件に休止区間と判定され、有音
／休止区間判定しきい値ｔｈｖａｄを初期値ｐｌｅｖに
設定して終了する（Ｓ２）。残差パワーがｐｔｈ以上で
あり（Ｓ１）、スペクトル非定常（Ｓ３）もしくはピッ
チ定常（Ｓ４）の場合は、無条件に有音区間と判定さ
れ、有音／休止区間判定しきい値の変更は行われず、し
きい値適応回数カウンタの値ｃｏｕｎｔを０として終了
する（Ｓ５）。FIG. 12 shows the processing of the threshold adaptation unit.
If the residual power is smaller than the tentative determination threshold value pth (constant) (S1), it is unconditionally determined to be a pause section, the sound / pause section determination threshold value thvad is set to the initial value plev, and the processing ends. (S2). If the residual power is equal to or greater than pth (S1) and the spectrum is unsteady (S3) or the pitch is steady (S4), the sound section is unconditionally determined, and the change of the sound / pause section determination threshold is performed. The processing is not performed, and the value count of the threshold adaptation number counter is set to 0, and the process ends (S5).

【０００９】上記以外の場合つまり残差パワーがｐｔｈ
以上であり、スペクトルが定常であり、かつピッチが定
常でない場合は背景雑音区間として扱われ、これが一定
回数ａｄｐ以上連続した場合に有音／休止区間判定しき
い値が変更される。即ち、背景雑音区間と判定される
と、しきい値適応回数カウンタの計数値ｃｏｕｎｔを＋
１し（Ｓ６）、その計数値ｃｏｕｎｔが適応猶予回数ａ
ｄｐを越えなければ終了とし（Ｓ７）、越えると、有音
／休止区間判定しきい値ｔｈｖａｄをしきい値変更ステ
ップサイズ係数ｄｅｃで割算し、その結果を、しきい値
ｔｈｖａｄから減算して新たなしきい値ｔｈｖａｄとす
る（Ｓ８）。次に残差パワーｐｖａｄに音声パワー／残
差パワー比較係数ｆａｃを乗算した結果がしきい値ｔｈ
ｖａｄより大である場合は（Ｓ９）、しきい値ｔｈｖａ
ｄを、しきい値変更の下限を規定する係数ｉｎｃで割算
した値にｔｈｖａｄを加算した値と、残差パワーｐｖａ
ｄに音声パワー／残差パワー比較係数ｆａｃとの積の値
との小さい方をしきい値ｔｈｖａｄとし（Ｓ１０）、そ
のしきい値ｔｈｖａｄ又はステップＳ９でしきい値ｔｈ
ｖａｄがｐｖａｄ×ｆａｃを越えなかった時のしきい値
ｔｈｖａｄが、残差パワーｐｖａｄと残差パワーマージ
ンｍａｒｇｉｎとの和より大であれば（Ｓ１１）、その
残差パワーｐｖａｄとマージンｍａｒｇｉｎとを加算し
たしきい値ｔｈｖａｄとした後（Ｓ１２）、またはステ
ップＳ１１でしきい値ｔｈｖａｄの方が大でないと判定
されると、適応猶予回数ａｄｐを＋１してしきい値適応
回数カウンタの計数値ｃｏｕｎｔとして終了する（Ｓ１
３）。In other cases, that is, when the residual power is pth
As described above, when the spectrum is stationary and the pitch is not stationary, it is treated as a background noise section, and when this continues more than a certain number of times adp, the sound / pause section determination threshold is changed. That is, when it is determined that the background noise period is present, the count value of the threshold adaptation number counter is incremented by +
1 (S6), and the counted value count becomes the adaptive grace number a
If it does not exceed dp, the processing is terminated (S7). If it does, the sound / pause section determination threshold value thvad is divided by the threshold value change step size coefficient dec, and the result is subtracted from the threshold value thvad. A new threshold value thvad is set (S8). Next, the result of multiplying the residual power pvad by the audio power / residual power comparison coefficient fac is a threshold th.
If it is larger than vad (S9), the threshold value thva
The value obtained by adding thvad to a value obtained by dividing d by a coefficient inc defining the lower limit of the threshold value change, and a residual power pva
The smaller of the product of d and the product of the audio power / residual power comparison coefficient fac is set as a threshold thvad (S10), and the threshold thvad or the threshold th in step S9.
If the threshold value thvad when vad does not exceed pvad × fac is larger than the sum of the residual power pvad and the residual power margin margin, then add the residual power pvad and the margin margin. After the threshold value thvad is set (S12), or when it is determined in step S11 that the threshold value thvad is not larger, the adaptation grace number adp is incremented by 1 to be the count value count of the threshold adaptation number counter. End (S1
3).

【００１０】この方法においては音声のパワーだけでな
く、スペクトルの定常性やピッチの定常性等の情報を用
いて適応的に有音／休止区間の判定のためのパワーのし
きい値を変化させることで音声のレベル変動や背景雑音
に起因する有音／休止区間判定の判定誤りを低減してい
る。In this method, not only the power of the voice but also information such as the continuity of the spectrum and the continuity of the pitch is used to adaptively change the threshold value of the power for the determination of the sound / pause section. As a result, it is possible to reduce the determination error in the voiced / pause section determination due to the voice level fluctuation and the background noise.

【００１１】[0011]

【発明が解決しようとする課題】従来の自動利得制御回
路を用いた方法においては、入力音声波を利得の変動す
る増幅器を用いて増幅するため音声波の忠実な伝送とい
う点で問題がある。一方自己相関係数とピッチラクグを
用いる方法では、その構成上しきい値との比較、条件分
岐といった制御が多数要求され、装置実現にあたり制御
回路が複雑化するという問題点がある。The conventional method using the automatic gain control circuit has a problem in that the input audio wave is amplified by using an amplifier having a variable gain, so that the audio wave can be faithfully transmitted. On the other hand, in the method using the autocorrelation coefficient and the pitch lag, a large number of controls such as comparison with a threshold value and conditional branching are required due to its configuration, and there is a problem that a control circuit is complicated in realizing the device.

【００１２】[0012]

【課題を解決するための手段】請求項１記載の発明は、
一つないし複数の短時間音声パワーと、音声波を分析す
ることにより得られるスペクトル包絡に関係付けられる
パラメータベクトルと、これと同種のパラメータでスペ
クトルが平坦なパラメータベクトルとのベクトル間距離
と、音声のピッチ周期をほぼカバーする範囲内での音声
波の自己相関の最大値とをニューラルネットワークに入
力し、その出力によって有音／休止区間の判定を行う。According to the first aspect of the present invention,
One or more short-time speech powers, a parameter vector related to a spectrum envelope obtained by analyzing the speech wave, a vector-to-vector distance between a parameter vector having the same kind of parameter and a flat spectrum, and speech. And the maximum value of the autocorrelation of the sound wave within a range substantially covering the pitch period of the sound wave is input to the neural network, and the output of the neural network determines the sound / pause section.

【００１３】また請求項２記載の発明は、一つないし複
数の短時間区間内における音声波の量子化されたサンプ
リング値の分散を、請求項１の発明における短時間音声
パワーの代りに用いる点が異なる。請求項１および２記
載の各発明においては入力音声を利得の変動する増幅器
を用いて増幅する必要がないため、自動利得制御回路を
用いる方法よりも音声波の忠実な伝送という点で優れて
いる。According to a second aspect of the present invention, the variance of the quantized sampling value of the audio wave in one or a plurality of short time intervals is used instead of the short-time audio power in the first aspect of the present invention. Are different. In each of the first and second aspects of the present invention, since it is not necessary to amplify the input voice using an amplifier having a variable gain, it is superior to the method using an automatic gain control circuit in terms of faithful transmission of a voice wave. .

【００１４】また請求項１および２記載の各発明におい
ては、自己相関係数とピッチラグを用いる手法と同様に
複数のパラメータを用いて有音／休止区間の判定を行う
が、パラメータの組をニューラルネットワークに入力し
て得られる出力のみを用いて有音／休止区間の判定を行
うため、しきい値との比較が一回必要となるだけで条件
分岐制御は一切不要である利点を有する。In each of the first and second aspects of the present invention, the sound / pause section is determined using a plurality of parameters in the same manner as in the method using the autocorrelation coefficient and the pitch lag. Since the sound / pause section is determined using only the output obtained by inputting to the network, there is an advantage that only one comparison with the threshold value is required and no conditional branch control is required.

【００１５】さらに請求項２記載の発明においては音声
波がバイアスを持っているような場合、例えばＡＤ変換
器のゼロレベルとマイクロフォン入力のゼロレベルが一
致しないような場合でも安定して有音／休止区間の判定
を行なうことが可能である。Further, according to the second aspect of the present invention, even when the sound wave has a bias, for example, when the zero level of the AD converter and the zero level of the microphone input do not match, the sound / voice is stably output. It is possible to determine a pause section.

【００１６】[0016]

【発明の実施の形態】図１に請求項１記載の発明の実施
の形態を示す。端子１１からの入力音声は短時間音声パ
ワー算出部３１、最大自己相関算出部３２、ＬＳＰ係数
算出部３３に送られ、それぞれ短時間音声パワー、最大
自己相関、線スペクトル対（ＬＳＰ）パラメータが算出
される。その算出された短時間音声パワーはニューラル
ネットワーク部３４へ直接送られるとともに、遅延素子
３５₁，３５₂，…，３５_nを順次通されると共にその
各遅延素子の出力がそれぞれニューラルネットワーク部
３１へ送られる。FIG. 1 shows an embodiment of the invention described in claim 1. The input voice from the terminal 11 is sent to the short-time voice power calculation unit 31, the maximum autocorrelation calculation unit 32, and the LSP coefficient calculation unit 33, where the short-time voice power, the maximum autocorrelation, and the line spectrum pair (LSP) parameter are calculated. Is done. The calculated short-time voice power is directly sent to the neural network unit 34, and is sequentially passed through the delay elements 35 ₁ , 35 ₂ ,..., 35 _n and the outputs of the respective delay elements are respectively sent to the neural network unit 31. Sent.

【００１７】最大自己相関算出部３２は入力音声のピッ
チ周期を再現し得るに十分な時間遅れまでの自己相関係
数を算出し、その最大値はニューラルネットワーク部３
４に入力される。ＬＳＰ係数算出部３３で算出されたＬ
ＳＰパラメータベクトルはＬＳＰベクトル誤差算出部３
６において予め設定された平坦なスペクトル包絡のＬＳ
Ｐパラメータベクトルとのベクトル誤差が算出され、得
られたベクトル誤差はニューラルネットワーク部３４へ
入力される。The maximum autocorrelation calculating section 32 calculates an autocorrelation coefficient up to a time delay sufficient to reproduce the pitch period of the input voice, and the maximum value is calculated by the neural network section 3.
4 is input. L calculated by the LSP coefficient calculator 33
The SP parameter vector is an LSP vector error calculator 3
LS of the flat spectral envelope preset in 6
A vector error with the P parameter vector is calculated, and the obtained vector error is input to the neural network unit 34.

【００１８】ニューラルネットワーク部３４は、多数の
学習音声を有音／休止区間の情報とともに与えられ、例
えば有音区間は１、無音区間は０を出力するように学習
される。この学習には一般的な手法、例えば誤差逆伝搬
法等を用いればよい。ニューラルネットワーク部３４の
出力は有音／休止区間判定部３７においてある一定のし
きい値と比較され、有音／休止区間が判定される。The neural network section 34 is provided with a large number of learning voices together with information on voiced / paused sections, and is trained to output, for example, 1 for voiced sections and 0 for silent sections. For this learning, a general method such as an error back propagation method may be used. The output of the neural network unit 34 is compared with a certain threshold value in the sound / pause section determination unit 37 to determine the sound / pause section.

【００１９】短時間音声パワー算出部３１では、従来の
この種の有音／休止区間検出方法で行われていると同様
の手法でかつ、同程度の時間区間、例えば５〜２０ｍｓ
程度ごとに音声パワーが計算される。最大自己相関算出
部３２では音声ピッチ間隔をほぼカバーする範囲内で自
己相関の最大値を求めていることになる。遅延素子３５
₁，３５₂，…の各遅延時間は、短時間音声パワーの計
算時間区間と等しくされる、つまり例えば５ｍｓごとに
短時間音声パワーを計算する場合は、遅延素子３５₁，
３５₂，…の各遅延時間は５ｍｓとされる。ニューラル
ネットワーク部３４に入力する短時間音声パワーの最も
遅れているものが１５〜４０ｍｓ程度が好ましい。つま
り短時間音声パワーの計算時間区間が５ｍｓであれば遅
延素子は３〜１０個程度がよい。この取込みの遅延時間
が短かいと、判定性能が低下し、２０ｍｓ程度が特に好
ましく、これより長くしても、判定性能はそれ程よくな
らず、処理量が多くなる。The short-time voice power calculating section 31 uses the same method as that of the conventional method for detecting a sound / pause section of this type, and has a similar time section, for example, 5 to 20 ms.
The audio power is calculated for each degree. The maximum autocorrelation calculation unit 32 finds the maximum value of the autocorrelation within a range substantially covering the voice pitch interval. Delay element 35
Each delay time of ₁ , 35 ₂ ,... Is made equal to the calculation time section of the short-time voice power. That is, for example, when calculating the short-time voice power every 5 ms, the delay elements 35 ₁ ,.
Each delay time of 35 ₂ ,... Is 5 ms. It is preferable that the shortest audio power input to the neural network unit 34 has the longest delay of about 15 to 40 ms. That is, if the calculation time section of the short-time audio power is 5 ms, about 3 to 10 delay elements are preferable. If the delay time of the capture is short, the determination performance is reduced, and it is particularly preferable to be about 20 ms. Even if the delay time is longer than this, the determination performance is not so good and the processing amount is increased.

【００２０】有音／休止区間の判定は、例えば１０〜４
０ｍｓ程度の一定時間区間ごとに行われ、短時間音声パ
ワーの計算時間区間は、この判定のための一定時間区間
と同程度か、短かく選定される。この構成によれば、従
来と同様に短時間音声パワーを判断の１ファクタとして
いるのみならず、自己相関の最大値を用いることによ
り、ピッチが定常的であれば有音と判断でき、更にベク
トル距離は、スペクトルの周波数特性に片寄りがある。
つまり平坦な特性ではない場合は有音と判定でき（雑音
のスペクトルは平坦）、これら複数のパラメータを用
い、図１１、図１２に示した従来の方法と同様に背景雑
音や音声のレベル変動に影響されず、正しい判定ができ
る。The sound / pause section is determined, for example, by 10 to 4
It is performed for each fixed time section of about 0 ms, and the calculation time section of the short-time audio power is selected to be equal to or shorter than the fixed time section for this determination. According to this configuration, not only the short-time audio power is used as one factor of the determination as in the conventional case, but also the use of the maximum value of the autocorrelation allows the sound to be determined as a sound if the pitch is stationary. The distance has a bias in the frequency characteristic of the spectrum.
In other words, if the characteristics are not flat, it can be determined that there is sound (the spectrum of the noise is flat). Using these multiple parameters, similar to the conventional method shown in FIGS. It can be judged correctly without being affected.

【００２１】ＬＳＰベクトル誤差算出部３６では前述し
たようにスペクトルの片寄りを検出するためのものであ
るから、ＬＳＰパラメータのみならず、ＬＰＣ（線形予
測係数）、ＰＡＲＣＯＲ係数など、要するにスペクトル
包絡に関係ずけられるパラメータであればよい。請求項
２記載の発明は、請求項１記載の発明のうち短時間パワ
ー出力部３１を短時間分散算出部で置き換えることで実
施される。短時間分散算出は、短時間区間における量子
化されたサンプル値の分散が算出される。音声の場合は
この分散が大きく、雑音の場合は小さい。またＡＤ変換
器のゼロレベルとマイクロフォン入力のゼロレベルが一
致しない場合のように入力がバイアスを持っていると、
休止区間で比較的大きな短時間音声パワーが検出される
が、そのサンプル値は一定値であるため、分散は著しく
小さいものとなり、有音と区別され、誤判定のおそれが
ない。Since the LSP vector error calculating section 36 is for detecting the deviation of the spectrum as described above, not only the LSP parameters but also the LPC (linear prediction coefficient), the PARCOR coefficient, etc. Any parameters can be used as long as the parameters can be shifted. The invention according to claim 2 is implemented by replacing the short-time power output unit 31 with the short-time variance calculation unit in the invention according to claim 1. In the short-time variance calculation, the variance of the quantized sample values in the short-time section is calculated. The variance is large in the case of speech, and small in the case of noise. Also, if the input has a bias, such as when the zero level of the AD converter and the zero level of the microphone input do not match,
Although a relatively large short-time audio power is detected in the pause section, the sample value is a fixed value, so that the variance becomes extremely small, is distinguished from a sound, and there is no possibility of erroneous determination.

【００２２】請求項１および２に記載の発明は短時間パ
ワー算出部と短時間分散算出部が異なるだけであるの
で、ここでは主に請求項１記載の発明について実施例を
示す。図２に各パラメータの分析条件等を示す。入力さ
れた音声は８０００［Ｈｚ］の標本化周波数で標本化さ
れ、５［ｍｓ］ごとに音声パワーが算出される。自己相
関係数は図２に示した範囲のすべてのサンプル遅れに対
して算出し、そのうち最大の自己相関の値を求める。In the first and second aspects of the present invention, only the short-time power calculating section and the short-time variance calculating section are different from each other. FIG. 2 shows the analysis conditions of each parameter and the like. The input voice is sampled at a sampling frequency of 8000 [Hz], and the voice power is calculated every 5 [ms]. The autocorrelation coefficient is calculated for all the sample delays in the range shown in FIG. 2, and the maximum autocorrelation value is obtained.

【００２３】また、図３にＬＳＰベクトル誤差を算出す
る基準となるＬＳＰパラメータベクトルを示す。この図
から各次数の間隔は等しく、これはスペクトルが平坦な
ものであることを示している。ＬＳＰベクトル誤差は同
図に示されたＬＳＰベクトルと分析により求められたＬ
ＳＰベクトルのユークリッド距離として定義した。ニュ
ーラルネットワーク部３４は４層のモデルを用い、入力
層、第一中間層、第二中間層、出力層のニューロン数は
それぞれ７，３，３，１とした。第ｉ層の第ｊニューロ
ンの出力ｏ_i,jはｏ_i,j＝（１／（ｅ^-x＋１）−（１／２））（１）ただし、ｘ＝Σ_kｗ_i.k,jｏ_i-1,k＋β_i,j （２）ここに、ｗ_i.j,kは第ｉ−１層第ｋニューロンの出力が
第ｉ層第ｊニューロンへ入力される際の重み係数であ
り、またβ_i,jは第ｉ層第ｊニューロンの入力に対する
バイアスである。ただし入力層を第０層とし、出力層を
第３層とする。式（１）は−０．５から０．５の範囲の
出力をとるので、出力層の出力に０．５を加えて出力範
囲を０から１とした。図４に各重み係数を、図５に各バ
イアスの例を示す。FIG. 3 shows an LSP parameter vector as a reference for calculating an LSP vector error. From this figure, the intervals of each order are equal, which indicates that the spectrum is flat. The LSP vector error is the LSP vector shown in FIG.
It was defined as the Euclidean distance of the SP vector. The neural network unit 34 uses a four-layer model, and the number of neurons in the input layer, the first intermediate layer, the second intermediate layer, and the output layer is set to 7, 3, 3, and 1, respectively. The output o _{i, j} of the j-th neuron in the i-th layer is o _{i, j} = (1 / (e− ^x + 1) − (1/2)) (1) where x = Σ _k w _{ik, j} o _{i _{-1, k + β i, j}} (2) here, w _{ij, k} is a weighting coefficient when the output of the i-1 layer k-th neuron is input to the i-th layer j-th neuron, also beta _{i , j} is a bias for the input of the i-th layer neuron j. However, the input layer is the 0th layer, and the output layer is the 3rd layer. Since equation (1) takes an output in the range of -0.5 to 0.5, 0.5 is added to the output of the output layer to make the output range 0 to 1. FIG. 4 shows an example of each weight coefficient, and FIG. 5 shows an example of each bias.

【００２４】請求項２記載の発明は請求項１記載の発明
のうち、短時間音声パワー算出部を短時間分散算出部に
変更するとともに、ニューラルネットワーク部３４の重
みおよびバイアスを図６および図７に変更することで実
現できる。According to a second aspect of the present invention, in the first aspect of the present invention, the short-time voice power calculating section is changed to a short-time variance calculating section, and the weight and bias of the neural network section 34 are changed as shown in FIGS. It can be realized by changing to

【００２５】[0025]

【発明の効果】請求項１および２記載の発明は、音声の
短時間パワー以外にスペクトル包絡に関係付けられたパ
ラメータベクトルのその平坦特性のベクトルとのベクト
ル誤差および一定時間内の音声波の自己相関最大値を用
いることにより、背景雑音が有音／休止区間判定に及ぼ
す影響を軽減する効果がある。According to the first and second aspects of the present invention, in addition to the short-term power of speech, a vector error of a parameter vector related to a spectral envelope with a vector of its flat characteristic and a self-wave of a speech wave within a predetermined time are obtained. The use of the maximum correlation value has the effect of reducing the effect of background noise on voiced / pause section determination.

【００２６】また、複数のパラメータをニューラルネッ
トワークを用いて処理することによって条件分岐処理が
一切不要となり、これにより複雑な制御回路を用いるこ
となく装置を実現することが可能となる効果を有する。
従来法のうち自己相関係数とピッチラグを用いる手法と
この発明の所要演算量を比較すると、実施例に示したパ
ラメータを用いた場合で、この発明は従来法の約半分程
度の演算で動作するものと見積もられる。Further, by processing a plurality of parameters by using a neural network, conditional branch processing is not required at all, which has the effect that the device can be realized without using a complicated control circuit.
Comparing the required calculation amount of the present invention with the method using the autocorrelation coefficient and the pitch lag of the conventional method, the present invention operates with about half the calculation of the conventional method when using the parameters shown in the embodiment. It is estimated that.

【００２７】さらに請求項２記載の発明においては、音
声の短時間パワーを用いずに音声波の瞬時値の短時間分
散を用いることにより、音声波に一定のバイアスが重畳
する際にも安定して有音／休止区間を判定することが可
能となるという効果を有する。図８に音声の短時間パワ
ーを一定のしきい値と比較して有音／休止区間判定を行
なう従来方法と、この発明による判定試験結果の比較を
示す。同判定試験において背景雑音のない状態の音声
を、短時間パワーのしきい値との比較による方法（従来
法）により有音／休止区間の判定を行なった結果を基準
として用いている。この時しきい値は音声の長時間平均
パワーに比して−４５［ｄＢ］を採用している。Further, in the invention according to the second aspect, the short-time dispersion of the instantaneous value of the sound wave is used without using the short-time power of the sound, so that even when a constant bias is superimposed on the sound wave, it is stable. This makes it possible to determine a sound / pause section. FIG. 8 shows a comparison between the conventional method of making a sound / pause section judgment by comparing the short-time power of speech with a certain threshold value and the judgment test result according to the present invention. In the same judgment test, a sound in a state without background noise is used as a reference based on a result of judgment of a sound / pause section by a method (conventional method) based on comparison with a short-time power threshold value. At this time, the threshold value is -45 [dB] compared to the long-term average power of the voice.

【００２８】背景雑音のない状態（同図にノイズフリー
と示す）の音声にこの発明を適用した場合、有音区間を
休止区間と判定する誤りが若干存在するが、この誤りの
大部分は大きなパワーを有する音声以外の音、例えば呼
吸音や舌打ちのような音を休止区間と判定したものであ
って、特に問題とはならないものであった。音声の長時
間平均パワーに比して−２０［ｄＢ］の背景雑音を重畳
した音声について有音／休止区間判定を行なった結果、
従来法においてしきい値を−４５［ｄＢ］および−３０
［ｄＢ］とした場合は休止区間を一切判定しなかった。
また、しきい値を−２０［ｄＢ］とすることで休止区間
の６０％程度を正しく判定することができた。一方、こ
の発明を用いた場合、休止区間の９５％程度を正しく判
定した。このときの有音区間の判定誤りの内容は背景雑
音を重畳しない場合と同様であった。When the present invention is applied to a speech without background noise (shown as noise-free in the figure), there is a slight error in determining a voiced section as a pause section, but most of the errors are large. A sound other than a sound having power, for example, a sound such as a breathing sound or a tongue tapping, is determined to be a pause section and does not cause any particular problem. As a result of performing a sound / pause section determination on a voice on which background noise of -20 [dB] is superimposed on the long-term average power of the voice,
In the conventional method, the threshold values are -45 [dB] and -30.
In the case of [dB], no pause section was determined.
Also, by setting the threshold to -20 [dB], it was possible to correctly determine about 60% of the pause section. On the other hand, when the present invention was used, about 95% of the pause section was correctly determined. The content of the determination error of the sound section at this time was the same as the case where the background noise was not superimposed.

【００２９】音声の長時間平均パワーに比して−２０
［ｄＢ］相当のバイアスを重畳した音声について有音／
休止区間判定を行なった結果、請求項１記載の発明を用
いた場合は休止区間をほとんど判定できなかったが、請
求項２記載の発明を用いると休止区間の６５％程度を正
しく判定することができた。-20 compared to the long-term average power of voice
[DB] Voice with superimposed bias
As a result of performing the pause section determination, when the invention according to claim 1 is used, the pause section can hardly be determined. However, when the invention according to claim 2 is used, about 65% of the pause section can be correctly determined. did it.

[Brief description of the drawings]

【図１】請求項１の発明の実施例の機能構成を示す図。FIG. 1 is a diagram showing a functional configuration according to an embodiment of the present invention;

【図２】請求項１の発明を実施する場合の各パラメータ
の数値例を示す図。FIG. 2 is a diagram showing a numerical example of each parameter when the invention of claim 1 is implemented.

【図３】基準のＬＳＰパラメータベクトルの例を示す
図。FIG. 3 is a diagram showing an example of a reference LSP parameter vector.

【図４】請求項１の発明の実施例におけるニューラルネ
ットワーク部３４の各重み係数の例を示す図。FIG. 4 is a diagram showing an example of each weight coefficient of the neural network unit 34 in the embodiment of the first invention.

【図５】請求項１の発明の実施例におけるニューラルネ
ットワーク部のバイアスβ_i,jの例を示す図。FIG. 5 is a diagram showing an example of a bias β _{i, j} of the neural network unit in the embodiment of the first invention.

【図６】請求項２の発明の実施例におけるニューラルネ
ットワーク部３４の各重み係数の例を示す図。FIG. 6 is a diagram showing an example of each weight coefficient of the neural network unit 34 in the embodiment of the second invention.

【図７】請求項２の発明の実施例におけるニューラルネ
ットワーク部３４のバイアスβ _i,jの例を示す図。FIG. 7 shows a neural network according to the second embodiment of the present invention.
Bias of the network section 34 _{i, j}FIG.

【図８】この発明方法と従来方法とによる有音／休止区
間判定実施結果を示す図。FIG. 8 is a diagram showing a result of performing a sound / pause section determination by the method of the present invention and the conventional method.

【図９】音声波における有音区間と休止区間の例を示す
波形図。FIG. 9 is a waveform diagram showing an example of a sound section and a pause section in an audio wave.

【図１０】従来の自動利得制御を用いた有音／休止区間
判定装置の機能構成を示す図。FIG. 10 is a diagram showing a functional configuration of a sound / pause section determination device using conventional automatic gain control.

【図１１】従来の自己相関係数とピッチラグを用いた音
声の有音／休止区間判定装置の機能構成を示す図。FIG. 11 is a diagram showing a functional configuration of a conventional voiced / pause section determination device for voice using an autocorrelation coefficient and a pitch lag.

【図１２】図１１の従来装置におけるしきい値適応処理
の手順を示す流れ図。FIG. 12 is a flowchart showing a procedure of a threshold adaptation process in the conventional device of FIG. 11;

Claims

[Claims]

1. A speech waveform sampled and quantized at a fixed cycle, divided into fixed time sections, and each time section contains a sound section or a pause section. In the method of determining whether or not a parameter vector related to the spectral envelope is obtained by analyzing the speech waveform, the vector distance between the parameter vector and a parameter vector of the same kind and having a substantially flat spectral envelope is determined. Determining a short-time audio power of the audio waveform; obtaining a maximum value within a range substantially covering a pitch period of the audio of the auto-correlation of the audio waveform; determining the vector distance and at least one of the short-time audio powers; , And the maximum value of the autocorrelation is input to the neural network, and one output thereof is compared with a threshold value.
A sound / pause section determination method for a speech wave, which determines whether the voice waveform is a voice section or a pause section based on the magnitude.

2. A sound waveform sampled and quantized at a fixed cycle and divided into fixed time sections, and for each time section, the sound contained therein is a sound section or a pause section. In the method of determining whether or not a parameter vector related to the spectral envelope is obtained by analyzing the speech waveform, the vector distance between the parameter vector and a parameter vector of the same kind and having a substantially flat spectral envelope is determined. Determining a variance of a quantized sampling value of the audio waveform in a short time period of the audio waveform, obtaining a maximum value within a range that substantially covers a pitch period of the audio of the autocorrelation of the audio waveform, A vector distance, at least one of the variances, and a maximum value of the autocorrelation are input to a neural network, and one output is output. Compared to have values, speech / pause interval determination method of speech waves and judging whether the pause interval the speech waveform is voiced section by its magnitude.

3. A device for inputting a sampled and quantized speech waveform at a fixed cycle and determining whether the speech contained therein is a voiced section or a paused section at regular time intervals, Means for analyzing the speech waveform to obtain a parameter vector which is not related to the spectrum envelope; and means for obtaining a vector distance between the parameter vector and a parameter vector of the same kind which has a substantially flat spectrum envelope. Means for obtaining a short-time audio power of the audio waveform; means for obtaining a maximum value within a range substantially covering a pitch period of the audio of the autocorrelation of the audio waveform; at least one of the vector distance and the short-time audio power A neural network which receives the maximum value of the autocorrelation and outputs an output from one output terminal; Characterized in that the output is compared with a threshold value to determine whether the voice waveform is a voiced section or a paused section based on the magnitude of the output.

4. An apparatus in which a speech waveform sampled and quantized at a fixed cycle is input and a speech interval included in the speech waveform is determined at regular time intervals to determine whether the speech is a speech interval or a pause interval. Means for analyzing the audio waveform to obtain a parameter vector related to the spectral envelope; and means for obtaining a vector distance between the parameter vector and a parameter vector of the same type that has a substantially flat spectral envelope. Means for obtaining a variance of a quantized sampling value of the audio waveform in a short time period of the audio waveform, and means for obtaining a maximum value within a range substantially covering a pitch period of the audio of the autocorrelation of the audio waveform, The vector distance, at least one of the variances, and the maximum value of the autocorrelation are input, and an output is output from one output terminal Means for comparing the output of the output terminal with a threshold value and determining whether the voice waveform is a voiced section or a paused section based on the magnitude thereof. Sound / pause section determination device.