JP3394506B2

JP3394506B2 - Voice discrimination device and voice discrimination method

Info

Publication number: JP3394506B2
Application number: JP2000188987A
Authority: JP
Inventors: 裕久田崎; 正山浦; 勝志瀬座
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1993-08-17
Filing date: 2000-06-23
Publication date: 2003-04-07
Anticipated expiration: 2018-04-07
Also published as: JP2001022368A

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は、音声をディジタ
ル伝送あるいは蓄積する場合に用いられる音声符号化復
号化装置の有声音・無声音判別装置（音声判別装置）及
びその判別方法に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voiced / unvoiced sound discriminating device (speech discriminating device) of a voice encoding / decoding device used when digitally transmitting or storing voice, and a discrimination method thereof.

【０００２】[0002]

【従来の技術】従来のこの種の有声音・無声音判別装置
（音声判別装置）として、例えば特開昭６１−２７８０
０に示されたものがあり、上記装置では、有声音、無声
音の判別パラメータとしてケプストラムの低次項の和を
用い、判別結果は有声音と無声音の２値である。2. Description of the Related Art As a conventional voiced sound / unvoiced sound discriminating device (voice discriminating device) of this type, for example, Japanese Patent Laid-Open No. 61-2780.
In the above device, the sum of the low-order terms of the cepstrum is used as a discrimination parameter for voiced sound and unvoiced sound, and the discrimination result is a binary value of voiced sound and unvoiced sound.

【０００３】図６は上記文献に示された従来の有声音・
無声音判別装置（音声判別装置）の構成ブロック図であ
り、図７は図６の有声音・無声音判別装置（音声判別装
置）の判別パラメータ２０の分布を例示する図である。
図中、１８はケプストラム、１９は加算回路、２０は判
別パラメータ、２１は閾値比較回路、２２は判別結果で
ある。FIG. 6 shows the conventional voiced sound shown in the above document.
FIG. 7 is a configuration block diagram of an unvoiced sound discrimination device (speech discrimination device), and FIG. 7 is a diagram illustrating a distribution of discrimination parameters 20 of the voiced sound / unvoiced sound discrimination device (speech discrimination device) of FIG. 6.
In the figure, 18 is a cepstrum, 19 is an addition circuit, 20 is a discrimination parameter, 21 is a threshold comparison circuit, and 22 is a discrimination result.

【０００４】以下、図６の有声音・無声音判別装置（音
声判別装置）の動作について図６，７を参照して説明す
る。先ず、加算回路１９は入力されたケプストラム１８
の低次項の和を求め、これを判別パラメータ２０として
出力する。閾値比較回路２１は入力された判別パラメー
タ２０が所定の固定閾値未満の場合は無声音、上記の固
定閾値以上の場合は有声音と判別し、判別結果２２を出
力する。The operation of the voiced sound / unvoiced sound discrimination apparatus (voice discrimination apparatus) shown in FIG. 6 will be described below with reference to FIGS. First, the adder circuit 19 receives the input cepstrum 18
Then, the sum of the low-order terms of is obtained and this is output as the discrimination parameter 20. The threshold comparison circuit 21 determines that the input discrimination parameter 20 is an unvoiced sound when the discrimination parameter 20 is less than a predetermined fixed threshold, and the voiced sound when the discrimination parameter 20 is the fixed threshold or more, and outputs a discrimination result 22.

【０００５】図７は背景雑音のパワーが音声信号のパワ
ーに比べて無視できない程大きい、即ち雑音レベルが高
い場合と、背景雑音のパワーが音声信号のパワーに比べ
て無視できる程小さい、即ち雑音レベルが低い場合の判
別パラメータ２０の分布のモデルを示したものである。
図中、曲線Ａは雑音レベルが低い場合の無声音、曲線Ｃ
は雑音レベルが高い場合の無声音、曲線Ｄは曲線Ａと曲
線Ｃとを合わせた分布で、曲線Ｂは有声音の分布であ
る。有声音における判別パラメータ２０の分布は雑音レ
ベルの高低によって大きく変化しない。いま、雑音レベ
ルが低い場合に無声音Ａと有声音Ｂを最適に分離する固
定閾値をＥ１とした場合、雑音レベルが高い場合に無声
音Ｃを有声音Ｂと判別する判別誤りが増加する。一方、
雑音レベルが高い場合に無声音Ｃと有声音Ｂを最適に分
離する固定閾値をＥ２とした場合、雑音レベルが低い場
合に有声音Ｂを無声音Ａと判別する判別誤りが増加す
る。また、Ｅ３を無声音Ｄと有声音Ｂを最適に分離する
固定閾値とすると、雑音レベルが小さい場合にＥ１を、
雑音レベルが大きい場合にＥ２を固定閾値に用いた場合
に比べて、判別誤りが増加するのは明かである。また、
以上のどの閾値を使う場合でも、判別パラメータ２０が
その閾値近辺の値のときは判別誤りが多く、信頼性が低
くなる。FIG. 7 shows that the power of the background noise is so large that it cannot be ignored compared with the power of the voice signal, that is, the noise level is high, and the power of the background noise is so small that it can be ignored compared with the power of the voice signal, that is, noise. It shows a model of the distribution of the discrimination parameter 20 when the level is low.
In the figure, a curve A is an unvoiced sound when the noise level is low, and a curve C
Is the unvoiced sound when the noise level is high, curve D is the distribution of curves A and C combined, and curve B is the distribution of voiced sounds. The distribution of the discrimination parameter 20 in voiced sound does not change significantly depending on the noise level. Now, if the fixed threshold for optimally separating the unvoiced sound A and the voiced sound B when the noise level is low is E1, there is an increase in the discrimination error that the unvoiced sound C is discriminated as the voiced sound B when the noise level is high. on the other hand,
When the fixed threshold for optimally separating the unvoiced sound C and the voiced sound B when the noise level is high is E2, the discrimination error for distinguishing the voiced sound B from the unvoiced sound A increases when the noise level is low. If E3 is a fixed threshold value that optimally separates the unvoiced sound D and the voiced sound B, then E1 is set to a low noise level.
It is clear that the discrimination error increases when the noise level is high compared to when E2 is used as the fixed threshold. Also,
Regardless of which of the above thresholds is used, when the discrimination parameter 20 is a value near the threshold, there are many discrimination errors and the reliability is low.

【０００６】[0006]

【発明が解決しようとする課題】従来の有声音・無声音
判別装置（音声判別装置）は以上のように構成されてお
り、ケプストラムの低次項の和だけを判別パラメータと
しているために、判別パラメータが判別閾値近辺の値の
ときは判別誤りが多く、また、有声音と無声音を判別す
る判別閾値を設定するとき想定した背景雑音レベルと異
なった雑音レベルを持つ音声の場合に判別誤りが増加す
るという課題がある。The conventional voiced sound / unvoiced sound discriminating apparatus (speech discriminating apparatus) is configured as described above, and since the discrimination parameter is only the sum of the low-order terms of the cepstrum, the discrimination parameter is It is said that there are many discrimination errors when the value is close to the discrimination threshold, and the discrimination error increases when the speech has a noise level different from the assumed background noise level when setting the discrimination threshold for discriminating voiced sound and unvoiced sound. There are challenges.

【０００７】本発明は上記のような課題を解決するため
になされたもので、背景雑音レベルの高低に依存せず判
別誤りが少ない有声音・無声音判別装置（音声判別装
置）及びその判別方法を得ることを目的としている。The present invention has been made to solve the above problems, and provides a voiced / unvoiced sound discriminating apparatus (speech discriminating apparatus) and its discriminating method that do not depend on the level of the background noise level and have few discrimination errors. The purpose is to get.

【０００８】[0008]

【課題を解決するための手段】この発明に係る音声判別
装置は、入力音声フレームを複数個のサブフレームに分
割し、入力音声パワーと雑音パワーの比較に基づいて雑
音レベルをサブフレーム毎に求める雑音レベル判定手段
と、サブフレーム毎に求められた雑音レベルを入力し、
サブフレーム毎に、音声区間を検出するための閾値を求
める閾値算出手段と、サブフレーム毎に求められた各閾
値を用いて、入力音声フレームが音声区間であるか否か
の決定を行う照合手段とを備えたことを特徴とする。A speech discrimination apparatus according to the present invention divides an input speech frame into a plurality of subframes, and obtains a noise level for each subframe based on a comparison between input speech power and noise power. Input the noise level determination means and the noise level obtained for each subframe,
Threshold calculation means for obtaining a threshold for detecting a voice section for each subframe, and collating means for determining whether or not the input voice frame is a voice section using each threshold obtained for each subframe. It is characterized by having and.

【０００９】上記雑音レベル判定手段は、２値化した雑
音レベルを出力し、上記閾値算出手段は、この２値判定
の各値に対応する２つの定数を予め記憶しておき、上記
２つの定数の中から２値化した雑音レベルの値に対応す
る定数を選択し、選択した定数を用いて閾値を算出する
ことを特徴とする。The noise level determining means outputs a binarized noise level, and the threshold calculating means stores in advance two constants corresponding to each value of the binary determination, and the two constants are stored in advance. It is characterized in that a constant corresponding to the binarized noise level value is selected from among the above, and the threshold is calculated using the selected constant.

【００１０】上記雑音レベル判定手段は、３以上の多値
化した雑音レベルを出力し、上記閾値算出手段は、この
多値判定の各値に対応する複数の定数を予め記憶してお
き、上記複数の定数の中から多値化した雑音レベルの値
に対応する定数を選択し、選択した定数を用いて閾値を
算出することを特徴とする。The noise level judging means outputs a multi-valued noise level of 3 or more, and the threshold calculating means stores in advance a plurality of constants corresponding to the respective values of the multi-value judgment, A feature is that a constant corresponding to the value of the multileveled noise level is selected from a plurality of constants and the threshold is calculated using the selected constant.

【００１１】この発明に係る音声判別方法は、入力音声
フレームを複数個のサブフレームに分割し、入力音声パ
ワーと雑音パワーの比較に基づいて雑音レベルをサブフ
レーム毎に求める雑音レベル判定工程と、サブフレーム
毎に求められた雑音レベルを入力し、サブフレーム毎
に、音声区間を検出するための閾値を求める閾値算出工
程と、サブフレーム毎に求められた各閾値を用いて、入
力音声フレームが音声区間であるか否かの決定を行う照
合工程とを備えたことを特徴とする。A speech discrimination method according to the present invention divides an input speech frame into a plurality of subframes, and determines a noise level for each subframe based on a comparison between input speech power and noise power, and a noise level determination step, The noise level calculated for each subframe is input, the threshold calculation step for calculating the threshold for detecting the voice section for each subframe, and the input speech frame is calculated by using each threshold calculated for each subframe. And a collation step of determining whether or not it is a voice section.

【００１２】上記雑音レベル判定工程は、２値化した雑
音レベルを出力し、上記閾値算出工程は、この２値判定
の各値に対応する２つの定数を予め記憶しておき、上記
２つの定数の中から２値化した雑音レベルの値に対応す
る定数を選択し、選択した定数を用いて閾値を算出する
ことを特徴とする。The noise level determining step outputs a binarized noise level, and the threshold calculating step stores in advance two constants corresponding to each value of the binary determination, and the two constants are stored in advance. It is characterized in that a constant corresponding to the binarized noise level value is selected from among the above, and the threshold is calculated using the selected constant.

【００１３】上記雑音レベル判定工程は、３以上の多値
化した雑音レベルを出力し、上記閾値算出工程は、この
多値判定の各値に対応する複数の定数を予め記憶してお
き、上記複数の定数の中から多値化した雑音レベルの値
に対応する定数を選択し、選択した定数を用いて閾値を
算出することを特徴とする。The noise level determination step outputs a multi-valued noise level of 3 or more, and the threshold value calculation step stores a plurality of constants corresponding to the respective values of the multi-level determination in advance. A feature is that a constant corresponding to the value of the multileveled noise level is selected from a plurality of constants and the threshold is calculated using the selected constant.

【００１４】[0014]

【作用】この発明の以下に述べる実施例では、音声信号
の有声音、無声音の判別をする有声音・無声音判別装置
（音声判別装置）において、入力音声フレームを分析し
て得る判別パラメータの値に基づいて、有声音、無声
音、無音を判別する判別条件を複数の異なる判別条件の
中から選択し、さらに、上記の選択した判別条件に従っ
て、パワー、正規化自己相関のピーク値、零交差数、第
１次の線形予測係数、過去の音声フレームの判別結果、
ケプストラムの低次項の中から少なくとも一つを判別パ
ラメータとして用い、所定の閾値と照合して、上記有声
音、無声音、無音の判別結果を出力する照合手段が動作
することにより、有声音、無声音、無音判別の判別誤り
を少なくすることができる。In the embodiments described below of the present invention, in a voiced sound / unvoiced sound discrimination device (speech discrimination device) for discriminating between voiced sound and unvoiced sound of a voice signal, a discriminant parameter value obtained by analyzing an input voice frame is obtained. Based on, voiced sound, unvoiced sound, select the discrimination condition for discriminating silence from a plurality of different discrimination conditions, further, according to the selected discrimination condition, power, peak value of the normalized autocorrelation, the number of zero crossings, First-order linear prediction coefficient, determination result of past speech frame,
Using at least one of the low-order terms of the cepstrum as a discriminant parameter, collating with a predetermined threshold value, the above voiced sound, unvoiced sound, by operating the collating means for outputting the discrimination result of unvoiced voiced sound, unvoiced sound, It is possible to reduce the discrimination error in the silence discrimination.

【００１５】この発明の以下に述べる実施例では、音声
信号の有声音、無声音の判別をする有声音・無声音判別
装置（音声判別装置）において、入力音声フレームを分
析して得る判別パラメータを有声音、無声音、無音判別
の判別条件と照合し、いずれかの区分に入る場合は有声
音、無声音、または無音を判別結果として出力し、いず
れの区分にも確実に入らぬ場合、有声音的特徴を有する
ときは準有声音として、無音的特徴を有するときは準無
音として判別結果を出力するよう照合手段が動作するこ
とにより、有声音、無声音、または無音の他に中間的な
準有声音、準無音を判別結果として出力することができ
る。In the following embodiments of the present invention, in a voiced sound / unvoiced sound discriminating apparatus (speech discriminating apparatus) for discriminating between voiced sound and unvoiced sound of a voice signal, a discrimination parameter obtained by analyzing an input voice frame is used. , Unvoiced sound, match with the judgment conditions of silent judgment, if any one of the classifications, voiced sound, unvoiced sound, or silent is output as the judgment result, if it does not surely enter any of the classifications, voiced features If the collating means operates so as to output the discrimination result as quasi-voiced sound when it has, and as quasi-voiceless when it has a silent feature, voiced sound, unvoiced sound, or non-voiced intermediate quasi-voiced sound, quasi-voiced sound Silence can be output as the determination result.

【００１６】この発明の以下に述べる実施例では、音声
信号の有声音、無声音の判別をする有声音・無声音判別
装置（音声判別装置）において、入力音声フレームの背
景雑音レベルを求め雑音レベルとして出力するよう雑音
レベル判定手段が動作することにより、上記の雑音レベ
ルの値により、有声音、無声音、無音を判別する判別条
件を複数の異なる判別条件の中から選択し、入力音声の
フレームを分析して得られた判別パラメータと所定の閾
値とを照合して有声音、無声音、無音判別を行うよう照
合手段が動作することにより、雑音レベルの値により、
有声音、無声音、無音判別の閾値を変化させることがで
きる。In the following embodiments of the present invention, a voiced sound / unvoiced sound discriminating apparatus (speech discriminating apparatus) for discriminating between voiced sound and unvoiced sound of a voice signal obtains a background noise level of an input voice frame and outputs it as a noise level. The noise level determination means operates so as to select a discrimination condition for discriminating voiced sound, unvoiced sound, or silence from a plurality of different discrimination conditions according to the above noise level value, and analyzes the frame of the input voice. By comparing the discrimination parameter obtained as a result with a predetermined threshold to perform voiced sound, unvoiced sound, and silent discrimination, by the noise level value,
The thresholds for voiced sound, unvoiced sound, and silence discrimination can be changed.

【００１７】この発明の以下に述べる実施例では、雑音
レベル判定手段が、入力音声フレームと過去の音声フレ
ームの、判別結果、パワー、正規化自己相関のピーク値
の中から少なくとも一つを判別パラメータとして用い、
所定の閾値と照合することにより、入力音声フレームと
過去の音声フレームについて有声音区間と無音区間に該
当する区間を決定し、上記の有声音区間と無音区間のパ
ワーの平均を算出して、それぞれ有声音平均パワーと無
音平均パワーとし、上記の有声音平均パワーと上記無音
平均パワーとを比較することにより、雑音レベルの高低
を判定し出力することができる。In the following embodiments of the present invention, the noise level determination means determines at least one of the determination result, the power, and the peak value of the normalized autocorrelation between the input speech frame and the past speech frame. Used as
By comparing with a predetermined threshold value, the section corresponding to the voiced sound section and the silent section is determined for the input speech frame and the past speech frame, and the average power of the voiced section and the silent section is calculated, respectively, It is possible to determine and output the noise level by comparing the voiced sound average power with the silent voice average power and comparing the voiced sound average power with the silent voice average power.

【００１８】この発明の以下に述べる実施例では、雑音
レベル判定手段が、過去の音声フレームの平均パワーよ
り入力音声のフレームのパワーが大きいフレームのパワ
ーの平均を入力音声のフレーム毎に更新しながら算出し
て有声音平均パワーとし、且つ、過去の音声フレームの
平均パワーより入力音声のフレームのパワーが小さいフ
レームのパワーの平均を入力音声のフレーム毎に更新し
ながら算出して無音平均パワーとして、上記有声音平均
パワーと上記無音平均パワーとを比較することにより、
雑音レベルの高低を判定し出力することができる。In the embodiments of the present invention described below, the noise level determination means updates the average power of the frames in which the power of the input voice frame is larger than the average power of the past voice frames for each frame of the input voice. Calculated as the voiced sound average power, and calculated as the silent average power while updating the average power of the frames in which the power of the input voice frame is smaller than the average power of the past voice frames for each frame of the input voice, By comparing the voiced sound average power and the silent sound average power,
The level of noise level can be determined and output.

【００１９】[0019]

【実施例】実施例１．図１は本発明に係わる有声音・無
声音判別装置（音声判別装置）及びその判別方法の実施
例１を示す構成ブロック図である。図１において、１は
入力音声のフレームを分析して得る判別パラメータとし
てのパワー、２は正規化自己相関のピーク値、３は零交
差数、４は第１次の線形予測係数、５は雑音レベル判定
手段、６は雑音レベル、７は無音平均パワー、８は有声
音平均パワー、９は閾値算出手段、１０はパワー判別閾
値、１１は照合手段、１２は判別結果、１３はレジス
タ、１４は過去の音声フレームのパワー、１５は過去の
音声フレームの正規化自己相関ピーク値、１６は過去の
音声フレームの判別結果、１７はケプストラムの低次項
である。EXAMPLES Example 1. FIG. 1 is a block diagram showing a first embodiment of a voiced sound / unvoiced sound discrimination device (voice discrimination device) and its discrimination method according to the present invention. In FIG. 1, 1 is power as a discriminant parameter obtained by analyzing a frame of an input speech, 2 is a peak value of normalized autocorrelation, 3 is the number of zero crossings, 4 is a first-order linear prediction coefficient, and 5 is noise. Level determination means, 6 noise level, 7 average silence power, 8 average voiced sound power, 9 threshold calculation means, 10 power determination threshold, 11 matching means, 12 determination result, 13 register, 14 register The power of the past speech frame, 15 is the normalized autocorrelation peak value of the past speech frame, 16 is the discrimination result of the past speech frame, and 17 is the low-order term of the cepstrum.

【００２０】以下、図１の有声音・無声音判別装置（音
声判別装置）及びその判別方法の動作について図を参照
して説明する。先ず、雑音レベル判定手段５では、入
力音声のフレームの正規化自己相関ピーク値２と、レジ
スタ１３に格納されている過去の正規化自己相関ピーク
値１５と、過去の音声フレームの判別結果１６に対して
予め設定している無音区間の判別条件（例えば、１０フ
レーム連続で正規化自己相関ピーク値２が所定の閾値P1
を下回り、かつ無音と判別されている）を満足する区間
の平均パワーを、入力音声のフレームのパワー１と過去
の音声フレームのパワー１４とより求め、無音平均パワ
ー７として出力する。一方、有声音区間の判別条件（例
えば、５フレーム連続で正規化自己相関のピーク値が所
定の閾値Ｐ２以上である）を満足する区間の平均パワー
有声音区間の判別条件を、入力音声のフレームのパワー
１と過去の音声フレームのパワー１４とより求め、有声
音平均パワー８として出力する。上記の無音平均パワー
７と上記の有声音平均パワー８との差が、所定の閾値D1
より小さい場合は、雑音レベルが高いと判定し、雑音レ
ベル判定手段５の出力である雑音レベル６として“１”
を出力し、一方、上記所定の閾値Ｄ１より大きい場合
は、雑音レベルが低いと判定し、雑音レベル判定手段５
の出力である雑音レベル６として“０”を出力する。The operation of the voiced sound / unvoiced sound discriminating apparatus (voice discriminating apparatus) and its discriminating method shown in FIG. 1 will be described below with reference to the drawings. First, the noise level determination means 5 uses the normalized autocorrelation peak value 2 of the input speech frame, the past normalized autocorrelation peak value 15 stored in the register 13, and the determination result 16 of the past speech frame. On the other hand, a preset silent condition (for example, the normalized autocorrelation peak value 2 for 10 consecutive frames is equal to a predetermined threshold P1).
The average power of a section satisfying the following condition is determined from the power 1 of the input voice frame and the power 14 of the past voice frame, and the average power 7 is output. On the other hand, the average power voiced sound section determination condition of the section that satisfies the voiced sound section determination condition (for example, the peak value of the normalized autocorrelation for 5 consecutive frames is greater than or equal to the predetermined threshold P2) is defined as the input voice frame. 1 and the power 14 of the past speech frame, and output as the voiced sound average power 8. The difference between the silent average power 7 and the voiced average power 8 is a predetermined threshold D1.
If it is smaller, it is determined that the noise level is high, and the noise level 6 output from the noise level determination means 5 is "1".
On the other hand, when it is larger than the predetermined threshold value D1, it is determined that the noise level is low, and the noise level determination means 5
"0" is output as the noise level 6 which is the output of the.

【００２１】次に、閾値算出手段９では、入力された雑
音レベル６が“０”の場合は式（１）により、“１”の
場合を式（２）により、雑音レベル判定手段５から入力
された無音平均パワー７をＰＵＶ、有声音平均パワー８
をＰＶとして、上記入力音声のフレームのパワーの判別
閾値を決定し、式（１）および式（２）に示す閾値算出
手段９の出力であるパワー判別閾値１０を照合手段１１
に送出する。Next, in the threshold value calculating means 9, when the input noise level 6 is "0", it is inputted from the noise level judging means 5 by the equation (1), and when it is "1", it is inputted from the noise level judging means 5. Averaged silent power 7 is PUV, voiced average power 8
Is set as PV, the discriminating threshold value of the power of the frame of the input speech is determined, and the power discriminating threshold value 10 output from the threshold value calculating means 9 shown in the equations (1) and (2) is compared with the collating means 11.
Send to.

【００２２】[0022]

【数１】 [Equation 1]

【００２３】但し、ＴＨ１，ＴＨ２，ＴＨ３はパワーの
判別閾値、ＰＵＶは無音平均パワー、ＰＶは有声音平均
パワーを表す。However, TH1, TH2, and TH3 are power discrimination thresholds, PUV is silent average power, and PV is voiced average power.

【００２４】次に、照合手段１１では、入力音声フレー
ムのパワー１、正規化自己相関のピーク値２、零交差数
３、第１次の線形予測係数４、ケプストラムの低次項の
和１７、雑音レベル判定手段５からの雑音レベル６、閾
値算出手段９からのパワー判別閾値１０、レジスタ１３
からの過去の音声フレームの判別結果１６を、入力と
し、例えば、先ず、下記のａもしくは、ｂ〜ｅの区分の
いずれかを選択する。ａの場合、即ち判別条件の式
（３）のいずれかの論理積を満足する場合は無声音と判
別して判別結果１２を出力する。ｂ〜ｅの場合、ｂ〜ｅ
の区分のいずれの区分を選択するかは、閾値算出手段９
からのパワー判別閾値１０であるＴＨの値と、入力音声
フレームのパワー１であるＰＯＷの値の大小関係により
決める。以上において、ａは無声音と判別できる場合、
ｂは有声音の確率が高い場合、ｃは有声音の確率がやや
高い場合、ｄは無音の確率がやや高い場合、ｅは無音の
確率が高い場合に相当する。Next, in the matching means 11, the power 1 of the input speech frame, the peak value 2 of the normalized autocorrelation, the number of zero crossings 3, the first-order linear prediction coefficient 4, the sum 17 of the low-order terms of the cepstrum, and the noise. Noise level 6 from level determination means 5, power determination threshold value 10 from threshold value calculation means 9, register 13
The discrimination result 16 of the past voice frame from is input, and, for example, first, one of the following categories a or b to e is selected. In the case of a, that is, when the logical product of any one of Expressions (3) of the determination conditions is satisfied, it is determined as an unvoiced sound and the determination result 12 is output. In case of b to e, b to e
The threshold calculation means 9 determines which one of the categories is selected.
It is determined by the magnitude relationship between the value of TH, which is the power determination threshold value 10 from the above, and the value of POW, which is the power 1 of the input voice frame. In the above, when a can be identified as unvoiced sound,
b is a case where the probability of voiced sound is high, c is a case where the probability of voiced sound is a little high, d is a case where the probability of silence is a little high, and e is a case where the probability of silence is high.

【００２５】次に、上記のｂ〜ｅの区分のいずれの区分
を選択したかにより、それぞれ図２，図３，図４，図５
の判別フローに従い、有声音、準有声音、準無音、無音
のいずれかを判別し、判別結果１２を出力する。なお、
ａ〜ｅの区分において、有声音、無声音、無音と判別で
きる判別条件はそれぞれ異なっているため、判別条件は
それぞれの区分において個別的に設定する必要が有り、
この判別条件は実験的に決定している。ここで、準有声
音とは有声音と判別される条件のいくつかが欠けている
場合を指し、また準無音とは無音と判別される条件のい
くつかが欠けている場合を指すものと定義する。Next, depending on which of the above categories b to e has been selected, FIG. 2, FIG. 3, FIG. 4 and FIG.
According to the determination flow of No. 1, any of voiced sound, quasi-voiced sound, quasi-silence, and silence is determined, and the determination result 12 is output. In addition,
In the categories a to e, the discrimination conditions capable of discriminating voiced sound, unvoiced sound, and silence are different from each other. Therefore, it is necessary to set the discrimination conditions individually in each classification.
This discrimination condition is experimentally determined. Here, quasi-voiced sound is defined as a case where some of the conditions for judging as voiced sound are missing, and quasi-silence is defined as a case where some of the conditions for judging as silence are missing. To do.

【００２６】[0026]

【数２】 [Equation 2]

【００２７】ｂ：ＰＯＷ＞ＴＨ１の場合、図２により判別する。ｃ：ＴＨ１≧ＰＯＷ＞ＴＨ２の場合、図３により判別す
る。ｄ：ＴＨ２≧ＰＯＷ＞ＴＨ３の場合、図４により判別す
る。ｅ：ＰＯＷ≦ＴＨ３の場合、図５により判別する。但し、上記のａ区分の判別式、ｂ〜ｅ区分の図２，３，
４，５において、ＴＨ１，ＴＨ２，ＴＨ３はパワー判別
閾値１０（但し、ＴＨ１＞ＴＨ２＞ＴＨ３）、ＰＵＶは
無音平均パワー７、ＰＶは有声音平均パワー８、ＰＯＷ
はパワー１、ＡＣは正規化自己相関のピーク値２、Ｃは
ケプストラムの低次項の和１７、ＣＭＩＮはケプストラ
ムの低次項の和の判別閾値、Ｚは零交差数３、Ａ１は第
１次の線形予測係数４、ＮＬは雑音レベル６、ＶＯは過
去の音声フレームの判別結果１６、T1,T11,T12,T2,T21,
T22,T23,T24,T3,T31,T32,T33,T34,T4,T41,T42,T43,T44
は全て固定閾値を表す。B: In the case of POW> TH1, determination is made according to FIG. In the case of c: TH1 ≧ POW> TH2, the determination is made according to FIG. In the case of d: TH2 ≧ POW> TH3, the determination is made according to FIG. If e: POW ≦ TH3, it is determined according to FIG. However, the above-mentioned discriminant of the section a, b to e in FIGS.
In 4 and 5, TH1, TH2 and TH3 are power discrimination thresholds 10 (TH1>TH2> TH3), PUV is silent average power 7, PV is voiced average power 8, and POW.
Is the power 1, AC is the peak value 2 of the normalized autocorrelation, C is the sum 17 of the low-order terms of the cepstrum, CMIN is the discrimination threshold of the sum of the low-order terms of the cepstrum, Z is the number of zero crossings 3, and A1 is the first-order Linear prediction coefficient 4, NL is noise level 6, VO is past speech frame discrimination result 16, T1, T11, T12, T2, T21,
T22, T23, T24, T3, T31, T32, T33, T34, T4, T41, T42, T43, T44
All represent fixed thresholds.

【００２８】次に、レジスタ１３では、入力音声のフレ
ームのパワー１、正規化自己相関のピーク値２、蓄積さ
れた過去の１０フレームのパワー、正規化自己相関ピー
ク値、照合手段の判別結果を更新する。Next, in the register 13, the power 1 of the input speech frame, the peak value 2 of the normalized autocorrelation, the accumulated power of the past 10 frames, the normalized peak value of the autocorrelation, and the discrimination result of the collating means are displayed. Update.

【００２９】実施例２．実施例１では、無音平均パワー
と有声音平均パワーによりパワーの判別閾値を決定して
いるが、過去の音声フレームのパワーの最大値よりパワ
ー判別閾値を、例えば、式（４）によって決定すること
も可能である。Example 2. In the first embodiment, the power discrimination threshold is determined based on the silent power and the voiced average power, but the power discrimination threshold may be determined from the maximum value of the power of the past speech frame, for example, by the formula (4). Is also possible.

【００３０】[0030]

【数３】 [Equation 3]

【００３１】但し、式（４）において、ＴＨ１，ＴＨ
２，ＴＨ３はパワーの判別閾値、Ｐmax は例えば、過去
３０フレームにおけるパワーの最大値を表す。また、過
去の音声フレームにおけるパワーの最大値を用い、無音
平均パワーと有声音平均パワーより求められたパワー判
別閾値を補正する、または有声音、無声音、無音の判別
結果を補正することも可能である。However, in the equation (4), TH1, TH
2, TH3 represents a power discrimination threshold value, and Pmax represents, for example, the maximum power value in the past 30 frames. It is also possible to correct the power discrimination threshold value obtained from the average silent power and the average voiced sound power by using the maximum value of the power in the past speech frame, or to correct the discrimination result of voiced sound, unvoiced sound, and silence. is there.

【００３２】実施例３．実施例１では、図２に従い正規
化自己相関関数のピーク値、過去の音声フレームの判別
結果、雑音レベルによって無音判別をしているが、例え
ばケプストラム係数の低次項を用いて過去に無音と判別
されたフレームのスペクトル概形を求め、このスペクト
ル概形と入力音声のフレームのスペクトルの距離とによ
り無音判別を行うことも可能である。Example 3. In the first embodiment, silence determination is performed according to the peak value of the normalized autocorrelation function, the determination result of the past speech frame, and the noise level according to FIG. 2, but it is determined to be silence in the past by using the low-order term of the cepstrum coefficient, for example. It is also possible to obtain the spectrum outline of the generated frame and perform silence determination based on this spectrum outline and the distance of the spectrum of the frame of the input voice.

【００３３】実施例４．実施例１では、入力音声のフレ
ーム毎に分析して得られる判別パラメータを用いて判別
をしているが、入力音声のフレームを複数個のサブフレ
ームに分割し、サブフレーム毎に分析して得られるパラ
メータを用いて判別を行う、または判別結果を補正する
ことも可能である。Example 4. In the first embodiment, the discrimination is performed using the discrimination parameter obtained by analyzing each frame of the input voice. However, the frame of the input voice is divided into a plurality of subframes, and the subframe is analyzed and obtained. It is also possible to make a determination using the parameters that are set or correct the determination result.

【００３４】実施例５．実施例１では、判別条件の区分
をするのに判別パラメータとして入力音声のフレームの
パワーを用いているが、ケプストラムの低次項の和を用
いることも可能である。Example 5. In the first embodiment, the power of the frame of the input voice is used as the discrimination parameter for discriminating the discrimination condition, but it is also possible to use the sum of the low-order terms of the cepstrum.

【００３５】実施例６．実施例１では、雑音レベルを２
値判別しているが、これを多値または連続的な数値とす
ることも可能である。Example 6. In the first embodiment, the noise level is set to 2
Although the values are discriminated, it is also possible to make them multi-valued or continuous numerical values.

【００３６】実施例７．実施例１において、フレーム内
最大振幅値を判別パラメータに含めることも可能であ
る。Example 7. In the first embodiment, it is possible to include the maximum in-frame amplitude value in the discrimination parameter.

【００３７】[0037]

【発明の効果】以上のようにこの発明によれば、入力音
声のフレームを分析して得る判別パラメータの判別閾値
近辺においても、判別誤りが少なく、また、背景雑音レ
ベルの高低に依存せず、判別誤りが少ない有声音・無声
音判別装置（音声判別装置）及びその判別方法を得るこ
とができる。また、有声音的な特徴と無声音的な特徴を
合わせ持つ中間的な状態の音声フレームも判別できる有
声音・無声音判別装置（音声判別装置）及びその判別方
法を得ることができる。As described above, according to the present invention, there are few discrimination errors even in the vicinity of the discrimination threshold of the discrimination parameter obtained by analyzing the frame of the input speech, and the background noise level does not depend on the level. It is possible to obtain a voiced sound / unvoiced sound discriminating apparatus (speech discriminating apparatus) and a discrimination method thereof with few discrimination errors. Further, it is possible to obtain a voiced sound / unvoiced sound discriminating apparatus (speech discriminating apparatus) and its discriminating method capable of discriminating an intermediate voice frame having both voiced and unvoiced characteristics.

[Brief description of drawings]

【図１】本発明の実施例１を示す有声音・無声音判別
装置（音声判別装置）の構成ブロック図である。FIG. 1 is a configuration block diagram of a voiced sound / unvoiced sound discrimination apparatus (voice discrimination apparatus) showing a first embodiment of the present invention.

【図２】図１の有声音・無声音判別装置（音声判別装
置）の判別条件を例示する図である。FIG. 2 is a diagram exemplifying discrimination conditions of a voiced sound / unvoiced sound discrimination device (voice discrimination device) of FIG.

【図３】図１の有声音・無声音判別装置（音声判別装
置）の判別条件を例示する図である。FIG. 3 is a diagram exemplifying a discrimination condition of a voiced sound / unvoiced sound discrimination device (voice discrimination device) of FIG. 1;

【図４】図１の有声音・無声音判別装置（音声判別装
置）の判別条件を例示する図である。FIG. 4 is a diagram exemplifying a discrimination condition of the voiced sound / unvoiced sound discrimination device (voice discrimination device) of FIG. 1;

【図５】図１の有声音・無声音判別装置（音声判別装
置）の判別条件を例示する図である。5 is a diagram exemplifying the discrimination conditions of the voiced sound / unvoiced sound discrimination device (voice discrimination device) of FIG. 1;

【図６】従来の有声音・無声音判別装置（音声判別装
置）を示す構成図である。FIG. 6 is a configuration diagram showing a conventional voiced sound / unvoiced sound discrimination device (voice discrimination device).

【図７】図６の有声音・無声音判別装置（音声判別装
置）の判別パラメータの分布を示す図である。7 is a diagram showing a distribution of discrimination parameters of the voiced sound / unvoiced sound discrimination device (voice discrimination device) of FIG. 6;

[Explanation of symbols]

１入力音声のフレームのパワー、２正規化自己相関
のピーク値、３零交差数、４第１次の線形予測係
数、５雑音レベル判定手段、６雑音レベル、７無
音平均パワー、８有声音平均パワー、９閾値算出手
段、１０パワー判別閾値、１１照合手段、１２判
別結果、１３レジスタ、１４過去の音声フレームの
パワー、１５過去の音声フレームの正規化自己相関ピ
ーク値、１６過去の音声フレームの判別結果、１７
ケプストラムの低次項、１８ケプストラム、１９加
算回路、２０判別パラメータ、２１閾値比較回路、
２判別結果。1 input speech frame power, 2 peak value of normalized autocorrelation, 3 zero-crossing number, first-order linear prediction coefficient, 5 noise level determination means, 6 noise level, 7 silent average power, 8 voiced average Power, 9 threshold calculation means, 10 power discrimination threshold, 11 collation means, 12 discrimination result, 13 register, 14 past speech frame power, 15 past speech frame normalized autocorrelation peak value, 16 past speech frame Judgment result, 17
Low-order term of cepstrum, 18 cepstrum, 19 adder circuit, 20 discrimination parameter, 21 threshold value comparison circuit,
2 discrimination result.

フロントページの続き (56)参考文献特開平４−100099（ＪＰ，Ａ) 特開平５−130067（ＪＰ，Ａ) 特開平５−224686（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 11/02 G10L 15/04 Continuation of the front page (56) References JP-A-4-100099 (JP, A) JP-A-5-130067 (JP, A) JP-A-5-224686 (JP, A) (58) Fields investigated (Int .Cl. ⁷ , DB name) G10L 11/02 G10L 15/04

Claims

(57) [Claims]

1. A noise level determination means for dividing an input speech frame into a plurality of subframes, and for obtaining a noise level for each subframe based on a comparison between input speech power and noise power, and for each subframe. Threshold value calculating means for inputting a noise level and obtaining a threshold value for detecting a voice section for each subframe, and each threshold value obtained for each subframe and each subframe
Substituting a predetermined parameter obtained by analyzing the input voice
Using the result of comparison for each frame, and a collating means for making a decision input speech frame whether a voice section, the noise level determination means 3 or more multivalued the noise Les
And outputs the bell, and the threshold value calculation means outputs the composite value corresponding to each value of the multivalued judgment.
Store a number constant in advance and select from among the above constants.
Select and select the constant that corresponds to the multilevel noise level value.
A voice discrimination device characterized by calculating a threshold value using a selected constant .

2. A noise level determination step of dividing an input speech frame into a plurality of subframes, and obtaining a noise level for each subframe based on a comparison between the input speech power and noise power, and a noise level determination step for each subframe. Threshold value calculation process of inputting the noise level and calculating the threshold value for detecting the voice section for each subframe, and each threshold value calculated for each subframe and each subframe
Substituting a predetermined parameter obtained by analyzing the input voice
A comparison step of determining whether or not the input voice frame is in a voice section using the result of comparison for each frame , and the noise level determination step is a multilevel noise level of 3 or more.
The bell is output, and the threshold calculation step described above is performed for each value of the multi-valued judgment.
Store a number constant in advance and select from among the above constants.
Select and select the constant that corresponds to the multilevel noise level value.
A voice discrimination method characterized by calculating a threshold value using a selected constant .