JP2559475B2

JP2559475B2 - Voice detection method

Info

Publication number: JP2559475B2
Application number: JP63238049A
Authority: JP
Inventors: 新吾西村; 正志宮川; 雅幸海野
Original assignee: Sekisui Chemical Co Ltd
Current assignee: Sekisui Chemical Co Ltd
Priority date: 1988-09-22
Filing date: 1988-09-22
Publication date: 1996-12-04
Anticipated expiration: 2011-12-04
Also published as: JPH0285896A

Description

【発明の詳細な説明】［産業上の利用分野］本発明は、音声検出方式に関する。TECHNICAL FIELD The present invention relates to a voice detection system.

［従来の技術］従来、雑音環境下で音声の存在を検出する方法は多数
あり、特公昭57−12999号公報に記載されているような
通信における音声区間の検出に用いたり、音声言語内容
の認識の前処理に用いたりされているが、高雑音下での
一般用途への展開は困難で、例えば、着信ベル音が鳴っ
ているような状態でのハンズフリー電話機の音声による
応答開始等ができなかった。[Prior Art] Conventionally, there are a number of methods for detecting the presence of speech in a noisy environment, such as detecting speech sections in communication as described in Japanese Patent Publication No. 57-12999, or using speech language contents. Although it is used for pre-processing of recognition, it is difficult to develop it for general use under high noise.For example, starting a response by voice of a hands-free phone while the ringing tone is ringing, etc. could not.

なお、雑音環境下で簡易に音声の存在を検出する方法
としては、入力信号が一定時間間隔内に参照軸を横切る
回数を検出する方法があった。As a method of easily detecting the presence of voice in a noisy environment, there is a method of detecting the number of times the input signal crosses the reference axis within a fixed time interval.

［発明が解決しようとする課題］しかしながら、上記従来の音声検出方式を用いる方法
にあっては、一般に雑音の振幅は音声の振幅に比較して
小さいという前提を用いており、雑音の振幅が音声の振
幅と同程度の場合、音声の存在を検出することができな
い。[Problems to be Solved by the Invention] However, in the above-mentioned method using the conventional voice detection method, it is generally assumed that the noise amplitude is smaller than the voice amplitude, and the noise amplitude is lower than the voice amplitude. If the amplitude is almost the same, the presence of voice cannot be detected.

そこで本出願人は、雑音環境下での音声の存在を簡易
に検出することができる音声検出方式として、入力信
号の参照軸交差数と波高値（波形の振幅レベルの無次元
化量）とを特徴パラメータとして有声音を検出する方
法、入力信号の参照軸交差数と超基準振幅時間（波形
の振幅が一定時間間隔内に実効値を目安とするしきい値
を越える時間）とを特徴パラメータとして有声音を検出
する方法を提案している。Therefore, the present applicant has set the number of reference axis crossings and the peak value (the dimensionless amount of the amplitude level of the waveform) of the input signal as a voice detection method that can easily detect the presence of the voice in a noisy environment. As a characteristic parameter, the method of detecting voiced sound, the number of reference axis crossings of the input signal, and the super-standard amplitude time (the time when the amplitude of the waveform exceeds the threshold value of the effective value within a certain time interval) as the characteristic parameter We propose a method to detect voiced sounds.

上記、の音声検出方式は、従来方式に比して有用
であるものの、以下の理由によりその有声音検出率の向
上に限界がある。Although the above voice detection method is more useful than the conventional method, there is a limit in improving the voiced sound detection rate for the following reasons.

すなわち、の場合には、一定時間の区間内に雑音の
影響等により一ケ所でも特に振幅レベルの高い入力信号
があると、そのレベルが波高値となる。このような区間
の存在は、波高値の分散を大きくする傾向を招き、有声
音検出率の向上を阻害する要因となる。That is, in the case of (1), if there is an input signal with a particularly high amplitude level even at one location due to the influence of noise or the like within a certain period of time, the level becomes the peak value. The presence of such a section tends to increase the dispersion of the crest values, and becomes a factor that hinders the improvement of the voiced sound detection rate.

また、の場合には、超基準振幅時間が波形の時間軸
方向に関する値であるに過ぎず、このパラメータだけで
は波形の振幅レベルに関する情報量が不足し、有声音検
出率を向上するのに困難がある。Further, in the case of, the super-reference amplitude time is only a value in the time axis direction of the waveform, and this parameter alone lacks the amount of information regarding the amplitude level of the waveform, making it difficult to improve the voiced sound detection rate. There is.

本発明は、雑音の振幅が大きく音声の検出に対する影
響が大きい場合にも、雑音環境下での音声の存在を、簡
易に高い検出率で検出することを目的とする。An object of the present invention is to easily detect the presence of voice in a noisy environment with a high detection rate even when the amplitude of noise is large and the influence on voice detection is large.

［課題を解決するための手段］請求項１に記載の本発明は、入力信号の参照軸交差数
と、一定時間間隔内の振幅の絶対値の最大値に対する該
一定時間間隔内の振幅の実効値の比で表される波高値
と、波形の振幅が一定時間間隔内に実効値を目安とする
しきい値を越える時間とを特徴パラメータとして算出
し、この算出結果を、有声音と特定雑音についての辞書
データと比較し、入力信号が有声音を含むかどうかを判
定するようにしたものであり、一定時間間隔内の振幅の
絶対値の最大値に対する該一定時間間隔内の振幅の実効
値の比で表される波高値として、例えば次式で表わされ
る波高値Ｐを用いるようにしたものである。[Means for Solving the Problem] The present invention according to claim 1 is the number of reference axis crossings of an input signal and the effective value of the amplitude within the fixed time interval with respect to the maximum absolute value of the amplitude within the fixed time interval. The peak value represented by the ratio of the values and the time when the amplitude of the waveform exceeds the threshold value, which is the effective value within the fixed time interval, are calculated as characteristic parameters, and the calculation result is used as the voiced sound and the specific noise. It is designed to determine whether or not the input signal contains voiced sound by comparing with the dictionary data of the above, and the effective value of the amplitude within the fixed time interval with respect to the maximum absolute value of the amplitude within the fixed time interval. As the crest value represented by the ratio, the crest value P represented by the following equation, for example, is used.

Ｐ＝20×log₁₀（V_P/V_rms）ただし、V_P:一定時間間隔内の振幅の絶対値の最大値 V_rms:同一定時間間隔内の振幅の実効値請求項２に記載の本発明は、一定時間間隔内の振幅の
絶対値の最大値に対する該一定時間間隔内の振幅の絶対
値の平均値の比で表される波高値として、例えば次式で
表される波高値Ｐを用いるようにしたものである。P = 20 × log ₁₀ (V _P / V _rms ) where V _P : maximum absolute value of amplitude within a fixed time interval V _rms : effective value of amplitude within the fixed time interval According to the invention, a peak value P represented by the following equation, for example, is given as a peak value represented by the ratio of the average value of the absolute values of the amplitude within the fixed time interval to the maximum value of the absolute value of the amplitude within the fixed time interval. It is the one that is used.

Ｐ＝20×log₁₀（V_P/V_a）ただし、V_P:一定時間間隔内の振幅の絶対値の最大値 V_a:同一定時間間隔内の振幅の絶対値の平均値［作用］請求項１に記載の本発明にあっては、雑音環境下の音
声を以下の如く検出する。なお、本発明にあっては、有
声音（母音、半母音、鼻音等の声帯の振動をともなう音
であり、人間が発声する殆どすべての音声には有声音が
含まれている）をもって音声とする。P = 20 × log ₁₀ (V _P / V _a ), where V _P : maximum value of the absolute value of the amplitude within a fixed time interval V _a : the average value of the absolute value of the amplitude within the fixed time interval [Action] Claim According to the present invention described in Item 1, the voice in a noisy environment is detected as follows. In the present invention, voiced sounds (voices with vocal folds such as vowels, semi-vowels, and nasal sounds, and almost all voices uttered by humans include voiced sounds) are regarded as voices. .

（１）有声音と特定雑音について、それらの信号の一定
時間間隔内における参照軸交差数（零レベル等、予め定
めた参照レベルを横切る回数）と、一定時間間隔内の振
幅の絶対値の最大値に対する該一定時間間隔内の振幅の
実効値の比で表される波高値と、波形の振幅が一定時間
間隔内に実効値を目安とするしきい値を越える時間（超
基準振幅時間）とを特徴パラメータとする辞書データを
用意する。(1) For voiced sound and specific noise, the number of reference axis crossings (the number of times the reference level crosses a predetermined reference level such as zero level) of those signals within a certain time interval, and the maximum absolute value of the amplitude within the certain time interval. A crest value represented by the ratio of the effective value of the amplitude within the certain time interval to the value, and the time when the amplitude of the waveform exceeds the threshold value of the effective value within the certain time interval (super-reference amplitude time) Prepare dictionary data that uses as a feature parameter.

辞書データとしては、例えば、下記（ａ）、（ｂ）、
（ｃ）が用いられる。Examples of the dictionary data include the following (a), (b),
(C) is used.

（ａ）多数の音声から得られた有声音についての特徴パ
ラメータの組。(A) A set of feature parameters for voiced sounds obtained from multiple voices.

（ｂ）特定雑音（例えば特定電話機の着信ベル音）につ
いて求められた多数の特徴パラメータの組。(B) A set of a large number of characteristic parameters obtained for a specific noise (for example, a ring tone of a specific telephone).

（ｃ）有声音と、特定雑音とを特定の比率で加え合わせ
た結果を多数の音声について求めた特徴パラメータの
組。(C) A set of feature parameters obtained by adding a result of adding a voiced sound and a specific noise at a specific ratio for many voices.

なお、上記（ａ）、（ｂ）、（ｃ）のデータは、音響
データを特徴パラメータ化した数値データ、数値データ
を統計処理した平均値、分散等の統計的データ、もしく
は統計的データに基づいて定まる境界方程式等の判別式
データ等の各種態様にて用意できる。The data (a), (b), and (c) above are based on numerical data obtained by converting acoustic data into characteristic parameters, average values obtained by statistically processing the numerical data, statistical data such as variance, or statistical data. Can be prepared in various forms such as discriminant data such as a boundary equation determined by

（２）入力信号を採取し、この入力信号の一定時間間隔
内における参照軸交差数と、一定時間間隔内の振幅の絶
対値の最大値に対する該一定時間間隔内の振幅の実効値
の比で表される波高値と、超基準振幅時間とを特徴パラ
メータとして算出する。(2) The input signal is sampled, and the ratio of the number of reference axis crossings within a certain time interval of this input signal and the effective value of the amplitude within the certain time interval to the maximum absolute value of the amplitude within the certain time interval The peak value represented and the super-reference amplitude time are calculated as characteristic parameters.

（３）上記（２）で算出した特徴パラメータと、上記
（１）で定めた辞書データが規定する標準パターンと
を、パラメータ空間上で比較し、入力信号が有声音を含
むかどうかをパターン認識により判定する。(3) The feature parameters calculated in (2) are compared with a standard pattern defined by the dictionary data defined in (1) in a parameter space, and pattern recognition is performed to determine whether or not the input signal includes a voiced sound. Determined by

辞書データを用いて上述のパターン認識は例えば以下
の如くなされる。The above pattern recognition using the dictionary data is performed as follows, for example.

辞書データが規定するカテゴリー「有声音」（前記
（ａ）の有声音、もしくは前記（ｃ）の特定雑音を特定
の比率で加え合わされた有声音のカテゴリー）と、カテ
ゴリー「その他」とで２分されるパラメータ空間を構成
し、入力信号の特徴パラメータがどちらのカテゴリーに
属するかを判定する。The category "voiced sound" defined by the dictionary data (the voiced sound of (a) or the voiced sound category of the specific noise of (c) added at a specific ratio) and the category "others" are divided into two minutes. And determines which category the feature parameter of the input signal belongs to.

次に、特定雑音の振幅が大きく、これが有声音の検出
に大きく影響を与えることの可能性を考慮し、上記に
加え、カテゴリー「特定雑音」とカテゴリー「有声音」
の境界を定め、入力信号の特徴パラメータがどちらかの
カテゴリーに属するかを判定する。Next, in consideration of the possibility that the amplitude of the specific noise is large, which greatly affects the detection of voiced sound, in addition to the above, the category "specific noise" and the category "voiced sound"
Is defined, and it is determined whether the characteristic parameter of the input signal belongs to which category.

上記、の判定の結果、入力信号が、においてカ
テゴリー「有声音」に属し、かつにおいてカテゴリー
「特定雑音」に属さないことを条件に、入力信号中に有
声音が存在することを判定する。As a result of the above determination, it is determined that a voiced sound exists in the input signal on condition that the input signal belongs to the category “voiced sound” and does not belong to the category “specific noise”.

しかして、請求項１に記載の本発明にあっては、特徴
パラメータとして参照軸交差数と波高値と超基準振幅時
間の３つのパラメータを用いたから、参照軸交差数と
一定時間間隔内の振幅の絶対値の最大値に対する該一定
時間間隔内の振幅の実効値の比で表される波高値と２つ
のパラメータのみを用いる場合に、雑音の影響等により
一定時間間隔内の振幅の絶対値の最大値に対する該一定
時間間隔内の振幅の実効値の比で表される波高値の分散
が大きくなり有声音の検出率が向上しない傾向を、超基
準振幅時間を併用することにより補完し、また参照軸
交差数と超基準振幅時間の２つのパラメータのみを用い
る場合に、波形の振幅レベルに関する情報量が不足する
ことを、一定時間間隔内の振幅の絶対値の最大値に対す
る該一定時間間隔内の振幅の実効値の比で表される波高
値を併用することにより補完できる。これにより、特定
雑音の振幅が大きく音声の検出に対する影響が大きい場
合にも、カテゴリー「有声音」とカテゴリー「特定雑
音」とをパラメータ空間において明瞭に分離でき、雑音
環境下での音声の存在を簡易に高い検出率で検出でき
る。Therefore, in the present invention according to claim 1, since the three parameters of the reference axis crossing number, the peak value, and the super-standard amplitude time are used as the characteristic parameters, the reference axis crossing number and the amplitude within the fixed time interval are used. When using only the crest value represented by the ratio of the effective value of the amplitude within the fixed time interval to the maximum absolute value of and the two parameters, the absolute value of the absolute value of the amplitude within the fixed time interval is affected by noise. The tendency that the variance of the crest value represented by the ratio of the effective value of the amplitude within the fixed time interval to the maximum value becomes large and the voiced sound detection rate does not improve is complemented by using the super-reference amplitude time together, and When only two parameters of the number of reference axis crossings and the super-standard amplitude time are used, the fact that the amount of information about the amplitude level of the waveform is insufficient is considered within the fixed time interval with respect to the maximum absolute value of the amplitude within the constant time interval. Shaking Interplay The combined use of the wave height value represented by the ratio of the effective value. As a result, even when the amplitude of specific noise is large and the influence on the detection of speech is large, the category “voiced sound” and the category “specific noise” can be clearly separated in the parameter space, and the presence of speech in a noisy environment can be detected. It can be easily detected with a high detection rate.

請求項１に記載の本発明によれば前述した如くの波高
値を用いたから、有声音の特徴である先鋭な波形を忠実
に反映したパラメータ値を用いることとなり、雑音の識
別性が向上するというメリットがある。また、この場合
には、波高値と超基準振幅時間の算定過程において、振
幅の実効値を共用できるから、演算量が少なくて足り、
検出作業がより簡易となる。According to the present invention as set forth in claim 1, since the crest value as described above is used, a parameter value that faithfully reflects a sharp waveform that is a characteristic of voiced sound is used, and noise discrimination is improved. There are merits. Further, in this case, since the effective value of the amplitude can be shared in the process of calculating the peak value and the super-reference amplitude time, the amount of calculation is small,
The detection work becomes easier.

請求項２に記載の本発明によれば、一定時間間隔内の
振幅の絶対値の最大値に対する該一定時間間隔内の振幅
の絶対値の平均値の比で表される波高値を用いたから、
請求項１に記載の本発明に比して演算量を少なくでき、
かつ有声音の特徴である先鋭な波形を比較的忠実に反映
したパラメータ値を用いることとなり、雑音の識別性が
向上するというメリットがある。なお、演算量が少ない
ということは応答速度が速いことを意味する。According to the present invention as set forth in claim 2, since the crest value represented by the ratio of the average value of the absolute values of the amplitude within the fixed time interval to the maximum value of the absolute value of the amplitude within the fixed time interval is used,
The calculation amount can be reduced as compared with the present invention according to claim 1,
In addition, a parameter value that relatively faithfully reflects a sharp waveform that is a characteristic of voiced sound is used, and there is an advantage that noise discrimination is improved. Note that a small amount of calculation means a high response speed.

［実施例］第１図は本発明の実施に用いられる音声検出装置の一
例を示すブロック図、第２図は本発明の特徴パラメータ
によって形成されるパラメータ空間を示す模式図であ
る。[Embodiment] FIG. 1 is a block diagram showing an example of a speech detection device used in the embodiment of the present invention, and FIG. 2 is a schematic diagram showing a parameter space formed by characteristic parameters of the present invention.

第１図において、11はマイク、12は増幅器、13はロー
パスフィルタ、14はA/Dコンバータ、15はパラメータ計
算部、16は辞書データ記憶部、17は判定部、18は結果出
力部である。この実施例にあっては、雑音環境下の音声
を以下の如く検出する。In FIG. 1, 11 is a microphone, 12 is an amplifier, 13 is a low-pass filter, 14 is an A / D converter, 15 is a parameter calculation unit, 16 is a dictionary data storage unit, 17 is a judgment unit, and 18 is a result output unit. . In this embodiment, speech in a noisy environment is detected as follows.

（１）有声音と特定雑音について、それらの信号の20ms
間における参照軸交差数X₁と、波高値X₂と、超基準振幅
X₃とを特徴パラメータとする辞書データを用意し、これ
を辞書データ記憶部16に記憶せしめる。(1) For voiced sound and specific noise, 20 ms of those signals
Number of reference axis crossings between X ₁ , peak value X ₂ , and super-standard amplitude
Dictionary data having X ₃ as a characteristic parameter is prepared and stored in the dictionary data storage unit 16.

ここで、波高値X₂としては、下記，のいずれかを
用いることができる。Here, as the peak value X ₂ , any of the following can be used.

下式で表わされる波高値Ｐ。The peak value P expressed by the following equation.

Ｐ＝20×log₁₀（V_P/V_rms）ただし、V_P:一定時間間隔内の振幅の絶対値の最大値 V_rms:同一定時間間隔内の振幅の実効値下式で表わされる波高値Ｐ。P = 20 × log ₁₀ (V _P / V _rms ) where V _P : maximum absolute value of amplitude within a fixed time interval V _rms : effective value of amplitude within a fixed time interval Crest value expressed by the following formula P.

Ｐ＝20×log₁₀（V_P/V_a）ただし、V_P:一定時間間隔内の振幅の絶対値の最大値 V_a:同一定時間間隔内の振幅の絶対値の平均値上記の波高値を用いる場合には、有声音の特徴であ
る先鋭な波形を比較的忠実に反映したパラメータ値を用
いることとなり、雑音の識別性が向上するというメリッ
トがある。P = 20 × log ₁₀ (V _P / V _a ), where V _P : maximum absolute value of amplitude within a fixed time interval V _a : average absolute value of amplitude within the fixed time interval In the case of using, the parameter value that reflects the sharp waveform that is the characteristic of voiced sound relatively faithfully is used, and there is an advantage that the discrimination of noise is improved.

上記の波高値を用いる場合には、上記の波高値に
比して演算量を少なくでき、かつ有性音の特徴である先
鋭な波形を忠実に反映したパラメータ値を用いることと
なり、雑音の識別性が向上するというメリットがある。When the above crest value is used, the amount of calculation can be reduced compared to the above crest value, and the parameter value that faithfully reflects the sharp waveform that is the characteristic of sexual sound is used. There is a merit that the property is improved.

また、辞書データとしては、例えば下記（ａ）、
（ｂ）、および（ｃ）が作成される。As the dictionary data, for example, the following (a):
(B) and (c) are created.

（ａ）多数の音声から得られた有声音［ア］についての
特徴パラメータの組。(A) A set of feature parameters for voiced sound [a] obtained from a large number of voices.

（ｂ）特定雑音（特定電話機の着信ベル音）について求
められた多数の特徴パラメータの組。(B) A set of a large number of characteristic parameters obtained for a specific noise (ringing sound of a specific telephone).

（ｃ）有声音［ア］と特定雑音とを、 20×log₁₀（S_rms/N_rms）［dB］で定義される有声音対特定雑音比−10［dB］で加え合わ
せた結果を多数の音声について求めた特徴パラメータの
組。なお、S_rmsは有声音「ア」の増幅の実効値を表わ
し、N_rmsは特定雑音の振幅の実効値を表わす。(C) Many results obtained by adding voiced sound [a] and specific noise at a voiced sound-to-specific noise ratio of -10 [dB] defined by 20 × log ₁₀ (S _rms / N _rms ) [dB] A set of characteristic parameters obtained for the voice of. Note that S _rms represents the effective value of the amplification of the voiced sound “A”, and N _rms represents the effective value of the amplitude of the specific noise.

（２）マイク1iにて入力信号を採取し、この入力信号
を、増幅器12で増幅し、ローパスフィルタ13を通すこと
によって4.2KHz以上の成分はカットし、A/Dコンバータ1
4によって標本化周波数10KHz、変換ビット数16bitのデ
ジタル信号に変換し、パラメータ計算部15に送り込む。
パラメータ計算部15は、上記入力信号の20mS間における
参照軸交差数X₁と、波高値X₂と、超基準振幅時間X₂とを
特徴パラメータとして算出する。(2) The input signal is sampled by the microphone 1i, this input signal is amplified by the amplifier 12 and passed through the low-pass filter 13 to cut the component of 4.2KHz or more, and the A / D converter 1
The signal is converted into a digital signal having a sampling frequency of 10 KHz and a conversion bit number of 16 bits by 4 and is sent to the parameter calculation unit 15.
The parameter calculator 15 calculates the reference axis crossing number X ₁ during 20 mS of the input signal, the peak value X _2, and the super-reference amplitude time X ₂ as characteristic parameters.

（３）上記（２）で算出した特徴パラメータと、上記
（１）で定めた辞書データが規定する標準パターンと
を、判定部17において比較し、入力信号が有声音を含む
かどうかを判定し、この判定結果を結果出力部18から出
力する。(3) The feature parameter calculated in (2) is compared with the standard pattern defined by the dictionary data defined in (1) in the determination unit 17 to determine whether the input signal includes a voiced sound. The result is output from the result output unit 18.

ここで、前述の辞書データを用いたパターン認識は、
例えば第２図のパラメータ空間上で以下の如くなされ
る。Here, the pattern recognition using the aforementioned dictionary data is
For example, the following is done on the parameter space in FIG.

なお、第２図は零交差数（参照軸レベルを零レベルに
設定したもの）と波高値と超基準振幅時間の３つの特徴
パラメータをそれぞれX₁軸とX₂軸とX₃軸にとったもので
ある。第２図において、μ_１、σ₁₁、σ₁₂、σ₁₃はそれ
ぞれ有声音（前記（ａ）の有声音［ア］、もしくは前記
（ｃ）の特定雑音を特定の有声音対特定雑音比で加え合
わされた有声音）の辞書パラメータの平均値、X₁軸成分
の標準偏差、X₂軸成分の標準偏差、X₃軸成分の標準偏差
を表わし、μ_２、σ₂₁、σ₂₂、σ₂₃はそれぞれ特定雑音
の辞書パラメータについての同様の値を表わす。In Fig. 2, three characteristic parameters of the number of zero crossings (reference axis level set to zero level), peak value, and super-standard amplitude time are taken on the X ₁ axis, X ₂ axis, and X ₃ axis, respectively. It is a thing. In FIG. 2, μ ₁ , σ ₁₁ , σ ₁₂ , and σ ₁₃ are voiced sounds (voiced sound [a] in (a) above) or specific noise in (c) above in a specific voiced to specific noise ratio. Sum of voiced speech dictionary parameters, standard deviation of X ₁ axis component, standard deviation of X ₂ axis component, standard deviation of X ₃ axis component, μ ₂ , σ ₂₁ , σ ₂₂ , σ ₂₃ Each represent a similar value for a particular noise dictionary parameter.

辞書データが規定するカテゴリー「有声音」（前記
（ａ）の有声音［ア］、もしくは前記（ｃ）の特定雑音
を特定の比率で加え合わせた有声音のカテゴリー）と、
カテゴリー「その他」とを２分する境界１を定める。境
界１にあっては、有声音の辞書データの平均値μ_１を含
む側がカテゴリー「有声音」である。この境界１は、平
均値のまわりにどれだけ有声音の辞書データが集中して
いるかを表わす集中楕円であり、軸の長さを変えること
により有声音の辞書データが楕円内に入る割合を変える
ことができる。この実施例の場合は有声音の辞書データ
の９割が楕円内に入るように軸の長さを定めた。破線は
μとσで規定されるカテゴリー「有声音」の概念を表わ
す。すなわち、このの過程にあっては、入力信号の特
徴パラメータが境界１のいずれの側のカテゴリーに属す
るかを判定することととなる。A category “voiced sound” defined by the dictionary data (voiced sound [a] in (a) above or a voiced sound category obtained by adding specific noise in (c) above at a specific ratio);
A boundary 1 that divides the category “other” into two is determined. In the boundary 1, the side including the average value mu ₁ of the dictionary data voiced is category "voiced". This boundary 1 is a concentrated ellipse indicating how much the voiced dictionary data is concentrated around the average value, and the ratio of the voiced dictionary data to the ellipse is changed by changing the length of the axis. be able to. In the case of this embodiment, the length of the axis is determined so that 90% of the voiced dictionary data falls within the ellipse. The broken line represents the concept of the category “voiced sound” defined by μ and σ. That is, in this process, it is determined which side of the boundary 1 the characteristic parameter of the input signal belongs to.

次に、特定雑音の振幅が大きく、これが有声音の検出
に大きく影響を与えることの可能性を考慮し、上記に
加え、カテゴリー「特定雑音」とカテゴリー「有声音」
の境界２を定める。境界２にあっては、特定雑音の平均
値μ_２を含む側がカテゴリー「特定雑音」となる。この
境界２は、カテゴリー「有声音」とカテゴリー「特定雑
音」に対する尤度が等しい点の集まりである。この実施
例の場合には特定雑音の標準偏差が、人工的に作られた
電話機の着信ベル音であって、有声音と特定雑音を特定
の有声音対特定雑音比で加え合わせたものの辞書データ
の標準偏差より一般的に小さいので、カテゴリー「特定
雑音」が閉じた空間になっている。破線はμとσで規定
されるカテゴリー「特定雑音」の概念を表わす。すなわ
ち、このの過程にあっては、入力信号の特徴パラメー
タが境界２のいずれの側のカテゴリーに属するかを評定
することとなる。Next, in consideration of the possibility that the amplitude of the specific noise is large, which greatly affects the detection of voiced sound, in addition to the above, the category "specific noise" and the category "voiced sound"
The boundary 2 of is determined. At the boundary 2, the side including the average value μ ₂ of the specific noise is the category “specific noise”. The boundary 2 is a group of points having the same likelihood for the category “voiced sound” and the category “specific noise”. In the case of this embodiment, the standard deviation of the specific noise is the ring tone of the artificially created telephone, and the dictionary data of the voiced sound and the specific noise added at the specific voiced sound to specific noise ratio. Is generally smaller than the standard deviation of, the category "specific noise" is a closed space. The broken line represents the concept of the category “specific noise” defined by μ and σ. That is, in this process, it is evaluated which side of the boundary 2 the characteristic parameter of the input signal belongs to.

上記、の判定の結果、入力信号が、特徴パラメー
タ空間上で、において境界１のμ_１側に属し、かつ
において境界２のμ_２側に属さない時、入力信号をカテ
ゴリー「有声音」に属すると判定する。すなわち、入力
信号中に有声音が存在することを判定する。Above, the results of the determination of the input signal, in feature parameter space, belonging to the mu ₁ side of the boundary 1 in, when not belonging to mu ₂ side of the boundary 2 in and belongs to the category "voiced" input signal Is determined. That is, it is determined that a voiced sound exists in the input signal.

しかして、上記実施例にあっては、特徴パラメータと
して参照軸交差数と波高値と超基準振幅時間の３つのパ
ラメータを用いたから、参照軸交差点と波高値の２つ
のパラメータのみを用いる場合に、雑音の影響等により
波高値の分散が大きくなり有声音の検出率が向上しない
傾向を、超基準振幅時間を併用することにより補完し、
また参照軸交差数と超基準振幅時間の２つのパラメー
タのみを用いる場合に、波形の振幅レベルに関する情報
量が不足することを、波高値を併用することにより補完
できる。これにより、特定雑音の振幅が大きく音声の検
出に対する影響が大きい場合にも、カテゴリー「有声
音」とカテゴリー「特定雑音」とをパラメータ空間にお
いて明瞭に分離でき、雑音環境下での音声の存在を、簡
易に高い検出率で検出できる。In the above embodiment, the three parameters of the reference axis crossing number, the crest value, and the super-standard amplitude time are used as the characteristic parameters. Therefore, when only the two parameters of the reference axis crossing point and the crest value are used, By using the super-reference amplitude time together, the tendency that the detection rate of voiced sound does not improve due to the large variance of the crest value due to the influence of noise is complemented.
Further, when only two parameters of the reference axis crossing number and the super-standard amplitude time are used, the lack of the information amount regarding the amplitude level of the waveform can be complemented by using the peak value together. As a result, even when the amplitude of specific noise is large and the influence on the detection of speech is large, the category “voiced sound” and the category “specific noise” can be clearly separated in the parameter space, and the presence of speech in a noisy environment can be detected. , Can be easily detected with a high detection rate.

特に、有声音対特定雑音比−10［dB］の非常に雑音の
大きな環境下で実験を行なった結果、零交差数と波高
値を特徴パラメータとして有声音を検出する場合、検出
率は50［％］、零交差数と超基準振幅時間を特徴パラ
メータとして有声音を検出する場合、60［％］であった
が、上記実施例の場合には90［％］となり本発明の効果
が認められた。また、波高値と超基準振幅時間の算出過
程において、共用できる計算部分（例えば振幅の実効
値）が多いため、上記実施例の処理時間はまたはの
場合とほとんど変わらなかった。In particular, as a result of an experiment in a very noisy environment with a voiced sound-to-specific noise ratio of -10 [dB], when a voiced sound is detected using the number of zero crossings and peak values as characteristic parameters, the detection rate is 50 [ %], When the voiced sound is detected using the number of zero crossings and the super-reference amplitude time as the characteristic parameters, it was 60%, but in the case of the above-mentioned embodiment, it was 90%, and the effect of the present invention was recognized. It was Further, in the process of calculating the crest value and the super-reference amplitude time, since there are many calculation parts that can be shared (for example, the effective value of the amplitude), the processing time of the above embodiment was almost the same as or.

なお、上記実施例においては、特徴パラメータ空間上
で標準パターンを規定する境界線として集中楕円と２つ
のカテゴリーに対する尤度が等しくなる点の集まりを用
いたが、本発明の実施においては、もちろん他の一般的
なパターン認識の手法を用いることができる。例えば、
カテゴリー「有声音」とカテゴリー「特定雑音」に対す
る尤度が等しくなる点の集まりの代わりに、Maharanobi
s距離やEuclid距離が等しくなる点の集まり等を用いる
ことができる。In the above embodiment, a concentrated ellipse and a set of points at which the likelihoods of the two categories are equal are used as the boundary defining the standard pattern in the feature parameter space. General pattern recognition technique can be used. For example,
Instead of a set of points with the same likelihood for the categories "voiced" and "specific noise", Maharanobi
A group of points where the s distance and the Euclid distance are equal can be used.

［発明の効果］以上のように本発明によれば、雑音の振幅が大きく音
声の検出に対する影響が大きい場合にも、雑音環境下で
の音声の存在を、簡易に高い検出率で検出することがで
きる。[Effects of the Invention] As described above, according to the present invention, the presence of voice in a noisy environment can be easily detected with a high detection rate even when the amplitude of noise is large and the influence on voice detection is large. You can

[Brief description of drawings]

第１図は本発明の実施に用いられる音声検出装置の一例
を示すブロック図、第２図は本発明の特徴パラメータに
よって形成されるパラメータ空間を示す模式図である。 11……マイク、 15……パラメータ計算部、 16……辞書データ記憶部、 17……判定部、 18……結果出力部。FIG. 1 is a block diagram showing an example of a voice detection device used for implementing the present invention, and FIG. 2 is a schematic diagram showing a parameter space formed by characteristic parameters of the present invention. 11: microphone, 15: parameter calculation unit, 16: dictionary data storage unit, 17: determination unit, 18: result output unit

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁶ 識別記号庁内整理番号ＦＩ技術表示箇所Ｇ１０Ｌ 9/18 ３０１Ｇ１０Ｌ 9/18 ３０１Ａ ─────────────────────────────────────────────────── ─── Continuation of the front page (51) Int.Cl. ⁶ Identification code Internal reference number FI Technical display location G10L 9/18 301 G10L 9/18 301A

Claims

(57) [Claims]

1. A crest value represented by the number of reference axis crossings of an input signal, a ratio of the effective value of the amplitude within the fixed time interval to the maximum absolute value of the amplitude within the fixed time interval, and the amplitude of the waveform. Is calculated as a characteristic parameter, and the time over which the threshold value exceeds the effective value as a guide within a certain time interval.
A voice detection method that determines whether an input signal contains voiced sound by comparing it with dictionary data for voiced sound and specific noise.

2. A crest value represented by the number of reference axis crossings of an input signal and a ratio of a mean value of absolute values of amplitudes within a certain time interval to a maximum value of absolute values of amplitudes within a certain time interval, The time over which the amplitude of the waveform exceeds the threshold value, which is the effective value within a certain time interval, is calculated as a characteristic parameter, and the calculation result is compared with the dictionary data for voiced sound and specific noise. A voice detection method that determines whether a voiced sound is included.