JPH0285897A

JPH0285897A - Voice detecting system

Info

Publication number: JPH0285897A
Application number: JP63238050A
Authority: JP
Inventors: Shigenobu Nonaka; 重信野中; Masayuki Unno; 海野　雅幸
Original assignee: Sekisui Chemical Co Ltd
Current assignee: Sekisui Chemical Co Ltd
Priority date: 1988-09-22
Filing date: 1988-09-22
Publication date: 1990-03-27

Abstract

PURPOSE:To detect the existence of a voiced sound with a high detection rate by calculating a specific value of an input signal as a feature parameter, comparing a calculated value with dictionary data and discriminating it. CONSTITUTION:The number of intersections of reference axes of an input signal from a microphone 11, a value related to an amplitude distribution of a waveform and a value related to a power spectrum are calculated as feature parameters by a parameter calculating part 17. On the other hand, a dictionary data storage part 18 has a standard parameter prescribed by dictionary data with regard to a voiced sound and a specific noise. In this state, as for the input signal, a preprocessing for extracting a feature parameter related to a power spectrum is executed by a multi-channel band pass filter 14 and an A/D converter 15. Also, by an A/D converter 16, other feature parameter is brought to pre-processing, and thereafter, a calculated value of the calculating part 17 and the data of the storage part 18 are compared by a deciding part 17, and whether a voiced sound exists in the input signal or not is decided. In such a way, even if amplitude of a noise is large, the existence of a voice can be detected easily and with a high detection rate.

Description

【発明の詳細な説明】［産業上の利用分野］本発明は、音声検出方式に関する。[Detailed description of the invention] [Industrial application field] The present invention relates to a voice detection method.

［従来の技術］従来、雑音環境下で音声の存在を検出する方法は多数あ
り、特公昭５７−１２９９９号公報に記載されているよ
うな通信における音声区間の検出に用いたり、音声言語
内容の認識の前処理に用いたりされているか、高雑音下
での一般用途への展開は困難で、例えば、着信ベル音が
鳴っているような状態てのハンズフリー電話機の音声に
よる応答開始等ができなかった。[Prior Art] Conventionally, there are many methods for detecting the presence of speech in a noisy environment. It may be used for preprocessing for recognition, but it is difficult to deploy it for general purposes under high noise conditions. There wasn't.

なお、雑音環境下で簡易に音声の存在を検出する方法と
しては、入力信号が一定時間間隔内に参照軸を横切る回
数を検出する方法かあった。Note that a method for easily detecting the presence of voice in a noisy environment is to detect the number of times an input signal crosses a reference axis within a fixed time interval.

［発明が解決しようとする課題］しかしながら、上記従来の音声検出方式を用いる方法に
あっては、一般に雑音の振幅は音声の振幅に比較して小
さいという前提を用いており、雑音の振幅が音声の振幅
と同程度の場合、音声の存在を検出することができない
。[Problems to be Solved by the Invention] However, the method using the conventional voice detection method described above generally uses the premise that the amplitude of noise is small compared to the amplitude of voice. If the amplitude is comparable to that of , the presence of voice cannot be detected.

そこで本出願人は、雑音環境下での音声の存在を簡易に
検出することがてきる音声検出方式として、■入力信号
の参照軸交差数と波高値（波形の振幅レベルの無次元化
量）とを特徴パラメータとして有声音を検出する方法、
■入力信号の参照軸交差数と圧基準振幅時間（波形の振
幅が一定時間間隔内に実効値を目安とするしきい値を越
える時間）とを特徴パラメータとして有声音を検出する
方法を提案している。Therefore, the present applicant has developed a method for detecting speech that can easily detect the presence of speech in a noisy environment.The number of reference axis crossings of the input signal and the peak value (the amount of non-dimensionalization of the amplitude level of the waveform) A method for detecting voiced sounds using and as a feature parameter,
■We proposed a method for detecting voiced sounds using the number of reference axis crossings of the input signal and the pressure reference amplitude time (the time when the amplitude of the waveform exceeds the threshold using the effective value within a fixed time interval) as characteristic parameters. ing.

上記■、■の音声検出方式は、従来方式に比して有用で
あるものの、以下の理由によりその有声音検出率の向上
に限界がある。Although the above voice detection methods (1) and (2) are more useful than conventional methods, there is a limit to the improvement in the voiced sound detection rate due to the following reasons.

すなわち、音声には、低周波成分のパワーが多く、高周
波成分のパワーは少ないという特徴がある。ところが、
上記■、■の方法で特徴パラメータとして用いられる、
参照軸交差数と、波形の振幅分布に関する値（波高値、
圧基準振幅時間）は、入力信号の主たる周波数成分の周
波数に近似するものであり、入力信号の周波数分布に関
する情報を堤供するものでない。したがって、■、■の
方法にあっては、有声音のもつ基本的な特徴の１つであ
る周波数分布に関する情報量が不足しているため、有声
音検出率を向上するのに困難がある。That is, voice has a characteristic in that the power of low frequency components is large and the power of high frequency components is small. However,
Used as a feature parameter in the methods of ■ and ■ above,
The number of reference axis crossings and the values related to the waveform amplitude distribution (wave height value,
The reference amplitude time) approximates the frequency of the main frequency component of the input signal, and does not provide information regarding the frequency distribution of the input signal. Therefore, in methods (1) and (2), it is difficult to improve the voiced sound detection rate because the amount of information regarding the frequency distribution, which is one of the basic characteristics of voiced sounds, is insufficient.

本発明は、雑音の振幅が大きく音声の検出に対する影響
が大きい場合にも、雑音環境下での音声の存在を、簡易
に高い検出率で検出することを目的とする。An object of the present invention is to easily detect the presence of speech in a noisy environment with a high detection rate even when the amplitude of noise is large and the influence on speech detection is large.

［課題を解決するための手段］請求項１に記載の本発明は、入力信号の参照軸交差数と
、波形の振幅分布に関する値と、パワースペクトルに関
する値とを特徴パラメータとして算出し、この算出結果
を、有声音と特定雑音についての辞書データと比較し、
入力信号か有声音を含むかどうかを判定するようにしだ
ものである。[Means for Solving the Problem] The present invention according to claim 1 calculates the number of reference axis crossings of an input signal, a value related to the amplitude distribution of the waveform, and a value related to the power spectrum as characteristic parameters, and Compare the results with dictionary data for voiced sounds and specific noises,
It is designed to determine whether the input signal contains voiced sounds.

ここで、参照軸交差数は、人力信号か零レベル等、予め
定めた参照レベルを横切る回数であり、入力信号が有声
音を含む場合、−Ｍにある一定範囲内の値を示す。Here, the reference axis crossing number is the number of times a human input signal crosses a predetermined reference level, such as a zero level, and when the input signal includes a voiced sound, -M indicates a value within a certain range.

請求項２に記載の本発明は、前記波形の振幅分布に関す
る値として、例えば次式で表わされる波高値Ｐを用いる
ようにしたものである。この波高値は、入力信号が有声
音を含む場合、−Ｍにある一定範囲内の値を示す。According to a second aspect of the present invention, a wave height value P expressed by the following equation is used as a value related to the amplitude distribution of the waveform. This peak value indicates a value within a certain range -M when the input signal includes a voiced sound.

Ｐ　＝　２０Ｘ　１ｏｇ＋ｏ（Ｖｐ／Ｖ、□）たたし、
■、ニ一定時間間隔内の振幅の絶対値の最大値 ■ｒ□　＝同一定時間間隔内の振幅の実効値請求項３に記載の本発明は、前記波形の振幅分布に関す
る値として、例えば次式で表わされる波高値Ｐを用いる
ようにしたものである。この波高値は、入力信号が有声
音を含む場合、一般にある一定範囲内の値を示す。P = 20X 1og + o (Vp/V, □),
(2) The maximum absolute value of the amplitude within a certain time interval ■r□ = Effective value of the amplitude within the same certain time interval The present invention as set forth in claim 3 provides, for example, the following as a value related to the amplitude distribution of the waveform: The wave height value P expressed by the formula is used. This peak value generally indicates a value within a certain range when the input signal includes a voiced sound.

Ｐ　＝　２０Ｘ　ｌｏｇ＋ｏ（Ｖｐ／Ｖａ）たたし、■
、ニ一定時間間隔内の振幅の絶対値の最大値 ■、：同一定時間間隔内の振幅の絶対値の平均値請求項４に記載の本発明は、前記波形の振幅分布に関す
る値として振幅が一定時間間隔内に実効値を目安とする
しきい値を越える時間（圧基準振幅時間と呼ぶ）を用い
るようにしたものである。P = 20X log + o (Vp/Va), ■
, (d) Maximum value of the absolute value of the amplitude within a certain time interval ■: Average value of the absolute value of the amplitude within the same certain time interval This method uses the time (referred to as pressure reference amplitude time) during which the effective value exceeds a threshold value within a fixed time interval.

この圧基準振幅時間は、入力信号が有声音を含む場合、
−Ｍにある一定範囲内の値を示す。This pressure reference amplitude time is
−M indicates a value within a certain range.

請求項５に記載の本発明は、前記パワースペクトルに関
する値として、入力信号の音声周波数帯域を多チャンネ
ルバンドパスフィルタで複数チャンネルに分割し、その
全チャンネルから得られたパワーの総和値に対し、その
低帯域チャンネルから得られたパワーの合計値がなす比
率を用いるようにしたものである。ここで、有声音は低
周波成分の側のパワーが高周波成分の側のパワーに対し
て大きい特徴がある。このため、」１記比率は、入力信
号が有声音を含む場合、−Ｓに雑音よりも大きめの値を
示す。なお、多チャンネルバンドパスフィルタは、音声
周波数帯域を等間隔または対数間隔的に複数の帯域に分
割した複数チャンネルをもって構成される。The present invention as set forth in claim 5 provides that the audio frequency band of the input signal is divided into a plurality of channels by a multi-channel bandpass filter, and the total value of the power obtained from all the channels is determined as the value related to the power spectrum. The ratio of the total power values obtained from the low band channels is used. Here, voiced sounds are characterized in that the power on the low frequency component side is greater than the power on the high frequency component side. Therefore, when the input signal includes a voiced sound, the ratio indicated in "1" indicates a value larger than that of noise for -S. Note that the multi-channel bandpass filter is configured with a plurality of channels in which the audio frequency band is divided into a plurality of bands at equal intervals or logarithmic intervals.

［作用コ請求項１に記載の本発明にあっては、雑音環境下の音声
を以下の如く検出する。なお、本発明にあっては、有声
音（母音、半母音、鼻音等の声帯の振動をともなう音で
あり、人間か発声する殆どすべての音声には有声音が含
まれている）をもって音声とする。[Operations] In the present invention as set forth in claim 1, speech in a noisy environment is detected as follows. In the present invention, voiced sounds (sounds that involve vibration of the vocal cords, such as vowels, semi-vowels, and nasal sounds, and almost all human sounds include voiced sounds) are considered to be speech. .

（１）有声音と特定雑音について、それらの信号の一定
時間間隔内における参照軸交差数と波形の振幅分布に関
する値と波形のパワースペクトルに関する値とを特徴パ
ラメータとする辞書データを用意する。(1) For voiced sounds and specific noises, prepare dictionary data whose feature parameters are the number of reference axis crossings within a certain time interval of these signals, values related to the amplitude distribution of the waveform, and values related to the power spectrum of the waveform.

辞書データとしては、例えば、下記（ａ）（ｂｌ　　　
（ｃ）が用いられる。As dictionary data, for example, the following (a) (bl
(c) is used.

（ａ）多数の音声から得られた有声音についての特徴パ
ラメータの組。(a) A set of feature parameters for voiced sounds obtained from a large number of sounds.

（ｂ）特定雑音（例えば特定電話機の着信ベル音）につ
いて求められた多数の特徴パラメータの組。(b) A set of a large number of characteristic parameters determined for a specific noise (for example, the ringing sound of a specific telephone).

（ｃ）有声音と、特定雑音とを特定の比率で加え合わせ
た結果を多数の音声について求めた特徴パラメータの組
。(c) A set of feature parameters obtained for a large number of voices by adding voiced sounds and specific noise at a specific ratio.

なお、上記（ａ）　　（ｂ）、（ｃ）のデータは、音響
データを特徴パラメータ化した数値データ、数値データ
を統計処理した平均値、分散等の統計的データ、もしく
は統計的データに基づいて定まる境界方程式等の判別式
データ等の各種態様にて用意できる。Note that the data in (a), (b), and (c) above are numerical data obtained by converting acoustic data into feature parameters, statistical data such as an average value and variance obtained by statistically processing numerical data, or based on statistical data. It can be prepared in various forms such as discriminant data such as boundary equations.

（２）入力信号を採取し、この入力信号の一定時間間隔
内における参照軸交差数と波形の振幅分布に関する値と
波形のパワースペクトルに関する値とを特徴パラメータ
として算出する。(2) An input signal is sampled, and the number of reference axis crossings within a certain time interval of the input signal, a value related to the amplitude distribution of the waveform, and a value related to the power spectrum of the waveform are calculated as characteristic parameters.

（３）上記（２）で算出した特徴パラメータと、上記（
１）で定めた辞書データが規定する標準パターンとを、
パラメータ空間上で比較し、入力信号か有声音を含むか
どうかをパターン認識により判定する。(3) The feature parameters calculated in (2) above and the above (
The standard pattern defined by the dictionary data defined in 1) is
Comparisons are made in the parameter space, and pattern recognition is used to determine whether the input signal contains voiced sounds.

辞書データを用いて上述のパターン認識は例えば以下の
如くなされる。The above pattern recognition using dictionary data is performed, for example, as follows.

■辞書データが規定するカテゴリー「有声音」（前記（
ａ）の有声音、もしくは前記（Ｃ）の特定雑音を特定の
比率で加え合わされた有声音のカテゴリー）と、カテゴ
リー「その他」とで２分されるパラメータ空間を構成し
、人力信号の特徴パラメータがどちらのカテゴリーに属
するかを判定する。■Category “voiced sounds” defined by dictionary data (mentioned above)
A parameter space is constructed that is divided into two categories: a) (voiced sounds) or (C) voiced sounds added at a specific ratio) and the category "other", and the characteristic parameters of the human signal are Determine which category belongs to.

■次に、特定雑音の振幅が大きく、これか有声音の検出
に大きく影響を与えることの可能性を考慮し、上記■に
加え、カテゴリー「特定雑音」とカテゴリー「有声音」
の境界を定め、入力信号の特徴パラメータかどちらのカ
テゴリーに属するかを判定する。■Next, considering the possibility that the specific noise has a large amplitude and greatly affects the detection of voiced sounds, in addition to the above
, and determine which category the feature parameter of the input signal belongs to.

■上記■、■の判定の結果、入力信号が、■においてカ
テゴリー「有声音」に属し、かつ■においてカテゴリー
「特定雑音」に属さないことを条件に、入力信号中に有
声音が存在することを判定する。■As a result of the judgments in ■ and ■ above, voiced sounds are present in the input signal, provided that the input signal belongs to the category "voiced sound" in ■, and does not belong to the category "specific noise" in ■. Determine.

しかして、請求項１に記載の本発明にあっては、参照軸
交差数と波形の振幅分布に関する値の２つの特徴パラメ
ータのみを用いる場合に比して、有声音のもつ基本的な
特徴の１つである周波数分布の偏りを反映した特徴パラ
メータを第３のパラメータとして用いることから、特定
雑音の振幅が大きく音声の検出に対する影響が大きい場
合にも、有声音のカテゴリーと特定雑音のカテゴリーと
をパラメータ空間において明瞭に分離でき、雑音環境下
での音声の存在を、簡易に高い検出率で検出できる。Therefore, in the present invention as set forth in claim 1, the basic characteristics of voiced sounds are Since the feature parameter that reflects the bias in the frequency distribution is used as the third parameter, even if the specific noise has a large amplitude and has a large influence on speech detection, the voiced sound category and the specific noise category can be differentiated. can be clearly separated in the parameter space, and the presence of speech in a noisy environment can be easily detected with a high detection rate.

請求項２に記載の本発明によれば、波形の振幅分布に関
する値として、前述した如くの波高値を用いたから、有
声音の特徴である先鋭な波形を忠実に反映したパラメー
タ値を用いることとなり、２１　ｇの識別性か向上する
というメリットがある。According to the second aspect of the present invention, since the wave height value as described above is used as the value related to the amplitude distribution of the waveform, parameter values that faithfully reflect the sharp waveform that is a characteristic of voiced sounds are used. , 21g has the advantage of improving its identifiability.

請求項３に記載の本発明によれば、波形の振幅分布に関
する値として、前述した如くの波高値を用いたから、請
求項２に記載の本発明に比して演算量を少なくでき、か
つ有声音の特徴である先鋭な波形を比較的忠実に反映し
たパラメータ値を用いることとなり、雑音の識別性が向
上するというメリットかある。なお、演算量が少ないと
いうことは応答速度か速いことを特徴する請求項４に記載の本発明によれば、波形の振幅分布に関
する値として、前述した如くの超基準振幅時間を用いた
から、請求項２または３に記載の本発明に比して演算量
をより少なくできるというメリットがある。According to the present invention as set forth in claim 3, since the above-mentioned wave height value is used as the value related to the amplitude distribution of the waveform, the amount of calculation can be reduced compared to the present invention as set forth in claim 2, and is advantageous. Parameter values that relatively faithfully reflect the sharp waveform that characterizes vocal sounds are used, which has the advantage of improving noise discrimination. According to the present invention as claimed in claim 4, in which a small amount of calculation means a fast response speed, the above-mentioned ultra-standard amplitude time is used as the value related to the amplitude distribution of the waveform. There is an advantage that the amount of calculation can be further reduced compared to the present invention described in item 2 or 3.

請求項５に記載の本発明によれば、波形のパワースペク
トルに関する値として、前述した如くのパワーの比率を
用いたから、有声音の特徴である周波数分布の傾きを反
映したパラメータ値を用いることになり、雑音との識別
性が向上するというメリットがある。According to the present invention as set forth in claim 5, since the power ratio as described above is used as the value related to the power spectrum of the waveform, it is possible to use a parameter value that reflects the slope of the frequency distribution, which is a characteristic of voiced sounds. This has the advantage of improving distinguishability from noise.

［実施例］第１図は本発明の実施に用いられる音声検出装置の一例
を示すブロック図、第２図は本発明の特徴パラメータに
よって形成されるパラメータ空間を示す模式図である。[Embodiment] FIG. 1 is a block diagram showing an example of a voice detection device used for implementing the present invention, and FIG. 2 is a schematic diagram showing a parameter space formed by characteristic parameters of the present invention.

第１図において、１１はマイク、１２は増幅器、１３は
ローパスフィルタ、１４は多チャンネルバンドパスフィ
ルタ、１５はＡ／Ｄコンバータ、１６はＡ／Ｄコンバー
タ、１７はパラメータ演算部、１８は辞書データ記憶部
、１９は判定部、２０は結果出力部である。この実施例
にあっては、雑音環境下の音声を以下の如く検出する。In FIG. 1, 11 is a microphone, 12 is an amplifier, 13 is a low-pass filter, 14 is a multi-channel band-pass filter, 15 is an A/D converter, 16 is an A/D converter, 17 is a parameter calculation unit, and 18 is dictionary data. A storage section, 19 a determination section, and 20 a result output section. In this embodiment, speech in a noisy environment is detected as follows.

（１）有声音と特定雑音について、それらの信号の２０
＊Ｓ間における参照軸交差数ｘよと、波形の振幅分布に
関する値Ｘ２と、波形のパワースペクトルに関する値Ｘ
３とを特徴パラメータとする辞書データを用意し、これ
を辞書データ記憶部１８に記憶せしめる。(1) Regarding voiced sounds and specific noises, 20% of their signals
*The number of reference axis crossings x between S, the value X2 regarding the amplitude distribution of the waveform, and the value X regarding the power spectrum of the waveform
3 is prepared as a feature parameter, and is stored in the dictionary data storage section 18.

ここで、波形の振幅分布に関する値Ｘ２としては、下記
■、■、■のいずれかを用いることができる。Here, as the value X2 regarding the amplitude distribution of the waveform, any one of the following ■, ■, and ■ can be used.

■下式で表わされる波高値Ｐ。■The wave height value P expressed by the formula below.

Ｐ　＝　２（ＩＸ　ｌｏｇ＋ｏ（ｖｐ／ｖｒ、＊）たた
し、ｖ２ニ一定時間間隔内の振幅の絶対値の最大値Ｖ□１　：同一定時間間隔内の振幅の実効値 ■下式で表わされる波高値Ｐ。P = 2 (IX log + o (vp/vr, *) plus v2 d Maximum absolute value of amplitude within a certain time interval V□1: Effective value of amplitude within the same certain time interval ■ Expressed by the following formula wave height P.

Ｐ　＝　２０Ｘ　ｌｏｇ＋ｏ（Ｖｐ／Ｖａ）ただし、■
、コニ−時間間隔内の振幅の絶対値の最大値 ■、二同一定時間間隔内の振幅の絶対値の平均値 ■振幅が一定時間間隔内に実効値を目安とするしきい値
を越える時間（超基準振幅時間）。P = 20X log+o (Vp/Va) However, ■
, Maximum absolute value of amplitude within a time interval ■ Average value of absolute value of amplitude within a fixed time interval ■ Time when the amplitude exceeds a threshold value using the effective value as a guideline within a fixed time interval (super reference amplitude time).

上記■の波高値を用いる場合には、有声音の特徴である
先鋭な波形を比較的忠実に反映したパラメータ値を用い
ることとなり、雑音の識別性か向上するというメリット
かある。When using the wave height value of (2) above, a parameter value that relatively faithfully reflects the sharp waveform characteristic of voiced sounds is used, which has the advantage of improving noise discrimination.

上記■の波高値を用いる場合には、上記■の波高値に比
して演算量を少なくてき、かつ有声音の特徴である先鋭
な波形を忠実に反映したパラメータ値を用いることとな
り、雑音の識別性か向上するというメリットかある。When using the wave height value of ■ above, the amount of calculation is reduced compared to the wave height value of ■ above, and parameter values that faithfully reflect the sharp waveform that is characteristic of voiced sounds are used. This has the advantage of improving identifiability.

上記■の波高値を用いる場合には、上記■、■の波高値
に比して演算量をより少なくてきるというメリットかあ
る。When using the wave height value of (2) above, there is an advantage that the amount of calculation can be reduced compared to the wave height values of (2) and (3) above.

また、パワースペクトルに関する値Ｘ、としては、前述
した、入力信号の音声周波数帯域を多チャンネルバンド
パスフィルタて複数チャンネルに分割し、その全チャン
ネルから得られたパワーの総和値に対し、その低帯域チ
ャンネルから得られたパワーの合計値がなす比率を用い
ることができる。このパワーの比率を用いる場合には、
有声音の特徴である周波数分布の傾きを反映したパラメ
ータ値を用いることになり、雑音との識別性が向上する
というメリットがある。In addition, as for the value X regarding the power spectrum, the audio frequency band of the input signal is divided into multiple channels using a multi-channel bandpass filter, and the low band A ratio of the sum of the powers obtained from the channels can be used. When using this power ratio,
Parameter values that reflect the slope of the frequency distribution, which is a characteristic of voiced sounds, are used, which has the advantage of improving distinguishability from noise.

また、辞書データとしては、例えば下記（ａ）、（ｂ）
、および（ｃ）が作成される。In addition, as dictionary data, for example, the following (a) and (b)
, and (c) are created.

（ａ）多数の音声から得られた有声音［ア］についての
特徴パラメータの組。(a) A set of feature parameters for voiced sound [a] obtained from a large number of voices.

（ｂ）特定雑音（特定電話機の着信ベル音）について求
められた多数の特徴パラメータの組。(b) A set of a large number of characteristic parameters determined for a specific noise (ringing sound of a specific telephone).

［ｃ）有声音［ア］と特定雑音とを、２０　Ｘ　１０ｇ＋ｏ　（Ｓｒ＋ｓ＋＋／Ｎｒｍ＊）　
［ｄＢ］で定義される有声音対特定雑音比３．　Ｏ，−
３゜−６，−１０［ｄＢ］で加え合わせた結果を多数の
音声について求めた特徴パラメータの組。なお、Ｓ　ｒ
ａｍｓは有声音「ア」の振幅の実効値を表わし、Ｎ１□
は特定雑音の振幅の実効値を表わす。[c) Voiced sound [a] and specific noise, 20 x 10g+o (Sr+s++/Nrm*)
Voiced to specific noise ratio defined in [dB]3. O,-
A set of feature parameters obtained for a large number of voices by adding the results at 3°-6 and -10 [dB]. In addition, S r
ams represents the effective value of the amplitude of the voiced sound "a", and N1□
represents the effective value of the amplitude of the specific noise.

（２）マイク１１にて入力信号を採取し、この入力信号
を、増幅器１２で増幅し、ローパスフィルタ１３を通す
ことによって４．２ＫＨｚ以下の音声帯域成分だけを抽
出する。この後、入力信号は、■パワースペクトルに関
する特徴パラメータを抽出するための前処理を行なう多
チャンネルバンドパスフィルタ１４およびＡ／Ｄコンバ
ータ１５からなる経路と、■振幅分布に関する値と参照
軸交差数の２つの特徴的パラメータを抽出する前処理を
行なうＡ／Ｄコンバータ１６を備える経路とに分岐して
転送される。多チャンネルバンドパスフィルタ１４は、
周波数帯域で２５０Ｈｚから４ＫＨｚまでを１７６オク
ターブ毎に２５チヤンネルに分割したバンドパスフィル
タである。Ａ／Ｄコンバータ１５．１６は標本化周波数
１０Ｋｊｌｚ　、変換ビット数１６ｂｉｔである。これ
らのフィルタ１４、コンバータ１５．１６の前処理部に
よって得られた入力信号およびその周波数信号のデジタ
ル値は、パラメータ演算部１７に送り込まれる。パラメ
ータ演算部１７は、上記入力信号の２ｈＳ間における参
照軸周波数ｘＩと、波形の振幅分布に関する値Ｘ２と、
パワースペクトルに関する値Ｘ３とを特徴パラメータと
して算出する。(2) An input signal is collected by the microphone 11, amplified by the amplifier 12, and passed through the low-pass filter 13 to extract only audio band components of 4.2 KHz or less. After this, the input signal is transferred to: ■ a path consisting of a multi-channel band-pass filter 14 and an A/D converter 15 that perform preprocessing for extracting feature parameters related to the power spectrum; and ■ values related to amplitude distribution and the number of reference axis crossings. The signal is branched and transferred to a path including an A/D converter 16 that performs preprocessing to extract two characteristic parameters. The multi-channel band pass filter 14 is
This is a bandpass filter that divides the frequency band from 250Hz to 4KHz into 25 channels every 176 octaves. The A/D converters 15 and 16 have a sampling frequency of 10 Kjlz and a conversion bit number of 16 bits. The digital values of the input signals and their frequency signals obtained by the preprocessing sections of these filters 14 and converters 15 and 16 are sent to the parameter calculation section 17. The parameter calculation unit 17 calculates a reference axis frequency xI during 2hS of the input signal, a value X2 regarding the amplitude distribution of the waveform,
A value X3 related to the power spectrum is calculated as a characteristic parameter.

（３）上記（２）で算出した特徴パラメータと、上記（
１）で定めた辞書データが規定する標準パラメータとを
、判定部１９において比較し、入力信号が有声音を含む
かどうかを判定し、この判定結果を結果出力部２０から
出力する。(3) The feature parameters calculated in (2) above and the above (
The determination unit 19 compares the standard parameters defined by the dictionary data determined in 1) to determine whether the input signal includes a voiced sound, and outputs the determination result from the result output unit 20.

ここで、前述の辞書データを用いたパターン認識は、例
えば第２図のパラメータ空間上で以下の如くなされる。Here, pattern recognition using the aforementioned dictionary data is performed, for example, on the parameter space shown in FIG. 2 as follows.

なお、第２図は零交差数く参照軸レベルを零レベルに設
定したもの）と波形の振幅分布に関する値とパワースペ
クトルに関する値の３つの特徴パラメータをそれぞれＸ
１軸とＸ２軸とＸ３軸にとったものである。第２図にお
いて、μｍσ１１％　σ１２、σ１．はそれぞれ有声音
（前記（ａ）の有声音［ア］、もしくは前記（ｃ）の特
定雑音を特定の有声音対特定雑音比で加え合わされた有
声音）の辞書パラメータの平均値、Ｘ１軸成分の標準偏
差、Ｘ２軸成分の標準偏差、×３３成分の標準偏差を表
わし、μ２、σ２１、σ２２、σ２３はそれぞれ特定雑
音の辞書パラメータについての同様の値を表わす。In addition, in Figure 2, the three characteristic parameters of the number of zero crossings (the reference axis level is set to zero level), the value related to the amplitude distribution of the waveform, and the value related to the power spectrum are
This is taken for the 1st axis, the X2 axis, and the X3 axis. In FIG. 2, μmσ11% σ12, σ1. are the average values of the dictionary parameters of voiced sounds (voiced sounds [a] in (a) above, or voiced sounds obtained by adding the specific noises in (c) above at a specific voiced sound to specific noise ratio), respectively, and the X1 axis component , the standard deviation of the X2 axis component, and the standard deviation of the x33 component, and μ2, σ21, σ22, and σ23 each represent similar values for the dictionary parameters of the specific noise.

■辞書データが規定するカテゴリー「有声音」（前記（
ａ）の有声音［アコ、もしくは前記（ｃ）の特定雑音を
特定の比率で加え合わせた有声音のカテゴリー）と、カ
テゴリー「その他」とを２分する境界１を定める。境界
１にあっては、有声音の辞書データの平均値μｍを含む
側がカテゴリー「有声音」である、この境界１は、平均
値のまわりにどれだけ有声音の辞書データが集中してい
るかを表わす集中楕円であり、軸の長さを変えることに
より有声音の辞書データが楕円内に入る割合を変えるこ
とができる。この実施例の場合は有声音の辞書データの
９割が楕円内に入るように軸の長さを定めた。破線はμ
とσで規定されるカテゴリー「有声音」の概念を表わす
、すなわち、この■の過程にあっては、入力信号の特徴
パラメータが境界１のいずれの側のカテゴリーに属する
かを判定することとなる。■Category “voiced sounds” defined by dictionary data (mentioned above)
A boundary 1 is defined that divides the voiced sound in a) [ako, or the voiced sound category in which the specific noise in (c) above is added at a specific ratio] and the category "Other" into two. In boundary 1, the side that includes the average value μm of voiced sound dictionary data is the category "voiced sound". This boundary 1 shows how much voiced sound dictionary data is concentrated around the average value. By changing the length of the axis, you can change the proportion of dictionary data of voiced sounds that fall within the ellipse. In this embodiment, the length of the axis is determined so that 90% of the voiced sound dictionary data falls within the ellipse. The dashed line is μ
represents the concept of the category "voiced sound" defined by .

０次に、特定雑音の振幅が大きく、これが有声音の検出
に大きく影響を与えることの可能性を考慮し、上記■に
加え、カテゴリー「特定雑音」とカテゴリー「有声音」
の境界２を定める。境界２にあっては、特定雑音の平均
値μ２を含む側がカテゴリー「特定雑音」となる。この
境界２は、カテゴリー「有声音」とカテゴリー「特定雑
音」に対する尤度が等しい点の集まりである。この実施
例の場合には特定雑音の標準偏差が、人工的に作られた
電話機の着信ベル音であって、有声音と特定雑音を特定
の有声音対特定雑音比で加え合わせたものの辞書データ
の標準偏差より一般的に小さいので、カテゴリー「特定
雑音」が閉じた空間になっている。破線はμとσで規定
されるカテゴリー「特定雑音」の概念を表わす。すなわ
ち、この■の過程にあっては、入力信号の特徴パラメー
タか境界２のいずれの側のカテゴリーに属するかを判定
することとなる。0th Next, considering the possibility that the amplitude of specific noise is large and this has a large influence on the detection of voiced sounds, in addition to the above ■, the category "specific noise" and the category "voiced sound" are added.
Define boundary 2. In boundary 2, the side that includes the average value μ2 of the specific noise is in the category "specific noise." This boundary 2 is a collection of points that have the same likelihood for the category "voiced sound" and the category "specific noise." In this example, the standard deviation of the specific noise is the dictionary data of the artificially created incoming ring tone of a telephone, which is a combination of voiced sounds and specific noises at a specific voiced to specific noise ratio. Since it is generally smaller than the standard deviation of , the category "specific noise" is a closed space. The broken line represents the concept of the category "specific noise" defined by μ and σ. That is, in the process (2), it is determined which category of the boundary 2 the characteristic parameter of the input signal belongs to.

■上記■、■の判定の結果、入力信号が、特徴パラメー
タ空間上で、■において境界１のμｍ側に属し、かつ■
において境界２のμ２側に属さない時、入力信号をカテ
ゴリー「有声音」に属すると判定する。すなわち、入力
信号中に有声音が存在することを判定する。■ As a result of the above determinations ■ and ■, the input signal belongs to the μm side of boundary 1 in ■ on the feature parameter space, and ■
When the input signal does not belong to the μ2 side of boundary 2, the input signal is determined to belong to the category "voiced sound". That is, it is determined that a voiced sound exists in the input signal.

しかして、上記実施例にあっては、参照軸交差数と波形
の振幅分布に関する値の２つの特徴パラメータのみを用
いる場合に比して、有声音のもつ基本的な特徴の１つで
ある周波数分布の偏りを反映した特徴パラメータを第３
のパラメータとして用いることから、特定雑音の振幅が
大きく音声の検出に対する影響が大きい場合にも、有声
音のカテゴリーと特定雑音のカテゴリーとをパラメータ
空間において明瞭に分離でき、雑音環境下での音声の存
在を、簡易に高い検出率で検出できる。Therefore, in the above embodiment, compared to the case where only two characteristic parameters, the number of reference axis crossings and the value related to the amplitude distribution of the waveform, are used, the frequency, which is one of the basic characteristics of voiced sounds, is The characteristic parameters that reflect the bias of the distribution are
Since the amplitude of the specific noise is large and its influence on speech detection is large, the voiced sound category and the specific noise category can be clearly separated in the parameter space, and it is possible to clearly separate the voiced sound category and the specific noise category in the parameter space. Its presence can be easily detected with a high detection rate.

特に、上記実施例では、有声音対特定雑音比が−６［ｄ
Ｂ］においても高い有声音の検出率を示し、−３（ｄＢ
］においては　１００［％］に近い検出率を示すことが
認められた。In particular, in the above embodiment, the voiced to specific noise ratio is −6[d
B] also showed a high detection rate of voiced sounds, -3 (dB
] was found to have a detection rate close to 100[%].

なお、上記実施例においては、特徴パラメータ空間上で
標準パターンを規定する境界線として集中楕円と２つの
カテゴリーに対する尤度が等しくなる点の集まりを用い
たか、本発明の実施においては、もちろん他の一般的な
パターン認識の手法を用いることができる。例えば、カ
テゴリー「有声音」とカテゴリー「特定雑音」に対する
尤度が等しくなる点の集まりの代わりに、Ｍａｈａｒａ
ｎｏｂｉｓ距離やＥｕｃｌｉｄ距離が等しくなる点の集
まり等を用いることができる。Note that in the above embodiment, a concentration ellipse and a collection of points having equal likelihoods for two categories were used as the boundary line defining the standard pattern on the feature parameter space, or other methods may be used in implementing the present invention. General pattern recognition techniques can be used. For example, instead of a collection of points that have equal likelihoods for the categories ``voiced sounds'' and ``specific noises,'' Mahara
A collection of points having the same nobis distance or Euclid distance can be used.

［発明の効果］以上のように本発明によれば、雑音の振幅が大きく音声
の検出に対する影響が大きい場合にも、雑音環境下での
音声の存在を、簡易に高い検出率で検出することができ
る。[Effects of the Invention] As described above, according to the present invention, the presence of speech in a noisy environment can be easily detected with a high detection rate even when the amplitude of noise is large and the influence on speech detection is large. Can be done.

[Brief explanation of the drawing]

第１図は本発明の実施に用いられる音声検出装置の一例
を示すブロック図、第２図は本発明の特徴パラメータに
よって形成されるパラメータ空間を示す模式図である。１１・・・マイク、１４・・・多チャンネルバンドパスフィルタ、１７・・
・パラメータ演算部、１８・・・辞書データ記憶部、１９・・・判定部、２０・・・結果出力部。特許出願人　積水化学工業株式会社代表者　　廣１）馨FIG. 1 is a block diagram showing an example of a voice detection device used to implement the present invention, and FIG. 2 is a schematic diagram showing a parameter space formed by characteristic parameters of the present invention. 11...Microphone, 14...Multi-channel band pass filter, 17...
・Parameter calculation section, 18... Dictionary data storage section, 19... Judgment section, 20... Result output section. Patent applicant: Sekisui Chemical Co., Ltd. Representative Hiroshi1) Kaoru

Claims

[Claims]

(1) Calculate the number of reference axis crossings of the input signal, values related to the amplitude distribution of the waveform, and values related to the power spectrum as characteristic parameters, and compare the calculation results with dictionary data regarding voiced sounds and specific noise, A voice detection method that determines whether the input signal contains voiced sounds.

(2) Audio detection according to claim 1, wherein the value regarding the amplitude distribution of the waveform is a peak value expressed as a ratio of the effective value of the amplitude within the certain time interval to the maximum value of the absolute value of the amplitude within the certain time interval. method.

(3) The wave height value expressed as the ratio of the average value of the absolute value of the amplitude within the certain time interval to the maximum value of the absolute value of the amplitude within the certain time interval is used as the value related to the amplitude distribution of the waveform. voice detection method.

(4) The voice detection method according to claim 1, wherein the value regarding the amplitude distribution of the waveform is a time period in which the amplitude exceeds a threshold value based on an effective value within a certain time interval.

(5) As a value related to the power spectrum, the audio frequency band of the input signal is divided into multiple channels using a multi-channel bandpass filter, and the total value of the power obtained from all channels is calculated by dividing the audio frequency band of the input signal into multiple channels using a multichannel bandpass filter. 2. The voice detection method according to claim 1, which uses a ratio of the sum of the powers.