JPH05173592A

JPH05173592A - Method and device for voice/no-voice discrimination making

Info

Publication number: JPH05173592A
Application number: JP3342631A
Authority: JP
Inventors: Yoshihisa Nakato; 良久中藤; Takeshi Norimatsu; 武志則松
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1991-12-25
Filing date: 1991-12-25
Publication date: 1993-07-13

Abstract

PURPOSE:To provide the voice/no-voice discrimination device of simple constitution which makes a voice/no-voice decision automatically with high precision. CONSTITUTION:A feature extraction part 11 extracts plural feature quantities from an input signal at constant intervals of time and a threshold value determination part 12 predetermines a threshold value by using learning data on many speeches and nonspeeches; and a rough decision part 13 and a detail decision part 14 decide the speeches and others by comparing the extracted feature quantities with the threshold value and a final decision part decides whether or not the section is a voice or not according to the presence rate of the number of frames decided by the rough decision part and detail decision part as voices.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、テレビ会議システムに
用いるテレビカメラ、またはマイクロフォンの切り換え
のための入力信号が音声であるかそれ以外の音であるか
を判定する音声／非音声判別方法および判別装置や、音
声認識装置の前処理等で使われる、入力信号が音声であ
るかそれ以外の音であるかを判定する音声／非音声判別
方法および判別装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice / non-voice discriminating method for discriminating whether an input signal for switching a television camera used in a video conference system or a microphone is a voice or another sound. The present invention relates to a discriminating device, a voice / non-voice discriminating method and a discriminating device which are used in preprocessing of a voice recognition device and which judge whether an input signal is a voice or another sound.

【０００２】[0002]

【従来の技術】テレビ会議システムに用いるテレビカメ
ラ、またはマイクロフォンの切り換えのための音声／非
音声判別装置では、会議室内に存在する様々な雑音など
音声以外の入力に対して切り替えが行われる可能性があ
る。また、音声認識等の音声処理を行う装置では、音声
以外の信号が入力され誤って音声と判断されると誤認識
を生じる。そこで、入力された信号が正確に音声である
かどうかを判定できる音声／非音声判別装置が必要とさ
れる。2. Description of the Related Art In a video camera used in a video conference system or a voice / non-voice discriminating device for switching microphones, there is a possibility of switching to inputs other than voice such as various noises existing in a conference room. There is. Further, in a device that performs voice processing such as voice recognition, if a signal other than voice is input and is erroneously determined to be voice, erroneous recognition occurs. Therefore, there is a need for a voice / non-voice discrimination device capable of accurately determining whether or not the input signal is voice.

【０００３】従来の音声／非音声判別装置では、処理の
簡素化のための入力信号のパワー値がある所定のしきい
値よりも大きい部分を音声と判断する方法が一般的に行
われる。しかし会議室等の実環境で使用することを考え
ると、紙などの資料をめくる音や、息吹きなどのマイク
ロフォンの振動によって起こるノイズ等の音声以外のパ
ワーの大きな様々な音が入力される可能性があり、パワ
ーだけでは音声／非音声の判別はできない。In the conventional speech / non-speech discriminating apparatus, a method is generally used in which a portion where the power value of the input signal is larger than a predetermined threshold value is determined as speech for simplification of processing. However, in consideration of using it in a real environment such as a conference room, it is possible that various sounds with large power other than the sound of flipping through materials such as paper and noise such as noise caused by the vibration of the microphone such as breathing may be input. However, it is not possible to distinguish between voice and non-voice only by the power.

【０００４】そこで、パワー以外の複数の音声の特徴量
を用いて、入力信号の有声性を判定する方法が幾つか提
案されている。例えば、A Pattern Recognition Approa
ch to Voiced-unvoiced-silence classification with
application to speech recognition, IEEE Trans. Aco
ust., Speech, Signal Processing, ASSP-24-3 (1976)
による方法がある。これは、パターン認識の手法を用い
て無音・無声・有声を一括して判定する方法であるが、
無音・無声音を高周波成分の多い雑音と考えれば有声音
（母音性）の検出に基づく音声／非音声判別装置の一種
と考えられる。具体的には、入力音声の一定時間ごとの
零交差回数、信号の対数エネルギー、１次の自己相関係
数、１次の線形予測係数、線形予測残差の対数エネルギ
ーの５種類の特徴量を求め、各特徴量毎に正規分布を仮
定し、それらの同時確率により無音・無声・有声の判定
を行っている。Therefore, some methods have been proposed for determining the voicedness of an input signal by using a plurality of voice feature amounts other than power. For example, A Pattern Recognition Approa
ch to Voiced-unvoiced-silence classification with
application to speech recognition, IEEE Trans. Aco
ust., Speech, Signal Processing, ASSP-24-3 (1976)
There is a method by. This is a method of collectively determining silence / unvoiced / voiced by using a pattern recognition method.
If the silent / unvoiced sound is considered as noise having many high frequency components, it can be considered as a kind of speech / non-speech discrimination apparatus based on detection of voiced sound (vowel sound). Specifically, five types of feature quantities of the number of zero crossings of the input speech at constant time intervals, the logarithmic energy of the signal, the first-order autocorrelation coefficient, the first-order linear prediction coefficient, and the logarithmic energy of the linear prediction residual are calculated. Then, a normal distribution is assumed for each feature amount, and silent / unvoiced / voiced determination is performed based on their joint probability.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら上記の音
声／非音声判別装置では、おもに高周波数域のエネルギ
ーの優勢な雑音のみしか除去できない。また、音声中の
各音韻の特徴に基づいた特徴量は使用されておらず、音
声の各音韻の検出に適した特徴量を用いることによる高
精度な音声／非音声判別が必要とされている。本発明
は、上記の課題を解決するもので、音声判別のための高
性能な音声／非音声判別装置を提供することを目的とす
る。本発明は、低周波数域のエネルギーの優勢な雑音の
除去に優れた特徴量を用いることで、息吹きなどによる
マイクロフォンの振動によって起こる雑音や、床や机な
どとの共鳴によって引き起こされる低周波性の雑音の除
去が可能で、さらに音声の各音韻の検出に適した特徴量
を組み合わせて用いることで、音声の各音韻の検出に基
づいた高性能で、しかも簡単な構成による音声／非音声
の判別が可能な音声／非音声判別装置を提供することを
目的とする。However, the above speech / non-speech discriminating apparatus can remove only the noise with the energy predominantly in the high frequency range. Further, the feature amount based on the feature of each phoneme in the voice is not used, and highly accurate voice / non-voice discrimination is required by using the feature amount suitable for detecting each phoneme of the voice. .. The present invention solves the above problems, and an object thereof is to provide a high-performance voice / non-voice discrimination device for voice discrimination. INDUSTRIAL APPLICABILITY The present invention uses a feature amount that is excellent in removing dominant noise of energy in a low frequency range, thereby causing noise caused by vibration of a microphone due to breathing or the like and low frequency noise caused by resonance with a floor or a desk. It is possible to remove the speech, and by using a combination of feature quantities suitable for detecting each phoneme of the voice, it is possible to perform a high performance based on the detection of each phoneme of the voice and to distinguish the voice / non-voice with a simple configuration. An object is to provide a possible voice / non-voice discrimination device.

【０００６】[0006]

【課題を解決するための手段】本発明は上記目的を達成
するために有声音（母音性）を検出することを主眼とし
て、入力信号の一定時間毎の１次以上の自己相関係数、
１次以上のケプストラム係数等の複数の音声の特徴量を
抽出する特徴量抽出部と、あらかじめ多数の音声と非音
声の学習データについて所定時間内に抽出した前記特徴
量抽出部にて抽出された特徴量を用いて、音声かそれ以
外かを判定するためのしきい値を決定するしきい値決定
部と、所定時間内の１次以上の自己相関係数および１次
のケプストラム係数と前記しきい値決定部で決定したし
きい値とを比較することで音声かそれ以外かを判定し、
音声が周波数領域においてどの程度母音性を持つかを概
略的に判定する概略判定部と、音声中の各音韻の特徴に
基づき有声音の検出に適した特徴量である２次以上のケ
プストラム係数と前記しきい値決定部で決定したしきい
値とを比較することで音声かそれ以外かを判定し、どの
程度母音性を持つかを詳細に判定する詳細判定部と、前
記概略判定部と前記詳細判定部とによりパワーの一定レ
ベル以上の区間について、音声と判定されたフレームの
存在比率によりその区間が音声か否かを判定する最終判
定部とを備えたものである。In order to achieve the above-mentioned object, the present invention aims to detect a voiced sound (vowel sound), and an autocorrelation coefficient of a first order or more for every constant time of an input signal,
It is extracted by a feature amount extraction unit that extracts a plurality of voice feature amounts such as first- and higher-order cepstrum coefficients, and the feature amount extraction unit that has extracted a large number of voice and non-voice learning data in advance within a predetermined time. A threshold value determining unit that determines a threshold value for determining whether it is voice or not using the feature amount, and a first-order or higher-order autocorrelation coefficient and a first-order cepstrum coefficient within a predetermined time. By comparing with the threshold value determined by the threshold value determination unit, it is determined whether it is voice or not,
A rough determination unit that roughly determines how vowel the voice has in the frequency domain, and a second or higher-order cepstrum coefficient that is a feature amount suitable for the detection of voiced sound based on the features of each phoneme in the voice. By comparing the threshold value determined by the threshold value determination unit to determine whether it is voice or otherwise, a detailed determination unit that determines in detail how much vowelness it has, the rough determination unit, and the The detailed determination unit includes a final determination unit that determines whether or not a section having a power level higher than a certain level is voice based on the existence ratio of frames determined to be voice.

【０００７】[0007]

【作用】本発明は、上記した構成により、母音などの有
声音がいかなる周波数帯域に最も優勢にエネルギーを持
つかを端的に表す特徴量や、音声中の各音韻の特徴に基
づく有声音の検出に適した特徴量を用い、あらかじめ信
頼性の高い多数の音声データと様々な雑音を含む非音声
データに基づいて適当に設定したしきい値により一定時
間毎の音声の判別を行わせているので、高性能な音声／
非音声の判別が可能となる。According to the present invention, with the above-described configuration, the feature amount that directly indicates in which frequency band the voiced sound such as a vowel has the most dominant energy, and the detection of the voiced sound based on the feature of each phoneme in the voice is detected. Using a feature amount suitable for the above, the voice is discriminated at regular intervals based on a preset threshold value based on a large number of highly reliable voice data and non-voice data including various noises. , High-performance voice /
Non-voice can be discriminated.

【０００８】[0008]

【実施例】以下本発明の一実施例について説明する。
（図１）は本発明の一実施例の全体構成を示すブロック
構成図である。（図１）において、１１は音声判別のた
めの複数の特徴量を抽出する特徴抽出部で、１フレーム
毎のパワーを計算するパワー算出部１１ａと、１フレー
ム毎の１次の自己相関係数を算出する１次の自己相関係
数算出部１１ｂと、１フレーム毎の７次の自己相関係数
を算出する７次の自己相関係数算出部１１ｃと、１フレ
ーム毎の１次のケプストラム係数を算出する１次のケプ
ストラム係数算出部１１ｄと、１フレーム毎の３次のケ
プストラム係数を算出する３次のケプストラム係数算出
部１１ｅとから構成される。これらの特徴量は入力信号
の有声音（母音性）を検出するために用いられる。以下
に、これら特徴量の頻度分布について調査した結果を示
す。EXAMPLE An example of the present invention will be described below.
FIG. 1 is a block diagram showing the overall configuration of an embodiment of the present invention. In FIG. 1, reference numeral 11 denotes a feature extraction unit that extracts a plurality of feature amounts for voice discrimination, and a power calculation unit 11a that calculates power for each frame and a primary autocorrelation coefficient for each frame. A first-order autocorrelation coefficient calculation unit 11b, a seventh-order autocorrelation coefficient calculation unit 11c that calculates a seventh-order autocorrelation coefficient for each frame, and a first-order cepstrum coefficient for each frame. And a third-order cepstrum coefficient calculation unit 11e that calculates a third-order cepstrum coefficient for each frame. These feature quantities are used to detect voiced sound (vowelness) of the input signal. The results of an investigation on the frequency distribution of these feature quantities are shown below.

【０００９】調査に使用したデータは、音声データと雑
音データの２種類の音響データである。音声データとし
ては、無響室において録音した男性１名の発声した２１
２単語中の１６音韻（/a/,/i/,/u/,/e/,/o/,/b/,/d/,/g
/,/m/,/n/,/N/,/s/,/h/,/r/,/w/,/y/）を使用し、すべ
ての音韻に関して視察により音韻境界が求められてい
る。また、雑音データとしては、（表１）に示すよう
な、本実施例の音声／非音声判別装置が使用されるであ
ろう会議室内において想定し得る２２種類の雑音を用い
た。音声及び雑音データの分析条件を（表２）に示す。The data used for the investigation are two types of acoustic data, voice data and noise data. As the voice data, one man recorded in an anechoic chamber was uttered 21
16 phonemes in 2 words (/ a /, / i /, / u /, / e /, / o /, / b /, / d /, / g
/, / m /, / n /, / N /, / s /, / h /, / r /, / w /, / y /) and the phoneme boundaries were determined by visual inspection for all phonemes. There is. Further, as the noise data, 22 kinds of noises as shown in (Table 1) that can be assumed in the conference room where the voice / non-voice discrimination apparatus of the present embodiment will be used are used. The analysis conditions for voice and noise data are shown in (Table 2).

【００１０】[0010]

【表１】 [Table 1]

【００１１】[0011]

【表２】 [Table 2]

【００１２】調査の結果として、１次の自己相関係数に
ついての頻度分布が１６音韻と２２雑音についての場合
をそれぞれ（図４）（図５）に、７次の自己相関係数に
ついての頻度分布が１６音韻と２２雑音についての場合
をそれぞれ（図６）（図７）に、１次のケプストラム係
数についての頻度分布が１６音韻と２２雑音についての
場合をそれぞれ（図８）（図９）に、３次のケプストラ
ム係数についての頻度分布が１６音韻と２２雑音につい
ての場合をそれぞれ（図１０）（図１１）に示した。そ
れぞれの図において、黒丸は平均値を示し、縦方向にそ
の標準偏差を示している。その結果、次のような傾向が
あることがわかった。As a result of the investigation, the frequency distributions for the first-order autocorrelation coefficient are 16 phonemes and 22 noises (FIG. 4) and FIG. The cases where the distributions are for 16 phonemes and 22 noises (FIG. 6) (FIG. 7) respectively, and the cases where the frequency distribution for the primary cepstrum coefficient are for 16 phonemes and 22 noises (FIG. 8) (FIG. 9) respectively The frequency distributions for the third-order cepstrum coefficient are shown in FIG. 10 and FIG. 11 for 16 phonemes and 22 noises, respectively. In each figure, the black circles represent the average value and the standard deviation in the vertical direction. As a result, the following tendencies were found.

【００１３】まず１次以上の自己相関係数は、エネルギ
ーの集中周波数域の違いが反映される特徴量であり、エ
ネルギーが高い周波数帯域に優勢に存在している無声音
などランダム性の強い雑音では、１次の自己相関係数の
値は０に近い小さい値を示し（図５）、有声音などにお
いてはその値は１近くを示す（図４）。一方、エネルギ
ーが低い周波数帯域に優勢に存在している雑音の場合、
７次の自己相関係数の値は１に近い値を示し（図７）、
有声音などではその値が０に近くなる（図６）。ケプス
トラム係数は、スペクトルの形状を表す特徴量であり、
同じ有声音でも各音韻毎にその値は大きく異なる。１次
のケプストラム係数は、有声音か無声音かなどの大まか
なスペクトルの形状の違いを表す量であり、音韻／ｉ／
を除く有声音ではその値が１．０以上を示し、それ以外
の音では１．０以下の値を示す（図８）。３次のケプス
トラム係数は、音韻／ｉ／の特徴が特に大きく表れる特
徴量であり、／ｉ／ではその値が０．５以上を示し、そ
れ以外の音では０．５以下を示す（図１０）。First, the autocorrelation coefficient of the first or higher order is a feature quantity that reflects the difference in the concentrated frequency range of energy, and is used for noise with strong randomness such as unvoiced sound that exists predominantly in the frequency band where energy is high. The value of the first-order autocorrelation coefficient shows a small value close to 0 (FIG. 5), and the value of voiced sound shows close to 1 (FIG. 4). On the other hand, in the case of noise where energy is predominantly present in the low frequency band,
The value of the 7th-order autocorrelation coefficient is close to 1 (FIG. 7),
For voiced sounds, etc., its value is close to 0 (Fig. 6). The cepstrum coefficient is a feature quantity that represents the shape of the spectrum,
Even for the same voiced sound, the value varies greatly for each phoneme. The first-order cepstrum coefficient is an amount representing a rough difference in the shape of a spectrum such as voiced sound or unvoiced sound, and is a phonological unit / i /
The voiced sounds except for have a value of 1.0 or more, and the other sounds have a value of 1.0 or less (FIG. 8). The third-order cepstrum coefficient is a feature amount in which the characteristic of the phoneme / i / is particularly large, and the value is 0.5 or more at / i /, and 0.5 or less at other sounds (FIG. 10). ).

【００１４】（図１）において、１２はあらかじめ多数
の音声と非音声の学習データを用いて、音声かそれ以外
かを判定するためのある適当なしきい値を決定するしき
い値決定部である。１３は特徴抽出部１１から出力され
る１次と７次の自己相関係数と１次のケプストラム係数
を、しきい値決定部１２で決定したある適当なしきい値
とフレーム単位で比較することにより音声かそれ以外か
を判定する概略判定部であり、１４は概略判定部１３に
より音声以外と判定されたもののうち、特徴抽出部１１
から出力される３次のケプストラム係数としきい値決定
部１２で決定したある適当なしきい値とをフレーム単位
で比較することにより、有声音の／ｉ／かそれ以外かを
判定する詳細判定部である。１５はパワーの一定レベル
以上の入力信号の塊について概略判定部１３と詳細判定
部１４とにより音声と判定されたフレームの個数の割合
が、しきい値決定部１２で決定したある適当なしきい値
以上のときにその塊を音声と判定する最終判定部であ
る。In FIG. 1, reference numeral 12 is a threshold value deciding unit for deciding an appropriate threshold value for judging whether it is voice or not, using a large number of voice and non-voice learning data in advance. .. Reference numeral 13 indicates a comparison between the first-order and seventh-order autocorrelation coefficients and the first-order cepstrum coefficient output from the feature extraction unit 11 with a certain appropriate threshold value determined by the threshold value determination unit 12 on a frame-by-frame basis. Reference numeral 14 denotes a general determination unit that determines whether the sound is a voice or not, and the reference numeral 14 indicates a feature extraction unit 11 among those determined to be other than the voice by the general determination unit 13.
The third-order cepstrum coefficient output from the above and a certain suitable threshold value determined by the threshold value determination unit 12 are compared in a frame unit to determine whether the voiced sound is / i / other than that. is there. Reference numeral 15 is a certain threshold value determined by the threshold value determining unit 12 as a ratio of the number of frames determined to be voices by the rough determining unit 13 and the detailed determining unit 14 for a group of input signals having a power level higher than a certain level. It is the final determination unit that determines the block as a voice in the above case.

【００１５】以下、本発明の一実施例について（図１）
のブロック構成図と（図２）の概略判定部１３の動作を
説明するためのフローチャート、および（図３）の詳細
判定部１４の動作を説明するためのフローチャートを参
照しながら詳細に説明する。An embodiment of the present invention will be described below (FIG. 1).
This will be described in detail with reference to the block diagram of FIG. 3 and a flowchart for explaining the operation of the general determination unit 13 (FIG. 2) and a flowchart for explaining the operation of the detailed determination unit 14 (FIG. 3).

【００１６】音響信号がマイクロフォンを通して入力さ
れると、特徴抽出部１１によりまず５つの特徴量が抽出
される。パワー算出部１１ａでは、一定時間毎のパワー
値が例えば（数１）で算出される。一定の時間間隔は、
ここでは例えばサンプリング周波数を１０ＫＨｚとし
て、２００点（２０ｍｓ）とし、この時間単位をフレー
ムと呼ぶ。When the acoustic signal is input through the microphone, the feature extraction unit 11 first extracts five feature amounts. In the power calculation unit 11a, the power value for each fixed time is calculated by, for example, (Equation 1). The fixed time interval is
Here, for example, the sampling frequency is 10 KHz, 200 points (20 ms), and this time unit is called a frame.

【００１７】[0017]

【数１】 [Equation 1]

【００１８】ここで、Ｐiはフレームｉでのパワー値、
Ｓ_kはフレーム内の入力信号のサンプル値を示す。この
パワー値は発声条件の違いによるパワーの違いを統一し
て扱えるように、パワーの大きな区間内の最大値、最小
値間を例えば０から１までの値に正規化して用いる。１
次の自己相関係数算出部１１ｂではフレーム毎に１次の
自己相関係数Ａi(1)が、７次の自己相関係数算出部１１
ｃではフレーム毎に７次の自己相関係数Ａi(7)がそれぞ
れ（数２）、（数３）で算出され、さらにＡi(1)、Ａi
(7)ともに０次の自己相関係数で正規化される。Where Pi is the power value at frame i,
S _k represents a sample value of the input signal in the frame. For this power value, the maximum value and the minimum value in the high power section are normalized to, for example, a value of 0 to 1 so that the difference in power due to the difference in utterance conditions can be handled in a unified manner. 1
In the next autocorrelation coefficient calculation unit 11b, the 1st-order autocorrelation coefficient Ai (1) is calculated for each frame and the 7th-order autocorrelation coefficient calculation unit 11b
In c, the 7th-order autocorrelation coefficient Ai (7) is calculated by (Equation 2) and (Equation 3) for each frame, and Ai (1), Ai (7)
(7) Both are normalized by the zero-order autocorrelation coefficient.

【００１９】[0019]

【数２】 [Equation 2]

【００２０】[0020]

【数３】 [Equation 3]

【００２１】１次のケプストラム係数算出部１１ｄで
は、フレームｉでの１次のケプストラム係数Ｃi(1)が、
３次のケプストラム係数算出部１１ｅでは、フレームｉ
での３次のケプストラム係数Ｃi(3)が線形予測分析によ
り求められる。なお、１次の自己相関係数のかわりに１
次の偏自己相関係数を、また１次のケプストラム係数の
かわりに１次の線形予測係数を用いても、それらの値の
絶対値は等しいので全く差し支えない。また、７次の自
己相関係数のかわりに６次から１２次程度までの自己相
関係数を用いても、エネルギーが低い周波数帯域に優勢
に存在している非音声を除去するために使用するという
意味では差し支えない。また、本実施例では１次および
３次のケプストラム係数により、音声中の音韻／ｉ／に
着目して／ｉ／の特徴が特に大きく表れる特徴量を用い
ているが、さらに高性能な音声／非音声の判別を実現す
るため、他の音韻、例えば／ａ／、／ｕ／、／ｅ／、／
ｏ／などの特徴が大きく表れる１次以上のケプストラム
係数を組み合わせて用いても良い。また、ケプストラム
係数としては、ＬＰＣケプストラム係数、ＦＦＴケプス
トラム係数、メルケプストラム係数を用いても、音声中
の各音韻の特徴に基づき音韻性を詳細に判定するという
意味では差し支えない。In the primary cepstrum coefficient calculation unit 11d, the primary cepstrum coefficient Ci (1) in frame i is
In the third-order cepstrum coefficient calculation unit 11e, the frame i
The third-order cepstrum coefficient Ci (3) at is calculated by linear prediction analysis. Note that instead of the first-order autocorrelation coefficient, it is 1
Even if the next partial autocorrelation coefficient or the first-order linear prediction coefficient is used instead of the first-order cepstrum coefficient, the absolute values of these values are equal to each other. Also, instead of the 7th-order autocorrelation coefficient, even if the 6th-order to 12th-order autocorrelation coefficients are used, it is used to remove the non-speech predominantly present in the frequency band where the energy is low. It does not matter in the sense. Further, in the present embodiment, by using the first and third-order cepstrum coefficients, the feature amount in which the feature of / i / is particularly large is focused on by focusing on the phoneme / i / in the voice. Other phonemes such as / a /, / u /, / e /, /
You may combine and use the 1st or more-order cepstrum coefficient which features that o / etc. show greatly. Further, as the cepstrum coefficient, an LPC cepstrum coefficient, an FFT cepstrum coefficient, or a mel cepstrum coefficient may be used in the sense that the phonological property is determined in detail based on the characteristics of each phoneme in the voice.

【００２２】しきい値決定部１２では、あらかじめ多数
の音声データの母音部分と非音声データについて特徴抽
出部１１で得られる特徴量を抽出しておき、これらの特
徴量の分布に基づき音声かそれ以外かの適当なしきい値
をそれぞれの特徴量毎に定めておく。また、音声の学習
データを用いて特徴抽出部１１で得られる特徴量が、あ
る決められたフレーム数の中でどの程度の割合で存在す
るかにより、音声／非音声を判定するためのある適当な
しきい値を決定する。音声／非音声のしきい値決定の際
に使用する非音声データとしては、例えば本実施例の音
声／非音声判別装置が会議室等で利用される場合は、机
を叩く音、紙の刷れる音、コップの音、マイクロフォン
に物が触れる音等、予想される雑音データを用いればよ
い。The threshold value deciding unit 12 extracts feature values obtained by the feature extracting unit 11 in advance for vowel parts and non-voice data of a large number of voice data, and determines whether the voice is based on the distribution of these feature amounts. An appropriate threshold value other than the above is determined for each feature amount. Further, there is a certain appropriateness for judging voice / non-voice depending on the proportion of the feature amount obtained by the feature extraction unit 11 using the learning data of voice in a predetermined number of frames. Determine a threshold. As the non-voice data used when determining the voice / non-voice threshold value, for example, when the voice / non-voice discriminating apparatus of the present embodiment is used in a conference room or the like, a sound of tapping a desk or printing of paper can be performed. It is sufficient to use expected noise data such as sounds, cup sounds, and sounds of objects touching the microphone.

【００２３】音響信号から特徴抽出部１１で得られた特
徴量は、概略判定部１３と詳細判定部１４にそれぞれ入
力される。まず、音響信号から特徴抽出部１１で得られ
た特徴量うち１次および７次の自己相関係数と１次のケ
プストラム係数が、概略判定部１３にそれぞれ入力され
る。The feature amount obtained by the feature extraction unit 11 from the acoustic signal is input to the general determination unit 13 and the detailed determination unit 14, respectively. First, the first-order and seventh-order autocorrelation coefficients and the first-order cepstrum coefficients among the feature quantities obtained from the acoustic signal by the feature extraction unit 11 are input to the rough determination unit 13, respectively.

【００２４】（図２）に示すステップ２１において、１
次の自己相関係数の値の大きさにより、エネルギーが高
い周波数帯域に優勢に存在している無声音などランダム
性の強い雑音が除去される。しきい値決定部１２で決定
した１次の自己相関係数のしきい値をＡ１とすると、Ａ
i(1)≧Ａ１のときに音声、それ以外が非音声であると判
断する。次に、ステップ２２において、７次の自己相関
係数の値の大きさにより、エネルギーが低い周波数帯域
に優勢に存在している雑音が除去される。すなわち、し
きい値決定部１２で決定した７次の自己相関係数のしき
い値をＡ７とすると、Ａi(7)≦Ａ７のときに音声、それ
以外が非音声であると判断する。In step 21 shown in FIG. 2, 1
Depending on the value of the next autocorrelation coefficient, noise with strong randomness, such as unvoiced sound that exists predominantly in the frequency band with high energy, is removed. Let A1 be the threshold value of the first-order autocorrelation coefficient determined by the threshold value determination unit 12,
When i (1) ≧ A1, it is determined to be voice, and the others are non-voice. Next, in step 22, the noise that is predominantly present in the low energy frequency band is removed by the magnitude of the value of the seventh-order autocorrelation coefficient. That is, assuming that the threshold value of the 7th-order autocorrelation coefficient determined by the threshold value determination unit 12 is A7, it is determined that Ai (7) ≦ A7 is voice, and the others are nonvoice.

【００２５】さらに、ステップ２３において、１次のケ
プストラム係数の値の大きさにより、音韻／ｉ／を除く
有声音が検出される。しきい値決定部１２で決定した１
次のケプストラム係数のしきい値をＣ１とすると、Ｃi
(1)≧Ｃ１のとき音声であり、それ以外が非音声である
と判断する。音声であればステップ２４においてＶi＝
１の出力値を、非音声であればステップ２５においてＶ
i＝０の出力値を詳細判定部１４に送出する。Further, in step 23, a voiced sound excluding the phoneme / i / is detected based on the magnitude of the value of the first-order cepstrum coefficient. 1 determined by the threshold determination unit 12
If the threshold value of the next cepstrum coefficient is C1, then Ci
(1) When ≧ C1, it is determined to be voice, and the others are determined to be non-voice. If it is voice, in step 24, Vi =
If the output value of 1 is non-voice, V is output in step 25.
The output value of i = 0 is sent to the detail determination unit 14.

【００２６】次に、詳細判定部１４のステップ３１にお
いて、概略判定部１３において音声と判定されたものす
なわちＶi＝１の場合は、そのまま出力値を最終判定部
１５に送出し、概略判定部１３において非音声と判定さ
れたもの、すなわちＶi＝０の場合についてのみ音韻／
ｉ／の検出が行われる。ステップ３２において、しきい
値決定部１２で決定した３次のケプストラム係数のしき
い値をＣ３とすると、音響信号から特徴抽出部１１で抽
出された３次のケプストラム係数の値の大きさとの比較
により、Ｃi(3)≧Ｃ３のときのみ／ｉ／すなわち音声で
あり、それ以外は非音声であると判断する。音声であれ
ばステップ３３においてＶi＝１の出力値を、非音声で
あればステップ３４においてＶi＝０の出力値を最終判
定部１５に送出する。Next, in step 31 of the detailed determination section 14, if the rough determination section 13 determines that the voice is present, that is, if Vi = 1, the output value is sent to the final determination section 15 as it is, and the rough determination section 13 is executed. In the case of non-speech, that is, only when Vi = 0
i / is detected. In step 32, assuming that the threshold value of the third-order cepstrum coefficient determined by the threshold value determination unit 12 is C3, comparison with the magnitude of the value of the third-order cepstrum coefficient extracted by the feature extraction unit 11 from the acoustic signal. Thus, it is determined that / i /, that is, voice only when Ci (3) ≧ C3, and non-voice other than that. If it is a voice, the output value of Vi = 1 is sent to the final determination unit 15 at step 33, and if it is a non-voice, the output value of Vi = 0 at step 34.

【００２７】最終判定部１５では、まずパワー計算部１
１ａで得られたパワー値系列から、しきい値決定部１２
であらかじめ定めたパワーしきい値を決められた長さ以
上越える区間を音声候補区間として検出する。このとき
の音声候補区間のフレーム長をＣとする。この音声候補
区間に対して、概略判定部１３と詳細判定部１４とによ
り音声と判定されたフレームの個数Ｃ１を計数し、音声
候補区間に占める音声と判定された区間のフレーム数の
割合がしきい値決定部１２であらかじめ定めたしきい値
Ｍを越えるとき、すなわち（数４）の条件を満足すると
きにこの音声候補区間は音声であると判定する。In the final determination section 15, first, the power calculation section 1
1a, the threshold value determining unit 12
The section which exceeds the predetermined power threshold value by a predetermined length or more is detected as a voice candidate section. Let C be the frame length of the voice candidate section at this time. For this voice candidate section, the number C1 of frames determined to be voice by the rough determination unit 13 and the detail determination unit 14 is counted, and the ratio of the number of frames of the section determined to be voice in the voice candidate section is calculated. When the threshold value determining unit 12 exceeds the predetermined threshold value M, that is, when the condition of (Equation 4) is satisfied, this voice candidate section is determined to be voice.

【００２８】[0028]

【数４】 [Equation 4]

【００２９】以上のように本実施例の音声／非音声判別
装置によれば、入力信号から一定時間毎の音声の複数の
特徴量を抽出する特徴量抽出部１１と、あらかじめ多数
の音声と非音声の学習データについてフレーム単位で抽
出した前記特徴量を用いて、音声かそれ以外かを判定す
るためのしきい値を決定するしきい値決定部１２と、複
数の音声の特徴量としきい値決定部で決定したしきい値
とを比較することで音声かそれ以外かを判定し、音声が
周波数領域においてどの程度母音性を持つかを概略的に
判定する概略判定部１３と、音声中の各音韻の特徴に基
づき有声音の検出に適した複数の特徴量としきい値決定
部で決定したしきい値とを比較することで音声かそれ以
外かを判定し、どの程度母音性を持つかを詳細に判定す
る詳細判定部１４と、概略判定部と詳細判定部とにより
パワーの一定レベル以上の区間について、音声と判定さ
れたフレームの存在比率によりその区間が音声か否かを
判定する最終判定部とを具備して構成することにより、
簡単な構成で様々な音響信号を正確に判定することがで
きる音声／非音声判別装置を提供することができる。As described above, according to the voice / non-voice discriminating apparatus of the present embodiment, the feature amount extracting section 11 for extracting a plurality of feature amounts of the voice at constant time intervals from the input signal, and a large number of voices in advance A threshold value determination unit 12 that determines a threshold value for determining whether the sound is voice or not using the feature quantity extracted for each frame of the learning data of the voice, and the feature quantity and the threshold value of a plurality of voices. By comparing with the threshold value determined by the determination unit, it is determined whether it is voice or not, and a rough determination unit 13 that roughly determines how vowel the voice has in the frequency domain; Based on the characteristics of each phoneme, a plurality of features suitable for voiced sound detection are compared with the threshold value determined by the threshold value determination unit to determine whether it is a voice or not, and how much vowelness it has. Detailed determination unit 14 for determining in detail A final determination unit that determines whether or not a section having a power level higher than a certain level by the rough determination unit and the detailed determination unit is a voice based on the existence ratio of frames determined to be voice. Due to
It is possible to provide a speech / non-speech discrimination device capable of accurately discriminating various acoustic signals with a simple configuration.

【００３０】[0030]

【発明の効果】以上の説明から明らかなように本発明に
よれば、音声を特徴付ける複数の特徴量を抽出し、多数
の母音と非音声データにおける特徴量からあらかじめ適
当なしきい値を設定しておき、フレーム毎に音声か非音
声かの判定を行い、パワーの大きな部分を音声区間候補
として判定されたフレームの存在比率により音声か非音
声か判別するように構成してあるので、非常に簡単な構
成で入力信号が音声かそれ以外かを正確に判定すること
ができる音声／非音声判別装置を提供することができ
る。As is apparent from the above description, according to the present invention, a plurality of feature quantities that characterize a voice are extracted, and an appropriate threshold value is set in advance from the feature quantities of many vowels and non-voice data. Every time, it is configured to judge whether it is voice or non-voice for each frame, and to judge whether it is voice or non-voice based on the existence ratio of the frames determined as the voice section candidates for the part with a large power. It is possible to provide a speech / non-speech discriminating apparatus capable of accurately discriminating whether an input signal is voice or not with such a configuration.

[Brief description of drawings]

【図１】本発明の一実施例の音声／非音声判別装置の全
体構成を示すブロック図FIG. 1 is a block diagram showing the overall configuration of a voice / non-voice discrimination device according to an embodiment of the present invention.

【図２】本発明の概略判定部の一実施例の動作を示すフ
ローチャートFIG. 2 is a flowchart showing the operation of an embodiment of a general determination section of the present invention.

【図３】本発明の詳細判定部の一実施例の動作を示すフ
ローチャートFIG. 3 is a flowchart showing the operation of an embodiment of a detail determination unit of the present invention.

【図４】１６音韻における１次の自己相関係数の頻度分
布図FIG. 4 is a frequency distribution diagram of first-order autocorrelation coefficients in 16 phonemes.

【図５】２２雑音における１次の自己相関係数の頻度分
布図FIG. 5 is a frequency distribution diagram of the first-order autocorrelation coefficient in 22 noise.

【図６】１６音韻における７次の自己相関係数の頻度分
布図FIG. 6 is a frequency distribution diagram of the 7th-order autocorrelation coefficient in 16 phonemes.

【図７】２２雑音における７次の自己相関係数の頻度分
布図FIG. 7 is a frequency distribution diagram of the 7th-order autocorrelation coefficient in 22 noise.

【図８】１６音韻における１次のケプストラム係数の頻
度分布図FIG. 8 is a frequency distribution diagram of first-order cepstrum coefficients in 16 phonemes.

【図９】２２雑音における１次のケプストラム係数の頻
度分布図FIG. 9 is a frequency distribution diagram of the first-order cepstrum coefficient in 22 noise.

【図１０】１６音韻における３次のケプストラム係数の
頻度分布図FIG. 10 is a frequency distribution chart of the third-order cepstrum coefficient in 16 phonemes.

【図１１】２２雑音における３次のケプストラム係数の
頻度分布図FIG. 11 is a frequency distribution diagram of the third-order cepstrum coefficient in 22 noise.

【符号の説明】１１特徴抽出部１１ａパワー算出部１１ｂ１次の自己相関係数算出部１１ｃ７次の自己相関係数算出部１１ｄ１次のケプストラム係数算出部１１ｅ３次のケプストラム係数算出部１２しきい値決定部１３概略判定部１４詳細判定部１５最終判定部[Description of Reference Signs] 11 feature extraction unit 11a power calculation unit 11b first-order autocorrelation coefficient calculation unit 11c seventh-order autocorrelation coefficient calculation unit 11d first-order cepstrum coefficient calculation unit 11e third-order cepstrum coefficient calculation unit 12 Threshold value determination unit 13 General determination unit 14 Detailed determination unit 15 Final determination unit

Claims

[Claims]

1. A feature amount of a plurality of voices is extracted from an input signal at regular intervals using at least one of a first-order autocorrelation coefficient and a second-order or higher autocorrelation coefficient that characterize the voice, A voice / non-voice discrimination method for discriminating between voice and non-voice.

2. A feature quantity of a plurality of voices is extracted from an input signal at regular time intervals by using at least one of a first-order cepstrum coefficient and a second-order or higher-order cepstrum coefficient that characterize the voice, and a voice or non-voice is extracted. A voice / non-voice discrimination method for determining whether or not

3. A feature extraction unit for extracting a first-order autocorrelation coefficient, a second-order or higher autocorrelation coefficient, a first-order cepstrum coefficient, and a second-order or higher-order cepstrum coefficient, which characterize speech at regular intervals, from an input signal. And threshold value determination for determining a threshold value for determining whether it is voice or not using the feature amount extracted by the feature extraction unit within a predetermined time with respect to a large number of voice and non-voice learning data in advance. And at least one of the first-order autocorrelation coefficient and the first-order cepstrum coefficient extracted by the feature extraction unit within a predetermined time from the input signal, the threshold determination unit determines. A rough judgment unit that judges whether it is voice or not by comparing it with a threshold value and a rough judgment unit that is judged to be other than voice are extracted by the feature extraction unit within a predetermined time from the input signal. At least one feature amount of the second or higher-order cepstrum coefficients that has been output is compared with the threshold value determined by the threshold value determination unit to determine whether it is voice or not, and a constant power. When the ratio of the number of frames determined to be speech by the rough determination unit and the detailed determination unit with respect to a block of the input signal having a level or higher is equal to or higher than the threshold value determined by the threshold value determination unit, the block is voiced. A voice / non-voice discriminating apparatus comprising: