JPH05249987A

JPH05249987A - Voice detecting method and device

Info

Publication number: JPH05249987A
Application number: JP4050327A
Authority: JP
Inventors: Yoshihisa Nakato; 良久中藤; Takeshi Norimatsu; 武志則松
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1992-03-09
Filing date: 1992-03-09
Publication date: 1993-09-28

Abstract

PURPOSE:To automatically detect voice with high precision with a comparatively simple structure in a voice detecting device for detecting only the voice, which is used as the pretreatment of a voice recognizing device. CONSTITUTION:A voice detecting device has a tolerance calculating part 13 for extracting a plurality of characteristic quantities from input signal every fixed time by a characteristic extracting part 11 and calculating the logarithmic tolerance with a vowel standard model formed by use of a number of learning data of vowel by a vowel standard model forming part 12: and a vowel judging part 14 for calculating a frame average logarithmic tolerance by collectively using the logarithmic tolerances for several frames and detecting vowels by comparison with a proper threshold. According to the number judged as vowels by the vowel judging part, it is judged by a final judging part 15 whether this section is a voice or not.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、非定常雑音の存在する
実環境下において、音声認識装置の前処理等で使われ
る、音声のみを検出する音声検出方法および音声検出装
置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice detection method and a voice detection device for detecting only voice used in preprocessing of a voice recognition device in an actual environment where non-stationary noise exists.

【０００２】[0002]

【従来の技術】音声認識等の音声処理を行う装置では、
音声以外の非定常雑音が入力され誤って音声と判断され
ると誤認識を生じる。そこで、入力された信号が正確に
音声であるかどうかを判定できる音声検出装置が必要と
される。2. Description of the Related Art In a device for performing voice processing such as voice recognition,
If non-stationary noise other than voice is input and is mistakenly determined as voice, misrecognition occurs. Therefore, there is a need for a voice detection device that can accurately determine whether the input signal is voice.

【０００３】従来の音声検出装置では、処理の簡素化の
ための入力信号のパワー値が閾値よりも大きい部分を音
声と判断する方法が一般的に行われる。しかし音声認識
の行われる実環境で使用することを考えると、紙などの
資料をめくる音や、息吹きなどのマイクロフォンの振動
によって起こるノイズ、あるいは動物の鳴き声等の音声
以外のパワーの大きな様々な音が入力される可能性があ
り、パワーだけでは音声の検出はできない。In a conventional voice detection apparatus, a method is generally used in which a portion where the power value of an input signal is larger than a threshold value is determined to be voice for simplification of processing. However, considering that it is used in a real environment where voice recognition is performed, various sounds with large power other than voice, such as the sound of flipping over materials such as paper, noise caused by the vibration of a microphone such as a breath, or the sound of animals May be input, and speech cannot be detected only by power.

【０００４】そこで、パワー以外の複数の音声の特徴量
を用いて入力信号が音声であるか非音声であるかの判定
をする方法が幾つか提案されている。例えば、「実環境
下での音声／非音声の判別」（石田明・小畑秀文、日本
音響学会誌７巻１２号（１９９１））による方法があ
る。これは、日常の実験室やオフィスなどで発生する種
々の非定常雑音と音声とを区別するのに有効な音響的特
徴量を用いて、実環境下での音声／非音声の判別を行っ
ている。具体的には、音声中のパワーの大きい部分にお
いて、母音と見なせる部分がどの程度存在するかによっ
て、音声／非音声の判別を行っており、用いる音響的特
徴量としては、（ａ）周期性（ｂ）ピッチ周波数（ｃ）最適線形予測次数（ｄ）５母音との距離（ｅ）ホルマントの鋭さの５種類の特徴量を求め、各特徴量毎に上限値あるいは
下限値を決定し、その大小関係により音声と非音声を判
別する。Therefore, some methods have been proposed for determining whether the input signal is voice or non-voice by using a plurality of voice feature amounts other than power. For example, there is a method according to "discrimination between voice / non-voice in an actual environment" (Akira Ishida and Hidefumi Obata, Journal of Acoustical Society of Japan, Volume 7, No. 12 (1991)). This is to distinguish voice / non-voice in a real environment by using acoustic features that are effective for distinguishing various non-stationary noises generated in everyday laboratories and offices from voice. There is. Specifically, in the high power portion of the voice, the voice / non-voice is discriminated according to how many parts can be regarded as vowels, and the acoustic feature quantity to be used is (a) periodicity (B) Pitch frequency (c) Optimal linear prediction order (d) Distance to five vowels (e) Formant sharpness Five types of feature quantities are obtained, and the upper limit value or the lower limit value is determined for each feature quantity, and the Voice and non-voice are discriminated according to the magnitude relation.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら上記の音
声／非音声判別装置では、音声中の各母音の特徴に基づ
いた特徴量は使用されておらず、音声の各母音の検出に
適した母音毎の標準モデルを用いる方法による高精度な
音声検出方式が必要とされる。また、各特徴量毎に上限
値あるいは下限値を決定し、その大小関係により音声／
非音声の判別を行う方法では、特徴量の数が増えた場合
特徴量毎に閾値を設定することが困難であると同時に、
定常雑音が付加された場合などのような特徴量の値の変
動に対して頑健であるとは言えない。さらに、音声、特
に母音は１分析フレーム毎に母音性を判定するより、数
フレーム分を１塊に考えて判定する方がより信頼性があ
る判定法であるといえる。However, in the above speech / non-speech discriminating apparatus, the feature amount based on the feature of each vowel in the voice is not used, and each vowel suitable for detection of each vowel in the voice is not used. There is a need for a highly accurate voice detection method that uses the standard model of. In addition, the upper limit value or the lower limit value is determined for each feature amount, and the voice /
In the method of discriminating non-speech, it is difficult to set a threshold value for each feature amount when the number of feature amounts increases, and at the same time,
It cannot be said to be robust against fluctuations in the value of the feature amount such as when stationary noise is added. Furthermore, it can be said that a more reliable determination method is to consider voices, especially vowels, by considering several frames as one block rather than determining vowelness for each analysis frame.

【０００６】本発明は、上記の課題を解決するもので、
音声認識等の音声信号処理に適した高性能な音声検出装
置を提供することを目的とする。本発明は、音声の各母
音の検出に適した母音毎の標準モデルを用いることで、
音声の各母音の検出に基づいた音声検出装置を提供す
る。さらに、音声であるかそれ以外であるかを表した特
徴量を総合的に判定するための評価値として、母音性を
判定するのに有用と考えられる数フレーム分を１塊に考
えて算出されるフレーム平均対数尤度を用いることで、
閾値の設定が比較的容易であると同時に、定常雑音が付
加された場合などのような特徴量の値の変動に対してあ
る程度頑健性を持った、比較的簡単で高性能な音声検出
装置を提供することを目的とする。The present invention solves the above-mentioned problems.
An object of the present invention is to provide a high-performance voice detection device suitable for voice signal processing such as voice recognition. The present invention uses a standard model for each vowel suitable for detecting each vowel of the voice,
Provided is a voice detection device based on the detection of each vowel of a voice. Furthermore, as an evaluation value for comprehensively determining the feature amount indicating whether it is voice or not, it is calculated by considering several frames for one frame considered to be useful for determining vowelness as one block. By using the frame average log likelihood,
A relatively simple and high-performance speech detection device that is relatively easy to set the threshold and has a certain degree of robustness to fluctuations in the value of the feature amount such as when stationary noise is added. The purpose is to provide.

【０００７】[0007]

【課題を解決するための手段】本発明は上記課題を解決
するために、入力信号からフレーム単位（一定時間毎）
に抽出した音声を特徴付ける１次以上の自己相関係数と
１次以上のケプストラム係数のうち少なくとも１つの特
徴量を用いて、その数フレーム分を一括して用いること
により母音の存在確率を求めて母音検出を行い、これに
より音声のみを検出することを特徴とするものである。In order to solve the above-mentioned problems, the present invention solves the above-mentioned problems by inputting a frame unit (every fixed time).
The presence probability of vowels is obtained by using at least one feature quantity of the autocorrelation coefficient of 1st order or more and the cepstrum coefficient of 1st order or more that characterize the extracted speech by collectively using several frames. The feature is that vowel detection is performed and only voice is detected by this.

【０００８】また、本発明は、あらかじめ多数の母音の
学習データについてフレーム単位に抽出した音声を特徴
付ける１次以上の自己相関係数と１次以上のケプストラ
ム係数のうち少なくとも１つの特徴量を用いて、数フレ
ーム分を一括して用いて母音毎に標準モデルの作成を行
い、前記母音標準モデルを用いて音声を検出することを
特徴とするものである。Further, according to the present invention, at least one of the feature quantities of the autocorrelation coefficient of the first order or more and the cepstrum coefficient of the first order or more, which characterizes the voice extracted in advance on the basis of a large number of vowel learning data, is used. , A standard model is created for each vowel by collectively using several frames, and a voice is detected using the vowel standard model.

【０００９】さらに、本発明の音声検出装置は、母音を
検出することを主眼として、入力信号の一定時間毎の１
次以上の自己相関係数、１次以上のケプストラム係数等
の複数の音声の特徴量を抽出する特徴量抽出部と、あら
かじめ多数の母音の学習データについて前記特徴抽出部
で抽出した特徴量を用いて母音毎の平均値と共分散行列
を算出し、母音毎の標準モデルを作成する母音標準モデ
ル作成部と、入力信号からフレーム単位に前記特徴抽出
部で抽出した１次以上の自己相関係数と１次のケプスト
ラム係数のうち少なくとも１つの特徴量について、前記
母音標準モデル作成部にて作成した各母音標準モデルと
の対数尤度を計算する尤度計算部と、前記尤度計算部に
て計算された前後数フレームについて各母音毎にフレー
ム平均対数尤度を計算し、ある適当な閾値とを比較する
ことで母音かそれ以外かを判定する母音判定部と、パワ
ーの一定レベル以上の入力信号の塊について前記母音判
定部によりいずれかの母音と判定された母音サンプルの
個数の割合がある適当なしきい値以上のときにその塊を
音声と判定する最終判定部とを備えたものである。Further, the voice detecting apparatus of the present invention is designed to detect vowels, and to detect the vowels at intervals of a fixed time.
A feature amount extraction unit that extracts a plurality of voice feature amounts such as an autocorrelation coefficient of the second order or higher and a cepstrum coefficient of the first order or higher, and the feature values extracted by the feature extraction unit in advance for learning data of many vowels are used. A vowel standard model creation unit that creates an average value and a covariance matrix for each vowel, and creates a standard model for each vowel, and an autocorrelation coefficient of the first or higher order extracted from the input signal in frame units by the feature extraction unit. And at least one of the first-order cepstrum coefficients, a likelihood calculation unit that calculates a log likelihood with each vowel standard model created by the vowel standard model creation unit, and a likelihood calculation unit by the likelihood calculation unit. A frame average log likelihood is calculated for each vowel for the calculated several frames before and after, and a vowel judgment unit for judging whether or not the vowel is other than that by comparing with a certain appropriate threshold, and a certain level of power or more. Of the input signal, the final deciding section for deciding the lump as a voice when the ratio of the number of vowel samples judged to be any vowel by the vowel judging section is above a certain threshold value. Is.

【００１０】[0010]

【作用】本発明は、上記した構成により、音声中の各音
韻の特徴に基づく母音検出に適した特徴量を用い、あら
かじめ信頼性の高い多数の母音データを用いて母音毎に
母音標準モデルを作成し、数フレーム分を一括して統計
的手法により母音の検出を行い、音声のみを検出するこ
とで、高性能な音声検出が可能となる。With the above-described structure, the present invention uses a feature quantity suitable for vowel detection based on the characteristics of each phoneme in a voice, and uses a large number of highly reliable vowel data to generate a vowel standard model for each vowel. It is possible to perform high-performance voice detection by creating the vowels for a few frames and detecting the vowels by a statistical method and detecting only the voice.

【００１１】[0011]

【実施例】以下本発明の一実施例について説明する。
（図１）は本発明の一実施例の全体構成を示すブロック
構成図である。（図１）において、１１は音声検出のた
めの複数の特徴量を抽出する特徴抽出部で、１フレーム
（一定時間）毎のパワーを計算するパワー計算部１１ａ
と、１フレーム毎の１次および７次の自己相関係数を算
出する自己相関係数算出部１１ｂと、１フレーム毎の１
次および３次のケプストラム係数を算出するケプストラ
ム係数算出部１１ｃとから構成される。これらの特徴量
は入力信号の母音性を検出するために用いられる。EXAMPLE An example of the present invention will be described below.
FIG. 1 is a block diagram showing the overall configuration of an embodiment of the present invention. In FIG. 1, 11 is a feature extraction unit that extracts a plurality of feature amounts for voice detection, and a power calculation unit 11a that calculates the power for each frame (constant time).
And an autocorrelation coefficient calculation unit 11b that calculates the 1st and 7th order autocorrelation coefficients for each frame, and 1 for each frame.
It is composed of a cepstrum coefficient calculation unit 11c for calculating the second and third-order cepstrum coefficients. These feature quantities are used to detect the vowel characteristics of the input signal.

【００１２】次に、１２はあらかじめ多数の母音の学習
データについて特徴抽出部１１で抽出した特徴量を用い
て母音毎の平均値と共分散行列を算出し、母音毎の標準
モデルを作成する母音標準モデル作成部である。１３は
特徴抽出部１１から出力されるフレーム毎の入力信号の
１次および７次の自己相関係数と１次および３次のケプ
ストラム係数について、母音標準モデル作成部１２にて
作成した各母音標準モデルとの対数尤度を計算する尤度
計算部であり、１４は尤度計算部１３にて計算に用いた
フレームの前後数フレームにおいて、尤度計算部１３で
同様に計算された対数尤度を用いて、各母音毎にフレー
ム平均対数尤度を計算し、ある適当な閾値と比較するこ
とでその入力信号数フレームが母音であるかどうかを判
定する母音判定部である。１５はパワーの一定レベル以
上の入力信号の塊について母音判定部１４によりいずれ
かの母音と判定されたフレームの個数がある適当な閾値
以上のときにその塊を音声と判定する最終判定部であ
る。Next, 12 is a vowel that creates a standard model for each vowel by calculating the average value and covariance matrix for each vowel using the feature quantities extracted in advance by the feature extraction unit 11 with respect to a large number of vowel learning data. It is a standard model creation unit. Reference numeral 13 is a vowel standard model created by the vowel standard model creation unit 12 for the first-order and seventh-order autocorrelation coefficients and the first-order and third-order cepstral coefficients of the input signal for each frame output from the feature extraction unit 11. A likelihood calculator that calculates a log-likelihood with a model, and 14 is a logarithmic likelihood calculated similarly by the likelihood calculator 13 in several frames before and after the frame used in the calculation by the likelihood calculator 13. Is a vowel determination unit that determines whether or not the input signal number frame is a vowel by calculating the frame average log likelihood for each vowel and comparing it with a certain appropriate threshold. Reference numeral 15 denotes a final determination unit that determines a vowel as a voice when the number of frames determined by the vowel determination unit 14 as a vowel for a lump of an input signal having a power level equal to or higher than a certain threshold is equal to or more than a certain threshold. ..

【００１３】以下、本発明の一実施例について（図１）
のブロック構成図を参照しながら詳細に説明する。音響
信号がマイクロホンを通して入力されると、特徴抽出部
１１でまず複数の特徴量が抽出される。パワー計算部１
１ａでは、一定時間毎のパワー値が例えば（数１）で算
出される。一定の時間間隔は、ここでは例えばサンプリ
ング周波数を１０ＫＨｚとして、２００点（２０ｍｓ）
とし、この時間単位をフレームと呼ぶ。An embodiment of the present invention will be described below (FIG. 1).
This will be described in detail with reference to the block diagram of FIG. When the acoustic signal is input through the microphone, the feature extraction unit 11 first extracts a plurality of feature amounts. Power calculator 1
In 1a, the power value for every fixed time is calculated by (Equation 1), for example. The fixed time interval is 200 points (20 ms) here, for example, when the sampling frequency is 10 KHz.
This time unit is called a frame.

【００１４】[0014]

【数１】 [Equation 1]

【００１５】ここで、Ｐ_iはフレームｉでのパワー値、
Ｓ_kはフレーム内の入力信号のサンプル値を示す。この
パワー値は発声条件の違いによるパワーの違いを統一し
て扱えるように、パワーの大きな区間内の最大値、最小
値間を例えば０から１までの値に正規化して用いる。自
己相関係数算出部１１ｂではフレーム毎に１次の自己相
関係数Ａ_i(1)が（数２）、７次の自己相関係数Ａ_i(7)
が、（数３）で算出される。Where P _i is the power value at frame i,
S _k represents a sample value of the input signal in the frame. For this power value, the maximum value and the minimum value in the high power section are normalized to, for example, a value of 0 to 1 so that the difference in power due to the difference in utterance conditions can be handled in a unified manner. In the autocorrelation coefficient calculation unit 11b, the first-order autocorrelation coefficient A _i (1) is calculated for each frame (Equation 2), and the 7th-order autocorrelation coefficient A _i (7)
Is calculated by (Equation 3).

【００１６】[0016]

【数２】 [Equation 2]

【００１７】[0017]

【数３】 [Equation 3]

【００１８】さらにＡ_i(1)、Ａ_i(7)は０次の自己相関係
数で正規化される。ケプストラム係数算出部１１ｃで
は、フレームｉでの１次および３次のケプストラム係数
Ｃ_i(1)、Ｃ_i(3)が線形予測分析により求められる。Further, A _i (1) and A _i (7) are normalized by the zero-order autocorrelation coefficient. The cepstrum coefficient calculation unit 11c obtains the primary and tertiary cepstrum coefficients C _i (1) and C _i (3) in the frame i by the linear prediction analysis.

【００１９】母音標準モデル作成部１２では、あらかじ
め多数の音声データの母音部分について特徴抽出部１１
で得られる特徴量を抽出しておき、これらの特徴量を用
いて次の方法により母音毎の平均値と共分散行列を算出
し、母音毎の標準モデルを作成する。すなわち、母音デ
ータとしては母音ｋの学習用データｙ_N（データ数Ｎ）
を用い、ｙ_N がｍ次元の多次元正規分布に従うと仮定で
きる場合、その平均値μ_k と共分散行列Σ_kを（数４）,
（数５）のように計算にて求めることができる。In the vowel standard model creating unit 12, the feature extracting unit 11 has previously created vowel parts of a large number of voice data.
The feature values obtained in step 1 are extracted, and the average value and covariance matrix for each vowel are calculated by the following method using these feature values to create a standard model for each vowel. That is, as the vowel data, the learning data y _N of the vowel k (the number of data N)
When y _N can be assumed to follow an m-dimensional multidimensional normal distribution, its mean value μ _k and covariance matrix Σ _k are given by (Equation 4),
It can be calculated as in (Equation 5).

【００２０】[0020]

【数４】 [Equation 4]

【００２１】[0021]

【数５】 [Equation 5]

【００２２】ただし、ｔは転値を示す。これにより、母
音毎の標準モデルのモデル形状（平均値μ_k、及び分散
Σ_k）が求められる。ただし、ｙ_N、μ_kはｍ次元のベク
トル（ｍ次元の特徴パラメータ）であり、Σ_kはｍ×ｍ
次元のマトリックスである。母音データとしては例え
ば、ある標準話者の母音ｋの学習用データとして母音部
分を切り出し、母音中心フレーム±２フレームのデータ
を用いればよい。また、複数の話者のデータを用いるこ
とで、話者の発声の変動に強い標準パターンを作成する
ことができる。However, t indicates a transposed value. As a result, the model shape (average value μ _k and variance Σ _k ) of the standard model for each vowel is obtained. However, y _N and μ _k are m-dimensional vectors (m-dimensional feature parameters), and Σ _k is m × m
It is a matrix of dimensions. As the vowel data, for example, data of a vowel center frame ± 2 frames may be used by cutting out a vowel portion as data for learning a vowel k of a certain standard speaker. Further, by using the data of a plurality of speakers, it is possible to create a standard pattern that is strong against variations in the utterances of the speakers.

【００２３】尤度計算部１３は、特徴抽出部１１から出
力されるフレーム毎の入力信号のいくつかの特徴パラメ
ータについて、母音標準モデル作成部１２にて作成した
各母音標準モデルとの対数尤度を計算する部分である。
母音検出に用いる距離尺度は、使用する各特徴パラメー
タの分布を多次元正規分布と仮定した場合の統計的距離
尺度である。母音毎の標準モデルｋに対する、ｉフレー
ム目の入力ベクトル（スペクトル）ｘ_i の対数尤度Ｌ_ik
は、（数６）で計算される。The likelihood calculation unit 13 has a logarithmic likelihood with respect to each vowel standard model created by the vowel standard model creation unit 12 for some feature parameters of the input signal for each frame output from the feature extraction unit 11. Is the part to calculate.
The distance measure used for vowel detection is a statistical distance measure assuming that the distribution of each feature parameter used is a multidimensional normal distribution. Logarithmic likelihood L _ik of the input vector (spectrum) x _i of the _i- th frame with respect to the standard model k for each vowel
Is calculated by (Equation 6).

【００２４】[0024]

【数６】 [Equation 6]

【００２５】ただし、ｘ_iはｍ次元のベクトル（ｍ次元
の特徴パラメータ）であり、ｔは転値、−１は逆行列、
Ｃは定数を示す。Where x _i is an m-dimensional vector (m-dimensional feature parameter), t is a transposed value, −1 is an inverse matrix,
C indicates a constant.

【００２６】母音判定部１４は、母音の時間的な継続性
を表現するため、検出しようとする目的フレームの前後
Ｎフレーム（これをセグメントという）を用いて、母音
判定を行う。各母音毎に尤度計算部１３にて計算された
対数尤度Ｌ_ikを用いて、次の条件式（数７）を満たせば
そのセグメントは母音であるとみなす。The vowel determination unit 14 performs vowel determination using N frames (which are referred to as segments) before and after the target frame to be detected in order to express the temporal continuity of the vowel. Using the log-likelihood _Lik calculated by the likelihood calculator 13 for each vowel, if the following conditional expression (Equation 7) is satisfied, the segment is regarded as a vowel.

【００２７】[0027]

【数７】 [Equation 7]

【００２８】ただし、Ｌ_kTH は母音標準モデルｋに関す
る判別閾値（フレーム平均対数尤度の閾値）である。However, L _kTH is a discrimination threshold (a threshold of frame average log likelihood) for the vowel standard model k.

【００２９】このように、各特徴パラメータの影響を効
果的に、しかも総合的に判定できる評価値（対数尤度）
を用いることで、各特徴パラメータ毎に閾値を設定する
方法よりも、定常雑音が付加された場合などのような特
徴量の値の変動に対して頑健なシステムが構築できる。
また、多くの閾値をヒューリスティックな方法により決
定する必要がない利点がある。さらに、音響信号数フレ
ーム分を１塊に考えて判定することで、母音などのよう
な継続的な音声に対してより有効な判定法となってい
る。As described above, the evaluation value (logarithmic likelihood) with which the influence of each characteristic parameter can be effectively and comprehensively determined.
By using, it is possible to construct a system that is more robust against variations in the value of the feature amount such as when stationary noise is added, compared to the method of setting the threshold value for each feature parameter.
Further, there is an advantage that many thresholds do not need to be determined by a heuristic method. Further, by considering the number of frames of the acoustic signal as one block, the determination is more effective for continuous speech such as vowels.

【００３０】最終判定部１５は、パワーの一定レベル以
上の入力信号の塊についての母音サンプルの数がある適
当なしきい値以上のときにその塊を音声と判定する最終
判定部である。最終判定部１５では、まずパワー計算部
１１ａで得られたパワー値系列からあらかじめ定めたパ
ワーしきい値を決められた長さ以上越える区間を音声候
補区間として検出する。この音声候補区間内において、
母音判定部１４により母音ｋと判定されたセグメントの
個数を計数し、母音ｋと判定された母音セグメントの数
をＣ_k、あらかじめ定めた区間内の母音セグメント数の
しきい値Ｍ_kとするとき、（数８）の条件を満たすなら
ば、この音声候補区間は音声であると判定する。これを
全ての母音について行う。The final decision unit 15 is a final decision unit for deciding a vowel sound as a voice when the number of vowel samples for the lump of the input signal having a power level above a certain level is above a certain threshold value. The final determination section 15 first detects, as a voice candidate section, a section that exceeds a predetermined power threshold value by a predetermined length or more from the power value series obtained by the power calculation section 11a. Within this voice candidate section,
When the number of segments determined to be vowels k by the vowel determination unit 14 is counted, and the number of vowel segments determined to be vowel k is C _k , and a threshold value M _{k of the} number of vowel segments in a predetermined section is set. If the conditions of (Equation 8) are satisfied, it is determined that the voice candidate section is voice. Do this for all vowels.

【００３１】[0031]

【数８】 [Equation 8]

【００３２】以下に、実際に本方法により実験した結果
を示す。（表１）に、本手法で用いた４つの特徴パラメ
ータを示す。これらの特徴パラメータは予備実験の結
果、音声と他の非定常雑音との分離が比較的良く、また
ＬＰＣケプストラム係数の算出過程において容易に得ら
れるパラメータである。まず、１次の正規化自己相関係
数及び１次の線形予測係数は有声／無声の判別に適した
パラメータであり、７次の正規化自己相関係数は低周波
性の雑音を区別するのに適したパラメータである。ま
た、３次のＬＰＣケプストラム係数は、５母音の中でも
／ｉ／に特徴的な性質を示すパラメータである。The results of experiments actually carried out by this method are shown below. (Table 1) shows four feature parameters used in this method. As a result of preliminary experiments, these characteristic parameters are parameters that are relatively good in separating speech from other non-stationary noises, and are easily obtained in the process of calculating the LPC cepstrum coefficient. First, the first-order normalized autocorrelation coefficient and the first-order linear prediction coefficient are parameters suitable for voiced / unvoiced discrimination, and the seventh-order normalized autocorrelation coefficient distinguishes low-frequency noise. Is a parameter suitable for. The third-order LPC cepstrum coefficient is a parameter showing a characteristic characteristic of / i / among the five vowels.

【００３３】[0033]

【表１】 [Table 1]

【００３４】音声データは、男性１０名の発声した日本
語２００単語である。標準モデルの作成には、標準話者
の発声した各母音の音韻中心±２フレームを使用した。
但し、計算効率を考えて、各パラメータ間の相関はない
とし、共分散行列の対角成分のみを計算に用いた。雑音
データとしては、（表２）に示す５雑音グループ（約９
００サンプル）の非定常雑音を用いた。また、分析条件
を（表３）に示す。The voice data is 200 Japanese words uttered by 10 men. The standard model was created by using the phonological center ± 2 frames of each vowel uttered by a standard speaker.
However, considering the calculation efficiency, it is assumed that there is no correlation between each parameter, and only the diagonal component of the covariance matrix is used for the calculation. As noise data, 5 noise groups (about 9
00 samples) non-stationary noise was used. The analysis conditions are shown in (Table 3).

【００３５】[0035]

【表２】 [Table 2]

【００３６】[0036]

【表３】 [Table 3]

【００３７】男性話者５名分の母音データから標準モデ
ルの作成を行い、標準話者を含む１０名の話者に関して
の音声検出実験及び（表２）の非定常雑音の除去実験を
行った。（図２）は、母音セグメント長を１フレームか
ら１１フレームまで変化させたときの、音声検出率と雑
音誤検出率の関係を示したものである。判別閾値を適当
に変化させることで、検出性能の最適値を求めることが
できるが、５フレーム以上ではほとんど判別性能に差は
ない。結局、母音セグメント長７フレームで判別閾値＝
-1.2のとき、音声検出率９９.３％（雑音誤検出率９.０
％）が得られた。A standard model was created from vowel data of 5 male speakers, and a voice detection experiment and a non-stationary noise removal experiment of (Table 2) were conducted for 10 speakers including the standard speaker. .. FIG. 2 shows the relationship between the voice detection rate and the noise erroneous detection rate when the vowel segment length is changed from 1 frame to 11 frames. The optimum value of the detection performance can be obtained by appropriately changing the discrimination threshold, but there is almost no difference in the discrimination performance in 5 frames or more. After all, the discrimination threshold = 7 frames with vowel segment length =
-When 1.2, voice detection rate 99.3% (noise false detection rate 9.0
%)was gotten.

【００３８】次に、本手法の定常騒音下での検出性能を
評価するために、白色雑音を付加したときのＳ／Ｎ比と
検出率との関係を調べた。（図３）は、母音セグメント
長を７フレームに固定したときの、各Ｓ／Ｎ比に対する
音声検出率と雑音誤検出率との関係を示したものであ
る。その結果、検出性能はＳ／Ｎ比が１２ｄＢまでほと
んど影響を受けていない。Next, in order to evaluate the detection performance of the present method under steady noise, the relationship between the S / N ratio and the detection rate when white noise was added was examined. FIG. 3 shows the relationship between the voice detection rate and the noise erroneous detection rate for each S / N ratio when the vowel segment length is fixed to 7 frames. As a result, the detection performance is hardly affected up to the S / N ratio of 12 dB.

【００３９】以上のように本実施例の音声検出装置によ
れば、入力信号から一定時間毎の音声の複数の特徴量を
抽出する特徴量抽出部１１と、あらかじめ多数の母音に
関する学習データについてフレーム単位で抽出した前記
特徴量を用いて母音毎の平均値と共分散行列を算出し、
母音毎の標準モデルを作成する母音標準モデル作成部１
２と、入力信号から得られた複数の音声の特徴量と母音
標準モデル作成部にて作成した各母音標準モデルとの対
数尤度を計算する尤度計算部１３と、母音判定を行うフ
レームの前後数フレーム分の対数尤度を用いて、各母音
毎にフレーム平均対数尤度を計算し、ある適当なしきい
値とを比較することでその入力信号数フレーム分が母音
であるかどうかを判定する母音判定部１４と、パワーの
一定レベル以上の入力信号の塊について母音判定部１４
によりいずれかの母音と判定された母音サンプルの個数
がある適当なしきい値以上のときにその塊を音声と判定
する最終判定部１５とを具備して構成することにより、
比較的簡単な構成で様々な音響信号の中の音声を正確に
判定することができる音声検出装置を提供することがで
きる。As described above, according to the voice detecting apparatus of the present embodiment, the feature amount extracting section 11 for extracting a plurality of feature amounts of the voice at a constant time from the input signal, and the learning data regarding a large number of vowels are framed beforehand. Calculate the average value and covariance matrix for each vowel using the feature amount extracted in units,
Vowel standard model creation unit 1 that creates a standard model for each vowel
2, a likelihood calculation unit 13 that calculates a logarithmic likelihood of a plurality of voice feature values obtained from an input signal and each vowel standard model created by the vowel standard model creation unit, and a frame for performing vowel determination. Calculate the frame average log-likelihood for each vowel by using the log-likelihood of several frames before and after, and compare it with a certain appropriate threshold to determine whether the input signal several frames are vowels. And a vowel determining section 14 for a block of input signals having a power level higher than a certain level.
And a final determination unit 15 for determining the vowel sample as a voice when the number of vowel samples determined to be any vowel is equal to or larger than an appropriate threshold value,
It is possible to provide a voice detection device that can accurately determine voices in various acoustic signals with a relatively simple configuration.

【００４０】なお、上記の実施例においては、特徴抽出
部において入力信号の母音性を検出するための特徴量と
して自己相関係数とケプストラム係数を用いた例で説明
したが、これに限定されず、偏自己相関関数やメルケプ
ストラム係数などを用いてもかまわない。In the above embodiment, an example in which the autocorrelation coefficient and the cepstrum coefficient are used as the feature quantity for detecting the vowel characteristic of the input signal in the feature extraction section has been described, but the present invention is not limited to this. Alternatively, a partial autocorrelation function or a mel cepstrum coefficient may be used.

【００４１】[0041]

【発明の効果】以上の実施例から明らかなように本発明
によれば、音声を特徴付ける複数の特徴量を抽出し、多
数の学習用母音データを用いて母音標準モデルを作成し
ておき、入力信号から得られた複数の特徴量と母音標準
モデルから得られる対数尤度を計算し、数フレーム分を
一括して母音の検出を行って音声を検出するように構成
しているので、比較的簡単な構成で入力信号が音声かそ
れ以外かを正確に判定することができる音声検出装置を
提供することができる。As is apparent from the above embodiments, according to the present invention, a plurality of feature quantities that characterize speech are extracted, a vowel standard model is created using a large number of learning vowel data, and input. It is configured to calculate the log-likelihood obtained from the vowel standard model and a plurality of feature quantities obtained from the signal, and to detect the vowels by collectively detecting the vowels for several frames. It is possible to provide a voice detection device capable of accurately determining whether an input signal is voice or not with a simple configuration.

[Brief description of drawings]

【図１】本発明の一実施例の音声検出装置の全体構成を
示すブロック図FIG. 1 is a block diagram showing an overall configuration of a voice detection device according to an embodiment of the present invention.

【図２】母音セグメント長の影響による音声検出率と雑
音誤検出率の関係を示す図FIG. 2 is a diagram showing a relationship between a voice detection rate and a noise false detection rate due to the influence of a vowel segment length.

【図３】Ｓ／Ｎ比の影響による音声検出率と雑音誤検出
率の関係を示す図FIG. 3 is a diagram showing a relationship between a voice detection rate and a noise erroneous detection rate due to the influence of the S / N ratio.

[Explanation of symbols]

１１特徴抽出部１１ａパワー算出部１１ｂ自己相関係数算出部１１ｃケプストラム係数算出部１２母音標準モデル作成部１３尤度計算部１４母音判定部１５最終判定部 11 feature extraction unit 11a power calculation unit 11b autocorrelation coefficient calculation unit 11c cepstrum coefficient calculation unit 12 vowel standard model creation unit 13 likelihood calculation unit 14 vowel judgment unit 15 final judgment unit

Claims

[Claims]

1. At least one of a first or higher order autocorrelation coefficient or partial autocorrelation function and a first or higher order cepstrum coefficient or mel-cepstrum coefficient which characterizes a voice extracted from an input signal in frame units (every constant time). A voice detection method characterized by detecting the presence of a vowel by detecting the vowel presence by collectively using a few frames for the feature quantity and detecting only the voice.

2. At least one of a first-order or higher autocorrelation coefficient or partial autocorrelation function and a first-order or higher-order cepstrum coefficient or mel-cepstrum coefficient that characterizes speech extracted in advance from a large number of vowel learning data in frame units. A voice detection method characterized in that a standard model is created for each vowel by collectively using several frames for a feature amount and a voice is detected using the vowel standard model.

3. A feature extraction unit for extracting a first-order or higher-order autocorrelation coefficient or partial autocorrelation function and a first-order or higher-order cepstrum coefficient or mel-cepstral coefficient, which characterizes a sound at regular intervals from an input signal, and a large number of pre-existing features. A vowel standard model creation unit that creates a standard model for each vowel by calculating an average value and a covariance matrix for each vowel using the feature amount extracted by the feature extraction unit for vowel learning data, and a frame unit from an input signal In the vowel standard model creation unit, at least one or more feature amounts of the first or higher-order autocorrelation coefficient or partial autocorrelation function extracted by the feature extraction unit and the first-order cepstrum coefficient or mel-cepstral coefficient Likelihood calculator that calculates log-likelihood with each created vowel standard model, frame to detect vowels and several frames before and after it In the vowel determination section for calculating the frame average log likelihood for each vowel using the log likelihood calculated by the likelihood calculation section and comparing it with a certain appropriate threshold And a final determination unit that determines a vowel as a voice when the number of vowel samples determined to be one of the vowels by the vowel determination unit with respect to a lump of the input signal having a power level higher than or equal to a certain threshold And a voice detection device.