JPH0458297A

JPH0458297A - Sound detecting device

Info

Publication number: JPH0458297A
Application number: JP2172028A
Authority: JP
Inventors: Kimitatsu Satou; 佐藤　仁樹; Tsuneo Nitta; 恒雄新田
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1990-06-27
Filing date: 1990-06-27
Publication date: 1992-02-25
Anticipated expiration: 2015-04-17
Also published as: JP3034279B2

Abstract

PURPOSE:To obtain a satisfactory detection rate without being affected by an input level even in the environment of large background noise by discriminating an input signal for each frame unit by using a noise standard pattern and a sound standard pattern. CONSTITUTION:A feature vector calculator 1 calculates a predictive linear coefficient by using a Durbin method or the like for the unit of a frame. A feature vector X(n), which is estimated as noise by a noise block estimating device 2, is transferred to a noise standard pattern preparing device 7. The device 7 prepares the noise standard pattern. Based on the feature vector X(n) outputted from the feature vector calculator 1, a discriminating device 10 discriminates a sound block. By using the feature vector estimated as noise, the noise standard pattern is prepared and based on this noise standard pattern and the sound standard pattern, the discrimination of sound/noise is executed. Thus, the satisfactory detection rate can be obtained without being affected by the input level even in the environment of large background noise.

Description

【発明の詳細な説明】［発明の目的］（産業上の利用分野）本発明は、Ａ　Ｔ　Ｍ　（Ａｓｙｎｃｒｏｎｕｓ　Ｔｒ
ａｎｓｆ’ｅｒ　Ｈａｄｅ）通信、Ｄ　Ｓ　Ｉ　　（Ｄ
ｉｇｉｔａｌ　５ｐｅｅｃｈ　Ｉｎｔｅｒｐｌａｔｉ。[Detailed description of the invention] [Object of the invention] (Industrial application field) The present invention is directed to an ATM (Asynchronous Tr).
ansf'er Hade) communication, DSI (D
digital 5peech Interplati.

ｎ）、パケット通信や音声認識の分野等において、音声
信号中の有音区間を検出するために用いられる有音検出
装置に関する。n), relates to a sound detection device used for detecting a sound interval in a voice signal in the field of packet communication, voice recognition, etc.

（従来の技術）有音検出装置は、通信や音声認識の分野において、通信
効率や認識率の向上を図る等の重要な役割を果している
。(Prior Art) Speech detection devices play an important role in the fields of communication and speech recognition, such as improving communication efficiency and recognition rate.

第９図は従来の有音検出装置の構成を示す図であり、ま
ず入力信号の電力、零点差数、自己相関関数やスペクト
ル等の特徴ベクトルを特徴ベクトル計算装置２３により
計算する。その後、マツチング装置２４によりこの特徴
ベクトルと音声標準パターン及び雑音標準パターンとの
距離を測定する。そして、例えば特徴ベクトルと音声標
準パターンとの距離の方が短ければ音声と判定し、逆に
特徴ベクトルと雑音標準パターンとの距離の方が短けれ
ば雑音と判定している。FIG. 9 is a diagram showing the configuration of a conventional sound detection device. First, a feature vector calculation device 23 calculates feature vectors such as the power of an input signal, the number of zero point differences, an autocorrelation function, and a spectrum. Thereafter, the matching device 24 measures the distance between this feature vector and the speech standard pattern and the noise standard pattern. For example, if the distance between the feature vector and the standard speech pattern is shorter, it is determined to be speech, and conversely, if the distance between the feature vector and the noise standard pattern is shorter, it is determined to be noise.

また、第１０図は従来の有音検出装置のもう一つの例を
示す図であり、電力計算装置２５が入力フレームの平均
電力Ｐ　（ｎ）を計算する。また、しきい値更新装置２
６は、判定のためのしきい値Ｔ（ｎ）を、例えばもしＰ　（ｎ）　＜　Ｔ　（ｎ）　−Ｐ　（ｎ）　Ｘ　
（ａ　−１）ならば、Ｔ　（ｎｉｌ）　−Ｐ　（ｎ＋１
）　Ｘ　ａもしＰ　（ｎ）≧Ｔ　（ｎ）　−Ｐ　（ｎ）
　Ｘ　（ａ　−１）ならば、　Ｔ　　（ｎ＋１）　　−
Ｐ　　（ｎ＋１）　　Ｘ　　７と更新する。Further, FIG. 10 is a diagram showing another example of a conventional sound detection device, in which a power calculation device 25 calculates the average power P (n) of an input frame. In addition, the threshold value updating device 2
6 is the threshold value T(n) for determination, for example, if P (n) < T (n) − P (n)
(a −1), then T (nil) −P (n+1
) X a if P (n)≧T (n) −P (n)
If X (a −1), then T (n+1) −
Update as P (n+1) X 7.

ここで、α、γは定数である。Here, α and γ are constants.

そして、しきい値比較装置２７が、入力フレームをもしＰ　（ｎ）≧Ｔ　（ｎ）ならば音声もしＰ　（ｎ）
　＜　Ｔ　（ｎ）ならば雑音と識別している。Then, the threshold comparison device 27 compares the input frame with the input frame if P (n)≧T (n), then if P (n)
< T (n), it is identified as noise.

尚、上記しきい値更新装置２６によるしきい値の更新は
、もしＰ　（ｎ）　＜　Ｔ　（ｎ）−αならば、Ｔ　（ｎ
＋１）　＝　Ｐ　（ｎ＋１）　＋　ａもしＰ　（ｎ）≧
Ｔ　（ｎ）−αならば、Ｔ　（ｎｉｌ）　−Ｐ　（ｎ＋
１）　＋　７とするものであってもよい。Note that the threshold value updating device 26 updates the threshold value as follows: If P (n) < T (n) - α, then T (n
+1) = P (n+1) + aIf P (n)≧
If T (n) − α, then T (nil) −P (n+
1) It may be set to +7.

ところで、一般的に母音の大半はその電力か背景雑音電
力を上回るが、子音の電力は背景雑音電力を下回ること
が多い。そのため、背景雑音の大きな環境においては、
子音区間でも特徴ベクトルに雑音の特徴が大きく出てし
まう。By the way, the power of most vowels generally exceeds the background noise power, but the power of consonants is often lower than the background noise power. Therefore, in environments with large background noise,
Even in the consonant interval, noise features appear significantly in the feature vector.

ところが、上述した従来の有音検出装置においては、背
景雑音の影響を受けた特徴ベクトルをそのまま判定に用
いているため、背景雑音が大きい場合には、子音の検出
誤りが発生するという問題かある。また、音声の特徴ベ
クトルとして電力のみを用いた従来の有音検出装置にお
いては、入力レベルが低いときには、音声を雑音と誤判
定することが多いという問題がある。しかして、かかる
事態は通信の分野においては音質の劣化を招き、また音
声認識の分野においては認識率の低下の原因となってい
た。However, in the conventional speech presence detection device described above, the feature vectors affected by background noise are used as they are for determination, so if the background noise is large, consonant detection errors may occur. . Further, in conventional speech presence detection devices that use only power as a speech feature vector, there is a problem in that speech is often incorrectly determined to be noise when the input level is low. However, such a situation has caused a deterioration in sound quality in the field of communications, and a decrease in recognition rate in the field of speech recognition.

（発明が解決しようとする課題）このように従来の有音検出装置においては、背景雑音の
影響を受けた特徴ベクトルをそのまま判定に用いている
ため、背景雑音が大きい場合には、子音の検出誤りが発
生するという問題かある。(Problem to be Solved by the Invention) In this way, in the conventional voice detection device, the feature vectors affected by the background noise are directly used for determination, so when the background noise is large, consonant detection is difficult. There is a problem with errors occurring.

また、音声の特徴ベクトルとして電力のみを用いた従来
の有音検出装置においては、入力レベルが低いときには
、音声を雑音と誤判定することが多いという問題がある
。Further, in conventional speech presence detection devices that use only power as a speech feature vector, there is a problem in that speech is often incorrectly determined to be noise when the input level is low.

本発明は、このような事情に基づき成されたもので、背
景雑音の大きな環境でもまた入力レベルに左右されるこ
となく良好な検出率を得ることができる有音検出装置を
提供することを目的としている。The present invention was made based on the above circumstances, and an object of the present invention is to provide a sound detection device that can obtain a good detection rate even in an environment with large background noise without being affected by the input level. It is said that

［発明の構成］（課題を解決するための手段）本発明は、上述した課題を解決するために、入力信号を
フレーム単位に分けこの単位毎の入力信号の特徴ベクト
ルを計算する特徴ベクトル計算手段と、この特徴ベクト
ル計算手段により算出された特徴ベクトルに基づきこの
特徴ベクトルの有音区間を推定する有音区間推定手段と
、この有音区間推定手段により有音区間でないと推定さ
れたフレームの特徴ベクトルに基づき雑音標準パターン
を作成する雑音標準パターン作成手段と、この雑音標準
パターン作成手段により作成された雑音標準パターンと
予め作成しておいた音声標準パターンとを用いて前記特
徴ベクトル計算手段により算出された特徴ベクトルから
前記フレーム単位毎の入力信号が音声か雑音のどちらで
あるかを判定する判定手段とを具備するものである。[Structure of the Invention] (Means for Solving the Problem) In order to solve the above-mentioned problem, the present invention provides a feature vector calculation means that divides an input signal into frame units and calculates a feature vector of the input signal for each frame. a voiced interval estimation means for estimating a voiced interval of this feature vector based on the feature vector calculated by this feature vector calculation means; and a feature of a frame estimated to be not a voiced interval by this voiced interval estimation means. Calculated by the feature vector calculation means using a noise standard pattern creation means that creates a noise standard pattern based on the vector, a noise standard pattern created by the noise standard pattern creation means, and a voice standard pattern created in advance. and determining means for determining whether the input signal for each frame is speech or noise from the determined feature vector.

また、第２の発明は、入力信号からフレーム単位でこの
入力信号の特徴ベクトルを計算する特徴ベクトル計算手
段と、この特徴ベクトル計算手段により算出された特徴
ベクトルを前記入力信号の変動が除去された変換ベクト
ルに変換する特徴ベクトル変換手段と、前記特徴ベクト
ルに基づき各フレーム単位で有音区間を推定する有音区
間推定手段と、この有音区間推定手段により有音区間で
ないと推定された変換ベクトルに基づき雑音標準パター
ンを作成する雑音標準パターン作成手段と、この雑音標
準パターン作成手段により作成された雑音標準パターン
と予め作成されている音声標準パターンとを用いて前記
各フレーム毎の入力信号が音声か雑音のどちらであるか
を判定する判定手段とを具備するものである。A second invention also provides a feature vector calculation means for calculating a feature vector of an input signal in units of frames from an input signal, and a feature vector calculated by the feature vector calculation means in which fluctuations in the input signal are removed. a feature vector converting means for converting into a conversion vector; a sound interval estimating means for estimating a sound interval for each frame based on the feature vector; and a conversion vector estimated to be not a sound interval by the sound interval estimating means. A noise standard pattern creating means creates a noise standard pattern based on the noise standard pattern creating means, and a noise standard pattern created by this noise standard pattern creating means and a pre-created audio standard pattern are used to convert the input signal of each frame into audio. and a determination means for determining whether it is noise or noise.

（作　用）本発明では、雑音と推定された特徴／変換ベクトルを用
いて雑音標準パターンを作成し、この雑音標準パターン
と予め作成しておいた音声標準パターンとを用いてフレ
ーム単位毎の入力信号が音声か雑音のどちらであるかを
判定しているので、背景雑音の大きな環境でもまた入力
レベルに左右されることなく良好な検出率を得ることが
できる。(Function) In the present invention, a noise standard pattern is created using features/conversion vectors estimated as noise, and this noise standard pattern and a pre-created audio standard pattern are used to input each frame. Since it is determined whether the signal is voice or noise, a good detection rate can be obtained regardless of the input level even in environments with large background noise.

（実施例）以下、本発明の実施例の詳細を図面に基づき説明する。(Example) Hereinafter, details of embodiments of the present invention will be explained based on the drawings.

第１図は本発明の一実施例に係る有音検出装置の概略的
構成を示すブロック図である。FIG. 1 is a block diagram showing a schematic configuration of a sound presence detection device according to an embodiment of the present invention.

以下では、入力信号をフレーム単位に分析し音声・雑音
の判定を行っていく。例えば入力信号を８ＫＨｚでサン
プリングし、１６０サンプルづつまとめてｌフレームと
する。ただし、フレーム長や分析周期は、常に一定長で
ある必要はない。In the following, we will analyze the input signal frame by frame and determine whether it is speech or noise. For example, the input signal is sampled at 8 KHz, and each 160 samples is made into one frame. However, the frame length and analysis cycle do not always need to be constant lengths.

特徴ベクトル計算装置１は、フレーム単位にＤｕｒｂｉ
ｎ法等を用いて、線形予測係数を計算する。ここで、線
形予測係数変換して、ＰＡＲＣＯＲ係数、ＬＰＣケプス
トラム、メルケプストラム等を計算し、特徴ベクトルと
してもよい。また、電力、自己相関関数、零交差数等を
計算してもよい。The feature vector calculation device 1 performs Durbi on a frame-by-frame basis.
Calculate linear prediction coefficients using the n method or the like. Here, linear prediction coefficients may be converted to calculate PARCOR coefficients, LPC cepstrum, mel cepstrum, etc., and may be used as feature vectors. Also, power, autocorrelation function, number of zero crossings, etc. may be calculated.

現在音声か雑音かを判定しようとしているフレームを以
下では入力フレームと呼ぶ。また、特徴ベクトル計算装
置で得られた入力フレームの特徴ベクトルをＸ（ｎ）と
する。ｎは、フレームのシーケンシャルな番号である。The frame for which it is currently being determined whether it is speech or noise will hereinafter be referred to as an input frame. Further, the feature vector of the input frame obtained by the feature vector calculation device is assumed to be X(n). n is the sequential number of the frame.

特徴ベクトルは、ｐ次元のベクトルで、Ｘ　（ｎ）　＝　（ｘ　１（ｎ）、ｘ２（ｎｌ・・ｘｐ
（ｎ）　）と書き表すことができる。The feature vector is a p-dimensional vector, X (n) = (x 1(n), x2(nl...xp
(n) ).

雑音区間推定装置２は、例えば第２図に示すように、電
力計算装置３、しきい値更新装置４及びしきい値比較装
置５から構成される。The noise interval estimating device 2 includes, for example, a power calculation device 3, a threshold updating device 4, and a threshold comparing device 5, as shown in FIG.

ここで、電力計算装置３は、入力フレームの平均電力Ｐ
　（ｎ）を計算する。また、しきい値更新装置４は、し
きい値Ｔ　（ｎ）を次式を用いて更新する。Here, the power calculation device 3 calculates the average power P of the input frame.
Calculate (n). Further, the threshold updating device 4 updates the threshold T (n) using the following equation.

もしＰ　（ｎ）　＜　Ｔ　（ｎ）　−Ｐ　（ｎ）　Ｘ　
（ａ　−１：）ならば、Ｔ　（ｎ＋１）　＝　Ｐ　（ｎ
ｉｌ）　ｘ　ａもしＰ　（ｎ）≧Ｔ（ｎ）　　Ｐ（ｎ）
　ｘ　Ｃａ　−１）ならば、Ｔ　（ｎ＋１）　−Ｐ　（
ｎｉｌ）　ｘ　７ここで、α、γは定数である。If P (n) < T (n) − P (n) X
(a −1:), then T (n+1) = P (n
il) x a if P (n)≧T(n) P(n)
x Ca −1), then T (n+1) −P (
nil) x 7 where α and γ are constants.

しきい値比較装置５は、入力フレームをもしＰ　（ｎ）
≧Ｔ　（ｎ）ならば音声もしＰ　（ｎ）　＜　Ｔ　（ｎ
）ならば雑音と識別する出力スイッチ６は、雑音区間推定装置２により雑音と推
定された特徴ベクトルＸ　（ｎ）を雑音標準パターン作
成装置７に転送する。The threshold comparator 5 compares the input frame with P (n)
If ≧T (n), then voice, if P (n) < T (n
), the output switch 6 that identifies it as noise transfers the feature vector X (n) estimated to be noise by the noise section estimation device 2 to the noise standard pattern creation device 7 .

雑音標準パターン作成装置７は、雑音標準パターンを作
成するもので、例えば第３図に示すように、バッファ８
及び平均・共分散行列計算装置９から構成される。バッ
ファ８は、雑音区間推定装置２により雑音と推定された
特徴ベクトルＸ　（ｎ）を蓄積する。そして、平均・共
分散行列計算装置９がバッファ８内に蓄積された特徴ベ
クトルに基づき雑音標準パターンを作成する。The noise standard pattern creation device 7 creates a noise standard pattern, and for example, as shown in FIG.
and an average/covariance matrix calculation device 9. The buffer 8 stores the feature vectors X (n) estimated to be noise by the noise interval estimating device 2. Then, the mean/covariance matrix calculation device 9 creates a noise standard pattern based on the feature vectors accumulated in the buffer 8.

判定装置１０は、特徴ベクトル計算装置１から出力され
る特徴ベクトルχ（ｎ）をもとに、有音区間を判定する
もので、例えば第４図に示すように、マツチング装置１
１、音声標準パターン蓄積装置１２及び上記雑音標準パ
ターン作成装置７により作成された雑音標準パターンを
蓄積する雑音積重パターン蓄積装置１３から構成される
。The determination device 10 determines a voiced section based on the feature vector χ(n) output from the feature vector calculation device 1. For example, as shown in FIG.
1. Consists of a voice standard pattern storage device 12 and a noise stack pattern storage device 13 that stores the noise standard patterns created by the noise standard pattern creation device 7.

マツチング装置１１は、音声標準パターン及び雑音標準
パターンと特徴ベクトル計算装置１から出力される特徴
ベクトルＸ　（ｎ）との距離を測定し、音声標準パター
ンにマツチングされた場合には音声と判定し、雑音標準
パターンにマツチングされた場合には雑音と判定する。The matching device 11 measures the distance between the voice standard pattern and the noise standard pattern and the feature vector If it matches the noise standard pattern, it is determined to be noise.

具体的には、まず次式により各標準パターン（Σｊ、μ
ｉ　）　（ｉ＝１．２・・・Ｍ＋１）との距離を測定す
る。Specifically, first, each standard pattern (Σj, μ
i) Measure the distance to (i=1.2...M+1).

Ｄｉ　　（ｆ）　＝　（Ｘ−ｕＩ　）　’　Ｉｆ　−’
　　（Ｘ−μｍ　）＋Ｉｎ　　　Σ　ｌＸは、ｉ　　−５ｉｎ　　Ｄｉ　　　（Ｘ）なるｗｉにχが属しているとする。もしｗｌが音声に属
していれば、そのフレームは音声と判定し、ｗｉが雑音
に属していれば、そのフレームは雑音であると判定する
。Di (f) = (X-uI) ' If -'
(X-μm)+In Σ l X assumes that χ belongs to wi, i -5in Di (X). If wl belongs to voice, the frame is determined to be voice, and if wi belongs to noise, the frame is determined to be noise.

ここで、音声標準パターン蓄積装置１２に蓄積されてい
る音声標準パターンは、以下のように定義できる。Here, the voice standard patterns stored in the voice standard pattern storage device 12 can be defined as follows.

標準パターンは、マツチング装置１１の構成から分るよ
うに、クラスＷに属する特徴ベクトルの平均値ベクトル
μ及び共分散行列Σとなる。As can be seen from the configuration of the matching device 11, the standard pattern is the mean value vector μ and covariance matrix Σ of the feature vectors belonging to class W.

クラスＷに属するＬ個のｒ次元特徴ベクトルをＸｖ　　
（ｊ）　＝　（ｘｗｌ（ｊ）、ｘｖ２（ｊ）、−ｘｖｒ
（ｊ）　）（ｊ＝１．２・・・Ｌ）とする。Let L r-dimensional feature vectors belonging to class W be Xv
(j) = (xwl(j), xv2(j), -xvr
(j) ) (j=1.2...L).

また、μｍとΣ１の各要素をｍ　　ｋ１σ（１）ｋｌと
すると、後述する第５図に示す平均・共分散行列計算装
置１４により次式を用いて計算される。Further, if each element of μm and Σ1 is defined as m k1σ(1)kl, it is calculated using the following equation by the mean/covariance matrix calculation device 14 shown in FIG. 5, which will be described later.

Ｌ　　　ｊ＝１標準パターン１〜Ｍの作成法を第５図に示す。L j=1 The method for creating standard patterns 1 to M is shown in FIG.

同図に示すように、標準パターンを作成するためには、
各標準パターンのクラス毎に、音声データベース１５を
作成する。As shown in the figure, to create a standard pattern,
A voice database 15 is created for each standard pattern class.

その作成方法は、具体的には、まず複数の被験者に各ク
ラスに属する音韻を発音してもらい、それを録音する。Specifically, the method for creating it is to first have multiple subjects pronounce the phonemes belonging to each class, and then record them.

このようにして得られた音声信号に対し、フレーム単位
に、子音と雑音との区別を付けるためにラベルを付けて
いく。このラベル付けは、音声信号の波形やスペクトル
を例えばＣＲＴに表示して、それを見ながらフレーム単
位にラベルを付けていく。この音声データベース１５に
対応したラベルをラベルデータベース１６とする。A label is attached to the audio signal obtained in this manner frame by frame in order to distinguish between consonants and noise. This labeling involves displaying the waveform and spectrum of the audio signal on, for example, a CRT, and labeling each frame while viewing it. The label corresponding to this voice database 15 is defined as a label database 16.

そして、音声データベース１５がら特徴ベクトルを計算
する。その後、ラベルデータベース１６を参照して、標
準パターンを作成しようとしているクラスに属するフレ
ームの特徴ベクトルならば、それを用いて平均・共分散
行列計算装置１４により平均・共分散を計算する。Then, a feature vector is calculated from the speech database 15. Thereafter, referring to the label database 16, if the feature vector of the frame belongs to the class for which the standard pattern is to be created, the mean/covariance matrix calculation device 14 uses it to calculate the mean/covariance.

しかして、本実施例装置は、雑音と推定された特徴ベク
トルを用いて雑音標準パターンを作成し、この雑音標準
パターンと音声標準パターンとに基づき音声／雑音の判
定を行っているので、背景雑音の大きな環境でもまた入
力レベルに左右されることなく良好な検出率を得ること
ができる。Therefore, the device of this embodiment creates a noise standard pattern using the feature vector estimated to be noise, and determines speech/noise based on this noise standard pattern and speech standard pattern. Even in large environment, a good detection rate can be obtained regardless of the input level.

次に、本発明の他の実施例を説明する。Next, another embodiment of the present invention will be described.

ｊｉ！６図はこの実施例に係る有音検出装置の構成を示
す図である。ji! FIG. 6 is a diagram showing the configuration of the sound presence detection device according to this embodiment.

同図に示す特徴ベクトル計算袋！ｆｌは、第１図に示し
たものと同一の構成である。Feature vector calculation bag shown in the same figure! fl has the same configuration as shown in FIG.

特徴ベクトル変換装置１７は、第７図に示すように構成
されている。The feature vector conversion device 17 is configured as shown in FIG.

同図に示すバッファ１８は、特徴ベクトル（Ｘ　（ｎ）
　＝　（ｘ　１（ｎ）、ｘ２（ｎ）・・・ｘｐ（ｎ）　
）　）がバッファに蓄積される時間の順序関係を保存す
ために、特徴ベクトルがバッファに入力された順番で、
当該特徴ベクトルをバッファのヘッドからテイルに向か
って蓄積する。すなわち、時間的に一番新しい特徴ベク
トルをバッファのヘッドに、一番過去の特徴ベクトルを
テイルに蓄積する。このバッファ１８の構成を第８図に
示す。同図に示すように、このバッファ１８には、しき
い値比較装置１９（第７図に示す。）により雑音と推定
されたフレームの特徴ベクトルが蓄積される。従って、
バッファ１８内の特徴ベクトルは、必ずしも時間的に連
続しているとは限らない。The buffer 18 shown in the figure has a feature vector (X (n)
= (x 1(n), x2(n)...xp(n)
)) are accumulated in the buffer in the order in which the feature vectors are input into the buffer, in order to preserve the ordering relationship of the time when
The feature vectors are accumulated from the head of the buffer to the tail. That is, the temporally newest feature vector is stored at the head of the buffer, and the oldest feature vector is stored at the tail. The configuration of this buffer 18 is shown in FIG. As shown in the figure, the buffer 18 stores feature vectors of frames estimated to be noise by the threshold comparison device 19 (shown in FIG. 7). Therefore,
The feature vectors in the buffer 18 are not necessarily continuous in time.

ここで、バッファ１８のヘッドのＳフレーム目からテイ
ルに向かってＮフレーム分の特徴ベクトルの集合をΩ（
ｎ）とし、以下のように表す。Here, the set of feature vectors for N frames from the Sth frame of the head of the buffer 18 toward the tail is Ω(
n) and is expressed as follows.

Ω　（ｎ）　　−ｆＸＬｎ（Ｓ）、ＸＬｎ（Ｓ＋１）、
−ＸＬｎ（Ｓ＋Ｎ−１）　　）ＸＬｎ（ｊ）　　−（ｘ
　Ｌｎｌ（ｊ）、　　ｘ　Ｌｎ２（ｊ）、−ｘ　Ｌｎｐ
（ｊ））また、ＸＬｎ（・）のｉ番目の要素の集合Ω１
（ｎ）は次式により表される。Ω (n) −fXLn(S), XLn(S+1),
-XLn(S+N-1) )XLn(j) -(x
Lnl(j), x Ln2(j), -x Lnp
(j)) Also, the set Ω1 of the i-th element of XLn(・)
(n) is expressed by the following formula.

Ω１（ｎ）＝　ｆｘＬｎｉ（Ｓ）、　ｘＬｎｉ（Ｓ＋１
）。Ω1(n) = fxLni(S), xLni(S+1
).

−−−ｘ　Ｌｎｌ（Ｓ＋Ｎ−１）１しきい値比較装置１９では特徴ベクトルＸ　（ｎ）のう
ちの電力をしきい値Ｔ　（ｎ）と比較し音声、雑音の推
定を行う。Ｘ　（ｎ）のうち電力の成分をｘｉ（ｎ）と
すると、もし　ｘｉ（ｎ）≧Ｔ　（ｎ）ならば音声もし　ｘ　１
（ｎ）＜　Ｔ　（ｎ）ならば雑音とする。---x Lnl(S+N-1)1 The threshold comparison device 19 compares the power of the feature vector X (n) with the threshold T (n) to estimate speech and noise. If the power component of X (n) is xi (n), if xi (n) ≧ T (n), then the sound is
If (n) < T (n), it is considered noise.

しきい値発生装置２０では、バッファ１８に蓄積された
特徴ベクトルのうち入力フレームのＳフレームより過去
（バッファのヘッドからＳフレーム目）からバッファの
テイルに向かってＮフレーム分の特徴ベクトルの要素２
１　０　　（電力）の集合ΩＩ（ｎ）を取り出し、Ω１
（ｎ）の平均値と標準偏差を計算する。The threshold generation device 20 generates feature vector elements 2 for N frames from the input frame S frame past (S frame from the head of the buffer) toward the tail of the buffer from among the feature vectors accumulated in the buffer 18.
Take out the set ΩI(n) of 1 0 (power) and Ω1
Calculate the mean value and standard deviation of (n).

まず、バッファ１８で、ヘッドのＳフレーム目からテイ
ルに向かってＮフレーム分の特徴ベクトルの要素Ｍｌ　
　Ｏを取り出し、これをΩ１（ｎ）　−ｆｘＬｎｌ（Ｓ
）、　ｘＬｎｌ（Ｓ＋１）。First, in the buffer 18, element Ml of the feature vector for N frames from the Sth frame of the head toward the tail.
O, and convert it into Ω1(n) −fxLnl(S
), xLnl(S+1).

−−−ｘ　Ｌｎｌ（Ｓ＋Ｎ−１））とする。---x　Lnl(S+N-1)) shall be.

次に、次式を用いて特徴ベクトルの各要素ごとに、平均
値ｍｌと標準偏差σ１を計算する。Next, the average value ml and standard deviation σ1 are calculated for each element of the feature vector using the following equation.

Ｎ＋５−１ｍ　１（ｎ）−１／Ｎ　、　　Σ　ｘ　Ｌｎｌ（ｊ）Ｎ
＋Ｓ−１ σ１　２−１／Ｎ−Σ （ｘ　Ｌｎｌ（ｊ）−ｍ　１（ｎ））　　２また、平均
値ｍ１（ｎ）と標準偏差σ１　（ｎ）は以下のようにも
書き表すことができる。N+5-1 m 1(n)-1/N, Σ x Lnl(j)N
+S-1 σ1 2-1/N-Σ (x Lnl(j)-m 1(n)) 2The average value m1(n) and standard deviation σ1(n) can also be written as below. .

ｍｌ（ｎ）−ΣＸ　１Ｎ）／　Ｎ σ１２−Σ　（ｘ　１（ｊ）−ｍ　１（ｎ））　　２　
／Ｎｊは次の条件を満たすもので、Σの範囲はｊの大き
い方から以下の条件を満足するＮフレームを取る。ml(n)-ΣX 1N)/N σ12-Σ(x 1(j)-m 1(n)) 2
/Nj satisfies the following conditions, and the range of Σ is N frames that satisfy the following conditions from the larger j.

ｃ　ｘ　１（ｊ）ｅΩ（ｎ）　’　）　＆　（ｊ＜ｎ−
８）ここで、Ω（ｎ）゛　はしきい値比較装置１９で雑
音と判断された特徴ベクトルの集合とする。c x 1(j)eΩ(n)' ) &(j<n-
8) Here, Ω(n) is a set of feature vectors determined to be noise by the threshold comparison device 19.

しきい値Ｔ　（ｎ）は、しきい値発生装置２０で例えば
以下のように計算される。The threshold value T (n) is calculated by the threshold generation device 20 as follows, for example.

Ｔ（ｎ）＝ａＸｍｌ＋βｘσ１ここで、α、βは、任意の数である。T(n)=aXml+βxσ1 Here, α and β are arbitrary numbers.

ただし、バッファ１８の中にＮ＋Ｓフレーム分の特徴ベ
クトルが蓄積されるまでは、Ｔ　（ｎ）は予め与えられ
たしきい値ＴＯを取るものとする。However, until the feature vectors for N+S frames are accumulated in the buffer 18, T (n) is assumed to take a predetermined threshold value TO.

変換装置２１は、Ω（ｎ）を用いて特徴ベクトルχ（ｎ
）から背景雑音や入力信号のレベル変動の影響を除去し
、Ｙ（ｎ）とする。変換ベクトルはｒ（≦ｐ）次元のベ
クトルである。The conversion device 21 uses Ω(n) to convert the feature vector χ(n
) from which the effects of background noise and input signal level fluctuations are removed and set as Y(n). The transformation vector is an r (≦p) dimensional vector.

’ｌ　（ｎ）を計算するために、まず、Ω（ｎ）の各要
素の平均値と分散を次式により計算する。In order to calculate 'l(n), first, the average value and variance of each element of Ω(n) are calculated using the following equation.

Ｎ＋５−１ｍ　１（ｎ）−１／　Ｎ　、　　Σ　ｘ　Ｌｎｉ（ｊ）
Ｎ＋Ｓ−１ｃｙｉ　２−１７Ｎ’　　Σ　　（ｘ　Ｌｎｉ（ｊ）−
ｍ　１（ｎ））　　２ここで、Ｙｒ（ｎ）を定義する。N+5-1 m 1(n)-1/N, Σ x Lni(j)
N+S-1 cyi 2-17N' Σ (x Lni(j)-
m 1(n)) 2Here, Yr(n) is defined.

Ｖｒ（ｎ）　−（ｙ　ｌ　（ｎ）、ｙ　２　（ｎ）、・
・・・・・、ｙ　ｒ（ｎ））各要素は、次式で定義され
る。Vr(n) −(y l (n), y 2 (n), ・
..., y r(n)) Each element is defined by the following formula.

ｙ　１（ｎ）＝　（ｘ　１（ｎ）　−ｍ　１（ｎ））　
／　ｃｒ　１（ｎ）また、次式でもよい。y 1(n) = (x 1(n) - m 1(n))
/ cr 1(n) Alternatively, the following formula may be used.

ｙ　Ｉ（ｎ）　−ｘ　Ｉ（ｎ）　−ｍ　１（ｎ）変換ベ
クトル’ｌ　（ｎ）の計算例を以下に示す。An example of calculating the y I(n) -x I(n) -m 1(n) transformation vector 'l (n) is shown below.

例１Ｙ　（ｎ）　　−Ｖ　ｒ（ｎ）Ｖ（ｎ）は、Ω（ｎ）の平均ベクトルＭ（ｎ）とＸ　（
ｎ）の差をΩ（ｎ）の分散で正規化したものである。こ
こで、ｉ−ｉ、２．・・・・・・、ｒであり、ｒ≦ｐで
ある。Example 1 Y (n) −V r(n) V(n) is the average vector M(n) of Ω(n) and X (
n) normalized by the variance of Ω(n). Here, ii, 2. ..., r, and r≦p.

例２変換ベクトルを’Ｉ　ｒ（ｎ）のノルムと定義してもよ
い。Example 2 The transformation vector may be defined as the norm of 'I r(n).

Ｙ　（ｎ）　　＝　ＩＩ　’Ｉ　ｒ（ｎ）　ＩＩここで
、１１・１１はベクトルのノルムを表す。Y (n) = II 'I r(n) II Here, 11·11 represents the norm of the vector.

例３各要素の組ごとにノルムをとってもよい。例えば、Ｙ　Ｌ（ｎ）　−（ｙ　１（ｎ）、＝・・・−、ｙ　ｒ
ｌ（ｎ）　）’ｌ　２（ｎ）−（ｙ　ｒｌ＋１（ｎ）、
−・＝　、ｙ　ｒ２（ｎ）　）”１７８（ｎ）　−（ｙ
　ｒ２＋１（ｎ）、−−、ｙ　ｒ（ｎ））１　＜　ｒｌ
＜　ｒ２＜　ｒとして、各ベクトルのノルムを用いて、次式で定義する
。Example 3 You may take the norm for each set of elements. For example, Y L(n) −(y 1(n), =...-, y r
l(n) )'l 2(n)-(y rl+1(n),
−・= , y r2(n) )”178(n) −(y
r2+1(n), --, y r(n))1 < rl
<r2<r, and using the norm of each vector, it is defined by the following equation.

Ｙ（ｎ）−（Ｉｆ　’Ｙｌ（ｎ）Ｉｆ　、　　Ｉｆ　Ｙ
２（ｎ）Ｉｆ　、　　Ｉｆ　’／３（ｎ）Ｉｆ　）出力
スイッチ２２は、しきい値比較装置１９により雑音と推
定された変換ベクトル’ｌ　（ｎ）を雑音標準パターン
作成装！７に転送する。Y(n)-(If'Yl(n)If, If Y
2(n)If, If'/3(n)If) output switch 22 converts the transformation vector 'l(n) estimated to be noise by the threshold comparison device 19 to the noise standard pattern creation device! Transfer to 7.

雑音標準パターン作成装置７は、雑音標準パターンを作
成するもので、第１図に示したものと同様に構成される
。ただし、ここでは、バッファ８（第３図参照）は、変
換ベクトルＹ　（ｎ）を蓄積する。The noise standard pattern creation device 7 creates a noise standard pattern, and is configured similarly to that shown in FIG. However, here, the buffer 8 (see FIG. 3) stores the transformation vector Y (n).

判定装置１０も、第１図及び第４図に示したものと同様
に構成される。ただし、ここでは、特徴ベクトル変換装
置１７から出力される変換ベクトル’Ｉ　（ｎ）をもと
に、有音区間を判定する。すなわち、マツチング装置１
１は、音声標準パターン及び雑音標準パターンと特徴ベ
クトル変換装置１７から出力される変換ベクトルＹ（ｎ
）との距離を測定し、音声標準パターンにマツチングさ
れた場合には音声と判定し、雑音標準パターンにマツチ
ングされた場合には雑音と判定する。The determination device 10 is also configured similarly to that shown in FIGS. 1 and 4. However, here, the voiced section is determined based on the transformation vector 'I (n) output from the feature vector transformation device 17. That is, matching device 1
1 is a conversion vector Y(n
), and if it matches the voice standard pattern, it is determined to be voice, and if it matches the noise standard pattern, it is determined to be noise.

具体的には、まず次式により各標準パターン（μｍ、Σ
１　）　（１−１，２・・・Ｎ＋１）との距離を測定す
る。Specifically, first, each standard pattern (μm, Σ
1) Measure the distance to (1-1, 2...N+1).

Ｄｉ　　ｒ）−（’／−μｍ）１Σ１−’ｌ’−μｌ）
＋Ｉｎ　　Σ１Ｙは、ｉ　　＝ｔｌｎ　　Ｄｉ　　（’／）なるクラスｗ１にＸが属しているとする。もしＷｌが音
声に属していれば、そのフレームは音声と判定し、ｗｌ
が雑音に属していれば、そのフレームは雑音であると判
定する。Di r)-('/-μm)1Σ1-'l'-μl)
+In Σ1 Y assumes that X belongs to the class w1 where i =tln Di ('/). If Wl belongs to audio, the frame is determined to be audio, and wl
If the frame belongs to noise, the frame is determined to be noise.

標準パターンは、マツチング装！１１の構成から分るよ
うに、クラスＷに属す墨変換ベクトルの平均値ベクトル
μ及び共分散行列Σとなる。The standard pattern is matching! 11, the black transformation vector belonging to class W has an average value vector μ and a covariance matrix Σ.

クラスＷに属するＬ個のｒ次元変換ベクトルをＹｖ　　
（ｊ　）　−（ｙｖｌＮ）、ｙｖ２（ｊ）、−ｙｖｒ（
ｊ）　）（ｊ−１，２・・・Ｌ）とする。Let L r-dimensional transformation vectors belonging to class W be Yv
(j) −(yvlN), yv2(j), −yvr(
j) ) (j-1, 2...L).

また、μｍとΣ１の各要素をｍ”ｋ、 σ（１）　ｋ、とすると、平均・共分散行列計算装！１
４により次式を用いて計算される。Also, if each element of μm and Σ1 is m”k, σ(1) k, then the mean/covariance matrix calculation device!1
4 using the following formula.

ｍ（１）ｋ− １Σｙｖｋ（ＤＬ　　　ｊ−１ σ（Ｉ）ｋｌ−１Σ（ｘｖｋＵ）　−ｍ　　　ｋＬ””
　　　　（ｘｗｌ（ｊ）　−ｍ　　　Ｉ　）しかして、
本実施例装置によれば、特徴ベクトル変換装置１７によ
って音声の特徴ベクトルから雑音の影響を取り除き、雑
音の標準パターンを適応的に作成しているため、Ｓ／Ｎ
比が２０ｄＢから１４ｄＢ程の背景雑音の大きな環境で
も検出率が良好結果が得られた。また、入力レベルに左
右されない検出率が得られた。m(1)k-1Σyvk(D L j-1 σ(I)kl-1Σ(xvkU) -m kL""
(xwl(j) −m I ) Therefore,
According to the device of this embodiment, the feature vector conversion device 17 removes the influence of noise from the voice feature vector and adaptively creates a standard noise pattern, so the S/N
A good detection rate was obtained even in an environment with large background noise where the ratio was about 20 dB to 14 dB. Furthermore, a detection rate independent of input level was obtained.

［発明の効果コ以上説明したように本発明によれば、雑音と推定された
特徴／変換ベクトルのみを用いて雑音標準パターンを作
成し、この雑音標準パターンに基づき音声／雑音の判定
を行っているので、背景雑音の大きな環境でもまた入力
レベルに左右されることなく良好な検出率を得ることが
できる。[Effects of the Invention] As explained above, according to the present invention, a noise standard pattern is created using only features/conversion vectors estimated to be noise, and speech/noise is determined based on this noise standard pattern. Therefore, even in environments with large background noise, a good detection rate can be obtained regardless of the input level.

[Brief explanation of drawings]

第１図は本発明の一実施例に係る有音検出装置の構成を
示す図、第２図は第１図に示す雑音区間推定装置の構成
を示す図、第３図は第１図に示す雑音標準パターン作成
装置の構成を示す図、第４図は第１図に示す判定装置の
構成を示す図、第５図は音声標準パターンの作成方法を
説明するための図、第６図は本発明の他の実施例に係る
有音検出装置の構成を示す図、第７図は第６図に示す特
徴ベクトル変換装置の構成を示す図、第８図は第７図に
示すバッファの構成を示す図、第９図及び第１０図は従
来の有音検出装置の構成を示す図である。１・・・特徴ベクトル計算装置、２・・・雑音区間推定
装置、７・・・雑音標準パターン作成装置、１０・・・
判定装置。出願人　　　　　　株式会社　東芝FIG. 1 is a diagram showing the configuration of a sound detection device according to an embodiment of the present invention, FIG. 2 is a diagram showing the configuration of the noise interval estimation device shown in FIG. 1, and FIG. 3 is a diagram showing the configuration of the noise interval estimation device shown in FIG. FIG. 4 is a diagram showing the configuration of the noise standard pattern creation device, FIG. 4 is a diagram showing the configuration of the determination device shown in FIG. 1, FIG. 5 is a diagram for explaining the voice standard pattern creation method, and FIG. 7 is a diagram showing the configuration of the feature vector conversion device shown in FIG. 6, and FIG. 8 is a diagram showing the configuration of the buffer shown in FIG. 7. 9 and 10 are diagrams showing the configuration of a conventional sound detection device. DESCRIPTION OF SYMBOLS 1... Feature vector calculation device, 2... Noise interval estimation device, 7... Noise standard pattern creation device, 10...
Judgment device. Applicant: Toshiba Corporation

Claims

[Claims]

(1) Feature vector calculation means that divides the input signal into frame units and calculates the feature vector of the input signal for each unit, and estimates the sound interval of this feature vector based on the feature vector calculated by the feature vector calculation means. a noise standard pattern creation means for creating a noise standard pattern based on the feature vector of a frame that is estimated to be not a sound period by the sound interval estimation means; It is determined whether the input signal for each frame unit is speech or noise from the feature vector calculated by the feature vector calculation means using the noise standard pattern created in advance and the speech standard pattern created in advance. 1. A sound detection device comprising: determination means.

(2) A feature vector calculation means for calculating a feature vector of the input signal on a frame-by-frame basis from the input signal; and converting the feature vector calculated by the feature vector calculation means into a transformation vector from which fluctuations in the input signal are removed. A feature vector conversion means, a sound interval estimating means for estimating a sound interval for each frame based on the feature vector, and a noise standard pattern based on the converted vector estimated to be not a sound interval by the sound interval estimation means. A noise standard pattern creation means for creating a noise standard pattern creation means, a noise standard pattern created by the noise standard pattern creation means, and a pre-created audio standard pattern are used to determine whether the input signal for each frame is audio or noise. 1. A sound detection device characterized by comprising: determination means for determining whether a sound is present.