JP3032215B2

JP3032215B2 - Sound detection device and method

Info

Publication number: JP3032215B2
Application number: JP1183684A
Authority: JP
Inventors: 仁樹佐藤; 恒雄新田
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1989-07-18
Filing date: 1989-07-18
Publication date: 2000-04-10
Anticipated expiration: 2015-04-10
Also published as: JPH0348900A

Description

【発明の詳細な説明】［発明の目的］（産業上の利用分野）本発明は、ATM（Asynchronous Transfer Mode）通
信、DSI（Digital Speech Interplation）、パケット
通信、音声認識の分野に適用され、音声信号中の有音区
間を精度良く検出する有音検出装置に関する。DETAILED DESCRIPTION OF THE INVENTION [Purpose of the Invention] (Industrial application field) The present invention is applied to the fields of ATM (Asynchronous Transfer Mode) communication, DSI (Digital Speech Interplation), packet communication, and speech recognition. The present invention relates to a sound detection device that accurately detects a sound section in a signal.

（従来の技術）第６図は従来の有音検出装置の一構成を示している。(Prior Art) FIG. 6 shows one configuration of a conventional sound detection device.

入力端子100に入力された音声信号中から電力、零交
差数、自己相関関数、スペクトルなどの特徴パラメータ
がフレーム単位で特徴パラメータ計算器101によって計
算される。Feature parameters such as power, the number of zero crossings, an autocorrelation function, and a spectrum are calculated by the feature parameter calculator 101 from the audio signal input to the input terminal 100 in frame units.

計算された特徴パラメータは、マッチング器102へ出
力され、予め設定された有音標準パターン103及び雑音
標準パターン104と比較し、それぞれの距離が算出され
る。The calculated feature parameters are output to the matching unit 102, and are compared with the preset sound standard pattern 103 and noise standard pattern 104 to calculate the respective distances.

もし、特徴パラメータと有音標準パターン103の距離
が特徴パラメータと雑音パターン104との距離よりも小
さければ、入力フレームは有音に属し、反対であれば雑
音に属すると判定され、その判定結果が出力端子105か
ら出力される。If the distance between the feature parameter and the sound standard pattern 103 is smaller than the distance between the feature parameter and the noise pattern 104, the input frame is determined to belong to voice, and if the distance is opposite, it is determined to belong to noise. Output from the output terminal 105.

（発明が解決しようとする課題）しかしながら、有音であっても子音の電力は母音と異
なり背景雑音の電力を下回ることが多い。このため、背
景雑音が大きい環境下では、子音区間の特徴パラメータ
に背景雑音の特徴が大きく出てしまう。(Problems to be Solved by the Invention) However, even if there is a sound, the power of consonants is often lower than the power of background noise, unlike vowels. For this reason, in an environment where the background noise is large, the feature of the background noise appears largely in the feature parameter of the consonant section.

上記従来の有音検出装置によれば、背景雑音の影響を
受けた特徴パラメータをそのまま判定に用いていたの
で、背景雑音が大きい場合には、子音の検出誤りが多く
なっていた。According to the above-described conventional sound detection device, the characteristic parameter affected by the background noise is used for the determination as it is. Therefore, when the background noise is large, the detection error of the consonant is increased.

このことによって、通信の分野では音質の劣化の要因
となり、また、音声認識の分野で認識率の低下を招いて
いた。This has caused deterioration in sound quality in the field of communication, and has also led to a decrease in the recognition rate in the field of speech recognition.

本発明は上記事情に鑑みてなされたものであり、その
目的は、背景雑音が大きい場合にあっても有音の検出精
度を向上することができる音声検出装置を提供すること
にある。The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a voice detection device that can improve the detection accuracy of a sound even when background noise is large.

［発明の構成］（課題を解決するための手段）上記課題を解決するために、本発明は、ある長さごと
に区切ったフレームを単位として入力された入力音声信
号の特徴パラメータを求める特徴パラメータ生成手段
と、この特徴パラメータ生成手段で求められた前記特徴
パラメータに基づいて、前記入力音声信号が雑音である
か否かをフレーム毎に仮に判定する雑音判定手段と、こ
の雑音判定手段により雑音であると仮に判定されたフレ
ームの前記特徴パラメータ生成手段により求められた特
徴パラメータを複数フレーム分蓄積する蓄積手段と、こ
の蓄積手段に蓄積された前記複数フレーム分の特徴パラ
メータを用いて、前記入力音声信号のフレームの特徴パ
ラメータを変換パラメータに変換する変換手段と、この
変換手段により変換された前記変換パラメータに基づい
て、前記入力音声信号のフレームが音声に属するか雑音
に属するかを判定する有音判定手段とからなることを特
徴とする。[Structure of the Invention] (Means for Solving the Problems) In order to solve the above problems, the present invention provides a feature parameter for obtaining a feature parameter of an input audio signal input in units of frames divided for each length. Generating means, noise determining means for temporarily determining, on a frame-by-frame basis, whether or not the input speech signal is noise, based on the characteristic parameters obtained by the characteristic parameter generating means; A storage unit for storing a plurality of frames of the feature parameters of the frame temporarily determined to be obtained by the feature parameter generation unit, and using the feature parameters for the plurality of frames stored in the storage unit to generate the input voice. Conversion means for converting the characteristic parameters of the signal frame into conversion parameters; Based on the conversion parameters, the frame of the input audio signal is characterized by comprising a sound presence judgment means for judging whether belonging to or noise belongs to the voice.

（作用）以上の構成において、本発明ではフレーム単位で求め
られた全ての特徴パラメータあるいは雑音区間の特徴パ
ラメータに基づいて変換パラメータを生成し、この変換
パラメータを用いることにより音声信号の有音区間と雑
音区間を判別することができる。とくに、雑音区間の特
徴パラメータを基にして変換パラメータを生成した場合
には雑音の影響を回避した有音判別が可能となる。(Operation) In the configuration described above, in the present invention, a conversion parameter is generated based on all the characteristic parameters obtained in frame units or the characteristic parameters of the noise section, and the conversion parameter is used to generate the sound section of the audio signal. The noise section can be determined. In particular, when a conversion parameter is generated based on a feature parameter in a noise section, it is possible to perform sound discrimination while avoiding the influence of noise.

（実施例）第１図は本発明に係る有音検出装置の概略的構成を示
すブロック図であり、この装置は、特徴パラメータ計算
器１と、特徴パラメータ変換器２と、有音判定器３と、
雑音検出器４と、スイッチ５と、バッファ６とから構成
される。(Embodiment) FIG. 1 is a block diagram showing a schematic configuration of a sound detection device according to the present invention. This device includes a feature parameter calculator 1, a feature parameter converter 2, and a sound determination device 3. When,
It comprises a noise detector 4, a switch 5, and a buffer 6.

なお、以下の実施例では、音声信号をフレーム単位に
分析し有無・音声の判定を行なっていく。例えば、音声
信号を8KHzでサンプリングし、160サンプルづつまとめ
て１フレームとする。ただし、フレーム長は、常に一定
長である必要はない。In the following embodiment, an audio signal is analyzed on a frame basis, and the presence / absence and audio are determined. For example, the audio signal is sampled at 8 KHz, and 160 samples are combined to form one frame. However, the frame length does not need to be always constant.

特徴パラメータ計算器１では、フレーム単位にDurbin
法などを用いて線形予測係数を計算する。ここで、線形
予測係数からPARCOR係数、LPCケプストラム、メルケプ
ストラム等を計算し、特徴パラメータとしてもよい。ま
た、電力、自己相関関数、零交差数、等も計算してもよ
い。In the feature parameter calculator 1, the Durbin
The linear prediction coefficient is calculated using the method or the like. Here, a PARCOR coefficient, an LPC cepstrum, a mel cepstrum, or the like may be calculated from the linear prediction coefficient, and may be used as a feature parameter. Also, the power, the autocorrelation function, the number of zero crossings, and the like may be calculated.

現在有音か無音かを判定しようとしているフレームを
以下では入力フレームという。また、特徴パラメータ計
算器１で得られた入力フレームの特徴パラメータをとする。ｎはフレームのシーケンシャルな番号である。
特徴パラメータはｐ次元のベクトルで、次の（１）の式
で書き表わされる。The frame for which it is currently determined that there is sound or silence is hereinafter referred to as an input frame. Further, the characteristic parameters of the input frame obtained by the characteristic parameter calculator 1 are And n is the sequential number of the frame.
The feature parameter is a p-dimensional vector and is represented by the following equation (1).

雑音検出器４では、フレーム単位に次の（３）式で平
均電力Powを測定する。フレーム内の音声信号のサンプ
ルをａ（ｉ）（ｉ＝0,1,…,s−１）、１フレームのサン
プル数をｓとすると、そして、入力信号の中から、確実に雑音であるという区
間を検出するためにあらかじめ与えられているしきい値
Ｔと平均電力Powとを比較する。 The noise detector 4 measures the average power Pow by the following equation (3) for each frame. If the sample of the audio signal in the frame is a (i) (i = 0, 1,..., S−1) and the number of samples in one frame is s, Then, a threshold value T given in advance and the average power Pow are compared with each other in order to reliably detect a section in which noise is present from the input signal.

もし、Pow≧Ｔならば雑音でないと判定し“0"をSW5に
出力する。If Pow ≧ T, it is determined that the noise is not a noise, and “0” is output to SW5.

そうでなければ雑音と判定し“1"をSW5に出力する。 Otherwise, it is determined to be noise and "1" is output to SW5.

SW5は、雑音検出器の出力が“1"ならば、バッファ６
にそのフレームの特徴パラメータを記憶させる。SW5 is the buffer 6 if the output of the noise detector is "1".
To store the feature parameters of the frame.

バッファ６では、第２図に示されているように、特徴
パラメータがバッファ６に蓄積される時間の順序関係を
保存するために、特徴パラメータがバッファに入力され
た順番で、バッファのヘッドからテイルに向かって蓄積
する。すなわち、一番新しい特徴パラメータ（現在判定
すべきフレームの特徴パラメータ）をバッファのヘッド
に、一番過去の特徴パラメータをテイルに蓄積する。In the buffer 6, as shown in FIG. 2, in order to preserve the order of the time when the characteristic parameters are accumulated in the buffer 6, in order of the characteristic parameters to be input to the buffer, the tail from the head of the buffer is tailed. Accumulate towards. That is, the newest feature parameter (the feature parameter of the frame to be determined at present) is stored in the head of the buffer, and the oldest feature parameter is stored in the tail.

この実施例では雑音検出器４で雑音と判定されたフレ
ームの特徴パラメータのみをバッファに蓄積しているが
雑音判定を行わずに全ての特徴パラメータをバッファに
蓄積しても良い。In this embodiment, only the characteristic parameters of the frame determined to be noise by the noise detector 4 are stored in the buffer. However, all the characteristic parameters may be stored in the buffer without performing the noise determination.

バッファ６に蓄積された特徴パラメータのうち、入力
フレームのＳフレーム前（バッファのヘッドからＳフレ
ームめ）からバッファのテイルに向かってＮフレーム分
の特徴パラメータ集合Ωを取り出し、第２図に示すよう
に、とする。From among the feature parameters stored in the buffer 6, a feature parameter set Ω for N frames is extracted from S frames before the input frame (from the head of the buffer to the S frame) toward the tail of the buffer, and as shown in FIG. To And

なお、前記Ｓフレーム、Ｎフレームは任意の数フレー
ムを取り得るが、数フレームから20フレーム程度が好適
である。The S frame and the N frame can take an arbitrary number of frames, but it is preferable that the number of frames is from about several frames to about 20 frames.

特徴パラメータ変換器２では、音声と雑音の違いを強
調するために特徴パラメータを変換する。ここで変換さ
れた特徴パラメータを、以下では変換パラメータと呼び、変換パラメータはｐ次元のベクトルである。The feature parameter converter 2 converts feature parameters to emphasize the difference between speech and noise. Here, the converted feature parameters are referred to as And the conversion parameters Is a p-dimensional vector.

ここでは、変換パラメータは、Ωの平均ベクトルと入力フレームの特徴パラメータとの差を取り距離ベクトルを計算して、Ωの標準偏差で
正規化したものである。次の（３）〜（７）式で各成分
は表され、第３図には特徴パラメータ特徴パラメータ集合Ω、変換パラメータ Ωの平均ベクトルの関係が図示されている。Here, the conversion parameters Is the mean vector of Ω And input frame feature parameters , The distance vector is calculated and normalized by the standard deviation of Ω. Each component is represented by the following equations (3) to (7), and FIG. Feature parameter set Ω, conversion parameter Mean vector of Ω Is shown.

とすると、 y_i（ｎ）＝（x_i（ｎ）−m_i）／σ_ｉ …（５）ここで、ｉ＝1,2,…,p、である。 Then, y _i (n) = (x _i (n) −m _i ) / σ _i (5) Here, i = 1, 2,..., P.

有音判定器３では、特徴パラメータ変換器２から得ら
れた変換パラメータを基に有音区間を判定する。この有
音判定器３は第４図に示すように、マッチング器７と、
Ｍ個の標準パターン８とから構成されている。The sound existence determiner 3 determines a sound period based on the conversion parameter obtained from the feature parameter converter 2. As shown in FIG. 4, this sound existence judgment device 3 includes a matching device 7 and
And M standard patterns 8.

標準パターン８は以下のように定義できる。標準パタ
ーン８はの平均値ベクトルμおよび、の共分散行列Σである。なお、以下（８）〜（10）式で
は標準パターンのクラスを示すｉを簡易のため省略す
る。The standard pattern 8 can be defined as follows. Standard pattern 8 Mean vector μ of and Is the covariance matrix. In the following equations (8) to (10), i indicating the class of the standard pattern is omitted for simplicity.

クラスωに属するＬ個のｐ次元変換パラメータをとして、μとΣの各要素をμ_ｋ、Σ_klとすると、と表される。Let L p-dimensional conversion parameters belonging to class ω be Let μ _k and Σ _kl be the elements of μ and Σ, respectively. It is expressed as

マッチング器７では、標準パターンω_ｉと変換パラメ
ータの距離を測定し、音声に属する標準パターンω_ｉにマッ
チングされた場合音声、そうでない場合無音と判定す
る。In the matching unit 7, the standard pattern ω _i and the conversion parameter If the distance is measured, is matched to the standard pattern omega _i belonging to the speech determines the speech, a silent otherwise.

まず、次式より各標準パターンω_ｉ（ｉ＝1,…,M）と
変換パラメータとの距離を測定する。First, according to the following equation, each standard pattern ω _i (i = 1,. Distance to Is measured.

このを用いて、クラスｉを計算すると、となる。これによって、はクラスｉのω_ｉに属していることになる。もしω_ｉが
音声を表すパターンであれば、そのフレームは有音、ω
_ｉが雑音をあらわすパターンであれば、そのフレームは
雑音であると判定する。 this Is used to calculate the class i, Becomes by this, Belongs to ω _i of class i. If ω _i is a pattern representing voice, the frame is voiced, ω
_{If i} is a pattern representing noise, the frame is determined to be noise.

以上の各実施例の効果を具体的な測定結果を基に説明
する。The effects of the above embodiments will be described based on specific measurement results.

母音と異なり、子音の電力は背景音電力を下回ること
が多い。そのため、背景雑音が大きな環境では、子音区
間でも特徴パラメータに雑音の特徴が大きく出てしま
う。従来の方式では、背景雑音の影響を受けた特徴パラ
メータをそのまま判定に用いていたため、背景雑音が大
きな場合には、子音の検出誤りが多くなっていた。Unlike vowels, the power of consonants is often lower than the power of background sounds. Therefore, in an environment where the background noise is large, the feature of the noise appears largely in the feature parameter even in the consonant section. In the conventional method, the characteristic parameter affected by the background noise is used for the determination as it is. Therefore, when the background noise is large, the detection error of the consonant is increased.

本発明の各実施例では、雑音と音声の特徴を強調する
ため、S/N比が20dBから14dBほどの、背景雑音の大きな
環境でも検出率が良好な検出率が得られた。以下に、特
徴パラメータ・特徴パラメータ変換法を変えたときの語
頭子音の検出結果を示す。音声データに付けられたラベ
ルが子音を示しているフレームが子音のクラスのうちい
ずれかであると判定された場合、正しく検出されたもの
であるとする。In each embodiment of the present invention, a good detection rate was obtained even in an environment with a large background noise, such as an S / N ratio of about 20 to 14 dB, in order to emphasize noise and speech characteristics. The detection result of the initial consonant when the feature parameter / feature parameter conversion method is changed is shown below. When it is determined that the frame in which the label attached to the audio data indicates a consonant is one of the consonant classes, it is determined that the frame is correctly detected.

第５図に示した検出率は子音検出率と雑音検出率の平
均値である。子音検出率は、次式で定義される。The detection rate shown in FIG. 5 is an average value of the consonant detection rate and the noise detection rate. The consonant detection rate is defined by the following equation.

また、雑音データのフレームが、雑音クラスのうちい
ずれかであると判定された場合、正しく検出されたもの
とする。これが雑音検出率であり、次式で定義される。 When it is determined that the frame of the noise data is any of the noise classes, it is assumed that the noise data is correctly detected. This is the noise detection rate and is defined by the following equation.

第５図において、縦軸は検出率である。また、横軸は
特徴パラメータの種類を示しており、LPCはLPCケプスト
ラム、Ｐはフレーム内平均電力、Ｐ＋LPCはＰとLPCの併
用である。 In FIG. 5, the vertical axis is the detection rate. The horizontal axis indicates the type of feature parameter, LPC is LPC cepstrum, P is average power in a frame, and P + LPC is a combination of P and LPC.

なお、以下ではLPCケプストラム分析次元は12次、変
換パラメータ次元は特徴パラメータがLPCのとき４次、
Ｐ＋LPCのとき５次とした。特徴パラメータ変換法は、
プロットを変えて示した。In the following, the LPC cepstrum analysis dimension is 12th order, the transformation parameter dimension is 4th order when the feature parameter is LPC,
In the case of P + LPC, the fifth order was set. The feature parameter conversion method is
The plot is shown differently.

ｃは、特徴パラメータ変換を行わない従来の方法であ
る。c is a conventional method that does not perform feature parameter conversion.

ｎは、第１図に示した実施例であり、雑音判定をして
いるものである。n is the embodiment shown in FIG. 1 and is for noise determination.

ｖは、第１図に示した実施例で、雑音判定をしていな
いものである。“v” represents the embodiment shown in FIG. 1 in which the noise is not determined.

［発明の効果］以上説明したように本発明によれば、特徴パラメータ
変換により特徴パラメータから雑音の影響を除去できる
ので、背景雑音が大きい環境下にあっても精確に有音区
間を判別することができる。[Effects of the Invention] As described above, according to the present invention, since the influence of noise can be removed from feature parameters by feature parameter conversion, it is possible to accurately determine a sound section even in an environment where background noise is large. Can be.

[Brief description of the drawings]

第１図は本発明に係る有音検出装置の概略構成を示すブ
ロック図、第２図は同実施例で使用されるバッファの構
成図、第３図は同実施例の変換パラメータの説明図、第
４図は有音判定器の構成例を示すブロック図、第５図は
各実施例における特徴パラメータと検出率との関係を示
す特性図、第６図は従来の有音検出装置の構成例を示す
ブロック図である。１……特徴パラメータ計算器２……特徴パラメータ変換器３……有音判定器４……雑音検出器５……スイッチ６……バッファ７……マッチング器８……標準パターン1 is a block diagram showing a schematic configuration of a sound detection device according to the present invention, FIG. 2 is a configuration diagram of a buffer used in the embodiment, FIG. 3 is an explanatory diagram of conversion parameters in the embodiment, FIG. 4 is a block diagram showing a configuration example of a sound presence detector, FIG. 5 is a characteristic diagram showing a relationship between a characteristic parameter and a detection rate in each embodiment, and FIG. 6 is a configuration example of a conventional sound presence detector. FIG. DESCRIPTION OF SYMBOLS 1 ... Feature parameter calculator 2 ... Feature parameter converter 3 ... Speech presence detector 4 ... Noise detector 5 ... Switch 6 ... Buffer 7 ... Matching device 8 ... Standard pattern

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開平２−266400（ＪＰ，Ａ) 特開昭60−200300（ＪＰ，Ａ) 特開平１−302298（ＪＰ，Ａ) 特開平４−58297（ＪＰ，Ａ) 特開昭61−48898（ＪＰ，Ａ) 特開平２−282798（ＪＰ，Ａ) 特開平２−26640（ＪＰ，Ａ) 特開平３−48900（ＪＰ，Ａ) 特公平５−56512（ＪＰ，Ｂ２) 1989年電子情報通信学会春季全国大会講演論文集第３分冊ｐ．３−78「Ｂ− 372 ＡＴＭ通信のための音声セル化方式」（1989／３／28) 古井「ディジタル音声処理」（1985− ９−25）東海大学出版会ｐ．44−48 斎藤・中田「音声情報処理の基礎」（昭56−11−30）オーム社ｐ．99− 103 電子情報通信学会技術研究報告［通信］Ｖｏｌ．89，Ｎｏ．132，ＣＳ89−33, 「音声パケット通信のための有音検出方式」ｐ．61−66（1989年７月19日発行) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 11/02 G10L 15/04 H04B 14/04 ＪＩＣＳＴファイル（ＪＯＩＳ)──────────────────────────────────────────────────続き Continuation of front page (56) References JP-A-2-266400 (JP, A) JP-A-60-200300 (JP, A) JP-A-1-302298 (JP, A) JP-A-4- 58297 (JP, A) JP-A-61-48898 (JP, A) JP-A-2-282798 (JP, A) JP-A-2-26640 (JP, A) JP-A-3-48900 (JP, A) Special Publication Hei 5-56512 (JP, B2) 1989 IEICE Spring National Convention Lecture Papers, Third Volume, p. 3-78 "B-372 Speech Cellization Method for ATM Communication" (March 28, 1989) Furui "Digital Speech Processing" (1985-9-25) Tokai University Press p. 44-48 Saito and Nakata, "Basics of Speech Information Processing" (56-11-30), Ohmsha p. 99-103 IEICE Technical Report [Communication] Vol. 89, No. 132, CS89-33, "Sound detection method for voice packet communication" p. 61-66 (Issued July 19, 1989) (58) Fields investigated (Int. Cl. ⁷ , DB name) G10L 11/02 G10L 15/04 H04B 14/04 JICST file (JOIS)

Claims

(57) [Claims]

1. A feature parameter generating means for obtaining a feature parameter of an input audio signal input in units of a frame delimited by a certain length, based on the feature parameter obtained by the feature parameter generating means. Noise determining means for temporarily determining, for each frame, whether or not the input speech signal is noise; and a plurality of feature parameters obtained by the feature parameter generating means for the frame temporarily determined to be noise by the noise determining means. Accumulating means for accumulating frames, and converting means for converting the characteristic parameters of the frame of the input audio signal into conversion parameters using the characteristic parameters for the plurality of frames stored in the accumulating means. The frame of the input audio signal belongs to the audio based on the converted And a sound determining means for determining whether the sound belongs to noise or noise.

2. The method according to claim 1, wherein the converting unit obtains a distance vector between a feature parameter of the frame of the input voice signal and a feature parameter of the plurality of frames stored in the storage unit, thereby obtaining a frame of the frame of the input voice signal. 2. The sound detection device according to claim 1, wherein a characteristic parameter is converted into the conversion parameter.

3. A feature parameter generating step for obtaining a feature parameter of an input audio signal input in units of frames divided by a certain length, based on the feature parameter obtained in the feature parameter generating step. A noise determination step of temporarily determining whether or not the input speech signal is noise for each frame; and a plurality of feature parameters obtained by the feature parameter generation step of the frame temporarily determined to be noise by the noise determination step. An accumulating step of accumulating frames, a converting step of using the characteristic parameters of the plurality of frames accumulated in the accumulating step to convert a characteristic parameter of a frame of the input voice signal into a conversion parameter, Based on the conversion parameters A sound determination step of determining whether a frame of the input voice signal belongs to voice or noise.