JP3321156B2

JP3321156B2 - Voice operation characteristics detection

Info

Publication number: JP3321156B2
Application number: JP50377289A
Authority: JP
Inventors: フリーマン，ダニエル・ケネス; ボイド，イヴン
Original assignee: ブリテツシユ・テレコミユニケイシヨンズ・パブリツク・リミテツド・カンパニー
Priority date: 1988-03-11
Filing date: 1989-03-10
Publication date: 2002-09-03
Anticipated expiration: 2017-09-03
Also published as: EP0548054A2; FI110726B; DE68910859T2; NO903936D0; EP0548054B1; KR900700993A; JP2000148172A; JPH03504283A; FI904410A0; IE890774L; AU608432B2; DK215690A; ES2047664T3; NO316610B1; PT89978A; CA1335003C; KR0161258B1; NO304858B1; EP0335521A1; FI20010933A

Abstract

The first aspect provides a voice activity detection appts. for receiving an input signal, estimating the noise signal component of the input signal and continually forming a measure M of the spectral similarity between a portion of the input signal and the noise signal. A circuit is provided to compare a parameter derived from the measure M with a threshold value T to produce an output to indicate the presence, or absence, of speech depending on whether, or not, that value is exceeded. A second aspect covers voice activity detection appts. which continually forms a spectral distortion measure and carries out a comparison.

Description

【発明の詳細な説明】音声の動作特性検出器（voice activity detecto
r）は、会話の期間、又はノイズのみを含む期間を検出
する目的を有する信号が供給される装置である。この発
明はこれらの応用に限るものではなく、そのような検出
器に関するこの発明の特定な実施例には、移動ラジオ電
話システムがあり、このシステムにおいて会話は会話コ
ーダ（coder）によって利用され、電波スペクトルの有
効な利用法を改善し、又、それらのシステムではノイズ
レベル（車に搭載されたユニットからの）は一般に大き
い。DETAILED DESCRIPTION OF THE INVENTION Voice activity detecto
r) is a device to which a signal is supplied which has the purpose of detecting a period of speech or a period containing only noise. The invention is not limited to these applications, and a particular embodiment of the invention for such a detector is a mobile radiotelephone system in which conversation is utilized by a conversation coder, It improves the efficient use of the spectrum, and in those systems the noise levels (from on-board units) are generally high.

音声の動作特性検出の本質は、会話と会話ではない期
間の間で異なる分量を探すことである。会話コーダを含
む装置において、一つコーダから、又は他のステージか
ら、多くのパラメータを容易に用いることができ、従っ
てそのようなパラメータを利用することによって、必要
な処理を経済的に簡素にすることが望まれる。多くの状
況において、主要なノイズはある周波数スペクトルの限
られた領域内に発生する。例えば移動する車のノイズ
（例えばエンジンノイズ）は、低い周波数帯域スペクト
ルである。ノイズスペクトルのそのような位置に関する
認識が利用できる場合は、比較的少ないノイズを含むス
ペクトル部分から得られた測定量について、会話が存在
するかどうかの判断の基準を置くのが望ましい。勿論、
会話の動作特性を検出して分析する前に、信号を濾波す
ることが実際に可能であるが、音声の動作特性検出器が
会話コーダの出力に依存している場合、この前段濾波は
コード化される音声信号を妨害する。The essence of speech behavioral characteristics detection is to look for different quantities between speech and non-speech periods. In an apparatus including a conversation coder, many parameters can be easily used from one coder or from another stage, thus utilizing such parameters to economically simplify the necessary processing. It is desired. In many situations, dominant noise occurs within a limited region of a frequency spectrum. For example, the noise of a moving car (eg, engine noise) is a low frequency band spectrum. If knowledge of such locations in the noise spectrum is available, it is desirable to base the determination of whether speech is present on the measurand obtained from the relatively noisy portions of the spectrum. Of course,
If it is indeed possible to filter the signal before detecting and analyzing the speech behavioral characteristics, but the speech behavioral detector depends on the output of the speech coder, this pre-filtering is coded. Interfere with the audio signal.

この発明によれば、入力信号を受信する手段と、入力
信号のノイズ信号成分を適合して周期的に概算する手段
と、入力信号とノイズ信号成分の間のスペクトル的類似
性の測定値Ｍを周期的に形成する手段と、測定値Ｍから
得られたパラメータをスレショルド値（threshold val
ue）Ｔと比較する手段と、前記値が超過されたかどうか
に依存して、会話が存在するかどうかを示す出力を発生
する手段を具備する音声の動作特性検出器が提供され
る。According to the present invention, a means for receiving an input signal, a means for adaptively and periodically estimating a noise signal component of the input signal, and a measurement M of spectral similarity between the input signal and the noise signal component are provided. A means for periodically forming a parameter obtained from the measured value M;
ue) A speech performance detector comprising means for comparing with T and means for generating an output indicating whether a conversation is present depending on whether said value has been exceeded is provided.

その測定値は、板倉・斎藤による歪み値であることが
望ましい。The measured value is desirably a distortion value by Itakura and Saito.

この発明の他の局面は特許請求の範囲に含まれる。 Other aspects of the invention are within the scope of the claims.

この発明の幾つかの実施例が添付図面を参照してこれ
より説明される。Some embodiments of the present invention will now be described with reference to the accompanying drawings.

第１図はこの発明の第１実施例を示すブロック図；第２図はこの発明の第２実施例を示し；第３図はこの発明の好適な第３実施例を示す。 FIG. 1 is a block diagram showing a first embodiment of the present invention; FIG. 2 shows a second embodiment of the present invention; FIG. 3 shows a preferred third embodiment of the present invention.

この発明による音声の動作特性検出器の第１実施例を
特徴付ける一般原則が次に示される。The general principle characterizing the first embodiment of the speech behavior characteristic detector according to the present invention is as follows.

ｎ個の信号サンプル（s₀,s₁,s₂,s₃,s₄…s_n-1）は、パ
ルス応答（1,h₀,h₂,h₃）の概念上の４次有限パルス応答
（FIR）デジタルフィルタを通過するとき、濾波された
信号となり（以前のフレームからのサンプルを無視す
る）、０次の自己相関係数は、各項の２乗の合計値であり、そ
れは正規化され、即ち項の全数によって分割され（一定
フレーム長に関し、その分割を省略するのが容易であ
る）、従って濾波された信号の合計値は、従ってこれは、論理的に濾波された信号ｓ′の電力
量、即ち概念的フィルタの通過帯域内の信号ｓの部分の
電力量である。The n signal samples (s ₀ , s ₁ , s ₂ , s ₃ , s ₄ … s _n-1 ) are the fourth-order finite pulses based on the concept of the pulse response (1, h ₀ , h ₂ , h ₃ ). When passed through a response (FIR) digital filter, it becomes a filtered signal (ignoring samples from previous frames), The zero-order autocorrelation coefficient is the sum of the squares of each term, which is normalized, ie, divided by the total number of terms (for a fixed frame length, it is easy to omit the division), Thus the sum of the filtered signals is This is therefore the power of the logically filtered signal s', ie the power of the portion of the signal s within the passband of the conceptual filter.

最初の４項を無視して拡張すると、従って、Ｒ′_０は、値Ｒ′_０が応答する周波数帯域を決
定する括弧でくくった定数によって重み付けされた自己
相関係数R_iの結合によって得られる。実際、括弧でくく
った項は論理フィルタのパルス応答の自己相関係数であ
り、従って上記表現は次のように簡単に現すことができ
る。Expanding ignoring the first four terms, Thus, R _'0, the value R' obtained by the binding of the autocorrelation coefficients R _i ₀ is weighted by a constant in parentheses to determine the frequency band to be responsive. In fact, the term in parentheses is the autocorrelation coefficient of the pulse response of the logic filter, so the above expression can be expressed simply as:

ここで、Ｎはフィルタの次数、H_iはフィルタのパルス応
答の（正規化されていない）自己相関係数。 Where N is the filter order and _Hi is the (unnormalized) autocorrelation coefficient of the filter's pulse response.

即ち、信号濾波の信号自己相関係数に関する効果は、
要求されるフィルタが有するパルス応答を用い、（濾波
されていない）信号の自己相関係数の合計を生成するこ
とによってシミュレート（simulate）することができ
る。That is, the effect of signal filtering on the signal autocorrelation coefficient is:
Using the pulse response of the required filter, it can be simulated by generating the sum of the autocorrelation coefficients of the (unfiltered) signal.

従って、乗算動作の小さい数を含む比較的簡単なアル
ゴリズムは、この数の100回の乗算動作を一般に必要と
するデジタルフィルタのシミュレーションを行うことが
できる。Thus, a relatively simple algorithm involving a small number of multiplication operations can simulate a digital filter that typically requires this number of 100 multiplication operations.

一方、この濾波動作は、信号スペクトルが参照スペク
トルに対して整合している（matched）状態で（論理フ
ィルタの逆相応答）、スペクトル比較の形式として見る
ことができる。この応用における論理フィルタはノイズ
スペクトルの逆を概算するように選択されるので、この
動作は、スペクトル間の非類似性をを示す値のような、
会話及びノイズのスペクトルと、生成される０次自己相
関係数（即ち逆濾波された信号のエネルギ）とのスペク
トル的比較として見ることができる。板倉・斎藤による
歪み値が、予測フィルタ（predistor filter）と入力
スペクトルの整合を評価するLPC内に用いられ、一つの
形式は次のように示される。This filtering operation, on the other hand, can be viewed as a form of spectral comparison, with the signal spectrum matched to the reference spectrum (the inverse phase response of the logical filter). Since the logical filter in this application is chosen to approximate the inverse of the noise spectrum, this behavior is similar to values that indicate dissimilarity between spectra, such as:
It can be seen as a spectral comparison of the speech and noise spectra with the generated zero order autocorrelation coefficients (ie, the energy of the defiltered signal). The distortion values by Itakura and Saito are used in the LPC that evaluates the matching between the predistor filter and the input spectrum, and one form is shown as follows.

ここで、A₀などはLPCパラメータ・セットの自己相関
係数である。これは前記得られた関係に非常に類似して
いることが判り、LPC係数が入力信号の逆スペクト応答
を有するFIRのタップ（taps）であり、それによってLPC
係数セットは逆LPCフィルタのパルス応答であることを
考えれば、実際、板倉・斎藤による歪み値は単に式１の
一形式であり、そこでフィルタ応答Ｈは入力信号の全ポ
ールモデル（all−pole model）であることは明らかで
ある。 Here, like A ₀ is the autocorrelation coefficients of the LPC parameter set. This turns out to be very similar to the relationship obtained above, where the LPC coefficients are the taps of the FIR with the inverse spectral response of the input signal, and thus the LPC
Considering that the coefficient set is the pulse response of an inverse LPC filter, in fact, the distortion value by Itakura and Saito is simply a form of Equation 1, where the filter response H is the all-pole model of the input signal. ).

事実、試験スペクトルのLPC係数と参照スペクトルの
自己相関係数を用いて、転換し、スペクトル的類似性の
異なる値を得ることができる。In fact, using the LPC coefficient of the test spectrum and the autocorrelation coefficient of the reference spectrum, one can convert and obtain different values of spectral similarity.

Ｉ−Ｓによる歪み値は、“ベクトル量子化に基づく会
話の符号化”（“Speech Coding based upon Vecto
r Quantisation"by Ａ Buzo,A Ｈ Gray,R Ｍ Gr
ay and ＪＤ Markel,IEE Trans on ASSP,Vol
ASSP−28,No5,October 1980）に更に詳細に説明されて
いる。The distortion value due to IS is described in “Speech Coding based upon Vecto”.
r Quantization "by A Buzo, AH Gray, RM Gr
ay and JD Markel, IEE Trans on ASSP, Vol
ASSP-28, No. 5, October 1980).

信号のフレームは単に有限値長を有し、項の数（Ｎ、
ここでＮはフィルタ次数）は無視されるので、前述の結
果は単に概算である。しかし、それは会話がある可動か
を非常に良く示し、従って会話報告の値Ｍとして用いら
れる。ノイズスペクトルが既知であり、それが静的ノイ
ズの場合、固定のh₀、h₁などの係数を逆ノイズフィルタ
に適用することは十分可能である。The frame of the signal has only a finite length and the number of terms (N,
The above results are only approximate, since N is the filter order). However, it shows very well whether the conversation is mobile and is therefore used as the value M of the conversation report. If the noise spectrum is known and it is static noise, it is quite possible to apply fixed h ₀ , h ₁ etc. coefficients to the inverse noise filter.

しかし、異なるノイズ状況に適合することができる装
置は更に有益である。However, a device that can adapt to different noise situations is even more beneficial.

第１図にはこの発明の第１実施例が示され、マイクロ
ホン（図示されず）からの信号ｓは入力１に受信され、
アナログ・デジタルコンバータ２によって、適切なサン
プリングレート（sampling rate）でデジタルサンプル
に変換される。LPC分析ユニット３（一般的なLPCコーダ
［coder］）は、ｎ個（例えば160個）のサンプルの連続
するフレームについて、入力の会話を示すために送信さ
れるＮ個（例えば８又は12個）のLPCフィルタ係数L_iの
一組を得る。会話信号ｓは又、相関ユニット（correlat
or unit）４（通常これはLPCコーダ３の一部分であ
る。なぜならば、ここで分離相関器［separate crrela
tor］を供給することが評価できるが、会話の自己相関
ベクトルR_iは通常LPC分析の１ステップとして生成され
るからである）に入力される。相関器４は自己相関ベク
トルR_iを発生し、ベクトルR_iは０次相関係数R₀、及び少
なくとも更に２つの自己相関係数R1、R2、R3を含む。こ
れらはマルチプライアユニット（multiplier unit）５
に供給される。FIG. 1 shows a first embodiment of the present invention, in which a signal s from a microphone (not shown) is received at an input 1;
The analog / digital converter 2 converts the digital samples into digital samples at an appropriate sampling rate. The LPC analysis unit 3 (general LPC coder) sends N (eg 8 or 12) to indicate the input conversation for successive frames of n (eg 160) samples obtaining a set of LPC filter coefficients L _i. The speech signal s is also the correlation unit (correlat
or unit) 4 (usually this is part of the LPC coder 3 because here the separate correlator [separate crrela
can evaluation be supplied tor], the autocorrelation vector R _i of the conversation is input to normal since produced as a step in the LPC analysis). Correlator 4 produces an autocorrelation vector R _i, including the vector R _i is zero-order correlation coefficient R _0, and at least two further autocorrelation coefficients R1, R2, R3. These are multiplier units 5
Supplied to

第２入力11はスピーカから離れて配置される第２マイ
クロホンに接続され、背景ノイズのみが受信される。こ
のマイクロホンからの入力は、ADコンバータ12によって
デジタル入力サンプル列に変換され、LPCアナライザ13
によってLPC分析される。アナライザ13から発生した
“ノイズ"LPC係数は相関ユニット14を通過し、それによ
って発生した自己相関ベクトルは、マルチプライア５の
会話マイクロホンからの入力信号の自己相関係数R_iによ
って項ごとに乗算され、それによって生成された重み係
数は等式１に従って加算器６によって加算され、それに
よってノイズのみのマイクロホンからのノイズスペクト
ルの逆相形状を有するフィルタを提供し（実際は信号・
パルス・ノイズ・マイクロホンにおけるノイズスペクト
ルと同一形状である）、従って殆どのノイズを濾波す
る。その結果的測定値Ｍはスレショルダ（thresholde
r）７によってスレショルド値（threshold）と比較さ
れ、会話が存在するかどうかを示すロジック出力８を発
生する。ここでＭが大きい場合、会話が存在すると考え
られる。The second input 11 is connected to a second microphone located away from the speaker, so that only background noise is received. The input from this microphone is converted to a digital input sample sequence by the AD converter 12 and the LPC analyzer 13
Analyzed by LPC. "Noise" LPC coefficients produced from analyzer 13 passes through the correlation unit 14, the self-correlation vector generated by it is multiplied term by term with the autocorrelation coefficients R _i of the input signal from the conversation microphone of the multiplier 5 , The weighting factors generated thereby are added by adder 6 according to equation 1, thereby providing a filter having the inverse phase shape of the noise spectrum from the noise-only microphone (actually the signal
It has the same shape as the noise spectrum in a pulse noise microphone) and thus filters out most of the noise. The resulting measured value M is a thresholde
r) is compared to a threshold value by 7 to generate a logic output 8 indicating whether a conversation is present. Here, when M is large, it is considered that a conversation exists.

この実施例では２つのマイクロホンと２つのLPCアナ
ライザを使用するが、費用と複雑性が増大するが、必要
であればこれらを増やすことができる。This embodiment uses two microphones and two LPC analyzers, but adds cost and complexity, but they can be increased if needed.

一方、他の実施例では、ノイズマイクロホン11からの
自己相関、及びメインマイクロホン１からのLPC係数を
使用して形成される対応する値を使用する。その場合、
LPCアナライザではなく、更に他の自己相関器が必要と
なる。On the other hand, other embodiments use the autocorrelation from the noise microphone 11 and the corresponding value formed using the LPC coefficients from the main microphone 1. In that case,
Instead of an LPC analyzer, another autocorrelator is required.

従ってこれらの実施例は、異なる周波数のノイズを有
する異なる状況、又は与えられた一つの状況において、
変化するノイズスペクトルの存在する所で動作すること
が可能である。Thus, these embodiments can be used in different situations with different frequencies of noise, or in one given situation.
It is possible to work where there is a changing noise spectrum.

第２図の好適実施例においては、LPC係数の一組（又
はその一組の自己相関ベクトル）を格納するバッファ15
が提供され、これらの値は、“ノンスピーチ（non−spe
ech）（即ちノイズのみ）”として定義される期間に、
マイクロホン入力１から得られる。これらの値は等式１
による値を得るために使用され、勿論この測定は、板倉
・斎藤による歪み測定法に対応するが、LPC係数の現在
のフレームではなく、逆相ノイズスペクトルの概算値に
一致する、LPC係数の格納された単一フレームが使用さ
れるところが異なる。In the preferred embodiment of FIG. 2, a buffer 15 stores a set of LPC coefficients (or a set of autocorrelation vectors).
Are provided, and these values are “non-speech (non-speech).
ech) (ie noise only)
Obtained from microphone input 1. These values are given by Equation 1.
Of course, this measurement corresponds to the distortion measurement method of Itakura and Saito, but the storage of the LPC coefficient, which is not the current frame of the LPC coefficient, but matches the estimated value of the inverse noise spectrum. The difference is that a single frame is used.

アナライザ３によって出力されるLPC係数ベクトルL_i
も又、相関器14に導かれ、それによってLPC係数ベクト
ルの自己相関ベクトルを発生する。バッファメモリ15は
スレショルダ７のスピーチ／ノンスピーチ出力によって
制御され、“スピーチ”フレームの間、バッファは“ノ
イズ”自己相関係数を保持するが、“ノイズ”フレーム
の間は、LPC係数の新たな一組が、例えば複合スイッチ1
6によってバッファを更新するのに使用することがで
き、このスイッチ16を介して、各自己相関係数を伝送す
る相関器14の出力がバッファ15に接続される。相関器14
がバッファ15の後に配置されてもよい。更に、係数更新
のためのスピーチ／ノンスピーチの決定は出力８からで
ある必要はなく、（好適に）他の方法で得ることができ
る。LPC coefficient vector L _i output by analyzer 3
Is also directed to correlator 14, thereby generating an autocorrelation vector of the LPC coefficient vector. The buffer memory 15 is controlled by the speech / non-speech output of the thresholder 7; during the "speech" frame, the buffer retains the "noise" autocorrelation coefficient, but during the "noise" frame, the LPC coefficient is renewed. One set, for example, compound switch 1
The output of the correlator 14 transmitting each autocorrelation coefficient is connected to the buffer 15 via this switch 16, which can be used to update the buffer by 6. Correlator 14
May be arranged after the buffer 15. Further, the speech / non-speech determination for coefficient updating need not be from output 8 and can be (preferably) obtained in other ways.

会話の無い期間がしばしば発生するので、バッファに
格納されたLPC係数は時折更新され、それによって装置
はノイズスペクトル内の変化に追随することができる。
ノイズスペクトルが時間的に比較的安定している場合
（多くの場合そうであるが）、そのようなバッファの更
新は、極く希に、又は検出器の初期の動作のみに必要と
されると考えられが、移動する（車の）ラジオのような
状況のときには、しばしば更新するのが望ましい。As periods of silence often occur, the LPC coefficients stored in the buffer are updated from time to time, allowing the device to follow changes in the noise spectrum.
If the noise spectrum is relatively stable in time (as is often the case), then such a buffer update may be needed very rarely or only for the initial operation of the detector. Though conceivable, it is often desirable to update in situations such as a moving (car) radio.

この実施例の変更例として、簡単な固定ハイパス・フ
ィルタに一致する係数項を有する等式１をシステムは適
用し、次に“ノイズ期間"LPC係数を使用して切り替わる
ことによってシステムは適合を開始する。幾つかの理由
によって会話検出が失敗した場合、システムは簡単なハ
イパスフィルタを再び用いることができる。As a variation on this embodiment, the system applies Equation 1 with the coefficient terms matching a simple fixed high-pass filter, and then the system starts fitting by switching using the "noise period" LPC coefficients. I do. If speech detection fails for several reasons, the system can again use a simple high-pass filter.

上記値をR₀で割ることによって正規化することがで
き、スレショルドと比較される表現は、この値はフレームの総合信号電力とは独立しており、従
って総合信号レベル変化に関しては補償されるが、“ノ
イズ”と“会話”レベルの間の著しい対比を与えず、従
ってノイズの大きな環境では好適に使用されることはな
い。The above value can be normalized by dividing by R ₀ , and the expression compared to the threshold is This value is independent of the total signal power of the frame, and is therefore compensated for in terms of total signal level changes, but does not provide a significant contrast between "noise" and "talk" levels, and therefore in noisy environments It is not used preferably.

（後述されるように）ノイズスペクトルが徐々に変化
するとき、（前述の様々な実施例におけるノイズマイク
ロホン又はノイズのみの期間から得られる）ノイズ信号
の逆フィルタ係数を得るためにLPC分析を用いる代わり
に、一般的な適合性フィルタ（adaptive filter）を用
いて逆相ノイズスペクトルの原型を生成することがで
き、そのようなフィルタに共通する比較的低速な適合率
を得ることができる。第１図に一致する実施例におい
て、LPC分析ユニット13は容易に適合性フィルタ（例え
ばトランスバーサル（transversal）FIR又はラティスフ
ィルタ（lattice filter））と交換することができ、
そのフィルタは、逆フィルタの原型を生成することによ
って、ノイズ入力をホワイトノイズに転換するためにシ
ステムに接続され、その係数は前述のように自己相関器
14に供給される。As the noise spectrum changes gradually (as described below), instead of using LPC analysis to obtain the inverse filter coefficients of the noise signal (obtained from noise microphones or noise-only periods in the various embodiments described above) In addition, a prototype of the inverse noise spectrum can be generated using a general adaptive filter, and a relatively slow precision factor common to such filters can be obtained. In an embodiment consistent with FIG. 1, the LPC analysis unit 13 can be easily replaced with a compatible filter (eg, a transversal FIR or a lattice filter),
The filter is connected to the system to convert the noise input to white noise by creating a prototype of the inverse filter, the coefficients of which are autocorrelator as described above.
Supplied to 14.

第２図に示される第２実施例において、LPC分析手段
３は、そのような適合性フィルタと置換され、バッファ
手段15は省略される。しかし、スイッチ16は、適合性フ
ィルタが会話期間の間、その係数を適合するのを防止す
るために動作する。In the second embodiment shown in FIG. 2, the LPC analysis means 3 is replaced by such a suitable filter and the buffer means 15 is omitted. However, switch 16 operates to prevent the adaptive filter from adapting its coefficients during the talk period.

この発明の他の実施例に使用される第２の音声の動作
特性検出器がこれより説明される。A second audio performance detector used in another embodiment of the present invention will now be described.

以下の説明において、LPC係数ベクトルは、FIRフィル
タの単にパルス応答であり、FIRフィルタは入力信号の
逆位相スペクトル形状であることは明らかである。隣接
するフレームの間に板倉・斎藤による歪み値が形成され
るとき、以前のフレームのLPCフィルタによって濾波さ
れているので、実際にその値は信号の電力に等しい。従
って隣接するフレームのスペクトルに違いが殆どない場
合、フレームの対応する僅かなスペクトル電力は濾波を
免れ、その値は小さいであろう。同時に、フレーム間の
大きなスペクトルの相違は大きな板倉・斎藤歪み値を発
生し、それによってその値は隣接するフレームのスペク
トルの類似性を反映する。スピーチコーダに関して、デ
ータレートを最小とすることによって、フレーム長をで
きるだけ長くするのが望ましい。即ち、フレーム長が十
分長ければ、会話信号はフレームからフレームへの重要
なスペクトル変化を示す（もしそうでなければコード化
は冗長である）。一方、ノイズはフレームからフレーム
へ徐々に変化するスペクトル形状を有し、会話が信号に
存在しない期間において、以前のフレームから逆相LPC
フィルタを適用し、殆どのノイズ電力を“フィルタアウ
ト（filter out）”するので、板倉・斎藤による歪み
値はそれに対応して少ない。In the following description, it is clear that the LPC coefficient vector is simply the pulse response of the FIR filter, and that the FIR filter is the inverse phase spectral shape of the input signal. When a distortion value by Itakura and Saito is formed between adjacent frames, the value is actually equal to the signal power because it is filtered by the LPC filter of the previous frame. Thus, if there is little difference in the spectrum of adjacent frames, the corresponding slight spectral power of the frame will escape filtering and its value will be small. At the same time, large spectral differences between frames generate large Itakura-Saito distortion values, whose values reflect the similarity of the spectra of adjacent frames. For a speech coder, it is desirable to maximize the frame length by minimizing the data rate. That is, if the frame length is long enough, the speech signal will show significant spectral changes from frame to frame (if not, the coding is redundant). On the other hand, noise has a spectral shape that changes gradually from frame to frame, and during periods when no speech is present in the signal, the phase LPC is reversed from the previous frame.
Since the filter is applied and most of the noise power is "filtered out", the distortion values by Itakura and Saito are correspondingly small.

断続的な会話を含み、ノイズの多い信号の隣接するフ
レーム間の板倉・斎藤歪み値は、一般にノイズの期間よ
り会話の期間の方が大きく、変化の程度（標準偏倚によ
って示されるように）も大きく、断続的な変化は少な
い。Including intermittent conversations, the Itakura-Saito distortion value between adjacent frames of a noisy signal is generally greater during periods of conversation than during periods of noise, and varies in magnitude (as indicated by the standard deviation). Large, intermittent changes are small.

ここで、Ｍの標準偏差（standard deviation）も信
頼できる値であり、各標準偏差をとる効果は本質的に値
を円滑にすることである。Here, the standard deviation of M is also a reliable value, and the effect of taking each standard deviation is essentially to smooth the value.

音声の動作特性検出器のこの第２の形態において、会
話が存在するかどうかを判断するのに用いる測定された
パラメータは、板倉・斎藤歪み値の標準偏差であること
が望ましいが、変化を測定する他の方法、及び（例えば
FFT分析に基づく）スペクトル歪みを測定する他の方法
を適用することができる。In this second form of the speech behavior characteristic detector, the measured parameter used to determine whether speech is present is preferably the standard deviation of the Itakura-Saito distortion value, but the change is measured. Other ways to do, and (for example,
Other methods of measuring spectral distortion (based on FFT analysis) can be applied.

音声の動作特性検出に適合性スレショルド（adaptive
threshold）を用いることにも利点がある。そのよう
なスレショルドは、会話期間の間は調整されるべきでは
なく、調整されると会話信号はスレショルドアウト（th
reshold out）される。従ってスピーチ／ノンスピーチ
制御信号を用いてスレショルド・アダプタを制御する必
要があり、この制御信号はスレショルド・アダプタの出
力から独立しているのが望ましい。スレショルドＴは、
ノイズのみが存在するとき、値Ｍのレベル以上のレベル
に保たれるように調整される。その値はノイズが存在す
るとき一般にランダムに変化するので、多くのブロック
についての平均レベルを決定し、スレショルドをこの平
均レベルに比例するレベルに設定することによって、ス
レショルドが変化する。しかし、これはノイズの多い状
況では一般に十分ではなく、幾つかのブロックについて
のパラメータの変化程度に関する査定が考慮される。Compatibility threshold for detecting voice motion characteristics (adaptive
There are advantages to using threshold). Such a threshold should not be adjusted during the duration of the conversation, at which point the speech signal will be thresholded out (th
reshold out). Therefore, it is necessary to control the threshold adapter using a speech / non-speech control signal, which is preferably independent of the output of the threshold adapter. The threshold T is
When only noise is present, adjustment is made so that the level is maintained at or above the level of the value M. Since the value generally varies randomly in the presence of noise, the threshold is varied by determining the average level for many blocks and setting the threshold to a level proportional to this average level. However, this is generally not sufficient in noisy situations and allows for an assessment of the degree of parameter change for some blocks.

従ってスレショルド値Ｔは次式に従って計算される。 Therefore, the threshold value T is calculated according to the following equation.

Ｔ＝Ｍ′＋K.d ここでＭは、連続する多くのフレームについての測定値
の平均値であり、ｄはそれらフレームについての測定値
の標準偏差であり、Ｋは定数である（代表的には２であ
る）。T = M '+ K.d where M is the average of the measurements for many consecutive frames, d is the standard deviation of the measurements for those frames, and K is a constant (typically Is 2.)

実際的に、会話の存在しないことが示された直後に再
び適合動作を開始すべきではなく、（適合及び非適合状
態の間に繰り返される急速なスイッチングを避けるため
に）降下が安定したことを確認するまで待つべきであ
る。In practice, the adaptation action should not be started again immediately after the indication that no conversation is present, but rather that the descent has stabilized (to avoid repeated rapid switching between conforming and non-conforming states). You should wait for confirmation.

第３図は前述の事柄を具備するこの発明の好適実施例
であり、入力１はアナログ・デジタルコンバータ（AD
C）２によってサンプルされ、デジタル化された信号を
受信し、逆相フィルタアナライザ３の入力に信号を供給
し、逆相フィルタアナライザ３は実際に音声の動作特性
検出器が動作するスピーチコーダの一部であり、又、入
力信号スペクトルの逆相に一致するフィルタの係数L
_i（代表的に８）を発生する。デジタル信号は又、（ア
ナライザ３の一部である）自己相関器４に供給され、自
己相関器４は入力信号（又は少なくともそれらがLPC係
数と同じくらい多くの低次項）の自己相関ベクトルR_iを
発生する。装置のこれらの部分の動作は第１図及び第２
図に示される。自己相関係数R_iは好適に、連続する幾つ
かのスピーチフレーム（代表的に５〜20ms）について平
均値がとられ、それらの信頼度が改善される。この平均
化は、バッファ4a内の自己相関器４によって出力される
自己相関係数の各組を格納し、平均器（averager）4bを
用いて、現在の自己相関係数R_i、及びバッファ4aに格納
されバッファ4aから供給される以前のフレームからの係
数の重み付けされた加算値を生成することによって達成
される。それによって得られた平均化された自己相関係
数Ra_iは重み付け及び加算手段５、６に供給され、この
手段は又、バッファ15を介して自己相関器14から格納さ
れたノイズ期間の逆相フィルタ係数L_iの自己相関ベクト
ルA_iを受信し、Ra_i及びA_iから次式により定義される値
Ｍを形成する。FIG. 3 shows a preferred embodiment of the present invention having the above-mentioned features, wherein input 1 is an analog-to-digital converter (AD).
C) receiving the sampled and digitized signal by 2 and supplying a signal to the input of an anti-phase filter analyzer 3, which is one of the speech coders in which the operating characteristic detector of the speech actually operates; And the coefficient L of the filter that matches the antiphase of the input signal spectrum.
_i (typically 8). The digital signal is also fed to an autocorrelator 4 (which is part of the analyzer 3), which outputs an autocorrelation vector R _i of the input signal (or at least as many low-order terms as LPC coefficients). Occurs. The operation of these parts of the device is illustrated in FIGS.
Shown in the figure. Autocorrelation coefficients R _i is preferably an average value for several speech successive frames (typically 5 to 20 mS) is taken, their reliability is improved. This averaging stores each set of autocorrelation coefficients output by the autocorrelator 4 in the buffer 4a, and uses an averager 4b to calculate the current autocorrelation coefficient R _i and the buffer 4a. By generating a weighted sum of the coefficients from the previous frame stored in the buffer 4a and supplied from the buffer 4a. The resulting averaged autocorrelation coefficient Ra _i is supplied to weighting and summing means 5, 6, which also receive, via buffer 15, the antiphase of the noise period stored from autocorrelator 14. receives the autocorrelation vector a _i of the filter coefficients L _i, forming the value M, which is defined by the following formulas Ra _i and a _i.

この値はスレショルダ７によって、スレショド値と比
較され、会話が存在するかしないかを示す論理結果が出
力８に発生する。 This value is compared by a thresholder 7 to a threshold value and a logical result is generated at output 8 indicating whether a conversation exists or not.

逆相フィルタ係数L_iがノイズスペクトルの逆相の適切
な概算に一致するために、これらの係数をノイズの期間
に更新するのが望ましい（勿論、会話の期間には更新し
ない）。しかし、その更新に基づくスピーチ／ノンスピ
ーチの決定はその更新の結果に影響されず、又は誤って
確認された信号の単一フレームによって、音声の動作特
性検出器は結果的に“ロックはずれ（out of loc
k）”となり、次のフレームを誤って認識する。従って
制御信号発生回路20、即ち分離音声の動作特性検出器が
提供され、この検出器は会話が存在するかどうかを示す
独立制御信号を形成し、逆相フィルタアナライザ３（又
はバッファ８）を制御し、それによって値Ｍを形成する
のに用いられる逆相フィルタ自己相関係数A_iは“ノイ
ズ”期間にのみ更新される。制御信号発生回路20はLPC
アナライザ21を含み（これは再び会話コーダの一部であ
り、特にアナライザ３によって実行される）、このアナ
ライザは、入力信号及び自己相関器21a（自己相関器3a
によって実行することができる）に一致する一組のLPC
係数M_iを発生し、自己相関器21aはM_iの自己相関係数B_i
を得る。アナライザ21がアナライザ３によって実行され
た場合は、M_i＝L_i、及びB_i＝A_iである。これら自己相関
係数は、重み付け及び加算手段22、23（５、６に同等）
に供給され、この手段も自己相関器４からの入力信号の
自己相関ベクトルR_iを受信する。従って、入力スピーチ
フレームと以前のスピーチフレームの間のスペクトル的
類似性が計算される。これは前述したように、現在のフ
レームのR_iと以前のフレームのB_iの間の板倉・斎藤歪み
値、又は現在のフレームのRiとB_iに関する板倉・斎藤歪
み値を計算することによって得られ、又は対応する値を
バッファ24に格納された以前のフレームに関して減算す
ることによって得られ、スペクトル的に異なる信号を発
生する（それぞれの場合、その値はRoで分割することに
よってエネルギ・正規化されるのが望ましい）。勿論こ
こでバッファ24は更新される。このスペクトル的に異な
る信号は、スレショルダ26によってスレショルドと比較
されたとき、前述のように、会話が存在するかどうかを
示す。音声とはならない会話からのノイズを区別するた
めにこの方法は優れているが（従来のシステムにおいて
可能なタスク（task））、音声となった会話からノイズ
を区別する能力は一般に少ないことが発見された。従っ
て、回路20には、ピッチアナライザ（pitch analyse
r）27（実際にスピーチコーダの一部として動作するこ
とができ、特にマルチパルスLPCコーダ内に生成される
算定器（predictor）の長い遅延値測定することができ
る）を具備する音声の会話検出回路が提供されるのが望
ましい。ピッチアナライザ27は、音声となった会話が検
出されたとき“真理（true）”であるロジック信号を発
生し、この信号は、スレショルダ26（音声とはならない
会話が存在するとき、一般に“真理”である）から得ら
れるスレショルド値と結合され、NORゲート28の入力に
供給され、会話が存在するとき“誤り（false）”であ
り、ノイズが存在するとき“真理”である信号を発生す
る。この信号はバッファ８（又は逆相フィルタアナライ
ザ３）に供給され、それによって逆相フィルタ係数Li
は、ノイズ期間のみに更新される。To reverse phase filter coefficients L _i matches the good estimate of the reverse-phase noise spectrum, it is desirable to update these coefficients during the noise (of course, does not update the duration of the conversation). However, speech / non-speech decisions based on the update are unaffected by the results of the update, or a single frame of incorrectly identified signal may result in the speech behavior detector eventually "out of lock" (out). of loc
k) "and incorrectly recognizes the next frame. Thus, a control signal generating circuit 20 is provided, i.e. a detector of the operating characteristics of the separated speech, which forms an independent control signal indicating whether speech is present or not. Then, the anti-phase filter autocorrelation coefficient A _i used to control the anti-phase filter analyzer 3 (or the buffer 8) and thereby form the value M is updated only during the “noise” period. Circuit 20 is LPC
It includes an analyzer 21 (which is again a part of the speech coder and is executed in particular by the analyzer 3), which analyzes the input signal and the autocorrelator 21a (autocorrelator 3a
A set of LPCs that can be implemented by
Generating a coefficient M _i, the autocorrelation coefficients of the autocorrelator 21a is M _i B _i
Get. If the analyzer 21 was executed by the analyzer 3, then M _i = L _i and B _i = A _i . These autocorrelation coefficients are weighting and adding means 22, 23 (equivalent to 5, 6)
Which also receives the autocorrelation vector R _i of the input signal from the autocorrelator 4. Therefore, the spectral similarity between the input speech frame and the previous speech frame is calculated. This is because, as described above, obtained by calculating the Itakura-Saito distortion value, or Itakura-Saito distortion values for Ri and B _i of the current frame between B _i of R _i and the previous frame of the current frame Or by subtracting the corresponding value with respect to the previous frame stored in buffer 24 to produce a spectrally different signal (in each case, the value is divided by Ro for energy and normalization). Is desirable). Of course, the buffer 24 is updated here. This spectrally different signal, when compared to the threshold by thresholder 26, indicates whether speech is present, as described above. This method is excellent for discriminating noise from non-speech conversations (a task possible in conventional systems), but found that the ability to discriminate noise from speech conversations is generally low. Was done. Therefore, the circuit 20 includes a pitch analyzer
r) Speech speech detection with 27 (actually able to operate as part of a speech coder, in particular to measure long delay values of a predictor generated in a multi-pulse LPC coder) Preferably, a circuit is provided. The pitch analyzer 27 generates a logic signal that is "true" when a spoken conversation is detected, and outputs this signal to a thresholder 26 (when a non-speech conversation is present, generally a "truth"). ) And applied to the input of NOR gate 28 to generate a signal that is "false" when speech is present and "truth" when noise is present. . This signal is supplied to the buffer 8 (or the anti-phase filter analyzer 3), whereby the anti-phase filter coefficient Li
Are updated only during the noise period.

スレショルドアダプタ29も又接続され、制御信号発生
回路20のノンスピーチ信号制御出力を受信する。スレシ
ョルドアダプタ29の出力はスレショルダ７に供給され
る。スレショルドアダプタ29の出力はスレショルダ７に
供給される。スレショルドアダプタは、スレショルドが
ノイズ電力レベルに近付くまで（これは、例えば回路2
2、23の加算及び重み付けすることによって容易に得ら
れる）、瞬時スレショルドレベルに比例するステップ
に、スレショルドをインクリメント（increment）又は
デクリメント（decrement）するように動作する。入力
信号が非常に小さいとき、スレショルドは自動的にロー
レベルに設定されるのが望ましい。なぜならば、小さい
信号レベルのとき、ADC2によって生成される信号量は信
頼できる結果を生成できないからである。A threshold adapter 29 is also connected and receives the non-speech signal control output of the control signal generation circuit 20. The output of the threshold adapter 29 is supplied to the threshold 7. The output of the threshold adapter 29 is supplied to the threshold 7. The threshold adapter is activated until the threshold approaches the noise power level (for example,
(Easily obtained by adding and weighting 2, 23), which operates to increment or decrement the threshold in steps proportional to the instantaneous threshold level. Preferably, when the input signal is very small, the threshold is automatically set to low level. This is because at low signal levels, the amount of signal generated by ADC2 cannot produce reliable results.

更に“ハングオーバ（hangover）”発生手段30が提供
され、これはスレショルダ７の後の会話を示す期間を測
定し、所定時定数を越える期間の間、会話の存在が示さ
れたとき、その出力は短い“ハングオーバ”の間、ハイ
に維持される。このようにして、ローレベルな会話バー
ストの中間の欠損（clipping）が避けられ、適切な時定
数の選択によって、会話のときに誤って示された短いス
パイクノイズによりハングオーバ発生器30の起動を防ぐ
ことができる。勿論、前述した全ての機能は、適切にプ
ログラムされた単一のデジタル処理手段、例えば、LPC
コーデックの一部として構成され（これは所望される構
成である）、又は関連するメモリ装置を有する適切にプ
ログラムされたマイクロコンピュータやマイクロコント
ローラチップとして構成されるデジタル信号処理チップ
（DSP）などのような手段によって実行することができ
る。In addition, a "hangover" generating means 30 is provided, which measures the period of time indicating a conversation after the threshold 7, and outputs its output when the presence of a conversation is indicated for a period exceeding a predetermined time constant. Is held high during a short "hangover". In this way, clipping in the middle of a low-level conversation burst is avoided, and selection of an appropriate time constant prevents activation of the hangover generator 30 due to short spikes that are erroneously shown during conversation. be able to. Of course, all of the functions described above can be performed by a single digital processing means that is appropriately programmed, such as LPC.
Such as a digital signal processing chip (DSP) configured as part of a codec (which is the desired configuration) or as a suitably programmed microcomputer or microcontroller chip with associated memory devices It can be performed by any means.

前述したように、音声検出装置はLPCコーデックの一
部として容易に構成されることができる。一方、信号の
自己相関係数、又はそれに関連する値（部分相関又は
“パルコール（parcor）”係数）が離れたステーション
に送信される場合、音声検出はコーデックから離れて行
われる。As described above, the voice detection device can be easily configured as a part of the LPC codec. On the other hand, if the autocorrelation coefficient of the signal, or a value associated therewith (partial correlation or "parcor" coefficient), is transmitted to a remote station, speech detection is performed away from the codec.

───────────────────────────────────────────────────── フロントページの続き (31)優先権主張番号８８２０１０５．８ (32)優先日昭和63年８月24日(1988．8．24) (33)優先権主張国イギリス（ＧＢ） (72)発明者ボイド，イヴンイギリス国アイ・ピー９，２エツクス・イー，サフオーク，アイプスウイツチ，カペル・エス・テイ・マリー，ホームフイールド５ (56)参考文献特開昭62−211698（ＪＰ，Ａ) 特開昭62−150299（ＪＰ，Ａ) 特開昭59−115625（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 11/02,15/04 ──────────────────────────────────────────────────続き Continued on the front page (31) Priority claim number 8820105.8 (32) Priority date August 24, 1988 (August 24, 1988) (33) Priority claim country United Kingdom (GB) (72) Inventor Boyd, Even United Kingdom IP9, 2 Ekswye, Saffork, Aipwich, Kapel S.T.Marie, Homefield 5 (56) References JP-A-62-111698 (JP, A) JP-A-62-150299 (JP, A) JP-A-59-115625 (JP, A) (58) Fields investigated (Int. Cl. ⁷ , DB name) G10L 11/02, 15/04

Claims

(57) [Claims]

(I) means for receiving a first input signal; and (ii) periodically adaptively generating a second signal representing an estimated noise signal component of the first signal. (Iv) means for periodically forming a value of the spectral similarity between a portion of the input signal and the estimated noise signal component from the first and second signals; Means for comparing the value with a threshold value to produce an output indicating whether speech is present or absent. Analyzing means operative to produce filter coefficients having a spectral response that is the inverse of the frequency spectrum for one of the signal and the estimated noise signal component; and (vi) the means for forming the value comprises an input Signal and estimated noise signal And wherein the operating to create a value proportional to the zero-order autocorrelation after filtering by a filter having the coefficients for the other of the components.

Wherein the means for the generated work to calculate the autocorrelation coefficients A _i of the pulse response of the coefficient, and the value forming means of the noise signal component the estimated and the input signal the other means for computing the autocorrelation coefficients R _i, apparatus according to claim 1, characterized in that it comprises means for calculating the connected value M from them to receive R _i and a _i.

3. The means for calculating the autocorrelation coefficient of the input signal and the other of the estimated noise signal components is based on autocorrelation coefficients of several successive portions of the signal. 3. The apparatus of claim 2, wherein the apparatus is configured to calculate.

4. The apparatus according to claim 2, wherein M = R ₀ A ₀ + 2ΣR _i A _i .where Ai represents the i-th autocorrelation coefficient of the pulse response of said filter. Or the device according to claim 3.

5. The apparatus according to claim 1, wherein The apparatus according to claim 2 or 3, wherein Ai denotes an i-th autocorrelation coefficient of a pulse response of the filter.

6. The apparatus according to claim 1, wherein one of the input signal and the estimated noise signal component is an estimated noise signal component.

7. A buffer connected to store data from which an autocorrelation coefficient A _i of the filter response is obtained, wherein the filter response is periodically calculated from the signal by LPC analysis means, The apparatus is connected and controlled such that the value M is calculated using the stored data, and wherein the stored data is updated only during a period indicating that no conversation is present. Apparatus according to any one of the preceding claims.

8. An apparatus for controlling the updating of the stored data, comprising means for indicating that no conversation is present, wherein the means for indicating that no conversation is present is means for detecting an operation characteristic of the second voice. The apparatus according to claim 7, characterized in that:

9. Apparatus according to claim 1, further comprising means for adjusting said threshold during a period when no speech is indicated.

10. The detecting device according to claim 9, further comprising a second voice operating characteristic detecting means configured to prohibit adjustment of the threshold value when a conversation is present.

11. The apparatus according to claim 11, wherein said second voice behavior characteristic detecting means includes means for generating a value of spectral similarity between a portion of the input signal and an earlier portion of the input signal. Item 11. The device according to Item 8 or 10.

12. An apparatus for encoding a speech signal, comprising an apparatus according to any one of claims 1 to 11.

13. A mobile telephone device comprising the device according to claim 1. Description:

14. A method for detecting an operating characteristic of a sound with respect to a first input signal, comprising: (a) periodically changing a second signal representing an estimated noise signal component of the first signal; (B) periodically forming a value of spectral similarity between a portion of an input signal and the estimated noise signal component from the first and second signals; and (C) comparing the value to a threshold value to produce an output indicating whether speech is present or absent; and (d) further for one of the input signal and the estimated noise signal component. Creating filter coefficients having a spectral response that is the inverse of the frequency spectrum; and (e) the value is the input signal after being filtered by the filter having the coefficients and the estimated noise signal component. Wherein the proportional of the other zero-order autocorrelation.