JP2010061151A

JP2010061151A - Voice activity detector and validator for noisy environment

Info

Publication number: JP2010061151A
Application number: JP2009251650A
Authority: JP
Inventors: Douglas Ralph Ealey; イーリー，ダグラス・ラルフ; Holly Louise Kelleher; ケレハー，ホーリー・ルイーズ; David John Benjamin Pearce; ピアス，デイビッド・ジョン・ベンジャミン
Original assignee: Motorola Inc
Current assignee: Motorola Solutions Inc
Priority date: 2002-01-24
Filing date: 2009-11-02
Publication date: 2010-03-18
Also published as: KR20040075959A; KR100976082B1; FI20041013A; GB0201585D0; JP2005516247A; WO2003063138A1; CN1623186A; GB2384670B; CN1307613C; GB2384670A; FI124869B; KR20090127182A

Abstract

<P>PROBLEM TO BE SOLVED: To provide an improved voice activity detector and validator for noisy environments. <P>SOLUTION: A communication unit 100 includes an audio processing unit 109 having a voice activity detection mechanism 130, 135. The voice activity detection mechanism 130, 135 measures an energy acceleration of a signal input to the communication unit 100 and determines whether the input signal is speech or noise, based on the measurement. A method of detecting voice and a method of deciding whether an input signal is voice or noise are also described. Using an energy acceleration based voice activity detector and validator, particularly for noisy environments, provides the advantages of noise robustness, fast response and independence of the level of input speech. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

［発明の分野］
本発明は、音声の検出（通常、雑音環境内での音声活動検出（ＶＡＤ）として知られている。）に関する。本発明は、音声検出システムにおける音声信号のエネルギ加速度測定に適用可能であるが、それに限定されるものではない。 [Field of the Invention]
The present invention relates to voice detection (typically known as voice activity detection (VAD) in noisy environments). The present invention is applicable to energy acceleration measurement of a sound signal in a sound detection system, but is not limited thereto.

［発明の背景］
移動通信セルラ電話標準用グローバル・システム（ＧＳＭ）、及び個人移動無線ユーザ用地上幹線無線（ＴＥＴＲＡ）システムのような多くの音声通信システムは、音声パターンを符号化及び復号するため音声処理装置を用いている。そのような音声通信システムにおいては、音声エンコーダは、アナログ音声パターンを、送信のための適切なディジタル・フォーマットに変換する。音声デコーダは、受信したディジタル音声信号を可聴アナログ音声パターンに変換する。 [Background of the invention]
Many voice communication systems, such as the Global System for Mobile Communications Cellular Telephone Standard (GSM), and the Terrestrial Trunk Radio (TETRA) system for personal mobile radio users, use voice processors to encode and decode voice patterns. ing. In such a voice communication system, the voice encoder converts the analog voice pattern into an appropriate digital format for transmission. The audio decoder converts the received digital audio signal into an audible analog audio pattern.

音声活動を検出する方法及び装置は、当該技術において既知である。音声活動検出器（ＶＡＤ）は、音声がオーディオ信号の一部にのみ存在するという仮定の下で動作する。この仮定は通常正しい。それは、沈黙又はバックグラウンド・ノイズ（背景雑音）のみを示す多くのオーディオ信号間隔が存在するからである。 Methods and apparatus for detecting voice activity are known in the art. A voice activity detector (VAD) operates under the assumption that speech is present only in a portion of the audio signal. This assumption is usually correct. This is because there are many audio signal intervals that exhibit only silence or background noise (background noise).

音声活動検出器は、多くの目的のため用いることができる。これには、発話が存在しないとき、伝送システムにおいて送信活動全体を抑制し、従ってパワー及びチャネル帯域幅を潜在的に節約することが含まれる。ＶＡＤが音声活動が再開したことを検出すると、ＶＡＤは、送信活動を再開することができる。 Voice activity detectors can be used for many purposes. This includes suppressing overall transmission activity in the transmission system when there is no speech, thus potentially saving power and channel bandwidth. When VAD detects that voice activity has resumed, VAD can resume transmission activity.

音声活動検出器はまた、音声を含むオーディオ部分を「音声無し（無言）」であるオーディオ部分から区別することにより、音声記憶装置と関係して用いられることができる。従って、音声を含む部分は記憶装置に格納され、そして「音声無し」部分は廃棄される。 A voice activity detector can also be used in connection with a voice storage device by distinguishing an audio part containing voice from an audio part that is "no voice". Thus, the part containing the voice is stored in the storage device and the "no voice" part is discarded.

音声を検出する従来の方法は、少なくとも一部分、音声信号のパワーを検出して評価する方法に基づいている。推定されたパワーは、信号が音声であったか否かを決定するため、或る定数又は適応的スレッショルドと比較される。これらの方法の主要利点はそれらの複雑さが低く、それは、これらの方法を低い処理資源の実現にとって適したものにしている。そのような方法の主要欠点は、バックグラウンド・ノイズが「音声」が実際に存在しないとき「音声」を検出することを間違ってもたらす場合があることである。一方、存在する「音声」がはっきりしないので、その「音声」を検出し得ない場合があり、そしてバックグラウンド・ノイズのため検出が困難である場合があることである。 Conventional methods for detecting speech are based, at least in part, on methods for detecting and evaluating the power of speech signals. The estimated power is compared to some constant or adaptive threshold to determine whether the signal was speech. The main advantage of these methods is their low complexity, which makes them suitable for the realization of low processing resources. The main drawback of such a method is that background noise can erroneously result in detecting “speech” when it is not actually present. On the other hand, since the existing “voice” is not clear, the “voice” may not be detected and may be difficult to detect due to background noise.

音声活動を検出する幾つかの方法は、雑音の多い移動環境に指向されており、そして音声信号の適応フィルタリングに基づいている。これは、最終決定の前に雑音成分を信号から低減する。周波数スペクトル及び雑音レベルは、上記方法が異なるスピーカに対してそして異なる環境で用いられるので変わり得る。従って、入力フィルタ及びスレッショルドは、多くの場合、これらの変動を追跡するように適応的である。 Some methods of detecting speech activity are directed to noisy mobile environments and are based on adaptive filtering of speech signals. This reduces the noise component from the signal before final determination. The frequency spectrum and noise level can vary as the method is used for different speakers and in different environments. Thus, input filters and thresholds are often adaptive to track these variations.

これらの方法の例が、ＧＳＭ仕様０６．４２音声活動検出器（ＶＡＤ）においてハーフ・レート、フル・レート及び増強フル・レート音声トラフィック・チャネルのそれぞれに対して与えられている。別のそのような方法は、ＩＴＵＧ．７２９添付Ｂで提案されている「マルチバウンダリ（多境界）（ｍｕｌｔｉ−ｂｏｕｄａｒｙ）音声活動検出アルゴリ
ズム」である。これらの方法は、雑音の多い環境においてより正確であるが、しかし実行するのに著しく複雑である。 Examples of these methods are given for each of half-rate, full-rate and augmented full-rate voice traffic channels in the GSM specification 06.42 voice activity detector (VAD). Another such method is described in ITUG. 729 Appendix B, “Multi-boundary voice activity detection algorithm”. These methods are more accurate in noisy environments, but are significantly more complicated to implement.

これらの全ての方法は、音声信号を入力することを必要とする。音声復元スキーム（音声デコンプレッション・スキーム）を採用する一部のアプリケーションは、音声復元プロセス中に音声検出を実行することを必要とする。 All these methods require inputting an audio signal. Some applications that employ voice recovery schemes (voice decompression schemes) require performing voice detection during the voice recovery process.

Ｂｅｎｙａｓｓｉｎｅ他によるヨーロッパ特許出願Ｎｏ．ＥＰ−Ａ−０７５４１９は、以下のステップ、即ち
（ｉ）所定の組みのパラメータを到来音声信号から各フレームに対して抽出するステップと、
（ｉｉ）到来音声信号のフレーム音声化決定を各フレームに対して、所定の組みのパラメータから抽出された１組の差測定に従って行うステップと
を含む音声活動検出を指向している。 European patent application no. EP-A-075419 includes the following steps: (i) extracting a predetermined set of parameters for each frame from the incoming speech signal;
(Ii) directed to voice activity detection including the step of making a frame speechization of the incoming speech signal for each frame according to a set of difference measurements extracted from a predetermined set of parameters.

セルラ・システムのＶＡＤは、第三者が音声を発するとき、音声コーデック及びＲＦ回路等を含む無線装置は、バックグラウンド・ノイズ及び他の障害が存在する中でその音声を他の第三者に伝えるためアクティブ状態（活性状態）にあることを保証するため、バイアス状態にされている。しかしながら、これは、第三者が音声を発していないときデータを送信することを招くことになる。この損失は、バッテリ寿命を僅かに低くし、そしてシステムの他のセルにおける同一チャネル・ユーザに対する干渉を僅かに増大させる。これらは、本質的には２次（又はより高次）的効果である。 A cellular system VAD allows a wireless device, including a voice codec and RF circuitry, to send voice to other third parties in the presence of background noise and other obstacles when the third party emits voice. In order to ensure that it is in an active state (active state) for transmission, it is in a biased state. However, this leads to sending data when the third party is not speaking. This loss slightly reduces battery life and slightly increases interference to co-channel users in other cells of the system. These are essentially secondary (or higher order) effects.

これらのシステムにおいて、有限の資源が二重呼び出し（ｄｕｐｌｅｘｃａｌｌ）に対して使用可能であることの概念が無い。それは、アップリンク及びダウンリンクに関して全体に可能でありそしてそれに対して一貫している。なお、アップリンク及びダウンリンクは、通常、フルの帯域幅を同時に利用しているため異なる搬送波上である。 In these systems, there is no notion that a finite resource is available for duplex calls. It is possible and consistent with the uplink and downlink as a whole. Note that the uplink and downlink are usually on different carriers since they use the full bandwidth simultaneously.

この発明の分野では、幾つかの音声活動又は音声開始検出器（ＶＡＤ／ＶＯＤ）が、有声音声を区別するため高調波構造のような音声の特性を（例えば、自己相関を介して、）用いることを試みていることが知られている。しかしながら、雑音において、これらの構造的インディケータは、音声構造の崩壊のためか、又は雑音における構造のためかで失敗する場合がある。これは、車両のエンジン、タイヤ又は空調雑音であるかも知れない。最後に、これらの方法は、無声音声を検出するのに弱い。 In the field of this invention, some voice activity or voice start detectors (VAD / VOD) use voice characteristics such as harmonic structures (eg, via autocorrelation) to distinguish voiced voices. It is known that they are trying to do that. However, in noise, these structural indicators may fail due to speech structure collapse or due to structure in noise. This may be vehicle engine, tire or air conditioning noise. Finally, these methods are weak to detect unvoiced speech.

代替方法は、音声を検出するため、単純にフレーム・エネルギ・レベルを用いる方法である。これは、高い信号対雑音比（ＳＮＲ）条件での音声に対して十分であり、そこでは雑音レベルを超えた任意のスレッショルドが、音声を示すため設定されることができる。しかしながら、このアプローチは、一層現実的な雑音条件で失敗である。 An alternative method is simply to use the frame energy level to detect speech. This is sufficient for speech at high signal-to-noise ratio (SNR) conditions, where any threshold above the noise level can be set to indicate speech. However, this approach fails with more realistic noise conditions.

非正規化のデータベースに対して、又は現実の応用において、１組の例における雑音レベルが別の例における音声レベルより大きい場合がありそうであり、これは、スレッショルド値を設定することを不可能にする。これを克服する従来の方法は、発話（ｕｔｔｅｒａｎｃｅ）の最初の１００ミリ秒程度が雑音を表すという想定の下で、当該発話の最初の１００ミリ秒程度を平均化して、その発話に関してその場限りのスレッショルドを生成する方法である。しかしながら、再度、これは、雑音が初期推定から急速に発散する場合、又は雑音が高い分散を有する場合、又は最初の数フレームが実際に、推定された雑音よりむしろ音声を含む場合、非定常雑音に関して不十分である。 For a denormalized database, or in real-world applications, it is likely that the noise level in one example is greater than the speech level in another example, which makes it impossible to set a threshold value To. The conventional method for overcoming this is that the first 100 milliseconds of the utterance represents noise, and the first 100 milliseconds of the utterance is averaged, and the utterance is ad hoc. This is a method for generating the threshold. However, again, this is not the case if the noise diverges rapidly from the initial estimate, or if the noise has a high variance, or if the first few frames actually contain speech rather than the estimated noise. Inadequate regarding.

従って、前述の欠点を改善する、雑音の多い環境のための改良された音声活動検出器及
び有効化器に対する必要性が存在する。 Accordingly, there is a need for improved voice activity detectors and enablers for noisy environments that ameliorate the aforementioned drawbacks.

［発明の陳述］
本発明の第１の局面に従って、請求項１記載の通信装置が提供される。 [Statement of invention]
According to a first aspect of the present invention, a communication device according to claim 1 is provided.

本発明の第２の局面に従って、請求項１１に記載された、通信装置に入力された音声信号を検出する方法が提供される。 According to a second aspect of the present invention, there is provided a method for detecting an audio signal input to a communication device according to claim 11.

本発明の第３の局面に従って、請求項１４に記載された、通信装置に入力された信号が音声であるか又は雑音であるかを決定する方法が提供される。 According to a third aspect of the present invention, there is provided a method for determining whether a signal input to a communication device is speech or noise according to claim 14.

本発明の更なる態様が、それらに従属する請求項において主張されている。 Further aspects of the invention are claimed in the dependent claims.

要約すると、本発明は、音声の存在又は不在を示すため、エネルギ振幅測定よりはむしろエネルギ加速度測定を使用することにより、任意の振幅の非定常雑音のケースに対処することを目指している。 In summary, the present invention aims to address the case of non-stationary noise of arbitrary amplitude by using energy acceleration measurements rather than energy amplitude measurements to indicate the presence or absence of speech.

本発明の例示的実施形態が、ここで添付図面を参照して説明されるであろう。 Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings.

図１は、本発明の好適な実施形態の音声活動検出及び有効化を実行するよう適合された通信装置のブロック図を示す。FIG. 1 shows a block diagram of a communication device adapted to perform voice activity detection and validation of a preferred embodiment of the present invention. 図２は、本発明の好適な実施形態に従った、雑音の多い環境のためのエネルギ加速度ベースの音声活動検出器のフロー・チャートを示す。FIG. 2 shows a flow chart of an energy acceleration based voice activity detector for a noisy environment, according to a preferred embodiment of the present invention. 図３は、本発明の好適な実施形態に従った、雑音の多い環境のためのエネルギ加速度ベースの音声活動有効化のフロー・チャートを示す。FIG. 3 shows a flow chart of enabling energy acceleration based voice activity for a noisy environment according to a preferred embodiment of the present invention. 図４は、本発明の好適な実施形態に従ったバッファ動作を示す。FIG. 4 illustrates the buffer operation according to the preferred embodiment of the present invention.

［好適な実施形態の説明］
有声の音声（発話）は、その開始（ｏｎｓｅｔ）が振動しているか又は静止しているかのいずれかである声帯の活動に依存しているので、比較的高いエネルギ加速度値を有する。同様に、無声の開始（例えば、破裂音）はまた、高いエネルギ加速度値を有する。 [Description of Preferred Embodiment]
Voiced speech (speech) has a relatively high energy acceleration value because it depends on vocal cord activity whose onset is either oscillating or stationary. Similarly, an unvoiced start (eg, a pop) also has a high energy acceleration value.

本発明者は、狭帯域パワー・スペクトル又はメル・スペクトルのような有声化を強調する表現領域（ｒｅｐｒｅｓｅｎｔａｔｉｏｎａｌｄｏｍａｉｎ）において、結果として生じたエネルギ加速度が非定常雑音より著しく高いことを確認した。唯一の著しい例外は、インパルス雑音（例えば、拍手）である。 The inventor has confirmed that the resulting energy acceleration is significantly higher than non-stationary noise in a representational domain that emphasizes voicing, such as a narrowband power spectrum or a mel spectrum. The only significant exception is impulse noise (eg applause).

従って、本発明の好適な実施形態に従って、本発明者は、人が音声信号の基本ピッチを含みそうである周波数領域のエネルギを集めることによりこれらの雑音に抗して追加的に区別することができることが分かった。特に、本発明の発明者は、音声の非構造化特性、即ち、エネルギ加速度（又は音声エネルギ又はその成分を表す或るメトリックの加速度）を用いることを提案する。 Thus, in accordance with a preferred embodiment of the present invention, the inventor can additionally distinguish against these noises by collecting frequency domain energy that is likely to include the fundamental pitch of the audio signal. I understood that I could do it. In particular, the inventors of the present invention propose to use unstructured characteristics of speech, ie energy acceleration (or acceleration of some metric representing speech energy or its components).

特に、本明細書に記載される発明概念に対する好適な応用は、ヨーロッパ電気通信標準協会（ＥＴＳＩ）により現在定義されつつある分散音声認識（ＤＳＲ）標準、即ち、「音声処理、伝送及び品質アスペクト（ＳＴＱ）；分散音声認識；フロントエンド特徴抽出アルゴリズム；圧縮アルゴリズム」（ＥＴＳＩＥＳ２０１１０８ｖｌ．１．２（２０
００−０４）、２０００年４月）である。 In particular, a preferred application for the inventive concepts described herein is the Distributed Speech Recognition (DSR) standard currently being defined by the European Telecommunications Standards Institute (ETSI): “Speech Processing, Transmission and Quality Aspects ( STQ); distributed speech recognition; front-end feature extraction algorithm; compression algorithm ”(ETSIES 201 108 vl.1.2 (20
00-04), April 2000).

ここで、図１を参照すると、本発明の好適な実施形態の発明概念をサポートするよう適合されているオーディオ加入者装置１００が示されている。 Referring now to FIG. 1, there is shown an audio subscriber unit 100 that is adapted to support the inventive concept of the preferred embodiment of the present invention.

本発明の好適な実施形態は、無線オーディオ通信装置、例えば、将来のセルラ無線通信システムのための第３世代パートナーシップ・プロジェクト（３ＧＰＰ）標準で動作し且つＤＳＲ能力を提供することができる無線オーディオ通信装置を参照して説明される。しかしながら、音声活動検出及びその有効化に関連する、本明細書に記載される発明概念が音声信号に応答し、そして改良された音声活動検出回路から利益を得るいずれの電子装置に等しく適用可能であることは、本発明の意図内である。 Preferred embodiments of the present invention are wireless audio communication devices, eg, wireless audio communication capable of operating with the 3rd Generation Partnership Project (3GPP) standard for future cellular wireless communication systems and providing DSR capabilities. The description will be made with reference to the apparatus. However, the inventive concepts described herein relating to voice activity detection and its validation are equally applicable to any electronic device that responds to a voice signal and benefits from an improved voice activity detection circuit. It is within the spirit of the present invention.

当該技術で知られているように、オーディオ加入者装置１００は、好ましくは二重フィルタに結合されたアンテナ１０２、オーディオ加入者装置１００内で受信チェーンと送信チェーンとの分離を行うアンテナ・スイッチ又はサーキュレータ１０４を含む。 As is known in the art, the audio subscriber unit 100 preferably includes an antenna 102 coupled to a double filter, an antenna switch or a switch that separates the receive chain and the transmit chain within the audio subscriber unit 100. A circulator 104 is included.

受信機チェーンは、受信機フロントエンド回路１０６（実効的には受信、フィルタリング及び中間又はベースバンド周波数変換を行う）を含む。フロントエンド回路１０６は、信号処理機能１０８（一般的にはディジタル信号プロセッサ（ＤＳＰ）により実現される。）に直列に接続されている。信号処理機能１０８は、信号復調、誤り訂正及びフォーマット化を実行する。信号処理機能１０８からの復元されたデータは、オーディオ処理機能１０９に直列に結合され、当該オーディオ処理機能１０９は、受信信号を適切な要領でフォーマット化して、オーディオ・イナシエータ（ａｕｄｉｏｅｎｕｎｃｉａｔｏｒ）／ディスプレイ１１１に送る。 The receiver chain includes a receiver front-end circuit 106 (effectively performing reception, filtering and intermediate or baseband frequency conversion). The front-end circuit 106 is connected in series with a signal processing function 108 (generally realized by a digital signal processor (DSP)). The signal processing function 108 performs signal demodulation, error correction, and formatting. The recovered data from the signal processing function 108 is coupled in series to an audio processing function 109, which formats the received signal in an appropriate manner to an audio initiator / display 111. Send to.

本発明の様々な実施形態において、信号処理機能１０８及びオーディオ処理機能１０９は、同じ物理装置内に設けられ得る。制御器１１４は、オーディオ加入者装置１００の構成要素の情報の流れ及び動作状態を制御するよう構成されている。 In various embodiments of the present invention, the signal processing function 108 and the audio processing function 109 may be provided in the same physical device. The controller 114 is configured to control the information flow and operating state of the components of the audio subscriber unit 100.

送信チェーンに関しては、これは本質的に、オーディオ処理機能１０９、信号処理機能１０８、送信機／変調回路１２２及び電力増幅器１２４を通して直列に結合されるオーディオ入力装置１２０を含む。プロセッサ１０８、送信機／変調回路１２２及び電力増幅器１２４は、制御器１１４に動作的に応答する。電力増幅器１２４の出力は、最終の無線周波数信号を放射するため、二重フィルタ、アンテナ・スイッチ又はサーキュレータ１０４及びアンテナ１０２に結合される。 With respect to the transmit chain, this essentially includes an audio input device 120 coupled in series through an audio processing function 109, a signal processing function 108, a transmitter / modulation circuit 122 and a power amplifier 124. The processor 108, transmitter / modulation circuit 122 and power amplifier 124 are operatively responsive to the controller 114. The output of the power amplifier 124 is coupled to a double filter, antenna switch or circulator 104 and antenna 102 to radiate the final radio frequency signal.

特に、オーディオ処理機能１０９は、音声活動決定機能１３５に動作可能に結合される音声活動（又は音声開始）検出（ＶＡＤ）機能１３０を含む。本発明の好適な実施形態に従って、ＶＡＤ機能１３０及び音声活動決定機能１３５は、改良された音声検出及び決定機構を与えるよう適合されており、その動作は更に、図２及び図３に関して説明される。特に、音声活動検出器機能１３０は、３つの測定から成るフレーム単位の検出段を含む。３つの周波数範囲測定は、
（ｉ）スペクトル全体、
（ｉｉ）スペクトルのサブバンド、及び
（ｉｉｉ）スペクトルの分散
を含む。 In particular, the audio processing function 109 includes a voice activity (or voice start) detection (VAD) function 130 operatively coupled to the voice activity determination function 135. In accordance with the preferred embodiment of the present invention, VAD function 130 and voice activity determination function 135 are adapted to provide an improved voice detection and determination mechanism, the operation of which is further described with respect to FIGS. . In particular, the voice activity detector function 130 includes a frame-by-frame detection stage consisting of three measurements. The three frequency range measurements are
(I) the entire spectrum;
(Ii) spectral subbands, and (iii) spectral dispersion.

その次ぎに、音声活動決定機能１３５は、或る決定を、音声可能性に関して解析される測定値のバッファに基づいて実行する。決定段からの最終決定は、当該バッファの中の以前のフレームに遡及的に適用される。 Subsequently, the voice activity determination function 135 performs a determination based on a buffer of measurements that are analyzed for voice likelihood. The final decision from the decision stage is applied retrospectively to previous frames in the buffer.

本発明の好適な実施形態において、タイマ／カウンタ１１８はまた、図２及び図３の検出及び決定プロセスにおいてタイミング機能を実行するよう適合されている。 In the preferred embodiment of the present invention, timer / counter 118 is also adapted to perform timing functions in the detection and determination process of FIGS.

信号プロセッサ機能１０８、オーディオ処理機能１０９、ＶＡＤ機能１３０及び音声活動決定機能１３５は、動作可能に結合される個別の処理構成要素として実現され得る。代替として、１又はそれより多くのプロセッサを用いて、対応する処理動作のうちの１又はそれより多くの動作を実行し得る。更に別の代替実施形態において、前述の機能は、ハードウエア及びソフトウエアの混合、又はファームウエア構成要素として、特定用途向け集積回路（ＡＳＩＣ）及び／又はプロセッサ、例えばディジタル信号プロセッサ（ＤＳＰ）を用いて、実行され得る。 Signal processor function 108, audio processing function 109, VAD function 130 and voice activity determination function 135 may be implemented as separate processing components that are operatively coupled. Alternatively, one or more processors may be used to perform one or more of the corresponding processing operations. In yet another alternative embodiment, the functions described above use an application specific integrated circuit (ASIC) and / or processor, such as a digital signal processor (DSP), as a mix of hardware and software, or firmware components. Can be executed.

勿論、オーディオ加入者装置１００内の様々な構成要素は、個別の又は一体型の構成要素形式で実現されることができ、従って、最終構造は単なる任意の選択である。 Of course, the various components within the audio subscriber unit 100 can be implemented in individual or integrated component forms, and thus the final structure is merely an arbitrary choice.

このため、本発明の好適な実施形態で使用のためエネルギ加速度の指示を実現するための幾つかの方法がある。 For this reason, there are several ways to achieve an indication of energy acceleration for use in the preferred embodiment of the present invention.

（ｉ）理論的に理想の方法は、以前に公開された米国特許出願Ｎｏ．６００９３９１に見られるように、エネルギ・レベルを発話の連続的フレームにわたり文字通り二重に微分する（ｄｉｆｆｅｒｅｎｔｉａｔｅ）方法である。このアプローチの欠点は、これが人が或る数のフレームを解析下でフレームの各側上で解析することが必要であるので、遅延を導入しそうであることである。 (I) The theoretically ideal method is described in previously published US patent application no. As seen in 6009391, it is a method of literally differentiating energy levels over successive frames of speech. The disadvantage of this approach is that it is likely to introduce a delay because one needs to analyze a certain number of frames under analysis on each side of the frame.

（ｉｉ）エネルギ加速度のゼロ遅延推定は、短期間平均の比を瞬時値と比較することにより、例えば、
フレーム平均 (Ii) Zero delay estimation of energy acceleration can be achieved by comparing the short-term average ratio with the instantaneous value, for example:
Frame average

を用いて、又はローリング平均（ＲｏｌｌｉｎｇＡｖｅｒａｇｅ） Or Rolling Average

を用いて、得ることができる。 Can be used.

各ケースにおいて、この方法は、｀減速度´＜｀１´＜｀加速度´として解釈することができる値を戻す。次いで、人は、Ａ^〜（本明細書では、「Ｘ^〜」は記号Ｘの上に^〜を付した記号を表す。）に対する経験値、及び音声を雑音から最良に区別する分母長（ｄｅｍｎｏｍｉｎａｔｏｒｌｅｎｇｔｈ）を見つけることができる。 In each case, the method returns a value that can be interpreted as ｀ deceleration '<｀ 1'<｀ acceleration '. The person then enters the empirical value for A ^~ (where "X ^~ " represents the symbol with ^~ on the symbol X), and the denominator length that best distinguishes speech from noise. ) Can be found.

本発明の発明者は、好ましい最適解法は非定常雑音を迅速に追跡することができる分母（ｄｅｎｏｍｉｎａｔｏｒ）を見つけることであるが、しかしそれは音声開始を追跡するのに長すぎる。ローリング平均に対する提案された値シーケンスは、ａ＝０．２、ｂ＝０．８＊ａ、ｃ＝０．８＊ｂ等であり、それは、回帰法として単純に表されることができる。

ｄ_ｔ＝０．２ｘ_ｔ＋０．８ｄ_ｔ−１［３］

従って、

Ａ＝ｘ_ｔ／ｄ_ｔ［４］

検出段内の好適なＶＡＤ及びパラメータ初期化システムは、図２のフロー・チャートにまとめられている。非定常雑音において、長期間エネルギ・スレッショルドは、音声の信頼できるインディケータではない。同様に、高い雑音条件において、音声の構造（例えば、高調波）は、高調波が雑音により破損され得る、又は構造化された雑音が検出器を混乱させ得るので、インディケータとして全体的に依拠することができない。従って、好適な音声活動検出器は、音声の雑音に強固な特性、即ち音声開始と関連したエネルギ加速度を用いる。 The inventor of the present invention finds a denominator that can quickly track non-stationary noise, but it is too long to track speech start. The proposed sequence of values for the rolling average is a = 0.2, b = 0.8 * a, c = 0.8 * b, etc., which can be simply expressed as a regression method.

d _t = 0.2x _t +0.8 d _t ₋₁ [3]

Therefore,

A = x _t / d _t [4]

A preferred VAD and parameter initialization system within the detection stage is summarized in the flow chart of FIG. In non-stationary noise, the long-term energy threshold is not a reliable indicator of speech. Similarly, in high noise conditions, speech structure (eg, harmonics) relies entirely as an indicator because harmonics can be corrupted by noise or structured noise can disrupt the detector. I can't. Thus, the preferred voice activity detector uses a noise robust characteristic, ie energy acceleration associated with voice onset.

ここで、図２を参照すると、好適な検出プロセスのフロー・チャート２００が示されて
いる。上記で示したように、この検出プロセスは、フレーム単位の解析を含む。好適なＶＡＤ機構は、「全体スペクトル」測定プロセスに関連する。ステップ２０５に示されるように、フレーム・カウンタを最初に評価して、それが「Ｎ」より小さいかどうかを決定する。なお、「Ｎ」はバッファされるフレームの数を定義する。好適な実施形態の例として、各フレームが例えば１０ミリ秒だけ増分することが確立されたと仮定すると、「Ｎ」は「１５」に設定される。ステップ２０５において、フレーム・カウンタが「Ｎ」より小さい場合、ステップ２１０に示されるように、初期加速度試験に関するローリング平均が更新される。ステップ２０５において、フレーム・カウンタが「Ｎ」より小さく無い場合、ステップ２１０を飛ばす。 Referring now to FIG. 2, a flow chart 200 of a preferred detection process is shown. As indicated above, this detection process involves frame-by-frame analysis. A preferred VAD mechanism is associated with the “whole spectrum” measurement process. As shown in step 205, the frame counter is first evaluated to determine if it is less than "N". “N” defines the number of frames to be buffered. As an example of the preferred embodiment, assuming that each frame is incremented by, for example, 10 milliseconds, “N” is set to “15”. In step 205, if the frame counter is less than “N”, the rolling average for the initial acceleration test is updated, as shown in step 210. In step 205, if the frame counter is not smaller than "N", step 210 is skipped.

次いで、ステップ２３５に示されるように、エネルギ加速度測定が１又はそれより多くの指定された余裕内にあるかどうかを評価するための決定を行う。ステップ２３５において、エネルギ加速度測定が１又はそれより多くの指定された余裕内にある場合、ステップ２４０におけるように、ローリング平均は更なるエネルギ加速度試験の結果を用いて更新される。ステップ２３５において、エネルギ加速度測定が１又はそれより多くの指定された余裕内に無い場合、ステップ２４０を飛ばす。 A determination is then made to evaluate whether the energy acceleration measurement is within one or more specified margins, as shown in step 235. In step 235, if the energy acceleration measurement is within one or more specified margins, the rolling average is updated with the results of the further energy acceleration test, as in step 240. If, in step 235, the energy acceleration measurement is not within one or more specified margins, step 240 is skipped.

次いで、ステップ２６０に示されるように、エネルギ加速度測定が指定されたスレッショルドより大きいかどうかを評価するための決定を行う。ステップ２６０において、エネルギ加速度測定が指定されたスレッショルドより大きい場合、ステップ２６５におけるように、フレームは音声フレームと見なされる。ステップ２６０において、エネルギ加速度測定が指定されたスレッショルドより大きくない場合、ステップ２７０におけるように、フレームは雑音フレームと見なされる。 A determination is then made to evaluate whether the energy acceleration measurement is greater than a specified threshold, as shown in step 260. In step 260, if the energy acceleration measurement is greater than the specified threshold, the frame is considered a speech frame, as in step 265. If, in step 260, the energy acceleration measurement is not greater than the specified threshold, the frame is considered a noise frame, as in step 270.

次いで、ステップ２７５におけるように、フレーム・カウンタが増分され、そしてプロセスはステップ２０５から繰り返される。 The frame counter is then incremented, as in step 275, and the process is repeated from step 205.

このプロセスに対する改良として、スペクトル全体の測定プロセスの代わりに、又はそれに加えて、オプションのステップ２１５及び２４５に示される副領域測定プロセスを実行し得る。スペクトルの特定の副領域が、その副領域が基本ピッチを最も含みそうであるので選択される。 As an improvement to this process, instead of or in addition to the entire spectrum measurement process, the sub-region measurement process shown in optional steps 215 and 245 may be performed. A particular subregion of the spectrum is selected because that subregion is most likely to contain the basic pitch.

副領域測定プロセスにおいて、全体スペクトルの測定におけるステップ２１０において、ひとたび初期加速度試験に関するローリング平均が更新されると、ステップ２２０に示されるように、エネルギ加速度測定がスレッショルド値より大きいかどうかを検査するための決定を行う。ステップ２２０において、エネルギ加速度測定がスレッショルド値より大きい場合、ステップ２２５に示されるように、他のパラメータを初期化するプロセスが中断される。ステップ２２０において、エネルギ加速度測定がスレッショルド値より大きくない場合、ステップ２３０におけるように、他のパラメータの初期化が更新される。次いで、プロセスは、図示のようにステップ２３５に戻る。 In the sub-region measurement process, once the rolling average for the initial acceleration test is updated in step 210 in the overall spectrum measurement, to check whether the energy acceleration measurement is greater than the threshold value, as shown in step 220. Make a decision. In step 220, if the energy acceleration measurement is greater than the threshold value, the process of initializing other parameters is interrupted, as shown in step 225. In step 220, if the energy acceleration measurement is not greater than the threshold value, the initialization of other parameters is updated as in step 230. The process then returns to step 235 as shown.

ステップ２３５において、上記の決定後に、エネルギ加速度測定が１又はそれより多くの指定された余裕内にあるかどうかを評価するための更に好適な決定を行う。ステップ２５０において、減速値を評価して、減速値が「高い」かどうかを決定し、そして「高い」場合、ステップ２５５に示されるように、エネルギ加速度試験に関するローリング平均が、ゆっくり更新される。次いで、ステップ２６０において、プロセスは、全体スペクトルの方法に戻る。 In step 235, after the above determination, a more suitable determination is made to evaluate whether the energy acceleration measurement is within one or more specified margins. In step 250, the deceleration value is evaluated to determine if the deceleration value is "high", and if so, the rolling average for the energy acceleration test is slowly updated, as shown in step 255. Then, in step 260, the process returns to the full spectrum method.

このようにして、副領域検出器の一般的により高い信号対雑音比（ＳＮＲ）は、それを非常に雑音に対して強固にする。しかしながら、それは、マイクロフォン及びスピーカの不都合な変化並びに帯域制限された雑音に対して弱い。従って、全ての状況において、測
定に依拠すべきでない。従って、本発明の好適な実施形態は、全体スペクトルの測定を増大するため副領域検出器を組み込む。 In this way, the generally higher signal-to-noise ratio (SNR) of the sub-region detector makes it very robust against noise. However, it is vulnerable to adverse microphone and speaker changes and band-limited noise. Therefore, in all situations, you should not rely on measurements. Accordingly, the preferred embodiment of the present invention incorporates a sub-region detector to increase the overall spectral measurement.

更なる測定プロセスは、例えば、各フレームのスペクトルの低側半分内で値の分散の「加速度」を用いて実行されるのが好ましい。分散の測定は、スペクトルの低側半分内の構造を検出し、それを有声の音声に対して非常に敏感にする。分散測定が、副領域プロセスのアプローチに続き、そしてスペクトルの低側半分が、選択された特定の副領域である。この分散測定が更に、全体スペクトル測定のアプローチを補完し、そのアプローチは、無声及び破裂性の音声をより良好に検出することができる。 The further measurement process is preferably performed, for example, using the “acceleration” of the variance of values within the lower half of the spectrum of each frame. The dispersion measurement detects the structure in the lower half of the spectrum and makes it very sensitive to voiced speech. Dispersion measurements follow the subregion process approach, and the lower half of the spectrum is the specific subregion selected. This dispersion measurement further complements the whole spectrum measurement approach, which can better detect unvoiced and bursting speech.

３つ全ての測定は、それらの生の入力を、出願人がモトローラ社で発明者がＹａｎ−ＭｉｎｇＣｈｅｎの米国特許出願Ｎｏ．０９／４２７４９７に記載されるような二重ウィ
ナー・フィルタの最初の段により発生されたフィルタ・ゲインのスペクトル表現から取り出す。前述されたように、各測定は、このデータの異なる局面を用いる。 All three measurements were taken from the raw inputs of U.S. Patent Application No. No. 5, filed by Motorola and inventor Yan-Ming Chen. Extract from the spectral representation of the filter gain generated by the first stage of the double Wiener filter as described in 09/427497. As described above, each measurement uses a different aspect of this data.

特に、全体スペクトル検出器は、二重ウィナー・フィルタの最初の段により発生されたフィルタ・ゲインの既知のメル・フィルタリング（Ｍｅｌ−ｆｉｌｔｅｒ）されたスペクトル表現を用いる。単一の入力値は、メル・フィルタ・バンクの和を二乗することにより得られる。 In particular, the overall spectrum detector uses a known Mel-filtered spectral representation of the filter gain generated by the first stage of the double Wiener filter. A single input value is obtained by squaring the sum of mel filter banks.

本発明の好適な実施形態において、全体スペクトル検出器は、以下に説明されるように、次のプロセスを全てのフレームに適用する。 In the preferred embodiment of the present invention, the full spectrum detector applies the following process to every frame, as described below.

ステップ１は、雑音推定トラッカ（ｎｏｉｓｅｅｓｔｉｍａｔｅＴｒａｃｋｅｒ）を次の要領で初期化する。

フレーム＜１５、且つ加速度＜２．５の場合、トラッカ＝ＭＡＸ（トラッカ、入力）

エネルギ加速度測定は、音声が１５フレームのリードイン時間（ｌｅａｄ−ｉｎｔｉｍｅ）内で生じる場合、トラッカが更新されることを防止する。 Step 1 initializes a noise estimate tracker as follows.

When frame <15 and acceleration <2.5, tracker = MAX (tracker, input)

The energy acceleration measurement prevents the tracker from being updated if the sound occurs within 15 frames of lead-in time.

ステップ２は、現在の入力が雑音推定と似ている場合、トラッカ値を次の要領で更新する。

入力＜トラッカ＊上側限度、且つ入力＞トラッカ＊下側限度の場合、
トラッカ＝ａ＊トラッカ＋（１−ａ）＊入力
ステップ３は、最初の数フレーム内の音声又は非特性的に大きい雑音成分が存在する、それらのインスタンスに対するフェール・セーフ機構を提供する。これは、減衰に対するその結果生じる間違った高雑音推定を生じさせる。ステップ３は、次の要領で機能することが好ましい。

入力＜トラッカ＊フロアである場合、
トラッカ＝ｂ＊トラッカ＋（１−ｂ）＊入力
ステップ４は、現在の入力がトラッカより１６５％より大きい入力である場合、次の要領で「真」の音声決定として戻る。

入力＞トラッカ＊スレッショルドである場合は、
真を出力し、その他の場合は、偽を出力する。

短期間平均トラッカに対する瞬時入力の比は、連続の入力のエネルギ加速度の関数である。 Step 2 updates the tracker value as follows if the current input is similar to the noise estimate.

If input <tracker * upper limit and input> tracker * lower limit,
Tracker = a * tracker + (1−a) * input Step 3 provides a fail-safe mechanism for those instances where speech or non-characteristically large noise components in the first few frames are present. This results in the resulting incorrect high noise estimate for attenuation. Step 3 preferably functions in the following manner.

If input <tracker * floor,
Tracker = b * tracker + (1−b) * input Step 4 returns as a “true” voice decision in the following manner if the current input is 165% greater than the tracker.

If input> tracker * threshold,
Outputs true, otherwise it outputs false.

The ratio of the instantaneous input to the short-term average tracker is a function of the energy acceleration of the continuous input.

ここで、上記においては、
ａ＝０．８及びｂ＝０．９７
上側限度は１５０％であり、そして低側限度は７５％であり、
フロア（ｆｌｏｏｒ）は５０％であり、及び
スレッショルドは１６５％である。 Here, in the above,
a = 0.8 and b = 0.97
The upper limit is 150% and the lower limit is 75%
The floor is 50% and the threshold is 165%.

特に、値が上側限度より大きいか又は下側限度とフロアとの間にある場合更新が無い。更に、上記で示したようにエネルギ加速度入力は、
連続した入力の二回微分法（ｄｏｕｂｌｅ−ｄｉｆｆｅｒｅｎｔｉａｔｉｏｎ）か、又は
入力の２つのローリング平均の比を追跡することにより推定するか
のいずれかとして計算されることができる。 In particular, there is no update if the value is greater than the upper limit or between the lower limit and the floor. Furthermore, as indicated above, the energy acceleration input is
It can be calculated either as a double-differentiation of successive inputs or as an estimate by tracking the ratio of the two rolling averages of the inputs.

特に、早い適応ローリング平均と遅い適応ローリング平均との比は、連続した入力のエネルギ加速度を反映する。 In particular, the ratio of the fast adaptive rolling average to the slow adaptive rolling average reflects the continuous input energy acceleration.

一例として、上記で用いられる平均に対する寄与率は、
（ｉ）０＊平均＋１＊入力、及び
（ｉｉ）（（フレーム−１）＊平均＋１＊入力）／フレーム
であって、エネルギ加速度測定を最初の１５個のフレームにわたりだんだんと敏感にする。 As an example, the contribution to the average used above is
(I) 0 * average + 1 * input, and (ii) ((frame-1) * average + 1 * input) / frame, making energy acceleration measurements increasingly sensitive over the first 15 frames.

サブバンド検出器は、「全体スペクトル」測定に関して導出された第２、第３及び第４のメル・フィルタ・バンクの平均を用いるのが好ましい。次いで、検出器は、次のプロセスを全てのフレームに対して、以下に記載される要領で適用する。 The subband detector preferably uses the average of the second, third and fourth mel filter banks derived for the “overall spectrum” measurement. The detector then applies the following process to all frames as described below.

（ｉ）入力＝ｐ＊現在の入力＋（１−ｐ）＊以前の入力
（ｉｉ）フレーム＜１５の場合、
トラッカ＝ＭＡＸ（トラッカ，入力）
（ｉｉｉ）入力＜トラッカ＊上側限度、且つ入力＞トラッカ＊下側限度の場合、
トラッカ＝ａ＊トラッカ＋（１−ａ）＊入力
（ｉｖ）入力＜トラッカ＊フロアの場合、
トラッカ＝ｂ＊トラッカ＋（１−ｂ）＊入力
（ｖ）入力＞トラッカ＊スレッショルドの場合、
真を出力し、その他の場合は偽を出力する。

ここで、副領域測定において、ｐ＝０．７５である。 (I) input = p * current input + (1-p) * previous input (ii) if frame <15,
Tracker = MAX (tracker, input)
(Iii) If input <tracker * upper limit and input> tracker * lower limit,
Tracker = a * tracker + (1-a) * input (iv) If input <tracker * floor,
Tracker = b * Tracker + (1-b) * Input (v) If Input> Tracker * Threshold,
Outputs true, otherwise it outputs false.

Here, in the sub-region measurement, p = 0.75.

他の全てのパラメータは、スレッショルドを除いて、全体スペクトルの測定に関して同じであり、それは３．２５に等しい。 All other parameters are the same for the overall spectrum measurement, except for the threshold, which is equal to 3.25.

スペクトルの分散の測定のため、各フレームに関してゲインの狭帯域スペクトル表現の低周波数側半分を有する値の分散が、入力として用いられる。次いで、検出器は、全体スペクトルの測定に関して同じプロセスを正確に適用する。 For the measurement of the spectral variance, the variance of the value having the lower half of the narrowband spectral representation of the gain for each frame is used as input. The detector then applies exactly the same process for the measurement of the whole spectrum.

分散は次のように計算される。 The variance is calculated as follows:

ここで、
Ｎ＝ＦＦＴの長さ／４、及び
ｗ_ｉは、ゲインの狭帯域スペクトル表示の値である。 here,
N = the FFT length / 4, and w _i is a narrow-band spectrum display of the value of the gain.

本発明の好適な実施形態に従って、前述した３つの測定は、図３のフロー・チャートに示されるように、ＶＡＤ決定アルゴリズムに与えられる。連続した入力はバッファに与えられ、それは文脈解析を提供する。これは、バッファの長さから１フレームを差し引いた大きさに等しいフレーム遅延を導入する。 In accordance with the preferred embodiment of the present invention, the three measurements described above are provided to the VAD determination algorithm as shown in the flow chart of FIG. Sequential input is provided to the buffer, which provides context analysis. This introduces a frame delay equal to the length of the buffer minus one frame.

ここで図３を参照すると、雑音が多い環境に対する加速度ベースの音声活動有効化プロセスのフロー・チャート３００が、本発明の好適な実施形態に従って示されている。 Referring now to FIG. 3, a flow chart 300 of an acceleration-based voice activity activation process for a noisy environment is shown in accordance with a preferred embodiment of the present invention.

Ｎ＝７のフレーム・バッファに対して、ステップ３０５に示されるように、最も最近の真／偽の音声入力が、データ・バッファの中の位置Ｎに格納される。決定ロジックが、或る数の以下のステップを与え、好ましくはそれらの一つ一つを与える。 For N = 7 frame buffers, the most recent true / false audio input is stored at location N in the data buffer, as shown in step 305. Decision logic provides a number of the following steps, preferably one of them.

ステップ１：
Ｖ_Ｎ＝測定１又は測定２又は測定３
入力Ｖ_Ｎは、上記３つの測定のいずれかが真の音声指示を戻す場合、「真」と定義される。 Step 1:
V _N = measurement 1 or measurement 2 or measurement 3
Input V _N is defined as “true” if any of the three measurements returns a true voice indication.

ステップ２： Step 2:

アルゴリズムは、ステップ３１０におけるように、バッファの中の「真」値の最も長い連続シーケンスを捜す。従って、例えば、シーケンス「ＴＴＦＴＴＴＦ」に関して、Ｍは「３」に等しいであろう。 The algorithm looks for the longest continuous sequence of “true” values in the buffer, as in step 310. Thus, for example, for the sequence “T T F T T T F”, M would be equal to “3”.

ステップ３：
Ｍ≧Ｓ_Ｐ、且つＴ_Ｌ＜Ｌ_Ｓの場合、Ｔ＝Ｌ_Ｓ
ここで、Ｓ_Ｐは、ステップ３１５において第１のスレッショルドと同等と見なして扱う。ステップ３１５において、真（Ｔ）の音声値の最も長いシーケンスが第１のスレッショルドに等しい又はそれを超える、即ち、Ｓ_Ｐ＝３又はより多くの連続の「真」値である場合、バッファは、「可能性のある（ｐｏｓｓｉｂｌｅ）」音声を含むと判断される。ステップ３２０において、それが既に決定から存在しない（又は超えられていない）場合、ステップ３２５において、例えばＬ_Ｓ＝５フレームの短いタイマ（時間１）が活動状態にされる。 Step 3:
If M ≧ S _P and T _L <L _S , then T = L _S
Here, _{S P} is handled equated with the first threshold in step 315. FIG. In step 315, if the longest sequence of true (T) speech values is equal to or exceeds the first threshold, ie, S _P = 3 or more consecutive “true” values, the buffer is It is determined to contain “possible” speech. If at step 320 it does not already exist (or has been exceeded) from the decision, at step 325 a short timer (time 1), eg L _S = 5 frames, is activated.

ステップ４：
Ｍ≧Ｓ_Ｌ、且つＦ＞Ｆ_Ｓの場合、Ｔ＝Ｔ_Ｌであり、
その他の場合は、Ｔ＝Ｌ_Ｌである。 Step 4:
If M ≧ S _L and F> F _S , then T = _TL .
In other cases, it is T = _{L L.}

ここで、Ｓ_Ｌはステップ３３０において第２のスレッショルドと同等と見なして扱う。Ｓ_Ｌ＝４又はより多くの連続した「真」値である場合、バッファは、再び、「可能性のありそうな（ｌｉｋｅｌｙ）」音声を含むと判断される。ステップ３３５において決定されるように、現在のフレームＦが初期リードイン安全期間Ｆ_Ｓの外側にある場合、ステップ３４０において、例えば、Ｌ_Ｍ＝２２フレームの中間タイマＴが活動状態にされる。その他の場合、ステップ３４５において、例えば、Ｌ_Ｌ＝４０フレームのフェールセーフの長いタイマＴが用いられる。そのような構成は、発話の中の音声の早期の存在がＶＡＤの初期雑音推定を高すぎるようにし得るので、用いられる。 Here, S _L are handled regarded as equivalent to the second threshold in step 330. If S _L = 4 or more consecutive “true” values, the buffer is again determined to contain “likely” speech. If the current frame F is outside the initial lead-in safety period F _S , as determined in step 335, then in step 340, for example, an intermediate timer T of L _M = 22 frames is activated. Otherwise, in step 345, for example, a long fail-safe timer T of L _L = 40 frames is used. Such a configuration is used because the early presence of speech in the utterance can make the VAD initial noise estimate too high.

ステップ５：
Ｍ＜Ｓ_Ｐ、且つＴ＞０の場合、Ｔ−−
ステップ３５０において、プロセスがＳ_Ｐ＝３より少ない連続の「真」値であると決定し、且つステップ３５５において、タイマがゼロより大きい場合、ステップ３６０において、タイマが減分される。 Step 5:
M In the case of _{<S P,} and T> 0, T--
If, at step 350, the process determines that it is a continuous “true” value less than S _P = 3, and at step 355, the timer is greater than zero, then at step 360, the timer is decremented.

ステップ６：
Ｔ＞０の場合、「真」を出力し、その他の場合は、「偽」を出力する。 Step 6:
If T> 0, “true” is output, otherwise “false” is output.

ステップ３６５において、タイマがゼロより大きい場合、ステップ３７０に示されるように、プロセスが「真」の音声の決定を出力する。代わりに、タイマがゼロより大きくない場合、ステップ３７５に示されるように、プロセスは、「雑音」の決定を出力する。 In step 365, if the timer is greater than zero, the process outputs a "true" audio decision, as shown in step 370. Alternatively, if the timer is not greater than zero, the process outputs a “noise” determination, as shown in step 375.

ステップ７：
フレーム＋＋の場合、バッファを左にシフトし、ステップ１に戻る。 Step 7:
For frame ++, shift the buffer to the left and return to step 1.

ステップ３８０における次のフレームに対する準備において、バッファは、図４に示されるように、左にシフトされて、次の入力を受け入れる。出力音声決定は、バッファから放出されつつあるフレームに適用される。ステップ３０５において、プロセスは、データ・バッファに入力される次の真／偽に関して繰り返す。 In preparation for the next frame in step 380, the buffer is shifted left to accept the next input, as shown in FIG. The output audio decision is applied to the frame that is being emitted from the buffer. In step 305, the process repeats for the next true / false input to the data buffer.

前述したエネルギ加速度プロセスに基づいて、音声又は雑音の決定をする代替機構を実現することができることは本発明の意図内である。例えば、決定機構は、１又はそれより多くのタイマに基づかないでよく、そして１又はそれより多くのエネルギ加速度スレッショルドを超えたかどうかについて単に決定し得る。 It is within the spirit of the present invention that an alternative mechanism for making speech or noise determinations can be implemented based on the energy acceleration process described above. For example, the decision mechanism may not be based on one or more timers and may simply determine whether one or more energy acceleration thresholds have been exceeded.

ここで図４を参照すると、本発明の好適な実施形態に従ったバッファ動作４００の一例がより詳細に示されている。第１のスレッショルドが３つの連続の「真」値に対して設定されると仮定しよう。時間（時刻）「ｔ」４１０において、現在の入力（フレーム＃７）４２５及びその前の入力（フレーム＃６）４２０のみが「真」であったと仮定しよう。従って、バッファがシフトされたとき、第１のフレーム（フレーム＃１）４１５は、「偽」とマークされる。 Referring now to FIG. 4, an example of a buffer operation 400 according to a preferred embodiment of the present invention is shown in more detail. Suppose that the first threshold is set for three consecutive “true” values. Suppose that at time (time) “t” 410, only the current input (frame # 7) 425 and the previous input (frame # 6) 420 were “true”. Thus, when the buffer is shifted, the first frame (frame # 1) 415 is marked “false”.

時間（時刻）「ｔ＋１」４３０において、第３の「真」入力（フレーム＃８）４５０が受け取られ、それより早い２つの「真」入力４４０、４４５を補足する。従って、バッファがシフトされたとき、次の出力フレーム（フレーム＃２）４３５が、「真」とマークされる。 At time (time) “t + 1” 430, a third “true” input (frame # 8) 450 is received, supplementing two earlier “true” inputs 440, 445. Thus, when the buffer is shifted, the next output frame (frame # 2) 435 is marked “true”.

上記の決定プロセスにおいて、制約だけは次のとおりであることに注目すべきである。 It should be noted that in the above decision process, only the constraints are:

（ｉ）時間１＜時間２＜時間３、及び
（ｉｉ）スレッショルド１＜スレッショルド２。 (I) Time 1 <Time 2 <Time 3 and (ii) Threshold 1 <Threshold 2.

これら３つの入力（フレーム＃６、フレーム＃７及びフレーム＃８）のみが「真」であると仮定すると、フルの出力シーケンスは、次のとおりである。 Assuming that only these three inputs (frame # 6, frame # 7 and frame # 8) are “true”, the full output sequence is:

ここで、フレーム＃２−＃５は、バッファ・リードイン機能に起因して「真」を指示する。フレーム＃６−＃８は、実際の元の「真」の音声入力の位置として「真」を指示する。フレーム＃９−＃１２は、バッファ・リードアウト機能（ｂｕｆｆｅｒｌｅａｄ−ｏｕｔｆｕｎｃｔｉｏｎ）に起因して「真」を指示する。フレーム＃１３−＃１８は、用いられるタイマ・ハングオーバ（ｔｉｍｅｒｈａｎｇｏｖｅｒ）に応答して「真」を指示する。ひとたび発話の中の全てのフレームが入力されてしまうと、バッファは、空になるまで「偽」の入力（フレーム＃１９−＃Ｌ_Ｍ）をシフトする。 Here, frames # 2 to # 5 indicate “true” due to the buffer lead-in function. Frames # 6 to # 8 indicate “true” as the actual original “true” audio input position. Frames # 9- # 12 indicate “true” due to the buffer lead-out function. Frames # 13- # 18 indicate “true” in response to the timer hangover used. Once all the frames in the utterance have been input, the buffer shifts the “false” input (frames # 19- # L _M ) until empty.

オーディオ通信装置の要求に適合するため、バッファの長さ及びハングオーバ・タイマを動的に調整することができることは、本発明の意図内である。そのようにして、８のバッファ長「Ｎ」、及び５フレームのハングオーバ・タイマを用いた好適な実施形態は、説明の目的のみのため用いられている。しかしながら、バッファ長「Ｎ」はＮ≧Ｓ_Ｌであるように常に決定されるべきであることに注目すべきである。 It is within the intent of the present invention that the buffer length and hangover timer can be adjusted dynamically to meet the requirements of the audio communication device. As such, the preferred embodiment using a buffer length “N” of 8 and a 5 frame hangover timer is used for illustrative purposes only. However, it should be noted that the buffer length “N” should always be determined such that N ≧ S _L.

それをＶＡＤとして自己の権利で使用することに加えて、図２の方法ステップで実行されるエネルギ加速度測定を用いて、他のパラメータの初期化を有効化することができることは、本発明の意図内である。例えば、スペクトル減算スキームは、音声の最初の１０フレーム（典型的には１００ミリ秒）に基づく雑音の初期推定を必要とする。たとえ定常雑音においても、幾つかの事象が、初期推定を無効化するため起こり得る。そのような事象の例には次のものが含まれる。 In addition to using it as its own right as VAD, it is the intent of the present invention that the initialization of other parameters can be validated using the energy acceleration measurement performed in the method step of FIG. Is within. For example, spectral subtraction schemes require an initial estimate of noise based on the first 10 frames of speech (typically 100 milliseconds). Even in stationary noise, several events can occur to invalidate the initial guess. Examples of such events include:

（ａ）信号のランプアップ（ｒａｍｐ−ｕｐ）：
様々な可能性のある原因に起因して、記録の全くの開始は、評価の下で周期内のフル・ボリュームへ「ランプアップ」し得る。そのようなフル・ランプアップの裏にある理由は、ディジタル・システムにおけるバッファ充填、キャパシタンス、又はアナログ・システムでのテープ−ヘッド係合を含む。そのような事象の効果は、推定を無効化するであろう。従って、エネルギ加速度測定を用いて、そのようなランプアップを検出し、そこでエラーを防止し得る。 (A) Ramp-up of signal:
Due to various possible causes, the full start of recording can “ramp up” to full volume within a cycle under evaluation. The reasons behind such full ramp-up include buffer filling in digital systems, capacitance, or tape-head engagement in analog systems. The effect of such an event will invalidate the estimation. Thus, energy acceleration measurements can be used to detect such ramp-up and prevent errors there.

（ｂ）初期信号の中のスパイク：
共通の「スパイク」が、加入者無線装置上の「プレス・ツー・トーク（話すため押す）（ｐｒｅｓｓ−ｔｏ−ｔａｌｋ）（ＰＴＴ）」ボタンのフル配置でもって生じ、そこにおいて、電気的接触は、ボタンがスイッチの背面を打つことよりほんの僅かに先行する。そのような事象が起きたとき、図２のステップ２２５に示されるように、前述したエネルギ加速度測定を用いて、推定プロセスを中断することができる。 (B) Spikes in the initial signal:
A common "spike" occurs with the full arrangement of "press-to-talk" (PTT) buttons on the subscriber radio device, where electrical contact is , Just slightly ahead of the button hitting the back of the switch. When such an event occurs, the energy acceleration measurement described above can be used to interrupt the estimation process, as shown in step 225 of FIG.

（ｃ）初期信号の中の音声：
特にＰＴＴシステムの場合の別の共通の出現事項は、ユーザがＰＴＴボタンを押すやいなやユーザが話し始めることである。このようにして、電気的接触は、しゃべりが始まった後で行われる。エネルギ加速度測定は、これを識別し、そして図２のステップ２２５に示されるように、雑音ベースの初期化をそのように中断し、又はデフォルト推定の使用を
強制することができる。 (C) Audio in the initial signal:
Another common occurrence, especially in the case of PTT systems, is that the user begins to speak as soon as the user presses the PTT button. In this way, electrical contact is made after talking begins. The energy acceleration measurement can identify this and so interrupt the noise-based initialization or force the use of the default estimate, as shown in step 225 of FIG.

要約すると、音声活動検出機構を有するオーディオ処理装置を含む通信装置を説明した。音声活動検出機構は、通信装置に入力される信号のエネルギ加速度の指示を与え、そして上記の入力信号が音声であるか又は雑音であるかを上記指示に基づいて決定する。 In summary, a communication device including an audio processing device having a voice activity detection mechanism has been described. The voice activity detection mechanism provides an indication of energy acceleration of a signal input to the communication device, and determines whether the input signal is voice or noise based on the instruction.

その上、通信装置に入力される音声信号を検出する方法を説明した。この方法は、通信装置への入力信号の加速度を指示するステップと、上記入力信号が音声であるか又は雑音であるかを上記指示するステップに基づいて決定するステップとを含む。 In addition, a method for detecting an audio signal input to a communication device has been described. The method includes instructing acceleration of an input signal to the communication device and determining based on the instructing step whether the input signal is speech or noise.

更に、通信装置に入力される信号が音声であるか又は雑音であるかを決定する方法を説明した。この方法は、上記の入力信号が音声であるか又は雑音であるかをエネルギ加速度に基づいて、例えば或る数の入力信号のフレーム平均又はローリング平均を用いて決定するステップを含む。 Further, a method for determining whether a signal input to the communication apparatus is voice or noise has been described. The method includes determining whether the input signal is speech or noise based on energy acceleration, for example, using a frame average or rolling average of a number of input signals.

従って、前述した、雑音の多い環境に対するエネルギ加速度ベースの音声活動検出器及び有効化器は、雑音に対する強固さ及び早い応答の利点を与える。好適な実施形態が絶対的測定の代わりにエネルギ加速度に依存した測定を用いるので、本明細書で説明した発明概念は、いずれの入力レベルの音声に対しても適用することができる。 Thus, the energy acceleration based voice activity detector and enabler described above for noisy environments provides the advantages of robustness and fast response to noise. Since the preferred embodiment uses energy acceleration dependent measurements instead of absolute measurements, the inventive concepts described herein can be applied to speech at any input level.

本発明の実施形態の特定のそして好適な実行が上記で説明されたが、当業者は、本発明の範囲内に入るそのような発明概念の変更及び修正を容易に適用することができるであろうことは明らかである。 While specific and preferred implementations of embodiments of the present invention have been described above, those skilled in the art will readily be able to apply such changes and modifications of the inventive concept that fall within the scope of the present invention. It is clear that it is deaf.

従って、従来技術の構成に係わる前述の欠点が実質的に改善された、雑音の多い環境のための改良された音声活動検出器及び有効化器が説明された。 Accordingly, an improved voice activity detector and enabler for a noisy environment has been described in which the aforementioned drawbacks associated with prior art configurations have been substantially improved.

Claims

A communication device (100) comprising an audio processing device (109) having a voice activity detection mechanism (130, 135),
The voice activity detection mechanism (130, 135) measures an energy acceleration of a signal input to the communication device (100) by calculating a ratio between an average energy value and an instantaneous energy value, and Configured to determine frame by frame based on the measurement whether the input signal is speech or noise;
The communication device (100), wherein the input frame is determined to be an audio frame if the energy acceleration measurement yields an energy acceleration value greater than an energy acceleration threshold (265).

The voice activity detection mechanism includes a voice activity detector function (130) configured to perform per-frame detection of speech on a signal input to the voice activity detection mechanism (130, 135). Item 4. The communication device (100) according to Item 1.

The frame-by-frame detection includes: (i) performing an energy acceleration measurement on the entire spectrum for a signal input to the voice activity detection mechanism (130, 135); (ii) performing the energy acceleration measurement on a subband of the spectrum. And (iii) performing the energy acceleration measurement by using the acceleration of the variance of values within the low frequency half of the spectrum of each frame. The communication device (100) described.

The voice activity detection mechanism is operatively coupled to the voice activity detector function (130), based on a buffering operation of an input frame of the input signal into a buffer, and one or more of the one or more of the A voice activity determination function (135) configured to determine whether the input signal is speech according to an energy acceleration measurement;
The voice activity determination function (135) is further configured to assign a true or false indication to each of the buffered input frames in the buffer, the one or more energy acceleration measurements for the input frames. A true indication is assigned if any one of the commands returns a voice indication, and the voice activity determination function (135) is further assigned for each of the buffered series of input frames in the buffer. 4. The communication device (100) of claim 3, wherein the communication device (100) is configured to determine that the input signal in the buffer is audio if the indication is true.

The communication device (100) according to any of claims 1 to 4, wherein the voice activity detection mechanism is configured to measure energy acceleration using a frame average or rolling average of a number of the input signals. .

The energy acceleration is estimated by tracking the ratio of the two rolling averages of the input signal using (0 × average + 1 × input) and ((Frame-1) × average + 1 × input) / Frame, respectively. The communication device (100) according to any one of claims 1 to 4, wherein Frame represents a value of a frame counter.

The estimation of the energy acceleration using the frame average is

The communication device (100) according to claim 5, wherein:

If the energy acceleration measurement is within one or more specified margins, the estimation of energy acceleration using a rolling average is:

The communication device (100) according to claim 5 or 6.

The buffer has a buffer length of N frames, and if a continuous input frame is passed to and released from the buffer and it is determined that the input frame in the buffer is a voice frame, the input frame The communication device (100) of claim 4, wherein the determination that is a speech frame (265) is applied retroactively to earlier frames in the buffer.

The communication device (100) according to any one of claims 3, 4 or 9, wherein the subband includes a basic pitch of an audio signal.

A method of detecting an audio signal input to a communication device,
Measuring an energy acceleration of a signal input to the communication device (100) by calculating a ratio of an average energy value and an instantaneous energy value;
Determining whether the input signal is speech (370) or noise (375) on a frame basis based on the measuring step (315, 330, 350), wherein the energy acceleration measurement comprises: The method wherein the input frame is determined to be a speech frame if an energy acceleration value greater than the energy acceleration threshold is generated (265).

The frame-by-frame measurement includes (i) performing an energy acceleration measurement on the input signal over the entire spectrum, (ii) performing the energy acceleration measurement over a subband of the spectrum, and (iii) 12. The method of claim 11 including one or more of: performing the energy acceleration measurement by using an acceleration of a variance of values within the low frequency half of the spectrum of each frame.

13. A method according to claim 11 or 12, wherein the step of measuring energy acceleration uses a frame average or rolling average of a number of input signals.

The energy acceleration is estimated by tracking the ratio of the two rolling averages of the input signal using (0 × average + 1 × input) and ((Frame-1) × average + 1 × input) / Frame, respectively. 13. The method according to claim 11, wherein Frame represents a value of a frame counter.

Measuring the energy acceleration comprises:

The method of claim 13, comprising estimating the energy acceleration using a frame average by calculating

The step of measuring the energy acceleration comprises: the energy acceleration measurement being within one or more specified margins;

15. The method according to claim 13 or 14, comprising estimating the energy acceleration using a rolling average by calculating.

The buffer has a buffer length of N frames, and successive input frames are passed to and released from the buffer, the method comprising:
12. The method of claim 11, further comprising: retroactively applying the determination to earlier frames in the buffer if it is determined that the input frame in the buffer is a speech frame.

The determining step includes:
Buffering an input frame of the input signal in a buffer;
Assigning a true or false indication to each of the buffered input frames in the buffer, wherein a true indication is assigned if an energy acceleration measurement for the input frame returns a voice indication;
12. The method of claim 11, further comprising: determining that the input signal in the buffer is speech if the indication assigned to each of the buffered series of input frames in the buffer is true. Method.