JP2003500936A

JP2003500936A - Improving near-end audio signals in echo suppression systems

Info

Publication number: JP2003500936A
Application number: JP2000619908A
Authority: JP
Inventors: ニルスクリステンソン，; ヨーンフィリプソン，
Original assignee: テレフオンアクチーボラゲットエルエムエリクソン（パブル）
Priority date: 1999-05-20
Filing date: 2000-05-09
Publication date: 2003-01-07
Also published as: MY122658A; AU4563700A; US6510224B1; CN1361972A; CN1223109C; DE10084614T1; WO2000072565A1

Abstract

(57)【要約】ハンドフリー環境において、オーディオ信号を受信し、評価された音響エコー信号を生成し、オーディオ信号からその評価された音響エコー信号を除去することにより処理された信号を生成することにより、改善されたニアエンド音声信号が生成される。それから、検出器スペクトラムが最大値をもつようになる連続した周波数の１つ以上の範囲をもつニアエンド改善スペクトラムが決定される。ここで、連続した周波数の範囲は処理された信号における相対的に高いエコーリターン損失に関連したものである。その処理された信号はニアエンド改善スペクトラムに従ってフィルタされ、これにより改善されたニアエンド音声信号を生成する。それから、改善されたニアエンド音声信号が、ニアエンドスピーチを処理することを意図されている任意の数の要素に印加される。例えば、音声アクティビティ検出器に印加されるとき、改善されたニアエンド音声信号に含まれているエネルギー量が測定される。ニアエンド音声アクティビティの有無は、改善されたニアエンド音声信号の測定されたエネルギーに基づいて決定される。その処理は周期的に繰り返されて動的に調整可能な動作をもたらす。 (57) Abstract: In a hands-free environment, receiving an audio signal, generating an estimated acoustic echo signal, and generating a processed signal by removing the estimated acoustic echo signal from the audio signal. As a result, an improved near-end audio signal is generated. Then, a near-end improvement spectrum having one or more ranges of continuous frequencies at which the detector spectrum has a maximum is determined. Here, the continuous frequency range is associated with a relatively high echo return loss in the processed signal. The processed signal is filtered according to a near-end enhancement spectrum, thereby producing an enhanced near-end audio signal. The enhanced near-end audio signal is then applied to any number of elements that are intended to process near-end speech. For example, the amount of energy contained in the enhanced near-end audio signal when applied to the audio activity detector is measured. The presence or absence of near-end audio activity is determined based on the measured energy of the improved near-end audio signal. The process is repeated periodically to provide a dynamically adjustable operation.

Description

Detailed Description of the Invention

【０００１】背景本発明は通信システムにおける音声信号の処理に関し、特に、ファーエンド音
声のエコーと結合したニアエンド音声を含む信号におけるニアエンド音声の改善
に関する。BACKGROUND OF THE INVENTION The present invention relates to the processing of audio signals in communication systems, and more particularly to improving near-end audio in signals containing near-end audio combined with far-end audio echoes.

【０００２】通信分野において、例えば、スピーカフォンを用いた場合や、セルラ電話にお
いて、ユーザが１つ以上の自分の手が続いて占有される必要なしに、通信機器を
操作できることがしばしば望まれている。このことは自動車の環境では重要な因
子である。その環境では、ドライバが電話機器を保持することに没頭しているな
らば、自分自身の安全だけでなく道路を共有している他の人の安全をも危険にさ
らすことになるかもしれない。マイクロフォンを保持する以外の何かにその手を
用いる自由度があれば、パーソナルコンピュータによるインターネット通信、コ
ンピュータによる音声認識、或いは、オーディオビジュアルプレゼンテーション
システムを用いるような別の応用分野においても同様に有用である。In the field of communications, for example with speakerphones and in cellular telephones, it is often desirable for a user to be able to operate a communication device without the need for a user to subsequently occupy one or more of their hands. There is. This is an important factor in the automotive environment. In that environment, if the driver is immersive in holding telephone equipment, he may endanger not only his own safety, but also the safety of others sharing the road. The freedom to use the hand for anything other than holding a microphone would be equally useful in other applications, such as internet communication with a personal computer, voice recognition with a computer, or an audiovisual presentation system. is there.

【０００３】これらの重要なニーズに対して便宜を図るために、所謂“ハンドフリー”機器
が開発されている。その機器において、マイクロフォンとラウドスピーカとはハ
ンドフリー環境でマウントされ、これによってそれらを保持する必要を不要にし
ている。例えば、自動車に応用した場合、セルラ電話のマイクロフォンはサンバ
イザにマウントされる一方で、ラウドスピーカはダッシュボードにマウントされ
たユニットで良いし、或いはカーステレオ機器に関連したものでも良い。このよ
うにしてマウントされた構成要素を用いると、セルラ電話のユーザはセルラユニ
ットやそのハンドセットを保持しなければならなくとも会話を行なえる。同様に
、パーソナルコンピュータはしばしば、例えば、互いに相対的には近接してモニ
タ内にマイクロフォンとラウドスピーカとをマウントさせることがある。In order to accommodate these important needs, so-called “hands-free” devices have been developed. In that device, the microphone and loudspeaker are mounted in a hands-free environment, thereby obviating the need to hold them. For example, for automotive applications, the microphone of the cellular phone may be mounted on the sun visor, while the loudspeaker may be a dashboard mounted unit or may be associated with car stereo equipment. The components mounted in this way allow the user of the cellular telephone to have a conversation without having to hold the cellular unit or its handset. Similarly, personal computers often have, for example, a microphone and a loudspeaker mounted within the monitor in close proximity to each other.

【０００４】ハンドフリー構成の問題には、マイクロフォンが、ハンドフリー機器のユーザ
（所謂、“ニアエンドユーザ”）の声に加えて、近くのラウドスピーカからの音
をピックアップする傾向がある点がある。このことはまた、どんどん小型化して
いるハンドヘルド移動電話のようなハンドフリーではないある機器においても問
題である。（小型サイズであるために、移動電話のマイクロフォンはそのラウド
スピーカによって発せられる音から完全にはシールドできない。）そのラウドス
ピーカによって生成された音がマイクロフォンによって感知されることは、多く
の応用分野において問題の原因となる。例えば、通信機器において、全体として
通信システムによって入り込む遅延は、ラウドスピーカからの音がその呼の他端
にいる（所謂“ファーエンド”）個人には自分の声のエコーとして聞こえる原因
となる。そのようなエコーはオーディオ品質を低下させ、その軽減が望まれる。
例えば、ラウドスピーカを通してスピーチを合成し、マイクロフォンにより話さ
れた命令や検知された他の言葉を認識して応答する音声認識要素を含む自動化シ
ステムにおいて、同様の問題が存在する。そのような応用分野において、マイク
ロフォン信号における合成されたスピーチのエコーの存在は音声認識要素の性能
を著しく低下させる。このようなエコーを改善するための解決策は、適応型エコ
ーキャンセレーションフィルタやエコー減衰器の利用が含まれる。A problem with hands-free configuration is that the microphone tends to pick up the sound from the nearby loudspeakers in addition to the voice of the user of the hands-free device (so-called “near end user”). This is also a problem in some devices that are not hands-free, such as the increasingly smaller handheld mobile phones. (Because of their small size, mobile phone microphones cannot be completely shielded from the sound emitted by their loudspeakers.) It is in many applications that the sound produced by that loudspeaker is perceived by the microphone. Cause problems. For example, in a communication device, the delay introduced by the communication system as a whole causes the sound from the loudspeaker to be heard by the individual at the other end of the call (so-called "far end") as an echo of their voice. Such echoes reduce the audio quality, which mitigation is desired.
Similar problems exist, for example, in automated systems that include speech recognition elements that synthesize speech through loudspeakers and recognize and respond to commands spoken by a microphone and other words sensed. In such applications, the presence of echoes of synthesized speech in the microphone signal significantly degrades the performance of the speech recognition element. Solutions to improve such echoes include the use of adaptive echo cancellation filters and echo attenuators.

【０００５】一般にハンドフリー機器の代表的な例として、適応型フィルタ構成の形をした
従来のエコーキャンセラをもった代表的な“ハンドフリー”移動電話が図１に描
かれている。例えば、ハンドフリー通信環境は、移動電話が設置された自動車の
インテリアであるかもしれない。そのような環境はそこでの音響的な信号伝播に
影響を及ぼす原因となるが、その影響は通常はわからない。今後、この種の環境
はこの明細書を通じて未知のシステムＨ（ｚ）として言及されるであろう。マイ
クロフォン１０５はユーザの声を検出することが意図されているが、ラウドスピ
ーカ１０９から発せられるオーディオ信号を検出するという望ましくない影響を
もつかもしれない。それが、そのシステムにエコー信号を導き入れてしまう望ま
しくない動作である。As a representative example of hands-free equipment in general, a typical “hands-free” mobile telephone with a conventional echo canceller in the form of an adaptive filter arrangement is depicted in FIG. For example, the hands-free communication environment may be the interior of a car with a mobile telephone installed. Such an environment causes acoustic signal propagation there, but the effect is usually unknown. In the future, this type of environment will be referred to as the unknown system H (z) throughout this specification. The microphone 105 is intended to detect the user's voice, but may have the undesired effect of detecting the audio signal emanating from the loudspeaker 109. That is the undesired behavior of introducing an echo signal into the system.

【０００６】もし除去できないのであれば、そのエコーを減衰させる回路には、適応型有限
衝撃応答（ＦＩＲ）フィルタのような適応型フィルタ１０１、最小自乗平均（Ｌ
ＭＳ）相互相関器のようなアダプテーションユニット１０３、及び減算器１０７
を含む。その動作では、適応型フィルタ１０１は一般にはｕ＾信号として言及さ
れるエコー評価信号１０２を生成する。エコー評価信号１０２はファーエンド信
号１１２と、フィルタ１０１のひと続きのｍ個のフィルタ重み係数（ｈ_i）との
畳み込みである（式１を参照）。ここで、ｘ（ｎ）は入力信号であり、ｍは重み係数の数であり、ｎはサンプル数である。If it cannot be removed, a circuit for attenuating the echo includes an adaptive filter 101 such as an adaptive finite shock response (FIR) filter and a least mean square (L).
MS) Adaptation unit 103 such as cross-correlator and subtractor 107
including. In its operation, the adaptive filter 101 produces an echo estimate signal 102, commonly referred to as the u signal. The echo estimate signal 102 is a convolution of the far-end signal 112 with a series of m filter weighting factors (h _i ) of the filter 101 (see equation 1). Where x (n) is the input signal, m is the number of weighting factors, and n is the number of samples.

【０００７】その重み係数が正しくセットされるとき、適応型フィルタ１０１は未知のシス
テムＨ（ｚ）内のラウドスピーカ１０９によって生成される応答におおよそ等し
い衝撃応答を生み出す。適応型フィルタ１０１によって生成されるエコー評価信
号１０２が入力されるデジタル化されたマイクロフォン信号１２６（式２におい
てｕ（ｎ）で表されている）から減算されてエラー信号ｅ（ｎ）を生み出す（式
２を参照）。理想的には、ラウドスピーカ１０９によって導入された未知のシステムＨ（ｚ
）からのどんなエコー応答も、エコー評価信号１０２の減算によってデジタル化
されたマイクロフォン信号１２６から除去される。通常、エコーを効果的にキャ
ンセルするのに必要な重み係数の数（以後、“複数の係数”として言及される）
はその適用に依存する。ハンドヘルド電話の場合、１００個より少ない数の係数
が適当であるかもしれない。自動車のハンドフリー電話の場合には、約２００〜
４００個の係数が必要である。より大きな空間では、適切なエコーキャンセレー
ションを備えるために、１０００個を越える係数を用いたフィルタを必要とする
かもしれない。When its weighting factors are set correctly, adaptive filter 101 produces a shock response approximately equal to the response produced by loudspeaker 109 in the unknown system H (z). The echo estimate signal 102 generated by the adaptive filter 101 is subtracted from the input digitized microphone signal 126 (represented by u (n) in Equation 2) to produce an error signal e (n) ( (See Equation 2). Ideally, the unknown system H (z
A) is removed from the digitized microphone signal 126 by subtraction of the echo estimate signal 102. Usually the number of weighting factors needed to effectively cancel the echo (hereinafter referred to as "multiple factors")
Depends on its application. For handheld phones, a number of coefficients less than 100 may be suitable. In the case of a hands-free telephone of a car, about 200-
400 coefficients are required. Larger spaces may require filters with over 1000 coefficients to provide adequate echo cancellation.

【０００８】エコーキャンセラーの効果性は、どれほどうまく適応型フィルタ１０１が未知
のシステムＨ（ｚ）の衝撃応答を複製することができるかに直接的に関係してい
ることが分かる。このことは、次に、フィルタ１０１によって維持される係数の
セットｈ_iに直接に関係する。It can be seen that the effectiveness of the echo canceller is directly related to how well the adaptive filter 101 can replicate the shock response of the unknown system H (z). This in turn is directly related to the set of coefficients h _i maintained by the filter 101.

【０００９】動的に係数ｈ_iを変化させ、適応型フィルタ１０１が未知のシステムＨ（ｚ）
における変化に適合できるようにする機構を提供することには利点がある。ハン
ドフリーセルラ構成をもつ車では、そのような変化は、ウィンドウや車のドアを
開閉するときに発生するかもしれない。公知の係数適用方式は、最小自乗平均（
ＬＭＳ）過程であり、その過程はウィドロウ（Widrow）とホッフ（Hoff）とによ
り１９６０年に最初に導入され、その効率性と耐性のある性質のために、頻繁に
用いられている。エコーキャンセレーションの問題に適用されるとき、ＬＭＳ過
程は、g(n)＝e(n)x(n)という傾きの大雑把な（ノイズのある）評価を用いる統計
的な傾きステップの方法であり、マイクロフォン信号ｅ（ｎ）におけるエコー信
号のエネルギーを最小化する方向に向かって増分ステップをつくる。ここで、x(
n)は、x(n)＝[x(n)x(n-1)x(n-2)…x(n-m+1)]という表現に対応したベクトル表記
である。ＬＭＳ過程e(n)x(n)によって生成される更新情報が用いられて次のサン
プルにおける係数の値を決定する。次の係数の値ｈ_i（ｎ＋１）を計算するため
の式は次のように与えられる。ここで、ｘ（ｎ）はデジタル化された入力信号１３４であり、（ｈ_j）はフィルタ重み付け係数であり、ｉは特定の係数を指示し、ｍは係数の数であり、ｎはサンプル数であり、 μはステップ或いは更新利得パラメータである。A system H (z) in which the adaptive filter 101 changes the coefficient h _i dynamically and the adaptive filter 101 is unknown
It would be advantageous to provide a mechanism that would allow it to adapt to changes in. In cars with a hands-free cellular configuration, such changes may occur when opening or closing windows or car doors. The known coefficient application method is the least mean square (
LMS process, which was first introduced by Widrow and Hoff in 1960 and is frequently used because of its efficiency and resistant nature. When applied to the echo cancellation problem, the LMS process is a statistical gradient step method with a rough (noisy) estimate of the gradient g (n) = e (n) x (n). , Increment steps towards minimizing the energy of the echo signal in the microphone signal e (n). Where x (
n) is a vector notation corresponding to the expression x (n) = [x (n) x (n-1) x (n-2) ... x (n-m + 1)]. The update information generated by the LMS process e (n) x (n) is used to determine the value of the coefficient at the next sample. The formula for calculating the next coefficient value h _i (n + 1) is given as: Where x (n) is the digitized input signal 134, (h _j ) is the filter weighting coefficient, i indicates a particular coefficient, m is the number of coefficients, n is the number of samples And μ is the step or update gain parameter.

【００１０】ＬＭＳ方法は正或いは負の値をもつかもしれない増分の部分夫々における情報
を生成する。ＬＭＳ過程によって生成された情報がフィルタに提供されてそのフ
ィルタ係数を更新する。The LMS method produces information in each incremental portion that may have positive or negative values. The information generated by the LMS process is provided to the filter to update its filter coefficients.

【００１１】再び図１に戻り、従来のエコーキャンセレーション回路はフィルタ１０１に係
数更新情報１０４を提供するＬＭＳ相互相関器の形でフィルタアダプテーション
ユニット１０３を含む。この構成において、フィルタアダプテーションユニット
１０３はデジタル化されたマイクロフォン信号１２６からフィルタ１０１によっ
て生成されたエコー評価信号１０２を差し引いたものを表す修正された信号ｅ（
ｎ）を監視する。上述のように、フィルタアダプテーションユニット１０３によ
って適応型フィルタ１０１に提供された更新情報１０４を用いて、エコー評価信
号１０２が生成される。適応型フィルタ１０１の係数ｈ_iは、式３に示されてい
るように更新情報１０４を累積する。Returning again to FIG. 1, the conventional echo cancellation circuit includes a filter adaptation unit 103 in the form of an LMS cross-correlator that provides coefficient update information 104 to the filter 101. In this arrangement, the filter adaptation unit 103 represents a modified signal e (which represents the digitized microphone signal 126 minus the echo estimate signal 102 produced by the filter 101.
n) is monitored. As described above, the echo evaluation signal 102 is generated using the update information 104 provided to the adaptive filter 101 by the filter adaptation unit 103. The coefficient h _i of the adaptive filter 101 accumulates the update information 104 as shown in Expression 3.

【００１２】マイクロフォン信号から音響エコーの存在を低減するので、その結果得られる
信号は付加的な構成要素に供給されアプリケーション特有のさらなる処理を行な
う。例えば、上述のような音響エコーキャンセレーション回路に加えて、図１に
描いたようなトランシーバは通常、ニアエンド音声アクティビティ検出器１５０
を含む。それはニアエンドユーザが話しているのかどうかを示す信号１５３を出
力する。ニアエンド音声アクティビティ検出を実行するための最も一般的に用い
られているやり方は、時間領域での電力計算を用いることである。通常、音声の
アクティビティがあるかないかに関する決定は主に、（背景雑音に対応した）閾
値エネルギーレベルとバンドパスフィルタによってフィルタされた信号エネルギ
ーの測定との比較に基づいている。バンドパスフィルタリングの目的は、背景雑
音に関連した信号エネルギーを除去することである。[0012] To reduce the presence of acoustic echoes from the microphone signal, the resulting signal is fed to additional components for further application-specific processing. For example, in addition to acoustic echo cancellation circuitry as described above, transceivers such as those depicted in FIG. 1 typically include near-end voice activity detector 150.
including. It outputs a signal 153 indicating whether the near-end user is talking. The most commonly used way to perform near-end voice activity detection is to use power calculations in the time domain. Usually, the decision as to whether or not there is voice activity is mainly based on a comparison of a threshold energy level (corresponding to background noise) with a measurement of the signal energy filtered by a bandpass filter. The purpose of bandpass filtering is to remove the signal energy associated with background noise.

【００１３】ニアエンドスピーチの有無を示す信号は、数多くのユーザにいずれに対しても
有用である。１つには、汎欧州デジタル移動電話方式（ＧＳＭ）のようなセルラ
通信システムにおいて、デジタル化されたスピーチ信号は生の形でネットワーク
を介して送信されないが、その代わり、実際にある場所から別の場所へと送信さ
れる必要のあるビットの数を減少させる方法で符号化される。ＧＳＭにおいて、
スピーチコーダは通常の会話において各参加者は平均で４０％未満の時間話して
いるという事実を利用している。スピーチコーダの機能の一部として音声アクテ
ィビティ検出器を組み込むことにより、ＧＳＭシステムは不連続伝送モード（Ｄ
ＴＸ）で動作する。そのモードにおいて、ＧＳＭの送信機は沈黙の時間は（即ち
、ニアエンド音声アクティビティ検出器１５０がニアエンドユーザは話していな
いことを示すとき）アクティブな状態にはない。このやり方は加入者の電池寿命
をより長くし、瞬間的な無線の干渉を低減する。受信側での快適なノイズサブシ
ステムは背景の音響雑音を導き入れＤＴＸにより発生する悩ましい切換えミュー
トを補償する。The signal indicating the presence or absence of near-end speech is useful to any of a large number of users. One is that in cellular communication systems such as the Pan-European Digital Mobile Telephone System (GSM), the digitized speech signal is not transmitted in the raw form over the network, but instead from another location. Are encoded in a manner that reduces the number of bits that need to be transmitted to the location. In GSM,
Speech coders make use of the fact that each participant on average speaks less than 40% of the time in normal conversation. By incorporating a voice activity detector as part of the function of the speech coder, the GSM system makes the discontinuous transmission mode (D
TX). In that mode, the GSM transmitter is not active during periods of silence (ie, when the near-end voice activity detector 150 indicates that the near-end user is not speaking). This approach prolongs subscriber battery life and reduces momentary radio interference. A comfortable noise subsystem at the receiving side introduces background acoustic noise to compensate for the annoying switching mute caused by DTX.

【００１４】ニアエンド音声アクティビティ検出器はまた、スピーチ信号がニアエンドスピ
ーチの成分を含んでいるかどうかに基づいてアクティブな音響エコーキャンセラ
ーの減衰因子を制御するために用いられても良い。The near-end voice activity detector may also be used to control the attenuation factor of an active acoustic echo canceller based on whether the speech signal contains near-end speech components.

【００１５】さらにその上、ニアエンド音声アクティビティ検出器はまた、適応型フィルタ
１０１のアダプテーション速度を制御するために用いられても良い。Furthermore, the near-end voice activity detector may also be used to control the adaptation rate of adaptive filter 101.

【００１６】音声アクティビティ検出器はニアエンドのスピーチを表す信号を処理するタイ
プだけの構成要素ではない。そのような信号は、例えば、音声認識モジュールに
も供給されても良い。音声認識モジュールは公知であり、ユーザが音声制御を介
して装置やコンピュータを制御することを可能にする応用分野や、ユーザがただ
文書を口述するだけで電子文書を創成できる応用分野において有用である。Voice activity detectors are not the only type of component that processes signals that represent near-end speech. Such a signal may also be provided to the voice recognition module, for example. Voice recognition modules are well known and are useful in applications that allow a user to control a device or computer via voice control, or in applications where the user can create an electronic document by simply dictating the document. .

【００１７】さらにその上、ニアエンドスピーチを表す信号がまた、システム内でフィード
バックされて、例えばアダプテーションの速度を制御するといった、エコーキャ
ンセレーションフィルタ１０１それ自身を制御するために用いられても良い。Furthermore, a signal representative of near-end speech may also be fed back within the system and used to control the echo cancellation filter 101 itself, eg to control the rate of adaptation.

【００１８】上述したようなエコーキャンセレーション回路があるにも係らず、さらなる処
理のために（例えば、通信システムにおけるファーエンドユーザへの送信のため
や、或いは、ニアエンドの音声認識のためや、或いは、エコーキャンセレーショ
ンフィルタ１０１の動作を制御するために）生成された信号はかなり頻繁に依然
としてエコー成分を含むかもしれない。このことは、例えば、適応型フィルタが
十分に適応された状態にまだ収束していないか、或いは、そのような収束の後で
さえも未知のシステムＨ（ｚ）が変化するときにはいつでも、それによって適応
過程が繰り返されるのを必要とするために発生するかもしれない。その信号に強
いエコー信号成分があると、これらのエコー信号成分がニアエンドのスピーチと
して誤認されるかもしれないので、信号劣化の原因となったり、或いは、ダウン
ストリーム処理要素の誤動作の原因にさえなったりする。Despite the echo cancellation circuitry as described above, for further processing (eg for transmission to far-end users in a communication system, or for near-end speech recognition, or The signal generated (to control the operation of the echo cancellation filter 101) may still frequently contain echo components. This means that, for example, whenever the adaptive filter has not yet converged to a fully adapted state, or even after such convergence, the unknown system H (z) changes. It may occur because the adaptation process needs to be repeated. Strong echo signal components in the signal can cause these echo signal components to be mistaken for near-end speech, causing signal degradation or even malfunctioning downstream processing elements. Or

【００１９】従来の音声アクティビティ検出器、音声認識モジュールなどのようなニアエン
ドスピーチ信号を処理する従来の適用では通常、処理される信号にはエコーは存
在しないと仮定しており、それ故に、人間の音声によるアクティビティの周波数
範囲内にあるかもしれないエコー信号成分を除去してしまうほどにニアエンドス
ピーチに焦点を当てた能力はもちあわせていない。Conventional applications that process near-end speech signals, such as conventional voice activity detectors, voice recognition modules, etc., typically assume that the signal being processed is free of echoes, and thus human It lacks the ability to focus on near-end speech to the extent that it removes echo signal components that may be in the frequency range of voice activity.

【００２０】要約それ故、ニアエンドスピーチ成分がエコー信号成分に相対して強調された信号
を生成する方法と装置とを提供することが本発明の目的である。SUMMARY Therefore, it is an object of the present invention to provide a method and apparatus for producing a signal in which the near-end speech component is enhanced relative to the echo signal component.

【００２１】前述のまた他の目的は改善されたニアエンド音声信号を生成する方法と装置と
において達成される。本発明の１つの側面からすれば、改善されたニアエンド音
声信号の生成には、オーディオ信号の受信と、評価された音響エコー信号の生成
と、そのオーディオ信号から評価された音響エコー信号を除去することにより処
理された信号を生成することが含まれる。これらの工程は、例えば、ハンドフリ
ー電話機において有用である。その電話機では、ファーエンドユーザからの情報
を搬送するラウドスピーカの信号がそのハンドフリー電話機のマイクロフォンに
よって音響エコーとしてピックアップされる。次に、ニアエンドの改善スペクト
ラムが決定される。ここで、ニアエンドの改善スペクトラムは、少なくとも１つ
の連続した周波数の範囲をもち、その範囲にわたって所定の閾値よりも大きな量
をもっており、その連続した周波数の範囲は、処理された信号における相対的に
大きなエコーリターン損失に関連しているものである。その処理された信号はニ
アエンドの改善されたスペクトラムに従ってフィルタされ、これにより改善され
たニアエンド音声信号を生成する。The foregoing and other objects are achieved in a method and apparatus for producing an improved near-end audio signal. According to one aspect of the invention, improved near-end audio signal generation includes receiving an audio signal, generating an estimated acoustic echo signal, and removing the estimated acoustic echo signal from the audio signal. Generating the processed signal is thereby included. These steps are useful, for example, in hands-free phones. At that telephone, the loudspeaker signal carrying information from the far-end user is picked up as an acoustic echo by the hands-free telephone's microphone. Next, the near-end improvement spectrum is determined. Here, the near-end improved spectrum has at least one contiguous range of frequencies and has a quantity over that range that is greater than a predetermined threshold, the contiguous range of frequencies being relatively large in the processed signal. It is related to echo return loss. The processed signal is filtered according to the near-end improved spectrum, thereby producing an improved near-end audio signal.

【００２２】本発明のもう１つの面からすれば、改善されたニアエンド音声信号に含まれる
エネルギー量が測定される。その改善されたニアエンド音声信号の測定されたエ
ネルギーに基づいて、ニアエンドで音声が発せられているかどうかが検出される
。According to another aspect of the invention, the amount of energy contained in the improved near-end audio signal is measured. Based on the measured energy of the improved near-end speech signal, it is detected whether speech is being made at the near-end.

【００２３】本発明のさらにもう１つの面からすれば、改善されたニアエンド音声信号はニ
アエンド音声認識器に印加されて、これにより音声認識の性能の改善が得られる
ようにしても良い。According to yet another aspect of the invention, the improved near-end speech signal may be applied to a near-end speech recognizer, which may result in improved speech recognition performance.

【００２４】本発明のもう１つの面からすれば、上述の過程は周期的に繰り返され、ニアエ
ンドで音声が発せられているかどうかの決定が動的に調整可能となり、変化する
条件に適応できる。According to another aspect of the invention, the above process is repeated cyclically to allow the determination of whether speech is being made at the near end to be dynamically adjustable and adaptable to changing conditions.

【００２５】本発明のさらにもう１つの面において、ニアエンドの改善されたスペクトラム
の決定は、重み付けされたスペクトラムの関数としてそのニアエンドの改善され
たスペクトラムを決定することが含まれ、その重み付けされたスペクトラムは、
次のように定義される。ここで、 Γは、ファーエンド信号から生じた音響エコーの評価のスペクトラムであり、Ｅは、ｃ）の工程のエコーキャンセル性能を表すエコーリターン損失改善スペ
クトラムであり、Ｎは、処理された信号のスペクトラムであり、Ｓは、エコーの経路のスペクトラム拡散特性を表すエコー拡散スペクトラムで
あり、 Γ_max＝ｍａｘ（Γ），Ｅ_max＝ｍａｘ（Ｅ），Ｓ_max＝ｍａｘ（Ｓ）であり、 α，β，及びγは定数であり、α＋β＋γ＞０である。In yet another aspect of the invention, determining the near-end improved spectrum includes determining the near-end improved spectrum as a function of the weighted spectrum, the weighted spectrum. Is
It is defined as follows. Here, Γ is the spectrum of the acoustic echo generated from the far-end signal, E is the echo return loss improvement spectrum representing the echo cancellation performance of the step c), and N is the processed signal. Is a spectrum, S is an echo spread spectrum that represents the spread spectrum characteristic of the path of the echo, Γ _max = max (Γ), E _max = max (E), S _max = max (S), α, β and γ are constants, and α + β + γ> 0.

【００２６】本発明のさらにもう１つの面から見れば、α＋β＋γ＝１である。[0026] From yet another aspect of the present invention, α + β + γ = 1.

【００２７】本発明のさらにもう１つの面において、重み付けされたスペクトラムの関数と
してニアエンドの改善されたスペクトラムを決定することは、次の式に従って検
出器のスペクトラムを決定することを含む。ここで、 Speech_min(i)は、Ｎが所定の閾値より大きい場合におけるｉ番目の周波数であ
り、 Speech_max(i)は、Ｎがその所定の閾値未満の場合におけるｉ番目の周波数であ
り、 Spectrum_totalmaxは、その重み付けされたスペクトラムＷ（ｆ）における注目
の最大周波数である。In yet another aspect of the invention, determining the near-end improved spectrum as a function of the weighted spectrum includes determining the spectrum of the detector according to the equation: Here, Speech _{min (i)} is the i-th frequency when N is larger than a predetermined threshold value, Speech _{max (i)} is the i-th frequency when N is less than the predetermined threshold value, Spectrum _totalmax is the maximum frequency of interest in the weighted spectrum W (f).

【００２８】本発明の目的と利点は添付図面に関連して次の詳細な説明を読むことにより理
解される。The objects and advantages of the present invention will be understood by reading the following detailed description in connection with the accompanying drawings.

【００２９】詳細な説明本発明の種々の特徴を図面に関して説明する。その図面で同様の部分について
は同じ参照記号で識別される。DETAILED DESCRIPTION Various features of the present invention are described with reference to the drawings. Similar parts in the drawings are identified by the same reference signs.

【００３０】本発明の１つの面からすれば、エコー信号成分に相対的にニアエンドのスピー
チ成分が強調された信号が、エコーキャンセラがよく作用して信号エネルギーが
おそらくニアエンドの音声アクティビティのためであろう周波数のバンド幅を決
定する周波数についての情報を用いて生成される。音声アクティビティにただ一
般的に関連しているより広い周波数範囲についてというよりはむしろ、エコーキ
ャンセレーションが効果的であることが知られている主にそれら選択された周波
数の電力を計算することにより、エコー成分とニアエンドスピーチとの間のより
大きな違いが得られる。この違いが大きくなると、エコーキャンセレーション動
作それ自身を制御する音声アクティビティ検出器、音声認識器、或いはフィード
バック経路のようなニアエンドスピーチを処理するために設計されたダウンスト
リームの構成要素の性能が改善される。According to one aspect of the present invention, a signal in which a near-end speech component is emphasized relative to an echo signal component is well-performed by the echo canceller and the signal energy is probably due to near-end voice activity. It is generated using information about the frequency that determines the bandwidth of the wax frequency. By calculating the power at those selected frequencies, mainly where echo cancellation is known to be effective, rather than for a wider frequency range that is commonly associated with voice activity, Greater differences between the echo component and near-end speech are obtained. Increasing this difference improves the performance of downstream components designed to handle near-end speech, such as voice activity detectors, speech recognizers, or feedback paths that control the echo cancellation behavior itself. It

【００３１】改善をするためにどの周波数を選択するのかについての技術は、どんな種類の
エコーキャンセラが用いられているのかに依存する。例えば、ＬＭＳタイプのエ
コーキャンセレーションのやり方では、各周波数についてのエコーリターン損失
改善（ＥＲＬＥ）は、信号のスペクトラルパワーに依存する。図２において、実
線２０１はエコーキャンセレーション適用前のスピーチ信号（１つのセンテンス
）のパワースペクトルを図示している。比較のため、破線２０３はエコーキャン
セレーション適用後の同じスピーチ信号のパワースペクトルを図示している。エ
コーキャンセルの実行における実質的な損失は２５０Ｈz未満或いは１５００Ｈz
を越えた周波数において観測可能である。従って、２５０Ｈzから１５００Ｈzの
範囲のスピーチ信号周波数にだけその解析を限定したニアエンド音声処理ユニッ
ト（例えば、音声アクティビティ検出器や音声認識器）はニアエンドスピーチに
ついてエコー成分を誤ることはそれほどないであろう。一般に、性能改善のため
にニアエンド音声処理ユニットが動作すべき特定の周波数バンドは信号スペクト
ラルパワーとともに用いられるエコーキャンセラのタイプにも依存するであろう
。The technique of which frequency to select for improvement depends on what type of echo canceller is used. For example, in the LMS type echo cancellation scheme, the echo return loss enhancement (ERLE) for each frequency depends on the spectral power of the signal. In FIG. 2, the solid line 201 shows the power spectrum of the speech signal (one sentence) before applying the echo cancellation. For comparison, the dashed line 203 illustrates the power spectrum of the same speech signal after echo cancellation has been applied. Substantial loss in performing echo cancellation is less than 250 Hz or 1500 Hz
It is observable at frequencies above. Therefore, a near-end speech processing unit (eg, voice activity detector or speech recognizer) whose analysis is limited to speech signal frequencies in the range of 250 Hz to 1500 Hz will be less likely to erroneous echo components for near-end speech. In general, the particular frequency band in which the near end speech processing unit should operate for improved performance will also depend on the type of echo canceller used with the signal spectral power.

【００３２】次のことは、ファーエンドエコー信号を除外してしまうほどにニアエンドスピ
ーチを処理することが望まれるときに改善するか或いは焦点をあわせるための周
波数バンドを選択するときに考慮すべき考察である。マイクロフォンはニアエン
ド音声信号とファーエンドエコー信号とを混合するので、ニアエンド音声信号の
本当のスペクトラムは分からないということが認識されねばならない。雑音の多
い環境下でスピーチを検出する従来の技術では、その雑音が有力な周波数を（例
えば、フィルタリングによって）除去することが含まれている。しかしながら、
ファーエンドエコーの場合、ファーエンドエコー信号に関連した周波数はそれ自
体スピーチに関連したものである。即ち、他の（例えば、ファーエンドの）スピ
ーチが存在している状況でニアエンドスピーチの検出を試みているのである。従
って、ただエコーに関連した周波数を除去すると、それはおそらくニアエンドス
ピーチに関連した信号の一部も除去することになり、それによって目的は達せら
れない。The following are considerations that should be considered when selecting a frequency band to improve or focus when it is desired to process near-end speech to the extent that it removes far-end echo signals. Is. It has to be recognized that the true spectrum of the near-end audio signal is unknown because the microphone mixes the near-end audio signal with the far-end echo signal. Conventional techniques for detecting speech in a noisy environment include removing (eg, by filtering) frequencies where the noise is dominant. However,
In the case of the far-end echo, the frequencies associated with the far-end echo signal are themselves speech-related. That is, it is attempting to detect near-end speech in the presence of other (eg, far-end) speech. Therefore, simply removing the frequencies associated with the echo will probably also remove some of the signal associated with near-end speech, thereby defeating the purpose.

【００３３】上述のように、ニアエンドスピーチスペクトラムの測定を行なうことは不可能
なので、ニアエンドスピーチ信号の明瞭な複製は利用可能ではない。（事実、ニ
アエンドスピーチ信号の明瞭な複製が利用可能であれば、今扱っている問題は存
在しないことになる。）しかしながら、ニアエンドスピーチによって汚染されて
いないファーエンドスピーチ信号１１２は利用可能であり、これはうまく利用で
きる。第１に、概して、エコー信号に含まれているスペクトラルエネルギーはニ
アエンドスピーチ信号のスペクトラルエネルギーに対応する（なぜなら、両方と
もスピーチ信号であるからである）。従って、ある程度まで、ファーエンドスピ
ーチ信号（或いは、この信号から生じる信号）はニアエンドスピーチを探索する
ことに焦点を合わせるための情報源として用いられる。As mentioned above, since it is not possible to make a measurement of the near-end speech spectrum, a clear copy of the near-end speech signal is not available. (Indeed, the problem we are dealing with would not exist if a clear copy of the near-end speech signal was available.) However, a far-end speech signal 112 that is not contaminated by near-end speech is available, This works well. First, generally, the spectral energy contained in the echo signal corresponds to the spectral energy of the near-end speech signal (because both are speech signals). Thus, to some extent, the far-end speech signal (or the signal resulting from this signal) is used as a source for focusing on searching for near-end speech.

【００３４】エコーキャンセレーションが最も効果的である周波数の測定も行なうことがで
きる。この情報がニアエンドスピーチ処理を改善するのに都合良く用いられるの
で、これらの周波数においてニアエンドスピーチ信号がエコースピーチ成分の存
在によって隠されてしまうことはまずないであろう。A measurement of the frequency at which echo cancellation is most effective can also be made. Since this information is conveniently used to improve near-end speech processing, it is unlikely that the near-end speech signal at these frequencies will be obscured by the presence of the acoustic speech component.

【００３５】ニアエンドスピーチについての改善されたスペクトラルの計算において用いら
れる周波数バンドの数は設計者にまかされている。計算された周波数スペクトラ
ムに存在する周波数バンドの最大数はそのスペクトラムが計算された信号サンプ
ル数の半分である。しかしながら、最大数の周波数バンドを必ずしもいつも計算
する必要はない。同じ数の信号サンプルからより少ない周波数バンドを決定する
ことによって、より意味のある数を得るかもしれない。例えば、周波数スペクト
ラムがＧＳＭセルラ通信システムにおいて伝播される信号の１６００個のサンプ
ルから生成されるものであるとしよう。ＧＳＭにおいて、これら１６００個のサ
ンプルは２００ミリ秒のスピーチを表現している。従って、最大の表現可能な周
波数は４０００Ｈz（Nyquist周波数）である。これら１６００個のサンプルは夫
々が１６０サンプルをもった１０個のグループに分割される。１０個のグループ
の夫々について２５６ポイントの高速フーリエ変換（ＦＦＴ）は１０個のスペク
トラムを生成し、それらは適当な重み平均の手法によって結合される。例えば、
指数関数的な平均化の手法が用いられるなら、これにより、新たに生成される周
波数スペクトラムに関連した周波数バンドは以前に決定された平均よりもはるか
に小さい重みをもつことになる（その結果、その平均は時間についてのスペクト
ラムの変化への応答が遅い）。スペクトラムのこのような結合の結果、一回のＦ
ＦＴが元々の１６００個のサンプルで実行されてより多くの周波数バンドを生成
しているかのように１０倍も多くの情報から各ポイント（周波数バンド）が生成
されるスペクトラムが得られる。重み付け結合の技術を用いることにより、代表
的ではないセットのサンプルから生成された１個のスペクトラムは全体的な動作
において実質的な影響を及ぼすことはないであろう。The number of frequency bands used in the improved spectral calculation for near-end speech is left to the designer. The maximum number of frequency bands present in the calculated frequency spectrum is half the number of signal samples for which the spectrum was calculated. However, it is not always necessary to calculate the maximum number of frequency bands. A more meaningful number may be obtained by determining fewer frequency bands from the same number of signal samples. For example, suppose the frequency spectrum is generated from 1600 samples of a signal propagated in a GSM cellular communication system. In GSM, these 1600 samples represent 200 milliseconds of speech. Therefore, the maximum expressible frequency is 4000 Hz (Nyquist frequency). These 1600 samples are divided into 10 groups, each with 160 samples. A 256-point Fast Fourier Transform (FFT) for each of the 10 groups produces 10 spectra, which are combined by a suitable weighted averaging technique. For example,
If an exponential averaging technique is used, this will cause the frequency band associated with the newly generated frequency spectrum to have a much lower weight than the previously determined average (and thus The average is slow to respond to changes in the spectrum over time). The result of this combination of spectra is a single F
As much as 10 times more information gives the spectrum at which each point (frequency band) is generated, as if the FT was performed on the original 1600 samples to generate more frequency bands. By using the technique of weighted combining, a single spectrum generated from a non-representative set of samples will have no substantial effect on overall operation.

【００３６】本発明の１実施形態において、設計者はまず、エコーキャンセラーがよく作用
することが期待される１つ以上の周波数バンドを計算し、それからこれらの周波
数バンドでのみ動作するために後に続くニアエンド音声処理に対して調整をする
であろう。In one embodiment of the invention, the designer first calculates one or more frequency bands in which the echo canceller is expected to work well, and then follows to operate only in those frequency bands. Adjustments will be made for near-end audio processing.

【００３７】別の実施形態では、後続のニアエンド音声処理が動作することになる周波数バ
ンドが動的に決定されても良い。これは、エコーキャンセラ性能の変化と、ファ
ーエンド信号１１２のスペクトラル品質の変化のような動的に変化する条件に対
応して変化する条件にニアエンド音声処理を調整することができる能力を備える
ものである。本発明のこの面に従うニアエンドスピーチの改善の代表的な実施形
態について、図３のブロック図を参照して説明する。In another embodiment, the frequency band in which subsequent near-end speech processing will operate may be dynamically determined. This provides the ability to adjust near-end speech processing to changing conditions in response to dynamically changing conditions such as changes in echo canceller performance and changes in the spectral quality of the far end signal 112. is there. An exemplary embodiment of near-end speech improvement according to this aspect of the invention is described with reference to the block diagram of FIG.

【００３８】代表的な音響エコーキャンセリング構成３０１は、適応型フィルタ１０１、フ
ィルタアダプテーションユニット１０３、ラウドスピーカ１０９、マイクロフォ
ン１０５、Ｄ／Ａ変換器１３６、Ａ／Ｄ変換器１２４、及び減算器１０７を含み
、これらは図１で描写されているものと同じ動作をする。従って、これらの構成
要素の説明はここでは繰り返さない。代表的なトランシーバで示されているもの
も、この要素はオプションではあるが、雑音抑制ユニット３０３である。これが
あると、雑音抑制ユニット３０３はそれ自身、本発明に従って生成された情報に
基づいて動的に調整される（例えば、雑音抑制ユニット３０３の動作は、減算器
１０７の出力で生成される信号e(n)においてニアエンド音声があるかないかが検
出されることの関数である）。ファーエンド信号１１２は、特定の応用分野に依
存して、任意の数のソースによって生成されるかもしれない。例えば、セルラ電
話において、ファーエンド信号１１２は受信信号からファーエンド信号１１２を
生成するスピーチデコーダ（不図示）の出力で供給される。音響エコーキャンセ
リング構成３０１の出力として、処理されたニアエンド音声信号３１３が生成さ
れ、これがニアエンド音声プロセッサ（不図示）の入力に供給されても良い。ニ
アエンド音声プロセッサの機能はアプリケーション固有のものであり、ここで詳
細に説明はしない。セルラ電話の例では、ニアエンド音声プロセッサは音声アク
ティビティ検出器（不図示）でも良く、同様に、ファーエンドユーザに伝送する
符号化信号を生成するスピーチエンコーダ（不図示）でも良い。A typical acoustic echo canceling configuration 301 includes an adaptive filter 101, a filter adaptation unit 103, a loudspeaker 109, a microphone 105, a D / A converter 136, an A / D converter 124, and a subtractor 107. Including, they operate the same as those depicted in FIG. Therefore, the description of these components will not be repeated here. Also shown in the exemplary transceiver is the noise suppression unit 303, although this element is optional. With this, the noise suppression unit 303 itself is dynamically adjusted based on the information generated in accordance with the invention (eg the operation of the noise suppression unit 303 depends on the signal e generated at the output of the subtractor 107). (n) is a function of detecting the presence or absence of near-end speech). Far-end signal 112 may be generated by any number of sources, depending on the particular application. For example, in a cellular telephone, the far end signal 112 is provided at the output of a speech decoder (not shown) that produces the far end signal 112 from the received signal. As an output of the acoustic echo canceling arrangement 301, a processed near-end audio signal 313 may be generated and provided to the input of a near-end audio processor (not shown). The near end voice processor functionality is application specific and will not be described in detail here. In the cellular telephone example, the near-end voice processor may be a voice activity detector (not shown), as well as a speech encoder (not shown) that produces an encoded signal for transmission to the far-end user.

【００３９】本発明に従えば、音響エコーキャンセリング構成３０１はさらに、ニアエンド
改善スペクトラム生成器３０９を含む。ニアエンド改善スペクトラム生成器３０
９の出力はその性能を改善するためにニアエンド音声プロセッサの制御入力に供
給される。例えば、ニアエンド音声プロセッサが音声アクティビティ検出器であ
れば、その音声アクティビティ検出器は、ニアエンド改善スペクトラム生成器３
０９によって示されているように、処理されたニアエンド音声信号３１３の特定
のスペクトラルバンドの特性に基づいて音声アクティビティの決定を行なうこと
ができる。即ち、ニアエンド改善スペクトラム生成器３０９の出力はどんなタイ
プのフィルタリングが音声アクティビティ検出のやり方の一部として処理された
ニアエンド音声信号３１３に適用されるのかを決定する。According to the invention, the acoustic echo canceling arrangement 301 further comprises a near-end improved spectrum generator 309. Near-end improved spectrum generator 30
The output of 9 is fed to the control input of the near-end voice processor to improve its performance. For example, if the near-end voice processor is a voice activity detector, the voice activity detector will be the near-end improved spectrum generator 3.
A determination of voice activity may be made based on the characteristics of the particular spectral band of the processed near-end voice signal 313, as indicated by 09. That is, the output of the near-end improved spectrum generator 309 determines what type of filtering is applied to the processed near-end voice signal 313 as part of the method of voice activity detection.

【００４０】類似の制御調整は、音声認識機器のような他のタイプのニアエンド音声処理機
器に対してもなされる。Similar control adjustments are made to other types of near-end speech processing equipment, such as speech recognition equipment.

【００４１】ニアエンド改善スペクトラム生成器３０９は数多くの形で実施され、そして、
その各々は本発明の範囲内にあると考えられる。そのような形式にはランダムア
クセスメモリ（ＲＡＭ）、磁気記憶媒体（例えば、磁気ディスク、ディスケット
、或いはテープ）、及び光学的記憶媒体（例えば、コンパクトディスクの読み出
し専用メモリ（ＣＤ−ＲＯＭ））のようなコンピュータが利用可能な記憶媒体上
の信号として実現されるコンピュータプログラム命令を含む。或いは、本発明は
そのような命令を実行するプログラム可能なプログラムとして構成されても良い
。ニアエンド改善スペクトラム生成器３０９は或いは、数多くの構成のハードワ
イヤードの構成要素やプログラムされたロジックアレイにおいて実現されても良
い。The near-end improved spectrum generator 309 is implemented in numerous ways, and
Each of them is considered within the scope of the present invention. Such forms include random access memory (RAM), magnetic storage media (eg, magnetic disk, diskette, or tape), and optical storage media (eg, compact disc read-only memory (CD-ROM)). Computer program instructions embodied as signals on a computer-readable storage medium. Alternatively, the present invention may be implemented as a programmable program that executes such instructions. The near-end enhanced spectrum generator 309 may alternatively be implemented in numerous configurations of hardwired components or programmed logic arrays.

【００４２】ニアエンド改善スペクトラム生成器３０９の動作を説明するために、次の用語
が定義される。The following terms are defined to describe the operation of the near-end enhanced spectrum generator 309.

【００４３】評価されたエコースペクトラム（Γ）は適応型フィルタ１０１によって供給さ
れる評価されたエコー信号ｙ（ｎ）のスペクトラムである（即ち、デジタル化さ
れたマイクロフォン信号ｄ（ｎ）から減算される信号である）。その評価された
エコースペクトラムΓは、例えば、ＦＦＴによってデジタル化されたマイクロフ
ォン信号ｄ（ｎ）から生成されても良く、それ故に、周波数ｆの関数である。評
価されたエコースペクトラムΓは通常、ファーエンドスペクトラムのエコーの局
部的に定常的なスペクトラムを表現しているべきである。ＧＳＭセルラ電話のよ
うな応用分野において、これは２０ミリ秒のスピーチのスペクトラムであるべき
である。この場合そのスペクトラムは２０ミリ秒より速いスペクトラルの内容を
変更しないことを認識するなら、評価されたエコースペクトラムΓを計算するた
めに用いられるサンプルの数は、ニアエンド音声プロセッサ（例えば、ニアエン
ド音声アクティビティ検出器）によって用いられるサンプルの数と同じであるこ
とが好ましい。もし結合技術（例えば、重み付け平均）が評価されたエコースペ
クトラムΓのいくつかの測定に適用されるなら、その重みは新しく計算された評
価されたエコースペクトラムΓが迅速にその結合に影響を与えるようなものであ
るべきである。いくつかの好適な実施形態では、評価されたエコースペクトラム
Γに関して平均化は適用されない。なお、評価されたエコースペクトラムΓが用
いられて相対的に高いエコーリターン損失に関連した周波数を示す。The estimated echo spectrum (Γ) is the spectrum of the estimated echo signal y (n) provided by the adaptive filter 101 (ie subtracted from the digitized microphone signal d (n). Signal). The estimated echo spectrum Γ may be generated from the microphone signal d (n) digitized by FFT, for example, and is therefore a function of frequency f. The estimated echo spectrum Γ should normally represent the locally stationary spectrum of the far-end spectrum echo. In applications such as GSM cellular phones, this should be a 20 millisecond speech spectrum. If we recognize in this case that the spectrum does not modify the contents of the spectrum faster than 20 ms, the number of samples used to calculate the estimated echo spectrum Γ is It is preferably the same as the number of samples used by the vessel. If a combining technique (eg, weighted averaging) is applied to some measurements of the evaluated echo spectrum Γ, the weights are such that the newly calculated evaluated echo spectrum Γ quickly affects the combination. It should be In some preferred embodiments, no averaging is applied on the estimated echo spectrum Γ. Note that the estimated echo spectrum Γ is used to show the frequencies associated with the relatively high echo return loss.

【００４４】エコーリターン損失改善（ＥＲＬＥ）スペクトラム（Ｅ）は、エコーキャンセ
リングフィルタのエコーキャンセリング性能を表現するスペクトラムである。Ｅ
ＲＬＥスペクトラムＥは周波数ｆの関数である。ＥＲＬＥスペクトラムＥのいく
つかの代替的な測定が用いられても良い。いくつかの実施形態では、ＥＲＬＥス
ペクトラムは次の式に従って決定されても良い。ここで、はフーリエ変換を表し、ｄ（ｎ）はニアエンド音声とともにエコーと雑音成分と
を含むデジタル化されたマイクロフォン信号であり、e'(n)は処理されたニアエ
ンド音声信号３１３である。The echo return loss improvement (ERLE) spectrum (E) is a spectrum expressing the echo canceling performance of the echo canceling filter. E
RLE spectrum E is a function of frequency f. Several alternative measurements of ERLE spectrum E may be used. In some embodiments, the ERLE spectrum may be determined according to the formula: here, Represents a Fourier transform, d (n) is a digitized microphone signal containing echo and noise components along with near-end speech, and e '(n) is the processed near-end speech signal 313.

【００４５】別の実施形態では、異なるＥＲＬＥスペクトラムは次の式に従って最初に時間
領域での測定を行なうことによって決定されても良い。これから、周波数領域のスペクトラムは次の式に従って生成されても良い。ＥＲＬＥスペクトラムＥのいずれかの測定が用いられて相対的に高いエコーリタ
ーン損失に関連した周波数を示しても良い。また、これらの実施形態のいずれに
おいても、ＥＲＬＥスペクトラムＥはサンプルのグループの各々と上述したよう
に（例えば、重み付け平均によって）結合された結果得られるスペクトラムに対
して別々に決定されても良い。平均化の速度（即ち、新しく計算されたスペクト
ラムにおいてその平均化に重大な影響を与える速度）は、適応型フィルタ１０１
のアダプテーション速度とおおよそ同じであることが好ましく、その結果、ＥＲ
ＬＥスペクトラムＥは正確にエコーキャンセレーションの性能を反映するであろ
う。In another embodiment, different ERLE spectra may be determined by first making a time domain measurement according to the following equation: From this, a spectrum in the frequency domain may be generated according to the following equation: Any measurement of the ERLE spectrum E may be used to indicate the frequency associated with the relatively high echo return loss. Also, in any of these embodiments, the ERLE spectrum E may be determined separately for each of the groups of samples and the resulting spectrum combined as described above (eg, by weighted averaging). The speed of averaging (ie, the speed that significantly affects the averaging in the newly calculated spectrum) is
Is approximately the same as the adaptation speed of
The LE spectrum E will accurately reflect the performance of echo cancellation.

【００４６】ニアエンドスペクトラム（Ｎ）は、エコーキャンセリングとオプションの雑音
抑制の後に受信された信号のスペクトラムである（即ち、それは、処理されたニ
アエンドスピーチ信号３１３のスペクトラムである）。ニアエンドスペクトラム
Ｎは周波数ｆの関数であり、そして、それは処理されたニアエンド音声信号３１
３（e'(n)）のＦＦＴとして計算されても良い。評価されたエコースペクトラム
Γを計算するのに用いられたのと同じ数のサンプルを用いて計算されるのが好ま
しい。Near-end spectrum (N) is the spectrum of the signal received after echo cancellation and optional noise suppression (ie, it is the spectrum of the processed near-end speech signal 313). Near-end spectrum N is a function of frequency f, which is the processed near-end speech signal 31.
It may be calculated as an FFT of 3 (e ′ (n)). It is preferably calculated using the same number of samples used to calculate the estimated echo spectrum Γ.

【００４７】エコー拡散スペクトラム（Ｓ）はエコー経路のスペクトラム拡散特性を表現し
ている。即ち、それは、どのくらい異なる周波数がラウドスピーカ１０９とマイ
クロフォン１０５との間で伝達されるのかの評価の測定である。エコー拡散スペ
クトラムＳは周波数ｆの関数であり、適用型フィルタ１０１によって実行される
フィルタリングの特性を決定する係数ｈ（ｎ）のフーリエ変換として計算されて
も良い。即ち、次の式である。早くに説明した実施形態にあるように、ＥＲＬＥスペクトラム（Ｅ）を用いて
ニアエンド音声処理が動作すべき周波数バンド（これ以降、“検出器スペクトラ
ム”として言及される）を決定することでニアエンド検出の性能が改善される。
本発明の別の面に従えば、スペクトラムＥの使用から生じる利点は、評価された
エコースペクトラム（Γ）がＥに対応しないときに性能を落とすことなく次のよ
うに検出器スペクトラムを決定することにより達成される。The echo spread spectrum (S) expresses the spread spectrum characteristic of the echo path. That is, it is a measure of how much different frequencies are transmitted between loudspeaker 109 and microphone 105. The echo spread spectrum S is a function of the frequency f and may be calculated as a Fourier transform of the coefficient h (n) that determines the characteristics of the filtering performed by the adaptive filter 101. That is, it is the following formula. As in the earlier described embodiments, the ERLE spectrum (E) is used to determine the frequency band in which near-end speech processing should operate (hereinafter referred to as the "detector spectrum"). Performance is improved.
According to another aspect of the invention, the advantage resulting from the use of the spectrum E is that the detector spectrum is determined as follows without degrading the performance when the evaluated echo spectrum (Γ) does not correspond to E: Achieved by

【００４８】図４のフローチャートにおいて、種々のスペクトラムΓ、Ｅ、Ｓ、及びＮがま
ず上述したように決定される（ステップ４０１）。In the flowchart of FIG. 4, various spectra Γ, E, S, and N are first determined as described above (step 401).

【００４９】次に、ステップ４０３において、重み付けされたスペクトラムＷ（ｆ）は評価
されたエコースペクトラムΓ、ＥＲＬＥスペクトラムＥ、及びエコー拡散スペク
トラムＳから、次の式に従って決定される。ここで、 Γ_max＝ｍａｘ（Γ），Ｅ_max＝ｍａｘ（Ｅ），Ｓ_max＝ｍａｘ（Ｓ）であり、 α，β，及びγは定数である。Next, in step 403, the weighted spectrum W (f) is determined from the evaluated echo spectrum Γ, ERLE spectrum E, and echo spread spectrum S according to the following equation. Here, Γ _max = max (Γ), E _max = max (E), S _max = max (S), and α, β, and γ are constants.

【００５０】スペクトラムΓ、Ｅ、及びＳの夫々を各最大値で割り算する目的は、重み付け
因子α，β，及びγの内の対応する１つでスケーリングしたあとに結合される正
規化されたスペクトラムを生成することであることがすぐに明らかであろう。The purpose of dividing each of the spectra Γ, E, and S by their respective maximum value is the normalized spectrum that is combined after scaling by the corresponding one of the weighting factors α, β, and γ. It will soon be obvious that

【００５１】好適な実施形態では、α＋β＋γは１つの値に近く（例えば、それはゼロに等
しくはないがそれに近い分数の値から約２の値までの範囲にあるかもしれない）
、しかし、このことは厳密な要求ではない。In a preferred embodiment, α + β + γ is close to one value (eg it may range from a fractional value close to but not equal to zero to a value of about 2).
, But this is not a strict requirement.

【００５２】次に、ステップ４０５では、圧縮因子Ｃが決定される。それは、重み付けされ
たスペクトラムＷ（ｆ）が、ニアエンドスペクトラムＮがその最大エネルギー成
分をもつ１つ以上の周波数バンド内にあるパワーを含む程度を表現している。図
５において、Speech_min(1)とSpeech_max(1)との間の第１のバンドとSpeech_min(2) とSpeech_max(2)との間の第２のバンドとによって図示されているように、ニアエ
ンドスペクトラムＮがいくつかの不連続な周波数バンドをもち、その範囲にわた
って所定の閾値レベルを越えた値をもっているかもしれないために、１つ以上の
周波数バンドへの参照がなされる。圧縮因子Ｃは次の式によって与えられる。ここで、 Speech_min(i)はＮがアプリケーションに特有な所定の閾値より大きい場合におけ
るｉ番目の周波数であり、そして、それ故に設計者によってセットされ、 Speech_max(i)はＮがその所定の閾値未満の場合におけるｉ番目の周波数であり、
Spectrum_totalmaxは重み付けされたスペクトラムＷ（ｆ）における我々が注目す
る最大周波数である。即ち、関数Ｗ（ｆ）の値は、Spectrum_totalmaxより高い周
波数全てに対してゼロに等しいことが仮定されて良い。Next, in step 405, the compression factor C is determined. It describes the extent to which the weighted spectrum W (f) contains the power in which the near-end spectrum N lies within one or more frequency bands with its maximum energy component. In FIG. 5, as illustrated by the first band between Speech _{min (1)} and Speech _{max (1)} and the _second band between Speech _{min (2)} and Speech _{max (2)} . At least one reference is made to one or more frequency bands, since the near-end spectrum N has several discontinuous frequency bands and may have values over its range above a predetermined threshold level. The compression factor C is given by the following equation. Where Speech _{min (i)} is the i-th frequency where N is greater than a predetermined threshold specific to the application, and is therefore set by the designer, and Speech _{max (i)} is N equal to that predetermined threshold. Is the i-th frequency below the threshold,
Spectrum _{total max} is the maximum frequency we _note in the weighted spectrum W (f). That is, it may be assumed that the value of the function W (f) is equal to zero for all frequencies above Spectrum _totalmax .

【００５３】またなお、圧縮因子Ｃは２つの積分の比として定義されるが、実際には、対応
するスペクトラムを種々の範囲の周波数にわたって実質的にはフラットであると
して近似することによりしばしば簡単に計算されるかもしれない。このことはさ
らに、以下に呈示するいくつかの例において説明される。Again, the compression factor C is defined as the ratio of two integrals, but in practice it is often simply done by approximating the corresponding spectrum as being substantially flat over various ranges of frequencies. May be calculated. This is further explained in some examples presented below.

【００５４】圧縮因子Ｃと重み付けされたスペクトラムＷ（ｆ）とを決定すると、検出器ス
ペクトラムはステップ４０７において次の式を計算することによって得られる。その結果得られるニアエンド改善スペクトラムは周波数ｆの関数であることが
認識されるであろう。Having determined the compression factor C and the weighted spectrum W (f), the detector spectrum is obtained in step 407 by calculating the following equation: It will be appreciated that the resulting near-end improved spectrum is a function of frequency f.

【００５５】ニアエンド改善スペクトラムはそれから、ニアエンド音声プロセッサ（不図示
）の制御入力に供給されても良い。例えば、ニアエンド改善スペクトラムが用い
られて、セルラ電話におけるニアエンド音声アクティビティ検出器によって実行
されるバンドパスフィルタリングを決定する。The near-end enhancement spectrum may then be provided to the control input of a near-end voice processor (not shown). For example, the near-end improved spectrum is used to determine the bandpass filtering performed by the near-end voice activity detector in a cellular phone.

【００５６】動的に調整可能な動作については、図４で示されているように、これらのステ
ップが周期的に繰り返され、ステップ４０１で再び始まるようになっている。例
えば、１６０個のサンプルのフレームが２０ミリ秒毎に一度生成されるシステム
において、新しいニアエンド改善スペクトラムもまた２０ミリ秒毎に一度決定さ
れても良い。For dynamically adjustable operation, these steps are repeated cyclically, starting again at step 401, as shown in FIG. For example, in a system in which a frame of 160 samples is generated once every 20 ms, the new near-end improved spectrum may also be determined once every 20 ms.

【００５７】上述した技術を説明するためにいくつかの例が呈示される。各ケースにおいて
、全ての説明されるスペクトラムはニアエンドスペクトラムＮについて以外は正
規化されている。（Ｎを正規化しない理由は処理されたニアエンド音声信号３１
３の実際のエネルギーレベルについての情報を保持するためである。）さらにそ
の上、次の例では、しばしばあることであるが、拡散スペクトラムは均一に分布
していると考えられる。さらに本発明の理解を容易にするために、Ｎはパワーが
所定の閾値レベルを超えている領域を１つだけもっているように示される。これ
によって別々に計算された積分を合計することが避けられる。Several examples are presented to illustrate the techniques described above. In each case, all the described spectra are normalized except for the near-end spectrum N. (The reason for not normalizing N is the processed near-end speech signal 31
3 to hold information about the actual energy level. Moreover, in the following examples, the spread spectrum is often considered to be evenly distributed. Further, to facilitate understanding of the invention, N is shown to have only one region where the power is above a given threshold level. This avoids summing the separately calculated integrals.

【００５８】第１の例を図６Ａ〜図６Ｅを参照して説明する。図６Ａはニアエンドスピーチ
スペクトラムＮのグラフである。ｆ＝０〜ｆ＝２５０Ｈzの間ではＮ＝０．２５
であり、ｆ＝２５０Ｈz〜ｆ＝７５０Ｈzの間ではＮ＝１．０であり、ｆ＝７５０
Ｈz〜ｆ＝１５００Ｈzの間ではＮ＝０．２５である（なお、最大値１．０が描か
れていることは単に例示的な目的のためであり、一般に、Ｎは正規化されていな
い。）。A first example will be described with reference to FIGS. 6A to 6E. FIG. 6A is a graph of near-end speech spectrum N. N = 0.25 between f = 0 and f = 250 Hz
And f = 250 Hz to f = 750 Hz, N = 1.0, and f = 750.
Between Hz and f = 1500 Hz, N = 0.25 (note that the maximum value 1.0 drawn is for illustrative purposes only, and in general, N is not normalized. ).

【００５９】例について続けると、図６Ｂは正規化されたＥＲＬＥスペクトラムＥのグラフ
である。ｆ＝０〜ｆ＝７５０Ｈzの間ではＥ＝１．０であり、ｆ＝７５０Ｈz〜ｆ
＝１５００Ｈzの間ではＥ＝０．２５である。Continuing with the example, FIG. 6B is a graph of the normalized ERLE spectrum E. E = 1.0 between f = 0 to f = 750 Hz, and f = 750 Hz to f
Between 1500 Hz, E = 0.25.

【００６０】正規化された評価されたエコースペクトラムΓのグラフは図６Ｃに描かれてい
る。ｆ＝０〜ｆ＝７５０Ｈzの間ではΓ＝１．０であり、ｆ＝７５０Ｈz〜ｆ＝１
５００Ｈzの間ではΓ＝０．２５である。A graph of the normalized estimated echo spectrum Γ is depicted in FIG. 6C. Γ = 1.0 between f = 0 to f = 750 Hz, and f = 750 Hz to f = 1.
At 500 Hz, Γ = 0.25.

【００６１】この例において、重み付けされたスペクトラムは次の式によって与えられる。（なぜなら、この例において、重み付け係数γ＝０であるので、エコー拡散スペ
クトラムＳが似ているように見えるものは無関係のものである。）正規化された
評価されたエコースペクトラムΓ（図６Ｃに描かれているように）と正規化され
たＥＲＬＥスペクトラムＥ（図６Ｂに描かれているように）とが与えられると、
この例については、結果として重み付けされたスペクトラムＷ（ｆ）が得られ、
それは図６Ｄに描かれている。In this example, the weighted spectrum is given by: (Because in this example, the weighting factor γ = 0, so that what appears to be similar in echo spread spectrum S is irrelevant.) Normalized evaluated echo spectrum Γ (see FIG. 6C). (As depicted) and the normalized ERLE spectrum E (as depicted in FIG. 6B),
For this example, the result is a weighted spectrum W (f),
It is depicted in Figure 6D.

【００６２】次に、圧縮因子Ｃを計算する。所定の閾値が０．２５であることを仮定するな
らば、図６Ａからこの閾値を超えるたった１つの周波数バンドがあることが理解
できる。この周波数バンドは Speech_min＝２５０Ｈz； Speech_max＝７５０Ｈz； Spectrum_totalmax＝１５００Ｈzによって境界が定められる。Next, the compression factor C is calculated. Given that the predetermined threshold is 0.25, it can be seen from Figure 6A that there is only one frequency band above this threshold. This frequency band is bounded by Speech _min = _250Hz ; Speech _max = _750Hz ; Spectrum _totalmax = _1500Hz .

【００６３】それ故に、式（７）に従えば、重み付けされたスペクトラムＷ（ｆ）がいくつかの範囲各々に対して定数である
ために、その積分とそれ故にＣは計算するのが比較的に容易である。Therefore, according to equation (7), Since the weighted spectrum W (f) is constant for each of several ranges, its integral and hence C is relatively easy to calculate.

【００６４】今や式（８）に従って、ニアエンド改善スペクトラムを計算できる。図６Ｅの
最左端のスペクトラムはこの例について結果として得られるニアエンド改善スペ
クトラムを描いている。それは、ｆ＝０〜ｆ＝７５０Ｈzの間では１．０の大き
さがあり、ｆ＝７５０Ｈz〜ｆ＝１５００Ｈzの間では０．６００...の値である
ことが分かる。The near-end improved spectrum can now be calculated according to equation (8). The leftmost spectrum in FIG. 6E depicts the resulting near-end improved spectrum for this example. It can be seen that it has a magnitude of 1.0 between f = 0 and f = 750 Hz and a value of 0.600 ... Between f = 750 Hz and f = 1500 Hz.

【００６５】図６Ｅはさらに、このニアエンド改善スペクトラムを音声アクティビティ検出
器のようなニアエンド音声プロセッサを制御することに適用していることを描い
ている。そのような音声アクティビティ検出器はニアエンド改善スペクトラムに
準拠するために調整されたバンドパスフィルタリング機能をもっている。その結
果、処理されたニアエンド音声信号３１３が音声アクティビティ検出器に印加さ
れるとき（図６Ｅの真中のスペクトラムを参照されたい）、結果として得られる
音声アクティビティ検出器スペクトラムは、図６Ｅの右側に示したものと似てい
るように見える。結果として得られる検出器スペクトラムはｆ＝０〜ｆ＝２５０
Ｈzの間では０．２５に等しく、ｆ＝２５０Ｈz〜ｆ＝７５０Ｈzの間では１．０
に等しく、ｆ＝７５０Ｈz〜ｆ＝１５００Ｈzの間では０．１５に等しい。その結
果、それらの周波数帯（即ち、ｆ＝０Ｈzとｆ＝７５０Ｈzとの間であり、図６Ｄ
における代表的な重み付けされたスペクトラムを参照されたい）についての動作
における変化はなく、そこではエコーキャンセレーションの動作は良好である。
しかしながら、エコーキャンセリング性能が良くないことに関係する周波数では
ニアエンド検出器の性能に小さな影響しかない。その結果、ニアエンド検出器の
性能は改善される。FIG. 6E further depicts applying this near-end improved spectrum to controlling a near-end voice processor, such as a voice activity detector. Such voice activity detectors have bandpass filtering capabilities tuned to comply with the near-end improved spectrum. As a result, when the processed near-end voice signal 313 is applied to the voice activity detector (see middle spectrum in FIG. 6E), the resulting voice activity detector spectrum is shown on the right side of FIG. 6E. Looks like something like The resulting detector spectrum is f = 0-f = 250
It is equal to 0.25 between Hz and 1.0 between f = 250 Hz and f = 750 Hz.
And is equal to 0.15 between f = 750 Hz and f = 1500 Hz. As a result, those frequency bands (ie, between f = 0 Hz and f = 750 Hz, as shown in FIG.
For a typical weighted spectrum in), there is no change in behavior, where the behavior of echo cancellation is good.
However, at frequencies associated with poor echo canceling performance, there is only a small impact on near-end detector performance. As a result, the performance of the near-end detector is improved.

【００６６】第２の例を図７Ａ〜図７Ｅを参照して説明する。図７Ａはニアエンドスピーチ
スペクトラムＮのグラフである。ｆ＝０〜ｆ＝２５０Ｈzの間ではＮ＝０．２５
であり、ｆ＝２５０Ｈz〜ｆ＝７５０Ｈzの間ではＮ＝１．０であり、ｆ＝７５０
Ｈz〜ｆ＝１５００Ｈzの間ではＮ＝０．２５である（なお、最大値１．０が描か
れていることは単に例示的な目的のためであり、一般に、Ｎは正規化されていな
い。）。A second example will be described with reference to FIGS. 7A to 7E. FIG. 7A is a graph of near-end speech spectrum N. N = 0.25 between f = 0 and f = 250 Hz
And f = 250 Hz to f = 750 Hz, N = 1.0, and f = 750.
Between Hz and f = 1500 Hz, N = 0.25 (note that the maximum value 1.0 drawn is for illustrative purposes only, and in general, N is not normalized. ).

【００６７】例について続けると、図７Ｂは正規化されたＥＲＬＥスペクトラムＥのグラフ
である。ｆ＝０〜ｆ＝７５０Ｈzの間ではＥ＝１．０であり、ｆ＝７５０Ｈz〜ｆ
＝１５００Ｈzの間ではＥ＝０．２５である。Continuing with the example, FIG. 7B is a graph of the normalized ERLE spectrum E. E = 1.0 between f = 0 to f = 750 Hz, and f = 750 Hz to f
Between 1500 Hz, E = 0.25.

【００６８】今までは図６Ａ〜図６Ｅに関して上述された例に従っている。しかしながら、
ここで、異なる正規化された評価されたエコースペクトラムΓのグラフが図７Ｃ
に描かれている。ｆ＝０〜ｆ＝７５０Ｈzの間ではΓ＝０．２５であり、ｆ＝７
５０Ｈz〜ｆ＝１５００Ｈzの間ではΓ＝１．０である。So far, the example described above with respect to FIGS. 6A-6E is followed. However,
Here, a graph of different normalized evaluated echo spectra Γ is shown in FIG. 7C.
Is depicted in. Between f = 0 and f = 750 Hz, Γ = 0.25, and f = 7
Γ = 1.0 between 50 Hz and f = 1500 Hz.

【００６９】この例において、重み付けされたスペクトラムは次の式によって与えられるこ
とを再び仮定する。（なぜなら、この例において、重み付け係数γ＝０であるので、エコー拡散スペ
クトラムＳが似ているように見えるものは無関係のものである。）正規化された
評価されたエコースペクトラムΓ（図７Ｃに描かれているように）と正規化され
たＥＲＬＥスペクトラムＥ（図７Ｂに描かれているように）とが与えられると、
この例に関して、結果として重み付けされたスペクトラムＷ（ｆ）が得られ、そ
れは図７Ｄに描かれている。ｆ＝０からｆ＝１５００までの全範囲を通じてそれ
は定数（＝０．６２５）であることに気づかれたい。In this example, again assume that the weighted spectrum is given by: (Because, in this example, the weighting factor γ = 0, so that what appears to be similar in echo spread spectrum S is irrelevant.) Normalized evaluated echo spectrum Γ (see FIG. 7C). (As depicted) and the normalized ERLE spectrum E (as depicted in FIG. 7B),
For this example, the resulting weighted spectrum W (f) is obtained, which is depicted in Figure 7D. Note that it is a constant (= 0.625) over the entire range from f = 0 to f = 1500.

【００７０】次に、圧縮因子Ｃを計算する。図７Ａから、 Speech_min＝２５０Ｈz； Speech_max＝７５０Ｈz； Spectrum_totalmax＝１５００Ｈzであることが分かる。Next, the compression factor C is calculated. It can be seen from FIG. 7A that Speech _min = 250 Hz; Speech _max = 750 Hz; Spectrum _total _max = 1500 Hz.

【００７１】それ故に、式（７）に従えば、重み付けされたスペクトラムＷ（ｆ）がｆ＝０からｆ＝１５００Ｈzの間の全範
囲に対して定数であるために、その積分とそれ故にＣは再び計算するのが比較的
に容易である。Therefore, according to equation (7), Since the weighted spectrum W (f) is constant over the entire range between f = 0 and f = 1500 Hz, its integral and hence C is relatively easy to calculate again.

【００７２】今や式（８）に従って、ニアエンド改善スペクトラムを計算できる。図７Ｅの
最左端のスペクトラムはこの例について結果として得られるニアエンド改善スペ
クトラムを描いている。それは、ｆ＝０からｆ＝１５００Ｈzの間の全範囲にわ
たって０．８７５の大きさがあることが分かる。The near-end improved spectrum can now be calculated according to equation (8). The leftmost spectrum in FIG. 7E depicts the resulting near-end improved spectrum for this example. It can be seen that it has a magnitude of 0.875 over the entire range between f = 0 and f = 1500 Hz.

【００７３】図７Ｅはさらに、このニアエンド改善スペクトラムを音声アクティビティ検出
器のようなニアエンド音声プロセッサを制御することに適用していることを描い
ている。そのような音声アクティビティ検出器はニアエンド改善スペクトラムに
準拠するために調整されたバンドパスフィルタリング機能をもっている。その結
果、処理されたニアエンド音声信号３１３が音声アクティビティ検出器に印加さ
れるとき（図７Ｅの真中のスペクトラムを参照されたい）、結果として得られる
音声アクティビティ検出器スペクトラムは、図７Ｅの右側に示したものと似てい
るように見える。結果として得られる検出器スペクトラムはｆ＝０〜ｆ＝２５０
Ｈzの範囲では０．２１８７５に等しく、ｆ＝２５０Ｈz〜ｆ＝７５０Ｈzの範囲
では０．８７５に等しく、ｆ＝７５０Ｈz〜ｆ＝１５００Ｈzの範囲では再び０．
２１８７５に等しい。ＥＲＬＥスペクトラムＥと評価されたエコースペクトラム
Γとの間において相関がないか小さい場合については、検出器スペクトラム全体
が減衰されることが分かる。にも係らず、ニアエンド検出器は依然としてニアエ
ンドスペクトラムＮがその最大成分をもつ周波数に対して最も良好に応答する。FIG. 7E further depicts applying this near-end improved spectrum to controlling a near-end voice processor, such as a voice activity detector. Such voice activity detectors have bandpass filtering capabilities tuned to comply with the near-end improved spectrum. As a result, when the processed near-end voice signal 313 is applied to the voice activity detector (see middle spectrum of FIG. 7E), the resulting voice activity detector spectrum is shown on the right side of FIG. 7E. Looks like something like The resulting detector spectrum is f = 0-f = 250
In the range of Hz, it is equal to 0.21875, in the range of f = 250Hz to f = 750Hz, it is equal to 0.875, and in the range of f = 750Hz to f = 1500Hz, it is 0.
Equal to 21875. It can be seen that the entire detector spectrum is attenuated when the ERLE spectrum E and the evaluated echo spectrum Γ are uncorrelated or small. Nevertheless, the near-end detector still responds best to frequencies where the near-end spectrum N has its largest component.

【００７４】本発明は特定の実施形態に関して説明された。しかしながら、前述した好適な
実施形態以外の特定の形態で本発明を実施できることは当業者には容易に明らか
であろう。このことは本発明の精神を逸脱することなくなされるものである。The invention has been described with reference to particular embodiments. However, it will be readily apparent to those skilled in the art that the present invention may be embodied in specific forms other than the preferred embodiments described above. This is done without departing from the spirit of the invention.

【００７５】例えば、図示されたスペクトラムは本発明の検討を容易にするために理想化さ
れている。しかしながら、実際には、これらのスペクトラムのいずれか或いは全
ては図６Ａ〜図６Ｅ及び図７Ａ〜図７Ｅに描かれた代表的なステップの関数に合
致しないかもしれない。むしろ、これらのスペクトラムのいくつか或いは全ては
より複雑な数学的な関数によって記述されるかもしれない。その違いにもかかわ
らず、結果として得られる検出器スペクトラムが連続的な周波数の範囲によって
特徴付けられ、その周波数範囲にわたってその検出器スペクトラムは最大値をも
ち、連続する周波数の範囲は処理された信号における相対的に高いエコーリター
ン損失に関連したものであることが期待される。For example, the spectrum shown is idealized to facilitate discussion of the present invention. However, in practice, any or all of these spectra may not fit the representative step function depicted in FIGS. 6A-6E and 7A-7E. Rather, some or all of these spectra may be described by more complex mathematical functions. Despite that difference, the resulting detector spectrum is characterized by a continuous range of frequencies, the detector spectrum has a maximum over that frequency range, and the continuous range of frequencies is the processed signal. Is expected to be associated with the relatively high echo return loss at.

【００７６】従って、好適な実施形態はただ例示的なものであり、どのようにも限定的に考
えられるべきではない。本発明の範囲は前述の説明によるよりはむしろ、添付さ
れた請求の範囲によって与えられるべきであり、その請求の範囲の中にある全て
の変形例や同等物はその請求の範囲に含まれることが意図されている。Therefore, the preferred embodiments are merely illustrative and should not be considered limiting in any way. The scope of the invention should be given by the appended claims, rather than by the foregoing description, and all variations and equivalents that fall within the scope of the claims are to be included within the scope of the claims. Is intended.

[Brief description of drawings]

【図１】音響エコーキャンセラとニアエンド音声アクティビティ検出器とを含む従来の
ハンドフリートランシーバのブロック図である。FIG. 1 is a block diagram of a conventional hands-free transceiver including an acoustic echo canceller and a near-end voice activity detector.

【図２】エコーキャンセレーション適用前後における音声信号のパワースペクトル（１
センテンス）の比較図である。FIG. 2 is a power spectrum (1) of a voice signal before and after applying echo cancellation.
It is a comparison diagram of a sentence).

【図３】本発明の代表的な実施形態のブロック図である。[Figure 3] FIG. 3 is a block diagram of an exemplary embodiment of the present invention.

【図４】本発明に従って実行される工程を描いたフローチャートである。[Figure 4] 3 is a flowchart depicting steps performed in accordance with the present invention.

【図５】振幅が所定の閾値レベルを超えたいくつかの不連続な周波数バンドの場合を図
示した代表的なニアエンドスペクトラムＮである。FIG. 5 is a representative near-end spectrum N illustrating the case of several discontinuous frequency bands whose amplitude exceeds a predetermined threshold level.

【図６Ａ】代表的な正規化されたニアエンド音声スペクトラムＮのグラフである。FIG. 6A 3 is a graph of an exemplary normalized near-end speech spectrum N.

【図６Ｂ】代表的な正規化されたＥＲＬＥスペクトラムＥのグラフである。FIG. 6B 6 is a graph of a representative normalized ERLE spectrum E.

【図６Ｃ】代表的な正規化されたラウドスピーカスペクトラムΓのグラフである。FIG. 6C 3 is a graph of an exemplary normalized loudspeaker spectrum Γ.

【図６Ｄ】本発明の１つの側面に従う代表的な重み付けスペクトラムのグラフである。FIG. 6D 3 is a graph of an exemplary weighted spectrum according to one aspect of the present invention.

【図６Ｅ】本発明の１つの側面に従う代表的な圧縮因子Ｃの決定を描いたグラフである。FIG. 6E 6 is a graph depicting the determination of an exemplary compression factor C according to one aspect of the present invention.

【図７Ａ】もう１つの代表的な正規化されたニアエンド音声スペクトラムＮのグラフであ
る。FIG. 7A is a graph of another exemplary normalized near-end speech spectrum N.

【図７Ｂ】もう１つの代表的な正規化されたＥＲＬＥスペクトラムＥのグラフである。FIG. 7B 6 is another representative normalized ERLE spectrum E graph.

【図７Ｃ】もう１つの代表的な正規化されたラウドスピーカスペクトラムΓのグラフであ
る。FIG. 7C is a graph of another exemplary normalized loudspeaker spectrum Γ.

【図７Ｄ】本発明の１つの側面に従うもう１つの代表的な重み付けスペクトラムのグラフ
である。FIG. 7D is a graph of another exemplary weighted spectrum in accordance with one aspect of the present invention.

【図７Ｅ】本発明の１つの側面に従うもう１つの代表的な圧縮因子Ｃの決定を描いたグラ
フである。FIG. 7E is a graph depicting another exemplary compression factor C determination according to one aspect of the invention.

【手続補正書】特許協力条約第３４条補正の翻訳文提出書[Procedure for Amendment] Submission for translation of Article 34 Amendment of Patent Cooperation Treaty

【提出日】平成１３年７月２６日（２００１．７．２６）[Submission date] July 26, 2001 (2001.26)

【手続補正１】[Procedure Amendment 1]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】特許請求の範囲[Name of item to be amended] Claims

【補正方法】変更[Correction method] Change

【補正の内容】[Contents of correction]

【特許請求の範囲】[Claims]

【手続補正２】[Procedure Amendment 2]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００１９[Correction target item name] 0019

【補正方法】変更[Correction method] Change

【補正の内容】[Contents of correction]

【００１９】従来の音声アクティビティ検出器、音声認識モジュールなどのようなニアエン
ドスピーチ信号を処理する従来の適用では通常、処理される信号にはエコーは存
在しないと仮定しており、それ故に、人間の音声によるアクティビティの周波数
範囲内にあるかもしれないエコー信号成分を除去してしまうほどにニアエンドス
ピーチに焦点を当てた能力はもちあわせていない。欧州特許出願第０８５４６２６号公報は、受信された未処理のニアエンド信号
を周波数領域における評価されたエコー信号とを比較することを含むエコーキャ
ンセレーション技術を開示している。この比較からの情報は、それから、その比
較からの情報に基づいて改善されたニアエンド音声信号を生成するフィルタに供
給される。Conventional applications that process near-end speech signals, such as conventional voice activity detectors, voice recognition modules, etc., typically assume that the signal being processed is free of echoes, and thus human It lacks the ability to focus on near-end speech to the extent that it removes echo signal components that may be in the frequency range of voice activity. European Patent Application No. 0854626 discloses an echo cancellation technique that involves comparing a received raw near-end signal with an estimated echo signal in the frequency domain. The information from this comparison is then provided to a filter that produces an improved near-end audio signal based on the information from that comparison.

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｈ０４Ｍ 1/60 Ｇ１０Ｌ 3/02 ３０１Ｃ 9/08 (81)指定国ＥＰ(ＡＴ，ＢＥ，ＣＨ，ＣＹ，ＤＥ，ＤＫ，ＥＳ，ＦＩ，ＦＲ，ＧＢ，ＧＲ，ＩＥ，ＩＴ，ＬＵ，ＭＣ，ＮＬ，ＰＴ，ＳＥ)，ＯＡ(ＢＦ，ＢＪ，ＣＦ，ＣＧ，ＣＩ，ＣＭ，ＧＡ，ＧＮ，ＧＷ，ＭＬ，ＭＲ，ＮＥ，ＳＮ，ＴＤ，ＴＧ)，ＡＰ(ＧＨ，ＧＭ，ＫＥ，ＬＳ，ＭＷ，ＳＤ，ＳＬ，ＳＺ，ＴＺ，ＵＧ，ＺＷ )，ＥＡ(ＡＭ，ＡＺ，ＢＹ，ＫＧ，ＫＺ，ＭＤ，ＲＵ，ＴＪ，ＴＭ)，ＡＥ，ＡＧ，ＡＬ，ＡＭ，ＡＴ，ＡＵ，ＡＺ，ＢＡ，ＢＢ，ＢＧ，ＢＲ，ＢＹ，ＣＡ，ＣＨ，ＣＮ，ＣＲ，ＣＵ，ＣＺ，ＤＥ，ＤＫ，ＤＭ，ＤＺ，ＥＥ，ＥＳ，ＦＩ，ＧＢ，ＧＤ，ＧＥ，ＧＨ，ＧＭ，ＨＲ，ＨＵ，ＩＤ，ＩＬ，ＩＮ，ＩＳ，ＪＰ，ＫＥ，ＫＧ，ＫＰ，ＫＲ，ＫＺ，ＬＣ，ＬＫ，ＬＲ，ＬＳ，ＬＴ，ＬＵ，ＬＶ，ＭＡ，ＭＤ，ＭＧ，ＭＫ，ＭＮ，ＭＷ，ＭＸ，ＮＯ，ＮＺ，ＰＬ，ＰＴ，ＲＯ，ＲＵ，ＳＤ，ＳＥ，ＳＧ，ＳＩ，ＳＫ，ＳＬ，ＴＪ，ＴＭ，ＴＲ，ＴＴ，ＴＺ，ＵＡ，ＵＧ，ＵＺ，ＶＮ，ＹＵ，ＺＡ，ＺＷＦターム(参考） 5D015 EE04 KK00 5D020 CC05 5K027 DD07 DD10 5K038 AA07 CC01 FF13 5K046 HH01 HH18 HH24 HH78 HH79 【要約の続き】されて動的に調整可能な動作をもたらす。─────────────────────────────────────────────────── ─── Continuation of front page (51) Int.Cl. ⁷ Identification code FI theme code (reference) H04M 1/60 G10L 3/02 301C 9/08 (81) Designated country EP (AT, BE, CH, CY, DE, DK, ES, FI, FR, GB, GR, IE, IT, LU, MC, NL, PT, SE), OA (BF, BJ, CF, CG, CI, CM, GA, GN, GW, ML, MR, NE, SN, TD, TG), AP (GH, GM, K E, LS, MW, SD, SL, SZ, TZ, UG, ZW), EA (AM, AZ, BY, KG, KZ) , MD, RU, TJ, TM), AE, AG, AL, AM, AT, AU, AZ, BA, BB, BG, BR, BY, CA, CH, CN, CR, CU, C , DE, DK, DM, DZ, EE, ES, FI, GB, GD, GE, GH, GM, HR, HU, ID, IL, IN, IS, JP, KE, KG, KP, KR, KZ, LC, LK, LR, LS, LT, LU, LV, MA, MD, MG, MK, MN, MW, MX, NO, NZ, PL, PT, RO, RU, SD, SE, SG, SI, SK , SL, TJ, TM, TR, TT, TZ, UA, UG, UZ, VN, YU, ZA, ZW F term (reference) 5D015 EE04 KK00 5D020 CC05 5K027 DD07 DD10 5K038 AA07 CC01 FF13 5K046 HH01 HH18 HH24 HH78 H79 [H] Continuation of Summary: Provided dynamically adjustable behavior.

Claims

[Claims]

1. A method for producing an improved near-end speech signal, comprising the steps of: a) receiving an audio signal, b) producing an estimated acoustic echo signal, and c) from the audio signal. Generating a processed signal by removing the evaluated acoustic echo signal, d) having a continuous frequency range, the continuous frequency range having a relatively large echo return in the processed signal. Determining a near-end improved spectrum that is associated with loss and has an amount greater than a predetermined threshold over the range; and e) filtering the processed signal according to the near-end improved spectrum, Thereby producing the improved near-end audio signal.

2. f) measuring how much energy is included in the improved near-end audio signal; and g) based on the measured energy of the improved near-end audio signal.
The method of claim 1, further comprising the step of detecting whether near-end audio is being emitted.

3. The method of claim 1, further comprising: f) recognizing near-end speech included in the improved near-end speech signal.

4. The method according to claim 1, wherein the steps a) to e) are repeated periodically.

5. The step of determining the near-end improved spectrum comprises:
Determining the near-end improved spectrum as a function of a weighted spectrum, the weighted spectrum comprising: , Γ is the spectrum of the acoustic echo generated from the far-end signal, E is the echo return loss improvement spectrum representing the echo canceling performance of the step c), and N is the process S is a spectrum of the generated signal, S is an echo spread spectrum representing a spread spectrum characteristic of the path of the echo, Γ _max = max (Γ), E _max = max (E), S _max = max (S) The method of claim 1, wherein α, β, and γ are constants and α + β + γ> 0.

6. The method according to claim 5, wherein α + β + γ = 1.

7. The step of determining the near-end improved spectrum as a function of the weighted spectrum comprises: Determining the improved spectrum of the near end according to the following: Speech _{min (i)} is the i-th frequency when N is greater than a predetermined threshold, and Speech _{max (i)} is N where N is the predetermined threshold. 6. The method of claim 5, which is the i-th frequency below a threshold and Spectrum _totalmax is the maximum frequency of interest in the weighted spectrum W (f).

8. An improved near-end audio signal generator comprising: a) means for receiving an audio signal; b) means for generating an estimated acoustic echo signal; and c) the estimated signal from the audio signal. Means for producing a processed signal by removing the acoustic echo signal, d) having a continuous frequency range, the continuous frequency range resulting in a relatively large echo return loss in the processed signal. Means for determining a near-end improved spectrum having a quantity greater than a predetermined threshold over the range, and e) filtering the processed signal according to the near-end improved spectrum, thereby Improved near-end audio signal and a filter for generating the improved near-end audio signal. Voice signal generator.

9. f) means for measuring how much energy is included in the improved near-end audio signal; and g) based on the measured energy of the improved near-end audio signal.
9. An improved near-end audio signal generator according to claim 8 further comprising means for detecting whether near-end audio is being emitted.

10. The improved near-end speech signal generator of claim 8, further comprising: f) a speech recognizer coupled to receive the improved near-end speech signal.

11. An improved near-end audio signal generator according to claim 8, wherein the components a) to e) operate cyclically and repeatedly.

12. The means for determining the near-end improved spectrum includes means for determining the near-end improved spectrum as a function of the weighted spectrum, the weighted spectrum comprising: , Γ is the spectrum of the acoustic echo produced from the far end signal, E is the echo return loss improvement spectrum representing the echo canceling performance of the means for producing the processed signal, N Is the spectrum of the processed signal, S is the echo spread spectrum representing the spread spectrum characteristic of the path of the echo, Γ _max = max (Γ), E _max = max (E), S _max = 9. The improved near-end audio signal generator of claim 8 wherein max (S), α, β, and γ are constants and α + β + γ> 0.

13. The improved near-end audio signal generator of claim 12, wherein α + β + γ = 1.

14. Means for determining the near-end improved spectrum as a function of the weighted spectrum, According to the above, said speech end _min is determined by said means: Speech _{min (i)} is the i-th frequency when N is greater than a predetermined threshold, and Speech _{max (i)} is N said predetermined frequency. Improved near-end speech signal generation according to claim 12, characterized in that it is the i-th frequency below a threshold and the Spectrum _totalmax is the maximum frequency of interest in the weighted spectrum W (f). vessel.