JP6970422B2

JP6970422B2 - Acoustic signal processing device, acoustic signal processing method, and acoustic signal processing program

Info

Publication number: JP6970422B2
Application number: JP2017108148A
Authority: JP
Inventors: 薫鈴木; 有志武田
Original assignee: Tokyo Metropolitan Industrial Technology Research Instititute (TIRI)
Current assignee: Tokyo Metropolitan Industrial Technology Research Instititute (TIRI)
Priority date: 2017-05-31
Filing date: 2017-05-31
Publication date: 2021-11-24
Anticipated expiration: 2037-05-31
Also published as: JP2018207221A

Description

本発明は、音響的なエコーを除去する装置、方法及びプログラムに関する。
する。 The present invention relates to devices, methods and programs for removing acoustic echoes.
do.

利用者の音声をマイクロホンで受音し、かつ所定のシステム音声をスピーカから拡声出力する装置においては、スピーカから拡声出力されたシステム音声が空気などを伝播してマイクロホンに拾われる現象がしばしば発生する。このとき、マイクロホンに拾われるシステム音声由来の音はシステム音声のエコーと呼ばれる。 In a device that receives a user's sound with a microphone and outputs a predetermined system sound from a speaker in a loudspeaker, a phenomenon that the system sound output from the speaker propagates through air or the like and is picked up by the microphone often occurs. .. At this time, the sound derived from the system voice picked up by the microphone is called the echo of the system voice.

例えば、利用者の音声を認識してサービスを提供するロボットでは、ロボットの発したシステム音声がエコーとなって自身のマイクロホンに拾われてしまい、利用者が発話していないにも関わらず、このエコーを利用者の音声であると誤ってロボットが認識し、何らかの応答を開始してしまう問題が発生する。 For example, in a robot that recognizes a user's voice and provides a service, the system voice emitted by the robot becomes an echo and is picked up by its own microphone, even though the user is not speaking. There is a problem that the robot mistakenly recognizes the echo as the user's voice and starts some kind of response.

したがって、このような誤応答が起こらないよう、マイクロホン入力に混入したシステム音声のエコーを除去する必要がある。このとき、スピーカから拡声出力されるロボットのシステム音声がどのような音声であるかは既知であるから、一般にエコー消去器と呼ばれる機能によってマイクロホン入力からシステム音声のエコーを消去することが可能である。 Therefore, it is necessary to remove the echo of the system voice mixed in the microphone input so that such an erroneous response does not occur. At this time, since it is known what kind of voice the robot's system voice is output from the speaker, it is possible to erase the echo of the system voice from the microphone input by a function generally called an echo eraser. ..

エコー消去器は、学習同定法（ＮｏｒｍａｌｉｚｅｄＬｅａｓｔＭｅａｎＳｑｕａｒｅアルゴリズム）などを用いてシステム音声の伝播経路（以下、エコーパスと呼ぶ）の伝達関数を適応フィルタに学習させ、このフィルタ係数にシステム音声を掛けることでエコーを模擬した音声（以下、エコーレプリカと呼ぶ）を生成し、このエコーレプリカをマイクロホン入力から差し引くことによって、エコーを消去した出力音声（以下、誤差出力と呼ぶ）を生成する。このような適応フィルタの学習がうまく進めば、出力音声にはエコーを含まない音声が残るはずである。 The echo eraser trains an adaptive filter to learn the transfer function of the system voice propagation path (hereinafter referred to as echo path) using a learning identification method (Normalized Rest Mean Square algorithm), etc., and multiplies this filter coefficient by the system voice. By generating an echo-simulated sound (hereinafter referred to as an echo replica) and subtracting this echo replica from the microphone input, an output sound in which the echo is erased (hereinafter referred to as an error output) is generated. If the learning of such an adaptive filter is successful, the output audio should remain without echo.

しかしながら、エコーとともに非エコー音声（利用者の音声や環境雑音など）がマイクロホンから入力されている状況（非エコー音声有りの状態）では適応フィルタの学習がうまく進まず、エコーの消え残りや利用者音声の歪みをもたらすことがある。これを回避するために、非エコー音声有りの状態を検出する検出器を設け、非エコー音声有りの状態が検出されると適応フィルタの学習を止めたり遅らせたりする制御を加える。 However, in the situation where non-echo voice (user's voice, environmental noise, etc.) is input from the microphone together with echo (with non-echo voice), learning of the adaptive filter does not proceed well, and the echo remains and the user May cause audio distortion. In order to avoid this, a detector is provided to detect the state with non-echo voice, and control is added to stop or delay the learning of the adaptive filter when the state with non-echo voice is detected.

特許文献１には、拡声出力された音声とエコーレプリカとマイクロホン入力のレベルや相関から通話状態（非エコー音声の有無など）を判定する通話判定装置が開示されている。この通話判定装置では、通話を行うそれぞれの局で独立して通話状態を判定すると誤る可能性があるため、両局でそれぞれ行った判定結果を突き合わせて通話状態を判定する必要がある。 Patent Document 1 discloses a call determination device that determines a call state (presence or absence of non-echo voice, etc.) from a loudspeaker output voice, an echo replica, and a level or correlation of a microphone input. In this call determination device, it may be erroneous if each station making a call independently determines the call status. Therefore, it is necessary to compare the determination results made by both stations to determine the call status.

また、特許文献２では、信号適応処理装置及びエコー抑圧装置における、学習同定法などによる適応フィルタの学習に際して、エコー消去後の出力音声パワーに従って適応フィルタの学習をＯＮ／ＯＦＦ制御する技術が開示されている。この技術は、マイクロホン入力からエコー成分を消去した後の音声レベルが所定閾値を超えるなら、マイクロホン入力にはエコー以外の音声が含まれているとして非エコー音声有りの状態を検出するものである。 Further, Patent Document 2 discloses a technique for ON / OFF control of learning of an adaptive filter according to the output voice power after echo elimination when learning an adaptive filter by a learning identification method or the like in a signal adaptation processing device and an echo suppression device. ing. In this technique, if the voice level after erasing the echo component from the microphone input exceeds a predetermined threshold value, it is assumed that the microphone input contains voice other than echo, and the state with non-echo voice is detected.

この技術の場合、エコー消去後の出力音声に依存して適応フィルタの学習がＯＮ／ＯＦＦされるため、適応フィルタの正しい学習がある程度進んだ段階では正しいＯＮ／ＯＦＦ制御が可能である。 In the case of this technique, since the learning of the adaptive filter is turned ON / OFF depending on the output voice after echo erasing, the correct ON / OFF control is possible when the correct learning of the adaptive filter is advanced to some extent.

しかしながら、フィルタが十分に適応していない学習の初期段階では、このＯＮ／ＯＦＦ制御に誤りが生じ、その結果として誤ったフィルタが学習されるなど、学習が遅くなる可能性がある。これは非エコー音声検出と適応フィルタ学習が鶏と卵のように相互に依存し合っているためである。 However, in the initial stage of learning in which the filter is not sufficiently adapted, an error occurs in this ON / OFF control, and as a result, the incorrect filter is learned, and learning may be delayed. This is because non-echo speech detection and adaptive filter learning are interdependent like chickens and eggs.

特開２００８−０６０９３８号公報Japanese Unexamined Patent Publication No. 2008-060938 特開平０８−０６５２１４号公報Japanese Unexamined Patent Publication No. 08-065214

本発明は、このような課題に着目して鋭意研究され完成されたものであり、その目的は、非エコー音声検出と適応フィルタ学習の相互依存性を排除しつつ、エコーを含まない（あるいはエコーを弱められた）音声に基づいて、マイク入力にエコー以外の音声（利用者の音声等）が含まれているか否かを速やかに正しく判定することにある。 The present invention has been enthusiastically researched and completed focusing on such problems, and its purpose is to eliminate the interdependence between non-echo speech detection and adaptive filter learning, but to eliminate echo (or echo). Based on the voice (weakened), it is necessary to promptly and correctly determine whether or not the microphone input contains voice other than echo (user's voice, etc.).

上記課題を解決するために、本発明は、スピーカから出力される前の音声信号を第１のスペクトルデータに変換する第１の変換部と、マイクから入力された音声信号を第２のスペクトルデータに変換する第２の変換部と、前記第１のスペクトルデータ及び前記第２のスペクトルデータに基づいて非エコー音声の有無を判定する判定部と、前記第１のスペクトルデータ及び前記第２のスペクトルデータを入力し、適応フィルタを用いてエコーを消去するエコー消去部と、を備え、前記エコー消去部は、前記非エコー音声が有る場合、前記適応フィルタの学習の強さを示す係数を、前記非エコー音声が無い場合に比べ、低くする音響信号処理装置である。 In order to solve the above problems, the present invention has a first conversion unit that converts an audio signal before being output from a speaker into first spectral data, and a second spectral data that converts an audio signal input from a microphone into second spectral data. A second conversion unit that converts to, a determination unit that determines the presence or absence of non-echo voice based on the first spectrum data and the second spectrum data, the first spectrum data, and the second spectrum. The echo erasing unit includes an echo erasing unit that inputs data and erases echoes using an adaptive filter, and the echo erasing unit, in the presence of the non-echo voice, obtains a coefficient indicating the learning strength of the adaptive filter. It is an acoustic signal processing device that lowers the value compared to the case where there is no non-echo voice.

本発明によれば、マイク入力にエコー以外の音声（利用者の音声等）が含まれているか否かを速やかに正しく判定することができる。 According to the present invention, it is possible to quickly and correctly determine whether or not the microphone input contains voice other than echo (user's voice, etc.).

本発明の実施例１に係る音響信号処理装置の機能ブロック図である。It is a functional block diagram of the acoustic signal processing apparatus which concerns on Example 1 of this invention. 本発明の実施例１に係るエコー消去部６の機能ブロック図である。It is a functional block diagram of the echo erasing part 6 which concerns on Example 1 of this invention. 本発明の実施例１に係る非エコー音声有無判定部９の機能ブロック図である。It is a functional block diagram of the non-echo voice presence / absence determination unit 9 which concerns on Example 1 of this invention. 本発明の実施例１に係る音響信号処理装置の処理の流れを示すフローチャートである。It is a flowchart which shows the processing flow of the acoustic signal processing apparatus which concerns on Example 1 of this invention. 本発明の実施例１に係る音響信号処理装置のハードウェア構成図である。It is a hardware block diagram of the acoustic signal processing apparatus which concerns on Example 1 of this invention. 本発明の実施例１に係る音響信号処理装置による処理の結果を示す図である。It is a figure which shows the result of the processing by the acoustic signal processing apparatus which concerns on Example 1 of this invention. 本発明の実施例２に係る音響信号処理装置の機能ブロック図である。It is a functional block diagram of the acoustic signal processing apparatus which concerns on Example 2 of this invention. 本発明の実施例２に係る音響信号処理装置の処理の流れを示すフローチャートである。It is a flowchart which shows the processing flow of the acoustic signal processing apparatus which concerns on Example 2 of this invention. 本発明の実施例２に係る音響信号処理装置による処理の結果を示す図である。It is a figure which shows the result of the processing by the acoustic signal processing apparatus which concerns on Example 2 of this invention.

図面を参照しながら本発明の実施の形態を説明する。なお、各図において共通する部分には同一の符号を付し、重複した説明は省略する。 Embodiments of the present invention will be described with reference to the drawings. In addition, the same reference numerals are given to common parts in each figure, and duplicate description is omitted.

図１は、本発明の実施例１に係る音響信号処理装置の機能ブロック図である。この実施例では、利用者の音声を認識してサービスを提供するロボットに適用される音響信号処理装置について説明する。ここでの、システム音声とは、ロボットが発した音声をいう。 FIG. 1 is a functional block diagram of the acoustic signal processing device according to the first embodiment of the present invention. In this embodiment, an acoustic signal processing device applied to a robot that recognizes a user's voice and provides a service will be described. Here, the system voice means the voice emitted by the robot.

（構成）
スピーカ１は、システム音声の信号ｘ（ｔ）を拡声出力する。マイク２は、利用者の音声などを音声信号ｍ（ｔ）として入力するためのマイクロホンである。エコー３は、スピーカ１から拡声出力されたシステム音声のエコーを示す。 (composition)
The speaker 1 outputs a loudspeaker signal x (t) of the system voice. The microphone 2 is a microphone for inputting a user's voice or the like as a voice signal m (t). The echo 3 indicates an echo of the system sound output from the speaker 1 in a loudspeaker.

エコーの伝搬経路Ｈは、システム音声のエコー３がマイク２に届く伝搬経路（エコーパス）を示す。ここで、ｔは音声信号のサンプリング周期における時刻を表すインデックスである。 The echo propagation path H indicates a propagation path (echo path) in which the echo 3 of the system voice reaches the microphone 2. Here, t is an index representing the time in the sampling cycle of the audio signal.

第１の周波数分解部４は、システム音声の時間領域の音声信号ｘ（ｔ）をＦＦＴ（ＦａｓｔＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ）処理によって、周波数領域のスペクトルデータｘ（ω,ｆ）に変換する。すなわち、周波数分解部４は、スピーカ１から出力される前の音声信号x(t)を第１のスペクトルデータｘ（ω,ｆ）に変換する第１の変換部といえる。 The first frequency decomposition unit 4 converts the voice signal x (t) in the time domain of the system voice into the spectral data x (ω, f) in the frequency domain by FFT (Fast Fourier Transform) processing. That is, it can be said that the frequency decomposition unit 4 is a first conversion unit that converts the audio signal x (t) before being output from the speaker 1 into the first spectral data x (ω, f).

ここで、ωはＦＦＴ出力の周波数ビン番号を表すインデックスである。ＦＦＴ処理では、所定サンプル数（フレーム長ＦＬ）の解析窓を所定サンプル数（フレームシフト量ＦＳ）ずつずらしながら、解析窓内の時間領域信号を周波数領域のスペクトルデータに変換する。これがＦＦＴの処理単位（フレーム）である。ｆはＦＦＴの処理単位で数えた時刻（フレーム番号）を表すインデックスである。時刻ｆに複素数として得られた第１のスペクトルデータｘ（ω,ｆ）は実部のスカラ値と虚部のスカラ値から成る２次元のベクトルデータである。そのベクトルの長さと向きがωで表される周波数成分の振幅と位相を表している。 Here, ω is an index representing the frequency bin number of the FFT output. In the FFT process, the time domain signal in the analysis window is converted into frequency domain spectral data while shifting the analysis window of a predetermined number of samples (frame length FL) by a predetermined number of samples (frame shift amount FS). This is the FFT processing unit (frame). f is an index representing the time (frame number) counted in the processing unit of the FFT. The first spectral data x (ω, f) obtained as a complex number at time f is two-dimensional vector data composed of the scalar value of the real part and the scalar value of the imaginary part. The length and direction of the vector represent the amplitude and phase of the frequency component represented by ω.

同様に、マイク２から入力された音声信号ｍ（ｔ）も、第２の周波数分解部５によって周波数領域のスペクトルデータｍ（ω,ｆ）に変換される。すなわち、周波数分解部５は、マイク２から入力された音声信号ｍ（ｔ）を第２のスペクトルデータｍ（ω,ｆ）に変換する第２の変換部といえる。 Similarly, the audio signal m (t) input from the microphone 2 is also converted into the spectrum data m (ω, f) in the frequency domain by the second frequency decomposition unit 5. That is, it can be said that the frequency decomposition unit 5 is a second conversion unit that converts the audio signal m (t) input from the microphone 2 into the second spectral data m (ω, f).

図２は、本発明の実施例１に係るエコー消去部６の機能ブロック図である。エコー消去部６は、適応フィルタ１１と減算器１２とからなり、マイク入力である第２のスペクトルデータｍ（ω,ｆ）と、システム音声である第１のスペクトルデータｘ（ω,ｆ）とを入力とし、誤差出力ｅ（ω,ｆ）を式（１）から計算することによって、第２のスペクトルデータｍ（ω,ｆ）からエコー成分を消去する機能ブロックである。ここで、ｙ（ω,ｆ）はエコーレプリカであり、第１のスペクトルデータｘ（ω,ｆ）にフィルタ係数ｗ（ω,ｆ）を掛けることで計算される。

FIG. 2 is a functional block diagram of the echo erasing unit 6 according to the first embodiment of the present invention. The echo erasing unit 6 includes an adaptive filter 11 and a subtractor 12, and includes a second spectrum data m (ω, f) which is a microphone input and a first spectrum data x (ω, f) which is a system sound. Is an input, and the error output e (ω, f) is calculated from the equation (1) to eliminate the echo component from the second spectral data m (ω, f). Here, y (ω, f) is an echo replica, and is calculated by multiplying the first spectral data x (ω, f) by the filter coefficient w (ω, f).

式（１）のフィルタ係数ｗ（ω,ｆ）は、式（２）に示す学習同定法（ＮｏｒｍａｌｉｚｅｄＬｅａｓｔＭｅａｎＳｑｕａｒｅｓアルゴリズム）によりエコーパスＨの伝達特性を学習する。ここで、＊（アスタリスク）は複素共役を表し、μは学習速度を制御するステップサイズである。

The filter coefficient w (ω, f) in the equation (1) learns the transmission characteristics of the echo path H by the learning identification method (Normalized Last Men Squares algorithm) shown in the equation (2). Here, * (asterisk) represents the complex conjugate, and μ is the step size that controls the learning speed.

学習同定法（ＮＬＭＳアルゴリズム）は平均二乗誤差最小化規範の確率勾配アルゴリズムであるため、これを用いたフィルタ係数ｗ（ω,ｆ）の学習は、常に誤差出力ｅ（ω,ｆ）に含まれる第１のスペクトルデータｘ（ω,ｆ）と相関のある成分のパワーを最小化するフィルタ係数ｗ（ω,ｆ）を求めるように進行する。そのため、第２のスペクトルデータｍ（ω,ｆ）が利用者音声などのエコー以外の成分を含んでいると、その利用者音声の一部（第１のスペクトルデータｘ（ω,ｆ）に含まれる周波数成分）までも消し去るようにフィルタを学習させてしまう。 Since the learning identification method (NLMS algorithm) is a probability gradient algorithm of the mean square error minimization norm, learning of the filter coefficient w (ω, f) using this is always included in the error output e (ω, f). The process proceeds so as to obtain a filter coefficient w (ω, f) that minimizes the power of the component correlated with the first spectral data x (ω, f). Therefore, if the second spectrum data m (ω, f) contains components other than echo such as user voice, it is included in a part of the user voice (first spectrum data x (ω, f)). The filter is trained so that even the frequency component) is erased.

しかしながら、このようにして学習される適応フィルタ１１は決して正しい値ではないため、エコーの消え残りや利用者音声の歪みの原因となる。それゆえ、第２のスペクトルデータｍ（ω,ｆ）がエコー以外の音声を含んでいる状況（非エコー音声有り状態）を検出して適応フィルタ１１の学習を止めたり弱めたりする制御が必要になる。 However, since the adaptive filter 11 learned in this way is by no means a correct value, it causes residual echo and distortion of the user's voice. Therefore, it is necessary to detect a situation in which the second spectral data m (ω, f) contains voice other than echo (state with non-echo voice) and stop or weaken the learning of the adaptive filter 11. Become.

これを実現するため、エコー消去部６は、ステップサイズμの値を後述するＤＴ（ｆ）が０（非エコー音声無し状態）か１（非エコー音声有り状態）かに応じて制御する。すなわち、μは非エコー音声有り状態には非エコー音声無し状態よりも減じられ、その結果、適応フィルタ１１の学習の強さ（学習速度とも言う）はこの間低く抑えられる。つまり、ステップサイズμは、適応フィルタ１１の学習の強さを示す係数である。 In order to realize this, the echo erasing unit 6 controls the value of the step size μ according to whether the DT (f) described later is 0 (state without non-echo sound) or 1 (state with non-echo sound). That is, μ is reduced in the state with non-echo voice as compared with the state without non-echo voice, and as a result, the learning intensity (also referred to as learning speed) of the adaptive filter 11 is suppressed to a low level during this period. That is, the step size μ is a coefficient indicating the learning strength of the adaptive filter 11.

非エコー音声有無判定部９は、エコー消去部６が適応フィルタ１１の学習を止めたり弱めたりする制御を行うために、マイク入力にエコー以外の音声（利用者の音声等）が含まれているか否かを判定する機能ブロックである。詳細については後述するが、この判定をより速やかに正しく行うことで、エコー消去部６の性能が向上するという効果を有する。 In the non-echo voice presence / absence determination unit 9, whether the microphone input includes voice other than echo (user voice, etc.) in order to control the echo erasing unit 6 to stop or weaken the learning of the adaptive filter 11. It is a functional block that determines whether or not it is. The details will be described later, but by making this determination more quickly and correctly, there is an effect that the performance of the echo erasing unit 6 is improved.

しかしながら、このようなエコー消去部６を用いても、残留エコーが残る場合がある。残留エコーは、例えばエコー消去後の音声を認識する場合に、その精度に悪影響を及ぼす可能性がある。そこで、本実施例では、エコー消去部６の後段に残留エコー抑圧部７を設けることにする。ただし、残留エコーの影響は音声認識処理などの後段処理の要求によって変わるため、残留エコー抑圧部７は本実施例に必須の構成ではない点に留意していただきたい。 However, even if such an echo erasing unit 6 is used, residual echo may remain. The residual echo may adversely affect the accuracy, for example, when recognizing the voice after echo elimination. Therefore, in this embodiment, the residual echo suppression unit 7 is provided after the echo erasing unit 6. However, it should be noted that the residual echo suppression unit 7 is not an indispensable configuration for this embodiment because the influence of the residual echo changes depending on the request of the post-stage processing such as the voice recognition processing.

図１に示す残留エコー抑圧部７は、誤差出力ｅ（ω,ｆ）と第１のスペクトルデータｘ（ω,ｆ）を入力とし、式（３）に従って誤差出力ｅ（ω,ｆ）に残留するエコー成分を抑圧した音声ｏ２（ω,ｆ）を生成する。

The residual echo suppression unit 7 shown in FIG. 1 receives an error output e (ω, f) and a first spectral data x (ω, f) as inputs, and remains in the error output e (ω, f) according to the equation (3). Generates voice o2 (ω, f) that suppresses the echo component.

式（４）に示すように、ｏ１（ω,ｆ）はｅ（ω,ｆ）の振幅をＧ倍した音声である。このＧは誤差出力ｅ（ω,ｆ）に含まれる残留エコーの大きさの比率を近似した係数である。Ｇの数値は実験的に求めておく。また、ＤＳ（ω,ｆ）は抑圧係数の瞬時値であり、ｇａｉｎ（ω,ｆ）は忘却係数により近似的に計算されるＤＳ（ω,ｆ）の移動平均値である。また、ｇｓは抑圧の強さを与えるための係数である。

As shown in the equation (4), o1 (ω, f) is a voice obtained by multiplying the amplitude of e (ω, f) by G. This G is a coefficient that approximates the ratio of the magnitudes of the residual echoes included in the error output e (ω, f). The numerical value of G is obtained experimentally. Further, DS (ω, f) is an instantaneous value of the suppression coefficient, and gain (ω, f) is a moving average value of DS (ω, f) approximately calculated by the forgetting coefficient. In addition, gs is a coefficient for giving the strength of suppression.

図１に示す波形生成部８は、残留エコー抑圧部７による音声ｏ２（ω,ｆ）を逆ＦＦＴ処理することで時間領域の波形信号Ｏ（ｔ）を生成する。このＯ（ｔ）が本実施例での最終的な出力音声信号である。 The waveform generation unit 8 shown in FIG. 1 generates a waveform signal O (t) in the time domain by performing inverse FFT processing on the voice o2 (ω, f) by the residual echo suppression unit 7. This O (t) is the final output audio signal in this embodiment.

次に、図１に示す非エコー音声有無判定部９について説明する。図３は、本発明の実施例１に係る非エコー音声有無判定部９の機能ブロック図である。非エコー音声有無判定部９は、エコー抑圧部２１と、波形生成部２２と、判定部２３とを備える。 Next, the non-echo voice presence / absence determination unit 9 shown in FIG. 1 will be described. FIG. 3 is a functional block diagram of the non-echo voice presence / absence determination unit 9 according to the first embodiment of the present invention. The non-echo voice presence / absence determination unit 9 includes an echo suppression unit 21, a waveform generation unit 22, and a determination unit 23.

エコー抑圧部２１は、第２のスペクトルデータｍ（ω,ｆ）と第１のスペクトルデータｘ（ω,ｆ）を入力とし、式（５）に従って第２のスペクトルデータｍ（ω,ｆ）に含まれるエコー成分を抑圧した音声ｓ（ω,ｆ）を求める。

The echo suppression unit 21 takes the second spectrum data m (ω, f) and the first spectrum data x (ω, f) as inputs, and uses the equation (5) as the second spectrum data m (ω, f). The voice s (ω, f) in which the contained echo component is suppressed is obtained.

ここで、式（５）のｇａｉｎ（ω,ｆ）は下記の式（６）で計算される抑圧係数である。

Here, gain (ω, f) in the equation (5) is a suppression coefficient calculated by the following equation (6).

式（６）のＭＲ（ω,ｆ）はスピーカ１からマイク２までの利得の移動平均値を表し、ＥＬ（ω,ｆ）はＭＲ（ω,ｆ）から推定したエコーの大きさである。ＮＬ（ω,ｆ）はＥＬ（ω,ｆ）から計算した現在の非エコー音声の大きさであり、ｇｓは抑圧の強さを与えるための係数である。ＦＬ（ω,ｆ）はＮＬ（ω,ｆ）の下限値を与える量であり、第２のスペクトルデータｍ（ω,ｆ）から決定される。以上の結果、現在の非エコー音声の大きさとしてＮＬ（ω,ｆ）をＦＬ（ω,ｆ）でフロアリングしたＸＸ（ω,ｆ）が算出される。そして、ｇａｉｎ（ω,ｆ）が第２のスペクトルデータｍ（ω,ｆ）に対する非エコー音声の大きさＸＸ（ω,ｆ）の比として計算される。 The MR (ω, f) in the equation (6) represents the moving average value of the gain from the speaker 1 to the microphone 2, and the EL (ω, f) is the magnitude of the echo estimated from the MR (ω, f). NL (ω, f) is the current non-echo voice magnitude calculated from EL (ω, f), and gs is a coefficient for giving the strength of suppression. FL (ω, f) is a quantity that gives a lower limit value of NL (ω, f), and is determined from the second spectral data m (ω, f). As a result of the above, XX (ω, f) is calculated by flooring NL (ω, f) with FL (ω, f) as the size of the current non-echo voice. Then, gain (ω, f) is calculated as the ratio of the magnitude XX (ω, f) of the non-echo voice to the second spectral data m (ω, f).

式（５）により計算されたｓ（ω,ｆ）は次段の波形生成部２２により時間領域の音声信号ｓ（ｔ）に変換される。 The s (ω, f) calculated by the equation (5) is converted into an audio signal s (t) in the time domain by the waveform generation unit 22 in the next stage.

続く判定部２３では、式（７）に示すように、最新のフレームシフト量分のｓ（ｔ）の中で振幅絶対値|ｓ（ｔ）|が閾値ｔｈ２以上となるサンプルの個数Ｓが計算される。この計算結果Ｓが閾値ｔｈ１以上となったとき、当該フレームｆは非エコー音声有り状態であると判断され、ＤＴ（ｆ）＝１が出力される。また、それ以外の場合には、非エコー音声無し状態であるとして、ＤＴ（ｆ）＝０が出力される。

In the subsequent determination unit 23, as shown in the equation (7), the number S of samples whose amplitude absolute value | s (t) | is equal to or greater than the threshold value th2 in s (t) for the latest frameshift amount is calculated. Will be done. When the calculation result S becomes the threshold value th1 or more, it is determined that the frame f is in a state with non-echo sound, and DT (f) = 1 is output. In other cases, DT (f) = 0 is output assuming that there is no non-echo sound.

このように、第２のスペクトルデータｍ（ω,ｆ）にエコー抑圧処理を掛けることで、ｓ（ω,ｆ）に含まれるエコーを適応フィルタ１１で除去するよりも速く抑圧することができる。 In this way, by applying the echo suppression process to the second spectral data m (ω, f), the echo contained in the s (ω, f) can be suppressed faster than the echo suppressed by the adaptive filter 11.

また、ｓ（ω,ｆ）を一度時間領域の波形ｓ（ｔ）に変換し、その波形中で閾値ｔｈ２以上となる振幅値を数えて閾値処理することによって、振幅が大きくても長続きしない抑圧初期のエコーや突発的なエコー以外の雑音に対して頑健にすることができる。 Further, by converting s (ω, f) once into a waveform s (t) in the time domain, counting the amplitude values having a threshold value th2 or more in the waveform and performing threshold processing, suppression that does not last long even if the amplitude is large. It can be robust against noise other than early echoes and sudden echoes.

このように、非エコー音声有無判定部９内のエコー抑圧部２１がエコーを速く抑圧できるのは、式（５）のｇａｉｎ（ω,ｆ）が速やかに求められることによる。ｇａｉｎ（ω,ｆ）は実数であるから、エコー抑圧部２１は第２のスペクトルデータｍ（ω,ｆ）の位相を変えずに振幅だけを変える。これは精度的に十分ではないが、非エコー音声の有無を振幅に基づいて判定するには十分である。要するに、エコー抑圧部２１は、精度よりも速さに重点を置いて第２のスペクトルデータｍ（ω,ｆ）に含まれるエコーを抑圧する。これに対し、エコー消去部６は精度に重点を置いて第２のスペクトルデータｍ（ω,ｆ）に含まれるエコーを消去する手段である。式（１）のｙ（ω,ｆ）も、ｗ（ω,ｆ）も複素数であるから、エコー消去部６は第２のスペクトルデータｍ（ω,ｆ）の振幅と位相を制御して高精度のエコー消去を行う。しかしながら、エコーの消えるｗ（ω,ｆ）が求まるまでには式（２）の計算を数多く繰り返さなければならないので時間を要する。 As described above, the echo suppression unit 21 in the non-echo voice presence / absence determination unit 9 can quickly suppress the echo because the gain (ω, f) of the equation (5) is quickly obtained. Since the gain (ω, f) is a real number, the echo suppression unit 21 changes only the amplitude without changing the phase of the second spectral data m (ω, f). This is not accurate enough, but it is sufficient to determine the presence or absence of non-echo voice based on the amplitude. In short, the echo suppression unit 21 suppresses the echo included in the second spectral data m (ω, f) with an emphasis on speed rather than accuracy. On the other hand, the echo erasing unit 6 is a means for erasing the echo included in the second spectral data m (ω, f) with an emphasis on accuracy. Since both y (ω, f) and w (ω, f) in the equation (1) are complex numbers, the echo erasing unit 6 controls the amplitude and phase of the second spectral data m (ω, f) to increase the value. Performs accurate echo elimination. However, it takes time because the calculation of the equation (2) must be repeated many times before the echo disappearing w (ω, f) can be obtained.

図４は、本発明の実施例１に係る音響信号処理装置の処理の流れを示すフローチャートである。本実施例の音響信号処理装置を起動すると、まず初期化処理ステップＳ１が実行される。この処理で時刻インデックスtとフレーム番号fが０に初期化される。 FIG. 4 is a flowchart showing a processing flow of the acoustic signal processing apparatus according to the first embodiment of the present invention. When the acoustic signal processing device of this embodiment is started, the initialization processing step S1 is first executed. In this process, the time index t and the frame number f are initialized to 0.

続くＦＳサンプル入力処理ステップＳ２では、ｍ（ｔ）、ｘ（ｔ）の各音声信号がフレームシフト量ＦＳサンプル分だけ入力される。 In the subsequent FS sample input processing step S2, each audio signal of m (t) and x (t) is input by the frame shift amount FS sample.

次にＦＬサンプル蓄積判定処理ステップＳ３では、これまでに入力されたｍ（ｔ）、ｘ（ｔ）の各音声信号のサンプル数がＦＦＴ解析窓の長さであるフレーム長ＦＬ以上か否かを判定する。もし、これまでに入力されたｍ（ｔ）、ｘ（ｔ）の各音声信号のサンプル数がフレーム長ＦＬに満たない場合は以降のＦＦＴ処理を行えないので、図中左（Ｎｏ）に分岐してダミー出力生成処理ステップＳ９を実行する。一方、そうでない場合は図中下（Ｙｅｓ）に分岐して周波数分解処理ステップＳ４を実行する。 Next, in the FL sample accumulation determination processing step S3, it is determined whether or not the number of samples of the m (t) and x (t) audio signals input so far is equal to or greater than the frame length FL, which is the length of the FFT analysis window. judge. If the number of m (t) and x (t) audio signal samples input so far is less than the frame length FL, the subsequent FFT processing cannot be performed, so branch to the left (No) in the figure. Then, the dummy output generation processing step S9 is executed. On the other hand, if this is not the case, the frequency decomposition process step S4 is executed by branching to the lower part (Yes) in the figure.

ダミー出力生成処理ステップＳ９では、例えば出力音声信号Ｏ（ｔ）＝ｍ（ｔ）として、マイク入力信号をそのまま出力するか、または無音を出力する。 In the dummy output generation processing step S9, for example, as the output audio signal O (t) = m (t), the microphone input signal is output as it is, or silence is output.

周波数分解処理ステップＳ４は第１の周波数分解部４と第２の周波数分解部５に対応した処理ステップであり、入力されたｘ（ｔ）、ｍ（ｔ）の各音声信号を第１のスペクトルデータｘ（ω,ｆ）、第２のスペクトルデータｍ（ω,ｆ）に変換する。 The frequency decomposition processing step S4 is a processing step corresponding to the first frequency decomposition unit 4 and the second frequency decomposition unit 5, and the input x (t) and m (t) audio signals are converted into the first spectrum. It is converted into data x (ω, f) and second spectral data m (ω, f).

非エコー音声検出処理ステップＳ５は、非エコー音声有無判定部９に対応した処理ステップであり、式（５）、式（６）、及び、式（７）の計算により、第２のスペクトルデータｍ（ω,ｆ）と第１のスペクトルデータｘ（ω,ｆ）からＤＴ（ｆ）の値を決定する。 The non-echo voice detection processing step S5 is a processing step corresponding to the non-echo voice presence / absence determination unit 9, and is the second spectrum data m by the calculation of the formula (5), the formula (6), and the formula (7). The value of DT (f) is determined from (ω, f) and the first spectral data x (ω, f).

エコー消去処理ステップＳ６はエコー消去部６に対応した処理ステップであり、式（１）及び式（２）の計算により、ｅ（ω,ｆ）を計算するとともに、ＤＴ（ｆ）に基づくステップサイズμの制御によりフィルタ係数ｗ（ω,ｆ）を更新する。 The echo erasing process step S6 is a processing step corresponding to the echo erasing unit 6, and e (ω, f) is calculated by the calculation of the equations (1) and (2), and the step size based on the DT (f) is calculated. The filter coefficient w (ω, f) is updated by controlling μ.

残留エコー抑圧処理ステップＳ７は、残留エコー抑圧部７に対応した処理ステップであり、式（３）及び式（４）の計算により、ｅ（ω,ｆ）から残留エコーを抑圧した音声ｏ２（ω,ｆ）を計算する。 The residual echo suppression processing step S7 is a processing step corresponding to the residual echo suppression unit 7, and the voice o2 (ω) in which the residual echo is suppressed from e (ω, f) by the calculation of the equations (3) and (4). , F) is calculated.

出力生成処理ステップＳ８は、波形生成部８に対応した処理ステップであり、ｏ２（ω,ｆ）から逆ＦＦＴ処理により出力音声信号Ｏ（ｔ）を計算する。なお、ダミー出力生成処理ステップＳ９と出力生成処理ステップＳ８を実行すると、処理はＦＳサンプル入力処理ステップＳ２に戻る。その際、時刻インデックスｔはＦＳだけ増加され、フレーム番号ｆは１だけ増加される。 The output generation processing step S8 is a processing step corresponding to the waveform generation unit 8, and calculates the output audio signal O (t) from o2 (ω, f) by reverse FFT processing. When the dummy output generation processing step S9 and the output generation processing step S8 are executed, the processing returns to the FS sample input processing step S2. At that time, the time index t is increased by FS, and the frame number f is increased by 1.

図５は、本発明の実施例１に係る音響信号処理装置のハードウェア構成図である。本実施例は、図１から図３の機能ブロックによって示される音響信号処理装置や、図４のフローチャートによって示される音響信号処理方法に限定されない。例えば、コンピュータを図１の音響信号処理装置として機能させたり、図４の音響信号処理方法の処理手順を実行させるプログラムとして実施したりすることも可能である。 FIG. 5 is a hardware configuration diagram of the acoustic signal processing device according to the first embodiment of the present invention. The present embodiment is not limited to the acoustic signal processing apparatus shown by the functional blocks of FIGS. 1 to 3 and the acoustic signal processing method shown by the flowchart of FIG. For example, the computer may function as the acoustic signal processing device of FIG. 1 or may be implemented as a program for executing the processing procedure of the acoustic signal processing method of FIG.

具体的には、本実施例は図５に示すようにコンピュータを使って実施することが可能である。ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１０３には、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１０４、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）１０５、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）１０６、ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）１０７、マウス／キーボード１０８及びディスプレイ１０９が接続される。これらはコンピュータを構成する一般的な要素である。 Specifically, this embodiment can be carried out using a computer as shown in FIG. The CPU (Central Processing Unit) 103 includes a RAM (Random Access Memory) 104, a ROM (Read Only Memory) 105, an HDD (Hard Disk Drive) 106, a LAN (Local Area Network) 107, and a mouse / keyboard. Be connected. These are the general elements that make up a computer.

その他ストレージ１１０は、外部から記憶メディアを介してプログラムやデータをコンピュータに供給するためのドライブ類、具体的には光学ディスクドライブ、磁気ディスクドライブ、ＣＦ（ＣｏｍｐａｃｔＦｌａｓｈ）／ＳＤ（ＳｅｃｕｒｅＤｉｇｉｔａｌ）カードスロットやＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）メモリなどである。 The storage 110 includes drives for supplying programs and data to a computer from the outside via a storage medium, specifically, an optical disk drive, a magnetic disk drive, and a CF (Compact Flash) / SD (Secure Digital) card slot. And USB (Universal Serial Bus) memory.

マイクロホン１０１及びスピーカ１１２は、図１に示したマイク２及びスピーカ１に対応している。また、マイクロホン１１３は、後述の実施例２で説明する雑音入力用のマイクに対応している。 The microphone 101 and the speaker 112 correspond to the microphone 2 and the speaker 1 shown in FIG. Further, the microphone 113 corresponds to the noise input microphone described in the second embodiment described later.

マイクロホン１０１及びマイクロホン１１３によって音波が電気的な信号に変換され、Ａ／Ｄ変換器１０２及びＡ／Ｄ変換器１１４によってデジタルデータに変換される。Ａ／Ｄ変換器１０２及びＡ／Ｄ変換器１１４からのデジタルデータは、プログラム命令を実行する過程でＣＰＵ１０３によって処理される。 The sound waves are converted into electrical signals by the microphone 101 and the microphone 113, and converted into digital data by the A / D converter 102 and the A / D converter 114. The digital data from the A / D converter 102 and the A / D converter 114 is processed by the CPU 103 in the process of executing the program instruction.

図５に示すコンピュータ装置では、図４に示した処理ステップを実行する音響信号処理プログラムをＨＤＤ１０６に記憶し、これをＲＡＭ１０４に読み出してＣＰＵ１０３で実行する。その際、利用者音声を含む音声信号ｍ（ｔ）の入力にマイクロホン１０１とＡ／Ｄ変換器１０２を使い、システム音声ｘ（ｔ）の拡声出力にＤ／Ａ変換器１１１とスピーカ１１２を使い、さらに、後述の実施例２で説明する雑音信号ｎ（ｔ）の入力にマイクロホン１１３とＡ／Ｄ変換器１１４を使う。そして、これらｍ（ｔ）とｘ（ｔ）、あるいはｍ（ｔ）とｘ（ｔ）とｎ（ｔ）をＣＰＵ１０３で処理することで出力音声Ｏ（ｔ）を生成して出力する。 In the computer device shown in FIG. 5, an acoustic signal processing program for executing the processing step shown in FIG. 4 is stored in the HDD 106, read into the RAM 104, and executed by the CPU 103. At that time, the microphone 101 and the A / D converter 102 are used for the input of the audio signal m (t) including the user's voice, and the D / A converter 111 and the speaker 112 are used for the loudspeaker output of the system voice x (t). Further, the microphone 113 and the A / D converter 114 are used for the input of the noise signal n (t) described in the second embodiment described later. Then, the output voice O (t) is generated and output by processing these m (t) and x (t) or m (t), x (t) and n (t) by the CPU 103.

この結果、図５に示すコンピュータ装置は本実施例に係る音響信号処理装置として機能する。また、このコンピュータ装置は、その他ストレージ１１０に挿入される記録媒体やＬＡＮ１０７を介して接続される他の装置から音響信号処理プログラムの供給を受けるようにすることが可能である。 As a result, the computer device shown in FIG. 5 functions as the acoustic signal processing device according to the present embodiment. Further, the computer device can be supplied with an acoustic signal processing program from a recording medium inserted in the storage 110 or another device connected via the LAN 107.

なお、このコンピュータ装置は、マウス／キーボード１０８やディスプレイ１０９を介して、利用者の操作入力を受け付けたり、利用者への情報呈示を行ったりすることも可能である。また、このコンピュータ装置が、音響信号処理装置だけでなく、利用者の音声を認識してサービスを提供するロボットにも適用される場合、マウス／キーボード１０８など、サービス提供時に不要な要素はコンピュータ装置から取り外し可能である。 It should be noted that this computer device can also accept user's operation input and present information to the user via the mouse / keyboard 108 or the display 109. Further, when this computer device is applied not only to an acoustic signal processing device but also to a robot that recognizes a user's voice and provides a service, elements unnecessary at the time of service provision such as a mouse / keyboard 108 are computer devices. It is removable from.

図６は、本発明の実施例１に係る音響信号処理装置による処理の結果を示す図である。図中（ａ）のマイク入力信号ｍ（ｔ）は、第２の周波数分解部５により第２のスペクトルデータｍ（ω,ｆ）に変換される。 FIG. 6 is a diagram showing the result of processing by the acoustic signal processing apparatus according to the first embodiment of the present invention. The microphone input signal m (t) in the figure (a) is converted into the second spectral data m (ω, f) by the second frequency decomposition unit 5.

このマイク入力信号ｍ（ｔ）には利用者音声とシステム音声エコーとが混ざりこんでいる。システム音声ｘ（ｔ）も第１の周波数分解部４により第１のスペクトルデータｘ（ω,ｆ）に変換される。ＦＦＴ及び逆ＦＦＴにはＣｏｏｌｅｙ−ＴｕｋｅｙＤＦＴＡｌｇｏｒｉｔｈｍを使い、フレーム長ＦＬを５１２サンプル、フレームシフト量ＦＳを１６０サンプル、窓掛けにハニング窓を使用して、ＦＦＴ及び逆ＦＦＴ実行している。 The user voice and the system voice echo are mixed in the microphone input signal m (t). The system voice x (t) is also converted into the first spectral data x (ω, f) by the first frequency decomposition unit 4. Cooley-Tukey DFT Algorithm is used for FFT and reverse FFT, frame length FL is 512 samples, frame shift amount FS is 160 samples, and FFT and reverse FFT are executed using a Hanning window for window hanging.

図中（ｂ）は出力音声信号Ｏ（ｔ）である。エコー消去部６と残留エコー抑圧部７を経て得られた出力ｏ２（ω,ｆ）は、波形生成部８により時間領域信号Ｏ（ｔ）となって出力される。これが本実施例の音響信号処理装置の出力音声である。 In the figure, (b) is an output audio signal O (t). The output o2 (ω, f) obtained through the echo erasing unit 6 and the residual echo suppressing unit 7 is output as a time domain signal O (t) by the waveform generation unit 8. This is the output voice of the acoustic signal processing device of this embodiment.

出力音声Ｏ（ｔ）には利用者音声のみが強く残る。このとき、図中（ｃ）に示す非エコー音声有無判定部９の出力ＤＴ（ｆ）のグラフも利用者音声の存在する期間で立ち上がっている。 Only the user voice remains strongly in the output voice O (t). At this time, the graph of the output DT (f) of the non-echo voice presence / absence determination unit 9 shown in the figure (c) also stands up during the period in which the user voice exists.

図中（ｄ）に示すｍ（ｔ）／Ｏ（ｔ）はＥＲＬＥ（ＥｃｈｏＲｅｔｕｒｎＬｏｓｓＥｎｈａｎｃｅｍｅｎｔ）と呼ばれる評価量を表している。ＥＲＬＥは下記の式（８）で定義される量であり、入力パワーに対して出力パワーがどれくらい小さくなったかをｄＢ値で表し、値が大きいほど消去性能が高いことを表す。式中のＥ［＊］はｎサンプル毎に平均値を計算することを表す。

In the figure, m (t) / O (t) shown in (d) represents an evaluation amount called ERLE (Echo Return Loss Relationship). ERRE is a quantity defined by the following equation (8), and indicates how small the output power is with respect to the input power by a dB value, and the larger the value, the higher the erasing performance. E [*] in the formula indicates that the average value is calculated for each n samples.

エコー消去性能を求めたい場合、ＥＲＬＥはシステム音声エコーだけが存在する期間で計算されなければならない。そこで、そのような期間をグラフより３か所選抜して図中のＲ１、Ｒ２、Ｒ３とする。これらの期間は全てシステム音声エコーだけが存在する期間である。よって、ＤＴ（ｆ）は立ち上がっていない（非エコー音声無し状態を示す）のが正解であり、図中（ｃ）からその通りになっていることがわかる。学習の初期（Ｒ１）、中期（Ｒ２）、終期（Ｒ３）のそれぞれでＥＲＬＥの平均値を拾い出すと、６７．７ｄＢ、８５．９ｄＢ、１０２．６ｄＢと高い値であることがわかる。 If the echo erasing performance is to be determined, the ERLE must be calculated over the period in which only the system voice echo is present. Therefore, three such periods are selected from the graph and designated as R1, R2, and R3 in the figure. All of these periods are periods in which only system audio echoes are present. Therefore, the correct answer is that DT (f) does not stand up (indicating a state without non-echo voice), and it can be seen from (c) in the figure that this is the case. When the average value of ERLE is picked up at each of the early stage (R1), the middle stage (R2), and the final stage (R3) of learning, it can be seen that the values are as high as 67.7 dB, 85.9 dB, and 102.6 dB.

（効果）
本実施例によれば、非エコー音声有無判定部９は第２のスペクトルデータｍ（ω,ｆ）がエコー以外の音声を含んでいる状況（非エコー音声有り状態）を検出して適応フィルタ１１の学習を止めたり弱めたりする制御を行うため、マイク入力にエコー以外の音声（利用者の音声等）が含まれているか否かを速やかに正しく判定することができるという効果を有する。 (effect)
According to this embodiment, the non-echo voice presence / absence determination unit 9 detects a situation (state with non-echo voice) in which the second spectral data m (ω, f) contains voice other than echo, and the adaptive filter 11 Since it controls to stop or weaken the learning, it has the effect of being able to quickly and correctly determine whether or not the microphone input contains voice other than echo (user's voice, etc.).

非エコー音声有無判定部９内のエコー抑圧部２１が、エコー消去部６よりも速くエコー抑圧量を上げられるエコー抑圧処理を実行することによって、適応フィルタ１１をその学習初期から速く正しく学習させることができるという高速化、高精度化の効果を有する。これは、特に移動することでエコーの状態が頻繁に変わる移動ロボットに好適である。 The echo suppression unit 21 in the non-echo voice presence / absence determination unit 9 executes the echo suppression process in which the echo suppression amount can be increased faster than the echo elimination unit 6, so that the adaptive filter 11 is quickly and correctly learned from the initial stage of learning. It has the effect of speeding up and increasing accuracy. This is particularly suitable for mobile robots whose echo state frequently changes as they move.

また、本実施例を利用者と音声で対話するシステムに適用した場合、システムが自分の声を誤って認識してしまうことがなくなるので、無駄な音声認識処理を減らすことができるという省ＣＰＵ化、クラウドサーバ上の音声認識を使う場合には通信量削減の効果を有する。 In addition, when this embodiment is applied to a system that interacts with a user by voice, the system does not mistakenly recognize one's own voice, so that unnecessary voice recognition processing can be reduced, which saves CPU. , When using voice recognition on a cloud server, it has the effect of reducing the amount of communication.

また、声の小さな利用者のためにマイク感度を上げるなど、システムのスピーカ音量とマイク感度の許容範囲を広げることができるというシステム運用条件拡大の効果を有する。 In addition, it has the effect of expanding the system operating conditions by expanding the allowable range of the speaker volume and microphone sensitivity of the system, such as increasing the microphone sensitivity for users with low voice.

さらに、利用者とシステムが同時にしゃべっても、利用者の声だけを取り出して認識できるので、利用者が自由なタイミングでシステムに話しかけることができるというシステム使用感向上の効果を有する。 Further, even if the user and the system speak at the same time, only the voice of the user can be extracted and recognized, so that the user can talk to the system at any time, which has the effect of improving the usability of the system.

実施例２では、環境雑音に対応した音響信号処理装置について説明する。図７は、本発明の実施例２に係る音響信号処理装置の機能ブロック図である。図中の符号１〜９は実施例１と同じ機能ブロックであるため、説明を省略する。実施例２は、実施例１に機能ブロック３１、３２、３３、３４を加えた構成となっている。以下、これら追加された機能ブロックを中心に説明する。 In the second embodiment, an acoustic signal processing device corresponding to environmental noise will be described. FIG. 7 is a functional block diagram of the acoustic signal processing device according to the second embodiment of the present invention. Reference numerals 1 to 9 in the figure are the same functional blocks as in the first embodiment, and thus description thereof will be omitted. The second embodiment has a configuration in which the functional blocks 31, 32, 33, and 34 are added to the first embodiment. Hereinafter, these added functional blocks will be mainly described.

マイク３１は、マイク２と比べて、スピーカ１並びに利用者からより遠い位置に配置されたマイクロホンである。この配置は、マイク３１にシステム音声や利用者音声が微弱にしか受音されないように配慮したものである。この結果、マイク３１には専ら周囲の環境雑音が受音され雑音信号ｎ（ｔ）として入力される。雑音信号ｎ（ｔ）は第３の周波数分解部３２によって周波数領域データｎ（ω,ｆ）に変換される。すなわち、第３の周波数分解部３２は、雑音信号ｎ（ｔ）を第３のスペクトルデータｎ（ω,ｆ）に変換する第３の変換部といえる。 The microphone 31 is a microphone arranged at a position farther from the speaker 1 and the user as compared with the microphone 2. This arrangement is designed so that the microphone 31 receives only a weak sound of the system voice and the user voice. As a result, the surrounding environmental noise is exclusively received by the microphone 31 and input as a noise signal n (t). The noise signal n (t) is converted into frequency domain data n (ω, f) by the third frequency decomposition unit 32. That is, the third frequency decomposition unit 32 can be said to be a third conversion unit that converts the noise signal n (t) into the third spectral data n (ω, f).

一方、環境雑音は電車の音など比較的遠距離から到来するので、マイク２の入力信号ｍ（ｔ）にもマイク３１と同程度のレベルで混入する。このｍ（ｔ）に混入した環境雑音はシステム音声ｘ（ｔ）と相関がないので、エコー消去部６と残留エコー抑圧部７の処理によっても消し去ることができない。 On the other hand, since environmental noise arrives from a relatively long distance such as the sound of a train, it is mixed in the input signal m (t) of the microphone 2 at the same level as the microphone 31. Since the environmental noise mixed in this m (t) has no correlation with the system voice x (t), it cannot be eliminated by the processing of the echo erasing unit 6 and the residual echo suppressing unit 7.

図７の３３は、この環境雑音を抑圧するための環境雑音抑圧部である。環境雑音抑圧部３３は、残留エコー抑圧部７の出力ｏ２（ω,ｆ）と雑音データである第３のスペクトルデータｎ（ω,ｆ）を入力とし、式（９）に従ってｏ２（ω,ｆ）に含まれる雑音成分を抑圧した音声ｏ３（ω,ｆ）を計算する。

33 of FIG. 7 is an environmental noise suppressing unit for suppressing this environmental noise. The environmental noise suppression unit 33 receives the output o2 (ω, f) of the residual echo suppression unit 7 and the third spectrum data n (ω, f) which is noise data as inputs, and the environment noise suppression unit 33 inputs o2 (ω, f) according to the equation (9). ) Suppresses the noise component, and the voice o3 (ω, f) is calculated.

ここで、式（９）のｇａｉｎ（ω,ｆ）は下記の式（１０）で計算される抑圧係数である。これは式（６）からフロアリング処理をなくした計算である。

Here, gain (ω, f) in the equation (9) is a suppression coefficient calculated by the following equation (10). This is a calculation without the flooring process from the equation (6).

図７の３４は微小周波数成分抑圧部である。微小周波数成分抑圧部３４は、環境雑音抑圧部３３の出力ｏ３（ω,ｆ）を入力とし、式（１１）に従ってｏ３（ω,ｆ）に含まれる所定閾値未満の振幅を持つ微小な周波数成分を抑圧した音声ｏ４（ω,ｆ）を計算する。

34 in FIG. 7 is a minute frequency component suppression unit. The minute frequency component suppressing unit 34 takes the output o3 (ω, f) of the environmental noise suppressing unit 33 as an input, and according to the equation (11), the minute frequency component having an amplitude less than a predetermined threshold value included in the o3 (ω, f). The voice o4 (ω, f) that suppresses is calculated.

ここで、式（１１）のｇａｉｎ（ω,ｆ）は下記の式（１２）で計算される抑圧係数である。なお、式中の０．０１は抑圧効果を与える１．０より小さい非負の値であり、例えば０．０や０．０２などの他の数値でも良い。

Here, gain (ω, f) in the equation (11) is a suppression coefficient calculated by the following equation (12). Note that 0.01 in the equation is a non-negative value smaller than 1.0 that gives a suppressing effect, and may be another value such as 0.0 or 0.02.

本実施例の音声信号処理装置では、波形生成部８は、微小周波数成分抑圧部３４の出力音声ｏ４（ω,ｆ）を逆ＦＦＴ処理することで時間領域の出力信号Ｏ（ｔ）を生成する。このＯ（ｔ）が本実施例の音声信号処理装置の出力音声信号である。 In the audio signal processing apparatus of this embodiment, the waveform generation unit 8 generates an output signal O (t) in the time domain by performing inverse FFT processing on the output audio o4 (ω, f) of the minute frequency component suppression unit 34. .. This O (t) is the output audio signal of the audio signal processing device of this embodiment.

図８は、本発明の実施例２に係る音響信号処理装置の処理の流れを示すフローチャートである。本実施例の音響信号処理装置を起動すると、まず初期化処理ステップＳ２１が実行される。この処理で時刻インデックスtとフレーム番号fが０に初期化される。 FIG. 8 is a flowchart showing a processing flow of the acoustic signal processing apparatus according to the second embodiment of the present invention. When the acoustic signal processing device of this embodiment is started, the initialization processing step S21 is first executed. In this process, the time index t and the frame number f are initialized to 0.

続くＦＳサンプル入力処理ステップＳ２２では、ｍ（ｔ）、ｘ（ｔ）、ｎ（ｔ）の各音声信号がフレームシフト量ＦＳサンプル分だけ入力される。 In the subsequent FS sample input processing step S22, each audio signal of m (t), x (t), and n (t) is input by the frame shift amount FS sample.

次にＦＬサンプル蓄積判定処理ステップＳ２３では、これまでに入力されたｍ（ｔ）、ｘ（ｔ）、ｎ（ｔ）の各音声信号のサンプル数がＦＦＴ解析窓の長さであるフレーム長ＦＬ以上か否かを判定する。もし、これまでに入力されたｍ（ｔ）、Ｘ（ｔ）、ｎ（ｔ）の各音声信号のサンプル数がフレーム長ＦＬに満たない場合は以降のＦＦＴ処理を行えないので、図中左（Ｎｏ）に分岐してダミー出力生成処理ステップＳ２９を実行する。一方、そうでない場合は図中下（Ｙｅｓ）に分岐して周波数分解処理ステップＳ２４を実行する。 Next, in the FL sample accumulation determination processing step S23, the frame length FL in which the number of samples of each of the m (t), x (t), and n (t) audio signals input so far is the length of the FFT analysis window. It is determined whether or not it is the above. If the number of m (t), X (t), and n (t) audio signal samples input so far is less than the frame length FL, the subsequent FFT processing cannot be performed, so the left in the figure. Branch to (No) and execute the dummy output generation processing step S29. On the other hand, if this is not the case, the frequency decomposition process step S24 is executed by branching to the lower part (Yes) in the figure.

ダミー出力生成処理ステップＳ２９では、例えば出力音声信号Ｏ（ｔ）＝ｍ（ｔ）として、マイクロホン入力信号をそのまま出力するか、あるいは無音を出力する。 In the dummy output generation processing step S29, for example, as the output audio signal O (t) = m (t), the microphone input signal is output as it is, or silence is output.

周波数分解処理ステップＳ２４は第１の周波数分解部４、第２の周波数分解部５、第３の周波数分解部３２に対応した処理ステップであり、入力されたｘ（ｔ）、ｍ（ｔ）、ｎ（ｔ）の各音声信号を第１のスペクトルデータｘ（ω,ｆ）、第２のスペクトルデータｍ（ω,ｆ）、第３のスペクトルデータｎ（ω,ｆ）に変換する。 The frequency decomposition processing step S24 is a processing step corresponding to the first frequency decomposition unit 4, the second frequency decomposition unit 5, and the third frequency decomposition unit 32, and the input x (t), m (t), Each voice signal of n (t) is converted into the first spectrum data x (ω, f), the second spectrum data m (ω, f), and the third spectrum data n (ω, f).

非エコー音声検出処理ステップＳ２５は非エコー音声有無判定部９に対応した処理ステップであり、式（５）、式（６）、及び、式（７）の計算により第２のスペクトルデータｍ（ω,ｆ）と第１のスペクトルデータｘ（ω,ｆ）からＤＴ（ｆ）の値を決定する。 The non-echo voice detection processing step S25 is a processing step corresponding to the non-echo voice presence / absence determination unit 9, and the second spectral data m (ω) is calculated by the formulas (5), (6), and (7). , F) and the first spectral data x (ω, f) determine the value of DT (f).

エコー消去処理ステップＳ２６はエコー消去部６に対応した処理ステップであり、式（１）及び式（２）の計算によりｅ（ω,ｆ）を計算するとともに、ＤＴ（ｆ）に基づくステップサイズμの制御によりフィルタ係数ｗ（ω,ｆ）を更新する。 The echo erasing processing step S26 is a processing step corresponding to the echo erasing unit 6, and e (ω, f) is calculated by the calculation of the equations (1) and (2), and the step size μ based on the DT (f) is calculated. The filter coefficient w (ω, f) is updated by the control of.

残留エコー抑圧処理ステップＳ２７は残留エコー抑圧部７に対応した処理ステップであり、式（３）及び式（４）の計算により、ｅ（ω,ｆ）から残留エコーを抑圧した音声ｏ２（ω,ｆ）を計算する。 The residual echo suppression processing step S27 is a processing step corresponding to the residual echo suppression unit 7, and the voice o2 (ω, ω, f) in which the residual echo is suppressed from e (ω, f) by the calculation of the equations (3) and (4). f) is calculated.

環境雑音抑圧処理ステップＳ３０は環境雑音抑圧部３３に対応した処理ステップであり、式（９）及び式（１０）の計算により、ｏ２（ω,ｆ）から雑音成分を抑圧した音声ｏ３（ω,ｆ）を計算する。 The environmental noise suppression processing step S30 is a processing step corresponding to the environmental noise suppression unit 33, and the noise component is suppressed from o2 (ω, f) by the calculation of the equations (9) and (10). f) is calculated.

微小周波数成分抑圧処理ステップＳ３１は、微小周波数成分抑圧部３４に対応した処理ステップであり、式（１１）及び式（１２）の計算により、ｏ３（ω,ｆ）から微小な周波数成分を抑圧した音声ｏ４（ω,ｆ）を計算する。 The minute frequency component suppression processing step S31 is a processing step corresponding to the minute frequency component suppression unit 34, and the minute frequency component is suppressed from o3 (ω, f) by the calculation of the equations (11) and (12). The voice o4 (ω, f) is calculated.

出力生成処理ステップＳ２８は、波形生成部８に対応した処理ステップであり、ｏ４（ω,ｆ）から逆ＦＦＴ処理により出力音声信号Ｏ（ｔ）を計算する。 The output generation processing step S28 is a processing step corresponding to the waveform generation unit 8, and calculates the output audio signal O (t) from o4 (ω, f) by reverse FFT processing.

なお、ダミー出力生成処理ステップＳ２９と出力生成処理ステップＳ２８を実行すると、処理はＦＳサンプル入力処理ステップＳ２２に戻る。その際、時刻インデックスｔはＦＳだけ増加され、フレーム番号ｆは１だけ増加される。 When the dummy output generation processing step S29 and the output generation processing step S28 are executed, the processing returns to the FS sample input processing step S22. At that time, the time index t is increased by FS, and the frame number f is increased by 1.

実施例２に係る音響信号処理装置のハードウェア構成については、図５で説明した実施例１のハードウェア構成と同様であるため、省略する。 The hardware configuration of the acoustic signal processing device according to the second embodiment is the same as the hardware configuration of the first embodiment described with reference to FIG. 5, and is therefore omitted.

図９は、本発明の実施例２に係る音響信号処理装置による処理の結果を示す図である。図中（ａ）のマイク入力信号ｍ（ｔ）は、第２の周波数分解部５により第２のスペクトルデータｍ（ω,ｆ）に変換される。この音声には利用者音声とシステム音声エコーと環境雑音とが混ざりこんでいる。システム音声ｘ（ｔ）も第１の周波数分解部４により第１のスペクトルデータｘ（ω,ｆ）に変換される。また、環境雑音ｎ（ｔ）も第３の周波数分解部３２により第３のスペクトルデータｎ（ω,ｆ）に変換される。実施例２でも、実施例１と同様、Ｃｏｏｌｅｙ−ＴｕｋｅｙＤＦＴＡｌｇｏｒｉｔｈｍを使い、フレーム長ＦＬを５１２サンプル、フレームシフト量ＦＳを１６０サンプル、窓掛けにハニング窓を使用して、ＦＦＴ及び逆ＦＦＴを実行している。 FIG. 9 is a diagram showing the results of processing by the acoustic signal processing apparatus according to the second embodiment of the present invention. The microphone input signal m (t) in the figure (a) is converted into the second spectral data m (ω, f) by the second frequency decomposition unit 5. This voice is a mixture of user voice, system voice echo, and environmental noise. The system voice x (t) is also converted into the first spectral data x (ω, f) by the first frequency decomposition unit 4. Further, the environmental noise n (t) is also converted into the third spectral data n (ω, f) by the third frequency decomposition unit 32. In Example 2, as in Example 1, FFT and reverse FFT are executed using Cooley-Tukey DFT Algorithm, frame length FL is 512 samples, frame shift amount FS is 160 samples, and Hanning window is used for window hanging. doing.

図中（ｂ）は出力音声信号Ｏ（ｔ）である。エコー消去部６、残留エコー抑圧部７、環境雑音抑圧部３３、微小周波数成分抑圧部３４を経て得られた出力音声ｏ４（ω,ｆ）は波形生成部８により時間領域信号Ｏ（ｔ）に変換される。これが本実施例の出力音声である。出力音声Ｏ（ｔ）には利用者音声のみが強く残る。このとき、図中（ｃ）に示す非エコー音声有無判定部９の出力ＤＴ（ｆ）のグラフは環境雑音のため全域で非エコー音声有り状態を示しているが、これは正しい応答である。 In the figure, (b) is an output audio signal O (t). The output voice o4 (ω, f) obtained through the echo erasing unit 6, the residual echo suppressing unit 7, the environmental noise suppressing unit 33, and the minute frequency component suppressing unit 34 is converted into a time domain signal O (t) by the waveform generation unit 8. Will be converted. This is the output voice of this embodiment. Only the user voice remains strongly in the output voice O (t). At this time, the graph of the output DT (f) of the non-echo voice presence / absence determination unit 9 shown in FIG.

図中（ｄ）に示すｍ（ｔ）／Ｏ（ｔ）はＥＲＬＥの値の推移を示している。ただし、ＥＲＬＥはシステム音声エコーのみが存在する期間で計算すべきであるが、今回は利用者音声がなく、システム音声エコーと環境雑音を合わせた妨害音だけが存在する期間で、これら妨害音に対する消去性能を求める意味で計算した。グラフより、学習の初期（図中Ｒ１）、中期（図中Ｒ２）、終期（図中Ｒ３）の３か所のそれぞれ平均値を拾い出すと、１３３．５ｄＢ、１０９．４ｄＢ、１５０．０ｄＢと高い値を記録していることがわかる。この消去性能はエコーの消去と抑圧効果に加えて環境雑音と微小周波数成分の抑圧効果を含んだ数値である。 In the figure, m (t) / O (t) shown in (d) shows the transition of the ERLE value. However, ERLE should be calculated in the period when only the system voice echo exists, but this time there is no user voice and only the disturbing sound which is the combination of the system voice echo and the environmental noise exists. It was calculated in the sense of finding the erasing performance. From the graph, when the average values of the three points of the initial stage (R1 in the figure), the middle stage (R2 in the figure), and the final stage (R3 in the figure) are picked up, they are 133.5 dB, 109.4 dB, and 150.0 dB. It can be seen that a high value is recorded. This erasing performance is a numerical value including the effect of suppressing environmental noise and minute frequency components in addition to the effect of erasing and suppressing echoes.

（効果）
本実施例によれば、実施例１で説明した効果はもちろん得られ、さらに、周囲が騒がしい場所（例えば、展示会場）でもシステム運用が可能になるというシステム運用条件拡大の効果を有する。 (effect)
According to this embodiment, the effects described in the first embodiment are of course obtained, and further, the system operation conditions can be expanded even in a noisy place (for example, an exhibition hall).

以上、本発明の実施例（変形例を含む）について説明してきたが、これらのうち、２つ以上の実施例を組み合わせて実施しても構わない。あるいは、これらのうち、１つの実施例を部分的に実施しても構わない。さらには、これらのうち、２つ以上の実施例を部分的に組み合わせて実施しても構わない。 Although the examples (including modified examples) of the present invention have been described above, two or more of these examples may be combined and carried out. Alternatively, one of these examples may be partially implemented. Furthermore, among these, two or more examples may be partially combined and carried out.

本発明は、上記発明の実施例の説明に何ら限定されるものではない。特許請求の範囲の記載を逸脱せず、当業者が容易に想到できる範囲で種々の変形態様もこの発明に含まれる。例えば、本発明の音声信号処理装置は、他の音声会話機能を有する案内型ロボットのフロントエンド処理としても適用可能である。 The present invention is not limited to the description of the embodiments of the above invention. Various modifications are also included in the present invention to the extent that those skilled in the art can easily conceive without departing from the description of the scope of claims. For example, the voice signal processing device of the present invention can also be applied as front-end processing of a guidance type robot having another voice conversation function.

１スピーカ
２、３１マイク
３エコー
４第１の周波数分解部（第１の変換部）
５第２の周波数分解部（第２の変換部）
６エコー消去部
７残留エコー抑圧部
８波形生成部
９非エコー音声有無判定部
１１適応フィルタ
１２減算器
２１エコー抑圧部
２２波形生成部
２３判定部
３２第３の周波数分解部（第３の変換部）
３３環境雑音抑圧部
３４微小周波数成分抑圧部

1 Speaker 2, 31 Microphone 3 Echo 4 First frequency decomposition section (first conversion section)
5 Second frequency decomposition unit (second conversion unit)
6 Echo erasing unit 7 Residual echo suppression unit 8 Waveform generation unit 9 Non-echo voice presence / absence determination unit 11 Adaptive filter 12 Subtractor 21 Echo suppression unit 22 Waveform generation unit 23 Judgment unit 32 Third frequency decomposition unit (third conversion unit) )
33 Environmental noise suppression section 34 Micro frequency component suppression section

Claims

The first conversion unit that converts the audio signal before being output from the speaker into the first spectral data, and
A second conversion unit that converts the audio signal input from the microphone into the second spectral data, and
A non-echo voice presence / absence determination unit that determines the presence / absence of non-echo voice based on the first spectrum data and the second spectrum data,
An echo elimination unit that inputs the first spectrum data and the second spectrum data and calculates an error output using an adaptive filter for eliminating echoes.
Equipped with
The echo erasing unit is
In the presence of the non-echo voice, the coefficient indicating the learning strength of the adaptive filter is lowered as compared with the case without the non-echo voice.
The non-echo voice presence / absence determination unit is
An echo suppression unit that suppresses the echo component of the second spectral data, a generation unit that generates an audio signal in the time domain from the output result of the echo suppression unit, and the presence / absence of the non-echo audio from the audio signal in the time domain. An acoustic signal processing device having a determination unit for determining.

The determination unit
The number of data in which the amplitude of the waveform data of the audio signal generated by the generation unit is equal to or greater than a predetermined threshold value is calculated.
Wherein when the number of data is equal to or greater than a predetermined threshold value, the acoustic signal processing apparatus according to claim 1 determines that the non-echo sound is present.

The acoustic signal processing device according to claim 1, wherein the coefficient indicating the learning strength of the adaptive filter is a step size.

The acoustic signal processing device according to claim 1, further comprising a residual echo suppression unit after the echo erasing unit.

A third converter that converts environmental noise signals input from other microphones into third spectral data, and
The acoustic signal processing device according to claim 1, further comprising an environmental noise suppressing unit that suppresses the third spectral data.

The first conversion step of converting the audio signal before being output from the speaker into the first spectral data, and
A second conversion step of converting the audio signal input from the microphone into the second spectral data,
A non-echo voice presence / absence determination step for determining the presence / absence of non-echo voice based on the first spectrum data and the second spectrum data,
An echo elimination step in which the first spectral data and the second spectral data are input and the error output is calculated using an adaptive filter for eliminating echoes.
Equipped with
The echo erasing step is
In the presence of the non-echo voice, the coefficient indicating the learning strength of the adaptive filter is lowered as compared with the case without the non-echo voice.
The non-echo voice presence / absence determination step is
An echo suppression step that suppresses the echo component of the second spectral data, a generation step that generates an audio signal in the time domain from the output result of the echo suppression step, and the presence / absence of the non-echo audio from the audio signal in the time domain. A determination step for determining an acoustic signal processing method.

The first conversion step of converting the audio signal before being output from the speaker into the first spectral data, and
A second conversion step of converting the audio signal input from the microphone into the second spectral data,
A non-echo voice presence / absence determination step for determining the presence / absence of non-echo voice based on the first spectrum data and the second spectrum data,
An echo elimination step in which the first spectral data and the second spectral data are input and the error output is calculated using an adaptive filter for eliminating echoes.
Is an acoustic signal processing program that can be executed on a computer.
The echo erasing step is
In the presence of the non-echo voice, the coefficient indicating the learning strength of the adaptive filter is lowered as compared with the case without the non-echo voice.
The non-echo voice presence / absence determination step is
An echo suppression step that suppresses the echo component of the second spectral data, a generation step that generates an audio signal in the time domain from the output result of the echo suppression step, and the presence / absence of the non-echo audio from the audio signal in the time domain. A determination step, and an acoustic signal processing program having.