JP2005258215A

JP2005258215A - Signal processing method and signal processing device

Info

Publication number: JP2005258215A
Application number: JP2004071507A
Authority: JP
Inventors: Shigeki Sagayama; 茂樹嵯峨山; Takuya Nishimoto; 卓也西本; Takashi Okajima; 崇岡嶋; Masaru Kamamoto; 優鎌本
Original assignee: Individual
Current assignee: Individual
Priority date: 2004-03-12
Filing date: 2004-03-12
Publication date: 2005-09-22

Abstract

<P>PROBLEM TO BE SOLVED: To estimate a short-time spectrum of a target sound signal by suppressing influence of noise through geometric processing in frequency ranges. <P>SOLUTION: The signal processing method which uses multiple channels with signal input parts includes a step of finding spectra of observation signals from the respective signal input parts by the channels as points on a complex plane by frequencies ω by converting the observation signals to frequency ranges by the channels and a step of estimating a circle passing the plurality of spectra or nearer by them on the complex plane by the frequencies ω and estimating the center of the estimated circle as a spectrum of the target signal. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、目的信号における雑音の影響を除去する信号処理方法及び信号処理装置に関するものである。本発明は、一つの好ましい態様では、マイクロフォンアレイを用いた信号処理方法に係り、詳しくは、マイクロフォンアレイのそれぞれのマイクロフォンで観測された信号に周波数領域において幾何学的な処理を行うことによって雑音の影響を除き、目的信号の短時間スペクトルを推定する方法に関するものである。 The present invention relates to a signal processing method and a signal processing apparatus for removing the influence of noise in a target signal. The present invention, in one preferred embodiment, relates to a signal processing method using a microphone array, and more particularly, by performing geometric processing in the frequency domain on a signal observed by each microphone of the microphone array. The present invention relates to a method for estimating a short-time spectrum of a target signal excluding influence.

マイクロフォンを用いて音響信号を受音する際、実環境においては周囲の雑音や、残響により目的の音響信号が劣化してしまう。このような劣化した音響信号は人間にとって聞き取りにくいのはもちろん、機械による音声認識やピッチ抽出、その他の分析も困難にし、劣化していない音響信号に比べ、それらの性能を低下させてしまうことが多い。このため、より劣化していない音響信号を受音する方法についての研究が盛んに行われている。 When an acoustic signal is received using a microphone, the target acoustic signal is degraded due to ambient noise and reverberation in an actual environment. Such degraded acoustic signals are difficult for humans to hear, but also make speech recognition, pitch extraction, and other analysis difficult by machine, which can degrade their performance compared to unaccompanied acoustic signals. Many. For this reason, research about the method of receiving the acoustic signal which has not deteriorated much is done actively.

これらの方法の中に、マイクロフォンアレイ（複数のマイクロフォン素子から構成される受音器)を用いるものがある。マイクロフォンアレイでは、目的とする音響信号の信号源と雑音源の物理的な位置の違いに起因する、音の空間的な情報を複数のマイクロフォンで観測することによって雑音の影響を抑制することができる。マイクロフォンアレイについては既に様々な技術が研究されてきている。それらの中で、基本的な方法である遅延和（DS：Delay-and-Sum）アレイは学習を必要としないが、その性能は十分とはいえない。また、Griffith-JimやAMNORなどに代表される適応型マイクロフォンアレイでは、予め無音声区間を入力し、学習させることで未知である雑音の方向対して自動的に感度の死角を形成し、目的とする音響信号を高SN比で受音することができる。しかし、信号源や雑音源の位置や反響などの環境が時々刻々と変化するような場合においては、学習による環境への追従が間に合わず、性能が低下することがある。 Among these methods, there is a method using a microphone array (a sound receiver composed of a plurality of microphone elements). In the microphone array, the influence of noise can be suppressed by observing the spatial information of sound due to the difference in the physical position of the signal source and the noise source of the target acoustic signal with a plurality of microphones. . Various techniques have already been studied for microphone arrays. Among them, the basic method, Delay-and-Sum (DS) array, does not require learning, but its performance is not sufficient. In addition, adaptive microphone arrays such as Griffith-Jim and AMNOR automatically form a blind spot of sensitivity with respect to the direction of unknown noise by inputting and learning a silent period in advance. The sound signal to be received can be received with a high SN ratio. However, when the environment such as the position of the signal source or the noise source or the echo changes from moment to moment, the follow-up to the environment due to learning may not be in time, and the performance may deteriorate.

このように時々刻々と変化する環境や突発的な雑音に対しても十分な効果を発揮するためには長時間の学習によって雑音の方向を検知し抑制するのではなく、ごく短時間の信号の情報のみを用いて雑音を抑制する方法が必要となる。 Thus, in order to exert a sufficient effect even in an environment that changes from moment to moment and sudden noise, the direction of the noise is not detected and suppressed by long-term learning, but a very short time signal is not detected. A method for suppressing noise using only information is required.

本発明は、各マイクロフォンアレイで受音された信号の時間差が、周波数領域では位相差として表されることに注目し、周波数領域での幾何学的な処理によって雑音の影響を抑制し、目的とする音響信号の短時間スペクトルを推定することを目的とするものである。さらに、本発明は、音響信号に限定されず、信号一般において雑音の影響を抑制することをも目的とするものである。 The present invention focuses on the fact that the time difference between the signals received by each microphone array is expressed as a phase difference in the frequency domain, and suppresses the influence of noise by geometric processing in the frequency domain. It is intended to estimate a short-time spectrum of an acoustic signal to be transmitted. Furthermore, the present invention is not limited to an acoustic signal, and an object thereof is to suppress the influence of noise in general signals.

本発明が採用した技術手段は、信号入力部を備えたマルチチャネルを用いた信号処理方法であって、各信号入力部からの観測信号をチャネル毎に周波数領域に変換することでチャネル毎の観測信号のスペクトルを周波数ω毎に複素平面上の点として求めるステップと、周波数ω毎に、求められたチャネル毎の複数のスペクトルに基づいて複素平面上で円を推定し、推定された円の中心（以下、本明細書において「円心」と呼ぶ）を目的信号のスペクトルと推定するステップとを有する。このような方法を円心法と呼ぶ。また、本発明が採用した信号処理装置は、信号入力部を備えたマルチチャネルと、各信号入力部からの観測信号をチャネル毎に周波数領域に変換することで周波数ω毎に、チャネル毎の観測信号のスペクトルを複素平面上の点として求める複素スペクトル値算出部と、周波数ω毎に、求められたチャネル毎の複数のスペクトルに基づいて複素平面上で円を推定し、推定された円の中心を目的信号のスペクトルと推定する目的信号スペクトル推定部とを有する。 The technical means adopted by the present invention is a signal processing method using a multi-channel equipped with a signal input unit, and the observation signal from each signal input unit is converted into the frequency domain for each channel, thereby observing for each channel. Obtaining the spectrum of the signal as a point on the complex plane for each frequency ω, and estimating the circle on the complex plane based on the obtained multiple spectra for each channel for each frequency ω, and the center of the estimated circle (Hereinafter, referred to as “circle center” in this specification) is estimated as a spectrum of the target signal. Such a method is called a circle center method. In addition, the signal processing apparatus adopted by the present invention is a multi-channel equipped with a signal input unit and an observation for each channel for each frequency ω by converting the observation signal from each signal input unit to the frequency domain for each channel. A complex spectrum value calculation unit that obtains the spectrum of a signal as a point on the complex plane, and a circle is estimated on the complex plane based on a plurality of obtained spectra for each channel at each frequency ω, and the center of the estimated circle And a target signal spectrum estimation unit for estimating the target signal spectrum.

一つの好ましい態様では、信号入力部を備えたマルチチャネルは、複数のマイクロフォンから構成されたマイクロフォンアレイである。また、好ましくは、該観測信号は同相化されており、同相化された観測信号を周波数領域に変換することでスペクトルを得るようにした。好ましくは、スペクトルは短時間スペクトルである。すなわち、各チャネル毎の観測信号を共通の短時間フレームに切り出し、それぞれを周波数領域に変換することで短時間スペクトルを求める。一つの好ましい態様では、該短時間スペクトルは、短時間フーリエ変換によって求められるが、観測信号を周波数領域（フーリエ領域）に変換する手段は、フーリエ変換（短時間フーリエ変換、離散的フーリエ変換、ＦＦＴを含む）に限定されず、例えば、ウェーブレット変換（一つの好ましい例では、ガボール関数である）を用いても良い。 In one preferred embodiment, the multi-channel provided with the signal input unit is a microphone array composed of a plurality of microphones. Preferably, the observation signal is in-phase, and the spectrum is obtained by converting the in-phase observation signal into the frequency domain. Preferably, the spectrum is a short time spectrum. That is, the observation signal for each channel is cut into a common short-time frame, and each is converted into the frequency domain to obtain a short-time spectrum. In one preferred embodiment, the short-time spectrum is obtained by a short-time Fourier transform, and means for transforming the observation signal into the frequency domain (Fourier domain) is a Fourier transform (short-time Fourier transform, discrete Fourier transform, FFT). For example, wavelet transform (in one preferable example, a Gabor function) may be used.

周波数ω毎に得られた複数のスペクトルに基づく円の推定は、一つの好ましい態様では、複素平面上で該複数のスペクトルあるいはそれらのなるべく近くを通るような円を推定するものである。一つの好ましい例では、周波数ω毎の目的信号のスペクトルの推定は、分散最小化によって、円の中心を求めるものである。このスペクトル推定は、雑音の到来方向が単一であることを仮定しているが、信号の短時間スペクトル推定は分析フレーム毎、周波数毎に独立して行われるので、雑音の到来方向が複数存在しても、ある分析フレームにおいて、ある周波数ωの成分を含む雑音の到来方向が単一であれば、その分析フレームのその周波数ωにおいて、上記推定は適用できる。さらに、分析フレーム内で同一の周波数ωの成分を持つ雑音の方向が複数存在しても、ある一つの雑音が他の雑音に比べて優勢であるような場合には、上記推定を適用できる。 The circle estimation based on the plurality of spectra obtained for each frequency ω is, in one preferred embodiment, to estimate a circle passing through the plurality of spectra or their vicinity as close as possible on the complex plane. In one preferred example, the estimation of the spectrum of the target signal for each frequency ω is to obtain the center of the circle by minimizing dispersion. This spectrum estimation assumes that the direction of noise arrival is single, but short-term spectrum estimation of the signal is performed independently for each analysis frame and frequency, so there are multiple noise arrival directions. However, if there is a single arrival direction of noise including a component at a certain frequency ω in a certain analysis frame, the above estimation can be applied at that frequency ω in the analysis frame. Furthermore, even when there are a plurality of noise directions having the same frequency ω component in the analysis frame, the above estimation can be applied in the case where one noise is dominant over the other noises.

観測信号が複数の雑音を含んでいる場合に、前記周波数ω毎の、求められたチャネル毎の複数のスペクトルに基づく円の推定は、一つの態様では、周波数ω毎に求められた複数のスペクトルに基づいて、複数の雑音中の一つの雑音のみを含む仮想観測信号の仮想スペクトルを想定し、複素平面上で該仮想スペクトルを通る円を推定するものである。到来方向の異なる複数の雑音が存在する場合には、観測されたスペクトル点は円には乗らない。ここで、例えば雑音が２種類だと仮定し、２つの雑音中の１つの雑音のみを含む仮想観測信号の複数の仮想スペクトルを想定し、複素平面上で該複数の仮想スペクトルを通る円を推定すれば、その円の中心が目的信号のスペクトルである。観測されたスペクトル点は、各仮想観測信号を中心とする円上に位置しており、それぞれの円上の位置には位相差が反映されている。したがって、観測されたスペクトル点を用いて、目的信号のスペクトルを表すであろう円心を求めることができ、この手法も、円心法の概念に含まれるものであると考えられる。 When the observation signal includes a plurality of noises, the estimation of the circle based on the plurality of spectra for each channel obtained for each frequency ω is, in one aspect, a plurality of spectra obtained for each frequency ω. Based on the above, a virtual spectrum of a virtual observation signal including only one noise among a plurality of noises is assumed, and a circle passing through the virtual spectrum on a complex plane is estimated. If there are multiple noises with different directions of arrival, the observed spectral points do not ride on the circle. Here, assuming that there are two types of noise, for example, a plurality of virtual spectra of a virtual observation signal including only one of the two noises is assumed, and a circle passing through the plurality of virtual spectra on a complex plane is estimated. Then, the center of the circle is the spectrum of the target signal. The observed spectral point is located on a circle centered on each virtual observation signal, and the phase difference is reflected at the position on each circle. Therefore, a circle center that will represent the spectrum of the target signal can be obtained using the observed spectrum points, and this method is also considered to be included in the concept of the circle center method.

本発明の信号処理方法において、周波数ωは、有限の全周波数帯域の離散値として与えられる。雑音を含む信号から、音声のみのスペクトル(ωの関数)を求めるために、ωは本来ならば連続値であるが、実際は離散的なωのそれぞれの値について円の中心を求める。例えば、有限な周波数帯域(例えば０〜５kHz)を２５６点に離散化すると、ωの値、ω= ２kπ／２５６，ｋ＝０，１，．．．２５６、について、全てそれぞれに円の中心を求める。そうすると、ωごとに異なった円の中心が求まり、それをωの関数と考えると、雑音を除いた音声の複素スペクトルとなる。 In the signal processing method of the present invention, the frequency ω is given as a discrete value in a finite total frequency band. In order to obtain a speech-only spectrum (function of ω) from a signal including noise, ω is originally a continuous value, but in reality, the center of a circle is obtained for each discrete value of ω. For example, when a finite frequency band (for example, 0 to 5 kHz) is discretized to 256 points, the value of ω, ω = 2kπ / 256, k = 0, 1,. . . For 256, the center of the circle is obtained for each. Then, a different circle center is obtained for each ω, and if it is considered as a function of ω, it becomes a complex spectrum of speech excluding noise.

一つの好ましい態様では、本発明はさらに、周波数ω毎に、求められたチャネル毎の複数のスペクトルの複素平面上の重心を求めるステップないし算出部と、周波数ω毎に求めた複素平面上の円心の信頼性を評価するステップないし評価部を含み、信頼性評価に基づいた円心と重心の間の重み付き点を周波数ω毎の目的信号のスペクトルと推定する。求められたチャネル毎の複数のスペクトルの複素平面上の重心を求めることは、従来のＤｅｌａｙ−ａｎｄ−Ｓｕｍ方式（本明細書では、重心法と呼ぶ）に相当し、重心法はいわば保守的な推定を行うので頑健である。そこで、円心法による信頼性の評価に応じて、重心情報を加味して目的信号のスペクトルを推定することが望ましい場合があり得る。円心の信頼性の評価ステップは、一つには、周波数ω毎に得られた複素平面上における複数のスペクトル間の円心から見込んだ角度差を要素として含む。角度差が小さいと誤推定の可能性が高くなる。他の評価要素としては、目的信号のスペクトルを分散最小化によって求める場合の分散が挙げられる。分散が小さいほど（残差値が小さいことを意味する）、信頼性が高くなり、分散が大きいと信頼性が低くなる。さらに、他の評価要素としては、観測信号のスペクトルの実数成分と複素数成分の相関係数が挙げられる。具体的な手法では、例えば、一つあるいは複数の評価要素を数値化し、それを反映して円心と重心の間の内分比を適応的に与え、円心と重心の重み付き点（内分点）を目的信号の推定として用いる。円心と重心の重み付き点の具体的な求め方については、当業者において様々なやり方が採用され得る。 In one preferred embodiment, the present invention further includes a step or a calculation unit for calculating a center of gravity of a plurality of obtained spectra for each frequency ω on the complex plane, and a circle on the complex plane obtained for each frequency ω. A step or evaluation unit for evaluating the reliability of the heart is included, and a weighted point between the circle center and the center of gravity based on the reliability evaluation is estimated as a spectrum of the target signal for each frequency ω. Obtaining the centroid on the complex plane of a plurality of obtained spectra for each channel corresponds to the conventional Delay-and-Sum method (referred to herein as the centroid method), and the centroid method is conservative. It is robust because it estimates. Therefore, it may be desirable to estimate the spectrum of the target signal in consideration of the centroid information in accordance with the reliability evaluation by the circle center method. The reliability evaluation step of the circle center includes, as one element, an angle difference estimated from the circle center between a plurality of spectra on the complex plane obtained for each frequency ω. If the angle difference is small, the possibility of erroneous estimation increases. Another evaluation factor is the variance when the spectrum of the target signal is obtained by variance minimization. The smaller the variance (meaning that the residual value is smaller), the higher the reliability, and the greater the variance, the lower the reliability. Furthermore, as another evaluation factor, there is a correlation coefficient between the real component and the complex component of the spectrum of the observation signal. In a specific method, for example, one or more evaluation elements are digitized, and the internal division ratio between the center of gravity and the center of gravity is adaptively given to reflect that, and the weighted points (inner Minute point) is used as an estimate of the target signal. Various methods can be adopted by those skilled in the art for a specific method of obtaining the weighted points of the center and the center of gravity.

円心法は様々な分野に適用され得る。円心法によって求めた目的信号のスペクトルを用いて音声認識を行うことができる。また、得られた目的信号のスペクトルから目的信号の波形を得ることもできる（例えば、逆フーリエ変換することで）また、本発明において、さらに、周波数ω毎に雑音到来方向を推定し、周波数ω毎に求められた雑音到来方向を統合することで雑音到来方向を推定してもよい。円心法により、ω毎に目的信号の複素スペクトルが求まるが、各スペクトル点の円心からの位相角、周波数ω、マイクロフォン間の距離から雑音の到来方向を推定することができる。ここで、周波数ω毎に雑音到来方向が全く独立である場合は考え難い。もし、雑音源が単一であると仮定できるのであれば、円心法により周波数ω毎の雑音到来方向を推定し、それらの情報を統合（たとえば、それらの方向の平均を取る）して、雑音到来方向を推定し、それに合わせた雑音除去（例えば、雑音方向に対して利得を最小にする）が可能になる。雑音源が厳密に単一でなくても、近い周波数ωの帯域の中では、上述のことが成立するとして、実際的に性能を向上させることも可能である。 The circle center method can be applied to various fields. Speech recognition can be performed using the spectrum of the target signal obtained by the circle center method. Further, the waveform of the target signal can be obtained from the spectrum of the obtained target signal (for example, by performing inverse Fourier transform). In the present invention, the noise arrival direction is further estimated for each frequency ω, and the frequency ω The noise arrival direction may be estimated by integrating the noise arrival directions obtained every time. Although the complex spectrum of the target signal is obtained for each ω by the circle center method, the arrival direction of noise can be estimated from the phase angle of each spectrum point from the circle center, the frequency ω, and the distance between the microphones. Here, it is difficult to consider the case where the noise arrival direction is completely independent for each frequency ω. If it can be assumed that there is a single noise source, the noise arrival direction for each frequency ω is estimated by the circular center method, and the information is integrated (for example, taking the average of the directions) It is possible to estimate the direction of noise arrival and perform noise reduction (for example, to minimize the gain with respect to the noise direction). Even if the noise source is not strictly single, it is possible to actually improve the performance, assuming that the above is true in the band of the near frequency ω.

本発明は、また、上述の信号処理方法をコンピュータに実行させるためのコンピュータプログラムに関するものである。 The present invention also relates to a computer program for causing a computer to execute the signal processing method described above.

本発明では学習を行なわず、短時間スペクトルの推定を分析フレーム内毎に独立して行うため、適応型アレイでは対応しにくい、時々刻々と変化する環境や、突発的な雑音に対しても適用でき、従来の適応型マイクロフォンアレイに比べて有利である。また、学習を必要としない遅延和マイクロフォンアレイに比べても良好な性能を有する。 In the present invention, learning is not performed, and short-time spectrum estimation is performed independently for each analysis frame. Therefore, it is difficult to cope with an adaptive array. This is advantageous over conventional adaptive microphone arrays. Also, it has better performance than a delay sum microphone array that does not require learning.

本発明の実施例について述べる。ここでは、マイクロフォンアレイの各マイクロフォンで観測された信号に周波数領域において幾何学的な処理を行うことで雑音の影響を除き、目的信号の短時間スペクトルを推定する方法について言及する。マイクロフォンアレイの各マイクロフォンで受音される信号には到来方向による時間差が存在する。本発明は、短時間フレーム毎の周波数領域への変換（例えば、短時間フーリエ変換）により、この時間差が周波数領域では位相差として表されることに注目し、各マイクロフォンでの受音信号の周波数成分が複素平面上で円周上に分布することを利用する。そして、幾何学的な処理によって円の中心を推定することによって、目的信号の周波数成分を推定し、目的信号の短時間スペクトルを復元する。 Examples of the present invention will be described. Here, a method for estimating the short-time spectrum of the target signal by removing the influence of noise by performing geometric processing in the frequency domain on the signal observed by each microphone of the microphone array will be described. A signal received by each microphone of the microphone array has a time difference depending on the direction of arrival. The present invention pays attention to the fact that this time difference is expressed as a phase difference in the frequency domain by conversion to the frequency domain for each short time frame (for example, short time Fourier transform), and the frequency of the received sound signal in each microphone. Use the fact that the components are distributed on the circumference in the complex plane. Then, the frequency component of the target signal is estimated by estimating the center of the circle by geometric processing, and the short-time spectrum of the target signal is restored.

アレイ入力の周波数領域での幾何学的処理について説明する。図１のように目的音源、および単一の雑音源から平面波の形で到来する音響信号をＭ個のマイクロフォンからなるマイクロフォンアレイで受音する場合を考える。目的信号の到来方向θ_Sは既知であり、雑音の到来方向θ_Nは未知であるとする。θ_Sとマイクロフォンの配置から、目的信号が各マイクロフォンで受音されるときの時間差を表す相対的な遅延量τ_S1,・・・・・τ_SMを計算することができる。DSやGriffith-Jim型アレイの場合と同様に、図に表した遅延器により、各マイクロフォンの受音信号に、遅延量Ｄ_i(=Ｄ₀-τ_Si,Ｄ₀は固定遅延量)を付加することにより目的信号の時間差を補正し同相化する。この時、各遅延器の出力m_i(t)は

となる。ここで、ｍ_ｉ（ｔ）はｉ番目のマイクロフォンの観測波形であり、ｓ（ｔ）とｎ（ｔ）はそれぞれ対象音源と雑音源の時刻ｔの信号、τ_ｉはｉ番目のマイクロフォンでの雑音信号の時間遅れである。s(t)は同相化された目的信号であり、n(t-τ_i)は相対的な遅延τ_iの加わった雑音である。遅延量τ_iはθ_N（未知）とマイクロフォン配置によって決まる相対的な遅延に、遅延器による遅延が加わったものである。 The geometric processing in the frequency domain of the array input will be described. Consider a case where an acoustic signal arriving in the form of a plane wave from a target sound source and a single noise source is received by a microphone array composed of M microphones as shown in FIG. It is assumed that the arrival direction θ _S of the target signal is known and the arrival direction θ _{N of} noise is unknown. The relative delay amounts τ _S1 ,..., τ _SM representing the time difference when the target signal is received by each microphone can be calculated from θ _S and the microphone arrangement. As with DS and Griffith-Jim type arrays, the delay amount shown in the figure adds a delay amount D _i (= D ₀ -τ _Si , D ₀ is a fixed delay amount) to the received sound signal of each microphone. By doing so, the time difference of the target signal is corrected and in-phased. At this time, the output m _i (t) of each delay unit is

It becomes. Here, m _i (t) is the observation waveform of the i-th microphone, s (t) and n (t) are the signals at the time t of the target sound source and the noise source, respectively, and τ _i is the i-th microphone. The time delay of the noise signal. s (t) is an in-phase target signal, and n (t−τ _i ) is noise with a relative delay τ _i added. The delay amount τ _i is obtained by adding a delay by a delay unit to a relative delay determined by θ _N (unknown) and a microphone arrangement.

次に、マイクロフォンアレイ入力の周波数領域での表現について説明する。式（１）で表されるｍ_i(t)を共通の分析フレームでそれぞれ短時間フーリエ変換すると、周波数ωの成分は、

となる。ωを固定して式（２）を幾何学的に考えると、ｅ^-jωτiの絶対値はτ_iによらず1であるから、図２に示すようにＭ_i(ω)は全て複素平面上においてＳ(ω)を中心として半径|Ｎ(ω)|の円上に分布する。ここで、厳密には式（１）に対して式（２）は近似である。式（１）に対して通常のフーリエ変換を行なうと式（２）が導かれるが、ここでは信号を分析フレームで切り出して短時間フーリエ変換を行なっている。一般に各マイクロフォンによってτ_iが異なるので式（１）を同一の分析フレームによって切り出すと雑音n(t)の異なる時間区間が切り出されることになり、その区間は含まれる周波数成分が異なる。 Next, the expression in the frequency domain of the microphone array input will be described. When _mi (t) represented by the equation (1) is Fourier-transformed in a common analysis frame, the frequency ω component is

It becomes. Considering equation (2) geometrically with ω fixed, the absolute value of e ^−jωτi is 1 regardless of τ _i , so that M _i (ω) is all on the complex plane as shown in FIG. , Distribution is on a circle of radius | N (ω) | with S (ω) as the center. Strictly speaking, equation (2) is an approximation to equation (1). When ordinary Fourier transformation is performed on equation (1), equation (2) is derived. Here, a short-time Fourier transformation is performed by cutting out a signal in an analysis frame. In general, since τ _i differs depending on each microphone, when the expression (1) is cut out by the same analysis frame, a time interval in which the noise n (t) is different is cut out, and the frequency component contained in the interval is different.

周波数領域での幾何学的処理による目的信号の短時間スペクトルの推定について説明する。マイクロフォン数Ｍが3以上であるマイクロフォンアレイを用いて、あるωに対し複素平面上でＭ_i(ω)を頂点とする多角形の外接円の中心を求めれば目的信号の周波数成分Ｓ(ω)を求めることができる。これを全てのωについて行えば目的信号の短時間スペクトルＳ(ω)を求めることができる。これに対し、ＤＳ（重心法）によって求められた信号の短時間スペクトルは、

となり、マイク数Ｍで除算し正規化しても図２において多角形の重心に当たるので目的信号のスペクトルＳ(ω)と一致しないことになる。 The short-time spectrum estimation of the target signal by geometric processing in the frequency domain will be described. If the center of a polygon circumscribed circle whose vertex is M _i (ω) on a complex plane is obtained for a certain ω using a microphone array having the number of microphones M of 3 or more, the frequency component S (ω) of the target signal is obtained. Can be requested. If this is performed for all ω, the short-time spectrum S (ω) of the target signal can be obtained. On the other hand, the short-time spectrum of the signal obtained by DS (centroid method) is

Thus, even if it is divided by the number of microphones M and normalized, it hits the center of gravity of the polygon in FIG. 2 and thus does not match the spectrum S (ω) of the target signal.

周波数領域での幾何学的処理によるスペクトル推定のアルゴリズムについて説明する。雑音下で目的信号の短時間スペクトルを推定するには複素平面上でＭ_i(ω)を頂点とする多角形の外接円の中心を求めればよい。しかし、既述のとおり、式（２）は厳密には近似であるから4個以上のマイクロフォンからなるマイクロフォンアレイでは観測された全てのＭ_i(ω)を厳密に通るような円は一般に存在しない。このような場合には誤差を含む観測点Ｍ_i(ω)から真の円の中心を推定する問題となり、精度の良いスペクトル推定を行なうためには、(i)観測点の数（マイクロフォンの数）Ｍをなるべく多くすること、(ii)観測点から円の中心を推定するためのアルゴリズムを工夫して誤差に対して頑健にすること、がそれぞれ重要となる。 An algorithm for spectrum estimation by geometric processing in the frequency domain will be described. In order to estimate the short-time spectrum of the target signal under noise, the center of a polygon circumscribed circle whose vertex is M _i (ω) may be obtained on the complex plane. However, as described above, equation (2) is strictly an approximation, so in a microphone array composed of four or more microphones, there is generally no circle that exactly passes all observed M _i (ω). . In such a case, it becomes a problem of estimating the true circle center from the observation point M _i (ω) including the error. In order to perform accurate spectrum estimation, (i) the number of observation points (the number of microphones) ) It is important to increase M as much as possible, and (ii) to devise an algorithm for estimating the center of the circle from the observation point so as to be robust against errors.

外接円の中心を推定する手法の一つの好ましい例として、距離2乗分散最小化による外接円の中心の推定について説明する。ここではＳ(ω)の推定値として複素平面上でＭ_i(ω)からの距離の2乗の分散が最小になる点

を用いた。ただしVar[・]は・の分散を表す。このような点を推定値に選んだ理由は次の通りである。まず、複素平面上で全てのＭ_i(ω)を通る円が存在する場合、その円の中心から各Ｍ_i(ω)までの距離は全て等しいから、式（４）で表される最小点は円の中心に一致する。全てのＭ_i(ω)を通る円が存在しない場合には推定される円ができるだけ各Ｍ_i(ω)の近くを通ることが望ましいと思われるが、これは推定される円の中心から各Ｍ_i(ω)までの距離ができるだけ円の半径に近いことと等価である。これを実現するためには推定される円の中心から各Ｍ_i(ω)までの距離のばらつきをできるだけ小さくすることが必要となる。この距離のばらつきの基準として距離の2乗の分散を選択すると推定される円の中心は式（４）で表される最小点となる。距離のばらつきの基準として距離の1乗の分散を用いることも考えられる。この場合推定される円の中心は、

で表されるＳ（バー）（ω）となるが、式（５）によって推定値を計算で求めることは式（４）によりも難しいと考えられる。そこで、本明細書では、一つの好ましい態様として、式（４）に基づく円の中心の推定について説明する。 As one preferred example of the method for estimating the center of the circumscribed circle, estimation of the center of the circumscribed circle by minimizing the distance square variance will be described. Here, as the estimated value of S (ω), the variance of the square of the distance from M _i (ω) on the complex plane is minimized.

Was used. However, Var [•] represents the variance of •. The reason for selecting such a point as an estimated value is as follows. First, when there are circles that pass through all M _i (ω) on the complex plane, the distances from the center of the circle to each M _i (ω) are all equal, so the minimum point represented by equation (4) Matches the center of the circle. If there is no circle that passes through all M _i (ω), it is desirable that the estimated circle pass as close as possible to each M _i (ω). This is equivalent to the distance to M _i (ω) being as close to the radius of the circle as possible. In order to realize this, it is necessary to minimize the variation in the distance from the estimated center of the circle to each M _i (ω). The center of the circle estimated to select the variance of the square of the distance as a reference for the variation in distance is the minimum point represented by the equation (4). It is also conceivable to use the variance of the first power of the distance as a reference for the variation in distance. In this case, the estimated circle center is

S (bar) (ω) represented by the equation (5), it is considered that it is difficult to obtain the estimated value by the equation (5) by the equation (4). Therefore, in the present specification, as one preferable aspect, estimation of the center of the circle based on Expression (4) will be described.

距離2乗分散最小化のアルゴリズムについて説明する。式（４）を用いてＳ（ω）の推定値を具体的に求める方法を以下に示す。X,YをそれぞれS'(ω)の実部と虚部、xi,
yiをそれぞれMi(ω)の実部と虚部とし、aとbとの共分散をCov [a,b]で表すことにすると、

となる。これをX, Yでそれぞれ偏微分すると、

となる。式（６）が最小となる(X,Y)に対しては式（８）の左辺が０に等しくなる必要があるので結局、

となる。この(X, Y)によって式（４）は、

と表される。
式（８）が式（９）のように解けるためにはxiとyiの分散共分散行列

が正則であることが必要である。行列Sxyの行列式はxiとyiの相関係数

を用いて

と表せる。相関係数rxyの絶対値は1以下であり、絶対値が1に等しくなるのは点Mi(ω)(i=1,2,・・・,M)が複素平面上で一直線上に並んでいるときに限られる。またVar[xi]やVar[yi]が0に等しい時も点M_i(ω)は複素平面上で一直線上に並ぶので、結局、式（９）を用いることができる必要十分条件は点M_i(ω)が複素平面上で一直線上に並ばないこととなる。ただし、相関係数rxyの絶対値が1より小さくても1に近い場合には、式（９）は誤差の影響を受けやすい不安定な式となる。不安定性を避けるための一つの対処として、相関係数rxyの2乗が0.99を越えた場合にはDSと同じようにM_i(ω)の重心を目的信号の短時間スペクトルの推定値とした。 An algorithm for minimizing distance squared variance will be described. A method for specifically obtaining the estimated value of S (ω) using Equation (4) will be described below. X and Y are the real and imaginary parts of S '(ω), xi,
Let yi be the real part and imaginary part of Mi (ω), respectively, and the covariance between a and b be expressed as Cov [a, b].

It becomes. If this is partially differentiated by X and Y, respectively,

It becomes. For (X, Y) where equation (6) is minimized, the left side of equation (8) needs to be equal to 0.

It becomes. With this (X, Y), equation (4) becomes

It is expressed.
In order for Equation (8) to be solved like Equation (9), the variance-covariance matrix of xi and yi

Must be regular. The determinant of the matrix Sxy is the correlation coefficient between xi and yi

Using

It can be expressed. The absolute value of the correlation coefficient rxy is 1 or less, and the absolute value is equal to 1 because the points Mi (ω) (i = 1, 2, ..., M) are aligned on the complex plane. Limited to when. Further, even when Var [xi] and Var [yi] are equal to 0, the points M _i (ω) are aligned on the complex plane, so that the necessary and sufficient condition for using equation (9) is the point M _i (ω) is not aligned on the complex plane. However, if the absolute value of the correlation coefficient rxy is smaller than 1 but close to 1, Equation (9) is an unstable equation that is easily affected by errors. As one countermeasure to avoid instability, when the square of the correlation coefficient rxy exceeds 0.99, the center of gravity of M _i (ω) is used as an estimate of the short-time spectrum of the target signal, as in DS. .

距離2乗分散最小化によって目的信号の短時間スペクトルを推定する手順を図３に示す。
（１）各マイクロフォンアレイで受音した信号に遅延器により遅延を加え、目的信号の時間差を補正し、同相化する。こうして得られた信号をｍ_i(t)とする。
（２）各遅延器の出力ｍ_i(t)を共通の分析フレームを用いて切りだす。それぞれフーリエ変換し短時間スペクトルＭ_i(ω)を得る。
（３）得られたＭ個のスペクトルの同一の周波数ωの成分に対し距離2乗分散最小化によって目的信号の短時間スペクトルの周波数ωの成分の推定値を得る。
（４）全ての周波数ωに対し（３）の処理を行ない目的信号の短時間スペクトルを推定する。
（５）次の分析フレームに対しても（２）以降の処理を行なう。 FIG. 3 shows a procedure for estimating the short-time spectrum of the target signal by minimizing the distance squared variance.
(1) The signal received by each microphone array is delayed by a delay device to correct the time difference of the target signal and make it in phase. Let the signal thus obtained be m _i (t).
(2) The output m _i (t) of each delay unit is cut out using a common analysis frame. Each is Fourier transformed to obtain a short-time spectrum M _i (ω).
(3) Obtain an estimated value of the frequency ω component of the short-time spectrum of the target signal by minimizing the squared variance for the same frequency ω components of the obtained M spectra.
(4) The process (3) is performed for all frequencies ω to estimate the short-time spectrum of the target signal.
(5) The processing from (2) onward is also performed for the next analysis frame.

マイクロフォンアレイを構成するマイクロフォンの配置態様は、当業者によって適宜最適な配置が選択される。各マイクロフォンの配置や、雑音の到来方向によってはある周波数ωにおいてマイクロフォン間の雑音の位相差が2πの整数倍に近くなり、図２においてM_i(ω)が円周上の狭い範囲に集中してしまい外接円の中心を精度よく求めることが難しくなる可能性がある。従って、マイクロフォンの配置を最適化することで、さらに性能が向上するものと考えられる。 A person skilled in the art appropriately selects an optimum arrangement of microphones constituting the microphone array. Depending on the arrangement of each microphone and the direction of noise arrival, the phase difference of noise between microphones at a certain frequency ω is close to an integral multiple of 2π. In FIG. 2, M _i (ω) is concentrated in a narrow range on the circumference. Therefore, it may be difficult to accurately obtain the center of the circumscribed circle. Therefore, it is considered that the performance is further improved by optimizing the arrangement of the microphones.

ロバスト性を向上させるための一つの好ましい態様では、周波数ω毎に、求められたチャネル毎の複数のスペクトルの複素平面上の重心を求めるステップと、周波数ω毎に求めた複素平面上の円心の信頼性を評価するステップを含み、信頼性評価に基づいた円心と重心の間の重み付き点を周波数ω毎の目的信号のスペクトルと推定する。求められたチャネル毎の複数のスペクトルの複素平面上の重心を求めることは、従来のＤｅｌａｙ−ａｎｄ−Ｓｕｍ方式（重心法）に相当し、重心法はいわば保守的な推定を行うので頑健である。そこで、円心法による信頼性の評価に応じて、重心を情報を加味して目的信号のスペクトルを推定することが望ましい場合があり得る。円心の信頼性の評価ステップは、一つには、周波数ω毎に得られた複素平面上における複数のスペクトル間の円心から見込んだ角度差を要素として含む。角度差が小さいと誤推定の可能性が高くなる。他の評価要素としては、目的信号のスペクトルを分散最小化によって求める場合の分散が挙げられる。分散が小さいほど（残差値が小さいことを意味する）、信頼性が高くなる。さらに、他の評価要素としては、観測信号のスペクトルの実数成分と複素数成分の相関係数が挙げられる。一例では、一つあるいは複数の評価要素を数値化し、それを反映して円心と重心の間の内分比を適応的に与え、円心と重心の重み付き点（内分点）を目的信号の推定として用いる。 In one preferred embodiment for improving robustness, for each frequency ω, obtaining a center of gravity of a plurality of spectra obtained for each channel on a complex plane, and a circle center on the complex plane obtained for each frequency ω A weighted point between the center and the center of gravity based on the reliability evaluation is estimated as a spectrum of the target signal for each frequency ω. Finding the centroid on the complex plane of a plurality of spectra for each obtained channel corresponds to the conventional Delay-and-Sum method (centroid method), and the centroid method is so robust that it performs conservative estimation. . Therefore, in some cases, it may be desirable to estimate the spectrum of the target signal by adding information on the center of gravity according to the reliability evaluation by the circle center method. The reliability evaluation step of the circle center includes, as one element, an angle difference estimated from the circle center between a plurality of spectra on the complex plane obtained for each frequency ω. If the angle difference is small, the possibility of erroneous estimation increases. Another evaluation factor is the variance when the spectrum of the target signal is obtained by variance minimization. The smaller the variance (meaning that the residual value is smaller), the higher the reliability. Furthermore, as another evaluation factor, there is a correlation coefficient between the real component and the complex component of the spectrum of the observation signal. In one example, one or more evaluation elements are digitized, and the internal division ratio between the circle center and the center of gravity is adaptively given to reflect this, and the weighted point (inner division point) between the center and the center of gravity is the objective. Used as signal estimation.

次に、本発明とマイクロフォン特性及び伝達路の特性との関係について検討する。音波が平面波で到来するという仮定が成り立たない場合や、マイクロフォンの特性がインパルスとみなせない場合にはマイクロフォンアレイの各マイクロフォンで受音される信号は音源で発せられる信号に、伝達路の特性、マイクロフォンの特性が畳み込まれたものとなり、この場合図１は図４のようになる。しかしこの場合においても、マイクロフォンの特性h_Mi(k)が素子によらず一定でh_M(k)、音源から各マイクロフォンへの伝達路の特性h_Si(k),h_Ni(k)の差が時間遅れだけで伝達路特性のインパルス応答の波形は素子によらずh_S(k),h_N(k)で等しいとそれぞれみなせるならば、図４の等価回路を図５のように表すことができる。図５でh_S*h_M*x_S(t)とh_N*h_M*x_N(t)を新たに図１の目的信号と雑音と捉え直すと、式（１）に帰着され、周波数領域での幾何学的処理による目的信号の短時間スペクトル推定の原理をそのまま適用することができる。 Next, the relationship between the present invention and microphone characteristics and transmission path characteristics will be discussed. If the assumption that sound waves arrive as plane waves does not hold, or if the characteristics of the microphones cannot be regarded as impulses, the signals received by each microphone in the microphone array are converted to the signals emitted by the sound source, the characteristics of the transmission path, and the microphones. In this case, FIG. 1 is as shown in FIG. However, even in this case, the microphone characteristic h _Mi (k) is constant regardless of the element, h _M (k), and the difference between the characteristics h _Si (k) and h _Ni (k) of the transmission path from the sound source to each microphone. 4 is equivalent to h _S (k), h _N (k), and the equivalent circuit of FIG. 4 can be expressed as shown in FIG. Can do. In FIG. 5, when h _S * h _M * x _S (t) and h _N * h _M * x _N (t) are newly re-recognized as the target signal and noise in FIG. The principle of short-time spectrum estimation of a target signal by geometric processing in a region can be applied as it is.

また、実環境においてはマイクロフォンの周波数特性にばらつきがあること、目的信号や雑音の音源からマイクロフォンまでの情報伝達路のインパルス応答の違いが単なる時間差に留まらず、インパルス応答の波形、周波数特性にもばらつきがあることなどの原因により、各マイクロフォンでの受音信号の差が式（１）のような単純な雑音の時間差だけで表せないことが考えられる。このことは実際に観測されるMi(ω)が式（２）であらわされる複素平面上の円から外れてしまう原因になりうる。このうちマイクロフォンの特性は、(i)伝達路に比べて比較的短いインパルス応答であらわされる、(ii)音源の移動や周囲の環境の変化により刻々と変化する伝達路のインパルス応答に比べて、素子について一度特性を測定するとその後はほぼ変化しないと考えられるので応用上も既知のものとして考えることができる、などの理由により伝達路のインパルス応答よりも比較的フィルタなどによる補正を行ないやすい。マイクロフォン、伝達路のいずれか一方あるいは両方の補正を行なうことによりさらに性能を向上させることができるものと考えられる。 In addition, there are variations in the frequency characteristics of the microphones in the actual environment, and the difference in the impulse response of the information transmission path from the target signal or noise source to the microphone is not just a time difference, but also the waveform and frequency characteristics of the impulse response. It is conceivable that the difference between the received sound signals at each microphone cannot be represented by a simple noise time difference as shown in Equation (1) due to the variation. This can cause the actually observed Mi (ω) to deviate from the circle on the complex plane expressed by the equation (2). Among these, the characteristics of the microphone are (i) represented by a relatively short impulse response compared to the transmission path. (Ii) Compared to the impulse response of the transmission path that changes momentarily due to movement of the sound source and surrounding environment, Once the characteristics of the element are measured, it is considered that the characteristics hardly change thereafter, so that it can be considered as a known one in terms of application. For this reason, correction by a filter or the like is relatively easier than the impulse response of the transmission path. It is considered that the performance can be further improved by correcting one or both of the microphone and the transmission path.

本発明は複数の雑音源にも適用することができる。実環境においては雑音の到来方向が複数あるような場合がしばしばある。マイクロフォンアレイ入力の周波数領域での幾何学的処理による目的信号のスペクトル推定の原理では、雑音の当来方向は単一であることを仮定しているが、信号の短時間スペクトル推定は分析フレームごと、周波数ごとに独立して行われる。従って、雑音の当来方向が複数存在しても、ある分析フレームにおいて、ある周波数ωの成分を含む雑音の当来方向が単一であるならば、その分析フレームのその周波数において、上述の原理は適用できることになり、雑音低減効果が期待できる。たとえば、雑音源が音声である場合、以上の性質が当てはまる可能性がある。条件によっては分析フレーム内で同一の周波数ωの成分を持つ雑音の方向が複数存在してしまうことも考えられるが、この時には式（２）は、

のようになり、M_i(ω)は複素平面上で円上に分布しなくなる。この時、分析フレームの同一の周波数に存在する雑音の中で、ある雑音N1が支配的で、

が成り立つような場合にはM_i(ω)は複素平面上で半径|N1|の円の近くに分布する。この時はM_i(ω)の外接円の中心を推定することにより得た周波数成分には|N2|,
|N3|程度の誤差は出るものの支配的な雑音N1についてはその影響を大きく低減できると考えられる。これに対して分析フレームの同一の周波数に存在する雑音の中で同程度の強度を持つ複数の雑音が支配的であるような場合つまり、

となるような場合にはM_i(ω)の複素平面上での分布の円からのずれが大きくなり、目的信号のスペクトルに対して大きな誤差を生じてしまう可能性がある。 The present invention can also be applied to a plurality of noise sources. In real environments, there are often cases where there are multiple directions of noise arrival. The principle of spectrum estimation of the target signal by geometric processing in the frequency domain of the microphone array input assumes that the noise direction is single, but short-term spectrum estimation of the signal is performed for each analysis frame. This is done independently for each frequency. Therefore, even if there are a plurality of noise current directions, if the noise noise direction including a component at a certain frequency ω is single in a certain analysis frame, the above-mentioned principle is applied at that frequency of the analysis frame. Can be applied, and a noise reduction effect can be expected. For example, if the noise source is speech, the above properties may apply. Depending on the conditions, there may be multiple noise directions with the same frequency ω component in the analysis frame.

Thus, M _i (ω) is not distributed on the circle on the complex plane. At this time, a certain noise N1 is dominant among the noises existing at the same frequency of the analysis frame,

If M _i (ω) is distributed near the circle of radius | N1 | on the complex plane. At this time, the frequency component obtained by estimating the center of the circumscribed circle of M _i (ω) is | N2 |,
Although there is an error of about | N3 |, the influence of the dominant noise N1 can be greatly reduced. On the other hand, when multiple noises with the same intensity are dominant among the noises existing at the same frequency of the analysis frame,

In such a case, the deviation of the distribution of M _i (ω) on the complex plane from the circle becomes large, which may cause a large error in the spectrum of the target signal.

ここで、複数の雑音源がある場合の円心を、

を用いて求めても良い。ここで、例えば雑音が２種類だと仮定し、２つの雑音中の１つの雑音のみを含む仮想観測信号の複数の仮想スペクトルを想定し、複素平面上で該複数の仮想スペクトルを通る円を推定すれば、その円の中心が目的信号のスペクトルである。観測されたスペクトル点は、各仮想観測信号を中心とする円上に位置しており、それぞれの円上のスペクトル点の位置には位相差が反映されている。したがって、観測されたスペクトル点を用いて、位相差の比を利用して、目的信号のスペクトルを表すであろう円心を求めることができる。 Here, when there are multiple noise sources,

You may ask for it. Here, assuming that there are two types of noise, for example, a plurality of virtual spectra of a virtual observation signal including only one of the two noises is assumed, and a circle passing through the plurality of virtual spectra on a complex plane is estimated. Then, the center of the circle is the spectrum of the target signal. The observed spectral point is located on a circle centered on each virtual observation signal, and the phase difference is reflected at the position of the spectral point on each circle. Therefore, by using the observed spectral points, the center of the circle that will represent the spectrum of the target signal can be obtained using the phase difference ratio.

本発明を用いた、短時間スペクトル推定の音声認識への応用について説明する。マイクロフォンアレイを利用する技術の中で、既述のDS、Griffith-Jim型アレイ、AMNOR方式のような方法は時間領域で目的信号の波形を推定するものである。それに対して本発明に係るマイクロフォンアレイ入力の周波数領域での幾何学的処理による方法は目的信号の短時間スペクトルを推定する方法である。従ってこの方法は目的信号の短時間スペクトルを必要とするような技術に組み込むことによって工学的に応用することができる。 The application of short-time spectrum estimation to speech recognition using the present invention will be described. Among the techniques using a microphone array, methods such as the above-described DS, Griffith-Jim type array, and AMNOR method estimate the waveform of the target signal in the time domain. On the other hand, the method of geometric processing in the frequency domain of the microphone array input according to the present invention is a method for estimating the short-time spectrum of the target signal. Therefore, this method can be applied engineeringly by incorporating it into a technique that requires a short-time spectrum of the target signal.

このような技術として特に重要なものとしては音声認識が考えられる。現在主流である音声認識の方法は図６のようなシステムによるものである。音響分析の過程は時間領域での波形を入力とし、音声の特徴量を抽出し出力する。探索過程では音響モデル及び言語モデルを参照しこの特徴量に対し尤度の最も高い単語列を決定し認識結果として出力する。 Speech recognition is a particularly important technology. The speech recognition method which is currently mainstream is based on a system as shown in FIG. In the process of acoustic analysis, a waveform in the time domain is input, and a voice feature is extracted and output. In the search process, an acoustic model and a language model are referred to, and a word string having the highest likelihood is determined for this feature quantity and output as a recognition result.

音声の特徴量としては一般的にメル周波数ケプストラム係数（MFCC）が用いられている。MFCCは信号の短時間スペクトルから求められる。ここで、マイクロフォンアレイ入力の周波数領域での幾何学的処理による方法で推定した目的信号の短時間スペクトルの推定値を用いてMFCCを計算することで、雑音の影響を低減することができ音声認識の性能を向上させることができると考えられる。尚、本発明が適用される音声認識法は、MFCCを用いるものに限定されない。 In general, a mel frequency cepstrum coefficient (MFCC) is used as an audio feature. The MFCC is obtained from the short time spectrum of the signal. Here, the effect of noise can be reduced by calculating the MFCC using the estimated value of the short-time spectrum of the target signal estimated by the method of geometric processing in the frequency domain of the microphone array input. It is thought that the performance of can be improved. Note that the speech recognition method to which the present invention is applied is not limited to the one using MFCC.

比較のため、マイクロフォンアレイによる雑音の低減などの処理を行なわない音声信号を入力とする通常の音声認識システム、DS、Griffith-Jim型アレイ、AMNOR方式などのマイクロフォンアレイ処理によって推定された時間領域での目的信号を入力とする音声認識システム、マイクロフォンアレイ入力の周波数領域での幾何学的処理による方法で推定した目的信号の短時間スペクトルの推定値を用いてMFCCを計算する音声認識システム、のそれぞれのシステムの音響分析の流れを図７に示す。 For comparison, in the time domain estimated by microphone array processing such as a normal speech recognition system, DS, Griffith-Jim type array, AMNOR method, etc. that receives speech signals that are not subjected to processing such as noise reduction by a microphone array Speech recognition system that uses the target signal of the input, and speech recognition system that calculates the MFCC using the estimated short-term spectrum of the target signal estimated by the geometric processing in the frequency domain of the microphone array input The flow of acoustic analysis of this system is shown in FIG.

図に示したように通常、音声認識ではフーリエ変換を行なう前に高域強調を行なうためのFIRフィルタを信号に畳み込む。そこでマイクロフォンアレイ入力の周波数領域での幾何学的処理による方法においても各ｍ_i(t)に高域強調を行なうが、これは共通のフィルタを畳み込む演算であるので共通のマイクロフォン特性の場合と同じように式（１）に帰着することができる。また、この方法において時間領域でのフィルタによる高域強調を行なわず、得られた目的信号の短時間スペクトルの推定値に対して、周波数領域で高域強調フィルタの特性を乗じる方法も考えられる。 As shown in the figure, normally, in speech recognition, an FIR filter for performing high frequency emphasis is convoluted with a signal before performing Fourier transform. Therefore, high frequency emphasis is also applied to each m _i (t) in the method of geometric processing in the frequency domain of the microphone array input, but this is an operation for convolution of a common filter, so that it is the same as in the case of common microphone characteristics. It can be reduced to equation (1) as follows. In addition, in this method, a high frequency emphasis by a filter in the time domain is not performed, and a method for multiplying the estimated value of the short-time spectrum of the obtained target signal by the characteristics of the high frequency emphasis filter in the frequency domain is also conceivable.

本発明に係る手法の有効性を確認するための音声認識実験を行なった。認識の種類は、より実際的な状況を想定し、大語彙連続音声認識とした。出力の単語列の単語正解精度に基づいて評価した。具体的には、まずマイクロフォンアレイ入力の周波数領域での幾何学的処理によるスペクトル推定の原理の正しさを検証し、本手法の有用性を評価するため、周波数特性が完全に平坦で、感度が全て等しい理想的なマイクロフォンからなるマイクロフォンアレイによって理想的な平面波を受音した場合のシミュレーションを計算機上で行ない雑音抑制の効果を評価した。次に、実環境におけるマイクロフォン特性や残響の影響の詳しい評価を行なうために残響時間の異なる環境におけるインパルス応答のデータを計算機上で音声のドライソースと畳み込んだデータを用いて、実験を行ない雑音抑制の効果を評価した。最後にマイクロフォンの特性のばらつきや、残響の効果を無視できない実際の環境においてこの手法がただちに有用であるかどうかを評価するために通常の実験室において実際のスピーカとマイクロフォンアレイを用いた実験を行ない雑音抑制の効果を評価した。 A speech recognition experiment was conducted to confirm the effectiveness of the technique according to the present invention. The type of recognition was large vocabulary continuous speech recognition, assuming a more realistic situation. Evaluation was based on the correct word accuracy of the output word string. Specifically, to verify the correctness of the principle of spectrum estimation by geometric processing in the frequency domain of the microphone array input and evaluate the usefulness of this method, the frequency characteristics are completely flat and the sensitivity is A simulation was performed on a computer when an ideal plane wave was received by a microphone array consisting of all equal ideal microphones, and the effect of noise suppression was evaluated. Next, in order to make a detailed evaluation of the microphone characteristics and the effects of reverberation in the actual environment, we conducted experiments using data obtained by convolving the impulse response data in a different reverberation time environment with a dry sound source on a computer. The effect of inhibition was evaluated. Finally, we conducted experiments using actual speakers and microphone arrays in a normal laboratory to evaluate whether this method is immediately useful in an actual environment where the variations in microphone characteristics and the effects of reverberation cannot be ignored. The effect of noise suppression was evaluated.

音声認識実験の条件を説明する。まず、異なる方向から到来した目的の音声信号と妨害音声（雑音)をマイクロフォンアレイの各マイクロフォンで受音した信号をシミュレーション、あるいは実験室での実験により得た。次に、得られたマイクロフォンアレイ入力に対し次の3通りのシステムで音声認識を行なった。 The conditions for the speech recognition experiment will be described. First, the target speech signal and disturbing speech (noise) arriving from different directions were received by the microphones in the microphone array, and obtained through simulations or laboratory experiments. Next, speech recognition was performed on the obtained microphone array input using the following three systems.

システム1：マイクロフォンアレイを用いず単一のマイクロフォンで得られた信号を入力とする通常の音声認識システム；システム２：DSにより妨害音声を低減した時間領域の信号を入力とする音声認識システム；システム３：本発明に係るマイクロフォンアレイ入力の周波数領域での幾何学的処理による方法で得た目的信号の短時間スペクトルの推定値から計算された音響特徴量を用いる音声認識システム。最後に、それぞれのシステムによって得られた認識結果を目的の音声信号で読み上げられている文と比較して、音声認識の性能を後述する単語正解精度の基準で評価した。本発明に係る手法において外接円の中心を推定するアルゴリズムは既述の距離2乗分散最小化による方法を用いた。また、提案法において高域強調は他のシステムと同様に時間領域のフィルタによっておこなった。 System 1: A normal speech recognition system that uses a single microphone to input a signal without using a microphone array; System 2: A speech recognition system that uses a signal in the time domain with reduced disturbing speech by DS; System 3: A speech recognition system using an acoustic feature amount calculated from an estimated value of a short-time spectrum of a target signal obtained by a method by geometric processing in the frequency domain of a microphone array input according to the present invention. Finally, the recognition result obtained by each system was compared with the sentence read out by the target speech signal, and the speech recognition performance was evaluated on the basis of the word correct accuracy described later. As the algorithm for estimating the center of the circumscribed circle in the method according to the present invention, the above-described method based on the distance square variance minimization is used. In the proposed method, high-frequency enhancement was performed by time-domain filters as in other systems.

認識の対象とする音声データにはIPA-98-TestSetを用いた。これは、日本音響学会の新聞記事読み上げ音声コーパス(ASJ-JNAS)のうち音響モデルの学習に用いていないセットで、男女ともに23話者、100文である。句読点を除いた総単語数は1575で、20000語彙に対しての未知語率は0.44(7/1575)であり、1文あたりの平均時間は5.8secである。雑音として加えた音声データは同一の音声コーパスからIPA-98-TestSetに含まれない文を選び、用いた。認識対象及び雑音の音声データは、kHzサンプリング、bit量子化された波形を用いたが、音声の加算及びインパルス応答の畳み込み、遅延の付加のためのフィルタの畳み込みなどの演算はbit浮動小数点において行った。 IPA-98-TestSet was used for speech data to be recognized. This is a set of the ASJ-JNAS newspaper reading aloud speech corpus (ASJ-JNAS) that is not used for learning acoustic models. It has 23 speakers and 100 sentences for both men and women. The total number of words excluding punctuation is 1575, the unknown word rate for 20000 vocabulary is 0.44 (7/1575), and the average time per sentence is 5.8 sec. For voice data added as noise, sentences that are not included in IPA-98-TestSet were selected from the same voice corpus. The speech data of the object to be recognized and the noise used kHz sampling and bit quantized waveform, but operations such as speech addition, impulse response convolution, and filter convolution for adding delay are performed in bit floating point. It was.

全てのシステムにおいて、信号にフィルタを畳み込むことによって高域強調を行ない、幅25ms、間隔10msのHamming窓でフレーム化処理した。各フレームの音響特徴量は12次元のメル周波数ケプストラム係数(MFCC)、その一次差分(MFCC)、および対数パワーの一次差分(logPower)の全25次元のベクトルとした。システム3においてフレーム内の信号のパワーは推定されたスペクトルから求めた。発話単位でケプストラム平均による正規化(CMN)を行った。表１に音響分析条件を示す。

In all systems, high-frequency emphasis was performed by convolving the filter with the signal, and it was framed with a Hamming window with a width of 25 ms and an interval of 10 ms. The acoustic feature value of each frame is a 25-dimensional vector of 12-dimensional mel frequency cepstrum coefficient (MFCC), its primary difference (MFCC), and logarithmic power primary difference (logPower). In the system 3, the power of the signal in the frame was obtained from the estimated spectrum. Normalization (CMN) by cepstrum average was performed for each utterance. Table 1 shows the acoustic analysis conditions.

認識エンジン(デコーダ)にはIPAの日本語ディクテーション基本ソフトウェアプロジェクトExpで使用されている大語彙連続音声認識エンジンJulius(Ver.3.3p3)を用いた。それぞれのシステムの出力単語列仮説の評価には単語正解精度(Acc.)を用いた。 The large vocabulary continuous speech recognition engine Julius (Ver.3.3p3) used in the IPA Japanese dictation basic software project Exp was used as the recognition engine (decoder). Word accuracy (Acc.) Was used to evaluate the output word string hypothesis of each system.

理想的な条件でのシミュレーション実験について説明する。マイクロフォンの特性は遅延のない単位インパルス、伝達路の特性は、音の到来方向とマイクロフォンの位置から計算した遅延のある単位インパルスとした。目的信号および、妨害音声の音声データに上で述べた遅延を加え、各マイクロフォンで受音される信号のシミュレーションを計算機上で行なった。雑音の到来方向が複数である場合の性質についても検証を行なうため複数の妨害音声を加えた場合についても実験を行なった。複数雑音の到来方向がマイクロフォンアレイを用いた処理系に与える影響を評価するため、特に妨害音声が2方向から来る場合について雑音の到来方向を変化させ2回実験を行なった。それぞれの妨害音声のパワーは全て目的信号のパワーに対して-10dBとした。この場合妨害音声が単一の場合にはSN比が10dBであるが、妨害音声が複数の場合にはSN比はさらに低下する。 A simulation experiment under ideal conditions will be described. The characteristic of the microphone is a unit impulse without delay, and the characteristic of the transmission path is a unit impulse with a delay calculated from the direction of arrival of the sound and the position of the microphone. The above-described delay was added to the target signal and the voice data of the disturbing voice, and the signal received by each microphone was simulated on a computer. In order to verify the characteristics when there are multiple directions of noise, experiments were also performed in the case where multiple disturbing voices were added. In order to evaluate the effect of multiple noise arrival directions on a processing system using a microphone array, we conducted two experiments with varying noise arrival directions, especially when disturbing speech comes from two directions. The power of each disturbing voice was set to -10 dB relative to the power of the target signal. In this case, the SN ratio is 10 dB when there is a single disturbing voice, but the SN ratio is further lowered when there are multiple disturbing voices.

マイクロフォンおよび音源の配置について説明する。マイクロフォンアレイは直径30cmおよび、60cmの円形状で等間隔に3,4,8,16個のマイクロフォンを配置した。目的信号および妨害音声の音源はマイクロフォンと同一の平面上にあるとした。目的信号の到来方向を基準とした妨害音声の個数および到来方向を表２に示す。DSおよび本手法において目的信号を同相化するための遅延量を決定する際にはシミュレートした目的信号の到来方向を既知のものとして与え、マイクロフォンの配置から相対遅延量を計算した。

The arrangement of the microphone and the sound source will be described. The microphone array had a circular shape of 30 cm in diameter and 60 cm, and 3, 4, 8, 16 microphones were arranged at equal intervals. The target signal and the sound source of the disturbing voice are assumed to be on the same plane as the microphone. Table 2 shows the number of disturbing voices and the arrival direction based on the arrival direction of the target signal. When determining the delay amount to make the target signal in-phase in DS and this method, the arrival direction of the simulated target signal is given as a known one, and the relative delay amount is calculated from the microphone arrangement.

理想的な条件でのシミュレーション実験の結果について説明する。以上の条件でのシミュレーション実験におけるそれぞれのシステムでの認識率を表３に示す。全ての場合において本手法を用いることによりDSに比べて認識率が向上し、本手法が有効であることが示された。またDS、本手法のいずれにおいても多くのマイクロフォンを用いることでより誤差の影響を低減し、認識率が向上することが確認できた。特に雑音が複数の方向から到来する場合にもDSと比較して本手法が有効であることが確認できた。雑音の到来方向が単一である場合には本手法の認識率は雑音を加えない場合の同条件における認識率89.4に近い値となり、高い雑音低減効果が得られたことが確認できた。雑音の到来方向とアレイ処理の性能については、雑音数2-Aの結果と雑音数2-Bの結果を比べると、雑音数2-Aの場合はDSと本手法のどちらのシステムでもマイクロフォン数3の場合の方がマイクロフォン数4の場合よりも認識率が向上しているが、雑音数2-Bの場合はその逆の結果となった。これには以下のような理由が考えられる。マイクロフォンは円周上に等間隔に配置されていたが、そのうち1つのマイクロフォンの配置されていた方向を目的信号の到来する方向とした。従って、目的信号に対して270度方向の雑音に対してはマイクロフォン数4の場合、目的信号に対して60度方向の雑音に対してはマイクロフォン数3の場合に、雑音に対する相対的な遅延が0になる2つのマイクロフォンの組ができていたことになる。DSや本手法において、このようなマイクロフォンで得られたデータに含まれる誤差は他のデータに含まれる誤差よりも重く扱われ、誤差の低減の効果が低くなったことが考えられる。

The results of simulation experiments under ideal conditions will be described. Table 3 shows the recognition rates in the respective systems in the simulation experiment under the above conditions. In all cases, the recognition rate was improved compared to DS by using this method, indicating that this method is effective. Moreover, it was confirmed that the use of many microphones in both DS and this method reduced the effect of errors and improved the recognition rate. In particular, it was confirmed that this method is more effective than DS when noise comes from multiple directions. When the noise arrival direction is single, the recognition rate of this method is close to the recognition rate of 89.4 under the same conditions when no noise is added, confirming that a high noise reduction effect was obtained. Regarding the noise arrival direction and the performance of array processing, comparing the result of noise number 2-A and the result of noise number 2-B, in the case of noise number 2-A, the number of microphones in both DS and this method system The recognition rate was better in the case of 3 than in the case of 4 microphones, but the opposite result was obtained in the case of 2-B noise. The following reasons are conceivable. The microphones were arranged at equal intervals on the circumference, and the direction in which one microphone was arranged was defined as the direction in which the target signal arrived. Therefore, when the number of microphones is 4 for noise in the 270-degree direction with respect to the target signal and the number of microphones is 3 for noise in the 60-degree direction with respect to the target signal, It was a set of two microphones that would be zero. In DS and this method, the error included in the data obtained with such a microphone is handled more heavily than the error included in the other data, and it is considered that the effect of reducing the error is reduced.

次に、マイクロフォンと伝達路の特性を考慮したシミュレーション実験について説明する。計算機上で、目的信号および妨害音声の音声データにRWCP実環境音声・音響データベースRWCPに含まれる、各スピーカからマイクロフォンアレイの各マイクロフォンへのインパルス応答を畳み込んだ音声データを用いた。このインパルス応答のデータはTSPTSPにより65536点のTSPを用い16回の同期加算を行なって測定されたものである。なお、このインパルス応答のデータはマイクロフォン特性の補正を行なっていない。補正用のデータは利用可能であるが本実験ではこれを用いていない。SN比は10dBとした。 Next, a simulation experiment considering the characteristics of the microphone and the transmission path will be described. On the computer, voice data that is included in the RWCP real-world voice / acoustic database RWCP and the impulse response from each speaker to each microphone in the microphone array is used for the target signal and the disturbing voice. The impulse response data was measured by performing 16 synchronous additions using 65536 TSPs by TSPTSP. The impulse response data is not corrected for microphone characteristics. Although correction data is available, it is not used in this experiment. The SN ratio was 10 dB.

マイクロフォンおよび音源の配置について説明する。上で述べたデータベースの中からマイクロフォンアレイは部屋の床からの高さ162cmで床に水平な直径30cmの円周上に等間隔に16個のマイクロフォンが配置され、スピーカはそれぞれ部屋の床から高さ172cmマイクロフォンアレイからの距離が202cmマイクロフォンアレイから見た角度が60度異なる2つのスピーカについてのインパルス応答のデータを用いた。実験は残響の程度と性能の関係を調べるため、残響時間の異なる複数の部屋で記録されたインパルス応答のデータを用いて行なった。この実験で用いたインパルス応答のデータ名及びを表４に示す。

The arrangement of the microphone and the sound source will be described. From the database mentioned above, the microphone array is 162cm high from the floor of the room and 16 microphones are arranged at equal intervals on the circumference of 30cm in diameter that is horizontal to the floor. Impulse response data were used for two loudspeakers whose distance from the 172 cm microphone array was 60 degrees different from the 202 cm microphone array. The experiment was performed using impulse response data recorded in multiple rooms with different reverberation times in order to investigate the relationship between the degree of reverberation and performance. Table 4 shows data names of impulse responses used in this experiment.

DSおよび本手法において目的信号を同相化するための遅延量の決定は以下の手順で行なった。まず、標本化されたインパルス応答の波形が負で絶対値最大となる標本点を求めた。次にその標本点とその前後の1標本点の区間についてsinc補間によって標本化周波数の20倍のアップサンプリングを行なった。補間された後の標本点について値が負で絶対値最大である標本点を求めその標本点の時刻を目的信号の到来時刻とした。すべてのマイクロフォンについて上記1〜3の処理を行なって相対遅延量を求め、目的信号を同相化した。 In the DS and this method, the delay amount for making the target signal in-phase is determined by the following procedure. First, the sample point where the sampled impulse response waveform is negative and has the maximum absolute value was obtained. Next, upsampling 20 times the sampling frequency was performed by sinc interpolation for the interval between the sample point and one sample point before and after that sample point. A sample point having a negative value and a maximum absolute value is obtained for the sample point after interpolation, and the time of the sample point is set as the arrival time of the target signal. The above processes 1 to 3 were performed on all the microphones to obtain the relative delay amount, and the target signal was made in phase.

マイクロフォンと伝達路の特性を考慮したシミュレーション実験の結果について説明する。以上の条件で行なった実験の結果を表５に示す。残響のほとんどない（マイクロフォン特性は存在する）無響室（ANE）のインパルス応答に対しては本手法により認識率はDSに比べ20向上した。しかし、残響の無視できない室内（E1B）のインパルス応答に対してはDSにおける認識率が提案法に比べて7.8高かった。DSおよび本手法は残響が存在する場合においてもアレイ処理を行なわない場合に比べて認識率向上の効果があった。

The result of the simulation experiment considering the characteristics of the microphone and the transmission path will be described. Table 5 shows the results of the experiment conducted under the above conditions. For the impulse response of an anechoic chamber (ANE) that has almost no reverberation (microphone characteristics exist), the recognition rate is improved by 20 compared to the DS. However, for the impulse response in the room (E1B) where reverberation cannot be ignored, the recognition rate in DS was 7.8 higher than that in the proposed method. DS and this method have an effect of improving the recognition rate even in the presence of reverberation compared to the case without array processing.

最後に、実音響空間における雑音重畳音声に対する本手法による認識評価について説明する。通常の実験室において目的信号、妨害音声の音声データをスピーカによって再生し、マイクロフォンアレイを用いて受音したデータを16kHzサンプリング、 16bit量子化した。妨害音声のパワーは目的信号のパワーに対して-10dBとした。他に雑音がない場合これはSN比を10dBとすることに相当するが、実験室内には計算機やエアコンから発生する雑音が存在したので実際のSN比はさらに低いと考えられる。これらの雑音に対する本方法の効果を評価するため、妨害音声を再生しない状態での実験も行なった。 Finally, recognition evaluation by the present method for noise superimposed speech in a real acoustic space will be described. In a normal laboratory, the target signal and the voice data of the disturbing voice were reproduced by a speaker, and the data received using a microphone array was sampled at 16 kHz and 16-bit quantized. The power of the disturbing voice is -10 dB relative to the power of the target signal. If there is no other noise, this corresponds to an S / N ratio of 10 dB, but the actual S / N ratio is considered to be even lower because there was noise generated from computers and air conditioners in the laboratory. In order to evaluate the effect of this method on these noises, experiments were also performed without disturbing speech.

マイクロフォンおよび音源の配置について説明する。マイクロフォンアレイは4行4列間隔10cmの格子点状に16個のマイクロフォンを配置した。マイクロフォンを配置した平面に、垂直な方向に目的信号の音源（スピーカ）を配置し、妨害音声の音源は対象音源から30度の方向に配置した。それらの音源はマイクロフォンアレイからほぼ1mの距離に配置した。 The arrangement of the microphone and the sound source will be described. In the microphone array, 16 microphones were arranged in the form of lattice dots with an interval of 10 cm in 4 rows and 4 columns. The sound source (speaker) of the target signal was placed in the vertical direction on the plane where the microphone was placed, and the sound source of the disturbing sound was placed 30 degrees from the target sound source. These sound sources were placed at a distance of about 1 m from the microphone array.

DSおよび本手法において目的信号を同相化するための遅延量の決定は以下の手順で行なった。まずm系列を用いて目的信号から各マイクロフォンへのインパルス応答を測定した。ここで、時間方向に4倍に引き延ばしたm系列を用いたため、得られたインパルス応答は実際のインパルス応答に幅7標本の三角窓が畳み込まれたものであった。次に得られたインパルス応答の波形が負で絶対値最大となる標本点を求め、その標本点の前4標本、後4標本に対しそれぞれ最小2乗法で直線をフィットし、その交点の時刻を目的信号の到来時刻とした。すべてのマイクロフォンについて上記の処理を行なって相対遅延量を求め、目的信号を同相化した。 In the DS and this method, the delay amount for making the target signal in-phase is determined by the following procedure. First, the impulse response from the target signal to each microphone was measured using m-sequence. Here, since the m-sequence stretched four times in the time direction was used, the obtained impulse response was obtained by convolving a triangular window having a width of 7 samples with the actual impulse response. Next, find the sampling point where the waveform of the obtained impulse response is negative and the absolute value is maximum, fit a straight line to the 4 samples before and 4 samples after that sampling point by the least square method, and set the time of the intersection The arrival time of the target signal was used. The above processing was performed for all the microphones to obtain the relative delay amount, and the target signal was made in phase.

実音響空間における音声認識結果について説明する。以上の条件で行なった実験の結果を表６に示す。妨害雑音を再生した場合は本手法により認識率はDSに比べ4.1向上したものの、理想的な条件でのシミュレーションにおける性能には及ばなかった。さらに、妨害音声を再生しない場合にはDSを用いた方が本手法より1.6認識率が高かった。これは実験を行なった実験室が比較的高残響であり、目的信号の残響成分や、計算機やエアコンから発生する雑音が非常に多くの方向から到来したためであると思われる。

A speech recognition result in a real acoustic space will be described. Table 6 shows the results of the experiment conducted under the above conditions. When the interference noise was reproduced, the recognition rate was improved by 4.1 compared with the DS by this method, but it did not reach the performance in the simulation under ideal conditions. Furthermore, when the disturbing voice was not played back, the DS recognition rate was higher with this method than with this method. This is probably because the laboratory where the experiment was conducted had a relatively high reverberation, and the reverberation component of the target signal and the noise generated from the computer and the air conditioner came from many directions.

実験結果に関する考察を述べる。マイクロフォンの特性や伝達路の特性が理想的なインパルスであるような場合には本手法はDSに比べて有効であることが確認できた。また雑音が特定の方向から到来した場合に特定のデータが偏った重みづけをされた結果雑音低減効果が低下したような現象が生じたと思われるデータも得られた。今後、このような現象を定式化することで、さらに性能が向上することも考えられる。マイクロフォンの特性や伝達路の特性がインパルスと見なせないような場合には、その特性が複雑になるにつれて、DSと比較した場合の本手法の有効性が低下するという結果が得られた。しかし、そのような場合の実験において、目的信号を同相化するための遅延量を決定する方法の精度が十分でなかったことがこのような結果の一因となったことも考えられる。従って、目的信号の到来方向を推定する技術にMUSICに代表されるような、より高性能な方法MUSICを用いることによってこの結果は改善されることもありえると考えられる。 A discussion on the experimental results is given. It was confirmed that this method is more effective than DS when the characteristics of the microphone and the characteristics of the transmission path are ideal impulses. In addition, when noise arrives from a specific direction, the specific data is weighted unevenly. As a result, data that seems to have a phenomenon that the noise reduction effect is reduced was obtained. In the future, it may be possible to further improve the performance by formulating such a phenomenon. When the characteristic of the microphone and the characteristic of the transmission path cannot be regarded as an impulse, the effectiveness of the method compared with the DS decreases as the characteristics become more complex. However, in the experiment in such a case, it is considered that the result of such a result is that the accuracy of the method for determining the delay amount for making the target signal in-phase is not sufficient. Therefore, it is considered that this result can be improved by using a higher-performance method MUSIC such as MUSIC as a technique for estimating the arrival direction of the target signal.

本発明の利用分野としては、音声認識、目的信号における雑音抑制、目的信号の波形復元が例示される。 The field of application of the present invention is exemplified by speech recognition, noise suppression in the target signal, and waveform restoration of the target signal.

θＳおよびθの方向から平面波で到来する目的信号及び単一の雑音をマイクロフォンアレイで受音し、遅延を加えて目的信号を同相化する例を示す図である。It is a figure which shows the example which receives the target signal and single noise which arrive by the plane wave from the direction of (theta) S and (theta) by a microphone array, adds a delay, and makes an objective signal in-phase. 受音信号の周波数成分Ｍi(ω)＝Ｓ(ω)＋Ｎ（ω）ｅ^-jωτiの複素平面上での分布を示す図である。全てのＭi(ω)はＳ(ω)から等距離|Ｎ(ω)|に存在する。It is a figure which shows distribution on the complex plane of frequency component Mi ((omega)) = S ((omega)) + N ((omega)) e- ^j (omega) ^τi of a received signal. All Mi (ω) exist at an equal distance | N (ω) | from S (ω). 本発明に係る手法を例示する図である。It is a figure which illustrates the method based on this invention. 伝達路とマイクロフォンの特性を考慮した場合の回路図である。It is a circuit diagram at the time of considering the characteristic of a transmission path and a microphone. 図３で特性が時間遅れを除いては一定とみなせる場合の等価回路である。信号および雑音を捉え直すことにより図３に帰着できる。FIG. 3 is an equivalent circuit when the characteristics can be regarded as constant except for time delay. Recapturing the signal and noise can be reduced to FIG. 音声認識システムの構成を示す図である。It is a figure which shows the structure of a speech recognition system. 音響分析の流れを示す図である。(A)はマイクロフォンアレイ処理を行なわないシステム、(B)はマイクロフォンアレイ処理により雑音を低減した波形を入力とするシステム、(C)はマイクロフォンアレイ入力の周波数領域での幾何学的処理による方法で推定した、目的信号の短時間スペクトルの推定値を用いるシステム、をそれぞれ示している。It is a figure which shows the flow of an acoustic analysis. (A) is a system that does not perform microphone array processing, (B) is a system that uses a waveform with noise reduced by microphone array processing, and (C) is a method that uses geometric processing in the frequency domain of the microphone array input. The system using the estimated value of the short-time spectrum of the target signal is shown.

Claims

A signal processing method using a multi-channel equipped with a signal input unit,
Obtaining the spectrum of the observation signal for each channel as a point on the complex plane for each frequency ω by converting the observation signal from each signal input unit to the frequency domain for each channel;
Estimating a circle on a complex plane based on a plurality of obtained spectra for each channel for each frequency ω, and estimating a center of the estimated circle as a spectrum of a target signal;
A signal processing method characterized by comprising:

2. The signal processing method according to claim 1, wherein the multi-channel provided with the signal input unit is a microphone array composed of a plurality of microphones.

3. The signal processing method according to claim 1, wherein the observation signal is in-phase, and a spectrum is obtained by converting the in-phase observation signal into a frequency domain. Signal processing method.

4. The signal processing method according to claim 1, wherein the spectrum is a short-time spectrum.

5. The signal processing method according to claim 4, wherein the short-time spectrum is obtained by a short-time Fourier transform.

The signal processing method according to any one of claims 1 to 5,
The observation signal m _i (t) is

The spectrum M _i (ω) of the observed signal is

A signal processing method characterized by being given by:

7. The signal processing method according to claim 1, wherein the estimation of a circle based on a plurality of spectra obtained for each frequency ω passes through the plurality of spectra or their vicinity as close as possible on a complex plane. A signal processing method for estimating a circle.

8. The signal processing method according to claim 7, wherein the estimation of the spectrum of the target signal for each frequency ω is to obtain the center of a circle by minimizing dispersion.

The signal processing method according to claim 8, wherein the estimation of the spectrum of the target signal for each frequency ω is:

The signal processing method characterized by calculating | requiring by.

7. The signal processing method according to claim 1, wherein the observed signal includes a plurality of noises, and the estimation of a circle based on a plurality of obtained spectra for each channel for each frequency ω Based on a plurality of spectra obtained for each ω, a virtual spectrum of a virtual observation signal including only one noise among a plurality of noises is assumed, and a circle passing through the virtual spectrum on a complex plane is estimated. And a signal processing method.

The signal processing method according to claim 1,
The estimation of the spectrum of the target signal for each frequency ω is

The signal processing method characterized by calculating | requiring by.

12. The signal processing method according to claim 1, wherein the frequency ω is a discrete value in a finite total frequency band.

The signal processing method according to any one of claims 1 to 12,
The method further comprises:
Obtaining a center of gravity on a complex plane of a plurality of spectra for each obtained frequency for each frequency ω;
Evaluating the reliability of the circle center on the complex plane determined for each frequency ω,
A signal processing method characterized by estimating a weighted point between a circle center and a center of gravity based on reliability evaluation as a spectrum of a target signal for each frequency ω.

14. The signal processing method according to claim 13, wherein the evaluation step of the reliability of the circle center includes, as an element, an angle difference estimated from a circle center between a plurality of spectra on a complex plane obtained for each frequency ω. A characteristic signal processing method.

15. The signal processing method according to claim 13, wherein the spectrum of the target signal is obtained by dispersion minimization, and the step of evaluating the reliability of the circle center includes dispersion as an element.

14. The signal processing method according to claim 13, wherein the evaluation of the reliability of the circle center includes a correlation coefficient between a real component and a complex component of the spectrum of the observation signal as an element.

17. The signal processing method according to claim 1, further comprising a step of performing speech recognition using the spectrum of the obtained target signal.

17. The signal processing method according to claim 1, further comprising a step of obtaining a waveform of the target signal from the obtained spectrum of the target signal.

20. The signal processing method according to claim 1, further comprising: estimating a noise arrival direction for each frequency ω and integrating the noise arrival directions obtained for each frequency ω. A signal processing method comprising the step of estimating.

Multi-channel with signal input,
Complex spectrum value calculation unit for obtaining the spectrum of the observation signal for each channel as a point on the complex plane for each frequency ω by converting the observation signal from each signal input unit to the frequency domain for each channel;
A target signal spectrum estimator that estimates a circle on a complex plane based on a plurality of obtained spectra for each channel for each frequency ω, and estimates the center of the estimated circle as the spectrum of the target signal;
A signal processing apparatus comprising:

21. The signal processing device according to claim 20, wherein the multi-channel provided with the signal input unit is a microphone array composed of a plurality of microphones.

22. The signal processing device according to claim 20, wherein the device includes a delay unit that adds a delay to an observation signal for each channel to correct a time difference of a target signal to make it in phase. Processing equipment.

23. The signal processing device according to claim 20, wherein the spectrum calculation unit is a short-time Fourier transform unit.

24. The signal processing device according to claim 20, wherein the estimation of a circle based on a plurality of spectra obtained for each frequency ω is performed so that the plurality of spectra or their circles pass as close as possible on a complex plane. A signal processing apparatus for estimating

25. The signal processing apparatus according to claim 24, wherein the estimation of the spectrum of the target signal for each frequency [omega] obtains the center of a circle by minimizing dispersion.

26. The signal processing apparatus according to claim 20, wherein the observation signal includes a plurality of noises, and the estimation of a circle based on the plurality of obtained spectra for each channel for each frequency ω is performed at the frequency ω. A virtual spectrum of a virtual observation signal including only one noise among a plurality of noises is assumed based on a plurality of spectra obtained for each, and a circle passing through the virtual spectrum on a complex plane is estimated. A signal processing device.

The signal processing device according to any one of claims 20 to 26,
The device further comprises:
For each frequency ω, a center-of-gravity calculation unit that calculates the center of gravity on the complex plane of a plurality of spectra for each channel obtained
A reliability evaluation unit that evaluates the reliability of the circle center on the complex plane obtained for each frequency ω,
A signal processing apparatus characterized by estimating a weighted point between a circle center and a center of gravity based on reliability evaluation as a spectrum of a target signal for each frequency ω.

28. The signal processing device according to claim 20, further comprising a speech recognition unit that performs speech recognition using a spectrum of the obtained target signal.

28. The signal processing apparatus according to claim 20, further comprising a waveform acquisition unit that obtains a waveform of the target signal from the spectrum of the obtained target signal.

30. The signal processing device according to claim 20, further comprising: estimating a noise arrival direction for each frequency ω and integrating the noise arrival directions obtained for each frequency ω. A signal processing apparatus comprising a noise arrival direction estimation unit that estimates the noise.

A computer program for causing a computer to execute the signal processing method according to claim 1.