JP4119328B2

JP4119328B2 - Sound collection method, apparatus thereof, program thereof, and recording medium thereof.

Info

Publication number: JP4119328B2
Application number: JP2003293785A
Authority: JP
Inventors: 和則小林; 賢一古家; 陽一羽田; 澄宇阪内
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2003-08-15
Filing date: 2003-08-15
Publication date: 2008-07-16
Anticipated expiration: 2023-08-15
Also published as: JP2005064968A

Description

本発明は、TV会議や音声会議、電話、遠隔講義などの収音方法および装置に関する。 The present invention relates to a sound collection method and apparatus for TV conferences, audio conferences, telephone calls, remote lectures, and the like.

図６は従来技術の収音装置の構成図である。この従来技術の収音装置はマイクロホン１１₁〜１１_Mと適応フィルタ１３B₁〜１３B_Mと学習フィルタ１３A₁〜１３A_Mと空間特性フィルタ１８_1,1〜１８_J,Mと信号発生器１７₁〜１７_Jと遅延器１９₁〜１９_Jと収音範囲設定部３０と仮想目的音源位置設定部２６と空間特性推定部２７と適応期間検出部２０と適応アルゴリズム部１６と加算器１２₁〜１２_M、１４A、１４B、１５、５１₁〜５１_M、５２とで構成される。 FIG. 6 is a block diagram of a conventional sound collecting device. This prior art sound collecting device includes microphones 11 _{1 to} 11 _M , adaptive filters 13 B _{1 to} 13 B _M , learning filters 13 A _{1 to} 13 A _M , spatial characteristic filters 18 _{1, 1} to 18 _{J, M,} and signal generator 17 ₁ to 17 _J , delay units 19 _{1 to} 19 _J , sound collection range setting unit 30, virtual target sound source position setting unit 26, spatial characteristic estimation unit 27, adaptation period detection unit 20, adaptation algorithm unit 16, and adders 12 _{1 to} 12 _M , 14A, 14B, 15, 51 _{1 to} 51 _M , 52.

図６の従来技術の収音装置は雑音を抑圧し目的音を高品質に収音する装置であり、あらかじめ設定した収音範囲内にある音源の音を収音し、収音範囲外にる雑音源の音を抑圧する。ただし、雑音源と目的音の判別は、音源が時間的に定常信号であるか非定常信号であるかで行っており、目的音は音声などの非定常信号を仮定し、雑音源は空調音などの定常信号を仮定している。したがって、非定常な雑音を抑圧することはできない。 The prior art sound collecting device in FIG. 6 is a device that suppresses noise and picks up the target sound with high quality, picks up the sound of the sound source within the preset sound collecting range, and falls outside the sound collecting range. Suppresses the noise source. However, the noise source and the target sound are discriminated based on whether the sound source is a stationary signal or an unsteady signal in terms of time. The target sound is assumed to be an unsteady signal such as speech, and the noise source is the air conditioning sound. Is assumed to be a stationary signal. Therefore, non-stationary noise cannot be suppressed.

マイクロホン１１₁〜１１_Mで収音された信号は、それぞれ適応フィルタ１３B₁〜１３B_Mでフィルタリングされた後、加算器１４Bで加算されて出力される。適応フィルタ１３B₁〜１３B_Mは、収音範囲設定部３０で設定された収音範囲に対して感度が高く、収音範囲外にある雑音源位置に対して感度が低くなるように学習されたものであり、加算器１４Bの出力は、目的音対雑音比（SN比）の高い高品質な音となる。ただし、従来技術の収音装置では、収音範囲全てに対して、常に感度拘束をしているので、収音範囲が広くなるほど、雑音抑圧性能が低くなるという問題がある。 The signals collected by the microphones 11 _{1 to} 11 _M are respectively filtered by the adaptive filters 13B _{1 to} 13B _M , and then added by the adder 14B and output. The adaptive filters 13B _{1 to} 13B _M are learned so that the sensitivity is high with respect to the sound collection range set by the sound collection range setting unit 30, and the sensitivity is low with respect to the noise source position outside the sound collection range. Therefore, the output of the adder 14B is a high-quality sound with a high target sound-to-noise ratio (SN ratio). However, since the sound collection device of the prior art always restricts sensitivity to the entire sound collection range, there is a problem that the noise suppression performance becomes lower as the sound collection range becomes wider.

次に、適応フィルタ１３B₁〜１３B_Mの学習について具体的に説明する。学習は、実際に収音した雑音と、仮想目的音源を用いて合成した仮想的な収音信号と、学習フィルタを用いて行う。このように仮想目的音源を用いるのは、実際の目的音源を観測する場合、必ず雑音が混入した信号として観測されるので、目的音と雑音を区別した処理ができないためである。 Next, learning of the adaptive filters 13B _{1 to} 13B _M will be specifically described. Learning is performed using the noise actually collected, the virtual collected sound signal synthesized using the virtual target sound source, and the learning filter. The reason why the virtual target sound source is used in this way is that when the actual target sound source is observed, it is always observed as a signal in which noise is mixed, and therefore processing that distinguishes the target sound from the noise cannot be performed.

まず、仮想目的音源を用いた仮想的な収音信号を合成する部分を説明する。収音範囲設定部３０は、収音する範囲（音源の移動範囲、音源位置計測誤差の範囲など）を設定し、仮想目的音源位置設定部２６は、設定範囲内に一様に仮想目的音源位置を設ける。仮想目的音源位置の間隔は十分に狭い必要があり、ある仮想目的音源位置から隣り合う仮想目的音源位置に音源が移動したときに、マイクロホン間の相対遅延時間の変動がサンプリング周期より小さくなるように間隔を設定する。空間特性推定部２７は、設定された仮想目的音源位置からマイクロホン位置までのインパルス応答を推定し、空間特性フィルタ１８_{1, 1}〜１８_{J, M}の係数に設定する。信号発生器１７₁〜１７_Jにより発生された互いに無相関で定常な信号は、空間特性フィルタ１８_{1, 1}〜１８_{J, M}によりフィルタリングされマイクロホンごとに加算器５１₁〜５１_Mで加算される。このように、信号を空間特性フィルタ１８₁,₁〜１８_{J, M}でフィルタリングすることにより、目的音収音信号を仮想的に合成できる。 First, a portion for synthesizing a virtual sound pickup signal using a virtual target sound source will be described. The sound collection range setting unit 30 sets a sound collection range (sound source movement range, sound source position measurement error range, etc.), and the virtual target sound source position setting unit 26 uniformly sets the virtual target sound source position within the set range. Is provided. The interval between virtual target sound source positions must be sufficiently narrow so that when the sound source moves from one virtual target sound source position to an adjacent virtual target sound source position, the relative delay time variation between microphones becomes smaller than the sampling period. Set the interval. The spatial characteristic estimation unit 27 estimates an impulse response from the set virtual target sound source position to the microphone position _, and sets the coefficients of the spatial characteristic filters 18 _{1, 1} to 18 _{J, M.} The uncorrelated and stationary signals generated by the signal generators 17 _{1 to} 17 _J are filtered by the spatial characteristic filters 18 _{1, 1} to 18 _J, _M and added by adders 51 _{1 to} 51 _M for each microphone. . Thus, by filtering the signal with the spatial characteristic filters 18 ₁ , ₁ to 18 _{J, M} , the target sound collection signal can be virtually synthesized.

次に、仮想的に合成された目的音収音信号と実際に収音された雑音信号を加算器１２₁〜１２_Mで加算し、これを学習フィルタ１３A₁〜１３A_Mでフィルタリングした後、加算器１４Aで加算する。この加算器１４Aの出力が仮想的に合成された収音信号の出力となる。この出力の雑音成分が小さく、仮想目的音成分の劣化が小さければ、高品質に収音できているということになるので、加算器１５で出力から仮想目的音の原音を減算し、この加算器１５の出力を誤差信号として、学習フィルタ１３A₁〜１３A_Mの更新を行う。ただし、学習フィルタ１３A₁〜１３A_Mの非因果部分を有効にするために、遅延器１９₁〜１９_Jで仮想目的音の原音に遅延を付加してから、加算器５２で加算したものを減算している。 Then, actually adding the collected noise signal by the adder 12 ₁ to 12 _M and virtually synthesized target sound collected sound signal, and filtering the learning filter 13A ₁ ~13A _M this, adding Adder 14A adds. The output of the adder 14A is an output of a virtually collected sound signal. If the noise component of the output is small and the degradation of the virtual target sound component is small, it means that the sound is collected with high quality. Therefore, the adder 15 subtracts the original sound of the virtual target sound from the output. The learning filters 13A _{1 to} 13A _M are updated using the output of 15 as an error signal. However, subtraction to enable non-causal portion of the learning filter 13A ₁ ~13A _M, after adding a delay to the original sound of the virtual target sound by the delaying unit 19 ₁ ~ 19 _J, those added by the adder 52 is doing.

適応アルゴリズム部１６は、加算器１５出力の誤差信号と学習フィルタ１３A₁〜１３A_Mへの入力信号から、誤差信号の二乗平均誤差が最小となるように学習フィルタ１３A₁〜１３A_Mの更新ベクトルを求める。適応フィルタ１３B₁〜１３B_Mには、学習フィルタ１３A₁〜１３A_Mと同じフィルタ係数がセットされ、設定された収音範囲内の目的音源の音を収音し、雑音を抑圧する。また、マイクロホン１１₁〜１１_Mの収音信号に実際の目的音が含まれる場合、実際の目的音源に対して感度を低くするように学習されてしまうので、実際の目的音が存在する場合には、フィルタの更新を停止する必要がある。適応期間検出部２０は、マイクロホン１１₁〜１１_Mで収音された信号のパワーを監視することで、実際の目的音の存在を検出し、適応動作を停止する。 Adaptive algorithm unit 16, the input signal of the adder 15 error signal output to the learning filter 13A ₁ ~13A _M, the update vector of the learning filter 13A ₁ ~13A _M as mean square error of the error signal is minimized Ask. The same filter coefficients as the learning filters 13A _{1 to} 13A _M are set in the adaptive filters 13B _{1 to} 13B _M , the sound of the target sound source within the set sound collection range is collected, and the noise is suppressed. Further, when the actual target sound is included in the collected sound signals of the microphones 11 _{1 to} 11 _M , learning is performed so as to reduce the sensitivity with respect to the actual target sound source, so that there is an actual target sound. Need to stop updating the filter. The adaptation period detection unit 20 detects the presence of the actual target sound by monitoring the power of the signals collected by the microphones 11 _{1 to} 11 _M , and stops the adaptation operation.

次に、適応アルゴリズム部１６について詳細に説明する。適応アルゴリズムとしては、LMSアルゴリズム、NLMSアルゴリズム、射影アルゴリズムなどがある。本明細書では、NLMS法を例にとって、以下にフィルタの収束解と修正式の導出を行う。まず、数式で使用する記号について説明する。サンプリング周期により離散化された時刻をn、マイクロホン数をM、仮想目的音源数をJ、時刻nにi番目マイクロホン１１_iで収音された信号をx _i(n)とし、Lサンプル分を取り出して行列で表したものを

Next, the adaptive algorithm unit 16 will be described in detail. Examples of adaptive algorithms include LMS algorithm, NLMS algorithm, and projection algorithm. In this specification, the NLMS method is taken as an example, and a filter convergence solution and a correction formula are derived below. First, symbols used in mathematical expressions will be described. The time sampled by the sampling period is n, the number of microphones is M, the number of virtual target sound sources is J, the signal picked up by the i-th microphone 11 _i at time n is x _i (n), and L samples are extracted. What the matrix represents

とする。j番目の信号発生器１７_jの出力信号はv_j(n)、j番目の信号発生器１７_jとi番目のマイクロホン１１_iに対する空間特性フィルタはg _i,_j(n)で表し、空間特性フィルタ出力をu_i,,j(n)＝g_i,j(n)*v_j(n)として、Lサンプル分を取り出して行列で表したものを

And The output signal of the j-th signal generator 17 _j is represented by v _j (n), and the spatial characteristic filter for the j-th signal generator 17 _j and the i-th microphone 11 _i is represented by g _i , _j (n). Filter output is u _{i ,, j} (n) = g _{i, j} (n) * v _j (n)

とする。ただし＊は畳み込み演算を表している。学習フィルタ１３A₁〜１３A_M、適応フィルタ１３B₁〜１３B_MはLタップのFIRフィルタとし、フィルタ係数

And However, * represents a convolution operation. The learning filters 13A _{1 to} 13A _M and the adaptive filters 13B _{1 to} 13B _M are L-tap FIR filters, and the filter coefficients

として行列で表す。h _i(n −l−１)は、時刻nにおけるi番目マイクロホンに対するフィルタのlタップ目のフィルタ係数を表し、学習フィルタ１３A₁〜１３A_Mと適応フィルタ１３B₁〜１３B_Mには同一のフィルタ係数が用いられる。加算器１４Aの出力をy′(n)、加算器１４Bの出力をy (n)、加算器１５の出力を誤差e (n)とし、遅延器１９₁〜１９_Jでの遅延量は全て等しいとしτ₀で表す（通常、τ₀は学習フィルタ１３A₁〜１３A_Mのタップ長の半分の長さである）。 As a matrix. h _i (n −l−1) represents the filter coefficient of the l-th tap of the filter for the i-th microphone at time n, and the same filter coefficient is used for the learning filters 13A _{1 to} 13A _M and the adaptive filters 13B _{1 to} 13B _M. Is used. The output of the adder 14A is y '(n), the output of the adder 14B is y (n), the output of the adder 15 is error e (n), and the delay amounts in the delay units 19 _{1 to} 19 _J are all equal. and it was expressed in tau ₀ (typically, tau ₀ is the length of half of the tap length of the learning filter 13A ₁ ~13A _M).

まず、加算器１５の出力である誤差e (n) の二乗平均を求める。この二乗平均誤差を最小とするフィルタが最適なフィルタとなる。 First, the mean square of the error e (n) that is the output of the adder 15 is obtained. The filter that minimizes the mean square error is the optimum filter.

ただし、￣は時間平均を意味する。仮想目的信号v_j (n)は互いに無相関であり、仮想目的信号と雑音は無相関であるので、式（１）は式（２）のように変形される。 However, ￣ means time average. Since the virtual target signal v _j (n) is uncorrelated with each other and the virtual target signal and noise are uncorrelated, Equation (1) is transformed into Equation (2).

適応フィルタ

Adaptive filter

をLタップのFIRフィルタとして、式（２）をベクトル表記すれば、式（３）のようになる。 If L is a L-tap FIR filter and equation (2) is expressed as a vector, equation (3) is obtained.

式（３）を最小化するフィルタが最適なフィルタであるので、式（３）を

Since the filter that minimizes Equation (3) is the optimal filter, Equation (3)

で偏微分し、０とおいて、極小点を求める。 To obtain a local minimum point.

式（４）を

Equation (4)

について解けば、式（３）を最小化する最適フィルタ

Solving for, the optimal filter that minimizes Equation (3)

が求められる。 Is required.

式（５）の最適フィルタを求める方法として、LMSアルゴリズム、NLMSアルゴリズム、射影アルゴリズムなどの適応アルゴリズムがある。本明細書ではNLMSアルゴリズムを例にとって説明することとし、修正式は式（６）で表される。 There are adaptive algorithms such as an LMS algorithm, an NLMS algorithm, and a projection algorithm as a method for obtaining the optimum filter of Expression (5). In this specification, the NLMS algorithm will be described as an example, and the correction formula is expressed by Formula (6).

ただし、

However,

は式（７）で表される。 Is represented by equation (7).

ここまでで、式（６）の修正式を用いて、式（５）の最適フィルタが求められることを示した。
特開平１４―０６２８９５号公報 Up to this point, it has been shown that the optimum filter of equation (5) is obtained using the modified equation of equation (6).
JP-A-14-062895

しかし、上記の従来技術の収音方法では、非定常な雑音信号（例えば収音したくない話者音声など）を抑圧不可能であるという問題、収音範囲を広げることにより雑音の抑圧性能が低下するという問題がある。 However, the above-described conventional sound collection methods have a problem that non-stationary noise signals (for example, speaker voices that are not desired to be collected) cannot be suppressed, and noise suppression performance is improved by widening the sound collection range. There is a problem of lowering.

本発明の目的は、非定常な雑音信号を抑圧した収音を実現するとともに収音範囲の広さによらず高い抑圧性能を実現する収音方法、装置、プログラム、および記録媒体を提供することである。 An object of the present invention is to provide a sound collection method, apparatus, program, and recording medium that realizes sound collection with suppressed non-stationary noise signals and achieves high suppression performance regardless of the range of the sound collection range. It is.

上記目的を達成するために、本発明の収音方法は、収音範囲を設定する収音範囲設定段階と、複数の収音手段の各々で受音された受音信号から話者位置を検出する話者位置検出段階と、検出された話者位置が収音範囲内である場合には話者音声を収音し、収音範囲外である場合には話者音声を抑圧する条件で、受信信号を用いてフィルタ係数を設定するフィルタ係数設定段階と、複数の収音手段の各々で受音された受音信号を、前記フィルタ係数で各々フィルタリングするフィルタ段階と、各フィルタ段階の出力信号を加算する加算段階を有する。 To achieve the above object, the sound collection method of the present invention detects a speaker position from a sound collection range setting stage for setting a sound collection range and a sound reception signal received by each of a plurality of sound collection means. The speaker position detection stage, and if the detected speaker position is within the sound collection range, the speaker voice is collected, and if the detected speaker position is outside the sound collection range, the speaker voice is suppressed. A filter coefficient setting stage for setting a filter coefficient using the received signal; a filter stage for filtering each received sound signal received by each of the plurality of sound collecting means with the filter coefficient; and an output signal of each filter stage There is an addition stage for adding.

これにより、設定された収音範囲の音声のみを収音し、それ以外の音を抑圧することが可能となる。
本発明の実施態様によれば、フィルタ係数設定段階は、複数の収音手段の各々で受音された受音信号を周波数領域に変換するＦＦＴ段階と、ＦＦＴ段階の出力信号の各々を周波数成分ごとに乗算し、共分散行列を求める共分散行列計算段階と、検出された話者位置ごとに前記共分散行列を加算平均して、記憶する共分散行列記憶段階と、記憶された共分散行列、および検出された話者位置と前記収音範囲を用いてフィルタ係数を計算するフィルタ係数計算段階を含む。
本発明の実施態様によれば、フィルタ係数設定段階は、複数の収音手段の各々で受音された受音信号を周波数領域に変換するＦＦＴ段階と、ＦＦＴ段階の出力信号の各々を周波数成分ごとに乗算し、共分散行列を求める共分散行列計算段階と、検出された話者位置ごとに共分散行列を加算平均して、記憶する共分散行列記憶段階と、記憶された共分散行列のうち対角成分で最もパワーの大きい成分、または記憶された共分散行列の対角成分の加算値の周波数特性を平滑化するゲインを、記憶された共分散行列に乗算する白色化段階と、白色化された共分散化行列、および検出された話者位置と前記収音範囲を用いてフィルタ係数を計算するフィルタ係数計算段階を含む。 As a result, it is possible to collect only the sound within the set sound collection range and suppress other sounds.
According to the embodiment of the present invention, the filter coefficient setting stage includes an FFT stage for converting the received signal received by each of the plurality of sound collecting means into the frequency domain, and each of the output signals of the FFT stage as frequency components. A covariance matrix calculation stage to obtain a covariance matrix by multiplying each, a covariance matrix storage stage for averaging the covariance matrix for each detected speaker position, and storing, and a stored covariance matrix And a filter coefficient calculation step of calculating a filter coefficient using the detected speaker position and the sound collection range.
According to the embodiment of the present invention, the filter coefficient setting stage includes an FFT stage for converting the received signal received by each of the plurality of sound collecting means into the frequency domain, and each of the output signals of the FFT stage as frequency components. A covariance matrix calculation stage to obtain a covariance matrix, a covariance matrix storage stage for averaging the covariance matrix for each detected speaker position, and a stored covariance matrix A whitening stage that multiplies the stored covariance matrix by a gain that smoothes the frequency characteristic of the diagonal component with the highest power or the added value of the diagonal component of the stored covariance matrix, and white And a filter coefficient calculation step of calculating a filter coefficient using the detected covariance matrix and the detected speaker position and the sound collection range.

上記課題を解決するために、本発明の他の収音方法は、収音範囲と音量範囲を設定する収音範囲・音量範囲設定段階と、複数の収音手段の各々で受音された受音信号から話者位置を検出する話者位置検出段階と、複数の収音手段の各々で受音された受音信号から話者音量を推定する話者音量推定段階と、検出された話者位置が前記収音範囲内であり、かつ推定された話者音量が前記音量範囲内である場合は収音し、それ以外の場合には話者音声抑圧する条件で、フィルタ係数設定段階は、前記受音信号を用いてフィルタ係数を設定するフィルタ係数設定段階と、前記複数の収音手段の各々で受音された受音信号を前記フィルタ係数で各々フィルタリングするフィルタ段階と、各フィルタ段階の出力信号を加算する加算段階を有する。 In order to solve the above-described problems, another sound collection method of the present invention includes a sound collection range / volume range setting stage for setting a sound collection range and a sound volume range, and a sound reception received by each of a plurality of sound collection means. A speaker position detecting stage for detecting a speaker position from a sound signal; a speaker volume estimating stage for estimating a speaker volume from a received sound signal received by each of a plurality of sound collecting means; and a detected speaker If the position is within the sound collection range and the estimated speaker volume is within the volume range, sound is collected; otherwise, the speaker sound suppression is performed. A filter coefficient setting step for setting a filter coefficient using the received sound signal; a filter step for filtering the received sound signal received by each of the plurality of sound collecting means with the filter coefficient; and An adding stage for adding the output signals;

収音範囲の条件に加え、音量範囲の条件を加えることで、収音手段から離れた話者の不要音声だけを抑圧することが可能となる。 By adding the condition of the volume range in addition to the condition of the sound collection range, it is possible to suppress only the unnecessary speech of the speaker away from the sound collection means.

本発明の実施態様によれば、フィルタ系数設定段階は、複数の収音手段の各々で受音された受音信号を周波数領域に変換するFFT段階と、FFT段階の出力信号の各々を周波数成分ごとに乗算し、共分散行列を求める共分散行列計算段階と、検出された話者位置ごとに前記共分散行列を加算平均して、記憶する共分散行列記憶段階と、記憶された共分散行列、検出された話者位置と前記収音範囲、および推定された話者音量と前記音量範囲を用いてフィル係数を計算するフィルタ係数計算段階を含む。 According to the embodiment of the present invention, the filter system number setting stage includes an FFT stage for converting the received signal received by each of the plurality of sound collecting means into a frequency domain, and each of the output signals of the FFT stage as frequency components. A covariance matrix calculation stage to obtain a covariance matrix by multiplying each, a covariance matrix storage stage for averaging the covariance matrix for each detected speaker position, and storing, and a stored covariance matrix And a filter coefficient calculation step of calculating a fill coefficient using the detected speaker position and the sound collection range, and the estimated speaker volume and the sound volume range.

本発明の実施態様によれば、フィルタ係数設定段階は、複数の収音手段の各々で受音された受音信号を周波数領域に変換するFFT段階と、FFT段階の出力信号の各々を周波数成分ごとに乗算し、共分散行列を求める共分散行列計算段階と、検出された話者位置ごとに共分散行列を加算平均して、記憶する共分散行列記憶段階と、記憶された共分散行列のうち対角成分で最もパワーの大きい成分、または記憶された共分散行列の対角成分の加算値の周波数特性を平滑化するゲインを、記憶された共分散行列に乗算する白色化段階と、白色化された共分散化行列、検出された話者位置と前記収音範囲、および推定された話者音量と前記音量範囲を用いてフィルタ係数を計算するフィルタ係数計算段階を含む。 According to the embodiment of the present invention, the filter coefficient setting stage includes an FFT stage for converting the received signal received by each of the plurality of sound collection means into the frequency domain, and each of the output signals of the FFT stage as frequency components. A covariance matrix calculation stage to obtain a covariance matrix, a covariance matrix storage stage for averaging the covariance matrix for each detected speaker position, and a stored covariance matrix A whitening stage that multiplies the stored covariance matrix by a gain that smoothes the frequency characteristic of the diagonal component with the highest power or the added value of the diagonal component of the stored covariance matrix, and white A filter coefficient calculating step of calculating a filter coefficient using the normalized covariance matrix, the detected speaker position and the sound collection range, and the estimated speaker volume and the sound volume range.

共分散行列の白色化により、音源の周波数特性に依存しないフィルタを求めることができる。これにより、音源の周波数特性が変化しても、フィルタの変化がなく、本発明の処理による音声の変化を防ぐことができる。 By whitening the covariance matrix, a filter that does not depend on the frequency characteristics of the sound source can be obtained. Thereby, even if the frequency characteristic of the sound source changes, there is no change in the filter, and it is possible to prevent a change in sound due to the processing of the present invention.

本発明は、上記のように収音範囲を設定し、検出した話者位置が収音範囲内にある場合にその音声を収音し、範囲外の音声は抑圧する。定常／非定常に関係なく範囲外の音を抑圧するので、収音したくない音声を抑圧することができる。また、実際に発音している音源位置に対する感度のみを制御するので、収音範囲の広さによって、抑圧性能が低下することもない。 The present invention sets the sound collection range as described above, collects the sound when the detected speaker position is within the sound collection range, and suppresses the sound outside the range. Since the sound outside the range is suppressed regardless of whether it is stationary or non-stationary, it is possible to suppress the voice that is not desired to be collected. In addition, since only the sensitivity to the sound source position where the sound is actually generated is controlled, the suppression performance does not deteriorate depending on the width of the sound collection range.

［第１の実施形態］
図１は本発明の第１の実施形態の収音装置のブロック図である。 [First Embodiment]
FIG. 1 is a block diagram of a sound collecting apparatus according to a first embodiment of the present invention.

第１の実施形態の収音装置はマイクロホン１１₁〜１１_Mと話者位置検出部２３と収音範囲設定部２５とフィルタ係数設定部２４とフィルタ部２１₁〜２１_Mと加算器２２により構成される。 The sound collection device according to the first embodiment includes microphones 11 _{1 to} 11 _M , a speaker position detection unit 23, a sound collection range setting unit 25, a filter coefficient setting unit 24, filter units 21 _{1 to} 21 _M, and an adder 22. Is done.

収音範囲設定部２５は、収音する範囲を設定する。収音範囲は、ユーザがボタンやリモコンなどにより設定したり、事前に固定的に設定されたりする。話者位置検出部２３は、マイクロホン１１₁〜１１_Mで受音された信号と、マイクロホン１１₁〜１１_Mの位置から話者位置を検出する。フィルタ係数設定部２４は、検出された話者位置が、収音範囲設定部２４で設定された収音範囲内であれば収音し、範囲外であれば抑圧するようにフィルタ係数を計算する。計算されたフィルタ係数はフィルタ部２１₁〜２１_Mにコピーされる。フィルタ部２１₁〜２１_Mは、マイクロホン１１₁〜１１_Mにより受音された信号を各々フィルタリングする。フィルタ部２１₁〜２１_Mの出力信号は加算器２２で加算され、出力信号となる。以上により、収音範囲内の音のみを収音し、収音範囲外の不要な音は抑圧した出力信号が得られる。 The sound collection range setting unit 25 sets a sound collection range. The sound collection range is set by the user with a button or a remote control, or is fixedly set in advance. Speaker position detection unit 23 includes a sound receiving the signal by the microphone 11 ₁ to 11 _M, detects the speaker position from the position of the microphone 11 ₁ to 11 _M. The filter coefficient setting unit 24 calculates a filter coefficient so that sound is collected if the detected speaker position is within the sound collection range set by the sound collection range setting unit 24, and is suppressed if the speaker position is out of the range. . The calculated filter coefficient is copied to the filter units 21 _{1 to} 21 _M. The filter units 21 _{1 to} 21 _M respectively filter the signals received by the microphones 11 _{1 to} 11 _M. The output signals of the filter units 21 _{1 to} 21 _M are added by the adder 22 to become an output signal. As described above, it is possible to obtain an output signal that collects only the sound within the sound collection range and suppresses unnecessary sound outside the sound collection range.

以下に話者位置検出部２３とフィルタ係数設定部２４について詳細に説明する。 The speaker position detection unit 23 and the filter coefficient setting unit 24 will be described in detail below.

話者位置検出部２３は、たとえば以下の方法で実現される。 The speaker position detection unit 23 is realized by the following method, for example.

マイクロホン１１₁〜１１_Mから共分散行列を計算し、共分散行列に走査位置ごとに設定されたステアリングベクトルを乗じることで走査位置ごとの音声パワーを推定する。推定された走査位置ごとの音声パワーから最大パワーを持つ走査位置を話者位置として検出する。 The covariance matrix is calculated from the microphones 11 _{1 to} 11 _M, and the sound power for each scanning position is estimated by multiplying the covariance matrix by the steering vector set for each scanning position. The scanning position having the maximum power is detected as the speaker position from the estimated voice power for each scanning position.

以下に数式を用いて説明する。 This will be described below using mathematical formulas.

まず、i番目のマイクロホン１１_iで受音された信号をx _i(t)とし、それを周波数領域に変換したものをX _i(ω)とし、入力信号ベクトル

First, let x _i (t) be the signal received by the i-th microphone 11 _i , and let X _i (ω) be the converted signal in the frequency domain, and input signal vector

を式（８）で定義する。 Is defined by equation (8).

ただし、^Tは行列の転置を表す。 Where ^T is the transpose of the matrix.

次に、共分散行列

Next, the covariance matrix

は式（９）で表わされる。 Is represented by equation (9).

ただし、^Hは行列の共役転置を表す。 Where ^H is the conjugate transpose of the matrix.

次に、音声パワー推定で用いるステアリングベクトルについて述べる。ステアリングベクトルは走査位置から到来した音が同位相となるように設定する。このようなステアリングベクトルを用いることで、同位相になった信号（走査位置で発生した音）のみが強調され、走査位置に鋭い指向性が形成される。 Next, a steering vector used for speech power estimation will be described. The steering vector is set so that the sound coming from the scanning position has the same phase. By using such a steering vector, only a signal having the same phase (sound generated at the scanning position) is emphasized, and a sharp directivity is formed at the scanning position.

まず、走査位置（x, y, z)の場合に、i番目のマイクロホン１１_iに与える遅延量d _i(x, y, z)は、走査位置（x, y, z）から発せられた音が同位相となるように、走査位置（x, y, z)とi番目のマイクロホン位置(x_i, y_i, z_i)と音速cより、式（１０）および式（１１）を用いて求められる。 First, in the case of the scanning position (x, y, z), the delay amount d _i (x, y, z) given to the i-th microphone 11 _i is the sound emitted from the scanning position (x, y, z). From the scanning position (x, y, z), the i-th microphone position (x _i , y _i, z _i ), and the sound velocity c, using the equations (10) and (11). Desired.

ただし、Dは固定遅延量であり、事前に定数として与えられる。 However, D is a fixed delay amount and is given as a constant in advance.

式（１０）を周波数領域に変換した式が式（１２）となり、これをベクトルとしたものがステアリングベクトルであり、式（１３）となる。 An expression obtained by converting Expression (10) into the frequency domain is Expression (12). A vector obtained by converting the expression into a vector is a steering vector, which is Expression (13).

このステアリングベクトル

This steering vector

を共分散行列に乗じ、周波数について積分すれば、各走査位置に対応する音声パワーの推定値

Multiplied by the covariance matrix and integrated over the frequency, the estimated audio power corresponding to each scan position

が求められる。これは式（１４）で表される。 Is required. This is expressed by equation (14).

ステアリングベクトル

Steering vector

は、走査位置（x, y, z）で発生した音のみを同位相にして強調しているので音声パワーの推定値

Shows the estimated sound power because only the sound generated at the scanning position (x, y, z) is emphasized with the same phase.

は、走査位置に音源があった場合のみ大きな値をとる。したがって、音声パワーの推定値

Takes a large value only when there is a sound source at the scanning position. Therefore, the estimated voice power

のうち最大パワーの走査位置（x_m, y_m, z_m),を検出すれば、話者位置を推定可能である。 If the scanning position (x _m , y _m , z _m ) with the maximum power is detected, the speaker position can be estimated.

次に、フィルタ係数設定部２４について詳細に説明する。 Next, the filter coefficient setting unit 24 will be described in detail.

フィルタ係数設定部２４では、話者位置検出部２３で検出された話者位置が収音範囲内にあるかどうかを判定する。収音範囲内にある場合には収音対象とし、それ以外を抑圧対象とする。 The filter coefficient setting unit 24 determines whether or not the speaker position detected by the speaker position detection unit 23 is within the sound collection range. When the sound is within the sound collection range, the sound is to be collected, and the others are the suppression targets.

収音範囲内の音声だけ収音し、それ以外の音を抑圧するフィルタは、収音対象の入力信号ベクトル

The filter that collects only the sound within the sound collection range and suppresses other sounds is the input signal vector to be collected.

をフィルタ

Filter

でフィルタリングし加算した信号が、収音対象の入力信号をミキシングベルト

The signal that has been filtered and added in the mixing belt is the input signal to be collected.

でミキシングしただけとなり、抑圧対象の入力信号ベクトル

The input signal vector to be suppressed.

をフィルタ

Filter

でフィルタリングし加算した信号が０となっていればよい。したがって、フィルタは以下の式（１５）、（１６）、（１７）を満たす場合に最適となる。 It is only necessary that the signal filtered and added at 0 is 0. Therefore, the filter is optimal when the following expressions (15), (16), and (17) are satisfied.

式（１５）〜（１７）を最小二乗解でフィルタ

Filter equations (15)-(17) with least squares solution

について解けば、式（１８）となる。 If it solves about, it will become a formula (18).

ただし、C_SjとＣ_Nkは、それぞれ話者音声収音の重みと抑圧の重みであり、C_Nkを大きくすれば不要音声の抑圧量が増加し、C_Sjを大きくすれば収音する音声の劣化が減少する。 However, C _Sj and C _Nk are the weights of speaker voice collection and suppression weights, respectively. If C _Nk is increased, the amount of suppression of unnecessary voices increases, and if C _Sj is increased, the voices to be collected are collected. Deterioration is reduced.

式（１８）より、フィルタ係数を求めるには、入力信号の共分散行列を話者位置ごとに求める必要がある。本発明では、式（９）により求められる共分散行列

In order to obtain the filter coefficient from equation (18), it is necessary to obtain the covariance matrix of the input signal for each speaker position. In the present invention, the covariance matrix obtained by equation (9)

を話者ごとに時間平均、保存する。このとき、収音対象の話者位置に対する共分散行列は

Save the time average for each speaker. At this time, the covariance matrix for the target speaker position is

とし、抑圧対象の話者位置に対する共分散行列は

And the covariance matrix for the speaker position to be suppressed is

とする。 And

以上求めた共分散行列から式（１８）によりフィルタ係数を求めることができる。 The filter coefficient can be obtained from the covariance matrix obtained as described above by Equation (18).

以上示したように本実施形態では、設定された収音範囲の音声のみを収音し、それ以外の音を抑圧することが可能である。 As described above, in the present embodiment, it is possible to collect only the sound within the set sound collection range and suppress other sounds.

図５は本発明の利用例を説明する図である。本発明を用いた収音装置がテーブルに置いてあり、その周りに話者がいる場合を想定している。装置には、範囲別のミュートボタンがついており、そのミュートボタンを押すことで、そのボタンに対応した範囲の音だけミュート（収音しない）することができる。本発明では、音の定常性、非定常性にかかわらず、収音しない範囲を設定可能であるので、このような利用方法も可能となる。 FIG. 5 is a diagram illustrating an example of use of the present invention. It is assumed that the sound collection device using the present invention is placed on a table and a speaker is around it. The device has a mute button for each range. By pressing the mute button, only the sound in the range corresponding to the button can be muted (no sound is collected). In the present invention, a range in which sound is not collected can be set regardless of the steadiness or non-stationarity of the sound, and thus such a utilization method is also possible.

［第２の実施形態］
図２は本発明の第２の実施形態の収音装置のブロック図である。 [Second Embodiment]
FIG. 2 is a block diagram of a sound collecting apparatus according to the second embodiment of the present invention.

第２の実施形態の収音装置は、第１の実施形態の収音装置に、収音範囲・音量範囲設定部３１と話者音量推定部３２を追加した例である。 The sound collection device of the second embodiment is an example in which a sound collection range / volume range setting unit 31 and a speaker volume estimation unit 32 are added to the sound collection device of the first embodiment.

収音範囲・音量範囲設定部３１は、収音範囲の設定と音量範囲を設定する。設定はユーザがボタンやリモコンなどにより行ったり、事前に固定的に与えたりする。話者音量推定部３２は、マイクロホン１１₁〜１１_Mで受音した信号から音声信号のパワーを推定する。話者位置検出部２３で検出された話者位置が収音範囲内であり、かつ推定した話者音量が音量範囲内である場合は収音し、それ以外の場合には話者音声を抑圧する。これにより、たとえばマイクロホン１１₁〜１１_Mに近くい受音パワーの大きい音声だけを収音し、マイクロホン１１₁〜１１_Mから離れた話者の音声を抑圧することが可能となる。 The sound collection range / volume range setting unit 31 sets a sound collection range and a volume range. The setting is performed by the user with a button or a remote controller, or given in advance. The speaker volume estimation unit 32 estimates the power of the audio signal from the signals received by the microphones 11 _{1 to} 11 _M. If the speaker position detected by the speaker position detection unit 23 is within the sound collection range and the estimated speaker volume is within the volume range, the sound is collected, otherwise the speaker sound is suppressed. To do. Thus, for example, only the sound pickup great voice nearby have received sound power to the microphone 11 ₁ to 11 _M, it is possible to suppress the voice of the speaker away from the microphone 11 ₁ to 11 _M.

以下に、話者音量の推定方法について説明する。話者音量

The speaker volume estimation method will be described below. Speaker volume

は、入力信号ベクトル

Is the input signal vector

にミキシングベクトル

Mixing vector

を乗じたものを周波数低域W内で平均したものであるので、式（１９）で求められる。 Is obtained by averaging in the frequency low band W, and is obtained by the equation (19).

式（１９）から、共分散行列から話者音量を推定できることが分かる。したがって、式（９）により共分散行列を求め、式（１９）により話者音量を求めることができる。 From equation (19), it can be seen that the speaker volume can be estimated from the covariance matrix. Therefore, the covariance matrix can be obtained from equation (9), and the speaker volume can be obtained from equation (19).

第２の実施形態では、第１の実施形態の収音範囲の条件に加え、音量範囲の条件を加えることで、マイクロホン１１₁〜１１_Mから離れた話者の不要音声だけを抑圧することも可能となる。 In the second embodiment, in addition to the sound collection range condition of the first embodiment, by adding the sound volume range condition, it is also possible to suppress only the unnecessary speech of the speaker away from the microphones 11 _{1 to} 11 _M. It becomes possible.

これら以外の部分に関しては、第２の実施形態と同じであるので、説明を省略する。 Since other parts are the same as those in the second embodiment, description thereof is omitted.

［第３の実施形態］
図３は本発明の第３の実施形態の収音装置のブロック図である。第３の実施形態の収音装置は、第１の実施形態または第２の実施形態の収音装置において、フィルタ係数設定部１２がFFT部４１₁〜４１_Mと共分散行列計算部４２と共分散行列記憶部４３とフィルタ係数計算部４４とにより実現された例である。 [Third Embodiment]
FIG. 3 is a block diagram of a sound collecting apparatus according to the third embodiment of the present invention. The sound collection device according to the third embodiment is the same as the sound collection device according to the first embodiment or the second embodiment, except that the filter coefficient setting unit 12 includes the FFT units 41 _{1 to} 41 _M and the covariance matrix calculation unit 42. This is an example realized by the dispersion matrix storage unit 43 and the filter coefficient calculation unit 44.

FFT部４１₁〜４１_Mは、マイクロホン１１₁〜１１_Mにより受音された信号を各々周波数領域に変換する。共分散行列計算部４２では、FFT出力信号をチャネル間で乗算し、式（９）により共分散行列を求める。共分散行列記憶部４３は、話者位置ごとに共分散行列を時間平均し、保存する。フィルタ係数計算部４４は、式（１８）により、フィルタ係数を算出する。 The FFT units 41 _{1 to} 41 _M respectively convert the signals received by the microphones 11 _{1 to} 11 _M into the frequency domain. The covariance matrix calculation unit 42 multiplies the FFT output signal between the channels, and obtains a covariance matrix by Expression (9). The covariance matrix storage unit 43 averages and stores the covariance matrix for each speaker position. The filter coefficient calculation unit 44 calculates the filter coefficient according to the equation (18).

これ以外の部分に関しては、第１の実施形態または第２の実施形態と同じであるので、説明を省略する。 Since other parts are the same as those in the first embodiment or the second embodiment, description thereof will be omitted.

［第４の実施形態］
図４は本発明の第４の実施形態の話者位置検出装置のブロック図である。第４の実施形態の収音装置は、第１の実施形態または第２の実施形態の収音装置において、共分散行列計算部１２がFFT部４１₁〜４１_Mと共分散行列計算部４２と共分散行列記憶部４３と白色化部４５とフィルタ係数計算部４４とにより実現された例である。 [Fourth Embodiment]
FIG. 4 is a block diagram of a speaker position detecting apparatus according to a fourth embodiment of the present invention. In the sound collection device of the fourth embodiment, in the sound collection device of the first embodiment or the second embodiment, the covariance matrix calculation unit 12 includes FFT units 41 _{1 to} 41 _M , a covariance matrix calculation unit 42, and the like. This is an example realized by the covariance matrix storage unit 43, the whitening unit 45, and the filter coefficient calculation unit 44.

FFT部４１₁〜４１_Mと共分散行列計算部４２と共分散行列記憶部４３フィルタ係数算出部４４に関しては、第３の実施形態と同様の処理を行うので、説明を省略する。 The FFT units 41 _{1 to} 41 _M , the covariance matrix calculation unit 42, the covariance matrix storage unit 43, and the filter coefficient calculation unit 44 perform the same processing as in the third embodiment, and thus description thereof is omitted.

白色化部４５は、共分散行列

The whitening unit 45 is a covariance matrix.

を周波数領域で白色化（平坦な周波数特性に）する。白色化は、共分散行列の対角成分のうち最もパワーの大きい

Is whitened (to a flat frequency characteristic) in the frequency domain. Whitening is the most powerful of the diagonal components of the covariance matrix

を平滑化する白色化ゲイン

Whitening gain to smooth

を乗算するか、共分散行列の対角成分の平均パワーを平滑化する白色化ゲイン

Or a whitening gain that smoothes the mean power of the diagonal components of the covariance matrix

を乗算することで行なう。これらはそれぞれ式（２０）と式（２１）により表される。 This is done by multiplying These are represented by the equations (20) and (21), respectively.

ただし、βは白色化の度合いを調整する係数であり、１となれば完全な白色化となり、０となれば白色化は行われなくなる。 However, β is a coefficient for adjusting the degree of whitening. When it is 1, it becomes complete whitening, and when it becomes 0, whitening is not performed.

第４の実施形態では、共分散行列の白色化により、音源の周波数特性に依存しないフィルタを求めることができる。これにより、音源の周波数特性が変化しても、フィルタの変化がなく、本発明の処理による音色の変化を防ぐことができる。 In the fourth embodiment, it is possible to obtain a filter that does not depend on the frequency characteristics of the sound source by whitening the covariance matrix. Thereby, even if the frequency characteristic of the sound source changes, there is no change in the filter, and a change in timbre due to the processing of the present invention can be prevented.

これら以外の部分に関しては、第１の実施形態または第２の実施形態と同じであるので、説明を省略する。 Since other parts are the same as those in the first embodiment or the second embodiment, description thereof will be omitted.

なお、本発明の収音方法は専用のハードウェアにより実現されるもの以外に、その機能を実現するためのプログラムを、コンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行するものであってもよい。コンピュータ読み取り可能な記録媒体とは、フロッピーディスク、光磁気ディスク、CD―ROM等の記録媒体、コンピュータシステムに内蔵されるハードディスク装置等の記憶装置を指す。さらに、コンピュータ読み取り可能な記録媒体は、インターネットを介してプログラムを送信する場合のように、短時間の間、動的にプログラムを保持するもの（伝送媒体もしくは伝送波）、その場合のサーバとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含む。 The sound collecting method of the present invention is not only realized by dedicated hardware, but a program for realizing the function is recorded on a computer-readable recording medium, and the program recorded on the recording medium is recorded. May be read by a computer system and executed. The computer-readable recording medium refers to a recording medium such as a floppy disk, a magneto-optical disk, a CD-ROM, or a storage device such as a hard disk device built in the computer system. Furthermore, a computer-readable recording medium is a server that dynamically holds a program (transmission medium or transmission wave) for a short period of time, as in the case of transmitting a program via the Internet, and a server in that case. Some of them hold programs for a certain period of time, such as volatile memory inside computer systems.

本発明の第１の実施形態の話者位置検出装置を示すブロック図である。It is a block diagram which shows the speaker position detection apparatus of the 1st Embodiment of this invention. 本発明の第２の実施形態の話者位置検出装置を示すブロック図である。It is a block diagram which shows the speaker position detection apparatus of the 2nd Embodiment of this invention. 本発明の第３の実施形態の話者位置検出装置を示すブロック図である。It is a block diagram which shows the speaker position detection apparatus of the 3rd Embodiment of this invention. 本発明の第４の実施形態の話者位置検出装置を示すブロック図である。It is a block diagram which shows the speaker position detection apparatus of the 4th Embodiment of this invention. 本発明の利用例を説明する図である。It is a figure explaining the usage example of this invention. 従来の話者位置検出装置の例を示すブロック図である。It is a block diagram which shows the example of the conventional speaker position detection apparatus.

Explanation of symbols

１１₁〜１１_M マイクロホン
２１₁〜２１_M フィルタ部
２２加算器
２３話者位置検出部
２４フィルタ係数設定部
２５収音範囲設定部
３１収音範囲・音量範囲設定部
３２話者音量推定部
４１₁〜４１_M FFT部
４２共分散行列計算部
４３共分散行列記憶部
４４フィルタ係数計算部
４５白色化部
１２₁〜１２_M 加算器
１３A₁〜１３A_M 学習フィルタ
１３B₁〜１３B_M 適応フィルタ
１４A 加算器
１４B 加算器
１５加算器
１６適応アルゴリズム部
１７₁〜１７_J 信号発生器
１８_1,1〜１８_J,M 空間特性フィルタ
１９₁〜１９_J 遅延器
２０適応期間検出器
５１₁〜５１_M 加算器
５２加算器
２６仮想音源位置設定部
２７空間特性推定部
３０収音範囲設定部 11 _{1 to} 11 _M microphones 21 _{1 to} 21 _M filter unit 22 adder 23 speaker position detection unit 24 filter coefficient setting unit 25 sound collection range setting unit 31 sound collection range / volume range setting unit 32 speaker volume estimation unit 41 ₁ ˜41 _M FFT unit 42 covariance matrix calculation unit 43 covariance matrix storage unit 44 filter coefficient calculation unit 45 whitening unit 12 _{1 to} 12 _M adder 13A _{1 to} 13A _M learning filter 13B _{1 to} 13B _M adaptive filter 14A adder 14B adder 15 adder 16 adaptive algorithm unit 17 _{1 to} 17 _J signal generator 18 _{1, 1} to 18 _{J, M} spatial characteristic filter 19 _{1 to} 19 _J delay unit 20 adaptive period detector 51 _{1 to} 51 _M adder 52 Adder 26 Virtual sound source position setting unit 27 Spatial characteristic estimation unit 30 Sound collection range setting unit

Claims

A sound collection method,
A sound collection range setting stage for setting the sound collection range;
A speaker position detection stage for detecting a speaker position from a received sound signal received by each of a plurality of sound pickup means;
When the detected speaker position is within the sound collection range, the voice signal is collected, and when the detected speaker position is outside the sound collection range, the sound reception signal is A filter coefficient setting stage to set the filter coefficient using,
A filter stage for filtering each received signal received by each of the plurality of sound collecting means with the filter coefficient;
A sound collection method comprising an addition step of adding the output signals of the filter step.

A sound collection method,
Sound collection range / volume range setting stage to set the sound collection range and volume range,
A speaker position detection stage for detecting a speaker position from a received sound signal received by each of a plurality of sound pickup means;
A speaker volume estimation stage for estimating a speaker volume from a received signal received by each of a plurality of sound collection means;
If the detected speaker position is within the sound collection range and the estimated speaker volume is within the volume range, sound is collected; otherwise, the speaker sound is suppressed. A filter coefficient setting step for setting a filter coefficient using the received sound signal;
A filter stage for filtering each received signal received by each of the plurality of sound collecting means with the filter coefficient;
And a summing step of summing the output signals of the filter step.

The filter coefficient setting step includes:
An FFT stage for converting a received sound signal received by each of the plurality of sound collecting means into a frequency domain;
A covariance matrix calculating step of multiplying each of the output signals of the FFT step for each frequency component to obtain a covariance matrix;
A covariance matrix storage step of averaging and storing the covariance matrix for each detected speaker position;
And a filter coefficient calculating step of calculating a filter coefficient using the stored covariance matrix, the detected speaker position and the sound collection range, and the estimated speaker volume and the volume range. Item 3. The sound collection method according to Item 2 .

The filter coefficient setting step includes:
An FFT stage for converting a received sound signal received by each of the plurality of sound collecting means into a frequency domain;
A covariance matrix calculating step of multiplying each of the output signals of the FFT step for each frequency component to obtain a covariance sequence;
A covariance matrix storage step of averaging and storing the covariance matrix for each detected speaker position;
The stored covariance is a gain for smoothing the frequency characteristic of the diagonal component of the stored covariance matrix having the highest power or the sum of the diagonal components of the stored covariance matrix. A whitening stage to multiply the matrix;
And a filter coefficient calculation step of calculating a filter coefficient using the whitened covariance matrix, the detected speaker position and the sound collection range, and the estimated speaker volume and the volume range. Item 3. The sound collection method according to Item 2 .

  The filter coefficient setting step includes:
  An FFT stage for converting a received sound signal received by each of the plurality of sound collecting means into a frequency domain;
  A covariance matrix calculating step of multiplying each of the output signals of the FFT step for each frequency component to obtain a covariance matrix;
  A covariance matrix storage step of averaging and storing the covariance matrix for each detected speaker position;
  The sound collection method according to claim 1, comprising: a filter coefficient calculation step of calculating a filter coefficient using the stored covariance matrix, the detected speaker position, and the sound collection range.

  The filter coefficient setting step includes:
  An FFT stage for converting a received sound signal received by each of the plurality of sound collecting means into a frequency domain;
  A covariance matrix calculating step of multiplying each of the output signals of the FFT step for each frequency component to obtain a covariance sequence;
  A covariance matrix storage step of averaging and storing the covariance matrix for each detected speaker position;
  The stored covariance is a gain for smoothing the frequency characteristic of the diagonal component of the stored covariance matrix having the highest power or the added value of the diagonal components of the stored covariance matrix. A whitening stage to multiply the matrix;
  The sound collection method according to claim 1, further comprising a filter coefficient calculation step of calculating a filter coefficient using the whitened covariance matrix and the detected speaker position and the sound collection range.

A sound collecting device,
A sound collection range setting means for setting a sound collection range;
Speaker position detecting means for detecting a speaker position from a received sound signal received by each of a plurality of sound collecting means;
When the detected speaker position is within the sound collection range, the voice signal is collected, and when the detected speaker position is outside the sound collection range, the received signal is used under the condition of suppressing the speaker voice. Filter coefficient setting means for setting the filter coefficient
Filter means for filtering received sound signals received by each of the plurality of sound collecting means, respectively, with the filter coefficients;
A sound collection device having addition means for adding the output signals of the filter means.

A sound collecting device,
A sound collection range / volume range setting means for setting a sound collection range and a volume range;
Speaker position detecting means for detecting a speaker position from a received sound signal received by each of a plurality of sound collecting means;
Speaker volume estimation means for estimating speaker volume from a received sound signal received by each of a plurality of sound collection means;
If the detected speaker position is within the sound collection range and the estimated speaker volume is within the volume range, sound is collected; otherwise, the speaker sound is suppressed. Filter coefficient setting means for setting a filter coefficient using the received sound signal;
Filter means for filtering received sound signals received by each of the plurality of sound collecting means, respectively, with the filter coefficients;
A sound collection device having addition means for adding the output signals of the filter means.

The filter coefficient setting means includes
FFT means for converting the received sound signal received by each of the plurality of sound collecting means into a frequency domain;
A covariance matrix calculating means for multiplying each output signal of the FFT means for each frequency component to obtain a covariance matrix;
Covariance matrix storage means for averaging the covariance matrix for each detected speaker position and storing the covariance matrix;
And a filter coefficient calculation means for calculating a fill coefficient using the stored covariance matrix, the detected speaker position and the sound collection range, and the estimated speaker volume and the volume range. 8. The sound collecting device according to 8 .

The filter coefficient setting means includes
FFT means for converting the received sound signal received by each of the plurality of sound collecting means into a frequency domain;
A covariance matrix calculating means for multiplying each output signal of the FFT means for each frequency component to obtain a covariance matrix;
Covariance matrix storage means for averaging the covariance matrix for each detected speaker position and storing the covariance matrix;
The stored covariance is a gain for smoothing the frequency characteristic of the diagonal component of the stored covariance matrix having the highest power or the sum of the diagonal components of the stored covariance matrix. Whitening means for multiplying the matrix;
Filter coefficient calculating means for calculating a filter coefficient using the whitened covariance matrix, the detected speaker position and the sound collection range, and the estimated speaker volume and the sound volume range; The sound collection device according to claim 8 .

  The filter coefficient setting means includes
  FFT means for converting the received sound signal received by each of the plurality of sound collecting means into a frequency domain;
  A covariance matrix calculating means for multiplying each output signal of the FFT means for each frequency component to obtain a covariance matrix;
  Covariance matrix storage means for averaging the covariance matrix for each detected speaker position and storing the covariance matrix;
  The sound collection device according to claim 7, further comprising: a filter coefficient calculation unit that calculates a filter coefficient using the stored covariance matrix, the detected speaker position, and the sound collection range.

  The filter coefficient setting means includes
  FFT means for converting the received sound signal received by each of the plurality of sound collecting means into a frequency domain;
  A covariance matrix calculating means for multiplying each output signal of the FFT means for each frequency component to obtain a covariance matrix;
  Covariance matrix storage means for averaging the covariance matrix for each detected speaker position and storing the covariance matrix;
  The stored covariance is a gain for smoothing the frequency characteristic of the diagonal component of the stored covariance matrix having the highest power or the added value of the diagonal components of the stored covariance matrix. Whitening means for multiplying the matrix;
  The sound collection device according to claim 7, further comprising filter coefficient calculation means for calculating a filter coefficient using the whitened covariance matrix, the detected speaker position, and the sound collection range.

A sound collection program for causing a computer to execute the speaker position detection method according to claim 1 .

A computer-readable recording medium on which the sound collecting program according to claim 13 is recorded.