JP2018191255A

JP2018191255A - Sound collecting device, method thereof, and program

Info

Publication number: JP2018191255A
Application number: JP2017094927A
Authority: JP
Inventors: 江村　暁; Akira Emura; 暁江村
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2017-05-11
Filing date: 2017-05-11
Publication date: 2018-11-29

Abstract

PROBLEM TO BE SOLVED: To provide a sound collecting device and the like, capable of fetching a target sound with higher accuracy than the conventional one using an MVDR method.SOLUTION: A sound collecting device calculates a spatial correlation matrix in each frequency by using a microphone signal of an N channel frequency region, calculates an estimation value of a noise power included in the estimation value of intensity of an incoming wave from K directions from a spatial correlation matrix and noise power contained in each microphone signal, calculates an estimation value kof an arrival direction of a target sound, calculates the estimation value of a correlation matrix of the target sound and a non-target sound by using a matrix A(f)formed by K vectors a(f) and an N×N unit matrix and a matrix V(f,l) obtained by setting to 0 all elements other than an element of (k,k) of a diagonal matrix setting the estimation value of the intensity and the estimation value of the noise power as diagonal components, calculates a filter coefficient vector by using the estimation value of the correlation matrix, adapts the filter coefficient vector to the microphone signal, and calculates an output signal.SELECTED DRAWING: Figure 2

Description

本発明は、複数のマイクロホンを用いてビームを形成するビームフォーミング技術を用いた収音装置、その方法、及びプログラムに関する。 The present invention relates to a sound collection device using a beam forming technique for forming a beam using a plurality of microphones, a method thereof, and a program.

複数のマイクロホンを音場に設置してマルチチャネルのマイクロホン信号を取得し、そこからターゲットとする音声や音(以下、ターゲット音ともいう)をクリアに、ノイズやその他音声(以下、非ターゲット音ともいう)をできる限り取り除いて、取り出す技術のニーズが近年高まっている。そのために、複数のマイクロホンをもちいてビームを形成するビームフォーミング技術が近年さかんに研究開発されている。 Multiple microphones are installed in the sound field to acquire multi-channel microphone signals, and target voices and sounds (hereinafter also referred to as target sounds) are cleared, noise and other sounds (hereinafter referred to as non-target sounds). In recent years, there has been an increasing need for a technology for removing as much as possible. For this purpose, beam forming technology for forming a beam by using a plurality of microphones has been researched and developed in recent years.

ビームフォーミング技術では、図１のようにN個のマイクロホン９１−ｎ(ただし、n=1,2,…,N)で収音した各マイクロホン信号y_n(t)にフィルタリング部９２−ｎにおいてフィルタを適用する。なお、tは時刻を示すインデックスである。次に、加算部９３においてフィルタリング部９２−ｎの出力値の総和をとる。求めた総和を収音装置の出力信号z(t)として出力する。このような構成により雑音を大幅に減らし、ターゲット音をより明瞭に取り出すことができる。このようなビームフォーミングのフィルタを求める方法として、minimum variance distortionless response法（MVDR法）がよく使われる（非特許文献１参照）。 In the beam forming technique, each microphone signal y _n (t) picked up by N microphones 91-n (where n = 1, 2,..., N) is filtered by a filtering unit 92-n as shown in FIG. Apply. Note that t is an index indicating time. Next, the sum of the output values of the filtering unit 92-n is taken in the adding unit 93. The obtained sum is output as an output signal z (t) of the sound collecting device. With such a configuration, the noise can be greatly reduced and the target sound can be extracted more clearly. As a method for obtaining such a beam forming filter, a minimum variance distortionless response method (MVDR method) is often used (see Non-Patent Document 1).

Habets, E., Benesty, J., Cohen, I., Gannot, S., Dmochowski, J., "New Insights Into the MVDR Beamformer in Room Acoustics", IEEE Transactions on Audio, Speech, and Language Processing, 18, 1, pp. 158 - 170, 2010.Habets, E., Benesty, J., Cohen, I., Gannot, S., Dmochowski, J., "New Insights Into the MVDR Beamformer in Room Acoustics", IEEE Transactions on Audio, Speech, and Language Processing, 18, 1, pp. 158-170, 2010.

MVDR法をもちいるためには、ターゲット音以外の音（非ターゲット音）の相関行列およびターゲット音の音源位置から各マイクロホンまでの伝達特性を適切に推定する必要がある。しかしながら、複数のマイクロホン信号には、そもそもターゲット音に由来する成分と非ターゲット音に由来する成分が混在しており、そのままでは所望の相関行列と伝達特性をとりだすことができない。そのため、MVDR法単独では、マイクロホン信号のみからターゲット音声をクリアに取り出せない。 In order to use the MVDR method, it is necessary to appropriately estimate the correlation characteristics of sounds other than the target sound (non-target sound) and the transfer characteristics from the sound source position of the target sound to each microphone. However, the components derived from the target sound and the component derived from the non-target sound are mixed in the plurality of microphone signals, and a desired correlation matrix and transfer characteristic cannot be extracted as they are. Therefore, with the MVDR method alone, the target sound cannot be clearly extracted from only the microphone signal.

そこで本発明では、ターゲット音と非ターゲット音の混在するマイクロホン信号から、それぞれの相関行列を推定し、MVDR法をもちいてターゲット音を取り出す収音装置、その方法、及びプログラムを提供することを目的とする。 Therefore, the present invention has an object to provide a sound collecting apparatus, a method, and a program for estimating a correlation matrix from microphone signals in which target sound and non-target sound are mixed and extracting the target sound using the MVDR method. And

上記の課題を解決するために、本発明の一態様によれば、収音装置は、N及びKをそれぞれ2以上の整数の何れかとし、n=1,2,…,N、k=1,2,…,Kとし、Nチャネルの周波数領域のマイクロホン信号Y_n(f,l)を用いて周波数毎に空間相関行列R(f,l)を算出し、空間相関行列R(f,l)からK個の方向からの到来波の強度の推定値p_k(f,l)及び各マイクロホン信号Y_n(f,l)に含まれるノイズパワーの推定値q_n(f,l)を求める到来波分解部と、ターゲット音の到来方向の推定値k_tを求めるターゲット音判定部と、N個のマイクロホンからなるマイクロホンアレーにk番目の方向から振幅1の平面波が到達したときのマイクロホンアレーの出力信号からなるベクトルをa_k(f)とし、K個のベクトルa_k(f)とN×N単位行列I_Nからなる行列A(f)^H=[a₁(f) a₂(f) … a_K(f) I_N]と、強度の推定値p_k(f,l)とノイズパワーの推定値q_n(f,l)を対角成分とする対角行列V(f,l)の(k_t,k_t)の要素以外の要素を全て0にした行列V_s(f,l)とを用いて、ターゲット音の相関行列の推定値R^_T(f,l)と非ターゲット音の相関行列の推定値R^_NT(f,l)とを求める相関行列合成部と、相関行列の推定値R^_T(f,l)及びR^_NT(f,l)を用いてフィルタ係数ベクトルh(f,l)を求め、マイクロホン信号Y_n(f,l)にフィルタ係数ベクトルh(f,l)を適用し、出力信号z(f,l)を求めるアレーフィルタリング部と、を含む。 In order to solve the above-described problems, according to one aspect of the present invention, the sound collection device is configured such that N and K are each an integer of 2 or more, and n = 1, 2,..., N, k = 1 , 2,..., K, and N-channel frequency domain microphone signal Y _n (f, l) is used to calculate the spatial correlation matrix R (f, l) for each frequency, and the spatial correlation matrix R (f, l ) To obtain the estimated value p _k (f, l) of the intensity of incoming waves from K directions and the estimated noise power q _n (f, l) included in each microphone signal Y _n (f, l) An incoming wave decomposition unit, a target sound determination unit that obtains an estimated value k _t of the arrival direction of the target sound, and a microphone array when a plane wave of amplitude 1 arrives at a microphone array composed of N microphones from the kth direction The vector of output signals is a _k (f), and the matrix A (f) ^H = [a ₁ (f) a ₂ (f) consisting of K vectors a _k (f) and N × N identity matrix I _N … A _K (f) I _N ] and the intensity estimate p _k (f, l) Matrix V _s (f) where all elements other than (k _t , k _t ) elements of the diagonal matrix V (f, l) whose diagonal component is the estimated noise power q _n (f, l) are zero , l) and a correlation matrix synthesizer that calculates an estimated value R ^ _T (f, l) of the target sound correlation matrix and an estimated value R ^ _NT (f, l) of the non-target sound correlation matrix, , Find the filter coefficient vector h (f, l) using the correlation matrix estimates R ^ _T (f, l) and R ^ _NT (f, l), and filter coefficients into the microphone signal Y _n (f, l) An array filtering unit that applies the vector h (f, l) and obtains the output signal z (f, l).

上記の課題を解決するために、本発明の他の態様によれば、収音方法は、N及びKをそれぞれ2以上の整数の何れかとし、n=1,2,…,N、k=1,2,…,Kとし、Nチャネルの周波数領域のマイクロホン信号Y_n(f,l)を用いて周波数毎に空間相関行列R(f,l)を算出し、空間相関行列R(f,l)からK個の方向からの到来波の強度の推定値p_k(f,l)及び各マイクロホン信号Y_n(f,l)に含まれるノイズパワーの推定値q_n(f,l)を求める到来波分解ステップと、ターゲット音の到来方向の推定値k_tを求めるターゲット音判定ステップと、N個のマイクロホンからなるマイクロホンアレーにk番目の方向から振幅1の平面波が到達したときのマイクロホンアレーの出力信号からなるベクトルをa_k(f)とし、K個のベクトルa_k(f)とN×N単位行列I_Nからなる行列A(f)^H=[a₁(f) a₂(f) … a_K(f) I_N]と、強度の推定値p_k(f,l)とノイズパワーの推定値q_n(f,l)を対角成分とする対角行列V(f,l)の(k_t,k_t)の要素以外の要素を全て0にした行列V_s(f,l)とを用いて、ターゲット音の相関行列の推定値R^_T(f,l)と非ターゲット音の相関行列の推定値R^_NT(f,l)とを求める相関行列合成ステップと、相関行列の推定値R^_T(f,l)及びR^_NT(f,l)を用いてフィルタ係数ベクトルh(f,l)を求め、マイクロホン信号Y_n(f,l)にフィルタ係数ベクトルh(f,l)を適用し、出力信号z(f,l)を求めるアレーフィルタリングステップと、を含む。 In order to solve the above-described problem, according to another aspect of the present invention, the sound collection method is such that N and K are each an integer of 2 or more, and n = 1, 2,..., N, k = 1, 2,..., K, and the spatial correlation matrix R (f, l) is calculated for each frequency using the N-channel frequency domain microphone signal Y _n (f, l), and the spatial correlation matrix R (f, l, estimate of the intensity of the incoming wave from the K direction from _{l) p k (f, l} ) and the microphone signal Y _n (f, estimate q _n (f noise power contained in l), l) the A arriving wave decomposition step to be obtained, a target sound determination step to obtain an estimated value k _t of the arrival direction of the target sound, and a microphone array when a plane wave having an amplitude of 1 arrives at the microphone array composed of N microphones from the kth direction. Let a _k (f) be the vector consisting of the output signals of, and the matrix A (f) ^H = [a ₁ (f) a ₂ (f with K vectors a _k (f) and the N × N identity matrix I _N )… A _K (f) I _N ] and strength Elements other than (k _t , k _t ) elements of the diagonal matrix V (f, l) whose diagonal components are the estimated value p _k (f, l) and the noise power estimate q _n (f, l) Using the matrix V _s (f, l) with all zeros, the estimated value R ^ _T (f, l) of the correlation matrix of the target sound and the estimated value R ^ _NT (f, l) of the correlation matrix of the non-target sound l) and a filter coefficient vector h (f, l) using the correlation matrix estimation values R ^ _T (f, l) and R ^ _NT (f, l), and the microphone signal Applying a filter coefficient vector h (f, l) to Y _n (f, l) to obtain an output signal z (f, l).

本発明によれば、MVDR法をもちいて従来よりも精度良くターゲット音を取り出すことができるという効果を奏する。 According to the present invention, there is an effect that the target sound can be extracted with higher accuracy than before by using the MVDR method.

従来技術に係る収音装置の機能ブロック図。The functional block diagram of the sound collection device which concerns on a prior art. 第一実施形態及び第二実施形態に係る収音装置の機能ブロック図。The functional block diagram of the sound collection device which concerns on 1st embodiment and 2nd embodiment. 第一実施形態及び第二実施形態に係る収音装置の処理フローの例を示す図。The figure which shows the example of the processing flow of the sound collection apparatus which concerns on 1st embodiment and 2nd embodiment.

以下、本発明の実施形態について、説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。以下の説明において、テキスト中で使用する記号「^」等は、本来直前の文字の真上に記載されるべきものであるが、テキスト記法の制限により、当該文字の直後に記載する。式中においてはこれらの記号は本来の位置に記述している。また、ベクトルや行列の各要素単位で行われる処理は、特に断りが無い限り、そのベクトルやその行列の全ての要素に対して適用されるものとする。 Hereinafter, embodiments of the present invention will be described. In the drawings used for the following description, constituent parts having the same function and steps for performing the same process are denoted by the same reference numerals, and redundant description is omitted. In the following description, the symbol “^” or the like used in the text should be described immediately above the character immediately before, but it is described immediately after the character due to restrictions on text notation. In the formula, these symbols are written in their original positions. Further, the processing performed for each element of a vector or matrix is applied to all elements of the vector or matrix unless otherwise specified.

＜第一実施形態＞
図２は第一実施形態に係る収音装置１００の機能ブロック図を、図３はその処理フローを示す。 <First embodiment>
FIG. 2 is a functional block diagram of the sound collection device 100 according to the first embodiment, and FIG. 3 shows a processing flow thereof.

本実施形態の収音装置１００は、N個のマイクロホン９１−ｎからなるマイクロホンアレイの出力信号(マイクロホン信号)y_n(t)を入力とする。例えば、マイクロホン９１−ｎは、無指向性マイクロホン素子からなる。Nは2以上の整数の何れかであり、n=1,2,…,Nである。本実施形態の収音装置１００は、そのNチャネルのマイクロホン信号y_n(t)から非ターゲット音の相関行列の推定値R^_NT(f,l)をとりだし、MVDR法によりターゲット音の成分を抽出し、抽出した信号を出力信号z(t)として出力する。 The sound collection device 100 of the present embodiment receives an output signal (microphone signal) y _n (t) of a microphone array composed of N microphones 91-n. For example, the microphone 91-n includes an omnidirectional microphone element. N is an integer of 2 or more, and n = 1, 2,. The sound collection device 100 of this embodiment extracts an estimated value R ^ _NT (f, l) of the correlation matrix of the non-target sound from the N-channel microphone signal y _n (t), and extracts the target sound component by the MVDR method. The extracted signal is output as the output signal z (t).

収音装置１００は、CPUと、RAMと、以下の処理を実行するためのプログラムを記録したROMを備えたコンピュータで構成され、機能的には次に示すように構成されている。 The sound collection device 100 is configured by a computer including a CPU, a RAM, and a ROM that stores a program for executing the following processing, and is functionally configured as follows.

収音装置１００は、ノイズ・到来波分解部１０１、ターゲット音判定部１０３、相関行列合成部１０４、アレーフィルタリング部１０５、フーリエ変換部１０７及び逆フーリエ変換部１０８を含む。 The sound collection device 100 includes a noise / arrival wave decomposition unit 101, a target sound determination unit 103, a correlation matrix synthesis unit 104, an array filtering unit 105, a Fourier transform unit 107, and an inverse Fourier transform unit.

＜フーリエ変換部１０７＞
フーリエ変換部１０７は、Nチャネルの時間領域のマイクロホン信号y_n(t)を入力とし、フレームl(エル)毎に周波数領域のマイクロホン信号Y_n(f,l)に短時間フーリエ変換し(Ｓ１０７)、出力する。その周波数f、フレームlでの変換結果を <Fourier transform unit 107>
The Fourier transform unit 107 receives the N-channel time domain microphone signal y _n (t) as input, and performs a short-time Fourier transform to the frequency domain microphone signal Y _n (f, l) every frame l (S107). ),Output. The conversion result at frequency f and frame l

のようにベクトル化して扱う。なお、Nチャネルの周波数領域のマイクロホン信号Y_n(f,l)からなるマイクロホン信号y(f,l)は、
y(f,l)=x(f,l)+v(f,l)
のようにターゲット音の直接波からなるマルチチャネル信号x(f,l)と、その反射及び残響成分と雑音からなるマルチチャネル信号v(f,l)とに分解される。 It is vectorized and treated as follows. Note that the microphone signal y (f, l) composed of the microphone signal Y _n (f, l) in the frequency domain of the N channel is
y (f, l) = x (f, l) + v (f, l)
Thus, the multi-channel signal x (f, l) composed of the direct wave of the target sound and the multi-channel signal v (f, l) composed of the reflection and reverberation components and noise are decomposed.

＜ノイズ・到来波分解部１０１＞
ノイズ・到来波分解部１０１は、周波数領域のマイクロホン信号y(f,l)を入力とし、周波数f、フレームlにおけるマイクロホン信号y(f,l)を用いて、その空間相関行列R(f,l)を算出する。例えば、次式により算出する。
R(f,l)=E[y(f,l)y(f,l)^H] (2) <Noise / arrival wave decomposition unit 101>
The noise / arrival wave decomposition unit 101 receives the microphone signal y (f, l) in the frequency domain, and uses the microphone signal y (f, l) in the frequency f and the frame l, and the spatial correlation matrix R (f, l l) is calculated. For example, it is calculated by the following formula.
R (f, l) = E [y (f, l) y (f, l) ^H ] (2)

ただしE[ ]は期待値をとることを意味する。また、y(f,l)^Hは、y(f,l)を転置し複素共役をとったベクトルである。実際の処理では通常E[ ]の代わりに短時間平均を用いる。そして、空間相関行列R(f,l)からK個の方向からの到来波の強度の推定値p_k(f,l)及び各マイクロホン信号Y_n(f,l)に含まれるノイズパワーの推定値q_n(f,l)を求め（Ｓ１０１）、p_k(f,l)及びq_n(f,l)を対角成分とする対角行列V(f,l)を出力する。ただし、kを到来方向のインデックスとし、平面波の到来可能方向としてK方向を想定し、k=1,2,…,Kとする。よって、対角行列V(f,l)は、以下のように表される。 However, E [] means taking the expected value. Y (f, l) ^H is a vector obtained by transposing y (f, l) and taking a complex conjugate. In actual processing, a short-time average is usually used instead of E []. Then, from the spatial correlation matrix R (f, l), the estimation value p _k (f, l) of the intensity of the incoming wave from K directions and the estimation of the noise power included in each microphone signal Y _n (f, l) A value q _n (f, l) is obtained (S101), and a diagonal matrix V (f, l) having p _k (f, l) and q _n (f, l) as diagonal components is output. Here, k is an index of arrival direction, and K direction is assumed as a possible arrival direction of plane waves, and k = 1, 2,. Therefore, the diagonal matrix V (f, l) is expressed as follows.

なおK>Nである。強度の推定値p_k(f,l)及びノイズパワーの推定値q_n(f,l)の推定方法として、例えば、参考文献１の方法をもちいることができる。
（参考文献１）P. Stoica, P. Babu, and J. Li, "SPICE A sparse covariance-based estimation method for array processing", IEEE Transactions on signal processing, vol. 59, no. 2, 2011, 629-638. K> N. As an estimation method of the estimated value p _k (f, l) of the intensity and the estimated value q _n (f, l) of the noise power, for example, the method of Reference Document 1 can be used.
(Reference 1) P. Stoica, P. Babu, and J. Li, "SPICE A sparse covariance-based estimation method for array processing", IEEE Transactions on signal processing, vol. 59, no. 2, 2011, 629- 638.

この方法では、あらかじめ平面波の到来可能方向としてK方向(>N)を想定する。周波数ｆにおいて、k番目の方向から振幅１の平面波がマイクロホンアレイに到達したとき、その各マイクロホンのレスポンス(出力信号)をa_k(f)=[a_k,1(f) a_k,2(f) … a_k,N(f)]^Tとする。a_k,n(f)は、周波数ｆにおいて、k番目の方向からの到来する振幅１の平面波に対するn番目のマイクロホンのレスポンス(出力信号)を表す。なお、a_k(f)は、収音に先立ち予め求めておく。だだし、a_k(f)は、実験(実測)やシミュレーションにより予め求めてもよいし、計算による理論値を用いてもよい。K個のレスポンスベクトルa_k(f)とN×N単位行列I_Nからなる行列
A(f)^H=[a₁(f) a₂(f) … a_K(f) I_N] (3)
を用いて、参考文献１では
R(f,l)=A(f)^HV(f,l)A(f) (4)
の形に行列R(f,l)を行列A(f)^H、対角行列V(f,l)と行列A(f)の積に分解する。この分解により、対角行列V(f,l)に含まれるk番目の方向からの平面波の強度の推定値p_k(f,l)と、n番目のマイクロホン９１−ｎのノイズパワーの推定値q_n(f,l)とが得られる。なお実際には、上記の分解は、
||(A(f)^HV(f,l)A(f))^-1/2(R(f,l)-A(f)^HV(f,l)A(f))R(f,l)^-1/2||² (5)
を最小にする対角行列V(f,l)を求めることに対応する。なお、この式(5)で||x||は行列xのフロベニウスノルムをとることを意味する。 In this method, a K direction (> N) is assumed in advance as a plane wave arrival direction. When a plane wave having an amplitude of 1 arrives at the microphone array at the frequency f from the kth direction, the response (output signal) of each microphone is expressed as a _k (f) = [a _{k, 1} (f) a _{k, 2} ( f)… a _{k, N} (f)] Let ^T. a _{k, n} (f) represents the response (output signal) of the n-th microphone to the plane wave having an amplitude of 1 coming from the k-th direction at the frequency f. Note that a _k (f) is obtained in advance prior to sound collection. However, a _k (f) may be obtained in advance by experiment (actual measurement) or simulation, or a theoretical value obtained by calculation may be used. A matrix consisting of K response vectors a _k (f) and an N × N identity matrix I _N
A (f) ^H = [a ₁ (f) a ₂ (f)… a _K (f) I _N ] (3)
In Reference Document 1,
R (f, l) = A (f) ^H V (f, l) A (f) (4)
The matrix R (f, l) is decomposed into a product of a matrix A (f) ^H , a diagonal matrix V (f, l) and a matrix A (f). By this decomposition, an estimated value p _k (f, l) of the plane wave intensity from the kth direction included in the diagonal matrix V (f, l) and an estimated value of the noise power of the nth microphone 91-n. q _n (f, l) is obtained. Actually, the above decomposition is
|| (A (f) ^H V (f, l) A (f)) ^-1/2 (R (f, l) -A (f) ^H V (f, l) A (f)) R (f , l) ^-1/2 || ² (5)
Corresponds to finding a diagonal matrix V (f, l) that minimizes. In this equation (5), || x || means to take the Frobenius norm of the matrix x.

＜ターゲット音判定部１０３＞
ターゲット音判定部１０３は、ターゲット音の到来方向の推定値k_tを求め（Ｓ１０３）、出力する。例えば、ターゲット音判定部１０３は、対角行列V(f,l)を入力とし、対角行列V(f,l)に含まれる各到来方向kの強度の推定値p_k(f,l)を用いて、強度が最も大きい方向をターゲット音の到来方向と判定し（Ｓ１０３）、判定結果(到来方向の推定値)k_tを出力する。この例では、ターゲット音判定部１０３は、音声パワーが集中している帯域100〜500Hzの強度の推定値p_k(f,l)を用いてターゲット音の到来方向の推定値k_tを求める。この帯域で各到来方向kの強度は <Target sound determination unit 103>
The target sound determination unit 103 obtains an estimated value k _t of the arrival direction of the target sound (S103) and outputs it. For example, the target sound determination unit 103 receives the diagonal matrix V (f, l) as an input, and estimates the intensity p _k (f, l) of the arrival directions k included in the diagonal matrix V (f, l). with the highest intensity in the direction it is determined that arrival direction of the target sound (S103), and outputs the determination result (the estimated value of the arrival direction) k _t. In this example, the target sound determination unit 103 obtains an estimated value k _t of the target sound arrival direction using the estimated value p _k (f, l) of the intensity of the band 100 to 500 Hz where the sound power is concentrated. In this band, the intensity of each direction of arrival k is

になる。この例では、f₀は100Hz、f₁は500Hzに相当する。b(k,l)を最大にするkを、フレームlでのターゲット音の到来方向k_tと判定する。方向k_tからの到来波をターゲット音と見做し、k_t以外の方向からの到来波を非ターゲット音と見做す。 become. In this example, f ₀ corresponds to 100 Hz and f ₁ corresponds to 500 Hz. k that maximizes b (k, l) is determined as the arrival direction k _t of the target sound in frame l. An incoming wave from direction k _t is regarded as a target sound, and an incoming wave from directions other than k _t is regarded as a non-target sound.

＜相関行列合成部１０４＞
相関行列合成部１０４は、対角行列V(f,l)及び到来方向の推定値k_tを入力とし、行列A(f)^H=[a₁(f) a₂(f) … a_K(f) I_N]と、対角行列V(f,l)の(k_t,k_t)の要素以外の要素を全て0にした行列V_s(f,l)(対角行列V(f,l)に含まれるK個のp_k(f,l)のうち、(k_t,k_t)の要素以外の要素を全て0にした行列であり、ターゲット音の到来方向（の推定値）から到来する音の強度（の推定値）のみを残し、他の要素を0とした行列V_s(f,l))とを用いて、ターゲット音の相関行列の推定値R^_T(f,l)と非ターゲット音の相関行列の推定値R^_NT(f,l)を求め（Ｓ１０４）、出力する。なお、前述の通り、a_k(f)は、収音に先立ち予め求めておく。 <Correlation matrix synthesis unit 104>
The correlation matrix synthesis unit 104 receives the diagonal matrix V (f, l) and the estimated direction of arrival k _t and inputs the matrix A (f) ^H = [a ₁ (f) a ₂ (f)... A _K ( f) I _N ] and a matrix V _s (f, l) (diagonal matrix V (f, l)) with all elements other than (k _t , k _t ) elements of the diagonal matrix V (f, l) set to 0 This is a matrix in which all elements other than the elements of (k _t , k _t ) in K p _k (f, l) included in l) are set to 0. From the arrival direction (estimated value) of the target sound Using the matrix V _s (f, l)) that leaves only the intensity (estimated value) of the incoming sound and sets the other elements to 0, the estimated value R ^ _T (f, l) of the target sound correlation matrix ) And the non-target sound correlation matrix estimate R ^ _NT (f, l) is obtained (S104) and output. As described above, a _k (f) is obtained in advance prior to sound collection.

例えば、相関行列合成部１０４は、次式により、ターゲット音の相関行列の推定値R^_T(f,l)と非ターゲット音の相関行列の推定値R^_NT(f,l)とを求める。
R^_T(f,l)=A(f)^HV_s(f,l)A(f)
R^_NT(f,l)=A(f)^H(V(f,l)-V_s(f,l))A(f) (7) For example, the correlation matrix synthesis unit 104 obtains the estimated value R ^ _T (f, l) of the correlation matrix of the target sound and the estimated value R ^ _NT (f, l) of the correlation matrix of the non-target sound by the following equations. .
R ^ _T (f, l) = A (f) ^H V _s (f, l) A (f)
R ^ _NT (f, l) = A (f) ^H (V (f, l) -V _s (f, l)) A (f) (7)

＜アレーフィルタリング部１０５＞
アレーフィルタリング部１０５は、周波数領域のマイクロホン信号y(f,l)、ターゲット音の相関行列の推定値R^_T(f,l)及び非ターゲット音の相関行列の推定値R^_NT(f,l)を入力とする。 <Array Filtering Unit 105>
The array filtering unit 105 performs frequency domain microphone signal y (f, l), target sound correlation matrix estimate R ^ _T (f, l), and non-target sound correlation matrix estimate R ^ _NT (f, l, l) is an input.

まず、ターゲット音の相関行列の推定値R^_T(f,l)及び非ターゲット音の相関行列の推定値R^_NT(f,l)を用いてフィルタ係数ベクトル(N次元複素数ベクトル)h(f,l)を求める。例えば、推定された相関行列の推定値R^_NT(f,l)をもとにMVDR法をもちいて、フィルタ係数ベクトルを求める。MVDR法は、次の拘束条件つき最適化問題を解いて、そのフィルタ係数ベクトルh(f,l)を求める。 First, the filter coefficient vector (N-dimensional complex vector) h () using the estimated value R ^ _T (f, l) of the target sound correlation matrix and the estimated value R ^ _NT (f, l) of the non-target sound correlation matrix f, l). For example, a filter coefficient vector is obtained by using the MVDR method based on the estimated correlation matrix estimate R ^ _NT (f, l). The MVDR method solves the following constrained optimization problem and obtains its filter coefficient vector h (f, l).

ここでg^(d)(f)はターゲット音の音源位置から各マイクロホンまでの直接経路の周波数伝達特性からなるベクトルである。g₁ ^(d)(f)は、ターゲット音の音源位置からリファレンスとするマイクロホン９１−１までの直接経路の周波数伝達特性である。なお、この例では、リファレンスとするマイクロホンをマイクロホン９１−１としているが、他のマイクロホン９１−２〜９１−Ｎの何れかをリファレンスとしてもよい。 Here, g ^{(d) and} (f) are vectors composed of frequency transfer characteristics of the direct path from the sound source position of the target sound to each microphone. g ₁ ^{(d) and} (f) are frequency transfer characteristics of the direct path from the sound source position of the target sound to the reference microphone 91-1. In this example, the microphone 91-1 is used as the reference, but any of the other microphones 91-2 to 91-N may be used as the reference.

この拘束条件は、(ターゲット音の)音源信号S(f,l)およびターゲット音の音源からマイクロホン９１−１に直接到達する信号成分X₁ ^(d)(f,l)をもちいて書き換えることができる。なお、X₁ ^(d)(f,l)=g₁ ^(d)(f)S(f,l)である。 This constraint condition can be rewritten using the sound source signal S (f, l) (target sound) and the signal component X ₁ ^(d) (f, l) that directly reaches the microphone 91-1 from the sound source of the target sound. it can. Note that X ₁ ^(d) (f, l) = g ₁ ^(d) (f) S (f, l).

上記の拘束条件式に、右からS(f,l)X₁ ^(d)*(f,l)をかけて、期待値をとる。ただし、上付き添え字*は複素共役をとることを意味する。書き換えられた拘束条件は
h^H(f,l)E[x^(d)(f,l)X₁ ^(d)*(f,l)]=E[X₁ ^(d)(f,l)X₁ ^(d)*(f,l)] (9)
になる。ただしx^(d)(f,l)は、ターゲット音の音源から各マイクロホンに直接到達する信号成分のベクトルである。 Multiply S (f, l) X ₁ ^{(d) *} (f, l) from the right to the above constraint condition expression to obtain the expected value. However, the superscript * means taking a complex conjugate. The rewritten constraint is
h ^H (f, l) E [x ^(d) (f, l) X ₁ ^{(d) *} (f, l)] = E [X ₁ ^(d) (f, l) X ₁ ^{(d) *} ( f, l)] (9)
become. Here, x ^(d) (f, l) is a vector of signal components that directly reach each microphone from the target sound source.

ここで、式(9)の左辺のE[ ]は、ターゲット音の相関行列の推定値R^_T(f,l)の第1縦ベクトルになっている。また右辺のE[ ]は、ターゲット音の相関行列の推定値R^_T(f,l)の(1,1)要素になっている。ターゲット音の音源信号S(f,l)やターゲット音の音源から各マイクロホンまでの周波数伝達特性g^(d)(f)は未知である。しかし上記の期待値をとる統計的手続きによって、新しい拘束条件の係数は、ターゲット音の相関行列の推定値R^_T(f,l)から求めることが可能になっている。 Here, E [] on the left side of Equation (9) is the first vertical vector of the estimated value R ^ _T (f, l) of the correlation matrix of the target sound. E [] on the right side is the (1,1) element of the estimated value R ^ _T (f, l) of the target sound correlation matrix. The target sound source signal S (f, l) and the frequency transfer characteristics g ^(d) (f) from the target sound source to each microphone are unknown. However, the statistical procedure for taking the expected value allows the new constraint coefficient to be obtained from the estimated value R ^ _T (f, l) of the correlation matrix of the target sound.

アレーフィルタリング部１０５は、求めたフィルタ係数ベクトルh(f,l)をマイクロホン信号y(f,l)に適用し(次式参照)、出力信号z(f,l)を求め（Ｓ１０５）、出力する。
z(f,l)=h^H(f,l)y(f,l) (15)
このような構成により、ターゲット音の周波数fの成分を取り出すことができる。 The array filtering unit 105 applies the obtained filter coefficient vector h (f, l) to the microphone signal y (f, l) (see the following equation) to obtain the output signal z (f, l) (S105) and outputs it. To do.
z (f, l) = h ^H (f, l) y (f, l) (15)
With such a configuration, the component of the frequency f of the target sound can be extracted.

＜逆フーリエ変換部１０８＞
逆フーリエ変換部１０８は、周波数領域の出力信号z(f,l)を入力とし、全周波数での処理結果を短時間逆フーリエ変換し（Ｓ１０８）、時間領域の出力信号z(t)を得、出力する。 <Inverse Fourier Transform Unit 108>
The inverse Fourier transform unit 108 receives the frequency domain output signal z (f, l) as input, and performs a short time inverse Fourier transform on the processing results at all frequencies (S108) to obtain the time domain output signal z (t). ,Output.

＜効果＞
以上の構成により、ターゲット音と非ターゲット音の混在するマイクロホン信号から、それぞれの相関行列を推定し、MVDR法をもちいてターゲット音を取り出すことができる。 <Effect>
With the above configuration, it is possible to estimate each correlation matrix from a microphone signal in which target sound and non-target sound are mixed, and to extract the target sound using the MVDR method.

＜変形例＞
本実施形態では、ノイズ・到来波分解部１０１からターゲット音判定部１０３へ対角行列V(f,l)を出力しているが、対角行列V(f,l)に含まれる強度の推定値p_k(f,l)のみを出力する構成としてもよい。要は、ターゲット音判定部１０３において、ターゲット音の到来方向を判定することができればよい。 <Modification>
In the present embodiment, the diagonal matrix V (f, l) is output from the noise / arrival wave decomposition unit 101 to the target sound determination unit 103, but the intensity included in the diagonal matrix V (f, l) is estimated. Only the value p _k (f, l) may be output. In short, it is only necessary that the target sound determination unit 103 can determine the direction of arrival of the target sound.

本実施形態では、収音装置１００は、フーリエ変換部１０７及び逆フーリエ変換部１０８を含む構成としているが、フーリエ変換部１０７及び逆フーリエ変換部１０８を別装置とし、収音装置１００は、周波数領域のマイクロホン信号y(f,l)を入力してもよいし、周波数領域の出力信号z(f,l)を出力してもよい。 In the present embodiment, the sound collection device 100 includes a Fourier transform unit 107 and an inverse Fourier transform unit 108. However, the Fourier transform unit 107 and the inverse Fourier transform unit 108 are separate devices, and the sound collection device 100 has a frequency. A microphone signal y (f, l) in the region may be input, or an output signal z (f, l) in the frequency region may be output.

＜第二実施形態＞
第一実施形態と異なる部分を中心に説明する。 <Second embodiment>
A description will be given centering on differences from the first embodiment.

図２は第二実施形態に係る収音装置２００の機能ブロック図を、図３はその処理フローを示す。 FIG. 2 is a functional block diagram of the sound collection device 200 according to the second embodiment, and FIG. 3 shows a processing flow thereof.

収音装置２００は、ノイズ・到来波分解部１０１、強度補正部２０２(図２中、破線で示す)、ターゲット音判定部１０３、相関行列合成部２０４、アレーフィルタリング部１０５、フーリエ変換部１０７及び逆フーリエ変換部１０８を含む。 The sound collection device 200 includes a noise / arrival wave decomposition unit 101, an intensity correction unit 202 (indicated by a broken line in FIG. 2), a target sound determination unit 103, a correlation matrix synthesis unit 204, an array filtering unit 105, a Fourier transform unit 107, and An inverse Fourier transform unit 108 is included.

＜強度補正部２０２＞
強度補正部２０２は、対角行列V(f,l)を入力とし、空間相関行列R(f,l)の総信号パワーと、行列A(f)^H=[a₁(f) a₂(f) … a_K(f) I_N]と対角行列V(f,l)とから得られる総信号パワーとから補正係数β(f,l)を求め（Ｓ２０２、図３中、破線で示す）、出力する。 <Intensity correction unit 202>
The intensity correction unit 202 receives the diagonal matrix V (f, l) as an input, the total signal power of the spatial correlation matrix R (f, l), and the matrix A (f) ^H = [a ₁ (f) a ₂ ( f)... a _K (f) I _N ] and the total signal power obtained from the diagonal matrix V (f, l), a correction coefficient β (f, l) is obtained (S202, indicated by a broken line in FIG. 3). ),Output.

例えば、強度補正部２０２では、次式により補正係数β(f,l)を求める。 For example, the intensity correction unit 202 obtains a correction coefficient β (f, l) by the following equation.

ただしtr()は行列のトレースをとる関数である。例えば、空間相関行列R(f,l)はノイズ・到来波分解部１０１で算出したものを用いればよく、行列A(f)^Hは第一実施形態で説明した方法により収音に先立ち予め求めておいたものを用いればよい。 Tr () is a function that takes a matrix trace. For example, the spatial correlation matrix R (f, l) may be the one calculated by the noise / arrival wave decomposition unit 101, and the matrix A (f) ^H is obtained in advance prior to sound collection by the method described in the first embodiment. You can use the ones you have left.

＜相関行列合成部２０４＞
相関行列合成部２０４は、空間相関行列R(f,l)と、補正係数β(f,l)と、対角行列V(f,l)及び到来方向の推定値k_tとを入力とし、空間相関行列R(f,l)と、補正係数β(f,l)と、収音に先立ち予め求めておいた行列A(f)^H=[a₁(f) a₂(f) … a_K(f) I_N]と、対角行列V(f,l)の(k_t,k_t)の要素以外の要素を全て0にした行列V_s(f,l)とを用いて相関行列の推定値R^_NT(f,l)を求め（Ｓ２０４）、出力する。例えば、次式により相関行列の推定値R^_NT(f,l)を求める。
R^_NT(f,l)=R(f,l)-β(f,l)A(f)^HV_s(f,l)A(f)
なお、ターゲット音の相関行列の推定値R^_T(f,l)については第一実施形態と同様の方法により求めることができる。 <Correlation matrix synthesis unit 204>
Correlation matrix synthesis section 204 receives as input spatial correlation matrix R (f, l), correction coefficient β (f, l), diagonal matrix V (f, l) and arrival direction estimation value k _t . Spatial correlation matrix R (f, l), correction coefficient β (f, l), matrix A (f) ^H = [a ₁ (f) a ₂ (f)… a _K (f) I _N ] and a matrix V _s (f, l) in which all elements other than (k _t , k _t ) elements of the diagonal matrix V (f, l) are zeroed Estimated value R ^ _NT (f, l) is obtained (S204) and output. For example, the estimated value R ^ _NT (f, l) of the correlation matrix is obtained by the following equation.
R ^ _NT (f, l) = R (f, l) -β (f, l) A (f) ^H V _s (f, l) A (f)
Note that the estimated value R ^ _T (f, l) of the correlation matrix of the target sound can be obtained by the same method as in the first embodiment.

＜効果＞
このような構成とすることで、第一実施形態と同様の効果を得ることができる。さらに、補正係数を用いることで、非ターゲット音の相関行列をより良く求めることができる。 <Effect>
By setting it as such a structure, the effect similar to 1st embodiment can be acquired. Furthermore, the correlation matrix of the non-target sound can be obtained better by using the correction coefficient.

＜その他の変形例＞
本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 <Other variations>
The present invention is not limited to the above-described embodiments and modifications. For example, the various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.

＜プログラム及び記録媒体＞
また、上記の実施形態及び変形例で説明した各装置における各種の処理機能をコンピュータによって実現してもよい。その場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 <Program and recording medium>
In addition, various processing functions in each device described in the above embodiments and modifications may be realized by a computer. In that case, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶部に格納する。そして、処理の実行時、このコンピュータは、自己の記憶部に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実施形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよい。さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、プログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its storage unit. When executing the process, this computer reads the program stored in its own storage unit and executes the process according to the read program. As another embodiment of this program, a computer may read a program directly from a portable recording medium and execute processing according to the program. Further, each time a program is transferred from the server computer to the computer, processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program includes information provided for processing by the electronic computer and equivalent to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、コンピュータ上で所定のプログラムを実行させることにより、各装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In addition, although each device is configured by executing a predetermined program on a computer, at least a part of these processing contents may be realized by hardware.

Claims

N and K are each an integer of 2 or more, n = 1, 2,..., N, k = 1, 2,..., K, and a microphone signal Y _n (f, l) in the N-channel frequency domain Is used to calculate a spatial correlation matrix R (f, l) for each frequency, and an estimated value p _k (f, l) of the intensity of incoming waves from K directions from the spatial correlation matrix R (f, l) And an incoming wave decomposition unit for obtaining an estimated value q _n (f, l) of the noise power included in each microphone signal Y _n (f, l),
A target sound determination unit for obtaining an estimated value k _{t of the} direction of arrival of the target sound;
The vector consisting of the output signal of the microphone array when a plane wave of amplitude 1 arrives at the microphone array consisting of N microphones from the k-th direction is defined as a _k (f), and K vectors a _k (f) and N A matrix A (f) ^H = [a ₁ (f) a ₂ (f) ... a _K (f) I _N ] consisting of × N unit matrix I _N and the estimated value p _k (f, l) of the intensity A matrix V _s (all elements other than (k _t , k _t ) elements of the diagonal matrix V (f, l) having the noise power estimate q _n (f, l) as a diagonal component are set to 0. Correlation matrix synthesizer that calculates the target sound correlation matrix estimate R ^ _T (f, l) and non-target sound correlation matrix estimate R ^ _NT (f, l) When,
A filter coefficient vector h (f, l) is obtained using the estimated values R ^ _T (f, l) and R ^ _NT (f, l) of the correlation matrix, and the microphone signal Y _n (f, l) An array filtering unit that applies a filter coefficient vector h (f, l) to obtain an output signal z (f, l),
Sound collection device.

The sound collection device according to claim 1,
The total signal power tr (R (f, l)) of the spatial correlation matrix R (f, l) and the matrix A (f) ^H = [a ₁ (f) a ₂ (f) ... a _K (f) I _N ] and the total signal power tr (A (f) ^H V (f, l) A (f)) obtained from the diagonal matrix V (f, l), the correction coefficient β (f, l) Including the required intensity correction unit,
In the correlation matrix synthesis unit, the spatial correlation matrix R (f, l), the correction coefficient β (f, l), the matrix A (f) ^H, and the matrix V _s (f, l) To obtain the estimated value R ^ _NT (f, l) of the correlation matrix,
Sound collection device.

N and K are each an integer of 2 or more, n = 1, 2,..., N, k = 1, 2,..., K, and a microphone signal Y _n (f, l) in the N-channel frequency domain Is used to calculate a spatial correlation matrix R (f, l) for each frequency, and an estimated value p _k (f, l) of the intensity of incoming waves from K directions from the spatial correlation matrix R (f, l) And an incoming wave decomposition step for obtaining an estimated value q _n (f, l) of the noise power included in each microphone signal Y _n (f, l),
A target sound determination step for obtaining an estimated value k _{t of the} direction of arrival of the target sound;
The vector consisting of the output signal of the microphone array when a plane wave of amplitude 1 arrives at the microphone array consisting of N microphones from the k-th direction is defined as a _k (f), and K vectors a _k (f) and N A matrix A (f) ^H = [a ₁ (f) a ₂ (f) ... a _K (f) I _N ] consisting of × N unit matrix I _N and the estimated value p _k (f, l) of the intensity A matrix V _s (all elements other than (k _t , k _t ) elements of the diagonal matrix V (f, l) having the noise power estimate q _n (f, l) as a diagonal component are set to 0. Correlation matrix synthesis step to obtain the target sound correlation matrix estimate R ^ _T (f, l) and the non-target sound correlation matrix estimate R ^ _NT (f, l) When,
A filter coefficient vector h (f, l) is obtained using the estimated values R ^ _T (f, l) and R ^ _NT (f, l) of the correlation matrix, and the microphone signal Y _n (f, l) Applying a filter coefficient vector h (f, l) to determine an output signal z (f, l), and
Sound collection method.

The sound collection method according to claim 3,
The total signal power tr (R (f, l)) of the spatial correlation matrix R (f, l) and the matrix A (f) ^H = [a ₁ (f) a ₂ (f) ... a _K (f) I _N ] and the total signal power tr (A (f) ^H V (f, l) A (f)) obtained from the diagonal matrix V (f, l), the correction coefficient β (f, l) Including a desired intensity correction step,
In the correlation matrix synthesis step, the spatial correlation matrix R (f, l), the correction coefficient β (f, l), the matrix A (f) ^H, and the matrix V _s (f, l) To obtain the estimated value R ^ _NT (f, l) of the correlation matrix,
Sound collection method.

A program for causing a computer to function as the sound collecting device according to claim 1.