JP2010175431A

JP2010175431A - Device, method and program for estimating sound source direction

Info

Publication number: JP2010175431A
Application number: JP2009019355A
Authority: JP
Inventors: Hirosuke Hioka; 裕輔日岡; Kenichi Furuya; 賢一古家; Yoichi Haneda; 陽一羽田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-01-30
Filing date: 2009-01-30
Publication date: 2010-08-12

Abstract

<P>PROBLEM TO BE SOLVED: To enhance an estimation precision of a sound source direction in a method and a device for estimating a sound source direction, and a program therefor. <P>SOLUTION: This device for estimating a sound source direction includes a microphone array having three microphones disposed at vertexes of a regular triangle, a frequency converter for converting signals received by the respective microphones of the microphone array to signals in a frequency region, an arrival time difference calculating section for calculating out an arrival time difference with respect to each combination of a microphone pair of different microphones, and a sound source direction estimating section that obtains a sound source candidate on the basis of the arrival time difference so as to classify sound source direction candidates. The sound source direction estimating section has a sparse characteristic determining section that determines whether or not a sparse characteristic can be assumed by each frequency bin of the arrival time difference. The sound source direction estimating section obtains a sound source candidate on the basis of the arrival time difference of the frequency bin of which the sparse characteristic can be assumed so as to classify the sound source direction candidate. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

この発明は、テレビ電話や音声会議等において用いられる発話者の方向を検出する音源方向推定装置とその方法と、そのプログラムに関する。 The present invention relates to a sound source direction estimating apparatus and method for detecting the direction of a speaker used in a videophone or audio conference, and a program thereof.

従来の音声会議等に用いられる音源方向推定方法は、例えば非特許文献１に開示されている。その方法は、図１３に示すように正三角形の頂点に配置された３つのマイクロホン１，２，３からなるマイクロホンアレーを用いて、Ｎ個（Ｎ≧２）の異なる音源の方向Ｓ_ｎを推定するものである。図１４に、その音源方向推定装置３００の機能構成例を示して動作を説明する。 A sound source direction estimation method used in a conventional audio conference or the like is disclosed in Non-Patent Document 1, for example. As shown in FIG. 13, N (N ≧ 2) different sound source directions _Sn are estimated using a microphone array composed of three microphones 1, 2, and 3 arranged at the vertices of an equilateral triangle as shown in FIG. To do. FIG. 14 shows an example of the functional configuration of the sound source direction estimating apparatus 300 and the operation will be described.

音源方向推定装置３００は、正三角形の頂点に配置された３つのマイクロホン１，２，３と、周波数変換部１１，１２，１３と、到達時間差算出部２１，２２，２３と、音源方向推定部１５０とを具備する。マイクロホン１，２，３で受信された時間サンプルｎにおける信号ｘ_ｉ（ｎ）は、周波数変換部１１，１２，１３に入力され、複数の時間サンプルの集合であるフレーム毎に求めた周波数領域の信号Ｘ_ｉ（ω，ｍ）に変換される。ここでｍとωは、それぞれ周波数変換を行った信号フレームの番号と、変換後の信号の周波数を示している。周波数変換されたマイク受音信号は、到達時間差算出部２１，２２，２３に入力される。到達時間差算出部２１，２２，２３は、異なる３つのマイクロホン対の組み合わせのそれぞれに対して式（１）の計算を行い、それぞれのマイクロホン対における到達時間差τ_ｉｊ（ω，ｍ）（ｉ，ｊ≦３，ｉ≠ｊ）を出力する。ｉとｊはマイクロホンの番号を示す。 The sound source direction estimation device 300 includes three microphones 1, 2, 3 arranged at the vertices of an equilateral triangle, frequency conversion units 11, 12, 13, arrival time difference calculation units 21, 22, 23, and a sound source direction estimation unit. 150. Signals x _i (n) at time samples n received by the microphones 1, 2 and 3 are input to the frequency converters 11, 12, and 13 in the frequency domain obtained for each frame which is a set of a plurality of time samples. It is converted into a signal X _i (ω, m). Here, m and ω respectively indicate the number of the signal frame subjected to frequency conversion and the frequency of the signal after conversion. The frequency-converted microphone sound reception signal is input to arrival time difference calculation units 21, 22, and 23. The arrival time difference calculation units 21, 22, and 23 perform the calculation of Expression (1) for each of the combinations of three different microphone pairs, and the arrival time differences τ _ij (ω, m) (i, j) in the respective microphone pairs. ≦ 3, i ≠ j) is output. i and j indicate microphone numbers.

到達時間差τ_ｉｊは音源方向推定部１５０に入力され、推定された音源方向θ_ｎ＾が出力される。なお、＾は図中の表記が正しい。図１５に音源方向推定部１５０の機能構成例を示してその動作を説明する。音源方向推定部１５０は、ベクトル化部１５１、音源方向算出部１５２、ヒストグラム演算部１５３を備える。ベクトル化部１５１は、到達時間差算出部２１，２２，２３が出力する到達時間差τ_１２（ω，ｍ）、τ_２３（ω，ｍ）、τ_３１（ω，ｍ）を入力として、式（２）に示す到達時間差ベクトルｔ（ω，ｍ）を出力する。ベクトル化部１５１は、入力される到達時間差τ_ｉｊ（ω，ｍ）を単に並べてベクトル化するものである。 The arrival time difference τ _ij is input to the sound source direction estimation unit 150, and the estimated sound source direction θ _n ^ is output. Note that ^ is correct in the figure. FIG. 15 shows an example of the functional configuration of the sound source direction estimation unit 150 and its operation will be described. The sound source direction estimation unit 150 includes a vectorization unit 151, a sound source direction calculation unit 152, and a histogram calculation unit 153. The vectorization unit 151 receives the arrival time differences τ ₁₂ (ω, m), τ ₂₃ (ω, m), and τ ₃₁ (ω, m) output from the arrival time difference calculation units 21, 22, and 23, and uses the expression (2 The arrival time difference vector t (ω, m) shown in FIG. The vectorization unit 151 simply vectorizes the input arrival time difference τ _ij (ω, m).

音源方向算出部３２は、入力された到達時間差ベクトルｔ（ω，ｍ）に対して式（３）のように、式（４）で与えられる座標変換行列Ｄを左から掛け、その出力の第一要素と第二要素から式（５）の計算によって音源方向候補θ′（ω，ｍ）を求める。 The sound source direction calculation unit 32 multiplies the input arrival time difference vector t (ω, m) by the coordinate transformation matrix D given by the equation (4) from the left as in the equation (3), and outputs the output The sound source direction candidate θ ′ (ω, m) is obtained from the one element and the second element by the calculation of Expression (5).

ヒストグラム演算部１５３は、入力された音源方向候補θ′（ω，ｍ）からヒストグラムを求め、ヒストグラムのピークを与える方向を音源方向推定値θ_ａ＾（ａ＝１，…，Ａ′）として出力する。Ａ′は予め与えられる最大同時発生音源数である。 The histogram calculation unit 153 obtains a histogram from the input sound source direction candidate θ ′ (ω, m), and outputs the direction giving the peak of the histogram as the sound source direction estimated value θ _a ^ (a = 1,..., A ′). To do. A ′ is the maximum number of simultaneously generated sound sources given in advance.

ここでヒストグラムは、連続する複数のフレームのそれぞれの周波数ビンにおいて求められた全ての音源方向候補θ′（ω，ｍ）を、予め決められた角度幅毎に分類することで算出される。ヒストグラムを求める際に用いるフレームの数は、音源が移動しない程度の時間長に対応するフレーム数が選ばれる。例えばフレーム長が１６ｍｓであり、約０.５秒間は音源が移動しないと考えられる場合、例えば３０個のフレームのそれぞれにおいて求められた音源方向候補θ′（ω，ｍ）を用いてヒストグラムは求められる。音源方向候補θ′（ω，ｍ）の数は、信号のサンプリング周波数が１６ｋＨｚとして、周波数変換方法を例えば２５６点のデータを用いた短時間フーリエ変換とすると、３８４０個（１２８×３０）ある周波数ビンの数と等しい。 Here, the histogram is calculated by classifying all the sound source direction candidates θ ′ (ω, m) obtained in the respective frequency bins of a plurality of consecutive frames for each predetermined angle width. As the number of frames used for obtaining the histogram, the number of frames corresponding to a length of time that does not move the sound source is selected. For example, if the frame length is 16 ms and the sound source is considered not to move for about 0.5 seconds, the histogram is obtained using the sound source direction candidate θ ′ (ω, m) obtained in each of 30 frames, for example. It is done. The number of sound source direction candidates θ ′ (ω, m) is 3840 (128 × 30) frequencies when the signal sampling frequency is 16 kHz and the frequency conversion method is short-time Fourier transform using, for example, 256 points of data. Equal to the number of bins.

Masao Matsuo,Yusuke Hioka and Nozomu Hamada,“Estimating DOA of multiple speech signals by improved histogram mapping method,”Proceedings of IWAENC2005,pp.129-132.Masao Matsuo, Yusuke Hioka and Nozomu Hamada, “Estimating DOA of multiple speech signals by improved histogram mapping method,” Proceedings of IWAENC2005, pp.129-132.

従来の方法では、音源信号が音声のように非定常で特定の周波数に成分が集中する信号であるとき、任意の時刻における任意の周波数ビンは複数の音源の内どれか一つの音源の成分のみが存在するという、時間周波数領域におけるスパース性と呼ばれる仮定の下で処理を行っている。 In the conventional method, when the sound source signal is a non-stationary signal such as voice and the components are concentrated at a specific frequency, an arbitrary frequency bin at an arbitrary time is only the component of one of the multiple sound sources. The processing is performed under the assumption called sparseness in the time-frequency domain.

〔スパース性とは〕
ここでスパース性とは、対象とする信号のエネルギーがある領域（多くの場合、時間周波数領域）で一部の領域に集中し、その他の多くの領域で０であるような性質がある場合、それを信号のスパース性と呼ぶ。 [What is sparsity?]
Here, sparsity is a characteristic in which the energy of the signal of interest is concentrated in some areas in a certain area (in many cases, the time frequency area) and is zero in many other areas. This is called signal sparsity.

しかしながら一般に音源数が増えると信号のスパース性の仮定は崩れるため、従来技術では十分な精度で音源方向が推定できない。例えば、違う方向に位置する発話者が同時に発言したような場合には、それらの音源方向の推定精度が劣化する。また、実際の環境では音声以外の音が発生することが多く、それらの音の多くは、例えばエアコンやパソコンのファンの音のように定常で広い周波数に音の成分が広がる信号である。これらの音はスパース性が仮定できないので、これが音源の音に重畳すると、更に音源方向の推定精度を劣化させる原因になる。 However, since the assumption of signal sparsity generally breaks as the number of sound sources increases, the conventional technology cannot estimate the sound source direction with sufficient accuracy. For example, when a speaker located in a different direction speaks at the same time, the estimation accuracy of the sound source direction deteriorates. Further, in an actual environment, sounds other than voice are often generated, and most of these sounds are signals in which sound components spread in a steady and wide frequency, such as the sound of an air conditioner or a fan of a personal computer. Since these sounds cannot be assumed to be sparse, when they are superimposed on the sound of the sound source, the sound source direction estimation accuracy is further deteriorated.

この発明はこの点に鑑みてなされたものであり、違う方向に位置する発話者が同時に発言しても、それらの方向が精度良く推定できるようにした音源方向推定装置とその方法と、そのプログラムを提供することを目的とする。 The present invention has been made in view of this point, and even if a speaker located in a different direction speaks at the same time, a sound source direction estimating apparatus, a method thereof, and a program thereof that can accurately estimate those directions. The purpose is to provide.

この発明の音源方向推定装置は、正三角形の頂点に配置された３つのマイクロホンからなるマイクロホンアレーと、マイクロホンアレーの各マイクロホンで受信された信号を周波数領域の信号に変換する周波数変換部と、異なるマイクロホンのマイクロホン対の組み合わせのそれぞれに対して到達時間差を計算する到達時間差算出部と、到達時間差から音源候補を求め、音源方向候補を分類する音源方向推定部と、を具備する。そして、音源方向推定部は、到達時間差の周波数ビン毎にスパース性が仮定できるか否かを判定するスパース性判定部を備え、スパース性が仮定できる周波数ビンの到達時間差から音源候補を求め、音源方向候補を分類する。 The sound source direction estimating device of the present invention is different from a microphone array composed of three microphones arranged at the vertices of an equilateral triangle, and a frequency conversion unit that converts signals received by the microphones of the microphone array into signals in the frequency domain. An arrival time difference calculating unit that calculates an arrival time difference for each combination of microphone pairs of a microphone, and a sound source direction estimating unit that obtains a sound source candidate from the arrival time difference and classifies the sound source direction candidate. The sound source direction estimation unit includes a sparsity determination unit that determines whether sparsity can be assumed for each frequency bin of arrival time differences, obtains sound source candidates from the arrival time differences of frequency bins that can assume sparsity, Classify direction candidates.

この発明によれば、スパース性判定部が音源のスパース性が仮定できない周波数ビンの到達時間差を取り除き、残ったスパース性が仮定できる周波数ビンの到達時間差から音源候補を求める。よって、この発明の音源方向推定装置は、異なる位置の発話者が同時に発言しても両者の声が混ざった周波数ビンの到達時間差を除外し、単一の音源から成る到達時間差に基づいてそれぞれの方向を推定する。したがって、音源方向推定を精度良く行うことができる。 According to this invention, the sparsity determination unit removes the arrival time difference between frequency bins where the sparsity of the sound source cannot be assumed, and obtains the sound source candidate from the arrival time difference between the frequency bins where the remaining sparsity can be assumed. Therefore, the sound source direction estimating apparatus of the present invention excludes the arrival time difference of frequency bins where both voices are mixed even if speakers at different positions speak at the same time, and based on the arrival time difference consisting of a single sound source, Estimate the direction. Therefore, sound source direction estimation can be performed with high accuracy.

この発明の音源方向推定装置１００の機能構成例を示す図。The figure which shows the function structural example of the sound source direction estimation apparatus 100 of this invention. 音源方向推定装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the sound source direction estimation apparatus 100. 音源方向推定部３０の機能構成例を示す図。The figure which shows the function structural example of the sound source direction estimation part 30. FIG. スパース性判定部３４の機能構成例を示す図。The figure which shows the function structural example of the sparsity determination part. スパース性判定部３４の動作フローを示す図。The figure which shows the operation | movement flow of the sparsity determination part. 到達時間差ベクトルと到達時間差正規直交ベクトルの例を示す図。The figure which shows the example of an arrival time difference vector and an arrival time difference orthonormal vector. 音源が複数ある場合のベクトル直交度Ｐ（θ）の一例を示す図。The figure which shows an example of vector orthogonality P ((theta)) in case there exist multiple sound sources. 音源が１個の場合のベクトル直交度Ｐ（θ）の一例を示す図。The figure which shows an example of vector orthogonality P ((theta)) in case there is one sound source. スパース性判定部３４′の機能構成例を示す図。The figure which shows the function structural example of sparsity determination part 34 '. スパース性判定部３４′の動作フローを示す図。The figure which shows the operation | movement flow of sparse property determination part 34 '. 従来の音源方向推定装置３００で音源方向を推定した結果の一例を示す図。The figure which shows an example of the result of having estimated the sound source direction with the conventional sound source direction estimation apparatus 300. FIG. この発明の音源方向推定装置１００で音源方向を推定した結果の一例を示す図。The figure which shows an example of the result of having estimated the sound source direction with the sound source direction estimation apparatus 100 of this invention. マイクロホンアレーの平面を示す図。The figure which shows the plane of a microphone array. 従来の音源方向推定装置３００の機能構成例を示す図。The figure which shows the function structural example of the conventional sound source direction estimation apparatus 300. FIG. 従来の音源方向推定部１５０の機能構成例を示す図。The figure which shows the function structural example of the conventional sound source direction estimation part 150. FIG.

以下に、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は省略する。 Embodiments of the present invention will be described below with reference to the drawings. The same components in the drawings are denoted by the same reference numerals, and the description thereof is omitted.

図１にこの発明の音源方向推定装置１００の機能構成例を示す。音源方向推定装置１００は、３つのマイクロホンからなるマイクロホンアレーと、周波数変換部１１,１２,１３と、到達時間差算出部２１,２２,２３と、音源方向推定部３０と、を具備する。音源方向推定装置１００は、音源方向推定部３０がスパース性判定部３４を備える点と、その判定結果を利用した処理手順のみが、従来技術で説明した音源方向推定装置３００と異なる。 FIG. 1 shows a functional configuration example of a sound source direction estimating apparatus 100 of the present invention. The sound source direction estimation apparatus 100 includes a microphone array including three microphones, frequency conversion units 11, 12, and 13, arrival time difference calculation units 21, 22, and 23, and a sound source direction estimation unit 30. The sound source direction estimation apparatus 100 differs from the sound source direction estimation apparatus 300 described in the related art only in that the sound source direction estimation unit 30 includes a sparsity determination unit 34 and the processing procedure using the determination result.

従来技術の音源方向推定装置３００の動作と同じ部分について、図２の動作フローも参照して簡単に説明する。周波数変換部１１,１２,１３は、各マイクロホン１,２,３で受信された信号を周波数領域の信号に変換する（ステップＳ１１）。到達時間差算出部２１,２２,２３は、異なるマイクロホン１,２,３のマイクロホン対の組み合わせのそれぞれに対して到達時間差τ_ｉｊ（ω，ｍ）（τ_１２（ω，ｍ）、τ_２３（ω，ｍ）、τ_３１（ω，ｍ））を計算する（ステップＳ２１）。音源方向推定部３０は、到達時間差τ_ｉｊ（ω，ｍ）から音源候補θ′（ω，ｍ）を求め、その音源候補θ′（ω，ｍ）を分類する（ステップＳ３０）。 The same part as the operation of the sound source direction estimation apparatus 300 of the prior art will be briefly described with reference to the operation flow of FIG. The frequency converters 11, 12, and 13 convert the signals received by the microphones 1, 2, and 3 into signals in the frequency domain (step S11). The arrival time difference calculation units 21, 22, and 23 have arrival time differences τ _ij (ω, m) (τ ₁₂ (ω, m) and τ ₂₃ (ω) for each combination of microphone pairs of different microphones 1,2,3. , M), τ ₃₁ (ω, m)) is calculated (step S21). The sound source direction estimation unit 30 _{obtains a} sound source candidate θ ′ (ω, m) from the arrival time difference τ _ij (ω, m), and classifies the sound source candidate θ ′ (ω, m) (step S30).

この発明の音源方向推定装置１００は、音源方向推定部３０が到達時間差τ_ｉｊ（ω，ｍ）の周波数ビン毎にスパース性が仮定できるか仮定できないかを判定するスパース性判定部３４を備える点で新しい。音源方向推定部３０は、スパース性判定部３４が出力するスパース性が仮定できる周波数ビンの到達時間差τ_ｉｊ（ω，ｍ）から音源候補を求め、音源候補を分類する（ステップＳ３０）。このスパース性の判定は、フレームｍ毎、周波数ビンω毎に行われる。よって、異なる位置の発話者が同時に発言しても両者の声が混ざった周波数ビンの到達時間差τ_ｉｊ（ω，ｍ）は除外されるので、それぞれの音源方向の推定を精度良く行うことができる。 The sound source direction estimating apparatus 100 of the present invention includes a sparsity determining unit 34 that determines whether or not the sound source direction estimating unit 30 can assume the sparsity for each frequency bin of the arrival time difference τ _ij (ω, m). And new. The sound source direction estimation unit 30 obtains sound source candidates from the arrival time difference τ _ij (ω, m) of the frequency bins that can be assumed to be sparsity output from the sparsity determination unit 34, and classifies the sound source candidates (step S30). This determination of sparsity is performed for each frame m and for each frequency bin ω. Therefore, even if speakers at different positions speak at the same time, the arrival time difference τ _ij (ω, m) of the frequency bins where the two voices are mixed is excluded, so that the direction of each sound source can be estimated with high accuracy. .

図３に音源方向推定部３０の機能構成例を示す。音源方向推定部３０は、ベクトル化部１５１、スパース性判定部３４、音源方向算出部１５２′、ヒストグラム演算部１５３を備える。従来技術の音源方向推定装置３００の機能構成例（図１５）と比較すると明らかなように、音源方向推定部３０は、ベクトル化部１５１と音源方向算出部１５２との間にスパース性判定部３４を備える点と、音源方向算出部１５２′がその判定結果を参照して音源方向を計算する点とが、従来の音源方向推定部１５０と異なる。 FIG. 3 shows a functional configuration example of the sound source direction estimation unit 30. The sound source direction estimation unit 30 includes a vectorization unit 151, a sparsity determination unit 34, a sound source direction calculation unit 152 ′, and a histogram calculation unit 153. As is clear from the functional configuration example of the conventional sound source direction estimation apparatus 300 (FIG. 15), the sound source direction estimation unit 30 includes a sparsity determination unit 34 between the vectorization unit 151 and the sound source direction calculation unit 152. And the point that the sound source direction calculating unit 152 ′ calculates the sound source direction with reference to the determination result is different from the conventional sound source direction estimating unit 150.

この実施例のスパース性判定部３４の機能構成例を図４に、その動作フローを図５に示して動作を説明する。スパース性判定部３４は、直交行列算出部３５、ベクトル直交度算出部３６、直交性判定部３８、を備える。直交行列算出部３５は、ベクトル化部１５１が出力する到達時間差ベクトルｔ（ω,ｍ）を入力として、その到達時間差ベクトルｔ（ω,ｍ）に直交する２つの到達時間差正規直交ベクトルｔ_⊥１（ω,ｍ）とｔ_⊥２（ω,ｍ）を出力する（ステップＳ３５）。この正規直交ベクトルは、例えばグラムシュミットの正規直交化で求めることが可能である。（参考文献「Ｇ．ストラング著“線形代数とその応用”産業図書、１４１〜１４３頁」） An example of the functional configuration of the sparsity determination unit 34 of this embodiment is shown in FIG. 4 and its operation flow is shown in FIG. The sparsity determination unit 34 includes an orthogonal matrix calculation unit 35, a vector orthogonality calculation unit 36, and an orthogonality determination unit 38. Orthogonal matrix calculating section 35, the arrival time difference vector t (ω, m) output from the vectorization unit 151 as an input, the arrival time difference vector t (ω, m) orthogonal to the two arrival time differences orthonormal vectors t _⊥1 (Ω, m) and t _⊥2 (ω, m) are output (step S35). This orthonormal vector can be obtained, for example, by Gramschmitt orthonormalization. (Reference: “G. Strang,“ Linear Algebra and Its Applications ”Industrial Books, pages 141-143”)

到達時間差正規直交ベクトルｔ_⊥１（ω,ｍ）とｔ_⊥２（ω,ｍ）は、ベクトル直交度算出部３６に入力され、到達時間差ベクトルの理論値ｔ_ｅ（θ）に対する直交度が求められる（ステップＳ３６）。到達時間差ベクトルの理論値ｔ_ｅ（θ）とは、式（６）で計算できる値である。 Arrival time difference orthonormal vector t _{⊥1 (ω,} m) and t _{⊥2 (ω,} m) are input to the vector orthogonality calculating unit 36 calculates the orthogonality with respect to the theoretical value t _e of the arrival time difference vector _(theta) (Step S36). The theoretical value t _e (θ) of the arrival time difference vector is a value that can be calculated by Equation (6).

ここでｄは、三角形の頂点に配置されるマイクロホン１,２,３が成す三角形の一辺の長
さである（図１３参照）。ｃは音速である。このようにｔ_ｅ（θ）は、実測値とは無関係
に計算できる理論上の値である。この到達時間差ベクトルの理論値ｔ_ｅ（θ）は、図４に
示すように記録部３７に記録されているものを逐次読み出しても良いし、ベクトル直交度
算出部３６内に予め記録した値を用いるようにしても良い。 Here, d is the length of one side of the triangle formed by the microphones 1, 2, and 3 arranged at the vertices of the triangle (see FIG. 13). c is the speed of sound. In this way, t _e (θ) is a theoretical value that can be calculated regardless of the actual measurement value. The theoretical value t _e (θ) of the arrival time difference vector may be sequentially read out as recorded in the recording unit 37 as shown in FIG. 4, or a value recorded in advance in the vector orthogonality calculating unit 36 may be used. It may be used.

ここで到達時間差ベクトルｔ（ω,ｍ）に直交する２つの到達時間差正規直交ベクトルｔ_⊥１（ω,ｍ）,ｔ_⊥２（ω,ｍ）を求める意味を説明する。図６に、任意の到達時間差ベクトルｔ（ω,ｍ）に対する到達時間差正規直交ベクトルｔ_⊥１（ω,ｍ）とｔ_⊥２（ω,ｍ）を示す。この到達時間差ベクトルｔ（ω,ｍ）の方向を知るためには、方向が既知のベクトルと、その到達時間差正規直交ベクトルｔ_⊥１（ω,ｍ）,ｔ_⊥２（ω,ｍ）とが直交するか否かを見れば良い。直交すれば到達時間差ベクトルｔ（ω,ｍ）の方向は、既知のベクトルの方向と同じである。 Here, the meaning of _obtaining two arrival time difference normal orthogonal vectors t _⊥1 (ω, m) and t _⊥2 (ω, m) orthogonal to the arrival time difference vector t (ω, m) will be described. FIG. 6 shows arrival time difference normal orthogonal vectors t _⊥1 (ω, m) and t _⊥2 (ω, m) with respect to an arbitrary arrival time difference vector t (ω, m). In order to know the direction of this arrival time difference vector t (ω, m), a vector whose direction is known and its arrival time difference orthonormal vector t _⊥1 (ω, m), t _⊥2 (ω, m) What is necessary is just to see whether it is orthogonal. If they are orthogonal, the direction of the arrival time difference vector t (ω, m) is the same as the direction of the known vector.

ベクトル直交度算出部３６は、それらの到達時間差正規直交ベクトルｔ_⊥１（ω,ｍ）,ｔ_⊥２（ω,ｍ）と、理論値の到達時間差ベクトルｔ_ｅ（θ）との直交度Ｐ（θ）を式（７）で算出する（ステップＳ３６）。 The vector orthogonality calculation unit 36 calculates the orthogonality _P between the arrival time difference normal orthogonal vectors t _⊥1 (ω, m) and t _⊥2 (ω, m) and the theoretical arrival time difference vector t _e (θ). (Θ) is calculated by equation (7) (step S36).

式（７）は、個々の到達時間差ベクトルｔ（ω,ｍ）に対応する到達時間差正規直交ベク
トルｔ_⊥１（ω,ｍ）,ｔ_⊥２（ω,ｍ）に対して、全ての方向０〜３５９度の理論値の到達時間差ベクトルｔ_ｅ（θ）について計算される。式（７）で計算する理論値の到達時間差ベクトルｔ_ｅ（θ）の方向は既知であるので、その理論値と到達時間差正規直交ベクトルｔ_⊥１（ω,ｍ）,ｔ_⊥２（ω,ｍ）とが直交すると、式（７）の分母の第１項と第２項は、それぞれ０となる。よって直交度Ｐ（θ）が大きな値を持つ。逆に、理論値と異なる角度の場合は、式（７）の分母第１項と第２項がある大きさの値を持つので直交度Ｐ（θ）の値は小さな値となる。 Expression (7) is obtained by _calculating all directions 0 with respect to arrival time difference orthonormal vectors t _⊥1 (ω, m) and t _⊥2 (ω, m) corresponding to individual arrival time difference vectors t (ω, m). It is calculated for the arrival time difference vector t _e (θ) with a theoretical value of ˜359 degrees. Since the direction of the arrival time difference vector t _e (θ) of the theoretical value calculated by the equation (7) is known, the theoretical value and the arrival time difference normal orthogonal vector t _⊥1 (ω, m), t _⊥2 (ω, When m) is orthogonal to each other, the first and second terms of the denominator of Equation (7) are each 0. Therefore, the orthogonality P (θ) has a large value. On the other hand, when the angle is different from the theoretical value, since the denominator first term and second term of Equation (7) have values of a certain magnitude, the value of the orthogonality P (θ) becomes a small value.

このように到達時間差ベクトルｔ（ω,ｍ）に直交する到達時間差正規直交ベクトルｔ_⊥１（ω,ｍ）,ｔ_⊥２（ω,ｍ）を求め、それらと、理論値の到達時間差ベクトルｔ_ｅ（θ）とが直交するか否かを評価することで、到達時間差ベクトルｔ（ω,ｍ）が、１個の音源によりできたベクトルなのか、他の音源の信号が混ざって出来たベクトルなのかを判別することができる。 In this way, the arrival time difference normal orthogonal vectors t _⊥1 (ω, m), t _⊥2 (ω, m) orthogonal to the arrival time difference vector t (ω, m) are obtained, and these and the theoretical arrival time difference vector t _{e (theta)} and that evaluates whether orthogonal, arrival time difference vector t (ω, m) is, whether a vector made by a single sound source, could mixed signals of other excitation vector Can be determined.

式（７）で計算された直交度Ｐ（θ）の具体例を図７と図８に示す。横軸は信号の到来
方向を[度]、縦軸は最大ベクトル直交度maxＰ（θ）である。ここで０度方向は、マイクロホンアレーを机上に置いた時に、マイクロホンアレーの中心から見たマイクロホン１の方向である（図１３）。図７は、１０度の角度に位置する音源１と、別の音源２の角度を０度から３６０度まで変えた時のそれぞれの角度における最大ベクトル直交度maxＰ（θ）を求めたものである。音源１と音源２の角度が一致したときだけ最大ベクトル直交度maxＰ（θ）が約３２と大きな値を示し、それ以外の方向では約１２以下の小さな値を示している。 Specific examples of the orthogonality P (θ) calculated by the equation (7) are shown in FIGS. The horizontal axis represents the arrival direction of the signal [degree], and the vertical axis represents the maximum vector orthogonality maxP (θ). Here, the 0 degree direction is the direction of the microphone 1 viewed from the center of the microphone array when the microphone array is placed on the desk (FIG. 13). FIG. 7 shows the maximum vector orthogonality maxP (θ) at each angle when the angle of the sound source 1 located at an angle of 10 degrees and another sound source 2 is changed from 0 degrees to 360 degrees. . The maximum vector orthogonality maxP (θ) shows a large value of about 32 only when the angles of the sound source 1 and the sound source 2 coincide with each other, and shows a small value of about 12 or less in the other directions.

図８は、音源が一つしかないときに、その音源の角度を０度から３６０度まで変えたときの最大ベクトル直交度Ｐ（θ）を示す。信号到来方向の全方向の最大ベクトル直交度maxＰ（θ）が図７の角度１０度と同じ（約３２）大きさを示している。 FIG. 8 shows the maximum vector orthogonality P (θ) when the angle of the sound source is changed from 0 degrees to 360 degrees when there is only one sound source. The maximum vector orthogonality maxP (θ) in all directions of the signal arrival direction shows the same magnitude (about 32) as the angle of 10 degrees in FIG.

直交性判定部３８は、その直交度Ｐ（θ）と、閾値Ｔｈとを比較して到達時間差ベクトルｔ（ω,ｍ）の直交性を判定する（ステップＳ３８）。直交性が高い到達時間差ベクトルｔ（ω,ｍ）は、１個の固定された位置の音源からのベクトル、つまりスパース性が仮定できる到達時間差ベクトルｔ（ω,ｍ）である。逆に直交度Ｐ（θ）の小さな到達時間差ベクトルｔ（ω,ｍ）は、スパース性が仮定できない。 The orthogonality determination unit 38 determines the orthogonality of the arrival time difference vector t (ω, m) by comparing the orthogonality P (θ) with the threshold Th (step S38). The arrival time difference vector t (ω, m) having high orthogonality is a vector from a sound source at one fixed position, that is, an arrival time difference vector t (ω, m) that can assume sparsity. Conversely, the arrival time difference vector t (ω, m) having a small orthogonality P (θ) cannot assume sparsity.

このスパース性が仮定できるか否かを式（８）に示すように、閾値Ｔｈを例えば１５として判定する（ステップＳ３８０）。 Whether or not this sparsity can be assumed is determined by setting the threshold Th to 15, for example, as shown in Expression (8) (step S380).

直交度Ｐ（θ）がＴｈ＝１５よりも大きければ、スパース性判定結果Ｎ_Ｊ（ω,ｍ）を１（ステップＳ３８２）、小さければＮ_Ｊ（ω,ｍ）を０（ステップＳ３８１）として全ての到達時間差ベクトルｔ（ω,ｍ）についての判定が終了（ステップＳ３８３のＹ）するまで、到達時間差ベクトルｔ（ω,ｍ）が更新される（ステップＳ３８４）。したがって、全てのフレームｍ、周波数ビンωの到達時間差ベクトルｔ（ω,ｍ）についてのスパース性が判定される。 If the orthogonality P (θ) is larger than Th = 15, the sparsity determination result N _J (ω, m) is 1 (step S382), and if it is smaller, N _J (ω, m) is 0 (step S381). The arrival time difference vector t (ω, m) is updated until the determination on the arrival time difference vector t (ω, m) is completed (Y in step S383) (step S384). Therefore, the sparsity of the arrival time difference vectors t (ω, m) of all frames m and frequency bins ω is determined.

音源方向算出部１５２′は、スパース性判定結果Ｎ_Ｊ（ω,ｍ）を参照し、Ｎ_Ｊ（ω,ｍ）＝１の到達時間差ベクトルｔ（ω,ｍ）についてのみ式（５）に示した音源方向候補θ′（ω，ｍ）を計算してヒストグラム演算部１５３に出力する。この音源方向候補θ′（ω，ｍ）の計算と、ヒストグラム演算部１５３でヒストグラムを求め、そのピーク値を与える角度を音源方向とする動作は、従来技術と同じである。 The sound source direction calculation unit 152 ′ refers to the sparsity determination result N _J (ω, m), and shows only the arrival time difference vector t (ω, m) of N _J (ω, m) = 1 in the equation (5). The sound source direction candidate θ ′ (ω, m) is calculated and output to the histogram calculation unit 153. The calculation of the sound source direction candidate θ ′ (ω, m) and the operation of obtaining the histogram by the histogram calculation unit 153 and setting the angle giving the peak value as the sound source direction are the same as in the prior art.

以上述べたように音源方向推定装置１００は、スパース性が仮定できる周波数ビンの到達時間差ベクトルｔ（ω,ｍ）を用いて音源方向を推定するので、異なる位置の発話者が同時に発言するような場合があっても、それぞれの音源方向を正確に推定することができる。なお、スパース性の判定を、到達時間差ベクトルに対する正規化直交ベクトルを求める方法で説明したが、この発明はこの方法に限定されない。スパース性の判定方法の他の実施例を次に説明する。 As described above, the sound source direction estimating apparatus 100 estimates the sound source direction using the arrival time difference vector t (ω, m) of frequency bins that can be assumed to be sparsity, so that speakers at different positions speak at the same time. Even in some cases, the direction of each sound source can be accurately estimated. Note that the sparsity has been determined by a method for obtaining a normalized orthogonal vector with respect to the arrival time difference vector, but the present invention is not limited to this method. Another embodiment of the sparsity determination method will be described next.

実施例２のスパース性の判定方法は、到達時間差ベクトルｔ（ω,ｍ）と理論値の到達時間差ベクトルｔ_ｅ（θ）の向きの違いを評価してスパース性を判定する方法である。図９に実施例２のスパース性判定部３４′の機能構成例を示す。スパース性判定部３４′は、ベクトル間距離算出部９０、ベクトル一致性判定部９１、を備える。 The sparsity determination method according to the second embodiment is a method for determining sparsity by evaluating the difference in direction between the arrival time difference vector t (ω, m) and the theoretical arrival time difference vector t _e (θ). FIG. 9 shows a functional configuration example of the sparsity determination unit 34 ′ of the second embodiment. The sparsity determination unit 34 ′ includes an inter-vector distance calculation unit 90 and a vector matching determination unit 91.

ベクトル間距離算出部９０は、到達時間差ベクトルｔ（ω,ｍ）を入力としてその到達時間差ベクトル自身の大きさで正規化した正規化実測値から、到達時間差ベクトルの理論値ｔ_ｅ（θ）自身の大きさで正規化した正規化理論値を減算した値の絶対値である距離Ｐ′（θ）を、式（９）で算出する。 The inter-vector distance calculation unit 90 receives the arrival time difference vector t (ω, m) as an input, and obtains the theoretical value t _e (θ) itself of the arrival time difference vector from the normalized actual measurement value normalized by the magnitude of the arrival time difference vector itself. A distance P ′ (θ) that is an absolute value of a value obtained by subtracting a normalized theoretical value normalized by the magnitude of is calculated by Expression (9).

ここでｔ_ｅ（θ）は、式（６）で計算される到達時間差ベクトルの理論値の大きさである。この到達時間差ベクトルの理論値ｔ_ｅ（θ）は、図９に示すように記録部３７′に記録されているものを逐次読み出しても良いし、ベクトル間距離算出部９０内に記録した値を用いるようにしても良い。 Here, t _e (θ) is the magnitude of the theoretical value of the arrival time difference vector calculated by Expression (6). The theoretical value t _e (θ) of the arrival time difference vector may be sequentially read from the recording unit 37 ′ as shown in FIG. 9, or the value recorded in the inter-vector distance calculation unit 90 may be used. It may be used.

距離Ｐ′（θ）は、到達時間差ベクトルｔ（ω,ｍ）の方向と、到達時間差ベクトルの理論値ｔ_ｅ（θ）の方向が一致すると０になる値である。よって、その値の大きさによって到達時間差ベクトルｔ（ω,ｍ）が、１個の音源からのベクトルなのか、他の音源の影響を受けたベクトルなのかを判定することができる。つまりスパース性が仮定できる到達時間差ベクトルｔ（ω,ｍ）であるのか否かを、距離Ｐ′（θ）の大きさで判定することができる。 The distance P ′ (θ) is a value that becomes 0 when the direction of the arrival time difference vector t (ω, m) coincides with the direction of the theoretical value t _e (θ) of the arrival time difference vector. Therefore, it is possible to determine whether the arrival time difference vector t (ω, m) is a vector from one sound source or a vector influenced by another sound source according to the magnitude of the value. That is, whether or not the arrival time difference vector t (ω, m) can be assumed as sparsity can be determined based on the magnitude of the distance P ′ (θ).

実施例２の場合は、距離Ｐ′（θ）の大きさをベクトル一致性判定部９１で判定する（ステップＳ９１）。実施例１とは逆に、距離Ｐ′（θ）の値が小さい方がスパース性を仮定できる到達時間差ベクトルｔ（ω,ｍ）である。他の処理は実施例１と同じである。このようにして到達時間差ベクトルｔ（ω,ｍ）のスパース性の有無を判定することも可能である。 In the case of the second embodiment, the magnitude of the distance P ′ (θ) is determined by the vector matching determination unit 91 (step S91). Contrary to the first embodiment, the smaller the value of the distance P ′ (θ) is the arrival time difference vector t (ω, m) that can assume the sparsity. Other processes are the same as those in the first embodiment. In this way, it is possible to determine whether or not the arrival time difference vector t (ω, m) has sparsity.

〔シミュレーション結果〕
この発明の効果を確認する目的で、従来の音源方向推定装置３００と、この発明の音源方向推定装置１００の音源方向推定性能の比較を行った。シミュレーションは、音源を角度１０度の方向に位置する男性、角度２０度の方向に位置する女性とし、その両者が同時に発話する声に、スパース性の無い白色雑音が１０ｄＢのＳＮ比で重畳される条件で行った。〔simulation result〕
In order to confirm the effect of the present invention, the sound source direction estimation performance of the conventional sound source direction estimation apparatus 300 and the sound source direction estimation apparatus 100 of the present invention were compared. In the simulation, the sound source is a male positioned in the direction of 10 degrees and a female positioned in the direction of 20 degrees, and white noise without sparseness is superimposed on the voice uttered by both of them at a signal-to-noise ratio of 10 dB. Performed under conditions.

その結果、得られたヒストグラムを図１１と図１２に示す。横軸は信号の到来方向を[度]で、縦軸は[度数]である。図１１が従来の音源方向推定装置３００で得られたヒストグラムである。ヒストグラムの頂点は、５度と１５度の方向にずれている。図１２がこの発明の音源方向推定装置１００で得られたヒストグラムである。２つの異なるピークが１０度と２０度の方向に正しく生じており、図１１と比較するとピークが際立って現れている。このように、この発明の音源方向推定装置１００の音源方向推定精度が高いことが確認できた。 As a result, the obtained histograms are shown in FIGS. The horizontal axis represents the arrival direction of the signal in [degrees], and the vertical axis represents [frequency]. FIG. 11 is a histogram obtained by the conventional sound source direction estimating apparatus 300. The vertices of the histogram are shifted in the directions of 5 degrees and 15 degrees. FIG. 12 is a histogram obtained by the sound source direction estimating apparatus 100 of the present invention. Two different peaks are correctly generated in the directions of 10 degrees and 20 degrees, and the peaks are conspicuous when compared with FIG. Thus, it was confirmed that the sound source direction estimation accuracy of the sound source direction estimation device 100 of the present invention was high.

以上説明したこの発明の音源方向推定装置とその方法は、上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能である。例えば、上記した装置及び方法において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The sound source direction estimation apparatus and method of the present invention described above are not limited to the above-described embodiment, and can be appropriately changed without departing from the spirit of the present invention. For example, the processes described in the above-described apparatus and method are not only executed in time series in the order described, but are also executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. Also good.

また、上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 Further, when the processing means in the above apparatus is realized by a computer, the processing contents of functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ-ＲＡＭ
（Random Access Memory）、ＣＤ-ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ-Ｒ
（Recordable）/ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto Optical disc）等を、半導体メモリとしてフラッシュメモリー等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape, or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM
(Random Access Memory), CD-ROM (Compact Disc Read Only Memory), CD-R
(Recordable) / RW (ReWritable) or the like can be used as a magneto-optical recording medium, MO (Magneto Optical disc) or the like as a semiconductor memory, and flash memory or the like as a semiconductor memory.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims

A microphone array consisting of three microphones arranged at the apex of an equilateral triangle;
A frequency converter that converts signals received by the microphones of the microphone array into signals in the frequency domain;
An arrival time difference calculation unit for calculating an arrival time difference for each combination of microphone pairs of the different microphones;
In a sound source direction estimating device comprising a sound source direction estimating unit that obtains sound source candidates from the arrival time difference and classifies the sound source direction candidates,
The sound source direction estimation unit includes a sparsity determination unit that determines whether sparsity can be assumed for each frequency bin of the arrival time difference, and obtains a sound source candidate from the arrival time difference of the frequency bins that can assume the sparsity, A sound source direction estimating apparatus for classifying the sound source direction candidates.

In the sound source direction estimating apparatus according to claim 1,
The sparsity determination unit
An orthogonal matrix calculation unit for calculating two arrival time difference normal orthogonal vectors orthogonal to each frequency bin of the arrival time difference vector by using the arrival time difference vector;
A vector orthogonality calculating unit that calculates the orthogonality of the two arrival time difference normal orthogonal vectors with respect to a theoretical value of the arrival time difference vector, using the two arrival time difference normal orthogonal vectors as inputs;
A sound source direction estimation apparatus comprising: an orthogonality determination unit that compares the orthogonality with a threshold value and determines the sparsity of the arrival time difference vector.

In the sound source direction estimating apparatus according to claim 1,
The sparsity determination unit
The absolute value of the value obtained by subtracting the normalized theoretical value normalized by the magnitude of the theoretical value of the arrival time difference vector from the normalized measured value normalized by the magnitude of the arrival time difference vector itself. A distance calculation unit between vectors for calculating a distance as a value for each frequency bin;
A sound source direction estimation apparatus comprising: a vector matching determination unit that determines the sparsity of the arrival time difference vector by comparing the distance with a threshold value.

A frequency conversion process in which the frequency conversion unit converts a signal received by each microphone of a microphone array including three microphones into a signal in the frequency domain;
An arrival time difference calculating unit for calculating an arrival time difference for each combination of microphone pairs of the different microphones; and
In a sound source direction estimation method, including a sound source direction estimation unit, wherein a sound source direction estimation unit obtains sound source candidates from the arrival time difference and classifies the sound source direction candidates.
The sound source direction estimation process includes a sparsity determination process in which the sparsity determination unit determines whether or not the sparsity can be assumed for each frequency bin of the arrival time difference. From the arrival time difference of the frequency bins where the sparsity can be assumed. A method of estimating a sound source direction, which is a process of obtaining sound source candidates and classifying the sound source direction candidates.

In the sound source direction estimation method according to claim 4,
The sparseness determination process is as follows:
An orthogonal matrix calculation process in which an orthogonal matrix calculation unit calculates two arrival time difference normal orthogonal vectors orthogonal to each frequency bin of the arrival time difference vector with the arrival time difference vector as an input;
A vector orthogonality calculating unit that receives the two arrival time difference normal orthogonal vectors as input and calculates the orthogonality of the two arrival time difference normal orthogonal vectors with respect to the theoretical arrival time difference vector;
A sound source direction estimation method, wherein the orthogonality determination unit includes an orthogonality determination step of determining the sparsity of the arrival time difference vector by comparing the orthogonality with a threshold value.

In the sound source direction estimation method according to claim 4,
The sparseness determination process is as follows:
Normalized theoretical value normalized by the magnitude of the theoretical value of the arrival time difference vector from the actual measurement value normalized by the magnitude of the arrival time difference vector itself, with the inter-vector distance calculation unit receiving the arrival time difference vector as an input The inter-vector distance calculation process for calculating the distance that can be expressed by the absolute value of the value obtained by subtracting for each frequency bin,
A sound source direction estimation method, comprising: a vector matching determination step in which the vector matching determination unit determines the sparsity of the arrival time difference vector by comparing the distance with a threshold.

An apparatus program for causing a computer to function as the sound source direction estimating apparatus according to any one of claims 1 to 3.