JP2010206449A

JP2010206449A - Speech direction estimation device and method, and program

Info

Publication number: JP2010206449A
Application number: JP2009048971A
Authority: JP
Inventors: Kenta Niwa; 健太丹羽; Sumitaka Sakauchi; 澄宇阪内; Kenichi Furuya; 賢一古家; Yoichi Haneda; 陽一羽田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-03-03
Filing date: 2009-03-03
Publication date: 2010-09-16
Anticipated expiration: 2029-03-03
Also published as: JP5235725B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech direction estimation device which does not request to arrange many microphones so as to enclose a speaker, and can appropriately estimate the speech direction even under environment in which a reverberation time is long. <P>SOLUTION: A correlation matrix calculation unit sequentially generates and outputs an M×M correlation matrix representing correlation between digital sound signals as expected values by frames, and a base matrix generation unit inputs the digital sound signals and information on a speaker position, and generates a base matrix having a base number D. A correlation matrix decomposition unit inputs correlation matrixes and base matrixes by the frames and decomposes the correlation matrixes into component columns projected on the base matrixes, and a frequency averaging processing unit inputs the component columns and outputs a frequency averaged component column averaged with frequencies by the frames. A front-lateral direction decision unit inputs the frequency averaged component column, and estimates whether the speaker has spoken in the front direction by the frames. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、マイクロホンに入力された音声信号から発話者の発話向きを推定する技術に関する。 The present invention relates to a technique for estimating the utterance direction of a speaker from an audio signal input to a microphone.

電話や音声会議端末等の音声情報をやりとりするシステムを一般に音声通信システムと呼ぶ。ＴＶ会議システムでは音声情報に映像を付加して提示するため場の状況が伝わりやすいが、音声通信システムでは相手側の状況を把握するのは難しい。相手側の状況に関する情報のひとつに発話向き情報があり、相手側からこの情報を受け取ることで発話者がどの方向に向かって発話しているかを把握でき、コミュニケーションの円滑化を図ることができる。 A system for exchanging voice information such as a telephone or a voice conference terminal is generally called a voice communication system. In the video conference system, the video is added to the audio information and presented, so that the situation of the place is easily transmitted, but in the audio communication system, it is difficult to grasp the situation of the other party. One of the information on the other party's situation is utterance direction information. By receiving this information from the other party, it is possible to grasp the direction in which the speaker is speaking and to facilitate communication.

このような発話向き情報を推定する従来技術が非特許文献１、２等で開示されており、構成例を図１５に示す。この構成例における発話向き推定装置１０は、以下のように発話向き情報を推定する。
(i) 発話者１からの発話音声をＭ本（Ｍは２以上の整数）のマイクロホン１１−１、・・・、１１−Ｍを用いて収音する。収音されたアナログ信号をＡＤ変換部１２にて、ディジタル信号vx(t)＝[ｘ₁(t),・・・, ｘ_M(t)]^Tへ変換する。ここで、ｔは離散時間のインデックスを表す。 Conventional techniques for estimating such speech direction information are disclosed in Non-Patent Documents 1 and 2 and the like, and a configuration example is shown in FIG. The speech direction estimation apparatus 10 in this configuration example estimates speech direction information as follows.
(i) The voice from the speaker 1 is picked up using M (M is an integer of 2 or more) microphones 11-1,. The collected analog signal is converted into a digital signal vx (t) = [x ₁ (t),..., X _M (t)] ^T by the AD converter 12. Here, t represents an index of discrete time.

(ii) 周波数領域変換部１３では、複数サンプルからなる上記ディジタル信号の組（フレーム）を入力とし、高速フーリエ変換等により周波数領域の信号VX(ω,ｎ）＝[Ｘ₁(ω,ｎ),・・・, Ｘ_M(ω,ｎ)]^Tへ変換する。ここで、ωは周波数のインデックスを表し、周波数のインデックスの総数をΩとする。また、ｎはフレームのインデックスを表す。
(iii) 固定ビームフォーマ設計部１４では、各発話者位置・発話向き毎に固定ビームフォーマVG(ω,ｒ,θ)＝[Ｇ₁(ω,ｒ,θ),・・・,Ｇ_M(ω,ｒ,θ)]^Tを設計する。Ｇ_ｉ(ω,ｒ,θ)は発話者位置ｒ、発話向きθの音源を強調・抑制するためにｉ番目のマイクロホンの周波数成分ｘ_i(ω,ｎ)に掛ける係数である。 (ii) The frequency domain transform unit 13 receives a set (frame) of the digital signals composed of a plurality of samples as an input, and performs frequency domain signal VX (ω, n) = [X ₁ (ω, n) by fast Fourier transform or the like. _{, ···, X M (ω,} n)] to convert to ^T. Here, ω represents a frequency index, and the total number of frequency indexes is Ω. N represents the index of the frame.
(iii) The fixed beamformer design unit 14 sets the fixed beamformer VG (ω, r, θ) = [G ₁ (ω, r, θ),..., G _M (for each speaker position and direction. ω, r, θ)] ^T is designed. G _i (ω, r, θ) is a coefficient to be multiplied by the frequency component x _i (ω, n) of the i-th microphone in order to emphasize / suppress the sound source at the speaker position r and the speech direction θ.

設計に際しては、あらかじめ設定された発話者位置・発話向き毎に音源とマイクロホン間の音響伝搬特性VH(ω,ｒ,θ)＝[Ｈ₁(ω,ｒ,θ),・・・,Ｈ_M(ω,ｒ,θ)]^Tをシミュレー
ション値や実測値を用いて求めておく。ここでＨ_i(ω,ｒ,θ)は発話者位置ｒ、発話向き
θの音源と、ｉ番目のマイクロホンとの間の音響伝搬特性を表す。 In designing, acoustic propagation characteristics between a sound source and a microphone VH (ω, r, θ) = [H ₁ (ω, r, θ),..., H _{M for} each predetermined speaker position and direction. (ω, r, θ)] ^T is obtained using simulation values or actual measurement values. Here, H _i (ω, r, θ) represents an acoustic propagation characteristic between the sound source at the speaker position r and the speech direction θ and the i-th microphone.

固定ビームフォーマVG(ω,ｒ,θ)は、音響伝搬特性との関係を表す式（１）、（２）を満たす値として設計される。
VH(ω,ｒ_T,θ_T)^H・VG(ω,ｒ_T,θ_T)＝１（１）
VH(ω,ｒ_U,θ_U)^H・VG(ω,ｒ_T,θ_T)＝０（２）
式(1)、(2)は、発話者位置ｒ_T、発話向きθ_Tの出力パワーを強調し、それ以外の発話者位置ｒ_U、発話向きθ_Uの出力パワーを抑えるように固定ビームフォーマVG(ω,ｒ,θ)を設計することを示している。 The fixed beamformer VG (ω, r, θ) is designed as a value that satisfies the expressions (1) and (2) representing the relationship with the acoustic propagation characteristics.
VH (ω, r _T , θ _T ) ^H · VG (ω, r _T , θ _T ) = 1 (1)
_{VH (ω, r U, θ} U) H · VG (ω, r T, θ T) = 0 (2)
Expressions (1) and (2) emphasize the output power of the speaker position r _T and the speech direction θ _T and suppress the output power of the other speaker positions r _U and the speech direction θ _U. It shows that VG (ω, r, θ) is designed.

(iv) 積和計算部１５では、周波数領域の信号VX(ω,ｎ）＝[Ｘ₁(ω,ｎ),・・・, Ｘ_M(ω,ｎ)]^Tと固定ビームフォーマVG(ω,ｒ,θ)＝[Ｇ₁(ω,ｒ,θ),・・・,Ｇ_M(ω,ｒ,θ)]^Tを入力とし、各周波数ω、発話者位置ｒ、発話向きθ毎に各マイクロホンに対応する周波数成分Ｘ_i(ω,ｎ)と固定ビームフォーマの係数Ｇ_i(ω,ｒ,θ)とを掛け、得られたＭ個の成分を足し合わせることで出力Ｙ(ω,ｎ,ｒ,θ)を計算する。この計算は、Ｙ(ω,ｎ,ｒ,θ)＝VG(ω,ｒ,θ)^H・VX(ω,ｎ）を計算することと同義である。 (iv) In the product-sum calculation unit 15, the frequency domain signal VX (ω, n) = [X ₁ (ω, n),..., X _M (ω, n)] ^T and the fixed beamformer VG (ω , r, θ) = [G ₁ (ω, r, θ),..., G _M (ω, r, θ)] ^T as an input, for each frequency ω, speaker position r, and speech direction θ. Multiplying the frequency component X _i (ω, n) corresponding to each microphone by the coefficient G _i (ω, r, θ) of the fixed beamformer, and adding the obtained M components, the output Y (ω, n, r, θ) is calculated. This calculation is synonymous with calculating Y (ω, n, r, θ) = VG (ω, r, θ) ^H · VX (ω, n).

(v) パワー計算部１６では、積和計算部１５からの出力Ｙ(ω,ｎ,ｒ,θ)からパワー|Ｙ(ω,ｎ,ｒ,θ)|²を計算して出力する。
(vi) 周波数平均化処理部１７では、パワー計算部１６から出力されたパワー|Ｙ(ω,ｎ,ｒ,θ)|²を周波数で平均化処理し、AY(ｎ,ｒ,θ)を得る。この計算は、Ｆ_０を平均化処理で用いる周波数のインデックス、|Ｆ_０|を周波数のインデックスの総数と定義すると、 (v) The power calculator 16 calculates and outputs power | Y (ω, n, r, θ) | ² from the output Y (ω, n, r, θ) from the product-sum calculator 15.
(vi) The frequency averaging processing unit 17 averages the power | Y (ω, n, r, θ) | ² output from the power calculation unit 16 by frequency, and calculates AY (n, r, θ). obtain. In this calculation, if F ₀ is defined as the frequency index used in the averaging process, and | F ₀ | is defined as the total number of frequency indexes,

を計算することと同義である。なお、Ｆ_０はΩ≧|Ｆ_０|を満たす。
(vii) 音源向き選択部１８では、各フレーム毎に周波数で平均化処理されたパワーAY(ｎ,ｒ,θ)が最大となる発話者位置ｒ、発話向きθを探査し、パワーAY(ｎ,ｒ,θ)が最大となる発話向きθを、推定された発話向きθ_out(n)として求める。 Is equivalent to calculating Note that F ₀ satisfies Ω ≧ | F ₀ |.
(vii) The sound source direction selection unit 18 searches for the speaker position r and the speech direction θ where the power AY (n, r, θ) averaged by frequency for each frame is maximum, and the power AY (n , r, θ) is determined as the estimated speech direction θ _out (n).

中島弘史、「音源の方向を推定可能な拡張ビームフォーミング」、日本音響学会講演論文集、2005年9月、p.619-620Hiroshi Nakajima, “Expanded Beamforming for Estimating Sound Source Direction”, Proceedings of the Acoustical Society of Japan, September 2005, p.619-620 中島弘史、外８名、「拡張ビームフォーミングを用いた音源指向特性推定」、日本音響学会講演論文集、2005年9月、p.621-622Hiroshi Nakajima, 8 others, "Sound source directivity estimation using extended beamforming", Proceedings of the Acoustical Society of Japan, September 2005, p.621-622

従来技術の課題として次の２点が挙げられる。
(i) 任意の位置での発話に対応し、高精度な発話向きの推定を行うには、多数のマイクロホンを必要とし、かつマイクロホンの設置位置にも工夫が必要
従来技術においては、各発話者位置・発話向き毎に設計された固定ビームフォーマの出力のパワー|Ｙ(ω,ｎ,ｒ,θ)|²に差があるほど、高精度に発話向きを推定することができる。しかし、発話者の口から放射される音波のように口の前方に強い指向性を持つ音源を想定すると、図１６に示すように多数のマイクロホンで発話者を囲い込むように収音しないと、発話者位置・発話向きによっては固定ビームフォーマの出力のパワーに差が出ず、発話向きの推定誤差が増大する（例えば、非特許文献２の実験ではマイクロホンを６４本使用）。そのため、誤差を小さくするには多数のマイクロホンが必要となり装置が大型化し、電話や音声会議端末のような可搬性がある装置に取り付けて利用することが難しい。 The following two points can be cited as problems of the prior art.
(i) To deal with utterances at any position and to estimate the direction of utterance with high accuracy, a large number of microphones are required and the position of the microphones must be devised. As the output power | Y (ω, n, r, θ) | ² of the fixed beamformer designed for each position and speech direction is different, the speech direction can be estimated with higher accuracy. However, assuming a sound source with a strong directivity in front of the mouth, such as a sound wave radiated from the mouth of the speaker, as shown in FIG. 16, it is necessary to collect sound so as to surround the speaker with a large number of microphones. Depending on the speaker position / speech direction, there is no difference in the output power of the fixed beamformer, and the speech direction estimation error increases (for example, in the experiment of Non-Patent Document 2, 64 microphones are used). Therefore, in order to reduce the error, a large number of microphones are required, the apparatus becomes large, and it is difficult to use it by attaching it to a portable apparatus such as a telephone or an audio conference terminal.

(ii) 残響時間（直接波到来後、直接波の収音パワーから６０ｄＢ減衰するまでの時間）
が２５０ｍｓｅｃ以上の残響環境下では高い発話方向推定性能が得られない
残響時間が２５０ｍｓｅｃ以上の残響環境下においては、強い反射波が多く混合するため音響伝搬特性VH(ω,ｒ,θ)を精度よく設計することが難しい。そのため、固定ビームフォーマの出力に曖昧性が生じ、推定精度が劣化する。例えば、低残響加工されていない実環境の部屋においては、一般に残響時間が２５０〜５００ｍｓｅｃ程度となるため精度の良い推定が困難である。 (ii) Reverberation time (time from the arrival of the direct wave until 60 dB attenuation from the collected power of the direct wave)
High speech direction estimation performance cannot be obtained in a reverberation environment with a reverberation time of 250 msec or more. In a reverberation environment with a reverberation time of 250 msec or more, the sound propagation characteristics VH (ω, r, θ) are accurate because many strong reflected waves are mixed. It is difficult to design well. Therefore, ambiguity occurs in the output of the fixed beamformer, and the estimation accuracy deteriorates. For example, in an actual environment room that is not subjected to low reverberation processing, reverberation time is generally about 250 to 500 msec, so that accurate estimation is difficult.

本発明の目的は、多数のマイクロホンを発話者を囲い込むように配置する必要が無く、かつ残響時間が２５０ｍｓｅｃ以上の残響環境下においても適切に発話向きを推定することが可能な、発話向き推定装置、方法及びプログラムを提供することにある。 An object of the present invention is to estimate the speech direction, which does not require a large number of microphones to be placed so as to surround the speaker, and can appropriately estimate the speech direction even in a reverberant environment with a reverberation time of 250 msec or more. To provide an apparatus, a method, and a program.

本発明の発話向き推定装置は、ＡＤ変換部と、周波数領域変換部と、相関行列計算部と、基底行列生成部と、相関行列分解部と、周波数平均化処理部と、正面・横向き判定部とを具備する。ＡＤ変換部は、Ｍ本（Ｍは２以上の整数）のマイクロホンで収音したアナログ音声信号を、それぞれディジタル音声信号に変換する。周波数領域変換部は、ディジタル音声信号を、複数サンプルからなるフレーム単位で時間領域から周波数領域に変換する。相関行列計算部は、各周波数毎に、周波数領域に変換されたそれぞれのディジタル音声信号間の相関を表すＭ×Ｍの相関行列を複数フレーム毎の期待値として順次生成して出力する。基底行列生成部は、ディジタル音声信号と話者位置の情報を入力として、Ｄ個（Ｄは２以上の整数）の基底から成る基底行列を生成する。相関行列分解部は、フレーム毎に相関行列と基底行列を入力として、相関行列を基底行列に射影した成分列に分解する。周波数平均化処理部は、成分列を入力として、フレーム毎に周波数で平均化された周波数平均化成分列を出力する。正面・横向き判定部は、周波数平均化成分列を入力として、フレーム毎に話者の発話の向きが正面方向であるか否かを推定する。 The speech direction estimation apparatus of the present invention includes an AD conversion unit, a frequency domain conversion unit, a correlation matrix calculation unit, a base matrix generation unit, a correlation matrix decomposition unit, a frequency averaging processing unit, and a front / lateral direction determination unit. It comprises. The AD conversion unit converts analog audio signals collected by M microphones (M is an integer of 2 or more) into digital audio signals. The frequency domain transform unit transforms the digital audio signal from the time domain to the frequency domain in units of frames composed of a plurality of samples. The correlation matrix calculation unit sequentially generates and outputs an M × M correlation matrix representing the correlation between the respective digital audio signals converted into the frequency domain for each frequency as an expected value for each of a plurality of frames. The base matrix generation unit receives a digital speech signal and speaker position information as input, and generates a base matrix composed of D bases (D is an integer of 2 or more). The correlation matrix decomposition unit inputs a correlation matrix and a base matrix for each frame, and decomposes the correlation matrix into component columns projected onto the base matrix. The frequency averaging processing unit receives the component sequence and outputs a frequency averaged component sequence averaged by frequency for each frame. The front / side orientation determination unit estimates whether or not the direction of the speaker's utterance is the front direction for each frame, using the frequency averaged component sequence as an input.

本発明の発話向き推定装置によれば、多数のマイクロホンを発話者を囲い込むように配置する必要が無く、かつ残響時間が２５０ｍｓｅｃ以上の残響環境下においても適切に発話向きを推定することが可能となる。 According to the speech direction estimating apparatus of the present invention, it is not necessary to arrange a large number of microphones so as to surround a speaker, and it is possible to appropriately estimate the speech direction even in a reverberant environment where the reverberation time is 250 msec or more. It becomes.

第１実施形態の発話向き推定装置１００の機能構成例を示す図。The figure which shows the function structural example of the speech direction estimation apparatus 100 of 1st Embodiment. 第１実施形態の発話向き推定装置１００の処理フロー例を示す図。The figure which shows the example of a processing flow of the utterance direction estimation apparatus 100 of 1st Embodiment. マイクロホンと発話者・発話向きとの位置関係を示すイメージ図。The image figure which shows the positional relationship of a microphone, a speaker, and a speech direction. 左右向きを判定する方法についてのイメージ図。The image figure about the method of determining left-right orientation. 音声信号の伝搬特性を時間領域で示す図。The figure which shows the propagation characteristic of an audio | voice signal in a time domain. 発話向きと基底ベクトルと固有値・固有ベクトルとの相関関係を示すイメージ図。The image figure which shows the correlation with an utterance direction, a base vector, and an eigenvalue / eigenvector. 発話向きと固有値との関係を示すイメージ図。The image figure which shows the relationship between an utterance direction and an eigenvalue. 第２実施形態の発話向き推定装置２００の機能構成例を示す図。The figure which shows the function structural example of the utterance direction estimation apparatus 200 of 2nd Embodiment. 第２実施形態の発話向き推定装置２００の処理フロー例を示す図。The figure which shows the example of a processing flow of the speech direction estimation apparatus 200 of 2nd Embodiment. 正面向き・横向きを判定する方法についてのイメージ図。The image figure about the method of judging front direction and sideways. 第３実施形態の発話向き推定装置３００の機能構成例を示す図。The figure which shows the function structural example of the speech direction estimation apparatus 300 of 3rd Embodiment. 効果の検証環境及び条件を示す図。The figure which shows the verification environment and conditions of an effect. 発話向きと固有値との関係についての検証結果を示す図。The figure which shows the verification result about the relationship between an utterance direction and an eigenvalue. 左右向き推定の検証結果を示す図。The figure which shows the verification result of left-right direction estimation. 音声会議端末に本発明を組み込んだサービス構成例を示す図。The figure which shows the service structural example which incorporated this invention in the audio conference terminal. 従来技術による発話向き推定装置の機能構成例を示す図。The figure which shows the function structural example of the speech direction estimation apparatus by a prior art. 従来技術によるマイクロホンと発話者との関係を示すイメージ図。The image figure which shows the relationship between the microphone and speaker by a prior art.

〔第１実施形態〕
図１に本発明の発話向き推定装置１００の機能構成例を、図２にその処理フロー例を示す。発話向き推定装置１００は、発話向きがマイクロホンアレイに対し、正面向き、左向きであるか右向きであるかを推定するものである。 [First Embodiment]
FIG. 1 shows a functional configuration example of the speech direction estimating apparatus 100 of the present invention, and FIG. 2 shows a processing flow example thereof. The utterance direction estimation apparatus 100 estimates whether the utterance direction is the front direction, the left direction, or the right direction with respect to the microphone array.

発話向き推定装置１００は、Ｍ本（Ｍは２以上の整数）のマイクロホン１０１−１〜１０１−Ｍからなるマイクロホンアレイ１０１と、ＡＤ変換部１２と、周波数領域変換部１３と、相関行列計算部１０２と、固有値分解部２０１と、第１固有ベクトルパワー計算部１０４と、第１周波数平均化処理部１０５と、左右向き判定部１０６と、第２周波数平均化処理部２０２と、正面・横向き判定部２０３とを備える。 The utterance direction estimation apparatus 100 includes a microphone array 101 including M microphones (M is an integer of 2 or more) 101-1 to 101-M, an AD conversion unit 12, a frequency domain conversion unit 13, and a correlation matrix calculation unit. 102, an eigenvalue decomposition unit 201, a first eigenvector power calculation unit 104, a first frequency averaging processing unit 105, a left / right orientation determination unit 106, a second frequency averaging processing unit 202, and a front / side orientation determination unit 203.

従来技術においては、図１６に示すように発話者を囲い込むように多数のマイクロホンを配置する必要があったが、本発明においては、Ｍ本のマイクロホン１０１−１〜１０１−Ｍを可能な程度に密集して配置すればよい。このようなマイクロホンアレイ１０１を構成するマイクロホンの本数は多いことに越したことはないが、以下に説明する本発明の構成によれば２本以上あれば発話向きの推定が可能である。また、配置は平面的でも立体的でも構わない。このように少ない本数のマイクロホンを密集して配置する形態をとることで、電話や音声会議端末のような可搬性がある装置に取り付けて、その周囲の発話者による発話向きを推定することが可能となる。発話者はこのマイクロホンアレイ１０１の周囲で発話する。図３は７本のマイクロホンからなるマイクロホンアレイ１０１の周りで話している発話者を上から見たイメージを示したものであり、矢印方向が発話向きである。なお、図３(ａ)は発話者が各位置で正面向きに発話している様子を、図３(ｂ)は横向きに発話している様子を表している。 In the prior art, as shown in FIG. 16, it was necessary to arrange a large number of microphones so as to surround the speaker. However, in the present invention, M microphones 101-1 to 101-M can be provided. It is sufficient to arrange them closely. Although the number of microphones constituting such a microphone array 101 is not too large, according to the configuration of the present invention described below, it is possible to estimate the utterance direction with two or more microphones. The arrangement may be two-dimensional or three-dimensional. By adopting a configuration in which a small number of microphones are arranged densely in this way, it can be attached to a portable device such as a telephone or an audio conference terminal, and the direction of speech by surrounding speakers can be estimated It becomes. The speaker speaks around the microphone array 101. FIG. 3 shows an image of a speaker talking around a microphone array 101 composed of seven microphones as viewed from above, and the direction of the arrow is the direction of speech. FIG. 3A shows a state where the speaker speaks in front of each position, and FIG. 3B shows a state where the speaker speaks sideways.

ＡＤ変換部１２は、Ｍ本のマイクロホン１０１−１〜１０１−Ｍで収音した発話者１が発話したアナログ音声信号を、それぞれディジタル音声信号ｘ₁(t) 、・・・、ｘ_M(t)に
変換する（Ｓ１）。ここで、ｔは離散時間のインデックスを表す。 The AD conversion unit 12 converts the analog voice signals uttered by the speaker 1 collected by the M microphones 101-1 to 101 -M into digital voice signals x ₁ (t),..., X _M (t (S1). Here, t represents an index of discrete time.

周波数領域変換部１３は、複数の離散時間サンプルからなる上記ディジタル音声信号の組（フレーム）を入力とし、高速フーリエ変換等により周波数領域のディジタル音声信号Ｘ₁(ω,ｎ)、・・・、Ｘ_M(ω,ｎ)に変換して出力する（Ｓ２）。ここで、ｎはフレームのインデックスを表し、ωは周波数のインデックスを表す。なお、周波数のインデックスの総数はΩとする。 The frequency domain transform unit 13 receives as input a set (frame) of the above digital speech signals composed of a plurality of discrete time samples, and performs frequency domain digital speech signals X ₁ (ω, n),. It is converted into X _M (ω, n) and output (S2). Here, n represents a frame index, and ω represents a frequency index. The total number of frequency indexes is Ω.

相関行列計算部１０２は、周波数領域のディジタル音声信号Ｘ₁(ω,ｎ) 、・・・、Ｘ_M(ω,ｎ)を入力とし、各信号間の相関を表すＭ×Ｍの相関行列Ｒ(ω,ｋ)を、各周波数ω毎に式（３）により順次生成し出力する（Ｓ３）。
Ｒ(ω,ｋ)＝Ｅ[VX(ω,ｎ)・VX^H(ω,ｎ)] （３） The correlation matrix calculation unit 102 receives the digital audio signals X ₁ (ω, n),..., X _M (ω, n) in the frequency domain as inputs, and an M × M correlation matrix R representing the correlation between the signals. (ω, k) is sequentially generated and output for each frequency ω by equation (3) (S3).
R (ω, k) = E [VX (ω, n) · VX ^H (ω, n)] (3)

ここで、VX(ω,ｎ)＝[Ｘ₁(ω,ｎ)、・・・、Ｘ_M(ω,ｎ)]^Tなお、Ｈは共役転置を表し、Ｅは、VX(ω,ｎ)・VX(ω,ｎ)^Hを各フレームについて計算した上で、平均化処理等によりＬフレーム毎の期待値を演算をする演算子である。つまり、相関行列はＬフレームに１回の割合で順次出力され、ｋはこの相関行列の出力のインデックスを表す。また、ＬはＭ以上の整数とすることが望ましい。 Here, VX (ω, n) = [X ₁ (ω, n),..., X _M (ω, n)] ^T where H represents a conjugate transpose and E represents VX (ω, n) An operator that calculates an expected value for each L frame by averaging processing after calculating VX (ω, n) ^H for each frame. That is, the correlation matrix is sequentially output at a rate of once per L frame, and k represents an output index of the correlation matrix. L is preferably an integer greater than or equal to M.

固有値分解部２０１は、相関行列Ｒ(ω,ｋ)を入力とし、まず、式（４）を満たすようにＭ個の固有値λ₁(ω,ｋ)、・・・、λ_M(ω,ｋ)それぞれの二乗を対角要素とする対角行列である固有値行列Λ(ω,ｋ)と、Ｍ個の固有ベクトルvv₁(ω,ｋ)、・・・、vv_M(ω,ｋ)を要素とする固有ベクトル行列Ｖ(ω,ｋ)とに固有値分解法によって分解する。
Ｒ(ω,ｋ)＝Ｖ(ω,ｋ)・Λ(ω,ｋ)・Ｖ^H(ω,ｋ) （４）
ここで、Λ(ω,ｋ)＝diag[λ₁ ²(ω,ｋ)、・・・、λ_M ²(ω,ｋ)]
λ₁(ω,ｋ)≧λ₂(ω,ｋ)≧・・・≧λ_M(ω,ｋ)
Ｖ(ω,ｋ)＝[vv₁(ω,ｋ)、・・・、vv_M(ω,ｋ)]^T
vv_i(ω,ｋ)＝[v_i,1(ω,ｋ)、・・・、v_i,M(ω,ｋ)]
なお、diag[・]は[・]内の成分を対角行列の要素とする演算子である。
そして、最大の固有値である第１固有値λ₁(ω,ｋ)に対応する第１固有ベクトルvv₁(ω,
ｋ)を出力する（Ｓ４）。 The eigenvalue decomposition unit 201 receives the correlation matrix R (ω, k) as an input, and first, M eigenvalues λ ₁ (ω, k),..., Λ _M (ω, k) so as to satisfy Expression (4). ) Eigenvalue matrix Λ (ω, k), which is a diagonal matrix with each square as a diagonal element, and M eigenvectors vv ₁ (ω, k),..., Vv _M (ω, k) Is decomposed into eigenvector matrix V (ω, k) by the eigenvalue decomposition method.
R (ω, k) = V (ω, k) · Λ (ω, k) · V ^H (ω, k) (4)
Where Λ (ω, k) = diag [λ ₁ ² (ω, k),..., Λ _M ² (ω, k)]
λ ₁ (ω, k) ≧ λ ₂ (ω, k) ≧ ・・・ ≧ λ _M (ω, k)
V (ω, k) = [vv ₁ (ω, k),..., Vv _M (ω, k)] ^T
vv _i (ω, k) = [v _{i, 1} (ω, k),..., v _{i, M} (ω, k)]
Note that diag [•] is an operator having the components in [•] as elements of a diagonal matrix.
Then, the first eigenvector vv ₁ (ω, k) corresponding to the first eigenvalue λ ₁ (ω, k) which is the maximum eigenvalue.
k) is output (S4).

第１固有ベクトルパワー計算部１０４は、第１固有ベクトルvv₁(ω,ｋ)を入力とし、第１固有ベクトルvv₁(ω,ｋ)を構成するv_1,1(ω,ｋ)、・・・、v_1,M(ω,ｋ)のＭ個の要素について、それぞれ式（５）によりパワーを計算してＭ個のパワー要素pv_1,1(ω,ｋ)、・・・
、pv_1,M(ω,ｋ)を出力する（Ｓ５）。
pv_1,i(ω,ｋ)＝|v_1,i(ω,ｋ)| （５）
第１周波数平均化処理部１０５は、各周波数ω毎に生成されたＭ個のパワー要素pv_1,1(ω,ｋ)、・・・、pv_1,M(ω,ｋ)について、それぞれ式（６）により平均値を計算してＭ個の平均化パワー要素apv_1,1(ｋ)、・・・、apv_1,M(ｋ)を出力する（Ｓ６）。 The first eigenvector power calculation unit 104 receives the first eigenvector vv ₁ (ω, k) as an input, and v _1,1 (ω, k),... Constituting the first eigenvector vv ₁ (ω, k). For M elements of v _{1, M} (ω, k), the power is calculated by the equation (5), respectively, and M power elements pv _1,1 (ω, k),...
, Pv _{1, M} (ω, k) is output (S5).
pv _{1, i} (ω, k) = | v _{1, i} (ω, k) | (5)
The first frequency averaging processing unit 105 formulas _M power elements pv _1,1 (ω, k),..., Pv _{1, M} (ω, k) generated for each frequency ω, respectively. The average value is calculated according to (6), and M averaged power elements apv _1,1 (k),..., Apv _{1, M} (k) are output (S6).

なお、Ｆ₁は平均化に用いる周波数のインデックス、|Ｆ₁|は周波数のインデックスの総
数であり、Ｆ₁はΩ≧|Ｆ₁|を満たすように適宜設定する。
左右向き判定部１０６は、Ｍ個の平均化パワー要素apv_1,1(ｋ)、・・・、apv_1,M(ｋ)とを入力とし、左向きに発話したか右向きに発話したかを判定して結果を出力する（Ｓ７）。 F ₁ is an index of frequencies used for averaging, | F ₁ | is a total number of frequency indexes, and F ₁ is appropriately set so as to satisfy Ω ≧ | F ₁ |.
The left / right direction determination unit 106 receives M averaged power elements apv _1,1 (k),..., Apv _{1, M} (k) as input, and determines whether the left direction is spoken or the right direction is spoken. The result is output (S7).

左右向きの判定は、マイクロホンアレイ１０１を構成するＭ本のマイクロホンのうち、ある２本のマイクロホン１０１−α、１０１−βに対応する２個の平均化パワー要素apv_1,α(ｋ)、apv_1,β(ｋ)の比をとり、それを所定のしきい値thr1と比較することにより行う。左右向きの判定イメージを図４に例示する。この例では、図４(a)に示すマイクロホンαとマイクロホンβとの中間点に向いて発話された場合をapv_1,α(ｋ)／apv_1,β(ｋ)＝thr1とし、この向きを基準とした左右方向の発話向きをapv_1,α(ｋ)／apv_1,β(ｋ)とthr1との大小関係により判定する。具体的には、左向きになればなるほどapv_1,β(ｋ)がapv_1,α(ｋ)に比べて減衰する割合が大きくなるため、apv_1,α(ｋ)／apv_1,β(ｋ)＞thr1の時には左向きであると判定することができ（図４(b)）、右向きになればなるほどapv_1,α(ｋ)がapv_1,β(ｋ)に比べて減衰する割合が大きくなるため、apv_1,α(ｋ)／apv_1,β(ｋ)＜thr1の時には右向きであると判定することができる（図４(c)）。なお、２本のマイクロホンは、平均化パワー要素apv_1,α(ｋ)、apv_1,β(ｋ)の値に差が生じやすいよう、発話者の位置に対して最も左右間隔の広い２本を選ぶのが望ましい。 The left-right orientation determination is performed by using two averaged power elements apv _{1, α} (k), apv corresponding to two microphones 101-α and 101-β among the M microphones constituting the microphone array 101. This is done by taking a ratio of _{1, β} (k) and comparing it to a predetermined threshold value thr1. FIG. 4 illustrates an example of the determination image in the horizontal direction. In this example, the case where the speech is directed toward the midpoint between the microphone α and the microphone β shown in FIG. 4A is apv _{1, α} (k) / apv _{1, β} (k) = thr1, and this direction is set as follows. The reference speech direction in the left-right direction is determined based on the magnitude relationship between apv _{1, α} (k) / apv _{1, β} (k) and thr1. Specifically, since the rate of attenuation of apv _{1, β} (k) is larger than apv _{1, α} (k) as it goes to the left, apv _{1, α} (k) / apv _{1, β} (k )> Thr1 can be determined to be leftward (FIG. 4 (b)), and the more rightward is, the greater the rate at which apv _{1, α} (k) attenuates than apv _{1, β} (k). Therefore, when apv _{1, α} (k) / apv _{1, β} (k) <thr1, it can be determined to be rightward (FIG. 4 (c)). Note that the two microphones have the widest left-right spacing with respect to the speaker's position so that the average power elements apv _{1, α} (k) and apv _{1, β} (k) tend to differ. It is desirable to choose.

このような構成で左右方向の発話向きを判定することができる理論的背景を説明する。
図５は音声信号の伝搬特性を時間領域で示したものである。伝搬特性は、直接波、初期残響、後部残響の３つに大きく分けられるが、直接波、初期残響が観測される時間帯においては、複数本のマイクロホンで構成されたマイクロホンアレイに対して方向性を持った波が混入することが知られている。特に、初期残響時間帯（直接波到来後、直接波の収音パワーから１０ｄＢ減衰するまでの時間）においては方向性を持った強い反射波が混在するが、この反射波のパワーは発話向きにより変化する。具体的には、発話向きが正面方向であるほど直接波のパワーが大きくなるため、反射波のパワーは小さくなり、また、横方向であるほど直接波のパワーが小さくなるため、その分反射波のパワーが大きくなる。本発明はこのような性質を利用して発話向きを推定する。 A theoretical background capable of determining the left-right direction of speech with such a configuration will be described.
FIG. 5 shows the propagation characteristics of the audio signal in the time domain. Propagation characteristics can be broadly divided into three types: direct wave, initial reverberation, and rear reverberation. In the time zone in which direct wave and initial reverberation are observed, the directivity with respect to the microphone array composed of a plurality of microphones. It is known that waves with In particular, in the initial reverberation time zone (after the arrival of the direct wave, the time from the direct wave pickup power to the attenuation of 10 dB), a strong reflected wave with directionality is mixed, but the power of this reflected wave depends on the direction of speech. Change. Specifically, since the direct wave power increases as the utterance direction is the front direction, the reflected wave power decreases, and the direct wave power decreases as it is in the horizontal direction. The power of will increase. The present invention uses such a property to estimate the speech direction.

これについて以下、本発明の構成に則して説明する。図６は、正面向き、横向きの発話
状態がどのように相関行列Ｒ(ω,ｋ)の各固有値λ_i(ω,ｋ)に影響するかを示したもので
ある。ここでは３本のマイクロホンでマイクロホンアレイを構成した場合を例示する。正
面向きの場合、マイクロホンアレイには直接波が多く到達し、反射波の到達割合は相対的
に低いため、図６(a)に示すように、直接波を表現する基底ベクトルが、反射波を表現する基底ベクトル群に比べて大きなパワーを持つ。この時、第１固有値λ₁(ω,ｋ)は第２固有値λ₂(ω,ｋ)、第３固有値λ₃(ω,ｋ)と比べ顕著に大きな値を示す。一方、横向きの場合、マイクロホンアレイに到達する直接波は減少するため、その分反射波が多く到達する。そのため、図６(b)に示すように、直接波を表現する基底ベクトルのパワーが減少し、反射波を表現する基底ベクトル群のパワーが増加する。そして、この時には第１固有値λ₁(ω,ｋ)は正面向きの場合より小さくなり、逆に第２固有値λ₂(ω,ｋ)、第３固有値λ₃(ω,ｋ)は正面向きの場合より大きくなる。正面向きの場合と横向きの場合とで各固有値に生じる差異のイメージを図７に示す。 This will be described below in accordance with the configuration of the present invention. FIG. 6 shows how the utterance state in front and side affects each eigenvalue λ _i (ω, k) of the correlation matrix R (ω, k). Here, a case where a microphone array is configured by three microphones is illustrated. When facing the front, many direct waves reach the microphone array, and the arrival rate of the reflected waves is relatively low. Therefore, as shown in FIG. Compared to the basis vector group to be expressed, it has a large power. At this time, the first eigenvalue λ ₁ (ω, k) is significantly larger than the second eigenvalue λ ₂ (ω, k) and the third eigenvalue λ ₃ (ω, k). On the other hand, in the case of the horizontal orientation, the direct waves that reach the microphone array are reduced, so that more reflected waves arrive accordingly. For this reason, as shown in FIG. 6B, the power of the basis vectors expressing the direct wave decreases, and the power of the basis vector group expressing the reflected wave increases. At this time, the first eigenvalue λ ₁ (ω, k) is smaller than that in the front direction, and conversely, the second eigenvalue λ ₂ (ω, k) and the third eigenvalue λ ₃ (ω, k) are in the front direction. Larger than the case. FIG. 7 shows an image of the difference that occurs in each eigenvalue between the case of facing forward and the case of facing sideways.

以上のことから、直接波を表現する基底ベクトルのパワーは第１固有値に顕著に反映さ
れることがわかる。そしてそうであれば、第１固有値に対応する第１固有ベクトルの、Ｍ
本のマイクロホンに対応する各パワー要素の値は、直接波がＭ本のマイクロホンのそれぞ
れにどの程度の強さで届いているかの尺度となると考えることができる。そして、直接波
が各マイクロホンに届くパワーは発話向きによって変化する。そこで上記のように、基準
とする発話向きにおける任意の２本のマイクロホンのパワー要素の比をしきい値（thr1）
とし、そのしきい値とある発話向きの時の２本のマイクロホンのパワー要素の値の比とを
比較することで、その大小関係から、基準とする発話向きに対して左向きに発話したか右
向きに発話したかを判定することができる。
固有値分解部２０１は、更に上記した第１固有値λ₁(ω,ｋ)を、式（７）により正規化
して第１正規化固有値nλ₁(ω,ｋ)を出力する（Ｓ８）。 From the above, it can be seen that the power of the basis vector expressing the direct wave is significantly reflected in the first eigenvalue. And if so, M of the first eigenvector corresponding to the first eigenvalue
The value of each power element corresponding to a single microphone can be considered as a measure of how strong a direct wave reaches each of the M microphones. The power at which direct waves reach each microphone varies depending on the direction of speech. Therefore, as described above, the ratio of the power elements of any two microphones in the reference utterance direction is the threshold (thr1).
By comparing the threshold value and the ratio of the values of the power elements of the two microphones for a certain utterance direction, the utterance is uttered to the left or the right direction with respect to the reference utterance direction. It is possible to determine whether you have spoken.
The eigenvalue decomposition unit 201 further normalizes the first eigenvalue λ ₁ (ω, k) as described above using Expression (7), and outputs a first normalized eigenvalue nλ ₁ (ω, k) (S8).

第２周波数平均化処理部２０２は、各周波数ω毎に得られた第１正規化固有値nλ₁(ω,
ｋ)について式（８）により平均値を計算して、第１平均化固有値aλ₁(ｋ)を出力する（Ｓ９）。 The second frequency averaging processing unit 202 outputs the first normalized eigenvalue nλ ₁ (ω, obtained for each frequency ω.
The average value of k) is calculated according to the equation (8), and the first averaged eigenvalue aλ ₁ (k) is output (S9).

なお、Ｆ₂は平均化に用いる周波数のインデックス、|Ｆ₂|は周波数のインデックスの総
数であり、Ｆ₂はΩ≧|Ｆ₂|を満たすように適宜設定する。 F ₂ is an index of frequencies used for averaging, | F ₂ | is the total number of frequency indexes, and F ₂ is appropriately set so as to satisfy Ω ≧ | F ₂ |.

正面・横向き判定部２０３は、左右向き判定部１０６での判定結果と第１平均化固有値aλ₁(ｋ)とを入力とし、第１平均化固有値aλ₁(ｋ)を所定のしきい値thr2と比較することにより、aλ₁(ｋ)＜thr2であれば発話者が上記マイクロホンアレイに対し横向きと判定し、そうでなければ正面向きと判定する。そして、横向きと判定した場合には左右向き判定部１０６での判定結果をそのまま出力し、そうでなければ正面向きであるとの判定結果を出力する（Ｓ１０）。正面・横向き判定イメージを図１０に例示する。ここで、thr2は環境や話者の位置によって任意に設定してよい。なお、第１平均化固有値aλ₁(ｋ)はフレームグループｋ毎に得られることから、判定結果もフレームグループｋ毎に出力される。 The front / horizontal direction determination unit 203 receives the determination result from the left / right direction determination unit 106 and the first averaged eigenvalue aλ ₁ (k), and uses the first averaged eigenvalue aλ ₁ (k) as a predetermined threshold value thr2. If aλ ₁ (k) <thr2, the speaker determines that the speaker is facing sideways with respect to the microphone array, and otherwise determines that the speaker is facing frontward. If it is determined to be in the horizontal direction, the determination result in the left-right direction determination unit 106 is output as it is, and if not, the determination result that it is in the front direction is output (S10). FIG. 10 shows an example of the front / side orientation determination image. Here, thr2 may be arbitrarily set according to the environment and the position of the speaker. Since the first averaged eigenvalue aλ ₁ (k) is obtained for each frame group k, the determination result is also output for each frame group k.

このような構成で発話向きが正面向きであるか横向きであるかを判定することができる理論的背景を説明する。直接波を表現する基底ベクトルのパワーは第１固有値に顕著に反映される。具体的には正面向きの場合には直接波を表現する基底ベクトルのパワーが大きい値を示すとともに第１固有値も大きな値を示す一方、横向きの場合には直接波を表現する基底ベクトルのパワーは正面向きの場合より小さくなり、第１固有値も小さくなる。そこで、第１固有値を正面・横向きの判定パラメータとして用いることで、第１固有値があるしきい値より大きければ正面向き、小さければ横向きであると適切に判定することができる。 A theoretical background that can determine whether the speech direction is the front direction or the horizontal direction with such a configuration will be described. The power of the basis vector representing the direct wave is significantly reflected in the first eigenvalue. Specifically, in the case of the front direction, the power of the base vector expressing the direct wave shows a large value and the first eigenvalue also shows a large value. The first eigenvalue is also smaller than when facing the front. Therefore, by using the first eigenvalue as a front / side determination parameter, it is possible to appropriately determine that the first eigenvalue is front-facing if it is larger than a certain threshold value and that it is lateral if it is smaller.

このように、第１実施形態の発話向き推定装置によれば、発話向きについて正面向き、
左向き、右向きのいずれであるかを判定することが可能となるため、ネットワークを介した相手方とのコミュニケーションをより円滑に行うことが可能となる。
第１実施形態では、固有値分解法によって空間相関行列Ｒ（ω，ｋ）を分解し、発話者向きを推定する例を説明したが、他の方法も考えられる次にその代替手法について説明する。 Thus, according to the utterance direction estimation apparatus of the first embodiment, the utterance direction is the front direction,
Since it is possible to determine whether it is facing left or right, it is possible to more smoothly communicate with the other party via the network.
In the first embodiment, the example in which the spatial correlation matrix R (ω, k) is decomposed by the eigenvalue decomposition method and the direction of the speaker is estimated has been described. Next, an alternative method will be described in which other methods can be considered.

〔第２実施形態〕
図８に第２実施形態の発話者向き推定装置２００の機能構成例を、図９にその処理フローを示す。発話向き推定装置２００は、Ｍ本（Ｍは２以上の整数）のマイクロホン１０１−１〜１０１−Ｍからなるマイクロホンアレイ１０１と、ＡＤ変換部１２と、周波数領域変換部１３と、相関行列計算部１０２と、発話位置推定部４０５と、基底行列生成部４００と、相関行列分解部４１０と、周波数平均化処理部４２０と、正面・横向き生成部４３０とを備える。このうちマイクロホンアレイ１０１と、ＡＤ変換部１２と、周波数領域変換部１３と、相関行列計算部１０２とは、第１実施形態にて説明済みのものと同じであるので、この部分の機能・処理の説明は省略する。 [Second Embodiment]
FIG. 8 shows a functional configuration example of the speaker orientation estimating apparatus 200 according to the second embodiment, and FIG. 9 shows a processing flow thereof. The utterance direction estimation apparatus 200 includes a microphone array 101 including M (M is an integer of 2 or more) microphones 101-1 to 101-M, an AD converter 12, a frequency domain converter 13, and a correlation matrix calculator. 102, an utterance position estimation unit 405, a base matrix generation unit 400, a correlation matrix decomposition unit 410, a frequency averaging processing unit 420, and a front / horizontal generation unit 430. Among these, the microphone array 101, the AD conversion unit 12, the frequency domain conversion unit 13, and the correlation matrix calculation unit 102 are the same as those described in the first embodiment. Description of is omitted.

発話者位置推定部４０５は、発話者の位置ｒを推定する。発話者の位置ｒを推定する方法は、複数考えられ、例えば、ＡＤ変換部１２が出力するマイクロホン出力の時間差から推定しても良いし、図示しない撮像装置から画像情報を入手し、画像を分析して推定しても良い。また、発話者の位置ｒの情報を外部から得るようにしてもよい。 The speaker position estimation unit 405 estimates the position r of the speaker. There are a plurality of methods for estimating the position r of the speaker. For example, it may be estimated from the time difference between the microphone outputs output from the AD converter 12, or image information is obtained from an imaging device (not shown) and the image is analyzed. And may be estimated. Further, information on the position r of the speaker may be obtained from the outside.

〔基底行列生成部〕
基底行列生成部４００は、その位置情報ｒと、周波数領域変換部１３の出力する各マイクロホンの周波数領域の信号Ｘ_Ｍ（ω，ｎ）を入力として、行列分解のための軸である基底数Ｄの基底行列を生成する（Ｓ４００）。ここで、基底数Ｄは、特にマイクロホン数Ｍと同じにする必要はなく、２以上の整数である。以下、基底行列ＵＵ（ω，ｋ）の生成方法を説明する。 [Base matrix generator]
The base matrix generation unit 400 receives the position information r and the frequency domain signal X _M (ω, n) of each microphone output from the frequency domain conversion unit 13 as input, and the basis number D that is an axis for matrix decomposition. A base matrix is generated (S400). Here, the base number D does not need to be the same as the number of microphones M, and is an integer of 2 or more. Hereinafter, a method for generating the basis matrix UU (ω, k) will be described.

まず、一番目の基底ｕｕ_１（ω，ｋ）＝[ｕｕ_１１（ω，ｋ），…，ｕｕ_１Ｍ（ω，ｋ）]を各周波数ω、フレームｋ毎に設計する。ｕｕ_１（ω，ｋ）の設計にはいくつかの方法が考えられる。その内の２種類の方法について説明する。 First, the first basis uu ₁ (ω, k) = [uu ₁₁ (ω, k),..., Uu _1M (ω, k)] is designed for each frequency ω and frame k. Several methods are conceivable for designing uu ₁ (ω, k). Two types of methods will be described.

その一つとして直接音経路の伝達特性を模擬する方法が考えられる。その方法は、音源位置ｒにある音源とＭ本のマイクロホン間の伝達特性の周波数表現Ｈ＾Ｈ（ω，ｒ）＝[Ｈ＾Ｈ_１（ω，ｒ），…，Ｈ＾Ｈ_Ｍ（ω，ｒ）]を用意する。ここで、Ｈ＾Ｈ_１〜Ｈ＾Ｈ_Ｍは音源と各マイクロホン間の伝達特性を表す伝達関数である。各フレームｋ毎に、発話者位置推定部４０５が推定した発話者の位置ｒを入力として、各周波数ω毎にｕｕ′_１（ω，ｒ）＝[Ｈ＾Ｈ_１（ω，ｒ），…，Ｈ＾Ｈ_Ｍ（ω，ｒ）]とする。ここで、伝達関数は予め用意された実測値を用いても良いし、シミュレーションで計算した値を用いても良い。最後に式（９）のように正規化することでｕｕ_１（ω，ｋ）を得る。 One possible method is to simulate the transfer characteristics of the direct sound path. The method uses a frequency expression H ^ H (ω, r) = [H ^ H ₁ (ω, r),..., H ^ H _M (ω , R)]. _{_{Here, H ^ H 1 ~H ^ H}} M is the transfer function representing the transmission characteristic between the sound sources and each microphone. For each frame k, the speaker's position r estimated by the speaker position estimation unit 405 is used as an input, and uu ′ ₁ (ω, r) = [H ^ H ₁ (ω, r),. , H ^ H _M (ω, r)]. Here, as the transfer function, an actual measurement value prepared in advance may be used, or a value calculated by simulation may be used. Finally, uu ₁ (ω, k) is obtained by normalization as in equation (9).

また、他にエリア強調フィルタを用いる方法も考えられる。その方法は、各フレームｋ毎に発話者位置推定部４０５が推定した位置情報ｒを得て、位置ｒにある音を強調するフィルタｕ^〜ｕ_１（ω，ｋ）＝[ｕ^〜ｕ_１１（ω，ｋ），…，ｕ^〜ｕ_１Ｍ（ω，ｋ）]^Ｔを各フレームｋ、周波数ω毎に設計する。エリア強調フィルタｕ^〜ｕ_１（ω，ｋ）は、例えば式（１０）を計算することで得られる。 Another method using an area enhancement filter is also conceivable. The method obtains position information r estimated by the speaker position estimation unit 405 for each frame k, and filters u ^to u ₁ (ω, k) = [u ^to u ₁₁ ( ω, k),..., u ^to u _1M (ω, k)] ^T is designed for each frame k and frequency ω. The area enhancement filters u ^to u ₁ (ω, k) can be obtained by, for example, calculating Expression (10).

ここで、ｈ^〜ｈ_１１（ω）＝[ｈ^〜ｈ_１１１（ω），…，ｈ^〜ｈ_１１Ｍ（ω）]^Ｔは強調したいエリアの代表的な伝達特性、ｈ^〜ｈ_１ｑ（ω）＝[ｈ^〜ｈ_１ｑ１（ω），…，ｈ^〜ｈ_１ｑＭ（ω）]^Ｔ（ｑは２以上Ｌ以下の整数）は強調したくないエリアの代表的な伝達特性であり、Ｌ−１個用意する。Ｌは２以上の整数である。また、Ｌ＝Ｍでない場合、逆行列演算は正則ではないので、擬似逆行列演算となる。最後に式（１１）のように正規化することでｕｕ_１（ω，ｋ）を得る。 Here, h ^to h ₁₁ (ω) = [h ^to h ₁₁₁ (ω),..., H ^to h _11M (ω)] ^T is a typical transfer characteristic of an area to be emphasized, and h ^to h _1q (ω) = [h ^to h _1q1 (ω),..., h ^to h _1qM (ω)] ^T (q is an integer of 2 or more and L or less) is a typical transfer characteristic of an area that is not desired to be emphasized, and L−1 are prepared. To do. L is an integer of 2 or more. Further, when L = M is not satisfied, the inverse matrix operation is not regular, and thus becomes a pseudo inverse matrix operation. Finally, uu ₁ (ω, k) is obtained by normalization as in equation (11).

次に、上記したどちらかの方法で求めた一番目の基底ｕｕ_１（ω，ｋ）を基に基底数Ｄのｕｕ_２（ω，ｋ），…，ｕｕ_Ｄ（ω，ｋ）を算出し、基底行列ＵＵ（ω，ｋ）を生成する。この基底行列ＵＵ（ω，ｋ）を生成する方法を次に説明する。
まず、正規直交基底で基底行列ＵＵ（ω，ｋ）を生成する方法が考えられる。その方法の一つは、ベクトルｕｕ_１（ω，ｋ）に直交な基底を、例えばグラムシュミット法で生成する。式（１２）と式（１３）の計算をＤ−１回繰り返すことで、Ｄ−１個のベクトルｕｕ_２（ω，ｋ），…，ｕｕ_Ｄ（ω，ｋ）を生成できる。 Next, uu ₂ (ω, k),..., Uu _D (ω, k) of the basis number D is calculated based on the _first basis uu ₁ (ω, k) obtained by either of the above methods. Then, a basis matrix UU (ω, k) is generated. A method for generating this basis matrix UU (ω, k) will now be described.
First, a method of generating a basis matrix UU (ω, k) with orthonormal basis is conceivable. One of the methods generates a base orthogonal to the vector uu ₁ (ω, k) by, for example, the Gram Schmitt method. It is possible to generate D−1 vectors uu ₂ (ω, k),..., Uu _D (ω, k) by repeating the calculations of Expression (12) and Expression (13) D−1 times.

ここでχ_ｎ（ω，ｋ）はｎ番目の基底ｕｕ_ｎ（ω，ｋ）の初期値で、‖ｕｕ_ｎ（ω，ｋ）‖≠０となるようにランダムに与えればよい。また、ｕｕ_２（ω，ｋ）から順に帰納的にｕｕ_Ｍ（ω，ｋ）まで生成する。
また、エリア強調フィルタを用いて基底行列ＵＵ（ω，ｋ）を生成することもできる。その方法は、既に生成された一番目の基底ｕｕ_１（ω，ｋ）で強調するエリアとは別のエリアを強調するフィルタを生成し、式（１４）の計算で基底ｕｕ_２（ω，ｋ），…，ｕｕ_Ｄ（ω，ｋ）を求める。 Here, χ _n (ω, k) is an initial value of the n-th basis uu _n (ω, k), and may be given randomly so that ‖u _n (ω, k) ‖ ≠ 0. In addition, uu ₂ (ω, k) is sequentially generated up to uu _M (ω, k).
In addition, the base matrix UU (ω, k) can be generated using an area enhancement filter. The method generates a filter that emphasizes an area different from the area emphasized by the first basis uu ₁ (ω, k) already generated, and calculates the basis uu ₂ (ω, k) by the calculation of Expression (14). ,..., Uu _D (ω, k) is obtained.

ここで、ｈ^〜ｈ_ｎ１（ω）＝[ｈ^〜ｈ_ｎ１１（ω），…，ｈ^〜ｈ_ｎ１Ｍ（ω）]^Ｔは、ｎ番目の基底が強調したいエリアの代表的な伝達特性、ｈ^〜ｈ_ｎｑ（ω）＝[ｈ^〜ｈ_ｎｑ１（ω），…，ｈ^〜ｈ_ｎｑＭ（ω）]^Ｔ（ｑは２以上Ｌ以下の整数）はｎ番目の基底が強調したくないエリアの代表的な伝達特性であり、Ｌ−１個用意する。Ｌは２以上の整数である。また、Ｌ＝Ｍでない場合、逆行列演算は正則ではないので、擬似逆行列演算となる。そして、式（１５）のように正規化することで、ｕｕ_ｎ（ω，ｋ）を得る。 Here, h ^to h _n1 (ω) = [h ^to h _n11 (ω),..., H ^to h _n1M (ω)] ^T is a typical transfer characteristic of an area to be emphasized by the nth base, h ^to h _nq (ω) = [h ^to h _nq1 (ω),..., h ^to h _nqM (ω)] ^T (q is an integer of 2 or more and L or less) is a representative area where the nth base is not emphasized L-1 are prepared. L is an integer of 2 or more. Further, when L = M is not satisfied, the inverse matrix operation is not regular, and thus becomes a pseudo inverse matrix operation. Then, uu _n (ω, k) is obtained by normalizing as in Expression (15).

〔相関行列分解部〕
相関行列分解部４１０は、各周波数ω、フレームｋ毎に相関行列Ｒ（ω，ｋ）と基底行列ＵＵ（ω，ｋ）＝[ｕｕ_１（ω，ｋ），…，ｕｕ_Ｄ（ω，ｋ）]^Ｔを入力とし、成分列Ｌ（ω，ｋ）＝[Ｌ_１（ω，ｋ），…，Ｌ_Ｄ（ω，ｋ）]を出力する（Ｓ４１０）。
成分列Ｌ（ω，ｋ）は、相関行列Ｒ（ω，ｋ）を基底行列ＵＵ（ω，ｋ）に射影することで得られる正面・横向き判定に有効な特徴量である。射影とは、相関行列Ｒ（ω，ｋ）と基底行列ＵＵ（ω，ｋ）の類似度合いを計算することである。つまり、式（１６）の計算によって成分列Ｌ（ω，ｋ）を構成する要素Ｌ_１（ω，ｋ），…，Ｌ_Ｄ（ω，ｋ）を得る。 [Correlation matrix decomposition unit]
Correlation matrix decomposition section 410 has correlation matrix R (ω, k) and basis matrix UU (ω, k) = [uu ₁ (ω, k),..., Uu _D (ω, k) for each frequency ω and frame k. )] With ^T as an input, a component sequence L (ω, k) = [L ₁ (ω, k),..., L _D (ω, k)] is output (S410).
The component sequence L (ω, k) is a feature quantity effective for front / side determination obtained by projecting the correlation matrix R (ω, k) onto the base matrix UU (ω, k). Projection is to calculate the degree of similarity between the correlation matrix R (ω, k) and the base matrix UU (ω, k). That is, the elements L ₁ (ω, k),..., L _D (ω, k) constituting the component sequence L (ω, k) are obtained by the calculation of Expression (16).

ここでＨは複素共役転置を意味する。
〔周波数平均化処理部〕
周波数平均化処理部４２０は、相関行列分解部４１０から出力される成分列Ｌ（ω，ｋ）を入力とし、各フレームｋ毎に周波数ωで平均化された周波数平均化成分列Ｌ⁻（ｋ）＝[Ｌ_１ ⁻（ｋ），…，Ｌ_Ｄ ⁻（ｋ）]を出力する（Ｓ４２０）。周波数平均化成分列Ｌ⁻（ｋ）の各要素は、式（１７）で計算される。 Here, H means complex conjugate transpose.
[Frequency averaging processing section]
The frequency averaging processing unit 420 receives the component sequence L (ω, k) output from the correlation matrix decomposing unit 410 and receives the frequency averaged component sequence L ⁻ (k) averaged at the frequency ω for each frame k. ) = [L ₁ ⁻ (k),..., L _D ⁻ (k)] is output (S420). Frequency averaging component column ^L - each element of (k) can be expressed by equation (17).

ここで、｜Ｆ_３｜は周波数平均化処理で用いる周波数インデックスの総数である。
正面・横向き判定部４３０は、周波数平均化された周波数平均化成分列Ｌ⁻（ｋ）を入力として、各フレームｋ毎に話者が「正面向き」の状態にあるのか、それとも「横向き」の状態にあるのかを判定して出力する（Ｓ４３０）。正面・横向き判定部４３０は、周波数平均化成分列Ｌ⁻（ｋ）を入力とする点のみが、第１平均化固有値ａλ_１（ｋ）を入力とする第１実施形態の正面・横向き判定部２０３と異なる。処理や理論的な背景は第１実施形態の正面・横向き判定部２０３と同じである。 Here, | F ₃ | is the total number of frequency indexes used in the frequency averaging process.
Front-lateral determining unit 430, frequency averaged frequency averaged component column L ^- as input (k), the speaker for each frame k is whether a state of "frontal", or the "horizontal" Whether it is in a state is determined and output (S430). The front / side orientation determination unit 430 receives the first averaged eigenvalue aλ ₁ (k) only as an input from the frequency averaged component sequence L ⁻ (k). 203. Processing and theoretical background are the same as those of the front / side orientation determination unit 203 of the first embodiment.

以上説明したように、固有値分解法を用いた空間相関行列Ｒ（ω，ｋ）を分解する方法によらなくても発話者向きを推定することが可能である。なお、第２実施形態は話者が「正面向き」か「横向きか」しか判定しない例であるが、更に左右向きも判定することも可能である。次に左右向きを判定できるようにした第３実施形態を説明する。 As described above, it is possible to estimate the speaker direction without using the method of decomposing the spatial correlation matrix R (ω, k) using the eigenvalue decomposition method. The second embodiment is an example in which the speaker is only determined to be “front-facing” or “landscape-oriented”, but it is also possible to determine the left-right orientation. Next, a description will be given of a third embodiment in which the horizontal direction can be determined.

〔第３実施形態〕
図１１に第３実施形態の発話者向き推定装置３００の機能構成例を示す。発話向き推定装置３００は、第２実施形態の発話者向き推定装置２００の機能構成に、第１実施形態の発話者向き推定装置１００の固有値分解部２０１と、第１固有ベクトルパワー計算部１０４と、第１周波数平均化処理部１０５と、左右向き判定部１０６との機能構成を追加したものである。また、正面・横向き判定部１１０が、左右向き判定結果を入力として話者の発話方向の左右方向も判定する点で発話者向き推定装置２００と異なる。 [Third Embodiment]
FIG. 11 shows a functional configuration example of the speaker orientation estimating apparatus 300 according to the third embodiment. The utterance direction estimation apparatus 300 includes, in the functional configuration of the utterer direction estimation apparatus 200 of the second embodiment, an eigenvalue decomposition unit 201 of the utterer direction estimation apparatus 100 of the first embodiment, a first eigenvector power calculation unit 104, Functional configurations of a first frequency averaging processing unit 105 and a left / right direction determination unit 106 are added. Further, the front / side orientation determination unit 110 is different from the speaker orientation estimation device 200 in that the left / right direction of the speaker's utterance direction is also determined by using the left / right direction determination result as an input.

なお、固有値分解部２０１と、第１固有ベクトルパワー計算部１０４と、第１周波数平均化処理部１０５と、左右向き判定部１０６については、説明済みであるが、第３実施形態の構成を明瞭にする目的で、各部の機能の説明のみを再度行い、その処理フローを参照した説明は省略する。 Although the eigenvalue decomposition unit 201, the first eigenvector power calculation unit 104, the first frequency averaging processing unit 105, and the left / right direction determination unit 106 have been described, the configuration of the third embodiment is clearly described. Therefore, only the function of each part will be described again, and the description referring to the processing flow will be omitted.

固有値分解部２０１は、相関行列Ｒ（ω，ｋ）を、そのＭ個の固有値の二乗を対角要素とする対角行列である固有値行列と、上記各固有値に対応するＭ個の固有ベクトルからなる固有ベクトル行列とに分解し、最大の固有値に対応する固有ベクトル（以下、「第１固有ベクトル」という）を出力する。 The eigenvalue decomposition unit 201 includes a correlation matrix R (ω, k), an eigenvalue matrix that is a diagonal matrix having the squares of M eigenvalues as diagonal elements, and M eigenvectors corresponding to the eigenvalues. The eigenvector matrix is decomposed and an eigenvector corresponding to the maximum eigenvalue (hereinafter referred to as “first eigenvector”) is output.

第１固有ベクトルパワー計算部１０４は、第１固有ベクトルを構成するＭ個の要素についてそれぞれパワーを計算して、Ｍ個のパワー要素を出力する。第１周波数平均化処理部１０５は、各周波数毎に生成された上記Ｍ個のパワー要素についてそれぞれ平均値を計算して、Ｍ個の平均化パワー要素を出力する。左右向き判定部１０６は、Ｍ個の平均化パワー要素のうち、任意の２本のマイクロホンに対応する２個の平均化パワー要素の比をとり、それを所定のしきい値と比較することにより、発話者が上記マイクロホンアレイに対し左向きに発話したか右向きに発話したかを判定して左右向き判定結果を出力する。 The first eigenvector power calculation unit 104 calculates the power for each of M elements constituting the first eigenvector, and outputs M power elements. The first frequency averaging processing unit 105 calculates an average value for each of the M power elements generated for each frequency, and outputs M averaged power elements. The left / right direction determination unit 106 calculates a ratio of two averaged power elements corresponding to two arbitrary microphones among the M averaged power elements, and compares the ratio with a predetermined threshold value. Then, it is determined whether the speaker speaks leftward or rightward with respect to the microphone array, and the left / right direction determination result is output.

正面・横向き判定部１１０は、左右向き判定部１０６の出力する左右向き判定結果を入力として、フレーム毎に話者の発話方向が正面方向であるか否かと、左向き又は右向きに発話したかを判定する。 The front / horizontal direction determination unit 110 receives the left / right direction determination result output from the left / right direction determination unit 106 as an input, and determines whether or not the speaker's utterance direction is the front direction for each frame and whether the utterance is leftward or rightward. To do.

上記の各実施形態の発話向き推定装置の構成をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。この場合、処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。
また、上述の各種処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 When the configuration of the utterance direction estimation device of each of the above embodiments is realized by a computer, the processing contents of the functions that each device should have are described by a program. The processing functions are realized on the computer by executing the program on the computer. In this case, at least a part of the processing content may be realized by hardware.
Further, the various processes described above are not only executed in time series according to the description, but may be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.

〔効果の検証〕
図１２(a)に示す収音環境において、図１２(b)に示す条件の下で、本発明の効果の検証を行った。なお、発話向きの定義は図１２(c)に示すとおりである。 [Verification of effects]
In the sound collection environment shown in FIG. 12 (a), the effect of the present invention was verified under the conditions shown in FIG. 12 (b). The definition of the utterance direction is as shown in FIG.

図１３に各発話向きと第１〜第７平均化固有値aλ_i(ｋ)との関係を示す。ここで、第２〜第７平均化固有値は第１平均化固有値と同様な方法で求めたものである。また、図１３(a)は残響時間が２５０ｍｓｅｃの場合、図１３(b)は残響時間が４００ｍｓｅｃの場合である。いずれの残響時間の場合も、第１平均化固有値は発話向きによって大きく異なるが、第２〜第７平均化固有値は発話向きによる差が小さい。また、第１固有値は０°（正面向き）の時が最も大きく、±９０°（横向き）の時が最も小さい。この検証結果より、第２実施形態に示した第１平均化固有値により正面・横向きを判定する本発明の構成が妥当かつ有効であることがわかる。 FIG. 13 shows the relationship between each utterance direction and the first to seventh averaged eigenvalues aλ _i (k). Here, the second to seventh averaged eigenvalues are obtained by the same method as the first averaged eigenvalue. FIG. 13A shows the case where the reverberation time is 250 msec, and FIG. 13B shows the case where the reverberation time is 400 msec. In any reverberation time, the first averaged eigenvalue varies greatly depending on the speech direction, but the second to seventh averaged eigenvalues have a small difference depending on the speech direction. The first eigenvalue is the largest when it is 0 ° (frontward) and the smallest when it is ± 90 ° (laterally). From this verification result, it can be seen that the configuration of the present invention for determining the front / sideways orientation based on the first averaged eigenvalue shown in the second embodiment is valid and effective.

また、図１４に第１実施形態の構成により推定した左右方向の発話向きと実際の発話向きとの比較を示す。図１４(a)は残響時間が２５０ｍｓｅｃの場合、図１４(b)は残響時間が４００ｍｓｅｃの場合である。いずれの残響時間の場合も、７５％以上の正解率が得られた。この検証結果より、第１実施形態に示した第１固有ベクトルの２つのパワー要素から左右向きを判定する本発明の構成が妥当かつ有効であることがわかる。 FIG. 14 shows a comparison between the utterance direction in the left-right direction estimated by the configuration of the first embodiment and the actual utterance direction. FIG. 14A shows the case where the reverberation time is 250 msec, and FIG. 14B shows the case where the reverberation time is 400 msec. In any reverberation time, a correct answer rate of 75% or more was obtained. From this verification result, it can be seen that the configuration of the present invention for determining the left-right direction from the two power elements of the first eigenvector shown in the first embodiment is valid and effective.

〔サービス適用例〕
図１５は音声会議端末に本発明を組み込んだサービスの構成例である。会議場Ａと会議場Ｂとをネットワークを通じて音声端末で繋がれている状況を想定する。音声会議端末に取り付けられたマイクロホンで収音した音声信号から発話向き情報を抽出し、音声情報と共に相手側へ伝送する。相手側にて、発話向き情報を視覚情報として提示することで、音声情報だけでは伝わりにくかった場の状況を伝達することができる。 [Service application example]
FIG. 15 is a structural example of a service in which the present invention is incorporated into an audio conference terminal. Assume that the conference hall A and the conference hall B are connected by a voice terminal through a network. Speech direction information is extracted from a voice signal picked up by a microphone attached to the voice conference terminal, and transmitted to the other party along with the voice information. By presenting the utterance direction information as visual information on the other party side, it is possible to convey the situation of the place that is difficult to convey only with the voice information.

また、会議でのやりとりの様子を映像や音声を用いて記録する議事録システムにも発話者向き推定技術を応用できる。発話者向き推定技術によって、収録した音声や映像に誰が誰に向かって話したかというタグを付けることができるので、議事録の整理作業に役立つ。また、画像で顔向きを検出して行うサービス、例えば、監視カメラやインターフォン等で用いられる監視・防犯目的のサービスや、ディジタルサイネージ（電子看板）で広告に注目しているか否かを判定するサービス等も、音声信号出力に置き換えることが可能である。 In addition, the speaker orientation estimation technology can be applied to a minutes system that records the state of communication in a conference using video and audio. The speaker orientation estimation technology can tag the recorded voice and video as to who spoke to whom, which is useful for organizing minutes. In addition, services performed by detecting face orientation in images, for example, surveillance / crime prevention services used with surveillance cameras and intercoms, and services that determine whether or not attention is paid to advertisements using digital signage (digital signage) Etc. can also be replaced with an audio signal output.

Claims

An AD converter for converting analog audio signals picked up by M microphones (M is an integer of 2 or more) into digital audio signals;
A frequency domain transforming unit that transforms the digital audio signal from a time domain to a frequency domain in units of frames of a plurality of samples;
A correlation matrix calculation unit that sequentially generates and outputs an M × M correlation matrix representing the correlation between the digital audio signals converted into the frequency domain for each frequency as an expected value for each of a plurality of frames;
A base matrix generation unit that generates a base matrix composed of D (D is an integer of 2 or more) bases by using the digital speech signal and speaker position estimation information as inputs;
A correlation matrix decomposing unit that receives the correlation matrix and the base matrix for each frame, and projects the correlation matrix onto the base matrix to obtain a component sequence that is an effective feature quantity for front / lateral determination;
With the above component sequence as an input, a frequency averaging processing unit that outputs a frequency averaged component sequence averaged by frequency for each frame;
Using the frequency averaged component sequence as an input, a front / side orientation determination unit that estimates whether the direction of the speaker's utterance is the front direction for each frame, and
An utterance direction estimation device comprising:

In the utterance direction estimation apparatus according to claim 1,
Furthermore,
The correlation matrix is decomposed into an eigenvalue matrix that is a diagonal matrix having the squares of the M eigenvalues as diagonal elements, and an eigenvector matrix composed of M eigenvectors corresponding to the eigenvalues, and the maximum eigenvalue is obtained. An eigenvalue decomposition unit that outputs a corresponding eigenvector (hereinafter referred to as “first eigenvector”);
A first eigenvector power calculator that calculates power for each of M elements constituting the first eigenvector and outputs M power elements;
A first frequency averaging processor that calculates an average value for each of the M power elements generated for each frequency and outputs M averaged power elements;
Of the M averaged power elements, the ratio of two averaged power elements corresponding to any two microphones is taken and compared with a predetermined threshold value so that the speaker can A left-right orientation determination unit that determines whether the speech is directed leftward or rightward with respect to the array and outputs a left-right orientation determination result;
Comprising
The front / side orientation determination unit receives the left / right orientation determination result as input and determines whether or not the direction of the speaker's utterance is the front direction for each frame and whether the utterance is uttered leftward or rightward. An utterance direction estimation device characterized by being.

In the utterance direction estimation apparatus according to claim 1 or 2,
The basis matrix generation unit obtains a first basis of the basis matrix by a method of simulating transfer characteristics of a direct sound path, obtains a plurality of basis vectors orthogonal to each other based on the first basis, An utterance direction estimation apparatus characterized by generating a base matrix for decomposing a correlation matrix.

In the utterance direction estimation apparatus according to claim 1 or 2,
The basis matrix generation unit obtains a first basis of the basis matrix by an area enhancement filter, and provides a filter for enhancing sounds in an area different from the first basis, and decomposes the correlation matrix An utterance direction estimation apparatus characterized by generating a base matrix using the filter.

An AD conversion process in which an AD conversion unit converts analog audio signals picked up by M (M is an integer of 2 or more) microphones into digital audio signals;
A frequency domain transforming process in which the frequency domain transforming unit transforms the digital audio signal from the time domain to the frequency domain in units of frames composed of a plurality of samples;
Correlation matrix calculation unit that sequentially generates and outputs an M × M correlation matrix representing the correlation between the digital audio signals converted into the frequency domain for each frequency as an expected value for each of a plurality of frames. Process,
A basis matrix generation process in which a basis matrix generation unit generates a basis matrix having a basis number D (D is an integer of 2 or more) by using the digital speech signal and speaker position estimation information as inputs;
Correlation matrix decomposition process for obtaining a component sequence that is an effective feature quantity for front / horizontal determination by inputting the correlation matrix and the base matrix for each frame and projecting the correlation matrix onto the base matrix for each frame When,
A frequency averaging processing step in which the frequency averaging processing unit outputs the frequency averaged component sequence averaged by frequency for each frame, using the component sequence as an input;
A front / side orientation determination unit that receives the frequency averaged component sequence as an input and estimates whether the direction of the speaker's utterance is the front direction for each frame;
Speech direction estimation method including

In the speech direction estimation method according to claim 5,
Furthermore,
An eigenvalue decomposition unit decomposes the correlation matrix into an eigenvalue matrix that is a diagonal matrix whose diagonal elements are the squares of the M eigenvalues and an eigenvector matrix composed of M eigenvectors corresponding to the eigenvalues. Eigenvalue decomposition process for outputting an eigenvector corresponding to the largest eigenvalue (hereinafter referred to as “first eigenvector”);
A first eigenvector power calculation unit that calculates power for each of M elements constituting the first eigenvector and outputs M power elements;
A first frequency averaging processor that calculates an average value for each of the M power elements generated for each frequency and outputs M averaged power elements;
The left-right direction determination unit takes a ratio of two averaged power elements corresponding to two arbitrary microphones out of the M averaged power elements, and compares it with a predetermined threshold value. , A left-right direction determination process for determining whether the speaker speaks leftward or rightward with respect to the microphone array and outputs a left-right direction determination result;
Including
The front / horizontal determination process is a process of determining whether the direction of the speaker's utterance is the front direction for each frame with the result of the determination of the left / right direction as an input, and whether the speaker is speaking leftward or rightward. A speech direction estimation method characterized by being.

In the speech direction estimation method according to claim 5 or 6,
In the basis matrix generation process, the first basis of the basis matrix is obtained by a method of simulating the transfer characteristic of the direct sound path, and a plurality of basis vectors orthogonal to each other based on the first basis are obtained, An utterance direction estimation method comprising generating a base matrix for decomposing a correlation matrix.

In the speech direction estimation method according to claim 5 or 6,
In the base matrix generation process, a first base of the base matrix is obtained by an area emphasis filter, and a filter for emphasizing sounds in a different area from the first base is provided, and the correlation matrix is decomposed. A utterance direction estimation method, wherein a basis matrix is generated using the filter.

An apparatus program for causing a computer to function as the speech direction estimating apparatus according to claim 1.