JP2010206392A

JP2010206392A - Speech direction estimation device and method, and program

Info

Publication number: JP2010206392A
Application number: JP2009048223A
Authority: JP
Inventors: Kenta Niwa; 健太丹羽; Sumitaka Sakauchi; 澄宇阪内; Kenichi Furuya; 賢一古家; Yoichi Haneda; 陽一羽田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-03-02
Filing date: 2009-03-02
Publication date: 2010-09-16
Anticipated expiration: 2029-03-02
Also published as: JP5235722B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech direction estimation device which does not request to arrange many microphones so as to enclose a speaker, and can appropriately estimate the speech direction even under environment in which a reverberation time is long. <P>SOLUTION: The plurality of microphones pick up a sound signal originated by the speaker around a microphone array which is composed of the plurality of microphones. A correlation matrix which represents correlation between voice signals each picked up by each microphone, is created, and it is estimated that the speaker has spoken from what direction to the microphone array from an eigenvector obtained by decomposing the correlation matrix into an eigenvalue matrix and an eigenvector matrix. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、マイクロホンに入力された音声信号から発話者の発話向きを推定する技術に関する。 The present invention relates to a technique for estimating the utterance direction of a speaker from an audio signal input to a microphone.

電話や音声会議端末等の音声情報をやりとりするシステムを一般に音声通信システムと呼ぶ。ＴＶ会議システムでは音声情報に映像を付加して提示するため場の状況が伝わりやすいが、音声通信システムでは相手側の状況を把握するのは難しい。相手側の状況に関する情報のひとつに発話向き情報があり、相手側からこの情報を受け取ることで発話者がどの方向に向かって発話しているかを把握でき、コミュニケーションの円滑化を図ることができる。 A system for exchanging voice information such as a telephone or a voice conference terminal is generally called a voice communication system. In the video conference system, the video is added to the audio information and presented, so that the situation of the place is easily transmitted, but in the audio communication system, it is difficult to grasp the situation of the other party. One of the information on the other party's situation is utterance direction information. By receiving this information from the other party, it is possible to grasp the direction in which the speaker is speaking and to facilitate communication.

このような発話向き情報を推定する従来技術が非特許文献１、２等で開示されており、構成例を図１２に示す。この構成例における発話向き推定装置１０は、以下のように発話向き情報を推定する。 Conventional techniques for estimating such speech direction information are disclosed in Non-Patent Documents 1 and 2 and the like, and a configuration example is shown in FIG. The speech direction estimation apparatus 10 in this configuration example estimates speech direction information as follows.

(i) 発話者１からの発話音声をＭ本（Ｍは２以上の整数）のマイクロホン１１−１、・・・、１１−Ｍを用いて収音する。収音されたアナログ信号をＡＤ変換部１２にて、ディジタル信号vＸ(t)＝[Ｘ_１(t)、・・・、Ｘ_Ｍ(t)]^Ｔへ変換する。ここで、ｔは離散時間のインデックスを表す。 (i) The voice from the speaker 1 is picked up using M (M is an integer of 2 or more) microphones 11-1,. The collected analog signal is converted by the AD converter 12 into a digital signal vX (t) = [X ₁ (t),..., X _M (t)] ^T. Here, t represents an index of discrete time.

(ii) 周波数領域変換部１３では、複数サンプルからなる上記ディジタル信号の組（フレーム）を入力とし、高速フーリエ変換等により周波数領域の信号vＸ(ω,ｎ）＝[Ｘ_１(ω,ｎ)、・・・、Ｘ_Ｍ(ω,ｎ)]^Ｔへ変換する。ここで、ωは周波数のインデックスを表し、周波数のインデックスの総数をΩとする。また、ｎはフレームのインデックスを表す。 (ii) The frequency domain transform unit 13 receives a set (frame) of the digital signals composed of a plurality of samples as an input, and performs frequency domain signal vX (ω, n) = [X ₁ (ω, n) by fast Fourier transform or the like. _{, ···, X M (ω,} n)] to convert to ^T. Here, ω represents a frequency index, and the total number of frequency indexes is Ω. N represents the index of the frame.

(iii) 固定ビームフォーマ設計部１４では、各発話者位置・発話向き毎に固定ビームフォーマvＧ(ω,ｒ,θ)＝[Ｇ_１(ω,ｒ,θ)、・・・、Ｇ_Ｍ(ω,ｒ,θ)]^Ｔを設計する。Ｇ_ｉ(ω,ｒ,θ)は発話者位置ｒ、発話向きθの音源を強調・抑制するためにｉ番目のマイクロホンの周波数成分Ｘ_ｉ(ω,ｎ)に掛ける係数である。 (iii) In the fixed beamformer design unit 14, the fixed beamformer vG (ω, r, θ) = [G ₁ (ω, r, θ),..., G _M ( ω, r, θ)] ^T is designed. G _i (ω, r, θ) is a coefficient that is multiplied by the frequency component X _i (ω, n) of the i-th microphone in order to emphasize / suppress the sound source at the speaker position r and the speech direction θ.

設計に際しては、あらかじめ設定された発話者位置・発話向き毎に音源とマイクロホン間の音響伝搬特性vＨ(ω,ｒ,θ)＝[Ｈ_１(ω,ｒ,θ),・・・,Ｈ_Ｍ(ω,ｒ,θ)]^Ｔをシミュレーション値や実測値を用いて求めておく。ここでＨ_ｉ(ω,ｒ,θ)は発話者位置ｒ、発話向きθの音源と、ｉ番目のマイクロホンとの間の音響伝搬特性を表す。 In designing, acoustic propagation characteristics between the sound source and the microphone vH (ω, r, θ) = [H ₁ (ω, r, θ),..., H _{M for} each predetermined speaker position and direction. (ω, r, θ)] ^T is obtained using a simulation value or an actual measurement value. Here, H _i (ω, r, θ) represents acoustic propagation characteristics between the sound source at the speaker position r and the speech direction θ and the i-th microphone.

固定ビームフォーマvＧ(ω,ｒ,θ)は、音響伝搬特性との関係を表す式(1)、(2)を満たす値として設計される。 The fixed beamformer vG (ω, r, θ) is designed as a value satisfying the expressions (1) and (2) representing the relationship with the acoustic propagation characteristics.

vＨ(ω,ｒ_Ｔ,θ_Ｔ)^Ｈ・vＧ(ω,ｒ_Ｔ,θ_Ｔ)＝１ (1)
vＨ(ω,ｒ_Ｕ,θ_Ｕ)^Ｈ・vＧ(ω,ｒ_Ｔ,θ_Ｔ)＝０ (2)
式(1)、(2)は、発話者位置ｒ_Ｔ、発話向きθ_Ｔの出力パワーを強調し、それ以外の発話者位置ｒ_Ｕ、発話向きθ_Ｕの出力パワーを抑えるように固定ビームフォーマvＧ(ω,ｒ,θ)を設計することを示している。 vH (ω, r _T , θ _T ) ^H · vG (ω, r _T , θ _T ) = 1 (1)
vH (ω, r _U , θ _U ) ^H · vG (ω, r _T , θ _T ) = 0 (2)
Expressions (1) and (2) emphasize the output power of the speaker position r _T and the speech direction θ _T , and the fixed beamformer so as to suppress the output power of the other speaker positions r _U and the speech direction θ _U. It shows that vG (ω, r, θ) is designed.

(iv) 積和計算部１５では、周波数領域の信号vＸ(ω,ｎ）＝[Ｘ_１(ω,ｎ),・・・, Ｘ_Ｍ(ω,ｎ)]^Ｔと固定ビームフォーマvＧ(ω,ｒ,θ)＝[Ｇ_１(ω,ｒ,θ),・・・,Ｇ_Ｍ(ω,ｒ,θ)]^Ｔを入力とし、各周波数ω、発話者位置ｒ、発話向きθ毎に各マイクロホンに対応する周波数成分Ｘ_ｉ(ω,ｎ)と固定ビームフォーマの係数Ｇ_ｉ(ω,ｒ,θ)とを掛け、得られたＭ個の成分を足し合わせることで出力Ｙ(ω,ｎ,ｒ,θ)を計算する。この計算は、Ｙ(ω,ｎ,ｒ,θ)＝vＧ(ω,ｒ,θ)^Ｈ・vＸ(ω,ｎ）を計算することと同義である。 (iv) In the product-sum calculation unit 15, the frequency domain signal vX (ω, n) = [X ₁ (ω, n),..., X _M (ω, n)] ^T and the fixed beamformer vG (ω , r, θ) = [G 1 (ω, r, θ), ···, G M (ω, r, θ)] as input ^T, each frequency ω, speaker position r, every utterance direction θ Multiplying the frequency component X _i (ω, n) corresponding to each microphone by the coefficient G _i (ω, r, θ) of the fixed beamformer, and adding the obtained M components, the output Y (ω, n, r, θ) is calculated. This calculation is synonymous with calculating Y (ω, n, r, θ) = vG (ω, r, θ) ^H · vX (ω, n).

(v) パワー計算部１６では、積和計算部１５からの出力Ｙ(ω,ｎ,ｒ,θ)からパワー|Ｙ(ω,ｎ,ｒ,θ)|^２を計算して出力する。 (v) The power calculator 16 calculates and outputs power | Y (ω, n, r, θ) | ² from the output Y (ω, n, r, θ) from the product-sum calculator 15.

(vi) 周波数平均化処理部１７では、パワー計算部１６から出力されたパワー|Ｙ(ω,ｎ,ｒ,θ)|^２を周波数で平均化処理し、aＹ(ｎ,ｒ,θ)を得る。この計算は、Ｆ_０を平均化処理で用いる周波数のインデックス、|Ｆ_０|を周波数のインデックスの総数と定義すると、 (vi) The frequency averaging processing unit 17 averages the power | Y (ω, n, r, θ) | ² output from the power calculation unit 16 by frequency, and aY (n, r, θ) is obtained. obtain. In this calculation, if F ₀ is defined as the frequency index used in the averaging process, and | F ₀ | is defined as the total number of frequency indexes,

を計算することと同義である。なお、Ｆ_０はΩ≧|Ｆ_０|を満たす。

Is equivalent to calculating Note that F ₀ satisfies Ω ≧ | F ₀ |.

(vii) 音源向き選択部１８では、各フレーム毎に周波数で平均化処理されたパワーaＹ(ｎ,ｒ,θ)が最大となる発話者位置ｒ、発話向きθを探査し、パワーaＹ(ｎ,ｒ,θ)が最大となる発話向きθを、推定された発話向きθ_out(ｎ)として求める。 (vii) The sound source direction selection unit 18 searches for the speaker position r and the utterance direction θ at which the power aY (n, r, θ) averaged by frequency for each frame is maximum, and the power aY (n , r, θ) is determined as the estimated speech direction θ _out (n).

中島弘史、「音源の方向を推定可能な拡張ビームフォーミング」、日本音響学会講演論文集、2005年9月、p.619-620Hiroshi Nakajima, “Expanded Beamforming for Estimating Sound Source Direction”, Proceedings of the Acoustical Society of Japan, September 2005, p.619-620 中島弘史、外８名、「拡張ビームフォーミングを用いた音源指向特性推定」、日本音響学会講演論文集、2005年9月、p.621-622Hiroshi Nakajima, 8 others, "Sound source directivity estimation using extended beamforming", Proceedings of the Acoustical Society of Japan, September 2005, p.621-622

従来技術の課題として次の２点が挙げられる。
(i) 任意の位置での発話に対応し、高精度な発話向きの推定を行うには、多数のマイクロホンを必要とし、かつマイクロホンの設置位置にも工夫が必要。 The following two points can be cited as problems of the prior art.
(i) In order to respond to utterances at an arbitrary position and to estimate the direction of utterance with high accuracy, a large number of microphones are required, and it is necessary to devise the microphone installation positions.

従来技術においては、各発話者位置・発話向き毎に設計された固定ビームフォーマの出力のパワー|Ｙ(ω,ｎ,ｒ,θ)|^２に差があるほど、高精度に発話向きを推定することができる。しかし、発話者の口から放射される音波のように口の前方に強い指向性を持つ音源を想定すると、図１３に示すように多数のマイクロホン１１で発話者を囲い込むように収音しないと、発話者位置・発話向きによっては固定ビームフォーマの出力のパワーに差が出ず、発話向きの推定誤差が増大する（例えば、非特許文献２の実験ではマイクロホンを６４本使用）。そのため、誤差を小さくするには多数のマイクロホンが必要となり装置が大型化し、電話や音声会議端末のような可搬性がある装置に取り付けて利用することが難しい。 In the prior art, the more accurate the utterance direction is estimated as the output power | Y (ω, n, r, θ) | ² of the fixed beamformer designed for each utterer position and utterance direction is different. can do. However, assuming a sound source with strong directivity in front of the mouth, such as a sound wave radiated from the mouth of the speaker, it is necessary to collect sound so as to surround the speaker with a large number of microphones 11 as shown in FIG. Depending on the speaker position and direction, the output power of the fixed beamformer does not differ, and the estimation error of the speech direction increases (for example, in the experiment of Non-Patent Document 2, 64 microphones are used). Therefore, in order to reduce the error, a large number of microphones are required, the apparatus becomes large, and it is difficult to use it by attaching it to a portable apparatus such as a telephone or an audio conference terminal.

(ii) 残響時間（直接波到来後、直接波の収音パワーから６０ｄＢ減衰するまでの時間）が２５０ｍｓｅｃ以上の残響環境下では高い発話方向推定性能が得られない
残響時間が２５０ｍｓｅｃ以上の残響環境下においては、強い反射波が多く混合するため音響伝搬特性vＨ(ω,ｒ,θ)を精度よく設計することが難しい。そのため、固定ビームフォーマの出力に曖昧性が生じ、推定精度が劣化する。例えば、低残響加工されていない実環境の部屋においては、一般に残響時間が２５０〜５００ｍｓｅｃ程度となるため精度の良い推定が困難である。 (ii) High reverberation direction estimation performance cannot be obtained in a reverberant environment where the reverberation time (the time from the direct wave arrival time to the 60 dB attenuation from the direct wave pickup power) is 250 msec or more. Below, since many strong reflected waves are mixed, it is difficult to design the acoustic propagation characteristic vH (ω, r, θ) with high accuracy. Therefore, ambiguity occurs in the output of the fixed beamformer, and the estimation accuracy deteriorates. For example, in an actual environment room that is not subjected to low reverberation processing, reverberation time is generally about 250 to 500 msec, so that accurate estimation is difficult.

本発明の目的は、多数のマイクロホンを発話者を囲い込むように配置する必要が無く、かつ残響時間が２５０ｍｓｅｃ以上の残響環境下においても適切に発話向きを推定することが可能な、発話向き推定装置、方法及びプログラムを提供することにある。 An object of the present invention is to estimate the speech direction, which does not require a large number of microphones to be placed so as to surround the speaker, and can appropriately estimate the speech direction even in a reverberant environment with a reverberation time of 250 msec or more. To provide an apparatus, a method, and a program.

本発明の発話向き推定装置は、ＡＤ変換部と周波数領域変換部と相関行列計算部と固有値分解部と第１固有ベクトル平均化処理部と左右向きコスト計算部と発話向き判定部とから構成される。 The speech direction estimation apparatus of the present invention includes an AD conversion unit, a frequency domain conversion unit, a correlation matrix calculation unit, an eigenvalue decomposition unit, a first eigenvector averaging processing unit, a left-right cost calculation unit, and a speech direction determination unit. .

ＡＤ変換部は、発話者が位置ｒにおいて発話し、Ｍ本（Ｍは２以上の整数）のマイクロホンからなるマイクロホンアレイで収音されたアナログ音声信号を、それぞれディジタル音声信号に変換する。 The AD conversion unit converts an analog voice signal collected by a microphone array including M microphones (M is an integer of 2 or more) and a digital voice signal.

周波数領域変換部は、それぞれの上記ディジタル音声信号を、時間領域から周波数領域に変換する。 The frequency domain transform unit transforms each digital audio signal from the time domain to the frequency domain.

相関行列計算部は、周波数領域に変換されたそれぞれの上記ディジタル音声信号間の相関を表すＭ×Ｍの相関行列を生成して出力する。 The correlation matrix calculation unit generates and outputs an M × M correlation matrix representing the correlation between the digital audio signals converted into the frequency domain.

固有値分解部は、上記相関行列をＭ個の固有値のそれぞれの二乗を対角要素とする対角行列である固有値行列と上記各固有値に対応するＭ個の固有ベクトルからなる固有ベクトル行列とに分解し、最大の固有値に対応する固有ベクトル（以下、「第１固有ベクトル」という）を出力する。 The eigenvalue decomposition unit decomposes the correlation matrix into an eigenvalue matrix that is a diagonal matrix with the squares of M eigenvalues as diagonal elements and an eigenvector matrix that includes M eigenvectors corresponding to the eigenvalues, An eigenvector corresponding to the largest eigenvalue (hereinafter referred to as “first eigenvector”) is output.

第１固有ベクトル平均化処理部は、各周波数毎に得られた上記第１固有ベクトルについて周波数平均をとることにより平均化第１固有ベクトルを出力する。 The first eigenvector averaging processing unit outputs the averaged first eigenvector by taking the frequency average of the first eigenvector obtained for each frequency.

左右向きコスト計算部は、上記平均化第１固有ベクトルと予め上記位置ｒにおける複数の発話向きθ_ｊ（ｊ＝１、２、・・、Ｎ、Ｎ≧２）毎に用意されたモデル平均化第１固有ベクトルとから、左右向き判定コストを上記発話向きθ_ｊ毎に計算して出力する。 The left-right direction cost calculation unit calculates the model-averaged first prepared for each of the averaged first eigenvectors and a plurality of utterance directions θ _j (j = 1, 2,..., N, N ≧ 2) at the position r. From one eigenvector, the right / left direction determination cost is calculated for each utterance direction θ _j and output.

発話向き判定部は、上記左右向き判定コストが最も小さいθ_ｊが上記マイクロホンアレイに対して左向きに該当するか右向きに該当するかを判定して判定結果を出力する。 The utterance direction determination unit determines whether θ _j having the smallest left-right direction determination cost corresponds to the left direction or the right direction with respect to the microphone array, and outputs a determination result.

本発明の発話向き推定装置によれば、多数のマイクロホンを発話者を囲い込むように配置する必要が無く、かつ残響時間が２５０ｍｓｅｃ以上の残響環境下においても適切に発話向きを推定することが可能となる。 According to the speech direction estimating apparatus of the present invention, it is not necessary to arrange a large number of microphones so as to surround a speaker, and it is possible to appropriately estimate the speech direction even in a reverberant environment where the reverberation time is 250 msec or more. It becomes.

音声信号の伝搬特性を時間領域で示す図。The figure which shows the propagation characteristic of an audio | voice signal in a time domain. 正面、左、右の３つの発話向きごとに、各マイクロホンで収音した信号間の相関を表す相関行列を構成する音響伝搬ベクトル群と固有空間を模式的に表現したイメージ図。The image figure which represented typically the acoustic propagation vector group and eigenspace which comprise the correlation matrix showing the correlation between the signals picked up by each microphone for every three speech directions of front, left, and right. 第１実施形態の発話向き推定装置の機能構成例を示す図。The figure which shows the function structural example of the speech direction estimation apparatus of 1st Embodiment. 第１実施形態の発話向き推定装置の処理フロー例を示す図。The figure which shows the example of a processing flow of the speech direction estimation apparatus of 1st Embodiment. マイクロホンと発話者・発話向きとの位置関係を示すイメージ図。The image figure which shows the positional relationship of a microphone, a speaker, and a speech direction. モデル平均化第１固有ベクトルを求める構成例を示す図。The figure which shows the structural example which calculates | requires a model average 1st eigenvector. 発話向きと固有値との関係を示すイメージ図。The image figure which shows the relationship between an utterance direction and an eigenvalue. 第２実施形態の発話向き推定装置の機能構成例を示す図。The figure which shows the function structural example of the speech direction estimation apparatus of 2nd Embodiment. 第２実施形態の発話向き推定装置の処理フロー例を示す図。The figure which shows the example of a processing flow of the utterance direction estimation apparatus of 2nd Embodiment. モデル平均化固有値を求める構成例を示す図。The figure which shows the structural example which calculates | requires a model average eigenvalue. 音声会議端末に本発明を組み込んだサービス構成例を示す図。The figure which shows the service structural example which incorporated this invention in the audio conference terminal. 従来技術による発話向き推定装置の機能構成例を示す図。The figure which shows the function structural example of the speech direction estimation apparatus by a prior art. 従来技術によるマイクロホンと発話者との位置関係を示すイメージ図。The image figure which shows the positional relationship of the microphone and speaker by a prior art.

〔第１実施形態〕
＜原理＞
第１実施形態では発話向きがマイクロホンアレイに対し左向きであるか右向きであるかを推定可能とする構成を明らかにする。そこで、まず左右方向の発話向きを推定する原理を説明する。 [First Embodiment]
<Principle>
In the first embodiment, a configuration that makes it possible to estimate whether the utterance direction is leftward or rightward with respect to the microphone array will be clarified. First, the principle of estimating the left-right direction of speech will be described.

図１は音声信号の伝搬特性を時間領域で示したものである。伝搬特性は、直接波、初期反射波、後部残響の３つに大きく分けられるが、直接波、初期反射波が観測される時間帯においては、複数本のマイクロホンで構成されたマイクロホンアレイに対して方向性を持った波が混入することが知られている。特に、初期残響時間帯（直接波到来後、直接波の収音パワーから１０ｄＢ減衰するまでの時間）においては方向性を持った強い反射波が混在するが、この反射波のパワーは発話向きにより変化する。 FIG. 1 shows the propagation characteristics of an audio signal in the time domain. Propagation characteristics can be broadly divided into three types: direct wave, initial reflected wave, and rear reverberation. In the time zone in which direct wave and initial reflected wave are observed, a microphone array composed of multiple microphones is used. It is known that waves with directionality are mixed. In particular, in the initial reverberation time zone (after the arrival of the direct wave, the time from the direct wave pickup power to the attenuation of 10 dB), a strong reflected wave with directionality is mixed, but the power of this reflected wave depends on the direction of speech. Change.

図２は、正面、左、右の３つの発話向きごとに、各マイクロホンで収音した信号間の相関を表す相関行列を構成する音響伝搬ベクトル群と固有空間（固有ベクトルvＶ_ｉと固有値λ_ｉとにより形成されるｉ次元の空間）を模式的に表現したものである。図２はマイクロホン３本によりマイクロホンアレイを構成した場合であり、従って、音響伝搬ベクトル群と固有空間は３次元で表現されている。図２において左向きと右向きとを比較すると、直接波や後部残響を構成する音響伝搬ベクトルの差異はほとんど無いが、初期反射波を構成する音響伝搬ベクトルは異なる。これは、観測信号中に混入する壁からの強い反射波の持つ方向性が発話向きによって異なるためである。つまり、発話方向が左であるか右であるかにより初期反射波を構成する音響伝搬の方向やパワーが変化し、固有空間の張り方も変化する。そしてこの変化の影響が、相関行列の固有ベクトルvＶ_ｉ、特に相関行列の固有値が最大のものに対応する第１固有ベクトルvＶ_１に顕著に表れるため、第１固有ベクトルvＶ_１がとる値を評価することにより、発話方向が左向きか右向きかを切り分けることができる。 FIG. 2 shows acoustic propagation vector groups and eigenspaces (eigenvectors vV _i and eigenvalues λ _i) constituting a correlation matrix representing the correlation between signals picked up by each microphone for each of the three front, left, and right speech directions. (I-dimensional space formed by the above) is schematically represented. FIG. 2 shows a case where a microphone array is constituted by three microphones. Therefore, the acoustic propagation vector group and the eigenspace are expressed in three dimensions. When comparing the left direction and the right direction in FIG. 2, there is almost no difference in the acoustic propagation vectors constituting the direct wave and the rear reverberation, but the acoustic propagation vectors constituting the initial reflected wave are different. This is because the directionality of the strong reflected wave from the wall mixed in the observation signal differs depending on the direction of speech. That is, the direction and power of acoustic propagation constituting the initial reflected wave change depending on whether the utterance direction is left or right, and how the eigenspace is stretched also changes. Since the influence of this change is prominent in the eigenvector vV _{i of} the correlation matrix, particularly the first eigenvector vV ₁ corresponding to the largest eigenvalue of the correlation matrix, the value taken by the first eigenvector vV ₁ is evaluated. , It is possible to distinguish whether the utterance direction is leftward or rightward.

＜構成＞
図３に本発明の発話向き推定装置１００の機能構成例を、図４にその処理フロー例を示す。発話向き推定装置１００は、Ｍ本（Ｍは２以上の整数）のマイクロホン１０１−１〜１０１−Ｍからなるマイクロホンアレイ１０１と、ＡＤ変換部１２と、周波数領域変換部１３と、相関行列計算部１０２と、固有値分解部１０３と、第１固有ベクトル平均化処理部１０４と、左右向きコスト計算部１０５と、発話向き判定部１０６とを備える。このうち、ＡＤ変換部１２と周波数変換部１３は背景技術で説明した発話向き推定装置１０で用いたものと同じである。 <Configuration>
FIG. 3 shows a functional configuration example of the speech direction estimating apparatus 100 of the present invention, and FIG. 4 shows a processing flow example thereof. The utterance direction estimation apparatus 100 includes a microphone array 101 including M microphones (M is an integer of 2 or more) 101-1 to 101-M, an AD conversion unit 12, a frequency domain conversion unit 13, and a correlation matrix calculation unit. 102, an eigenvalue decomposition unit 103, a first eigenvector averaging processing unit 104, a left-right direction cost calculation unit 105, and an utterance direction determination unit 106. Among them, the AD conversion unit 12 and the frequency conversion unit 13 are the same as those used in the speech direction estimation apparatus 10 described in the background art.

従来技術においては、図１３に示すように発話者を囲い込むように多数のマイクロホンを配置する必要があったが、本発明においては、Ｍ本のマイクロホン１０１−１〜１０１−Ｍを可能な程度に密集して配置すればよい。このようなマイクロホンアレイ１０１を構成するマイクロホンの本数は多いことに越したことはないが、以下に説明する本発明の構成によれば２本以上あれば発話向きの推定が可能である。また、配置は平面的でも立体的でも構わない。このように少ない本数のマイクロホンを密集して配置する形態をとることで、電話や音声会議端末のような可搬性がある装置に取り付けて、その周囲の発話者による発話向きを推定することが可能となる。発話者はこのマイクロホンアレイ１０１の周囲のある位置ｒで発話する。図５は７本のマイクロホンからなるマイクロホンアレイ１０１の周りで話している発話者を上から見たイメージを示したものであり、矢印方向が発話向きである。なお、図５(a)は発話者が各位置で左向きに発話している様子を、図５(b)は右向きに発話している様子を表している。 In the prior art, as shown in FIG. 13, it has been necessary to arrange a large number of microphones so as to surround the speaker, but in the present invention, M microphones 101-1 to 101 -M are possible to the extent possible. It is sufficient to arrange them closely. Although the number of microphones constituting such a microphone array 101 is not too large, according to the configuration of the present invention described below, it is possible to estimate the utterance direction with two or more microphones. The arrangement may be two-dimensional or three-dimensional. By adopting a configuration in which a small number of microphones are arranged densely in this way, it can be attached to a portable device such as a telephone or an audio conference terminal, and the direction of speech by surrounding speakers can be estimated It becomes. The speaker speaks at a certain position r around the microphone array 101. FIG. 5 shows an image of a speaker talking around a microphone array 101 composed of seven microphones as viewed from above, and the direction of the arrow is the direction of speech. FIG. 5A shows a state where the speaker is speaking leftward at each position, and FIG. 5B shows a state where the speaker is speaking rightward.

ＡＤ変換部１２は、Ｍ本のマイクロホン１０１−１〜１０１−Ｍで収音した発話者１が位置ｒで発話したアナログ音声信号を、それぞれディジタル音声信号Ｘ_１(t) 、・・・、Ｘ_Ｍ(t)に変換する（Ｓ１）。ここで、ｔは離散時間のインデックスを表す。 The AD converter 12 converts the analog voice signals uttered at the position r by the speaker 1 picked up by the M microphones 101-1 to 101 -M into digital voice signals X ₁ (t),. Convert to _M (t) (S1). Here, t represents an index of discrete time.

周波数領域変換部１３は、複数の離散時間サンプルからなる上記ディジタル音声信号の組（フレーム）を入力とし、高速フーリエ変換等により周波数領域のディジタル音声信号Ｘ_１(ω,ｎ)、・・・、Ｘ_Ｍ(ω,ｎ)に変換して出力する（Ｓ２）。ここで、ｎはフレームのインデックスを表し、ωは周波数のインデックスを表す。なお、周波数のインデックスの総数をΩとする。 The frequency domain transform unit 13 receives as input a set (frame) of the above digital speech signals composed of a plurality of discrete time samples, and performs frequency domain digital speech signals X ₁ (ω, n),. It is converted to X _M (ω, n) and output (S2). Here, n represents a frame index, and ω represents a frequency index. The total number of frequency indexes is Ω.

相関行列計算部１０２は、周波数領域のディジタル音声信号Ｘ_１(ω,ｎ) 、・・・、Ｘ_Ｍ(ω,ｎ)を入力とし、各信号間の相関を表すＭ×Ｍの相関行列Ｒ(ω,ｋ)を、各周波数ω毎に式(3)により順次生成し出力する（Ｓ３）。 The correlation matrix calculation unit 102 receives the digital audio signals X ₁ (ω, n),..., X _M (ω, n) in the frequency domain as inputs, and an M × M correlation matrix R representing the correlation between the signals. (ω, k) is sequentially generated and output for each frequency ω by equation (3) (S3).

Ｒ(ω,ｋ)＝Ｅ[vＸ(ω,ｎ)・vＸ^Ｈ(ω,ｎ)] (3)
ここで、vＸ(ω,ｎ)＝[Ｘ_１(ω,ｎ)、・・・、Ｘ_Ｍ(ω,ｎ)]^Ｔ
式(3)において、Ｈは共役転置を表し、ＥはvＸ(ω,ｎ)・vＸ(ω,ｎ)^Ｈを各フレームについて計算した上で平均化処理等によりＬフレーム毎の期待値を演算をする演算子である。つまり、相関行列はＬフレームに１回の割合で順次出力され、ｋはこの相関行列の出力のインデックスを表す。なお、ＬはＭ以上の整数とすることが望ましい。 R (ω, k) = E [vX (ω, n) · vX ^H (ω, n)] (3)
Here, vX (ω, n) = [X ₁ (ω, n),..., X _M (ω, n)] ^T
In Equation (3), H represents a conjugate transpose, and E calculated vX (ω, n) · vX (ω, n) ^H for each frame, and then calculated an expected value for each L frame by averaging processing or the like. Is an operator. That is, the correlation matrix is sequentially output at a rate of once per L frame, and k represents an output index of the correlation matrix. Note that L is preferably an integer greater than or equal to M.

固有値分解部１０３は、相関行列Ｒ(ω,ｋ)を入力とし、まず、式(4)を満たすようにＭ個の固有値λ_１(ω,ｋ)、・・・、λ_Ｍ(ω,ｋ)それぞれの二乗を対角要素とする対角行列である固有値行列Λ(ω,ｋ)と、Ｍ個の固有ベクトルvＶ_１(ω,ｋ)、・・・、vＶ_Ｍ(ω,ｋ)を要素とする固有ベクトル行列Ｖ(ω,ｋ)とに固有値分解法によって分解する。 The eigenvalue decomposition unit 103 receives the correlation matrix R (ω, k), and first, M eigenvalues λ ₁ (ω, k),..., Λ _M (ω, k) so as to satisfy Equation (4). ) Eigenvalue matrix Λ (ω, k), which is a diagonal matrix with each square as a diagonal element, and M eigenvectors vV ₁ (ω, k),..., VV _M (ω, k) Is decomposed into eigenvector matrix V (ω, k) by the eigenvalue decomposition method.

Ｒ(ω,ｋ)＝Ｖ(ω,ｋ)・Λ(ω,ｋ)・Ｖ^Ｈ(ω,ｋ) (4)
ここで、Λ(ω,ｋ)＝diag[λ_１ ^２(ω,ｋ)、・・・、λ_Ｍ ^２(ω,ｋ)]
λ_１(ω,ｋ)≧λ_２(ω,ｋ)≧・・・≧λ_Ｍ(ω,ｋ)
Ｖ(ω,ｋ)＝[vＶ_１(ω,ｋ)、・・・、vＶ_Ｍ(ω,ｋ)]^Ｔ
vＶ_ｉ(ω,ｋ)＝[Ｖ_ｉ,１(ω,ｋ)、・・・、Ｖ_ｉ,Ｍ(ω,ｋ)]
なお、diag[・]は[・]内の成分を対角行列の要素とする演算子である。そして、最大の固有値である第１固有値λ_１(ω,ｋ)に対応する第１固有ベクトルvＶ_１(ω,ｋ)を出力する（Ｓ４）。 R (ω, k) = V (ω, k) ・ Λ (ω, k) ・ V ^H (ω, k) (4)
Here, Λ (ω, k) = diag [λ ₁ ² (ω, k),..., Λ _M ² (ω, k)]
λ ₁ (ω, k) ≧ λ ₂ (ω, k) ≧ ・・・ ≧ λ _M (ω, k)
V (ω, k) = [vV ₁ (ω, k),..., VV _M (ω, k)] ^T
vV _i (ω, k) = [V _{i, 1} (ω, k),..., V _{i, M} (ω, k)]
Note that diag [•] is an operator having the components in [•] as elements of a diagonal matrix. Then, the first eigenvector vV ₁ (ω, k) corresponding to the first eigenvalue λ ₁ (ω, k) which is the maximum eigenvalue is output (S4).

第１固有ベクトル平均化処理部１０４は、各周波数ω毎に得られた第１固有ベクトルvＶ_１(ω,ｋ)について周波数平均をとることにより平均化第１固有ベクトルvaＶ_１(ｋ)を求めて出力する（Ｓ５）。ここで、第１固有ベクトルvＶ_１(ω,ｋ)は複素領域のベクトルであり、周波数ωに依存するため単純な積和演算では周波数平均化処理を行うことができない。そこで、第１固有ベクトルvＶ_１(ω,ｋ)を特開２００７−２２６０３６（段落〔００７８〕〔００７９〕等）にて開示された周波数正規化手法を参考に、周波数に依存しない特徴量に変換した上で周波数平均化処理を行う。 The first eigenvector averaging processing unit 104 obtains and outputs an averaged first eigenvector vaV ₁ (k) by taking a frequency average for the first eigenvector vV ₁ (ω, k) obtained for each frequency ω. (S5). The first eigenvector vV 1 _(ω, k) is a vector of complex domain, it is not possible to perform frequency averaging process simple product sum operation for frequency dependent omega. Therefore, the first eigenvector vV ₁ (ω, k) is converted into a frequency-independent feature amount with reference to the frequency normalization method disclosed in Japanese Patent Application Laid-Open No. 2007-226036 (paragraphs [0078] [0079] etc.). The frequency averaging process is performed above.

具体的には、まず第１固有ベクトルvＶ_１(ω,ｋ)＝[Ｖ_１,１(ω,ｋ)、・・・、Ｖ_１,Ｍ(ω,ｋ)]を、式(5)(6)により周波数に依存しない音響伝搬特性の類似性を測る特徴量ベクトルvＰ_１(ω,ｋ)＝[Ｐ_１,１(ω,ｋ)、・・・、Ｐ_１,Ｍ(ω,ｋ)]に変換する。 Specifically, first, the first eigenvector vV ₁ (ω, k) = [V _1,1 (ω, k),..., V _{1, M} (ω, k)] is expressed by equations (5) (6 ) by the feature vector vP 1 to measure the similarity of acoustic propagation characteristics not dependent on the frequency _{(ω, k) = [P} 1,1 (ω, k), ···, P 1, M (ω, k)] Convert to

ここで、ｉ＝１、２、・・・、Ｍであり、ξ_ｉ(ω,ｋ)は複素回転子、arg[・]は位相角を算出する演算子、ｆ_ωは周波数インデックスωに対応する周波数（Ｈｚ）、ｄはマイクロホンアレイの最大間隔（ｍ）、ｃは音速（ｍ／ｓ）である。 Here, i = 1, 2,..., M, ξ _i (ω, k) is a complex rotator, arg [•] is an operator for calculating a phase angle, and f _ω corresponds to a frequency index ω. Frequency (Hz), d is the maximum distance (m) of the microphone array, and c is the speed of sound (m / s).

そして、得られた特徴量ベクトルvＰ_１(ω,ｋ)＝[Ｐ_１,１(ω,ｋ)、・・・、Ｐ_１,Ｍ(ω,ｋ)]を式(7)により周波数平均化処理を行い、平均化第１固有ベクトルvaＶ_１(ｋ)＝ [aＶ_１,１(ｋ)、・・・、aＶ_１,Ｍ(ｋ)]を出力する。 Then, the obtained feature vector vP ₁ (ω, k) = [P _1,1 (ω, k),..., P _{1, M} (ω, k)] is frequency-averaged using equation (7). Processing is performed, and the averaged first eigenvector vaV ₁ (k) = [aV _1,1 (k),..., AV _{1, M} (k)] is output.

ここで、Ｆ_１は周波数平均化で用いる周波数インデックス、|Ｆ_１|は周波数平均化処理で用いる周波数インデックスの総数であり、Ｆ_１はΩ≧|Ｆ_１|を満たすように適宜設定する。 Here, F ₁ is a frequency index used in frequency averaging, | F ₁ | is the total number of frequency indexes used in frequency averaging processing, and F ₁ is appropriately set so as to satisfy Ω ≧ | F ₁ |.

左右向きコスト計算部１０５は、第１固有ベクトル平均化処理部１０４で得られた平均化第１固有ベクトルvaＶ_１(ｋ)＝ [aＶ_１,１(ｋ)、・・・、aＶ_１,Ｍ(ｋ)]と、予め発話位置ｒにおける複数の発話向きθ_ｊ（ｊ＝１、２、・・、Ｎ、Ｎ≧２）毎に用意されたモデル平均化第１固有ベクトルvaＳ_１(ｋ,ｒ,θ_ｊ)＝［aＳ_１,１(ｋ,ｒ,θ_ｊ)、・・・、aＳ_１,Ｍ(ｋ,ｒ,θ_ｊ) ］とから、発話向きθ_ｊ毎に左右向き判定コストＣ_１(ｋ,ｒ,θ_ｊ)を計算して出力する（Ｓ６）。ここで、発話向きθ_ｊ毎のモデル平均化第１固有ベクトルvaＳ_１(ｋ,ｒ,θ_ｊ)は、例えば図６に示すように図３と同じ構成のもとで発話位置ｒにおいて向きθ_ｊ毎に発話された音声信号に対し、それぞれ第１固有ベクトル平均化処理部１０４までの処理を行うことにより得ることができる。なお、モデル平均化第１固有ベクトルvaＳ_１(ｋ,ｒ,θ_ｊ)の左右向きコスト計算部１０５への入力は、予めデータベースに記録しておきそこから読み出す等任意の方法で行って構わない。
左右向き判定コストＣ_１(ｋ,ｒ,θ_ｊ)は式(8)により求める。 The left-right cost calculator 105 calculates the averaged first eigenvector vaV ₁ (k) = [aV _1,1 (k),..., AV _{1, M} (k) obtained by the first eigenvector averaging processor 104. )] And a model-averaged first eigenvector vaS ₁ (k, r, θ) prepared in advance for each of a plurality of speech directions θ _j (j = 1, 2,..., N, N ≧ 2) at the speech position r. _j ) = [aS _1,1 (k, r, θ _j ),..., aS _{1, M} (k, r, θ _j )]], the left-right determination cost C ₁ (for each utterance direction θ _j k, r, θ _j ) is calculated and output (S6). Here, the utterance orientation theta model averaging first eigenvector for each _{_{j vaS 1 (k, r,}} θ j) , the orientation theta _j in speech position r under the same configuration as FIG. 3, for example, as shown in FIG. 6 It can be obtained by performing the processing up to the first eigenvector averaging processing unit 104 for each voice signal uttered. Note that the input of the model averaged first eigenvector vaS ₁ (k, r, θ _j ) to the left-right cost calculation unit 105 may be performed by an arbitrary method such as recording in the database in advance and reading from the database.
The left / right direction determination cost C ₁ (k, r, θ _j ) is obtained by Expression (8).

左右向き判定コストＣ_１(ｋ,ｒ,θ_ｊ)は、判定対象である発話の向きと、予め用意された各発話向きθ_ｊとの近さを表す指標であり、コストが小さいほど判定対象である発話の向きがθ_ｊに近いことを意味する。つまり、予め用意された各θ_ｊのうちコストが最小のθ_ｊを抽出することにより、判定対象である発話向きを推定することができる。 The left-right direction determination cost C ₁ (k, r, θ _j ) is an index representing the proximity between the utterance direction that is the determination target and each utterance direction θ _j prepared in advance. This means that the direction of the utterance is close to θ _j . That is, it is possible to estimate the utterance direction as a determination target by extracting θ _j having the lowest cost from each θ _j prepared in advance.

発話向き判定部１０６は、左右向き判定コストＣ_１(ｋ,ｒ,θ_ｊ)が最小のθ_ｊが、マイクロホンアレイ１０１に対して左向きに該当するか右向きに該当するかを判定して判定結果を出力する（Ｓ７）。例えば、発話位置ｒからマイクロホンアレイ１０１に対して正面向きを０°、左向きを負の角度、右向きを正の角度として、θ_１＝−９０°、θ_２＝＋９０°の２つの向きについてモデル平均化第１固有ベクトルを用意した場合、左右向き判定コストがＣ_１(ｋ,ｒ,θ_１)＜Ｃ_１(ｋ,ｒ,θ_２)である時には左向き（コストが小さいθ_１が負の角度であるため）、Ｃ_１(ｋ,ｒ,θ_１)＞Ｃ_１(ｋ,ｒ,θ_２)である時は右向き（コストが小さいθ_２が正の角度であるため）と判定する。 The speech direction determination unit 106 determines whether θ _j having the minimum left-right direction determination cost C ₁ (k, r, θ _j ) corresponds to the left direction or the right direction with respect to the microphone array 101 and determines the result. Is output (S7). For example, the model average for two orientations θ ₁ = −90 ° and θ ₂ = + 90 °, where the front direction is 0 °, the left direction is a negative angle, and the right direction is a positive angle with respect to the microphone array 101 from the speech position r. If prepared first eigenvector of the left and right orientation determination cost _{C 1 (k, r, θ} 1) <C 1 (k, r, θ 2) at an angle leftward (cost is small theta ₁ is negative when it is Therefore, when C ₁ (k, r, θ ₁ )> C ₁ (k, r, θ ₂ ), it is determined that the direction is right (because θ ₂ having a low cost is a positive angle).

以上のように、第１実施形態の発話向き推定装置により、発話者がマイクロホンアレイに対し、左向きに発話したか右向きに発話したかを推定することができる。また、マイクロホンアレイを少数のマイクロホンを密集した形で構成すればよいため、多数のマイクロホンで発話者を囲い込むことなくコンパクトに構成することが可能となる。また、残響を積極的に利用する構成であるため、残響時間が２５０ｍｓｅｃ以上の残響環境下においても適切に発話向きを推定することが可能となる。また、本発明において処理の核となる固有値分解処理は演算量が少ないため、携帯端末のようなＣＰＵスペックの低い機材に組み込む場合にも有利である。 As described above, the utterance direction estimation apparatus according to the first embodiment can estimate whether a speaker utters leftward or rightward with respect to the microphone array. In addition, since the microphone array may be configured in a form in which a small number of microphones are densely packed, it is possible to configure the microphone array compactly without enclosing the speaker. In addition, since the reverberation is actively used, it is possible to appropriately estimate the utterance direction even in a reverberation environment where the reverberation time is 250 msec or more. In addition, since the eigenvalue decomposition process, which is the core of the process in the present invention, has a small amount of calculation, it is advantageous when it is incorporated in equipment with low CPU specifications such as a portable terminal.

〔第２実施形態〕
第１実施形態は、発話向きが左向きであるか右向きであるかを判定するものであったが、第２実施形態は更に正面向きという区分を設け、発話向きが正面向き、左向き、右向きのいずれであるかを判定することを可能とするものである。 [Second Embodiment]
In the first embodiment, it is determined whether the utterance direction is leftward or rightward. However, the second embodiment further includes a front direction, and the utterance direction is frontal, leftward, or rightward. It is possible to determine whether or not.

＜原理＞
第１実施形態の原理の説明で触れたように、初期残響時間帯においてはマイクロホンアレイに対して方向性を持った強い反射波が混在し、この反射波のパワーは発話向きにより変化する。具体的には、発話向きが正面方向であるほど直接波のパワーが大きくなるため、反射波のパワーは小さくなり、また、横方向であるほど直接波のパワーが小さくなるため、その分反射波のパワーが大きくなる。 <Principle>
As mentioned in the explanation of the principle of the first embodiment, strong reflected waves having directivity with respect to the microphone array are mixed in the initial reverberation time zone, and the power of the reflected waves changes depending on the direction of speech. Specifically, since the direct wave power increases as the utterance direction is the front direction, the reflected wave power decreases, and the direct wave power decreases as it is in the horizontal direction. The power of will increase.

図２において正面向きの場合、マイクロホンアレイには直接波が多く到達し、反射波の到達割合は相対的に低いため、直接波を表現する音響伝搬ベクトルが反射波を表現する音響伝搬ベクトル群に比べて大きなパワーを持つ。この時、相関行列の第１固有値λ_１は第２固有値λ_２、第３固有値λ_３と比べ顕著に大きな値を示す。一方、横向きの場合、マイクロホンアレイに到達する直接波は減少し、その分反射波が多く到達する。そのため、直接波を表現する音響伝搬ベクトルのパワーが減少し、反射波を表現する音響伝搬ベクトル群のパワーが増加する。そして、この時には第１固有値λ_１は正面向きの場合より小さくなり、逆に第２固有値λ_２、第３固有値λ_３は正面向きの場合より大きくなる。正面向きの場合と横向きの場合とで各固有値に生じる差異のイメージを図７に示す。このように、直接波の到達度合が相関行列の固有値λ_ｉ（特に第１固有値λ_１）に顕著に表れるため、固有値λ_ｉがとる値を評価することにより、発話方向が正面向きか横向きかを切り分けることができる。 In the case of facing front in FIG. 2, since many direct waves reach the microphone array and the arrival rate of the reflected waves is relatively low, the acoustic propagation vectors representing the direct waves are in the acoustic propagation vector group representing the reflected waves. Compared with greater power. At this time, the first eigenvalue λ ₁ of the correlation matrix is significantly larger than the second eigenvalue λ ₂ and the third eigenvalue λ ₃ . On the other hand, in the case of the horizontal orientation, the direct waves that reach the microphone array are reduced, and more reflected waves reach accordingly. Therefore, the power of the acoustic propagation vector that expresses the direct wave decreases, and the power of the acoustic propagation vector group that expresses the reflected wave increases. The first eigenvalue lambda ₁ when this is smaller than the front direction, the second eigenvalue lambda _{2 Conversely,} the third eigenvalue lambda ₃ is larger than that of the front facing. FIG. 7 shows an image of the difference that occurs in each eigenvalue between the case of facing forward and the case of facing sideways. As described above, since the degree of arrival of the direct wave appears prominently in the eigenvalue λ _i (especially the first eigenvalue λ ₁ ) of the correlation matrix, by evaluating the value taken by the eigenvalue λ _i, Can be carved.

＜構成＞
図８に本発明の発話向き推定装置２００の機能構成例を、図９にその処理フロー例を示す。 <Configuration>
FIG. 8 shows a functional configuration example of the speech direction estimating apparatus 200 of the present invention, and FIG. 9 shows a processing flow example thereof.

発話向き推定装置２００は、Ｍ本（Ｍは２以上の整数）のマイクロホン１０１−１〜１０１−Ｍからなるマイクロホンアレイ１０１と、ＡＤ変換部１２と、周波数領域変換部１３と、相関行列計算部１０２と、固有値分解部２０１と、第１固有ベクトル平均化処理部１０４と、左右向きコスト計算部１０５と、固有値平均化処理部２０２と、正面・横向きコスト計算部２０３と、発話向き判定部２０４とを備える。このうち、固有値分解部２０１と、固有値平均化処理部２０２と、正面・横向き判定部２０３と、発話者向き判定部２０４以外は、第１実施形態にて説明した同じ名称・符号を付した構成要素と同じものであるため、機能・処理の説明は省略する。 The utterance direction estimation apparatus 200 includes a microphone array 101 including M (M is an integer of 2 or more) microphones 101-1 to 101-M, an AD converter 12, a frequency domain converter 13, and a correlation matrix calculator. 102, eigenvalue decomposition unit 201, first eigenvector averaging processing unit 104, left-right direction cost calculation unit 105, eigenvalue averaging processing unit 202, front / sideways cost calculation unit 203, speech direction determination unit 204, Is provided. Among these, except for the eigenvalue decomposition unit 201, the eigenvalue averaging processing unit 202, the front / side orientation determination unit 203, and the speaker orientation determination unit 204, the same name / symbol described in the first embodiment is attached. Since it is the same as the element, description of the function / process is omitted.

固有値分解部２０１は、第１実施形態の固有値分解部１０３と同様な分解処理を行った上で、第１固有ベクトルvＶ_１(ω,ｋ)を出力するとともに、各固有値λ_ｉ(ω,ｋ)（ｉ＝１、２、・・・、Ｍ）を、式(9)により正規化して、正規化固有値nλ_ｉ(ω,ｋ)を出力する（Ｓ１１）。 The eigenvalue decomposition unit 201 performs the same decomposition process as the eigenvalue decomposition unit 103 of the first embodiment, and then outputs a first eigenvector vV ₁ (ω, k) and each eigenvalue λ _i (ω, k). (I = 1, 2,..., M) is normalized by equation (9), and a normalized eigenvalue nλ _i (ω, k) is output (S11).

なお、正面・横向き判定コストを最大の固有値である第１固有値λ_１(ω,ｋ)のみに基づき計算する場合は、正規化第１固有値nλ_１(ω,ｋ)のみを計算して出力することとしてもよい。 When calculating the front / side determination cost based only on the first eigenvalue λ ₁ (ω, k) that is the maximum eigenvalue, only the normalized first eigenvalue nλ ₁ (ω, k) is calculated and output. It is good as well.

固有値平均化処理部２０２は、各周波数ω毎に得られた正規化固有値nλ_ｉ(ω,ｋ)について式(10)により周波数平均をとり、平均化固有値aλ_ｉ(ｋ)を出力する（Ｓ１２）。 The eigenvalue averaging processing unit 202 averages the frequencies of the normalized eigenvalues nλ _i (ω, k) obtained for each frequency ω according to the equation (10), and outputs the averaged eigenvalues aλ _i (k) (S12). ).

ここで、Ｆ₁は平均化に用いる周波数のインデックス、|Ｆ₁|は周波数のインデックスの総数であり、Ｆ₁はΩ≧|Ｆ₁|を満たすように適宜設定する。なお、正面・横向き判定コストを最大の固有値である第１固有値λ_１(ω,ｋ)のみに基づき計算する場合は、平均化第１固有値aλ_１(ｋ)のみを出力することとしてもよい。 Here, F ₁ is the frequency index used for averaging, | F ₁ | is the total number of frequency indexes, and F ₁ is appropriately set so as to satisfy Ω ≧ | F ₁ |. When the front / side orientation determination cost is calculated based only on the first eigenvalue λ ₁ (ω, k), which is the maximum eigenvalue, only the averaged first eigenvalue aλ ₁ (k) may be output.

正面・横向きコスト計算部２０３は、固有値平均化処理部２０２で得られた平均化固有値列vaλ(ｋ)＝［aλ_１(ｋ)、aλ_２(ｋ)、・・・、aλ_Ｍ(ｋ)］と、予め発話位置ｒにおける複数の発話向きθ_ｊ（ｊ＝１、２、・・、Ｎ、Ｎ≧２）毎に用意されたモデル平均化固有値列vaＱ(ｋ,ｒ,θ_ｊ)＝［aＱ_１(ｋ,ｒ,θ_ｊ)、aＱ_２(ｋ,ｒ,θ_ｊ)、・・・、aＱ_Ｍ(ｋ,ｒ,θ_ｊ) ］とから、発話向きθ_ｊ毎に正面・横向き判定コストＣ_２(ｋ,ｒ,θ_ｊ)を計算して出力する（Ｓ１３）。ここで、モデル平均化固有値aＱ_ｉ(ｋ,ｒ,θ_ｊ)は例えば図１０に示すように、図８と同じ構成のもとで発話位置ｒにおいて向きθ_ｊ毎に発話された音声信号に対し、それぞれ固有値平均化処理部２０２までの処理を行うことにより得ることができる。なお、モデル平均化固有値aＱ_ｉ(ｋ,ｒ,θ_ｊ)の正面・横向きコスト計算部２０３への入力は、予めデータベースに記録しておきそこから読み出す等任意の方法で行って構わない。
正面・横向き判定コストＣ_２(ｋ,ｒ,θ_ｊ)は式(11)により求める。 The front / lateral cost calculation unit 203 averages the eigenvalue sequence vaλ (k) = [aλ ₁ (k), aλ ₂ (k),..., Aλ _M (k) obtained by the eigenvalue averaging processing unit 202. ] And a model-averaged eigenvalue sequence vaQ (k, r, θ _j ) = prepared in advance for each of a plurality of speech directions θ _j (j = 1, 2,..., N, N ≧ 2) at the speech position r = From [aQ ₁ (k, r, θ _j ), aQ ₂ (k, r, θ _j ),..., AQ _M (k, r, θ _j )], the front and side directions for each utterance direction θ _j The determination cost C ₂ (k, r, θ _j ) is calculated and output (S13). Here, the model average eigenvalue _{aQ i (k, r, θ} j) as is shown in FIG. 10 for example, the voice signal uttered for each orientation theta _j in speech position r under the same configuration as FIG. 8 On the other hand, it can be obtained by performing the processing up to the eigenvalue averaging processing unit 202. Note that the model averaged eigenvalue aQ _i (k, r, θ _j ) may be input to the front / sideways cost calculation unit 203 by an arbitrary method such as recording in a database in advance and reading from the database.
The front / side orientation determination cost C ₂ (k, r, θ _j ) is obtained by the equation (11).

なお、正面向きに発話した場合と横向きに発話した場合との固有値の相違は、第１固有値に特に顕著に反映されることから、正面・横向き判定コストＣ_２(ｋ,r,θ_ｊ)を第１固有値のみから式(12)により求めても構わない。 Note that the eigenvalue difference between the case of speaking in front and the case of speaking in side is reflected particularly prominently in the first eigenvalue, so the front / side determination cost C ₂ (k, r, θ _j ) is You may obtain | require by Formula (12) only from a 1st eigenvalue.

発話向き判定部２０４は、各θ_ｊ毎の左右向き判定コストＣ_１(ｋ,ｒ,θ_ｊ)と正面・横向き判定コストＣ_２(ｋ,ｒ,θ_ｊ)との和であるＣ(ｋ,ｒ,θ_ｊ)のうち、各左右向き判定コストＣ_１(ｋ,ｒ,θ_ｊ)と各正面・横向き判定コストＣ_２(ｋ,ｒ,θ_ｊ)との和の全ての組み合わせの最小値に最も近いＣ(ｋ,ｒ,θ_ｊ)の発話向きであるθ_ｊが、上記マイクロホンアレイに対して正面、左、右のいずれの向きに該当するかを判定して判定結果を出力する（Ｓ１４）。 Utterance orientation determining unit 204, the left and right orientation determining cost C ₁ for each _{θ j (k, r, θ} j) and a front-transverse determined cost _{C 2 (k, r, θ} j) is the sum of the C (k , r, θ _j ), the minimum of all combinations of the sums of the left-right determination costs C ₁ (k, r, θ _j ) and the front / side determination costs C ₂ (k, r, θ _j ) It is determined whether θ _j that is the utterance direction of C (k, r, θ _j ) closest to the value corresponds to the front, left, or right direction with respect to the microphone array, and a determination result is output. (S14).

例えば、モデル第１固有ベクトル及びモデル固有値を、発話位置ｒからマイクロホンアレイ１０１に対してθ_１＝０°（正面向き）、θ_２＝−９０°（左向き）、θ_３＝＋９０°（右向き）の３つの向きについてそれぞれ用意した場合を考える。この場合、左右向きコスト計算部１０５からはＣ_１(ｋ,ｒ,θ_１)、Ｃ_１(ｋ,ｒ,θ_２)、Ｃ_１(ｋ,ｒ,θ_３)の３つのコストが出力され、正面・横向きコスト計算部２０３からもＣ_２(ｋ,ｒ,θ_１)、Ｃ_２(ｋ,ｒ,θ_２)、Ｃ_２(ｋ,ｒ,θ_３)の３つのコストが出力される。発話向き判定部２０４ではこれらを入力として、Ｃ(ｋ,ｒ,θ_ｊ)＝Ｃ_１(ｋ,ｒ,θ_ｊ)＋Ｃ_２(ｋ,ｒ,θ_ｊ)により、Ｃ(ｋ,ｒ,θ_１)、Ｃ(ｋ,ｒ,θ_２)、Ｃ(ｋ,ｒ,θ_３)をそれぞれ求める。そして求めた３つのコストＣ(ｋ,ｒ,θ_ｊ)のうち、最小のコストmin｛Ｃ(ｋ,ｒ,θ_ｊ)｝のθ_ｊを判定対象の発話の向きと推定する。この例では、Ｃ(ｋ,ｒ,θ_１)が最小のコストであれば正面向き、Ｃ(ｋ,ｒ,θ_２)が最小のコストであれば左向き、Ｃ(ｋ,ｒ,θ_３)が最小のコストであれば右向きと推定することができる。 For example, the model first eigenvector and the model eigenvalue are set to θ ₁ = 0 ° (front direction), θ ₂ = −90 ° (left direction), and θ ₃ = + 90 ° (right direction) from the speech position r to the microphone array 101. Consider the case where three orientations are prepared. In this case, three costs of C ₁ (k, r, θ ₁ ), C ₁ (k, r, θ ₂ ), and C ₁ (k, r, θ ₃ ) are output from the left-right cost calculation unit 105. Also, three costs of C ₂ (k, r, θ ₁ ), C ₂ (k, r, θ ₂ ), and C ₂ (k, r, θ ₃ ) are also output from the front / lateral cost calculation unit 203. . The speech direction determination unit 204 receives these as inputs, and C (k, r, θ _j ) = C ₁ (k, r, θ _j ) + C ₂ (k, r, θ _j ) ₁ ), C (k, r, θ ₂ ), and C (k, r, θ ₃ ), respectively. The three cost C obtained (k, r, θ _j) of the minimum cost min {C (k, r, θ j)} a theta _j of estimating the orientation of the utterance to be determined. In this example, if C (k, r, θ ₁ ) is the minimum cost, it faces forward, if C (k, r, θ ₂ ) has the minimum cost, it faces left, C (k, r, θ ₃ ) Can be estimated to be right-facing.

このように、第２実施形態の発話向き推定装置によれば、第１実施形態の構成における効果に加え、更に正面向きという区分を設け、発話向きについて正面向き、左向き、右向きのいずれであるかを判定することが可能となるため、ネットワークを介した相手方とのコミュニケーションをより円滑に行うことが可能となる。 Thus, according to the utterance direction estimation device of the second embodiment, in addition to the effects of the configuration of the first embodiment, a section called front direction is further provided, and the utterance direction is front direction, left direction, or right direction. Therefore, communication with the other party via the network can be performed more smoothly.

上記の各実施形態の発話向き推定装置の構成をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。この場合、処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 When the configuration of the utterance direction estimation device of each of the above embodiments is realized by a computer, the processing contents of the functions that each device should have are described by a program. The processing functions are realized on the computer by executing the program on the computer. In this case, at least a part of the processing content may be realized by hardware.

また、上述の各種処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 Further, the various processes described above are not only executed in time series according to the description, but may be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.

〔サービス適用例〕
図１１は音声会議端末に本発明を組み込んだサービスの構成例である。会議場Ａと会議場Ｂとをネットワークを通じて音声端末で繋がれている状況を想定する。音声会議端末に取り付けられたマイクロホンで収音した音声信号から発話向き情報を抽出し、音声情報と共に相手側へ伝送する。相手側にて、発話向き情報を視覚情報として提示することで、音声情報だけでは伝わりにくかった場の状況を伝達することができる。 [Service application example]
FIG. 11 shows a configuration example of a service in which the present invention is incorporated into an audio conference terminal. Assume that the conference hall A and the conference hall B are connected by a voice terminal through a network. Speech direction information is extracted from a voice signal picked up by a microphone attached to the voice conference terminal, and transmitted to the other party along with the voice information. By presenting the utterance direction information as visual information on the other party side, it is possible to convey the situation of the place that is difficult to convey only with the voice information.

また、会議でのやりとりの様子を映像や音声を用いて記録する議事録システムにも発話向き推定技術を応用できる。すなわち、発話向き推定技術により、収録した音声や映像に誰が誰に向かって話したかというタグをつけることが可能となるため、議事録の整理作業に役立つ。 In addition, the utterance direction estimation technology can be applied to a minutes system that records the state of communication in a meeting using video and audio. In other words, the utterance direction estimation technique makes it possible to tag the recorded voice and video as to who spoke to whom, which is useful for organizing the minutes.

更に、画像で顔向きを検出して行っているサービス、例えば監視カメラやインターホン等で用いられている監視、防犯目的のサービスや、デジタルサイネージで広告に注目しているかを判定するサービス等における画像による向きの検出を、音声信号による検出に置き換えることが可能である。 In addition, images for services that detect faces from images, for example, surveillance used for surveillance cameras and intercoms, security purposes, and services that determine whether you are paying attention to advertising with digital signage, etc. It is possible to replace the detection of the direction by the detection with an audio signal.

Claims

An analog-to-digital conversion unit that converts analog audio signals collected by a microphone array composed of M microphones (M is an integer of 2 or more) at a position r into a digital audio signal;
A frequency domain converter for converting each of the digital audio signals from the time domain to the frequency domain;
A correlation matrix calculator that generates and outputs an M × M correlation matrix representing the correlation between the digital audio signals converted into the frequency domain;
The correlation matrix is decomposed into an eigenvalue matrix that is a diagonal matrix having the squares of M eigenvalues as diagonal elements and an eigenvector matrix composed of M eigenvectors corresponding to the eigenvalues. An eigenvalue decomposition unit that outputs an eigenvector corresponding to the largest eigenvalue among eigenvalues (hereinafter referred to as “first eigenvector”);
A first eigenvector averaging processing unit that outputs an averaged first eigenvector by taking a frequency average for the first eigenvector obtained for each frequency;
From the averaged first eigenvector and the model averaged first eigenvector prepared in advance for each of a plurality of speech directions θ _j (j = 1, 2,..., N, N ≧ 2) at the position r, A left / right cost calculator that calculates and outputs a direction determination cost for each utterance direction θ _j , and
An utterance direction determination unit that determines whether θ _j having the smallest left-right direction determination cost corresponds to the left direction or the right direction with respect to the microphone array and outputs a determination result;
An utterance direction estimation device comprising:

In the utterance direction estimation device according to claim 1,
The averaged first eigenvector is obtained by calculating a feature value representing similarity of acoustic propagation characteristics independent of frequency for each of the M elements constituting the first eigenvector, and then taking a frequency average for the feature value. The utterance direction estimation device obtained by

An AD conversion step for converting analog audio signals collected by a microphone array comprising M microphones (M is an integer of 2 or more) at a position r into a digital audio signal;
A frequency domain transforming step for transforming each digital audio signal from the time domain to the frequency domain;
A correlation matrix calculation step of generating and outputting an M × M correlation matrix representing a correlation between the digital audio signals converted into the frequency domain;
The correlation matrix is decomposed into an eigenvalue matrix, which is a diagonal matrix having the squares of M eigenvalues as diagonal elements, and an eigenvector matrix composed of M eigenvectors corresponding to the eigenvalues. An eigenvalue decomposition step for outputting an eigenvector corresponding to the largest eigenvalue (hereinafter referred to as “first eigenvector”);
A first eigenvector averaging process step of outputting an averaged first eigenvector by taking a frequency average for the first eigenvector obtained for each frequency;
From the averaged first eigenvector and the model averaged first eigenvector prepared in advance for each of a plurality of speech directions θ _j (j = 1, 2,..., N, N ≧ 2) at the position r, the left-right direction A left-right cost calculation step for calculating and outputting a determination cost for each utterance direction θ _j ;
An utterance direction determination step of determining whether θ _j having the smallest left-right direction determination cost corresponds to the left direction or the right direction with respect to the microphone array, and outputs a determination result;
Utterance direction estimation method.

In the speech direction estimation method according to claim 3,
The averaged first eigenvector is obtained by calculating a feature value representing similarity of acoustic propagation characteristics independent of frequency for each of the M elements constituting the first eigenvector, and then taking a frequency average for the feature value. The utterance direction estimation method obtained by

A program for causing a computer to function as the apparatus according to claim 1.