JP2009271183A

JP2009271183A - Multiple signal sections estimation device and its method, and program and its recording medium

Info

Publication number: JP2009271183A
Application number: JP2008119717A
Authority: JP
Inventors: Akiko Araki; 章子荒木; Kentaro Ishizuka; 健太郎石塚; Masakiyo Fujimoto; 雅清藤本; Shoji Makino; 昭二牧野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2008-05-01
Filing date: 2008-05-01
Publication date: 2009-11-19
Anticipated expiration: 2028-05-01
Also published as: JP4875656B2

Abstract

<P>PROBLEM TO BE SOLVED: To securely detect a signal section even when multiple signal sources exist at the same time. <P>SOLUTION: The multiple signals section estimation device includes: a frequency region conversion section; a voice existence probability estimation section; a coming direction estimation section; a coming direction probability calculation section; and a multiplication section. The voice existence probability estimation section calculates a voice existence probability in each frame. The coming direction estimation section and the coming direction probability calculation section estimate a voice coming direction probability in all frequencies in each frame. Then, the multiplication section outputs a value in which the voice existence probability is multiplied by the voice coming direction probability at the multiplication section, as an utterance probability for all sound sources. Accordingly, as existence of multiple sound sources is allowed for each frame, section detection with little deficiency is achieved. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

この発明は、複数人による会話などを複数のマイクロホンで収録し、「いつ誰が話したか」を推定する技術に関連する。特に、複数の音源からの信号が混在しているデータについて、各音源から信号が発せられている区間を推定する複数信号区間推定装置とその方法と、プログラムと記録媒体に関する。 The present invention relates to a technique for recording conversations by a plurality of people using a plurality of microphones and estimating “when and who spoke”. In particular, the present invention relates to a multi-signal section estimation apparatus and method, program, and recording medium for estimating a section in which a signal is emitted from each sound source for data in which signals from a plurality of sound sources are mixed.

複数人の発話者の各話者の発言している音声区間を検出する技術は、例えば会議録自動作成において各発言に発話者を自動的に付与したり、会議収録データに話者情報を付与して録音データの検索や頭出しを容易にしたりする際に重要である。 The technology to detect the speech section of each speaker of multiple speakers is, for example, automatically adding a speaker to each utterance or automatically adding speaker information to conference recording data This is important when making it easy to search for recorded data or to find a cue.

従来の音源方向推定装置として、非特許文献１に開示された方法が知られている。図１０に非特許文献１の音源方向推定装置２００の機能構成を示して簡単に説明する。音源方向推定装置２００は、周波数領域変換部２０１と、音声区間推定部２０２と、到来方向推定部２０３と、到来方向分類部２０４を備える。周波数領域変換部２０１は、離散値化された複数のマイクロホンで収録された観測信号を例えば３２ｍｓ毎に窓関数で切り出したあと（切り出した１区間を以降、「フレーム」と称する。）、観測信号をフーリエ変換などで周波数領域の信号に変換する。音声区間推定部２０２は、周波数領域に変換された観測信号から音声区間を推定する。到来方向推定部２０３は、音声区間とされた各フレームの観測信号から音声到来方向を推定する。到来方向分類部２０４は、音声到来方向が類似した音声区間をその方向の話者が話した区間として出力する。
S.Araki,H.Sawada,and S.Makino,”Blind speech separation in a meeting situation with maximum SNR beamformers,”ICASSP2007,vol.1,pp.41-44,Apr.2007. As a conventional sound source direction estimation apparatus, a method disclosed in Non-Patent Document 1 is known. FIG. 10 shows a functional configuration of the sound source direction estimating apparatus 200 of Non-Patent Document 1 and will be briefly described. The sound source direction estimation apparatus 200 includes a frequency domain conversion unit 201, a speech segment estimation unit 202, an arrival direction estimation unit 203, and an arrival direction classification unit 204. The frequency domain transform unit 201 cuts out an observation signal recorded by a plurality of discrete microphones, for example, every 32 ms with a window function (the cut out section is hereinafter referred to as “frame”), and then the observation signal. Is converted into a frequency domain signal by Fourier transform or the like. The speech segment estimation unit 202 estimates a speech segment from the observation signal converted into the frequency domain. The arrival direction estimation unit 203 estimates the voice arrival direction from the observation signal of each frame set as the voice section. The direction-of-arrival classification unit 204 outputs a voice section having a similar voice arrival direction as a section spoken by a speaker in that direction.
S. Araki, H. Sawada, and S. Makino, “Blind speech separation in a meeting situation with maximum SNR beamformers,” ICASSP2007, vol.1, pp.41-44, Apr.2007.

従来の方法では次の問題点があった。その１つは、音声区間推定部２０２が、音声区間か否かを決定論的に出力する点である。これは音声区間推定部２０２において、音声区間であるのにそうではないと判定する誤棄却や、音声が無い区間を音声区間と判定する誤受理の推定誤りが発生することを意味する。非音声区間と判定されたフレームは以後扱われないので、誤棄却は音声の取りこぼしの原因になる。一般に誤棄却と誤受理は、トレードオフの関係にある。誤棄却と誤受理との関係を複数話者の音声区間検出に適したものに設定することは大変難しく、従来の方法では音声区間の取りこぼしが発生していた。 The conventional method has the following problems. One of them is that the speech section estimation unit 202 deterministically outputs whether or not it is a speech section. This means that in the speech section estimation unit 202, an erroneous rejection that determines that the speech section is not so, or an erroneous acceptance estimation error that determines that there is no speech as a speech section occurs. Since a frame determined to be a non-speech segment is not handled thereafter, erroneous rejection causes a loss of speech. In general, there is a trade-off between false rejection and false acceptance. It has been very difficult to set the relationship between false rejection and false acceptance to be suitable for detecting the speech interval of multiple speakers, and the conventional method has caused the speech segment to be missed.

また、問題点の２つ目としては、到来方向推定部２０３が、各フレームにおいて１つの到来方向しか出力しないため、フレーム内に複数人の発言が混在する場合でも１つの到来方向の情報しか得られない。このため、検出されなかった方向からの話者についての音声区間を取りこぼしてしまう。このように従来の方法では、音声区間推定部２０２と到来方向推定部２０３のそれぞれに、音声区間を欠損させてしまう問題点があった。
この発明は、このような点に鑑みてなされたものであり、音声区間を欠損させることのない複数信号区間推定装置と、その方法とプログラムと、その記録媒体を提供することを目的とする。 As a second problem, since the arrival direction estimation unit 203 outputs only one arrival direction in each frame, only information on one arrival direction can be obtained even when a plurality of utterances are mixed in the frame. I can't. For this reason, the speech section about the speaker from the direction not detected is missed. As described above, the conventional method has a problem in that the speech section is lost in each of the speech section estimation unit 202 and the arrival direction estimation unit 203.
The present invention has been made in view of the above points, and an object of the present invention is to provide a multiple signal section estimation device, a method and a program thereof, and a recording medium thereof that do not cause a voice section to be lost.

この発明の複数信号区間推定装置は、周波数領域変換部と、音声存在確率推定部と、到来方向推定部と、到来方向確率計算部と、乗算部とを具備する。周波数領域変換部は、複数のマイクロホンで収録された複数の音源からの音声信号をフレーム毎に周波数領域の信号に変換する。音声存在確率推定部は、フレーム毎に音源からの音声の存在確率を推定する。到来方向推定部は、フレーム毎に各周波数成分についての音声到来方向を推定する。到来方向確率計算部は、音声到来方向を分類して各音源に関する音声到来方向の分布を求め、各音源に関する音声到来方向確率を計算する。乗算部は、音声存在確率と、音声到来方向確率との積を計算して各フレームにおける音源毎の存在確率を出力する。 The multiple signal section estimation device of the present invention includes a frequency domain conversion unit, a speech presence probability estimation unit, an arrival direction estimation unit, an arrival direction probability calculation unit, and a multiplication unit. The frequency domain conversion unit converts audio signals from a plurality of sound sources recorded by a plurality of microphones into a frequency domain signal for each frame. The voice existence probability estimation unit estimates the voice existence probability from the sound source for each frame. The arrival direction estimation unit estimates the voice arrival direction for each frequency component for each frame. The arrival direction probability calculation unit classifies the voice arrival directions, obtains the distribution of the voice arrival directions for each sound source, and calculates the voice arrival direction probability for each sound source. The multiplication unit calculates the product of the voice presence probability and the voice arrival direction probability and outputs the presence probability for each sound source in each frame.

この発明の複数信号区間推定装置は、各フレームにおける音声存在確率を計算すると共に、各フレームの全ての周波数における音声到来方向確率を推定する。そして、音声存在確率と音声到来方向確率を乗算した値を音源毎の発音確率として出力する。これにより、音声区間検出部の決定的な推定誤りによる性能低下を防げる。また、各フレームで複数の音源の到来方向を確率的に推定できる。よって、音声区間の取りこぼしの少ない複数信号区間推定装置を実現することができる。 The multi-signal section estimation device of the present invention calculates a speech existence probability in each frame and estimates a speech arrival direction probability in all frequencies of each frame. Then, a value obtained by multiplying the voice existence probability and the voice arrival direction probability is output as a pronunciation probability for each sound source. As a result, it is possible to prevent performance degradation due to a definitive estimation error of the speech section detection unit. Moreover, the arrival directions of a plurality of sound sources can be estimated probabilistically in each frame. Therefore, it is possible to realize a multi-signal section estimation device with few missing voice sections.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

図１にこの発明の複数信号区間推定装置１００の機能構成例を示す。図２に動作フローを示す。複数信号区間推定装置１００は、周波数領域変換部１１と、音声存在確率推定部１２と、到来方向推定部１３と、到来方向確率計算部１４と、乗算部１５とを具備する。周波数領域変換部１１に入力される観測信号ｘ（τ）は、複数のマイクロホンで収録された複数の音源からの音声信号であり、例えばサンプリング周波数１６ｋＨｚで離散値化された信号である。図１では観測信号を離散値化するＡＤ変換器については省略している。複数信号区間推定装置１００は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 FIG. 1 shows an example of a functional configuration of a multiple signal section estimation apparatus 100 of the present invention. FIG. 2 shows an operation flow. The multiple signal section estimation device 100 includes a frequency domain transform unit 11, a speech presence probability estimation unit 12, an arrival direction estimation unit 13, an arrival direction probability calculation unit 14, and a multiplication unit 15. The observation signal x (τ) input to the frequency domain conversion unit 11 is an audio signal from a plurality of sound sources recorded by a plurality of microphones, and is a signal that is made discrete at a sampling frequency of 16 kHz, for example. In FIG. 1, the AD converter that converts the observation signal into discrete values is omitted. The multiple signal section estimation device 100 is realized by a predetermined program being read into a computer composed of, for example, a ROM, a RAM, and a CPU, and the CPU executing the program.

周波数領域変換部１１は、離散値化された観測信号を、例えば５１２点毎に窓関数で切り出し、フーリエ変換などで周波数領域の信号に変換する（ステップＳ１１、図２参照）。この場合、フレーム長は５１２/１６ｋＨｚ＝３２ｍｓである。音声存在確率推定部１２は、各フレーム（τ）における音声の存在確率ｐ_ｖ（τ）を推定する（ステップＳ１２）。到来方向推定部１３は、各フレームの各周波数成分についての音声到来方向ｑ（ｆ，τ）を推定する（ステップＳ１３）。到来方向確率計算部１４は、音声到来方向を分類して各音源に関する音声到来方向の分布を求め、各音源に関する音声到来方向確率ｐ_ｋ（τ）を計算する（ステップＳ１４）。乗算部１５は、音声存在確率ｐ_ｖ（τ）と、音声到来方向確率ｐ_ｋ（τ）との積を計算して音源毎の発音確率Ｐ_ｋ（τ）を出力する（ステップＳ１５）。 The frequency domain transform unit 11 cuts out the discrete observation signal with a window function at every 512 points, for example, and transforms it into a frequency domain signal by Fourier transform or the like (step S11, see FIG. 2). In this case, the frame length is 512/16 kHz = 32 ms. The voice presence probability estimation unit 12 estimates the voice existence probability p _v (τ) in each frame (τ) (step S12). The arrival direction estimation unit 13 estimates the voice arrival direction q (f, τ) for each frequency component of each frame (step S13). The arrival direction probability calculation unit 14 classifies the voice arrival directions, obtains the distribution of the voice arrival directions for each sound source, and calculates the voice arrival direction probability p _k (τ) for each sound source (step S14). The multiplier 15 calculates the product of the speech existence probability p _v (τ) and the speech arrival direction probability p _k (τ), and outputs the pronunciation probability P _k (τ) for each sound source (step S15).

以上のように音声区間を音声存在確率として、また音声到来方向を各フレームで複数音源に関する音声到来方向確率として処理するのでフレームが欠損することが少ない。つまり音声信号の取りこぼしを少なくした複数信号区間推定装置が実現できる。以下、複数信号区間推定装置１００の各部の動作を詳しく説明する。ただし、周波数領域変換部１１と乗算部１５については、従来技術で簡単に構成できるので、詳しい説明は省略する。 As described above, since the speech section is processed as the speech existence probability and the speech arrival direction is processed as the speech arrival direction probability for a plurality of sound sources in each frame, frames are rarely lost. That is, it is possible to realize a multi-signal section estimation device that reduces the missed voice signal. Hereinafter, the operation of each unit of the multiple signal section estimation apparatus 100 will be described in detail. However, since the frequency domain transform unit 11 and the multiplication unit 15 can be easily configured by the conventional technology, detailed description thereof is omitted.

〔音声存在確率推定部〕
図３に音声存在確率推定部１２の機能ブロックを示す。音声存在確率推定部１２は、ＧＭＭパラメータ記録部１２０と、カルマンフィルタ１２１と、ＧＭＭ尤度計算部１２２と、単一ガウス分布尤度計算部１２３と、推移確率記録部１２４と、前向き確率算出部１２５と、前向き確率保持部１２６を備える。音声存在確率推定部１２は、入力の特徴ベクトルを混合ガウス分布で表現したＧＭＭ（Gaussian Mixture model）を用いて、式（１）と（２）に示すように音声存在確率ｐ_ｖ（τ）を前向き確率α_ｊ（τ）として算出するものである。 [Speech existence probability estimation unit]
FIG. 3 shows functional blocks of the speech existence probability estimation unit 12. The speech existence probability estimation unit 12 includes a GMM parameter recording unit 120, a Kalman filter 121, a GMM likelihood calculation unit 122, a single Gaussian distribution likelihood calculation unit 123, a transition probability recording unit 124, and a forward probability calculation unit 125. And a forward probability holding unit 126. The speech existence probability estimation unit 12 uses the GMM (Gaussian Mixture model) expressing the input feature vector with a mixed Gaussian distribution to obtain the speech existence probability p _v (τ) as shown in equations (1) and (2). This is calculated as a forward probability α _j (τ).

ここで、時刻τ−１における音声状態（音声＋雑音の状態）をｉ＝１とし、非音声状態（無音＋雑音の状態）をi＝０とする。また時刻τにおける音声状態をｊ＝１とし、非音声状態をｊ＝０とする。ａ_ijは時刻τでの状態ｉからｊに推移する推移確率である。ｂ_ｊ（τ）は音声ＧＭＭ又は非音声ＧＭＭの出力確率である。

Here, the voice state (speech + noise state) at time τ−1 is i = 1, and the non-speech state (silence + noise state) is i = 0. The voice state at time τ is j = 1, and the non-voice state is j = 0. a _ij is a transition probability of transition from state i to j at time τ. b _j (τ) is an output probability of the voice GMM or the non-voice GMM.

カルマンフィルタ１２１は、観測信号ｘ（ｆ，τ）とＧＭＭパラメータを入力として、時刻τ−１における音声/非音声ＧＭＭから、時刻τにおける各ガウス分布（ｋ番目）の平均値μ_jmkτと分散値Σ_jmkτを推定する。単一ガウス分布尤度計算部１２３は、ガウス分布の平均値μ_jmkτと分散値Σ_jmkτを入力として各ガウス分布の尤度ｂ_ｊｋ（τ）を式（３）で計算する。 The Kalman filter 121 receives the observed signal x (f, τ) and the GMM parameter as input, and from the speech / non-speech GMM at time τ−1, the average value μ _jmkτ and variance value Σ of each Gaussian distribution (kth) at time τ. Estimate _jmkτ . The single Gaussian distribution likelihood calculating unit 123 calculates the likelihood b _jk (τ) of each Gaussian distribution by using the average value μ _jmkτ of the Gaussian distribution and the variance value Σ _jmkτ as input.

ここでｘ（ｍ，τ）はフレームτのｍ次のメルスペクトルである。
ＧＭＭ尤度計算部１２２は、各ガウス分布の尤度ｂ_ｊｋ（τ）と、重み係数ω_jkを入力として音声ＧＭＭｂ_１（τ）及び非音声ＧＭＭｂ_０（τ）の尤度ｂ_ｊ（τ）を式（４）で計算する。

Here, x (m, τ) is an m-th order mel spectrum of the frame τ.
The GMM likelihood calculating unit 122 receives the likelihood b _jk (τ) of each Gaussian distribution and the likelihood b _j (τ) of the speech GMMb ₁ (τ) and the non-speech GMMb ₀ (τ) by inputting the weight coefficient ω _jk. Is calculated by equation (4).

前向き確率算出部１２５は、音声ＧＭＭｂ_１（τ）と、非音声ＧＭＭｂ_０（τ）と、前向き確率保持部１２６に記録された１時刻前の前向き確率α_ｊ（τ−１）と、推移確率記録部１２４に記録された推移確率ａ_ijを入力として、式（２）で得られた前向き確率α_１（τ）を音声存在確率ｐ_ｖ（τ）として出力する。
なお、音声存在確率ｐ_ｖ（τ）を式（５）に示す演算で求めても良い。

The forward probability calculation unit 125 includes a speech GMMb ₁ (τ), a non-speech GMMb ₀ (τ), a forward probability α _j (τ−1) one time ago recorded in the forward probability holding unit 126, and a transition probability. Using the transition probability a _ij recorded in the recording unit 124 as an input, the forward probability α ₁ (τ) obtained by Expression (2) is output as the speech existence probability p _v (τ).
Note that the speech existence probability p _v (τ) may be obtained by the calculation shown in Expression (5).

ここでΛは、式（６）と（７）で表わせる。

Here, Λ can be expressed by equations (6) and (7).

ここでλ_Ｎ（ｆ）は周波数ｆにおけるノイズの平均パワー（音声が明らかに存在しない録音ファイルの冒頭区間などで求める。）、Ｌはフーリエ変換で用いる周波数の個数である。（例えば参考文献参照）
〔参考文献〕J.Sohn,N.S.Kim and W.Sung,“A Statistical Model-Based Voice Activity Detection,IEEE Signal Processing letters”,vol.6,no.1,pp.1-3,1999.

Here, λ _N (f) is the average power of noise at the frequency f (determined at the beginning section of a recording file in which no sound is clearly present), and L is the number of frequencies used in the Fourier transform. (See eg references)
(Reference) J. Sohn, NSKim and W. Sung, “A Statistical Model-Based Voice Activity Detection, IEEE Signal Processing letters”, vol. 6, no. 1, pp. 1-3, 1999.

〔到来方向推定部〕
図４に到来方向推定部１３と到来方向確率計算部１４の機能構成例を示す。到来方向推定部１３は、マイク間位相差計算部１３１と音源方向ベクトル計算部１３２を備える。マイク間位相差計算部１３１は、周波数領域に変換された観測信号ｘ（ｆ，τ）の各フレームτ、各周波数ｆにおけるマイク間位相差ｑ´_ｊｊ´を式（８）で計算する。 [Arrival Direction Estimator]
FIG. 4 shows a functional configuration example of the arrival direction estimation unit 13 and the arrival direction probability calculation unit 14. The arrival direction estimation unit 13 includes an inter-microphone phase difference calculation unit 131 and a sound source direction vector calculation unit 132. The inter-microphone phase difference calculation unit 131 calculates the inter-microphone phase difference q ′ _jj ′ at each frame τ and each frequency f of the observation signal x (f, τ) converted into the frequency domain using Expression (8).

ここでｘ_ｊ（ｆ，τ）はマイクｊでのフレームτ、周波数ｆにおける観測信号である。＊は複素共役を表わす。全てのマイクペアにおけるマイク間位相差ｑ´_ｊｊ´を並べたベクトルをｑ´（ｆ，τ）と記載する。ベクトルｑ´（ｆ，τ）は音源方向ベクトル計算部１３２に入力される。音源方向ベクトル計算部１３２は、ムーア・ペンローズ（Moore-Penrose）の擬似逆行列を用いて式（９）により音源方向ベクトルｑ（ｆ，τ）を計算する。

Here, x _j (f, τ) is an observation signal at the frame τ and the frequency f at the microphone j. * Represents a complex conjugate. A vector in which inter-microphone phase differences q ′ _jj ′ in all microphone _pairs are arranged is described as q ′ (f, τ). The vector q ′ (f, τ) is input to the sound source direction vector calculation unit 132. The sound source direction vector calculation unit 132 calculates the sound source direction vector q (f, τ) according to Expression (9) using a Moore-Penrose pseudo inverse matrix.

ここで＋はMoore-Penroseの擬似逆行列を表わし、Ｄ＝[ｄ_１−ｄ_J，…，ｄ_Ｍ−ｄ_J]^Ｔであり、ｄ_ｊはマイクｊの座標を[ｘ，ｙ，ｚ]と並べたベクトルである。マイクから見た音源の水平角をθ、仰角をφとすると音源方向ベクトルｑ（ｆ，τ）は式（１０）で表わせる。

Here, + represents a Moore-Penrose pseudo-inverse matrix, D = [d ₁ −d _J ,..., D _M −d _J ] ^T , and d _j represents the coordinates of the microphone j [x, y, z]. Are the vectors When the horizontal angle of the sound source viewed from the microphone is θ and the elevation angle is φ, the sound source direction vector q (f, τ) can be expressed by Equation (10).

音源方向ベクトルｑ（ｆ，τ）は到来方向確率計算部１４に入力される。以降、記載の簡単化のために水平角θ（ｆ，τ）のみを用いて説明をする。

The sound source direction vector q (f, τ) is input to the arrival direction probability calculation unit 14. Hereinafter, only the horizontal angle θ (f, τ) will be described for simplification of description.

〔到来方向確率計算部〕
到来方向確率計算部１４は、クラスタリング部１４０と、各クラスタの分布計算部１６０と、確率計算部１７０を備える。この実施例では、クラスタリング部１４０が、音源方向ベクトルｑ（ｆ，τ）の各フレームの各周波数（ｆ，τ）における水平角θ（ｆ，τ）をオンラインクラスタリングする。到来方向確率計算部１４の動作フローを図５に示す。到来方向確率計算過程（ステップＳ１４、図２参照）は、クラスタリング部１４０が音源方向ベクトルとクラスタのセントロイドとの距離で音源方向ベクトルを分類する分類ステップ（ステップＳ１４０、図５参照）と、各クラスタの分布計算部が分類毎の音源方向ベクトルの分布を計算する分布計算ステップ（ステップＳ１６０）と、確率計算部１７０が分類毎の音源方向ベクトルの分布を、音源方向ベクトルの全体の分布で除して音声到来方向確率として計算する確率計算ステップ（ステップＳ１７０）とを含む。図６に分類ステップＳ１４０の詳細な動作フローを示して説明する。 [Arrival Direction Probability Calculator]
The arrival direction probability calculation unit 14 includes a clustering unit 140, a distribution calculation unit 160 for each cluster, and a probability calculation unit 170. In this embodiment, the clustering unit 140 performs online clustering on the horizontal angle θ (f, τ) at each frequency (f, τ) of each frame of the sound source direction vector q (f, τ). The operation flow of the arrival direction probability calculation unit 14 is shown in FIG. The arrival direction probability calculation process (step S14, see FIG. 2) includes a classification step (step S140, see FIG. 5) in which the clustering unit 140 classifies the sound source direction vector by the distance between the sound source direction vector and the centroid of the cluster. The distribution calculation step (step S160) in which the cluster distribution calculation unit calculates the distribution of the sound source direction vectors for each classification, and the probability calculation unit 170 divides the distribution of the sound source direction vectors for each classification by the entire distribution of the sound source direction vectors. Then, a probability calculating step (step S170) for calculating the voice arrival direction probability is included. FIG. 6 shows the detailed operation flow of the classification step S140.

<ステップＳ１４１>
まず、分類するグループの中心値であるセントロイドを更新する大きさである更新ステップサイズβと、グループ分けするための閾値ｚを設定する。更新ステップサイズβと閾値ｚは、この発明を実施する環境に応じて適宜実験的に定められる値である。
<ステップＳ１４２>
フレームτと周波数ｆを初期化（τ＝１，ｆ＝１）する。
<ステップＳ１４３>
最初のフレームτ＝１の最小周波数ｆ＝１の音源方向ベクトルｑ（ｆ，τ）の水平角θ（ｆ，τ）を、第１のセントロイドｃ_１とする。
<ステップＳ１４４>
周波数ｆを次の周波数にインクリメントする。 <Step S141>
First, an update step size β that is a size for updating a centroid that is a central value of a group to be classified, and a threshold z for grouping are set. The update step size β and the threshold value z are values determined experimentally as appropriate according to the environment in which the present invention is implemented.
<Step S142>
The frame τ and the frequency f are initialized (τ = 1, f = 1).
<Step S143>
The horizontal angle θ (f, τ) of the sound source direction vector q (f, τ) with the minimum frequency f = 1 in the first frame τ = 1 is defined as a first centroid c ₁ .
<Step S144>
The frequency f is incremented to the next frequency.

<ステップＳ１４５>
水平角θ（ｆ，τ）に最も近い既存セントロイドｃ_ｋを見つけ、その番号をｋとする。つまり式（１１）で、クラスタリングする周波数成分の水平角に最も近いクラスタｋを選択する。 <Step S145>
Find the existing centroid _kk closest to the horizontal angle θ (f, τ) and let its number be k. That is, the cluster k that is closest to the horizontal angle of the frequency components to be clustered is selected using Equation (11).

<ステップＳ１４６>
ステップＳ１４５で求めた最も近いセントロイドｃ_ｋとθ（ｆ，τ）の距離と閾値ｚを比較する。距離が閾値ｚより小さければ（ステップＳ１４６のＹｅｓ）、θ（ｆ，τ）も同方向（音源）からの周波数成分と判定してステップＳ１４７の処理を行う。距離が閾値ｚよりも大きければ（ステップＳ１４６のＮｏ）、他の方向の音源からの周波数成分と判定してステップＳ１４９の処理を行う。
<ステップＳ１４７>
セントロイドｃ_ｋを式（１２）で更新する。

<Step S146>
The distance between the nearest centroid _kk and θ (f, τ) obtained in step S145 and the threshold value z are compared. If the distance is smaller than the threshold value z (Yes in step S146), θ (f, τ) is also determined as a frequency component from the same direction (sound source), and the process in step S147 is performed. If the distance is larger than the threshold value z (No in step S146), the frequency component from the sound source in the other direction is determined and the process in step S149 is performed.
<Step S147>
The centroid _ck is updated by the equation (12).

式（１２）はセントロイドｃ_ｋを、水平角θ（ｆ，τ）に近づけることを意味する。これは、セントロイドｃ_ｋの初期値にクラスタリングの性能が左右されないようにするクラスタリングの一般的な手法である。
<ステップＳ１４８>
距離が閾値ｚより小さいので同方向（音源）からの周波数成分と判断し、その時間周波数（ｆ，τ）にクラスタＫのクラスタ番号を付与する。ここでは、ある時間周波数（ｆ，τ）のクラスタ番号をＣ（ｆ，τ）に保持する。

Equation (12) means that the centroid _kk approaches the horizontal angle θ (f, τ). This is a general method of clustering in which the clustering performance is not affected by the initial value of the centroid _kk .
<Step S148>
Since the distance is smaller than the threshold value z, it is determined as a frequency component from the same direction (sound source), and the cluster number of the cluster K is given to the time frequency (f, τ). Here, the cluster number of a certain time frequency (f, τ) is held in C (f, τ).

<ステップＳ１４９>
距離が閾値ｚよりも大きいので、この音源方向ベクトルｑ（ｆ，τ）は、他の方向の音源からの周波数成分と判定する（ステップＳ１４６のＮｏ）。他の方向からの周波数成分として分類するために、ｍａｘ（ｋ）＋１番目の新しいクラスタを生成し、そのセントロイドをｃ_{max（ｋ）＋１}＝θ（ｆ，τ）として与える。
<ステップＳ１５０>
その時間周波数（ｆ，τ）に新しいクラスタ番号を付与する。
<ステップＳ１５１>
周波数ｆが、最後の周波数か否かを判定する。最後の周波数で無い場合（ステップＳ１５１のＮｏ）、周波数をインクリメント（ステップＳ１５４）してステップＳ１４５の動作に戻る。 <Step S149>
Since the distance is larger than the threshold value z, the sound source direction vector q (f, τ) is determined as a frequency component from a sound source in another direction (No in step S146). In order to classify as frequency components from other directions, a max (k) + 1-th new cluster is generated, and its centroid is given as c _{max (k) +1} = θ (f, τ).
<Step S150>
A new cluster number is assigned to the time frequency (f, τ).
<Step S151>
It is determined whether the frequency f is the last frequency. If it is not the last frequency (No in step S151), the frequency is incremented (step S154), and the operation returns to step S145.

<ステップＳ１５２>
周波数ｆが、最後の周波数の場合（ステップＳ１５１のＹｅｓ）、フレームτが最後であるか否かを判定する。フレームτが最後の場合、クラスタリング動作を終了する（ステップＳ１５２のＹｅｓ）。フレームτが最後で無い場合（ステップＳ１５２のＮｏ）、フレームτをインクリメントすると共に周波数を初期化（ステップＳ１５５）してステップＳ１４５の動作に戻る。なお、メンバ数が少ないクラスタは除外しても良い（破線で示すステップＳ１５３）。 <Step S152>
When the frequency f is the last frequency (Yes in step S151), it is determined whether or not the frame τ is the last. If the frame τ is the last, the clustering operation is terminated (Yes in step S152). If the frame τ is not the last (No in step S152), the frame τ is incremented and the frequency is initialized (step S155), and the process returns to the operation in step S145. A cluster having a small number of members may be excluded (step S153 indicated by a broken line).

以上のように動作することで、音源方向ベクトルｑ（ｆ，τ）の全てのフレーム、全ての周波数がクラスタリングされ、音源方向ベクトルｑ（ｆ，τ）にクラスタ番号ｋが付与される。
クラスタリング部１４０で分類された音源方向ベクトルｑ（ｆ，τ）の水平角θ（ｆ，τ）の分布を、各クラスタの分布計算部１６０が計算する。各クラスタの分布計算部１６０は、式（１３）を用いて各クラスタを平均値ｃ_ｋ，分散σ_ｋ ^２の正規分布でモデル化する（ステップＳ１６０、図５参照）。 By operating as described above, all frames and all frequencies of the sound source direction vector q (f, τ) are clustered, and the cluster number k is assigned to the sound source direction vector q (f, τ).
The distribution calculation unit 160 of each cluster calculates the distribution of the horizontal angle θ (f, τ) of the sound source direction vector q (f, τ) classified by the clustering unit 140. The distribution calculation unit 160 of each cluster models each cluster with a normal distribution having an average value c _k and a variance σ _k ² using the equation (13) (see step S160, FIG. 5).

ここで、平均値ｃ_ｋは、クラスタのセントロイド又は式（１４）で計算した値を用いる。
分散σ_ｋ ^２は式（１５）で計算する。

Here, the average value _ck uses the centroid of the cluster or the value calculated by the equation (14).
The variance σ _k ² is calculated by the equation (15).

ここで、｜Ｃ_ｋ｜は、クラスタ番号Ｃ（ｆ，τ）＝ｋである成分の個数である。確率計算部１７０は、各フレームτでｋの方向の音源が存在する確率を式（１６）と（１７）を用いて計算する。

Here, | C _k | is the number of components with cluster number C (f, τ) = k. The probability calculation unit 170 calculates the probability that a sound source in the k direction exists in each frame τ using equations (16) and (17).

最後に乗算部１５にて各フレームτにおける音声存在確率ｐ_ｖ（τ）と到来方向確率ｐ_ｋ（τ）との積を計算し、その確率値を音源ｋの発話確率Ｐ_ｋ（τ）として出力する（ステップＳ１７０）。

Finally, the multiplication unit 15 calculates a product of the speech existence probability p _v (τ) and the arrival direction probability p _k (τ) in each frame τ, and uses the probability value as the speech probability P _k (τ) of the sound source k. Output (step S170).

これを全てのクラスタｋに対して計算することで、全ての音源の発話確率Ｐ_ｋ（τ）を得ることができる。

By calculating this for all clusters k, the utterance probabilities P _k (τ) of all sound sources can be obtained.

次に到来方向確率ｐ_ｋ（τ）を、クラスタリングされた音源方向ベクトルｑ（ｆ，τ）の水平角θ（ｆ，τ）の度数から求めるようにした実施例２を説明する。実施例２の到来方向確率計算部１４´は、クラスタリング部１４０´と、確率計算部１７０´を備える（図４参照）。他の構成は実施例１と同じである。動作フローを図７に示す。
クラスタリング部１４０´は、ある時刻τまでに存在するクラスタのセントロイドｃ_ｋについて、式（１９）に示す計算をして音源方向ベクトルｑ（ｆ，τ）をクラスタリングする（ステップＳ１４０´）。 Next, a description will be given of a second embodiment in which the arrival direction probability p _k (τ) is obtained from the frequency of the horizontal angle θ (f, τ) of the clustered sound source direction vector q (f, τ). The arrival direction probability calculation unit 14 ′ according to the second embodiment includes a clustering unit 140 ′ and a probability calculation unit 170 ′ (see FIG. 4). Other configurations are the same as those of the first embodiment. The operation flow is shown in FIG.
The clustering unit 140 ′ clusters the sound source direction vector q (f, τ) by performing the calculation shown in Equation (19) for the centroid _ck of the cluster existing up to a certain time τ (step S140 ′).

ここでｔｈはある閾値であり、図６の閾値ｚと例えば同じ値で構わない。あるクラスタＣ_ｋに属する水平角θ（ｆ，τ）については１を、そうでなければ０に分類する。この分類によって各クラスタの度数が求められる。その動作フローについては、前述した図６から明らかであるので省略する。
確率計算部１７０´は、式（２０）で到来方向確率ｐ_ｋ（τ）を計算する（ステップＳ１７０´）。

Here, th is a threshold value, and may be the same value as the threshold value z in FIG. A horizontal angle θ (f, τ) belonging to a certain cluster C _k is classified as 1; otherwise, it is classified as 0. Degree number of each cluster is determined by this classification. Since the operation flow is clear from FIG.
The probability calculation unit 170 ′ calculates the arrival direction probability p _k (τ) using Expression (20) (step S170 ′).

以上のように到来方向確率ｐ_ｋ（τ）を求めることで、計算負荷を軽減することができる。計算負荷を軽減することで処理速度を向上させる効果が期待できる。

As described above, the calculation load can be reduced by obtaining the arrival direction probability p _k (τ). The effect of improving the processing speed can be expected by reducing the calculation load.

実施例３として雑音を抑圧するようにした到来方向確率計算部６０の構成を図４に示して説明する。到来方向確率計算部６０は、振幅計算部６１を備える。他の構成は実施例１，２と同じである。動作フローを図８に示す。
振幅計算部６１は、音源方向ベクトルｑ（ｆ，τ）の時間周波数（ｆ，τ）における正規化された振幅値ａ（ｆ，τ）を式（２１）で計算する（ステップＳ６１）。 A configuration of the arrival direction probability calculation unit 60 configured to suppress noise as Example 3 will be described with reference to FIG. The arrival direction probability calculation unit 60 includes an amplitude calculation unit 61. Other configurations are the same as those of the first and second embodiments. The operation flow is shown in FIG.
The amplitude calculation unit 61 calculates the normalized amplitude value a (f, τ) at the time frequency (f, τ) of the sound source direction vector q (f, τ) by the equation (21) (step S61).

ｘ₁（ｆ，τ）の1はマイクロホン番号である。定数ｂは１〜４の整数が望ましく、ｂ＝１ならば振幅、ｂ＝２ならばパワー、ｂ＝４ならば尖度の正規化された振幅値ａ（ｆ，τ）となる。
確率計算部６２は、振幅値ａ（ｆ，τ）を用いて到来方向確率ｐ_ｋ（τ）を式（２２）で算出する（ステップＳ１７０´）。

_{1 of} x ₁ (f, τ) is a microphone number. The constant b is preferably an integer of 1 to 4. If b = 1, the amplitude is b, if b = 2, the power is b, and if b = 4, the kurtosis is normalized amplitude value a (f, τ).
The probability calculation unit 62 calculates the arrival direction probability p _k (τ) by using the amplitude value a (f, τ) by Expression (22) (step S170 ′).

ここで正規化された振幅値ａ（ｆ，τ）は、重み係数である。音声が存在する音源方向ベクトルｑ（ｆ，τ）の時間周波数（ｆ，τ）の振幅は大きな値を持つ。それに対して音声が存在しない雑音だけの振幅は小さな値を持つ。したがって、音声区間における正規化された振幅値ａ（ｆ，τ）は大きくなり、非音声区間におけるそれは小さな値になる。

The normalized amplitude value a (f, τ) here is a weighting factor. The amplitude of the time frequency (f, τ) of the sound source direction vector q (f, τ) where the sound is present has a large value. On the other hand, the amplitude of only noise with no voice has a small value. Therefore, the normalized amplitude value a (f, τ) in the speech segment is large, and it is small in the non-speech segment.

この正規化された振幅値ａ（ｆ，τ）を到来方向確率ｐ_ｋ（τ）の算出の際に、式（２２）に示すように考慮することで、雑音を音声として誤検出してしまうことを抑制することができる。
なお、式（２２）は実施例２に振幅計算部６１を設けた場合の式である。音源方向ベクトルの分布を正規分布として求めた実施例１に振幅計算部６１を設けても、雑音を抑圧する効果が期待できる。 When this normalized amplitude value a (f, τ) is taken into account when calculating the arrival direction probability p _k (τ), the noise is erroneously detected as speech. This can be suppressed.
Expression (22) is an expression in the case where the amplitude calculator 61 is provided in the second embodiment. Even if the amplitude calculation unit 61 is provided in the first embodiment in which the distribution of the sound source direction vectors is obtained as a normal distribution, an effect of suppressing noise can be expected.

〔シミュレーション結果〕
実施例２の複数信号区間推定装置の性能を確認するシミュレーションを行った。シミュ
レーション条件を簡単に説明する。図７にシミュレーションに用いた部屋の平面図を示す。奥行きのある部屋の幅側の一辺を、３０５ｃｍの幅のパーテーションで仕切り、幅が約４ｍで奥行き約９.３ｍの部屋を形成した。この部屋の残響時間は約３５０ｍｓである。パーテーションの一方の隅にはパーソナルコンピュータ（ＰＣ）があり、そのファンノイズが本システムに対する雑音となった。パーテーション側に近い位置に長円形のテーブルを配置した。テーブルを挟んでパーテーション側に話者ＡとＢの二人、反対側に話者ＣとＤの二人を座らせた。そして４人の話者のほぼ中央付近の位置に３個のマイクロホンを、４ｃｍの正三角形の頂点に位置するように配置した。〔simulation result〕
A simulation for confirming the performance of the multiple signal section estimation apparatus of the second embodiment was performed. The simulation conditions will be briefly described. FIG. 7 shows a plan view of the room used for the simulation. One side of the width side of the deep room was partitioned by a 305 cm wide partition to form a room with a width of about 4 m and a depth of about 9.3 m. The reverberation time of this room is about 350 ms. There was a personal computer (PC) in one corner of the partition, and its fan noise became noise for this system. An oval table was placed near the partition side. Two speakers A and B sit on the partition side and two speakers C and D sit on the other side across the table. Then, three microphones were arranged at positions near the center of the four speakers so as to be located at the vertices of a 4 cm equilateral triangle.

話者Ａ〜Ｄの４名の会議を５分間、サンプリング周波数１６ｋＨｚ、フーリエ変換のフレーム長を６４ｍｓ、フレームシフト長を３２ｍｓとし、上記した式（１６）の発話確率Ｐ_ｋ（τ）が０.４以上となるクラスタｋの方向の話者が話したと判定した。評価指標としては、ＤＥＲ＝（誤受理・誤棄却・話者誤りの時間長）/全音声区間長×１００[％（Diarization Error Rate）を利用した。 The conference of four speakers A to D is held for 5 minutes, the sampling frequency is 16 kHz, the Fourier transform frame length is 64 ms, the frame shift length is 32 ms, and the utterance probability P _k (τ) of the above equation (16) is 0. It was determined that the speaker in the direction of cluster k, which is 4 or more, spoke. As an evaluation index, DER = (time length of false acceptance / false rejection / speaker error) / total speech interval length × 100 [% (Diarization Error Rate) was used.

ここで、誤受理（FAT：false alarm speaker time）は、誰も話していないにもかかわらず誰かが話していると判定した時間長である。誤棄却（MST：missed speaker time）は、誰かが話しているにもかかわらず話していないと判定した時間長である。話者誤り（SET：speaker error time）は、話者を誤って判定した時間長である。ＤＥＲ値は、小さい方が話者区間推定の精度が高いことを意味する。全てを[％]で表わす。表１に結果を示す。 Here, false alarm speaker time (FAT) is the length of time when it is determined that someone is speaking even though no one is speaking. Missed speaker time (MST) is the length of time that someone decides that they are speaking but not speaking. The speaker error time (SET) is the length of time that the speaker is erroneously determined. A smaller DER value means higher accuracy of speaker section estimation. All are expressed in [%]. Table 1 shows the results.

従来法では、全てのエラーが多く、ＤＥＲも大きかった。それに対して実施例２の方
法では、特に誤棄却（ＭＳＴ）が大きく改善され、その結果としてＤＥＲの値が改善した。これは、フレーム毎に、音声区間と音声到来方向が、確率値として処理されること、及び各フレームで複数の方向を推定することにより音声区間が欠損することが少ないことによる。

In the conventional method, all errors were large and DER was large. On the other hand, in the method of Example 2, false rejection (MST) was greatly improved, and as a result, the value of DER was improved. This is because, for each frame, the voice section and the voice arrival direction are processed as probability values, and the voice section is less likely to be lost by estimating a plurality of directions in each frame.

以上述べたように、この発明の複数信号区間推定装置によれば、音声信号の取りこぼしを少なくした複数信号の区間推定を行うことができる。この発明の技術思想に基づく複数信号区間推定装置とその方法は、上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能である。上記した装置及び方法において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。
例えば、音声存在確率ｐ_ｖ（τ）と到来方向確率ｐ_ｋ（τ）の一方を、計算を軽くする目的で、決定論的に算出するようにしても良い。一方を決定論的に算出しても、音声が在ると判定されたフレームにおいては、複数音源があれば複数方向の方向確率が計算されるので、そのフレーム内の複数の音源を取りこぼすことが従来法に比べて少なくなる。 As described above, according to the multiple signal section estimation apparatus of the present invention, it is possible to perform section estimation of a plurality of signals with less missing audio signals. The multiple signal section estimation device and method based on the technical idea of the present invention are not limited to the above-described embodiments, and can be appropriately changed without departing from the spirit of the present invention. The processes described in the above-described apparatus and method are not only executed in time series according to the order described, but may be executed in parallel or individually as required by the processing capability of the apparatus that executes the process. .
For example, one of the speech existence probability p _v (τ) and the arrival direction probability p _k (τ) may be calculated deterministically for the purpose of reducing the calculation. Even if one of them is calculated deterministically, the direction probability in multiple directions will be calculated if there are multiple sound sources in a frame that is determined to have sound, so multiple sound sources in that frame will be missed. Is less than the conventional method.

また、到来方向確率ｐ_ｋ（τ）は、式（１７）を満たす水平角θ（ｆ，τ）が各周波数でｔｈ_２個以上存在していればｐ_ｋ（τ）＝１、そうでなければｐ_ｋ（τ）＝０としても良い。また、クラスタリング部におけるセントロイドｃ_ｋは、予めそれぞれの音源の方向θ_ｋが分かっていればその角度をｃ_ｋ＝θ_ｋとして与えても良い。また、各フレーム、周波数（ｆ，τ）における水平角θ（ｆ，τ）ではなく、従来のＧＣＣ−ＰＨＡＴ法のように各フレームτ毎に１つだけ求めた水平角θ（τ）をオンラインクラスタリングし、そのセントロイドをｃ_ｋとして用いても良い。また、水平角θ（ｆ，τ）を用いて音源の到来方向を分類する例で説明を行ったが、音源方向ベクトルｑ（ｆ，τ）そのものを用いて分類するようにしても良い。 The arrival direction probability p _k (τ) is p _k (τ) = 1 if there are th ₂ or more horizontal angles θ (f, τ) satisfying the equation (17) at each frequency. For example, p _k (τ) = 0 may be set. Further, the centroid c _k in the clustering unit may be given as c _k = θ _k if the direction θ _k of each sound source is known in advance. Also, instead of the horizontal angle θ (f, τ) at each frame and frequency (f, τ), only one horizontal angle θ (τ) obtained for each frame τ as in the conventional GCC-PHAT method is online. Clustering may be performed, and the centroid may be used as _ck . In addition, although an example in which the arrival direction of the sound source is classified using the horizontal angle θ (f, τ) has been described, the sound source direction vector q (f, τ) itself may be used for classification.

また、上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。
この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ-ＲＡＭ（Random Access Memory）、ＣＤ-ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ-Ｒ（Recordable）/ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto Optical disc）等を、半導体メモリとしてフラッシュメモリー等を用いることができる。 Further, when the processing means in the above apparatus is realized by a computer, the processing contents of functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.
The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape, etc., and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only) Memory), CD-R (Recordable) / RW (ReWritable), etc. can be used as magneto-optical recording media, MO (Magneto Optical disc) can be used, and flash memory can be used as semiconductor memory.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。
また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.
Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

この発明の複数信号区間推定装置１００の機能構成例を示す図。The figure which shows the function structural example of the multiple signal area estimation apparatus 100 of this invention. 複数信号区間推定装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the multiple signal area estimation apparatus 100. 音声存在確率推定部１２の機能ブロックを示す図。The figure which shows the functional block of the audio | voice existence probability estimation part 12. FIG. 到来方向推定部１３と到来方向確率計算部１４の機能ブロックを示す図。The figure which shows the functional block of the arrival direction estimation part 13 and the arrival direction probability calculation part 14. FIG. 到来方向確率計算部１４の動作フローを示す図。The figure which shows the operation | movement flow of the arrival direction probability calculation part 14. FIG. クラスタリング部１４０の動作フローを示す図。The figure which shows the operation | movement flow of the clustering part 140. FIG. 実施例２の到来方向確率計算部１４´の動作フローを示す図。The figure which shows the operation | movement flow of the arrival direction probability calculation part 14 'of Example 2. FIG. 実施例３の到来方向確率計算部６０の動作フローを示す図。The figure which shows the operation | movement flow of the arrival direction probability calculation part 60 of Example 3. FIG. シミュレーションを行った部屋の平面を示す図。The figure which shows the plane of the room which performed the simulation. 非特許文献１に開示された従来の音声区間推定装置２００の機能構成を示す図。The figure which shows the function structure of the conventional audio | voice area estimation apparatus 200 disclosed by the nonpatent literature 1. FIG.

Claims

A frequency domain converter for converting audio signals from a plurality of sound sources recorded by a plurality of microphones into a frequency domain signal for each frame;
A voice presence probability estimator for estimating the presence probability of the voice from the sound source for each frame;
A direction-of-arrival estimation unit that estimates the direction of voice arrival for each frequency component for each frame;
Classifying the voice arrival direction to obtain a distribution of voice arrival directions for each sound source, and calculating a voice arrival direction probability for each sound source;
A multiplication unit that calculates a product of the voice presence probability and the voice arrival direction probability and outputs a presence probability for each sound source in each frame;
A multiple signal section estimation device comprising:

In the multiple signal area estimation device according to claim 1,
The arrival direction probability calculation unit includes a clustering unit, a distribution calculation unit for each cluster, and a probability calculation unit,
The clustering unit classifies sound source direction vectors at each frequency calculated for each frame,
The distribution calculation unit for each cluster calculates the distribution of the sound source direction vectors of the cluster,
The probability calculation unit is to normalize the distribution of the sound source direction vectors for each cluster with the overall distribution of the sound source direction vectors and output as a speech arrival direction probability.
A multi-signal section estimation apparatus characterized by the above.

In the multiple signal area estimation device according to claim 1,
The arrival direction probability calculation unit includes a clustering unit and a probability calculation unit,
The clustering unit classifies the sound source direction vector by the distance between the sound source direction vector and the threshold value at each frequency calculated for each frame,
The said probability calculation part outputs the value which remove | divided each frequency of the said cluster by the whole frequency of the said sound source direction vector as a voice arrival direction probability, The multiple signal area estimation apparatus characterized by the above-mentioned.

In the multiple signal area estimation device according to any one of claims 1 to 3,
The arrival direction probability calculation unit includes an amplitude calculation unit for calculating a normalized amplitude value in each frame and each frequency,
A multi-signal section estimation apparatus, wherein the normalized amplitude value is used as a weighting factor when calculating the speech arrival direction probability.

A frequency domain process in which a frequency domain conversion unit converts audio signals from a plurality of sound sources recorded by a plurality of microphones into a frequency domain signal for each frame;
A voice presence probability estimating unit that estimates a voice presence probability of voice from the sound source for each frame;
A direction of arrival estimation process in which a direction of arrival estimation unit estimates a voice arrival direction for each frequency component for each frame;
An arrival direction probability calculation unit classifies the voice arrival directions to obtain a distribution of voice arrival directions for each sound source, and calculates a voice arrival direction probability for each sound source;
A multiplication unit that calculates a product of the speech existence probability and the speech arrival direction probability and outputs a presence probability for each sound source in each frame;
A multi-signal section estimation method including:

In the multiple signal area estimation method according to claim 5,
In the arrival direction probability calculation process, the clustering unit classifies the sound source direction vector at each frequency calculated for each frame, and
A distribution calculation step in which a distribution calculation unit of each cluster calculates a distribution of the sound source direction vectors of the cluster;
A probability calculating step in which a probability calculation unit normalizes the distribution of the sound source direction vectors for each cluster with the overall distribution of the sound source direction vectors and calculates it as a voice arrival direction probability;
A multi-signal section estimation method comprising:

In the multiple signal area estimation method according to claim 5,
In the arrival direction probability calculation process, the clustering unit classifies the sound source direction vector at each frequency calculated for each frame, and
A probability calculation step in which the probability calculation unit calculates a value obtained by dividing each frequency of the cluster by the total frequency of the sound source direction vector as a voice arrival direction probability;
A multi-signal section estimation method comprising:

The multiple signal section estimation method according to any one of claims 5 to 7,
The arrival direction probability calculation process includes an amplitude calculation step in which an amplitude calculation unit calculates a normalized amplitude value in each frame and each frequency,
A multi-signal section estimation method, wherein the speech arrival direction probability is calculated using the normalized amplitude value as a weighting factor.

An apparatus program for causing a computer to function as the multiple signal section estimation apparatus according to any one of claims 1 to 4.

A computer-readable recording medium on which any of the apparatus programs according to claim 9 is recorded.