JP4964204B2

JP4964204B2 - Multiple signal section estimation device, multiple signal section estimation method, program thereof, and recording medium

Info

Publication number: JP4964204B2
Application number: JP2008218677A
Authority: JP
Inventors: 章子荒木; 健太郎石塚; 雅清藤本; 智広中谷; 昭二牧野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2008-08-27
Filing date: 2008-08-27
Publication date: 2012-06-27
Anticipated expiration: 2028-08-27
Also published as: JP2010054733A

Description

本発明は、信号処理の技術分野に属する。特に、複数人の音声信号が混在している音響データについて、各人の音声信号が発せられている区間を推定する複数信号区間推定装置、複数信号区間推定方法、そのプログラムおよび記録媒体に関する。 The present invention belongs to the technical field of signal processing. In particular, the present invention relates to a multi-signal section estimation apparatus, a multi-signal section estimation method, a program thereof, and a recording medium for estimating a section in which each person's speech signal is emitted for acoustic data in which a plurality of persons' speech signals are mixed.

複数人による会話などを複数のマイクで収録し、「いつ、誰が話したか」を推定する音声区間検出技術は、例えば会議録自動作成において、各発言に発話者を自動的に付与したり、会議収録データに話者情報を付与して録音データの検索や頭出しを容易にしたりする際に有用である。 Voice segment detection technology that uses multiple microphones to record conversations by multiple people and estimates "when and who spoke", for example, automatically assigning a speaker to each utterance, This is useful when speaker information is added to the recorded data to make it easier to search or find the recorded data.

従来の音声区間検出技術としては、例えば特許文献１や非特許文献１などで開示されている方法が挙げられる。図１１に従来技術による複数信号区間推定装置１００の機能構成例を、図１２にその処理フロー例を示す。複数信号区間推定装置１００は、周波数領域変換部１１０と音声区間推定部１２０と到来方向推定部１３０と到来方向分類部１４０とから構成される。 Examples of conventional speech segment detection techniques include the methods disclosed in Patent Document 1, Non-Patent Document 1, and the like. FIG. 11 shows an example of a functional configuration of the conventional multiple signal section estimation apparatus 100, and FIG. The multiple signal section estimation apparatus 100 includes a frequency domain transform unit 110, a speech section estimation unit 120, an arrival direction estimation unit 130, and an arrival direction classification unit 140.

周波数領域変換部１１０は、Ｍ本のマイクによりそれぞれ収録した時間領域の観測信号ｘ_j(t)（ｊ＝１、・・・、Ｍ）を、例えば３２ｍｓごとに窓関数で切り出して（切り出した１区間を以下、「フレーム」という）、切り出した各フレーム（インデックスをτとする）についてフーリエ変換等によりそれぞれ周波数領域の観測信号ｘ_j(f,τ) （ｆ＝１、・・・、Ｌ）に変換する（Ｓ１）。 The frequency domain transform unit 110 cuts out (cuts out) the time domain observation signals x _j (t) (j = 1,..., M) recorded by M microphones, for example, every 32 ms. One section is hereinafter referred to as “frame”), and for each cut out frame (index is τ), the frequency domain observation signal x _j (f, τ) (f = 1,..., L (S1).

音声区間推定部１２０は、周波数領域変換部１１０で周波数領域に変換された観測信号の各フレームに音声が存在するか否かを、音声存在確率を計算することにより推定する（Ｓ２）。音声存在確率の計算に際しては、例えば非特許文献２、非特許文献３に記載された方法が利用できる。前者で説明すると、該当フレームにおける音声存在確率ｐ_V(τ)を次式により求める。

ここで、λ_N(f)は周波数ｆにおけるノイズの平均パワー（音声が明らかに存在しない録音ファイルの冒頭区間などで求める）、ｘ(f,τ)はＭ本のマイクにおける周波数領域の観測信号ｘ₁(f,τ)〜ｘ_M(f,τ)の中から任意に選んだいずれか１本についての周波数領域の観測信号である。なお、ｘ(f,τ)はすべてのマイクの振幅の平均値として次のように求めても構わない。

音声区間推定部１２０は、式(1)により求めた音声存在確率ｐ_V(τ)をそのまま出力してもよいし、ｐ_V(τ)がある閾値より大きければそのフレームは音声区間Ｐ_Sであると判定し、小さければ非音声（ノイズ）区間Ｐ_Nと判定して結果を出力してもよい。 The speech segment estimation unit 120 estimates whether or not speech exists in each frame of the observation signal converted into the frequency domain by the frequency domain transform unit 110 by calculating a speech presence probability (S2). In calculating the speech existence probability, for example, methods described in Non-Patent Document 2 and Non-Patent Document 3 can be used. In the former case, the speech existence probability p _V (τ) in the corresponding frame is obtained by the following equation.

Here, λ _N (f) is the average noise power at frequency f (obtained at the beginning of a recording file in which no sound is clearly present), and x (f, τ) is the frequency domain observation signal of M microphones. This is an observation signal in the frequency domain for any one arbitrarily selected from x ₁ (f, τ) to x _M (f, τ). Note that x (f, τ) may be obtained as follows as an average value of amplitudes of all microphones.

The speech segment estimation unit 120 may output the speech existence probability p _V (τ) obtained by the equation (1) as it is, or if p _V (τ) is larger than a certain threshold, the frame is the speech segment P _S. If it is determined that it is present, and if it is smaller, it may be determined as a non-speech (noise) section _PN and the result may be output.

到来方向推定部１３０は、周波数領域変換部１１０で周波数領域に変換された観測信号の到来方向を各フレームごと又は各フレームの各周波数成分ごとにを推定する（Ｓ３）。具体的には、観測信号のマイクｊとマイクｊ´とからの到来時間差ｑ´_jj′を全てのマイクペアについて求め、それらを並べた縦ベクトルとマイクの座標系とから音声到来方向ベクトルを推定する。 The arrival direction estimation unit 130 estimates the arrival direction of the observation signal converted into the frequency domain by the frequency domain conversion unit 110 for each frame or for each frequency component of each frame (S3). Specifically, the arrival time difference q ′ _{jj ′} between the microphones j and j ′ of the observation signal is obtained for all microphone pairs, and the speech arrival direction vector is estimated from the vertical vector in which they are arranged and the coordinate system of the microphone. .

各フレームごとに到来時間差ｑ´_jj′を計算する手法として、非特許文献４にて開示されているＧＣＣ−ＰＨＡＴと呼ばれる手法がある。この手法においては到来時間差ｑ´_jj′(τ)を次式に従い算出する。

これをすべてのマイクペアｊｊ´について求めて、それらを並べた縦ベクトルをvq´(τ)とする。なお、すべてのマイクペアを用いる代わりに、ある基準マイクを決め、基準マイクとその他のマイクに関するすべてのペアを用いてもよい。音声到来方向ベクトルvq(τ)は、vq´(τ)と音速ｃとマイクの座標系VDとから次式により推定する。
vq(τ)＝ｃ・VD⁺・vq´(τ) (4)
ここで、^＋はMoore-Penroseの疑似逆行列を表し、vd_jがマイクｊの座標を[x,y,z]と並べたベクトルであるとき、VD＝[vd₁−vd_j,・・・，vd_M−vd_j]^Tである。このように求めた音声到来方向ベクトルvq(τ)は、到来方向の水平角がθ、仰角がφとすると、次式のように表すことができる。
vq(τ)＝[cosθ・cosφ，sinθ・cosφ，sinφ]^T (5) As a technique for calculating the arrival time difference q ′ _{jj ′} for each frame, there is a technique called GCC-PHAT disclosed in Non-Patent Document 4. In this method, the arrival time difference q ′ _{jj ′} (τ) is calculated according to the following equation.

This is obtained for all microphone pairs jj ′, and the vertical vector in which they are arranged is defined as vq ′ (τ). Instead of using all microphone pairs, a certain reference microphone may be determined, and all pairs related to the reference microphone and other microphones may be used. The voice arrival direction vector vq (τ) is estimated by the following equation from vq ′ (τ), the sound velocity c, and the microphone coordinate system VD.
vq (τ) ＝ c ・ VD ⁺・ vq´ (τ) (4)
Here, ⁺ represents the Moore-Penrose pseudo-inverse matrix, and when vd _j is a vector in which the coordinates of microphone j are aligned with [x, y, z], VD = [vd ₁ −vd _j ,. , Vd _M −vd _j ] ^T. The voice arrival direction vector vq (τ) obtained in this way can be expressed by the following equation, where the horizontal angle of the arrival direction is θ and the elevation angle is φ.
vq (τ) = [cosθ ・ cosφ, sinθ ・ cosφ, sinφ] ^T (5)

各フレームの各周波数成分ごとに到来時間差ｑ´_jj′を計算する場合は、マイクｊとマイクｊ´との到来時間差ｑ´_jj′(f,τ)を次式に従い算出する。

これをすべてのマイクペアｊｊ´について求めて（又は上記のように基準マイクに対して求めて）、それらを並べた縦ベクトルをvq´(f,τ)とし、式(4)と同様にして音声到来方向ベクトルvq (f,τ)を推定する。 When the arrival time difference q ′ _{jj ′} is calculated for each frequency component of each frame, the arrival time difference q ′ _{jj ′} (f, τ) between the microphone j and the microphone j _′ is calculated according to the following equation.

This is obtained for all microphone pairs jj ′ (or obtained with respect to the reference microphone as described above), and the vertical vector in which they are arranged is set as vq ′ (f, τ), and the voice is obtained in the same manner as in equation (4). An arrival direction vector vq (f, τ) is estimated.

なお、音声区間推定部１２０の処理と到来方向推定部１３０の処理とは並行して行ってもよいし、音声区間推定部１２０の処理により音声区間を推定した上で、その音声区間に該当するフレームに絞って到来方向推定部１３０の処理を行うこととしてもよい。 Note that the process of the speech segment estimation unit 120 and the process of the arrival direction estimation unit 130 may be performed in parallel, or the speech segment is estimated by the process of the speech segment estimation unit 120 and corresponds to the speech segment. The process of the arrival direction estimation unit 130 may be performed by focusing on the frame.

到来方向分類部１４０は、音声区間Ｐ_Sに該当する各フレームについて、音声到来方向（ベクトルvq(τ) 又はｖｑ(f,τ)）が類似するものを各話者区間Ｐ_k（ｋ＝１、・・・、Ｎ）としてクラスタリングを行い、すべてのクラスタについて、クラスタのインデックスｋとそのクラスタに属するすべてのフレームのインデックスτとの組を出力する（Ｓ４）。

The direction-of-arrival classification unit 140 determines that each frame corresponding to the speech section P _S has a similar speech arrival direction (vector vq (τ) or vq (f, τ)) for each speaker section P _k (k = 1). ,..., N), and for each cluster, a set of the cluster index k and the indexes τ of all frames belonging to the cluster is output (S4).

クラスタリング手法としては、公知のｋ−ｍｅａｎｓ法や階層的クラスタリングを用いてもよいし、オンラインクラスタリングを用いてもよい（非特許文献５参照）。このクラスタリング処理で分類されたクラスタＣ_kが、そのクラスタを形成しているクラスタメンバ（ベクトルvq(τ) 又はｖｑ(f,τ)）から求められるセントロイドで示される角度方向にいる話者ｋに相当し、このクラスタメンバに該当する各フレームτが話者ｋによる話者区間Ｐ_kを構成する。 As a clustering method, a known k-means method or hierarchical clustering may be used, or online clustering may be used (see Non-Patent Document 5). The speaker C in which the cluster C _k classified by the clustering process is in the angular direction indicated by the centroid obtained from the cluster members (vector vq (τ) or vq (f, τ)) that form the cluster. Each frame τ corresponding to the cluster member constitutes a speaker section P _k by the speaker _k .

なお、上記の説明では、到来方向推定部１３０はマイク間の到達時間差ベクトルvq´(τ)又はvq´(f,τ)を推定した上で、更に音声到来方向ベクトルvq (τ)又はvq (f,τ)を推定しているが、単に到達時間差ベクトルを推定するだけでも構わない。従って、この場合は図１３に示すように、到来方向推定部１３０が到来時間差推定部１３１として構成され、到来方向分類部１４０が到来時間差分類部１４１としてvq (τ)又はvq (f,τ)の代わりにvq´(τ)又はvq´(f,τ)を分類するように構成すればよい。
特表２０００−５１２１０８号公報 S.Araki, M.Fujimoto, K.Ishizuka, H.Sawada and S.Makino, "Speaker indexing and speech enhancement in real meetings/conversations," IEEE International Conference on Acoustics, Speech, and Signal Processing(ICASSP-2008), 2008, p.93-96 J.Sohn, N.S.Kim and W.Sung,"A Statistical Model-Based Voice Activity Detection," IEEE Signal Processing letters, 1999, vol.6, no.1, p.1-3 藤本、石塚、中谷、「複数の音声区間検出法の適応的統合の検討と考察」、電子情報通信学会音声研究会、2007、SP2007-97、p.7-12 C.H.Knapp and G.C.Carter,"The generalized correlation method for estimation of time delay," IEEE Trans. Acoust. Speech and Signal Processing, 1976, vol.24, no.4, p.320-327 R.O.Duda, P.E.Hart and D.G.Stork,"Pattern Classification," 2nd edition, Wiley Interscience, 2000 In the above description, the arrival direction estimation unit 130 estimates the arrival time difference vector vq ′ (τ) or vq ′ (f, τ) between the microphones, and further, the voice arrival direction vector vq (τ) or vq ( f, τ) is estimated, but it is also possible to simply estimate the arrival time difference vector. Therefore, in this case, as shown in FIG. 13, the arrival direction estimation unit 130 is configured as an arrival time difference estimation unit 131, and the arrival direction classification unit 140 serves as an arrival time difference classification unit 141 as vq (τ) or vq (f, τ). Instead, vq ′ (τ) or vq ′ (f, τ) may be classified.
Special Table 2000-512108 S.Araki, M.Fujimoto, K.Ishizuka, H.Sawada and S.Makino, "Speaker indexing and speech enhancement in real meetings / conversations," IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-2008), 2008, p.93-96 J.Sohn, NSKim and W.Sung, "A Statistical Model-Based Voice Activity Detection," IEEE Signal Processing letters, 1999, vol.6, no.1, p.1-3 Fujimoto, Ishizuka, Nakatani, “Examination and Consideration of Adaptive Integration of Multiple Voice Interval Detection Methods”, IEICE Technical Committee on Speech, 2007, SP2007-97, p.7-12 CHKnapp and GCCarter, "The generalized correlation method for estimation of time delay," IEEE Trans. Acoust. Speech and Signal Processing, 1976, vol.24, no.4, p.320-327 RODuda, PEHart and DGStork, "Pattern Classification," 2nd edition, Wiley Interscience, 2000

従来技術では、音声の到来方向情報のみにより話者識別を行っていたため、ある位置に居た話者が他の位置に移動してしまった場合に、同じ話者であるにもかかわらず新しい話者と識別したり、新しい話者であるにもかかわらず以前にその位置にいた別の話者として誤識別したりする問題があった。
本発明の目的は、音声の収録中に話者位置の移動が生じても、移動前と移動後において、同一話者には同一インデックスを付与することのできる、複数信号区間推定装置、複数信号区間推定方法、そのプログラムおよび記録媒体を提供することにある。 In the prior art, speaker identification was performed based only on voice direction-of-arrival information, so if a speaker who was in one location moved to another location, a new story was spoken even though it was the same speaker. There is a problem of being identified as a speaker, or being misidentified as another speaker who was previously in that position despite being a new speaker.
An object of the present invention is to provide a multi-signal section estimation device, a multi-signal, and a multi-signal section estimation device capable of assigning the same index to the same speaker before and after the movement even if the movement of the speaker position occurs during audio recording. An object is to provide a section estimation method, a program thereof, and a recording medium.

本発明の複数信号区間推定装置は、複数のマイクによりそれぞれ収録された、複数の話者による発話音声が含まれる観測信号から、それぞれの話者の発話区間を推定するものであり、周波数領域変換部と音声区間推定部と到来方向推定部と到来方向分類部と話者同定部とを備える。 The multi-signal section estimation device of the present invention estimates each speaker's utterance section from observation signals that are recorded by a plurality of microphones and includes speech uttered by a plurality of speakers, and is a frequency domain transform. A speech section estimation unit, an arrival direction estimation unit, an arrival direction classification unit, and a speaker identification unit.

周波数領域変換部は、観測信号を所定長のフレームに順次切り出し、当該フレームごとに周波数領域に変換する。
音声区間推定部は、周波数領域に変換された観測信号に基づき、各フレームが音声区間に該当するか否かを推定する。
到来方向推定部は、周波数領域に変換された観測信号に基づき、当該観測信号の到来方向を各フレームごとに推定する。 The frequency domain transform unit sequentially cuts the observation signal into frames of a predetermined length, and transforms the frames into the frequency domain for each frame.
The speech segment estimation unit estimates whether each frame corresponds to a speech segment based on the observation signal converted into the frequency domain.
The arrival direction estimation unit estimates the arrival direction of the observation signal for each frame based on the observation signal converted into the frequency domain.

到来方向分類部は、音声区間に該当すると推定された各フレームを、到来方向の類似性に基づき話者ごとのクラスタに分類する。
そして話者同定部は、所定の時刻までに同一クラスタに分類された各フレームの周波数領域に変換された観測信号に基づき、当該クラスタに係る話者のモデルをクラスタごとに作成し、当該所定の時刻以降の観測信号の話者を、各話者のモデルに基づき推定する。 The arrival direction classification unit classifies each frame estimated to correspond to the speech section into a cluster for each speaker based on the similarity of the arrival directions.
Then, the speaker identification unit creates a model of the speaker related to the cluster for each cluster based on the observation signal converted into the frequency domain of each frame classified into the same cluster by a predetermined time. The speaker of the observation signal after the time is estimated based on the model of each speaker.

本発明の複数信号区間推定装置によれば、複数の話者に対する話者区間の推定にあたり、音声の到来方向の推定と分類に加え、話者の同一性の判定が可能となる。そのため、音声の収録中に話者位置の移動が生じても、移動前と移動後において、同一話者には同一のインデックスを付与することができる。 According to the multi-signal section estimation apparatus of the present invention, it is possible to determine speaker identity in addition to estimation and classification of voice arrival directions when estimating speaker sections for a plurality of speakers. Therefore, even if the speaker position moves during recording of the voice, the same index can be assigned to the same speaker before and after the movement.

〔第１実施形態〕
図１（実線部分）に本発明の複数信号区間推定装置２００の機能構成例を、図２（実線部分）にその処理フロー例を示す。複数信号区間推定装置２００は、背景技術にて説明した周波数領域変換部１１０、音声区間推定部１２０、到来方向推定部１３０、及び到来方向分類部１４０と、話者同定部２５０とから構成される。また、話者同定部２５０の処理は図１１に示したフローのＳ４に続いて行われる。従って、ここでは背景技術として説明した内容の説明は必要最小限とし、話者同定部２５０での処理に重点を置いて説明する。
図３（実線部分）に話者同定部２５０の機能構成例を示す。話者同定部２５０は、特徴抽出手段２５１とモデル学習手段２５２と尤度計算手段２５３とから構成される。 [First Embodiment]
FIG. 1 (solid line part) shows an example of the functional configuration of the multiple signal section estimation apparatus 200 of the present invention, and FIG. 2 (solid line part) shows an example of the processing flow. The multiple signal section estimation device 200 includes the frequency domain conversion section 110, the speech section estimation section 120, the arrival direction estimation section 130, the arrival direction classification section 140, and the speaker identification section 250 described in the background art. . Moreover, the process of the speaker identification part 250 is performed following S4 of the flow shown in FIG. Therefore, the explanation of the contents explained as the background art is assumed to be a minimum here, and the explanation will be given with emphasis on the processing in the speaker identification unit 250.
FIG. 3 (solid line portion) shows a functional configuration example of the speaker identification unit 250. The speaker identification unit 250 includes feature extraction means 251, model learning means 252, and likelihood calculation means 253.

話者同定部２５０の処理においては、観測信号の収録開始から所定の時刻ｔ_trainまでは話者の位置の移動が無かったと仮定し、その間に作成されたクラスタから、各話者のモデルＭ_ｋを作成することとする。そして、時刻ｔ_train以降は話者の位置の移動があり得たと仮定し、時刻ｔ_train以降のすべての音声セグメント（同一クラスタに分類された連続フレーム）について、その発話者が時刻ｔ_train以前に発話したどの話者であるかを、観測信号の当初部分（収録開始から時刻ｔ_trainまで）で作成した各話者のモデルに基づき判定する。このように各話者のモデルを観測信号の当初部分で作成することで、時刻ｔ_train以降については、事前に話者のモデルを用意することなく話者の同定を行うことができる。なお、ｔ_trainは同定の対象となる話者全員が少なくとも一度発話した時点以降の時刻に設定する。 In the processing of the speaker identification unit 250, it is assumed that there is no movement of the speaker position from the start of recording of the observation signal to a predetermined time t _train , and the model M _{k of} each speaker is created from the cluster created during that time. Let's create. Then, the time t _train after assuming obtained has moved the position of the speaker, for all the speech segments of time t _train after (consecutive frames classified into the same cluster), to the speaker the time t _train before Which speaker has spoken is determined based on the model of each speaker created in the initial part of the observation signal (from the start of recording until time t _train ). Thus, by creating each speaker's model at the initial part of the observation signal, the speaker can be identified without preparing the speaker's model in advance after time _ttrain . Note that t _train is set to a time after the point when all speakers to be identified speak at least once.

特徴抽出手段２５１は、Ｍ本のマイクにおける周波数領域の観測信号ｘ₁(f,τ)〜ｘ_M(f,τ)の中から任意に選んだいずれか１本の観測信号ｘ(f,τ)の音声特徴量ベクトルvf(τ）を、各フレームごとに計算する（Ｓ５）。音声特徴量ベクトルvf(τ）としては、たとえば１２次元のＭＦＣＣ(Mel-Frequency Cepstrum Coefficient)を利用できる。また、自己相関法などで推定した基本周波数F0(τ)を併用し、音声特徴量ベクトルvf(τ）の一成分として含ませてもよい。 The feature extraction unit 251 selects any one observation signal x (f, τ) arbitrarily selected from the frequency domain observation signals x ₁ (f, τ) to x _M (f, τ) in the _M microphones. ) Voice feature vector vf (τ) is calculated for each frame (S5). As the speech feature vector vf (τ), for example, a 12-dimensional MFCC (Mel-Frequency Cepstrum Coefficient) can be used. Further, the fundamental frequency F0 (τ) estimated by the autocorrelation method or the like may be used together and included as a component of the speech feature vector vf (τ).

モデル学習手段２５２は、到来方向分類部１４０にて同一クラスタＣ_ｋ（話者数Ｎのとき、ｋ＝１、・・・、Ｎ）に分類されたフレームのうち、観測信号の収録開始から所定の時刻ｔ_trainまでの各フレームに係る音声特徴量ベクトルvf(τ）を用いて、話者ｋのモデル、すなわちモデルパラメータφ_ｋを作成して出力するとともに、所定の時刻ｔ_trainまでの各フレームのインデックスτとそれらがそれぞれ属するクラスタに係る話者のインデックスｋとの組を出力する（Ｓ６）。なお、同一話者のフレームが連続する場合は、各フレームのインデックスを出力する代わりに、連続フレームの始点と終点の時刻を出力してもよい。 The model learning means 252 is predetermined from the start of recording the observation signal among the frames classified into the same cluster C _k (when the number of speakers is N, k = 1,..., N) by the arrival direction classification unit 140. Using the speech feature vector vf (τ) related to each frame up to time t _train, a model of speaker k, that is, model parameter φ _k is created and output, and each frame up to a predetermined time t _train is output. A pair of the index τ and the index k of the speaker related to the cluster to which each index belongs is output (S6). In addition, when frames of the same speaker are continuous, the start time and end time of the continuous frames may be output instead of outputting the index of each frame.

話者のモデルとしては、ここでは混合正規分布(ＧＭＭ: Gaussian Mixture Model)を用いる場合を例示するが、他の話者同定や話者認識の方法（隠れマルコフモデルやベクトル量子化等）を用いてもよい。ＧＭＭのガウシアンの数をＭ_ｇとした時、モデルＭ_ｋのモデルパラメータをφ_ｋ＝（平均μ_ｋ,ｍ、共分散行列Σ_ｋ,ｍ、ガウシアン重みｗ_ｋ,ｍ）と置くと、ＧＭＭは次式のように表すことができる。

ここで、ｐ_ｋ,ｍ(vf(τ))は話者ｋのｍ番目の多次元（次元数ｄは音声特徴量ベクトルの次元と同じ）ガウシアンを表している。Ｍ_ｇは例えば１０とする。モデルパラメータφ_ｋは、ＥＭアルゴリズムなどを用いて、所定の時刻ｔ_trainまでのクラスタＣ_ｋに属する全てのフレームに基づき、次式によって求められる対数尤度Ｌが最大となるφ_ｋの値として計算することができる。

ここで、ＥＭアルゴリズムは、「汪他、”計算統計Ｉ〜確率計算の新しい手法〜”、岩波書店、2003、p158-162」等にて公知の技術である。 As a speaker model, a mixed normal distribution (GMM: Gaussian Mixture Model) is used here, but other speaker identification and speaker recognition methods (such as hidden Markov models and vector quantization) are used. May be. When the number of Gaussians in the GMM is M _g, and the model parameters of the model M _k are φ _k = (average μ _{k, m} , covariance matrix Σ _{k, m} , Gaussian weight w _{k, m} ), the GMM is It can be expressed as:

Here, p _{k, m} (vf (τ)) represents the m-th multidimensional of the speaker k (the dimension number d is the same as the dimension of the speech feature vector) Gaussian. For example, _Mg is 10. The model parameter φ _k is calculated as a value of φ _k that maximizes the log likelihood L obtained by the following equation based on all the frames belonging to the cluster C _k up to a predetermined time t _train using an EM algorithm or the like. can do.

Here, the EM algorithm is a well-known technique such as “Tatsumi et al.,“ Calculation Statistics I—A New Method for Probability Calculation ”, Iwanami Shoten, 2003, p158-162”.

なお、モデル学習部では、モデルパラメータφ_ｋの推定精度を高める上で、各フレームτは互いに接続されていることが望ましい。そこで、接続されていない場合の処理方法の一例を説明する。図４(a)は観測信号の到来方向の時系列の例である。この例は、収録開始から時刻ｔ_trainまでの間に到来方向がθ_１→θ_２→θ_３→θ_２→θ_１の順に推移しており、つまり話者１→話者２→話者３→話者２→話者１の順に発話している場合である。このうち、話者３は短時間の隙間を挟んで計３回発話している。このように短時間（例えば３００ｍｓ以下）の隙間があるような場合には、図４(b)に示すように音声区間が連続しているとみなしてモデルを学習するのが望ましい。また、話者１と話者２については、共に１回目の発話と２回目の発話との間が広くなっている。このような場合には、図４(b)に示すように１回目の発話と２回目の発話が一体的にされたものとみなしてモデルを学習する。なお、モデル学習手段２５２が出力するインデックスτは接続前のτであることに注意が必要である。 In the model learning unit, in improving the estimation accuracy of the model parameter phi _k, it is desirable that each frame τ are connected to each other. Therefore, an example of a processing method when not connected will be described. FIG. 4A shows an example of a time series of the arrival direction of the observation signal. In this example, the arrival direction changes in the order of θ ₁ → θ ₂ → θ ₃ → θ ₂ → θ ₁ from the start of recording to time t _train , that is, speaker 1 → speaker 2 → speaker 3. This is the case where the speaker 2 speaks in the order of the speaker 2 → the speaker 1. Among them, the speaker 3 speaks a total of three times with a short gap. In such a case where there is a short gap (for example, 300 ms or less), it is desirable to learn the model on the assumption that speech sections are continuous as shown in FIG. For speaker 1 and speaker 2, the interval between the first utterance and the second utterance is wide. In such a case, as shown in FIG. 4B, the model is learned on the assumption that the first utterance and the second utterance are integrated. It should be noted that the index τ output from the model learning unit 252 is τ before connection.

尤度計算手段２５３は、所定の時刻ｔ_train以降に同一クラスタに分類された互いに接続されたフレーム（以下、「セグメント」という）の音声特徴量について、モデル学習手段２５２において作成した全ての話者のモデルに対する尤度を計算して、最大尤度をとるモデルに係る話者のインデックスｋと当該セグメントに含まれる全てのフレームのインデックスτとを出力する（Ｓ７）。なお、同一話者のフレームが連続する場合は、各フレームのインデックスを出力する代わりに、連続フレームの始点と終点の時刻を出力してもよい。 Likelihood calculation means 253 uses all the speakers created in model learning means 252 for speech features of mutually connected frames (hereinafter referred to as “segments”) classified into the same cluster after a predetermined time t _train. The likelihood of the model is calculated, and the index k of the speaker related to the model having the maximum likelihood and the indices τ of all frames included in the segment are output (S7). In addition, when frames of the same speaker are continuous, the start time and end time of the continuous frames may be output instead of outputting the index of each frame.

話者のモデルとしてＧＭＭを用いた場合、各話者のモデルに当該セグメントに含まれる全てのフレームτの音声特徴量ベクトルvf(τ)を代入して、式(10)により対数尤度を計算し、最も大きな対数尤度をとるモデルのインデックスｋを当該セグメントの話者インデックスとして付与する。なお、話者の同定は必ずしもセグメントごとに行う必要はなく、フレームごとに行っても構わない。この場合、対数尤度の計算は式(10)のΣを外した式により行う。 When GMM is used as the speaker model, the log likelihood is calculated by Equation (10) by substituting the speech feature vector vf (τ) of all frames τ included in the segment into each speaker model. Then, the index k of the model having the largest log likelihood is assigned as the speaker index of the segment. Note that speaker identification is not necessarily performed for each segment, and may be performed for each frame. In this case, the log likelihood is calculated by an expression obtained by removing Σ in Expression (10).

以上のように本発明においては、複数の話者に対する話者区間の推定にあたり、音声の到来方向の推定と分類に加え、話者の同定を行う。そのため、音声の収録中に話者位置の移動が生じても、移動前と移動後において、同一話者には同一のインデックスを付与することができる。 As described above, in the present invention, speaker estimation is performed in addition to estimation and classification of voice arrival directions when estimating speaker sections for a plurality of speakers. Therefore, even if the speaker position moves during recording of the voice, the same index can be assigned to the same speaker before and after the movement.

〔第２実施形態〕
第１実施形態においては、特徴抽出手段２５１における処理に際し、周波数領域変換部１１０から出力された周波数領域の観測信号ｘ(f,τ)をそのまま使用していた。しかし、実際の会議の場では複数の発話者がしばしば同時に発話するが、各フレームではいずれかの１名の話者の発話として識別する必要があり、その他の話者の発話は雑音成分となるため、同時発話されたフレームτにおける観測信号ｘ(f,τ)をそのまま使用すると、ＳＮ比の小ささにより特徴抽出を適切に行えずに話者モデルの推定精度が劣化する場合がある。そこで第２実施形態では、このＳＮ比を向上させるための機能構成・処理方法を示す。 [Second Embodiment]
In the first embodiment, the frequency domain observation signal x (f, τ) output from the frequency domain conversion unit 110 is used as it is in the processing in the feature extraction unit 251. However, in an actual conference, a plurality of speakers often speak at the same time, but each frame needs to be identified as one speaker's speech, and the other speaker's speech becomes a noise component. Therefore, if the observed signal x (f, τ) in the simultaneously-spoken frame τ is used as it is, the feature model may not be appropriately extracted due to the small SN ratio, and the speaker model estimation accuracy may deteriorate. Therefore, in the second embodiment, a functional configuration / processing method for improving the SN ratio will be described.

第１実施形態との機能構成上の相違は図１において、更に点線部分の構成、つまり音声強調部２６０が加わる点にあり、処理フロー上の相違は、図２において更に点線部分の処理が加わる点にある。 The difference in the functional configuration from the first embodiment is that the configuration of the dotted line portion in FIG. 1 is further added, that is, the voice enhancement unit 260 is added. The difference in the processing flow is that the processing of the dotted line portion is further added in FIG. In the point.

音声強調部２６０においては、それぞれの話者ｋの発話信号成分を強調する。ここでは、複数のマイクにおける観測信号を用いた公知のビームフォーミング的手法（例えば、参考文献１参照）を用いてもよいし、１本のマイクにおける観測信号に対して処理をする方法（例えば、Wiener Filter）による雑音除去的な手法を用いてもよい。
〔参考文献１〕S. Araki, H. Sawada and S. Makino, "Blind Speech Separation in a MeetingSituation with Maximum SNR beamformers," proc. of ICASSP2007, 2007, vol.I, p.41-45 In the speech enhancement unit 260, the speech signal component of each speaker k is enhanced. Here, a known beamforming method using observation signals from a plurality of microphones (for example, see Reference 1) may be used, or a method for processing observation signals from one microphone (for example, A noise-removal method using Wiener Filter may be used.
[Reference 1] S. Araki, H. Sawada and S. Makino, "Blind Speech Separation in a Meeting Situation with Maximum SNR beamformers," proc. Of ICASSP2007, 2007, vol.I, p.41-45

参考文献１のＳＮ比最大化型ビームフォーマの場合には、周波数領域変換部１１０からのＭ本のマイクにおける周波数領域の観測信号による観測信号ベクトルvx(f,τ)＝[ｘ₁(f,τ)、・・・、ｘ_M(f,τ)]^Ｔと、到来方向分類部１４０からの各クラスタＣ_ｋに属するフレームτの情報とから、各フレームτが属するクラスタＣ_ｋに係る話者ｋの発話信号成分を強調した周波数領域信号ｙ_ｋ(f,τ)を生成し（Ｓ８）、これをｘ(f,τ)の代わりに特徴抽出手段２５１での処理に用いる。 In the case of the S / N maximization type beamformer of Reference Document 1, the observation signal vector vx (f, τ) = [x ₁ (f, τ),..., x _M (f, τ)] ^T and the information on the frame τ belonging to each cluster C _k from the arrival direction classifying unit 140, and the speaker related to the cluster C _k to which each frame τ belongs. A frequency domain signal y _k (f, τ) in which the utterance signal component of k is emphasized is generated (S8), and this is used for processing in the feature extraction means 251 instead of x (f, τ).

このように第１実施形態の構成に音声強調部２６０による処理を加えることで、特徴抽出手段２５１に入力する各話者ｋの発話信号成分のＳＮ比を向上することができ、話者モデルの推定精度を高めることができる。 Thus, by adding the processing by the speech enhancement unit 260 to the configuration of the first embodiment, the SN ratio of the utterance signal component of each speaker k input to the feature extraction unit 251 can be improved. The estimation accuracy can be increased.

〔第３実施形態〕
上記の実施形態では、モデルパラメータφ_ｋを時刻ｔ_trainまでの観測信号により求めて、それを時刻ｔ_train以降の話者同定処理に固定的に適用する。しかし、会話が収録される音響環境は通常、経時的に変化するものであり、求めたモデルパラメータφ_ｋが経時的にその環境に相応しくなくなる場合がある。 [Third Embodiment]
In the above embodiment, determined by the observation signal of the model parameters phi _k until time t _train, fixedly applied it to the time t _train after the speaker identification process. However, the acoustic environment in which the conversation is recorded usually changes over time, and the obtained model parameter φ _k may become unsuitable for the environment over time.

第３実施形態はそのような事態を回避するための構成であり、処理フロー例を図５に示す。Ｓ７にて時刻ｔ_train以降のセグメントに対して話者インデックスｋを付与した後、そのセグメントに属する各フレームの音声特徴量ベクトルvf(τ)を、図３の一点鎖線に示すように尤度計算手段２５４からモデル学習手段２５３にフィードバックし、これらの音声特徴量ベクトルvf(τ)を用いて、式(10)により改めてφ_ｋを計算してモデルパラメータを更新する（Ｓ９）。更新は逐次行っても、所定の更新間隔を置いて行っても構わない。 The third embodiment is a configuration for avoiding such a situation, and an example of a processing flow is shown in FIG. After assigning a speaker index k to a segment after time t _{train in} S7, the likelihood calculation of the speech feature vector vf (τ) of each frame belonging to the segment is calculated as shown by the one-dot chain line in FIG. Feedback is made from the means 254 to the model learning means 253, and φ _k is calculated again by Equation (10) using these speech feature vector vf (τ) to update the model parameters (S9). Updates may be performed sequentially or at predetermined update intervals.

このように構成することで、会話が収録される音響環境が経時的に変化しても、適切なモデルパラメータにより話者の同定処理を行うことができる。 With this configuration, even if the acoustic environment in which the conversation is recorded changes over time, speaker identification processing can be performed using appropriate model parameters.

〔第４実施形態〕
上記の各実施形態では、尤度計算手段２５３における話者の同定を、各話者のモデルＭ_ｋに同定対象セグメントに含まれる全てのフレームτの音声特徴量ベクトルvf(τ)を代入して対数尤度を計算し、対数尤度が最大となるモデルのインデックスｋを当該セグメントの話者インデックスとするというルールの下で行う。しかし、このようなルールの下では、新たに参加した話者による発話があった場合においても、当初から参加している話者のモデルのいずれかが最大対数尤度をとることになるため、そのモデルの話者であると同定されてしまう。 [Fourth Embodiment]
In each of the embodiments described above, speaker identification in the likelihood calculation means 253 is performed by substituting the speech feature vector vf (τ) of all frames τ included in the identification target segment into each speaker model M _k. The log likelihood is calculated, and this is performed under the rule that the index k of the model having the maximum log likelihood is the speaker index of the segment. However, under such rules, even if there is an utterance by a newly joined speaker, one of the models of the speaker who participated from the beginning will have the maximum log likelihood, It will be identified as the speaker of the model.

第４実施形態はそのような事態を回避するための構成である。処理フロー例を図６に示す。尤度計算手段２５３において、所定の時刻ｔ_train以降の各セグメントについて音声特徴量ベクトルを各話者のモデルに代入して対数尤度を計算し（Ｓ７−１）、最大の対数尤度が所定の閾値より小さいか否かを判断し、閾値より大きい場合には、最大尤度をとるモデルに係る話者のインデックスｋと当該セグメントに含まれる全てのフレームのインデックスτとを出力し（Ｓ７−２）、閾値より小さい場合には、新たな話者が参加したと判断して新たな話者インデックスを当該セグメントに付与するとともに、そのセグメントに属する各フレームの音声特徴量ベクトルvf(τ)を、図３の一点鎖線に示すように尤度計算手段２５４からモデル学習手段２５３にフィードバックし、これらの音声特徴量ベクトルvf(τ)を用いて、式(10)によりφ_ｋを計算して新たな話者のモデルパラメータとして追加する（Ｓ１０）。 The fourth embodiment is a configuration for avoiding such a situation. An example of the processing flow is shown in FIG. The likelihood calculating means 253 calculates the log likelihood by substituting the speech feature vector into each speaker model for each segment after the predetermined time t _train (S7-1), and the maximum log likelihood is predetermined. If it is larger than the threshold, the speaker's index k related to the model having the maximum likelihood and the indexes τ of all the frames included in the segment are output (S7-). 2) If it is smaller than the threshold, it is determined that a new speaker has joined, and a new speaker index is assigned to the segment, and the speech feature vector vf (τ) of each frame belonging to the segment is assigned. , is fed back to the model learning unit 253 from the likelihood calculating unit 254 as shown in dashed line in FIG. 3, with these audio feature vector vf (tau), calculate phi _k by equation (10) Te be added as model parameters for a new speaker (S10).

このように構成することで、新たな話者が参加した場合においても、それを検知してその話者のモデルを生成することにより、以降、その話者についても同定処理を行うことができる。 With this configuration, even when a new speaker joins, by detecting it and generating a model of the speaker, the identification process can be performed for the speaker thereafter.

〔第５実施形態〕
上記の各実施形態は、モデルパラメータを時刻ｔ_trainまでの観測信号により求めて、それを用いて時刻ｔ_train以降の話者同定処理を行う構成である。しかし、発話が想定される複数の話者音声を予め入手できる場合には、それに基づき事前に各話者のモデルを準備しておき、この事前に準備したモデルを用いて話者同定処理を行うことが可能である。
第５実施形態はそのような場合の構成であり、話者同定部２５０を例えば図７のように構成することにより実現できる。上記の各実施形態との機能構成上の相違は、図３におけるモデル学習手段２５２が、予め準備した話者のモデルパラメータが記憶された話者モデルＤＢ２６４に置き換わる点にある。 [Fifth Embodiment]
Each of the above embodiments is obtained by observing the signal of the model parameters until time t _train, is configured to perform time t _train subsequent speaker identification process using it. However, if a plurality of speaker voices that are supposed to be uttered can be obtained in advance, a model of each speaker is prepared in advance based on that and speaker identification processing is performed using the model prepared in advance. It is possible.
The fifth embodiment is configured in such a case, and can be realized by configuring the speaker identification unit 250 as shown in FIG. 7, for example. A difference in functional configuration from the above embodiments is that the model learning means 252 in FIG. 3 is replaced with a speaker model DB 264 in which model parameters of a speaker prepared in advance are stored.

このように構成することで、モデルパラメータを学習により求める必要が無くなるため、音声の収録当初から尤度計算手段２５３において話者同定が可能になる。また、話者のモデルパラメータに話者の氏名情報を関連付けてＤＢに記憶させておくことで、話者インデックスｋに方向情報に加え話者の氏名情報も持たせることができる。
上記の各実施形態の複数信号区間推定装置の構成をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。 With this configuration, it is not necessary to obtain model parameters by learning, so that the speaker calculation can be performed by the likelihood calculation means 253 from the beginning of voice recording. Further, by storing the speaker name information in the DB in association with the speaker model parameter, the speaker index k can have the speaker name information in addition to the direction information.
When the configuration of the multiple signal section estimation device of each of the above embodiments is realized by a computer, the processing contents of the functions that each device should have are described by a program. The processing functions are realized on the computer by executing the program on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよいが、具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. The computer-readable recording medium may be any medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, or a semiconductor memory. Specifically, for example, the magnetic recording device may be a hard disk device or a flexible Discs, magnetic tapes, etc. as optical disks, DVD (Digital Versatile Disc), DVD-RAM (Random Access Memory), CD-ROM (Compact Disc Read Only Memory), CD-R (Recordable) / RW (ReWritable), etc. As the magneto-optical recording medium, MO (Magneto-Optical disc) or the like can be used, and as the semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory) or the like can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

また、上述した実施形態とは別の実行形態として、コンピュータが可搬型記録媒体から直接このプログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 As an execution form different from the above-described embodiment, the computer may read the program directly from the portable recording medium and execute processing according to the program. Each time is transferred, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。
また、上述の各種処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.
Further, the various processes described above are not only executed in time series according to the description, but may be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.

〔効果の確認〕
発明の効果を確認するため、図８で示すような３本のマイクを用いた測定環境において、４名参加による５分間の会議データについての話者区間推定実験を行った。会議においては、まず男女各２名の話者がそれぞれ男１、女１、男２、女２の位置に着席して始めに自己紹介をし、その後、各話者が順番に位置ＰＰに移動して発言を行った。自己紹介は収録開始から１２０秒までの間に行われたものとし、ｔ_trainを１２０秒として収録開始から１２０秒までの観測信号を話者同定モデルの作成に用い、１２０秒以降について話者同定を行った。なお、短時間フーリエ変換のフレーム長は６４ｍｓ、フレームシフト長は３２ｍｓとした。 [Confirmation of effect]
In order to confirm the effect of the invention, a speaker section estimation experiment was performed on conference data for 5 minutes with four participants in a measurement environment using three microphones as shown in FIG. At the conference, first two male and female speakers are seated in the positions of male 1, female 1, male 2, and female 2, respectively, and then introduce themselves, then each speaker moves to position PP in turn. And made a remark. Assume that self-introduction was performed between 120 seconds from the start of recording, t _train was set to 120 seconds, observation signals from the start of recording to 120 seconds were used to create a speaker identification model, and speaker identification was performed after 120 seconds. Went. The short-time Fourier transform has a frame length of 64 ms and a frame shift length of 32 ms.

評価指標としては、diarization error rate(ＤＥＲ)を利用した。

ここで、ＤＥＲは誤棄却（missed speaker time: ＭＳＴ、誰かが話しているにもかかわらず話していないと判定した時間長）、誤受理（false alarm speaker time:ＦＡＴ、誰も話していないにもかかわらず誰かが話していると判定した時間長）、話者誤り（speaker error time: ＳＥＴ、話者を誤って判定した時間長）の３つの誤検出を含む指標となっている。つまりこの指標においては、ＤＥＲ値が小さい方が話者区間推定の精度が高いことを示しており、特に本発明においては話者を正しく判定できているかが問題となるため、効果の程度はＳＥＴに顕著に現れるはずである。 As an evaluation index, dialization error rate (DER) was used.

Here, DER is falsely rejected (missed speaker time: MST, the length of time that someone has spoken but not determined to speak), false acceptance (false alarm speaker time: FAT, no one is speaking) Regardless of the length of time that someone has determined to be speaking) and speaker error time (SET), this is an index that includes three false detections. In other words, this index indicates that the smaller the DER value, the higher the accuracy of the speaker section estimation. In particular, in the present invention, the problem is whether the speaker is correctly determined. Should appear prominently.

図９(a)に確認結果を示す。図１０は結果を図解したものであり、(a)は正解を示したもの、(b)は従来の方法による推定結果、 (c)は本発明の方法による推定結果である。なお、男１、女１、男２、女２の到来方向はそれぞれ１００°、５０°、−５０°、−１００°であり、位置ＰＰは−１６０°の到来方向にあり、また、男１が話者１に、女１が話者２に、男２が話者３に、女２が話者４にそれぞれ対応する。図１０(b)からわかるように、従来の方法では位置ＰＰの話者を話者１〜４以外の別の話者５と推定しており、図９(a)に示すとおりＳＥＴが大きくなっている。これに対し、本発明の方法ではほぼ全ての時間区間で−１６０°方向の話者の区別を図１０(a)と同様にできており、図９(a)に示すとおりＳＥＴが改善し、全体の性能であるＤＥＲ値も改善していることがわかる。 The confirmation result is shown in FIG. FIG. 10 illustrates the results. (A) shows the correct answer, (b) shows the estimation result by the conventional method, and (c) shows the estimation result by the method of the present invention. The arrival directions of male 1, female 1, male 2, and female 2 are 100 °, 50 °, −50 °, and −100 °, respectively, position PP is in the arrival direction of −160 °, and male 1 Corresponds to speaker 1, woman 1 corresponds to speaker 2, man 2 corresponds to speaker 3, and woman 2 corresponds to speaker 4. As can be seen from FIG. 10 (b), in the conventional method, the speaker at the position PP is estimated as another speaker 5 other than the speakers 1 to 4, and the SET becomes large as shown in FIG. 9 (a). ing. On the other hand, in the method of the present invention, the speaker can be distinguished in the −160 ° direction in almost all the time intervals in the same manner as in FIG. 10 (a), and SET is improved as shown in FIG. 9 (a). It can be seen that the DER value, which is the overall performance, is also improved.

また、１０組の話者組み合わせにおける会議シミュレーションを行った結果を図９(b)に示す。これは、音声信号と図８の測定環境で測定したインパルス応答とを用いて作成した会議シミュレーションデータを用いたものである。図９(b)においてシミュレーション１は各話者の音声間の重なりが無い場合であり、シミュレーション２は各話者の音声間の重なりがある場合の結果であるが、いずれの場合においてもＤＥＲ、ＳＥＴに関し本発明の方法が従来方法より優れた結果を示すことがわかる。 Moreover, the result of having performed the conference simulation in 10 speaker combinations is shown in FIG. This uses conference simulation data created using an audio signal and an impulse response measured in the measurement environment of FIG. In FIG. 9 (b), simulation 1 is a case where there is no overlap between the voices of the speakers, and simulation 2 is a result when there is an overlap between the voices of the speakers. It can be seen that the method of the present invention shows better results than the conventional method for SET.

本発明は、複数話者の音声信号が混在している音響データから各話者の音声区間を推定する必要があるシステムや装置等に利用することができ、特に音声の収録中に話者位置の移動が生じる場合に有効である。 INDUSTRIAL APPLICABILITY The present invention can be used for a system or apparatus that needs to estimate each speaker's voice section from acoustic data in which voice signals of a plurality of speakers are mixed, and in particular, speaker position during voice recording. This is effective when the movement of.

第１、２実施形態の複数信号区間推定装置の機能構成例を示す図The figure which shows the function structural example of the multiple signal area estimation apparatus of 1st, 2 embodiment. 第１、２実施形態の複数信号区間推定装置の処理フロー例を示す図The figure which shows the example of a processing flow of the multiple signal area estimation apparatus of 1st, 2 embodiment. 第１〜４実施形態の複数信号区間推定装置の話者同定部の機能構成例を示す図The figure which shows the function structural example of the speaker identification part of the multiple signal area estimation apparatus of 1st-4th embodiment. フレームが接続されていない場合に接続して処理をする方法を説明する図Diagram explaining how to connect and process when the frame is not connected 第３実施形態の複数信号区間推定装置の処理フロー例を示す図The figure which shows the example of a processing flow of the multiple signal area estimation apparatus of 3rd Embodiment. 第４実施形態の複数信号区間推定装置の処理フロー例を示す図The figure which shows the example of a processing flow of the multiple signal area estimation apparatus of 4th Embodiment. 第５実施形態の複数信号区間推定装置の機能構成例を示す図The figure which shows the function structural example of the multiple signal area estimation apparatus of 5th Embodiment. 効果の確認に用いた測定環境を示す図Diagram showing the measurement environment used to confirm the effect 効果の確認結果を示す表Table showing effect confirmation results 効果の確認結果の根拠データを示す図The figure which shows the basis data of the confirmation result of the effect 従来技術の複数信号区間推定装置の機能構成例を示す図The figure which shows the function structural example of the multiple signal area estimation apparatus of a prior art 従来技術の複数信号区間推定装置の処理フロー例を示す図The figure which shows the example of a processing flow of the multiple signal area estimation apparatus of a prior art 従来技術の複数信号区間推定装置の別の機能構成例を示す図The figure which shows another functional structural example of the multiple signal area estimation apparatus of a prior art

Claims

A multi-signal section estimation device for estimating each speaker's utterance section from an observation signal recorded by a plurality of microphones and containing speech uttered by a plurality of speakers,
A frequency domain converter that sequentially cuts out the observation signal into frames of a predetermined length and converts the frames into a frequency domain for each frame;
A speech interval estimation unit that estimates whether each frame corresponds to a speech interval based on the observation signal converted into the frequency domain (hereinafter referred to as a “frequency domain observation signal”);
Based on the frequency domain observation signal, an arrival direction estimation unit that estimates the arrival direction of the frequency domain observation signal for each frame;
A direction-of-arrival classification unit that classifies each frame estimated to correspond to the speech section into a cluster for each speaker based on the similarity of the direction of arrival;
Based on the frequency domain observation signal of each frame classified into the same cluster by a predetermined time, a model of the speaker related to the cluster is created for each cluster, and the speaker of the observation signal after the predetermined time A speaker identification unit for estimating
A multi-signal section estimation apparatus comprising:

In the multiple signal area estimation device according to claim 1,
The above speaker identification unit
Feature extraction means for calculating speech feature values of each frame of the frequency domain observation signal;
Using the speech feature value of each frame classified into the same cluster by the predetermined time, the speaker model related to the cluster is created and output for each cluster, and each of the models up to the predetermined time is output. Model learning means for outputting a set of a frame index and a speaker index associated with a cluster to which each frame belongs;
Frames connected to each other that are classified into the same cluster after the predetermined time (hereinafter,
The likelihood of the speaker's model is calculated for each model, and the speaker's index related to the model having the maximum likelihood is assigned to the segment. A likelihood calculating means for outputting together with an index of each included frame;
A multi-signal section estimation apparatus comprising:

A multi-signal section estimation device for estimating each speaker's utterance section from an observation signal recorded by a plurality of microphones and containing speech uttered by a plurality of speakers,
A frequency domain converter that sequentially cuts out the observation signal into frames of a predetermined length and converts the frames into a frequency domain for each frame;
A speech interval estimation unit that estimates whether each frame corresponds to a speech interval based on the observation signal converted into the frequency domain (hereinafter referred to as a “frequency domain observation signal”);
Based on the frequency domain observation signal, an arrival direction estimation unit that estimates the arrival direction of the frequency domain observation signal for each frame;
A direction-of-arrival classification unit that classifies each frame estimated to correspond to the speech section into a cluster for each speaker based on the similarity of the direction of arrival;
A speech enhancement unit that generates a signal (hereinafter referred to as an “emphasis signal”) that emphasizes the speech signal component for each speaker related to the cluster, based on the frequency domain observation signal;
Speaker identification that creates a speaker model for each speaker based on the emphasis signal up to a predetermined time, and estimates the speaker of the observed signal after the predetermined time based on the model of each speaker And
A multi-signal section estimation apparatus comprising:

In the multiple signal section estimation device according to claim 3,
The above speaker identification unit
Feature extraction means for calculating a speech feature amount of each frame of the enhancement signal;
Using the speech feature value of each frame of the enhancement signal up to the predetermined time, the speaker model is created and output for each speaker, and the index of each frame up to the predetermined time and the frame Model learning means for outputting a pair with a speaker index related to a cluster to which each frame belongs,
The likelihood of the speaker's model is calculated for each model for speech features of mutually connected frames (hereinafter referred to as “segments”) classified into the same cluster after the predetermined time, and the maximum likelihood is calculated. A likelihood calculating means for assigning an index of a speaker related to the model to be output to the segment together with an index of each frame included in the segment;
A multi-signal section estimation apparatus comprising:

A multi-signal section estimation method for estimating the utterance section of each speaker from an observation signal recorded by a plurality of microphones and including speech uttered by a plurality of speakers,
A frequency domain conversion step of sequentially cutting out the observation signal into frames of a predetermined length and converting each frame into the frequency domain;
A speech interval estimation step for estimating whether or not each frame corresponds to a speech interval based on the observation signal converted into the frequency domain (hereinafter referred to as “frequency domain observation signal”);
Based on the frequency domain observation signal, the direction of arrival estimation step for estimating the direction of arrival of the frequency domain observation signal for each frame;
A direction-of-arrival classification step of classifying each frame estimated to fall within the speech section into clusters for each speaker based on the similarity of the directions of arrival;
Based on the frequency domain observation signal of each frame classified into the same cluster by a predetermined time, a model of the speaker related to the cluster is created for each cluster, and the speaker of the observation signal after the predetermined time Speaker identification step for estimating each speaker based on the model of each speaker;
The multiple signal section estimation method characterized by performing.

In the multiple signal section estimation device according to claim 5,
The speaker identification step is
A feature extraction substep for calculating a speech feature amount of each frame of the frequency domain observation signal;
Using the speech feature value of each frame classified into the same cluster by the predetermined time, the speaker model related to the cluster is created and output for each cluster, and each of the models up to the predetermined time is output. A model learning sub-step for outputting a set of a frame index and a speaker index associated with a cluster to which each frame belongs;
The likelihood of the speaker's model is calculated for each model for speech features of mutually connected frames (hereinafter referred to as “segments”) classified into the same cluster after the predetermined time, and the maximum likelihood is calculated. A likelihood calculation sub-step of assigning to the segment an index of a speaker related to the model taking the following and outputting together with an index of each frame included in the segment;
The multiple signal section estimation method characterized by performing.

A multi-signal section estimation method for estimating the utterance section of each speaker from an observation signal recorded by a plurality of microphones and including speech uttered by a plurality of speakers,
A frequency domain conversion step of sequentially cutting out the observation signal into frames of a predetermined length and converting each frame into the frequency domain;
A speech interval estimation step for estimating whether or not each frame corresponds to a speech interval based on the observation signal converted into the frequency domain (hereinafter referred to as “frequency domain observation signal”);
Based on the frequency domain observation signal, the direction of arrival estimation step for estimating the direction of arrival of the frequency domain observation signal for each frame;
A direction-of-arrival classification step of classifying each frame estimated to fall within the speech section into clusters for each speaker based on the similarity of the directions of arrival;
Based on the frequency domain observation signal, a speech enhancement step of generating a signal (hereinafter referred to as “emphasis signal”) that emphasizes the speech signal for each speaker related to the cluster ;
Speaker identification that creates a speaker model for each speaker based on the emphasis signal up to a predetermined time, and estimates the speaker of the observed signal after the predetermined time based on the model of each speaker Steps,
The multiple signal section estimation method characterized by performing.

The multiple signal section estimation method according to claim 7,
The speaker identification step is
A feature extraction sub-step for calculating a speech feature amount of each frame of the enhancement signal;
Using the speech feature value of each frame of the enhancement signal up to the predetermined time, the speaker model is created and output for each speaker, and the index of each frame up to the predetermined time A model learning sub-step for outputting a pair with a speaker index associated with the cluster to which the frame belongs;
The likelihood of the speaker's model is calculated for each model for speech features of mutually connected frames (hereinafter referred to as “segments”) classified into the same cluster after the predetermined time, and the maximum likelihood is calculated. A likelihood calculation sub-step of assigning to the segment an index of a speaker related to the model taking the following and outputting together with an index of each frame included in the segment;
The multiple signal section estimation method characterized by performing.

The multiple signal section estimation method according to claim 6 or 8,
Furthermore, after assigning a speaker index to the segment in the likelihood calculation sub-step, a model of the speaker is newly created based on the speech feature amount of each frame belonging to the segment, and the speaker model A method for estimating a plurality of signal sections, comprising executing a model update step for updating the signal.

In the multiple signal area estimation method according to any one of claims 6, 8, and 9,
Further, when the calculated maximum likelihood is smaller than a predetermined threshold, it is determined that a new speaker has joined, and an index of the new speaker is assigned to the segment, and each frame belonging to the segment is assigned. A multi-signal section estimation method, comprising: executing a model addition step of creating a new speaker model based on a speech feature amount.

The program for functioning a computer as an apparatus in any one of Claims 1-4.

A computer-readable recording medium on which the program according to claim 11 is recorded.