JP5022387B2

JP5022387B2 - Clustering calculation apparatus, clustering calculation method, clustering calculation program, and computer-readable recording medium recording the program

Info

Publication number: JP5022387B2
Application number: JP2009015338A
Authority: JP
Inventors: 勝彦石黒; 武士山田; 章子荒木; 智広中谷
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-01-27
Filing date: 2009-01-27
Publication date: 2012-09-12
Anticipated expiration: 2029-01-27
Also published as: JP2010175614A

Abstract

PROBLEM TO BE SOLVED: To provide clustering technology for accurately estimating the number of speakers and a parameter for characterizing each speaker, in dialization. SOLUTION: A clustering calculation device 3 includes: an observation amount creation section 26 which reads a power vector extracted from recorded conversation data, to convert it to an observation amount of the vector corresponding to dynamic Hierarchical Dirichlet Process (dHDP) approximation model; a storage means 10 for accumulating and storing a collection data of the converted observation amount; a variation post-distribution inference section 30 in which a value of post-distribution of a plurality of parameters when the plurality of clusters are created from the collection data of the observation amount by dHDP approximation model, is respectively estimated by an expectation-maximization (EM) algorithm, and sequentially stored and updated in the storage means 10; and an output control section 28 for outputting a latest estimation value of the post-distribution of the plurality of parameters, which are stored in the storage means 10, when a finishing condition set beforehand is satisfied. COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、話者数が未知である会話の録音データから話者数を推定する技術に関する。 The present invention relates to a technique for estimating the number of speakers from conversation recording data in which the number of speakers is unknown.

従来、複数話者の会話からなる録音データから、その会話に参加した話者数とさらに各話者の発話したタイミングを推定する問題が知られている。この問題やこの問題を解決する技術は、ダイアライゼーション（diarization）と呼ばれている。ダイアライゼーションは、簡単には「いつ、誰が発話したか」を自動推定する技術といえる。この技術の応用としては、会議データへのアノテーション（annotation）やそれを用いた検索、音声強調など幅広い用途が期待されている。 Conventionally, there has been known a problem of estimating the number of speakers participating in a conversation and the timing of each speaker speaking from recorded data consisting of conversations of a plurality of speakers. This problem and the technology that solves this problem are called diarization. Dialization can be said to be a technology that automatically estimates “when and who spoke”. As an application of this technology, a wide range of uses such as annotation to conference data, search using the same, and speech enhancement are expected.

既存のダイアライゼーションの解法は、大きく２つに分けることができる。第１の方法は、話者固有の声質を推定して話者を区別する方法であり、第２の方法は、話者の位置を推定する方法である。このうち、第１の方法（話者声質を利用する方法）では、各話者の声の特徴を抽出することで現在発話している話者を識別する。この第１の方法では、話者が移動したとしても話者を識別できるという利点があるが、複数の話者が同時に発話した場合には話者の識別が困難となるという欠点がある。 Existing dialytical solutions can be roughly divided into two. The first method is a method for estimating speaker-specific voice quality and distinguishing speakers, and the second method is a method for estimating a speaker position. Among these, in the first method (method using speaker voice quality), the speaker who is currently speaking is identified by extracting the voice characteristics of each speaker. This first method has an advantage that the speaker can be identified even if the speaker moves, but has a disadvantage that it becomes difficult to identify the speaker when a plurality of speakers speak at the same time.

第２の方法（話者位置に関する情報を利用する方法）は、話者の位置を推定することで話者数とその位置を推定する方法である（例えば、非特許文献１〜３参照）。非特許文献１〜３の手法では、マイクロホンアレーを用いて各話者の位置を推定することで話者の識別を行う。従って、非特許文献１〜３の手法は、話者が移動すると同一話者として識別することができないという欠点があるが、複数話者が同時に発話する場合でも各話者の発話行動を識別することができるという利点がある。 The second method (a method using information related to the speaker position) is a method of estimating the number of speakers and their positions by estimating the positions of the speakers (see, for example, Non-Patent Documents 1 to 3). In the methods of Non-Patent Documents 1 to 3, speaker identification is performed by estimating the position of each speaker using a microphone array. Therefore, although the methods of Non-Patent Documents 1 to 3 have a drawback that they cannot be identified as the same speaker when the speaker moves, the utterance behavior of each speaker is identified even when a plurality of speakers speak at the same time. There is an advantage that you can.

ダイアライゼーションでは、一般に、録音された音声データ中の発話者数は未知であり、推定しなければならない。また、第１の方法（話者声質を利用する方法）によるもの、第２の方法（話者位置に関する情報を利用する方法）によるものに関わらず、各話者を識別するためには、各話者を特徴づける量（パラメータ）を推定する必要がある。これは、いわゆるクラスタリングの問題に相当する。クラスタリングの問題とは、観測されたデータを適切な数のクラスタに分割し、各クラスタのパラメータを推定する問題である。ダイアライゼーションを単純にクラスタリングの問題に置き換えようとしても、そのクラスタ数とデータの分割は未知なので、パラメータと共に学習する必要がある。ダイアライゼーションの問題では、クラスタ数は「話者数」を表し、各クラスタのパラメータは「話者の特徴量」を表し、結果得られるデータの分割結果から「各話者の発話タイミング」が示唆されることになる。 In dialization, in general, the number of speakers in recorded audio data is unknown and must be estimated. In order to identify each speaker, regardless of whether the first method (method of using speaker voice quality) or the second method (method of using information on the speaker position) is used, The amount (parameter) that characterizes the speaker needs to be estimated. This corresponds to a so-called clustering problem. The clustering problem is a problem of dividing observed data into an appropriate number of clusters and estimating parameters of each cluster. Even if dialization is simply replaced with a clustering problem, the number of clusters and the division of data are unknown, so it is necessary to learn with parameters. In the problem of dialization, the number of clusters represents “the number of speakers”, the parameter of each cluster represents “the feature amount of the speakers”, and “the timing of each speaker's utterance” is suggested from the result of dividing the data obtained. Will be.

非特許文献１ではleader-followerアルゴリズムと呼ばれるクラスタリング手法を用いている。これは、逐次的なクラスタリング手法であり、少しずつサンプルを入力していきながらクラスタリングを行うものである。このクラスタリング手法では、クラスタリングの際に、新しく入力されたサンプルが既存のどのクラスタからも一定距離以上離れていた場合に、そのサンプルを中心として新しいクラスタを生成する。なお、leader-followerアルゴリズムについては、「R. O. Duda, P. E. Hart and D. G. Stork, “Pattern Classification”, John Wiley & Sons, 2001.」に記載されている。 Non-Patent Document 1 uses a clustering technique called a leader-follower algorithm. This is a sequential clustering method, in which clustering is performed while inputting samples little by little. In this clustering method, when a newly input sample is more than a certain distance away from any existing cluster during clustering, a new cluster is generated around that sample. The leader-follower algorithm is described in “R. O. Duda, P. E. Hart and D. G. Stork,“ Pattern Classification ”, John Wiley & Sons, 2001.”.

また、非特許文献３では、ＢＩＣ基準と呼ばれる評価値を最大化するようなクラスタ数を選択する方法を提案している。この方法では、クラスタ数をＫに設定してから実際にクラスタリングを行って評価値を計算し、次はクラスタ数をＫ＋１に設定して評価値を計算する、ということを繰り返えして評価値を最大化するクラスタ数を探索する。 Non-Patent Document 3 proposes a method of selecting the number of clusters that maximizes an evaluation value called a BIC criterion. In this method, the evaluation value is calculated by actually performing clustering after setting the number of clusters to K, and then calculating the evaluation value by setting the number of clusters to K + 1. Find the number of clusters that maximizes the value.

ところで、クラスタ数未知のデータに対するクラスタリングモデルとして、ノンパラメトリックベイズモデルが、近年、多くの場面で利用されるようになってきた。例えば、ノンパラメトリックベイズモデルの１種であるDirichlet Process Mixture（ＤＰＭ）は、クラスタ数と各サンプルのクラスタリングを確率的に同時に最適化することができる。従って、ＤＰＭは、既存のクラスタリングモデルのように、クラスタ数の最適化を簡単に実現できる点が大きな特徴である。このＤＰＭにおいて、確率分布の連続的な時間変化をモデル化した拡張モデルとしては、dynamic Hierarchical Dirichlet Process（ｄＨＤＰ）と呼ばれるモデルが知られている（非特許文献４参照）。 By the way, as a clustering model for data with an unknown number of clusters, a non-parametric Bayes model has recently been used in many situations. For example, Dirichlet Process Mixture (DPM), which is one type of non-parametric Bayesian model, can simultaneously optimize the number of clusters and clustering of each sample at random. Therefore, DPM has a great feature in that optimization of the number of clusters can be easily realized like an existing clustering model. In this DPM, a model called dynamic Hierarchical Dirichlet Process (dHDP) is known as an extended model that models a continuous temporal change of a probability distribution (see Non-Patent Document 4).

S. Araki, M. Fujimoto, K. Ishizuka, H. Sawada and S. Makino, ”A DOA Based Speaker Diarization System for Real Meetings”, Proceedings of the Joint Workshop on Hands-Free Speech Communication and Microphone Arrays, p.29-32, 2008.S. Araki, M. Fujimoto, K. Ishizuka, H. Sawada and S. Makino, “A DOA Based Speaker Diarization System for Real Meetings”, Proceedings of the Joint Workshop on Hands-Free Speech Communication and Microphone Arrays, p.29 -32, 2008. 荒木章子、藤本雅清、石塚健太郎、澤田宏、牧野昭二、“音声区間検出と方向情報を用いた会議音声話者識別システムとその評価”、日本音響学会講演論文集、春季、vol. 1-10-1, p. 1-4, 2008.Akiko Araki, Masayoshi Fujimoto, Kentaro Ishizuka, Hiroshi Sawada, Shoji Makino, “Conference Speech Speaker Identification System Using Speech Interval Detection and Direction Information and Its Evaluation”, Proceedings of the Acoustical Society of Japan, Spring, vol. 1-10 -1, p. 1-4, 2008. J. M. Pardo, X. Anguera and C. Wooters, “Speaker Diarization for Multi-Microphone Meetings Using Only Between-Channel Differences”, Proceedings of the Third Joint Workshop on Multimodal Interaction and Related machine Learning Algorithms”, p. 257-264, 2008.JM Pardo, X. Anguera and C. Wooters, “Speaker Diarization for Multi-Microphone Meetings Using Only Between-Channel Differences”, Proceedings of the Third Joint Workshop on Multimodal Interaction and Related machine Learning Algorithms ”, p. 257-264, 2008 . L. Ren, D. B. Dunson and L. Carin, “The Dynamic Hierarchical Dirichlet process”, Proceedings of International Conference on Machine Learning, p. 824-831, 2008.L. Ren, D. B. Dunson and L. Carin, “The Dynamic Hierarchical Dirichlet process”, Proceedings of International Conference on Machine Learning, p. 824-831, 2008.

従来のダイアライゼーションの研究において、例えば、非特許文献１に記載の手法は、計算が簡単で高速に実行できるが、新しいクラスタを生成するための距離閾値の設定が必要である。この閾値の設定によって、最終的に得られるクラスタリングとクラスタ数が決定される。一方で、クラスタ数やクラスタリング結果の推定値を真の値に近づけるという意味で、この閾値を最適化することは困難である。 In the conventional dialization research, for example, the method described in Non-Patent Document 1 is simple in calculation and can be executed at high speed, but it is necessary to set a distance threshold value for generating a new cluster. By setting this threshold value, the finally obtained clustering and the number of clusters are determined. On the other hand, it is difficult to optimize this threshold in the sense of bringing the number of clusters and the estimated value of the clustering result closer to the true value.

また、例えば、非特許文献３に記載の方法は、実際には不適切なクラスタ数のもとでもクラスタリングを行う必要があり、処理の過程において、計算量や時間の観点からは大きな無駄が発生する。 In addition, for example, the method described in Non-Patent Document 3 needs to perform clustering even under an inappropriate number of clusters in practice, and a large waste occurs from the viewpoint of calculation amount and time in the process. To do.

したがって、従来のダイアライゼーションの研究では、話者数と話者の特徴を推定する部分、つまりクラスタリングの問題に改良の余地があった。また、ダイアライゼーションに対して、処理負荷の低減や処理の高速化が要望されている。さらに、従来のダイアライゼーションの研究では、ｄＨＤＰやＤＰＭなどのノンパラメトリックベイズモデルは、クラスタリングモデルとして採用されておらず、その適用方法が知られていなかった。 Therefore, in the conventional dialization research, there is room for improvement in the problem of clustering, which estimates the number of speakers and speaker characteristics. In addition, reduction of processing load and speeding up of processing are demanded for dialization. Furthermore, in conventional dialization research, nonparametric Bayes models such as dHDP and DPM have not been adopted as clustering models, and their application methods have not been known.

そこで、本発明では、前記した問題を解決し、ダイアライゼーションにおいて、話者数と各話者を特徴づけるパラメータを正確に推定するクラスタリング技術を提供することを目的とする。 Therefore, an object of the present invention is to provide a clustering technique for solving the above-described problems and accurately estimating the number of speakers and parameters characterizing each speaker in dialization.

前記目的を達成するために、本願発明者らは、ダイアライゼーションにおいて、話者数と各話者を特徴づけるパラメータを推定するクラスタリングにおいて種々検討を行った。その結果、クラスタ数やデータの分割、そしてパラメータを事後確率最大化の意味で同時に最適化できるモデルとしてノンパラメトリックベイズモデルを採用したときに、話者数を正確に推定できることを見出した。 In order to achieve the above object, the inventors of the present application have made various studies on clustering for estimating the number of speakers and parameters characterizing each speaker in dialization. As a result, we found that the number of speakers can be estimated accurately when the nonparametric Bayes model is adopted as a model that can simultaneously optimize the number of clusters, data division, and parameters in the sense of maximizing the posterior probability.

そこで、本発明に係るクラスタリング計算装置は、話者数が未知である会話の録音データから前記会話の話者数を推定するために、各話者を特徴付ける特徴量を抽出する特徴量抽出手段と、前記抽出された特徴量から前記各話者に対応する複数のクラスタを生成するときの複数の未知パラメータの値をそれぞれ推定するクラスタリング計算装置と、前記推定された複数のパラメータ値により前記会話の各話者を識別する識別手段とを有したダイアライゼーションシステムの前記クラスタリング計算装置であって、前記抽出された特徴量を読み込む読込手段と、前記読み込んだ特徴量である角度別かつ時刻別の複数の音声パワーを、前記音声パワーの値に応じて決定された個数の要素を有するサンプル集合へ量子化変換することでノンパラメトリックベイズモデルに対応したベクトルの観測量に量子化変換する観測量生成手段と、前記変換された観測量の集合データを蓄積記憶する観測量記憶手段と、前記観測量の集合データから複数のクラスタをノンパラメトリックベイズモデルにより生成するときの複数のパラメータの事後分布の値をＥＭアルゴリズムによりそれぞれ推定および更新する事後分布推論手段と、前記推定および更新された複数のパラメータの事後分布の値を蓄積記憶する推定値記憶手段と、事前に設定された終了条件が成立したときに前記推定値記憶手段に記憶されている前記複数のパラメータの事後分布の最新の推定値を出力する出力制御手段と、を備えることを特徴とする。 Therefore, the clustering calculation apparatus according to the present invention includes a feature amount extraction unit that extracts a feature amount that characterizes each speaker in order to estimate the number of speakers in the conversation from recording data of the conversation in which the number of speakers is unknown. A clustering calculation device for estimating values of a plurality of unknown parameters when generating a plurality of clusters corresponding to each speaker from the extracted feature values, and a conversation calculation method using the estimated plurality of parameter values. A clustering calculation apparatus of a dialization system having identification means for identifying each speaker, the reading means for reading the extracted feature value, and a plurality of the read feature values for each angle and each time Nonparameto of speech power, the the sample set having elements of the number determined in accordance with the value of speech power by converting quantized Tsu Non the observed amount generating means for converting quantized to click Bayesian model to the observed amount of vector corresponding the observed amount storage means for storing stores the set data of the converted observation quantity, a plurality of clusters from the set data of the observables Posterior distribution inference means for estimating and updating the posterior distribution values of a plurality of parameters when generated by a parametric Bayes model by an EM algorithm, and estimation for accumulating and storing the posterior distribution values of the estimated and updated parameters Value storage means, and output control means for outputting the latest estimated values of the posterior distributions of the plurality of parameters stored in the estimated value storage means when a preset end condition is satisfied. It is characterized by.

かかる構成によれば、クラスタリング計算装置は、会話の録音データから抽出された特徴量をノンパラメトリックベイズモデルに適応できるようにベクトルの観測量に変換することで生成した観測量の集合データを用いて、ノンパラメトリックベイズモデルに従った演算により、クラスタ数やデータの分割、そしてパラメータを事後確率最大化の意味で同時に最適化する。そして、クラスタリング計算装置は、この最適化において、ＥＭアルゴリズムにより事後分布の推定および更新を繰り返し、蓄積記憶する。ここで、ＥＭアルゴリズムは、局所最適解の計算アルゴリズムであるため、計算を繰り返すことで、ある１つの解に収束させることができる。また、ノンパラメトリックベイズモデルでは、予め準備した最大クラスタ数の個数のクラスタが最適化処理の過程で、有効なクラスタ数の個数に収束し、他のクラスタは、クラスタの混合比がほぼ０となる。このクラスタリング結果が得られれば、各クラスタへ帰属するサンプル数（観測量の個数）が計算できる。また、この各クラスタへ帰属するサンプル数を用いて各クラスタの混合比を計算可能である。さらに、この混合比により、有効なクラスタ数を決定することができる。ここで、クラスタ数は、会話の録音データ中の話者に対応しているので、話者数が決定できるようになる。また、これらクラスタリング結果と共に、それを用いた各種変数の推定値も同様に蓄積記憶することができる。そして、クラスタリング計算装置は、終了条件が成立したときに蓄積記憶されている最新の推定値を出力する。 According to such a configuration, the clustering calculation apparatus uses the observation amount set data generated by converting the feature amount extracted from the recording data of the conversation into the vector observation amount so that it can be applied to the nonparametric Bayes model. The number of clusters, data division, and parameters are optimized simultaneously in the sense of maximizing posterior probabilities through operations according to the nonparametric Bayes model. In this optimization, the clustering calculation apparatus repeatedly accumulates and stores the posterior distribution estimation and update using the EM algorithm. Here, since the EM algorithm is a calculation algorithm for a local optimum solution, it can be converged to a single solution by repeating the calculation. In the non-parametric Bayes model, the number of clusters prepared in advance converges to the number of effective clusters during the optimization process, and the cluster mixture ratio of other clusters becomes almost zero. . If this clustering result is obtained, the number of samples (number of observations) belonging to each cluster can be calculated. Further, the mixing ratio of each cluster can be calculated using the number of samples belonging to each cluster. Furthermore, the effective number of clusters can be determined by this mixing ratio. Here, since the number of clusters corresponds to the speakers in the recording data of the conversation, the number of speakers can be determined. In addition to the clustering results, estimated values of various variables using the clustering results can be accumulated and stored in the same manner. Then, the clustering calculation apparatus outputs the latest estimated value stored and stored when the end condition is satisfied.

また、本発明に係るクラスタリング計算装置は、前記事後分布推論手段が、前記ＥＭアルゴリズムのＥステップの処理として、ｄＨＤＰ（dynamic Hierarchical Dirichlet Process）モデルにおいて予め定められた事前分布、観測分布および最大クラスタ数と、ハイパーパラメータの設定値と、過去から推定対象の時刻までに変換された観測量の集合データと、過去から最新のＭステップまでに推定された隠れ変数の事後分布の推定値とを読み込んで、過去分を含めた演算対象時刻ごと、かつ、クラスタごと、かつ、演算対象時刻別の全データごとに、前記クラスタと、混合比と、前記クラスタ分布の時間変化の程度を表す重みとに関するパラメータの事後分布の値を演算するＥステップ用計算手段と、前記ＥＭアルゴリズムのＭステップの処理として、前記ハイパーパラメータの設定値と、過去から推定対象の時刻までに変換された観測量の集合データと、過去から最新のＥステップまでに推定されたパラメータの事後分布の推定値とを読み込んで、過去分を含めた演算対象時刻ごと、かつ、演算対象時刻別の全データごとに、２種類の隠れ変数の事後分布の値を推定し、前記２種類の隠れ変数のうち、前記クラスタ分布の時間変化の程度を表す重みに関連付けられた第１隠れ変数の事後分布の値についてはクラスタごとに演算し、前記第１隠れ変数および前記混合比に関連付けられた第２隠れ変数の事後分布の値については、演算対象時刻から過去に遡及した時刻ごとに演算するＭステップ用計算手段と、前記Ｅステップの処理と前記Ｍステップの処理とを交互に予め定められた回数だけ繰り返し実行させる制御を行う収束判定手段と、を備えることを特徴とする。 Further, in the clustering calculation apparatus according to the present invention, the posterior distribution inference means uses a pre-distribution, an observation distribution, and a maximum cluster predetermined in a dHDP (dynamic Hierarchical Dirichlet Process) model as the processing of the E step of the EM algorithm. Number, hyperparameter setting value, aggregated data of observations converted from the past to the estimation target time, and estimated values of the posterior distribution of hidden variables estimated from the past to the latest M steps And the cluster, the mixture ratio, and the weight representing the degree of time variation of the cluster distribution for each calculation target time including the past, for each cluster, and for all data for each calculation target time. As E step calculation means for calculating the value of the posterior distribution of parameters, and M step processing of the EM algorithm, Read the hyper parameter setting values, the aggregated data of the observed quantities converted from the past to the estimation target time, and the estimated values of the posterior distribution of the parameters estimated from the past to the latest E step. Estimate the value of the posterior distribution of two types of hidden variables for each calculation target time including the minute and for all data for each calculation target time, and change the time distribution of the cluster distribution among the two types of hidden variables The value of the posterior distribution of the first hidden variable associated with the weight representing the degree of the calculation is calculated for each cluster, and the value of the posterior distribution of the second hidden variable associated with the first hidden variable and the mixture ratio is calculated. The M step calculation means for calculating each time retroactively from the calculation target time, and the process of the E step and the process of the M step are alternately repeated a predetermined number of times. And a convergence determination means for performing control to execute the return.

かかる構成によれば、クラスタリング計算装置は、ノンパラメトリックベイズモデルのうちｄＨＤＰモデルに従った演算により、クラスタ数やデータの分割、そしてパラメータを事後確率最大化の意味で同時に最適化するためにＥＭアルゴリズムを用いる。また、ｄＨＤＰモデルでは、ある時刻ｔにおけるパラメータ分布を、前の時刻ｔ−１でのパラメータ分布と、当該時刻ｔでのパラメータ分布の変化量とを、時刻ｔにおける変化量の変化の割合（重み）で重ね合わせた分布として表現している。すなわち、ｄＨＤＰでは、時刻に依存して分布が少しずつ変化するというモデルとなっている。クラスタリング計算装置のアルゴリズムは、ある時刻ｔにおけるパラメータ分布を、時刻ｔに実際に発話した発話者を構成しているものとして構築した。そのため、ｄＨＤＰモデルは、話者の入れ替わり（turn-taking）によって観測量（サンプル）の分布が時間変化するダイアライゼーションのタスクにとって好都合の確率モデルとなった。したがって、話者の入れ替わりがある会話録音データから、未知の話者数と各話者を特徴づけるパラメータを正確に推定することができる。 According to such a configuration, the clustering calculation device uses an EM algorithm to simultaneously optimize the number of clusters, data division, and parameters in the sense of posterior probability maximization by calculation according to the dHDP model among nonparametric Bayes models. Is used. Further, in the dHDP model, the parameter distribution at a certain time t is defined as the parameter distribution at the previous time t−1 and the change amount of the parameter distribution at the time t as a change rate (weight) of the change amount at the time t. ). In other words, dHDP is a model in which the distribution changes little by little depending on time. The algorithm of the clustering calculation device was constructed assuming that the parameter distribution at a certain time t constitutes the speaker who actually spoke at time t. Therefore, the dHDP model has become a probabilistic model convenient for the task of dialization in which the distribution of observed quantities (samples) changes with time due to speaker turn-taking. Therefore, it is possible to accurately estimate the number of unknown speakers and the parameters characterizing each speaker from the conversation recording data with the change of speakers.

また、本発明に係るクラスタリング計算装置は、前記Ｅステップ用計算手段が、観測量の集合データと、隠れ変数の事後分布の推定値とについて、推定対象の時刻から事前に設定された定数または直前のＭステップの推定値を反映した変動数分だけ過去に遡及した時刻のデータを読み込み、推定対象の時刻から事前に設定された設定値分だけ遡及した過去の時刻までを演算対象時刻として、前記クラスタと、前記混合比と、前記重みとに関するパラメータの事後分布の値を演算し、クラスタごとの演算のためにクラスタを更新する処理を実行するたびに、事前に設定された再推定すべきクラスタの判断基準に基づいて、推定処理対象のクラスタを再推定すべきか否か判別し、再推定すべきクラスタである場合に、前記混合比に関するパラメータの事後分布の値を演算し、前記Ｍステップ用計算手段が、観測量の集合データと、パラメータの事後分布の推定値とについて、推定対象の時刻から事前に設定された定数または直前のＭステップの推定値を反映した変動数分だけ過去に遡及した時刻のデータを読み込み、推定対象の時刻から事前に設定された設定値分まで遡及した過去の時刻までを演算対象時刻として、前記第１隠れ変数および第２隠れ変数の事後分布の値を演算し、クラスタごとの演算のためにクラスタを更新する処理を実行するたびに、前記判断基準に基づいて、推定処理対象のクラスタを再推定すべきか否か判別し、再推定すべきクラスタである場合に、前記第１隠れ変数の事後分布の値を演算することを特徴とする。 Further, in the clustering calculation apparatus according to the present invention, the E-step calculation means is configured such that the observation amount set data and the estimated value of the posterior distribution of the hidden variable are constants set in advance from the estimation target time or immediately before The data of the time retroactively reflected by the number of fluctuations reflecting the estimated value of the M step is read, and the calculation target time is the time from the estimation target time to the past time retroactive by the preset set value. A cluster to be reestimated in advance every time the process of updating the cluster for the calculation of each cluster is performed by calculating the value of the posterior distribution of the parameters related to the cluster, the mixture ratio, and the weight On the basis of the determination criteria, it is determined whether or not the cluster to be estimated is to be re-estimated. The posterior distribution value is calculated, and the M-step calculating means sets a constant set in advance from the estimation target time or the immediately preceding M step for the observation amount set data and the estimated value of the posterior distribution of the parameter. The data of the time retroactively reflected by the number of fluctuations reflecting the estimated value of the first time is read, and the first hidden time is set as the calculation target time from the time of the estimation target to the time of the past retroactively to the set value set in advance. Whether the cluster to be estimated should be re-estimated based on the above judgment criteria every time the posterior distribution values of the variable and the second hidden variable are calculated and the process of updating the cluster for each cluster is executed. If the cluster is to be re-estimated, the value of the posterior distribution of the first hidden variable is calculated.

かかる構成によれば、クラスタリング計算装置は、ＥステップおよびＭステップにおいて、推定対象の時刻から事前に定められた過去の一時点まで遡った観測量の集合データと、その演算に必要な推定値とを読み込み、当該推定対象の時刻から事前に定められた設定値分だけ過去に遡った過去の時刻までを演算対象時刻として、推定値を求める演算を行う。したがって、事前に過去の一時点や設定値を定めることなく、時間ステップの進展とともに推定すべき変数の個数が増加していくだけである場合と比べて、処理負荷の低減や処理の高速化を実現できる。また、クラスタリング計算装置は、ＥステップおよびＭステップにおいて、クラスタごとの演算のためにクラスタを更新したときに、必要な場合にだけそのクラスタにおける推定値の再推定処理を行う。したがって、ｄＨＤＰモデルにおいて予め設定される最大クラスタ数の個数のクラスタに関して、推定値の再推定処理を毎回実行する場合に比べて、処理負荷の低減や処理の高速化を実現できる。 According to such a configuration, the clustering calculation apparatus, in the E step and the M step, sets the observation amount set data that has been traced back to a predetermined point in time from the estimation target time, and the estimated value required for the calculation. , And the calculation for obtaining the estimated value is performed using the time up to the past as far as the set value determined in advance from the time of the estimation target as the calculation target time. Therefore, the processing load is reduced and the processing speed is increased compared to the case where the number of variables to be estimated increases with the progress of the time step without having to set a past point in time or set value in advance. it can. In addition, when the cluster is updated for the calculation for each cluster in the E step and the M step, the clustering calculation apparatus performs a re-estimation process of the estimated value in the cluster only when necessary. Therefore, the processing load can be reduced and the processing speed can be increased as compared with the case where the estimation value re-estimation process is executed every time for the maximum number of clusters set in advance in the dHDP model.

また、課題を解決するため、本発明に係るクラスタリング計算方法は、話者数が未知である会話の録音データから前記会話の話者数を推定するダイアライゼーションシステムにおいて、記憶手段と、読込手段と、観測量生成手段と、事後分布推論手段と、出力制御手段とを備えて、前記録音データから抽出された各話者を特徴付ける特徴量から前記各話者に対応する複数のクラスタを生成するときの複数の未知パラメータの値をそれぞれ推定するクラスタリング計算装置のクラスタリング計算方法であって、前記読込手段によって、前記抽出された特徴量を読み込む特徴量読込ステップと、前記観測量生成手段によって、前記読み込んだ特徴量である角度別かつ時刻別の複数の音声パワーを、前記音声パワーの値に応じて決定された個数の要素を有するサンプル集合へ量子化変換することでノンパラメトリックベイズモデルに対応したベクトルの観測量に量子化変換し、前記変換された観測量の集合データを記憶手段に順次蓄積する観測量蓄積ステップと、前記事後分布推論手段によって、前記観測量の集合データから複数のクラスタをノンパラメトリックベイズモデルにより生成するときの複数のパラメータの事後分布の値をＥＭアルゴリズムによりそれぞれ推定し、当該推定値を前記記憶手段に順次格納および更新する事後分布推定ステップと、前記出力制御手段によって、事前に設定された終了条件が成立したときに前記記憶手段に記憶されている前記複数のパラメータの事後分布の最新の推定値を出力する推定値出力ステップと、を含んで実行することを特徴とする。 In order to solve the problem, the clustering calculation method according to the present invention includes a storage unit, a reading unit, and a dialing system for estimating the number of speakers of the conversation from the recording data of the conversation whose number of speakers is unknown. An observation amount generating means, a posteriori distribution inference means, and an output control means, and generating a plurality of clusters corresponding to each speaker from the feature quantity characterizing each speaker extracted from the recorded data A clustering calculation method of a clustering calculation apparatus for estimating the values of a plurality of unknown parameters respectively, wherein the reading means reads a feature quantity reading step, and the observation quantity generation means performs the reading I an angle different and time-specific plurality of sound power which is a feature quantity, needed number determined in accordance with the value of the speech power An observation amount accumulating step for converting quantized observables vector corresponding to non-parametric Bayesian model, sequentially accumulated in the storage means a set data of the converted observables by converting quantized into sample set having, The posterior distribution inference means estimates the posterior distribution values of a plurality of parameters when generating a plurality of clusters from the observation data set by a non-parametric Bayes model using an EM algorithm, and stores the estimated values in the memory A posterior distribution estimating step for sequentially storing and updating in the means; and a latest estimation of the posterior distribution of the plurality of parameters stored in the storage means when a preset end condition is satisfied by the output control means. And an estimated value output step for outputting a value.

かかる手順によれば、クラスタリング計算方法において、クラスタリング計算装置は、まず、会話の録音データから抽出された特徴量を読み込み、ノンパラメトリックベイズモデルに適応できるようにベクトルの観測量に変換し、蓄積する。そして、クラスタリング計算装置は、生成した観測量の集合データを用いて、ノンパラメトリックベイズモデルに従った演算により、クラスタ数やデータの分割、そしてパラメータを事後確率最大化の意味で同時に最適化する。そして、クラスタリング計算装置は、この最適化において、ＥＭアルゴリズムにより事後分布の推定および更新を繰り返し、蓄積記憶する。そして、クラスタリング計算装置は、終了条件が成立したときに蓄積記憶されている最新の推定値を出力する。 According to such a procedure, in the clustering calculation method, the clustering calculation apparatus first reads the feature amount extracted from the recording data of the conversation, converts it into a vector observation amount so as to be adaptable to the nonparametric Bayes model, and accumulates it. . Then, the clustering calculation apparatus simultaneously optimizes the number of clusters, the data division, and the parameters in the sense of maximizing the posterior probability by the calculation according to the nonparametric Bayes model using the generated aggregate data of the observation amount. In this optimization, the clustering calculation apparatus repeatedly accumulates and stores the posterior distribution estimation and update using the EM algorithm. Then, the clustering calculation apparatus outputs the latest estimated value stored and stored when the end condition is satisfied.

また、本発明に係るクラスタリング計算方法は、前記事後分布推論手段が、前記事後分布推定ステップにおいて、前記ＥＭアルゴリズムのＥステップの処理として、ｄＨＤＰモデルにおいて予め定められた事前分布、観測分布および最大クラスタ数と、ハイパーパラメータの設定値と、過去から推定対象の時刻までに変換された観測量の集合データと、過去から最新のＭステップまでに推定された隠れ変数の事後分布の推定値とを読み込んで、推定対象の時刻に過去分を含めた演算対象時刻ごと、かつ、クラスタごと、かつ、前記演算対象時刻別の全データごとに、前記クラスタと、混合比と、前記クラスタ分布の時間変化の程度を表す重みとに関するパラメータの事後分布の値を演算する段階と、前記ＥＭアルゴリズムのＭステップの処理として、前記ハイパーパラメータの設定値と、過去から推定対象の時刻までに変換された観測量の集合データと、過去から最新のＥステップまでに推定されたパラメータの事後分布の推定値とを読み込んで、推定対象の時刻に過去分を含めた演算対象時刻ごと、かつ、前記演算対象時刻別の全データごとに、２種類の隠れ変数の事後分布の値を推定し、前記２種類の隠れ変数のうち、前記クラスタ分布の時間変化の程度を表す重みに関連付けられた第１隠れ変数の事後分布の値についてはクラスタごとに演算し、前記第１隠れ変数および前記混合比に関連付けられた第２隠れ変数の事後分布の値については、前記演算対象時刻から過去に遡及した時刻ごとに演算する段階とを含み、前記Ｅステップの処理と前記Ｍステップの処理とを交互に予め定められた回数だけ繰り返し実行することを特徴とする。 Further, in the clustering calculation method according to the present invention, the posterior distribution inference means, as the processing of the E step of the EM algorithm in the posterior distribution estimation step, a predistribution, an observation distribution, and a predetermined distribution determined in advance in the dHDP model. The maximum number of clusters, the set value of the hyper parameter, the aggregated data of observations converted from the past to the estimation target time, the estimated value of the posterior distribution of the hidden variables estimated from the past to the latest M steps, and For each calculation target time including the past in the estimation target time, for each cluster, and for each data for each calculation target time, the cluster, the mixture ratio, and the time of the cluster distribution A step of calculating a value of a posterior distribution of a parameter relating to a weight representing a degree of change, and a process of M step of the EM algorithm Then, the setting value of the hyper parameter, the collective data of the observation amount converted from the past to the estimation target time, and the estimated value of the posterior distribution of the parameter estimated from the past to the latest E step are read. Then, the value of the posterior distribution of the two types of hidden variables is estimated for each calculation target time including the past in the estimation target time and for all the data for each calculation target time, and the two types of hidden variables are estimated. Among the posterior distribution values of the first hidden variable associated with the weight representing the degree of temporal change of the cluster distribution is calculated for each cluster, and the second associated with the first hidden variable and the mixture ratio. The value of the posterior distribution of the hidden variable is calculated for each time retroactively from the calculation target time, and the process of the E step and the process of the M step are alternately predicted. Characterized in that it only repeatedly executed the number of times defined.

かかる手順によれば、クラスタリング計算方法において、クラスタリング計算装置は、ノンパラメトリックベイズモデルのうちｄＨＤＰモデルに従った演算により、クラスタ数やデータの分割、そしてパラメータを事後確率最大化の意味で同時に最適化するためにＥＭアルゴリズムを用いる。ここで、ｄＨＤＰでは、時刻に依存して分布が少しずつ変化するというモデルとなっているため、話者の入れ替わり（turn-taking）によって観測量（サンプル）の分布が時間変化するダイアライゼーションのタスクにとって恰好の確率モデルである。したがって、話者の入れ替わりがある会話録音データから、未知の話者数と各話者を特徴づけるパラメータを正確に推定することができる。 According to such a procedure, in the clustering calculation method, the clustering calculation device simultaneously optimizes the number of clusters, the data division, and the parameters in the sense of maximizing the posterior probability by the calculation according to the dHDP model among the nonparametric Bayes models. To do this, the EM algorithm is used. Here, dHDP is a model in which the distribution changes little by little depending on the time, so the dialization task in which the distribution of the observed quantity (sample) changes with time due to the turn-taking of the speaker. It is a good stochastic model. Therefore, it is possible to accurately estimate the number of unknown speakers and the parameters characterizing each speaker from the conversation recording data with the change of speakers.

また、本発明に係るクラスタリング計算方法は、前記事後分布推論手段が、前記Ｅステップにおいて、観測量の集合データと、隠れ変数の事後分布の推定値とについては、推定対象の時刻から事前に設定された定数または直前のＭステップの推定値を反映した変動数分だけ過去に遡及した時刻のデータを読み込み、前記Ｍステップにおいて、観測量の集合データと、パラメータの事後分布の推定値とについては、推定対象の時刻から事前に設定された定数または直前のＭステップの推定値を反映した変動数分だけ過去に遡及した時刻のデータを読み込むことを特徴とする。 Further, in the clustering calculation method according to the present invention, the posterior distribution inference means determines in advance from the time to be estimated about the set data of the observation amount and the estimated value of the posterior distribution of the hidden variable in the E step. The data of the retroactive time is read by the number of fluctuations reflecting the set constant or the estimated value of the previous M step, and in the M step, the aggregated data of the observation amount and the estimated value of the posterior distribution of the parameter are obtained. Is characterized in that it reads data at a time retroactive to the past by the number of fluctuations reflecting a constant set in advance or an estimated value of the immediately preceding M step from the estimation target time.

かかる手順によれば、クラスタリング計算方法において、ＥステップおよびＭステップでは、当該推定対象の時刻から事前に定められた過去の一時点まで遡った観測量の集合データと、その演算に必要な推定値とを読みこむ。したがって、事前に過去の一時点を定めることなく、時間ステップの進展とともに推定すべき変数の個数が増加していくだけである場合と比べて、処理負荷の低減や処理の高速化を実現できる。 According to such a procedure, in the clustering calculation method, in the E step and the M step, the collective data of the observation amount traced back to a predetermined point in time from the time of the estimation target, and the estimated value necessary for the calculation And read. Therefore, the processing load can be reduced and the processing speed can be increased as compared to the case where the number of variables to be estimated is only increased with the progress of the time step without determining one point in the past in advance.

また、本発明に係るクラスタリング計算方法は、前記事後分布推論手段が、前記Ｅステップにおいて、推定対象の時刻から事前に設定された設定値分だけ遡及した過去の時刻までを演算対象時刻として、前記クラスタと、前記混合比と、前記重みとに関するパラメータの事後分布の値を演算し、前記Ｍステップにおいて、推定対象の時刻から事前に設定された設定値分まで遡及した過去の時刻までを演算対象時刻として、前記第１隠れ変数および第２隠れ変数の事後分布の値を演算することを特徴とする。 Further, in the clustering calculation method according to the present invention, the posterior distribution inference means uses, as the calculation target time, a past time that is retroactive by a set value set in advance from the estimation target time in the E step. Calculates the posterior distribution values of the parameters related to the cluster, the mixture ratio, and the weight, and in the M step, calculates from the estimation target time to a past set time retroactive to a preset value. As the target time, the value of the posterior distribution of the first hidden variable and the second hidden variable is calculated.

かかる手順によれば、クラスタリング計算方法において、ＥステップおよびＭステップでは、当該推定対象の時刻から事前に定められた設定値分だけ過去に遡った過去の時刻までを演算対象時刻として、推定値を求める演算を行う。したがって、事前に設定値を定めることなく、時間ステップの進展とともに推定すべき変数の個数が増加していくだけである場合と比べて、処理負荷の低減や処理の高速化を実現できる。 According to such a procedure, in the clustering calculation method, in the E step and the M step, the estimated value is calculated using the estimation target time from the estimation target time to a past time traced back in the past by a predetermined set value. Perform the desired calculation. Therefore, the processing load can be reduced and the processing speed can be increased as compared with the case where the number of variables to be estimated is increased with the progress of the time step without setting the set value in advance.

また、本発明に係るクラスタリング計算方法は、前記事後分布推論手段が、前記Ｅステップにおいて、クラスタごとの演算のためにクラスタを更新する処理を実行するたびに、事前に設定された再推定すべきクラスタの判断基準に基づいて、推定処理対象のクラスタを再推定すべきか否か判別し、再推定すべきクラスタである場合にだけ、前記混合比に関するパラメータの事後分布の値を演算し、前記Ｍステップにおいて、クラスタごとの演算のためにクラスタを更新する処理を実行するたびに、前記判断基準に基づいて、推定処理対象のクラスタを再推定すべきか否か判別し、再推定すべきクラスタである場合にだけ、前記第１隠れ変数の事後分布の値を演算することを特徴とする。 In addition, the clustering calculation method according to the present invention performs a re-estimation that is set in advance every time the posterior distribution inference means executes a process of updating a cluster for calculation for each cluster in the E step. Based on the judgment criterion of the power cluster, it is determined whether or not the cluster to be estimated is to be re-estimated, and only when the cluster is to be re-estimated, the value of the posterior distribution of the parameter relating to the mixture ratio is calculated, In step M, each time a process for updating a cluster is performed for each cluster operation, it is determined whether or not the estimation target cluster should be re-estimated based on the determination criterion. Only in some cases, the value of the posterior distribution of the first hidden variable is calculated.

かかる手順によれば、クラスタリング計算方法において、ＥステップおよびＭステップでは、クラスタごとの演算のためにクラスタを更新したときに、必要な場合にだけそのクラスタにおける推定値の再推定処理を行う。したがって、ｄＨＤＰモデルにおいて予め設定される最大クラスタ数の個数のクラスタに関して、推定値の再推定処理を毎回実行する場合に比べて、処理負荷の低減や処理の高速化を実現できる。 According to such a procedure, in the clustering calculation method, in the E step and the M step, when the cluster is updated for the calculation for each cluster, the estimation value in the cluster is re-estimated only when necessary. Therefore, the processing load can be reduced and the processing speed can be increased as compared with the case where the estimation value re-estimation process is executed every time for the maximum number of clusters set in advance in the dHDP model.

また、本発明に係るクラスタリング計算プログラムは、前記いずれかのクラスタリング計算装置を構成する各手段としてコンピュータを機能させるためのプログラムである。このように構成されることにより、このプログラムをインストールされたコンピュータは、このプログラムに基づいた各機能を実現することができる。 A clustering calculation program according to the present invention is a program for causing a computer to function as each means constituting one of the clustering calculation apparatuses. By being configured in this way, a computer in which this program is installed can realize each function based on this program.

また、本発明に係るコンピュータ読み取り可能な記録媒体は、前記クラスタリング計算プログラムが記録されたことを特徴とする。このように構成されることにより、この記録媒体を装着されたコンピュータは、この記録媒体に記録されたプログラムに基づいた各機能を実現することができる。 The computer-readable recording medium according to the present invention is characterized in that the clustering calculation program is recorded. By being configured in this way, a computer equipped with this recording medium can realize each function based on a program recorded on this recording medium.

本発明によれば、ダイアライゼーションにおける話者クラスタリングの問題に対して、ノンパラメトリックベイズモデルを採用して、確率的なクラスタリングを用いることで、従来のようなパラメータの設定や探索によらずに容易に話者数を推定できる。また、本発明によれば、ノンパラメトリックベイズモデルを採用したので、クラスタ数やデータの分割、そしてパラメータを事後確率最大化の意味で同時に最適化できる。その結果、ダイアライゼーションにおいて、話者数と各話者を特徴づけるパラメータを正確に推定できる。 According to the present invention, the non-parametric Bayes model is used for the problem of speaker clustering in dialization, and probabilistic clustering is used. The number of speakers can be estimated. Further, according to the present invention, since the non-parametric Bayes model is adopted, the number of clusters, data division, and parameters can be simultaneously optimized in the sense of maximizing the posterior probability. As a result, the number of speakers and parameters characterizing each speaker can be accurately estimated in dialization.

本発明の実施形態に係るクラスタリング計算装置を含むダイアライゼーションシステムの概要を示す構成図である。It is a block diagram which shows the outline | summary of the dialization system containing the clustering calculation apparatus which concerns on embodiment of this invention. 本発明の実施形態に係るクラスタリング計算方法の全体処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the whole process of the clustering calculation method which concerns on embodiment of this invention. 図２に示す変分事後分布推論手順を示すフローチャートである。It is a flowchart which shows the variational posterior distribution inference procedure shown in FIG. 図３に示すＥステップの計算手順の一例を示すフローチャートである。It is a flowchart which shows an example of the calculation procedure of E step shown in FIG. 図３に示すＭステップの計算手順の一例を示すフローチャートである。It is a flowchart which shows an example of the calculation procedure of M step shown in FIG. 本発明の実施形態に係るクラスタリング計算方法におけるＥステップの計算手順を示すフローチャートである。It is a flowchart which shows the calculation procedure of E step in the clustering calculation method which concerns on embodiment of this invention. 本発明の実施形態に係るクラスタリング計算方法におけるＭステップの計算手順を示すフローチャートである。It is a flowchart which shows the calculation procedure of M step in the clustering calculation method which concerns on embodiment of this invention. 本発明の実施形態に係るクラスタリング計算装置の構成の一例を示す機能ブロック図である。It is a functional block diagram which shows an example of a structure of the clustering calculation apparatus which concerns on embodiment of this invention. 本発明の実施形態に係るクラスタリング計算装置のクラスタリング性能を評価するために用いた人工音声データの時間平均パワー分布を示すグラフである。It is a graph which shows the time average power distribution of the artificial speech data used in order to evaluate the clustering performance of the clustering calculation apparatus which concerns on embodiment of this invention. 本発明の実施形態に係るクラスタリング計算装置のクラスタリング性能を評価するために用いた実音声データの時間平均パワー分布を示すグラフであって、（ａ）はＣＰ１、（ｂ）はＣＰ２、（ｃ）はＤＣ、（ｄ）はＣＮをそれぞれ示している。It is a graph which shows the time average power distribution of the real audio | voice data used in order to evaluate the clustering performance of the clustering calculation apparatus which concerns on embodiment of this invention, Comprising: (a) is CP1, (b) is CP2, (c). Represents DC, and (d) represents CN. 本発明の実施形態に係るクラスタリング計算装置によって人工音声データをクラスタリングした結果を示すグラフであって、（ａ）はＤＰＭ、（ｂ）はｄＨＤＰをそれぞれ示している。It is a graph which shows the result of having clustered artificial speech data with the clustering calculation apparatus which concerns on embodiment of this invention, Comprising: (a) has shown DPM and (b) has each shown dHDP. 本発明の実施形態に係るクラスタリング計算装置によってＣＰ１の実音声データをクラスタリングした結果を示すグラフであって、（ａ）はＤＰＭ、（ｂ）はｄＨＤＰをそれぞれ示している。It is a graph which shows the result of having clustered real voice data of CP1 by the clustering calculation apparatus concerning an embodiment of the present invention, and (a) shows DPM and (b) shows dHDP, respectively. 本発明の実施形態に係るクラスタリング計算装置によってＤＣの実音声データをクラスタリングした結果を示すグラフであって、（ａ）はＤＰＭ、（ｂ）はｄＨＤＰをそれぞれ示している。It is a graph which shows the result of having clustered real voice data of DC by the clustering calculation device concerning an embodiment of the present invention, and (a) shows DPM and (b) shows dHDP, respectively.

図面を参照して本発明のクラスタリング計算装置およびクラスタリング計算方法を実施するための形態（以下「実施形態」という）について詳細に説明する。以下では、推論原理の概略、ダイアライゼーションシステムの概略、クラスタリング計算方法の概要、計算アルゴリズム、クラスタリング計算装置について順次説明する。 An embodiment (hereinafter referred to as “embodiment”) for carrying out the clustering calculation apparatus and the clustering calculation method of the present invention will be described in detail with reference to the drawings. Below, an outline of the inference principle, an outline of the dialization system, an outline of the clustering calculation method, a calculation algorithm, and a clustering calculation apparatus will be sequentially described.

［推論原理の概略］
本実施形態では、ノンパラメトリックベイズモデルの一種として、例えば、ｄＨＤＰを用いることとする。ここでは、ｄＨＤＰを簡単に説明する。ｄＨＤＰの数学的なモデルを式（１）〜式（６）に示す。なお、ＤＰＭ等の他のノンパラメトリックベイズモデルを用いてもよいことはもちろんである。ＤＰＭの詳細は、例えば、「上田修功、山田武士、“ノンパラメトリックベイズモデル”，応用数理，Vol. 17， No. 3， pp. 196-214, 2007.」に記載されている。 [Outline of inference principle]
In the present embodiment, for example, dHDP is used as a kind of nonparametric Bayes model. Here, dHDP will be briefly described. Equations (1) to (6) show mathematical models of dHDP. Of course, other non-parametric Bayes models such as DPM may be used. Details of DPM are described in, for example, “Osamu Ueda, Takeshi Yamada,“ Non-parametric Bayes Model ”, Applied Mathematics, Vol. 17, No. 3, pp. 196-214, 2007.”

ここで、〜は確率分布からのサンプリングを表す。また、ＤＰ（・）はDirichlet Process（無限次元分布）を表し、γ，α_０，ａ_０，ｂ_０は事前に設定するハイパーパラメータである。 Here, ~ represents sampling from the probability distribution. DP (·) represents a Dirichlet Process (infinite dimensional distribution), and γ, α ₀ , a ₀ , and b ₀ are hyperparameters set in advance.

ｄＨＤＰでは、まず、式（１）で離散無限個のパラメータ分布（クラスタ）Ｇ_０を生成する。ダイアライゼーションにおいて、この分布Ｇ_０は、データ全体を見たときの各発話者の発話割合と構成に相当する。ＤＰＭに代表されるノンパラメトリックベイズモデルでは、推定された解が自動的に少数のパラメータ（話者）からなる分布に集約される。 In dHDP, first, an infinite number of discrete parameter distributions (clusters) G ₀ are generated using equation (1). In dialization, this distribution G ₀ corresponds to the utterance ratio and composition of each utterer when viewing the entire data. In a non-parametric Bayes model typified by DPM, estimated solutions are automatically aggregated into a distribution consisting of a small number of parameters (speakers).

式（２）のＨ_tは、時刻ｔでの話者分布の時間変化（分布変化）を表す。
式（３）のｗ_tは、時刻ｔでの話者分布の時間変化の割合（程度）を表す。
式（５）のθ_{t i}は、時刻ｔでのクラスタのパラメータを表し、式（６）のｘ_{t i}は、時刻ｔでのサンプル分布を表す。なお、ｉは、時刻ｔでのｉ番目のデータを示す。 H _{t in} equation (2) represents the temporal change (distribution change) of the speaker distribution at time t.
W _{t in} equation (3) represents the rate (degree) of temporal change in speaker distribution at time t.
Θ _{ti in} equation (5) represents the parameter of the cluster at time t, and x _{ti in} equation (6) represents the sample distribution at time t. Note that i indicates i-th data at time t.

式（４）におけるＧ_tは、時刻ｔにおけるパラメータ分布である。ダイアライゼーションにおいて、分布Ｇ_tは、時刻ｔに実際に発話した発話者の構成を表現している。式（４）は、このＧ_tを、時刻ｔ−１での分布Ｇ_t-1と、時刻ｔでの分布変化を表すＨ_tとをｗ_tで重ね合わせた分布として表現したものである。よって、ｄＨＤＰでは、各時刻でのサンプル分布（ｘ_{t i}の分布）に変動が許される。一方で、時刻に不変な分布であるＧ_０も推定するため、全体を通した話者クラスタも学習されている。 G _t in equation (4) is a parameter distribution at time t. In dialization, the distribution G _t represents the configuration of the speaker who actually spoke at time t. Expression (4) expresses this G _t as a distribution obtained by superimposing the distribution G _t-1 at time t _-1 and H _t representing the distribution change at time t on w _t . Therefore, the DHDP, variations in the sample distribution at each time (distribution of x _ti) is allowed. On the other hand, since the G ₀ is the time to invariant distribution also to estimate, it has also been learning speaker cluster through the whole.

さらに、ｄＨＤＰでは、変化の割合であるｗ_tも動的に学習されるので、話者が交代するときには、劇的に分布が代わり、そうでないときには、ほとんど分布が変化しないというように、変化の割合が一定でないデータのモデル化にも対応できる。このように、ｄＨＤＰでは、時刻に依存して分布が少しずつ変化するというモデルとなっているため、話者の入れ替わり（turn-taking）によって観測量（サンプル）の分布が時間変化するダイアライゼーションのタスクにとって恰好の確率モデルとなっている。 In addition, in dHDP, w _t, which is the rate of change, is also learned dynamically, so that the distribution changes dramatically when the speaker changes, and otherwise the distribution changes little. It can also support modeling of data whose ratio is not constant. In this way, dHDP is a model in which the distribution changes little by little depending on the time. Therefore, the distribution of observation (sample) changes with time due to the turn-taking of speakers. It is a good probability model for tasks.

［ダイアライゼーションシステムの概略］
図１は、本発明の実施形態に係るクラスタリング計算装置を含むダイアライゼーションシステムの概要を示す構成図である。本実施形態のダイアライゼーションは、前記した第２の方法（話者位置に関する情報を利用する方法）によるものとして説明する。予め、未知数話者による会話を録音し、ダイアライゼーションシステム１への入力とする。ここでは、室内１０１に３人の話者Ｈ_A、Ｈ_B、Ｈ_Cが図１のような定位置で会話したものとする。音声データ（会話の録音データ）１０２は時系列データとなる。 [Outline of Dialization System]
FIG. 1 is a configuration diagram showing an outline of a dialization system including a clustering calculation apparatus according to an embodiment of the present invention. The dialization of this embodiment will be described as being based on the second method described above (a method using information related to the speaker position). A conversation by an unknown number of speakers is recorded in advance and used as an input to the dialization system 1. Here, it is assumed that three speakers H _A , H _B , and H _C talk in a fixed position as shown in FIG. The voice data (conversation recording data) 102 is time-series data.

ダイアライゼーションシステム１は、１つの大型コンピュータ、または、複数のコンピュータからなる。ここでは、ダイアライゼーションシステム１は、３つのコンピュータ、すなわち、特徴量抽出部２と、クラスタリング計算装置３と、識別部４とを備える。 The dialization system 1 is composed of one large computer or a plurality of computers. Here, the dialization system 1 includes three computers, that is, a feature amount extraction unit 2, a clustering calculation device 3, and an identification unit 4.

特徴量抽出部２は、ノイズ除去のような前処理を行うと共に、ダイアライゼーションに適した各種特徴量を抽出するものである。特徴量抽出部２は、例えば、マイクロホンアレーから取得した録音データから、ＤＯＡ（direction of arrival：音声到来方向）情報を抽出し、クラスタリング計算装置３に出力する。ＤＯＡ情報（音声の到達角度）は、マイクに対してどの方向からどの程度の強さの音声信号が観測されたかを推定した量になっている。 The feature amount extraction unit 2 performs preprocessing such as noise removal and extracts various feature amounts suitable for dialization. For example, the feature amount extraction unit 2 extracts DOA (direction of arrival) information from the recording data acquired from the microphone array and outputs the information to the clustering calculation device 3. The DOA information (sound arrival angle) is an amount obtained by estimating how much sound signal is observed from which direction with respect to the microphone.

クラスタリング計算装置３は、ＤＯＡ情報に基づいてクラスタリング処理によって話者数と各話者を特徴づけるパラメータとを推定する。すなわち、クラスタリング計算装置３は、抽出した音声特徴量をクラスタリングし、クラスタ数と各クラスタのパラメータを推定する。クラスタリング計算装置３は、確率的なクラスタリングモデルを適用することで、クラスタ数やデータの分割、そしてパラメータを事後確率最大化の意味で同時に最適化する。 The clustering calculation apparatus 3 estimates the number of speakers and parameters characterizing each speaker by clustering processing based on the DOA information. That is, the clustering calculation device 3 clusters the extracted speech feature values, and estimates the number of clusters and the parameters of each cluster. The clustering calculation apparatus 3 simultaneously optimizes the number of clusters, data division, and parameters in the sense of maximizing the posterior probability by applying a probabilistic clustering model.

識別部４は、クラスタリング計算装置３で得られたクラスタリング結果（クラスタリング推定値）により、各時刻における話者の発話状態を識別する。この識別部４は、クラスタリング推定値を解析して、クラスタ数と、その位置とを、ユーザが識別できるような画面表示で提示する。 The identification unit 4 identifies the utterance state of the speaker at each time based on the clustering result (clustering estimation value) obtained by the clustering calculation device 3. The identification unit 4 analyzes the clustering estimation value and presents the number of clusters and the position thereof on a screen display that can be identified by the user.

図１に例示したダイアライゼーション結果表示１０３のグラフでは、横軸が時刻（秒）、縦軸が方向（話者の位置）をそれぞれ示している。この例では、３人の話者Ｈ_A、Ｈ_B、Ｈ_Cに対応して、３つの方向に対して矩形波形が表示されている。各矩形波形の山となっている部分が各話者の発話（発言）を表している。まず話者Ｈ_Cが発言を終えると、次に話者Ｈ_Aが発言し、その発言途中に話者Ｈ_Cが再び発言し始めると、話者Ｈ_Aが黙り、話者Ｈ_Bが発言を開始する、というように、話者の入れ替わり（turn-taking）が生じていることがわかる。なお、２人の話者が同時に発話するタイミングも生じていることがわかる。 In the graph of the dialization result display 103 illustrated in FIG. 1, the horizontal axis indicates time (seconds) and the vertical axis indicates the direction (speaker position). In this example, rectangular waveforms are displayed in three directions corresponding to the three speakers H _A , H _B and H _C. The part which becomes the peak of each rectangular waveform represents the speech (utterance) of each speaker. First, when the speaker H _C finishes speaking, the speaker H _A speaks, and when the speaker H _C starts speaking again during the speech, the speaker H _A is silent and the speaker H _B speaks. It can be seen that there is a turn-taking of the speaker, such as starting. In addition, it turns out that the timing which two speakers speak simultaneously also arises.

［クラスタリング計算方法の概要］
ここでは、推論原理のダイアライゼーションへの導入と、クラスタリングモデルとについて説明する。
＜推論原理のダイアライゼーションへの導入＞
ここでは、クラスタリング計算装置３において、ノンパラメトリックベイズモデル（ｄＨＤＰ）を用いるため、その前段の特徴量抽出部２で抽出する特徴量を定式化する。時刻ｔにおいて、角度ｄ（例えば、ｄ＝−１８０，−１７９，…，０，…，１８０）方向から聞こえてきた音声パワー（ＤＯＡ情報）をｆ_{t d}とする。すなわち、各時刻ｔにおける音声パワーベクトルは

である。なお、ｄ＝−１８０の方向と、ｄ＝１８０の方向とは同じものである。 [Outline of clustering calculation method]
Here, the introduction of the inference principle to the dialization and the clustering model will be described.
<Introduction of inference principle into dialization>
Here, since the non-parametric Bayes model (dHDP) is used in the clustering calculation apparatus 3, the feature quantity extracted by the feature quantity extraction unit 2 in the preceding stage is formulated. Let _ftd be the sound power (DOA information) heard from the direction of angle d (for example, d = −180, −179,..., 0,..., 180) at time t. That is, the audio power vector at each time t is

It is. The direction of d = −180 and the direction of d = 180 are the same.

このパワーベクトルを、本実施形態で用いるクラスタリングモデルに適合するように、１次元ベクトルｘ_{t i}の集合へと変換する。ここでは、各ｆ_{t d}について、閾値定数τとパラメータμとを用いて、値ｇ（ｄ）をもつｎ_{t d}個のサンプル集合を生成することとした。ここで、関数ｇ（・）は、実装の都合に合わせて選択された適切なスケール関数を示す。例えば、関数ｇ（・）は、［−１８０：１８０］→［−０．５：０．５］に変換する関数を用いることができる。また、ｎ_{t d}は、式（７）で定義する。なお、１次元に限らず、例えば、３次元ベクトルｘ_{t i}の集合へ変換するように構成してもよい。 This power vector is converted into a set of one-dimensional vectors _xti so as to conform to the clustering model used in this embodiment. Here, for each f _td , n _td sample sets having a value g (d) are generated using the threshold constant τ and the parameter μ. Here, the function g (•) indicates an appropriate scale function selected according to the convenience of implementation. For example, as the function g (•), a function that converts [−180: 180] → [−0.5: 0.5] can be used. N _td is defined by equation (7). Note that the present invention is not limited to one dimension, and may be configured to convert into a set of three-dimensional vectors _xti , for example.

式（７）の関数ｈ（・）は、パワーの値に応じて何らかの正整数を返す関数であり、定数でもよい。本実施形態では、例えば、ｈ＝１を用いる。以上の量子化変換を全ての角度ｄに関して行い、時刻ｔの観測量をサンプル集合に集約する。 The function h (·) in Expression (7) is a function that returns some positive integer according to the power value, and may be a constant. In this embodiment, for example, h = 1 is used. The above quantization transformation is performed for all angles d, and the observation amount at time t is aggregated into a sample set.

本実施形態で扱うクラスタリング問題は、サンプル集合Ｘ_t＝｛ｘ_１，…，ｘ_t｝のクラスタリングと捉えることができる。人間の発話音声パワーは背景ノイズよりもはるかに大きいものとすれば、話者の位置を反映した値のサンプルが大量に観測されるため、主要なクラスタとして話者の位置が推定できるはずである。そこで、本実施形態のクラスタリング計算方法では、クラスタ数Ｋを推定しながら、同時に各サンプルｘ_{t i}に対してどのクラスタに属するかを表す変数ｚ_{t i}の確率分布を求めること、さらにＫ個のクラスタに対応するパラメータΘ＝｛θ_k｝を求めることをそれぞれ行う。 The clustering problem handled in the present embodiment can be regarded as clustering of the sample set X _t = {x ₁ ,..., X _t }. If the human speech power is much larger than the background noise, a large number of sample values reflecting the speaker's position will be observed, so the speaker's position should be able to be estimated as the main cluster. . Therefore, in the clustering calculation method of the present embodiment, while estimating the number of clusters K, the probability distribution of the variable z _ti indicating which cluster belongs to each sample x _{ti is} obtained at the same time, and further, K clusters are obtained. The corresponding parameters Θ = {θ _k } are respectively obtained.

＜クラスタリングモデル＞
本実施形態では、計算量とアルゴリズムの簡易さとを考慮して、ｄＨＤＰ近似モデルを用いる。ｄＨＤＰ近似モデルの生成モデルは、以下の式（９）〜式（１５）のようになる。なお、ｄＨＤＰ近似モデルについては、「I. Pruteanu-Malinici, L. Ren, J. Paisley, E. Wang and L. Carin, “Dynamic Hierarchical Dirichlet Process for Modeling Topics in Time-Stamped Documents”, IEEE Transactions on Pattern Analysis and Machine Intelligence, submitted, 2008.」に記載されている。 <Clustering model>
In this embodiment, the dHDP approximate model is used in consideration of the calculation amount and the simplicity of the algorithm. The generation model of the dHDP approximate model is expressed by the following formulas (9) to (15). For the dHDP approximation model, see “I. Pruteanu-Malinici, L. Ren, J. Paisley, E. Wang and L. Carin,“ Dynamic Hierarchical Dirichlet Process for Modeling Topics in Time-Stamped Documents ”, IEEE Transactions on Pattern. Analysis and Machine Intelligence, submitted, 2008. "

ｄＨＤＰ近似モデルでは、まず、最大クラスタ数Ｋを固定しておく。最大クラスタ数Ｋは、推定しようとしている話者数よりも充分大きい値（例えば、数十〜１００）であればよい。実際には、「有効な」クラスタ数Ｋ_eff（＜Ｋ）なる個数のクラスタが、推定しようとしている話者に対応することとなる。「有効な」クラスタ数Ｋ_effについては、学習の結果、話者に対応しないそれ以外のクラスタの重み（混合比）が自動的にほぼ０になることを利用して判断することができる。すなわち「有効な」クラスタ数Ｋ_effは、推定の過程で自動的に決定される。 In the dHDP approximate model, first, the maximum number of clusters K is fixed. The maximum cluster number K may be a value sufficiently larger than the number of speakers to be estimated (for example, several tens to 100). In practice, the number of “effective” clusters K _eff (<K) corresponds to the speaker to be estimated. The “effective” number of clusters K _eff can be determined by utilizing the fact that the weight (mixing ratio) of other clusters that do not correspond to the speaker automatically becomes almost zero as a result of learning. That is, the “effective” number of clusters K _eff is automatically determined in the estimation process.

≪式（９）、式（１０）≫
ｄＨＤＰ近似モデルでは、式（９）で、Ｋ個のクラスタに対応するパラメータをサンプリングする。式（１０）では、有限次元のDirichlet分布よりinnovation measure Ｈ_t（後記する式（１６ｃ）参照）のサンプリングを行う。より具体的には混合比π_tをサンプリングする。これは、前記した式（４）より、式（１６ａ）およびその変形式である式（１６ｂ）が導かれるため、時刻ｔにおける話者分布Ｇ_tは、時刻ｔまでに生成したＨ_1:lの重ね合わせだけで表現できるからである。ここで、「Ｈ_1:l」は、Ｈ₁〜Ｈ_lを表す。 << Formula (9), Formula (10) >>
In the dHDP approximate model, parameters corresponding to K clusters are sampled by Equation (9). In Expression (10), sampling of an innovation measure H _t (see Expression (16c) described later) is performed from a finite-dimensional Dirichlet distribution. More specifically, the mixing ratio π _t is sampled. This is because the equation (16a) and its modified equation (16b) are derived from the above equation (4), so that the speaker distribution G _t at time t is H _{1: l} generated up to time t. This is because it can be expressed only by superimposing. Here, “H _{1: l} ” represents H _{1 to} H _l .

≪式（１１）、式（１２）≫
ｄＨＤＰ近似モデルでは、続いて、式（１１）により、話者分布Ｇ_tの時間変化の程度を表すｗ_tをサンプリングして、このｗ_tを用いて、式（１２）で定義されるｖ_tl（ｌ＝１，…，ｔ）を計算する。ここで、時刻ｌは、時刻ｔとそれよりも過去の時刻とを表す。 << Formula (11), Formula (12) >>
In dHDP approximate model, followed by the equation (11), by sampling the w _t representing the degree of time change of the speaker distribution G _t, using the w _t, v _tl defined by formula (12) (L = 1,..., T) is calculated. Here, time l represents time t and a past time.

≪式（１３）≫
式（１３）に示す隠れ変数ｄ_{t i}は、ｔ次元の｛０，１｝ベクトルである（ｔは時刻、要素の値は０か１のみ）。隠れ変数ｄ_{t i}は、時刻ｔのときのｔ次元の要素のうち、ｌ次元目の要素ｄ_{t i l}が１の値をもつものである（ｌ≦ｔ）。隠れ変数ｄ_{t i}のｌ次元目の要素ｄ_{t i l}は、時刻ｌでの分布変化Ｈ_lに対応する要素である。時刻ｌでの分布変化Ｈ_lは、時刻ｔのｉ番目のサンプルｘ_{t i}をサンプルするための分布変化である。 << Formula (13) >>
The hidden variable d _ti shown in Expression (13) is a t-dimensional {0, 1} vector (t is time, element value is 0 or 1 only). The hidden variable d _ti has a value of 1 in the l-dimensional element d _til among the t-dimensional elements at time t (l ≦ t). The l-dimensional element d _til of the hidden variable d _ti is an element corresponding to the distribution change H _l at time l. Distribution changing H _l at time l is the distribution change for sample i th sample x _ti of the time t.

≪式（１４）≫
式（１４）に示す隠れ変数ｚ_{t i}は、同様に、Ｋ次元の｛０，１｝ベクトルである。
隠れ変数ｚ_{t i}は、実際にサンプルｘ_{t i}をサンプリングするクラスタ（パラメータ）ｋに対応するｋ次元目の要素ｚ_{t i k}にのみ１の値を持つ。 ≪Formula (14) ≫
Similarly, the hidden variable z _ti shown in Equation (14) is a K-dimensional {0, 1} vector.
The hidden variable z _ti has a value of 1 only in the k-th element z _tik corresponding to the cluster (parameter) k that actually samples the sample x _ti .

≪式（１５）≫
式（１５）に示す観測量ｘ_{t i}は、与えられたクラスタ番号（ｋ）に対応するパラメータθから生成される。式（１５）は、前記した式（６）、式（８）と同様であって、別の表式である。 ≪Formula (15) ≫
The observation amount x _ti shown in the equation (15) is generated from the parameter θ corresponding to the given cluster number (k). Expression (15) is the same as Expression (6) and Expression (8) described above, and is another table expression.

≪観測分布と事前分布≫
ｄＨＤＰ近似モデルでは、式（１５）に示す観測分布Ｆと、パラメータの事前分布Ｈ（式（９）参照）とを事前に定めておく必要がある。本実施形態では、一例として、観測分布Ｆが正規分布であり、かつ、パラメータの事前分布Ｈが共役事前分布であるようなＮｏｒｍａｌ−Ｇａｍｍａ分布を用いることにした。なお、Ｎｏｒｍａｌ−Ｇａｍｍａ分布については、参考文献１「C. M. ビショップ、“パターン認識と機械学習”、シュプリンガー・ジャパン、2007.」に記載されている。 ≪Observation distribution and prior distribution≫
In the dHDP approximation model, the observation distribution F shown in Expression (15) and the parameter prior distribution H (see Expression (9)) must be determined in advance. In the present embodiment, as an example, a Normal-Gamma distribution in which the observation distribution F is a normal distribution and the parameter prior distribution H is a conjugate prior distribution is used. The Normal-Gamma distribution is described in Reference Document 1 “CM Bishop,“ Pattern Recognition and Machine Learning ”, Springer Japan, 2007.”.

本実施形態でＮｏｒｍａｌ−Ｇａｍｍａ分布を採用した理由は、これらの分布を利用したｄＨＤＰの近似モデル解法は、従来発表されていない上、多くの分野に応用可能なモデルであり、最も実用性が高いと考えたからである。ただし、目的や実際のデータに則して、これらを別の分布にすることも可能である。 The reason why the Normal-Gamma distribution is adopted in the present embodiment is that the approximate model solution method of dHDP using these distributions has not been published so far, and is a model that can be applied to many fields, and has the highest practicality. Because I thought. However, it is possible to make these different distributions according to the purpose and actual data.

≪ｄＨＤＰ近似モデルのダイアライゼーションシステムへの適用≫
図１のダイアライゼーションシステム１では、式（９）〜式（１５）に示すｄＨＤＰ近似モデルにおいて、ダイアライゼーションの話者数（＝クラスタ数）と各サンプルのクラスタリングを行い、各クラスタのパラメータ（＝話者の位置）を推定することとした。 ≪Application of dHDP approximation model to dialization system≫
In the dialization system 1 of FIG. 1, in the dHDP approximation model shown in Expressions (9) to (15), the number of speakers (= number of clusters) for dialization and clustering of each sample are performed, and the parameters (= Speaker position).

このうち、各サンプルのクラスタリングは、式（１４）に示す隠れ変数ｚ_{t i}の分布ｐ（ｚ_{t i}）を求めることに等しい。隠れ変数ｚ_{t i}は、各サンプルｘ_{t i}に対してどのクラスタに属するかを表す変数なので、このクラスタリング結果が得られれば、各クラスタへ帰属するサンプル数（あるいはその期待値）が計算できる。また、この各クラスタへ帰属するサンプル数（あるいはその期待値）を用いて各クラスタｋの混合比（後記するβ_k＾）を計算可能である。さらに、この混合比（後記するβ_k＾）により、「有効な」クラスタ数Ｋ_eff（＝話者数Ｋ_eff）を決定することができる。また、クラスタリング結果が求まれば、各クラスタのパラメータ｛θ_k｝も容易に求めることができる。なお、本実施形態では、クラスタリング計算装置３によって、「有効な」クラスタ数Ｋ_eff（＝話者数Ｋ_eff）の決定まで行うこととするが、識別部４がこの処理を行ってもよい。つまり、クラスタリング計算装置３によって、クラスタリング結果を得て、各クラスタのパラメータ｛θ_k｝を得て、識別部４によって、混合比（後記するβ_k＾）を求め、「有効な」クラスタ数Ｋ_eff（＝話者数Ｋ_eff）を決定するようにしてもよい。 Among these, clustering of each sample is equivalent to _obtaining the distribution p (z _ti ) of the hidden variable z _ti shown in Expression (14). Since the hidden variable z _ti is a variable that indicates which cluster belongs to each sample x _ti , if this clustering result is obtained, the number of samples belonging to each cluster (or its expected value) can be calculated. Further, the mixing ratio (β _k ^ described later) of each cluster k can be calculated using the number of samples belonging to each cluster (or its expected value). Further, the number of “effective” clusters K _eff (= the number of speakers K _eff ) can be determined based on the mixing ratio (β _k ^ described later). If the clustering result is obtained, the parameter {θ _k } of each cluster can be easily obtained. In this embodiment, the clustering calculation device 3 performs the determination up to the determination of the “effective” number of clusters K _eff (= the number of speakers K _eff ), but the identification unit 4 may perform this process. That is, the clustering calculation device 3 obtains a clustering result, obtains a parameter {θ _k } of each cluster, obtains a mixture ratio (β _k ^ described later) by the identification unit 4, and obtains the “effective” number of clusters K _eff (= the number of speakers K _eff ) may be determined.

［計算アルゴリズム］
計算アルゴリズムについては、以下、１）ｄＨＤＰモデルのオンライン推定法、２）変分事後分布推論プロセス、３）観測モデルおよび事前分布の特定、４）推定結果とクラスタ数の決定方法、５）ｄＨＤＰの高速化方法に分けて詳細に説明する。 [Calculation algorithm]
The calculation algorithms are as follows: 1) Online estimation method of dHDP model, 2) Variational posterior distribution inference process, 3) Identification of observation model and prior distribution, 4) Determination method of estimation result and number of clusters, 5) dHDP The details will be described separately for each speed-up method.

＜１）ｄＨＤＰモデルのオンライン推定法＞
ここでは、具体的な推論アルゴリズムを示す。図２は、本発明の実施形態に係るクラスタリング計算方法の全体処理の流れを示すフローチャートである。図２は、クラスタリングの全体の推論プロセスを示したものである。ダイアライゼーションは、一般的にオンライン計算のプロセスである。しかしながら、ｄＨＤＰモデルのオンライン推定方法は、従来研究されていない。本実施形態では、ｄＨＤＰモデルのオンライン推定法を開発した。図２の符号２０１で示す推論プロセスは、そのオンライン推定法を示したものである。 <1) Online estimation method of dHDP model>
Here, a specific inference algorithm is shown. FIG. 2 is a flowchart showing the overall processing flow of the clustering calculation method according to the embodiment of the present invention. FIG. 2 shows the entire inference process of clustering. Dialization is generally an online calculation process. However, the online estimation method of the dHDP model has not been studied so far. In the present embodiment, an online estimation method for the dHDP model has been developed. The inference process denoted by reference numeral 201 in FIG. 2 shows the online estimation method.

オンライン推定法は、時刻Ｔ（１≦Ｔ≦Ｔ_total）ごとに行うものである。ここで、刻々と進展する推定対象のある時点の時刻をＴ、それらのうち最終時刻をＴ_totalとした。なお、以下では、過去を含めた演算対象時刻をｔとする。過去にどこまで遡及するかについては予め定めておく。例えば、推定対象時刻Ｔ＝５のとき、演算対象時刻ｔ＝１，２，３，４，５としたり、ｔ＝３，４，５としたりすることができる。この例では、ｔ＝１まで考慮することとする。 The online estimation method is performed every time T (1 ≦ T ≦ T _total ). Here, the time at a certain point of the estimation target that progresses momentarily is T, and the final time is T _total . In the following, it is assumed that the calculation target time including the past is t. It is determined in advance how far back in the past. For example, when the estimation target time T = 5, the calculation target time t = 1, 2, 3, 4, 5 or t = 3,4, 5 can be set. In this example, t = 1 is considered.

この推論プロセスでは、処理を開始すると、まず、Ｔを初期化する。すなわち、Ｔ＝１とする（ステップＳ１）。そして、時刻１からＴまでの隠れ変数およびパラメータ（隠れ変数・パラメータ）の推定値およびハイパーパラメータを入力する（ステップＳ２）。また、時刻１からＴ−１までの観測量（サンプル）を入力する（ステップＳ３）。また、時刻Ｔの音声パワーｆ_Tを入力する（ステップＳ４）。この時刻Ｔの音声パワーｆ_Tから、時刻Ｔの観測量｛ｘ_{T i}｝を生成して入力する（ステップＳ５）。また、時刻Ｔに対応する未知数（未知の隠れ変数・パラメータ）を初期化する（ステップＳ６）。 In this inference process, when processing is started, T is first initialized. That is, T = 1 is set (step S1). Then, the estimated values and hyperparameters of hidden variables and parameters (hidden variables / parameters) from time 1 to T are input (step S2). Further, the observation amount (sample) from time 1 to T-1 is input (step S3). Also inputs the voice power f _T of time T (step S4). From the voice power f _{T at} the time T, the observation amount {x _{T i} } at the time T is generated and input (step S5). Also, an unknown number (unknown hidden variable / parameter) corresponding to time T is initialized (step S6).

そして、隠れ変数・パラメータの変分事後分布を推定する（ステップＳ７）。推定後、Ｔをインクリメントする。すなわち、Ｔ←Ｔ＋１とする（ステップＳ８）。そして、入力が終了したか否かを判別する（ステップＳ９）。入力が終了していない場合（ステップＳ９：Ｎｏ）、ステップＳ２に戻る。一方、例えば、最終時刻Ｔ_totalを超えたときに、入力が終了したと判定し（ステップＳ９：Ｙｅｓ）、推定結果を出力する（ステップＳ１０）。なお、ステップＳ２〜ステップＳ４の処理順序は、任意であり、並列処理してもよい。
また、ステップＳ６の処理は、ステップＳ７の前に行うのであれば処理順序は問わない。 Then, the variational posterior distribution of hidden variables / parameters is estimated (step S7). After estimation, T is incremented. That is, T ← T + 1 is set (step S8). And it is discriminate | determined whether the input was complete | finished (step S9). If the input has not ended (step S9: No), the process returns to step S2. On the other hand, for example, when the final time T _total is exceeded, it is determined that the input is completed (step S9: Yes), and the estimation result is output (step S10). Note that the processing order of step S2 to step S4 is arbitrary, and may be processed in parallel.
Further, the processing order of step S6 is not limited as long as it is performed before step S7.

オンライン推定法を繰り返すことで、時刻ステップの進展と共に観測サンプル｛ｘ_{t i}｝が蓄積され、その都度変数を再推定する。時刻Ｔ−１の時点での推定結果は、次の時刻Ｔでの推定の初期値として利用される。 By repeating the online estimation method, observation samples {x _ti } are accumulated as the time step progresses, and the variables are re-estimated each time. The estimation result at the time T-1 is used as an initial value for estimation at the next time T.

推論の目的は、全観測データが与えられた時に全未知変数（｛ｚ_{t i}｝，｛ｄ_{t i}｝，ｗ，｛π_t｝，｛θ_k｝）の推定値を求めることである。これは、確率モデルの観点からは、全変数の事後分布を求めることに相当する。本実施形態では、ｄＨＤＰ近似モデルに対して変分ベイズ法による事後分布推定法（変分事後分布推定法）を示す。実際に求めたい事後分布は、式（１７）であるが、変分法では、式（１８）のように全変数が独立であると仮定した分布（変分事後分布）を推定する。 The purpose of inference is to obtain estimates of all unknown variables ({z _ti }, {d _ti }, w, {π _t }, {θ _k }) when all observation data are given. This is equivalent to obtaining the posterior distribution of all variables from the viewpoint of the probability model. In the present embodiment, a posterior distribution estimation method (variation posterior distribution estimation method) by the variational Bayes method is shown for the dHDP approximation model. The posterior distribution to be actually obtained is Equation (17), but the variation method estimates a distribution (variation posterior distribution) that assumes that all variables are independent as in Equation (18).

式（１８）に示すｑ（・）は変分事後分布を表す。ここで、時刻Ｔまでの観測量が得られたとする。このとき、現在の時刻Ｔからの過去を含む時刻ｔ（≦Ｔ）に関する変数の変分事後分布推定値ｑ^＊は、式（１９）〜式（２３）のように表される。これら各変数の分布は、元の分布（式（９），式（１０），式（１１），式（１３），式（１４））からデータの情報が加わった分修正された形になる。ただし、式（１９）は、観測モデルＦやパラメータの事前分布Ｈに依存する。したがって、本実施形態では、式（１９）に基づいて後記する式（３６）を用いる。 Q (·) shown in Equation (18) represents a variational posterior distribution. Here, it is assumed that the observation amount up to time T is obtained. At this time, the variation posterior distribution estimated value q ^* of the variable relating to the time t (≦ T) including the past from the current time T is expressed as in Expression (19) to Expression (23). The distribution of each of these variables is corrected by adding data information from the original distribution (formula (9), formula (10), formula (11), formula (13), formula (14)). . However, Expression (19) depends on the observation model F and the parameter prior distribution H. Therefore, in the present embodiment, the following formula (36) is used based on the formula (19).

ただし、式（１９）〜式（２３）を同時に最適化することは困難なので、変分法においてはＥＭアルゴリズム（Expectation−Maximization algorithm）という様式に従った繰り返し計算による推定で各変数を個別に最適化する。ＥＭアルゴリズムとは、複数の変数を同時最大化する計算手法であり、Ｅステップ（Expectation step）とＭステップ（Maximization step）からなる計算ステップを相互に繰り返し計算することで全体最適化を行うものである。なお、ＥＭアルゴリズムについては、前記した参考文献１に記載されている。 However, since it is difficult to optimize the equations (19) to (23) at the same time, in the variational method, each variable is optimized individually by estimation by iterative calculation according to the EM algorithm (Expectation-Maximization algorithm) format. Turn into. The EM algorithm is a calculation method that simultaneously maximizes a plurality of variables, and performs overall optimization by repeatedly calculating calculation steps including an E step (Expectation step) and an M step (Maximization step). is there. The EM algorithm is described in Reference Document 1 described above.

＜２）変分事後分布推論プロセス＞
図３は、図２に示す変分事後分布推論手順を示すフローチャートであって、ＥＭアルゴリズムを含んだ推論プロセスを示す。図３に示す推論プロセスでは、まず、時刻１からＴまでのサンプル｛ｘ_1:T｝を入力する（ステップＳ２１）。また、時刻１からＴまでの隠れ変数・パラメータの推定値およびハイパーパラメータを入力する（ステップＳ２２）。なお、ステップＳ２１，Ｓ２２の処理順序は任意である。 <2) Variational posterior distribution inference process>
FIG. 3 is a flowchart showing the variational post-distribution inference procedure shown in FIG. 2 and shows an inference process including the EM algorithm. In the inference process shown in FIG. 3, first, samples {x _{1: T} } from time 1 to T are input (step S21). Further, the estimated values of the hidden variables / parameters and the hyper parameters from time 1 to T are input (step S22). Note that the processing order of steps S21 and S22 is arbitrary.

そして、ＥＭステップの繰り返し回数を表す識別子ｊを初期化する。すなわち、ｊ＝１とする（ステップＳ２３）。そして、Ｅステップを計算する（ステップＳ２４）。Ｅステップでは、時刻１からＴまでのパラメータの推定値を更新する。続いて、Ｍステップを計算する（ステップＳ２５）Ｍステップでは、時刻１からＴまでの隠れ変数の推定値を更新する。 Then, an identifier j representing the number of repetitions of the EM step is initialized. That is, j = 1 is set (step S23). Then, the E step is calculated (step S24). In step E, the estimated parameter values from time 1 to T are updated. Subsequently, the M step is calculated (step S25). In the M step, the estimated value of the hidden variable from time 1 to T is updated.

そして、ＥＭステップの繰り返し回数ｊをインクリメントする。すなわち、ｊ←ｊ＋１とする（ステップＳ２６）。そして、現在の繰り返し回数ｊが事前に設定されたしきい値（ｊ_max）を超えたか否かを判別する。すなわち、ｊ＞ｊ_maxとなったか否かを判別する（ステップＳ２７）。ｊ≦ｊ_maxである場合（ステップＳ２７：Ｎｏ）、ステップＳ２４に戻る。一方、ｊ＞ｊ_maxとなった場合（ステップＳ２７：Ｙｅｓ）、時刻Ｔが事前に設定された設定値（適切な正整数ｔ_updateの倍数）を超えたか否かを判別する。すなわち、Ｔが「ｔ_updateの倍数」となったか否かを判別する（ステップＳ２８）。 Then, the number of repetitions j of the EM step is incremented. That is, j ← j + 1 is set (step S26). Then, it is determined whether or not the current number of repetitions j has exceeded a preset threshold value (j _max ). That is, it is determined whether j> j _max is satisfied (step S27). If j ≦ j _max (step S27: No), the process returns to step S24. On the other hand, if j> j _max is satisfied (step S27: Yes), it is determined whether or not the time T has exceeded a preset setting value (a multiple of an appropriate positive integer t _update ). That is, it is determined whether or not T is “a multiple of t _update ” (step S28).

Ｔが「ｔ_updateの倍数」となった場合（ステップＳ２８：Ｙｅｓ）、ハイパーパラメータを更新し（ステップＳ２９）、推定結果を保存する（ステップＳ３０）。Ｔが「ｔ_updateの倍数」ではない場合（ステップＳ２８：Ｎｏ）、ステップＳ２９をスキップして、推定結果を保存する（ステップＳ３０）。 When T becomes “a multiple of t _update ” (step S28: Yes), the hyperparameter is updated (step S29), and the estimation result is stored (step S30). When T is not “a multiple of t _update ” (step S28: No), step S29 is skipped and the estimation result is stored (step S30).

このようにＥステップ、Ｍステップをｊ_max回に渡って相互に繰り返し計算することで変分事後分布の式（１９）〜式（２３）を得ることができる。なお、これらは、Ｍステップにおいて演算される。 In this way, the E step and the M step are repeatedly calculated over j _max times to obtain the variational posterior distribution equations (19) to (23). These are calculated in M steps.

本実施形態では、図３に示すようにステップＳ２８〜Ｓ３０の処理を行うことで、時間ステップＴを適切な正整数ｔ_update回インクリメントするごとにハイパーパラメータを推定することとした。通常、ハイパーパラメータは事前に与える定数であるが、ステップＳ２８〜Ｓ３０の処理を実行することで、ハイパーパラメータが固定値である場合よりも精度を上げることができる。ハイパーパラメータの更新方法については、事後分布からのサンプリング方法など、様々な公知の手法を用いることができる。また、ハイパーパラメータ自身もＥＭアルゴリズムで推定することも可能である。 In the present embodiment, as shown in FIG. 3, the hyper parameter is estimated every time the time step T is incremented by an appropriate positive integer t _update times by performing the processing of steps S 28 to S 30. Usually, the hyper parameter is a constant given in advance, but by executing the processing of steps S28 to S30, the accuracy can be improved as compared with the case where the hyper parameter is a fixed value. As a hyper parameter update method, various known methods such as a sampling method from a posterior distribution can be used. The hyperparameter itself can also be estimated by the EM algorithm.

≪Ｅステップ≫
Ｅステップの具体的な計算式は、以下の式（２４）〜式（２８）で表される。ここで、ψ（・）はプサイ関数（あるいはディガンマ関数）である。また、式（２７）は、観測モデルＦに依存する。したがって、本実施形態では、式（２７）に基づいて後記する式（３７）を用いる。 ≪E step≫
A specific calculation formula of the E step is expressed by the following formulas (24) to (28). Here, ψ (·) is a psi function (or digamma function). Expression (27) depends on the observation model F. Therefore, in the present embodiment, the following formula (37) is used based on the formula (27).

Ｅステップでは、未知パラメータ（θ、π、ｗ）に関わる推定値を再計算する。ここでＥ_x［ｆ（ｘ）］は、変分分布上の期待値を示す。式（２８）にその定義を示す。 In step E, the estimated values related to the unknown parameters (θ, π, w) are recalculated. Here, E _x [f (x)] indicates an expected value on the variation distribution. The definition is shown in Equation (28).

図４は、図３に示すＥステップの計算手順の一例を示すフローチャートである。Ｅステップでは、時刻Ｔまでに関係する全変数について再計算を行う。具体的には、Ｅステップでは、まず、時刻１からＴまでのサンプル｛ｘ_1:T｝を入力する（ステップＳ３１）。また、時刻１からＴまでの隠れ変数・パラメータの推定値とハイパーパラメータ、最新のＭステップの演算結果を入力する（ステップＳ３２）。なお、ステップＳ３１，Ｓ３２の処理順序は任意である。 FIG. 4 is a flowchart showing an example of a calculation procedure of the E step shown in FIG. In step E, recalculation is performed for all variables related to time T. Specifically, in step E, first, samples {x _{1: T} } from time 1 to T are input (step S31). Also, the estimated values of the hidden variables / parameters and the hyper parameters from time 1 to T, and the latest M step calculation result are input (step S32). Note that the processing order of steps S31 and S32 is arbitrary.

そして、過去を含む演算対象の時刻ｔを初期化する。すなわち、ｔ＝１とする（ステップＳ３３）。そして、この演算対象の時刻ｔに対して、式（２４）、式（２５）を計算する（ステップＳ３４）。次いで、クラスタの識別子ｋを初期化する。すなわち、ｋ＝１とする（ステップＳ３５）。そして、この演算対象の時刻ｔおよびクラスタｋに対して、式（２６）を計算する（ステップＳ３６）。さらに、時刻ｔにおけるデータの識別子ｉを初期化する。すなわち、ｉ＝１とする（ステップＳ３７）。そして、時刻ｔ、クラスタｋのｉ番目のデータに対して、式（２７）を計算する（ステップＳ３８）。 Then, the calculation target time t including the past is initialized. That is, t = 1 is set (step S33). Then, the equations (24) and (25) are calculated for the time t to be calculated (step S34). Next, the cluster identifier k is initialized. That is, k = 1 is set (step S35). Then, the equation (26) is calculated for the time t and cluster k to be calculated (step S36). Further, the data identifier i at time t is initialized. That is, i = 1 is set (step S37). Then, the equation (27) is calculated for the i-th data of the cluster k at time t (step S38).

そして、時刻ｔにおけるデータの識別子ｉをインクリメントする。すなわち、ｉ←ｉ＋１とする（ステップＳ３９）。続いて、ｉ＞ｎ_tとなったか否かを判別する（ステップＳ４０）。なお、ｎ_tは、式（８）に示した個数である。ｉ≦ｎ_tである場合（ステップＳ４０：Ｎｏ）、ステップＳ３８に戻る。一方、ｉ＞ｎ_tとなった場合（ステップＳ４０：Ｙｅｓ）、次のクラスタに更新する。すなわち、ｋ←ｋ＋１とする（ステップＳ４１）。そして、すべてのクラスタについて演算したか否かを判別する。すなわち、ｋ＞Ｋとなったか否かを判別する（ステップＳ４２）。ｋ≦Ｋである場合（ステップＳ４２：Ｎｏ）、ステップＳ３６に戻る。一方、ｋ＞Ｋとなった場合（ステップＳ４２：Ｙｅｓ）、演算対象の時刻ｔを更新する。すなわち、ｔ←ｔ＋１とする（ステップＳ４１）。そして、演算対象の時刻ｔが推定対象の時刻Ｔとなったか否かを判別する。すなわち、ｔ＞Ｔとなったか否かを判別する（ステップＳ４４）。ｔ≦Ｔである場合（ステップＳ４４：Ｎｏ）、ステップＳ３４に戻る。一方、ｔ＞Ｔとなった場合（ステップＳ４４：Ｙｅｓ）、推定対象の時刻ＴにおけるＥステップの推定結果を保存する（ステップＳ４５）。 Then, the data identifier i at time t is incremented. That is, i ← i + 1 is set (step S39). Subsequently, it is determined whether or not a i> n _t (step S40). Here, n _t is the number shown in Expression (8). If a i ≦ n _t (step S40: No), the flow returns to step S38. On the other hand, when it is i> n _t (step S40: Yes), and updates the next cluster. That is, k ← k + 1 is set (step S41). And it is discriminate | determined whether it calculated about all the clusters. That is, it is determined whether or not k> K is satisfied (step S42). If k ≦ K (step S42: No), the process returns to step S36. On the other hand, when k> K is satisfied (step S42: Yes), the calculation target time t is updated. That is, t ← t + 1 is set (step S41). And it is discriminate | determined whether the time t of calculation object became the time T of estimation object. That is, it is determined whether or not t> T is satisfied (step S44). If t ≦ T (step S44: No), the process returns to step S34. On the other hand, when t> T is satisfied (step S44: Yes), the estimation result of the E step at the estimation target time T is stored (step S45).

≪Ｍステップ≫
Ｍテップの具体的な計算式は、以下の式（２９）〜式（３２）で表される。Ｍステップでは、隠れ変数（ｚ_t，ｄ_t）に関わる推定値を再計算する。 ≪M step≫
The specific calculation formula of M tep is expressed by the following formulas (29) to (32). In the M step, the estimated values related to the hidden variables (z _t , d _t ) are recalculated.

図５は、図３に示すＭステップの計算手順の一例を示すフローチャートである。Ｍステップでも、Ｅステップと同様に、時刻Ｔまでに関係する全変数について再計算を行う。具体的には、Ｍステップでは、まず、時刻１からＴまでのサンプル｛ｘ_1:T｝を入力する（ステップＳ５１）。また、時刻１からＴまでの隠れ変数・パラメータの推定値とハイパーパラメータ、最新のＥステップの演算結果を入力する（ステップＳ５２）。なお、ステップＳ５１，Ｓ５２の処理順序は任意である。 FIG. 5 is a flowchart showing an example of the calculation procedure of the M step shown in FIG. In the M step, as in the E step, recalculation is performed for all variables related to the time T. Specifically, in the M step, first, samples {x _{1: T} } from time 1 to T are input (step S51). Also, the estimated values of the hidden variables / parameters from time 1 to T, the hyper parameters, and the latest calculation result of E step are input (step S52). Note that the processing order of steps S51 and S52 is arbitrary.

以下、Ｅステップと同様な処理については、説明を適宜省略する。Ｍステップでは、まず、ｔ＝１（ステップＳ５３）、ｉ＝１（ステップＳ５４）、ｋ＝１（ステップＳ５５）とした上で、前記した式（２９）、式（３０）を計算する（ステップＳ５６）。その後、ｋ←ｋ＋１として（ステップＳ５７）、ｋ＞ＫとなるまでステップＳ５６に戻る処理を繰り返す。ｋ＞Ｋとなった場合（ステップＳ５８：Ｙｅｓ）、時刻の識別子ｌを初期化する。すなわち、ｌ＝１とする（ステップＳ５９）。そして、時刻ｔ、時刻ｌ、クラスタｋのｉ番目のデータに対して、式（３１）、式（３２）を計算する（ステップＳ６０）。 Hereinafter, description of processes similar to those in step E will be omitted as appropriate. In the M step, first, t = 1 (step S53), i = 1 (step S54), k = 1 (step S55), and then the above equations (29) and (30) are calculated (steps). S56). Thereafter, k ← k + 1 is set (step S57), and the process of returning to step S56 is repeated until k> K. When k> K is satisfied (step S58: Yes), the time identifier l is initialized. That is, l = 1 is set (step S59). Then, equations (31) and (32) are calculated for the i-th data of time t, time l, and cluster k (step S60).

そして、時刻ｌをインクリメントする。すなわち、ｌ←ｌ＋１とする（ステップＳ６１）。過去から時刻ｔまでの時刻を示す時刻ｌについてのすべての演算が終わるまでステップＳ６０に戻る処理を繰り返す。ｌ＞ｔとなった場合（ステップＳ６２：Ｙｅｓ）、ｉ←ｉ＋１とする（ステップＳ６３）。そして、ｉについてのすべての演算が終わるまでステップＳ５５に戻る処理を繰り返す。そして、ｉ＞ｎ_tとなった場合（ステップＳ６４：Ｙｅｓ）、ｔ←ｔ＋１とする（ステップＳ６５）。さらに、ｔについてのすべての演算が終わるまでステップＳ５４に戻る処理を繰り返す。そして、ｔ＞Ｔとなった場合（ステップＳ６６：Ｙｅｓ）、推定結果を保存する（ステップＳ６７）。 Then, the time l is incremented. That is, l ← l + 1 is set (step S61). The process of returning to step S60 is repeated until all calculations for time l indicating the time from the past to time t are completed. When l> t is satisfied (step S62: Yes), i ← i + 1 is set (step S63). And the process which returns to step S55 is repeated until all the calculations about i are completed. When it becomes i> n _t (step S64: Yes), the t ← t + 1 (step S65). Further, the process of returning to step S54 is repeated until all calculations for t are completed. When t> T is satisfied (step S66: Yes), the estimation result is stored (step S67).

＜３）観測モデルおよび事前分布の特定＞
図３〜図５および式（１９）〜式（２７）を参照して説明した変分事後分布推論プロセスでは、観測モデルおよび事前分布を一般化した説明を行った。つまり、前記した式（１９）および式（２７）は、観測モデルＦ＝ｐ（ｘ_{t i}|θ_k）およびパラメータθ_kの事前分布Ｈに依存する。本実施形態では、観測モデルＦを正規分布と仮定し、事前分布ＨとしてＮｏｒｍａｌ−Ｇａｍｍａ分布を仮定する。Ｎｏｒｍａｌ−Ｇａｍｍａ分布は、式（３３）〜式（３５）で表されるモデルである。 <3) Identification of observation model and prior distribution>
In the variational posterior distribution inference process described with reference to FIGS. 3 to 5 and equations (19) to (27), the observation model and the prior distribution are generalized. That is, the above-described equations (19) and (27) depend on the observation model F = p (x _ti | θ _k ) and the prior distribution H of the parameter θ _k . In the present embodiment, the observation model F is assumed to be a normal distribution, and a Normal-Gamma distribution is assumed as the prior distribution H. The Normal-Gamma distribution is a model represented by Expression (33) to Expression (35).

このモデルの場合において、前記した式（１９）および式（２７）は、式（３６）、式（３７）でそれぞれ表される。また、式（３６）、式（３７）中のハイパーパラメータは、式（３８）〜式（４１）で表される。さらに、式（３８）〜式（４１）中の変数は、式（４２ａ）〜式（４２ｃ）で表される。式（３８）〜式（４２）の演算は、Ｅステップの中で完了する。 In the case of this model, the above equations (19) and (27) are represented by equations (36) and (37), respectively. Moreover, the hyper parameter in Formula (36) and Formula (37) is represented by Formula (38)-Formula (41). Furthermore, the variables in the expressions (38) to (41) are expressed by the expressions (42a) to (42c). The calculations of equations (38) to (42) are completed in the E step.

＜４）推定結果とクラスタ数の決定方法＞
≪クラスタ数の決定方法≫
ＥＭアルゴリズムでは、Ｋ個のクラスタを常に保持しているが、推定が進むと少数のクラスタのみに大きな混合比が付き、他のクラスタの大きさ（cluster size）はほぼ０となる。時刻ｔにおいてクラスタｋへ振り分けられるサンプル数の期待値は、式（４３）の定義により計算できる。 <4) Estimation method and number of clusters determination method>
≪How to determine the number of clusters≫
In the EM algorithm, K clusters are always held, but as the estimation proceeds, only a small number of clusters have a large mixing ratio, and the size of other clusters is almost zero. The expected value of the number of samples distributed to cluster k at time t can be calculated by the definition of equation (43).

本実施形態では、「有効な」クラスタ数Ｋ_effを、式（４３）で定義した‖ｚ_t,k‖の時刻ｔに関する総和の比率で決定する。例えば、各クラスタｋの混合比は、式（４４ａ）で推定できる。このルールでは、式（４４ｂ）の条件が成立するならば、そのクラスタｋが「有効な」クラスタであると判断することとする。 In the present embodiment, the “effective” number of clusters K _eff is determined by the ratio of the sum total of time ｔz _{t, k}時刻 defined by Expression (43). For example, the mixing ratio of each cluster k can be estimated by Expression (44a). In this rule, if the condition of Expression (44b) is satisfied, it is determined that the cluster k is an “effective” cluster.

このルールにおいて、そのようなクラスタｋの総数を、「有効な」クラスタ数Ｋ_effとすることとする。このようにすれば、「有効な」クラスタ数が最大クラスタ数Ｋより小さくなること、すなわち、Ｋ_eff＜Ｋが保証される。 In this rule, let the total number of such clusters k be the “effective” number of clusters K _eff . This guarantees that the “effective” number of clusters is smaller than the maximum number of clusters K, that is, K _eff <K.

≪保存すべき推定結果≫
保存すべき推定結果としては、第１に、ＥＭアルゴリズムで推定された各変数によって定められる変分事後分布である。また、第２に、クラスタリング結果を用いて求められる、「有効な」クラスタ数Ｋ_eff、式（４３）に示す‖ｚ_t,k‖、式（４４ａ）に示すβ_k＾などである。ここで、＾は、文字βの上に付される記号を意味する。
特に、保存すべき推定結果として２番目に挙げたクラスタリング結果を用いて求められる推定量“「有効な」クラスタ数Ｋ_eff、式（４３）に示す‖ｚ_t,k‖、式（４４ａ）に示すβ_k＾”は、図１における識別部４で利用することが可能な重要な推定量である。なお、式（４３）、式（４４ａ）、式（４４ｂ）の演算は、Ｍステップの中で行う。 ≪Estimated results to be saved≫
First, the estimation result to be stored is a variational posterior distribution determined by each variable estimated by the EM algorithm. Second, the number of “effective” clusters K _eff obtained using the clustering result, ‖z _{t, k}に shown in Equation (43), β _k ＾ shown in Equation (44a), and the like. Here, ^ means a symbol added on the character β.
In particular, the estimated amount “number of effective” clusters K _eff obtained using the second clustering result as an estimation result to be stored, ‖z _{t, k} ‖ shown in Equation (43), and Equation (44a) Β _k ^ ”shown is an important estimation amount that can be used by the identification unit 4 in FIG. 1. Note that the operations of Equation (43), Equation (44a), and Equation (44b) are performed in M steps. To do in.

＜５）ｄＨＤＰの高速化方法＞
本実施形態のクラスタリング計算方法におけるオンライン推定方法では、図４や図５の処理の流れからも明らかなように、時間ステップの進展とともに推定すべき変数の数が増えていくことが特徴である。そのため、実時間性などを考慮して、計算省力化の技法を開発した。本実施形態では、ｄＨＤＰを用いたオンライン話者クラスタリングのための計算高速化法として、大別して３種類の高速化法（高速化１、高速化２、高速化３）を用いることとした。 <5) dHDP speedup method>
The on-line estimation method in the clustering calculation method of the present embodiment is characterized in that the number of variables to be estimated increases as the time step progresses, as is apparent from the processing flow of FIG. 4 and FIG. Therefore, we developed a technique for labor saving in calculation considering real time characteristics. In this embodiment, three types of speed-up methods (speed-up 1, speed-up 2, speed-up 3) are roughly used as a speed-up method for online speaker clustering using dHDP.

≪高速化１：データの忘却≫
前記した式（１６ａ）が意味することは、時刻ｔにおける話者分布Ｇ_tを計算するためには、時刻１≦ｌ≦ｔの分布変化Ｈ_ｌが必要であるということである。従って、時刻１からの情報を保持し続ける必要があるので、時間ステップｔが進むとともに推論の計算量が増大する。ここで、次の仮定を導入する。すなわち、時刻ｌ＜ｔで話者の入れ替わりが発生したとする。すると、この時点で話者分布Ｈが大きく変化することが予想される。これは、前記した式（４）において、ｗ_l≒１となってＧ_l-1の影響がほぼ排除されることを意味する。このことと、前記した式（１６ａ）および式（１６ｂ）とを照らし合わせると、ｖ_t1≒…≒ｖ_l(t-1)＝０となる。従って、実際には、Ｇ_tの推論には、時刻ｌの分布変化Ｈ_l以降の分布のみしか関与しないことがわかる。このことから、分布選択を表す変数｛ｄ_{t i}｝の事後分布に相当するｓ_{t i l}（式（３０）参照）も大部分が０となることが期待される。 ≪Speedup 1: Forgetting data≫
The above equation (16a) means that the distribution change H _{l at} time 1 ≦ l ≦ t is necessary to calculate the speaker distribution G _t at time t. Therefore, since it is necessary to keep the information from time 1, the amount of inference calculation increases as time step t advances. Here, the following assumptions are introduced. That is, it is assumed that a speaker change occurs at time l <t. Then, the speaker distribution H is expected to change greatly at this time. This means that w ₁ ≈1 in the above-described equation (4), and the influence of G _l−1 is almost eliminated. When this is compared with the above-described equations (16a) and (16b), v _t1 ≈... ≈v _{l (t−1)} = 0. Therefore, in practice, it can be seen that only the distribution after the distribution change H _{l at} time l is involved in the inference of G _t . From this, it is expected that s _til (see equation (30)) corresponding to the posterior distribution of the variable {d _ti } representing the distribution selection is mostly zero.

ＥＭアルゴリズム中でｓが存在する部分は、ｓと、定数あるいは別変数と、の掛け算となっているため、ｓ_{t i l}＝０となる部分については、計算する必要がない。従って、ｓに対応する変数にはアクセスしなくてよい。この考察より、ある適切な時間ステップ幅Ｗ_１より過去の変数あるいは定数については、ＥＭ更新式中でアクセスしない（忘却する）ことによって計算時間を削減できる。 The part where s exists in the EM algorithm is a multiplication of s and a constant or another variable, so it is not necessary to calculate the part where s _til = 0. Therefore, it is not necessary to access the variable corresponding to s. From this discussion, for the appropriate time past variables or constants than the step width W ₁ that is not accessible in EM update equation (forgetting) can reduce the computational time by.

この時間ステップ幅Ｗ_１の決定方法としては、事前に適切な定数を決めておく方法が最も簡便である。あるいは、この考察からの示唆より、次のような方法も考えることができる。すなわち、各時刻でのｓ_{t i}の推定結果より、ｓ_{t i 1}＝…＝ｓ_{t i l}＝０となるｌを見つけて、Ｗ_１=ｔ−ｌとする方法がある。そのようなｌの見付け方としては、適切な閾値ｔｈ（＜１．０）に対して、式（４５）の関係を満たす最大のｌを探せばよい。この場合には、データを忘却する時間幅Ｗ_１が推論結果に従って動的に変化することになる。 The method of determining the time step width W _1, a method of previously have determined the appropriate constants is most convenient. Alternatively, the following method can be considered from the suggestion from this consideration. That is, there is a method of finding l where s _{ti 1} =... = S _til = 0 from the estimation result of s _ti at each time and setting W ₁ = t−l. As a method of finding such l, it is only necessary to find the maximum l satisfying the relationship of Expression (45) with respect to an appropriate threshold th (<1.0). In this case, the time width W ₁ of forgetting the data is to dynamically vary according to the inference result.

≪高速化２：推定時間幅の制限≫
図２に示したオンライン推定法では、毎時刻Ｔにおいて全変数のＥＭアルゴリズム推定を行うものとして説明した。このことは、早い時刻ステップに関する変数については、何度もＥＭ再推定を行っているということを意味する。ＥＭアルゴリズムは、局所最適解の計算アルゴリズムであるため、計算を繰り返すと、ある１つの解に収束する。よって、早い時刻ステップに関わる変数については再推定をしなくとも、値が収束している可能性が高い。 ≪Speedup 2: Estimated time width limit≫
In the online estimation method shown in FIG. 2, the EM algorithm is estimated for all variables at each time T. This means that the EM re-estimation is performed many times for the variable related to the early time step. Since the EM algorithm is a calculation algorithm for a local optimum solution, when the calculation is repeated, the EM algorithm converges to one solution. Therefore, there is a high possibility that the values related to the early time step have converged without re-estimation.

この考察より、ＥＭアルゴリズムで再推定する変数を、適切な時間幅Ｗ_２を使うことで、Ｔ−Ｗ_２≦ｔ≦Ｔの範囲の変数に制限することが可能であることがわかる。このことで、推定する変数の個数が、時間ステップＴに対する線形増加とならないように、その個数を定数に保つことが可能となる。時間幅Ｗ_２の決定方法としては、事前に適切な定数を決めておく方法が最も簡便である。 From this consideration, it can be seen that the variable re-estimated by the EM algorithm can be limited to a variable in the range of T−W ₂ ≦ t ≦ T by using an appropriate time width W ₂ . This makes it possible to keep the number of variables to be constant so that the number of variables to be estimated does not increase linearly with respect to time step T. The method for determining the time width W _2, a method of previously have determined the appropriate constants is most convenient.

≪高速化３：クラスタ変数再推定の制限≫
ｄＨＤＰは、ノンパラメトリックベイズモデルなので、最大クラスタ数（Ｋ）の個数のクラスタを予めモデルに準備していたとしても、実際には、「有効な」クラスタ数Ｋ_eff（＜Ｋ）の個数のクラスタだけが実質的なクラスタとなり、他のクラスタは、混合比がほぼ０となるために消去されたような形式となる。この消去されたような形式のクラスタには、有効な情報がないため、そのようなクラスタを用いてパラメータや混合比の推定計算を行っても無駄なだけである。 ≪Speedup 3: Limitation of cluster variable re-estimation≫
Since dHDP is a non-parametric Bayes model, even if the maximum number of clusters (K) is prepared in the model in advance, the number of “effective” clusters K _eff (<K) is actually the number of clusters. Only the cluster becomes a substantial cluster, and the other clusters are erased because the mixing ratio is almost zero. Since there is no valid information in the deleted cluster, it is only useless to perform estimation calculation of parameters and mixing ratios using such a cluster.

このことから、前記した式（２６）、式（２７）、式（２９）などクラスタごとの推定を行う変数（ｋ）を用いて、実際にクラスタごとの変数再推定を行う演算処理では、その変数再推定の回数を確率的に（確率論に基づいて）低減することが考えられる。その方法は、例えば３種類挙げることができる。 From this, in the arithmetic processing that actually performs the variable re-estimation for each cluster using the variable (k) for estimating for each cluster such as the above-mentioned formula (26), formula (27), and formula (29), It is conceivable to reduce the number of variable re-estimations stochastically (based on probability theory). The method can mention three types, for example.

第１の低減法は、単純な方法として、全クラスタについて確率ｃ（≦１．０）で毎回ランダムに、再推定するか否かを決定する方法である。
第２の低減法は、各クラスタｋの混合比β_k＾（式（４４）参照）に応じてＥＭアルゴリズムによる再推定を行う回数を増減する方法である。
第３の低減法は、前記した式（４４ｂ）の条件を満たす場合、つまり、対象のクラスタが「有効な」クラスタである場合には毎回再推定するが、それ以外の場合には、クラスタｋを確率的に更新する方法である。
これらの低減法において、各クラスタｋを更新する確率をｐ_update（ｋ）とおくと、第１〜第３の低減法は、それぞれ、式（４６）〜式（４８）のように表現できる。特に、第３の低減法は、推定精度をほとんど犠牲にせずに計算量をＫ_eff／Ｋ程度に削減できる効果がある。 As a simple method, the first reduction method is a method for determining whether to re-estimate all clusters at random with a probability c (≦ 1.0) each time.
The second reduction method is a method of increasing or decreasing the number of re-estimations by the EM algorithm in accordance with the mixture ratio β _k ^ (see equation (44)) of each cluster k.
The third reduction method re-estimates every time when the condition of the above-described equation (44b) is satisfied, that is, when the target cluster is an “effective” cluster, but in other cases, the cluster k Is a method of updating the probability.
In these reduction methods, if the probability of updating each cluster k is set as p _update (k), the first to third reduction methods can be expressed as Equations (46) to (48), respectively. In particular, the third reduction method has an effect of reducing the calculation amount to about K _eff / K without sacrificing estimation accuracy.

前記した３つの高速化方法（高速化１、高速化２、高速化３）をすべて使う場合のＥＭアルゴリズムの一例を図６および図７にそれぞれ示す。なお、これら３つの方法は、それぞれ独立に利用可能である。 An example of the EM algorithm in the case of using all the three speed-up methods (speed-up 1, speed-up 2, speed-up 3) is shown in FIGS. 6 and 7, respectively. These three methods can be used independently of each other.

図６は、本発明の実施形態に係るクラスタリング計算方法におけるＥステップの計算手順を示すフローチャートである。なお、図４のフローチャートと比較して異なる部分を太線及び破線で示し、説明を適宜省略して異なる処理について説明する。図６において、高速化１は、符号３０１で示すように、ステップＳ３１Ａ，Ｓ３２Ａからなる。ここでは、時刻Ｔ−Ｗ₁からＴまでのサンプル｛ｘ_T-W1:T｝を入力する（ステップＳ３１Ａ）。また、時刻Ｔ−Ｗ₁からＴまでの隠れ変数・パラメータの推定値とハイパーパラメータ、最新のＭステップの演算結果を入力する（ステップＳ３２Ａ）。次いで、高速化２は、符号３０２で示すように、ステップＳ３３Ａからなる。ここでは、ｔの初期値を１の代わりに、ｔ＝Ｔ−Ｗ₂とする（ステップＳ３３Ａ）。 FIG. 6 is a flowchart showing a calculation procedure of the E step in the clustering calculation method according to the embodiment of the present invention. In addition, a different part is shown with a thick line and a broken line compared with the flowchart of FIG. 4, description is abbreviate | omitted suitably and a different process is demonstrated. In FIG. 6, the high speed 1 includes steps S 31 A and S 32 A as indicated by reference numeral 301. Here, samples {x _{T-W1: T} } from time T-W ₁ to T are input (step S31A). Also, the estimated values of the hidden variables / parameters and the hyper parameters from time T-W ₁ to T, and the latest M step calculation result are input (step S32A). Next, the speed-up 2 includes step S33A as indicated by reference numeral 302. Here, the initial value of t is set to t = TW ₂ instead of 1 (step S33A).

さらに、高速化３は、ステップＳ３５に続いて行う処理であり、符号３０３で示すように、ステップＳ７１，Ｓ７２からなる。ここでは、［０，１］の一様乱数ｕを生成し（ステップＳ７１）、ｐ_update（ｋ）＞ｕであるか否かを判別することとした（ステップＳ７２）。ｐ_update（ｋ）＞ｕである場合（ステップＳ７２：Ｙｅｓ）、前記したステップＳ３６〜ステップＳ４５を実行する。つまり、前記した式（２６）および式（２７）の計算を行う。ただし、ステップＳ４２において、ｋ≦Ｋである場合（ステップＳ４２：Ｎｏ）、ステップＳ７１に戻る。一方、ステップＳ７２において、ｐ_update（ｋ）≦ｕである場合（ステップＳ７２：Ｎｏ）、前記したステップＳ３６〜ステップＳ４０をスキップして、ステップＳ４１に進む。つまり、前記した式（２６）および式（２７）の計算を行わない。 Furthermore, the speed-up 3 is processing performed subsequent to step S35, and includes steps S71 and S72 as indicated by reference numeral 303. Here, a uniform random number u of [0, 1] is generated (step S71), and it is determined whether or not p _update (k)> u is satisfied (step S72). When p _update (k)> u is satisfied (step S72: Yes), the above-described steps S36 to S45 are executed. That is, the above-described equations (26) and (27) are calculated. However, if k ≦ K in step S42 (step S42: No), the process returns to step S71. On the other hand, when p _update (k) ≦ u is satisfied in step S72 (step S72: No), the above-described steps S36 to S40 are skipped and the process proceeds to step S41. In other words, the above-described equations (26) and (27) are not calculated.

図７は、本発明の実施形態に係るクラスタリング計算方法におけるＭステップの計算手順を示すフローチャートである。なお、図５のフローチャートと比較して異なる部分を太線及び破線で示し、説明を適宜省略して異なる処理について説明する。図７において、高速化１は、符号４０１で示すように、ステップＳ５１Ａ，Ｓ５２Ａからなる。ここでは、時刻Ｔ−Ｗ₁からＴまでのサンプル｛ｘ_T-W1:T｝を入力する（ステップＳ５１Ａ）。また、時刻Ｔ−Ｗ₁からＴまでの隠れ変数・パラメータの推定値とハイパーパラメータ、最新のＥステップの演算結果を入力する（ステップＳ５２Ａ）。次いで、高速化２は、符号４０２で示すように、ステップＳ５３Ａからなる。ここでは、ｔの初期値を１の代わりに、ｔ＝Ｔ−Ｗ₂とする（ステップＳ５３Ａ）。 FIG. 7 is a flowchart showing a calculation procedure of M steps in the clustering calculation method according to the embodiment of the present invention. Note that different parts compared to the flowchart of FIG. 5 are indicated by thick lines and broken lines, and description thereof will be omitted as appropriate. In FIG. 7, the high speed 1 includes steps S 51 A and S 52 A as indicated by reference numeral 401. Here, samples {x _{T-W1: T} } from time T-W ₁ to T are input (step S51A). Further, the estimated values of the hidden variables / parameters and the hyper parameters from the time T-W ₁ to T, and the latest calculation result of the E step are input (step S52A). Next, the speed-up 2 includes step S53A as indicated by reference numeral 402. Here, the initial value of t is set to t = TW ₂ instead of 1 (step S53A).

さらに、高速化３は、ステップＳ５５に続いて行う処理であり、符号４０３で示すように、ステップＳ８１，Ｓ８２からなる。ここでは、［０，１］の一様乱数ｕを生成し（ステップＳ８１）、ｐ_update（ｋ）＞ｕであるか否かを判別することとした（ステップＳ８２）。ｐ_update（ｋ）＞ｕである場合（ステップＳ８２：Ｙｅｓ）、前記したステップＳ５６〜ステップＳ６７を実行する。つまり、前記した式（２９）〜式（３２）の計算を行う。ただし、ステップＳ５８において、ｋ≦Ｋである場合（ステップＳ５８：Ｎｏ）、ステップＳ８１に戻る。一方、ステップＳ８２において、ｐ_update（ｋ）≦ｕである場合（ステップＳ８２：Ｎｏ）、前記したステップＳ５６をスキップして、ステップＳ５７に進む。つまり、前記した式（２９）および式（３０）の計算を行わない。 Further, the speedup 3 is a process performed subsequent to step S55, and includes steps S81 and S82 as indicated by reference numeral 403. Here, a uniform random number u of [0, 1] is generated (step S81), and it is determined whether or not p _update (k)> u is satisfied (step S82). When p _update (k)> u (step S82: Yes), the above-described steps S56 to S67 are executed. That is, the above-described equations (29) to (32) are calculated. However, if k ≦ K in step S58 (step S58: No), the process returns to step S81. On the other hand, when p _update (k) ≦ u is satisfied in step S82 (step S82: No), the above-described step S56 is skipped and the process proceeds to step S57. In other words, the above-described equations (29) and (30) are not calculated.

［クラスタリング計算装置］
図８は、本発明の実施形態に係るクラスタリング計算装置の構成の一例を示す機能ブロック図である。このクラスタリング計算装置３は、例えば、図２の推論プロセス等を実現するものであり、ＣＰＵ等の演算装置と、メモリ、ハードディスク等の記憶装置（記憶手段）と、マウスやキーボード等の外部から情報の入力を検出する入力装置と、外部との各種情報の送受信を行うインタフェース装置と、ＬＣＤ（Liquid Crystal Display）等の表示装置を備えたコンピュータと、このコンピュータにインストールされたプログラムとから構成される。 [Clustering calculator]
FIG. 8 is a functional block diagram showing an example of the configuration of the clustering calculation apparatus according to the embodiment of the present invention. The clustering calculation device 3 implements, for example, the inference process of FIG. 2 and the like, and includes information from the outside such as a calculation device such as a CPU, a storage device (storage means) such as a memory and a hard disk, and a mouse and a keyboard. An input device for detecting the input of a computer, an interface device for transmitting / receiving various information to / from the outside, a computer having a display device such as an LCD (Liquid Crystal Display), and a program installed in the computer .

クラスタリング計算装置３は、ハードウェア装置とソフトウェアとが協働することによって、前記したハードウェア資源がプログラムによって制御されることにより実現され、図８に示すように、記憶手段１０と、以下ＣＰＵの機能として、変分事後分布推論部３０と、パワーベクトル読込部２１と、入力制御部２２と、割当部２３と、未知数初期化部２４と、パワーベクトル書込部２５と、観測量生成部２６と、終了判定部２７と、出力制御部２８とを備えることとした。 The clustering calculation device 3 is realized by the hardware device and software cooperating to control the above-described hardware resources by a program. As shown in FIG. As functions, a variational posterior distribution inference unit 30, a power vector reading unit 21, an input control unit 22, an allocation unit 23, an unknown number initialization unit 24, a power vector writing unit 25, and an observation amount generation unit 26 And an end determination unit 27 and an output control unit 28.

＜記憶手段＞
記憶手段１０は、ＲＯＭ、ＲＡＭ、ＨＤＤ等からなる。記憶手段１０は、プログラム記憶領域、設定データ記憶領域、演算データ一時記憶領域、推定結果記憶領域等に区分され、コマンド、データ、プログラム等各種情報を記憶する。例えば、推定結果記憶領域には、データとして、図９に示すように、未知数の事後分布推定値１１、ハイパーパラメータ１２、Ｅステップにおける推定値（まとめてＥステップ１３と表記した）、Ｍステップにおける推定値（まとめてＭステップ１４と表記した）、パワーベクトル１５、観測量（サンプル）１６が保存される。 <Storage means>
The storage unit 10 includes a ROM, a RAM, an HDD, and the like. The storage means 10 is divided into a program storage area, a setting data storage area, a calculation data temporary storage area, an estimation result storage area, and the like, and stores various information such as commands, data, programs, and the like. For example, as shown in FIG. 9, in the estimation result storage area, as shown in FIG. 9, unknown posterior distribution estimated value 11, hyperparameter 12, estimated value in E step (collectively expressed as E step 13), in M step An estimated value (collectively expressed as M step 14), a power vector 15, and an observation amount (sample) 16 are stored.

＜変分事後分布推論部＞
メインの変分事後分布推論部３０は、例えば、図３、図６、図７の推論プロセスを実現するものであり、ここでは、Ｅステップ用計算部３１と、Ｍステップ用計算部３２と、ＥＭ収束判定部３３と、パラメータ更新条件判定部３４と、ハイパーパラメータ更新部３５とを備えることとした。詳細は後記する。 <Variational posterior distribution reasoning section>
The main variational posterior distribution inference unit 30 implements, for example, the inference process of FIGS. 3, 6, and 7. Here, the E step calculation unit 31, the M step calculation unit 32, The EM convergence determination unit 33, the parameter update condition determination unit 34, and the hyper parameter update unit 35 are provided. Details will be described later.

＜パワーベクトル読込部＞
パワーベクトル読込部２１は、推定対象の時刻Ｔにしたがって、パワーベクトルｆ_Tを順次読み込み、入力制御部２２に渡す。 <Power vector reading part>
The power vector reading unit 21 sequentially reads the power vector f _T according to the estimation target time T and passes it to the input control unit 22.

＜入力制御部＞
入力制御部２２は、パワーベクトルｆ_Tを取得すると、パワーベクトル書込部２５にそのまま送ると共に、割当部２３、未知数初期化部２４、観測量生成部２６および終了判定部２７に対しては、それぞれ必要なコマンドを出力する。 <Input control unit>
When the input control unit 22 obtains the power vector f _T , the input control unit 22 sends the power vector f _T to the power vector writing unit 25 as it is. At the same time, the input unit 23, the unknown quantity initialization unit 24, the observation amount generation unit 26, and the end determination unit 27 Output each necessary command.

＜割当部＞
割当部２３は、入力制御部２２からのコマンドにしたがって、そのときのＴ（＝１，２，…）や、角度ｄ（１８０，−１７９，…）に対応したｉに応じて、前記した式（９）〜（１４）の各未知数（パラメータ・隠れ変数）を順次生成し（初期値は例えば０）、記憶手段１０において、未知数の事後分布推定値１１として保存する。すなわち、割当部２３は、推定すべき未知数にＴとｉとを割り当てる。この意味で、図９においては、Ｔとｉの割当部２３と表記した。なお、記憶手段１０には、ｄＨＤＰ近似モデルで用いる各ハイパーパラメータ１２の初期値が予め格納されている。 <Allocation unit>
According to the command from the input control unit 22, the allocating unit 23 calculates the above-described equation according to T (= 1, 2,...) And i corresponding to the angle d (180, −179,...). The unknown numbers (parameters / hidden variables) of (9) to (14) are sequentially generated (initial value is 0, for example), and stored in the storage means 10 as the posterior distribution estimated value 11 of the unknown number. That is, the assigning unit 23 assigns T and i to the unknown to be estimated. In this sense, in FIG. 9, the assignment unit 23 of T and i is indicated. The storage unit 10 stores in advance initial values of the hyper parameters 12 used in the dHDP approximate model.

＜未知数初期化部＞
未知数初期化部２４は、入力制御部２２からのコマンドにしたがって、そのときのＴや角度ｄに対応したｉに応じて、Ｅステップ演算に用いるための、前記した式（２４）〜（２６）、（３７）の左辺パラメータを順次生成し（初期値は例えば０）、記憶手段１０において、Ｅステップ１３の初期値として保存する。また、未知数初期化部２４は、Ｍステップ演算に用いるための、式（２９）〜（３２）の左辺パラメータを順次生成し（初期値は例えば０）、記憶手段１０において、Ｍステップ１４の初期値として保存する。さらに、未知数初期化部２４は、ＥＭ推定値の演算に用いるための、式（３６），（２０）〜（２３）の左辺パラメータを順次生成し（初期値は例えば０）、記憶手段１０において、未知数の事後分布推定値１１に上書き保存する（更新する）。 <Unknown number initialization part>
The unknown number initialization unit 24 uses the above-described equations (24) to (26) for use in the E step calculation according to the command from the input control unit 22 and i corresponding to T and the angle d at that time. , (37) are sequentially generated (the initial value is 0, for example), and stored in the storage means 10 as the initial value of E step 13. The unknown number initialization unit 24 sequentially generates the left side parameters of the equations (29) to (32) to be used for the M step calculation (initial value is 0, for example). Save as value. Further, the unknown number initialization unit 24 sequentially generates the left side parameters of the expressions (36) and (20) to (23) to be used for the calculation of the EM estimated value (the initial value is 0, for example). The posterior distribution estimated value 11 of unknown number is overwritten and saved (updated).

＜パワーベクトル書込部＞
パワーベクトル書込部２５は、入力制御部２２から取得したパワーベクトルｆ_Tを、記憶手段１０において、パワーベクトル１５として順次保存する。 <Power vector writing unit>
The power vector writing unit 25 sequentially stores the power vector f _T acquired from the input control unit 22 as the power vector 15 in the storage unit 10.

＜観測量生成部＞
観測量生成部２６は、入力制御部２２からのコマンドにしたがって、そのときのＴに応じて、記憶手段１０から読み出したパワーベクトル１５のデータの角度ｄ毎のパワー値を、所定のルールに則って変換し、識別子ｉ（個数ｎ_ｔ）のデータに変換することで、観測量（サンプル）を生成し、記憶手段１０に、観測量１６として順次保存する。なお、本実施形態では、所定のルールとしては、前記した式（７）、式（８）を用いる。 <Observation generator>
In accordance with a command from the input control unit 22, the observation amount generation unit 26 determines the power value for each angle d of the data of the power vector 15 read from the storage means 10 according to a predetermined rule according to T at that time. Then, an observation amount (sample) is generated by converting the data into identifier i (number n _t ), and sequentially stored as the observation amount 16 in the storage unit 10. In the present embodiment, the above-described formulas (7) and (8) are used as the predetermined rules.

＜終了判定部＞
終了判定部２７は、入力制御部２２からの入力信号（コマンド）が所定期間途絶えたときに、パワーベクトルｆ_Tの入力が終了したと判定し、出力制御部２８に通知する。なお、本実施形態では、最終時刻Ｔ_totalになったときに、入力が終了したと判定する。 <End determination unit>
End determining unit 27, when the input signal from the input control unit 22 (command) is interrupted a predetermined period, it is determined that the input of the power vector f _T is completed, it notifies the output control unit 28. In the present embodiment, it is determined that the input is completed when the final time T _total is reached.

＜出力制御部＞
出力制御部２８は、終了通知を受け取ると、記憶手段１０から、最終的な推定値として、未知数の事後分布推定値１１を取得し、識別部４に出力する。 <Output control unit>
When the output control unit 28 receives the end notification, the output control unit 28 obtains an unknown posterior distribution estimated value 11 from the storage unit 10 as a final estimated value, and outputs it to the identifying unit 4.

［変分事後分布推論部の詳細］
≪Ｅステップ用計算部≫
Ｅステップ用計算部３１は、処理時点のＴに応じて、記憶手段１０から、サンプル、ハイパーパラメータ、Ｍステップの演算結果等を、過去を含めた演算対象時刻ｔ（ｔ≦Ｔ）に関して読み出す。そして、Ｅステップ用計算部３１は、処理時点のＴに応じて、過去を含むすべての演算対象時刻ｔ（ｔ≦Ｔ）に関して、式（２４）〜（２６）、（３７）を演算し、その演算結果を、記憶手段１０にＥステップ１３として保存する。 [Details of variational posterior distribution reasoning section]
≪E step calculation section≫
The E step calculation unit 31 reads the sample, the hyper parameter, the calculation result of the M step, and the like from the storage unit 10 regarding the calculation target time t (t ≦ T) including the past according to T at the time of processing. Then, the E-step calculation unit 31 calculates the expressions (24) to (26) and (37) with respect to all the calculation target times t (t ≦ T) including the past according to T at the processing time point, The calculation result is stored as E step 13 in the storage means 10.

≪Ｍステップ用計算部≫
Ｍステップ用計算部３２は、処理時点のＴに応じて、記憶手段１０から、サンプル、ハイパーパラメータ、Ｅステップの演算結果等を、過去を含めた演算対象時刻ｔ（ｔ≦Ｔ）に関して読み出す。そして、Ｍステップ用計算部３２は、処理時点のＴに応じて、過去を含むすべての演算対象時刻ｔ（ｔ≦Ｔ）に関して、式（２９）〜（３２）を演算し、その演算結果を、記憶手段１０にＭステップ１４として保存する。そして、Ｍステップ用計算部３２は、記憶手段１０に保存されたこれらの演算結果を合わせて用いて、処理時点のＴに応じて、過去を含むすべての演算対象時刻ｔ（１≦ｔ≦Ｔ）に関して、式（２０）〜（２３）、（３６）を演算し、その演算結果を、記憶手段１０に、未知数の事後分布推定値１１として上書き保存する（更新する）。 ≪M step calculation part≫
The calculation unit for M step 32 reads the sample, the hyper parameter, the calculation result of the E step, and the like from the storage unit 10 regarding the calculation target time t (t ≦ T) including the past according to T at the time of processing. Then, the M step calculation unit 32 calculates equations (29) to (32) for all the calculation target times t (t ≦ T) including the past according to T at the time of processing, and calculates the calculation results. Then, it is stored in the storage means 10 as M step 14. The M-step calculating unit 32 uses all of the calculation results stored in the storage unit 10 and uses all the calculation target times t (1 ≦ t ≦ T) including the past according to T at the time of processing. ), The formulas (20) to (23) and (36) are calculated, and the calculation results are overwritten and saved (updated) in the storage means 10 as the unknown posterior distribution estimated value 11.

≪ＥＭ収束判定部≫
ＥＭ収束判定部３３は、ＥステップとＭステップを合わせた１セットの処理の繰り返し回数ｊのしきい値（ｊ_max）に達したか否かを判定することで、ＥＭアルゴリズムが収束したか否かを判別する。しきい値（ｊ_max）は、予め設定される。ＥＭアルゴリズムが収束していない場合、ＥＭ収束判定部３３は、ＥステップとＭステップとを繰り返す制御を行う。ＥＭアルゴリズムが収束した場合、ＥＭ収束判定部３３は、ＥステップとＭステップとの処理を停止する制御を行う。本実施形態では、ＥＭ収束判定部３３は、収束の如何に関わらず、その時点のＴの値をパラメータ更新条件判定部３４に通知することとした。 ≪EM convergence judgment part≫
The EM convergence determination unit 33 determines whether or not the EM algorithm has converged by determining whether or not the threshold (j _max ) of the number j of repetitions of a set of processing including the E step and the M step has been reached. Is determined. The threshold value (j _max ) is set in advance. When the EM algorithm has not converged, the EM convergence determination unit 33 performs control to repeat the E step and the M step. When the EM algorithm has converged, the EM convergence determination unit 33 performs control to stop the processing of the E step and the M step. In the present embodiment, the EM convergence determination unit 33 notifies the parameter update condition determination unit 34 of the value of T at that time regardless of the convergence.

≪パラメータ更新条件判定部≫
パラメータ更新条件判定部３４は、受け取った処理時点のＴが、事前に設定された設定値（適切な正整数ｔ_updateの倍数）と同じであるか否かを判別する。同じある場合には、パラメータ更新条件判定部３４は、ハイパーパラメータ更新部３５に通知する。 ≪Parameter update condition judgment unit≫
The parameter update condition determination unit 34 determines whether or not T at the received processing time is the same as a preset setting value (a multiple of an appropriate positive integer t _update ). If they are the same, the parameter update condition determination unit 34 notifies the hyper parameter update unit 35.

≪ハイパーパラメータ更新部≫
ハイパーパラメータ更新部３５は、通知を受けると、記憶手段１０に保存されているハイパーパラメータ１２を任意の方法で更新する。これにより、Ｅステップ用計算部３１およびＭステップ用計算部３２は、処理時点のＴが、予め設定された適切な正整数の倍数と同じである次のタイミングから、更新されたハイパーパラメータを用いることができる。なお、パラメータ更新条件判定部３４およびハイパーパラメータ更新部３５は、前記したステップＳ２８〜Ｓ３０の処理に対応している。ただし、これらの構成は、必ずしも備える必要はない。 ≪Hyper parameter update part≫
When receiving the notification, the hyper parameter update unit 35 updates the hyper parameter 12 stored in the storage unit 10 by an arbitrary method. As a result, the E-step calculation unit 31 and the M-step calculation unit 32 use the updated hyperparameter from the next timing when T at the time of processing is the same as a preset appropriate multiple of a positive integer. be able to. The parameter update condition determination unit 34 and the hyper parameter update unit 35 correspond to the processes in steps S28 to S30 described above. However, these configurations are not necessarily provided.

なお、クラスタリング計算装置３は、一般的なコンピュータを、クラスタリング計算装置３を構成する前記した各手段として機能させるプログラム（クラスタリング計算プログラム）により動作させることで実現することができる。このプログラムは、通信回線を介して提供することも可能であるし、ＣＤ−ＲＯＭ等の記録媒体に書き込んで配布することも可能である。このプログラムをインストールされたコンピュータは、ＣＰＵが、ＲＯＭ等に格納されたこのプログラムをＲＡＭに展開することにより、クラスタリング計算装置３と同等の効果を奏することができる。 Note that the clustering calculation device 3 can be realized by operating a general computer by a program (clustering calculation program) that functions as each of the above-described means constituting the clustering calculation device 3. This program can be provided via a communication line, or can be written on a recording medium such as a CD-ROM and distributed. The computer in which this program is installed can achieve the same effect as the clustering calculation apparatus 3 by the CPU developing this program stored in the ROM or the like in the RAM.

本実施形態によれば、複数話者の会話からなる録音データから、その会話に参加した話者数、各話者の位置、さらに各話者の発話行動のタイミングを推定するダイアライゼーションにおける話者クラスタリングの問題に対して、確率的なクラスタリングを用いたので、従来のようなパラメータの設定や探索によらずに容易に話者数を推定できる。
また、本実施形態によれば、ｄＨＤＰ近似モデルを採用したことにより、時間ごとに発話に参加する話者が変化する状況も適切にモデリングできる。その結果、より精度の高い話者クラスタリングが実現できる。
さらに、本実施形態によれば、ｄＨＤＰのオンライン推定法とその高速化法を用いることで、高速に推論を行うことが可能である。なお、精度と時間のトレードオフによって実時間推論も可能である。 According to this embodiment, the speakers in dialization that estimate the number of speakers who participated in the conversation, the positions of each speaker, and the timing of each speaker's utterance behavior from recorded data consisting of conversations of a plurality of speakers. Since probabilistic clustering is used for the clustering problem, the number of speakers can be easily estimated without using conventional parameter setting or searching.
Further, according to the present embodiment, by adopting the dHDP approximate model, it is possible to appropriately model the situation in which the speakers participating in the utterance change with time. As a result, more accurate speaker clustering can be realized.
Furthermore, according to the present embodiment, it is possible to perform inference at high speed by using the online estimation method of dHDP and its speed-up method. Real-time inference is also possible by a trade-off between accuracy and time.

以上、本発明の実施形態について説明したが、本発明はこれに限定されるものではなく、その趣旨を変えない範囲で実施することができる。例えば、クラスタリング計算装置３は、ｄＨＤＰ近似モデルにおいて、Ｎｏｒｍａｌ−Ｇａｍｍａ分布を仮定したが、別の分布を仮定してもよい。この場合、前記した式（３６）、式（３７）および関連する関係式と同様な計算式を、仮定された別の分布で改めて導出すればよい。 As mentioned above, although embodiment of this invention was described, this invention is not limited to this, It can implement in the range which does not change the meaning. For example, the clustering calculation apparatus 3 assumes the Normal-Gamma distribution in the dHDP approximate model, but may assume another distribution. In this case, a calculation formula similar to the above-described formula (36), formula (37) and related relational formula may be derived again with another assumed distribution.

また、本実施形態では、クラスタリング計算装置３は、ｄＨＤＰ近似モデルにより推論するものとしたが、例えば、ｄＨＤＰモデルや他のノンパラメトリックベイズモデルであってもよい。他のノンパラメトリックベイズモデルの場合、確率分布の連続的な時間変化をモデル化した分布であることが好ましい。 In the present embodiment, the clustering calculation device 3 is inferred from the dHDP approximation model, but may be a dHDP model or other nonparametric Bayes model, for example. In the case of other non-parametric Bayes models, a distribution obtained by modeling a continuous temporal change of a probability distribution is preferable.

また、本実施形態では、話者の位置に基づいたクラスタリングを行うために、特徴量抽出部２では、一例として、マイクロホンアレーを用いてＤＯＡ情報（音声の到達角度）を抽出するものとしたが、ダイアライゼーションに適していれば、それ以外の様々な特徴量を抽出することが可能である。例えば、話者ごとの声質の特徴に基づいたクラスタリングをする場合にはＭＦＣＣ特徴量（Mel Frequency Cepstrum Coefficient）などを抽出することができる。そして、このような特徴量であっても前記したアルゴリズムに適用することが可能である。 In the present embodiment, in order to perform clustering based on the position of the speaker, the feature amount extraction unit 2 extracts DOA information (speech arrival angle) using a microphone array as an example. If it is suitable for dialization, various other feature quantities can be extracted. For example, in the case of clustering based on the voice quality feature for each speaker, an MFCC feature amount (Mel Frequency Cepstrum Coefficient) or the like can be extracted. Even such a feature amount can be applied to the algorithm described above.

本発明の効果を確認するために本実施形態に係るクラスタリング計算装置の性能を測定した。まず、第１段階の実験として、設計されたとおりのクラスタ数と適切なパラメータとを持つクラスタリングが、ｄＨＤＰによって実現できるか否かを検証する実験（クラスタリング検証実験）を行った。次に、第２段階の実験として、得られたクラスタ結果を用いたダイアライゼーション精度を評価する実験（ＤＥＲによるダイアライゼーション精度の評価）を行った。前記２種類の実験では、人工音声データと実音声データとを用いてクラスタリング計算装置の性能を確認した。 In order to confirm the effect of the present invention, the performance of the clustering calculation apparatus according to the present embodiment was measured. First, as a first stage experiment, an experiment (clustering verification experiment) was performed to verify whether or not clustering having the number of clusters as designed and appropriate parameters can be realized by dHDP. Next, as a second stage experiment, an experiment (evaluation of dialization accuracy by DER) for evaluating the dialization accuracy using the obtained cluster result was performed. In the two types of experiments, the performance of the clustering calculation apparatus was confirmed using artificial voice data and real voice data.

＜実験データ＞
≪人工音声データ≫
人工音声データは、３人の発話者が交代で発話・非発話を切り替える状況をシミュレートしたデータである。この人工音声データは、６４［ｍｓｅｃ］の時間ステップで計算されたＤＯＡデータ（音声の到達角度のデータ）と、ＶＡＤ（音声区間検出器）による音声・非音声判定結果とからなる。また、人工音声データは、ノイズがほとんど重畳しないデータである。各実験では、ＶＡＤによる判定結果を用いて非音声区間を閾値処理で除外して４２２ステップの連続シーケンスの人工音声データを作成した。また、時間ステップごとのサンプル分布をある程度安定させるため、４２２ステップを数ステップごとに重複なくまとめて１つの長いメタステップとした。各実験では、５ステップのデータをまとめて１メタステップとし、メタステップのステップ数をｔ，Ｔに対応させた。よって、ステップ数はＴ_total＝ｃｅｉｌ（４２２／５）＝８５である。ここで、ｃｅｉｌは切り上げを示す。この人工音声データには、複数の話者が同時に喋る区間も存在している。 <Experimental data>
≪Artificial voice data≫
Artificial voice data is data that simulates a situation in which three speakers switch between speaking and non-speaking in turn. This artificial voice data is composed of DOA data (voice arrival angle data) calculated at a time step of 64 [msec] and voice / non-voice judgment results by VAD (voice section detector). Artificial voice data is data on which almost no noise is superimposed. In each experiment, the non-speech section was excluded by threshold processing using the determination result by VAD, and artificial speech data of a continuous sequence of 422 steps was created. In addition, in order to stabilize the sample distribution for each time step to some extent, the 422 steps are combined every few steps without duplication to form one long meta step. In each experiment, 5 steps of data were combined into 1 metastep, and the number of metasteps corresponded to t and T. Therefore, the number of steps is _{T total = ceil (422/5) =} 85. Here, ceil indicates rounding up. In the artificial voice data, there is a section where a plurality of speakers speak simultaneously.

人工音声データについての全時刻での平均パワーベクトルの分布を図９に示す。図９において、横軸は、ＤＯＡデータ、すなわち、音声到来方向角度［ｄｅｇ］を示し、縦軸は、時間平均パワー［ｄＢ］を示す。図９に示すように、各話者に対応するパワー分布の３つのピークを観察することができる。 FIG. 9 shows the distribution of the average power vector at all times for the artificial voice data. In FIG. 9, the horizontal axis indicates DOA data, that is, the voice arrival direction angle [deg], and the vertical axis indicates time average power [dB]. As shown in FIG. 9, three peaks of the power distribution corresponding to each speaker can be observed.

≪実音声データ≫
実音声データは、実際の複数話者の会話の様子を録音したデータである。実音声データとして、非特許文献１に記載されている４データを利用した。４データの詳細を表１に示す。表１において、ＣＰはcrossword puzzleデータ、ＤＣはdiscussionデータ、ＣＮはconversationデータを表す。 ≪Real voice data≫
The actual voice data is data that records the actual conversations of multiple speakers. Four data described in Non-Patent Document 1 were used as actual voice data. Details of the four data are shown in Table 1. In Table 1, CP represents crossword puzzle data, DC represents discussion data, and CN represents conversation data.

実音声データは、どのデータも３００秒の音声データである。これらの実音声データについての平均パワーベクトルの分布を図１０にそれぞれ示す。図１０の横軸および縦軸は、図９のグラフと同様である。ただし、時間平均パワーのオーダーは低い。 The actual voice data is 300 seconds of voice data. The distribution of the average power vector for these actual voice data is shown in FIG. The horizontal and vertical axes in FIG. 10 are the same as those in the graph of FIG. However, the order of time average power is low.

図１０（ａ）に示すＣＰ１と、図１０（ｂ）に示すＣＰ２に関しては、表１に記載の発話者数「４」に対応するパワー分布のピークの個数も「４個」なので、比較的良好なクラスタリング結果が期待される。一方、図１０（ｃ）に示すＤＣと、図１０（ｄ）に示すＣＮに関しては、表１に記載の発話者数「３」に対応するパワー分布の３つのピークが明確には観察できないため、正しいクラスタ数の推定は困難を伴うことが予想される。 Regarding CP1 shown in FIG. 10A and CP2 shown in FIG. 10B, the number of peaks of the power distribution corresponding to the number of speakers “4” shown in Table 1 is also “4”. Good clustering results are expected. On the other hand, for the DC shown in FIG. 10C and the CN shown in FIG. 10D, the three peaks of the power distribution corresponding to the number of speakers “3” shown in Table 1 cannot be clearly observed. The estimation of the correct number of clusters is expected to be difficult.

［クラスタリング検証実験］
まず、第１段階として、クラスタリングの性能を確認した。ここでは、ＤＰＭ（参考例１）とｄＨＤＰ（実施例１）とによって、オンラインで話者クラスタの推定を行った。このクラスタリング検証実験では、最終時刻Ｔ_totalにおいて、最終的な混合比が、チャンスレベル（１／Ｋ）を超えたクラスタのみを有効なクラスタとしてカウントすることで、最終的なクラスタリング結果を求めた。そして、得られたクラスタリング結果により、ＤＰＭ（参考例１）とｄＨＤＰ（実施例１）とを比較し、話者の分布と話者数に対応したクラスタリング結果が得られているか否かを判定した。なお、チャンスレベルは偶然の一致が起こる確率である。 [Clustering verification experiment]
First, as a first step, the performance of clustering was confirmed. Here, speaker clusters were estimated online using DPM (Reference Example 1) and dHDP (Example 1). In this clustering verification experiment, the final clustering result was obtained by counting only the clusters whose final mixing ratio exceeded the chance level (1 / K) at the final time T _total as valid clusters. Then, based on the obtained clustering result, DPM (Reference Example 1) and dHDP (Example 1) are compared to determine whether or not a clustering result corresponding to the distribution of speakers and the number of speakers is obtained. . The chance level is the probability that a coincidence will occur.

≪人工音声データの場合≫
図９に示す人工音声データに対して、オンラインで、ＤＰＭ（参考例１）とｄＨＤＰ（実施例１）とを適用した結果を図１１に示す。図１１において、横軸は、ＤＯＡデータを、［−１８０：１８０］→［−０．５：０．５］に変換する関数を用いて無次元量に正規化した角度、すなわち、正規化角度を示す。また、縦軸は、確率密度関数（probabilistic density function：p. d. f）の値（無次元数）を示す。 ≪For artificial voice data≫
FIG. 11 shows the result of applying DPM (Reference Example 1) and dHDP (Example 1) online to the artificial voice data shown in FIG. In FIG. 11, the horizontal axis represents an angle normalized to a dimensionless amount using a function for converting DOA data from [−180: 180] → [−0.5: 0.5], that is, a normalized angle. Indicates. The vertical axis represents the probability density function (probabilistic density function: pd f) value (dimensionless number).

図１１（ｂ）に示すｄＨＤＰ（実施例１）の結果と、図９のグラフとの比較の結果、ｄＨＤＰでは、正しいクラスタ数とパラメータを得ることができたことが分かる。一方、図１１（ａ）に示すＤＰＭ（参考例１）の場合、クラスタ数が「１」となり、図９に示す人工音声データに対しては不適切な結果が得られたことを確認した。 As a result of comparison between the result of dHDP (Example 1) shown in FIG. 11B and the graph of FIG. 9, it can be seen that the correct number of clusters and parameters can be obtained in dHDP. On the other hand, in the case of DPM (Reference Example 1) shown in FIG. 11A, the number of clusters was “1”, and it was confirmed that an inappropriate result was obtained for the artificial voice data shown in FIG.

≪実音声データ（ＣＰ１、ＣＰ２）の場合≫
図１０（ａ）に示す実音声データ（ＣＰ１）に対して、オンラインで、ＤＰＭ（参考例２）とｄＨＤＰ（実施例２）とを適用した結果を図１２に示す。図１２の各軸は、図１１のグラフと同様である。図１２（ｂ）に示すｄＨＤＰ（実施例２）の結果と、図１０（ａ）のグラフとの比較の結果、ｄＨＤＰでは、正しいクラスタ数とパラメータを得ることができたことが分かる。一方、図１２（ａ）に示すＤＰＭ（参考例２）の場合、多数のクラスタに分かれてしまい、図１０（ａ）に示すＣＰ１に対しては不適切な結果が得られたことを確認した。なお、同様の傾向がＣＰ２データに対しても確認された。このときのＤＰＭ（参考例３）とｄＨＤＰ（実施例３）との結果の記述は省略した。 << In case of real voice data (CP1, CP2) >>
FIG. 12 shows the result of applying DPM (Reference Example 2) and dHDP (Example 2) online to the actual voice data (CP1) shown in FIG. Each axis in FIG. 12 is the same as the graph in FIG. As a result of comparison between the result of dHDP (Example 2) shown in FIG. 12B and the graph of FIG. 10A, it can be seen that the correct number of clusters and parameters can be obtained in dHDP. On the other hand, in the case of DPM (Reference Example 2) shown in FIG. 12A, it was divided into a large number of clusters, and it was confirmed that an inappropriate result was obtained for CP1 shown in FIG. . A similar trend was confirmed for CP2 data. The description of the results of DPM (Reference Example 3) and dHDP (Example 3) at this time was omitted.

≪実音声データ（ＤＣ）の場合≫
図１０（ｃ）に示す実音声データ（ＤＣ）に対して、オンラインで、ＤＰＭ（参考例４）とｄＨＤＰ（実施例４）とを適用した結果を図１３に示す。図１３の各軸は、図１１のグラフと同様である。図１３（ｂ）に示すｄＨＤＰ（実施例４）の結果と、図１０（ｃ）のグラフや表１における発話者数との比較の結果、ｄＨＤＰでは、話者数である３クラスタを得ることはできなかった。しかしながら、各クラスタのうち、サイズ数（図中のcluster size）の上位３つである、「cluster 4」、「cluster 6」、「cluster 14」の正規化角度の位置は、ＤＣデータにおける話者の位置に対応することができた。ここで、サイズ数（図中のcluster size）は、前記した式（４４）の右辺の分母で示される数値で定義した。なお、サイズ数の下位２つは、ノイズクラスタとなった。 ≪In case of real voice data (DC) ≫
FIG. 13 shows the result of applying DPM (Reference Example 4) and dHDP (Example 4) online to the actual voice data (DC) shown in FIG. Each axis in FIG. 13 is the same as the graph in FIG. As a result of comparison between the result of dHDP (Example 4) shown in FIG. 13B and the number of speakers in the graph of FIG. 10C and Table 1, three clusters that are the number of speakers are obtained in dHDP. I couldn't. However, among the clusters, the positions of the normalized angles of “cluster 4”, “cluster 6”, and “cluster 14”, which are the top three of the number of sizes (cluster size in the figure), are the speakers in the DC data. It was possible to correspond to the position of. Here, the size number (cluster size in the figure) was defined by a numerical value indicated by the denominator on the right side of the above-described equation (44). Note that the lower two of the size numbers were noise clusters.

一方、図１３（ａ）に示すＤＰＭ（参考例４）の場合、より多くのクラスタに分かれてしまい、サイズ数上位３つのクラスタも話者の位置にそれぞれ対応することができなかった。この点からも、ｄＨＤＰ（実施例４）は、ＤＰＭ（参考例４）によるクラスタリングに比してより正確なクラスタリングを実現できたと考えられる。なお、ＤＰＭ（参考例４）の場合、各クラスタのうち、サイズ数の上位３つである、「cluster 2」、「cluster 4」、「cluster 5」の正規化角度の位置は、ＤＣデータにおける話者の位置のうちの２つにしか対応できなかった。また、第４〜６位の位置でもあと１つの話者位置に対応できなかった。また、ＣＮデータについては、ＤＣデータと同様な傾向を有すると考えられる。 On the other hand, in the case of the DPM (Reference Example 4) shown in FIG. 13A, the cluster is divided into a larger number of clusters, and the top three clusters in the number of sizes cannot correspond to the positions of the speakers. From this point, it is considered that dHDP (Example 4) can realize more accurate clustering compared to clustering by DPM (Reference Example 4). In the case of DPM (Reference Example 4), the positions of the normalized angles of “cluster 2”, “cluster 4”, and “cluster 5”, which are the top three sizes in each cluster, Only two of the speaker locations could be handled. In addition, the 4th to 6th positions could not correspond to another speaker position. Also, CN data is considered to have the same tendency as DC data.

［ＤＥＲによるダイアライゼーション精度の評価］
第１段階のクラスタリング検証実験に続いて、第２段階では、ダイアライゼーションのためのクラスタリングとしての性能を評価するため、ＤＥＲ（diarization error ratio）による評価を試みた。ＤＥＲとはＮＩＳＴが提案した話者識別能力の指標である。具体的には、ＤＥＲは、全音声区間長に対して、以下の（１）〜（３）の３種類の誤識別区間がどれだけあったかを百分率で示したものである。ＤＥＲ値が少ないほど良いダイアライゼーションができた、という評価になる。 [Evaluation of dialization accuracy by DER]
Following the clustering verification experiment in the first stage, in the second stage, an evaluation by DER (diarization error ratio) was attempted in order to evaluate the performance as clustering for dialization. DER is an index of speaker identification ability proposed by NIST. Specifically, the DER indicates the percentage of the following three types of misidentification sections (1) to (3) with respect to the total speech section length. The smaller the DER value, the better the dialization.

（１）false alarm speaker time：誰も話していないのに、誰かが話したと誤検出した区間長
（２）missed speaker time：誰かが話しているのに、誰も話していないと判断した区間長
（３）speaker error time：誰かが話していることは正しく検出したが、話者を誤った区間長 (1) false alarm speaker time: the length of the section that was falsely detected that someone was speaking when no one was speaking (2) missed speaker time: the section where no one was speaking when someone was speaking Long (3) speaker error time: It is detected correctly that someone is speaking, but the speaker is wrong

なお、ＤＥＲについての詳細は、下記ＵＲＬに記載されている。
「NIST Speech group,”Spring2007(RT-07) Rich Transcription Meeting Recognition Evaluation Plan”,[online],[平成２１年１月21日検索]、インターネット＜URL:http://www.nist.gov/speech/tests/rt/2007/index.html＞」 Details of DER are described in the following URL.
“NIST Speech group,“ Spring2007 (RT-07) Rich Transcription Meeting Recognition Evaluation Plan ”, [online], [searched on January 21, 2009], Internet <URL: http://www.nist.gov/speech /tests/rt/2007/index.html> ”

第１段階のｄＨＤＰによるクラスタリングは、ダイアライゼーションのサブ問題でしかなく、そのままでは話者識別はできない。しかしながら、ｄＨＤＰによるクラスタリングでは、各時刻におけるサンプル（＝方向付きの音声パワーデータ）をクラスタリングしているため、各フレーム（時間ステップ）において各クラスタにアサインされたサンプル数を数えれば、所定の閾値を用いることで話者ごとの発声または非発声を決定することが可能である。そこで、前記した式（４３）で定義した‖ｚ_t,k‖を用いて、各時刻における話者ｋの発話または非発話を、式（４９）および式（５０）に示すルールで決定した。なお、所定の閾値として式（４９）に示したτ_ＤＥＲは、適切な値に設定した。表１に示した各実音声データに対して、ｄＨＤＰ（実施例５、実施例６）について算出されたＤＥＲの結果を表２に示す。このＤＥＲ値を非特許文献１における結果（比較例）と比較した。 Clustering by dHDP in the first stage is only a sub-problem of dialization, and speaker identification cannot be performed as it is. However, in the clustering by dHDP, the samples at each time (= directed sound power data) are clustered. Therefore, if the number of samples assigned to each cluster in each frame (time step) is counted, a predetermined threshold value is set. By using it, it is possible to determine utterance or non-voicing for each speaker. Therefore, the utterance or non-utterance of the speaker k at each time is determined according to the rules shown in Expression (49) and Expression (50) using ‖z _{t, k}定義 defined in Expression (43). Note that τ _DER shown in Expression (49) as a predetermined threshold was set to an appropriate value. Table 2 shows the DER results calculated for dHDP (Example 5 and Example 6) for each real voice data shown in Table 1. This DER value was compared with the result in Non-Patent Document 1 (comparative example).

表２において、比較例は、既存手法である非特許文献１の中で報告された最良の結果を示す。本発明のｄＨＤＰクラスタリングの結果（実施例５、実施例６）のうち、実施例５（naive）は、式（４９）および式（５０）に示すルールのみでＤＥＲを算出したときの値を示す。また、実施例６（heuristic）は、式（４９）および式（５０）に示すルール（識別則）に加え、１フレーム内（１時間ステップ内）での同時発話人数に上限の仮定をおくことで、非発話区間のfalse alarmを低減した実施例である。実施例６で採用した方法は、非特許文献１で最良の方法が得られた識別則と共通点を持っている。すなわち、実施例６は、式（４９）に示したサンプル数閾値τ_ＤＥＲと、同時発話人数の上限の仮定とに基づいて探索を行った結果である。そのため、実施例６は実施例５よりも良好な結果となった。さらに、実施例６は、同時発話人数の上限の仮定をした比較例と比較しても、明らかに、ＤＥＲ値として良好な値を示すことがわかる。 In Table 2, the comparative example shows the best result reported in Non-Patent Document 1, which is an existing method. Of the results (Examples 5 and 6) of the dHDP clustering of the present invention, Example 5 (naive) shows a value when DER is calculated only by the rules shown in Expressions (49) and (50). . In addition, in Example 6 (heuristic), in addition to the rules (identification rules) shown in Equation (49) and Equation (50), an upper limit is assumed for the number of simultaneous utterances within one frame (within one hour step). In this embodiment, false alarms in non-speech intervals are reduced. The method employed in Example 6 is in common with the discrimination rule obtained by the best method in Non-Patent Document 1. That is, Example 6 is the result of searching based on the sample number threshold τ _DER shown in Expression (49) and the assumption of the upper limit of the number of simultaneous speakers. Therefore, Example 6 gave better results than Example 5. Furthermore, it can be seen that Example 6 clearly shows a good value as the DER value even when compared with the comparative example in which the upper limit of the number of simultaneous speakers is assumed.

以上の検証実験および評価実験をまとめると、ｄＨＤＰで解決されるのはサブ問題（第１段階）であるクラスタリングであるが、第２段階で、クラスタリングの過程で計算されるサンプルアサインメントｚを用いることで、ダイアライゼーションの精度を示すＤＥＲ指標の向上を実現可能である。すなわち、本発明によれば、ｄＨＤＰでクラスタリングの問題を解決したことで、自然と良好なダイアライゼーションが可能となったと結論できる。 Summarizing the above verification experiment and evaluation experiment, it is clustering that is a sub-problem (first stage) that is solved by dHDP, but in the second stage, sample assignment z calculated in the clustering process is used. Thus, it is possible to improve the DER index indicating the accuracy of dialization. That is, according to the present invention, it can be concluded that a good dialization can be naturally achieved by solving the clustering problem with dHDP.

１ダイアライゼーションシステム
２特徴量抽出部（特徴量抽出手段）
３クラスタリング計算装置
４識別部（識別手段）
１０記憶手段（推定値記憶手段、観測量記憶手段）
２１パワーベクトル読込部（読込手段）
２２入力制御部
２３割当部
２４未知数初期化部
２５パワーベクトル書込部
２６観測量生成部（観測量生成手段）
２７終了判定部
２８出力制御部（出力制御手段）
３０変分事後分布推論部（事後分布推論手段）
３１Ｅステップ用計算部（Ｅステップ用計算手段）
３２Ｍステップ用計算部（Ｍステップ用計算手段）
３３ＥＭ収束判定部（収束判定手段）
３４ハイパーパラメータ更新条件判定部
３５ハイパーパラメータ更新部 1 Dialization System 2 Feature Extraction Unit (Feature Extraction Means)
3 Clustering calculation device 4 Identification part (identification means)
10 storage means (estimated value storage means, observation amount storage means)
21 Power vector reading part (reading means)
22 Input control unit 23 Allocation unit 24 Unknown number initialization unit 25 Power vector writing unit 26 Observation amount generation unit (observation amount generation means)
27 End determination unit 28 Output control unit (output control means)
30 Variational post-distribution reasoning section (post-distribution reasoning means)
31 E step calculation unit (E step calculation means)
32 M step calculation unit (M step calculation means)
33 EM convergence determination unit (convergence determination means)
34 Hyper parameter update condition determination unit 35 Hyper parameter update unit

Claims

In order to estimate the number of speakers of the conversation from the recording data of the conversation whose number of speakers is unknown, feature amount extraction means for extracting a feature amount that characterizes each speaker, and each story from the extracted feature amount A clustering calculation device for estimating a plurality of unknown parameter values when generating a plurality of clusters corresponding to a speaker, and an identification means for identifying each speaker of the conversation based on the estimated plurality of parameter values. The clustering calculation device of the dialization system,
Reading means for reading the extracted feature value;
Supports a non-parametric Bayesian model by quantizing and transforming a plurality of voice powers by angle and time, which are the read feature quantities , into a sample set having a number of elements determined according to the voice power value. An observable generating means for quantizing and transforming the observed vector into an observable,
Observation amount storage means for accumulating and storing the set of converted observation amounts;
Posterior distribution inference means for estimating and updating the posterior distribution values of a plurality of parameters when generating a plurality of clusters from the aggregated data of the observation amount by a nonparametric Bayes model, respectively, using an EM algorithm;
Estimated value storage means for accumulating and storing values of posterior distributions of the estimated and updated parameters;
Output control means for outputting a latest estimated value of the posterior distribution of the plurality of parameters stored in the estimated value storage means when a preset termination condition is satisfied;
A clustering calculation apparatus comprising:

The posterior distribution inference means is
As the processing of the E step of the EM algorithm, conversion is performed from the past to the estimation target time, the pre-distribution, observation distribution and maximum number of clusters determined in advance in the dHDP model (dynamic Hierarchical Dirichlet Process) Read the aggregate data of the observed quantity and the estimated value of the posterior distribution of the hidden variable estimated from the past to the latest M steps, and for each calculation target time including the past, for each cluster, and E step calculation means for calculating a value of a posterior distribution of a parameter related to the cluster, a mixture ratio, and a weight indicating a degree of temporal change of the cluster distribution, for all data for each calculation target time;
As the processing of the M step of the EM algorithm, the set value of the hyper parameter, the collective data of the observation amount converted from the past to the estimation target time, and the posterior of the parameter estimated from the past to the latest E step The estimated values of the distribution are read in, and the values of the posterior distributions of the two types of hidden variables are estimated for each calculation target time including the past and for every data for each calculation target time. Among the variables, the value of the posterior distribution of the first hidden variable associated with the weight representing the degree of temporal change of the cluster distribution is calculated for each cluster, and the first hidden variable and the mixture ratio associated with the mixture ratio are calculated. For the value of the posterior distribution of the two hidden variables, M step calculation means for calculating each time retroactively from the calculation target time,
Convergence determining means for performing control to repeatedly execute the process of the E step and the process of the M step alternately a predetermined number of times;
The clustering calculation apparatus according to claim 1, further comprising:

The calculation means for the E step is:
For the collective data of observations and the estimated value of the posterior distribution of hidden variables, the time of retroactively reflecting the number of fluctuations reflecting the previously set constant or the estimated value of the previous M step from the estimation target time The data is read, and the posterior distribution values of the parameters related to the cluster, the mixture ratio, and the weight are calculated as the calculation target time from the estimation target time to the past time set in advance by the set value set in advance. Whether to re-estimate the target cluster for estimation based on the preset criteria for re-estimating the cluster each time it performs a calculation and updates the cluster for each cluster operation If the cluster is to be discriminated and re-estimated, the value of the posterior distribution of the parameter related to the mixture ratio is calculated,
The M step calculating means includes:
For the aggregated data of the observed quantities and the estimated values of the posterior distribution of parameters, the data of the time retroactively reflected by the number of fluctuations reflecting the constants set in advance from the estimated time or the estimated value of the previous M step And calculating the values of the posterior distributions of the first hidden variable and the second hidden variable from the estimation target time to the past set time set in advance as the calculation target time, for each cluster Each time the process of updating the cluster for the calculation of is performed, it is determined whether or not the estimation target cluster is to be reestimated based on the determination criterion. The clustering calculation apparatus according to claim 2, wherein a value of a posterior distribution of one hidden variable is calculated.

In a dialization system for estimating the number of speakers of the conversation from conversation recording data whose number of speakers is unknown, storage means, reading means, observation generation means, posterior distribution inference means, output control means, And a clustering calculation of a clustering calculation device for estimating values of a plurality of unknown parameters when generating a plurality of clusters corresponding to each speaker from a feature amount characterizing each speaker extracted from the recording data A method,
A feature amount reading step of reading the extracted feature amount by the reading means;
Quantizing and transforming a plurality of audio powers for each angle and time, which are the read feature quantities , into a sample set having a number of elements determined according to the value of the audio power by the observation amount generation unit. an observation amount accumulating step for converting quantized observables vector corresponding to non-parametric Bayesian model, sequentially accumulated in the storage means a set data of the converted observables in,
The posterior distribution inference means estimates the posterior distribution values of a plurality of parameters when generating a plurality of clusters from the observation data set by a non-parametric Bayes model using an EM algorithm, and stores the estimated values in the memory A posteriori distribution estimation step for sequentially storing and updating the means;
An estimated value output step for outputting a latest estimated value of the posterior distribution of the plurality of parameters stored in the storage means when a preset end condition is satisfied by the output control means;
A clustering calculation method comprising:

The posterior distribution inference means is
In the posterior distribution estimation step,
As the processing of the E step of the EM algorithm, a set of a prior distribution, observation distribution and maximum number of clusters determined in advance in the dHDP model, a set value of the hyperparameter, and an observation amount converted from the past to the estimation target time Read the data and the estimated value of the posterior distribution of the hidden variable estimated from the past to the latest M step, for each calculation target time including the past in the estimation target time, for each cluster, and Calculating the value of the posterior distribution of the parameter related to the cluster, the mixture ratio, and the weight representing the degree of temporal change of the cluster distribution for every data for each calculation target time;
As the processing of the M step of the EM algorithm, the set value of the hyper parameter, the collective data of the observation amount converted from the past to the estimation target time, and the posterior of the parameter estimated from the past to the latest E step The estimated value of the distribution is read, and the value of the posterior distribution of the two types of hidden variables is estimated for each calculation target time including the past in the estimation target time and for each data for each calculation target time. Of the two types of hidden variables, the value of the posterior distribution of the first hidden variable associated with the weight representing the degree of temporal change of the cluster distribution is calculated for each cluster, and the first hidden variable and the mixture The value of the posterior distribution of the second hidden variable associated with the ratio includes calculating each time retroactively from the calculation target time,
The clustering calculation method according to claim 4, wherein the processing of the E step and the processing of the M step are alternately and repeatedly executed a predetermined number of times.

The posterior distribution inference means is
In the E step, the observation amount set data and the estimated value of the posterior distribution of the hidden variable are the number of fluctuations reflecting the constant set in advance from the estimation target time or the estimated value of the immediately preceding M step. Read the data of retroactive time,
In the M step, the observation amount set data and the estimated value of the posterior distribution of the parameter are past by the number of fluctuations reflecting the constant set in advance from the estimation target time or the estimated value of the immediately preceding M step. The clustering calculation method according to claim 5, wherein data at a time retroactively read is read.

The posterior distribution inference means is
In the E step, the value of the posterior distribution of the parameters relating to the cluster, the mixture ratio, and the weight is calculated from the estimation target time to a past time that is retroactive by a set value set in advance. And
In the M step, calculating the values of the posterior distributions of the first hidden variable and the second hidden variable with the calculation target time from the estimation target time to the past set time set in advance. The clustering calculation method according to claim 5, wherein:

The posterior distribution inference means is
Whether the estimation target cluster should be re-estimated based on a preset criterion for re-estimation of the cluster every time the process of updating the cluster for the computation for each cluster is executed in the E step. Only if it is a cluster to be re-estimated, calculate the value of the posterior distribution of the parameter related to the mixture ratio,
In the M step, each time a process for updating a cluster is performed for computation for each cluster, it is determined whether or not a cluster to be estimated is to be reestimated based on the determination criterion, and a cluster to be reestimated The clustering calculation method according to claim 5, wherein the value of the posterior distribution of the first hidden variable is calculated only when

The clustering calculation program for functioning a computer as each means which comprises the clustering calculation apparatus as described in any one of Claims 1 thru | or 3.

A computer-readable recording medium on which the clustering calculation program according to claim 9 is recorded.