JP2010170424A

JP2010170424A - Distribution estimation apparatus, clustering apparatus, estimation method for distribution estimation apparatus, and program

Info

Publication number: JP2010170424A
Application number: JP2009013522A
Authority: JP
Inventors: Ryohei Fujimaki; 遼平藤巻; Satoshi Morinaga; 聡森永; Michiya Monma; 道也門馬; Kenji Aoki; 健児青木
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2009-01-23
Filing date: 2009-01-23
Publication date: 2010-08-05

Abstract

<P>PROBLEM TO BE SOLVED: To optimize the number of discrete areas of each dimension and the positions of discretization of each dimension by a locally independent mixed discrete distribution when estimating a distribution. <P>SOLUTION: A distribution estimation apparatus includes an initialization means for initializing data, an expectation calculation means for calculating a posterior probability expectation of cluster assignment in the data initialized by the initialization means, a minimization means for calculating a parameter minimizing an information criterion expectation to the expectation of cluster assignment calculated by the expectation calculation means, and calculating a locally independent mixed discrete distribution, and an optimality decision means for determining optimality of the information criterion calculated by the minimization means, and repeating the expectation calculation by the expectation calculation means if determining that optimization is incomplete. The number of discrete areas of each dimension and the positions of discretization of each dimension are optimized. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、混合数、離散化方法を分布推定と同時に最適化する多変量データの分布推定装置、クラスタリング装置、分布推定装置の推定方法及びプログラムに関する。 The present invention relates to a multivariate data distribution estimation device, a clustering device, a distribution estimation device estimation method, and a program for optimizing the number of mixtures and the discretization method simultaneously with the distribution estimation.

多次元空間において、データの分布や、データに付与されたラベルの予測分布（以下、ラベルの分布と呼ぶ）あるいはデータとラベルの同時分布を推定する事は、産業上重要な技術である（以下、これらを総称して単に分布と呼ぶ）。 Estimating the distribution of data, the predicted distribution of labels attached to the data (hereinafter referred to as label distribution), or the simultaneous distribution of data and labels in a multidimensional space is an industrially important technology (hereinafter referred to as the following). These are collectively called simply distribution).

例えば、データとして体重、血圧、医療費といったデータが与えられた場合に、３つの属性に関する多次元の分布を推定する事で、医療費が高額になる確率の高い体重や血圧の関係など、複数の属性の関係性を分析する事が可能となる。 For example, when data such as body weight, blood pressure, and medical expenses are given as data, a multidimensional distribution related to three attributes is estimated, so that there are a plurality of relationships such as weight and blood pressure that are likely to cause high medical expenses. It is possible to analyze the relationship of attributes.

また例えば、データとして自動車のエンジン回転数や速度といったセンサ値が与えられ、ラベルとして各データに対する故障の種類が与えられた場合、ラベルの分布を推定する事で、故障種類の分かっていない新規データが取得された場合に、各故障の発生する確率を計算し、故障種類を特定する事が可能となる。 Also, for example, when sensor values such as the engine speed and speed of a car are given as data, and the type of failure for each data is given as a label, new data whose failure type is unknown can be estimated by estimating the distribution of the label. Is acquired, it is possible to calculate the probability of occurrence of each failure and specify the failure type.

同様に、データとラベルの同時分布を推定する事によって、データとラベルの関係性を分析する事が可能となる。 Similarly, the relationship between data and labels can be analyzed by estimating the simultaneous distribution of data and labels.

分布を推定する場合に、ヒストグラムに代表される離散分布を利用する事は、１）データの真の分布が分からない場合に特定の分布を仮定する必要がない点、２）適切な離散化によって、データやラベルの性質の分析が容易になる点、から応用の範囲が広い。 When estimating a distribution, the use of a discrete distribution represented by a histogram is that 1) there is no need to assume a specific distribution when the true distribution of data is unknown, and 2) by appropriate discretization. Since the analysis of the properties of data and labels is easy, the range of applications is wide.

２）をより具体的に説明すると、体重の分布をヒストグラムで表す場合を考えると、体重が５０kg以上６０kg未満の確率が３０％、６０kg以上７０kg未満が４５％、それ以外が２５％など、データの分布が人間に解釈可能な形で推定されるという利点がある（例えば、図１）。 2) will be explained more concretely. Considering the case where the distribution of weight is represented by a histogram, the probability that the weight is 50 kg or more and less than 60 kg is 30%, 60 kg or more but less than 70 kg is 45%, and the other is 25%. There is an advantage that the distribution of is estimated in a form that can be interpreted by humans (for example, FIG. 1).

また、自動車のエンジン回転数を例にラベルの分布を考えると、エンジン回転数が０以上１０００未満では故障１の確率が７０％、故障２の確率が１０％、その他の故障が２０％など、データに対する故障の関係性が人間にとって理解しやすい（例えば、図２）。 Also, considering the distribution of labels taking the engine speed of an automobile as an example, if the engine speed is 0 or more and less than 1000, the probability of failure 1 is 70%, the probability of failure 2 is 10%, and other failures are 20%. The relationship of failure to data is easy for humans to understand (for example, FIG. 2).

従来、このような離散分布を学習する方法は様々な方法が提案されているが、各次元に依存関係を持たせて離散分布を最適化しようとすると計算量が増大してしまうという問題があった。この問題を回避するためには、各次元を独立に扱う方法があるが、この場合には、各次元の依存関係を捉える事ができなくなるという問題があった。 Conventionally, various methods for learning such a discrete distribution have been proposed. However, there is a problem in that the amount of calculation increases if an attempt is made to optimize the discrete distribution by giving dependency to each dimension. It was. In order to avoid this problem, there is a method of handling each dimension independently. In this case, however, there is a problem that it becomes impossible to grasp the dependency relationship of each dimension.

これらの問題を解決するために、非特許文献1で提案されている潜在クラスモデルと呼ばれる技術を利用する事が可能である。このモデルでは、分布を局所独立な混合分布によって表す事を特徴としている。 In order to solve these problems, a technique called a latent class model proposed in Non-Patent Document 1 can be used. This model is characterized in that the distribution is represented by a locally independent mixed distribution.

正規分布を例に説明すると、図１０に示されるように、各クラスタ内では各次元が独立（図１０の３つの真円、局所独立性）とし、複数のクラスタによって各次元の依存関係を表現する。図１０は連続分布を例に説明したが、離散分布でも同様である。 The normal distribution will be described as an example. As shown in FIG. 10, each dimension is independent in each cluster (three perfect circles in FIG. 10, local independence), and the dependency of each dimension is expressed by a plurality of clusters. To do. Although FIG. 10 has been described by taking a continuous distribution as an example, the same applies to a discrete distribution.

また、潜在クラスモデルの多くはデータの分布に対して定義されるが、ラベルの分布やデータとラベルの同時分布に対しても局所独立性を仮定した同様のモデルを考える事が可能である。 Many latent class models are defined for data distribution, but it is possible to consider similar models assuming local independence for label distribution and simultaneous distribution of data and labels.

Goodman、 L. A.、Exploratory latent structure analysis using both identifiable and unidentifiable models、Biometrika、Aug 1974、Vol.61、No.2、pp.215-231Goodman, L.A., Exploratory latent structure analysis using both identifiable and unidentifiable models, Biometrika, Aug 1974, Vol.61, No.2, pp.215-231

しかしながら、離散分布に関する潜在クラスモデルを推定するためには、各次元を幾つの領域へ離散化するのか、各次元をどの位置で離散化するのか、混合数は幾つ必要かを特定する必要があるが、これらを最適化することはできなかった。 However, in order to estimate the latent class model for a discrete distribution, it is necessary to specify how many regions are required to discretize each dimension into which region, where each dimension is to be discretized, and how many mixtures are required. However, these could not be optimized.

そこで本発明は、上記問題点に鑑みてなされたもので、分布を推定する際に、局所独立な混合離散分布によって各次元の離散化領域数と各次元の離散化位置とを最適化することを目的とする。 Therefore, the present invention has been made in view of the above problems, and in estimating the distribution, the number of discretized regions in each dimension and the discretized position in each dimension are optimized by a locally independent mixed discrete distribution. With the goal.

上記課題を解決するため、本発明における分布推定装置は、データの初期化を行う初期化手段と、初期化手段にて初期化されたデータにおけるクラスタアサイメントの事後確率に対する期待値を計算する期待値計算手段と、期待値計算手段で計算されたクラスタアサイメントの期待値に対する情報量基準の期待値を最小化するパラメータを算出して、局所独立な混合離散分布を計算する最小化手段と、最小化手段で計算された情報量基準の最適性を判定し、最適でないと判定されれば、期待値計算手段にて再度期待値計算を行わせる最適性判定手段と、を有し、各次元の離散化領域数及び各次元の離散化位置を最適化することを特徴とする。 In order to solve the above problems, a distribution estimation apparatus according to the present invention includes an initialization unit that initializes data, and an expectation for calculating an expected value for a posterior probability of cluster assignment in the data initialized by the initialization unit. A value calculating means, and a minimizing means for calculating a parameter for minimizing the expected value of the information criterion for the expected value of the cluster assignment calculated by the expected value calculating means, and calculating a locally independent mixed discrete distribution; Determining the optimality of the information criterion calculated by the minimizing means and, if determined not optimal, the optimality determining means for performing the expected value calculation again by the expected value calculating means, and each dimension The number of discretized regions and the discretized position in each dimension are optimized.

また、本発明におけるクラスタリング装置は、上記記載の分布推定装置を備えることを特徴とする。 A clustering apparatus according to the present invention includes the distribution estimation apparatus described above.

また、本発明における分布推定装置の推定方法は、データの初期化処理を行う初期化ステップと、初期化ステップにて初期化されたデータにおけるクラスタアサイメントの事後確率に対する期待値を計算する期待値計算ステップと、期待値計算ステップで計算されたクラスタアサイメントの期待値に対する情報量基準の期待値を最小化するパラメータを算出して、局所独立な混合離散分布を計算する最小化ステップと、最小化ステップで計算された情報量基準の最適性を判定し、最適でないと判定されれば、期待値計算ステップによる期待値計算を再度行わせる最適性判定ステップと、を有し、各次元の離散化領域数及び各次元の離散化位置を最適化することを特徴とする。 Further, the estimation method of the distribution estimation apparatus according to the present invention includes an initialization step for initializing data, and an expected value for calculating an expected value for the posterior probability of cluster assignment in the data initialized in the initialization step. A calculation step, a minimization step for calculating a parameter that minimizes an information criterion expected value for the expected value of the cluster assignment calculated in the expected value calculation step, and a minimization step for calculating a locally independent mixed discrete distribution; and a minimum And determining the optimality of the information criterion calculated in the conversion step, and if it is determined that the determination is not optimal, the optimality determination step in which the expected value calculation in the expected value calculation step is performed again. It is characterized by optimizing the number of discretization regions and the discretization position of each dimension.

また、本発明におけるプログラムは、データの初期化処理を行う初期化処理と、初期化処理にて初期化されたデータにおけるクラスタアサイメントの事後確率に対する期待値を計算する期待値計算処理と、期待値計算処理で計算されたクラスタアサイメントの期待値に対する情報量基準の期待値を最小化するパラメータを算出して、局所独立な混合離散分布を計算する最小化処理と、最小化処理で計算された情報量基準の最適性を判定し、最適でないと判定されれば、期待値計算処理による期待値計算を再度行わせる最適性判定処理と、を有し、分布推定とともに各次元の離散化領域数及び各次元の離散化位置を最適化することをコンピュータに実行させることを特徴とする。 The program according to the present invention includes an initialization process for initializing data, an expected value calculation process for calculating an expected value for the posterior probability of cluster assignment in the data initialized by the initialization process, and an expectation The parameter that minimizes the expected value of the information criterion with respect to the expected value of the cluster assignment calculated by the value calculation process is calculated by the minimization process that calculates the locally independent mixed discrete distribution and the minimization process. And determining the optimality of the information amount criterion, and if it is determined that it is not optimal, an optimality determination process for performing the expected value calculation again by the expected value calculation process. It is characterized by causing a computer to optimize the number and the discretized position of each dimension.

本発明により、分布を推定する際に、局所独立な混合離散分布モデルによって各次元の離散化領域数と各次元の離散化位置、とを分布推定とともに最適化する事が可能となる。 According to the present invention, when estimating a distribution, it is possible to optimize the number of discretized regions in each dimension and the discretized position in each dimension together with the distribution estimation by using a locally independent mixed discrete distribution model.

離散分布の説明図（その１）である。It is explanatory drawing (the 1) of discrete distribution. 離散分布の説明図（その２）である。It is explanatory drawing (the 2) of discrete distribution. 本発明の実施形態に係る分布推定装置の構成図である。It is a block diagram of the distribution estimation apparatus which concerns on embodiment of this invention. 本発明の実施形態に係る分布推定装置のフローチャート図である。It is a flowchart figure of the distribution estimation apparatus which concerns on embodiment of this invention. 本発明の他の実施形態に係る分布推定装置の構成図である。It is a block diagram of the distribution estimation apparatus which concerns on other embodiment of this invention. 本発明の他の実施形態に係る分布推定装置のフローチャート図である。It is a flowchart figure of the distribution estimation apparatus which concerns on other embodiment of this invention. 本発明の実施形態に係るクラスタリング装置の構成図である。It is a block diagram of the clustering apparatus which concerns on embodiment of this invention. 本発明の実施形態に係る外れ値検出装置の構成図である。It is a block diagram of the outlier detection apparatus which concerns on embodiment of this invention. 本発明の実施形態に係るデータ分類装置の構成図である。It is a block diagram of the data classification device which concerns on embodiment of this invention. 従来技術の説明図である。It is explanatory drawing of a prior art.

次に、発明を実施するための最良の形態について図面を参照して詳細に説明する。 Next, the best mode for carrying out the invention will be described in detail with reference to the drawings.

［第１の実施の形態］
図４は、本発明の実施形態におけるデータ分布推定装置の構成図である。分布推定装置１００は、データ入力部１１０と、初期化処理部１２０と、クラスタアサイメント期待値計算処理部１３０と、期待記述長最小化処理部１４０と、最適性判定処理部１５０と、分布推定結果出力部１６０と、を備え、各次元に対する離散化の数と離散化の区切り位置が最適化された分布を算出し、算出された分布を含む分布推定結果１９０を出力する。 [First Embodiment]
FIG. 4 is a configuration diagram of the data distribution estimation apparatus in the embodiment of the present invention. The distribution estimation apparatus 100 includes a data input unit 110, an initialization processing unit 120, a cluster assignment expected value calculation processing unit 130, an expected description length minimization processing unit 140, an optimality determination processing unit 150, and a distribution estimation. And a result output unit 160, which calculates a distribution in which the number of discretizations and the discretization breakpoints for each dimension are optimized, and outputs a distribution estimation result 190 including the calculated distributions.

以下では入力されたi番目のデータをx_i = (x_i1, x_i2, …, x_iD)とし、ラベルが同時に入力される場合にはx_iに対応するラベルをy_iとする。ラベルの数（ラベルの種類）は、Cと表記する。また、推定すべき分布は、データの分布P(X)が（数1）、ラベルの分布P(Y|X)が（数２）、データとラベルの同時分布P(X,Y)が（数３）として表される。 The i-th data entered in the following _{_{x i = (x i1, x}} i2, ..., x iD) and, and y _i labels corresponding to x _i, if the label is simultaneously input. The number of labels (label type) is expressed as C. The distribution to be estimated is the data distribution P (X) (Equation 1), the label distribution P (Y | X) (Equation 2), and the data and label simultaneous distribution P (X, Y) ( It is expressed as Equation 3).

ただし、X=(X1,X2,…,XD)はＤ次元データ、Yはラベル、Z=(Z1,Z2,…,Zk)はXの属するクラスタに対する確率変数をそれぞれ表す。なお、Zkは、Xがk番目のクラスタに属する場合に1を、それ以外では0をとる確率変数で、クラスタアサイメントと呼ばれる。また、Zは直接観測することができないため、一般には隠れ変数と呼ばれる。 X = (X1, X2,..., XD) represents D-dimensional data, Y represents a label, and Z = (Z1, Z2,..., Zk) represents a random variable for the cluster to which X belongs. Zk is a random variable that takes 1 when X belongs to the k-th cluster and takes 0 otherwise, and is called a cluster assignment. Z cannot be observed directly, so it is generally called a hidden variable.

Kは混合数（クラスタ数）を表す。P(Xd|Zk=1)はk番目のクラスタに関するd番目の次元に対するヒストグラムをあらわし、P(Y|X_d,Zk=1)はk番目のクラスタに関してd番目の次元のデータの値Xdの値を条件としたラベルYの確率をあらわす。 K represents the number of mixtures (number of clusters). P (Xd | Zk = 1) represents a histogram for the dth dimension for the kth cluster, and P (Y | X_d, Zk = 1) is the value Xd of the dth dimension data for the kth cluster. Represents the probability of label Y with the condition.

データ入力部１１０は、入力データおよびラベル１８０を入力するための機能部であり、この際分布推定に必要なパラメータを同時に入力する事も可能である。なお、本実施形態では、混合分布の混合数は、データと共に入力されるパラメータとして扱われる。 The data input unit 110 is a functional unit for inputting the input data and the label 180. At this time, it is possible to simultaneously input parameters necessary for distribution estimation. In the present embodiment, the number of mixtures in the mixture distribution is treated as a parameter input together with data.

初期化処理部１２０は、クラスタアサイメントなどを初期化するための装置である。初期化の方法は、全クラスタを一様の比率にする方法や、乱数によってランダムに設定する方法など、（数１）から（数３）においてΣP(Zk=1) = 1、ΣP(Zk=1|Xd) = 1を満たす任意の値で初期化する事が可能である。 The initialization processing unit 120 is a device for initializing cluster assignments and the like. As an initialization method, ΣP (Zk = 1) = 1, ΣP (Zk =) in (Equation 1) to (Equation 3), such as a method in which all the clusters are made to have a uniform ratio or a random setting method using random numbers. It is possible to initialize with any value satisfying 1 | Xd) = 1.

クラスタアサイメント期待値計算処理部１３０では、データXに対するクラスタアサイメントの事後確率に対する期待値を計算する。すなわち、データの分布を推定する際にはP(Z|X)、ラベルの分布およびデータとラベルの同時分布を推定する際にはP(Z|X,Y)に対するZkの値を計算する。 The cluster assignment expected value calculation processing unit 130 calculates an expected value for the posterior probability of cluster assignment for the data X. That is, the value of Zk for P (Z | X, Y) is calculated when estimating the distribution of data, and P (Z | X) when estimating the distribution of labels and the simultaneous distribution of data and labels.

計算方法の一例を挙げると、データの分布を計算する場合には（数４）、ラベルの確率を計算する場合には（数５）、データとラベルの同時分布を計算する場合には（数６）によって算出することが可能である。 As an example of the calculation method, when calculating the distribution of data (Equation 4), when calculating the probability of the label (Equation 5), when calculating the simultaneous distribution of the data and the label (Equation 5) 6).

ただし、ziはxiに対するクラスタアサイメントを表す。 Here, zi represents a cluster assignment for xi.

期待記述長最小化処理部１４０では、クラスタアサイメント期待値計算処理部１３０で計算された、クラスタアサイメントの期待値に対する記述長の期待値を最小化するパラメータを算出する。 The expected description length minimization processing unit 140 calculates a parameter that minimizes the expected value of the description length for the expected value of the cluster assignment calculated by the cluster assignment expected value calculation processing unit 130.

記述長とは、データやラベルから分布を算出するための最適化の基準であり、データを計算された分布によって圧縮した時の大きさと分布そのものを圧縮した時の大きさの和によって表される。 The description length is an optimization standard for calculating a distribution from data or labels, and is expressed by the sum of the size when the data is compressed by the calculated distribution and the size when the distribution itself is compressed. .

直感的に説明すると、本発明の第1の実施の形態では、離散化の数を増やす・離散化の位置を任意に変動させるなどによって、入力されたデータの圧縮率を上げる事ができるが、その分、分布が複雑になるため分布の圧縮率が下がり、この２つのトレードオフを最適とする分布を選択する事が可能である。 To explain intuitively, in the first embodiment of the present invention, the compression rate of input data can be increased by increasing the number of discretizations or arbitrarily changing the position of discretization, Accordingly, since the distribution becomes complicated, the compression ratio of the distribution decreases, and it is possible to select a distribution that optimizes the two trade-offs.

以下で、期待記述長の計算方法および最小化方法の例を具体的に説明する。以下では、Mkdはクラスタkに対して次元dに関する離散化領域の数とする。 Hereinafter, examples of the expected description length calculation method and the minimization method will be specifically described. In the following, Mkd is the number of discretized regions with respect to dimension d for cluster k.

[離散化の幅を各分布内で一定とした場合]
離散化の幅を各分布内で一定とした場合には、クラスタkの次元dに対する離散分布では、各離散化領域はデータの値域をMkd個に等分割する領域として定義される。この場合には、期待記述長はデータの分布を推定する場合には（数７）、ラベルの確率を計算する場合には（数８）、データとラベルの同時分布を計算する場合には（数９）によって算出することが可能である。 [When discretization width is constant in each distribution]
When the width of the discretization is constant in each distribution, in the discrete distribution for the dimension d of the cluster k, each discretized region is defined as a region that equally divides the data value range into Mkd. In this case, the expected description length is calculated when the distribution of data is estimated (Equation 7), when the probability of the label is calculated (Equation 8), and when the simultaneous distribution of the data and the label is calculated (Equation 8). It is possible to calculate by equation (9).

ただし、対数の底は２とし、eは自然対数の底を表す。また、log^*は自然数 aに対して、log^*a = log c + log a + log log a + log log log a + … を負の数が出てくるまで足した数を表す。ただし、log^* 0 = 0と定義する。また、cはaによらない定数である。 However, the base of the logarithm is 2, and e represents the base of the natural logarithm. In addition, log ^* represents a number obtained by adding log ^* a = log c + log a + log log a + log log log a + ... to a natural number a until a negative number appears. However, log ^* 0 = 0 is defined. C is a constant not depending on a.

また、E_Z|X[log (Σzik)]、E_Z|xi,Y[log (Σzik)]、E_Z|X,Y[log (Σzik)]は、それぞれクラスタアサイメントの和のZの事後分布に対する期待値を表す。期待値の計算については、任意の方法で計算する事が可能である。例えば、モンテカルロ法によって近似を行なう方法や、log関数をΣzikの平均値周りでテイラー展開する事によって計算を行なう事が可能である。 Also, E _{Z | X} [log (Σzik)], E _{Z | xi, Y} [log (Σzik)], and E _{Z | X, Y} [log (Σzik)] are the posterior of Z of the sum of cluster assignments. Represents the expected value for the distribution. The expected value can be calculated by an arbitrary method. For example, the calculation can be performed by a method of approximation by the Monte Carlo method or by Taylor expansion of the log function around the mean value of Σzik.

続いて、データの分布を推定する場合を例に期待記述長を最小化する手順を説明する。まず、クラスタkの次元dに対する分布に関して、離散化領域の数Mkdを固定する。次に、分布のパラメータを推定する。推定方法としては最尤推定法などが挙げられる。 Next, the procedure for minimizing the expected description length will be described taking the case of estimating the data distribution as an example. First, regarding the distribution of cluster k with respect to dimension d, the number Mkd of discretized regions is fixed. Next, the parameters of the distribution are estimated. Examples of the estimation method include a maximum likelihood estimation method.

次に、Mkdおよび推定されたパラメータに関して、対応する分布に対するデータの記述長の期待値とパラメータの記述長の期待値を、データの分布を推定する場合には（数１０）、ラベルの確率を計算する場合には（数１１）、データとラベルの同時分布を計算する場合には（数１２）によって算出する。 Next, regarding Mkd and the estimated parameter, the expected value of the description length of the data and the expected value of the description length of the parameter for the corresponding distribution, and when estimating the distribution of the data (Equation 10), the probability of the label is When calculating (Equation 11), when calculating the simultaneous distribution of data and label, it is calculated by (Equation 12).

クラスタkの次元dに対する分布はMkdを変更させて、データの分布を推定する場合には（数１０）、ラベルの確率を計算する場合には（数１１）、データとラベルの同時分布を計算する場合には（数１２）を最も小さくするMkdおよびそのパラメータを採用する。この処理を各クラスタの各次元に対する分布に対して実施し、それぞれに対する最適な分布を算出する。 When the distribution of cluster k with respect to dimension d is changed by changing Mkd to estimate the distribution of data (Equation 10), when calculating the probability of label (Equation 11), the simultaneous distribution of data and label is calculated. In this case, Mkd and its parameters that minimize (Equation 12) are adopted. This process is performed on the distribution of each dimension in each cluster, and the optimum distribution for each dimension is calculated.

（数７）から（数９）のうち、（数１０）から（数１２）には対応しない項は、各分布の離散化領域の数およびパラメータに依存しないため、以上の処理によって各クラスタの各次元に対する分布が計算された事になる。 Of (Equation 7) to (Equation 9), terms that do not correspond to (Equation 10) to (Equation 12) do not depend on the number and parameters of the discretization regions of each distribution. The distribution for each dimension is calculated.

最後に、各クラスタの混合の比率P(Zk)を、データの分布を推定する場合には（数１３）、ラベルの確率を計算する場合には（数１４）、データとラベルの同時分布を計算する場合には（数１５）によって算出する。 Finally, the ratio P (Zk) of the mixture of each cluster is calculated when calculating the distribution of data (Equation 13), when calculating the probability of the label (Equation 14), When calculating, it calculates by (Equation 15).

最適性判定処理部１５０では、計算された分布に対する記述長DL（期待値ではない）と、ひとつ前のループで計算された記述長DL_oldを比較し、DLとDL_oldが十分に近いかどうかによって、計算された分布が最適であるかを判定する。具体的には、DL_old-DLが0または0に十分近い値を判定の基準とする事が考えられる。 The optimality determination processing unit 150 compares the description length DL (not an expected value) for the calculated distribution with the description length DL_old calculated in the previous loop, and determines whether DL and DL_old are sufficiently close to each other. Determine if the calculated distribution is optimal. Specifically, it is conceivable that DL_old-DL is 0 or a value sufficiently close to 0 as a criterion for determination.

なおDLは、データの分布を推定する場合には（数１６）、ラベルの確率を計算する場合には（数１７）、データとラベルの同時分布を計算する場合には（数１８）によって算出する。 The DL is calculated by (Equation 16) when estimating the distribution of data (Equation 16), when calculating the probability of the label (Equation 17), and when calculating the simultaneous distribution of data and labels (Equation 18). To do.

分布推定結果出力部１６０では、算出された分布推定結果１９０をモニタやハードディスクなどの外部媒体に出力する。 The distribution estimation result output unit 160 outputs the calculated distribution estimation result 190 to an external medium such as a monitor or a hard disk.

図５に示すフローチャート図を参照して、本実施の形態に関する分布推定装置１００における動作について説明する。 With reference to the flowchart shown in FIG. 5, the operation in the distribution estimation apparatus 100 according to the present embodiment will be described.

データ入力部１１０へデータ１８０を入力する（ステップＳ１００）。次に、初期化処理部１２０において算出すべき分布の初期化処理を実施する（ステップＳ１０１）。 Data 180 is input to the data input unit 110 (step S100). Next, the initialization processing unit 120 performs an initialization process for the distribution to be calculated (step S101).

次に、クラスタアサイメント期待値計算処理部１３０において、各データに対するクラスタアサイメントの事後確率に対する期待値を計算する（ステップＳ１０２）。 Next, an expected value for the posterior probability of cluster assignment for each data is calculated in the cluster assignment expected value calculation processing unit 130 (step S102).

次に、期待記述長最小化処理部１４０において、期待記述長を最小化する分布を算出する（ステップＳ１０３）。 Next, the expected description length minimization processing unit 140 calculates a distribution that minimizes the expected description length (step S103).

次に、最適性判定処理部１５０において、現在算出されている分布が最適か否かを判定する（ステップＳ１０４）。 Next, the optimality determination processing unit 150 determines whether or not the currently calculated distribution is optimal (step S104).

最適ではないと判定された場合には、ステップＳ１０２からステップＳ１０４の処理を繰り返す。一方、最適と判定された場合には、分布推定結果出力部１６０によって、結果を出力する（ステップＳ１０５）。 If it is determined that it is not optimal, the processing from step S102 to step S104 is repeated. On the other hand, if it is determined to be optimal, the distribution estimation result output unit 160 outputs the result (step S105).

本実施形態によれば、データの分布，データに付与されたラベルの予測分布あるいはデータとラベルの同時分布を局所独立な離散分布の混合分布として表し、各次元の離散化領域数、各次元の離散化位置を分布推定とともに最適化することが可能となる。 According to the present embodiment, the distribution of data, the predicted distribution of labels attached to the data, or the simultaneous distribution of data and labels is represented as a mixed distribution of locally independent discrete distributions, the number of discretized regions in each dimension, It is possible to optimize the discretized position together with the distribution estimation.

なお、本実施の形態では、記述長（確率的コンプレキシティ）を基準とした分布推定装置に関して説明を行ったが、情報量基準としては、赤池情報量基準、一般化情報量基準など、モデル選択を行うための任意の基準に関して同様の分布推定を実施することが可能であり、容易に類推されるものである。 In this embodiment, the distribution estimation device based on the description length (probabilistic complexity) has been described. However, as the information amount criterion, a model such as an Akaike information amount criterion, a generalized information amount criterion, or the like is used. Similar distribution estimation can be performed on any criterion for making a selection and can be easily analogized.

［第２の実施の形態］
図６は、本発明の第２の実施形態におけるデータ分布推定装置の構成図である。分布推定装置２００は、第1の実施の形態に関わる分布推定装置１００と比較して混合数設定部２１０と、混合数ループ終了判定処理部２２０と、最適分布選択処理部２３０と、を備えている点で相違する。なお、第１の実施形態と同一の機能部については、同一の符号を記し詳細な説明を省略する。 [Second Embodiment]
FIG. 6 is a configuration diagram of a data distribution estimation apparatus according to the second embodiment of the present invention. The distribution estimation device 200 includes a mixture number setting unit 210, a mixture number loop end determination processing unit 220, and an optimum distribution selection processing unit 230, as compared with the distribution estimation device 100 according to the first embodiment. Is different. In addition, about the function part same as 1st Embodiment, the same code | symbol is described and detailed description is abbreviate | omitted.

混合数設定部２１０には、事前に設定された、あるいは入力データ１８０とともに入力された混合数の候補値から、期待記述長の最小化の処理が行われていない候補値を選択し、初期化処理部１２０へ出力する。 The mixture number setting unit 210 selects a candidate value that has not been subjected to the process of minimizing the expected description length from the candidate number of the mixture number set in advance or input together with the input data 180, and is initialized. The data is output to the processing unit 120.

初期化処理部１２０は、分布推定装置１００の場合と同様に動作するが、混合数が混合数設定部２１０から入力される。 The initialization processing unit 120 operates in the same manner as in the case of the distribution estimation apparatus 100, but the number of mixtures is input from the number of mixtures setting unit 210.

最適性判定処理部１５０で最適と判定された場合は、算出された分布に関する記述長を、混合数や推定されたパラメータとともに記憶しておく。 When the optimality determination processing unit 150 determines that the distribution is optimal, the description length regarding the calculated distribution is stored together with the number of mixtures and the estimated parameter.

混合数ループ判定処理部２２０では、事前に設定された、あるいはデータ１８０とともに入力された混合数の候補値のすべてについて、期待記述長を最小化する分布が算出されているかを判定する。 The mixture number loop determination processing unit 220 determines whether or not a distribution that minimizes the expected description length has been calculated for all of the mixture number candidate values set in advance or input together with the data 180.

最適分布選択処理部２３０では、各混合数に対して最適化された分布に関する記述長を比較し、最も記述長の小さい分布を最適な分布として選択する。 The optimum distribution selection processing unit 230 compares the description lengths related to the distributions optimized for each number of mixtures, and selects the distribution with the smallest description length as the optimum distribution.

図７を参照にして、本実施の形態に関する分布推定装置２００における動作について説明する。 With reference to FIG. 7, the operation in the distribution estimation apparatus 200 according to the present embodiment will be described.

データ入力部へデータ１８０を入力する（ステップＳ２００）。次に、混合数設定部２１０において、期待記述長の算出が終了していない混合数の候補値を1つ設定する（ステップＳ２０１）。 Data 180 is input to the data input unit (step S200). Next, the mixture number setting unit 210 sets one candidate value for the mixture number for which the calculation of the expected description length has not been completed (step S201).

次に、初期化処理部１２０において算出すべき分布の初期化処理を実施する（ステップＳ２０２）。 Next, the initialization processing unit 120 performs an initialization process for the distribution to be calculated (step S202).

次に、クラスタアサイメント期待値計算処理部１３０において、各データに対するクラスタアサイメントの期待値を計算する（ステップＳ２０３）。 Next, the cluster assignment expected value calculation processing unit 130 calculates the expected value of the cluster assignment for each data (step S203).

次に、期待記述長最小化処理部１４０において、期待記述長を最小化する分布を算出する（ステップＳ２０４）。 Next, the expected description length minimization processing unit 140 calculates a distribution that minimizes the expected description length (step S204).

次に、最適性判定処理部１５０において、現在算出されている分布が最適か否かを判定する（ステップＳ２０５）。 Next, the optimality determination processing unit 150 determines whether or not the currently calculated distribution is optimal (step S205).

最適ではないと判定された場合には、ステップＳ２０３からステップＳ２０５の処理を繰り返す。一方、最適と判定された場合には、混合数ループ終了判定処理部２２０において、全混合数の候補値に対して記述長の算出が終了しているかを判定する（ステップＳ２０６）。 If it is determined that it is not optimal, the processing from step S203 to step S205 is repeated. On the other hand, if it is determined to be optimal, the mixture number loop end determination processing unit 220 determines whether the calculation of the description length has been completed for the candidate values for the total number of mixtures (step S206).

ここで算出の終了いない候補値がある場合には、ステップＳ２０１からステップＳ２０６の処理を繰り返す。全候補値に対して記述長の算出が終了した場合には、最適分布選択処理部２３０において、候補値の中から記述長を最小とする混合数を選択する（ステップＳ２０７）。次に、分布推定結果出力装置１６０によって、結果を出力する（ステップＳ２０８）。 If there is a candidate value for which calculation has not ended, the processing from step S201 to step S206 is repeated. When the calculation of the description length is completed for all candidate values, the optimal distribution selection processing unit 230 selects the number of mixtures that minimizes the description length from the candidate values (step S207). Next, the distribution estimation result output device 160 outputs the result (step S208).

本実施の形態を利用すると、混合数も最適化することで分布推定部１００ではパラメータとして外部から与えられた混合数に関しても最適な分布を算出することが可能となる。 If this embodiment is used, the distribution estimation unit 100 can also calculate the optimum distribution with respect to the number of mixtures given from the outside as a parameter by optimizing the number of mixtures.

また、本発明の実施形態における分布推定装置１００および２００は、クラスタリング、異常検出、データ分類などに利用する事が可能である。 In addition, the distribution estimation apparatuses 100 and 200 according to the embodiment of the present invention can be used for clustering, abnormality detection, data classification, and the like.

例えばクラスタリングに利用する場合には、図８に示すように算出された最適な分布に対してクラスタアサイメント期待値計算処理部１３０で計算されたクラスタアサイメントの事後分布に対して、最も事後確率の高いクラスタへデータをクラスタリングすれば良い。 For example, in the case of use for clustering, the most posterior probability for the posterior distribution of the cluster assignment calculated by the cluster assignment expected value calculation processing unit 130 for the optimal distribution calculated as shown in FIG. Clustering data into clusters with high

例えば異常検出に利用する場合には、図９に示すように算出された最適な分布に対して、新規入力データの発生確率（あるいは確率密度）を計算し、その値が小さい場合に異常データとして検出すれば良い。 For example, when used for abnormality detection, the occurrence probability (or probability density) of new input data is calculated for the optimal distribution calculated as shown in FIG. What is necessary is just to detect.

例えばデータ分類に利用する場合には、図１０に示すように算出された最適な分布に対して、新規入力データのラベル確率P(Y|X)を算出し、その値が最も高いクラスへデータを分類すれば良い。 For example, when used for data classification, the label probability P (Y | X) of new input data is calculated for the optimal distribution calculated as shown in FIG. Should be classified.

本発明を利用することで可能となる産業上の応用例をいくつか説明する。 Several industrial applications that can be achieved by using the present invention will be described.

［自動車の故障データの分析例］
自動車のElectric Control Unit (ECU) から取得される、エンジン回転数、車速、エンジン油温などの各種センサと、故障のラベルが与えられた場合に、本発明で提案する分布推定装置によって分布を学習することで、回転数が３０００回転以上でかつ車速が３０ｋｍ以下の場合には故障１が発生しやすいなど、故障とセンサ値の関係を分析することが可能となる。 [Automobile failure data analysis example]
When a sensor is provided with various sensors such as engine speed, vehicle speed, and engine oil temperature obtained from the vehicle's Electric Control Unit (ECU), and a failure label, the distribution estimation device proposed in the present invention learns the distribution. By doing so, it becomes possible to analyze the relationship between the failure and the sensor value, for example, failure 1 is likely to occur when the rotation speed is 3000 rotations or more and the vehicle speed is 30 km or less.

正常に走行する車両データを分布推定装置に入力して、外れ値検出装置を構成することによって、正常データの特性を分析する事が可能になるとともに、新規入力データが正常から逸脱する状態にあるかどうかを判定し、異常な状態にある場合にはドライバーにアラームを出すなど、車両の異常検出システムへ応用することができる。 By inputting normally running vehicle data into the distribution estimation device and configuring an outlier detection device, it is possible to analyze the characteristics of normal data, and new input data deviates from normal. It can be applied to a vehicle abnormality detection system, for example, by determining whether or not the vehicle is in an abnormal state.

故障データからラベルの分布を推定する事で、新規の原因が特定できていない故障が発生した場合に、その故障の原因をすばやく特定する故障診断装置を構成することが可能である。 By estimating the label distribution from the failure data, it is possible to configure a failure diagnosis device that quickly identifies the cause of a failure when a failure has occurred for which a new cause cannot be identified.

本発明の実施形態によれば、離散化の数や位置、また混合の数を自動的に最適化することが可能であるため、専門家にもモデルを作ることが難しい故障や、センサの数が多く全変数に対して離散化の方法を指定することが難しい場合であっても、適切な分布の推定を行うことができる。 According to the embodiment of the present invention, it is possible to automatically optimize the number and position of discretizations and the number of mixtures, so that it is difficult for an expert to make a model or the number of sensors. Even when it is difficult to specify a discretization method for all variables, an appropriate distribution can be estimated.

以上、実施の形態を説明したが、特許請求の範囲に定義された本発明の広範囲な趣旨および範囲から逸脱することなく、これら実施の形態や具体例に様々な修正および変更が可能である。 Although the embodiments have been described above, various modifications and changes can be made to these embodiments and specific examples without departing from the broad scope and scope of the present invention defined in the claims.

１００分布推定装置
１１０データ入力部
１２０初期化処理部
１３０クラスタアサイメント期待値計算処理部
１４０期待記述長最小化処理部
１５０最適性判定処理部
１６０分布推定結果出力部
２１０混合数設定部
２２０混合数ループ終了判定処理部
２３０最適分布選択処理部 DESCRIPTION OF SYMBOLS 100 Distribution estimation apparatus 110 Data input part 120 Initialization processing part 130 Cluster assignment expected value calculation processing part 140 Expected description length minimization processing part 150 Optimality judgment processing part 160 Distribution estimation result output part 210 Mixing number setting part 220 Mixing number Loop end determination processing unit 230 Optimal distribution selection processing unit

Claims

Initialization means for initializing data;
Expected value calculation means for calculating an expected value for the posterior probability of cluster assignment in the data initialized by the initialization means;
Minimizing means for calculating a parameter that minimizes the expected value of the information amount criterion for the expected value of the cluster assignment calculated by the expected value calculating means, and calculating a locally independent mixed discrete distribution;
Determining the optimality of the information amount criterion calculated by the minimizing means, and if it is determined not to be optimal, the optimality determining means for performing the expected value calculation again by the expected value calculating means, and
A distribution estimation apparatus characterized by optimizing the number of discretized areas in each dimension and the discretized position in each dimension.

A mixture number setting means for selecting a candidate value that has not been minimized from the mixture number candidate values and inputting data to the initialization means;
For all candidate values of the number of mixtures in the data determined to be optimal by the optimality determining means, it is determined whether a parameter that minimizes the expected value of the information amount criterion is calculated, and all candidate values are minimized. If not, a mixture number determining means for transmitting data to the mixture number setting means;
An optimum parameter selection means for comparing information amount criteria related to parameters optimized for each number of mixtures and selecting a parameter having the smallest information amount criterion as an optimum parameter;
The distribution estimation apparatus according to claim 1, wherein the number of discretized regions in each dimension, the discretized position in each dimension, and the number of mixtures are optimized.

3. The distribution estimation apparatus according to claim 1, wherein any one of a description length, an Akaike information amount criterion, and a generalized information amount criterion is used as the information amount criterion.

4. The distribution estimation apparatus according to claim 1, wherein calculation is performed for any one of data distribution, label distribution on condition of data, and simultaneous distribution of data and label.

A clustering apparatus comprising the distribution estimation apparatus according to claim 1.

An initialization step for initializing the data;
An expected value calculation step of calculating an expected value for the posterior probability of cluster assignment in the data initialized in the initialization step;
A minimizing step of calculating a parameter for minimizing the expected value of the information criterion with respect to the expected value of the cluster assignment calculated in the expected value calculating step, and calculating a locally independent mixed discrete distribution;
Determining the optimality of the information criterion calculated in the minimization step, and if not determined optimal, the optimality determination step for performing the expected value calculation again by the expected value calculation step, and
An estimation method for a distribution estimation apparatus, wherein the number of discretized areas in each dimension and the discretized position in each dimension are optimized.

An initialization process to initialize the data,
An expected value calculation process for calculating an expected value for a posterior probability of cluster assignment in the data initialized by the initialization process;
A parameter that minimizes the expected value of the information criterion for the expected value of the cluster assignment calculated in the expected value calculation process, and a minimizing process that calculates a locally independent mixed discrete distribution;
Determining the optimality of the information amount criterion calculated in the minimization process, and if it is determined not to be optimal, the optimality determination process for causing the expected value calculation by the expected value calculation process to be performed again.
A program that causes a computer to execute optimization of the number of discretized regions in each dimension and the discretized position in each dimension along with distribution estimation.