JP6316773B2

JP6316773B2 - Statistical data reconstruction device, statistical data reconstruction method, program

Info

Publication number: JP6316773B2
Application number: JP2015094798A
Authority: JP
Inventors: 長谷川　聡; 聡長谷川; 亮菊池
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-05-07
Filing date: 2015-05-07
Publication date: 2018-04-25
Anticipated expiration: 2035-05-07
Also published as: JP2016212217A

Description

本発明は、元データに撹乱処理を施して生成した撹乱データから統計データを再構築する統計データ再構築装置、統計データ再構築方法、プログラムに関する。 The present invention relates to a statistical data reconstruction device, a statistical data reconstruction method, and a program for reconstructing statistical data from disturbance data generated by performing disturbance processing on original data.

従来、データベースにおける個別データを確率的手法により秘匿しつつクロス集計結果のみを再構築して得る技術として、例えば非特許文献１、２、３などが開示されている。 Conventionally, for example, Non-Patent Documents 1, 2, and 3 have been disclosed as techniques for reconstructing only the cross tabulation result while concealing individual data in a database by a probabilistic method.

五十嵐大、千田浩司、高橋克巳、「数値属性における、k-匿名性を満たすランダム化手法」、コンピュータセキュリティシンポジウム2011論文集、平成23年10月、第2011巻、pp.450-455University of Igarashi, Koji Senda, Katsumi Takahashi, “Randomization method that satisfies k-anonymity in numerical attributes”, Proceedings of the Computer Security Symposium 2011, October 2011, Vol. 2011, pp.450-455 五十嵐大、千田浩司、高橋克巳、「k-匿名性の確率的指標への拡張とその適用例」、コンピュータセキュリティシンポジウム2009論文集、平成21年10月、第2009巻、pp.1-6University of Igarashi, Koji Senda, Katsumi Takahashi, “Extension to k-anonymity and its application example”, 2009 Proceedings of Computer Security Symposium, October 2009, Vol. 1, pp.1-6 Rakesh Agrawal, Ramakrishnan Srikant, and Dilys Thomas, "Privacy preserving OLAP," In Proceedings of the 2005 ACM SIGMOD international conference on Management of data, 2005, pp. 251-262Rakesh Agrawal, Ramakrishnan Srikant, and Dilys Thomas, "Privacy preserving OLAP," In Proceedings of the 2005 ACM SIGMOD international conference on Management of data, 2005, pp. 251-262

従来技術では、元データに何も仮定をおかず、秘匿されたデータ(以後、撹乱データと呼ぶ)からクロス集計表を推定(以後、再構築処理と呼ぶ)していたため、元データを精度よく再構築するためには、膨大なデータが必要となることが課題であった。 In the prior art, no assumptions are made on the original data, and the cross tabulation table is estimated (hereinafter referred to as reconstruction processing) from concealed data (hereinafter referred to as disturbance data). In order to construct it, it was a problem that a huge amount of data was required.

そこで本発明では、従来より少ないデータ数で精度よく統計データを再構築することができる統計データ再構築装置を提供することを目的とする。 Therefore, an object of the present invention is to provide a statistical data reconstruction apparatus that can reconstruct statistical data with a smaller number of data than before.

本発明は、元データに撹乱処理を施して生成した撹乱データから統計データを再構築する統計データ再構築装置であって、パラメータ初期化部と、Ｅステップ計算部と、Ｍステップ計算部と、統計データ再構築部を含む。 The present invention is a statistical data reconstruction device for reconstructing statistical data from disturbance data generated by performing disturbance processing on original data, a parameter initialization unit, an E step calculation unit, an M step calculation unit, Includes a statistical data reconstruction unit.

パラメータ初期化部は、撹乱処理前の元データが共分散行列の非対角要素０の有限個のガウス分布の線形和で記述できるものと仮定し、各ガウス分布の重み、平均、および分散を初期化する。Ｅステップ計算部は、元データに所定のノイズを付加して生成した所定個の撹乱データおよび所定個のノイズを付加したガウス分布の全ての組み合わせについて、分母に所定番目の撹乱データの確率密度を含み、分子にノイズを付加した所定番目のガウス分布における所定番目の撹乱データの確率密度を含む負担率を計算する処理を繰り返し実行する。Ｍステップ計算部は、負担率の平均値に基づいてガウス分布それぞれの重みを更新し、未知関数を撹乱データの確率密度の尤度を対数化した値とし、変数をガウス分布の平均とした偏微分を用いて、尤度が最大化する方向にガウス分布それぞれの平均を更新し、未知関数を撹乱データの確率密度の尤度を対数化した値とし、変数をガウス分布の分散とした偏微分を用いて、尤度が最大化する方向にガウス分布それぞれの分散を更新する処理を繰り返し実行する。統計データ再構築部は、負担率を計算する処理、重みと平均と分散を更新する処理を繰り返し実行することにより収束した平均、分散および重みを用いて統計データを再構築する。 The parameter initialization unit assumes that the original data before the disturbance processing can be described by a linear sum of a finite number of Gaussian distributions with non-diagonal elements 0 of the covariance matrix, and calculates the weight, average, and variance of each Gaussian distribution. initialize. The E step calculation unit calculates the probability density of the predetermined disturbance data in the denominator for all combinations of the predetermined number of disturbance data generated by adding the predetermined noise to the original data and the Gaussian distribution to which the predetermined number of noises are added. In addition, a process of calculating a burden rate including the probability density of the predetermined number of disturbance data in the predetermined number of Gaussian distribution in which noise is added to the numerator is repeatedly executed. The M step calculation unit updates the weight of each Gaussian distribution based on the average value of the burden rate, sets the unknown function as a logarithm value of the likelihood density of the disturbance data, and sets the variable as the average of the Gaussian distribution. Use differential to update the average of each Gaussian distribution in the direction that maximizes the likelihood, use the unknown function as the logarithm value of the likelihood density of the disturbance data, and the variable as the variance of the Gaussian distribution Is used to repeatedly execute the process of updating the variance of each Gaussian distribution in the direction in which the likelihood is maximized. The statistical data restructuring unit reconstructs statistical data using the average, variance, and weight converged by repeatedly executing the processing for calculating the burden rate and the processing for updating the weight, average, and variance.

本発明の統計データ再構築装置によれば、従来より少ないデータ数で精度よく統計データを再構築することができる。 According to the statistical data reconstruction apparatus of the present invention, statistical data can be accurately reconstructed with a smaller number of data than before.

実施例１の統計データ再構築装置の構成を示すブロック図。1 is a block diagram illustrating a configuration of a statistical data reconstruction apparatus according to a first embodiment. 実施例１の統計データ再構築装置の動作を示すフローチャート。5 is a flowchart illustrating the operation of the statistical data reconstruction apparatus according to the first embodiment. 実施例２の統計データ再構築装置の構成を示すブロック図。FIG. 6 is a block diagram illustrating a configuration of a statistical data reconstruction apparatus according to a second embodiment. 実施例２の統計データ再構築装置の動作を示すフローチャート。10 is a flowchart illustrating the operation of the statistical data reconstruction apparatus according to the second embodiment. 実施例３の統計データ再構築装置の構成を示すブロック図。FIG. 9 is a block diagram illustrating a configuration of a statistical data reconstruction apparatus according to a third embodiment. 実施例３の統計データ再構築装置の動作を示すフローチャート。10 is a flowchart illustrating the operation of the statistical data reconstruction apparatus according to the third embodiment. 実施例３のＥステップ計算部の構成を示すブロック図。FIG. 9 is a block diagram illustrating a configuration of an E step calculation unit according to the third embodiment. 実施例３のＥステップ計算部の動作を示すフローチャート。10 is a flowchart illustrating the operation of an E step calculation unit according to the third embodiment. 実施例３の主成分分析部が実行するデータの回転について説明する図。FIG. 10 is a diagram for explaining data rotation executed by a principal component analysis unit according to the third embodiment. 実施例３のＭステップ計算部の構成を示すブロック図。FIG. 10 is a block diagram illustrating a configuration of an M step calculation unit according to the third embodiment. 実施例３のＭステップ計算部の動作を示すフローチャート。10 is a flowchart illustrating the operation of an M step calculation unit according to the third embodiment.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the structure part which has the same function, and duplication description is abbreviate | omitted.

＜記法＞
以下の説明では、ベクトルをbold体で記述し、スカラ値をイタリック体（あるいはbold体以外の書体）で記述する。また、ベクトルのj番目の要素をa(j)と記述することにする。観測されている撹乱データの数をN、i番目（1≦i≦N）の撹乱データを実施例１において１次元のベクトルy_i、実施例２、３においてd次元ベクトル(dは2以上の整数)

とする。また、元データを実施例１において１次元のベクトルx、実施例２、３においてd次元ベクトル

で表すこととする。 <Notation>
In the following description, a vector is described in bold type, and a scalar value is described in italic type (or typeface other than bold type). In addition, the j-th element of the vector is described as a (j). The number of observed disturbance data is N, and the i-th (1 ≦ i ≦ N) disturbance data is a one-dimensional vector y _i in Example 1, d-dimensional vector in Examples 2 and 3 (d is 2 or more) integer)

And The original data is a one-dimensional vector x in the first embodiment, and the d-dimensional vector in the second and third embodiments.

It shall be expressed as

＜準備＞
本発明の統計データ再構築装置は、元データに撹乱処理を施して生成した撹乱データから統計データを再構築する装置である。統計データの撹乱、再構築の処理は、大まかには以下の２ステップとなる。
撹乱：元データに撹乱処理を施し、撹乱データを作成する。
再構築：撹乱データに対し統計分析を行い、統計結果（統計データ）を得る。 <Preparation>
The statistical data reconstruction apparatus of the present invention is an apparatus for reconstructing statistical data from disturbance data generated by performing disturbance processing on original data. The statistical data disturbance and reconstruction process is roughly divided into the following two steps.
Disturbance: Disturbance processing is performed on the original data to create disturbance data.
Reconstruction: Statistical analysis is performed on disturbance data, and statistical results (statistical data) are obtained.

撹乱処理として、例えば非特許文献１にあるノイズ加算、非特許文献２にある維持置換撹乱がある。統計分析には、参考非特許文献１にあるクロス集計を推定する手法や、t検定を行う手法がある。本発明は、後者の再構築（統計分析）に属する発明である。
（参考非特許文献１：五十嵐大、千田浩司、高橋克巳、「多値属性に適用可能な効率的プライバシー保護クロス集計」、コンピュータセキュリティシンポジウム2008論文集、平成20年10月、第2008巻、pp.497-502） Examples of the disturbance processing include noise addition in Non-Patent Document 1 and maintenance replacement disturbance in Non-Patent Document 2. Statistical analysis includes a method of estimating cross tabulation in Reference Non-Patent Document 1 and a method of performing a t-test. The present invention belongs to the latter reconstruction (statistical analysis).
(Reference Non-Patent Document 1: University of Igarashi, Koji Senda, Katsumi Takahashi, “Efficient Privacy Protection Cross Tabulation Applicable to Multi-valued Attributes”, Proceedings of Computer Security Symposium 2008, October 2008, Vol. 2008, pp .497-502)

以下の実施例では、撹乱処理として、非特許文献１の手法が用いられるものとする。すなわち実施例１においては、元データに対し、平均0、パラメータφの1次元ラプラス分布

に従うノイズ(以後、ラプラスノイズと呼ぶ)を付加したものとし、パラメータφは公開されているものとする。なお実施例２、３においては、元データに対し、平均0、パラメータφのd次元ラプラス分布

に従うノイズ(以後、ラプラスノイズと呼ぶ)を付加したものとする。本発明の統計データ再構築装置は、以上の操作が行われた撹乱データy_i,(i=1,...,N)を用いて、再構築処理を行う。 In the following embodiments, the technique of Non-Patent Document 1 is used as the disturbance process. That is, in the first embodiment, the original data is a one-dimensional Laplace distribution with an average of 0 and a parameter φ.

It is assumed that noise (hereinafter referred to as Laplace noise) is added and the parameter φ is disclosed. In Examples 2 and 3, d-dimensional Laplace distribution with an average of 0 and parameter φ with respect to the original data

Is added (hereinafter referred to as Laplace noise). The statistical data reconstruction apparatus of the present invention performs the reconstruction process using the disturbance data y _i (i = 1,..., N) on which the above operations have been performed.

以下、図１、図２を参照して実施例１の統計データ再構築装置の構成、および動作について説明する。図１は、本実施例の統計データ再構築装置１の構成を示すブロック図である。図２は、本実施例の統計データ再構築装置１の動作を示すフローチャートである。 Hereinafter, the configuration and operation of the statistical data reconstruction apparatus according to the first embodiment will be described with reference to FIGS. 1 and 2. FIG. 1 is a block diagram illustrating a configuration of a statistical data reconstruction apparatus 1 according to the present embodiment. FIG. 2 is a flowchart showing the operation of the statistical data reconstruction apparatus 1 of this embodiment.

図１に示すように本実施例の統計データ再構築装置１は、パラメータ初期化部１１と、尤度最大化部１２と、統計データ再構築部１３を含む。尤度最大化部１２は、Ｅステップ計算部１２１と、Ｍステップ計算部１２２を含む。 As shown in FIG. 1, the statistical data reconstruction device 1 according to the present exemplary embodiment includes a parameter initialization unit 11, a likelihood maximization unit 12, and a statistical data reconstruction unit 13. The likelihood maximizing unit 12 includes an E step calculating unit 121 and an M step calculating unit 122.

パラメータ初期化部１１は、撹乱処理前の元データが共分散行列の非対角要素0の有限個(K個、Kを自然数、kを1≦k≦Kを充たす自然数とする)のガウス分布の線形和、すなわち

で記述できるものと仮定し、k番目の各ガウス分布の重みρ_k、平均μ_k、および分散σ_k ²をランダムに初期化する（Ｓ１１）。（２）式において

は、平均μ、分散σ²に従う共分散行列の非対角要素0のガウス分布を表すものとする。詳細には、zを任意の変数として、

と表される。なお、（２）式で述べたように、全ての（∀）kに対して、重みρ_kが０以上１以下、重みρ_kの総和が１となる制約条件の下でρ_kをランダムに初期化するものとする。 The parameter initialization unit 11 is a Gaussian distribution in which the original data before the disturbance processing is a finite number of non-diagonal elements 0 of the covariance matrix (K, K is a natural number, and k is a natural number satisfying 1 ≦ k ≦ K). A linear sum of

The weight ρ _k , mean μ _k , and variance σ _k ² of each kth Gaussian distribution are initialized at random (S11). In equation (2)

Denote the Gaussian distribution of off-diagonal element 0 of the covariance matrix with mean μ and variance σ ² . Specifically, let z be an arbitrary variable

It is expressed. Note that, as described in the equation (2), for all (∀) k, ρ _k is randomly set under the constraint that the weight ρ _k is 0 or more and 1 or less and the sum of the weights ρ _k is 1. It shall be initialized.

以下のステップＳ１２１、Ｓ１２２において、尤度最大化部１２は、以下の尤度最大化問題を解く。

In the following steps S121 and S122, the likelihood maximization unit 12 solves the following likelihood maximization problem.

なお、i番目の撹乱データの確率分布は、前述した二つの確率変数（式（１）、（３））の和で表現される。すなわち、i番目の撹乱データの確率分布を表す関数gは

と表される。ここで、erfc(x)は、相補誤差関数と呼ばれているものであり、詳細には、

と表される。式を簡略化するために関数ｆを

と定義すれば、i番目の撹乱データの確率分布を表す関数ｇは

と表される。 Note that the probability distribution of the i-th disturbance data is represented by the sum of the two random variables (Equations (1) and (3)) described above. That is, the function g representing the probability distribution of the i-th disturbance data is

It is expressed. Here, erfc (x) is called a complementary error function.

It is expressed. To simplify the equation, the function f is

The function g representing the probability distribution of the i-th disturbance data is

It is expressed.

上述の尤度最大化問題を解く方法として本実施例では、参考非特許文献２で利用されているような、一般化EMアルゴリズムを適用する。
（参考非特許文献２：Jeff A Bilmes, et al., "A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models," International Computer Science Institute, Vol. 4, No. 510, 1998, pp. 1-13） In this embodiment, a generalized EM algorithm as used in Reference Non-Patent Document 2 is applied as a method for solving the above likelihood maximization problem.
(Reference Non-Patent Document 2: Jeff A Bilmes, et al., "A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models," International Computer Science Institute, Vol. 4, No. 510 , 1998, pp. 1-13)

Ｅステップ計算部１２１は、元データｘに所定のノイズ（ラプラスノイズ）を付加して生成した所定個（N個）の撹乱データおよび所定個（K個）のノイズを付加したガウス分布の全ての組み合わせ（i=1,...,N,l=1,...K）について、分母に所定番目(i番目)の撹乱データの確率密度、すなわち

を含み、分子にノイズを付加した所定番目(l番目、lを1≦l≦Kを充たす自然数とする)のガウス分布における所定番目(i番目)の撹乱データの確率密度、すなわち

を含む負担率γ(i,l)を計算する処理を繰り返し実行する（Ｓ１２１）。負担率γ(i,l)は、

と表される。負担率γ(i,l)は、所定番目のガウス分布への負担度合いを表すことから、負担率と呼ばれる。 The E step calculation unit 121 adds a predetermined number (N) of disturbance data generated by adding a predetermined noise (Laplace noise) to the original data x and all of the Gaussian distributions with the predetermined number (K) of noise added. For the combination (i = 1, ..., N, l = 1, ... K), the probability density of the predetermined (i-th) disturbance data in the denominator, ie

And the probability density of the predetermined order (i-th) disturbance data in a predetermined Gaussian distribution with noise added to the numerator (l, where l is a natural number satisfying 1 ≦ l ≦ K), that is,

The process of calculating the burden rate γ (i, l) including is repeatedly executed (S121). The burden factor γ (i, l) is

It is expressed. The burden rate γ (i, l) is called a burden rate because it represents the degree of burden on the predetermined Gaussian distribution.

次に、Ｍステップ計算部１２２は、負担率の平均値に基づいてガウス分布それぞれの重みを更新する。すなわちＭステップ計算部１２２は、更新後の重みρ_l ^newを、

と計算する（Ｓ１２２）。ここでN_lは、

であるため、更新後の重みρ_l ^newは、l番目のガウス分布における負担率の平均値ということができる。Ｍステップ計算部１２２は、未知関数を撹乱データの確率密度の尤度を対数化した値L、すなわち

とし、変数をl番目のガウス分布の平均μ_lとした偏微分、すなわち

を用いて、尤度が最大化する方向にガウス分布それぞれの平均を更新する（Ｓ１２２）。具体的には、Ｍステップ計算部１２２は、

により、l番目のガウス分布の平均μ_lを更新する（Ｓ１２２）。αは学習率と呼ばれ、ステップＳ１２２実行前に、ユーザが適切な値を決めておくものとする。同様に、Ｍステップ計算部１２２は、未知関数を撹乱データの確率密度の尤度を対数化した値（（８）式）とし、変数をl番目のガウス分布の分散σ_lとした偏微分を用いて、尤度が最大化する方向にガウス分布それぞれの分散を更新する（Ｓ１２２）すなわち、Ｍステップ計算部１２２は、

により、l番目のガウス分布の分散σ_lを更新する（Ｓ１２２）。Ｍステップ計算部１２２は、上述の重みρ_l、平均μ_l、分散σ_lを更新する処理を繰り返し実行する（Ｓ１２２）。なお、

は、偏微分であり微分可能であるため、解が必ず求まる。ステップＳ１２１とステップＳ１２２は、収束条件を充たすまで交互に、繰り返し実行される。収束条件については、μ^new、σ^new、ρ^newから計算されるL^new、μ^old、σ^old、ρ^oldから計算されるL^oldに違いがあまりない状態(L^newとL^oldの差分が所定の閾値以下となる状態)であれば、収束とする。 Next, the M step calculation unit 122 updates the weight of each Gaussian distribution based on the average value of the burden rate. That is, the M step calculation unit 122 sets the updated weight ρ _l ^new to

Is calculated (S122). Where N _l is

Therefore, the updated weight ρ _l ^new can be said to be the average value of the burden rate in the l-th Gaussian distribution. The M step calculation unit 122 is a value L obtained by logarithmizing the likelihood of the probability density of the disturbance data from the unknown function, that is,

Partial differential, i.e. where the to the variable and the mean mu _l of l-th Gaussian distribution

The average of each Gaussian distribution is updated in the direction in which the likelihood is maximized (S122). Specifically, the M step calculation unit 122

Thus, the average μ _l of the l-th Gaussian distribution is updated (S122). α is called a learning rate, and it is assumed that the user determines an appropriate value before executing step S122. Similarly, the M step calculation unit 122 performs partial differentiation with the unknown function as a logarithmic value of the likelihood density probability probability of the disturbance data (equation (8)) and the variable as the variance σ _l of the l-th Gaussian distribution. The variance of each Gaussian distribution is updated in the direction in which the likelihood is maximized (S122), that is, the M step calculation unit 122

Thus, the variance σ _l of the l-th Gaussian distribution is updated (S122). The M step calculation unit 122 repeatedly executes the process of updating the above weight ρ _l , average μ _l , and variance σ _l (S122). In addition,

Is a partial differentiation and is differentiable, so a solution is always obtained. Step S121 and step S122 are repeatedly executed alternately until the convergence condition is satisfied. The convergence ^{^{condition, μ new, σ new, ρ}} L new calculated from ^{new, μ} ^{^old,} ^σ ^old, the difference between the [rho state there is not much difference in L ^old calculated from ^old (L ^{new new} and L ^old is given If the threshold value is less than or equal to the threshold value, convergence is assumed.

次に、統計データ再構築部１３は、負担率を計算する処理、重みと平均と分散を更新する処理を交互に繰り返し実行することにより収束した平均、分散および重みを用いて統計データを再構築する（Ｓ１３）。 Next, the statistical data reconstruction unit 13 reconstructs the statistical data by using the average, variance, and weight converged by repeatedly executing the processing for calculating the burden rate and the processing for updating the weight, average, and variance alternately. (S13).

このように、本実施例の統計データ再構築装置１によれば、従来より少ないデータ数で精度よく統計データを再構築することができる。 Thus, according to the statistical data reconstruction apparatus 1 of the present embodiment, statistical data can be reconstructed with high accuracy with a smaller number of data than before.

以下、図３、図４を参照して元データの次元をd次元(dは2以上の整数)に拡張した実施例２の統計データ再構築装置について説明する。図３は、本実施例の統計データ再構築装置２の構成を示すブロック図である。図４は、本実施例の統計データ再構築装置２の動作を示すフローチャートである。 Hereinafter, the statistical data reconstruction apparatus according to the second embodiment in which the dimension of the original data is expanded to the d dimension (d is an integer of 2 or more) will be described with reference to FIGS. FIG. 3 is a block diagram illustrating the configuration of the statistical data reconstruction apparatus 2 according to the present embodiment. FIG. 4 is a flowchart showing the operation of the statistical data reconstruction apparatus 2 of the present embodiment.

図３に示すように本実施例の統計データ再構築装置２は、パラメータ初期化部１１と、尤度最大化部２２と、統計データ再構築部１３を含む。尤度最大化部２２は、Ｅステップ計算部２２１と、Ｍステップ計算部２２２を含み、尤度最大化部２２以外の構成要件は実施例１と共通する。本実施例において、尤度最大化部２２は、以下の尤度最大化問題を解く。

As shown in FIG. 3, the statistical data reconstruction device 2 of this embodiment includes a parameter initialization unit 11, a likelihood maximization unit 22, and a statistical data reconstruction unit 13. The likelihood maximizing unit 22 includes an E step calculating unit 221 and an M step calculating unit 222. The configuration requirements other than the likelihood maximizing unit 22 are the same as those in the first embodiment. In this embodiment, the likelihood maximization unit 22 solves the following likelihood maximization problem.

ここで、

は、

である。具体的には、Ｅステップ計算部２２１は、各次元において負担率を計算し（Ｓ２２１）、Ｍステップ計算部２２２は、各次元において重みと平均と分散を更新する（Ｓ２２２）。 here,

Is

It is. Specifically, the E step calculation unit 221 calculates the burden rate in each dimension (S221), and the M step calculation unit 222 updates the weight, average, and variance in each dimension (S222).

上述のように、本実施例の統計データ再構築装置２は、各次元ごとに偏微分を行い、各次元ごとに実施例１と同様の一般化EMアルゴリズムを適用するため、共分散が0(共分散行列の非対角要素0)という制約は課せられているものの、元データが複数の次元(d次元)である場合にも適用可能となるため、実施例１よりも幅広いデータ形式に対応できる。 As described above, the statistical data reconstruction apparatus 2 according to the present embodiment performs partial differentiation for each dimension, and applies the generalized EM algorithm similar to that of the first embodiment for each dimension. Although the constraint of non-diagonal elements (0) of the covariance matrix is imposed, it can be applied even when the original data has multiple dimensions (d dimensions), so it supports a wider range of data formats than the first embodiment. it can.

以下、上述の実施例において制約条件であった、共分散行列の非対角要素０という条件を外して、元データが一般的な多次元の混合ガウス分布の線形和で記述される場合にも対応できる実施例３の統計データ再構築装置について説明する。共分散行列の非対角要素が０でない場合、すなわち一般的な多次元の混合ガウス分布を考えた場合、ｇの計算に要する積分を解析的に計算することが困難になる。そこで、実施例２の統計データ再構築装置２の構成を一部変更することで共分散行列が一般的な場合の元データについても扱えるようにする。 Hereinafter, when the original data is described by a linear sum of a general multi-dimensional mixed Gaussian distribution by removing the condition of non-diagonal element 0 of the covariance matrix, which was a constraint condition in the above-described embodiment. A statistical data reconstruction apparatus according to the third embodiment that can be used will be described. When the off-diagonal elements of the covariance matrix are not 0, that is, when a general multidimensional mixed Gaussian distribution is considered, it is difficult to analytically calculate the integral required for calculating g. Therefore, by partially changing the configuration of the statistical data reconstruction apparatus 2 of the second embodiment, it is possible to handle the original data when the covariance matrix is general.

以下、図５、図６を参照して本実施例の統計データ再構築装置の構成、および動作について説明する。図５は、本実施例の統計データ再構築装置３の構成を示すブロック図である。図６は、本実施例の統計データ再構築装置３の動作を示すフローチャートである。 Hereinafter, the configuration and operation of the statistical data reconstruction apparatus of this embodiment will be described with reference to FIGS. FIG. 5 is a block diagram illustrating the configuration of the statistical data reconstruction device 3 according to the present embodiment. FIG. 6 is a flowchart showing the operation of the statistical data reconstruction device 3 of this embodiment.

図５に示すように、本実施例の統計データ再構築装置３は、パラメータ初期化部３１と、尤度最大化部３２と、統計データ再構築部１３を含む。尤度最大化部３２は、Ｅステップ計算部３２１と、Ｍステップ計算部３２２を含み、統計データ再構築部１３は実施例１、実施例２と共通する。 As shown in FIG. 5, the statistical data reconstruction device 3 according to the present exemplary embodiment includes a parameter initialization unit 31, a likelihood maximization unit 32, and a statistical data reconstruction unit 13. The likelihood maximization unit 32 includes an E step calculation unit 321 and an M step calculation unit 322, and the statistical data reconstruction unit 13 is common to the first and second embodiments.

パラメータ初期化部３１は、撹乱処理前の元データが有限個の多次元のガウス分布の線形和で記述できるものと仮定し、各ガウス分布の重み、平均、および分散を初期化する（Ｓ３１）。ステップＳ３１とステップＳ１１では、元データに対する仮定が異なる。ただし、分散の初期化は共分散が０の状態で実行される必要があるため、ステップＳ３１は、上述のステップＳ１１と同様の方法で実行される。Ｅステップ計算部３２１は本実施例におけるＥステップ（後述）を実行する（Ｓ３２１）。Ｍステップ計算部３２２は本実施例におけるＭステップ（後述）を実行する（Ｓ３２２）。ステップＳ１３は前述と同様である。 The parameter initialization unit 31 assumes that the original data before the disturbance processing can be described by a linear sum of a finite number of multidimensional Gaussian distributions, and initializes the weight, average, and variance of each Gaussian distribution (S31). . Step S31 and step S11 have different assumptions about the original data. However, since the dispersion initialization needs to be executed in a state where the covariance is 0, step S31 is executed by the same method as step S11 described above. The E step calculation unit 321 executes the E step (described later) in this embodiment (S321). The M step calculation unit 322 executes the M step (described later) in this embodiment (S322). Step S13 is the same as described above.

以下、図７、図８を参照して本実施例のＥステップ計算部３２１の構成、および動作について説明する。図７は、本実施例のＥステップ計算部３２１の構成を示すブロック図である。図８は、本実施例のＥステップ計算部３２１の動作を示すフローチャートである。図７に示すように、本実施例のＥステップ計算部３２１は、負担率計算部３２１１と、データ集合生成部３２１２と、主成分分析部３２１３を含む構成である。負担率計算部３２１１は、実施例２におけるステップＳ２２１と同様に負担率を計算する（Ｓ３２１１）。詳細には、負担率計算部３２１１は、負担率γ(i,l)を以下のように計算する（Ｓ３２１１）。

Hereinafter, the configuration and operation of the E step calculation unit 321 according to the present embodiment will be described with reference to FIGS. FIG. 7 is a block diagram illustrating a configuration of the E step calculation unit 321 of the present embodiment. FIG. 8 is a flowchart showing the operation of the E step calculation unit 321 of this embodiment. As illustrated in FIG. 7, the E step calculation unit 321 according to the present exemplary embodiment includes a burden rate calculation unit 3211, a data set generation unit 3212, and a principal component analysis unit 3213. The burden factor calculation unit 3211 calculates the burden factor in the same manner as step S221 in the second embodiment (S3211). Specifically, the burden factor calculation unit 3211 calculates the burden factor γ (i, l) as follows (S3211).

データ集合生成部３２１２は、所定番目(i番目)の撹乱データに対応する負担率γ(i,k),k=1,...,Kのうち、最も高い負担率に該当するガウス分布の番号（例えばk番目）に対応するデータ集合（例えばD_k）に、当該所定番目(i番目)の撹乱データが含まれるように、全ての撹乱データに対応する所定個(K個)のデータ集合(D₁,...,D_K)を生成する（Ｓ３２１２）。 The data set generation unit 3212 has a Gaussian distribution corresponding to the highest burden ratio among the burden ratios γ (i, k), k = 1,... K corresponding to the predetermined (i-th) disturbance data. A predetermined number (K) of data sets corresponding to all the disturbance data so that the predetermined (i-th) disturbance data is included in the data set (for example, D _k ) corresponding to the number (for example, k-th). (D ₁ ,..., D _K ) are generated (S3212).

主成分分析部３２１３は、K個のデータ集合それぞれに対し主成分分析を実行して、各データ集合内のデータが無相関になるように回転し、無相関なK個のデータ集合を生成する（Ｓ３２１３）。 The principal component analysis unit 3213 performs principal component analysis on each of the K data sets, rotates the data in each data set to be uncorrelated, and generates uncorrelated K data sets. (S3213).

図９を参照して主成分分析部３２１３の動作について補足する。図９は、本実施例の主成分分析部３２１３が実行するデータの回転について説明する図である。図９Ａ、Ｂ、Ｃはｘ軸を次元：身長データ、ｙ軸を次元：体重データとし、ｚ軸にその度数を表現した場合のデータの分布を等高線で表現したグラフである。図９に示す身長と体重のように、元データが相関のある二つの次元からなる場合、図９Ａに示すように、共分散行列の非対角要素が非ゼロとなり、後述するＭステップ計算部３２２による解析が難しくなる。そこで、本実施例の主成分分析部３２１３は、主成分分析を実行し、例えば図９Ｂ、あるいは図９Ｃのようにデータ集合を回転して、各データ集合内のデータを無相関（共分散行列の非対角要素０）とする。 The operation of the principal component analysis unit 3213 will be supplemented with reference to FIG. FIG. 9 is a diagram illustrating data rotation executed by the principal component analysis unit 3213 of the present embodiment. 9A, 9B and 9C are graphs in which the distribution of data is expressed by contour lines when the x-axis is dimension: height data, the y-axis is dimension: weight data, and the frequency is expressed on the z-axis. When the original data consists of two correlated dimensions, such as height and weight shown in FIG. 9, the non-diagonal elements of the covariance matrix are non-zero, as shown in FIG. Analysis by 322 becomes difficult. Therefore, the principal component analysis unit 3213 according to the present embodiment performs principal component analysis, rotates the data set as illustrated in FIG. 9B or FIG. 9C, for example, and correlates the data in each data set (covariance matrix). Of non-diagonal elements 0).

以下、図１０、図１１を参照して、本実施例のＭステップ計算部３２２の構成、および動作について説明する。図１０は、本実施例のＭステップ計算部３２２の構成を示すブロック図である。図１１は、本実施例のＭステップ計算部３２２の動作を示すフローチャートである。図１０に示すように、本実施例のＭステップ計算部３２２は、パラメータ更新部３２２１と、逆回転部３２２２と、収束判定部３２２３を含む構成である。 Hereinafter, with reference to FIGS. 10 and 11, the configuration and operation of the M-step calculation unit 322 of the present embodiment will be described. FIG. 10 is a block diagram illustrating a configuration of the M step calculation unit 322 of the present embodiment. FIG. 11 is a flowchart showing the operation of the M step calculation unit 322 of this embodiment. As illustrated in FIG. 10, the M step calculation unit 322 of this embodiment includes a parameter update unit 3221, a reverse rotation unit 3222, and a convergence determination unit 3223.

パラメータ更新部３２２１は、実施例２におけるステップＳ２２２と同様に重み、平均、分散を更新する（Ｓ３２１１）。詳細には、パラメータ更新部３２２１は、以下の式に従って重み、平均、分散を更新する（Ｓ３２１１）。

The parameter update unit 3221 updates the weight, average, and variance in the same manner as in step S222 in the second embodiment (S3211). Specifically, the parameter update unit 3221 updates the weight, average, and variance according to the following formula (S3211).

Ｎ_ｌについては、式（７）をそのまま用いる。逆回転部３２２２は、更新後の分散σ_l ^newの要素を対角要素として並べた対角行列Σ^new、更新後の平均μ_l ^new、およびステップＳ３２１２で生成されたデータ集合に対し、主成分分析における回転と逆の回転を実行する（Ｓ３２２２）。収束判定部３２２３は、更新前の尤度L^oldと更新後の尤度L^newの変化が所定の閾値以下となる場合に収束と判定する（Ｓ３２２３）。 For _N1 , Equation (7) is used as it is. The inverse rotation unit 3222 applies a principal component to the diagonal matrix Σ ^new in which the elements of the updated variance σ _l ^new are arranged as diagonal elements, the updated average μ _l ^new , and the data set generated in step S3212. The rotation opposite to the rotation in the analysis is executed (S3222). Convergence determination unit 3223 determines convergence when the change in likelihood L ^old before update and likelihood L ^new after update is equal to or less than a predetermined threshold (S3223).

本実施例の統計データ再構築装置３によれば、元データが一般的な多次元の混合ガウス分布の線形和で記述される場合であっても、従来より少ないデータ数で精度よく統計データを再構築することができる。 According to the statistical data reconstruction apparatus 3 of the present embodiment, even if the original data is described by a linear sum of general multidimensional mixed Gaussian distributions, statistical data can be accurately obtained with a smaller number of data than before. Can be rebuilt.

本実施例の統計データ再構築装置３の工夫点は、元データに相関がある場合にも対応したことである。元データに相関がない場合は、実施例１に開示したように必要となる積分計算を解析的に解くことが可能であるが、元データに相関がある場合、解析的に積分計算を行うことが困難になる。そこで、ステップＳ３２１３に示したように、データを回転させて無相関化してしまうことで、元データに相関がある場合にも対応できるようにした。 The ingenuity of the statistical data reconstruction device 3 of the present embodiment is that it corresponds to the case where there is a correlation in the original data. When there is no correlation in the original data, it is possible to analytically solve the necessary integral calculation as disclosed in the first embodiment, but when the original data has a correlation, the integral calculation is performed analytically. Becomes difficult. Therefore, as shown in step S3213, by rotating the data to make it uncorrelated, it is possible to deal with a case where the original data has a correlation.

非特許文献１、２、３では、再構築を行い得る統計値として、度数クロス集計表を取り扱っていたが、本発明では、元データの確率分布を統計量として得る再構築を実現した。元データの確率分布を得ることができれば、百分率のクロス集計表を得ることもでき、それからデータ数が得られている場合、度数クロス集計表を得ることもできる。よって、本発明では、確率分布を再構築するが、本質的には、度数クロス集計表を再構築することと変わらない。 In Non-Patent Documents 1, 2, and 3, the frequency cross tabulation table is handled as a statistical value that can be reconstructed. However, in the present invention, reconstruction that obtains the probability distribution of the original data as a statistic is realized. If the probability distribution of the original data can be obtained, a percentage cross tabulation table can be obtained, and if the number of data is obtained therefrom, a frequency cross tabulation table can also be obtained. Therefore, in the present invention, the probability distribution is reconstructed, but is essentially the same as reconstructing the frequency cross tabulation table.

以上に述べた実施例１、２、３の統計データ再構築装置に共通して、Kの値をいくつとするかが問題となる。Kは、赤池情報量基準やベイズ情報量基準を利用して決定すると良い。 In common with the statistical data reconstruction apparatuses of the first, second, and third embodiments described above, what is the value of K is a problem. K may be determined using the Akaike information criterion or Bayesian information criterion.

＜補記＞
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部、ＣＰＵ（Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい）、メモリであるＲＡＭやＲＯＭ、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、ＣＤ−ＲＯＭなどの記録媒体を読み書きできる装置（ドライブ）などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Supplementary note>
The apparatus of the present invention includes, for example, a single hardware entity as an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity. Can be connected to a communication unit, a CPU (Central Processing Unit, may include a cache memory or a register), a RAM or ROM that is a memory, an external storage device that is a hard disk, and an input unit, an output unit, or a communication unit thereof , A CPU, a RAM, a ROM, and a bus connected so that data can be exchanged between the external storage devices. If necessary, the hardware entity may be provided with a device (drive) that can read and write a recording medium such as a CD-ROM. A physical entity having such hardware resources includes a general-purpose computer.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている（外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくこととしてもよい）。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores a program necessary for realizing the above functions and data necessary for processing the program (not limited to the external storage device, for example, reading a program) It may be stored in a ROM that is a dedicated storage device). Data obtained by the processing of these programs is appropriately stored in a RAM or an external storage device.

ハードウェアエンティティでは、外部記憶装置（あるいはＲＯＭなど）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にＣＰＵで解釈実行・処理される。その結果、ＣＰＵが所定の機能（上記、…部、…手段などと表した各構成要件）を実現する。 In the hardware entity, each program stored in an external storage device (or ROM or the like) and data necessary for processing each program are read into a memory as necessary, and are interpreted and executed by a CPU as appropriate. . As a result, the CPU realizes a predetermined function (respective component requirements expressed as the above-described unit, unit, etc.).

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiment, and can be appropriately changed without departing from the spirit of the present invention. In addition, the processing described in the above embodiment may be executed not only in time series according to the order of description but also in parallel or individually as required by the processing capability of the apparatus that executes the processing. .

既述のように、上記実施形態において説明したハードウェアエンティティ（本発明の装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing functions in the hardware entity (the apparatus of the present invention) described in the above embodiments are realized by a computer, the processing contents of the functions that the hardware entity should have are described by a program. Then, by executing this program on a computer, the processing functions in the hardware entity are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, a hardware entity is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

Claims

A statistical data reconstruction device that reconstructs statistical data from disturbance data generated by performing disturbance processing on original data,
Assuming that the original data before the disturbance processing can be described by a linear sum of a finite number of Gaussian distributions with non-diagonal elements 0 of the covariance matrix, the initial parameter values for initializing the weight, mean, and variance of each Gaussian distribution And
For all combinations of a predetermined number of disturbance data generated by adding predetermined noise to the original data and a Gaussian distribution to which a predetermined number of noises are added, the denominator includes the probability density of the predetermined number of the disturbance data, and the numerator An E step calculation unit that repeatedly executes a process of calculating a burden rate including the probability density of the predetermined number of disturbance data in the predetermined number of Gaussian distributions to which the noise is added;
The weight of each of the Gaussian distributions is updated based on the average value of the burden ratio, the unknown function is a value obtained by logarithmizing the likelihood density of the disturbance data, and the partial differentiation with the variable being the average of the Gaussian distribution. And updating the average of each of the Gaussian distributions in the direction in which the likelihood is maximized, the unknown function is a logarithm value of the likelihood density of the disturbance data, and the variable is the variance of the Gaussian distribution An M-step calculation unit that repeatedly executes a process of updating the variance of each of the Gaussian distributions in a direction in which the likelihood is maximized using partial differentiation;
A statistical data restructuring unit that reconstructs the statistical data using the average, variance, and weight converged by repeatedly executing the process of calculating the burden rate, the process of updating the weight, the average, and the variance;
Statistical data reconstruction device including

The statistical data reconstruction device according to claim 1,
The original data is data having a plurality of dimensions,
The E step calculation unit
Calculate the burden rate in each dimension,
The M step calculator is
A statistical data reconstruction device that updates the weight, the average, and the variance in each dimension.

A statistical data reconstruction device that reconstructs statistical data from disturbance data generated by performing disturbance processing on original data,
The original data is data having a plurality of dimensions,
Assuming that the original data before the disturbance processing can be described by a linear sum of a finite number of multidimensional Gaussian distributions, a parameter initialization unit that initializes the weight, average, and variance of each Gaussian distribution;
For all combinations of a predetermined number of disturbance data generated by adding predetermined noise to the original data and a Gaussian distribution to which a predetermined number of noises are added, the denominator includes the probability density of the predetermined number of the disturbance data, and the numerator A burden factor calculation unit that repeatedly executes a burden factor including a probability density of predetermined disturbance data in the predetermined Gaussian distribution with the noise added to each dimension, and
Corresponding to all disturbance data so that the predetermined disturbance data is included in the data set corresponding to the Gaussian distribution number corresponding to the highest burden ratio among the burden ratios corresponding to the predetermined disturbance data A data set generation unit that generates a predetermined number of data sets;
A principal component analysis unit that performs principal component analysis on each of the data sets and rotates the data in each data set to be uncorrelated;
The weight of each of the Gaussian distributions is updated for each dimension based on the average value of the burden ratio, the unknown function is a value obtained by logarithmizing the likelihood density of the disturbance data, and the variable is the average of the Gaussian distribution Using partial differentiation, the average of each of the Gaussian distributions is updated for each dimension in the direction in which the likelihood is maximized, the unknown function is a logarithm value of the likelihood density of the disturbance data, and the variable is A parameter updating unit that repeatedly performs a process of updating the variance of each of the Gaussian distributions for each dimension in a direction in which the likelihood is maximized, using a partial derivative that is a variance of a Gaussian distribution;
A diagonal matrix in which the elements of the updated variance are arranged as diagonal elements, an average after the update, and a reverse rotation unit that performs a rotation opposite to the rotation in the principal component analysis on the generated data set;
A convergence determination unit that determines convergence when the likelihood before update and the change in likelihood after update are equal to or less than a predetermined threshold;
A statistical data reconstruction unit that reconstructs the statistical data using the converged mean, variance and weight;
Statistical data reconstruction device including

A statistical data reconstruction method executed by a statistical data reconstruction device that reconstructs statistical data from disturbance data generated by performing disturbance processing on original data,
Assuming that the original data before the disturbance processing can be described by a linear sum of a finite number of Gaussian distributions with non-diagonal elements 0 of the covariance matrix, the initial parameter values for initializing the weight, mean, and variance of each Gaussian distribution Step,
For all combinations of a predetermined number of disturbance data generated by adding predetermined noise to the original data and a Gaussian distribution to which a predetermined number of noises are added, the denominator includes the probability density of the predetermined number of the disturbance data, and the numerator An E step calculation step of repeatedly executing a process of calculating a burden rate including the probability density of the predetermined number of disturbance data in the predetermined number of Gaussian distributions to which the noise is added;
The weight of each of the Gaussian distributions is updated based on the average value of the burden ratio, the unknown function is a value obtained by logarithmizing the likelihood density of the disturbance data, and the partial differentiation with the variable being the average of the Gaussian distribution. And updating the average of each of the Gaussian distributions in the direction in which the likelihood is maximized, the unknown function is a logarithm value of the likelihood density of the disturbance data, and the variable is the variance of the Gaussian distribution An M-step calculation step of repeatedly executing a process of updating the variance of each of the Gaussian distributions in a direction in which the likelihood is maximized using partial differentiation;
A statistical data restructuring step of reconstructing the statistical data using the average, variance, and weight converged by repeatedly executing the process of calculating the burden rate, the process of updating the weight, the average, and the variance;
Statistical data reconstruction method including

The statistical data reconstruction method according to claim 4,
The original data is data having a plurality of dimensions,
The E step calculation step includes:
Calculate the burden rate in each dimension,
The M step calculation step includes:
A statistical data reconstruction method for updating the weight, the average, and the variance in each dimension.

A statistical data reconstruction method executed by a statistical data reconstruction device that reconstructs statistical data from disturbance data generated by performing disturbance processing on original data,
The original data is data having a plurality of dimensions,
Assuming that the original data before the disturbance processing can be described by a linear sum of a finite number of multi-dimensional Gaussian distributions, a parameter initialization step for initializing the weight, mean, and variance of each Gaussian distribution;
For all combinations of a predetermined number of disturbance data generated by adding predetermined noise to the original data and a Gaussian distribution to which a predetermined number of noises are added, the denominator includes the probability density of the predetermined number of the disturbance data, and the numerator A burden factor calculating step of repeatedly executing, for each dimension, a burden factor including the probability density of the predetermined disturbance data in the predetermined Gaussian distribution with the noise added thereto;
Corresponding to all disturbance data so that the predetermined disturbance data is included in the data set corresponding to the Gaussian distribution number corresponding to the highest burden ratio among the burden ratios corresponding to the predetermined disturbance data A data set generation step for generating a predetermined number of data sets;
Performing principal component analysis on each of the data sets, and rotating the principal component analysis steps so that the data in each data set is uncorrelated;
The weight of each of the Gaussian distributions is updated for each dimension based on the average value of the burden ratio, the unknown function is a value obtained by logarithmizing the likelihood density of the disturbance data, and the variable is the average of the Gaussian distribution Using partial differentiation, the average of each of the Gaussian distributions is updated for each dimension in the direction in which the likelihood is maximized, the unknown function is a logarithm value of the likelihood density of the disturbance data, and the variable is A parameter updating step that repeatedly executes a process of updating the variance of each of the Gaussian distributions for each dimension in a direction in which the likelihood is maximized, using partial differentiation as a variance of the Gaussian distribution,
A reverse rotation step of performing a rotation opposite to the rotation in the principal component analysis on the diagonal matrix in which the elements of the updated variance are arranged as diagonal elements, the average after the update, and the generated data set;
A convergence determination step for determining convergence when the likelihood before update and the change in likelihood after update are equal to or less than a predetermined threshold;
A statistical data reconstruction step of reconstructing the statistical data using the converged mean, variance and weight;
Statistical data reconstruction method including

A program for causing a computer to function as the statistical data reconstruction device according to any one of claims 1 to 3.