JP2018067227A

JP2018067227A - Data analyzing apparatus, data analyzing method, and data analyzing processing program

Info

Publication number: JP2018067227A
Application number: JP2016206718A
Authority: JP
Inventors: 美幸今田; Miyuki Imada; 慧廣▲瀬▼; Kei Hirose
Original assignee: Nippon Telegraph and Telephone Corp; Osaka University NUC
Current assignee: Nippon Telegraph and Telephone Corp; Osaka University NUC
Priority date: 2016-10-21
Filing date: 2016-10-21
Publication date: 2018-04-26

Abstract

PROBLEM TO BE SOLVED: To reduce load caused by estimation of posterior distribution.SOLUTION: A data analyzing apparatus according to the present embodiment has: parameter estimation means for regression-analyzing data to be analyzed to estimate a parameter which is a regression coefficient of an explanatory variable, Bayes estimation executing means for treating the parameter estimated by the parameter estimation means as prior distribution of Bayes estimation and for executing Bayes estimation on the basis of the prior distribution and a predetermined likelihood function to derive posterior distribution of the Bayes estimation; and sparse estimation executing means for executing sparse estimation by assigning, to loss function, a result of logarithmic transformation for the posterior distribution derived by the Bayes estimation executing means.SELECTED DRAWING: Figure 1

Description

本発明の実施形態は、データ分析装置、データ分析方法、データ分析処理プログラムに関する。 Embodiments described herein relate generally to a data analysis apparatus, a data analysis method, and a data analysis processing program.

逐次的にしかデータが入手できない環境で、パラメータを推定する方法としてベイズ推定がある。既存のベイズ推定では、事後分布を解析的に求めることが困難なため、通常、MCMC（Markov chain Monte Carlo methods）（例えば非特許文献１参照）などの近似を使って求める。 There is Bayesian estimation as a method for estimating parameters in an environment where data is available only sequentially. In the existing Bayesian estimation, since it is difficult to analytically obtain the posterior distribution, it is usually obtained by using an approximation such as MCMC (Markov chain Monte Carlo methods) (for example, see Non-Patent Document 1).

伊庭幸人他，“統計計算のフロンティア12 計算統計ＩＩマルコフ連鎖モンテカルロ法とその周辺”，岩波書店，2009．Yukito Iba et al., “Frontier of Statistical Computation 12 Computational Statistics II Markov Chain Monte Carlo Method and its Surroundings”, Iwanami Shoten, 2009. J. Friedman, T. Hastie and R. Tibshirani, “Regularization Paths for Generalized Linear Models via Coordinate Descent”, Journal of Statistical Software, Vol. 33, Issue 1, January 2010.２０１６年８月２５日検索、インターネット＜https://core.ac.uk/download/files/153/6287975.pdf＞J. Friedman, T. Hastie and R. Tibshirani, “Regularization Paths for Generalized Linear Models via Coordinate Descent”, Journal of Statistical Software, Vol. 33, Issue 1, January 2010. Search August 25, 2016, Internet <https : //core.ac.uk/download/files/153/6287975.pdf>

MCMCは、サンプリングを繰り返し行うことで、最適解を求める方法である。よってMCMCは、計算に時間がかかる上、入力データ（入力変数）を最小化するための変数を選択できない。つまり、ベイズ推定でよく用いられるMCMCでは、変数の次元を低くすることができない。実用サービスへの適用を考えると、データ入力負荷を軽減するために入力変数を減らしたいが、従前の方法では適さない。 MCMC is a method for obtaining an optimal solution by repeatedly performing sampling. Therefore, MCMC takes time to calculate and cannot select a variable for minimizing input data (input variable). That is, in MCMC often used in Bayesian estimation, the dimension of variables cannot be reduced. Considering application to practical service, we want to reduce the input variables to reduce the data input load, but the conventional method is not suitable.

本発明の目的は、事後分布の推定にかかる負荷を低減することができるデータ分析装置、データ分析方法、データ分析処理プログラムを提供することである。 An object of the present invention is to provide a data analysis device, a data analysis method, and a data analysis processing program that can reduce the load applied to the estimation of the posterior distribution.

上記目的を達成するために、この発明の実施形態におけるデータ分析装置の第１の態様は、分析対象のデータを回帰分析して、説明変数の回帰係数であるパラメータを推定するパラメータ推定手段と、前記パラメータ推定手段により推定したパラメータをベイズ推定の事前分布とし、この事前分布と所定の尤度関数とに基づいてベイズ推定を行うことでベイズ推定の事後分布を導出するベイズ推定実行手段と、前記ベイズ推定実行手段により導出された事後分布に対する対数変換結果を損失関数に代入してスパース推定を実行するスパース推定実行手段とを有する装置を提供する。 In order to achieve the above object, a first aspect of a data analysis apparatus according to an embodiment of the present invention includes: a parameter estimation unit that performs regression analysis of data to be analyzed and estimates a parameter that is a regression coefficient of an explanatory variable; Bayesian estimation execution means for deriving a posterior distribution of Bayesian estimation by performing a Bayesian estimation based on the parameter estimated by the parameter estimating means and performing Bayesian estimation based on the prior distribution and a predetermined likelihood function; There is provided an apparatus having sparse estimation execution means for executing sparse estimation by substituting a logarithmic transformation result for a posterior distribution derived by Bayesian estimation execution means into a loss function.

上記構成のデータ分析装置の第２の態様は、第１の態様において、前記パラメータ推定手段は、前記スパース推定実行手段により導出された回帰式の回帰係数の絶対値が所定の条件を満たす大きさの説明変数を用いて前記分析対象のデータをクラス分けし、このクラス分けしたデータを回帰分析して、説明変数の回帰係数であるパラメータを推定する装置を提供する。 According to a second aspect of the data analysis apparatus configured as described above, in the first aspect, the parameter estimation unit is such that the absolute value of the regression coefficient of the regression equation derived by the sparse estimation execution unit satisfies a predetermined condition. There is provided an apparatus for classifying the data to be analyzed using the explanatory variables, and performing regression analysis on the classified data to estimate a parameter which is a regression coefficient of the explanatory variable.

本発明の実施形態におけるデータ分析方法の第１の態様は、データ分析装置に適用される方法であって、分析対象のデータを回帰分析して、説明変数の回帰係数であるパラメータを推定し、前記推定したパラメータをベイズ推定の事前分布とし、この事前分布と所定の尤度関数とに基づいてベイズ推定を行うことでベイズ推定の事後分布を導出し、前記導出された事後分布に対する対数変換結果を損失関数に代入してスパース推定を実行する方法を提供する。 A first aspect of a data analysis method according to an embodiment of the present invention is a method applied to a data analysis device, which performs regression analysis on data to be analyzed, estimates a parameter that is a regression coefficient of an explanatory variable, The estimated parameter is a prior distribution of Bayesian estimation, and a Bayesian estimation posterior distribution is derived by performing Bayesian estimation based on the prior distribution and a predetermined likelihood function, and a logarithmic transformation result for the derived posterior distribution A method for performing sparse estimation by substituting into a loss function is provided.

上記のデータ分析方法の第２の態様は、第１の態様において、前記導出された回帰式の回帰係数の絶対値が所定の条件を満たす大きさの説明変数を用いて前記分析対象のデータをクラス分けし、このクラス分けしたデータを回帰分析して、説明変数の回帰係数であるパラメータを推定する方法を提供する。 According to a second aspect of the data analysis method described above, in the first aspect, the analysis target data is obtained using an explanatory variable whose magnitude is such that the absolute value of the regression coefficient of the derived regression equation satisfies a predetermined condition. There is provided a method of classifying and classifying the classified data and estimating a parameter which is a regression coefficient of an explanatory variable.

本発明の実施形態におけるデータ分析処理プログラムの態様は、第１または第２の態様におけるデータ分析装置の一部分として動作するコンピュータに用いられるプログラムであって、前記コンピュータを、前記パラメータ推定手段、前記ベイズ推定実行手段、および前記スパース推定実行手段として機能させるためのプログラムを提供する。 An aspect of the data analysis processing program in the embodiment of the present invention is a program used in a computer that operates as a part of the data analysis apparatus in the first or second aspect, and the computer is used as the parameter estimation unit, the Bayes. An estimation execution unit and a program for functioning as the sparse estimation execution unit are provided.

本発明によれば、事後分布の推定にかかる負荷を低減することが可能になる。 According to the present invention, it is possible to reduce the load for estimating the posterior distribution.

本発明の第１の実施形態におけるデータ分析装置の構成例を示す図。The figure which shows the structural example of the data analyzer in the 1st Embodiment of this invention. 本発明の第１の実施形態におけるデータ分析装置による処理手順の一例を示すフローチャート。The flowchart which shows an example of the process sequence by the data analyzer in the 1st Embodiment of this invention.

以下、この発明に係わる実施形態を説明する。
（第１の実施形態）
まず、第１の実施形態について説明する。第１の実施形態では、ベイズ推定の事後分布のlog（対数）をとった式が、スパース推定のlogをとった式と同じ形式であることに着目し、ベイズ推定の事後分布である出力を用いてスパース推定を行なう。 Embodiments according to the present invention will be described below.
(First embodiment)
First, the first embodiment will be described. In the first embodiment, paying attention to the fact that the formula that takes the log (logarithm) of the posterior distribution of Bayesian estimation is the same format as the formula that takes the log of sparse estimation, the output that is the posterior distribution of Bayesian estimation is output. To perform sparse estimation.

第１の実施形態では、MCMCの代わりに、事後分布の最頻値（つまり事後分布の密度関数を最大とするような値）を推定する正則化最尤法をベイズの尤度関数として用いることでスパース推定を行なう。このスパース推定は、目的変数を予測するために有効な説明変数のみを用いて安定した予測を実現するためのLasso（least absolute shrinkage and selection operator）タイプの正則化法に基づく推定である。ここでは、ベイズ推定の事前分布に正規分布を仮定するので、スパース推定のアルゴリズムを、有効なパラメータの選択の為に、そのまま用いることができる。以下に、詳細なアルゴリズムを述べる。 In the first embodiment, instead of MCMC, the regularized maximum likelihood method for estimating the mode value of the posterior distribution (that is, the value that maximizes the density function of the posterior distribution) is used as the Bayesian likelihood function. Sparse estimation. This sparse estimation is an estimation based on a Lasso (least absolute shrinkage and selection operator) type regularization method for realizing stable prediction using only effective explanatory variables for predicting an objective variable. Here, since a normal distribution is assumed as the prior distribution of Bayesian estimation, the sparse estimation algorithm can be used as it is for selecting effective parameters. The detailed algorithm is described below.

第１の実施形態では、予測値を安定して算出できる予測式において、線形回帰モデルの各説明変数の回帰係数であるパラメータθを精度よく推定する。
出来事が起きる前のつながり度や年齢性別などをＸ（計画行列）とし、出来事が起きた後のつながり度をｙ（出力（目的とする推定量））とする。線形回帰モデルでＸからｙを予測することが望ましいが、サンプルサイズが小さい場合、線形回帰モデルの説明変数の回帰係数であるパラメータθの推定値 In the first embodiment, the parameter θ, which is the regression coefficient of each explanatory variable of the linear regression model, is accurately estimated in the prediction formula that can stably calculate the predicted value.
Let X (planning matrix) be the degree of connection before the event occurs, age sex, etc., and y (output (target estimated amount)) after the event has occurred. It is desirable to predict y from X in the linear regression model, but when the sample size is small, the estimated value of parameter θ, which is the regression coefficient of the explanatory variable of the linear regression model

が不安定となる。 Becomes unstable.

そこで、第１の実施形態では、Webアンケートのような大規模データで得られたパラメータθの推定値 Therefore, in the first embodiment, the estimated value of the parameter θ obtained from large-scale data such as a Web questionnaire.

を使って、より安定した推定値 For a more stable estimate

を計算する。 Calculate

推定に使う観測データそのものが少ない（例えばサンプルサイズが小さい）時でも推定値 Estimated value even when the observation data itself used for estimation is small (for example, the sample size is small)

を計算する方法として、ベイズ推定がある。 There is a Bayesian estimation as a method for calculating.

また、安定した推定値とするためには、計算に必要とする変数の制約を少なくするために、説明変数の数をできるだけ減らす方が望ましく、安定した推定値を求めるための方法として、スパース推定がある。 In order to obtain a stable estimate, it is desirable to reduce the number of explanatory variables as much as possible in order to reduce the constraints on the variables required for the calculation. As a method for obtaining a stable estimate, sparse estimation There is.

そこで、第１の実施形態では、ベイズ推定とスパース推定とをともに行う方法について述べる。具体的には、以下のベイズ推定によって事後分布の最頻値を推定する。ベイズ推定とスパース推定とをともに行うステップの概要を以下に示す。 Therefore, in the first embodiment, a method for performing both Bayesian estimation and sparse estimation will be described. Specifically, the mode value of the posterior distribution is estimated by the following Bayesian estimation. An outline of the steps for performing both Bayesian estimation and sparse estimation is shown below.

（Step1）事前分布とデータに正規性を仮定することによる事後分布の導出
ベイズ推定では、最終的に求めたい予測式を導出するために、事後分布の最頻値もしくは平均値を計算する必要がある。ここでは、事後分布の最頻値を計算することについて説明する。この問題は、対数尤度関数に事後分布の対数を加えた関数の最大化問題に対応する。第１の実施形態では、この関数に、変数選択を行なえる、スパース推定の罰則項を加える。この罰則項は、予測式の説明変数の回帰係数にあたるパラメータを計算するために必要であり、以降のStep2で述べるスパース推定の罰則付き尤度関数の対数の最大値を求めるために必要である。 (Step 1) Derivation of posterior distribution by assuming normality of prior distribution and data In Bayesian estimation, it is necessary to calculate the mode or average value of posterior distribution in order to derive the prediction formula to be finally obtained. is there. Here, calculation of the mode value of the posterior distribution will be described. This problem corresponds to the function maximization problem that is obtained by adding the logarithm of the posterior distribution to the log likelihood function. In the first embodiment, a sparse estimation penalty term that allows variable selection is added to this function. This penalty term is necessary for calculating a parameter corresponding to the regression coefficient of the explanatory variable of the prediction formula, and is necessary for obtaining the maximum value of the logarithm of the sparse likelihood penalty function described in Step 2 below.

（Step2）スパース推定のアルゴリズムを使って、ベイズ推定の事後分布の最頻値から有効なパラメータを選択
ベイズ推定の事後分布の最頻値を計算するために、スパース推定のアルゴリズムを用い、有効な変数だけを選定する。ここで、上記のStep1で計算した式が複雑な式になると、最適化問題が難しくなってしまうが、第１の実施形態では、Lassoで使われている従来のアルゴリズムを、有効なパラメータの選択の為に、そのまま使える形になっており、高速かつ簡単なプログラムを組むことができる。 (Step 2) Use sparse estimation algorithm to select valid parameters from mode of posterior distribution of Bayesian estimation. Use sparse estimation algorithm to calculate mode of posterior distribution of Bayesian estimation. Select only variables. Here, if the formula calculated in Step 1 above becomes a complicated formula, the optimization problem becomes difficult. In the first embodiment, the conventional algorithm used in Lasso is selected as an effective parameter. Because of this, it is in a form that can be used as it is, and it is possible to build a fast and simple program.

以下、Step1およびStep2の詳細を説明する。
まず、Step1の詳細について説明する。
上記の事後分布の推定手法であるMCMCは、どのような分布に対しても使える汎用的な手法であるが、計算時間が長くなるという問題がある。そこで、第１の実施形態では、MCMCほど汎用性はないが、事前分布とデータに正規性を仮定すると、極めて高速なLassoのアルゴリズムを、有効なパラメータの選択の為に、そのまま用いることができる。 Details of Step 1 and Step 2 will be described below.
First, details of Step 1 will be described.
MCMC, which is an estimation method of the posterior distribution described above, is a general-purpose method that can be used for any distribution, but has a problem that the calculation time becomes long. Therefore, in the first embodiment, although not as versatile as MCMC, assuming normality in the prior distribution and data, the extremely fast Lasso algorithm can be used as it is for selecting effective parameters. .

具体的には、Webアンケートのような大規模データで得られたパラメータθの推定値 Specifically, the estimated value of parameter θ obtained from large-scale data such as a web questionnaire

を使った場合のベイズ推定を考える。ここで、データへの当てはまりの良さの指標である尤度関数を Consider Bayesian estimation using. Here, the likelihood function, which is an index of goodness of fit to the data, is

とし、ベイズ推定の事前分布を And prior distribution of Bayesian estimation

とした場合（ただし、尤度関数の対数が２次関数（正規分布、他）になっている必要がある）、ベイズ推定の事後分布 (However, the logarithm of the likelihood function must be a quadratic function (normal distribution, etc.)).

は、 Is

となる。ここでの目標は、式（１）の左辺を最大にするパラメータθを求めることにある。そこで、式（１）の左辺を計算する必要があるが、これを直接求めることは難しい。一方、式（１）の右辺は、尤度関数とベイズ推定の事前分布との積で与えられるので、容易に計算できる。そこで、式（１）の右辺を計算することを考える。 It becomes. The goal here is to find the parameter θ that maximizes the left side of equation (1). Therefore, it is necessary to calculate the left side of Equation (1), but it is difficult to directly obtain this. On the other hand, the right side of Equation (1) can be easily calculated because it is given by the product of the likelihood function and the prior distribution of Bayesian estimation. Therefore, consider calculating the right side of equation (1).

ここで、尤度関数 Where the likelihood function

が正規分布に従い、事前分布 Follows a normal distribution, prior distribution

が平均ベクトル Is the mean vector

の正規分布に従うと仮定する。このように正規分布を仮定する理由は、正規性を仮定する回帰モデルが最もスタンダードな方法であり、後述するLasso推定値を計算しやすくするためである。尤度関数や事前分布では、平均ベクトルとして、Webアンケートのような大規模データで得られたパラメータθの推定値 Suppose that it follows the normal distribution of. The reason for assuming a normal distribution in this way is that a regression model that assumes normality is the most standard method, and it is easy to calculate a Lasso estimated value that will be described later. In the likelihood function and prior distribution, as an average vector, the estimated value of parameter θ obtained from large-scale data such as a web questionnaire

を用いる。 Is used.

上記の式（１）は、指数関数であるため、そのまま計算することは難しい。そこで、式（１）の両辺の対数をとると Since the above equation (1) is an exponential function, it is difficult to calculate as it is. So, taking the logarithm of both sides of equation (1)

が得られる。この結果に、正規分布の密度関数 Is obtained. The result is a normal density function

を代入することにより、以下の式（２）が得られる。 The following formula (2) is obtained by substituting.

ただし、τ（調整パラメータ）は、分散としてのσとγの比である。Ｃは定数である。このように、尤度関数と事前分布とに正規分布を仮定すると、事後分布の対数がパラメータβに関する二次関数で与えられる。 However, τ (adjustment parameter) is a ratio of σ and γ as dispersion. C is a constant. Thus, assuming a normal distribution for the likelihood function and the prior distribution, the logarithm of the posterior distribution is given by a quadratic function with respect to the parameter β.

次に、Step2の詳細について説明する。
ここでは、Webアンケートのような大規模データで得られたパラメータθをL1正則化法（Lasso）によって推定することを考える。これは、Step1で行ったパラメータθの推定値 Next, details of Step 2 will be described.
Here, it is assumed that the parameter θ obtained from large-scale data such as a Web questionnaire is estimated by the L1 regularization method (Lasso). This is the estimated value of parameter θ performed in Step 1.

を使って、Lasso推定値 Lasso estimates using

を計算することを意味する。 Means to calculate

先ず、上記の式（２）の中のLassoの損失関数に着目する。損失関数とは、モデルがデータに当てはまらない損失の大きさを表す関数であり、式（２）では First, attention is paid to the Lasso loss function in the above equation (2). The loss function is a function that represents the magnitude of the loss that the model does not apply to the data.

が該当する。Lassoの損失関数は２次式で与えられるため、今回のベイズ推定を行う上での損失関数も２次式で与えられれば、Lassoのアルゴリズムが、有効なパラメータの選択の為に、そのまま使えることになる。この場合、負の対数尤度関数は、 Is applicable. Since the Lasso loss function is given by a quadratic expression, if the loss function for the current Bayesian estimation is also given by a quadratic expression, the Lasso algorithm can be used as it is for selecting effective parameters. become. In this case, the negative log-likelihood function is

となる。 It becomes.

第１の実施形態では、損失関数が２次式であるという特徴に着目し、上記の負の対数尤度関数 In the first embodiment, paying attention to the characteristic that the loss function is a quadratic expression, the above negative log-likelihood function is used.

のみを使うのではなく、この負の対数尤度関数から事前分布の対数を引いた Subtract the logarithm of the prior distribution from this negative log-likelihood function

を用いる。これは、上記式（２）に対応する。事前分布の対数を、負の値から減算するという対数の引き算は、式（２）に事前分布を加えることを意味する。このようにWebデータから得られたパラメータθを事前分布に組み込むことで、式（２）が得られる。 Is used. This corresponds to the above equation (2). The logarithmic subtraction of subtracting the logarithm of the prior distribution from the negative value means adding the prior distribution to Equation (2). By incorporating the parameter θ obtained from the Web data in the prior distribution in this way, Expression (2) is obtained.

次に、Lassoの関数を以下の式（３）に示す。 Next, the Lasso function is shown in the following equation (3).

この式（３）を式（２）で示した２次関数に代入すると、以下の式（４）が得られる。 Substituting this equation (3) into the quadratic function shown in equation (2) yields the following equation (4).

この式（４）は、ベイズ推定の事後分布 This equation (4) is the posterior distribution of Bayesian estimation.

が組み込まれたLassoである。ここで、式（４）の絶対値の和以外の項としての以下の式（５）は、明らかにパラメータβに関する二次関数である。 Lasso with built-in. Here, the following equation (5) as a term other than the sum of absolute values of equation (4) is obviously a quadratic function related to the parameter β.

この場合、Lassoのアルゴリズムを、有効なパラメータの選択の為に、そのまま使うことができる。
このLassoのアルゴリズムは、例えば非特許文献２に記載されている。この中で、パラメータβに関する二次関数に対応する式は、非特許文献２のp.5の（４）の式であり、この式に、上記の式（５）を代入すればよい。
この代入に関して本実施形態に当てはめて説明する。ベイズ推定の事後分布の対数は以下の式（６）に比例する。 In this case, the Lasso algorithm can be used as is to select valid parameters.
The Lasso algorithm is described in Non-Patent Document 2, for example. Among these, the equation corresponding to the quadratic function related to the parameter β is the equation (4) of p.5 of Non-Patent Document 2, and the above equation (5) may be substituted into this equation.
This substitution will be described by applying to this embodiment. The logarithm of the posterior distribution of Bayesian estimation is proportional to the following equation (6).

式（６）の各変数の意味は以下の通りである。
ｆ：尤度関数
π：事前分布
θ：すべてのパラメータ
ｙ：出力
Ｘ（計画行列）→Ａ（新たな説明変数）とし、ｙ→Ｚ（新たな目的変数）とすると、式（６）は、以下の式（７）としての、Lassoによるスパース推定の式（罰則付き尤度関数の対数）となる。 The meaning of each variable in equation (6) is as follows.
f: Likelihood function π: Prior distribution θ: All parameters y: Output X (design matrix) → A (new explanatory variable) and y → Z (new objective variable). The following expression (7) is an expression for sparse estimation by Lasso (logarithm of likelihood function with penalties).

となり、最少二乗推定量が元の問題のベイズ推定量となる。ｐは事後分布である。
上記のように、式（７）は先に説明した式（２）に等しく、この式（２）の括弧内について以下の計算が成り立つ。 Thus, the least square estimator becomes the Bayes estimator of the original problem. p is a posterior distribution.
As described above, the equation (7) is equal to the equation (2) described above, and the following calculation is established within the parentheses of the equation (2).

ただし、 However,

である。 It is.

上記の各変数の意味は以下の通りである。
β：分散σ２の逆数（θに対応）
Ｂ０：大規模データから得られた係数（θｗｅｂに対応）
Ｔ：転置行列
ｌ：単位行列
第１の実施形態により、Step2で説明した式（４）で示したLassoのアルゴリズムが、有効なパラメータの選択の為に、そのまま使え、予測式を求めることができる。
なお、MCMCでも事後分布からの乱数を発生させることによりベイズ推定の事後分布の乱数を発生できるが、これはあくまで分布からの乱数の発生で、Step1の概要で示したようなベイズ推定の事後分布の最頻値（つまり事後分布の密度関数を最大とするような値）を求める方法ではない。
このため、MCMCを使っても、ベイズ推定の事後分布の最頻値を正確に計算することができない。もちろん、ベイズ推定の事後分布の最頻値を近似することはできるが、Lassoのように回帰係数の値を正確にゼロと推定することができないため、変数選択ができない。これに対し、第１の実施形態では、Lassoを使えるアルゴリズムなので、上記の変数選択ができない問題も解決できる。 The meaning of each variable is as follows.
β: Reciprocal of variance σ2 (corresponding to θ)
B0: Coefficient obtained from large-scale data (corresponding to θweb)
T: transposed matrix l: unit matrix According to the first embodiment, the Lasso algorithm shown in Expression (4) described in Step 2 can be used as it is for selecting effective parameters, and a prediction expression can be obtained. .
MCMC can also generate random numbers from the posterior distribution of Bayesian estimation by generating random numbers from the posterior distribution, but this is only the generation of random numbers from the distribution, and the posterior distribution of Bayesian estimation as shown in the outline of Step 1 It is not a method for obtaining the mode value of (ie, the value that maximizes the density function of the posterior distribution).
For this reason, even if MCMC is used, the mode value of the posterior distribution of Bayesian estimation cannot be accurately calculated. Of course, the mode value of the posterior distribution of Bayesian estimation can be approximated, but the variable cannot be selected because the regression coefficient value cannot be accurately estimated as zero as in Lasso. On the other hand, in the first embodiment, since the algorithm can use Lasso, the above-described problem that the variable cannot be selected can be solved.

さらに、第１の実施形態は、式（１）で示した、ベイズ推定事後分布の最頻値を、MCMCなどによる繰り返し計算を行うことなく求められるので、MCMCより、例えば３桁速い速度で予測式を導出することができる。 Further, in the first embodiment, the mode value of the Bayesian posterior distribution shown in the equation (1) can be obtained without performing repeated calculation by MCMC or the like, so that the prediction is made at a speed three orders of magnitude faster than MCMC, for example. An expression can be derived.

図１は、本発明の第１の実施形態におけるデータ分析装置の構成例を示す図である。
図１に示すように、本発明の第１の実施形態におけるデータ分析装置１０は、データベース１、大規模データパラメータθ_Web推定部２、事前分布入力部３、ベイズ推定実行部４、事後分布対数変換部５、スパース推定実行部６を有する。 FIG. 1 is a diagram illustrating a configuration example of a data analysis apparatus according to the first embodiment of the present invention.
As shown in FIG. 1, the data analysis apparatus 10 according to the first embodiment of the present invention includes a database 1, a large-scale data parameter θ _Web estimation unit 2, a prior distribution input unit 3, a Bayesian estimation execution unit 4, a posterior distribution logarithm. A conversion unit 5 and a sparse estimation execution unit 6 are included.

このデータ分析装置１０は、当該装置に内蔵されるＣＰＵがハードディスク等の記憶装置に予め記憶させたプログラムを実行して電子回路各部の動作を制御することで、各部を機能させる。つまり、データ分析装置１０は、ＣＰＵが制御プログラムに記述された命令に従い回路各部の動作を制御し、ソフトウェアとハードウェアとが協働して動作することにより、各種の機能を実現する。データベース１は、分析に使うデータを格納する装置であり、例えば不揮発性メモリなどの記憶装置で実現される。 In the data analysis apparatus 10, the CPU built in the apparatus executes a program stored in advance in a storage device such as a hard disk to control the operation of each part of the electronic circuit, thereby causing each part to function. That is, in the data analysis apparatus 10, the CPU controls the operation of each part of the circuit in accordance with the instructions described in the control program, and various functions are realized by the software and hardware operating in cooperation. The database 1 is a device that stores data used for analysis, and is realized by a storage device such as a nonvolatile memory.

大規模データパラメータθ_Web推定部２は、データベース１に格納してある大規模データを回帰分析して、各説明変数の回帰係数であるパラメータθ_Webを推定する。
事前分布入力部３は、大規模データパラメータθ_Web推定部２により算出したパラメータθ_Webをベイズ推定の事前分布の計算式に入力する。 The large-scale data parameter θ _Web estimation unit 2 performs regression analysis on large-scale data stored in the database 1 and estimates a parameter θ _Web that is a regression coefficient of each explanatory variable.
Priors input unit 3 inputs the parameter theta _Web calculated by the large-scale data parameter theta _Web estimation unit 2 in the formula of the prior distribution of the Bayesian estimation.

ベイズ推定実行部４は、事前分布入力部３により入力された事前分布を用いてベイズ推定を行う。
事後分布対数変換部５は、ベイズ推定の結果、出力される事後分布の対数変換した値を求める。
スパース推定実行部６は、対数変換した値をLassoによるスパース推定の入力として、スパース推定を実行する。 The Bayesian estimation execution unit 4 performs Bayesian estimation using the prior distribution input by the prior distribution input unit 3.
The posterior distribution logarithmic conversion unit 5 obtains a logarithmically converted value of the posterior distribution output as a result of Bayesian estimation.
The sparse estimation execution unit 6 executes sparse estimation using the logarithmically transformed value as an input for sparse estimation by Lasso.

なお、θ_WebはWebデータに対するパラメータを意味し、 Θ _Web means the parameter for Web data,

は、θ_Webを使って推定された値を意味する。 Means the value estimated using θ _Web .

次に、データ分析装置１０による処理手順について説明する。図２は、本発明の第１の実施形態におけるデータ分析装置による処理手順の一例を示すフローチャートである。
まず、大規模データパラメータθ_Web推定部２は、データベース１から、あらかじめ格納していた、分析に必要な大規模データを読み出してデータセットとして使い、目的とする推定量ｙに関する回帰分析を行なう。事前分布入力部３は、回帰分析の結果、導出される回帰係数をパラメータ Next, a processing procedure by the data analysis apparatus 10 will be described. FIG. 2 is a flowchart illustrating an example of a processing procedure performed by the data analysis apparatus according to the first embodiment of the present invention.
First, the large-scale data parameter θ _Web estimation unit 2 reads large-scale data necessary for analysis stored in advance from the database 1 and uses it as a data set, and performs regression analysis on the target estimator y. The prior distribution input unit 3 sets the derived regression coefficient as a parameter as a result of the regression analysis.

として、ベイズ推定の事前分布に用いる（図２の２０１）。 As a prior distribution of Bayesian estimation (201 in FIG. 2).

次に、ベイズ推定実行部４は、事前分布入力部３から入力したパラメータ Next, the Bayesian estimation execution unit 4 uses the parameters input from the prior distribution input unit 3.

を事前分布として、あらかじめ仮定した尤度関数に基づきベイズ推定を実行し、上記のStep1にしたがって、ベイズ推定の事後分布を導出する（図２の２０２）。
事後分布対数変換部５は、ベイズ推定実行部４により導出された事後分布に対して、Step2で説明したように対数変換した値を求める（図２の２０３）。 As a prior distribution, Bayesian estimation is executed based on a presumed likelihood function, and a posterior distribution of Bayesian estimation is derived according to Step 1 described above (202 in FIG. 2).
The posterior distribution logarithmic conversion unit 5 obtains a value obtained by logarithmically transforming the posterior distribution derived by the Bayesian estimation execution unit 4 as described in Step 2 (203 in FIG. 2).

スパース推定実行部６は、対数変換の結果（式（２））に、Lassoを用いたスパース推定の損失関数を代入した式（４）を用いて、スパース推定を実施する（式（５）参照）（図２の２０４）。 The sparse estimation execution unit 6 performs sparse estimation using the formula (4) obtained by substituting the loss function of the sparse estimation using Lasso into the logarithmic transformation result (formula (2)) (see formula (5)). (204 in FIG. 2).

その結果、データ分析装置１０は、出力ｙの予測に必要な説明変数と、その回帰係数、推定精度を導出できるようになる（２０５）。 As a result, the data analysis apparatus 10 can derive the explanatory variables necessary for predicting the output y, its regression coefficient, and estimation accuracy (205).

以上のように、第１の実施形態では、事前分布とデータに正規性をもたせることにより事後分布を導出し、スパース推定のアルゴリズムを用いて、事後分布から予測に有効なパラメータを選択するので、事後分布の推定にかかる負荷を低減することができる。 As described above, in the first embodiment, the posterior distribution is derived by providing normality to the prior distribution and the data, and a parameter effective for prediction is selected from the posterior distribution using the sparse estimation algorithm. The load for estimating the posterior distribution can be reduced.

（第２の実施形態）
次に第２の実施形態について説明する。
第２の実施形態は、第１の実施形態において、ベイズ推定で用いる観測データと特徴量とが類似したデータセットを、回帰係数の絶対値の大きさに着目して生成する。そして、このデータセットの回帰分析により得られた回帰係数をベイズ推定の事前分布に使うことで、より短時間で目的とする回帰式を導出する。 (Second Embodiment)
Next, a second embodiment will be described.
In the second embodiment, in the first embodiment, a data set in which the observation data used in Bayesian estimation and the feature amount are similar is generated by paying attention to the absolute value of the regression coefficient. Then, by using the regression coefficient obtained by the regression analysis of this data set for the prior distribution of Bayesian estimation, the target regression equation is derived in a shorter time.

例えば、２０代の男女の集団において、ケーキが好きな人を精度良く予測したい場合、女性の方が男性よりケーキが好きな人が多いという心理学から得られた特性を用いて、性別で母集団を分けた方がより高い精度で予測ができる。
しかし、２０代の男女の集団に対し、どの政党を支持するかを予測したいといったような未知の予測を精度よく行いたい場合、どのような属性、例えば、性別で良いのか、大学の出身学科や職種の方が良いのか、恋人や配偶者の有無の方が良いのか、などを見つける手段が必要となる。 For example, in a group of men and women in their twenties, when they want to accurately predict people who like cakes, the characteristics obtained from psychology that women have more people who like cakes than men, The group can be predicted with higher accuracy.
However, if you want to accurately make an unknown prediction, such as predicting which political party you want to support for a group of men and women in their twenties, what attributes, for example, gender, We need a way to find out whether the job type is better or whether it is better to have a lover or a spouse.

第２の実施形態では、統計学の知見を活用して予測を行なう方法について述べる。具体的には、第２の実施形態のStep2で説明したスパース推定から導出された回帰式の回帰係数の絶対値が所定の条件を満たす大きさ（例えば所定値以上）の説明変数（前者の例における性別に相当するもの）で、母集団のサンプルをクラス分けする。
このスパース推定は、求めたい目的変数（前者の例におけるケーキが好きな人に相当するもの）の予測に関係ない目的変数の回帰係数をゼロ、またはゼロに近い値にする性質を持つ。 In the second embodiment, a method for performing prediction using statistical knowledge will be described. Specifically, an explanatory variable (an example of the former) having a magnitude (for example, a predetermined value or more) with which the absolute value of the regression coefficient of the regression equation derived from the sparse estimation described in Step 2 of the second embodiment satisfies a predetermined condition The sample of the population is classified according to gender in
This sparse estimation has the property that the regression coefficient of the objective variable not related to the prediction of the objective variable to be obtained (corresponding to the person who likes cake in the former example) is zero or close to zero.

そこで、第２の実施形態では、データ分析装置１０は、ゼロに近くない、かつ絶対値の大きい回帰係数を有する説明変数は、目的変数予測に有効な説明変数であるとみなし、当該説明変数で集団をクラス分けする。そして、データ分析装置１０は、このクラス分けしたデータセットを用いて、第１の実施形態で説明した処理を行う。 Therefore, in the second embodiment, the data analysis apparatus 10 regards an explanatory variable having a regression coefficient that is not close to zero and has a large absolute value as an explanatory variable that is effective for predicting the objective variable, and the explanatory variable Classify groups. Then, the data analysis apparatus 10 performs the processing described in the first embodiment using the classified data sets.

具体的には、の大規模データパラメータθ_Web推定部２は、尤度関数に導入するデータ分布に対して特徴量が近いデータセットを、データベース１に格納される大規模データから生成する。特徴量が近いデータセットは、出力ｙを計算するために用いたデータセットを回帰分析し、この回帰分析の結果導出される回帰係数の高い説明変数と同じ値のデータを用いて大規模データをクラス分けすることによって大規模データパラメータθ_Web推定部２により生成される。 Specifically, the large-scale data parameter θ _Web estimation unit 2 generates a data set having a feature quantity close to the data distribution introduced into the likelihood function from the large-scale data stored in the database 1. For data sets with similar features, regression analysis is performed on the data set used to calculate the output y, and large-scale data is obtained using data with the same value as the explanatory variable with a high regression coefficient derived as a result of this regression analysis. The large-scale data parameter θ _Web estimation unit 2 generates the classified data.

例えば、Webアンケート（2013.1〜2収集実施）した3500サンプルの大規模データを使って、大規模データパラメータθ_Web推定部２が大規模データパラメータθ_Webを推定し、実際の職場で収集（2014.10〜12収集実施）した166サンプルを使って、ベイズ推定実行部４により実行したベイズ推定により、どのような出来事が起こると対人感情（５段階）が変化するかの予測を行った場合について説明する。 For example, by using a large-scale data of 3500 samples Web questionnaire (2,013.1-2 collection implementation), large-scale data parameter θ _Web estimation unit 2 estimates the large-scale data parameter θ _Web, collected in the actual workplace (2014.10～ A case will be described in which the 166 samples that have been collected 12) are used to predict what events will occur and interpersonal emotions (five levels) will change according to the Bayesian estimation performed by the Bayesian estimation execution unit 4.

この場合、回帰係数の絶対値が大きい出来事の種別で大規模データをクラス分けした場合の方が、このようにクラス分けしない場合に比べて、予測の精度が１０％程度向上した上、ベイズ推定による予測式の導出までに必要なサンプル数が、クラス分けしない場合に比べて半分ないし2/3程度で良いとの結果を得ることができる。 In this case, when the large-scale data is classified according to the type of event having a large absolute value of the regression coefficient, the prediction accuracy is improved by about 10% compared to the case where the classification is not performed in this way, and the Bayesian estimation is performed. It is possible to obtain a result that the number of samples required for deriving the prediction formula according to is about half to two thirds compared with the case of not classifying.

このように、第２の実施形態では、回帰係数に着目して大規模データをクラス分けするにより、予測精度を向上させ、かつ、少ないデータからでも適切な回帰式を導出することができる。
なお、本発明は上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態に開示されている複数の構成要素の適宜な組み合わせにより、種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。さらに、異なる実施形態にわたる構成要素を適宜組み合わせてもよい。 As described above, in the second embodiment, by classifying large-scale data by focusing on the regression coefficient, it is possible to improve prediction accuracy and derive an appropriate regression equation from a small amount of data.
Note that the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. In addition, various inventions can be formed by appropriately combining a plurality of components disclosed in the embodiment. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, constituent elements over different embodiments may be appropriately combined.

また、各実施形態に記載した手法は、計算機（コンピュータ）に実行させることができるプログラム（ソフトウェア手段）として、例えば磁気ディスク（フロッピー（登録商標）ディスク、ハードディスク等）、光ディスク（ＣＤ−ＲＯＭ、ＤＶＤ、ＭＯ等）、半導体メモリ（ＲＯＭ、ＲＡＭ、フラッシュメモリ等）等の記録媒体に格納し、また通信媒体により伝送して頒布することもできる。なお、媒体側に格納されるプログラムには、計算機に実行させるソフトウェア手段（実行プログラムのみならずテーブルやデータ構造も含む）を計算機内に構成させる設定プログラムをも含む。本装置を実現する計算機は、記録媒体に記録されたプログラムを読み込み、また場合により設定プログラムによりソフトウェア手段を構築し、このソフトウェア手段によって動作が制御されることにより上述した処理を実行する。なお、本明細書でいう記録媒体は、頒布用に限らず、計算機内部あるいはネットワークを介して接続される機器に設けられた磁気ディスクや半導体メモリ等の記憶媒体を含むものである。 In addition, the methods described in the embodiments are, for example, magnetic disks (floppy (registered trademark) disks, hard disks, etc.), optical disks (CD-ROM, DVD) as programs (software means) that can be executed by a computer (computer). , MO, etc.), a semiconductor memory (ROM, RAM, flash memory, etc.) or the like, or can be transmitted and distributed via a communication medium. Note that the program stored on the medium side includes a setting program that configures in the computer software means (including not only the execution program but also a table and data structure) to be executed by the computer. A computer that implements this apparatus reads a program recorded on a recording medium, constructs software means by a setting program as the case may be, and executes the processing described above by controlling the operation by this software means. The recording medium referred to in this specification is not limited to distribution, but includes a storage medium such as a magnetic disk or a semiconductor memory provided in a computer or a device connected via a network.

１…データベース、２…大規模データパラメータθ_Web推定部、３…事前分布入力部、４…ベイズ推定実行部、５…事後分布対数変換部、６…スパース推定実行部。 DESCRIPTION OF SYMBOLS 1 ... Database, 2 ... Large-scale data parameter (theta) _Web estimation part, 3 ... Prior distribution input part, 4 ... Bayes estimation execution part, 5 ... Post distribution logarithm conversion part, 6 ... Sparse estimation execution part.

Claims

A parameter estimation means for performing regression analysis of the data to be analyzed and estimating a parameter that is a regression coefficient of the explanatory variable;
Bayesian estimation execution means for deriving a posterior distribution of Bayesian estimation by setting the parameter estimated by the parameter estimating means as a prior distribution of Bayesian estimation and performing Bayesian estimation based on the prior distribution and a predetermined likelihood function;
A sparse estimation execution means for executing a sparse estimation by substituting a logarithmic transformation result for the posterior distribution derived by the Bayesian estimation execution means into a loss function.

The parameter estimation means includes
The analysis target data is classified using an explanatory variable whose magnitude of the regression coefficient of the regression equation derived by the sparse estimation execution means satisfies a predetermined condition, and the classified data is subjected to regression analysis. The data analysis apparatus according to claim 1, wherein a parameter that is a regression coefficient of an explanatory variable is estimated.

A method applied to a data analysis device,
Analyze the data to be analyzed to estimate the parameters that are the regression coefficients of the explanatory variables,
The estimated parameter is set as a prior distribution of Bayesian estimation, and Bayesian estimation is performed based on the prior distribution and a predetermined likelihood function to derive a posterior distribution of Bayesian estimation,
A sparse estimation is performed by substituting a logarithmic transformation result for the derived posterior distribution into a loss function.

The analysis target data is classified using an explanatory variable whose magnitude of the regression coefficient of the derived regression equation satisfies a predetermined condition, and the classification data is subjected to regression analysis, and the explanatory variable The data analysis method according to claim 3, wherein a parameter that is a regression coefficient is estimated.

A program used for a computer operating as a part of the data analysis apparatus according to claim 1,
The computer,
The data analysis processing program for functioning as said parameter estimation means, said Bayesian estimation execution means, and said sparse estimation execution means.