JP2014041566A

JP2014041566A - Device, method, and program for linear regression model estimation

Info

Publication number: JP2014041566A
Application number: JP2012184608A
Authority: JP
Inventors: Shinya Murata; 眞哉村田; Noriko Takaya; 典子高屋; Masashi Uchiyama; 匡内山; Kunio Kashino; 邦夫柏野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-08-23
Filing date: 2012-08-23
Publication date: 2014-03-06

Abstract

PROBLEM TO BE SOLVED: To estimate more reliable linear regression model.SOLUTION: A linear regression modeling unit 31 models a linear regression model which represents an observation value y by an explanatory variable x and a linear sum of regression coefficients β for the explanatory variable, a regression coefficient estimation unit 32 estimates a regression coefficient β for each of parameters (λ, λ) designating magnitudes of a penalty term by minimization of a cost function in which a penalty term of which the magnitude is larger for a smaller norm of the explanatory variable x is added to a residual between the linear regression model and the observation value, and a model identification unit 33 selects a regression coefficient which maximizes a contribution rate when used, to identify the linear regression model.

Description

本発明は、線形回帰モデル推定装置、方法、及びプログラムに関する。 The present invention relates to a linear regression model estimation apparatus, method, and program.

線形回帰モデルは、現実に観測されるデータの変動を説明変数の線形和で捉える。最小二乗法に代表されるモデリングアルゴリズムは実装が簡単であり、推定される回帰係数は観測データと説明変数との間の偏相関を表し、解釈も容易であることから多くの解析者に使われている。しかし、現実の解析では説明変数の数が多くなり、最小二乗法では上手くモデリングできなくなる。そのため回帰係数の大きさに罰則項を付け、データにソフトフィッティングした回帰モデルを推定する罰則項付きのモデリング手法が提案されている。 The linear regression model captures actual observed data fluctuations as a linear sum of explanatory variables. Modeling algorithms represented by the least square method are easy to implement, and the estimated regression coefficients represent partial correlations between observed data and explanatory variables, and are easy to interpret. ing. However, in actual analysis, the number of explanatory variables increases, and modeling with the least square method becomes impossible. For this reason, a modeling method with a penalty term has been proposed in which a penalty term is attached to the magnitude of the regression coefficient and a regression model soft-fitted to the data is estimated.

例えば、線形回帰モデルの推定において、観測データとの二乗誤差に加えて回帰係数のＬ１、Ｌ２ノルムも考慮したコスト関数を用いて、観測データとの相関が高い説明変数のみを使用してデータにソフトフィッティングさせる変数選択の回帰モデリングが提案されている（例えば、非特許文献１参照）。この手法は、Elastic Net Regressionと呼ばれ、回帰係数の推定においては、ＬＡＲＳ（Least Angle Regression)と呼ばれる有効なアルゴリズムがよく用いられている。 For example, in the estimation of the linear regression model, the cost function considering the L1 and L2 norms of the regression coefficient in addition to the square error with the observation data is used, and only the explanatory variables having a high correlation with the observation data are used for the data. Regression modeling of variable selection for soft fitting has been proposed (see, for example, Non-Patent Document 1). This method is called Elastic Net Regression, and an effective algorithm called LARS (Least Angle Regression) is often used for estimating regression coefficients.

Hui Zou and Trevor Hastie, "Regularization and Variable Selection via the Elastic Net", J.R. Statist. Soc. B, pp0301-320, 2005.Hui Zou and Trevor Hastie, "Regularization and Variable Selection via the Elastic Net", J.R.Statist.Soc.B, pp0301-320, 2005.

本発明は、より信頼性の高い線形回帰モデルを推定することができる線形回帰モデル推定装置、方法、及びプログラムを提供することを目的とする。 An object of the present invention is to provide a linear regression model estimation apparatus, method, and program capable of estimating a more reliable linear regression model.

上記目的を達成するために、本発明の線形回帰モデル推定装置は、説明変数及び該説明変数に対する回帰係数の線形和で観測値を表す線形回帰モデルにおいて、前記線形回帰モデルと前記観測値との残差に、前記説明変数のノルムが小さいほど大きくなる罰則項を付けたコスト関数の最小化により、前記回帰係数を推定する回帰係数推定手段と、前記罰則項の大きさを指定するパラメータ毎に前記回帰係数推定手段により推定された回帰係数から、各回帰係数を用いた場合の説明率が最も大きくなる回帰係数を選択することにより、前記線形回帰モデルを同定するモデル同定手段と、を含んで構成することができる。 In order to achieve the above object, a linear regression model estimation apparatus according to the present invention includes an explanatory variable and a linear regression model that represents an observed value by a linear sum of regression coefficients for the explanatory variable. Regression coefficient estimation means for estimating the regression coefficient by minimizing a cost function with a penalty term that increases as the norm of the explanatory variable increases in the residual, and for each parameter that specifies the size of the penalty term Model identifying means for identifying the linear regression model by selecting a regression coefficient that provides the largest explanation rate when using each regression coefficient from the regression coefficients estimated by the regression coefficient estimation means, Can be configured.

本発明の線形回帰モデル推定装置によれば、まず、説明変数及び説明変数に対する回帰係数の線形和で観測値を表す線形回帰モデルを定義する。そして、回帰係数推定手段が、線形回帰モデルと観測値との残差に、説明変数のノルムが小さいほど大きくなる罰則項を付けたコスト関数の最小化により、回帰係数を推定し、モデル同定手段が、罰則項の大きさを指定するパラメータ毎に回帰係数推定手段により推定された回帰係数から、各回帰係数を用いた場合の説明率が最も大きくなる回帰係数を選択することにより、線形回帰モデルを同定する。 According to the linear regression model estimation apparatus of the present invention, first, a linear regression model that represents an observation value by a linear sum of an explanatory variable and a regression coefficient with respect to the explanatory variable is defined. The regression coefficient estimation means estimates the regression coefficient by minimizing the cost function with a penalty term that increases as the norm of the explanatory variable decreases to the residual between the linear regression model and the observed value. The linear regression model is selected by selecting the regression coefficient that provides the highest explanation rate when using each regression coefficient from the regression coefficients estimated by the regression coefficient estimation means for each parameter that specifies the size of the penalty term. Is identified.

このように、データとしての信頼性に直結する説明変数のノルムが小さいほど大きくなる罰則項を付けたコスト関数の最小化により回帰係数を推定するため、信頼性の高い線形回帰モデルを推定することができる。 In this way, to estimate the regression coefficient by minimizing the cost function with a penalty term that increases as the norm of the explanatory variable directly related to the reliability as data becomes smaller, a highly reliable linear regression model must be estimated. Can do.

また、本発明の線形回帰モデル推定方法は、回帰係数推定手段と、モデル同定手段とを含む線形回帰モデル推定装置における線形回帰モデル推定方法であって、前記回帰係数推定手段が、説明変数及び該説明変数に対する回帰係数の線形和で観測値を表す線形回帰モデルにおいて、前記線形回帰モデルと前記観測値との残差に、前記説明変数のノルムが小さいほど大きくなる罰則項を付けたコスト関数の最小化により、前記回帰係数を推定し、前記モデル同定手段が、前記罰則項の大きさを指定するパラメータ毎に前記回帰係数推定手段により推定された回帰係数から、各回帰係数を用いた場合の説明率が最も大きくなる回帰係数を選択することにより、前記線形回帰モデルを同定する方法である。 The linear regression model estimation method of the present invention is a linear regression model estimation method in a linear regression model estimation device including a regression coefficient estimation means and a model identification means, wherein the regression coefficient estimation means includes an explanatory variable and the In a linear regression model that represents an observation value as a linear sum of regression coefficients with respect to an explanatory variable, a cost function with a penalty term attached to the residual between the linear regression model and the observation value that increases as the norm of the explanatory variable decreases. The regression coefficient is estimated by minimization, and the model identification unit uses each regression coefficient from the regression coefficient estimated by the regression coefficient estimation unit for each parameter that specifies the size of the penalty term. In this method, the linear regression model is identified by selecting a regression coefficient that maximizes the explanation rate.

また、本発明の線形回帰モデル推定プログラムは、コンピュータを、上記の線形回帰モデル推定装置を構成する各手段として機能させるためのプログラムである。 The linear regression model estimation program of the present invention is a program for causing a computer to function as each means constituting the linear regression model estimation apparatus.

以上説明したように、本発明の線形回帰モデル推定装置、方法、及びプログラムによれば、データとしての信頼性に直結する説明変数のノルムが小さいほど大きくなる罰則項を付けたコスト関数の最小化により回帰係数を推定するため、信頼性の高い線形回帰モデルを推定することができる、という効果が得られる。 As described above, according to the linear regression model estimation apparatus, method, and program of the present invention, the cost function with a penalty term that becomes larger as the norm of the explanatory variable directly related to the reliability as data becomes smaller is minimized. Thus, since the regression coefficient is estimated, an effect that a highly reliable linear regression model can be estimated is obtained.

本実施の形態に係る線形回帰モデル推定装置の構成を示す概略図である。It is the schematic which shows the structure of the linear regression model estimation apparatus which concerns on this Embodiment. 観測値の時系列データのデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of the time series data of an observation value. 本実施の形態における線形回帰モデル推定処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the linear regression model estimation process routine in this Embodiment. 実験に用いた観測データを示すグラフである。It is a graph which shows the observation data used for experiment. 実験に用いた説明変数（ｘ_１）を示すグラフである。Is a graph showing explanatory variables (x ₁₎ used in the experiment. 実験に用いた説明変数（ｘ_４）を示すグラフである。Is a graph showing the explanatory variable (x ₄₎ used in the experiment. 実験に用いた説明変数（ｘ_７）を示すグラフである。Is a graph showing explanatory variables (x ₇₎ used in the experiment. 回帰係数の推定結果を示す表である。It is a table | surface which shows the estimation result of a regression coefficient.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜本実施の形態の概要＞
本実施の形態に係る線形回帰モデル推定装置では、まず、観測値の変動を説明する線形回帰モデルを想定し用意する。ここで、観測値は平均０に、説明変数は平均０、分散１に標準化されているものとする。次に、Elastic Net Regressionにおいて、説明変数の大きさ（スケール）に罰則を付けたコスト関数を用意する。本実施の形態では、この手法を、Scale Penalized Elastic Net Regression（ＳＰＥＮ）と呼ぶ。そして、観測値を取り込むことで、このコスト関数を最小化させる回帰係数の推定値を求める。回帰係数の推定においてはElastic Net Regressionの推定と同様に、ＬＡＲＳの変形を用いることができる。そして、最後に回帰モデルの説明率（決定係数）によりモデルを同定し、線形回帰モデルのパラメータの推定値を出力する。 <Outline of the present embodiment>
In the linear regression model estimation apparatus according to the present embodiment, first, a linear regression model that explains fluctuations in observed values is assumed and prepared. Here, it is assumed that the observed value is standardized to mean 0, the explanatory variable is standardized to mean 0, and variance 1. Next, in Elastic Net Regression, prepare a cost function that penalizes the size (scale) of the explanatory variable. In the present embodiment, this technique is called Scale Penalized Elastic Net Regression (SPEN). Then, an estimated value of the regression coefficient that minimizes the cost function is obtained by taking in the observed value. In the estimation of the regression coefficient, a modification of LARS can be used similarly to the estimation of Elastic Net Regression. Finally, the model is identified by the explanation rate (determination coefficient) of the regression model, and the estimated values of the parameters of the linear regression model are output.

＜システム構成＞
本実施の形態に係る線形回帰モデル推定装置１０は、ＣＰＵと、ＲＡＭと、後述する線形回帰モデル推定処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成されている。このコンピュータは、機能的には、図１に示すように、入力部２０と、演算部３０と、出力部４０とを含んだ構成で表すことができる。 <System configuration>
The linear regression model estimation apparatus 10 according to the present embodiment is configured by a computer including a CPU, a RAM, and a ROM that stores a program for executing a linear regression model estimation processing routine described later. As shown in FIG. 1, this computer can be functionally represented by a configuration including an input unit 20, a calculation unit 30, and an output unit 40.

入力部２０は、入力された観測値の時系列データを受け付ける。観測値は、スカラー値であり、すなわち、時系列データは、１次元時系列である。観測値は、例えば、図２に示すように、時刻ｔ_ｉと観測値ｙ_ｉと説明変数ｘ_ｉ，ｊとの組（ｉ＝１，・・・，ｎ、ｊ＝１，・・・，ｐ）からなる時系列データである。 The input unit 20 receives time-series data of input observation values. The observed value is a scalar value, that is, the time series data is a one-dimensional time series. For example, as shown in FIG. 2, the observed value is a set of time t _i , observed value y _i, and explanatory variable x _{i, j} (i = 1,..., N, j = 1,. p) is time-series data.

演算部３０は、線形回帰モデリング部３１と、回帰係数推定部３２と、モデル同定部３３とを含んだ構成で表すことができる。 The calculation unit 30 can be expressed by a configuration including a linear regression modeling unit 31, a regression coefficient estimation unit 32, and a model identification unit 33.

線形回帰モデリング部３１は、オペレータにより入力部２０から入力された設定値を受け付けて、下記（１）式及び（２）式に示す線形回帰モデルを定義する。 The linear regression modeling unit 31 receives the set value input from the input unit 20 by the operator, and defines the linear regression models shown in the following equations (1) and (2).

ただし、ｙはｎ×１の観測値のベクトル、Ｘはｎ×ｐの説明変数の行列、βはｐ×１の回帰係数のベクトルである。ｘ_{i,i=1,・・・p}はｐ個の説明変数で、それぞれｎ×１のベクトルである。この線形回帰式を標準化すると、下記（３）式及び（４）式となる。 Where y is a vector of n × 1 observation values, X is a matrix of n × p explanatory variables, and β is a vector of p × 1 regression coefficients. x _{i, i = 1,... p} are p explanatory variables, each of which is an n × 1 vector. When this linear regression equation is standardized, the following equations (3) and (4) are obtained.

ここで、||Ｘ_{i,i=1,・・・p}||は各説明変数のＬ_２ノルムであり、￣Ａ（数式中では、記号Ａ上に“￣”）はＡの平均である。（４）式は、観測値の平均が０で、各説明変数の平均も０、Ｌ_２ノルムも１になっていることから、この形式を標準化された線形回帰モデルと呼ぶ。本実施の形態では、（４）式の回帰係数β’を観測値の時系列データから推定することを目的とする。 Here, || X _{i, i = 1,... P} || is the L ₂ norm of each explanatory variable, and ￣A (“￣” on symbol A in the equation) is the average of A. . (4) has an average of the observed values 0, since it has an average be 0, L ₂ norm even 1 for each explanatory variable, called the standardized linear regression model with this format. The purpose of this embodiment is to estimate the regression coefficient β ′ of the equation (4) from time series data of observed values.

回帰係数推定部３２は、下記（５）式に示すScale Penalized Elastic Net Regression（ＳＰＥＮ）のコスト関数を最小化するパラメータβ’を、下記（６）式に示す最適化により推定する。 The regression coefficient estimation unit 32 estimates the parameter β ′ that minimizes the cost function of Scale Penalized Elastic Net Regression (SPEN) shown in the following formula (5) by optimization shown in the following formula (6).

（５）式右辺の第２項はＬ_１罰則項、第３項はＬ_２罰則項であり、λ_１及びλ_２はＬ_１罰則項及びＬ_２罰則項の大きさを指定するパラメータである。Ｗは、下記（７）式に示すような、説明変数の大きさ（スケール）の関数を対角成分に持つ対角行列である。 The second term on the right side of equation (5) is the L ₁ penalty term, the third term is the L ₂ penalty term, and λ ₁ and λ ₂ are parameters that specify the size of the L ₁ penalty term and the L ₂ penalty term. . W is a diagonal matrix having a function of the size (scale) of the explanatory variable as a diagonal component as shown in the following equation (7).

ただし、ｗ（||ｘ_ｐ||）はｘ_ｐのＬ_２ノルムの関数であり、ノルムが大きいと１に近付き（つまり罰則無し）、ノルムが小さいと大きくなる（罰則が大きい）関数として定義する。例えば、ｗ（||ｘ_ｐ||）を下記（８）式に示す形で想定することができ、γ＝５０、ｍ＝３とすることができる。 However, w (|| x _p ||) is a function of the L ₂ norm of x _p and is defined as a function that approaches 1 (that is, there is no penalty) when the norm is large and becomes large when the norm is small (the penalty is large). To do. For example, w (|| x _p ||) can be assumed in the form shown in the following equation (8), and γ = 50 and m = 3.

上記（５）式のコスト関数は、下記（９）式の形に変形される。 The cost function of the above equation (5) is transformed into the following equation (9).

ここで、β”＝Ｇ^-1β’であり、Ｇは下記（１０）式となる。 Here, β ″ = G ⁻¹ β ′, and G is represented by the following equation (10).

また、ｙ”及びｘ”はｙ’及びｘ’の拡大であり、それぞれ下記（１１）式となる。 Further, y ″ and x ″ are enlargements of y ′ and x ′, and are respectively expressed by the following formula (11).

ここで注意が必要なのは、β’＝Ｇβ”であり、これを（５）式のＬ_１罰則項に代入すると|ｆ（Ｗ）Ｇβ”|となり、ｆ（Ｗ）Ｇ＝Ｉとして（９）式のＬ_１罰則項を|β”|にした点である。つまり、（５）式のｆ（Ｗ）は、下記（１２）式となる。 Be careful, however, "a, which (5) of the _{L 1} are substituted into penalty term | f (W) Gβ" | β '= Gβ next, as f (W) G = I ( 9) formula _{L 1} penalties term | beta "|. lies in that the words, equation (5) f (W) becomes the following equation (12).

（９）式のコスト関数の形は、Ｌ_１罰則項のみのＬａｓｓｏ（Least Absolute Shrinkage and Selection Operator）の形と同型であり、ＬＡＲＳの変形を使用したβ”の推定が可能となる。観測値の時系列データを取り込んでβ”を推定し、β”→β’→βと変換していくことで、（１）式の線形回帰モデルの回帰係数を推定する。また、（９）式のコスト関数の最小化として下記（１３）式を考えたとき、推定される回帰係数を全て０以上にすることができる。このコスト関数の最小化もＬＡＲＳの変形により可能であり、回帰係数が０より大きくなる説明変数のみを用いてＬＡＲＳの反復計算を実行することに相当する。 (9) the form of the cost function is in the form of the same type Lasso only L ₁ penalty term (Least Absolute Shrinkage and Selection Operator) , it is possible to estimate the beta "using the deformation of LARS. Observations The time series data is taken in, β ″ is estimated, and β ″ → β ′ → β is converted to estimate the regression coefficient of the linear regression model in equation (1). Also, in equation (9) When the following equation (13) is considered as the cost function minimization, the estimated regression coefficients can all be set to 0 or more: The cost function can be minimized by the modification of LARS, and the regression coefficient is 0. This is equivalent to performing an iterative calculation of LARS using only the explanatory variables that become larger.

モデル同定部３３は、（５）式のλ_１及びλ_２を振り、回帰モデルの説明率（決定係数）が最も大きくなったモデルを選択する。説明率は、例えば下記（１４）で定義される自由度調整済み説明率（adjusted Ｒ^２）を用いることができる。 The model identification unit 33 assigns λ ₁ and λ ₂ in the equation (5), and selects the model having the highest explanation rate (determination coefficient) of the regression model. For example, the explanation rate (adjusted R ² ) defined by the following (14) can be used as the explanation rate.

（１４）式の２項目の分子は線形回帰モデルの残差の二乗和で、分母は観測値の平均からのずれの二乗和である。ｎは観測値の個数、ｐはモデルの説明変数の個数である。自由度調整済みの説明率は、回帰モデルの観測値へのフィッティングの良し悪しを自由度で調整した尺度である。 The numerator of the two items of the equation (14) is the sum of squares of the residuals of the linear regression model, and the denominator is the sum of squares of deviations from the average of the observed values. n is the number of observation values, and p is the number of explanatory variables of the model. The explanation rate adjusted for the degree of freedom is a scale obtained by adjusting the degree of fitting of the regression model to the observed value by the degree of freedom.

＜線形回帰モデル推定装置の作用＞
次に、本実施の形態に係る線形回帰モデル推定装置１０の作用について説明する。まず、オペレータにより、時刻ｔ_１〜ｔ_Ｎの観測値からなる時系列データが、線形回帰モデル推定装置１０に入力されると、線形回帰モデル推定装置１０によって、入力された時系列データが、メモリ（図示省略）へ格納される。そして、線形回帰モデル推定装置１０によって、図３に示す線形回帰モデル推定処理ルーチンが実行される。 <Operation of linear regression model estimation device>
Next, the operation of the linear regression model estimation apparatus 10 according to the present embodiment will be described. First, when time series data including observation values at times t _{1 to} t _N is input to the linear regression model estimation apparatus 10 by the operator, the input time series data is stored in the memory by the linear regression model estimation apparatus 10. (Not shown). Then, the linear regression model estimation apparatus 10 executes the linear regression model estimation processing routine shown in FIG.

まず、ステップ１００で、メモリに格納された観測値の時系列データを取得する。そして、ステップ１０２で、線形回帰モデリング部３１が、時系列データに含まれる説明変数を用いて、上記（１）式及び（２）式に示す線形回帰モデルを定義し、これを標準化して、（４）式に示す標準化された線形回帰モデルを定義する。 First, in step 100, time series data of observation values stored in the memory is acquired. In step 102, the linear regression modeling unit 31 uses the explanatory variables included in the time series data to define the linear regression models shown in the above formulas (1) and (2), standardizes them, (4) Define a standardized linear regression model as shown in equation (4).

次に、ステップ１０４で、回帰係数推定部３２が、説明変数の大きさ（スケール）を考慮した罰則項を付けたＳＰＥＮのコスト関数（５）式内のＬ_１罰則項及びＬ_２罰則項の大きさを指定するパラメータ（λ_１，λ_２）を設定し、次のステップ１０６で、（５）式のコスト関数を（６）式により最小化することにより、パラメータβ’を推定する。 Next, at step 104, a regression coefficient estimator 32, the explanatory variable size (scale) of the cost function SPEN which gave a penalty term that takes into account (5) the expression of L ₁ penalties section and L ₂ penalties term Parameters (λ ₁ , λ ₂ ) for specifying the magnitudes are set, and in the next step 106, the parameter β ′ is estimated by minimizing the cost function of equation (5) using equation (6).

次に、ステップ１０８で、モデル同定部３３が、上記ステップ１０６で推定されたパラメータβ’を用いた線形回帰モデルに基づいて、例えば（１４）式に示す説明率（決定係数）を算出して、パラメータβ’と共に一旦所定の記憶領域に記憶しておく。 Next, in step 108, the model identification unit 33 calculates an explanation rate (determination coefficient) shown in, for example, the equation (14) based on the linear regression model using the parameter β ′ estimated in step 106. , The parameter β ′ is once stored in a predetermined storage area.

次に、ステップ１１０で、回帰係数推定部３２が、パラメータ（λ_１，λ_２）の全ての組み合わせについてパラメータβ’を推定したか否かを判定する。未処理の（λ_１，λ_２）が存在する場合には、ステップ１０４へ戻って、次の（λ_１，λ_２）を設定し、ステップ１０６及び１０８の処理を繰り返す。 Next, in step 110, the regression coefficient estimation unit 32 determines whether the parameter β ′ has been estimated for all combinations of the parameters (λ ₁ , λ ₂ ). If there is an unprocessed (λ ₁ , λ ₂ ), the process returns to step 104, the next (λ ₁ , λ ₂ ) is set, and the processes of steps 106 and 108 are repeated.

全ての（λ_１，λ_２）について処理が終了した場合には、ステップ１１２へ移行し、上記ステップ１０８で算出した説明率が最も大きくなったときのパラメータβ’を選択することにより、（４）式の標準化された線形回帰モデルを同定する。また、選択されたパラメータβ’をβに変換する。 When the processing is completed for all (λ ₁ , λ ₂ ), the process proceeds to step 112, and the parameter β ′ when the explanation rate calculated in step 108 is the largest is selected (4 ) Identify a standardized linear regression model of the equation. Also, the selected parameter β ′ is converted into β.

次に、ステップ１１４で、出力部４０が、上記ステップ１１２で得られたパラメータβを、（１）式の線形回帰モデルのパラメータ推定値として出力し、線形回帰モデル推定処理ルーチンを終了する。 Next, in step 114, the output unit 40 outputs the parameter β obtained in step 112 as a parameter estimated value of the linear regression model of equation (1), and the linear regression model estimation processing routine is terminated.

＜実験結果＞
ここで、下記（１５）式で生成された人工観測データを使用した、本実施の形態に係る手法であるＳＰＥＮの実験結果について説明する。 <Experimental result>
Here, an experimental result of SPEN, which is a technique according to the present embodiment, using artificial observation data generated by the following equation (15) will be described.

この観測データを生成した４個の説明変数は独立である。さらに下記（１６）式〜（２１）式に示す６個の説明変数の候補を用意する。 The four explanatory variables that generated this observation data are independent. Furthermore, six explanatory variable candidates shown in the following equations (16) to (21) are prepared.

これら全部で１０個の説明変数の大きさ（スケール）はそれぞれ下記（２２）式及び（２３）式になる。 In total, the magnitudes (scales) of the ten explanatory variables are the following formulas (22) and (23), respectively.

この説明変数の集合は、説明変数同士が強く相関しており、多重共線性があるデータになっている。またｘ_４、ｘ_７、ｘ_８はその大きさ（スケール）が他の説明変数と比べて小さくなっており、データとしての信頼性が低い説明変数である。観測値ｙ及び説明変数ｘ_１、ｘ_４、ｘ_７のプロット図を図４〜７に示す。 In this set of explanatory variables, the explanatory variables are strongly correlated with each other and are data having multicollinearity. In addition, x ₄ , x ₇ , and x ₈ are explanatory variables that are smaller in size (scale) than other explanatory variables and have low reliability as data. Plots of the observed value y and the explanatory variables x ₁ , x ₄ , x ₇ are shown in FIGS.

このデータを本実施の形態の手法であるＳＰＥＮでモデリングした結果を図８に示す。（５）式のλ_１及びλ_２の値は上述のモデル同定部３３の処理により決定し、λ_１＝0.1651、λ_２＝0.0131であった。比較手法はRidge及びElastic Net（ＥＮ）である。Ridgeは回帰係数のＬ_２罰則項のみの回帰で、観測データにソフトフィッティングさせる手法である。ＳＰＥＮのλ_１を０にとり、Ｗ＝Ｉとしたときに一致する。ＥＮはＳＰＥＮのＷ＝Ｉとしたときの手法であり、説明変数のスケールによる罰則は考慮していない。Ridgeは多重共線性が存在する場合も安定的に回帰係数を推定できるが、誤推定が多いことがわかる。ＥＮは回帰係数をよく推定できているが、説明変数ｘ_７及びｘ_８に対する回帰係数を誤推定している。 FIG. 8 shows the result of modeling this data with SPEN which is the method of the present embodiment. The values of λ ₁ and λ _{2 in} the equation (5) are determined by the processing of the model identification unit 33 described above, and are λ ₁ = 0.1651 and λ ₂ = 0.0131. The comparison method is Ridge and Elastic Net (EN). Ridge the regression only L ₂ penalties term of the regression coefficients is a technique for soft-fitted to the observed data. It matches when λ ₁ of SPEN is 0 and W = I. EN is a technique when SPEN W = I, and does not consider penalties due to the scale of the explanatory variables. Ridge can stably estimate the regression coefficient even when multicollinearity exists, but it can be seen that there are many false estimations. EN is made up estimated regression coefficients may have been estimated regression coefficients erroneous for explanatory variable x ₇ and x _8.

一方、本実施の形態の手法であるＳＰＥＮでは、スケールが小さい説明変数はデータとしての信頼性が低いとみなすため、説明変数ｘ_４、ｘ_７、ｘ_８に対する回帰係数を０に推定している。いずれの手法でも説明率はほぼ同じであるが、ＳＰＥＮにより推定した線形回帰モデルが真なるモデルに最も近く、かつ本実施の形態の効能である説明変数のスケールに対する罰則も良好に働いていることがわかる。 On the other hand, in SPEN which is the method of the present embodiment, since the explanatory variable with a small scale is regarded as having low reliability as data, the regression coefficients for the explanatory variables x ₄ , x ₇ , and x ₈ are estimated to be 0. . The explanation rate is almost the same in either method, but the linear regression model estimated by SPEN is closest to the true model, and the penalties for the scale of the explanatory variable, which is the effect of this embodiment, are working well. I understand.

以上説明したように、本実施の形態に係る線形回帰モデル推定装置によれば、データとしての信頼性に直結する説明変数の大きさ（スケール）を考慮した罰則項を付けたＳＰＥＮのコスト関数を最小化することにより回帰係数を推定するため、信頼性の高い線形回帰モデルを推定することができる。 As described above, according to the linear regression model estimation apparatus according to the present embodiment, the SPEN cost function with a penalty term considering the size (scale) of the explanatory variable directly related to the reliability as data is obtained. Since the regression coefficient is estimated by minimization, a highly reliable linear regression model can be estimated.

線形回帰モデルは数学的にシンプルで解釈を与えやすく、かつ強力な推定アルゴリズムに支えられて多くの解析者に使われている。しかし現実のデータは変動が複雑で、回帰に用いる説明変数の個数も多くなり適切なモデリングができなくなる。そのため観測データにソフトフィッティングさせることを目的とした、回帰係数の罰則項付きモデリング手法が提案されてきた。本実施の形態ではさらに説明変数の大きさ（スケール）に罰則を付ける線形回帰モデリング手法を提案した。これにより説明変数として十分な大きさ（スケール）を持った説明変数、つまりデータとしての信頼性が高い説明変数のみを使用したモデリングが可能になり、複雑なデータセットからの頑健な回帰式の推定を行うことができるようになる。 Linear regression models are mathematically simple and easy to interpret, and are used by many analysts, supported by powerful estimation algorithms. However, the actual data has complicated fluctuations, and the number of explanatory variables used for regression increases, making appropriate modeling impossible. Therefore, modeling methods with penalties for regression coefficients have been proposed for the purpose of soft fitting to observation data. In the present embodiment, a linear regression modeling method for penalizing the size (scale) of explanatory variables has been proposed. This allows modeling using only explanatory variables with sufficient size (scale) as explanatory variables, that is, highly reliable explanatory variables as data, and estimation of robust regression equations from complex data sets. Will be able to do.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１０線形回帰モデル推定装置
２０入力部
３０演算部
３１線形回帰モデリング部
３２回帰係数推定部
３３モデル同定部
４０出力部 DESCRIPTION OF SYMBOLS 10 Linear regression model estimation apparatus 20 Input part 30 Operation part 31 Linear regression modeling part 32 Regression coefficient estimation part 33 Model identification part 40 Output part

Claims

In a linear regression model that represents an observed value by a linear sum of an explanatory variable and a regression coefficient for the explanatory variable, a penalty term that increases as the norm of the explanatory variable decreases is added to the residual between the linear regression model and the observed value. A regression coefficient estimating means for estimating the regression coefficient by minimizing the cost function;
The linear regression is selected by selecting a regression coefficient having the highest explanation rate when using each regression coefficient from the regression coefficients estimated by the regression coefficient estimation means for each parameter that specifies the size of the penalty term. A model identification means for identifying the model;
An apparatus for estimating a linear regression model.

A linear regression model estimation method in a linear regression model estimation apparatus including a regression coefficient estimation unit and a model identification unit,
In the linear regression model in which the regression coefficient estimation means represents an observation value as an explanatory variable and a linear sum of the regression coefficients for the explanatory variable, a norm of the explanatory variable is small in a residual between the linear regression model and the observation value Estimate the regression coefficient by minimizing the cost function with a penal term that increases
The model identification unit selects a regression coefficient that provides the highest explanation rate when each regression coefficient is used from the regression coefficients estimated by the regression coefficient estimation unit for each parameter that specifies the size of the penalty term. A linear regression model estimation method for identifying the linear regression model.

The linear regression model estimation program for functioning a computer as each means which comprises the linear regression model estimation apparatus of Claim 1.