JP6239486B2

JP6239486B2 - Prediction model creation method

Info

Publication number: JP6239486B2
Application number: JP2014225055A
Authority: JP
Inventors: 森　俊樹; 俊樹森
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2014-11-05
Filing date: 2014-11-05
Publication date: 2017-11-29
Anticipated expiration: 2034-11-05
Also published as: JP2016091306A

Description

本発明の実施形態は予測モデル作成方法に関する。 Embodiments described herein relate generally to a prediction model creation method.

大規模データ活用の重要性が広く認識され、大量の顧客情報の分析や、機器のセンサデータに基づく異常検知など、様々な場面でデータ分析技術や予測モデルが活用されている。 The importance of utilizing large-scale data is widely recognized, and data analysis techniques and prediction models are used in various situations such as analysis of large amounts of customer information and detection of anomalies based on device sensor data.

ソフトウェア開発管理の分野においても、各種ツールやインフラの発展に伴い、大量のソフトウェア開発データが蓄積されるようになり、予測モデルの構築による開発プロジェクトのコントロールが重要な課題となっている。 Also in the field of software development management, with the development of various tools and infrastructure, a large amount of software development data has been accumulated, and control of development projects by building predictive models has become an important issue.

予測モデルの活用を考えた際に、予測モデルに望まれる特性としては、予測精度が高いことに加えて、「なぜ、そのような予測に至ったか？」の原因の追究など、予測結果の解釈の容易さも重要な要素となる。 When considering the use of a prediction model, as a characteristic desired for a prediction model, in addition to high prediction accuracy, interpretation of the prediction results, such as the investigation of the cause of “why did such a prediction occur?” This is also an important factor.

特開2010-272004号公報JP 2010-272004 特開2012-194894号公報JP 2012-194894 特開2010-9177号公報JP 2010-9177

これまで、様々な予測モデルが提案されているが、高い予測精度と予測結果の解釈の容易さを両立することは困難だった。ナイーブベイズや決定木などの単純な予測モデルは、予測結果の解釈の容易さの点では優れているが、予測精度の面ではランダム・フォレスト等のより複雑な集団学習の手法に劣る。一方、ランダム・フォレストなどの集団学習の手法は、予測精度の面では優れているが、予測結果がブラックボックス化しがちであり、予測結果の解釈の容易さの点では劣る。 Various prediction models have been proposed so far, but it has been difficult to achieve both high prediction accuracy and ease of interpretation of prediction results. Simple prediction models such as naive Bayes and decision trees are superior in terms of ease of interpretation of prediction results, but are inferior to more complex group learning methods such as random forest in terms of prediction accuracy. On the other hand, group learning methods such as random forest are excellent in terms of prediction accuracy, but the prediction results tend to be black boxes, and are inferior in the ease of interpretation of the prediction results.

本発明の目的は、予測精度を保ったまま予測結果の解釈の容易さを向上する予測モデルを作成することである。 An object of the present invention is to create a prediction model that improves the ease of interpretation of prediction results while maintaining prediction accuracy.

実施形態によれば、予測モデル作成方法は評価ステップと、繰り返しステップと、統合ステップと、集約ステップとを具備する。評価ステップは過去に蓄積されたデータであり、予測モデルの作成に使用されるものである学習データを離散化し、離散領域毎に事前確率から事後確率を計算するための係数であるリフト値を求め、離散領域毎のリフト値を表す単体予測モデルを生成し、単体予測モデルを学習データに適用して、その予測結果と実績を比較し、予測精度や必要なパラメータの値を求めることにより単体予測モデルを評価する。繰り返しステップは離散領域を変更して、評価ステップを複数回動作させ、複数の単体予測モデルを生成させ、複数の単体予測モデルを評価させる。統合ステップは複数の単体予測モデルの評価結果を１つの統合評価結果に変換する。集約ステップは統合評価結果の非線形部分を線形近似して集約予測モデルを生成する。 According to the embodiment, the prediction model creation method includes an evaluation step, a repetition step, an integration step, and an aggregation step. The evaluation step is data accumulated in the past, discretizes the learning data that is used to create the prediction model, and calculates the lift value, which is a coefficient for calculating the posterior probability from the prior probability for each discrete region. , Generate a unitary prediction model that represents the lift value for each discrete region, apply the unitary prediction model to learning data, compare the prediction results with the actual results, and calculate the prediction accuracy and necessary parameter values Evaluate the model. The iteration step changes the discrete region, operates the evaluation step a plurality of times, generates a plurality of unit prediction models, and evaluates the plurality of unit prediction models. The integration step converts the evaluation results of a plurality of unit prediction models into one integrated evaluation result. The aggregation step generates an aggregate prediction model by linearly approximating the nonlinear part of the integrated evaluation result.

実施形態の予測モデル作成装置の構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of the prediction model production apparatus of embodiment. 実施形態の予測モデル作成方法の一例を示すフローチャートである。It is a flowchart which shows an example of the prediction model creation method of embodiment. ブートストラップ・サンプリングにより重み付け学習データを作成する動作の一例を示す。An example of the operation of creating weighted learning data by bootstrap sampling will be described. 評価関数としてエントロピーを用いた場合の説明変数の離散化の一例を示す。An example of discretization of explanatory variables when entropy is used as an evaluation function is shown. 単体予測モデルの一例をリフト値の計算例とともに示す。An example of a unit prediction model is shown together with an example of calculating a lift value. 対数スコアの計算例を示す。An example of logarithmic score calculation is shown. 最大正答率の計算例を示す。An example of calculating the maximum correct answer rate is shown. ＲＯＣ曲線及びＡＵＣの一例を示す。An example of a ROC curve and AUC is shown. 単体予測モデルの評価結果の一例を示す。An example of the evaluation result of a unit prediction model is shown. 重みの更新の一例を示すフローチャートである。It is a flowchart which shows an example of the update of a weight. 重みの更新例を示す。An example of updating weights is shown. 単体予測モデル統合用のパラメータの一例を示す。An example of the parameter for unit prediction model integration is shown. 統合スコアの計算例を示す。The example of calculation of an integrated score is shown. 集約予測モデルの一例を示す。An example of an aggregate prediction model is shown. 単体予測モデルと集約予測モデルの関係の一例を示す。An example of the relationship between a single prediction model and an aggregate prediction model is shown. 実施形態の集約予測モデルと、統合予測モデル（統合スコアを直接予測に利用）と、ランダム・フォレスト予測モデルとの比較結果の一例を示す。An example of the comparison result with the aggregated prediction model of embodiment, an integrated prediction model (an integrated score is utilized for direct prediction), and a random forest prediction model is shown.

実施形態を説明する前に、実施形態の説明に関係する用語を説明する。 Before describing the embodiment, terms related to the description of the embodiment will be described.

・尺度水準
変数やその測定データに対して、それらが表現する情報の性質に基づき数学・統計学的に分類した基準。低い方から順に、名義尺度、順序尺度、間隔尺度、比例尺度の４つの尺度水準があり、高い水準はより低い水準の性質を含む形になっている。・ Scale level
A standard that categorizes variables and their measured data mathematically and statistically based on the nature of the information they represent. In order from the lowest, there are four scale levels: nominal scale, ordinal scale, interval scale, and proportional scale, with the higher level including the properties of lower levels.

・学習データ
予測モデルの学習に用いられるデータの集合。各データは、１つ以上の説明変数、及び１つの目的変数から成るベクトルとして表現される。全ての説明変数の値の組み合わせに対して目的変数が所定の値になるように予測モデルを学習することが目標である。学習データの個々のデータに重みを付加したものを、重み付き学習データと呼ぶ。・ Learning data
A collection of data used for learning predictive models. Each data is expressed as a vector composed of one or more explanatory variables and one objective variable. The goal is to learn the prediction model so that the objective variable becomes a predetermined value for all combinations of the values of the explanatory variables. Data obtained by adding weights to individual pieces of learning data is referred to as weighted learning data.

・テストデータ
予測モデルの評価（テスト）に用いられるデータの集合。データセットの構造自体は、学習データと同じである。 ·test data
A collection of data used to evaluate (test) a predictive model. The structure of the data set itself is the same as the learning data.

・ブートストラップ・サンプリング
１つのデータセットから、データ数と同じ回数の復元抽出を行って新たなサンプルを生成し、その繰り返しにより、母集団の性質やモデルの推測の誤差などを分析する方法。復元抽出されたサンプルを、ブートストラップ・サンプルという。・ Bootstrap sampling
A method of generating a new sample by performing restoration extraction as many times as the number of data from one data set, and analyzing the characteristics of the population, model estimation error, and the like by repetition. The restored sample is called a bootstrap sample.

・ＯＯＢ（Out-of-Bag）データ
ブートストラップ・サンプリングにおいて、復元抽出で一部のデータが重複して選択された場合、その分、一度も選択されないデータが生じる。これらのデータを集めたものを、ＯＯＢ（Out-of-Bag）データと呼ぶ。 -OOB (Out-of-Bag) data
In bootstrap sampling, when some data is selected redundantly in the restoration extraction, data that is not selected is generated. A collection of these data is called OOB (Out-of-Bag) data.

・正答率
予測モデルの評価指標の一つ。全予測対象データにおける正答（例えば、正と負の２クラスの判別において、正と予測して実際に正、または負と予測して実際に負）の割合。判別のための閾値を変化させると、それに伴って正答率も変化する。その最大値を最大正答率と呼ぶ。また、最大正答率における、誤判別されたデータを誤判別データと呼ぶ。 -Correct answer rate
One of the evaluation indices of the prediction model. Percentage of correct answers in all prediction target data (for example, in positive and negative two-class discrimination, predicted as positive and actually positive, or predicted as negative and actually negative). When the threshold value for discrimination is changed, the correct answer rate also changes accordingly. The maximum value is called the maximum correct answer rate. Further, misclassified data at the maximum correct answer rate is referred to as misclassified data.

・ＲＯＣ曲線
予測モデルの性能評価のためのグラフ。予測モデルの出力が高い順にデータをソートして、横軸に偽陽性率（例えば、正と負の２クラスの判別において、正と予測して実際は負だった割合）、縦軸に真陽性率（正と予測して実際に正だった割合）をプロットしたもの。ＲＯＣ曲線の下側の面積はＡＵＣ（Area Under the ROC Curve）と呼ばれ、予測モデルの評価指標の一つで、０から１の間の値をとる。ＡＵＣが１に近いほど、予測精度が高いと言える。・ ROC curve
Graph for performance evaluation of prediction model. The data is sorted in descending order of the output of the prediction model, the horizontal axis represents the false positive rate (for example, the rate of positive and negative in the two-class discrimination), and the vertical axis represents the true positive rate. A plot of the percentage that was actually positive as expected. The area under the ROC curve is called AUC (Area Under the ROC Curve), which is one of the evaluation indices of the prediction model and takes a value between 0 and 1. It can be said that the closer the AUC is to 1, the higher the prediction accuracy.

・バイアス−バリアンス分解
予測モデルの誤差は、予測モデルの真のモデルからの偏りに起因する誤差（バイアス）、学習データのサンプルのばらつきに起因する誤差（バリアンス）、及び本質的に削減不可能な誤差に分解できる。バイアスとバリアンスには、一般に、トレードオフの関係がある。・ Bias-variance decomposition
The error of the prediction model can be decomposed into an error (bias) due to the deviation of the prediction model from the true model, an error (variance) due to sample variation of the learning data, and an error that cannot be essentially reduced. There is generally a trade-off between bias and variance.

・交差検定（Ｋ分割交差検定）
予測モデルの性能評価のための実験方法の一つ。予測モデルの学習・評価に用いるデータセットをＫ分割（Ｋ＝５，１０等が多く用いられる）し、（Ｋ−１）セットを学習データ、残りの１セットをテストデータとし、予測モデルの評価をＫ回繰り返す。・ Cross-validation (K-division cross-validation)
An experimental method for evaluating the performance of predictive models. The data set used for learning / evaluation of the prediction model is divided into K (K = 5, 10 etc. are often used), (K-1) set is used as learning data, and the remaining one set is used as test data. Repeat K times.

・予測モデル
対象の観察データや測定データから要因間の関係や内在する傾向、パターンを学習し、諸量を計算できる形のモデルとして表現したもの。予測の対象となる変数（目的変数）が連続値の場合を回帰モデル、離散値の場合を判別モデルという。判別のための代表的な予測モデルとしては、例えば、決定木、ナイーブベイズなどがある。・ Prediction model
The model is a model that can learn the relationship between factors, inherent trends and patterns from the observed and measured data, and calculate various quantities. When the variable (objective variable) to be predicted is a continuous value, it is called a regression model, and when it is a discrete value, it is called a discriminant model. Typical prediction models for discrimination include, for example, decision trees and naive bayes.

対象の観察データの一例は顧客データがあり、測定データの一例は種々の機器のセンサデータがある。顧客データは過去の顧客のパターン情報を含んでおり、例えば、顧客の特性と購買パターンを構造化したものが予測モデルとなる。予測モデルを作っておくと、新しい顧客が来た時に、こういう商品があれば購買するであろうという予測を立てることができる。単純な予測モデルは生のデータを入力し予測結果を出力する関数で表すことができる。 An example of target observation data is customer data, and an example of measurement data is sensor data of various devices. The customer data includes past customer pattern information. For example, a structured model of customer characteristics and purchase patterns is a prediction model. By creating a prediction model, when a new customer arrives, you can make a prediction that if you have such a product, you will buy it. A simple prediction model can be represented by a function that inputs raw data and outputs a prediction result.

・ナイーブベイズ
予測モデルの一種。目的変数の判別に関する事後確率をベイズの定理に基づき計算したもの。ただし、事後確率の計算において、説明変数の独立性を仮定している。事後確率は、事前確率に対してリフト値と呼ばれる一種の補正係数を掛け合わせて行くことによって計算する。モデルの出力（事後確率の推定値）の対数をとったものを対数スコアと呼ぶ。・ Naive Bayes
A kind of prediction model. A posteriori probability for objective variable discrimination calculated based on Bayes' theorem. However, independence of explanatory variables is assumed in the calculation of posterior probabilities. The posterior probability is calculated by multiplying the prior probability by a kind of correction coefficient called a lift value. The logarithm of the model output (estimated posterior probability) is called the logarithmic score.

・集団学習
単独ではそれほど精度が高くない複数の予測モデルを統合・組み合わせることで、精度を向上させる学習方法。複数の結果の統合・組み合わせの方法としては、多数決、平均などが用いられる。代表的な集団学習のアルゴリズムとして、バギング（ブートラップ・サンプリングした学習データの予測結果を多数決で評価）、ブースティング（予測結果に基づいて学習データの重みを変更）、ランダム・フォレストなどがある。ここでは、集団学習の構成要素となる個々の予測モデルを単体予測モデル、複数の単体予測モデルの評価結果を統合・組み合わせたものを統合評価結果、複数の単体予測モデルを統合・組み合わせたモデルを統合予測モデルと呼ぶ。単体予測モデルを統合・組み合わせる際、各単体予測モデルに対して割り当てられた重みを、統合ウェイトと呼ぶ。また、統合予測モデルを１つの単体予測モデルに変換する操作を集約と呼ぶ。・ Group learning
A learning method that improves accuracy by integrating and combining multiple prediction models that are not so accurate alone. As a method of integrating / combining a plurality of results, majority vote, average, or the like is used. Typical group learning algorithms include bagging (evaluating the prediction result of bootlap sampled learning data by majority vote), boosting (changing the weight of the learning data based on the prediction result), random forest, and the like. Here, individual prediction models that are components of group learning are unit prediction models, evaluation results of multiple unit prediction models are integrated and combined, integrated evaluation results, and models that combine and combine multiple unit prediction models are combined. This is called an integrated prediction model. When the unit prediction models are integrated / combined, the weight assigned to each unit prediction model is called an integration weight. In addition, an operation for converting the integrated prediction model into one single prediction model is called aggregation.

・ランダム・フォレスト
決定木を用いた集団学習のアルゴリズムの一種。バギングの改良版であり、決定木における説明変数の選択をランダム化して、個々の予測のばらつきを大きくすることにより、統合した結果の予測精度を高めることに成功。・ Random Forest
A kind of group learning algorithm using decision trees. It is an improved version of bagging and succeeds in improving the prediction accuracy of the integrated results by randomizing the selection of explanatory variables in the decision tree and increasing the variation of individual predictions.

以下、実施形態について図面を参照して説明する。
図１は、予測モデル作成方法を実行するシステムの構成例を示す機能ブロック図である。破線はモデル作成処理部７０への入力を意味し、実線はモデル作成処理部７０からの出力を意味する。システムは、入力データ管理部１０と、出力データ管理部４０と、モデル作成処理部７０とからなる。入力データ管理部１０は、ブートストラップ最大回数データ１２、ブースティング最大回数データ１４、学習データ１６、テストデータ１８を管理する。出力データ管理部４０は、重み付け学習データ４２、単体予測モデル４４、対数スコア４６、最大正答率データ４８、誤判別データ５０、統合ウェイト５２、統合スコア５４、集約予測モデル５６、評価結果データ５８を管理する。モデル作成処理部７０は、ブートストラップ・サンプル生成部７２、ナイーブベイズ予測モデル生成部７４、ブースティング処理部７６、予測モデル統合部７８、予測モデル集約部８０、予測モデル評価部８２を含む。 Hereinafter, embodiments will be described with reference to the drawings.
FIG. 1 is a functional block diagram illustrating a configuration example of a system that executes a prediction model creation method. A broken line means an input to the model creation processing unit 70, and a solid line means an output from the model creation processing unit 70. The system includes an input data management unit 10, an output data management unit 40, and a model creation processing unit 70. The input data management unit 10 manages the bootstrap maximum number of times data 12, the boosting maximum number of times data 14, the learning data 16, and the test data 18. The output data management unit 40 stores the weighted learning data 42, the unit prediction model 44, the logarithmic score 46, the maximum correct answer rate data 48, the misclassification data 50, the integrated weight 52, the integrated score 54, the aggregated prediction model 56, and the evaluation result data 58. to manage. The model creation processing unit 70 includes a bootstrap / sample generation unit 72, a naive Bayes prediction model generation unit 74, a boosting processing unit 76, a prediction model integration unit 78, a prediction model aggregation unit 80, and a prediction model evaluation unit 82.

ブートストラップ最大回数データ１２はブートストラップ・サンプル生成部７２に入力される。ブースティング最大回数データ１４はブースティング処理部７６に入力される。学習データ１６はブートストラップ・サンプル生成部７２に入力される。テストデータ１８は評価結果データ５８に入力される。 The bootstrap maximum number of times data 12 is input to the bootstrap sample generation unit 72. The boosting maximum frequency data 14 is input to the boosting processing unit 76. The learning data 16 is input to the bootstrap sample generation unit 72. The test data 18 is input to the evaluation result data 58.

ブートストラップ・サンプル生成部７２は重み付き学習データ４２を出力する。重み付き学習データ４２はナイーブベイズ予測モデル生成部７４とブースティング処理部７６に入力される。ナイーブベイズ予測モデル生成部７４は、単体予測モデル４４、対数スコア４６、最大正答率データ４８、誤判別データ５０、統合ウェイト５２を出力する。最大正答率データ４８、誤判別データ５０はブースティング処理部７６に入力される。ブースティング処理部７６は重み付き学習データ４２も出力する。重み付け学習データ４２、対数スコア４６、統合ウェイト５２は、予測モデル統合部７８に入力される。予測モデル統合部７８は統合スコア５４を出力する。統合スコア５４は予測モデル集約部８０に入力され、予測モデル集約部８０は集約予測モデル５６を出力する。統合スコア５４、集約予測モデル５６は予測モデル評価部８２に入力され、予測モデル評価部８２は評価結果データ５８を出力する。 The bootstrap sample generation unit 72 outputs the weighted learning data 42. The weighted learning data 42 is input to the naive Bayes prediction model generation unit 74 and the boosting processing unit 76. The naive Bayes prediction model generation unit 74 outputs a single prediction model 44, a logarithmic score 46, maximum correct answer rate data 48, misclassification data 50, and an integrated weight 52. The maximum correct answer rate data 48 and the erroneous determination data 50 are input to the boosting processing unit 76. The boosting processing unit 76 also outputs weighted learning data 42. The weighted learning data 42, the logarithmic score 46, and the integrated weight 52 are input to the prediction model integration unit 78. The prediction model integration unit 78 outputs an integrated score 54. The integrated score 54 is input to the prediction model aggregation unit 80, and the prediction model aggregation unit 80 outputs the aggregated prediction model 56. The integrated score 54 and the aggregated prediction model 56 are input to the prediction model evaluation unit 82, and the prediction model evaluation unit 82 outputs evaluation result data 58.

実施形態は、予測精度が高く、かつ、結果の解釈が容易な予測モデルを作成するものである。
予測精度の向上のために、ブートストラップ・サンプリングとブースティングを組み合わせて学習データをランダム化し、ナイーブベイズ予測モデルに適用している。ここで、ＯＯＢデータのＡＵＣから計算した重み（統合ウェイト）で各予測モデルの評価結果の加重平均をとり評価結果を統合し、その結果に基づいて予測モデルを生成することにより、高い予測精度が実現される。過去のデータである学習データから予測モデルを作る際、新しいデータであるテストデータに対する予測精度を出来るだけ高めることが目標だが、学習データに対して過剰に適合させる（オーバーフィッティング）と、かえってテストデータに対する予測精度が悪くなることがある。ＯＯＢデータ、すなわち、ブートストラップ・サンプリングにおける重複を許したランダム抽出で選ばれなかったデータ、を用いることにより、テストデータに対して実際に予測精度を評価する前段階で、オーバーフィッティングの度合いや新しいデータに対する予測性を事前に見積もることができる。ＯＯＢデータに対する予測精度が低い場合には、仮に、学習データに対する予測精度が高くても、テストデータに対する予測精度は高くない可能性がある。このような情報を、評価結果を統合する際のウェイトとして利用することで、トータルとして高い予測精度を実現している。 In the embodiment, a prediction model with high prediction accuracy and easy interpretation of a result is created.
In order to improve the prediction accuracy, bootstrap sampling and boosting are combined to randomize the training data and apply it to the naive Bayes prediction model. Here, the weighted average of the evaluation results of each prediction model is integrated with the weight (integrated weight) calculated from the AUC of the OOB data, the evaluation results are integrated, and a prediction model is generated based on the result, thereby achieving high prediction accuracy. Realized. When creating a prediction model from learning data that is past data, the goal is to increase the prediction accuracy of test data that is new data as much as possible. However, if you fit too much to learning data (overfitting), test data The prediction accuracy may be deteriorated. By using OOB data, that is, data that was not selected by random sampling that allowed duplication in bootstrap sampling, the degree of overfitting or new Predictability of data can be estimated in advance. If the prediction accuracy for OOB data is low, the prediction accuracy for test data may not be high even if the prediction accuracy for learning data is high. By using such information as a weight when integrating the evaluation results, high prediction accuracy is realized as a total.

ナイーブベイズ予測モデルは、一般に、バイアスが大きくバリアンスが小さい比較的安定した予測モデルであるため、従来、集団学習のアルゴリズムとはあまり相性が良くなかった。集団学習では、むしろ、決定木のような、バイアスが小さくバリアンスが大きい不安定な予測モデルの方が適用効果が大きい。そこで、ブートストラップ・サンプリングとブースティングを組み合わせて適用することにより、ナイーブベイズの予測結果のばらつきを大きくし、さらに、ＯＯＢデータのＡＵＣから計算した重み（統合ウェイト）を使うことにより、極端な予測結果の影響を減らして、予測精度を高めることができる。 Since the naive Bayes prediction model is generally a relatively stable prediction model with a large bias and a small variance, it has conventionally not been very compatible with the group learning algorithm. In group learning, an unstable prediction model with a small bias and a large variance, such as a decision tree, is more effective. Therefore, by applying a combination of bootstrap sampling and boosting, the dispersion of the Naive Bayes prediction results is increased, and by using weights (integrated weights) calculated from the AUC of OOB data, extreme predictions are made. The influence of the results can be reduced and the prediction accuracy can be increased.

結果の解釈容易性の向上のために、集団学習による予測モデルの非線形部分を線形近似して、予測精度を概ね保ったまま等価な単体のナイーブベイズの予測モデルに集約することにより、予測結果の解釈が容易なモデルを実現した。集約予測モデルは単一のナイーブベイズモデルと等価であるので、単一のモデルを調べるだけでよく、また、各変数の影響を独立に考慮することができるので、予測結果の解釈が容易である。 In order to improve the interpretability of the results, the nonlinear part of the prediction model by collective learning is linearly approximated and aggregated into an equivalent single naive Bayes prediction model while maintaining the prediction accuracy in general. A model that is easy to interpret was realized. The aggregated prediction model is equivalent to a single naive Bayes model, so it is only necessary to examine a single model, and the influence of each variable can be taken into account independently, making it easy to interpret the prediction results .

図２は、予測モデル作成方法の一例を示すフローチャートである。
ブロック１０２で、学習データ１６がブートストラップ・サンプル生成部７２に入力される。ブートストラップ・サンプル生成部７２には、ブートストラップ最大回数データ１２が設定されている。学習データ１６は、予測モデル作成のための生データであり、顧客データ等である。 FIG. 2 is a flowchart illustrating an example of a prediction model creation method.
At block 102, the learning data 16 is input to the bootstrap sample generator 72. In the bootstrap sample generation unit 72, the bootstrap maximum number of times data 12 is set. The learning data 16 is raw data for creating a prediction model, such as customer data.

ステップ１０４で、ブートストラップ・サンプル生成部７２は、学習データ１６に対して、データの数と同じ回数、重複ありのランダム抽出（ブートストラップ・サンプリングと呼ばれる）を実施し、重み付き学習データ４２を出力する。 In step 104, the bootstrap sample generation unit 72 performs random extraction (referred to as bootstrap sampling) with the same number of times as the number of data on the learning data 16 to obtain the weighted learning data 42. Output.

図３は、ブートストラップ・サンプリングの一例を示す。学習データ１６は、多数、例えば２０個のデータセットからなり、データセットは、データの識別子、説明変数Ｘ１、Ｘ２、Ｘ３、目的変数Ｙ（２値）を含む。予測モデルの学習の目標は、例えば目的変数Ｙ＝１を判別することである。ブートストラップ・サンプリングにより得られる重み付け学習データ４２も２０個のデータセットからなり、データセットは、データの識別子、説明変数Ｘ１、Ｘ２、Ｘ３、目的変数Ｙ（２値）、重みを含む。全てのデータセットの重みの合計は、データセットの数（＝２０）である。識別子ＩＤ＝１、２のデータは１回も抽出されておらず、識別子ＩＤ＝３のデータは２回抽出されている。 FIG. 3 shows an example of bootstrap sampling. The learning data 16 includes a large number, for example, 20 data sets, and the data set includes an identifier of data, explanatory variables X1, X2, and X3, and an objective variable Y (binary). The goal of learning the prediction model is to determine, for example, the objective variable Y = 1. The weighted learning data 42 obtained by bootstrap sampling is also composed of 20 data sets, and the data set includes data identifiers, explanatory variables X1, X2, and X3, an objective variable Y (binary), and weights. The sum of the weights of all data sets is the number of data sets (= 20). Data with identifier ID = 1, 2 has not been extracted once, and data with identifier ID = 3 has been extracted twice.

ステップ１０６で、ブートストラップ・サンプル生成回数のカウンタがインクリメント（＋１）される。 In step 106, the bootstrap sample generation counter is incremented (+1).

ステップ１０８で、ナイーブベイズ予測モデル生成部７４は、重み付き学習データ４２の説明変数Ｘを離散化する。重み付き学習データ４２の各説明変数Ｘ１、Ｘ２、Ｘ３が順序尺度（あるいは間隔尺度、比例尺度）の場合、あらかじめ決定した評価関数に基づいて変数を離散化する。離散化のための主な評価関数としては、エントロピー、期待リフト値（特願２０１３−１１８０９１）などがあるが、いずれを用いてもよく、評価関数は特に限定しない。 In step 108, the naive Bayes prediction model generation unit 74 discretizes the explanatory variable X of the weighted learning data 42. When the explanatory variables X1, X2, and X3 of the weighted learning data 42 are order scales (or interval scales and proportional scales), the variables are discretized based on a predetermined evaluation function. The main evaluation function for discretization includes entropy, expected lift value (Japanese Patent Application No. 2013-118091), and any of them may be used, and the evaluation function is not particularly limited.

一例として、評価関数としてエントロピーを用いた場合の説明変数Ｘ１の離散化を図４に示す。変数毎に離散化の候補を幾つか用意し、評価関数が最小となる候補に決定する。候補２で、エントロピーが最小であるので、変数Ｘ１の離散化は候補２に決定する。変数Ｘ２も同様の方法で離散化される。変数Ｘ３は、元々、名義尺度の離散変数なので、離散化の必要がない。 As an example, FIG. 4 shows the discretization of the explanatory variable X1 when entropy is used as the evaluation function. Several discretization candidates are prepared for each variable, and the candidate having the smallest evaluation function is determined. Since candidate 2 has the smallest entropy, discretization of variable X1 is determined as candidate 2. The variable X2 is also discretized in the same way. Since the variable X3 is originally a discrete variable of nominal scale, there is no need for discretization.

ステップ１１０で、ナイーブベイズ予測モデル生成部７４は、各説明変数Ｘ１、Ｘ２について各離散領域のリフト値を計算する。リフト値は、ナイーブベイズ手法において、事前確率Ｐ（Ａ）から事後確率Ｐ（Ａ│Ｄ）を計算するための一種の補正係数であり、ベイズの定理を式変形した以下の式で定義される。
Ｐ（Ａ│Ｄ）＝Ｐ（Ａ）×Ｐ（Ｄ│Ａ）／Ｐ（Ｄ）
事後確率Ｐ（Ａ│Ｄ）は変数Ｘが離散領域Ｄに入ったときに、変数Ｙ＝Ａとなる確率である。事前確率Ｐ（Ａ）は何も条件がないときに、変数Ｙ＝Ａとなる確率である。リフト値Ｐ（Ｄ│Ａ）／Ｐ（Ｄ）の分子：Ｐ（Ｄ│Ａ）は変数Ｙ＝Ａの条件下で、変数Ｘが離散領域Ｄに入る確率であり、分母：Ｐ（Ｄ）は変数Ｘが離散領域Ｄに入る確率である。結果として得られる（変数、離散領域、リフト値）の組が、単体予測モデル４４となる。 In step 110, the naive Bayes prediction model generation unit 74 calculates lift values of the discrete regions for the explanatory variables X1 and X2. The lift value is a kind of correction coefficient for calculating the posterior probability P (A | D) from the prior probability P (A) in the Naive Bayes method, and is defined by the following equation obtained by transforming the Bayes' theorem. .
P (A | D) = P (A) × P (D | A) / P (D)
The posterior probability P (A | D) is a probability that the variable Y = A when the variable X enters the discrete region D. Prior probability P (A) is a probability that variable Y = A when there is no condition. The numerator of lift value P (D | A) / P (D): P (D | A) is the probability that variable X enters discrete region D under the condition of variable Y = A, and denominator: P (D) Is the probability that the variable X enters the discrete region D. A set of (variable, discrete region, lift value) obtained as a result is a unit prediction model 44.

単体予測モデル４４の一例をリフト値の計算例とともに図５に示す。なお、単体予測モデルは初期確率（Ｙ＝１の重みの合計を全データ数で割った値）も含む。 An example of the unit prediction model 44 is shown in FIG. 5 together with a lift value calculation example. The simplex prediction model also includes an initial probability (a value obtained by dividing the total weight of Y = 1 by the total number of data).

ステップ１１２で、ナイーブベイズ予測モデル生成部７４は、重み付き学習データ４２による単体予測モデル４４の評価を行なう。重み付き学習データ４２を単体予測モデル４４に適用して、その出力に基づき評価する。単体予測モデル４４の評価結果は、対数スコア（事後確率の対数）４６、最大正答率データ４８、誤判別データ５０及びＯＯＢデータのＡＵＣから構成される。 In step 112, the naive Bayes prediction model generation unit 74 evaluates the single prediction model 44 using the weighted learning data 42. The weighted learning data 42 is applied to the unit prediction model 44 and evaluated based on the output. The evaluation result of the unit prediction model 44 is composed of a logarithmic score (logarithm of posterior probability) 46, maximum correct answer rate data 48, misclassification data 50, and AUC of OOB data.

対数スコア４６は、対象データの各要素（ＩＤ＝１，２，…）に対してナイーブベイズ予測モデルを適用して計算する。対数スコア４６は、初期確率に、各説明変数Ｘ１，Ｘ２，…の該当する離散領域のリフト値を掛け合わせて、最後にＬｏｇ_１０をとることによって得られる。 The logarithmic score 46 is calculated by applying a naive Bayes prediction model to each element (ID = 1, 2,...) Of the target data. Logarithmic score 46, the initial probability, each explanatory variable X1, X2, ... corresponding by multiplying the lift value of the discrete regions of, obtained by taking the last Log _10.

対数スコア４６の計算例を図６に示す。変数Ｘ１は離散領域Ｄ_１，１に含まれ、変数Ｘ２は離散領域Ｄ_２，２に含まれる。識別子ＩＤ＝１、２のデータは、重み＝０なので適用対象外である。識別子ＩＤ＝３のデータの対数スコアはＬｏｇ_１０（０．４×０．６５９×２．１３３×…）＝０．７４９である。ここで、０．４は初期確率であり、０．６５９はＸ１∈Ｄ_１，１のリフト値、２．１３３はＸ２∈Ｄ_２，２のリフト値である。 A calculation example of the logarithmic score 46 is shown in FIG. The variable X1 is included in the discrete region _D1,1 , and the variable X2 is included in the discrete region _D2,2 . Data with identifier ID = 1, 2 is not applicable because weight = 0. The logarithmic score of the data with the identifier ID = 3 is Log ₁₀ (0.4 × 0.659 × 2.133 ×...) = 0.747. Here, 0.4 is an initial probability, 0.659 is a lift value of X1εD _1,1 , and 2.133 is a lift value of X2εD _2,2 .

ナイーブベイズ予測モデルを適用して得られる各データの対数スコアが高いほど、当該データがＹ＝１である確率が高いと期待される。対数スコアの閾値を決めて、対数スコアが閾値以上のデータをＹ＝１とみなすことにより予測結果が確定する。ここで、予測結果と実績を比較することにより、予測モデルの正答率が計算できる。 It is expected that the higher the log score of each data obtained by applying the naive Bayes prediction model, the higher the probability that the data is Y = 1. The threshold value of the logarithmic score is determined, and the prediction result is determined by regarding the data having the logarithmic score equal to or higher than the threshold value as Y = 1. Here, the correct answer rate of the prediction model can be calculated by comparing the prediction result with the actual result.

対数スコアの閾値を動かすと、それに伴って正答率も変化する。正答率の最大値を最大正答率データ４８と呼ぶ。また、最大正答率において、予測と実績が異なるデータの集合を誤判別データ５０と呼ぶ。 When the threshold value of the logarithmic score is moved, the correct answer rate also changes accordingly. The maximum value of the correct answer rate is called maximum correct answer rate data 48. In addition, a set of data with different predictions and actual results at the maximum correct answer rate is referred to as misclassification data 50.

最大正答率の計算例を図７に示す。閾値１とした場合、Ｙ＝１と予測し、実績もＹ＝１だったデータの重みの合計は５であり、Ｙ＝０と予測し、実績もＹ＝０だったデータの重みの合計は１２である。全データの重みの合計は２０であるので、正答率＝（５＋１２）／２０＝０．８５である。閾値２とした場合、同様にして計算すると、正答率＝（７＋７）／２０＝０．７である。そのため、最大正答率＝０．８５、誤判別データ＝｛４，１１｝である。 A calculation example of the maximum correct answer rate is shown in FIG. When the threshold value is 1, Y = 1 is predicted, and the total weight of data for which the actual result is Y = 1 is 5, and Y = 0 is predicted, and the total weight of the data for which the actual result is also Y = 0 is 12. Since the total weight of all data is 20, the correct answer rate = (5 + 12) /20=0.85. When the threshold value is 2, the correct answer rate = (7 + 7) /20=0.7 is calculated in the same manner. Therefore, the maximum correct answer rate = 0.85, and misclassification data = {4, 11}.

予測結果の一般的な評価方法として、正答率の他に、ＲＯＣ曲線がある。ＲＯＣ曲線の下側の面積は、ＡＵＣ（Area Under the ROC Curve）と呼ばれる。 As a general evaluation method of the prediction result, there is an ROC curve in addition to the correct answer rate. The area under the ROC curve is called AUC (Area Under the ROC Curve).

重み付き学習データ４２において、重みを反転（重み＞０のデータを重み＝０に変更、重み＝０のデータを重み＝１に変更）させたデータを、ＯＯＢデータと呼ぶ。 In the weighted learning data 42, data in which the weight is inverted (data of weight> 0 is changed to weight = 0 and data of weight = 0 is changed to weight = 1) is referred to as OOB data.

ＯＯＢデータに対してＡＵＣを計算した値auc_oobから、統合スコア計算時（ステップ１２４）の各単体予測モデル４４の重みである統合ウェイト（ｗ）５２が得られる。
ｗ＝ｌｏｇ_ｅ（auc_oob／（１−auc_oob））
ただし、auc_oob＜＝０．５の場合、ｗ＝０となる。 From the value auc_oob obtained by calculating AUC with respect to the OOB data, an integrated weight (w) 52 that is a weight of each single prediction model 44 at the time of integrated score calculation (step 124) is obtained.
w = log _e (auc_oob / (1-auc_oob))
However, when auc_oob <= 0.5, w = 0.

ＲＯＣ曲線及びＡＵＣの一例を図８に示す。ＲＯＣ曲線は、対数スコアの高い順にデータをソートして、横軸に偽陽性率、縦軸に真陽性率をプロットしたグラフである。縦軸の真陽性率は、Ｙ＝１と予測して実際もＹ＝１だった割合であり、横軸の偽陽性率は、Ｙ＝１と予測して実際はＹ＝０だった割合である。ＲＯＣ曲線の下側の面積がＡＵＣであり、この例ではＡＵＣ＝０．８７５である。 An example of the ROC curve and AUC is shown in FIG. The ROC curve is a graph in which data is sorted in descending order of logarithmic score, and the false positive rate is plotted on the horizontal axis and the true positive rate is plotted on the vertical axis. The true positive rate on the vertical axis is the rate where Y = 1 was actually predicted, and the false positive rate on the horizontal axis was the rate where Y = 1 was actually predicted when Y = 1 was predicted. . The area under the ROC curve is AUC, and in this example, AUC = 0.875.

ステップ１１２で出力される最大正答率データ４８及び誤判別データ５０は、ステップ１１６の「重み付き学習データの重みを更新」で使用される。ステップ１１２で出力される対数スコア４６及び統合ウェイト（ｗ）５２は、ステップ１２４の「統合スコアの計算」で使用される。 The maximum correct answer rate data 48 and the erroneous determination data 50 output in step 112 are used in “update the weight of weighted learning data” in step 116. The logarithmic score 46 and the integration weight (w) 52 output in step 112 are used in “calculation of integration score” in step 124.

図９は、単体予測モデルの評価結果の一例を示す。重み付き学習データ（重み＝０を除去）からは、対数スコア（識別子ＩＤ＝３のデータについて０．７４９）、最大正答率（＝０．８５）、誤判別データ５０（＝｛４，１１｝）が得られる。ＯＯＢデータからは、auc_oob＝０．８７５より、統合ウェイトｗ（＝１．９４６）が得られる。 FIG. 9 shows an example of the evaluation result of the unit prediction model. From weighted learning data (weight = 0 removed), logarithmic score (0.749 for data with identifier ID = 3), maximum correct answer rate (= 0.85), misclassification data 50 (= {4, 11}) ) Is obtained. From the OOB data, an integrated weight w (= 1.946) is obtained from auc_oob = 0.875.

ステップ１１４で、ブースティング処理部７６はブースティング処理を終了すべきか否かを判定する。ブースティング処理とは、学習データによる予測モデルの評価結果に基づいて、学習データの重みを調整する処理である。代表的なブースティング手法として、AdaBoost、LogitBoost等があるが、いずれを用いてもよく、ブースティング手法は特に限定しない。 In step 114, the boosting processing unit 76 determines whether or not to end the boosting process. The boosting process is a process of adjusting the weight of the learning data based on the evaluation result of the prediction model based on the learning data. Typical boosting techniques include AdaBoost and LogitBoost, but any of them may be used, and the boosting technique is not particularly limited.

終了判定は、ブースティング処理の回数、最大正答率、または最大正答率に基づくことができる。例えば、ブースティング処理の回数がブースティング最大回数データ１４（値はＮ_ｂｓｔ）を超えるか、最大正答率＝１（誤判別データセット＝φ）となるか、または最大正答率が０．５を下回った場合、ブースティング処理のループから抜ける。 The end determination can be based on the number of boosting processes, the maximum correct answer rate, or the maximum correct answer rate. For example, the number of boosting processes exceeds the boosting maximum number data 14 (value is N _bst ), the maximum correct answer rate = 1 (misidentification data set = φ), or the maximum correct answer rate is 0.5. If it falls below, it will break out of the boosting loop.

終了判定の結果が否の場合は、ステップ１１６で、ブースティング処理部７６が重み付き学習データ４２の重みを更新する（ブースティング処理）。ブースティング処理部７６は、重み付き学習データ４２、最大正答率データ４８、誤判別データ５０を入力して、各データに対する重みを再計算し、重み付き学習データ４２を更新する。 If the end determination result is negative, the boosting processing unit 76 updates the weight of the weighted learning data 42 in step 116 (boosting processing). The boosting processing unit 76 inputs the weighted learning data 42, the maximum correct answer rate data 48, and the misclassification data 50, recalculates the weight for each data, and updates the weighted learning data 42.

重みの更新例を図１０に示す。重みの更新では、誤判別したデータの重みを増して、全体を正規化する。ステップ１００１で、ｂ＝ａ／（１−ａ）を計算する。ここで、ａは最大正答率である。ステップ１００２で、誤判別データの重みをｂ倍する（Ｗ_temp＝ｂｘＷ_old）。なお、正しく判別したデータの重みは変更しない（Ｗ_temp＝Ｗ_old）。ステップ１００３で、変更前と変更後の全データの重みの比を求める（ｋ＝ΣＷ_temp／ΣＷ_old）。ステップ１００４で、変更後の重みをｋで割って正規化する（Ｗ_new＝Ｗ_temp／ｋ）。Σはデータの数についての積算である。 An example of updating the weight is shown in FIG. In updating the weights, the weight of misclassified data is increased and the whole is normalized. In step 1001, b = a / (1-a) is calculated. Here, a is the maximum correct answer rate. In step 1002, the weight of misclassification data is multiplied by b (W_temp = bxW_old). The weight of correctly determined data is not changed (W_temp = W_old). In step 1003, a ratio of weights of all data before and after the change is obtained (k = ΣW_temp / ΣW_old). In step 1004, the changed weight is divided by k and normalized (W_new = W_temp / k). Σ is an integration over the number of data.

図１１は、更新された重み付け学習データの一例を示す。
図９に示すように、最大正答率＝０．８５であるので、ｂは次のように計算される（ステップ１００１）。
ｂ＝０．８５／（１−０．８５）＝５．６６６
ＩＤ＝３のデータは正しく判別されたので、重みは変更しない（ステップ１００２）。
Ｗ_{３＿ｔｅｍｐ}＝Ｗ_{３＿ｏｌｄ}＝２
ＩＤ＝４のデータは誤って判別されたので、重みをｂ倍する（ステップ１００２）。
Ｗ_{４＿ｔｅｍｐ}＝ｂｘＷ_{４＿ｏｌｄ}＝１１．３３３
以下、同様に、正判別か誤判別かに応じて重みを更新する。
ΣＷ_{ｉ＿ｔｅｍｐ}＝３１
ΣＷ_{ｉ＿ｏｌｄ}＝２０
ｋ＝ΣＷ_{ｉ＿ｔｅｍｐ}／ΣＷ_{ｉ＿ｏｌｄ}＝１．５５
Ｗ_{３＿ｎｅｗ}＝Ｗ_{３＿ｔｅｍｐ}／ｋ＝１．２９
Ｗ_{４＿ｎｅｗ}＝Ｗ_{４＿ｔｅｍｐ}／ｋ＝７．３１
この後、ステップ１１８でブースティング処理回数のカウンタがインクリメント（＋１）され、ステップ１０８に戻り、ブースティング処理が継続される。 FIG. 11 shows an example of updated weighted learning data.
As shown in FIG. 9, since the maximum correct answer rate = 0.85, b is calculated as follows (step 1001).
b = 0.85 / (1−0.85) = 5.666
Since the data with ID = 3 is correctly identified, the weight is not changed (step 1002).
W _{3_temp} = W _{3_old} = 2
Since the data with ID = 4 has been erroneously determined, the weight is multiplied by b (step 1002).
_{_W} 4_temp = bxW 4_old = 11.333
Similarly, the weight is updated according to whether the determination is correct or incorrect.
ΣW _{i_temp} = 31
ΣW _{i_old} = 20
k = ΣW _{i_temp} / ΣW _{i_old} = 1.55
_{W3_new} = _{W3_temp} / _k = 1.29
_{W4_new} = _{W4_temp} / _k = 7.31
Thereafter, the counter of the number of boosting processes is incremented (+1) in step 118, the process returns to step 108, and the boosting process is continued.

ブースティング終了判定の結果がＯＫの場合は、ステップ１２０で、ブートストラップ・サンプル生成部７２は、ブートストラップ・サンプル処理を終了すべきか否かを判定する。終了判定は、ブートストラップ・サンプル生成回数に基づくことができる。例えば、ブートストラップ・サンプル生成回数がブートストラップ最大回数データ１２（値はＮ_ｂｓｐ）を超える場合、ブートストラップ・サンプル生成処理のループから抜ける。 If the result of the boosting end determination is OK, in step 120, the bootstrap sample generation unit 72 determines whether or not the bootstrap sample processing should be ended. The termination determination can be based on the number of bootstrap sample generations. For example, if the bootstrap sample generation count exceeds the bootstrap maximum count data 12 (value is N _bsp ), the process exits the bootstrap sample generation loop.

終了判定の結果が否の場合は、ステップ１０４に戻り、ブートストラップ・サンプル生成処理が継続される。 If the end determination is negative, the process returns to step 104 and the bootstrap sample generation process is continued.

ここまでの処理で、ブートストラップ・サンプリング処理とブースティング処理の２つのループを入れ子（ブースティングループが内側、ブートストラップ・サンプリングループが外側）で繰り返した結果、様々な重みに変化した重み付き学習データ４２と、それに対応する単体予測モデル４４、対数スコア４６、統合ウェイト５２が、出力データ管理部４０に保存される。以後、各ループをＲｕｎ＝１，２，…として参照する。例えば、Ｎ_ｂｓｔ＝１０，Ｎ_ｂｓｐ＝２０の場合、Ｒｕｎ＝１から始まり、最大でＲｕｎ＝１０ｘ２０＝２００個までの単体予測モデル４４及びこれらのデータセットが存在する可能性がある。 In the process so far, the weighted learning changed to various weights as a result of repeating the two loops of the bootstrap sampling process and boosting process in a nested manner (with the boosting group inside and the bootstrap sampling loop outside) The data 42, the corresponding unit prediction model 44, the logarithmic score 46, and the integrated weight 52 are stored in the output data management unit 40. Hereinafter, each loop is referred to as Run = 1, 2,. For example, when N _bst = 10 and N _bsp = 20, there is a possibility that there are unit prediction models 44 starting from Run = 1 and up to Run = 10 _× 20 = 200 and these data sets.

ブートストラップ・サンプリング終了判定の結果がＯＫの場合は、予測モデル統合部７８は各Ｒｕｎ毎に出力データ管理部４０に保存されている複数の単体予測モデル４４を統合する。先ず、ステップ１２２で、予測モデル統合部７８は重み付き学習データ４２と対数スコア４６を用いて、各Ｒｕｎにおけるロジスティック回帰パラメータα，βを計算する。具体的には、対数スコア４６を説明変数（＝Ｓ）、重み付き学習データ４２のＹを目的変数として、次式のロジスティック回帰分析を実施し、ロジスティック回帰パラメータα，βを求める。
Ｐ（Ｙ＝１）＝１／（１＋ｅ^{−（α＋β×Ｓ）}）
各Ｒｕｎで得られる個々のデータの対数スコア４６は、変数の独立性を仮定した事後確率の近似指標であるが、厳密な意味での確率値でない（０から１の範囲に収まらない）ため、複数の単体予測モデル４４を統合する際、スケールがばらばらで予測精度が落ちてしまうことが実験により確認された。そこで、対数スコア４６をロジスティック回帰式で変換することにより、０から１の範囲で正規化され、その結果、複数の単体予測モデル４４を統合する際の予測精度が向上する。 When the bootstrap sampling end result is OK, the prediction model integration unit 78 integrates a plurality of unit prediction models 44 stored in the output data management unit 40 for each Run. First, in step 122, the prediction model integration unit 78 calculates logistic regression parameters α and β in each Run using the weighted learning data 42 and the logarithmic score 46. Specifically, logistic regression analysis of the following equation is performed using logarithmic score 46 as an explanatory variable (= S) and Y of weighted learning data 42 as an objective variable to obtain logistic regression parameters α and β.
P (Y = 1) = 1 / (1 + e− ^{(α + β × S)} )
Since the log score 46 of the individual data obtained in each Run is an approximate index of the posterior probability assuming the independence of the variables, it is not a probability value in a strict sense (it does not fall within the range of 0 to 1). When integrating a plurality of unit prediction models 44, it was confirmed by experiments that the scales are different and the prediction accuracy is lowered. Therefore, by converting the logarithmic score 46 using a logistic regression equation, the logarithmic score 46 is normalized in the range of 0 to 1, and as a result, the prediction accuracy when integrating a plurality of unit prediction models 44 is improved.

これにより、図１２に示すように、ステップ１１２で計算した統合ウェイトｗ、auc_oobデータとロジスティック回帰パラメータα，βとを併せて、単体予測モデル統合用のパラメータとして使用する。 As a result, as shown in FIG. 12, the integrated weight w, auc_oob data calculated in step 112 and the logistic regression parameters α and β are used together as parameters for unitary prediction model integration.

ステップ１２４で、予測モデル統合部７８は、全Ｒｕｎの単体予測モデル４４とモデル統合用パラメータｗ，α，βを用いて、集団学習の予測モデルの一種である統合予測モデルの統合スコアＳ_ａｌｌを計算する。 In step 124, the prediction model integration unit 78 uses the unit prediction models 44 of all Runs and the model integration parameters w, α, and β to calculate an integrated score S _all of an integrated prediction model that is a kind of prediction model for group learning. calculate.

統合スコアＳ_ａｌｌの計算例を図１３に示す。予測モデル統合部７８は、全Ｒｕｎの単体予測モデル４４から導出される対数スコアＳ_ｎ（＝Σｌｏｇ（Lift^ｉ _（ｎ））、Σはｉ＝０からｍの積算である）をロジスティック回帰式（Ｐ_ｎ＝１／（１＋ｅ^{−（αｎ＋βｎ×Ｓｎ）}））で変換して、各単体予測モデル４４の重みｗで加重平均をとることにより、複数の単体予測モデル４４を統合・組み合わせた統合予測モデルの統合スコアＳ_ａｌｌを計算する。
Ｓ_ａｌｌ＝Σｗ_ｉ×Ｐ_ｉ／Σｗ_ｉ
ここで、単体予測モデル４４の説明変数をＸ_１，Ｘ_２，…Ｘ_ｍ、Ｒｕｎ＝ｋにおける初期確率をLift^０ _（ｋ）、変数Ｘ_ｉのリフト値をLift^ｉ _（ｋ）とする。Σはｉ＝０からｍの積算である。 A calculation example of the integrated score S _all is shown in FIG. The prediction model integration unit 78 calculates a logarithmic score S _n (= Σlog (Lift ⁱ _(n) ), Σ is an integration from i = 0 to m) derived from the unit prediction model 44 of all Runs to a logistic regression equation ( P _n = 1 / (1 + e− ^{(αn + βn × Sn)} )) and taking a weighted average with the weight w of each single unit prediction model 44, thereby integrating and combining a plurality of single unit prediction models 44. The integrated score S _all is calculated.
S _all = Σw _i × P _i / Σw _i
Here, the explanatory variables of the unitary prediction model 44 are X ₁ , X ₂ ,... X _m , the initial probability at Run = k is Lift ⁰ _(k) , and the lift value of the variable X _i is Lift ⁱ _(k) . Σ is an integration from i = 0 to m.

ステップ１２６で、予測モデル集約部８０は、統合予測モデルを集約した１つの単体予測モデル（集約予測モデル５６）を構築する。具体的には、図１４に示すように、統合スコアに含まれるロジスティック回帰式（Ｐ_ｎ＝１／（１＋ｅ^{−（αｎ＋βｎ×Ｓｎ）}））をテーラー展開して得られた予測式の非線形部分を線形近似（非線形の項を無視して線形項のみとする）して、式変形や項の整理を行うことにより、集約予測モデル５６が得られる。 In step 126, the prediction model aggregating unit 80 constructs a single unit prediction model (aggregated prediction model 56) in which the integrated prediction models are aggregated. Specifically, as shown in FIG. 14, the nonlinear part of the prediction formula obtained by Taylor expansion of the logistic regression equation ( _Pn = 1 / (1 + e− ^{(αn + βn × Sn)} )) included in the integrated score is obtained. The aggregated prediction model 56 is obtained by performing linear approximation (ignoring non-linear terms and making only linear terms), and performing equation transformation and organizing terms.

集約予測モデル５６において、Ａを初期確率（ｌｏｇ）、Ｂを変数Ｘｊのリフト値（ｌｏｇ）とみなすことにより、単体のナイーブベイズ予測式が得られる。これにより、ブートストラップ・サンプリング処理とブースティング処理の２つのループを入れ子で繰り返し、様々な重みの重み付き学習データに対応する単体予測モデルが多数得られ、これらを統合した１つの集約予測モデルが得られる。 By considering A as an initial probability (log) and B as a lift value (log) of the variable Xj in the aggregate prediction model 56, a single naive Bayes prediction formula can be obtained. As a result, two loops of bootstrap sampling processing and boosting processing are repeated in a nested manner, and a large number of unit prediction models corresponding to weighted learning data of various weights are obtained. can get.

ステップ１２８で、予測モデル評価部８２は、テストデータによる統合予測モデル、集約予測モデル５６の評価を行なう。統合予測モデルは実施例では構築していないが、統合スコアを直接予測に利用した仮想的なモデルを統合予測モデルとする。統合予測モデル及び集約予測モデル５６をテストデータに適用して、評価結果データ５８を得る。 In step 128, the prediction model evaluation unit 82 evaluates the integrated prediction model and the aggregate prediction model 56 based on the test data. Although the integrated prediction model is not constructed in the embodiment, a virtual model using the integrated score for direct prediction is used as the integrated prediction model. The integrated prediction model and the aggregate prediction model 56 are applied to the test data to obtain evaluation result data 58.

図１５に示すように、単体予測モデル４４は、全てのＲｕｎに対応する重みｗと、ロジスティック回帰パラメータα、βを含む。一方、集約予測モデル５６は、集約された離散領域とリフト値をもつ単一のナイーブベイズ予測モデルである。 As shown in FIG. 15, the simplex prediction model 44 includes weights w corresponding to all Runs and logistic regression parameters α and β. On the other hand, the aggregate prediction model 56 is a single naive Bayes prediction model having an aggregated discrete region and a lift value.

予測モデルの評価は、一般的な手法、例えば、前述のＡＵＣなどを用いることができる。ここでは、一例として、ＮＡＳＡが公開しているソフトウェア開発データの４つのデータセットを用いて、ランダム・フォレスト、統合予測モデル、集約予測モデル５６との比較を行った。評価結果を図１６に示す。５分割交差検定（全データを５分割し、４つを学習データ、１つをテストデータとした評価を５回繰り返す方法）で予測精度（ＡＵＣ）を計算する。これを５回繰り返し、その平均値を求めた。その結果、４つのデータセット全てにおいて、統合予測モデル、集約予測モデル５６の予測精度（ＡＵＣ）がランダム・フォレストを上回った。統合予測モデルと集約予測モデル５６とを比較すると、今回のデータセットにおいて、予測精度はほぼ同等である。ただし、モデルの複雑さの面では、統合予測モデルは、他の集団学習のモデル（ランダム・フォレスト等）とあまり変わらないが、集約予測モデル５６は、単一のモデルに集約されているため、構造が単純であり、モデルの解釈（予測結果に至った原因の分析等）が容易である。 A general method, for example, the above-mentioned AUC etc. can be used for evaluation of a prediction model. Here, as an example, comparison was made with a random forest, an integrated prediction model, and an aggregate prediction model 56 using four data sets of software development data published by NASA. The evaluation results are shown in FIG. Prediction accuracy (AUC) is calculated by a 5-fold cross validation (a method in which all data is divided into 5 and evaluation is performed 5 times with 4 learning data and 1 test data). This was repeated 5 times, and the average value was obtained. As a result, in all four data sets, the prediction accuracy (AUC) of the integrated prediction model and the aggregate prediction model 56 exceeded the random forest. Comparing the integrated prediction model and the aggregated prediction model 56, the prediction accuracy is almost the same in this data set. However, in terms of model complexity, the integrated prediction model is not much different from other collective learning models (such as random forest), but the aggregate prediction model 56 is aggregated into a single model. The structure is simple and the interpretation of the model (such as analysis of the cause that led to the prediction result) is easy.

以上説明したように、実施形態によれば、ブートストラップ・サンプリングとブースティングを組み合わせて学習データをランダム化し、ナイーブベイズ予測モデルに適用することにより、予測精度が向上する。ブートストラップ・サンプリングでは、重複を許してランダム抽出するので、選ばれないデータ（ＯＯＢ）が存在する。ＯＯＢデータに対して予測精度を評価することで学習データに含まれない新しいデータに対してどの位の予測精度がありそうかを事前に見積もることができる。ＯＯＢデータのＡＵＣから計算した統合ウェイトを用いて各予測モデルの評価結果の加重平均をとって評価結果を統合し、その結果に基づいて予測モデルを生成することにより、高い予測精度が実現される。また、集団学習による予測モデルの非線形部分を線形近似して、予測精度を概ね保ったまま等価な単体のナイーブベイズの予測モデルに集約することにより、予測結果の解釈が容易なモデルを実現することができる。 As described above, according to the embodiment, the prediction accuracy is improved by randomizing the learning data by combining bootstrap sampling and boosting and applying the learning data to the naive Bayes prediction model. In bootstrap sampling, random extraction is allowed with duplication, so there is unselected data (OOB). By evaluating the prediction accuracy for the OOB data, it is possible to estimate in advance how much the prediction accuracy is likely to be for new data not included in the learning data. High prediction accuracy is realized by integrating the evaluation results by taking the weighted average of the evaluation results of each prediction model using the integrated weight calculated from the AUC of the OOB data, and generating the prediction model based on the result. . In addition, by linearly approximating the nonlinear part of the prediction model by collective learning and consolidating it into an equivalent single naive Bayes prediction model while maintaining the prediction accuracy in general, a model that can easily interpret the prediction result should be realized. Can do.

なお、本実施形態の処理はコンピュータプログラムによって実現することができるので、このコンピュータプログラムを格納したコンピュータ読み取り可能な記憶媒体を通じてこのコンピュータプログラムをコンピュータにインストールして実行するだけで、本実施形態と同様の効果を容易に実現することができる。 Note that the processing of the present embodiment can be realized by a computer program, so that the computer program can be installed and executed on a computer through a computer-readable storage medium storing the computer program, as in the present embodiment. The effect of can be easily realized.

なお、本発明は上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態に開示されている複数の構成要素の適宜な組み合せにより種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。更に、異なる実施形態に亘る構成要素を適宜組み合せてもよい。例えば、ブートストラップ・サンプリング、ブースティング、及び、ＯＯＢデータを用いた複数の予測モデルの統合は、ナイーブベイズ以外の単体予測モデルにも適用できる。また、集団学習による予測モデルの非線形部分の線形近似は、上述した実施形態に限らずに生成された複数のナイーブベイズ予測モデルを集約する際に適用してもよい。 Note that the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. Further, various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the embodiment. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, you may combine suitably the component covering different embodiment. For example, the integration of multiple prediction models using bootstrap sampling, boosting, and OOB data can be applied to simpler prediction models other than naive Bayes. Further, the linear approximation of the nonlinear part of the prediction model by collective learning may be applied when a plurality of naive Bayes prediction models generated are not limited to the above-described embodiment.

１６…学習データ、４２…重み付き学習データ、４４…単体予測モデル、４６…対数スコア、５２…統合ウェイト、５４…統合スコア、５６…集約予測モデル、７２…ブートストラップ・サンプル生成部、７４…ナイーブベイズ予測モデル生成部、７６…ブースティング処理部、７８…予測モデル統合部、８０…予測モデル集約部、８２…予測モデル評価部。 DESCRIPTION OF SYMBOLS 16 ... Learning data, 42 ... Weighted learning data, 44 ... Single prediction model, 46 ... Logarithmic score, 52 ... Integrated weight, 54 ... Integrated score, 56 ... Aggregate prediction model, 72 ... Bootstrap sample generation part, 74 ... Naive Bayes prediction model generation unit, 76 ... boosting processing unit, 78 ... prediction model integration unit, 80 ... prediction model aggregation unit, 82 ... prediction model evaluation unit.

Claims

Discretized learning data, which is data accumulated in the past and used to create a prediction model, finds lift values that are coefficients for calculating posterior probabilities from prior probabilities for each discrete region, and A single unit prediction model representing a lift value for each region is generated, the single unit prediction model is applied to the learning data, the prediction result is compared with the actual result, and the single unit prediction model is obtained by calculating the prediction accuracy and the necessary parameter value. An evaluation step for evaluating the predictive model;
Changing the discrete region, operating the evaluation step a plurality of times, generating a plurality of unit prediction models, and repeatedly evaluating the plurality of unit prediction models;
An integration step of converting the evaluation results of the plurality of unit prediction models into one integrated evaluation result;
An aggregation step of generating an aggregate prediction model from the integrated evaluation result;
Comprising
The aggregation step is a prediction model creation method for generating the aggregate prediction model by linearly approximating a nonlinear portion of the integrated evaluation result.

The evaluation step obtains a logarithm of the posterior probability of the single unit prediction model and an integrated weight according to the area under the ROC curve of the single unit prediction model;
The prediction according to claim 1, wherein the integration step obtains a logistic regression parameter corresponding to the logarithm, and converts the evaluation results of the plurality of unit prediction models into the integrated evaluation result using the integration weight and the logistic regression parameter. Model creation method.

3. The aggregation step generates the aggregated prediction model by omitting a nonlinear term of a logistic regression equation using the logistic regression parameter included in the integrated evaluation result and leaving a linear term for the integrated evaluation result. The prediction model creation method described.

Further comprising a weighted learning data generation step of bootstrap sampling the learning data to generate weighted learning data;
The evaluation step discretizes the weighted learning data, obtains a lift value for each discrete region, generates the unit prediction model, evaluates the unit prediction model,
The repeating step includes
A first iteration of updating the weight of the weighted learning data, operating the evaluation step a first positive integer m times, generating m unit prediction models, and evaluating the m unit prediction models Steps,
After evaluating the m unit prediction models, the operation of the weighted learning data generation step is repeated a second positive integer n times to generate m × n unit prediction models, and the m × n unit prediction models are generated. A second iteration step for evaluating the prediction model;
The prediction model creating method according to claim 1, wherein the integration step converts the evaluation result of the m × n unit prediction models into one integrated evaluation result.

5. The prediction model creation method according to claim 4, wherein in the first iteration step, the evaluation step is operated until m exceeds a predetermined number of times, or a maximum correct answer rate of prediction becomes 1 or falls below a predetermined value.

5. The prediction model creation method according to claim 4, wherein the second iteration step operates the weighted learning data generation step until n exceeds a predetermined number of times.

The learning data includes explanatory variables and objective variables,
The weighted learning data further includes a weight indicating the number of times the learning data is extracted by bootstrap sampling,
In the evaluation step, an initial probability obtained by dividing the sum of the weights of data in which the objective variable becomes a target value by the number of data and a single prediction model representing a lift value for each discrete region,
The prediction model creation method according to claim 4, wherein the logarithm of the posterior probability of the single prediction model is obtained by multiplying the initial probability and the lift value for each discrete region.

A program executed by a computer, wherein the program is
Discretized learning data, which is data accumulated in the past and used to create a prediction model, finds lift values that are coefficients for calculating posterior probabilities from prior probabilities for each discrete region, and A single unit prediction model representing a lift value for each region is generated, the single unit prediction model is applied to the learning data, the prediction result is compared with the actual result, and the single unit prediction model is obtained by calculating the prediction accuracy and the necessary parameter value. An evaluation step for evaluating the predictive model;
Changing the discrete region, operating the evaluation step a plurality of times, generating a plurality of unit prediction models, and repeatedly evaluating the plurality of unit prediction models;
An integration step of converting the evaluation results of the plurality of unit prediction models into one integrated evaluation result;
An aggregation step of generating an aggregate prediction model from the integrated evaluation result;
Comprising
The aggregation step is a program for generating the aggregate prediction model by linearly approximating a nonlinear part of the integrated evaluation result.

The evaluation step obtains a logarithm of the posterior probability of the single unit prediction model and an integrated weight according to the area under the ROC curve of the single unit prediction model;
The program according to claim 8, wherein the integration step obtains a logistic regression parameter corresponding to the logarithm, and converts the evaluation results of the plurality of unit prediction models into the integrated evaluation result using the integration weight and the logistic regression parameter. .

10. The aggregation step generates the aggregate prediction model by omitting a nonlinear term of a logistic regression equation that uses the logistic regression parameter included in the integrated evaluation result and leaving a linear term for the integrated evaluation result. The listed program.

Further comprising a weighted learning data generation step of bootstrap sampling the learning data to generate weighted learning data;
The evaluation step discretizes the weighted learning data, obtains a lift value for each discrete region, generates the unit prediction model, evaluates the unit prediction model,
The repeating step includes
A first iteration of updating the weight of the weighted learning data, operating the evaluation step a first positive integer m times, generating m unit prediction models, and evaluating the m unit prediction models Steps,
After evaluating the m unit prediction models, the operation of the weighted learning data generation step is repeated a second positive integer n times to generate m × n unit prediction models, and the m × n unit prediction models are generated. A second iteration step for evaluating the prediction model;
The program according to claim 8, wherein the integration step converts the evaluation result of the m × n unit prediction models into one integrated evaluation result.

The program according to claim 11, wherein the first iteration step operates the evaluation step until m exceeds a predetermined number of times, or a maximum correct answer rate of prediction becomes 1 or falls below a predetermined value.

The program according to claim 11, wherein in the second repetition step, the weighted learning data generation step is operated until n exceeds a predetermined number of times.

The learning data includes explanatory variables and objective variables,
The weighted learning data further includes a weight indicating the number of times the learning data is extracted by bootstrap sampling,
In the evaluation step, an initial probability obtained by dividing the sum of the weights of data in which the objective variable becomes a target value by the number of data and a single prediction model representing a lift value for each discrete region,
The program according to claim 11, wherein the logarithm of the posterior probability of the simplex prediction model is obtained by multiplying the initial probability and the lift value for each discrete region.