JP2014241060A

JP2014241060A - Tree model learning device, tree model learning method, and tree model learning program

Info

Publication number: JP2014241060A
Application number: JP2013123375A
Authority: JP
Inventors: 浩嗣玉野; Koji Tamano
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2013-06-12
Filing date: 2013-06-12
Publication date: 2014-12-25

Abstract

PROBLEM TO BE SOLVED: To quickly obtain a tree model having a size suitable for generalization performance without a pruning test set.SOLUTION: The tree model learning device includes: split evaluation value calculation means 51 that calculates an evaluation value for when a node is split at each split point in each dimension of an explanatory valuable; splitless evaluation value calculation means 53 that calculates an evaluation value for when the node is not split; and split determination means 54 that determines whether the node is split or not at a dimension having the best evaluation value on the basis of the resulting evaluation values. Each of the split evaluation value calculation means and splitless evaluation value calculation means calculates the evaluation value by using an evaluation function including at least two types of terms that are in a trade-off relationship with respect to tree growth, one of the terms expressing fitting to data, and the other being a penalty term serving to reduce the evaluation value according to the number of data having fallen into a child node.

Description

本発明は、回帰木や決定木といったデータ空間の分割を木で表現するモデルを学習する木モデル学習装置、木モデル学習方法および木モデル学習プログラムに関する。 The present invention relates to a tree model learning device, a tree model learning method, and a tree model learning program for learning a model that represents a division of a data space such as a regression tree and a decision tree by a tree.

回帰木や決定木と呼ばれる木のモデルがある。回帰木は回帰を行う木のモデルであり、決定木は分類を行う木のモデルである。この他にもデータのクラスタリング用途に利用できる木のモデルなどもある。 There are models of trees called regression trees and decision trees. A regression tree is a model of a tree that performs regression, and a decision tree is a model of a tree that performs classification. There are other tree models that can be used for data clustering.

これら木のモデルは、ある指標において似たものが同じ葉に落ちるようにデータの空間を分割する。 These tree models divide the data space so that something similar at a certain index falls on the same leaf.

より具体的には、木のモデルは、木の各ノードから伸びる枝に説明変数に関する条件が対応づけられており、その条件に従ってルートノードからリーフノードまで辿り、辿り着いたリーフノードに関連付けられた値（回帰木の場合は実数値、決定木の場合はカテゴリ値）を予測値として出力する。このような木のモデルを用いて、ある未知のデータサンプルに対する予測を行うことができる。 More specifically, in the model of the tree, a condition concerning an explanatory variable is associated with a branch extending from each node of the tree, and the tree model is traced from the root node to the leaf node according to the condition, and associated with the arrived leaf node. A value (a real value for a regression tree, a category value for a decision tree) is output as a predicted value. Using such a tree model, it is possible to make predictions on certain unknown data samples.

回帰木や決定木といった木のモデル学習は、学習の手法がシンプルであるため高速に動作し、また得られるモデルが人間が解釈しやすいというメリットがある。一方、精度の面ではサポートベクターマシンなどと比べると一般的に悪い。しかし、木のモデル学習において、複数の木を学習し組み合わせるブースティングを行うことで、高い精度が達成できることが知られている。回帰木や決定木は現在でもよく使われるデータマイニング手法の一つである。 Tree model learning, such as regression trees and decision trees, has the advantage that it operates at high speed because the learning method is simple, and that the resulting model is easy for humans to interpret. On the other hand, in terms of accuracy, it is generally worse than a support vector machine. However, it is known that high accuracy can be achieved by performing boosting by learning and combining a plurality of trees in model learning of trees. Regression trees and decision trees are still one of the most commonly used data mining methods.

しかし、学習の手法がシンプルな木のモデルであっても、データが大量になると学習に時間がかかるため、高速化が必要となる。近年の分散、並列処理基盤の整備に伴い、学習アルゴリズムの分散並列化による高速化が進んでいる（例えば、非特許文献１参照）。また、分散並列とは別軸での高速化の方法として、Ｈｏｅｆｆｄｉｎｇ木（Ｈｏｅｆｆｄｉｎｇｔｒｅｅ）と呼ばれる木アルゴリズムがある（例えば、非特許文献２参照。）。 However, even if the learning method is a simple tree model, it takes time to learn when the amount of data is large, and thus speeding up is necessary. With the recent development of distributed and parallel processing infrastructure, the speed of learning algorithms has been increased by distributed parallelization (see, for example, Non-Patent Document 1). Further, as a method of speeding up on a different axis from distributed parallelism, there is a tree algorithm called a Hoeffing tree (see Non-Patent Document 2, for example).

Ｈｏｅｆｆｄｉｎｇ木はデータストリームを対象にした決定木の学習アルゴリズムであるが、対象がデータストリームである場合に限らず、回帰木や決定木のバッチ学習の高速化にも有効な方法である。Ｈｏｅｆｆｄｉｎｇ木による高速化のアイディアは、すべてのデータを処理後に木をスプリット（分割）させる、すなわち木を成長させるのではなく、データの一部を処理した後にそれが精度を担保するのに統計的に十分なデータであればその時点で木をスプリットさせるというものである。データが大量の場合に、精度保証に必要最低限のデータだけを処理しながら木を構築できるので、高速に木のモデル学習ができる。 The Hoeffing tree is a decision tree learning algorithm for a data stream. However, the Hoeffing tree is not limited to a case where the object is a data stream, and is an effective method for speeding up batch learning of a regression tree or a decision tree. The idea of speeding up with Hoeffing trees is that statistical processing is used to ensure accuracy after processing a portion of the data, rather than splitting the tree after processing all the data, i.e. growing the tree. If the data is sufficient, the tree is split at that time. When there is a large amount of data, a tree can be constructed while processing only the minimum data required for accuracy assurance, so that model learning of the tree can be performed at high speed.

Panda Biswanath, Herbach J.S., Basu Sugato, Bayarde R.J., "PLANET:massively parallel learning of tree ensembles with MapReduce", VLDB Endowment, 2009.Panda Biswanath, Herbach J.S., Basu Sugato, Bayarde R.J., "PLANET: massively parallel learning of tree ensembles with MapReduce", VLDB Endowment, 2009. Domingos Pedro, Hulten Geoff, "Mining high-speed data streams", PCM SIGKDD, 2000.Domingos Pedro, Hulten Geoff, "Mining high-speed data streams", PCM SIGKDD, 2000.

しかし、Ｈｏｅｆｆｄｉｎｇ木アルゴリズムは以下に述べるような問題がある。すなわち、Ｈｏｅｆｆｄｉｎｇ木アルゴリズムは、スプリットの評価関数から木の成長をどこで止めるべきかの判定ができないという問題がある。Ｈｏｅｆｆｄｉｎｇ木アルゴリズムでは、スプリットの評価関数に情報利得やジニ係数を用いているため、評価結果が必ずスプリットした方がよいという結果になる。このような評価関数を基に、ノードをスプリットさせていくと、木が成長しすぎて過学習となり、汎化性能が失われてしまう。このような問題を回避するために、成長し過ぎた木の枝刈り用にテスト用のデータセットでが必要であった。 However, the Hoeffing tree algorithm has the following problems. That is, the Hoeffing tree algorithm has a problem that it cannot be determined where to stop the tree growth from the split evaluation function. In the Hoeffing tree algorithm, since the information gain and Gini coefficient are used for the evaluation function of split, the result of the evaluation should always be split. If nodes are split based on such an evaluation function, the tree grows too much and overlearns, and generalization performance is lost. To avoid such problems, a test data set was needed for pruning overgrown trees.

枝刈りとは別に、設定により深さや葉の多さ、１リーフあたりのデータ数に制限を設けて過学習を防止する手法も考えられているが、予測対象データを限定することなしにこれらの適切な値を前もって設定することは困難である。適切な値が設定されなければ、木の成長が足りずに精度の劣ったモデルが生成されたり、木が成長しすぎて汎化性能が失われることになる。 Apart from pruning, there is a method to prevent overlearning by setting the depth, the number of leaves, and the number of data per leaf by setting, but without limiting the data to be predicted, It is difficult to set an appropriate value in advance. If an appropriate value is not set, a model with inaccurate accuracy is generated due to insufficient growth of the tree, or generalization performance is lost due to excessive growth of the tree.

本発明は、上述した課題に鑑み、汎化性能の意味で適切なサイズの木のモデルを、枝刈り用のテストセットなしに高速に得ることができる木モデル学習装置、木モデル学習方法および木モデル学習プログラムを提供することを目的とする。 In view of the above-described problems, the present invention provides a tree model learning apparatus, a tree model learning method, and a tree that can quickly obtain a tree model having an appropriate size in terms of generalization performance without using a test set for pruning. The purpose is to provide a model learning program.

本発明による木モデル学習装置は、説明変数の指定された次元を用いて、指定されたノードを当該次元の各スプリットポイントでスプリットする場合の評価値を計算するスプリット評価値計算手段と、スプリット評価値計算手段によって得られる評価値を基に、説明変数の次元ごとに指定されたノードをスプリットするのに最もよいスプリットポイントを検索するスプリットポイント検索手段と、ノードをスプリットしない場合の評価値を計算するスプリット無し評価値計算手段と、説明変数の次元ごとの最もよいスプリットポイントでの評価値と、スプリットしない場合の評価値とを基に、ノードを最も評価値のよい次元でスプリットするか、またはノードをスプリットしないかを判定するスプリット判定手段とを備え、スプリット評価値計算手段は、木の成長に対してトレードオフの関係にある２種類の項であって、データへのフィッティングを評価値の良さで表す項と、分岐ノードと葉ノードに関して、それぞれ通過データ数または落ちるデータ数に応じて評価値を悪くする方向に働くペナルティ項の２種類の項を少なくとも含む評価関数を用いて、評価値を計算し、スプリット無し評価値計算手段は、木の成長に対してトレードオフの関係にある２種類の項であって、データへのフィッティングを評価値の良さで表す項と、葉ノードに関して、落ちるデータ数に応じて評価値を悪くする方向に働くペナルティ項の２種類の項を少なくとも含む評価関数を用いて、評価値を計算することを特徴とする。 A tree model learning apparatus according to the present invention uses a specified dimension of an explanatory variable, split evaluation value calculating means for calculating an evaluation value when a specified node is split at each split point of the dimension, split evaluation Based on the evaluation value obtained by the value calculation means, the split point search means for searching the best split point for splitting the specified node for each dimension of the explanatory variable and the evaluation value when the node is not split are calculated. Split the node in the dimension with the best evaluation value based on the evaluation value calculation means without splitting, the evaluation value at the best split point for each dimension of the explanatory variable, and the evaluation value when not splitting, or Split evaluation means for determining whether or not to split a node, and a split evaluation value The calculation means are two types of terms that are in a trade-off relationship with the growth of the tree, and the number of passing data or the number of passing data with respect to the term representing the fitting to the data by the good evaluation value and the branch node and the leaf node, respectively The evaluation value is calculated using an evaluation function including at least two types of penalty terms that work in the direction of worsening the evaluation value in accordance with the number of falling data. There are two types of terms that are in a trade-off relationship: a term that represents fitting to data with good evaluation value, and a penalty term that works in the direction of worsening the evaluation value according to the number of data drops for leaf nodes. An evaluation value is calculated using an evaluation function including at least a kind of term.

また、本発明による木モデル学習方法は、ルートノードから順に、各ノードに対して、説明変数の次元ごとに、当該ノードを当該次元の各スプリットポイントでスプリットする場合の評価値を、木の成長に対してトレードオフの関係にある２種類の項であって、データへのフィッティングを評価値の良さで表す項と、分岐ノードと葉ノードに関して、それぞれ通過データ数または落ちるデータ数に応じて評価値を悪くする方向に働くペナルティ項の２種類の項を少なくとも含む評価関数を用いて計算し、得られた評価値を基に、説明変数の次元ごとにノードをスプリットするのに最もよいスプリットポイントを検索し、ノードをスプリットしない場合の評価値を、木の成長に対してトレードオフの関係にある２種類の項であって、データへのフィッティングを評価値の良さで表す項と、葉ノードに関して、落ちるデータ数に応じて評価値を悪くする方向に働くペナルティ項の２種類の項を少なくとも含む評価関数を用いて計算し、説明変数の次元ごとの最もよいスプリットポイントでの評価値と、スプリットしない場合の評価値とを基に、ノードを最も評価値のよい次元でスプリットするか、またはノードをスプリットしないかを判定することを特徴とする。 In addition, the tree model learning method according to the present invention provides, for each node in order from the root node, an evaluation value for splitting the node at each split point of the dimension for each dimension of the explanatory variable. The two types of terms that are in a trade-off relationship with respect to the term that represents the fitting to the data with good evaluation values, and the branch node and the leaf node are evaluated according to the number of passing data or the number of falling data, respectively. The best split point for splitting a node for each dimension of explanatory variables based on the evaluation value calculated using an evaluation function that includes at least two types of penalty terms that work in the direction of worsening the value. And the evaluation value when the node is not split is two types of terms that are in a trade-off relationship with the growth of the tree, It is calculated using an evaluation function that includes at least two types of terms: a term that expresses goodness of the evaluation value, and a penalty term that works in the direction of worsening the evaluation value according to the number of data that falls for leaf nodes. Based on the evaluation value at the best split point for each dimension and the evaluation value when not splitting, it is determined whether to split the node at the dimension with the best evaluation value or not to split the node. To do.

また、本発明による木モデル学習プログラムは、コンピュータに、説明変数の指定された次元を用いて、指定されたノードを当該次元の各スプリットポイントでスプリットする場合の評価値を計算するスプリット評価値計算処理、スプリット評価値計算処理によって得られる評価値を基に、説明変数の次元ごとに指定されたノードをスプリットするのに最もよいスプリットポイントを検索するスプリットポイント検索処理、ノードをスプリットしない場合の評価値を計算するスプリット無し評価値計算処理、および説明変数の次元ごとの最もよいスプリットポイントでの評価値と、スプリットしない場合の評価値とを基に、ノードを最も評価値のよい次元でスプリットするか、またはノードをスプリットしないかを判定するスプリット判定処理を実行させ、スプリット評価値計算処理で、木の成長に対してトレードオフの関係にある２種類の項であって、データへのフィッティングを評価値の良さで表す項と、分岐ノードと葉ノードに関して、それぞれ通過データ数または落ちるデータ数に応じて評価値を悪くする方向に働くペナルティ項の２種類の項を少なくとも含む評価関数を用いて、評価値を計算させ、スプリット無し評価値計算処理で、木の成長に対してトレードオフの関係にある２種類の項であって、データへのフィッティングを評価値の良さで表す項と、葉ノードに関して、落ちるデータ数に応じて評価値を悪くする方向に働くペナルティ項の２種類の項を少なくとも含む評価関数を用いて、評価値を計算させることを特徴とする。 Also, the tree model learning program according to the present invention uses a specified dimension of an explanatory variable in a computer to calculate an evaluation value when splitting a specified node at each split point of the dimension. Split point search processing to search the best split point for splitting the specified node for each dimension of the explanatory variable based on the evaluation value obtained by the processing, split evaluation value calculation processing, evaluation when the node is not split Splits the node in the dimension with the best evaluation value based on the evaluation value calculation process without split and the evaluation value at the best split point for each dimension of the explanatory variable and the evaluation value when not splitting Split decision process to determine whether or not to split the node In the split evaluation value calculation process, there are two types of terms that are in a trade-off relationship with the growth of the tree, a term that represents the fitting to the data with good evaluation values, a branch node, and a leaf node For the evaluation value calculation process without splitting, the evaluation value is calculated by using an evaluation function including at least two types of penalty terms that work in the direction of worsening the evaluation value according to the number of passing data or the number of falling data, respectively. Two types of terms that have a trade-off relationship with the growth of the tree, the term representing the fitting to the data by the goodness of the evaluation value, and the leaf node, the evaluation value is deteriorated according to the number of data falling The evaluation value is calculated using an evaluation function including at least two types of penalty terms acting in the direction.

本発明によれば、汎化性能の意味で適切なサイズの木のモデルを、枝刈り用のテストセットなしに高速に得ることができる。 According to the present invention, a tree model having an appropriate size in terms of generalization performance can be obtained at high speed without a test set for pruning.

第１の実施形態の木モデル学習システムの構成例を示すブロック図である。It is a block diagram which shows the structural example of the tree model learning system of 1st Embodiment. 木モデル学習装置１０のハードウェア構成例を示すブロック図である。FIG. 3 is a block diagram illustrating a hardware configuration example of the tree model learning device 10. データ記憶部１０１に格納される学習データの一例を示す説明図である。It is explanatory drawing which shows an example of the learning data stored in the data storage part. 木モデル記憶部１０２に格納される木のモデルの一例を模式的に示す説明図である。3 is an explanatory diagram schematically showing an example of a tree model stored in a tree model storage unit 102. FIG. ビンニングポイントの計算の結果、分けられたビンの一例を示す説明図である。It is explanatory drawing which shows an example of the bin divided as a result of the calculation of a binning point. 第１の実施形態の木モデル学習システムの動作の一例を示すフローチャートである。It is a flowchart which shows an example of operation | movement of the tree model learning system of 1st Embodiment. 第２の実施形態の木モデル学習システムの構成例を示すブロック図である。It is a block diagram which shows the structural example of the tree model learning system of 2nd Embodiment. 第２の実施形態の木モデル学習システムの動作の一例を示すフローチャートである。It is a flowchart which shows an example of operation | movement of the tree model learning system of 2nd Embodiment. 本発明の木モデル学習装置の最小の構成例を示すブロック図である。It is a block diagram which shows the minimum structural example of the tree model learning apparatus of this invention. 本発明の木モデル学習装置の他の構成例を示すブロック図である。It is a block diagram which shows the other structural example of the tree model learning apparatus of this invention.

以下、本発明の実施形態について図面を参照して説明する。なお、以下の説明はすべて木のモデルが回帰木である場合を例に説明する。本発明による木のモデル学習方法に用いたスプリットの評価方法や信頼区間の算出方法の技術コンセプトは他の木のモデル学習においても共通して使用可能である。例えば、決定木の学習であれば、文献「Bouchard Guillaume, "Efficient Bounds for the softmax Function, Applications to Inference in Hybrid Models", NIPS 2007」などに開示されている技術により、回帰木の学習と同様な処理に帰着できる。また、回帰木以外の木のモデルであっても、枝の分岐がデータ空間の分割に対応しているものであれば、回帰木の学習と同様な処理に帰着できる。 Embodiments of the present invention will be described below with reference to the drawings. In the following description, the case where the tree model is a regression tree will be described as an example. The technical concepts of the split evaluation method and the confidence interval calculation method used in the tree model learning method according to the present invention can be commonly used in model learning of other trees. For example, in the case of decision tree learning, the technique disclosed in the literature “Bouchard Guillaume,“ Efficient Bounds for the softmax Function, Applications to Inference in Hybrid Models ”, NIPS 2007” etc. Can result in processing. Further, even a tree model other than a regression tree can be reduced to a process similar to learning of a regression tree if the branch of the branch corresponds to the division of the data space.

実施形態１．
図１は、本発明の第１の実施形態の木モデル学習システムの構成例を示すブロック図である。図１に示すように、本実施形態の木モデル学習システムは、木モデル学習装置１０を備えている。そして、木モデル学習装置１０は、データ記憶部１０１と、木モデル記憶部１０２と、スプリットポイント計算手段１０３と、スプリット無し計算手段１０４と、ビンニングポイント計算手段１０５と、スプリット判定手段１０６と、木ノード生成手段１０７とを備えている。なお、図１には、１つの木モデル学習装置１０がこれらの記憶部および処理手段を備える例が示されているが、これらの記憶部および処理手段は複数の装置に分けて実装されていてもよい。 Embodiment 1. FIG.
FIG. 1 is a block diagram illustrating a configuration example of a tree model learning system according to the first embodiment of this invention. As shown in FIG. 1, the tree model learning system of this embodiment includes a tree model learning device 10. The tree model learning device 10 includes a data storage unit 101, a tree model storage unit 102, a split point calculation unit 103, a no split calculation unit 104, a binning point calculation unit 105, a split determination unit 106, Tree node generation means 107. FIG. 1 shows an example in which one tree model learning apparatus 10 includes these storage units and processing means. However, these storage units and processing means are separately implemented in a plurality of devices. Also good.

また、図１には、木モデル学習装置１０と併せて、学習された木モデルを用いて予測を行う予測手段２０が示されているが、木モデル学習装置１０が予測手段２０を含んでいてもよい。 FIG. 1 shows prediction means 20 that performs prediction using a learned tree model in combination with the tree model learning apparatus 10, but the tree model learning apparatus 10 includes the prediction means 20. Also good.

図２は、木モデル学習装置１０のハードウェア構成例を示すブロック図である。図２に示すように、木モデル学習装置１０は、演算部１１、記憶部１２および入出力部１３を含んでいてもよい。このような装置は、例えばプログラムに従って動作するパーソナルコンピュータ等の情報処理装置によって実現される。また、この場合、演算部１１、記憶部１２および入出力部１３は、それぞれＣＰＵ、メモリおよび各種入出力装置（例えば、キーボート、マウス、ネットワークインタフェース部等）によって実現される。 FIG. 2 is a block diagram illustrating a hardware configuration example of the tree model learning device 10. As shown in FIG. 2, the tree model learning device 10 may include a calculation unit 11, a storage unit 12, and an input / output unit 13. Such an apparatus is realized by an information processing apparatus such as a personal computer that operates according to a program, for example. In this case, the calculation unit 11, the storage unit 12, and the input / output unit 13 are realized by a CPU, a memory, and various input / output devices (for example, a keyboard, a mouse, a network interface unit, and the like), respectively.

本実施形態において、データ記憶部１０１および木モデル記憶部１０２は、記憶装置（例えば、図２に示す記憶部１２）によって実現される。また、スプリットポイント計算手段１０３、スプリット無し計算手段１０４、ビンニングポイント計算手段１０５、スプリット判定手段１０６および木ノード生成手段１０７は、例えばプログラムに従って動作するＣＰＵ（例えば、図２に示す演算部１１）によって実現される。 In the present embodiment, the data storage unit 101 and the tree model storage unit 102 are realized by a storage device (for example, the storage unit 12 illustrated in FIG. 2). The split point calculation unit 103, the no split calculation unit 104, the binning point calculation unit 105, the split determination unit 106, and the tree node generation unit 107 are, for example, a CPU (for example, the calculation unit 11 shown in FIG. 2) that operates according to a program. It is realized by.

データ記憶部１０１は、木のモデルを学習する際に用いるデータ（以下、学習データという。）を格納する。図３は、データ記憶部１０１に格納される学習データの一例を示す説明図である。 The data storage unit 101 stores data used when learning a tree model (hereinafter referred to as learning data). FIG. 3 is an explanatory diagram illustrating an example of learning data stored in the data storage unit 101.

図３に示す例では、１行が１学習データを表している。また、列は各学習データの属性値を表すマトリックスデータである。図３には、３次元の説明変数から目的変数を予測する回帰用の学習データの例が示されている。 In the example shown in FIG. 3, one line represents one learning data. The column is matrix data representing the attribute value of each learning data. FIG. 3 shows an example of learning data for regression for predicting an objective variable from a three-dimensional explanatory variable.

木モデル記憶部１０２は、学習された木のモデルを格納する。図４は、木モデル記憶部１０２に格納される木のモデルの一例を模式的に示す説明図である。木モデル記憶部１０２は、生成された木モデルの木構造（例えば、ノードがいくつ存在し、それらノードが他のノードとどのように繋がっているか等）を表す情報とともに、各ノードに対応づけて、予測値、そのノードに落ちる学習データのＩＤリスト、もし子ノードが存在すれば子ノードへのエッジに対応する説明変数の条件、などを格納する。図４には、ＩＤ＝ｉで識別される子ノードに対応づけて、予測値として「１３５」が、子ノードへのエッジに対応する説明変数の条件として「６０．０＜Ｘ_２」が格納されている例が示されている。子ノードへのエッジに対応する説明変数の条件は、このようにスプリット次元（Ｎｏｄｅｉの例では第２次元）とスプリット値の情報とで表わされる情報であってもよい。なお、木のモデルのデータ格納形式はこの限りでなく、学習された木のモデルが具体的に把握できるよう構成されていれば特に限定されない。 The tree model storage unit 102 stores a learned tree model. FIG. 4 is an explanatory diagram schematically illustrating an example of a tree model stored in the tree model storage unit 102. The tree model storage unit 102 associates each node with information indicating the tree structure of the generated tree model (for example, how many nodes exist and how these nodes are connected to other nodes, etc.). , A predicted value, an ID list of learning data falling on the node, an explanatory variable condition corresponding to an edge to the child node if a child node exists, and the like. In FIG. 4, “135” is stored as the predicted value in association with the child node identified by ID = i, and “60.0 <X ₂ ” is stored as the condition of the explanatory variable corresponding to the edge to the child node. An example is shown. The condition of the explanatory variable corresponding to the edge to the child node may be information represented by the split dimension (second dimension in the example of Node i) and the split value information as described above. The data storage format of the tree model is not limited to this, and is not particularly limited as long as it is configured so that the learned tree model can be specifically grasped.

ビンニングポイント計算手段１０５は、データの次元ごとに、当該次元においてデータをビンに分ける区切り点であるビンニングポイントを計算する。ビンニングポイント計算手段１０５は、例えば図３に示すような学習データの説明変数の各次元について、それらがとりうる値の中からビンを分ける区切り点を求めればよい。ビンニングポイントの計算方法としては、例えば、等幅で区切る方法や、各ビンの中に入るデータ数がほぼ等しくなるようにビンニングポイントを求める方法などがある。図５は、ある次元についてビンニングポイントを計算した結果、分けられたビンの一例を示す説明図である。図５には、説明変数の第１次元のデータの値域を等幅で区切る方法により分けられたｎ個のビン（Ｂｉｎ_１〜Ｂｉｎ_ｎ）の例が示されている。 For each data dimension, the binning point calculation means 105 calculates a binning point, which is a breakpoint for dividing the data into bins in the dimension. For example, the binning point calculation unit 105 may obtain a breakpoint that divides bins from the possible values for each dimension of the explanatory variables of the learning data as shown in FIG. As a binning point calculation method, for example, there are a method of dividing with equal widths, a method of obtaining binning points so that the number of data included in each bin is substantially equal, and the like. FIG. 5 is an explanatory diagram illustrating an example of bins divided as a result of calculating binning points for a certain dimension. FIG. 5 shows an example of _n bins (Bin _{1 to} Bin _n ) divided by a method of dividing the range of the first-dimensional data of the explanatory variables with equal widths.

スプリットポイント計算手段１０３は、説明変数の各次元ごとに、スプリットするか否かの判定対象となったノード（以下、対象ノードという。）をスプリットするのに最もよいとされるスプリットポイント（境界条件となるスプリット値）を計算するための手段であって、ビンニング手段１０３１と、スプリット評価値計算手段１０３２と、スプリット評価値信頼区間計算手段１０３３と、スプリットポイント検索手段１０３４とを含む。 The split point calculation means 103 is the best split point (boundary condition) for splitting a node (hereinafter referred to as a target node) for which to determine whether or not to split for each dimension of the explanatory variable. And a binning means 1031, a split evaluation value calculation means 1032, a split evaluation value confidence interval calculation means 1033, and a split point search means 1034.

ビンニング手段１０３１は、指定された次元について、ビンニングポイント計算手段１０５で得られたビンニングポイントを基準に、対象ノードに落ちる学習データをビンに振り分け、ビンごとの統計量を計算し、保存する。統計量は、評価値の計算式にあわせ必要なものを計算する。ビンニング手段１０３１は、例えばビンごとに当該ビンに含まれる学習データを対象にしたΣｙ^２、Σｙ、Σ１といった３種類の統計量を計算し、保存してもよい。なお、本例のビンニング手段１０３１は、ビンごとの統計量を計算するとともに、すべてのビンについて累積した統計量も計算し、保存する。上述の例でいえば、すべてのビンについて累積したΣｙ^２、Σｙ、Σ１の値を計算すればよい。 The binning means 1031 distributes the learning data falling on the target node to the bins based on the binning points obtained by the binning point calculation means 105 for the specified dimension, calculates the statistics for each bin, and stores them. . The necessary statistics are calculated according to the evaluation value calculation formula. For example, the binning unit 1031 may calculate and store three types of statistics such as Σy ² , Σy, and Σ1 for the learning data included in each bin. The binning means 1031 of this example calculates the statistics for each bin and also calculates and stores the statistics accumulated for all bins. In the above example, the values of Σy ² , Σy, and Σ1 accumulated for all bins may be calculated.

スプリット評価値計算手段１０３２は、指定された次元について、予め定められている評価関数を使い、対象ノードにおけるスプリットの各候補点でスプリットした場合の評価値を計算する。以下では、スプリット評価値計算手段１０３２が求めた評価値を「スプリット評価値」という場合がある。スプリット評価値計算手段１０３２は、スプリット可能なすべての点を候補点としてスプリット評価値を計算するのではなく、ビンニングポイント計算手段１０５が計算したビンニングポイントによって区切られるビンとビンの間のみをスプリットの候補点としてスプリット評価値を計算してもよい。 The split evaluation value calculation means 1032 calculates an evaluation value when the split is performed at each candidate point of the split at the target node using a predetermined evaluation function for the designated dimension. Hereinafter, the evaluation value obtained by the split evaluation value calculation unit 1032 may be referred to as a “split evaluation value”. The split evaluation value calculation means 1032 does not calculate the split evaluation value using all the points that can be split as candidate points, but only between the bins separated by the binning points calculated by the binning point calculation means 105. A split evaluation value may be calculated as a split candidate point.

スプリット評価値計算手段１０３２は、例えば以下の式（１）で表される評価関数を使い、スプリットの評価値を計算してもよい。 The split evaluation value calculation means 1032 may calculate the split evaluation value using, for example, an evaluation function represented by the following formula (1).

上述の式（１）の評価関数には、データへのフィッティングを表す項の他に、木のモデルの複雑さに伴うペナルティ項が含まれている。ペナルティ項とは、具体的には、子ノードに落ちるデータ数に応じて評価値を悪くする方向に働く項である。式（１）では、第２項が左葉ノードに関するペナルティ項、第４項が右葉ノードに関するペナルティ項、第６項が分岐ノードに関するペナルティ項となっている。これらの項において、それぞれのノードに落ちるデータ（分岐ノードの場合は通過するデータ）の数に応じたペナルティを掛けている。 The evaluation function of the above equation (1) includes a penalty term associated with the complexity of the tree model in addition to the term representing the fitting to the data. Specifically, the penalty term is a term that works in the direction of worsening the evaluation value in accordance with the number of data falling in the child nodes. In Equation (1), the second term is a penalty term for the left leaf node, the fourth term is a penalty term for the right leaf node, and the sixth term is a penalty term for the branch node. In these terms, a penalty is applied according to the number of data (data passing in the case of a branch node) falling on each node.

式（１）において、Ｎは対象ノードに落ちるデータの総数を表している。また、ｒ_Ｌ、ｒ_Ｒはそれぞれ対象ノードの左右の葉ノードに落ちるデータの割合を表している。すなわち、ｒ_Ｌ、ｒ_Ｒは、ある次元について対象ノードに対してｎ個のデータを見たときにそれぞれ左右の葉ノードに落ちるデータ数がｎ_Ｌ、ｎ_Ｒであった場合にｒ_Ｌ＝ｎ_Ｌ／ｎ、ｒ_Ｒ＝１−ｒ_Ｌで表される確率変数である。また、ハット付きσ_Ｌ、σ_Ｒの二乗はそれぞれ左右の葉ノードに落ちたデータの目的変数の分散を表している。なお、式（１）はスプリット評価値を求める評価関数の一つの例であって、これに限られない。 In Expression (1), N represents the total number of data falling on the target node. _R _L and r _R represent the ratio of data falling on the left and right leaf nodes of the target node, respectively. That is, r _L and r _R are r _L = n when the number of data falling on the left and right leaf nodes is n _L and n _R , respectively, when n data is viewed with respect to the target node in a certain dimension. _L / _n, a random variable represented by _r R = 1-r _L. The squares of σ _L and σ _R with hats represent the variance of the objective variable of the data that has fallen on the left and right leaf nodes, respectively. Expression (1) is an example of an evaluation function for obtaining a split evaluation value, and is not limited to this.

Ｈｏｅｆｆｄｉｎｇ木アルゴリズムで用いられる評価関数は、データのフィッティングを表す項しか含まれていなかった。このため、スプリットした後の木がどれだけ複雑になるかといったことは何ら考慮されずにデータのフィッティング具合だけをみて評価値が計算されていた。このような評価値では、後述するスプリット無し計算手段１０４が計算するスプリット無し評価値と比較しても常にスプリットした方がよいという判定結果になってしまう。 The evaluation function used in the Hoeffing tree algorithm included only a term representing data fitting. For this reason, the evaluation value was calculated only by looking at the degree of data fitting without considering how complicated the tree after splitting was. Such an evaluation value results in a determination result that it is always better to split even when compared with the no-split evaluation value calculated by the no-split calculation means 104 described later.

一方、本実施形態のスプリット評価値計算手段１０３２が用いる評価関数は、木の成長に対してトレードオフ関係になる２種類の項を含んでいる。すなわち、木を成長させると値が大きくなる項と、木を成長させると値が小さくなる項を少なくとも含んでいる。上記の評価関数の例では、葉におけるデータへのフィット具合を評価値の良さで表す項（２分木なら右と左）と、木の複雑さによって評価値を悪くするペナルティ項（分岐ノードと葉ノード）とが含まれ、これらがトレードオフ関係になっている。このようなトレードオフ関係になる２項を含ませることによって、各ポイントでスプリットした場合のデータへのフィッティングのメリットと、木が複雑になるペナルティとを比較考量した値がスプリット評価値として表れるようにしている。 On the other hand, the evaluation function used by the split evaluation value calculation unit 1032 of this embodiment includes two types of terms that have a trade-off relationship with respect to tree growth. That is, it includes at least a term that increases in value when growing a tree and a term that decreases in value when growing a tree. In the example of the above evaluation function, a term that expresses the degree of fit to the data in the leaf by the good evaluation value (right and left in the case of a binary tree) and a penalty term that makes the evaluation value worse by the complexity of the tree (branch node and Leaf nodes), and these are in a trade-off relationship. By including two terms that have such a trade-off relationship, a value that compares the merit of data fitting when splitting at each point and a penalty that complicates the tree appears as a split evaluation value. I have to.

このような評価関数を用いれば、例えばスプリットによるデータへのフィッティングのメリットがスプリットして木が複雑になるペナルティよりも大きい場合は当該ポイントでのスプリットを許可し、一方、スプリットによるデータへのフィッティングのメリットがスプリットして木が複雑になるペナルティよりも小さい場合は当該ポイントでのスプリットを許可しないといった判定もできる。 If such an evaluation function is used, for example, if the merit of fitting data to a split is larger than the penalty that the tree becomes complicated by splitting, the split at that point is permitted, while the fitting to the data by splitting is allowed. If the advantage is smaller than the penalty that the tree becomes complicated by splitting, it can be determined that splitting at that point is not permitted.

スプリット評価値信頼区間計算手段１０３３は、スプリット評価値計算手段１０３２が算出したスプリット評価値の信頼区間を求める。木のモデル学習にＨｏｅｆｆｄｉｎｇ木アルゴリズムにおける高速化の手法を適用した場合、学習データの一部を読み出してスプリット評価値を計算するため、正確な値は得られない。そのような場合に、スプリット評価値信頼区間計算手段１０３３が、算出されたスプリット評価値の信頼区間を求めることにより、該スプリット評価値に含まれる誤差を考慮した判定を行えるようにする。信頼区間とは、データをｎ個だけ読み込んだ時に、正確な評価値が高い確率１−δで存在している区間をいう。なお、δは信頼区間の誤り確率を表す。 The split evaluation value confidence interval calculation unit 1033 obtains a confidence interval of the split evaluation value calculated by the split evaluation value calculation unit 1032. When the speed-up method in the Hoeffding tree algorithm is applied to the tree model learning, since a part of the learning data is read and the split evaluation value is calculated, an accurate value cannot be obtained. In such a case, the split evaluation value confidence interval calculation unit 1033 obtains a confidence interval of the calculated split evaluation value, thereby enabling determination in consideration of an error included in the split evaluation value. The confidence interval is an interval in which an accurate evaluation value exists with a high probability 1-δ when only n pieces of data are read. Note that δ represents the error probability of the confidence interval.

ところで、スプリット評価値を得る評価関数がデータへのフィッティング項と木の複雑さに伴うペナルティ項に分かれる場合、ペナルティ項には対数が用いられていることが予想される。このような場合、評価関数全体を確率変数とみて信頼区間を求めることはできない。スプリット評価値信頼区間計算手段１０３３は、そのような場合には、評価関数のうち確率変数にあたる部分のみ（上述の例でいえば、左・右葉に落ちるデータの割合、左・右葉のデータの二乗誤差）に着目して信頼区間を着目し、それらの信頼区間をつかって評価関数全体の信頼区間を算出する。 By the way, when the evaluation function for obtaining the split evaluation value is divided into a fitting term for data and a penalty term due to the complexity of the tree, it is expected that a logarithm is used for the penalty term. In such a case, it is not possible to obtain a confidence interval by regarding the entire evaluation function as a random variable. In such a case, the split evaluation value confidence interval calculation unit 1033 only includes the portion corresponding to the random variable in the evaluation function (in the above example, the ratio of the data falling on the left / right lobe, the data on the left / right lobe) Focusing on the confidence intervals focusing on the square error), the confidence intervals of the entire evaluation function are calculated using those confidence intervals.

スプリット評価値信頼区間計算手段１０３３は、例えば、上述の式（１）を、以下に示す式（２）〜式（７）を使って最大化および最小化することで、スプリット評価値の信頼区間を計算してもよい。なお、式（２）〜式（７）は、葉に落ちたデータの二乗誤差はχ二乗分布に従うという仮定を基にしたスプリット評価値の信頼区間の導出用の条件式の例であるが、これに限定されない。 For example, the split evaluation value confidence interval calculation unit 1033 maximizes and minimizes the above-described equation (1) using the following equations (2) to (7), so that the confidence interval of the split evaluation value is obtained. May be calculated. Equations (2) to (7) are examples of conditional expressions for deriving the confidence interval of the split evaluation value based on the assumption that the square error of the data falling on the leaf follows the chi-square distribution. It is not limited to this.

スプリットポイント検索手段１０３４は、説明変数の次元ごとに最もよいスプリットポイントを検索する。スプリットポイント検索手段１０３４は、説明変数の各次元について、スプリット評価値計算手段１０３２に対象ノードにおけるスプリットの各候補点でのスプリット評価値を計算させ、その結果を基に、説明変数の次元ごとに最もよいスプリットポイントを検索する。スプリットポイント検索手段１０３４は、例えば、各次元で最もよいスプリット評価値を得た候補点を当該次元で最もよいスプリットポイントとすればよい。 The split point search means 1034 searches for the best split point for each dimension of the explanatory variable. The split point search unit 1034 causes the split evaluation value calculation unit 1032 to calculate the split evaluation value at each candidate point of the split in the target node for each dimension of the explanatory variable, and for each dimension of the explanatory variable based on the result. Find the best split point. For example, the split point search unit 1034 may set the candidate point that has obtained the best split evaluation value in each dimension as the best split point in the dimension.

スプリットポイント計算手段１０３ではこのようにして、説明変数の次元ごとに、対象ノードにおいてスプリットするのに最もよいとされるスプリットポイントの情報として、スプリット条件と、スプリット評価値と、信頼区間とを計算によって得る。 In this way, the split point calculation means 103 calculates the split condition, the split evaluation value, and the confidence interval as the information of the split point that is best split at the target node for each dimension of the explanatory variable. Get by.

スプリット無し計算手段１０４は、対象ノードをスプリットしない場合の評価値およびその信頼区間を計算する手段であって、スプリット無し評価値計算手段１０４１と、スプリット無し評価値信頼区間計算手段１０４２とを含む。 The no split calculation means 104 is a means for calculating an evaluation value and its confidence interval when the target node is not split, and includes a no split evaluation value calculation means 1041 and a no split evaluation value confidence interval calculation means 1042.

スプリット無し評価値計算手段１０４１は、予め定められている評価関数を用い、対象ノードをスプリットしない場合の評価値を計算する。以下では、スプリット無し評価値計算手段１０４１が計算した評価値を「スプリット無し評価値」という場合がある。スプリット無し評価値計算手段１０４１は、例えば以下の式（８）で表される評価関数を使い、スプリット無し評価値を計算してもよい。 The evaluation value calculation means 1041 without splitting uses a predetermined evaluation function to calculate an evaluation value when the target node is not split. Hereinafter, the evaluation value calculated by the evaluation value calculation unit 1041 without split may be referred to as “evaluation value without split”. The evaluation value calculation unit 1041 without split may calculate the evaluation value without split using, for example, an evaluation function represented by the following formula (8).

式（８）において、Ｎは対象ノードに落ちるデータの総数を表している。また、ハット付きのσの二乗は対象ノードに落ちたデータの目的変数の分散を表している。なお、学習データは、スプリット評価値を求めたときの学習データと同じものを用いる。具体的には、ビンニング手段１０３１が算出した統計量を利用すればよい。 In Expression (8), N represents the total number of data falling on the target node. Further, the square of σ with a hat represents the variance of the objective variable of the data that has fallen on the target node. Note that the learning data is the same as the learning data when the split evaluation value is obtained. Specifically, the statistics calculated by the binning means 1031 may be used.

スプリット無し評価値計算手段１０４１が用いる評価関数は、対象ノードにおける木の成長に対してトレードオフ関係になる２種類の項を含んでいればよい。上記評価関数の例では、葉におけるフィット具合を評価値の良さで表す項と、木の複雑さによって評価値を悪くするペナルティ項とが含まれ、これらがトレードオフ関係になっている。具体的には、式（８）では、第２項が葉ノードに関するペナルティ項となっており、葉ノードに落ちるデータの数に応じたペナルティを掛けている。なお、式（８）はスプリット無し評価値を求める評価関数の一つの例であり、これに限られない。 The evaluation function used by the split-less evaluation value calculation unit 1041 only needs to include two types of terms that have a trade-off relationship with respect to tree growth at the target node. In the example of the evaluation function, a term that represents the degree of fit in the leaf by the good evaluation value and a penalty term that deteriorates the evaluation value due to the complexity of the tree are included, and these are in a trade-off relationship. Specifically, in Equation (8), the second term is a penalty term for the leaf node, and a penalty corresponding to the number of data falling on the leaf node is applied. Equation (8) is one example of an evaluation function for obtaining an evaluation value without splitting, and is not limited to this.

スプリット無し評価値信頼区間計算手段１０４２は、スプリット無し評価値計算手段１０４１が算出したスプリット無し評価値の信頼区間を求める。スプリット無し評価値信頼区間計算手段１０４２は、例えば、上述の式（８）を、以下に示す式（９）を使って最大化および最小化することで、スプリット無し評価値の信頼区間を計算してもよい。 The evaluation value confidence interval calculation means 1042 without split obtains a confidence interval of the evaluation value without split calculated by the evaluation value calculation means 1041 without split. The evaluation value confidence interval calculation means 1042 without splitting calculates, for example, the confidence interval of the evaluation value without splitting by maximizing and minimizing the above equation (8) using the following equation (9). May be.

なお、式（９）は、葉に落ちたデータの二乗誤差はχ二乗分布に従うという仮定を基にしたスプリット無し評価値の信頼区間の導出用の条件式の一例であり、これに限られない。 Expression (9) is an example of a conditional expression for deriving the confidence interval of the evaluation value without split based on the assumption that the square error of the data falling on the leaf follows the chi-square distribution, and is not limited thereto. .

スプリット無し計算手段１０４ではこのようにして、対象ノードでスプリットしない場合の評価値であるスプリット無し評価値と、その信頼区間とを計算によって得る。 In this way, the non-split calculating means 104 obtains the non-split evaluation value, which is an evaluation value when the target node is not split, and its confidence interval by calculation.

スプリット判定手段１０６は、スプリットポイント計算手段１０３およびスプリット無し計算手段１０４によって得られた評価値およびその信頼区間に基づいて、スプリットするか否かを判定する。スプリット判定手段１０６は、例えば、対象ノードにおける次元ごとの最もよいスプリットポイントでのスプリット評価値およびスプリット無し評価値の中で最もよい評価値がスプリット評価値であればそのスプリット評価値を得た次元でのスプリットポイントでスプリットし、一方、最もよい評価値がスプリット無し評価値であれば対象ノードではスプリットしないとの判定をしてもよい。 The split determination unit 106 determines whether or not to split based on the evaluation value obtained by the split point calculation unit 103 and the non-split calculation unit 104 and its confidence interval. For example, if the best evaluation value among the split evaluation value at the best split point and the non-split evaluation value for each dimension in the target node is the split evaluation value, the split determination unit 106 obtains the split evaluation value. On the other hand, if the best evaluation value is the non-split evaluation value, it may be determined that the target node does not split.

また、スプリット判定手段１０６は、各評価値の信頼区間も考慮に加えて、次の三つの選択肢による判定を行ってもよい。すなわち、スプリット判定手段１０６は、当該判定において、対象ノードに対して、スプリットする、スプリットしない、学習データの追加のいずれかを決定するようにしてもよい。なお、学習データの追加を決定した場合は、対象ノードに対するスプリット有無の判定は保留される。スプリット判定手段１０６は、例えば、対象ノードにおける次元ごとの最もよいスプリットポイントでのスプリット評価値およびスプリット無し評価値の中から評価の高い１位と２位を抽出する。そして、その１位と２位の評価値の信頼区間に重なりがある場合には処理したデータ数ではスプリット有無を決定するのに十分でないとして、さらに学習データを読み込んで評価値を再計算させてもよい。一方、信頼区間に重なりがなければ、上述した場合と同様に１位の評価値がスプリット評価値であるかスプリット無し評価値であるかによってスプリットするか否かを判定すればよい。ここで、評価値の信頼区間に重なりがある場合とは、具体的には、１位の評価値の信頼区間の下限が、２位の評価値の信頼区間の上限よりも小さい場合をいう。 In addition, the split determination unit 106 may perform determination based on the following three options in addition to the confidence interval of each evaluation value. That is, in the determination, the split determination unit 106 may determine whether to split, not split, or add learning data for the target node. When the addition of learning data is determined, the determination of whether or not the target node is split is suspended. The split determination unit 106 extracts, for example, the first and second highest evaluations from the split evaluation value at the best split point and the non-split evaluation value for each dimension in the target node. If there is an overlap between the confidence intervals of the first and second evaluation values, it is assumed that the number of processed data is not sufficient to determine the presence or absence of splitting, and further learning data is read and the evaluation value is recalculated. Also good. On the other hand, if there is no overlap in the confidence intervals, it is sufficient to determine whether or not to split based on whether the first evaluation value is a split evaluation value or a non-split evaluation value, as in the case described above. Here, the case where there is an overlap in the confidence interval of the evaluation value specifically means a case where the lower limit of the confidence interval of the first evaluation value is smaller than the upper limit of the confidence interval of the second evaluation value.

木ノード生成手段１０７は、スプリット判定手段１０６による判定の結果、対象ノードをスプリットすることが決定した場合に、対象ノードを指定されたスプリット次元およびスプリット値でスプリットする。具体的には、対象ノードに子ノードを追加して対象ノードを親ノードとし、該親ノードに予測値、スプリット次元、スプリット値の情報を対応づけ、子ノードに該子ノードに落ちるデータのＩＤを対応づける。そして、これらの情報を木モデル記憶部１０２に格納する。 If the result of determination by the split determination unit 106 is that the target node is determined to be split, the tree node generation unit 107 splits the target node with the specified split dimension and split value. Specifically, a child node is added to the target node, the target node is set as a parent node, information on the predicted value, split dimension, and split value is associated with the parent node, and the ID of data that falls on the child node is associated with the child node Associate. These pieces of information are stored in the tree model storage unit 102.

予測手段２０は、未知のデータサンプルが入力された場合に、木モデル記憶部１０２に格納されている木のモデルを用いて、予測を行う。予測手段２０は、例えば入力されたデータサンプルから得られる説明変数の各次元の値を基に、木のモデルをルートから辿り、最終的に辿り着いた葉ノードに対応づけられる予測値を返す処理を行えばよい。 When an unknown data sample is input, the prediction unit 20 performs prediction using a tree model stored in the tree model storage unit 102. For example, the prediction unit 20 traces the tree model from the root based on the value of each dimension of the explanatory variable obtained from the input data sample, and returns a predicted value associated with the finally reached leaf node. Can be done.

なお、図示省略しているが、木モデル学習装置１０には、木のモデル学習に関して、学習データを読み出したり、各処理手段を起動して、必要な作業指示を与えるなどの全体制御を行う学習指示手段が含まれていてもよい。学習指示手段は、例えばプログラムに従って動作するＣＰＵ（例えば、図２に示す演算部１１）によって実現される。 Although not shown in the figure, the tree model learning device 10 performs overall control related to tree model learning, such as reading learning data or activating each processing means to give necessary work instructions. Instruction means may be included. The learning instruction means is realized by, for example, a CPU (for example, the calculation unit 11 shown in FIG. 2) that operates according to a program.

次に、本実施形態の動作について説明する。 Next, the operation of this embodiment will be described.

図６は、本実施形態の木モデル学習システムの動作の一例を示すフローチャートである。図６に示す例では、まず木モデル学習装置１０（より具体的には、学習指示手段）が、データ記憶部１０１から学習データを読み出して、ビンニングポイント計算手段１０５を起動する。起動されたビンニングポイント計算手段１０５は、読み出された学習データを基に、説明変数の各次元についてビンニングポイントを計算する（ステップＳ１０１）。以下では、ビンニングとして１００等分する例を示す。例えば、ある説明変数の次元が０から１０００までの値をとる場合、値１０刻みにビンニングポイントが置かれる。 FIG. 6 is a flowchart showing an example of the operation of the tree model learning system of this embodiment. In the example illustrated in FIG. 6, first, the tree model learning device 10 (more specifically, the learning instruction unit) reads the learning data from the data storage unit 101 and activates the binning point calculation unit 105. The activated binning point calculation means 105 calculates binning points for each dimension of the explanatory variable based on the read learning data (step S101). Below, the example which divides into 100 equally as binning is shown. For example, when the dimension of an explanatory variable takes a value from 0 to 1000, a binning point is placed in increments of 10 values.

次に、木モデル学習装置１０は、ノードキュー（以下、ＮＱと略記する）にルートノードを追加する（ステップＳ１０２）。なお、ＮＱは追加した順番でノードを取り出すことができるＦＩＦＯ構造のバッファである。なお、木モデル記憶部１０２には、ルートノードに関する情報として、当該ノードに落ちるデータのＩＤに全学習データが指定された情報が格納されているものとする。 Next, the tree model learning device 10 adds a root node to the node queue (hereinafter abbreviated as NQ) (step S102). NQ is a FIFO-structured buffer that can extract nodes in the order of addition. It is assumed that the tree model storage unit 102 stores information in which all learning data is specified as the ID of data falling on the node as information on the root node.

次に、木モデル学習装置１０は、ＮＱが空かどうかをチェックし（ステップＳ１０３）、空でなければＮＱからノードを１つ取り出す（ステップＳ１０３のＮｏ，Ｓ１０４）。ここでは、最初にルートノードが取り出される。以下、ステップＳ１０４で取り出されたノードをスプリットの対象ノードとし、ステップＳ１０５〜Ｓ１１５の処理を行う。 Next, the tree model learning device 10 checks whether or not NQ is empty (step S103), and if not empty, extracts one node from the NQ (No in step S103, S104). Here, the root node is first taken out. Hereinafter, the node extracted in step S104 is set as a split target node, and the processes in steps S105 to S115 are performed.

一方、ＮＱが空であれば木のモデル学習を終了する（ステップＳ１０３のＹｅｓ） On the other hand, if NQ is empty, the tree model learning is terminated (Yes in step S103).

ステップＳ１０５では、ＮＱから取り出した対象ノードについて、木モデル学習装置１０は、対象ノードに落ちるデータのＩＤ集合を参照して、対象ノードに落ちるｎ個の学習データを読み込む。 In step S105, for the target node extracted from the NQ, the tree model learning device 10 refers to the ID set of data falling on the target node and reads n pieces of learning data falling on the target node.

次いで、木モデル学習装置１０は、算出対象とする説明変数の次元を表す変数Ｄを初期化する（ステップＳ１０６）。ここでは、第１次元を表すようＤ＝１とする。 Next, the tree model learning device 10 initializes a variable D representing the dimension of the explanatory variable to be calculated (step S106). Here, D = 1 is set to represent the first dimension.

対象ノードに対して、学習データの読み込みおよび変数Ｄの初期化が完了すると、木モデル学習装置１０は、スプリットポイント計算手段１０３を起動して、変数Ｄが示す次元について、対象ノードにおいて最もよいスプリットポイントを求める処理を開始させる。 When the learning data reading and the initialization of the variable D are completed for the target node, the tree model learning device 10 activates the split point calculation unit 103, and for the dimension indicated by the variable D, the best split in the target node. Start the process of finding points.

スプリットポイント計算手段１０３が起動すると、まずビンニング手段１０３１が、変数Ｄが示す次元についてビンニング処理を行う（ステップＳ１０７）。ビンニング手段１０３１は、ビンニングポイント計算手段１０５で計算したビンニングポイントを基に、ステップＳ１０５で読み出されたｎ個の学習データを処理し、ビンごとの統計量を計算する。ビンニング手段１０３１は、例えば、あるビンが［１０，２０）の範囲であった場合、説明変数のＤ次元目がこの範囲のデータについて、当該ビンの統計量として目的変数の二乗和と、目的変数の和、データ数を計算し、ビンの統計情報として保存する。なお、後述する判定処理の結果、同一ノード、同一次元に対して学習データが追加された後の当該ステップ処理では、保存されているビンの統計情報に対してビンの統計量を累積していく。 When the split point calculation unit 103 is activated, the binning unit 1031 first performs binning processing for the dimension indicated by the variable D (step S107). Based on the binning points calculated by the binning point calculation unit 105, the binning unit 1031 processes the n pieces of learning data read in step S105, and calculates the statistics for each bin. For example, when a certain bin is in the range of [10, 20], the binning means 1031 uses the sum of squares of the objective variable and the objective variable as the statistics of the bin for the data in which the D-dimensional explanatory variable is in this range. And the number of data is calculated and stored as bin statistical information. As a result of determination processing described later, in the step processing after learning data is added to the same node and the same dimension, bin statistics are accumulated with respect to the stored bin statistical information. .

次に、スプリットポイント検索手段１０３４が、変数Ｄが示す次元で最もよいスプリットポイントを探索する（ステップＳ１０８）。ステップＳ１０８では、スプリットポイント検索手段１０３４は、まずスプリット評価値計算手段１０３２を起動して、変数Ｄが示す次元によるスプリットの各候補点のスプリット評価値を計算させる。スプリット評価値計算手段１０３２は、ステップＳ１０７で作成されるビンの統計情報を基に、指定された次元のスプリットの候補点である各ビンとビンの間でスプリットした場合のスプリット評価値を、例えば式（１）を使い計算する。 Next, the split point search unit 1034 searches for the best split point in the dimension indicated by the variable D (step S108). In step S108, the split point search unit 1034 first activates the split evaluation value calculation unit 1032 to calculate the split evaluation value of each candidate point of the split according to the dimension indicated by the variable D. Based on the bin statistical information created in step S107, the split evaluation value calculation unit 1032 uses, for example, the split evaluation value when splitting between each bin that is a candidate point for split of the designated dimension. Calculate using equation (1).

次いで、スプリット評価値信頼区間計算手段１０３３が、ステップＳ１０８で求めたスプリット評価値の信頼区間を計算する（ステップＳ１０９）。スプリット評価値信頼区間計算手段１０３３は、ステップＳ１０８で求めたスプリット評価値の信頼区間を、例えば式（２）〜（７）を使い計算する。 Next, the split evaluation value confidence interval calculation unit 1033 calculates the confidence interval of the split evaluation value obtained in step S108 (step S109). The split evaluation value confidence interval calculation unit 1033 calculates the confidence interval of the split evaluation value obtained in step S108 using, for example, the equations (2) to (7).

次に、スプリットポイント検索手段１０３４は、D＜次元数かの判定を行い、そうであればＤをインクリメントした上でステップＳ１０７に戻る（ステップＳ１１０のＹｅｓ，Ｓ１１１）。一方、D＜次元数でなければ全ての次元について当該処理を完了したとして、ステップＳ１１２に進む（ステップＳ１１０のＮｏ）。 Next, the split point search unit 1034 determines whether D <number of dimensions, and if so, increments D and returns to step S107 (Yes in step S110, S111). On the other hand, if D <number of dimensions is not satisfied, the process is completed for all dimensions, and the process proceeds to step S112 (No in step S110).

以上の処理によって、評価変数の全ての次元について、評価関数を最大にするスプリットポイントと、そのスプリット評価値と、その信頼区間とを得る。 With the above processing, the split point that maximizes the evaluation function, the split evaluation value, and the confidence interval are obtained for all dimensions of the evaluation variable.

ステップＳ１１２に進むと、木モデル学習装置１０は、スプリット無し計算手段１０４を起動して、対象ノードにおいてスプリットしない場合の評価値（スプリット無し評価値）とその信頼区間を計算させる。ステップＳ１１２では、まずスプリット無し評価値計算手段１０４１が、スプリット無し評価値を計算する。スプリット無し評価値計算手段１０４１は、対象ノードにおいてスプリットしない場合の評価値を、例えば式（８）を使い計算する。 In step S112, the tree model learning device 10 activates the no split calculation unit 104 to calculate an evaluation value (non-split evaluation value) when the target node is not split and its confidence interval. In step S112, first, the evaluation value calculation unit 1041 without split calculates an evaluation value without split. The evaluation value calculation unit 1041 without splitting calculates the evaluation value when the target node does not split using, for example, the equation (8).

次いで、スプリット無し評価値信頼区間計算手段１０４２が、ステップＳ１１２で求めたスプリット無し評価値の信頼区間を計算する（ステップＳ１１３）。スプリット無し評価値信頼区間計算手段１０４２は、ステップＳ１１２で求めたスプリット無し評価値の信頼区間を、例えば式（８）、（９）を使い計算する。 Next, the evaluation value confidence interval calculation means 1042 without split calculates the confidence interval of the evaluation value without split obtained in step S112 (step S113). The evaluation value confidence interval calculation means 1042 without split calculates the confidence interval of the evaluation value without split obtained in step S112 using, for example, the equations (8) and (9).

このようにして対象ノードについて、各次元のスプリットポイントでのスプリット評価値とその信頼区間と、スプリット無し評価値とその信頼区間とが求まると、スプリット判定手段１０６が、対象ノードをスプリットをするか否か、もしくは学習データを追加するかを判定する（ステップＳ１１４）。 When the split evaluation value at each dimension split point and its confidence interval, the non-split evaluation value and its confidence interval are found in this way for the target node, the split determination means 106 determines whether to split the target node. It is determined whether or not to add learning data (step S114).

ステップＳ１１４では、スプリット判定手段１０６が、例えばこれら評価値の中で最も評価のよい１位および２位を求め、その結果により次のように判定する。 In step S114, for example, the split determination means 106 obtains the best first and second ranks among these evaluation values, and determines as follows based on the result.

以下に示す例では、ケース１を「１位の評価値がある次元のスリット評価値であり、かつ、１位の評価値の信頼区間の下限より２位の評価値の信頼区間の上限が小さい」場合とし、ケース２を「１位の評価値がスプリット無し評価値であり、かつ、１位の評価値の信頼区間の下限より２位の評価値の信頼区間の上限が小さい」場合とし、ケース３を「１位の評価値がいずれの評価値であれ、１位の評価値の信頼区間の下限より２位の評価値の信頼区間の上限が大きい」場合とする。スプリット判定手段１０６は、１位および２位の評価値が３つのうちのいずれのケースに該当するかを判定し、ケース１であれば、１位の評価値を得たスプリット候補点でスプリットする旨の決定をする。また、ケース２であれば、スプリットしない旨の決定をする。また、ケース３であれば、処理したデータ数ではスプリット有無を決定するのに十分でないとして学習データを追加する旨の決定をする。 In the example shown below, Case 1 is “a slit evaluation value of a dimension having the first evaluation value, and the upper limit of the confidence interval of the second evaluation value is smaller than the lower limit of the confidence interval of the first evaluation value. And Case 2 is a case where “the first evaluation value is an evaluation value without splitting and the upper limit of the confidence interval of the second evaluation value is smaller than the lower limit of the confidence interval of the first evaluation value”. Case 3 is a case where “the upper limit of the confidence interval of the second-ranked evaluation value is larger than the lower limit of the confidence interval of the first-ranked evaluation value, whatever the evaluation value of the first-ranked evaluation value”. The split determination means 106 determines which of the three evaluation values corresponds to the first and second evaluation values, and in case 1, the split is performed at the split candidate point that obtained the first evaluation value. Make a decision to that effect. In case 2, it is decided not to split. Also, in case 3, it is determined that the learning data is to be added because the number of processed data is not sufficient to determine the presence / absence of split.

ステップＳ１１４の判定の結果、スプリットする旨が決定された場合には、木ノード生成手段１０７が、指定された次元のスプリットポイントで対象ノードをスプリットする（ステップＳ１１５）。木ノード生成手段１０７は、具体的には、木モデル記憶部１０２に、対象ノードを親ノードとする子ノードの情報を新たに追加し、親ノードである対象ノードに対して追加した子ノードへのエッジに対応する説明変数の条件を対応づけるとともに、追加した子ノードに対して予測値と当該子ノードに落ちる学習データのＩＤリストとを対応づけて保持させる。また、木ノード生成手段１０７は、ノードを追加した際には、追加した子ノードをＮＱに追加する。これにより、次回以降のスプリットの判定対象ノードに追加した子ノードが加えられる。 As a result of the determination in step S114, if it is determined to split, the tree node generation unit 107 splits the target node at the split point of the designated dimension (step S115). Specifically, the tree node generation unit 107 newly adds information on a child node whose parent node is the target node to the tree model storage unit 102, and adds the child node added to the target node that is the parent node. In addition to associating the condition of the explanatory variable corresponding to the edge, the predicted value and the ID list of the learning data falling on the child node are associated with each other and held. In addition, when adding a node, the tree node generation unit 107 adds the added child node to the NQ. As a result, the child node added to the determination target node for the next and subsequent splits is added.

一方、ステップＳ１１４の判定の結果、スプリットしない旨が決定された場合には、木モデル学習装置１０は、対象ノードをスプリットさせずに当該対象ノードに関する処理を終了し、次のノードのスプリット有無を判定するためにステップＳ１０３に戻る。戻った先のステップＳ１０３では、あるノードにおけるスプリット判定の結果、ステップＳ１１５でＮＱに追加された子ノードがあれば、それら子ノードが順に取り出されて、対象ノードとされる。このようにして、ＮＱが空になるまで対象ノードを変えながら、スプリット判定を繰り返し行っていく過程で木が成長していく。 On the other hand, if it is determined in step S114 that splitting is not to be performed, the tree model learning device 10 ends the process for the target node without splitting the target node, and determines whether the next node is split. The process returns to step S103 for determination. In step S103 after the return, if there is a child node added to NQ in step S115 as a result of the split determination at a certain node, those child nodes are sequentially extracted and set as the target node. In this way, the tree grows in the process of repeatedly performing split determination while changing the target node until NQ becomes empty.

一方、ステップＳ１１４の判定の結果、学習データを追加する旨が決定された場合には、対象ノードはそのままにステップＳ１０５に戻る。木のモデル学習の高速化のために、Ｈｏｅｆｆｄｉｎｇ木アルゴリズムを取り入れた場合、求まる評価値は誤差を含んだ値となる。従って、上記判定において、その誤差を含めても１位の評価値が２位の評価値よりも高ければその評価値によりスプリットの有無を決定するが、そうでなければ学習データの追加を決定するようにしている。ステップＳ１０５に戻ると、学習データがｎ個追加され、ステップＳ１０８およびＳ１１２で評価値が再計算される。このような学習データの追加を、例えばステップＳ１１４の判定で対象ノードにおけるスプリットの有無が決定できる精度になるまで繰り返す。 On the other hand, if it is determined as a result of the determination in step S114 that learning data is to be added, the target node returns to step S105 as it is. When the Hoeffing tree algorithm is adopted for speeding up the tree model learning, the obtained evaluation value includes an error. Therefore, in the above determination, if the first-ranked evaluation value is higher than the second-ranked evaluation value even if the error is included, the presence / absence of split is determined based on the evaluation value. Otherwise, the addition of learning data is determined. I am doing so. Returning to step S105, n pieces of learning data are added, and the evaluation value is recalculated in steps S108 and S112. Such addition of learning data is repeated until, for example, the accuracy at which the presence / absence of split in the target node can be determined by the determination in step S114.

以上のように、本実施形態によれば、枝刈り用のテストデータセットを用意すること無しに、汎化性能の意味で適切なサイズの木のモデルを学習できる。その理由は、各ノードにつき、木の成長に対してトレードオフの関係にある２項を含む評価関数を用いて複数のスプリット候補の良し悪しを決めるスプリット評価値を計算する手段と、同じく木の成長に対してトレードオフの関係にある２項を含む評価関数を用いてスプリットしない場合の評価値を計算する手段とを備えているからである。それにより、各ノードでスプリットするのが良いのかストップするのが良いのかを、各々データへのフィッティング具合と木の複雑さ具合とを比較考量して表す評価値を用いて判定できるからである。そして、そのような評価値を用いて各ノードでスプリット有無の判定を行うことで、汎化性能の意味で最適なサイズで木の成長を止めることができるからである。 As described above, according to the present embodiment, an appropriately sized tree model can be learned in terms of generalization performance without preparing a test data set for pruning. The reason for this is that, for each node, a means for calculating a split evaluation value that determines the quality of a plurality of split candidates using an evaluation function including two terms that are in a trade-off relationship with the growth of the tree; This is because there is provided means for calculating an evaluation value in the case of not splitting using an evaluation function including two terms that are in a trade-off relationship with respect to growth. This is because it is possible to determine whether to split at each node or to stop using an evaluation value that represents the degree of fitting to the data and the degree of complexity of the tree by comparative consideration. This is because by using such an evaluation value to determine whether or not there is a split at each node, it is possible to stop the tree growth with an optimum size in terms of generalization performance.

また、本実施形態によれば、そのような汎化性能の意味で最適なサイズとなる木のモデル学習を高速に行うことができる。その理由は、データの一部分から計算された上記評価値の信頼区間を計算する手段を備えているからである。それにより、全データを処理しなくても、精度の担保に必要な統計的に十分な一部のデータだけでスプリット有無を決定できるからである。 Further, according to the present embodiment, it is possible to perform model learning of a tree having an optimum size in terms of such generalization performance at high speed. The reason is that a means for calculating a confidence interval of the evaluation value calculated from a part of the data is provided. This is because it is possible to determine the presence or absence of splitting with only a part of statistically sufficient data necessary for ensuring accuracy without processing all data.

実施形態２．
図７は、本発明の第２の実施形態の木モデル学習システムの構成例を示すブロック図である。本実施形態の木モデル学習システムは、図１に示す木モデル学習システムと比べて、スプリットポイント計算手段１０３がスプリットポイント並列計算手段２０３となっている点、および新たに計算不要次元削除手段１０８を備えている点が異なる。 Embodiment 2. FIG.
FIG. 7 is a block diagram illustrating a configuration example of the tree model learning system according to the second embodiment of this invention. The tree model learning system of this embodiment is different from the tree model learning system shown in FIG. 1 in that the split point calculation unit 103 is a split point parallel calculation unit 203 and a calculation unnecessary dimension deletion unit 108 is newly added. It has different points.

本実施形態において、スプリットポイント並列計算手段２０３は、第１の実施形態においてスプリットポイント計算手段１０３が次元ごとに行っていた処理を次元別に並列に処理する。スプリットポイント並列計算手段２０３は、例えば、説明変数の次元数分備えられる。 In the present embodiment, the split point parallel calculation unit 203 performs the processing performed by the split point calculation unit 103 for each dimension in the first embodiment in parallel for each dimension. The split point parallel calculation means 203 is provided for the number of dimensions of explanatory variables, for example.

計算不要次元削除手段１０８は、ある対象ノードについてスプリット判定手段１０６が学習データの追加を決定した場合に、学習データ追加後の評価値の再計算の対象から除外可能なスプリット次元の有無を判定し、あれば当該スプリット次元を再計算の対象から削除する処理を行う。 The calculation unnecessary dimension deletion unit 108 determines whether or not there is a split dimension that can be excluded from the recalculation of the evaluation value after adding the learning data when the split determination unit 106 determines to add the learning data for a certain target node. If there is, the split dimension is deleted from the recalculation target.

計算不要次元削除手段１０８は、例えば、スプリット判定手段１０６が判定を行ったときに得ていた評価値の中で、１位の評価値の信頼区間の下限よりも、信頼区間の上限が小さい評価値があれば、その評価値を得た次元については当該ノードにおける以降の評価値の再計算の対象から除外する処理を行ってもよい。すなわち、ある次元の最もよいスプリットポイントでの評価値の信頼区間の上限が、１位の評価値の信頼区間の下限よりも小さければ、当該次元は以降の評価値の再計算の対象から除外される。 For example, the unnecessary calculation dimension deleting unit 108 has an evaluation in which the upper limit of the confidence interval is smaller than the lower limit of the confidence interval of the first evaluation value among the evaluation values obtained when the split determination unit 106 performs the determination. If there is a value, the dimension from which the evaluation value has been obtained may be subjected to a process of excluding it from the target of subsequent recalculation of the evaluation value at the node. In other words, if the upper limit of the confidence interval of the evaluation value at the best split point of a dimension is smaller than the lower limit of the confidence interval of the first evaluation value, the dimension is excluded from the recalculation of the subsequent evaluation value. The

計算不要次元削除手段１０８は、例えば、再計算時に、除外対象となった次元に対応づけられているスプリットポイント並列計算手段２０３が起動されないように、その旨を示すフラグ等を設けてもよい。 The calculation unnecessary dimension deleting unit 108 may be provided with a flag or the like indicating that the split point parallel calculating unit 203 associated with the excluded dimension is not activated at the time of recalculation.

図８は、本実施形態の木モデル学習システムの動作の一例を示すフローチャートである。図８に示すように、本実施形態の動作は、図６に示す第１の実施形態の動作と比べて、次元を示す変数Ｄに関する処理が削除されてステップＳ１０７〜Ｓ１０９の処理が並列実行される点、および計算不要次元削除ステップ（ステップＳ２０１）が追加される点が異なる。 FIG. 8 is a flowchart showing an example of the operation of the tree model learning system of this embodiment. As shown in FIG. 8, in the operation of this embodiment, compared to the operation of the first embodiment shown in FIG. 6, the process related to the variable D indicating the dimension is deleted, and the processes in steps S107 to S109 are executed in parallel. And the point that a calculation unnecessary dimension deletion step (step S201) is added is different.

ステップＳ１０７〜Ｓ１０９の並列実行は、例えば、対象ノードに対して、学習データの読み込みが完了すると、木モデル学習装置１０が、各スプリットポイント並列計算手段２０３を起動して、対応する次元について、対象ノードにおいて最もよいスプリットポイントを求める処理を開始させればよい。 In parallel execution of steps S107 to S109, for example, when reading of learning data is completed for the target node, the tree model learning device 10 activates each split point parallel calculation unit 203 and performs processing for the corresponding dimension. What is necessary is just to start the process which calculates | requires the best split point in a node.

また、ステップＳ２０１では、学習データの追加が決定されたことを受けた計算不要次元削除手段１０８が、学習データ追加後の評価値の再計算の対象から除外可能なスプリット次元の有無を判定し、あればあれば当該スプリット次元を再計算の対象から削除する処理を行う。計算不要次元削除手段１０８は、判定に用いた評価値の中から１位の評価値と任意の順位ｉの評価値を比べ、一位の評価値の信頼区間の下限より順位ｉの評価値の信頼区間の上限の方が小さい場合に、当該順位ｉの評価値を得たスプリット次元をそのノードについての以降のスプリットポイントの計算対象から除外する処理を行う。 Further, in step S201, the calculation-unnecessary dimension deletion unit 108 that has received the decision to add learning data determines whether or not there is a split dimension that can be excluded from the recalculation of the evaluation value after adding learning data. If there is, the process of deleting the split dimension from the recalculation target is performed. The calculation unnecessary dimension deletion unit 108 compares the evaluation value of the first rank out of the evaluation values used for determination with the evaluation value of the arbitrary rank i, and determines the evaluation value of the rank i from the lower limit of the confidence interval of the first evaluation value. When the upper limit of the confidence interval is smaller, a process of excluding the split dimension from which the evaluation value of the ranking i is obtained from the calculation target of the subsequent split point for the node is performed.

以上のように、本実施形態によれば、処理のボトルネックとなる、スプリット候補の各評価値を得るための計算部分を並列化して行うだけでなく、さらに計算不要次元を削除する手段を備えることで、計算が不要な次元のスプリットポイントの再計算処理を省くことができ、さらに高速化される。 As described above, according to the present embodiment, the calculation part for obtaining each evaluation value of the split candidate, which becomes a bottleneck of processing, is performed in parallel, and further includes a means for deleting a calculation unnecessary dimension. As a result, it is possible to omit recalculation processing of split points of dimensions that do not require calculation, and the processing speed is further increased.

次に、本発明による木モデル学習装置の最小構成について説明する。図９は、本発明による木モデル学習装置の最小の構成例を示すブロック図である。図９に示すように、本発明による木モデル学習装置は、最小の構成要素として、スプリット評価値計算手段５１と、スプリットポイント検索手段５２と、スプリット無し評価値計算手段５３と、スプリット判定手段５４とを備える。 Next, the minimum configuration of the tree model learning apparatus according to the present invention will be described. FIG. 9 is a block diagram showing a minimum configuration example of the tree model learning apparatus according to the present invention. As shown in FIG. 9, the tree model learning apparatus according to the present invention includes, as the minimum components, a split evaluation value calculation means 51, a split point search means 52, a non-split evaluation value calculation means 53, and a split determination means 54. With.

スプリット評価値計算手段５１（例えば、スプリット評価値計算手段１０３２）は、説明変数の指定された次元を用いて、指定されたノードを当該次元の各スプリットポイントでスプリットする場合の評価値を計算する。 The split evaluation value calculation unit 51 (for example, the split evaluation value calculation unit 1032) calculates an evaluation value when the specified node is split at each split point of the dimension using the specified dimension of the explanatory variable. .

スプリットポイント計算手段５２（例えば、スプリットポイント検索手段１０３４）は、スプリット評価値計算手段５１によって得られる評価値を基に、説明変数の次元ごとに指定されたノードをスプリットするのに最もよいスプリットポイントを検索する。 The split point calculation means 52 (for example, the split point search means 1034) is the best split point for splitting the node specified for each dimension of the explanatory variable based on the evaluation value obtained by the split evaluation value calculation means 51. Search for.

スプリット無し評価値計算手段５３（例えば、スプリット無し評価値計算手段１０４１）は、指定されたノードをスプリットしない場合の評価値を計算する。 The evaluation value calculation means 53 without split (for example, the evaluation value calculation means 1041 without split) calculates an evaluation value when the designated node is not split.

スプリット判定手段５４は、説明変数の次元ごとの最もよいスプリットポイントでの評価値と、スプリットしない場合の評価値とを基に、指定されたノードを最も評価値のよい次元でスプリットするか、または指定されたノードをスプリットしないかを判定する。 The split determination means 54 splits the designated node in the dimension with the best evaluation value based on the evaluation value at the best split point for each dimension of the explanatory variable and the evaluation value when not splitting, or Determine whether to split the specified node.

本発明において、スプリット評価値計算手段５１は、木の成長に対してトレードオフの関係にある２種類の項であって、データへのフィッティングを評価値の良さで表す項と、分岐ノードと葉ノードに関して、それぞれ通過データ数または落ちるデータ数に応じて評価値を悪くする方向に働くペナルティ項の２種類の項を少なくとも含む評価関数を用いて、評価値を計算する。また、スプリット無し評価値計算手段５３は、木の成長に対してトレードオフの関係にある２種類の項であって、データへのフィッティングを評価値の良さで表す項と、葉ノードに関して、落ちるデータ数に応じて評価値を悪くする方向に働くペナルティ項の２種類の項を少なくとも含む評価関数を用いて、評価値を計算する。 In the present invention, the split evaluation value calculation means 51 includes two types of terms that are in a trade-off relationship with respect to the growth of the tree, a term that represents fitting to data with good evaluation values, a branch node, and a leaf. For each node, an evaluation value is calculated using an evaluation function including at least two types of penalty terms that work in the direction of worsening the evaluation value according to the number of passing data or the number of data dropped. Further, the evaluation value calculation means 53 without splitting falls in terms of two types of terms that are in a trade-off relationship with respect to the growth of the tree, that is, a term that represents the fitting to the data with good evaluation values and a leaf node. The evaluation value is calculated using an evaluation function including at least two types of penalty terms that work in the direction of worsening the evaluation value according to the number of data.

従って、最小構成の木モデル学習装置によれば、汎化性能の意味で適切なサイズの木のモデルを、枝刈り用のテストセットなしに高速に得ることができる。木の成長に対してトレードオフの関係にある２項を含む評価関数を用いて複数のスプリット候補の良し悪しを決めるスプリット評価値を計算するとともに、同じく木の成長に対してトレードオフの関係にある２項を含む評価関数を用いてスプリットしない場合の評価値を計算することによって、シンプルな学習手法を維持したまま、木の成長をどこで止めるべきかの判定が妥当性をもって行えるからである。 Therefore, according to the minimum configuration tree model learning apparatus, a tree model having an appropriate size in terms of generalization performance can be obtained at high speed without a test set for pruning. A split evaluation value that determines the quality of a plurality of split candidates is calculated using an evaluation function including two terms that are in a trade-off relationship with respect to tree growth, and also in a trade-off relationship with respect to tree growth. This is because by calculating an evaluation value in the case of not splitting using an evaluation function including a certain two terms, it is possible to determine where to stop the tree growth with validity while maintaining a simple learning method.

また、図１０は、本発明の木モデル学習装置の他の構成例を示すブロック図である。図１０に示すように、本発明による木モデル学習装置は、スプリット評価値信頼区間計算手段５５と、スプリット無し評価値信頼区間計算手段５６とを備えていてもよい。 FIG. 10 is a block diagram showing another configuration example of the tree model learning apparatus of the present invention. As shown in FIG. 10, the tree model learning apparatus according to the present invention may include a split evaluation value confidence interval calculation unit 55 and a non-split evaluation value confidence interval calculation unit 56.

スプリット評価値信頼区間計算手段５５（例えば、スプリット評価値信頼区間計算手段１０３３）は、スプリット評価値計算手段５１によって計算された評価値の信頼区間を計算する。また、スプリット無し評価値信頼区間計算手段５６（例えば、スプリット無し評価値信頼区間計算手段１０４２）は、スプリット無し評価値計算手段５３によって計算された評価値の信頼区間を計算する。 The split evaluation value confidence interval calculation means 55 (for example, the split evaluation value confidence interval calculation means 1033) calculates the confidence interval of the evaluation value calculated by the split evaluation value calculation means 51. Further, the evaluation value confidence interval calculation means 56 without split (for example, the evaluation value confidence interval calculation means 1042 without split) calculates the confidence interval of the evaluation value calculated by the evaluation value calculation means 53 without split.

また、そのような場合にスプリット判定手段５４は、ノードに対して所定数のデータが読み込まれてスプリットする場合の評価値とスプリットしない場合の評価値とが計算される度に、説明変数の次元ごとの最もよいスプリットポイントでの評価値およびそれらの信頼区間と、スプリットしない場合の評価値およびその信頼区間とを基に、ノードを最も評価値のよい次元でスプリットするか、またはノードをスプリットしないか、もしくは処理したデータ数ではスプリットするか否かを決定するのに十分でないかを判定してもよい。 In such a case, the split determination means 54 reads the dimension of the explanatory variable each time the evaluation value when the predetermined number of data is read into the node and the evaluation value when the split is performed and the evaluation value when the split is not performed are calculated. Split the node on the dimension with the best evaluation value or do not split the node based on the evaluation value at each best split point and their confidence intervals and the evaluation value and its confidence interval when not splitting Alternatively, it may be determined whether the number of processed data is not sufficient to determine whether to split.

そのような構成によれば、学習データの一部のみを用いて木のモデル学習を行うことができるので、さらに高速に汎化性能の意味で適切なサイズの木のモデルを得ることができる。 According to such a configuration, the tree model learning can be performed using only a part of the learning data, so that a tree model of an appropriate size can be obtained at a higher speed in terms of generalization performance.

また、木モデル学習装置は、各次元についての少なくともスプリット評価値計算手段５１による評価値の計算処理を並列で動作させてもよい。 Further, the tree model learning device may operate at least evaluation value calculation processing by the split evaluation value calculation means 51 for each dimension in parallel.

そのような構成によれば、さらに高速に汎化性能の意味で適切なサイズの木のモデルを得ることができる。 According to such a configuration, an appropriately sized tree model can be obtained at a higher speed in terms of generalization performance.

また、木モデル学習装置は、スプリット判定手段５４が処理したデータ数ではスプリットするか否かを決定するのに十分でないと判断した場合には、指定されたノードに対してさらにデータを読み込んで、指定されたノードをスプリットする場合の評価値およびその信頼区間と、スプリットしない場合の評価値およびその信頼区間とを再計算させてもよい。 Also, if the tree model learning device determines that the number of data processed by the split determination unit 54 is not sufficient to determine whether or not to split, the tree model learning device further reads data into the designated node, The evaluation value and its confidence interval when the specified node is split may be recalculated, and the evaluation value and its confidence interval when not splitting may be recalculated.

そのような構成によれば、１度に読み込むデータ数を調整することにより、さらに高速に汎化性能の意味で適切なサイズの木のモデルを得ることができる。 According to such a configuration, it is possible to obtain a tree model of an appropriate size in terms of generalization performance by adjusting the number of data read at a time.

また、木モデル学習装置は、さらに計算不要次元削除手段５７を備えていてもよい。計算不要次元削除手段５７（例えば、計算不要次元削除手段１０８）は、スプリット判定手段５４により処理したデータ数ではスプリットするか否かを決定するのに十分でないと判定された場合に、その判定時に最もよかった評価値の信頼区間の下限よりも、２番以降の評価値の信頼区間の上限が小さい次元を、新たに読み込んだデータを用いた評価値およびその信頼区間の再計算の対象から除外する処理を行う。 Further, the tree model learning device may further include a calculation unnecessary dimension deleting unit 57. The calculation-unnecessary dimension deletion unit 57 (for example, the calculation-unnecessary dimension deletion unit 108) determines that the number of data processed by the split determination unit 54 is not sufficient to determine whether or not to split, at the time of the determination Exclude from the recalculation of the evaluation value using the newly read data and the recalculation of that confidence interval the dimension whose upper limit of the confidence interval of the second and subsequent evaluation values is smaller than the lower limit of the confidence interval of the best evaluation value Process.

そのような構成によれば、計算が不要な次元のスプリットポイントの再計算処理を省くことができるので、さらに高速に汎化性能の意味で適切なサイズの木のモデルを得ることができる。 According to such a configuration, it is possible to omit recalculation processing of split points of dimensions that do not require calculation, so that a tree model of an appropriate size can be obtained at a higher speed in terms of generalization performance.

以上、実施形態及び実施例を参照して本願発明を説明したが、本願発明は上記実施形態および実施例に限定されるものではない。本願発明の構成や詳細には、本願発明のスコープ内で当業者が理解し得る様々な変更をすることができる。 Although the present invention has been described with reference to the embodiments and examples, the present invention is not limited to the above embodiments and examples. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

例えば、本発明の他の態様としては、上記各実施態様に係る各構成をコンピュータに実現させるコンピュータプログラムであってもよいし、このようなプログラムを記録したコンピュータが読み取り可能な記憶媒体であってもよい。この記録媒体は、非一時的な有形の媒体を含む。 For example, another aspect of the present invention may be a computer program that causes a computer to realize each configuration according to each of the above embodiments, or a computer-readable storage medium that records such a program. Also good. This recording medium includes a non-transitory tangible medium.

また、上記の実施形態の一部または全部は、以下の付記のようにも記載されうるが、以下には限られない。 Moreover, although a part or all of said embodiment can be described also as the following additional remarks, it is not restricted to the following.

（付記１）ルートノードから順に、各ノードに対して、説明変数の次元ごとに、当該ノードを当該次元の各スプリットポイントでスプリットする場合の評価値を、木の成長に対してトレードオフの関係にある２種類の項であって、データへのフィッティングを評価値の良さで表す項と、分岐ノードと葉ノードに関して、それぞれ通過データ数または落ちるデータ数に応じて評価値を悪くする方向に働くペナルティ項の２種類の項を少なくとも含む評価関数を用いて計算し、得られた評価値を基に、説明変数の次元ごとにノードをスプリットするのに最もよいスプリットポイントを検索し、ノードをスプリットしない場合の評価値を、木の成長に対してトレードオフの関係にある２種類の項であって、データへのフィッティングを評価値の良さで表す項と、葉ノードに関して、落ちるデータ数に応じて評価値を悪くする方向に働くペナルティ項の２種類の項を少なくとも含む評価関数を用いて計算し、次元ごとの最もよいスプリットポイントでの評価値と、スプリットしない場合の評価値とを基に、ノードを最も評価値のよい次元でスプリットするか、またはノードをスプリットしないかを判定する木モデル学習方法。 (Supplementary note 1) For each node in order from the root node, the evaluation value when the node is split at each split point of the dimension is traded off with respect to the growth of the tree. There are two types of terms in the above, which indicate that the fitting to the data is expressed by a good evaluation value, and the branch node and the leaf node tend to worsen the evaluation value according to the number of passing data or the number of falling data, respectively. Calculates using an evaluation function including at least two types of penalty terms, and based on the obtained evaluation value, finds the best split point for splitting the node for each dimension of the explanatory variable, and splits the node The evaluation value in the case of not doing is two types of terms that have a trade-off relationship with the growth of the tree, and the fitting of the data to the evaluation value is good The evaluation at the best split point for each dimension is calculated using an evaluation function that includes at least two types of terms, the term to represent and the penalty term that works in the direction of worsening the evaluation value according to the number of data dropped. A tree model learning method for determining whether to split a node at a dimension having the best evaluation value or not to split a node based on a value and an evaluation value when not splitting.

（付記２）少なくとも次元ごとの最もよいスプリットポイントでの評価値とスプリットしない場合の評価値とについて、各々信頼区間を計算し、ノードに対して所定数のデータが読み込まれてスプリットする場合の評価値とスプリットしない場合の評価値とが計算される度に、次元ごとの最もよいスプリットポイントでの評価値およびそれらの信頼区間と、スプリットしない場合の評価値およびその信頼区間とを基に、ノードを最も評価値のよい次元でスプリットするか、またはノードをスプリットしないか、もしくは処理したデータ数ではスプリットするか否かを決定するのに十分でないかを判定する付記１に記載の木モデル学習方法。 (Supplementary note 2) At least the evaluation value at the best split point for each dimension and the evaluation value when not splitting are calculated for each confidence interval, and evaluation when a predetermined number of data is read into the node and split Each time a value and a non-split evaluation value are calculated, the node based on the evaluation value at the best split point for each dimension and its confidence interval and the evaluation value and its confidence interval when not splitting The tree model learning method according to supplementary note 1, wherein it is determined whether to split the node with the dimension having the best evaluation value, not to split the node, or to determine whether the number of processed data is sufficient to determine whether to split .

（付記３）処理したデータ数ではスプリットするか否かを決定するのに十分でないと判定した場合には、ノードに対してさらにデータを読み込んで、ノードをスプリットする場合の評価値およびその信頼区間と、スプリットしない場合の評価値およびその信頼区間とを再計算させる付記２に記載の木モデル学習方法。 (Supplementary Note 3) If it is determined that the number of processed data is not sufficient to determine whether or not to split, the evaluation value and its confidence interval when the data is further read into the node and the node is split The tree model learning method according to attachment 2, wherein the evaluation value and the confidence interval when the split is not performed are recalculated.

（付記４）処理したデータ数ではスプリットするか否かを決定するのに十分でないと判定した場合に、その判定時に最もよかった評価値の信頼区間の下限よりも、２番以降の評価値の信頼区間の上限が小さい次元を、新たに読み込んだデータを用いた評価値およびその信頼区間の再計算の対象から除外する付記３に記載の木モデル学習方法。 (Supplementary Note 4) When it is determined that the number of processed data is not sufficient to determine whether or not to split, the reliability of evaluation values after the second is lower than the lower limit of the confidence interval of the evaluation value that was most favorable at the time of the determination. The tree model learning method according to supplementary note 3, wherein a dimension having a small upper limit of an interval is excluded from an evaluation value using newly read data and an object of recalculation of the confidence interval.

（付記５）各次元についての少なくともスプリットする場合の評価値を計算する処理を並列で動作させる付記１から付記４のうちのいずれかに記載の木モデル学習方法。 (Supplementary note 5) The tree model learning method according to any one of supplementary note 1 to supplementary note 4, wherein the processing for calculating at least the evaluation value in the case of splitting is operated in parallel.

（付記６）コンピュータに、説明変数の指定された次元を用いて、指定されたノードを当該次元の各スプリットポイントでスプリットする場合の評価値を計算するスプリット評価値計算処理、スプリット評価値計算処理によって得られる評価値を基に、説明変数の次元ごとに指定されたノードをスプリットするのに最もよいスプリットポイントを検索するスプリットポイント検索処理、ノードをスプリットしない場合の評価値を計算するスプリット無し評価値計算処理、および次元ごとの最もよいスプリットポイントでの評価値と、スプリットしない場合の評価値とを基に、ノードを最も評価値のよい次元でスプリットするか、またはノードをスプリットしないかを判定するスプリット判定処理を実行させ、スプリット評価値計算処理で、木の成長に対してトレードオフの関係にある２種類の項であって、データへのフィッティングを評価値の良さで表す項と、分岐ノードと葉ノードに関して、それぞれ通過データ数または落ちるデータ数に応じて評価値を悪くする方向に働くペナルティ項の２種類の項を少なくとも含む評価関数を用いて、評価値を計算させ、スプリット無し評価値計算処理で、木の成長に対してトレードオフの関係にある２種類の項であって、データへのフィッティングを評価値の良さで表す項と、葉ノードに関して、落ちるデータ数に応じて評価値を悪くする方向に働くペナルティ項の２種類の項を少なくとも含む評価関数を用いて、評価値を計算させるための木モデル学習プログラム。 (Supplementary note 6) Split evaluation value calculation processing and split evaluation value calculation processing for calculating an evaluation value when splitting a specified node at each split point of the dimension using a specified dimension of an explanatory variable in a computer Split point search processing that searches for the best split point for splitting the specified node for each dimension of the explanatory variable based on the evaluation value obtained by, and evaluation without split for calculating the evaluation value when the node is not split Based on the value calculation process and the evaluation value at the best split point for each dimension and the evaluation value when not splitting, determine whether to split the node at the dimension with the best evaluation value or not to split the node Split decision processing is executed, and split evaluation value calculation processing There are two types of terms that are in a trade-off relationship with growth, depending on the number of passing data or the number of dropped data for branching nodes and leaf nodes, respectively, which represent the fitting to the data with good evaluation values. The evaluation value is calculated using an evaluation function including at least two types of penalty terms that work in the direction of worsening the evaluation value, and there is a trade-off relationship with the growth of the tree in the evaluation value calculation process without splitting. It includes at least two types of terms: a term that represents fitting to data with good evaluation values, and a penalty term that acts on the leaf node in a direction that degrades the evaluation value according to the number of data that falls. A tree model learning program for calculating evaluation values using evaluation functions.

（付記７）コンピュータに、スプリット評価値計算処理で計算された評価値の信頼区間を計算するスプリット評価値信頼区間計算処理、およびスプリット無し評価値計算処理で計算された評価値の信頼区間を計算するスプリット無し評価値信頼区間計算処理を実行させ、スプリット判定処理で、ノードに対して所定数のデータが読み込まれてスプリットする場合の評価値とスプリットしない場合の評価値とが計算される度に、次元ごとの最もよいスプリットポイントでの評価値およびそれらの信頼区間と、スプリットしない場合の評価値およびその信頼区間とを基に、ノードを最も評価値のよい次元でスプリットするか、またはノードをスプリットしないか、もしくは処理したデータ数ではスプリットするか否かを決定するのに十分でないかを判定させる付記６に記載の木モデル学習プログラム。 (Supplementary note 7) The computer calculates the confidence interval of the evaluation value calculated by the split evaluation value confidence interval calculation processing for calculating the confidence interval of the evaluation value calculated by the split evaluation value calculation processing and the evaluation value calculation processing without the split. Each time an evaluation value when a split is performed and a predetermined number of data is read into the node and split is calculated and an evaluation value when the split is not performed is calculated in the split determination process. Based on the evaluation value and the confidence interval at the best split point for each dimension, and the evaluation value and its confidence interval when not splitting, the node is split at the dimension with the best evaluation value, or the node is Whether to split or the number of processed data is not enough to decide whether to split Tree model learning program according to Note 6 for determination.

（付記８）コンピュータに、スプリット判定処理において処理したデータ数ではスプリットするか否かを決定するのに十分でないと判定された場合に、ノードに対してさらにデータを読み込んで、ノードをスプリットする場合の評価値およびその信頼区間と、スプリットしない場合の評価値およびその信頼区間とを再計算させる付記７に記載の木モデル学習プログラム。 (Supplementary note 8) When it is determined that the number of data processed in the split determination process is not sufficient for the computer to determine whether or not to split, the data is further read into the node and the node is split The tree model learning program according to appendix 7, wherein the evaluation value and the confidence interval thereof are recalculated, and the evaluation value and the confidence interval when not splitting are recalculated.

（付記９）コンピュータに、スプリット判定処理において処理したデータ数ではスプリットするか否かを決定するのに十分でないと判定された場合に、その判定時に最もよかった評価値の信頼区間の下限よりも、２番以降の評価値の信頼区間の上限が小さい次元を、新たに読み込んだデータを用いた評価値およびその信頼区間の再計算の対象から除外する計算不要次元削除処理を実行させる付記８に記載の木モデル学習プログラム。 (Additional remark 9) When it is determined that the number of data processed in the split determination process is not sufficient for the computer to determine whether or not to split, the lower limit of the confidence interval of the evaluation value that is the best at the time of determination, Supplementary note 8 that executes a calculation-unnecessary dimension deletion process that excludes the evaluation value using newly read data and the recalculation target of the dimension having a small upper limit of the confidence interval of the evaluation value after No. 2 Tree model learning program.

（付記１０）コンピュータに、各次元についての少なくともスプリット評価値計算処理を並列で動作させる付記６から付記９までのいずれかに記載の木モデル学習プログラム。 (Additional remark 10) The tree model learning program in any one of additional remark 6 to additional remark 9 which makes a computer operate | move at least split evaluation value calculation processing about each dimension in parallel.

本発明は、回帰木や決定木に限らず、データ空間の分割を木で表現する木のモデルを得る用途に好適に適用可能である。 The present invention is not limited to regression trees and decision trees, and can be suitably applied to the use of obtaining a model of a tree that represents a division of data space with a tree.

１０木モデル学習装置
１０１データ記憶部
１０２木モデル記憶部
１０３スプリットポイント計算手段
１０３１ビンニング手段
１０３２スプリット評価値計算手段
１０３３スプリット評価値信頼区間計算手段
１０３４スプリットポイント検索手段
１０４スプリット無し計算手段
１０４１スプリット無し評価値計算手段
１０４２スプリット無し評価値信頼区間計算手段
１０５ビンニングポイント計算手段
１０６スプリット判定手段
１０７木ノード生成手段
１０８計算不要次元削除手段
２０予測手段 DESCRIPTION OF SYMBOLS 10 Tree model learning apparatus 101 Data storage part 102 Tree model storage part 103 Split point calculation means 1031 Binning means 1032 Split evaluation value calculation means 1033 Split evaluation value confidence interval calculation means 1034 Split point search means 104 Split no calculation means 1041 Evaluation without split Value calculation unit 1042 Evaluation value confidence interval calculation unit without split 105 Binning point calculation unit 106 Split determination unit 107 Tree node generation unit 108 Calculation unnecessary dimension deletion unit 20 Prediction unit

Claims

A split evaluation value calculation means for calculating an evaluation value when the specified node is split at each split point of the dimension using the specified dimension of the explanatory variable;
Split point search means for searching for the best split point for splitting the node designated for each dimension of the explanatory variable based on the evaluation value obtained by the split evaluation value calculation means;
An evaluation value calculation means without split for calculating an evaluation value when the node is not split;
Based on the evaluation value at the best split point for each dimension and the evaluation value when not splitting, it is determined whether to split the node at the dimension with the best evaluation value or not to split the node. Split determination means,
The split evaluation value calculation means includes two types of terms that are in a trade-off relationship with respect to the growth of the tree, and a term that represents the fitting to the data with good evaluation values, a branch node, and a leaf node, respectively. The evaluation value is calculated using an evaluation function including at least two types of penalty terms that work in the direction of worsening the evaluation value according to the number of passing data or the number of falling data,
The evaluation value calculation means without splitting is two types of terms that have a trade-off relationship with respect to the growth of the tree, the term representing the fitting to the data by the goodness of the evaluation value, and the number of data dropped with respect to the leaf node A tree model learning device characterized in that an evaluation value is calculated using an evaluation function including at least two types of penalty terms that work in the direction of worsening the evaluation value in accordance with.

A split evaluation value confidence interval calculation means for calculating a confidence interval of the evaluation value calculated by the split evaluation value calculation means;
A non-split evaluation value confidence interval calculation means for calculating a confidence interval of the evaluation value calculated by the non-split evaluation value calculation means,
The split determination means reads the best split for each dimension of the explanatory variable every time a predetermined number of data is read into the node and the evaluation value when splitting and the evaluation value when not splitting are calculated. Based on the evaluation values at points and their confidence intervals and the evaluation values when not splitting and their confidence intervals, the node is split at the dimension with the best evaluation value, or the node is not split, or The tree model learning device according to claim 1, wherein it is determined whether the number of processed data is not sufficient to determine whether to split or not.

If the split determination means determines that the number of processed data is not sufficient to determine whether or not to split, an evaluation value when further reading data into the node and splitting the node The tree model learning apparatus according to claim 2, wherein the tree interval learning section recalculates the confidence interval and the evaluation value when the split is not performed and the confidence interval.

When it is determined that the number of data processed by the split determination means is not sufficient to determine whether or not to split, the evaluation values for the second and subsequent evaluation values are lower than the lower limit of the confidence interval of the evaluation value that was the best at the time of the determination. The tree according to claim 3, further comprising: a calculation unnecessary dimension deleting unit that performs processing for excluding an evaluation value using newly read data and a recalculation target of the dimension having a small upper limit of the confidence interval. Model learning device.

The tree model learning device according to any one of claims 1 to 4, wherein at least evaluation value calculation processing by the split evaluation value calculation means for each dimension is operated in parallel.

In order from the root node,
For each dimension of the explanatory variable, the evaluation value when the node is split at each split point of the dimension is two types of terms that are in a trade-off relationship with the growth of the tree. Using an evaluation function that includes at least two types of terms, that is, a term expressed by a good evaluation value and a penalty term that works in the direction of worsening the evaluation value according to the number of passing data or the number of falling data for branch nodes and leaf nodes, respectively. Calculate
Based on the obtained evaluation value, the best split point for splitting the node for each dimension of the explanatory variable is searched,
The evaluation value in the case where the node is not split is two types of terms that are in a trade-off relationship with the growth of the tree, and the terms representing the fitting to the data with the good evaluation value and the leaf node are dropped. Calculate using an evaluation function including at least two types of penalty terms that work in the direction of worsening the evaluation value according to the number of data,
Based on the evaluation value at the best split point for each dimension and the evaluation value when not splitting, it is determined whether to split the node at the dimension with the best evaluation value or not to split the node. A tree model learning method characterized by that.

Calculating a confidence interval for each of the evaluation value at the best split point for each dimension and the evaluation value when the split is not performed;
Each time a predetermined number of data is read into the node and the evaluation value when splitting and the evaluation value when not splitting are calculated, the evaluation value at the best split point for each dimension and Based on the confidence interval, the evaluation value when the split is not performed, and the confidence interval, the node is split at the dimension with the best evaluation value, or the node is not split or the number of processed data is split. The tree model learning method according to claim 6, wherein it is determined whether or not it is sufficient to determine whether or not.

On the computer,
Split evaluation value calculation processing for calculating an evaluation value when the specified node is split at each split point of the dimension using the specified dimension of the explanatory variable,
Split point search processing for searching for the best split point for splitting the node specified for each dimension of the explanatory variable based on the evaluation value obtained by the split evaluation value calculation processing,
The split node evaluation value calculation process for calculating the evaluation value when the node is not split, and the evaluation value at the best split point for each dimension and the evaluation value when the node is not split, Execute split determination processing for determining whether to split in a dimension with a good evaluation value or not to split the node,
In the split evaluation value calculation process, there are two types of terms that are in a trade-off relationship with respect to the growth of the tree, and the terms representing the fitting to the data with good evaluation values, and the branch node and the leaf node, respectively The evaluation value is calculated by using an evaluation function including at least two types of penalty terms that work in the direction of worsening the evaluation value depending on the number of passing data or the number of falling data,
In the evaluation value calculation process without splitting, there are two types of terms that are in a trade-off relationship with respect to the growth of the tree, the term representing the fitting to the data with the goodness of the evaluation value, and the number of data that falls regarding the leaf node A tree model learning program for calculating an evaluation value by using an evaluation function including at least two types of penalty terms that work in the direction of worsening the evaluation value according to.

In the computer,
Split evaluation value confidence interval calculation processing for calculating an evaluation value confidence interval calculated by the split evaluation value calculation processing, and no split evaluation value for calculating an evaluation value confidence interval calculated by the non-split evaluation value calculation processing Run the confidence interval calculation process,
In the split determination process, every time a predetermined number of data is read into the node and the evaluation value when the split is performed and the evaluation value when the split is not performed are calculated, the best split point for each dimension is calculated. The node is split at the dimension with the best evaluation value, or the node is not split or processed based on the evaluation value and the confidence interval thereof and the evaluation value when the split is not performed and the confidence interval. The tree model learning program according to claim 8, wherein the tree model learning program is configured to determine whether the number of data obtained is not sufficient to determine whether to split or not.