JP7040168B2

JP7040168B2 - Learning identification device and learning identification method

Info

Publication number: JP7040168B2
Application number: JP2018050252A
Authority: JP
Inventors: 亮介笠原; 拓哉田中
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2018-03-16
Filing date: 2018-03-16
Publication date: 2022-03-23
Anticipated expiration: 2038-03-16
Also published as: JP2019160254A

Description

本発明は、学習識別装置および学習識別方法に関する。 The present invention relates to a learning identification device and a learning identification method.

近年、ＡＩ（ＡｒｔｉｆｉｃｉａｌＩｎｔｅｌｌｉｇｅｎｃｅ：人工知能）に関連して一般的にも知られるようになった機械学習を用いて、大量のデータを元に人間の機能を代替する試みが各分野において広がっている。この分野は未だ日ごとに大きく発展を続けているが、現状いくつかの課題がある。その内の代表的なものは、データから汎用的な知識を取り出す汎化性能を含む精度の限界、および、その大きな計算負荷による処理速度の限界である。また、よく知られている、高性能な機械学習のアルゴリズムとして、Ｄｅｅｐｌｅａｒｎｉｎｇ（ＤＬ）、およびその中で周辺のみに入力ベクトルを限定したＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ（ＣＮＮ）が存在するが、これらの手法と比較して、現状では、ＧＢＤＴ（ＧｒａｄｉｅｎｔＢｏｏｓｔｉｎｇＤｅｃｉｓｉｏｎＴｒｅｅ：勾配ブースティング決定木）は、特徴量の抽出が難しいため画像、音声および言語等の入力データに対しては精度が劣るものの、それ以外の構造化したデータではより良い性能が出ることが知られている。現に、データサイエンティストのコンペティションであるＫａｇｇｌｅでは、ＧＢＤＴが最もスタンダードなアルゴリズムとなっている。実社会の機械学習により解決したい課題のうち７０％は、画像、音声および言語以外の構造化されたデータと言われており、ＧＢＤＴは実世界の問題を解くためには、重要なアルゴリズムであることは間違いない。さらに、近年、決定木を用いて、画像、音声等のデータの特徴抽出を行う手法も提案され始めている。 In recent years, attempts to replace human functions based on a large amount of data by using machine learning, which has become generally known in relation to AI (Artificial Intelligence), are spreading in each field. .. This field is still developing significantly day by day, but there are currently some challenges. Typical examples are the limit of accuracy including generalization performance for extracting general-purpose knowledge from data, and the limit of processing speed due to the large computational load. In addition, as well-known high-performance machine learning algorithms, there are Deep learning (DL) and Resolution Natural Network (CNN) in which the input vector is limited only to the periphery. In comparison, at present, GBDT (Gradient Boosting Decision Tree) is inferior in accuracy to input data such as images, sounds, and languages because it is difficult to extract features, but other than that. It is known that structured data will give better performance. In fact, in Kaggle, a data scientist competition, GBDT is the most standard algorithm. It is said that 70% of the problems that we want to solve by machine learning in the real world are structured data other than images, sounds and languages, and GBDT is an important algorithm for solving problems in the real world. There is no doubt. Furthermore, in recent years, a method for extracting features of data such as images and sounds using a decision tree has begun to be proposed.

このような決定木を用いた高速な識別処理を実現するための技術として、決定木のノードデータに対する探索処理のしきい値を適切に調整することにより、キャッシュメモリの効果を高めて識別処理を高速化する技術が開示されている（特許文献１参照）。 As a technique for realizing high-speed identification processing using such a decision tree, the effect of the cache memory is enhanced by appropriately adjusting the threshold value of the search processing for the node data of the decision tree, and the identification processing is performed. A technique for increasing the speed is disclosed (see Patent Document 1).

しかしながら、特許文献１に記載された技術では、分岐毎に新しいメモリ領域にサンプルデータをコピーしていくため、ノードの階層が深くなると、それだけメモリ容量が必要となるという問題がある。 However, in the technique described in Patent Document 1, sample data is copied to a new memory area for each branch, so that there is a problem that a memory capacity is required as the node hierarchy becomes deeper.

本発明は、上述の問題点に鑑みてなされたものであって、メモリの使用量を削減することができる学習識別装置および学習識別方法を提供することを目的とする。 The present invention has been made in view of the above-mentioned problems, and an object of the present invention is to provide a learning identification device and a learning identification method capable of reducing the amount of memory used.

上述した課題を解決し、目的を達成するために、本発明は、決定木の学習をするための学習データを記憶するデータメモリと、前記データメモリから前記学習データの各特徴量を読み出し、該各特徴量に基づいてノードのデータを導出することによって前記決定木を学習する学習部と、前記学習部により導出された前記ノードのデータにより、前記データメモリから読み出した前記学習データを該ノードからいずれに分岐されるかを判定する識別部と、を備え、前記データメモリは、前記学習データのアドレスを格納するための少なくとも２つのバンク領域を有し、前記少なくとも２つのバンク領域は、学習対象の前記ノードの階層が切り替わるごとに、読み出し用のバンク領域と、書き込み用のバンク領域とが切り替えられ、前記学習部は、前記ノードで分岐された前記学習データのアドレスを前記読み出し用のバンク領域から読み出し、該アドレスで示される前記データメモリの領域から該学習データを読み出し、前記識別部は、前記ノードで分岐した前記学習データのアドレスを前記書き込み用のバンク領域に書き込むことを特徴とする。 In order to solve the above-mentioned problems and achieve the object, the present invention has a data memory for storing training data for learning a determination tree, and each feature amount of the training data is read from the data memory. The learning unit that learns the determination tree by deriving the data of the node based on each feature amount, and the learning data read from the data memory by the data of the node derived by the learning unit is obtained from the node. The data memory includes at least two bank areas for storing the addresses of the learning data, and the at least two bank areas are learning targets. Each time the hierarchy of the node is switched, the bank area for reading and the bank area for writing are switched, and the learning unit uses the address of the learning data branched by the node as the bank area for reading. The learning data is read from the area of the data memory indicated by the address, and the identification unit writes the address of the learning data branched at the node to the bank area for writing.

本発明によれば、メモリの使用量を削減することができる。 According to the present invention, the amount of memory used can be reduced.

図１は、決定木モデルの一例を示す図である。FIG. 1 is a diagram showing an example of a decision tree model. 図２は、実施形態に係る学習識別装置のモジュール構成の一例を示す図である。FIG. 2 is a diagram showing an example of a module configuration of the learning identification device according to the embodiment. 図３は、ポインタメモリの構成の一例を示す図である。FIG. 3 is a diagram showing an example of the configuration of the pointer memory. 図４は、ラーニングモジュールのモジュール構成の一例を示す図である。FIG. 4 is a diagram showing an example of the module configuration of the learning module. 図５は、実施形態に係る学習識別装置の初期化時のモジュールの動作を示す図である。FIG. 5 is a diagram showing the operation of the module at the time of initialization of the learning identification device according to the embodiment. 図６は、実施形態に係る学習識別装置のデプス０、ノード０のノードパラメータを決定する場合のモジュールの動作を示す図である。FIG. 6 is a diagram showing the operation of the module when determining the node parameters of the depth 0 and the node 0 of the learning identification device according to the embodiment. 図７は、実施形態に係る学習識別装置のデプス０、ノード０の分岐時のモジュールの動作を示す図である。FIG. 7 is a diagram showing the operation of the module at the time of branching of the depth 0 and the node 0 of the learning identification device according to the embodiment. 図８は、実施形態に係る学習識別装置のデプス１、ノード０のノードパラメータを決定する場合のモジュールの動作を示す図である。FIG. 8 is a diagram showing the operation of the module when determining the node parameters of the depth 1 and the node 0 of the learning identification device according to the embodiment. 図９は、実施形態に係る学習識別装置のデプス１、ノード０の分岐時のモジュールの動作を示す図である。FIG. 9 is a diagram showing the operation of the module at the time of branching of the depth 1 and the node 0 of the learning identification device according to the embodiment. 図１０は、実施形態に係る学習識別装置のデプス１、ノード１のノードパラメータを決定する場合のモジュールの動作を示す図である。FIG. 10 is a diagram showing the operation of the module when determining the node parameters of the depth 1 and the node 1 of the learning identification device according to the embodiment. 図１１は、実施形態に係る学習識別装置のデプス１、ノード１の分岐時のモジュールの動作を示す図である。FIG. 11 is a diagram showing the operation of the module at the time of branching of the depth 1 and the node 1 of the learning identification device according to the embodiment. 図１２は、実施形態に係る学習識別装置のデプス１、ノード１のノードパラメータを決定の結果、分岐しない場合のモジュールの動作を示す図である。FIG. 12 is a diagram showing the operation of the module when the node parameters of the depth 1 and the node 1 of the learning identification device according to the embodiment are determined and the modules are not branched. 図１３は、実施形態に係る学習識別装置において決定木の学習が完了した場合に全サンプルデータのステート情報を更新するときのモジュールの動作を示す図である。FIG. 13 is a diagram showing the operation of the module when updating the state information of all sample data when the learning of the decision tree is completed in the learning identification device according to the embodiment. 図１４は、変形例に係る学習識別装置のモデルメモリの構成の一例を示す図である。FIG. 14 is a diagram showing an example of the configuration of the model memory of the learning identification device according to the modified example. 図１５は、変形例に係る学習識別装置のクラシフィケーションモジュールの構成の一例を示す図である。FIG. 15 is a diagram showing an example of the configuration of the classification module of the learning identification device according to the modified example.

以下に、図１～図１５を参照しながら、本発明に係る学習識別装置および学習識別方法の実施形態を詳細に説明する。また、以下の実施形態によって本発明が限定されるものではなく、以下の実施形態における構成要素には、当業者が容易に想到できるもの、実質的に同一のもの、およびいわゆる均等の範囲のものが含まれる。さらに、以下の実施形態の要旨を逸脱しない範囲で構成要素の種々の省略、置換、変更および組み合わせを行うことができる。 Hereinafter, embodiments of the learning identification device and the learning identification method according to the present invention will be described in detail with reference to FIGS. 1 to 15. Further, the present invention is not limited to the following embodiments, and the components in the following embodiments include those easily conceived by those skilled in the art, substantially the same, and so-called equivalent ranges. Is included. Further, various omissions, substitutions, changes and combinations of components can be made without departing from the gist of the following embodiments.

（ＧＢＤＴのロジックについて）
高性能な機械学習のアルゴリズムとしてのＤＬにおいて、識別器は様々なハードロジックによる実装が試みられ、ＧＰＵ（ＧｒａｐｈｉｃｓＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）での処理と比較して電力効率が高いことが分かっている。ただし、ＤＬのうち特にＣＮＮの場合には、ＧＰＵのアーキテクチャが非常にマッチするため、速度的には、ロジック実装したＦＰＧＡ（Ｆｉｅｌｄ－ＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）の方がＧＰＵに比べて、識別が速いというわけではない。それに対して、ＧＢＤＴのような決定木系のアルゴリズムについて、ＦＰＧＡによるハードロジックの実装が試行され、ＧＰＵよりも高速な結果が報告されている。これは、後述するように、決定木系のアルゴリズムはそのデータ配列の特徴上、ＧＰＵのアーキテクチャに適さないためである。 (About GBDT logic)
In DL as a high-performance machine learning algorithm, the classifier has been tried to be implemented by various hard logics, and it has been found that the power efficiency is higher than the processing by GPU (Graphics Processing Unit). However, among DLs, especially in the case of CNN, the GPU architecture matches very well, so in terms of speed, FPGA (Field-Programmable Gate Array) with logic implementation is said to be faster to identify than GPU. Do not mean. On the other hand, for decision tree algorithms such as GBDT, implementation of hard logic by FPGA has been tried, and results faster than GPU have been reported. This is because, as will be described later, the decision tree algorithm is not suitable for the GPU architecture due to the characteristics of the data array.

また、学習に関しては、識別よりも世の中の検討は遅れており、ＤＬにおいてもほとんど現状報告がなく、決定木系では報告は少ない状況である。その中でもＧＢＤＴの学習は、現状どこからもまだ報告がなく、現在では未開拓の分野であると考えられる。精度のよい識別モデルを得るためには、学習時に特徴量の選択および設計、ならびに学習アルゴリズムのハイパーパラメータの選択を行うため、莫大な試行回数が必要となり、特に大量の学習データがある場合には、学習処理のスピードの高さは現実的に最終的なモデルの精度について非常に大きく作用する。さらに、ロボティクス、ＨＦＴ（ＨｉｇｈＦｒｅｑｕｅｎｃｙＴｒａｄｉｎｇ)、およびＲＴＢ（Ｒｅａｌ－ＴｉｍｅＢｉｄｄｉｎｇ）のように環境変化への追従のリアルタイム性が求められる分野に関しては、スピードの速さが性能へと直結する。そのため、精度の高いＧＢＤＴにおいて、高速な学習処理が出来た場合には、結果的にそれを利用したシステムの性能を大きく向上させることができると考えられる。 In addition, regarding learning, the examination of the world is behind the identification, and there are almost no reports on the current situation in DL, and there are few reports in decision trees. Among them, GBDT learning has not been reported from anywhere at present, and is considered to be an undeveloped field at present. In order to obtain an accurate discriminative model, a huge number of trials are required to select and design features and hyperparameters of the learning algorithm during training, especially when there is a large amount of training data. In reality, the high speed of the learning process greatly affects the accuracy of the final model. Furthermore, in fields such as robotics, HFT (High Frequency Trading), and RTB (Real-Time Bidding) where real-time performance of tracking environmental changes is required, speed is directly linked to performance. Therefore, if high-speed learning processing can be performed in GBDT with high accuracy, it is considered that the performance of the system using it can be greatly improved as a result.

（ＧＢＤＴのＦＰＧＡに対する親和性）
決定木またはＧＢＤＴが、なぜＧＰＵでは速くならないか、および、なぜＦＰＧＡだと速くなるかについて、ＧＢＤＴのＦＰＧＡに対する親和性の観点から述べる。 (Affinity of GBDT for FPGA)
The reason why the decision tree or GBDT is not faster on the GPU and why it is faster on the FPGA will be described in terms of the affinity of the GBDT for the FPGA.

まず、ＧＢＤＴがブースティングを用いたアルゴリズムであることの観点から述べる。決定木の中でも、アンサンブル学習を用いたＲａｎｄｏｍＦｏｒｅｓｔ（ＲＦ）の場合は、木の間に依存関係がないため、ＧＰＵでも並列化しやすいが、ＧＢＤＴはブースティングを用いて、多数の木を連結する方法であり、一個前の木の結果が出ないと、次の木の学習を開始することができない。そのため、処理としてシリアルな処理であり、一本ずつの木をいかに速く学習するかがキーとなる。これに対して、ＲＦでは、一本あたりは遅くても、並列に多数の木の学習を速くすることで、全体の学習を速くするという選択肢をとり得る。そのため、ＧＰＵを用いた場合にも次に述べるＤＲＡＭ（ＤｙｎａｍｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）のアクセスレイテンシの問題をある程度隠蔽することが可能であると考えられる。 First, it will be described from the viewpoint that GBDT is an algorithm using boosting. Among the decision trees, in the case of Random Forest (RF) using ensemble learning, it is easy to parallelize even with GPU because there is no dependency between trees, but GBDT is a method of connecting many trees using boosting. Yes, if the result of the previous tree is not obtained, the learning of the next tree cannot be started. Therefore, it is a serial process, and the key is how quickly you can learn each tree one by one. On the other hand, in RF, even if each tree is slow, the option of speeding up the learning of the whole tree can be taken by speeding up the learning of many trees in parallel. Therefore, even when the GPU is used, it is considered possible to conceal the problem of access latency of the DRAM (Dynamic Random Access Memory) described below to some extent.

次に、ＧＰＵデバイスのＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）へのアクセス速度の限界（特にランダムアクセス)の観点から述べる。ＦＰＧＡに内蔵のＳＲＡＭ（ＳｔａｔｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）は、ＦＰＧＡ内のＲＡＭのバス幅を非常に大きくできるため、ミドルレンジのＦＰＧＡである、例えば、Ｘｉｌｉｎｘ社のＸＣ７ｋ３２５Ｔを用いた場合でも、以下の様に、３．２［ＴＢ／ｓｅｃ］に及ぶ。なお、内蔵ＲＡＭの容量は１６［Ｍｂ］である。 Next, it will be described from the viewpoint of the limit of the access speed (particularly random access) of the GPU device to the RAM (Random Access Memory). The SRAM (Static Random Access Memory) built into the FPGA can greatly increase the bus width of the RAM in the FPGA, so even when using a middle-range FPGA, for example, the XC7k325T manufactured by Xilinx, as follows. It reaches 3.2 [TB / sec]. The capacity of the built-in RAM is 16 [Mb].

BRAM 445個 × 36bit × 100MHz × 2ポート = 445*36*2*100*10^6/10^12 = 3.2TB/sec BRAM 445 pieces x 36bit x 100MHz x 2 ports = 445 * 36 * 2 * 100 * 10 ^ 6/10 ^ 12 = 3.2TB / sec

また、ハイエンドのＦＰＧＡである、Ｘｉｌｉｎｘ社のＶＵ９Ｐを用いた場合、６．９［ＴＢ／ｓｅｃ］である。なお、内蔵ＲＡＭの容量は２７０［Ｍｂ］である。 Further, when VU9P manufactured by Xilinx, which is a high-end FPGA, is used, it is 6.9 [TB / sec]. The capacity of the built-in RAM is 270 [Mb].

URAM 960個 × 36bit × 100MHz × 2ポート = 960*36*2*100*10^6/10^12 = 6.9TB/sec URAM 960 pieces x 36bit x 100MHz x 2 ports = 960 * 36 * 2 * 100 * 10 ^ 6/10 ^ 12 = 6.9TB / sec

これらの値は、クロック周波数を１００［ＭＨｚ］とした場合であるが、実際には、回路構成を工夫すると、２００～５００［ＭＨｚ］程度での動作が考えられ、限界の帯域は数倍となる。これに対して、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）に接続されているＲＡＭは現世代では、ＤＤＲ４（Ｄｏｕｂｌｅ－Ｄａｔａ－Ｒａｔｅ４）であるが、下記のようにＤＩＭＭ（ＤｕａｌＩｎｌｉｎｅＭｅｍｏｒｙＭｏｄｕｌｅ）１枚での帯域は２５．６［ＧＢ／ｓｅｃ］に留まる。４枚のインタリーブ構成（２５６ビット幅)にしたとしても、１００［ＧＢ／ｓｅｃ］程度である。ＤＤＲ４のチップ規格がＤＤＲ４－３２００（バス幅６４ビット、ＤＩＭＭ１枚）の場合、以下のようになる。 These values are for the case where the clock frequency is 100 [MHz], but in reality, if the circuit configuration is devised, operation at about 200 to 500 [MHz] can be considered, and the limit band is several times. Become. On the other hand, the RAM connected to the CPU (Central Processing Unit) is DDR4 (Double-Data-Rate4) in the current generation, but the band with one DIMM (Dual Inline Memory Module) as shown below. Stays at 25.6 [GB / sec]. Even with an interleaved configuration (256 bit width) of four sheets, it is about 100 [GB / sec]. When the chip standard of DDR4 is DDR4-3200 (bus width 64 bits, DIMM 1 sheet), it is as follows.

200MHz × 2(DDR) × 64 = 200*10^6*2*64/10^9 = 25.6GB/sec 200MHz x 2 (DDR) x 64 = 200 * 10 ^ 6 * 2 * 64/10 ^ 9 = 25.6GB / sec

ＧＰＵに搭載されているＧＤＤＲ５（ＧｒａｐｈｉｃｓＤｏｕｂｌｅ－Ｄａｔａ－Ｒａｔｅ５)では、ＤＤＲ４の帯域よりも４倍程度大きくなっているが、それでも、最大で４００［ＧＢ／ｓｅｃ］程度である。 In GDDR5 (Graphics Double-Data-Rate5) mounted on the GPU, it is about four times larger than the band of DDR4, but it is still about 400 [GB / sec] at the maximum.

このように、ＦＰＧＡ内のＲＡＭと、ＧＰＵおよびＣＰＵでの外部メモリとは、帯域に大きな差がある。さらに、ここまでは、アドレスに対してシーケンシャルなアクセスの場合に関して述べてきたが、これ以上に大きく効いてくるのが、ランダムアクセス時のアクセスの時間である。ＦＰＧＡ内蔵のＲＡＭはＳＲＡＭであるため、シーケンシャルアクセスでもランダムアクセスでもアクセスレイテンシは１クロックであるが、ＤＤＲ４およびＧＤＤＲ５は、ＤＲＡＭであり、センスアンプの都合上、異なるカラムにアクセスした場合には、レイテンシが大きくなる。例えば、ＤＤＲ４のＲＡＭにおいて、代表的なＣＡＳレイテンシ（ＣｏｌｕｍｎＡｄｄｒｅｓｓＳｔｒｏｂｅｌａｔｅｎｃｙ）は１６クロックであり、簡単には、シーケンシャルアクセスと比較して、１／１６しかスループットが出ない計算となる。 As described above, there is a large difference in bandwidth between the RAM in the FPGA and the external memory in the GPU and the CPU. Furthermore, although the case of sequential access to the address has been described so far, the access time at the time of random access is more effective than this. Since the RAM with built-in FPGA is SRAM, the access latency is 1 clock for both sequential access and random access, but DDR4 and GDDR5 are DRAMs, and for the convenience of the sense amplifier, the latency when accessing different columns. Becomes larger. For example, in a DDR4 RAM, a typical CAS latency is 16 clocks, which is simply a calculation that produces only 1/16 of the throughput as compared with sequential access.

ＣＮＮの場合には、隣接した画素のデータを処理していくので、ランダムアクセスのレイテンシは大きく問題とならないが、決定木の場合には、分岐を続けていくと、枝ごとに元のデータのアドレスがどんどんと不連続になり、基本的にランダムアクセスとなる。そのため、データをＤＲＡＭに置いた場合、そのスループットがボトルネックとなり、速度が大きく劣化する。ＧＰＵにはそのような場合の性能劣化を抑えるために、キャッシュが存在するが、基本的に決定木はデータを総なめしていくアルゴリズムなので、データアクセスに局所性がなくキャッシュの効果が非常に効きにくい。なお、ＧＰＵの構造では、ＧＰＵには、演算コア（ＳＭ）毎に割り振られたＳＲＡＭからなるシェアードメモリが存在し、これを使うと高速な処理が可能である場合があるが、１個のＳＭあたり１６～４８［ｋＢ］と少量であり、かつ、ＳＭ間をまたぐアクセスの場合には、大きなレイテンシが発生する。現在の高価で大規模なＧＰＵである、ＮｖｉｄｉａＫ８０の場合のシェアードメモリの容量の試算を以下に示す。 In the case of CNN, the data of adjacent pixels is processed, so the latency of random access does not matter much, but in the case of a decision tree, if you continue branching, the original data will be displayed for each branch. The addresses become more and more discontinuous, and basically random access. Therefore, when the data is placed in the DRAM, the throughput becomes a bottleneck and the speed is greatly deteriorated. There is a cache in the GPU in order to suppress the performance deterioration in such a case, but since the decision tree is basically an algorithm that swipes the data, there is no locality in the data access and the effect of the cache is very effective. Hateful. In the GPU structure, the GPU has a shared memory consisting of SRAM allocated for each arithmetic core (SM), and if this is used, high-speed processing may be possible, but one SM. In the case of a small amount of 16 to 48 [kB] per unit and access across SMs, a large latency occurs. The following is a trial calculation of the capacity of the shared memory in the case of NVIDIA K80, which is the current expensive and large-scale GPU.

K80 = 2 × 13 SMX = 26 SMX = 4992 CUDAコア
26 × 48 × 8 = 9Mb K80 = 2 × 13 SMX = 26 SMX = 4992 CUDA core
26 × 48 × 8 = 9Mb

このように、数十万円する大規模なＧＰＵでもシェアードメモリはたった９［Ｍｂ］しか存在せず、容量が少な過ぎる。さらに、ＧＰＵの場合は、上述のように、処理を行うＳＭは他のＳＭのシェアードメモリには直接アクセスできないことに起因し、決定木の学習に利用する場合には、高速なコーディングが困難という制約も存在する。 In this way, even with a large-scale GPU that costs hundreds of thousands of yen, there is only 9 [Mb] of shared memory, and the capacity is too small. Further, in the case of GPU, as described above, the SM that performs processing cannot directly access the shared memory of other SMs, and when used for learning a decision tree, high-speed coding is difficult. Also exists.

以上のように、データがＦＰＧＡ上のＳＲＡＭに載るという前提で、ＦＰＧＡはＧＰＵに比べてＧＢＤＴの学習アルゴリズムを高速に実装可能であると考えられる。 As described above, on the premise that the data is stored in the SRAM on the FPGA, it is considered that the FPGA can implement the GBDT learning algorithm at a higher speed than the GPU.

（ＧＢＤＴのアルゴリズム）
図１は、決定木モデルの一例を示す図である。以下、式（１）～式（２２）および図１を参照してＧＢＤＴの基本論理を説明する。 (GBDT algorithm)
FIG. 1 is a diagram showing an example of a decision tree model. Hereinafter, the basic logic of GBDT will be described with reference to Equations (1) to (22) and FIG.

ＧＢＤＴは、教師あり学習の一手法であり、教師あり学習は以下の式（１）に示すように、学習データに対するフィッティングの良さを表すロス関数Ｌ（θ）と、学習したモデルの複雑さを表す正則化項Ω（θ）とからなる目的関数ｏｂｊ（θ）を何らかの尺度で最適化する処理である。正則化項Ω（θ）は、モデル（決定木）が複雑になり過ぎることを防ぐ、すなわち、汎化性能を高める役割を有する。

GBDT is a method of supervised learning, and supervised learning uses the loss function L (θ), which represents the goodness of fitting to the training data, and the complexity of the learned model, as shown in the following equation (1). It is a process of optimizing the objective function obj (θ) consisting of the regularization term Ω (θ) to be expressed by some scale. The regularization term Ω (θ) has the role of preventing the model (decision tree) from becoming too complicated, that is, improving the generalization performance.

式（１）の第１項のロス関数は、例えば、以下の式（２）に示すように、サンプルデータ（学習データ）ごとに誤差関数ｌより計算されるロスを足し合わせたものである。ここでｎはサンプルデータ数、ｉはサンプル番号、ｙはラベル、モデルのｙ（ハット）は予測値である。

The loss function of the first term of the equation (1) is, for example, the sum of the losses calculated by the error function l for each sample data (learning data) as shown in the following equation (2). Here, n is the number of sample data, i is the sample number, y is the label, and y (hat) of the model is a predicted value.

ここで、誤差関数ｌは、例えば、以下の式（３）および式（４）に示すような二乗誤差関数またはロジスティックロス関数等が用いられる。

Here, as the error function l, for example, a square error function or a logistic loss function as shown in the following equations (3) and (4) is used.

また式（１）の第２項の正則化項Ω（θ）は、例えば、以下の式（５）に示すようなパラメータθの二乗ノルム等が用いられる。ここで、λは正則化の重みを表すハイパーパラメータである。

Further, as the regularization term Ω (θ) of the second term of the equation (1), for example, the squared norm of the parameter θ as shown in the following equation (5) is used. Here, λ is a hyperparameter representing the weight of regularization.

ここで、ＧＢＤＴの場合について考える。まず、ＧＢＤＴのｉ番目のサンプルデータｘ_ｉに対する予測値は、以下の式（６）のように表現できる。

Here, consider the case of GBDT. First, the predicted value for the i-th sample data x _i of GBDT can be expressed by the following equation (6).

ここで、Ｋは決定木の総数、ｋは決定木の番号、ｆ_ｋ（）はｋ番目の決定木の出力、ｘ_ｉは入力されるサンプルデータの特徴量である。これより、ＧＢＤＴもＲＦ等と同じく、各決定木の出力を足し合わせたものを最終的な出力としていることがわかる。また、パラメータθは、θ＝｛ｆ_１，ｆ_２，・・・，ｆ_Ｋ｝である。以上より、ＧＢＤＴの目的関数は以下の式（７）のように表される。

Here, K is the total number of decision trees, k is the number of the decision tree, f _k () is the output of the kth decision tree, and _xi is the feature amount of the input sample data. From this, it can be seen that the final output of GBDT is the sum of the outputs of each decision tree, as in RF and the like. Further, the parameter θ is θ = {f ₁ , f ₂ , ..., F _K }. From the above, the objective function of GBDT is expressed by the following equation (7).

上記の目的関数について学習を行うが、決定木モデルではニューラルネット等の学習で用いられるＳＧＤ（ＳｔｏｃｈａｓｔｉｃＧｒａｄｉｅｎｔＤｅｓｃｅｎｔ：確率的勾配降下法）等の手法は使えない。そこでＡｄｄｉｔｉｖｅＴｒａｉｎｉｎｇ（ブースティング）を用いて学習を行う。ＡｄｄｉｔｉｖｅＴｒａｉｎｉｎｇでは、あるラウンド（学習回数、決定木モデル数）ｔにおける予測値を以下の式（８）のように表現する。

Although learning is performed on the above objective function, a method such as SGD (Stochastic Gradient Descent) used in learning a neural network or the like cannot be used in the decision tree model. Therefore, learning is performed using Adaptive Training. In Adaptive Training, the predicted value in a certain round (number of learnings, number of decision tree models) t is expressed by the following equation (8).

式（８）より、あるラウンドｔにおいて、決定木（の出力）ｆ_ｔ（ｘ_ｉ）を求める必要があることが分かる。逆に、あるラウンドｔでは他のラウンドについて考える必要はない。そこで、以下では、ラウンドｔについて考える。ラウンドｔでの目的関数は以下の式（９）のように表される。

From equation (8), it can be seen that it is necessary to obtain the decision tree (output) ft (x _i ) in a certain round _t . Conversely, in one round t there is no need to think about another round. Therefore, in the following, the round t will be considered. The objective function in round t is expressed by the following equation (9).

ここで、ラウンドｔにおける目的関数のテーラー展開（二次の項までで打ち切り）は以下の式（１０）のようになる。

Here, the Taylor expansion of the objective function in round t (censored up to the quadratic term) is as shown in the following equation (10).

ここで、式（１０）において、ｇ_ｉ、ｈ_ｉは以下の式（１１）で表されるものである。

Here, in the formula (10), _gi and _hi are represented by the following formula (11).

式（１０）において定数項を無視すると、ラウンドｔでの目的関数は、以下の式（１２）のようになる。

Ignoring the constant term in equation (10), the objective function in round t is as shown in equation (12) below.

この式（１２）により、ラウンドｔでの目的関数は、誤差関数を１ラウンド前の予測値で１階微分および２階微分したもの、および正則化項で表されるので、１階微分および２階微分が求まる誤差関数ならば適用が可能なことが分かる。 According to this equation (12), the objective function in round t is expressed by the first-order derivative and the second-order derivative of the error function with the predicted value one round before, and the regularization term, so that the first-order derivative and the second derivative are expressed. It can be seen that it can be applied if it is an error function for which the derivative can be obtained.

ここで、決定木モデルについて考える。図１に決定木モデルの例を示す。決定木モデルは、ノードとリーフとから構成され、ノードではある分岐条件を元に入力を次のノードまたはリーフへ入力し、リーフにはリーフウェイトがあり、これが入力に対する出力となる。例えば、図１では、「リーフ２」のリーフウェイトＷ２が「－１」であることを示している。 Now consider the decision tree model. FIG. 1 shows an example of a decision tree model. The decision tree model consists of a node and a leaf. A node inputs an input to the next node or leaf based on a branch condition, and the leaf has a leaf weight, which is the output to the input. For example, FIG. 1 shows that the leaf weight W2 of the “leaf 2” is “-1”.

また、決定木モデルは以下の式（１３）に示すように定式化される。

Further, the decision tree model is formulated as shown in the following equation (13).

式（１３）において、ｗはリーフウェイト、ｑは木の構造を表す。つまり、入力（サンプルデータｘ）は木の構造ｑによりいずれかのリーフに割り当てられ、そのリーフのリーフウェイトが出力されることになる。 In equation (13), w represents a leaf weight and q represents a tree structure. That is, the input (sample data x) is assigned to any leaf by the structure q of the tree, and the leaf weight of that leaf is output.

ここで、決定木モデルの複雑さを以下の式（１４）のように定義する。

Here, the complexity of the decision tree model is defined as the following equation (14).

式（１４）において、第１項はリーフの数による複雑さを、第２項はリーフウェイトの二乗ノルムである。また、γは正則化項の重要度を制御するハイパーパラメータである。以上より、ラウンドｔでの目的関数について、以下の式（１５）のように整理する。

In equation (14), the first term is the complexity of the number of leaves, and the second term is the squared norm of the leaf weights. Gamma is a hyperparameter that controls the importance of the regularization term. From the above, the objective function in round t is organized as shown in the following equation (15).

ただし、式（１５）において、Ｉ_ｊ、Ｇ_ｊ、Ｈ_ｊは、以下の式（１６）のように表される。

However, in the formula (15), I _j , G _j , and H _j are expressed as the following formula (16).

式（１５）より、あるラウンドｔでの目的関数はリーフウェイトｗに関する二次関数であり、一般に二次関数の最小値、およびその時の条件は、以下の式（１７）で表される。

From the equation (15), the objective function in a certain round t is a quadratic function with respect to the leaf weight w, and the minimum value of the quadratic function and the conditions at that time are generally expressed by the following equation (17).

つまり、あるラウンドｔの決定木の構造ｑが決まったときに、その目的関数およびリーフウェイトは以下の式（１８）のようになる。

That is, when the structure q of the decision tree of a certain round t is determined, its objective function and leaf weight are as shown in the following equation (18).

ここまでで、あるラウンドで決定木の構造が決まったときのリーフウェイトの算出が可能となった。以降は、決定木の構造の学習手順について述べる。 Up to this point, it has become possible to calculate the leaf weight when the structure of the decision tree is decided in a certain round. The procedure for learning the structure of the decision tree will be described below.

決定木の構造の学習方法の１つに貪欲法（ＧｒｅｅｄｙＡｌｇｏｒｉｔｈｍ）がある。貪欲法では、木構造を深さ０からスタートし、各ノードで分岐スコア（Ｇａｉｎ）を計算し分岐するか否かを判断して決定木の構造を学習するアルゴリズムである。分岐スコアは以下の式（１９）で求められる。

Greedy Algorithm is one of the methods for learning the structure of a decision tree. The greedy algorithm is an algorithm that learns the structure of a decision tree by starting the tree structure from a depth of 0, calculating a branch score (Gain) at each node, and determining whether or not to branch. The branch score is calculated by the following equation (19).

ここで、Ｇ_Ｌ、Ｈ_Ｌは左ノードに分岐したサンプルの勾配情報、Ｇ_Ｒ、Ｈ_Ｒは右ノードに分岐したサンプルの勾配情報、γは正則化項である。式（１９）の［］内の第１項は左ノードに分岐したサンプルデータのスコア（目的関数）、第２項は右ノードに分岐したサンプルデータのスコア、第３項は分岐しない場合のスコアであり、分岐による目的関数の改善度合いを表している。 Here, _GL and _HL are the gradient information of the sample branched to the left node, _GR and _HR are the gradient information of the sample branched to the right node, and γ is a regularization term. The first term in [] of the equation (19) is the score of the sample data branched to the left node (objective function), the second term is the score of the sample data branched to the right node, and the third term is the score when no branch is made. It represents the degree of improvement of the objective function by branching.

上述の式（１９）に示す分岐スコアは、ある特徴量のあるしきい値で分岐した時の良さを表すが、これ単体ではどのような条件が最適か判断できない。そこで、貪欲法では、全ての特徴量の全てのしきい値候補で分岐スコアを求め、分岐スコアが最大となる条件を探すものである。貪欲法は上述のように、アルゴリズムとしては非常にシンプルであるが、全ての特徴量の全てのしきい値候補で分岐スコアを求めるため計算コストが高い。そこで後述するＸＧＢｏｏｓｔ等のライブラリでは、性能を維持しつつ、計算コストを低減する工夫がなされている。 The branching score shown in the above equation (19) represents the goodness when branching at a certain threshold value with a certain feature amount, but it is not possible to determine what kind of condition is optimal by itself. Therefore, in the greedy method, the branch score is obtained from all the threshold candidates of all the features, and the condition that maximizes the branch score is searched. As described above, the greedy algorithm is very simple as an algorithm, but the calculation cost is high because the branch score is obtained for all the threshold candidates of all the features. Therefore, in a library such as XGBoost, which will be described later, a device is made to reduce the calculation cost while maintaining the performance.

（ＸＧＢｏｏｓｔについて）
以下、ＧＢＤＴのライブラリとして周知なＸＧＢｏｏｓｔについて述べる。ＸＧＢｏｏｓｔの学習アルゴリズムでは、しきい値候補の削減、および欠損値の扱いの２点について工夫がされている。 (About XGBoost)
Hereinafter, the XGBoost, which is well known as a GBDT library, will be described. In the XGBoost learning algorithm, two points are devised: reduction of threshold value candidates and handling of missing values.

まず、しきい値候補の削減について説明する。上述した貪欲法は計算コストが高いという課題があった。ＸＧＢｏｏｓｔでは、ＷｅｉｇｈｔｅｄＱｕａｎｔｉｌｅＳｋｅｔｃｈという方法でしきい値候補数を削減している。これは、分岐スコア（Ｇａｉｎ）の計算では、左右に別れるサンプルデータの勾配情報の和が重要であり、勾配情報の和が一定割合変化するしきい値のみを探索候補とするものである。具体的にはサンプルの二次勾配であるｈを用いている。特徴量の次元をｆとすると、特徴量およびサンプルデータの二次勾配ｈの集合を、以下の式（２０）のように表現する。

First, the reduction of threshold candidates will be described. The above-mentioned greedy method has a problem that the calculation cost is high. In XGBoost, the number of threshold candidates is reduced by a method called Weighted Quantile Sketch. This is because, in the calculation of the branch score (Gain), the sum of the gradient information of the sample data separated into left and right is important, and only the threshold value at which the sum of the gradient information changes by a certain percentage is used as a search candidate. Specifically, h, which is a quadratic gradient of the sample, is used. Assuming that the dimension of the feature amount is f, the set of the quadratic gradient h of the feature amount and the sample data is expressed by the following equation (20).

また、ランク関数ｒ_ｆを以下の式（２１）のように定義する。

Further, the rank function r _f is defined as the following equation (21).

ここで、ｚはしきい値候補である。式（２１）に示すランク関数ｒ_ｆは、あるしきい値候補より小さいサンプルデータの二次勾配の和が全サンプルデータの二次勾配の和に占める割合を意味している。最終的には、次元ｆで示される特徴量について、あるしきい値候補の集合｛ｓ_ｆ１，ｓ_ｆ２，・・・，ｓ_ｆｌ｝を求める必要があり、これは以下の式（２２）で求める。

Here, z is a threshold candidate. The rank function r _f shown in the equation (21) means the ratio of the sum of the quadratic gradients of the sample data smaller than a certain threshold candidate to the sum of the quadratic gradients of all the sample data. Finally, for the feature quantity represented by the dimension f, it is necessary to obtain a set of certain threshold candidates {s _f1 , s _f2 , ..., _Sfl }, which is expressed by the following equation (22). Ask.

ここでεはしきい値候補の削減度合いを決めるパラメータであり、おおよそ１／ε個のしきい値候補が得られる。 Here, ε is a parameter that determines the degree of reduction of the threshold value candidates, and approximately 1 / ε threshold value candidates can be obtained.

ＷｅｉｇｈｔｅｄＱｕａｎｔｉｌｅＳｋｅｔｃｈは、決定木の最初のノードで（全サンプルデータに対して一括で）行うグローバルと、ノードごとに（当該ノードに割り当てられたサンプルについて毎回）行うローカルの２パターンが考えられる。汎化性能の面ではローカルの方がよいという結果が出ているので、ＸＧＢｏｏｓｔではローカルを採用している。 The Weighted Quantile Sketch can be considered in two patterns: global, which is performed at the first node of the decision tree (collectively for all sample data), and local, which is performed for each node (every time for the sample assigned to the node). Since the results show that local is better in terms of generalization performance, XGBoost adopts local.

次に、欠損値の扱いについて説明する。入力されるサンプルデータの欠損値の扱いはＧＢＤＴおよび決定木に限らず、機械学習分野において一般的に有効な手法はない。欠損値を、平均値、中央値、もしくは協調フィルタ等で補完する方法、または欠損値が多い特徴量を除外する方法等があるが、性能の面で多くのケースで成功するわけではない。しかし、構造化データは欠損値を含むことが多く、実用上は何らかの対応が求められる。 Next, the handling of missing values will be described. The handling of missing values in the input sample data is not limited to GBDT and decision trees, and there is no generally effective method in the field of machine learning. There are methods of supplementing missing values with an average value, a median value, a cooperative filter, etc., or a method of excluding features with many missing values, but they are not successful in many cases in terms of performance. However, structured data often contains missing values, and practical measures are required.

ＸＧＢｏｏｓｔは、欠損値を含むサンプルデータを直接扱えるように学習アルゴリズムが工夫されている。これは、ノードの分岐スコアを求める際に、欠損値のデータを全て左右どちらかのノードに割り当てた時のスコアを求める方法である。また、上述のＷｅｉｇｈｔｅｄＱｕａｎｔｉｌｅＳｋｅｔｃｈを行う場合は、欠損値を含むサンプルデータを除外した集合に対してしきい値候補を求めるものとすればよい。 XGBoost has a learning algorithm devised so that it can directly handle sample data including missing values. This is a method of obtaining the score when all the missing value data is assigned to either the left or right node when obtaining the branch score of the node. Further, in the case of performing the above-mentioned Weighted Quantile Sketch, the threshold value candidate may be obtained for the set excluding the sample data including the missing value.

（ＬｉｇｈｔＧＢＭについて）
次に、ＧＢＤＴのライブラリであるＬｉｇｈｔＧＢＭについて述べる。ＬｉｇｈｔＧＢＭは前処理にｂｉｎｎｉｎｇと呼ばれる特徴量の量子化を採用し、分岐スコアの計算にＧＰＵを利用した高速なアルゴリズムを採用している。ＬｉｇｈｔＧＢＭはＸＧＢｏｏｓｔと比較して性能は同程度で学習速度が数倍速く、近年利用者が増えてきている。 (About LightGBM)
Next, LightGBM, which is a library of GBDT, will be described. LightGBM adopts the quantization of the feature quantity called binning for the preprocessing, and adopts the high-speed algorithm using GPU for the calculation of the branch score. LightGBM has the same performance as XGBoost, and the learning speed is several times faster, and the number of users has been increasing in recent years.

まず、特徴量の量子化について説明する。分岐スコアは、データセットが大規模であれば大量のしきい値候補に対して計算が必要である。ＬｉｇｈｔＧＢＭは、学習の前処理として、特徴量を量子化することでしきい値候補数を削減している。また、量子化することでＸＧＢｏｏｓｔのようにノードごとにしきい値候補の値および数が変わることがなく、ＧＰＵを利用する場合に必須の処理となっている。 First, the quantization of the feature quantity will be described. The branch score needs to be calculated for a large number of threshold candidates if the dataset is large. LightGBM reduces the number of threshold candidates by quantizing the features as a pre-processing for learning. Further, by quantization, the value and the number of threshold candidates do not change for each node unlike XGBoost, which is an indispensable process when using the GPU.

特徴量の量子化についてはｂｉｎｎｉｎｇという名前で様々な研究がなされており、ＬｉｇｈｔＧＢＭでは、特徴量をｋ個のビンに分割しており、しきい値候補はｋ個だけとなる。ｋは２５５、６３、１５等であり、データセットによって性能または学習速度は異なる。 Various studies have been conducted on the quantization of features under the name of binning. In LightGBM, the features are divided into k bins, and the threshold candidates are only k. k is 255, 63, 15, etc., and the performance or learning speed differs depending on the data set.

また、特徴量を量子化したことで分岐スコアの計算が簡易になる。具体的には、しきい値候補が単に量子化された値になる。そのため、各特徴量について一次勾配および二次勾配のヒストグラムを作成し、各ビン（量子化された値）について分岐スコアを求めればよいことになる。これを特徴量ヒストグラムと呼んでいる。 In addition, the calculation of the branch score is simplified by quantizing the features. Specifically, the threshold candidates are simply quantized values. Therefore, it is sufficient to create histograms of the first-order gradient and the second-order gradient for each feature quantity and obtain the branch score for each bin (quantized value). This is called a feature histogram.

次に、ＧＰＵを利用した分岐スコアの計算について説明する。分岐スコアの計算自体は特徴量が量子化されているため最大でも２５６パターンであるが、サンプルデータ数はデータセットによっては数万件を超えるため、ヒストグラム作成が学習時間に対して支配的となる。上述で述べたように、分岐スコアの計算では、特徴量ヒストグラムを求める必要がある。ＧＰＵを利用した場合、複数のスレッドが同一のヒストグラムを更新する必要があるが、このとき同一のビンを更新する可能性がある。そのため、アトミック演算を使用する必要があり、同一のビンを更新する割合が高いとパフォーマンスが低下する。そこで、ＬｉｇｈｔＧＢＭでは、ヒストグラムの作成の際に、一次勾配および二次勾配のどちらのヒストグラムから値を更新するかをスレッドごとに分けており、これによって同一のビンを更新する頻度を下げている。 Next, the calculation of the branch score using the GPU will be described. The calculation of the branch score itself has a maximum of 256 patterns because the features are quantized, but since the number of sample data exceeds tens of thousands depending on the data set, histogram creation becomes dominant with respect to the learning time. .. As mentioned above, in the calculation of the branch score, it is necessary to obtain the feature histogram. When GPU is used, it is necessary for a plurality of threads to update the same histogram, but at this time, the same bin may be updated. Therefore, it is necessary to use atomic arithmetic, and if the ratio of updating the same bin is high, the performance will decrease. Therefore, in LightGBM, when creating a histogram, which of the histograms of the primary gradient and the secondary gradient is used to update the value is divided for each thread, thereby reducing the frequency of updating the same bin.

（学習識別装置の構成）
図２は、実施形態に係る学習識別装置のモジュール構成の一例を示す図である。図３は、ポインタメモリの構成の一例を示す図である。図４は、ラーニングモジュールのモジュール構成の一例を示す図である。図２～図４を参照しながら、本実施形態に係る学習識別装置１のモジュール構成について説明する。 (Configuration of learning identification device)
FIG. 2 is a diagram showing an example of a module configuration of the learning identification device according to the embodiment. FIG. 3 is a diagram showing an example of the configuration of the pointer memory. FIG. 4 is a diagram showing an example of the module configuration of the learning module. The module configuration of the learning identification device 1 according to the present embodiment will be described with reference to FIGS. 2 to 4.

図２に示すように、本実施形態に係る学習識別装置１は、ＣＰＵ１０と、ラーニングモジュール２０（学習部）と、データメモリ３０と、モデルメモリ４０と、クラシフィケーションモジュール５０（識別部）と、を備えている。このうち、ラーニングモジュール２０、データメモリ３０、モデルメモリ４０およびクラシフィケーションモジュール５０は、ＦＰＧＡにより構成されている。ＣＰＵ１０と、当該ＦＰＧＡとはバスを介してデータ通信可能となっている。なお、学習識別装置１は、図２に示す各構成要素だけではなく、他の構成要素、例えば、ＣＰＵ１０のワークエリアとなるＲＡＭ、ＣＰＵ１０が実行するプログラム等を記憶したＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、各種データ（プログラム等）を記憶した補助記憶装置、および外部装置と通信を行う通信Ｉ／Ｆ等を備えているものとしてもよい。 As shown in FIG. 2, the learning identification device 1 according to the present embodiment includes a CPU 10, a learning module 20 (learning unit), a data memory 30, a model memory 40, and a classification module 50 (identification unit). , Is equipped. Of these, the learning module 20, the data memory 30, the model memory 40, and the classification module 50 are composed of FPGAs. Data communication is possible between the CPU 10 and the FPGA via a bus. The learning identification device 1 includes not only each component shown in FIG. 2, but also other components, for example, a RAM that serves as a work area of the CPU 10, a ROM (Read Only Memory) that stores a program executed by the CPU 10, and the like. It may be provided with an auxiliary storage device that stores various data (programs and the like), a communication I / F that communicates with an external device, and the like.

ＣＰＵ１０は、全体でＧＢＤＴの学習を制御する演算装置である。ＣＰＵ１０は、制御部１１を有する。制御部１１は、ラーニングモジュール２０、データメモリ３０、モデルメモリ４０およびクラシフィケーションモジュール５０の各モジュールを制御する。制御部１１は、ＣＰＵ１０で実行されるプログラムによって実現される。 The CPU 10 is an arithmetic unit that controls GBDT learning as a whole. The CPU 10 has a control unit 11. The control unit 11 controls each module of the learning module 20, the data memory 30, the model memory 40, and the classification module 50. The control unit 11 is realized by a program executed by the CPU 10.

ラーニングモジュール２０は、決定木を構成するノード毎の最適な特徴量の番号（以下、「特徴量番号」と称する場合がある）、およびしきい値を算出し、当該ノードがリーフの場合は、リーフウェイトを算出し、モデルメモリ４０に書き込むハードウェアモジュールである。また、図４に示すように、ラーニングモジュール２０は、ゲイン算出モジュール２１＿１、２１＿２、・・・、２１＿ｎ（ゲイン算出部）と、最適条件導出モジュール２２（導出部）と、を備えている。ここで、ｎは、少なくともサンプルデータ（学習データ、識別データ双方含む）の特徴量の数以上の数である。なお、ゲイン算出モジュール２１＿１、２１＿２、・・・、２１＿ｎについて、任意のゲイン算出モジュールを示す場合、または総称する場合、単に「ゲイン算出モジュール２１」と称するものとする。 The learning module 20 calculates the optimum feature quantity number (hereinafter, may be referred to as “feature quantity number”) and the threshold value for each node constituting the decision tree, and when the node is a leaf, the learning module 20 calculates the optimum feature quantity number (hereinafter, may be referred to as “feature quantity number”). This is a hardware module that calculates leaf weights and writes them to the model memory 40. Further, as shown in FIG. 4, the learning module 20 includes a gain calculation module 21_1, 21_2, ..., 21_n (gain calculation unit) and an optimum condition derivation module 22 (derivation unit). Here, n is at least the number of feature quantities of sample data (including both learning data and identification data). The gain calculation modules 21_1, 21_2, ..., 21_n are simply referred to as "gain calculation module 21" when any gain calculation module is indicated or generically referred to.

ゲイン算出モジュール２１は、入力されるサンプルデータに含まれる特徴量のうち対応する特徴量について、各しきい値における分岐スコアを、上述の式（１９）を用いて算出するモジュールである。ここで、サンプルデータのうち学習データには、特徴量の他、ラベル（真の値）が含まれ、サンプルデータのうち識別データには、特徴量が含まれるが、ラベルは含まれていない。また、各ゲイン算出モジュール２１は、一度（１クロック）で入力されたすべての特徴量について、それぞれにそのヒストグラムを演算・格納するメモリを有し、全特徴量を並列に演算する。そのヒストグラムの結果より、各特徴量のゲインを並列に算出する。これによって、一度に、または同時に全特徴量に対する処理が可能となるので、学習処理の速度を飛躍的に向上させることが可能となる。このように、並列に全部の特徴量を読み出し、処理していく方法をフィーチャパラレル（ＦｅａｔｕｒｅＰａｒａｌｌｅｌ）と呼ぶ。なお、この方法を実現するためには、データメモリは一度（１クロック）ですべての特徴量を読み出すことができる必要がある。そのため、通常の３２ビットや２５６ビット幅のデータ幅を持つメモリでは実現できない。また、ソフトウエアでは、通常ＣＰＵの一度に扱えるデータのビット数は６４ビットにとどまり、特徴量数が１００、各特徴量のビット数が８ビットだとしても８０００ビットが必要となるのに対して、全く対応できない。そのため、従来は、メモリのアドレス毎（例えば、ＣＰＵが扱える６４ビット幅）に別の特徴量を格納しておき、特徴量すべてでは、複数のアドレスにまたがって保存される方法が取られていた。それに対して、本方法では、メモリの１アドレスにすべての特徴量を格納し、１アクセスで全特徴量を読み出す点が新規の技術内容である。 The gain calculation module 21 is a module that calculates the branch score at each threshold value for the corresponding feature amount among the feature amounts included in the input sample data by using the above equation (19). Here, among the sample data, the training data includes a label (true value) in addition to the feature amount, and the identification data among the sample data includes the feature amount but does not include the label. Further, each gain calculation module 21 has a memory for calculating and storing a histogram of all the features input once (1 clock), and calculates all the features in parallel. From the result of the histogram, the gain of each feature is calculated in parallel. As a result, processing for all features can be performed at once or at the same time, so that the speed of learning processing can be dramatically improved. In this way, a method of reading out all the features in parallel and processing them is called a feature parallel. In order to realize this method, the data memory needs to be able to read all the features at once (1 clock). Therefore, it cannot be realized with a memory having a data width of a normal 32-bit or 256-bit width. Further, in software, the number of bits of data that can be normally handled by the CPU at one time is limited to 64 bits, and even if the number of feature quantities is 100 and the number of bits of each feature quantity is 8 bits, 8000 bits are required. , I can't handle it at all. Therefore, conventionally, a method has been adopted in which a different feature amount is stored for each memory address (for example, a 64-bit width that can be handled by the CPU), and all the feature amounts are saved across a plurality of addresses. .. On the other hand, in this method, the new technical content is that all the features are stored in one address of the memory and all the features are read by one access.

上述のように、ＧＢＤＴでは決定木の学習についての並列化はできない。そのため、いかに一本ずつの決定木を速く学習するかが、学習処理の速度に関して支配的となる。一方、アンサンブルな学習を行うＲＦでは、決定木の間の依存関係は学習時にないので、決定木ごとの学習処理の並列化は容易であるが、一般的にＧＢＤＴに対して精度が劣る。上述のように、ＲＦよりも精度の高いＧＢＤＴの学習について、上述のようなフィーチャパラレル（ＦｅａｔｕｒｅＰａｒａｌｌｅｌ）を適用することで、決定木の学習処理の速度を向上させることができる。 As mentioned above, GBDT cannot parallelize the learning of decision trees. Therefore, how fast the decision trees are learned one by one is dominant in terms of the speed of the learning process. On the other hand, in an RF that performs ensemble learning, since there is no dependency between decision trees at the time of learning, it is easy to parallelize the learning process for each decision tree, but the accuracy is generally inferior to GBDT. As described above, the speed of the training process of the decision tree can be improved by applying the above-mentioned feature parallel (Feature Parallel) to the learning of GBDT having higher accuracy than RF.

ゲイン算出モジュール２１は、算出した分岐スコアを最適条件導出モジュール２２へ出力する。 The gain calculation module 21 outputs the calculated branch score to the optimum condition derivation module 22.

最適条件導出モジュール２２は、各ゲイン算出モジュール２１により出力された各特徴量に対応する各分岐スコアを入力し、分岐スコアが最大となる特徴量の番号（特徴量番号）およびしきい値を導出するモジュールである。最適条件導出モジュール２２は、導出した特徴量番号およびしきい値を、対応するノードの分岐条件データ（ノードのデータの一例）として、モデルメモリ４０へ書き込む。 The optimum condition derivation module 22 inputs each branch score corresponding to each feature amount output by each gain calculation module 21, and derives the feature amount number (feature amount number) and the threshold value that maximize the branch score. It is a module to do. The optimal condition derivation module 22 writes the derived feature quantity number and the threshold value into the model memory 40 as branch condition data (an example of node data) of the corresponding node.

データメモリ３０は、各種データを格納するＳＲＡＭである。データメモリ３０は、ポインタメモリ３１と、フィーチャメモリ３２と、ステートメモリ３３と、を備えている。 The data memory 30 is an SRAM that stores various data. The data memory 30 includes a pointer memory 31, a feature memory 32, and a state memory 33.

ポインタメモリ３１は、フィーチャメモリ３２で格納されているサンプルデータの格納先アドレスを記憶するメモリである。ポインタメモリ３１は、図３に示すように、バンクＡ（バンク領域）と、バンクＢ（バンク領域）とを有する。なお、バンクＡおよびバンクＢの２バンクに分割して、サンプルデータの格納先アドレスを記憶する動作の詳細については、図５～図１３で後述する。なお、ポインタメモリ３１は、３つ以上のバンクを有することを制限するものではない。 The pointer memory 31 is a memory for storing the storage destination address of the sample data stored in the feature memory 32. As shown in FIG. 3, the pointer memory 31 has a bank A (bank area) and a bank B (bank area). The details of the operation of dividing into two banks, bank A and bank B, and storing the storage destination address of the sample data will be described later with reference to FIGS. 5 to 13. The pointer memory 31 does not limit the possession of three or more banks.

フィーチャメモリ３２は、サンプルデータ（学習データ、識別データを含む）を格納するメモリである。 The feature memory 32 is a memory for storing sample data (including learning data and identification data).

ステートメモリ３３は、ステート情報（上述のｗ、ｇ、ｈ）およびラベル情報を記憶するメモリである。 The state memory 33 is a memory for storing state information (w, g, h described above) and label information.

モデルメモリ４０は、決定木のノード毎の分岐条件データ（特徴量番号、しきい値）、そのノードがリーフであるか否かを示すリーフフラグ（フラグ情報、ノードのデータの一例）、および、そのノードがリーフである場合におけるリーフウェイトを記憶するＳＲＡＭである。 The model memory 40 includes branch condition data (feature quantity number, threshold value) for each node of the decision tree, a leaf flag indicating whether or not the node is a leaf (flag information, an example of node data), and A SRAM that stores leaf weights when the node is a leaf.

クラシフィケーションモジュール５０は、ノードごと、決定木ごとにサンプルデータを振り分けるハードウェアモジュールである。また、クラシフィケーションモジュール５０は、ステート情報（ｗ，ｇ，ｈ）を計算して、ステートメモリ３３に書き込む。 The classification module 50 is a hardware module that distributes sample data for each node and each decision tree. Further, the classification module 50 calculates the state information (w, g, h) and writes it to the state memory 33.

なお、クラシフィケーションモジュール５０は、上述のように学習処理におけるサンプルデータ（学習データ）の識別（分岐）だけでなく、サンプルデータ（識別データ）に対する識別処理においても、同一のモジュール構成で、当該識別データに対する識別を行うことが可能である。また、識別処理時にも、一括して特徴量をすべて読み込むことにより、クラシフィケーションモジュール５０による処理をパイプライン化することができ、クロックごとに１つのサンプルデータの識別をすることまで処理の高速化が可能となる。一方、上述のように一括で読み込むことができない場合、どこの特徴量が必要になるかは、各ノードに分岐してみないとわからないため、毎回該当する特徴量のアドレスにアクセスする形態ではパイプライン化ができないことになる。 The classification module 50 has the same module configuration not only in the identification (branch) of the sample data (learning data) in the learning process but also in the identification process for the sample data (identification data) as described above. It is possible to identify the identification data. In addition, even during the identification process, the process by the classification module 50 can be pipelined by reading all the features at once, and the process speed is high up to the identification of one sample data for each clock. It becomes possible to change. On the other hand, if it is not possible to read all at once as described above, it is not possible to know which feature amount is required until branching to each node, so a pipe is used to access the address of the corresponding feature amount each time. It will not be possible to make a line.

また、上述のクラシフィケーションモジュール５０を複数備えるものとし、複数の識別データを分割（データパラレル（ＤａｔａＰａｒａｌｌｅｌ））して、各クラシフィケーションモジュール５０に分配してそれぞれに識別処理をさせることによって、識別処理を高速化させることもできる。 Further, it is assumed that a plurality of the above-mentioned classification modules 50 are provided, and a plurality of identification data are divided (data parallel) and distributed to each classification module 50 for identification processing. , It is also possible to speed up the identification process.

（学習識別装置の学習処理）
以下、図５～図１３を参照しながら、学習識別装置１の学習処理について具体的に説明する。 (Learning process of learning identification device)
Hereinafter, the learning process of the learning identification device 1 will be specifically described with reference to FIGS. 5 to 13.

＜初期化＞
図５は、実施形態に係る学習識別装置の初期化時のモジュールの動作を示す図である。図５に示すように、まず、制御部１１は、ポインタメモリ３１を初期化する。例えば、図５に示すように、制御部１１は、ポインタメモリ３１のバンクＡに対して、サンプルデータ（学習データ）のフィーチャメモリ３２におけるアドレスを、学習データの数だけ順番に（例えば、アドレスの低い方から順に）書き込む。 <Initialization>
FIG. 5 is a diagram showing the operation of the module at the time of initialization of the learning identification device according to the embodiment. As shown in FIG. 5, first, the control unit 11 initializes the pointer memory 31. For example, as shown in FIG. 5, the control unit 11 assigns the addresses of the sample data (learning data) in the feature memory 32 to the bank A of the pointer memory 31 in order by the number of training data (for example, of the addresses). Write (from lowest to lowest).

なお、学習データのすべてを利用（すべてのアドレスを書き込み）することに限定されるものではなく、いわゆるデータサブサンプリングによって、所定の乱数に従った確率に基づいてランダムに選択した学習データを用いる（当該選択した学習データのアドレスを書き込む）ものとしてもよい。例えば、データサブサンプリングが０．５の場合、乱数に従った半分の確率で学習データの全アドレスのうち、半分のアドレスがポインタメモリ３１（ここではバンクＡ）に書き込まれるものとしてもよい。乱数の発生には、ＬＦＳＲ（ＬｉｎｅａｒＦｅｅｄｂａｃｋＳｈｉｆｔＲｅｇｉｓｔｅｒ：線形帰還シフトレジスタ）により作成された擬似乱数が使用可能である。 It should be noted that the learning data is not limited to using all of the training data (writing all addresses), and learning data randomly selected based on the probability according to a predetermined random number is used by so-called data subsampling (the learning data is randomly selected based on the probability according to a predetermined random number). The address of the selected learning data may be written). For example, when the data subsampling is 0.5, half of all the addresses of the training data may be written to the pointer memory 31 (here, bank A) with a half probability according to the random number. Pseudo-random numbers created by LFSR (Linear Feedback Shift Register) can be used to generate random numbers.

また、学習に使用する学習データのうちすべての特徴量を使用することに限定されるものではなく、いわゆるフィーチャサブサンプルによって、上述と同様の乱数に従った確率に基づいてランダムに選択（例えば、半分を選択）した特徴量のみを使用するものとしてもよい。この場合、例えば、フィーチャサブサンプルにより選択された特徴量以外の特徴量のデータとしては、フィーチャメモリ３２から定数が出力されるものとすればよい。これによって、未知のデータ（識別データ）に対する汎化性能が向上するという効果がある。 Also, the training data used for training is not limited to using all the features, and is randomly selected based on the probability according to the same random number as described above by the so-called feature subsample (for example,). It is also possible to use only the feature amount selected (half selected). In this case, for example, as the data of the feature amount other than the feature amount selected by the feature subsample, a constant may be output from the feature memory 32. This has the effect of improving the generalization performance for unknown data (identification data).

＜デプス０・ノード０の分岐条件データの決定＞
図６は、実施形態に係る学習識別装置のデプス０、ノード０のノードパラメータを決定する場合のモジュールの動作を示す図である。なお、決定木の一番上の階層を「デプス０」、そこから下の階層を順に「デプス１」、「デプス２」、・・・と称するものとし、特定の階層の一番左のノードを「ノード０」、そこから右のノードを順に「ノード１」、「ノード２」、・・・と称するものとする。 <Determination of branch condition data for depth 0 and node 0>
FIG. 6 is a diagram showing the operation of the module when determining the node parameters of the depth 0 and the node 0 of the learning identification device according to the embodiment. The top layer of the decision tree is called "depth 0", and the layers below it are called "depth 1", "depth 2", and so on, and the leftmost node of a specific layer. Is referred to as "node 0", and the node to the right thereof is referred to as "node 1", "node 2", ...

図６に示すように、まず、制御部１１は、ラーニングモジュール２０へ開始アドレスおよび終了アドレスを送信し、トリガによりラーニングモジュール２０による処理を開始させる。ラーニングモジュール２０は、開始アドレスおよび終了アドレスに基づいて、ポインタメモリ３１（バンクＡ）から対象とする学習データのアドレスを指定し、当該アドレスによって、フィーチャメモリ３２から学習データ（特徴量）を読み出し、ステートメモリ３３からステート情報（ｗ，ｇ，ｈ）を読み出す。 As shown in FIG. 6, first, the control unit 11 transmits a start address and an end address to the learning module 20, and triggers the learning module 20 to start processing. The learning module 20 specifies the address of the target learning data from the pointer memory 31 (bank A) based on the start address and the end address, and reads the training data (feature amount) from the feature memory 32 according to the address. State information (w, g, h) is read from the state memory 33.

この場合、上述したように、ラーニングモジュール２０の各ゲイン算出モジュール２１は、対応する特徴量のヒストグラムを計算し、それぞれ自身のＳＲＡＭに格納し、その結果に基づいて各しきい値における分岐スコアを算出する。そして、ラーニングモジュール２０の最適条件導出モジュール２２は、各ゲイン算出モジュール２１により出力された各特徴量に対応する各分岐スコアを入力し、分岐スコアが最大となる特徴量の番号（特徴量番号）およびしきい値を導出する。そして、最適条件導出モジュール２２は、導出した特徴量番号およびしきい値を、対応するノード（デプス０、ノード０）の分岐条件データとして、モデルメモリ４０へ書き込む。この際、最適条件導出モジュール２２は、ノード（デプス０、ノード０）からさらに分岐されることを示すためにリーフフラグを「０」として、当該ノードのデータ（分岐条件データの一部としてもよい）をモデルメモリ４０へ書き込む。 In this case, as described above, each gain calculation module 21 of the learning module 20 calculates a histogram of the corresponding feature amount, stores it in its own SRAM, and calculates the branch score at each threshold value based on the result. calculate. Then, the optimum condition derivation module 22 of the learning module 20 inputs each branch score corresponding to each feature amount output by each gain calculation module 21, and the feature amount number (feature amount number) at which the branch score becomes maximum is input. And derive the threshold. Then, the optimal condition derivation module 22 writes the derived feature quantity number and the threshold value into the model memory 40 as branch condition data of the corresponding nodes (depth 0, node 0). At this time, the optimal condition derivation module 22 may set the leaf flag to "0" to indicate that the node (depth 0, node 0) is further branched, and may be a part of the data (branch condition data) of the node. ) Is written to the model memory 40.

以上の動作について、ラーニングモジュール２０は、バンクＡに書き込まれた学習データのアドレスを順に指定し、当該アドレスによって、フィーチャメモリ３２から各学習データを読み出して行う。 For the above operation, the learning module 20 designates the addresses of the learning data written in the bank A in order, and reads out each learning data from the feature memory 32 by the addresses.

＜デプス０・ノード０でのデータ分岐処理＞
図７は、実施形態に係る学習識別装置のデプス０、ノード０の分岐時のモジュールの動作を示す図である。 <Data branch processing at depth 0 / node 0>
FIG. 7 is a diagram showing the operation of the module at the time of branching of the depth 0 and the node 0 of the learning identification device according to the embodiment.

図７に示すように、制御部１１は、クラシフィケーションモジュール５０へ開始アドレスおよび終了アドレスを送信し、トリガによりクラシフィケーションモジュール５０による処理を開始させる。クラシフィケーションモジュール５０は、開始アドレスおよび終了アドレスに基づいて、ポインタメモリ３１（バンクＡ）から対象とする学習データのアドレスを指定し、当該アドレスによって、フィーチャメモリ３２から学習データ（特徴量）を読み出す。また、クラシフィケーションモジュール５０は、モデルメモリ４０から対応するノード（デプス０、ノード０）の分岐条件データ（特徴量番号、しきい値）を読み出す。そして、クラシフィケーションモジュール５０は、分岐条件データに従って、読み出したサンプルデータを、ノード（デプス０、ノード０）の左側に分岐させるか、右側に分岐させるかを判定し、その判定結果により、当該学習データのフィーチャメモリ３２におけるアドレスを、ポインタメモリ３１の読み出しバンク（ここではバンクＡ）（読み出し用のバンク領域）と異なる他方のバンク（書き込みバンク）（ここではバンクＢ）（書き込み用のバンク領域）に書き込む。 As shown in FIG. 7, the control unit 11 transmits a start address and an end address to the classification module 50, and causes the classification module 50 to start processing by a trigger. The classification module 50 specifies the address of the target learning data from the pointer memory 31 (bank A) based on the start address and the end address, and the learning data (feature amount) is obtained from the feature memory 32 by the address. read out. Further, the classification module 50 reads branch condition data (feature quantity number, threshold value) of the corresponding node (depth 0, node 0) from the model memory 40. Then, the classification module 50 determines whether to branch the read sample data to the left side or the right side of the node (depth 0, node 0) according to the branch condition data, and based on the determination result, the said The address in the feature memory 32 of the training data is different from the read bank (here, bank A) (bank area for reading) of the pointer memory 31 and the other bank (write bank) (here, bank B) (bank area for writing). ).

この際、クラシフィケーションモジュール５０は、当該ノードの左側に分岐すると判定した場合、当該学習データのアドレスを、図７に示すように、バンクＢのアドレスの低い方から順に書き込み、当該ノードの右側に分岐すると判定した場合、当該学習データのアドレスを、バンクＢのアドレスの高い方から順に書き込む。これによって、書き込みバンク（バンクＢ）では、ノードの左側に分岐した学習データのアドレスは、アドレスの低い方に、ノードの右側に分岐した学習データのアドレスは、アドレスの高い方にきれいに分けて書き込むことができる。なお、書き込みバンクにおいて、ノードの左側に分岐した学習データのアドレスは、アドレスの高い方に、ノードの右側に分岐した学習データのアドレスは、アドレスの低い方に分けて書き込むものとしてもよい。 At this time, when the classification module 50 determines that the branch is to the left side of the node, the address of the learning data is written in order from the lowest address of the bank B as shown in FIG. 7, and the address of the learning data is written in order from the lower side to the right side of the node. If it is determined that the learning data is branched to, the address of the learning data is written in order from the highest address of the bank B. As a result, in the write bank (bank B), the address of the learning data branched to the left side of the node is written to the lower address, and the address of the learning data branched to the right side of the node is written to the higher address. be able to. In the write bank, the address of the learning data branched to the left side of the node may be written to the higher address, and the address of the learning data branched to the right side of the node may be written to the lower address.

このように、ポインタメモリ３１では、上述のように、バンクＡおよびバンクＢの２つが構成されており、交互に読み書きすることによって、ＦＰＧＡ内のＳＲＡＭの容量が限られている中、効率的にメモリを使用することが可能となる。単純には、フィーチャメモリ３２およびステートメモリ３３を、それぞれ２バンク構成する方法もあるが、一般的に、サンプルデータよりも、フィーチャメモリ３２でのアドレスを示すデータの方が小さいので、本実施形態のように、ポインタメモリ３１を準備しておき、間接的にアドレスを指定する方法の方が、メモリ容量を節約することが可能となる。 As described above, in the pointer memory 31, as described above, two banks A and B are configured, and by alternately reading and writing, the capacity of the SRAM in the FPGA is limited, and the efficiency is increased. Memory can be used. There is also a method of simply configuring two banks each of the feature memory 32 and the state memory 33, but in general, the data indicating the address in the feature memory 32 is smaller than the sample data, so this embodiment. As described above, the method of preparing the pointer memory 31 and indirectly specifying the address makes it possible to save the memory capacity.

以上の動作について、クラシフィケーションモジュール５０は、全学習データに対して分岐処理を行う。ただし、分岐処理が終了した後、ノード（デプス０、ノード０）の左側と右側とに同数の学習データが分けられるわけではないので、クラシフィケーションモジュール５０は、左側に分岐した学習データのアドレスと、右側に分岐した学習データのアドレスとの境界に対応する書き込みバンク（バンクＢ）におけるアドレス（中間アドレス）を、制御部１１に返す。当該中間アドレスは、次の分岐処理の際に使用される。 For the above operation, the classification module 50 performs branch processing on all the training data. However, since the same number of training data is not divided into the left side and the right side of the node (depth 0, node 0) after the branch processing is completed, the classification module 50 has the address of the training data branched to the left side. And the address (intermediate address) in the write bank (bank B) corresponding to the boundary with the address of the learning data branched to the right side is returned to the control unit 11. The intermediate address is used in the next branch processing.

＜デプス１・ノード０の分岐条件データの決定＞
図８は、実施形態に係る学習識別装置のデプス１、ノード０のノードパラメータを決定する場合のモジュールの動作を示す図である。基本的には、図６に示した、デプス０・ノード０の分岐条件データの決定の処理と同様であるが、対象とするノードの階層が変わる（デプス０からデプス１になる）ので、ポインタメモリ３１のバンクＡおよびバンクＢの役割が反転する。具体的には、バンクＢが読み出しバンクとなり、バンクＡが書き込みバンク（図９参照）となる。 <Determination of branch condition data for depth 1 and node 0>
FIG. 8 is a diagram showing the operation of the module when determining the node parameters of the depth 1 and the node 0 of the learning identification device according to the embodiment. Basically, it is the same as the process of determining the branch condition data of depth 0 and node 0 shown in FIG. 6, but since the hierarchy of the target node changes (from depth 0 to depth 1), a pointer. The roles of bank A and bank B of the memory 31 are reversed. Specifically, bank B is a read bank and bank A is a write bank (see FIG. 9).

図８に示すように、制御部１１は、デプス０での処理でクラシフィケーションモジュール５０から受け取った中間アドレスに基づいて、ラーニングモジュール２０へ開始アドレスおよび終了アドレスを送信し、トリガによりラーニングモジュール２０による処理を開始させる。ラーニングモジュール２０は、開始アドレスおよび終了アドレスに基づいて、ポインタメモリ３１（バンクＢ）から対象とする学習データのアドレスを指定し、当該アドレスによって、フィーチャメモリ３２から学習データ（特徴量）を読み出し、ステートメモリ３３からステート情報（ｗ，ｇ，ｈ）を読み出す。具体的には、ラーニングモジュール２０は、図８に示すように、バンクＢの左側（アドレスが低い方）から中間アドレスまで順にアドレスを指定していく。 As shown in FIG. 8, the control unit 11 transmits the start address and the end address to the learning module 20 based on the intermediate address received from the classification module 50 in the processing at depth 0, and the learning module 20 is triggered. Starts processing by. The learning module 20 specifies the address of the target learning data from the pointer memory 31 (bank B) based on the start address and the end address, and reads the training data (feature amount) from the feature memory 32 according to the address. State information (w, g, h) is read from the state memory 33. Specifically, as shown in FIG. 8, the learning module 20 designates addresses in order from the left side (lower address) of bank B to the intermediate address.

この場合、上述したように、ラーニングモジュール２０の各ゲイン算出モジュール２１は、読み出した学習データの各特徴量をそれぞれ自身のＳＲＡＭに格納して、各しきい値における分岐スコアを算出する。そして、ラーニングモジュール２０の最適条件導出モジュール２２は、各ゲイン算出モジュール２１により出力された各特徴量に対応する各分岐スコアを入力し、分岐スコアが最大となる特徴量の番号（特徴量番号）およびしきい値を導出する。そして、最適条件導出モジュール２２は、導出した特徴量番号およびしきい値を、対応するノード（デプス１、ノード０）の分岐条件データとして、モデルメモリ４０へ書き込む。この際、最適条件導出モジュール２２は、ノード（デプス１、ノード０）からさらに分岐されることを示すためにリーフフラグを「０」として、当該ノードのデータ（分岐条件データの一部としてもよい）をモデルメモリ４０へ書き込む。 In this case, as described above, each gain calculation module 21 of the learning module 20 stores each feature amount of the read learning data in its own SRAM and calculates the branch score at each threshold value. Then, the optimum condition derivation module 22 of the learning module 20 inputs each branch score corresponding to each feature amount output by each gain calculation module 21, and the feature amount number (feature amount number) at which the branch score becomes maximum is input. And derive the threshold. Then, the optimal condition derivation module 22 writes the derived feature quantity number and the threshold value into the model memory 40 as branch condition data of the corresponding nodes (depth 1, node 0). At this time, the optimal condition derivation module 22 may set the leaf flag to "0" to indicate that the node (depth 1, node 0) is further branched, and may be a part of the data (branch condition data) of the node. ) Is written to the model memory 40.

以上の動作について、ラーニングモジュール２０は、バンクＢの左側（アドレスが低い方）から中間アドレスまで順に指定し、当該アドレスによって、フィーチャメモリ３２から各学習データを読み出して行う。 The learning module 20 designates the above operations in order from the left side of the bank B (the one with the lower address) to the intermediate address, and reads out each learning data from the feature memory 32 according to the address.

＜デプス１・ノード０でのデータ分岐処理＞
図９は、実施形態に係る学習識別装置のデプス１、ノード０の分岐時のモジュールの動作を示す図である。 <Data branch processing at depth 1 and node 0>
FIG. 9 is a diagram showing the operation of the module at the time of branching of the depth 1 and the node 0 of the learning identification device according to the embodiment.

図９に示すように、制御部１１は、デプス０での処理でクラシフィケーションモジュール５０から受け取った中間アドレスに基づいて、クラシフィケーションモジュール５０へ開始アドレスおよび終了アドレスを送信し、トリガによりクラシフィケーションモジュール５０による処理を開始させる。クラシフィケーションモジュール５０は、開始アドレスおよび終了アドレスに基づいて、ポインタメモリ３１（バンクＢ）の左側から対象とする学習データのアドレスを指定し、当該アドレスによって、フィーチャメモリ３２から学習データ（特徴量）を読み出す。また、クラシフィケーションモジュール５０は、モデルメモリ４０から対応するノード（デプス１、ノード０）の分岐条件データ（特徴量番号、しきい値）を読み出す。そして、クラシフィケーションモジュール５０は、分岐条件データに従って、読み出したサンプルデータを、ノード（デプス１、ノード０）の左側に分岐させるか、右側に分岐させるかを判定し、その判定結果により、当該学習データのフィーチャメモリ３２におけるアドレスを、ポインタメモリ３１の読み出しバンク（ここではバンクＢ）（読み出し用のバンク領域）と異なる他方のバンク（書き込みバンク）（ここではバンクＡ）（書き込み用のバンク領域）に書き込む。 As shown in FIG. 9, the control unit 11 transmits a start address and an end address to the classification module 50 based on the intermediate address received from the classification module 50 in the processing at depth 0, and the class is triggered by a trigger. The processing by the fiction module 50 is started. The classification module 50 specifies the address of the target learning data from the left side of the pointer memory 31 (bank B) based on the start address and the end address, and the learning data (feature amount) from the feature memory 32 by the address. ) Is read. Further, the classification module 50 reads branch condition data (feature quantity number, threshold value) of the corresponding node (depth 1, node 0) from the model memory 40. Then, the classification module 50 determines whether to branch the read sample data to the left side or the right side of the node (depth 1, node 0) according to the branch condition data, and based on the determination result, the said The address in the feature memory 32 of the training data is different from the read bank (here, bank B) (bank area for reading) of the pointer memory 31 and the other bank (write bank) (here, bank A) (bank area for writing). ).

この際、クラシフィケーションモジュール５０は、当該ノードの左側に分岐すると判定した場合、当該学習データのアドレスを、図９に示すように、バンクＡのアドレスの低い方から順に書き込み、当該ノードの右側に分岐すると判定した場合、当該学習データのアドレスを、バンクＡのアドレスの高い方から順に書き込む。これによって、書き込みバンク（バンクＡ）では、ノードの左側に分岐した学習データのアドレスは、アドレスの低い方に、ノードの右側に分岐した学習データのアドレスは、アドレスの高い方にきれいに分けて書き込むことができる。なお、書き込みバンクにおいて、ノードの左側に分岐した学習データのアドレスは、アドレスの高い方に、ノードの右側に分岐した学習データのアドレスは、アドレスの低い方に分けて書き込むものとしてもよい。 At this time, when the classification module 50 determines that it branches to the left side of the node, the address of the learning data is written in order from the lowest address of the bank A as shown in FIG. 9, and the address of the learning data is written in order from the lower side to the right side of the node. If it is determined that the learning data is branched to, the address of the learning data is written in order from the highest address of the bank A. As a result, in the write bank (bank A), the address of the learning data branched to the left side of the node is written to the lower address, and the address of the learning data branched to the right side of the node is written to the higher address. be able to. In the write bank, the address of the learning data branched to the left side of the node may be written to the higher address, and the address of the learning data branched to the right side of the node may be written to the lower address.

以上の動作について、クラシフィケーションモジュール５０は、全学習データのうちバンクＢの中間アドレスよりも左側に書き込まれたアドレスで指定される学習データに対して分岐処理を行う。ただし、分岐処理が終了した後、ノード（デプス１、ノード０）の左側と右側とに同数の学習データが分けられるわけではないので、クラシフィケーションモジュール５０は、左側に分岐した学習データのアドレスと、右側に分岐した学習データのアドレスとの中間に対応する書き込みバンク（バンクＡ）におけるアドレス（中間アドレス）を、制御部１１に返す。当該中間アドレスは、次の分岐処理の際に使用される。 For the above operation, the classification module 50 performs branch processing on the learning data designated by the address written on the left side of the intermediate address of the bank B among all the learning data. However, since the same number of training data is not divided into the left side and the right side of the node (depth 1, node 0) after the branch processing is completed, the classification module 50 has the address of the training data branched to the left side. And the address (intermediate address) in the write bank (bank A) corresponding to the middle of the address of the learning data branched to the right side is returned to the control unit 11. The intermediate address is used in the next branch processing.

＜デプス１・ノード１の分岐条件データの決定＞
図１０は、実施形態に係る学習識別装置のデプス１、ノード１のノードパラメータを決定する場合のモジュールの動作を示す図である。なお、図８の場合と同様に、デプス１・ノード０のノードと同じ階層なので、バンクＢが読み出しバンクとなり、バンクＡが書き込みバンク（図１１参照）となる。 <Determination of branch condition data for depth 1 and node 1>
FIG. 10 is a diagram showing the operation of the module when determining the node parameters of the depth 1 and the node 1 of the learning identification device according to the embodiment. As in the case of FIG. 8, since the hierarchy is the same as that of the nodes of depth 1 and node 0, bank B is a read bank and bank A is a write bank (see FIG. 11).

図１０に示すように、制御部１１は、デプス０での処理でクラシフィケーションモジュール５０から受け取った中間アドレスに基づいて、ラーニングモジュール２０へ開始アドレスおよび終了アドレスを送信し、トリガによりラーニングモジュール２０による処理を開始させる。ラーニングモジュール２０は、開始アドレスおよび終了アドレスに基づいて、ポインタメモリ３１（バンクＢ）から対象とする学習データのアドレスを指定し、当該アドレスによって、フィーチャメモリ３２から学習データ（特徴量）を読み出し、ステートメモリ３３からステート情報（ｗ，ｇ，ｈ）を読み出す。具体的には、ラーニングモジュール２０は、図１０に示すように、バンクＢの右側（アドレスが高い方）から中間アドレスまで順にアドレスを指定していく。 As shown in FIG. 10, the control unit 11 transmits the start address and the end address to the learning module 20 based on the intermediate address received from the classification module 50 in the processing at the depth 0, and the learning module 20 is triggered. Starts processing by. The learning module 20 specifies the address of the target learning data from the pointer memory 31 (bank B) based on the start address and the end address, and reads the training data (feature amount) from the feature memory 32 according to the address. State information (w, g, h) is read from the state memory 33. Specifically, as shown in FIG. 10, the learning module 20 designates addresses in order from the right side (higher address) of bank B to the intermediate address.

この場合、上述したように、ラーニングモジュール２０の各ゲイン算出モジュール２１は、読み出した学習データの各特徴量をそれぞれ自身のＳＲＡＭに格納して、各しきい値における分岐スコアを算出する。そして、ラーニングモジュール２０の最適条件導出モジュール２２は、各ゲイン算出モジュール２１により出力された各特徴量に対応する各分岐スコアを入力し、分岐スコアが最大となる特徴量の番号（特徴量番号）およびしきい値を導出する。そして、最適条件導出モジュール２２は、導出した特徴量番号およびしきい値を、対応するノード（デプス１、ノード１）の分岐条件データとして、モデルメモリ４０へ書き込む。この際、最適条件導出モジュール２２は、ノード（デプス１、ノード１）からさらに分岐されることを示すためにリーフフラグを「０」として、当該ノードのデータ（分岐条件データの一部としてもよい）をモデルメモリ４０へ書き込む。 In this case, as described above, each gain calculation module 21 of the learning module 20 stores each feature amount of the read learning data in its own SRAM and calculates the branch score at each threshold value. Then, the optimum condition derivation module 22 of the learning module 20 inputs each branch score corresponding to each feature amount output by each gain calculation module 21, and the feature amount number (feature amount number) at which the branch score becomes maximum is input. And derive the threshold. Then, the optimal condition derivation module 22 writes the derived feature quantity number and the threshold value into the model memory 40 as branch condition data of the corresponding nodes (depth 1, node 1). At this time, the optimal condition derivation module 22 may set the leaf flag to "0" to indicate that the node (depth 1, node 1) is further branched, and may be a part of the data (branch condition data) of the node. ) Is written to the model memory 40.

以上の動作について、ラーニングモジュール２０は、バンクＢの右側（アドレスが高い方）から中間アドレスまで順に指定し、当該アドレスによって、フィーチャメモリ３２から各学習データを読み出して行う。 The learning module 20 designates the above operation in order from the right side (higher address) of the bank B to the intermediate address, and reads out each learning data from the feature memory 32 according to the address.

＜デプス１・ノード１でのデータ分岐処理＞
図１１は、実施形態に係る学習識別装置のデプス１、ノード１の分岐時のモジュールの動作を示す図である。 <Data branch processing at depth 1 and node 1>
FIG. 11 is a diagram showing the operation of the module at the time of branching of the depth 1 and the node 1 of the learning identification device according to the embodiment.

図１１に示すように、制御部１１は、デプス０での処理でクラシフィケーションモジュール５０から受け取った中間アドレスに基づいて、クラシフィケーションモジュール５０へ開始アドレスおよび終了アドレスを送信し、トリガによりクラシフィケーションモジュール５０による処理を開始させる。クラシフィケーションモジュール５０は、開始アドレスおよび終了アドレスに基づいて、ポインタメモリ３１（バンクＢ）の右側から対象とする学習データのアドレスを指定し、当該アドレスによって、フィーチャメモリ３２から学習データ（特徴量）を読み出す。また、クラシフィケーションモジュール５０は、モデルメモリ４０から対応するノード（デプス１、ノード１）の分岐条件データ（特徴量番号、しきい値）を読み出す。そして、クラシフィケーションモジュール５０は、分岐条件データに従って、読み出したサンプルデータを、ノード（デプス１、ノード１）の左側に分岐させるか、右側に分岐させるかを判定し、その判定結果により、当該学習データのフィーチャメモリ３２におけるアドレスを、ポインタメモリ３１の読み出しバンク（ここではバンクＢ）（読み出し用のバンク領域）と異なる他方のバンク（書き込みバンク）（ここではバンクＡ）（書き込み用のバンク領域）に書き込む。 As shown in FIG. 11, the control unit 11 transmits a start address and an end address to the classification module 50 based on the intermediate address received from the classification module 50 in the processing at depth 0, and the class is triggered by a trigger. The processing by the fiction module 50 is started. The classification module 50 specifies the address of the target learning data from the right side of the pointer memory 31 (bank B) based on the start address and the end address, and the learning data (feature amount) from the feature memory 32 by the address. ) Is read. Further, the classification module 50 reads branch condition data (feature quantity number, threshold value) of the corresponding node (depth 1, node 1) from the model memory 40. Then, the classification module 50 determines whether to branch the read sample data to the left side or the right side of the node (depth 1, node 1) according to the branch condition data, and based on the determination result, the said The address in the feature memory 32 of the training data is different from the read bank (here, bank B) (bank area for reading) of the pointer memory 31 and the other bank (write bank) (here, bank A) (bank area for writing). ).

この際、クラシフィケーションモジュール５０は、当該ノードの左側に分岐すると判定した場合、当該学習データのアドレスを、図１１に示すように、バンクＡのアドレスの低い方から順に書き込み、当該ノードの右側に分岐すると判定した場合、当該学習データのアドレスを、バンクＡのアドレスの高い方から順に書き込む。これによって、書き込みバンク（バンクＡ）では、ノードの左側に分岐した学習データのアドレスは、アドレスの低い方に、ノードの右側に分岐した学習データのアドレスは、アドレスの高い方にきれいに分けて書き込むことができる。なお、書き込みバンクにおいて、ノードの左側に分岐した学習データのアドレスは、アドレスの高い方に、ノードの右側に分岐した学習データのアドレスは、アドレスの低い方に分けて書き込むものとしてもよい。この場合、図９における動作も合わせる必要がある。 At this time, when the classification module 50 determines that it branches to the left side of the node, the address of the learning data is written in order from the lowest address of the bank A as shown in FIG. 11, and the address of the learning data is written in order from the lower side to the right side of the node. If it is determined that the learning data is branched to, the address of the learning data is written in order from the highest address of the bank A. As a result, in the write bank (bank A), the address of the learning data branched to the left side of the node is written to the lower address, and the address of the learning data branched to the right side of the node is written to the higher address. be able to. In the write bank, the address of the learning data branched to the left side of the node may be written to the higher address, and the address of the learning data branched to the right side of the node may be written to the lower address. In this case, it is necessary to match the operation in FIG.

以上の動作について、クラシフィケーションモジュール５０は、全学習データのうちバンクＢの中間アドレスよりも右側に書き込まれたアドレスで指定される学習データに対して分岐処理を行う。ただし、分岐処理が終了した後、ノード（デプス１、ノード１）の左側と右側とに同数の学習データが分けられるわけではないので、クラシフィケーションモジュール５０は、左側に分岐した学習データのアドレスと、右側に分岐した学習データのアドレスとの中間に対応する書き込みバンク（バンクＡ）におけるアドレス（中間アドレス）を、制御部１１に返す。当該中間アドレスは、次の分岐処理の際に使用される。 For the above operation, the classification module 50 performs branch processing on the learning data designated by the address written on the right side of the intermediate address of the bank B among all the learning data. However, since the same number of training data is not divided into the left side and the right side of the node (depth 1, node 1) after the branch processing is completed, the classification module 50 has the address of the training data branched to the left side. And the address (intermediate address) in the write bank (bank A) corresponding to the middle of the address of the learning data branched to the right side is returned to the control unit 11. The intermediate address is used in the next branch processing.

＜デプス１・ノード１の分岐条件データの決定時に分岐しない場合＞
図１２は、実施形態に係る学習識別装置のデプス１、ノード１のノードパラメータを決定の結果、分岐しない場合のモジュールの動作を示す図である。なお、図８の場合と同様に、デプス１・ノード０のノードと同じ階層なので、バンクＢが読み出しバンクとなる。 <When branching does not occur when determining the branching condition data for depth 1 and node 1>
FIG. 12 is a diagram showing the operation of the module when the node parameters of the depth 1 and the node 1 of the learning identification device according to the embodiment are determined and the modules are not branched. As in the case of FIG. 8, since the hierarchy is the same as that of the nodes of depth 1 and node 0, bank B is a read bank.

図１２に示すように、制御部１１は、デプス０での処理でクラシフィケーションモジュール５０から受け取った中間アドレスに基づいて、ラーニングモジュール２０へ開始アドレスおよび終了アドレスを送信し、トリガによりラーニングモジュール２０による処理を開始させる。ラーニングモジュール２０は、開始アドレスおよび終了アドレスに基づいて、ポインタメモリ３１（バンクＢ）から対象とする学習データのアドレスを指定し、当該アドレスによって、フィーチャメモリ３２から学習データ（特徴量）を読み出し、ステートメモリ３３からステート情報（ｗ，ｇ，ｈ）を読み出す。具体的には、ラーニングモジュール２０は、図１２に示すように、バンクＢの右側（アドレスが高い方）から中間アドレスまで順にアドレスを指定していく。 As shown in FIG. 12, the control unit 11 transmits the start address and the end address to the learning module 20 based on the intermediate address received from the classification module 50 in the processing at the depth 0, and the learning module 20 is triggered. Starts processing by. The learning module 20 specifies the address of the target learning data from the pointer memory 31 (bank B) based on the start address and the end address, and reads the training data (feature amount) from the feature memory 32 according to the address. State information (w, g, h) is read from the state memory 33. Specifically, as shown in FIG. 12, the learning module 20 designates addresses in order from the right side (higher address) of bank B to the intermediate address.

ラーニングモジュール２０は、算出した分岐スコア等から、これ以上ノード（デプス１、ノード１）から分岐しないと判断した場合、リーフフラグを「１」として、当該ノードのデータ（分岐条件データの一部としてもよい）をモデルメモリ４０に書き込むと共に、制御部１１にも当該ノードのリーフフラグが「１」であることを送信する。これによって、ノード（デプス１、ノード１）から下の階層には分岐しないことが認識される。さらに、ラーニングモジュール２０は、ノード（デプス１、ノード１）のリーフフラグが「１」である場合、特徴量番号およびしきい値の代わりに、リーフウェイト（ｗ）（分岐条件データの一部としてもよい）をモデルメモリ４０に書き込む。これにより、モデルメモリ４０の容量を別々に持つよりも小さくすることができる。 When the learning module 20 determines from the calculated branch score and the like that it will no longer branch from the node (depth 1, node 1), the leaf flag is set to "1" and the data of the node (as part of the branch condition data). It may be written) to the model memory 40, and also transmits that the leaf flag of the node is "1" to the control unit 11. As a result, it is recognized that the node (depth 1, node 1) does not branch to the lower hierarchy. Further, when the leaf flag of the node (depth 1, node 1) is "1", the learning module 20 uses the leaf weight (w) (as part of the branch condition data) instead of the feature quantity number and the threshold value. May be good) is written in the model memory 40. As a result, the capacity of the model memory 40 can be made smaller than that of having them separately.

以上の図６～図１２で示した処理を、階層（デプス）毎に進めていくと、全体の決定木が完成する（決定木が学習される）。 By proceeding with the processes shown in FIGS. 6 to 12 for each layer (depth), the entire decision tree is completed (decision tree is learned).

＜決定木の学習が完了した場合＞
図１３は、実施形態に係る学習識別装置において決定木の学習が完了した場合に全サンプルデータのステート情報を更新するときのモジュールの動作を示す図である。 <When learning of decision tree is completed>
FIG. 13 is a diagram showing the operation of the module when updating the state information of all sample data when the learning of the decision tree is completed in the learning identification device according to the embodiment.

ＧＢＤＴを構成する１つの決定木の学習が完了した場合、次の決定木へのブースティング（ここではグラディエントブースティング）の際に使用するため、各学習データの誤差関数に対応する一次勾配ｇ、二次勾配ｈ、および各学習データに対するリーフウェイトｗを算出する必要がある。図１３に示すように、制御部１１は、トリガによりクラシフィケーションモジュール５０による上述の計算を開始させる。クラシフィケーションモジュール５０は、全学習データに対して、全デプス（階層）のノードに対する分岐判定の処理を行い、各学習データに対応するリーフウェイトを算出する。そして、クラシフィケーションモジュール５０は、算出したリーフウェイトに対して、ラベル情報を基に、ステート情報（ｗ、ｇ、ｈ）を算出し、元のステートメモリ３３のアドレスに書き戻す。このように、更新されたステート情報を利用して、次の決定木の学習が行われる。 When the training of one decision tree constituting GBDT is completed, the linear gradient g corresponding to the error function of each training data is used for boosting to the next decision tree (here, gradient boosting). It is necessary to calculate the quadratic gradient h and the leaf weight w for each training data. As shown in FIG. 13, the control unit 11 initiates the above-mentioned calculation by the classification module 50 by a trigger. The classification module 50 performs branch determination processing for all depth (hierarchy) nodes for all learning data, and calculates leaf weights corresponding to each learning data. Then, the classification module 50 calculates the state information (w, g, h) based on the label information for the calculated leaf weight, and writes it back to the address of the original state memory 33. In this way, the next decision tree is learned using the updated state information.

以上のように、本実施形態に係る学習識別装置１において、ラーニングモジュール２０は、入力されたサンプルデータの各特徴量を読み込むためのメモリ（例えば、ＳＲＡＭ）をそれぞれ備えている。これによって、１アクセスでサンプルデータの全特徴量を読み出すことができ、各ゲイン算出モジュール２１により、一度に全特徴量に対する処理が可能となるので、決定木の学習処理の速度を飛躍的に向上させることが可能となる。 As described above, in the learning identification device 1 according to the present embodiment, the learning module 20 is provided with a memory (for example, SRAM) for reading each feature amount of the input sample data. As a result, all the features of the sample data can be read out with one access, and each gain calculation module 21 can process all the features at once, which dramatically improves the speed of learning processing of the decision tree. It is possible to make it.

また、本実施形態に係る学習識別装置１において、ポインタメモリ３１では、バンクＡおよびバンクＢの２つが構成されており、交互に読み書きするものとしている。これによって、効率的にメモリを使用することが可能となる。単純には、フィーチャメモリ３２およびステートメモリ３３を、それぞれ２バンク構成する方法もあるが、一般的に、サンプルデータよりも、フィーチャメモリ３２でのアドレスを示すデータの方が小さいので、本実施形態のように、ポインタメモリ３１を準備しておき、間接的にアドレスを指定する方法の方が、メモリの使用量を削減することが可能となる。また、クラシフィケーションモジュール５０は、ノードの左側に分岐すると判定した場合、学習データのアドレスを、２つのバンクのうち書き込みバンクのアドレスの低い方から順に書き込み、当該ノードの右側に分岐すると判定した場合、当該学習データのアドレスを、書き込みバンクのアドレスの高い方から順に書き込む。これによって、書き込みバンクでは、ノードの左側に分岐した学習データのアドレスは、アドレスの低い方に、ノードの右側に分岐した学習データのアドレスは、アドレスの高い方にきれいに分けて書き込むことができる。 Further, in the learning identification device 1 according to the present embodiment, the pointer memory 31 is composed of two banks A and B, and is supposed to read and write alternately. This makes it possible to use the memory efficiently. There is also a method of simply configuring two banks each of the feature memory 32 and the state memory 33, but in general, the data indicating the address in the feature memory 32 is smaller than the sample data, so this embodiment. As described above, the method of preparing the pointer memory 31 and indirectly specifying the address makes it possible to reduce the amount of memory used. Further, when the classification module 50 determines that the learning data branch is to the left side of the node, the learning data address is written in order from the lower write bank address of the two banks, and the classification module 50 determines that the learning data address is branched to the right side of the node. In this case, the address of the learning data is written in order from the highest address of the write bank. As a result, in the write bank, the address of the learning data branched to the left side of the node can be written to the lower address, and the address of the learning data branched to the right side of the node can be written to the higher address.

（変形例）
図１４は、変形例に係る学習識別装置のモデルメモリの構成の一例を示す図である。図１４を参照しながら、本変形例に係る学習識別装置１におけるモデルメモリ４０において、決定木のデプス（階層）毎にメモリが備えられた構成について説明する。 (Modification example)
FIG. 14 is a diagram showing an example of the configuration of the model memory of the learning identification device according to the modified example. With reference to FIG. 14, a configuration in which a memory is provided for each depth (layer) of the decision tree in the model memory 40 in the learning identification device 1 according to the present modification will be described.

図１４に示すように、本変形例に係る学習識別装置１のモデルメモリ４０は、学習された決定木のモデルデータについてデプス（階層）毎にデータ（具体的には分岐条件データ）を格納するためのデプス０用メモリ４１＿１、デプス１用メモリ４１＿２、・・・、デプス（ｍ－１）用メモリ４１＿ｍを有する。ここで、ｍは、少なくとも決定木のモデルのデプス（階層）数以上の数である。すなわち、モデルメモリ４０は、学習された決定木のモデルデータについてデプス（階層）毎にデータ（デプス０ノードデータ、デプス１ノードデータ、・・・、デプス（ｍ－１）ノードデータ）を同時に取り出すための独立したポートを有する、ということになる。これによって、クラシフィケーションモジュール５０は、決定木における最初のノードでの分岐結果に基づき、次のノードに対応するデータ（分岐条件データ）を読み出すことを、全デプス（階層）で並列に行い、途中にメモリを介さずに、１つのサンプルデータ（識別データ）に対して、１クロックで同時に各デプス（階層）での分岐処理を実行（パイプライン処理）することが可能となる。これによって、クラシフィケーションモジュール５０における識別処理は、サンプルデータ数分だけの時間だけで済むことになり、識別処理の速度を飛躍的に向上させることができる。これに対して、従来の技術では、ノード毎に新しいメモリ領域にサンプルデータをコピーしていくため、メモリの読み書きの時間だけ速度に影響し、（サンプルデータ数×デプス（階層）数）の識別処理の時間となるので、上述のように本変形例に係る識別処理の方が大幅に優位となる。 As shown in FIG. 14, the model memory 40 of the learning identification device 1 according to this modification stores data (specifically, branch condition data) for each depth (hierarchy) of the model data of the learned decision tree. It has a depth 0 memory 41_1, a depth 1 memory 41_2, ..., And a depth (m-1) memory 41_m. Here, m is at least a number equal to or greater than the number of depths (hierarchies) of the model of the decision tree. That is, the model memory 40 simultaneously extracts data (depth 0 node data, depth 1 node data, ..., Depth (m-1) node data) for each depth (hierarchy) of the learned model data of the decision tree. It means that it has an independent port for. As a result, the classification module 50 reads the data (branch condition data) corresponding to the next node in parallel at all depths (hierarchy) based on the branch result at the first node in the decision tree. It is possible to execute branch processing (pipeline processing) at each depth (hierarchy) at the same time for one sample data (identification data) at one clock without going through a memory on the way. As a result, the identification process in the classification module 50 only takes time for the number of sample data, and the speed of the identification process can be dramatically improved. On the other hand, in the conventional technology, since the sample data is copied to a new memory area for each node, the speed is affected only by the time for reading and writing the memory, and the identification (number of sample data x number of depths) is identified. Since it is the processing time, the identification processing according to this modification is significantly superior as described above.

図１５は、変形例に係る学習識別装置のクラシフィケーションモジュールの構成の一例を示す図である。図１５に示すように、クラシフィケーションモジュール５０は、ノード０判別器５１＿１、ノード１判別器５１＿２、ノード判別器５１＿３、・・・を有する。フィーチャメモリ３２からは、１クロックに１つのサンプルデータが特徴量として供給される。図１５に示すように、特徴量は、まずノード０判別器５１＿１に入力され、ノード０判別器５１＿１は、対応するモデルメモリ４０のデプス０用メモリ４１＿１からそのノードのデータ（デプス０ノードデータ）（右に行くか、左に行くかの条件、および使用する特徴量番号）を受け取る。ノード０判別器５１＿１では、その条件に従い、対応するサンプルデータが右に行くか左に行くかが判別される。なお、ここではデプス用メモリ（デプス０用メモリ４１＿１、デプス１用メモリ４１＿２、デプス２用メモリ４１＿３、・・・）はそれぞれレイテンシが１クロックあるとしている。ノード０判別器５１＿１の結果により、次のデプス１用メモリ４１＿２の内、何番目のノードに行くかがアドレス指定され、対応するノードのデータ（デプス１ノードデータ）が抽出され、ノード１判別器５１＿２に入力される。 FIG. 15 is a diagram showing an example of the configuration of the classification module of the learning identification device according to the modified example. As shown in FIG. 15, the classification module 50 has a node 0 discriminator 51_1, a node 1 discriminator 51_2, a node discriminator 51_3, and so on. From the feature memory 32, one sample data is supplied as a feature amount per clock. As shown in FIG. 15, the feature amount is first input to the node 0 discriminator 51_1, and the node 0 discriminator 51_1 is the data of the node (depth 0 node data) from the depth 0 memory 41_1 of the corresponding model memory 40. Receive (conditions for going to the right or to the left, and the feature number to use). The node 0 discriminator 51_1 determines whether the corresponding sample data goes to the right or to the left according to the condition. Here, it is assumed that each of the depth memories (depth 0 memory 41_1, depth 1 memory 41_2, depth 2 memory 41_3, ...) Has a latency of 1 clock. Based on the result of the node 0 discriminator 51_1, the number of the node to go to in the next depth 1 memory 41_1 is specified, the data of the corresponding node (depth 1 node data) is extracted, and the node 1 discriminator is used. It is input to 51_2.

デプス０用メモリ４１＿１のレイテンシは１クロックであるため、同じように特徴量も１クロックの遅延を入れて、ノード１判別器５１＿２に入力される。また、同じクロックで次のサンプルデータの特徴量がノード０判別機５１＿１に入力されている。このようにして、パイプライン処理で識別を行うことにより、デプス毎にメモリが同時に出力されている前提で、１つの決定木全体として、１クロックで１つのサンプルデータを識別することが可能である。なお、デプス０用メモリ４１＿１は、デプス０ではノードは１つしかないので、１つのアドレスのみでよく、デプス１用メモリ４１＿２は、デプス１ではノードは２つあるので、２つのアドレスが必要であり、同じように、デプス２用メモリ４１＿３は、４つのアドレスが必要であり、デプス３用メモリ（図示せず）は、８つのアドレスが必要となる。なお、このクラシフィケーションモジュール５０は木全体の識別を行うものであるが、ノードの学習時には、ノード０判別器５１＿１のみを用いて学習を行うことで同じ回路を流用して、回路規模を小さくすることができる。 Since the latency of the memory 41_1 for depth 0 is one clock, the feature amount is also input to the node 1 discriminator 51_2 with a delay of one clock. Further, the feature amount of the next sample data is input to the node 0 discriminator 51_1 at the same clock. In this way, by performing identification by pipeline processing, it is possible to identify one sample data in one clock as a whole decision tree on the premise that the memory is output at the same time for each depth. .. Since the memory 41_1 for depth 0 has only one node in depth 0, only one address is required, and the memory 41_1 for depth 1 needs two addresses because there are two nodes in depth 1. Yes, similarly, the depth 2 memory 41_3 requires four addresses, and the depth 3 memory (not shown) requires eight addresses. The classification module 50 identifies the entire tree, but when learning a node, the same circuit is diverted by learning using only the node 0 discriminator 51_1 to reduce the circuit scale. can do.

以下では、上述の実施形態に係る学習識別装置１における学習処理の速度の予測結果を説明する。 Hereinafter, the prediction result of the speed of the learning process in the learning identification device 1 according to the above-described embodiment will be described.

まずは、比較のためＧＢＤＴの代表的なライブラリである上述のＸＧＢｏｏｓｔ、およびＬｉｇｈｔＧＢＭの学習速度の評価を行った。２０１７年１２月時点では、ＬｉｇｈｔＧＢＭでＧＰＵを用いた場合が高速であり、これについて実測した。 First, for comparison, the learning speeds of the above-mentioned XGBoost and LightGBM, which are representative libraries of GBDT, were evaluated. As of December 2017, the speed was high when GPU was used in LightGBM, and this was actually measured.

ハードウェア構成のクロックから処理時間を算出した。今回実装したハードウェアのロジックでは、ラーニングモジュール２０による学習処理、クラシフィケーションモジュール５０による識別処理（ノード単位）、およびクラシフィケーションモジュール５０による識別処理（木単位）の３つが主な処理である。 The processing time was calculated from the clock of the hardware configuration. In the hardware logic implemented this time, three main processes are learning processing by the learning module 20, identification processing by the classification module 50 (node unit), and identification processing by the classification module 50 (tree unit). ..

＜ラーニングモジュールの処理について＞
ここでは、サンプルデータの各特徴量から勾配ヒストグラムの作成および分岐スコアの算出が支配的である。サンプルデータの各特徴量からの勾配ヒストグラムの作成では、１デプス（階層）ごとに全サンプルデータを読む必要がある。木のデプスが浅い段階で学習が終了するサンプルデータもあるので、この見積りは最大値である。分岐スコアの計算は勾配ヒストグラムの全ビンを参照するのでビンの数（特徴量の次元）のクロックを要する。以上より、ラーニングモジュール２０の処理のクロック数Ｃ_{ｌｅａｒｎｉｎｇ}は以下の式（２３）で表される。

<About the processing of the learning module>
Here, the creation of the gradient histogram and the calculation of the branch score from each feature of the sample data are dominant. In creating a gradient histogram from each feature of the sample data, it is necessary to read all the sample data for each depth (hierarchy). This estimate is the maximum, as some sample data completes learning when the depth of the tree is shallow. Since the calculation of the branch score refers to all the bins of the gradient histogram, a clock of the number of bins (dimension dimension) is required. From the above, the clock number _learning of the processing of the learning module 20 is expressed by the following equation (23).

ここで、ｎ_{ｓａｍｐｌｅ＿ｔｒａｉｎ}は決定木の学習に使うサンプルデータ数であり、一般に全サンプルデータからサブサンプルされた集合である。また、ｍａｘｄｅｐｔｈは決定木の最大深さであり、ｎ_{ｆｅａｔｕｒｅ}はビンの数（特徴量の次元）であり、ｎ_ｎｏｄｅはノード数である。 Here, n _{sample_train} is the number of sample data used for learning the decision tree, and is generally a set subsampled from all the sample data. Further, maxdepth is the maximum depth of the decision tree, _nfature is the number of bins (dimension of the feature amount), and _nnode is the number of nodes.

＜クラシフィケーションモジュールの処理（ノード単位）について＞
ここでは、学習したノードの結果を使って、サンプルデータが左右どちらの下位のノードに割り当てられるかを処理している。深さごとに処理するサンプルデータの総数は変わらないので、クロック数Ｃ_{Ｃｌａｓｓｉｆｉｃａｔｉｏｎ＿ｎｏｄｅ}は以下の式（２４）で表される。実際は途中で学習が終了するノードがあるため、下記の見積は最大値である。

<Processing of classification module (node unit)>
Here, the result of the learned node is used to process whether the sample data is assigned to the lower node on the left or right. Since the total number of sample data to be processed does not change for each depth, the clock number C _{Classification_node} is expressed by the following equation (24). Actually, there is a node where learning ends in the middle, so the following estimate is the maximum value.

＜クラシフィケーションモジュールの処理（木単位）について＞
ここでは、決定木１つの学習が終了した後、次の決定木の学習のため、サンプルデータごとに勾配情報の更新を行う。そのため、学習した決定木を用いて、全サンプルデータについて予測を行う必要がある。木単位の処理では、深さ分だけ遅延が発生する。この場合、クロック数Ｃ_{Ｃｌａｓｓｉｆｉｃａｔｉｏｎ＿ｔｒｅｅ}は以下の式（２５）で表される。

<Processing of classification module (tree unit)>
Here, after the learning of one decision tree is completed, the gradient information is updated for each sample data for the learning of the next decision tree. Therefore, it is necessary to make predictions for all sample data using the learned decision tree. In the processing of each tree, a delay occurs by the depth. In this case, the clock number C _{Cclassification_tree} is expressed by the following equation (25).

ここで、全サンプルデータとは、サブサンプル前の全学習サンプルデータと、全バリデーションサンプルデータの総数である。 Here, the total sample data is the total number of all training sample data before the subsample and all validation sample data.

以上より、決定木１つ分の学習処理にかかるクロック数Ｃ_ｔｒｅｅ（最大値）は以下の式（２６）で表される。

From the above, the number of clocks C _tree (maximum value) required for the learning process for one decision tree is expressed by the following equation (26).

ＧＢＤＴは多数の決定木から構成されるので、決定木の本数をｎ_ｔｒｅｅとすると、ＧＢＤＴモデル全体のクロック数Ｃ_ｇｂｄｔは以下の式（２７）で表される。

Since GBDT is composed of a large number of decision trees, if the number of decision trees is n _tree , the clock number C _gbdt of the entire GBDT model is expressed by the following equation (27).

以上は、上述したフィーチャパラレル（ＦｅａｔｕｒｅＰａｒａｌｌｅｌ）の場合の試算であり、このモジュールを並列に多数配置し、データで分割した場合のいわゆるデータパラレル（ＤａｔａＰａｒａｌｌｅｌ）では、各モジュール毎に各ノードでのデータ数に偏りがない場合には、基本的にそのモジュール数倍の高速化が可能である。どの程度偏りが存在するかは、サンプルデータおよび各モジュールへのサンプルデータの分割の方法に依存するため、今後、本オーバーヘッドに関しては実データを用いて検討を行う。予測としては、本オーバーヘッドを考慮しても、効率で５０％以上は出るものと推測される。 The above is a trial calculation in the case of the above-mentioned feature parallel (Fature Parallel). In the so-called data parallel (Data Parallel) in which a large number of modules are arranged in parallel and divided by data, each module is used at each node. If there is no bias in the number of data, it is basically possible to increase the speed by several times the number of modules. Since the degree of bias depends on the sample data and the method of dividing the sample data into each module, this overhead will be examined using actual data in the future. As a prediction, even if this overhead is taken into consideration, it is estimated that the efficiency will be 50% or more.

＜使用データについて＞
テスト用のサンプルデータとしては、約１０万件からランダムに学習データと識別データ（評価用データ）とを選択したものである。以下にデータセットの概要を示す。 <About usage data>
As the sample data for the test, learning data and identification data (evaluation data) were randomly selected from about 100,000 cases. The outline of the data set is shown below.

・クラス数：２
・特徴量次元：１２９
・学習データ数：６３４１５
・評価用データ数：３１７０７・ Number of classes: 2
・ Feature dimension: 129
・ Number of learning data: 63415
・ Number of evaluation data: 31707

また、速度の測定条件を以下の（表１）に示す。ＦＰＧＡのクロック周波数は仮に１００［ＭＨｚ］での動作とした（実際にはそれ以上となる可能性が高い）。

The speed measurement conditions are shown in (Table 1) below. The clock frequency of the FPGA is assumed to be 100 [MHz] (actually, there is a high possibility that it will be higher).

＜ハードウェアロジックの試算＞
上述した速度の計算式を用いた上述のアーキテクチャでの学習速度の試算を以下の（表２）に示す。ただし、本試算はすべてのサンプルデータが末端の枝まで行った場合の試算であり最悪値である。

<Estimation of hardware logic>
The following (Table 2) shows the trial calculation of the learning speed in the above-mentioned architecture using the above-mentioned speed calculation formula. However, this trial calculation is the worst value because it is a trial calculation when all the sample data goes to the end branch.

＜ＣＰＵ・ＧＰＵでの実測を含めた比較結果＞
ＣＰＵ・ＧＰＵでの実測結果を以下の（表３）に示す。なお、比較のため、ハードロジックの試算結果も含めて表示している。ここまでの試算はフィーチャパラレル（ＦｅａｔｕｒｅＰａｒａｌｌｅｌ）のみであるため、参考として、データパラレル（ＤａｔａＰａｒａｌｌｅｌ）も併用した場合の試算結果も追加した。

<Comparison results including actual measurement with CPU / GPU>
The actual measurement results on the CPU / GPU are shown in (Table 3) below. For comparison, the hard logic calculation results are also included. Since the estimation up to this point is only for feature parallel (Fature Parallel), the estimation result when data parallel (Data Parallel) is also used is added as a reference.

本データに関しては、ＧＰＵを使用した場合にもＣＰＵよりも速度が落ちていることがわかる。ＬｉｇｈｔＧＢＭの開発元のマイクロソフト社はＧＰＵ使用の場合には、３倍から１０倍程度高速化するが、データに大きく依存するとしており、本データに関しては、ＧＰＵでの高速化がうまくいかなかったことがわかる。また、この結果はＧＢＤＴのアルゴリズムが、ＣＮＮほどＧＰＵの高速化が容易ではないことを示している。ＣＰＵでの結果では、最も基本的なライブラリであるＸＧＢｏｏｓｔと比較して、後発のＬｉｇｈｔＧＢＭでは１０倍程度高速となっている。なお、フィーチャパラレル（ＦｅａｔｕｒｅＰａｒａｌｌｅｌ）のみのハードロジックでも、ＰＣ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ）での最も速いＣＰＵ（ＬｉｇｈｔＧＢＭ）と比較して、２．３倍程度高速となっている。また、１５並列のデータパラレル（ＤａｔａＰａｒａｌｌｅｌ）も使用した場合には、データパラレル（ＤａｔａＰａｒａｌｌｅｌ）の効率を７５％とした場合でも、２５倍以上、ＡＷＳｆ１．１６ｘｌａｒｇｅインスタンスを考えた場合で２４０並列の場合の効率を５０％とすると、２７５倍以上の速度となることが試算された。ただし、この試算はメモリ帯域が限界の場合の試算であり、これだけのロジックがＦＰＧＡに収まるかどうかは今後検討が必要である。 Regarding this data, it can be seen that the speed is lower than that of the CPU even when the GPU is used. Microsoft, the developer of LightGBM, says that when using GPU, it will be 3 to 10 times faster, but it depends heavily on data, and for this data, the speed on GPU did not go well. I understand. This result also shows that the GBDT algorithm is not as easy to speed up the GPU as CNN. The result on the CPU is about 10 times faster in the later LightGBM than in the XGBoost, which is the most basic library. Even with the hardware logic of only the feature parallel (Fature Parallel), the speed is about 2.3 times faster than that of the fastest CPU (LightGBM) in the PC (Personal Computer). In addition, when 15 parallel data parallels (Data Parallels) are also used, even if the efficiency of the data parallels (Data Parallels) is 75%, it is 25 times or more, 240 parallels when considering AWS f1.16xlage instances. Assuming that the efficiency in the case of is 50%, it is estimated that the speed will be 275 times or more. However, this calculation is a calculation when the memory bandwidth is the limit, and it is necessary to consider whether or not this amount of logic fits in the FPGA in the future.

なお、消費電力に関してはＦＰＧＡでは数［Ｗ］と予測され、ＣＰＵおよびＧＰＵでの１００［Ｗ］以上であることを考えると、速度に加えて消費電力が２桁異なるため、電力効率では３桁以上の差となる可能性がある。 Regarding power consumption, it is predicted to be a number [W] in FPGA, and considering that it is 100 [W] or more in CPU and GPU, power consumption differs by 2 digits in addition to speed, so power efficiency is 3 digits. The above difference may occur.

１学習識別装置
１０ＣＰＵ
１１制御部
２０ラーニングモジュール
２１、２１＿１、２１＿２ゲイン算出モジュール
２２最適条件導出モジュール
３０データメモリ
３１ポインタメモリ
３２フィーチャメモリ
３３ステートメモリ
４０モデルメモリ
４１＿１デプス０用メモリ
４１＿２デプス１用メモリ
４１＿３デプス２用メモリ
５０クラシフィケーションモジュール
５１＿１ノード０判別器
５１＿２ノード１判別器
５１＿３ノード２判別器 1 Learning identification device 10 CPU
11 Control unit 20 Learning module 21, 21_1, 21_1 Gain calculation module 22 Optimal condition derivation module 30 Data memory 31 Pointer memory 32 Feature memory 33 State memory 40 Model memory 41_1 Depth 0 memory 41_1 Depth 1 memory 41_3 Depth 2 memory 50 Classification module 51_1 node 0 discriminator 51_1 node 1 discriminator 51_3 node 2 discriminator

特許第５０３２６０２号公報Japanese Patent No. 5032602

Claims

A data memory that stores learning data for learning decision trees, and
A learning unit that learns the decision tree by reading each feature amount of the learning data from the data memory and deriving node data based on the feature amount.
An identification unit that determines to which the learning data read from the data memory is branched from the node based on the data of the node derived by the learning unit.
Equipped with
The data memory has at least two bank areas for storing the addresses of the learning data.
The at least two bank areas are switched between a read bank area and a write bank area each time the hierarchy of the node to be learned is switched.
The learning unit reads the address of the learning data branched by the node from the bank area for reading, reads the learning data from the area of the data memory indicated by the address, and reads the learning data.
The identification unit is a learning identification device that writes the address of the learning data branched at the node to the bank area for writing.

The identification unit is
The address of the learning data branched from the node to one of the lower nodes is written to the write bank area in order from the smallest address of the write bank area.
The learning identification according to claim 1, wherein the address of the learning data branched from the node to the other lower node is written in order from the bank area for writing in descending order of the address of the bank area for writing. Device.

The learning unit learns the decision tree by reading all the feature quantities contained in one of the learning data from the data memory by one access and deriving the data of the node based on each feature quantity. The learning identification device according to claim 1 or 2.

The learning unit
It has at least as many memories as the feature amount of the training data.
The learning identification device according to claim 3, wherein a histogram of the feature amount of the learning data read from the data memory is stored in each of the memory, and processing based on each feature amount is performed at once.

The learning unit
A gain calculation unit that calculates a branch score for each threshold value based on the histogram of the feature amount provided in each memory and stored in each memory.
A derivation unit for deriving the feature quantity number and the threshold value for which each branch score is optimal as data of the node, and a derivation unit.
The learning identification device according to claim 4.

The learning identification device according to any one of claims 1 to 5, wherein the identification unit reads out the label information of the learning data together with the feature amount of the learning data from the data memory.

The learning identification device according to any one of claims 1 to 6, wherein the learning unit learns the next decision tree by gradient boosting based on the learning result of the learned decision tree.

In order to allow the learning unit to learn the next decision tree by the gradient boosting, the identification unit has a primary gradient, a secondary gradient, and each learning corresponding to the error function of each training data. The learning identification device according to claim 7, wherein a leaf weight for data is calculated and written to the data memory.

The learning identification device according to any one of claims 1 to 8, wherein the identification unit shares a configuration for performing an operation of identifying the learning data at the time of learning, and performs an operation of identifying the identification data at the time of identification. ..

Further, a model memory for storing the data of the node of the decision tree is provided.
The learning unit
When the node to be learned is to be further branched, flag information indicating that fact is written to the model memory as data of the node.
The learning according to any one of claims 1 to 9, wherein when the node to be learned is not branched any more, flag information indicating that fact is written to the model memory as data of the node. Identification device.

The learning identification device according to any one of claims 1 to 10, wherein the learning unit performs learning using some of the feature quantities of all the feature quantities of the learning data.

The learning identification device according to any one of claims 1 to 10, wherein the learning unit performs learning using a part of the learning data among all the learning data.

The learning identification device according to any one of claims 1 to 12, wherein at least the data memory, the learning unit, and the identification unit are configured on an FPGA (Field-Programmable Gate Array).

A learning step of learning the decision tree by reading each feature amount of the learning data from a data memory that stores learning data for learning the decision tree and deriving node data based on each feature amount. ,
An identification step for determining to which the learning data read from the data memory is branched from the node based on the derived data of the node.
At least two bank areas for storing the addresses of the learning data possessed by the data memory are switched between a bank area for reading and a bank area for writing each time the hierarchy of the node to be learned is switched. Steps and
A step of reading the address of the learning data branched at the node from the bank area for reading and reading the learning data from the area of the data memory indicated by the address.
A step of writing the address of the learning data branched at the node to the bank area for writing, and
Learning identification method having.