JP6659618B2

JP6659618B2 - Analysis apparatus, analysis method and analysis program

Info

Publication number: JP6659618B2
Application number: JP2017091187A
Authority: JP
Inventors: 哲哉塩田; 一樹及川; 雅人澤田; 拓郎宇田川
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2017-05-01
Filing date: 2017-05-01
Publication date: 2020-03-04
Anticipated expiration: 2037-05-01
Also published as: JP2018190130A

Description

本発明は、分析装置、分析方法及び分析プログラムに関する。 The present invention relates to an analysis device, an analysis method, and an analysis program.

近年、機械学習を用いたデータ分析の適用事例が増加している。一方、データ分析に不可欠な統計や機械学習の知識の習得には、中長期的な教育が必要となる。そこで、非専門家が統計や機械学習の知識の習得を行うことなく、容易にデータ分析に従事できるよう、データ分析を支援する技術が開示されている。 In recent years, application examples of data analysis using machine learning have increased. On the other hand, acquiring statistics and machine learning knowledge, which are essential for data analysis, requires mid- to long-term education. Therefore, a technology that supports data analysis has been disclosed so that a non-expert can easily engage in data analysis without acquiring knowledge of statistics and machine learning.

例えば、逐次的最適化手法（ＳＭＢＯ：Sequential model-based optimization）を用いてパイプラインごとに精度の評価を行い、最適なパイプラインを探索する手法が知られている（例えば、非特許文献１及び２を参照）。なお、ここでは、パイプラインとは、予測モデルを構築する一連の処理であり、入力されたデータに対する前処理、ハイパーパラメータに基づくデータの学習等が含まれる。また、あらかじめ専門家が設計した多数のパイプラインの中から、分析対象のデータに適合した少数のパイプラインをユーザに提示する技術が知られている。 For example, there is known a method of evaluating the accuracy of each pipeline using a sequential optimization method (SMBO: Sequential model-based optimization) and searching for an optimal pipeline (for example, Non-Patent Document 1 and 2). Here, the pipeline is a series of processes for constructing a prediction model, and includes preprocessing for input data, learning of data based on hyperparameters, and the like. In addition, there is known a technique of presenting a user with a small number of pipelines suitable for data to be analyzed from a large number of pipelines designed in advance by experts.

また、一般的に、パイプラインを構成するデータの前処理や予測アルゴリズムに関する設定内容の候補が増加すると、より予測精度の高い予測モデルを構築可能なパイプラインが見つかる可能性が高くなる一方で、探索に要する時間が増加する。 Also, in general, when the number of candidates for the setting contents related to the pre-processing of data constituting the pipeline and the prediction algorithm increases, while the possibility of finding a pipeline capable of constructing a prediction model with higher prediction accuracy increases, The time required for searching increases.

Matthias Feurer，Aaron Klein，Katharina Eggensperger，Jost Tobias Springenberg，Manuel Blum，Frank Hutter，“Efficient and Robust Automated Machine Learning”，NIPS'15 Proceedings of the 28th International Conference on Neural Information Processing Systems，2015年12月，PP.2755-2763Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias Springenberg, Manuel Blum, Frank Hutter, “Efficient and Robust Automated Machine Learning”, NIPS'15 Proceedings of the 28th International Conference on Neural Information Processing Systems, December 2015, PP. 2755-2763 Lisha Li，Kevin Jamieson，Giulia DeSalvo，Afshin Rostamizadeh，Ameet Talwalkar，“Hyperband:A Novel Bandit-Based Approach to Hyperparameter Optimization”，arXiv:1603.06560v3，cs.LG，2016年11月Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, Ameet Talwalkar, "Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization", arXiv: 1603.06560v3, cs. LG, November 2016

しかしながら、従来のデータ分析を自動化する技術には、パイプラインの探索に要する時間を短縮しつつ、予測精度の高い予測モデルを構築することができない場合があるという問題があった。例えば、単に探索の候補となる設定内容を削減することでパイプラインの探索に要する時間を削減することが考えられるが、この場合、十分な探索が行われず予測モデルの精度が低下する場合がある。 However, the conventional technology for automating data analysis has a problem in that it may not be possible to construct a prediction model with high prediction accuracy while reducing the time required for searching for a pipeline. For example, it is conceivable to reduce the time required for searching a pipeline simply by reducing the setting contents serving as search candidates, but in this case, sufficient search is not performed, and the accuracy of the prediction model may decrease. .

本発明の分析装置は、データが入力された場合、当該入力されたデータから所定の割合のデータをサンプルデータとして抽出し、前記所定の割合が増加した場合、前記入力されたデータから当該増加した所定の割合のデータを前記サンプルデータとしてさらに抽出する抽出部と、前記抽出部によって前記サンプルデータが抽出されるたびに、前記サンプルデータを用いて所定の処理のうちの一部を行った際に要した時間に基づいて、前記入力されたデータを用いて前記所定の処理を行った際に要する時間を推定し、当該推定した時間があらかじめ設定された制限時間未満である場合、前記所定の割合を増加させる推定部と、所定のデータに基づく予測モデルを構築する際に実行される複数の処理に適用される設定内容の組合せの候補から、前記複数の処理に適用された場合に構築される前記サンプルデータに基づく予測モデルの予測精度が所定の条件を満たすような設定内容の組合せを探索する探索部と、前記探索部によって探索された設定内容の組合せから、前記複数の処理に適用された場合に構築される前記入力されたデータに基づく予測モデルの予測精度が所定の閾値以上となる設定内容の組合せを選択する選択部と、を有することを特徴とする。 The analyzer of the present invention, when data is input, extracts a predetermined ratio of data as sample data from the input data, and when the predetermined ratio increases, the data is increased from the input data. An extracting unit for further extracting a predetermined ratio of data as the sample data, and each time the extracting unit extracts the sample data, a part of a predetermined process is performed using the sample data. Based on the required time, the time required for performing the predetermined processing using the input data is estimated, and if the estimated time is less than a preset time limit, the predetermined ratio And a candidate for a combination of setting contents applied to a plurality of processes executed when constructing a prediction model based on predetermined data, A search unit that searches for a combination of setting contents such that the prediction accuracy of a prediction model based on the sample data constructed when applied to a plurality of processes satisfies a predetermined condition; and a setting content searched by the search unit. And a selection unit that selects a combination of setting contents in which a prediction accuracy of a prediction model based on the input data constructed when applied to the plurality of processes is equal to or more than a predetermined threshold, from combinations of It is characterized by.

本発明によれば、パイプラインの探索に要する時間を短縮しつつ、予測精度の高い予測モデルを構築することができる。 According to the present invention, it is possible to construct a prediction model with high prediction accuracy while reducing the time required for searching for a pipeline.

図１は、第１の実施形態に係る分析装置の構成の一例を示す図である。FIG. 1 is a diagram illustrating an example of a configuration of the analyzer according to the first embodiment. 図２は、第１の実施形態に係る設定情報のデータ構成の一例を示す図である。FIG. 2 is a diagram illustrating an example of a data configuration of the setting information according to the first embodiment. 図３は、第１の実施形態に係る予測器情報のデータ構成の一例を示す図である。FIG. 3 is a diagram illustrating an example of a data configuration of predictor information according to the first embodiment. 図４は、第１の実施形態に係る分析装置の処理概要を説明するための図である。FIG. 4 is a diagram for explaining an outline of a process performed by the analyzer according to the first embodiment. 図５は、第１の実施形態に係るパイプラインの探索について説明するための図である。FIG. 5 is a diagram for describing a search for a pipeline according to the first embodiment. 図６は、第１の実施形態に係る交差検証について説明するための図である。FIG. 6 is a diagram illustrating the cross-validation according to the first embodiment. 図７は、第１の実施形態に係るパイプラインの探索について説明するための図である。FIG. 7 is a diagram illustrating a search for a pipeline according to the first embodiment. 図８は、第１の実施形態に係る有望領域について説明するための図である。FIG. 8 is a diagram for explaining a promising area according to the first embodiment. 図９は、第１の実施形態に係るパイプラインの選択について説明するための図である。FIG. 9 is a diagram for describing selection of a pipeline according to the first embodiment. 図１０は、第１の実施形態に係る有望領域について説明するための図である。FIG. 10 is a diagram for explaining a promising area according to the first embodiment. 図１１は、第１の実施形態に係る予測モデルの合成について説明するための図である。FIG. 11 is a diagram for explaining the synthesis of the prediction model according to the first embodiment. 図１２は、第１の実施形態に係る予測モデルの合成について説明するための図である。FIG. 12 is a diagram for describing the synthesis of the prediction model according to the first embodiment. 図１３は、第１の実施形態に係る分析装置の処理の流れを示すフローチャートである。FIG. 13 is a flowchart illustrating the flow of the process of the analyzer according to the first embodiment. 図１４は、第１の実施形態に係る分析装置の処理の流れを示すフローチャートである。FIG. 14 is a flowchart illustrating the flow of the process of the analyzer according to the first embodiment. 図１５は、第１の実施形態の効果を説明するための図である。FIG. 15 is a diagram for explaining the effect of the first embodiment. 図１６は、第１の実施形態の効果を説明するための図である。FIG. 16 is a diagram for explaining the effect of the first embodiment. 図１７は、第２の実施形態に係る分析装置の処理の流れを示すフローチャートである。FIG. 17 is a flowchart illustrating the flow of the process of the analyzer according to the second embodiment. 図１８は、分析プログラムを実行するコンピュータの一例を示す図である。FIG. 18 is a diagram illustrating an example of a computer that executes an analysis program.

以下、図面を参照して、本発明の実施形態を詳細に説明する。なお、この実施形態により本発明が限定されるものではない。また、図面の記載において、同一部分には同一の符号を付して示している。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. Note that the present invention is not limited by this embodiment. In the description of the drawings, the same parts are denoted by the same reference numerals.

［第１の実施形態］
［分析装置の構成］
図１を用いて、分析装置１０の構成について説明する。図１は、第１の実施形態に係る分析装置の構成の一例を示す図である。図１に示すように、分析装置１０は、ワークステーションやパソコン等の汎用コンピュータで実現され、入力部１１と、出力部１２と、通信制御部１３と、記憶部１４と、制御部１５とを備える。 [First Embodiment]
[Configuration of analyzer]
The configuration of the analyzer 10 will be described with reference to FIG. FIG. 1 is a diagram illustrating an example of a configuration of the analyzer according to the first embodiment. As shown in FIG. 1, the analyzer 10 is realized by a general-purpose computer such as a workstation or a personal computer, and includes an input unit 11, an output unit 12, a communication control unit 13, a storage unit 14, and a control unit 15. Prepare.

入力部１１は、キーボードやマウス等の入力デバイスを用いて実現され、操作者による入力操作に対応して、制御部１５に対して各種指示情報を入力する。出力部１２は、液晶ディスプレイ等の表示装置、プリンター等の印刷装置、情報通信装置等によって実現され、データ分析の結果等を操作者に対して出力する。 The input unit 11 is realized using an input device such as a keyboard and a mouse, and inputs various kinds of instruction information to the control unit 15 in response to an input operation by an operator. The output unit 12 is realized by a display device such as a liquid crystal display, a printing device such as a printer, an information communication device, and the like, and outputs a result of data analysis and the like to an operator.

通信制御部１３は、ＮＩＣ（Network Interface Card）等で実現され、ＬＡＮ（Local Area Network）やインターネット等の電気通信回線を介した管理サーバ等の外部の装置と制御部１５との通信を制御する。 The communication control unit 13 is realized by a NIC (Network Interface Card) or the like, and controls communication between the control unit 15 and an external device such as a management server via an electric communication line such as a LAN (Local Area Network) or the Internet. .

記憶部１４は、ＲＡＭ（Random Access Memory）、フラッシュメモリ（Flash Memory）等の半導体メモリ素子、又は、ハードディスク、光ディスク等の記憶装置によって実現される。記憶部１４には、分析装置１０を動作させる処理プログラムや、処理プログラムの実行中に使用されるデータ等があらかじめ記憶され、あるいは処理の都度一時的に記憶される。記憶部１４は、通信制御部１３を介して制御部１５と通信する構成でもよい。また、記憶部１４は、設定情報１４１及び予測器情報１４２を記憶する。 The storage unit 14 is realized by a semiconductor memory device such as a random access memory (RAM) or a flash memory, or a storage device such as a hard disk or an optical disk. In the storage unit 14, a processing program for operating the analyzer 10, data used during execution of the processing program, and the like are stored in advance, or temporarily stored for each processing. The storage unit 14 may be configured to communicate with the control unit 15 via the communication control unit 13. The storage unit 14 stores the setting information 141 and the predictor information 142.

ここで、図２を用いて、設定情報１４１について説明する。図２は、第１の実施形態に係る設定情報のデータ構成の一例を示す図である。図２に示すように、設定情報１４１は、ステップごとの実行順序及び設定内容候補を含む。設定内容候補は、各ステップに対応する設定項目の設定内容の候補である。 Here, the setting information 141 will be described with reference to FIG. FIG. 2 is a diagram illustrating an example of a data configuration of the setting information according to the first embodiment. As shown in FIG. 2, the setting information 141 includes an execution order for each step and a setting content candidate. The setting content candidate is a candidate for the setting content of the setting item corresponding to each step.

図２の例では、設定情報１４１は、ステップとして「欠損値補完手法探索」、「正規化手法探索」及び「特徴選択手法探索」があることを示している。なお、これらのステップは、データの前処理に対応するものである。 In the example of FIG. 2, the setting information 141 indicates that there are “missing value complementing method search”, “normalization method searching”, and “feature selection method searching” as steps. These steps correspond to data pre-processing.

図２の例では、設定情報１４１は、ステップ「特徴選択手法探索」が３番目に実行されるステップであることを示している。また、設定情報１４１は、ステップ「特徴選択手法探索」に対応する設定項目の設定内容の候補として、「決定木」、「Ｌ１正則化」、「分散分析」及び「無処理」があることを示している。なお、図２の例では、ステップ「特徴選択手法探索」に対応する設定項目は、特徴選択で用いられる手法である。 In the example of FIG. 2, the setting information 141 indicates that the step “feature selection method search” is the third step to be executed. Further, the setting information 141 indicates that there are “decision tree”, “L1 regularization”, “variance analysis”, and “no processing” as setting content candidates corresponding to the step “feature selection method search”. Is shown. In the example of FIG. 2, the setting item corresponding to the step “feature selection method search” is a method used in feature selection.

次に、図３を用いて、予測器情報１４２について説明する。図３は、第１の実施形態に係る予測器情報のデータ構成の一例を示す図である。図３に示すように、予測器情報１４２は、予測器ごとのデフォルトパラメータを含む。また、デフォルトパラメータは、各予測器のハイパーパラメータのデフォルト値である。例えば、予測器情報１４２は、予測器Ａ_１のハイパーパラメータＮのデフォルト値が１００であることを示している。 Next, the predictor information 142 will be described with reference to FIG. FIG. 3 is a diagram illustrating an example of a data configuration of predictor information according to the first embodiment. As shown in FIG. 3, the predictor information 142 includes a default parameter for each predictor. The default parameter is a default value of the hyper parameter of each predictor. For example, the predictor information 142, the default value of the hyper-parameter N predictor A ₁ indicates that it is 100.

制御部１５は、ＣＰＵ（Central Processing Unit）等の演算処理装置がメモリに記憶された処理プログラムを実行することにより、図１に例示するように、抽出部１５１、探索部１５２、推定部１５４、合成部１５５及び検証部１５６として機能する。また、探索部１５２は、ステップ選択部１５２ａ、精度計算部１５２ｂ及び設定内容決定部１５２ｃを含む。なお、これらの機能部は、それぞれ、あるいは一部が異なるハードウェアに実装されてもよい。 As illustrated in FIG. 1, the control unit 15 executes a processing program stored in a memory by an arithmetic processing device such as a CPU (Central Processing Unit), thereby extracting an extraction unit 151, a search unit 152, an estimation unit 154, It functions as the synthesis unit 155 and the verification unit 156. The search unit 152 includes a step selection unit 152a, an accuracy calculation unit 152b, and a setting content determination unit 152c. These functional units may be implemented individually or partially in different hardware.

抽出部１５１は、データが入力された場合、当該入力されたデータから所定の割合のデータをサンプルデータとして抽出し、所定の割合が増加した場合、入力されたデータから当該増加した所定の割合のデータをサンプルデータとしてさらに抽出する。図４に示すように、抽出部１５１は、入力された全データである学習用データからサンプルデータを抽出する。図４は、第１の実施形態に係る分析装置の処理概要を説明するための図である。ここで、抽出部１５１がサンプルデータを抽出する所定の割合、すなわちサンプリングにおけるサンプル率は、あらかじめ設定されているものとする。 When data is input, the extraction unit 151 extracts a predetermined ratio of data as sample data from the input data, and when the predetermined ratio increases, extracts the predetermined ratio of the increased predetermined ratio from the input data. The data is further extracted as sample data. As illustrated in FIG. 4, the extraction unit 151 extracts sample data from learning data, which is all input data. FIG. 4 is a diagram for explaining an outline of a process performed by the analyzer according to the first embodiment. Here, it is assumed that a predetermined ratio at which the extracting unit 151 extracts the sample data, that is, a sampling rate in sampling is set in advance.

探索部１５２は、所定のデータに基づく予測モデルを構築する際に実行される複数の処理に適用される設定内容の組合せの候補から、複数の処理に適用された場合に構築されるサンプルデータに基づく予測モデルの予測精度が所定の条件を満たすような設定内容の組合せを探索する。 The search unit 152 converts a candidate of a combination of setting contents applied to a plurality of processes executed when constructing a prediction model based on predetermined data from a sample data constructed when applied to a plurality of processes. Search for a combination of setting contents such that the prediction accuracy of the based prediction model satisfies predetermined conditions.

図４に示すように、探索部１５２は、予測器ごとに探索を行う。また、探索部１５２は、前処理の有効性判定を行ったうえでパラメータ探索を行い、交差検証による交差検証精度でパイプラインの取捨選択を行う。 As illustrated in FIG. 4, the search unit 152 performs a search for each predictor. In addition, the search unit 152 performs a parameter search after determining the validity of the preprocessing, and selects a pipeline with the cross-validation accuracy by the cross-validation.

ここで、図５を用いて、探索部１５２の処理について説明する。図５は、第１の実施形態に係るパイプラインの探索について説明するための図である。なお、探索部１５２における処理は、ステップ選択部１５２ａ、精度計算部１５２ｂ及び設定内容決定部１５２ｃによって行われる。 Here, the processing of the search unit 152 will be described with reference to FIG. FIG. 5 is a diagram for describing a search for a pipeline according to the first embodiment. Note that the processing in the search unit 152 is performed by the step selection unit 152a, the accuracy calculation unit 152b, and the setting content determination unit 152c.

図５に示すように、まず、探索部１５２は、予測器ごとに、ステップ１〜３を実行することで予測モデルの構築に必要なパイプラインの決定を行う。例えば、探索部１５２は、予測器Ａ_１についてステップ１〜３を実行した後、パラメータ探索を実行する。 As illustrated in FIG. 5, first, the search unit 152 determines a pipeline necessary for constructing a prediction model by executing steps 1 to 3 for each predictor. For example, the search unit 152, after performing the steps 1 to 3 for the predictor _{A 1,} executes the parameter search.

ステップ選択部１５２ａは、予測モデルを構築する際に実行される複数の処理、すなわちパイプラインのそれぞれに対応し、対応する処理の設定内容を順次決定するステップにおいて、設定内容が決定されるたびに、次に実行されるステップを選択する。設定内容決定部１５２ｃは、各ステップの設定内容を、設定情報１４１に含まれる設定内容候補の中から決定する。このとき、ステップ選択部１５２ａは、設定情報１４１に示される実行順序に従って、設定内容が決定された次のステップを選択する。なお、いずれのステップも未実行である場合、ステップ選択部１５２ａは実行順序が最も前であるステップを選択する。 The step selecting unit 152a corresponds to a plurality of processes executed when constructing the prediction model, that is, corresponds to each of the pipelines, and in the step of sequentially determining the setting content of the corresponding process, every time the setting content is determined. , Select the next step to be performed. The setting content determining unit 152c determines the setting content of each step from the setting content candidates included in the setting information 141. At this time, the step selecting unit 152a selects the next step whose setting content is determined according to the execution order indicated in the setting information 141. If none of the steps has been executed, the step selecting unit 152a selects the step whose execution order is the earliest.

例えば、図５に示すように、ステップ「正規化手法探索」の次のステップは「特徴選択手法探索」なので、ステップ「正規化手法探索」の設定内容が決定された場合、ステップ選択部１５２ａは、次のステップとして「特徴選択手法探索」を選択する。 For example, as shown in FIG. 5, since the next step after the step “normalization method search” is “feature selection method search”, when the setting content of the step “normalization method search” is determined, the step selection unit 152a As a next step, "feature selection method search" is selected.

また、図５のステップ「欠損値補完手法探索」、「正規化手法探索」及び「特徴選択手法探索」は、それぞれ、学習用及び分析用のデータの前処理である欠損値補完、正規化及び特徴選択の設定内容を決定するステップである。また、ステップ「欠損値補完手法探索」、「正規化手法探索」及び「特徴選択手法探索」の設定内容候補は、それぞれ、欠損値補完、正規化及び特徴選択で用いられる手法である。 In addition, the steps “missing value complementing method search”, “normalization method searching”, and “feature selection method searching” in FIG. 5 are missing value complementing, normalizing, and normalizing data pre-processing of learning and analysis data, respectively. This is a step of determining the setting contents of the feature selection. The setting content candidates of the steps “missing value complementing method search”, “normalization method searching”, and “feature selection method searching” are methods used in missing value complementing, normalization, and feature selection, respectively.

精度計算部１５２ｂは、複数の処理のうち、設定内容が決定済みの処理を当該決定済みの設定内容を適用して行うとともに、ステップ選択部１５２ａによって選択されたステップに対応する処理を設定内容の候補のそれぞれを適用して行った場合に構築される予測モデルのそれぞれについて予測精度を計算する。 The accuracy calculation unit 152b performs the process of which the setting content is determined among the plurality of processes by applying the determined setting content, and performs the process corresponding to the step selected by the step selection unit 152a as the setting content. The prediction accuracy is calculated for each of the prediction models constructed when each of the candidates is applied.

例えば、ステップ選択部１５２ａによってステップ「特徴選択手法探索」が選択された場合、ステップ「特徴選択手法探索」よりも実行順序が前であるステップ「欠損値補完手法探索」及び「正規化手法探索」の設定内容は決定済みであるため、ステップ「欠損値補完手法探索」及び「正規化手法探索」で決定された設定内容と、ステップ「特徴選択手法探索」の設定内容の候補のそれぞれを適用した予測モデルの構築が可能である。このとき、ステップ「特徴選択手法探索」の設定内容の候補は４つであるため、ステップ「欠損値補完手法探索」及び「正規化手法探索」の設定内容が１つに決定されている場合、少なくとも４通りの予測モデルが構築可能である。 For example, when the step “feature selection method search” is selected by the step selection unit 152a, the steps “missing value complementation method search” and “normalization method search” whose execution order is earlier than the step “feature selection method search” Since the setting contents of have already been determined, each of the setting contents determined in the steps “Search for missing value complementing method” and “Search for normalization method” and the candidates for the setting contents in the step “Search for feature selection method” were applied. It is possible to construct a prediction model. At this time, since there are four candidates for the setting contents of the step “feature selection method search”, when the setting contents of the steps “missing value complementing method search” and “normalization method search” are determined to be one, At least four types of prediction models can be constructed.

そして、精度計算部１５２ｂは、構築可能な予測モデルそれぞれについて予測精度を計算する。なお、このとき、ステップ「欠損値補完手法探索」及び「正規化手法探索」の設定内容は複数通り決定されていてもよい。例えば、ステップ「欠損値補完手法探索」及び「正規化手法探索」の設定内容が２通り決定されている場合、構築可能な予測モデルの数は少なくとも８通りである。また、精度計算部１５２ｂは、所定数に分割した学習用のデータを用いた交差検証を行うことで予測精度を計算することができる。ここで、図６を用いて交差検証について説明する。図６は、第１の実施形態に係る交差検証について説明するための図である。 Then, the accuracy calculator 152b calculates the prediction accuracy for each of the predictable models that can be constructed. At this time, the setting contents of the steps “search for missing value complementing method” and “search for normalizing method” may be determined in a plurality of ways. For example, when two settings are made for the steps “search for missing value complementing method” and “search for normalizing method”, the number of predictable models that can be constructed is at least eight. The accuracy calculation unit 152b can calculate the prediction accuracy by performing cross-validation using learning data divided into a predetermined number. Here, the cross-validation will be described with reference to FIG. FIG. 6 is a diagram illustrating the cross-validation according to the first embodiment.

図６に示すように、まず、精度計算部１５２ｂは、サンプルデータを分割データ２０ａ、２０ｂ、２０ｃ及び２０ｄの４つに分割する。そして、精度計算部１５２ｂは、１回目の処理として、予測モデルを用いて、分割データ２０ｂ、２０ｃ及び２０ｄを予測器に学習させ、分割データ２０ａを用いて学習済みの予測器の精度を測定する。 As shown in FIG. 6, first, the accuracy calculation unit 152b divides the sample data into four pieces of divided data 20a, 20b, 20c, and 20d. Then, as the first process, the accuracy calculation unit 152b causes the predictor to learn the divided data 20b, 20c, and 20d using the prediction model, and measures the accuracy of the learned predictor using the divided data 20a. .

同様に、精度計算部１５２ｂは、２回目の処理では、分割データ２０ａ、２０ｃ及び２０ｄを予測器に学習させ、分割データ２０ｂを用いて学習済みの予測器の精度を測定する。また、精度計算部１５２ｂは、３回目の処理では、分割データ２０ａ、２０ｂ及び２０ｄを予測器に学習させ、分割データ２０ｃを用いて学習済みの予測器の精度を測定する。また、精度計算部１５２ｂは、４回目の処理では、分割データ２０ａ、２０ｂ及び２０ｃを予測器に学習させ、分割データ２０ｄを用いて学習済みの予測器の精度を測定する。そして、精度計算部１５２ｂは、４回の処理で測定した精度の平均値である交差検証精度を予測精度とする。なお、交差検証における分割数は４に限定されず、任意の数とすることができる。 Similarly, in the second process, the accuracy calculation unit 152b causes the predictor to learn the divided data 20a, 20c, and 20d, and measures the accuracy of the learned predictor using the divided data 20b. In the third process, the accuracy calculation unit 152b causes the predictor to learn the divided data 20a, 20b, and 20d, and measures the accuracy of the learned predictor using the divided data 20c. In the fourth process, the accuracy calculation unit 152b causes the predictor to learn the divided data 20a, 20b, and 20c, and measures the accuracy of the learned predictor using the divided data 20d. Then, the accuracy calculation unit 152b sets the cross-validation accuracy, which is the average value of the accuracy measured in the four processes, as the prediction accuracy. Note that the number of divisions in the cross-validation is not limited to four, and may be any number.

設定内容決定部１５２ｃは、精度計算部１５２ｂによって計算された各予測精度を比較し、設定内容の候補のうち予測精度が最も高くなる設定内容の候補を、ステップ選択部１５２ａによって選択されたステップに対応する処理の設定内容に決定する。 The setting content determination unit 152c compares the prediction accuracy calculated by the accuracy calculation unit 152b, and determines the setting content candidate having the highest prediction accuracy among the setting content candidates in the step selected by the step selection unit 152a. Determine the settings for the corresponding process.

例えば、図５に示すように、ステップ「正規化手法探索」では、精度計算部１５２ｂは、設定内容「最大最小」に対応する予測モデルの予測精度を７２％と計算し、設定内容「標準化」に対応する予測モデルの予測精度を７８％と計算し、設定内容「Ｚスコア」に対応する予測モデルの予測精度を７２％と計算し、設定内容「無処理」に対応する予測モデルの予測精度を７０％と計算した。このとき、ステップ「正規化手法探索」において最も予測精度が高い予測モデルは設定内容「標準化」に対応する予測モデルであるため、設定内容決定部１５２ｃは、ステップ「正規化手法探索」に対応する設定項目の設定内容を「標準化」に決定する。つまり、設定内容決定部１５２ｃは、データの前処理である正規化で用いられる手法を標準化に決定する。 For example, as shown in FIG. 5, in the step “normalization method search”, the accuracy calculation unit 152b calculates the prediction accuracy of the prediction model corresponding to the setting content “maximum / minimum” as 72%, and sets the setting content “standardization”. The prediction accuracy of the prediction model corresponding to the setting content is calculated as 78%, the prediction accuracy of the prediction model corresponding to the setting content “Z score” is calculated as 72%, and the prediction accuracy of the prediction model corresponding to the setting content “no processing” Was calculated as 70%. At this time, since the prediction model having the highest prediction accuracy in the step “normalization method search” is the prediction model corresponding to the setting content “standardization”, the setting content determination unit 152c corresponds to the step “normalization method search”. Set the settings of the setting items to "Standardized". That is, the setting content determination unit 152c determines the method used in the normalization which is the pre-processing of the data to the standardization.

そして、前述の通り、ステップ選択部１５２ａは、設定内容決定部１５２ｃによって設定内容が決定されたステップの次に実行されるステップを選択する。例えば、設定内容決定部１５２ｃによってステップ「正規化手法探索」における設定内容が決定された場合、ステップ選択部１５２ａは、ステップ「特徴選択手法探索」を選択する。 Then, as described above, the step selecting unit 152a selects a step to be executed next to the step whose setting content is determined by the setting content determining unit 152c. For example, when the setting content in the step “normalization method search” is determined by the setting content determination unit 152c, the step selection unit 152a selects the step “feature selection method search”.

さらに、探索部１５２は、パラメータ探索を行う。パラメータ探索では、探索部１５２は、ハイパーパラメータの組合せごとの予測精度を網羅的に計算し、最も予測精度が高くなるベストパラメータを探索する。 Further, the search unit 152 performs a parameter search. In the parameter search, the search unit 152 comprehensively calculates the prediction accuracy for each combination of the hyperparameters, and searches for the best parameter with the highest prediction accuracy.

探索部１５２は、最終的に、パイプラインのうち、予測モデルの予測精度が所定の条件を満たすようなパイプラインを探索する。例えば、予測精度の閾値を７０％とし、予測精度が閾値以上であるパイプラインのみを探索する場合、図７に示すように、探索部１５２は、パイプラインＰ_１〜Ｐ_３を探索結果とする。図７は、第１の実施形態に係るパイプラインの探索について説明するための図である。 The search unit 152 finally searches the pipeline such that the prediction accuracy of the prediction model satisfies a predetermined condition. For example, when the threshold of the prediction accuracy is set to 70% and only the pipelines whose prediction accuracy is equal to or more than the threshold are searched, as illustrated in FIG. 7, the search unit 152 sets the pipelines P _{1 to} P ₃ as the search result. . FIG. 7 is a diagram illustrating a search for a pipeline according to the first embodiment.

なお、探索部１５２によるハイパーパラメータの探索結果は、図８に示すように、領域として表すことができる。図８は、第１の実施形態に係る有望領域について説明するための図である。例えば、図８の（ａ）の縦軸は第１のハイパーパラメータを表し、横軸は第２のパラメータを表している。そして、図８の（ａ）は、パターンが密である領域ほど予測精度が高かったことを示している。 The search result of the hyperparameter by the search unit 152 can be represented as an area as shown in FIG. FIG. 8 is a diagram for explaining a promising area according to the first embodiment. For example, the vertical axis of FIG. 8A represents the first hyperparameter, and the horizontal axis represents the second parameter. FIG. 8A shows that the higher the density of the pattern, the higher the prediction accuracy.

このとき、探索部１５２は、予測精度が７０％以上となるパイプラインだけでなく、予測精度が７０％と統計的な有意差のない値になるようなパイプラインを探索結果とすることができる。つまり、探索部１５２は、複数の処理に適用された場合に構築されるサンプルデータに基づく予測モデルの予測精度が７０％以上であるような設定内容の組合せ、及び、予測精度と７０％以上である予測精度との間に統計的な有意差がないような設定内容の組合せを探索する。なお、ここでの７０％は、第２の閾値の一例である。そして、この場合の探索結果に対応するハイパーパラメータの組合せに対応する領域を有望領域とよぶ。 At this time, the search unit 152 can set, as a search result, not only a pipeline having a prediction accuracy of 70% or more, but also a pipeline having a prediction accuracy of 70% and a value having no statistically significant difference. . That is, the search unit 152 sets the combination of the setting contents such that the prediction accuracy of the prediction model based on the sample data constructed when applied to a plurality of processes is 70% or more, and sets the prediction accuracy to 70% or more. A combination of setting contents that does not have a statistically significant difference from a certain prediction accuracy is searched. Note that 70% here is an example of the second threshold value. The area corresponding to the combination of the hyperparameters corresponding to the search result in this case is called a promising area.

ここで、図８の（ａ）は、サンプルデータを用いた場合の予測精度の有望領域を表している。また、図８の（ｂ）は、サンプルデータの抽出元である学習用データを用いた場合の予測精度の有望領域を表している。図８に示すように、２つの有望領域は類似する。有望領域は、選択部１５３による処理で用いられる。 Here, FIG. 8A shows a promising region of the prediction accuracy when the sample data is used. FIG. 8B illustrates a promising region of the prediction accuracy when the learning data from which the sample data is extracted is used. As shown in FIG. 8, the two promising regions are similar. The promising area is used in the processing by the selection unit 153.

選択部１５３は、探索部１５２によって探索された設定内容の組合せから、複数の処理に適用された場合に構築される入力されたデータに基づく予測モデルの予測精度が所定の閾値以上となる設定内容の組合せを選択する。 The selecting unit 153 sets, from the combination of the setting contents searched by the searching unit 152, the setting contents in which the prediction accuracy of the prediction model based on the input data constructed when applied to a plurality of processes is equal to or more than a predetermined threshold value Select the combination of

つまり、探索部１５２がサンプルデータを用いて予測モデルの予測精度を評価していたのに対し、選択部１５３は学習用データを用いて予測モデルの予測精度を評価する。このとき、選択部１５３は、探索部１５２によって探索されたパイプラインごとの予測精度を、さらに閾値を用いて絞り込む。 That is, while the search unit 152 evaluates the prediction accuracy of the prediction model using the sample data, the selection unit 153 evaluates the prediction accuracy of the prediction model using the learning data. At this time, the selection unit 153 further narrows down the prediction accuracy for each pipeline searched by the search unit 152 using a threshold.

ここで、選択部１５３は、単に各予測精度が閾値以上となるパイプラインを選択してもよいし、各予測精度を最大予測精度で割った値が閾値以上となるパイプラインを選択してもよい。例えば、図９に示すように、パイプラインＰ_１、Ｐ_２、Ｐ_３の予測精度が８５％、７０％、８３％である場合、最大予測精度は８５％である。図９は、第１の実施形態に係るパイプラインの選択について説明するための図である。 Here, the selection unit 153 may simply select a pipeline in which each prediction accuracy is equal to or more than a threshold, or select a pipeline in which a value obtained by dividing each prediction accuracy by the maximum prediction accuracy is equal to or more than the threshold. Good. For example, as shown in FIG. 9, when the prediction accuracy of the pipelines P ₁ , P ₂ , and P ₃ is 85%, 70%, and 83%, the maximum prediction accuracy is 85%. FIG. 9 is a diagram for describing selection of a pipeline according to the first embodiment.

このとき、選択部１５３は、パイプラインＰ_１については８５／８５と閾値を比較し、パイプラインＰ_２については８３／８５と閾値を比較し、パイプラインＰ_１については７０／８５と閾値を比較した結果を用いてパイプラインの選択を行うことができる。 In this case, selection section 153 compares the 85/85 and the threshold for the pipeline _{P 1} is compared with 83/85 and the threshold for the pipeline _{P 2,} the 70/85 and the threshold for the pipeline _{P 1} The pipeline can be selected using the comparison result.

さらに、選択部１５３は、選択した組合せのデータの前処理に関する設定内容を固定し、ハイパーパラメータ探索を行う。ここで、探索部１５２による探索の結果、最も予測精度が高かったハイパーパラメータをベストパラメータ２１０とすると、有望領域は図１０のように表すことができる。図１０は、第１の実施形態に係る有望領域について説明するための図である。 Further, the selection unit 153 fixes the setting contents regarding the pre-processing of the data of the selected combination, and performs a hyperparameter search. Here, assuming that the hyperparameter with the highest prediction accuracy as a result of the search by the search unit 152 is the best parameter 210, the promising area can be represented as shown in FIG. FIG. 10 is a diagram for explaining a promising area according to the first embodiment.

図１０の塗りつぶされた部分は有望領域を表している。また、図１０の斜線部分は、探索部１５２によるハイパーパラメータ探索の対象であったが、選択部１５３によるハイパーパラメータ探索の対象から除外された領域である。なお、探索部１５２は、ベストパラメータ２１０の場合の予測精度を用いたｔ検定によって有意差の有無を判定することができる。 The shaded portions in FIG. 10 represent promising regions. The hatched portion in FIG. 10 is an area that is a target of the hyperparameter search by the search unit 152 but is excluded from a target of the hyperparameter search by the selection unit 153. Note that the search unit 152 can determine the presence or absence of a significant difference by a t-test using prediction accuracy in the case of the best parameter 210.

推定部１５４は、抽出部１５１によってサンプルデータが抽出されるたびに、サンプルデータを用いて所定の処理のうちの一部を行った際に要した時間に基づいて、入力されたデータを用いて所定の処理を行った際に要する時間を推定し、当該推定した時間があらかじめ設定された制限時間未満である場合、所定の割合を増加させる。推定部１５４による処理は、探索部１５２の処理の実行中の任意のタイミング、又は、探索部１５２の処理の実行前、又は、探索部１５２の処理の実行後であって選択部１５３の処理が行われる前に実行されてもよい。例えば、推定部１５４は、図５のステップ「欠損値補完手法探索」についての「平均値」の予測精度が計算された時点で、探索部１５２及び選択部１５３の処理時間を推定することができる。 Each time the extraction unit 151 extracts the sample data, the estimation unit 154 uses the input data based on the time required to perform a part of a predetermined process using the sample data. The time required for performing the predetermined process is estimated, and if the estimated time is less than a preset time limit, the predetermined ratio is increased. The processing by the estimation unit 154 is performed at an arbitrary timing during the execution of the processing of the search unit 152, before the execution of the processing of the search unit 152, or after the execution of the processing of the search unit 152, and when the processing of the selection unit 153 is executed. It may be performed before it is performed. For example, the estimating unit 154 can estimate the processing time of the searching unit 152 and the selecting unit 153 when the prediction accuracy of the “average value” for the step “missing value complementing method search” in FIG. 5 is calculated. .

まず、予測器ごとの、データ数が増加した場合の計算量の増加量が既知であることとする。例えば予測器Ａ_１の計算量の増加量がＯ（ｎ^２）であるとする。また、サンプル率を２０％とし、ステップ「欠損値補完手法探索」についての「平均値」の予測精度の計算に要した時間が１００秒であったとする。図５に示すように、ステップは３つであり、各ステップには４つずつの設定内容の候補があるため、推定部１５４は、予測器Ａ_１についての探索部１５２の処理時間を、式（１）のように計算する。
１００（秒）×３（ステップ）×４（設定内容）＝１２００（秒）・・・（１） First, it is assumed that the amount of increase in the amount of calculation for each predictor when the number of data increases is known. For example the increase of the calculation amount predictor A ₁ is assumed to be O (n ^2). It is also assumed that the sample rate is 20% and the time required for calculating the prediction accuracy of the “average value” in the step “missing value complementing method search” is 100 seconds. As shown in FIG. 5, steps are three, because each step has a candidate set of parameters for each four, estimation unit 154, the processing time of the search section 152 for the predictor A _1, wherein Calculate as in (1).
100 (seconds) x 3 (steps) x 4 (settings) = 1200 (seconds) ... (1)

また、選択部１５３では、探索部１５２の１００／２０倍、すなわち５倍のデータが対象となるため、推定部１５４は、予測器Ａ_１についての選択部１５３の処理時間を、式（２）のように計算する。
１００（秒）×５^２（計算量の増加量）×３（ステップ）×１（設定内容）
＝７５００（秒）・・・（２） Further, the selecting unit 153, 100/20 times the search section 152, that is, the 5 times the data is of interest, estimating section 154, the processing time of the selection unit 153 of the predictor _{A 1,} equation (2) Calculate as follows.
100 (seconds) x 5 ² (increase in calculation amount) x 3 (steps) x 1 (settings)
= 7500 (seconds) ... (2)

ここで、推定部１５４によって計算された処理時間（（１）と（２）の合計）があらかじめ定められた制限時間未満であるか否かによって、以降の処理が異なる。探索部１５２は、推定部１５４によって推定された所要時間があらかじめ定められた制限時間未満である場合、所定の割合よりも高い割合で抽出部１５１によって抽出されたサンプルデータを用いてさらに探索を行う。 Here, the subsequent processing differs depending on whether or not the processing time (total of (1) and (2)) calculated by the estimating unit 154 is shorter than a predetermined time limit. When the required time estimated by the estimating unit 154 is shorter than a predetermined time limit, the searching unit 152 performs a further search using the sample data extracted by the extracting unit 151 at a rate higher than a predetermined rate. .

一方、探索部１５２は、推定部１５４によって推定された所要時間が制限時間以上である場合、予測方法が類似した予測アルゴリズムを除外して探索を実行する。このとき、探索部１５２は、推定時間が制限時間未満になるまで、予測アルゴリズムを除外する。 On the other hand, when the required time estimated by the estimating unit 154 is equal to or longer than the time limit, the searching unit 152 performs a search excluding a prediction algorithm having a similar prediction method. At this time, the search unit 152 excludes the prediction algorithm until the estimated time becomes less than the time limit.

具体的には、まず、予測器情報１４２は、予測アルゴリズム、すなわち予測器ごとのカテゴリを記憶しておく。そして、探索部１５２は、推定部１５４によって推定された所要時間が制限時間以上である場合、予測器情報１４２を参照し、カテゴリが同一である複数の予測器のうちのいずれかを除外する。例えば、探索部１５２は、あらかじめ設定された実行時間に関する情報に基づいて、実行時間が長い順に予測器を除外する。 Specifically, first, the predictor information 142 stores a prediction algorithm, that is, a category for each predictor. Then, when the required time estimated by the estimating unit 154 is equal to or longer than the time limit, the searching unit 152 refers to the predictor information 142 and excludes any one of the plurality of predictors having the same category. For example, the search unit 152 excludes the predictors in descending order of the execution time based on information about the execution time set in advance.

例えば、図３の予測器Ａ_１をＲａｎｄｏｍＦｏｒｅｓｔ、予測器Ａ_２をＥｘｔｒａＲａｎｄｏｍＴｒｅｅｓ、予測器Ａ_３をＬｏｇｉｓｔｉｃＲｅｇｒｅｓｓｉｏｎ、Ａ_４をＬｉｎｅａｒＳＶＭとする。そして、予測器情報１４２には、Ａ_１及びＡ_２のカテゴリとして「決定木カテゴリ」が設定され、Ａ_３及びＡ_４のカテゴリとして「線形予測器カテゴリ」が設定されていることとする。また、予測器情報１４２には、実行時間の長さの順序が、Ａ_２、Ａ_１、Ａ_３、Ａ_４であることが設定されていることとする。 For example, predictor _{A 1} a Random Forest in FIG. 3, the predictor _{A 2} the Extra Random Trees, predictor _{A 3} a Logistic Regression, the _{A 4} and Linear SVM. Then, the predictor information 142, "decision tree category" is set as the category of A ₁ and A _2, and the "linear predictor category" is set as the category of A ₃ and A _4. Further, it is assumed that the order of the execution time lengths is set to A ₂ , A ₁ , A ₃ , and A ₄ in the predictor information 142.

このとき、探索部１５２は、推定部１５４によって推定された所要時間が制限時間以上である場合、まず、同一のカテゴリに複数の予測器が設定されていて、かつ、最も実行時間が長いＡ_２を除外する。ここで、Ａ_１が除外されても所要時間が制限時間以上であると推定される場合、探索部１５２は、残った予測器のうち、同一のカテゴリに複数の予測器が設定されていて、かつ、最も実行時間が長いＡ_３を除外する。探索部１５２は、複数の予測器が設定されたカテゴリがなくなるまで、又は、所要時間が制限時間以上であると推定されなくなるまで予測器の除外を繰り返す。 At this time, when the required time estimated by the estimating unit 154 is equal to or longer than the time limit, the searching unit 152 first sets A ₂ in which a plurality of predictors are set in the same category and the execution time is the longest. Exclude Here, if be excluded A ₁ is estimated to be required time is the time limit or more, the search unit 152, among the remaining predictor, a plurality of predictor is set to the same category, and, most execution time excludes a long a _3. The search unit 152 repeats the elimination of predictors until there is no category in which a plurality of predictors are set, or until the required time is not estimated to be longer than the time limit.

合成部１５５は、複数の設定内容の組合せが選択部１５３によって選択された場合、複数の設定内容の組合せのそれぞれに対応する複数の予測モデルを、所定のアンサンブル手法を用いて合成する。例えば、図１１に示すように、選択部１５３によって２つのパイプラインが選択された場合、ＭｏｄｅｌＡ_１及びＭｏｄｅｌＡ_３の２つの予測モデルが構築可能となる。図１１は、第１の実施形態に係る予測モデルの合成について説明するための図である。このとき、合成部１５５は、所定のアンサンブル手法により、ＭｏｄｅｌＡ_１及びＭｏｄｅｌＡ_３にそれぞれ重みｗ_１及び重みｗ_２を付けて合成する。 When a combination of a plurality of setting contents is selected by the selection unit 153, the combining unit 155 combines a plurality of prediction models corresponding to the respective combinations of the plurality of setting contents using a predetermined ensemble technique. For example, as shown in FIG. 11, if the two pipelines is selected by the selection unit 153, the two predictive models Model A ₁ and Model A ₃ is possible construction. FIG. 11 is a diagram for explaining the synthesis of the prediction model according to the first embodiment. In this case, the combining unit 155 by a predetermined ensemble technique, each synthesized weighted _{w 1} and the weight _{w 2} in Model A ₁ and Model A _3.

さらに、合成部１５５は、図１２に示すように、ＭｏｄｅｌＡ_１、ＭｏｄｅｌＡ_３、及び合成したＥｎｓｅｍｂｌｅＭｏｄｅｌのうち、汎化性能が最も大きい予測モデルをベストモデルとする。図１２は、第１の実施形態に係る予測モデルの合成について説明するための図である。 Furthermore, as illustrated in FIG. 12, the combining unit 155 sets the prediction model having the largest generalization performance among the Model A ₁ , the Model A ₃ , and the combined Ensemble Model as the best model. FIG. 12 is a diagram for describing the synthesis of the prediction model according to the first embodiment.

検証部１５６は、選択部１５３によって選択されたパイプラインに基づいて構築される予測モデルの検証を行う。選択部１５３によってパイプラインが選択されると、検証部１５６は、決定されたパイプラインに基づいて予測器に学習用データを学習させ、予測モデルを構築する。そして、検証部１５６は、学習用データとは別のテスト用データを用いて、構築した予測モデルの予測精度をテスト精度として測定する。例えば、分析装置１０は、ここで測定されたテスト精度を最終的な出力としてもよい。また、学習用データと異なるテスト用データを用いた検証を行うことで、過学習状態及び未学習状態の確認が可能となる。 The verification unit 156 verifies a prediction model constructed based on the pipeline selected by the selection unit 153. When a pipeline is selected by the selection unit 153, the verification unit 156 causes the predictor to learn learning data based on the determined pipeline, and constructs a prediction model. Then, the verification unit 156 uses the test data different from the learning data to measure the prediction accuracy of the constructed prediction model as the test accuracy. For example, the analyzer 10 may use the test accuracy measured here as a final output. In addition, by performing verification using test data different from the learning data, it is possible to confirm the overlearned state and the unlearned state.

［第１の実施形態の処理］
図１３及び１４を用いて、第１の実施形態に係る分析装置１０の処理の流れについて説明する。図１３及び１４は、第１の実施形態に係る分析装置の処理の流れを示すフローチャートである。図１３に示すように、まず、分析装置１０は、学習用データを読み込む（ステップＳ１０１）。次に、抽出部１５１は、学習データから所定のサンプルサイズのサンプルデータを抽出する（ステップＳ１０２）。 [Processing of First Embodiment]
The processing flow of the analyzer 10 according to the first embodiment will be described with reference to FIGS. FIGS. 13 and 14 are flowcharts showing the flow of the process of the analyzer according to the first embodiment. As shown in FIG. 13, first, the analyzer 10 reads the learning data (step S101). Next, the extraction unit 151 extracts sample data of a predetermined sample size from the learning data (Step S102).

次に、探索部１５２は、未選択の予測器がある場合（ステップＳ１０３、Ｙｅｓ）、予測器情報１４２を参照し、次の予測器を選択する（ステップＳ１０４）。次に、探索部１５２は、読み込んだ学習用データを用いて、選択した予測器のパイプラインを決定する（ステップＳ１０５）。探索部１５２は、複数のパイプラインを決定してもよい。 Next, when there is an unselected predictor (Step S103, Yes), the search unit 152 refers to the predictor information 142 and selects the next predictor (Step S104). Next, the search unit 152 determines a pipeline of the selected predictor using the read learning data (step S105). The search unit 152 may determine a plurality of pipelines.

次に、図１４を用いて、探索部１５２がパイプラインを決定する処理（図１３のステップＳ１０５）について詳細に説明する。図１４に示すように、ステップ選択部１５２ａは、未選択のステップがある場合（ステップＳ２０１、Ｙｅｓ）、設定情報１４１を参照し、次のステップを選択する（ステップＳ２０２）。なお、次のステップとは、未選択のステップのうち、最も実行順序が早いステップである。また、本実施形態では、各ステップはデータの前処理のそれぞれに対応するものである。一方、未選択のステップがない場合（ステップＳ２０１、Ｎｏ）、分析装置１０は、ハイパーパラメータの探索を行う（ステップＳ２０７）。 Next, the process in which the search unit 152 determines a pipeline (step S105 in FIG. 13) will be described in detail with reference to FIG. As illustrated in FIG. 14, when there is an unselected step (Step S201, Yes), the step selecting unit 152a refers to the setting information 141 and selects the next step (Step S202). The next step is a step having the earliest execution order among unselected steps. In this embodiment, each step corresponds to each of the data pre-processing. On the other hand, when there is no unselected step (Step S201, No), the analyzer 10 searches for a hyper parameter (Step S207).

ステップ選択部１５２ａによって選択されたステップの設定内容候補のうち、未選択の設定内容がある場合（ステップＳ２０３、Ｙｅｓ）、精度計算部１５２ｂは、次の設定内容を選択する（ステップＳ２０４）。一方、未選択の設定内容がない場合（ステップＳ２０３、Ｎｏ）、設定内容決定部１５２ｃは、精度計算部１５２ｂによって計算された予測精度が最も高い設定内容をステップ選択部１５２ａによって選択されたステップの設定内容に決定する（ステップＳ２０６）。 When there is an unselected setting content among the setting content candidates of the step selected by the step selecting unit 152a (Step S203, Yes), the accuracy calculation unit 152b selects the next setting content (Step S204). On the other hand, when there is no unselected setting content (No at Step S203), the setting content determining unit 152c determines the setting content having the highest prediction accuracy calculated by the accuracy calculating unit 152b in the step selected by the step selecting unit 152a. The content is determined (step S206).

精度計算部１５２ｂは、設定内容を選択すると、当該選択した設定内容を適用したパイプラインに基づいて構築される予測モデルの予測精度を計算する（ステップＳ２０５）。このとき、精度計算部１５２ｂは、所定数に分割した学習用データを用いた交差検証によって予測精度の計算を行うことができる。そして、精度計算部１５２ｂは、未選択の設定内容がなくなるまで、ステップＳ２０３〜Ｓ２０５の処理を繰り返す。 When the setting content is selected, the accuracy calculation unit 152b calculates the prediction accuracy of the prediction model constructed based on the pipeline to which the selected setting content is applied (Step S205). At this time, the accuracy calculation unit 152b can calculate the prediction accuracy by cross-validation using the learning data divided into a predetermined number. Then, the accuracy calculation unit 152b repeats the processing of steps S203 to S205 until there is no unselected setting content.

図１３に戻り、推定部１５４は、未選択の予測器がない場合（ステップＳ１０３、Ｎｏ）、処理全体の時間を推定する（ステップＳ１０６）。推定した時間が制限時間未満である場合（ステップＳ１０７、Ｙｅｓ）、推定部１５４はサンプル率を増加させる（ステップＳ１０８）。そして、抽出部１５１は増加したサンプル率で再度サンプルデータを抽出する（ステップＳ１０２）。その後、分析装置１０は、推定部１５４によって推定される時間が制限時間未満でなくなるまで処理を繰り返す。 Returning to FIG. 13, when there is no unselected predictor (No at Step S103), the estimating unit 154 estimates the entire processing time (Step S106). If the estimated time is less than the time limit (step S107, Yes), the estimating unit 154 increases the sample rate (step S108). Then, the extraction unit 151 extracts the sample data again at the increased sample rate (Step S102). Thereafter, the analyzer 10 repeats the processing until the time estimated by the estimating unit 154 is not less than the time limit.

一方、推定部１５４によって推定された時間が制限時間未満でない場合（ステップＳ１０７、Ｎｏ）、探索部１５２は、予測アルゴリズムを削減する（ステップＳ１０９）。そして、選択部１５３は、探索部１５２によって決定されたパイプラインから予測精度が閾値以上であるパイプラインを選択し、さらに、選択したパイプラインについてのハイパーパラメータ探索を行う（ステップＳ１１０）。そして、検証部１５６は、選択されたパイプラインに基づいて予測モデルを構築し（ステップＳ１１１）、構築した予測モデルをテスト用データを用いて検証する（ステップＳ１１２）。 On the other hand, when the time estimated by the estimating unit 154 is not less than the time limit (step S107, No), the searching unit 152 reduces the prediction algorithm (step S109). Then, the selecting unit 153 selects a pipeline whose prediction accuracy is equal to or larger than the threshold from the pipeline determined by the searching unit 152, and further performs a hyperparameter search for the selected pipeline (step S110). Then, the verification unit 156 constructs a prediction model based on the selected pipeline (Step S111), and verifies the constructed prediction model using the test data (Step S112).

［第１の実施形態の効果］
抽出部１５１は、データが入力された場合、当該入力されたデータから所定の割合のデータをサンプルデータとして抽出し、所定の割合が増加した場合、入力されたデータから当該増加した所定の割合のデータをサンプルデータとしてさらに抽出する。推定部１５４は、抽出部１５１によってサンプルデータが抽出されるたびに、サンプルデータを用いて所定の処理のうちの一部を行った際に要した時間に基づいて、入力されたデータを用いて所定の処理を行った際に要する時間を推定し、当該推定した時間があらかじめ設定された制限時間未満である場合、所定の割合を増加させる。また、探索部１５２は、所定のデータに基づく予測モデルを構築する際に実行される複数の処理に適用される設定内容の組合せの候補から、複数の処理に適用された場合に構築されるサンプルデータに基づく予測モデルの予測精度が所定の条件を満たすような設定内容の組合せを探索する。また、選択部１５３は、探索部１５２によって探索された設定内容の組合せから、複数の処理に適用された場合に構築される入力されたデータに基づく予測モデルの予測精度が所定の閾値以上となる設定内容の組合せを選択する。このように、サンプルデータを用いて予測精度が高くなるようなパイプラインを大まかに決定しておき、全データを用いて最終的なパイプラインを選択することで、パイプラインの探索に要する時間を短縮しつつ、予測精度の高い予測モデルを構築することができる。 [Effect of First Embodiment]
When data is input, the extraction unit 151 extracts a predetermined ratio of data as sample data from the input data, and when the predetermined ratio increases, extracts the predetermined ratio of the increased predetermined ratio from the input data. The data is further extracted as sample data. Each time the extraction unit 151 extracts the sample data, the estimation unit 154 uses the input data based on the time required to perform a part of a predetermined process using the sample data. The time required for performing the predetermined process is estimated, and if the estimated time is less than a preset time limit, the predetermined ratio is increased. In addition, the search unit 152 extracts a sample constructed when applied to a plurality of processes from a candidate of a combination of setting contents applied to a plurality of processes executed when constructing a prediction model based on predetermined data. A search is made for a combination of setting contents such that the prediction accuracy of the prediction model based on the data satisfies a predetermined condition. Further, the selecting unit 153 determines, from the combination of the setting contents searched by the searching unit 152, that the prediction accuracy of the prediction model based on the input data constructed when applied to a plurality of processes is equal to or greater than a predetermined threshold. Select a combination of settings. In this way, by roughly determining a pipeline with high prediction accuracy using sample data and selecting a final pipeline using all data, the time required for searching the pipeline is reduced. It is possible to construct a prediction model with high prediction accuracy while shortening.

図１５は、ａｕｔｏ−ｓｋｌｅａｒｎを用いた場合及び本実施形態を用いた場合についての、同じ実行時間の分析によって構築された予測モデルの予測精度を示している。図１５は、第１の実施形態の効果を説明するための図である。図１５に示すように、１２種類中８種類のデータで、本実施形態の方が予測精度が高くなった。また、残り４種類についても、ほぼ同等の予想精度となった。 FIG. 15 shows the prediction accuracy of the prediction model constructed by the analysis of the same execution time when auto-skearn is used and when this embodiment is used. FIG. 15 is a diagram for explaining the effect of the first embodiment. As shown in FIG. 15, the prediction accuracy was higher in the present embodiment for eight of the twelve types of data. In addition, the same prediction accuracy was obtained for the remaining four types.

また、図１６は、サンプルの抽出を行わずに探索部１５２及び選択部１５３の処理を行った場合（自動化ルール）、及び本実施形態の場合の、実行時間及び予測精度を示している。図１６は、第１の実施形態の効果を説明するための図である。図１６に示すように、本実施形態では、予測精度をほとんど低下させることなく、約７０％の処理時間を削減することができた。 FIG. 16 shows the execution time and the prediction accuracy in the case where the processing of the search unit 152 and the selection unit 153 is performed without extracting the sample (automation rule), and in the case of the present embodiment. FIG. 16 is a diagram for explaining the effect of the first embodiment. As shown in FIG. 16, in the present embodiment, it was possible to reduce the processing time by about 70% without substantially lowering the prediction accuracy.

探索部１５２は、推定部１５４によって推定された所要時間が制限時間以上である場合、予測方法が類似した予測アルゴリズムを除外して探索を実行する。これにより、時間に余裕がない場合はさらに処理時間を短縮でき、時間に余裕がある場合はさらに予想精度の高い予測モデルを構築することができる。 When the required time estimated by the estimating unit 154 is equal to or longer than the time limit, the searching unit 152 performs a search excluding a prediction algorithm having a similar prediction method. As a result, when there is not enough time, the processing time can be further reduced, and when there is enough time, a prediction model with higher prediction accuracy can be constructed.

探索部１５２は、複数の処理に適用された場合に構築されるサンプルデータに基づく予測モデルの予測精度が第２の閾値以上であるような設定内容の組合せ、及び、予測精度と第２の閾値以上である予測精度との間に統計的な有意差がないような設定内容の組合せを探索する。これにより、最適なハイパーパラメータを短時間で探索することができるようになる。 The search unit 152 is a combination of setting contents such that the prediction accuracy of a prediction model based on sample data constructed when applied to a plurality of processes is equal to or greater than a second threshold, and the prediction accuracy and the second threshold A combination of setting contents that does not have a statistically significant difference from the prediction accuracy described above is searched. As a result, it is possible to search for an optimal hyper parameter in a short time.

合成部１５５は、複数の設定内容の組合せが選択部１５３によって選択された場合、複数の設定内容の組合せのそれぞれに対応する複数の予測モデルを、所定のアンサンブル手法を用いて合成する。これにより、各パイプラインを単独で選択する場合と比べて、より予測精度の高い予測モデルを構築できる。 When a combination of a plurality of setting contents is selected by the selection unit 153, the combining unit 155 combines a plurality of prediction models corresponding to the respective combinations of the plurality of setting contents using a predetermined ensemble technique. This makes it possible to construct a prediction model with higher prediction accuracy than in a case where each pipeline is selected alone.

［第２の実施形態］
第１の実施形態では、探索部１５２によってパイプラインが決定された後に、推定部１５４による処理時間の推定が行われる場合について説明した。ここで、前述の通り、推定部１５４による処理は、探索部１５２の処理の実行中の任意のタイミング、又は、探索部１５２の処理の実行前に行われてもよい。第２の実施形態では、探索部１５２によるパイプラインの決定に関する処理が開始される前に、推定部１５４による処理時間の推定が行われる。 [Second embodiment]
In the first embodiment, the case has been described where the estimation unit 154 estimates the processing time after the search unit 152 determines the pipeline. Here, as described above, the processing by the estimation unit 154 may be performed at any timing during the execution of the processing of the search unit 152 or before the execution of the processing of the search unit 152. In the second embodiment, the processing time is estimated by the estimating unit 154 before the processing regarding the determination of the pipeline by the searching unit 152 is started.

［第２の実施形態の処理］
図１７を用いて、第２の実施形態に係る分析装置１０の処理の流れについて説明する。図１７は、第２の実施形態に係る分析装置の処理の流れを示すフローチャートである。図１７に示すように、まず、分析装置１０は、学習用データを読み込む（ステップＳ３０１）。次に、抽出部１５１は、学習データから所定のサンプルサイズのサンプルデータを抽出する（ステップＳ３０２）。 [Processing of the Second Embodiment]
The processing flow of the analyzer 10 according to the second embodiment will be described with reference to FIG. FIG. 17 is a flowchart illustrating the flow of the process of the analyzer according to the second embodiment. As shown in FIG. 17, first, the analyzer 10 reads the learning data (step S301). Next, the extraction unit 151 extracts sample data of a predetermined sample size from the learning data (Step S302).

次に、推定部１５４は、未選択の予測器がある場合（ステップＳ３０３、Ｙｅｓ）、予測器情報１４２を参照し、次の予測器を選択する（ステップＳ３０４）。次に、推定部１５４は、サンプルデータでの実行時間を実測し、学習用データでの実行時間を予測する（ステップＳ３０５）。 Next, when there is an unselected predictor (step S303, Yes), the estimating unit 154 refers to the predictor information 142 and selects the next predictor (step S304). Next, the estimating unit 154 actually measures the execution time in the sample data, and predicts the execution time in the learning data (step S305).

ここで、推定部１５４は、サンプルデータを用いた交差検証を実際に行い、その際に要した時間を計測する。具体的には、推定部１５４は、図６に示すような、精度計算部１５２ｂによる処理と同様の処理を行う。ただし、推定部１５４は、処理に要した時間を計測することを目的としているため、予測精度の計算は行わなくてもよい。 Here, the estimating unit 154 actually performs the cross-validation using the sample data, and measures the time required at that time. Specifically, the estimation unit 154 performs the same processing as the processing by the accuracy calculation unit 152b as shown in FIG. However, since the estimation unit 154 aims to measure the time required for the processing, it is not necessary to calculate the prediction accuracy.

また、精度計算部１５２ｂは、サンプルデータに対し前処理を行ったうえで当該サンプルデータを予測器に学習させる。これに対し、推定部１５４は、サンプルデータに対し前処理を行わず、直接学習器に学習させる。また、推定部１５４は、前述の式（１）及び（２）を用いて、学習用データでの実行時間を予測する。 Further, the accuracy calculation unit 152b performs pre-processing on the sample data, and then makes the predictor learn the sample data. On the other hand, the estimating unit 154 does not perform preprocessing on the sample data, and directly makes the learning device learn. Further, the estimating unit 154 estimates the execution time in the learning data using the above-described equations (1) and (2).

推定部１５４は、予測器を順次選択し、ステップＳ３０５を実行する。推定部１５４は、未選択の予測器がない場合（ステップＳ３０３、Ｎｏ）、処理全体の時間を推定する（ステップＳ３０６）。推定した時間が制限時間未満である場合（ステップＳ３０７、Ｙｅｓ）、推定部１５４はサンプル率を増加させる（ステップＳ３０８）。そして、抽出部１５１は増加したサンプル率で再度サンプルデータを抽出する（ステップＳ３０２）。その後、分析装置１０は、推定部１５４によって推定される時間が制限時間未満でなくなるまで処理を繰り返す。 The estimating unit 154 sequentially selects predictors, and executes step S305. When there is no unselected predictor (No at Step S303), the estimating unit 154 estimates the time of the entire process (Step S306). When the estimated time is less than the time limit (step S307, Yes), the estimating unit 154 increases the sample rate (step S308). Then, the extraction unit 151 extracts the sample data again at the increased sample rate (Step S302). Thereafter, the analyzer 10 repeats the processing until the time estimated by the estimating unit 154 is not less than the time limit.

一方、推定部１５４によって推定された時間が制限時間未満でない場合（ステップＳ３０７、Ｎｏ）、探索部１５２は、予測アルゴリズムを削減する（ステップＳ３０９）。次に、探索部１５２は、未選択の予測器がある場合（ステップＳ３１０、Ｙｅｓ）、予測器情報１４２を参照し、次の予測器を選択する（ステップＳ３１１）。次に、探索部１５２は、読み込んだ学習用データを用いて、選択した予測器のパイプラインを決定する（ステップＳ３１２）。探索部１５２は、複数のパイプラインを決定してもよい。なお、探索部１５２がパイプラインを決定する処理は、図１４に示す処理と同様である。 On the other hand, when the time estimated by the estimating unit 154 is not less than the time limit (No at Step S307), the searching unit 152 reduces the prediction algorithm (Step S309). Next, when there is an unselected predictor (step S310, Yes), the search unit 152 refers to the predictor information 142 and selects the next predictor (step S311). Next, the search unit 152 determines a pipeline of the selected predictor by using the read learning data (step S312). The search unit 152 may determine a plurality of pipelines. The process in which the search unit 152 determines a pipeline is the same as the process illustrated in FIG.

未選択の予測器がない場合（ステップＳ３１０、Ｎｏ）、選択部１５３は、探索部１５２によって決定されたパイプラインから予測精度が閾値以上であるパイプラインを選択し、さらに、選択したパイプラインについてのハイパーパラメータ探索を行う（ステップＳ３１３）。そして、検証部１５６は、選択されたパイプラインに基づいて予測モデルを構築し（ステップＳ３１４）、構築した予測モデルをテスト用データを用いて検証する（ステップＳ３１５）。 If there is no unselected predictor (No at Step S310), the selecting unit 153 selects a pipeline whose prediction accuracy is equal to or larger than a threshold from the pipelines determined by the searching unit 152, and further selects a pipeline with respect to the selected pipeline. Is performed (step S313). Then, the verification unit 156 constructs a prediction model based on the selected pipeline (Step S314), and verifies the constructed prediction model using the test data (Step S315).

［その他の実施形態］
探索部１５２は、入力されたデータの特徴量に所定のデータ特性がある場合、設定内容の組合せの候補のうち、所定のデータ特性にあらかじめ対応付けられた設定内容の組合せの候補から探索を行ってもよい。例えば、テキストデータのように、データに占めるゼロの割合が多いデータの場合、すなわちデータがスパースである場合、線形の予測器でも十分な精度が得られることが多い。このため、探索部１５２は、学習用データ又はサンプルデータがスパースである場合、予測器のうち、非線形の予測器を除外する。例えば、探索部１５２は、ゼロの割合が閾値ｒ_ｆ以上であるスパースな特徴量の、全体に占める割合が閾値ｒ_ａ以上である場合、データがスパースであると判定する。閾値ｒ_ｆ及びｒ_ａは、例えば０．９とすることができる。 [Other Embodiments]
When the feature amount of the input data has a predetermined data characteristic, the search unit 152 performs a search from the setting content combination candidates that are previously associated with the predetermined data characteristic among the setting content combination candidates. You may. For example, in the case of data such as text data in which the proportion of zero in the data is large, that is, when the data is sparse, sufficient accuracy can often be obtained even with a linear predictor. Therefore, when the learning data or the sample data is sparse, the search unit 152 excludes a non-linear predictor from the predictors. For example, the search unit 152, a sparse feature quantity ratio of zero is equal to or greater than the threshold r _f, when the ratio of total is the threshold value r _a more, determines that the data is sparse. Threshold _{r f} and _{r a} can be, for example, 0.9.

［システム構成等］
また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示のように構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部又は一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的又は物理的に分散・統合して構成することができる。さらに、各装置にて行われる各処理機能は、その全部又は任意の一部が、ＣＰＵ及び当該ＣＰＵにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 [System configuration, etc.]
Each component of each device illustrated is a functional concept and does not necessarily need to be physically configured as illustrated. In other words, the specific form of distribution / integration of each device is not limited to that shown in the figure, and all or a part thereof may be functionally or physically distributed / arbitrarily divided into arbitrary units according to various loads and usage conditions. Can be integrated and configured. Further, all or any part of each processing function performed by each device can be realized by a CPU and a program analyzed and executed by the CPU, or can be realized as hardware by wired logic.

また、本実施形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部又は一部を手動的に行うこともでき、あるいは、手動的に行われるものとして説明した処理の全部又は一部を公知の方法で自動的に行うこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 Further, of the processes described in the present embodiment, all or a part of the processes described as being performed automatically can be manually performed, or the processes described as being performed manually can be performed. All or part can be performed automatically by a known method. In addition, the processing procedures, control procedures, specific names, and information including various data and parameters shown in the above-mentioned documents and drawings can be arbitrarily changed unless otherwise specified.

［プログラム］
一実施形態として、分析装置１０は、パッケージソフトウェアやオンラインソフトウェアとして上記の分析を実行する分析プログラムを所望のコンピュータにインストールさせることによって実装できる。例えば、上記の分析プログラムを情報処理装置に実行させることにより、情報処理装置を分析装置１０として機能させることができる。ここで言う情報処理装置には、デスクトップ型又はノート型のパーソナルコンピュータが含まれる。また、その他にも、情報処理装置にはスマートフォン、携帯電話機やＰＨＳ（Personal Handyphone System）等の移動体通信端末、さらには、ＰＤＡ（Personal Digital Assistant）等のスレート端末等がその範疇に含まれる。 [program]
As one embodiment, the analyzer 10 can be implemented by installing an analysis program for performing the above analysis as package software or online software on a desired computer. For example, by causing the information processing device to execute the analysis program, the information processing device can function as the analysis device 10. The information processing apparatus referred to here includes a desktop or notebook personal computer. In addition, the information processing apparatus includes mobile communication terminals such as a smartphone, a mobile phone, and a PHS (Personal Handyphone System), and a slate terminal such as a PDA (Personal Digital Assistant).

また、分析装置１０は、ユーザが使用する端末装置をクライアントとし、当該クライアントに上記の分析に関するサービスを提供する分析サーバ装置として実装することもできる。例えば、分析サーバ装置は、学習用データを入力とし、パイプライン又は予測モデルを出力とする分析サービスを提供するサーバ装置として実装される。この場合、分析サーバ装置は、Ｗｅｂサーバとして実装することとしてもよいし、アウトソーシングによって上記の分析に関するサービスを提供するクラウドとして実装することとしてもかまわない。 Further, the analysis device 10 can be implemented as an analysis server device that provides a terminal device used by a user as a client and provides the client with the above-described analysis service. For example, the analysis server device is implemented as a server device that provides an analysis service that inputs learning data and outputs a pipeline or a prediction model. In this case, the analysis server device may be implemented as a Web server, or may be implemented as a cloud that provides the above-described analysis services by outsourcing.

図１８は、分析プログラムを実行するコンピュータの一例を示す図である。コンピュータ１０００は、例えば、メモリ１０１０、ＣＰＵ１０２０を有する。また、コンピュータ１０００は、ハードディスクドライブインタフェース１０３０、ディスクドライブインタフェース１０４０、シリアルポートインタフェース１０５０、ビデオアダプタ１０６０、ネットワークインタフェース１０７０を有する。これらの各部は、バス１０８０によって接続される。 FIG. 18 is a diagram illustrating an example of a computer that executes an analysis program. The computer 1000 has, for example, a memory 1010 and a CPU 1020. The computer 1000 has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. These components are connected by a bus 1080.

メモリ１０１０は、ＲＯＭ（Read Only Memory）１０１１及びＲＡＭ１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０９０に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１１００に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ１１００に挿入される。シリアルポートインタフェース１０５０は、例えばマウス１１１０、キーボード１１２０に接続される。ビデオアダプタ１０６０は、例えばディスプレイ１１３０に接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012. The ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1090. The disk drive interface 1040 is connected to the disk drive 1100. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120. The video adapter 1060 is connected to the display 1130, for example.

ハードディスクドライブ１０９０は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３、プログラムデータ１０９４を記憶する。すなわち、分析装置１０の各処理を規定するプログラムは、コンピュータにより実行可能なコードが記述されたプログラムモジュール１０９３として実装される。プログラムモジュール１０９３は、例えばハードディスクドライブ１０９０に記憶される。例えば、分析装置１０における機能構成と同様の処理を実行するためのプログラムモジュール１０９３が、ハードディスクドライブ１０９０に記憶される。なお、ハードディスクドライブ１０９０は、ＳＳＤにより代替されてもよい。 The hard disk drive 1090 stores, for example, the OS 1091, the application program 1092, the program module 1093, and the program data 1094. That is, a program that defines each process of the analyzer 10 is implemented as a program module 1093 in which codes executable by a computer are described. The program module 1093 is stored in, for example, the hard disk drive 1090. For example, a program module 1093 for executing the same processing as the functional configuration in the analyzer 10 is stored in the hard disk drive 1090. Note that the hard disk drive 1090 may be replaced by an SSD.

また、上述した実施形態の処理で用いられる設定データは、プログラムデータ１０９４として、例えばメモリ１０１０やハードディスクドライブ１０９０に記憶される。そして、ＣＰＵ１０２０が、メモリ１０１０やハードディスクドライブ１０９０に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して実行する。 The setting data used in the processing of the above-described embodiment is stored as the program data 1094 in, for example, the memory 1010 or the hard disk drive 1090. Then, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 to the RAM 1012 as needed, and executes them.

なお、プログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０９０に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ１１００等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、プログラムモジュール１０９３及びプログラムデータ１０９４は、ネットワーク（ＬＡＮ、ＷＡＮ（Wide Area Network）等）を介して接続された他のコンピュータに記憶されてもよい。そして、プログラムモジュール１０９３及びプログラムデータ１０９４は、他のコンピュータから、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 The program module 1093 and the program data 1094 are not limited to being stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read out by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN, Wide Area Network (WAN), or the like). Then, the program module 1093 and the program data 1094 may be read from another computer by the CPU 1020 via the network interface 1070.

１０分析装置
１１入力部
１２出力部
１３通信制御部
１４記憶部
１５制御部
１４１設定情報
１４２予測器情報
１５１抽出部
１５２探索部
１５２ａステップ選択部
１５２ｂ精度計算部
１５２ｃ設定内容決定部
１５３選択部
１５４推定部
１５５合成部
１５６検証部 Reference Signs List 10 analyzer 11 input unit 12 output unit 13 communication control unit 14 storage unit 15 control unit 141 setting information 142 predictor information 151 extraction unit 152 search unit 152a step selection unit 152b accuracy calculation unit 152c setting content determination unit 153 selection unit 154 estimation Unit 155 Synthesis unit 156 Verification unit

Claims

When data is input, a predetermined ratio of data is extracted as sample data from the input data, and when the predetermined ratio is increased, the increased predetermined ratio of data is extracted from the input data. An extraction unit for further extracting as sample data,
Each time the sample data is extracted by the extraction unit, using the sample data, for each prediction algorithm in which a category is set in advance , the execution time required when performing a part of predetermined processing is reduced. Based on the input data, the time required for performing the predetermined process is estimated, and when the estimated time is shorter than a preset time limit, the predetermined ratio is increased. An estimator,
In the case where the estimated time is equal to or longer than the time limit, the prediction algorithm is excluded from a plurality of prediction algorithms in which the same category is set, according to a preset order based on the execution time. In the above, the setting contents applied to a plurality of processes executed when constructing a prediction model based on predetermined data, wherein the plurality of processes are selected from candidates of a combination of setting contents including selection of the prediction algorithm. A search unit that searches for a combination of setting contents such that the prediction accuracy of the prediction model based on the sample data constructed when applied to the predetermined condition is satisfied,
From the combination of setting contents searched by the search unit, a combination of setting contents where the prediction accuracy of a prediction model based on the input data constructed when applied to the plurality of processes is equal to or more than a predetermined threshold value is determined. A selection section to be selected,
An analyzer comprising:

When data is input, a predetermined ratio of data is extracted as sample data from the input data, and when the predetermined ratio is increased, the increased predetermined ratio of data is extracted from the input data. An extraction unit for further extracting as sample data,
Each time the sample data is extracted by the extraction unit, using the sample data, for each prediction algorithm in which a category is set in advance, the execution time required when performing a part of predetermined processing is reduced. Based on the input data, the time required for performing the predetermined process is estimated, and if the estimated time is less than a preset time limit, the predetermined ratio is increased. An estimator,
In the case where the estimated time is equal to or longer than the time limit, the prediction algorithm is excluded from a plurality of prediction algorithms in which the same category is set, according to a preset order based on the execution time. Above, the setting contents applied to a plurality of processes performed when constructing a prediction model based on predetermined data, each of the selection of the prediction algorithm, preprocessing for input data, and hyperparameters A search unit that searches for a combination of setting contents such that the prediction accuracy of a prediction model based on the sample data constructed when applied to the plurality of processes satisfies a predetermined condition, from candidates of the combination of setting contents. ,
From the combination of setting contents searched by the search unit, a combination of setting contents where the prediction accuracy of a prediction model based on the input data constructed when applied to the plurality of processes is equal to or more than a predetermined threshold value is determined. A selection section to be selected,
An analyzer comprising:

The search unit, when the feature amount of the input data has a predetermined data characteristic, among the candidates for the combination of the setting contents, from the candidates for the combination of the setting contents previously associated with the predetermined data characteristic. analyzer according to claim 1 or 2, characterized in that to search.

When a combination of a plurality of setting contents is selected by the selection unit, a combination unit for combining a plurality of prediction models corresponding to each of the plurality of combination of setting contents by using a predetermined ensemble method. The analyzer according to any one of claims 1 to 3, characterized in that:

The search unit,
In the step of sequentially determining the setting content of the corresponding process corresponding to each of the plurality of processes executed when constructing the prediction model, each time the setting content is determined, the step to be executed next is selected. ,
Among the plurality of processes, the setting content is determined by applying the determined setting content, and the process corresponding to the next step is performed by applying each of the setting content candidates. Calculate the prediction accuracy for each of the prediction models built when performed
The calculated prediction accuracy is compared, and among the setting content candidates, the setting content candidate having the highest prediction accuracy is determined as the setting content of the process corresponding to the next executed step. analyzer according to claim 1 or 2.

An analysis method performed by the analysis device,
When data is input, a predetermined ratio of data is extracted as sample data from the input data, and when the predetermined ratio is increased, the increased predetermined ratio of data is extracted from the input data. An extraction step for further extracting as sample data ,
Each time the sample data is extracted by the extraction step, using the sample data, for each prediction algorithm in which a category is set in advance, the execution time required when performing a part of predetermined processing is reduced. Based on the input data, the time required for performing the predetermined process is estimated, and when the estimated time is shorter than a preset time limit, the predetermined ratio is increased. The estimation process;
In the case where the estimated time is equal to or longer than the time limit, the prediction algorithm is excluded from the plurality of prediction algorithms in which the same category is set according to a preset order based on the execution time. In the above, the setting contents applied to a plurality of processes executed when constructing a prediction model based on predetermined data, wherein the plurality of processes are selected from candidates of a combination of setting contents including selection of the prediction algorithm. A search step of searching for a combination of setting contents such that the prediction accuracy of the prediction model based on the sample data constructed when applied to the predetermined condition is satisfied;
From the combination of setting contents searched in the search step, a combination of setting contents in which the prediction accuracy of a prediction model based on the input data constructed when applied to the plurality of processes is equal to or more than a predetermined threshold is determined. A selection process to select;
An analysis method comprising:

An analysis method performed by the analysis device,
When data is input, a predetermined ratio of data is extracted as sample data from the input data, and when the predetermined ratio is increased, the increased predetermined ratio of data is extracted from the input data. An extraction step for further extracting as sample data,
Each time the sample data is extracted by the extraction step, using the sample data, for each prediction algorithm in which a category is set in advance, the execution time required when performing a part of predetermined processing is reduced. Based on the input data, the time required for performing the predetermined process is estimated, and if the estimated time is less than a preset time limit, the predetermined ratio is increased. The estimation process;
In the case where the estimated time is equal to or longer than the time limit, the prediction algorithm is excluded from a plurality of prediction algorithms in which the same category is set, according to a preset order based on the execution time. Above, the setting contents applied to a plurality of processes performed when constructing a prediction model based on predetermined data, each of the selection of the prediction algorithm, preprocessing for input data, and hyperparameters A search step of searching for a combination of setting contents such that the prediction accuracy of a prediction model based on the sample data constructed when applied to the plurality of processes satisfies a predetermined condition, from candidates of the combination of setting contents. ,
From the combination of setting contents searched in the search step, a combination of setting contents in which the prediction accuracy of a prediction model based on the input data constructed when applied to the plurality of processes is equal to or more than a predetermined threshold is determined. A selection process to select;
An analysis method comprising:

An analysis program for causing a computer to function as the analysis device according to any one of claims 1 to 5.