JP6863926B2

JP6863926B2 - Data analysis system and data analysis method

Info

Publication number: JP6863926B2
Application number: JP2018083408A
Authority: JP
Inventors: 前川　拓也; 拓也前川
Original assignee: Hitachi Solutions Ltd
Current assignee: Hitachi Solutions Ltd
Priority date: 2018-04-24
Filing date: 2018-04-24
Publication date: 2021-04-21
Anticipated expiration: 2038-04-24
Also published as: WO2019207910A1; JP2019191895A

Description

本開示は、データ分析システムに関する。 The present disclosure relates to a data analysis system.

ニューラルネットワーク等の機械学習技術が注目を集めている。機械学習により得られた機械学習モデルを利用して様々な問題の解決が試みられている。例えば、特許文献１においては、既知事例集合と、予測事例が入力された場合に、既知事例集合から予測事例に類似した事例の集合である類似事例集合を抽出する類似事例抽出部１と、類似事例集合から或る予測属性値の確信度を計算する確信度計算部２と、類似事例集合と確信度から、その確信度の信頼性尺度を計算する信頼性尺度計算部３とを備え、ある予測属性値の確信度と、その確信度の信頼性尺度を出力するように構成する予想装置が記載されている。 Machine learning technologies such as neural networks are attracting attention. Attempts have been made to solve various problems using the machine learning model obtained by machine learning. For example, in Patent Document 1, it is similar to the similar case extraction unit 1 that extracts a known case set and a similar case set that is a set of cases similar to the predicted case from the known case set when a predicted case is input. It includes a certainty calculation unit 2 that calculates the certainty of a certain predicted attribute value from a case set, and a reliability scale calculation unit 3 that calculates a reliability scale of the certainty from a similar case set and certainty. A predictor configured to output the confidence of the predictor attribute value and the reliability measure of that confidence is described.

特開２００３−３２３６０１号公報Japanese Unexamined Patent Publication No. 2003-323601

しかしながら、特許文献１に記載された手法は、類似事例に基づく予測結果の確信度に、その確信度の信頼度を示す信頼性尺度を付加することにより、予測結果に対するユーザのその後の判断を支援するものであり、ユーザは各説明変数の予測結果に対する寄与度を知ることができない。すなわち、ユーザはどのような要因により入力データから予測結果が導かれたかを知ることができない。換言すると、ユーザは、ニューラルネットワークにおいて説明変数と予測結果である目的変数との関連性が未知のまま機械学習モデルを利用していた。このため、ユーザは予測結果に基づいてどのような判断をすべきか知ることが困難であった。 However, the method described in Patent Document 1 supports the user's subsequent judgment on the prediction result by adding a reliability scale indicating the reliability of the prediction result to the certainty of the prediction result based on a similar case. The user cannot know the contribution of each explanatory variable to the prediction result. That is, the user cannot know by what factor the prediction result is derived from the input data. In other words, the user used the machine learning model in the neural network without knowing the relationship between the explanatory variable and the objective variable which is the prediction result. Therefore, it is difficult for the user to know what kind of judgment should be made based on the prediction result.

本発明は、このような状況に鑑みてなされたものであり、説明変数が目的変数に与える影響度を可視化して、予測結果に基づいてどのような判断をすべきかを把握可能にする技術を提供する。 The present invention has been made in view of such a situation, and a technique for visualizing the degree of influence of an explanatory variable on an objective variable and making it possible to grasp what kind of judgment should be made based on a prediction result. provide.

本願において開示される発明の代表的な一例を示せば以下の通りである。すなわち、データ分析システムであって、プログラムを実行する演算装置と、前記演算装置と接続された記憶装置とを備え、前記演算装置が、機械学習モデルが学習時に用いた複数の説明変数からなる入力データセット又は前記説明変数が加工されたデータセットからなる入力データセットを、指定された分割条件で分割し、前記分割された各データセットの分布構造の特徴を表す特徴ノードを算出する特徴ノード算出部と、前記演算装置が、前記特徴ノードを含む入力データの近傍データを生成し、前記生成された近傍データの説明変数と、前記近傍データを前記機械学習モデルに入力して得られた目的変数のデータとに基づいて、当該説明変数と当該目的変数との関係性を表すスコアを算出するスコア算出部と、前記演算装置が、前記スコアを含む出力結果を出力する出力処理部とを備える。 A typical example of the invention disclosed in the present application is as follows. That is, it is a data analysis system, which includes a calculation device for executing a program and a storage device connected to the calculation device, and the calculation device is an input composed of a plurality of explanatory variables used at the time of learning by the machine learning model. Feature node calculation that divides an input data set consisting of a data set or a data set in which the explanatory variables are processed under specified division conditions, and calculates a feature node that represents the characteristics of the distribution structure of each of the divided data sets. The unit and the arithmetic unit generate neighborhood data of input data including the feature node, and input the explanatory variables of the generated neighborhood data and the neighborhood data into the machine learning model to obtain an objective variable. A score calculation unit that calculates a score representing the relationship between the explanatory variable and the objective variable based on the data of the above, and an output processing unit that outputs an output result including the score by the arithmetic unit are provided.

本発明の一態様によれば、説明変数が目的変数に与える影響度を可視化できる。前述した以外の課題、構成及び効果は、以下の実施例の説明によって明らかにされる。 According to one aspect of the present invention, the degree of influence of the explanatory variable on the objective variable can be visualized. Issues, configurations and effects other than those mentioned above will be clarified by the description of the following examples.

本実施例のデータ分析システム構成を表す図である。It is a figure which shows the data analysis system configuration of this Example. 本実施例のデータ分析システムのデータ構造を示す図である。It is a figure which shows the data structure of the data analysis system of this Example. 本実施例の全体処理のフローチャートである。It is a flowchart of the whole process of this Example. 本実施例の特徴ノード算出処理のフローチャートである。It is a flowchart of the feature node calculation process of this Example. 本実施例のスコア算出処理のフローチャートである。It is a flowchart of the score calculation process of this Example. 本実施例のノードマッピング処理のフローチャートである。It is a flowchart of the node mapping process of this embodiment. 本実施例の出力処理のフローチャートである。It is a flowchart of the output processing of this Example. 本実施例の成分マップの例を示す図である。It is a figure which shows the example of the component map of this Example. 本実施例のヒットマップの例を示す図である。It is a figure which shows the example of the hit map of this Example. 本実施例のスコアマップの例を示す図である。It is a figure which shows the example of the score map of this Example. 本実施例のノードマップの例を示す図である。It is a figure which shows the example of the node map of this embodiment.

＜実施例１＞
以下、本発明の実施例を図面を参照して説明する。 <Example 1>
Hereinafter, examples of the present invention will be described with reference to the drawings.

なお、本実施例では、機械学習モデルは、予め学習済みであり、その学習において利用された学習データを参照し、及び学習済みの機械学習モデルを利用して出力結果を得る処理を行うものである。また、機械学習モデルは、ｄ次元ベクトルの入力信号に対してｋ次元ベクトルの出力信号を返すものであり、さらに、本実施例での機械学習モデルの出力信号は、ｋ個の分類クラスに属する分類確率に相当するものとして説明する。 In this embodiment, the machine learning model has been trained in advance, and the learning data used in the learning is referred to, and the trained machine learning model is used to obtain an output result. is there. Further, the machine learning model returns the output signal of the k-dimensional vector with respect to the input signal of the d-dimensional vector, and the output signal of the machine learning model in this embodiment belongs to k classification classes. It will be described as corresponding to the classification probability.

図１は、本実施例のデータ分析システム構成を表す図である。 FIG. 1 is a diagram showing a data analysis system configuration of this embodiment.

本実施例のデータ分析システムは、機械学習における入力データ及び出力データの関係性を分析する計算機であり、入力装置１０１、出力装置１０２、表示装置１０３、処理装置１０４、及び記憶装置１１１を有する。 The data analysis system of this embodiment is a computer that analyzes the relationship between input data and output data in machine learning, and includes an input device 101, an output device 102, a display device 103, a processing device 104, and a storage device 111.

入力装置１０１は、キーボードやマウスなどであり、ユーザからの入力を受けるインターフェースである。出力装置１０２は、プリンタなどであり、プログラムの実行結果をユーザが視認可能な形式で出力するインターフェースである。表示装置１０３は、液晶表示装置などのディスプレイ装置であり、プログラムの実行結果をユーザが視認可能な形式で出力するインターフェースである。なお、データ分析システムにネットワークを介して接続された端末が入力装置１０１と出力装置１０２と表示装置１０３とを提供してもよい。 The input device 101 is a keyboard, a mouse, or the like, and is an interface for receiving input from the user. The output device 102 is a printer or the like, and is an interface that outputs a program execution result in a user-visible format. The display device 103 is a display device such as a liquid crystal display device, and is an interface that outputs a program execution result in a format that can be visually recognized by the user. A terminal connected to the data analysis system via a network may provide the input device 101, the output device 102, and the display device 103.

処理装置１０４は、プログラムを実行するプロセッサ（演算装置）及びプログラムやデータを格納するメモリによって構成される。具体的には、プロセッサがプログラムを実行することによって、入力処理部１０６、特徴ノード算出部１０７、スコア算出部１０８、ノードマッピング部１０９、及び出力処理部１１０が実現される。なお、プロセッサがプログラムを実行して行う処理の一部を、他の演算装置（例えば、ＦＰＧＡ）で実行してもよい。 The processing device 104 is composed of a processor (arithmetic unit) that executes a program and a memory that stores the program and data. Specifically, when the processor executes the program, the input processing unit 106, the feature node calculation unit 107, the score calculation unit 108, the node mapping unit 109, and the output processing unit 110 are realized. A part of the processing performed by the processor by executing the program may be executed by another arithmetic unit (for example, FPGA).

メモリは、不揮発性の記憶素子であるＲＯＭ及び揮発性の記憶素子であるＲＡＭを含む。ＲＯＭは、不変のプログラム（例えば、ＢＩＯＳ）などを格納する。ＲＡＭは、ＤＲＡＭ（Dynamic Random Access Memory）のような高速かつ揮発性の記憶素子であり、プロセッサ１１が実行するプログラム及びプログラムの実行時に使用されるデータを一時的に格納する。 The memory includes a ROM which is a non-volatile storage element and a RAM which is a volatile storage element. The ROM stores an invariant program (for example, BIOS) and the like. The RAM is a high-speed and volatile storage element such as a DRAM (Dynamic Random Access Memory), and temporarily stores a program executed by the processor 11 and data used when the program is executed.

記憶装置１１１は、例えば、磁気記憶装置（ＨＤＤ）、フラッシュメモリ（ＳＳＤ）等の大容量かつ不揮発性の記憶装置である。記憶装置１１１は、処理装置１０４がプログラムの実行時に使用するデータ及び処理装置１０４が実行するプログラムを格納する。具体的には、記憶装置１１１は、入力データテーブル１１２、正規化情報テーブル１１３、分割条件テーブル１１４、ノード情報テーブル１１５、ノード距離テーブル１１６、スコアテーブル１１７及び加重平均スコアテーブル１１８などの一連の処理に必要なデータ及び出力結果を格納する。なお、プログラムは、記憶装置１１１から読み出されて、メモリにロードされて、プロセッサによって実行される。 The storage device 111 is, for example, a large-capacity and non-volatile storage device such as a magnetic storage device (HDD) or a flash memory (SSD). The storage device 111 stores data used by the processing device 104 when executing the program and a program executed by the processing device 104. Specifically, the storage device 111 performs a series of processes such as an input data table 112, a normalization information table 113, a division condition table 114, a node information table 115, a node distance table 116, a score table 117, and a weighted average score table 118. Stores the necessary data and output results. The program is read from the storage device 111, loaded into the memory, and executed by the processor.

データ分析システムは、所定のプロトコルに従って、他の装置との通信を制御する通信インターフェースを有してもよい。 The data analysis system may have a communication interface that controls communication with other devices according to a predetermined protocol.

処理装置１０４が実行するプログラムは、リムーバブルメディア（ＣＤ−ＲＯＭ、フラッシュメモリなど）又はネットワークを介してデータ分析システムに提供され、非一時的記憶媒体である不揮発性の記憶装置１１１に格納される。このため、データ分析システムは、リムーバブルメディアからデータを読み込むインターフェースを有するとよい。 The program executed by the processing device 104 is provided to the data analysis system via removable media (CD-ROM, flash memory, etc.) or a network, and is stored in the non-volatile storage device 111, which is a non-temporary storage medium. Therefore, the data analysis system may have an interface for reading data from removable media.

データ分析システムは、物理的に一つの計算機上で、又は、論理的又は物理的に構成された複数の計算機上で構成される計算機システムであり、複数の物理的計算機資源上に構築された仮想計算機上で動作してもよい。 A data analysis system is a computer system composed of physically one computer or a plurality of computers logically or physically configured, and is a virtual system constructed on a plurality of physical computer resources. It may operate on a computer.

図２は、本実施例のデータ分析システムのデータ構造を示す図である。 FIG. 2 is a diagram showing a data structure of the data analysis system of this embodiment.

入力データテーブル１１２は、機械学習モデルの学習データを、本実施例のデータ分析システムによる一連の処理で利用する形式に加工したデータを格納し、説明変数１〜ｄ（２０１）及び目的変数１〜ｋ（２０２）を含む。 The input data table 112 stores data obtained by processing the training data of the machine learning model into a format used in a series of processing by the data analysis system of this embodiment, and stores explanatory variables 1 to d (201) and objective variables 1 to 1. Includes k (202).

説明変数１〜ｄ（２０１）は、機械学習モデルの入力データであるｄ次元ベクトルを表している。但し、機械学習では変数ごとにデータを正規化することが多い。本実施例ではこの正規化されたデータを、正規化情報テーブル１１３を用いて、もとの数値データに戻して格納する。また、機械学習モデルの学習データが時系列である場合、変数名ｘに対して、ｘ＿ｔ０，ｘ＿ｔ１，ｘ＿ｔ２，．．．のように各時点の値での変数名として平坦化できる。この場合、説明変数２０１の次元数と機械学習モデルの入力次元数は一致せず、本実施例の入力データ形式でデータを機械学習モデルに入力する際には、その都度データ形式を変換する。目的変数１〜ｋ（２０２）は、機械学習モデルの出力結果であるｋ次元ベクトルである。 Explanatory variables 1 to d (201) represent d-dimensional vectors that are input data of the machine learning model. However, in machine learning, data is often normalized for each variable. In this embodiment, the normalized data is returned to the original numerical data and stored by using the normalized information table 113. Further, when the learning data of the machine learning model is a time series, x_t0, x_t1, x_t2, for the variable name x. .. .. It can be flattened as a variable name at each time point value as in. In this case, the number of dimensions of the explanatory variable 201 and the number of input dimensions of the machine learning model do not match, and the data format is converted each time data is input to the machine learning model in the input data format of this embodiment. The objective variables 1 to k (202) are k-dimensional vectors that are the output results of the machine learning model.

正規化情報テーブル１１３は、機械学習モデルの学習時に行った正規化処理に関する情報を格納し、変数ＩＤ２０３、変数名２０４、データ型２０５、平均２０６、標準偏差２０７及びモデル用データ形式対応情報２０８のデータを含む。 The normalization information table 113 stores information related to the normalization process performed at the time of learning the machine learning model, and includes variable ID 203, variable name 204, data type 205, average 206, standard deviation 207, and model data format correspondence information 208. Contains data.

変数ＩＤ２０３は、説明変数２０１の要素を特定するインデクスである。変数名２０４は、当該説明変数の名前である。データ型２０５は、当該説明変数のデータ型（例えば、論理型、整数型、浮動小数点型など）である。 The variable ID 203 is an index that identifies the element of the explanatory variable 201. The variable name 204 is the name of the explanatory variable. The data type 205 is the data type of the explanatory variable (for example, logical type, integer type, floating point type, etc.).

平均２０６及び標準偏差２０７は、機械学習モデルの学習時の正規化処理で用いた平均と標準偏差を格納する。但し、変数が論理型の場合など、正規化処理を行わない変数に対しては、平均を０、標準偏差を１などと設定するとよい。モデル用データ形式対応情報２０９は、機械学習モデルの入力形式と本実施例のデータ分析システムで扱う入力形式が異なる場合に、その形式を相互変換するための情報を格納する。例えば、時系列を含むデータの場合、変数ｘをｘ＿ｔ０，ｘ＿ｔ１，．．．と展開するので、展開前のインデクスと展開後のインデクスとの対応関係を記述しておくことで、相互変換が可能となる。 The mean 206 and the standard deviation 207 store the mean and standard deviation used in the normalization process during training of the machine learning model. However, for variables that are not normalized, such as when the variables are logical types, it is advisable to set the mean to 0, the standard deviation to 1, and so on. The model data format correspondence information 209 stores information for mutual conversion between the input formats of the machine learning model and the input formats handled by the data analysis system of the present embodiment. For example, in the case of data including a time series, the variables x are x_t0, x_t1,. .. .. Therefore, mutual conversion is possible by describing the correspondence between the index before expansion and the index after expansion.

分割条件テーブル１１４は、特徴ノード算出部１０７が入力データテーブル１１２を分割する条件を格納し、条件ＩＤ２０９、分割条件２１０、データ数２１１、マップサイズ２１２及び集計フラグ１〜ｋ（２１３）のデータを含む。 The division condition table 114 stores the conditions for dividing the input data table 112 by the feature node calculation unit 107, and stores the data of the condition ID 209, the division condition 210, the number of data 211, the map size 212, and the aggregation flags 1 to k (213). Including.

条件ＩＤ２０９は、分割条件テーブル１１４に記録される条件を識別するための識別情報である。分割条件２１０は、入力データを分割して１組のデータセットを得るための条件である。例えば、ＳＱＬのｓｅｌｅｃｔ文のような文字列でもよい。分割条件２１０には、説明変数に対する特定の値又は範囲や、目的変数に対する値の条件の組み合わせを記述されてもよい。データ数２１１は、当該分割条件によって選択された入力データ中のデータ数である。 The condition ID 209 is identification information for identifying the conditions recorded in the division condition table 114. The division condition 210 is a condition for dividing the input data to obtain a set of data sets. For example, it may be a character string such as an SQL select statement. In the division condition 210, a specific value or range for the explanatory variable or a combination of the value condition for the objective variable may be described. The number of data 211 is the number of data in the input data selected by the division condition.

マップサイズ２１２は、特徴ノード算出部１０７が、図４のノードベクトル算出ステップ４０３で使用するマップサイズを格納する。又は、マップサイズを自動設定する場合には、マップサイズ２１２の値をＮＵＬＬなどとしておき、自動設定の結果を格納してもよい。集計フラグ１〜ｋ（２１３）は、図５の目的スコア集計処理５０４で使用する目的変数に対するｋ個のフラグ配列である。この配列で１が設定されている目的変数に対するスコアのみを集計し、目的スコアとする。例えば、会員管理システムにおいて現在ランクからのランクアップを目的とした分析では、各現在ランクを分割条件として設定し、現在ランクより上位の予測ランクに対応する目的変数のフラグを１に設定する。 The map size 212 stores the map size used by the feature node calculation unit 107 in the node vector calculation step 403 of FIG. Alternatively, when the map size is automatically set, the value of the map size 212 may be set to NULL or the like, and the result of the automatic setting may be stored. Aggregation flags 1 to k (213) are an array of k flags for the objective variables used in the objective score aggregation process 504 of FIG. Only the scores for the objective variable for which 1 is set in this array are aggregated and used as the objective score. For example, in an analysis aimed at increasing the rank from the current rank in the member management system, each current rank is set as a division condition, and the flag of the objective variable corresponding to the predicted rank higher than the current rank is set to 1.

ノード情報テーブル１１５は、特徴ノード算出部１０７による特徴ノード算出結果を格納し、条件ＩＤ２１４、ノードＩＤ２１５、ヒット数２１６、ヒット率２１７、座標２１８、説明変数１〜ｄ（２１９）及び目的変数１〜ｋ（２２０）のデータを含む。 The node information table 115 stores the feature node calculation result by the feature node calculation unit 107, and stores the condition ID 214, the node ID 215, the number of hits 216, the hit rate 217, the coordinates 218, the explanatory variables 1 to d (219), and the objective variables 1 to 1. Includes data for k (220).

条件ＩＤ２１４は、分割条件テーブル１１４に記録される条件を識別するための識別情報（条件ＩＤ２０９）である。ノードＩＤ２１５は、条件ＩＤ２１４によって特定される条件を満たすノードの識別情報である。ヒット数２１６は、ノードＩＤ２１５で特定されたノードについて、分割条件によって分割されたデータセットのうち、当該ノードが他ノードより近い距離にあるデータ数である。ヒット率２１７は、ヒット数２１６をデータ数２１１で除した値である。座標２１８は、図３に示すノードマッピング処理３０４の処理結果である。 The condition ID 214 is identification information (condition ID 209) for identifying the conditions recorded in the division condition table 114. The node ID 215 is identification information of a node that satisfies the condition specified by the condition ID 214. The number of hits 216 is the number of data in which the node is closer than the other nodes in the data set divided by the division condition for the node specified by the node ID 215. The hit rate 217 is a value obtained by dividing the number of hits 216 by the number of data 211. Coordinates 218 are the processing results of the node mapping process 304 shown in FIG.

説明変数１〜ｄ（２１９）及び目的変数１〜ｋ（２２０）は、入力データテーブル１１２と同形式のデータであり、入力データセットに対して、その分布構造の特徴を表すノードベクトルである。このベクトルは入力データテーブルに含まれるデータと一致するものが存在する必要はなく、また、データ型２０５に指定された型に従わなくてもよい。例えば、論理値や整数値が指定されても、浮動小数点型のデータとして格納できる。 The explanatory variables 1 to d (219) and the objective variables 1 to k (220) are data having the same format as the input data table 112, and are node vectors representing the characteristics of the distribution structure of the input data set. This vector does not have to match the data contained in the input data table and does not have to follow the type specified in data type 205. For example, even if a logical value or an integer value is specified, it can be stored as floating-point type data.

ノード距離テーブル１１６は、ノード情報テーブル１１５の説明変数２１９、又は説明変数２１９に目的変数２２０を加えたノードベクトルについて、各ノード間の距離を格納し、ノードｆｒｏｍ２２１、ノードｔｏ２２２及び距離２２３のデータを含む。 The node distance table 116 stores the distance between each node for the explanatory variable 219 of the node information table 115 or the node vector obtained by adding the objective variable 220 to the explanatory variable 219, and stores the data of the node from 221 and the node to 222 and the distance 223. Including.

ノードｆｒｏｍ２２１及びノードｔｏ２２２は、それぞれノード情報テーブル１１５に記録されるノードを特定するための識別情報である。ノードｆｒｏｍ２２１及びノードｔｏ２２２の値は、条件ＩＤ２１４とノードＩＤ２１５の組でもよいし、ノード情報テーブル１１５上のｉｎｄｅｘでもよい。距離２２３は、ノードｆｒｏｍ２２１とノードｔｏ２２２の間のノードベクトルの距離である。 The node from 221 and the node to 222 are identification information for identifying the node recorded in the node information table 115, respectively. The values of the node from 221 and the node to 222 may be a set of the condition ID 214 and the node ID 215, or may be an index on the node information table 115. The distance 223 is the distance of the node vector between the node from 221 and the node to 222.

なお、ノード距離テーブル１１６は二次元配列として表現してもよい。この場合、行及び列にはノード情報テーブル１１５のｉｎｄｅｘを用いる。 The node distance table 116 may be expressed as a two-dimensional array. In this case, the index of the node information table 115 is used for the rows and columns.

スコアテーブル１１７は、スコア算出部１０８の算出結果を格納し、目的変数ＩＤ２２４、条件ＩＤ２２５、ノードＩＤ２２６及び説明変数１〜ｄのスコア２２７のデータを含む。 The score table 117 stores the calculation result of the score calculation unit 108, and includes data of the objective variable ID 224, the condition ID 225, the node ID 226, and the scores 227 of the explanatory variables 1 to d.

目的変数ＩＤ２２４は、機械学習モデルの出力結果におけるｋ次元ベクトルの要素番号を格納する。条件ＩＤ２２５及びノードＩＤ２２６は、ノード情報テーブル１１５に記録されたノードを特定するための識別情報であり、ノード情報テーブル１１５の条件ＩＤ２１４及びノードＩＤ２１５と共通の値を用いる。説明変数１〜ｄのスコア２２７は、スコア算出部１０８の算出結果であり、目的変数ＩＤ２２４、条件ＩＤ２２５、ノードＩＤ２２６及び説明変数ごとに格納する。 The objective variable ID 224 stores the element number of the k-dimensional vector in the output result of the machine learning model. The condition ID 225 and the node ID 226 are identification information for identifying the node recorded in the node information table 115, and use values common to the condition ID 214 and the node ID 215 of the node information table 115. The scores 227 of the explanatory variables 1 to d are the calculation results of the score calculation unit 108, and are stored for each of the objective variable ID 224, the condition ID 225, the node ID 226, and the explanatory variables.

スコアテーブル１１７は、図５で説明する目的スコア及び加重平均スコアも格納する。目的スコアは、目的変数ＩＤ２２４に−１などを設定し、加重平均スコアは、目的変数ＩＤ２２４及びノードＩＤ２２６に−１などを設定し、目的変数とノードが特定のものを識別していないことを表すものである。 The score table 117 also stores the objective score and the weighted average score described in FIG. The objective score is set to -1 or the like in the objective variable ID 224, and the weighted average score is set to -1 or the like in the objective variable ID 224 and the node ID 226 to indicate that the objective variable and the node do not identify a specific object. It is a thing.

加重平均スコアテーブル１１８は、目的変数２２４とノードＩＤ２２６が−１のように特定されない形で、分割条件と説明変数ごとのスコアを格納している。具体的には、加重平均スコアテーブル１１８は、後述するスコア算出処理３０３（図５）のステップ５０５で算出された加重平均スコアを分割条件ごとに分け、各説明変数のスコアを絶対値の降順にソートし、変数名とともに列挙したリストである。加重平均スコアテーブル１１８は、出力処理３０５（図７）のステップ７０１で生成される。加重平均スコアテーブル１１８によって、ユーザは、各分割条件が表すターゲット層ごとに、影響度が高い説明変数を容易に把握でき、分割条件での説明変数の順位及びスコアを比較できる。例えば、条件１、２では属性Ａの影響度が大きく、条件３、４では属性Ｊの影響度が大きい。また、属性Ｉのスコアは符号が逆になっており、同一の施策を適用すると効果が逆に現れる可能性がある。このように、各条件が示すターゲット層への施策立案に活用できる。 The weighted average score table 118 stores the division conditions and the scores for each explanatory variable in a form in which the objective variable 224 and the node ID 226 are not specified as -1. Specifically, in the weighted average score table 118, the weighted average score calculated in step 505 of the score calculation process 303 (FIG. 5) described later is divided for each division condition, and the scores of each explanatory variable are sorted in descending order of absolute value. A sorted list with variable names. The weighted average score table 118 is generated in step 701 of the output process 305 (FIG. 7). The weighted average score table 118 allows the user to easily grasp the explanatory variables having a high degree of influence for each target layer represented by each division condition, and to compare the rank and score of the explanatory variables under the division condition. For example, under conditions 1 and 2, the degree of influence of attribute A is large, and under conditions 3 and 4, the degree of influence of attribute J is large. In addition, the sign of the attribute I score is reversed, and if the same measure is applied, the effect may appear in reverse. In this way, it can be used for planning measures for the target group indicated by each condition.

図３は、本実施例の全体処理のフローチャートである。 FIG. 3 is a flowchart of the entire process of this embodiment.

まず、入力処理部１０６が入力処理を実行する（３０１）。例えば、入力処理部１０６は、正規化情報テーブル１１３を参照して、機械学習モデルの学習データを、その入力形式から本実施例の入力形式に変換し、正規化された数値を元に戻す処理を実行し、その結果を入力データテーブル１１２に格納する。 First, the input processing unit 106 executes the input processing (301). For example, the input processing unit 106 refers to the normalization information table 113, converts the training data of the machine learning model from the input format to the input format of the present embodiment, and restores the normalized numerical values. Is executed, and the result is stored in the input data table 112.

次に、特徴ノード算出部１０７が特徴ノード算出処理を実行する（３０２）。例えば、特徴ノード算出部１０７は、分割条件テーブル１１４に従って入力データテーブル１１２を分割し、分割された各データセットから特徴ノードを算出し、結果をノード情報テーブルに格納する。特徴ノード算出処理の詳細は図４で説明する。 Next, the feature node calculation unit 107 executes the feature node calculation process (302). For example, the feature node calculation unit 107 divides the input data table 112 according to the division condition table 114, calculates the feature node from each divided data set, and stores the result in the node information table. The details of the feature node calculation process will be described with reference to FIG.

次に、スコア算出部１０８がスコア算出処理を実行する（３０３）。例えば、スコア算出部１０８は、説明変数の影響度を表すスコアを算出し、結果をスコアテーブルに格納する。スコア算出処理の詳細は図５で説明する。 Next, the score calculation unit 108 executes the score calculation process (303). For example, the score calculation unit 108 calculates a score representing the degree of influence of the explanatory variable, and stores the result in the score table. The details of the score calculation process will be described with reference to FIG.

次に、ノードマッピング部１０９がノードマッピング処理を実行する（３０４）。例えば、ノードマッピング部１０９は、ステップ３０２で得られた特徴ノードを低次元空間へマッピングする。ノードマッピング処理の詳細は図６で説明する。 Next, the node mapping unit 109 executes the node mapping process (304). For example, the node mapping unit 109 maps the feature node obtained in step 302 to the low-dimensional space. The details of the node mapping process will be described with reference to FIG.

次に、出力処理部１１０が出力処理を実行し（３０５）、処理を終了する。出力処理の詳細は図７で説明する。 Next, the output processing unit 110 executes the output processing (305) and ends the processing. Details of the output processing will be described with reference to FIG.

図４は、本実施例の特徴ノード算出処理３０２のフローチャートである。 FIG. 4 is a flowchart of the feature node calculation process 302 of this embodiment.

まず、特徴ノード算出部１０７は、変数ｐを１から分割条件テーブル１１４のデータ件数でループする（４０１）。以降、ｐ番目の分割条件についてステップ４０２からステップ４０５の処理を実行する。 First, the feature node calculation unit 107 loops the variable p from 1 by the number of data items in the division condition table 114 (401). After that, the processes of steps 402 to 405 are executed for the p-th division condition.

次に、特徴ノード算出部１０７は、データ分割処理を行う（４０２）。例えば、ｐ番目の分割条件の分割条件２１０を満たすデータを入力データテーブル１１２から選択する。選択されたデータセットは、正規化情報テーブルを用いて正規化処理を施される。 Next, the feature node calculation unit 107 performs data division processing (402). For example, data satisfying the division condition 210 of the p-th division condition is selected from the input data table 112. The selected data set is normalized using the normalization information table.

次に、特徴ノード算出部１０７は、ノードベクトル算出を行う（４０３）。例えば、ｋ−平均法に代表されるクラスタリング手法などによって、選択されたデータセットの分布構造を考慮し、より少ないノード数でその特徴を表すノードベクトルを算出する。本実施例では、自己組織化マップ（以下、ＳＯＭと略す）を適用する。ＳＯＭは、格子状に配置されたノードと、隣接するノードとの間を連結するエッジで表現されるニューラルネットワークの一種である。各ノードには、入力データと同形式の参照ベクトルが割り当てられる。参照ベクトルは、ＳＯＭの学習データと距離が最も近いノード（以下、ＢＭＵ（ＢｅｓｔＭａｔｃｈｉｎｇＵｎｉｔ）と略す）の参照ベクトルと共に、ＢＭＵに連結したノードの参照ベクトルも、学習データに近づくように更新する。ＳＯＭは公知の手法であるため、手法の詳細な説明は省略する。この処理を繰り返すことによって、学習データの複雑な分布構造を、ノードの幾何学的構造に写像できる。 Next, the feature node calculation unit 107 calculates the node vector (403). For example, by a clustering method represented by the k-means method, the distribution structure of the selected data set is taken into consideration, and a node vector representing the feature is calculated with a smaller number of nodes. In this embodiment, a self-organizing map (hereinafter abbreviated as SOM) is applied. SOM is a kind of neural network represented by edges connecting nodes arranged in a grid pattern and adjacent nodes. Each node is assigned a reference vector in the same format as the input data. The reference vector is updated so that the reference vector of the node connected to the BMU is close to the training data together with the reference vector of the node having the closest distance to the learning data of the SOM (hereinafter, abbreviated as BMU (Best Matching Unit)). Since SOM is a known method, detailed description of the method will be omitted. By repeating this process, the complex distribution structure of the training data can be mapped to the geometric structure of the node.

ＳＯＭの結果として算出される各ノードの参照ベクトルは、説明変数２１９と目的変数２２０の形式でノード情報テーブル１１５に格納される。 The reference vector of each node calculated as a result of SOM is stored in the node information table 115 in the form of the explanatory variable 219 and the objective variable 220.

なお、ＳＯＭを実行する際の学習データの形式は、説明変数のみ、又は説明変数及び目的変数の組によって設定できる。どちらの形式を利用するかは予め設定されているとよい。そして、出力結果としての目的変数２２０は、これら学習データの入力形式に従う。 The format of the learning data when executing SOM can be set only by the explanatory variables or by a set of the explanatory variables and the objective variables. Which format to use should be set in advance. Then, the objective variable 220 as an output result follows the input format of these learning data.

次に、特徴ノード算出部１０７は、ノードごとにヒット数を計数する（４０４）。ここでは、ステップ４０３で算出したノードごとに、それをＢＭＵとする選択データセット中のデータ数をヒット数２１６の値として算出する。ヒット率２１７はこれを選択されたデータ件数で割って算出する。 Next, the feature node calculation unit 107 counts the number of hits for each node (404). Here, for each node calculated in step 403, the number of data in the selected data set with that as the BMU is calculated as the value of the number of hits 216. The hit rate 217 is calculated by dividing this by the number of selected data items.

次に、特徴ノード算出部１０７は、算出された結果をデータの保存領域に格納する（４０５）。このとき、ステップ４０３で算出されたノードベクトルは正規化されているため、正規化情報テーブル１１３を用いて元に戻す処理を行い、その結果を格納する。 Next, the feature node calculation unit 107 stores the calculated result in the data storage area (405). At this time, since the node vector calculated in step 403 is normalized, the normalization information table 113 is used to perform the restoration process, and the result is stored.

そして、ステップ４０１からステップ４０５のループが終了すると特徴ノード算出処理３０２を終了する。 Then, when the loop from step 401 to step 405 ends, the feature node calculation process 302 ends.

図５は、本実施例のスコア算出処理３０３のフローチャートである。 FIG. 5 is a flowchart of the score calculation process 303 of this embodiment.

まず、スコア算出部１０８は、変数ｉを１からノード情報テーブル１１５のデータ件数でループする（５０１）。以降、ｉ番目のノードについてステップ５０２からステップ５０４の処理を実行する。 First, the score calculation unit 108 loops the variable i from 1 by the number of data items in the node information table 115 (501). After that, the processes of steps 502 to 504 are executed for the i-th node.

次に、スコア算出部１０８は、ノードｉの近傍データセットと、それに対する機械学習モデルの予測結果を生成する（５０２）。近傍データとは、変数ｉで指定されたノードの説明変数が表すｄ次元ベクトルの周辺に位置するベクトルデータである。本実施例では近傍データの生成方法として、ノードｉの説明変数の値を平均とし、正規化情報の標準偏差の２分の１を標準偏差とした正規分布に従った乱数によって生成する方法を用いるが、他の生成方法を用いてもよい。近傍データセットのデータ件数は予め指定されているとよい。機械学習モデルによる予測は、正規化情報テーブルを用いた正規化と、モデル用データ形式対応情報２０８による変換を行って実行できる。 Next, the score calculation unit 108 generates a data set in the vicinity of the node i and a prediction result of the machine learning model for the data set (502). The neighborhood data is vector data located around the d-dimensional vector represented by the explanatory variable of the node specified by the variable i. In this embodiment, as a method of generating neighborhood data, a method is used in which the values of the explanatory variables of node i are averaged and half of the standard deviation of the normalized information is set as the standard deviation, and a random number is generated according to a normal distribution. However, other generation methods may be used. The number of data items in the neighborhood data set may be specified in advance. The prediction by the machine learning model can be executed by performing normalization using the normalization information table and conversion by the model data format correspondence information 208.

次に、スコア算出部１０８は、生成された近傍データセットと機械学習モデルの予測結果について局所モデル推定処理を行う（５０３）。ステップ５０３では、近傍データについて説明変数と目的変数との関係性を表すスコアを得る。本実施例では近傍データセットと機械学習モデルの予測結果に対して線形モデル推定を適用し、その推定パラメータをスコアとする。すなわち、ｄ次元の説明変数Ｘ＝（ｘ_１，ｘ_２，…，ｘ_ｄ）に対する機械学習モデルの出力結果Ｙを、下式で表される線形モデルで近似し、推定パラメータＳ_ｉを入力ｘ_ｉにおけるスコアとする。ここで、Ｙ，Ｙ，Ｓ_ｉ，Ｃはｋ次元ベクトルである。線形モデル推定の手法は公知の技術であるため、手法の詳細な説明は省略する。 Next, the score calculation unit 108 performs a local model estimation process on the generated neighborhood data set and the prediction result of the machine learning model (503). In step 503, a score representing the relationship between the explanatory variable and the objective variable is obtained for the neighborhood data. In this embodiment, linear model estimation is applied to the prediction results of the neighborhood data set and the machine learning model, and the estimation parameters are used as scores. That is, the output result Y of the machine learning model for the d-dimensional explanatory variable X = (x ₁ , x ₂ , ..., X _d ) is approximated by the linear model represented by the following equation, and the estimation parameter _Si is input x. _{Let it be} the score in i. Here, Y, Y, _Si , and C are k-dimensional vectors. Since the method of linear model estimation is a known technique, detailed description of the technique will be omitted.

次に、スコア算出部１０８は、ステップ５０３で得られたスコアを、集計フラグ２１３に従って集計して目的スコアを算出する（５０４）。具体的には、フラグが１の要素のスコアを説明変数ごとに集計する。 Next, the score calculation unit 108 totals the scores obtained in step 503 according to the aggregation flag 213 to calculate the target score (504). Specifically, the scores of the elements whose flag is 1 are aggregated for each explanatory variable.

そして、スコア算出部１０８は、ステップ５０１からステップ５０４のループが終了すると、ヒット率２１７を目的スコアに適用して加重平均スコアを算出する（５０５）。加重平均スコアは、同一条件ＩＤの全ノードについて、説明変数ごとに算出される。 Then, when the loop from step 501 to step 504 is completed, the score calculation unit 108 applies the hit rate 217 to the target score to calculate the weighted average score (505). The weighted average score is calculated for each explanatory variable for all nodes with the same condition ID.

そして、スコア算出部１０８は、算出された結果をデータの保存領域に格納し（５０６）、スコア算出処理を終了する。 Then, the score calculation unit 108 stores the calculated result in the data storage area (506), and ends the score calculation process.

図６は、本実施例のノードマッピング処理３０４のフローチャートである。本実施例では、多次元尺度構成法（以下、ＭＤＳと略す）を使って格子状の平面ＳＯＭノードの分割条件ごとのセットを２次元座標にマッピングするが、ノードの幾何学的構造やマッピングする空間は他の次元数の空間でもよい。 FIG. 6 is a flowchart of the node mapping process 304 of this embodiment. In this embodiment, a multidimensional scaling method (hereinafter abbreviated as MDS) is used to map a set of grid-like plane SOM nodes for each division condition to two-dimensional coordinates, but the geometric structure of the nodes and mapping are performed. The space may be a space of another dimension.

ＭＤＳは、多次元ベクトル空間上のノードを、２次元や３次元などの低次元空間にマッピングする手法の一つで、ノード間の距離を可能な限り再現するようにマッピングを行う。ＭＤＳは公知の手法であるため、手法の詳細な説明は省略する。本実施例では、ＭＤＳを適用する際に、ＳＯＭノードの幾何学的構造を考慮した初期化を行う。 MDS is one of the methods for mapping nodes on a multidimensional vector space to a low-dimensional space such as two-dimensional or three-dimensional, and mapping is performed so as to reproduce the distance between the nodes as much as possible. Since MDS is a known method, detailed description of the method will be omitted. In this embodiment, when applying MDS, initialization is performed in consideration of the geometric structure of the SOM node.

まず、ノードマッピング部１０９は、ノード距離テーブル１１６を生成する（６０１）。本実施例では、各特徴ノードベクトルを、正規化された説明変数２１９とし、ユークリッド距離によって距離テーブルを生成する。 First, the node mapping unit 109 generates the node distance table 116 (601). In this embodiment, each feature node vector is set to the normalized explanatory variable 219, and a distance table is generated by the Euclidean distance.

次に、ノードマッピング部１０９は、各変数を初期化する（６０２）。具体的には、まずｌｔ、ｌｂ、ｒｔ、ｒｂを、それぞれ格子状のＳＯＭノードの構造における左上、左下、右上、右下のノードｉｎｄｅｘとして定義し、全て−１を設定する。次に、ｙを０に設定する。次に、配列Ｐｏｓを、各ノードの座標を格納する配列として定義する。そして、Ｓｗ、Ｓｈを、それぞれｘ方向、ｙ方向のノード座標配列として定義する。この配列サイズはマップサイズ２１２によって決定される。Ｐｏｓ、Ｓｗ、Ｓｈの要素は全て０で初期化する。 Next, the node mapping unit 109 initializes each variable (602). Specifically, first, lt, lb, rt, and rb are defined as upper left, lower left, upper right, and lower right node indexes in the structure of the grid-like SOM node, and all are set to -1. Next, y is set to 0. Next, the array Pos is defined as an array that stores the coordinates of each node. Then, Sw and Sh are defined as node coordinate arrays in the x-direction and the y-direction, respectively. This array size is determined by the map size 212. The elements of Pos, Sw, and Sh are all initialized to 0.

次に、ノードマッピング部１０９は、変数ｐを１から分割条件テーブル１１４のデータ件数でループする（６０３）。以降、ｐ番目の分割条件についてステップ６０４からステップ６０９の処理を実行する。 Next, the node mapping unit 109 loops the variable p from 1 by the number of data items in the division condition table 114 (603). After that, the processes of steps 604 to 609 are executed for the p-th division condition.

次に、ノードマッピング部１０９は、ｒｂが０以上であれば（６０４でＹｅｓ）、ｙに配列Ｓｈ内の最大値に所定数（例えば、２）を加算した数を入力する（６０５）。所定数は適切な値に変更してもよい。 Next, if rb is 0 or more (Yes at 604), the node mapping unit 109 inputs a number obtained by adding a predetermined number (for example, 2) to the maximum value in the array Sh (605). The predetermined number may be changed to an appropriate value.

一方、ノードマッピング部１０９は、ｒｂが負の数であれば（６０４でＮｏ）、何もせずにステップ６０６に進む。 On the other hand, if rb is a negative number (No at 604), the node mapping unit 109 proceeds to step 606 without doing anything.

次に、分割条件ｐのノードに対する四隅ノードｉｎｄｅｘを、それぞれｌｔ，ｌｂ，ｒｔ，ｒｂに設定する（６０６）。このとき、ｌｔをｒｂ＋１に設定し、マップサイズ２１２に従って残りの変数を設定できる。 Next, the four corner node indexes for the node of the division condition p are set to lt, lb, rt, and rb, respectively (606). At this time, lt can be set to rb + 1 and the remaining variables can be set according to the map size 212.

次に、ノードマッピング部１０９は、Ｓｗ、Ｓｈに値を設定する（６０７）。本実施例では、ノードｌｔとｒｔとの距離、ｌｔとｌｂとの距離を、マップサイズに従って均等分割した値を設定する。 Next, the node mapping unit 109 sets values for Sw and Sh (607). In this embodiment, the distance between the nodes lt and rt and the distance between lt and lb are set by evenly dividing them according to the map size.

次に、ノードマッピング部１０９は、Ｓｈの各要素にｙを加算する（６０８）。ｘ軸方向に移動したい場合、変数ｘを定義して、ｙと同様の処理をＳｗに適用すればよい。 Next, the node mapping unit 109 adds y to each element of Sh (608). If you want to move in the x-axis direction, you can define the variable x and apply the same processing as y to Sw.

次に、ノードマッピング部１０９は、ノードｌｔ〜ｒｂの座標をＰｏｓに設定する（６０９）。この処理は、例えば、ＳＯＭのノード構造においてｉ行ｊ列の位置のノードの座標を、（Ｓｗ［ｉ］，Ｓｈ［ｊ］）で設定するとよい。 Next, the node mapping unit 109 sets the coordinates of the nodes lt to rb to Pos (609). In this process, for example, the coordinates of the node at the position of row i and column j in the node structure of SOM may be set by (Sw [i], Sh [j]).

そして、ステップ６０３からステップ６０８のループが終了すると、Ｐｏｓをノードの初期座標としてＭＤＳを適用する（６１０）。 Then, when the loop from step 603 to step 608 ends, MDS is applied with Pos as the initial coordinates of the node (610).

次に、ノードマッピング部１０９は、結果を保存領域に格納し（６１１）、ノードマッピング処理を終了する。 Next, the node mapping unit 109 stores the result in the storage area (611) and ends the node mapping process.

図７は、本実施例の出力処理３０５のフローチャートである。 FIG. 7 is a flowchart of the output process 305 of this embodiment.

まず、出力処理部１１０は、加重平均スコアを列挙して、加重平均スコアテーブル１１８を生成する（７０１）。加重平均スコアテーブル１１８は、前述したように、分割条件ごとに加重平均スコアを分け、各説明変数のスコアを絶対値の降順にソートし、変数名ともに列挙したものである。 First, the output processing unit 110 enumerates the weighted average scores and generates the weighted average score table 118 (701). As described above, the weighted average score table 118 divides the weighted average score for each division condition, sorts the scores of each explanatory variable in descending order of absolute value, and lists the variable names together.

次に、出力処理部１１０は、ノードベクトルの成分マップを表示する（７０２）。成分マップは、同一条件における各ノードの特定の説明変数３１９又は目的変数２２０の値を、ＳＯＭのノードの幾何学的構造とマップサイズによって可視化したものである。例えば、マップサイズがｍ×ｎのときの説明変数ｉの成分マップは、ノード情報テーブルの同一条件ＩＤの全ノードにおける説明変数ｉの値を、その値に対応した色でｍ×ｎの画像として表示する。 Next, the output processing unit 110 displays the component map of the node vector (702). The component map is a visualization of the value of a specific explanatory variable 319 or objective variable 220 of each node under the same conditions by the geometric structure and map size of the SOM node. For example, in the component map of the explanatory variable i when the map size is m × n, the value of the explanatory variable i in all the nodes having the same condition ID in the node information table is set as an image of m × n in the color corresponding to the value. indicate.

本実施例の成分マップは、図８に例示するように、特定の分割条件に対して、説明変数２１９ごとに、ノードの幾何学的構造に基づいて、説明変数３１９の値を画像化している。また、ステップ４０３において、目的変数２０２も加えたベクトルに対する処理を行った場合には、目的変数２２０を用いた成分マップも表示可能である。成分マップによって、各説明変数間の相関や、説明変数と目的変数との間の相関関係などを視覚的に把握できる。 As illustrated in FIG. 8, the component map of this embodiment images the value of the explanatory variable 319 for each explanatory variable 219 based on the geometric structure of the node for a specific division condition. .. Further, in step 403, when the vector to which the objective variable 202 is added is processed, the component map using the objective variable 220 can also be displayed. The component map makes it possible to visually grasp the correlation between each explanatory variable and the correlation between the explanatory variable and the objective variable.

次に、出力処理部１１０は、ヒットマップを表示する（７０３）。ヒットマップは、ステップ７０２の可視化手法を用いて、ヒット数２１６（又はその対数）又はヒット率２１７を可視化したものである。 Next, the output processing unit 110 displays the hit map (703). The hit map is a visualization of the number of hits 216 (or its logarithm) or the hit rate 217 using the visualization method of step 702.

本実施例のヒットマップは、図９に例示するように、ヒット率２１７の対数に基づいた色分けによってヒット数を画像化している。また、図のようにヒット数の数値を表示してもよい。ヒットマップによって、学習データの分布において密度の濃いノードなどを把握できる。 As illustrated in FIG. 9, the hit map of this embodiment images the number of hits by color coding based on the logarithm of the hit rate 217. Further, the numerical value of the number of hits may be displayed as shown in the figure. From the hit map, it is possible to grasp the dense nodes in the distribution of the training data.

次に、出力処理部１１０は、スコアマップを表示する（７０４）。スコアマップは、ステップ７０２の可視化手法を用いて、特定の説明変数に対するスコア２２７又は目的スコアを可視化したものである。 Next, the output processing unit 110 displays the score map (704). The score map is a visualization of the score 227 or the objective score for a particular explanatory variable using the visualization method of step 702.

本実施例のスコアマップは、図１０に例示するように、説明変数ごとのスコア２２７に基づいた色分けによってスコア２２７を画像化している。例えば、スコアが０の場合を緑に設定し、プラス方向に赤、マイナス方向に青へと段階的に変化する色分けを行うことで、どのノード位置で、どの説明変数の影響度が強いかを容易に把握できる。また、図のように、該当する説明変数の成分マップと模様を比較することで、影響度が高いノードにおける説明変数の値の様子を把握できる。 In the score map of this embodiment, as illustrated in FIG. 10, the score 227 is imaged by color coding based on the score 227 for each explanatory variable. For example, if the score is 0, it is set to green, and by performing color coding that gradually changes to red in the plus direction and blue in the minus direction, which node position and which explanatory variable has a strong influence can be determined. Easy to grasp. In addition, as shown in the figure, by comparing the component map of the corresponding explanatory variable with the pattern, it is possible to grasp the state of the value of the explanatory variable in the node having a high degree of influence.

次に、出力処理部１１０は、ノードマップを表示する（７０５）。ノードマップは、ステップ３０４で算出したノードごとの座標２１８によって、各ノードを低次元空間上の点として可視化したものである。このとき、各ノードを表す点の形や色などは、ノード情報テーブルの説明変数の値、目的変数の値、スコアテーブルの説明変数ごとのスコア、目的スコア、分割条件などによって設定するとよい。 Next, the output processing unit 110 displays the node map (705). The node map visualizes each node as a point in the low-dimensional space by the coordinates 218 for each node calculated in step 304. At this time, the shape and color of the points representing each node may be set according to the value of the explanatory variable in the node information table, the value of the objective variable, the score for each explanatory variable in the score table, the objective score, the division condition, and the like.

本実施例のノードマップは、図１１に例示するように、各分割条件におけるノードの座標２１８に基づいて、２次元空間にノードをプロットしたものである。また、特定の分割条件におけるノードの幾何学的構造を格子状の線によって表示してもよい。ノードマップによって、複数の分割条件での各ノードの位置関係を把握できる。例えば、現在ランクを分割条件とした場合に、距離が近いノードを見ることで、ランクが上がる可能性や下がるリスクが高そうなノードを容易に探し出せる。それら近隣ノードとの特徴の違いは、ノード情報テーブルの値を直接比較したり、成分マップを用いて比較できる。 As illustrated in FIG. 11, the node map of this embodiment is a plot of nodes in a two-dimensional space based on the coordinates 218 of the nodes under each division condition. Further, the geometrical structure of the node under a specific division condition may be displayed by a grid-like line. The node map makes it possible to grasp the positional relationship of each node under a plurality of division conditions. For example, when the current rank is set as the division condition, by looking at the nodes that are close to each other, it is possible to easily find the node that has a high possibility of increasing the rank or a high risk of decreasing the rank. Differences in characteristics from those neighboring nodes can be compared directly by comparing the values in the node information table or by using a component map.

そして、処理を終了する。 Then, the process ends.

なお、前述した可視化手法はユーザの指示によって任意の順序で実行可能であり、それらを組み合わせて同時に表示してもよい。 The visualization methods described above can be executed in any order according to the user's instruction, and they may be combined and displayed at the same time.

以上に説明したように、本実施例のデータ分析システムは、機械学習モデルが学習時に用いた複数の説明変数からなる入力データセット又は前記説明変数が加工されたデータセットからなる入力データセットを、指定された分割条件で分割し、前記分割された各データセットの分布構造の特徴を表す特徴ノードを算出する特徴ノード算出部１０７と、前記特徴ノードを含む入力データの近傍データを生成し、前記生成された近傍データの説明変数と、前記近傍データを前記機械学習モデルに入力して得られた目的変数のデータとに基づいて、当該説明変数と当該目的変数との関係性を表すスコアを算出するスコア算出部１０８と、前記スコアを含む出力結果を出力する出力処理部１１０とを備える。このため、学習済の機械学習モデルに対し、分割条件が示すターゲット層ごとに、説明変数が目的変数に与える影響度を算出し可視化できる。また、分布構造の特徴を表す特徴ノードによって、学習データより少ないデータでデータセットの特徴を表すことができる。また、学習データが少なく、網羅されていなくても、近傍データによってデータセットの特徴を表して、特徴ノードを補完できる。つまり、少ないデータでデータセットの特徴を表して、演算量を低減できる。 As described above, the data analysis system of this embodiment uses an input data set consisting of a plurality of explanatory variables used by the machine learning model at the time of training or an input data set consisting of a data set obtained by processing the explanatory variables. The feature node calculation unit 107 that divides according to the specified division condition and calculates the feature node representing the feature of the distribution structure of each divided data set, and the vicinity data of the input data including the feature node are generated, and the above-mentioned Based on the explanatory variables of the generated neighborhood data and the data of the objective variable obtained by inputting the neighborhood data into the machine learning model, a score representing the relationship between the explanatory variable and the objective variable is calculated. The score calculation unit 108 is provided, and the output processing unit 110 that outputs an output result including the score is provided. Therefore, it is possible to calculate and visualize the degree of influence of the explanatory variables on the objective variable for each target layer indicated by the division condition on the trained machine learning model. In addition, the feature node representing the feature of the distribution structure can represent the feature of the dataset with less data than the training data. Further, even if the learning data is small and not covered, the characteristics of the dataset can be represented by the neighborhood data and the feature nodes can be complemented. That is, the characteristics of the data set can be represented with a small amount of data, and the amount of calculation can be reduced.

また、特徴ノード算出部１０７は、自己組織化マップが適用された前記入力データセットに基づいて特徴ノードを算出するので、特徴ノードを的確に算出できる。 Further, since the feature node calculation unit 107 calculates the feature node based on the input data set to which the self-organizing map is applied, the feature node can be accurately calculated.

また、特徴ノード算出部１０７は、前記機械学習モデルが学習時に用いた複数の説明変数及び前記機械学習モデルが算出した目的変数からなる入力データセット、又は前記説明変数及び前記目的変数が加工されたデータセットからなる入力データセットを用いて前記特徴ノードを算出するので、目的変数をマップで比較できる。 Further, in the feature node calculation unit 107, an input data set composed of a plurality of explanatory variables used by the machine learning model at the time of learning and objective variables calculated by the machine learning model, or the explanatory variables and the objective variables are processed. Since the feature node is calculated using the input data set consisting of the data set, the objective variables can be compared on the map.

また、特徴ノード算出部１０７は、特定の説明変数の特定の値又は範囲、及び前記目的変数の要素の特定の値（例えば、最大値）又は範囲の少なくとも一つを含む分割条件、又はこれらの組み合わせによって表現される分割条件によって前記入力データセットを分割するので、ターゲット層を絞り込んだ分析ができる。すなわち、集団全体ではなく、目的によって属性を変えることによって、特定の属性を有する集団のデータを解析できる。 Further, the feature node calculation unit 107 includes a specific value or range of a specific explanatory variable, and a division condition including at least one of a specific value (for example, a maximum value) or a range of the element of the objective variable, or a division condition thereof. Since the input data set is divided according to the division conditions expressed by the combination, the analysis can be performed by narrowing down the target layer. That is, it is possible to analyze the data of a group having a specific attribute by changing the attribute according to the purpose instead of the whole group.

また、スコア算出部１０８は、前記説明変数のデータと前記目的変数のデータとに基づいて線形モデル推定を適用することによって、前記説明変数毎に前記目的変数の形式に対応したスコアを算出するので、線形モデルはシンプルで扱いやすいことから、ユーザにとって分かりやすく、結果に対する信頼性を向上できる。特に、線形モデルでは、複数属性を統合する場合に確率の和で計算可能であるため、ユーザが直感的に分かりやすい。 Further, the score calculation unit 108 calculates the score corresponding to the format of the objective variable for each explanatory variable by applying the linear model estimation based on the data of the explanatory variable and the data of the objective variable. Since the linear model is simple and easy to handle, it is easy for the user to understand and the reliability of the result can be improved. In particular, in the linear model, when integrating multiple attributes, it is possible to calculate by the sum of probabilities, so that the user can intuitively understand.

また、スコア算出部１０８は、前記目的変数中の要素の一部のうち、前記分割条件ごとに指定された部分を集計して目的スコアを算出するので、ターゲット層を絞り込んだ分析ができる。すなわち、集団全体ではなく、目的によって属性を変えることによって、特定の属性を有する集団のデータを解析できる。 Further, since the score calculation unit 108 calculates the target score by totaling the parts designated for each of the division conditions among some of the elements in the objective variable, it is possible to perform analysis by narrowing down the target layer. That is, it is possible to analyze the data of a group having a specific attribute by changing the attribute according to the purpose instead of the whole group.

また、スコア算出部１０８は、前記算出したスコア及び前記算出した目的スコアについて、前記各分割条件における特徴ノードごとの周辺データの数に基づいて、説明変数ごとに加重平均スコアを算出するので、密度分布を考慮して、データの特性を正しく表せる。 Further, since the score calculation unit 108 calculates the weighted average score for each explanatory variable for the calculated score and the calculated target score based on the number of peripheral data for each feature node in each division condition, the density. The characteristics of the data can be correctly expressed in consideration of the distribution.

また、各分割条件において、前記各分割条件において、特徴ノード算出部１０７によって算出された特徴ノードを二次元空間にマッピングするノードマッピング部部１０９を備えるので、集団の特性を分かりやすく表すことができる。 Further, in each division condition, since the node mapping unit 109 that maps the feature node calculated by the feature node calculation unit 107 to the two-dimensional space is provided in each division condition, the characteristics of the group can be expressed in an easy-to-understand manner. ..

また、ノードマッピング部１０９は、前記説明変数ごとの特徴ノードの値と、前記算出されたスコアと、前記スコア及び目的スコアについて算出された加重平均スコアとを、ノードの幾何学的構造に基づいて画像化して表示するためのデータを生成するので、ノード間の距離の関係性を維持しつつ、異なる属性の集団間でデータを比較できる。 Further, the node mapping unit 109 sets the value of the feature node for each explanatory variable, the calculated score, and the weighted average score calculated for the score and the target score based on the geometric structure of the node. Since the data to be imaged and displayed is generated, it is possible to compare the data between groups of different attributes while maintaining the relationship of the distance between the nodes.

また、ノードマッピング部１０９は、前記特徴ノードのベクトル又は目的変数成分を含む特徴ノードのベクトルを、前記分割条件の特徴ノードの幾何学的構造に基づいて初期化した後、多次元尺度構成法を適用してマッピングを行うので、スコアマップによって、影響度が高い属性と低い属性とを分かりやすく表すことができる。 Further, the node mapping unit 109 initializes the vector of the feature node or the vector of the feature node including the objective variable component based on the geometric structure of the feature node of the division condition, and then applies the multidimensional scaling method. Since the mapping is performed by applying it, the score map can clearly represent the attributes having a high degree of influence and the attributes having a low degree of influence.

また、入力データセットが、所定時間ごとの説明変数を含む時系列データである場合、当該説明変数を過去のある時点から現時点までの独立した変数として展開したデータを入力データとし、当該展開に用いた規則を格納するので、入力データセットが時系列データであるデータを解析できる。 If the input data set is time-series data including explanatory variables for each predetermined time, the data obtained by expanding the explanatory variables as independent variables from a certain point in the past to the present time is used as input data for the expansion. Since the existing rules are stored, it is possible to analyze data whose input data set is time series data.

なお、本発明は前述した実施例に限定されるものではなく、添付した特許請求の範囲の趣旨内における様々な変形例及び同等の構成が含まれる。例えば、前述した実施例は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに本発明は限定されない。また、ある実施例の構成の一部を他の実施例の構成に置き換えてもよい。また、ある実施例の構成に他の実施例の構成を加えてもよい。また、各実施例の構成の一部について、他の構成の追加・削除・置換をしてもよい。 The present invention is not limited to the above-described embodiment, and includes various modifications and equivalent configurations within the scope of the attached claims. For example, the above-described examples have been described in detail in order to explain the present invention in an easy-to-understand manner, and the present invention is not necessarily limited to those having all the described configurations. Further, a part of the configuration of one embodiment may be replaced with the configuration of another embodiment. Further, the configuration of another embodiment may be added to the configuration of one embodiment. In addition, other configurations may be added / deleted / replaced with respect to a part of the configurations of each embodiment.

また、前述した各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路で設計する等により、ハードウェアで実現してもよく、プロセッサがそれぞれの機能を実現するプログラムを解釈し実行することにより、ソフトウェアで実現してもよい。 Further, each of the above-described configurations, functions, processing units, processing means, etc. may be realized by hardware by designing a part or all of them by, for example, an integrated circuit, and the processor realizes each function. It may be realized by software by interpreting and executing the program to be executed.

各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリ、ハードディスク、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）等の記憶装置、又は、ＩＣカード、ＳＤカード、ＤＶＤ等の記録媒体に格納することができる。 Information such as programs, tables, and files that realize each function can be stored in a memory, a hard disk, a storage device such as an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, or a DVD.

また、制御線や情報線は説明上必要と考えられるものを示しており、実装上必要な全ての制御線や情報線を示しているとは限らない。実際には、ほとんど全ての構成が相互に接続されていると考えてよい。 In addition, the control lines and information lines indicate those that are considered necessary for explanation, and do not necessarily indicate all the control lines and information lines necessary for implementation. In practice, it can be considered that almost all configurations are interconnected.

１０１…入力装置
１０２…出力装置
１０３…表示装置
１０４…処理装置
１０５…プログラム
１０６…入力処理部
１０７…特徴ノード算出部
１０８…スコア算出部
１０９…ノードマッピング部
１１０…出力処理部
１１１…記憶装置
１１２…入力データテーブル
１１３…正規化情報テーブル
１１４…分割条件テーブル
１１５…ノード情報テーブル
１１６…ノード距離テーブル
１１７…スコアテーブル 101 ... Input device 102 ... Output device 103 ... Display device 104 ... Processing device 105 ... Program 106 ... Input processing unit 107 ... Feature node calculation unit 108 ... Score calculation unit 109 ... Node mapping unit 110 ... Output processing unit 111 ... Storage device 112 ... Input data table 113 ... Normalization information table 114 ... Split condition table 115 ... Node information table 116 ... Node distance table 117 ... Score table

Claims

A data analysis system
An arithmetic unit for executing a program and a storage device connected to the arithmetic unit are provided.
The arithmetic unit divides an input data set consisting of a plurality of explanatory variables used by the machine learning model at the time of training or an input data set consisting of a data set obtained by processing the explanatory variables under specified division conditions, and the division is performed. A feature node calculation unit that calculates feature nodes that represent the features of the distribution structure of each dataset,
The arithmetic unit generates neighborhood data of input data including the feature node, and the explanatory variables of the generated neighborhood data and the data of the objective variable obtained by inputting the neighborhood data into the machine learning model. A score calculation unit that calculates a score representing the relationship between the explanatory variable and the objective variable based on
A data analysis system, wherein the arithmetic unit includes an output processing unit that outputs an output result including the score.

The data analysis system according to claim 1.
The feature node calculation unit is a data analysis system characterized by calculating feature nodes based on the input data set to which a self-organizing map is applied.

The data analysis system according to claim 1.
The feature node calculation unit is an input data set composed of a plurality of explanatory variables used by the machine learning model at the time of learning and objective variables calculated by the machine learning model, or a data set obtained by processing the explanatory variables and the objective variables. A data analysis system characterized in that the feature node is calculated using an input data set consisting of.

The data analysis system according to claim 1.
The feature node calculation unit is based on a division condition containing at least one of a specific value or range of a specific explanatory variable and a specific value or range of an element of the objective variable, or a division condition expressed by a combination thereof. A data analysis system characterized by dividing the input data set.

The data analysis system according to claim 1.
The score calculation unit is characterized by calculating a score corresponding to the format of the objective variable for each explanatory variable by applying a linear model estimation based on the data of the explanatory variable and the data of the objective variable. Data analysis system.

The data analysis system according to claim 1.
The score calculation unit is a data analysis system characterized in that a part of the elements in the objective variable, which is designated for each division condition, is aggregated to calculate the objective score.

The data analysis system according to claim 6.
The score calculation unit is characterized in that it calculates a weighted average score for each explanatory variable based on the number of peripheral data for each feature node in each of the division conditions for the calculated score and the calculated target score. Data analysis system.

The data analysis system according to claim 1.
A data analysis system, wherein the arithmetic unit includes a node mapping unit that maps a feature node calculated by the feature node calculation unit to a two-dimensional space under each of the division conditions.

The data analysis system according to claim 7.
The arithmetic unit includes a node mapping unit that maps a feature node calculated by the feature node calculation unit to a two-dimensional space under each division condition.
The node mapping unit images the value of the feature node for each explanatory variable, the calculated score, and the weighted average score calculated for the score and the target score based on the geometric structure of the node. A data analysis system characterized by generating data for display.

The data analysis system according to claim 8.
The node mapping unit initializes the vector of the feature node or the vector of the feature node including the objective variable component based on the geometric structure of the feature node of the division condition, and then applies the multidimensional scaling method. A data analysis system characterized by mapping.

The data analysis system according to claim 1.
When the input data set is time-series data including explanatory variables for each predetermined time, the data obtained by expanding the explanatory variables as independent variables from a certain point in the past to the present time is used as input data.
A data analysis system in which the arithmetic unit stores the rules used for the development.

It is a data analysis method executed by a calculator.
The calculator has an arithmetic unit that executes a program and a storage device connected to the arithmetic unit.
The method is
The arithmetic unit divides an input data set consisting of a plurality of explanatory variables used by the machine learning model at the time of training or an input data set consisting of a data set obtained by processing the explanatory variables under specified division conditions.
The arithmetic unit calculates a feature node representing the feature of the distribution structure of each of the divided data sets.
The arithmetic unit generates neighborhood data of input data including the feature node,
Based on the explanatory variables of the generated neighborhood data and the data of the objective variable obtained by inputting the neighborhood data into the machine learning model, a score representing the relationship between the explanatory variable and the objective variable is obtained. Calculate and
A data analysis method, wherein the arithmetic unit outputs an output result including the score.