JP2018147280A

JP2018147280A - Data analysis device and data analysis method

Info

Publication number: JP2018147280A
Application number: JP2017042472A
Authority: JP
Inventors: 光晴大峡; Mitsuharu Ohazama; 孝泰羽根; Takayasu Hane
Original assignee: Hitachi Solutions Ltd
Current assignee: Hitachi Solutions Ltd
Priority date: 2017-03-07
Filing date: 2017-03-07
Publication date: 2018-09-20

Abstract

PROBLEM TO BE SOLVED: To provide a technique that allows a user to easily determine what kind of measures to take for a prediction result is effective.SOLUTION: A data analysis device 1 includes: a data set generation unit 124 for generating data of an explanatory variable input to a machine learning model and inputting the generated data to the machine learning model to obtain data of a target variable; and a model evaluation unit 125 for calculating, based on the explanatory variable data generated by the data set generation unit 124 and the target variable data, a relationship between the explanatory variable and the target variable.SELECTED DRAWING: Figure 1

Description

本開示は、機械学習における入力データ及び出力データの関係性を分析するデータ分析装置及びデータ分析方法に関する。 The present disclosure relates to a data analysis apparatus and a data analysis method for analyzing the relationship between input data and output data in machine learning.

従来、ニューラルネットワーク等の機械学習技術が注目を集めている。機械学習により得られた機械学習モデルを利用して様々な問題を解決することが行われている。 Conventionally, machine learning techniques such as neural networks have attracted attention. Various problems are solved by using a machine learning model obtained by machine learning.

特許文献１においては、機械学習手法を通じて、金融与信問題、クレジットカードの不正顧客の発見、ネットワークにおける不正アクセスの発見等の用途に活用できる技術が提案されている。特許文献１に記載された手法は、類似事例に基づく予測結果の確信度に、その確信度の信頼度を示す信頼性尺度を付加することにより、予測結果に対するユーザのその後の判断を支援する仕組みを持っている。 Patent Document 1 proposes a technique that can be used for applications such as financial credit problems, finding unauthorized customers of credit cards, and finding unauthorized access in a network through machine learning techniques. The method described in Patent Document 1 is a mechanism that supports a user's subsequent judgment on a prediction result by adding a reliability measure indicating the reliability of the certainty factor to the certainty factor of a prediction result based on similar cases. have.

特開２００３−３２３６０１号公報JP 2003-323601 A

しかしながら、特許文献１に記載された手法を用いた場合、ユーザは個別の説明変数が予測結果に対して寄与している度合いを知ることができない。すなわち、ユーザはどのような要因により入力データから予測結果が導かれたかを知ることができない。換言すると、ユーザは、ニューラルネットワークにおいて説明変数と予測結果である目的変数との関連性が未知のまま機械学習モデルを利用していた。このため、ユーザは予測結果に基づいてどのような判断をすべきか知ることが困難であった。 However, when the method described in Patent Document 1 is used, the user cannot know the degree to which individual explanatory variables contribute to the prediction result. In other words, the user cannot know what factor led to the prediction result from the input data. In other words, the user uses the machine learning model while the relevance between the explanatory variable and the objective variable that is the prediction result is unknown in the neural network. For this reason, it is difficult for the user to know what kind of judgment should be made based on the prediction result.

本開示はこのような状況に鑑みてなされたものであり、予測結果に基づいてどのような判断をすべきかをユーザが容易に知ることができる技術を提供する。 The present disclosure has been made in view of such a situation, and provides a technique by which a user can easily know what judgment should be made based on a prediction result.

上記課題を解決するために、代表的な本開示のデータ分析装置の一つは、機械学習モデルに入力する説明変数のデータを生成し、生成した前記説明変数のデータを前記機械学習モデルに入力して目的変数のデータを得るデータセット生成部と、前記データセット生成部が生成した前記説明変数のデータと前記目的変数のデータとに基づいて、前記説明変数と前記目的変数との関係性を算出するモデル評価部と、を備える。 In order to solve the above problem, one of the representative data analysis devices of the present disclosure generates data of explanatory variables to be input to a machine learning model, and inputs the generated data of the explanatory variables to the machine learning model. Based on the data of the explanatory variable generated by the data set generating unit and the data of the objective variable, the relationship between the explanatory variable and the objective variable is obtained. A model evaluation unit for calculating.

また、代表的な本開示のデータ分析方法の一つは、機械学習モデルに入力する説明変数のデータを生成し、生成した前記説明変数のデータを前記機械学習モデルに入力して目的変数のデータを得るステップと、生成した前記説明変数のデータと前記目的変数のデータとに基づいて、前記説明変数と前記目的変数との関係性を算出するステップと、を含む。 Also, one of the representative data analysis methods of the present disclosure is to generate data of explanatory variables to be input to a machine learning model, and input the generated data of the explanatory variables to the machine learning model to obtain data of objective variables. And calculating the relationship between the explanatory variable and the objective variable based on the generated data of the explanatory variable and the data of the objective variable.

本開示によれば、学習済の機械学習モデルに対し、説明変数毎に目的変数に対する影響の度合いを算出することが可能となる。これにより、機械学習モデルがどのような要因で出力結果を求めたのかを推測することが容易になり、ユーザがその後の施策を行う際の判断が容易になる。
なお、上述した以外の課題、構成及び効果は、以下の本発明を実施するための形態及び添付図面によって明らかになるものである。 According to the present disclosure, it is possible to calculate the degree of influence on an objective variable for each explanatory variable with respect to a learned machine learning model. As a result, it is easy to infer what factor the machine learning model has obtained the output result, and it is easy for the user to make a determination when performing subsequent measures.
Problems, configurations, and effects other than those described above will become apparent from the following embodiments for implementing the present invention and the accompanying drawings.

実施形態に係るデータ分析装置の概略構成を示す機能ブロック図である。It is a functional block diagram which shows schematic structure of the data analyzer which concerns on embodiment. モデルデータの一例を示す図である。It is a figure which shows an example of model data. 入出力データの一例を示す図である。It is a figure which shows an example of input-output data. シミュレーションデータの一例を示す図である。It is a figure which shows an example of simulation data. 評価データの表示画面の一例を示す図である。It is a figure which shows an example of the display screen of evaluation data. シミュレーション処理を説明するためのフローチャートである。It is a flowchart for demonstrating a simulation process. 評価処理を説明するためのフローチャートである。It is a flowchart for demonstrating an evaluation process.

以下、添付図面を参照して本発明の実施形態について説明する。ただし、本実施形態は本発明を実現するための一例に過ぎず、本発明の技術的範囲を限定するものではないことに注意すべきである。また、各図において共通の構成については同一の参照番号が付されている。なお、本願明細書において学習済みモデルとは機械学習によって得られたモデルのことであり、機械学習モデルとも称する。 Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. However, it should be noted that this embodiment is merely an example for realizing the present invention, and does not limit the technical scope of the present invention. In each drawing, the same reference numerals are assigned to common components. In the present specification, the learned model is a model obtained by machine learning, and is also referred to as a machine learning model.

＜データ分析装置の構成＞
図１は、実施形態に係るデータ分析装置１の概略構成を示す機能ブロック図である。このデータ分析装置１は、必要な演算処理及び制御処理等を行う中央処理装置（プロセッサ）１００と、データの入出力を行うための入出力装置１１０と、中央処理装置１００による処理に必要なプログラムを格納するプログラムメモリ１２０と、中央処理装置１００による処理の対象となるデータ又は中央処理装置１００によって処理した後のデータを格納する記憶装置１３０と、を有している。 <Configuration of data analyzer>
FIG. 1 is a functional block diagram illustrating a schematic configuration of a data analysis apparatus 1 according to the embodiment. The data analysis apparatus 1 includes a central processing unit (processor) 100 that performs necessary arithmetic processing and control processing, an input / output device 110 for inputting and outputting data, and a program necessary for processing by the central processing unit 100. And a storage device 130 for storing data to be processed by the central processing unit 100 or data processed by the central processing unit 100.

入出力装置１１０は、データを表示するための表示部１１１やプリンタ（図示せず）等で構成される出力デバイスと、表示されたデータに対してメニューを選択するなどの操作を行うためのキーボード１１２、マウスなどのポインティングデバイス１１３と、を有している。 The input / output device 110 includes an output device configured by a display unit 111 for displaying data, a printer (not shown), and the like, and a keyboard for performing operations such as selecting a menu for the displayed data. 112, a pointing device 113 such as a mouse.

プログラムメモリ１２０は、機械学習によって生成された学習済みモデルに対し、様々な入力データを入力して出力結果を得るシミュレーションプログラム１２１と、シミュレーション処理の結果を分析する評価プログラム１２２と、シミュレーション処理の分析結果を表示部１１１に表示する出力プログラム１２３と、を格納している。各処理プログラムは、プログラムコードとしてプログラムメモリ１２０に格納されており、中央処理装置１００が各プログラムコードを実行することによって各処理が実現される。 The program memory 120 inputs a variety of input data to a learned model generated by machine learning, obtains an output result, an evaluation program 122 that analyzes the result of the simulation process, and an analysis of the simulation process An output program 123 for displaying the result on the display unit 111 is stored. Each processing program is stored in the program memory 120 as a program code, and each processing is realized by the central processing unit 100 executing each program code.

中央処理装置１００は、シミュレーションプログラム１２１を実行することによりデータセット生成部１２４として機能し、評価プログラム１２２を実行することによりモデル評価部１２５として機能し、出力プログラム１２３を実行することにより評価出力部１２６として機能する。 The central processing unit 100 functions as the data set generation unit 124 by executing the simulation program 121, functions as the model evaluation unit 125 by executing the evaluation program 122, and executes the evaluation output unit by executing the output program 123. It functions as 126.

データセット生成部１２４は、機械学習モデルに入力する説明変数のデータを生成し、当該機械学習モデルに生成した上記データを入力して目的変数のデータを得る。モデル評価部１２５は、データセット生成部１２４が生成した説明変数のデータと目的変数のデータとに基づいて、説明変数と目的変数との関係性を算出する。説明変数と目的変数との関係性とは、例えば、説明変数と目的変数との統計的な相関関係を指す。評価出力部１２６は、モデル評価部１２５が算出した説明変数と目的変数との関係性を、機械学習モデルの説明変数を入力するインターフェースと併せて表示部１１１に表示する。 The data set generation unit 124 generates explanatory variable data to be input to the machine learning model, and inputs the data generated to the machine learning model to obtain target variable data. The model evaluation unit 125 calculates the relationship between the explanatory variable and the objective variable based on the explanatory variable data and the objective variable data generated by the data set generation unit 124. The relationship between the explanatory variable and the objective variable indicates, for example, a statistical correlation between the explanatory variable and the objective variable. The evaluation output unit 126 displays the relationship between the explanatory variable and the objective variable calculated by the model evaluation unit 125 on the display unit 111 together with an interface for inputting the explanatory variable of the machine learning model.

記憶装置１３０は、あらかじめ機械学習によって生成された学習済みモデルのデータであるモデルデータ１３１と、モデルデータ１３１の生成時に使用した機械学習の教師データの入出力形式である入出力データ１３２と、入出力データ１３２をもとにシミュレーション用に生成したシミュレーションデータ１３３と、シミュレーションデータ１３３を分析処理した後に得られる評価データ１３４と、を格納している。なお、記憶装置１３０は、ネットワークを介して遠隔的に配置されていているストレージシステムであってもよい。 The storage device 130 includes model data 131 which is data of a learned model generated in advance by machine learning, input / output data 132 which is an input / output format of teacher data for machine learning used when generating the model data 131, and input data 132. Simulation data 133 generated for simulation based on the output data 132 and evaluation data 134 obtained after analyzing the simulation data 133 are stored. The storage device 130 may be a storage system that is remotely arranged via a network.

以上に述べた処理プログラム・データ・各プログラム等は、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、ＵＳＢメモリ等の種々の記録媒体に格納して提供することもできる。 The processing programs, data, programs, and the like described above can be provided by being stored in various recording media such as a CD-ROM, a DVD-ROM, and a USB memory.

＜モデルデータ＞
図２は、記憶装置１３０内のモデルデータ１３１の一例を示す図である。モデルデータには、機械学習により得られた学習済みモデルの設定データ及び学習済モデルで構成されるデータ種類２０１がある。設定データ及び学習済モデルは、それぞれデータ項目２０２及びその値２０３で構成される。 <Model data>
FIG. 2 is a diagram illustrating an example of the model data 131 in the storage device 130. The model data includes a learned model setting data obtained by machine learning and a data type 201 composed of learned models. The setting data and the learned model are composed of a data item 202 and its value 203, respectively.

設定データのデータ項目には、例えば、入力次元数や出力次元数等がある。入力次元数とは、機械学習のモデルに入力する際の説明変数の数である。また、出力次元数とは、機械学習のモデルが出力する目的変数の数である。例えば、ある日の売上、次の日の降水確率及び曜日を説明変数とし、次の日の売上を目的変数とする機械学習モデルの場合、入力次元数は３、出力次元数は１となる。 The data items of the setting data include, for example, the number of input dimensions and the number of output dimensions. The number of input dimensions is the number of explanatory variables when inputting to a machine learning model. The number of output dimensions is the number of objective variables output by the machine learning model. For example, in the case of a machine learning model in which sales on a certain day, probability of precipitation on the next day and day of the week are explanatory variables and sales on the next day are objective variables, the number of input dimensions is 3 and the number of output dimensions is 1.

学習済モデルのデータ項目には、前述のような入力データを入力して出力データを得るための学習済みモデルを特徴づけるパラメータが含まれる。なお、学習済みモデルのパラメータはユーザが把握しておく必要はなく、機械学習モデルを呼び出した際にプログラム内で自動的に使用される。 The data item of the learned model includes parameters that characterize the learned model for obtaining output data by inputting the input data as described above. Note that the parameters of the learned model do not need to be grasped by the user, and are automatically used in the program when the machine learning model is called.

＜入出力データ＞
図３は、記憶装置１３０内の入出力データ１３２の一例を示す図である。入出力データは、モデルの入力と出力、すなわち説明変数と目的変数を格納するデータ項目３０１と、データ項目３０１に格納された各変数のデータの種類を表すデータ型３０２と、データ項目３０１に格納された各変数のデータの値の範囲を表す値範囲３０３で構成される。 <Input / output data>
FIG. 3 is a diagram illustrating an example of the input / output data 132 in the storage device 130. Input / output data is input to and output from the model, that is, a data item 301 that stores explanatory variables and objective variables, a data type 302 that represents the data type of each variable stored in the data item 301, and a data item 301. The value range 303 represents the range of the data value of each variable.

データ型３０２には、例えば、浮動小数点型や２値型といった値が格納される。浮動小数点型とは、値に実数値を持つデータ型であり、例えば０．１といった値が格納される。２値型とは、男女、購入/非購入等のように２種類の値のみが格納されるデータ型であり、２種類を区別するために０又は１といった値が格納される。値範囲３０３には、取り得る値の範囲が格納される。例えば、データタイプが浮動小数点型で、０−１であれば、０から１の範囲の実数値が格納される。 The data type 302 stores a value such as a floating point type or a binary type. The floating point type is a data type having a real value as a value, and stores a value such as 0.1. The binary type is a data type in which only two types of values are stored, such as sex and purchase / non-purchase, and a value such as 0 or 1 is stored to distinguish the two types. The value range 303 stores a range of possible values. For example, if the data type is a floating point type and 0-1, real values in the range of 0 to 1 are stored.

＜シミュレーションデータ＞
図４は、記憶装置１３０内のシミュレーションデータ１３３の一例を示す図である。シミュレーションデータは、説明変数群４０１、４０２、４０３と、説明変数をモデルに入力した時に出力される目的変数４０４の組み合わせを格納するデータで構成される。説明変数及び目的変数の数はモデルデータ１３１における、入力次元数及び出力次元数で決定される。説明変数群の数値は、入出力データ１３２における、各説明変数の値範囲に含まれる値の中から重複がないように決定される。 <Simulation data>
FIG. 4 is a diagram illustrating an example of the simulation data 133 in the storage device 130. The simulation data includes data that stores a combination of explanatory variable groups 401, 402, and 403 and a target variable 404 that is output when the explanatory variable is input to the model. The number of explanatory variables and objective variables is determined by the number of input dimensions and the number of output dimensions in the model data 131. The numerical value of the explanatory variable group is determined so as not to overlap among the values included in the value range of each explanatory variable in the input / output data 132.

図４に示した例では、図３に示した入出力データに基づいてシミュレーションを行った場合のデータセットが示されている。図４に示すシミュレーションの例では、例えば、説明変数１及び３は０−１の範囲からランダムに値が生成され、説明変数２は０か１のどちらかの数値がランダムに生成される。目的変数１には、生成した説明変数１〜３を機械学習モデルに入力して得られた値が記録されている。 In the example shown in FIG. 4, a data set when a simulation is performed based on the input / output data shown in FIG. 3 is shown. In the simulation example shown in FIG. 4, for example, the explanatory variables 1 and 3 are randomly generated from the range of 0−1, and the explanatory variable 2 is randomly generated as either 0 or 1. In the objective variable 1, a value obtained by inputting the generated explanatory variables 1 to 3 into the machine learning model is recorded.

＜評価データ＞
図５は、記憶装置１３０内の評価データ１３４の表示画面の例を示す図である。評価データは、評価対象のモデルにおける、各説明変数に対する評価値と、モデル全体に対する評価値で構成される。具体的には、評価データ１３４には、各説明変数のＩＤである説明変数５０１と、各説明変数に対する評価項目が１つ以上格納される。図５では、評価項目１として標準偏回帰係数５０２が、評価項目２として有意確率ｐ値が格納されている。また、モデル全体に対する評価値として、モデル評価項目５０４と、その値であるモデル評価値５０５が格納される。 <Evaluation data>
FIG. 5 is a diagram illustrating an example of a display screen of the evaluation data 134 in the storage device 130. The evaluation data includes an evaluation value for each explanatory variable in the evaluation target model and an evaluation value for the entire model. Specifically, the evaluation data 134 stores an explanatory variable 501 that is an ID of each explanatory variable and one or more evaluation items for each explanatory variable. In FIG. 5, standard partial regression coefficient 502 is stored as evaluation item 1, and significance probability p-value is stored as evaluation item 2. Also, model evaluation items 504 and model evaluation values 505 that are the values are stored as evaluation values for the entire model.

図５に示した評価データ表示画面は、例えば、ユーザが利用したい機械学習モデルを選択した際に、評価出力部１２６が当該機械学習モデルに関する評価データ１３４を読み込み、機械学習モデルに説明変数の値を入力するインターフェースと併せて出力することによって表示部１１１に表示する。 In the evaluation data display screen shown in FIG. 5, for example, when the user selects a machine learning model that the user wants to use, the evaluation output unit 126 reads the evaluation data 134 related to the machine learning model, and the value of the explanatory variable is read into the machine learning model. Is displayed on the display unit 111 by being output together with the interface for inputting.

＜データ分析装置における処理概要＞
上述の構成を有するデータ分析装置１において行われる処理について説明する。まず、中央処理装置１００は、シミュレーションプログラム１２１を実行してデータセット生成部１２４として機能する。データセット生成部１２４は、記憶装置１３０に格納されたモデルデータ１３１、入出力データ１３２を読み込み、シミュレーションを行う。中央処理装置１００は、シミュレーションによって得られたシミュレーションデータ１３３を記憶装置１３０に格納する。 <Outline of processing in data analyzer>
Processing performed in the data analysis apparatus 1 having the above-described configuration will be described. First, the central processing unit 100 executes the simulation program 121 and functions as the data set generation unit 124. The data set generation unit 124 reads the model data 131 and the input / output data 132 stored in the storage device 130 and performs a simulation. The central processing unit 100 stores the simulation data 133 obtained by the simulation in the storage device 130.

次に、中央処理装置１００は、評価プログラム１２２を実行してモデル評価部１２５として機能する。モデル評価部１２５は、記憶装置１３０からシミュレーションデータ１３３を読み込み、機械学習モデルの各説明変数に対する評価値と機械学習モデル全体に対する評価値を算出し、算出した評価値を評価データ１３４として記憶装置１３０に格納する。 Next, the central processing unit 100 executes the evaluation program 122 and functions as the model evaluation unit 125. The model evaluation unit 125 reads the simulation data 133 from the storage device 130, calculates an evaluation value for each explanatory variable of the machine learning model and an evaluation value for the entire machine learning model, and uses the calculated evaluation value as evaluation data 134. To store.

中央処理装置１００は、出力プログラム１２３を実行して評価出力部１２６として機能する。そして評価出力部１２６が、評価データ１３４の内容を表示部１１１に表示する。上記それぞれの処理について、以下に詳細を説明する。 The central processing unit 100 executes the output program 123 and functions as the evaluation output unit 126. Then, the evaluation output unit 126 displays the contents of the evaluation data 134 on the display unit 111. Details of each of the above processes will be described below.

＜シミュレーション処理＞
図６は、データセット生成部１２４が実行するシミュレーション処理を説明するためのフローチャートである。データセット生成部１２４は、シミュレーション処理において、図４に示すような説明変数の値の組を生成する。データセット生成部１２４は、上記説明変数の値と、生成した説明変数の値を機械学習のモデルに入力した場合に得られる目的変数の値と、の組み合わせのパターンを求める処理を行う。説明変数と目的変数との組み合わせのパターンは、例えば数百万〜数億のパターンを生成する。 <Simulation process>
FIG. 6 is a flowchart for explaining the simulation processing executed by the data set generation unit 124. The data set generation unit 124 generates a set of explanatory variable values as shown in FIG. 4 in the simulation process. The data set generation unit 124 performs processing for obtaining a combination pattern of the value of the explanatory variable and the value of the objective variable obtained when the generated value of the explanatory variable is input to the machine learning model. For example, millions to hundreds of millions of patterns of combinations of explanatory variables and objective variables are generated.

ステップ６０１では、データセット生成部１２４が、モデルデータ１３１と入出力データ１３２とを読み込む。 In step 601, the data set generation unit 124 reads model data 131 and input / output data 132.

ステップ６０２では、データセット生成部１２４が、まずモデルデータ１３１から入力次元数と出力次元数を読み込み、入出力データ１３２から各説明変数のデータタイプと値範囲を読み込む。次にそれらのデータに基づいて、各説明変数の値範囲において網羅的に説明変数の組み合わせのパターンを生成する。 In step 602, the data set generation unit 124 first reads the number of input dimensions and the number of output dimensions from the model data 131, and reads the data type and value range of each explanatory variable from the input / output data 132. Next, based on these data, a pattern of combinations of explanatory variables is comprehensively generated in the value range of each explanatory variable.

例えば、図３の入出力データの例では、説明変数１、説明変数２、説明変数３の値の組み合わせを、（説明変数１の値、説明変数２の値、説明変数３の値）のように表す時、図４のように、（０．１、０、０．２）、（０．１、０、０．３）、（０．１、１、０．２）等のような組み合わせパターンが出力される。この時点では、目的変数１の値は設定されない。組み合わせパターンをどの程度の網羅性、すなわちデータの粒度にするかは任意に設定可能である。一般的に、網羅性が高いほどモデルの評価精度が良くなるが、処理時間は多くなる。逆に網羅性が低いほどモデルの評価精度は悪くなるが、処理時間は短くなる。 For example, in the example of the input / output data in FIG. 3, the combination of the values of the explanatory variable 1, the explanatory variable 2, and the explanatory variable 3 is expressed as (value of explanatory variable 1, value of explanatory variable 2, value of explanatory variable 3). As shown in FIG. 4, combinations such as (0.1, 0, 0.2), (0.1, 0, 0.3), (0.1, 1, 0.2), etc. A pattern is output. At this time, the value of the objective variable 1 is not set. The degree of completeness of the combination pattern, that is, the data granularity can be arbitrarily set. In general, the higher the comprehensiveness, the better the model evaluation accuracy, but the processing time increases. Conversely, the lower the completeness, the worse the evaluation accuracy of the model, but the shorter the processing time.

ステップ６０３では、データセット生成部１２４が、モデルデータから学習済モデルを読み込み、ステップ６０２で生成したシミュレーションデータを学習済モデルに入力し、出力結果すなわち目的変数を求める。 In step 603, the data set generation unit 124 reads the learned model from the model data, inputs the simulation data generated in step 602 to the learned model, and obtains an output result, that is, an objective variable.

ステップ６０４では、データセット生成部１２４が、ステップ６０３で得られた目的変数の値を、対応する説明変数群のレコードに格納し、シミュレーションデータを更新する。 In step 604, the data set generation unit 124 stores the value of the objective variable obtained in step 603 in the corresponding explanatory variable group record, and updates the simulation data.

＜評価処理＞
図７は、モデル評価部１２５及び評価出力部１２６が実行する評価処理を説明するためのフローチャートである。評価処理では、図４のようなシミュレーションデータに基づいて、例えば図５に示す評価結果を出力する。 <Evaluation process>
FIG. 7 is a flowchart for explaining an evaluation process executed by the model evaluation unit 125 and the evaluation output unit 126. In the evaluation process, for example, the evaluation result shown in FIG. 5 is output based on the simulation data as shown in FIG.

ステップ７０１では、モデル評価部１２５がシミュレーションデータ１３３を読み込む。 In step 701, the model evaluation unit 125 reads the simulation data 133.

ステップ７０２では、モデル評価部１２５がシミュレーションデータ１３３に対し、評価モデルを適用する。評価モデルとは、目的変数に対する影響の度合いを説明変数毎に求めるために使用するものであり、例えば重回帰分析が評価モデルに使用できる。この目的を実現できれば任意の評価モデルを適用可能である。 In step 702, the model evaluation unit 125 applies an evaluation model to the simulation data 133. The evaluation model is used to obtain the degree of influence on the objective variable for each explanatory variable. For example, multiple regression analysis can be used for the evaluation model. Any evaluation model can be applied if this purpose can be realized.

以下、評価モデルに重回帰分析を用いた場合で説明する。シミュレーションデータのレコードに対し重回帰分析を適用すると、説明変数毎に標準偏回帰係数や有意確率ｐ値を算出することができる。標準偏回帰係数とは、その説明変数が目的変数に与える影響度の強さを表し、かつ各説明変数のスケールを統一したものである。すなわち、説明変数毎の標準偏回帰係数を比較することによって説明変数の影響の大きさを把握することができる。また、有意確率ｐ値とは、重回帰分析によって算出された標準偏回帰係数の確からしさを表す。一般に、有意確率ｐ値が5%を下回っていると、その説明変数は目的変数に対して「関係性がある」と判断できる。 Hereinafter, the case where multiple regression analysis is used for the evaluation model will be described. When multiple regression analysis is applied to a record of simulation data, a standard partial regression coefficient and a significance probability p value can be calculated for each explanatory variable. The standard partial regression coefficient represents the strength of the influence of the explanatory variable on the objective variable, and the scale of each explanatory variable is unified. That is, the magnitude of the influence of the explanatory variable can be grasped by comparing the standard partial regression coefficients for each explanatory variable. In addition, the significance probability p-value represents the probability of the standard partial regression coefficient calculated by multiple regression analysis. In general, when the significance probability p-value is less than 5%, it can be determined that the explanatory variable is “related” to the objective variable.

また、シミュレーションデータのレコードに対し重回帰分析を適用すると、モデル全体に対する評価項目として、決定係数や自由度調整済決定係数が算出できる。決定係数とは、目的変数の全変動のうち、全ての説明変数によって説明できる割合を表し、回帰方程式とサンプルデータとのあてはまりの良さを示す値である。また、自由度調整済決定係数とは、説明変数の数を考慮した決定係数であり、通常の決定係数が説明変数の数が増えるほど大きくなってしまう欠点を補ったものである。このようにして得られた各出力値は、評価データ１３４として記憶装置１３０に格納される。 Further, when multiple regression analysis is applied to simulation data records, determination coefficients and determination coefficients with adjusted degrees of freedom can be calculated as evaluation items for the entire model. The coefficient of determination represents the ratio that can be explained by all explanatory variables out of the total variation of the objective variable, and is a value indicating the goodness of fit between the regression equation and the sample data. The degree-of-freedom-adjusted determination coefficient is a determination coefficient that takes into account the number of explanatory variables, and compensates for the disadvantage that the normal determination coefficient increases as the number of explanatory variables increases. Each output value obtained in this way is stored in the storage device 130 as evaluation data 134.

ステップ７０３では、評価出力部１２６が評価データ１３４を表示部１１１に表示する。このように画面表示することで、機械学習による学習済モデルにおいて、目的変数に影響を与える度合いを定量化し把握することが可能となる。 In step 703, the evaluation output unit 126 displays the evaluation data 134 on the display unit 111. By displaying the screen in this way, it is possible to quantify and grasp the degree of influence on the objective variable in the learned model by machine learning.

＜まとめ＞
以上説明したように、本実施形態によれば、説明変数の値を網羅的に又は十分に大きな数だけ機械学習モデルに入力し、対応する目的変数を求め、その結果に対して評価モデルを適用することにより、目的変数に影響を与える度合いを説明変数毎に定量化することが可能となる。これにより、ある説明変数群に対して出力される目的変数が、どの説明変数の影響を強く受けて出力されたかを把握することが容易になる。また、モデルの学習時に使用した教師データと正解データのみを用いて、学習済みモデルに評価モデルを適用した場合と比較して、本実施形態にように網羅的にシミュレーションを行うことにより、学習済みモデルの評価を精密に行うことが可能となる。 <Summary>
As described above, according to the present embodiment, the values of the explanatory variables are input to the machine learning model in a comprehensive or sufficiently large number, the corresponding objective variable is obtained, and the evaluation model is applied to the result. By doing so, it becomes possible to quantify the degree of influence on the objective variable for each explanatory variable. Thereby, it becomes easy to grasp which explanatory variable is strongly influenced by the objective variable output for a certain explanatory variable group. In addition, compared to the case where the evaluation model is applied to the learned model using only the teacher data and correct answer data used when learning the model, it has already been learned by performing a comprehensive simulation as in this embodiment. The model can be evaluated accurately.

また、実施形態のデータ分析装置１は、例えば、説明変数と目的変数との関係性を機械学習モデルの説明変数を入力するインターフェースと併せて表示部１１１に表示する。そのため、ユーザが機械学習モデルの使用時に、表示部１１１に表示された当該機械学習モデルの評価を確認しながら、説明変数の値を入力することができる。それ故、ユーザは機械学習モデルの出力に対する信頼性についてより定量的に検証することが可能となる。 In addition, the data analysis apparatus 1 according to the embodiment displays, for example, the relationship between the explanatory variable and the objective variable on the display unit 111 together with the interface for inputting the explanatory variable of the machine learning model. Therefore, when the user uses the machine learning model, the value of the explanatory variable can be input while checking the evaluation of the machine learning model displayed on the display unit 111. Therefore, the user can more quantitatively verify the reliability of the output of the machine learning model.

なお、本発明は、実施形態そのままに限定されるものではなく、実施段階では、その要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、実施形態に開示されている複数の構成要素の適宜な組み合わせにより、種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。さらに、異なる実施形態にわたる構成要素を適宜組み合わせてもよい。 Note that the present invention is not limited to the embodiments as they are, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. Various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the embodiments. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, constituent elements over different embodiments may be appropriately combined.

また、実施形態で示された各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路で設計する等によりハードウェアで実現しても良い。また、上記各構成、機能等は、プロセッサがそれぞれの機能を実現するプログラムを解釈し、実行することによりソフトウェアで実現しても良い。各機能等を実現するプログラム、テーブル、ファイル等の情報は、メモリやハードディスク、ＳＳＤ（Solid State Drive）等の記録或いは記憶装置、又はＩＣカード、ＳＤカード、ＤＶＤ等の記録或いは記憶媒体に格納することができる。また、本実施形態のデータ分析装置１が実行するプログラムの一部または全ては、専用ハードウェアで実現してもよく、また、モジュール化されていても良い。各種プログラムはプログラム配布サーバや記憶メディアによって各計算機にインストールされてもよい。 In addition, each configuration, function, processing unit, processing unit, and the like described in the embodiments may be realized in hardware by designing a part or all of them with, for example, an integrated circuit. Further, each of the above-described configurations, functions, etc. may be realized by software by the processor interpreting and executing a program that realizes each function. Information such as programs, tables, and files for realizing each function is stored in a recording or storage device such as a memory, a hard disk, or an SSD (Solid State Drive), or a recording or storage medium such as an IC card, SD card, or DVD. be able to. In addition, a part or all of the program executed by the data analysis apparatus 1 according to the present embodiment may be realized by dedicated hardware or may be modularized. Various programs may be installed in each computer by a program distribution server or a storage medium.

さらに、上述の実施形態において、制御線や情報線は説明上必要と考えられるものを示しており、製品上必ずしも全ての制御線や情報線を示しているとは限らない。全ての構成が相互に接続されていても良い。 Furthermore, in the above-described embodiment, control lines and information lines are those that are considered necessary for explanation, and not all control lines and information lines on the product are necessarily shown. All the components may be connected to each other.

１００…中央処理装置（プロセッサ）
１１０…入出力装置
１１１…表示部
１１２…キーボード
１１３…マウス
１２０…プログラムメモリ
１２１…シミュレーションプログラム
１２２…評価プログラム
１２３…出力プログラム
１２４…データセット生成部
１２５…モデル評価部
１２６…評価出力部
１３０…記憶装置
１３１…モデルデータ
１３２…入出力データ
１３３…シミュレーションデータ
１３４…評価データ 100: Central processing unit (processor)
DESCRIPTION OF SYMBOLS 110 ... Input / output device 111 ... Display unit 112 ... Keyboard 113 ... Mouse 120 ... Program memory 121 ... Simulation program 122 ... Evaluation program 123 ... Output program 124 ... Data set generation unit 125 ... Model evaluation unit 126 ... Evaluation output unit 130 ... Memory Device 131 ... Model data 132 ... Input / output data 133 ... Simulation data 134 ... Evaluation data

Claims

A data set generation unit that generates data of explanatory variables to be input to the machine learning model, inputs the generated data of the explanatory variables to the machine learning model, and obtains data of the objective variable;
A model evaluation unit that calculates the relationship between the explanatory variable and the objective variable based on the explanatory variable data and the objective variable data generated by the data set generation unit;
A data analysis apparatus comprising:

An evaluation output unit for displaying the relationship between the explanatory variable calculated by the model evaluation unit and the objective variable together with an interface for inputting the explanatory variable of the machine learning model on a display unit;
The data analysis apparatus according to claim 1.

The model evaluation unit calculates at least one of a significance probability or a standard partial regression coefficient for each explanatory variable by performing multiple regression analysis on a data set of the data of the explanatory variable and the data of the objective variable.
The data analysis apparatus according to claim 1.

The model evaluation unit calculates a determination coefficient indicating a degree by which the objective variable can be explained by the explanatory variable;
The data analysis apparatus according to claim 1.

Generating explanatory variable data to be input to the machine learning model, inputting the generated explanatory variable data to the machine learning model to obtain target variable data; and
Calculating a relationship between the explanatory variable and the objective variable based on the generated explanatory variable data and the objective variable data;
Data analysis method including.

A step of displaying the relationship between the calculated explanatory variable and the objective variable on a display unit together with an interface for inputting the explanatory variable of the machine learning model;
The data analysis method according to claim 5.

The step of calculating the relationship between the explanatory variable and the objective variable includes performing a multiple regression analysis on a data set of the explanatory variable data and the objective variable data to obtain a significant probability or Calculating at least one of the standard partial regression coefficients,
The data analysis method according to claim 5.

The step of calculating the relationship between the explanatory variable and the objective variable is a step of calculating a determination coefficient indicating a degree that the objective variable can be explained by the explanatory variable.
The data analysis method according to claim 5.