WO2019207910A1

WO2019207910A1 - Data analysis system and data analysis mehtod

Info

Publication number: WO2019207910A1
Application number: PCT/JP2019/005167
Authority: WO
Inventors: 前川　拓也
Original assignee: 株式会社日立ソリューションズ
Priority date: 2018-04-24
Filing date: 2019-02-13
Publication date: 2019-10-31
Also published as: JP6863926B2; JP2019191895A

Abstract

A data analysis system provided with a computation device for executing a program and a storage device connected to the computation device. The computation device is provided with: a feature node calculation unit for dividing an input data set comprising a plurality of explanatory variables used during learning by a machine learning model, or an input data set in which the explanatory variables comprise a processed data set, under a designated division condition, and calculating a feature node that represents the feature of distribution structure of each of the divided data sets; a score calculation unit for generating neighbor data of input data that includes the feature node and calculating, on the basis of an explanatory variable of the generated neighbor data and the data of an objective variable obtained by inputting the neighbor data to the machine learning model, a score that represents the relationship between the explanatory variable and the objective variable; and an output processing unit for outputting an output result that includes the score.

Description

Data analysis system and data analysis method

Import by reference

This application claims the priority of Japanese Patent Application No. 2018-83408, filed on April 24, 2018, and is incorporated into the present application by referring to its contents.

This disclosure relates to a data analysis system.

Machine learning technologies such as neural networks are attracting attention. Attempts have been made to solve various problems using a machine learning model obtained by machine learning. For example, in Japanese Patent Application Laid-Open No. 2003-323601, when a known case set and a predicted case are input, a similar case extracting unit that extracts a similar case set that is a set of cases similar to the predicted case from the known case set 1, a certainty factor calculation unit 2 that calculates a certainty factor of a prediction attribute value from a similar case set, and a reliability measure calculation unit 3 that calculates a reliability measure of the certainty factor from the similar case set and the certainty factor, And a prediction device configured to output a certainty factor of a certain predictive attribute value and a reliability measure of the certainty factor.

However, the method described in Japanese Patent Application Laid-Open No. 2003-323601 adds a reliability measure indicating the reliability of the certainty to the certainty of the prediction based on the similar case, and thereby the user's subsequent prediction result. The user cannot know the degree of contribution of each explanatory variable to the prediction result. In other words, the user cannot know what factor led to the prediction result from the input data. In other words, the user uses the machine learning model while the relevance between the explanatory variable and the objective variable that is the prediction result is unknown in the neural network. For this reason, it is difficult for the user to know what kind of judgment should be made based on the prediction result.

The present invention has been made in view of such a situation, and provides a technique for visualizing the degree of influence of an explanatory variable on an objective variable and making it possible to grasp what judgment should be made based on the prediction result. provide.

A typical example of the invention disclosed in the present application is as follows. That is, the data analysis system includes an arithmetic device that executes a program and a storage device that is connected to the arithmetic device, and the arithmetic device includes an input that includes a plurality of explanatory variables used by the machine learning model during learning. A feature node calculation that divides an input data set consisting of a data set or a data set obtained by processing the explanatory variable according to a specified division condition and calculates a feature node that represents a feature of a distribution structure of each divided data set. And the arithmetic unit generates neighborhood data of the input data including the feature node, the explanatory variable of the generated neighborhood data, and the objective variable obtained by inputting the neighborhood data to the machine learning model A score calculation unit that calculates a score representing the relationship between the explanatory variable and the objective variable, and the arithmetic device includes: And an output processing unit that outputs the output results including A.

According to one aspect of the present invention, it is possible to visualize the degree of influence of explanatory variables on objective variables. Problems, configurations, and effects other than those described above will become apparent from the following description of embodiments.

It is a figure showing the data analysis system structure of a present Example. It is a figure which shows the data structure of the data analysis system of a present Example. It is a flowchart of the whole process of a present Example. It is a flowchart of the characteristic node calculation process of a present Example. It is a flowchart of the score calculation process of a present Example. It is a flowchart of the node mapping process of a present Example. It is a flowchart of the output process of a present Example. It is a figure which shows the example of the component map of a present Example. It is a figure which shows the example of the hit map of a present Example. It is a figure which shows the example of the score map of a present Example. It is a figure which shows the example of the node map of a present Example.

<Example 1>
Embodiments of the present invention will be described below with reference to the drawings.

In this embodiment, the machine learning model has been learned in advance, refers to the learning data used in the learning, and performs a process of obtaining an output result using the learned machine learning model. is there. Further, the machine learning model returns an output signal of a k-dimensional vector in response to an input signal of a d-dimensional vector, and the output signal of the machine learning model in this embodiment belongs to k classification classes. The description will be made assuming that it corresponds to the classification probability.

FIG. 1 is a diagram showing the data analysis system configuration of this embodiment.

The data analysis system of this embodiment is a computer that analyzes the relationship between input data and output data in machine learning, and includes an input device 101, an output device 102, a display device 103, a processing device 104, and a storage device 111.

The input device 101 is a keyboard, a mouse, or the like, and is an interface that receives input from the user. The output device 102 is a printer or the like, and is an interface that outputs the execution result of the program in a format that can be visually recognized by the user. The display device 103 is a display device such as a liquid crystal display device, and is an interface that outputs the execution result of the program in a format that can be visually recognized by the user. A terminal connected to the data analysis system via a network may provide the input device 101, the output device 102, and the display device 103.

The processing device 104 includes a processor (arithmetic device) that executes a program and a memory that stores the program and data. Specifically, the input processing unit 106, the feature node calculation unit 107, the score calculation unit 108, the node mapping unit 109, and the output processing unit 110 are realized by the processor executing the program. Note that a part of processing performed by the processor executing the program may be executed by another arithmetic device (for example, FPGA).

The memory includes a ROM that is a nonvolatile storage element and a RAM that is a volatile storage element. The ROM stores an immutable program (for example, BIOS). The RAM is a high-speed and volatile storage element such as a DRAM (Dynamic Random Access Memory), and temporarily stores a program executed by the processor 11 and data used when the program is executed.

The storage device 111 is a large-capacity non-volatile storage device such as a magnetic storage device (HDD) or a flash memory (SSD). The storage device 111 stores data used by the processing device 104 when executing the program and a program executed by the processing device 104. Specifically, the storage device 111 includes a series of processes such as an input data table 112, a normalization information table 113, a division condition table 114, a node information table 115, a node distance table 116, a score table 117, and a weighted average score table 118. The necessary data and output results are stored. The program is read from the storage device 111, loaded into the memory, and executed by the processor.

The data analysis system may have a communication interface that controls communication with other devices according to a predetermined protocol.

The program executed by the processing device 104 is provided to the data analysis system via a removable medium (CD-ROM, flash memory, etc.) or a network, and is stored in the nonvolatile storage device 111 that is a non-temporary storage medium. For this reason, the data analysis system may have an interface for reading data from a removable medium.

A data analysis system is a computer system that is configured on a single physical computer or a plurality of logically or physically configured computers, and is a virtual system constructed on a plurality of physical computer resources. It may operate on a computer.

FIG. 2 is a diagram showing the data structure of the data analysis system of this embodiment.

The input data table 112 stores data obtained by processing the learning data of the machine learning model into a format used in a series of processes by the data analysis system of the present embodiment, and includes the explanatory variables 1 to d (201) and the objective variables 1 to k (202) is included.

The explanatory variables 1 to d (201) represent d-dimensional vectors that are input data of the machine learning model. However, in machine learning, data is often normalized for each variable. In the present embodiment, the normalized data is stored in the original numerical data using the normalization information table 113. Further, when the learning data of the machine learning model is time series, x_t0, x_t1, x_t2,. . . It can be flattened as a variable name at the value at each time point. In this case, the number of dimensions of the explanatory variable 201 and the number of input dimensions of the machine learning model do not match. When data is input to the machine learning model in the input data format of this embodiment, the data format is converted each time. The objective variables 1 to k (202) are k-dimensional vectors that are output results of the machine learning model.

The normalization information table 113 stores information related to normalization processing performed during learning of the machine learning model. The variable ID 203, the variable name 204, the data type 205, the average 206, the standard deviation 207, and the model data format correspondence information 208 are stored. Contains data.

The variable ID 203 is an index that identifies the element of the explanatory variable 201. The variable name 204 is the name of the explanatory variable. The data type 205 is the data type of the explanatory variable (for example, logical type, integer type, floating point type, etc.).

The average 206 and standard deviation 207 store the average and standard deviation used in the normalization process when learning the machine learning model. However, for variables that are not subjected to normalization processing, such as when the variables are logical, it is preferable to set the average to 0, the standard deviation to 1, and the like. The model data format correspondence information 209 stores information for mutual conversion when the input format of the machine learning model is different from the input format handled by the data analysis system of this embodiment. For example, in the case of data including a time series, the variable x is set to x_t0, x_t1,. . . Therefore, mutual conversion is possible by describing the correspondence between the index before expansion and the index after expansion.

The division condition table 114 stores conditions for the feature node calculation unit 107 to divide the input data table 112, and the condition ID 209, the division condition 210, the number of data 211, the map size 212, and the data of the aggregation flags 1 to k (213) are stored. Including.

The condition ID 209 is identification information for identifying a condition recorded in the division condition table 114. The division condition 210 is a condition for dividing the input data to obtain one data set. For example, a character string such as an SQL select statement may be used. The division condition 210 may describe a specific value or range for the explanatory variable, or a combination of values for the objective variable. The number of data 211 is the number of data in the input data selected according to the division condition.

The map size 212 stores the map size used by the feature node calculation unit 107 in the node vector calculation step 403 of FIG. Alternatively, when the map size is automatically set, the value of the map size 212 may be set to NULL or the like, and the result of the automatic setting may be stored. Aggregation flags 1 to k (213) are k flag arrays for objective variables used in the objective score aggregation process 504 in FIG. Only the scores for the objective variables for which 1 is set in this array are aggregated to obtain the objective score. For example, in the analysis aiming at rank-up from the current rank in the member management system, each current rank is set as a division condition, and the flag of the objective variable corresponding to the predicted rank higher than the current rank is set to 1.

The node information table 115 stores the result of the feature node calculation by the feature node calculation unit 107, and includes a condition ID 214, a node ID 215, a hit number 216, a hit rate 217, coordinates 218, explanatory variables 1 to d (219), and objective variables 1 to k (220) data are included.

The condition ID 214 is identification information (condition ID 209) for identifying a condition recorded in the division condition table 114. The node ID 215 is node identification information that satisfies the condition specified by the condition ID 214. The number of hits 216 is the number of data of the node identified by the node ID 215 in the data set divided by the division condition at a distance closer to the node than the other nodes. The hit rate 217 is a value obtained by dividing the hit number 216 by the data number 211. The coordinates 218 are the processing result of the node mapping process 304 shown in FIG.

The explanatory variables 1 to d (219) and the objective variables 1 to k (220) are data in the same format as the input data table 112, and are node vectors representing the characteristics of the distribution structure for the input data set. This vector does not have to match the data included in the input data table, and does not have to follow the type specified in the data type 205. For example, even if a logical value or an integer value is specified, it can be stored as floating point type data.

The node distance table 116 stores the distance between nodes for the explanatory variable 219 of the node information table 115 or the node vector obtained by adding the objective variable 220 to the explanatory variable 219, and stores the data of the node from 221, the node to 222, and the distance 223. Including.

The node from 221 and the node to 222 are identification information for specifying the nodes recorded in the node information table 115, respectively. The values of the node from 221 and the node to 222 may be a set of the condition ID 214 and the node ID 215, or may be an index on the node information table 115. A distance 223 is a distance of a node vector between the node from 221 and the node to 222.

Note that the node distance table 116 may be expressed as a two-dimensional array. In this case, the index of the node information table 115 is used for the row and the column.

The score table 117 stores the calculation result of the score calculation unit 108, and includes data of the objective variable ID 224, the condition ID 225, the node ID 226, and the score 227 of the explanatory variables 1 to d.

The objective variable ID 224 stores the element number of the k-dimensional vector in the output result of the machine learning model. The condition ID 225 and the node ID 226 are identification information for specifying the node recorded in the node information table 115, and use values common to the condition ID 214 and the node ID 215 of the node information table 115. The score 227 of the explanatory variables 1 to d is a calculation result of the score calculation unit 108, and is stored for each objective variable ID 224, condition ID 225, node ID 226, and explanatory variable.

The score table 117 also stores a target score and a weighted average score described in FIG. The objective score is set to −1 or the like in the objective variable ID 224, and the weighted average score is set to −1 or the like in the objective variable ID 224 and the node ID 226, indicating that the objective variable and the node do not identify a specific thing. Is.

The weighted average score table 118 stores the division condition and the score for each explanatory variable in such a manner that the objective variable 224 and the node ID 226 are not specified as -1. Specifically, the weighted average score table 118 divides the weighted average score calculated in step 505 of the score calculation process 303 (FIG. 5) described later for each division condition, and sets the scores of each explanatory variable in descending order of absolute values. A list that is sorted and listed with variable names. The weighted average score table 118 is generated in step 701 of the output process 305 (FIG. 7). The weighted average score table 118 allows the user to easily grasp the explanatory variable having a high influence degree for each target layer represented by each division condition, and compare the rank and score of the explanatory variable under the division condition. For example, in the conditions 1 and 2, the influence degree of the attribute A is large, and in the conditions 3 and 4, the influence degree of the attribute J is large. In addition, the score of the attribute I has the opposite sign, and if the same measure is applied, the effect may appear in reverse. In this way, it can be used for planning measures for the target layer indicated by each condition.

FIG. 3 is a flowchart of the overall processing of this embodiment.

First, the input processing unit 106 executes input processing (301). For example, the input processing unit 106 refers to the normalization information table 113, converts the learning data of the machine learning model from the input format to the input format of the present embodiment, and returns the normalized numerical value to the original value. And the result is stored in the input data table 112.

Next, the feature node calculation unit 107 executes feature node calculation processing (302). For example, the feature node calculation unit 107 divides the input data table 112 according to the division condition table 114, calculates a feature node from each divided data set, and stores the result in the node information table. Details of the feature node calculation process will be described with reference to FIG.

Next, the score calculation unit 108 executes a score calculation process (303). For example, the score calculation unit 108 calculates a score representing the degree of influence of the explanatory variable, and stores the result in the score table. Details of the score calculation process will be described with reference to FIG.

Next, the node mapping unit 109 executes a node mapping process (304). For example, the node mapping unit 109 maps the feature node obtained in step 302 to the low-dimensional space. Details of the node mapping process will be described with reference to FIG.

Next, the output processing unit 110 executes the output process (305) and ends the process. Details of the output process will be described with reference to FIG.

FIG. 4 is a flowchart of the feature node calculation process 302 of this embodiment.

First, the feature node calculation unit 107 loops the variable p from 1 to the number of data in the division condition table 114 (401). Thereafter, the processing from step 402 to step 405 is executed for the p-th division condition.

Next, the feature node calculation unit 107 performs data division processing (402). For example, data that satisfies the division condition 210 of the p-th division condition is selected from the input data table 112. The selected data set is subjected to normalization processing using a normalization information table.

Next, the feature node calculation unit 107 calculates a node vector (403). For example, a node vector representing the feature is calculated with a smaller number of nodes in consideration of the distribution structure of the selected data set by a clustering method typified by the k-average method. In this embodiment, a self-organizing map (hereinafter abbreviated as SOM) is applied. The SOM is a kind of neural network expressed by edges connecting nodes arranged in a grid and adjacent nodes. Each node is assigned a reference vector in the same format as the input data. The reference vector is updated so that the reference vector of the node connected to the BMU is approximated to the learning data together with the reference vector of the node having the closest distance to the SOM learning data (hereinafter referred to as BMU (Best Matching Unit)). Since SOM is a known technique, a detailed description of the technique is omitted. By repeating this process, it is possible to map the complicated distribution structure of the learning data to the geometric structure of the node.

The reference vector of each node calculated as a result of the SOM is stored in the node information table 115 in the form of an explanatory variable 219 and a target variable 220.

Note that the format of the learning data when executing the SOM can be set only by the explanatory variable or by the combination of the explanatory variable and the objective variable. Which format to use is preferably set in advance. And the objective variable 220 as an output result follows the input format of these learning data.

Next, the feature node calculation unit 107 counts the number of hits for each node (404). Here, for each node calculated in step 403, the number of data in the selected data set with BMU as the BMU is calculated as the value of the hit number 216. The hit rate 217 is calculated by dividing this by the number of selected data items.

Next, the feature node calculation unit 107 stores the calculated result in the data storage area (405). At this time, since the node vector calculated in step 403 has been normalized, the normalization information table 113 is used to restore the original, and the result is stored.

Then, when the loop from step 401 to step 405 ends, the feature node calculation process 302 ends.

FIG. 5 is a flowchart of the score calculation process 303 of this embodiment.

First, the score calculation unit 108 loops the variable i from 1 to the number of data items in the node information table 115 (501). Thereafter, the processing from step 502 to step 504 is executed for the i-th node.

Next, the score calculation unit 108 generates a neighborhood data set of the node i and a prediction result of the machine learning model corresponding thereto (502). The neighborhood data is vector data located around the d-dimensional vector represented by the explanatory variable of the node specified by the variable i. In this embodiment, as a method for generating the neighborhood data, a method is used in which the value of the explanatory variable of the node i is averaged, and is generated by a random number according to a normal distribution in which half the standard deviation of the normalized information is a standard deviation. However, other generation methods may be used. The number of data items in the neighborhood data set may be designated in advance. Prediction using a machine learning model can be executed by performing normalization using a normalization information table and conversion using model data format correspondence information 208.

Next, the score calculation unit 108 performs local model estimation processing on the generated neighborhood data set and the prediction result of the machine learning model (503). In step 503, a score representing the relationship between the explanatory variable and the objective variable is obtained for the neighborhood data. In this embodiment, linear model estimation is applied to the prediction results of the neighborhood data set and the machine learning model, and the estimation parameter is used as a score. That is, the output result Y of the machine learning model for the d-dimensional explanatory variable X = (x1, x2,..., Xd) is approximated by a linear model expressed by the following equation, and the estimated parameter Si is used as the score at the input xi. . Here, Y, Y, Si, and C are k-dimensional vectors. Since the linear model estimation technique is a known technique, a detailed description of the technique is omitted.

Next, the score calculation unit 108 calculates the target score by counting the scores obtained in step 503 according to the counting flag 213 (504). Specifically, the scores of the elements whose flag is 1 are totaled for each explanatory variable.

Then, when the loop from step 501 to step 504 ends, the score calculation unit 108 calculates the weighted average score by applying the hit rate 217 to the target score (505). The weighted average score is calculated for each explanatory variable for all nodes having the same condition ID.

The score calculation unit 108 stores the calculated result in the data storage area (506), and ends the score calculation process.

FIG. 6 is a flowchart of the node mapping process 304 of this embodiment. In the present embodiment, a multidimensional scaling method (hereinafter abbreviated as MDS) is used to map a set of grid-like planar SOM nodes for each division condition to two-dimensional coordinates. The space may be a space with other dimensions.

MDS is a technique for mapping nodes in a multidimensional vector space to a low-dimensional space such as 2D or 3D, and performs mapping so that the distance between nodes is reproduced as much as possible. Since MDS is a known technique, a detailed description of the technique is omitted. In this embodiment, when MDS is applied, initialization is performed in consideration of the geometric structure of the SOM node.

First, the node mapping unit 109 generates a node distance table 116 (601). In this embodiment, each feature node vector is set as a normalized explanatory variable 219, and a distance table is generated based on the Euclidean distance.

Next, the node mapping unit 109 initializes each variable (602). Specifically, first, lt, lb, rt, and rb are respectively defined as upper left, lower left, upper right, and lower right node indexes in the lattice-like SOM node structure, and all are set to -1. Next, y is set to 0. Next, the array Pos is defined as an array for storing the coordinates of each node. Then, Sw and Sh are defined as node coordinate arrays in the x direction and the y direction, respectively. This array size is determined by the map size 212. All elements of Pos, Sw, and Sh are initialized with 0.

Next, the node mapping unit 109 loops the variable p from 1 to the number of data items in the division condition table 114 (603). Thereafter, the processing from step 604 to step 609 is executed for the p-th division condition.

Next, if rb is 0 or more (Yes in 604), the node mapping unit 109 inputs a number obtained by adding a predetermined number (eg, 2) to the maximum value in the array Sh to y (605). The predetermined number may be changed to an appropriate value.

On the other hand, if rb is a negative number (No in 604), the node mapping unit 109 proceeds to step 606 without doing anything.

Next, the four corner nodes index for the node of the division condition p are set to lt, lb, rt, and rb, respectively (606). At this time, it is possible to set lt to rb + 1 and set the remaining variables according to the map size 212.

Next, the node mapping unit 109 sets values for Sw and Sh (607). In this embodiment, a value obtained by equally dividing the distance between the nodes lt and rt and the distance between lt and lb according to the map size is set.

Next, the node mapping unit 109 adds y to each element of Sh (608). If it is desired to move in the x-axis direction, a variable x may be defined and the same processing as y may be applied to Sw.

Next, the node mapping unit 109 sets the coordinates of the nodes lt to rb to Pos (609). In this process, for example, the coordinates of the node at the position of i row and j column in the SOM node structure may be set by (Sw [i], Sh [j]).

Then, when the loop from step 603 to step 608 ends, MDS is applied with Pos as the initial coordinates of the node (610).

Next, the node mapping unit 109 stores the result in the storage area (611), and ends the node mapping process.

FIG. 7 is a flowchart of the output process 305 of this embodiment.

First, the output processing unit 110 lists the weighted average scores and generates the weighted average score table 118 (701). As described above, the weighted average score table 118 divides the weighted average score for each division condition, sorts the scores of each explanatory variable in descending order of absolute values, and lists the variable names.

Next, the output processing unit 110 displays a component map of the node vector (702). The component map is obtained by visualizing the value of the specific explanatory variable 319 or the objective variable 220 of each node under the same conditions by the geometric structure of the SOM node and the map size. For example, in the component map of the explanatory variable i when the map size is m × n, the value of the explanatory variable i in all the nodes having the same condition ID in the node information table is converted into an m × n image in a color corresponding to the value. indicate.

As illustrated in FIG. 8, the component map of the present embodiment images the value of the explanatory variable 319 based on the geometric structure of the node for each explanatory variable 219 with respect to a specific division condition. . In addition, in step 403, when processing is performed on the vector including the objective variable 202, a component map using the objective variable 220 can also be displayed. With the component map, the correlation between each explanatory variable and the correlation between the explanatory variable and the objective variable can be visually grasped.

Next, the output processing unit 110 displays a hit map (703). The hit map is obtained by visualizing the hit number 216 (or its logarithm) or the hit rate 217 using the visualization method of step 702.

In the hit map of the present embodiment, the number of hits is imaged by color coding based on the logarithm of the hit rate 217 as illustrated in FIG. Further, as shown in the figure, the numerical value of the hit number may be displayed. With the hit map, it is possible to grasp the dense nodes in the learning data distribution.

Next, the output processing unit 110 displays a score map (704). The score map is obtained by visualizing the score 227 or the objective score for a specific explanatory variable using the visualization method in step 702.

In the score map of this embodiment, the score 227 is imaged by color coding based on the score 227 for each explanatory variable, as illustrated in FIG. For example, if the score is 0, the color is set to green, and the color change gradually changes to red in the positive direction and blue in the negative direction. Easy to grasp. Further, as shown in the figure, by comparing the pattern with the component map of the corresponding explanatory variable, it is possible to grasp the state of the value of the explanatory variable at the node having a high influence level.

Next, the output processing unit 110 displays a node map (705). The node map is obtained by visualizing each node as a point in a low-dimensional space based on the coordinates 218 for each node calculated in step 304. At this time, the shape and color of a point representing each node may be set according to the value of the explanatory variable in the node information table, the value of the objective variable, the score for each explanatory variable in the score table, the objective score, the division condition, and the like.

The node map of this embodiment is obtained by plotting nodes in a two-dimensional space based on the node coordinates 218 under each division condition, as illustrated in FIG. Further, the geometric structure of the node under a specific division condition may be displayed by a grid line. The node map can grasp the positional relationship of each node under a plurality of division conditions. For example, when the current rank is used as a division condition, by looking at nodes that are close to each other, it is possible to easily find a node that is likely to increase in rank or have a high risk of decreasing. Differences in features from those neighboring nodes can be directly compared with values in the node information table or compared using a component map.

Then, the process ends.

Note that the above-described visualization methods can be executed in an arbitrary order according to a user instruction, and they may be combined and displayed simultaneously.

As described above, the data analysis system according to the present embodiment is configured such that an input data set including a plurality of explanatory variables used by the machine learning model at the time of learning or an input data set including a data set obtained by processing the explanatory variables. A feature node calculation unit 107 that calculates a feature node that represents a feature of a distribution structure of each of the divided data sets, generates near data of input data including the feature node, and is divided according to a specified division condition, Based on the generated explanatory variable of the neighborhood data and the objective variable data obtained by inputting the neighborhood data to the machine learning model, a score representing the relationship between the explanatory variable and the objective variable is calculated. A score calculation unit 108 that outputs the output result including the score. Therefore, it is possible to calculate and visualize the degree of influence of the explanatory variable on the objective variable for each target layer indicated by the division condition with respect to the learned machine learning model. In addition, the feature node representing the feature of the distribution structure can represent the feature of the data set with less data than the learning data. Moreover, even if there is little learning data and it is not covered, the feature node can be complemented by representing the feature of the data set by the neighborhood data. That is, it is possible to represent the characteristics of the data set with a small amount of data and reduce the amount of calculation.

Further, the feature node calculation unit 107 calculates the feature node based on the input data set to which the self-organizing map is applied, so that the feature node can be accurately calculated.

In addition, the feature node calculation unit 107 is an input data set including a plurality of explanatory variables used by the machine learning model during learning and an objective variable calculated by the machine learning model, or the explanatory variable and the objective variable are processed. Since the feature node is calculated using an input data set consisting of data sets, the objective variables can be compared on a map.

In addition, the feature node calculation unit 107 includes a division condition including at least one of a specific value or range of a specific explanatory variable and a specific value (for example, maximum value) or range of an element of the objective variable, or these Since the input data set is divided according to the dividing condition expressed by the combination, analysis with the target layer narrowed down can be performed. That is, data of a group having a specific attribute can be analyzed by changing the attribute according to the purpose, not the entire group.

The score calculation unit 108 calculates a score corresponding to the format of the objective variable for each explanatory variable by applying linear model estimation based on the data of the explanatory variable and the data of the objective variable. Because the linear model is simple and easy to handle, it is easy for the user to understand and can improve the reliability of the results. In particular, in a linear model, when a plurality of attributes are integrated, calculation can be performed with the sum of probabilities, so that the user can easily understand intuitively.

In addition, the score calculation unit 108 calculates an objective score by summing up the parts specified for each of the division conditions among some of the elements in the objective variable, so that analysis with a narrowed down target layer can be performed. That is, data of a group having a specific attribute can be analyzed by changing the attribute according to the purpose, not the entire group.

Further, the score calculation unit 108 calculates a weighted average score for each explanatory variable based on the number of peripheral data for each feature node in each division condition for the calculated score and the calculated target score. Considering the distribution, data characteristics can be expressed correctly.

In addition, each division condition includes a node mapping unit 109 that maps the feature node calculated by the feature node calculation unit 107 in the two-dimensional space in each division condition, so that the characteristics of the group can be easily expressed. .

Further, the node mapping unit 109 calculates the feature node value for each explanatory variable, the calculated score, and the weighted average score calculated for the score and the target score based on the geometric structure of the node. Since data to be displayed as an image is generated, it is possible to compare data between groups of different attributes while maintaining the distance relationship between nodes.

The node mapping unit 109 initializes the feature node vector or the feature node vector including the objective variable component based on the geometric structure of the feature node of the division condition, and then performs a multidimensional scaling method. Since mapping is applied and applied, attributes with high influence and attributes with low influence can be expressed in an easy-to-understand manner using the score map.

In addition, when the input data set is time-series data including explanatory variables for each predetermined time, the data obtained by expanding the explanatory variables as independent variables from a certain point in the past to the present time is used as input data and used for the expansion. Since the rules are stored, the input data set can be analyzed as time series data.

The present invention is not limited to the above-described embodiments, and includes various modifications and equivalent configurations within the spirit of the appended claims. For example, the above-described embodiments have been described in detail for easy understanding of the present invention, and the present invention is not necessarily limited to those having all the configurations described. A part of the configuration of one embodiment may be replaced with the configuration of another embodiment. Moreover, you may add the structure of another Example to the structure of a certain Example. In addition, for a part of the configuration of each embodiment, another configuration may be added, deleted, or replaced.

In addition, each of the above-described configurations, functions, processing units, processing means, etc. may be realized in hardware by designing a part or all of them, for example, with an integrated circuit, and the processor realizes each function. It may be realized by software by interpreting and executing the program to be executed.

Information such as programs, tables, and files that realize each function can be stored in a storage device such as a memory, a hard disk, and an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, and a DVD.

Also, the control lines and information lines indicate what is considered necessary for explanation, and do not necessarily indicate all control lines and information lines necessary for mounting. In practice, it can be considered that almost all the components are connected to each other.

Claims

A data analysis system,
An arithmetic device for executing the program, and a storage device connected to the arithmetic device,
The arithmetic unit divides an input data set composed of a plurality of explanatory variables used by the machine learning model at the time of learning or an input data set composed of a data set obtained by processing the explanatory variables according to a specified division condition, and the division A feature node calculating unit that calculates a feature node representing the characteristics of the distribution structure of each data set;
The arithmetic unit generates neighboring data of input data including the feature node, and explanatory variables of the generated neighboring data, and data of objective variables obtained by inputting the neighboring data to the machine learning model, A score calculation unit that calculates a score representing the relationship between the explanatory variable and the objective variable,
The data processing system characterized by the said arithmetic unit being provided with the output process part which outputs the output result containing the said score.
The data analysis system according to claim 1,
The data analysis system, wherein the feature node calculation unit calculates a feature node based on the input data set to which a self-organizing map is applied.
The data analysis system according to claim 1,
The feature node calculation unit includes an input data set including a plurality of explanatory variables used by the machine learning model during learning and an objective variable calculated by the machine learning model, or a data set obtained by processing the explanatory variables and the objective variable. A data analysis system characterized in that the feature node is calculated using an input data set consisting of:
The data analysis system according to claim 1,
The feature node calculation unit is based on a division condition expressed by a division condition including at least one of a specific value or range of a specific explanatory variable and a specific value or range of an element of the objective variable, or a combination thereof. A data analysis system that divides the input data set.
The data analysis system according to claim 1,
The score calculation unit calculates a score corresponding to the format of the objective variable for each explanatory variable by applying linear model estimation based on the data of the explanatory variable and the data of the objective variable. Data analysis system.
The data analysis system according to claim 1,
The said score calculation part calculates the objective score by totaling the part designated for every said division conditions among some elements in the said objective variable, The data analysis system characterized by the above-mentioned.
The data analysis system according to claim 6, wherein
The score calculation unit calculates a weighted average score for each explanatory variable based on the number of peripheral data for each feature node in each division condition for the calculated score and the calculated target score. Data analysis system.
The data analysis system according to claim 1,
The data analysis system, wherein the arithmetic device includes a node mapping unit that maps a feature node calculated by the feature node calculation unit to a two-dimensional space under each of the division conditions.
The data analysis system according to claim 7,
The arithmetic device includes a node mapping unit that maps the feature node calculated by the feature node calculation unit to a two-dimensional space in each of the division conditions,
The node mapping unit images the value of the feature node for each explanatory variable, the calculated score, and the weighted average score calculated for the score and the target score based on a geometric structure of the node. A data analysis system characterized by generating data for display.
The data analysis system according to claim 8,
The node mapping unit initializes a feature node vector or a feature node vector including an objective variable component based on a geometric structure of the feature node of the division condition, and then applies a multidimensional scaling method. A data analysis system characterized by mapping.
The data analysis system according to claim 1,
When the input data set is time-series data including explanatory variables for each predetermined time, the input data is data that has been expanded as independent variables from a certain point in the past to the present time,
A data analysis system, wherein the arithmetic device stores a rule used for the expansion.
A data analysis method executed by a computer,
The computer includes an arithmetic device that executes a program, and a storage device connected to the arithmetic device,
The method
The arithmetic device divides an input data set consisting of a plurality of explanatory variables used by the machine learning model at the time of learning or an input data set consisting of a data set obtained by processing the explanatory variables according to a specified division condition,
The arithmetic unit calculates a feature node representing the characteristics of the distribution structure of each divided data set;
The arithmetic device generates neighborhood data of input data including the feature node,
Based on the generated explanatory variable of the neighborhood data and the data of the objective variable obtained by inputting the neighborhood data to the machine learning model, a score representing the relationship between the explanatory variable and the objective variable is obtained. Calculate
The data analysis method, wherein the arithmetic device outputs an output result including the score.